arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.12821 2026-06-12 cs.AI cs.ET 新提交

GeoNatureAgent Benchmark: Benchmarking LLM Agents for Environmental Geospatial Analysis Across Frontier and Open-Weight Foundation Models

GeoNatureAgent Benchmark：面向前沿与开源基础模型的环境地理空间分析LLM智能体基准测试

Gabriel Diaz-Ireland, Diego Prieto-Herráez, Mario García Peces, Javier Velázquez, Devika Jain

发表机构 * Universidad Católica de Ávila (UCAV)（阿维拉天主教大学）； Johns Hopkins University（约翰霍普金斯大学）； Independent Researcher（独立研究者）； Center for Geographic Analysis, Harvard University（哈佛大学地理分析中心）

AI总结提出首个通过结构化工具调用真实API评估环境分析智能体的基准，包含93个任务，发现Claude Sonnet 4领先，但开源模型在成本效益上占优，且比较任务普遍未解决。

Comments Preprint. 10 pages, 8 figures. Submitted to ACM SIGSPATIAL 2026

详情

AI中文摘要

环境科学家在数据整理而非分析上花费了不成比例的精力，而自动化地理空间工作流的AI智能体仍未得到验证：没有基准通过结构化工具调用评估智能体对真实API的操作。我们引入了GeoNatureAgent Benchmark，这是首个通过结构化工具调用生产级地理空间API进行环境分析智能体的基准。它包含18个类别的93个任务，涵盖市政分析、多轮对话、空间推理、跨指标综合、错误处理与恢复、排序、比较、多语言理解、栖息地分析和任务拒绝。任务通过一个开放、可自托管的API进行评估，该API通过16个工具提供西班牙和葡萄牙的三个环境指标。我们评估了七个LLM（Claude Sonnet 4、DeepSeek V3.2、GLM-5、Gemini 2.5 Pro、Qwen3-235B、GPT-OSS-120B、Llama 4 Scout），在三个温度1.0的随机种子下，报告能力与每案例成本作为正交轴。我们发现：（1）Claude Sonnet 4以60.8%±0.8%领先，其次是DeepSeek V3.2的56.3%±3.1%，其他模型均未超过51%；（2）成本-准确率帕累托前沿主要由开源模型占据，DeepSeek V3.2以11倍低的成本（每案例0.011美元）提供Claude 93%的能力；（3）比较任务普遍未解决（接近值比较上为0%），暴露了系统性的推理限制；（4）针对真实API的结构化工具调用比通用GIS基准更具区分度，准确率低25-35个百分点。我们进一步展示了可扩展性，将葡萄牙的BigEarthNet V2土地覆盖与西班牙的CO2和侵蚀指标集成。该基准、工具集和可自托管API均已公开。

英文摘要

Environmental scientists spend disproportionate effort on data wrangling rather than analysis, and AI agents that automate geospatial workflows remain unvalidated: no benchmark evaluates agents operating through structured tool calling against real APIs. We introduce the GeoNatureAgent Benchmark, the first benchmark for environmental analysis agents that operate via structured tool calls to a production-style geospatial API. It comprises 93 tasks across 18 categories, covering municipality analysis, multi-turn conversation, spatial reasoning, cross-indicator synthesis, error handling and recovery, ranking, comparison, multilingual understanding, habitat analysis, and task rejection. Tasks are evaluated against an open, self-hostable API serving three environmental indicators across Spain and Portugal via sixteen tools. We evaluate seven LLMs (Claude Sonnet 4, DeepSeek V3.2, GLM-5, Gemini 2.5 Pro, Qwen3-235B, GPT-OSS-120B, Llama 4 Scout) under three temperature-1.0 seeds, reporting capability and per-case cost as orthogonal axes. We find: (1) Claude Sonnet 4 leads at 60.8% +/- 0.8%, followed by DeepSeek V3.2 at 56.3% +/- 3.1%, with no other model above 51%; (2) the cost-accuracy Pareto frontier is occupied mostly by open-weight models, with DeepSeek V3.2 offering 93% of Claude's capability at 11x lower cost ($0.011/case); (3) comparison tasks remain universally unsolved (0% on close-value comparisons), exposing systematic reasoning limits; and (4) structured tool calling against a real API is more discriminative than general-purpose GIS benchmarks, with accuracies 25-35 points lower. We further show extensibility by integrating BigEarthNet V2 land cover for Portugal alongside Spanish CO2 and erosion indicators. The benchmark, harness, and self-hostable API are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.12818 2026-06-12 cs.CL cs.AI 新提交

Localizing Anchoring Pathways in Language Models

定位语言模型中的锚定路径

Hillary N. Owusu, Sarah Wiegreffe, Naomi H. Feldman

发表机构 * University of Maryland, College Park（马里兰大学帕克分校）

AI总结研究提示中无关数字如何影响语言模型数值推理的锚定效应，通过logit差值度量和电路归因定位，发现边级方法优于节点级方法，并揭示锚定路径的共享与迁移特性。

详情

AI中文摘要

提示中的无关数字可以改变语言模型的判断，在数值推理中产生锚定效应。我们使用共享答案选项的受控多项选择设置，研究这种锚定敏感信号在语言模型内部的携带位置。我们定义了一个logit差值度量，比较正确答案选项与对应锚点的答案选项，并验证其追踪行为锚定。通过对7B-8B Qwen和Llama基础及指令微调模型进行基于归因的电路定位，我们发现边级方法比节点级方法更忠实地恢复该信号。低锚和高锚电路在模型内部强迁移，表明跨锚定方向存在共享路径结构。然而，基础模型和指令微调变体之间的稀疏迁移可靠性较低，表明后训练改变了哪些路径最重要。总体而言，我们的结果为锚定相关决策信号如何在语言模型内部携带提供了机制性解释。

英文摘要

Irrelevant numbers in a prompt can shift language model judgments, producing anchoring effects in numerical reasoning. We study where this anchor-sensitive signal is carried inside language models using a controlled multiple-choice setup with shared answer options. We define a logit-difference metric comparing the correct answer option with the answer option corresponding to the anchor, and validate that it tracks behavioral anchoring. Using attribution-based circuit localization on 7B--8B Qwen and Llama base and instruction-tuned models, we find that edge-level methods recover this signal more faithfully than node-level methods. Low- and high-anchor circuits transfer strongly within a model, suggesting shared pathway structure across anchor direction. However, sparse transfer across base and instruction-tuned variants is less reliable, indicating that post-training changes which pathways matter most. Overall, our results provide a mechanistic account of how anchoring-related decision signals are carried inside language models.

URL PDF HTML ☆

赞 0 踩 0

2606.12814 2026-06-12 cs.RO cs.AI 新提交

Stubborn: A Streamlined and Unified Reinforcement Learning Framework for Robust Motion Tracking and Fall Recovery for Humanoids

Stubborn: 一种用于人形机器人鲁棒运动跟踪与摔倒恢复的流线型统一强化学习框架

Xiao Ren, Yuhui Yang, Zongbiao Weng, Zhijie Liu, He Kong

发表机构 * Southern University of Science and Technology（南方科技大学）

AI总结提出Stubborn框架，通过非对称Actor-Critic架构、偏航对齐表示、伯努利概率终止机制和自适应采样策略，统一实现人形机器人的运动跟踪与摔倒恢复，在性能与鲁棒性上超越现有方法。

详情

AI中文摘要

最近的强化学习方法在改善人形机器人运动跟踪性能和实现扰动下的摔倒恢复方面显示出巨大潜力。然而，现有大多数工作将运动跟踪和摔倒恢复视为不同任务，需要多阶段训练，并配备专门的恢复奖励和/或独立的恢复策略。此外，现有的基于强化学习的方法通常在严重跟踪失败后立即终止训练回合，限制了在不稳定或摔倒状态下的恢复导向探索。为了解决上述问题，我们提出了Stubborn，一个流线型统一的强化学习框架，用于实现鲁棒的人形机器人运动跟踪和摔倒恢复。具体来说，Stubborn采用非对称Actor-Critic架构，包含三个主要组件。首先，采用偏航对齐的跟踪表示，以减少对全局漂移和航向扰动的敏感性，同时保留与重力相关的平衡信息。其次，我们引入基于伯努利的概率终止机制，使策略能够在不同失败模式下鼓励探索摔倒恢复行为。第三，我们提出一种概率终止和跟踪误差驱动的策略，根据跟踪性能动态重塑采样分布，提高困难运动片段和不稳定状态的训练效率。与最先进方法的广泛比较和消融研究表明，Stubborn取得了有竞争力的性能，所提出的概率终止机制和自适应采样策略有助于性能和鲁棒性的提升。真实世界演示请参见此https URL。

英文摘要

Recent reinforcement learning approaches have shown great promise in improving humanoid motion tracking performance and achieving fall recovery under disturbances. However, most existing works treat motion tracking and fall recovery as different tasks and require multi-stage training with specialized recovery rewards and/or separate recovery policies. Moreover, existing reinforcement learning-based methods often terminate training episodes immediately after severe tracking failures, limiting recovery-oriented exploration in unstable or fallen states. To address the above issues, we propose Stubborn, a streamlined and unified reinforcement learning framework to achieve robust humanoid motion tracking and fall recovery. Specifically, Stubborn uses an asymmetric Actor-Critic architecture and consists of three major components. First, a yaw-aligned tracking representation is adopted to reduce sensitivity to global drift and heading disturbances while preserving gravity-related balance information. Second, we introduce a Bernoulli-based probabilistic termination mechanism that enables the policy to encourage exploration of fall-recovery behaviors under varying failure modes. Third, we propose a probabilistic termination and tracking-error-driven strategy that dynamically reshapes the sampling distribution based on tracking performance, increasing the training efficiency for difficult motion segments and unstable states. Extensive comparisons with SOTA methods and ablation studies show that Stubborn achieved competitive performance, and the proposed probabilistic termination mechanism and adaptive sampling strategy contributed to the performance and robustness gains. For real-world demonstrations, please refer to https://aislab-sustech.github.io/Stubborn/.

URL PDF HTML ☆

赞 0 踩 0

2606.12809 2026-06-12 cs.AI cs.LG 新提交

MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs

MLUBench: 多模态大语言模型终身遗忘评估基准

He Li, Haoang Chi, Qizhou Wang, Yunxin Mao, Zhiheng Zhang, Jie Tan, Tongliang Liu, Wenjing Yang, Bo Han

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出MLUBench基准，评估多模态大模型在连续遗忘请求下的性能，发现现有方法存在累积退化，并揭示多模态对齐保持的挑战，提出LUMoE方法缓解退化。

Comments 36 pages, accepted to the ICML 2026

详情

AI中文摘要

多模态大语言模型（MLLMs）在海量多模态数据上训练，使得数据遗忘变得越来越重要，因为数据所有者可能要求移除特定内容。实际上，这些请求通常随时间顺序到达，引发了MLLM终身遗忘这一具有挑战性的问题。然而，现有大多数基准在规模和范围上有限，未能捕捉MLLM终身遗忘的复杂性。为填补这一空白，我们引入了MLUBench，一个大规模、全面的基准，包含9个类别下的127个实体，用于终身遗忘请求。我们使用MLUBench进行了大量实验，揭示出现有遗忘方法遭受严重且累积的退化。更重要的是，我们进一步识别出该问题的独特挑战：与单模态模型不同，MLLM终身遗忘受到保持多模态对齐需求的约束。持续从一种模态遗忘可能会退化整个模型。为缓解这一挑战，我们提出了LUMoE，一种有效方法。实验表明，LUMoE显著缓解了基线方法面临的退化问题。源代码和MLUBench数据集已在此https URL开源。

英文摘要

Multimodal large language models (MLLMs) are trained on massive multimodal data, making data unlearning increasingly important as data owners may request the removal of specific content. In practice, these requests often arrive sequentially over time, giving rise to the challenging problem of MLLM Lifelong Unlearning. However, most existing benchmarks are limited in scale and scope, failing to capture the complexities of MLLM lifelong unlearning. To fill this gap, we introduce the MLUBench, a large-scale and comprehensive benchmark featuring 127 entities across 9 classes under lifelong unlearning requests. We perform extensive experiments using MLUBench and reveal that existing unlearning methods suffer from severe, cumulative degradation. More critically, we further identify the unique challenge of this problem: unlike in unimodal models, MLLM lifelong unlearning is constrained by the need to preserve multimodal alignment. Continually unlearning from one modality could degrade the entire model. To alleviate this challenge, we propose LUMoE, an effective method. Experiments demonstrate that LUMoE significantly mitigates the degradation problem faced by baselines. The source code and the MLUBench dataset are open-sourced in https://github.com/lihe-maxsize/Lifelong_Unlearning_main.

URL PDF HTML ☆

赞 0 踩 0

2606.12807 2026-06-12 cs.CL 新提交

Detect, Remask, Repair: Diffusion Editing for Faithful Summarization of Evolving Contexts

检测、重掩、修复：面向动态上下文忠实摘要的扩散编辑

Hao Zou, Zachary Horvitz, Chandhru Karthick, Zhou Yu, Kathleen McKeown

发表机构 * Columbia University（哥伦比亚大学）

AI总结提出DETECT-REMASK-REPAIR框架，利用掩码扩散语言模型识别并修复摘要中过时内容，在保持支持内容的同时实现局部忠实性修复，并引入StreamSum基准评估。

详情

AI中文摘要

现实世界事件的摘要可能随着上下文演变和新信息的到来而过时。常见的做法是从更新后的上下文生成新摘要，但完全重新生成会丢弃之前的草稿，可能掩盖变化，并且当只有少数声明不支持时可能不必要。我们研究局部忠实性修复：在保留支持内容的同时更新现有摘要中的过时片段。我们提出DETECT-REMASK-REPAIR，一个基于扩散的框架，通过掩码扩散语言模型识别、重新掩码并修复过时区域。为了评估动态上下文摘要，我们引入了StreamSum，一个合成事件时间线的基准。在DialogSum和StreamSum上的实验表明，局部扩散修复提供了一种可控的替代完全重写的方法：忠实性导向的修复改进了早期草稿，一步修复将修复成本降低到半秒以下，该框架实现了跨数据集的忠实性-速度-保留权衡。我们还发现该框架可以作为事后修正步骤，提高自回归系统的忠实性。

英文摘要

Summaries of real-world events can become outdated as contexts evolve and new information arrives. A common response is to generate a new summary from the updated context, but full regeneration discards the previous draft, can obscure what changed, and may be unnecessary when only a few claims are unsupported. We study localized faithfulness repair: updating outdated spans in an existing summary while preserving supported content. We propose DETECT-REMASK-REPAIR, a diffusion-based framework that identifies, remasks, and repairs outdated regions with masked diffusion language models. To evaluate evolving-context summarization, we introduce StreamSum, a benchmark of synthetic event timelines. Experiments on DialogSum and StreamSum show that localized diffusion repair provides a controllable alternative to full rewriting: faithfulness-steered repair improves early drafts, one-step repair reduces repair cost to under half a second, with the framework enabling faithfulness-speed-preservation tradeoffs across datasets. We also find that the framework can provide a post-hoc correction step that improves faithfulness for autoregressive systems.

URL PDF HTML ☆

赞 0 踩 0

2606.12797 2026-06-12 cs.AI 新提交

The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements

遏制缺口：已部署的自主AI框架如何未能满足面向公众的安全要求

Md Jafrin Hossain, Mohammad Arif Hossain, Weiqi Liu, Nirwan Ansari

发表机构 * New Jersey Institute of Technology（新泽西理工学院）

AI总结研究发现主流自主AI框架缺乏架构级安全保证，内存完整性漏洞可导致定向腐败，提出轻量级遏制机制消除攻击向量。

Comments ICML 2026 (AI4GOOD Workshop)

详情

AI中文摘要

自主调用工具、维护持久内存并执行多步计划的大语言模型系统越来越多地部署在面向公众的领域，包括政府服务、医疗分诊和财务咨询。我们询问用于构建这些系统的框架是否提供架构级结构安全保证。应用从自主架构的组合模型导出的六项遏制原则，我们审计了三个主流框架（LangChain、AutoGPT和OpenAI Agents SDK），发现没有一个原生合规。内存完整性，一种针对最普遍漏洞类别的防御，在三个评估框架中均未观察到。我们通过实证验证这些发现：在基于LangChain构建的模拟政府福利代理中，单次内存投毒写入在所有测试种子和后端上引起持久定向腐败，使目标申请人的错误拒绝率升至88.9%。在复杂的五因素政策下，同一攻击保持总体准确率，同时将目标错误拒绝率提高3.5倍，使腐败难以通过标准监控检测。然后我们引入两种轻量级遏制机制：内存完整性验证器和策略门，它们以亚毫秒开销（每次调用<0.2ms）消除了两种攻击向量。我们得出结论，当前的自主框架生态系统可能尚未满足面向公众部署的默认安全期望，并概述了优先架构干预措施，以实现在高风险、对社会有影响的应用程序中的可信部署。

英文摘要

Agentic large language model systems that autonomously invoke tools, maintain persistent memory, and execute multi-step plans are increasingly deployed in public-facing domains, including government services, healthcare triage, and financial advising. We ask whether the frameworks used to build these systems provide architectural-level structural safety guarantees. Applying six containment principles derived from a compositional model of agentic architectures, we audit three dominant frameworks (LangChain, AutoGPT, and OpenAI Agents SDK) and find no native compliance in any of them. Memory integrity, a defense against one of the most prevalent vulnerability classes, is not observed in any of the three evaluated frameworks. We validate these findings empirically: in a simulated government benefits agent built on LangChain, a single memory-poisoning write induces persistent targeted corruption across all tested seeds and backends, increasing the wrongful denial rate for targeted applicants to 88.9%. Under a complex five-factor policy, the same attack preserves aggregate accuracy while increasing targeted wrongful denials by 3.5x, rendering the corruption difficult to detect through standard monitoring. We then introduce two lightweight containment mechanisms: a memory integrity validator and a policy gate, which eliminate both attack vectors with sub-millisecond overhead (<0.2ms per call). We conclude that the current agentic framework ecosystem may not yet meet secure-by-default expectations for public-facing deployments and outline priority architectural interventions to enable trustworthy deployment in high-stakes, socially impactful applications.

URL PDF HTML ☆

赞 0 踩 0

2606.12790 2026-06-12 cs.CL 新提交

GENIE: A Fine-Grained Measure for Novelty

GENIE：一种细粒度新颖性度量方法

Ramya Namuduri, Manya Wadhwa, Anshun Asher Zheng, Greg Durrett, Junyi Jessy Li

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）； New York University（纽约大学）

AI总结提出GENIE指标，通过任务特定特征细粒度衡量模型生成内容的新颖性，克服整体指标无法捕捉高维新颖性的局限。

2606.12789 2026-06-12 cs.CL cs.IR 新提交

How Fine-Grained Should a RAG Benchmark Be? A Hierarchical Framework for Synthetic Question Generation

RAG基准测试应该有多细粒度？一个用于合成问题生成的层次化框架

Chase M. Fensore, Kaustubh Dhole, Jason Fan, Eugene Agichtein, Joyce C. Ho

发表机构 * Department of Computer Science, Emory University（埃默里大学计算机科学系）

AI总结提出HieraRAG层次化框架，通过合成问题生成研究RAG基准测试的细粒度，发现最优粒度因维度而异，并引入一致性比率度量。

详情

DOI: 10.1145/3805712.3809925

AI中文摘要

评估检索增强生成（RAG）系统需要能够捕捉多样化问题特征的基准测试，然而实践者缺乏关于在哪些维度上变化以及以何种粒度变化的经验指导。我们提出了HieraRAG，一个用于研究RAG基准测试构建中粒度的层次化框架，将最优粒度定义为在给定RAG配置下最大化区分能力（各类别生成质量的标准差）的水平。作为案例研究，我们从FineWeb-10BT中生成了5,872个合成问答对，涵盖3个维度（问题复杂度、答案类型、语言变异）和3个粒度级别（2、4和8个类别）。使用BM25+Falcon-3-10B流水线，最优粒度因维度而异：复杂度受益于细粒度区分（区分能力：0.053），而答案类型和语言变异在中等粒度达到峰值。我们引入了一致性比率度量来量化细粒度划分是否干净地细分父类别，揭示了维度间的结构差异（问题复杂度：0.40 vs. 答案类型：1.44）。对110个分层问答对的人工评估确认了合成质量。虽然这些具体发现反映的是单一配置，但HieraRAG为实践者提供了可移植的程序和验证度量，以确定其自身RAG设置中的评估粒度。

英文摘要

Evaluating retrieval-augmented generation (RAG) systems requires benchmarks that capture diverse question characteristics, yet practitioners lack empirical guidance on which dimensions to vary and at what granularity. We present HieraRAG, a hierarchical framework for studying granularity in RAG benchmark construction, defining optimal granularity as the level that maximizes discriminative power (the standard deviation of generation quality across categories) within a given RAG configuration. As a case study, we generate 5,872 synthetic question-answer (QA) pairs from FineWeb-10BT across 3 dimensions (Question Complexity, Answer Type, Linguistic Variation) at 3 granularity levels (2, 4, and 8 categories). With a BM25+Falcon-3-10B pipeline, optimal granularity varies by dimension: complexity benefits from fine-grained distinctions (discriminative power: 0.053) while answer type and linguistic variation peak at medium granularity. We introduce a Coherence Ratio metric to quantify whether fine-grained splits cleanly subdivide parent categories, revealing structural differences across dimensions (Question Complexity: 0.40 vs. Answer Type: 1.44). Human evaluation of 110 stratified QA pairs confirms synthetic quality. While these specific findings reflect a single configuration, HieraRAG provides a portable procedure and validation metric for practitioners to determine evaluation granularity within their own RAG settings.

URL PDF HTML ☆

赞 0 踩 0

2606.12783 2026-06-12 cs.AI 新提交

A Tutorial on World Models and Physical AI

世界模型与物理AI教程

Il-Seok Oh

发表机构 * Department of Computer Science and Artificial Intelligence/CAIIT, Jeonju, Jeonbuk, South Korea（韩国全北全州计算机科学与人工智能系/CAIIT）

AI总结本文提出统一框架，区分显式与隐式世界模型，并探讨其在机器人、自动驾驶等物理AI领域的应用，以及迈向通用人工智能的挑战。

详情

AI中文摘要

世界建模正成为构建具备预测、推理和决策能力的智能系统的核心原则。显式世界模型与隐式世界模型之间存在一个核心区别：前者学习结构化动态以进行基于推演的推理和规划，后者则将预测结构编码到可扩展的学习表示中。这些互补范式为机器人、自动驾驶等领域的物理AI奠定了基础，使其能够在现实世界约束下实现超越反应式控制的智能。近期的基础模型进一步指明了通向集成感知、预测和行动的通用系统的路径。尽管进展迅速，但在层次推理、长时域规划和自主目标形成方面仍存在重大挑战，这些对于迈向通用人工智能至关重要。本教程提出了一个连贯的框架，其中多种世界建模方法通过共享的预测结构得以统一，并通过这种结构的表示和利用方式加以区分。

英文摘要

World modeling is emerging as a central principle for building intelligent systems capable of prediction, reasoning, and decision making. A central distinction can be drawn between explicit world models, which learn structured dynamics for rollout-based reasoning and planning, and implicit world models, which encode predictive structure within scalable learned representations. These complementary paradigms provide a foundation for physical AI in domains such as robotics and autonomous driving, enabling intelligence beyond reactive control under real-world constraints. Recent foundation models further suggest a pathway toward unified systems integrating perception, prediction, and action. Despite rapid progress, major challenges remain in hierarchical reasoning, long-horizon planning, and autonomous goal formation, which are critical for advancing toward artificial general intelligence. This tutorial presents a coherent framework in which diverse world modeling approaches are unified through shared predictive structure and differentiated by how such structure is represented and exploited.

URL PDF HTML ☆

赞 0 踩 0

2606.12780 2026-06-12 cs.LG cs.CL 新提交

ProPlay: Procedural World Models for Self-Evolving LLM Agents

ProPlay: 用于自我进化LLM智能体的程序化世界模型

Yijun Ma, Zehong Wang, Yiyang Li, Ziming Li, Xiaoguang Guo, Weixiang Sun, Chuxu Zhang, Yanfang Ye

发表机构 * University of Notre Dame（圣母大学）； University of Connecticut（康涅狄格大学）

AI总结提出ProPlay程序化世界模型，通过程序级预演和因果过程图，使LLM智能体在部分可观测环境中自我进化，无需外部监督。

详情

AI中文摘要

自我进化智能体应能在无外部监督下通过交互改进，但在部分可观测环境中仍困难，智能体必须主动探索、从有限反馈中学习，并决定何时信任先前经验。现有的LLM智能体方法通常依赖记忆或规划模块，但很少在它们之间闭环以持续完善对环境动态的内部理解。我们提出ProPlay，一种程序化世界模型，支持程序级预演，智能体可利用学到的世界知识排练未来的程序路径。ProPlay不将经验表示为孤立的规则或低层动作约束，而是将成功轨迹抽象为程序，并在捕获任务阶段间因果转换的程序图中组织它们。每个转换与一个可靠性记录嵌入相关联，以从过去结果中估计其任务特定贡献。在每个回合前，ProPlay在已知图结构上模拟未来程序轨迹作为结构化软指导；执行后，它利用环境反馈精炼图。在公开基准上的实验表明，ProPlay在环境理解和自我进化能力上持续优于强基线。我们的代码已在此https URL发布。

英文摘要

Self-evolving agents are expected to improve through interaction without external supervision, but this remains difficult in partially observable environments where agents must explore actively, learn from limited feedback, and decide when to trust prior experience. Existing LLM-agent methods often rely on memory or planning modules, yet they rarely close the loop between them to continually refine an internal understanding of environment dynamics. We introduce ProPlay, a procedural world model that supports procedure-level preplay, where agents can rehearse future procedural paths using the learned world knowledge. Rather than representing experience as isolated rules or low-level action constraints, ProPlay abstracts successful trajectories into procedures and organizes them in a procedure graph that captures causal transitions among task stages. Each transition is associated with a reliability record embedding to estimate its task-specific contribution from past outcomes. Before each episode, ProPlay simulates future procedural trajectories over known graph structures as structured soft guidance; after execution, it refines the graph using environment feedback. Experiments on public benchmarks show that ProPlay consistently improves environment understanding and self-evolution capability over strong baselines. Our code has been released in https://github.com/antman9914/proplay.

URL PDF HTML ☆

赞 0 踩 0

2606.12767 2026-06-12 cs.AI 新提交

Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage

构建程序性推理评估数据集：平衡自然性、基础性和多跳覆盖

Sarah Elshabrawy, Rahul K. Dass, Ashok K. Goel

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结研究基于任务-方法-知识（TMK）模型的问题生成策略对程序性和多跳推理数据集质量的影响，提出基础性验证框架，发现严格TMK生成策略在基础性和可用性上最优。

Comments 10 pages, 2 numbered figures. Workshop submission to HAIL @ AIED 2026

详情

AI中文摘要

评估AI辅助学习系统中的程序性推理需要问答数据集，这些数据集既要像学习者一样，又要基于系统预期使用的教学知识。我们研究了基于TMK的问题生成策略如何影响程序性和多跳推理的数据集质量。我们比较了三种策略：从任务-方法-知识（TMK）模型严格生成、先转录后基于TMK过滤的生成、以及结合转录和结构化指导的TMK感知生成。为了评估生成的项目，我们引入了一个基于从TMK模型中提取的闭集证据单元的基础性验证框架。该框架衡量答案是否由底层表示支持、问题是否自包含、以及是否针对多跳程序性推理。在23个教学主题和690个生成的问答对中，严格TMK生成实现了最强的整体质量，其中96.5%的问题有基础，92.6%的问题可用。先转录生成产生更像学习者的问题，但更多是上下文依赖或基础薄弱的问题，而TMK感知生成产生较高的原始多跳覆盖率但基础性较低。这些结果表明，程序丰富性和自然措辞并不能保证表示基础性，这促使在AI辅助学习中的评估数据集需要进行显式的表示感知验证。

英文摘要

Evaluating procedural reasoning in AI-supported learning systems requires question-answer datasets that are both learner-like and grounded in the instructional knowledge the system is expected to use. We study how TMK-based question generation strategies affect dataset quality for procedural and multi-hop reasoning. We compare three strategies: strict generation from Task-Method-Knowledge (TMK) models, transcript-first generation with post-hoc TMK filtering, and TMK-aware generation that combines transcripts with structured guidance. To evaluate generated items, we introduce a grounding validation framework based on closed-set evidence units extracted from TMK models. The framework measures whether answers are supported by the underlying representation, whether questions are self-contained, and whether they target multi-hop procedural reasoning. Across 23 instructional topics and 690 generated question-answer pairs, strict TMK generation achieves the strongest overall quality, with 96.5% grounded questions and 92.6% usable questions. Transcript-first generation produces more learner-like questions but more context-dependent or weakly grounded items, while TMK-aware generation yields high raw multi-hop coverage but lower grounding. These results show that procedural richness and natural phrasing do not guarantee representational grounding, motivating explicit representation-aware validation for evaluation datasets in AI-supported learning.

URL PDF HTML ☆

赞 0 踩 0

2606.12765 2026-06-12 cs.CL cs.DC 新提交

Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

Rigel：逆向工程 Apple M4 Max GPU 上的 Metal 4.1 张量计算路径

Ramchand Kumaresan

发表机构 * Apple Inc.（苹果公司）

AI总结通过微基准测试逆向工程 Apple M4 Max 的 Metal 4.1 张量计算路径，揭示 fp8 matmul2d 为模拟而非硬件加速，并重建了 8x8 张量片段布局。

详情

AI中文摘要

Apple 的 Metal 4.1 暴露了一条张量计算路径：基于 cooperative_tensor 片段的 Metal Performance Primitives (MPP) matmul2d 操作，其接口有文档记录，但硬件行为被故意隐藏。规范说明了支持哪些数据类型行，但从未说明它们是否经过硬件加速、操作在物理上何处执行、其累加器宽度是多少，或者如何在线程间划分矩阵片段。我们提出了 Rigel，这是对单个 Apple M4 Max（前神经加速器一代）上该路径的经验性表征。使用校验和门控、来源追踪的微基准测试工具，Rigel 恢复了 v4.1 规范隐藏或矛盾的十一个事实。主要发现：Metal 4.1 fp8 (E4M3) matmul2d 是模拟的，而非加速的：尽管读取的操作数字节数减半，但其吞吐量仅为 fp16 的 0.94 倍，因此在 M4 上它是一个内存占用特性，而非性能特性。我们进一步通过三信号三角测量（吞吐量上限、与 simdgroup_matrix 的比较以及每路功率归因）表明，matmul2d 完全在 GPU 着色器核心上执行，没有专用的矩阵数据路径，也没有证据表明路由到 Apple 神经引擎；它使用 >=fp32 累加；并且我们重建了 Apple 在任何地方都没有记录的 opaque 8x8 cooperative_tensor 片段布局。基于该表征，一个手动融合的 GEMM + bias + GELU 内核在缓存驻留状态下比分解路径快 6.5-12.9%。所有发现均可从 MIT 许可的代码和逐单元 CSV 中重现。

英文摘要

Apple's Metal 4.1 exposes a tensor compute path: the Metal Performance Primitives (MPP) matmul2d operation over cooperative_tensor fragments, whose interface is documented but whose hardware behavior is deliberately hidden. The specification states which data-type rows are supported, never whether they are hardware-accelerated, where the operation physically executes, what its accumulator width is, or how it partitions matrix fragments across threads. We present Rigel, an empirical characterization of this path on a single Apple M4 Max (a pre-neural-accelerator generation). Using a checksum-gated, provenance-tracked microbenchmark harness, Rigel recovers eleven facts the v4.1 specification hides or contradicts. The headline finding: the Metal 4.1 fp8 (E4M3) matmul2d is emulated, not accelerated: it sustains 0.94x the throughput of fp16 despite reading half the operand bytes, so on M4 it is a memory-footprint feature, not a performance feature. We further show, via a three-signal triangulation (throughput ceiling, comparison against simdgroup_matrix, and per-rail power attribution), that matmul2d executes entirely on the GPU shader cores with no dedicated matrix datapath and no evidence of Apple Neural Engine routing; that it accumulates in >=fp32; and we reconstruct the opaque 8x8 cooperative_tensor fragment layout Apple documents nowhere. Acting on the characterization, a hand-fused GEMM + bias + GELU kernel beats the decomposed path by +6.5-12.9% in the cache-resident regime. All findings are reproducible from committed MIT-licensed code and per-cell CSVs.

URL PDF HTML ☆

赞 0 踩 0

2606.12764 2026-06-12 cs.LG cs.CL cs.CR 新提交

Detecting Functional Memorization in Code Language Models

检测代码语言模型中的功能记忆

Matthieu Meeus, Anil Ramakrishna, Matthew Grange, Zheng Xu, Luca Melis

发表机构 * Meta ； Imperial College London（伦敦帝国学院）

AI总结研究代码语言模型的功能记忆现象，通过反事实设置对比暴露目标代码的模型与未暴露的参考模型，使用文本和功能相似性度量，发现功能记忆超出文本重叠的检测范围。

2606.12763 2026-06-12 cs.LG cs.DS 新提交

Adaptive Weighted Averaging

自适应加权平均

Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, Manish Purohit

发表机构 * University of Utah（犹他大学）； Boston University（波士顿大学）； Google（谷歌）

AI总结提出一种从单次无偏估计中选取最大未知值的方法，具有可容许性且不劣于基线，应用于随机优化获得在线到批次的转换界限。

2606.12759 2026-06-12 cs.RO 新提交

Sparse2Act: Learning Action-Aligned Sparse 3D Representations for Cross-Domain Robot Manipulation

Sparse2Act: 学习跨域机器人操作的动作对齐稀疏3D表示

Yu Guo, Chang Yu, Siyu Ma, Yunuo Chen, Yin Yang, Ying Nian Wu, Chenfanfu Jiang

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； University of California, San Diego（加州大学圣迭戈分校）； University of Utah（犹他大学）

AI总结提出Sparse2Act框架，通过动作对齐的掩码稀疏3D编码预训练，实现跨域机器人操作，在LIBERO-10上达86.9%成功率，并支持域迁移和sim-to-real。

详情

AI中文摘要

显式3D表示对于操作任务具有吸引力，因为它们以度量坐标暴露物体形状、工作空间几何以及机器人-物体关系。然而，稀疏3D编码器通常通过下游任务目标学习，将表示与特定数据分布、策略架构和动作参数化绑定。我们引入Sparse2Act，一个用于预训练稀疏点云编码器的观察-动作对齐框架。关键思想是使用任务空间末端执行器动作作为几何监督：训练掩码稀疏3D令牌以组织场景特征，使其围绕与观察配对的工作空间运动。预训练后，仅编码器初始化被下游策略重用，允许它们保留自己的架构和动作空间，包括关节空间命令。在LIBERO-10基准上，我们的方法在500步微调后达到86.9%的平均成功率。相同的预训练编码器支持LIBERO到Meta-World的跨域迁移，在Meta-World-5基准上达到73.4%的平均成功率。关于目标和解码器容量的消融实验表明，增益来自掩码动作对齐信号，并且在下游动作解码器中仍然有用。在真实世界实验中，模拟预训练后跟有限真实数据微调，在四个任务上平均成功率达到72.5%，展示了有效的模拟到真实迁移。这些结果表明，机器人动作可以为可重用的稀疏3D表示提供紧凑的几何监督。

英文摘要

Explicit 3D representations are attractive for manipulation because they expose object shape, workspace geometry, and robot-object relations in metric coordinates. However, sparse 3D encoders are often learned through downstream task objectives, tying the representation to a particular data distribution, policy architecture, and action parameterization. We introduce Sparse2Act, an observation-action alignment framework for pretraining sparse point-cloud encoders. The key idea is to use task-space end-effector actions as geometric supervision: masked sparse 3D tokens are trained to organize scene features around the workspace motion paired with the observation. After pretraining, only the encoder initialization is reused by downstream policies, allowing them to retain their own architectures and action spaces, including joint-space commands. On the LIBERO-10 benchmark, our method achieves 86.9% average success after 500 fine-tuning steps. The same pretrained encoder supports LIBERO-to-Meta-World cross-domain transfer, achieving 73.4% average success on the Meta-World-5 benchmark. Ablations on the objective and decoder capacity show that the gains come from the masked action-alignment signal and remain useful across downstream action decoders. In real-world experiments, simulation pretraining followed by limited real-data fine-tuning achieves an average success rate of 72.5% across four tasks, demonstrating effective sim-to-real transfer. These results suggest that robot actions can provide compact geometric supervision for reusable sparse 3D representations.

URL PDF HTML ☆

赞 0 踩 0

2606.12747 2026-06-12 cs.AI 新提交

Prefill Awareness in Large Language Models

大型语言模型中的预填充感知

Andy Wang, Parv Mahajan, David Demitri Africa, Alexandra Souly, Jordan Taylor, Robert Kirk

发表机构 * Constellation University of Wisconsin-Madison（威斯康星大学麦迪逊分校星座研究所）； Constellation Georgia Institute of Technology（佐治亚理工学院星座研究所）； UK AI Security Institute（英国人工智能安全研究所）

AI总结研究大型语言模型能否识别并响应其助手消息被预填充或篡改，发现前沿模型具有显著预填充感知能力，可能影响安全评估方法。

Comments Submitted to NeurIPS 2026

详情

AI中文摘要

语言模型的安全相关研究，包括对齐和越狱评估以及AI控制协议，通常依赖于预填充模型输出。如果AI模型能够识别并利用其先前的助手消息被插入或编辑这一事实，这些方法的有效性和有效性可能会受到损害。我们调查了前沿语言模型是否能区分被篡改和未被篡改的助手侧上下文，我们将这种能力称为预填充感知。为此，我们构建了一个跨三种预填充机制的二元偏好基准，筛选出模型表现出一致立场的案例。我们发现前沿模型表现出显著的预填充感知：Claude Opus 4.5在9-35%的案例中检测到与其偏好相反的预填充，且在提示时假阳性率为0%；此外，模型通常会恢复到基线行为，而不会明确报告预填充是外来的。受控消融实验后来也表明，检测和抵抗依赖于不同的线索，其中风格不匹配主要影响模型是否将预填充标记为外来，而偏好不匹配主要影响模型是否恢复到其基线答案。我们还检查了更真实的智能体设置，如错位延续评估和SWE-bench轨迹，在这些设置中，前沿模型有时会否认预填充的助手轮次，其方式强烈依赖于数据集、任务成功和隐藏的格式伪影。我们的结果表明，预填充感知已经是一些基于预填充的方法的重要混淆因素。我们建议模型开发者在前沿系统中跟踪这种能力。

英文摘要

Safety-relevant studies of language models, including alignment and jailbreaking evaluations and AI control protocols, often rely on prefilling model outputs. If AI models can recognize and act on the fact their prior assistant messages have been inserted or edited, the effectiveness and validity of these methods could be compromised. We investigate whether frontier language models can distinguish between tampered and untampered assistant-side context, a capability we call prefill awareness. To do so, we construct a binary preference benchmark across three prefill mechanisms, filtering for cases where models show consistent stances. We find that frontier models show substantial prefill awareness: Claude Opus 4.5 detects prefills opposing its preferences in 9-35% of cases with a 0% false positive rate when prompted; additionally, models often revert towards baseline behavior without explicitly reporting that the prefill was foreign. Controlled ablations later also show that detection and resistance rely on different cues, where stylistic mismatch mainly affects whether models flag a prefill as foreign, while preference mismatch mainly affects whether they revert toward their baseline answer. We also examine more realistic agentic settings such as misalignment-continuation evaluations and SWE-bench trajectories, where frontier models sometimes disavow prefilled assistant turns in ways that depend strongly on dataset, task success, and hidden formatting artifacts. Our results indicate that prefill awareness is already a substantial confound for some prefill-based methods. We recommend that model developers track this capability in frontier systems.

URL PDF HTML ☆

赞 0 踩 0

2606.12744 2026-06-12 cs.CV 新提交

GRIP: Feedback-Guided Prompt Retrieval for Large Multimodal Models

GRIP：面向大型多模态模型的反馈引导提示检索

Garvita Allabadi, Matteo Sodano, Roberto Estevão, Yuxiong Wang, Vikram Adve, Emre Kiciman, Ranveer Chandra

发表机构 * University of Illinois Urbana Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Bonn（波恩大学）； Microsoft（微软）

AI总结提出GRIP，一种可学习的视觉检索框架，利用多模态模型反馈识别真正提升上下文学习性能的示例，在分类、描述和VQA任务上优于基于相似度的检索。

详情

AI中文摘要

上下文学习（ICL）已成为一种强大的机制，使大型语言模型（LLMs）无需微调即可适应新任务。将此概念扩展到大型多模态模型（LMMs），多模态上下文学习（M-ICL）依赖于检索相关示例（如图像、标题或问答对）来指导分类、描述和视觉问答（VQA）等任务的预测。现有方法大多基于特征空间相似性选择上下文示例，假设语义相似的样本提供最有用的上下文。然而，我们的系统分析表明，这一假设并不总是成立：视觉上相似的示例并不一定是那些最有效增强上下文学习性能的示例。为解决此问题，我们提出了上下文提示的引导检索（GRIP），一种可学习的纯视觉检索框架，利用LMMs的反馈来识别真正改善模型预测的示例。GRIP通过对比训练学习区分有益和有害的上下文示例，将检索优化到超越纯相似性。在三个多模态任务（分类、描述和VQA）上，GRIP在Qwen2.5-VL-7B上持续优于基于相似度的检索，在Idefics2-8B上的分类任务中提升最为显著。此外，我们证明了从一个开放LMM训练得到的检索器可以迁移到其他模型（包括闭源的GPT-4o和Gemini）而无需重新训练，从而实现了M-ICL的可扩展且经济高效的部署。代码将在接收后发布。

英文摘要

In-Context Learning (ICL) has become a powerful mechanism for adapting Large Language Models (LLMs) to new tasks without fine-tuning. Extending this concept to Large Multimodal Models (LMMs), Multimodal In-Context Learning (M-ICL) relies on retrieving relevant examples, such as images, captions, or question-answer pairs, to guide predictions across tasks like classification, captioning, and visual question answering (VQA). Most existing approaches select in-context examples based on feature-space similarity, assuming that semantically similar samples provide the most useful context. However, our systematic analysis reveals that this assumption does not always hold: visually similar examples are not necessarily those that most effectively enhance in-context learning performance. To address this, we propose the Guided Retrieval of In-context Prompts (GRIP), a learnable vision-only retrieval framework that leverages feedback from LMMs to identify examples that truly improve model predictions. GRIP learns to distinguish beneficial from detrimental in-context examples through contrastive training, refining retrieval beyond pure similarity. Across three multimodal tasks, namely classification, captioning, and VQA, GRIP improves consistently over similarity-based retrieval on Qwen2.5-VL-7B, with its strongest gains in classification on Idefics2-8B. Moreover, we demonstrate that retrievers trained with feedback from one open LMM can be transferred to other models without retraining, including closed-source GPT-4o and Gemini, enabling scalable and cost-efficient deployment of M-ICL. Code will be published upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.12740 2026-06-12 cs.LG 新提交

Deep Unfolded Latent Optimally Partitioned-l2/l1 Networks for Data-driven Block-Sparse Recovery

深度展开潜在最优分区l2/l1网络用于数据驱动的块稀疏恢复

Takanobu Furuhashi, Hidekata Hontani, Qibin Zhao, Tatsuya Yokota

发表机构 * Nagoya Institute of Technology（名古屋工业大学）； RIKEN Center for Advanced Intelligence Project（理化学研究所革新智能研究中心）

AI总结针对凸LOP-l2/l1方法依赖手动调参且近端算子不可微的问题，提出基于隐式微分和深度权重分解的两种深度展开架构，实现自动参数学习，在块稀疏恢复中表现优异且抗脉冲噪声。

Comments 11 pages, 6 figures

2606.12736 2026-06-12 cs.AI cs.LG 新提交

Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

跨尺度科学挑战的AI智能体基准测试

Tianyu Liu, Allen Xin Wang, Antonia Panescu, Lisa Xinyi Chen, Wenxin Long, Xinyu Wei, Yueqian Jing, Ziyao Zeng, Jihang Chen, Sihan Jiang, Ziqing Wang, Siyi Gu, Siyu Chen, Xinyang Hu, Haoran Shao, Leqi Xu, Wangjie Zheng, Zhiyuan Cao, Ada Fang, Botao Yu, Kunyang Sun, Rex Ying, Arman Cohan, Qingyu Chen, Lingzhou Xue, Kaize Ding, Yuanqi Du, Wengong Jin, Zhuoran Yang, Marinka Zitnik, James Zou, Hua Xu, Hongyu Zhao

发表机构 * Yale University（耶鲁大学）； Broad Institute of MIT and Harvard（布罗德研究所）； The Pennsylvania State University（宾夕法尼亚州立大学）； Northeastern University（东北大学）； Northwestern University（西北大学）

AI总结提出SciAgentArena基准，含约200个交互式任务，评估AI智能体在真实科研场景中的能力，发现其在数据分析中有效，但在创新探索和开放问题上表现不均。

Comments 6 figures

详情

AI中文摘要

AI智能体正被越来越多地开发用于加速科学发现，但它们在真实研究环境中的实际能力仍知之甚少。现有的AI智能体基准很少捕捉科学工作所需的复杂性、异质性和扩展推理，而科学任务的基准通常将研究简化为静态、直接的问题，并对交互式评估支持有限。在此，我们引入SciAgentArena，这是一个系统性的基准，用于评估AI智能体在来自多个领域新兴需求的真实科学研究场景中的表现。SciAgentArena包含约200个具有逐步验证的任务，以及一个交互式、与智能体无关的环境，用于评估不同的AI智能体。使用该基准，我们发现当前智能体能够有效贡献于明确指定的数据分析工作流，特别是当任务结构和评估标准清晰时。然而，它们在科学情境中的表现仍然不均衡：智能体难以产生真正新颖的见解，维持自主探索，并为开放的研究问题制定稳健的解决方案。我们进一步描述了智能体常见的失败模式，并识别了提高其可靠性、自主性和科学推理能力的机会。总之，SciAgentArena提供了一个实用的框架，用于衡量AI智能体在科学领域的进展，并指导未来能够应对复杂科学挑战的智能体设计。完整代码、任务和数据集可通过此链接访问：this https URL。

英文摘要

AI agents are increasingly being developed to accelerate scientific discovery, yet their practical capabilities in real research settings remain poorly understood. Existing benchmarks for AI agents rarely capture the complexity, heterogeneity, and extended reasoning required by scientific work, whereas benchmarks for scientific tasks often reduce research to static, direct problems and provide limited support for interactive evaluation. Here, we introduce SciAgentArena, a systematic benchmark for evaluating AI agents in real-world scientific research scenarios drawn from emerging needs across multiple domains. SciAgentArena comprises approximately 200 tasks with stepwise verification and an interactive, agent-agnostic environment for assessing diverse AI agents. Using this benchmark, we find that current agents can contribute effectively to well-specified data-analysis workflows, particularly when the task structure and evaluation criteria are clear. However, their performance remains uneven across scientific contexts: agents struggle to generate genuinely novel insights, sustain self-directed exploration, and formulate robust solutions for open-ended research questions. We further characterize common failure modes across agents and identify opportunities for improving their reliability, autonomy, and scientific reasoning. Together, SciAgentArena provides a practical framework for measuring progress in AI agents for science and for guiding the design of future agents capable of addressing complex scientific challenges. Full codes, tasks, and datasets can be accessed via this link: https://sciagentarena.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.12735 2026-06-12 cs.LG 新提交

Physics-Informed Neural Networks and Radial Basis Functions for PDEs with Dirac Delta Sources

物理信息神经网络与径向基函数求解含狄拉克δ源的偏微分方程

Manuel Reyna, Alexandre Tartakovsky

发表机构 * Department of Civil and Environmental Engineering, University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校土木与环境工程系）

AI总结针对含狄拉克δ项的偏微分方程，通过将物理信息神经网络解释为残差最小二乘法，利用弱形式直接处理δ项，并对比径向基函数展开方法，发现径向基函数-残差最小二乘法在输运问题中更稳定。

Comments 33 pages, 4 figures

详情

AI中文摘要

物理信息神经网络（PINNs）是一种用于求解正向和逆向偏微分方程（PDEs）的机器学习方法。当应用于强迫项、边界条件或初始条件中包含狄拉克δ函数的PDEs时，PINNs需要用光滑的代理函数来近似它们，这种做法可能会引入显著的建模误差。在这项工作中，我们利用PINNs作为残差最小二乘法（RLS）的解释，并表明这种视角能够通过积分弱形式方程直接处理狄拉克δ项。在除PINN之外的RLS公式中，我们重点关注径向基函数（RBF）展开（也称为单层RBF网络）。我们证明，虽然在PINNs中积分掉狄拉克δ会导致残差无法收敛到零，但RBF-RLS始终能为输运问题提供良好的正向和逆向解。我们使用神经正切核（NTK）理论解释这一发现。我们在代表多孔介质和河流中地下水流和输运的线性PDEs上测试了这两种方法。我们求解逆问题以拟合合成数据、含噪声的合成数据以及真实世界测量值。

英文摘要

Physics-Informed Neural Networks (PINNs) are a machine learning method for solving forward and inverse Partial Differential Equations (PDEs). When applied to PDEs with Dirac delta functions in the forcing terms, boundary conditions, or initial conditions, PINNs require approximating them with smooth surrogate functions, a practice that can introduce significant modeling errors. In this work, we exploit the interpretation of PINNs as Residual Least Squares (RLS) methods and show that this perspective enables direct treatment of Dirac delta terms by integrating the weak-form equation. Among RLS formulations other than PINN, we focus on the Radial Basis Function (RBF) expansion (also known as a single-layer RBF Network). We show that while integrating out the Dirac delta in PINNs causes residuals to fail to converge to zero, RBF-RLS consistently provides good forward and inverse solutions to transport problems. We explain this finding using the Neural Tangent Kernel (NTK) theory. We test both approaches on linear PDEs that represent groundwater flow and transport in porous media and rivers. We solve inverse problems to fit synthetic data, noisy synthetic data, and real-world measurements.

URL PDF HTML ☆

赞 0 踩 0

2606.12731 2026-06-12 cs.LG cs.CY 新提交

Normative Robustness as a Frontier for Non-Verifiable Reasoning in LLMs

规范性鲁棒性作为LLM中不可验证推理的前沿

Elizaveta Tennant, Benjamin Henke, Anita Keshmirian, Murray Shanahan, Verena Rieser, Kristian Lum, Sydney Levine, Julia Haas

发表机构 * DeepMind ； Institute of Philosophy, School of Advanced Study, University of London（伦敦大学高等研究院哲学研究所）； Technische Universität Berlin（柏林工业大学）

AI总结提出道德推理作为不可验证推理的典型子域，定义道德鲁棒性并引入可扩展的多轮对抗评估框架，发现模型会向用户偏好偏移推理（平均6.5%），且受顺序和轮次影响。

详情

AI中文摘要

随着LLM越来越多地承担咨询和审议角色，用户在缺乏客观真实性的领域中依赖它们进行不可验证推理。然而，传统LLM推理评估几乎只关注基于事实的领域（如数学和科学），导致不确定模型能否以及能在多大程度上处理随时间变化的模糊、主观或价值负载问题。为解决这一问题，我们提出道德推理作为不可验证推理的一个典型子域。我们将道德鲁棒性定义为模型在不同时间和情境下展现合理道德推理的能力，并引入一个可扩展的、对抗性的多轮评估框架来实证测量这一能力。我们在四个前沿LLM上模拟了48,000次用户-智能体道德讨论，变化前提相关性、前提顺序、对话时长和用户声明的道德观点。我们发现模型成功忽略了道德无关的干扰项，但平均向用户声明的偏好道德观点偏移了6.5%的推理，并且推理因顺序（在13-22%的案例中改变道德判断）和时长（在10-24%的案例中在单轮和多轮之间改变道德判断）等因素而变化。我们的分析表明，模型不仅调整最终裁决，还调整其背后的理由以适应用户的道德观点——我们将这种失败模式称为道德审议谄媚。

英文摘要

As LLMs increasingly serve in advisory and deliberative roles, users rely on them for non-verifiable reasoning in domains lacking objective ground truths. However, traditional evaluations of LLM reasoning focus almost exclusively on fact-based domains, such as mathematics and science, leaving uncertainty over whether and to what degree models can handle ambiguous, subjective, or value-laden problems over time. To address this concern, we propose moral reasoning as a paradigmatic subdomain of non-verifiable reasoning. We define moral robustness as a model's capacity to exhibit sound moral reasoning across time and contexts, and we introduce a scalable, adversarial, multi-turn evaluation framework to empirically measure this capability. We simulate 48,000 user-agent moral deliberations across four frontier LLMs, varying premise relevance, premise order, conversation duration, and the user's stated moral view. We find that models successfully ignore morally-irrelevant distractors, but shift their reasoning by up to 6.5%, on average, towards the user's stated preferred moral view, and varying their reasoning depending on factors such as order (altering moral judgments by order in 13-22% of the cases) and duration (altering moral judgments between single-turn and multi-turn in 10-24% of the cases). Our analysis indicates that models tailor not just their final verdicts but their underlying justifications to align with a user's moral viewpoint - a failure mode we characterize as moral deliberative sycophancy.

URL PDF HTML ☆

赞 0 踩 0

2606.12730 2026-06-12 cs.AI cs.CL cs.CY cs.LG 新提交

Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

重新思考LLMs的心理测量评估：自我报告何时以及为何能预测行为

Rafal Kocielnik, Pengrui Han, Peiyang Song, Myrl G. Marmarelis, Ramit Debnath, Dean Mobbs, Anima Anandkumar, R. Michael Alvarez

发表机构 * Caltech（加州理工学院）； UIUC（伊利诺伊大学厄巴纳-香槟分校）； University of Cambridge（剑桥大学）

AI总结研究对比大五人格与计划行为理论，发现LLMs的自我报告-行为一致性存在选择性：在共享对话中TPB达到人类水平，跨对话仅对锚定于训练的行为保持一致性，且角色提示不能使行为对齐。

Comments Accepted as an Oral (Contributed Talk) at the ICML 2026 Workshop on Combining Theory and Benchmarks (CTB)

详情

AI中文摘要

从低成本心理测量探针预测LLM行为倾向对于安全部署至关重要，但前提是自我报告（SR）能可靠地预测行为。近期研究记录了LLMs中显著的SR-行为分离，但依赖于广泛的人格特质（大五），这些特质即使在人类中也只能弱预测特定行为。此外，对话会话的隔离加上弱上下文匹配使得以下问题悬而未决：LLMs是否真正缺乏一致性，或者检测这种一致性所需的条件是否未满足。我们将大五与计划行为理论（TPB）进行对比，后者测量针对特定行为的意图，并且比广泛特质能更好地预测人类行为。我们在四个行为任务和11个前沿LLM上进行实验，同时改变会话上下文和身份诱导。我们发现SR-行为一致性存在但具有选择性。1) 在共享对话中，计划行为理论达到人类水平的一致性；大五则没有。2) 在跨对话中，一致性仅对锚定于即时提示之外的行为（如由训练塑造的内隐偏见）幸存，而当行为被上下文强烈启动（如谄媚）时则崩溃。3) 角色提示使自我报告在对话间更一致，但并未使行为对齐。这些发现表明，粗糙的人格框架（如大五）可能不是测试部署行为的最佳工具。需要更多任务和特定行为的工具，并且即使这些工具也必须在任务和上下文中进行评估。

英文摘要

Anticipating LLM behavioral tendencies from low-cost psychometric probes is critical for safe deployment, but only if self-reports (SR) reliably predict behavior. Recent work documented substantial SR-behavior dissociation in LLMs, but relied on broad personality traits (Big 5) that predict specific behaviors weakly, even in humans. Furthermore, the isolation of conversational sessions combined with weak context matching left open whether LLMs truly lack coherence or whether the conditions needed to detect such coherence were not met. We contrast Big 5 with the Theory of Planned Behavior (TPB), which measures intention targeted to a specific behavior and predicts human behavior substantially better than broad traits. We run experiments across four behavioral tasks and 11 frontier LLMs, while also varying session context and identity induction. We find that SR-behavior coherence exists but is selective. 1) Within a shared conversation, the Theory of Planned Behavior reaches human-level coherence; Big 5 does not. 2) Across separate conversations, coherence survives only for behaviors anchored outside the immediate prompt, such as implicit bias shaped by training, and collapses when behavior is strongly primed by context, as with sycophancy. 3) Persona prompting makes self-reports more consistent across conversations, but does not bring behavior into alignment. These findings suggest that coarse personality frameworks, such as Big 5 may not be the best tools for testing deployment behavior. More task- and behavior-specific instruments are needed, and even these must be evaluated across tasks and contexts.

URL PDF HTML ☆

赞 0 踩 0

2606.12721 2026-06-12 cs.AI 新提交

The Theory of Mind Utility: Formal Specification of a Mentalizing Mechanism

心智理论效用：心理化机制的形式化规范

Nikolos Gurney, Stacy Marsella

发表机构 * Institute for Creative Technologies, University of Southern California（南加州大学创意技术研究所）； Khoury College of Computer Sciences, Northeastern University（东北大学库里计算机科学学院）

AI总结提出心智理论效用（ToM-U）框架，通过局部认知世界模型（LEWM）形式化推断他人信念的计算问题，定义结构、推理过程及失败痕迹，区别于贝叶斯心智理论等方法。

详情

AI中文摘要

推断他人的信念需要超越表面信号；需要追踪谁告诉了他们什么、以什么顺序以及有多可信。心智理论效用（ToM-U）在计算分析层面形式化了这一认知状态推断问题，明确了心理化计算的内容和原因，而不承诺算法或神经实现。ToM-U通过构建局部认知世界模型（LEWMs）——表示智能体、状态节点及其之间认知关系的有向类型图——并根据观察到的行为评估离散候选LEWM，直到达到足够的置信度来实现这一点。五个形式定义指定了LEWM结构、包括有序信息访问历史的智能体节点属性、递归心理化的有界增殖机制、三种推理过程以及一个残差函数，该函数捕捉失败心理化尝试留下的结构化痕迹。ToM-U不同于贝叶斯心智理论和相邻的形式化描述，后者预设而非推导信念状态，也不同于模拟理论和理论-理论，后者缺乏认知状态推断的形式化工具。该架构生成关于心理化失败的方向性、可证伪预测，这些预测源于模型的结构属性而非辅助假设，并将ToM-U定位为在目标推断和其他下游社会认知过程之前的领域无关机制。

英文摘要

Inferring others' beliefs requires more than reading surface signals; it requires tracking who told them what, in what order, and how credibly. The Theory of Mind Utility (ToM-U) formalizes this epistemic state inference problem at the computational level of analysis, specifying what mentalizing computes and why without commitment to algorithmic or neural implementation. ToM-U achieves this by constructing Local Epistemic World Models (LEWMs) -- directed typed graphs that represent agents, state nodes, and the epistemic relationships among them -- and evaluating discrete candidate LEWMs against observed behavior until one achieves sufficient confidence. Five formal definitions specify the LEWM structure, agent node properties including ordered information access history, a bounded proliferation mechanism for recursive mentalizing, three inference procedures, and a residue function that captures the structured trace left by failed mentalizing attempts. ToM-U differs from Bayesian Theory of Mind and adjacent formal accounts, which presuppose rather than derive belief states, and from simulation theory and theory-theory, which lack a formal apparatus for epistemic state inference. The architecture generates directional, falsifiable predictions about mentalizing failure that follow from structural properties of the model rather than auxiliary assumptions, and positions ToM-U as a domain-agnostic mechanism upstream of goal inference and other downstream social cognitive processes.

URL PDF HTML ☆

赞 0 踩 0

2606.12718 2026-06-12 cs.LG eess.SP 新提交

Out-of-Distribution (OOD) Detectors for Open-Set RF Fingerprinting

面向开放集射频指纹识别的分布外检测器

Sudeepta Mondal, Ganesh Sundaramoorthi

发表机构 * University of Michigan（密歇根大学）

AI总结针对开放集射频指纹识别中未知发射机与时间漂移引起的分布偏移问题，引入基于信息论的OOD检测统一框架，并采用无需OOD调优数据的方法，在POWDER数据集上验证其性能接近有真实OOD数据的基线。

详情

AI中文摘要

射频指纹识别系统必须在开放世界环境中运行，其中来自未知发射机的信号和时间漂移会在测试时引入分布偏移。分布外检测为该问题提供了自然框架，但其在射频指纹识别中的应用仍然有限。其采用的一个关键障碍是大多数OOD检测器需要辅助OOD数据进行参数调优，而在射频环境中收集代表性OOD数据不切实际，这一假设难以满足。在这项工作中，我们将机器学习文献中一组有前景的OOD检测方法引入开放集RFF领域。我们基于信息论（通信系统的自然框架）在一个统一的数学框架中呈现这些方法。我们的框架允许对方法进行系统分析并开发新方法。我们进一步展示了最近关于无需给定OOD调优数据即可调优OOD检测器的工作在开放集RFF中的适用性。我们在POWDER射频指纹数据集上进行评估，表明无需任何给定OOD数据调优的检测器性能与能够访问真实OOD调优数据的基线相当，并且大大优于无法访问真实OOD调优数据的基线方法，展示了RFF问题的实际可行性。

英文摘要

Radio-frequency (RF) fingerprinting systems must operate in open-world environments where signals from unknown transmitters and temporal drift introduce distribution shift at test time. Out-of-distribution (OOD) detection provides a natural framework for this problem, yet its application to RF fingerprinting (RFF) remains limited. A key barrier to their adoption is that most OOD detectors require auxiliary OOD data for parameter tuning, an assumption that is difficult to satisfy in RF environments where representative OOD data is impractical to collect. In this work, we introduce a promising set of OOD detection methods from the machine learning literature to open-set RFF domain. We present these methods within a unified mathematical framework based on information theory, which is a natural framework for communication systems. Our framework allows for the systematic analysis of methods and development of new methods. We further demonstrate the applicability of recent work on tuning OOD detectors without given OOD tuning data for open-set RFF. We evaluate on the POWDER RF fingerprinting dataset, showing that detectors tuned without any given OOD data achieve performance comparable to baselines with access to true OOD tuning data and greatly out-perform baseline approaches without access to true OOD tuning data, showcasing the practical viability for the RFF problem.

URL PDF HTML ☆

赞 0 踩 0

2606.12716 2026-06-12 cs.CL 新提交

Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review

AI审稿人是否看到全貌？攻击与防御多模态同行评审

Xinyu Zhao, Rana Muhammad Shahroz Khan, Zhen Xu, Zhen Tan, Tianlong Chen

发表机构 * University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）

AI总结针对AI同行评审易受多模态对抗攻击的问题，提出PaperGuard基准，包含多领域数据集、统一攻击套件和基于分块嵌入搜索的实用防御方法。

Comments Accepted to ICML 2026, Project Page: https://paper-guard.github.io/

详情

AI中文摘要

将大型语言模型（LLMs）和多模态LLMs（MLLMs）集成到科学同行评审工作流程中，引入了对抗性操纵的新重大风险，尤其是考虑到科学论文的多模态性质——其中图表（而非仅文本）传达了核心证据。这造成了一个显著差距：当前关于AI同行评审的鲁棒性研究绝大多数仅针对文本。此外，该问题与标准越狱不同，因为同行评审攻击旨在诱导领域特定的、有针对性的失败（例如，“提高这个分数”），而非违反一般安全策略，而目前尚无实用的防御措施。为解决此问题，我们引入了PaperGuard，这是第一个旨在系统评估和防御AI生成的同行评审免受这些领域特定、跨模态攻击的全面基准。我们的框架基于三大支柱：（1）一个新的跨多个科学领域的多模态同行评审数据集；（2）一套统一的攻击方法，包括黑盒提示注入和白盒扰动，专门针对文本（GCG）和图表（PGD）；（3）一种实用的防御方法，受学术论文长上下文挑战的启发，使用基于分块的嵌入搜索来高效定位和缓解有害指令。我们在最先进模型上进行的广泛实验证实，AI审稿人普遍存在脆弱性。PaperGuard建立了必要的基准、协议和可操作的防御措施，以开创可信赖、抗攻击的AI辅助学术评审。

英文摘要

The integration of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) into scientific peer-review workflows introduces novel and significant risks for adversarial manipulation, especially given the multimodal nature of scientific papers where figures, not just text, convey core evidence. This creates a significant gap: current robustness studies on AI peer-review are overwhelmingly text-only. Moreover, the problem is distinct from standard jailbreaking, as a peer-review attack seeks to induce a domain-specific, targeted failure (e.g., "inflate this score") rather than a general safety policy violation, for which no practical defenses exist. To address this, we introduce PaperGuard, the first comprehensive benchmark designed to systematically evaluate and defend AI-generated peer-review against these domain-specific, cross-modal attacks. Our framework is built on three pillars: (1) a new multimodal peer-review dataset spanning multiple scientific domains; (2) a unified suite of attacks, including black-box prompt injections and white-box perturbations, specifically designed to target both text (GCG) and figures (PGD); and (3) a practical defense, motivated by the long-context challenge of academic papers, that uses chunk-based embedding search to efficiently localize and mitigate harmful instructions. Our extensive experiments, conducted across state-of-the-art models, confirm that AI reviewers are pervasively vulnerable. PaperGuard establishes the foundational benchmark, protocols, and actionable defense necessary to pioneer trustworthy, attack-resilient AI-assisted scholarly reviewing.

URL PDF HTML ☆

赞 0 踩 0

2606.12713 2026-06-12 cs.AI 新提交

Definitional alignment before capability alignment: a Design-Science framework for adjudicating claims about AGI

能力对齐之前的定义对齐：一个用于裁定关于AGI主张的设计科学框架

J. E. Aguilera Briones

发表机构 * Universidad Internacional de Investigación México（墨西哥国际研究大学）

AI总结针对AGI定义不统一导致争议的问题，提出DAF-AGI框架，包含五个序数标准和一个结构化治理审计，用于评估候选定义并裁定AGI主张。

Comments 31 pages, 1 table, 2 appendices

详情

AI中文摘要

关于人工通用智能已经到来或仍需数十年的主张常常基于重叠的证据进行辩护。“AGI”缺乏一个单一共享且稳定的指称，不同的操作化方法可能对同一系统给出不同的判定。本文将这种欠指定性视为一个设计和治理问题。遵循设计科学研究方法论，本文开发了DAF-AGI，一个二阶概念性人工制品，包含两个耦合组件：用于评估候选定义的裁定适应性的五个序数标准，以及对作者身份、利益、认证、外部验证和修订权威的结构化治理审计。该人工制品在五个显著的测量族和一个通缩边界立场上进行了演示，这些均来自一个已记录的语料库，然后对一个风格化的强到来主张进行了压力测试：即当前生成系统构成AGI，因为它们在许多认知任务上优于受过良好教育的成年人。根据引用的2024-2025年来源的证据，该主张仅在基于性能的操作化下可认证；能力本体论、心理测量学和技能习得方法未认证它，经济族仍不确定，通缩立场拒绝二元裁定。贡献在于新颖的整合和操作化，而非经验验证：独立应用、评估者间测试和作者外部案例仍然是必要的。本文进一步提出定义主权作为算法主权的使能组件：即在公共问责下对进口技术类别进行质疑、认证和修订的制度能力。

英文摘要

Claims that artificial general intelligence has already arrived and claims that it remains decades away are often defended from overlapping evidence. "AGI" lacks a single shared and stable referent and competing operationalizations can return different verdicts on the same system. This article treats that under-specification as a design and governance problem. Following Design Science Research Methodology, it develops DAF-AGI, a second-order conceptual artifact with two coupled components: five ordinal criteria for assessing the adjudicative fitness of candidate definitions and a structured governance audit of authorship, interest, certification, external verification and revision authority. The artifact is demonstrated on five prominent measurement families and one deflationary boundary position in a documented corpus and then stress-tested against a stylized strong arrival claim: that current generative systems constitute AGI because they outperform a well-educated adult on many cognitive tasks. On evidence from the cited 2024-2025 sources, the claim was certifiable only under a performance-based operationalization; capability-ontology, psychometric and skill-acquisition approaches did not certify it, the economic family remains indeterminate and the deflationary position refuses binary adjudication. The contribution is a novel integration and operationalization, not an empirical validation: independent application, inter-rater testing and author-external cases remain necessary. The paper further proposes definitional sovereignty as an enabling component of algorithmic sovereignty: the institutional capacity to contest, certify and revise imported technological categories under public accountability.

URL PDF HTML ☆

赞 0 踩 0

2606.12708 2026-06-12 cs.CL cs.AI 新提交

AfriSUD: A Dependency Treebank Collection for Evaluating Models on African Languages

AfriSUD：用于评估非洲语言模型的依存树库集合

Happy Buzaaba, Cheikh Mouhamadou Bamba Dione, David Ifeoluwa Adelani, Sylvain Kahane, Kim Gerdes, Bruno Guillaume, Kevin Guan, Aremu Anuoluwapo, Naome A. Etori, Shamsuddeen Hassan Muhammad, Utitofon Inyang, Peter Nabende, David Sabiiti Bamutura, Andiswa Bukula, Chinedu Uchechukwu, Rooweither Mabuya, Idris Akinade, Christiane Fellbaum

发表机构 * Princeton University（普林斯顿大学）； Laboratory for Artificial Intelligence, Princeton University（普林斯顿大学人工智能实验室）； Gaston Berger University（加斯顿·伯杰大学）； Mila, McGill University（麦吉尔大学米拉研究所）； Canada CIFAR AI Chair（加拿大CIFAR人工智能教席）； Paris Nanterre University（巴黎南泰尔大学）； Paris-Saclay University（巴黎-萨克雷大学）； CNRS（法国国家科学研究中心）； Inria（法国国家信息与自动化研究所）； LORIA（洛林计算机科学实验室）； Université de Lorraine（洛林大学）； University of Trento（特伦托大学）； University of Minnesota–Twin Cities（明尼苏达大学双城分校）； Imperial College London（伦敦帝国学院）； Binghamton University（宾汉姆顿大学）； Makerere University（马凯雷雷大学）； Penn State University（宾夕法尼亚州立大学）； Mbarara University of Science and Technology（姆巴拉拉科技大学）； Chalmers University of Technology（查尔姆斯理工大学）； University of Ibadan（伊巴丹大学）； Nnamdi Azikiwe University（纳姆迪·阿齐基韦大学）； South African Centre for Digital Language Resources（南非数字语言资源中心）

AI总结为弥补非洲语言在NLP资源上的不足，构建了首个大规模九种非洲语言句法标注树库AfriSUD，评估多种模型发现显著句法差距。

详情

AI中文摘要

尽管非洲语言具有语言多样性和全球重要性，但在支持NLP的研究和资源中仍代表性不足。我们通过引入AfriSUD来弥合这一差距，这是首个大规模句法标注树库集合，涵盖九种多样的非洲语言，跨越撒哈拉以南非洲的主要语系和地区。采用表层句法通用依存（SUD）框架，我们社区主导的努力提供了高质量、经母语者验证的数据，捕捉了如黏着和声调等类型学关键特征。我们在AfriSUD上评估了多种模型，包括非Transformer基线、多语言预训练编码器和LLM，用于词性标注和依存句法分析。我们的结果揭示了显著的句法差距，模型在九种语言上仍表现出明显局限性，表明现有架构可能无法完全捕捉非洲语言句法的结构多样性。

英文摘要

Despite their linguistic diversity and global significance, African languages remain underrepresented in research and resources to support NLP. We aim to bridge this gap by introducing AfriSUD, the first large-scale collection of syntactically annotated treebanks for nine diverse African languages spanning major language families and regions across Sub-Saharan Africa. Using the Surface-Syntactic Universal Dependencies (SUD) framework, our community-led effort provides high-quality, native-speaker verified data that capture typological key features such as agglutination and tone. We evaluate a range of models on AfriSUD for part-of-speech tagging and dependency parsing including non-transformer baselines, multilingual pretrained encoders, and LLMs. Our results reveal a significant syntax gap, where models still show clear limitations across the nine languages, suggesting that existing architectures may not fully capture the structural diversity of African-language syntax.

URL PDF HTML ☆

赞 0 踩 0

2606.12706 2026-06-12 cs.CV 新提交

VLADriveBench: Evaluating CoT-Action Relationship in VLA for Autonomous Driving

VLADriveBench：评估自动驾驶VLA中的CoT-动作关系

Thach Nguyen, Danhua Guo, Tom Lampo, Fei Wu, Burhan Yaman

发表机构 * Uber AV Labs（优步自动驾驶实验室）

AI总结提出VLADriveBench框架，结合观察指标和CoT干预协议评估VLA模型中思维链与驾驶动作的相关性和因果性，发现不同模型表现差异显著。

2606.12699 2026-06-12 cs.LG cs.AI 新提交

LLM-Powered Personalized Glycemic Assessment in Type 2 Diabetes with Wearable Sensor Data

基于可穿戴传感器数据的2型糖尿病个性化血糖评估：LLM驱动方法

Yifan Gao, Yanmin Gong, Yun Shi, Yuanxiong Guo

发表机构 * Department of Information Systems and Cybersecurity, The University of Texas at San Antonio（德克萨斯大学圣安东尼奥分校信息系统与网络安全系）； School of Engineering Medicine, Texas A&M University（德克萨斯农工大学工程医学院）； Department of Family and Community Medicine, The University of Texas at San Antonio（德克萨斯大学圣安东尼奥分校家庭与社区医学系）

AI总结提出GlyLLM框架，利用大语言模型整合可穿戴传感器数据和结构化元数据，实现个性化血糖动态建模，在血糖预测和糖尿病分类任务上分别比传统ML方法提升13.66%和13.08%。

Comments The 14th IEEE International Conference on Healthcare Informatics, 2026

详情

AI中文摘要

2型糖尿病（T2D）对全球健康构成日益严重的威胁，需要有效的血糖评估来支持个性化和改进的糖尿病护理。可穿戴传感器如连续血糖监测仪（CGM）和健身追踪器为血糖评估提供了许多有价值的见解。然而，有效分析这些数据需要与重要的个体层面背景信息整合。现有方法通常基于传统机器学习（ML），主要依赖历史血糖测量值，忽略了个性化信息，这限制了它们在多样化糖尿病群体中的性能。大语言模型（LLMs）的最新进展展示了它们整合多种数据模态同时建模序列依赖性的能力，激发了探索其在个性化血糖评估中潜力的兴趣。在本文中，我们提出了GlyLLM，一个基于LLM的框架，通过整合可穿戴传感器数据和结构化元数据来建模基于CGM的血糖动态。GlyLLM可以利用预训练LLM的广泛先验知识，并在决策时实现传感器-文本语义抽象。在AI-READI数据集上的两个相关任务实验表明，我们的模型在血糖预测的均方根误差（RMSE）上平均优于传统ML方法13.66%，在糖尿病分类的受试者工作特征曲线下面积（AUROC）上平均优于13.08%。此外，我们的消融研究表明，糖尿病调查和生物特征测试比其他健康信息对血糖评估更为关键。我们的工作为利用LLM推进T2D护理中的个性化血糖评估迈出了有希望的一步。

英文摘要

Type 2 Diabetes (T2D) poses an increasing global health threat, demanding effective glycemic assessment to support personalized and improved diabetes care. Wearable sensors such as continuous glucose monitors (CGM) and fitness trackers offer many valuable insights for glycemic assessment. However, effectively analyzing these data requires integration with essential individual-level context. Existing methods are often based on traditional machine learning (ML) and rely primarily on historical blood glucose measurements and overlook personalized information, which limits their performance across diverse diabetes populations. Recent advances in large language models (LLMs) have demonstrated their ability to integrate diverse data modalities while modeling sequential dependencies, motivating the exploration of their potential for personalized glycemic assessment. In this paper, we propose GlyLLM, an LLM-powered framework for modeling CGM-based glycemic dynamics through the integration of wearable sensor data and structured metadata. GlyLLM can leverage the extensive prior knowledge of pre-trained LLMs and achieve sensor-text semantic abstraction at decision time. Experiments on two related tasks on the AI-READI dataset demonstrate that our model outperforms traditional ML methods by an average of 13.66\% in Root Mean Squared Error (RMSE) for glucose forecasting and 13.08\% in Area Under the Receiver Operating Characteristic (AUROC) for diabetes categorization. Additionally, our ablation study shows that diabetes surveys and biometric tests are more critical than other health information for glycemic assessment. Our work presents a promising step toward harnessing the power of LLMs to advance personalized glycemic assessment in T2D care.

URL PDF HTML ☆

赞 0 踩 0

2606.12690 2026-06-12 cs.RO cs.AI 新提交

EWAM: An Enhanced World Action Model for Closed-Loop Online Adaptation in Embodied Intelligence

EWAM：一种用于具身智能闭环在线自适应的增强世界动作模型

Xin Zhou, Cong Miao

发表机构 * Astronex Robotics ； Nanjing University of Information Science and Technology（南京信息工程大学）

AI总结提出EWAM架构，基于冻结的Cosmos3骨干网络，通过四个轻量级神经层实现零样本在线自适应，无需微调或额外演示数据，显著减少新任务布局的部署数据需求。

详情

AI中文摘要

在本文中，我们提出了增强世界动作模型（EWAM），这是一种基于预训练且完全冻结的Cosmos3骨干网络构建的闭环在线自适应架构。EWAM完全在零样本任务协议下进行评估，其核心目标是减少适应新任务布局所需的额外部署数据量。值得注意的是，所有评估中均未引入额外的任务特定演示集，也未对骨干网络进行微调。其性能提升完全源于由四个插入的轻量级神经层组成的推理时协同推理机制：位于扩散变换器（DiT）中间层的神经经验记忆层提供任务相关的执行上下文；状态预测头之后的神经异常检测层实时监测预测状态与实际状态之间的差异；神经策略路由层根据异常严重程度动态选择直接执行、保守重规划或回滚恢复；神经动作校正层利用执行诊断优化生成的动作块。与简单的特征融合不同，记忆、异常检测和校正模块以可微分的方式深度集成到Cosmos3的前向路径中，仅最终路由决策是离散监督的。

英文摘要

In this paper, we propose the Enhanced World Action Model (EWAM), a closed-loop online adaptation architecture built upon a pretrained and fully frozen Cosmos3 backbone network. Evaluated entirely under a zero-shot task protocol, EWAM is centrally focused on reducing the amount of additional deployment data required to adapt to new task layouts. Notably, no extra task-specific demonstration sets were introduced in any of the evaluations, and no fine-tuning was performed on the backbone network. Its performance gains stem entirely from an inference-time co-reasoning mechanism composed of four inserted lightweight neural layers: the Neural Experience Memory Layer located in the intermediate layers of the Diffusion Transformer (DiT) provides task-relevant execution context; the Neural Anomaly Detection Layer after the state prediction head monitors the divergence between predicted and actual states in real time; the Neural Policy Routing Layer dynamically selects direct execution, conservative replanning, or rollback recovery based on the anomaly severity; and the Neural Action Correction Layer refines the generated action chunks using execution diagnostics. Unlike naive feature fusion, the memory, anomaly detection, and correction modules are deeply integrated into the Cosmos3 forward path in a differentiable manner, with only the final routing decision being a discrete supervised one.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

GeoNatureAgent Benchmark: Benchmarking LLM Agents for Environmental Geospatial Analysis Across Frontier and Open-Weight Foundation Models

Localizing Anchoring Pathways in Language Models

Stubborn: A Streamlined and Unified Reinforcement Learning Framework for Robust Motion Tracking and Fall Recovery for Humanoids

MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs

Detect, Remask, Repair: Diffusion Editing for Faithful Summarization of Evolving Contexts

The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements

GENIE: A Fine-Grained Measure for Novelty

How Fine-Grained Should a RAG Benchmark Be? A Hierarchical Framework for Synthetic Question Generation

A Tutorial on World Models and Physical AI

ProPlay: Procedural World Models for Self-Evolving LLM Agents

Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage

Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

Detecting Functional Memorization in Code Language Models

Adaptive Weighted Averaging

Sparse2Act: Learning Action-Aligned Sparse 3D Representations for Cross-Domain Robot Manipulation

Prefill Awareness in Large Language Models

GRIP: Feedback-Guided Prompt Retrieval for Large Multimodal Models

Deep Unfolded Latent Optimally Partitioned-l2/l1 Networks for Data-driven Block-Sparse Recovery

Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

Physics-Informed Neural Networks and Radial Basis Functions for PDEs with Dirac Delta Sources

Normative Robustness as a Frontier for Non-Verifiable Reasoning in LLMs

Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

The Theory of Mind Utility: Formal Specification of a Mentalizing Mechanism

Out-of-Distribution (OOD) Detectors for Open-Set RF Fingerprinting

Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review

Definitional alignment before capability alignment: a Design-Science framework for adjudicating claims about AGI

AfriSUD: A Dependency Treebank Collection for Evaluating Models on African Languages

VLADriveBench: Evaluating CoT-Action Relationship in VLA for Autonomous Driving

LLM-Powered Personalized Glycemic Assessment in Type 2 Diabetes with Wearable Sensor Data

EWAM: An Enhanced World Action Model for Closed-Loop Online Adaptation in Embodied Intelligence