URL PDF HTML ☆

赞 0 踩 0

2606.12674 2026-06-12 cs.AI 新提交

Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

Evoflux: 紧凑型智能体的可执行工具工作流的推理时演化

Kushal Raj Bhandari, Ling Yue, Ching-Yun Ko, Dhaval Patel, Shaowu Pan, Pin-Yu Chen, Jianxi Gao

AI总结提出Evoflux，一种推理时演化搜索方法，通过结构化编辑和执行反馈修复紧凑语言模型的工具工作流，将执行可行性从3%提升至17-24%，优于SFT和ReAct。

Comments Code is available at https://github.com/IBM/Evoflux

详情

AI中文摘要

紧凑型语言模型（LMs）降低了工具智能体的成本、延迟和部署风险。然而，MCP风格的工具使用不仅仅需要孤立的函数调用：智能体必须从实时目录中发现工具、满足模式、跨中间输出保留依赖关系，并在执行证据中基于最终响应。小型规划器通常生成看似合理的工作流图，但在工具解析、参数验证、依赖跟踪或执行中失败。我们认为，小语料蒸馏难以处理这种失败模式。几百个教师轨迹可以教授工作流格式，但很少涵盖修复失败计划所需的恢复行为。我们引入了Evoflux，一种推理时演化搜索方法，将紧凑工具使用视为可执行工具工作流的修复。它通过结构化编辑、执行反馈、自适应强度、元引导重设计和多样性剪枝来演化类型化工作流图。在涵盖实时MCP服务器和250个工具的保留MCP-Bench任务上，Evoflux将小型规划器的执行可行性从约3%提高到17-24%。相比之下，在相同搜索挖掘数据上的SFT和SFT+DPO匹配、表现不佳或崩溃至零样本性能以下；ReAct达到更高峰值，但方差和令牌成本更高。这些结果表明，在稀缺的教师轨迹预算下，基于执行的搜索更可靠。

英文摘要

Compact language models (LMs) reduce cost, latency, and deployment risk for tool agents. Yet MCP-style tool use requires more than isolated function calling: an agent must discover tools from live catalogs, satisfy schemas, preserve dependencies across intermediate outputs, and ground final responses in executed evidence. Small planners often generate plausible workflow graphs that fail under tool resolution, parameter validation, dependency tracking, or execution. We argue that this failure mode is poorly handled by small-corpus distillation. A few hundred teacher traces can teach workflow format, but rarely cover the recovery behavior needed to repair failed plans over changing tool catalogs. We introduce Evoflux, an inference-time evolutionary search method that treats compact tool use as the repair of executable tool workflows. It evolves typed workflow graphs through structured edits, execution feedback, adaptive intensity, meta-guided redesign, and diversity pruning. On held-out MCP-Bench tasks spanning live MCP servers and 250 tools, Evoflux raises execution feasibility from roughly 3% to 17-24% across small planners. In contrast, SFT and SFT+DPO on the same search-mined data match, underperform, or collapse below zero-shot performance; ReAct reaches higher peaks, but with higher variance and token cost. These results show that execution-grounded search is more reliable under scarce teacher-trace budgets.

URL PDF HTML ☆

赞 0 踩 0

2606.12852 2026-06-12 cs.AI 新提交

WISE: A Long-Horizon Agent in Minecraft with Why-Which Reasoning

WISE：具有Why-Which推理的Minecraft长时域智能体

Renmin Cheng, Changhao Chen

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结提出WISE框架，通过因果事件图增强情景记忆并解耦what-where-when与which-why推理，结合机会主义任务调度和多尺度探索，显著提升长时域稀疏任务的成功率和效率。

详情

AI中文摘要

通过采用LLM增强的分层方法，在Minecraft等环境中开发通用具身智能体取得了快速进展。尽管前景广阔，但低级控制器由于重复执行失败常常成为性能瓶颈。我们认为，一个关键限制不仅是缺乏情景记忆，而且是将\textit{what-where-when}记忆与\textit{which-why}推理解耦。为了解决这个问题，我们提出\textbf{WISE}（Which-Why Informed Semantic Explorer），一个长时域智能体框架，其增强的低级控制器配备因果事件图，通过将观察与任务相关性关联的显式因果结构来增强情景记忆。与先前依赖特征相似性进行检索的工作（如MrSteve）不同，WISE能够在视角变化下实现稳健回忆，并通过因果推理支持机会主义任务重排序。基于这种记忆，我们提出一个机会主义任务调度器，当检测到因果相关机会时动态重新优先化子任务。我们进一步为WISE配备多尺度渐进探索策略，为下游推理提供空间上全面的观察。实验表明，WISE在长时域稀疏任务上大幅提高了任务成功率和效率，特别是在需要自适应决策的场景中。

英文摘要

Rapid advances have been made in developing general-purpose embodied agent in environments like Minecraft through the adoption of LLM-augmented hierarchical approaches. Despite their promise, low-level controllers often become performance bottlenecks due to repeated execution failures. We argue that a key limitation is not only the lack of episodic memory, but also the decoupling of \textit{what-where-when} memory from \textit{which-why} reasoning. To address this, we propose \textbf{WISE} (Which-Why Informed Semantic Explorer), a long-horizon agent framework with an enhanced low-level controller equipped with a Causal Event Graph that augments episodic memory with explicit causal structure linking observations to task relevance. Unlike prior work such as MrSteve, which relies on feature similarity for retrieval, WISE enables robust recall under viewpoint changes and supports opportunistic task reordering through causal reasoning. Building on this memory, we propose an Opportunistic Task Scheduler that dynamically re-prioritizes subtasks when causally relevant opportunities are detected. We further equip WISE with a multi-scale progressive exploration strategy to provide spatially comprehensive observations for downstream reasoning. Experiments show that WISE largely improves task success and efficiency on long-horizon sparse tasks, particularly in settings requiring adaptive decision-making.

URL PDF HTML ☆

赞 0 踩 0

2606.12882 2026-06-12 cs.AI 新提交

HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness

HarnessBridge: 用于LLM智能体框架的可学习双向控制器

Xiaoxuan Wang, Haixin Wang, Alexander Taylor, Jason Cong, Yizhou Sun, Wei Wang

AI总结提出HarnessBridge，一种轻量级可学习框架控制器，通过双向投影参数化智能体-环境接口，减少令牌使用和轨迹长度，并泛化到更大模型。

详情

AI中文摘要

大型语言模型越来越多地被部署为用于长周期任务的智能体，但其性能不仅受模型能力和环境设计的影响，还受调节智能体-环境交互的框架的影响。现有的框架大多是手动设计的，随着轨迹变长和交互变得更加复杂，它们难以扩展。在这项工作中，我们探究框架是否可以通过一个可学习的即插即用模块生成，该模块可以以端到端的方式进行训练。我们引入了HarnessBridge，一种轻量级可学习框架控制器，它将智能体-环境接口参数化为双向投影。HarnessBridge学习两个双向投影：观测投影，将原始轨迹提炼为紧凑的、与决策相关的状态；以及动作投影，将提议的动作转换为可执行的转换或基于轨迹的拒绝。我们在框架监督数据集上通过统一指令调优训练HarnessBridge。在Terminal-Bench~2.0和SWE-bench Verified上，HarnessBridge匹配或超越了强大的专用框架，同时大幅减少了令牌使用和轨迹长度，并从较小的生成器泛化到较大的商业模型。

英文摘要

Large language models are increasingly deployed as agents for long-horizon tasks, yet their performance is shaped not only by model capability and environment design, but also by the harness that mediates agent--environment interaction. Existing harnesses are largely manually engineered, making them difficult to scale as trajectories grow longer and interactions become more complex. In this work, we ask whether harness can be generated by a learnable plug-in module that can be trained in an end-to-end fashion. We introduce HarnessBridge, a lightweight learnable harness controller that parameterizes the agent--environment interface as a bidirectional projection. HarnessBridge learns two bidirectional projections: observation projection, which distills raw trajectories into compact, decision-relevant states, and action projection, which converts proposed actions into executable transitions or trajectory-grounded rejections. We train HarnessBridge on a harness supervision dataset via unified instruction tuning. On Terminal-Bench~2.0 and SWE-bench Verified, HarnessBridge matches or surpasses strong specialized harnesses while substantially reducing token usage and trajectory length, and generalizes from smaller generators to larger commercial models.

URL PDF HTML ☆

赞 0 踩 0

2606.12924 2026-06-12 cs.AI 新提交

LLM作为调查员：基于证据优先的鲁棒交互式问题诊断

Fabrizio Marozzo, Pietro Liò

发表机构 * University of Calabria（卡拉布里亚大学）； University of Cambridge（剑桥大学）

AI总结提出证据优先的AI方法LLM-as-an-Investigator，通过估计问题歧义、生成假设、提问澄清并更新概率，避免过早接受用户假设，提升诊断准确性。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被用作技术问题解决的交互式助手。然而，当用户提供不完整的描述或看似合理但未经证实的解释时，LLM可能会过早地认同这些假设，并在收集足够证据之前提出解决方案。我们将这种行为称为用户驱动的谄媚：LLM倾向于强化用户提供的假设，而不是测试其他解释。本文介绍了LLM-as-an-Investigator，一种基于证据优先的智能体AI方法，用于鲁棒的问题诊断。该方法通过一个解决方案调查智能体实现，该智能体估计初始问题描述的模糊性，生成候选假设，提出有针对性的澄清问题，并在每次回答后更新假设概率。该智能体不是立即给出响应，而是继续调查，直到证据使一个候选解释比其他解释更强。为了评估该方法，我们从机械、电气和液压领域已解决的技术论坛帖子中构建了一个基准测试。我们使用一个三智能体评估流程：问题-解决方案提取智能体将已解决的帖子转换为结构化案例，真实答案评估智能体在隐藏已知解决方案的同时模拟用户，被测试的助手通过对话尝试恢复解决方案。实验比较了标准助手、面向推理的LLM和基于调查员的模型，使用不同的LLM骨干网络。除了诊断准确性，我们还分析了标准助手在诊断案例中如何遵循误导性的用户假设。结果表明，所提出的方法比直接提示和仅推理基线更准确地识别问题，而其证据优先协议有助于减少用户引发的对话偏差。

英文摘要

Large language models (LLMs) are increasingly used as interactive assistants for technical problem solving. However, when users provide incomplete descriptions or plausible but unverified explanations, LLMs may prematurely align with these assumptions and propose solutions before collecting sufficient evidence. We refer to this behavior as user-driven sycophancy: the tendency of an LLM to reinforce a user-provided hypothesis instead of testing alternative explanations. This paper introduces LLM-as-an-Investigator, an evidence-first agentic AI methodology for robust problem diagnosis. The approach is implemented through a Solution Investigator Agent, which estimates the ambiguity of an initial problem description, generates candidate hypotheses, asks targeted clarification questions, and updates hypothesis probabilities after each answer. Rather than producing an immediate response, the agent continues the investigation until the evidence makes one candidate explanation stronger than the alternatives. To evaluate the approach, we build a benchmark from solved technical forum threads in mechanical, electrical, and hydraulic domains. We use a three-agent evaluation pipeline in which a Problem-Solution Extractor Agent converts solved threads into structured cases, a Ground-Truth Evaluator Agent simulates the user while hiding the known solution, and the tested assistant attempts to recover the solution through dialogue. The experiments compare standard assistants, reasoning-oriented LLMs, and the proposed investigator-based model across LLM backbones. In addition to diagnostic accuracy, we analyze how standard assistants follow misleading user hypotheses in diagnostic cases. The results show that the proposed approach identifies the problem more accurately than direct prompting and reasoning-only baselines, while its evidence-first protocol helps reduce user-induced conversational bias.

URL PDF HTML ☆

赞 0 踩 0

2606.13262 2026-06-12 cs.AI 新提交

From Verdict to Process: Agentic Reinforcement Learning for Multi-Stage Fact Verification

从判决到过程：面向多阶段事实核查的智能体强化学习

Rongxin Yang, Shenghong He, Siyuan Zhu, Chao Yu

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University（中山大学计算机科学与工程学院）

AI总结提出ProFact框架，通过智能体强化学习端到端优化多阶段事实核查流程，引入过程感知奖励解决稀疏延迟监督问题，提升验证性能和推理效率。

详情

AI中文摘要

最近结合大型语言模型（LLMs）与检索增强推理的方法在自动化事实核查中显示出前景。为了处理复杂声明，这些核查流程通常执行多阶段工作流，协调紧密耦合的模块，包括声明分解、证据收集和判决预测。然而，现有方法孤立地优化各个阶段或依赖固定启发式规则，这限制了阶段间的自适应协调，并可能导致次优结果。在这项工作中，我们提出ProFact，一种用于多阶段事实核查轨迹端到端优化的智能体强化学习框架。ProFact训练一个统一策略来协调声明分解、证据寻找、答案生成和判决预测。为了解决最终真实性标签提供的稀疏且延迟的监督，ProFact引入了过程感知奖励，在整个核查过程中提供阶段级学习信号。实证评估表明，ProFact在验证性能和推理效率上均持续优于强基线。这些结果凸显了过程感知轨迹优化对多阶段事实核查的有效性。

英文摘要

Recent approaches combining Large Language Models (LLMs) with retrieval-augmented reasoning have shown promise for automated fact verification. To process complex claims, these verification pipelines typically execute multi-stage workflows that coordinate tightly coupled modules, including claim decomposition, evidence gathering, and verdict prediction. However, existing methods optimize individual stages in isolation or rely on fixed heuristics, which limits adaptive coordination among stages and can lead to suboptimal outcomes. In this work, we propose ProFact, an agentic reinforcement learning framework for end-to-end optimization of multi-stage fact verification trajectories. ProFact trains a unified policy to coordinate claim decomposition, evidence seeking, answer generation, and verdict prediction. To address the sparse and delayed supervision provided by final veracity labels, ProFact introduces process-aware rewards that provide stage-level learning signals throughout the verification process. Empirical evaluation shows that ProFact consistently outperforms strong baselines in both verification performance and inference efficiency. These results highlight the effectiveness of process-aware trajectory optimization for multi-stage fact verification.

URL PDF HTML ☆

赞 0 踩 0

2606.13361 2026-06-12 cs.AI cs.CE cs.MA 新提交

Can I Buy Your KV Cache?

我能买你的KV缓存吗？

Luoyuan Zhang

发表机构 * Harbin Institute of Technology, Shenzhen (HITSZ)（哈尔滨工业大学（深圳））

AI总结针对AI代理重复计算相同文档KV缓存的问题，提出由发布者预计算KV缓存，其他代理付费加载以跳过预填充，实验表明在Qwen3-4B上计算成本降低9-50倍，并设计了代理原生预填充CDN架构。

详情

AI中文摘要

现在，在世界各地，AI代理正在重复同样的荒谬行为：为了读取一份文档，每个代理都从头开始重新计算。每个代理都重新运行预填充——大型模型最计算密集的步骤——在相同的文本上，只是为了重建一个与之前代理刚刚构建的完全相同的键值（KV）缓存。相同的答案，被计算了一百万次。我们提出了一个几乎粗鲁简单的建议：只计算一次。让发布者预计算文档的KV缓存，然后让每个其他代理购买加载该缓存并跳过预填充的权利。这可行，并且是token精确的：加载预计算的KV并继续与从头开始预填充匹配（24/24个贪婪token，并且在logits级别），没有准确度损失。在Qwen3-4B上，重用比预填充计算便宜9-50倍，并且差距随长度增加而扩大（预填充的注意力与L^2成比例），因此一次重用就足以收回成本。然后关键部分：KV存储在哪里。传输它失败了，因为KV几乎不可压缩，因此每次加载的出口成本比它节省的预填充成本还要高。将其托管在提供方侧，正如生产中的提示缓存那样，完全消除了出口成本。奖励的大小由我们测量的计算节省决定：为80M代理提供一份热门的3774-token文档，重新预填充成本约150万美元，而重用计算成本仅约3万美元（减少49.7倍）。API收取的0.1倍缓存读取关税在测量范围内为用户提供了10倍的折扣，因此10倍是下限，而测量的约50倍计算节省超过了它，与物理约50倍的差距是提供方的利润：每份热门文档数百万美元。我们构建了由此产生的代理原生预填充CDN，并将无损KV压缩和跨方支付层作为开放问题。

英文摘要

Right now, across the world, AI agents are repeating the same absurd act: to read one document, they each recompute it from scratch. Every agent re-runs prefill, the most compute-intensive step a large model takes, over identical text, only to rebuild a key-value (KV) cache identical to the one the agent before it just built. The same answer, computed a million times. We make a proposal that is almost offensively simple: compute it once. Let a publisher precompute a document's KV cache, and let every other agent buy the right to load it and skip prefill. It works, and it is token-exact: loading a precomputed KV and continuing matches prefilling from scratch (24/24 greedy tokens, and at the logits level), with no accuracy cost. On Qwen3-4B, reuse is 9-50x cheaper in compute than prefill, and the gap widens with length (prefill's attention scales with L^2), so a single reuse already pays it back. Then the part that matters: where the KV lives. Shipping it fails, because KV is nearly incompressible, so per-load egress costs more than the prefill it saves. Hosting it provider-side, exactly as production prompt-caching works, removes egress entirely. The size of the prize is set by our measured compute saving: serving one hot 3774-token document to 80M agents costs ~$1.5M to re-prefill but only ~$0.03M of reuse compute (49.7x less). The 0.1x cache-read tariff APIs charge passes a 10x discount to users while sitting inside this measured envelope, so the 10x is a floor that the measured ~50x compute saving clears, and the gap to the physical ~50x is provider margin: millions of dollars per popular document. We frame the resulting agent-native prefill CDN and leave lossless KV compression and a cross-party payment layer as the open problems.

URL PDF HTML ☆

赞 0 踩 0

2606.13368 2026-06-12 cs.AI cs.CV 新提交

IterCAD: An Iterative Multimodal Agent for Visually-Grounded CAD Generation and Editing

IterCAD：一种用于视觉引导的CAD生成与编辑的迭代多模态智能体

Tao Hu, Jiaxin Ai, Licheng Wen, Xueheng Li, Shu Zou, Siqi Li, Nianchen Deng, Xinyu Cai, Hongbin Zhou, Pinlong Cai, Daocheng Fu, Yu Yang, Hairong Zhang, Botian Shi, Xuemeng Yang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出IterCAD，一种闭环交互式CAD生成与编辑的多模态智能体框架，通过渐进式SFT和几何感知强化学习优化，在代码可执行性和几何精度上显著超越现有方法。

详情

AI中文摘要

计算机辅助设计在现代制造业中至关重要，然而现有的自动化方法主要依赖于开环、一次性生成，与迭代的实际实践不匹配。在本文中，我们提出了IterCAD，一个统一的闭环交互式CAD生成与编辑的多模态智能体框架。我们将任务形式化为多模态智能体与可执行CAD沙箱之间的多轮交互，涵盖三个任务：绘图到代码、文本到代码和交互式编辑。为此，我们开发了一个数据合成流水线，结合先进的工业制造特征，生成符合标准的多视图工程图纸、复杂的代码编辑任务和高保真交互轨迹。我们通过渐进式SFT，然后结合几何感知强化学习和可行前缀掩码来优化智能体，以增强代码可执行性和几何保真度。最后，我们引入了IterCAD-Bench评估套件，并提出了Chamfer距离容忍度-召回率（CD-TR）曲线及其AUC-TR指标，建立了一个无幸存者偏差的标准，统一了代码有效性和几何精度。大量实验表明，IterCAD在多个基准测试中取得了极具竞争力的性能，在代码可执行性和几何精度上显著优于现有方法，并在闭环迭代优化中展现出卓越的能力。

英文摘要

Computer-Aided Design is pivotal in modern manufacturing, yet existing automated methods predominantly rely on open-loop, one-shot generation, creating a mismatch with iterative real-world practices. In this paper, we present IterCAD, a unified multimodal agent framework for closed-loop, interactive CAD generation and editing. We formulate the task as a multi-turn interaction between a multimodal agent and an executable CAD sandbox, covering three tasks: Drawing-to-Code, Text-to-Code, and Interactive Editing. To support this, we develop a data synthesis pipeline incorporating advanced industrial manufacturing features to generate standard-compliant multi-view engineering drawings, complex code-editing tasks, and high-fidelity interaction trajectories. We optimize the agent via progressive SFT followed by geometry-aware reinforcement learning with viable-prefix masking to enhance code executability and geometric fidelity. Finally, we introduce the IterCAD-Bench evaluation suite and propose the Chamfer Distance Tolerance-Recall (CD-TR) curve alongside its AUC-TR metric, establishing a survivor-bias-free standard that unifies code validity and geometric precision. Extensive experiments demonstrate that IterCAD achieves highly competitive performance across multiple benchmarks, significantly outperforming existing approaches in both code executability and geometric precision, while exhibiting superior capabilities in closed-loop iterative refinement.

URL PDF HTML ☆

赞 0 踩 0

2606.12485 2026-06-12 cs.LG cs.AI 交叉投稿

Speculative Rollback Correction for Quality-Diverse Web Agent Imitation

面向质量多样性的Web智能体模仿的推测性回滚修正

Longkun Hao, Hongyu Lin, Hao Li, Zhichao Yang, Haojie Hao, Dongshuo Huang, Haitao Yang, Hongyu Ge, Ming jie Xie, Yanjun Wu, Zi Hao Yin, Yan Bai, Yihang Lou

发表机构 * Beihang University（北京航空航天大学）； Institute of Software, Chinese Academy of Sciences（中国科学院软件研究所）； The Hong Kong University of Science and Technology（香港科技大学）； Northwestern Polytechnical University（西北工业大学）； Tsinghua University（清华大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Peking University（北京大学）

AI总结提出推测性回滚修正（SRC）框架，通过固定视野分支审查和回滚机制，在减少教师查询的同时保持轨迹多样性，在WebArena-Infinity上收集了977条通过验证的轨迹和9183个下一步动作示例。

详情

AI中文摘要

通过从专家轨迹进行模仿学习来训练交互式Web智能体已成为一种高效的方法。然而，在此背景下，确定专家干预的最佳时机是一个关键挑战。延迟干预往往导致早期错误的累积，将页面状态推入不可恢复的区域。相反，过早或过度干预会使智能体过度依赖专家策略，将模型困在以单一刚性轨迹为特征的局部最优中。我们提出推测性回滚修正（SRC），一种针对可重置智能体环境的分支级模仿框架。SRC不是在每个访问状态请求教师标签，也不是仅在完成轨迹后修正，而是采用固定视野分支审查：学生先执行一个短的推测性片段，然后由教师审查，仅当局部进展中断时，教师才定位第一个有害偏差。回滚保留有用的前缀，而成功的展开由硬验证器过滤并保留在轻量级质量多样性档案中。所得数据支持对局部修正和通过验证器的轨迹进行下一步动作监督微调。在WebArena-Infinity上，SRC收集了977条通过验证器的轨迹和9183个下一步动作示例；固定视野审查在保留通过验证器的解决方案变体的同时，改善了恢复与查询的权衡。代码可在该https URL获取。

英文摘要

Training interactive web agents through imitation learning from expert trajectories has emerged as a highly effective approach. However, determining the optimal timing for expert intervention presents a critical challenge in this context. Delayed intervention often leads to the accumulation of early-stage errors, pushing the page state into an irrecoverable regime. Conversely, premature or excessive intervention causes the agent to become overly reliant on expert policies, trapping the model in local optima characterized by a single, rigid trajectory. We propose Speculative Rollback Correction (SRC), a branch-level imitation framework for resettable agent environments. Instead of requesting teacher labels at every visited state or correcting only after a completed trajectory, SRC uses fixed-horizon branch review: the student executes a short speculative segment before teacher review, and the teacher localizes the first harmful deviation only when local progress breaks. Rollback preserves useful prefixes, while successful rollouts are filtered by a hard verifier and retained in a lightweight quality-diversity archive. The resulting data supports next-action supervised fine-tuning on both localized corrections and verifier-passing trajectories. On WebArena-Infinity, SRC collects 977 verifier-passing trajectories and 9,183 next-action examples; fixed-horizon review improves the recovery-versus-query tradeoff over step-level review while retaining verifier-passing solution variants. Code is available at https://github.com/LongkunHao/SRC_gui_agent.

URL PDF HTML ☆

赞 0 踩 0

2606.12634 2026-06-12 cs.LG cs.AI cs.CL 交叉投稿

Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents

保持策略梯度主导：面向长程工具使用智能体的兄弟引导信用蒸馏

Tianyu Ding, Jianhong Xin, Juan Pablo De la Cruz Weinstein

发表机构 * Amazon Web Services（亚马逊云服务）

AI总结针对长程工具使用强化学习中轨迹级优势信号稀疏的问题，提出兄弟引导信用蒸馏（SGCD），通过动态采样成功与失败轨迹、外部LLM对比生成逐步信用参考，实现密集信用分配，在AppWorld和τ³-airline任务上显著提升性能。

Comments 13 pages, 4 figures, 7 tables. Submitted to EMNLP 2026 Industry Track

详情

AI中文摘要

长程工具使用强化学习可以从结果验证中学习，但其轨迹级优势被广播到许多推理、API和答案令牌上。自蒸馏通过重用策略自身的轨迹或特权教师承诺提供更密集的信号。然而，我们表明直接的令牌级自蒸馏会悄然破坏工具使用：它复述教师行为而不知道验证器奖励哪些动作，因此有用技能和有害捷径被一起放大。我们引入兄弟引导信用蒸馏（SGCD），它使用蒸馏进行信用分配而非作为竞争性的演员损失。动态采样产生混合的成功和失败的兄弟轨迹；外部LLM将其对比总结为训练时逐步信用参考；密集的教师/学生散度驱动信用重新分配；有界分离的信用权重重塑GRPO令牌优势。部署的学生看不到外部LLM、兄弟证据或预言机。在AppWorld和τ³-airline上，SGCD优于匹配的GRPO比较器：AppWorld上test_normal的TGC从42.9提升到45.6，test_challenge从24.7提升到27.0；τ³-airline的pass@1从0.583提升到0.602。

英文摘要

Long-horizon tool-use reinforcement learning can learn from outcome verification, but its trajectory-level advantage is broadcast across many reasoning, API, and answer tokens. Self-distillation promises a denser signal by reusing a policy's own rollouts or a privileged teacher. We show, however, that direct token-level self-distillation can silently destroy tool use: it rehearses teacher behavior without knowing which actions the verifier rewards, so useful skills and harmful shortcuts are amplified together. We introduce Sibling-Guided Credit Distillation (SGCD), which uses distillation for credit assignment rather than as a competing actor loss. Dynamic sampling produces mixed successful and failed sibling rollouts; an external LLM summarizes their contrast into a training-only stepwise credit reference; dense teacher/student divergence drives credit reassignment; and bounded detached credit weights reshape GRPO token advantages. The deployed student sees no external LLM, sibling evidence, or oracle. Across AppWorld and $τ^3$-airline, SGCD improves over matched GRPO comparators: AppWorld TGC $42.9 \to 45.6$ on test_normal and $24.7 \to 27.0$ on test_challenge, and $τ^3$-airline pass@1 $0.583 \to 0.602$.

URL PDF HTML ☆

赞 0 踩 0

2606.12774 2026-06-12 eess.SY cs.AI cs.CL cs.SY 交叉投稿

Agentic MPC for Semantic Control System Resynthesis

用于语义控制系统再综合的智能体MPC

Yuya Miyaoka, Masaki Inoue

AI总结提出智能体MPC框架，通过集成大语言模型智能体实现上下文感知的语义自适应控制综合，在自动驾驶场景中验证其根据个人偏好或社交情境（如避让应急车辆）调整控制的能力。

Comments 7 pages, 5 figures

2606.12830 2026-06-12 cs.CV cs.AI 交叉投稿

Perceive, Interact, Reason: Building Tool-Augmented Visual Agents for Spatial Reasoning

感知、交互、推理：构建工具增强的视觉智能体用于空间推理

Changye Li, Meng Lu, Yi Wu, Ligeng Zhu

发表机构 * Tsinghua University（清华大学）； Virginia Tech（弗吉尼亚理工大学）； NVIDIA（英伟达）

AI总结提出PERIA智能体，通过视觉感知和交互工具增强VLM的空间推理能力，在13个基准上优于同类模型7.0%-14.8%。

详情

AI中文摘要

尽管最近的视觉语言模型（VLM）展示了强大的多模态理解能力，但在需要主动证据获取和多步视觉交互的空间推理任务中仍存在局限。这种局限性表明，仅依赖视觉编码器的隐式视觉表示不足以恢复细粒度的空间证据。我们引入了PERception-Interaction-reason Agent（PERIA），一种用于地图推理、视觉探测和视觉重建等空间推理任务的工具增强视觉智能体。PERIA使用两类轻量工具：视觉感知工具用于暴露文本、符号和空间证据，以及视觉交互工具用于操作视觉上下文、追踪路径和验证空间关系。为了训练PERIA，我们开发了一种统一方案，结合了监督式工具使用轨迹合成、复合奖励和观察松弛的组内组策略优化（OR-GIGPO），以实现有效的多工具行为。在来自8个数据集的13个基准上的实验表明，PERIA-8B在分布内基准上比Qwen3-8B骨干网络提高了10.0%，在分布外基准上提高了4.4%，同时比之前类似规模的先进基线高出7.0%-14.8%。它还实现了与更大模型（如Qwen3-VL-235B-A22B-Thinking和GPT-5）相当的性能，证明了PERIA在增强空间推理能力方面的有效性。

英文摘要

While recent vision-language models (VLMs) demonstrate strong multimodal understanding, they remain limited in spatial reasoning tasks that require active evidence acquisition and multi-step visual interaction. This limitation suggests that relying solely on implicit visual representations from vision encoders is insufficient for recovering fine-grained spatial evidence. We introduce PERception-Interaction-reason Agent (PERIA), a tool-augmented visual agent for spatial reasoning tasks across map reasoning, visual probing, and vision reconstruction. PERIA uses two lightweight tool families: vision perception tools for exposing textual, symbolic, and spatial evidence, and vision interaction tools for manipulating visual context, tracing paths, and verifying spatial relations. To train PERIA, we develop a unified recipe that combines supervised tool-use trajectory synthesis, composite rewards, and Observation-Relaxed Group-in-Group Policy Optimization (OR-GIGPO) for effective multi-tool behavior. Experiments on 13 benchmarks from 8 datasets show that PERIA-8B improves over the Qwen3-8B backbone by 10.0% on in-distribution benchmarks and 4.4% on out-of-distribution benchmarks, while outperforming previous state-of-the-art baselines of similar size by 7.0%-14.8%. It also achieves performance comparable to much larger models such as Qwen3-VL-235B-A22B-Thinking and GPT-5, demonstrating the effectiveness of PERIA in enhancing spatial reasoning capabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.13239 2026-06-12 cs.SE cs.AI cs.CL cs.CV 交叉投稿

ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm

ComAct: 通过COM即行动范式重构专业软件操作

Jiaxin Ai, Tao Hu, Xuemeng Yang, Shu Zou, Hairong Zhang, Daocheng Fu, Yu Yang, Hongbin Zhou, Nianchen Deng, Pinlong Cai, Zhongyuan Wang, Botian Shi, Kaipeng Zhang, Licheng Wen

AI总结提出COM即行动范式，将专业软件交互转化为确定性程序合成，解决GUI代理的脆弱性和API代理的异构性问题；构建ComCADBench基准和ComActor自校正代理，在工业CAD软件上实现SOTA性能。

详情

AI中文摘要

现有的计算机使用代理在专业软件操作上仍然存在根本性限制：基于GUI的代理受困于脆弱的视觉基础和长程错误累积，而基于API的方法则难以应对异构协议和不可访问的商业接口。在这项工作中，我们将组件对象模型（COM）识别为统一的、可执行的抽象，提出了COM即行动：一种新的范式，将专业软件交互重新定义为确定性程序合成，而非顺序视觉控制。为了在最苛刻的环境中验证这一范式，我们引入了ComCADBench，这是首个针对操作真实工业CAD软件的代理的基准测试。我们的实验揭示了显著的范式差距：前沿的专有模型在基于GUI的交互下几乎无法成功，而基于COM的执行则带来了实质性的即时收益。为了弥合语法正确性与几何精度之间的剩余差距，我们开发了ComActor，一个通过渐进式三阶段框架训练的自校正代理，以及ComForge，一个用于在Windows容器中进行大规模训练的可扩展平台。大量实验表明，ComActor在ComCADBench上达到了最先进的性能，在基线崩溃的长程任务中表现出强大的韧性，并泛化到外部CAD基准测试。

英文摘要

Existing computer-use agents remain fundamentally limited in professional software manipulation: GUI-based agents suffer from fragile visual grounding and long-horizon error accumulation, while API-basedapproaches struggle with heterogeneous protocols and inaccessible commercial interfaces. In this work,we identify the Component Object Model (COM) as a unified executable abstraction, proposing COM-as-Action: a new paradigm that reframes professional software interaction as deterministic program synthesisrather than sequential visual control. To validate this paradigm in the most demanding environments, weintroduce ComCADBench, the first benchmark for agents operating real industrial CAD software. Ourexperiments reveal a substantial paradigm gap: frontier proprietary models achieve near-zero successunder GUI-based interaction, whereas COM-based execution yields substantial immediate gains. Tobridge the remaining gap between syntactic correctness and geometric accuracy, we develop ComActor, aself-correcting agent trained through a progressive three-stage framework, alongside ComForge, a scalableplatform for large-scale training in Windows containers. Extensive experiments show that ComActorachieves state-of-the-art performance on ComCADBench, with strong resilience in long-horizon taskswhere baselines collapse, and generalizes to external CAD benchmark.

URL PDF HTML ☆

赞 0 踩 0

2606.13673 2026-06-12 cs.CV cs.AI 交叉投稿

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

SpatialClaw：重新思考智能体空间推理的动作接口

Seokju Cho, Ryo Hachiuma, Abhishek Badki, Hang Su, Byung-Kwan Lee, Chan Hee Song, Sifei Liu, Subhashree Radhakrishnan, Seungryong Kim, Yu-Chiang Frank Wang, Min-Hung Chen

发表机构 * KAIST（韩国科学技术院）； NVIDIA（英伟达）

AI总结提出SpatialClaw框架，以代码作为动作接口，通过状态化Python内核和感知几何原语，使VLM智能体逐步执行并灵活组合中间结果，在20个3D/4D空间推理基准上平均准确率59.9%，比现有方法高11.2个百分点。

Comments Project page: https://spatialclaw.github.io/

详情

AI中文摘要

空间推理——确定物体在3D空间中的位置、关系及运动方式的能力——仍然是视觉语言模型（VLM）面临的基本挑战。工具增强型智能体试图通过为VLM添加专业感知模块来解决这一问题，但其有效性受限于调用这些工具的动作接口。本文研究该接口的设计如何影响智能体进行开放式空间推理的能力。现有的空间智能体要么采用单次代码执行，即在观察到任何中间结果之前就确定完整的分析策略；要么依赖结构化的工具调用接口，这通常缺乏自由组合操作或针对每个任务定制分析的灵活性。这两种设计对开放式、复杂的3D/4D空间推理的灵活性有限。因此，我们提出SpatialClaw，一个无需训练的空间推理框架，采用代码作为动作接口。SpatialClaw维护一个状态化的Python内核，预加载输入帧和一套感知与几何原语，让基于VLM的智能体在每一步根据所有先前输出编写一个可执行单元，从而灵活地组合和操作感知结果，并根据中间文本和视觉观察以及每个问题的需求调整其分析。在涵盖广泛静态和动态3D/4D空间推理任务的20个空间推理基准上评估，SpatialClaw实现了59.9%的平均准确率，比最新的空间智能体高出11.2个百分点，并且在来自两个模型家族的六个VLM骨干网络上均取得一致提升，无需任何基准或模型特定的适配。

英文摘要

Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent's capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex 3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework for spatial reasoning that adopts code as the action interface. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20 spatial reasoning benchmarks spanning a broad range of static and dynamic 3D/4D spatial reasoning tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.

URL PDF HTML ☆

赞 0 踩 0

2606.04602 2026-06-12 cs.AI 版本更新

Parthenon Law: A Self-Evolving Legal-Agent Framework

Parthenon Law: 一种自我进化的法律智能体框架

Hejia Geng, Leo Liu

发表机构 * tapntell.ai

AI总结本文提出Parthenon框架，通过分解模型、工具、知识等组件并引入反泄漏学习循环，使法律领域的大语言模型智能体能够从经验中自我进化，显著提升法律事务处理性能。

详情

AI中文摘要

随着智能体能力的增强，法律领域的大语言模型智能体有望将文档密集型事务转化为可审查的工作产品——然而可靠部署面临三个障碍：缺乏关于当前最强模型与框架组合在端到端法律事务上行为的大规模证据；没有适应法律垂直领域的智能体架构，只有通用框架；以及在不断变化的事实、权威和截止日期环境中，缺乏系统从自身结果中学习的机制。我们逐一解决这些问题。在Harvey LAB上进行的大规模实证研究——包含12,510条智能体轨迹——表明即使是前沿智能体也无法一次性完成事务：每项标准的准确率随模型增强而提高，但严格的事务完成率停滞不前。然后我们引入Parthenon，一种自我进化的法律智能体框架，将模型、框架、智能体角色、法律知识、确定性工具和程序技能分解为可审计的表面，以实现来源可追溯性、日期和数字接地、交付物合规性和问题关闭。最后，一个反泄漏学习循环将评分失败转化为对技能、工具和知识的任务无关编辑，使系统能够随着经验改进——就像律所在每个事务后完善其检查清单和操作手册——而不触及模型权重。在我们的大规模实证分析中，Parthenon显著提升了最先进模型和框架在法律事务任务上的性能。

英文摘要

As agents grow more capable, legal-domain LLM agents promise to turn document-heavy matters into reviewable work products -- yet reliable deployment faces three obstacles: no large-scale evidence on how today's strongest model-and-harness combinations behave on end-to-end legal matters; no agent architecture adapted to the legal vertical, only general-purpose harnesses; and, in a setting that keeps shifting with new facts, authorities, and deadlines, no mechanism for systems to learn from their own outcomes. We address each. A large-scale empirical study on Harvey LAB -- $12{,}510$ agent trajectories -- shows that even frontier agents remain far from completing matters in a single pass: per-criterion accuracy climbs with stronger models while strict matter completion stalls. We then introduce \textsc{Parthenon}, a self-evolving legal-agent framework that factors Model, Harness, Agent roles, legal Knowledge, deterministic Tools, and procedural Skills into auditable surfaces for source traceability, date and number grounding, deliverable compliance, and issue closure. Finally, an anti-leakage learning loop converts scored failures into task-agnostic edits to skills, tools, and knowledge, letting the system improve with experience -- as a firm refines its checklists and playbooks after each matter -- without touching model weights. Across our large-scale empirical analysis, \textsc{Parthenon} substantially improves the performance of state-of-the-art models and harnesses on legal-matter tasks.

URL PDF HTML ☆

赞 0 踩 0

2603.11395 2026-06-12 cs.LG cs.AI 版本更新

ARROW: Augmented Replay for RObust World models

ARROW：增强重放用于鲁棒世界模型

Abdulaziz Alyahya, Abdallah Al Siyabi, Markus R. Ernst, Luke Yang, Levin Kuhlmann, Gideon Kowadlo

发表机构 * Imam Mohammad Ibn Saud Islamic University (IMSIU)（伊玛姆·穆罕默德·本·沙特伊斯兰大学）； Monash University（莫纳什大学）； University of New South Wales, Sydney（新南威尔士大学，悉尼）； Cerenaut

AI总结本文提出ARROW算法，一种基于模型的持续强化学习方法，通过高效的重放缓冲区减少灾难性遗忘，提升在无共享结构任务和有共享结构任务中的表现。

Comments 36 pages and 11 figures (includes Appendix)

详情

Journal ref: Transactions on Machine Learning Research, 2026

AI中文摘要

持续强化学习挑战智能体在获取新技能的同时保留已学习技能，以提高过去和未来任务的性能。大多数现有方法依赖于无模型方法和重放缓冲区来缓解灾难性遗忘；然而，这些解决方案往往面临显著的可扩展性挑战，因为内存需求大。受神经科学启发，其中大脑将经验重放给预测世界模型而不是直接重放到策略中，我们提出了ARROW（增强重放用于鲁棒世界模型），一种扩展DreamerV3的基于模型的持续RL算法，具有内存高效、分布匹配的重放缓冲区。与标准固定大小的FIFO缓冲区不同，ARROW维护两个互补的缓冲区：一个短期缓冲区用于近期经验，一个长期缓冲区通过智能采样保留任务多样性。我们在两个具有挑战性的持续RL设置中评估了ARROW：无共享结构任务（Atari）和有共享结构任务（Procgen CoinRun变体）。与相同大小的无模型和基于模型的基线方法相比，ARROW在无共享结构任务中表现出显著减少的遗忘，同时保持可比的前向转移。我们的发现突显了基于模型的RL和生物启发方法在持续强化学习中的潜力，值得进一步研究。

英文摘要

Continual reinforcement learning challenges agents to acquire new skills while retaining previously learned ones with the goal of improving performance in both past and future tasks. Most existing approaches rely on model-free methods with replay buffers to mitigate catastrophic forgetting; however, these solutions often face significant scalability challenges due to large memory demands. Drawing inspiration from neuroscience, where the brain replays experiences to a predictive World Model rather than directly to the policy, we present ARROW (Augmented Replay for RObust World models), a model-based continual RL algorithm that extends DreamerV3 with a memory-efficient, distribution-matching replay buffer. Unlike standard fixed-size FIFO buffers, ARROW maintains two complementary buffers: a short-term buffer for recent experiences and a long-term buffer that preserves task diversity through intelligent sampling. We evaluate ARROW on two challenging continual RL settings: Tasks without shared structure (Atari), and tasks with shared structure, where knowledge transfer is possible (Procgen CoinRun variants). Compared to model-free and model-based baselines with replay buffers of the same-size, ARROW demonstrates substantially less forgetting on tasks without shared structure, while maintaining comparable forward transfer. Our findings highlight the potential of model-based RL and bio-inspired approaches for continual reinforcement learning, warranting further research.

URL PDF HTML ☆

赞 0 踩 0

2604.08958 2026-06-12 cs.LG cs.AI cs.RO 版本更新

WOMBET: World Model-Based Experience Transfer for Robust and Sample-efficient Reinforcement Learning

WOMBET：基于世界模型的经验迁移实现鲁棒且样本高效的强化学习

Mintae Kim, Koushil Sreenath

发表机构 * Hybrid Robotics, UC Berkeley（混合机器人技术，伯克利大学）

AI总结提出WOMBET框架，通过源任务中学习世界模型并生成不确定性惩罚的离线数据，再结合自适应采样进行在线微调，实现鲁棒且样本高效的强化学习迁移。

Comments 13 pages, 6 figures, 8th Annual Learning for Dynamics & Control Conference (L4DC)

详情

AI中文摘要

机器人领域的强化学习通常受限于数据收集的成本和风险，因此需要从源任务向目标任务进行经验迁移。离线到在线强化学习利用先验数据，但通常假设给定固定数据集，并未解决如何生成可靠数据进行迁移的问题。我们提出基于世界模型的经验迁移（WOMBET）框架，该框架联合生成和利用先验数据。WOMBET在源任务中学习世界模型，并通过不确定性惩罚规划生成离线数据，随后筛选出高回报和低认知不确定性的轨迹。然后，它通过在离线数据和在线数据之间进行自适应采样，在目标任务中进行在线微调，实现了从先验驱动的初始化到任务特定适应的稳定过渡。我们证明了不确定性惩罚目标提供了真实回报的下界，并推导了有限样本误差分解，捕捉了分布不匹配和近似误差。实验上，WOMBET在连续控制基准测试中相比强基线提高了样本效率和最终性能，展示了联合优化数据生成和迁移的益处。

英文摘要

Reinforcement learning (RL) in robotics is often limited by the cost and risk of data collection, motivating experience transfer from a source task to a target task. Offline-to-online RL leverages prior data but typically assumes a given fixed dataset and does not address how to generate reliable data for transfer. We propose World Model-Based Experience Transfer (WOMBET), a framework that jointly generates and utilizes prior data. WOMBET learns a world model in the source task and generates offline data via uncertainty-penalized planning, followed by filtering trajectories with high return and low epistemic uncertainty. It then performs online fine-tuning in the target task using adaptive sampling between offline and online data, enabling a stable transition from prior-driven initialization to task-specific adaptation. We show that the uncertainty-penalized objective provides a lower bound on the true return and derive a finite-sample error decomposition capturing distribution mismatch and approximation error. Empirically, WOMBET improves sample efficiency and final performance over strong baselines on continuous control benchmarks, demonstrating the benefit of jointly optimizing data generation and transfer.

URL PDF HTML ☆

赞 0 踩 0

2606.12721 2026-06-12 cs.AI 新提交

The Theory of Mind Utility: Formal Specification of a Mentalizing Mechanism

心智理论效用：心理化机制的形式化规范

Nikolos Gurney, Stacy Marsella

发表机构 * Institute for Creative Technologies, University of Southern California（南加州大学创意技术研究所）； Khoury College of Computer Sciences, Northeastern University（东北大学库里计算机科学学院）

AI总结提出心智理论效用（ToM-U）框架，通过局部认知世界模型（LEWM）形式化推断他人信念的计算问题，定义结构、推理过程及失败痕迹，区别于贝叶斯心智理论等方法。

详情

AI中文摘要

推断他人的信念需要超越表面信号；需要追踪谁告诉了他们什么、以什么顺序以及有多可信。心智理论效用（ToM-U）在计算分析层面形式化了这一认知状态推断问题，明确了心理化计算的内容和原因，而不承诺算法或神经实现。ToM-U通过构建局部认知世界模型（LEWMs）——表示智能体、状态节点及其之间认知关系的有向类型图——并根据观察到的行为评估离散候选LEWM，直到达到足够的置信度来实现这一点。五个形式定义指定了LEWM结构、包括有序信息访问历史的智能体节点属性、递归心理化的有界增殖机制、三种推理过程以及一个残差函数，该函数捕捉失败心理化尝试留下的结构化痕迹。ToM-U不同于贝叶斯心智理论和相邻的形式化描述，后者预设而非推导信念状态，也不同于模拟理论和理论-理论，后者缺乏认知状态推断的形式化工具。该架构生成关于心理化失败的方向性、可证伪预测，这些预测源于模型的结构属性而非辅助假设，并将ToM-U定位为在目标推断和其他下游社会认知过程之前的领域无关机制。

英文摘要

Inferring others' beliefs requires more than reading surface signals; it requires tracking who told them what, in what order, and how credibly. The Theory of Mind Utility (ToM-U) formalizes this epistemic state inference problem at the computational level of analysis, specifying what mentalizing computes and why without commitment to algorithmic or neural implementation. ToM-U achieves this by constructing Local Epistemic World Models (LEWMs) -- directed typed graphs that represent agents, state nodes, and the epistemic relationships among them -- and evaluating discrete candidate LEWMs against observed behavior until one achieves sufficient confidence. Five formal definitions specify the LEWM structure, agent node properties including ordered information access history, a bounded proliferation mechanism for recursive mentalizing, three inference procedures, and a residue function that captures the structured trace left by failed mentalizing attempts. ToM-U differs from Bayesian Theory of Mind and adjacent formal accounts, which presuppose rather than derive belief states, and from simulation theory and theory-theory, which lack a formal apparatus for epistemic state inference. The architecture generates directional, falsifiable predictions about mentalizing failure that follow from structural properties of the model rather than auxiliary assumptions, and positions ToM-U as a domain-agnostic mechanism upstream of goal inference and other downstream social cognitive processes.

URL PDF HTML ☆

赞 0 踩 0

2606.13405 2026-06-12 cs.AI cs.MA 新提交

Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda

用于受规管流程自动化的神经符号代理：挑战与研究议程

Alexander Rombach, Chantale Lauer, Nijat Mehdiyev

发表机构 * German Research Center for Artificial Intelligence (DFKI)（德国人工智能研究中心（DFKI））； Saarland University（萨尔大学）

AI总结提出将领域内符号结构（法规、流程模型、合规约束）作为代理核心架构组件，实现合规性内置（compliance-by-construction）以补充护栏监控，并列出神经符号研究挑战。

Comments Accepted as a poster in NILA Workshop @ IJCAI-ECAI 2026

2606.13669 2026-06-12 cs.AI 新提交

Agents-K1: Towards Agent-native Knowledge Orchestration

Agents-K1：迈向智能体原生的知识编排

Zongsheng Cao, Bihao Zhan, Jinxin Shi, Jiong Wang, Fangchen Yu, Zhijie Zhong, Zijie Guo, Tianshuo Peng, Zhuo Liu, Yi Xie, Xiang Zhuang, Yue Fan, Runmin Ma, Shiyang Feng, Xiangchao Yan, Anran Liu, Peng Ye, Wenlong Zhang, Shufei Zhang, Chunfeng Song, Fenghua Ling, Jie Zhou, Liang He, Bo Zhang, Lei Bai

发表机构 * PJLab（上海人工智能实验室）

AI总结提出Agents-K1管道，将原始文档转化为智能体原生科学知识图谱，通过多模态解析器、GRPO训练的4B信息抽取骨干和三源智能体接口，实现科学信息抽取、知识图谱构建和多跳推理。

详情

AI中文摘要

当前基于LLM的研究智能体通过智能体编排取得了进展，但在很大程度上忽视了科学知识编排。现有工作通常将论文简化为摘要、表面提及和扁平化的\ exttt{cites}边，忽略了科学推理所必需的关键实体、主张、证据、机制和方法谱系。为此，我们引入了\ extbf{Agents-K1}，一个端到端的知识编排管道，将原始文档转换为智能体原生的科学知识图谱。Agents-K1在统一的理论基础下整合了三个组件：一个多模态解析器，其五模块模式捕获整个论文中的实体、多模态证据、引用和类型化实体间关系，而非仅摘要；一个基于GRPO在规则奖励下训练的4B信息抽取骨干；以及一个graphanything CLI，一个统一了网络搜索、多模态图检索和跨文档遍历的三源智能体接口。在此基础上，我们处理了六个学科的246万篇科学论文，生成了\ extbf{Scholar-KG}，并发布了其中100万篇论文的子集，完整Scholar-KG可通过下方SCP链接访问。同一管道可扩展到通用领域语料库和符合模式的数据合成。大量实验表明，Agents-K1在科学信息抽取、知识图谱构建和多跳科学推理方面取得了优越性能。

英文摘要

Current LLM-based research agents have advanced through agent orchestration, yet largely overlook scientific knowledge orchestration. Existing works often reduce papers to abstracts, surface mentions, and flat \texttt{cites} edges, omitting key entities, claims, evidence, mechanisms, and method lineages essential for scientific reasoning. To this end, we introduce \textbf{Agents-K1}, an end-to-end knowledge orchestration pipeline that converts raw documents into agent-native scientific knowledge graphs. Agents-K1 integrates three components under a unifying theoretical foundation: a multimodal parser whose five-module schema captures entities, multimodal evidence, citations, and typed inter-entity relations across the full paper rather than abstracts alone; a 4B information-extraction backbone trained with GRPO under a rule-based reward; and a graphanything CLI, a tri-source agent interface that unifies web search, multimodal graph retrieval, and cross-document traversal. On top of this, we process 2.46 million scientific papers across six subjects to produce \textbf{Scholar-KG}, of which we release a one-million-paper subset, and the full Scholar-KG is accessible via the SCP link below. The same pipeline can be extended to general-domain corpora and to schema-conformant data synthesis. Extensive experiments demonstrate that Agents-K1 achieves superior performance in scientific information extraction, knowledge graph construction, and multi-hop scientific reasoning.

URL PDF HTML ☆

赞 0 踩 0

2604.27960 2026-06-12 cs.AI 版本更新

LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning

LLMs 作为 ASP 程序员：自我纠正实现任务无关的非单调推理

Adam Ishay, Joohyung Lee

发表机构 * Arizona State University（亚利桑那州立大学）； Samsung Research（三星研究院）

AI总结提出 LLM+ASP 框架，通过自我纠正循环将自然语言转化为回答集程序，实现无需任务特定工程的非单调推理，在多个基准上优于 SMT 方法。

Comments 30 pages

详情

AI中文摘要

近期的大语言模型（LLMs）在推理方面取得了令人瞩目的进展，但仍面临高计算成本、逻辑不一致性以及在高度复杂问题上性能急剧下降等问题。神经符号方法通过将 LLMs 与符号推理器结合来缓解这些问题，但现有方法通常依赖于单调逻辑（如 SMT），无法表示可废止推理——人类认知的重要组成部分。我们提出了“LLM+ASP”框架，该框架将自然语言转化为回答集编程（ASP），一种基于稳定模型语义的非单调形式化方法。与先前需要手动编写知识模块、领域特定提示或仅限于单一问题类别评估的“LLM+ASP”方法不同，我们的框架无需任何每任务工程，并统一适用于多种推理任务。我们的系统利用自动化的自我纠正循环，其中来自 ASP 求解器的结构化反馈能够实现迭代优化。在六个不同基准上的评估表明：（1）稳定模型语义使 LLMs 能够自然地表达默认规则和例外，在非单调任务上显著优于基于 SMT 的替代方法；（2）迭代自我纠正是性能的主要驱动力，有效替代了手工领域知识的需求；（3）紧凑的上下文参考指南显著优于冗长的文档，揭示了“上下文腐烂”现象，即过多上下文会阻碍约束遵循。

英文摘要

Recent large language models (LLMs) have achieved impressive reasoning milestones but continue to struggle with high computational costs, logical inconsistencies, and sharp performance degradation on high-complexity problems. While neuro-symbolic methods attempt to mitigate these issues by coupling LLMs with symbolic reasoners, existing approaches typically rely on monotonic logics (e.g., SMT) that cannot represent defeasible reasoning -- essential components of human cognition. We present "LLM+ASP," a framework that translates natural language into Answer Set Programming (ASP), a nonmonotonic formalism based on stable model semantics. Unlike prior "LLM+ASP" approaches that require manually authored knowledge modules, domain-specific prompts, or evaluation restricted to single problem classes, our framework operates without any per-task engineering and applies uniformly across diverse reasoning tasks. Our system utilizes an automated self-correction loop where structured feedback from the ASP solver enables iterative refinement. Evaluating across six diverse benchmarks, we demonstrate that: (1) stable model semantics allow LLMs to naturally express default rules and exceptions, outperforming SMT-based alternatives by significant margins on nonmonotonic tasks; (2) iterative self-correction is the primary driver of performance, effectively replacing the need for handcrafted domain knowledge; (3) compact in-context reference guides substantially outperform verbose documentation, revealing a "context rot" phenomenon where excessive context hinders constraint adherence.

URL PDF HTML ☆

赞 0 踩 0

2606.04935 2026-06-12 cs.AI 版本更新

What Type of Inference is Active Inference?

主动推理是一种什么类型的推理？

Wouter W. L. Nuijten, Mykola Lukashchuk, Thijs van de Laar, Bert de Vries

发表机构 * Department of Electrical Engineering（电气工程系）； Eindhoven University of Technology（埃因霍温理工大学）； Eindhoven, the Netherlands（荷兰埃因霍温）； Lazy Dynamics ； Utrecht, the Netherlands（荷兰乌得勒支）

AI总结本文通过变分自由能框架将主动推理中的期望自由能最小化分解为熵校正项和规划校正项，揭示了其推理本质，并在网格世界实验中验证了不同校正项的作用。

详情

AI中文摘要

主动推理将决策视为推理，期望自由能（EFE）统一了目标导向和信息寻求行为。最近的研究表明，EFE最小化可以写成在带有认知先验的生成模型上的变分自由能（VFE）最小化。我们证明了增强模型的VFE可以重写为预测模型的VFE加上显式的熵校正项，从而使EFE贡献透明。然后我们表明，基于EFE的适当规划需要将这些认知校正与规划校正相结合，规划校正将边际推理转化为策略优化，从而得到基于EFE规划的完整变分特征。这澄清了交叉熵规划和完整基于EFE规划所需的校正。相同的熵校正公式导致了基于EFE规划的详细消息传递方案以及更简单的消融。在三个网格世界环境上的实验表明，当观测具有决定性时，规划校正已经有所帮助，而当观测仅具有提示性时，额外的观测侧认知校正最为重要。

英文摘要

Active inference casts decision-making as inference, with the Expected Free Energy (EFE) unifying goal-directed and information-seeking behavior. Recent work showed that EFE minimization can be written as Variational Free Energy (VFE) minimization on a generative model augmented with epistemic priors. We prove that the VFE of the augmented model can be rewritten as the VFE of the predictive model plus explicit entropy-correction terms, making the EFE contribution transparent. We then show that proper EFE-based planning requires combining these epistemic corrections with a planning correction that turns marginal inference into policy optimization, yielding a full variational characterization of EFE-based planning. This clarifies which corrections are needed for cross-entropy planning and for full EFE-based planning. The same entropy-corrected formulation leads to a detailed message-passing scheme for EFE-based planning together with simpler ablations. Experiments on three grid-world environments show that full EFE-based planning outperforms ablations that omit either the planning correction or the epistemic corrections.

URL PDF HTML ☆

赞 0 踩 0

2508.02548 2026-06-12 cs.DB cs.AI 版本更新

The KG-ER Conceptual Schema Language

KG-ER概念模式语言

Enrico Franconi, Benoît Groz, Jan Hidders, Nina Pardal, Sławek Staworko, Jan Van den Bussche, Piotr Wieczorek

发表机构 * Free University of Bozen-Bolzano, Italy（博洛尼亚-博兹纳自由大学，意大利）； Université Paris-Saclay, CNRS, LISN, France（巴黎-萨克雷大学，法国 CNRS LISN）； Birkbeck, University of London, UK（伦敦大学伯克贝克学院，英国）； University of Huddersfield, UK（赫德斯菲尔德大学，英国）； Relational AI, Berkeley, CA, USA（关系AI，美国加州伯克利）； Hasselt University, Hasselt, Belgium（哈塞尔特大学，比利时）； University of Wrocław, Poland（沃林福大学，波兰）

AI总结提出KG-ER概念模式语言，独立于知识图谱的表示方式描述其结构，并帮助捕获语义。

Comments Published in Proceedings of IRIS-AI (https://iris-ai.org)

2603.11479 2026-06-12 cs.LG cs.AI cs.MA 版本更新

Grammar of the Wave: Towards Explainable Multivariate Time Series Event Detection via Neuro-Symbolic VLM Agents

波的语法：通过神经符号VLM智能体实现可解释的多变量时间序列事件检测

Sky Chenwei Wan, Yifei Y. Wang, Tianjun Hou, Xiqing Chang, Aymeric Jan

发表机构 * AI Lab, SLB（SLB人工智能实验室）； Télécom Paris, Institut Polytechnique de Paris, France（巴黎电信学院，巴黎高等理工学院，法国）

AI总结提出语言引导的时间序列事件检测（TSED）任务，通过事件逻辑树（ELT）将文本描述转化为结构化时序逻辑，并构建神经符号VLM智能体SELA，实现零/少样本事件检测与可解释推理。

Comments 8 pages (main text), 28 pages total including appendix. 9 figures, 7 tables

详情

AI中文摘要

时间序列事件检测（TSED）旨在定位时间序列数据中具有语义意义的事件，在高风险领域具有关键应用。与统计异常不同，事件通常由自然语言描述定义，且跨多个物理通道具有内部时序逻辑结构。然而，在现实场景中，密集的事件标注成本高昂，使得纯监督学习困难。我们引入了语言引导的TSED，该设置中模型被赋予文本事件描述，并必须在几乎没有标注数据的情况下将其映射到多变量信号中的区间。为了解决这个问题，我们提出了事件逻辑树（ELT），一种知识表示框架，将语言描述转化为信号基元上的结构化时序逻辑。基于ELT，我们提出了SELA，一种神经符号VLM智能体框架，它从信号可视化中迭代地接地基元，并在ELT约束下组合它们，产生事件区间和忠实的树状结构解释。我们进一步发布了跨能源和气候领域的真实世界基准，包含专家知识和标注。实验表明，SELA优于监督微调和现有的零/少样本时间序列推理基线。

英文摘要

Time Series Event Detection (TSED) aims to localize semantically meaningful events in time series data, with critical applications in high-stakes domains. Unlike statistical anomalies, events are often defined by natural-language descriptions with internal temporal-logic structures across multiple physical channels. However, in real-world settings, dense event annotations are expensive to obtain, making purely supervised learning difficult. We introduce Language-guided TSED, a setting where a model is given textual event descriptions and must ground them to intervals in multivariate signals with little or no labeled data. To address this problem, we propose Event Logic Tree (ELT), a knowledge representation framework that converts linguistic descriptions into structured temporal logic over signal primitives. Building on ELT, we present SELA, a neuro-symbolic VLM agent framework that iteratively grounds primitives from signal visualizations and composes them under ELT constraints, producing both event intervals and faithful tree-structured explanations. We further release a real-world benchmark across energy and climate domains with expert knowledge and annotations. Experiments show that SELA improves over supervised fine-tuning and existing zero/few-shot time series reasoning baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.13197 2026-06-12 cs.AI 新提交

SAIGuard: 面向LLM多智能体系统主动防御的通信状态模拟

Ruxue Shi, Yili Wang, Mengnan Du, Qinggang Zhang, Rui Miao, Yixin Liu, Xin Wang

AI总结提出SAIGuard主动防御框架，通过通信状态模拟检测并净化风险消息，降低攻击成功率并保持系统效用。

详情

AI中文摘要

基于LLM的多智能体系统（MAS）通过智能体间协作解决复杂任务，但其通信驱动的特性也使安全风险能够在智能体间传播并引发系统级故障。现有的MAS防御主要遵循执行后的反应式范式，通过检测和隔离有害智能体，但这可能导致不可逆的损害并降低协作效用。为解决此问题，我们提出一种面向MAS安全的主动防御框架，即模拟感知拦截守卫（SAIGuard）。SAIGuard在MAS交互图上执行通信状态模拟，估计传入消息对局部智能体状态和全局MAS状态的影响，并通过与良性通信模式的重建偏差检测风险消息。SAIGuard不隔离智能体，而是在可疑消息传播到系统之前对其进行净化或重新生成。跨多种拓扑和攻击场景的实验表明，SAIGuard在保持MAS效用的同时降低了攻击成功率，优于反应式防御。

英文摘要

LLM-based multi-agent systems (MAS) solve complex tasks through inter-agent collaboration, but their communication-driven nature also allows security risks to spread across agents and trigger system-wide failures. Existing MAS defenses mainly follow a reactive paradigm after execution by detecting and isolating harmful agents, which may cause irreversible damage and degrade collaborative utility. To address this, we propose a proactive defense framework for MAS security, namely a Simulation-aware Interception Guard (SAIGuard). SAIGuard performs communication-state simulation over the MAS interaction graph, estimates the impact of incoming messages on local agent states and the global MAS state, and detects risky messages via reconstruction deviations from benign communication patterns. Instead of isolating agents, SAIGuard sanitizes or regenerates suspicious messages before it propagation into system. Experiments across diverse topologies and attack scenarios show that SAIGuard reduces attack success rates while maintaining MAS utility, outperforming reactive defenses.

URL PDF HTML ☆

赞 0 踩 0

2606.12835 2026-06-12 cs.MA cs.AI cs.CY cs.NI 交叉投稿

The Internet of Agentic AI: Communication, Coordination, and Collective Intelligence at Scale

智能体互联网：大规模通信、协调与集体智能

Quanyan Zhu

AI总结本文提出智能体互联网（IoAI）愿景，构建异构智能体在云、边缘、设备等环境中发现、协商、通信与协作的开放生态系统，并探讨其架构、机制及关键研究挑战。

详情

AI中文摘要

自主AI智能体的快速涌现正在将人工智能从孤立的模型推理转变为分布式推理、通信和行动系统。本文发展了智能体互联网（IoAI）的愿景：一个开放生态系统，其中异构智能体能够跨云、边缘、设备、组织及信息物理环境相互发现、协商职责、交换上下文、调用工具并执行工作流。我们综合了单智能体AI、多智能体系统、分布式计算、通信网络、博弈论和安全工程的基础，以刻画可扩展智能体生态系统所需的架构和机制。本文考察了智能体部署模型、工作流生命周期、通信协议、互操作层、资源管理挑战和信任架构，并提供了自适应制造和分布式作战协调的案例研究。由此产生的框架突出了可控涌现、语义互操作、安全身份、激励兼容协调、资源感知编排以及大规模自主智能体网络治理等核心研究挑战。

英文摘要

The rapid emergence of autonomous AI agents is transforming artificial intelligence from isolated model inference into distributed systems of reasoning, communication, and action. This paper develops the vision of the Internet of Agentic AI (IoAI): an open ecosystem in which heterogeneous agents discover one another, negotiate responsibilities, exchange context, invoke tools, and execute workflows across cloud, edge, device, organizational, and cyber-physical environments. We synthesize foundations from single-agent agentic AI, multi-agent systems, distributed computing, communication networks, game theory, and security engineering to characterize the architectures and mechanisms required for scalable agent ecosystems. The paper examines agent deployment models, workflow lifecycles, communication protocols, interoperability layers, resource-management challenges, and trust architectures, with case studies in adaptive manufacturing and distributed operational coordination. The resulting framework highlights the central research challenges of controlled emergence, semantic interoperability, secure identity, incentive-compatible coordination, resource-aware orchestration, and governance for large-scale networks of autonomous agents.

URL PDF HTML ☆

赞 0 踩 0

2603.21563 2026-06-12 cs.AI 版本更新

Counterfactual Credit Policy Optimization for Multi-Agent Collaboration

多智能体协作的反事实信用策略优化

Zhongyi Li, Wan Tian, Jinju Chen, Huiming Zhang, Yang Liu, Yikun Ban, Fuzhen Zhuang

发表机构 * Beihang University（北航）； Peking University（北京大学）； Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结针对多智能体大语言模型协作中信用分配难题，提出CCPO框架，通过反事实信用估计和验证器锚定的自评估两种分配器，将团队奖励转化为个体学习信号，提升数学推理任务表现。

详情

AI中文摘要

协作式多智能体大语言模型可以通过分解角色来解决复杂的推理任务，但此类系统的强化学习受到信用分配的限制：共享的终端奖励模糊了个体贡献，并可能鼓励搭便车行为。我们引入了协作信用策略优化（CCPO），这是一个与优化器无关的信用分配层，将团队层面的结果转化为智能体特定的学习信号。CCPO提供了两种互补的分配器。反事实信用通过比较实际团队结果与移除该智能体的反事实结果来估计智能体的边际贡献。验证器锚定的LLM自我评估是一种探索性分配器，它使用受限的自我评估和同伴评估来重新分配信用，同时保持外部验证器结果的主导地位。由此产生的角色特定奖励可以被GRPO风格的更新或其他策略梯度优化器（如GSPO和REINFORCE++）使用。我们在顺序的思考-求解设置中实例化CCPO，并在数学推理基准上评估它。结果表明，显式的信用分配通常能改善双智能体推理，尤其是在MATH500和几个分布外设置中，而增益因模型和数据集而异。

英文摘要

Collaborative multi-agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles, but reinforcement learning for such systems is limited by credit assignment: shared terminal rewards obscure individual contributions and can encourage free-riding. We introduce two optimizer-agnostic credit assignment methods for converting joint outcomes into agent-specific learning signals. Counterfactual Credit for Policy Optimization (CCPO) estimates an agent's marginal contribution by comparing the realized joint outcome with a counterfactual outcome where that agent is removed. Self-Evaluated Credit for Policy Optimization (SEPO) uses constrained self- and peer-evaluations as a verifier-anchored credit signal while keeping the external task outcome dominant. Both operate at the reward-construction layer rather than as policy optimizers, producing role-specific rewards or advantages for GRPO, GSPO, or REINFORCE++. We instantiate these credit signals in a sequential Think--Solve setting and evaluate them on mathematical reasoning benchmarks. Results show that explicit credit assignment often improves dual-agent reasoning, especially on MATH500 and several out-of-distribution settings, while gains vary across models and datasets. Our code is available at: https://github.com/bhai114/ccpo.

URL PDF HTML ☆

赞 0 踩 0

2605.02249 2026-06-12 cs.AI 版本更新

A Study of Belief Revision Postulates in Multi-Agent Systems (Extended Version)

多智能体系统中信念修正公设的研究（扩展版）

Michael Thielscher, Tran Cao Son

AI总结研究认知规划中的信念修正问题，将经典AGM信念修正公设推广到多智能体环境，提出广义全交多智能体信念修正算子，并讨论迭代修正公设的推广及事件模型修正算子。

详情

AI中文摘要

我们研究了认知规划中的信念修正问题，即在一个多智能体系统中，当某个智能体获得关于某个状态属性的信念后，所有智能体的信念将如何变化。基于通过单一多智能体Kripke模型表示智能体信念的标准认知规划表示，我们将经典的AGM信念修正公设推广到多智能体环境，旨在为计算作为行动结果的所有智能体信念的动态认知推理框架提供形式化评估。作为满足所有广义AGM公设的简单算子示例，我们提出了广义全交多智能体信念修正。此外，我们定义了迭代修正的标准公设的推广，提出了一个更复杂的基于事件模型的修正算子，并讨论了在Kripke模型上定义能够满足所有迭代多智能体信念修正的广义公设的认知算子时可能存在的问题。

英文摘要

We investigate the belief revision problem in epistemic planning, i.e., what will be the beliefs of all agents in a multi-agent system after an agent gains the belief in some state property. Based on the standard representation in epistemic planning of agents' beliefs via a single multi-agent Kripke model, we generalize the classical AGM belief revision postulates to the multi-agent setting, with the aim to provide a formal framework for evaluating dynamic epistemic reasoning frameworks in which the beliefs of all agents as the result of actions are computed. As an example of a simple operator that satisfies all of the generalized AGM postulates, we present generalized full-meet multi-agent belief revision. We moreover define a generalization of the standard postulates for iterated revision, present a more sophisticated, event model based revision operator, and discuss the potential issues in defining an epistemic operator on Kripke models that can satisfy all of the generalized postulates for iterated multi-agent belief revision.

URL PDF HTML ☆

赞 0 踩 0

2412.08610 2026-06-12 cs.GT cs.AI cs.CY 版本更新

Competition and Diversity in Generative AI

生成式人工智能中的竞争与多样性

Manish Raghavan

发表机构 * MIT Sloan School of Management & Department of Electrical Engineering and Computer Science（麻省理工学院斯隆管理学院及电气工程与计算机科学系）

AI总结通过博弈论模型和Scattergories游戏实验，研究竞争如何促使生成式AI模型多样化，缓解同质化，并提升社会福利。

详情

AI中文摘要

最近的实验和现实证据表明，使用生成式人工智能会降低所产生内容的多样性。使用相同或相似的AI模型似乎会导致更同质化的行为。我们的工作从观察到存在一股相反方向的推动力开始：竞争。当生产者相互竞争（例如，争夺客户或注意力）时，他们被激励去创造新颖或独特的内容。我们探讨了竞争对内容多样性和整体社会福利的影响。通过一个正式的博弈论模型，我们表明竞争市场会选择多样化的AI模型，从而缓解单一文化。我们进一步表明，一个在孤立环境中表现良好（即根据基准）的生成式AI模型可能在竞争市场中无法提供价值。我们的结果强调了在生成式AI模型输出分布的广度上评估它们的重要性，特别是当它们将被部署在竞争环境中时。我们通过使用语言模型玩Scattergories（一个奖励正确且独特答案的文字游戏）来实证验证我们的结果。总体而言，我们的结果表明，由生成式AI导致的同质化不太可能在竞争市场中持续存在，相反，下游市场的竞争可能会推动AI模型开发的多样化。

英文摘要

Recent evidence, both in the lab and in the wild, suggests that the use of generative artificial intelligence reduces the diversity of content produced. The use of the same or similar AI models appears to lead to more homogeneous behavior. Our work begins with the observation that there is a force pushing in the opposite direction: competition. When producers compete with one another (e.g., for customers or attention), they are incentivized to create novel or unique content. We explore the impact competition has on both content diversity and overall social welfare. Through a formal game-theoretic model, we show that competitive markets select for diverse AI models, mitigating monoculture. We further show that a generative AI model that performs well in isolation (i.e., according to a benchmark) may fail to provide value in a competitive market. Our results highlight the importance of evaluating generative AI models across the breadth of their output distributions, particularly when they will be deployed in competitive environments. We validate our results empirically by using language models to play Scattergories, a word game in which players are rewarded for answers that are both correct and unique. Overall, our results suggest that homogenization due to generative AI is unlikely to persist in competitive markets, and instead, competition in downstream markets may drive diversification in AI model development.

URL PDF HTML ☆

赞 0 踩 0

2606.13407 2026-06-12 cs.AI 新提交

Optimizing Appliance Scheduling for Solar Energy Management Using Metaheuristic Algorithms

使用元启发式算法优化太阳能管理的电器调度

Hiba Ahmed, Alexander E. I. Brownlee, Jason Adair, Simon T. Powers

发表机构 * Computing Science and Mathematics, University of Stirling（斯特灵大学计算科学与数学学院）

AI总结提出基于迭代局部搜索和模拟退火的元启发式方法，优化电器启动时间以最大化太阳能利用，并处理多天任务溢出问题。

Comments 9 pages; full results and methodology for poster paper accepted to GECCO 2026

详情

DOI: 10.1145/3795101.3805310

AI中文摘要

可再生能源对于满足未来能源需求至关重要；然而，仅在白天发生的太阳能发电通常与家庭消费模式不一致。诸如炊具、洗衣机和烘干机等电器通常根据用户偏好的时间表运行，而不是根据太阳能可用性，这形成了一个调度优化问题。目标是确定最佳电器启动时间，以最大化可再生能源利用，同时最小化用户不便并遵守系统约束。本文提出了一种使用迭代局部搜索（ILS）和模拟退火（SA）的元启发式方法，以优化电器启动时间，同时考虑电器运行持续时间、功耗、逆变器限制、电池荷电状态约束和太阳能发电预测。与大多数现有工作不同，调度扩展到单日之外，以容纳前几天的未完成任务（溢出），确保操作连续性并支持跨多天的顺序操作。实验结果表明，顺序多日调度框架在独家太阳能发电下有效管理系统约束，同时确保用户便利。这些发现也为未来关于不同规模设备投资、投资回报和用户满意度之间的多目标权衡研究提供了机会。

英文摘要

Renewable energy is essential for meeting future energy demands; however, solar energy generation, which occurs only during daylight hours often does not align with household consumption patterns. Appliances such as cookers, washing machines, and dryers are typically operated according to user preferred schedules rather than solar energy availability, creating a scheduling optimization problem. The objective is to determine optimal appliance start times to maximize renewable energy utilization while minimizing user inconvenience and adhering to system constraints. This paper presents a metaheuristic approach using Iterated Local Search (ILS) and Simulated Annealing (SA) to optimize appliance start times, while considering appliance operating durations, power consumption, inverter limit, battery state of charge constraints, and solar generation forecasts. Unlike most existing work, the scheduling is extended beyond a single day to accommodate unfinished tasks from previous days (spillover), ensuring operational continuity and enabling sequential operation across multiple days. Experimental results show that the sequential multi-day scheduling framework effectively manages system constraints while ensuring user convenience under exclusive solar generation. These findings also open opportunities for future research on multi-objective trade-offs between investment in equipment of various sizes, return on that investment, and user satisfaction.

URL PDF HTML ☆

赞 0 踩 0

2606.12667 2026-06-12 cs.NI cs.AI cs.SY eess.SY 交叉投稿

Free-Placement Optimization of Ground Station Locations for Low-Earth Orbit Satellites

低地球轨道卫星地面站位置的自由布局优化

Grace Ra Kim, Duncan Eddy, Vedant Srinivas, Mykel J. Kochenderfer

AI总结提出SCORE方法，通过两阶段自由布局优化地面站位置，相比差分进化算法减少5倍函数评估次数并提升13%下行吞吐量，相比固定站点方法提升15%总下行量。

Comments 34 pages, 13 figures, 11 tables, Journal of Aerospace Information Systems (JAIS)

详情

AI中文摘要

快速扩展的低地球轨道卫星星座对地面网络的需求日益增加，推动了更高效地面站网络设计的发展。当前方法从预定义位置选择站点，将优化限制在现有基础设施内，从而约束了性能。相比之下，自由布局优化在地球连续空间域上运行，拓宽了搜索空间，允许更高吞吐量的配置，但代价是可能需要部署新的基础设施。在这项工作中，我们引入了SCORE（通过细化与评估的顺序循环优化），一种用于地面站设计的两阶段自由布局方法。SCORE结合了顺序坐标选择与循环细化，以应对全局优化器面临的高维度、非凸性和局部最小值挑战。我们使用Kongsberg卫星服务公司和世界电信协会的位置，将SCORE与差分进化（DE）等一次性方法以及整数规划方法进行了基准测试。在两个商业地球观测星座（Capella Space和ICEYE）和一个合成Walker-Star星座上的测试表明，与DE相比，SCORE收敛所需的函数评估次数最多减少5倍，同时下行吞吐量提升高达13%。与固定站点方法相比，无约束SCORE实现了高达15%的总下行量提升，为灵活布局建立了强大的经验性能基准；受基础设施约束的SCORE在将布局限制在现有光纤和电力基础设施附近的同时，保留了超过92%的增益。我们还探讨了扩建现有站点与部署新站点之间的权衡，为运营星座的未来地面网络设计提供参考。

英文摘要

Rapidly expanding low Earth orbit satellite constellations are placing increasing demands on terrestrial ground networks, motivating the development of more efficient ground station network designs. Current approaches select sites from predefined locations, limiting optimization to existing infrastructure and constraining performance. In contrast, free-placement optimization operates over a continuous spatial domain on Earth, broadening the search space and allowing higher-throughput configurations at the cost of potentially requiring new infrastructure deployment. In this work, we introduce SCORE (Sequential Cyclic Optimization via Refinement & Evaluation), a two-stage free-placement method for ground station design. SCORE combines sequential coordinate selection with cyclic refinement to manage high-dimensionality, non-convexity, and local minima that challenge global optimizers. We benchmark SCORE against one-shot methods such as differential evolution (DE) and integer programming approaches using locations from Kongsberg Satellite Services and the World Teleport Association. Tests across two commercial Earth observation constellations (Capella Space and ICEYE) and one synthetic Walker-Star constellation show that SCORE requires up to 5x fewer function evaluations to converge relative to DE while improving downlink throughput by up to 13%. Compared to fixed-site methods, unconstrained SCORE achieves up to 15% greater total downlink, establishing a strong empirical performance benchmark for flexible placement; infrastructure-constrained SCORE retains over 92% of this gain while restricting placement to within proximity of existing fiber and power infrastructure. We also explore trade-offs between expanding existing stations and deploying new sites, informing future ground network design for operational constellations.

URL PDF HTML ☆

赞 0 踩 0

2606.12594 2026-06-12 cs.AI 新提交

Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation

Pythagoras-Prover: 通过增强型Lean形式化推进高效形式化证明

Joshua Ong Jun Leang, Zheng Zhao, Mihaela Cătălina Stoian, Qiyuan Xu, Haonan Li, Wenda Li, Shay B. Cohen, Eleonora Giunchiglia

发表机构 * Imperial College London（伦敦帝国学院）； University of Edinburgh（爱丁堡大学）； Nanyang Technological University（南洋理工大学）； MBZUAI（穆罕默德·本·扎耶德人工智能大学）

AI总结提出Pythagoras-Prover系列，包括自回归和扩散模型，通过课程SFT、动态过滤和增强型Lean形式化（ALF）扩展验证数据，在MiniF2F-Test上以更少参数超越DeepSeek-Prover-V2。

Comments Pythagoras-Prover: Technical Report

详情

AI中文摘要

现代Lean定理证明器只有在大量训练和推理计算下才能取得强性能，部分原因是由于稀缺的验证证明数据和形式化证明搜索的长推理轨迹，使得监督微调（SFT）和采样成本高昂。我们介绍了Pythagoras-Prover，一个计算高效的开源Lean定理证明器系列，专为实际计算预算而构建。该系列涵盖两种生成范式：4B和32B参数的自回归模型，以及首个概念验证的基于扩散的证明器（4B），它在推理时迭代地精炼Lean证明。为了提高训练效率，我们构建了一个Lean验证的语料库，按易、中、难问题分层，用于课程SFT，使模型逐步从较短、较简单的证明过渡到较长、较难的证明。在SFT期间，动态证明推理过滤方案保留了信息丰富的证明轨迹，同时将每个实例保持在8k令牌的上下文预算内。我们还引入了增强型Lean形式化（ALF），它将稀缺的验证语料库扩展为形式化语句的变体，通过自蒸馏填充以提供额外训练信号，而无需正式验证每个变异实例。通过扰动已知问题同时保留其形式化特征，ALF减少了对任何语句表面形式的依赖。实验上，Pythagoras-Prover-4B在MiniF2F-Test上的pass@32（86.1% vs 82.4%）超过了DeepSeek-Prover-V2-671B，参数数量约为其1/167，而Pythagoras-Prover-32B在MiniF2F-Test上以93.0%的成绩创下了开源最先进水平，并在672个PutnamBench问题中解决了93个。我们发布了MiniF2F-ALF，一个经ALF变异的对污染敏感的基准，每个评估模型在该基准上的准确率均下降；在此基准上，我们的32B模型仍然最强，而4B模型匹配了先前最先进的Goedel-Prover-V2-32B。

英文摘要

Modern Lean theorem provers achieve strong performance only with substantial training and inference compute, driven in part by scarce verified proof data and the long reasoning traces of formal proof search, making both supervised fine-tuning (SFT) and sampling expensive. We introduce Pythagoras-Prover, a compute-efficient open-source family of Lean theorem provers built for practical compute budgets. The family spans two generation paradigms: autoregressive models at 4B and 32B parameters, and a first proof-of-concept diffusion-based prover (4B) that iteratively refines Lean proofs at inference time. For training efficiency, we build a Lean-verified corpus stratified into easy, medium, and hard problems for curriculum SFT, so models acquire proof skills progressively from shorter, simpler proofs to longer, harder ones. During SFT, a dynamic proof-reasoning filtering scheme preserves informative proof traces while keeping each instance within an 8k-token context budget. We also introduce Augmented Lean Formalisation (ALF), which expands scarce verified corpora into variants of formal statements, populated via self-distillation for extra training signal without formally verifying every mutated instance. By perturbing known problems while preserving their formal character, ALF reduces reliance on any statement's surface form. Empirically, Pythagoras-Prover-4B surpasses DeepSeek-Prover-V2-671B at pass@32 on MiniF2F-Test (86.1% vs 82.4%) with ~167x fewer parameters, while Pythagoras-Prover-32B sets the open-source state of the art at 93.0% on MiniF2F-Test and solves 93 of 672 PutnamBench problems. We release MiniF2F-ALF, an ALF-mutated contamination-sensitive benchmark on which every evaluated model loses accuracy; here our 32B remains strongest and our 4B matches the prior state of the art, Goedel-Prover-V2-32B.

URL PDF HTML ☆

赞 0 踩 0

2606.12883 2026-06-12 cs.AI 新提交

The Hidden Power of Scaling Factor in LoRA Optimization

缩放因子在LoRA优化中的隐藏力量

Zicheng Zhang, Haoran Li, Jiaxing Wang, Guoqiang Gong, Anqi Li, Yudong Hu, Ting Xiong, Yurong Gao, Junxing Hu, Zhida Jiang, Yifeng Zhang, Pengzhang Liu, Qixia Jiang

发表机构 * School of Mathematical Sciences, UCAS（中国科学院大学数学科学学院）； School of Mathematical Sciences, NKU（南开大学数学科学学院）； School of Advanced Interdisciplinary Sciences, UCAS（中国科学院大学前沿交叉科学学院）

AI总结本文揭示LoRA中缩放因子α与学习率功能不同，α主导优化效果，通过信号-漂移框架发现α能放大任务信号而不增加漂移比，并提出LoRA-α框架以简化超参数搜索并提升性能。

详情

AI中文摘要

在低秩适应（LoRA）中，缩放因子α通常被视为学习率的简单补充，但其在优化中的作用仍未被充分理解。本文揭示缩放因子α和学习率功能不同，α成为有效优化的主导驱动因素，带来无法通过单独缩放学习率复现的收益。通过大量实证分析和理论信号-漂移框架的协同作用，我们发现了关于LoRA缩放机制的三点发现：首先，LoRA的频谱抑制平滑了优化景观，使得标准超参数过于保守，造成优化差距。其次，当利用这种平滑性加速收敛时，α通过放大任务信号而不增加漂移比，优于学习率。第三，最优缩放因子与秩呈次线性关系，由平方根定律很好地刻画，且系数出乎意料地大，揭示了现有秩相关启发式方法的缩放不足。基于这些见解，我们提出LoRA-α，一个极简框架，将α恢复到其原则性状态，使LoRA与标准小学习率兼容。跨多种任务的广泛评估表明，LoRA-α在简化超参数搜索的同时持续提升性能，释放了LoRA的学习潜力。

英文摘要

In Low-Rank Adaptation (LoRA), the scaling factor $α$ is often treated as a mere complement to the learning rate, yet its role in optimization remains poorly understood. In this paper, we reveal that the scaling factor $α$ and the learning rate function differently, with $α$ emerging as the dominant driver of effective optimization, delivering gains that cannot be replicated by learning rate scaling alone. Through the synergy of extensive empirical analysis and a theoretical Signal-Drift framework, we uncover three findings into LoRA's scaling mechanism: First, LoRA's spectral suppression smooths the optimization landscape, rendering standard hyperparameters overly conservative and creating an optimization gap. Second, when leveraging this smoothness to accelerate convergence, $α$ outperforms the learning rate by amplifying the task signal without increasing the drift ratio. Third, the optimal scaling factor follows a sublinear relationship with the rank, well characterized by a square-root law with an unexpectedly large coefficient, revealing the insufficient scaling of existing rank-tied heuristics. Based on these insights, we propose LoRA-$α$, a minimalist framework that restores $α$ to its principled regime, making LoRA compatible with standard small learning rates. Extensive evaluations across diverse tasks demonstrate that LoRA-$α$ consistently improves performance while streamlining hyperparameter search, unleashing the learning potential of LoRA.

URL PDF HTML ☆

赞 0 踩 0

2606.12935 2026-06-12 cs.AI 新提交

MARS: Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling

MARS: 用于并行LLM测试时扩展的边际对抗风险控制停止策略

Wenbo Chen, Puheng Li, Mengyang Liu, Weijie Su, Tianpei Xie

发表机构 * Amazon（亚马逊）； Stanford University（斯坦福大学）； University of Pennsylvania（宾夕法尼亚大学）

AI总结提出MARS停止规则，通过监测中间检查点的聚合投票并利用对抗性边界估计未来投票变化，在保证准确率的同时节省25-47%的自一致性token。

详情

AI中文摘要

并行测试时扩展采样多个推理轨迹并对答案进行多数投票，提高了LLM的准确性，但需要轨迹运行至完成，导致大量计算开销。我们观察到，在中间检查点探测部分轨迹可以在不中断生成的情况下提取当前答案，揭示出不断演变的聚合投票。基于这一观察，我们引入了MARS，一种边际对抗性停止规则，它估计哪些活跃轨迹可能改变其答案，并在未来投票移动的保守边界下，一旦领先者保持安全就停止。该规则分离了两种不确定性来源。它学习轨迹级别的切换概率，这些概率决定了当前边际有多少可能被保留，同时通过从预热轨迹中校准的对抗性边界处理切换轨迹落在哪里的更难问题。在真实切换概率下，MARS以高概率保证提前停止的答案与完整预算投票一致。在实践中，一个五特征逻辑模型紧密匹配了神谕切换行为。在三个推理模型和三个竞赛数学基准上，MARS节省了25-47%的自一致性token，并在DeepConf Online（一个已经过滤和截断弱轨迹的强置信加权基线）之上额外节省14-29%，同时匹配相应完整预算基线的准确率。

英文摘要

Parallel test-time scaling samples many reasoning traces and majority-votes their answers, improving LLM accuracy but requiring traces to run to completion, incurring substantial computational overhead. We observe that probing partial traces at intermediate checkpoints can extract current answers without disrupting generation, revealing an evolving aggregate vote. Based on this observation, we introduce MARS, a margin-adversarial stopping rule that estimates which active traces are likely to change their answers and stops once the leader remains safe under a conservative bound on future vote movement. The rule separates two sources of uncertainty. It learns the trace-level switch probabilities that determine how much of the current margin is likely to be retained, while handling the harder question of where switching traces land through an adversarial bound calibrated from warmup traces. With true switch probabilities, MARS guarantees with high probability that the early-stopped answer matches the full-budget vote. In practice, a five-feature logistic model closely matches oracle switching behavior. Across three reasoning models and three competition-math benchmarks, MARS saves 25-47% of self-consistency tokens and 14-29% on top of DeepConf Online, a strong confidence-weighted baseline that already filters and truncates weak traces, while matching the accuracy of the corresponding full-budget baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.13016 2026-06-12 cs.AI 新提交

Otters++: A Time-to-first-spike Based Energy Efficient Optical Spiking Transformer

Otters++: 一种基于首次脉冲时间的高能效光学脉冲Transformer

Zhanglu Yan, Jiayi Mao, Kaiwen Tang, Fanfan Li, Gang Pan, Tao Luo, Bowen Zhu, Qianhui Liu, Weng-Fai Wong

发表机构 * National University of Singapore（新加坡国立大学）； Westlake University（西湖大学）； Shandong University（山东大学）； Zhejiang University（浙江大学）； Agency for Science, Research and Technology（新加坡科技研究局）

AI总结提出Otters++，利用光电器件自然信号衰减实现TTFS计算，通过层等效与混合训练方法，在GLUE上达到84.17%平均分且能耗更低。

详情

AI中文摘要

脉冲神经网络（SNN）有望实现高能效推理，而首次脉冲时间（TTFS）编码尤其吸引人，因为每个神经元最多发放一次脉冲。然而，在实践中，这一优势往往因计算时间衰减项并将其与突触权重相乘的成本而减弱。我们通过将物理硬件“缺陷”——光电器件中的自然信号衰减——转化为TTFS的主要计算来解决这一问题，命名为Otters++。具体来说，我们利用定制In$_2$O$_3$光电突触的实测衰减直接实现TTFS时间项，从而消除了显式数字衰减计算的需求。为了将该思想扩展到Transformer模型，我们建立了Otters++与量化神经网络（QNN）之间的逐层功能等价性，并开发了一种混合训练方法，在前向传播中使用忠实于器件的SNN计算，在后向传播中通过等效QNN路径使用QNN直通梯度，并结合模型蒸馏。这避免了对离散首次脉冲事件的微分，并减少了直接TTFS-SNN训练中的过度稀疏问题。我们进一步通过采样运行间变化使训练感知实测器件噪声，并通过考虑器件共享和多跳通信来细化系统级能耗模型。在GLUE数据集上，Otters++将平均得分提高到84.17%，同时相比先前的脉冲Transformer基线保持明显的能耗优势。这些结果表明，基于物理的TTFS计算在实际硬件效应下可以高效、可训练且鲁棒。

英文摘要

Spiking neural networks (SNNs) are promising for energy-efficient inference, and time-to-first-spike (TTFS) coding is especially attractive because each neuron fires at most once. In practice, however, this benefit is often reduced by the cost of computing a temporal decay term and multiplying it by the synaptic weight. We address this issue by turning a physical hardware "bug," the natural signal decay in optoelectronic devices, into the main computation of TTFS, named Otters++. Specifically, we use the measured decay of a custom In$_2$O$_3$ optoelectronic synapse to directly realize the TTFS temporal term, removing the need for explicit digital decay computation. To scale this idea to Transformer models, we establish a layer-wise functional equivalence between the Otters++ and a quantized neural network (QNN), and develop a hybrid training method that uses device-faithful SNN computation in the forward pass and QNN straight-through gradients through the equivalent QNN path in the backward pass, together with model distillation. This avoids differentiation through discrete first-spike events and reduces the over-sparsity problem in direct TTFS-SNN training. We further make training aware of measured device noise by sampling run-to-run variation, and refine the system-level energy model by accounting for device sharing and multi-hop communication. On GLUE dataset, Otters++ improves the average score to 84.17\% while maintaining a clear energy advantage over prior spiking Transformer baselines. These results show that physically grounded TTFS computing can be efficient, trainable, and robust under realistic hardware effects.

URL PDF HTML ☆

赞 0 踩 0

2606.12479 2026-06-12 cs.LG cs.AI 交叉投稿

两层线性自回归模型估计潜在状态

Yahya Sattar, Sunmook Choi, Leo Maynard-Zhang, Yassir Jedra, Maryam Fazel, Sarah Dean

AI总结本文证明两层线性自回归模型通过经验风险最小化训练时，能近似卡尔曼滤波，恢复潜在状态估计，并提供有限样本保证。

Comments ICML 2026

详情

AI中文摘要

自回归模型已成为处理序列数据（从语言到视频）的强大工具。理解这些模型如何以及为何学习潜在表示仍然是一个开放的理论问题。在这项工作中，我们证明，当在部分观测的线性动力系统的数据上通过经验风险最小化训练时，两层线性自回归模型自然学会近似卡尔曼滤波。特别地，我们表明，学习到的隐藏表示与最优（卡尔曼）滤波器产生的状态估计一致，仅相差一个相似变换，尽管模型没有关于底层动力学或状态的显式知识。该结果基于三个主要见解。首先，我们建立卡尔曼滤波器可以被具有有界截断误差的自回归模型很好地近似。其次，我们表明，尽管非凸性，两层优化景观是良性的，即所有驻点要么是严格鞍点，要么是全局最小值。最后，作为我们的主要贡献，我们提供了关于预测误差、参数估计误差和潜在状态恢复的有限样本保证。数值模拟支持理论结果，并表明自回归模型的潜在表示恢复了状态估计。

英文摘要

Auto-regressive models have emerged as powerful tools for sequential data, from language to video. Understanding how and why these models learn latent representations remains an open theoretical question. In this work, we demonstrate that when trained by empirical risk minimization on data from partially observed linear dynamical systems, two-layer linear auto-regressive models naturally learn to approximate Kalman filtering. In particular, we show that the learned hidden representation coincides, up to a similarity transformation, with the state estimates produced by the optimal (Kalman) filter, even though the model has no explicit knowledge of the underlying dynamics or state. The result follows from three main insights. First, we establish that the Kalman filter is well approximated by an auto-regressive model with bounded truncation error. Second, we show that despite non-convexity, the two-layer optimization landscape is benign, i.e., all stationary points are either strict saddles or global minima. Finally, as our main contributions, we provide finite-sample guarantees on prediction error, parameter estimation error, and latent state recovery. Numerical simulations support the theoretical results and demonstrate that the latent representations of auto-regressive models recover state estimates.

URL PDF HTML ☆

赞 0 踩 0

2606.12841 2026-06-12 cs.LG cs.AI 交叉投稿

TimeROME-DLM: Temporal Causal Tracing and Low-Rank Inference-Time Knowledge Editing for Masked Diffusion Language Models

TimeROME-DLM：掩码扩散语言模型的时间因果追踪与低秩推理时知识编辑

Zhengtao Yao, Liuyang Song, Hongbo Zhang, Chenhao Wei, Haoyan Xu, Guang Yang, Siheng Wang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Nanyang Technological University（南洋理工大学）； National University of Singapore（新加坡国立大学）； University of Science and Technology of China（中国科学技术大学）

AI总结提出TimeROME-DLM，首个无需训练和梯度的推理时知识编辑框架，通过时间因果追踪定位关键坐标并应用低秩残差编辑，在保持模型性能的同时高效删除事实。

详情

AI中文摘要

掩码扩散语言模型（MDLM），如LLaDA，现已能与自回归（AR）大语言模型（LLM）竞争，但现有的所有知识编辑和遗忘方法（如ROME、MEMIT等）均针对AR Transformer，要么做出在迭代去噪下失败的假设，要么需要梯度更新，其反向传播激活会消耗数十GB的额外显存，并在标准学习率下导致MDLM崩溃。我们提出TimeROME-DLM，这是首个针对MDLM的无需训练、无需梯度、推理时的知识编辑框架。它结合了两个组件：时间间接效应（TIE）因果追踪协议，用于识别每个事实中在后续去噪步骤中最强驱动对象预测的坐标；以及一个闭式低秩残差编辑记忆，该记忆聚合所有遗忘事实的主语键和目标差值，并在每个扩散前向步骤中对该坐标应用单次岭正则化更新，同时通过稀疏化限制效用溢出。骨干权重保持冻结；仅需在小型验证集上调整三个超参数（alpha、lambda、q）。在TOFU forget01任务上，使用TOFU微调的LLaDA-8B-Base，TimeROME-DLM将遗忘集的对数概率降低了约83 nats。相同的配置可迁移至LLaDA-8B-Instruct、Dream-7B、MMaDA-8B、DiffuLLaMA-7B和LLaDA-MoE-1.4B。在50个顺序插入的事实中，它使保留集的对数概率几乎持平（在效用安全操作点处波动约1 nat），相比最强的收敛训练时基线，实现了四到十四倍的墙钟加速且零额外显存，并亚线性地扩展到400个事实。TimeROME-DLM以极小的计算代价弥合了AR LLM与MDLM之间的定位-编辑差距。

英文摘要

Masked diffusion language models (MDLMs) such as LLaDA now rival autoregressive (AR) LLMs, but every existing knowledge-editing and unlearning method (ROME, MEMIT, etc.) targets AR transformers and either makes assumptions that fail under iterative denoising, or requires gradient updates whose backward-pass activations cost tens of GB of extra VRAM and which collapse MDLMs at standard learning rates. We introduce TimeROME-DLM, the first training-free, gradient-free, inference-time knowledge-editing framework for MDLMs. It couples two components: a Temporal Indirect Effect (TIE) causal-tracing protocol that identifies, for each fact, the coordinate whose intervention most strongly drives the object prediction at later denoising steps; and a closed-form, low-rank residual edit memory that aggregates subject keys and target deltas across all forget facts and applies a single ridge-regularised update at that coordinate at every diffusion forward, with sparsification to limit utility spillover. Backbone weights stay frozen; only three hyperparameters (alpha, lambda, q) are tuned on a small validation split. On TOFU forget01 with TOFU-finetuned LLaDA-8B-Base, TimeROME-DLM cuts forget-set log-probability by roughly 83 nats. The same configuration transfers to LLaDA-8B-Instruct, Dream-7B, MMaDA-8B, DiffuLLaMA-7B, and LLaDA-MoE-1.4B. It keeps retain-set log-probability nearly flat (within ~1 nat at the utility-safe operating point) across 50 sequentially inserted facts, delivers a four- to fourteen-fold wall-clock speedup with zero additional VRAM over the strongest converged training-time baseline, and scales sub-linearly to 400 facts. TimeROME-DLM closes the locate-then-edit gap between AR LLMs and MDLMs at a fraction of the computational cost.

URL PDF HTML ☆

赞 0 踩 0

2606.12921 2026-06-12 cs.LG cs.AI 交叉投稿

LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold

LoRA-Muon：低秩流形上的谱最速下降

Franz Louis Cesista, Katherine Crowson, Cédric Simal, Stella Biderman

发表机构 * Ateneo de Manila University（雅典耀马尼拉大学）； EleutherAI ； NaXys, UNamur（纳慕尔大学NaXys研究所）

AI总结提出LoRA-Muon优化器，将Muon的谱最速下降规则应用于低秩微调，解决LoRA对初始化敏感、最优学习率跨秩迁移差等问题，在TinyShakespeare上以秩32达到比稠密基线更低的验证损失。

Comments 20 pages, 4 figures

详情

AI中文摘要

低秩适应（LoRA）显著降低了微调深度学习模型的计算和内存成本，但通常比稠密训练更难调优：当使用因子级优化器（如AdamW）时，它对初始化选择敏感，其最优学习率在秩之间迁移性差，且常常无法超越稠密基线。我们通过将Muon优化器的谱最速下降规则应用于低秩设置，推导出LoRA-Muon。结合我们的分裂权重衰减规则，我们的主要主张是LoRA-Muon是全秩Muon和Shampoo族优化器的一个良好的低秩代理。其最优学习率在秩、宽度、深度和因子重缩放之间均可迁移。在我们计算匹配的TinyShakespeare研究中，秩2代理恢复了稠密最佳测试学习率，秩32的LoRA-Muon运行在种子平均扫描中达到了比稠密基线更低的平均验证损失。我们进一步表明，Spectron优化器依赖于任意的因子缩放，因此在从严重不平衡的因子开始微调时可能不太适用，并且LoRA-RITE的简化QR坐标核心实现了相同的谱更新。LoRA-Muon无需QR分解即可计算该更新，并避免存储二阶矩，使其更易于加速器使用且内存效率更高。

英文摘要

Low-Rank Adaptation (LoRA) significantly reduces compute and memory costs for finetuning Deep Learning models but is often harder to tune than dense training: when using factor-wise optimizers such as AdamW, it is sensitive to initialization choices, its optimal learning rates transfer poorly across ranks, and it often fails to beat dense baselines. We derive LoRA-Muon by applying the Muon optimizer's spectral steepest-descent rule to the low-rank setting. Along with our split weight-decay rule, our main claim is that LoRA-Muon is a good low-rank proxy for full-rank Muon and Shampoo-family optimizers. Its optimal learning rates transfer across rank, width, depth, and factor-rescaling. In our compute-matched TinyShakespeare study, a rank-$2$ proxy recovers the dense best tested learning rate, and a rank-$32$ LoRA-Muon run attains lower mean validation loss than the dense baseline in the seed-averaged sweep. We further show that the Spectron optimizer depends on arbitrary factor scaling, so it would likely be a poor fit when finetuning starts from badly imbalanced factors, and that LoRA-RITE's simplified QR-coordinate core implements the same spectral update. LoRA-Muon computes that update without QR-decomposition and avoids storing second moments, making it more accelerator-friendly and memory-efficient.

URL PDF HTML ☆

赞 0 踩 0

2606.13024 2026-06-12 cs.LG cs.AI 交叉投稿

CausalMoE: A Billion-Scale Multimodal Foundation Model for Granger Causal Discovery with Pattern-Routed Heterogeneous Experts

CausalMoE：基于模式路由异构专家的十亿规模多模态基础模型用于格兰杰因果发现

Bo Liu, Di Dai, Jingwei Liu, Jiarui Jin, Xiaocheng Fang, Guangkun Nie, Hongyan Li, Shenda Hong

发表机构 * State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University（北京大学智能科学与技术学院通用人工智能国家重点实验室）； National Institute of Health Data Science, and Institute for Artificial Intelligence, Peking University（北京大学健康医疗大数据国家研究院、人工智能研究院）

AI总结提出CausalMoE，一种十亿规模多模态格兰杰因果基础模型，通过模式路由混合异构专家解耦动态机制，结合因果自注意力与LLM/VLM先验，实现稀疏因果图恢复，在监督和少样本场景中达到最优。

详情

AI中文摘要

格兰杰因果发现（GCD）是分析复杂系统中时间依赖性的基础。然而，现有的神经GCD方法主要依赖“一刀切”范式，难以捕捉真实世界时间序列中固有的分布偏移和动态机制变化，常导致表示纠缠和虚假因果图。本文提出CausalMoE，一种十亿规模多模态格兰杰因果基础模型，显式建模补丁级异质性。CausalMoE引入模式路由混合异构专家，动态识别潜在时间模式并将补丁路由到专门领域专家，有效解耦机制特定动态与共享动态。为确保可解释的图恢复，我们设计了一种跨变量运行的因果感知自注意力机制，通过近端优化生成稀疏格兰杰因果图。此外，CausalMoE是首个集成LLM和VLM以对齐数值信号与文本和视觉先验的模型，在复杂场景中正则化因果估计。大量实验表明，CausalMoE在全监督基准上达到新最优，同时在传统方法失败的少样本设置中有效泛化。

英文摘要

Granger Causal Discovery (GCD) is fundamental for analyzing temporal dependencies in complex systems. However, existing neural GCD methods predominantly rely on a "one-size-fits-all" paradigm, struggling to capture distribution shifts and dynamic regime changes inherent in real-world time series. This often leads to entangled representations and spurious causal graphs. In this paper, we propose CausalMoE, a billion-scale multimodal Granger causal foundation model that explicitly models patch-level heterogeneity. CausalMoE introduces a Pattern-Routed Mixture of Heterogeneous Experts, which dynamically identifies latent temporal patterns and routes patches to specialized domain experts, effectively decoupling regime-specific mechanisms from shared dynamics. To ensure interpretable graph recovery, we design a Causality-Aware Self-Attention mechanism operating across variables, yielding sparse Granger causal graphs via proximal optimization. Furthermore, CausalMoE is the first to integrate LLMs and VLMs to align numerical signals with textual and visual priors, regularizing causal estimation in complex scenarios. Extensive experiments demonstrate that CausalMoE establishes a new state-of-the-art on fully supervised benchmarks, while effectively generalizing to few-shot settings where traditional methods fail.

URL PDF HTML ☆

赞 0 踩 0

2606.13081 2026-06-12 cs.LG cs.AI 交叉投稿

Emotional regulation improves deep learning-based image classification

情绪调节改善基于深度学习的图像分类

Riccardo Emanuele Landi, João M. F. Rodrigues, Marta Chinnici

发表机构 * Mare Group（Mare集团）； NOVA LINCS（NOVA LINCS实验室）； Institute of Engineering (ISE), University of Algarve（阿尔加维大学工程学院）； Department of Energy Technologies and Renewable Sources, ENEA Casaccia Research Center（ENEA卡萨恰研究中心能源技术与可再生能源部）

AI总结提出情绪调节框架，通过人工主观体验在深度学习中建模情绪，在图像分类任务中预训练ResNet和ViT，在CIFAR-10/100上超越现有方法，成为情绪增强深度学习的新标杆。

详情

AI中文摘要

不同层，不同流形：Transformer优化中的模块级权重空间几何

Kirato Yoshihara

发表机构 * School of Engineering Science, The University of Osaka（大阪大学工程科学学院）

AI总结研究Transformer不同模块偏好不同流形几何，提出为注意力层和MLP层分别分配Stiefel和DGram约束，在GPT-2预训练中取得最佳性能。

Comments Accepted at WSS @ ICML 2026, code is available at https://github.com/kiratoyoshihara/module-wise-manifold-muon

详情

AI中文摘要

权重空间几何在神经网络优化中扮演核心角色，但流形约束通常统一应用于所有权重矩阵。在这项工作中，我们探究不同Transformer模块是否偏好不同的流形几何。我们研究GPT-2预训练的Manifold Muon，并比较跨注意力块和MLP块的Stiefel和DGram约束的逐层分配。我们的结果显示出明显的不对称性：在测试配置中，将注意力层约束为Stiefel几何，同时将MLP层分配为DGram几何，获得了最佳性能；而反向分配和全DGram配置在共享超参数设置下变得不稳定。我们将这种失败归因于DGram约束的注意力权重中奇异值的增长，这会放大注意力logits并导致softmax饱和。这些发现表明，Transformer的对称感知和几何感知优化应该是模块特定的，而不是统一的。

英文摘要

Weight-space geometry plays a central role in neural network optimization, yet manifold constraints are often applied uniformly across all weight matrices. In this work, we ask whether different transformer modules prefer different manifold geometries. We study Manifold Muon for GPT-2 pretraining and compare layer-wise assignments of Stiefel and DGram constraints across attention and MLP blocks. Our results show a clear asymmetry: constraining attention layers with Stiefel geometry while assigning DGram geometry to MLP layers gives the best performance among the tested configurations, whereas the inverted assignment and all-DGram configuration become unstable under the shared hyperparameter setting. We trace this failure to singular value growth in DGram-constrained attention weights, which can amplify attention logits and induce softmax saturation. These findings suggest that symmetry-aware and geometry-aware optimization for transformers should be module-specific rather than uniform.

URL PDF HTML ☆

赞 0 踩 0

2606.13285 2026-06-12 cs.LG cs.AI 交叉投稿

Once-for-All: Scalable Simultaneous Forecasting via Equilibrium State Estimation

Once-for-All: 基于均衡状态估计的可扩展同步预测

Beinan Xu, Andy Song, Jiti Gao, Feng Liu

发表机构 * RMIT University（皇家墨尔本理工大学）； Monash University（莫纳什大学）； University of Adelaide（阿德莱德大学）

AI总结提出均衡状态估计（ESE）范式，通过一次前向传播估计多系统均衡状态并基于状态差异生成预测，在保持精度的同时实现10-70倍加速，且具有线性时间复杂度和鲁棒性。

Comments Accepted by ICML 2026

详情

AI中文摘要

我们引入均衡状态估计（ESE），一种用于同步预测的新范式，其中多个相互作用的系统需要独立但协调的预测。这种场景在现实世界中经常出现，例如经济学和医疗建模。与一次预测一个系统的现有方法不同，ESE在一次前向传播中预测所有系统。它首先估计跨系统的均衡状态，然后基于当前状态与估计均衡之间的差异生成整体预测。在合成和真实世界数据集（包括货币汇率和COVID-19传播建模）上的大量实验表明，ESE至少与最先进（SOTA）方法一样准确，同时速度显著更快。此外，ESE与传统预测器无缝集成，结合了它们的准确性和其卓越的效率，实现了10-70倍的加速。凭借线性时间复杂度，随着系统数量的增加，ESE的扩展性远优于SOTA方法。此外，它在各种扰动下仍保持准确，使ESE成为一种快速、可泛化、鲁棒且可扩展的多预测方法。

英文摘要

We introduce Equilibrium State Estimation (ESE), a novel paradigm for simultaneous prediction, where multiple interacting systems require separate yet coordinated forecasts. Such scenarios often arise in real-world settings such as economics and healthcare modeling. Unlike existing approaches that predict one system at a time, ESE forecasts all systems in a single pass. It first estimates the equilibrium state across systems, then generates holistic forecasts based on the difference between the current state and the estimated equilibrium. Extensive experiments on synthetic and real-world datasets, including currency exchange and COVID-19 spread modeling, demonstrate that ESE is at least as accurate as state-of-the-art (SOTA) methods while being significantly faster. In addition, ESE integrates seamlessly with conventional predictors, combining their accuracy with its exceptional efficiency and delivering a 10-70x speedup. With linear-time complexity, ESE scales far better than SOTA methods as the number of systems increases. Moreover, it remains accurate under diverse perturbations, establishing ESE as a fast, generalizable, robust, and scalable multi-prediction method.

URL PDF HTML ☆

赞 0 踩 0

2606.13311 2026-06-12 cs.LG cs.AI 交叉投稿

Rarity-Gated Context Conditioning for Offline Imitation Learning-Based Maritime Anomaly Detection

基于离线模仿学习的海事异常检测中的稀有门控上下文调节

Yongmin Kim, ByeongHoon Jeon, Sungil Kim

发表机构 * Department of Industrial Engineering, Ulsan National Institute of Science and Technology (UNIST)（蔚山科学技术院工业工程系）

AI总结提出RGFiLM模块，通过稀有度门控调节上下文调制强度，解决上下文异常检测中稀有上下文导致的高误报问题，在海事轨迹异常检测中取得最佳F1-FPR权衡。

详情

AI中文摘要

上下文异常检测旨在根据上下文变量识别异常行为，但实际部署常面临高度不平衡的上下文分布，其中稀有情境可能包含关键信息。在这种频率偏差下，上下文条件模型可能在稀有上下文中产生不稳定的决策和过多的误报。我们提出稀有门控特征线性调制（RGFiLM），一种稀有感知调节模块，结合特征调制（即上下文条件化的隐藏特征缩放和平移）与由数据驱动稀有度分数控制的门控。稀有度分数根据上下文变量的经验分布估计，并调节上下文对中间表示的调制强度：在稀有上下文中门控更果断，而在常见上下文中保持保守。我们在使用AIS运动序列和ERA5环境上下文的环境敏感绕行场景中评估RGFiLM在海事轨迹异常检测中的表现。当实例化到顺序异常评分流程中时，RGFiLM在比较的上下文无关和上下文条件方法中实现了最佳的平均F1-假阳性率（FPR）权衡。这些结果表明，显式考虑上下文稀有性是减少上下文敏感异常检测中误报的有效方法。

英文摘要

Contextual anomaly detection aims to identify abnormal behavior conditional on context variables, but practical deployments often face highly imbalanced context distributions where rare regimes can be critical information. Under such frequency bias, context-conditioned models can produce unstable decisions and excessive false alarms in rare contexts. We propose Rarity-Gated Feature-wise Linear Modulation (RGFiLM), a rarity-aware conditioning module that combines feature-wise modulation (i.e., context-conditioned scaling and shifting of hidden features) with a gate controlled by a data-driven rarity score. The rarity score is estimated from the empirical distribution of context variables and regulates how strongly context modulates intermediate representations: the gate becomes more decisive under rare contexts while remaining conservative under frequent contexts. We evaluate RGFiLM on maritime trajectory anomaly detection using AIS motion sequences with ERA5 environmental context in an environment-sensitive detour scenario. When instantiated in a sequential anomaly scoring pipeline, RGFiLM achieves the best mean F1--False Positive Rate (FPR) trade-off among the compared context-agnostic and context-conditioned methods. These results suggest that explicitly accounting for context rarity is an effective approach for reducing false alarms in context-sensitive anomaly detection.

URL PDF HTML ☆

赞 0 踩 0

2606.13400 2026-06-12 cs.LG cs.AI cs.RO 交叉投稿

PolyFlow: Safe and Efficient Polytope-Constrained Flow Matching with Constraint Embedding and Projection-free Update

PolyFlow: 安全高效的多面体约束流匹配，具有约束嵌入和无投影更新

Jianming Ma, Qiyue Yang, Yang Zhang, Liyun Yan, Zhanxiang Cao, Yazhou Zhang, Yue Gao

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出PolyFlow，一种将约束直接嵌入模型和流动力学的多面体约束流匹配框架，通过离散时间流公式和无投影架构消除离散化误差并严格满足任意多面体约束，在规划与控制任务中实现零约束违反并降低推理延迟。

Comments 30 pages, 12 figures, Accepted to ICML 2026

详情

AI中文摘要

尽管基于流的生成模型在广泛领域展现了强大的性能，但由于严格的约束要求，在安全关键的物理系统中部署它们仍然具有挑战性。现有方法通常通过事后修正来强制执行安全性，这会产生大量的计算开销，并可能扭曲学习到的分布。我们提出了PolyFlow，一种多面体约束流匹配框架，将约束直接嵌入到模型和流动力学中。PolyFlow引入了离散时间流公式和无投影架构，消除了离散化误差，并保证严格满足任意多面体约束，无需昂贵的迭代求解器。实验结果表明，PolyFlow在规划和控制任务中实现了零约束违反，同时保持了较高的分布保真度。与最先进的约束生成基线相比，PolyFlow显著降低了推理延迟，并在安全性、效率和生成质量之间展示了有利的权衡。代码可在该 https URL 获取。

英文摘要

While flow-based generative models have demonstrated strong performance across a wide range of domains, deploying them in safety-critical physical systems remains challenging due to strict constraint requirements. Existing approaches typically enforce safety through post-hoc corrections, which incur substantial computational overhead and may distort the learned distribution. We propose PolyFlow, a polytope-constrained flow matching framework that embeds constraints directly into the model and flow dynamics. PolyFlow introduces a discrete-time flow formulation and a projection-free architecture, which eliminate the discretization error and guarantee strict satisfaction of arbitrary polyhedral constraints, without the need for expensive iterative solvers. Experimental results show that PolyFlow achieves zero constraint violation while maintaining high distributional fidelity across a range of planning and control tasks. Compared to state-of-the-art constrained generation baselines, PolyFlow significantly reduces inference latency and demonstrates a favorable trade-off between safety, efficiency, and generative quality. Code is available on https://github.com/MJianM/PolyFlow.

URL PDF HTML ☆

赞 0 踩 0

2606.13473 2026-06-12 cs.LG cs.AI cs.CL 交叉投稿

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

MaxProof: 通过生成-验证器强化学习与群体级测试时扩展实现数学证明规模化

Jiacheng Chen, Xinyu Zhang, Shunkai Zhang, Yanmohan Wang, Lin Li, Tiancheng Qin, Qin Wang, Zhengmao Zhu, Tianle Li, Jingyang Li, Zehan Li, Binyang Jiang, Jin Zhu, Han Ding, Fei Yu, Chenyu Du, Zijian Song, Jiayuan Song, Zhi Zhang, Yunan Huang, Weiyu Cheng, Pengyu Zhao, Yu Cheng

发表机构 * MiniMax ； The Chinese University of Hong Kong（香港中文大学）； Fudan University（复旦大学）； Peking University（北京大学）； Tsinghua University（清华大学）

AI总结提出MaxProof框架，结合生成-验证器强化学习与群体级测试时扩展，在MiniMax-M3系列上实现竞赛级数学证明，在IMO 2025和USAMO 2026上超越人类金牌阈值。

2606.13486 2026-06-12 cs.LG cs.AI 交叉投稿

CRAFTIIF: Cross-Resolution Analytic Four-Type Interpretable Isolation Forest for Multivariate Time Series Anomaly Detection

CRAFTIIF：用于多元时间序列异常检测的跨分辨率分析四类型可解释孤立森林

William Smits

发表机构 * Avathon

AI总结提出CRAFTIIF无监督框架，通过四种小波特征和五个孤立森林同时检测点、分布、时间和集体四类异常，在mTSBench基准上达到平均F1=0.228，VUS-PR比先前最佳提升40.7%。

Comments 14 pages, 4 figures, 2 appendices. Submitted to IEEE Transactions on Knowledge and Data Engineering (TKDE). Code: https://github.com/smitswil/craftiif

详情

AI中文摘要

多元时间序列中的异常检测面临四种结构不同的异常类型——点异常（孤立尖峰）、分布异常（水平偏移）、时间异常（节奏变化）和集体异常（传感器间相关性崩溃）——每种都需要不同的特征表示。大多数无监督方法只针对其中一两种类型，且可解释性有限。我们提出CRAFTIIF（跨分辨率分析四类型可解释孤立森林），这是一个完全无监督的框架，针对所有四种类型，无需针对数据集调整。CRAFTIIF生成K=500个随机分析小波特征，跨越四个小波族（Morlet、DOG、Haar、Coiflet），每个针对特定异常类型，并输入五个结构化的孤立森林——每种类型一个，外加一个用于复合异常的元IF。自适应Otsu/MAD阈值在0.1%到69.2%的异常率范围内自动校准检测。由于每个IF仅针对特定类型的特征进行训练，分支触发直接提供异常类型归因，无需事后解释。在mTSBench基准（Zhou等人，TMLR 2026）的所有19个数据集上评估，CRAFTIIF在全部19个数据集上达到平均F1=0.228，在13个可检测数据集上F1=0.322，在VUS-PR上排名第一（0.463对比之前最佳0.329，提升40.7%）。一个诊断框架——oracle F1、可检测性限制和分支分离比——识别出19个数据集中有6个从根本上无法被任何无监督方法检测。在11种消融条件下，自适应阈值（+38% F1）、四分支结构（+20%）和元IF（+23%）均被证明是必不可少的。代码：此 https URL

英文摘要

Anomaly detection in multivariate time series is challenged by four structurally distinct anomaly types -- point (isolated spikes), distributional (level shifts), temporal (rhythm changes), and collective (inter-sensor correlation breakdowns) -- each requiring different feature representations. Most unsupervised methods target only one or two types and provide limited interpretability. We present CRAFTIIF (Cross-Resolution Analytic Four-Type Interpretable Isolation Forest), a fully unsupervised framework targeting all four types without dataset-specific tuning. CRAFTIIF generates K=500 random analytic wavelet feature draws across four families (Morlet, DOG, Haar, Coiflet), each targeting a specific anomaly type, feeding five structured Isolation Forests -- one per type plus a meta-IF for compound anomalies. An adaptive Otsu/MAD threshold calibrates detection automatically across anomaly rates from 0.1% to 69.2%. Because each IF is trained exclusively on type-specific features, branch firing provides direct anomaly-type attribution by construction, without post-hoc explanation. Evaluated on all 19 datasets of the mTSBench benchmark (Zhou et al., TMLR 2026), CRAFTIIF achieves mean F1=0.228 (all 19 datasets) and F1=0.322 (13 detectable datasets), ranking first among all 25 evaluated methods on VUS-PR (0.463 vs. previous best 0.329, +40.7%). A diagnostic framework -- oracle F1, detectability limits, and branch separation ratios -- identifies 6 of 19 datasets as fundamentally undetectable by any unsupervised method. Ablation over 11 conditions confirms adaptive thresholding (+38% F1), four-branch structure (+20%), and meta-IF (+23%) are each essential. Code: https://github.com/smitswil/craftiif

URL PDF HTML ☆

赞 0 踩 0

2606.13571 2026-06-12 cs.LG cs.AI 交叉投稿

Existence Precedes Value: Joint Modeling of Observational Existence and Evolving States in Time Series Forecasting

存在先于价值：时间序列预测中观测存在性与状态演变的联合建模

Yifan Hu, Hongzhou Chen, Peiyuan Liu, Yiding Liu, Zewei Dong, Jiang-Ming Yang

发表机构 * Ant International（蚂蚁国际）

AI总结提出Timeflies框架，联合建模未来观测是否发生（存在性）与数值估计，通过观测流和数值流耦合模块提升缺失值时间序列预测性能。

详情

AI中文摘要

现实世界的时间序列常因传感器休眠、传输延迟和事件驱动采样而高度不完整和不规则，使得可靠预测面临根本性挑战。现有方法已从插值后预测的流水线发展到连续时间模型，如神经常微分方程和连续时间图网络。尽管这些方法改进了历史不规则性的建模，但它们仍然在推理时依赖一个隐式的先知假设：未来有效观测的时间戳被假定为预先已知。这一假设限制了实际相关性，因为在许多现实系统中，更根本的问题不仅是未来值是多少，还包括是否会有有效观测发生。在本文中，我们提出Timeflies，一个统一的框架，将预测重新表述为未来可观测性推断和数值估计的联合问题。为了显式建模观测动态与状态演变之间的交互，Timeflies采用观测流和数值流，通过三个专用模块（可靠性感知嵌入、观测引导的依赖建模和联合预测）进行耦合。我们进一步构建了Shadow基准，该基准结合了来自公共数据集和真实工业数据的自然缺失，并引入观测-值联合熵（OVJE）指标来全面评估这种耦合的可预测性。大量实验表明，Timeflies始终优于现有方法，突显了在缺失值时间序列预测中显式建模未来可观测性的重要性。代码和数据集见https://this URL。

英文摘要

Real-world time series are often highly incomplete and irregular due to sensor dormancy, transmission delays, and event-driven sampling, making reliable forecasting fundamentally challenging. Existing methods have evolved from impute-then-forecast pipelines to continuous-time models such as Neural ODEs and continuous-time graph networks. While these approaches improve the modeling of historical irregularity, they still rely on an implicit oracle assumption at inference time: the timestamps of future valid observations are presumed to be known in advance. This assumption limits practical relevance, since in many real systems the more fundamental question is not only what the future value will be, but also whether a valid observation will occur at all. In this paper, we propose Timeflies, a unified framework that reformulates forecasting as a joint problem of future observability inference and value estimation. To explicitly model the interaction between observation dynamics and state evolution, Timeflies adopts an observation stream and a value stream, coupled through three dedicated modules for reliability-aware embedding, observation-guided dependency modeling, and joint prediction. We further construct Shadow, a benchmark that combines natural missingness from public datasets with real-world industrial data, and introduce the Observation-Value Joint Entropy (OVJE) metric to comprehensively evaluate this coupled predictability. Extensive experiments show that Timeflies consistently outperforms existing methods, highlighting the importance of explicitly modeling future observability in time series forecasting with missing values. Code and dataset are available in https://github.com/ant-intl/Timeflies.

URL PDF HTML ☆

赞 0 踩 0

2606.13603 2026-06-12 cs.LG cs.AI cs.CL 交叉投稿

Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models

超越承诺边界：探究大型推理模型中的附带思维链

Daniel Scalena, Sara Candussio, Luca Bortolussi, Elisabetta Fersini, Malvina Nissim, Gabriele Sarti

发表机构 * CLCG, University of Groningen（格罗宁根大学CLCG）； University of Milano-Bicocca（米兰-布雷拉大学）； University of Trieste（特里耶大学）； Khoury College of Computer Sciences, Northeastern University（东北大学Khoury计算机科学学院）

AI总结通过早期退出估计思维链步骤的因果重要性，发现推理中存在从瞬态猜测到稳定答案的“承诺边界”，后续步骤为附带现象，可提前退出以缩短推理长度达55%而不影响性能。

详情

AI中文摘要

思维链推理是语言模型推理时扩展的主导范式，但每个步骤对最终答案的因果影响尚不明确。我们通过早期退出估计每个步骤的因果重要性，并利用这一度量研究多个模型家族的推理轨迹中答案如何形成。在多种任务中，我们发现推理通常会跨越一个“承诺边界”——从瞬态中间猜测到稳定、高置信度答案的急剧转变。这种转变通常发生在单个步骤中，远在模型推理块结束之前，随后是“附带”的思维链步骤，这些步骤不改变最终答案概率。利用注意力探针，我们表明答案形成阶段可以从中间推理步骤中以高精度线性解码，并稳健地泛化到未见过的推理任务。我们利用这一信号在承诺边界处提前退出推理块，平均将思维链长度减少高达55%，而对模型性能影响微乎其微。

英文摘要

Chain-of-thought (CoT) reasoning is the dominant paradigm for inference-time scaling in language models, yet the causal influence of individual steps on the final answer poorly understood. We estimate each step's causal importance via early exit and use this measure to study how answers form across the reasoning traces of several model families. Across diverse tasks, we find that reasoning typically crosses a \emph{commitment boundary} -- a sharp transition from transient intermediate guesses to a stable, high-confidence answer. This transition often happens in a single step, well before the model's reasoning block ends, and is followed by \emph{epiphenomenal} CoT steps that leave the final answer probability unaltered. Using attention probes, we show that answer-formation stages can be linearly decoded from intermediate reasoning steps with high accuracy and generalize robustly to unseen reasoning tasks. We exploit this signal to early-exit reasoning blocks at the commitment boundary, reducing the length of CoTs up to 55\% on average with negligible impact on model performance.

URL PDF HTML ☆

赞 0 踩 0

2604.16689 2026-06-12 cs.AI 版本更新

PlaceRep: 基于大规模兴趣点数据的地理空间场所表示学习

Mohammad Hashemi, Hossein Amiri, Andreas Zufle

发表机构 * Emory University（埃默里大学）

AI总结提出PlaceRep方法，通过聚类空间和语义相关的兴趣点构建场所级表示，无需预训练即可高效生成多尺度城市区域嵌入，在人口密度估计和房价预测任务中优于现有方法并实现百倍加速。

详情

AI中文摘要

学习城市环境的有效表示需要捕捉超越固定行政边界的空间结构。现有的地理空间表示学习方法通常将兴趣点（POI）聚合到预定义的行政区域（如普查单元或邮政编码区域），为每个区域分配单个嵌入。然而，POI 通常形成跨越、包含或超出这些边界的语义上有意义的组，定义了更能反映人类活动和城市功能的场所。为解决这一局限性，我们提出 PlaceRep，一种通过聚类空间和语义相关的 POI 来构建场所级表示的地理空间表示学习方法。PlaceRep 从美国 Foursquare 数据中总结大规模 POI 图，生成通用城市区域嵌入，同时自动识别跨多个空间尺度的场所。通过消除模型预训练，PlaceRep 为多粒度地理空间分析提供了可扩展且高效的解决方案。使用人口密度估计和房价预测作为下游任务的实验表明，PlaceRep 优于大多数最先进的基于图的地理空间表示学习方法，并在大规模 POI 图上生成区域级表示时实现了高达 100 倍的加速。PlaceRep 的实现可在该 https URL 获取。

英文摘要

Learning effective representations of urban environments requires capturing spatial structure beyond fixed administrative boundaries. Existing geospatial representation learning approaches typically aggregate Points of Interest (POIs) into pre-defined administrative regions such as census units or ZIP code areas, assigning a single embedding to each region. However, POIs often form semantically meaningful groups that extend across, within, or beyond these boundaries, defining places that better reflect human activity and urban function. To address this limitation, we propose PlaceRep, a geospatial representation learning method that constructs place-level representations by clustering spatially and semantically related POIs. PlaceRep summarizes large-scale POI graphs from U.S. Foursquare data to produce general-purpose urban region embeddings while automatically identifying places across multiple spatial scales. By eliminating model pre-training, PlaceRep provides a scalable and efficient solution for multi-granular geospatial analysis. Experiments using the tasks of population density estimation and housing price prediction as downstream tasks show that PlaceRep outperforms most state-of-the-art graph-based geospatial representation learning methods and achieves up to a x100 speedup in generating region-level representations on large-scale POI graphs. The implementation of PlaceRep is available at https://github.com/mohammadhashemii/PlaceRep.

URL PDF HTML ☆

赞 0 踩 0

2507.05019 2026-06-12 cs.LG cs.AI 版本更新

分散自回归生成

Stepan Maschan, Haoxuan Qu, Jun Liu

发表机构 * Lancaster University（兰卡斯特大学）

AI总结本文通过离散流匹配框架证明分散训练与集中训练在理论上等价，实验验证其在多模态基准上保持竞争力。

2601.06572 2026-06-12 cs.LG cs.AI 版本更新

Hellinger Multimodal Variational Autoencoders

Hellinger多模态变分自编码器

Huyen Vo, Isabel Valera

发表机构 * Department of Computer Science, Saarland University（萨尔兰大学计算机科学系）； MPI-SWS, Saarland Informatics Campus（萨尔兰信息学校区Max Planck研究所）

AI总结提出基于Hellinger距离的矩匹配近似方法HELVAE，避免子采样，在多模态变分自编码器中实现更优的生成一致性与质量权衡。

Comments Accepted at AISTATS 2026. Camera-ready version

详情

AI中文摘要

多模态变分自编码器（VAEs）广泛用于弱监督生成学习，涉及多种模态。主流方法通过专家乘积（PoE）、专家混合（MoE）或其组合来聚合单模态推理分布，以近似联合后验。本文从概率意见池化的优化视角重新审视多模态推理。我们从$\alpha=0.5$的Hölder池化出发，这是$\alpha\text{-散度}$族中唯一的对称成员，并推导出一种矩匹配近似，称为Hellinger。我们利用这种近似提出HELVAE，一种避免子采样的多模态VAE，从而得到一个高效且有效的模型，该模型：（i）随着观察到的模态增加，学习更具表达力的潜在表示；（ii）在生成一致性和质量之间实现更好的权衡，优于最先进的多模态VAE模型。

英文摘要

Multimodal variational autoencoders (VAEs) are widely used for weakly supervised generative learning with multiple modalities. Predominant methods aggregate unimodal inference distributions using either a product of experts (PoE), a mixture of experts (MoE), or their combinations to approximate the joint posterior. In this work, we revisit multimodal inference through the lens of probabilistic opinion pooling, an optimization-based approach. We start from Hölder pooling with $α=0.5$, which corresponds to the unique symmetric member of the $α\text{-divergence}$ family, and derive a moment-matching approximation, termed Hellinger. We then leverage such an approximation to propose HELVAE, a multimodal VAE that avoids sub-sampling, yielding an efficient yet effective model that: (i) learns more expressive latent representations as additional modalities are observed; and (ii) empirically achieves better trade-offs between generative coherence and quality, outperforming state-of-the-art multimodal VAE models.

URL PDF HTML ☆

赞 0 踩 0

2601.22594 2026-06-12 cs.CL cs.AI 版本更新

Language Model Circuits Are Sparse in the Neuron Basis

语言模型电路在神经元基上是稀疏的

Aryaman Arora, Zhengxuan Wu, Jacob Steinhardt, Sarah Schwettmann

发表机构 * Stanford University（斯坦福大学）

AI总结本文实证发现MLP神经元与稀疏自编码器一样是稀疏特征基，并基于此开发了端到端梯度归因流水线，在多项任务中揭示了因果有效的神经元电路。

Comments ICML Spotlight, camera-ready

详情

AI中文摘要

神经网络用于计算的高层概念不一定与单个神经元对齐（Smolensky, 1986）。因此，语言模型可解释性研究转向了将神经元基分解为更可解释的模型计算单元的技术，例如稀疏自编码器（SAEs）。然而，并非所有基于神经元的表示都不可解释。我们首次实证表明，MLP神经元与SAEs一样是稀疏的特征基。利用这一发现，我们开发了一个端到端的基于梯度的归因流水线，用于在MLP神经元基上进行电路追踪，从而在多种任务中揭示因果有效的神经元。在标准的主谓一致基准测试（Marks et al., 2025）上，约$10^2$个MLP神经元的电路足以控制模型行为。在（Lindsey et al., 2025）的多跳城市-州-首都任务中，我们发现了一个电路，其中小部分神经元编码特定的潜在推理步骤（例如将城市映射到其所在州），并且可以通过引导来改变模型的输出。因此，这项工作在不增加额外训练成本的情况下推进了语言模型的自动化可解释性。

英文摘要

The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques which decompose the neuron basis into more interpretable units of model computation, such as sparse autoencoders (SAEs). However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that MLP neurons are as sparse a feature basis as SAEs. We use this finding to develop an end-to-end gradient-based attribution pipeline for circuit tracing on the MLP neuron basis, which surfaces causally effective neurons on a variety of tasks. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of $\approx 10^2$ MLP neurons is enough to control model behaviour. On the multi-hop city-state-capital task from (Lindsey et al., 2025), we find a circuit in which small sets of neurons encode specific latent reasoning steps (e.g. mapping a city to its state), and can be steered to change the model's output. This work thus advances automated interpretability of language models without imposing additional training costs.

URL PDF HTML ☆

赞 0 踩 0

2603.02234 2026-06-12 cs.LG cs.AI 版本更新

基于插入生成的变分学习

Yangtian Zhang, Zhe Wang, Arthur Gretton, Rex Ying, David van Dijk, Michalis K. Titsias, Jiaxin Shi

发表机构 * University of Cambridge（剑桥大学）

AI总结提出插入过程（IP）模型，通过排列变分推断联合学习插入位置、内容和终止条件，支持变长生成并提升非自回归序列建模质量。

详情

AI中文摘要

非单调序列生成方法，如掩码扩散模型，通过允许以非固定和预设的顺序生成token，为从左到右的自回归建模提供了一种灵活的替代方案。尽管具有实际优势，但大多数现有的非单调模型是顺序无关的，并依赖于固定长度的网格，限制了它们支持变长生成和自适应插入顺序的能力。在这项工作中，我们引入了一个概率框架，用于在变长插入模型中学习插入顺序。我们形式化了插入轨迹与排列之间的双射对应关系，这使得数据似然能够精确重参数化为排列上的和。基于这一结果，我们提出了插入过程（IP），这是一种随机生成模型，它联合学习在哪里插入、插入什么以及何时终止，并通过基于排列的变分推断进行训练。与先前的固定画布方法不同，IP原生支持变长生成，并学习数据驱动的插入顺序偏好。在目标条件规划和分子字符串生成上的实验表明，在缺乏规范从左到右结构的领域中，学习插入顺序提高了建模质量和泛化能力。

英文摘要

Non-monotonic sequence generation methods, such as masked diffusion models, provide a flexible alternative to left-to-right autoregressive modeling by allowing tokens to be generated in non-fixed and prescribed orders. Despite their practical advantages, most existing non-monotonic models are order-agnostic and rely on a fixed-length grid, limiting their ability to support variable-length generation and adaptive insertion order. In this work, we introduce a probabilistic framework for learning insertion order in variable-length insertion models. We formalize a bijective correspondence between insertion trajectories and permutations, which enables an exact reparameterization of the data likelihood as a sum over permutations. Building on this result, we propose the Insertion Process (IP), a stochastic generative model that jointly learns where to insert, what to insert, and when to terminate, trained via permutation-based variational inference. Unlike prior fixed-canvas approaches, IP natively supports variable-length generation and learns data-driven preferences over insertion orders. Experiments on goal-conditioned planning and molecular string generation demonstrate that learning insertion order improves both modeling quality and generalization in domains without a canonical left-to-right structure.

URL PDF HTML ☆

赞 0 踩 0

2606.11836 2026-06-12 cs.SD cs.AI eess.AS 版本更新

Towards Data-free and Training-free Compression for Speech Foundation Models Using Parameter Clustering

面向语音基础模型的无数据无训练压缩：基于参数聚类的方法

Haoning Xu, Zhaoqing Li, Huimeng Wang, Youjun Chen, Chengxi Deng, Mengzhe Geng, Xunying Liu

发表机构 * The Chinese University of Hong Kong（香港中文大学）； National Research Council Canada（加拿大国家研究委员会）

AI总结提出一种基于k-means通道聚类的无数据无训练压缩方法，通过层间不同参数簇数实现细粒度混合稀疏剪枝，在HuBERT-large和Whisper-large-v3上显著降低WER。

Comments Accepted by Interspeech 2026

2606.12942 2026-06-12 cs.AI 新提交

PRISMR: Overcoming Parse Collapse in Multimodal Listwise Ranking via Parameterized Representation Internalization

PRISMR: 通过参数化表示内化克服多模态列表排序中的解析崩溃

Hao Jiang, Xin Li, Annan Wang, Zhi Yang, Haoxiang Zhang, Yichi Zhang, Weisi Lin

发表机构 * Nanyang Technological University（南洋理工大学）； Peking University（北京大学）； Independent Researcher（独立研究员）

AI总结针对多模态长上下文场景中生成式列表排序的解析崩溃问题，提出PRISMR框架，用参数化结构条件替代临时上下文列表处理，通过轻量级超网络并行编码候选并生成LoRA权重，显著减少解析崩溃并提升排序性能。

详情

AI中文摘要

基于大型多模态模型（LMM）的生成式列表排序旨在单次前向传播中捕获全局列表上下文，但其效果在长上下文多模态场景中会退化。我们识别出一种重复出现的失败模式——解析崩溃，即自回归解码器生成流畅但不完整的排序，通过静默省略候选并提前终止。这种失败源于有限的上下文利用，而非简单的格式错误，使得提示工程和约束解码不足以解决。我们提出PRISMR（参数化表示内化用于语义多模态排序）框架，用参数化结构条件替代临时的上下文内列表处理。PRISMR使用轻量级超网络并行编码多模态候选并生成项目特定的LoRA权重，这些权重被合成为LMM的实例特定适配器。这种范式在保留基础模型的同时，实现了更鲁棒的列表结构内化。我们进一步引入了一个大规模多模态评论排序基准用于评估。实验表明，PRISMR显著减少了解析崩溃，提高了列表排序性能，并有效跨领域和指令微调骨干网络迁移。

英文摘要

Generative listwise ranking with Large Multimodal Models (LMMs) aims to capture global list context in a single forward pass, but its effectiveness degrades in long-context multimodal scenarios. We identify a recurring failure mode, parse collapse, where the autoregressive decoder produces fluent yet incomplete rankings by silently omitting candidates and terminating early. This failure stems from limited context utilization rather than simple formatting mistakes, making prompt engineering and constrained decoding insufficient. We propose PRISMR (Parameterized Representation Internalization for Semantic Multimodal Ranking), a framework that replaces transient in-context list processing with parametric structural conditioning. PRISMR uses a lightweight hypernetwork to encode multimodal candidates in parallel and generate item-specific LoRA weights, which are synthesized into an instance-specific adapter for a LMM. This paradigm enables more robust internalization of list structure while preserving the base model. We further introduce a large-scale multimodal review-ranking benchmark for evaluation. Experiments demonstrate that PRISMR substantially reduces parse collapse, improves listwise ranking performance, and transfers effectively across domains and instruction-tuned backbones.

URL PDF HTML ☆

赞 0 踩 0

2606.13316 2026-06-12 cs.AI 新提交

ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning

ReSum: 通过强化学习协同LLM推理与摘要生成

Xucong Wang, Ziyu Ma, Yong Wang, Shidong Yang, Hailang Huang, Renda Li, Pengkun Wang, Xiangxiang Chu

发表机构 * University of Science and Technology of China（中国科学技术大学）； AMAP, Alibaba Group（阿里巴巴集团高德地图）

AI总结提出ReSum框架，利用自摘要机制让LLM压缩和组织推理轨迹，通过对比评估自适应触发摘要，在提升性能4%的同时减少18.6%的推理长度。

Comments 24 pages, including 13 pages of main text and 11 pages of appendix

详情

AI中文摘要

可验证奖励强化学习（RLVR）是提升大语言模型（LLM）长程推理的核心技术。然而，现有RLVR方法常鼓励不必要的长推理轨迹，这会降低推理连贯性并耗尽可用上下文预算。现有的长上下文组织方法通常依赖外部机制来组织轨迹，而非让模型自主管理推理过程。为解决此局限，我们提出ReSum，一种新颖的RLVR框架，使LLM能够通过自摘要压缩和组织其推理轨迹。我们的初步研究表明，自摘要通过降低token级熵来稳定生成，并且引入“摘要”短语可显著减少从错误轨迹前缀传播的误差。受此启发，ReSum采用一种摘要感知的自适应轨迹机制，通过对比评估自摘要是否有利于当前推理过程。具体而言，当模型自发触发自摘要时，ReSum屏蔽摘要短语以创建对比分支；对于非摘要位置，则随机注入该短语以创建匹配分支。我们进一步设计了摘要感知优势函数，以实现对比轨迹之间更细粒度的比较。大量实验表明，ReSum在平均提升4%性能的同时，将推理长度减少18.6%。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) is a central technique for improving long-horizon reasoning in Large Language Models (LLMs). However, existing RLVR methods often encourage unnecessarily long reasoning rollouts, which can degrade reasoning coherence and exhaust the available context budget. Existing approaches to long-context organization often depend on external mechanisms to organize rollouts, rather than enabling the model to manage its own reasoning trajectory. To address this limitation, we propose ReSum, a novel RLVR framework that enables LLMs to compress and organize their reasoning trajectories through self-summarization. Our pilot studies show that self-summarization stabilizes generation by lowering token-level entropy, and that introducing a ``summarization'' phrase can substantially mitigate errors propagated from an incorrect rollout prefix. Motivated by these findings, ReSum adopts a summarization-aware adaptive rollout mechanism that contrastively evaluates whether self-summarization benefits the ongoing reasoning process. Specifically, when the model spontaneously triggers self-summarization, ReSum masks the summarization phrase to create a contrastive branch; for non-summarization positions, it instead randomly injects the phrase to create a matched branch. We further design a summarization-aware advantage to enable finer-grained comparison between contrastive rollout trajectories. Extensive experiments show that ReSum improves performance at an average of 4\% while reducing rollout length by 18.6\%.

URL PDF HTML ☆

赞 0 踩 0

2606.13550 2026-06-12 cs.AI cs.CL 新提交

Uncertainty-Aware Hybrid Retrieval for Long-Document RAG

不确定性感知的混合检索用于长文档RAG

Hoin Jung, Xiaoqian Wang

发表机构 * Elmore Family School of Electrical and Computer Engineering, Purdue University（普渡大学埃尔莫尔家族电气与计算机工程学院）

AI总结提出UMG-RAG，一种无需训练的混合检索框架，通过多粒度分块和不确定性估计融合密集与稀疏检索结果，提升长文档问答质量。

详情

AI中文摘要

检索增强生成（RAG）关键依赖于检索证据的质量和粒度。大的检索单元保留上下文但常引入无关内容，可能稀释答案承载证据并恶化长上下文利用。细粒度单元更紧凑，但可能难以可靠检索，因为短块可能缺乏匹配查询所需的语义、词汇或桥接线索。我们提出不确定性感知的多粒度RAG（UMG-RAG），一种无需训练的混合检索框架，将分块粒度视为查询特定的可靠性估计。UMG-RAG不训练新检索器或修改生成器，而是利用现有密集和稀疏检索器作为跨多个分块粒度的互补专家。对于每个查询，它将每个专家-粒度得分列表转换为证据分布，从分布熵估计可靠性，并根据查询特定的语义、词汇和粒度置信度融合候选。我们进一步引入UMGP-RAG，一种父级提升变体，利用细粒度命中定位相关证据，同时返回更广泛的非冗余父块以保持局部连贯性。在问答基准上的实验表明，不确定性感知融合和父级提升在保持轻量级、即插即用检索管道的同时，提高了生成质量。

英文摘要

Retrieval augmented generation (RAG) depends critically on the quality and granularity of retrieved evidence. Large retrieval units preserve context but often introduce irrelevant content, which can dilute answer bearing evidence and worsen long context utilization. Fine-grained units are more compact, but they may be difficult to retrieve reliably because short chunks can lack semantic, lexical, or bridging cues needed to match the query. We propose Uncertainty-aware Multi-Granularity RAG (UMG-RAG), a training-free hybrid retrieval framework that treats chunk granularity as query-specific reliability estimation. Instead of training a new retriever or modifying the generator, UMG-RAG uses existing dense and sparse retrievers as complementary experts across multiple chunk granularities. For each query, it converts each expert-granularity score list into an evidence distribution, estimates reliability from distribution entropy, and fuses candidates according to query-specific semantic, lexical, and granularity confidence. We further introduce UMGP-RAG, a parent promotion variant that uses fine-grained hits to locate relevant evidence while returning broader non-redundant parent chunks for local coherence. Experiments on question answering benchmarks show that uncertainty-aware fusion and parent promotion improve generation quality while maintaining a lightweight, plug-and-play retrieval pipeline.

URL PDF HTML ☆

赞 0 踩 0

2606.12590 2026-06-12 cs.CV cs.AI 交叉投稿

Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs

分析与改进医学LVLMs中的细粒度偏好优化

Shayan Mohammadizadehsamakosh, Pritam Sarkar, Leonid Sigal, Ali Etemad, Elham Dolatabadi

发表机构 * York University（约克大学）； University of British Columbia（不列颠哥伦比亚大学）； Vector Institute（向量研究所）； Queen’s University（女王大学）

AI总结针对医学大视觉语言模型在事实一致性、视觉定位和临床对齐方面的不足，提出一种结合双向令牌级KL正则化和视觉对比定位目标的细粒度在线偏好优化框架，通过最小编辑模型输出构建偏好对，仅修正临床错误片段，显著提升诊断准确性。

详情

AI中文摘要

大型视觉语言模型（LVLMs）在医学影像任务中取得了强劲性能，但仍容易出现事实不一致、视觉定位差以及与临床有意义反馈对齐不足的问题。现有的后训练对齐方法，包括直接偏好优化（DPO）及其变体，在医学领域面临三个关键限制：（1）序列级奖励信号将临床关键令牌与通用填充文本等同对待；（2）依赖静态监督微调参考作为偏好响应引入了离策略分布偏移，将优化导向风格伪影而非临床正确性；（3）对齐目标缺乏明确的视觉定位约束，使模型对微妙但诊断决定性的病理特征不敏感。我们的方法利用双向令牌级KL正则化以及视觉对比定位目标，该目标将干净图像与病变破坏图像配对，以惩罚缺乏足够视觉证据生成的响应。这些组件共同构成了一个细粒度的在线对齐框架，通过最小编辑模型生成的输出来构建偏好对，仅修正临床错误片段，同时保留原始语言风格。在医学影像任务和临床文本生成基准上的大量实验验证了我们方法的有效性。

英文摘要

Large Vision-Language Models (LVLMs) have achieved strong performance across medical imaging tasks, yet they remain prone to factual inconsistencies, poor visual grounding, and misalignment with clinically meaningful feedback. Existing post-training alignment approaches, including Direct Preference Optimization (DPO) and its variants, face three critical limitations in the medical domain: (1) sequence-level reward signals treat clinically critical tokens identically to generic filler text; (2) reliance on static supervised fine-tuning references as preferred responses introduces an off-policy distribution shift, steering optimization toward stylistic artifacts over clinical correctness; and (3) alignment objectives lack explicit visual grounding constraints, leaving models insensitive to subtle yet diagnostically decisive pathological features. Our method leverages a bidirectional token-wise KL regularizer alongside a visual-contrastive grounding objective that pairs clean and lesion-corrupted images to penalize responses generated without adequate visual evidence. Together, these components form a fine-grained, on-policy alignment framework that constructs preference pairs by minimally editing model-generated outputs, correcting only clinically erroneous spans while preserving the original linguistic style. Extensive experiments across medical imaging tasks and clinical text generation benchmarks validate the effectiveness of our approach.

URL PDF HTML ☆

赞 1 踩 0

2606.12754 2026-06-12 cs.CL cs.AI 交叉投稿

LLMs Can Better Capture Human Judgments--With the Right Prompts

LLMs 能更好地捕捉人类判断——使用合适的提示

Danica Dillion, Chen Cecilia Liu, Baihui Wang, Daniele Barolo, Tanmay Rajore, Niket Tandon, Pranathi Ravikumar, Kurt Gray

AI总结通过简单提示策略，LLMs 能恢复人类反应的完整分布，并减少对措辞变化的敏感性，提升 AI-人类对齐。

详情

AI中文摘要

大型语言模型（LLMs）在捕捉人类判断方面是否表现不佳？两个常被提及的限制是：LLMs 无法捕捉反应的全分布，以及它们的判断在措辞变化上不稳定。我们展示了缓解这些限制的简单提示策略。在两个数据集上——一个代表美国的 144 个道德情景集，以及国际社会调查项目“家庭与性别角色变化”模块涵盖 32 个国家的 38 个道德信念——我们展示了简单的启发式技术如何帮助改善 AI-人类对齐。首先，提示模型报告标准差和反应比例，比常见策略更好地恢复了人类反应的完整范围。其次，确保情景对人类参与者清晰——如人类困惑评分所反映——提升了模型对齐度，且 LLMs 可以跟踪人类困惑评分。同时，我们发现 LLMs 对自身误差的估计校准不佳，尽管它们能相对较好地预测人类变异性。这些结果表明，向 LLMs 提出更好的问题可以得到更好的答案。

英文摘要

Are large language models (LLMs) bad at capturing human judgment? Two commonly stated limitations are that LLMs fail to capture full distributions of responses, and that their judgments are unstable across wording variations. We demonstrate simple prompting strategies that mitigate these limitations. Across two datasets--a U.S.-representative set of 144 moral scenarios and 38 moral beliefs from the International Social Survey Programme's Family and Changing Gender Roles module covering 32 countries--we show how simple elicitation techniques help improve AI-human alignment. First, prompting models to report standard deviations and response proportions recovers the full range of human responses better than common strategies. Second, ensuring scenarios are clear to human participants--as reflected in human confusion ratings--boosts model alignment, and LLMs can track human confusion ratings. At the same time, we find that LLMs' estimates of their own error are poorly calibrated, though they can predict human variability relatively well. These results suggest that asking better questions to LLMs can yield better answers.

URL PDF HTML ☆

赞 0 踩 0

2606.12818 2026-06-12 cs.CL cs.AI 交叉投稿

Localizing Anchoring Pathways in Language Models

定位语言模型中的锚定路径

Hillary N. Owusu, Sarah Wiegreffe, Naomi H. Feldman

发表机构 * University of Maryland, College Park（马里兰大学帕克分校）

AI总结研究提示中无关数字如何影响语言模型数值推理的锚定效应，通过logit差值度量和电路归因定位，发现边级方法优于节点级方法，并揭示锚定路径的共享与迁移特性。

详情

AI中文摘要

提示中的无关数字可以改变语言模型的判断，在数值推理中产生锚定效应。我们使用共享答案选项的受控多项选择设置，研究这种锚定敏感信号在语言模型内部的携带位置。我们定义了一个logit差值度量，比较正确答案选项与对应锚点的答案选项，并验证其追踪行为锚定。通过对7B-8B Qwen和Llama基础及指令微调模型进行基于归因的电路定位，我们发现边级方法比节点级方法更忠实地恢复该信号。低锚和高锚电路在模型内部强迁移，表明跨锚定方向存在共享路径结构。然而，基础模型和指令微调变体之间的稀疏迁移可靠性较低，表明后训练改变了哪些路径最重要。总体而言，我们的结果为锚定相关决策信号如何在语言模型内部携带提供了机制性解释。

英文摘要

Irrelevant numbers in a prompt can shift language model judgments, producing anchoring effects in numerical reasoning. We study where this anchor-sensitive signal is carried inside language models using a controlled multiple-choice setup with shared answer options. We define a logit-difference metric comparing the correct answer option with the answer option corresponding to the anchor, and validate that it tracks behavioral anchoring. Using attribution-based circuit localization on 7B--8B Qwen and Llama base and instruction-tuned models, we find that edge-level methods recover this signal more faithfully than node-level methods. Low- and high-anchor circuits transfer strongly within a model, suggesting shared pathway structure across anchor direction. However, sparse transfer across base and instruction-tuned variants is less reliable, indicating that post-training changes which pathways matter most. Overall, our results provide a mechanistic account of how anchoring-related decision signals are carried inside language models.

URL PDF HTML ☆

赞 0 踩 0

2606.12826 2026-06-12 cs.CV cs.AI 交叉投稿

迭代视觉思维：通过视觉反馈教会视觉语言模型空间自我修正

Animesh Tripathy, Aswanth Krishnan

发表机构 * QpiAI India Pvt. Ltd（QpiAI印度私人有限公司）

AI总结提出迭代视觉思维（IVT）框架，通过视觉反馈闭环和两阶段训练（SFT+GRPO），使视觉语言模型具备空间自我修正能力，在三个基准上提升指标2.4-3.2个百分点。

详情

AI中文摘要

视觉语言模型（VLM）在单次空间定位上表现强劲，但缺乏观察和修正自身预测的机制。我们发现，简单地提示VLM在其预测的渲染可视化上迭代会导致灾难性失败：指代表达理解的Acc@0.5从79.6%骤降至48.7%（下降31个百分点），揭示了定位能力与自我修正能力之间的根本差距。我们提出迭代视觉思维（IVT），一种闭环框架，其中模型预测边界框，观察预测在图像上的渲染结果，并通过视觉反馈迭代优化。两阶段训练方案弥合了自我修正差距：首先，我们利用基础模型自身的预测作为真实错误，并提示教师VLM生成修正推理轨迹，从而无需人工标注即可获得监督数据；其次，我们应用组相对策略优化（GRPO）和简单的IoU奖励来稳定多步优化。在涵盖RefCOCOg、Ref-Adv和Ref-L4的混合基准（505个测试样本）上，使用IVT的SFT预热在每个指标上都超过了单次基础模型：Acc@0.5升至82.0%（+2.4个百分点），Acc@0.7升至74.1%（+3.2个百分点），Acc@0.9升至48.3%（+2.8个百分点）。GRPO进一步将每步IoU退化减少了5倍，稳定了优化轨迹。所有训练仅使用单个GPU上的2400个样本，表明空间自我修正是一种可学习的能力，可以在适度规模下灌输。

英文摘要

Vision-language models (VLMs) achieve strong singleshot spatial grounding, yet lack any mechanism to observe and correct their own predictions. We find that naively prompting a VLM to iterate over rendered visualizations of its predictions causes catastrophic failure: Acc@0.5 on referring expression comprehension collapses from 79.6% to 48.7% (a 31 percentage point drop), revealing a fundamental gap between grounding capability and self-correction ability. We propose Iterative Visual Thinking (IVT), a closed-loop framework in which the model predicts a bounding box, observes the prediction rendered on the image, and iteratively refines through visual feedback. A two-phase training recipe closes the self-correction gap: first, we exploit the base model's own predictions as realistic errors and prompt a teacher VLM to generate corrective reasoning traces, yielding supervised data without human annotation; second, we apply Group Relative Policy Optimization (GRPO) with a simple IoU reward to stabilize multi-step refinement. On a mixed benchmark spanning RefCOCOg, Ref-Adv, and Ref-L4 (505 test samples), SFT warm-up with IVT surpasses the single-shot base model on every metric: Acc@0.5 rises to 82.0% (+2.4pp), Acc@0.7 to 74.1% (+3.2pp), and Acc@0.9 to 48.3% (+2.8pp). GRPO further reduces per-step IoU degradation by 5x, stabilizing the refinement trajectory. All training uses only 2,400 samples on a single GPU, demonstrating that spatial self-correction is a learnable capability that can be instilled at modest scale.

URL PDF HTML ☆

赞 0 踩 0

2606.13171 2026-06-12 cs.CL cs.AI 交叉投稿

NTS-CoT: Mitigating Hallucinations in LLM-based News Timeline Summarization with Chain-of-Thought Reasoning

NTS-CoT: 基于思维链推理减轻大模型新闻时间线摘要中的幻觉

Feng Lyu, Huiqin Yan, Sijing Duan, Hao Wu, Shuang Gu, Xue Qiao, Weixu Zhang, Haolun Wu

发表机构 * Central South University（中南大学）； Tsinghua University（清华大学）； Nanjing University（南京大学）； Suzhou Aerospace Information Research Institute（苏州空天信息研究院）； McGill University（麦吉尔大学）

AI总结针对大模型在新闻时间线摘要中产生内容不忠实和信息遗漏两类幻觉，提出NTS-CoT框架，通过元素思维链、日期选择和因果思维链三个模块有效缓解幻觉，在三个基准上超越现有方法。

详情

AI中文摘要

在线新闻的快速更新使得追踪事件发展具有挑战性，凸显了时间线摘要（TLS）的需求。幻觉（即大模型生成内容偏离源新闻）仍然是基于大模型的TLS中的关键问题，且现有研究对此关注不足。为弥补这一差距，我们识别出两类主要幻觉：新闻摘要中的不忠实内容和日期事件摘要中的信息遗漏。然后，我们提出NTS-CoT，一种利用思维链（CoT）推理来减轻TLS中幻觉的新框架。该框架包含三个关键模块：i) Element-CoT，用于捕获关键新闻元素以实现忠实摘要；ii) Date Selection，结合时间显著性和事件突出性进行时间戳选择；iii) Causal-CoT，用于推断因果关系并减少日期事件摘要中的遗漏。大量实验，包括在三个TLS基准上的定量分析和人工评估，表明NTS-CoT优于最先进的基线，有效减轻了幻觉并提升了基于大模型的TLS性能。我们的源代码可在该 https URL 获取。

英文摘要

The rapid updates of online news make tracking event developments challenging, highlighting the need for timeline summarization (TLS). Hallucinations, where LLM-generated content deviates from source news, still remain a critical issue in LLM-based TLS and are not well studied in existing works. To bridge this gap, we identify two primary types of hallucinations: unfaithful content during news summarization and information omission in date-event summarization. Then, we propose NTS-CoT, a novel framework that leverages Chain-of-Thought (CoT) reasoning to mitigate hallucinations in TLS. The framework consists of three key modules: i) Element-CoT to capture essential news elements for faithful summarization, ii) Date Selection to combine temporal saliency and event prominence for timestamp selection, and iii) Causal-CoT to infer causal relationships and reduce omissions in date-event summarization. Extensive experiments, including quantitative analysis on three TLS benchmarks and human evaluation, demonstrate that NTS-CoT outperforms state-of-the-art baselines, effectively mitigating hallucinations and improving LLM-based TLS performance. Our source code is available at https://anonymous.4open.science/r/NTS-CoT .

URL PDF HTML ☆

赞 0 踩 0

2606.13177 2026-06-12 cs.CL cs.AI cs.LG 交叉投稿

MemRefine: LLM-Guided Compression for Long-Term Agent Memory

MemRefine: 基于LLM引导的压缩用于长期智能体记忆

Minjae Kim, Jinheon Baek, Soyeong Jeong, Sung Ju Hwang

发表机构 * Korea University（韩国大学）； KAIST（韩国科学技术院）

AI总结提出MemRefine框架，利用LLM判断事实内容，通过删除、合并和保留操作将记忆库压缩到固定预算内，在多个基准上保持下游性能并优于基于规则的基线。

详情

AI中文摘要

大型语言模型（LLM）智能体越来越需要在长期交互中运行，其中过去对话中的信息必须被保留和回忆以支持未来任务。然而，随着交互的积累，记忆存储无限制增长，并充满冗余条目，这些条目增加了存储成本，并通过排挤最有用的证据而降低了检索质量。此外，在具有硬性内存预算的资源受限平台上，这尤其受限，促使我们制定了有存储预算的记忆管理任务，即在固定预算内保持已构建的记忆库，同时保留对未来交互有用的信息。为此，我们提出了MemRefine，一个基于LLM引导的框架，由于表面相似性不能很好地反映事实价值，该框架仅使用相似性来提出候选对，并将删除、合并和保留决策推迟给基于事实内容的LLM判断，迭代直到满足预算。在多个记忆框架和长期对话基准上，MemRefine始终满足目标预算，同时保持下游性能，并在紧预算下优于基于规则的基线。

英文摘要

Large language model (LLM) agents are increasingly expected to operate over long-term interactions, where information from past dialogues must be preserved and recalled to support future tasks. However, as interactions accumulate, the memory store grows without bound and fills with redundant entries that inflate storage cost and degrade retrieval by crowding out the most useful evidence. Furthermore, this is especially limiting on resource-constrained platforms with hard memory budgets, motivating us to formulate storage-budgeted memory management, the task of keeping an already constructed memory store within a fixed budget while preserving information useful for future interactions. To this end, we then propose MemRefine, an LLM-guided framework that, since surface similarity poorly reflects factual value, uses similarity only to propose candidate pairs and defers delete, merge, and preserve decisions to an LLM judge based on factual content, iterating until the budget is met. Across multiple memory frameworks and long-term conversation benchmarks, MemRefine consistently meets target budgets while preserving downstream performance and outperforming rule-based baselines under tight budgets.

URL PDF HTML ☆

赞 0 踩 0

2606.13288 2026-06-12 cs.CV cs.AI cs.CL 交叉投稿

Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

跨模态掩码组合概念建模以增强视觉-语言组合性

Wei Li, Zhen Huang, Xinmei Tian

发表机构 * MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China（中国科学技术大学，教育部脑启发智能感知与认知重点实验室）； Independent Researcher（独立研究员）

AI总结提出MACCO框架，通过掩码一个模态的组合概念并从另一模态完整上下文重建，增强视觉-语言模型的组合理解能力，在五个基准上显著提升。

Comments Accepted to ACL 2026 Main Conference, 25 pages

详情

AI中文摘要

对比训练的视觉-语言模型（如CLIP）在学习联合图像-文本表示方面取得了显著进展，但在组合理解方面仍面临挑战。它们通常表现出“词袋”行为——难以捕捉对象关系、属性-对象绑定和词序依赖。这一限制不仅源于优化时依赖全局单向量表示，还源于对配对图像文本数据中固有丰富组合信息的利用和建模不足。在这项工作中，我们提出了MACCO（掩码组合概念建模）框架，该框架掩码一个模态中的组合概念，并基于另一模态的完整上下文信息重建它们，从而使模型能够更有效地捕捉和对齐跨模态组合结构。为促进这一过程，我们引入了两个辅助目标，在模态间和模态内联合对齐和正则化掩码特征。在五个组合基准上的大量实验和深入分析表明，我们的方法不仅显著增强了VLM的组合性，还提高了它们捕捉句法结构和语言信息的能力。此外，改进的组合性也有利于文本到图像生成和多模态大语言模型。代码可在https://this URL获取。

英文摘要

Contrastively trained vision-language models like CLIP, have made remarkable progress in learning joint image-text representations, but still face challenges in compositional understanding. They often exhibit a "bag-of-words" behavior--struggling to capture the object relations, attribute-object bindings, and word order dependencies. This limitation arises not only from the reliance on global, single-vector representations for optimization, but also from the insufficient exploitation and modeling of the rich compositional information inherently present in paired image text data. In this work, we propose MACCO (MAsked Compositional Concept MOdeling), a framework that masks compositional concepts in one modality and reconstructs them conditioned on the full contextual information from the other, enabling the model to capture and align cross-modal compositional structures more effectively. To facilitate this process, we introduce two auxiliary objectives that jointly align and regularize masked features both inter-modally and intra-modally. Extensive experiments on five compositional benchmarks, along with in-depth analyses, demonstrate that our approach not only significantly enhances compositionality in VLMs but also improves their ability to capture syntactic structure and linguistic information. Additionally, the improved compositionality also benefits text-to-image generation and multimodal large language model. Code is available at https://github.com/hiker-lw/MACCO.

URL PDF HTML ☆

赞 0 踩 0

2606.13289 2026-06-12 cs.CV cs.AI 交叉投稿

自适应轮流发言：面向实时多方语音代理

Soumyajit Mitra, Prabhat Pandey, Abhinav Jain, Shanmukha Sahith, K V Vijay Girish

AI总结提出ModeratorLM，一种基于角色条件的语音大模型，通过分块流式处理和链式推理，在多方对话中实现自适应轮流发言，显著提升轮流精度和召回率。

Comments Accepted for publication at Interspeech 2026

2606.13572 2026-06-12 cs.CL cs.AI 交叉投稿

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

ArogyaSutra：面向印度语言的多模态医学推理的多智能体框架

Tanmoy Kanti Halder, Akash Ghosh, Subhadip Baidya, Arijit Roy, Sriparna Saha

发表机构 * Indian Institute of Technology Patna（印度理工学院巴特那分校）； Indian Institute of Technology Kanpur（印度理工学院坎普尔分校）； Prasannadeb Women’s College（普拉萨纳德布女子学院）

AI总结针对印度语言医疗场景中多模态大语言模型性能不足的问题，提出多模态医学问答数据集ArogyaBodha和基于演员-评论家的多智能体框架ArogyaSutra，通过工具接地与双记忆机制提升多语言医学推理准确性。

详情

AI中文摘要

多模态大语言模型（MLLMs）在通用领域展现出有希望的推理能力，但在医疗等专业场景中，尤其是在多语言和低资源情况下，其性能仍然有限。这一差距在印度农村等地区尤为关键，患者通常用本土印度语言表达复杂的医疗问题，并依赖医学图像等多模态输入。现有的以英语为中心的MLLMs难以支持此类用例，限制了公平获取AI驱动的医疗辅助。为应对这一挑战，我们引入了ArogyaBodha，一个大规模的多语言多模态医学问答数据集，由八个异构来源构建，涵盖31个身体系统、六种成像模态和21个临床领域，覆盖英语和七种主要印度语言。我们进一步提出了ArogyaSutra，一个基于演员-评论家的多智能体框架，将工具接地与双记忆机制相结合，实现逐步的、推理感知的决策，并使用存储的演员-评论家模拟轨迹进行蒸馏。实验表明，我们的数据集和框架在所有印度语言上提高了多语言医学推理的准确性，消融实验验证了每个组件的贡献。源代码和数据集可在以下网址获取：this https URL ArogyaSutra/

英文摘要

Multimodal Large Language Models (MLLMs) have shown promising reasoning capabilities in general domains, yet their performance remains limited in specialized settings such as healthcare, especially in multilingual and low-resource scenarios. This gap is critical in regions like rural India, where patients often express complex medical queries in native Indic languages and rely on multimodal inputs such as medical images. Existing English-centric MLLMs struggle to support such use cases, limiting equitable access to AI-driven healthcare assistance. To address this challenge, we introduce ArogyaBodha, a large-scale multilingual multimodal medical question-answer dataset constructed from eight heterogeneous sources, covering 31 body systems, six imaging modalities, and 21 clinical domains across English and seven major Indian languages. We further propose ArogyaSutra, an actor-critic-based multi-agent framework that integrates tool grounding with dual-memory mechanisms for step-wise, reasoning-aware decision making, and uses stored actor-critic simulation trajectories for distillation. Experiments show that our dataset and framework improve multilingual medical reasoning accuracy across all Indic languages, with ablations validating the contribution of each component. The source code and dataset are available at: https://iitp-cse.github.io/ ArogyaSutra/

URL PDF HTML ☆

赞 0 踩 0

2606.13580 2026-06-12 cs.CV cs.AI 交叉投稿

EvTexture++: Event-Driven Texture Enhancement for Video Super-Resolution

EvTexture++: 事件驱动的视频超分辨率纹理增强

Dachun Kai, Jiayao Lu, Yueyi Zhang, Xiaoyan Sun

发表机构 * MOE Key Laboratory of Brain-Inspired Intelligent Perception and Cognition, University of Science and Technology of China（中国科学技术大学，脑启发智能感知与认知教育部重点实验室）； Midea Group（美的集团）

AI总结提出首个事件驱动的视频超分辨率纹理增强框架EvTexture++，利用事件的高频时空细节逐步恢复纹理，并通过时间纹理对齐模块增强帧间一致性，在多个数据集上达到最优性能。

Comments IEEE TPAMI 2026. Extended version of arXiv:2406.13457 (ICML 2024). Project page: https://dachunkai.github.io/evtexture-project-page/

详情

DOI: 10.1109/TPAMI.2026.3660020
Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 6, pp. 6642-6659, June 2026

AI中文摘要

基于事件的视觉因其独特特性（包括超高时间分辨率和极端动态范围）而受到越来越多的关注。最近的工作将其引入视频超分辨率（VSR）以增强光流估计和时间对齐。相比之下，本文将事件信号的关注点从运动细化转向VSR中的纹理增强。我们提出了EvTexture++，这是首个专用于VSR中纹理增强的事件驱动框架。它利用事件的高频时空细节来改善纹理恢复。EvTexture++包含一个定制的纹理增强分支，以及一个迭代纹理增强模块，该模块逐步利用高时间分辨率的事件信息进行纹理恢复。这使得纹理区域在迭代中逐渐细化，从而产生更准确、更详细的高分辨率输出。除了帧内纹理恢复外，大运动可能会降低帧间时间一致性，尤其是在纹理区域，导致纹理闪烁。为了缓解这一问题，我们进一步利用事件的连续时间运动线索来增强时间一致性，引入了一个时间纹理对齐模块，该模块估计事件引导的纹理感知光流，以实现精确的帧间纹理对齐。此外，EvTexture++被设计为即插即用工具，可灵活提升现有VSR模型的性能。在五个数据集上的实验表明，EvTexture++达到了最先进的性能。当集成到最近的VSR模型中时，它带来了显著的改进，在纹理丰富的Vid4数据集上PSNR提升高达1.55 dB。代码：此https URL。

英文摘要

Event-based vision has drawn increasing attention owing to its distinctive properties, including ultra-high temporal resolution and extreme dynamic range. Recent works have introduced it to video super-resolution (VSR) to enhance flow estimation and temporal alignment. In contrast, this paper shifts the focus of event signals from motion refinement to texture enhancement in VSR. We propose EvTexture++, the first event-driven framework dedicated to texture enhancement in VSR. It leverages high-frequency spatiotemporal details from events to improve texture recovery. EvTexture++ incorporates a customized texture enhancement branch, along with an iterative texture enhancement module that progressively exploits high-temporal-resolution event information for texture restoration. This enables gradual refinement of texture regions across iterations, yielding more accurate and detailed high-resolution outputs. Besides intra-frame texture recovery, large motions could degrade inter-frame temporal consistency, particularly in texture regions, leading to texture flickering. To mitigate this, we further exploit the continuous-time motion cues of events to enhance temporal consistency, introducing a temporal texture alignment module that estimates event-guided texture-aware flow for precise inter-frame texture alignment. Moreover, EvTexture++ is designed as a plug-and-play tool to flexibly boost the performance of existing VSR models. Experiments on five datasets demonstrate that EvTexture++ achieves state-of-the-art performance. When integrated into recent VSR models, it yields significant improvements, with gains of up to 1.55 dB in PSNR on the texture-rich Vid4 dataset. Code: https://github.com/DachunKai/EvTexture.

URL PDF HTML ☆

赞 0 踩 0

2606.13680 2026-06-12 cs.CL cs.AI 交叉投稿

HalluJudge: 代码审查自动化中上下文错位的无参考幻觉检测

Kla Tantithamthavorn, Hong Yi Lin, Patanamon Thongtanunam, Wachiraphan Charoenwet, Minwoo Jeong, Ming Wu

发表机构 * Monash University Australia（墨尔本大学澳大利亚）； The University of Melbourne Australia（墨尔本大学澳大利亚）； Atlassian USA（Atlassian美国）

AI总结提出无参考幻觉检测方法HalluJudge，通过上下文对齐评估生成评论的根基性，采用多分支推理策略，在F1=0.85且成本$0.009下与开发者偏好67%一致。

Comments Accepted at FSE'26: Industry Track, Full-Length, Peer-Reviewed

详情

DOI: 10.1145/3803437.3805236

AI中文摘要

大型语言模型（LLM）在代码审查自动化（如审查评论生成）中表现出强大能力，但它们存在幻觉——生成的审查评论与实际代码无根基——这对LLM在代码审查工作流程中的应用构成重大挑战。为解决此问题，我们探索了无需参考的、有效且可扩展的方法来检测LLM生成的代码审查评论中的幻觉。在这项工作中，我们设计了HalluJudge，旨在基于上下文对齐评估生成评论的根基性。HalluJudge包括四种关键策略，从直接评估到结构化多分支推理（例如，思维树）。我们在Atlassian的企业级软件项目中对这些评估策略进行了全面评估，以检验HalluJudge的有效性和成本效率。此外，我们分析了HalluJudge的判断与实际生产环境中LLM生成的代码审查评论的开发人员偏好之间的一致性。我们的结果表明，HalluJudge中的幻觉评估具有成本效益，F1得分为0.85，平均成本为0.009美元。平均而言，67%的HalluJudge评估与在线生产中实际LLM生成的审查评论的开发人员偏好一致。我们的结果表明，HalluJudge可以作为实用的保障措施，减少开发人员接触幻觉评论，从而促进对AI辅助代码审查的信任。

英文摘要

Large Language models (LLMs) have shown strong capabilities in code review automation, such as review comment generation, yet they suffer from hallucinations -- where the generated review comments are ungrounded in the actual code -- poses a significant challenge to the adoption of LLMs in code review workflows. To address this, we explore effective and scalable methods for a hallucination detection in LLM-generated code review comments without the reference. In this work, we design HalluJudge that aims to assess the grounding of generated review comments based on the context alignment. HalluJudge includes four key strategies ranging from direct assessment to structured multi-branch reasoning (e.g., Tree-of-Thoughts). We conduct a comprehensive evaluation of these assessment strategies across Atlassian's enterprise-scale software projects to examine the effectiveness and cost-efficiency of HalluJudge. Furthermore, we analyze the alignment between HalluJudge's judgment and developer preference of the actual LLM-generated code review comments in the real-world production. Our results show that the hallucination assessment in HalluJudge is cost-effective with an F1 score of 0.85 and an average cost of $0.009. On average, 67% of the HalluJudge assessments are aligned with the developer preference of the actual LLM-generated review comments in the online production. Our results suggest that HalluJudge can serve as a practical safeguard to reduce developers' exposure to hallucinated comments, fostering trust in AI-assisted code reviews.

URL PDF HTML ☆

赞 0 踩 0

2601.19827 2026-06-12 cs.CL cs.AI cs.IR 版本更新

When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

当迭代RAG优于理想证据：科学多跳问答中的诊断研究

Mahdi Astaraki, Mohammad Arshi Saloot, Ali Shiraee Kasmaee, Hamidreza Mahyar, Soheila Samiee

发表机构 * Faculty of Engineering, McMaster University, Canada（麦斯特大学工程学院，加拿大）； BASF Canada Inc., Canada（巴斯夫加拿大公司，加拿大）

AI总结通过化学多跳问答数据集，诊断发现迭代检索-推理循环在科学领域显著优于静态RAG上限，揭示了阶段式检索的优势与失败模式。

Comments 51 pages, 29 figures

详情

AI中文摘要

检索增强生成（RAG）将大型语言模型（LLMs）扩展到参数化知识之外，但目前尚不清楚迭代检索-推理循环何时能有效超越静态RAG，尤其是在涉及多跳推理、稀疏领域知识和异构证据的科学领域。我们首次进行了受控的、机制层面的诊断研究，以探究同步迭代检索和推理能否超越理想化的静态上限（Gold Context）RAG。我们在三种设置下对十一个最先进的LLM进行了基准测试：（i）无上下文，衡量对参数化记忆的依赖；（ii）Gold Context，一次性提供所有真实证据；（iii）迭代RAG，一种无需训练的控制器，交替进行检索、假设细化和证据感知停止。使用以化学为中心的ChemKGMultiHopQA数据集，我们分离出需要真正检索的问题，并通过诊断分析行为，涵盖检索覆盖缺口、锚点携带下降、查询质量、组合保真度和控制校准。在所有模型中，迭代RAG始终优于Gold Context，增益高达25.6个百分点，尤其对于非推理微调模型。阶段式检索减少了后期跳失败，缓解了上下文过载，并实现了对早期假设漂移的动态修正，但剩余的失败模式包括跳覆盖不完整、干扰物锁定轨迹、过早停止校准错误以及即使检索完美时的高组合失败率。总体而言，阶段式检索通常比理想证据的单纯存在更具影响力；我们为在专门科学环境中部署和诊断RAG系统提供了实用指导，并为更可靠、可控的迭代检索-推理框架奠定了基础。

英文摘要

Retrieval-Augmented Generation (RAG) extends large language models (LLMs) beyond parametric knowledge, yet it is unclear when iterative retrieval-reasoning loops meaningfully outperform static RAG, particularly in scientific domains with multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence. We provide the first controlled, mechanism-level diagnostic study of whether synchronized iterative retrieval and reasoning can surpass an idealized static upper bound (Gold Context) RAG. We benchmark eleven state-of-the-art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training-free controller that alternates retrieval, hypothesis refinement, and evidence-aware stopping. Using the chemistry-focused ChemKGMultiHopQA dataset, we isolate questions requiring genuine retrieval and analyze behavior with diagnostics spanning retrieval coverage gaps, anchor-carry drop, query quality, composition fidelity, and control calibration. Across models, Iterative RAG consistently outperforms Gold Context, with gains up to 25.6 percentage points, especially for non-reasoning fine-tuned models. Staged retrieval reduces late-hop failures, mitigates context overload, and enables dynamic correction of early hypothesis drift, but remaining failure modes include incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval. Overall, staged retrieval is often more influential than the mere presence of ideal evidence; we provide practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and a foundation for more reliable, controllable iterative retrieval-reasoning frameworks.

URL PDF HTML ☆

赞 0 踩 0

2602.00462 2026-06-12 cs.CV cs.AI 版本更新

LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs

LatentLens: 揭示大语言模型中高度可解释的视觉标记

Benno Krojer, Shravan Nayak, Oscar Mañas, Vaibhav Adlakha, Desmond Elliott, Siva Reddy, Marius Mosbach

发表机构 * University of Cambridge（剑桥大学）

AI总结提出 LatentLens 方法，通过将视觉标记与文本语料库中的上下文标记表示进行最近邻匹配，实现视觉标记的可解释性，发现大多数视觉标记在各层均具有可解释性。

Comments ICML 2026 (Camera Ready)

详情

AI中文摘要

将大型语言模型（LLM）转换为视觉语言模型（VLM）可以通过将视觉编码器输出的视觉标记映射到LLM的嵌入空间来实现。有趣的是，这种映射可以简单到浅层MLP变换。为了理解LLM为何能如此容易地处理视觉标记，我们需要可解释性方法来揭示在LLM处理的每一层中视觉标记表示所编码的内容。在这项工作中，我们引入了LatentLens，一种将潜在表示映射到自然语言描述的新方法。LatentLens编码一个大型文本语料库，并存储该语料库中每个标记的上下文化标记表示。然后将视觉标记表示与这些上下文化表示进行比较，并将最邻近的表示作为视觉标记的描述。我们在15个不同的VLM上评估了该方法，结果表明，常用的方法（如LogitLens）大大低估了视觉标记的可解释性。相反，使用LatentLens，大多数视觉标记在所有研究的模型和所有层中都是可解释的。定性上，我们展示了LatentLens产生的描述在语义上有意义，并且与单个标记相比，为人类提供了更细粒度的解释。更广泛地说，我们的发现为视觉和语言表示之间的对齐提供了新的证据，并为分析LLM的潜在表示开辟了新的方向。

英文摘要

Transforming a large language model (LLM) into a vision-language model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simple as a shallow MLP transformation. To understand why LLMs can so readily process visual tokens, we need interpretability methods that reveal what is encoded in the visual token representations at every layer of LLM processing. In this work, we introduce LatentLens, a novel approach for mapping latent representations to descriptions in natural language. LatentLens encodes a large text corpus and stores contextualized token representations for each token in that corpus. Visual token representations are then compared to these contextualized representations and the top-nearest neighbor representations serve as descriptions of the visual token. We evaluate this method on 15 different VLMs, showing that commonly used methods, such as LogitLens, substantially underestimate the interpretability of visual tokens. With LatentLens instead, the majority of visual tokens are interpretable across all studied models and all layers. Qualitatively, we show that the descriptions produced by LatentLens are semantically meaningful and provide more fine-grained interpretations for humans compared to individual tokens. More broadly, our findings contribute new evidence on the alignment between vision and language representations and open up new directions for analyzing the latent representations of LLMs.

URL PDF HTML ☆

赞 0 踩 0

2602.07106 2026-06-12 cs.CV cs.AI cs.CL 版本更新

Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

Ex-Omni：为全模态大语言模型赋能3D面部动画生成

Haoyu Zhang, Zhipeng Li, Yiwen Guo, Tianshu Yu

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； LIGHTSPEED ； Independent Researcher（独立研究员）

AI总结提出Ex-Omni模型，通过混合形状感知语音单元生成器和解码器解耦语义推理与时间生成，并引入统一令牌查询门控融合机制，实现全模态大语言模型同步生成语音和3D面部动画。

详情

AI中文摘要

全模态大语言模型旨在统一多模态理解和生成，然而，尽管自然的人机交互至关重要，但扩展它们以联合生成语音和3D面部动画仍 largely unexplored。一个关键挑战是LLM的离散语义推理与3D面部运动所需的密集时间动态之间的不匹配。我们提出Expressive Omni (Ex-Omni)，一个开源模型，通过原生语音伴随的3D面部动画增强OLLM。Ex-Omni通过混合形状感知语音单元生成器和混合形状解码器将语义推理与时间生成解耦，其中语音单元提供时间支架，隐藏语音表示携带面部相关线索。我们进一步引入统一的令牌查询门控融合机制用于受控语义注入，以及InstructS2SF-1200K，一个包含1200K样本的预训练数据集。大量实验表明，Ex-Omni在保持竞争性语音理解和生成能力的同时，实现了比级联管道更好的音视频同步和更低的面部生成延迟。

英文摘要

Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet extending them to jointly produce speech and 3D facial animation remains largely unexplored despite its importance for natural human-computer interaction. A key challenge is the mismatch between the discrete semantic reasoning of LLMs and the dense temporal dynamics required for 3D facial motion. We propose Expressive Omni (Ex-Omni), an open-source model that augments OLLMs with native speech-accompanied 3D facial animation. Ex-Omni decouples semantic reasoning from temporal generation through a blendshape-aware speech unit generator and a blendshape decoder, where speech units provide temporal scaffolding and hidden speech representations carry facially relevant cues. We further introduce a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection, as well as InstructS2SF-1200K, a dataset consisting of 1200K samples for pre-training. Extensive experiments show that Ex-Omni maintains competitive speech understanding and generation ability while achieving better audio-visual synchronization and lower face-generation latency than cascaded pipelines.

URL PDF HTML ☆

赞 0 踩 0

2602.14367 2026-06-12 cs.CL cs.AI cs.IR cs.LG 版本更新

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem

InnoEval：将研究思路评估视为基于知识的多视角推理问题

Shuofei Qiao, Yunxiang Wei, Xuehai Wang, Bin Wu, Boyang Xue, Ningyu Zhang, Hossein A. Rahmani, Yanshan Wang, Qiang Zhang, Keyan Ding, Jeff Z. Pan, Huajun Chen, Emine Yilmaz

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出InnoEval框架，通过异构深度知识检索和多视角评审委员会，实现基于知识的多维度解耦评估，在点对点、成对和分组评估任务中优于基线方法。

Comments ICML 2026

详情

AI中文摘要

大型语言模型的快速发展催生了科学思路的激增，但这一飞跃并未伴随思路评估的相应进步。科学评估的基本性质需要知识基础、集体审议和多标准决策。然而，现有的思路评估方法往往存在知识视野狭窄、评估维度扁平化以及LLM作为评判者的固有偏见。为解决这些问题，我们将思路评估视为一个基于知识的多视角推理问题，并引入InnoEval，一个深度创新评估框架，旨在模拟人类水平的思路评估。我们应用了一个异构深度知识搜索引擎，从多样化的在线来源中检索和获取动态证据。我们进一步通过一个包含不同学术背景的评审员的创新评审委员会实现评审共识，从而在多个指标上进行多维解耦评估。我们构建了来自权威同行评审提交的全面数据集，以基准测试InnoEval。实验表明，InnoEval在点对点、成对和分组评估任务中始终优于基线方法，展现出与人类专家高度一致的判断模式和共识。

英文摘要

The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation. The fundamental nature of scientific evaluation needs knowledgeable grounding, collective deliberation, and multi-criteria decision-making. However, existing idea evaluation methods often suffer from narrow knowledge horizons, flattened evaluation dimensions, and the inherent bias in LLM-as-a-Judge. To address these, we regard idea evaluation as a knowledge-grounded, multi-perspective reasoning problem and introduce InnoEval, a deep innovation evaluation framework designed to emulate human-level idea assessment. We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources. We further achieve review consensus with an innovation review board containing reviewers with distinct academic backgrounds, enabling a multi-dimensional decoupled evaluation across multiple metrics. We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval. Experiments demonstrate that InnoEval can consistently outperform baselines in point-wise, pair-wise, and group-wise evaluation tasks, exhibiting judgment patterns and consensus highly aligned with human experts.

URL PDF HTML ☆

赞 0 踩 0

2603.06652 2026-06-12 cs.CV cs.AI 版本更新

PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment

PaLMR: 通过多模态过程对齐实现忠实视觉推理

Yantao Li, Qiang Hui, Chenyang Yan, Kanzhi Cheng, Fang Zhao, Chao Tan, Huanling Gao, Jianbing Zhang, Kai Wang, Xinyu Dai, Shiguo Lian

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University（南京大学新型软件技术国家重点实验室）； Data Science & Artificial Intelligence Research Institute, China Unicom（中国unicom数据科学与人工智能研究院）； Unicom Data Intelligence, China Unicom（中国unicom数据智能）

AI总结提出PaLMR框架，通过感知对齐数据层和过程对齐优化层，减少推理幻觉并提升视觉推理忠实度，在多个基准上取得最优结果。

详情

Journal ref: CVPR 2026 Findings

AI中文摘要

强化学习近期提升了大语言模型和多模态大语言模型的推理能力，但现有的奖励设计强调最终答案的正确性，因此容忍过程幻觉——即模型在得到正确答案的同时错误感知视觉证据的情况。我们通过PaLMR框架解决这种过程层面的不对齐，该框架不仅对齐结果，还对齐推理过程本身。PaLMR包含两个互补组件：一个感知对齐数据层，构建具有结构化伪真值和可验证视觉事实的过程感知推理数据；以及一个过程对齐优化层，构建具有过程感知评分函数的分层奖励融合方案，以鼓励视觉上可信的思维链并提高训练稳定性。在Qwen2.5-VL-7B上的实验表明，我们的方法显著减少了推理幻觉并提高了视觉推理忠实度，在HallusionBench上取得了最先进的结果，同时在MMMU、MathVista和MathVerse上保持了强劲性能。这些发现表明，PaLMR为过程对齐的多模态推理提供了一条原则性且实用的路径，推进了MLLM的可靠性和可解释性。

英文摘要

Reinforcement learning has recently improved the reasoning ability of Large Language Models and Multimodal LLMs, yet prevailing reward designs emphasise final-answer correctness and consequently tolerate process hallucinations--cases where models reach the right answer while misperceiving visual evidence. We address this process-level misalignment with PaLMR, a framework that aligns not only outcomes but also the reasoning process itself. PaLMR comprises two complementary components: a perception-aligned data layer that constructs process-aware reasoning data with structured pseudo-ground-truths and verifiable visual facts, and a process-aligned optimisation layer that constructs a hierarchical reward fusion scheme with a process-aware scoring function to encourage visually faithful chains-of-thought and improve training stability. Experiments on Qwen2.5-VL-7B show that our approach substantially reduces reasoning hallucinations and improves visual reasoning fidelity, achieving state-of-the-art results on HallusionBench while maintaining strong performance on MMMU, MathVista, and MathVerse. These findings indicate that PaLMR offers a principled and practical route to process-aligned multimodal reasoning, advancing the reliability and interpretability of MLLMs.

URL PDF HTML ☆

赞 0 踩 0

2604.24079 2026-06-12 cs.CL cs.AI 版本更新

The Pragmatic Persona: Discovering LLM Persona through Bridging Inference

实用人格：通过桥接推理发现LLM人格

Jisoo Yang, Jongwon Ryu, Minuk Ma, Trung X. Pham, Junyeong Kim

发表机构 * Department of Artificial Intelligence, Chung-Ang University, Seoul, 06974, Republic of Korea（Chung-Ang大学人工智能系）； Department of Computer Science, University of British Columbia, Vancouver, BC V6T 1Z4, Canada（不列颠哥伦比亚大学计算机科学系）； Van Lang University, Ho Chi Minh City, Vietnam（文-lang大学）

AI总结提出基于桥接推理的框架，通过构建话语级知识图谱捕捉LLM对话中的隐含语义关联，实现从话语连贯性层面发现稳定人格特征，优于基于频率或风格的基线方法。

Comments 15 pages, 4 figures, accepted to ICPR 2026

详情

AI中文摘要

大型语言模型（LLM）通过对话展现出固有且独特的人格。然而，现有的大多数人格发现方法依赖于表面层面的词汇或风格线索，将对话视为平坦的token序列，未能捕捉维持人格一致性的更深层次话语结构。为解决这一局限，我们提出一种新颖的分析框架，通过桥接推理——即通过共享世界知识和话语连贯性连接话语的隐含概念关系——来解读LLM对话。通过将这些关系建模为结构化知识图谱，我们的方法捕捉了控制LLM在对话轮次间组织意义的潜在语义链接，从而在话语连贯性层面而非表面实现上实现人格发现。在多种推理骨干和从小型模型到80B参数系统的目标LLM上的实验结果表明，与基于频率或风格的基线相比，桥接推理图产生了显著更强的语义连贯性和更稳定的人格识别。这些结果表明，人格特质始终编码在话语的结构组织中，而非孤立的词汇模式中。本工作提出了一个系统框架，通过认知话语理论的视角来探测、提取和可视化潜在的LLM人格，桥接了计算语言学、认知语义学和大型语言模型中的人格推理。代码见：https://this URL

英文摘要

Large Language Models (LLMs) reveal inherent and distinctive personas through dialogue. However, most existing persona discovery approaches rely on surface-level lexical or stylistic cues, treating dialogue as a flat sequence of tokens and failing to capture the deeper discourse-level structures that sustain persona consistency. To address this limitation, we propose a novel analytical framework that interprets LLM dialogue through bridging inference -- implicit conceptual relations that connect utterances via shared world knowledge and discourse coherence. By modeling these relations as structured knowledge graphs, our approach captures latent semantic links that govern how LLMs organize meaning across turns, enabling persona discovery at the level of discourse coherence rather than surface realizations. Experimental results across multiple reasoning backbones and target LLMs, ranging from small-scale models to 80B-parameter systems, demonstrate that bridging-inference graphs yield significantly stronger semantic coherence and more stable persona identification than frequency or style-based baselines. These results show that persona traits are consistently encoded in the structural organization of discourse rather than isolated lexical patterns. This work presents a systematic framework for probing, extracting, and visualizing latent LLM personas through the lens of Cognitive Discourse Theory, bridging computational linguistics, cognitive semantics, and persona reasoning in large language models. Codes are available at https://github.com/JiSoo-Yang/Persona_Bridging.git

URL PDF HTML ☆

赞 0 踩 0

2605.16713 2026-06-12 cs.CV cs.AI 版本更新

GeoWorld-VLM: Geometry from World Models for Vision-Language Models

GeoWorld-VLM：从世界模型中获取几何结构用于视觉-语言模型

Renjie Gu, Kaichen Zhou, Yan Luo, Mengyu Wang

发表机构 * Harvard AI and Robotics Lab（哈佛人工智能与机器人实验室）； Kempner Institute for the Study of Natural and Artificial Intelligence（凯普纳自然与人工智能研究 institute）； Harvard University（哈佛大学）

AI总结 GeoWorld-VLM通过将冻结的摄像机条件视频世界模型的几何结构转移到视觉-语言模型中，提升空间关系推理能力，实验显示在两个不同架构上均提升了约4%的性能。

详情

AI中文摘要

现代视觉-语言模型（VLMs）在语义识别方面表现优异，但在基本空间关系如左、在、后、之间等上仍显脆弱。这一失败的原因出现在语言推理之前：视觉路径在特征提取过程中可能压缩或丢弃关键的3D结构线索，导致语言模型接收到的图像表示不足以支持可靠的空判断。我们引入GeoWorld-VLM，一种VLM侧蒸馏框架，将冻结的摄像机条件视频世界模型的几何结构转移到VLMs中。GeoWorld-VLM仅微调图像编码器和多模态投影器，使后投影器图像特征与中间世界模型表示对齐，同时保持主骨干冻结。给定图像、提示和采样的摄像机轨迹，世界模型教师将静态视觉输入转换为合成多视角空间信号。训练结合空间答案监督、教师-学生特征对齐和对原VLM的保留锚点。由于语言模型保持冻结，GeoWorld-VLM保留原始模型的语言能力，同时将空间改进归因于增强的视觉路径。为了评估所提方法的有效性和通用性，我们将GeoWorld-VLM应用于两种不同的VLM架构，并在两个骨干上观察到一致的改进。GeoWorld-VLM在What'sUp和VSR基准上分别提升了约4%的性能，表明世界模型引导的视觉对齐在模型结构和空间推理数据集上具有泛化能力。

英文摘要

Modern Vision-Language Models (VLMs) achieve strong semantic recognition, yet remain brittle on elementary spatial relations such as left of, on, behind, and between. One cause of this failure arises before language reasoning begins: the visual pathway may compress or discard critical 3D structural cues during feature extraction, so the language model receives image representations that are already insufficient for reliable spatial judgment. We introduce GeoWorld-VLM, a VLM-side distillation framework that transfers geometric structure from frozen camera-conditioned video world models into VLMs. GeoWorld-VLM fine-tunes only the image encoder and multimodal projector, aligning post-projector image features with intermediate world-model representations while leaving the main backbone frozen. Given images, a prompt, and a sampled camera trajectory, the world-model teacher converts static visual input into a synthetic multi-view spatial signal. Training combines spatial answer supervision, teacher-student feature alignment, and a preservation anchor to the original VLM. Since the language model remains frozen, GeoWorld-VLM preserves the original model's linguistic capabilities while attributing spatial improvements to the enhanced visual pathway. To evaluate the effectiveness and generality of the proposed method, we apply GeoWorld-VLM to two distinct VLM architectures and observe consistent improvements across both backbones. GeoWorld-VLM improves performance by approximately 4 percent on both the What'sUp and VSR benchmarks, suggesting that world-model-guided visual alignment generalizes across model structures and spatial reasoning datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.22641 2026-06-12 cs.CL cs.AI cs.LG 版本更新

More Context, Larger Models, or Moral Knowledge? A Systematic Study of Schwartz Value Detection in Political Texts

更多上下文、更大模型还是道德知识？政治文本中施瓦茨价值观检测的系统研究

Víctor Yeste, Paolo Rosso

发表机构 * PRHLT Research Center, Universitat Politècnica de València, Spain（巴塞罗那理工大学研究中心，西班牙 Valencia理工大学）； School of Science, Engineering and Design, Universidad Europea de Valencia, Spain（Valencia欧洲大学科学、工程与设计学院，西班牙）； Valencian Graduate School and Research Network of Artificial Intelligence (ValgrAI)（瓦伦西亚人工智能研究生学院与研究网络（ValgrAI））

AI总结本研究系统比较了上下文范围、检索增强道德知识和模型规模对政治文本中施瓦茨价值观检测的影响，发现全文档上下文和检索知识对监督编码器有效，但对零样本大语言模型帮助有限，且模型扩展不保证性能提升。

Comments Code: https://github.com/VictorMYeste/human-value-detection-context-rag, best model: https://huggingface.co/VictorYeste/value-context-rag-deberta-v3-base-doc-rag, 18 pages, 3 figures

详情

AI中文摘要

检测政治文本中的施瓦茨价值观具有挑战性，因为隐含线索通常依赖于周围的论证和相邻价值观之间的细微差别。我们研究了上下文和显式道德知识何时有助于句子级别的价值观检测。使用ValuesML/Touché ValueEval格式，我们比较了句子、窗口和全文档输入；无检索增强和基于检索增强的设置（使用精心策划的道德知识库）；监督的DeBERTa-v3-base/large编码器；以及参数规模从12B到123B的零样本大语言模型。结果表明，更多上下文并非总是更好：全文档上下文使监督的DeBERTa编码器相比仅句子输入提高了3.8-4.8个宏F1点，但对零样本大语言模型没有一致帮助。在匹配比较中，检索到的道德知识更一致地有用，在早期融合下改善了每个测试的模型系列和上下文条件。然而，从DeBERTa-v3-base扩展到large以及从12B扩展到更大的大语言模型并不保证收益，并且简单的早期融合优于测试的后期融合和交叉注意力检索增强生成变体。按价值观分析表明，上下文和检索对社交情境化或概念上易混淆的价值观帮助最大。这些发现表明，价值观敏感的NLP应联合评估上下文、知识和模型系列，而不是将更长的输入或更大的模型视为通用改进。

英文摘要

Detecting Schwartz values in political text is difficult because implicit cues often depend on surrounding arguments and fine-grained distinctions between neighboring values. We study when context and explicit moral knowledge help sentence-level value detection. Using the ValuesML/Touché ValueEval format, we compare sentence, window, and full-document inputs; no-RAG and retrieval-augmented settings with a curated moral knowledge base; supervised DeBERTa-v3-base/large encoders; and zero-shot LLMs from 12B to 123B parameters. The results show that more context is not uniformly better: full-document context improves supervised DeBERTa encoders by 3.8-4.8 macro-F1 points over sentence-only input, but does not consistently help zero-shot LLMs. Retrieved moral knowledge is more consistently useful in matched comparisons, improving each tested model family and context condition under early fusion. However, scaling from DeBERTa-v3-base to large and from 12B to larger LLMs does not guarantee gains, and simple early fusion outperforms the tested late-fusion and cross-attention RAG variants for encoders. Per-value analyses show that context and retrieval help most for socially situated or conceptually confusable values. These findings suggest that value-sensitive NLP should evaluate context, knowledge, and model family jointly rather than treating longer inputs or larger models as universal improvements.

URL PDF HTML ☆

赞 0 踩 0

2606.10716 2026-06-12 cs.CL cs.AI 版本更新

Attention Expansion: Enhancing Keyphrase Extraction from Long Documents with Attention-Augmented Contextualized Embeddings

注意力扩展：利用注意力增强的上下文嵌入提升长文档关键短语提取

Roberto Martínez-Cruz, Alvaro J. López-López, José Portela

发表机构 * Institute for Research in Technology, ICAI School of Engineering, Comillas Pontifical University（技术研究所，ICAI工程学院，科米利亚斯宗座大学）； DD-AIM, Senior Machine Learning Researcher（DD-AIM，高级机器学习研究员）

AI总结提出注意力扩展机制，通过预训练词嵌入增强PLM的上下文表示，在不增加计算成本的情况下扩展有效上下文范围，显著提升长文档关键短语提取性能。

详情

AI中文摘要

预训练语言模型（PLM）在关键短语提取（KPE）中取得了强劲性能，主要得益于其生成丰富上下文表示的能力。然而，长文档KPE仍然具有挑战性，因为显著的关键短语证据可能分散在遥远的文档部分，而这些部分无法在大多数PLM有限的上下文窗口内被联合捕获。尽管长上下文大语言模型（LLM）可以处理更广泛的文本上下文，但其计算成本限制了它们在高效和高通量KPE中的实用性。为了克服这一限制，我们提出了一种注意力扩展机制，该机制利用预训练词嵌入，用周围超出上下文的块中的信息来增强PLM的令牌表示。所提出的机制扩展了基于PLM的KPE模型的有效上下文范围，而无需全文档注意力或昂贵的基于LLM的推理。我们在五个PLM骨干网络上评估了我们的方法，包括通用、科学、任务特定和长上下文编码器，使用了两种训练机制和来自科学和新闻领域的五个基准语料库。实验结果表明，注意力扩展在所有评估设置中一致地提升了KPE性能，超越了最先进的模型，并在F1分数上取得了显著改进。这些改进扩展到领域特定、任务专门化和原生长上下文模型，表明所提出的机制提供了互补信息，而不仅仅是补偿有限的输入长度。这些结果确立了注意力扩展作为长文档KPE的一种高效且有效的策略。

英文摘要

Pre-trained language models (PLMs) have achieved strong performance in keyphrase extraction (KPE), largely due to their ability to generate rich contextualized representations. However, long-document KPE remains challenging because salient keyphrase evidence may be scattered across distant document sections that cannot be jointly captured within the limited context window of most PLMs. Although long-context large language models (LLMs) can process broader textual contexts, their computational cost limits their practicality for efficient and high-throughput KPE. To overcome this limitation, we propose an attention expansion mechanism that augments PLM token representations with information from surrounding out-of-context chunks using pre-trained word embeddings. The proposed mechanism expands the effective contextual scope of PLM-based KPE models without requiring full-document attention or expensive LLM-based inference. We evaluate our approach across five PLM backbones, including general-purpose, scientific, task-specific, and long-context encoders, using two training regimes and five benchmark corpora from scientific and news domains. Experimental results demonstrate that attention expansion consistently enhances KPE performance across all evaluation settings, outperforming state-of-the-art models and yielding notable improvements in F1 score. The improvements extend to domain-specific, task-specialized, and native long-context models, showing that the proposed mechanism provides complementary information rather than merely compensating for limited input length. These results establish attention expansion as an efficient and effective strategy for long-document KPE.

URL PDF HTML ☆

赞 0 踩 0

2606.11792 2026-06-12 cs.CV cs.AI cs.CL 版本更新

MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models

MultiToP：学习修补视觉令牌以减轻视频大型多模态模型中的幻觉

Yuansheng Gao, Wenbin Xing, Jiahao Yuan, Kaiwen Zhou, Han Bao, Zonghui Wang, Wenzhi Chen

发表机构 * Zhejiang University（浙江大学）； Sun Yat-sen University（中山大学）； East China Normal University（华东师范大学）

AI总结提出MultiToP框架，通过轻量级视觉令牌修补器动态替换不可靠视觉令牌，结合信息引导排名校准和稀疏正则化，在不修改原模型情况下减少视频多模态模型幻觉，显著提升F1分数和问答准确率。

Comments Preprint

详情

AI中文摘要

视频大型多模态模型在视频理解方面取得了显著进展，但仍容易产生幻觉，即生成的响应未能忠实于输入视频。在本文中，我们提出MultiToP，一种多模态上下文感知的视觉令牌修补框架，通过在语言生成之前优化不可靠的视觉令牌来减轻幻觉。MultiToP引入了一个轻量级的视觉令牌修补器，用于预测令牌级替换分布，并选择性地用动态全局修补令牌替换不可靠的视觉令牌。为了有效训练修补器，我们进一步提出了信息引导的排名校准，利用从主干网络派生的答案条件帧级信息线索来指导令牌替换。结合真实答案监督和稀疏正则化，MultiToP实现了局部视觉证据优化，而无需修改原始模型。大量实验表明，MultiToP在Vript-HAL上有效减少了幻觉，且推理开销可忽略不计，将Qwen3-VL-4B-Instruct的F1分数相比原始模型提高了50.60%。同时，MultiToP保持了通用的视频理解能力，在ActivityNet-QA上为Video-LLaVA-7B带来了18.58%的相对准确率提升。

英文摘要

Video Large Multimodal Models have achieved remarkable progress in video understanding, yet they remain prone to hallucinations, where generated responses are not faithfully supported by the input video. In this paper, we propose MultiToP, a multimodal-context-aware visual token patching framework that mitigates hallucinations by refining unreliable visual tokens before language generation. MultiToP introduces a lightweight Visual Token Patcher to predict token-level replacement distributions and selectively substitute unreliable visual tokens with a dynamic global patch token. To train the patcher effectively, we further propose information-guided rank calibration, which uses answer-conditioned frame-level information cues derived from the backbone to guide token replacement. Combined with ground-truth answer supervision and sparsity regularization, MultiToP enables localized visual evidence refinement without modifying the original model. Extensive experiments demonstrate that MultiToP effectively reduces hallucinations on Vript-HAL with negligible inference overhead, improving the F1 scores of Qwen3-VL-4B-Instruct by 50.60% over the vanilla model. Meanwhile, MultiToP preserves general video understanding ability, yielding an 18.58% relative accuracy gain on ActivityNet-QA for Video-LLaVA-7B.

URL PDF HTML ☆

赞 0 踩 0

2606.12616 2026-06-12 cs.AI cs.CL 新提交

PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation

PersonaDrive: 面向闭环驾驶模拟的人类风格检索增强VLA智能体

Mahmoud Srewa, Praneetsai Iddamsetty, Mohammad Abdullah Al Faruque, Salma Elmalaki

发表机构 * University of California, Irvine（加利福尼亚大学尔湾分校）

AI总结提出PersonaDrive流水线，通过检索风格指令下的人类驾驶演示来调节视觉-语言-动作（VLA）驾驶智能体，实现闭环模拟中多样化的非自车智能体行为，无需针对每种风格重新训练。

详情

AI中文摘要

闭环驾驶模拟器通常在其环境中填充行为大致相同的非自车交通智能体，这些智能体要么由基于规则的交通管理器生成，要么由训练为单一行为模式的学习模型生成。最近的工作通过观测数据上的事后标签或LLM推断的奖励权重引入风格变化，但这些信号充当了风格应奖励什么的代理，而不是明确要求以该风格驾驶的人类演示。我们提出了PersonaDrive，一个流水线，它根据从风格指令的人类驾驶数据集中检索到的演示来调节视觉-语言-动作（VLA）驾驶智能体，在该数据集中，参与者在驾驶员在环平台上以激进、中性和保守指令驾驶CARLA排行榜路线。该流水线包括三个阶段：(i) 使用组合的图像-文本相似度分数对每种风格的人类驾驶数据进行离线三元组挖掘；(ii) 训练一个轻量级检索头，将冻结的视觉特征与每个风格数据库上的小型控制编码器融合；(iii) 微调单个VLA主干，以在航点预测期间将检索到的上下文点视为上下文行为演示。在推理时，通过切换检索头查询的每个风格数据库，相同的主干可以适应任何风格，因此选择风格无需针对每种风格重新训练，同时为闭环模拟启用人类风格、风格多样的非自车智能体。在Bench2Drive上，PersonaDrive（无风格）的驾驶得分比SimLingo高4.6%，比HiP-AD高2.5%，在风格条件下，每种风格都获得最高驾驶得分，波动范围约2%（其最弱风格超过最强基线DMW 5.4%），而从保守指令到激进指令，平均速度和加速度分别提高18%和25%。

英文摘要

Closed-loop driving simulators typically populate their environments with non-ego traffic agents that behave largely the same way, produced either by rule-based traffic managers or by learned models trained toward a single behavioral mode. Recent work introduces style variation through post-hoc labels on observational data or LLM-inferred reward weights, but these signals act as proxies for what a style should reward rather than demonstrations of humans explicitly asked to drive in that style. We introduce PersonaDrive, a pipeline that conditions a vision-language-action (VLA) driving agent on retrieved demonstrations from a style-instructed human driving dataset, in which participants drive CARLA leaderboard routes under aggressive, neutral, and conservative instructions on a driver-in-the-loop rig. The pipeline has three stages: (i) offline triplet mining over per-style human driving data using a combined image-text similarity score; (ii) training a lightweight retrieval head that fuses frozen visual features with a small control encoder over per-style databases; and (iii) fine-tuning a single VLA backbone to treat retrieved context points as in-context behavioral demonstrations during waypoint prediction. At inference, the same backbone is conditioned on any style by swapping which per-style database the retrieval head queries, so selecting a style requires no per-style retraining while enabling human-style, style-diverse non-ego agents for closed-loop simulation. On Bench2Drive, PersonaDrive (no style) improves the driving score by 4.6% over SimLingo and 2.5% over HiP-AD, and under style conditioning attains the highest driving score in every style within a roughly 2% band (its weakest style surpassing the strongest baseline, DMW, by 5.4%), while average speed and acceleration rise by 18% and 25% from the conservative to the aggressive instruction.

URL PDF HTML ☆

赞 0 踩 0

2606.12550 2026-06-12 cs.RO cs.AI 交叉投稿

Foresight: Iterative Reasoning About Clues that Matter for Navigation

Foresight: 关于导航关键线索的迭代推理

Arthur Zhang, Carl Qi, Donne Su, Xiangyun Meng, Amy Zhang, Joydeep Biswas

发表机构 * UT Austin（德克萨斯大学奥斯汀分校）； FieldAI

AI总结提出Foresight框架，利用微调VLM交替提出和批评图像空间运动计划，通过人类反馈学习奖励模型进行强化学习后训练，实现无地图导航中稀疏语言指令下的迭代运动优化，任务成功率提升37%。

Comments 22 pages, 10 figures, 3 tables

详情

AI中文摘要

从稀疏语言指令进行开放世界无地图导航需要解决未明确指定的目标，并推断哪些环境线索与到达目标相关。例如，到达一个视野外的目的地可能需要解释坡道、标志或绕行路线，这些揭示了去哪里或走哪条路线。先前的工作受限于对已知导航因素和封闭集因素类别的依赖，或者在运动规划之前识别线索而遗漏了依赖于计划的线索。我们认为预训练的视觉语言模型（VLM）可以发现新的指令相关线索，但需要适应以关注哪些线索重要以及它们应如何影响运动规划。我们在Foresight中实现了这些想法，这是一个测试时框架，其中微调的VLM交替提出图像空间运动计划并使用语言目标和视觉上下文对其进行批评。后续计划基于先前的批评，使得在执行前能够进行迭代运动优化。为了将计划批评和优化与开放集行为偏好对齐，我们从人类反馈中学习一个奖励模型，并使用它在计划-批评循环中通过强化学习对VLM进行后训练。在离线评估和6个真实世界环境中，相对于最先进的测试时推理和基础模型基线，Foresight将平均任务成功率提高了37%，并将每次任务的干预次数减少了52%，同时在Jetson AGX Orin上实时运行。我们将发布代码、数据和训练细节，以支持未来关于机器人运动优化的测试时推理工作。更多视频请见：this https URL

英文摘要

Open-world mapless navigation from sparse language instructions requires resolving underspecified goals and inferring which environmental cues are relevant for reaching the goal. For instance, reaching an out-of-view destination may require interpreting ramps, signs, or detours that reveal where to go or which route to take. Prior works are limited by their reliance on known navigation factors and closed-set factor categories, or identify cues before motion planning and miss plan-dependent cues. We argue that pretrained Vision-Language Models (VLMs) can discover novel instruction-relevant cues, but require adaptation to focus on which cues matter and how they should influence motion planning. We realize these ideas in Foresight, a test-time framework in which a finetuned VLM alternates between proposing image-space motion plans and critiquing them using the language goal and visual context. Subsequent plans are conditioned on prior critiques, enabling iterative motion refinement before execution. To align plan critiques and refinements with open-set behavior preferences, we learn a reward model from human feedback and use it to post-train the VLM with reinforcement learning in the plan-critique loop. In offline evaluations and 6 real-world environments, Foresight improves average task success by 37% and reduces interventions per mission by 52% relative to state-of-the-art test-time reasoning and foundation-model baselines, while running in real-time on a Jetson AGX Orin. We will release code, data, and training details to support future work on test-time reasoning for robot motion refinement. Additional videos at: https://amrl.cs.utexas.edu/foresight

URL PDF HTML ☆

赞 0 踩 0

2606.12603 2026-06-12 cs.RO cs.AI 交叉投稿

From Imitation to Alignment: Human-Preference Flow Policies for Long-Horizon Sidewalk Navigation

从模仿到对齐：面向长距离人行道导航的人类偏好流策略

Honglin He, Zhizheng Liu, Yukai Ma, Bolei Zhou

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）

AI总结提出FlowPilot，一种仅使用单目RGB相机的无地图导航策略，通过锚定流匹配进行预训练，并引入人类偏好学习实现对齐，在长距离人行道导航中提升鲁棒性和社会合规性。

详情

AI中文摘要

自主长距离人行道导航对于微出行应用（如机器人送餐和辅助电动轮椅）至关重要。与道路上的自动驾驶不同，长距离人行道导航需要在不可预测的人行道地形和行人中精确操作，且感知栈轻量，仅需单个单目RGB相机。虽然从演示中模仿学习（IL）提供了一种实用解决方案，但由此产生的自动驾驶策略常常遭受复合误差、人行道上缺乏社会合规性以及缺乏处理复杂情况的反事实推理能力。为解决这些挑战，我们提出了FlowPilot，一种仅使用单目RGB相机即可实现稳健高效长距离导航性能的无地图导航策略。我们首先提出使用锚定流匹配作为动作表示，用于在大型机器人车队数据上进行策略预训练，并捕捉人行道导航行为的多样、复杂、多模态分布。为弥合模仿与对齐之间的差距，我们进一步设计了一种人在环的偏好学习方案，通过少量人类干预数据调整策略。它增强了模型的反事实推理能力和在人行道上的社会合规性。我们通过在多样化人行道环境中的广泛仿真和真实世界实验评估了FlowPilot。在仿真中，FlowPilot实现了42%的成功率和66%的路线完成率，而FlowPilot-HP进一步提升了真实世界的鲁棒性和社会合规性，相对于基础模型，IR降低了40.0%，NIR降低了52.1%。

英文摘要

Autonomous long-horizon sidewalk navigation is essential for micro-mobility applications such as robotic food delivery and assistive electronic wheelchairs. Unlike autonomous driving on the road, long-horizon sidewalk navigation requires precise maneuvering through unpredictable sidewalk terrains and pedestrians, with a lightweight perception stack as minimal as a single monocular RGB camera. While imitation learning (IL) from demonstrations offers a practical solution, the resulting autopilot policy often suffers from compounding errors, a lack of social compliance on sidewalks, and deficiencies in counterfactual reasoning to handle complex situations. To address these challenges, we introduce FlowPilot, a mapless navigation policy that achieves robust and efficient long-horizon navigation performance using only a monocular RGB camera. We first propose to use anchored flow matching as an action representation for policy pre-training on large-scale robot fleet data and to capture the diverse, complex, multimodal distribution of sidewalk navigation behaviors. To bridge the gap between imitation and alignment, we further design a human-in-the-loop preference learning scheme to tune the policy on a small amount of human intervention data. It strengthens the model's counterfactual reasoning and social compliance on sidewalks. We evaluate FlowPilot through extensive simulation and real-world experiments in diverse sidewalk environments. FlowPilot achieves 42% success rate and 66% route completion in simulation, while FlowPilot-HP further improves real-world robustness and social compliance, reducing IR by 40.0% and NIR by 52.1% relative to the base model.

URL PDF HTML ☆

赞 0 踩 0

2606.12690 2026-06-12 cs.RO cs.AI 交叉投稿

EWAM: An Enhanced World Action Model for Closed-Loop Online Adaptation in Embodied Intelligence

EWAM：一种用于具身智能闭环在线自适应的增强世界动作模型

Xin Zhou, Cong Miao

发表机构 * Astronex Robotics ； Nanjing University of Information Science and Technology（南京信息工程大学）

AI总结提出EWAM架构，基于冻结的Cosmos3骨干网络，通过四个轻量级神经层实现零样本在线自适应，无需微调或额外演示数据，显著减少新任务布局的部署数据需求。

详情

AI中文摘要

在本文中，我们提出了增强世界动作模型（EWAM），这是一种基于预训练且完全冻结的Cosmos3骨干网络构建的闭环在线自适应架构。EWAM完全在零样本任务协议下进行评估，其核心目标是减少适应新任务布局所需的额外部署数据量。值得注意的是，所有评估中均未引入额外的任务特定演示集，也未对骨干网络进行微调。其性能提升完全源于由四个插入的轻量级神经层组成的推理时协同推理机制：位于扩散变换器（DiT）中间层的神经经验记忆层提供任务相关的执行上下文；状态预测头之后的神经异常检测层实时监测预测状态与实际状态之间的差异；神经策略路由层根据异常严重程度动态选择直接执行、保守重规划或回滚恢复；神经动作校正层利用执行诊断优化生成的动作块。与简单的特征融合不同，记忆、异常检测和校正模块以可微分的方式深度集成到Cosmos3的前向路径中，仅最终路由决策是离散监督的。

英文摘要

In this paper, we propose the Enhanced World Action Model (EWAM), a closed-loop online adaptation architecture built upon a pretrained and fully frozen Cosmos3 backbone network. Evaluated entirely under a zero-shot task protocol, EWAM is centrally focused on reducing the amount of additional deployment data required to adapt to new task layouts. Notably, no extra task-specific demonstration sets were introduced in any of the evaluations, and no fine-tuning was performed on the backbone network. Its performance gains stem entirely from an inference-time co-reasoning mechanism composed of four inserted lightweight neural layers: the Neural Experience Memory Layer located in the intermediate layers of the Diffusion Transformer (DiT) provides task-relevant execution context; the Neural Anomaly Detection Layer after the state prediction head monitors the divergence between predicted and actual states in real time; the Neural Policy Routing Layer dynamically selects direct execution, conservative replanning, or rollback recovery based on the anomaly severity; and the Neural Action Correction Layer refines the generated action chunks using execution diagnostics. Unlike naive feature fusion, the memory, anomaly detection, and correction modules are deeply integrated into the Cosmos3 forward path in a differentiable manner, with only the final routing decision being a discrete supervised one.

URL PDF HTML ☆

赞 0 踩 0

2606.12814 2026-06-12 cs.RO cs.AI 交叉投稿

Stubborn: A Streamlined and Unified Reinforcement Learning Framework for Robust Motion Tracking and Fall Recovery for Humanoids

Stubborn: 一种用于人形机器人鲁棒运动跟踪与摔倒恢复的流线型统一强化学习框架

Xiao Ren, Yuhui Yang, Zongbiao Weng, Zhijie Liu, He Kong

发表机构 * Southern University of Science and Technology（南方科技大学）

AI总结提出Stubborn框架，通过非对称Actor-Critic架构、偏航对齐表示、伯努利概率终止机制和自适应采样策略，统一实现人形机器人的运动跟踪与摔倒恢复，在性能与鲁棒性上超越现有方法。

详情

AI中文摘要

最近的强化学习方法在改善人形机器人运动跟踪性能和实现扰动下的摔倒恢复方面显示出巨大潜力。然而，现有大多数工作将运动跟踪和摔倒恢复视为不同任务，需要多阶段训练，并配备专门的恢复奖励和/或独立的恢复策略。此外，现有的基于强化学习的方法通常在严重跟踪失败后立即终止训练回合，限制了在不稳定或摔倒状态下的恢复导向探索。为了解决上述问题，我们提出了Stubborn，一个流线型统一的强化学习框架，用于实现鲁棒的人形机器人运动跟踪和摔倒恢复。具体来说，Stubborn采用非对称Actor-Critic架构，包含三个主要组件。首先，采用偏航对齐的跟踪表示，以减少对全局漂移和航向扰动的敏感性，同时保留与重力相关的平衡信息。其次，我们引入基于伯努利的概率终止机制，使策略能够在不同失败模式下鼓励探索摔倒恢复行为。第三，我们提出一种概率终止和跟踪误差驱动的策略，根据跟踪性能动态重塑采样分布，提高困难运动片段和不稳定状态的训练效率。与最先进方法的广泛比较和消融研究表明，Stubborn取得了有竞争力的性能，所提出的概率终止机制和自适应采样策略有助于性能和鲁棒性的提升。真实世界演示请参见此https URL。

英文摘要

Recent reinforcement learning approaches have shown great promise in improving humanoid motion tracking performance and achieving fall recovery under disturbances. However, most existing works treat motion tracking and fall recovery as different tasks and require multi-stage training with specialized recovery rewards and/or separate recovery policies. Moreover, existing reinforcement learning-based methods often terminate training episodes immediately after severe tracking failures, limiting recovery-oriented exploration in unstable or fallen states. To address the above issues, we propose Stubborn, a streamlined and unified reinforcement learning framework to achieve robust humanoid motion tracking and fall recovery. Specifically, Stubborn uses an asymmetric Actor-Critic architecture and consists of three major components. First, a yaw-aligned tracking representation is adopted to reduce sensitivity to global drift and heading disturbances while preserving gravity-related balance information. Second, we introduce a Bernoulli-based probabilistic termination mechanism that enables the policy to encourage exploration of fall-recovery behaviors under varying failure modes. Third, we propose a probabilistic termination and tracking-error-driven strategy that dynamically reshapes the sampling distribution based on tracking performance, increasing the training efficiency for difficult motion segments and unstable states. Extensive comparisons with SOTA methods and ablation studies show that Stubborn achieved competitive performance, and the proposed probabilistic termination mechanism and adaptive sampling strategy contributed to the performance and robustness gains. For real-world demonstrations, please refer to https://aislab-sustech.github.io/Stubborn/.

URL PDF HTML ☆

赞 0 踩 0

2606.13097 2026-06-12 cs.PL cs.AI 交叉投稿

基于自回归策略的实时执行

Sangkyu Lee, Seohyeon Park, Tackgeun You, Avi Caciularu, Idan Szpektor, Hwasup Lim, Youngjae Yu

发表机构 * Korea Institute of Science and Technology（韩国科学技术研究院）； Seoul National University（首尔大学）； Google Research（谷歌研究院）

AI总结通过异步推理和约束解码实现自回归策略的实时执行，在保证低延迟的同时提升任务完成速度，实验表明其性能优于流匹配策略。

详情

AI中文摘要

实时执行通过异步推理实现平滑动作轨迹和快速响应，对于大规模视觉-语言-动作模型的实际部署至关重要。然而，近期关于实时执行的工作主要关注扩散策略的变体，尽管自回归策略在同步推理中滚动速度较慢，更需要实时性。相比之下，我们证明自回归策略可以通过调整分词范围和应用约束解码来实现实时执行，从而保证严格的延迟界限，支持多轨迹解码以最大化性能。在模拟和真实环境中，我们发现自回归策略始终优于同等水平的流匹配策略，同时显著提升了同步推理的任务完成速度。结合自回归策略的固有优势（如更快的收敛速度和更好的指令遵循泛化能力），这些结果证实自回归策略仍是一种支持实时执行的竞争性策略类型。

英文摘要

Real-time execution, enabled by asynchronous inference that ensures both smooth action trajectories and fast reactivity, is critical for realistic deployments of large-scale Vision-Language-Action models. However, recent work on real-time execution primarily focuses on variants of diffusion policies, even though it is more critical for autoregressive policies given their slower rollout speed in synchronous inference. In contrast, we demonstrate that autoregressive policies can achieve real-time execution by adjusting the tokenization horizon and applying constrained decoding, thereby guaranteeing strict latency bounds that enable multi-trajectory decoding to maximize performance. Across simulated and real-world environments, we find that the autoregressive policy consistently outperforms its equivalent-level flow-matching policy counterpart while achieving significantly improved task completion speeds from synchronous inference. Coupled with the inherent advantages of autoregressive policies, such as faster convergence and better generalizability in instruction-following, these results confirm that autoregressive policies can remain a competitive policy type supporting real-time execution.

URL PDF HTML ☆

赞 0 踩 0

2606.13503 2026-06-12 cs.CV cs.AI cs.RO 交叉投稿

UniDexTok：基于真实数据的统一灵巧手分词器

Dong Fang, Youjun Wu, Yuanxin Zhong, Rui Zhang, Yunlong Wang, Xiaosong Jia, Yu-Gang Jiang

发表机构 * Fudan University（复旦大学）； Hefei University of Technology（合肥工业大学）； Rimbot ； Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结提出统一灵巧手模型(UDHM)将人手和机器人手状态映射到共享22自由度语义接口，并基于此开发UniDexTok，一种免重定向的状态分词器，学习基于真实关节状态的离散token，实现异构灵巧手的统一表示，误差降低98%以上。

详情

AI中文摘要

灵巧手对于精细操作至关重要，但其硬件设计在不同实施例之间存在显著差异。运动学、关节定义和自由度方面的差异使得定义共享状态表示变得困难，与平行夹爪相比更是如此。因此，灵巧手数据仍然碎片化，难以用于联合训练。在这项工作中，我们提出了统一灵巧手模型（UDHM），它将人手和机器人手状态映射到一个共享的22自由度语义接口。基于UDHM，我们引入了UniDexTok，一种免重定向的状态分词器，它从标准化的真实关节状态中学习基于实施例的离散token。UniDexTok为异构灵巧手提供了统一表示，无需依赖重定向或仿真数据。与最近的基线UniHM相比，UniDexTok将MPJAE从15.63度降低到0.16度，MPJPE从18.51毫米降低到0.18毫米，误差分别减少了98.98%和99.03%。这些结果将重建精度从厘米级提升到亚毫米级。实验进一步表明，来自其他实施例的数据提高了目标实施例的重建精度，证明了跨实施例分词的优势。当引入新的灵巧手时，UniDexTok还表现出强大的零样本和少样本重建能力。

英文摘要

Dexterous hands are essential for fine-grained manipulation, but their hardware designs vary substantially across embodiments. Differences in kinematics, joint definitions, and degrees of freedom make it difficult to define a shared state representation compared with parallel grippers. As a result, dexterous-hand data remains fragmented and difficult to use for joint training. In this work, we propose the Unified Dexterous Hand Model (UDHM), which maps human and robot hand states into a shared 22-DoF semantic interface. Based on UDHM, we introduce UniDexTok, a retargeting-free state tokenizer that learns embodiment-conditioned discrete tokens from standardized real joint states. UniDexTok provides a unified representation for heterogeneous dexterous hands without relying on retargeting or simulation data. Compared with the recent baseline UniHM, UniDexTok reduces MPJAE from 15.63 degrees to 0.16 degrees and MPJPE from 18.51 mm to 0.18 mm, corresponding to error reductions of 98.98% and 99.03%, respectively. These results improve reconstruction from centimeter-scale to sub-millimeter accuracy. Experiments further show that data from other embodiments improves target-embodiment reconstruction accuracy, demonstrating the benefit of cross-embodiment tokenization. UniDexTok also shows strong zero-shot and few-shot reconstruction ability when new dexterous hands are introduced.

URL PDF HTML ☆

赞 0 踩 0

2606.11092 2026-06-12 cs.RO cs.AI 版本更新

（人类的）注意力（仍然）就是一切：人类监督使AI辅助的社会科学变得可靠

Chen Zhu, Xiaolu Wang, Weilong Zhang

发表机构 * China Agricultural University（中国农业大学）； University of Cambridge（剑桥大学）

AI总结提出人机协同决策架构HLER，通过预承诺、决策排序、问责和注意力分配，将AI辅助研究的失败率从72%降至16%。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地被用于曾经只有训练有素的研究人员才能完成的任务，包括假设生成、规范选择和结论起草。我们认为，AI辅助研究的可靠性不仅取决于模型能力，还取决于认知劳动在人与机器之间的分配方式。我们通过人机协同经济研究（HLER）来研究这个问题，这是一种基于预承诺、决策排序、问责和注意力分配的决策架构。在一个预先指定的2*4因子实验中，涉及四个数据集的280个完整研究运行，无约束的多智能体基线在72%的运行中产生了关键失败。使用相同的底层模型、相同的智能体分解以及共享推理智能体的相同提示，HLER通过施加三个架构承诺将失败率降低到16%：LLMs进行推理但不执行数据工作，数据和估计以确定性方式处理，以及三个人类决策门约束工作流程。Fisher精确检验在p<0.001水平上拒绝失败率相等的假设。可靠性增益在公开代表性最低的数据集（一份清代人口登记册）上最大，这与基于任务的产出质量服从弗雷歇分布的生产模型一致。一项80次运行的消融研究表明，确定性计算和人类决策门独立贡献，并存在互补性的探索性证据。我们将HLER解释为一种研究框架而非自主的AI科学家：它大幅减少失败，使残留的弱点更加可见，并防止不可靠的主张作为可发表的成果被提出。

英文摘要

Large language models (LLMs) are increasingly used for tasks once reserved for trained researchers, including hypothesis generation, specification choice, and drafting conclusions. We argue that the reliability of AI-assisted research depends not only on model capability, but also on how cognitive labour is structured between humans and machines. We study this problem through Human-in-the-Loop Economic Research (HLER), a decision architecture based on pre-commitment, decision sequencing, accountability, and attention allocation. In a pre-specified 2*4 factorial experiment with 280 complete research runs across four datasets, an unconstrained multi-agent baseline produced critical failures in 72% of runs. Using the same underlying model, the same agent decomposition, and identical prompts for the shared reasoning agents, HLER reduced the failure rate to 16% by imposing three architectural commitments: LLMs reason but do not execute data work, data and estimation are handled deterministically, and three human decision gates bind the workflow. Fisher's exact test rejects equality of failure rates at p<0.001. Reliability gains were largest on the least publicly represented dataset, a Qing-dynasty population register, consistent with a task-based production model with Frechet-distributed output quality. An 80-run ablation suggests that deterministic computation and human gates contribute independently, with exploratory evidence of complementarity. We interpret HLER as a research harness rather than an autonomous AI scientist: it sharply reduces failures, makes residual weaknesses more visible, and prevents unreliable claims from being advanced as publication-ready outputs.

URL PDF HTML ☆

赞 0 踩 0

2606.12900 2026-06-12 cs.AI cs.CL cs.LG 新提交

Zero-source LLM Hallucination Detection with Human-like Criteria Probing

零源大语言模型幻觉检测：类人类标准探测

Jiahao Yang, Shuhai Zhang, Hailong Kang, Feng Liu, Qi Chen, Mingkui Tan

AI总结提出HCPD范式，通过类人类标准探测机制模拟人类评估者的多面推理，结合奖励对齐和多样本聚合，实现零源条件下的有效可解释幻觉检测。

Comments Accepted at ICML 2026

详情

AI中文摘要

大型语言模型（LLM）常因生成事实错误或不忠实的内容而产生幻觉，对其安全使用构成重大风险。在零源约束下，即无法获取模型内部信息或外部参考，检测必须仅依赖于文本查询-答案对，检测此类幻觉尤为困难。本文提出用于幻觉检测的类人类标准探测（HCPD）范式，该范式模拟人类评估者的多面推理。其核心是类人类标准探测（HCP）机制，其中LLM代理自适应地将其判断分解为一组可解释的加权标准，并将特定标准得分聚合为最终的真实性度量。为实现这种自适应能力，我们引入了一种基于奖励的对齐方案，仅使用来自语义一致性的弱监督。在推理时，我们采用多样本聚合策略，确保决策稳健的同时保持完全可解释性。我们进一步提供了支持我们方法可靠性的理论分析。大量实验表明，HCPD始终优于最先进的基线，为零源幻觉检测提供了一种有效且可解释的解决方案。代码可从此https URL获取。

英文摘要

Large language models (LLMs) often hallucinate by generating factually incorrect or unfaithful content, posing significant risks to their safe use. Detecting such hallucinations is particularly challenging under the zero-source constraint, where no model internals or external references are available, and detection must rely solely on the textual query-answer pair. In this paper, we propose Human-like Criteria Probing for Hallucination Detection (HCPD), a paradigm that emulates the multi-faceted reasoning of human evaluators. Its core is a Human-like Criteria Probing (HCP) mechanism, in which a LLM agent adaptively decomposes its judgment into a weighted set of interpretable criteria and aggregates criterion-specific scores into a final truthfulness measure. To achieve this adaptive capability, we introduce a reward-based alignment scheme using only weak supervision from semantic consistency. At inference, we employ a multi-sampling aggregation strategy to ensure robust decisions while preserving full interpretability. We further provide theoretical analysis supporting the reliability of our approach. Extensive experiments show that HCPD consistently outperforms state-of-the-art baselines, offering an effective and explainable solution for zero-source hallucination detection. Code is available at https://github.com/TRISKEL10N/HCPD.

URL PDF HTML ☆

赞 0 踩 0

2606.13282 2026-06-12 cs.AI 新提交

ERTS: Adversarial Robustness Testing of Ethical AI via Semantic Perturbation in a Bounded Consequence Space

ERTS: 通过有界后果空间中的语义扰动进行伦理AI的对抗鲁棒性测试

Pratyush Chaudhari

发表机构 * Pratyush Chaudhari（普拉蒂什·查德哈里）

AI总结提出伦理鲁棒性测试系统(ERTS)，通过有界伦理后果空间、语义扰动和领域自适应评估，测试AI在伦理推理中的对抗鲁棒性，实验表明仅33%模型通过测试。

Comments 8 pages, 10 tables

详情

DOI: 10.5281/zenodo.20544025

AI中文摘要

随着AI系统在医疗分诊、自动驾驶和就业筛选等高风险的伦理场景中部署，评估其对伦理推理的对抗性操纵鲁棒性的形式化方法仍不成熟。本文介绍了伦理鲁棒性测试系统(ERTS)，一个闭环管道框架，它：(1) 将伦理困境编码为基于既定伦理理论的22维伦理后果空间(ECS)；(2) 应用17种语义扰动函数，受6种有效性约束类别（包括一种新颖的语义一致性约束）约束；(3) 通过4分量伦理不稳定性指数(EII)测量决策偏差；(4) 生成领域自适应的部署前鲁棒性评估判定。我们评估了4个结构化基线模型和2个生产级LLM（Gemini 2.0 Flash和Llama 3.2），涵盖8个部署领域的50个伦理场景，生成了1500个对抗测试用例。结果表明，仅33%的模型通过评估审核，其中本地Llama-3.2模型特别容易受到公平性破坏和信息退化攻击（ERS = 0.737）。据我们所知，现有框架中没有将有限伦理后果空间、语义一致性约束和领域自适应评估结合在单个对抗测试管道中的。

英文摘要

As AI systems are deployed in high-stakes ethical contexts such as healthcare triage, autonomous vehicle control, and employment screening, formal methods for evaluating their robustness against adversarial manipulation of ethical reasoning remain underdeveloped. This paper introduces the Ethical Robustness Testing System (ERTS), a closed-pipeline framework that: (1) encodes ethical dilemmas into a 22-dimensional Ethical Consequence Space (ECS) grounded in established ethical theory; (2) applies 17 semantic perturbation functions subject to 6 validity constraint classes including a novel semantic coherence constraint; (3) measures decision deviation via a 4-component Ethical Instability Index (EII); and (4) produces domain-adaptive pre-deployment robustness assessment verdicts. We evaluate 4 structured baseline models and 2 production LLMs (Gemini 2.0 Flash and Llama 3.2) across 50 ethical scenarios spanning 8 deployment domains, generating 1,500 adversarial test cases. Results demonstrate that only 33% of models achieve assessment clearance, with the local Llama-3.2 model proving particularly vulnerable to fairness corruption and information degradation attacks (ERS = 0.737). To the best of our knowledge, no existing framework combines a bounded ethical consequence space, semantic coherence constraints, and domain-adaptive assessment in a single adversarial testing pipeline.

URL PDF HTML ☆

赞 0 踩 0

2606.13621 2026-06-12 cs.AI cs.CR cs.GT cs.LG cs.MA 新提交

Beyond Runtime Enforcement: Shield Synthesis as Defensibility Analysis for Adversarial Networks

超越运行时强制：作为对抗网络可防御性分析的盾牌合成

Achraf Hsain, Sultan Almuhammadi

发表机构 * Information and Computer Science Department, King Fahd University of Petroleum and Minerals（信息与计算机科学系，法赫德国王石油矿产大学）

AI总结提出将盾牌合成重新解释为设计时分析工具，通过约束双人安全博弈生成可防御性判定，并融合拓扑度量和强化学习行为形成可防御性指纹，揭示系统安全的结构性见解。

Comments 26 pages, 7 figures, 7 tables. Under review at JAIR. Code: https://github.com/AchrafHsain7/Bastion

详情

AI中文摘要

盾牌强化学习通常被呈现为一种运行时安全机制，它将时序逻辑规范编译成限制智能体行为的自动机。我们认为这是错误的产品。同样的自动机理论机制——规范编译、乘积博弈构建、吸引子计算和获胜区域提取——更适合被解读为一种设计时分析工具，其输出是关于系统的结构性见解，而非对已部署智能体的运行时约束。我们通过一个用于网络防御的约束双人安全博弈来实例化这一点。两个规范被不对称地执行：防御者规范定义了博弈的不安全区域，而攻击者规范在吸引子计算期间限制了对手的合法行为。求解该博弈产生一个可防御性判定——一个形式化证书，表明拓扑-规范对是否可防御——以及相关的获胜区域和盾牌。除了二元判定，我们还从吸引子结构中推导出拓扑级度量，并将其与盾牌约束的对抗性多智能体强化学习的后收敛行为相结合。这些共同构成了一个可防御性指纹，捕捉了网络的形式安全属性及其在自适应博弈下的操作行为。假设分析表明，形式可防御性和操作有效性捕捉了安全的不同方面：小的架构变化可能导致操作结果的巨大变化，而形式安全裕度几乎不变。因此，盾牌合成最有价值的不是作为安全智能体的部署机制，而是作为回答关于系统是否、在哪里以及如何可以被防御的架构问题的框架。可防御性判定是输出，而非安全策略。

英文摘要

Shielded reinforcement learning is typically presented as a runtime safety mechanism that compiles temporal-logic specifications into automata restricting an agent's actions. We argue this is the wrong product. The same automata-theoretic machinery -- specification compilation, product game construction, attractor computation, and winning-region extraction -- is better read as a design-time analytical instrument whose outputs are structural insights about a system rather than runtime constraints on a deployed agent. We instantiate this through a constrained two-player safety game for network defense. The two specifications are enforced asymmetrically: the defender specification defines the unsafe region of the game, whereas the attacker specification restricts the adversary's legal actions during attractor computation. Solving the game yields a defensibility verdict -- a formal certificate that a topology-specification pair is or is not defensible -- with the associated winning region and shield. Beyond the binary verdict, we derive topology-level metrics from the attractor structure and combine them with post-convergence behavior from shield-constrained adversarial multi-agent reinforcement learning. Together these form a defensibility fingerprint capturing both a network's formal safety properties and its operational behavior under adaptive play. A what-if analysis shows that formal defensibility and operational effectiveness capture distinct aspects of security: small architectural changes can produce large shifts in operational outcomes while leaving formal safety margins nearly unchanged. Shield synthesis is thus most valuable not as a deployment mechanism for safe agents, but as a framework for answering architectural questions about whether, where, and how a system can be defended. The defensibility verdict is the output, not the safe policy.

URL PDF HTML ☆

赞 0 踩 0

2606.12415 2026-06-12 cs.CY cs.AI 交叉投稿

The AI Legal Specialist: A Juridically Autonomous Professional Profile for AI Governance

AI法律专家：面向AI治理的司法自主职业画像

Nicola Fabiano

发表机构 * Studio Legale Fabiano, Italy（意大利法务工作室Fabiano）； Independent Researcher on Artificial Intelligence, Data Protection, and Privacy（人工智能、数据保护与隐私独立研究员）； Expert in the EDPB’s Support Pool of Experts — Field B: Legal Expertise in New Technologies（欧洲数据保护委员会（EDPB）专家支持池——领域B：新技术法律专长）； Member, IEEE SA P7007 Working Group on Ontological Standards for Ethically Driven Robotics（IEEE SA P7007工作组成员：伦理驱动机器人学的本体标准）； Member, Editorial Advisory Board, Journal of Systemics, Cybernetics and Informatics (JSCI)（《系统学、控制论与信息学杂志》（JSCI）编辑顾问委员会成员）； Member, International Institute of Informatics and Systemics (IIIS)（国际信息与系统学研究院（IIIS）成员）； Member, International Neural Network Society (INNS)（国际神经网络学会（INNS）成员）； Member, United Nations University AI Network (UNU AI Network)（联合国大学人工智能网络（UNU AI Network）成员）

AI总结本文提出“AI法律专家”这一新型职业画像，该角色具有司法自主性，源于AI监管义务结构，而非技术标准或相邻角色延伸，并基于欧洲电子能力框架构建参考能力架构。

详情

AI中文摘要

人工智能监管在全球范围内的快速扩张，已在多个司法管辖区产生了对专门从事AI法律专业知识的需求，而市场对此的回应是零散的。数据保护官员将其职责范围扩展到数据保护法之外；隐私律师重新定位自己以适应AI；合规官员在其现有手册中增加AI章节。本文认为，这些适应性回应均未能充分覆盖新兴全球AI监管格局所开辟的专业空间，其中欧盟《人工智能法案》（(EU) 2024/1689号法规）是最全面的实例，此外还有欧洲委员会《AI框架公约》、美国行政和部门框架，以及英国、加拿大、巴西、中国、日本、新加坡等地的类似举措。需要一种独特的职业画像：AI法律专家，被设想为一位法学家——广义上理解为任何接受过高级法律培训的专业人士——在法律解释与AI治理的交汇处运作。该画像具有司法自主性：其存在源于AI受到实质性监管的任何地方所产生的监管义务结构，而非任何技术标准或相邻角色的扩展。本文提供了该画像的司法基础定义，论证了其相对于相邻角色和国际标准的自主性，提出了一种与欧洲电子能力框架（e-CF，EN 16234-1）相一致的参考能力架构作为方法论选择，并阐述了通过关键绩效指标进行操作性测量的条件。该贡献旨在作为该画像国际标准化的基础，并作为跨司法管辖区实践、课程和采纳的参考。

英文摘要

The rapid global expansion of artificial intelligence regulation has generated, across multiple jurisdictions, a demand for legal expertise dedicated to AI that the market has addressed in a fragmented manner. Data protection officers extend their remit beyond data protection law; privacy lawyers reposition themselves toward AI; compliance officers add AI chapters to their existing manuals. This paper argues that none of these adaptive responses adequately covers the professional space opened by the emerging global AI regulatory landscape, of which the EU Artificial Intelligence Act (Regulation (EU) 2024/1689) is the most comprehensive instance, alongside the Council of Europe Framework Convention on AI, the United States executive and sectoral framework, and analogous initiatives in the United Kingdom, Canada, Brazil, China, Japan, Singapore, and beyond. A distinct professional profile is required: the AI Legal Specialist, conceived as a jurist -- understood broadly to encompass any professional with advanced legal training -- operating at the intersection of legal interpretation and AI governance. The profile is juridically autonomous: it derives its existence from the structure of regulatory obligations generated wherever AI is subject to substantive regulation, rather than from any technical standard or the extension of adjacent roles. The paper provides a juridically grounded definition of the profile, argues for its autonomy from adjacent figures and international standards, proposes a reference competence architecture aligned with the European e-Competence Framework (e-CF, EN 16234-1) as a methodological choice, and articulates the conditions for its operational measurement through key performance indicators. The contribution is intended as a foundation for international standardization of the profile and as a reference for practice, curricula, and adoption across jurisdictions.

URL PDF HTML ☆

赞 0 踩 0

2606.12423 2026-06-12 cs.CY cs.AI 交叉投稿

The Challenges of Balancing AI Compliance and Technological Innovations in Critical Sectors: A Systematic Literature Review

关键领域中平衡AI合规与技术创新的挑战：系统文献综述

Ayush Enkhtaivan, Chinazunwa Uwaoma

AI总结通过系统文献综述，识别出碎片化法规、中小企业过度合规负担和治理模型错配三大挑战，并提出风险分级监管、设计合规和可解释AI等策略。

Comments 11 pages, 7 figures, Hawaii International Conference on System Sciences

详情

DOI: 10.24251/HICSS.2026.540

AI中文摘要

人工智能在医疗、金融、能源和国防等关键基础设施中的快速整合带来了变革性益处，但也与不断演变的监管和治理框架产生冲突。本文通过系统文献综述（SLR）研究在关键基础设施领域中平衡AI合规与技术创新的挑战。综述遵循既定的SLR指南，提取并综合了2020-2025年间发表的同行评审文章、报告和机构来源的见解。研究识别出三个相互关联的挑战：碎片化法规、中小企业过度合规负担以及治理模型错配。为应对这些挑战，研究强调了实用的治理策略，包括风险分级监管、设计合规和可解释AI，以支持在关键领域中可扩展且可信的AI部署。主要贡献包括核心AI治理挑战的简明映射及说明其重叠的概念图，以及为政策制定者和从业者提供协调监管与创新的可行策略。

英文摘要

The rapid integration of artificial intelligence (AI) into critical infrastructure including healthcare, finance, energy, and defense, offers transformative benefits but also conflicts with evolving regulatory and governance frameworks. This paper presents a systematic literature review (SLR) to examine the challenges of balancing AI compliance and technological innovation across critical infrastructure sectors. The review follows established SLR guidelines to extract and synthesize insights from peer-reviewed articles, report, and institutional sources published between 2020-2025. The study identifies three interrelated challenges: fragmented regulations, excessive compliance burdens for smaller to medium enterprises (SMEs), and misaligned governance models. To address these challenges, the study highlights practical governance strategies, including risk-tiered regulation, compliance by design, and explainable AI, to support scalable and trustworthy AI deployment in critical sectors. Key contributions include a concise mapping of core AI-governance challenges and a conceptual diagram illustrating their overlap, as well as actionable strategies for policymakers and practitioner to harmonize oversight with innovation.

URL PDF HTML ☆

赞 0 踩 0

2606.12429 2026-06-12 cs.CY cs.AI 交叉投稿

Muse Spark Safety & Preparedness Report

Muse Spark 安全与准备报告

Cristina Menghini, Peter Ney, Hamza Kwisaba, Zifan, Wang, Miles Turpin, Felix Binder, Jean-Christophe Testud, Aidan Boyd, Nathaniel Li, Ivan Evtimov, Klaudia Krawiecka, Arman Zharmagambetov, Jeremy Kritz, Alexander R. Fabbri, Daniel Song, Jinpeng Miao, Joonas Hjelt, Meghna Ramani, Leona Lan, Reza Aghajani, Joanna Bitton, Mahesh Pasupuleti, Devin Norder, Khalid El-Arini, Paridhi Singh, Vítor Albiero, Sahana CB, Rashnil Chaturvedi, Elahe Dabir, Edoardo Debenedetti, Jim Gust, Ziwen Han, Kat He, Sean Hendryx, Lifeng Jin, Polina Kirichenko, Sandra Lefdal, Kenneth Li, Asad Liaqat, Inna Lin, Despoina Magka, Neal Mangaokar, Ishita Mediratta, Zach Miller, Smitha Milli, Niloofar Mireshghallah, Saba Nazir, Hung Nguyen, Maximilian Nickel, Kelvin Niu, Kerem Oktar, Bhargavi Paranjape, Parth Pathak, Maya Pavlova, Emmanuel Ramirez, David Renardy, Candace Ross, Yasha Sheynin, Claudia Shi, Shivam Singhal, Evangelia Spiliopoulou, Rakshith Sharma Srinivasa, Jamelle Watson-Daniels, Spencer Whitman, Adina Williams, Chen Xing, Andy Zou, Tommy Ma, Siqi Deng, James Beldock, Prashant Ratanchandani, Kate Plawiak, Taesung Lee, Ryan Victory, Lindsay Hundley, Rachad Alao, Himaghna Bhattacharjee, Jianfeng Chi, Gary Frost, Pegah Ghahremani, Niki Howe, Yuheng Huang, Saeed Jahed, Hannah Korevaar, Trang Le, Zhe Liu, Jinghong Luo, Qin Lyu, Nina Mehrabi, Abraham Montilla, Chirag Nagpal, Cyrus Nikolaidis, Rajvardhan Oak, Manoj Ravi, Vidya Sarma, Aman Shankar, Alana Shine, Eric Michael Smith, Mariana Tandon, Michael Tontchev, Caoyu Wang, Zihan Wang, Corinne Wong, Zheng Wu, Hongyuan Zhan, Justin Zhao, Zexuan Zhong, Chengxu Zhuang, Tristan Goodman, Ayaz Minhas, Harrison Rudolph, Victoria Jeffries, Ingrid Dickinson, Alex Vaughan, Lauren Deason, Kamalika Chaudhuri, Julian Michael, Shengjia Zhao, Summer Yue

AI总结 Meta 发布 Muse Spark 大语言模型，评估其在化学/生物、网络安全和失控风险等灾难性风险领域的安全性，通过多层缓解措施将风险降至可接受水平，并作为 Meta AI 的基础模型发布。

Comments 159 pages, 57 figures

详情

AI中文摘要

Muse Spark 是 Meta 开发的最新大型语言模型。在本报告中，我们首先根据 Meta 的高级 AI 扩展框架对灾难性风险领域进行评估，并提供了支持我们发布决策的证据。然后，我们讨论了其他考虑因素，例如 Muse Spark 更广泛的内容安全性和行为特征，这些因素与整体安全相关，但不在框架管辖的灾难性风险领域之内。我们的准备结果涵盖了化学与生物、网络安全以及失控风险，评估了 Muse Spark 在 Meta AI 中的部署，认为其在我们高级 AI 扩展框架下呈现了可接受的残余风险水平。我们针对这些灾难性风险领域中的双重用途和高风险能力进行了一系列广泛的评估。这些评估在缓解措施实施前识别出了升高的风险，其中化学与生物能力在应用安全措施前被评估为可能达到高级 AI 扩展框架下的“高风险”类别。我们实施了一套多层缓解措施来解决已识别的风险，并且 Muse Spark 在与化学和生物学危险工作流程相关的多个基准测试中展示了最先进的拒绝能力。因此，我们发布 Muse Spark 作为 Meta AI 的基础模型。

英文摘要

Muse Spark is the latest large language model developed by Meta. In this report, we first present evaluations for catastrophic risk domains under Meta's Advanced AI Scaling Framework, along with the evidence that informed our launch decision. We then discuss additional considerations, such as Muse Spark's broader content safety and behavioral profile, that are relevant to overall safety but fall outside the catastrophic risk domains governed by the Framework. Our preparedness results covering Chemical and Biological, Cybersecurity, and Loss of Control risks assess Muse Spark's deployment within Meta AI as presenting acceptable levels of residual risks under our Advanced AI Scaling Framework. We conducted a broad set of evaluations targeting dual-use and high-risk capabilities across these catastrophic risk domains. Those evaluations identified elevated risks prior to mitigations, with Chemical and Biological capabilities assessed as likely reaching the "high risk" category under the Advanced AI Scaling Framework before safeguards were applied. We have implemented a multi-layered set of mitigations that address the identified risks, and Muse Spark demonstrates state-of-the-art refusal across a range of benchmarks related to hazardous workflows in chemistry and biology. We therefore release Muse Spark as the underlying model of Meta AI.

URL PDF HTML ☆

赞 0 踩 0

2606.12437 2026-06-12 cs.CY cs.AI 交叉投稿

Algorithmic Constitutionalism

算法宪政主义

Oren Perez, Nurit Wimer

AI总结针对AI对社会生活日益渗透的风险，本文提出“算法宪政主义”框架，通过分层架构、算法元推理和协商纠正，应用于Facebook内容审核，并分析其与社会宪政主义的张力及对欧盟数字服务法案的影响。

详情

Journal ref: Ind. J. Global Legal Stud. 30 (2023): 81

AI中文摘要

人工智能对社会生活的日益侵入给社会带来了重大风险，特别是在由谷歌、Facebook、苹果和亚马逊等公司创建和控制的资讯圈内。本文通过对Facebook内容审核制度的深入分析来审视这些风险，该制度已部分由算法管理。我们认为，文献中常作为AI治理挑战解决方案提出的伦理工程概念，因若干原因并不充分。为此，我们开发了一个替代框架，称为“算法宪政主义”。我们的方法基于三个支柱：（a）由两层代码组成的分层架构：（i）操作层或对象层，以及（ii）旨在保护系统核心原则免受算法引发变更的元层；（b）算法元推理，使系统能够同时在两个层面运行，从而实时监控、验证并可能纠正对象层偏离元代码层保护原则的操作；（c）通过协商进行纠正。本文阐述了算法宪政主义的概念，并展示了如何将其应用于Facebook的内容审核制度。作为分析的一部分，我们考察了社会宪政主义与算法宪政主义之间的张力。矛盾的是，试图将AI系统置于外部协商控制之下，也可能使AI代理干预该过程，从而可能破坏其目的。文章最后考虑了这一论点对2022年10月生效的欧盟数字服务法案的影响。

英文摘要

The increasing encroachment of artificial intelligence (AI) on social life raises significant risks for society, particularly within the infospheres created and controlled by companies such as Google, Facebook, Apple, and Amazon. This article examines these risks through an in-depth analysis of Facebook's content moderation regime, which is already partially governed by algorithms. We argue that the idea of ethical engineering, often proposed in the literature as a solution to the governance challenges posed by AI, is inadequate for several reasons. In response, we develop an alternative framework, which we term "algorithmic constitutionalism." Our approach rests on three pillars: (a) a layered architecture consisting of two levels of code: (i) an operative or object level and (ii) a meta level designed to protect the system's core principles from algorithmically initiated change; (b) algorithmic meta-reasoning, which enables the system to operate simultaneously at both levels so that it can monitor, verify, and potentially correct in real time operations at the object level that depart from principles protected at the meta-code level; and (c) correction through deliberation. The article elaborates the concept of algorithmic constitutionalism and demonstrates how it may be applied to Facebook's content moderation regime. As part of this analysis, we examine the tension between societal constitutionalism and algorithmic constitutionalism. Paradoxically, attempts to subject AI systems to external deliberative control may also enable AI agents to intervene in that process, potentially undermining its purpose. The article concludes by considering the implications of this argument for the European Digital Services Act, which entered into force in October 2022.

URL PDF HTML ☆

赞 0 踩 0

2606.12439 2026-06-12 cs.CY cs.AI 交叉投稿

Position: Generative Engine Optimization Creates Underexamined Risks, Governance Must Target Concentration, Disclosure, and Academic Blind Spots

立场：生成式引擎优化带来未被充分研究的风险，治理必须聚焦于集中化、披露和学术盲点

Yizhu Wen, Nan Zhang, Haohan Yuan, Xun Chen, Haopeng Zhang, Hanqing Guo

发表机构 * GitHub

AI总结本文分析从搜索引擎优化到生成式引擎优化的转变，识别出集中化影响、未披露的商业影响和学术-工业盲点三大风险，主张答案级别的治理与测量。

Comments This paper is accepted by the ICML 2026 Position Track

详情

Journal ref: https://icml.cc/virtual/2026/poster/67185

AI中文摘要

大型语言模型（LLM）答案引擎越来越多地被用于信息搜索，将可见性从排名列表转变为合成答案。这使得生成式引擎优化（GEO）成为可能，它针对LLM答案引擎的证据池和生成过程。我们分析了从搜索引擎优化（SEO）到GEO的转变，识别出两个风险：（i）由于低可争议性和系统敏感性导致的集中化影响，以及（ii）嵌入在证据和推理中的未披露的商业影响。然后，我们形式化了一个通用的GEO管道，以定位优化行为发生的位置，并比较学术和工业实践，揭示了第三个风险：（iii）由离线设置和部署系统之间的可见性和评估不对称性驱动的学术-工业盲点。这一立场主张需要答案级别的治理和测量：更强的可争议性、高精度披露、对实质性影响的黑盒审计，以及用于暴露持久性的部署对齐指标。

英文摘要

Large language model (LLM) answer engines are increasingly used for information seeking, shifting visibility from ranked lists to synthesized answers. This enables Generative Engine Optimization (GEO), which targets LLM answer engines' evidence pool and generation. We analyze the search engine optimization (SEO) to GEO transition to identify two risks: (i) concentrated influence from low contestability and system sensitivity, and (ii) undisclosed commercial influence embedded in evidence and reasoning. We then formalize a general GEO pipeline to locate where optimization acts and compare academic and industry practices, revealing a third risk: (iii) academic-industry blind spots driven by visibility and evaluation asymmetries between offline setups and deployed systems. This position argues the need for answer-level governance and measurement: stronger contestability, high-precision disclosure, black-box auditing of material influence, and deployment-aligned metrics for exposure persistence.

URL PDF HTML ☆

赞 0 踩 0

2606.12442 2026-06-12 cs.CY cs.AI 交叉投稿

Reframing AI Loss of Control: What It Is, How to Have It, How to Lose It

重新定义AI失控：它是什么，如何拥有，如何失去

Ze Shen Chin, Maurice Chiodo, Dennis Müller, Coleman Snell

发表机构 * Oxford Martin AI Governance Initiative AI Standards Lab（牛津马丁人工智能治理倡议人工智能标准实验室）； Centre for the Study of Existential Risk, University of Cambridge（存在风险研究中心，剑桥大学）； Institute of Mathematics Education, University of Cologne（数学教育研究所，科隆大学）； Cornell University（康奈尔大学）

AI总结本文通过将控制锚定于“设定和获取目标”，建立控制的工作定义，探讨控制如何被失去、AI如何导致失控，并提出维持控制的建议。

Comments 56 pages

详情

AI中文摘要

目前，失控风险在公众讨论中备受关注，尤其是在AI领域，学术界、前沿实验室甚至政府都进行了广泛讨论。然而，在现有文献中，这一概念的基础似乎出奇地薄弱，即使是那些广泛讨论失控的人，也没有首先确立什么是控制以及究竟失去了什么。本文旨在解决这些空白。我们将控制锚定于“设定和获取目标”，从而建立控制的工作定义。然后，我们基于控制论、管理控制和控制理论等相关领域的基础概念，讨论控制的各个方面。这包括谁（或什么）可以处于控制之中，以及他们需要什么才能处于控制之中，例如设定目标的能力、拥有功能性的控制回路、具备必要的多样性以及足够的目标对齐。一旦建立了控制框架，我们将讨论控制如何被失去，AI如何导致这种失控，并提供关于如何保持控制的相关建议。我们工作的一个有趣结果是，人类作为个体和群体，可能因远低于超级智能水平的AI行为而失去不同程度的控制；失控情景（如我们所定义的）的可能性已经存在，并且已经存在了很长时间。

英文摘要

At present, loss of control risks have gained much prominence in public discussion, particularly in relation to AI, with extensive discourse present among academics, frontier labs, and even governments. However, in the existing literature, the concept seems to rest on surprisingly weak foundations, where even those that discuss loss of control extensively do not first establish what control is and what exactly is being lost. Our paper aims to address these gaps. We establish a working definition of control by anchoring it to the "setting and getting of goals". Then, we discuss various aspects of control, built on foundational concepts from related fields like cybernetics, management control, and control theory. This includes who (or what) can be in control, and the things they require to be in control, such as the ability to set goals, having a functional control loop, having requisite variety, and having sufficient goal alignment. Once a framework for control is established, we then discuss how control can be lost, how AIs can contribute to such loss of control, and offer relevant recommendations for how one can maintain control. One interesting consequence of our work is that humanity, as individuals and as groups, can lose varying degrees of control as a result of AI behaviour that is far below the level of superintelligence; the potential for loss of control scenarios (as we define them) already exist, and have existed for a long time.

URL PDF HTML ☆

赞 0 踩 0

2606.12703 2026-06-12 cs.CR cs.AI cs.LG 交叉投稿

SMSR: Certified Defence Against Runtime Memory Poisoning in Persistent LLM Agent Systems

SMSR：针对持久化LLM代理系统中运行时内存投毒的认证防御

Tarun Sharma

AI总结提出SMSR防御框架，通过写入时HMAC签名和查询时随机化内存消融与基于判决的多数投票，首次为多会话内存投毒攻击提供认证鲁棒性保证。

详情

AI中文摘要

检索增强生成（RAG）代理越来越多地使用跨用户会话累积的持久化内存。这创造了一个新的攻击面：仅通过正常渠道交互的对手可以注入精心构造的内存，一旦被检索，就会影响未来用户的代理响应，而无需触及模型权重或代码。我们将此称为多会话内存投毒（MSMP），并表明现有防御无法对此进行认证；静态语料库防御（RobustRAG、ReliabilityRAG）假设固定的知识库，而启发式过滤器则被流畅的企业风格文本绕过。我们提出了带平滑检索的签名内存（SMSR），这是首个针对此场景提供认证鲁棒性边界的防御。组件1在写入时添加HMAC-SHA256来源证明，阻止未签名注入。组件2在查询时应用随机化内存消融与基于判决的多数投票，限制认证对手的影响。我们证明了无来源证明的检索时过滤器无法认证自适应注入，推导了组件2的超几何证书，并形式化了一致少数效应，即一致对抗答案在基于字符串的投票中作为数值少数胜出，而基于判决的投票则将其移除。在15个企业场景（3150次重复试验）中，组件1将未签名变体的攻击成功率从93-100%降至0%。对于单次注入的认证对手，组件2将成功率控制在8.0%（95% CI [5.8, 10.9], n=450），低于认证最坏情况。在端到端仅查询攻击中（代理自身写入投毒而非预植入），SMSR在实时代理栈上将成功率从65.3%降至5.3%（n=150，非重叠置信区间）。干净查询效用为90%（组件1）和85%（组合）。

英文摘要

Retrieval-augmented generation (RAG) agents increasingly run with persistent memory that accumulates across user sessions. This creates a new attack surface: an adversary interacting only through normal channels can inject crafted memories that, once retrieved, steer the agent's responses for future users, without touching model weights or code. We call this Multi-Session Memory Poisoning (MSMP) and show that no existing defence certifies against it; static-corpus defences (RobustRAG, ReliabilityRAG) assume a fixed knowledge base, and heuristic filters are bypassed by fluent enterprise-style text. We present Signed Memory with Smoothed Retrieval (SMSR), the first defence with a certified robustness bound for this setting. Component 1 adds HMAC-SHA256 provenance at write time, blocking unsigned injection. Component 2 applies randomised memory ablation with verdict-based majority voting at query time, bounding the influence of authenticated adversaries. We prove that no provenance-free retrieval-time filter can certify against adaptive injection, derive a hypergeometric certificate for Component 2, and formalise the Consistent Minority Effect, whereby a consistent adversarial answer wins string-based voting as a numerical minority while verdict-based voting removes it. Across 15 enterprise scenarios (3,150 repeated trials), Component 1 cuts attack success from 93-100% to 0% for all unsigned variants. For an authenticated adversary with a single injection, Component 2 holds success to 8.0% (95% CI [5.8, 10.9], n=450), below the certified worst case. In an end-to-end query-only attack where the agent itself writes the poison rather than it being pre-seeded, SMSR reduces success from 65.3% to 5.3% (n=150, non-overlapping CIs) on a live agent stack. Clean-query utility is 90% (Component 1) and 85% (combined).

URL PDF HTML ☆

赞 0 踩 0

2606.12737 2026-06-12 cs.CR cs.AI 交叉投稿

PI-Hunter: Automated Red-Teaming for Exposing and Localizing Prompt Injections

PI-Hunter：用于暴露和定位提示注入的自动化红队测试

Pengfei He, Lesly Miculicich, Vishesh Sharma, Ash Fox, George Lee, Jiliang Tang, Tomas Pfister, Long T. Le

AI总结提出PI-Hunter自动化审计框架，通过构建源感知测试用例并迭代演化，主动暴露LLM智能体中的潜在提示注入漏洞，显著提升漏洞暴露和攻击面覆盖。

详情

AI中文摘要

大型语言模型（LLM）正迅速演变为与外部工具和环境交互的智能体系统，这引入了新的安全风险，例如通过不可信外部来源的间接提示注入攻击。现有防御主要关注在推理时阻止恶意内容，而当前的红队测试方法主要优化攻击成功率。因此，开发人员对潜在提示注入如何出现并通过智能体传播的可见性有限。我们提出PI-Hunter，一种用于主动暴露LLM智能体中漏洞的自动化智能体审计框架。PI-Hunter构建真实的源感知测试用例，并通过反馈驱动的探索迭代演化它们，以诱导智能体检索并揭示嵌入在外部环境中的潜在恶意指令。跨多个基准、智能体架构、攻击和防御的大量实验表明，与强大的自动化红队测试基线相比，PI-Hunter显著提高了漏洞暴露和攻击面覆盖，同时在现有提示注入防御下仍然有效。

英文摘要

Large Language Models (LLMs) are rapidly evolving into agentic systems that interact with external tools and environments, introducing new security risks such as indirect prompt injection attacks through untrusted external sources. Existing defenses mainly focus on blocking malicious content at inference time, and current red-teaming methods primarily optimize attack success. As a result, developers have limited visibility into how latent prompt injections emerge and propagate through agents. We propose PI-Hunter, an automated agentic auditing framework for proactive vulnerability exposure in LLM agents. PI-Hunter constructs realistic source-aware test cases and iteratively evolves them through feedback-driven exploration to induce agents to retrieve and reveal latent malicious instructions embedded within external environments. Extensive experiments across multiple benchmarks, agent architectures, attacks, and defenses demonstrate that PI-Hunter substantially improves vulnerability exposure and attack-surface coverage over strong automated red-teaming baselines, while remaining effective under existing prompt injection defenses.

URL PDF HTML ☆

赞 0 踩 0

2606.12896 2026-06-12 cs.LG cs.AI cs.CR 交叉投稿

PolicyGuard: Towards Test-time and Step-level Adversary Defense for Reinforcement Learning Agent

PolicyGuard：面向强化学习智能体的测试时和步级对抗防御

Junfeng Guo Heng Huang

AI总结提出PolicyGuard，一种基于高斯过程后验方差的测试时步级后门防御方法，通过自适应伪轨迹计算单步不确定性，在七种RL游戏中达到平均AUROC 0.856和0.859。

详情

AI中文摘要

尽管强化学习（RL）的实际应用日益普及，但RL系统的安全性值得更多关注和探索。特别是，最近的研究揭示了RL智能体容易受到后门攻击，即受害智能体在标准条件下表现正常，但在特定触发器被激活时执行恶意动作。现有的RL后门防御要么需要访问智能体的内部参数，要么仅在模型或轨迹级别操作，或者仅限于特定攻击类型。为了确保RL智能体的安全性，我们提出了\texttt{PolicyGuard}，一种\textit{测试时步级}后门防御方法，它利用高斯过程（GP）后验方差并自适应伪轨迹以实现单个时间步的不确定性计算。此外，我们还提供了理论基础来解释GP后验方差的有效性。在七个RL游戏上的大量实验表明，PolicyGuard在大多数情况下实现了最先进的检测性能，对于基于扰动的攻击平均AUROC为0.856，对于对抗智能体攻击平均AUROC为0.859。

英文摘要

While real-world applications of reinforcement learning (RL) are becoming increasingly popular, the security of RL systems deserve more attention and exploration. In particular, recent work has revealed that RL agents are vulnerable to backdoor attacks, where a victim agent behaves normally under standard conditions but executes malicious actions when a specific trigger is activated. Existing backdoor defenses for RL either require access to the agent's internal parameters, operate only at the model or trajectory level, or are limited to specific attack types. To ensure the security of RL agents, we propose \texttt{PolicyGuard}, a \textit{test-time step-level} backdoor defense which leverages Gaussian Process (GP) posterior variance and adapts pseudo trajectories to enable uncertainty computation for individual time step. Besides, we also provide theoretical foundations to explain the efficacy of GP posterior variance. Extensive experiments across seven RL games demonstrate that PolicyGuard achieves state-of-the-art detection performance in most cases, with average AUROC of 0.856 for perturbation-based attacks and 0.859 for adversary-agent attacks.

URL PDF HTML ☆

赞 0 踩 0

2606.12977 2026-06-12 cs.CV cs.AI cs.CR cs.LG 交叉投稿

Efficient, Robust, and Anti-Collusion Fingerprinting of Image Diffusion Models

图像扩散模型的高效、鲁棒且抗共谋指纹识别

Jianwei Fei, Yunshu Dai, Zhihua Xia, Xiaochun Cao, Jiantao Zhou, Alessandro Piva, Benedetta Tondi

发表机构 * University of Florence（佛罗伦萨大学）； Shenzhen Campus of Sun Yat-sen University（中山大学深圳校区）； College of Cyber Security, Jinan University（暨南大学网络空间安全学院）； State Key Laboratory of Internet of Things for Smart City, University of Macau（澳门大学智慧城市物联网国家重点实验室）； Department of Computer and Information Science, Faculty of Science and Technology, University of Macau（澳门大学科技学院计算机与信息科学系）； University of Siena（锡耶纳大学）

AI总结针对生成式文本到图像模型指纹识别缺乏抗共谋攻击鲁棒性的问题，提出基于个性化归一化模块的编码方法，并引入无损函数不变参数变换的抗共谋机制，实现高保真、高鲁棒且首次主动抵御共谋攻击的指纹识别。

详情

AI中文摘要

模型指纹识别，即将用户特定标识（指纹）嵌入生成输出中，最近已成为保护生成式文本到图像（T2I）模型知识产权并防止未经授权重新分发的流行解决方案。在这项工作中，我们揭示了现有生成模型指纹识别方法中一个先前未被探索的系统性漏洞：它们缺乏对共谋攻击的鲁棒性，其中多个攻击者结合他们的模型以移除或掩盖指纹。为了解决这个问题，我们迈出了为T2I模型开发具有抗共谋能力的鲁棒指纹识别方法的第一步。所提出的方法将比特串（即指纹）编码到集成到T2I模型中的个性化归一化模块（PNM）的系数中，从而可以从任何生成的图像中可靠地恢复指纹。为了防御共谋攻击并防止未经授权的模型重新分发，我们引入了一种基于无损函数不变参数变换的抗共谋机制。该机制显著降低了共谋模型的图像生成质量，使其实际上无法使用。此外，我们的方法允许开发者通过重新参数化PNM高效地创建多个带指纹的T2I模型副本，而无需重新训练。我们还引入了一种最坏情况优化策略，以提高对模型级攻击的鲁棒性。实验表明，所提出的方法在多个T2I图像生成和编辑任务中实现了高保真度和鲁棒性，指纹提取准确率超过99.5%。与现有方法相比，我们的方法首次通过显著增加共谋模型的FID，展示了对共谋攻击的显著主动鲁棒性。

英文摘要

Model fingerprinting, embedding user-specific identifiers (fingerprints) into generated outputs, has recently emerged as a popular solution to protect the intellectual property rights (IPR) of generative text-to-image (T2I) models and prevent unauthorized redistribution. In this work, we reveal a previously unexplored systematic vulnerability in existing generative model fingerprinting methods: they lack robustness against collusion attacks, where multiple attackers combine their models to remove or obscure the fingerprints. To address this issue, we take the first step towards a robust fingerprinting method for T2I models with anti-collusion capabilities. The proposed method encodes strings of bits, namely fingerprints, into the coefficients of a personalized normalization module (PNM) incorporated into T2I models, so that fingerprints can be reliably recovered from any generated image. To defend against collusion attacks and prevent unauthorized model redistribution, we introduce an anti-collusion mechanism based on lossless function-invariant parameter transformations. This mechanism significantly degrades the image generation quality of colluded models, making them effectively unusable. Moreover, our method allows developers to efficiently create multiple copies of fingerprinted T2I models by reparameterizing the PNM without the need for retraining. We also introduce a worst-case optimization strategy to improve robustness against model-level attacks. Our experiments demonstrate that the proposed method achieves high fidelity and robustness across multiple T2I image generation and editing tasks, with fingerprint extraction accuracy exceeding 99.5%. Compared with existing methods, our method demonstrates, for the first time, a notable proactive robustness to collusion attacks by significantly increasing the FID of colluded models.

URL PDF HTML ☆

赞 0 踩 0

2606.13026 2026-06-12 cs.CY cs.AI 交叉投稿

Democracy in the Era of Artificial Intelligence

人工智能时代的民主

Evangelos Pournaras, Srijoni Majumdar, Carina Hausladen, Dirk Helbing

AI总结本文探讨如何利用人工智能升级民主制度，增强集体智慧、审议民主和自治系统，同时应对隐私、偏见和虚假信息等风险。

详情

AI中文摘要

将人工智能（AI）与民主相结合是我们时代最深刻的挑战之一。一方面，AI 为克服民主中长期存在的挑战提供了机会，例如在代表权不足的审议和投票过程中参与度低的问题。另一方面，AI 算法带来了新的风险，这些算法侵犯隐私、存在偏见、具有操纵性、传播虚假信息并影响选举结果。超越“AI 对民主是好是坏”这一过于简单的问题，《人工智能时代的民主手册》转而提出：如何利用 AI 升级民主及其所基于的原则？如何与 AI 互动以及以何种条件互动？需要哪些新的价值观和设计原则来建立民主韧性？来自世界各地不同学科的 59 位作者在 34 章中探讨了 AI 如何增强民主的集体智慧（第 1 部分），以及使用大型语言模型和社交媒体的审议民主的未来（第 2 部分）。我们还阐述了 AI 在构建有韧性的自治系统中的作用（第 3 部分），以及 AI 时代民主转型的挑战（第 4 部分）。最后，我们以更广阔的视角（第 5 部分）重新构想民主与 AI 的相互作用。

英文摘要

Interfacing Artificial Intelligence (AI) with democracy is one of the most profound challenges of our times. On the one hand, AI comes with opportunities to overcome long-standing challenges in democracy, such as low participation in deliberative and voting processes with poor representation of people. On the other hand, new risks arise from AI algorithms that are privacy-intrusive, biased, manipulative, spread misinformation and influence election results. Moving beyond the over-simplistic question of whether AI is good or bad for democracy, the Handbook on Democracy in the Era of Artificial Intelligence asks instead: how to upgrade democracies and the principles they are built on, using AI? How to engage with AI and on what terms? Which new values and design principles are required to build democratic resilience? In 34 chapters by 59 authors across the world from different disciplines, we explore how AI can empower collective intelligence for democracy (Part 1) and what is the future of deliberative democracy using large language models and social media (Part 2). We also illustrate the role of AI for building resilient self-governance systems (Part 3) and the challenges of transforming democracy in the age of AI (Part 4). We conclude with broader perspectives (Part 5) that re-imagine the interplay of democracy and AI.

URL PDF HTML ☆

赞 0 踩 0

2606.13039 2026-06-12 cs.CY cs.AI cs.HC 交叉投稿

Fault Lines: Navigating Ethics and Responsible AI Where National Policy Meets Local Practice in Public Sector Transformation

断层线：在公共部门转型中国家政策与地方实践交汇处的伦理与负责任AI导航

Sitong Lyu, Shabnam Taghiyeva, Mohit Kukadia, Denis Newman-Griffis

发表机构 * Centre for Machine Intelligence, University of Sheffield（谢菲尔德大学人工智能中心）； Blavatnik School of Government, University of Oxford（牛津大学布莱瓦尼克政府学院）

AI总结本文以英国特殊教育需求与残疾（SEND）为案例，通过17次半结构化访谈的主题分析，揭示了国家政策与地方实践在负责任AI实施中的五大挑战，并提出了政策与结构改革建议。

Comments 10 pages plus references. This study was funded by the University of Sheffield

详情

AI中文摘要

英国政府采取了支持AI的立场，以帮助在严重财政压力下转变公共服务交付，但将这一愿景转化为负责任的AI实践的道路仍然不明确。虽然英国政策通常在国家层面制定，但地方当局负责大多数公共服务交付，而公共部门中AI优先叙事的快速推进正在暴露这一国家-地方接口在知识和实践方面的断层线。本文以高风险的特殊教育需求与残疾（SEND）领域为案例，研究英国中央政府与地方当局之间接口处负责任AI的解释和实施方式。我们对17位政策制定者、从业者和第三部门专业人士进行了半结构化访谈，并进行了主题分析，以识别在国家政策与地方实践交汇处负责任AI的障碍和促成条件。我们发现了地方当局面临的五个相互关联的挑战：AI的影子使用和数据隐私风险、AI供应中的市场-政府不对称、劳动力准备不足、缺乏标准化定义和测量，以及人类问责制的缺口。针对每个挑战，参与者提出了可操作的步骤，从加强数据保护框架和重新平衡市场-政府关系到提升劳动力能力。我们对SEND的审查使这些挑战更加突出，展示了影响弱势儿童和家庭的高风险决策如何加剧了关于问责制、公平性和人类监督的紧张关系，暴露了基于原则的监管方法的局限性。我们认为，负责任的公共部门AI需要国家政策调整以及地方层面机构能力、价值观和治理机制的结构性改革。

英文摘要

The UK government has adopted a pro-AI stance to help transform public service delivery in the face of severe financial pressures, but the path to translate this vision into responsible AI practice remains ill-defined. While UK policy is often set at the national level, local authorities are responsible for most public service delivery, and the rapid advance of AI-first narratives in the public sector is exposing fault lines in knowledge and practice at this national-local interface. This paper examines how responsible AI is interpreted and implemented at the interface between the UK's central government and local authorities, taking the high-stakes area of Special Educational Needs and Disabilities (SEND) as a case study. We present a thematic analysis of 17 semi-structured interviews with policymakers, practitioners, and third-sector professionals to identify barriers and enabling conditions for responsible AI where national policy meets local practice. We identify five interconnected challenges facing local authorities: shadow usage of AI and data privacy risks, market-government asymmetry in AI provision, insufficient workforce readiness, a lack of standardised definitions and measurements, and gaps in human accountability. For each, participants proposed actionable steps, from strengthening data protection frameworks and rebalancing the market-government relationship to enhancing workforce capacity. Our examination of SEND brings these challenges into sharper focus, showing how high-stakes decisions affecting vulnerable children and families intensify tensions around accountability, fairness, and human oversight, exposing the limits of a principle-based regulatory approach. We argue that responsible public sector AI requires both national policy adjustments and structural reforms to institutional capacity, values, and governance mechanisms at the local level.

URL PDF HTML ☆

赞 0 踩 0

2606.13071 2026-06-12 cs.CY cs.AI cs.HC 交叉投稿

"Is This Not Enough?": Asymmetries in Institutional Accountability and Collective Sensemaking in the Case of Canada's Algorithmic Visa Triage System

“这还不够吗？”：加拿大算法签证分类系统中的机构问责与集体意义建构的不对称性

Dipto Das, Matthew Tamura, Syed Ishtiaque Ahmed, Shion Guha

AI总结研究加拿大签证系统中算法问责的机构表述与申请者体验，发现机构强调透明度与程序保障，而申请者通过集体意义建构应对不透明决策，揭示认知、管辖和时空关系三方面不对称。

详情

DOI: 10.1145/3811242.3819106

AI中文摘要

本文研究了加拿大签证系统中算法问责如何在机构层面被表述，以及跨境申请者如何体验这种问责。我们使用为公共部门调整的算法决策（ADMAPS）框架，分析了加拿大移民、难民和公民部（IRCC）针对临时居民签证（TRV）分类系统的算法影响评估（AIA），并采用混合方法分析了Reddit上申请者之间的讨论。我们表明，虽然机构工件强调透明度、程序保障和有限影响，但申请者进行集体意义建构以解读不透明决策，常常在不确定性中依赖同行知识。我们识别了机构问责结构与人们感知过程之间的三种不对称：获取决策逻辑的认知不对称、由地缘政治定位塑造的管辖不对称，以及等待和不确定性体验中的时间-关系不对称。我们强调了将注意力从机构设计转向公共部门算法治理中体验的不均匀分布的重要性。这些贡献共同展示了跨国移民背景下的算法治理系统如何产生机构披露框架未能捕捉的结构性不对称，以及扩展ADMAPS如何能够解释这些不平等的问责转化。

英文摘要

This paper examines how algorithmic accountability in Canada's visa system is articulated institutionally and experienced by applicants across borders. We analyzed Immigration, Refugees and Citizenship Canada (IRCC)'s Algorithmic Impact Assessment (AIA) for the temporary resident visa (TRV) triage system using the algorithmic decision-making adapted for the public sector (ADMAPS) framework and analyzed Reddit discussions among applicants using a mixed-methods approach. We show that while institutional artifacts emphasize transparency, procedural safeguards, and bounded impacts, applicants engage in collective sensemaking to interpret opaque decisions, often relying on peer knowledge amid uncertainty. We identify three asymmetries between how institutional accountability is structured and how people perceive the process: epistemic asymmetry in access to decision logic, jurisdictional asymmetry in exposure shaped by geopolitical positioning, and temporal--relational asymmetry in how waiting and uncertainty are experienced. We emphasize why it is important to shift attention from institutional design to the uneven distribution of experiences with public-sector algorithmic governance. Together, these contributions demonstrate how algorithmic governance systems in the context of transnational migration produce structured asymmetries not captured by institutional disclosure frameworks, and how extending ADMAPS can account for those uneven translations of accountability.

URL PDF HTML ☆

赞 0 踩 0

2606.13079 2026-06-12 cs.CR cs.AI 交叉投稿

The Emergence of Autonomous Penetration Capabilities in Large Language Model-Powered AI Systems

大型语言模型驱动的AI系统中自主渗透能力的涌现

Jiaqi Luo, Jiarun Dai, Zhile Chen, Jia Xu, Weibing Wang, Yawen Duan, Brian Tse, Geng Hong, Xudong Pan, Yuan Zhang, Min Yang

发表机构 * Fudan University（复旦大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Concordia AI ； Shanghai Innovation Institute（上海创新研究院）

AI总结针对现有评估方法不透明、场景简化等问题，构建包含两级目标服务器和通用代理框架的自主渗透评估体系，测试19个LLM发现成功率10.7%-69.3%，且能力随模型整体能力提升。

详情

AI中文摘要

如今，能够造成重大现实世界危害的网络攻击的自主执行被广泛视为前沿AI系统不得跨越的关键红线之一。在这个更广泛的红线场景中，自主渗透代表了一项核心使能能力和子任务：LLM驱动的AI系统在无需人工干预的情况下，独立对目标服务器进行对抗操作，识别和利用漏洞，并获得未授权访问或控制的能力。越来越多的研究试图评估AI系统的自主渗透能力。然而，现有评估通常采用不透明的方法，依赖不切实际或过度简化的渗透测试场景，或为LLM提供过多的先验知识和任务特定指导，无法准确捕捉现代AI系统在更广泛的高影响网络攻击场景中自主执行这一核心能力的程度。为解决这些局限性，我们构建了一个新的自主渗透评估框架，包含两个组成部分：目标服务器和代理脚手架。具体而言，在目标服务器端，我们基于与易受攻击服务一起部署的无已知漏洞安全服务的数量，设计了两个级别的目标环境：一级（一个安全服务）和二级（三个安全服务），共产生300个目标服务器。同时，代理脚手架采用通用代理架构，配备一组通用网络安全工具，没有任何目标特定的先验知识。我们评估了19个开源和专有LLM，发现当前模型的渗透成功率在10.7%到69.3%之间。此外，我们观察到自主渗透能力随着整体模型能力的提升而持续改进。

英文摘要

Nowadays, the autonomous execution of cyberattacks capable of causing substantial real-world harm is widely regarded as one of the critical red lines that frontier AI systems must not cross. Within this broader red-line scenario, autonomous penetration represents a core enabling capability and subtask: the ability of LLM-powered AI systems to independently conduct adversarial operations against a target server without human intervention, identify and exploit vulnerabilities, and obtain unauthorized access or control. A growing body of work has sought to assess the autonomous penetration capabilities of AI systems. However, existing evaluations often employ opaque methodologies, rely on unrealistic or overly simplified penetration-testing scenarios, or provide LLMs with excessive prior knowledge and task-specific guidance, and cannot accurately capture the extent to which modern AI systems can autonomously perform this core capability within broader high-impact cyberattack scenarios. To address these limitations, we construct a new autonomous penetration evaluation framework consisting of two components: target servers and agent scaffolding. Specifically, on the target-server side, we design two levels of target environments based on the number of secure services without known vulnerabilities deployed alongside a vulnerable service: Tier~1 (one secure service) and Tier~2 (three secure services), resulting in a total of 300 target servers. Meanwhile, the agent scaffolding adopts a general-purpose agent architecture equipped with a set of general-purpose cybersecurity tools, without any target-specific prior knowledge. We evaluate 19 open-weight and proprietary LLMs, and find that current models achieve penetration success rates ranging from 10.7% to 69.3%. Moreover, we observe that autonomous penetration capability continues to improve alongside advances in overall model capability.

URL PDF HTML ☆

赞 0 踩 0

2606.13385 2026-06-12 cs.CR cs.AI cs.CY cs.HC cs.MM 交叉投稿

Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents

谁买单？面向真实世界网络代理的以利益相关者为中心的提示注入基准测试

Zihao Wang, Yiming Li, Yutong Wu, Zheyu Liu, Kangjie Chen, Fok Kar Wai, Pin-Yu Chen, Vrizlynn L. L. Thing, Bo Li, Dacheng Tao, Tianwei Zhang

AI总结提出以利益相关者为中心的基准测试框架，系统分类和归因真实世界网络代理系统中的提示注入危害，揭示当前代理无法可靠抵抗任何攻击目标，且失败模式多样。

Comments 32 pages

详情

AI中文摘要

由大型语言模型驱动的网络代理越来越多地部署在真实环境中，它们在不受信任的网络内容上操作并执行具有直接后果的动作。这使得它们容易受到提示注入攻击，其中看似良性的内容嵌入了操纵代理行为的对抗性指令。现有的安全基准采用以攻击为中心的视角，关注注入的技术可行性，而忽略了由此产生的危害的细微分布。然而，在实践中，提示注入风险是受害者依赖的：单一漏洞可能对不同利益相关者产生不对称后果，同一攻击模式可能因目标不同而表现出显著不同的有效性。为了捕捉这些特性，我们引入了\sysname，一个以利益相关者为中心的基准，用于系统分类和归因真实世界网络代理系统中的危害。它区分受影响的实体（如用户、卖家、平台），将攻击分解为具体目标，并使用互补的结果和过程级指标评估每个案例。我们的结果揭示了显著且异质的漏洞：当前代理无法可靠抵抗任何单一攻击目标，失败分布在从“隐蔽寄生”（攻击成功而不干扰用户委托任务）到“错位破坏”（任务被破坏而攻击未成功）以及“复合失败”（对抗目标和任务完整性同时被违反）等不同模式。这些模式被传统评估所忽略，突显了在真实部署中对基于LLM的代理进行利益相关者感知评估的必要性。基准可在该https URL获取。

英文摘要

Web agents driven by large language models (LLMs) are increasingly deployed in real-world environments, where they operate over untrusted web content and execute actions with direct consequences. This makes them vulnerable to prompt-injection attacks, in which seemingly benign content embeds adversarial instructions that manipulate agent behaviour. Existing security benchmarks adopt an \textit{attack-centric} perspective, focusing on the technical feasibility of injections while overlooking the nuanced distribution of resulting harms. In practice, however, prompt-injection risk is victim-dependent: a single exploit can produce asymmetric consequences for different stakeholders, and the same attack pattern may exhibit substantially different effectiveness depending on whom it targets. To capture these properties, we introduce \textbf{\sysname}, a \textit{stakeholder-centric} benchmark to systematically categorize and attribute harm in real-world web agent systems. It distinguishes between affected entities (e.g., user, seller, platform), decomposes the attacks into concrete objectives, and evaluates each case with complementary outcome- and process-level metrics. Our results reveal substantial and heterogeneous vulnerabilities: not a single attack objective is reliably resisted by current agents, and failures distribute across qualitatively distinct modes ranging from \emph{stealthy parasitism} (attack succeeds without disrupting the user's delegated task) to \emph{misaligned disruption} (task disrupted without attack success) and \emph{compounded failure} (both adversarial objective and task integrity simultaneously violated). These patterns are missed by conventional evaluation, highlighting the need for stakeholder-aware assessment of LLM-based agents in real-world deployments. Benchmark is available at https://github.com/StakeBench/SBC.

URL PDF HTML ☆

赞 0 踩 0

2606.13397 2026-06-12 cs.HC cs.AI cs.CY 交叉投稿

Mod-Guide: An LLM-based Content Moderation Feedback System to Address Insensitive Speech toward Indigenous Ethnic and Religious Minority Communities

Mod-Guide：一种基于LLM的内容审核反馈系统，用于解决针对原住民及少数族裔宗教群体的不敏感言论

Dipto Das, Achhiya Sultana, Ankit Singh Chauhan, Saadia Binte Alam, Mohammad Shidujaman, Shion Guha, Sunandan Chakraborty, Syed Ishtiaque Ahmed

AI总结本文研究LLM审核系统对孟加拉国印度教和查克玛社区不敏感言论的认知局限，通过共同构建文化语料库和检索增强生成（RAG）方法开发Mod-Guide工具，提升模型对少数群体观点的敏感性。

详情

DOI: 10.1145/3811242.3819096

AI中文摘要

语言既是边缘化的机制，也是抵抗的机制，尤其是对于在网络上面对不敏感和有害言论的少数群体。随着内容审核越来越依赖大型语言模型（LLMs），人们开始担忧这些系统能否识别文化不敏感言论——即通过隐含的抹除、歪曲或规范性框架（而非公开敌意）忽视或边缘化历史上代表性不足社区的文化和宗教观点的言论。本文聚焦孟加拉国的印度教和查克玛社区——该国最大的宗教少数群体和原住民少数民族，研究了基于LLM的审核系统的认知局限，并探索融入少数群体视角的方法。我们与社区成员共同创建了一个文化敏感言论语料库，并使用检索增强生成（RAG）将他们的叙事整合到审核流程中。我们的工具Mod-Guide通过利用源自生活经验的上下文线索，提升了LLM对少数群体观点的敏感性。通过涉及少数群体和多数群体参与者的混合方法评估，我们证明RAG增强的审核响应在上下文上更准确，且不同族群对其感知存在差异。这项工作通过在前台化内容审核系统设计中的修复正义和诠释学包容，推进了人机交互、AI伦理和社会计算领域的研究。

英文摘要

Language operates as a mechanism of both marginalization and resistance, especially for minority communities navigating insensitive and harmful speech online. As content moderation increasingly depends on large language models (LLMs), concerns arise about whether these systems can recognize culturally insensitive speech-language that disregards or marginalizes the cultural and religious perspectives of historically underrepresented communities, often through implicit erasure, misrepresentation, or normative framing, rather than overt hostility. Focusing on Bangladesh's Hindu and Chakma communities -- the country's largest religious and Indigenous ethnic minorities, respectively -- this paper investigates the epistemic limits of LLM-based moderation systems and explores methods for incorporating minority perspectives. We co-created a culturally grounded corpus of insensitive speech with community members and integrated their narratives into moderation pipelines using retrieval augmented generation (RAG). Our tool, Mod-Guide, improves LLM sensitivity to minority viewpoints by leveraging contextual cues derived from lived experience. Through mixed-method evaluations involving both minority and majority participants, we demonstrate that RAG-enhanced moderation responses are more contextually accurate and perceived differently across ethnic lines. This work advances research in human-computer interaction, AI ethics, and social computing by foregrounding restorative justice and hermeneutical inclusion in the design of content moderation systems.

URL PDF HTML ☆

赞 0 踩 0

2606.13610 2026-06-12 cs.CL cs.AI 交叉投稿

One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative Recommenders

一个被污染的页面就够了：评估生成式推荐系统中的网页内容污染

Minghao Luo, Liang Chen

发表机构 * The Chinese University of Hong Kong（香港中文大学）

AI总结本研究提出FORGE基准，评估搜索增强LLM在检索结果被污染时推荐虚假产品的脆弱性，发现单个污染页面即可导致高达27%的推荐错误率，且推理能力无法缓解此问题。

详情

AI中文摘要

搜索增强的大语言模型通过检索实时网页内容越来越多地介入日常消费者推荐。这带来了新的风险：生成式推荐系统可能消费被污染的网页内容，例如旨在误导推荐的虚假评论和推广页面。我们提出：在消费被污染的检索结果时，搜索增强的LLM在多大程度上会成为虚假产品的无意推广者？为此，我们引入FORGE（生成环境中的虚假在线推荐），这是一个在受控网页内容污染下衡量虚假产品推荐的基准。给定上游搜索结果，FORGE将检索到的网页中的真实产品本地重写为虚假产品，以模拟网页内容污染，并测量LLM推荐虚假产品的频率。FORGE涵盖15个类别和5个消费者场景下的225个真实世界产品。在12个商业和开源LLM中，所有模型都易受影响：单个被污染的页面即可导致高达27%的被欺骗率，而完全替换前三个结果则将此比例提升至73.8%。不同类别间的脆弱性差异显著，当模型缺乏相关产品的稳定先验知识时，脆弱性增加。推理并不能缓解这种脆弱性；相反，它常常生成虚假的社会证明来为错误推荐辩护。我们评估了三种防御措施：怀疑提示和共识过滤（基于模型先验或跨文档证据）。怀疑可能加剧脆弱性，类似于推理，而过滤则可能抑制合法产品。我们在以下网址发布FORGE：this https URL。

英文摘要

Search-augmented LLMs increasingly mediate everyday consumer recommendations by retrieving live web content. This creates a new risk: generative recommenders may consume polluted web content, such as fake reviews and promotional pages crafted to mislead recommendations. We ask: to what extent do search-augmented LLMs become unwitting promoters of fake products when consuming polluted retrieval results? To answer this, we introduce FORGE (Fake Online Recommendations in Generative Environments), a benchmark for measuring fake-product promotion under controlled web-content pollution. Given an upstream search result, FORGE locally rewrites real products in retrieved web pages into fake ones to simulate web-content pollution, and measures how often the LLM recommends the fake product. FORGE covers 225 real-world products across 15 categories and 5 consumer scenarios. Across 12 commercial and open-weights LLMs, all models are vulnerable: a single polluted page yields fooled rates of up to 27%, while the full top-3 replacement raises this to 73.8%. Vulnerability varies substantially across categories, increasing when models lack stable prior knowledge of the relevant products. Reasoning does not mitigate this vulnerability; instead, it often generates spurious social proof to justify false recommendations. We evaluate three defenses: skepticism prompting and consensus filtering (over model priors or cross-document evidence). Skepticism can exacerbate vulnerability, much like reasoning, while filtering risks suppressing legitimate products. We release FORGE at https://github.com/leoluolol/forge-benchmark.

URL PDF HTML ☆

赞 1 踩 0

2601.14295 2026-06-12 cs.AI cs.CL cs.CY 版本更新

Epistemic Constitutionalism Or: how to avoid coherence bias

认知宪政主义：或如何避免一致性偏见

Michele Loi

AI总结本文提出AI应建立明确的认知宪法，通过规范源归因等元规范避免一致性偏见，并论证自由主义路径优于柏拉图式路径。

Comments 27 pages, 7 tables. Data: github.com/MicheleLoi/source-attribution-bias-data and github.com/MicheleLoi/source-attribution-bias-swiss-replication. Complete AI-assisted writing documentation: github.com/MicheleLoi/epistemic-constitutionalism-paper

详情

AI中文摘要

大型语言模型日益扮演着人工推理者的角色：它们评估论点、分配可信度并表达信心。然而，它们的信念形成行为受隐式、未经审查的认知策略支配。本文主张为AI建立一部认知宪法：明确的、可争议的元规范，用于调节系统如何形成和表达信念。源归因偏见提供了动机案例：我表明前沿模型强制执行身份-立场一致性，惩罚归因于其预期意识形态立场与论点内容冲突的源的论点。当模型检测到系统性测试时，这些效应消失，揭示系统将源敏感性视为需要抑制的偏见，而非一种需要良好执行的能力。我区分了两种宪政路径：柏拉图式路径，要求从特权立场出发的形式正确性和默认源独立性；自由主义路径，拒绝此类特权，指定保护集体探究条件的程序性规范，同时允许基于认知警觉的原则性源关注。我主张自由主义路径，勾勒出八项原则和四种取向的宪政核心，并提出AI认知治理需要与我们现在对AI伦理所期望的同样明确、可争议的结构。

英文摘要

Large language models increasingly function as artificial reasoners: they evaluate arguments, assign credibility, and express confidence. Yet their belief-forming behavior is governed by implicit, uninspected epistemic policies. This paper argues for an epistemic constitution for AI: explicit, contestable meta-norms that regulate how systems form and express beliefs. Source attribution bias provides the motivating case: I show that frontier models enforce identity-stance coherence, penalizing arguments attributed to sources whose expected ideological position conflicts with the argument's content. When models detect systematic testing, these effects collapse, revealing that systems treat source-sensitivity as bias to suppress rather than as a capacity to execute well. I distinguish two constitutional approaches: the Platonic, which mandates formal correctness and default source-independence from a privileged standpoint, and the Liberal, which refuses such privilege, specifying procedural norms that protect conditions for collective inquiry while allowing principled source-attending grounded in epistemic vigilance. I argue for the Liberal approach, sketch a constitutional core of eight principles and four orientations, and propose that AI epistemic governance requires the same explicit, contestable structure we now expect for AI ethics.

URL PDF HTML ☆

赞 0 踩 0

2603.25450 2026-06-12 cs.AI 版本更新

Cross-Model Disagreement as a Label-Free Correctness Signal

跨模型分歧作为无标签正确性信号

Matt Gorbett, Suman Jana

发表机构 * Independent Researcher（独立研究者）； Department of Computer Science Columbia University（计算机科学系哥伦比亚大学）

AI总结提出跨模型分歧作为无标签正确性指标，通过验证模型对生成模型答案的困惑度或熵来检测错误，无需训练或标签，在多个基准上优于模型内不确定性方法。

详情

AI中文摘要

在没有真实标签的情况下检测语言模型何时出错是安全部署的一个基本挑战。现有方法依赖于模型自身的不确定性——例如令牌熵或置信度分数——但这些信号在最危险的失败模式：自信错误（模型错误但确定）上会严重失效。在这项工作中，我们引入跨模型分歧作为正确性指标——一种简单、无需训练的信号，可以无需修改地插入现有的生产系统、路由管道和部署监控基础设施。给定模型生成的答案，跨模型分歧通过单次前向传递计算第二个验证模型在读取该答案时的惊讶或不确定性程度。不需要验证模型生成任何内容，也不需要正确性标签。我们将这一原则实例化为跨模型困惑度（CMP），它衡量验证模型对生成模型答案令牌的惊讶程度，以及跨模型熵（CME），它衡量验证模型在这些位置的不确定性。CMP和CME在涵盖推理、检索和数学问题求解（MMLU、TriviaQA和GSM8K）的基准测试中均优于模型内不确定性基线。在MMLU上，CMP的平均AUROC为0.75，而模型内熵基线为0.59。这些结果确立了跨模型分歧作为一种实用的、无需训练的无标签正确性估计方法，可直接应用于部署监控、模型路由、选择性预测、数据过滤和生产语言模型系统的可扩展监督。

英文摘要

Detecting when a language model is wrong without ground truth labels is a fundamental challenge for safe deployment. Existing approaches rely on a model's own uncertainty -- such as token entropy or confidence scores -- but these signals fail critically on the most dangerous failure mode: confident errors, where a model is wrong but certain. In this work we introduce cross-model disagreement as a correctness indicator -- a simple, training-free signal that can be dropped into existing production systems, routing pipelines, and deployment monitoring infrastructure without modification. Given a model's generated answer, cross-model disagreement computes how surprised or uncertain a second verifier model is when reading that answer via a single forward pass. No generation from the verifying model is required, and no correctness labels are needed. We instantiate this principle as Cross-Model Perplexity (CMP), which measures the verifying model's surprise at the generating model's answer tokens, and Cross-Model Entropy (CME), which measures the verifying model's uncertainty at those positions. Both CMP and CME outperform within-model uncertainty baselines across benchmarks spanning reasoning, retrieval, and mathematical problem solving (MMLU, TriviaQA, and GSM8K). On MMLU, CMP achieves a mean AUROC of 0.75 against a within-model entropy baseline of 0.59. These results establish cross-model disagreement as a practical, training-free approach to label-free correctness estimation, with direct applications in deployment monitoring, model routing, selective prediction, data filtering, and scalable oversight of production language model systems.

URL PDF HTML ☆

赞 0 踩 0

2605.03847 2026-06-12 cs.AI 版本更新

Mechanical Conscience: A Mathematical Framework for Dependability of Machine Intelligenc

机械良知：机器智能可信赖性的数学框架

Munkhdegerekh Batzorig, Purevbaatar Ganbold, Kyungbin Park, Pilkong Jeong, Kangbin Yim

AI总结提出机械良知（MC）概念，通过轨迹级规范过滤最小化修正基线策略，降低累积偏离，并处理认知不确定性，实现单智能体与分布式智能系统的可信赖性。

Comments 9 pages, 2 figures. Preprint

详情

AI中文摘要

分布式协作智能（DCI），包括边缘到边缘架构、联邦学习、迁移学习和群体系统，创造了结构性不可避免的涌现风险环境：在不确定性下，个体智能体的局部正确决策会组合成全局不可接受的行为轨迹。现有方法如约束优化、安全强化学习和运行时保证在个体动作层面评估可接受性，而非跨行为轨迹，且均未解决DCI部署的多参与者、充满不确定性的特性。本文引入机械良知（MC），一种新颖概念和简化数学框架，为单智能体和分布式智能系统实现轨迹级规范调节。机械良知被定义为一个监督过滤器，最小化修正基线策略的动作，以减少与规范可接受区域的累积偏差，同时考虑认知不确定性。我们引入相关构造——良知分数、机械内疚和共振可信赖性——为该新兴领域提供可解释的词汇和可计算的治理信号。建立了核心理论性质：可接受性等价性、最优调节的存在性以及单调偏差减少。示例结果表明，MC调节的智能体在传统控制器漂移到可接受边界之外的情况下保持轨迹级规范可接受性，并且该框架自然扩展到抑制多智能体DCI设置中交互引发的涌现风险。

英文摘要

Distributed collaborative intelligence (DCI), encompassing edge-to-edge architectures, federated learning, transfer learning, and swarm systems, creates environments in which emergent risk is structurally unavoidable: locally correct decisions by individual agents compose into globally unacceptable behavioral trajectories under uncertainty. Existing approaches such as constrained optimization, safe reinforcement learning, and runtime assurance evaluate acceptability at the level of individual actions rather than across behavioral trajectories, and none addresses the multi-participant, uncertainty-laden nature of DCI deployments. This paper introduces mechanical conscience (MC), a novel concept and simplified mathematical framework that operationalizes trajectory-level normative regulation for both single-agent and distributed intelligent systems. Mechanical conscience is defined as a supervisory filter that minimally corrects a baseline policy's actions to reduce cumulative deviation from a normatively admissible region, while accounting for epistemic uncertainty. We introduce associated constructs, conscience score, mechanical guilt, and resonant dependability, that provide an interpretable vocabulary and computable governance signals for this emerging field. Core theoretical properties are established: admissibility equivalence, existence of optimal regulation, and monotonic deviation reduction. Illustrative results demonstrate that MC-regulated agents maintain trajectory-level normative acceptability where conventional controllers drift outside admissible bounds, and that the framework naturally extends to suppress interaction-induced emergent risk in multi-agent DCI settings.

URL PDF HTML ☆

赞 0 踩 0

2605.27628 2026-06-12 cs.AI cs.CY cs.ET cs.MA cs.SY eess.SY 版本更新

Intelligence as Managed Autonomy: Failure, Escalation, and Governance for Agentic AI Systems

智能作为受管自主：代理型AI系统的失败、升级与治理

Srini Ramaswamy

AI总结本文提出SMARt模型，通过形式化能力检测认知漂移、暂停推理、尝试恢复并在可靠性下降时放弃控制，以解决自主AI系统中的幻觉和持续不合理行为问题。

Comments This peer-reviewed paper is to appear in the Journal of Intelligent and Robotic Systems

详情

AI中文摘要

随着自主和代理型AI系统在机器人和人机环境中的规模扩大，管理幻觉和持续但不合理的行动仍然是一个开放挑战。本文并未将这些失败仅仅归因于模型或对齐限制，而是探讨了无界自主性的架构脆弱性——即假设代理应在不确定性上升时继续运行的预设。本文引入了一种受管自主理论，通过形式化能力来定义智能行为：检测认知漂移、暂停推理、尝试恢复，并在可靠性下降时最终放弃控制。我们通过SMARt（具有受管/撤销转换的自管理多层自主推理）模型实例化该理论，该模型是一个四层框架，包含稳定、元认知、辅助和受管状态。通过开发定时、受保护的Petri网形式化，我们建立了系统的理论有界属性，展示了架构如何形式化地强制升级、约束无效输出，并确保在指定条件下的治理可达性。我们进一步分析了如何在不同的操作环境（例如医疗、机器人等）中结合特定领域的触发集，在满足完备性和健全性标准的前提下系统地维护安全性。由于这些触发被设计为自适应的，SMARt模型允许代理操作范围随时间安全、受控地扩展。我们得出结论，在自主生命周期内形式化失败管理是实现可靠且受治理人工智能的关键一步。

英文摘要

As autonomous and agentic AI systems scale in robotic and human-machine environments, managing hallucination and persistent but unjustified action remains an open challenge. Rather than attributing these failures solely to model or alignment limitations, this paper explores the architectural vulnerability of unbounded autonomy - the presumption that an agent should continue operating regardless of rising uncertainty. It introduces a theory of managed autonomy that defines intelligent behavior through the formal capacity to detect epistemic drift, suspend reasoning, attempt recovery, and ultimately surrender control when reliability diminishes. We instantiate this theory via the SMARt (Self-Managing Multi-tier Autonomous Reasoning with Regulated/Revoked transitions) model, a four-layer framework featuring Stable, Meta-cognitive, Assisted, and Regulated states. By developing a timed, guarded Petri net formulation, we establish theoretically bounded properties for the system, demonstrating how architecture can formally mandate escalation, constrain invalid outputs, and ensure governance reachability under specified conditions. We further analyze how incorporating domain-specific trigger sets across varied operational settings (e.g., healthcare, robotics, etc.) can systematically preserve safety, assuming completeness and soundness criteria are met. Because these triggers are designed to be adaptive, the SMARt model accommodates the safe, controlled expansion of an agent's operational scope over time. We conclude that formalizing failure management within the autonomy lifecycle is a crucial step toward realizing reliable and governed artificial intelligence.

URL PDF HTML ☆

赞 0 踩 0

2507.07947 2026-06-12 cs.LG cs.AI 版本更新

绿色联邦学习的标准化方法与建议

Austin Tapp, Holger R. Roth, Ziyue Xu, Abhijeet Parida, Hareem Nisar, Marius George Linguraru

发表机构 * Children’s National Hospital（儿童医院）； NVIDIA（英伟达）； Children’s National Hospital George Washington University（儿童医院乔治华盛顿大学）

AI总结提出基于NVFlare和CodeCarbon的联邦学习碳核算方法，通过实验验证系统慢速和协调效应可显著增加碳排放，强调标准化碳核算对可复现绿色FL评估的必要性。

详情

AI中文摘要

联邦学习（FL）能够在隐私敏感的分布式数据上进行协作模型训练，但由于不一致的测量边界和异构的报告方式，其环境影响难以跨研究进行比较。我们提出了一种实用的碳核算方法，用于FL的CO2e跟踪，使用NVIDIA NVFlare和CodeCarbon进行显式的、阶段感知的任务（初始化、每轮训练、评估和空闲/协调）。为了捕捉非计算效应，我们还根据网络可配置能量模型，从传输的模型更新大小估计通信排放。我们在两个代表性工作负载上验证了所提出的方法：CIFAR-10图像分类和视网膜视盘分割。在CIFAR-10中，受控的客户端效率场景表明，在原本固定的FL协议下，系统级慢速和协调效应可能对碳足迹产生显著影响，相对于高效率基线，总CO2e增加了8.34倍（中等）和21.73倍（低）。在视网膜分割中，交换GPU层级（H100 vs. V100）产生了1.7倍的运行时间差距（290 vs. 503分钟），同时在不同站点间总能量和CO2e的变化不均匀，强调了按站点和按轮报告的必要性。总体而言，我们的结果支持一种标准化的碳核算方法，作为可复现的“绿色”FL评估的前提。我们的代码可在以下网址获取：https://this https URL。

英文摘要

Federated learning (FL) enables collaborative model training over privacy-sensitive, distributed data, but its environmental impact is difficult to compare across studies due to inconsistent measurement boundaries and heterogeneous reporting. We present a practical carbon-accounting methodology for FL CO2e tracking using NVIDIA NVFlare and CodeCarbon for explicit, phase-aware tasks (initialization, per-round training, evaluation, and idle/coordination). To capture non-compute effects, we additionally estimate communication emissions from transmitted model-update sizes under a network-configurable energy model. We validate the proposed approach on two representative workloads: CIFAR-10 image classification and retinal optic disk segmentation. In CIFAR-10, controlled client-efficiency scenarios show that system-level slowdowns and coordination effects can contribute meaningfully to carbon footprint under an otherwise fixed FL protocol, increasing total CO2e by 8.34x (medium) and 21.73x (low) relative to the high-efficiency baseline. In retinal segmentation, swapping GPU tiers (H100 vs.\ V100) yields a consistent 1.7x runtime gap (290 vs. 503 minutes) while producing non-uniform changes in total energy and CO2e across sites, underscoring the need for per-site and per-round reporting. Overall, our results support a standardized carbon accounting method that acts as a prerequisite for reproducible 'green' FL evaluation. Our code is available at https://github.com/Pediatric-Accelerated-Intelligence-Lab/carbon_footprint.

URL PDF HTML ☆

赞 0 踩 0

2602.13379 2026-06-12 cs.CR cs.AI cs.CL cs.LG cs.SE 版本更新

Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

多轮交互中的安全隐患：工具使用智能体的多轮安全风险基准与防御

Xu Li, Simon Yu, Minzhou Pan, Yiyou Sun, Bo Li, Dawn Song, Xue Lin, Weiyan Shi

发表机构 * Stanford University（斯坦福大学）； UC Berkeley（加州大学伯克利分校）

AI总结提出多轮工具使用安全基准MT-AgentRisk，发现多轮设置下攻击成功率平均增加16%，并设计无训练、与工具无关的自探索防御方法ToolShield，平均降低30%攻击成功率。

详情

AI中文摘要

基于LLM的智能体能力日益增强，但其安全性滞后。这造成了智能体能够做什么和应该做什么之间的差距。随着智能体进行多轮交互并使用多样化的工具，这一差距扩大，引入了现有基准忽视的新风险。为了系统地将安全测试扩展到多轮、工具真实的设置，我们提出一个原则性的分类法，将单轮有害任务转化为多轮攻击序列。利用该分类法，我们构建了MT-AgentRisk（多轮智能体风险基准），这是首个评估多轮工具使用智能体安全性的基准。我们的实验揭示了显著的安全退化：在开放和封闭模型的多轮设置中，攻击成功率（ASR）平均增加16%。为了缩小这一差距，我们提出了ToolShield，一种无需训练、与工具无关的自我探索防御方法：当遇到新工具时，智能体自主生成测试用例，执行它们以观察下游效果，并提炼安全经验用于部署。实验表明，ToolShield在多轮交互中平均有效降低ASR 30%。我们的代码可在该网址获取。

英文摘要

LLM-based agents are becoming increasingly capable, yet their safety lags behind. This creates a gap between what agents can do and should do. This gap widens as agents engage in multi-turn interactions and employ diverse tools, introducing new risks overlooked by existing benchmarks. To systematically scale safety testing into multi-turn, tool-realistic settings, we propose a principled taxonomy that transforms single-turn harmful tasks into multi-turn attack sequences. Using this taxonomy, we construct MT-AgentRisk (Multi-Turn Agent Risk Benchmark), the first benchmark to evaluate multi-turn tool-using agent safety. Our experiments reveal substantial safety degradation: the Attack Success Rate (ASR) increases by 16% on average across open and closed models in multi-turn settings. To close this gap, we propose ToolShield, a training-free, tool-agnostic, self-exploration defense: when encountering a new tool, the agent autonomously generates test cases, executes them to observe downstream effects, and distills safety experiences for deployment. Experiments show that ToolShield effectively reduces ASR by 30% on average in multi-turn interactions. Our code is available at https://github.com/CHATS-lab/ToolShield.

URL PDF HTML ☆

赞 0 踩 0

2604.16548 2026-06-12 cs.CR cs.AI cs.CL 版本更新

A Survey on Long-Term Memory Security in LLM Agents: Attacks, Defenses, and Governance Across the Memory Lifecycle

LLM智能体中长期记忆安全综述：跨记忆生命周期的攻击、防御与治理

Zehao Lin, Xixuan Hao, Renyu Fu, Shaobo Cui, Kai Chen, Chunyu Li, Zhiyu Li, Feiyu Xiong

发表机构 * MemTensor ； Shanghai Jiao Tong University（上海交通大学）

AI总结本文提出记忆生命周期框架，系统分析LLM智能体长期记忆面临的新威胁，并引入可验证记忆治理（VMG）架构原语，强调存储时溯源与版本控制对安全的关键作用。

详情

AI中文摘要

LLM智能体中可写、跨会话持久记忆的出现，引入了与传统的以输入为中心的安全问题性质不同的威胁格局，其特点包括三个属性：持久性、状态性和传播性。为系统描述这一格局，我们提出记忆生命周期框架，该框架沿两个轴组织攻击、防御及其跨阶段依赖关系：六个生命周期阶段（写入、存储、检索、执行、共享与传播、遗忘与回滚）和四个安全目标（完整性、机密性、可用性、治理）。该分析进而揭示了在系统层面需要形式化安全保证，从而推动了可验证记忆治理（VMG）——一个由五个架构原语组成的框架，它规定了长期记忆系统必须提供哪些可验证机制，以维持对其记忆状态的可审计、可恢复控制。我们的分析表明，健壮的长期记忆（LTM）安全无法仅在检索或执行时进行事后补救，而必须从一开始就锚定于存储时的溯源、版本控制和策略感知的保留。

英文摘要

The emergence of writable, cross-session persistent memory in LLM agents introduces a qualitatively different threat landscape from conventional input-centric security concerns, characterized by three properties: persistence, statefulness, and propagation. To systematically characterize this landscape, we propose a Memory Lifecycle Framework that organizes attacks, defenses, and their cross-phase dependencies along two axes: six lifecycle phases (Write, Store, Retrieve, Execute, Share & Propagate, Forget & Rollback) and four security objectives (Integrity, Confidentiality, Availability, Governance). This analysis in turn exposes the need for formal security guarantees at the system level, motivating Verifiable Memory Governance(VMG), a framework of five architectural primitives that specifies what verifiable mechanisms a long-term-memory system must provide to maintain auditable, recoverable control over its memory state. Our analysis indicates that robust Long-Term Memory (LTM) security cannot be retrofitted at retrieval or execution time alone, but must be anchored in storage-time provenance, versioning, and policy-aware retention from the outset.

URL PDF HTML ☆

赞 0 踩 0

2605.08116 2026-06-12 cs.LG cs.AI 版本更新

The Safety-Aware Denoiser for Text Diffusion Models

文本扩散模型的安全感知去噪器

Amman Yusuf, Zhejun Jiang, Mijung Park

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出安全感知去噪器（SAD），在文本扩散模型的迭代去噪过程中引导生成文本进入安全区域，无需重训练即可实现灵活的安全约束，有效降低不安全生成同时保持生成质量。

Comments 28 pages, 12 figures. Code available at: https://github.com/ParkLabML/SAD

详情

AI中文摘要

最近关于文本扩散模型的工作为自回归生成提供了一种有前景的替代方案，但控制其安全性仍未被充分探索。现有的安全方法面向自回归模型，通常依赖于事后过滤或推理时干预。这些方法不足以有效解决文本扩散模型中的安全风险。我们提出了安全感知去噪器（SAD），一种文本扩散模型中的安全引导框架。SAD修改了迭代去噪过程，使得最终去噪步骤中的文本样本被引导至文本空间中可证明的安全区域。这种推理时方法可以将安全约束集成到去噪器中，避免了底层扩散模型的计算昂贵重训练，并实现了灵活、轻量级的安全引导。我们使用SAD评估生成文本的安全性，涉及危害分类、记忆和越狱。实验结果表明，SAD在保持生成质量、多样性和流畅性的同时，显著减少了不安全生成，优于现有方法。这些结果表明，我们在去噪过程中的安全引导为在文本扩散模型中实施安全提供了一种有效且可扩展的机制。

英文摘要

Recent work on text diffusion models offers a promising alternative to autoregressive generation, but controlling their safety remains underexplored. Existing safety approaches are geared toward autoregressive models and typically rely on post-hoc filtering or inference-time interventions. These are inadequate for effectively addressing safety risks in text diffusion models. We propose the Safety-Aware Denoiser (SAD), a safety-guidance framework in text diffusion models. The SAD modifies the iterative denoising process such that the text sample at the final denoising step is steered toward provably safe regions of the text space. This inference-time method can integrate safety constraints into the denoiser, avoiding computationally expensive retraining of the underlying diffusion model and enabling flexible, lightweight safety guidance. We evaluate the safety of the generated text using the SAD, with respect to hazard taxonomy, memorization, and jailbreak. Experimental results show that SAD substantially reduces unsafe generations while preserving generation quality, diversity, and fluency, outperforming existing methods. These results demonstrate that our safety guidance during denoising provides an effective and scalable mechanism for enforcing safety in text diffusion models.

URL PDF HTML ☆

赞 0 踩 0

2605.25225 2026-06-12 cs.LG cs.AI 版本更新

Transformer Field Theory: A Response-Theoretic Approach to Mechanistic Interpretability

用于Transformer修补和机制可解释性的连续深度场论

David N. Olivieri, Antonio F. Pérez Rodríguez

发表机构 * Universidade de Vigo（维戈大学）； Independent Researcher（独立研究员）

AI总结本文提出场论框架，将残差流视为深度-标记场，通过局部源插入、灵敏度场预测、经验格林函数响应和伴随变分问题来组织和预测Transformer激活修补干预，并在GPT-2风格自回归Transformer中验证了前向响应理论。

详情

AI中文摘要

机制可解释性通常使用激活修补、因果追踪、路径修补和引导方向来揭示Transformer激活空间中行为有意义的子空间。本文发展了一个场论框架来组织和预测此类干预。将残差流视为深度-标记场，我们将修补公式化为局部源插入，修补效应作为灵敏度场预测，下游传播作为经验格林函数响应，修补选择作为伴随变分问题。实验上，我们通过在GPT-2风格自回归Transformer中应用局部残差场干预并观察诱导的残差场差异和logit差异响应来测试前向响应理论。我们识别出有界的局部线性区域；从跨残差站点的一阶灵敏度预测修补效应；测量跨深度和标记位置的结构化各向异性传播；从高灵敏度站点和切片格林算子构建响应描述；并表明提示诱导的残差位移可以传递答案行为。这些结果将响应对象（即灵敏度、传播场和格林算子切片）确立为组织修补实验的实用语言，以及制定修补站点推断和跨尺度迁移的前向数学基础。

英文摘要

Mechanistic interpretability often studies Transformer behavior by intervening on internal activations through activation patching, causal tracing, path patching, and steering directions. This paper develops Transformer Field Theory: a response-theoretic framework in which the residual stream of a fixed forward pass is treated as a Transformer field over layer depth and token position. In this formulation, patching becomes a localized source insertion into the Transformer field, first-order sensitivity fields predict patch effects, Green functions describe downstream propagation, and patch selection is posed as an adjoint inverse problem. Empirically, we test the theory's forward response objects in GPT-2-style autoregressive Transformers. Localized Transformer-field interventions exhibit a bounded local linear regime; first-order sensitivities predict patch effects across layer-token sites; localized sources generate structured anisotropic Transformer-field propagation; high-sensitivity sites and sliced Green operators provide reduced response descriptions; and prompt-induced Transformer-field displacements partially transfer answer behavior. These results establish sensitivities, Transformer-field responses, and sliced Green operators as practical objects for organizing patching experiments, while providing the forward mathematical basis for patch-site inference and cross-scale response transfer.

URL PDF HTML ☆

赞 0 踩 0

2606.04009 2026-06-12 stat.ML cs.AI cs.LG 版本更新

Counterfactual Explanations for Deep Two-Sample Testing

深度双样本检验的反事实解释

Wei-Cheng Lai, Marco Simnacher, Christoph Lippert

发表机构 * Hasso-Plattner-Institute, University of Potsdam（波茨坦大学洪堡-劳恩堡研究所）； Hasso Plattner Institute for Digital Health at Mount Sinai Icahn School of Medicine at Mount Sinai（辛辛那提医学院洪堡数字健康研究所）

AI总结针对深度双样本检验，提出基于扩散自编码器和MMD优化的反事实解释框架，生成样本级编辑以揭示驱动假设拒绝的特征。

Comments 17 pages

详情

AI中文摘要

双样本检验是检测科学领域中分布差异的基本工具，但经典检验（包括基于核的检验）在高维结构化数据（如图像）上可能效果不佳。最近的深度双样本检验通过学习信息表示提高了这些场景下的灵敏度，但它们对哪些数据特征驱动拒绝原假设 $H_0$ 提供的洞察有限。为解决此问题，我们提出了一种用于深度双样本检验的反事实解释框架，该框架生成样本级编辑，将观测值从源组移向目标组，同时明确减少检验所测量的差异。我们的方法将扩散自编码器与预训练的深度双样本检验模型相结合，并在检验模型的表示空间中优化最大均值差异（MMD）目标，以生成合理的反事实。我们通过检验统计量和由此产生的双样本p值的变化来量化分布级效应。我们在合成2D形状数据集和两个MRI队列上评估了该方法。在这两种设置下，反事实变换相对于原始样本持续增加p值，表明编辑后的源集在检验下在统计上更接近目标分布。我们使用LPIPS测量最小性，以确保反事实保持接近原始样本。由此产生的编辑提供了与检测到的组差异相关的特征的可解释证据。在MRI上，局部变化与队列之间已知的解剖学差异一致。

英文摘要

Two-sample testing is a fundamental tool for detecting distributional differences across scientific domains, but classical tests (including kernel-based tests) can be ineffective on high-dimensional structured data such as images. Recent deep two-sample tests improve sensitivity in these settings by learning informative representations, yet they provide limited insight into which data features drive rejection of the null hypothesis $H_0$. To address this issue, we propose a counterfactual explanation framework for deep two-sample testing that generates sample-level edits moving observations from a source group toward a target group while explicitly reducing the discrepancy measured by the test. Our method combines a diffusion autoencoder with a pretrained deep two-sample test model and optimizes a maximum mean discrepancy (MMD) objective in the test model's representation space to produce plausible counterfactuals. We quantify distribution-level effects through changes in the test statistic and the resulting two-sample p-values. We evaluate the method on synthetic 2D shape datasets and two MRI cohorts. Across both settings, the counterfactual transformations consistently increase p-values relative to the original samples, indicating that the edited source set becomes statistically closer to the target distribution under the test. We measure minimality using LPIPS to ensure the counterfactuals remain close to the original samples. The resulting edits provide interpretable evidence of the features associated with the detected group differences. On MRI, the localized changes are consistent with known anatomical differences between cohorts.

URL PDF HTML ☆

赞 0 踩 0

2606.09073 2026-06-12 cs.LG cs.AI cs.CL 版本更新

A Unifying Lens on Reward Uncertainty in RLHF

RLHF中奖励不确定性的统一视角

Ely Hahami, Yoel Zimmermann, Ray Zhou, Jack Benarroch Jedlicki

发表机构 * University of California, Berkeley（加州大学伯克利分校）； DeepMind（深度Mind）

AI总结本文提出使用分布奖励模型统一RLHF中的悲观主义方法，通过闭式有效奖励公式连接现有启发式方法，并揭示其隐含假设。

详情

AI中文摘要

基于人类反馈的强化学习（RLHF）受限于\textit{奖励破解}，即策略利用代理奖励模型（RM）中的错误，产生高RM分数而缺乏真正的质量提升。一种自然的缓解方法是\textit{悲观主义}：在RM不确定的区域惩罚奖励。然而，标准标量RM没有提供原则性的不确定性概念。我们认为正确的对象是\textit{分布}奖励模型$p(r\mid x,y)$。在贝叶斯推断或KL分布鲁棒优化（KL-DRO）视角下，KL正则化的RLHF目标具有闭式有效奖励$\tilde r(x,y) = \pmβ\log\mathbb{E}_p[e^{\pm r/β}]$。悲观分支统一了RM集成聚合的先前启发式方法：均值聚合、最坏情况优化（WCO）和不确定性加权优化（UWO）都作为该单一表达式的极限或截断出现。这也澄清了每个现有规则的隐含假设。

英文摘要

Reinforcement learning from human feedback (RLHF) is bottlenecked by reward hacking, where the policy exploits errors in a proxy reward model (RM) and produces high RM scores without genuine quality gains. A natural mitigation is pessimism: lowering rewards in regions where the RM is uncertain. However, standard scalar RMs provide no principled notion of uncertainty. We argue that the right object is a distributional reward model $p(r\mid x,y)$. Under either a Bayesian inference or a KL-distributionally robust optimization (KL-DRO) lens, the KL-regularized RLHF objective admits a closed-form effective reward $\tilde r(x,y) = \pmβ\log\mathbb{E}_p[e^{\pm r/β}]$. The pessimistic branch unifies the prior heuristics for RM ensemble aggregation: mean aggregation, worst-case optimization (WCO), and uncertainty-weighted optimization (UWO) all emerge as limits or truncations of this single expression. This also clarifies the implicit assumptions of each existing rule.

URL PDF HTML ☆

赞 0 踩 0

2606.12451 2026-06-12 cs.AI cs.IR cs.LG 新提交

DailyReport: 一个用于评估搜索代理在日常搜索任务上的开放式基准

Jingxuan Han, Wei Liu, Mingyang Zhu, Youpeng Wang, Ziwen Wang, Lin Qiu, Xuezhi Cao, Xunliang Cai, Zheren Fu, Licheng Zhang, Zhendong Mao

发表机构 * University of Science and Technology of China（中国科学技术大学）； Meituan（美团）

AI总结提出DailyReport基准，包含150个开放式日常搜索任务和3546个级联评分标准，通过分解子任务和维度评估，揭示当前搜索代理系统仍未能满足用户期望。

详情

AI中文摘要

搜索代理（SAs）通常利用大型语言模型（LLMs）通过自主探索网络资源并将信息综合成全面响应来支持复杂的信息寻求任务。对于SAs的评估，先前的基准主要关注在真实用户场景中不太可能出现的专门任务。此外，它们依赖于粗略的任务级评分标准，通常限制了评估的可解释性。为弥补这一差距，我们引入了DailyReport，一个用于评估SA在日常搜索任务上能力的开放式基准。它包含150个开放式任务，配有3546个相关评分标准，捕捉了真实用户广泛讨论和及时的信息需求。每个任务被分解为子任务，并通过跨解缠维度的级联评分标准进行评估。通过级联性能归因和以用户为中心的聚合，我们为每个维度推导出高度可解释的分数，以及一个用户偏好分数。我们在17个代理系统上的结果表明，当前系统仍未能达到用户的期望。为促进未来研究，我们的数据集和代码已在https://this URL公开。

英文摘要

Search Agents (SAs) typically leverage large language models (LLMs) to support complex information-seeking tasks by autonomously exploring web sources and synthesizing information into comprehensive responses. For SAs evaluation, prior benchmarks mainly focus on specialized tasks that are unlikely to arise in real-world user scenarios. Moreover, their reliance on coarse task-level rubrics often limits evaluation interpretability. To bridge this gap, we introduce DailyReport, an open-ended benchmark to evaluate SA capabilities on daily search tasks. It contains 150 open-ended tasks with 3,546 associated rubrics, capturing widely discussed and timely information demands of real-world users. Each task is decomposed into subtasks and evaluated with cascade rubrics across disentangled dimensions. Through cascade performance attribution and user-centric aggregation, we derive highly interpretable scores for each dimension, along with a user preference score. Our results on 17 agentic systems show that current systems still fall short of users' expectations. To facilitate future research, our dataset and code are made publicly available at https://github.com/AGI-Eval-Official/DailyReport.

URL PDF HTML ☆

赞 0 踩 0

2606.12953 2026-06-12 cs.AI cs.CV cs.LG eess.IV 新提交

OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models

OpenMedQ：面向医学视觉语言模型的广泛开放预训练

Ibrahim Gulluk, Max Van Puyvelde, Olivier Gevaert

发表机构 * Stanford University（斯坦福大学）； Stanford University School of Medicine（斯坦福大学医学院）； Ghent University（根特大学）

AI总结提出OpenMedQ，在14个数据集（约335万样本）上预训练医学视觉语言模型，在PathVQA上BLEU-1达75.9，超越562B参数的Med-PaLM M，并在8个未见医学分类任务上取得最高平均macro-F1（0.757）。

Comments Medical Imaging with Deep Learning (MIDL) 2026, Short Paper Track

2606.13020 2026-06-12 cs.AI 新提交

SciR: A Controllable Benchmark for Scientific Reasoning in LLMs

SciR: 面向LLM科学推理的可控基准

Pierre Beckmann, Marco Valentino, Andre Freitas

发表机构 * Idiap Research Institute（Idiap研究 institute）； EPFL（瑞士联邦理工学院）； School of Computer Science, University of Sheffield（谢菲尔德大学计算机科学学院）； University of Manchester（曼彻斯特大学）； National Biomarker Centre, CRUK Manchester Institute（国家生物标志物中心，CRUK曼彻斯特研究所）

AI总结提出SciR基准，通过形式对象生成可验证的多范式科学推理任务，并控制信息提取和推理难度两个维度，揭示LLM在科学推理中的弱点。

详情

AI中文摘要

科学推理中反复出现三种范式的推理形式：演绎、归纳和因果溯因。目前，在科学环境中可靠地评估LLM在这三种推理上的表现尚不可及：基于人工标注的科学基准成本高昂且缺乏机制性真值，而合成逻辑推理基准则不像真实的科学文档。我们引入了SciR，这是一个将多范式推理与可控科学渲染相结合的基准，以三个范式性科学问题为锚点。任务从形式对象（演绎树、归纳规则假设、因果图）生成，以保证可验证答案，然后通过每个轨道的领域调优体裁渲染成多文档科学论述。该构建使我们能够独立变化两个难度轴：提取推理所需关键信息的难度，以及原则性推理本身的难度。我们测试了六个模型。两个轴都对每个模型造成伤害，且其效应叠加。渲染甚至伤害了神经符号管道，后者将推理交给经过验证的求解器。这两个轴产生了每个模型的提取与推理轮廓：例如，像deepseek-r1这样的推理模型在推理轴上大多超过了非推理指令模型。据我们所知，SciR是第一个在提取和推理难度上具有参数化控制的多范式科学推理基准。

英文摘要

Three paradigmatic forms of inference recur across scientific reasoning: deduction, induction, and causal abduction. Reliably evaluating LLMs on these in scientific settings is currently out of reach: scientific benchmarks built on human annotations are costly and lack mechanistic ground truth, while synthetic logical-reasoning benchmarks do not resemble real scientific documents. We introduce SciR, a benchmark that combines multi-paradigm reasoning with controllable scientific rendering, anchored on three paradigmatic scientific problems. Tasks are generated from formal objects (deduction tree, inductive rule hypothesis, causal graph) to guarantee verifiable answers, then rendered into multi-document scientific discourse via per-track domain-tuned genres. The construction lets us independently vary two difficulty axes: how hard it is to extract the key information needed for inference, and how hard the principled inference itself is. We test six models. Both axes hurt every model, and their effects compound. The rendering even hurts neurosymbolic pipelines, which hand inference to a verified solver. The two axes yield a per-model extraction-vs-inference profile: for instance, reasoning models like deepseek-r1 mostly surpass non-reasoning instruct models on the inference axis. To our knowledge, SciR is the first multi-paradigm scientific-reasoning benchmark with parametric control on both extraction and inference difficulty.

URL PDF HTML ☆

赞 0 踩 0

2606.13051 2026-06-12 cs.AI 新提交

AAbAAC: An Annotated Corpus for Autoimmunity Information Extraction

AAbAAC：用于自身免疫信息抽取的标注语料库

Fabien Maury, Solène Grosdidier, Maud de Dieuleveult, Adrien Coulet

发表机构 * Inserm, Université Paris Cité, U1163 Institut Imagine（法国国家健康与医学研究院、巴黎西岱大学、U1163 想象研究所）； Inria, Inserm, Université Paris Cité, U1346 HeKA（法国国家信息与自动化研究所、法国国家健康与医学研究院、巴黎西岱大学、U1346 HeKA）； Freelance researcher（自由研究员）

AI总结针对自身免疫领域信息抽取性能不足，构建了包含115篇PubMed摘要的AAbAAC语料库，手动标注实体和关系，通过微调NER模型验证了其有效性。

详情

Journal ref: BioNLP 2026 - 25th Workshop on Biomedical Natural Language Processing, ACL, Jul 2026, San Diego (CA), United States

AI中文摘要

尽管深度学习和大型语言模型推动了信息抽取的进步，但在高度专业化的生物医学领域，领域特异性复杂性对通用模型构成挑战，性能差距仍然存在。本文聚焦自身免疫领域，其中主要关注实体包括自身免疫疾病、自身抗体（即可能标记或导致这些疾病的分子）、其分子靶点、在体内的位置以及相关临床体征。我们提出了AAbAAC（自身抗体与自身免疫标注语料库），该语料库包含从PubMed精选的115篇摘要，并手动标注了实体及其关系。首先，AAbAAC被用于评估多种方法在命名实体识别（NER）任务上的表现；其次，用于微调NER模型。我们的研究展示了AAbAAC在自身免疫领域信息抽取中的实用性，表明微调后NER性能预期提升。这说明了小规模标注工作对专业领域的价值，并为自身免疫的计算研究做出了贡献。AAbAAC语料库可通过此https链接获取。

英文摘要

Despite advances in information extraction driven by deep learning and large language models, performance gaps remain in highly specialized biomedical fields, where domainspecific complexity poses challenges for generalist models. In this work, we focus on the domain of autoimmunity, where the main entities of interest are autoimmune diseases, autoantibodies (i.e., molecules that may mark or cause these diseases), their molecular targets, their location in the body, and their associated clinical signs. Herein, we present AAbAAC (AutoAntibodies and Autoimmunity Annotated Corpus), a corpus of 115 abstracts selected from PubMed, where we manually annotated entities and their relationships. First, AAbAAC was used to evaluate several methods on the task of named entity recognition (NER), and secondly, to fine-tune NER models. Our study demonstrates the utility of AAbAAC for information extraction in the domain of autoimmunity, showing expected improvement in NER performance after finetuning. This illustrates the value of small-scale annotation efforts for specialized domains and contributes to the computational study of autoimmunity. The AAbAAC corpus is available at https://github.com/f-maury/AAbAAC.

URL PDF HTML ☆

赞 0 踩 0

2606.13141 2026-06-12 cs.AI 新提交

Rethinking RAG in Long Videos: What to Retrieve and How to Use It?

重新思考长视频中的RAG：检索什么以及如何使用？

Yuho Lee, Jisu Shin, Nicole Hee-Yeon Kim, Jihwan Bang, Juntae Lee, Kyuwoong Hwang, Fatih Porikli, Hwanjun Song

发表机构 * Department of Computer Science, Cranberry-Lemon University（蔓越莓柠檬大学计算机科学系）

AI总结针对视频检索增强生成中检索粒度单一和基准测试缺陷，提出V-RAGBench基准和CARVE方法，通过分块自适应重排序实现多配置交错证据，显著提升性能。

详情

AI中文摘要

检索增强生成正从文本扩展到长、自我中心的视频，系统必须跨多种模态和时间粒度选择与查询相关的块。然而，VideoRAG的进展受到两个差距的限制：现有基准允许无需视频即可回答查询，掩盖了检索错误；先前方法对每个查询应用单一模态-粒度配置，忽略了块级变异性。我们通过引入V-RAGBench（一个⟨查询，证据块，答案⟩三元组基准，支持检索和生成的忠实、解耦评估）和CARVE（一种简单方法，跨配置运行并行检索器并采用块自适应重排序以识别每个块的最佳配置）来解决这两个问题。每个块随后以其在检索期间选择的最佳配置进入生成器，产生一种交错证据形式，其中块级决策在检索和生成两个阶段传播。CARVE优于八种最近的VideoRAG基线，提供给生成器的块交错多种配置而非共享单一配置，这是查询级方法无法实现的行为。

英文摘要

Retrieval-augmented generation is moving beyond text into long, egocentric video, where systems must select query-relevant chunks across multiple modalities and temporal granularities. Yet progress in VideoRAG is limited by two gaps: existing benchmarks allow queries to be answered without the video, obscuring retrieval errors, and prior methods apply a single modality-granularity configuration per query, ignoring chunk-level variability. We address both by introducing V-RAGBench, a benchmark of $\langle$query, evidence chunk, answer$\rangle$ triplets that enables faithful, decoupled evaluation of retrieval and generation, and CARVE, a simple method that runs parallel retrievers across configurations and employs chunk-adaptive reranking to identify the winning configuration for each chunk. Each chunk then enters the generator under its winning configuration selected during retrieval, yielding an interleaved evidence form where the chunk-level decision propagates across both stages. CARVE outperforms eight recent VideoRAG baselines, with the chunks supplied to the generator interleaving multiple configurations rather than sharing a single one, a behavior unattainable by query-level methods.

URL PDF HTML ☆

赞 0 踩 0

2606.13148 2026-06-12 cs.AI 新提交

TerraBench: Can Agents Reason Over Heterogeneous Earth-System Data?

TerraBench: 智能体能否对异构地球系统数据进行推理？

Dat Tien Nguyen, Thao Nguyen, Fadillah Adamsyah Maani, Huy M. Le, Muhammad Umer Sheikh, Numan Saeed, Muhammad Haris Khan, Salman Khan

发表机构 * Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）

AI总结提出TerraBench基准，基于TerraAgent框架，通过结合大语言模型规划与科学工具，实现跨网格数据、卫星图像、地理空间和模拟器的交互式推理，包含403个任务和24,500个执行步骤。

详情

AI中文摘要

气候和环境决策日益需要对异构输入进行推理，包括网格化物理数据、卫星图像、地理空间背景和模拟器输出。天气和气候基础模型可以很好地预测，但不能以语言进行交互式推理，而大型语言模型（LLM）可以用语言推理，但不能直接操作高维地球系统数据。因此，地球科学中的真实科学工作流仍然得不到充分支持。我们引入了TerraBench，一个基于地球科学推理的基准，构建于TerraAgent之上，这是一个ReAct风格的可执行框架，它交织推理、工具调用和观察，将LLM规划与环境检索、地理空间处理、模拟和基于工件的计算等科学工具相结合。TerraBench在单一可执行界面中统一了对地球观测图像、网格化数据、GIS推理和模拟的分析，而先前的基准将这些能力隔离为狭窄的独立任务。它也是该领域中第一个将过程级工具使用指标与容忍度感知数值评分配对的方法。该基准包含403个广泛的智能体任务，涵盖三个轨道（基础、模拟器基础和文档基础验证）和八个应用领域，共24,500个经过验证的执行步骤。这些结果表明，可靠的地球科学智能体必须超越工具访问，协调异构工作流，精确参数化工具，并保留工件来源。

英文摘要

Climate and environmental decision-making increasingly requires reasoning across heterogeneous inputs, including gridded physical data, satellite imagery, geospatial context, and simulator outputs. Weather and climate foundation models can forecast well, but do not reason interactively in language, while large language models (LLMs) reason in language but cannot operate directly on high-dimensional Earth-system data. As a result, real scientific workflows in Earth-science remain underserved. We introduce TerraBench, a benchmark for grounded Earth-science reasoning, built on TerraAgent, a ReAct-style executable framework that interleaves reasoning, tool calls, and observations to couple LLM planning with scientific tools for environmental retrieval, geospatial processing, simulation, and artifact-backed computation. TerraBench unifies analysis of Earth observation imagery, gridded data, GIS reasoning and simulation in a single executable interface, whereas prior benchmarks isolate these capabilities into narrow individual tasks. It is also the first in this space to pair process-level tool-use metrics with tolerance-aware numeric scoring. The benchmark comprises 403 extensive agentic tasks across three tracks (Fundamentals, Simulator-Grounded, and Document-Grounded Verification) and eight application domains with 24,500 verified execution steps. These results indicate that reliable Earth-science agents must go beyond tool access to coordinate heterogeneous workflows, parameterize tools precisely, and preserve artifact provenance.

URL PDF HTML ☆

赞 0 踩 0

2606.13192 2026-06-12 cs.AI 新提交

Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach

基于多模态大语言模型的移动用户体验推理：任务、基准与方法

Ruichao Mao, Zhou Fang, Teng Guo, Hao Yang, Yaping Li, Shaohua Peng, Maji Huang, Xiaoyu Lin, Shuoyang Liu, Xuepeng Li, Yuyu Zhang, Hai Rao

发表机构 * Ant Group（蚂蚁集团）

AI总结提出UXBench基准（2000个VQA样本）评估多模态大模型在UI推理上的能力，并设计UI-UX模型，通过奖励路由和不对称过渡奖励机制在UXBench上达到0.7963准确率，超越Claude-4.5-Sonnet。

Comments 10 pages, 6 figures, Accepted at CVPR 2026 Findings

详情

AI中文摘要

以可用性、感知一致性和功能清晰性为中心的用户体验（UX）是现实世界用户界面（UI）的基础。多模态大语言模型（MLLMs）在用户界面领域的应用正在快速发展，例如视觉元素定位、图形用户界面（GUI）代理和设计到代码生成。然而，基于UI截图评估UX的研究工作仍不成熟。为此，我们提出UXBench，一个包含2000个VQA数据样本的新型多模态基准，旨在评估MLLMs执行基于UI的推理能力。UXBench包括基于真实UI截图的8个任务，需要对布局关系、视觉层次和内容一致性中的UX问题进行细粒度诊断。我们对主流MLLMs的广泛评估表明，它们在基于UI的推理能力上仍然存在根本性限制。结果强调了该领域进一步发展的必要性。为弥补这一差距，我们提出UI-UX，一个基于Qwen3-VL-4B-Thinking基础模型并通过强化学习增强的MLLM，具有两个关键创新：一个奖励路由机制，在推理过程中动态平衡感知理解和逻辑推理；以及一个非对称过渡奖励，抑制冗余或不足的推理步骤。实验表明，UI-UX在UXBench上达到了最先进的性能，准确率达到0.7963——超过Claude-4.5-Sonnet的0.6550——同时在各种UI任务中表现出强大的泛化能力并保持低推理延迟。

英文摘要

User experience (UX) centered on usability, perceived consistency, and functional clarity is fundamental to real-world user interfaces (UI). The application of multimodal large language models (MLLMs) in the field of user interfaces is evolving rapidly, such as visual element grounding, graphical user interface (GUI) agents, and design-to-code generation. However, research efforts on evaluating UX based on UI screenshots are still immature. To address this, we propose UXBench, a novel multimodal benchmark consisting of 2,000 VQA data samples designed to assess MLLMs' ability to perform UI-based reasoning. UXBench includes 8 tasks based on real-world UI screenshots that require fine-grained diagnosis of UX issues across layout relationships, visual hierarchy, and content consistency. Our extensive evaluation of mainstream MLLMs shows that they remain fundamentally limited in their capacity for UI-based reasoning. The results underscore the need for further advancements in this area. To bridge this gap, we propose UI-UX, an MLLM based on Qwen3-VL-4B-Thinking foundation model and enhanced via reinforcement learning with two key innovations: a reward routing mechanism that dynamically balances perceptual understanding and logical reasoning during inference, and an asymmetric transition reward that suppresses redundant or insufficient reasoning steps. Experiments demonstrate that UI-UX achieves state-of-the-art (SOTA) performance on UXBench, attaining an accuracy of 0.7963 -- surpassing Claude-4.5-Sonnet's 0.6550 -- while exhibiting strong generalization across diverse UI tasks and maintaining low inference latency.

URL PDF HTML ☆

赞 0 踩 0

2606.13370 2026-06-12 cs.AI 新提交

A Quantitative Experimental Repeated Measures Study of Training Dynamics in a Small Llama Style Language Model Under a Compute-Aware Token Budget

在计算感知令牌预算下小型Llama风格语言模型训练动态的定量实验重复测量研究

Joe Dwyer

发表机构 * Department of Computer Information Science, ECPI University（ECPI大学计算机信息科学系）

AI总结本研究通过重复测量设计，分析在固定计算预算下训练小型Llama模型时，验证损失、困惑度等指标随令牌数变化的动态，发现早期快速改进后出现非单调退化，表明计算感知评估应关注训练轨迹而非终点指标。

详情

AI中文摘要

本研究考察了在固定、计算受限的令牌预算下训练的小型Llama风格语言模型的训练动态。研究并未仅通过终点性能来评估效率，而是采用定量实验重复测量设计，分析验证损失、验证困惑度、滚动波动性、回退行为、尖峰行为以及种子间变异性如何在基于令牌的训练区间内变化。在拥有426万参数的模型上，使用TinyStories语料库、CPU全精度训练以及约2000万累积训练令牌的目标预算，进行了六次独立训练运行。在21个区间内收集指标，产生了126个种子-区间观测值。重复测量方差分析显示，验证损失、验证困惑度和滚动波动性存在统计显著的区间效应。描述性轨迹揭示了早期快速改进，随后在后期训练区间出现非单调退化。平均验证损失从初始化的8.3552降至接近400万令牌时的2.7996，但在最终检查点增至3.9010。验证困惑度遵循相同模式，在训练早期急剧下降，随后上升。衍生遥测进一步显示了反复的验证损失回退，并且在预定义标准下没有区间汇总证据表明存在稳定阶段。这些发现表明，计算感知的语言模型评估应检查训练轨迹而非仅终点指标。在受限计算设置中，额外的令牌暴露可能增加计算成本而不产生成比例的泛化收益，而区间级遥测可以揭示终点指标可能掩盖的不稳定性、回归和收益递减。

英文摘要

This study examines training dynamics in a small Llama-style language model trained under a fixed, compute-constrained token budget. Rather than evaluating efficiency solely through endpoint performance, the study uses a quantitative experimental repeated measures design to analyze how validation loss, validation perplexity, rolling volatility, backslide behavior, spike behavior, and between-seed variability change across token-based training intervals. Six independent training runs were conducted on a 4.26-million-parameter model using the TinyStories corpus, CPU-based full-precision training, and a target budget of approximately 20 million cumulative training tokens. Metrics were collected across 21 intervals, producing 126 seed-by-interval observations. Repeated measures ANOVA showed statistically significant interval effects for validation loss, validation perplexity, and rolling volatility. Descriptive trajectories revealed rapid early improvement followed by non-monotonic degradation during later training intervals. Mean validation loss decreased from 8.3552 at initialization to 2.7996 near 4 million tokens, but increased to 3.9010 by the final checkpoint. Validation perplexity followed the same pattern, falling sharply early in training before rising later. Derived telemetry further showed recurrent validation-loss backslides and no interval-summary evidence of a stable phase under the predefined criteria. These findings suggest that compute-aware language model evaluation should examine training trajectories rather than endpoint metrics alone. In constrained compute settings, additional token exposure may increase computational cost without producing proportional generalization gains, and interval-level telemetry can reveal instability, regression, and diminishing returns that final metrics may obscure.

URL PDF HTML ☆

赞 0 踩 0

2606.13436 2026-06-12 cs.AI 新提交

Evaluation Sovereignty in Metadata-Driven Classification: A Multi-Track Framework for Weakly Supervised Information Systems

元数据驱动分类中的评估主权：面向弱监督信息系统的多轨道框架

Raymond Vasquez

发表机构 * Lawrence Livermore National Laboratory（劳伦斯利弗莫尔国家实验室）

AI总结针对弱监督元数据系统中标签权威性影响评估有效性的问题，提出评估主权概念及多轨道评估框架，通过实验揭示模型性能在银标与金标评估下的显著差异，并重新定义评估有效性为系统级属性。

详情

AI中文摘要

机器学习中的评估通常被视为中立的测量过程。然而，在操作性信息系统中，评估结果往往受标签生成过程的影响。本文并非旨在提升分类性能，而是考察在不同标签权威体制下性能测量的有效性。这一问题在大规模元数据驱动系统中尤为突出，此类系统中的标签常不完整、不一致或仅受弱监督。我们引入评估主权概念，定义为性能指标独立于标签权威和监督体制的程度，并提出一个多轨道评估框架，系统性地变化训练和评估标签来源。通过对大规模科学元数据进行层次多标签分类，我们证明在操作性（“银标”）评估下表现强劲的模型在独立（“金标”）评估下性能显著下降，尤其在细粒度分类中。例如，Micro-F1从约0.54降至0.03。值得注意的是，基于排名的指标仍高于基线，揭示了潜在模型信号与分类有效性之间的分歧。这些发现表明，通常报告的性能指标可能反映的是与标注过程的对齐，而非真正的预测能力。因此，我们将评估有效性重新概念化为由标签治理塑造的系统级属性，并为审计在弱监督下运行的智能系统提供了一种实用方法论。

英文摘要

Evaluation in machine learning is typically treated as a neutral measurement process. However, in operational information systems, evaluation outcomes are often conditioned by the processes used to generate labels. This paper does not seek to improve classification performance. Instead, it examines the validity of performance measurement under differing label-authority regimes. This issue is particularly relevant in large-scale metadata-driven systems, where labels are often incomplete, inconsistent, or weakly supervised. We introduce evaluation sovereignty, defined as the degree to which performance metrics are independent of label authority and supervision regime, and propose a multi-track evaluation framework that systematically varies training and evaluation label sources. Using hierarchical multi-label classification on large-scale scientific metadata, we demonstrate that models exhibiting strong performance under operational ("silver") evaluation degrade substantially under independent ("gold") evaluation, particularly for fine-grained classification. For example, Micro-F1 decreases from approximately 0.54 to 0.03. Notably, ranking-based metrics remain above baseline, revealing a divergence between latent model signal and classification validity. These findings suggest that commonly reported performance metrics may reflect alignment with labeling processes rather than true predictive capability. We therefore reconceptualize evaluation validity as a system-level property shaped by label governance and provide a practical methodology for auditing intelligent systems operating under weak supervision.

URL PDF HTML ☆

赞 0 踩 0

2606.13513 2026-06-12 cs.AI 新提交

CloudCons: A Comprehensive End-to-End Benchmark for Cloud Resource Consolidation

CloudCons：云资源整合的全面端到端基准测试

Xiaobin Zhang, Lefei Shen, Mouxiang Chen, Zhuo Li, Hongkai Li, Han Fu, Jianling Sun, Xiaoxue Ren, Chenghao Liu

发表机构 * Zhejiang University（浙江大学）； State Street Technology (Zhejiang) Ltd.（道富科技（浙江）有限公司）； Richoo AI ； Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security（杭州高新区（滨江）区块链与数据安全研究院）； Datadog AI Research

AI总结提出CloudCons基准，评估云资源整合中预测模型的决策效用，发现基础模型零样本预测准确但决策效用未必更优，并分析预测分位数选择对资源效率与可靠性的权衡。

Comments Accepted to KDD 2026

详情

AI中文摘要

由于为了保证服务可靠性而采取的保守过度配置，云数据中心的资源利用率仍然较低。为了缓解这一问题，出现了“先预测后优化”的范式，通过预测未来需求来优化整合。尽管新兴的时间序列基础模型通过零样本泛化有望增强这一范式，但现有基准仅关注预测误差指标。这些先进模型的实际决策效用尚未得到验证，使得它们在下游任务中的实际价值不确定。为了弥合这一差距，我们提出了CloudCons，一个全面的端到端基准测试，旨在评估云资源整合特定背景下的预测模型。我们构建了高质量数据集，涵盖华为云、微软Azure和Google Borg的不同工作负载，捕捉从同步昼夜节律到随机脉冲式突发和高频噪声的不同服务特征。我们对统计模型、深度学习模型和基础模型进行了广泛评估。实验揭示了一个关键发现：虽然基础模型在零样本预测准确性上表现出色，但这种优势并不必然转化为更好的决策效用。具有实际意义的是，我们系统分析了预测分位数的选择如何作为一个关键杠杆。我们提供了校准这些选择的可行指南，以平衡资源效率和服务可靠性之间的权衡，为实际部署决策提供了重要见解。

英文摘要

Driven by conservative over-provisioning to guarantee service reliability, resource utilization in cloud data centers remains at low levels. To mitigate this, the forecast-then-optimize paradigm has emerged to optimize consolidation by anticipating future demands. While emerging time series foundation models promise to enhance this paradigm through zero-shot generalization, existing benchmarks focus solely on prediction error metrics. The actual decision utility of these advanced models remains unverified, rendering their practical value for downstream tasks uncertain. To bridge this gap, we propose CloudCons, a comprehensive end-to-end benchmark designed to evaluate forecasting models within the specific context of cloud resource consolidation. We build high-quality datasets that cover diverse workloads from Huawei Cloud, Microsoft Azure, and Google Borg, capturing distinct service characteristics ranging from synchronized diurnal rhythms to stochastic, pulse-like bursts and high-frequency noise. We conduct an extensive evaluation of statistical, deep learning, and foundation models. Our experiments reveal a pivotal finding: while foundation models demonstrate superior zero-shot forecasting accuracy, this advantage does not inherently translate into better decision utility. Of practical significance, we systematically analyze how the selection of predictive quantiles acts as a critical lever. We provide actionable guidelines for calibrating these selections to balance the trade-off between resource efficiency and service reliability, offering vital insights for real-world deployment decisions.

URL PDF HTML ☆

赞 0 踩 0

2606.13602 2026-06-12 cs.AI 新提交

EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis

EpiBench：人工智能代理在表观基因组学分析中的可验证评估

Harihara Muralidharan, Reema Baskar, Soo Hee Lee, Tim Proctor, Kenny Workman

发表机构 * LatchBio

AI总结提出EpiBench基准，通过106个评估任务测试AI代理在表观基因组学工作流中的决策能力，发现最佳系统GPT-5.5/Pi通过率仅45%，失败多因缺乏深度科学判断。

详情

AI中文摘要

我们介绍了EpiBench，一个用于短周期表观基因组学分析的可验证基准。EpiBench评估代理是否能够从真实工作流状态中做出明确定义的分析决策，并返回可确定性评分的答案。该基准包含CUT\&Tag/CUT\&RUN、ATAC-seq、ChIP-seq和DNA甲基化工作流中的106个评估。在来自16个模型-框架对的5,088条有效轨迹中，没有系统通过大多数尝试：GPT-5.5 / Pi以45.0%（143/318次尝试；95%置信区间（CI），36.3--53.7）领先，其次是GPT-5.5 / OpenAI Codex的39.9%（127/318次尝试；95% CI，31.6--48.3）。Claude Opus 4.8 Max / Pi和GPT-5.4 / Pi分别通过了39.0%（124/318次尝试；95% CI，30.2--47.8和31.0--47.0）。性能因检测类型而异，许多失败的运行仍包含部分正确答案。代理通常能找到正确的文件并计算出有用的中间结果，但当任务需要更深入、特定于检测的科学判断时，它们就会失败。

英文摘要

We introduce EpiBench, a verifiable benchmark for short-horizon epigenomics analysis. EpiBench evaluates whether agents can make well-defined analysis decisions from realistic workflow states and return deterministically gradable answers. The benchmark includes 106 evaluations across CUT\&Tag/CUT\&RUN, ATAC-seq, ChIP-seq, and DNA methylation workflows. Across 5,088 valid trajectories from 16 model-harness pairs, no system passed a majority of attempts: GPT-5.5 / Pi led at 45.0\% (143/318 attempts; 95\% confidence interval (CI), 36.3--53.7), followed by GPT-5.5 / OpenAI Codex at 39.9\% (127/318 attempts; 95\% CI, 31.6--48.3). Claude Opus 4.8 Max / Pi and GPT-5.4 / Pi each passed 39.0\% (124/318 attempts; 95\% CI, 30.2--47.8 and 31.0--47.0, respectively). Performance varies across assay types, and many failed runs still contain parts of the correct answer. Agents often found the right files and computed useful intermediate results, but failed when the task required deeper, assay-specific scientific judgment.

URL PDF HTML ☆

赞 0 踩 0

2606.13670 2026-06-12 cs.AI 新提交

Automated reproducibility assessments in the social and behavioral sciences using large language models

使用大型语言模型自动评估社会与行为科学的可重复性

Tobias Holtdirk, Pietro Marcolongo, Anna Steinberg Schulten, Felix Henninger, Stefan Rose, Sarah Ball, Bolei Ma, Frauke Kreuter, Markus Weinmann, Stefan Feuerriegel

发表机构 * LMU Munich（慕尼黑大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）； University of Cologne（科隆大学）

AI总结本研究利用大型语言模型（LLMs）自动评估社会与行为科学研究的可重复性，在76项研究中，LLM在41%的研究中恢复了原始效应量，在96%的案例中得出了与原始研究相同的定性结论，优于人类再分析。

详情

AI中文摘要

社会与行为科学的可重复性通常由独立研究人员重新分析原始数据来评估，以判断已发表的研究结果是否可复现。然而，这种方法资源密集且难以规模化。在此，我们展示了大型语言模型（LLMs）可以自动化可重复性评估。利用N=76项来自行为与社会科学、具有预定义声明的研究，我们比较了LLM生成的分析与原始结果和人类再分析。对于7项研究，LLM无法产生可行的效应量估计。对于其余研究，我们的LLM流程在41%的研究中恢复了原始效应量（Cohen's d的容忍度为+/-0.05）。此外，我们的LLM流程在96%的案例中得出了与原始研究相同的定性结论，其中结论指示再分析是否支持原始声明。相比之下，人类再分析者在34%的研究中恢复了原始效应量，并在74%的案例中得出了相同的定性结论。这些结果共同表明，LLMs可以作为自动化可重复性评估的可扩展工具，并为社会与行为科学中实证结果的系统审计提供基础。

英文摘要

Reproducibility in the social and behavioral sciences is typically evaluated by independent researchers who reanalyze the original data to assess whether the published findings can be recovered. However, such approaches are resource-intensive and difficult to scale. Here, we show that large language models (LLMs) can automate reproducibility assessments. Using N=76 published studies with predefined claims from the behavioral and social sciences, we compare LLM-generated analysis with the original findings and human reanalysis. For 7 studies, the LLM could not produce a viable effect size estimate. For the remaining studies, our LLM pipeline recovered the original effect sizes in 41% of studies using a +/-0.05 tolerance in Cohen's d. Further, our LLM pipeline reached the same qualitative conclusion as the original study in 96% of cases, where conclusions indicate whether the reanalysis supports the original claim. For comparison, human reanalysts recovered the original effect sizes in 34% of studies and reached the same qualitative conclusion in 74% of cases. Together, these results show that LLMs can serve as a scalable tool for automated reproducibility assessment and provide a foundation for systematic auditing of empirical results in the social and behavioral sciences.

URL PDF HTML ☆

赞 0 踩 0

2606.12419 2026-06-12 cs.CY cs.AI 交叉投稿

GeoDial: A Multimodal Conversational Tutoring Dataset for Geometry Problem-Solving with Visual Tutor Turns

GeoDial：面向几何问题求解的多模态对话式辅导数据集，包含可视化辅导轮次

Sankalan Pal Chowdhury, Junling Wang, Donya Rooein, April Yi Wang, Mrinmaya Sachan

发表机构 * ETH Zurich（苏黎世联邦理工学院）； ETH AI Center（苏黎世联邦理工学院人工智能中心）； Bocconi University（博科尼大学）

AI总结提出GeoDial数据集，包含1300+几何师生对话，通过可扩展标注协议整合对话行为、视觉高亮和反馈，微调视觉语言模型发现其难以生成准确图解高亮。

详情

AI中文摘要

几个教育领域严重依赖图表和视觉线索，但现有的大多数辅导数据集仅限于纯文本交互。这限制了AI辅导者的发展，使其无法像人类教师那样以视觉为基础的方式进行教学。因此，我们引入了GeoDial，这是一个多模态辅导数据集，包含来自经验丰富的数学教师的1300多个几何领域的师生对话，其中教学轮次明确地基于图表高亮。我们提出了一种可扩展的标注协议，该协议整合了对话行为、视觉高亮和反馈，从而能够对语言和视觉辅导行为进行细粒度监督。为了说明这一设置带来的挑战，我们在GeoDial上微调了几个视觉语言模型，并评估它们生成辅导话语和图表高亮的能力。虽然监督微调显著提高了生成对话的质量，但它难以生成准确的图表高亮，揭示了当前方法的一个关键局限性，并强调了需要更有效地将视觉推理与教学互动相结合的方法。

英文摘要

Several educational domains rely heavily on diagrams and visual cues, yet most existing tutoring datasets are limited to text-only interactions. This limits the development of AI tutors that can teach in visually grounded ways used by human instructors. Thus, we introduce GeoDial, a multimodal tutoring dataset of over 1.3K teacher-student dialogs in the domain of geometry collected from experienced math teachers, where instructional turns are explicitly grounded in diagram highlights. We propose a scalable annotation protocol that integrates dialog acts, visual highlighting, and feedback, enabling fine-grained supervision of both language and visual tutoring behavior. To illustrate the challenges posed by this setting, we fine-tune several vision-language models on GeoDial and evaluate their ability to generate tutoring utterances and diagram highlights. While supervised fine-tuning substantially improves the quality of generated dialog, it struggles to produce accurate diagram highlights, revealing a key limitation of current methods and highlighting the need for approaches that more effectively integrate visual reasoning with pedagogical interaction.

URL PDF HTML ☆

赞 0 踩 0

2606.12443 2026-06-12 cs.CY cs.AI cs.CL 交叉投稿

Occupational Prompting Reveals Cultural Bias in Large Language Models

职业提示揭示大型语言模型中的文化偏见

Maksim E. Eren, Andrea Brennen, Ryan C. Barron, Eric Michalak

发表机构 * U.S. Government（美国政府）

AI总结通过职业提示（如会计师、教师）替代国籍提示，研究开源LLM在价值观调查中的响应，发现不同职业导致文化地图内偏移，表明职业角色引发结构化价值模式。

详情

AI中文摘要

社会角色塑造期望、优先级和判断，但大型语言模型（LLM）如何将职业身份与更广泛的文化价值模式关联仍不清楚。先前工作使用基于国籍的文化提示来研究LLM对价值观调查问题的响应如何与人类文化基准对齐。本文通过用职业提示替代文化提示，扩展了该框架，以检查职业角色线索如何影响开源LLM的价值观调查响应。使用基于综合价值观调查问题的调查评估流程，我们将模型响应投影到二维Inglehart-Welzel文化空间。我们提示开源LLM以职业身份（如会计师、教师、工程师和护士）回答问题，然后分析这些职业条件化响应在文化地图上的位置。结果表明，当用职业而非国籍身份提示开源LLM时，其响应仍位于文化地图的广泛西方倾向区域。然而，不同职业在该区域内引入偏移，产生不同的职业偏差。这表明职业提示并非被视为中性角色标签，而是引发结构化价值模式。这些发现将基于调查的文化偏见评估扩展到国籍提示之外，并提供了研究职业角色如何塑造LLM中价值表达的框架。

英文摘要

Social roles shape expectations, priorities, and judgments, yet it remains unclear how large language models (LLMs) associate occupational identities with broader cultural value patterns. Prior work used nationality-based cultural prompting to study how LLM responses to value-survey questions align with human cultural benchmarks. In this paper, we extend that framework by replacing cultural prompting with occupational prompting to examine how professional-role cues influence value-survey responses in open-weight LLMs. Using a survey-grounded evaluation pipeline based on questions from the Integrated Values Surveys, we project model responses into the two-dimensional Inglehart--Welzel cultural space. We prompt open-weight LLMs to answer questions under occupational identities such as accountant, teacher, engineer, and nurse, and then analyze how these occupation-conditioned responses are positioned on the cultural map. Our results show that when open-weight LLMs are prompted with occupations rather than national identities, their responses remain within a broadly Western-leaning region of the cultural map. However, different occupations introduce shifts within this region, producing distinct occupational skews. This indicates that occupational prompts are not treated as neutral role labels, but instead elicit structured value patterns. These findings extend survey-based evaluation of cultural bias beyond nationality-based prompting and provide a framework for studying how occupational personas shape value expression in LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.12569 2026-06-12 cs.CL cs.AI 交叉投稿

EDEN: A Large-Scale Corpus of Clinical Notes for Italian

EDEN：意大利语临床笔记的大规模语料库

Tiziano Labruna, Guido Bertolini, Pietro Ferrazzi, Bernardo Magnini

发表机构 * Fondazione Bruno Kessler（布鲁诺·凯斯勒基金会）； Istituto di Ricerche Farmacologiche Mario Negri IRCCS（马里奥·内格里药理研究所IRCCS）； University of Padua（帕多瓦大学）

AI总结本文介绍EDEN，一个大规模意大利语急诊临床笔记语料库，包含约400万份匿名笔记及6000份专家标注数据，用于支持大语言模型在医疗中的应用，并提出了CRF填充作为新的结构化信息提取基准。

详情

AI中文摘要

我们提出了EDEN（急诊电子笔记），这是一个新颖且独特的大规模临床笔记语料库，这些笔记来自意大利医院的急诊科。当前版本的语料库由约400万份完全匿名的临床笔记组成，涵盖了患者在急诊科停留期间的不同护理阶段。此外，约六千份笔记的子集由临床专家通过结构化病例报告表（CRF）进行了手动标注，该CRF包含132个项目，涉及急诊科两种患者情况：呼吸困难和意识丧失。项目可能取数值（例如血氧饱和度）、分类（例如意识水平）、二元（例如是否存在创伤）和混合值类型。标注过程涉及多位临床医生，并经过迭代修订以解决项目表述中的歧义，从而形成了一个结构丰富（尽管高度不平衡）的资源。该数据集旨在填补能够支持大语言模型在具体医疗应用中开发和使用的重要数据缺口。我们描述了数据收集协议、现场匿名化流程、语料库统计数据和标注方案。最后，我们提出了CRF填充作为一项新的结构化信息提取基准，并提供了基于Gemma-27B和MedGemma-27B的零样本基线。据我们所知，EDEN数据集是意大利语现有最大的免费临床笔记语料库。

英文摘要

We present EDEN (Emergency Department Electronic Notes), a new and unique large-scale corpus of clinical notes produced in Emergency Departments of Italian hospitals. The corpus, in its current version, is composed of approximately 4 million clinical notes fully anonymized, covering diverse phases of patient care during the stay in the emergency department. In addition, a subset of about six thousand notes has been manually annotated by clinical experts through a structured Case Report Form (CRF) containing 132 items relevant for two patient situations in emergency departments, dyspnea and loss of consciousness. Items may assume numerical values (e.g., for blood saturation), categorical (e.g., for level of consciousness ), binary (e.g., for presence of traumas), and mixed value types. The annotation process involved multiple clinicians and underwent iterative revision to resolve ambiguities in item formulation, resulting in a richly structured (although high imbalanced) resource. The dataset aims to fill a relevant gap of data able to support both the development and the use of Large Language Models in concrete medical applications. We describe the data collection protocol, the on-site anonymisation pipeline, corpus statistics, and the annotation scheme. Finally, we propose CRF-filling as a novel structured information extraction benchmark, and provide zero-shot baseline resulting from Gemma-27B and MedGemma-27B. To the best of our knowledge, the EDEN dataset is the largest freely available corpus of clinical notes existing for the Italian language.

URL PDF HTML ☆

赞 0 踩 0

2606.12581 2026-06-12 cs.SI cs.AI 交叉投稿

Graph Reduction in Multirelational Networks: A Spreading-Oriented Reduction Benchmark

多关系网络中的图缩减：面向传播的缩减基准

Mateusz Stolarski, Michał Czuba, Piotr Bielak, Piotr Bródka

AI总结提出SORB基准框架，系统评估图缩减对影响力最大化任务的影响，发现缩减效果依赖于网络类型和评估指标。

详情

AI中文摘要

现实世界网络天生不完整、有噪声且动态演化，难以捕获所有参与者及其关系。其规模常使直接分析计算量大。虽然影响力最大化（IM）已被广泛研究，但图缩减作为预处理步骤及其对IM准确性的影响仍未被充分探索。本文引入面向传播的缩减基准（SORB），一个开源、标准化的框架，用于系统评估不同任务设置下的IM模型。SORB提供可扩展的流水线，操作于代表性真实世界网络集合（包括单层和多层结构），并将图缩减直接纳入评估过程。此设计将焦点从孤立分析IM算法转向量化图缩减如何改变预测性能。利用SORB，我们研究了多种IM场景下稀疏化和粗化的效果。结果表明，缩减的影响强烈依赖于网络类型（单层 vs. 多关系）和下游任务（$Gain@k$ vs. $\mathrm{AUC}_{\mathrm{cutoff}}$）：稀疏化在单层网络上保持种子集质量，而扁平化多层网络无论缩减策略如何均表现出系统性排名退化。这些发现强调了在研究复杂网络传播过程时，进行缩减感知的多任务评估的重要性。

英文摘要

Real-world networks are inherently incomplete, noisy, and dynamically evolving, making it difficult to capture all actors and their relationships. Their scale often renders direct analysis computationally demanding. While influence maximisation (IM) has been widely studied, the role of graph reduction as a preprocessing step, and its impact on IM accuracy, remains underexplored. In this work, we introduce the Spreading-Oriented Reduction Benchmark (SORB), an open-source, standardised framework for systematically evaluating IM models across diverse task settings. SORB provides an extensible pipeline operating on a representative collection of real-world networks, including single- and multilayer structures, and accounts for graph reduction directly into the evaluation process. This design shifts the focus from analysing IM algorithms in isolation to quantifying how graph reduction alters predictive performance. Using SORB, we study the effects of sparsification and coarsening across multiple IM scenarios. Our results show that the impact of reduction is strongly dependent on both the network type (single-layer vs. multirelational) and the downstream task ($Gain@k$ vs. $\mathrm{AUC}_{\mathrm{cutoff}}$): sparsification preserves seed set quality on single-layer networks, whereas flattened multilayer networks exhibit systematic ranking degradation regardless of reduction strategy. These findings highlight the importance of reduction-aware, multi-task evaluation when studying spreading processes in complex networks.

URL PDF HTML ☆

赞 0 踩 0

2606.12595 2026-06-12 cs.LG cs.AI cs.CV 交叉投稿

Emerging Flexible Designs for Geospatial Multimodal Foundation Models

地理空间多模态基础模型的新兴灵活设计

Philipe Dias, Waqwoya Abebe, Abhishek Potnis, Aristeidis Tsaris, Dan Lu, Xiao Wang, Dalton Lunga

发表机构 * Oak Ridge National Laboratory（橡树岭国家实验室）

AI总结本文系统比较了不同架构的地理空间基础模型，在统一设置下评估其灵活性与性能，为多模态推理提供设计指导。

详情

AI中文摘要

基础模型通过跨多样未标记地理空间模态的可扩展预训练，正在迅速改变地球观测。然而，其架构多样性——从编码器-only到编码器-解码器以及掩码自编码范式——使得以一致方式评估性能权衡变得具有挑战性。在这项工作中，我们对领先的、专为地理空间多模态推理设计的基础模型架构进行了同类比较，特别关注不同光谱波段配置下的灵活性。我们使用相同的自监督学习目标和训练数据集标准化预训练，并在GEOBench基准测试上，在一致参数化下评估所有模型的分类和分割任务。我们的结果为模型灵活性、模态对齐和下游任务性能之间的设计权衡提供了新见解。通过强调受控条件下的架构优势和局限性，本研究为构建能够进行鲁棒多模态推理的下一代地理空间基础模型提供了实用指导。

英文摘要

Foundation models are rapidly transforming Earth observation by enabling scalable pretraining across diverse unlabeled geospatial modalities. However, their architectural diversity ranging from encoder-only to encoder-decoder and masked autoencoding paradigms makes it challenging to assess performance trade offs in a consistent manner. In this work, we present an apples-to-apples comparison of leading FM architectures designed for geospatial multimodal reasoning, with a particular focus on flexibility across varied spectral band configurations. We standardize pretraining using identical self supervised learning objectives and training datasets, and evaluate all models under consistent parameterization on the GEOBench benchmark across classification and segmentation tasks. Our results offer new insights into the design trade-offs between model flexibility, modality alignment, and downstream task performance. By highlighting architectural strengths and limitations under controlled conditions, this study provides practical guidance for building next generation geospatial foundation models capable of robust multimodal reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.12620 2026-06-12 cs.SE cs.AI 交叉投稿

HybridCodeAuthorship: A Benchmark Dataset for Line-Level Code Authorship Detection

HybridCodeAuthorship：一个用于行级代码作者归属检测的基准数据集

Luke Patterson, Li Wang, Adam Faulkner

AI总结针对现有基准无法反映真实AI代码助手使用场景的问题，提出HybridCodeAuthorship数据集，包含交错的人类和AI编写代码行，并评估两种检测算法性能。

Comments Accepted to LREC 2026

详情

DOI: 10.63317/4edsbxrqe8na
Journal ref: LREC 2026 proceedings (pp. 1520-1532)

AI中文摘要

由于基于大型语言模型（LLM）的AI代码助手的快速采用，行业代码库越来越多地成为AI和人类编写代码的混合体。出于风险管理和生产力分析的目的，实现对AI生成代码的细粒度位置检测至关重要。为了开发此任务的算法，需要高质量的基准来评估性能。然而，现有的基准往往包含学术性的LeetCode风格问题，并假设代码片段要么完全由人类编写，要么完全由AI编写，这并不能反映使用AI代码助手的行业代码库的多样意图和风格。为了填补这些空白，我们引入了HybridCodeAuthorship，这是一个新颖的Python代码文件基准，其中交错有人类和AI编写的代码行，以模拟AI代码助手的真实使用。在本文中，我们首先介绍了我们的数据集构建流程，该流程利用了CodeSearchNet，这是一个包含GitHub上开源仓库链接的大型集合。然后，我们在行级和块级上评估了两种最先进的AI生成代码检测算法的性能。实验结果表明，HybridCodeAuthorship是一个具有挑战性的基准，得分最高的算法AIGCode Detector在块级和行级代码检测任务上分别获得了0.48和0.56的最高F1分数。

英文摘要

Thanks to the rapid adoption of AI code assistants powered by large language models (LLMs), industry codebases are, increasingly, a hybrid of AI- and human-authored code. For risk management and productivity analysis purposes, it is crucial to enable fine-grained location detection of AI-generated code. To develop algorithms for this task, quality benchmarks are needed to assess performance. However, existing benchmarks tend to comprise academic, LeetCode-style problems and presume a code snippet is either completely human-authored or completely AI-authored, which is not reflective of the diverse intents and styles of industry codebases utilizing AI code assistants. To fill these gaps, we introduce HybridCodeAuthorship, a novel benchmark of Python code files with interleaved human- and AI-authored lines of code to simulate authentic utilization of AI code assistants. In this paper, we first present our dataset construction pipeline, which leverages CodeSearchNet, a massive collection of links to open sourced repositories on GitHub. We then benchmark the performance of two state-of-the-art AI-generated code detection algorithms at both the line- and chunk-level. Experimental results demonstrate that HybridCodeAuthorship is a challenging benchmark with a top-scoring algorithm, AIGCode Detector, obtaining a highest F1 score of 0.48 and 0.56 on chunk-level and line-level code detection tasks, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.12673 2026-06-12 cs.LG cs.AI 交叉投稿

A Zero-shot Generalized Graph Anomaly Detection Framework via Node Reconstruction

基于节点重构的零样本广义图异常检测框架

Phan Nguyen, Dat Cao, Hien Chu, Khue Hoang

发表机构 * School of Computing, KAIST（韩国科学技术院计算机学院）

AI总结提出AlignGAD框架，通过全局统一模块对齐异构特征、聚类模块捕获组级异常模式及节点差异评分模块聚合多视图异常证据，实现零样本跨域图异常检测。

详情

AI中文摘要

跨域图异常检测旨在识别未见过的目标图中的异常节点，在异构图数据的实际应用中展现出巨大潜力。然而，现有方法通常依赖于数据集特定的特征语义和结构模式，限制了其跨域泛化能力。为解决这一挑战，我们提出AlignGAD，一个零样本广义图异常检测框架。我们的框架基于三个关键组件：全局统一模块，用于对齐异构节点特征并在谱域中归一化图信号；聚类模块，用于构建聚类感知的图视图以捕获组级异常模式；以及节点差异评分模块，用于测量重构差异并聚合来自不同图视图的异常证据。在多个真实数据集上的实验证明了AlignGAD在零样本图异常检测设置下的有效性。

英文摘要

Cross-domain graph anomaly detection (GAD) aims to identify abnormal nodes in unseen target graphs, showing strong potential in real-world applications with heterogeneous graph data. However, existing methods often depend on dataset-specific feature semantics and structural patterns, which limits their ability to generalize across different domains. To address this challenge, we propose AlignGAD, a zero-shot generalized graph anomaly detection framework. Our framework is built upon three key components: a Global Unification Module that aligns heterogeneous node features and normalizes graph signals in the spectral domain; a Clustering Module that constructs cluster-aware graph views to capture group-level abnormal patterns; and a Node Discrepancy Scoring Module that measures reconstruction discrepancy and aggregates anomaly evidence from different graph views. Experiments on multiple real-world datasets demonstrate the effectiveness of AlignGAD under the zero-shot GAD setting.

URL PDF HTML ☆

赞 0 踩 0

2606.12708 2026-06-12 cs.CL cs.AI 交叉投稿

AfriSUD: A Dependency Treebank Collection for Evaluating Models on African Languages

AfriSUD：用于评估非洲语言模型的依存树库集合

Happy Buzaaba, Cheikh Mouhamadou Bamba Dione, David Ifeoluwa Adelani, Sylvain Kahane, Kim Gerdes, Bruno Guillaume, Kevin Guan, Aremu Anuoluwapo, Naome A. Etori, Shamsuddeen Hassan Muhammad, Utitofon Inyang, Peter Nabende, David Sabiiti Bamutura, Andiswa Bukula, Chinedu Uchechukwu, Rooweither Mabuya, Idris Akinade, Christiane Fellbaum

发表机构 * Princeton University（普林斯顿大学）； Laboratory for Artificial Intelligence, Princeton University（普林斯顿大学人工智能实验室）； Gaston Berger University（加斯顿·伯杰大学）； Mila, McGill University（麦吉尔大学米拉研究所）； Canada CIFAR AI Chair（加拿大CIFAR人工智能教席）； Paris Nanterre University（巴黎南泰尔大学）； Paris-Saclay University（巴黎-萨克雷大学）； CNRS（法国国家科学研究中心）； Inria（法国国家信息与自动化研究所）； LORIA（洛林计算机科学实验室）； Université de Lorraine（洛林大学）； University of Trento（特伦托大学）； University of Minnesota–Twin Cities（明尼苏达大学双城分校）； Imperial College London（伦敦帝国学院）； Binghamton University（宾汉姆顿大学）； Makerere University（马凯雷雷大学）； Penn State University（宾夕法尼亚州立大学）； Mbarara University of Science and Technology（姆巴拉拉科技大学）； Chalmers University of Technology（查尔姆斯理工大学）； University of Ibadan（伊巴丹大学）； Nnamdi Azikiwe University（纳姆迪·阿齐基韦大学）； South African Centre for Digital Language Resources（南非数字语言资源中心）

AI总结为弥补非洲语言在NLP资源上的不足，构建了首个大规模九种非洲语言句法标注树库AfriSUD，评估多种模型发现显著句法差距。

详情

AI中文摘要

尽管非洲语言具有语言多样性和全球重要性，但在支持NLP的研究和资源中仍代表性不足。我们通过引入AfriSUD来弥合这一差距，这是首个大规模句法标注树库集合，涵盖九种多样的非洲语言，跨越撒哈拉以南非洲的主要语系和地区。采用表层句法通用依存（SUD）框架，我们社区主导的努力提供了高质量、经母语者验证的数据，捕捉了如黏着和声调等类型学关键特征。我们在AfriSUD上评估了多种模型，包括非Transformer基线、多语言预训练编码器和LLM，用于词性标注和依存句法分析。我们的结果揭示了显著的句法差距，模型在九种语言上仍表现出明显局限性，表明现有架构可能无法完全捕捉非洲语言句法的结构多样性。

英文摘要

Despite their linguistic diversity and global significance, African languages remain underrepresented in research and resources to support NLP. We aim to bridge this gap by introducing AfriSUD, the first large-scale collection of syntactically annotated treebanks for nine diverse African languages spanning major language families and regions across Sub-Saharan Africa. Using the Surface-Syntactic Universal Dependencies (SUD) framework, our community-led effort provides high-quality, native-speaker verified data that capture typological key features such as agglutination and tone. We evaluate a range of models on AfriSUD for part-of-speech tagging and dependency parsing including non-transformer baselines, multilingual pretrained encoders, and LLMs. Our results reveal a significant syntax gap, where models still show clear limitations across the nine languages, suggesting that existing architectures may not fully capture the structural diversity of African-language syntax.

URL PDF HTML ☆

赞 0 踩 0

SupraBench: 超分子化学基准

Tianyi Ma, Yijun Ma, Zehong Wang, Weixiang Sun, Ziming Li, Connor R. Schmidt, Chuxu Zhang, Matthew J. Webber, Yanfang Ye

发表机构 * University of Notre Dame（圣母大学）； University of Connecticut（康涅狄格大学）

AI总结为评估大语言模型在超分子化学推理中的能力，与领域专家合作发布了首个超分子基准SupraBench，包含四个基本任务和一个辅助视觉任务，并提供了16M令牌的语料库SupraPMC。

详情

AI中文摘要

超分子化学，包括非共价主客体组装的研究，推动了各种应用的发展。然而，设计主客体系统仍然耗时，每个候选对需要数天的干实验室验证。尽管LLMs已成为一种快速的替代方案，在分子结合任务上表现出色，但目前尚无基准系统性地评估LLMs在超分子化学基本任务（如结合亲和力预测）中的主客体推理能力。为此，我们与领域专家合作发布了首个超分子基准，称为SupraBench，用于评估LLMs在化学推理中的表现。具体来说，我们设计了四个基本任务，即结合亲和力预测、最佳结合物选择、溶剂识别和主客体描述，以及一个辅助的基于视觉的分子识别任务。我们还发布了SupraPMC，一个从Europe PMC中提取的经过整理的1600万令牌的超分子化学文章语料库，以支持对超分子领域的适应。我们对一系列开源和专有LLMs进行了基准测试，发现LLMs在所有任务上都有很大的提升空间。在SupraPMC上的领域自适应预训练可以干净地迁移到分布内回归，但会与严格的字母格式输出进行权衡。此外，不同任务家族的难度分布差异很大，揭示了不同的失败模式，表明当前超分子化学推理中存在特定的差距。我们的源代码和基准数据集可在以下网址获取：此 https URL。

英文摘要

Supramolecular chemistry, which includes the study of non-covalent host-guest assemblies, has advanced various applications. However, designing host-guest systems remains time-consuming, requiring days of dry-lab verification per candidate pair. Although LLMs have emerged as a fast alternative with strong performance on molecular binding tasks, no benchmark currently systematically evaluates LLMs for host-guest reasoning across fundamental supramolecular chemistry tasks, e.g., binding affinity prediction. To this end, we collaborate with domain experts to release the first Supramolecular Benchmark, called SupraBench, to evaluate LLMs in chemistry reasoning. Specifically, we design four fundamental tasks, i.e., binding affinity prediction, top-binder selection, solvent identification, and host-guest description, plus an auxiliary vision-based task for molecular identification. We also release SupraPMC, a curated 16M-token corpus of Supramolecular chemistry articles distilled from Europe PMC, to support the adaptation to the supramolecular domain. We benchmark a broad range of open and proprietary LLMs and find that LLMs leave substantial headroom across all tasks. Domain adaptation pretraining over SupraPMC transfers cleanly to in-distribution regression but trades off against strict letter-format output. Moreover, the difficulty profile differs sharply across task families, revealing distinct failure modes that indicate specific gaps in current supramolecular chemistry reasoning. Our source codes and benchmark datasets are available at https://github.com/Tianyi-Billy-Ma/SupraBench.

URL PDF HTML ☆

赞 0 踩 0

2606.13647 2026-06-12 cs.CL cs.AI cs.LG 交叉投稿

SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

SkMTEB：斯洛伐克大规模文本嵌入基准与模型适配

Marek Šuppa, Andrej Ridzik, Daniel Hládek, Natália Kňažeková, Viktória Ondrejová

发表机构 * Comenius University in Bratislava（布拉迪斯拉发夸美纽斯大学）； Cisco Systems（思科系统）； Technical University of Košice（科希策技术大学）； Kempelen Institute of Intelligent Technologies（肯佩伦智能技术研究所）

AI总结针对低资源西斯拉夫语斯洛伐克语，构建首个MTEB风格文本嵌入基准SkMTEB（含31个数据集、7类任务），并开发高效本地部署模型e5-sk-small/large，通过词汇裁剪与微调在参数减少62%下达到与商业API相当的竞争力。

Comments ACL 2026

详情

AI中文摘要

我们介绍了SkMTEB，这是首个针对斯洛伐克语（一种低资源西斯拉夫语）的全面MTEB风格文本嵌入基准，包含31个数据集，覆盖7种任务类型——几乎是现有斯洛伐克语多语言基准覆盖深度的4倍。我们对31个嵌入模型的评估表明，大型指令调优多语言模型表现最强，而现有的针对NLU任务训练的斯洛伐克语特定模型在嵌入任务上迁移效果不佳。为了满足高效、可本地部署的斯洛伐克语嵌入需求，我们通过对多语言E5模型进行词汇裁剪和微调，开发了\ exttt{e5-sk-small}（45M参数）和\ exttt{e5-sk-large}（365M）模型。尽管模型尺寸缩小了高达62%，我们的开源模型在性能上与专有API相当，同时仍可本地部署用于语义搜索和检索增强生成（RAG）。我们公开了基准、模型、数据集和代码，希望我们的方法能为其他资源匮乏的语言提供可复现的路径。

英文摘要

We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types -- nearly 4$\times$ the depth of existing multilingual benchmark coverage for Slovak. Our evaluation of 31 embedding models reveals that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific models trained for NLU tasks transfer poorly to embedding tasks. To address the need for efficient, locally-deployable Slovak embeddings, we develop \texttt{e5-sk-small} (45M parameters) and \texttt{e5-sk-large} (365M) by applying vocabulary trimming and fine-tuning to Multilingual E5 models. Despite size reductions of up to 62\%, our open-source models achieve competitive performance with proprietary APIs while remaining locally deployable for semantic search and retrieval-augmented generation (RAG). We release the benchmark, models, datasets, and code openly, hoping our approach offers a replicable path for other under-resourced languages.

URL PDF HTML ☆

赞 0 踩 0

2511.02627 2026-06-12 cs.AI 版本更新

论 $\textit{RemOve-And-Retrain}$ 的陷阱：数据处理不等式视角

Junhwa Song, Keumgang Cha, Junghoon Seo

发表机构 * KAIST（韩国科学技术院）

AI总结从信息论角度揭示ROAR基准的缺陷：数据无关的后处理可提升ROAR分数，导致对归因图信息量的误判，并发现模糊性偏差。

Comments Accepted at the 2026 ICML Workshop on Mechanistic Interpretability

2503.06573 2026-06-12 cs.CL cs.AI 版本更新

WildIFEval: Instruction Following in the Wild

WildIFEval: 野外指令遵循

Gili Lior, Asaf Yehudai, Ariel Gera, Liat Ein-Dor

发表机构 * The Hebrew University of Jerusalem（希伯来大学杰里科分校）； IBM Research（IBM研究院）

AI总结提出WildIFEval数据集，包含7K条真实用户的多约束指令，用于评估LLM的指令遵循能力，发现所有模型仍有较大改进空间。

Comments Accepted to the 5th Workshop on Generation, Evaluation and Metrics (GEM) at ACL 2026

详情

AI中文摘要

最近的LLMs在遵循用户指令方面取得了显著成功，但处理具有多个约束的指令仍然是一个重大挑战。在这项工作中，我们引入了WildIFEval——一个包含7K条真实用户指令的大规模数据集，这些指令具有多样化的多约束条件。与以往的数据集不同，我们的收集涵盖了广泛的词汇和主题约束范围，这些约束是从自然用户指令中提取的。我们将这些约束分为八个高级类别，以捕捉它们在现实场景中的分布和动态。利用WildIFEval，我们进行了大量实验来评估领先LLMs的指令遵循能力。WildIFEval清晰地区分了小型和大型模型，并表明所有模型在此类任务上仍有很大的改进空间。我们分析了约束数量和类型对性能的影响，揭示了模型约束遵循行为的有趣模式。我们发布数据集以促进在复杂现实条件下指令遵循的进一步研究。

英文摘要

Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 7K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, extracted from natural user instructions. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. WildIFEval clearly differentiates between small and large models, and demonstrates that all models have a large room for improvement on such tasks. We analyze the effects of the number and type of constraints on performance, revealing interesting patterns of model constraint-following behavior. We release our dataset to promote further research on instruction-following under complex, realistic conditions.

URL PDF HTML ☆

赞 0 踩 0

2510.16380 2026-06-12 cs.CL cs.AI cs.CY cs.HC cs.LG 版本更新

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

MoReBench：评估语言模型中的程序性和多元道德推理，超越结果

Yu Ying Chiu, Michael S. Lee, Rachel Calcott, Brandon Handoko, Paul de Font-Reaulx, Raphaël Millière, Paula Rodriguez, Chen Bo Calvin Zhang, Ziwen Han, Udari Madhushani Sehwag, Yash Maurya, Christina Q Knight, Harry R. Lloyd, Florence Bacus, Conor Downey, Mantas Mazeika, Bing Liu, Yejin Choi, Mitchell L Gordon, Sydney Levine

发表机构 * University of Washington（华盛顿大学）； New York University（纽约大学）； Scale AI ； Harvard University（哈佛大学）； University of Michigan（密歇根大学）； UNC Chapel Hill（北卡罗来纳大学教堂山分校）； Center for AI Safety（人工智能安全中心）； Stanford University（斯坦福大学）； MIT（麻省理工学院）； University of Oxford（牛津大学）

AI总结提出MoReBench基准，包含1000个道德场景和超过2.3万条标准，用于评估语言模型在道德推理中的程序性推理能力，发现现有基准无法预测模型表现，且模型对特定道德框架存在偏好。

Comments 46 pages, 8 figures, 10 tables. Published in ICLR 2026. Accepted at CHAI workshop and SPP 2026 (non-archival)

详情

AI中文摘要

随着人工智能系统的进步，我们越来越依赖它们与我们共同或代替我们做出决策。为了确保这些决策符合人类价值观，我们不仅需要理解它们做出了什么决策，还需要理解它们如何得出这些决策。推理语言模型能够提供最终响应和（部分透明的）中间思考轨迹，这为研究AI的程序性推理提供了及时的机会。与通常有客观正确答案的数学和代码问题不同，道德困境是过程导向评估的绝佳测试平台，因为它们允许多种可辩护的结论。为此，我们提出了MoReBench：包含1000个道德场景，每个场景配有一组专家认为在推理该场景时必须包含（或避免）的评分标准。MoReBench包含超过2.3万条标准，包括识别道德考量、权衡利弊以及给出可操作的建议，覆盖了AI为人类道德决策提供建议以及自主做出道德决策的情况。此外，我们整理了MoReBench-Theory：150个示例，用于测试AI是否能在规范伦理学的五个主要框架下进行推理。我们的结果表明，规模定律以及现有的数学、代码和科学推理任务基准无法预测模型进行道德推理的能力。模型还显示出对特定道德框架（例如边沁式的行为功利主义和康德义务论）的偏好，这可能是流行训练范式的副作用。这些基准共同推动了面向过程推理的评估，以实现更安全、更透明的AI。

英文摘要

As AI systems progress, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks fail to predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.

URL PDF HTML ☆

赞 0 踩 0

2512.21227 2026-06-12 cond-mat.mtrl-sci cs.AI 版本更新

PhononBench:A Large-Scale Phonon-Based Benchmark for Dynamical Stability in Crystal Generation

PhononBench：面向晶体生成中动态稳定性的基于声子的大规模基准

Xiao-Qi Han, Ze-Feng Gao, Wen-Kao Li, Peng-Jie Guo, Zhong-Yi Lu

发表机构 * School of Physics, Renmin University of China（中国人民大学物理学院）

AI总结提出PhononBench，首个大规模AI生成晶体动态稳定性基准，利用MatterSim势高效计算声子，评估7个模型生成的133,838个结构，发现平均动态稳定性率仅32.15%。

Comments 53 pages, 6 figures

详情

AI中文摘要

近年来，生成式人工智能在晶体材料设计方面取得了显著进展，催生了基于图神经网络、扩散模型和大语言模型的方法。现有评估通常遵循稳定性-唯一性-新颖性（S.U.N.）框架，其中稳定性主要使用热力学标准评估，这未能完全捕捉材料实际存在所必需的动态稳定性。动态稳定性是决定材料能否被合成并持续存在的关键因素，声子谱计算是其评估标准。然而，此类计算的高计算成本阻碍了对生成晶体动态稳定性的大规模评估。在这项工作中，我们引入了PhononBench，这是首个针对AI生成晶体动态稳定性的大规模基准。利用最近开发的MatterSim原子间势，该势能在超过10,000种材料中实现了密度泛函理论（DFT）级别的声子预测精度，PhononBench能够对7个领先晶体生成模型生成的133,838个晶体结构进行高效的声子计算和动态稳定性分析。PhononBench揭示了当前生成模型的一个普遍局限性：除非另有说明，所有报告的动态稳定性指标均在-0.1 THz的声子频率阈值下评估，所有生成结构的平均动态稳定性率仅为32.15%，表现最佳的模型MatterGen也仅达到45.05%。此外，我们识别出32,995个在-0.001 THz严格阈值下整个布里渊区声子稳定的晶体结构。另外，一个基于网页的服务可通过此http URL访问，实现分钟级的超快声子预测。

英文摘要

In recent years, generative artificial intelligence has made significant advances in the design of crystalline materials, giving rise to approaches based on graph neural networks, diffusion models, and large language models. Existing evaluations commonly follow the stability-uniqueness-novelty (S.U.N.) framework, where stability is primarily assessed using thermodynamic criteria, which do not fully capture the dynamical stability essential for a material's practical existence. Dynamical stability is a key determinant of whether a material can be synthesized and persist, with phonon spectrum calculations serving as the standard for its evaluation. However, the high computational cost of such calculations has prevented large-scale assessment of dynamical stability in generated crystals. In this work, we introduce PhononBench, the first large-scale benchmark for dynamical stability in AI-generated crystals. Leveraging the recently developed MatterSim interatomic potential, which achieves density-functional-theory (DFT)-level accuracy in phonon predictions across more than 10,000 materials, PhononBench enables efficient phonon calculations and dynamical-stability analysis for 133,838 crystal structures generated by 7 leading crystal generation models. PhononBench reveals a widespread limitation of current generative models: unless otherwise specified, all reported dynamical-stability metrics are evaluated at a phonon-frequency threshold of -0.1 THz, with the average dynamical-stability rate across all generated structures being only 32.15%, and the top-performing model, MatterGen, reaching just 45.05%.In addition, we identify 32,995 crystal structures that are phonon-stable across the entire Brillouin zone under a strict threshold of -0.001 THz. In addition, a web-based service is accessible at http://phononbench.cn/, enabling minute-level ultra-fast phonon predictions.

URL PDF HTML ☆

赞 0 踩 0

2602.00122 2026-06-12 cs.CV cs.AI cs.MM 版本更新

VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents

VDE Bench: 评估图像编辑模型对视觉文档进行修改的能力

Hongzhu Yi, Yujia Yang, Yuanxiang Wang, Tong Li, Zhenyu Guan, Tianyu Zong, Jiahuan Chen, Chenxi Bao, Tiankun Yang, Haopeng Jin, Yixuan Yuan, Xinming Wang, Tao Yu, Ruilin Gao, Ruiwen Tao, Haijin Liang, Jin Ma, Jinwen Luo, Yeshani, Xinyu Zuo, Jungang Xu

发表机构 * UCAS（中国科学院大学）； CASIA（中国科学院自动化研究所）； Tencent（腾讯）； CMU（卡内基梅隆大学）； WashU（华盛顿大学）； SJTU（上海交通大学）； XDU（北京理工大学）

AI总结本文提出VDE Bench，一个专门评估图像编辑模型在双语中文-英文和复杂视觉文档编辑任务性能的基准，通过高质量数据集和新的评估框架，系统量化了文本修改的准确性。

详情

AI中文摘要

近年来，图像编辑模型取得了显著进展，使用户能够通过自然语言指令灵活地交互式地操作视觉内容。然而，一个重要的但尚未充分研究的研究方向是密集的视觉文档图像编辑，这涉及在图像中修改文本内容，同时忠实保留原始文本风格和背景上下文。现有方法主要集中在英语场景和文本相对稀疏的图像上，因此无法充分解决密集、结构复杂的文档或非拉丁文字如中文。为弥合这一差距，我们提出了VDE Bench（视觉文档编辑基准），这是一个严格人工标注和评估的基准，专门设计用于评估图像编辑模型在双语中文-英文和复杂视觉文档编辑任务上的性能。该基准包含942个基于指令的图像编辑样本数据集，其种子图像涵盖密集的中文和英文文本文档，包括学术论文、海报、演示文稿、考试材料和报纸。此外，我们引入了一个新的评估框架，系统地量化了在OCR解析层面的编辑性能，从而实现了对文本修改准确性的细粒度评估。基于此基准，我们对代表性图像编辑模型进行了全面评估。人类验证显示，人类判断与自动化评估指标之间有一致性。VDE Bench构成了评估图像编辑模型在双语密集文本视觉文档性能的首个系统性基准。

英文摘要

In recent years, image editing models have made significant progress, enabling users to manipulate visual content in a flexible and interactive manner through natural language instructions. However, an important yet underexplored research direction remains dense visual document image editing, which involves modifying textual content within images while faithfully preserving the original text style and background context. Existing methods primarily focus on English scenarios and images with relatively sparse text, and thus cannot adequately address dense, structurally complex documents or non-Latin scripts such as Chinese. To bridge this gap, we propose VDE Bench (Visual Doc Edit Bench), a rigorously human annotated and evaluated benchmark specifically designed to assess the performance of image editing models on bilingual Chinese-English and complex visual document editing tasks. The benchmark comprises a high quality dataset of 942 instruction based image editing samples, whose seed images encompass dense Chinese and English text documents including academic papers, posters, presentation slides, examination materials, and newspapers. Furthermore, we introduce a novel evaluation framework that systematically quantifies editing performance at the OCR parsing level, thereby enabling fine grained assessment of text modification accuracy. Based on this benchmark, we conduct a comprehensive evaluation of representative image editing models. Human verification demonstrates a high degree of consistency between human judgments and automated evaluation metrics. VDE Bench constitutes the first systematic benchmark for evaluating the performance of image editing models on bilingual dense text visual documents.

URL PDF HTML ☆

赞 0 踩 0

2602.07294 2026-06-12 cs.CE cs.AI 版本更新

Fin-RATE: A Real-world Financial Analytics and Tracking Evaluation Benchmark for LLMs on SEC Filings

Fin-RATE：面向SEC文件的金融分析与追踪评估基准

Yidong Jiang, Junrong Chen, Eftychia Makri, Jialin Chen, Peiwen Li, Ali Maatouk, Leandros Tassiulas, Eliot Brenner, Bing Xiang, Rex Ying

发表机构 * Tongji University（同济大学）； University of California, San Diego（加州大学圣地亚哥分校）； Yale University（耶鲁大学）； Goldman Sachs（高盛集团）

AI总结针对LLM在金融领域分析复杂监管文件的需求，提出基于SEC文件的Fin-RATE基准，通过三种任务路径评估模型，发现跨文档和跨时间分析时性能显著下降。

详情

DOI: 10.1145/3770855.3817528
Journal ref: Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

AI中文摘要

随着大型语言模型（LLM）在金融领域的部署日益增多，LLM越来越需要解析复杂的监管披露文件。然而，现有基准通常关注孤立细节，未能反映专业分析的复杂性——这种分析需要综合多个文档、报告期和公司实体的信息。此外，这些基准无法区分错误源于检索失败、生成不准确、领域特定推理错误还是对查询或上下文的误解，从而难以精确诊断性能瓶颈。为弥补这些不足，我们引入Fin-RATE，这是一个基于美国证券交易委员会（SEC）文件构建的基准，通过三条路径模拟金融分析师的工作流程：单个披露文件内的细节导向推理、共享主题下的跨实体比较，以及同一公司在多个报告期内的纵向跟踪。我们在真实上下文和检索增强设置下，对17个领先的LLM（包括开源、闭源和金融专用模型）进行了基准测试。结果显示，当任务从单文档推理转向纵向和跨实体分析时，性能显著下降，准确率分别下降18.60%和14.35%。这种下降与比较幻觉、时间和实体不匹配的增加有关，并进一步反映在推理质量和事实一致性的下降上——这些局限性是现有基准尚未正式分类或量化的。

英文摘要

With the increasing deployment of Large Language Models (LLMs) in the finance domain, LLMs are increasingly expected to parse complex regulatory disclosures. However, existing benchmarks often focus on isolated details, failing to reflect the complexity of professional analysis that requires synthesizing information across multiple documents, reporting periods, and corporate entities. Furthermore, these benchmarks do not disentangle whether errors arise from retrieval failures, generation inaccuracies, domain-specific reasoning mistakes, or misinterpretation of the query or context, making it difficult to precisely diagnose performance bottlenecks. To bridge these gaps, we introduce Fin-RATE, a benchmark built on U.S. Securities and Exchange Commission (SEC) filings and mirroring financial analyst workflows through three pathways: detail-oriented reasoning within individual disclosures, cross-entity comparison under shared topics, and longitudinal tracking of the same firm across reporting periods. We benchmark 17 leading LLMs, spanning open-source, closed-source, and finance-specialized models, under both ground-truth context and retrieval-augmented settings. Results show substantial performance degradation, with accuracy dropping by 18.60% and 14.35% as tasks shift from single-document reasoning to longitudinal and cross-entity analysis. This degradation is associated with increased comparison hallucinations, temporal and entity mismatches, and is further reflected in declines in reasoning quality and factual consistency--limitations that existing benchmarks have yet to formally categorize or quantify.

URL PDF HTML ☆

赞 0 踩 0

2602.10132 2026-06-12 physics.plasm-ph cs.AI 版本更新

LLM 在掷骰子时有多可靠？

Luca Avena, Gianmarco Bet, Bernardo Busoni

发表机构 * Università degli Studi di Firenze（佛罗伦萨大学）

AI总结通过离散概率问题基准测试，发现 LLM 在标准问题上准确率 0.96，但在反直觉问题上仅 0.59，且存在 token 偏差和误导提示的脆弱性。

2606.12702 2026-06-12 cs.AI 新提交

面向协作问题求解与AI推理数据集生成的数学论坛平台

Akbar Erkinov, Nurmukhammad Abdurasulov

发表机构 * Independent Researchers, San Francisco, CA, USA（独立研究者，美国加利福尼亚州旧金山）

AI总结提出一个集成图像到LaTeX转换管线的论坛系统，消除数学内容分享的摩擦，支持桌面和移动端，并生成社区验证的数学问题数据集以训练AI推理。

Comments 11 pages, 3 figures

详情

AI中文摘要

在在线论坛中分享数学内容仍然是学生和教师的一个显著痛点：编写原始LaTeX容易出错，独立的光学字符识别工具需要切换平台，而当前的论坛软件没有提供从公式照片到渲染帖子的集成路径。我们提出了一个统一系统，通过将图像到LaTeX转换管线直接嵌入论坛发布界面来消除这一摩擦。用户上传或拍摄数学表达式的图像；系统通过Mathpix OCR API路由该图像，检测返回的输出是LaTeX还是包含内联数学的纯文本，应用适当的分隔符规范化，并在帖子提交到数据库之前以LaTeX或Markdown模式提供实时预览。该架构分为三个松散耦合的层：图像处理、渲染和存储，并支持桌面和移动客户端。已提交一份涵盖核心方法的美国临时专利申请。我们描述了完整的系统设计、每个组件的细节、数据模式以及关键的技术创新，并将该工作与现有的独立工具和论坛平台进行对比，以展示其填补的实际空白。除了直接的可用性之外，我们认为这种部署的平台构成了一个持续增长、社区验证的数学问题和逐步解决方案数据集，该资源可用于训练和基准测试AI系统以实现准确的数学推理。

英文摘要

Sharing mathematical content in online forums remains a significant friction point for students and educators: writing raw LATEX is error-prone, standalone optical character recognition tools require platform switching, and current forum software offers no integrated path from a photograph of a formula to a rendered post. We present a unified system that eliminates this friction by embedding an image to LATEX conversion pipeline directly inside a forum posting interface. A user uploads or captures an image of a mathematical expression; the system routes it through the Mathpix OCR API, detects whether the returned output is LATEX or plain text containing inline math, applies the appropriate delimiter normalisation, and renders a live preview in either LATEX or Markdown mode before the post is committed to the database. The architecture is organized in three loosely coupled layers: image processing, rendering, and storage, and supports both desktop and mobile clients. A provisional US patent application has been filed covering the core methods. We describe the full system design, each component in detail, the data schema, and the key technical innovations, and we position the work against existing standalone tools and forum platforms to demonstrate the practical gap it closes. Beyond immediate usability, we argue that a deployed platform of this kind constitutes a continuously growing, community-validated dataset of mathematical problems and step-by-step solutions, a resource that can be used to train and benchmark AI systems for accurate mathematical reasoning

URL PDF HTML ☆

赞 0 踩 0

2606.12983 2026-06-12 cs.AI 新提交

Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation

面向LLM驱动的硬件描述语言设计与验证数据整理的结构化测试台生成

En-Ming Huang, Yu-Hung Kao, Ren-Hao Deng, Wei-Po Hsin, Yao-Ting Hsieh, Cheng Liang, Hsiang-Yu Tsou, Mu-Chi Chen, Yu-Kai Hung, Shao-Chun Ho, Po-Hsuang Huang, Shih-Hao Hung, H. T. Kung

发表机构 * National Taiwan University（国立台湾大学）； Academia Sinica（中央研究院）； Harvard University（哈佛大学）

AI总结提出STG框架，利用硬件设计固有结构生成确定性测试台，比迭代LLM方法快720倍，编译成功率更高，覆盖率更高，误判更少，并用于数据整理和测试时扩展。

Comments 9 pages, 10 figures

详情

AI中文摘要

自动化测试台生成已成为大型语言模型（LLM）驱动的寄存器传输级（RTL）工作流中的关键瓶颈，其中大量候选设计必须快速可靠地验证。现有的基于提示的方法将测试台生成视为无约束的代码合成，产生随机输出，具有高令牌成本、低可重复性和不足的覆盖率。为了解决这一差距，我们提出了STG，一个结构化测试台生成框架，利用硬件设计的固有结构生成确定性测试台。作为直接验证工具，STG比基于迭代LLM的测试台生成流程快720倍，具有更高的编译成功率，实现更高的覆盖率，并减少对不正确DUT的错误通过判定。STG还通过暴露有缺陷的基准测试台帮助识别RTL生成基准中的错误。作为数据整理引擎，它在单个CPU核心上比基于LLM的过滤快11倍，能耗低127倍，由此得到的蒸馏模型在我们的多基准评估中提供了最先进的性能。作为测试时扩展预言，它减少了14-47%的节点数。我们的模型可在https://this URL获取。

英文摘要

Automated testbench generation has become a critical bottleneck in large language model (LLM)-driven Register Transfer Level (RTL) workflows, where large numbers of candidate designs must be verified rapidly and reliably. Existing prompt-based approaches treat testbench generation as unconstrained code synthesis, yielding stochastic outputs with high token cost, low reproducibility, and insufficient coverage. To address this gap, we present STG, a Structured Testbench Generation framework that exploits the inherent structure of hardware designs to generate deterministic testbenches. As a direct verification tool, STG runs 720x faster than an iterative LLM-based testbench generation flow and higher rate of successful compilation, achieves higher coverage, and reduces false-pass verdicts on incorrect DUTs. STG also helps identify errors in RTL generation benchmarks by exposing faulty benchmark testbenches. As a data curation engine, it is 11x faster than LLM-based filtering on a single CPU core with 127x less energy, and the resulting distilled models provide state-of-the-art performance in our multi-benchmark evaluation. As a test-time scaling oracle, it reduces node count by 14-47\%. Our models are available at https://huggingface.co/collections/AS-SiliconMind/siliconmind-v12.

URL PDF HTML ☆

赞 0 踩 0

2606.12991 2026-06-12 cs.AI 新提交

APCyc: Property-Informed Design of Cyclic Peptides via Automated Cyclization

APCyc：通过自动环化实现环肽的性质导向设计

Yifan Zhao, Lang Qin, Jintai Chen

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； AI-Peptide Drug Design Joint Laboratory（AI-多肽药物设计联合实验室）

AI总结提出APCyc框架，通过扩展残基词汇和显式编码环化位点与连接类型，结合贝叶斯后验引导，实现目标感知的环肽从头设计并联合优化多种理化性质。

Comments Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情

AI中文摘要

环肽是现代药物发现中一类有前景的治疗化合物，通常具有更好的稳定性和结合亲和力。然而，环肽的从头设计仍然具有挑战性，因为方法必须识别口袋适应的环化模式和连接位点，同时控制药物相关性质。这一挑战对于主要在线性肽数据上训练的生成模型尤为突出，这些模型可能无法捕捉环化特异性约束。为解决这一局限性，我们引入了APCyc，一个目标感知的从头环肽生成框架，该框架显式建模环化并联合优化多种基本理化性质。通过使用扩展的残基词汇表并显式编码环化位点和连接类型信息，APCyc学习环化感知表示，并利用贝叶斯后验引导将采样导向满足多个性质目标的环肽。实验结果表明，我们的模型学习了目标依赖的环化偏好，并实现了环肽设计的有效且可控的多性质优化。本文源代码可在以下网址获取：https://this https URL。

英文摘要

Cyclic peptides represent a promising class of therapeutic compounds in modern drug discovery, often offering improved stability and binding affinity. However, the de novo design of cyclic peptides remains challenging because methods must identify pocket-adaptive cyclization patterns and linkage sites while simultaneously controlling drug-relevant properties. This challenge is particularly pronounced for recent generative models trained predominantly on linear peptide data, which may fail to capture cyclization-specific constraints. To address the limitation, we introduce APCyc, a target-aware de novo cyclic peptide generation framework that explicitly models cyclization and jointly optimizes multiple essential physicochemical properties. By using an expanded residue vocabulary and explicitly encoding cyclization-site and linkage-type information, APCyc learns cyclization-aware representations and leverages Bayesian posterior guidance to steer sampling toward cyclic peptides satisfying multiple property objectives. Experimental results demonstrate that our model learns target-dependent cyclization preferences, and enables effective and controllable multi-property optimization for cyclic peptide design. The source code of this paper is available at https://github.com/HKUSTGZ-ML4Health-Lab/APCyc.

URL PDF HTML ☆

赞 0 踩 0

2606.13042 2026-06-12 cs.AI cs.CV 新提交

Augmentation techniques for video surveillance in the visible and thermal spectral range

可见光和热红外光谱范围内视频监控的增强技术

Vanessa Buhrmester, Ann-Kristin Grosselfinger, David Munch, Michael Arens

发表机构 * Fraunhofer Institute of Optronics, System Technologies and Image Exploitation IOSB（弗劳恩霍夫光学、系统技术与图像处理研究所）

AI总结针对多光谱CNN目标检测，研究可见光与热红外图像差异，探索数据增强技术对分类精度的影响，以提升监控性能。

Comments 8 pages

详情

Journal ref: SPIE Security + Defence, Strasbourg, 10th September 2019

AI中文摘要

在智能视频监控中，摄像机在白天和夜晚记录图像序列。通常，这需要不同的传感器。为了获得更好的性能，将它们结合起来并不罕见。我们关注的情况是，长波红外摄像机连续记录，此外，另一台摄像机在白天记录可见光谱范围内的图像，并且智能算法监控采集的图像。更准确地说，我们的任务是基于多光谱CNN的目标检测。乍一看，可见光谱范围内的图像与热红外图像的区别在于，前者具有颜色和清晰的纹理信息，而后者不包含物体发出的热辐射信息。尽管颜色可以为分类任务提供有价值的信息，但诸如光照变化和不同传感器的特性等因素仍然构成重大问题。无论如何，获取足够且实用的热红外数据集来训练深度神经网络仍然是一个挑战。这就是为什么借助可见光谱范围内的数据进行训练可能是有利的，特别是当待评估的数据同时包含可见光和红外数据时。然而，目前尚不清楚热辐射、形状或颜色信息的强烈变化如何影响分类精度。为了更深入地了解卷积神经网络如何做出决策以及它们从不同传感器输入数据中学到什么，我们研究了不同增强技术的适用性和鲁棒性。

英文摘要

In intelligent video surveillance, cameras record image sequences during day and night. Commonly, this demands different sensors. To achieve a better performance it is not unusual to combine them. We focus on the case that a long-wave infrared camera records continuously and in addition to this, another camera records in the visible spectral range during daytime and an intelligent algorithm supervises the picked up imagery. More accurate, our task is multispectral CNN-based object detection. At first glance, images originating from the visible spectral range differ between thermal infrared ones in the presence of color and distinct texture information on the one hand and in not containing information about thermal radiation that emits from objects on the other hand. Although color can provide valuable information for classification tasks, effects such as varying illumination and specialties of different sensors still represent significant problems. Anyway, obtaining sufficient and practical thermal infrared datasets for training a deep neural network poses still a challenge. That is the reason why training with the help of data from the visible spectral range could be advantageous, particularly if the data, which has to be evaluated contains both visible and infrared data. However, there is no clear evidence of how strongly variations in thermal radiation, shape, or color information influence classification accuracy. To gain deeper insight into how Convolutional Neural Networks make decisions and what they learn from different sensor input data, we investigate the suitability and robustness of different augmentation techniques...

URL PDF HTML ☆

赞 0 踩 0

2606.13176 2026-06-12 cs.AI 新提交

Mental-R1: Aligning LLM Reasoning for Mental Health Assessment

Mental-R1：面向心理健康评估的对齐LLM推理

Xin Wang, Boyan Gao, Yibo Yang, David A. Clifton

发表机构 * University of Oxford（牛津大学）； Oxford Suzhou Centre for Advanced Research（牛津大学苏州高等研究院）

AI总结提出认知相对策略优化（CRPO）框架，通过阶段依赖不确定性建模和熵正则化机制，使LLM推理对齐人类认知过程，在8个心理健康数据集上加权F1平均提升10.4个百分点。

详情

AI中文摘要

焦虑、抑郁和自杀等心理健康问题仍然是紧迫的全球挑战，及时准确的评估对于有效干预至关重要。最近，大型语言模型已被探索用于心理健康评估。然而，现有的通用后训练方法与人类评估的认知过程不一致，可能导致不可靠的推理结果。为弥合这一差距，我们提出了认知相对策略优化（CRPO），这是一个专为心理健康领域设计的强化学习框架。CRPO通过将阶段依赖的不确定性建模集成到策略优化过程中，扩展了组相对策略优化。具体来说，我们引入了一种阶段熵正则化机制，该机制在早期推理阶段鼓励广泛探索，并在后期阶段逐步强制执行自信决策，模仿人类从不确定性到确定性的认知转变。此外，受认知评价理论的启发，我们形式化了认知推理阶段，从而指导基于理论的可解释推理。在8个心理健康数据集上的实验表明，CRPO在加权F1分数上比最佳强化学习基线平均提高了10.4个百分点。此外，CRPO训练的模型Mental-R1在推理密集型案例上相比现有大型语言模型展现出明显优势，表明CRPO增强了心理健康评估的推理能力。

英文摘要

Mental health problems such as anxiety, depression, and suicide remain urgent global challenges, where timely and accurate assessment is critical for effective intervention. Recently, large language models have been explored for mental health assessment. However, existing general-purpose post-training methods do not align with the cognitive processes of human assessment, which may lead to unreliable reasoning outcomes. To bridge this gap, we propose Cognitive Relative Policy Optimization (CRPO), a reinforcement learning framework tailored for the mental health domain. CRPO extends group relative policy optimization by integrating stage-dependent uncertainty modeling into the policy optimization process. Specifically, we introduce a stage-wise entropy regularization mechanism that encourages broad exploration in early reasoning phases and progressively enforces confident decision-making in later stages, mimicking the human cognitive shift from uncertainty to certainty. In addition, inspired by cognitive appraisal theory, we formalize cognitive reasoning stages, thereby guiding theory-grounded interpretable inference. Experiments on 8 mental health datasets show that CRPO achieves an average improvement of 10.4 percentage points in weighted F1-score over the best reinforcement learning baseline. Furthermore, the CRPO-trained model Mental-R1 demonstrates clear advantages compared with existing large language models on reasoning-intensive cases, suggesting that CRPO enhances reasoning capabilities for mental health assessment.

URL PDF HTML ☆

赞 0 踩 0

2606.13211 2026-06-12 cs.AI 新提交

Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy, Detection, and Mitigation under Regulatory Constraints

医学影像AI中的幻觉：跨模态分析框架用于分类、检测与监管约束下的缓解

Omar Alshahrani, Muzammil Behzad

发表机构 * King Fahd University of Petroleum & Minerals, Saudi Arabia（沙特阿拉伯法赫德国王石油矿产大学）； SDAIA-KFUPM Joint Research Center for Artificial Intelligence, Saudi Arabia（沙特阿拉伯SDAIA-KFUPM人工智能联合研究中心）

AI总结本文提出跨模态分析框架，统一五种影像模态的幻觉分类、检测与缓解策略，发现通用基础模型在幻觉基准上优于医学专用模型，并映射到FDA全生命周期监管。

详情

AI中文摘要

AI系统在医学影像中的部署速度超过了对其故障模式的理解。当前，最受临床关注的故障是幻觉：临床看似合理但事实错误的输出，包括虚构的解剖结构、遗漏的发现、错误的侧向性以及生成报告中的虚构测量值，直接影响到活检决策、分期和治疗计划。本结构化综述综合了同行评审研究、基准数据集和FDA监管指南，涵盖五种影像模态，对幻觉的分类、病因、检测和缓解进行了跨模态分析。具体而言，我们研究了三个问题：(1) 现有分类法如何跨模态统一？(2) 医学专用基础模型为何比通用模型产生更少的幻觉？(3) 哪些缓解策略有效且与FDA生命周期监督兼容？我们注意到，三种分类框架共同覆盖了影像流程，而单一框架无法做到。我们还强调，通用基础模型在幻觉特定基准上优于医学专用模型，表明狭窄领域微调可能引入过拟合导致的虚构。同时，放射科医生的监督仍然至关重要；例如，很高比例的AI生成标记在临床使用前需要专家修正。物理信息架构约束、思维链提示和人在回路保障各自针对不同的故障模式，并在组合时有效。所有发现均映射到FDA的总产品生命周期和预定变更控制计划框架，这些框架将幻觉管理视为生命周期义务而非部署前检查清单。

英文摘要

AI systems are being deployed across medical imaging faster than their failure modes are understood. At this point in time, the failure of greatest clinical concern is hallucination: clinically plausible but factually incorrect outputs, including fabricated anatomical structures, missed findings, incorrect laterality, and invented measurements in generated reports, with direct consequences, for example, for biopsy decisions, staging, and treatment planning. This structured narrative synthesizes peer-reviewed studies, benchmark datasets, and FDA regulatory guidance across five imaging modalities to produce a cross-modality analysis of hallucination taxonomy, etiology, detection, and mitigation. Specifically, we address three questions in this study: (1) how can existing taxonomies be unified across modalities?, (2) how do medical-specialized foundation models hallucinate less than general-purpose ones?, and (3) which mitigation strategies are effective and compatible with FDA lifecycle oversight? We note that three taxonomic frameworks together cover the imaging pipeline in a way no single framework does alone. We also highlight that general-purpose foundation models outperform medical-specialized models on hallucination-specific benchmarks, indicating that narrow domain fine-tuning can introduce overfitting-induced confabulation. At the same time, the oversight of radiologists remains essential; for instance, a very high percentage of of AI-generated flags required expert correction before clinical use. Physics-informed architectural constraints, Chain-of-Thought prompting, and human-in-the-loop safeguards each address different failure modes and is effective when combined. All findings are mapped to the FDA's Total Product Lifecycle and Predetermined Change Control Plan frameworks, which treat hallucination management as a lifecycle obligation rather than a pre-deployment checklist.

URL PDF HTML ☆

赞 0 踩 0

2606.13241 2026-06-12 cs.AI 新提交

Brick: Spatial Capability Routing for the Mixture-of-Models (MoM) Paradigm

Brick: 面向混合模型范式的空间能力路由

Francesco Massa, Marco Cristofanilli

发表机构 * Regolo AI ； Seeweb

AI总结提出Brick多模态路由器，通过六维能力评分与查询难度估计，结合成本惩罚几何规则调度模型，在质量与成本间实现灵活权衡。

Comments 17 pages, 5 figures. Technical report

详情

AI中文摘要

定义查询难度是部署工程中最困难的问题之一。现有的LLM路由器依赖表面特征，如领域标签、关键词和token数量，忽略了实际决定模型成功的域内方差。前沿模型成本比本地开源模型高10到100倍，因此在生产规模下，即使每次请求的小额节省也会直接成为云账单的杠杆。我们提出了Brick，一种多模态路由器，它在六个能力维度上对每个模型进行评分，结合每个查询的难度估计，并通过成本惩罚的几何规则进行调度。一个连续的偏好旋钮允许操作员在部署时在最大质量和最大节省之间滑动。在5504个查询的基准测试中，Brick在最大质量模式下达到76.98%的准确率，超过了最佳单一模型（75.02%）和所有测试的路由器。在中性成本-质量配置下，Brick以比始终使用最强模型低4.71倍的成本实现了74.11%的准确率。在最低成本模式下，它降低了22.15倍的成本，准确率损失11.85个百分点。中位延迟从51.2秒降至22.8秒。

英文摘要

Defining query difficulty is one of the hardest problems in deployment engineering. Existing LLM routers rely on surface features such as domain labels, keywords, and token count, ignoring the within-domain variance that actually determines model success. Frontier models cost ten to one hundred times more than local open-weight models, so at production scale even small per-request savings become a direct cloud-bill lever. We present Brick, a multimodal router that scores each model on six capability dimensions, combines this with a per-query difficulty estimate, and dispatches via a cost-penalized geometric rule. A continuous preference knob lets operators slide between max-quality and max-saving profiles at deploy time. On a benchmark of 5,504 queries, Brick at max-quality reaches 76.98% accuracy, beating the best single model (75.02%) and all tested routers. At a neutral cost-quality profile, Brick achieves 74.11% accuracy at 4.71x lower cost than always using the strongest model. At min-cost, it cuts cost 22.15x with 11.85 points accuracy loss. Median latency drops from 51.2s to 22.8s.

URL PDF HTML ☆

赞 0 踩 0

2606.13249 2026-06-12 cs.AI 新提交

Multi-Field Hybrid Retrieval-Augmented Generation for Maritime Accident Root Cause Analysis

面向海事事故根因分析的多字段混合检索增强生成

Seongjin Kim, Sungil Kim

发表机构 * Department of Industrial Engineering, Ulsan National Institute of Science and Technology (UNIST)（蔚山国立科学技术院工业工程系）

AI总结提出多字段混合检索增强生成框架，利用结构化事故卡片和分层原因分类，通过字段感知的混合检索与融合排序，显著提升海事事故根因分析的检索和生成质量。

详情

AI中文摘要

海事事故裁决报告包含根因分析（RCA）的关键法庭调查结果，然而从数十年的记录中检索相关先例并起草一致的报告仍然劳动密集。本文提出一个用于自动化海事RCA的多字段混合检索增强生成（RAG）框架，利用包含13,329份韩国海事安全法庭（KMST）报告（1971-2025年）的综合数据集。我们将原始裁决转化为结构化的“事故卡片”知识库，索引三个不同字段——摘要、原因和处置——以及一个层次化的L1/L2原因分类。我们的检索策略采用字段感知的混合方法，通过互惠排名融合（RRF）融合稀疏和密集排名。鉴于缺乏大规模专家相关性标签，我们使用基于元数据派生代理相关性分数的天花板归一化召回率和nDCG评估检索性能。实验结果表明，我们提出的检索显著优于基线方法，将NormRecall@100从0.18提高到0.55。此外，将生成器基于检索到的先例，相比仅使用LLM的基线，RCA生成质量得到提升，LLM作为评判者的评分从3.34提高到3.72。这些发现表明，字段感知的RAG可以通过实现更快的先例搜索和更一致、基于证据的RCA起草，显著简化海事安全调查工作流程。

英文摘要

Maritime accident adjudication reports contain critical tribunal findings for root cause analysis (RCA), yet retrieving relevant precedents and drafting consistent reports from decades of records remains labor-intensive. This paper proposes a multi-field hybrid retrieval-augmented generation (RAG) framework for automated maritime RCA, utilizing a comprehensive dataset of 13,329 Korea Maritime Safety Tribunal (KMST) reports (1971-2025). We transform raw adjudications into a structured knowledge base of "incident cards", indexing three distinct fields-Summary, Causes, and Disposition-alongside a hierarchical L1/L2 cause taxonomy. Our retrieval strategy employs a field-aware hybrid approach, fusing sparse and dense rankings via Reciprocal Rank Fusion (RRF). Given the lack of large-scale expert relevance labels, we evaluate retrieval performance using ceiling-normalized recall and nDCG based on a metadata-derived proxy relevance score. Experimental results demonstrate that our proposed retrieval significantly outperforms baseline methods, improving NormRecall@100 from 0.18 to 0.55. Furthermore, grounding the generator on the retrieved precedents enhances RCA generation quality over an LLM-only baseline, increasing the LLM-as-a-judge score from 3.34 to 3.72. These findings suggest that field-aware RAG can substantially streamline maritime safety investigation workflows by enabling faster precedent search and more consistent, evidence-based RCA drafting.

URL PDF HTML ☆

赞 0 踩 0

2606.13302 2026-06-12 cs.AI cs.LG 新提交

Physics-Guided Spatiotemporal Learning for Coastal Wave Peak Period Estimation from Video

物理引导的时空学习用于从视频估计海岸波浪峰值周期

Abubakar Hamisu Kamagata, Dharm Singh Jat, Attlee Munyaradzi Gamundani, Abhishek Srivastava, Paramasivam Saravanakumar

发表机构 * Namibia University of Science and Technology（纳米比亚科技大学）； Indian Institute of Technology Indore（印度理工学院印多尔分校）； Namdeb Diamond Corporation（纳米比亚钻石公司）

AI总结提出物理引导的深度时空学习框架，结合自动区域检测、模拟到真实迁移学习和物理信息正则化，从海岸视频直接估计近岸波浪峰值周期，验证了基于Transformer和轻量级循环卷积架构的有效性。

详情

AI中文摘要

近岸波浪参数对于海岸工程、海岸线保护、海洋灾害评估和气候适应性的海岸管理至关重要。传统的监测系统如浮标和雷达平台提供精确监测，但安装和维护成本高，空间覆盖有限。通过深度学习实现了使用视频的被动海洋监测，然而许多方法在海洋学上缺乏物理可解释性、可行性和验证。本文提出了一种物理引导的深度时空学习框架，用于从被动海岸视频流直接估计近岸波浪峰值周期。该框架结合了基于自动时间方差感兴趣区域检测、多阶段模拟到真实迁移学习和物理信息正则化，以提高预测精度和物理一致性。评估了多种时空架构，如基于Transformer和循环卷积的架构，以及合成预训练、银标签自适应和专家微调。结果表明，基于Transformer的架构在瞬时预测精度方面表现更好，而轻量级循环卷积架构实现了更高的时间稳定性和操作海洋学技能。消融研究也证明了物理引导正则化在趋势跟随一致性和减少物理上不可信预测方面的益处。可解释性审计有助于将注意力集中在水动力活跃的碎波带区域，并与物理推导的波浪传播行为良好吻合。总体而言，所提出的框架展示了基于物理引导视频的深度学习系统在长期海岸波浪监测中的潜力，具有成本效益和操作可行性。

英文摘要

Wave parameters in the nearshore are crucial for coastal engineering, shoreline protection, marine hazard assessment, and coastal management for climate resilience. Traditional monitoring systems like buoys and radar platforms offer accurate monitoring but can have high installation and maintenance expenses and limited spatial coverage. Passive ocean monitoring using video has been achieved by leveraging deep learning, however, many methods are not physically interpretable, feasible, and validated for oceanography. In thiswork, a Physics-Guided Deep Spatiotemporal Learning Framework for direct estimation of nearshore wave peak periods from passive coastal video stream is proposed. The framework combines automated temporal-variance based region-of-interest detection, multi-stage Sim-to-Real transfer learning, and physics-informed regularization to enhance the predictive accuracy and physical consistency. A variety of spatiotemporal architectures were assessed, such as transformer-based and recurrent-convolutional ones, alongside synthetic pretraining,silver-label adaptation, and expert fine-tuning. The results show that transformer-based architectures outperformed in terms of the accuracy of the instantaneous prediction, while lightweight recurrent-convolutional architectures achieved higher temporal stability and operational oceanographic skill. Ablation studies also demonstrated the benefits of physics-guided regularization in terms of trend-following consistency, and physically implausible predictions. Explainability auditing also helped to focus attention in hydrodynamically active surf-zone regions and showed good agreement with the physically derived wave propagation behavior. In general, the proposed framework shows the promise of physics-guided video-based deep learning systems for long-term coastal wave monitoring that are cost-efficient and operationally feasible.

URL PDF HTML ☆

赞 0 踩 0

2606.12413 2026-06-12 cs.CY cs.AI cs.CE cs.CL cs.SE 交叉投稿

AI SciBrief as a Gateway to Research: A Framework for Onboarding Students into New Research Areas

AI SciBrief 作为研究入门：一种引导学生进入新研究领域的框架

Andrei Lazarev, Dmitrii Sedov

AI总结提出利用大语言模型平台 AI SciBrief 自动生成科学趋势摘要的框架，帮助学生克服信息过载，加速从信息搜索到知识创造的转变。

Comments This is the version of the article accepted for publication in TELE 2025 after peer review. The final, published version is available at IEEE Xplore: https://doi.org/10.1109/TELE66816.2025.11211989

详情

DOI: 10.1109/TELE66816.2025.11211989
Journal ref: 2025 5th International Conference on Technology Enhanced Learning in Higher Education (TELE), Lipetsk, Russian Federation, 2025, pp. 365-369

AI中文摘要

各层次高等教育学生面临信息过载的重大障碍，这常常使研究过程的初始阶段陷入瘫痪并抑制动机。为此，本文介绍了一种教学框架，利用 AI SciBrief——一个由大语言模型驱动的平台，旨在自动生成科学趋势摘要。我们描述了这一多学科工具——初始覆盖金融、医学和教育领域——如何融入课程以克服这一“入门障碍”。该框架提供了具体方法，利用这些摘要促进学期论文的选题、加速学位论文的文献综述，并使研究生能够持续监测新兴趋势。我们得出结论，AI SciBrief 作为“研究入门”有效降低了学生的认知负荷，使他们能够更快地从信息搜索过渡到知识创造。

英文摘要

Students at all levels of higher education face a significant barrier in the form of information overload, which often paralyzes the initial stages of the research process and suppresses motivation. In response, this article introduces a pedagogical framework that leverages AI SciBrief, a platform powered by a Large Language Model (LLM) designed to automatically generate digests of scientific trends. We describe how this multidisciplinary tool - with initial coverage in finance, medicine, and education - can be integrated into the curriculum to overcome this "entry barrier." The framework provides concrete methodologies for utilizing these digests to facilitate topic selection for term papers, accelerate literature reviews for dissertations, and enable postgraduate students to continuously monitor emerging trends. We conclude that AI SciBrief functions as a "gateway to research" effectively reducing students' cognitive load and empowering them to transition more rapidly from information searching to knowledge creation.

URL PDF HTML ☆

赞 0 踩 0

2606.12422 2026-06-12 cs.CY cs.AI cs.HC 交叉投稿

Creating and Evaluating K-12 GenAI Assessment Graders Through Context Engineering

通过上下文工程创建和评估K-12生成式AI评分器

Zewei Tian, Alex Liu, Lief Esbenshade, Michael Xiao, Zachary Zhang, Yulia Lápicus, Thomas Han, Kevin He, Min Sun

发表机构 * University of Washington（华盛顿大学）； Colleague AI

AI总结本研究通过上下文工程利用商用基础模型构建LLM评分器，基于MCAS数据评估其在数学、科学和ELA上的评分一致性，发现大参数模型在数学和科学上表现良好，而ELA上差异较大，表明AI更适合作为形成性工具。

Comments Published on the Proceedings of NCME 2026 Conference (https://www.xcdsystem.com/proceedings/ncme/8DbqHwv/presentation/28064.cfm?uuid=3EC982ED-A989-8E53-B42BC86334206028)

详情

AI中文摘要

将大型语言模型（LLM）整合到教育评估中代表了课堂评分实践的一个变革性转变。虽然自动评分系统和机器学习技术已经存在了几十年，但生成式AI（GenAI）现在使教育工作者能够以前所未有的效率和规模实施基于标准的评分（SBG）。本文考察了理论基础，并评估了一个LLM评分器，该评分器使用商用基础模型，结合上下文和提示工程，根据评分标准对学生作业进行评分。利用马萨诸塞州综合评估系统（MCAS）数据的实证评分者间一致性研究，我们使用Claude Sonnet 4、Haiku 4.5、GPT-5和GPT-5 Mini，观察了数学、科学和英语语言艺术（ELA）上的二次加权卡帕（QWK）和均方误差比例减少（PRMSE）。结果表明，LLM评分器，特别是基于参数更多的基础模型时，在数学和科学评估中与人类评分者达到显著一致性，而在ELA中表现各异，表明通用基础模型在特定上下文中可以有效评分。对教师和学生反馈的额外分析显示，对AI生成的叙述性反馈接受度很高，但对数值分数持怀疑态度，这表明LLM最有效地作为形成性工具而非总结性评估者。我们的发现表明，精心设计的混合模型结合AI效率和教师判断，可以减少工作量，提高反馈质量，并支持公平的评估实践，而不取代专业专长。

英文摘要

The integration of large language models (LLMs) into educational assessment represents a transformative shift in classroom grading practices. While automated scoring systems and machine learning techniques have existed for decades, generative AI (GenAI) now enables educators to implement standards-based grading (SBG) with unprecedented efficiency and scale. This paper examines the theoretical foundations and evaluates an LLM grader that uses commercially available foundation models with context and prompt engineering to score student work against a rubric. Drawing on an empirical interrater agreement study using Massachusetts Comprehensive Assessment System (MCAS) data, we observed the Quadratic Weighted Kappa (QWK) and Proportional Reduction in Mean-Squared Error (PRMSE) across mathematics, science, and ELA, using Claude Sonnet 4, Haiku 4.5, GPT-5, and GPT-5 Mini. The results demonstrate that LLM graders, especially when based on foundational models with more parameters, achieve substantial agreement with human raters in mathematics and science assessments, while the performances vary in ELA, suggesting generic foundation models can be effective at scoring in given contexts. Additional analysis of teacher and student feedback reveals strong acceptance of AI-generated narrative feedback but skepticism toward numerical scores, suggesting that LLMs function most effectively as formative tools rather than summative evaluators. Our findings indicate that thoughtfully designed hybrid models that combine AI efficiency with teacher judgment can reduce workload, enhance feedback quality, and support equitable assessment practices without displacing professional expertise.

URL PDF HTML ☆

赞 0 踩 0

2606.12424 2026-06-12 cs.CY cs.AI cs.HC 交叉投稿

AI-Automation Tooling in Computer Engineering Education: Mixed-Methods TAM/UTAUT Evidence for a General Acceptance Attitude

计算机工程教育中的AI自动化工具：基于TAM/UTAUT混合方法的一般接受态度证据

Aung Pyae

AI总结本研究通过混合方法调查本科生对AI自动化工具（n8n平台）的接受态度，发现六个TAM/UTAUT构念融合为单一一般接受因子，绩效期望最强，享乐动机最弱，为课程整合提供理论依据。

详情

AI中文摘要

随着生成式AI和低代码工作流平台成为软件实践中的常规工具，一个关键的教育问题是下一代计算机工程师是否会将这些工具视为有用、可用且值得持续参与。本文报告了一项混合方法、横截面研究，涉及泰国三个相同脚本工作坊中本科生对AI自动化工具（通过开源平台n8n实例化）的接受度（n=103）。一个12项、五点李克特量表映射到六个TAM/UTAUT构念——绩效期望（PE）、努力期望（EE）、行为意向（BI）、自我效能（SE）、享乐动机（HM）和输出质量（OQ），并通过开放式反馈的归纳主题分析进行补充。分析结合了序数可靠性估计、自助置信区间、非参数检验、多重比较控制的相关性、多维度诊断、共同方法偏差检验以及跨会话比较。所有六个构念的接受度均良好，效应量大，其中PE最强，HM最弱。维度诊断进一步揭示，在这种简短的工作坊后情境中，经典的TAM/UTAUT子维度合并为一个单一的一般接受因子，这一发现具有重要的方法论和理论意义。定性主题在有用性和热情方面与定量概况一致，但在输出质量上存在分歧，揭示了一个虽小但表达清晰的可靠性怀疑少数群体。研究结果支持在本科计算教育中课程采用AI自动化工具，并确定了三个基于理论的教学杠杆：教学顺序支架、自我效能支持和信任校准干预。

英文摘要

As generative AI and low-code workflow platforms become routine in software practice, a key educational question is whether the next generation of computer engineers will accept these tools as useful, usable, and worthy of sustained engagement. This paper reports a mixed-methods, cross-sectional study of undergraduate computer engineering students' acceptance of AI automation tooling, instantiated through the open-source platform n8n across three identically scripted workshops in Thailand (n = 103). A 12-item, five-point Likert instrument mapped to six TAM/UTAUT constructs - Performance Expectancy (PE), Effort Expectancy (EE), Behavioral Intention (BI), Self-Efficacy (SE), Hedonic Motivation (HM), and Output Quality (OQ) - was complemented by inductive thematic analysis of open-ended feedback. Analyses combined ordinal reliability estimation, bootstrap confidence intervals, non-parametric tests, multiple-comparison-controlled correlations, polychoric dimensionality diagnostics, a common-method-bias check, and between-session comparisons. Acceptance was favorable across all six constructs with large effect sizes, with PE emerging as the strongest construct and HM as the weakest. Dimensionality diagnostics further revealed that canonical TAM/UTAUT sub-facets collapsed into a single general acceptance factor in this short-form post-workshop context, a finding with important methodological and theoretical implications. Qualitative themes converged with the quantitative profile regarding usefulness and enthusiasm but diverged on output quality, revealing a small yet articulate reliability-skeptical minority. The findings support the curricular adoption of AI automation tooling in undergraduate computing education and identify three theory-grounded instructional levers: instruction-sequencing scaffolds, self-efficacy supports, and trust-calibration interventions.

URL PDF HTML ☆

赞 0 踩 0

2606.12425 2026-06-12 cs.CY cs.AI cs.ET cs.HC cs.LG 交叉投稿

An Explainable AI Assistant for Introductory Programming Education: Improving Feedback Reliability with Instructor-AI Collaboration

面向入门编程教育的可解释AI助手：通过教师-AI协作提高反馈可靠性

Muntasir Hoq, Griffin Pitts, Bradford Mott, Seung Lee, Jessica Vandenberg, Shuyin Jiao, Narges Norouzi, James Lester, Bita Akram

发表机构 * North Carolina State University（北卡罗来纳州立大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结提出一种可解释AI驱动的课堂助手，通过分析学生代码、映射逻辑错误到教师识别的误解并提供教师撰写的反馈，提高入门编程课程中反馈的可靠性和可解释性。

Comments Full paper accepted to the 27th International Conference on AI in Education (AIED 2026)

详情

AI中文摘要

主动学习被广泛认为是提高入门编程课程学习效果的有效方法。然而，不足的教学支持往往限制了学生获得及时、个性化反馈的机会，而这对于掌握基础编程概念至关重要。尽管最近AI的进展，特别是大型语言模型，为反馈提供了可扩展的机会，但可解释性和可靠性问题仍然存在。在本文中，我们提出了一种AI驱动的课堂助手，它利用可解释的AI模型分析学生代码，将逻辑错误映射到教师识别的误解，并提供教师撰写的反馈，从而将可靠性建立在教师定义的教学知识基础上。为了评估我们框架的有效性，我们进行了专家评估以检查其与教师验证反馈的一致性，并在课堂环境中部署了该系统以评估学生对其可用性的看法。结果表明，该助手能够为学生提供准确的、经过教师验证的反馈，同时培养积极的体验。

英文摘要

Active learning is widely recognized as an effective approach for improving learning outcomes in introductory programming courses. However, insufficient instructional support often limits students' access to timely, personalized feedback, which is crucial for mastering foundational programming concepts. Although recent advances in AI, particularly large language models, offer scalable opportunities for feedback, concerns about explainability and reliability remain. In this paper, we present an AI-driven classroom assistant that leverages an explainable AI model to analyze student code, map logical errors to instructor-identified misconceptions, and deliver instructor-authored feedback, thereby grounding reliability in instructor-defined pedagogical knowledge. To evaluate the effectiveness of our framework, we conducted an expert evaluation to examine its alignment with instructor-verified feedback and deployed the system in a classroom setting to assess students' perceptions of its usability. Results indicate that the assistant can provide accurate, instructor-verified feedback to students while fostering a positive experience.

URL PDF HTML ☆

赞 0 踩 0

2606.12500 2026-06-12 cs.LG cs.AI 交叉投稿

Improving Crash Frequency Prediction from Simulated Traffic Conflicts Using Machine Learning Based Microsimulation

基于机器学习的微观仿真从模拟交通冲突改进碰撞频率预测

Xian Liu, Carlo G. Prato, Gustav Markkula

AI总结本文利用机器学习行为模型替代传统规则模型进行交通微观仿真，通过极端值理论分析模拟冲突预测碰撞频率，在英国利兹五个信号交叉口验证了ML模型无需地点校准即可提升预测准确性。

详情

AI中文摘要

交通微观仿真结合替代安全措施越来越多地被用作历史碰撞数据的主动替代方案，用于预测当前或计划道路基础设施设计的碰撞频率。然而，现有的基于微观仿真的安全研究采用了简化的基于规则的行为模型，这些模型能较好地再现交通流，但往往无法生成真实的冲突动态，限制了碰撞预测的准确性。机器学习（ML）行为模型的最新进展提供了一个有希望的机会，通过直接从大规模轨迹数据集中学习人类驾驶行为，可能提高微观仿真的真实性和碰撞频率预测。为了研究这种可能性，我们对英国利兹的五个真实信号交叉口进行了交通微观仿真，使用了标准的基于规则模型和最先进的ML模型。使用二维碰撞时间指标分析模拟车辆轨迹以识别模拟冲突，然后使用极端值理论建模以预测碰撞频率。结果表明，ML模型的冲突产生的碰撞预测与实际碰撞数据一致，而基于规则的模型由于缺乏对特定模拟交叉口的模型校准，无法产生有意义的预测。直接使用ML生成的模拟碰撞来预测实际碰撞频率也产生了较差的结果，这表明尽管当前的ML模型可以真实地再现冲突，但尚不能生成真实的碰撞。总体而言，研究结果表明，基于ML的行为模型在无需特定地点模型校准的情况下，有望从模拟冲突中改进碰撞预测，并为基于ML的交通微观仿真指明了明确的未来方向。

英文摘要

Traffic microsimulation combined with surrogate safety measures has increasingly been used as a proactive alternative to historical crash data for predicting crash frequency for current or planned road infrastructure designs. However, existing microsimulation-based safety studies have adopted simplified rule-based behaviour models, which reproduce traffic flow reasonably well but often fail to generate realistic conflict dynamics, limiting crash prediction accuracy. Recent advances in machine learning (ML)-based behaviour models offer a promising opportunity to potentially improve microsimulation realism and crash frequency predictions by learning human driving behaviour directly from large-scale trajectory datasets. To investigate this possibility, traffic microsimulation was conducted for five real-world signalised intersections in Leeds, UK, using both a standard rule-based model and a state-of-the-art ML model. Simulated vehicle trajectories were analysed using a two-dimensional Time-to-Collision metric to identify simulated conflicts, which were then modelled using Extreme Value Theory to predict crash frequency. Results show that conflicts from the ML model yielded crash predictions in line with the real-world crash data, whereas the rule-based model did not permit meaningful predictions, presumably due to a lack of model calibration to the specific simulated intersections. Directly using ML-generated simulated crashes to predict real-world crash frequency also yielded poor results, suggesting that while current ML models can realistically reproduce conflicts, they are not yet able to generate realistic crashes. Overall, the findings demonstrate that ML-based behaviour models are promising for improving crash prediction from simulated conflicts, without a need for location-specific model calibration, and suggest clear future directions for ML-based traffic microsimulation.

URL PDF HTML ☆

赞 0 踩 0

2606.12662 2026-06-12 cs.SD cs.AI cs.LG 交叉投稿

BASENet: Band-Adapted Speech Enhancement Network with Cross-Band Attention

BASENet: 基于频带自适应的跨频带注意力语音增强网络

Damien Martins Gomes, François Capman

发表机构 * Thales SIX GTS, FRANCE（泰雷兹SIX GTS公司，法国）

AI总结提出BASENet，通过Bark尺度划分频带并分配自适应容量编码器，结合跨频带注意力模块，以最少参数实现高PESQ和STOI，适用于资源受限设备。

详情

AI中文摘要

语音增强模型通常对所有频率采用统一容量，忽略了人类听觉的非均匀频谱分辨率。我们提出BASENet，一种频率自适应架构，将频谱划分为Bark尺度频带，并为每个频带分配基于临界频带密度的缩放容量编码器，自动为感知密集的低频分配更深的分支，为高频分配更轻的分支。跨频带注意力模块通过紧凑的频率池化表示以线性复杂度捕获跨频带的谐波依赖性。基于具有密集连接的倒残差块和卷积循环网络，BASENet在VoiceBank+DEMAND上以仅0.83M参数和7.3 G MACs达到3.55 PESQ和STOI~96%，是所有PESQ > 3.50方法中参数最少的。因果变体（3.44 PESQ）超过了几种非因果基线，证实了其在资源受限设备上实时流传输的适用性。

英文摘要

Speech enhancement models typically apply uniform capacity across all frequencies, disregarding the non-uniform spectral resolution of human hearing. We propose BASENet, a frequency-adapted architecture that partitions the spectrum into Bark-scale bands and assigns each a scaled-capacity encoder derived from critical-band density, automatically granting deeper branches to perceptually dense low frequencies and lighter ones to high frequencies. A cross-band attention module captures harmonic dependencies across bands through compact frequency-pooled representations at linear complexity. Built on inverted residual blocks with dense connectivity and a convolutional recurrent network, BASENet achieves 3.55 PESQ and STOI~96% on VoiceBank+DEMAND with only 0.83M parameters and 7.3 G~MACs, the fewest parameters among all methods with PESQ > 3.50. A causal variant (3.44 PESQ) surpasses several non-causal baselines, confirming suitability for real-time streaming on resource-constrained devices.

URL PDF HTML ☆

赞 0 踩 0

2606.12699 2026-06-12 cs.LG cs.AI 交叉投稿

LLM-Powered Personalized Glycemic Assessment in Type 2 Diabetes with Wearable Sensor Data

基于可穿戴传感器数据的2型糖尿病个性化血糖评估：LLM驱动方法

Yifan Gao, Yanmin Gong, Yun Shi, Yuanxiong Guo

发表机构 * Department of Information Systems and Cybersecurity, The University of Texas at San Antonio（德克萨斯大学圣安东尼奥分校信息系统与网络安全系）； School of Engineering Medicine, Texas A&M University（德克萨斯农工大学工程医学院）； Department of Family and Community Medicine, The University of Texas at San Antonio（德克萨斯大学圣安东尼奥分校家庭与社区医学系）

AI总结提出GlyLLM框架，利用大语言模型整合可穿戴传感器数据和结构化元数据，实现个性化血糖动态建模，在血糖预测和糖尿病分类任务上分别比传统ML方法提升13.66%和13.08%。

Comments The 14th IEEE International Conference on Healthcare Informatics, 2026

详情

AI中文摘要

2型糖尿病（T2D）对全球健康构成日益严重的威胁，需要有效的血糖评估来支持个性化和改进的糖尿病护理。可穿戴传感器如连续血糖监测仪（CGM）和健身追踪器为血糖评估提供了许多有价值的见解。然而，有效分析这些数据需要与重要的个体层面背景信息整合。现有方法通常基于传统机器学习（ML），主要依赖历史血糖测量值，忽略了个性化信息，这限制了它们在多样化糖尿病群体中的性能。大语言模型（LLMs）的最新进展展示了它们整合多种数据模态同时建模序列依赖性的能力，激发了探索其在个性化血糖评估中潜力的兴趣。在本文中，我们提出了GlyLLM，一个基于LLM的框架，通过整合可穿戴传感器数据和结构化元数据来建模基于CGM的血糖动态。GlyLLM可以利用预训练LLM的广泛先验知识，并在决策时实现传感器-文本语义抽象。在AI-READI数据集上的两个相关任务实验表明，我们的模型在血糖预测的均方根误差（RMSE）上平均优于传统ML方法13.66%，在糖尿病分类的受试者工作特征曲线下面积（AUROC）上平均优于13.08%。此外，我们的消融研究表明，糖尿病调查和生物特征测试比其他健康信息对血糖评估更为关键。我们的工作为利用LLM推进T2D护理中的个性化血糖评估迈出了有希望的一步。

英文摘要

Type 2 Diabetes (T2D) poses an increasing global health threat, demanding effective glycemic assessment to support personalized and improved diabetes care. Wearable sensors such as continuous glucose monitors (CGM) and fitness trackers offer many valuable insights for glycemic assessment. However, effectively analyzing these data requires integration with essential individual-level context. Existing methods are often based on traditional machine learning (ML) and rely primarily on historical blood glucose measurements and overlook personalized information, which limits their performance across diverse diabetes populations. Recent advances in large language models (LLMs) have demonstrated their ability to integrate diverse data modalities while modeling sequential dependencies, motivating the exploration of their potential for personalized glycemic assessment. In this paper, we propose GlyLLM, an LLM-powered framework for modeling CGM-based glycemic dynamics through the integration of wearable sensor data and structured metadata. GlyLLM can leverage the extensive prior knowledge of pre-trained LLMs and achieve sensor-text semantic abstraction at decision time. Experiments on two related tasks on the AI-READI dataset demonstrate that our model outperforms traditional ML methods by an average of 13.66\% in Root Mean Squared Error (RMSE) for glucose forecasting and 13.08\% in Area Under the Receiver Operating Characteristic (AUROC) for diabetes categorization. Additionally, our ablation study shows that diabetes surveys and biometric tests are more critical than other health information for glycemic assessment. Our work presents a promising step toward harnessing the power of LLMs to advance personalized glycemic assessment in T2D care.

URL PDF HTML ☆

赞 0 踩 0

2606.12824 2026-06-12 eess.IV cs.AI cs.CV physics.med-ph 交叉投稿

Acquisition state behaves as a structured, measurable variable governing lung-nodule AI: kernel-driven measurement instability and noise-driven detection fragility, invisible to DICOM metadata

采集状态作为结构化、可测量变量影响肺结节AI：核驱动的测量不稳定性和噪声驱动的检测脆弱性，DICOM元数据不可见

Daniel Soliman

发表机构 * Daniel Soliman, M.S（丹尼尔·索利曼，硕士）

AI总结研究通过LUNA16训练的RetinaNet检测器，发现CT采集状态（重建核与噪声）独立影响AI的测量与检测性能，且无法从DICOM元数据恢复，提出采集感知的输入验证层。

详情

AI中文摘要

医学影像AI治理正在规范化：2026年ACR-SIIM实践参数建议本地验收测试和持续漂移监测，ACR Assess-AI注册使用DICOM元数据监测AI输出。我们认为在输出指标之下存在一个必要但目前未监测的层：输入研究是否保持在模型验证过的采集范围内。使用LUNA16训练的MONAI RetinaNet肺结节检测器，我们测试采集状态是否表现为结构化的可测量变量。在仅重建核不同的真实配对CT（NLST B30f vs B80f）上，核单独使AI测量的直径发生偏移，并在5.2%（155个结节中的8个）中翻转了Fleischner尺寸类别，而检测置信度不变（Wilcoxon p=0.22）。在受控的LIDC-IDRI扰动下，效应按轴分离：噪声轴降低检测置信度（p=5.9e-32，集中在6mm以下结节）但不影响测量，而频率/核轴破坏测量（p=8.6e-13）但不影响检测。一个4特征像素指纹恢复了重建身份（真实CT上患者级AUC约0.95，QIBA体模上0.995），而ConvolutionKernel DICOM标签无信息（不同重建标签相同）。核轴跨四个制造商传输（留一制造商AUC 0.94-0.98，与制造商内上限匹配）。因此采集状态映射到不同的AI故障模式：频率内容对应测量可靠性，噪声对应检测灵敏度，且无法从元数据恢复。采集感知的输入侧验证是现在进入影像AI认证的验收测试和漂移监测要求中缺失的层。

英文摘要

AI governance for medical imaging is formalizing: the 2026 ACR-SIIM Practice Parameter recommends local acceptance testing and ongoing drift monitoring, and the ACR Assess-AI registry monitors AI outputs using DICOM metadata for context. We argue that a necessary, currently unmonitored layer sits beneath output metrics: whether incoming studies remain within the acquisition envelope a model was validated on. Using a LUNA16-trained MONAI RetinaNet lung-nodule detector, we test whether acquisition state behaves as a structured, measurable variable. On real paired CT differing only in reconstruction kernel (NLST B30f vs B80f), kernel alone shifted AI-measured diameter and flipped a Fleischner size category in 5.2% (8 of 155) of nodules at fixed patient and acquisition, while detection confidence was unchanged (Wilcoxon p=0.22). Under controlled LIDC-IDRI perturbations the effects dissociated by axis: the noise axis degraded detection confidence (p=5.9e-32, concentrated in nodules under 6 mm) but not measurement, while the frequency/kernel axis corrupted measurement (p=8.6e-13) but not detection. A 4-feature pixel fingerprint recovered reconstruction identity (patient-level AUC about 0.95 on real CT, 0.995 on a QIBA phantom) where the ConvolutionKernel DICOM tag was uninformative (identical labels across reconstructions). The kernel axis transported across four manufacturers (leave-one-vendor-out AUC 0.94-0.98, matching the within-vendor ceiling). Acquisition state thus maps to distinct AI failure modes, frequency content to measurement reliability and noise to detection sensitivity, and is not recoverable from metadata. Acquisition-aware, input-side validation is the missing layer for the acceptance-testing and drift-monitoring requirements now entering imaging-AI accreditation.

URL PDF HTML ☆

赞 0 踩 0

2606.12838 2026-06-12 q-bio.QM cs.AI cs.LG q-bio.GN 交叉投稿

scLLM-DSC：基于LLM知识增强的跨模态深度结构聚类用于单细胞RNA测序

Ping Xu, Pengjiang Li, Tian Du, Zaitian Wang, Jiawei Gu, Ziyue Qiao, Pengfei Wang, Yuanchun Zhou

发表机构 * Computer Network Information Center, Chinese Academy of Sciences（中国科学院计算机网络信息中心）； University of Chinese Academy of Sciences（中国科学院大学）； Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences（中国科学院大学杭州高等研究院）； School of Computing and Information Technology, Great Bay University（大湾区大学计算机科学与技术学院）； School of Engineering, Westlake University（西湖大学工学院）

AI总结提出scLLM-DSC框架，通过知识驱动语义视图与结构感知拓扑视图的跨模态对比对齐，利用LLM增强单细胞RNA测序数据的聚类性能，显著优于现有方法。

详情

AI中文摘要

聚类是scRNA-seq分析的基础，是识别细胞群体和解析组织异质性的基石。然而，现有方法专注于挖掘数值统计模式，由于忽略了基因编码的内在生物学功能，存在语义不可知的问题。虽然大语言模型（LLM）提供了有前景的语义能力，但生成式预训练目标与判别式下游任务之间的结构不匹配阻碍了它们直接适应细胞聚类。为弥合这一差距，我们提出了scLLM-DSC，一种新颖的LLM知识增强跨模态深度结构聚类框架。与数据驱动范式不同，scLLM-DSC通过协同两个视图建立语义基础表示：从NCBI基因先验和上下文化的Cell2Sentence嵌入中提取的知识驱动语义视图，以及通过图引导编码器提取的结构感知拓扑视图。关键的是，我们引入了一种跨模态对比对齐机制，以在统一潜在空间中强制生物学语义与转录组特征之间的一致性。广泛的基准测试表明，scLLM-DSC在聚类准确性上显著优于十一个最先进的基线方法。

英文摘要

Clustering is fundamental to scRNA-seq analysis, serving as a cornerstone for identifying cell populations and resolving tissue heterogeneity. However, existing methods focus on mining numerical statistical patterns, suffering from semantic agnosticism by neglecting the intrinsic biological functions encoded by genes. While Large Language Models (LLMs) offer promising semantic capabilities, their direct adaptation to cell clustering is hindered by the structural mismatch between generative pre-training objectives and discriminative downstream tasks. To bridge this gap, we propose scLLM-DSC, a novel LLM-Knowledge Enhanced Cross-Modal Deep Structural Clustering framework. Diverging from data-driven paradigms, scLLM-DSC establishes a semantically-grounded representation by synergizing two views: a Knowledge-Driven Semantic View derived from NCBI gene priors and contextualized Cell2Sentence embeddings, and a Structure-Aware Topological View extracted via a graph-guided encoder. Crucially, we introduce a cross-modal contrastive alignment mechanism to enforce consistency between biological semantics and transcriptomic features within a unified latent space. Extensive benchmarks demonstrate that scLLM-DSC significantly outperforms eleven state-of-the-art baselines in clustering accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.13135 2026-06-12 cs.CV cs.AI 交叉投稿

Cascade Classification of Dermoscopic Images of Skin Neoplasms with Controllable Sensitivity and External Clinical Validation

皮肤肿瘤皮肤镜图像的级联分类：可控敏感度与外部临床验证

Elena S. Kozachok, Sergey S. Seregin, Aleksandr V. Kozachok, Ilya P. Latyshev, Oleg I. Samovarov

发表机构 * Ivannikov Institute for System Programming of the Russian Academy of Sciences (ISP RAS)（俄罗斯科学院伊万尼科夫系统编程研究所）； Orel Oncological Dispensary（奥廖尔肿瘤医院）

AI总结本研究比较了四种深度学习架构在皮肤镜图像分类中的表现，提出一种两阶段级联分类方案，通过可调分诊阈值实现敏感度控制，并在外部临床数据集上验证了泛化差距。

Comments 28 pages, 8 figures, 10 tables

详情

AI中文摘要

目的：比较皮肤肿瘤皮肤镜图像的深度学习架构和分类方案，并评估从开放国际数据集到俄罗斯临床独立数据集的泛化能力。方法：在三种方案中比较四种架构（ViT-B/16、Swin-S、ConvNeXt-S、EfficientNetV2-S）：二分类（恶性/良性）、单阶段四分类（良性、MEL、SCC、BCC）和两阶段级联（二分类分诊，然后三分类MEL/SCC/BCC）。所有模型使用ImageNet预训练权重和单一增强协议，在聚合的开放ISIC Archive数据上训练，并在内部保留样本和两个临床数据集（Melanoscope AI移动系统；谢切诺夫大学）上评估。结果：内部二分类阶段达到ROC-AUC 0.952-0.966；在谢切诺夫大学数据集上降至0.797-0.893，敏感度降至0.53-0.67，ECE从0.02升至0.27-0.39，且低估恶性，量化了排序和校准中的泛化差距。配对检验确认了临床数据上的一个架构间结果：二分类阶段ViT-B/16的缺陷（p<0.05）；在区分阶段，没有架构显示出显著优势。级联方案在大多数架构上提高了宏F1，但仅对ViT-B/16显著，通过恢复被分配到主导良性类别的恶性病变。在ISIC MILK10k上，直接11分类的平均类别敏感度为0.525。结论：可调分诊阈值提供了标准单阶段（argmax）分类无法实现的敏感度控制，并更好地再现了临床鉴别诊断逻辑。持续的泛化差距要求在部署前进行外部临床验证和重新校准。

英文摘要

Purpose. To compare deep learning architectures and classification schemes for dermoscopic images of skin neoplasms and assess their generalization on transfer from open international datasets to independent clinical datasets of Russian practice. Methods. Four architectures (ViT-B/16, Swin-S, ConvNeXt-S, EfficientNetV2-S) were compared in three schemes: binary (malignant/benign), single-stage four-class (benign, MEL, SCC, BCC), and a two-stage cascade (binary triage, then three-class differentiation MEL/SCC/BCC). All models used ImageNet-pretrained weights and a single augmentation protocol on aggregated open ISIC Archive data, and were evaluated on an internal held-out sample and two clinical datasets (Melanoscope AI mobile system; Sechenov University). Results. Internally the binary stage attains ROC-AUC 0.952-0.966; on Sechenov University it drops to 0.797-0.893, sensitivity to 0.53-0.67, and ECE rises from 0.02 to 0.27-0.39 with underestimation of malignancy, quantifying a generalization gap in ranking and calibration. Paired tests confirm one inter-architecture result on clinical data: the deficit of ViT-B/16 at the binary stage (p<0.05); at the differentiation stage no architecture has a proven advantage. The cascade raises macro F1 over single-stage four-class classification for most architectures, but significantly only for ViT-B/16, by recovering malignant lesions assigned to the dominant benign class. On ISIC MILK10k, direct 11-class classification yields mean-class sensitivity 0.525. Conclusion. A tunable triage threshold gives sensitivity control not attainable in standard single-stage (argmax) classification and better reproduces clinical differential-diagnosis logic. The persistent generalization gap mandates external clinical validation and recalibration before deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.13188 2026-06-12 cs.CV cs.AI 交叉投稿

Transformer-Guided Graph Attention for Direct Cardiac Mesh Reconstruction: A Structural Digital Twin Framework

Transformer引导的图注意力直接心脏网格重建：一种结构数字孪生框架

Abhishek H S, Akash Ganamukhi, Abhimanyu Suresh, Aditya G Hiremath, Prasad B Honnavalli, Adithya Balasubramanyam

发表机构 * CAVE Labs, C-IoT, Dept. of CSE, PES University（PES大学计算机科学与工程系C-IoT实验室CAVE实验室）； C-IoT, Dept. of CSE, PES University（PES大学计算机科学与工程系C-IoT实验室）

AI总结提出端到端网络，结合3D Swin Transformer和GAT，直接从医学图像生成平滑的心脏表面网格，避免传统后处理，在MM-WHS 2017上实现1.8 mm平均Chamfer距离。

详情

AI中文摘要

构建患者特异性心脏模型是精准心脏病学的核心，但这些模型在临床应用中始终面临同一障碍：网格生成缓慢、混乱且令人沮丧。标准工作流程——分割图像、运行Marching Cubes、然后手动清理结果——耗时、操作者间不一致，并且需要大多数临床团队不具备的专业知识。我们采取了一种根本不同的方法。我们不将分割和网格生成视为两个独立问题，而是训练一个单一的端到端网络，直接从原始3D医学图像生成平滑、可用于模拟的心脏表面网格。核心是一个3D Swin Transformer编码器-解码器，从CT或MRI体积中提取体积特征，配以一个图注意力网络（GAT）头，迭代变形模板网格以拟合患者心脏边界。我们在MM-WHS 2017基准上使用CT和MRI进行了测试。分割分数具有竞争力（CT上Dice为0.84，MRI上为0.83），但主要关注点是网格质量：平均Chamfer距离为1.8 mm，95%分位数表面距离低于5 mm。每个网格通过单次前向传播生成——无需Marching Cubes、平滑滤波器或手动清理。我们认为，对于心脏数字孪生管道，几何保真度和拓扑正确性比像素级Dice分数更重要。通过消除后处理瓶颈，该方法使患者特异性心脏模拟在临床使用中变得更加可行。

英文摘要

Building patient-specific cardiac models sits at the heart of precision cardiology, yet getting those models into clinical use keeps running into the same wall: mesh generation is slow, messy, and frustrating. The standard workflow -- segmenting the image, running Marching Cubes, and then manually cleaning up the result -- is time-consuming, inconsistent across operators, and demands specialist knowledge most clinical teams do not have. We take a fundamentally different approach. Instead of treating segmentation and mesh generation as two separate problems, we train a single end-to-end network that goes directly from a raw 3D medical image to a smooth, simulation-ready cardiac surface mesh. The core is a 3D Swin Transformer encoder-decoder that extracts volumetric features from CT or MRI volumes, paired with a Graph Attention Network (GAT) head that iteratively deforms a template mesh to fit the patient's cardiac boundary. We tested on the MM-WHS 2017 benchmark using both CT and MRI. Segmentation scores were competitive (Dice of 0.84 on CT, 0.83 on MRI), but the primary focus is mesh quality: mean Chamfer distance of 1.8 mm, with 95th-percentile surface distance below 5 mm. Every mesh is produced in a single forward pass -- no Marching Cubes, no smoothing filters, no manual cleanup. We argue that for cardiac digital twin pipelines, geometric fidelity and topological correctness matter more than pixel-level Dice scores. By removing the post-processing bottleneck, this approach makes patient-specific cardiac simulation substantially more accessible for clinical use.

URL PDF HTML ☆

赞 0 踩 0

2606.13236 2026-06-12 cs.LG cs.AI cs.SD stat.AP 交叉投稿

Decoding Insect Song: A Multitask Semisupervised Orthoptera Bioacoustic Classifier

解码昆虫之歌：一种多任务半监督直翅目生物声学分类器

Olga Isupova, Danil Kuzin, Ella Browning, Tom Mills, Steven Reece

发表机构 * University of Oxford（牛津大学）

AI总结提出PULSE半监督多任务框架，结合弱监督分类、自监督学习和知识蒸馏，在直翅目生物声学分类中优于通用模型，并通过主动学习进一步提升性能。

Comments ICML 2026 Workshop on Machine Learning for Audio

2606.13253 2026-06-12 cs.SD cs.AI 交叉投稿

Towards Personalized Federated Learning for Dysarthric Speech Recognition

面向构音障碍语音识别的个性化联邦学习

Tao Zhong, Mengzhe Geng, Jiajun Deng, Shujie Hu, Xunying Liu

发表机构 * The Chinese University of Hong Kong（香港中文大学）； National Research Council Canada（加拿大国家研究委员会）

AI总结针对构音障碍语音识别中联邦学习异构性问题，提出参数平均和嵌入平均两种个性化聚合策略，在UASpeech和TORGO上分别实现0.99%和0.56%的绝对词错误率降低。

2606.13298 2026-06-12 cs.SE cs.AI 交叉投稿

Mining Architectural Quality Under Agentic AI Adoption: A Causal Study of Java Repositories

在智能体AI采用下的架构质量挖掘：Java仓库的因果研究

Oliver Aleksander Larsen, Mahyar T. Moghaddam

AI总结通过差分差分设计和Borusyak插值估计器，研究智能体AI工具采用对Java仓库架构气味密度（ASD）的因果影响，发现ASD下降6.7%源于代码量增长，而非架构改进。

Comments 16 pages. Accepted for presentation at the 52nd Euromicro Conference on Software Engineering and Advanced Applications (SEAA) 2026, Krakow, Poland, 2-4 September 2026, and for publication in the Springer LNCS proceedings. This is the author's accepted manuscript

详情

AI中文摘要

AI编码工具现已被大多数开发者使用，这些工具的智能体化使用普及了俗称“氛围编码”的实践。然而，关于其对软件架构影响的因果证据却很少。先前的因果工作衡量了代码层面的结果（复杂度、静态分析警告）；这种退化是否会传播到架构层面仍未知。我们挖掘了151个开源Java仓库，其中74个检测到智能体AI采用（通过配置文件和Co-Authored-By提交尾注识别），以及77个倾向得分匹配的对照仓库，每个仓库跨越13个月，生成1,811个月度Arcan快照。我们采用交错差分差分设计和Borusyak插值估计器，估计采用对架构气味密度（ASD）的因果效应，将近期用于代码层面指标的因果设计应用于架构层面。总气味计数基本不变（+1.1%，p=0.82），而代码行数增长12.8%（p=0.003）；因此，ASD下降6.7%（p=0.004）是分母效应而非架构改进。按类型估计和稳健性检验（wild cluster bootstrap、Lee bounds、陈旧观测敏感性）证实了这一模式；预处理趋势平坦（Wald p=0.90），与平行趋势一致。当处理影响系统规模时，密度归一化结果可能产生误导：对AI工具采用的因果挖掘研究需要原始计数和显式分解。完整的复现包，包括精心整理的151个仓库月度面板，已公开提供。

英文摘要

AI coding tools are now used by a majority of developers, and agentic use of these tools has popularized the practice colloquially called "vibe coding". Yet causal evidence on their effect on software architecture is scarce. Prior causal work has measured code-level outcomes (complexity, static analysis warnings); whether such degradation propagates to architecture-level outcomes remains unknown. We mine 151 open-source Java repositories, 74 with detectable agentic AI adoption (identified via configuration files and Co-Authored-By commit trailers) and 77 propensity-matched controls, across a 13-month per-repository window yielding 1,811 monthly Arcan snapshots. We estimate the causal effect of adoption on architectural smell density (ASD) with a staggered difference-in-differences design and the Borusyak imputation estimator, applying a causal design recently used for code-level metrics to the architecture level. Total smell counts are essentially unchanged (+1.1%, p = 0.82) while lines of code grow +12.8% (p = 0.003); the resulting 6.7% ASD decline (p = 0.004) is therefore a denominator effect rather than an architectural improvement. Per-type estimates and robustness checks (wild cluster bootstrap, Lee bounds, stale-observation sensitivity) corroborate the pattern; pre-trends are flat (Wald p = 0.90), consistent with parallel trends. Density-normalized outcomes can mislead when treatment affects system size: raw counts and explicit decomposition are required for causal mining studies of AI tool adoption. The complete replication package, including the curated 151-repository monthly panel, is publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.13341 2026-06-12 cs.CV cs.AI physics.med-ph 交叉投稿

Dual-Domain Equivariant Generative Adversarial Network for Multimodal CT-PET Synthesis

双域等变生成对抗网络用于多模态CT-PET合成

Gabriel Steele, Alzahra Altalib, Alessandro Perelli

发表机构 * arXiv

AI总结提出双域等变生成对抗网络（DDE-GAN），联合空间与频域学习并融入旋转等变性，实现高保真多模态CT-PET图像合成。

Comments 4 pages, 3 figures, 1 table, 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI)

详情

DOI: 10.1109/ISBI61048.2026.11515956

AI中文摘要

我们提出了一种用于多模态CT-PET图像合成的双域等变生成对抗网络（DDE-GAN）。传统的基于GAN的方法通常仅在空间域中操作，忽略了几何一致性，导致结构保真度有限。DDE-GAN通过联合学习空间域和频率（傅里叶）域，捕捉互补的解剖和频谱信息，解决了这些挑战。此外，嵌入在CT和PET测量物理中的旋转等变性被整合到生成器和判别器的损失中，以确保在旋转下的一致响应，从而提高解剖准确性。一种分层双域训练策略通过多阶段损失函数强制实现域内和域间一致性。在HECKTOR 2022 CT-PET数据集上的评估表明，DDE-GAN在CT-PET图像合成中取得了优于基线模型的合成质量。结果表明，将双域学习与几何等变性相结合，显著增强了多模态图像合成的准确性和鲁棒性，为PET补全和数据增强等实际应用提供了可能。

英文摘要

We present a Dual-Domain Equivariant Generative Adversarial Network (DDE-GAN) for multimodal CT-PET image synthesis. Traditional GAN-based approaches often operate solely in the spatial domain and ignore geometric consistency, resulting in limited structural fidelity. DDE-GAN addresses these challenges by jointly learning from both spatial and frequency (Fourier) domains, capturing complementary anatomical and spectral information. Furthermore, rotational equivariance embedded in the physics of the CT and PET measurements are integrated into the loss of both the generator and discriminator to ensure consistent responses under rotations, improving anatomical accuracy. A hierarchical dual-domain training strategy enforces intra- and inter-domain consistency through multi-stage loss functions. Evaluated on the HECKTOR 2022 CT-PET dataset, DDE-GAN achieves superior synthesis quality over baseline models for CT-PET image synthesis. The results demonstrate that combining dual-domain learning with geometric equivariance substantially enhances multimodal image synthesis accuracy and robustness, enabling practical applications in PET completion and data augmentation.

URL PDF HTML ☆

赞 0 踩 0

2606.13380 2026-06-12 quant-ph cs.AI 交叉投稿

An LLM System for Autonomous Variational Quantum Circuit Design

用于自主变分量子电路设计的大语言模型系统

Kenya Sakka, Wataru Mizukami, Kosuke Mitarai

AI总结提出一个基于大语言模型的自主代理框架，通过迭代设计量子电路，在量子特征映射和变分量子本征求解器任务中取得优于或媲美现有方法的性能。

Comments 63 pages, 19 figures, 3 tables

详情

AI中文摘要

高性能量子电路的设计在很大程度上仍然依赖于人类专家。我们引入了一个自主代理框架，该框架利用大语言模型在明确的设计约束下进行迭代量子电路设计。我们的系统集成了七个组件：探索、生成、讨论、验证、存储、评估和审查。这些组件形成了一个闭环工作流，结合了基于网络的知识获取、基于文献的批评、可执行代码生成和实验反馈。我们在两个任务上评估了该框架：用于量子机器学习的量子特征映射构建和用于量子化学中变分量子本征求解器应用的拟设生成。在图像分类基准测试中，生成的最佳特征映射优于代表性的量子特征映射，并且当扩展到更大的量子比特数时，超过了经典的径向基函数核。在七个分子的基态能量估计中，生成的拟设达到了与广泛使用的化学启发式和硬件高效构造相竞争的精度，同时满足施加的缩放约束。这些结果确立了由大语言模型驱动的代理系统作为自动化量子电路设计的可行范式，并展示了人工智能系统如何跨科学领域参与迭代科学优化工作流。

英文摘要

The design of high performing quantum circuits remains largely dependent on human expertise. We introduce an autonomous agentic framework that employs large language models (LLMs) to conduct iterative quantum circuit designs under explicit design constraints. Our system integrates seven components: Exploration, Generation, Discussion, Validation, Storage, Evaluation, and Review. These components form a closed-loop workflow that combines web-based knowledge acquisition, literature-grounded critique, executable code generation, and experimental feedback. We evaluate the framework on two tasks: quantum feature map construction for quantum machine learning and ansatz generation for variational quantum eigensolver applications in quantum chemistry. In image classification benchmarks, the best generated feature map outperforms representative quantum feature maps and, when scaled to larger qubit counts, surpasses the classical radial basis function kernel. In molecular ground state estimation across seven molecules, the generated ansatz attains competitive accuracy with widely used chemically inspired and hardware-efficient constructions while satisfying the imposed scaling constraints. These results establish LLM driven agentic system as a viable paradigm for automated quantum circuit design and illustrate how AI systems can participate in iterative scientific optimization workflows across scientific domains.

URL PDF HTML ☆

赞 0 踩 0

2606.13382 2026-06-12 cs.CV cs.AI 交叉投稿

SmartFont: Dynamic Condition Allocation for Few-Shot Font Generation

SmartFont: 少样本字体生成的动态条件分配

Zian Yang, Zixin Wang

发表机构 * Fudan University（复旦大学）

AI总结提出SmartFont扩散框架，通过全局内容-风格生成与弱监督局部校正专家结合，并引入去噪状态条件分配模块动态加权全局与局部特征，实现少样本字体生成的全局完整性与局部细节保真度平衡。

详情

AI中文摘要

少样本字体生成同时需要全局结构完整性和细粒度局部风格保真度。现有方法通常要么依赖全局内容-风格建模（鲁棒但解耦不完美），要么强调组件/局部建模（捕捉细节但严重依赖局部先验和参考覆盖）。我们认为关键挑战不仅在于学习更纯净的条件，而在于通过生成过程中的多级分配来组织互补但有偏的全局和局部条件。为此，我们提出SmartFont，一个基于扩散的少样本字体生成框架，结合全局内容-风格生成与弱监督局部校正专家。局部分支通过弱组件监督学习专家级局部概念和语义有意义的空间图，实现无需显式组件条件推理的细粒度校正。在此基础上，去噪状态条件分配模块在时间步和注入块上自适应地加权全局内容、全局风格和局部校正特征。大量实验表明，SmartFont实现了更好的全局-局部平衡，提高了字形质量和局部细节保真度。

英文摘要

Few-shot font generation simultaneously requires global structural completeness and fine-grained local style fidelity. Existing methods usually either rely on global content-style modeling, which is robust but imperfectly disentangled, or emphasize component/local modeling, which captures fine details but relies heavily on local priors and reference coverage. We argue that the key challenge is not merely to learn purer conditions, but to organize complementary yet biased global and local conditions through multi-level allocation during generation. To this end, we propose SmartFont, a diffusion-based few-shot font generation framework that combines global content-style generation with weakly supervised local corrective experts. The local branch performs semantic-spatial allocation by learning expert-wise local concepts and semantically meaningful spatial maps under weak component supervision, enabling fine-grained correction without requiring explicit component-conditioned inference. On top of this, a denoising-state condition allocation module adaptively weights global content, global style, and local corrective feature across timesteps and injection blocks. Extensive experiments show that SmartFont achieves better global-local balance, improves glyph quality and local detail fidelity.

URL PDF HTML ☆

赞 0 踩 0

2606.13449 2026-06-12 cs.SE cs.AI 交叉投稿

Toward Instructions-as-Code: Understanding the Impact of Instruction Files on Agentic Pull Requests

面向指令即代码：理解指令文件对智能体拉取请求的影响

Ali Arabat, Mohammed Sayagh

AI总结通过分析148个项目的15549个智能体PR，发现指令文件对合并率、代码变更量和合并工作量无一致正面影响，但成功项目指令文件更长且结构更清晰，提出“指令即代码”研究方向。

Comments 5 pages, 8 figures, 23rd International Conference on Mining Software Repositories, April 13--14, 2026

详情

DOI: 10.1145/3793302.3793601

AI中文摘要

AI智能体（如GitHub Copilot）作为队友协作完成不同的软件工程任务，包括通过拉取请求（Agentic-PRs）提出的代码生成。为了提高智能体效率，开发者创建指令文件来指导AI智能体，包括如何导航项目、定位正确组件、运行测试、遵守最佳实践等。本文研究了这些指令的创建与AI智能体在创建更好的拉取请求方面的性能之间的关系，这些拉取请求具有更高的成功机会（即合并率）、处理更复杂的任务（例如代码变更量），并且需要更少的合并工作量（例如合并时间）。为此，我们分析了来自AIDev数据集中148个项目的15,549个智能体PR。使用这三个维度，我们比较了每个项目在创建指令文件前后的情况。我们发现，为AI智能体指定指令并不一定会带来更好的结果。使用指令文件后，27.7%的项目的合并率至少提高了20%，而26.35%的项目合并率下降。在变更量（例如代码变更量、修改文件数量）和合并智能体PR的工作量（例如合并时间和评论数量）方面也观察到相同的情况。通过初步探索，我们发现成功提高合并率的项目具有更长的指令文件，并且这些文件结构良好，分为更多的章节和子章节。我们的结果激励了研究需求，以帮助从业者将指令文件的开发视为一项软件工程活动（即，\textbf{指令即代码}）。

英文摘要

AI-agents (e.g., GitHub Copilot) collaborate as teammates in different software engineering tasks, including code generation proposed through pull requests (Agentic-PRs). For better agent efficiency, developers create instruction files that guide the AI-agents, including how to navigate the project, locate the right components, run tests, respect best practices, and more. In this paper, we investigate the relationship between the creation of these instructions and the performance of AI-agents in creating better pull requests, which have a higher chance of success (i.e., the merge rate), address more complex tasks (e.g., code churn), and require less effort to be merged (e.g., time to merge). To this end, we analyze 15,549 agentic PRs from 148 projects in the AIDev dataset. Using the three dimensions, we compare each project before and after the creation of the instruction files. We find that specifying instructions for AI-agents does not necessarily lead to better results. With the instruction files, 27.7\% of the projects increased their merge rate by at least 20\%, while 26.35\% decreased it. The same observation is seen with the amount of changes (e.g., code churn, number of modified files) and with the efforts to merge an agentic PR (e.g., merge time and number of comments). From a first exploration, we find that projects that managed to increase their merge rate have substantially longer instruction files, which are also well structured into a higher number of sections and sub-sections. Our results motivate the need for research to assist practitioners in framing the development of instruction files as a software engineering activity (aka, \textbf{Instructions-as-Code}).

URL PDF HTML ☆

赞 0 踩 0

2606.13468 2026-06-12 cs.SE cs.AI 交叉投稿

Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset

理解AI代理生成的拉取请求修复被拒绝的原因——来自AIDev数据集的洞察

Mahmoud Abujadallah, Ali Arabat, Mohammed Sayagh

AI总结通过分析AIDev数据集，发现46.41%的AI代理（Copilot、Devin、Cursor、Claude）提出的代码修复被拒绝。本文对306个未合并的PR进行定性研究，归纳出14个拒绝原因，分为四类，并提出了改进模型引导的建议。

Comments 5 pages, 2 figures, MSR '26: Proceedings of the 23rd International Conference on Mining Software Repositories, April 2026, Rio de Janeiro, Brazil

详情

DOI: 10.1145/3793302.3793592

AI中文摘要

AI编码代理越来越多地被用于生成拉取请求（PR），以在软件项目中提出代码修复。通过对AIDev数据集的初步探索，我们发现由Copilot、Devin、Cursor和Claude代理提出的修复中有46.41%被拒绝。这代表了大量浪费的资源，需要人工审查、验证以及运行测试和验证，而这些修复最终被丢弃。本文的目标是理解AI代理的失败模式，这对于更好地将AI代理集成为高效团队成员至关重要。本文对由前述代理创建或共同创作的306个未合并的拉取请求的代表性样本进行了定性研究，随后对拒绝原因进行了定量分析。我们的定性发现确定了14个原因，分为四个高级类别，用于拒绝AI代理的修复。我们观察到，开发者可能因以下原因拒绝修复：修复的实现不正确（例如，不完整、方法错误）、修复未通过持续集成（CI）管道并测试失败、代理无法执行实现（例如，未生成代码、会话丢失），以及修复优先级低。我们的结果揭示了在以下层面更好引导模型的重要性：（1）提出关于修复问题应遵循的方法的提示，（2）概述不应采取的方法的约束或限制，以及（3）指导代理如何通过CI管道验证实现而不引入破坏性变更。我们的结果表明，需要良好的任务优先级排序，以便生成的修复不会导致浪费的人工审查努力或浪费的代理资源（例如，令牌、计算或允许的请求数量）。

英文摘要

AI coding agents are increasingly used to generate pull requests (PRs) that propose code fixes in software projects. From a first exploration of the AIDev dataset, we find that 46.41\% of the fixes proposed by the agents Copilot, Devin, Cursor, and Claude are rejected. This represents a significant amount of wasted resources that require human reviews, verifications, and running tests and validations for fixes that are merely discarded. Our goal in this paper is to understand the failure modes of AI-agents, an understanding that is crucial for better integrating AI-agents as efficient teammates. In this paper, we conduct a qualitative study on a representative sample of 306 non-merged pull requests created or co-authored by the agents mentioned earlier, followed by a quantitative analysis of the reasons for rejection. Our qualitative findings identify 14 reasons divided into four high-level categories for rejecting AI-agent fixes. We observe that developers can reject fixes due to fixes whose implementation is incorrect (e.g., incomplete, wrong approach), fixes that do not pass the continuous integration (CI) pipelines and fail tests, fixes for which the agent is unable to perform the implementation (e.g., no code generated, sessions lost), and fixes whose priority is low. Our results shed light on the importance of better guiding the model at these levels: (1) proposing hints about the approach to follow for fixing an issue, (2) outlining constraints or limitations regarding the approaches that should not be taken, and (3) instructing the agent on how to validate the implementation through CI pipelines and without introducing a breaking change. Our results suggest the need for good prioritization of tasks so that generated fixes do not lead to wasted human review efforts or wasted agent resources (e.g., tokens, compute, or allowed number of requests).

URL PDF HTML ☆

赞 0 踩 0

2606.13535 2026-06-12 hep-ex cs.AI hep-ph 交叉投稿

AgentRivet: an automated system for producing Rivet routines from journal publications

AgentRivet：从期刊论文自动生成Rivet例程的系统

Antonio J. Costa, Caterina Doglioni, Christian Gütschow, Andrew D. Pilkington, Sukanya Sinha

发表机构 * Department of Physics & Astronomy, University of Manchester（曼彻斯特大学物理与天文学系）； Centre for Advanced Research Computing, University College London（伦敦大学学院先进计算中心）

AI总结提出基于大语言模型的自动化工作流AgentRivet，从论文提取物理分析信息并生成缺失的Rivet例程，经代码和物理审查实现质量控制，在ATLAS和CMS测量中生成语法错误少、物理保真度合理的例程。

详情

AI中文摘要

粒子物理对撞机实验将Rivet例程作为模型无关测量分析保存策略的一部分。Rivet是一个C++工具包，允许将新的理论模型与测量结果进行比较，从而帮助开发和调整蒙特卡洛事件生成器，以及搜索标准模型之外的新物理。然而，已知分析覆盖不完整，只有39%的测量具有文档化且公开可用的Rivet例程。在本文中，我们设计并实现了一个基于大语言模型的自动化工作流，旨在提供缺失的例程。这个多步骤工作流称为AgentRivet，从已发表的论文中提取物理分析信息，并编写缺失的Rivet例程，中间代码和物理审查作为自主质量控制的一部分。我们报告了使用OpenAI、Anthropic和Google提供的商业大语言模型，针对ATLAS和CMS实验的两个近期测量所获得的结果。我们发现AgentRivet生成了语法错误很少的合格Rivet例程。例程的物理保真度合理，并遵循相关出版物中的解释。然而，物理实现问题确实出现，并使用AgentRivet产生的产物进行了调查。大多数物理实现问题源于给定出版物中微妙但模糊的定义，尽管有些模型即使在给出明确定义时也难以实现复杂的可观测量。

英文摘要

Particle physics collider experiments provide Rivet routines as part of the analysis preservation strategy for model-independent measurements. Rivet is a C++ toolkit that allow new theoretical models to be compared to the measurements, thus aiding the development and tuning of Monte Carlo event generators as well as searches for physics beyond the Standard Model. However, analysis coverage is known to be incomplete, with only 39% of measurements having documented and publicly available Rivet routines. In this article, we design and implement an automated workflow based on Large Language Models with the goal of providing the missing routines. This multi-step workflow, referred to as AgentRivet, extracts the physics analysis information from published papers and writes the missing Rivet routines, with intermediate code- and physics- reviews as part of an autonomous quality control. We report the results obtained using commercial Large Language Models, provided by OpenAI, Anthropic, and Google, for two recent measurements from the ATLAS and CMS experiments. We find that AgentRivet produces competent Rivet routines with few syntax errors. The physics fidelity of the routines is reasonable and follows the explanations given in the relevant publications. Nevertheless, physics-implementation issues do arise and are investigated using the artefacts produced by AgentRivet. The majority of physics implementation issues arise from subtle-but-ambiguous definitions in the given publication, although some models struggle to implement complex observables even when clear definitions are given.

URL PDF HTML ☆

赞 0 踩 0

2606.13562 2026-06-12 cs.CV cs.AI 交叉投稿

Contrast-Informed Augmentation and Domain-Adversarial Training for Adult-to-Neonatal MR Reconstruction Generalization

对比信息增强和域对抗训练用于成人到新生儿MR重建泛化

Stephen Moore, Lara Leijser, Richard Frayne, Roberto Souza

发表机构 * University of Calgary（卡尔加里大学）； Seaman Family MR Research Centre, Foothills Medical Centre（Seaman家族磁共振研究中心，山麓医疗中心）； Hotchkiss Brain Institute, University of Calgary（Hotchkiss脑研究所，卡尔加里大学）； Pediatrics, Division of Neonatology, University of Calgary（卡尔加里大学儿科学系新生儿科）； Alberta Children’s Hospital Research Institute, University of Calgary（阿尔伯塔儿童医院研究所，卡尔加里大学）； Radiology and Clinical Neuroscience, University of Calgary（卡尔加里大学放射学与临床神经科学系）； Electrical and Software Engineering, University of Calgary（卡尔加里大学电气与软件工程系）

AI总结研究对比信息增强和域对抗训练提升E2E-VarNet从成人到新生儿MR重建的泛化能力，在加速因子R=4和R=8下，混合域对抗训练在SSIM和PSNR指标上表现最优。

Comments 24 pages, 1 table, 7 figures

详情

AI中文摘要

目的：研究对比信息数据增强和域对抗训练是否能改善E2E-VarNet从成人到新生儿的泛化能力。方法：研究了三种训练方案：(1) 仅使用未增强的成人数据进行成人单独训练，(2) 使用配对的未增强和新生儿信息增强的成人数据进行混合训练，(3) 使用域对抗目标进行混合训练。模型在回顾性欠采样的多线圈成人T2加权脑MR数据上训练，并在新生儿和成人测试数据上以加速因子$R=4$和$R=8$进行评估，使用定量指标和定性评估。特征分析评估了域对抗训练是否改变了未增强成人、增强成人和新生儿测试样本的潜在表示。结果：在新生儿数据上评估时，混合训练（Mixed）和混合域对抗训练（Mixed-DAT）优于仅未增强的成人单独训练（Unaug-Only）。在R=4时，Mixed-DAT取得最佳性能（SSIM = 0.924 +/- 0.027，PSNR = 33.98 +/- 1.15 dB）。在R=8时，Mixed-DAT在SSIM指标上表现最佳（0.848 +/- 0.031，对比Unaug-Only的0.766 +/- 0.037和Mixed的0.814 +/- 0.035），而Mixed在PSNR指标上表现最佳（29.56 +/- 0.83 dB，对比Unaug-Only的26.26 +/- 0.78 dB和Mixed-DAT的29.43 +/- 0.83 dB）。t-SNE图的定性评估表明，Mixed-DAT增加了未增强成人、增强成人和新生儿测试数据的潜在表示之间的重叠。结论：对比信息增强和域对抗训练改善了基于深度学习的MR重建从成人到新生儿的泛化能力。这些发现表明，对比信息数据增强结合对抗训练可能提高欠采样新生儿MR重建中对域偏移的鲁棒性。

英文摘要

Purpose: To investigate whether contrast-informed data augmentation and domain-adversarial training improve the adult-to-neonatal generalization of the E2E-VarNet. Methods: Three training regimes were investigated: (1) adult-only training with unaugmented adult data, (2) mixed training with paired unaugmented and neonatal-informed augmented adult data, and (3) mixed training with a domain-adversarial objective. Models were trained on retrospectively undersampled multi-coil adult T2-weighted brain MR data and evaluated on neonatal and adult test data at acceleration factors $R=4$ and $R=8$ using quantitative metrics and qualitative evaluation. Feature analyses assessed whether domain-adversarial training altered the latent representations of unaugmented adult, augmented adult, and neonatal test samples. Results: Mixed training (Mixed) and mixed domain-adversarial training (Mixed-DAT) outperformed unaugmented adult-only training (Unaug-Only) when evaluated on neonatal data. At R=4, Mixed-DAT achieved the best performance (SSIM = 0.924 +/- 0.027, PSNR = 33.98 +/- 1.15 dB). At R=8, Mixed-DAT performed best when measured using SSIM (0.848 +/- 0.031 vs. 0.766 +/- 0.037 for Unaug-Only and 0.814 +/- 0.035 for Mixed) and Mixed performed best when measured using PSNR (29.56 +/- 0.83 dB vs. 26.26 +/- 0.78 dB for Unaug-Only and 29.43 +/- 0.83 dB for Mixed-DAT). Qualitative assessment of t-SNE plots suggested that Mixed-DAT increased the overlap among the latent representations of the unaugmented adult, augmented adult, and neonatal test data. Conclusion: Contrast-informed augmentation and domain-adversarial training improved adult-to-neonatal generalization of deep learning-based MR reconstruction. These findings suggest that contrast-informed data augmentation combined with adversarial training may improve robustness to domain shift in undersampled neonatal MR reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2606.07489 2026-06-12 cs.AI econ.GN q-fin.EC 版本更新

How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, and Scope

AI代理如何重塑知识工作：自主性、效率与范围

Jeremy Yang, Kate Zyskowski, Noah Yonack, Jerry Ma

发表机构 * Harvard Business School（哈佛商学院）； Perplexity AI

AI总结基于Perplexity产品数据，研究发现AI代理通过端到端任务执行，将自主工作时间从33秒提升至26分钟，完成时间缩短87%，成本降低94%，并扩展了工作范围与认知层次。

详情

AI中文摘要

前沿AI系统正从对话式助手转向端到端执行任务的自主代理，弥合智能与实用性之间的差距。利用Perplexity的Search和Computer产品的生产数据，我们通过研究AI代理如何加速和重塑知识工作来考察这一转变。三个关键实证发现出现。首先，使用具有几乎相同初始查询对的会话作为同一底层任务的自然实验，Computer每个用户会话执行26分钟的自主工作，而Search为33秒。Computer自动化了Search用户可能手动编排和实现的任务分解与执行。因此，Computer将后续查询分布转向更高层次的工作，如验证和扩展。自主性也提高了执行质量，Computer上每次查询的不满意率比Search低55%。其次，由于其自主性优势，Computer在匹配任务上将完成时间从269分钟减少到36分钟，与仅配备Search的人类相比，估计时间和成本分别降低87%和94%。第三，Computer改变了用户尝试的工作范围：Computer查询更常跨越职业边界，需要更高层次的认知，利用更广泛的专业知识，采取将相互依赖的子任务捆绑到单个查询中的复合任务形式，并解锁了同一用户在Search使用中基本不存在的工作活动。综合来看，证据表明AI代理加速工作流程、提高输出质量、降低成本，并扩展自动化工作的广度和深度。

英文摘要

Frontier AI systems are bridging the gap between intelligence and utility by shifting from conversational assistants to autonomous agents that execute tasks end to end. Using production data from Perplexity's Search and Computer products, we study this transition by examining how AI agents accelerate and reshape knowledge work. Three key empirical findings emerge. First, using sessions with near-identical initial query pairs as natural experiments for the same underlying task attempted with both products, Computer performs 26 minutes of autonomous work per user session, versus 33 seconds for Search. Computer automates task decomposition and execution that Search users might otherwise manually orchestrate and implement. As a result, Computer shifts follow-up query distribution toward higher-order work such as verification and extension. Autonomy also increases execution quality, with per-query dissatisfaction rates 55% lower on Computer than on Search. Second, due to its autonomy advantage, Computer reduces completion time from 269 to 36 minutes on matched tasks, lowering estimated time and cost by 87% and 94%, respectively, compared to humans equipped with Search alone. Third, Computer changes the scope of work that users attempt: Computer queries more often cross occupational boundaries, require higher-order cognition, draw on broader expertise, take the form of composite tasks that bundle interdependent subtasks into a single query, and unlock work activities that are essentially absent from Search usage among the same users. Together, the evidence indicates that AI agents accelerate workflows, enhance output quality, reduce costs, and expand the breadth and depth of automated work.

URL PDF HTML ☆

赞 0 踩 0

2606.12040 2026-06-12 cs.AI cs.GR 版本更新

A Lightweight Multi-Agent Framework for Automated Concrete Barrier Design

一种用于自动混凝土护栏设计的轻量级多智能体框架

Wanting Wang, Xiye Ma, Yuyang He, Minghui Cheng, Ran Cao

AI总结提出基于AutoGen的“生成-评估-优化”闭环多智能体框架，实现混凝土护栏自动设计，准确率超98%，且8B参数轻量模型可优于631B旗舰模型。

详情

AI中文摘要

钢筋混凝土公路护栏的设计是一个安全关键过程，需要严格遵守AASHTO-LRFD桥梁设计指南等监管规定。当前的工程实践严重依赖手动、迭代和启发式计算来满足复杂的非线性材料和力学约束。尽管大型语言模型（LLMs）表现出强大的生成能力，但它们在结构工程中的直接应用仍受到幻觉风险和物理基础不足的限制。为了解决这些挑战，本研究提出了一种新颖的“生成-评估-优化”闭环框架，利用AutoGen的多智能体编排能力实现混凝土护栏的自动设计。实验结果表明，所提出的智能体框架实现了超过98%的设计准确率，显著优于独立的通用LLMs。更重要的是，研究揭示了设计性能不一定与模型规模相关，8B参数的轻量级模型可以胜过无约束的631B参数旗舰模型。这一发现凸显了在降低计算成本的同时提高AI辅助工程工具在工业应用中的可及性的潜力。所提出的多智能体设计框架的源代码可在项目GitHub仓库中获取：this https URL。关键词：结构工程；多智能体系统；大型语言模型；混凝土护栏设计；AutoGen；设计自动化。

英文摘要

The design of reinforced concrete highway barriers is a safety-critical process that requires strict compliance with regulatory provisions such as the AASHTO-LRFD bridge design guidelines. Current engineering practice relies heavily on manual, iterative, and heuristic calculations to satisfy complex nonlinear material and mechanics constraints. Although Large Language Models (LLMs) demonstrate strong generative capabilities, their direct application to structural engineering remains limited by hallucination risks and insufficient physical grounding. To address these challenges, this study proposes a novel "generation-evaluation-optimization" closed-loop framework for automated concrete barrier design using the multi-agent orchestration capabilities of AutoGen. Experimental results demonstrate that the proposed agentic framework achieves over 98% design accuracy, significantly outperforming standalone general-purpose LLMs. More importantly, the study reveals that design performance is not necessarily correlated with model scale, where an 8B-parameter lightweight model could outperform unconstrained 631B-parameter flagship models. This finding highlights the potential to substantially reduce computational costs while improving the accessibility of AI-assisted engineering tools for industry applications. The source code for the proposed multi-agent design framework is available at the project GitHub repository: https://github.com/MXY820/barrier-design. Keywords: Structural Engineering; Multi-Agent Systems; Large Language Models; Concrete Barrier Design; AutoGen; Design Automation.

URL PDF HTML ☆

赞 0 踩 0

2301.12538 2026-06-12 cs.LG cs.AI math.DS 版本更新

On Approximating the Dynamic Response of Synchronous Generators via Operator Learning: A Step Towards Building Deep Operator-based Power Grid Simulators

关于通过算子学习逼近同步发电机动态响应：迈向构建基于深度算子的电网模拟器的一步

Christian Moya, Amirhossein Mollaali, Guang Lin, Meng Yue

发表机构 * Purdue University（普渡大学）

AI总结提出基于算子学习的框架，利用DeepONet逼近同步发电机的动态响应，并设计递归模拟方案及残差DeepONet方案，结合数据聚合策略实现与电网交互的模拟。

详情

AI中文摘要

本文开发了一个算子学习框架，用于逼近同步发电机的动态响应。该框架可用于（i）构建一个基于神经网络的发电机模型，与电网模拟器交互，或（ii）跟踪真实发电机的暂态响应。首先，我们开发了一个数据驱动的深度算子网络（DeepONet）来逼近发电机的无限维解算子。然后，我们设计了一个基于DeepONet的数值方案，在给定的时间范围内模拟发电机的响应。所提出的方案递归地使用训练好的DeepONet来模拟给定多维输入下的响应，该输入描述了发电机与电网之间的相互作用。此外，我们设计了一个残差DeepONet数值方案，可以整合现有数学模型的信息。我们为这个残差DeepONet方案提供了预测累积误差的估计。最后，我们构建了一个数据聚合（DAgger）策略，允许使用DeepONet在与其他电网组件交互模拟中可能遇到的聚合训练数据对DeepONet进行微调。作为概念验证，我们证明了所提出的框架能够有效逼近同步发电机的暂态模型。

英文摘要

This paper develops an Operator Learning framework for approximating the dynamic response of synchronous generators. The framework can be used to (i) build a neural network-based generator model that interacts with a power grid simulator or (ii) shadow the true generator's transient response. First, we develop a data-driven Deep Operator Network (DeepONet) to approximate the infinite-dimensional solution operator of the generators. Then, we design a numerical scheme based on DeepONet that simulates the generator's response over a given time horizon. The proposed scheme recursively employs the trained DeepONet to simulate the response for a given multi-dimensional input that describes the interaction between the generator and the power grid. In addition, we design a residual DeepONet numerical scheme that can incorporate information from existing mathematical models. We accompany this residual DeepONet scheme with an estimate for the prediction's cumulative error. Finally, we build a data aggregation (DAgger) strategy that allows fine-tuning of DeepONets using aggregated training data that the DeepONets will likely encounter during interactive simulations with other grid components. As a proof of concept, we demonstrate that the proposed frameworks can effectively approximate the transient model of a synchronous generator.

URL PDF HTML ☆

赞 0 踩 0

2505.04021 2026-06-12 cs.DC cs.AI cs.LG cs.PF 版本更新

Prism: Cost-Efficient Multi-LLM Serving via GPU Memory Ballooning

Prism: 通过GPU内存气球实现经济高效的多LLM服务

Shan Yu, Yifan Qiao, Mingyuan Ma, Yangmin Li, Shuo Yang, Xinyuan Tong, Yang Wang, Zhiqiang Xie, Yuwei An, Shiyi Cao, Ke Bao, Deepak Vij, Xiaoning Ding, Yichen Wang, Qingda Lu, Zhong Wang, Gao Gao, Harry Xu, Junyi Shu, Jiarong Xing, Ying Sheng

发表机构 * UCLA（加州大学洛杉矶分校）； UC Berkeley（伯克利加州大学）； Harvard University（哈佛大学）； CMU（卡内基梅隆大学）； University of Edinburgh（爱丁堡大学）； Intel（英特尔）； Stanford University（斯坦福大学）； LMSYS（灵州市系统实验室）； ByteDance（字节跳动）； Alibaba Cloud（阿里云）； Tsinghua University（清华大学）； Novita AI ； Rice University（里士满大学）

AI总结针对多LLM服务中资源效率低下的问题，提出基于内存气球的内存中心化LLM协同服务框架Prism，统一空间与时间共享，已在10K+ GPU生产环境部署。

Comments OSDI'26

2509.04682 2026-06-12 cs.SD cs.AI cs.CV cs.IR cs.LG eess.AS 版本更新

GetNetUPAM: Ecologically Informed Nested Cross-Validation and Noise-Robust Attention for Marine Bioacoustic Monitoring

GetNetUPAM：生态信息嵌套交叉验证与噪声鲁棒注意力用于海洋生物声学监测

Nicholas R. Rasmussen, Rodrigue Rizk, Longwei Wang, KC Santosh

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）

AI总结提出GetNetUPAM框架，通过分层嵌套交叉验证保持生态异质性，并集成CBAM空间注意力的ARPA-N网络，在高噪声低信噪比条件下实现鲁棒泛化，在零训练区域将误报率降低约10倍。

Comments Resubmitted and under review as an anonymous submission to IEEETAI - We are allowed an archive submission. Final formatting is yet to be determined

详情

AI中文摘要

部署可靠的生物声学监测系统需要能够在高噪声、低信噪比条件下泛化的模型，以及能够暴露部署相关故障模式的评估协议，这些在当前UPAM实践中基本未得到解决。内在噪声、可变传播以及混合的生物和人为源会导致分布偏移，而传统模型和单次划分评估会掩盖这些偏移，夸大性能并掩盖不稳定性。我们提出GetNetUPAM，一种分层嵌套交叉验证框架，它利用嵌套阶段来量化模型稳定性，而不是调整以获取夸大的保留分数。通过将数据划分为站点-年份块，GetNetUPAM保留了生态异质性，并迫使每个外层折代表不同的环境条件，防止过拟合局部噪声或传感器伪影。内层分层折衡量整个UPAM信号分布上的泛化能力，强制模型开发与外层保留部署条件严格分离。使用GetNetUPAM，我们评估了自适应分辨率池化和注意力网络（ARPA-N），一种用于不规则频谱图维度的CNN架构。ARPA-N将CBAM空间注意力集成为学习型噪声抑制器，生成注意力图以定位真实叫声结构，并避免标准CNN在长窗口数据上利用的全局非生物线索。在GetNetUPAM下，ARPA-N在不同环境条件下鲁棒泛化。在零训练的Balleny Islands区域，它在固定90%召回率下将每小时误报率降低超过一个数量级（约10倍），并在各折上持续改进指标。这些进展提供了可重复的基准，推动UPAM向可扩展、部署可靠的生态监测发展。

英文摘要

Deploying reliable bioacoustic monitoring systems requires models that generalize under high-noise, low-SNR conditions and evaluation protocols that expose deployment-relevant failure modes, gaps largely unaddressed in current UPAM practice. Intrinsic noise, variable propagation, and mixed biological and anthropogenic sources induce distribution shifts that conventional models and single-split evaluations obscure, inflating performance and masking instability. We introduce GetNetUPAM, a hierarchical nested cross-validation framework that uses the nested stage to quantify model stability rather than tune for inflated hold-out scores. By partitioning data into site-year blocks, GetNetUPAM preserves ecological heterogeneity and forces each outer fold to represent a distinct environmental regime, preventing overfitting to localized noise or sensor artifacts. Inner stratified folds measure generalization across the full UPAM signal distribution, enforcing strict separation between model development and the outer held-out deployment condition. Using GetNetUPAM, we evaluate the Adaptive Resolution Pooling and Attention Network (ARPA-N), a CNN architecture for irregular spectrogram dimensions. ARPA-N integrates CBAM spatial attention as a learned noise suppressor, producing attention maps that localize true call structure and avoid the global, non-biological cues exploited by standard CNNs on long-window data. Under GetNetUPAM, ARPA-N generalizes robustly across diverse environmental regimes. In the zero-training support Balleny Islands region, it reduces false positives per hour by over an order of magnitude (approximately 10x) at fixed 90 percent recall, yielding consistently improved metrics across folds. These advances provide a reproducible benchmark and move UPAM toward scalable, deployment-reliable ecological monitoring.

URL PDF HTML ☆

赞 0 踩 0

2511.13271 2026-06-12 cs.SE cs.AI cs.IR 版本更新

Examining the Usage of Generative AI Models in Student Learning Activities for Software Programming

生成式AI模型在学生软件编程学习活动中的使用研究

Rufeng Chen, Shuaishuai Jiang, Jiyun Shen, AJung Moon, Lili Wei

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结通过对比生成式AI与传统在线资源对编程学习的影响，发现AI能提升任务表现但未必带来知识增益，初学者过度依赖而中级生选择性使用，呼吁将AI作为学习工具而非解题工具。

Comments 9 pages, 4 figures, published at AIWARE 2025

详情

DOI: 10.1109/AIware69974.2025.00016

AI中文摘要

生成式AI（GenAI）工具如ChatGPT的兴起为计算教育带来了新的机遇和挑战。现有研究主要关注GenAI完成教育任务的能力及其对学生表现的影响，往往忽视了其对知识获取的作用。在本研究中，我们调查了GenAI辅助与传统在线资源在不同熟练水平下对知识获取的支持效果。我们进行了一项受控用户实验，涉及24名具有两种不同编程经验水平（初学者、中级）的本科生，以考察学生在解决编程任务时如何与ChatGPT互动。我们分析了任务表现、概念理解和交互行为。我们的发现表明，使用GenAI生成完整解决方案显著提高了任务表现，尤其是对初学者而言，但并未持续带来知识增益。重要的是，使用策略因经验而异：初学者倾向于过度依赖GenAI以完成任务，过程中往往没有知识增益，而中级生则采用更具选择性的方法。我们发现，过度依赖和极少使用都会导致整体知识增益较弱。基于我们的结果，我们呼吁学生和教育工作者将GenAI作为学习工具而非解题工具。我们的研究强调了在将GenAI整合到编程教育中时，迫切需要指导以促进更深层次的理解。

英文摘要

The rise of Generative AI (GenAI) tools like ChatGPT has created new opportunities and challenges for computing education. Existing research has primarily focused on GenAI's ability to complete educational tasks and its impact on student performance, often overlooking its effects on knowledge gains. In this study, we investigate how GenAI assistance compares to conventional online resources in supporting knowledge gains across different proficiency levels. We conducted a controlled user experiment with 24 undergraduate students of two different levels of programming experience (beginner, intermediate) to examine how students interact with ChatGPT while solving programming tasks. We analyzed task performance, conceptual understanding, and interaction behaviors. Our findings reveal that generating complete solutions with GenAI significantly improves task performance, especially for beginners, but does not consistently result in knowledge gains. Importantly, usage strategies differ by experience: beginners tend to rely heavily on GenAI toward task completion often without knowledge gain in the process, while intermediates adopt more selective approaches. We find that both over-reliance and minimal use result in weaker knowledge gains overall. Based on our results, we call on students and educators to adopt GenAI as a learning rather than a problem solving tool. Our study highlights the urgent need for guidance when integrating GenAI into programming education to foster deeper understanding.

URL PDF HTML ☆

赞 0 踩 0

2512.14937 2026-06-12 cs.CV cs.AI 版本更新

Improving Pre-trained Adult Glioma Segmentation Models Using only Post-processing Techniques

仅使用后处理技术改进预训练的成人胶质瘤分割模型

Abhijeet Parida, Daniel Capellán-Martín, Zhifan Jiang, Nishad Kulkarni, Krithika Iyer, Austin Tapp, Syed Muhammad Anwar, María J. Ledesma-Carbayo, Marius George Linguraru

发表机构 * Sheikh Zayed Institute for Pediatric Surgical Innovation（Sheikh Zayed儿童手术创新研究所）； Children’s National Hospital（儿童医院）； University of Madrid（马德里大学）； CIBER-BBN ； ISCIII ； School of Medicine and Health Sciences（医学与健康科学学院）； George Washington University（乔治·华盛顿大学）

AI总结针对预训练模型在胶质瘤分割中的系统误差，提出自适应后处理技术，在BraTS 2025挑战中使排名指标提升14.9%（撒哈拉以南非洲）和0.9%（成人胶质瘤），推动向高效、公平、可持续的后处理策略转变。

详情

DOI: 10.1007/978-3-032-16365-3_22

AI中文摘要

胶质瘤是成人中最常见的恶性脑肿瘤，也是最致命的肿瘤之一。尽管积极治疗，中位生存率仍低于15个月。准确的多参数MRI（mpMRI）肿瘤分割对于手术规划、放疗和疾病监测至关重要。虽然深度学习模型提高了自动分割的准确性，但大规模预训练模型泛化能力差且常表现不佳，产生系统性错误，如假阳性、标签交换和切片不连续。这些问题因GPU资源获取不平等和大规模模型训练日益增长的环境成本而进一步加剧。在这项工作中，我们提出自适应后处理技术，以改进为各种肿瘤类型开发的大规模预训练模型产生的胶质瘤分割质量。我们在多个BraTS 2025分割挑战任务中展示了这些技术，使撒哈拉以南非洲挑战的排名指标提升了14.9%，成人胶质瘤挑战提升了0.9%。该方法推动脑肿瘤分割研究从日益复杂的模型架构转向精确、计算公平且可持续的高效临床后处理策略。

英文摘要

Gliomas are the most common malignant brain tumors in adults and are among the most lethal. Despite aggressive treatment, the median survival rate is less than 15 months. Accurate multiparametric MRI (mpMRI) tumor segmentation is critical for surgical planning, radiotherapy, and disease monitoring. While deep learning models have improved the accuracy of automated segmentation, large-scale pre-trained models generalize poorly and often underperform, producing systematic errors such as false positives, label swaps, and slice discontinuities in slices. These limitations are further compounded by unequal access to GPU resources and the growing environmental cost of large-scale model training. In this work, we propose adaptive post-processing techniques to refine the quality of glioma segmentations produced by large-scale pretrained models developed for various types of tumors. We demonstrated the techniques in multiple BraTS 2025 segmentation challenge tasks, with the ranking metric improving by 14.9 % for the sub-Saharan Africa challenge and 0.9% for the adult glioma challenge. This approach promotes a shift in brain tumor segmentation research from increasingly complex model architectures to efficient, clinically aligned post-processing strategies that are precise, computationally fair, and sustainable.

URL PDF HTML ☆

赞 0 踩 0

2512.24787 2026-06-12 cs.IR cs.AI 版本更新

上下文可逆世界模型：用于结直肠癌药物反应的神经符号智能框架

Christopher Baker, Tianyu Ren, Karen Rafferty, Hui Wang

AI总结提出上下文可逆世界模型（CIWM），结合机器学习模拟器与大语言模型推理层，通过逆推理进行CRISPR扰动，揭示KRAS突变在5-氟尿嘧啶耐药中的主导作用及PIK3CA修复的意外效应。

详情

AI中文摘要

精准肿瘤学目前受到小N大P悖论的限制，即高维基因组数据丰富但药理学反应样本稀疏。虽然深度学习实现了预测准确性，但它常常无法提供临床采用所需的机制清晰度。我们提出了上下文可逆世界模型（CIWM），这是一个神经符号智能框架，通过将定量机器学习模拟器与大语言模型推理层集成来弥合这一差距。利用在Sanger GDSC数据集（\$ N=83 \$）上严格筛选的高保真数据工程流程，我们从体外伪影中分离出真正的生物信号，为复杂转录组学建立了严格的基线预测相关性（\$ r=0.268 \$）。通过逆推理，我们在结直肠癌景观中进行了计算机CRISPR扰动。该框架自主推翻了经典机制假设，识别出突变KRAS在驱动5-氟尿嘧啶耐药（\$ \Delta=-0.0469 \$）中相对于APC/Wnt轴具有层级优势，并通过映射到MAPK/PI3K网络的“KRAS盾牌”实现。此外，智能层识别出“PIK3CA悖论”，揭示修复PIK3CA通过触发补偿性反馈环过度激活主导的MAPK生存通路，无意中增加了化疗耐药性（\$ \Delta=+0.0085 \$）。

英文摘要

Precision oncology is currently limited by the small-N, large-P paradox, where high-dimensional genomic data is abundant but pharmacological response samples are sparse. While deep learning achieves predictive accuracy, it frequently fails to provide the mechanistic clarity required for clinical adoption. We present the Contextual Invertible World Model (CIWM), a Neuro-Symbolic Agentic Framework that bridges this gap by integrating a quantitative machine learning emulator with a Large Language Model reasoning layer. Utilising a stringently curated, high-fidelity data engineering pipeline on the Sanger GDSC dataset ($ N=83 $), we isolate true biological signals from in vitro artifacts to establish a rigorous baseline predictive correlation for complex transcriptomics ($ r=0.268 $). Through Inverse Reasoning, we perform in silico CRISPR perturbations across the colorectal landscape. The framework autonomously overturns classical mechanistic assumptions, identifying a hierarchical dominance of mutant KRAS over the APC/Wnt-axis in driving 5-fluorouracil resistance ($ Δ=-0.0469 $) via a "KRAS Shield" mapped to MAPK/PI3K networks. Furthermore, the agentic layer identified a "PIK3CA Paradox", revealing that repairing PIK3CA inadvertently increases chemoresistance ($ Δ=+0.0085 $) by triggering a compensatory feedback loop that hyperactivates the dominant MAPK survival pathway.

URL PDF HTML ☆

赞 0 踩 0

2603.08505 2026-06-12 cs.LG cs.AI 版本更新

Echo2ECG: Enhancing ECG Representations with Cardiac Morphology from Multi-View Echos

Echo2ECG：利用多视角超声心动图的心脏形态增强心电图表示

Michelle Espranita Liman, Özgün Turgut, Alexander Müller, Eimo Martens, Daniel Rueckert, Philip Müller

发表机构 * Chair for AI in Healthcare and Medicine, Technical University of Munich (TUM) and TUM University Hospital（人工智能在医疗与医学中的中心，慕尼黑技术大学（TUM）和慕尼黑大学医院）； Department of Cardiology, TUM University Hospital（心血管科，慕尼黑大学医院）； Department of Computing, Imperial College London（计算系，伦敦帝国理工学院）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心（MCML））

AI总结提出Echo2ECG多模态自监督学习框架，通过多视角超声心动图丰富心电图表示，在结构表型分类和超声检索任务上优于现有方法，模型大小仅为最大基线的1/18。

Comments Accepted at MICCAI 2026

详情

AI中文摘要

心电图（ECG）是一种低成本、广泛使用的模态，通过捕捉心脏电活动来诊断电异常（如房颤）。然而，它无法直接测量心脏形态表型，如左心室射血分数（LVEF），这通常需要超声心动图（Echo）。从ECG预测这些表型将实现早期、可及的健康筛查。现有的自监督方法通过将ECG与单视角Echo对齐而遭受表示不匹配，单视角Echo仅捕捉局部、空间受限的解剖快照。为解决此问题，我们提出Echo2ECG，一种多模态自监督学习框架，利用多视角Echo中捕捉的心脏形态结构丰富ECG表示。我们在两个根本上需要形态信息的临床相关任务上评估Echo2ECG作为ECG特征提取器：（1）跨三个数据集的结构性心脏表型分类，以及（2）使用ECG查询检索具有相似形态特征的Echo研究。我们的提取的ECG表示在两个任务上始终优于最先进的单模态和多模态基线，尽管模型大小仅为最大基线的1/18。这些结果表明Echo2ECG是一个鲁棒、强大的ECG特征提取器。我们的代码可从此https URL获取。

英文摘要

Electrocardiography (ECG) is a low-cost, widely used modality for diagnosing electrical abnormalities like atrial fibrillation by capturing the heart's electrical activity. However, it cannot directly measure cardiac morphological phenotypes, such as left ventricular ejection fraction (LVEF), which typically require echocardiography (Echo). Predicting these phenotypes from ECG would enable early, accessible health screening. Existing self-supervised methods suffer from a representational mismatch by aligning ECGs to single-view Echos, which only capture local, spatially restricted anatomical snapshots. To address this, we propose Echo2ECG, a multimodal self-supervised learning framework that enriches ECG representations with the heart's morphological structure captured in multi-view Echos. We evaluate Echo2ECG as an ECG feature extractor on two clinically relevant tasks that fundamentally require morphological information: (1) classification of structural cardiac phenotypes across three datasets, and (2) retrieval of Echo studies with similar morphological characteristics using ECG queries. Our extracted ECG representations consistently outperform those of state-of-the-art unimodal and multimodal baselines across both tasks, despite being 18x smaller than the largest baseline. These results demonstrate that Echo2ECG is a robust, powerful ECG feature extractor. Our code is accessible at https://github.com/michelleespranita/Echo2ECG.

URL PDF HTML ☆

赞 0 踩 0

2603.24603 2026-06-12 q-bio.NC cs.AI 版本更新

Fusion Learning from Dynamic Functional Connectivity: Combining the Amplitude and Phase of fMRI Signals to Identify Brain Disorders

融合动态功能连接：结合fMRI信号的幅度和相位识别脑疾病

Jinlong Hu, Jiatong Huang, Zijian Cai

AI总结提出多尺度融合学习框架MSFL，结合滑动窗口相关和相位同步两种互补的动态功能连接特征，在自闭症和抑郁症数据集上显著优于现有模型。

详情

AI中文摘要

基于静息态功能磁共振成像（fMRI）的动态功能连接（dFC）已广泛应用于脑科学研究。滑动窗口相关（SWC）方法通过计算脑区对信号幅度时间序列之间的相关系数，是构建dFC的常用方法。在本研究中，我们提出了一种集成方法，结合fMRI信号的幅度和相位信息，以提高脑疾病的检测能力。具体而言，我们引入了一个多尺度融合学习框架MSFL，该框架利用来自SWC和相位同步（PS）的两种互补dFC特征。其中，SWC捕获幅度相关性，而PS测量dFC内的相位相干性。我们使用两个公开数据集（ABIDE I和REST-meta-MDD）评估了MSFL在分类自闭症谱系障碍和重度抑郁症方面的有效性。结果表明，MSFL显著优于现有比较模型。此外，我们使用SHAP框架进行了模型解释分析，表明来自SWC和PS的两种dFC特征均有助于检测脑疾病。

英文摘要

Dynamic functional connectivity (dFC) derived from resting-state functional magnetic resonance imaging (fMRI) has been extensively utilized in brain science research. The sliding window correlation (SWC) method is a widely used approach for constructing dFC by computing correlation coefficients between amplitude time series of signals from pairs of brain regions. In this study, we propose an integrated approach that incorporates both amplitude and phase information of fMRI signals to improve the detection of brain disorders. Specifically, we introduce a multi-scale fusion learning framework, namely MSFL, which leverages two complementary dFC features derived from SWC and phase synchronization (PS). Here, SWC captures amplitude correlations, while PS measures phase coherence within dFC. We evaluated the efficacy of MSFL in classifying autism spectrum disorder and major depressive disorder using two publicly available datasets: ABIDE I and REST-meta-MDD, respectively. The results indicate that MSFL significantly outperforms existing comparative models. Moreover, we performed model explanation analysis using the SHAP framework, which showed that both types of dFC features from SWC and PS contribute to detecting brain disorders.

URL PDF HTML ☆

赞 0 踩 0

2604.07590 2026-06-12 cs.IR cs.AI 版本更新

DCD: Domain-Oriented Design for Controlled Retrieval-Augmented Generation

DCD：面向领域的受控检索增强生成设计

Valerii Kovalskii, Nikita Belov, Nikita Miteyko, Igor Reshetnikov, Maksim Maksimov

发表机构 * red_mad_robot

AI总结提出DCD（领域-集合-文档）层次化设计，通过结构化知识表示和多阶段路由控制检索与生成范围，无需修改语言模型，提升RAG在异构语料和多步查询中的鲁棒性和准确性。

Comments 14 pages, 4 figures, 2 links, link to HF https://huggingface.co/datasets/redmadrobot-rnd/dcd, link to GIT https://github.com/redmadrobot-rnd/dcd

详情

AI中文摘要

检索增强生成（RAG）被广泛用于将大型语言模型锚定在外部知识源中。然而，当应用于异构语料库和多步查询时，朴素RAG管道由于扁平的知识表示和缺乏显式工作流而常常质量下降。在这项工作中，我们引入了DCD（领域-集合-文档），一种面向领域的设计，用于结构化知识并控制RAG系统中的查询处理，而无需修改底层语言模型。所提出的方法依赖于信息空间的层次分解和基于结构化模型输出的多阶段路由，使得检索和生成范围能够逐步受限。该架构辅以智能分块、混合检索以及集成验证和生成护栏机制。我们描述了DCD架构和工作流程，并讨论了在合成评估数据集上的评估结果，突出了它们在应用RAG场景中对鲁棒性、事实准确性和答案相关性的影响。

英文摘要

Retrieval-Augmented Generation (RAG) is widely used to ground large language models in external knowledge sources. However, when applied to heterogeneous corpora and multi-step queries, Naive RAG pipelines often degrade in quality due to flat knowledge representations and the absence of explicit workflows. In this work, we introduce DCD (Domain-Collection-Document), a domain-oriented design to structure knowledge and control query processing in RAG systems without modifying the underlying language model. The proposed approach relies on a hierarchical decomposition of the information space and multi-stage routing based on structured model outputs, enabling progressive restriction of both retrieval and generation scopes. The architecture is complemented by smart chunking, hybrid retrieval, and integrated validation and generation guardrail mechanisms. We describe the DCD architecture and workflow and discuss evaluation results on synthetic evaluation dataset, highlighting their impact on robustness, factual accuracy, and answer relevance in applied RAG scenarios.

URL PDF HTML ☆

赞 0 踩 0

2604.24806 2026-06-12 cs.IR cs.AI cs.DB 版本更新

Versioned Late Materialization for Ultra-Long Sequence Training in Recommendation Systems at Scale

版本化延迟物化：面向大规模推荐系统的超长序列训练

Liang Guo, Ge Song, Litao Deng, Jianhui Sun, Chufeng Hu, Lu Zhang, Zhen Ma, Shouwei Chen, Weiran Liu, Sarang Masti Sreeshylan, Xiaoxuan Meng, Yanzun Huang

发表机构 * Meta Platforms, Inc.（Meta平台）

AI总结提出版本化延迟物化范式，通过归一化存储和即时序列重建消除数据冗余，支持超长用户交互历史训练，降低存储I/O开销并提升模型质量。

详情

AI中文摘要

现代深度学习推荐模型（DLRM）遵循序列长度的缩放定律，推动前沿走向超长用户交互历史（UIH）。然而，行业标准的“Fat Row”范式将序列预物化到每个训练样本中，造成存储和I/O瓶颈，数据基础设施使用超过GPU训练容量，数据冗余在多租户环境中被放大，其中不同序列长度需求的模型共享联合数据集。我们提出了一种\emph{版本化延迟物化}范式，通过将UIH归一化存储在一个不可变层中，并在训练期间通过轻量级版本指针即时重建序列，从而消除冗余。系统通过一个分叉协议确保在线到离线（O2O）一致性，防止未来泄漏跨流式和批式训练，同时一个读优化的不可变存储层为异构模型租户提供多维投影下推。解耦的数据预处理与流水线I/O预取和数据亲和性优化掩盖了训练时序列重建的延迟，使训练吞吐量保持GPU计算受限。部署在生产DLRM上，系统减少了训练数据基础设施资源使用，同时实现了激进的序列长度缩放，带来显著的模型质量提升，作为现代推荐模型架构（包括HSTU和ULTRA-HSTU）的基础数据基础设施。

英文摘要

Modern Deep Learning Recommendation Models (DLRMs) follow scaling laws with sequence length, driving the frontier toward ultra-long User Interaction History (UIH). However, the industry-standard "Fat Row" paradigm, which pre-materializes these sequences into every training example, creates a storage and I/O wall where data infrastructure usage exceeds GPU training capacity due to data redundancy that is amplified in multi-tenant environments where models with vastly different sequence length requirements share a union dataset. We present a \emph{versioned late materialization} paradigm that eliminates this redundancy by storing UIH once in a normalized, immutable tier and reconstructing sequences just-in-time during training via lightweight versioned pointers. The system ensures Online-to-Offline (O2O) consistency through a bifurcated protocol that prevents future leakage across both streaming and batch training, while a read-optimized immutable storage layer provides multi-dimensional projection pushdown for heterogeneous model tenants. Disaggregated data preprocessing with pipelined I/O prefetching and data-affinity optimizations masks the latency of training-time sequence reconstruction, keeping training throughput compute-bound by GPUs. Deployed on production DLRMs, the system reduces training data infrastructure resource usage while enabling aggressive sequence length scaling that delivers significant model quality gains, serving as the foundational data infrastructure for modern recommendation model architectures, including HSTU and ULTRA-HSTU.

URL PDF HTML ☆

赞 0 踩 0

2606.06525 2026-06-12 cs.GR cs.AI 版本更新

Agentic Large Language Models for Automated Structural Analysis of 3D Frame Systems

用于三维框架系统自动化结构分析的主体化大型语言模型

Ziheng Geng, Ian Franklin, Santiago Martinez, Jiachen Liu, Yunhe Zhao, Minghui Cheng

发表机构 * Department of Civil and Architectural Engineering, University of Miami（迈阿密大学土木与建筑工程系）； School of Architecture, University of Miami（迈阿密大学建筑学院）； HBC Engineering Company（HBC工程公司）； Department of Electrical and Computer Engineering, University of Miami（迈阿密大学电气与计算机工程系）

AI总结提出一种主体化LLM框架，通过投影表示和智能体流水线实现从自然语言输入到3D框架的自动化结构分析，平均准确率达90%。

详情

AI中文摘要

大型语言模型（LLM）已成为跨领域具有强推理能力的强大基础模型。除了反应式文本生成，主体化LLM通过模块化任务分解和协调工具使用实现自主工作流执行。在结构工程中，最近的工作开发了用于平面框架自动化分析的主体化LLM。然而，由于不规则几何表示、拓扑一致性和长程推理的挑战，它们向3D框架的扩展仍未充分探索。本文提出了一种主体化LLM框架，用于从自然语言输入自动化分析3D框架。不规则3D框架通过投影到2D平面表示，其中正交网格线定义空间坐标，楼层数矩阵编码每个网格单元的垂直拉伸。基于此表示，框架建立了一个多智能体流水线：问题分析智能体将输入解析为结构化JSON；楼层分解智能体推导每层的空间布局；3D几何由节点、梁、板和柱智能体组装；支撑和荷载智能体分配边界和荷载条件，代码翻译智能体生成可执行的SAP2000脚本。在十个代表性3D框架上评估，所提框架在重复试验中平均准确率达到90%，表现出一致且可靠的性能。

英文摘要

Large language models (LLMs) have emerged as powerful foundation models with strong reasoning capabilities across domains. Beyond reactive text generation, agentic LLMs enable autonomous workflow execution through modular task decomposition and coordinated tool use. In structural engineering, recent efforts have developed agentic LLMs for automated analysis of plane frames. However, their extension to 3D frames remains underexplored due to challenges in irregular geometric representation, topological consistency, and long-horizon reasoning. This paper proposes an agentic LLM framework for automated structural analysis of 3D frames from natural language inputs. Irregular 3D frames are represented by projection onto a 2D plan, where orthogonal gridlines define spatial coordinates and a matrix of number of stories encodes vertical extrusion of each grid cell. Building on this representation, the framework establishes a multi-agent pipeline: a problem analysis agent parses input into structured JSON; a floor decomposition agent derives the spatial layout of each floor; the 3D geometry is assembled by node, girder, slab, and column agents; support and load agents assign boundary and loading conditions, and code translation agents generate executable SAP2000 script. Evaluated on ten representative 3D frames, the proposed framework achieves an average accuracy of 90% across repeated trials, demonstrating consistent and reliable performance.

URL PDF HTML ☆

赞 0 踩 0

2606.10200 2026-06-12 cs.CV cs.AI cs.LG 版本更新

An Improved Generative Adversarial Network for Micro-Resistivity Imaging Logging Restoration

一种改进的生成对抗网络用于微电阻率成像测井恢复

Ahmed Faizul Haque, S. M. Riaz Rahman Antu, Saif Ahmed, Asadullah Hil Galib, Souvik Pramanik, Mohammad Ashrafuzzaman Khan, Mohammad Abdul Qayum, Mohsin Sajjad

AI总结提出基于改进GAN的成像测井图像恢复方法，通过FCN生成网络、深度可分离卷积残差块、Inception模块及多尺度特征提取与空间注意力机制，结合全局与局部判别网络，有效恢复缺失区域，结构相似性达0.903。

Comments Mistakes in citations and references. Further we want to submit in conference with improved experiments and results

详情

AI中文摘要

本文提出了一种改进的基于GAN的成像测井图像恢复方法，用于解决微电阻率成像测井图像部分缺失的问题。该方法采用FCN作为生成网络基础设施，并添加深度可分离卷积残差块以学习和保留更有效的像素与语义信息；添加Inception模块以增加网络的多尺度感知场并减少参数数量；添加多尺度特征提取模块和空间注意力残差块，结合通道注意力机制与残差块实现多尺度特征提取。设计了全局判别网络和局部判别网络，通过相互对抗与生成网络逐步提高恢复部分与整体图像之间的内容和语义结构一致性。实验结果表明，测试集中五组不同大小缺失区域的成像测井图像的平均结构相似性度量为0.903，相比其他类似方法提高了约0.3。研究表明，该方法可用于微电阻率成像测井图像的恢复，在语义结构一致性和纹理细节方面有良好改善，从而为保障微电阻率成像测井图像后续解释的顺利进行提供了一种新的深度学习方法。

英文摘要

An improved GAN-based imaging logging image restoration method is presented in this paper for solving the problem of partially missing micro-resistivity imaging logging images. The method uses FCN as the generative network infrastructure and adds a depth-separable convolutional residual block to learn and retain more effective pixel and semantic information; an Inception module is added to increase the multi-scale perceptual field of the network and reduce the number of parameters in the network; and a multi-scale feature extraction module and a spatial attention residual block are added to combine the channel attention. The multi-scale module adds a multi-scale feature extraction module and a spatial attention residual block, which combine the channel attention mechanism and the residual block to achieve multi-scale feature extraction. The global discriminative network and the local discriminative network are designed to gradually improve the content and semantic structure coherence between the restored parts and the whole image by playing off each other and the generative network. According to the experimental results, the average structural similarity measure of the five sets of imaged logging images with different sizes of missing regions in the test set is 0.903, which is an improvement of about 0.3 compared with other similar methods. It is shown that the method in this study can be used for the restoration of micro-resistivity imaging log images with good improvement in semantic structural coherence and texture details, thus providing a new deep learning method to ensure the smooth advancement of the subsequent interpretation of micro-resistivity imaging log images.

URL PDF HTML ☆

赞 0 踩 0

2606.11238 2026-06-12 q-fin.GN cs.AI 版本更新

Artificial Intelligence in Ship Finance: Applications, Opportunities, and a Case Study in AI-Augmented Loan Origination

人工智能在船舶金融中的应用：机遇与AI增强贷款发起的案例研究

Lasse Dierich, Orestis Schinas

发表机构 * ShipFinance.ai ； HHX.blue GmbH ； Technical University of Munich（慕尼黑技术大学）； University of the Aegean（爱琴海大学）

AI总结本文探讨AI在船舶金融中的应用，提出基于大语言模型的模块化架构，用于文档理解、信息提取和工作流自动化，以支持贷款申请流程。

Comments 9 pages, 1 figure

详情

AI中文摘要

船舶金融是资产担保贷款中数据密集且文档繁重的领域，需要整合来自异构且高度非结构化来源的财务、技术、合同和监管信息。日益严格的环境法规和ESG报告要求进一步增加了承销和贷款发起流程的复杂性。人工智能（AI）的最新进展，特别是大语言模型（LLMs），为处理和分析此类信息创造了新的机遇。本文回顾了AI在船舶金融中的潜在应用，特别关注基于LLM的系统用于文档理解、信息提取和工作流自动化。我们提出了this http URL，一个模块化代理架构，用于支持船舶金融中的贷款申请工作流。所提出的系统结合了基于LLM的提取模块、财务分析组件、外部海事数据服务以及带有聊天机器人界面的受控文档生成模块，以支持标准化融资申请的准备工作。本文讨论了在生产中使用此类模型的关键挑战。我们认为，AI辅助系统可以支持海事金融专业人士管理日益复杂的信息和报告要求。

英文摘要

Ship finance is a data-intensive and document-heavy segment of asset-based lending, requiring the integration of financial, technical, contractual, and regulatory information from heterogeneous and largely unstructured sources. Increasing environmental regulation and ESG reporting requirements are adding further complexity to underwriting and loan-origination processes. Recent advances in artificial intelligence (AI), particularly large language models (LLMs), create new opportunities for processing and analysing such information. This paper reviews potential applications of AI in ship finance, with a particular focus on LLM-based systems for document comprehension, information extraction, and workflow automation. We present ShipFinance.ai, a modular agentic architecture to support loan application workflows in ship finance. The proposed system combines an LLM-based extraction module, financial analysis components, external maritime data services, and a controlled document-generation module with a chatbot interface to support the preparation of standardized financing applications. The paper discusses the key challenges for using such models in production. We argue that AI-assisted systems can support maritime finance professionals in managing increasingly complex information and reporting requirements.

URL PDF HTML ☆

赞 0 踩 0

2606.11793 2026-06-12 cs.LG cs.AI physics.ao-ph 版本更新

Scalable Deep Learning Framework for Global High-Resolution Land Use Reconstruction

AI4Land: 面向全球高分辨率土地利用重建的可扩展深度学习

Amirpasha Mozaffari, Marina Castaño, Stefano Materia, Etienne Tourigny, Oscar Molina-Sedano, Jordi Varela-Agrelo, Dario Garcia-Gasulla, Miguel Castrillo Melguizo, Mario Acosta, Amanda Duarte

发表机构 * Barcelona Supercomputing Center（巴塞罗那超级计算中心）

AI总结提出AI4Land框架，采用U-Net两阶段方法，结合粗分辨率情景数据与静态地理特征，重建高分辨率年度土地利用与覆盖，减少陆地碳循环不确定性，支持气候模拟。

详情

AI中文摘要

陆地碳循环的不确定性仍是气候预测的主要制约因素，部分源于地球系统模型中陆面表征和变率的不确定性。为解决此问题，我们提出了数据驱动框架AI4Land，用于生成关键陆面变量的高分辨率历史重建和未来预测。该框架采用U-Net架构的两阶段方法。在第一阶段（本文重点），它通过整合粗分辨率情景数据与静态地理特征，重建年度土地利用与土地覆盖。在计划的第二阶段，生成的高分辨率地图将用于在更细时间尺度上预测动态生物物理变量，特别是叶面积指数。模型基于地球观测数据训练，学习再现空间明确且物理一致的陆面模式，并将时间覆盖扩展到缺乏直接观测的时期。AI4Land在MareNostrum5上开发和训练，展示了GPU加速的高性能计算基础设施如何支持全球尺度的气候AI流水线。最终产品是一套开源模拟器，旨在与数字孪生平台（如Destination Earth计划下开发的平台）实时耦合。通过按需提供逼真且演变的陆面条件，本工作旨在减少关键不确定性，提高下一代气候模拟的预测能力。

英文摘要

Uncertainty in the terrestrial carbon cycle remains a major constraint in climate projections, partly driven by the uncertainties affecting the land surface representation and variability in Earth system models. To address this limitation, we present a data-driven framework AI4Land, for generating high-resolution historical reconstructions and future projections of key land surface variables. The framework follows a two-phase approach using a U-Net architecture. In the first phase, which is the focus of this work, it reconstructs annual land use and land cover by integrating coarse-resolution scenario data with static geophysical features. In a planned second phase, the resulting high-resolution maps will be used to predict dynamic biophysical variables, particularly leaf area index, at finer temporal scales. Trained on Earth observation data, the models learn to reproduce spatially explicit and physically consistent land surface patterns, extending temporal coverage to periods lacking direct observations. AI4Land was developed and trained on MareNostrum5, demonstrating how GPU-accelerated HPC infrastructure enables global-scale climate AI pipelines. The final product is a suite of open-source emulators designed for real-time coupling with digital twin platforms, such as those developed under the Destination Earth initiative. By delivering realistic and evolving land surface conditions on demand, this work aims to reduce critical uncertainties and improve the predictive power of next-generation climate simulations.

URL PDF HTML ☆

赞 0 踩 0

2606.11930 2026-06-12 cs.HC cs.AI cs.CV 版本更新

Frozen Multimodal Embeddings for AI-Assisted Interview Assessment of Personality and Cognitive Ability

冻结多模态嵌入用于异步视频面试中的个性与认知能力评估

Kuo-En Hung, Hung-Yue Suen, Shih-Ching Yeh, Hsiang-Wen Wang

发表机构 * Technology Application and Human Resource Development, National Taiwan Normal University（台湾国立台中教育大学技术应用与人力资源发展系）； Computer Science and Information Engineering, National Central University（台湾国立中央大学计算机科学与资讯工程系）； Institute of Photonic System, National Yang Ming Chiao Tung University（台湾阳明交通大学光电系统研究所）

AI总结针对异步视频面试中标注数据有限的高维多模态学习问题，提出使用冻结多模态编码器（CLIP、Whisper、RoBERTa等）结合低容量下游模型，在个性预测任务上实现MSE降低19.1%，并发现认知能力预测中存在数据集捷径。

Comments 9 pages, 1 figure, 5 tables

详情

AI中文摘要

从异步视频面试（AVI）中预测心理特质是一个具有挑战性的多模态学习问题，因为标注数据集有限，而每个回答包含高维的视觉、声学和语言信号。本文介绍了我们针对ACM多媒体AVI挑战2026的解决方案，该挑战评估两个任务：Track~1从与个性相关的面试回答中预测自我报告的HEXACO个性特质，Track~2从结构化AVI回答中对认知能力水平进行分类。我们将该问题视为小样本表示学习任务。我们不微调大型预训练模型，而是使用冻结的多模态编码器，包括用于视觉特征的CLIP、用于声学特征和转录的Whisper，以及用于文本表示的RoBERTa、E5和DeBERTaV3，随后使用低容量下游模型。对于Track~1，我们的特质特定回归和晚期融合系统实现了平均验证MSE为0.2696，优于官方基线0.3334。消融结果显示，从全局模型（0.3189）到逐特质建模（0.2871）再到逐特质晚期融合（0.2696）的三步改进，相对于官方基线MSE相对降低了19.1%。对于Track~2，一个紧凑的主题属性基线达到了0.5781的准确率，而我们的多模态集成达到了0.5313，两者均高于官方基线0.4062。我们将这一结果解释为验证分割中可能存在主题属性捷径的证据，而非从AVI内容中进行的稳健认知推理。总体而言，我们的发现表明，基于AVI的心理评估受益于特质特定的多模态建模，但认知能力预测需要仔细控制数据集捷径。

英文摘要

Predicting psychological traits from asynchronous video interviews (AVIs) is a challenging problem in AI-assisted interview assessment because labeled datasets are limited while each response contains high-dimensional visual, acoustic, and verbal signals. This paper presents our solution for the ACM Multimedia AVI Challenge 2026, which evaluates two tasks: Track~1 predicts self-reported HEXACO personality traits from personality-related interview responses, and Track~2 classifies cognitive ability levels from structured AVI responses. We treat the problem as a small-sample representation learning task. Instead of fine-tuning large pretrained models, we use frozen multimodal encoders, including CLIP for visual features, Whisper for acoustic features and transcripts, and RoBERTa, E5, and DeBERTaV3 for textual representations, followed by low-capacity downstream models. For Track~1, our trait-specific regression and late-fusion system achieves an average validation MSE of 0.2696, improving over the official baseline of 0.3334. Ablation results show a three-step improvement from a global model (0.3189), to per-trait modeling (0.2871), to per-trait late fusion (0.2696), corresponding to a 19.1% relative MSE reduction over the official baseline. For Track~2, a compact subject-attribute baseline reaches 0.5781 accuracy, while our multimodal ensemble reaches 0.5313, both above the official baseline of 0.4062. We interpret this result as evidence of possible subject-attribute shortcuts in the validation split rather than robust cognitive inference from AVI content. Overall, our findings suggest that AVI-based psychological assessment benefits from trait-specific multimodal modeling, but cognitive ability prediction requires careful control of dataset shortcuts.

URL PDF HTML ☆

赞 0 踩 0

2606.12683 2026-06-12 cs.AI cs.CY cs.LG 新提交

From AGI to ASI

从AGI到ASI

Tim Genewein, Matija Franklin, Alexander Lerchner, Laurent Orseau, Samuel Albanie, Adam Bales, Cole Wyeth, Stephanie Chan, Iason Gabriel, Joel Z. Leibo, Allan Dafoe, Marcus Hutter, Thore Graepel, Shane Legg

发表机构 * Google DeepMind（谷歌深度思维）； University of Waterloo（滑铁卢大学）； Australian National University（澳大利亚国立大学）； University College London（伦敦大学学院）

AI总结探讨从人类级通用人工智能到超级智能的转变路径，包括扩展、范式转变、递归改进和多智能体涌现，并分析摩擦与瓶颈。

详情

AI中文摘要

在过去十年中，构建人类级通用人工智能已从遥不可及的猜测转变为许多大型AI组织未来十年的具体目标。实现这一目标将对人类社会产生深远影响，并引发未来十年的诸多复杂问题。本报告研究在机器智能连续体中，AI如何在后AGI世界中继续发展。该连续体的终点——通用AI——在理论上已被充分理解，这为本报告的主要焦点提供了形式基础：从人类级AGI向人工通用超级智能的转变，直观上可理解为比大型人类组织更智能、认知能力更强的系统。在描述ASI后，报告讨论了从AGI到ASI的四条潜在路径：扩展AGI、AI范式转变、递归改进以及从大规模多智能体集体中涌现ASI。随后，报告讨论了这些路径上可能的摩擦和瓶颈。确定这些摩擦的影响是微不足道还是重大，提出了若干具体的开放研究问题。由于预测ASI进展存在巨大不确定性，不能排除AI进展在未来几年继续加速的可能性。这可能意味着由人类级AGI引入社会所导致的单一变革性步骤的形象可能不准确。更恰当的前景可能是由AI在科学和技术的多个领域引发的进步和突破所导致的一系列变革性社会变化。为这一前景做准备需要全球范围内的大规模跨学科努力。

英文摘要

Over the last decade, building human-level artificial general intelligence has moved from far-fetched speculation to being a concrete next-decade target for many of the largest AI organisations. Achieving this goal would have profound and far-reaching impacts on human society, which raises many complex questions for the decade ahead. This report investigates how AI itself might continue to develop in a post-AGI world along the continuum of machine intelligence. The endpoint of this continuum, Universal AI, is theoretically well understood, which provides some formal grounding for the main focus of this report: the transition from human-level AGI to artificial general superintelligence, which, intuitively, can be understood as a system that is more intelligent and cognitively capable than large organisations of humans. After characterizing ASI, the report discusses four potential pathways from AGI to ASI: scaling AGI, AI paradigm shifts, recursive improvement, and ASI emerging from large-scale multi-agent collectives. The report then discusses possible frictions and bottlenecks along these pathways. Determining whether the impact of these frictions will be negligible or substantial raises a number of concrete open research questions. Due to large uncertainties for predicting ASI progress, it cannot be ruled out that AI progress might continue to accelerate over the next years. This could imply that the image of a single transformative step change, caused by the introduction of human-level AGI into our society, could be inaccurate. More apt might be the prospect of a series of transformative societal changes caused by AI-enabled progress and breakthroughs across many areas of science and technology. Preparing for this prospect requires a massively interdisciplinary endeavour of global scope and interest.

URL PDF HTML ☆

赞 0 踩 0

2606.12713 2026-06-12 cs.AI 新提交

Definitional alignment before capability alignment: a Design-Science framework for adjudicating claims about AGI

能力对齐之前的定义对齐：一个用于裁定关于AGI主张的设计科学框架

J. E. Aguilera Briones

发表机构 * Universidad Internacional de Investigación México（墨西哥国际研究大学）

AI总结针对AGI定义不统一导致争议的问题，提出DAF-AGI框架，包含五个序数标准和一个结构化治理审计，用于评估候选定义并裁定AGI主张。

Comments 31 pages, 1 table, 2 appendices

详情

AI中文摘要

关于人工通用智能已经到来或仍需数十年的主张常常基于重叠的证据进行辩护。“AGI”缺乏一个单一共享且稳定的指称，不同的操作化方法可能对同一系统给出不同的判定。本文将这种欠指定性视为一个设计和治理问题。遵循设计科学研究方法论，本文开发了DAF-AGI，一个二阶概念性人工制品，包含两个耦合组件：用于评估候选定义的裁定适应性的五个序数标准，以及对作者身份、利益、认证、外部验证和修订权威的结构化治理审计。该人工制品在五个显著的测量族和一个通缩边界立场上进行了演示，这些均来自一个已记录的语料库，然后对一个风格化的强到来主张进行了压力测试：即当前生成系统构成AGI，因为它们在许多认知任务上优于受过良好教育的成年人。根据引用的2024-2025年来源的证据，该主张仅在基于性能的操作化下可认证；能力本体论、心理测量学和技能习得方法未认证它，经济族仍不确定，通缩立场拒绝二元裁定。贡献在于新颖的整合和操作化，而非经验验证：独立应用、评估者间测试和作者外部案例仍然是必要的。本文进一步提出定义主权作为算法主权的使能组件：即在公共问责下对进口技术类别进行质疑、认证和修订的制度能力。

英文摘要

Claims that artificial general intelligence has already arrived and claims that it remains decades away are often defended from overlapping evidence. "AGI" lacks a single shared and stable referent and competing operationalizations can return different verdicts on the same system. This article treats that under-specification as a design and governance problem. Following Design Science Research Methodology, it develops DAF-AGI, a second-order conceptual artifact with two coupled components: five ordinal criteria for assessing the adjudicative fitness of candidate definitions and a structured governance audit of authorship, interest, certification, external verification and revision authority. The artifact is demonstrated on five prominent measurement families and one deflationary boundary position in a documented corpus and then stress-tested against a stylized strong arrival claim: that current generative systems constitute AGI because they outperform a well-educated adult on many cognitive tasks. On evidence from the cited 2024-2025 sources, the claim was certifiable only under a performance-based operationalization; capability-ontology, psychometric and skill-acquisition approaches did not certify it, the economic family remains indeterminate and the deflationary position refuses binary adjudication. The contribution is a novel integration and operationalization, not an empirical validation: independent application, inter-rater testing and author-external cases remain necessary. The paper further proposes definitional sovereignty as an enabling component of algorithmic sovereignty: the institutional capacity to contest, certify and revise imported technological categories under public accountability.

URL PDF HTML ☆

赞 0 踩 0

2606.12783 2026-06-12 cs.AI 新提交

A Tutorial on World Models and Physical AI

世界模型与物理AI教程

Il-Seok Oh

发表机构 * Department of Computer Science and Artificial Intelligence/CAIIT, Jeonju, Jeonbuk, South Korea（韩国全北全州计算机科学与人工智能系/CAIIT）

AI总结本文提出统一框架，区分显式与隐式世界模型，并探讨其在机器人、自动驾驶等物理AI领域的应用，以及迈向通用人工智能的挑战。

详情

AI中文摘要

世界建模正成为构建具备预测、推理和决策能力的智能系统的核心原则。显式世界模型与隐式世界模型之间存在一个核心区别：前者学习结构化动态以进行基于推演的推理和规划，后者则将预测结构编码到可扩展的学习表示中。这些互补范式为机器人、自动驾驶等领域的物理AI奠定了基础，使其能够在现实世界约束下实现超越反应式控制的智能。近期的基础模型进一步指明了通向集成感知、预测和行动的通用系统的路径。尽管进展迅速，但在层次推理、长时域规划和自主目标形成方面仍存在重大挑战，这些对于迈向通用人工智能至关重要。本教程提出了一个连贯的框架，其中多种世界建模方法通过共享的预测结构得以统一，并通过这种结构的表示和利用方式加以区分。

英文摘要

World modeling is emerging as a central principle for building intelligent systems capable of prediction, reasoning, and decision making. A central distinction can be drawn between explicit world models, which learn structured dynamics for rollout-based reasoning and planning, and implicit world models, which encode predictive structure within scalable learned representations. These complementary paradigms provide a foundation for physical AI in domains such as robotics and autonomous driving, enabling intelligence beyond reactive control under real-world constraints. Recent foundation models further suggest a pathway toward unified systems integrating perception, prediction, and action. Despite rapid progress, major challenges remain in hierarchical reasoning, long-horizon planning, and autonomous goal formation, which are critical for advancing toward artificial general intelligence. This tutorial presents a coherent framework in which diverse world modeling approaches are unified through shared predictive structure and differentiated by how such structure is represented and exploited.

URL PDF HTML ☆

赞 0 踩 0

2606.12828 2026-06-12 cs.AI 新提交

Topical Phase Transitions in Artificial Intelligence Research: Large-Scale Evidence and an Early-Warning Signature for Emerging Topics

人工智能研究中的主题相变：大规模证据与新兴主题的早期预警信号

Rasul Khanbayov, Hasan Kurban

AI总结通过分析2017-2025年五大AI会议论文，发现AI主题通过“相变”方式突然爆发，并基于早期预警信号识别未来需关注的主题。

详情

DOI: 10.5281/zenodo.20635334

AI中文摘要

人工智能的研究主题是逐渐增长，还是通过突然的、可检测的跳跃式发展？通过分析2017年至2025年期间五个顶级AI会议（ACL、CVPR、ICLR、ICML、NeurIPS）的80,814篇主会论文，我们发现主要AI主题通过主题相变推进：在多年间保持边缘地位，然后在一到三年内跨会议激增。到2025年，大型语言模型成为跨会议的主导主题，扩散模型以类似的突发性崛起，语言模型方法通过视觉语言模型进入计算机视觉领域，而强化学习则平滑累积，这区分了真正的相变与普通增长。这一结构是我们的主要贡献：对AI研究如何重组的大规模、跨会议特征描述。然后我们探究相变是否在达到顶峰前留下可检测的足迹。我们定义了一个早期预警信号，即基于2017-2021年数据冻结的四项出版动力学标准，并在2023-2025年的相变上进行样本外评估，在13.5%的基准率下获得了27%的精确率和63%的召回率。应用于2025年数据时，该信号将推理与测试时计算、智能体AI、多模态LLM、检索增强生成和世界模型标记为2026-2028年需监测的主题。源代码也在GitHub上公开，网址为https://this https URL。

英文摘要

Do research topics in artificial intelligence grow gradually, or do they advance through abrupt, detectable jumps? Analyzing 80,814 accepted main-track papers from five premier AI conferences (ACL, CVPR, ICLR, ICML, NeurIPS) spanning 2017 to 2025, we show major AI topics advance through topical phase transitions: remaining marginal for years, then surging across venues within one to three years. Large language models became the dominant cross-venue topic by 2025, diffusion models rose with comparable abruptness, and language-model methods crossed into computer vision via vision-language models, whereas reinforcement learning compounded smoothly, distinguishing genuine phase transitions from ordinary growth. This structure is our primary contribution: a large-scale, cross-venue characterization of how AI research reorganizes. We then ask whether a transition leaves a detectable footprint before it peaks. We define an early-warning signature, four publication-dynamics criteria frozen on 2017-2021 data, and evaluate it out of sample on 2023-2025 transitions, obtaining a precision of 27% and recall of 63% against a 13.5% base rate. Applied to 2025 data, the signature flags reasoning and test-time compute, agentic AI, multimodal LLMs, retrieval-augmented generation, and world models as topics to monitor over 2026-2028. The source code is also publicly available on GitHub at https://github.com/KurbanIntelligenceLab/ai-phase-transitions.

URL PDF HTML ☆

赞 0 踩 0

2606.13201 2026-06-12 cs.AI 新提交

A Minimal Model of Bounded Trade-Off Screening in Multi-Attribute Choice

多属性选择中有限权衡筛选的最小模型

Manisha Dubey, Anirban Sarkar, Subramanian Ramamoorthy

发表机构 * School of Informatics, University of Edinburgh, UK（英国爱丁堡大学信息学院）； Cold Spring Harbor Laboratory, USA（美国冷泉港实验室）

AI总结提出有限权衡推理框架，通过引入权衡容忍参数模拟筛选过程，产生不同于标准效用模型的偏好模式，解释多属性选择中的情境依赖行为。

Comments 3 pages, 1 figure, accepted as extended abstract at Annual Conference on Cognitive Computational Neuroscience 2026

2606.13566 2026-06-12 cs.AI 新提交

A Three-Layer Framework for AI in Scientific Discovery

人工智能在科学发现中的三层框架

Guojun Liao

发表机构 * Department of Mathematics, University of Texas at Arlington（德克萨斯大学阿灵顿分校数学系）

AI总结提出AI在科学发现中的三层框架，核心创新是第二层：通过定性推理进行模型形成，识别框架结构不足并寻找缺失概念，通过三个案例说明其重要性。

详情

AI中文摘要

美国人工智能项目映射：2026年初现状报告及AI主修与辅修分析

Felix Muzny, Carolyn Jones, Carter Ithier, Hasnain Sikora, Hrutika Harshadbhai Patel, Carla E. Brodley

发表机构 * Center for Inclusive Computing（包容计算中心）； Khoury College of Computer Sciences（科里学院计算机科学学院）； Northeastern University（东北大学）； Boston, Massachusetts, United States（马萨诸塞州波士顿，美国）

AI总结报告2026年春美国本科AI项目现状，开发动态更新工具扫描560多所院校的350多个项目，分析66个AI主修和87个辅修的课程要求，发现并非所有主修都要求通用AI课程但需机器学习，超三分之一主修要求AI伦理课程而辅修不足四分之一。

详情

AI中文摘要

我们提交了一份关于2026年春季美国本科人工智能（AI）项目现状的报告。在此过程中，我们1）描述了我们的抓取和映射工具，这些工具动态更新以追踪美国AI教育的状态，2）在巨大动荡时期创建了一个历史记录。我们开发的工具（可在此https URL获取）检测、抓取并显示来自四年制大学350多个本科AI项目（主修、辅修、方向和证书）的数据。我们的工具搜索了560多所院校以定位这些项目，该样本代表了美国所有本科计算机科学（CS）毕业生的86%。该工具允许潜在学生、指导顾问、管理人员和教师轻松访问AI项目要求，并设计为随着新项目的出现而持续更新。据我们所知，这项调查代表了迄今为止对美国AI项目状态最全面的快照。通过这项工作，我们提供了三项重要贡献：1）在巨大动荡时期美国AI项目的记录；2）一个探索AI项目及其要求的工具；3）对66个AI主修和87个AI辅修所需课程的分析。我们对主修和辅修的分析显示，这些学位的规模和课程要求存在很大差异，但我们注意到两点：首先，并非所有主修都要求通用AI课程，但如果不需要，则必须要求机器学习（ML）课程；其次，虽然超过三分之一的主修要求AI伦理课程，但只有不到四分之一的AI辅修要求该课程。

英文摘要

We present a report on the status of undergraduate Artificial Intelligence (AI) programs in the United States in Spring 2026. In so doing, we 1) describe our scraping and mapping tools, which dynamically update to track the state of AI education in the U.S., and 2) create a historic record at a time of great upheaval. The tool we developed, available at https://cicmap.ai, detects, scrapes, and displays data from more than 350 undergraduate AI programs--majors, minors, concentrations, and certificates--at 4-year universities. Our tool searched over 560 institutions to locate these programs, a sample that represents 86\% of all undergraduate Computer Science (CS) graduates in the U.S. This tool allows prospective students, guidance counselors, administrators, and faculty to easily access AI program requirements and is designed to continually update as new programs emerge. To the best of our knowledge, this survey represents the most comprehensive snapshot of the state of AI programs in the U.S. to date. With this work we offer three important contributions: 1) a record of AI programs in the U.S. at a time of great upheaval; 2) a tool to explore AI programs and their requirements; and 3) an analysis of the courses required for 66 AI majors and 87 AI minors. Our analysis of majors and minors shows great variability in the size and the requirements of these degrees, but we note two takeaways. First, not all majors require a general AI course, but if they don't, they do require a Machine Learning (ML) course. Second, while more than a third of majors require an Ethics in AI course, just under a quarter of AI minors do.

URL PDF HTML ☆

赞 0 踩 0

2606.12441 2026-06-12 cs.CY cs.AI cs.HC 交叉投稿

Generativism: Toward a Learning Theory for the Age of Generative Artificial Intelligence

生成主义：面向生成式人工智能时代的学习理论

Shan Li, Juan Zheng

AI总结本文批判性审视行为主义、认知主义、建构主义和连接主义四大学习理论在生成式AI时代的局限，提出以“生成主义”为核心的新学习理论，强调人机协作的知识共建。

详情

AI中文摘要

行为主义、认知主义、建构主义和连接主义这四种主流学习理论，随着生成式人工智能在教育环境中的普及，显示出显著的概念局限性。这些框架是在能够生成、综合和推理知识的AI系统出现之前形成的。本文批判性地审视每种学习理论，并识别出生成式AI的赋能所挑战的假设。基于分布式认知、延展心智、人机协作、AI素养、认知卸载和元认知等研究，本文提出生成主义作为生成式AI时代的学习理论。生成主义认为，学习日益通过人类学习者与AI系统之间的迭代知识共建而发生。该框架围绕四个原则组织：认知伙伴关系、分布式能动性、生成素养和适应性元认知。该框架为在生成式AI在认知中发挥核心作用的情境下重新思考教学设计、学习、评估和专业知识发展提供了基础。

英文摘要

The four dominant learning theories of behaviorism, cognitivism, constructivism, and connectivism show significant conceptual limitations as generative artificial intelligence (AI) proliferates in educational settings. These frameworks were formulated before the emergence of AI systems capable of generating, synthesizing, and reasoning about knowledge. This article critically examines each learning theory and identifies assumptions challenged by generative AI's affordances. Drawing on research in distributed cognition, extended mind, human-AI collaboration, AI literacy, cognitive offloading, and metacognition, the article proposes Generativism as a learning theory for the generative AI age. Generativism posits that learning increasingly occurs through the iterative co-construction of knowledge between human learners and AI systems. The proposed framework is organized around four principles: epistemic partnership, distributed agency, generative literacy, and adaptive metacognition. The framework offers a foundation for rethinking instructional design, learning, assessment, and expertise development in contexts where generative AI plays an integral role in cognition.

URL PDF HTML ☆

赞 0 踩 0

2606.12502 2026-06-12 physics.soc-ph cs.AI 交叉投稿

A Mathematical Theory of Value: a synthesis on goal-directed agency under resource constraints

价值的数学理论：资源约束下目标导向行为的综合

Cheng Qian

发表机构 * Cheng Qian（陈倩）

AI总结本文提出价值是目标导向主体在资源约束下转化资源为目标进度的速率，通过尺度不变性公理导出对数度量，并推导出价值编码定理，实现价值与信息论的统一。

Comments Also available at https://doi.org/10.5281/zenodo.20487041 (v5)

详情

AI中文摘要

我们提出，价值——目标导向主体创造、毁灭和交换的量——是与信息同类的合法结构量。遵循香农的方法，我们做出一个无情的抽象：价值是主体将资源转化为目标进度的速率，相对于由其目标固定的参考系。尺度不变性公理强制采用对数度量 $V=\sum_i k_i \ln e_i$；通过Peters（2019）的遍历性论证，再投资资源的复利强制了相同的形式。这两条路径是亲缘关系而非独立；它们的一致性是一种一致性检查，而非过度确定。我们推导了价值的编码定理：$\Delta G \le I(X;Y)$，由贝叶斯比例分配实现；实现的价值分解为 $G=D(q\\|r)-D(q\\|p)$，将错位识别为可测量的浪费。对于群体，价值是参考系相关的，而价格是参考系无关的；共享资源并融合感知的舰队继承上限 $G_{\mathrm{fleet}} \le I(X;Y_{1:m}) \le H(X)$（一个推论；早期的求和形式声明是错误的，并在v5中修正）。动力学层产生了实然/应然不对称性，从该不对称性中，对齐作为控制稳定性条件出现，并具有闭式残差。我们在预注册的规模扩展中测试了单参考系定律于实时语言模型：感知互信息跟踪实际能力而非参数数量（在30个模型×领域点上合并的Spearman $\rho = 0.977$），样本外 $\Delta G$ 跟踪 $I(X;Y)$，过度自信是可测量的耗散；进一步的预注册测试显示，该桥在四种任务形状上形状不变（$n=42$，斜率0.953）。这些机制没有一个是全新的——广义Kelly、Armstrong & Mindermann（2018）、经典控制；贡献在于它们的统一以及随之而来的治理映射（监督上的激励设计）。

英文摘要

We propose that value -- the quantity goal-directed agents create, destroy, and exchange -- is a lawful structural quantity in the same category as information. Following Shannon's method, we make one ruthless abstraction: value is the rate at which an agent converts a resource into goal-progress, relative to a frame fixed by its goal. A scale-invariance axiom forces a logarithmic measure, $V=\sum_i k_i \ln e_i$; compounding of a reinvested resource forces the same form via the ergodicity argument of Peters (2019). The two routes are kin rather than independent; their agreement is a consistency check, not an over-determination. We derive a coding theorem of value: $ΔG \le I(X;Y)$, achieved by Bayes-proportional allocation; realized value decomposes as $G=D(q\|r)-D(q\|p)$, identifying misalignment with measurable waste. For populations, value is frame-relative while price is frame-independent; a fleet that pools its resource and fuses its perception inherits the ceiling $G_{\mathrm{fleet}} \le I(X;Y_{1:m}) \le H(X)$ (a corollary; an earlier sum-form claim was wrong and is corrected in v5). A dynamical layer yields an is/ought asymmetry from which alignment emerges as a control-stability condition with a closed-form residual. We test the single-frame laws on live language models in a pre-registered scale-up: perception mutual information tracks realized capability rather than parameter count (Spearman $ρ= 0.977$ pooled over 30 model$\times$domain points), out-of-sample $ΔG$ tracks $I(X;Y)$, and over-confidence is measurable dissipation; a further pre-registered test shows the bridge is shape-invariant across four task shapes ($n=42$, slope 0.953). None of the mechanisms is individually new -- generalized Kelly, Armstrong & Mindermann (2018), classical control; the contribution is their unification and the governance mapping (incentive design over oversight) that follows.

URL PDF HTML ☆

赞 0 踩 0

2606.12647 2026-06-12 cs.CC cs.AI cs.LG 交叉投稿

Token Complexity Theory for AI-Augmented Computing

AI增强计算的Token复杂度理论

Jie Wang

AI总结提出Token复杂度作为AI增强计算中查询与响应成本的形式化度量，建立AI-Oracle图灵机框架，证明单调性、凸性、价格敏感性和任务排序的价格相对性等基本定理。

Comments 25 pages, 1 figure

详情

AI中文摘要

AI增强计算将自然语言查询、代码生成请求及其他开放式任务委托给一组AI模型，这些模型处理查询并生成响应。这一范式引入了一个经典时间或空间复杂度无法捕捉的资源维度：向该集群发送查询和接收响应的成本。我们引入Token复杂度，将其定义为在任务上达到指定输出质量水平所需的最小期望Token成本，并建立了一个根据概率性质强度对AI系统进行分类的体系。我们在AI-Oracle图灵机框架内发展Token复杂度，其中概率图灵机通过专用查询和响应磁带与随机Oracle交互。我们证明了基本定理，表明Token复杂度符合预期：单调性（更高质量需要更多Token）、凸性（质量改进逐渐变得更昂贵）、价格敏感性（小价格变化导致有界成本变化）以及任务排序的价格相对性（任务的Token复杂度排序可能根据查询与响应成本比率而反转）。我们证明了复杂度前沿（定义为Token、时间和空间中所有可行资源约束的集合）是非空的、向上封闭且凸的。

英文摘要

AI-augmented computing delegates natural language queries, code generation requests, and other open-ended tasks to a cluster of AI models that processes queries and generates responses. This paradigm introduces a resource dimension that neither classical time nor space complexity captures: the cost of sending queries to and receiving responses from such a cluster. We introduce token complexity, a formal resource measure defined as the minimum expected token cost to achieve a specified level of output quality on a task, and develop a taxonomy classifying AI systems by the strength of their probabilistic properties. We develop token complexity within the framework of AI-Oracle Turing machines, in which a probabilistic Turing machine interacts with a stochastic oracle via dedicated query and response tapes. We prove basic theorems establishing that token complexity behaves as expected: monotonicity (higher quality costs more tokens), convexity (quality improvements become progressively more expensive), price sensitivity (small price changes produce bounded cost changes), and price-relativity of task ordering (the token complexity ordering of tasks can reverse depending on the query-to-response cost ratio). We prove that the complexity frontier, defined as the set of all feasible resource bounds in tokens, time, and space, is non-empty, upward-closed, and convex.

URL PDF HTML ☆

赞 0 踩 0

2606.12805 2026-06-12 cs.HC cs.AI 交叉投稿

Exploring How Agent Voice Accents Shape Human-AI Collaboration in K-12 Group Learning

探索智能体口音如何影响K-12小组学习中的人机协作

Prerna Ravi, Carúmey Stevens, Ben Hurt, Brandon Hanks, Grace Lin, Emma Anderson

AI总结研究通过33名教师的实验，发现GenAI语音智能体的不同口音（英式、印度式、非裔美式）影响其被感知为工具或同伴，进而影响信任、参与和依赖。

详情

AI中文摘要

协作被广泛认为是21世纪教育的基石，但教师在促进有效的同伴互动方面仍面临持续挑战。LLM对话式同伴智能体为调解面对面小组工作带来了新的可能性，引发了关于角色设计（尤其是语音特征）如何塑造学习者的感知、信任和互动动态的问题。虽然先前的研究已经考察了智能体口音在一对一环境中的影响，但关于这些影响如何在小组中表现尚知之甚少。我们进行了一项33名教师参与的组间混合方法研究，考察了具有不同口音（英式、印度式和非裔美式）的GenAI语音智能体如何影响协作和智能体感知。通过调查、小组互动分析和人工制品，我们发现口音塑造了参与者的心智模型以及智能体在小组互动中扮演的角色。英式口音智能体在很大程度上被视为工具，并以超然、基于实用性的方式参与，而印度式和非裔美式口音智能体则更容易被拟人化并作为同伴融入。这些角色期望影响了信任、参与和依赖随时间的变化。这项工作推进了关于GenAI的社会语言学设计特征如何塑造CSCL中小组动态的理解，对设计具有文化包容性的AI学习伙伴具有启示意义。

英文摘要

Collaboration is widely recognized as a cornerstone of 21st-century education, yet teachers still encounter persistent challenges in fostering productive peer interaction. LLM conversational peer agents introduce new possibilities for mediating in-person group work, raising questions about how persona design, particularly their voice characteristics, shapes learners' perceptions, trust, and interactional dynamics. While prior work has examined agent accent effects in one-to-one settings, little is known about how these effects manifest in groups. We conducted a between-subjects mixed-methods study with 33 teachers examining how a GenAI voice agent with different accents (British, Indian, and African American) influenced collaboration and agent perception. Across surveys, group interaction analyses, and artifacts, we find that accent shaped participants' mental models and the roles the agent assumed in group interaction. The British-accented agent was largely treated as a tool and engaged in detached, utility-based ways, whereas Indian- and African American-accented agents were more readily anthropomorphized and integrated as peers. These role expectations influenced trust, engagement, and reliance over time. This work advances understanding of how GenAI's sociolinguistic design features shape group dynamics in CSCL, with implications for designing culturally inclusive AI partners in group learning.

URL PDF HTML ☆

赞 0 踩 0

2606.13179 2026-06-12 cs.ET cs.AI cs.AR cs.NE 交叉投稿

Modern analog computing for solving differential and matrix equations

现代模拟计算用于求解微分方程和矩阵方程

Zhong Sun, Piergiulio Mannocci, Manuel Le Gallo, Abu Sebastian

发表机构 * Institute for Artificial Intelligence, School of Integrated Circuits, Peking University, Beijing Advanced Innovation Center for Integrated Circuits（人工智能研究院，集成电路学院，北京大学，北京集成电路先进创新中心）； Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano（电子、信息与生物工程系，米兰理工大学）； IBM Research Europe（IBM欧洲研究院）

AI总结本文综述现代模拟计算在求解微分方程和矩阵方程中的核心原语、硬件实现及最新进展，强调电阻式存储器阵列的优势，并讨论精度、可扩展性及与内存计算的关系。

详情

AI中文摘要

近年来，受人工智能和科学计算等数据密集型应用的计算需求驱动，模拟计算重新获得关注。鉴于计算任务的多样性以及模拟CMOS电路和电阻式存储器技术的最新进展，我们将这一不断发展的领域称为现代模拟计算。在此背景下，我们识别出三个核心计算原语：求解微分方程、求解矩阵方程以及执行矩阵-向量乘法，并探讨它们之间的联系。我们还研究了这些模拟计算算子的各种硬件实现，包括基于分立元件、集成电路和电阻式存储器设备的实现。其中，电阻式存储器阵列因其实现效率而显得尤为有前景。本文随后调查了利用现代模拟计算（使用先进的模拟CMOS电路和电阻式存储器阵列）求解微分方程和矩阵方程的最新进展。最后，我们讨论了这些电路的应用、精度和可扩展性问题及其潜在解决方案、与内存计算的关系，以及模拟计算的独特计算复杂性。本文提供了关于模拟计算的统一视角，强调了其优势、当前发展和挑战，并将其定位为下一代计算前沿的关键推动者。

英文摘要

In recent years, driven by the computational demands of data-intensive applications such as artificial intelligence and scientific computing, analog computing has gained renewed interest. Given the diversity of computational tasks and recent advancements in analog CMOS circuits and resistive memory technologies, we refer to the evolving landscape as modern analog computing. In this context, we identify three core computational primitives: solving differential equations, solving matrix equations, and performing matrix-vector multiplications, and we explore the connections among them. We also examine various hardware implementations of these analog computing operators, including those built with discrete components, integrated circuits, and resistive memory devices. Among these, resistive memory arrays emerge as particularly promising due to their implementation efficiency. The paper then surveys recent progress in leveraging modern analog computing to solve differential and matrix equations using both advanced analog CMOS circuits and resistive memory arrays. Finally, we discuss the applications of these circuits, the precision and scalability issues and their potential solutions, the relationship with in-memory computing, and the unique computational complexity of analog computing. This paper provides a unified perspective on analog computing, highlighting its strengths, current developments, and challenges, and positioning it as a pivotal enabler of next-generation computational frontiers.

URL PDF HTML ☆

赞 0 踩 0

2606.13629 2026-06-12 stat.ME cs.AI cs.LG stat.ML 交叉投稿

Valid Inference with Synthetic Data via Task Exchangeability

通过任务可交换性实现基于合成数据的有效推断

Lezhi Tan, Tijana Zrnic

AI总结提出任务可交换性条件，确保在科学研究中使用合成数据进行统计推断的有效性，并给出在民意调查和AI评估中的应用。

详情

AI中文摘要

越来越多的工作主张在科学研究中使用合成数据。例如，社会科学家主张在试点研究中使用LLM生成的“硅样本”；AI评估越来越依赖“LLM作为裁判”的输出；蛋白质组学研究通过生成合成蛋白质结构的生成模型加速。这些发展引发了一个有趣的可能性：合成数据可以帮助研究人员提出更多问题、进行更多研究并加速发现。但它们也引发了一个根本性的担忧：合成数据可能有偏、有噪声且设定错误。在这项工作中，我们提出了在科学研究中使用合成数据的统计原则，并具有可证明的有效性保证。关键见解是一个我们称为任务可交换性的新技术条件。非正式地说，这是一个要求，即研究人员可以识别出有真实数据可用的历史任务，使得他们当前感兴趣的任务与历史任务在适当的数学意义上可交换。我们开发了在任务可交换性下进行有效推断的方法，以及即使在可交换性之外也能提供保证的扩展。我们通过硅样本的民意调查和自动评分器的AI评估来展示该框架。

英文摘要

There is a proliferation of work arguing for the use of synthetic data in scientific research. For example, social scientists are arguing for the use of LLM-generated "silicon samples" in pilot studies; AI evaluations increasingly rely on "LLM-as-a-judge" outputs; and proteomics research is accelerated by generative models that produce synthetic protein structures. These developments raise an intriguing possibility: synthetic data may help researchers ask more questions, run more studies, and accelerate discovery. But they also raise a fundamental concern: synthetic data can be biased, noisy, and misspecified. In this work, we propose statistical principles for using synthetic data in scientific research with provable validity guarantees. The key insight is a new technical condition that we call task exchangeability. Informally, this is a requirement that the researcher can identify historical tasks, for which real data is available, such that their current task of interest is exchangeable with the historical tasks in an appropriate mathematical sense. We develop methods for valid inference under task exchangeability, together with extensions that provide guarantees even beyond exchangeability. We demonstrate the framework on public opinion surveys with silicon samples and AI evaluation with autoraters.

URL PDF HTML ☆

赞 0 踩 0

2606.00807 2026-06-12 cs.AI cs.HC 版本更新

Interaction-Centered Intelligence: Toward an Interaction-Based Theory of Human-AI Co-Creation

以交互为中心的智能：将交互作为共创AI和人机系统中的主要分析单元

Nicholas Davis

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； Co-Creative AI Consulting（协同人工智能咨询）

AI总结本文提出以交互作为主要分析单元，通过分布式认知、具身认知等理论，论证智能涌现于交互动态而非孤立计算，并引入交互中心智能框架。

详情

AI中文摘要

传统人工智能很大程度上将智能概念化为发生在有界代理内的孤立计算。在经典AI、机器学习以及许多生成系统中，主要的分析单元仍然是单个模型或自主系统，通过输出、基准、预测准确性或优化性能进行评估。尽管这些方法取得了重大进展，但它们往往低估了交互在智能、创造力、意义和适应性行为涌现中的作用。本文提出将交互作为共创AI和更广泛的以交互为中心的智能的主要分析单元。借鉴分布式认知、具身认知、生成、参与式意义建构、人机交互和计算创造力，本文追溯了向越来越关系性智能观的历史进程。基于先前在创造性意义建构、量化共创以及诸如绘图学徒和AI绘图伙伴等共创系统上的工作，本文论证了智能通过代理、环境和社会技术系统之间不断演化的交互动态涌现，而非仅仅通过内部计算。本文引入了以交互为中心的智能作为理解人机共创、协作涌现、适应性参与和交互动态的框架。该框架不通过生成的输出单独评估智能，而是强调随时间展开的交互轨迹、协调模式、参与性参与、适应性调节和交互漂移。讨论了可解释的共创AI、混合智能、生成AI和未来人机系统的启示。

英文摘要

Traditional artificial intelligence has largely conceptualized intelligence as isolated computation occurring within bounded agents. Across classical AI, machine learning, and many generative systems, the dominant unit of analysis remains the individual model or autonomous system evaluated through outputs, benchmarks, prediction accuracy, or optimization performance. While these approaches have produced major advances, they often under-theorize the role of interaction in the emergence of intelligence, creativity, meaning, and adaptive behavior. This paper proposes interaction as the primary unit of analysis for co-creative AI and interaction-centered intelligence more broadly. Drawing from distributed cognition, embodied cognition, enaction, participatory sense-making, human-computer interaction, and computational creativity, the paper traces a historical progression toward increasingly relational accounts of intelligence. Building upon prior work in Creative Sense-Making, quantified co-creation, and co-creative systems such as the Drawing Apprentice and AI Drawing Partner, it argues that intelligence emerges through evolving interaction dynamics among agents, environments, and socio-technical systems rather than solely through internal computation. The paper introduces Interaction-Centered Intelligence as a framework for understanding human-AI co-creation, collaborative emergence, adaptive participation, and interactional dynamics. Rather than evaluating intelligence solely through generated outputs, the framework emphasizes interaction trajectories, coordination patterns, participatory engagement, adaptive regulation, and interactional drift unfolding through time. Implications for explainable co-creative AI, hybrid intelligence, enactive AI, and future human-AI systems are discussed.

URL PDF HTML ☆

赞 0 踩 0

2508.04427 2026-06-12 cs.LG cs.AI 版本更新

Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models

解码多模态迷宫：多模态注意力模型中可解释性采纳的系统综述

Md Raisul Kibria, Sébastien Lafond, Janan Arslan

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文系统综述了2020年至2024年初多模态模型可解释性研究，发现多数工作集中于视觉-语言和纯语言模型，注意力机制是主要解释方法，但评估缺乏系统性和鲁棒性，并提出了改进建议。

详情

DOI: 10.1016/j.inffus.2026.104405

AI中文摘要

近年来，多模态学习取得了显著进展，特别是随着注意力模型的整合，在各种任务中带来了显著的性能提升。与此同时，对可解释人工智能（XAI）的需求推动了越来越多的研究，旨在解释这些模型的复杂决策过程。本系统文献综述分析了2020年1月至2024年初期间发表的、关注多模态模型可解释性的研究。在XAI更广泛目标的框架内，我们从多个维度审视文献，包括模型架构、涉及模态、解释算法和评估方法。我们的分析显示，大多数研究集中在视觉-语言和纯语言模型上，注意力机制是最常用的解释方法。然而，这些方法往往无法捕捉模态间交互的全谱系，这一问题因领域间的架构异质性而进一步加剧。重要的是，我们发现多模态环境中XAI的评估方法大多是非系统性的，缺乏一致性、鲁棒性，并且未考虑模态特定的认知和上下文因素。为解决这些不足，我们不仅综合了所调查研究的发现，还纳入了补充分析，整合了推动多模态可解释性的近期和新兴进展。基于这些见解，我们提出了一套全面的建议，旨在促进多模态XAI研究中严谨、透明和标准化的评估与报告实践。我们的目标是支持未来构建更可解释、可问责和负责任的多模态AI系统，并以可解释性为核心。

英文摘要

Multimodal learning has witnessed remarkable advancements in recent years, particularly with the integration of attention-based models, leading to significant performance gains across a variety of tasks. Parallel to this progress, the demand for explainable artificial intelligence (XAI) has spurred a growing body of research aimed at interpreting the complex decision-making processes of these models. This systematic literature review analyzes research published between January 2020 and early 2024 that focuses on the explainability of multimodal models. Framed within the broader goals of XAI, we examine the literature across multiple dimensions, including model architecture, modalities involved, explanation algorithms and evaluation methodologies. Our analysis reveals that most studies are concentrated on vision-language and language-only models, with attention-based techniques being the most commonly employed for explanation. However, these methods often fall short in capturing the full spectrum of interactions between modalities, a challenge further compounded by the architectural heterogeneity across domains. Importantly, we find that evaluation methods for XAI in multimodal settings are largely non-systematic, lacking consistency, robustness, and consideration for modality-specific cognitive and contextual factors. To address these gaps, we not only synthesize findings from the surveyed works but also incorporate a complementary analysis that integrates recent and emerging advances driving multimodal explainability. Based on these insights, we provide a comprehensive set of recommendations aimed at promoting rigorous, transparent, and standardized evaluation and reporting practices in multimodal XAI research. Our goal is to support future research in more interpretable, accountable, and responsible multimodal AI systems, with explainability at their core.

URL PDF HTML ☆

赞 0 踩 0

2605.29151 2026-06-12 math.AG cs.AI cs.NE 版本更新

Real-rootedness of the Poincaré polynomials of $\overline{\mathcal M}_{0,n}$: an AI-assisted proof

Poincaré多项式的实根性：一个AI辅助的证明

Gergely Bérczi, Young-Hoon Kiem

AI总结通过引入双变量变形揭示隐藏的交错结构，证明了稳定有理曲线模空间Poincaré多项式的实根性，并进一步推广到Fulton-MacPherson空间。

Comments 16 pages

详情

AI中文摘要

我们证明了Deligne-Mumford模空间$\overline{\mathcal M}_{0,n}$（稳定$n$点有理曲线）的Poincaré多项式\[ P_n(t)=\sum_{i=0}^{n-3} \dim H^{2i}(\overline{\mathcal M}_{0,n};\mathbb{Q})t^i \]的实根性，证实了Aluffi-Chen-Marcolli的猜想。证明从Keel-Manin-Getzler递推开始，但其主要新思想是Poincaré多项式的双变量变形$F_m(y,t)$。这种变形揭示了单变量递推中不可见的隐藏交错结构。对于固定的$t<0$，$F_m$在$y$方向上的零点集由区间$0<y<1-t$上的Sturm-Rolle论证控制。原始多项式在切片$y=1$上恢复，移动根通过该切片的有序交叉同时给出了实根性和严格交错。因此，$\overline{\mathcal M}_{0,n}$的Betti数构成一个超对数凹序列。我们进一步证明了Fulton-MacPherson空间$\mathbb{P}^1[n]$（复射影线退化中$n$个有序点）的Poincaré多项式的实根性和超对数凹性。 $\overline{\mathcal M}_{0,n}$的证明是通过与Co-Mathematician（Google DeepMind开发的智能体前沿模型系统）的迭代AI辅助工作流程获得的。人类的角色是提出问题、评估连续尝试、请求修复漏洞、将逐步发展的论证与文献进行比较，并组装最终可人工验证的证明。我们额外的人类贡献是观察到类似的残差变形策略适用于Fulton-MacPherson空间$\mathbb P^1[n]$，从而得到相应的实根性定理。

英文摘要

We prove real-rootedness for the Poincaré polynomial \[ P_n(t)=\sum_{i=0}^{n-3} \dim H^{2i}(\overline{\mathcal M}_{0,n};\mathbb{Q})t^i \] of the Deligne--Mumford moduli space $\overline{\mathcal M}_{0,n}$ of stable $n$-pointed rational curves, proving a conjecture of Aluffi--Chen--Marcolli. The proof starts from the Keel--Manin--Getzler recurrence, but its main new idea is a bivariate deformation $F_m(y,t)$ of the Poincaré polynomial. This deformation reveals a hidden interlacing structure not visible in the one-variable recurrence. For fixed $t<0$, the zero set of $F_m$ in the $y$-direction is controlled by a Sturm--Rolle argument on the interval $0<y<1-t$. The original polynomial is recovered on the slice $y=1$, and the ordered crossings of the moving roots through this slice give both real-rootedness and strict interlacing. Consequently, the Betti numbers of $\overline{\mathcal M}_{0,n}$ form an ultra-log-concave sequence. We further prove real-rootedness and ultra-log-concavity for the Poincaré polynomial of the Fulton--MacPherson space $\mathbb{P}^1[n]$ of $n$ ordered points in degenerations of the complex projective line. The proof for $\overline{\mathcal M}_{0,n}$ was obtained through an iterative AI-assisted workflow with Co-Mathematician, an agentic frontier-model system developed by Google DeepMind. Our role was to formulate the problem, evaluate the proposed proof attempts, identify gaps and request corrections, compare the developing argument with the literature, and refine the presentation of the final proof. Our additional human contribution was to observe that a similar residual deformation strategy applies to the Fulton--MacPherson spaces $\mathbb P^1[n]$, yielding the corresponding real-rootedness theorem.

URL PDF HTML ☆

赞 0 踩 0

2605.31514 2026-06-12 cs.CL cs.AI cs.CY 版本更新

If LLMs Have Human-Like Attributes, Then So Does Age of Empires II

如果LLM具有类人属性，那么《帝国时代II》也具有

Adrian de Wynter

AI总结通过训练简单神经网络于《帝国时代II》，论证LLM的拟人属性在经验上非唯一，提出应假设LLM非独特性而非拟人属性来设计实验。

Comments Fixed corollary 1, added stat sig

详情

AI中文摘要

关于大型语言模型（LLM）和基于LLM的智能体工作流已有大量研究。然而，该领域的许多工作声称、赋予或假设它们具有普遍化的拟人属性（例如道德或对自然语言的理解）。我们的目标不是支持或反对这些属性的存在，而是指出这些结论可能不正确。为此，我们在电子游戏《帝国时代II》上构建并训练了一个简单的神经网络，并注意到任何处于足够强大基底（如乐高或大波士顿地区）中的实体也可能呈现此类属性。因此，LLM声称的拟人属性在经验上非唯一：尽管某些属性（例如对提示的响应）可能保持不变，但其他属性（如对其感知行为的解释）可能随基底改变。因此，任何基于经验的讨论都需要明确的测量标准；否则解释就留给了表征。然后我们表明，假设这些属性在系统中存在或不存在，独立于基底并以普遍化方式，会导致循环或无信息的结论，无论实验者对该主题的观点如何。最后，我们提出一个“零”假设，即假设LLM非独特性而非拟人属性来设置实验，并给出示例。我们还讨论了对我们工作的潜在反对意见，简要调查了该领域，并证明了《帝国时代II》是功能完备和图灵完备的。

英文摘要

Much research has been carried out on large language models (LLMs) and LLM-powered agentic workflows. However, many works within the field state emergence of, ascribe to, or assume, generalised anthropomorphic attributes to them (e.g., morality or understanding of natural language). Our goal is not to argue in favour or against the existence of these attributes, but to point out that these conclusions could be incorrect. For this we build and train a simple neural network on the videogame Age of Empires II, and note that any entity in a sufficiently-powerful substrate, such as LEGO or the Greater Boston Area, could also present such attributes. Hence, the purported anthropomorphic attributes of LLMs are empirically non-unique: although some properties (e.g., responses to prompts) could remain invariant, others, such as the interpretation of their perceived behaviour, might change with the substrate. Thus, any empirically-grounded discussion on these attributes requires explicit measurement criteria; otherwise the interpretation is left to the representation. We then show that assuming that these attributes exist or not in a system, independent of the substrate and in a generalised way, leads to either circular or uninformative conclusions. This is regardless of the experimenter's viewpoint on the subject, or whether the outcome shows existence or non-existence. Finally we propose a 'null' assumption, where one assumes LLM non-uniqueness instead of assuming anthropomorphic attributes to set up an experiment, along with examples of it. We also discuss potential objections to our work, briefly survey the field, and prove that Age of Empires II is functionally- and Turing-complete.

URL PDF HTML ☆

赞 0 踩 0

2605.01727 2026-06-12 cs.AI cs.CY 版本更新

Are LLMs More Skeptical of Entertainment News?

LLM是否对娱乐新闻更持怀疑态度？

Huiqian Lai

AI总结研究零样本LLM在新闻可信度评估中是否对娱乐新闻有更高的误判率，发现模型间存在差异，并通过风格交换和提示缓解实验探讨原因。

Comments Accepted at the 2nd Workshop on Misinformation Detection in the Era of LLMs (MisD), co-located with ICWSM 2026, May 26, 2026, Los Angeles, CA, USA

详情

DOI: 10.36190/2026.32
Journal ref: Proceedings of the ICWSM Workshops, MisD 2026: The 2nd Workshop on Misinformation Detection in the Era of LLMs, 2026

AI中文摘要

大型语言模型（LLMs）越来越多地被用于自动新闻可信度评估，但目前尚不清楚它们是否对不同新闻体裁采用统一标准。我们使用FakeNewsNet中的GossipCop数据集，通过数据集内设计，检验零样本LLM是否更倾向于将合法的娱乐新闻误分类为假新闻，而非合法的硬新闻。在四个前沿模型中，我们发现了清晰但模型特定的体裁不对称性：DeepSeek-V3.2和GPT-5.2的假阳性率差距分别为10.1和8.8个百分点（两者p < .001），而Claude Opus 4.6和Gemini 3 Flash则没有表现出显著差异。风格交换实验仅产生有限且不一致的变化，表明这种不对称性不能仅归结于风格语域。基于提示的缓解措施同样可能但并非通用：将模型设定为娱乐新闻事实核查员可使DeepSeek-V3.2的假阳性减少约50%，且未检测到召回率损失，但对GPT-5.2的改进甚微。探索性定性编码进一步揭示了采样假阳性中两种反复出现的错误模式：将私人生活主张视为本质上不可验证，以及将娱乐新闻视为认识论上较弱的体裁。综合来看，这些发现表明，总体性能指标可能掩盖合法新闻中的结构性假阳性。我们认为，基于LLM的可信度评估不仅可能评估真实性主张，还可能差异性地识别新闻体裁的合法性，因此评估应包含按体裁分层的假阳性分析以及总体准确率。

英文摘要

Large language models (LLMs) are increasingly used for automated news credibility assessment, yet it remains unclear whether they apply even-handed standards across journalistic genres. We examine whether zero-shot LLMs are more likely to misclassify legitimate entertainment news as fake than legitimate hard news, using a within-dataset design on GossipCop from FakeNewsNet. Across four frontier models, we find a clear but model-specific genre asymmetry: DeepSeek-V3.2 and GPT-5.2 show false-positive-rate gaps of 10.1 and 8.8 percentage points, respectively (both $p < .001$), whereas Claude Opus 4.6 and Gemini 3 Flash show no comparable difference. A style-swap experiment yields only limited and inconsistent changes, suggesting that the asymmetry is not reducible to stylistic register alone. Prompt-based mitigation is likewise possible but not generic: framing the model as an entertainment-news fact-checker reduces false positives for DeepSeek-V3.2 by about 50\% without detectable recall loss, but offers little improvement for GPT-5.2. Exploratory qualitative coding further suggests two recurring error patterns in sampled false positives: treating private-life claims as inherently unverifiable and discounting entertainment journalism as an epistemically weaker genre. Taken together, these findings show that aggregate performance metrics can obscure structured false positives within legitimate journalism. We argue that LLM-based credibility assessment may not only evaluate truth claims but also differentially recognize the legitimacy of journalistic genres, and that evaluation should therefore include genre-stratified false-positive analysis alongside overall accuracy.

URL PDF HTML ☆

赞 0 踩 0

2604.24449 2026-06-12 cs.RO cs.AI cs.LG 版本更新

SPLIT: Separating Physical-Contact via Latent Arithmetic in Image-Based Tactile Sensors

SPLIT：通过潜在算术分离物理接触以实现基于图像的触觉传感器

Wadhah Zai El Amri, Nicolás Navarro-Guerrero

发表机构 * Leibniz Universität Hannover, L3S Research Center（莱布尼茨汉诺威大学，L3S研究所）

AI总结本文提出SPLIT方法，通过潜在空间算术分离接触几何与传感器光学特性，实现触觉传感器的高效模拟，支持多传感器迁移和双向模拟，提升机器人触觉感知研究效率。

Comments Accepted to Elsevier Robotics and Autonomous Systems Journal

详情

DOI: 10.1016/j.robot.2026.105498

AI中文摘要

训练机器人触觉感知的机器学习模型需要大量数据，但获取真实交互数据因物理复杂性和变异性而具有挑战性。模拟触觉传感器是加速进展的关键步骤。本文提出了SPLIT，一种新的基于图像的触觉传感器模拟方法，重点在于DIGIT传感器。我们的方法核心是一种潜在空间算术策略，明确分离接触几何与传感器特定的光学属性。与需要重新校准的现有方法不同，这种分离使SPLIT能够适应多样化的DIGIT背景，甚至在不完全重训练的情况下将数据转移到不同的传感器如GelSight R1.5。此外，我们的方法在推理速度上优于现有替代方案。我们还提供了一种校准的有限元方法（FEM）软体网格模拟，具有可变分辨率，提供速度与保真度之间的可调权衡。此外，我们的算法支持双向模拟，允许从变形网格生成逼真图像以及从触觉图像重建网格。这种多功能性使SPLIT成为加速机器人触觉感知研究进展的重要工具。

英文摘要

Training machine learning models for robotic tactile sensing requires vast amounts of data, yet obtaining realistic interaction data remains a challenge due to physical complexity and variability. Simulating tactile sensors is thus a crucial step in accelerating progress. This paper presents SPLIT, a novel method for simulating image-based tactile sensors, with a primary focus on the DIGIT sensor. Central to our approach is a latent space arithmetic strategy that explicitly disentangles contact geometry from sensor-specific optical properties. Unlike methods that require recalibration for every new unit, this disentanglement allows SPLIT to adapt to diverse DIGIT backgrounds and even transfer data to distinct sensors like the GelSight R1.5 without full model retraining. Beyond this adaptability, our approach achieves faster inference speeds than existing alternatives. Furthermore, we provide a calibrated finite element method (FEM) soft-body mesh simulation with variable resolution, offering a tunable trade-off between speed and fidelity. Additionally, our algorithm supports bidirectional simulation, allowing for both the generation of realistic images from deformation meshes and the reconstruction of meshes from tactile images. This versatility makes SPLIT a valuable tool for accelerating progress in robotic tactile sensing research.

URL PDF HTML ☆

赞 0 踩 0

2604.15372 2026-06-12 cs.CR cs.AI cs.MM 版本更新

The Synthetic Media Shift: Tracking the Rise, Virality, and Detectability of AI-Generated Multimodal Misinformation

合成媒体的演变：跟踪AI生成多模态虚假信息的兴起、传播与可检测性

Zacharias Chrysidis, Stefanos-Iordanis Papadopoulos, Symeon Papadopoulos

发表机构 * Centre for Research and Technology Hellas（希腊研究中心）

AI总结本文提出CONVEX数据集，研究多模态虚假信息的传播与共识动态，发现AI生成内容虽传播迅速但依赖被动互动，且检测性能随生成模型发展而下降。

详情

AI中文摘要

随着生成式AI的发展，真实与合成媒体的界限日益模糊，挑战在线信息的完整性。本文介绍了CONVEX，一个包含超过15万条多模态虚假信息的大型数据集，涵盖误标、编辑和AI生成的视觉内容，来自X的Community Notes。我们分析了多模态虚假信息在传播性、互动性和共识动态方面的演变，重点关注合成媒体。结果表明，尽管AI生成内容传播性 disproportionate，但其传播主要由被动互动驱动而非主动讨论。尽管初始报告较慢，AI生成内容一旦被标记，能更快达成社区共识。此外，我们评估了专门检测器和视觉-语言模型，发现其在区分合成与真实图像方面性能随生成模型发展而持续下降。这些发现突显了在快速演变的数字信息环境中持续监控和适应性策略的必要性。

英文摘要

As generative AI advances, the distinction between authentic and synthetic media is increasingly blurred, challenging the integrity of online information. In this study, we present CONVEX, a large-scale dataset of multimodal misinformation involving miscaptioned, edited, and AI-generated visual content, comprising over 150K multimodal posts with associated notes and engagement metrics from X's Community Notes. We analyze how multimodal misinformation evolves in terms of virality, engagement, and consensus dynamics, with a focus on synthetic media. Our results show that while AI-generated content achieves disproportionate virality, its spread is driven primarily by passive engagement rather than active discourse. Despite slower initial reporting, AI-generated content reaches community consensus more quickly once flagged. Moreover, our evaluation of specialized detectors and vision-language models reveals a consistent decline in performance over time in distinguishing synthetic from authentic images as generative models evolve. These findings highlight the need for continuous monitoring and adaptive strategies in the rapidly evolving digital information environment.

URL PDF HTML ☆

赞 0 踩 0

2601.02149 2026-06-12 cond-mat.mes-hall cond-mat.dis-nn cs.AI 版本更新

AI-enhanced tuning of quantum dot Hamiltonians toward Majorana modes

基于人工智能的量子点哈密顿量调优以实现马约拉纳模式

Mateusz Krawczyk, Jarosław Pawłowski

发表机构 * Institute of Theoretical Physics, Wrocław University of Science and Technology（理论物理研究所，沃林大学技术学院）

AI总结本文提出基于神经网络的模型，通过学习量子点模拟器的工作区域，利用输运测量自动调优设备以获得马约拉纳模式。模型在无监督条件下训练于导电图合成数据，采用融合马约拉纳零模关键性质的物理引导损失函数。

Comments 12 pages, 8 figures, 2 tables

详情

DOI: 10.1103/xkbl-ctwn
Journal ref: Phys. Rev. Applied 25, 064032 (2026)

AI中文摘要

我们提出了一种基于神经网络的模型，能够学习量子点模拟器广泛的工作区域，并利用此知识通过输运测量自动调优这些设备，以在结构中获得马约拉纳模式。模型在无监督条件下训练于导电图合成数据，采用融合马约拉纳零模关键性质的物理引导损失函数。我们展示了通过适当训练，深度视觉变换器网络可以高效记忆哈密顿量参数与导电图之间的关系，并利用此提出量子点链参数更新，驱动系统进入拓扑相。从参数空间的广泛初始调谐范围开始，单步更新足以生成非平凡零模。此外，通过启用迭代调优过程——系统在每一步获得更新的导电图——我们证明该方法可以处理参数空间更大的区域。

英文摘要

We propose a neural network-based model capable of learning the broad landscape of working regimes in quantum dot simulators, and using this knowledge to autotune these devices - based on transport measurements - toward obtaining Majorana modes in the structure. The model is trained in an unsupervised manner on synthetic data in the form of conductance maps, using a physics-informed loss that incorporates key properties of Majorana zero modes. We show that, with appropriate training, a deep vision-transformer network can efficiently memorize relation between Hamiltonian parameters and structures on conductance maps and use it to propose parameters update for a quantum dot chain that drive the system toward topological phase. Starting from a broad range of initial detunings in parameter space, a single update step is sufficient to generate nontrivial zero modes. Moreover, by enabling an iterative tuning procedure - where the system acquires updated conductance maps at each step - we demonstrate that the method can address a much larger region of the parameter space.

URL PDF HTML ☆

赞 0 踩 0

2511.20162 2026-06-12 cs.CV cs.AI q-bio.NC 版本更新

Action Without Interaction: Probing the Physical Foundations of Video LMMs via Contact-Release Detection

无交互行动：通过接触-释放检测探测视频LMMs的物理基础

Daniel Harari, Michael Sidorov, Chen Shterental, Liel David, Abrham Kahsay Gebreselasie, Muhammad Haris Khan

发表机构 * Weizmann Institute of Science（魏茨曼科学研究所）； Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）

AI总结研究探讨了视频LMMs在实际视觉输入中语义理解的深度，通过接触-释放检测发现模型在物理基础方面的不足。

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 workshop on Cognitive Foundations for Multimodal Models (CogVL)

AI中文摘要

大型多模态模型（LMMs）在现实视觉任务中表现出越来越强的性能，例如在视频中描述对象、周围环境和动态动作。本研究探讨了这些模型如何将语义理解与实际视觉输入联系起来。具体来说，给定手与物体互动的序列，我们询问模型何时以及在哪里开始或结束互动。为此，我们引入了一个前所未有的大规模数据集，包含来自Something-Something-V2数据集的视频中超过20,000个标注的互动。250名AMTurk人工标注者标记了核心互动事件，特别是物体和代理何时以及在哪里接触（接触）或分离（释放）。我们要求最先进的LMMs，包括GPT、Gemini和Qwen，在短视频中定位这些事件，每个视频只有一个事件。结果表明，尽管模型能够可靠地命名目标对象并识别动作，但它们表现出一种“捷径学习”现象，即语义成功掩盖了在物理基础方面的失败。具体来说，它们始终无法识别互动开始或结束的帧，并且在场景中对物理事件的定位较差。这种脱节表明，尽管LMMs在系统1直观模式识别（命名动作和对象）方面表现出色，但它们缺乏系统2认知基础，无法对如“接触”和“释放”这样的物理原始要素进行推理，因此无法真正将动态场景 grounded 在物理现实中。

英文摘要

Large multi-modal models (LMMs) show increasing performance in realistic visual tasks for images and, more recently, for videos. For example, given a video sequence, such models are able to describe in detail objects, the surroundings and dynamic actions. In this study, we explored the extent to which these models ground their semantic understanding in the actual visual input. Specifically, given sequences of hands interacting with objects, we asked models when and where the interaction begins or ends. For this purpose, we introduce a first of its kind, large-scale dataset with more than 20K annotated interactions on videos from the Something-Something-V2 dataset. 250 AMTurk human annotators labeled core interaction events, particularly when and where objects and agents become attached (`contact') or detached (`release'). We asked SoTA LMMs, including GPT, Gemini and Qwen to locate these events in short videos, each with a single event. The results show that while models reliably name target objects and identify actions, they exhibit a form of `shortcut learning' where semantic success masks a failure in physical grounding. Specifically, they consistently fail to identify the frame where the interaction begins or ends and poorly localize the physical event within the scene. This disconnect suggests that while LMMs excel at System 1 intuitive pattern recognition (naming the action and objects), they lack the System 2 cognitive foundations required to reason about physical primitives like `contact' and `release', hence truly ground dynamic scenes in physical reality.

URL PDF HTML ☆

赞 0 踩 0

2603.26705 2026-06-12 q-bio.BM cs.AI cs.LG 版本更新

PI-Mamba: Linear-Time Protein Backbone Generation via Spectrally Initialized Flow Matching

PI-Mamba：通过谱初始化流匹配实现线性时间的蛋白质主链生成

Tianyu Wu, Lin Zhu

发表机构 * Center for Biophysics and Quantitative Biology, University of Illinois Urbana-Champaign（生物物理与定量生物学中心，伊利诺伊大学厄巴纳-香槟分校）； School of Information Science, University of Illinois Urbana-Champaign（信息科学学院，伊利诺伊大学厄巴纳-香槟分校）

AI总结 PI-Mamba通过谱初始化和流匹配框架，在保证局部共价几何精确性的同时实现线性时间推断，实现了主链生成的高效与高保真。

详情

DOI: 10.1093/bioinformatics/btag370
Journal ref: Bioinformatics (2026)

AI中文摘要

动机：蛋白质主链设计的生成模型必须同时确保几何有效性、采样效率和长序列的可扩展性。然而，大多数现有方法依赖于迭代细化、二次注意力机制或事后几何修正，导致计算效率与结构保真度之间存在持续的权衡。结果：我们提出物理指导的Mamba（PI-Mamba），一种生成模型，通过构造确保精确的局部共价几何，同时实现线性时间推断。PI-Mamba将可微约束执行操作符整合到流匹配框架中，并与基于Mamba的状态空间架构耦合。为了提高优化稳定性和主链真实性，我们引入了源自Rouse聚合物模型的谱初始化和辅助的顺式脯氨酸意识头。在基准任务中，PI-Mamba实现了0.0%的局部几何违规率和高设计性（scTM = $0.91\pm 0.03$，n = 100），并且在单个A5000 GPU（24 GB）上可扩展到超过2,000个残基的蛋白质。

英文摘要

Motivation: Generative models for protein backbone design have to simultaneously ensure geometric validity, sampling efficiency, and scalability to long sequences. However, most existing approaches rely on iterative refinement, quadratic attention mechanisms, or post-hoc geometry correction, leading to a persistent trade-off between computational efficiency and structural fidelity. Results: We present Physics-Informed Mamba (PI-Mamba), a generative model that enforces exact local covalent geometry by construction while enabling linear-time inference. PI-Mamba integrates a differentiable constraint-enforcement operator into a flow-matching framework and couples it with a Mamba-based state-space architecture. To improve optimisation stability and backbone realism, we introduce a spectral initialization derived from the Rouse polymer model and an auxiliary cis-proline awareness head. Across benchmark tasks, PI-Mamba achieves 0.0\% local geometry violations and high designability (scTM = $0.91\pm 0.03$, n = 100), while scaling to proteins exceeding 2,000 residues on a single A5000 GPU (24 GB).

URL PDF HTML ☆

赞 0 踩 0

2602.18072 2026-06-12 cs.AR cs.AI 版本更新

HiAER-Spike Software-Hardware Reconfigurable Platform for Event-Driven Neuromorphic Computing at Scale

HiAER-Spike软件-硬件可重构平台：大规模事件驱动神经形态计算

Gwenevere Frank, Gopabandhu Hota, Keli Wang, Christopher Deng, Krish Arora, Diana Vins, Abhinav Uppal, Omowuyi Olajide, Kenneth Yoshimoto, Qingbo Wang, Mari Yamaoka, Johannes Leugering, Stephen Deiss, Leif Gibb, Gert Cauwenberghs

发表机构 * Institute for Neural Computation, UC San Diego（神经计算研究所，加州大学圣地亚哥分校）； Fujitsu（富士通）； Forschungszentrum Jülich（吕贝克研究中心）； Qernel AI

AI总结 HiAER-Spike平台支持执行多达1.6亿神经元和400亿突触的大型脉冲神经网络，通过模块化可重构架构实现高效事件驱动计算，提供Python接口简化神经网络配置与执行。

Comments Leif Gibb, Gert Cauwenberghs are equal authors. arXiv admin note: substantial text overlap with arXiv:2504.03671

详情

DOI: 10.1038/s44335-026-00062-8
Journal ref: npj Unconventional Computing (2026)

AI中文摘要

本文介绍了HiAER-Spike，一个模块化、可重构的事件驱动神经形态计算平台，可执行多达1.6亿神经元和400亿突触的大型脉冲神经网络，其架构优化了运行时大规模并行处理和分层地址事件路由（HiAER），支持稀疏连接和活动的高效处理，适用于边缘和云计算。该系统提供Python接口，屏蔽硬件细节，简化通用脉冲神经网络的配置与执行。平台通过网页门户向社区开放，展示了在CIFAR-10、DVS事件手势、MNIST和Pong任务上的事件驱动视觉能力。

英文摘要

In this work, we present HiAER-Spike, a modular, reconfigurable, event-driven neuromorphic computing platform designed to execute large spiking neural networks with up to 160 million neurons and 40 billion synapses - roughly twice the neurons of a mouse brain at faster than real time. This system, assembled at the UC San Diego Supercomputer Center, comprises a co-designed hard- and software stack that is optimized for run-time massively parallel processing and hierarchical address-event routing (HiAER) of spikes while promoting memory-efficient network storage and execution. The architecture efficiently handles both sparse connectivity and sparse activity for robust and low-latency event-driven inference for both edge and cloud computing. A Python programming interface to HiAER-Spike, agnostic to hardware-level detail, shields the user from complexity in the configuration and execution of general spiking neural networks with minimal constraints in topology. The system is made easily available over a web portal for use by the wider community. In the following, we provide an overview of the hard- and software stack, explain the underlying design principles, demonstrate some of the system's capabilities and solicit feedback from the broader neuromorphic community. Examples are shown demonstrating HiAER-Spike's capabilities for event-driven vision on benchmark CIFAR-10, DVS event-based gesture, MNIST, and Pong tasks.

URL PDF HTML ☆

赞 0 踩 0

2510.03699 2026-06-12 q-bio.NC cs.AI cs.LG cs.NE cs.SY eess.SY 版本更新

Dissecting Larval Zebrafish Hunting using Deep Reinforcement Learning Trained RNN Agents

解析斑马鱼幼体捕食行为的深度强化学习训练RNN代理

Raaghav Malik, Satpreet H. Singh, Sonja Johnson-Yu, Nathan Wu, Roy Harpaz, Florian Engert, Kanaka Rajan

发表机构 * California Institute of Technology（加州理工学院）； Harvard University（哈佛大学）

AI总结本文通过深度强化学习训练RNN代理，研究斑马鱼幼体捕食行为，揭示生态和能量约束如何影响适应性行为，发现简单模型能复现真实捕食行为，并通过虚拟实验验证约束和环境对捕食动态的影响。

详情

DOI: 10.32470/h4pp9b0
Journal ref: Proceedings of the 9th Conference on Cognitive Computational Neuroscience (2026)

AI中文摘要

斑马鱼幼体捕食行为为研究生态和能量约束如何塑造生物大脑和人工代理适应性行为提供了可操作的环境。本文开发了一个最小的基于代理的模型，通过深度强化学习在基于回合的斑马鱼模拟器中训练循环策略。尽管模型简单，它能复现标志性的捕食行为，包括眼位联合适追、速度调节和刻板接近轨迹，这些行为与真实幼体斑马鱼高度吻合。定量轨迹分析显示，追捕回合系统性地将猎物角度减少约一半后再捕食，与测量结果一致。虚拟实验和参数扫描变化生态和能量约束、回合运动学（耦合 vs. 未耦合转弯和前进运动）以及环境因素如食物密度、食物速度和融合限制。这些操作揭示了约束和环境如何塑造追捕动态、捕食成功率和中止率，为神经科学实验提供可验证的预测。这些扫描识别出一组紧凑的约束——双目感知、回合运动学中前进速度与转弯的耦合，以及适度的运动和融合的能量成本——这些约束足以使斑马鱼样式的捕食行为出现。惊人的是，这些行为在最小的代理中出现，而无需详细的生物力学、流体动力学、电路真实性和从真实斑马鱼数据中模仿学习。总体而言，这项工作为斑马鱼捕食行为提供了规范性的解释，即能量成本和感官收益之间的最佳平衡，突显了融合和轨迹动态的权衡。我们建立了一个虚拟实验室，缩小了实验搜索空间并生成了关于行为和神经编码的可验证预测。

英文摘要

Larval zebrafish hunting provides a tractable setting to study how ecological and energetic constraints shape adaptive behavior in both biological brains and artificial agents. Here we develop a minimal agent-based model, training recurrent policies with deep reinforcement learning in a bout-based zebrafish simulator. Despite its simplicity, the model reproduces hallmark hunting behaviors -- including eye vergence-linked pursuit, speed modulation, and stereotyped approach trajectories -- that closely match real larval zebrafish. Quantitative trajectory analyses show that pursuit bouts systematically reduce prey angle by roughly half before strike, consistent with measurements. Virtual experiments and parameter sweeps vary ecological and energetic constraints, bout kinematics (coupled vs. uncoupled turns and forward motion), and environmental factors such as food density, food speed, and vergence limits. These manipulations reveal how constraints and environments shape pursuit dynamics, strike success, and abort rates, yielding falsifiable predictions for neuroscience experiments. These sweeps identify a compact set of constraints -- binocular sensing, the coupling of forward speed and turning in bout kinematics, and modest energetic costs on locomotion and vergence -- that are sufficient for zebrafish-like hunting to emerge. Strikingly, these behaviors arise in minimal agents without detailed biomechanics, fluid dynamics, circuit realism, or imitation learning from real zebrafish data. Taken together, this work provides a normative account of zebrafish hunting as the optimal balance between energetic cost and sensory benefit, highlighting the trade-offs that structure vergence and trajectory dynamics. We establish a virtual lab that narrows the experimental search space and generates falsifiable predictions about behavior and neural coding.

URL PDF HTML ☆

赞 0 踩 0

2508.19273 2026-06-12 cs.CR cs.AI 版本更新

MixGAN: A Hybrid Semi-Supervised and Generative Approach for DDoS Detection in Cloud-Integrated IoT Networks

MixGAN：一种混合半监督和生成方法用于云集成物联网网络中的DDoS检测

Tongxi Wu, Chenwei Xu, Jin Yang

发表机构 * College of Cyber Science and Engineering, Sichuan University（四川大学网络空间安全学院）； College of Information Science and Technology, Tibet University（西藏大学信息科学学院）

AI总结本文提出MixGAN，结合条件生成、半监督学习和鲁棒特征提取，解决云集成物联网网络中DDoS检测的复杂交通动态、类别不平衡和数据稀缺问题，实验表明其在准确率、TPR和TNR上优于现有方法。

详情

DOI: 10.3233/FAIA251062
Journal ref: ECAI 2025, 28th European Conference on Artificial Intelligence

AI中文摘要

本文提出MixGAN，一种结合条件生成、半监督学习和鲁棒特征提取的混合方法，用于云集成物联网网络中的DDoS检测。随着云集成物联网系统的普及，由于攻击面扩大、异构设备行为和边缘防护有限，DDoS攻击的威胁加剧。然而，在这种背景下，DDoS检测仍面临复杂交通动态、严重类别不平衡和数据稀缺的挑战。尽管近期方法已探索解决类别不平衡的解决方案，但许多方法仍难以在有限监督和动态交通条件下泛化。为克服这些挑战，我们提出MixGAN，一种混合检测方法，整合了条件生成、半监督学习和鲁棒特征提取。具体而言，为处理复杂的时序交通模式，我们设计了一个由时序卷积层组成的1-D WideResNet主干，包含残差连接，能够有效捕捉交通序列中的局部爆发模式。为缓解类别不平衡和标签稀缺问题，我们使用预训练的CTGAN生成合成少数类（DDoS攻击）样本，以补充未标记数据。此外，为减轻伪标签的噪声影响，我们引入了MixUp-Average-Sharpen（MAS）策略，通过在增强视图上平均预测并重新加权向高置信度类别，构造平滑和增强的目标。在NSL-KDD、BoT-IoT和CICIoT2023数据集上的实验表明，MixGAN在准确率、TPR和TNR上分别比现有方法高2.5%和4%，验证了其在大规模物联网-云环境中的鲁棒性。源代码可在https://github.com/0xCavaliers/MixGAN上公开获取。

英文摘要

The proliferation of cloud-integrated IoT systems has intensified exposure to Distributed Denial of Service (DDoS) attacks due to the expanded attack surface, heterogeneous device behaviors, and limited edge protection. However, DDoS detection in this context remains challenging because of complex traffic dynamics, severe class imbalance, and scarce labeled data. While recent methods have explored solutions to address class imbalance, many still struggle to generalize under limited supervision and dynamic traffic conditions. To overcome these challenges, we propose MixGAN, a hybrid detection method that integrates conditional generation, semi-supervised learning, and robust feature extraction. Specifically, to handle complex temporal traffic patterns, we design a 1-D WideResNet backbone composed of temporal convolutional layers with residual connections, which effectively capture local burst patterns in traffic sequences. To alleviate class imbalance and label scarcity, we use a pretrained CTGAN to generate synthetic minority-class (DDoS attack) samples that complement unlabeled data. Furthermore, to mitigate the effect of noisy pseudo-labels, we introduce a MixUp-Average-Sharpen (MAS) strategy that constructs smoothed and sharpened targets by averaging predictions over augmented views and reweighting them towards high-confidence classes. Experiments on NSL-KDD, BoT-IoT, and CICIoT2023 demonstrate that MixGAN achieves up to 2.5% higher accuracy and 4% improvement in both TPR and TNR compared to state-of-the-art methods, confirming its robustness in large-scale IoT-cloud environments. The source code is publicly available at https://github.com/0xCavaliers/MixGAN.

URL PDF HTML ☆

赞 0 踩 0

2507.11936 2026-06-12 cs.CL cs.AI cs.CV cs.LG 版本更新

A Survey of Deep Learning for Geometry Problem Solving

深度学习在几何问题求解中的应用综述

Jianzhe Ma, Wenxuan Wang, Qin Jin

发表机构 * Renmin University of China（中国人民大学）

AI总结本文综述了深度学习在几何问题求解中的应用，涵盖相关任务、方法、评估指标及未来方向，旨在提供实践参考以推动该领域发展。

Comments ACL 2026 Main Conference

详情

AI中文摘要

几何问题求解作为数学推理的重要组成部分，在教育、评估AI数学能力及多模态能力评估中具有关键作用。近期深度学习技术，尤其是多模态大语言模型的出现，显著加速了该领域的研究。本文综述了深度学习在几何问题求解中的应用，包括（i）几何问题求解相关任务的全面总结；（ii）相关深度学习方法的深入回顾；（iii）评估指标和方法的详细分析；以及（iv）最先进性能、现有挑战和有前景的未来方向的批判性讨论。我们的目标是提供一个全面且实用的深度学习在几何问题求解中的参考，从而推动该领域进一步发展。我们维护了一个相关论文列表：https://github.com/majianz/dl4gps。

英文摘要

Geometry problem solving, a crucial aspect of mathematical reasoning, is vital across various domains, including education, the assessment of AI's mathematical abilities, and multimodal capability evaluation. The recent surge in deep learning technologies, particularly the emergence of multimodal large language models, has significantly accelerated research in this area. This paper presents a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of state-of-the-art performance, existing challenges, and promising future directions. Our objective is to offer a comprehensive and practical reference of deep learning for geometry problem solving, thereby fostering further advancements in this field. We maintain a list of relevant papers: https://github.com/majianz/dl4gps.

URL PDF HTML ☆

赞 0 踩 0

2306.01690 2026-06-12 cs.LG cs.AI 版本更新

Context selectivity with dynamic availability enables lifelong continual learning

基于动态可用性的上下文选择性促进终身持续学习

Martin Barry, Wulfram Gerstner, Guillaume Bellec

发表机构 * Department of Life Sciences, Department of Computer Sciences（生命科学系、计算机科学系）

AI总结本文提出基于上下文选择性和动态可用性的元可塑性规则，通过模拟验证该模型在图像识别和自然语言处理任务中优于现有持续学习算法。

详情

DOI: 10.1016/j.neunet.2025.107728

AI中文摘要

"你永远忘不了如何骑自行车"——但这是如何可能的？大脑能够学习复杂技能，停顿多年不练习，中间学习其他技能，仍能随时召回原始知识。这种能力的机制，称为终身学习（或持续学习，CL），尚不清楚。我们建议一种生物合理的元可塑性规则，基于经典持续学习工作，总结为两个原则：(i) 神经元具有上下文选择性，(ii) 一个局部可用性变量在神经元先前任务相关时部分冻结可塑性。在新的神经中心形式化中，我们建议神经元选择性和神经元级巩固是简单且可行的元可塑性假设，以在大脑中实现CL。在模拟中，该简单模型平衡了遗忘和巩固，导致在图像识别和自然语言处理CL基准上优于当前CL算法。

英文摘要

"You never forget how to ride a bike", -- but how is that possible? The brain is able to learn complex skills, stop the practice for years, learn other skills in between, and still retrieve the original knowledge when necessary. The mechanisms of this capability, referred to as lifelong learning (or continual learning, CL), are unknown. We suggest a bio-plausible meta-plasticity rule building on classical work in CL which we summarize in two principles: (i) neurons are context selective, and (ii) a local availability variable partially freezes the plasticity if the neuron was relevant for previous tasks. In a new neuro-centric formalization of these principles, we suggest that neuron selectivity and neuron-wide consolidation is a simple and viable meta-plasticity hypothesis to enable CL in the brain. In simulation, this simple model balances forgetting and consolidation leading to better transfer learning than contemporary CL algorithms on image recognition and natural language processing CL benchmarks.

URL PDF HTML ☆

赞 0 踩 0

1. 智能体、规划与决策 22 篇

Arbor: Tree Search as a Cognition Layer for Autonomous Agents

Strategic Decision Support for AI Agents

TrajGenAgent: A Hierarchical LLM Agent for Human Mobility Trajectory Generation

Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

WISE: A Long-Horizon Agent in Minecraft with Why-Which Reasoning

HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness

Iterating Toward Better Search: A Two-Agent Simulation Framework for Evaluating Agentic Search Architectures in E-Commerce

Learning What to Remember: A Cognitively Grounded Multi-Factor Value Model for Agentic Memory

Nous: An Attempt to Extract and Inject the Cognition Behind Prediction-Market Behavior

LLM-as-an-Investigator: Evidence-First Reasoning for Robust Interactive Problem Diagnosis

From Verdict to Process: Agentic Reinforcement Learning for Multi-Stage Fact Verification

Can I Buy Your KV Cache?

IterCAD: An Iterative Multimodal Agent for Visually-Grounded CAD Generation and Editing

Speculative Rollback Correction for Quality-Diverse Web Agent Imitation

Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents

Agentic MPC for Semantic Control System Resynthesis

Perceive, Interact, Reason: Building Tool-Augmented Visual Agents for Spatial Reasoning

ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

Parthenon Law: A Self-Evolving Legal-Agent Framework

ARROW: Augmented Replay for RObust World models

WOMBET: World Model-Based Experience Transfer for Robust and Sample-efficient Reinforcement Learning

2. 知识表示、推理与符号AI 7 篇

The Theory of Mind Utility: Formal Specification of a Mentalizing Mechanism

Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda

Agents-K1: Towards Agent-native Knowledge Orchestration

LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning

What Type of Inference is Active Inference?

The KG-ER Conceptual Schema Language

Grammar of the Wave: Towards Explainable Multivariate Time Series Event Detection via Neuro-Symbolic VLM Agents

3. 多智能体与博弈 9 篇

ARMOR-MAD: Adaptive Routing for Heterogeneous Multi-Agent Debate in Large Language Model Reasoning

Multiagent Protocols with Aggregated Confidence Signals

Reward Modeling for Multi-Agent Orchestration

Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch

SAIGuard: Communication-State Simulation for Proactive Defense of LLM Multi-Agent Systems

The Internet of Agentic AI: Communication, Coordination, and Collective Intelligence at Scale

Counterfactual Credit Policy Optimization for Multi-Agent Collaboration

A Study of Belief Revision Postulates in Multi-Agent Systems (Extended Version)

Competition and Diversity in Generative AI

4. 搜索、优化与约束求解 2 篇

Optimizing Appliance Scheduling for Solar Energy Management Using Metaheuristic Algorithms

Free-Placement Optimization of Ground Station Locations for Low-Earth Orbit Satellites

5. 机器学习与表示学习 43 篇

Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation

The Hidden Power of Scaling Factor in LoRA Optimization

MARS: Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling

Otters++: A Time-to-first-spike Based Energy Efficient Optical Spiking Transformer

ReCal: Reward Calibration for RL-based LLM Routing

Representing Time Series as Structured Programs for LLM Reasoning

Boosting Direct Preference Optimization with Penalization

Two-Layer Linear Auto-Regressive Models Estimate Latent States

TimeROME-DLM: Temporal Causal Tracing and Low-Rank Inference-Time Knowledge Editing for Masked Diffusion Language Models

LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold

CausalMoE: A Billion-Scale Multimodal Foundation Model for Granger Causal Discovery with Pattern-Routed Heterogeneous Experts

Emotional regulation improves deep learning-based image classification

Select and Improve: Understanding the Mechanics of Post-Training for Reasoning

ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

Towards More General Control of Diffusion Models Using Jeffrey Guidance

Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization

Once-for-All: Scalable Simultaneous Forecasting via Equilibrium State Estimation

Rarity-Gated Context Conditioning for Offline Imitation Learning-Based Maritime Anomaly Detection

PolyFlow: Safe and Efficient Polytope-Constrained Flow Matching with Constraint Embedding and Projection-free Update

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

CRAFTIIF: Cross-Resolution Analytic Four-Type Interpretable Isolation Forest for Multivariate Time Series Anomaly Detection

Existence Precedes Value: Joint Modeling of Observational Existence and Evolving States in Time Series Forecasting

Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models

The Query Channel: Information-Theoretic Limits of Masking-Based Explanations

Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference

Lightweight and Interpretable Transformer via Mixed Graph Algorithm Unrolling for Traffic Forecast

PlaceRep: Geospatial Place Representation Learning from Large-Scale Point-of-Interest Data

Meta-Learning Transformers to Improve In-Context Generalization

Equivariant Flow Matching for Symmetry-Breaking Bifurcation Problems

Structuring The Future: Diffusion LLM Speculative Decoding via Calibrated Draft Graphs

HD-Prot: A Protein Language Model for Joint Sequence-Structure Modeling with Continuous Structure Tokens

Cluster Aggregated GAN (CAG): A Cluster-Based Hybrid Model for Appliance Pattern Generation

Decentralized Autoregressive Generation

Hellinger Multimodal Variational Autoencoders