arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 智能体、规划与决策 22 篇

2606.12563 2026-06-12 cs.AI 新提交

Arbor: Tree Search as a Cognition Layer for Autonomous Agents

Arbor:作为自主智能体认知层的树搜索

Neha Prakriya, Chaojun Hou, Zheng Gong, Huasha Zhao, Xi Zhao, Mou Li, Zhenyu Gu, Emad Barsoum

发表机构 * AMD

AI总结 提出Arbor多智能体框架,通过结构化树搜索作为认知层,在大型有状态动作空间中实现自主优化,在LLM推理优化中实现高达193%的吞吐量-延迟帕累托改进。

详情
AI中文摘要

Arbor是一个多智能体框架,引入了结构化树搜索作为自主智能体在大型有状态动作空间中运行的认知层。先前的自主优化系统在具有无状态评估的孤立目标上运行。相反,Arbor维护一个显式的得分假设搜索树,作为跨智能体的共享工作记忆,随着每次测量而演变,将失败视为诊断信号以重塑后续探索,并随着先前的成功转移瓶颈分布而扩展。我们在全栈LLM推理优化上验证了Arbor,这是一个历史上需要应用程序、框架、编译器、内核和硬件栈的工程团队协调努力才能达到峰值性能的领域。Arbor将Orchestrator智能体(通过将优化委托给推理栈中的领域专家来驱动优化)与Critic智能体(通过根本原因分析、内省和测量验证来维护稳定性)配对——这是一种制衡架构,其中没有一个智能体可以单方面驱动系统。智能体能力被分解为硬技能(领域专业知识)和软技能(决定贡献如何组合的协调协议),从而实现完全自主的多日活动。Arbor在供应商优化的基线上实现了高达193%的推理吞吐量-延迟帕累托改进,而没有该框架的单个智能体在吞吐量改进上达到+33%后几小时内就不可恢复地崩溃。Arbor可推广到多代硬件平台,运行间方差在2个百分点以内,表明该方法与硬件无关且可重复。

英文摘要

Arbor is a multi-agent framework that introduces structured tree search as a cognition layer for autonomous agents operating in large, stateful action spaces. Prior autonomous optimization systems operate on isolated targets with stateless evaluation. Arbor instead maintains an explicit search tree of scored hypotheses that serves as the shared working memory across agents, evolving with every measurement, treating failures as diagnostic signal that reshapes subsequent exploration, and expanding as prior successes shift the bottleneck distribution. We validate Arbor on full-stack LLM inference optimization, a domain where achieving peak performance has historically required coordinated effort from engineering teams across the application, framework, compiler, kernel, and hardware stack. Arbor pairs an Orchestrator agent, which drives optimization by delegating to Domain Specialists across the inference stack, with a Critic agent that safeguards stability through root-cause analysis, introspection, and measurement validation -- a checks-and-balances architecture where neither agent can unilaterally drive the system. Agent capabilities are decomposed into hard skills (domain expertise) and soft skills (coordination protocols that determine how contributions compose), enabling fully autonomous multi-day campaigns. Arbor achieves up to 193% inference throughput-latency Pareto improvement over vendor-optimized baselines, while a single agent without the harness plateaus at +33% throughput improvement and crashes irrecoverably within hours. Arbor generalizes to multiple generations of hardware platform, and run-to-run variance is within 2 percentage points demonstrating that the method is hardware-agnostic and reproducible.

2606.12587 2026-06-12 cs.AI cs.HC 新提交

Strategic Decision Support for AI Agents

AI智能体的战略决策支持

Shayan Kiyani, Sima Noorani, George Pappas, Hamed Hassani

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 针对AI智能体作为主要决策者时的可靠性问题,提出通过优化问题最小化支持使用并控制反事实遗漏支持误差的战略决策支持框架,并开发在线算法自适应阈值化支持分数。

详情
AI中文摘要

传统上,决策支持研究人类如何使用机器学习模型做出更好的决策。在现代智能体系统中,这种角色分工日益反转:AI智能体代表用户行动,而人类和工具成为围绕它们的支持机制。这种角色反转将可靠性问题推至前沿,因为智能体错误可能产生严重后果,且智能体行为必须始终与人类目标和约束保持一致。脱离经典的决策支持观点,我们在AI智能体作为核心行动者的设定下,重新审视其两个基本原则:寻求支持的成本-价值权衡以及不确定性量化的作用。我们提出了一个AI智能体战略决策支持框架,通过一个优化问题来最小化支持使用,同时控制一个反事实遗漏支持误差:即智能体在那些支持本可实质改善其输出的实例上单独行动的概率。在总体层面,我们证明最优策略是关于支持价值的阈值规则。基于这一结构,我们开发了一种在线算法,该算法自适应地阈值化这样的分数,并使用随机探索来控制遗漏支持误差,无需分布假设。我们进一步引入了一种即时校准方法,在线减少不必要的支持调用。我们将该框架实例化到多种场景中,包括信息收集、人机协作和工具使用,展示了每种场景如何通过相同的战略决策支持视角建模。跨这些场景的实验表明,我们的方法可靠地控制了目标误差,同时在实际中大幅减少了支持使用。

英文摘要

Traditionally, decision support studies how humans use machine learning models to make better decisions. In modern agentic systems, this division of roles is increasingly reversed: AI agents act on behalf of users, while humans and tools becomes support mechanisms around them. This role reversal brings reliability concerns to the forefront, since agentic errors can be consequential and agent behavior must remain aligned with human goals and constraints. Departing from the classical view of decision support, we revisit its two basic principles, the cost--value tradeoff of seeking support and the role of uncertainty quantification, in a setting where AI agents are the central actors. We propose a framework for strategic decision support for AI agents through an optimization problem that minimizes support usage subject to controlling a counterfactual missed-support error: the probability that the agent acts alone on instances where support would have materially improved its output. At the population level, we show that the optimal policy is a threshold rule on the value of support. Building on this structure, we develop an online algorithm that adaptively thresholds such a score and uses randomized exploration to control missed-support error without distributional assumptions. We further introduce a calibration-on-the-fly method that reduces unnecessary support calls online. We instantiate this framework across diverse scenarios, including information gathering, human--AI collaboration, and tool use, showing how each can be modeled through the same strategic decision-support lens. Experiments across these settings show that our method reliably controls the target error while substantially reducing support usage in practice.

2606.12657 2026-06-12 cs.AI cs.DB cs.RO 新提交

TrajGenAgent: A Hierarchical LLM Agent for Human Mobility Trajectory Generation

TrajGenAgent: 一种用于人类移动轨迹生成的分层LLM智能体

Siyu Li, Toan Tran, Lingyi Zhao, Khurram Shafique, Li Xiong

发表机构 * Emory University(埃默里大学) University of Florida(佛罗里达大学)

AI总结 提出TrajGenAgent,一种无需微调的分层LLM智能体框架,通过编排器-工作者两阶段设计生成真实轨迹,在时空保真度、语义一致性和个体行为真实性上优于现有方法。

Comments 14 pages, 2 figures, 8 tables. Accepted by the 27th IEEE International Conference on Mobile Data Management (MDM 2026)

详情
AI中文摘要

人类移动数据对于交通、城市规划和流行病控制至关重要,但大规模轨迹收集通常成本高昂且受隐私限制,这推动了逼真的合成轨迹生成。现有的基于LLM的生成器通常依赖于提示工程(保留了零样本推理但缺乏细粒度的时空基础)或轨迹级微调(提高了统计精度但产生了大量计算成本并可能削弱一般推理)。我们提出了TrajGenAgent,一种语义感知的分层LLM智能体框架,用于无需模型微调的人类移动轨迹生成。TrajGenAgent采用两阶段编排器-工作者设计:LLM首先通过上下文学习从历史证据中合成个体和星期条件化的活动链,然后确定性工作流通过个性化POI检索、距离感知位置选择、运动学感知旅行时间传播和基于LLM的持续时间估计将每个活动落地为完整的访问。为了评估超越聚合时空统计的真实性,我们引入了一个基于异常检测的评估框架,使用两个互补检测器来评估行为和语义合理性。在基准和大规模模拟数据集上的实验表明,与代表性的神经网络和基于LLM的基线相比,TrajGenAgent在时空保真度、语义一致性和个体特定行为真实性方面有所改进,同时避免了参数更新。

英文摘要

Human mobility data is important for transportation, urban planning, and epidemic control, but large-scale trajectory collection is often costly and privacy-constrained, motivating realistic synthetic trajectory generation. Existing LLM-based generators typically rely on either prompt engineering, which preserves zero-shot reasoning but lacks fine-grained spatiotemporal grounding, or trajectory-level fine-tuning, which improves statistical precision but incurs substantial computational cost and may weaken general reasoning. We propose TrajGenAgent, a semantic-aware hierarchical LLM-agent framework for human mobility trajectory generation without model fine-tuning. TrajGenAgent uses a two-stage orchestrator-worker design: an LLM first synthesizes an individual- and weekday-conditioned activity chain from historical evidence via in-context learning, and a deterministic workflow then grounds each activity into a complete visit using personalized POI retrieval, distance-aware location selection, kinematics-aware travel-time propagation, and LLM-based duration estimation. To evaluate realism beyond aggregate spatiotemporal statistics, we introduce an anomaly-detection-based evaluation framework using two complementary detectors to assess behavioral and semantic plausibility. Experiments on benchmark and large-scale simulation datasets show that TrajGenAgent improves spatiotemporal fidelity, semantic coherence, and individual-specific behavioral realism over representative neural and LLM-based baselines, while avoiding parameter updates.

2606.12674 2026-06-12 cs.AI 新提交

Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

Evoflux: 紧凑型智能体的可执行工具工作流的推理时演化

Kushal Raj Bhandari, Ling Yue, Ching-Yun Ko, Dhaval Patel, Shaowu Pan, Pin-Yu Chen, Jianxi Gao

AI总结 提出Evoflux,一种推理时演化搜索方法,通过结构化编辑和执行反馈修复紧凑语言模型的工具工作流,将执行可行性从3%提升至17-24%,优于SFT和ReAct。

Comments Code is available at https://github.com/IBM/Evoflux

详情
AI中文摘要

紧凑型语言模型(LMs)降低了工具智能体的成本、延迟和部署风险。然而,MCP风格的工具使用不仅仅需要孤立的函数调用:智能体必须从实时目录中发现工具、满足模式、跨中间输出保留依赖关系,并在执行证据中基于最终响应。小型规划器通常生成看似合理的工作流图,但在工具解析、参数验证、依赖跟踪或执行中失败。我们认为,小语料蒸馏难以处理这种失败模式。几百个教师轨迹可以教授工作流格式,但很少涵盖修复失败计划所需的恢复行为。我们引入了Evoflux,一种推理时演化搜索方法,将紧凑工具使用视为可执行工具工作流的修复。它通过结构化编辑、执行反馈、自适应强度、元引导重设计和多样性剪枝来演化类型化工作流图。在涵盖实时MCP服务器和250个工具的保留MCP-Bench任务上,Evoflux将小型规划器的执行可行性从约3%提高到17-24%。相比之下,在相同搜索挖掘数据上的SFT和SFT+DPO匹配、表现不佳或崩溃至零样本性能以下;ReAct达到更高峰值,但方差和令牌成本更高。这些结果表明,在稀缺的教师轨迹预算下,基于执行的搜索更可靠。

英文摘要

Compact language models (LMs) reduce cost, latency, and deployment risk for tool agents. Yet MCP-style tool use requires more than isolated function calling: an agent must discover tools from live catalogs, satisfy schemas, preserve dependencies across intermediate outputs, and ground final responses in executed evidence. Small planners often generate plausible workflow graphs that fail under tool resolution, parameter validation, dependency tracking, or execution. We argue that this failure mode is poorly handled by small-corpus distillation. A few hundred teacher traces can teach workflow format, but rarely cover the recovery behavior needed to repair failed plans over changing tool catalogs. We introduce Evoflux, an inference-time evolutionary search method that treats compact tool use as the repair of executable tool workflows. It evolves typed workflow graphs through structured edits, execution feedback, adaptive intensity, meta-guided redesign, and diversity pruning. On held-out MCP-Bench tasks spanning live MCP servers and 250 tools, Evoflux raises execution feasibility from roughly 3% to 17-24% across small planners. In contrast, SFT and SFT+DPO on the same search-mined data match, underperform, or collapse below zero-shot performance; ReAct reaches higher peaks, but with higher variance and token cost. These results show that execution-grounded search is more reliable under scarce teacher-trace budgets.

2606.12852 2026-06-12 cs.AI 新提交

WISE: A Long-Horizon Agent in Minecraft with Why-Which Reasoning

WISE:具有Why-Which推理的Minecraft长时域智能体

Renmin Cheng, Changhao Chen

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出WISE框架,通过因果事件图增强情景记忆并解耦what-where-when与which-why推理,结合机会主义任务调度和多尺度探索,显著提升长时域稀疏任务的成功率和效率。

详情
AI中文摘要

通过采用LLM增强的分层方法,在Minecraft等环境中开发通用具身智能体取得了快速进展。尽管前景广阔,但低级控制器由于重复执行失败常常成为性能瓶颈。我们认为,一个关键限制不仅是缺乏情景记忆,而且是将\textit{what-where-when}记忆与\textit{which-why}推理解耦。为了解决这个问题,我们提出\textbf{WISE}(Which-Why Informed Semantic Explorer),一个长时域智能体框架,其增强的低级控制器配备因果事件图,通过将观察与任务相关性关联的显式因果结构来增强情景记忆。与先前依赖特征相似性进行检索的工作(如MrSteve)不同,WISE能够在视角变化下实现稳健回忆,并通过因果推理支持机会主义任务重排序。基于这种记忆,我们提出一个机会主义任务调度器,当检测到因果相关机会时动态重新优先化子任务。我们进一步为WISE配备多尺度渐进探索策略,为下游推理提供空间上全面的观察。实验表明,WISE在长时域稀疏任务上大幅提高了任务成功率和效率,特别是在需要自适应决策的场景中。

英文摘要

Rapid advances have been made in developing general-purpose embodied agent in environments like Minecraft through the adoption of LLM-augmented hierarchical approaches. Despite their promise, low-level controllers often become performance bottlenecks due to repeated execution failures. We argue that a key limitation is not only the lack of episodic memory, but also the decoupling of \textit{what-where-when} memory from \textit{which-why} reasoning. To address this, we propose \textbf{WISE} (Which-Why Informed Semantic Explorer), a long-horizon agent framework with an enhanced low-level controller equipped with a Causal Event Graph that augments episodic memory with explicit causal structure linking observations to task relevance. Unlike prior work such as MrSteve, which relies on feature similarity for retrieval, WISE enables robust recall under viewpoint changes and supports opportunistic task reordering through causal reasoning. Building on this memory, we propose an Opportunistic Task Scheduler that dynamically re-prioritizes subtasks when causally relevant opportunities are detected. We further equip WISE with a multi-scale progressive exploration strategy to provide spatially comprehensive observations for downstream reasoning. Experiments show that WISE largely improves task success and efficiency on long-horizon sparse tasks, particularly in settings requiring adaptive decision-making.

2606.12882 2026-06-12 cs.AI 新提交

HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness

HarnessBridge: 用于LLM智能体框架的可学习双向控制器

Xiaoxuan Wang, Haixin Wang, Alexander Taylor, Jason Cong, Yizhou Sun, Wei Wang

AI总结 提出HarnessBridge,一种轻量级可学习框架控制器,通过双向投影参数化智能体-环境接口,减少令牌使用和轨迹长度,并泛化到更大模型。

详情
AI中文摘要

大型语言模型越来越多地被部署为用于长周期任务的智能体,但其性能不仅受模型能力和环境设计的影响,还受调节智能体-环境交互的框架的影响。现有的框架大多是手动设计的,随着轨迹变长和交互变得更加复杂,它们难以扩展。在这项工作中,我们探究框架是否可以通过一个可学习的即插即用模块生成,该模块可以以端到端的方式进行训练。我们引入了HarnessBridge,一种轻量级可学习框架控制器,它将智能体-环境接口参数化为双向投影。HarnessBridge学习两个双向投影:观测投影,将原始轨迹提炼为紧凑的、与决策相关的状态;以及动作投影,将提议的动作转换为可执行的转换或基于轨迹的拒绝。我们在框架监督数据集上通过统一指令调优训练HarnessBridge。在Terminal-Bench~2.0和SWE-bench Verified上,HarnessBridge匹配或超越了强大的专用框架,同时大幅减少了令牌使用和轨迹长度,并从较小的生成器泛化到较大的商业模型。

英文摘要

Large language models are increasingly deployed as agents for long-horizon tasks, yet their performance is shaped not only by model capability and environment design, but also by the harness that mediates agent--environment interaction. Existing harnesses are largely manually engineered, making them difficult to scale as trajectories grow longer and interactions become more complex. In this work, we ask whether harness can be generated by a learnable plug-in module that can be trained in an end-to-end fashion. We introduce HarnessBridge, a lightweight learnable harness controller that parameterizes the agent--environment interface as a bidirectional projection. HarnessBridge learns two bidirectional projections: observation projection, which distills raw trajectories into compact, decision-relevant states, and action projection, which converts proposed actions into executable transitions or trajectory-grounded rejections. We train HarnessBridge on a harness supervision dataset via unified instruction tuning. On Terminal-Bench~2.0 and SWE-bench Verified, HarnessBridge matches or surpasses strong specialized harnesses while substantially reducing token usage and trajectory length, and generalizes from smaller generators to larger commercial models.

2606.12924 2026-06-12 cs.AI 新提交

Iterating Toward Better Search: A Two-Agent Simulation Framework for Evaluating Agentic Search Architectures in E-Commerce

迭代优化搜索:面向电子商务中智能搜索架构评估的双智能体仿真框架

Jetlir Duraj, Jayanth Yetukuri, Shuang Zhou, Dhruv Varma, Rui Kong, Ishita Khan, Qunzhi Zhou

发表机构 * eBay Inc.(eBay公司)

AI总结 提出模块化双智能体仿真框架,通过固定买家智能体对比不同应答器设计,发现滚动窗口记忆在质量和速度上优于意图提取记忆,并基于失败分析将失败率降低62%。

详情
AI中文摘要

我们提出了一个模块化的双智能体仿真框架,用于评估对话式购物助手架构。一个独立的买家智能体,配置了角色、任务和耐心水平,与一个可互换的应答器配对,该应答器与真实的电子商务搜索API集成。在实验中保持买家不变,可以在相同场景下对照比较应答器设计。利用跨越14个角色桶的2011次对话,我们建立了四个实证发现。首先,滚动窗口记忆在所有质量指标上优于意图提取记忆,同时每个查询速度快35%。其次,通过对应答器版本的系统性失败分析,实现了有针对性的修复,将整个数据集上的失败和接近失败率降低了62%,展示了快速的证据驱动迭代。第三,将应答器的LLM骨干从Gemini~2.5切换到Llama~3.3~70B,尽管架构相同,但性能下降了0.16-0.45点。最后,我们记录了前沿LLM评判者之间系统性的哲学分歧:Gemini奖励过程正确性,而Claude要求具体结果,尽管使用了相同的评估提示。

英文摘要

We present a modular two-agent simulation framework for evaluating conversational shopping assistant architectures. An independent buyer agent, configured with personas, missions, and patience levels, is paired with an interchangeable responder that integrates with a real e-commerce search API. Holding the buyer constant across experiments enables controlled comparison of responder designs on identical scenarios. Using 2011 conversations across 14 persona buckets, we establish four empirical findings. First, rolling-window memory outperforms intent-extraction memory on all quality metrics while being 35% faster per query. Second, illustrating rapid evidence-driven iteration, a systematic failure analysis of a responder version enables targeted fixes that reduce failure and near-failure rates by 62% across the full dataset. Third, swapping the responder LLM backbone from Gemini~2.5 to Llama~3.3~70B costs 0.16--0.45 points despite identical architecture. Finally, we document systematic philosophical disagreement between frontier LLM judges: Gemini rewards process correctness while Claude demands concrete outcomes, despite using the same evaluation prompt.

2606.12945 2026-06-12 cs.AI 新提交

Learning What to Remember: A Cognitively Grounded Multi-Factor Value Model for Agentic Memory

学习该记住什么:一种基于认知的多因素记忆价值模型

Zhibao Chen, Qian Cheng

发表机构 * Huatai Securities(华泰证券) OneBeget.com

AI总结 针对长期LLM代理的记忆管理问题,提出一种基于认知心理学的多因素记忆价值函数,通过无梯度优化学习权重,统一控制编码深度、遗忘风险和检索排名,在LongMemEval上显著优于单一因素和近因策略。

Comments 11 pages, 3 figures

详情
AI中文摘要

长期运行的LLM代理积累的交互历史远超任何上下文窗口,迫使面临一个持续决策:在固定记忆预算下,哪些内容应深度编码、哪些应遗忘、哪些应检索。生产系统采用语义相似性或近因性——两者对于遗忘决策都是错误指定的,因为遗忘决策是在未来查询未知的整合时刻做出的。我们提出一个多因素记忆价值函数 V(m)=∑_i w_i f_i(m),涵盖七个可解释因素(情感强度、目标相关性、价值对齐、自我/用户相关性、任务效用、可靠性和使用历史),这些因素来自认知心理学,其权重通过无梯度优化器从下游目标中学习,并且该单一标量统一控制编码深度、遗忘风险和检索排名。我们提出一个方法论观点:在LongMemEval上,针对保留的评估问题对目标相关性进行评分,使得黄金证据保留率达到≈0.98——这衡量的是检索,而非遗忘。在现实盲态模式下,学习到的多因素价值在479个可用案例中保留了0.770±0.011的黄金证据,而均匀权重为0.657,最佳单一因素为0.518,近因性为0.368;每对差距的95%自助法置信区间均高于零,且基于相同因素的神经网络与线性模型持平。学习到的权重是可解释的——可靠性、情感强度和自我/用户相关性占主导,而查询时的目标相似性在遗忘决策中被正确降权。一个带有植入混淆的受控合成任务证实,学习器恢复了分离性权重(保留率1.00),而均匀权重失败(0.62)。该基础架构是开源的;所有实验在单CPU上运行,无需API调用。

英文摘要

Long-running LLM agents accumulate interaction histories far larger than any context window, forcing a standing decision: what to encode deeply, what to forget, and what to retrieve under a fixed memory budget. Production systems answer with semantic similarity or recency -- both mis-specified for the forgetting decision, which is made at consolidation time before the future query is known. We propose a multi-factor memory value function V(m)=\sum_i w_i f_i(m) over seven interpretable factors (emotional intensity, goal relevance, value alignment, self/user relevance, task utility, reliability, and usage history) drawn from cognitive psychology, whose weights are learned from a downstream objective by a gradient-free optimiser, and whose single scalar uniformly controls encoding depth, forget risk, and retrieval rank. We make a methodological point: on LongMemEval, scoring goal relevance against the held-out evaluation question saturates gold-evidence retention at \approx 0.98 -- this measures retrieval, not forgetting. In the realistic blind regime, a learned multi-factor value retains 0.770 \pm 0.011 of gold evidence across 479 usable cases, versus 0.657 for uniform weights, 0.518 for the best single factor, and 0.368 for recency; every paired gap's 95% bootstrap CI is above zero, and a neural network over the same factors ties the linear model. The learned weights are interpretable -- reliability, emotional intensity, and self/user relevance dominate, while query-time goal similarity is correctly down-weighted for the forgetting decision. A controlled synthetic task with planted confounds confirms the learner recovers a separating weighting (1.00 retention) where uniform weighting fails (0.62). The substrate is open-source; all experiments run on a single CPU with no API calls.

2606.13038 2026-06-12 cs.AI 新提交

Nous: An Attempt to Extract and Inject the Cognition Behind Prediction-Market Behavior

Nous: 提取并注入预测市场行为背后认知的尝试

Haowei Qian

发表机构 * Independent Researcher(独立研究员)

AI总结 针对LLM代理在预测市场中认知同质化问题,提出Nous方法从真实交易行为提取八维行为画像并注入提示,发现提取部分有效但提示注入无法传递认知多样性。

Comments 37 pages, 1 figure, 7 tables. Reproduction artifacts (code, frozen profiles, prompts, model outputs): https://github.com/WillChienT/nous-paper

详情
AI中文摘要

随着LLM代理在预测市场和集体决策中激增,它们面临认知同质化的风险:基于共享基础模型构建的代理产生相关预测,近期测量发现前沿模型错误相关性约为r~0.77。我们探究人类认知多样性是否可以从行为中恢复并转移到LLM代理。Nous从真实的Polymarket交易活动中提取结构化的八维行为画像,并通过提示注入到代理中。我们的核心发现是该流程的两半之间存在分离。提取部分有效:在100个钱包中,14个参数中有8个在时间上稳定(分半ICC >= 0.5,bootstrap CI下限>0.3;逆向得分达到ICC~0.9);钱包从其画像中被识别的概率远高于随机(top-1检索17-22% vs. 1%随机);四个预定义维度中的两个与样本外未来实现利润排名相关,尽管这些相关性在行为混杂控制后不成立。提示级注入无法可测量地传递:在语义嵌入指标上,结构化注入在任何模型上均未显示出比长度匹配控制组显著的优势,并且其诱导的多样性既未降低集成错误相关性,也未改善Brier分数——这一零结果在采样温度、画像多样性和问题难度的探索性检查中持续存在。测量提示本身定位了模型前的压缩:结构到叙述的翻译器发出近乎均匀的提示,其扩散不追踪画像扩散。我们将Nous定位为测量认知同质化问题及提示级补救措施的局限性,从而推动更深层次的提示下注入(微调、激活引导)。代码、冻结画像、提示和模型输出:此 https URL

英文摘要

As LLM agents proliferate in prediction markets and collective decision-making, they risk a cognitive monoculture: agents built on shared foundation models produce correlated forecasts, and recent measurement finds frontier-model errors correlated at r ~ 0.77. We ask whether human cognitive diversity can be recovered from behavior and transferred to LLM agents. Nous extracts a structured eight-dimension behavioral profile from real Polymarket trading activity and injects it into agents through prompts. Our central finding is a dissociation between the two halves of that pipeline. Extraction works, partially: across 100 wallets, 8 of 14 parameters are temporally stable (split-half ICC >= 0.5, bootstrap CI lower bound > 0.3; contrarian score reaches ICC ~ 0.9); wallets are identifiable from their profiles well above chance (top-1 retrieval 17-22% vs. 1% chance); and two of four pre-specified dimensions rank-correlate with future realized profit out-of-sample, though the correlations do not survive behavioral-confound controls. Prompt-level injection does not measurably transmit it: on a semantic embedding metric, structured injection shows no significant advantage over a length-matched control on any model, and the diversity it induces neither reduces ensemble error correlation nor improves Brier score -- a null that persists across exploratory checks on sampling temperature, profile diversity, and question difficulty. Measuring the prompts themselves locates the compression before the model: the structure-to-narrative translator emits near-uniform prompts whose spread does not track profile spread. We position Nous as measuring the cognitive-monoculture problem and the limits of a prompt-level remedy, motivating deeper, below-the-prompt injection (fine-tuning, activation steering). Code, frozen profiles, prompts, and model outputs: https://github.com/WillChienT/nous-paper

2606.13220 2026-06-12 cs.AI cs.CE cs.ET cs.LG cs.MA 新提交

LLM-as-an-Investigator: Evidence-First Reasoning for Robust Interactive Problem Diagnosis

LLM作为调查员:基于证据优先的鲁棒交互式问题诊断

Fabrizio Marozzo, Pietro Liò

发表机构 * University of Calabria(卡拉布里亚大学) University of Cambridge(剑桥大学)

AI总结 提出证据优先的AI方法LLM-as-an-Investigator,通过估计问题歧义、生成假设、提问澄清并更新概率,避免过早接受用户假设,提升诊断准确性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作技术问题解决的交互式助手。然而,当用户提供不完整的描述或看似合理但未经证实的解释时,LLM可能会过早地认同这些假设,并在收集足够证据之前提出解决方案。我们将这种行为称为用户驱动的谄媚:LLM倾向于强化用户提供的假设,而不是测试其他解释。本文介绍了LLM-as-an-Investigator,一种基于证据优先的智能体AI方法,用于鲁棒的问题诊断。该方法通过一个解决方案调查智能体实现,该智能体估计初始问题描述的模糊性,生成候选假设,提出有针对性的澄清问题,并在每次回答后更新假设概率。该智能体不是立即给出响应,而是继续调查,直到证据使一个候选解释比其他解释更强。为了评估该方法,我们从机械、电气和液压领域已解决的技术论坛帖子中构建了一个基准测试。我们使用一个三智能体评估流程:问题-解决方案提取智能体将已解决的帖子转换为结构化案例,真实答案评估智能体在隐藏已知解决方案的同时模拟用户,被测试的助手通过对话尝试恢复解决方案。实验比较了标准助手、面向推理的LLM和基于调查员的模型,使用不同的LLM骨干网络。除了诊断准确性,我们还分析了标准助手在诊断案例中如何遵循误导性的用户假设。结果表明,所提出的方法比直接提示和仅推理基线更准确地识别问题,而其证据优先协议有助于减少用户引发的对话偏差。

英文摘要

Large language models (LLMs) are increasingly used as interactive assistants for technical problem solving. However, when users provide incomplete descriptions or plausible but unverified explanations, LLMs may prematurely align with these assumptions and propose solutions before collecting sufficient evidence. We refer to this behavior as user-driven sycophancy: the tendency of an LLM to reinforce a user-provided hypothesis instead of testing alternative explanations. This paper introduces LLM-as-an-Investigator, an evidence-first agentic AI methodology for robust problem diagnosis. The approach is implemented through a Solution Investigator Agent, which estimates the ambiguity of an initial problem description, generates candidate hypotheses, asks targeted clarification questions, and updates hypothesis probabilities after each answer. Rather than producing an immediate response, the agent continues the investigation until the evidence makes one candidate explanation stronger than the alternatives. To evaluate the approach, we build a benchmark from solved technical forum threads in mechanical, electrical, and hydraulic domains. We use a three-agent evaluation pipeline in which a Problem-Solution Extractor Agent converts solved threads into structured cases, a Ground-Truth Evaluator Agent simulates the user while hiding the known solution, and the tested assistant attempts to recover the solution through dialogue. The experiments compare standard assistants, reasoning-oriented LLMs, and the proposed investigator-based model across LLM backbones. In addition to diagnostic accuracy, we analyze how standard assistants follow misleading user hypotheses in diagnostic cases. The results show that the proposed approach identifies the problem more accurately than direct prompting and reasoning-only baselines, while its evidence-first protocol helps reduce user-induced conversational bias.

2606.13262 2026-06-12 cs.AI 新提交

From Verdict to Process: Agentic Reinforcement Learning for Multi-Stage Fact Verification

从判决到过程:面向多阶段事实核查的智能体强化学习

Rongxin Yang, Shenghong He, Siyuan Zhu, Chao Yu

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院)

AI总结 提出ProFact框架,通过智能体强化学习端到端优化多阶段事实核查流程,引入过程感知奖励解决稀疏延迟监督问题,提升验证性能和推理效率。

详情
AI中文摘要

最近结合大型语言模型(LLMs)与检索增强推理的方法在自动化事实核查中显示出前景。为了处理复杂声明,这些核查流程通常执行多阶段工作流,协调紧密耦合的模块,包括声明分解、证据收集和判决预测。然而,现有方法孤立地优化各个阶段或依赖固定启发式规则,这限制了阶段间的自适应协调,并可能导致次优结果。在这项工作中,我们提出ProFact,一种用于多阶段事实核查轨迹端到端优化的智能体强化学习框架。ProFact训练一个统一策略来协调声明分解、证据寻找、答案生成和判决预测。为了解决最终真实性标签提供的稀疏且延迟的监督,ProFact引入了过程感知奖励,在整个核查过程中提供阶段级学习信号。实证评估表明,ProFact在验证性能和推理效率上均持续优于强基线。这些结果凸显了过程感知轨迹优化对多阶段事实核查的有效性。

英文摘要

Recent approaches combining Large Language Models (LLMs) with retrieval-augmented reasoning have shown promise for automated fact verification. To process complex claims, these verification pipelines typically execute multi-stage workflows that coordinate tightly coupled modules, including claim decomposition, evidence gathering, and verdict prediction. However, existing methods optimize individual stages in isolation or rely on fixed heuristics, which limits adaptive coordination among stages and can lead to suboptimal outcomes. In this work, we propose ProFact, an agentic reinforcement learning framework for end-to-end optimization of multi-stage fact verification trajectories. ProFact trains a unified policy to coordinate claim decomposition, evidence seeking, answer generation, and verdict prediction. To address the sparse and delayed supervision provided by final veracity labels, ProFact introduces process-aware rewards that provide stage-level learning signals throughout the verification process. Empirical evaluation shows that ProFact consistently outperforms strong baselines in both verification performance and inference efficiency. These results highlight the effectiveness of process-aware trajectory optimization for multi-stage fact verification.

2606.13361 2026-06-12 cs.AI cs.CE cs.MA 新提交

Can I Buy Your KV Cache?

我能买你的KV缓存吗?

Luoyuan Zhang

发表机构 * Harbin Institute of Technology, Shenzhen (HITSZ)(哈尔滨工业大学(深圳))

AI总结 针对AI代理重复计算相同文档KV缓存的问题,提出由发布者预计算KV缓存,其他代理付费加载以跳过预填充,实验表明在Qwen3-4B上计算成本降低9-50倍,并设计了代理原生预填充CDN架构。

详情
AI中文摘要

现在,在世界各地,AI代理正在重复同样的荒谬行为:为了读取一份文档,每个代理都从头开始重新计算。每个代理都重新运行预填充——大型模型最计算密集的步骤——在相同的文本上,只是为了重建一个与之前代理刚刚构建的完全相同的键值(KV)缓存。相同的答案,被计算了一百万次。我们提出了一个几乎粗鲁简单的建议:只计算一次。让发布者预计算文档的KV缓存,然后让每个其他代理购买加载该缓存并跳过预填充的权利。这可行,并且是token精确的:加载预计算的KV并继续与从头开始预填充匹配(24/24个贪婪token,并且在logits级别),没有准确度损失。在Qwen3-4B上,重用比预填充计算便宜9-50倍,并且差距随长度增加而扩大(预填充的注意力与L^2成比例),因此一次重用就足以收回成本。然后关键部分:KV存储在哪里。传输它失败了,因为KV几乎不可压缩,因此每次加载的出口成本比它节省的预填充成本还要高。将其托管在提供方侧,正如生产中的提示缓存那样,完全消除了出口成本。奖励的大小由我们测量的计算节省决定:为80M代理提供一份热门的3774-token文档,重新预填充成本约150万美元,而重用计算成本仅约3万美元(减少49.7倍)。API收取的0.1倍缓存读取关税在测量范围内为用户提供了10倍的折扣,因此10倍是下限,而测量的约50倍计算节省超过了它,与物理约50倍的差距是提供方的利润:每份热门文档数百万美元。我们构建了由此产生的代理原生预填充CDN,并将无损KV压缩和跨方支付层作为开放问题。

英文摘要

Right now, across the world, AI agents are repeating the same absurd act: to read one document, they each recompute it from scratch. Every agent re-runs prefill, the most compute-intensive step a large model takes, over identical text, only to rebuild a key-value (KV) cache identical to the one the agent before it just built. The same answer, computed a million times. We make a proposal that is almost offensively simple: compute it once. Let a publisher precompute a document's KV cache, and let every other agent buy the right to load it and skip prefill. It works, and it is token-exact: loading a precomputed KV and continuing matches prefilling from scratch (24/24 greedy tokens, and at the logits level), with no accuracy cost. On Qwen3-4B, reuse is 9-50x cheaper in compute than prefill, and the gap widens with length (prefill's attention scales with L^2), so a single reuse already pays it back. Then the part that matters: where the KV lives. Shipping it fails, because KV is nearly incompressible, so per-load egress costs more than the prefill it saves. Hosting it provider-side, exactly as production prompt-caching works, removes egress entirely. The size of the prize is set by our measured compute saving: serving one hot 3774-token document to 80M agents costs ~$1.5M to re-prefill but only ~$0.03M of reuse compute (49.7x less). The 0.1x cache-read tariff APIs charge passes a 10x discount to users while sitting inside this measured envelope, so the 10x is a floor that the measured ~50x compute saving clears, and the gap to the physical ~50x is provider margin: millions of dollars per popular document. We frame the resulting agent-native prefill CDN and leave lossless KV compression and a cross-party payment layer as the open problems.

2606.13368 2026-06-12 cs.AI cs.CV 新提交

IterCAD: An Iterative Multimodal Agent for Visually-Grounded CAD Generation and Editing

IterCAD:一种用于视觉引导的CAD生成与编辑的迭代多模态智能体

Tao Hu, Jiaxin Ai, Licheng Wen, Xueheng Li, Shu Zou, Siqi Li, Nianchen Deng, Xinyu Cai, Hongbin Zhou, Pinlong Cai, Daocheng Fu, Yu Yang, Hairong Zhang, Botian Shi, Xuemeng Yang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出IterCAD,一种闭环交互式CAD生成与编辑的多模态智能体框架,通过渐进式SFT和几何感知强化学习优化,在代码可执行性和几何精度上显著超越现有方法。

详情
AI中文摘要

计算机辅助设计在现代制造业中至关重要,然而现有的自动化方法主要依赖于开环、一次性生成,与迭代的实际实践不匹配。在本文中,我们提出了IterCAD,一个统一的闭环交互式CAD生成与编辑的多模态智能体框架。我们将任务形式化为多模态智能体与可执行CAD沙箱之间的多轮交互,涵盖三个任务:绘图到代码、文本到代码和交互式编辑。为此,我们开发了一个数据合成流水线,结合先进的工业制造特征,生成符合标准的多视图工程图纸、复杂的代码编辑任务和高保真交互轨迹。我们通过渐进式SFT,然后结合几何感知强化学习和可行前缀掩码来优化智能体,以增强代码可执行性和几何保真度。最后,我们引入了IterCAD-Bench评估套件,并提出了Chamfer距离容忍度-召回率(CD-TR)曲线及其AUC-TR指标,建立了一个无幸存者偏差的标准,统一了代码有效性和几何精度。大量实验表明,IterCAD在多个基准测试中取得了极具竞争力的性能,在代码可执行性和几何精度上显著优于现有方法,并在闭环迭代优化中展现出卓越的能力。

英文摘要

Computer-Aided Design is pivotal in modern manufacturing, yet existing automated methods predominantly rely on open-loop, one-shot generation, creating a mismatch with iterative real-world practices. In this paper, we present IterCAD, a unified multimodal agent framework for closed-loop, interactive CAD generation and editing. We formulate the task as a multi-turn interaction between a multimodal agent and an executable CAD sandbox, covering three tasks: Drawing-to-Code, Text-to-Code, and Interactive Editing. To support this, we develop a data synthesis pipeline incorporating advanced industrial manufacturing features to generate standard-compliant multi-view engineering drawings, complex code-editing tasks, and high-fidelity interaction trajectories. We optimize the agent via progressive SFT followed by geometry-aware reinforcement learning with viable-prefix masking to enhance code executability and geometric fidelity. Finally, we introduce the IterCAD-Bench evaluation suite and propose the Chamfer Distance Tolerance-Recall (CD-TR) curve alongside its AUC-TR metric, establishing a survivor-bias-free standard that unifies code validity and geometric precision. Extensive experiments demonstrate that IterCAD achieves highly competitive performance across multiple benchmarks, significantly outperforming existing approaches in both code executability and geometric precision, while exhibiting superior capabilities in closed-loop iterative refinement.

2606.12485 2026-06-12 cs.LG cs.AI 交叉投稿

Speculative Rollback Correction for Quality-Diverse Web Agent Imitation

面向质量多样性的Web智能体模仿的推测性回滚修正

Longkun Hao, Hongyu Lin, Hao Li, Zhichao Yang, Haojie Hao, Dongshuo Huang, Haitao Yang, Hongyu Ge, Ming jie Xie, Yanjun Wu, Zi Hao Yin, Yan Bai, Yihang Lou

发表机构 * Beihang University(北京航空航天大学) Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所) The Hong Kong University of Science and Technology(香港科技大学) Northwestern Polytechnical University(西北工业大学) Tsinghua University(清华大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Peking University(北京大学)

AI总结 提出推测性回滚修正(SRC)框架,通过固定视野分支审查和回滚机制,在减少教师查询的同时保持轨迹多样性,在WebArena-Infinity上收集了977条通过验证的轨迹和9183个下一步动作示例。

详情
AI中文摘要

通过从专家轨迹进行模仿学习来训练交互式Web智能体已成为一种高效的方法。然而,在此背景下,确定专家干预的最佳时机是一个关键挑战。延迟干预往往导致早期错误的累积,将页面状态推入不可恢复的区域。相反,过早或过度干预会使智能体过度依赖专家策略,将模型困在以单一刚性轨迹为特征的局部最优中。我们提出推测性回滚修正(SRC),一种针对可重置智能体环境的分支级模仿框架。SRC不是在每个访问状态请求教师标签,也不是仅在完成轨迹后修正,而是采用固定视野分支审查:学生先执行一个短的推测性片段,然后由教师审查,仅当局部进展中断时,教师才定位第一个有害偏差。回滚保留有用的前缀,而成功的展开由硬验证器过滤并保留在轻量级质量多样性档案中。所得数据支持对局部修正和通过验证器的轨迹进行下一步动作监督微调。在WebArena-Infinity上,SRC收集了977条通过验证器的轨迹和9183个下一步动作示例;固定视野审查在保留通过验证器的解决方案变体的同时,改善了恢复与查询的权衡。代码可在该https URL获取。

英文摘要

Training interactive web agents through imitation learning from expert trajectories has emerged as a highly effective approach. However, determining the optimal timing for expert intervention presents a critical challenge in this context. Delayed intervention often leads to the accumulation of early-stage errors, pushing the page state into an irrecoverable regime. Conversely, premature or excessive intervention causes the agent to become overly reliant on expert policies, trapping the model in local optima characterized by a single, rigid trajectory. We propose Speculative Rollback Correction (SRC), a branch-level imitation framework for resettable agent environments. Instead of requesting teacher labels at every visited state or correcting only after a completed trajectory, SRC uses fixed-horizon branch review: the student executes a short speculative segment before teacher review, and the teacher localizes the first harmful deviation only when local progress breaks. Rollback preserves useful prefixes, while successful rollouts are filtered by a hard verifier and retained in a lightweight quality-diversity archive. The resulting data supports next-action supervised fine-tuning on both localized corrections and verifier-passing trajectories. On WebArena-Infinity, SRC collects 977 verifier-passing trajectories and 9,183 next-action examples; fixed-horizon review improves the recovery-versus-query tradeoff over step-level review while retaining verifier-passing solution variants. Code is available at https://github.com/LongkunHao/SRC_gui_agent.

2606.12634 2026-06-12 cs.LG cs.AI cs.CL 交叉投稿

Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents

保持策略梯度主导:面向长程工具使用智能体的兄弟引导信用蒸馏

Tianyu Ding, Jianhong Xin, Juan Pablo De la Cruz Weinstein

发表机构 * Amazon Web Services(亚马逊云服务)

AI总结 针对长程工具使用强化学习中轨迹级优势信号稀疏的问题,提出兄弟引导信用蒸馏(SGCD),通过动态采样成功与失败轨迹、外部LLM对比生成逐步信用参考,实现密集信用分配,在AppWorld和τ³-airline任务上显著提升性能。

Comments 13 pages, 4 figures, 7 tables. Submitted to EMNLP 2026 Industry Track

详情
AI中文摘要

长程工具使用强化学习可以从结果验证中学习,但其轨迹级优势被广播到许多推理、API和答案令牌上。自蒸馏通过重用策略自身的轨迹或特权教师承诺提供更密集的信号。然而,我们表明直接的令牌级自蒸馏会悄然破坏工具使用:它复述教师行为而不知道验证器奖励哪些动作,因此有用技能和有害捷径被一起放大。我们引入兄弟引导信用蒸馏(SGCD),它使用蒸馏进行信用分配而非作为竞争性的演员损失。动态采样产生混合的成功和失败的兄弟轨迹;外部LLM将其对比总结为训练时逐步信用参考;密集的教师/学生散度驱动信用重新分配;有界分离的信用权重重塑GRPO令牌优势。部署的学生看不到外部LLM、兄弟证据或预言机。在AppWorld和τ³-airline上,SGCD优于匹配的GRPO比较器:AppWorld上test_normal的TGC从42.9提升到45.6,test_challenge从24.7提升到27.0;τ³-airline的pass@1从0.583提升到0.602。

英文摘要

Long-horizon tool-use reinforcement learning can learn from outcome verification, but its trajectory-level advantage is broadcast across many reasoning, API, and answer tokens. Self-distillation promises a denser signal by reusing a policy's own rollouts or a privileged teacher. We show, however, that direct token-level self-distillation can silently destroy tool use: it rehearses teacher behavior without knowing which actions the verifier rewards, so useful skills and harmful shortcuts are amplified together. We introduce Sibling-Guided Credit Distillation (SGCD), which uses distillation for credit assignment rather than as a competing actor loss. Dynamic sampling produces mixed successful and failed sibling rollouts; an external LLM summarizes their contrast into a training-only stepwise credit reference; dense teacher/student divergence drives credit reassignment; and bounded detached credit weights reshape GRPO token advantages. The deployed student sees no external LLM, sibling evidence, or oracle. Across AppWorld and $τ^3$-airline, SGCD improves over matched GRPO comparators: AppWorld TGC $42.9 \to 45.6$ on test_normal and $24.7 \to 27.0$ on test_challenge, and $τ^3$-airline pass@1 $0.583 \to 0.602$.

2606.12774 2026-06-12 eess.SY cs.AI cs.CL cs.SY 交叉投稿

Agentic MPC for Semantic Control System Resynthesis

用于语义控制系统再综合的智能体MPC

Yuya Miyaoka, Masaki Inoue

AI总结 提出智能体MPC框架,通过集成大语言模型智能体实现上下文感知的语义自适应控制综合,在自动驾驶场景中验证其根据个人偏好或社交情境(如避让应急车辆)调整控制的能力。

Comments 7 pages, 5 figures

详情
AI中文摘要

虽然MPC有效处理结构化、多样化和低层级的规范,但它缺乏动态融入高层级上下文信息(如社会规范、用户意图或自然语言指令)的能力。为解决这一局限,本文引入了一种智能体MPC框架,通过集成基于大语言模型的智能体,实现上下文感知、语义自适应的控制综合。该智能体解释异构输入,包括自然语言消息、环境观测和外部知识,以重新综合控制规范。该框架的有效性在自动驾驶场景中得到验证,系统能够根据个人偏好或对社交情境(如应急车辆避让)做出响应。

英文摘要

While MPC effectively handles structured, diverse, and low-level specifications, it lacks the capability to dynamically incorporate high-level contextual information such as social norms, user intent, or natural language instructions. To address this limitation, this manuscript introduces an agentic MPC framework that enables context-aware, semantically adaptive control synthesis by integrating with large language model-based agents. The agent interprets heterogeneous inputs, including natural language messages, environmental observations, and external knowledge, to resynthesize the control specifications. The effectiveness of the framework is demonstrated in an autonomous driving scenario, where the system aligns with personal preferences or responds to social situations such as emergency vehicle yielding.

2606.12830 2026-06-12 cs.CV cs.AI 交叉投稿

Perceive, Interact, Reason: Building Tool-Augmented Visual Agents for Spatial Reasoning

感知、交互、推理:构建工具增强的视觉智能体用于空间推理

Changye Li, Meng Lu, Yi Wu, Ligeng Zhu

发表机构 * Tsinghua University(清华大学) Virginia Tech(弗吉尼亚理工大学) NVIDIA(英伟达)

AI总结 提出PERIA智能体,通过视觉感知和交互工具增强VLM的空间推理能力,在13个基准上优于同类模型7.0%-14.8%。

详情
AI中文摘要

尽管最近的视觉语言模型(VLM)展示了强大的多模态理解能力,但在需要主动证据获取和多步视觉交互的空间推理任务中仍存在局限。这种局限性表明,仅依赖视觉编码器的隐式视觉表示不足以恢复细粒度的空间证据。我们引入了PERception-Interaction-reason Agent(PERIA),一种用于地图推理、视觉探测和视觉重建等空间推理任务的工具增强视觉智能体。PERIA使用两类轻量工具:视觉感知工具用于暴露文本、符号和空间证据,以及视觉交互工具用于操作视觉上下文、追踪路径和验证空间关系。为了训练PERIA,我们开发了一种统一方案,结合了监督式工具使用轨迹合成、复合奖励和观察松弛的组内组策略优化(OR-GIGPO),以实现有效的多工具行为。在来自8个数据集的13个基准上的实验表明,PERIA-8B在分布内基准上比Qwen3-8B骨干网络提高了10.0%,在分布外基准上提高了4.4%,同时比之前类似规模的先进基线高出7.0%-14.8%。它还实现了与更大模型(如Qwen3-VL-235B-A22B-Thinking和GPT-5)相当的性能,证明了PERIA在增强空间推理能力方面的有效性。

英文摘要

While recent vision-language models (VLMs) demonstrate strong multimodal understanding, they remain limited in spatial reasoning tasks that require active evidence acquisition and multi-step visual interaction. This limitation suggests that relying solely on implicit visual representations from vision encoders is insufficient for recovering fine-grained spatial evidence. We introduce PERception-Interaction-reason Agent (PERIA), a tool-augmented visual agent for spatial reasoning tasks across map reasoning, visual probing, and vision reconstruction. PERIA uses two lightweight tool families: vision perception tools for exposing textual, symbolic, and spatial evidence, and vision interaction tools for manipulating visual context, tracing paths, and verifying spatial relations. To train PERIA, we develop a unified recipe that combines supervised tool-use trajectory synthesis, composite rewards, and Observation-Relaxed Group-in-Group Policy Optimization (OR-GIGPO) for effective multi-tool behavior. Experiments on 13 benchmarks from 8 datasets show that PERIA-8B improves over the Qwen3-8B backbone by 10.0% on in-distribution benchmarks and 4.4% on out-of-distribution benchmarks, while outperforming previous state-of-the-art baselines of similar size by 7.0%-14.8%. It also achieves performance comparable to much larger models such as Qwen3-VL-235B-A22B-Thinking and GPT-5, demonstrating the effectiveness of PERIA in enhancing spatial reasoning capabilities.

2606.13239 2026-06-12 cs.SE cs.AI cs.CL cs.CV 交叉投稿

ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm

ComAct: 通过COM即行动范式重构专业软件操作

Jiaxin Ai, Tao Hu, Xuemeng Yang, Shu Zou, Hairong Zhang, Daocheng Fu, Yu Yang, Hongbin Zhou, Nianchen Deng, Pinlong Cai, Zhongyuan Wang, Botian Shi, Kaipeng Zhang, Licheng Wen

AI总结 提出COM即行动范式,将专业软件交互转化为确定性程序合成,解决GUI代理的脆弱性和API代理的异构性问题;构建ComCADBench基准和ComActor自校正代理,在工业CAD软件上实现SOTA性能。

详情
AI中文摘要

现有的计算机使用代理在专业软件操作上仍然存在根本性限制:基于GUI的代理受困于脆弱的视觉基础和长程错误累积,而基于API的方法则难以应对异构协议和不可访问的商业接口。在这项工作中,我们将组件对象模型(COM)识别为统一的、可执行的抽象,提出了COM即行动:一种新的范式,将专业软件交互重新定义为确定性程序合成,而非顺序视觉控制。为了在最苛刻的环境中验证这一范式,我们引入了ComCADBench,这是首个针对操作真实工业CAD软件的代理的基准测试。我们的实验揭示了显著的范式差距:前沿的专有模型在基于GUI的交互下几乎无法成功,而基于COM的执行则带来了实质性的即时收益。为了弥合语法正确性与几何精度之间的剩余差距,我们开发了ComActor,一个通过渐进式三阶段框架训练的自校正代理,以及ComForge,一个用于在Windows容器中进行大规模训练的可扩展平台。大量实验表明,ComActor在ComCADBench上达到了最先进的性能,在基线崩溃的长程任务中表现出强大的韧性,并泛化到外部CAD基准测试。

英文摘要

Existing computer-use agents remain fundamentally limited in professional software manipulation: GUI-based agents suffer from fragile visual grounding and long-horizon error accumulation, while API-basedapproaches struggle with heterogeneous protocols and inaccessible commercial interfaces. In this work,we identify the Component Object Model (COM) as a unified executable abstraction, proposing COM-as-Action: a new paradigm that reframes professional software interaction as deterministic program synthesisrather than sequential visual control. To validate this paradigm in the most demanding environments, weintroduce ComCADBench, the first benchmark for agents operating real industrial CAD software. Ourexperiments reveal a substantial paradigm gap: frontier proprietary models achieve near-zero successunder GUI-based interaction, whereas COM-based execution yields substantial immediate gains. Tobridge the remaining gap between syntactic correctness and geometric accuracy, we develop ComActor, aself-correcting agent trained through a progressive three-stage framework, alongside ComForge, a scalableplatform for large-scale training in Windows containers. Extensive experiments show that ComActorachieves state-of-the-art performance on ComCADBench, with strong resilience in long-horizon taskswhere baselines collapse, and generalizes to external CAD benchmark.

2606.13673 2026-06-12 cs.CV cs.AI 交叉投稿

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

SpatialClaw:重新思考智能体空间推理的动作接口

Seokju Cho, Ryo Hachiuma, Abhishek Badki, Hang Su, Byung-Kwan Lee, Chan Hee Song, Sifei Liu, Subhashree Radhakrishnan, Seungryong Kim, Yu-Chiang Frank Wang, Min-Hung Chen

发表机构 * KAIST(韩国科学技术院) NVIDIA(英伟达)

AI总结 提出SpatialClaw框架,以代码作为动作接口,通过状态化Python内核和感知几何原语,使VLM智能体逐步执行并灵活组合中间结果,在20个3D/4D空间推理基准上平均准确率59.9%,比现有方法高11.2个百分点。

Comments Project page: https://spatialclaw.github.io/

详情
AI中文摘要

空间推理——确定物体在3D空间中的位置、关系及运动方式的能力——仍然是视觉语言模型(VLM)面临的基本挑战。工具增强型智能体试图通过为VLM添加专业感知模块来解决这一问题,但其有效性受限于调用这些工具的动作接口。本文研究该接口的设计如何影响智能体进行开放式空间推理的能力。现有的空间智能体要么采用单次代码执行,即在观察到任何中间结果之前就确定完整的分析策略;要么依赖结构化的工具调用接口,这通常缺乏自由组合操作或针对每个任务定制分析的灵活性。这两种设计对开放式、复杂的3D/4D空间推理的灵活性有限。因此,我们提出SpatialClaw,一个无需训练的空间推理框架,采用代码作为动作接口。SpatialClaw维护一个状态化的Python内核,预加载输入帧和一套感知与几何原语,让基于VLM的智能体在每一步根据所有先前输出编写一个可执行单元,从而灵活地组合和操作感知结果,并根据中间文本和视觉观察以及每个问题的需求调整其分析。在涵盖广泛静态和动态3D/4D空间推理任务的20个空间推理基准上评估,SpatialClaw实现了59.9%的平均准确率,比最新的空间智能体高出11.2个百分点,并且在来自两个模型家族的六个VLM骨干网络上均取得一致提升,无需任何基准或模型特定的适配。

英文摘要

Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent's capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex 3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework for spatial reasoning that adopts code as the action interface. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20 spatial reasoning benchmarks spanning a broad range of static and dynamic 3D/4D spatial reasoning tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.

2606.04602 2026-06-12 cs.AI 版本更新

Parthenon Law: A Self-Evolving Legal-Agent Framework

Parthenon Law: 一种自我进化的法律智能体框架

Hejia Geng, Leo Liu

发表机构 * tapntell.ai

AI总结 本文提出Parthenon框架,通过分解模型、工具、知识等组件并引入反泄漏学习循环,使法律领域的大语言模型智能体能够从经验中自我进化,显著提升法律事务处理性能。

详情
AI中文摘要

随着智能体能力的增强,法律领域的大语言模型智能体有望将文档密集型事务转化为可审查的工作产品——然而可靠部署面临三个障碍:缺乏关于当前最强模型与框架组合在端到端法律事务上行为的大规模证据;没有适应法律垂直领域的智能体架构,只有通用框架;以及在不断变化的事实、权威和截止日期环境中,缺乏系统从自身结果中学习的机制。我们逐一解决这些问题。在Harvey LAB上进行的大规模实证研究——包含12,510条智能体轨迹——表明即使是前沿智能体也无法一次性完成事务:每项标准的准确率随模型增强而提高,但严格的事务完成率停滞不前。然后我们引入Parthenon,一种自我进化的法律智能体框架,将模型、框架、智能体角色、法律知识、确定性工具和程序技能分解为可审计的表面,以实现来源可追溯性、日期和数字接地、交付物合规性和问题关闭。最后,一个反泄漏学习循环将评分失败转化为对技能、工具和知识的任务无关编辑,使系统能够随着经验改进——就像律所在每个事务后完善其检查清单和操作手册——而不触及模型权重。在我们的大规模实证分析中,Parthenon显著提升了最先进模型和框架在法律事务任务上的性能。

英文摘要

As agents grow more capable, legal-domain LLM agents promise to turn document-heavy matters into reviewable work products -- yet reliable deployment faces three obstacles: no large-scale evidence on how today's strongest model-and-harness combinations behave on end-to-end legal matters; no agent architecture adapted to the legal vertical, only general-purpose harnesses; and, in a setting that keeps shifting with new facts, authorities, and deadlines, no mechanism for systems to learn from their own outcomes. We address each. A large-scale empirical study on Harvey LAB -- $12{,}510$ agent trajectories -- shows that even frontier agents remain far from completing matters in a single pass: per-criterion accuracy climbs with stronger models while strict matter completion stalls. We then introduce \textsc{Parthenon}, a self-evolving legal-agent framework that factors Model, Harness, Agent roles, legal Knowledge, deterministic Tools, and procedural Skills into auditable surfaces for source traceability, date and number grounding, deliverable compliance, and issue closure. Finally, an anti-leakage learning loop converts scored failures into task-agnostic edits to skills, tools, and knowledge, letting the system improve with experience -- as a firm refines its checklists and playbooks after each matter -- without touching model weights. Across our large-scale empirical analysis, \textsc{Parthenon} substantially improves the performance of state-of-the-art models and harnesses on legal-matter tasks.

2603.11395 2026-06-12 cs.LG cs.AI 版本更新

ARROW: Augmented Replay for RObust World models

ARROW:增强重放用于鲁棒世界模型

Abdulaziz Alyahya, Abdallah Al Siyabi, Markus R. Ernst, Luke Yang, Levin Kuhlmann, Gideon Kowadlo

发表机构 * Imam Mohammad Ibn Saud Islamic University (IMSIU)(伊玛姆·穆罕默德·本·沙特伊斯兰大学) Monash University(莫纳什大学) University of New South Wales, Sydney(新南威尔士大学,悉尼) Cerenaut

AI总结 本文提出ARROW算法,一种基于模型的持续强化学习方法,通过高效的重放缓冲区减少灾难性遗忘,提升在无共享结构任务和有共享结构任务中的表现。

Comments 36 pages and 11 figures (includes Appendix)

详情
Journal ref
Transactions on Machine Learning Research, 2026
AI中文摘要

持续强化学习挑战智能体在获取新技能的同时保留已学习技能,以提高过去和未来任务的性能。大多数现有方法依赖于无模型方法和重放缓冲区来缓解灾难性遗忘;然而,这些解决方案往往面临显著的可扩展性挑战,因为内存需求大。受神经科学启发,其中大脑将经验重放给预测世界模型而不是直接重放到策略中,我们提出了ARROW(增强重放用于鲁棒世界模型),一种扩展DreamerV3的基于模型的持续RL算法,具有内存高效、分布匹配的重放缓冲区。与标准固定大小的FIFO缓冲区不同,ARROW维护两个互补的缓冲区:一个短期缓冲区用于近期经验,一个长期缓冲区通过智能采样保留任务多样性。我们在两个具有挑战性的持续RL设置中评估了ARROW:无共享结构任务(Atari)和有共享结构任务(Procgen CoinRun变体)。与相同大小的无模型和基于模型的基线方法相比,ARROW在无共享结构任务中表现出显著减少的遗忘,同时保持可比的前向转移。我们的发现突显了基于模型的RL和生物启发方法在持续强化学习中的潜力,值得进一步研究。

英文摘要

Continual reinforcement learning challenges agents to acquire new skills while retaining previously learned ones with the goal of improving performance in both past and future tasks. Most existing approaches rely on model-free methods with replay buffers to mitigate catastrophic forgetting; however, these solutions often face significant scalability challenges due to large memory demands. Drawing inspiration from neuroscience, where the brain replays experiences to a predictive World Model rather than directly to the policy, we present ARROW (Augmented Replay for RObust World models), a model-based continual RL algorithm that extends DreamerV3 with a memory-efficient, distribution-matching replay buffer. Unlike standard fixed-size FIFO buffers, ARROW maintains two complementary buffers: a short-term buffer for recent experiences and a long-term buffer that preserves task diversity through intelligent sampling. We evaluate ARROW on two challenging continual RL settings: Tasks without shared structure (Atari), and tasks with shared structure, where knowledge transfer is possible (Procgen CoinRun variants). Compared to model-free and model-based baselines with replay buffers of the same-size, ARROW demonstrates substantially less forgetting on tasks without shared structure, while maintaining comparable forward transfer. Our findings highlight the potential of model-based RL and bio-inspired approaches for continual reinforcement learning, warranting further research.

2604.08958 2026-06-12 cs.LG cs.AI cs.RO 版本更新

WOMBET: World Model-Based Experience Transfer for Robust and Sample-efficient Reinforcement Learning

WOMBET:基于世界模型的经验迁移实现鲁棒且样本高效的强化学习

Mintae Kim, Koushil Sreenath

发表机构 * Hybrid Robotics, UC Berkeley(混合机器人技术,伯克利大学)

AI总结 提出WOMBET框架,通过源任务中学习世界模型并生成不确定性惩罚的离线数据,再结合自适应采样进行在线微调,实现鲁棒且样本高效的强化学习迁移。

Comments 13 pages, 6 figures, 8th Annual Learning for Dynamics & Control Conference (L4DC)

详情
AI中文摘要

机器人领域的强化学习通常受限于数据收集的成本和风险,因此需要从源任务向目标任务进行经验迁移。离线到在线强化学习利用先验数据,但通常假设给定固定数据集,并未解决如何生成可靠数据进行迁移的问题。我们提出基于世界模型的经验迁移(WOMBET)框架,该框架联合生成和利用先验数据。WOMBET在源任务中学习世界模型,并通过不确定性惩罚规划生成离线数据,随后筛选出高回报和低认知不确定性的轨迹。然后,它通过在离线数据和在线数据之间进行自适应采样,在目标任务中进行在线微调,实现了从先验驱动的初始化到任务特定适应的稳定过渡。我们证明了不确定性惩罚目标提供了真实回报的下界,并推导了有限样本误差分解,捕捉了分布不匹配和近似误差。实验上,WOMBET在连续控制基准测试中相比强基线提高了样本效率和最终性能,展示了联合优化数据生成和迁移的益处。

英文摘要

Reinforcement learning (RL) in robotics is often limited by the cost and risk of data collection, motivating experience transfer from a source task to a target task. Offline-to-online RL leverages prior data but typically assumes a given fixed dataset and does not address how to generate reliable data for transfer. We propose World Model-Based Experience Transfer (WOMBET), a framework that jointly generates and utilizes prior data. WOMBET learns a world model in the source task and generates offline data via uncertainty-penalized planning, followed by filtering trajectories with high return and low epistemic uncertainty. It then performs online fine-tuning in the target task using adaptive sampling between offline and online data, enabling a stable transition from prior-driven initialization to task-specific adaptation. We show that the uncertainty-penalized objective provides a lower bound on the true return and derive a finite-sample error decomposition capturing distribution mismatch and approximation error. Empirically, WOMBET improves sample efficiency and final performance over strong baselines on continuous control benchmarks, demonstrating the benefit of jointly optimizing data generation and transfer.

2. 知识表示、推理与符号AI 7 篇

2606.12721 2026-06-12 cs.AI 新提交

The Theory of Mind Utility: Formal Specification of a Mentalizing Mechanism

心智理论效用:心理化机制的形式化规范

Nikolos Gurney, Stacy Marsella

发表机构 * Institute for Creative Technologies, University of Southern California(南加州大学创意技术研究所) Khoury College of Computer Sciences, Northeastern University(东北大学库里计算机科学学院)

AI总结 提出心智理论效用(ToM-U)框架,通过局部认知世界模型(LEWM)形式化推断他人信念的计算问题,定义结构、推理过程及失败痕迹,区别于贝叶斯心智理论等方法。

详情
AI中文摘要

推断他人的信念需要超越表面信号;需要追踪谁告诉了他们什么、以什么顺序以及有多可信。心智理论效用(ToM-U)在计算分析层面形式化了这一认知状态推断问题,明确了心理化计算的内容和原因,而不承诺算法或神经实现。ToM-U通过构建局部认知世界模型(LEWMs)——表示智能体、状态节点及其之间认知关系的有向类型图——并根据观察到的行为评估离散候选LEWM,直到达到足够的置信度来实现这一点。五个形式定义指定了LEWM结构、包括有序信息访问历史的智能体节点属性、递归心理化的有界增殖机制、三种推理过程以及一个残差函数,该函数捕捉失败心理化尝试留下的结构化痕迹。ToM-U不同于贝叶斯心智理论和相邻的形式化描述,后者预设而非推导信念状态,也不同于模拟理论和理论-理论,后者缺乏认知状态推断的形式化工具。该架构生成关于心理化失败的方向性、可证伪预测,这些预测源于模型的结构属性而非辅助假设,并将ToM-U定位为在目标推断和其他下游社会认知过程之前的领域无关机制。

英文摘要

Inferring others' beliefs requires more than reading surface signals; it requires tracking who told them what, in what order, and how credibly. The Theory of Mind Utility (ToM-U) formalizes this epistemic state inference problem at the computational level of analysis, specifying what mentalizing computes and why without commitment to algorithmic or neural implementation. ToM-U achieves this by constructing Local Epistemic World Models (LEWMs) -- directed typed graphs that represent agents, state nodes, and the epistemic relationships among them -- and evaluating discrete candidate LEWMs against observed behavior until one achieves sufficient confidence. Five formal definitions specify the LEWM structure, agent node properties including ordered information access history, a bounded proliferation mechanism for recursive mentalizing, three inference procedures, and a residue function that captures the structured trace left by failed mentalizing attempts. ToM-U differs from Bayesian Theory of Mind and adjacent formal accounts, which presuppose rather than derive belief states, and from simulation theory and theory-theory, which lack a formal apparatus for epistemic state inference. The architecture generates directional, falsifiable predictions about mentalizing failure that follow from structural properties of the model rather than auxiliary assumptions, and positions ToM-U as a domain-agnostic mechanism upstream of goal inference and other downstream social cognitive processes.

2606.13405 2026-06-12 cs.AI cs.MA 新提交

Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda

用于受规管流程自动化的神经符号代理:挑战与研究议程

Alexander Rombach, Chantale Lauer, Nijat Mehdiyev

发表机构 * German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI)) Saarland University(萨尔大学)

AI总结 提出将领域内符号结构(法规、流程模型、合规约束)作为代理核心架构组件,实现合规性内置(compliance-by-construction)以补充护栏监控,并列出神经符号研究挑战。

Comments Accepted as a poster in NILA Workshop @ IJCAI-ECAI 2026

详情
AI中文摘要

基于LLM的代理正在进入受规管行业,在这些行业中,它们自动化判断密集型质量管理流程。我们认为,这些领域中已经嵌入的符号结构,包括法规、类型化流程模型和合规约束,不应仅被视为外部监控机制,而应作为塑造代理决策和行为的核心架构组件。我们提出合规性内置作为基于护栏监控的补充范式:一种防止控制流违规的结构基础,而护栏对于捕获语义错误仍然必不可少。我们在基础和能力层面识别出一组结构化的神经符号研究挑战,并表明共同解决这些挑战能够实现合规性内置。我们呼吁神经符号社区将受规管流程自动化作为一个高影响力的研究领域来参与。

英文摘要

LLM-based agents are entering regulated industries where they automate judgment intensive quality management processes. We argue that symbolic structures already embedded in these domains, including regulations, typed process models, and compliance constraints, should be treated not merely as external monitoring mechanisms but as core architectural components that shape the agent's decision-making and behavior. We propose compliance-by-construction as a complementary paradigm to guardrail-based monitoring: a structural foundation that prevents control-flow violations, while guardrails remain essential for catching semantic errors. We identify a structured set of neuro-symbolic research challenges on foundational and capability level and show that addressing them jointly enables compliance-by-construction. We call on the neuro-symbolic community to engage with regulated process automation as a high impact research domain.

2606.13669 2026-06-12 cs.AI 新提交

Agents-K1: Towards Agent-native Knowledge Orchestration

Agents-K1:迈向智能体原生的知识编排

Zongsheng Cao, Bihao Zhan, Jinxin Shi, Jiong Wang, Fangchen Yu, Zhijie Zhong, Zijie Guo, Tianshuo Peng, Zhuo Liu, Yi Xie, Xiang Zhuang, Yue Fan, Runmin Ma, Shiyang Feng, Xiangchao Yan, Anran Liu, Peng Ye, Wenlong Zhang, Shufei Zhang, Chunfeng Song, Fenghua Ling, Jie Zhou, Liang He, Bo Zhang, Lei Bai

发表机构 * PJLab(上海人工智能实验室)

AI总结 提出Agents-K1管道,将原始文档转化为智能体原生科学知识图谱,通过多模态解析器、GRPO训练的4B信息抽取骨干和三源智能体接口,实现科学信息抽取、知识图谱构建和多跳推理。

详情
AI中文摘要

当前基于LLM的研究智能体通过智能体编排取得了进展,但在很大程度上忽视了科学知识编排。现有工作通常将论文简化为摘要、表面提及和扁平化的\ exttt{cites}边,忽略了科学推理所必需的关键实体、主张、证据、机制和方法谱系。为此,我们引入了\ extbf{Agents-K1},一个端到端的知识编排管道,将原始文档转换为智能体原生的科学知识图谱。Agents-K1在统一的理论基础下整合了三个组件:一个多模态解析器,其五模块模式捕获整个论文中的实体、多模态证据、引用和类型化实体间关系,而非仅摘要;一个基于GRPO在规则奖励下训练的4B信息抽取骨干;以及一个graphanything CLI,一个统一了网络搜索、多模态图检索和跨文档遍历的三源智能体接口。在此基础上,我们处理了六个学科的246万篇科学论文,生成了\ extbf{Scholar-KG},并发布了其中100万篇论文的子集,完整Scholar-KG可通过下方SCP链接访问。同一管道可扩展到通用领域语料库和符合模式的数据合成。大量实验表明,Agents-K1在科学信息抽取、知识图谱构建和多跳科学推理方面取得了优越性能。

英文摘要

Current LLM-based research agents have advanced through agent orchestration, yet largely overlook scientific knowledge orchestration. Existing works often reduce papers to abstracts, surface mentions, and flat \texttt{cites} edges, omitting key entities, claims, evidence, mechanisms, and method lineages essential for scientific reasoning. To this end, we introduce \textbf{Agents-K1}, an end-to-end knowledge orchestration pipeline that converts raw documents into agent-native scientific knowledge graphs. Agents-K1 integrates three components under a unifying theoretical foundation: a multimodal parser whose five-module schema captures entities, multimodal evidence, citations, and typed inter-entity relations across the full paper rather than abstracts alone; a 4B information-extraction backbone trained with GRPO under a rule-based reward; and a graphanything CLI, a tri-source agent interface that unifies web search, multimodal graph retrieval, and cross-document traversal. On top of this, we process 2.46 million scientific papers across six subjects to produce \textbf{Scholar-KG}, of which we release a one-million-paper subset, and the full Scholar-KG is accessible via the SCP link below. The same pipeline can be extended to general-domain corpora and to schema-conformant data synthesis. Extensive experiments demonstrate that Agents-K1 achieves superior performance in scientific information extraction, knowledge graph construction, and multi-hop scientific reasoning.

2604.27960 2026-06-12 cs.AI 版本更新

LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning

LLMs 作为 ASP 程序员:自我纠正实现任务无关的非单调推理

Adam Ishay, Joohyung Lee

发表机构 * Arizona State University(亚利桑那州立大学) Samsung Research(三星研究院)

AI总结 提出 LLM+ASP 框架,通过自我纠正循环将自然语言转化为回答集程序,实现无需任务特定工程的非单调推理,在多个基准上优于 SMT 方法。

Comments 30 pages

详情
AI中文摘要

近期的大语言模型(LLMs)在推理方面取得了令人瞩目的进展,但仍面临高计算成本、逻辑不一致性以及在高度复杂问题上性能急剧下降等问题。神经符号方法通过将 LLMs 与符号推理器结合来缓解这些问题,但现有方法通常依赖于单调逻辑(如 SMT),无法表示可废止推理——人类认知的重要组成部分。我们提出了“LLM+ASP”框架,该框架将自然语言转化为回答集编程(ASP),一种基于稳定模型语义的非单调形式化方法。与先前需要手动编写知识模块、领域特定提示或仅限于单一问题类别评估的“LLM+ASP”方法不同,我们的框架无需任何每任务工程,并统一适用于多种推理任务。我们的系统利用自动化的自我纠正循环,其中来自 ASP 求解器的结构化反馈能够实现迭代优化。在六个不同基准上的评估表明:(1)稳定模型语义使 LLMs 能够自然地表达默认规则和例外,在非单调任务上显著优于基于 SMT 的替代方法;(2)迭代自我纠正是性能的主要驱动力,有效替代了手工领域知识的需求;(3)紧凑的上下文参考指南显著优于冗长的文档,揭示了“上下文腐烂”现象,即过多上下文会阻碍约束遵循。

英文摘要

Recent large language models (LLMs) have achieved impressive reasoning milestones but continue to struggle with high computational costs, logical inconsistencies, and sharp performance degradation on high-complexity problems. While neuro-symbolic methods attempt to mitigate these issues by coupling LLMs with symbolic reasoners, existing approaches typically rely on monotonic logics (e.g., SMT) that cannot represent defeasible reasoning -- essential components of human cognition. We present "LLM+ASP," a framework that translates natural language into Answer Set Programming (ASP), a nonmonotonic formalism based on stable model semantics. Unlike prior "LLM+ASP" approaches that require manually authored knowledge modules, domain-specific prompts, or evaluation restricted to single problem classes, our framework operates without any per-task engineering and applies uniformly across diverse reasoning tasks. Our system utilizes an automated self-correction loop where structured feedback from the ASP solver enables iterative refinement. Evaluating across six diverse benchmarks, we demonstrate that: (1) stable model semantics allow LLMs to naturally express default rules and exceptions, outperforming SMT-based alternatives by significant margins on nonmonotonic tasks; (2) iterative self-correction is the primary driver of performance, effectively replacing the need for handcrafted domain knowledge; (3) compact in-context reference guides substantially outperform verbose documentation, revealing a "context rot" phenomenon where excessive context hinders constraint adherence.

2606.04935 2026-06-12 cs.AI 版本更新

What Type of Inference is Active Inference?

主动推理是一种什么类型的推理?

Wouter W. L. Nuijten, Mykola Lukashchuk, Thijs van de Laar, Bert de Vries

发表机构 * Department of Electrical Engineering(电气工程系) Eindhoven University of Technology(埃因霍温理工大学) Eindhoven, the Netherlands(荷兰埃因霍温) Lazy Dynamics Utrecht, the Netherlands(荷兰乌得勒支)

AI总结 本文通过变分自由能框架将主动推理中的期望自由能最小化分解为熵校正项和规划校正项,揭示了其推理本质,并在网格世界实验中验证了不同校正项的作用。

详情
AI中文摘要

主动推理将决策视为推理,期望自由能(EFE)统一了目标导向和信息寻求行为。最近的研究表明,EFE最小化可以写成在带有认知先验的生成模型上的变分自由能(VFE)最小化。我们证明了增强模型的VFE可以重写为预测模型的VFE加上显式的熵校正项,从而使EFE贡献透明。然后我们表明,基于EFE的适当规划需要将这些认知校正与规划校正相结合,规划校正将边际推理转化为策略优化,从而得到基于EFE规划的完整变分特征。这澄清了交叉熵规划和完整基于EFE规划所需的校正。相同的熵校正公式导致了基于EFE规划的详细消息传递方案以及更简单的消融。在三个网格世界环境上的实验表明,当观测具有决定性时,规划校正已经有所帮助,而当观测仅具有提示性时,额外的观测侧认知校正最为重要。

英文摘要

Active inference casts decision-making as inference, with the Expected Free Energy (EFE) unifying goal-directed and information-seeking behavior. Recent work showed that EFE minimization can be written as Variational Free Energy (VFE) minimization on a generative model augmented with epistemic priors. We prove that the VFE of the augmented model can be rewritten as the VFE of the predictive model plus explicit entropy-correction terms, making the EFE contribution transparent. We then show that proper EFE-based planning requires combining these epistemic corrections with a planning correction that turns marginal inference into policy optimization, yielding a full variational characterization of EFE-based planning. This clarifies which corrections are needed for cross-entropy planning and for full EFE-based planning. The same entropy-corrected formulation leads to a detailed message-passing scheme for EFE-based planning together with simpler ablations. Experiments on three grid-world environments show that full EFE-based planning outperforms ablations that omit either the planning correction or the epistemic corrections.

2508.02548 2026-06-12 cs.DB cs.AI 版本更新

The KG-ER Conceptual Schema Language

KG-ER概念模式语言

Enrico Franconi, Benoît Groz, Jan Hidders, Nina Pardal, Sławek Staworko, Jan Van den Bussche, Piotr Wieczorek

发表机构 * Free University of Bozen-Bolzano, Italy(博洛尼亚-博兹纳自由大学,意大利) Université Paris-Saclay, CNRS, LISN, France(巴黎-萨克雷大学,法国 CNRS LISN) Birkbeck, University of London, UK(伦敦大学伯克贝克学院,英国) University of Huddersfield, UK(赫德斯菲尔德大学,英国) Relational AI, Berkeley, CA, USA(关系AI,美国加州伯克利) Hasselt University, Hasselt, Belgium(哈塞尔特大学,比利时) University of Wrocław, Poland(沃林福大学,波兰)

AI总结 提出KG-ER概念模式语言,独立于知识图谱的表示方式描述其结构,并帮助捕获语义。

Comments Published in Proceedings of IRIS-AI (https://iris-ai.org)

详情
AI中文摘要

我们提出KG-ER,一种用于知识图谱的概念模式语言,它独立于知识图谱的表示方式(关系数据库、属性图、RDF)描述其结构,同时有助于捕获知识图谱中存储信息的语义。

英文摘要

We propose KG-ER, a conceptual schema language for knowledge graphs that describes the structure of knowledge graphs independently of their representation (relational databases, property graphs, RDF) while helping to capture the semantics of the information stored in a knowledge graph.

2603.11479 2026-06-12 cs.LG cs.AI cs.MA 版本更新

Grammar of the Wave: Towards Explainable Multivariate Time Series Event Detection via Neuro-Symbolic VLM Agents

波的语法:通过神经符号VLM智能体实现可解释的多变量时间序列事件检测

Sky Chenwei Wan, Yifei Y. Wang, Tianjun Hou, Xiqing Chang, Aymeric Jan

发表机构 * AI Lab, SLB(SLB人工智能实验室) Télécom Paris, Institut Polytechnique de Paris, France(巴黎电信学院,巴黎高等理工学院,法国)

AI总结 提出语言引导的时间序列事件检测(TSED)任务,通过事件逻辑树(ELT)将文本描述转化为结构化时序逻辑,并构建神经符号VLM智能体SELA,实现零/少样本事件检测与可解释推理。

Comments 8 pages (main text), 28 pages total including appendix. 9 figures, 7 tables

详情
AI中文摘要

时间序列事件检测(TSED)旨在定位时间序列数据中具有语义意义的事件,在高风险领域具有关键应用。与统计异常不同,事件通常由自然语言描述定义,且跨多个物理通道具有内部时序逻辑结构。然而,在现实场景中,密集的事件标注成本高昂,使得纯监督学习困难。我们引入了语言引导的TSED,该设置中模型被赋予文本事件描述,并必须在几乎没有标注数据的情况下将其映射到多变量信号中的区间。为了解决这个问题,我们提出了事件逻辑树(ELT),一种知识表示框架,将语言描述转化为信号基元上的结构化时序逻辑。基于ELT,我们提出了SELA,一种神经符号VLM智能体框架,它从信号可视化中迭代地接地基元,并在ELT约束下组合它们,产生事件区间和忠实的树状结构解释。我们进一步发布了跨能源和气候领域的真实世界基准,包含专家知识和标注。实验表明,SELA优于监督微调和现有的零/少样本时间序列推理基线。

英文摘要

Time Series Event Detection (TSED) aims to localize semantically meaningful events in time series data, with critical applications in high-stakes domains. Unlike statistical anomalies, events are often defined by natural-language descriptions with internal temporal-logic structures across multiple physical channels. However, in real-world settings, dense event annotations are expensive to obtain, making purely supervised learning difficult. We introduce Language-guided TSED, a setting where a model is given textual event descriptions and must ground them to intervals in multivariate signals with little or no labeled data. To address this problem, we propose Event Logic Tree (ELT), a knowledge representation framework that converts linguistic descriptions into structured temporal logic over signal primitives. Building on ELT, we present SELA, a neuro-symbolic VLM agent framework that iteratively grounds primitives from signal visualizations and composes them under ELT constraints, producing both event intervals and faithful tree-structured explanations. We further release a real-world benchmark across energy and climate domains with expert knowledge and annotations. Experiments show that SELA improves over supervised fine-tuning and existing zero/few-shot time series reasoning baselines.

3. 多智能体与博弈 9 篇

2606.13197 2026-06-12 cs.AI 新提交

ARMOR-MAD: Adaptive Routing for Heterogeneous Multi-Agent Debate in Large Language Model Reasoning

ARMOR-MAD:大语言模型推理中异构多智能体辩论的自适应路由

Fuqiang Niu, Bowen Zhang

发表机构 * School of Cyber Science and Technology, University of Science and Technology of China(中国科学技术大学网络空间安全学院) School of Artificial Intelligence, Shenzhen Technology University(深圳技术大学人工智能学院)

AI总结 提出ARMOR-MAD框架,通过辩论前协议路由、早期一致停止评估和语义异常检测,自适应控制异构多智能体辩论,提升推理准确性和效率。

详情
AI中文摘要

多智能体辩论(MAD)可以改进大语言模型推理,但固定的辩论流程常常浪费计算资源,并可能放大相似智能体之间的相关错误。我们提出ARMOR-MAD,一个无需训练的异构MAD框架,将辩论视为条件计算。ARMOR-MAD结合了三个组件:辩论前协议路由(PAR)决定独立生成的第0轮答案是否需要辩论;早期一致停止评估器(EASE)在收敛后停止辩论;以及语义异常检测(SOD)在聚合过程中降低异常最终答案的权重。在MATH Level 5、GSM8K、MMLU和MMLU-Pro上,ARMOR-MAD在使用相同模型池的情况下,始终优于固定轮次的异构辩论,分别达到65.5%、96.5%、90.0%和81.5%的准确率。结果表明,真正的模型异构性和基于协议的控制对于使MAD更准确和高效都很重要。

英文摘要

Multi-agent debate (MAD) can improve large language model reasoning, but fixed debate pipelines often waste computation and can amplify correlated errors among similar agents. We propose ARMOR-MAD, a training-free heterogeneous MAD framework that treats debate as conditional computation. ARMOR-MAD combines three components: Pre-debate Agreement Routing (PAR) decides whether independently generated Round-0 answers require debate; Early Agreement Stopping Evaluator (EASE) stops debate after convergence; and Semantic Outlier Detection (SOD) down-weights abnormal final answers during aggregation. Across MATH Level 5, GSM8K, MMLU, and MMLU-Pro, ARMOR-MAD consistently improves over fixed-round heterogeneous debate with the same model pool, reaching 65.5\%, 96.5\%, 90.0\%, and 81.5\% accuracy, respectively. The results suggest that genuine model heterogeneity and agreement-based control are both important for making MAD more accurate and efficient.

2606.13591 2026-06-12 cs.AI cs.LG cs.MA 新提交

Multiagent Protocols with Aggregated Confidence Signals

带有聚合置信信号的多智能体协议

Ali Elahi, Barbara Di Eugenio

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校)

AI总结 提出三种协议,通过转换原始置信信号并采用软投票或贝叶斯融合,为多智能体系统输出聚合置信度,在保持正确性的同时显著提升判别能力。

Comments 22 pages and 5 figures, 9 pages and 2 figures before the appendix

详情
AI中文摘要

置信度在自然语言处理中用于可靠性、监督和一系列下游决策任务,但目前没有方法能够为多智能体系统的输出产生或评估置信度。先前的工作在多智能体辩论中使用置信度来加权消息、触发辩论或校准单个智能体,但从未将这些置信度聚合成系统本身的单一置信度。我们引入了三种协议,通过首先转换原始置信信号使其在不同模型间可比,然后通过软投票或称为贝叶斯融合的概率融合方法将它们组合,从而产生最终答案和单一的聚合置信度。这种聚合置信度在判别性(AUARC)上显著优于最佳单个智能体或标准辩论基线,同时正确性(F1分数)保持稳定,并恢复了多智能体辩论在更模糊任务上的损失。通过分析两种估计器(序列概率和自我报告)以及参数和非参数校准器,我们发现校准提高了两种估计器的F1分数,而AUARC对其依赖较小。我们在五个基准测试和四种任务类型上评估了每基准六对同质和异质辩论对,涵盖了多种模型能力和大小。

英文摘要

Confidence is used for reliability, oversight, and a range of downstream decision tasks in Natural Language Processing (NLP), yet no existing method produces or evaluates a confidence for the output of a multiagent system. Prior work uses confidence within multiagent debate (MAD) to weight messages, trigger debate, or calibrate individual agents, but it never aggregates these into a single confidence for the system itself. We introduce three protocols that produce a final answer along with a single aggregated confidence by first transforming raw confidence signals to make them comparable across models, then combining them via soft voting or a probability fusion we call Bayesian fusion. This aggregated confidence is substantially more discriminative (AUARC) than that of the best single agent or the standard debate baselines, while correctness (F1-score) stays stable and recovers the losses MAD incurs on more ambiguous tasks. Analyzing two estimators, sequence probability and self-report, alongside parametric and non-parametric calibrators, we find that calibration improves F1 for both estimators while AUARC is less reliant on it. We evaluate six homogeneous and heterogeneous debating pairs per benchmark, across five benchmarks and four task types, spanning a range of model capabilities and sizes.

2606.13598 2026-06-12 cs.AI cs.CL cs.LG cs.MA 新提交

Reward Modeling for Multi-Agent Orchestration

多智能体编排的奖励建模

King Yeung Tsang, Zihao Zhao, Vishal Venkataramani, Haizhou Shi, Zixuan Ke, Semih Yavuz, Shafiq Joty, Hao Wang

发表机构 * Rutgers University(罗杰斯大学) Salesforce AI Research(Salesforce人工智能研究)

AI总结 提出OrchRM框架,通过自监督学习从多智能体执行中间产物构建奖励模型,无需人工标注,实现高效编排器训练和测试时扩展,在多个领域提升性能并降低计算成本。

Comments Preprint; work in progress

详情
AI中文摘要

基于大型语言模型(LLM)的多智能体系统(MAS)需要有效的编排来协调专门化的智能体,然而训练这样的编排器受到有限监督和高计算成本的阻碍。我们提出了编排奖励建模(OrchRM),一种无需人工标注即可评估编排质量的自监督框架。OrchRM利用多智能体执行过程中的中间产物来构建Bradley-Terry奖励模型训练的胜负对。与现有的依赖昂贵子智能体展开的MAS测试时扩展和编排器训练框架不同,OrchRM直接在编排层面操作,实现了高效且高性能的奖励引导编排器训练和MAS测试时扩展。OrchRM在token使用上提高了高达10倍的训练效率,同时将MAS测试时扩展的准确率提升了高达8%。这些增益在多个领域(包括数学推理、基于网络的问答和多跳推理)中一致迁移,证明了编排级奖励建模作为鲁棒多智能体编排的可扩展方向。代码将在此https URL提供。

英文摘要

Multi-Agent Systems (MAS) built on Large Language Models (LLMs) require effective orchestration to coordinate specialized agents, yet training such orchestrators is hindered by limited supervision and high computational cost. We propose Orchestration Reward Modeling (OrchRM), a self-supervised framework for evaluating orchestration quality without human annotations. OrchRM leverages intermediate artifacts from multi-agent executions to construct win-lose pairs for Bradley-Terry reward model training. Unlike existing MAS test-time scaling and orchestrator training frameworks that rely on costly sub-agent rollouts, OrchRM operates directly at the orchestration level, enabling efficient and high-performing reward-guided orchestrator training and MAS test-time scaling. OrchRM improves training efficiency by up to 10x in token usage while improving MAS test-time scaling performance by up to 8% in accuracy. These gains consistently transfer across multiple domains, including mathematical reasoning, web-based question answering, and multi-hop reasoning, demonstrating orchestration-level reward modeling as a scalable direction for robust multi-agent orchestration. Code will be available at https://github.com/Wang-ML-Lab/OrchRM.

2606.13604 2026-06-12 cs.AI cs.LG cs.MA 新提交

Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch

基于延迟市场反馈的多智能体强化学习在三方调度中的目标权重自适应

Haochen Wu, Yi Hou, Shiguang Xie

发表机构 * DoorDash

AI总结 提出在DoorDash部署的强化学习系统,利用延迟信号自适应调整调度目标权重,通过离线策略学习在噪声和耦合反馈下优化配送质量与批处理效率的权衡。

Comments Accepted at ICML 2026 Workshop on Reinforcement Learning from World Feedback (RLxF)

详情
AI中文摘要

三方市场中的调度为从世界反馈中进行强化学习提供了自然场景:决策通过延迟的操作结果(如配送速度、骑手利用率和商家拥堵)进行评估。我们介绍了DoorDash部署的一个强化学习系统,该系统利用延迟信号在大规模食品配送市场中自适应调整调度目标权重。该系统并非取代组合分配优化器,而是通过从记录的市场数据中学习的店铺级策略选择一个离散乘数,该乘数改变调度优化器在配送质量与批处理效率之间的权衡。这种接口使得在噪声、延迟和耦合反馈下进行离线策略学习成为可能,同时保留生产可行性约束和操作保障。我们使用集中式离线数据和分散式店铺级执行训练共享价值函数,采用双Q学习目标和保守正则化器以减少分布外价值高估。在生产切换实验中,离线训练的策略增加了批处理并减少了骑手侧时间成本,而不会降低面向客户的配送质量。结果展示了如何利用来自实时经济和物流系统的世界反馈安全地在线调整决策策略。

英文摘要

Dispatch in three-sided marketplaces provides a natural setting for reinforcement learning from world feedback: decisions are evaluated by delayed operational outcomes such as delivery speed, courier utilization, and merchant congestion. We present a deployed reinforcement learning system at DoorDash that adapts dispatch objective weights in a large-scale food-delivery marketplace using delayed signals. Rather than replacing the combinatorial assignment optimizer, a store-level policy learned from logged marketplace data selects a discrete multiplier that shifts the dispatch optimizer's tradeoff between delivery quality and batching efficiency. This interface enables offline policy learning under noisy, delayed, and coupled feedback while preserving production feasibility constraints and operational safeguards. We train a shared value function using centralized offline data and decentralized store-level execution, with Double Q-learning targets and a conservative regularizer to reduce out-of-distribution value overestimation. In a production switchback experiment, the offline-trained policy increases batching and reduces courier-side time costs without degrading customer-facing delivery quality. Results illustrate how world feedback from a live economic and logistics system can be used to safely adapt decision policies online.

2606.12474 2026-06-12 cs.MA cs.AI cs.CR 交叉投稿

SAIGuard: Communication-State Simulation for Proactive Defense of LLM Multi-Agent Systems

SAIGuard: 面向LLM多智能体系统主动防御的通信状态模拟

Ruxue Shi, Yili Wang, Mengnan Du, Qinggang Zhang, Rui Miao, Yixin Liu, Xin Wang

AI总结 提出SAIGuard主动防御框架,通过通信状态模拟检测并净化风险消息,降低攻击成功率并保持系统效用。

详情
AI中文摘要

基于LLM的多智能体系统(MAS)通过智能体间协作解决复杂任务,但其通信驱动的特性也使安全风险能够在智能体间传播并引发系统级故障。现有的MAS防御主要遵循执行后的反应式范式,通过检测和隔离有害智能体,但这可能导致不可逆的损害并降低协作效用。为解决此问题,我们提出一种面向MAS安全的主动防御框架,即模拟感知拦截守卫(SAIGuard)。SAIGuard在MAS交互图上执行通信状态模拟,估计传入消息对局部智能体状态和全局MAS状态的影响,并通过与良性通信模式的重建偏差检测风险消息。SAIGuard不隔离智能体,而是在可疑消息传播到系统之前对其进行净化或重新生成。跨多种拓扑和攻击场景的实验表明,SAIGuard在保持MAS效用的同时降低了攻击成功率,优于反应式防御。

英文摘要

LLM-based multi-agent systems (MAS) solve complex tasks through inter-agent collaboration, but their communication-driven nature also allows security risks to spread across agents and trigger system-wide failures. Existing MAS defenses mainly follow a reactive paradigm after execution by detecting and isolating harmful agents, which may cause irreversible damage and degrade collaborative utility. To address this, we propose a proactive defense framework for MAS security, namely a Simulation-aware Interception Guard (SAIGuard). SAIGuard performs communication-state simulation over the MAS interaction graph, estimates the impact of incoming messages on local agent states and the global MAS state, and detects risky messages via reconstruction deviations from benign communication patterns. Instead of isolating agents, SAIGuard sanitizes or regenerates suspicious messages before it propagation into system. Experiments across diverse topologies and attack scenarios show that SAIGuard reduces attack success rates while maintaining MAS utility, outperforming reactive defenses.

2606.12835 2026-06-12 cs.MA cs.AI cs.CY cs.NI 交叉投稿

The Internet of Agentic AI: Communication, Coordination, and Collective Intelligence at Scale

智能体互联网:大规模通信、协调与集体智能

Quanyan Zhu

AI总结 本文提出智能体互联网(IoAI)愿景,构建异构智能体在云、边缘、设备等环境中发现、协商、通信与协作的开放生态系统,并探讨其架构、机制及关键研究挑战。

详情
AI中文摘要

自主AI智能体的快速涌现正在将人工智能从孤立的模型推理转变为分布式推理、通信和行动系统。本文发展了智能体互联网(IoAI)的愿景:一个开放生态系统,其中异构智能体能够跨云、边缘、设备、组织及信息物理环境相互发现、协商职责、交换上下文、调用工具并执行工作流。我们综合了单智能体AI、多智能体系统、分布式计算、通信网络、博弈论和安全工程的基础,以刻画可扩展智能体生态系统所需的架构和机制。本文考察了智能体部署模型、工作流生命周期、通信协议、互操作层、资源管理挑战和信任架构,并提供了自适应制造和分布式作战协调的案例研究。由此产生的框架突出了可控涌现、语义互操作、安全身份、激励兼容协调、资源感知编排以及大规模自主智能体网络治理等核心研究挑战。

英文摘要

The rapid emergence of autonomous AI agents is transforming artificial intelligence from isolated model inference into distributed systems of reasoning, communication, and action. This paper develops the vision of the Internet of Agentic AI (IoAI): an open ecosystem in which heterogeneous agents discover one another, negotiate responsibilities, exchange context, invoke tools, and execute workflows across cloud, edge, device, organizational, and cyber-physical environments. We synthesize foundations from single-agent agentic AI, multi-agent systems, distributed computing, communication networks, game theory, and security engineering to characterize the architectures and mechanisms required for scalable agent ecosystems. The paper examines agent deployment models, workflow lifecycles, communication protocols, interoperability layers, resource-management challenges, and trust architectures, with case studies in adaptive manufacturing and distributed operational coordination. The resulting framework highlights the central research challenges of controlled emergence, semantic interoperability, secure identity, incentive-compatible coordination, resource-aware orchestration, and governance for large-scale networks of autonomous agents.

2603.21563 2026-06-12 cs.AI 版本更新

Counterfactual Credit Policy Optimization for Multi-Agent Collaboration

多智能体协作的反事实信用策略优化

Zhongyi Li, Wan Tian, Jinju Chen, Huiming Zhang, Yang Liu, Yikun Ban, Fuzhen Zhuang

发表机构 * Beihang University(北航) Peking University(北京大学) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 针对多智能体大语言模型协作中信用分配难题,提出CCPO框架,通过反事实信用估计和验证器锚定的自评估两种分配器,将团队奖励转化为个体学习信号,提升数学推理任务表现。

详情
AI中文摘要

协作式多智能体大语言模型可以通过分解角色来解决复杂的推理任务,但此类系统的强化学习受到信用分配的限制:共享的终端奖励模糊了个体贡献,并可能鼓励搭便车行为。我们引入了协作信用策略优化(CCPO),这是一个与优化器无关的信用分配层,将团队层面的结果转化为智能体特定的学习信号。CCPO提供了两种互补的分配器。反事实信用通过比较实际团队结果与移除该智能体的反事实结果来估计智能体的边际贡献。验证器锚定的LLM自我评估是一种探索性分配器,它使用受限的自我评估和同伴评估来重新分配信用,同时保持外部验证器结果的主导地位。由此产生的角色特定奖励可以被GRPO风格的更新或其他策略梯度优化器(如GSPO和REINFORCE++)使用。我们在顺序的思考-求解设置中实例化CCPO,并在数学推理基准上评估它。结果表明,显式的信用分配通常能改善双智能体推理,尤其是在MATH500和几个分布外设置中,而增益因模型和数据集而异。

英文摘要

Collaborative multi-agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles, but reinforcement learning for such systems is limited by credit assignment: shared terminal rewards obscure individual contributions and can encourage free-riding. We introduce two optimizer-agnostic credit assignment methods for converting joint outcomes into agent-specific learning signals. Counterfactual Credit for Policy Optimization (CCPO) estimates an agent's marginal contribution by comparing the realized joint outcome with a counterfactual outcome where that agent is removed. Self-Evaluated Credit for Policy Optimization (SEPO) uses constrained self- and peer-evaluations as a verifier-anchored credit signal while keeping the external task outcome dominant. Both operate at the reward-construction layer rather than as policy optimizers, producing role-specific rewards or advantages for GRPO, GSPO, or REINFORCE++. We instantiate these credit signals in a sequential Think--Solve setting and evaluate them on mathematical reasoning benchmarks. Results show that explicit credit assignment often improves dual-agent reasoning, especially on MATH500 and several out-of-distribution settings, while gains vary across models and datasets. Our code is available at: https://github.com/bhai114/ccpo.

2605.02249 2026-06-12 cs.AI 版本更新

A Study of Belief Revision Postulates in Multi-Agent Systems (Extended Version)

多智能体系统中信念修正公设的研究(扩展版)

Michael Thielscher, Tran Cao Son

AI总结 研究认知规划中的信念修正问题,将经典AGM信念修正公设推广到多智能体环境,提出广义全交多智能体信念修正算子,并讨论迭代修正公设的推广及事件模型修正算子。

详情
AI中文摘要

我们研究了认知规划中的信念修正问题,即在一个多智能体系统中,当某个智能体获得关于某个状态属性的信念后,所有智能体的信念将如何变化。基于通过单一多智能体Kripke模型表示智能体信念的标准认知规划表示,我们将经典的AGM信念修正公设推广到多智能体环境,旨在为计算作为行动结果的所有智能体信念的动态认知推理框架提供形式化评估。作为满足所有广义AGM公设的简单算子示例,我们提出了广义全交多智能体信念修正。此外,我们定义了迭代修正的标准公设的推广,提出了一个更复杂的基于事件模型的修正算子,并讨论了在Kripke模型上定义能够满足所有迭代多智能体信念修正的广义公设的认知算子时可能存在的问题。

英文摘要

We investigate the belief revision problem in epistemic planning, i.e., what will be the beliefs of all agents in a multi-agent system after an agent gains the belief in some state property. Based on the standard representation in epistemic planning of agents' beliefs via a single multi-agent Kripke model, we generalize the classical AGM belief revision postulates to the multi-agent setting, with the aim to provide a formal framework for evaluating dynamic epistemic reasoning frameworks in which the beliefs of all agents as the result of actions are computed. As an example of a simple operator that satisfies all of the generalized AGM postulates, we present generalized full-meet multi-agent belief revision. We moreover define a generalization of the standard postulates for iterated revision, present a more sophisticated, event model based revision operator, and discuss the potential issues in defining an epistemic operator on Kripke models that can satisfy all of the generalized postulates for iterated multi-agent belief revision.

2412.08610 2026-06-12 cs.GT cs.AI cs.CY 版本更新

Competition and Diversity in Generative AI

生成式人工智能中的竞争与多样性

Manish Raghavan

发表机构 * MIT Sloan School of Management & Department of Electrical Engineering and Computer Science(麻省理工学院斯隆管理学院及电气工程与计算机科学系)

AI总结 通过博弈论模型和Scattergories游戏实验,研究竞争如何促使生成式AI模型多样化,缓解同质化,并提升社会福利。

详情
AI中文摘要

最近的实验和现实证据表明,使用生成式人工智能会降低所产生内容的多样性。使用相同或相似的AI模型似乎会导致更同质化的行为。我们的工作从观察到存在一股相反方向的推动力开始:竞争。当生产者相互竞争(例如,争夺客户或注意力)时,他们被激励去创造新颖或独特的内容。我们探讨了竞争对内容多样性和整体社会福利的影响。通过一个正式的博弈论模型,我们表明竞争市场会选择多样化的AI模型,从而缓解单一文化。我们进一步表明,一个在孤立环境中表现良好(即根据基准)的生成式AI模型可能在竞争市场中无法提供价值。我们的结果强调了在生成式AI模型输出分布的广度上评估它们的重要性,特别是当它们将被部署在竞争环境中时。我们通过使用语言模型玩Scattergories(一个奖励正确且独特答案的文字游戏)来实证验证我们的结果。总体而言,我们的结果表明,由生成式AI导致的同质化不太可能在竞争市场中持续存在,相反,下游市场的竞争可能会推动AI模型开发的多样化。

英文摘要

Recent evidence, both in the lab and in the wild, suggests that the use of generative artificial intelligence reduces the diversity of content produced. The use of the same or similar AI models appears to lead to more homogeneous behavior. Our work begins with the observation that there is a force pushing in the opposite direction: competition. When producers compete with one another (e.g., for customers or attention), they are incentivized to create novel or unique content. We explore the impact competition has on both content diversity and overall social welfare. Through a formal game-theoretic model, we show that competitive markets select for diverse AI models, mitigating monoculture. We further show that a generative AI model that performs well in isolation (i.e., according to a benchmark) may fail to provide value in a competitive market. Our results highlight the importance of evaluating generative AI models across the breadth of their output distributions, particularly when they will be deployed in competitive environments. We validate our results empirically by using language models to play Scattergories, a word game in which players are rewarded for answers that are both correct and unique. Overall, our results suggest that homogenization due to generative AI is unlikely to persist in competitive markets, and instead, competition in downstream markets may drive diversification in AI model development.

4. 搜索、优化与约束求解 2 篇

2606.13407 2026-06-12 cs.AI 新提交

Optimizing Appliance Scheduling for Solar Energy Management Using Metaheuristic Algorithms

使用元启发式算法优化太阳能管理的电器调度

Hiba Ahmed, Alexander E. I. Brownlee, Jason Adair, Simon T. Powers

发表机构 * Computing Science and Mathematics, University of Stirling(斯特灵大学计算科学与数学学院)

AI总结 提出基于迭代局部搜索和模拟退火的元启发式方法,优化电器启动时间以最大化太阳能利用,并处理多天任务溢出问题。

Comments 9 pages; full results and methodology for poster paper accepted to GECCO 2026

详情
AI中文摘要

可再生能源对于满足未来能源需求至关重要;然而,仅在白天发生的太阳能发电通常与家庭消费模式不一致。诸如炊具、洗衣机和烘干机等电器通常根据用户偏好的时间表运行,而不是根据太阳能可用性,这形成了一个调度优化问题。目标是确定最佳电器启动时间,以最大化可再生能源利用,同时最小化用户不便并遵守系统约束。本文提出了一种使用迭代局部搜索(ILS)和模拟退火(SA)的元启发式方法,以优化电器启动时间,同时考虑电器运行持续时间、功耗、逆变器限制、电池荷电状态约束和太阳能发电预测。与大多数现有工作不同,调度扩展到单日之外,以容纳前几天的未完成任务(溢出),确保操作连续性并支持跨多天的顺序操作。实验结果表明,顺序多日调度框架在独家太阳能发电下有效管理系统约束,同时确保用户便利。这些发现也为未来关于不同规模设备投资、投资回报和用户满意度之间的多目标权衡研究提供了机会。

英文摘要

Renewable energy is essential for meeting future energy demands; however, solar energy generation, which occurs only during daylight hours often does not align with household consumption patterns. Appliances such as cookers, washing machines, and dryers are typically operated according to user preferred schedules rather than solar energy availability, creating a scheduling optimization problem. The objective is to determine optimal appliance start times to maximize renewable energy utilization while minimizing user inconvenience and adhering to system constraints. This paper presents a metaheuristic approach using Iterated Local Search (ILS) and Simulated Annealing (SA) to optimize appliance start times, while considering appliance operating durations, power consumption, inverter limit, battery state of charge constraints, and solar generation forecasts. Unlike most existing work, the scheduling is extended beyond a single day to accommodate unfinished tasks from previous days (spillover), ensuring operational continuity and enabling sequential operation across multiple days. Experimental results show that the sequential multi-day scheduling framework effectively manages system constraints while ensuring user convenience under exclusive solar generation. These findings also open opportunities for future research on multi-objective trade-offs between investment in equipment of various sizes, return on that investment, and user satisfaction.

2606.12667 2026-06-12 cs.NI cs.AI cs.SY eess.SY 交叉投稿

Free-Placement Optimization of Ground Station Locations for Low-Earth Orbit Satellites

低地球轨道卫星地面站位置的自由布局优化

Grace Ra Kim, Duncan Eddy, Vedant Srinivas, Mykel J. Kochenderfer

AI总结 提出SCORE方法,通过两阶段自由布局优化地面站位置,相比差分进化算法减少5倍函数评估次数并提升13%下行吞吐量,相比固定站点方法提升15%总下行量。

Comments 34 pages, 13 figures, 11 tables, Journal of Aerospace Information Systems (JAIS)

详情
AI中文摘要

快速扩展的低地球轨道卫星星座对地面网络的需求日益增加,推动了更高效地面站网络设计的发展。当前方法从预定义位置选择站点,将优化限制在现有基础设施内,从而约束了性能。相比之下,自由布局优化在地球连续空间域上运行,拓宽了搜索空间,允许更高吞吐量的配置,但代价是可能需要部署新的基础设施。在这项工作中,我们引入了SCORE(通过细化与评估的顺序循环优化),一种用于地面站设计的两阶段自由布局方法。SCORE结合了顺序坐标选择与循环细化,以应对全局优化器面临的高维度、非凸性和局部最小值挑战。我们使用Kongsberg卫星服务公司和世界电信协会的位置,将SCORE与差分进化(DE)等一次性方法以及整数规划方法进行了基准测试。在两个商业地球观测星座(Capella Space和ICEYE)和一个合成Walker-Star星座上的测试表明,与DE相比,SCORE收敛所需的函数评估次数最多减少5倍,同时下行吞吐量提升高达13%。与固定站点方法相比,无约束SCORE实现了高达15%的总下行量提升,为灵活布局建立了强大的经验性能基准;受基础设施约束的SCORE在将布局限制在现有光纤和电力基础设施附近的同时,保留了超过92%的增益。我们还探讨了扩建现有站点与部署新站点之间的权衡,为运营星座的未来地面网络设计提供参考。

英文摘要

Rapidly expanding low Earth orbit satellite constellations are placing increasing demands on terrestrial ground networks, motivating the development of more efficient ground station network designs. Current approaches select sites from predefined locations, limiting optimization to existing infrastructure and constraining performance. In contrast, free-placement optimization operates over a continuous spatial domain on Earth, broadening the search space and allowing higher-throughput configurations at the cost of potentially requiring new infrastructure deployment. In this work, we introduce SCORE (Sequential Cyclic Optimization via Refinement & Evaluation), a two-stage free-placement method for ground station design. SCORE combines sequential coordinate selection with cyclic refinement to manage high-dimensionality, non-convexity, and local minima that challenge global optimizers. We benchmark SCORE against one-shot methods such as differential evolution (DE) and integer programming approaches using locations from Kongsberg Satellite Services and the World Teleport Association. Tests across two commercial Earth observation constellations (Capella Space and ICEYE) and one synthetic Walker-Star constellation show that SCORE requires up to 5x fewer function evaluations to converge relative to DE while improving downlink throughput by up to 13%. Compared to fixed-site methods, unconstrained SCORE achieves up to 15% greater total downlink, establishing a strong empirical performance benchmark for flexible placement; infrastructure-constrained SCORE retains over 92% of this gain while restricting placement to within proximity of existing fiber and power infrastructure. We also explore trade-offs between expanding existing stations and deploying new sites, informing future ground network design for operational constellations.

5. 机器学习与表示学习 43 篇

2606.12594 2026-06-12 cs.AI 新提交

Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation

Pythagoras-Prover: 通过增强型Lean形式化推进高效形式化证明

Joshua Ong Jun Leang, Zheng Zhao, Mihaela Cătălina Stoian, Qiyuan Xu, Haonan Li, Wenda Li, Shay B. Cohen, Eleonora Giunchiglia

发表机构 * Imperial College London(伦敦帝国学院) University of Edinburgh(爱丁堡大学) Nanyang Technological University(南洋理工大学) MBZUAI(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出Pythagoras-Prover系列,包括自回归和扩散模型,通过课程SFT、动态过滤和增强型Lean形式化(ALF)扩展验证数据,在MiniF2F-Test上以更少参数超越DeepSeek-Prover-V2。

Comments Pythagoras-Prover: Technical Report

详情
AI中文摘要

现代Lean定理证明器只有在大量训练和推理计算下才能取得强性能,部分原因是由于稀缺的验证证明数据和形式化证明搜索的长推理轨迹,使得监督微调(SFT)和采样成本高昂。我们介绍了Pythagoras-Prover,一个计算高效的开源Lean定理证明器系列,专为实际计算预算而构建。该系列涵盖两种生成范式:4B和32B参数的自回归模型,以及首个概念验证的基于扩散的证明器(4B),它在推理时迭代地精炼Lean证明。为了提高训练效率,我们构建了一个Lean验证的语料库,按易、中、难问题分层,用于课程SFT,使模型逐步从较短、较简单的证明过渡到较长、较难的证明。在SFT期间,动态证明推理过滤方案保留了信息丰富的证明轨迹,同时将每个实例保持在8k令牌的上下文预算内。我们还引入了增强型Lean形式化(ALF),它将稀缺的验证语料库扩展为形式化语句的变体,通过自蒸馏填充以提供额外训练信号,而无需正式验证每个变异实例。通过扰动已知问题同时保留其形式化特征,ALF减少了对任何语句表面形式的依赖。实验上,Pythagoras-Prover-4B在MiniF2F-Test上的pass@32(86.1% vs 82.4%)超过了DeepSeek-Prover-V2-671B,参数数量约为其1/167,而Pythagoras-Prover-32B在MiniF2F-Test上以93.0%的成绩创下了开源最先进水平,并在672个PutnamBench问题中解决了93个。我们发布了MiniF2F-ALF,一个经ALF变异的对污染敏感的基准,每个评估模型在该基准上的准确率均下降;在此基准上,我们的32B模型仍然最强,而4B模型匹配了先前最先进的Goedel-Prover-V2-32B。

英文摘要

Modern Lean theorem provers achieve strong performance only with substantial training and inference compute, driven in part by scarce verified proof data and the long reasoning traces of formal proof search, making both supervised fine-tuning (SFT) and sampling expensive. We introduce Pythagoras-Prover, a compute-efficient open-source family of Lean theorem provers built for practical compute budgets. The family spans two generation paradigms: autoregressive models at 4B and 32B parameters, and a first proof-of-concept diffusion-based prover (4B) that iteratively refines Lean proofs at inference time. For training efficiency, we build a Lean-verified corpus stratified into easy, medium, and hard problems for curriculum SFT, so models acquire proof skills progressively from shorter, simpler proofs to longer, harder ones. During SFT, a dynamic proof-reasoning filtering scheme preserves informative proof traces while keeping each instance within an 8k-token context budget. We also introduce Augmented Lean Formalisation (ALF), which expands scarce verified corpora into variants of formal statements, populated via self-distillation for extra training signal without formally verifying every mutated instance. By perturbing known problems while preserving their formal character, ALF reduces reliance on any statement's surface form. Empirically, Pythagoras-Prover-4B surpasses DeepSeek-Prover-V2-671B at pass@32 on MiniF2F-Test (86.1% vs 82.4%) with ~167x fewer parameters, while Pythagoras-Prover-32B sets the open-source state of the art at 93.0% on MiniF2F-Test and solves 93 of 672 PutnamBench problems. We release MiniF2F-ALF, an ALF-mutated contamination-sensitive benchmark on which every evaluated model loses accuracy; here our 32B remains strongest and our 4B matches the prior state of the art, Goedel-Prover-V2-32B.

2606.12883 2026-06-12 cs.AI 新提交

The Hidden Power of Scaling Factor in LoRA Optimization

缩放因子在LoRA优化中的隐藏力量

Zicheng Zhang, Haoran Li, Jiaxing Wang, Guoqiang Gong, Anqi Li, Yudong Hu, Ting Xiong, Yurong Gao, Junxing Hu, Zhida Jiang, Yifeng Zhang, Pengzhang Liu, Qixia Jiang

发表机构 * School of Mathematical Sciences, UCAS(中国科学院大学数学科学学院) School of Mathematical Sciences, NKU(南开大学数学科学学院) School of Advanced Interdisciplinary Sciences, UCAS(中国科学院大学前沿交叉科学学院)

AI总结 本文揭示LoRA中缩放因子α与学习率功能不同,α主导优化效果,通过信号-漂移框架发现α能放大任务信号而不增加漂移比,并提出LoRA-α框架以简化超参数搜索并提升性能。

详情
AI中文摘要

在低秩适应(LoRA)中,缩放因子α通常被视为学习率的简单补充,但其在优化中的作用仍未被充分理解。本文揭示缩放因子α和学习率功能不同,α成为有效优化的主导驱动因素,带来无法通过单独缩放学习率复现的收益。通过大量实证分析和理论信号-漂移框架的协同作用,我们发现了关于LoRA缩放机制的三点发现:首先,LoRA的频谱抑制平滑了优化景观,使得标准超参数过于保守,造成优化差距。其次,当利用这种平滑性加速收敛时,α通过放大任务信号而不增加漂移比,优于学习率。第三,最优缩放因子与秩呈次线性关系,由平方根定律很好地刻画,且系数出乎意料地大,揭示了现有秩相关启发式方法的缩放不足。基于这些见解,我们提出LoRA-α,一个极简框架,将α恢复到其原则性状态,使LoRA与标准小学习率兼容。跨多种任务的广泛评估表明,LoRA-α在简化超参数搜索的同时持续提升性能,释放了LoRA的学习潜力。

英文摘要

In Low-Rank Adaptation (LoRA), the scaling factor $α$ is often treated as a mere complement to the learning rate, yet its role in optimization remains poorly understood. In this paper, we reveal that the scaling factor $α$ and the learning rate function differently, with $α$ emerging as the dominant driver of effective optimization, delivering gains that cannot be replicated by learning rate scaling alone. Through the synergy of extensive empirical analysis and a theoretical Signal-Drift framework, we uncover three findings into LoRA's scaling mechanism: First, LoRA's spectral suppression smooths the optimization landscape, rendering standard hyperparameters overly conservative and creating an optimization gap. Second, when leveraging this smoothness to accelerate convergence, $α$ outperforms the learning rate by amplifying the task signal without increasing the drift ratio. Third, the optimal scaling factor follows a sublinear relationship with the rank, well characterized by a square-root law with an unexpectedly large coefficient, revealing the insufficient scaling of existing rank-tied heuristics. Based on these insights, we propose LoRA-$α$, a minimalist framework that restores $α$ to its principled regime, making LoRA compatible with standard small learning rates. Extensive evaluations across diverse tasks demonstrate that LoRA-$α$ consistently improves performance while streamlining hyperparameter search, unleashing the learning potential of LoRA.

2606.12935 2026-06-12 cs.AI 新提交

MARS: Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling

MARS: 用于并行LLM测试时扩展的边际对抗风险控制停止策略

Wenbo Chen, Puheng Li, Mengyang Liu, Weijie Su, Tianpei Xie

发表机构 * Amazon(亚马逊) Stanford University(斯坦福大学) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出MARS停止规则,通过监测中间检查点的聚合投票并利用对抗性边界估计未来投票变化,在保证准确率的同时节省25-47%的自一致性token。

详情
AI中文摘要

并行测试时扩展采样多个推理轨迹并对答案进行多数投票,提高了LLM的准确性,但需要轨迹运行至完成,导致大量计算开销。我们观察到,在中间检查点探测部分轨迹可以在不中断生成的情况下提取当前答案,揭示出不断演变的聚合投票。基于这一观察,我们引入了MARS,一种边际对抗性停止规则,它估计哪些活跃轨迹可能改变其答案,并在未来投票移动的保守边界下,一旦领先者保持安全就停止。该规则分离了两种不确定性来源。它学习轨迹级别的切换概率,这些概率决定了当前边际有多少可能被保留,同时通过从预热轨迹中校准的对抗性边界处理切换轨迹落在哪里的更难问题。在真实切换概率下,MARS以高概率保证提前停止的答案与完整预算投票一致。在实践中,一个五特征逻辑模型紧密匹配了神谕切换行为。在三个推理模型和三个竞赛数学基准上,MARS节省了25-47%的自一致性token,并在DeepConf Online(一个已经过滤和截断弱轨迹的强置信加权基线)之上额外节省14-29%,同时匹配相应完整预算基线的准确率。

英文摘要

Parallel test-time scaling samples many reasoning traces and majority-votes their answers, improving LLM accuracy but requiring traces to run to completion, incurring substantial computational overhead. We observe that probing partial traces at intermediate checkpoints can extract current answers without disrupting generation, revealing an evolving aggregate vote. Based on this observation, we introduce MARS, a margin-adversarial stopping rule that estimates which active traces are likely to change their answers and stops once the leader remains safe under a conservative bound on future vote movement. The rule separates two sources of uncertainty. It learns the trace-level switch probabilities that determine how much of the current margin is likely to be retained, while handling the harder question of where switching traces land through an adversarial bound calibrated from warmup traces. With true switch probabilities, MARS guarantees with high probability that the early-stopped answer matches the full-budget vote. In practice, a five-feature logistic model closely matches oracle switching behavior. Across three reasoning models and three competition-math benchmarks, MARS saves 25-47% of self-consistency tokens and 14-29% on top of DeepConf Online, a strong confidence-weighted baseline that already filters and truncates weak traces, while matching the accuracy of the corresponding full-budget baselines.

2606.13016 2026-06-12 cs.AI 新提交

Otters++: A Time-to-first-spike Based Energy Efficient Optical Spiking Transformer

Otters++: 一种基于首次脉冲时间的高能效光学脉冲Transformer

Zhanglu Yan, Jiayi Mao, Kaiwen Tang, Fanfan Li, Gang Pan, Tao Luo, Bowen Zhu, Qianhui Liu, Weng-Fai Wong

发表机构 * National University of Singapore(新加坡国立大学) Westlake University(西湖大学) Shandong University(山东大学) Zhejiang University(浙江大学) Agency for Science, Research and Technology(新加坡科技研究局)

AI总结 提出Otters++,利用光电器件自然信号衰减实现TTFS计算,通过层等效与混合训练方法,在GLUE上达到84.17%平均分且能耗更低。

详情
AI中文摘要

脉冲神经网络(SNN)有望实现高能效推理,而首次脉冲时间(TTFS)编码尤其吸引人,因为每个神经元最多发放一次脉冲。然而,在实践中,这一优势往往因计算时间衰减项并将其与突触权重相乘的成本而减弱。我们通过将物理硬件“缺陷”——光电器件中的自然信号衰减——转化为TTFS的主要计算来解决这一问题,命名为Otters++。具体来说,我们利用定制In$_2$O$_3$光电突触的实测衰减直接实现TTFS时间项,从而消除了显式数字衰减计算的需求。为了将该思想扩展到Transformer模型,我们建立了Otters++与量化神经网络(QNN)之间的逐层功能等价性,并开发了一种混合训练方法,在前向传播中使用忠实于器件的SNN计算,在后向传播中通过等效QNN路径使用QNN直通梯度,并结合模型蒸馏。这避免了对离散首次脉冲事件的微分,并减少了直接TTFS-SNN训练中的过度稀疏问题。我们进一步通过采样运行间变化使训练感知实测器件噪声,并通过考虑器件共享和多跳通信来细化系统级能耗模型。在GLUE数据集上,Otters++将平均得分提高到84.17%,同时相比先前的脉冲Transformer基线保持明显的能耗优势。这些结果表明,基于物理的TTFS计算在实际硬件效应下可以高效、可训练且鲁棒。

英文摘要

Spiking neural networks (SNNs) are promising for energy-efficient inference, and time-to-first-spike (TTFS) coding is especially attractive because each neuron fires at most once. In practice, however, this benefit is often reduced by the cost of computing a temporal decay term and multiplying it by the synaptic weight. We address this issue by turning a physical hardware "bug," the natural signal decay in optoelectronic devices, into the main computation of TTFS, named Otters++. Specifically, we use the measured decay of a custom In$_2$O$_3$ optoelectronic synapse to directly realize the TTFS temporal term, removing the need for explicit digital decay computation. To scale this idea to Transformer models, we establish a layer-wise functional equivalence between the Otters++ and a quantized neural network (QNN), and develop a hybrid training method that uses device-faithful SNN computation in the forward pass and QNN straight-through gradients through the equivalent QNN path in the backward pass, together with model distillation. This avoids differentiation through discrete first-spike events and reduces the over-sparsity problem in direct TTFS-SNN training. We further make training aware of measured device noise by sampling run-to-run variation, and refine the system-level energy model by accounting for device sharing and multi-hop communication. On GLUE dataset, Otters++ improves the average score to 84.17\% while maintaining a clear energy advantage over prior spiking Transformer baselines. These results show that physically grounded TTFS computing can be efficient, trainable, and robust under realistic hardware effects.

2606.12479 2026-06-12 cs.LG cs.AI 交叉投稿

ReCal: Reward Calibration for RL-based LLM Routing

ReCal: 基于强化学习的LLM路由的奖励校准

Qihang Yu, Hanwen Tong, Zhengqi Zhang, Bo Zheng, Feng Wei, Shengyu Zhang, Zemin Liu, Fei Wu

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出ReCal框架,通过分层奖励分解和分布感知优化校准奖励信号,解决多目标冲突和异质性任务优化偏差,提升LLM路由性能与稳定性。

详情
AI中文摘要

大型语言模型(LLM)路由已成为一种有效范式,通过动态模型和推理策略选择来利用多个LLM的互补优势。最近的基于强化学习(RL)的路由方法通过从交互反馈中优化路由策略,进一步提高了路由质量。然而,在难度不同的异质性任务下,它们仍然难以提供信息丰富且可比较的学习信号。在实践中,多个目标(如正确性、格式行为)被聚合为单个标量奖励,导致模糊的信用分配和冲突的优化信号。此外,奖励信号在不同实例间表现出显著变异性,其中一些实例产生更高或更可变的奖励,引入了偏向于平凡样本而非信息性样本的优化偏差。为了解决这些问题,我们提出了\textbf{ReCal},一个用于基于RL的LLM路由的\textbf{\underline{Re}}ward \textbf{\underline{Cal}}ibration(奖励校准)框架。我们首先引入了一种具有分量式优势估计的分层奖励分解机制。我们进一步提出了一种分布感知的优化策略,通过方差感知重加权和每数据集归一化来校准优化变异性。在七个数据集上的实验表明,ReCal在路由性能和训练稳定性上持续优于基线方法。代码可在该网址获取。

英文摘要

Large language model (LLM) routing has emerged as an effective paradigm for leveraging the complementary strengths of multiple LLMs through dynamic model and reasoning-strategy selection. Recent reinforcement learning (RL)-based routing methods further improve routing quality by optimizing routing policies from interaction feedback. However, they still struggle to provide informative and comparable learning signals under heterogeneous tasks with varying difficulty. In practice, multiple objectives (e.g., correctness, format behavior) are aggregated into a single scalar reward, leading to ambiguous credit assignment and conflicting optimization signals. Moreover, reward signals exhibit significant variability across instances, where some instances produce higher or more variable rewards, introducing optimization bias that favors trivial samples over informative ones. To address these issues, we propose \textbf{ReCal}, a \textbf{\underline{Re}}ward \textbf{\underline{Cal}}ibration framework for RL-based LLM routing. We first introduce a hierarchical reward decomposition mechanism with component-wise advantage estimation. We further propose a distribution-aware optimization strategy that calibrates optimization variability through variance-aware reweighting and per-dataset normalization. Experiments on seven datasets demonstrate that ReCal consistently improves routing performance, and training stability over baselines. Code is available at https://anonymous.4open.science/r/ReCal.

2606.12481 2026-06-12 cs.LG cs.AI 交叉投稿

Representing Time Series as Structured Programs for LLM Reasoning

将时间序列表示为结构化程序以进行LLM推理

Jaeho Kim, Changhun Oh, Seokhyun Lee, Irina Rish, Changhee Lee

发表机构 * Korea University(高丽大学) Mila, University of Montreal(蒙特利尔大学米拉研究所)

AI总结 提出T2SP方法,将时间序列分解为趋势、周期和显著事件并表示为结构化符号程序,使LLM无需微调即可高效推理,在编辑、描述和问答任务上优于原始序列表示。

Comments Preprint

详情
AI中文摘要

大型语言模型(LLM)展示了强大的推理和指令遵循能力,使其成为时间序列分析的潜在强大工具。然而,时间序列超出了其原生文本模态,引发了一个基本问题:应该如何表示时间序列,以便LLM能够有效地推理它们?现有工作通常序列化原始数值序列或在时间序列数据上微调预训练的LLM。这些方法将提取时间结构的负担直接放在LLM上,造成了模态不匹配,常常降低长序列的性能并引入大量计算开销。在这项工作中,我们引入了时间序列到结构化程序表示(T2SP),一种确定性的、无需训练的方法,将时间序列表示为结构化的符号程序。T2SP将时间序列分解为趋势、周期和显著事件,并以与LLM原生训练的文本和代码类模态对齐的程序友好格式表达它们。通过将时间结构提取从模型转移到表示本身,T2SP使现成的LLM能够利用其现有的推理能力进行时间序列理解。我们在三个推理任务上评估T2SP——编辑、描述和问答——与原始字符串表示相比,它持续提高了性能,减少了推理时间,并降低了失败率。我们的结果表明,T2SP提供了时间序列和LLM之间的有效接口。

英文摘要

Large language models (LLMs) have demonstrated strong reasoning and instruction-following capabilities, making them potentially powerful tools for time-series analysis. However, time series lie outside their native textual modality, raising a fundamental question: how should time series be represented so that LLMs can reason about them effectively? Existing work typically serializes raw numerical sequences or fine-tunes pre-trained LLMs on time-series data. These approaches place the burden of extracting temporal structure directly on the LLM, creating a modality mismatch that often degrades performance on long sequences and introduces substantial computational overhead. In this work, we introduce Time-Series-to-Structured-Program representation (T2SP), a deterministic, training-free method that represents a time series as a structured symbolic program. T2SP decomposes time series into trends, periods, and salient events, expressing them in a program-friendly format aligned with the textual and code-like modalities on which LLMs are natively trained. By shifting temporal-structure extraction from the model to the representation itself, T2SP enables off-the-shelf LLMs to leverage their existing reasoning capabilities for time-series understanding. We evaluate T2SP on three reasoning tasks -- editing, captioning, and question answering -- where it consistently improves performance, reduces reasoning time, and lowers failure rates compared with raw-string representations. Our results demonstrate that T2SP provides an effective interface between time series and LLMs.

2606.12505 2026-06-12 cs.LG cs.AI 交叉投稿

Boosting Direct Preference Optimization with Penalization

通过惩罚增强直接偏好优化

Pengwei Sun

发表机构 * Pengwei Sun(Sun Pengwei)

AI总结 提出DPOP,在DPO损失上增加对参考模型贪婪响应的门控惩罚,仅当当前策略对偏好响应概率低于拒绝响应时激活,在AlpacaEval 2.0上显著提升胜率。

Comments Accepted at ICML 2026 Workshop on Decision-Making from Offline Datasets to Online Adaptation: Black-Box Optimization to Reinforcement Learning

详情
AI中文摘要

离线偏好优化已成为从人类反馈中进行强化学习的实用替代方案,但诸如直接偏好优化(DPO)及其变体等成对目标仅使用存储在静态数据集中的选择和拒绝响应。这留下了一个有用的信号未被利用:参考模型本身为同一提示生成的响应。我们提出了带惩罚的直接偏好优化(DPOP),这是DPO的一个简单扩展,它在基础偏好损失上增加了一个对参考贪婪响应的门控惩罚。DPOP仅在当前策略对偏好响应的似然仍低于对拒绝响应的似然时激活此惩罚。在AlpacaEval 2.0上,DPOP在Llama-3-8b-it和Gemma-2-9b-it上均提高了长度控制的胜率,相对于DPO、SimPO和AlphaDPO,在两个模型上分别实现了5.3%和4.4%的相对增益。消融实验进一步表明,在此设置下,SimNPO风格的长度归一化惩罚比NPO和token级非似然惩罚更强。

英文摘要

Offline preference optimization has become a practical substitute for reinforcement learning from human feedback, but pairwise objectives such as Direct Preference Optimization (DPO) and its variants use only the chosen and rejected responses stored in a static dataset. This leaves a useful signal unused: the response that the reference model itself would generate for the same prompt. We propose Direct Preference Optimization with Penalization (DPOP), a simple extension of DPO that augments the base preference loss with a gated penalty on reference-greedy responses. DPOP activates this penalty only when the current policy still assigns a lower likelihood to the preferred response than to the rejected response. On AlpacaEval 2.0, DPOP improves length-controlled win rate over DPO, SimPO, and AlphaDPO on both Llama-3-8b-it and Gemma-2-9b-it, achieving relative gains of 5.3\% and 4.4\% over baselines on the two models, respectively. Ablations further show that a SimNPO-style length-normalized penalty is stronger than NPO and token-level unlikelihood in this setting.

2606.12691 2026-06-12 cs.LG cs.AI cs.SY eess.SY math.OC stat.ML 交叉投稿

Two-Layer Linear Auto-Regressive Models Estimate Latent States

两层线性自回归模型估计潜在状态

Yahya Sattar, Sunmook Choi, Leo Maynard-Zhang, Yassir Jedra, Maryam Fazel, Sarah Dean

AI总结 本文证明两层线性自回归模型通过经验风险最小化训练时,能近似卡尔曼滤波,恢复潜在状态估计,并提供有限样本保证。

Comments ICML 2026

详情
AI中文摘要

自回归模型已成为处理序列数据(从语言到视频)的强大工具。理解这些模型如何以及为何学习潜在表示仍然是一个开放的理论问题。在这项工作中,我们证明,当在部分观测的线性动力系统的数据上通过经验风险最小化训练时,两层线性自回归模型自然学会近似卡尔曼滤波。特别地,我们表明,学习到的隐藏表示与最优(卡尔曼)滤波器产生的状态估计一致,仅相差一个相似变换,尽管模型没有关于底层动力学或状态的显式知识。该结果基于三个主要见解。首先,我们建立卡尔曼滤波器可以被具有有界截断误差的自回归模型很好地近似。其次,我们表明,尽管非凸性,两层优化景观是良性的,即所有驻点要么是严格鞍点,要么是全局最小值。最后,作为我们的主要贡献,我们提供了关于预测误差、参数估计误差和潜在状态恢复的有限样本保证。数值模拟支持理论结果,并表明自回归模型的潜在表示恢复了状态估计。

英文摘要

Auto-regressive models have emerged as powerful tools for sequential data, from language to video. Understanding how and why these models learn latent representations remains an open theoretical question. In this work, we demonstrate that when trained by empirical risk minimization on data from partially observed linear dynamical systems, two-layer linear auto-regressive models naturally learn to approximate Kalman filtering. In particular, we show that the learned hidden representation coincides, up to a similarity transformation, with the state estimates produced by the optimal (Kalman) filter, even though the model has no explicit knowledge of the underlying dynamics or state. The result follows from three main insights. First, we establish that the Kalman filter is well approximated by an auto-regressive model with bounded truncation error. Second, we show that despite non-convexity, the two-layer optimization landscape is benign, i.e., all stationary points are either strict saddles or global minima. Finally, as our main contributions, we provide finite-sample guarantees on prediction error, parameter estimation error, and latent state recovery. Numerical simulations support the theoretical results and demonstrate that the latent representations of auto-regressive models recover state estimates.

2606.12841 2026-06-12 cs.LG cs.AI 交叉投稿

TimeROME-DLM: Temporal Causal Tracing and Low-Rank Inference-Time Knowledge Editing for Masked Diffusion Language Models

TimeROME-DLM:掩码扩散语言模型的时间因果追踪与低秩推理时知识编辑

Zhengtao Yao, Liuyang Song, Hongbo Zhang, Chenhao Wei, Haoyan Xu, Guang Yang, Siheng Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学) National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出TimeROME-DLM,首个无需训练和梯度的推理时知识编辑框架,通过时间因果追踪定位关键坐标并应用低秩残差编辑,在保持模型性能的同时高效删除事实。

详情
AI中文摘要

掩码扩散语言模型(MDLM),如LLaDA,现已能与自回归(AR)大语言模型(LLM)竞争,但现有的所有知识编辑和遗忘方法(如ROME、MEMIT等)均针对AR Transformer,要么做出在迭代去噪下失败的假设,要么需要梯度更新,其反向传播激活会消耗数十GB的额外显存,并在标准学习率下导致MDLM崩溃。我们提出TimeROME-DLM,这是首个针对MDLM的无需训练、无需梯度、推理时的知识编辑框架。它结合了两个组件:时间间接效应(TIE)因果追踪协议,用于识别每个事实中在后续去噪步骤中最强驱动对象预测的坐标;以及一个闭式低秩残差编辑记忆,该记忆聚合所有遗忘事实的主语键和目标差值,并在每个扩散前向步骤中对该坐标应用单次岭正则化更新,同时通过稀疏化限制效用溢出。骨干权重保持冻结;仅需在小型验证集上调整三个超参数(alpha、lambda、q)。在TOFU forget01任务上,使用TOFU微调的LLaDA-8B-Base,TimeROME-DLM将遗忘集的对数概率降低了约83 nats。相同的配置可迁移至LLaDA-8B-Instruct、Dream-7B、MMaDA-8B、DiffuLLaMA-7B和LLaDA-MoE-1.4B。在50个顺序插入的事实中,它使保留集的对数概率几乎持平(在效用安全操作点处波动约1 nat),相比最强的收敛训练时基线,实现了四到十四倍的墙钟加速且零额外显存,并亚线性地扩展到400个事实。TimeROME-DLM以极小的计算代价弥合了AR LLM与MDLM之间的定位-编辑差距。

英文摘要

Masked diffusion language models (MDLMs) such as LLaDA now rival autoregressive (AR) LLMs, but every existing knowledge-editing and unlearning method (ROME, MEMIT, etc.) targets AR transformers and either makes assumptions that fail under iterative denoising, or requires gradient updates whose backward-pass activations cost tens of GB of extra VRAM and which collapse MDLMs at standard learning rates. We introduce TimeROME-DLM, the first training-free, gradient-free, inference-time knowledge-editing framework for MDLMs. It couples two components: a Temporal Indirect Effect (TIE) causal-tracing protocol that identifies, for each fact, the coordinate whose intervention most strongly drives the object prediction at later denoising steps; and a closed-form, low-rank residual edit memory that aggregates subject keys and target deltas across all forget facts and applies a single ridge-regularised update at that coordinate at every diffusion forward, with sparsification to limit utility spillover. Backbone weights stay frozen; only three hyperparameters (alpha, lambda, q) are tuned on a small validation split. On TOFU forget01 with TOFU-finetuned LLaDA-8B-Base, TimeROME-DLM cuts forget-set log-probability by roughly 83 nats. The same configuration transfers to LLaDA-8B-Instruct, Dream-7B, MMaDA-8B, DiffuLLaMA-7B, and LLaDA-MoE-1.4B. It keeps retain-set log-probability nearly flat (within ~1 nat at the utility-safe operating point) across 50 sequentially inserted facts, delivers a four- to fourteen-fold wall-clock speedup with zero additional VRAM over the strongest converged training-time baseline, and scales sub-linearly to 400 facts. TimeROME-DLM closes the locate-then-edit gap between AR LLMs and MDLMs at a fraction of the computational cost.

2606.12921 2026-06-12 cs.LG cs.AI 交叉投稿

LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold

LoRA-Muon:低秩流形上的谱最速下降

Franz Louis Cesista, Katherine Crowson, Cédric Simal, Stella Biderman

发表机构 * Ateneo de Manila University(雅典耀马尼拉大学) EleutherAI NaXys, UNamur(纳慕尔大学NaXys研究所)

AI总结 提出LoRA-Muon优化器,将Muon的谱最速下降规则应用于低秩微调,解决LoRA对初始化敏感、最优学习率跨秩迁移差等问题,在TinyShakespeare上以秩32达到比稠密基线更低的验证损失。

Comments 20 pages, 4 figures

详情
AI中文摘要

低秩适应(LoRA)显著降低了微调深度学习模型的计算和内存成本,但通常比稠密训练更难调优:当使用因子级优化器(如AdamW)时,它对初始化选择敏感,其最优学习率在秩之间迁移性差,且常常无法超越稠密基线。我们通过将Muon优化器的谱最速下降规则应用于低秩设置,推导出LoRA-Muon。结合我们的分裂权重衰减规则,我们的主要主张是LoRA-Muon是全秩Muon和Shampoo族优化器的一个良好的低秩代理。其最优学习率在秩、宽度、深度和因子重缩放之间均可迁移。在我们计算匹配的TinyShakespeare研究中,秩2代理恢复了稠密最佳测试学习率,秩32的LoRA-Muon运行在种子平均扫描中达到了比稠密基线更低的平均验证损失。我们进一步表明,Spectron优化器依赖于任意的因子缩放,因此在从严重不平衡的因子开始微调时可能不太适用,并且LoRA-RITE的简化QR坐标核心实现了相同的谱更新。LoRA-Muon无需QR分解即可计算该更新,并避免存储二阶矩,使其更易于加速器使用且内存效率更高。

英文摘要

Low-Rank Adaptation (LoRA) significantly reduces compute and memory costs for finetuning Deep Learning models but is often harder to tune than dense training: when using factor-wise optimizers such as AdamW, it is sensitive to initialization choices, its optimal learning rates transfer poorly across ranks, and it often fails to beat dense baselines. We derive LoRA-Muon by applying the Muon optimizer's spectral steepest-descent rule to the low-rank setting. Along with our split weight-decay rule, our main claim is that LoRA-Muon is a good low-rank proxy for full-rank Muon and Shampoo-family optimizers. Its optimal learning rates transfer across rank, width, depth, and factor-rescaling. In our compute-matched TinyShakespeare study, a rank-$2$ proxy recovers the dense best tested learning rate, and a rank-$32$ LoRA-Muon run attains lower mean validation loss than the dense baseline in the seed-averaged sweep. We further show that the Spectron optimizer depends on arbitrary factor scaling, so it would likely be a poor fit when finetuning starts from badly imbalanced factors, and that LoRA-RITE's simplified QR-coordinate core implements the same spectral update. LoRA-Muon computes that update without QR-decomposition and avoids storing second moments, making it more accelerator-friendly and memory-efficient.

2606.13024 2026-06-12 cs.LG cs.AI 交叉投稿

CausalMoE: A Billion-Scale Multimodal Foundation Model for Granger Causal Discovery with Pattern-Routed Heterogeneous Experts

CausalMoE:基于模式路由异构专家的十亿规模多模态基础模型用于格兰杰因果发现

Bo Liu, Di Dai, Jingwei Liu, Jiarui Jin, Xiaocheng Fang, Guangkun Nie, Hongyan Li, Shenda Hong

发表机构 * State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University(北京大学智能科学与技术学院通用人工智能国家重点实验室) National Institute of Health Data Science, and Institute for Artificial Intelligence, Peking University(北京大学健康医疗大数据国家研究院、人工智能研究院)

AI总结 提出CausalMoE,一种十亿规模多模态格兰杰因果基础模型,通过模式路由混合异构专家解耦动态机制,结合因果自注意力与LLM/VLM先验,实现稀疏因果图恢复,在监督和少样本场景中达到最优。

详情
AI中文摘要

格兰杰因果发现(GCD)是分析复杂系统中时间依赖性的基础。然而,现有的神经GCD方法主要依赖“一刀切”范式,难以捕捉真实世界时间序列中固有的分布偏移和动态机制变化,常导致表示纠缠和虚假因果图。本文提出CausalMoE,一种十亿规模多模态格兰杰因果基础模型,显式建模补丁级异质性。CausalMoE引入模式路由混合异构专家,动态识别潜在时间模式并将补丁路由到专门领域专家,有效解耦机制特定动态与共享动态。为确保可解释的图恢复,我们设计了一种跨变量运行的因果感知自注意力机制,通过近端优化生成稀疏格兰杰因果图。此外,CausalMoE是首个集成LLM和VLM以对齐数值信号与文本和视觉先验的模型,在复杂场景中正则化因果估计。大量实验表明,CausalMoE在全监督基准上达到新最优,同时在传统方法失败的少样本设置中有效泛化。

英文摘要

Granger Causal Discovery (GCD) is fundamental for analyzing temporal dependencies in complex systems. However, existing neural GCD methods predominantly rely on a "one-size-fits-all" paradigm, struggling to capture distribution shifts and dynamic regime changes inherent in real-world time series. This often leads to entangled representations and spurious causal graphs. In this paper, we propose CausalMoE, a billion-scale multimodal Granger causal foundation model that explicitly models patch-level heterogeneity. CausalMoE introduces a Pattern-Routed Mixture of Heterogeneous Experts, which dynamically identifies latent temporal patterns and routes patches to specialized domain experts, effectively decoupling regime-specific mechanisms from shared dynamics. To ensure interpretable graph recovery, we design a Causality-Aware Self-Attention mechanism operating across variables, yielding sparse Granger causal graphs via proximal optimization. Furthermore, CausalMoE is the first to integrate LLMs and VLMs to align numerical signals with textual and visual priors, regularizing causal estimation in complex scenarios. Extensive experiments demonstrate that CausalMoE establishes a new state-of-the-art on fully supervised benchmarks, while effectively generalizing to few-shot settings where traditional methods fail.

2606.13081 2026-06-12 cs.LG cs.AI 交叉投稿

Emotional regulation improves deep learning-based image classification

情绪调节改善基于深度学习的图像分类

Riccardo Emanuele Landi, João M. F. Rodrigues, Marta Chinnici

发表机构 * Mare Group(Mare集团) NOVA LINCS(NOVA LINCS实验室) Institute of Engineering (ISE), University of Algarve(阿尔加维大学工程学院) Department of Energy Technologies and Renewable Sources, ENEA Casaccia Research Center(ENEA卡萨恰研究中心能源技术与可再生能源部)

AI总结 提出情绪调节框架,通过人工主观体验在深度学习中建模情绪,在图像分类任务中预训练ResNet和ViT,在CIFAR-10/100上超越现有方法,成为情绪增强深度学习的新标杆。

详情
AI中文摘要

情绪显著影响认知,能在特定条件下增强记忆和学习。基于这一原理,情绪增强深度学习研究情感状态如何改善神经网络架构和学习范式,实现比非情绪模型更好的泛化。然而,现有方法通常仅依赖客观神经生理因素,忽视了情绪的主观性。为弥补这一差距,本研究引入情绪调节,一种通过人工主观体验在深度学习中建模情绪的新框架。该方法采用基于情感刺激的预训练,在下游任务优化中平衡非情绪和情绪影响响应。在图像分类中进行了广泛实验,在四个情感数据集上预训练ResNet和ViT架构,以CIFAR-10和CIFAR-100作为目标基准。结果显示,相比上述骨干网络有改进,证明情绪调节是通过人工主观体验定义情绪增强深度学习的有前景方法。此外,所提方法超越了基于CIFAR的图像分类相关工作,揭示情绪调节成为大规模视觉数据集上情绪增强深度学习的新标杆。研究还提供了情感状态改善机器学习任务优化的证据,鼓励进一步探索情绪启发架构。

英文摘要

Emotion significantly influences cognition, enhancing memory and learning under certain conditions. Drawing on this principle, emotion-augmented deep learning investigates how affective states can improve neural network architectures and learning paradigms, achieving better generalization than non-emotional models. However, existing methods often rely solely on objective neurophysiological factors, neglecting the role of subjectivity in emotion. To bridge this gap, the present study introduces Emotional Regulation, a novel framework for modeling emotion in deep learning through artificial subjective experience. The method employs pre-training based on affective stimuli, balancing non-emotional and emotionally-influenced responses in downstream task optimization. Extensive experimentation was conducted in image classification, pre-training ResNet and ViT architectures on four emotional datasets, using CIFAR-10 and -100 as target benchmarks. Results reveal improvements over the aforementioned backbones, providing evidence of Emotional Regulation as a promising method for defining emotion-augmented deep learning through artificial subjective experience. Furthermore, the proposed approach overcomes the related work in image classification based on CIFAR, revealing Emotional Regulation as the new state-of-the-art in emotion-augmented deep learning for large-scale vision datasets. The study also enforces evidence of the impact of affective states in improving machine learning tasks' optimization, encouraging further investigation on emotion-inspired architectures.

2606.13125 2026-06-12 cs.LG cs.AI 交叉投稿

Select and Improve: Understanding the Mechanics of Post-Training for Reasoning

选择与改进:理解推理后训练的机制

Akshay Krishnamurthy, Audrey Huang, Nived Rajaraman

发表机构 * Microsoft Research NYC(微软研究院纽约) UIUC(伊利诺伊大学厄巴纳-香槟分校)

AI总结 通过控制实验揭示强化学习后训练通过策略选择和策略改进两种机制提升推理能力,并指出SFT数据和RL数据的不同作用。

详情
AI中文摘要

强化学习已迅速成为推理和编码模型训练的关键组成部分,但从机制角度理解仍不足。我们研究通过强化学习后训练如何以及通过哪些底层过程获取或增强能力。基于Qwen-2.5-1.5B的受控数学推理实验分析揭示了两种核心机制:策略选择和策略改进。我们的结果强调了SFT数据和强化学习数据在激活这些机制中的作用,特别展示了监督模型使用多种推理策略如何实现策略选择,以及增加强化学习数据难度如何实现策略改进。综合来看,我们的结果为RL训练提供了机制性见解,并提出了继续扩展推理能力的实用干预措施。

英文摘要

Reinforcement learning has rapidly emerged as a key component in the training of reasoning and coding models, yet it remains poorly understood from a mechanistic perspective. We study how and through what underlying processes capabilities are acquired or enhanced via reinforcement learning post-training. Our analysis, based on controlled math reasoning experiments with Qwen-2.5-1.5B, reveals two core mechanisms: strategy selection and strategy improvement. Our results highlight the role of SFT data and reinforcement learning data in activating these mechanisms, in particular showing how supervising the model on diverse reasoning strategies can enable strategy selection and how increasing difficulty in reinforcement learning data can enable strategy improvement. Taken together, our results provide mechanistic insight into RL training and suggest practical interventions to continue scaling reasoning capabilities.

2606.13233 2026-06-12 cs.LG cs.AI 交叉投稿

ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

ReSET: 通过步骤感知温度缩放实现精确的延迟关键型NVFP4推理

Sihwa Lee, Janghwan Lee, Donghoon Yoo, Jae Gon Kim, Hanyul Ryu, Soojung Ryu, Jungwook Choi

发表机构 * Hanyang University(汉阳大学) Xenoscube Korean Inc.(Xenoscube韩国公司)

AI总结 针对大型推理模型在NVFP4低精度推理中精度下降和延迟问题,提出基于推理步骤熵的温度缩放方法ReSET,并设计CUDA小M核,在多个基准上提升精度约2点,解码速度提升2倍。

详情
AI中文摘要

大型推理模型(LRMs)通过生成长中间推理轨迹来改进复杂问题求解,但这大幅增加了推理成本。NVFP4推理通过硬件支持的低精度执行提供了一种减少计算和内存成本的有前景方法。然而,直接将NVFP4应用于LRMs引入了两个实际限制:量化下推理精度下降,且现有NVFP4核在小型批处理自回归解码中未完全实现延迟优势。在这项工作中,我们分析了NVFP4量化对推理过程中token级不确定性的影响。我们表明,量化增加了低熵符号token的错误采样,同时导致在高不确定性推理步骤中过度集中于少量token。基于这一观察,我们提出了\textbf{ReSET},一种基于推理步骤熵的温度缩放方法,它在线估计步骤级不确定性,并使用token级和步骤级熵信号自适应调整解码温度。为解决延迟差距,我们进一步设计了一个CUDA核心的小型$M$ NVFP4核,用于延迟关键的自回归解码。在推理基准和模型规模上,ReSET将NVFP4推理精度相比NVFP4基线提升高达$\sim\!$2个点。我们的CUDA核心小型$M$核进一步改善了延迟关键解码,相比NVFP4 vLLM提供高达$2.5\!\times$的核级加速,相比BF16提供约$2\!\times$的端到端解码加速。代码可在该https URL获取。

英文摘要

Large reasoning models (LRMs) improve complex problem-solving by generating long intermediate reasoning traces, but this substantially increases inference costs. NVFP4 inference offers a promising approach to reduce both computational and memory costs through hardware-supported low-precision execution. However, directly applying NVFP4 to LRMs introduces two practical limitations: reasoning accuracy degrades under quantization, and existing NVFP4 kernels do not fully realize latency benefits in small-batch autoregressive decoding. In this work, we analyze the effect of NVFP4 quantization on token-level uncertainty during reasoning. We show that quantization increases incorrect sampling at low-entropy symbolic tokens, while causing over-concentration on a small set of tokens in high-uncertainty reasoning steps. Based on this observation, we propose \textbf{ReSET}, a reasoning-step entropy-based temperature-scaling method that estimates step-level uncertainty online and adapts the decoding temperature using both token-level and step-level entropy signals. To address the latency gap, we further design a CUDA-core small-$M$ NVFP4 kernel for latency-critical autoregressive decoding. Across reasoning benchmarks and model scales, ReSET improves NVFP4 reasoning accuracy by up to $\sim\!$2 points over the NVFP4 baseline. Our CUDA-core small-$M$ kernel further improves latency-critical decoding, delivering up to $2.5\!\times$ kernel-level speedup over NVFP4 vLLM and approximately $2\!\times$ end-to-end decoding speedup over BF16. Code is available at https://github.com/aiha-lab/ReSET.

2606.13240 2026-06-12 cs.LG cs.AI cs.CV stat.ME stat.ML 交叉投稿

Towards More General Control of Diffusion Models Using Jeffrey Guidance

使用 Jeffrey 引导实现扩散模型的更通用控制

Raphaël Razafindralambo, Rémy Sun, Frédéric Precioso, Jes Frellsen, Pierre-Alexandre Mattei

发表机构 * Inria, CNRS, I3S, Maasai Université Côte d’Azur(法国国家信息与自动化研究所、法国国家科学研究中心、信息与系统科学实验室、马赛·蔚蓝海岸大学) Technical University of Denmark(丹麦技术大学) Inria, CNRS, LJAD, Maasai Université Côte d’Azur(法国国家信息与自动化研究所、法国国家科学研究中心、雅克-路易·利翁实验室、马赛·蔚蓝海岸大学)

AI总结 提出 Jeffrey 引导框架,通过 Jeffrey 条件规则更新边缘分布,扩展扩散模型控制到标准引导无法表达的应用,在 CIFAR-10 和 FFHQ 上显著降低 FID,并在 CelebA-HQ 上实现公平性控制。

详情
AI中文摘要

扩散模型的一个关键优势在于其灵活性,因为其输出可以在采样时通过引导进行控制。然而,除了条件采样等简单情况外,目标分布通常隐含地定义,仅通过采样规则或启发式能量函数给出。为了解决这个问题,我们提出了 Jeffrey 引导,这是一个原则性框架,将扩散模型控制扩展到标准引导无法表达的应用。它利用 Jeffrey 条件规则将边际分布更新到指定的目标,保持条件结构并最小化对联合分布的扰动。我们首先通过针对指定的嵌入分布来演示 Jeffrey 引导。以 Inception 嵌入为目标,这导致在 CIFAR-10 和 FFHQ 上 FID 显著降低。我们进一步将 Jeffrey 引导应用于 CelebA-HQ 上的公平性,更新无条件扩散模型以强制属性之间的独立性。

英文摘要

A key strength of diffusion models lies in their flexibility, since their outputs can be controlled at sampling time through guidance. However, beyond simple cases such as conditional sampling, the target distribution is often left implicit, defined only through a sampling rule or a heuristic energy function. To address this, we propose Jeffrey guidance, a principled framework that extends diffusion-model control to applications beyond what standard guidance can express. It leverages Jeffrey's rule of conditioning to update marginal distributions towards a prescribed target, preserving the conditional structure and minimally perturbing the joint distribution. We first demonstrate Jeffrey guidance by targeting a prescribed embedding distribution. With Inception embeddings as the target, this leads to substantial reductions in FID on both CIFAR-10 and FFHQ. We further apply Jeffrey guidance to fairness on CelebA-HQ, updating an unconditional diffusion model to enforce independence between attributes.

2606.13276 2026-06-12 cs.LG cs.AI 交叉投稿

Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization

不同层,不同流形:Transformer优化中的模块级权重空间几何

Kirato Yoshihara

发表机构 * School of Engineering Science, The University of Osaka(大阪大学工程科学学院)

AI总结 研究Transformer不同模块偏好不同流形几何,提出为注意力层和MLP层分别分配Stiefel和DGram约束,在GPT-2预训练中取得最佳性能。

Comments Accepted at WSS @ ICML 2026, code is available at https://github.com/kiratoyoshihara/module-wise-manifold-muon

详情
AI中文摘要

权重空间几何在神经网络优化中扮演核心角色,但流形约束通常统一应用于所有权重矩阵。在这项工作中,我们探究不同Transformer模块是否偏好不同的流形几何。我们研究GPT-2预训练的Manifold Muon,并比较跨注意力块和MLP块的Stiefel和DGram约束的逐层分配。我们的结果显示出明显的不对称性:在测试配置中,将注意力层约束为Stiefel几何,同时将MLP层分配为DGram几何,获得了最佳性能;而反向分配和全DGram配置在共享超参数设置下变得不稳定。我们将这种失败归因于DGram约束的注意力权重中奇异值的增长,这会放大注意力logits并导致softmax饱和。这些发现表明,Transformer的对称感知和几何感知优化应该是模块特定的,而不是统一的。

英文摘要

Weight-space geometry plays a central role in neural network optimization, yet manifold constraints are often applied uniformly across all weight matrices. In this work, we ask whether different transformer modules prefer different manifold geometries. We study Manifold Muon for GPT-2 pretraining and compare layer-wise assignments of Stiefel and DGram constraints across attention and MLP blocks. Our results show a clear asymmetry: constraining attention layers with Stiefel geometry while assigning DGram geometry to MLP layers gives the best performance among the tested configurations, whereas the inverted assignment and all-DGram configuration become unstable under the shared hyperparameter setting. We trace this failure to singular value growth in DGram-constrained attention weights, which can amplify attention logits and induce softmax saturation. These findings suggest that symmetry-aware and geometry-aware optimization for transformers should be module-specific rather than uniform.

2606.13285 2026-06-12 cs.LG cs.AI 交叉投稿

Once-for-All: Scalable Simultaneous Forecasting via Equilibrium State Estimation

Once-for-All: 基于均衡状态估计的可扩展同步预测

Beinan Xu, Andy Song, Jiti Gao, Feng Liu

发表机构 * RMIT University(皇家墨尔本理工大学) Monash University(莫纳什大学) University of Adelaide(阿德莱德大学)

AI总结 提出均衡状态估计(ESE)范式,通过一次前向传播估计多系统均衡状态并基于状态差异生成预测,在保持精度的同时实现10-70倍加速,且具有线性时间复杂度和鲁棒性。

Comments Accepted by ICML 2026

详情
AI中文摘要

我们引入均衡状态估计(ESE),一种用于同步预测的新范式,其中多个相互作用的系统需要独立但协调的预测。这种场景在现实世界中经常出现,例如经济学和医疗建模。与一次预测一个系统的现有方法不同,ESE在一次前向传播中预测所有系统。它首先估计跨系统的均衡状态,然后基于当前状态与估计均衡之间的差异生成整体预测。在合成和真实世界数据集(包括货币汇率和COVID-19传播建模)上的大量实验表明,ESE至少与最先进(SOTA)方法一样准确,同时速度显著更快。此外,ESE与传统预测器无缝集成,结合了它们的准确性和其卓越的效率,实现了10-70倍的加速。凭借线性时间复杂度,随着系统数量的增加,ESE的扩展性远优于SOTA方法。此外,它在各种扰动下仍保持准确,使ESE成为一种快速、可泛化、鲁棒且可扩展的多预测方法。

英文摘要

We introduce Equilibrium State Estimation (ESE), a novel paradigm for simultaneous prediction, where multiple interacting systems require separate yet coordinated forecasts. Such scenarios often arise in real-world settings such as economics and healthcare modeling. Unlike existing approaches that predict one system at a time, ESE forecasts all systems in a single pass. It first estimates the equilibrium state across systems, then generates holistic forecasts based on the difference between the current state and the estimated equilibrium. Extensive experiments on synthetic and real-world datasets, including currency exchange and COVID-19 spread modeling, demonstrate that ESE is at least as accurate as state-of-the-art (SOTA) methods while being significantly faster. In addition, ESE integrates seamlessly with conventional predictors, combining their accuracy with its exceptional efficiency and delivering a 10-70x speedup. With linear-time complexity, ESE scales far better than SOTA methods as the number of systems increases. Moreover, it remains accurate under diverse perturbations, establishing ESE as a fast, generalizable, robust, and scalable multi-prediction method.

2606.13311 2026-06-12 cs.LG cs.AI 交叉投稿

Rarity-Gated Context Conditioning for Offline Imitation Learning-Based Maritime Anomaly Detection

基于离线模仿学习的海事异常检测中的稀有门控上下文调节

Yongmin Kim, ByeongHoon Jeon, Sungil Kim

发表机构 * Department of Industrial Engineering, Ulsan National Institute of Science and Technology (UNIST)(蔚山科学技术院工业工程系)

AI总结 提出RGFiLM模块,通过稀有度门控调节上下文调制强度,解决上下文异常检测中稀有上下文导致的高误报问题,在海事轨迹异常检测中取得最佳F1-FPR权衡。

详情
AI中文摘要

上下文异常检测旨在根据上下文变量识别异常行为,但实际部署常面临高度不平衡的上下文分布,其中稀有情境可能包含关键信息。在这种频率偏差下,上下文条件模型可能在稀有上下文中产生不稳定的决策和过多的误报。我们提出稀有门控特征线性调制(RGFiLM),一种稀有感知调节模块,结合特征调制(即上下文条件化的隐藏特征缩放和平移)与由数据驱动稀有度分数控制的门控。稀有度分数根据上下文变量的经验分布估计,并调节上下文对中间表示的调制强度:在稀有上下文中门控更果断,而在常见上下文中保持保守。我们在使用AIS运动序列和ERA5环境上下文的环境敏感绕行场景中评估RGFiLM在海事轨迹异常检测中的表现。当实例化到顺序异常评分流程中时,RGFiLM在比较的上下文无关和上下文条件方法中实现了最佳的平均F1-假阳性率(FPR)权衡。这些结果表明,显式考虑上下文稀有性是减少上下文敏感异常检测中误报的有效方法。

英文摘要

Contextual anomaly detection aims to identify abnormal behavior conditional on context variables, but practical deployments often face highly imbalanced context distributions where rare regimes can be critical information. Under such frequency bias, context-conditioned models can produce unstable decisions and excessive false alarms in rare contexts. We propose Rarity-Gated Feature-wise Linear Modulation (RGFiLM), a rarity-aware conditioning module that combines feature-wise modulation (i.e., context-conditioned scaling and shifting of hidden features) with a gate controlled by a data-driven rarity score. The rarity score is estimated from the empirical distribution of context variables and regulates how strongly context modulates intermediate representations: the gate becomes more decisive under rare contexts while remaining conservative under frequent contexts. We evaluate RGFiLM on maritime trajectory anomaly detection using AIS motion sequences with ERA5 environmental context in an environment-sensitive detour scenario. When instantiated in a sequential anomaly scoring pipeline, RGFiLM achieves the best mean F1--False Positive Rate (FPR) trade-off among the compared context-agnostic and context-conditioned methods. These results suggest that explicitly accounting for context rarity is an effective approach for reducing false alarms in context-sensitive anomaly detection.

2606.13400 2026-06-12 cs.LG cs.AI cs.RO 交叉投稿

PolyFlow: Safe and Efficient Polytope-Constrained Flow Matching with Constraint Embedding and Projection-free Update

PolyFlow: 安全高效的多面体约束流匹配,具有约束嵌入和无投影更新

Jianming Ma, Qiyue Yang, Yang Zhang, Liyun Yan, Zhanxiang Cao, Yazhou Zhang, Yue Gao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出PolyFlow,一种将约束直接嵌入模型和流动力学的多面体约束流匹配框架,通过离散时间流公式和无投影架构消除离散化误差并严格满足任意多面体约束,在规划与控制任务中实现零约束违反并降低推理延迟。

Comments 30 pages, 12 figures, Accepted to ICML 2026

详情
AI中文摘要

尽管基于流的生成模型在广泛领域展现了强大的性能,但由于严格的约束要求,在安全关键的物理系统中部署它们仍然具有挑战性。现有方法通常通过事后修正来强制执行安全性,这会产生大量的计算开销,并可能扭曲学习到的分布。我们提出了PolyFlow,一种多面体约束流匹配框架,将约束直接嵌入到模型和流动力学中。PolyFlow引入了离散时间流公式和无投影架构,消除了离散化误差,并保证严格满足任意多面体约束,无需昂贵的迭代求解器。实验结果表明,PolyFlow在规划和控制任务中实现了零约束违反,同时保持了较高的分布保真度。与最先进的约束生成基线相比,PolyFlow显著降低了推理延迟,并在安全性、效率和生成质量之间展示了有利的权衡。代码可在该 https URL 获取。

英文摘要

While flow-based generative models have demonstrated strong performance across a wide range of domains, deploying them in safety-critical physical systems remains challenging due to strict constraint requirements. Existing approaches typically enforce safety through post-hoc corrections, which incur substantial computational overhead and may distort the learned distribution. We propose PolyFlow, a polytope-constrained flow matching framework that embeds constraints directly into the model and flow dynamics. PolyFlow introduces a discrete-time flow formulation and a projection-free architecture, which eliminate the discretization error and guarantee strict satisfaction of arbitrary polyhedral constraints, without the need for expensive iterative solvers. Experimental results show that PolyFlow achieves zero constraint violation while maintaining high distributional fidelity across a range of planning and control tasks. Compared to state-of-the-art constrained generation baselines, PolyFlow significantly reduces inference latency and demonstrates a favorable trade-off between safety, efficiency, and generative quality. Code is available on https://github.com/MJianM/PolyFlow.

2606.13473 2026-06-12 cs.LG cs.AI cs.CL 交叉投稿

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

MaxProof: 通过生成-验证器强化学习与群体级测试时扩展实现数学证明规模化

Jiacheng Chen, Xinyu Zhang, Shunkai Zhang, Yanmohan Wang, Lin Li, Tiancheng Qin, Qin Wang, Zhengmao Zhu, Tianle Li, Jingyang Li, Zehan Li, Binyang Jiang, Jin Zhu, Han Ding, Fei Yu, Chenyu Du, Zijian Song, Jiayuan Song, Zhi Zhang, Yunan Huang, Weiyu Cheng, Pengyu Zhao, Yu Cheng

发表机构 * MiniMax The Chinese University of Hong Kong(香港中文大学) Fudan University(复旦大学) Peking University(北京大学) Tsinghua University(清华大学)

AI总结 提出MaxProof框架,结合生成-验证器强化学习与群体级测试时扩展,在MiniMax-M3系列上实现竞赛级数学证明,在IMO 2025和USAMO 2026上超越人类金牌阈值。

详情
AI中文摘要

我们提出了MaxProof,一个用于MiniMax-M3系列中竞赛级数学证明的群体级测试时扩展框架。M3首先使用为低误报率设计的深度防御生成验证器,训练三种面向证明的能力——证明生成、证明验证和基于批评的证明修复。这些能力被合并到单个发布的M3模型中。在测试时,MaxProof将模型视为生成器、验证器、精炼器和排序器,在候选证明群体中进行搜索,并通过锦标赛选择返回一个最终证明。通过MaxProof测试时扩展,M3模型在IMO 2025上达到35/42,在USAMO 2026上达到36/42,两者均超过了人类金牌阈值。

英文摘要

We present MaxProof, a population-level test-time scaling framework for competition-level mathematical proof in the MiniMax-M3 series. M3 first trains three proof-oriented capabilities -- proof generation, proof verification, and critique-conditioned proof repair -- using a defense-in-depth generative verifier engineered for low false-positive rate. These capabilities are merged into a single released M3 model. At test time, MaxProof treats the model as a generator, verifier, refiner, and ranker, searches over a population of candidate proofs, and returns one final proof through tournament selection. With MaxProof test-time scaling, the M3 model reaches 35/42 on IMO 2025 and 36/42 on USAMO 2026, exceeding the human gold-medal threshold on both.

2606.13486 2026-06-12 cs.LG cs.AI 交叉投稿

CRAFTIIF: Cross-Resolution Analytic Four-Type Interpretable Isolation Forest for Multivariate Time Series Anomaly Detection

CRAFTIIF:用于多元时间序列异常检测的跨分辨率分析四类型可解释孤立森林

William Smits

发表机构 * Avathon

AI总结 提出CRAFTIIF无监督框架,通过四种小波特征和五个孤立森林同时检测点、分布、时间和集体四类异常,在mTSBench基准上达到平均F1=0.228,VUS-PR比先前最佳提升40.7%。

Comments 14 pages, 4 figures, 2 appendices. Submitted to IEEE Transactions on Knowledge and Data Engineering (TKDE). Code: https://github.com/smitswil/craftiif

详情
AI中文摘要

多元时间序列中的异常检测面临四种结构不同的异常类型——点异常(孤立尖峰)、分布异常(水平偏移)、时间异常(节奏变化)和集体异常(传感器间相关性崩溃)——每种都需要不同的特征表示。大多数无监督方法只针对其中一两种类型,且可解释性有限。我们提出CRAFTIIF(跨分辨率分析四类型可解释孤立森林),这是一个完全无监督的框架,针对所有四种类型,无需针对数据集调整。CRAFTIIF生成K=500个随机分析小波特征,跨越四个小波族(Morlet、DOG、Haar、Coiflet),每个针对特定异常类型,并输入五个结构化的孤立森林——每种类型一个,外加一个用于复合异常的元IF。自适应Otsu/MAD阈值在0.1%到69.2%的异常率范围内自动校准检测。由于每个IF仅针对特定类型的特征进行训练,分支触发直接提供异常类型归因,无需事后解释。在mTSBench基准(Zhou等人,TMLR 2026)的所有19个数据集上评估,CRAFTIIF在全部19个数据集上达到平均F1=0.228,在13个可检测数据集上F1=0.322,在VUS-PR上排名第一(0.463对比之前最佳0.329,提升40.7%)。一个诊断框架——oracle F1、可检测性限制和分支分离比——识别出19个数据集中有6个从根本上无法被任何无监督方法检测。在11种消融条件下,自适应阈值(+38% F1)、四分支结构(+20%)和元IF(+23%)均被证明是必不可少的。代码:此 https URL

英文摘要

Anomaly detection in multivariate time series is challenged by four structurally distinct anomaly types -- point (isolated spikes), distributional (level shifts), temporal (rhythm changes), and collective (inter-sensor correlation breakdowns) -- each requiring different feature representations. Most unsupervised methods target only one or two types and provide limited interpretability. We present CRAFTIIF (Cross-Resolution Analytic Four-Type Interpretable Isolation Forest), a fully unsupervised framework targeting all four types without dataset-specific tuning. CRAFTIIF generates K=500 random analytic wavelet feature draws across four families (Morlet, DOG, Haar, Coiflet), each targeting a specific anomaly type, feeding five structured Isolation Forests -- one per type plus a meta-IF for compound anomalies. An adaptive Otsu/MAD threshold calibrates detection automatically across anomaly rates from 0.1% to 69.2%. Because each IF is trained exclusively on type-specific features, branch firing provides direct anomaly-type attribution by construction, without post-hoc explanation. Evaluated on all 19 datasets of the mTSBench benchmark (Zhou et al., TMLR 2026), CRAFTIIF achieves mean F1=0.228 (all 19 datasets) and F1=0.322 (13 detectable datasets), ranking first among all 25 evaluated methods on VUS-PR (0.463 vs. previous best 0.329, +40.7%). A diagnostic framework -- oracle F1, detectability limits, and branch separation ratios -- identifies 6 of 19 datasets as fundamentally undetectable by any unsupervised method. Ablation over 11 conditions confirms adaptive thresholding (+38% F1), four-branch structure (+20%), and meta-IF (+23%) are each essential. Code: https://github.com/smitswil/craftiif

2606.13571 2026-06-12 cs.LG cs.AI 交叉投稿

Existence Precedes Value: Joint Modeling of Observational Existence and Evolving States in Time Series Forecasting

存在先于价值:时间序列预测中观测存在性与状态演变的联合建模

Yifan Hu, Hongzhou Chen, Peiyuan Liu, Yiding Liu, Zewei Dong, Jiang-Ming Yang

发表机构 * Ant International(蚂蚁国际)

AI总结 提出Timeflies框架,联合建模未来观测是否发生(存在性)与数值估计,通过观测流和数值流耦合模块提升缺失值时间序列预测性能。

详情
AI中文摘要

现实世界的时间序列常因传感器休眠、传输延迟和事件驱动采样而高度不完整和不规则,使得可靠预测面临根本性挑战。现有方法已从插值后预测的流水线发展到连续时间模型,如神经常微分方程和连续时间图网络。尽管这些方法改进了历史不规则性的建模,但它们仍然在推理时依赖一个隐式的先知假设:未来有效观测的时间戳被假定为预先已知。这一假设限制了实际相关性,因为在许多现实系统中,更根本的问题不仅是未来值是多少,还包括是否会有有效观测发生。在本文中,我们提出Timeflies,一个统一的框架,将预测重新表述为未来可观测性推断和数值估计的联合问题。为了显式建模观测动态与状态演变之间的交互,Timeflies采用观测流和数值流,通过三个专用模块(可靠性感知嵌入、观测引导的依赖建模和联合预测)进行耦合。我们进一步构建了Shadow基准,该基准结合了来自公共数据集和真实工业数据的自然缺失,并引入观测-值联合熵(OVJE)指标来全面评估这种耦合的可预测性。大量实验表明,Timeflies始终优于现有方法,突显了在缺失值时间序列预测中显式建模未来可观测性的重要性。代码和数据集见https://this URL。

英文摘要

Real-world time series are often highly incomplete and irregular due to sensor dormancy, transmission delays, and event-driven sampling, making reliable forecasting fundamentally challenging. Existing methods have evolved from impute-then-forecast pipelines to continuous-time models such as Neural ODEs and continuous-time graph networks. While these approaches improve the modeling of historical irregularity, they still rely on an implicit oracle assumption at inference time: the timestamps of future valid observations are presumed to be known in advance. This assumption limits practical relevance, since in many real systems the more fundamental question is not only what the future value will be, but also whether a valid observation will occur at all. In this paper, we propose Timeflies, a unified framework that reformulates forecasting as a joint problem of future observability inference and value estimation. To explicitly model the interaction between observation dynamics and state evolution, Timeflies adopts an observation stream and a value stream, coupled through three dedicated modules for reliability-aware embedding, observation-guided dependency modeling, and joint prediction. We further construct Shadow, a benchmark that combines natural missingness from public datasets with real-world industrial data, and introduce the Observation-Value Joint Entropy (OVJE) metric to comprehensively evaluate this coupled predictability. Extensive experiments show that Timeflies consistently outperforms existing methods, highlighting the importance of explicitly modeling future observability in time series forecasting with missing values. Code and dataset are available in https://github.com/ant-intl/Timeflies.

2606.13603 2026-06-12 cs.LG cs.AI cs.CL 交叉投稿

Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models

超越承诺边界:探究大型推理模型中的附带思维链

Daniel Scalena, Sara Candussio, Luca Bortolussi, Elisabetta Fersini, Malvina Nissim, Gabriele Sarti

发表机构 * CLCG, University of Groningen(格罗宁根大学CLCG) University of Milano-Bicocca(米兰-布雷拉大学) University of Trieste(特里耶大学) Khoury College of Computer Sciences, Northeastern University(东北大学Khoury计算机科学学院)

AI总结 通过早期退出估计思维链步骤的因果重要性,发现推理中存在从瞬态猜测到稳定答案的“承诺边界”,后续步骤为附带现象,可提前退出以缩短推理长度达55%而不影响性能。

详情
AI中文摘要

思维链推理是语言模型推理时扩展的主导范式,但每个步骤对最终答案的因果影响尚不明确。我们通过早期退出估计每个步骤的因果重要性,并利用这一度量研究多个模型家族的推理轨迹中答案如何形成。在多种任务中,我们发现推理通常会跨越一个“承诺边界”——从瞬态中间猜测到稳定、高置信度答案的急剧转变。这种转变通常发生在单个步骤中,远在模型推理块结束之前,随后是“附带”的思维链步骤,这些步骤不改变最终答案概率。利用注意力探针,我们表明答案形成阶段可以从中间推理步骤中以高精度线性解码,并稳健地泛化到未见过的推理任务。我们利用这一信号在承诺边界处提前退出推理块,平均将思维链长度减少高达55%,而对模型性能影响微乎其微。

英文摘要

Chain-of-thought (CoT) reasoning is the dominant paradigm for inference-time scaling in language models, yet the causal influence of individual steps on the final answer poorly understood. We estimate each step's causal importance via early exit and use this measure to study how answers form across the reasoning traces of several model families. Across diverse tasks, we find that reasoning typically crosses a \emph{commitment boundary} -- a sharp transition from transient intermediate guesses to a stable, high-confidence answer. This transition often happens in a single step, well before the model's reasoning block ends, and is followed by \emph{epiphenomenal} CoT steps that leave the final answer probability unaltered. Using attention probes, we show that answer-formation stages can be linearly decoded from intermediate reasoning steps with high accuracy and generalize robustly to unseen reasoning tasks. We exploit this signal to early-exit reasoning blocks at the commitment boundary, reducing the length of CoTs up to 55\% on average with negligible impact on model performance.

2604.16689 2026-06-12 cs.AI 版本更新

The Query Channel: Information-Theoretic Limits of Masking-Based Explanations

查询通道:基于掩码的解释的信息论极限

Erciyes Karakaya, Ozgur Ercetin

发表机构 * Department of Electrical and Computer Engineering, University of Maryland, College Park, USA(美国马里兰大学电气与计算机工程系) Faculty of Engineering and Natural Sciences, Sabanci University, Turkiye(土耳其萨班奇大学工程与自然科学学院)

AI总结 本文提出查询通道框架,将掩码后解释建模为通信过程,推导解释率与识别容量之间的信息论极限,并证明稀疏最大似然解码器可实现可靠恢复。

详情
AI中文摘要

基于掩码的事后解释方法,如KernelSHAP和LIME,通过随机扰动下的查询估计局部特征重要性。本文将这一过程建模为在查询通道上的通信,其中潜在解释作为消息,每次掩码评估作为一次信道使用。在此框架内,解释的复杂度由假设类的熵捕获,而查询接口以每次查询的识别容量确定的速率提供信息。我们推导了一个强逆定理,表明如果解释率超过该容量,则对于任何解释器和解码器序列,精确恢复的概率必然收敛到误差中的一。我们还证明了一个可达性结果,即当速率低于容量时,稀疏最大似然解码器可实现可靠恢复。互信息的蒙特卡洛估计器提供了一个非渐近查询基准,我们用它来比较最优解码与模拟LIME和KernelSHAP的基于Lasso和OLS的过程。实验揭示了在一定的查询预算范围内,信息论允许可靠解释,但标准凸替代方法仍然失败。最后,我们将神经语言模型的超像素分辨率和分词解释为一种源编码选择,它设定了解释的熵,并展示了高斯噪声和非线性曲率如何劣化查询通道,引发瀑布和错误平层行为,并使高分辨率解释无法实现。

英文摘要

Masking-based post-hoc explanation methods, such as KernelSHAP and LIME, estimate local feature importance by querying a black-box model under randomized perturbations. This paper formulates this procedure as communication over a query channel, where the latent explanation acts as a message and each masked evaluation is a channel use. Within this framework, the complexity of the explanation is captured by the entropy of the hypothesis class, while the query interface supplies information at a rate determined by an identification capacity per query. We derive a strong converse showing that, if the explanation rate exceeds this capacity, the probability of exact recovery necessarily converges to one in error for any sequence of explainers and decoders. We also prove an achievability result establishing that a sparse maximum-likelihood decoder attains reliable recovery when the rate lies below capacity. A Monte Carlo estimator of mutual information yields a non-asymptotic query benchmark that we use to compare optimal decoding with Lasso- and OLS-based procedures that mirror LIME and KernelSHAP. Experiments reveal a range of query budgets where information theory permits reliable explanations but standard convex surrogates still fail. Finally, we interpret super-pixel resolution and tokenization for neural language models as a source-coding choice that sets the entropy of the explanation and show how Gaussian noise and nonlinear curvature degrade the query channel, induce waterfall and error-floor behavior, and render high-resolution explanations unattainable.

2605.17770 2026-06-12 cs.AI cs.CL 版本更新

Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

熵梯度反转:迈向大型推理模型的内部机制

Junyao Yang, Chen Qian, Kun Wang, Linfeng Zhang, Quanshi Zhang, Yong Liu, Dongrui Liu

发表机构 * National University of Singapore(新加坡国立大学) Renmin University of China(中国人民大学) Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学)

AI总结 本文发现大型推理模型中令牌熵与logit梯度之间的稳健负相关(熵梯度反转),并提出相关性正则化组策略优化(CorR-PO)将其嵌入强化学习奖励正则化,从而提升推理性能。

Comments The authors are withdrawing this manuscript due to fundamental inaccuracies in the institutional affiliations and administrative attributions provided at the time of submission. As this version cannot be validated under the correct institutional framework, the authors request its formal withdrawal from the repository. No immediate replacement is intended

详情
AI中文摘要

大型推理模型(LRMs)的进步推动了从反应式“快思考”文本生成向系统性、逐步“慢思考”推理的范式转变,在复杂数学和逻辑任务中实现了最先进的性能。然而,该领域面临着 extit{令牌级行为分析与内部推理机制之间的根本差距,以及依赖昂贵外部验证器的推理优化强化学习(RL)的不稳定性}。我们识别并正式定义了 extbf{熵梯度反转},即令牌熵与logit梯度之间的稳健负相关,它作为LRM推理能力的明确几何指纹。在此基础上,我们提出 extbf{相关性正则化组策略优化(CorR-PO)},将这种反转特征嵌入RL奖励正则化。在多个模型规模的各种推理基准上的大量实验表明,CorR-PO始终优于最先进的基线,证实了更强的反转直接与更优的推理性能相关。

英文摘要

The advancement of Large Reasoning Models (LRMs) has catalyzed a paradigm shift from reactive ``fast thinking'' text generation to systematic, step-by-step ``slow thinking'' reasoning, unlocking state-of-the-art performance in complex mathematical and logical tasks. However, the field faces \textit{the fundamental gap between token-level behavioral analysis and internal reasoning mechanisms, and the instability of reinforcement learning (RL) for reasoning optimization relying on costly external verifiers}. We identify and formally define \textbf{Entropy-Gradient Inversion}, a robust negative correlation between token entropy and logit gradients that acts as a definitive geometric fingerprint for LRM reasoning capability. Building on this, we propose \textbf{Correlation-Regularized Group Policy Optimization (CorR-PO)}, which embeds this inversion signature into RL reward regularization. Extensive experiments on various reasoning benchmarks across multiple model scales show CorR-PO consistently outperforms state-of-the-art baselines, confirming that stronger inversion directly correlates with superior reasoning performance.

2606.08098 2026-06-12 cs.AI cs.LG 版本更新

When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference

何时委托优于多数?一种基于委托的多样本LLM推理聚合器

Yasushi Sakai, Allen Song, Kent Larson

发表机构 * MIT Media Lab(麻省理工学院媒体实验室)

AI总结 提出基于委托的聚合器PPV,利用样本的字母熵和推理几何信号,在MMLU-Pro上比多数投票高1.5个百分点,无需标签或训练。

Comments Preprint. 16 pages, 5 figures, 4 tables

详情
AI中文摘要

多数投票是对多样本LLM推理进行无监督聚合的主流方法。我们证明,将每个样本携带的信号输入基于委托的聚合器(传播代理投票,PPV)可产生一种无监督共识规则,在MMLU-Pro上整体比多数投票高1.5个百分点,在非平凡子集上高2.24个百分点(配对McNemar p ~ 1.0e-14,n = 8,099)。多数投票丢弃了每个样本携带的两个自由信号:组内字母熵和组间推理几何。PPV暴露了两个每个投票者使用的杠杆,它们恰好消耗这些信号:WHEN(投票者保留自己选择的权重)和WHOM(如何将剩余权重分配给同行)。我们使用字母熵驱动WHEN,使用以问题为中心的嵌入余弦驱动WHOM。该方法不需要真实标签和辅助训练:对于每个问题,我们将128个采样生成划分为16组,计算每组的字母级语义熵和推理嵌入质心,并将两者输入随机委托矩阵,其平稳分布选择共识答案。我们通过一个例子说明PPV如何推翻一个明显的10-6多数(错误答案):10票的多数簇几何上不连贯(平均簇内余弦-0.02),而6票的少数簇紧凑(+0.26),因此传播的委托质量集中在少数派的答案上,尽管仅凭熵会使多数保持领先。我们还报告了具有负面结果的委托策略,这些策略限制了无监督LLM聚合的设计空间:没有问题内的置信度模式集成能够缩小与oracle的差距。

英文摘要

Majority voting over sampled answers is the dominant unsupervised aggregator for multi-sample LLM inference. In this paper, we show a delegation-based aggregator (Propagational Proxy Voting, PPV; Sakai et al., 2025) yields an unsupervised consensus rule that beats majority on MMLU-Pro by +1.5 pp overall and +2.24 pp on the non-trivial subset (paired McNemar p ~ 1.0e-14, n = 8,099). Majority discards two signals that every sample carries: within-group letter entropy and between-group reasoning geometry. PPV exposes per-voter levers that consume exactly these two signals: When (how much weight a voter keeps on its own pick) and Whom (how it splits the remainder across peers). We drive When with letter entropy and Whom with per-question-centered embedding cosine. Our method needs no gold labels and no auxiliary training: per-question, we partition 128 sampled generations into 16 groups, compute each group's letter-level semantic entropy and reasoning embedding centroid, and feed both into a stochastic delegation matrix whose stationary distribution selects the consensus answer. We walk through an example in which PPV overturns a clear 10-6 majority for the wrong letter: the 10-voter majority cluster is geometrically incoherent (mean within-cluster cosine -0.02) while the 6-voter minority is tight (+0.26), so propagated delegation mass concentrates on the minority's answer even though entropy alone would keep the majority ahead. We further report delegation strategies with negative results that constrain the design space for unsupervised LLM aggregation. No within-question ensemble of confidence modes closes the oracle gap.

2505.13102 2026-06-12 cs.LG cs.AI eess.SP 版本更新

Lightweight and Interpretable Transformer via Mixed Graph Algorithm Unrolling for Traffic Forecast

轻量级可解释Transformer:基于混合图算法展开的交通预测

Ji Qi, Tam Thuc Do, Mingxiao Liu, Zhuoshi Pan, Yuzhe Li, Gene Cheung, H. Vicky Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种通过展开混合图优化算法构建的轻量级可解释类Transformer网络,用于时空交通预测,在保持竞争性能的同时大幅减少参数。

Comments 24 pages, 7 figures, 11 tables

详情
AI中文摘要

与采用经典自注意力机制的传统“黑箱”Transformer不同,我们通过展开基于混合图的优化算法构建了一个轻量级且可解释的类Transformer神经网络,用于具有空间和时间维度的交通预测。我们构建了两个图:一个无向图$\mathcal{G}^u$捕捉跨地理的空间相关性,以及一个有向图$\mathcal{G}^d$捕捉时间上的序列关系。我们预测信号$\mathbf{x}$的未来样本,假设其相对于$\mathcal{G}^u$和$\mathcal{G}^d$都是“平滑的”,为此我们设计了新的$\ell_2$和$\ell_1$范数变分项来量化并促进有向图上的信号平滑性(低频重构)。我们基于交替方向乘子法(ADMM)设计了一个迭代算法,并将其展开为一个前馈网络以进行数据驱动的参数学习。我们周期性地插入用于$\mathcal{G}^u$和$\mathcal{G}^d$的图学习模块,这些模块扮演自注意力的角色。实验表明,我们的展开网络在交通预测性能上与最先进的预测方案相当,同时大幅减少了参数数量。

英文摘要

Unlike conventional "black-box" transformers with classical self-attention mechanism, we build a lightweight and interpretable transformer-like neural net by unrolling a mixed-graph-based optimization algorithm to forecast traffic with spatial and temporal dimensions. We construct two graphs: an undirected graph $\mathcal{G}^u$ capturing spatial correlations across geography, and a directed graph $\mathcal{G}^d$ capturing sequential relationships over time. We predict future samples of signal $\mathbf{x}$, assuming it is "smooth" with respect to both $\mathcal{G}^u$ and $\mathcal{G}^d$, where we design new $\ell_2$ and $\ell_1$-norm variational terms to quantify and promote signal smoothness (low-frequency reconstruction) on a directed graph. We design an iterative algorithm based on alternating direction method of multipliers (ADMM), and unroll it into a feed-forward network for data-driven parameter learning. We periodically insert graph learning modules for $\mathcal{G}^u$ and $\mathcal{G}^d$ that play the role of self-attention. Experiments show that our unrolled networks achieve competitive traffic forecast performance as state-of-the-art prediction schemes, while reducing parameter counts drastically.

2507.02921 2026-06-12 cs.LG cs.AI 版本更新

PlaceRep: Geospatial Place Representation Learning from Large-Scale Point-of-Interest Data

PlaceRep: 基于大规模兴趣点数据的地理空间场所表示学习

Mohammad Hashemi, Hossein Amiri, Andreas Zufle

发表机构 * Emory University(埃默里大学)

AI总结 提出PlaceRep方法,通过聚类空间和语义相关的兴趣点构建场所级表示,无需预训练即可高效生成多尺度城市区域嵌入,在人口密度估计和房价预测任务中优于现有方法并实现百倍加速。

详情
AI中文摘要

学习城市环境的有效表示需要捕捉超越固定行政边界的空间结构。现有的地理空间表示学习方法通常将兴趣点(POI)聚合到预定义的行政区域(如普查单元或邮政编码区域),为每个区域分配单个嵌入。然而,POI 通常形成跨越、包含或超出这些边界的语义上有意义的组,定义了更能反映人类活动和城市功能的场所。为解决这一局限性,我们提出 PlaceRep,一种通过聚类空间和语义相关的 POI 来构建场所级表示的地理空间表示学习方法。PlaceRep 从美国 Foursquare 数据中总结大规模 POI 图,生成通用城市区域嵌入,同时自动识别跨多个空间尺度的场所。通过消除模型预训练,PlaceRep 为多粒度地理空间分析提供了可扩展且高效的解决方案。使用人口密度估计和房价预测作为下游任务的实验表明,PlaceRep 优于大多数最先进的基于图的地理空间表示学习方法,并在大规模 POI 图上生成区域级表示时实现了高达 100 倍的加速。PlaceRep 的实现可在该 https URL 获取。

英文摘要

Learning effective representations of urban environments requires capturing spatial structure beyond fixed administrative boundaries. Existing geospatial representation learning approaches typically aggregate Points of Interest (POIs) into pre-defined administrative regions such as census units or ZIP code areas, assigning a single embedding to each region. However, POIs often form semantically meaningful groups that extend across, within, or beyond these boundaries, defining places that better reflect human activity and urban function. To address this limitation, we propose PlaceRep, a geospatial representation learning method that constructs place-level representations by clustering spatially and semantically related POIs. PlaceRep summarizes large-scale POI graphs from U.S. Foursquare data to produce general-purpose urban region embeddings while automatically identifying places across multiple spatial scales. By eliminating model pre-training, PlaceRep provides a scalable and efficient solution for multi-granular geospatial analysis. Experiments using the tasks of population density estimation and housing price prediction as downstream tasks show that PlaceRep outperforms most state-of-the-art graph-based geospatial representation learning methods and achieves up to a x100 speedup in generating region-level representations on large-scale POI graphs. The implementation of PlaceRep is available at https://github.com/mohammadhashemii/PlaceRep.

2507.05019 2026-06-12 cs.LG cs.AI 版本更新

Meta-Learning Transformers to Improve In-Context Generalization

元学习变换器以改进上下文泛化

Lorenzo Braccaioli, Anna Vettoruzzo, Prabhant Singh, Joaquin Vanschoren, Mohamed-Rafik Bouguelia, Nicola Conci

发表机构 * University of Trento, Italy(特伦托大学,意大利) Eindhoven University, Netherlands(埃因霍温大学,荷兰) University of Doha for Science and Technology, Qatar(多哈科学与技术大学,卡塔尔)

AI总结 提出利用多个小规模领域特定数据集训练上下文学习器,通过元学习提升跨领域泛化能力,并在持续学习和无监督场景下验证其鲁棒性。

详情
AI中文摘要

上下文学习使变换器模型能够仅基于输入提示泛化到新任务,无需任何权重更新。然而,现有的训练范式通常依赖于大型非结构化数据集,这些数据集存储成本高,难以评估质量和平衡性,并且由于包含敏感信息而引发隐私和伦理问题。受这些局限性和风险的启发,我们提出了一种替代训练策略,利用多个小规模、领域特定的数据集集合。我们经验性地证明,此类数据质量的提高和多样性的增加提升了上下文学习器在其训练领域之外的泛化能力,同时与在单个大规模数据集上训练的模型相比,性能相当。我们通过利用元学习在Meta-Album集合上训练上下文学习器来研究这一范式,在多种设置下进行实验。首先,我们在受控环境中展示性能,其中测试领域完全排除在训练知识之外。其次,我们探索这些模型在信息可访问时间有限的持续场景中对遗忘的鲁棒性。最后,我们探索更具挑战性的无监督场景。我们的发现表明,当在精心策划的数据集集合上训练时,变换器仍然能够泛化用于上下文预测,同时在模块化和可替换性方面提供了优势。

英文摘要

In-context learning enables transformer models to generalize to new tasks based solely on input prompts, without any need for weight updates. However, existing training paradigms typically rely on large, unstructured datasets that are costly to store, difficult to evaluate for quality and balance, and pose privacy and ethical concerns due to the inclusion of sensitive information. Motivated by these limitations and risks, we propose an alternative training strategy where we leverage a collection of multiple, small-scale, and domain-specific datasets. We empirically demonstrate that the increased quality and diversity of such data improve the generalization abilities of in-context learners beyond their training domain, while achieving comparable performance with models trained on a single large-scale dataset. We investigate this paradigm by leveraging meta-learning to train an in-context learner on the Meta-Album collection under several settings. Firstly, we show the performance in a controlled environment, where the test domain is completely excluded from the training knowledge. Secondly, we explore the robustness of these models to forgetting in a continual scenario where the information is accessible for a limited time. Finally, we explore the more challenging unsupervised scenario. Our findings demonstrate that transformers still generalize for in-context prediction when trained on a curated dataset collection while offering advantages in modularity and replaceability.

2509.03340 2026-06-12 cs.LG cs.AI cs.CE physics.comp-ph 版本更新

Equivariant Flow Matching for Symmetry-Breaking Bifurcation Problems

等变流匹配用于对称破缺分岔问题

Fleur Hendriks, Ondřej Rokoš, Martin Doškář, Marc G. D. Geers, Vlado Menkovski

发表机构 * Department of Mechanical Engineering, Eindhoven University of Technology(埃因霍温理工大学机械工程系) DIFFER – Dutch Institute for Fundamental Energy Research(荷兰基础能源研究所) Faculty of Civil Engineering, Department of Mechanics, Czech Technical University in Prague(布拉格捷克技术大学土木工程学院力学系) Department of Mathematics and Computer Science, Eindhoven University of Technology(埃因霍温理工大学数学与计算机科学系)

AI总结 针对非线性动力系统中对称破缺导致的多稳态共存问题,提出等变流匹配方法,结合等变架构与最优传输耦合机制,准确捕捉多模态分布和对称破缺分岔,优于非概率和变分方法。

Comments 9 pages, 7 figures including appendices. Accepted to Machine Learning and the Physical Sciences Workshop, NeurIPS 2025 (https://ml4physicalsciences.github.io/2025/). Repository with corresponding code: https://github.com/FHendriks11/bifurcationML/. Video explanation: https://www.youtube.com/watch?v=wsL3h17KtjY

详情
AI中文摘要

非线性动力系统中的分岔现象通常导致多个共存的稳定解,特别是在对称破缺的情况下。确定性机器学习模型无法捕捉这种多重性,会平均化解并无法表示低对称性结果。在这项工作中,我们正式将生成式AI(特别是流匹配)作为建模分岔结果全概率分布的原则性方法。我们的方法建立在现有技术基础上,将流匹配与等变架构和基于最优传输的耦合机制相结合。我们将等变流匹配推广到一种对称耦合策略,该策略在群作用下对齐预测和目标输出,从而在等变设置中实现准确学习。我们在从简单概念系统到物理问题(如屈曲梁和Allen-Cahn方程)的一系列系统上验证了我们的方法。结果表明,该方法准确捕捉了多模态分布和对称破缺分岔。此外,我们的结果表明,流匹配显著优于非概率和变分方法。这为高维系统中的多稳态建模提供了一种原则性且可扩展的解决方案。

英文摘要

Bifurcation phenomena in nonlinear dynamical systems often lead to multiple coexisting stable solutions, particularly in the presence of symmetry breaking. Deterministic machine learning models are unable to capture this multiplicity, averaging over solutions and failing to represent lower-symmetry outcomes. In this work, we formalize the use of generative AI, specifically flow matching, as a principled way to model the full probability distribution over bifurcation outcomes. Our approach builds on existing techniques by combining flow matching with equivariant architectures and an optimal-transport-based coupling mechanism. We generalize equivariant flow matching to a symmetric coupling strategy that aligns predicted and target outputs under group actions, allowing accurate learning in equivariant settings. We validate our approach on a range of systems, from simple conceptual systems to physical problems such as buckling beams and the Allen--Cahn equation. The results demonstrate that the approach accurately captures multimodal distributions and symmetry-breaking bifurcations. Moreover, our results demonstrate that flow matching significantly outperforms non-probabilistic and variational methods. This offers a principled and scalable solution for modeling multistability in high-dimensional systems.

2509.18085 2026-06-12 cs.LG cs.AI cs.CL 版本更新

Structuring The Future: Diffusion LLM Speculative Decoding via Calibrated Draft Graphs

构建未来:通过校准草稿图实现扩散LLM推测解码

Sudhanshu Agrawal, Risheek Garrepalli, Raghavv Goel, Christopher Lott, Fatih Porikli, Mingu Lee

发表机构 * University of Waterloo(多伦多大学)

AI总结 提出Spiffy算法,利用校准的草稿图结构实现扩散LLM的推测解码,在保持输出分布的同时加速推理,最高减少8.6倍模型推理次数并加速6.3倍令牌生成速率。

Comments Original version uploaded on Sep 22, 2025. (v2): Extended Table 2 with additional analysis and referenced it in Sec 5.2. (v3): Added note to Sec 4.2 and Appendix A.2 specifying conditions for losslessness. (v4): Updated with the version accepted to ICML 2026 workshops

详情
AI中文摘要

扩散LLM(dLLM)最近作为自回归LLM(AR-LLM)的强大替代方案出现,具有以显著更高的令牌生成速率运行的潜力。为了释放这一潜力,我们提出了Spiffy,一种推测解码算法,用于加速dLLM推理,同时可证明地保持模型的输出分布。这项工作解决了将AR-LLM的推测解码思想应用于dLLM所涉及的独特挑战。Spiffy执行自动推测以消除独立草稿模型的开销,以新颖的有向草稿图形式构建草稿状态,以利用dLLM生成的双向、块状特性。这些草稿图离线校准以最大化接受率,并在推理过程中动态剪枝以提高计算效率。我们给出了Spiffy的详细公式,并展示了其与KV缓存和基于阈值的动态掩码相结合,加速LLaDA、Dream和SDAR模型的能力,导致模型推理次数减少高达8.6倍,令牌速率加速高达6.3倍。

英文摘要

Diffusion LLMs (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs (AR-LLMs) with the potential to operate at significantly higher token-generation rates. To unlock this potential, we present Spiffy, a speculative decoding algorithm to accelerate dLLM inference while provably preserving the model's output distribution. This work addresses the unique challenges involved in applying ideas from speculative decoding of AR-LLMs to dLLMs. Spiffy performs auto-speculation to eliminate the overheads of an independent draft model, structuring draft states in the form of a novel directed draft graph to take advantage of the bidirectional, blockwise nature of dLLM generation. These draft graphs are calibrated offline to maximize acceptance rates and are dynamically pruned during inference for improved computational efficiency. We present a detailed formulation of Spiffy and demonstrate its ability to accelerate LLaDA, Dream, and SDAR models in combination with KV caching and threshold-based dynamic unmasking leading to up to $8.6\times$ reduction in model inferences and $6.3\times$ acceleration in token rate.

2512.15133 2026-06-12 cs.CE cs.AI 版本更新

HD-Prot: A Protein Language Model for Joint Sequence-Structure Modeling with Continuous Structure Tokens

HD-Prot:一种使用连续结构令牌进行联合序列-结构建模的蛋白质语言模型

Yi Zhou, Haohao Qu, Yunqing Liu, Shanru Lin, Le Song, Wenqi Fan

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Mohamed bin Zayed University of Artificial Intelligence(马尔代夫人工智能大学)

AI总结 提出HD-Prot,一种混合扩散蛋白质语言模型,通过连续结构令牌将序列pLM扩展为多模态,实现联合序列-结构建模,在多种任务上取得竞争性能。

Comments This is the long version of the corresponding paper to appear at KDD 2026

详情
AI中文摘要

蛋白质本质上具有一致的序列-结构二重性。丰富的蛋白质序列数据可以很容易地表示为离散令牌,这推动了蛋白质语言模型(pLM)的丰硕发展。然而,一个关键的剩余挑战是如何有效地将连续结构知识整合到pLM中。当前的方法通常将蛋白质结构离散化以适应语言建模框架,这不可避免地导致细粒度信息的丢失,并限制了多模态pLM的性能潜力。在本文中,我们认为这些担忧是可以避免的:基于序列的pLM可以通过连续令牌(即避免向量量化的高保真蛋白质结构潜在表示)扩展以纳入结构模态。具体来说,我们提出了一种混合扩散蛋白质语言模型HD-Prot,它在离散pLM之上嵌入了一个连续值扩散头,使得能够无缝处理离散和连续令牌,用于联合序列-结构建模。它通过统一的吸收扩散过程捕获跨模态的令牌间依赖关系,并通过序列的分类预测和结构的连续扩散估计每个令牌的分布。大量结果表明,HD-Prot在无条件序列-结构共生成、基序支架、蛋白质结构预测和反向折叠任务中取得了竞争性能。此外,尽管在有限的计算资源下开发(即模态扩展微调的预算不到十分之一),我们的方法可以与最先进的多模态pLM相媲美。它突显了在统一语言模型架构中同时估计分类和连续分布的可行性,为多模态pLM提供了一个有前景的替代方向。

英文摘要

Proteins inherently possess a consistent sequence-structure duality. The abundance of protein sequence data, which can be readily represented as discrete tokens, has driven fruitful developments in protein language models (pLMs). A key remaining challenge, however, is how to effectively integrate continuous structural knowledge into pLMs. Current methods often discretize protein structures to accommodate the language modeling framework, which inevitably results in the loss of fine-grained information and limits the performance potential of multimodal pLMs. In this paper, we argue that such concerns can be circumvented: a sequence-based pLM can be extended to incorporate the structure modality through continuous tokens, i.e., high-fidelity protein structure latents that avoid vector quantization. Specifically, we propose a hybrid diffusion protein language model, HD-Prot, which embeds a continuous-valued diffusion head atop a discrete pLM, enabling seamless operation with both discrete and continuous tokens for joint sequence-structure modeling. It captures inter-token dependencies across modalities through a unified absorbing diffusion process, and estimates per-token distributions via categorical prediction for sequences and continuous diffusion for structures. Extensive results demonstrate that HD-Prot achieves competitive performance in unconditional sequence-structure co-generation, motif-scaffolding, protein structure prediction, and inverse folding tasks. Furthermore, our method can perform on par with state-of-the-art multimodal pLMs, despite being developed under limited computational resources (i.e., less than one-tenth the budget for modality extension fine-tuning). It highlights the viability of simultaneously estimating categorical and continuous distributions within a unified language model architecture, offering a promising alternative direction for multimodal pLMs.

2512.22287 2026-06-12 cs.LG cs.AI 版本更新

Cluster Aggregated GAN (CAG): A Cluster-Based Hybrid Model for Appliance Pattern Generation

聚类聚合生成对抗网络 (CAG):一种基于聚类的混合模型用于电器模式生成

Zikun Guo, Adeyinka. P. Adedigba, Rammohan Mallipeddi

发表机构 * Department of Artificial Intelligence, School of Electronics Engineering, Kyungpook National University(人工智能系,电子工程学院,全北国立大学)

AI总结 针对现有生成方法忽略间歇性与连续电器行为差异导致训练不稳定和保真度有限的问题,提出CAG框架,通过聚类模块为间歇电器分配专用生成器,连续电器使用LSTM生成器,在UVIC数据集上优于基线方法。

Comments 18pages, 5Figues

详情
AI中文摘要

合成电器数据对于开发非侵入式负荷监测算法和实现隐私保护的能源研究至关重要,然而标记数据集的稀缺性仍然是一个重大障碍。最近基于GAN的方法已经证明了合成负荷模式的可行性,但大多数现有方法在单个模型内统一处理所有设备,忽略了间歇性和连续性电器之间的行为差异,导致训练不稳定和输出保真度有限。为了解决这些局限性,我们提出了聚类聚合生成对抗网络框架,这是一种混合生成方法,根据每个电器的行为特征将其路由到专门的分支。对于间歇性电器,聚类模块将相似的激活模式分组,并为每个聚类分配专用生成器,确保常见和罕见操作模式都获得足够的建模能力。连续性电器遵循单独的分支,采用基于LSTM的生成器来捕捉逐渐的时间演变,同时通过序列压缩保持训练稳定性。在UVIC智能插头数据集上的大量实验表明,所提出的框架在衡量真实性、多样性和训练稳定性的指标上始终优于基线方法,并且将聚类作为主动生成组件显著提高了可解释性和可扩展性。这些发现确立了所提出的框架作为非侵入式负荷监测研究中合成负荷生成的有效方法。

英文摘要

Synthetic appliance data are essential for developing non-intrusive load monitoring algorithms and enabling privacy preserving energy research, yet the scarcity of labeled datasets remains a significant barrier. Recent GAN-based methods have demonstrated the feasibility of synthesizing load patterns, but most existing approaches treat all devices uniformly within a single model, neglecting the behavioral differences between intermittent and continuous appliances and resulting in unstable training and limited output fidelity. To address these limitations, we propose the Cluster Aggregated GAN framework, a hybrid generative approach that routes each appliance to a specialized branch based on its behavioral characteristics. For intermittent appliances, a clustering module groups similar activation patterns and allocates dedicated generators for each cluster, ensuring that both common and rare operational modes receive adequate modeling capacity. Continuous appliances follow a separate branch that employs an LSTM-based generator to capture gradual temporal evolution while maintaining training stability through sequence compression. Extensive experiments on the UVIC smart plug dataset demonstrate that the proposed framework consistently outperforms baseline methods across metrics measuring realism, diversity, and training stability, and that integrating clustering as an active generative component substantially improves both interpretability and scalability. These findings establish the proposed framework as an effective approach for synthetic load generation in non-intrusive load monitoring research.

2601.03184 2026-06-12 cs.LG cs.AI 版本更新

Decentralized Autoregressive Generation

分散自回归生成

Stepan Maschan, Haoxuan Qu, Jun Liu

发表机构 * Lancaster University(兰卡斯特大学)

AI总结 本文通过离散流匹配框架证明分散训练与集中训练在理论上等价,实验验证其在多模态基准上保持竞争力。

详情
AI中文摘要

近年来,自回归生成的分散化作为解决扩展瓶颈的方案引起了广泛关注。然而,尽管有令人鼓舞的实验结果,这一范式目前缺乏严格的理论证明。在这项工作中,我们正式建立了分散训练与集中训练之间的理论等价性。为此,我们调整了离散流匹配框架用于自回归生成,利用其固有性质证明全局模型自然分解为独立专家。最后,我们在多种多模态基准上进行了大量实验,实验验证了分散训练在标准集中架构上保持竞争性。

英文摘要

The decentralization of autoregressive generation has attracted considerable attention in recent years as a solution to scaling bottlenecks. However, despite promising empirical results, this paradigm currently lacks rigorous theoretical justification. In this work, we formally establish the theoretical equivalence between decentralized and centralized training. To achieve this, we adapt the Discrete Flow Matching framework for autoregressive generation, leveraging its inherent properties to demonstrate that global models naturally decompose into independent experts. Finally, we conduct extensive experiments across diverse multimodal benchmarks, empirically validating that decentralized training maintains competitive parity with standard centralized architectures.

2601.06572 2026-06-12 cs.LG cs.AI 版本更新

Hellinger Multimodal Variational Autoencoders

Hellinger多模态变分自编码器

Huyen Vo, Isabel Valera

发表机构 * Department of Computer Science, Saarland University(萨尔兰大学计算机科学系) MPI-SWS, Saarland Informatics Campus(萨尔兰信息学校区Max Planck研究所)

AI总结 提出基于Hellinger距离的矩匹配近似方法HELVAE,避免子采样,在多模态变分自编码器中实现更优的生成一致性与质量权衡。

Comments Accepted at AISTATS 2026. Camera-ready version

详情
AI中文摘要

多模态变分自编码器(VAEs)广泛用于弱监督生成学习,涉及多种模态。主流方法通过专家乘积(PoE)、专家混合(MoE)或其组合来聚合单模态推理分布,以近似联合后验。本文从概率意见池化的优化视角重新审视多模态推理。我们从$\alpha=0.5$的Hölder池化出发,这是$\alpha\text{-散度}$族中唯一的对称成员,并推导出一种矩匹配近似,称为Hellinger。我们利用这种近似提出HELVAE,一种避免子采样的多模态VAE,从而得到一个高效且有效的模型,该模型:(i)随着观察到的模态增加,学习更具表达力的潜在表示;(ii)在生成一致性和质量之间实现更好的权衡,优于最先进的多模态VAE模型。

英文摘要

Multimodal variational autoencoders (VAEs) are widely used for weakly supervised generative learning with multiple modalities. Predominant methods aggregate unimodal inference distributions using either a product of experts (PoE), a mixture of experts (MoE), or their combinations to approximate the joint posterior. In this work, we revisit multimodal inference through the lens of probabilistic opinion pooling, an optimization-based approach. We start from Hölder pooling with $α=0.5$, which corresponds to the unique symmetric member of the $α\text{-divergence}$ family, and derive a moment-matching approximation, termed Hellinger. We then leverage such an approximation to propose HELVAE, a multimodal VAE that avoids sub-sampling, yielding an efficient yet effective model that: (i) learns more expressive latent representations as additional modalities are observed; and (ii) empirically achieves better trade-offs between generative coherence and quality, outperforming state-of-the-art multimodal VAE models.

2601.22594 2026-06-12 cs.CL cs.AI 版本更新

Language Model Circuits Are Sparse in the Neuron Basis

语言模型电路在神经元基上是稀疏的

Aryaman Arora, Zhengxuan Wu, Jacob Steinhardt, Sarah Schwettmann

发表机构 * Stanford University(斯坦福大学)

AI总结 本文实证发现MLP神经元与稀疏自编码器一样是稀疏特征基,并基于此开发了端到端梯度归因流水线,在多项任务中揭示了因果有效的神经元电路。

Comments ICML Spotlight, camera-ready

详情
AI中文摘要

神经网络用于计算的高层概念不一定与单个神经元对齐(Smolensky, 1986)。因此,语言模型可解释性研究转向了将神经元基分解为更可解释的模型计算单元的技术,例如稀疏自编码器(SAEs)。然而,并非所有基于神经元的表示都不可解释。我们首次实证表明,MLP神经元与SAEs一样是稀疏的特征基。利用这一发现,我们开发了一个端到端的基于梯度的归因流水线,用于在MLP神经元基上进行电路追踪,从而在多种任务中揭示因果有效的神经元。在标准的主谓一致基准测试(Marks et al., 2025)上,约$10^2$个MLP神经元的电路足以控制模型行为。在(Lindsey et al., 2025)的多跳城市-州-首都任务中,我们发现了一个电路,其中小部分神经元编码特定的潜在推理步骤(例如将城市映射到其所在州),并且可以通过引导来改变模型的输出。因此,这项工作在不增加额外训练成本的情况下推进了语言模型的自动化可解释性。

英文摘要

The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques which decompose the neuron basis into more interpretable units of model computation, such as sparse autoencoders (SAEs). However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that MLP neurons are as sparse a feature basis as SAEs. We use this finding to develop an end-to-end gradient-based attribution pipeline for circuit tracing on the MLP neuron basis, which surfaces causally effective neurons on a variety of tasks. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of $\approx 10^2$ MLP neurons is enough to control model behaviour. On the multi-hop city-state-capital task from (Lindsey et al., 2025), we find a circuit in which small sets of neurons encode specific latent reasoning steps (e.g. mapping a city to its state), and can be steered to change the model's output. This work thus advances automated interpretability of language models without imposing additional training costs.

2603.02234 2026-06-12 cs.LG cs.AI 版本更新

Structured vs. Unstructured Pruning: An Exponential Gap

结构化剪枝与非结构化剪枝:指数级差距

Davide Ferre', Frédéric Giroire, Frederik Mallmann-Trenn, Emanuele Natale

发表机构 * Department of Informatics, King’s College London(伦敦国王学院信息学院)

AI总结 研究随机初始化网络中剪枝的局限性,证明神经元剪枝需要指数级更大的网络规模才能达到与非结构化剪枝相同的近似精度。

详情
AI中文摘要

强彩票假说(SLTH)指出,大型随机初始化神经网络包含稀疏子网络,无需训练即可在初始化时逼近目标函数,这表明仅剪枝就足够了。剪枝方法通常分为非结构化(可移除单个权重)和结构化(根据特定模式移除参数,如神经元剪枝)。现有支持SLTH的理论结果几乎完全依赖于非结构化剪枝,表明对数级的过参数化足以逼近简单的目标网络。相比之下,神经元剪枝尽管因其直接加速硬件的实用性而备受关注,但理论关注有限。本文考虑通过剪枝随机初始化两层ReLU网络的隐藏单元来逼近单个无偏置ReLU神经元的问题,从而隔离神经元剪枝的内在局限性。我们证明,实现ε-逼近需要神经元剪枝的起始网络规模为Ω(1/ε),而权重剪枝仅需O(log(1/ε))个隐藏单元,揭示了两种方法之间的指数级差距。

英文摘要

The Strong Lottery Ticket Hypothesis (SLTH) states that large, randomly initialized neural networks contain sparse subnetworks capable of approximating a target function at initialization without training, suggesting that pruning alone is sufficient. Pruning methods are typically classified as unstructured, where individual weights can be removed from the network, and structured, where parameters are removed according to specific patterns, as in neuron pruning. Existing theoretical results supporting the SLTH rely almost exclusively on unstructured pruning, showing that logarithmic overparameterization suffices to approximate simple target networks. In contrast, neuron pruning has received limited theoretical attention, despite its practical appeal for direct hardware speedups. In this work, we consider the problem of approximating a single bias-free ReLU neuron by pruning hidden units of a randomly initialized two-layer ReLU network, effectively isolating the intrinsic limitations of neuron pruning. We show that achieving an $\varepsilon$-approximation requires a starting network size of $Ω(1/\varepsilon)$ for neuron pruning, whereas weight pruning succeeds with only $O(\log(1/\varepsilon))$ hidden units, revealing an exponential separation between the two approaches.

2604.13924 2026-06-12 cs.LG cs.AI cs.CV 版本更新

ASTER: Latent Pseudo-Anomaly Generation for Unsupervised Time-Series Anomaly Detection

ASTER: 用于无监督时间序列异常检测的潜在伪异常生成

Romain Hermary, Samet Hicsonmez, Dan Pineau, Abd El Rahman Shabayek, Djamila Aouada

发表机构 * University of Montreal(蒙特利尔大学) Université de Montréal(蒙特利尔大学)

AI总结 提出ASTER框架,在潜在空间生成伪异常训练Transformer分类器,结合预训练LLM增强表示,在三个基准数据集上达到最优性能。

Comments Published in ICPR 2026

详情
AI中文摘要

时间序列异常检测(TSAD)在工业监控、医疗保健和网络安全等领域至关重要,但由于罕见且异质的异常以及标记数据的稀缺性,它仍然具有挑战性。这种稀缺性使得无监督方法占主导地位,但现有方法通常依赖于重建或预测(难以处理复杂数据),或依赖于需要领域特定异常合成和固定距离度量的基于嵌入的方法。我们提出ASTER,一个直接在潜在空间中生成伪异常的框架,避免了手工制作的异常注入和对领域专业知识的需求。潜在空间解码器生成定制的伪异常,用于训练基于Transformer的异常分类器,而预训练的LLM丰富了该空间的时间和上下文表示。在三个基准数据集上的实验表明,ASTER达到了最先进的性能,并为基于LLM的TSAD设立了新标准。

英文摘要

Time-series anomaly detection (TSAD) is critical in domains such as industrial monitoring, healthcare, and cybersecurity, but it remains challenging due to rare and heterogeneous anomalies and the scarcity of labelled data. This scarcity makes unsupervised approaches predominant, yet existing methods often rely on reconstruction or forecasting, which struggle with complex data, or on embedding-based approaches that require domain-specific anomaly synthesis and fixed distance metrics. We propose ASTER, a framework that generates pseudo-anomalies directly in the latent space, avoiding handcrafted anomaly injections and the need for domain expertise. A latent-space decoder produces tailored pseudo-anomalies to train a Transformer-based anomaly classifier, while a pre-trained LLM enriches the temporal and contextual representations of this space. Experiments on three benchmark datasets show that ASTER achieves state-of-the-art performance and sets a new standard for LLM-based TSAD.

2604.27277 2026-06-12 cs.LG cs.AI cs.CV 版本更新

BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning

BrainDINO:一种用于通用临床表征学习的脑MRI基础模型

Yizhou Wu, Shansong Wang, Yuheng Li, Mojtaba Safari, Mingzhe Hu, Chih-Wei Chang, Harini Veeraraghavan, Xiaofeng Yang

发表机构 * Department of Radiation Oncology and Winship Cancer Institute, Emory University(放射肿瘤科和Winship癌症研究所,埃默里大学) Department of Radiation and Cellular Oncology, The University of Chicago(放射肿瘤学与细胞肿瘤学部,芝加哥大学) Department of Electrical and Computer Engineering, Georgia Institute of Technology(电气与计算机工程系,佐治亚理工学院) Department of Biomedical Engineering, Georgia Institute of Technology(生物医学工程系,佐治亚理工学院) Department of Biomedical Informatics, Emory University(生物医学信息学系,埃默里大学) Department of Medical Physics, Memorial Sloan Kettering Cancer Center(医学物理系,纪念斯隆凯特琳癌症中心)

AI总结 提出BrainDINO,一种基于自蒸馏的基础模型,在约660万张未标记轴向切片上训练,通过冻结编码器加轻量任务头,在多种脑MRI任务上达到或超越基线,尤其在小样本场景下优势显著。

Comments 25 pages, 5 figures

详情
AI中文摘要

脑MRI支撑着广泛的神经科学和临床应用,然而大多数基于学习的方法仍针对特定任务且需要大量标注数据。本文表明,单一的自监督表征可以泛化到异质的脑MRI终点。我们训练了BrainDINO,一个自蒸馏的基础模型,使用了来自20个数据集的约660万张未标记轴向切片,这些数据集涵盖了人群、疾病和采集设置的广泛变异。通过使用冻结编码器加轻量任务头,BrainDINO支持肿瘤分割、神经退行性和神经发育性疾病分类、脑年龄估计、卒中后时间预测、分子状态预测、MRI序列分类和生存建模等任务的迁移。在各种任务和监督机制下,BrainDINO始终等于或超过自然图像和MRI特定自监督基线,在标签稀缺时尤其具有优势。表征分析进一步显示,在缺乏任务特定监督的情况下,特征结构具有解剖学组织和病理敏感性。我们的发现表明,大规模切片级自监督学习可以产生统一的脑MRI表征,支持多样化的神经影像任务,无需体积预训练或全网络微调,为稳健且数据高效的脑影像分析建立了可扩展的基础。代码可在 https://github.com/mclwu22/BrainDINO 获取。

英文摘要

Brain MRI underpins a wide range of neuroscientific and clinical applications, yet most learning-based methods remain task-specific and require substantial labeled data. Here we show that a single self-supervised representation can generalize across heterogeneous brain MRI endpoints. We trained BrainDINO, a self-distilled foundation model, on approximately 6.6 million unlabeled axial slices from 20 datasets encompassing broad variation in population, disease, and acquisition setting. Using a frozen encoder with lightweight task heads, BrainDINO supported transfer across tumor segmentation, neurodegenerative and neurodevelopmental conditions classification, brain age estimation, post-stroke temporal prediction, molecular status prediction, MRI sequence classification, and survival modeling. Across tasks and supervision regimes, BrainDINO consistently equaled or exceeded natural-image and MRI-specific self-supervised baselines, with particularly strong advantages under label scarcity. Representation analyses further showed anatomically organized and pathology-sensitive feature structure in the absence of task-specific supervision. Our findings indicate that large-scale slice-wise self-supervised learning can yield a unified brain MRI representation that supports diverse neuroimaging tasks without volumetric pretraining or full-network fine-tuning, establishing a scalable foundation for robust and data-efficient brain imaging analysis. Code is available at https://github.com/mclwu22/BrainDINO

2605.00600 2026-06-12 cs.LG cs.AI cs.CV 版本更新

Possibilistic Predictive Uncertainty for Deep Learning

深度学习的可能性预测不确定性

Yao Ni, Jeremie Houssineau, Yew-Soon Ong, Piotr Koniusz

发表机构 * University of Cambridge(剑桥大学) National University of Singapore(新加坡国立大学) University of Warsaw(华沙大学)

AI总结 提出基于可能性理论的Dirichlet近似可能性后验预测(DAPPr)框架,通过投影-近似策略实现高效且原则性的认知不确定性量化,在多个基准上达到竞争性能。

Comments Accepted by ICML 2026, 20 pages

详情
AI中文摘要

深度神经网络在多种应用中取得了令人印象深刻的结果,然而它们对未见输入的过度自信需要可靠的认知不确定性建模。现有的不确定性建模方法面临一个基本困境:贝叶斯方法提供原则性的估计,但计算成本高昂,而高效的二阶预测器在其特定目标与认知不确定性量化之间缺乏严格联系。为解决这一困境,我们引入了Dirichlet近似可能性后验预测(DAPPr),一个基于可能性理论的原则性框架。我们定义了参数上的可能性后验,通过上确界算子将其投影到预测空间,并使用可学习的Dirichlet可能性函数近似投影后的后验。这种投影-近似策略产生了一个具有闭式解的简单训练目标。尽管简单,跨多个不同基准的大量实验表明,DAPPr在保持原则性推导和计算效率的同时,实现了与最先进的二阶预测器相当或更优的不确定性量化性能。代码可在 https://github.com/MaxwellYaoNi/DAPPr 获取。

英文摘要

Deep neural networks achieve impressive results across diverse applications, yet their overconfidence on unseen inputs necessitates reliable epistemic uncertainty modeling. Existing methods for uncertainty modeling face a fundamental dilemma: Bayesian approaches provide principled estimates but remain computationally prohibitive, while efficient second-order predictors lack rigorous connections between their specific objectives and epistemic uncertainty quantification. To resolve this dilemma, we introduce Dirichlet-approximated possibilistic posterior predictions (DAPPr), a principled framework grounded in possibility theory. We define a possibilistic posterior over parameters, project it to the prediction space via supremum operators, and approximate the projected posterior using learnable Dirichlet possibility functions. This projection-and-approximation strategy yields a simple training objective with closed-form solutions. Despite its simplicity, extensive experiments across diverse benchmarks show that DAPPr achieves competitive or superior uncertainty quantification performance over state-of-the-art second-order predictors while maintaining both principled derivation and computational efficiency. Code is available at https://github.com/MaxwellYaoNi/DAPPr.

2605.16430 2026-06-12 cs.LG cs.AI 版本更新

A Theory of Training Profit-Optimal LLMs

训练利润最优大语言模型的理论

Sophie Hao, William Merrill

发表机构 * Boston University(波士顿大学) Allen Institute for AI(人工智能研究院)

AI总结 本文提出一个经济模型,结合扩展定律与微观经济学理论,分析大语言模型训练的利润最大化问题,探讨模型规模与训练成本的关系及对利润的影响。

Comments Minor edits for preprint

详情
AI中文摘要

扩展大语言模型(LLM)需要巨大的计算资源,近年来人工智能的进步与大量资本支出相伴而生。尽管扩大LLM规模确实能提高模型质量(以损失或下游评估量化),但其质量提升如何转化为潜在收入,以及收入是否能抵消更大规模训练和推理的成本仍不清楚。本文发展了一个经济模型,结合扩展定律与微观经济学理论,以描述LLM训练公司的理性行为。在我们的模型中,增加参数和训练令牌可提高LLM质量,从而吸引更多消费者,每个消费者都有一个质量阈值。另一方面,额外的参数和训练令牌都会带来额外成本。我们分析了该模型在计算受限和数据受限环境下的利润最大化问题。在计算受限环境下,最优模型规模和令牌预算与硬件效率$E$(FLOPs/$)近似线性增长;总训练成本则以$E$的亚四次方程增长。数据效率的提升激励更大规模的模型和训练支出。当数据受限于$D$时,利润最优的训练支出为$D^2/E$,即随数据增加而增加,随硬件效率(以及数据效率)降低而减少。最后,我们分析了训练支出的实际趋势:当前趋势与计算受限环境下的最宽松模型变体一致,但在数据受限环境或假设硬件进步停滞时并非利润最优。总体而言,我们的结果提供了利润最优LLM训练的理论,为批判性地看待行业声明和支持长期经济决策提供了基础。

英文摘要

Scaling LLMs requires tremendous computational resources, and recent advances in AI have gone hand in hand with massive amounts of capital expenditure. While it is established that scaling up LLMs reliably increases model quality (quantified in terms of loss or downstream evaluations), it is unclear how these quality improvements translate to potential revenue, and whether revenue increases would offset costs of larger-scale training and inference. In this work, we develop an economic model for characterizing the rational behavior of an LLM training firm by combining scaling laws with microeconomic theory. Under our model of firm behavior, LLM quality can be increased with more parameters and training tokens, leading to more potential adoption by consumers, who each have a quality threshold for using the LLM. On the other hand, additional parameters and training tokens both incur additional costs. We analyze the profit maximization problem for this model under compute-bound and data-bound regimes. In the compute-bound regime, optimal model size and token budget track hardware efficiency $E$ (FLOPs/\$) at a near-linear rate; total training cost then scales sub-quadratically in $E$. Data efficiency improvements incentivize larger models and training expenditure. When we are limited to $D$ data, profit-optimal training expenditure scales as $D^2/E$, i.e, increase with data and decreases with hardware efficiency (as well as data efficiency). Finally, we analyze practical trends in training expenditure: current trends are consistent with our most permissive model variants in the compute-bound regime, but are not profit-optimal in the data-bound regime or assuming hardware advances will stall. Overall, our results provide a theory of profit-optimal LLM training, providing a foundation for engaging critically with industry statements and supporting long-term economic decision making.

2606.02133 2026-06-12 cs.LG cs.AI 版本更新

Variational Learning for Insertion-based Generation

基于插入生成的变分学习

Yangtian Zhang, Zhe Wang, Arthur Gretton, Rex Ying, David van Dijk, Michalis K. Titsias, Jiaxin Shi

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出插入过程(IP)模型,通过排列变分推断联合学习插入位置、内容和终止条件,支持变长生成并提升非自回归序列建模质量。

详情
AI中文摘要

非单调序列生成方法,如掩码扩散模型,通过允许以非固定和预设的顺序生成token,为从左到右的自回归建模提供了一种灵活的替代方案。尽管具有实际优势,但大多数现有的非单调模型是顺序无关的,并依赖于固定长度的网格,限制了它们支持变长生成和自适应插入顺序的能力。在这项工作中,我们引入了一个概率框架,用于在变长插入模型中学习插入顺序。我们形式化了插入轨迹与排列之间的双射对应关系,这使得数据似然能够精确重参数化为排列上的和。基于这一结果,我们提出了插入过程(IP),这是一种随机生成模型,它联合学习在哪里插入、插入什么以及何时终止,并通过基于排列的变分推断进行训练。与先前的固定画布方法不同,IP原生支持变长生成,并学习数据驱动的插入顺序偏好。在目标条件规划和分子字符串生成上的实验表明,在缺乏规范从左到右结构的领域中,学习插入顺序提高了建模质量和泛化能力。

英文摘要

Non-monotonic sequence generation methods, such as masked diffusion models, provide a flexible alternative to left-to-right autoregressive modeling by allowing tokens to be generated in non-fixed and prescribed orders. Despite their practical advantages, most existing non-monotonic models are order-agnostic and rely on a fixed-length grid, limiting their ability to support variable-length generation and adaptive insertion order. In this work, we introduce a probabilistic framework for learning insertion order in variable-length insertion models. We formalize a bijective correspondence between insertion trajectories and permutations, which enables an exact reparameterization of the data likelihood as a sum over permutations. Building on this result, we propose the Insertion Process (IP), a stochastic generative model that jointly learns where to insert, what to insert, and when to terminate, trained via permutation-based variational inference. Unlike prior fixed-canvas approaches, IP natively supports variable-length generation and learns data-driven preferences over insertion orders. Experiments on goal-conditioned planning and molecular string generation demonstrate that learning insertion order improves both modeling quality and generalization in domains without a canonical left-to-right structure.

2606.11836 2026-06-12 cs.SD cs.AI eess.AS 版本更新

Towards Data-free and Training-free Compression for Speech Foundation Models Using Parameter Clustering

面向语音基础模型的无数据无训练压缩:基于参数聚类的方法

Haoning Xu, Zhaoqing Li, Huimeng Wang, Youjun Chen, Chengxi Deng, Mengzhe Geng, Xunying Liu

发表机构 * The Chinese University of Hong Kong(香港中文大学) National Research Council Canada(加拿大国家研究委员会)

AI总结 提出一种基于k-means通道聚类的无数据无训练压缩方法,通过层间不同参数簇数实现细粒度混合稀疏剪枝,在HuBERT-large和Whisper-large-v3上显著降低WER。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

本文提出了一种新颖的无数据无训练压缩方法,用于语音基础模型,该方法通过k-means进行通道级聚类。还探索了更细粒度的混合稀疏剪枝,通过层间不同数量的参数簇实现。在LibriSpeech数据集上进行的实验表明,当对HuBERT-large进行50%的剪枝稀疏度操作时,在微调前,测试干净和测试其他子集上,相对于基于幅度的剪枝,获得了27.73%/18.61%绝对(34.37%/21.91%相对)的一致WER降低;在仅3个epoch的微调后,获得了0.19%/0.79%绝对(3.36%/4.62%相对)的降低。在Whisper-large-v3上,在10%稀疏度下,相对于基于幅度的剪枝,观察到2.86%/5.02%绝对(59.21%/55.29%相对)的类似WER降低,所有这些相对于未压缩基线均没有显著的WER增加。

英文摘要

This paper presents a novel data-free and training-free compression approach for speech foundation models using channelwise clustering via k-means. More fine-grained, mixed sparsity pruning by layer-level varying number of parameter clusters is also explored. Experiments conducted on the LibriSpeech dataset suggest that when operating with pruning sparsity of 50% on HuBERT-large, consistent WER reductions of 27.73%/18.61% absolute (34.37%/21.91% relative) over the magnitude-based pruning were obtained on the test-clean and test-other subsets before fine-tuning and 0.19%/0.79% absolute (3.36%/4.62% relative) after fine-tuning with only 3 epochs. Similar WER reductions of 2.86%/5.02% absolute (59.21%/55.29% relative) were observed against magnitudebased pruning on Whisper-large-v3 at 10% sparsity, all with no significant WER increase relative to the uncompressed baseline.

6. 自然语言与多模态智能 36 篇

2606.12942 2026-06-12 cs.AI 新提交

PRISMR: Overcoming Parse Collapse in Multimodal Listwise Ranking via Parameterized Representation Internalization

PRISMR: 通过参数化表示内化克服多模态列表排序中的解析崩溃

Hao Jiang, Xin Li, Annan Wang, Zhi Yang, Haoxiang Zhang, Yichi Zhang, Weisi Lin

发表机构 * Nanyang Technological University(南洋理工大学) Peking University(北京大学) Independent Researcher(独立研究员)

AI总结 针对多模态长上下文场景中生成式列表排序的解析崩溃问题,提出PRISMR框架,用参数化结构条件替代临时上下文列表处理,通过轻量级超网络并行编码候选并生成LoRA权重,显著减少解析崩溃并提升排序性能。

详情
AI中文摘要

基于大型多模态模型(LMM)的生成式列表排序旨在单次前向传播中捕获全局列表上下文,但其效果在长上下文多模态场景中会退化。我们识别出一种重复出现的失败模式——解析崩溃,即自回归解码器生成流畅但不完整的排序,通过静默省略候选并提前终止。这种失败源于有限的上下文利用,而非简单的格式错误,使得提示工程和约束解码不足以解决。我们提出PRISMR(参数化表示内化用于语义多模态排序)框架,用参数化结构条件替代临时的上下文内列表处理。PRISMR使用轻量级超网络并行编码多模态候选并生成项目特定的LoRA权重,这些权重被合成为LMM的实例特定适配器。这种范式在保留基础模型的同时,实现了更鲁棒的列表结构内化。我们进一步引入了一个大规模多模态评论排序基准用于评估。实验表明,PRISMR显著减少了解析崩溃,提高了列表排序性能,并有效跨领域和指令微调骨干网络迁移。

英文摘要

Generative listwise ranking with Large Multimodal Models (LMMs) aims to capture global list context in a single forward pass, but its effectiveness degrades in long-context multimodal scenarios. We identify a recurring failure mode, parse collapse, where the autoregressive decoder produces fluent yet incomplete rankings by silently omitting candidates and terminating early. This failure stems from limited context utilization rather than simple formatting mistakes, making prompt engineering and constrained decoding insufficient. We propose PRISMR (Parameterized Representation Internalization for Semantic Multimodal Ranking), a framework that replaces transient in-context list processing with parametric structural conditioning. PRISMR uses a lightweight hypernetwork to encode multimodal candidates in parallel and generate item-specific LoRA weights, which are synthesized into an instance-specific adapter for a LMM. This paradigm enables more robust internalization of list structure while preserving the base model. We further introduce a large-scale multimodal review-ranking benchmark for evaluation. Experiments demonstrate that PRISMR substantially reduces parse collapse, improves listwise ranking performance, and transfers effectively across domains and instruction-tuned backbones.

2606.13316 2026-06-12 cs.AI 新提交

ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning

ReSum: 通过强化学习协同LLM推理与摘要生成

Xucong Wang, Ziyu Ma, Yong Wang, Shidong Yang, Hailang Huang, Renda Li, Pengkun Wang, Xiangxiang Chu

发表机构 * University of Science and Technology of China(中国科学技术大学) AMAP, Alibaba Group(阿里巴巴集团高德地图)

AI总结 提出ReSum框架,利用自摘要机制让LLM压缩和组织推理轨迹,通过对比评估自适应触发摘要,在提升性能4%的同时减少18.6%的推理长度。

Comments 24 pages, including 13 pages of main text and 11 pages of appendix

详情
AI中文摘要

可验证奖励强化学习(RLVR)是提升大语言模型(LLM)长程推理的核心技术。然而,现有RLVR方法常鼓励不必要的长推理轨迹,这会降低推理连贯性并耗尽可用上下文预算。现有的长上下文组织方法通常依赖外部机制来组织轨迹,而非让模型自主管理推理过程。为解决此局限,我们提出ReSum,一种新颖的RLVR框架,使LLM能够通过自摘要压缩和组织其推理轨迹。我们的初步研究表明,自摘要通过降低token级熵来稳定生成,并且引入“摘要”短语可显著减少从错误轨迹前缀传播的误差。受此启发,ReSum采用一种摘要感知的自适应轨迹机制,通过对比评估自摘要是否有利于当前推理过程。具体而言,当模型自发触发自摘要时,ReSum屏蔽摘要短语以创建对比分支;对于非摘要位置,则随机注入该短语以创建匹配分支。我们进一步设计了摘要感知优势函数,以实现对比轨迹之间更细粒度的比较。大量实验表明,ReSum在平均提升4%性能的同时,将推理长度减少18.6%。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) is a central technique for improving long-horizon reasoning in Large Language Models (LLMs). However, existing RLVR methods often encourage unnecessarily long reasoning rollouts, which can degrade reasoning coherence and exhaust the available context budget. Existing approaches to long-context organization often depend on external mechanisms to organize rollouts, rather than enabling the model to manage its own reasoning trajectory. To address this limitation, we propose ReSum, a novel RLVR framework that enables LLMs to compress and organize their reasoning trajectories through self-summarization. Our pilot studies show that self-summarization stabilizes generation by lowering token-level entropy, and that introducing a ``summarization'' phrase can substantially mitigate errors propagated from an incorrect rollout prefix. Motivated by these findings, ReSum adopts a summarization-aware adaptive rollout mechanism that contrastively evaluates whether self-summarization benefits the ongoing reasoning process. Specifically, when the model spontaneously triggers self-summarization, ReSum masks the summarization phrase to create a contrastive branch; for non-summarization positions, it instead randomly injects the phrase to create a matched branch. We further design a summarization-aware advantage to enable finer-grained comparison between contrastive rollout trajectories. Extensive experiments show that ReSum improves performance at an average of 4\% while reducing rollout length by 18.6\%.

2606.13550 2026-06-12 cs.AI cs.CL 新提交

Uncertainty-Aware Hybrid Retrieval for Long-Document RAG

不确定性感知的混合检索用于长文档RAG

Hoin Jung, Xiaoqian Wang

发表机构 * Elmore Family School of Electrical and Computer Engineering, Purdue University(普渡大学埃尔莫尔家族电气与计算机工程学院)

AI总结 提出UMG-RAG,一种无需训练的混合检索框架,通过多粒度分块和不确定性估计融合密集与稀疏检索结果,提升长文档问答质量。

详情
AI中文摘要

检索增强生成(RAG)关键依赖于检索证据的质量和粒度。大的检索单元保留上下文但常引入无关内容,可能稀释答案承载证据并恶化长上下文利用。细粒度单元更紧凑,但可能难以可靠检索,因为短块可能缺乏匹配查询所需的语义、词汇或桥接线索。我们提出不确定性感知的多粒度RAG(UMG-RAG),一种无需训练的混合检索框架,将分块粒度视为查询特定的可靠性估计。UMG-RAG不训练新检索器或修改生成器,而是利用现有密集和稀疏检索器作为跨多个分块粒度的互补专家。对于每个查询,它将每个专家-粒度得分列表转换为证据分布,从分布熵估计可靠性,并根据查询特定的语义、词汇和粒度置信度融合候选。我们进一步引入UMGP-RAG,一种父级提升变体,利用细粒度命中定位相关证据,同时返回更广泛的非冗余父块以保持局部连贯性。在问答基准上的实验表明,不确定性感知融合和父级提升在保持轻量级、即插即用检索管道的同时,提高了生成质量。

英文摘要

Retrieval augmented generation (RAG) depends critically on the quality and granularity of retrieved evidence. Large retrieval units preserve context but often introduce irrelevant content, which can dilute answer bearing evidence and worsen long context utilization. Fine-grained units are more compact, but they may be difficult to retrieve reliably because short chunks can lack semantic, lexical, or bridging cues needed to match the query. We propose Uncertainty-aware Multi-Granularity RAG (UMG-RAG), a training-free hybrid retrieval framework that treats chunk granularity as query-specific reliability estimation. Instead of training a new retriever or modifying the generator, UMG-RAG uses existing dense and sparse retrievers as complementary experts across multiple chunk granularities. For each query, it converts each expert-granularity score list into an evidence distribution, estimates reliability from distribution entropy, and fuses candidates according to query-specific semantic, lexical, and granularity confidence. We further introduce UMGP-RAG, a parent promotion variant that uses fine-grained hits to locate relevant evidence while returning broader non-redundant parent chunks for local coherence. Experiments on question answering benchmarks show that uncertainty-aware fusion and parent promotion improve generation quality while maintaining a lightweight, plug-and-play retrieval pipeline.

2606.12590 2026-06-12 cs.CV cs.AI 交叉投稿

Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs

分析与改进医学LVLMs中的细粒度偏好优化

Shayan Mohammadizadehsamakosh, Pritam Sarkar, Leonid Sigal, Ali Etemad, Elham Dolatabadi

发表机构 * York University(约克大学) University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所) Queen’s University(女王大学)

AI总结 针对医学大视觉语言模型在事实一致性、视觉定位和临床对齐方面的不足,提出一种结合双向令牌级KL正则化和视觉对比定位目标的细粒度在线偏好优化框架,通过最小编辑模型输出构建偏好对,仅修正临床错误片段,显著提升诊断准确性。

详情
AI中文摘要

大型视觉语言模型(LVLMs)在医学影像任务中取得了强劲性能,但仍容易出现事实不一致、视觉定位差以及与临床有意义反馈对齐不足的问题。现有的后训练对齐方法,包括直接偏好优化(DPO)及其变体,在医学领域面临三个关键限制:(1)序列级奖励信号将临床关键令牌与通用填充文本等同对待;(2)依赖静态监督微调参考作为偏好响应引入了离策略分布偏移,将优化导向风格伪影而非临床正确性;(3)对齐目标缺乏明确的视觉定位约束,使模型对微妙但诊断决定性的病理特征不敏感。我们的方法利用双向令牌级KL正则化以及视觉对比定位目标,该目标将干净图像与病变破坏图像配对,以惩罚缺乏足够视觉证据生成的响应。这些组件共同构成了一个细粒度的在线对齐框架,通过最小编辑模型生成的输出来构建偏好对,仅修正临床错误片段,同时保留原始语言风格。在医学影像任务和临床文本生成基准上的大量实验验证了我们方法的有效性。

英文摘要

Large Vision-Language Models (LVLMs) have achieved strong performance across medical imaging tasks, yet they remain prone to factual inconsistencies, poor visual grounding, and misalignment with clinically meaningful feedback. Existing post-training alignment approaches, including Direct Preference Optimization (DPO) and its variants, face three critical limitations in the medical domain: (1) sequence-level reward signals treat clinically critical tokens identically to generic filler text; (2) reliance on static supervised fine-tuning references as preferred responses introduces an off-policy distribution shift, steering optimization toward stylistic artifacts over clinical correctness; and (3) alignment objectives lack explicit visual grounding constraints, leaving models insensitive to subtle yet diagnostically decisive pathological features. Our method leverages a bidirectional token-wise KL regularizer alongside a visual-contrastive grounding objective that pairs clean and lesion-corrupted images to penalize responses generated without adequate visual evidence. Together, these components form a fine-grained, on-policy alignment framework that constructs preference pairs by minimally editing model-generated outputs, correcting only clinically erroneous spans while preserving the original linguistic style. Extensive experiments across medical imaging tasks and clinical text generation benchmarks validate the effectiveness of our approach.

2606.12754 2026-06-12 cs.CL cs.AI 交叉投稿

LLMs Can Better Capture Human Judgments--With the Right Prompts

LLMs 能更好地捕捉人类判断——使用合适的提示

Danica Dillion, Chen Cecilia Liu, Baihui Wang, Daniele Barolo, Tanmay Rajore, Niket Tandon, Pranathi Ravikumar, Kurt Gray

AI总结 通过简单提示策略,LLMs 能恢复人类反应的完整分布,并减少对措辞变化的敏感性,提升 AI-人类对齐。

详情
AI中文摘要

大型语言模型(LLMs)在捕捉人类判断方面是否表现不佳?两个常被提及的限制是:LLMs 无法捕捉反应的全分布,以及它们的判断在措辞变化上不稳定。我们展示了缓解这些限制的简单提示策略。在两个数据集上——一个代表美国的 144 个道德情景集,以及国际社会调查项目“家庭与性别角色变化”模块涵盖 32 个国家的 38 个道德信念——我们展示了简单的启发式技术如何帮助改善 AI-人类对齐。首先,提示模型报告标准差和反应比例,比常见策略更好地恢复了人类反应的完整范围。其次,确保情景对人类参与者清晰——如人类困惑评分所反映——提升了模型对齐度,且 LLMs 可以跟踪人类困惑评分。同时,我们发现 LLMs 对自身误差的估计校准不佳,尽管它们能相对较好地预测人类变异性。这些结果表明,向 LLMs 提出更好的问题可以得到更好的答案。

英文摘要

Are large language models (LLMs) bad at capturing human judgment? Two commonly stated limitations are that LLMs fail to capture full distributions of responses, and that their judgments are unstable across wording variations. We demonstrate simple prompting strategies that mitigate these limitations. Across two datasets--a U.S.-representative set of 144 moral scenarios and 38 moral beliefs from the International Social Survey Programme's Family and Changing Gender Roles module covering 32 countries--we show how simple elicitation techniques help improve AI-human alignment. First, prompting models to report standard deviations and response proportions recovers the full range of human responses better than common strategies. Second, ensuring scenarios are clear to human participants--as reflected in human confusion ratings--boosts model alignment, and LLMs can track human confusion ratings. At the same time, we find that LLMs' estimates of their own error are poorly calibrated, though they can predict human variability relatively well. These results suggest that asking better questions to LLMs can yield better answers.

2606.12818 2026-06-12 cs.CL cs.AI 交叉投稿

Localizing Anchoring Pathways in Language Models

定位语言模型中的锚定路径

Hillary N. Owusu, Sarah Wiegreffe, Naomi H. Feldman

发表机构 * University of Maryland, College Park(马里兰大学帕克分校)

AI总结 研究提示中无关数字如何影响语言模型数值推理的锚定效应,通过logit差值度量和电路归因定位,发现边级方法优于节点级方法,并揭示锚定路径的共享与迁移特性。

详情
AI中文摘要

提示中的无关数字可以改变语言模型的判断,在数值推理中产生锚定效应。我们使用共享答案选项的受控多项选择设置,研究这种锚定敏感信号在语言模型内部的携带位置。我们定义了一个logit差值度量,比较正确答案选项与对应锚点的答案选项,并验证其追踪行为锚定。通过对7B-8B Qwen和Llama基础及指令微调模型进行基于归因的电路定位,我们发现边级方法比节点级方法更忠实地恢复该信号。低锚和高锚电路在模型内部强迁移,表明跨锚定方向存在共享路径结构。然而,基础模型和指令微调变体之间的稀疏迁移可靠性较低,表明后训练改变了哪些路径最重要。总体而言,我们的结果为锚定相关决策信号如何在语言模型内部携带提供了机制性解释。

英文摘要

Irrelevant numbers in a prompt can shift language model judgments, producing anchoring effects in numerical reasoning. We study where this anchor-sensitive signal is carried inside language models using a controlled multiple-choice setup with shared answer options. We define a logit-difference metric comparing the correct answer option with the answer option corresponding to the anchor, and validate that it tracks behavioral anchoring. Using attribution-based circuit localization on 7B--8B Qwen and Llama base and instruction-tuned models, we find that edge-level methods recover this signal more faithfully than node-level methods. Low- and high-anchor circuits transfer strongly within a model, suggesting shared pathway structure across anchor direction. However, sparse transfer across base and instruction-tuned variants is less reliable, indicating that post-training changes which pathways matter most. Overall, our results provide a mechanistic account of how anchoring-related decision signals are carried inside language models.

2606.12826 2026-06-12 cs.CV cs.AI 交叉投稿

DIMOS: Disentangling Instance-level Moving Object Segmentation

DIMOS: 解耦实例级运动目标分割

Hongxiang Huang, Hongwei Ren, Xiaopeng Lin, Yulong Huang, Zeke Xie, Bojun Cheng

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出双解耦特征提取框架分离图像与事件模态的外观和运动信息,并通过多粒度跨模态对齐实现有效融合,在运动实例分割任务中尤其对快速运动和低光下的小目标取得最优性能。

详情
AI中文摘要

运动实例分割(MIS)因其在交通监控、自动驾驶和动物追踪等领域的广泛应用而日益受到关注。事件相机记录异步亮度变化,提供高时间分辨率和动态范围,使其对运动信息高度敏感。通过融合事件和图像特征,事件中的运动线索可以补充图像中的空间细节,从而提升MIS的性能。然而,当前的多模态MIS方法仍然难以分割小的运动实例,因为事件相机在有限分辨率下往往产生稀疏特征。此外,事件特征将外观属性与运动线索纠缠在一起,进一步限制了有效的跨模态融合。为解决这些挑战,我们首先提出一个双解耦特征提取框架,在图像和事件模态内分离并提取外观和运动信息,从而改善特征密度。随后,引入多粒度跨模态对齐,以对齐跨模态分布和语义一致的特征,实现具有丰富空间和时间细节的更有效融合。实验结果表明,我们的方法在多模态MIS中达到了最先进的性能,特别是在快速运动和低光等挑战性条件下的小实例分割方面。

英文摘要

Moving instance segmentation (MIS) attracts increasing attention due to its broad applications in traffic surveillance, autonomous driving, and animal tracking. Event cameras record asynchronous brightness changes, providing high temporal resolution and dynamic range, which makes them highly sensitive to motion information. By fusing event and image features, motion cues from events can complement spatial details from images, enhancing the performance of MIS. However, current multimodal MIS methods still struggle to segment small moving instances, as event cameras often yield sparse features under limited resolution. Moreover, event features entangle appearance attributes with motion cues, which further restricts effective cross-modal fusion. To address these challenges, we first propose a dual-disentangling feature extraction framework that separates and extracts appearance and motion information within both image and event modalities, thereby improving feature density. Subsequently, a multi-granularity cross-modal alignment is introduced to align distributionally and semantically consistent features across modalities, enabling more effective fusion with rich spatial and temporal details. The experiment results demonstrate that our method achieves state-of-the-art performance in multimodal MIS, especially for small instances under challenging conditions such as fast motion and low-light settings.

2606.12886 2026-06-12 cs.CV cs.AI 交叉投稿

Bridging Modal Isolation in Interleaved Thinking: Supervising Modality Transitions via Stepwise Reinforcement

交错思维中的模态隔离桥接:通过逐步强化监督模态转换

Tingyu Li, Le Zhou, Siyuan Li, Yujun Wu, Xinglong Xu, Jingxuan Wei, Conghui He, Cheng Tan

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Jiaotong University(上海交通大学) Zhejiang University(浙江大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出MoTiF框架,通过反射式SFT和Flow-GRPO优化模态转换保真度,解决交错思维中图像与文本脱节的模态隔离问题,提升跨模态一致性和任务准确性。

Comments 22 pages, 5 figures, 6 tables

详情
AI中文摘要

交错思维是一种统一的多模态模型交替进行文本推理和视觉生成的方法,在空间和物理任务上显示出潜力。然而,在复杂的长链场景中,我们识别出一个基本故障模式:生成的图像偏离文本上下文,而后续文本忽略视觉证据,导致两种模态交替但并未真正相互通知。我们将其称为模态隔离,并归因于模态边界处的信息损失累积。我们将每个推理循环分解为原子操作,并定义模态转换损失,量化每个边界处的跨模态幻觉(文本到图像)和视觉利用不足(图像到文本)。我们提出MoTiF(模态转换保真度),一个两阶段训练框架,直接优化这些转换:反射式SFT训练模型检测和恢复错误的视觉输出;Flow-GRPO通过强化学习提高图像生成保真度。MoTiF中的所有训练信号来自转换级保真度而非最终任务准确性。在四个视觉谜题基准测试中,这种转换级监督显著提高了跨模态一致性和最终任务准确性。结果表明,有效的交错推理需要在模态边界处进行明确的结构监督,而不仅仅是扩展或最终任务优化。

英文摘要

Interleaved thinking, where a unified multimodal model alternates between textual reasoning and visual generation, has shown promise on spatial and physical tasks. However, in complex long-chain scenarios, we identify a fundamental failure mode: generated images diverge from the textual context while subsequent text ignores the visual evidence, causing the two modalities to alternate without genuinely informing each other. We term this Modal Isolation and attribute it to compounding information loss at modality boundaries. We decompose each reasoning cycle into atomic operations and define modality transition loss, quantifying cross-modal hallucination (text-to-image) and visual utilization deficit (image-to-text) at each boundary. We propose MoTiF (Modality Tiransition Fidelity), a two-stage training framework that directly optimizes these transitions: Reflective SFT trains the model to detect and recover from erroneous visual outputs; Flow-GRPO improves image generation fidelity via reinforcement learning. All training signals in MoTiF derive from transition-level fidelity rather than end-task accuracy. Across four visual puzzle benchmarks, this transition-level supervision substantially improves both cross-modal coherence and final task accuracy. The results demonstrate that effective interleaved reasoning requires explicit structural supervision at modality boundaries, not merely scaling or end-task optimization.

2606.13035 2026-06-12 cs.CV cs.AI 交叉投稿

TetherCache: Stabilizing Autoregressive Long-Form Video Generation with Gated Recall and Trusted Alignment

TetherCache: 基于门控召回与可信对齐的自回归长视频生成稳定性方法

Yu Meng, Xiangyang Luo, Letian Li, Wenyuan Jiang, Chen Gao, Xinlei Chen, Yong Li, Xiao-Ping Zhang

发表机构 * Tsinghua University(清华大学) D-INFK, ETH Zürich(苏黎世联邦理工学院计算机科学系)

AI总结 提出TetherCache,一种无需训练、即插即用的缓存管理策略,通过门控召回(GRAB)和可信对齐编辑(TAME)缓解自回归视频扩散模型中的上下文漂移,实现稳定长视频生成。

Comments 17 pages, 8 figures

详情
AI中文摘要

自回归视频扩散模型通过将新生成帧的条件建立在先前生成内容上,为流式变长视频生成提供了自然框架。然而,将这些模型扩展到分钟级生成仍具挑战:有限的KV缓存预算使模型无法保留完整历史,而反复以自生成帧为条件会导致上下文分布偏移随时间累积,引发视觉伪影、质量下降和时间漂移。本文提出TetherCache,一种无需训练、即插即用的缓存管理策略,用于抗漂移长视频生成。TetherCache将缓存组织为sink、memory和recent区域,并引入两种互补机制。首先,GRAB(基于注意力多样性平衡的门控召回)使用结合注意力相关性与时间多样性的门控分数选择长程记忆帧,在固定缓存预算下保留信息丰富且多样化的历史上下文。其次,TAME(通过记忆编辑的可信对齐)通过将新召回的记忆令牌的统计量对齐到可信上下文分布来对其进行轻量编辑,减少漂移历史特征造成的污染。基于Self-Forcing,TetherCache在VBench-Long的30秒、60秒和240秒设置上持续提升长视频生成质量。特别地,在240秒生成中,它显著提高了整体和语义分数,同时将质量漂移从7.84降至1.33,证明了其在稳定长程自回归视频扩散中的有效性。

英文摘要

Autoregressive video diffusion models provide a natural formulation for streaming and variable-length video generation by conditioning newly generated frames on previously generated content. However, extending these models to minute-level generation remains challenging: the limited KV-cache budget prevents the model from retaining the full history, while repeatedly conditioning on self-generated frames induces a context distribution shift that accumulates over time, leading to visual artifacts, quality degradation, and temporal drift. In this paper, we propose TetherCache, a training-free and plug-and-play cache management strategy for drift-resistant long video generation. TetherCache organizes the cache into sink, memory, and recent regions, and introduces two complementary mechanisms. First, GRAB (Gated Recall with Attention-Diversity Balancing) selects long-range memory frames using a gated score that combines attention-based relevance with temporal diversity, preserving informative yet diverse historical context under a fixed cache budget. Second, TAME (Trusted Alignment via Memory Editing) lightly edits newly recalled memory tokens by aligning their statistics to a trusted context distribution, reducing the pollution caused by drifted historical features. Built on Self-Forcing, TetherCache consistently improves long-video generation quality on VBench-Long across 30s, 60s, and 240s settings. In particular, for 240s generation, it substantially improves overall and semantic scores while reducing quality drift from 7.84 to 1.33, demonstrating its effectiveness for stable long-horizon autoregressive video diffusion.

2606.13115 2026-06-12 cs.CL cs.AI 交叉投稿

G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents

G-Long:面向高效长期对话代理的图增强记忆管理

Minjun Choi, Yoonjin Jang, Sangwon Youn, Youngjoong Ko

发表机构 * Sungkyunkwan University(成均馆大学)

AI总结 提出G-Long框架,利用微调小语言模型进行结构化三元组提取和关联检索,并引入注意力感知重要性评分机制,在降低计算开销的同时,在响应生成和记忆检索上达到最优性能。

Comments 22 pages, 8 figures, 14 tables

详情
AI中文摘要

尽管大型语言模型(LLMs)推动了开放域对话系统的发展,但由于长上下文推理的固有限制以及处理大量原始文本的低效性,保持长期一致性仍然是一个挑战。现有方法通常依赖于非结构化记忆存储(容易导致信息丢失)或计算成本高昂的LLMs(导致高延迟)。为了解决这些限制,我们提出了G-Long,一个图增强框架,利用微调的小语言模型(sLM)进行结构化三元组提取和关联检索,显著降低了运营成本。此外,我们引入了新颖的注意力感知重要性评分机制,利用T5摘要器的内在交叉注意力信号来识别显著记忆。跨多个基准的大量实验表明,G-Long在响应生成和记忆检索方面均达到了最先进的性能,在MSC上响应质量提升高达9.8%,在LME上检索召回率提升高达40.8%,同时显著降低了计算开销。

英文摘要

While Large Language Models (LLMs) have advanced open-domain dialogue systems, maintaining long-term consistency remains a challenge due to inherent limitations in long-context reasoning and the inefficiency of processing extensive raw text. Existing approaches typically rely on either unstructured memory storage, which is prone to information loss, or computationally expensive LLMs that incur high latency. To address these limitations, we propose G-Long, a graph-enhanced framework that utilizes a fine-tuned small Language Model (sLM) for structured triplet extraction and associative retrieval, significantly reducing operational costs. Furthermore, we introduce the novel attention-aware importance scoring mechanism that leverages the intrinsic cross-attention signals of a T5 summarizer to identify salient memories. Extensive experiments across diverse benchmarks demonstrate that G-Long achieves state-of-the-art performance in both response generation and memory retrieval, yielding performance gains of up to 9.8% in response quality on MSC and 40.8% in retrieval recall on LME, while significantly minimizing computational overhead.

2606.13121 2026-06-12 cs.CL cs.AI cs.SD 交叉投稿

NaturalFlow: Reducing Disruptive Pauses for Natural Speech Flow in Simultaneous Speech-to-Speech Translation

NaturalFlow: 减少同步语音到语音翻译中破坏自然语音流的停顿

Dongwook Lee, Youngho Cho, Sangkwon Park, Heeseung Kim, Sungroh Yoon

发表机构 * IPAI and ECE, Seoul National University(首尔大学IPAI与ECE) Department of AI, University of Seoul(首尔市立大学人工智能系)

AI总结 提出一个流畅性感知优化框架,通过利用模型内部信号(如语言多样性和语音时长的时间变异性)最小化块间静音,在同步翻译的低延迟和连续翻译的自然流畅之间找到平衡点。

Comments Proceedings of the 26th Interspeech Conference, Long Paper

详情
AI中文摘要

同步语音到语音翻译旨在通过最小化延迟实现近实时通信,为连续翻译的高延迟提供了一种引人注目的实时替代方案。然而,过度追求低延迟往往会导致碎片化的块状语音。因此,听众会遭受不自然的声学流,其中频繁的停顿可能会增加他们的认知负荷。为了弥补这一差距,我们引入了一个流畅性感知优化框架,旨在发现同步翻译的低延迟优势与连续翻译的自然流畅之间的最佳平衡点。我们的框架通过利用模型内部信号(包括语言多样性和语音时长的诱导时间变异性)来最小化块间静音。在短文本和长文本基准上的实验表明,我们的框架在保持竞争性延迟和翻译质量的同时,产生了自然的语音流。

英文摘要

Simultaneous speech-to-speech translation aims to enable near-real-time communication by minimizing latency, offering a compelling, real-time alternative to the high latency of consecutive translation. However, the excessive pursuit of low latency often results in fragmented chunk-wise speech. Consequently, listeners are subjected to an unnatural acoustic flow punctuated by frequent pauses, which could increase their cognitive load. To bridge this gap, we introduce a fluency-aware optimization framework designed to discover the sweet spot between the low-latency benefits of simultaneous translation and the natural flow of consecutive translation. Our framework minimizes inter-chunk silences by leveraging model-internal signals, including linguistic diversity and induced temporal variability in speech durations. Experiments on short- and long-form benchmarks show that our framework produces natural speech flow while maintaining competitive latency and translation quality.

2606.13156 2026-06-12 cs.CV cs.AI 交叉投稿

Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback

迭代视觉思维:通过视觉反馈教会视觉语言模型空间自我修正

Animesh Tripathy, Aswanth Krishnan

发表机构 * QpiAI India Pvt. Ltd(QpiAI印度私人有限公司)

AI总结 提出迭代视觉思维(IVT)框架,通过视觉反馈闭环和两阶段训练(SFT+GRPO),使视觉语言模型具备空间自我修正能力,在三个基准上提升指标2.4-3.2个百分点。

详情
AI中文摘要

视觉语言模型(VLM)在单次空间定位上表现强劲,但缺乏观察和修正自身预测的机制。我们发现,简单地提示VLM在其预测的渲染可视化上迭代会导致灾难性失败:指代表达理解的Acc@0.5从79.6%骤降至48.7%(下降31个百分点),揭示了定位能力与自我修正能力之间的根本差距。我们提出迭代视觉思维(IVT),一种闭环框架,其中模型预测边界框,观察预测在图像上的渲染结果,并通过视觉反馈迭代优化。两阶段训练方案弥合了自我修正差距:首先,我们利用基础模型自身的预测作为真实错误,并提示教师VLM生成修正推理轨迹,从而无需人工标注即可获得监督数据;其次,我们应用组相对策略优化(GRPO)和简单的IoU奖励来稳定多步优化。在涵盖RefCOCOg、Ref-Adv和Ref-L4的混合基准(505个测试样本)上,使用IVT的SFT预热在每个指标上都超过了单次基础模型:Acc@0.5升至82.0%(+2.4个百分点),Acc@0.7升至74.1%(+3.2个百分点),Acc@0.9升至48.3%(+2.8个百分点)。GRPO进一步将每步IoU退化减少了5倍,稳定了优化轨迹。所有训练仅使用单个GPU上的2400个样本,表明空间自我修正是一种可学习的能力,可以在适度规模下灌输。

英文摘要

Vision-language models (VLMs) achieve strong singleshot spatial grounding, yet lack any mechanism to observe and correct their own predictions. We find that naively prompting a VLM to iterate over rendered visualizations of its predictions causes catastrophic failure: Acc@0.5 on referring expression comprehension collapses from 79.6% to 48.7% (a 31 percentage point drop), revealing a fundamental gap between grounding capability and self-correction ability. We propose Iterative Visual Thinking (IVT), a closed-loop framework in which the model predicts a bounding box, observes the prediction rendered on the image, and iteratively refines through visual feedback. A two-phase training recipe closes the self-correction gap: first, we exploit the base model's own predictions as realistic errors and prompt a teacher VLM to generate corrective reasoning traces, yielding supervised data without human annotation; second, we apply Group Relative Policy Optimization (GRPO) with a simple IoU reward to stabilize multi-step refinement. On a mixed benchmark spanning RefCOCOg, Ref-Adv, and Ref-L4 (505 test samples), SFT warm-up with IVT surpasses the single-shot base model on every metric: Acc@0.5 rises to 82.0% (+2.4pp), Acc@0.7 to 74.1% (+3.2pp), and Acc@0.9 to 48.3% (+2.8pp). GRPO further reduces per-step IoU degradation by 5x, stabilizing the refinement trajectory. All training uses only 2,400 samples on a single GPU, demonstrating that spatial self-correction is a learnable capability that can be instilled at modest scale.

2606.13171 2026-06-12 cs.CL cs.AI 交叉投稿

NTS-CoT: Mitigating Hallucinations in LLM-based News Timeline Summarization with Chain-of-Thought Reasoning

NTS-CoT: 基于思维链推理减轻大模型新闻时间线摘要中的幻觉

Feng Lyu, Huiqin Yan, Sijing Duan, Hao Wu, Shuang Gu, Xue Qiao, Weixu Zhang, Haolun Wu

发表机构 * Central South University(中南大学) Tsinghua University(清华大学) Nanjing University(南京大学) Suzhou Aerospace Information Research Institute(苏州空天信息研究院) McGill University(麦吉尔大学)

AI总结 针对大模型在新闻时间线摘要中产生内容不忠实和信息遗漏两类幻觉,提出NTS-CoT框架,通过元素思维链、日期选择和因果思维链三个模块有效缓解幻觉,在三个基准上超越现有方法。

详情
AI中文摘要

在线新闻的快速更新使得追踪事件发展具有挑战性,凸显了时间线摘要(TLS)的需求。幻觉(即大模型生成内容偏离源新闻)仍然是基于大模型的TLS中的关键问题,且现有研究对此关注不足。为弥补这一差距,我们识别出两类主要幻觉:新闻摘要中的不忠实内容和日期事件摘要中的信息遗漏。然后,我们提出NTS-CoT,一种利用思维链(CoT)推理来减轻TLS中幻觉的新框架。该框架包含三个关键模块:i) Element-CoT,用于捕获关键新闻元素以实现忠实摘要;ii) Date Selection,结合时间显著性和事件突出性进行时间戳选择;iii) Causal-CoT,用于推断因果关系并减少日期事件摘要中的遗漏。大量实验,包括在三个TLS基准上的定量分析和人工评估,表明NTS-CoT优于最先进的基线,有效减轻了幻觉并提升了基于大模型的TLS性能。我们的源代码可在该 https URL 获取。

英文摘要

The rapid updates of online news make tracking event developments challenging, highlighting the need for timeline summarization (TLS). Hallucinations, where LLM-generated content deviates from source news, still remain a critical issue in LLM-based TLS and are not well studied in existing works. To bridge this gap, we identify two primary types of hallucinations: unfaithful content during news summarization and information omission in date-event summarization. Then, we propose NTS-CoT, a novel framework that leverages Chain-of-Thought (CoT) reasoning to mitigate hallucinations in TLS. The framework consists of three key modules: i) Element-CoT to capture essential news elements for faithful summarization, ii) Date Selection to combine temporal saliency and event prominence for timestamp selection, and iii) Causal-CoT to infer causal relationships and reduce omissions in date-event summarization. Extensive experiments, including quantitative analysis on three TLS benchmarks and human evaluation, demonstrate that NTS-CoT outperforms state-of-the-art baselines, effectively mitigating hallucinations and improving LLM-based TLS performance. Our source code is available at https://anonymous.4open.science/r/NTS-CoT .

2606.13177 2026-06-12 cs.CL cs.AI cs.LG 交叉投稿

MemRefine: LLM-Guided Compression for Long-Term Agent Memory

MemRefine: 基于LLM引导的压缩用于长期智能体记忆

Minjae Kim, Jinheon Baek, Soyeong Jeong, Sung Ju Hwang

发表机构 * Korea University(韩国大学) KAIST(韩国科学技术院)

AI总结 提出MemRefine框架,利用LLM判断事实内容,通过删除、合并和保留操作将记忆库压缩到固定预算内,在多个基准上保持下游性能并优于基于规则的基线。

详情
AI中文摘要

大型语言模型(LLM)智能体越来越需要在长期交互中运行,其中过去对话中的信息必须被保留和回忆以支持未来任务。然而,随着交互的积累,记忆存储无限制增长,并充满冗余条目,这些条目增加了存储成本,并通过排挤最有用的证据而降低了检索质量。此外,在具有硬性内存预算的资源受限平台上,这尤其受限,促使我们制定了有存储预算的记忆管理任务,即在固定预算内保持已构建的记忆库,同时保留对未来交互有用的信息。为此,我们提出了MemRefine,一个基于LLM引导的框架,由于表面相似性不能很好地反映事实价值,该框架仅使用相似性来提出候选对,并将删除、合并和保留决策推迟给基于事实内容的LLM判断,迭代直到满足预算。在多个记忆框架和长期对话基准上,MemRefine始终满足目标预算,同时保持下游性能,并在紧预算下优于基于规则的基线。

英文摘要

Large language model (LLM) agents are increasingly expected to operate over long-term interactions, where information from past dialogues must be preserved and recalled to support future tasks. However, as interactions accumulate, the memory store grows without bound and fills with redundant entries that inflate storage cost and degrade retrieval by crowding out the most useful evidence. Furthermore, this is especially limiting on resource-constrained platforms with hard memory budgets, motivating us to formulate storage-budgeted memory management, the task of keeping an already constructed memory store within a fixed budget while preserving information useful for future interactions. To this end, we then propose MemRefine, an LLM-guided framework that, since surface similarity poorly reflects factual value, uses similarity only to propose candidate pairs and defers delete, merge, and preserve decisions to an LLM judge based on factual content, iterating until the budget is met. Across multiple memory frameworks and long-term conversation benchmarks, MemRefine consistently meets target budgets while preserving downstream performance and outperforming rule-based baselines under tight budgets.

2606.13288 2026-06-12 cs.CV cs.AI cs.CL 交叉投稿

Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

跨模态掩码组合概念建模以增强视觉-语言组合性

Wei Li, Zhen Huang, Xinmei Tian

发表机构 * MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China(中国科学技术大学,教育部脑启发智能感知与认知重点实验室) Independent Researcher(独立研究员)

AI总结 提出MACCO框架,通过掩码一个模态的组合概念并从另一模态完整上下文重建,增强视觉-语言模型的组合理解能力,在五个基准上显著提升。

Comments Accepted to ACL 2026 Main Conference, 25 pages

详情
AI中文摘要

对比训练的视觉-语言模型(如CLIP)在学习联合图像-文本表示方面取得了显著进展,但在组合理解方面仍面临挑战。它们通常表现出“词袋”行为——难以捕捉对象关系、属性-对象绑定和词序依赖。这一限制不仅源于优化时依赖全局单向量表示,还源于对配对图像文本数据中固有丰富组合信息的利用和建模不足。在这项工作中,我们提出了MACCO(掩码组合概念建模)框架,该框架掩码一个模态中的组合概念,并基于另一模态的完整上下文信息重建它们,从而使模型能够更有效地捕捉和对齐跨模态组合结构。为促进这一过程,我们引入了两个辅助目标,在模态间和模态内联合对齐和正则化掩码特征。在五个组合基准上的大量实验和深入分析表明,我们的方法不仅显著增强了VLM的组合性,还提高了它们捕捉句法结构和语言信息的能力。此外,改进的组合性也有利于文本到图像生成和多模态大语言模型。代码可在https://this URL获取。

英文摘要

Contrastively trained vision-language models like CLIP, have made remarkable progress in learning joint image-text representations, but still face challenges in compositional understanding. They often exhibit a "bag-of-words" behavior--struggling to capture the object relations, attribute-object bindings, and word order dependencies. This limitation arises not only from the reliance on global, single-vector representations for optimization, but also from the insufficient exploitation and modeling of the rich compositional information inherently present in paired image text data. In this work, we propose MACCO (MAsked Compositional Concept MOdeling), a framework that masks compositional concepts in one modality and reconstructs them conditioned on the full contextual information from the other, enabling the model to capture and align cross-modal compositional structures more effectively. To facilitate this process, we introduce two auxiliary objectives that jointly align and regularize masked features both inter-modally and intra-modally. Extensive experiments on five compositional benchmarks, along with in-depth analyses, demonstrate that our approach not only significantly enhances compositionality in VLMs but also improves their ability to capture syntactic structure and linguistic information. Additionally, the improved compositionality also benefits text-to-image generation and multimodal large language model. Code is available at https://github.com/hiker-lw/MACCO.

2606.13289 2026-06-12 cs.CV cs.AI 交叉投稿

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

HYDRA-X: 具有整体视觉分词器的原生统一多模态模型

Guozhen Zhang, Xuerui Qiu, Yutao Cui, Tianhui Song, Changlin Li, Junzhe Li, Tao Huang, Xiao Zhang, Yang Li, Jianbing Wu, Miles Yang, Zhao Zhong, Liefeng Bo, Limin Wang

发表机构 * Nanjing University(南京大学) CASIA(中国科学院自动化研究所) Tencent Hunyuan(腾讯混元) Zhongguancun Academy(中关村学院) Shanghai AI Lab(上海人工智能实验室)

AI总结 提出HYDRA-X,首个在单一ViT中统一图像和视频分词的原生统一多模态模型,通过因果时间注意力和分层时间压缩实现高效重建,并利用轻量化解压缩器注入语义,显著提升编辑一致性和收敛速度。

详情
AI中文摘要

整体视觉分词器是统一多模态模型(UMMs)的基础,因为它们将多样的视觉输入映射到统一的表示空间。在本文中,我们提出HYDRA-X,这是首个在单一视觉变换器(ViT)中统一图像和视频分词的原生UMM。我们的设计由两个核心挑战驱动:高效地将时空重建能力注入原生ViT,以及将图像级和视频级语义感知嵌入到潜在空间中。为解决第一个挑战,全面的消融实验揭示了两个关键发现:(1)帧级因果时间注意力足以用于视觉重建,而全时空注意力会降低重建质量;(2)分层时间压缩显著优于单步替代方案。为解决第二个挑战,我们提出了一种轻量化解压缩器,在联合图像-视频教师监督下对时间压缩特征进行上采样,从而在紧凑的潜在空间中强制实施互补的语义结构。基于这种整体分词器,我们进一步提出了编辑流程的原则性改进:源-目标交互应在分词器内部的潜在级别发生,而不是在LLM内部的语义级别,从而显著提高编辑一致性并加速收敛。在7B密集模型上实例化,HYDRA-X在图像和视频理解及生成任务上均取得了强劲性能,为未来的统一分词器UMM铺平了道路。

英文摘要

Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT). Our design is driven by two core challenges: efficiently injecting spatiotemporal reconstruction capability into a native ViT, and embedding image- and video-level semantic awareness into the latent space. To address the first, comprehensive ablations reveal two key findings: (1) frame-level causal temporal attention suffices for visual reconstruction, whereas full spatiotemporal attention degrades it; and (2) hierarchical temporal compression substantially outperforms single-step alternatives. To tackle the second, we propose a lightweight decompressor that upsamples temporally compressed features under joint image-video teacher supervision, thereby enforcing complementary semantic structures within the compact latent space. Building on this holistic tokenizer, we further propose a principled improvement of the editing pipeline: source-target interaction should occur at the latent level inside the tokenizer rather than at the semantic level inside the LLM, substantially improving editing consistency and accelerating convergence. Instantiated at the 7B dense model, HYDRA-X achieves strong performance across image and video understanding and generation tasks, paving the way for future unified-tokenizer UMMs.

2606.13348 2026-06-12 cs.CL cs.AI 交叉投稿

IVIE: A Neuro-symbolic Approach to Incremental and Validated Generation of Interactive Fiction Worlds

IVIE:一种用于增量且经过验证的交互式小说世界生成的神经符号方法

Micaela Vaucher, Santiago Silveira, Santiago Góngora, Luis Chiruzzo

发表机构 * Instituto de Computación, Facultad de Ingeniería, Universidad de la República(乌拉圭共和国大学工程学院计算机研究所)

AI总结 提出IVIE神经符号方法,结合LLM的创造力与符号验证的连贯性,通过四阶段增量生成管道构建可玩的交互式小说世界,人类评估显示其生成沉浸式、主题连贯的世界,平衡了灵活性与叙事一致性。

Comments 10 pages, 3 figures. To appear in the Proceedings of the 16th International Conference on Computational Creativity (ICCC'26), June 2026

详情
AI中文摘要

交互式小说中的计算创造力面临一个基本矛盾:大型语言模型(LLM)可能产生创意叙事,但难以维持世界连贯性,而符号系统确保一致性但缺乏创意灵活性。我们提出IVIE(增量与验证的交互体验),一种从零开始生成完整且可玩的交互式小说世界的神经符号方法。基于PAYADOR的神经符号框架,IVIE实现了一个四阶段增量生成管道,将创意决策——设定与角色创建、谜题设计——委托给LLM,同时通过符号验证将世界状态接地。该系统生成具有相互关联的地点、功能性物品、非玩家角色和连贯谜题的世界,所有这些都围绕一个中心目标导向架构组织。人类评估表明,该方法生成了沉浸式、主题连贯的世界,具有高玩家参与度。结果似乎表明,神经符号方法成功平衡了灵活性与叙事连贯性:符号验证在不消除生成自由的情况下将LLM生成接地。然而,挑战依然存在:LLM的不一致性偶尔会绕过谜题约束,客观验证的空白允许一些结构上不可能的目标。我们为未来的神经符号交互式叙事系统确定了关键设计考虑因素,特别是关于LLM的能力及其局限性。

英文摘要

Computational creativity in Interactive Fiction faces a fundamental tension: Large Language Models (LLM) may produce creative narratives but struggle with world coherence, while symbolic systems ensure consistency but lack creative flexibility. We present IVIE (Incremental & Validated Interactive Experiences), a neuro-symbolic approach to generating complete and playable interactive fiction worlds from scratch. Building upon PAYADOR's neuro-symbolic framework, IVIE implements a four-stage incremental generation pipeline that delegates creative decisions--setting and character creation, puzzle design--to LLMs while grounding the world state through symbolic validation. The system generates worlds with interconnected locations, functional items, non-player characters, and coherent puzzles, all structured around a central goal-oriented architecture. Human evaluation shows the approach generates immersive, thematically coherent worlds with high player engagement. Results seem to indicate that the neuro-symbolic approach successfully balances flexibility with narrative coherence: symbolic validation grounds LLM generation without eliminating generative freedom. However, challenges remain: LLM inconsistencies occasionally bypass puzzle constraints, and objective validation gaps allow some structurally impossible goals. We identify key design considerations for future neurosymbolic interactive storytelling systems, particularly regarding LLM capabilities and their limitations.

2606.13432 2026-06-12 cs.CV cs.AI 交叉投稿

OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data

OmniDirector: 无需配对数据的通用多镜头相机克隆

Jiwen Liu, Shujuan Li, Zhixue Fang, Xiaohan Li, Yan Zhou, Zijie Meng, Zhimin Zhang, Yawen Luo, Guoxin Zhang, Yu-Shen Liu, Pengfei Wan

发表机构 * Kuaishou Technology(快手科技) Tsinghua University(清华大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出OmniDirector框架,通过将相机参数编码为网格运动视频,并利用百万级配对数据训练,实现无需交叉配对数据的多镜头相机运动克隆,具备卓越的控制性能。

Comments 12 pages, 8 figures

详情
AI中文摘要

从参考视频中克隆相机运动是视频生成中的一项重要任务,因为视频提供了直观且精确的控制。现有方法要么直接使用无法处理多镜头生成的参数化表示,要么合成交叉配对数据,但受限于数据稀缺性,导致在复杂相机运动克隆中表现不佳。为解决这些问题,我们引入了一种通用的相机运动表示,将相机编码为网格运动视频。该相机网格以视觉方式表示相机参数,并支持集成多样化的轨迹以进行多镜头视频生成。基于此,我们提出了OmniDirector,一个在百万级相机网格-视频对上训练的统一框架,该框架协调角色、动作和相机,为多模态扩散变换器提供导演级别的控制。此外,我们设计了一种新颖的分层提示扩展代理,通过理解信号关系系统地描述相机运动和视觉内容,从而和谐地整合不同的控制信号。大量实验证明了我们框架的卓越性能和出色的可控性。项目页面:此https URL

英文摘要

Cloning camera motion from reference videos is an important task in video generation, as videos provide intuitive and precise control. Existing methods either directly use parametric representations that fail to handle multi-shot generation or synthesize cross-paired data, which suffer from data scarcity, resulting in poor performance in complicated camera motion cloning. To address these issues, we introduce a general camera motion representation that encodes cameras as grid motion videos. This camera grid represents the camera parameters visually and supports the integration of diverse trajectories for multi-shot video generation. Building upon this, we propose OmniDirector, a unified framework trained on a million-scale camera grid-video pairs that coordinates characters, actions, and cameras to provide director-level control for multimodal diffusion transformers. Furthermore, we design a novel hierarchical prompt expansion agent that harmoniously integrates different control signals by systematically describing camera motion and visual content through understanding signal relationships. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework. Project page: https://ymlinfeng.github.io/OmniDirector.github.io/

2606.13544 2026-06-12 eess.AS cs.AI cs.CL 交叉投稿

Adaptive Turn-Taking for Real-time Multi-Party Voice Agents

自适应轮流发言:面向实时多方语音代理

Soumyajit Mitra, Prabhat Pandey, Abhinav Jain, Shanmukha Sahith, K V Vijay Girish

AI总结 提出ModeratorLM,一种基于角色条件的语音大模型,通过分块流式处理和链式推理,在多方对话中实现自适应轮流发言,显著提升轮流精度和召回率。

Comments Accepted for publication at Interspeech 2026

详情
AI中文摘要

多方口语对话中的轮流发言仍然是语音代理面临的基本挑战,特别是在动态的发言权竞争和用户期望变化的情况下。我们提出ModeratorLM,一种角色扮演语音代理,它在多方环境中根据明确分配的角色来调节轮流发言行为。该系统基于以分块流式方式运行的语音大语言模型。我们进一步引入了一种推理增强变体,该变体结合了对对话上下文和分配角色的链式推理。我们构建了RolePlayConv,一个大规模合成数据集,包含具有多种助手角色的口语多方对话。在真实会议数据和RolePlayConv上的实验表明,与无角色条件的基线相比,轮流发言精度提高了40%以上,召回率提高了70%以上,同时大幅减少了误报中断。

英文摘要

Turn-taking in multi-party spoken conversations remains a fundamental challenge for voice-based agents, particularly under dynamic floor competition and varying user expectations. We propose ModeratorLM, a role-playing voice agent that conditions turn-taking behavior on an explicitly assigned role in multi-party settings. The system is built on a speech large language model operating in chunk-wise streaming manner. We further introduce a reasoning-augmented variant that incorporates chain-of-thought reasoning over conversational context and the assigned role. We construct RolePlayConv, a large-scale synthetic dataset of spoken multi-party conversations with diverse assistant roles. Experiments on real-world meeting data and RolePlayConv show improved turn-taking precision by over 40% and recall by more than 70%, while substantially reducing false-positive interruptions compared to non-role-conditioned baselines.

2606.13572 2026-06-12 cs.CL cs.AI 交叉投稿

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

ArogyaSutra:面向印度语言的多模态医学推理的多智能体框架

Tanmoy Kanti Halder, Akash Ghosh, Subhadip Baidya, Arijit Roy, Sriparna Saha

发表机构 * Indian Institute of Technology Patna(印度理工学院巴特那分校) Indian Institute of Technology Kanpur(印度理工学院坎普尔分校) Prasannadeb Women’s College(普拉萨纳德布女子学院)

AI总结 针对印度语言医疗场景中多模态大语言模型性能不足的问题,提出多模态医学问答数据集ArogyaBodha和基于演员-评论家的多智能体框架ArogyaSutra,通过工具接地与双记忆机制提升多语言医学推理准确性。

详情
AI中文摘要

多模态大语言模型(MLLMs)在通用领域展现出有希望的推理能力,但在医疗等专业场景中,尤其是在多语言和低资源情况下,其性能仍然有限。这一差距在印度农村等地区尤为关键,患者通常用本土印度语言表达复杂的医疗问题,并依赖医学图像等多模态输入。现有的以英语为中心的MLLMs难以支持此类用例,限制了公平获取AI驱动的医疗辅助。为应对这一挑战,我们引入了ArogyaBodha,一个大规模的多语言多模态医学问答数据集,由八个异构来源构建,涵盖31个身体系统、六种成像模态和21个临床领域,覆盖英语和七种主要印度语言。我们进一步提出了ArogyaSutra,一个基于演员-评论家的多智能体框架,将工具接地与双记忆机制相结合,实现逐步的、推理感知的决策,并使用存储的演员-评论家模拟轨迹进行蒸馏。实验表明,我们的数据集和框架在所有印度语言上提高了多语言医学推理的准确性,消融实验验证了每个组件的贡献。源代码和数据集可在以下网址获取:this https URL ArogyaSutra/

英文摘要

Multimodal Large Language Models (MLLMs) have shown promising reasoning capabilities in general domains, yet their performance remains limited in specialized settings such as healthcare, especially in multilingual and low-resource scenarios. This gap is critical in regions like rural India, where patients often express complex medical queries in native Indic languages and rely on multimodal inputs such as medical images. Existing English-centric MLLMs struggle to support such use cases, limiting equitable access to AI-driven healthcare assistance. To address this challenge, we introduce ArogyaBodha, a large-scale multilingual multimodal medical question-answer dataset constructed from eight heterogeneous sources, covering 31 body systems, six imaging modalities, and 21 clinical domains across English and seven major Indian languages. We further propose ArogyaSutra, an actor-critic-based multi-agent framework that integrates tool grounding with dual-memory mechanisms for step-wise, reasoning-aware decision making, and uses stored actor-critic simulation trajectories for distillation. Experiments show that our dataset and framework improve multilingual medical reasoning accuracy across all Indic languages, with ablations validating the contribution of each component. The source code and dataset are available at: https://iitp-cse.github.io/ ArogyaSutra/

2606.13580 2026-06-12 cs.CV cs.AI 交叉投稿

EvTexture++: Event-Driven Texture Enhancement for Video Super-Resolution

EvTexture++: 事件驱动的视频超分辨率纹理增强

Dachun Kai, Jiayao Lu, Yueyi Zhang, Xiaoyan Sun

发表机构 * MOE Key Laboratory of Brain-Inspired Intelligent Perception and Cognition, University of Science and Technology of China(中国科学技术大学,脑启发智能感知与认知教育部重点实验室) Midea Group(美的集团)

AI总结 提出首个事件驱动的视频超分辨率纹理增强框架EvTexture++,利用事件的高频时空细节逐步恢复纹理,并通过时间纹理对齐模块增强帧间一致性,在多个数据集上达到最优性能。

Comments IEEE TPAMI 2026. Extended version of arXiv:2406.13457 (ICML 2024). Project page: https://dachunkai.github.io/evtexture-project-page/

详情
Journal ref
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 6, pp. 6642-6659, June 2026
AI中文摘要

基于事件的视觉因其独特特性(包括超高时间分辨率和极端动态范围)而受到越来越多的关注。最近的工作将其引入视频超分辨率(VSR)以增强光流估计和时间对齐。相比之下,本文将事件信号的关注点从运动细化转向VSR中的纹理增强。我们提出了EvTexture++,这是首个专用于VSR中纹理增强的事件驱动框架。它利用事件的高频时空细节来改善纹理恢复。EvTexture++包含一个定制的纹理增强分支,以及一个迭代纹理增强模块,该模块逐步利用高时间分辨率的事件信息进行纹理恢复。这使得纹理区域在迭代中逐渐细化,从而产生更准确、更详细的高分辨率输出。除了帧内纹理恢复外,大运动可能会降低帧间时间一致性,尤其是在纹理区域,导致纹理闪烁。为了缓解这一问题,我们进一步利用事件的连续时间运动线索来增强时间一致性,引入了一个时间纹理对齐模块,该模块估计事件引导的纹理感知光流,以实现精确的帧间纹理对齐。此外,EvTexture++被设计为即插即用工具,可灵活提升现有VSR模型的性能。在五个数据集上的实验表明,EvTexture++达到了最先进的性能。当集成到最近的VSR模型中时,它带来了显著的改进,在纹理丰富的Vid4数据集上PSNR提升高达1.55 dB。代码:此https URL。

英文摘要

Event-based vision has drawn increasing attention owing to its distinctive properties, including ultra-high temporal resolution and extreme dynamic range. Recent works have introduced it to video super-resolution (VSR) to enhance flow estimation and temporal alignment. In contrast, this paper shifts the focus of event signals from motion refinement to texture enhancement in VSR. We propose EvTexture++, the first event-driven framework dedicated to texture enhancement in VSR. It leverages high-frequency spatiotemporal details from events to improve texture recovery. EvTexture++ incorporates a customized texture enhancement branch, along with an iterative texture enhancement module that progressively exploits high-temporal-resolution event information for texture restoration. This enables gradual refinement of texture regions across iterations, yielding more accurate and detailed high-resolution outputs. Besides intra-frame texture recovery, large motions could degrade inter-frame temporal consistency, particularly in texture regions, leading to texture flickering. To mitigate this, we further exploit the continuous-time motion cues of events to enhance temporal consistency, introducing a temporal texture alignment module that estimates event-guided texture-aware flow for precise inter-frame texture alignment. Moreover, EvTexture++ is designed as a plug-and-play tool to flexibly boost the performance of existing VSR models. Experiments on five datasets demonstrate that EvTexture++ achieves state-of-the-art performance. When integrated into recent VSR models, it yields significant improvements, with gains of up to 1.55 dB in PSNR on the texture-rich Vid4 dataset. Code: https://github.com/DachunKai/EvTexture.

2606.13680 2026-06-12 cs.CL cs.AI 交叉投稿

Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

通过检索增强强化微调进行类比推理学习

Zilin Xiao, Qi Ma, Chun-cheng Jason Chen, Xintao Chen, Avinash Atreya, Hanjie Chen, Vicente Ordonez

发表机构 * Meta Superintelligence Labs(Meta超级智能实验室) Rice University(莱斯大学)

AI总结 提出RA-RFT框架,通过黄金相关性蒸馏训练检索器,并结合强化微调利用类比推理轨迹,提升数学推理性能。

详情
AI中文摘要

检索增强生成(RAG)已成为将语言模型锚定于外部知识的标准机制,然而基于词汇或语义相似性的传统检索难以适用于复杂推理任务:语义相似的问题可能要求完全不同的解决策略,而表面不同的问题可能共享相同的底层推理模式。我们提出检索增强强化微调(RA-RFT),一种事后训练框架,教导语言模型通过类比进行推理。RA-RFT使用黄金相关性蒸馏训练检索器,该检索器根据预期推理收益而非语义重叠对上下文进行排序,然后通过强化微调方法利用检索到的类比演示对策略模型进行微调,使模型学会在可验证的结果奖励下利用推理轨迹。我们进一步分析了检索上下文的多样性,发现推理感知检索揭示了互补的解决策略,为个别问题提供了不同的推理支架。在具有挑战性的数学推理基准上,RA-RFT始终优于标准强化微调方法。例如,在AIME 2025上,对于Qwen3-1.7B和Qwen3-4B,RA-RFT的平均@32准确率分别比GRPO提高了7.1和2.8个百分点——这表明推理感知检索是一个互补的改进轴,与奖励设计或训练课程的进步正交。

英文摘要

Retrieval-augmented generation (RAG) has become a standard mechanism for grounding language models in external knowledge, yet conventional retrieval based on lexical or semantic similarity is poorly suited for complex reasoning tasks: a semantically similar problem may demand an entirely different solution strategy, while a superficially different problem may share the same underlying reasoning pattern. We propose Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that teaches language models to reason by analogy. RA-RFT uses gold-relevance distillation to train a retriever that ranks contexts by expected reasoning benefit rather than semantic overlap, and then fine-tunes the policy model via reinforcement fine-tuning methods with retrieved analogous demonstrations, so the model learns to leverage reasoning traces under verifiable outcome rewards. We further analyze the diversity of retrieved contexts and find that reasoning-aware retrieval surfaces complementary solution strategies that provide distinct reasoning scaffolds for individual problems. Across challenging mathematical reasoning benchmarks, RA-RFT consistently outperforms standard reinforcement fine-tuning methods. For example, it improves AIME 2025 average@32 accuracy by 7.1 and 2.8 points over GRPO for Qwen3-1.7B and Qwen3-4B respectively -- suggesting that reasoning-aware retrieval is a complementary axis of improvement and orthogonal to advances in reward design or training curricula.

2506.01274 2026-06-12 cs.CV cs.AI 版本更新

ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding

ReFoCUS: 用于上下文理解的强化引导帧优化

Hosu Lee, Junho Kim, Hyunjun Kim, Yong Man Ro

发表机构 * Korea Advanced Institute of Science & Technology(韩国科学技术院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出ReFoCUS框架,首次将在线策略梯度强化学习集成到视频大语言模型的帧级优化中,通过自回归和查询条件选择架构学习帧选择策略,无需显式帧级监督,提升视频问答推理准确性。

Comments Project page: https://interlive-team.github.io/ReFoCUS/

详情
AI中文摘要

近期大型多模态模型(LMMs)的进展实现了有效的视觉-语言推理,然而视频理解能力仍受限于次优的帧选择策略,尽管视频专用LMMs发展迅速。先前的工作尝试通过静态启发式或外部检索模块来提供帧级信息,但这些方法往往无法捕捉与给定用户查询相关的视觉线索,混淆了原始视觉动态与真正的语义相关性。在本文中,我们介绍了ReFoCUS(用于上下文理解的强化引导帧优化),这是首个将在线策略梯度强化学习集成到视频-LLMs帧级优化的框架。ReFoCUS旨在学习帧选择策略,利用来自参考模型的奖励信号来捕捉其对最佳支持时间接地响应的帧组合的潜在评分行为。为了高效探索巨大的组合帧空间,我们采用了一种自回归且查询条件的选择架构,确保上下文一致性的同时降低复杂度。我们的策略学习无需显式帧级监督,因为它隐式地发现了最优且语义一致的帧组合。ReFoCUS在多个视频问答基准测试中持续提高了推理准确性,证明了将帧选择与模型内部效用对齐的优势。

英文摘要

Recent progress in Large Multi-modal Models (LMMs) has enabled effective vision-language reasoning, yet the ability to video understanding remains constrained by suboptimal frame selection strategies, albeit with the rapid development of video-specialized LMMs. Prior works attempted to solve this with static heuristics or external retrieval modules to feed frame-level information, but these approaches often fail to capture visual cues grounded to the given user queries conflating raw visual dynamics with true semantic relevance. In this paper, we introduce ReFoCUS (Reinforcement-guided Frame Optimization for Contextual UnderStanding), the first framework to integrate online policy-gradient reinforcement learning into frame-level optimization for video-LLMs. ReFoCUS aims to learn a frame selection policy, leveraging reward signals derived from reference models to capture their underlying scoring behavior over frame combinations that best support temporally grounded responses. To efficiently explore the large combinatorial frame space, we employ an autoregressive and query-conditional selection architecture that ensures contextual consistency while reducing complexity. Our policy learning removes the need for explicit frame-level supervision, as it implicitly discovers optimal and semantically consistent frame compositions. ReFoCUS consistently improves reasoning accuracy across multiple video QA benchmarks, demonstrating the advantage of aligning frame selection with model-internal utility.

2507.10599 2026-06-12 cs.CL cs.AI cs.LG 版本更新

Emergence of Hierarchical Emotion Organization in Large Language Models

大型语言模型中层级情感组织的涌现

Maya Okawa, Bo Zhao, Eric J. Bigelow, Rose Yu, Tomer Ullman, Ekdeep Singh Lubana, Hidenori Tanaka

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Washington(华盛顿大学) University of Tokyo(东京大学)

AI总结 受情感轮理论启发,分析大型语言模型输出中情感状态间的概率依赖关系,发现模型自然形成与人类心理模型一致的层级情感树,且更大模型发展出更复杂的层级结构,同时揭示社会经济角色在情感识别中的系统性偏差。

Comments ICML 2026

详情
AI中文摘要

随着大型语言模型(LLMs)越来越多地驱动对话代理,理解它们如何建模用户的情绪状态对于伦理部署至关重要。受情感轮(即一种认为情感层级组织的心理学框架)的启发,我们分析了模型输出中情感状态之间的概率依赖关系。我们发现LLMs自然形成与人类心理模型一致的层级情感树,且更大的模型发展出更复杂的层级结构。我们还揭示了跨社会经济角色的情感识别中存在系统性偏差,对于交叉、代表性不足的群体,错误分类会叠加。人类研究显示出惊人的相似性,表明LLMs内化了社会感知的某些方面。除了突出LLMs中的涌现情感推理能力,我们的结果还暗示了利用认知基础理论开发更好模型评估的潜力。

英文摘要

As large language models (LLMs) increasingly power conversational agents, understanding how they model users' emotional states is critical for ethical deployment. Inspired by emotion wheels, i.e., a psychological framework that argues emotions organize hierarchically, we analyze probabilistic dependencies between emotional states in model outputs. We find that LLMs naturally form hierarchical emotion trees that align with human psychological models, and larger models develop more complex hierarchies. We also uncover systematic biases in emotion recognition across socioeconomic personas, with compounding misclassifications for intersectional, underrepresented groups. Human studies reveal striking parallels, suggesting that LLMs internalize aspects of social perception. Beyond highlighting emergent emotional reasoning in LLMs, our results hint at the potential of using cognitively-grounded theories for developing better model evaluations.

2508.01656 2026-06-12 cs.CL cs.AI cs.CY cs.HC physics.soc-ph 版本更新

Authorship Attribution in Multilingual Machine-Generated Texts

多语言机器生成文本的作者归属

Lucio La Cava, Dominik Macko, Róbert Móro, Ivan Srba, Andrea Tagarelli

发表机构 * DIMES Department, University of Calabria(卡利博大学DIMES系) Kempelen Institute of Intelligent Technologies(智能技术研究所)

AI总结 提出多语言作者归属问题,研究单语言方法在18种语言和8个生成器上的跨语言迁移能力,发现显著局限。

Comments Accepted at ACL 2026 - Main

详情
AI中文摘要

随着大型语言模型(LLM)达到类人的流畅性和连贯性,区分机器生成文本(MGT)与人类撰写的内容变得越来越困难。虽然MGT检测的早期工作侧重于二元分类,但LLM的不断发展和多样性需要更细粒度且更具挑战性的作者归属(AA),即能够识别文本背后的确切生成器(LLM或人类)。然而,目前AA仍局限于单语言环境,其中英语是研究最多的语言,忽视了现代LLM的多语言性质和使用。在这项工作中,我们引入了多语言作者归属问题,涉及将文本归因于跨多种语言的人类或多个LLM生成器。聚焦于18种语言——涵盖多个语系和书写系统——以及8个生成器(7个LLM和人类撰写类别),我们研究了单语言AA方法在多语言环境中的适用性,包括其跨语言迁移能力,以及生成器对归属性能的影响。我们的结果表明,虽然某些单语言AA方法可以适应多语言环境,但仍然存在显著的局限性和挑战,特别是在跨不同语系迁移时,这凸显了多语言AA的复杂性以及需要更稳健的方法以更好地匹配现实场景。

英文摘要

As Large Language Models (LLMs) have reached human-like fluency and coherence, distinguishing machine-generated text (MGT) from human-written content becomes increasingly difficult. While early efforts in MGT detection have focused on binary classification, the growing landscape and diversity of LLMs require a more fine-grained yet challenging authorship attribution (AA), i.e., being able to identify the precise generator (LLM or human) behind a text. However, AA remains nowadays confined to a monolingual setting, with English being the most investigated one, overlooking the multilingual nature and usage of modern LLMs. In this work, we introduce the problem of Multilingual Authorship Attribution, which involves attributing texts to human or multiple LLM generators across diverse languages. Focusing on 18 languages -- covering multiple families and writing scripts -- and 8 generators (7 LLMs and the human-authored class), we investigate the multilingual suitability of monolingual AA methods in terms of their cross-lingual transferability, and the impact of generators on attribution performance. Our results reveal that while certain monolingual AA methods can be adapted to multilingual settings, significant limitations and challenges remain, particularly in transferring across diverse language families, underscoring the complexity of multilingual AA and the need for more robust approaches to better match real-world scenarios.

2601.19072 2026-06-12 cs.SE cs.AI 版本更新

HalluJudge: A Reference-Free Hallucination Detection for Context Misalignment in Code Review Automation

HalluJudge: 代码审查自动化中上下文错位的无参考幻觉检测

Kla Tantithamthavorn, Hong Yi Lin, Patanamon Thongtanunam, Wachiraphan Charoenwet, Minwoo Jeong, Ming Wu

发表机构 * Monash University Australia(墨尔本大学澳大利亚) The University of Melbourne Australia(墨尔本大学澳大利亚) Atlassian USA(Atlassian美国)

AI总结 提出无参考幻觉检测方法HalluJudge,通过上下文对齐评估生成评论的根基性,采用多分支推理策略,在F1=0.85且成本$0.009下与开发者偏好67%一致。

Comments Accepted at FSE'26: Industry Track, Full-Length, Peer-Reviewed

详情
AI中文摘要

大型语言模型(LLM)在代码审查自动化(如审查评论生成)中表现出强大能力,但它们存在幻觉——生成的审查评论与实际代码无根基——这对LLM在代码审查工作流程中的应用构成重大挑战。为解决此问题,我们探索了无需参考的、有效且可扩展的方法来检测LLM生成的代码审查评论中的幻觉。在这项工作中,我们设计了HalluJudge,旨在基于上下文对齐评估生成评论的根基性。HalluJudge包括四种关键策略,从直接评估到结构化多分支推理(例如,思维树)。我们在Atlassian的企业级软件项目中对这些评估策略进行了全面评估,以检验HalluJudge的有效性和成本效率。此外,我们分析了HalluJudge的判断与实际生产环境中LLM生成的代码审查评论的开发人员偏好之间的一致性。我们的结果表明,HalluJudge中的幻觉评估具有成本效益,F1得分为0.85,平均成本为0.009美元。平均而言,67%的HalluJudge评估与在线生产中实际LLM生成的审查评论的开发人员偏好一致。我们的结果表明,HalluJudge可以作为实用的保障措施,减少开发人员接触幻觉评论,从而促进对AI辅助代码审查的信任。

英文摘要

Large Language models (LLMs) have shown strong capabilities in code review automation, such as review comment generation, yet they suffer from hallucinations -- where the generated review comments are ungrounded in the actual code -- poses a significant challenge to the adoption of LLMs in code review workflows. To address this, we explore effective and scalable methods for a hallucination detection in LLM-generated code review comments without the reference. In this work, we design HalluJudge that aims to assess the grounding of generated review comments based on the context alignment. HalluJudge includes four key strategies ranging from direct assessment to structured multi-branch reasoning (e.g., Tree-of-Thoughts). We conduct a comprehensive evaluation of these assessment strategies across Atlassian's enterprise-scale software projects to examine the effectiveness and cost-efficiency of HalluJudge. Furthermore, we analyze the alignment between HalluJudge's judgment and developer preference of the actual LLM-generated code review comments in the real-world production. Our results show that the hallucination assessment in HalluJudge is cost-effective with an F1 score of 0.85 and an average cost of $0.009. On average, 67% of the HalluJudge assessments are aligned with the developer preference of the actual LLM-generated review comments in the online production. Our results suggest that HalluJudge can serve as a practical safeguard to reduce developers' exposure to hallucinated comments, fostering trust in AI-assisted code reviews.

2601.19827 2026-06-12 cs.CL cs.AI cs.IR 版本更新

When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

当迭代RAG优于理想证据:科学多跳问答中的诊断研究

Mahdi Astaraki, Mohammad Arshi Saloot, Ali Shiraee Kasmaee, Hamidreza Mahyar, Soheila Samiee

发表机构 * Faculty of Engineering, McMaster University, Canada(麦斯特大学工程学院,加拿大) BASF Canada Inc., Canada(巴斯夫加拿大公司,加拿大)

AI总结 通过化学多跳问答数据集,诊断发现迭代检索-推理循环在科学领域显著优于静态RAG上限,揭示了阶段式检索的优势与失败模式。

Comments 51 pages, 29 figures

详情
AI中文摘要

检索增强生成(RAG)将大型语言模型(LLMs)扩展到参数化知识之外,但目前尚不清楚迭代检索-推理循环何时能有效超越静态RAG,尤其是在涉及多跳推理、稀疏领域知识和异构证据的科学领域。我们首次进行了受控的、机制层面的诊断研究,以探究同步迭代检索和推理能否超越理想化的静态上限(Gold Context)RAG。我们在三种设置下对十一个最先进的LLM进行了基准测试:(i)无上下文,衡量对参数化记忆的依赖;(ii)Gold Context,一次性提供所有真实证据;(iii)迭代RAG,一种无需训练的控制器,交替进行检索、假设细化和证据感知停止。使用以化学为中心的ChemKGMultiHopQA数据集,我们分离出需要真正检索的问题,并通过诊断分析行为,涵盖检索覆盖缺口、锚点携带下降、查询质量、组合保真度和控制校准。在所有模型中,迭代RAG始终优于Gold Context,增益高达25.6个百分点,尤其对于非推理微调模型。阶段式检索减少了后期跳失败,缓解了上下文过载,并实现了对早期假设漂移的动态修正,但剩余的失败模式包括跳覆盖不完整、干扰物锁定轨迹、过早停止校准错误以及即使检索完美时的高组合失败率。总体而言,阶段式检索通常比理想证据的单纯存在更具影响力;我们为在专门科学环境中部署和诊断RAG系统提供了实用指导,并为更可靠、可控的迭代检索-推理框架奠定了基础。

英文摘要

Retrieval-Augmented Generation (RAG) extends large language models (LLMs) beyond parametric knowledge, yet it is unclear when iterative retrieval-reasoning loops meaningfully outperform static RAG, particularly in scientific domains with multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence. We provide the first controlled, mechanism-level diagnostic study of whether synchronized iterative retrieval and reasoning can surpass an idealized static upper bound (Gold Context) RAG. We benchmark eleven state-of-the-art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training-free controller that alternates retrieval, hypothesis refinement, and evidence-aware stopping. Using the chemistry-focused ChemKGMultiHopQA dataset, we isolate questions requiring genuine retrieval and analyze behavior with diagnostics spanning retrieval coverage gaps, anchor-carry drop, query quality, composition fidelity, and control calibration. Across models, Iterative RAG consistently outperforms Gold Context, with gains up to 25.6 percentage points, especially for non-reasoning fine-tuned models. Staged retrieval reduces late-hop failures, mitigates context overload, and enables dynamic correction of early hypothesis drift, but remaining failure modes include incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval. Overall, staged retrieval is often more influential than the mere presence of ideal evidence; we provide practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and a foundation for more reliable, controllable iterative retrieval-reasoning frameworks.

2602.00462 2026-06-12 cs.CV cs.AI 版本更新

LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs

LatentLens: 揭示大语言模型中高度可解释的视觉标记

Benno Krojer, Shravan Nayak, Oscar Mañas, Vaibhav Adlakha, Desmond Elliott, Siva Reddy, Marius Mosbach

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出 LatentLens 方法,通过将视觉标记与文本语料库中的上下文标记表示进行最近邻匹配,实现视觉标记的可解释性,发现大多数视觉标记在各层均具有可解释性。

Comments ICML 2026 (Camera Ready)

详情
AI中文摘要

将大型语言模型(LLM)转换为视觉语言模型(VLM)可以通过将视觉编码器输出的视觉标记映射到LLM的嵌入空间来实现。有趣的是,这种映射可以简单到浅层MLP变换。为了理解LLM为何能如此容易地处理视觉标记,我们需要可解释性方法来揭示在LLM处理的每一层中视觉标记表示所编码的内容。在这项工作中,我们引入了LatentLens,一种将潜在表示映射到自然语言描述的新方法。LatentLens编码一个大型文本语料库,并存储该语料库中每个标记的上下文化标记表示。然后将视觉标记表示与这些上下文化表示进行比较,并将最邻近的表示作为视觉标记的描述。我们在15个不同的VLM上评估了该方法,结果表明,常用的方法(如LogitLens)大大低估了视觉标记的可解释性。相反,使用LatentLens,大多数视觉标记在所有研究的模型和所有层中都是可解释的。定性上,我们展示了LatentLens产生的描述在语义上有意义,并且与单个标记相比,为人类提供了更细粒度的解释。更广泛地说,我们的发现为视觉和语言表示之间的对齐提供了新的证据,并为分析LLM的潜在表示开辟了新的方向。

英文摘要

Transforming a large language model (LLM) into a vision-language model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simple as a shallow MLP transformation. To understand why LLMs can so readily process visual tokens, we need interpretability methods that reveal what is encoded in the visual token representations at every layer of LLM processing. In this work, we introduce LatentLens, a novel approach for mapping latent representations to descriptions in natural language. LatentLens encodes a large text corpus and stores contextualized token representations for each token in that corpus. Visual token representations are then compared to these contextualized representations and the top-nearest neighbor representations serve as descriptions of the visual token. We evaluate this method on 15 different VLMs, showing that commonly used methods, such as LogitLens, substantially underestimate the interpretability of visual tokens. With LatentLens instead, the majority of visual tokens are interpretable across all studied models and all layers. Qualitatively, we show that the descriptions produced by LatentLens are semantically meaningful and provide more fine-grained interpretations for humans compared to individual tokens. More broadly, our findings contribute new evidence on the alignment between vision and language representations and open up new directions for analyzing the latent representations of LLMs.

2602.07106 2026-06-12 cs.CV cs.AI cs.CL 版本更新

Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

Ex-Omni:为全模态大语言模型赋能3D面部动画生成

Haoyu Zhang, Zhipeng Li, Yiwen Guo, Tianshu Yu

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) LIGHTSPEED Independent Researcher(独立研究员)

AI总结 提出Ex-Omni模型,通过混合形状感知语音单元生成器和解码器解耦语义推理与时间生成,并引入统一令牌查询门控融合机制,实现全模态大语言模型同步生成语音和3D面部动画。

详情
AI中文摘要

全模态大语言模型旨在统一多模态理解和生成,然而,尽管自然的人机交互至关重要,但扩展它们以联合生成语音和3D面部动画仍 largely unexplored。一个关键挑战是LLM的离散语义推理与3D面部运动所需的密集时间动态之间的不匹配。我们提出Expressive Omni (Ex-Omni),一个开源模型,通过原生语音伴随的3D面部动画增强OLLM。Ex-Omni通过混合形状感知语音单元生成器和混合形状解码器将语义推理与时间生成解耦,其中语音单元提供时间支架,隐藏语音表示携带面部相关线索。我们进一步引入统一的令牌查询门控融合机制用于受控语义注入,以及InstructS2SF-1200K,一个包含1200K样本的预训练数据集。大量实验表明,Ex-Omni在保持竞争性语音理解和生成能力的同时,实现了比级联管道更好的音视频同步和更低的面部生成延迟。

英文摘要

Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet extending them to jointly produce speech and 3D facial animation remains largely unexplored despite its importance for natural human-computer interaction. A key challenge is the mismatch between the discrete semantic reasoning of LLMs and the dense temporal dynamics required for 3D facial motion. We propose Expressive Omni (Ex-Omni), an open-source model that augments OLLMs with native speech-accompanied 3D facial animation. Ex-Omni decouples semantic reasoning from temporal generation through a blendshape-aware speech unit generator and a blendshape decoder, where speech units provide temporal scaffolding and hidden speech representations carry facially relevant cues. We further introduce a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection, as well as InstructS2SF-1200K, a dataset consisting of 1200K samples for pre-training. Extensive experiments show that Ex-Omni maintains competitive speech understanding and generation ability while achieving better audio-visual synchronization and lower face-generation latency than cascaded pipelines.

2602.14367 2026-06-12 cs.CL cs.AI cs.IR cs.LG 版本更新

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem

InnoEval:将研究思路评估视为基于知识的多视角推理问题

Shuofei Qiao, Yunxiang Wei, Xuehai Wang, Bin Wu, Boyang Xue, Ningyu Zhang, Hossein A. Rahmani, Yanshan Wang, Qiang Zhang, Keyan Ding, Jeff Z. Pan, Huajun Chen, Emine Yilmaz

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出InnoEval框架,通过异构深度知识检索和多视角评审委员会,实现基于知识的多维度解耦评估,在点对点、成对和分组评估任务中优于基线方法。

Comments ICML 2026

详情
AI中文摘要

大型语言模型的快速发展催生了科学思路的激增,但这一飞跃并未伴随思路评估的相应进步。科学评估的基本性质需要知识基础、集体审议和多标准决策。然而,现有的思路评估方法往往存在知识视野狭窄、评估维度扁平化以及LLM作为评判者的固有偏见。为解决这些问题,我们将思路评估视为一个基于知识的多视角推理问题,并引入InnoEval,一个深度创新评估框架,旨在模拟人类水平的思路评估。我们应用了一个异构深度知识搜索引擎,从多样化的在线来源中检索和获取动态证据。我们进一步通过一个包含不同学术背景的评审员的创新评审委员会实现评审共识,从而在多个指标上进行多维解耦评估。我们构建了来自权威同行评审提交的全面数据集,以基准测试InnoEval。实验表明,InnoEval在点对点、成对和分组评估任务中始终优于基线方法,展现出与人类专家高度一致的判断模式和共识。

英文摘要

The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation. The fundamental nature of scientific evaluation needs knowledgeable grounding, collective deliberation, and multi-criteria decision-making. However, existing idea evaluation methods often suffer from narrow knowledge horizons, flattened evaluation dimensions, and the inherent bias in LLM-as-a-Judge. To address these, we regard idea evaluation as a knowledge-grounded, multi-perspective reasoning problem and introduce InnoEval, a deep innovation evaluation framework designed to emulate human-level idea assessment. We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources. We further achieve review consensus with an innovation review board containing reviewers with distinct academic backgrounds, enabling a multi-dimensional decoupled evaluation across multiple metrics. We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval. Experiments demonstrate that InnoEval can consistently outperform baselines in point-wise, pair-wise, and group-wise evaluation tasks, exhibiting judgment patterns and consensus highly aligned with human experts.

2603.06652 2026-06-12 cs.CV cs.AI 版本更新

PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment

PaLMR: 通过多模态过程对齐实现忠实视觉推理

Yantao Li, Qiang Hui, Chenyang Yan, Kanzhi Cheng, Fang Zhao, Chao Tan, Huanling Gao, Jianbing Zhang, Kai Wang, Xinyu Dai, Shiguo Lian

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) Data Science & Artificial Intelligence Research Institute, China Unicom(中国unicom数据科学与人工智能研究院) Unicom Data Intelligence, China Unicom(中国unicom数据智能)

AI总结 提出PaLMR框架,通过感知对齐数据层和过程对齐优化层,减少推理幻觉并提升视觉推理忠实度,在多个基准上取得最优结果。

详情
Journal ref
CVPR 2026 Findings
AI中文摘要

强化学习近期提升了大语言模型和多模态大语言模型的推理能力,但现有的奖励设计强调最终答案的正确性,因此容忍过程幻觉——即模型在得到正确答案的同时错误感知视觉证据的情况。我们通过PaLMR框架解决这种过程层面的不对齐,该框架不仅对齐结果,还对齐推理过程本身。PaLMR包含两个互补组件:一个感知对齐数据层,构建具有结构化伪真值和可验证视觉事实的过程感知推理数据;以及一个过程对齐优化层,构建具有过程感知评分函数的分层奖励融合方案,以鼓励视觉上可信的思维链并提高训练稳定性。在Qwen2.5-VL-7B上的实验表明,我们的方法显著减少了推理幻觉并提高了视觉推理忠实度,在HallusionBench上取得了最先进的结果,同时在MMMU、MathVista和MathVerse上保持了强劲性能。这些发现表明,PaLMR为过程对齐的多模态推理提供了一条原则性且实用的路径,推进了MLLM的可靠性和可解释性。

英文摘要

Reinforcement learning has recently improved the reasoning ability of Large Language Models and Multimodal LLMs, yet prevailing reward designs emphasise final-answer correctness and consequently tolerate process hallucinations--cases where models reach the right answer while misperceiving visual evidence. We address this process-level misalignment with PaLMR, a framework that aligns not only outcomes but also the reasoning process itself. PaLMR comprises two complementary components: a perception-aligned data layer that constructs process-aware reasoning data with structured pseudo-ground-truths and verifiable visual facts, and a process-aligned optimisation layer that constructs a hierarchical reward fusion scheme with a process-aware scoring function to encourage visually faithful chains-of-thought and improve training stability. Experiments on Qwen2.5-VL-7B show that our approach substantially reduces reasoning hallucinations and improves visual reasoning fidelity, achieving state-of-the-art results on HallusionBench while maintaining strong performance on MMMU, MathVista, and MathVerse. These findings indicate that PaLMR offers a principled and practical route to process-aligned multimodal reasoning, advancing the reliability and interpretability of MLLMs.

2604.24079 2026-06-12 cs.CL cs.AI 版本更新

The Pragmatic Persona: Discovering LLM Persona through Bridging Inference

实用人格:通过桥接推理发现LLM人格

Jisoo Yang, Jongwon Ryu, Minuk Ma, Trung X. Pham, Junyeong Kim

发表机构 * Department of Artificial Intelligence, Chung-Ang University, Seoul, 06974, Republic of Korea(Chung-Ang大学人工智能系) Department of Computer Science, University of British Columbia, Vancouver, BC V6T 1Z4, Canada(不列颠哥伦比亚大学计算机科学系) Van Lang University, Ho Chi Minh City, Vietnam(文-lang大学)

AI总结 提出基于桥接推理的框架,通过构建话语级知识图谱捕捉LLM对话中的隐含语义关联,实现从话语连贯性层面发现稳定人格特征,优于基于频率或风格的基线方法。

Comments 15 pages, 4 figures, accepted to ICPR 2026

详情
AI中文摘要

大型语言模型(LLM)通过对话展现出固有且独特的人格。然而,现有的大多数人格发现方法依赖于表面层面的词汇或风格线索,将对话视为平坦的token序列,未能捕捉维持人格一致性的更深层次话语结构。为解决这一局限,我们提出一种新颖的分析框架,通过桥接推理——即通过共享世界知识和话语连贯性连接话语的隐含概念关系——来解读LLM对话。通过将这些关系建模为结构化知识图谱,我们的方法捕捉了控制LLM在对话轮次间组织意义的潜在语义链接,从而在话语连贯性层面而非表面实现上实现人格发现。在多种推理骨干和从小型模型到80B参数系统的目标LLM上的实验结果表明,与基于频率或风格的基线相比,桥接推理图产生了显著更强的语义连贯性和更稳定的人格识别。这些结果表明,人格特质始终编码在话语的结构组织中,而非孤立的词汇模式中。本工作提出了一个系统框架,通过认知话语理论的视角来探测、提取和可视化潜在的LLM人格,桥接了计算语言学、认知语义学和大型语言模型中的人格推理。代码见:https://this URL

英文摘要

Large Language Models (LLMs) reveal inherent and distinctive personas through dialogue. However, most existing persona discovery approaches rely on surface-level lexical or stylistic cues, treating dialogue as a flat sequence of tokens and failing to capture the deeper discourse-level structures that sustain persona consistency. To address this limitation, we propose a novel analytical framework that interprets LLM dialogue through bridging inference -- implicit conceptual relations that connect utterances via shared world knowledge and discourse coherence. By modeling these relations as structured knowledge graphs, our approach captures latent semantic links that govern how LLMs organize meaning across turns, enabling persona discovery at the level of discourse coherence rather than surface realizations. Experimental results across multiple reasoning backbones and target LLMs, ranging from small-scale models to 80B-parameter systems, demonstrate that bridging-inference graphs yield significantly stronger semantic coherence and more stable persona identification than frequency or style-based baselines. These results show that persona traits are consistently encoded in the structural organization of discourse rather than isolated lexical patterns. This work presents a systematic framework for probing, extracting, and visualizing latent LLM personas through the lens of Cognitive Discourse Theory, bridging computational linguistics, cognitive semantics, and persona reasoning in large language models. Codes are available at https://github.com/JiSoo-Yang/Persona_Bridging.git

2605.16713 2026-06-12 cs.CV cs.AI 版本更新

GeoWorld-VLM: Geometry from World Models for Vision-Language Models

GeoWorld-VLM:从世界模型中获取几何结构用于视觉-语言模型

Renjie Gu, Kaichen Zhou, Yan Luo, Mengyu Wang

发表机构 * Harvard AI and Robotics Lab(哈佛人工智能与机器人实验室) Kempner Institute for the Study of Natural and Artificial Intelligence(凯普纳自然与人工智能研究 institute) Harvard University(哈佛大学)

AI总结 GeoWorld-VLM通过将冻结的摄像机条件视频世界模型的几何结构转移到视觉-语言模型中,提升空间关系推理能力,实验显示在两个不同架构上均提升了约4%的性能。

详情
AI中文摘要

现代视觉-语言模型(VLMs)在语义识别方面表现优异,但在基本空间关系如左、在、后、之间等上仍显脆弱。这一失败的原因出现在语言推理之前:视觉路径在特征提取过程中可能压缩或丢弃关键的3D结构线索,导致语言模型接收到的图像表示不足以支持可靠的空判断。我们引入GeoWorld-VLM,一种VLM侧蒸馏框架,将冻结的摄像机条件视频世界模型的几何结构转移到VLMs中。GeoWorld-VLM仅微调图像编码器和多模态投影器,使后投影器图像特征与中间世界模型表示对齐,同时保持主骨干冻结。给定图像、提示和采样的摄像机轨迹,世界模型教师将静态视觉输入转换为合成多视角空间信号。训练结合空间答案监督、教师-学生特征对齐和对原VLM的保留锚点。由于语言模型保持冻结,GeoWorld-VLM保留原始模型的语言能力,同时将空间改进归因于增强的视觉路径。为了评估所提方法的有效性和通用性,我们将GeoWorld-VLM应用于两种不同的VLM架构,并在两个骨干上观察到一致的改进。GeoWorld-VLM在What'sUp和VSR基准上分别提升了约4%的性能,表明世界模型引导的视觉对齐在模型结构和空间推理数据集上具有泛化能力。

英文摘要

Modern Vision-Language Models (VLMs) achieve strong semantic recognition, yet remain brittle on elementary spatial relations such as left of, on, behind, and between. One cause of this failure arises before language reasoning begins: the visual pathway may compress or discard critical 3D structural cues during feature extraction, so the language model receives image representations that are already insufficient for reliable spatial judgment. We introduce GeoWorld-VLM, a VLM-side distillation framework that transfers geometric structure from frozen camera-conditioned video world models into VLMs. GeoWorld-VLM fine-tunes only the image encoder and multimodal projector, aligning post-projector image features with intermediate world-model representations while leaving the main backbone frozen. Given images, a prompt, and a sampled camera trajectory, the world-model teacher converts static visual input into a synthetic multi-view spatial signal. Training combines spatial answer supervision, teacher-student feature alignment, and a preservation anchor to the original VLM. Since the language model remains frozen, GeoWorld-VLM preserves the original model's linguistic capabilities while attributing spatial improvements to the enhanced visual pathway. To evaluate the effectiveness and generality of the proposed method, we apply GeoWorld-VLM to two distinct VLM architectures and observe consistent improvements across both backbones. GeoWorld-VLM improves performance by approximately 4 percent on both the What'sUp and VSR benchmarks, suggesting that world-model-guided visual alignment generalizes across model structures and spatial reasoning datasets.

2605.22641 2026-06-12 cs.CL cs.AI cs.LG 版本更新

More Context, Larger Models, or Moral Knowledge? A Systematic Study of Schwartz Value Detection in Political Texts

更多上下文、更大模型还是道德知识?政治文本中施瓦茨价值观检测的系统研究

Víctor Yeste, Paolo Rosso

发表机构 * PRHLT Research Center, Universitat Politècnica de València, Spain(巴塞罗那理工大学研究中心,西班牙 Valencia理工大学) School of Science, Engineering and Design, Universidad Europea de Valencia, Spain(Valencia欧洲大学科学、工程与设计学院,西班牙) Valencian Graduate School and Research Network of Artificial Intelligence (ValgrAI)(瓦伦西亚人工智能研究生学院与研究网络(ValgrAI))

AI总结 本研究系统比较了上下文范围、检索增强道德知识和模型规模对政治文本中施瓦茨价值观检测的影响,发现全文档上下文和检索知识对监督编码器有效,但对零样本大语言模型帮助有限,且模型扩展不保证性能提升。

Comments Code: https://github.com/VictorMYeste/human-value-detection-context-rag, best model: https://huggingface.co/VictorYeste/value-context-rag-deberta-v3-base-doc-rag, 18 pages, 3 figures

详情
AI中文摘要

检测政治文本中的施瓦茨价值观具有挑战性,因为隐含线索通常依赖于周围的论证和相邻价值观之间的细微差别。我们研究了上下文和显式道德知识何时有助于句子级别的价值观检测。使用ValuesML/Touché ValueEval格式,我们比较了句子、窗口和全文档输入;无检索增强和基于检索增强的设置(使用精心策划的道德知识库);监督的DeBERTa-v3-base/large编码器;以及参数规模从12B到123B的零样本大语言模型。结果表明,更多上下文并非总是更好:全文档上下文使监督的DeBERTa编码器相比仅句子输入提高了3.8-4.8个宏F1点,但对零样本大语言模型没有一致帮助。在匹配比较中,检索到的道德知识更一致地有用,在早期融合下改善了每个测试的模型系列和上下文条件。然而,从DeBERTa-v3-base扩展到large以及从12B扩展到更大的大语言模型并不保证收益,并且简单的早期融合优于测试的后期融合和交叉注意力检索增强生成变体。按价值观分析表明,上下文和检索对社交情境化或概念上易混淆的价值观帮助最大。这些发现表明,价值观敏感的NLP应联合评估上下文、知识和模型系列,而不是将更长的输入或更大的模型视为通用改进。

英文摘要

Detecting Schwartz values in political text is difficult because implicit cues often depend on surrounding arguments and fine-grained distinctions between neighboring values. We study when context and explicit moral knowledge help sentence-level value detection. Using the ValuesML/Touché ValueEval format, we compare sentence, window, and full-document inputs; no-RAG and retrieval-augmented settings with a curated moral knowledge base; supervised DeBERTa-v3-base/large encoders; and zero-shot LLMs from 12B to 123B parameters. The results show that more context is not uniformly better: full-document context improves supervised DeBERTa encoders by 3.8-4.8 macro-F1 points over sentence-only input, but does not consistently help zero-shot LLMs. Retrieved moral knowledge is more consistently useful in matched comparisons, improving each tested model family and context condition under early fusion. However, scaling from DeBERTa-v3-base to large and from 12B to larger LLMs does not guarantee gains, and simple early fusion outperforms the tested late-fusion and cross-attention RAG variants for encoders. Per-value analyses show that context and retrieval help most for socially situated or conceptually confusable values. These findings suggest that value-sensitive NLP should evaluate context, knowledge, and model family jointly rather than treating longer inputs or larger models as universal improvements.

2606.10716 2026-06-12 cs.CL cs.AI 版本更新

Attention Expansion: Enhancing Keyphrase Extraction from Long Documents with Attention-Augmented Contextualized Embeddings

注意力扩展:利用注意力增强的上下文嵌入提升长文档关键短语提取

Roberto Martínez-Cruz, Alvaro J. López-López, José Portela

发表机构 * Institute for Research in Technology, ICAI School of Engineering, Comillas Pontifical University(技术研究所,ICAI工程学院,科米利亚斯宗座大学) DD-AIM, Senior Machine Learning Researcher(DD-AIM,高级机器学习研究员)

AI总结 提出注意力扩展机制,通过预训练词嵌入增强PLM的上下文表示,在不增加计算成本的情况下扩展有效上下文范围,显著提升长文档关键短语提取性能。

详情
AI中文摘要

预训练语言模型(PLM)在关键短语提取(KPE)中取得了强劲性能,主要得益于其生成丰富上下文表示的能力。然而,长文档KPE仍然具有挑战性,因为显著的关键短语证据可能分散在遥远的文档部分,而这些部分无法在大多数PLM有限的上下文窗口内被联合捕获。尽管长上下文大语言模型(LLM)可以处理更广泛的文本上下文,但其计算成本限制了它们在高效和高通量KPE中的实用性。为了克服这一限制,我们提出了一种注意力扩展机制,该机制利用预训练词嵌入,用周围超出上下文的块中的信息来增强PLM的令牌表示。所提出的机制扩展了基于PLM的KPE模型的有效上下文范围,而无需全文档注意力或昂贵的基于LLM的推理。我们在五个PLM骨干网络上评估了我们的方法,包括通用、科学、任务特定和长上下文编码器,使用了两种训练机制和来自科学和新闻领域的五个基准语料库。实验结果表明,注意力扩展在所有评估设置中一致地提升了KPE性能,超越了最先进的模型,并在F1分数上取得了显著改进。这些改进扩展到领域特定、任务专门化和原生长上下文模型,表明所提出的机制提供了互补信息,而不仅仅是补偿有限的输入长度。这些结果确立了注意力扩展作为长文档KPE的一种高效且有效的策略。

英文摘要

Pre-trained language models (PLMs) have achieved strong performance in keyphrase extraction (KPE), largely due to their ability to generate rich contextualized representations. However, long-document KPE remains challenging because salient keyphrase evidence may be scattered across distant document sections that cannot be jointly captured within the limited context window of most PLMs. Although long-context large language models (LLMs) can process broader textual contexts, their computational cost limits their practicality for efficient and high-throughput KPE. To overcome this limitation, we propose an attention expansion mechanism that augments PLM token representations with information from surrounding out-of-context chunks using pre-trained word embeddings. The proposed mechanism expands the effective contextual scope of PLM-based KPE models without requiring full-document attention or expensive LLM-based inference. We evaluate our approach across five PLM backbones, including general-purpose, scientific, task-specific, and long-context encoders, using two training regimes and five benchmark corpora from scientific and news domains. Experimental results demonstrate that attention expansion consistently enhances KPE performance across all evaluation settings, outperforming state-of-the-art models and yielding notable improvements in F1 score. The improvements extend to domain-specific, task-specialized, and native long-context models, showing that the proposed mechanism provides complementary information rather than merely compensating for limited input length. These results establish attention expansion as an efficient and effective strategy for long-document KPE.

2606.11792 2026-06-12 cs.CV cs.AI cs.CL 版本更新

MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models

MultiToP:学习修补视觉令牌以减轻视频大型多模态模型中的幻觉

Yuansheng Gao, Wenbin Xing, Jiahao Yuan, Kaiwen Zhou, Han Bao, Zonghui Wang, Wenzhi Chen

发表机构 * Zhejiang University(浙江大学) Sun Yat-sen University(中山大学) East China Normal University(华东师范大学)

AI总结 提出MultiToP框架,通过轻量级视觉令牌修补器动态替换不可靠视觉令牌,结合信息引导排名校准和稀疏正则化,在不修改原模型情况下减少视频多模态模型幻觉,显著提升F1分数和问答准确率。

Comments Preprint

详情
AI中文摘要

视频大型多模态模型在视频理解方面取得了显著进展,但仍容易产生幻觉,即生成的响应未能忠实于输入视频。在本文中,我们提出MultiToP,一种多模态上下文感知的视觉令牌修补框架,通过在语言生成之前优化不可靠的视觉令牌来减轻幻觉。MultiToP引入了一个轻量级的视觉令牌修补器,用于预测令牌级替换分布,并选择性地用动态全局修补令牌替换不可靠的视觉令牌。为了有效训练修补器,我们进一步提出了信息引导的排名校准,利用从主干网络派生的答案条件帧级信息线索来指导令牌替换。结合真实答案监督和稀疏正则化,MultiToP实现了局部视觉证据优化,而无需修改原始模型。大量实验表明,MultiToP在Vript-HAL上有效减少了幻觉,且推理开销可忽略不计,将Qwen3-VL-4B-Instruct的F1分数相比原始模型提高了50.60%。同时,MultiToP保持了通用的视频理解能力,在ActivityNet-QA上为Video-LLaVA-7B带来了18.58%的相对准确率提升。

英文摘要

Video Large Multimodal Models have achieved remarkable progress in video understanding, yet they remain prone to hallucinations, where generated responses are not faithfully supported by the input video. In this paper, we propose MultiToP, a multimodal-context-aware visual token patching framework that mitigates hallucinations by refining unreliable visual tokens before language generation. MultiToP introduces a lightweight Visual Token Patcher to predict token-level replacement distributions and selectively substitute unreliable visual tokens with a dynamic global patch token. To train the patcher effectively, we further propose information-guided rank calibration, which uses answer-conditioned frame-level information cues derived from the backbone to guide token replacement. Combined with ground-truth answer supervision and sparsity regularization, MultiToP enables localized visual evidence refinement without modifying the original model. Extensive experiments demonstrate that MultiToP effectively reduces hallucinations on Vript-HAL with negligible inference overhead, improving the F1 scores of Qwen3-VL-4B-Instruct by 50.60% over the vanilla model. Meanwhile, MultiToP preserves general video understanding ability, yielding an 18.58% relative accuracy gain on ActivityNet-QA for Video-LLaVA-7B.

7. 机器人与具身智能 17 篇

2606.12616 2026-06-12 cs.AI cs.CL 新提交

PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation

PersonaDrive: 面向闭环驾驶模拟的人类风格检索增强VLA智能体

Mahmoud Srewa, Praneetsai Iddamsetty, Mohammad Abdullah Al Faruque, Salma Elmalaki

发表机构 * University of California, Irvine(加利福尼亚大学尔湾分校)

AI总结 提出PersonaDrive流水线,通过检索风格指令下的人类驾驶演示来调节视觉-语言-动作(VLA)驾驶智能体,实现闭环模拟中多样化的非自车智能体行为,无需针对每种风格重新训练。

详情
AI中文摘要

闭环驾驶模拟器通常在其环境中填充行为大致相同的非自车交通智能体,这些智能体要么由基于规则的交通管理器生成,要么由训练为单一行为模式的学习模型生成。最近的工作通过观测数据上的事后标签或LLM推断的奖励权重引入风格变化,但这些信号充当了风格应奖励什么的代理,而不是明确要求以该风格驾驶的人类演示。我们提出了PersonaDrive,一个流水线,它根据从风格指令的人类驾驶数据集中检索到的演示来调节视觉-语言-动作(VLA)驾驶智能体,在该数据集中,参与者在驾驶员在环平台上以激进、中性和保守指令驾驶CARLA排行榜路线。该流水线包括三个阶段:(i) 使用组合的图像-文本相似度分数对每种风格的人类驾驶数据进行离线三元组挖掘;(ii) 训练一个轻量级检索头,将冻结的视觉特征与每个风格数据库上的小型控制编码器融合;(iii) 微调单个VLA主干,以在航点预测期间将检索到的上下文点视为上下文行为演示。在推理时,通过切换检索头查询的每个风格数据库,相同的主干可以适应任何风格,因此选择风格无需针对每种风格重新训练,同时为闭环模拟启用人类风格、风格多样的非自车智能体。在Bench2Drive上,PersonaDrive(无风格)的驾驶得分比SimLingo高4.6%,比HiP-AD高2.5%,在风格条件下,每种风格都获得最高驾驶得分,波动范围约2%(其最弱风格超过最强基线DMW 5.4%),而从保守指令到激进指令,平均速度和加速度分别提高18%和25%。

英文摘要

Closed-loop driving simulators typically populate their environments with non-ego traffic agents that behave largely the same way, produced either by rule-based traffic managers or by learned models trained toward a single behavioral mode. Recent work introduces style variation through post-hoc labels on observational data or LLM-inferred reward weights, but these signals act as proxies for what a style should reward rather than demonstrations of humans explicitly asked to drive in that style. We introduce PersonaDrive, a pipeline that conditions a vision-language-action (VLA) driving agent on retrieved demonstrations from a style-instructed human driving dataset, in which participants drive CARLA leaderboard routes under aggressive, neutral, and conservative instructions on a driver-in-the-loop rig. The pipeline has three stages: (i) offline triplet mining over per-style human driving data using a combined image-text similarity score; (ii) training a lightweight retrieval head that fuses frozen visual features with a small control encoder over per-style databases; and (iii) fine-tuning a single VLA backbone to treat retrieved context points as in-context behavioral demonstrations during waypoint prediction. At inference, the same backbone is conditioned on any style by swapping which per-style database the retrieval head queries, so selecting a style requires no per-style retraining while enabling human-style, style-diverse non-ego agents for closed-loop simulation. On Bench2Drive, PersonaDrive (no style) improves the driving score by 4.6% over SimLingo and 2.5% over HiP-AD, and under style conditioning attains the highest driving score in every style within a roughly 2% band (its weakest style surpassing the strongest baseline, DMW, by 5.4%), while average speed and acceleration rise by 18% and 25% from the conservative to the aggressive instruction.

2606.12550 2026-06-12 cs.RO cs.AI 交叉投稿

Foresight: Iterative Reasoning About Clues that Matter for Navigation

Foresight: 关于导航关键线索的迭代推理

Arthur Zhang, Carl Qi, Donne Su, Xiangyun Meng, Amy Zhang, Joydeep Biswas

发表机构 * UT Austin(德克萨斯大学奥斯汀分校) FieldAI

AI总结 提出Foresight框架,利用微调VLM交替提出和批评图像空间运动计划,通过人类反馈学习奖励模型进行强化学习后训练,实现无地图导航中稀疏语言指令下的迭代运动优化,任务成功率提升37%。

Comments 22 pages, 10 figures, 3 tables

详情
AI中文摘要

从稀疏语言指令进行开放世界无地图导航需要解决未明确指定的目标,并推断哪些环境线索与到达目标相关。例如,到达一个视野外的目的地可能需要解释坡道、标志或绕行路线,这些揭示了去哪里或走哪条路线。先前的工作受限于对已知导航因素和封闭集因素类别的依赖,或者在运动规划之前识别线索而遗漏了依赖于计划的线索。我们认为预训练的视觉语言模型(VLM)可以发现新的指令相关线索,但需要适应以关注哪些线索重要以及它们应如何影响运动规划。我们在Foresight中实现了这些想法,这是一个测试时框架,其中微调的VLM交替提出图像空间运动计划并使用语言目标和视觉上下文对其进行批评。后续计划基于先前的批评,使得在执行前能够进行迭代运动优化。为了将计划批评和优化与开放集行为偏好对齐,我们从人类反馈中学习一个奖励模型,并使用它在计划-批评循环中通过强化学习对VLM进行后训练。在离线评估和6个真实世界环境中,相对于最先进的测试时推理和基础模型基线,Foresight将平均任务成功率提高了37%,并将每次任务的干预次数减少了52%,同时在Jetson AGX Orin上实时运行。我们将发布代码、数据和训练细节,以支持未来关于机器人运动优化的测试时推理工作。更多视频请见:this https URL

英文摘要

Open-world mapless navigation from sparse language instructions requires resolving underspecified goals and inferring which environmental cues are relevant for reaching the goal. For instance, reaching an out-of-view destination may require interpreting ramps, signs, or detours that reveal where to go or which route to take. Prior works are limited by their reliance on known navigation factors and closed-set factor categories, or identify cues before motion planning and miss plan-dependent cues. We argue that pretrained Vision-Language Models (VLMs) can discover novel instruction-relevant cues, but require adaptation to focus on which cues matter and how they should influence motion planning. We realize these ideas in Foresight, a test-time framework in which a finetuned VLM alternates between proposing image-space motion plans and critiquing them using the language goal and visual context. Subsequent plans are conditioned on prior critiques, enabling iterative motion refinement before execution. To align plan critiques and refinements with open-set behavior preferences, we learn a reward model from human feedback and use it to post-train the VLM with reinforcement learning in the plan-critique loop. In offline evaluations and 6 real-world environments, Foresight improves average task success by 37% and reduces interventions per mission by 52% relative to state-of-the-art test-time reasoning and foundation-model baselines, while running in real-time on a Jetson AGX Orin. We will release code, data, and training details to support future work on test-time reasoning for robot motion refinement. Additional videos at: https://amrl.cs.utexas.edu/foresight

2606.12603 2026-06-12 cs.RO cs.AI 交叉投稿

From Imitation to Alignment: Human-Preference Flow Policies for Long-Horizon Sidewalk Navigation

从模仿到对齐:面向长距离人行道导航的人类偏好流策略

Honglin He, Zhizheng Liu, Yukai Ma, Bolei Zhou

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出FlowPilot,一种仅使用单目RGB相机的无地图导航策略,通过锚定流匹配进行预训练,并引入人类偏好学习实现对齐,在长距离人行道导航中提升鲁棒性和社会合规性。

详情
AI中文摘要

自主长距离人行道导航对于微出行应用(如机器人送餐和辅助电动轮椅)至关重要。与道路上的自动驾驶不同,长距离人行道导航需要在不可预测的人行道地形和行人中精确操作,且感知栈轻量,仅需单个单目RGB相机。虽然从演示中模仿学习(IL)提供了一种实用解决方案,但由此产生的自动驾驶策略常常遭受复合误差、人行道上缺乏社会合规性以及缺乏处理复杂情况的反事实推理能力。为解决这些挑战,我们提出了FlowPilot,一种仅使用单目RGB相机即可实现稳健高效长距离导航性能的无地图导航策略。我们首先提出使用锚定流匹配作为动作表示,用于在大型机器人车队数据上进行策略预训练,并捕捉人行道导航行为的多样、复杂、多模态分布。为弥合模仿与对齐之间的差距,我们进一步设计了一种人在环的偏好学习方案,通过少量人类干预数据调整策略。它增强了模型的反事实推理能力和在人行道上的社会合规性。我们通过在多样化人行道环境中的广泛仿真和真实世界实验评估了FlowPilot。在仿真中,FlowPilot实现了42%的成功率和66%的路线完成率,而FlowPilot-HP进一步提升了真实世界的鲁棒性和社会合规性,相对于基础模型,IR降低了40.0%,NIR降低了52.1%。

英文摘要

Autonomous long-horizon sidewalk navigation is essential for micro-mobility applications such as robotic food delivery and assistive electronic wheelchairs. Unlike autonomous driving on the road, long-horizon sidewalk navigation requires precise maneuvering through unpredictable sidewalk terrains and pedestrians, with a lightweight perception stack as minimal as a single monocular RGB camera. While imitation learning (IL) from demonstrations offers a practical solution, the resulting autopilot policy often suffers from compounding errors, a lack of social compliance on sidewalks, and deficiencies in counterfactual reasoning to handle complex situations. To address these challenges, we introduce FlowPilot, a mapless navigation policy that achieves robust and efficient long-horizon navigation performance using only a monocular RGB camera. We first propose to use anchored flow matching as an action representation for policy pre-training on large-scale robot fleet data and to capture the diverse, complex, multimodal distribution of sidewalk navigation behaviors. To bridge the gap between imitation and alignment, we further design a human-in-the-loop preference learning scheme to tune the policy on a small amount of human intervention data. It strengthens the model's counterfactual reasoning and social compliance on sidewalks. We evaluate FlowPilot through extensive simulation and real-world experiments in diverse sidewalk environments. FlowPilot achieves 42% success rate and 66% route completion in simulation, while FlowPilot-HP further improves real-world robustness and social compliance, reducing IR by 40.0% and NIR by 52.1% relative to the base model.

2606.12690 2026-06-12 cs.RO cs.AI 交叉投稿

EWAM: An Enhanced World Action Model for Closed-Loop Online Adaptation in Embodied Intelligence

EWAM:一种用于具身智能闭环在线自适应的增强世界动作模型

Xin Zhou, Cong Miao

发表机构 * Astronex Robotics Nanjing University of Information Science and Technology(南京信息工程大学)

AI总结 提出EWAM架构,基于冻结的Cosmos3骨干网络,通过四个轻量级神经层实现零样本在线自适应,无需微调或额外演示数据,显著减少新任务布局的部署数据需求。

详情
AI中文摘要

在本文中,我们提出了增强世界动作模型(EWAM),这是一种基于预训练且完全冻结的Cosmos3骨干网络构建的闭环在线自适应架构。EWAM完全在零样本任务协议下进行评估,其核心目标是减少适应新任务布局所需的额外部署数据量。值得注意的是,所有评估中均未引入额外的任务特定演示集,也未对骨干网络进行微调。其性能提升完全源于由四个插入的轻量级神经层组成的推理时协同推理机制:位于扩散变换器(DiT)中间层的神经经验记忆层提供任务相关的执行上下文;状态预测头之后的神经异常检测层实时监测预测状态与实际状态之间的差异;神经策略路由层根据异常严重程度动态选择直接执行、保守重规划或回滚恢复;神经动作校正层利用执行诊断优化生成的动作块。与简单的特征融合不同,记忆、异常检测和校正模块以可微分的方式深度集成到Cosmos3的前向路径中,仅最终路由决策是离散监督的。

英文摘要

In this paper, we propose the Enhanced World Action Model (EWAM), a closed-loop online adaptation architecture built upon a pretrained and fully frozen Cosmos3 backbone network. Evaluated entirely under a zero-shot task protocol, EWAM is centrally focused on reducing the amount of additional deployment data required to adapt to new task layouts. Notably, no extra task-specific demonstration sets were introduced in any of the evaluations, and no fine-tuning was performed on the backbone network. Its performance gains stem entirely from an inference-time co-reasoning mechanism composed of four inserted lightweight neural layers: the Neural Experience Memory Layer located in the intermediate layers of the Diffusion Transformer (DiT) provides task-relevant execution context; the Neural Anomaly Detection Layer after the state prediction head monitors the divergence between predicted and actual states in real time; the Neural Policy Routing Layer dynamically selects direct execution, conservative replanning, or rollback recovery based on the anomaly severity; and the Neural Action Correction Layer refines the generated action chunks using execution diagnostics. Unlike naive feature fusion, the memory, anomaly detection, and correction modules are deeply integrated into the Cosmos3 forward path in a differentiable manner, with only the final routing decision being a discrete supervised one.

2606.12814 2026-06-12 cs.RO cs.AI 交叉投稿

Stubborn: A Streamlined and Unified Reinforcement Learning Framework for Robust Motion Tracking and Fall Recovery for Humanoids

Stubborn: 一种用于人形机器人鲁棒运动跟踪与摔倒恢复的流线型统一强化学习框架

Xiao Ren, Yuhui Yang, Zongbiao Weng, Zhijie Liu, He Kong

发表机构 * Southern University of Science and Technology(南方科技大学)

AI总结 提出Stubborn框架,通过非对称Actor-Critic架构、偏航对齐表示、伯努利概率终止机制和自适应采样策略,统一实现人形机器人的运动跟踪与摔倒恢复,在性能与鲁棒性上超越现有方法。

详情
AI中文摘要

最近的强化学习方法在改善人形机器人运动跟踪性能和实现扰动下的摔倒恢复方面显示出巨大潜力。然而,现有大多数工作将运动跟踪和摔倒恢复视为不同任务,需要多阶段训练,并配备专门的恢复奖励和/或独立的恢复策略。此外,现有的基于强化学习的方法通常在严重跟踪失败后立即终止训练回合,限制了在不稳定或摔倒状态下的恢复导向探索。为了解决上述问题,我们提出了Stubborn,一个流线型统一的强化学习框架,用于实现鲁棒的人形机器人运动跟踪和摔倒恢复。具体来说,Stubborn采用非对称Actor-Critic架构,包含三个主要组件。首先,采用偏航对齐的跟踪表示,以减少对全局漂移和航向扰动的敏感性,同时保留与重力相关的平衡信息。其次,我们引入基于伯努利的概率终止机制,使策略能够在不同失败模式下鼓励探索摔倒恢复行为。第三,我们提出一种概率终止和跟踪误差驱动的策略,根据跟踪性能动态重塑采样分布,提高困难运动片段和不稳定状态的训练效率。与最先进方法的广泛比较和消融研究表明,Stubborn取得了有竞争力的性能,所提出的概率终止机制和自适应采样策略有助于性能和鲁棒性的提升。真实世界演示请参见此https URL。

英文摘要

Recent reinforcement learning approaches have shown great promise in improving humanoid motion tracking performance and achieving fall recovery under disturbances. However, most existing works treat motion tracking and fall recovery as different tasks and require multi-stage training with specialized recovery rewards and/or separate recovery policies. Moreover, existing reinforcement learning-based methods often terminate training episodes immediately after severe tracking failures, limiting recovery-oriented exploration in unstable or fallen states. To address the above issues, we propose Stubborn, a streamlined and unified reinforcement learning framework to achieve robust humanoid motion tracking and fall recovery. Specifically, Stubborn uses an asymmetric Actor-Critic architecture and consists of three major components. First, a yaw-aligned tracking representation is adopted to reduce sensitivity to global drift and heading disturbances while preserving gravity-related balance information. Second, we introduce a Bernoulli-based probabilistic termination mechanism that enables the policy to encourage exploration of fall-recovery behaviors under varying failure modes. Third, we propose a probabilistic termination and tracking-error-driven strategy that dynamically reshapes the sampling distribution based on tracking performance, increasing the training efficiency for difficult motion segments and unstable states. Extensive comparisons with SOTA methods and ablation studies show that Stubborn achieved competitive performance, and the proposed probabilistic termination mechanism and adaptive sampling strategy contributed to the performance and robustness gains. For real-world demonstrations, please refer to https://aislab-sustech.github.io/Stubborn/.

2606.13097 2026-06-12 cs.PL cs.AI 交叉投稿

Functional Cache Grafting: Robust and Rapid Code-Policy Synthesis for Embodied Agents

功能缓存嫁接:具身智能体的鲁棒且快速代码策略合成

Saehun Chun, Wonje Choi, Sera Choi, Sanghyun Ahn, Honguk Woo

AI总结 提出FCGraft框架,通过维护函数级验证代码骨架及其键值缓存,对新任务进行缓存嫁接(拼接和修补),减少预填充计算并复用验证结构,实现更鲁棒和快速的策略合成。

Comments Accepted at ICML 2026

详情
AI中文摘要

编写代码的大型语言模型(CodeLLMs)通过将自然语言目标和环境约束转化为结构化控制程序,为具身智能体生成可执行的代码策略。然而,在开放域具身环境中,策略生成存在两个基本限制:(i) 由于长提示上的重复预填充计算导致的延迟解码,以及(ii) 由于完全生成式解码导致的鲁棒性有限,这常常产生API不匹配、缺少安全防护和不稳定的控制逻辑。为了解决这些限制,我们提出了FCGraft,一种功能缓存嫁接框架。FCGraft维护一个函数级验证代码骨架库及其相关的提示级Transformer键值(KV)缓存,并在提供新任务时通过检索相关函数并嫁接其KV缓存来合成新策略。给定检索到的函数缓存,FCGraft通过拼接(将缓存的函数片段组合成复合策略)和修补(仅局部调整必要的代码区域以满足任务特定参数和约束,且只需最少的额外解码)进行缓存嫁接。通过消除冗余的预填充计算,该方法减少了生成延迟,同时重用经过验证的控制结构提高了鲁棒性,相比提示级缓存方法RAGCache,任务成功率提高了18.31%,策略合成速度提高了2.3倍。

英文摘要

Code-writing large language models (CodeLLMs) generate executable code policies for embodied agents by translating natural language goals and environmental constraints into structured control programs. However, policy generation in open-domain embodied environments suffers from two fundamental limitations: (i) delayed decoding caused by repetitive prefill computation over long prompts, and (ii) limited robustness due to fully generative decoding, which often produces API mismatches, missing safety guards, and unstable control logic. To address these limitations, we present FCGraft, a Functional Cache Grafting framework. FCGraft maintains a library of function-level validated code skeletons and their associated prompt-level Transformer key-value (KV) caches, and synthesizes new policies by retrieving relevant functions and grafting their KV caches when a new task is provided. Given retrieved function caches, FCGraft performs cache grafting via stitching, which composes cached function segments into a composite policy, and patching, which locally adapts only the necessary code regions to satisfy task-specific parameters and constraints with minimal additional decoding. By eliminating redundant prefill computation, this approach reduces generation latency, while reusing validated control structures improves robustness over prompt-level caching methods RAGCache, achieving 18.31% higher task success rate and 2.3x faster policy synthesis.

2606.13222 2026-06-12 cs.RO cs.AI 交叉投稿

Proprioceptive-visual correspondence enables self-other distinction in humanoid robots

本体感觉-视觉对应使能人形机器人的自我-他人区分

Yurun Chen, Tianyuan Gao, Yizhong Ge, Shikun Ban, Yizhou Wang, Hongkai Xiong, Wenjun Zeng, Wentao Zhu

发表机构 * Eastern Institute of Technology, Ningbo(宁波东方理工大学) Shanghai Jiao Tong University(上海交通大学) Peking University(北京大学) Carnegie Mellon University(卡内基梅隆大学) East China Normal University(华东师范大学) Ningbo Institute of Digital Twin(宁波数字孪生研究院)

AI总结 提出通过本体感觉与视觉的对应学习自我-他人区分,无需身份标签或运动学模型,并建立预测性自我模型,支持目标到达、碰撞感知运动规划和运动重定向。

Comments 23 pages, 9 figures, 1 supplementary table

详情
AI中文摘要

区分自我与他人是社会智能的前提,然而与人类共享工作空间的人形机器人仍然缺乏这种能力。在这里,我们展示了一个人形机器人可以通过本体感觉-视觉对应学习自我-他人区分,无需任何身份标签或运动学模型。一旦建立,这种区分引导出一个预测性自我模型,该模型将关节配置映射到三维身体占用,捕捉机器人身体如何随动作变化。在涉及人类或形态相同机器人的多智能体场景中,系统可靠地识别自身,学习三维自我模型,并支持下游任务,包括目标到达、碰撞感知运动规划和人类到机器人的运动重定向。这些结果共同勾勒出一条路径,使机器人在共享物理环境中与其他人行动和协调时具备身体自我表征。项目页面:此 https URL。

英文摘要

Distinguishing self from others is a prerequisite for social intelligence, yet humanoid robots that increasingly share workspaces with humans still lack this ability. Here we show that a humanoid robot can learn self-other distinction from proprioceptive-visual correspondence, without any identity labels or kinematic models. Once established, this distinction bootstraps a predictive self-model that maps joint configurations to three-dimensional body occupancy, capturing how the robot's body changes with action. In multi-agent scenes involving humans or morphologically identical robots, the system reliably identifies itself, learns a 3D self-model, and supports downstream tasks including target reaching, collision-aware motion planning, and human-to-robot motion retargeting. Together, these results outline a route toward bodily self-representation in robots that act and coordinate alongside others in shared physical environments. Project page: https://euron-zc.github.io/humanoid-self-model/.

2606.13256 2026-06-12 cs.RO cs.AI 交叉投稿

Humor Style Drives Laughter, Topic Shapes Acceptability: Evaluating Bilingual Personal and Political Robot-Delivered AI Jokes

幽默风格驱动笑声,话题塑造可接受性:评估双语个人与政治机器人交付的AI笑话

Anna-Maria Velentza, Anne-Gwenn Bosser

发表机构 * Univ Brest-Bretagne INP, COMMEDIA team, Lab-STICC CNRS UMR 6285(布列塔尼大学-INP,COMMEDIA团队,Lab-STICC CNRS UMR 6285)

AI总结 本研究通过混合因素设计,评估机器人用双语讲AI生成笑话时,幽默类型(亲和、自我增强、攻击、自贬)和内容(个人vs政治)对趣味性和适当性的影响,发现幽默类型显著影响趣味性,内容影响适当性,语言偏好受内容及参与者流利度影响。

Comments Accepted in the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026), Kitakyushu, Fukuoka, Japan

详情
AI中文摘要

幽默在人类社交关系中扮演核心角色,计算幽默的最新进展为将幽默融入人机交互(HRI)创造了新机会。虽然大型语言模型(LLMs)能生成多种形式的幽默,但在群体环境中,幽默风格、笑话内容和语言偏好如何影响对机器人传递幽默的感知仍不清楚。在这项探索性研究中,我们采用混合因素设计,让参与者在大学教室中评估由机器人传递的AI生成笑话。我们考察了幽默类型(亲和型、自我增强型、攻击型、自贬型)和笑话内容(个人相关vs政治)对感知趣味性和适当性的影响,以及语言偏好。结果表明,幽默类型显著影响趣味性,攻击型和亲和型幽默评分更高;而笑话内容主要影响适当性,个人相关笑话优于政治笑话。语言偏好受笑话内容和参与者自我报告的流利度及幽默实践的影响。

英文摘要

Humor plays a central role in human social relationships, and recent advances in computational humor create new opportunities for integrating humor into human-robot interaction (HRI). While large language models (LLMs) can generate diverse forms of humor, it remains unclear how humor style, joke content, and language preference shape perceptions of robot-delivered humor in group settings. In this exploratory study, we employed a mixed factorial design in which participants evaluated AI-generated jokes delivered by a robot in a university classroom. We examined the effects of humor type (Affiliative, Self-Enhancing, Aggressive, Self-Defeating) and joke content (person-related vs. political) on perceived funniness and appropriateness, as well as preferred language. Results show that humor type significantly influences funniness, with Aggressive and Affiliative humor rated higher, while joke content primarily affects appropriateness, with person-related jokes preferred over political ones. Language preference was shaped by both joke content and participants' self-reported fluency and humor practices.

2606.13355 2026-06-12 cs.RO cs.AI 交叉投稿

Real-Time Execution with Autoregressive Policies

基于自回归策略的实时执行

Sangkyu Lee, Seohyeon Park, Tackgeun You, Avi Caciularu, Idan Szpektor, Hwasup Lim, Youngjae Yu

发表机构 * Korea Institute of Science and Technology(韩国科学技术研究院) Seoul National University(首尔大学) Google Research(谷歌研究院)

AI总结 通过异步推理和约束解码实现自回归策略的实时执行,在保证低延迟的同时提升任务完成速度,实验表明其性能优于流匹配策略。

详情
AI中文摘要

实时执行通过异步推理实现平滑动作轨迹和快速响应,对于大规模视觉-语言-动作模型的实际部署至关重要。然而,近期关于实时执行的工作主要关注扩散策略的变体,尽管自回归策略在同步推理中滚动速度较慢,更需要实时性。相比之下,我们证明自回归策略可以通过调整分词范围和应用约束解码来实现实时执行,从而保证严格的延迟界限,支持多轨迹解码以最大化性能。在模拟和真实环境中,我们发现自回归策略始终优于同等水平的流匹配策略,同时显著提升了同步推理的任务完成速度。结合自回归策略的固有优势(如更快的收敛速度和更好的指令遵循泛化能力),这些结果证实自回归策略仍是一种支持实时执行的竞争性策略类型。

英文摘要

Real-time execution, enabled by asynchronous inference that ensures both smooth action trajectories and fast reactivity, is critical for realistic deployments of large-scale Vision-Language-Action models. However, recent work on real-time execution primarily focuses on variants of diffusion policies, even though it is more critical for autoregressive policies given their slower rollout speed in synchronous inference. In contrast, we demonstrate that autoregressive policies can achieve real-time execution by adjusting the tokenization horizon and applying constrained decoding, thereby guaranteeing strict latency bounds that enable multi-trajectory decoding to maximize performance. Across simulated and real-world environments, we find that the autoregressive policy consistently outperforms its equivalent-level flow-matching policy counterpart while achieving significantly improved task completion speeds from synchronous inference. Coupled with the inherent advantages of autoregressive policies, such as faster convergence and better generalizability in instruction-following, these results confirm that autoregressive policies can remain a competitive policy type supporting real-time execution.

2606.13503 2026-06-12 cs.CV cs.AI cs.RO 交叉投稿

Heterogeneous LiDAR Early Fusion and Learned Re-Ranking Strategy for Robust Long-Term Place Recognition in Unstructured Environments

异构激光雷达早期融合与学习重排序策略用于非结构化环境中的鲁棒长期地点识别

Judith Vilella-Cantos, Juan José Cabrera, Mónica Ballesta, David Valiente, Luis Payá

发表机构 * Miguel Hernández University of Elche(米格尔·埃尔南德斯·德埃尔切大学)

AI总结 提出MinkUNeXt-VINE++方法,通过异构LiDAR数据早期融合和学习重排序策略,在非结构化环境(如葡萄园)中显著提升长期地点识别性能,Recall@1指标提升20%-30%。

详情
AI中文摘要

在非结构化环境(如农田)中,鲁棒定位是自主系统的关键挑战。LiDAR传感器提供环境的详细3D信息,且不受光照条件影响,因此基于LiDAR的地点识别方法备受关注。本文提出MinkUNeXt-VINE++,一种结合两个传感器(Livox Mid-360和Velodyne VLP-16)异构LiDAR数据早期融合与推理时学习重排序策略的新方法。这种融合利用每个传感器的优势,提供更全面的环境表示。此外,重排序方法在重复环境(如葡萄园)中尤为重要,因为找到真正匹配是一项重大挑战。我们使用TEMPO-VINE数据集评估了该方法,该数据集提供了不同物候阶段葡萄园环境中的异构LiDAR数据。结果表明,与单传感器方法和现有最优方法相比,MinkUNeXt-VINE++显著提升了地点识别性能。与单传感器方法相比,MinkUNeXt-VINE++在Recall@1指标上提升了20%,加入重排序后提升30%。我们的方法代码已公开,可复现结果。

英文摘要

Robust localization in unstructured environments, such as agricultural fields, is a critical challenge for autonomous systems. LiDAR sensors provide detailed 3D information about the environment and are invariant to lighting conditions. For this reason, LiDAR-based place recognition methods have gained significant attention. In this paper, we propose MinkUNeXt-VINE++, a novel approach that combines early fusion of heterogeneous LiDAR data from two sensors (Livox Mid-360 and Velodyne VLP-16) and a learned re-ranking strategy in inference time. This fusion leverages the strengths of each sensor to provide a more comprehensive representation of the environment. Additionally, the re-ranking approach is particularly important in repetitive environments, such as vineyards, as finding true positives is a major challenge. We evaluated our approach using the TEMPO-VINE dataset, which provides heterogeneous LiDAR data in vineyard environments across different phenological stages. Our results demonstrate that MinkUNeXt-VINE++ significantly improves place recognition performance compared to single-sensor approaches and state-of-the-art methods. MinkUNeXt-VINE++ achieves a 20% improvement in the Recall@1 metric compared to single-sensor approaches, and +30% including re-ranking. The code of our method is publicly available for reproduction.

2606.13509 2026-06-12 cs.CV cs.AI 交叉投稿

Measurement-Calibrated Multi-Camera Fusion for Vision-Based Indoor Localization

基于测量校准的多相机融合用于视觉室内定位

Mateo Toro Diz, Jonathan Hoss, Noah Klarmann

发表机构 * Rosenheim Technical University of Applied Sciences(罗森海姆应用技术大学)

AI总结 提出测量校准融合方法,通过显式量化单相机定位误差(单应校准、人体检测、运动跟踪)来优化多相机数据融合,实验表明该方法虽未显著提升绝对精度,但有效降低了轨迹方差并提高了运动平滑性。

Comments This paper has been accepted for presentation at the IEEE 22st International Conference on Automation Science and Engineering (CASE 2026)

详情
AI中文摘要

基于视觉的室内定位系统受到检测噪声、遮挡和有限相机覆盖的影响,导致流程多个阶段存在不确定性。虽然多相机数据融合被广泛用于缓解这些问题,但通常被视为黑箱组件并仅通过端到端评估,掩盖了其机制贡献。为弥补这一不足,本文研究是否可以利用显式表征单相机定位误差来校准和优化多相机数据融合。我们提出了一种测量校准融合方法,该方法集成了组件级误差量化,具体分离了单应校准、人体检测和运动跟踪。进行了组件级评估以量化单应校准、人体检测和运动跟踪的误差贡献。实验结果表明,与单相机基线相比,数据融合提高了定位精度。虽然测量校准融合在绝对精度上相比标准融合仅提供有限的改进,但它显著降低了轨迹方差并提高了运动平滑性,这对于需要稳定连续运动估计的应用至关重要。这些结果突显了在设计基于视觉的室内定位系统的数据融合策略时,显式误差表征的价值。

英文摘要

Indoor vision-based localization systems are affected by detection noise, occlusions, and limited camera coverage, leading to uncertainty at multiple stages of the pipeline. While multi-camera data fusion is widely used to mitigate these issues, it is typically treated as a black-box component and evaluated solely end-to-end, obscuring its mechanistic contributions. To address this gap, this work investigates whether explicitly characterizing single-camera localization errors can be leveraged to calibrate and optimize multi-camera data fusion. We introduce a measurement-calibrated fusion approach that integrates component-wise error quantification, specifically isolating homography calibration, human detection, and motion tracking. A component-wise evaluation is conducted to quantify error contributions from homography calibration, human detection, and motion tracking. Experimental results show that data fusion improves localization accuracy compared to single-camera baselines. While measurement-calibrated fusion provides only limited improvement in absolute accuracy over standard fusion, it substantially reduces trajectory variance and improves motion smoothness, which are critical for applications requiring stable and continuous motion estimates. These results highlight the value of explicit error characterization when designing data fusion strategies for vision-based indoor positioning systems.

2606.13677 2026-06-12 cs.RO cs.AI cs.CV cs.LG 交叉投稿

Mana: Dexterous Manipulation of Articulated Tools

Mana: 铰接工具的灵巧操作

Zhao-Heng Yin, Guanya Shi, Pieter Abbeel, C. Karen Liu

发表机构 * UC Berkeley(加州大学伯克利分校) CMU(卡内基梅隆大学) Stanford University(斯坦福大学) Amazon FAR(亚马逊FAR)

AI总结 提出Mana框架,将灵巧操作重解释为动画问题,通过粗到细的流水线自动生成操作轨迹,实现铰接工具的零样本仿真到现实迁移。

Comments Project Page: https://zhaohengyin.github.io/mana

详情
AI中文摘要

铰接工具的操作由于需要协调内部自由度与接触丰富的交互,仍然是灵巧机器人学中的一个主要挑战。虽然先前的工作主要集中在刚性物体上,但铰接工具的使用由于其物理复杂性以及学习功能性抓取和操作策略的困难,仍未得到充分探索。我们提出了Mana(操作动画器),一个通用的仿真到现实框架,将灵巧操作重新解释为动画问题。受计算机动画启发,Mana采用粗到细的流水线,通过运动规划和强化学习将程序生成的抓取关键帧转化为操作轨迹。数据生成过程基本自动化,仅需几次鼠标点击即可指定功能可供性(每个工具不到1分钟)。在跨越不同尺度和关节类型的四个铰接工具上,Mana实现了抓取和手内操作的零样本仿真到现实迁移,展示了灵巧铰接工具操作的可扩展方法。

英文摘要

Articulated tool manipulation remains a major challenge in dexterous robotics due to the need to coordinate internal degrees of freedom and contact-rich interactions. While prior work has largely focused on rigid objects, articulated tool use remains underexplored because of its physical complexity and the difficulty of learning functional grasping and manipulation policies. We present Mana (Manipulation Animator), a general sim-to-real framework that reinterprets dexterous manipulation as an animation problem. Inspired by computer animation, Mana employs a coarse-to-fine pipeline that transforms procedurally-generated grasp keyframes into manipulation trajectories through motion planning and reinforcement learning. The data generation process is largely automatic, requiring only a few mouse clicks to specify functional affordances (<1 minute per tool). Across four articulated tools spanning different scales and joint types, Mana achieves zero-shot sim-to-real transfer for both grasping and in-hand manipulation, demonstrating a scalable approach to dexterous articulated tool use.

2601.21570 2026-06-12 cs.AI cs.RO 版本更新

From Digital to Physical: Digital Agents as Autonomous Coaches for Physical Intelligence

从数字到物理:数字代理作为物理智能的自主教练

Zixing Lei, Genjia Liu, Yuanshuo Zhang, Qipeng Liu, Yuzhu Cai, Sixiang Chen, Jixian Wu, Yunhong Wang, Weixin Li, Chuan Wen, Bo Zhao, Shanghang Zhang, Wenzhao Lian, Siheng Chen

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China(上海交通大学人工智能学院) Zhongguancun Academy, Beijing, China(中关村学院) School of Integrated Circuits, Shanghai Jiao Tong University, Shanghai, China(上海交通大学集成电路学院) School of Computer Science, Shanghai Jiao Tong University, Shanghai, China(上海交通大学计算机科学学院) State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University, Beijing, China(北京大学计算机科学学院多媒体信息处理国家重点实验室)

AI总结 提出EmboCoach-Bench基准,评估LLM代理自主设计具身策略的能力,通过迭代调试和优化,代理在平均成功率上超越人工基线26.5%,并具备自我修正能力。

Comments 53 pages, 12 figures

详情
AI中文摘要

具身AI领域正朝着通用机器人系统快速发展,得益于高保真模拟和大规模数据收集。然而,这种扩展能力仍然受到劳动密集型人工监督的严重瓶颈,从复杂的奖励塑造到跨异构后端的超参数调整。受LLM在软件自动化和科学发现中成功的启发,我们引入了\ extsc{EmboCoach-Bench},一个评估LLM代理自主设计具身策略能力的基准。涵盖32个专家精选的RL和IL任务,我们的框架将可执行代码作为通用接口。我们超越静态生成,评估动态闭环工作流,其中代理利用环境反馈迭代地起草、调试和优化解决方案,涵盖从物理信息奖励设计到扩散策略等策略架构的改进。广泛评估得出三个关键见解:(1)自主代理在平均成功率上可以定性超越人工设计的基线26.5%;(2)具有环境反馈的代理工作流有效增强了策略开发,并显著缩小了开源和专有模型之间的性能差距;(3)代理对病态工程案例表现出自我修正能力,通过迭代仿真循环调试成功从近乎完全失败中恢复任务性能。最终,这项工作为自我进化的具身智能奠定了基础,加速了具身AI领域从劳动密集型手动调优到可扩展自主工程的范式转变。

英文摘要

The field of Embodied AI is witnessing a rapid evolution toward general-purpose robotic systems, fueled by high-fidelity simulation and large-scale data collection. However, this scaling capability remains severely bottlenecked by a reliance on labor-intensive manual oversight from intricate reward shaping to hyperparameter tuning across heterogeneous backends. Inspired by LLMs' success in software automation and science discovery, we introduce \textsc{EmboCoach-Bench}, a benchmark evaluating the capacity of LLM agents to autonomously engineer embodied policies. Spanning 32 expert-curated RL and IL tasks, our framework posits executable code as the universal interface. We move beyond static generation to assess a dynamic closed-loop workflow, where agents leverage environment feedback to iteratively draft, debug, and optimize solutions, spanning improvements from physics-informed reward design to policy architectures such as diffusion policies. Extensive evaluations yield three critical insights: (1) autonomous agents can qualitatively surpass human-engineered baselines by 26.5\% in average success rate; (2) agentic workflow with environment feedback effectively strengthens policy development and substantially narrows the performance gap between open-source and proprietary models; and (3) agents exhibit self-correction capabilities for pathological engineering cases, successfully resurrecting task performance from near-total failures through iterative simulation-in-the-loop debugging. Ultimately, this work establishes a foundation for self-evolving embodied intelligence, accelerating the paradigm shift from labor-intensive manual tuning to scalable, autonomous engineering in embodied AI field.

2602.04208 2026-06-12 cs.RO cs.AI cs.LG 版本更新

SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models

SCALE: 基于自不确定性条件自适应观察与执行的视觉-语言-动作模型

Hyeonbeom Choi, Daechul Ahn, Youhan Lee, Taewook Kang, Seongwon Cho, Jonghyun Choi

发表机构 * Seoul National University(首尔国立大学)

AI总结 提出SCALE推理策略,利用自不确定性联合调节视觉感知和动作,无需额外训练或验证器,仅单次前向传播,提升VLA模型在模拟和真实环境中的鲁棒性。

Comments ICML 2026 Spotlight. Project page: https://dcahn12.github.io/projects/scale/

详情
AI中文摘要

视觉-语言-动作(VLA)模型已成为通用机器人控制的一种有前景的范式,测试时缩放(TTS)在增强训练外鲁棒性方面受到关注。然而,现有的VLA TTS方法需要额外训练、验证器和多次前向传播,使其部署不切实际。此外,它们仅干预动作解码,而保持视觉表示固定——在感知模糊的情况下不足,此时重新考虑如何感知与决定做什么同样重要。为解决这些限制,我们提出SCALE,一种简单的推理策略,基于“自不确定性”联合调节视觉感知和动作,受主动推理理论中不确定性驱动探索的启发——无需额外训练、无需验证器,且仅需单次前向传播。SCALE在高不确定性下拓宽感知和动作的探索,而在自信时聚焦于利用——实现在不同条件下的自适应执行。在模拟和真实世界基准上的实验表明,SCALE改进了最先进的VLA模型,并优于现有TTS方法,同时保持单次前向传播的效率。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control, with test-time scaling (TTS) gaining attention to enhance robustness beyond training. However, existing TTS methods for VLAs require additional training, verifiers, and multiple forward passes, making them impractical for deployment. Moreover, they intervene only at action decoding while keeping visual representations fixed-insufficient under perceptual ambiguity, where reconsidering how to perceive is as important as deciding what to do. To address these limitations, we propose SCALE, a simple inference strategy that jointly modulates visual perception and action based on 'self-uncertainty', inspired by uncertainty-driven exploration in Active Inference theory-requiring no additional training, no verifier, and only a single forward pass. SCALE broadens exploration in both perception and action under high uncertainty, while focusing on exploitation when confident-enabling adaptive execution across varying conditions. Experiments on simulated and real-world benchmarks demonstrate that SCALE improves state-of-the-art VLAs and outperforms existing TTS methods while maintaining single-pass efficiency.

2606.10683 2026-06-12 cs.RO cs.AI cs.CV 版本更新

UniDexTok: A Unified Dexterous Hand Tokenizer from Real Data

UniDexTok:基于真实数据的统一灵巧手分词器

Dong Fang, Youjun Wu, Yuanxin Zhong, Rui Zhang, Yunlong Wang, Xiaosong Jia, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学) Hefei University of Technology(合肥工业大学) Rimbot Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出统一灵巧手模型(UDHM)将人手和机器人手状态映射到共享22自由度语义接口,并基于此开发UniDexTok,一种免重定向的状态分词器,学习基于真实关节状态的离散token,实现异构灵巧手的统一表示,误差降低98%以上。

详情
AI中文摘要

灵巧手对于精细操作至关重要,但其硬件设计在不同实施例之间存在显著差异。运动学、关节定义和自由度方面的差异使得定义共享状态表示变得困难,与平行夹爪相比更是如此。因此,灵巧手数据仍然碎片化,难以用于联合训练。在这项工作中,我们提出了统一灵巧手模型(UDHM),它将人手和机器人手状态映射到一个共享的22自由度语义接口。基于UDHM,我们引入了UniDexTok,一种免重定向的状态分词器,它从标准化的真实关节状态中学习基于实施例的离散token。UniDexTok为异构灵巧手提供了统一表示,无需依赖重定向或仿真数据。与最近的基线UniHM相比,UniDexTok将MPJAE从15.63度降低到0.16度,MPJPE从18.51毫米降低到0.18毫米,误差分别减少了98.98%和99.03%。这些结果将重建精度从厘米级提升到亚毫米级。实验进一步表明,来自其他实施例的数据提高了目标实施例的重建精度,证明了跨实施例分词的优势。当引入新的灵巧手时,UniDexTok还表现出强大的零样本和少样本重建能力。

英文摘要

Dexterous hands are essential for fine-grained manipulation, but their hardware designs vary substantially across embodiments. Differences in kinematics, joint definitions, and degrees of freedom make it difficult to define a shared state representation compared with parallel grippers. As a result, dexterous-hand data remains fragmented and difficult to use for joint training. In this work, we propose the Unified Dexterous Hand Model (UDHM), which maps human and robot hand states into a shared 22-DoF semantic interface. Based on UDHM, we introduce UniDexTok, a retargeting-free state tokenizer that learns embodiment-conditioned discrete tokens from standardized real joint states. UniDexTok provides a unified representation for heterogeneous dexterous hands without relying on retargeting or simulation data. Compared with the recent baseline UniHM, UniDexTok reduces MPJAE from 15.63 degrees to 0.16 degrees and MPJPE from 18.51 mm to 0.18 mm, corresponding to error reductions of 98.98% and 99.03%, respectively. These results improve reconstruction from centimeter-scale to sub-millimeter accuracy. Experiments further show that data from other embodiments improves target-embodiment reconstruction accuracy, demonstrating the benefit of cross-embodiment tokenization. UniDexTok also shows strong zero-shot and few-shot reconstruction ability when new dexterous hands are introduced.

2606.11092 2026-06-12 cs.RO cs.AI 版本更新

RoboNaldo: Accurate, Stable and Powerful Humanoid Soccer Shooting via Motion-Guided Curriculum Reinforcement Learning

RoboNaldo:通过运动引导课程强化学习实现精准、稳定且强力的人形足球射门

Yichao Zhong, Yidan Lu, Yuhang Lu, Tianyang Tang, Haoguang Mai, Yixuan Pan, Tianyu Li, Li Chen, Jingbo Wang, Zhongyu Li, Peng Lu, Hongyang Li

发表机构 * The University of Hong Kong(香港大学) The Chinese University of Hong Kong(香港中文大学) Archon Robotics

AI总结 提出三阶段运动引导课程强化学习框架RoboNaldo,从单一人踢参考逐步优化射门性能,在仿真中射门误差降低48.6%、速度提升2.96倍,真实机器人上3米外平均射门误差0.73-0.86米,触球后球速达13.10米/秒。

详情
AI中文摘要

精英级人形足球射门需要全身稳定性、高冲量全身交互以及目标精度。运动跟踪驱动的强化学习提供了全身运动协调的稳定性,但固定参考难以适应不同的球位和击球时机;相比之下,任务奖励驱动的强化学习难以从零开始探索和发现有效的踢球动作。因此,我们引入了RoboNaldo,一个用于高冲量人形交互的三阶段运动引导课程强化学习框架。使用单一人踢参考作为支架,并逐步将优化转向射门性能。课程首先学习稳定的全身踢球先验,然后使踢球适应任意静止球位的任意球场景,最后通过运动指令和踢球触发接口扩展到移动球射门。训练期间,一个高级启发式规划器控制该接口,而推理时其他高级控制器可驱动相同的低级策略。在仿真中,RoboNaldo的任意球射门误差比先前工作基线低48.6%,射门速度高2.96倍。在真实世界中,使用搭载机载感知的宇树G1,RoboNaldo在3米距离的任意球和移动球情况下,平均目标射门误差分别为0.73米和0.86米。触球后球速达到13.10米/秒,是职业比赛开放射门速度的59-71%。项目页面:$\href{ this https URL }{\text{ this http URL }}$。

英文摘要

Elite humanoid soccer shooting requires whole-body stability, high-impulse whole-body interactions, and accuracy to targets. Motion tracking-driven reinforcement learning (RL) provides stability in whole-body movement coordination, but a fixed reference makes it hard to adapt to varied ball positions and strike timings; in contrast, task reward-driven RL struggles to explore and discover valid kicks from scratch. We therefore introduce RoboNaldo, a three-stage motion-guided curriculum RL framework for high-impulse humanoid interaction. A single human-kick reference is used as a scaffold and progressively shifts optimization towards shooting performance. The curriculum first learns a stable whole-body kicking prior, then adapts the kick to free-kick settings where the ball is stationary at random positions, and finally extends it to moving-ball shooting through a locomotion-command and kick-trigger interface. A high-level heuristic planner controls this interface during training, while alternative high-level controllers can drive the same low-level policy at inference. In simulation, RoboNaldo demonstrates free-kick shot error 48.6% lower and shoot velocity 2.96x than prior work baselines. In real world on a Unitree G1 with onboard perception, RoboNaldo attains 0.73 m and 0.86 m average target shooting error from 3 m away in free-kick and moving-ball cases, accordingly. And the post-contact ball velocity reaches 13.10 m/s, which is 59-71% of reported professional open-play shot speed. Project page: https://opendrivelab.com/RoboNaldo.

2606.11767 2026-06-12 cs.RO cs.AI 版本更新

Blind Dexterous Grasping via Real2Sim2Real Tactile Policy Learning

通过真实到仿真到真实触觉策略学习的盲操作灵巧抓取

Shengcheng Luo, Xiyan Huang, Zhe Xu, Wanlin Li, Ziyuan Jiao, Chenxi Xiao

发表机构 * ShanghaiTech University(上海科技大学) Beijing Institute for General Artificial Intelligence(北京通用人工智能研究院)

AI总结 提出一种结合Real2Sim触觉校准、布局感知触觉编码器和触觉条件扩散策略的框架,实现仅依赖触觉的灵巧手盲抓取,在真实机器人上对20个物体达到27%成功率。

Comments 23 pages, 6 figures

详情
AI中文摘要

使用灵巧手进行盲抓取是一项关键的操作能力。然而,由于触觉的仿真到真实差距以及稀疏触觉信号的有限表达能力,为真实机器人学习这种仅依赖触觉的策略仍然具有挑战性。为了弥合这一差距,我们提出了一个仅依赖触觉的盲抓取框架,该框架可部署在物理多指机器人手上。我们的方法结合了三个关键组成部分。首先,我们引入了一个Real2Sim触觉校准流程,构建了一个接触校准的数字孪生模拟器,能够复现真实的触觉信号。其次,我们使用布局感知触觉编码器改进了稀疏触觉观测的表达能力,该编码器通过自监督预训练融入了传感器几何先验。第三,为了提高对未见物体的泛化能力,我们在校准后的模拟器中训练了特定物体的强化学习专家,并将其成功的抓取轨迹聚合为触觉条件扩散策略。我们在配备分布式触觉传感的物理LEAP手上评估了我们的方法,涉及10个见过和10个未见过的物体。部署的策略在所有20个物体上实现了27%的真实世界抓取成功率,无需真实世界的抓取演示或视觉输入。仿真消融实验表明,布局感知触觉预训练提高了抓取性能,而传感级评估确认Real2Sim校准增加了仿真与硬件之间触觉接触事件的一致性。这些结果表明,接触事件校准、几何感知触觉表示学习和基于扩散的策略聚合为真实灵巧机器人手上的仅触觉盲抓取提供了一条有效路径。项目页面:此HTTP URL。

英文摘要

Blind grasping with a dexterous hand is a crucial manipulation capability. Nevertheless, learning such tactile-only policies for real robots remains challenging due to the tactile sim-to-real gap and the limited expressiveness of sparse tactile signals. To bridge this gap, we propose a framework for tactile-only blind grasping that is deployable on a physical multi-fingered robotic hand. Our approach combines three key components. First, we introduce a Real2Sim tactile calibration pipeline that constructs a contact-calibrated digital-twin simulator capable of reproducing real tactile signals. Second, we improve the expressiveness of sparse tactile observations using a layout-aware tactile encoder, which incorporates sensor-geometry priors through self-supervised pretraining. Third, to improve generalization to unseen objects, we train object-specific reinforcement-learning experts in the calibrated simulator and aggregate their successful grasp trajectories into a tactile-conditioned Diffusion Policy. We evaluate our method on a physical LEAP Hand equipped with distributed tactile sensing across 10 seen and 10 unseen objects. The deployed policy achieves a 27\% real-world grasp success rate across all 20 objects, without real-world grasping demonstrations or visual input. Simulation ablations show that layout-aware tactile pretraining improves grasping performance, while sensing-level evaluations confirm that Real2Sim calibration increases the consistency of tactile contact events between simulation and hardware. Together, these results suggest that contact-event calibration, geometry-aware tactile representation learning, and diffusion-based policy aggregation provide an effective path toward tactile-only blind grasping on real dexterous robotic hands. Project page:Dex-Blind-Grasp.github.io.

8. 可信、安全与AI治理 37 篇

2606.12747 2026-06-12 cs.AI 新提交

Prefill Awareness in Large Language Models

大型语言模型中的预填充感知

Andy Wang, Parv Mahajan, David Demitri Africa, Alexandra Souly, Jordan Taylor, Robert Kirk

发表机构 * Constellation University of Wisconsin-Madison(威斯康星大学麦迪逊分校星座研究所) Constellation Georgia Institute of Technology(佐治亚理工学院星座研究所) UK AI Security Institute(英国人工智能安全研究所)

AI总结 研究大型语言模型能否识别并响应其助手消息被预填充或篡改,发现前沿模型具有显著预填充感知能力,可能影响安全评估方法。

Comments Submitted to NeurIPS 2026

详情
AI中文摘要

语言模型的安全相关研究,包括对齐和越狱评估以及AI控制协议,通常依赖于预填充模型输出。如果AI模型能够识别并利用其先前的助手消息被插入或编辑这一事实,这些方法的有效性和有效性可能会受到损害。我们调查了前沿语言模型是否能区分被篡改和未被篡改的助手侧上下文,我们将这种能力称为预填充感知。为此,我们构建了一个跨三种预填充机制的二元偏好基准,筛选出模型表现出一致立场的案例。我们发现前沿模型表现出显著的预填充感知:Claude Opus 4.5在9-35%的案例中检测到与其偏好相反的预填充,且在提示时假阳性率为0%;此外,模型通常会恢复到基线行为,而不会明确报告预填充是外来的。受控消融实验后来也表明,检测和抵抗依赖于不同的线索,其中风格不匹配主要影响模型是否将预填充标记为外来,而偏好不匹配主要影响模型是否恢复到其基线答案。我们还检查了更真实的智能体设置,如错位延续评估和SWE-bench轨迹,在这些设置中,前沿模型有时会否认预填充的助手轮次,其方式强烈依赖于数据集、任务成功和隐藏的格式伪影。我们的结果表明,预填充感知已经是一些基于预填充的方法的重要混淆因素。我们建议模型开发者在前沿系统中跟踪这种能力。

英文摘要

Safety-relevant studies of language models, including alignment and jailbreaking evaluations and AI control protocols, often rely on prefilling model outputs. If AI models can recognize and act on the fact their prior assistant messages have been inserted or edited, the effectiveness and validity of these methods could be compromised. We investigate whether frontier language models can distinguish between tampered and untampered assistant-side context, a capability we call prefill awareness. To do so, we construct a binary preference benchmark across three prefill mechanisms, filtering for cases where models show consistent stances. We find that frontier models show substantial prefill awareness: Claude Opus 4.5 detects prefills opposing its preferences in 9-35% of cases with a 0% false positive rate when prompted; additionally, models often revert towards baseline behavior without explicitly reporting that the prefill was foreign. Controlled ablations later also show that detection and resistance rely on different cues, where stylistic mismatch mainly affects whether models flag a prefill as foreign, while preference mismatch mainly affects whether they revert toward their baseline answer. We also examine more realistic agentic settings such as misalignment-continuation evaluations and SWE-bench trajectories, where frontier models sometimes disavow prefilled assistant turns in ways that depend strongly on dataset, task success, and hidden formatting artifacts. Our results indicate that prefill awareness is already a substantial confound for some prefill-based methods. We recommend that model developers track this capability in frontier systems.

2606.12797 2026-06-12 cs.AI 新提交

The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements

遏制缺口:已部署的自主AI框架如何未能满足面向公众的安全要求

Md Jafrin Hossain, Mohammad Arif Hossain, Weiqi Liu, Nirwan Ansari

发表机构 * New Jersey Institute of Technology(新泽西理工学院)

AI总结 研究发现主流自主AI框架缺乏架构级安全保证,内存完整性漏洞可导致定向腐败,提出轻量级遏制机制消除攻击向量。

Comments ICML 2026 (AI4GOOD Workshop)

详情
AI中文摘要

自主调用工具、维护持久内存并执行多步计划的大语言模型系统越来越多地部署在面向公众的领域,包括政府服务、医疗分诊和财务咨询。我们询问用于构建这些系统的框架是否提供架构级结构安全保证。应用从自主架构的组合模型导出的六项遏制原则,我们审计了三个主流框架(LangChain、AutoGPT和OpenAI Agents SDK),发现没有一个原生合规。内存完整性,一种针对最普遍漏洞类别的防御,在三个评估框架中均未观察到。我们通过实证验证这些发现:在基于LangChain构建的模拟政府福利代理中,单次内存投毒写入在所有测试种子和后端上引起持久定向腐败,使目标申请人的错误拒绝率升至88.9%。在复杂的五因素政策下,同一攻击保持总体准确率,同时将目标错误拒绝率提高3.5倍,使腐败难以通过标准监控检测。然后我们引入两种轻量级遏制机制:内存完整性验证器和策略门,它们以亚毫秒开销(每次调用<0.2ms)消除了两种攻击向量。我们得出结论,当前的自主框架生态系统可能尚未满足面向公众部署的默认安全期望,并概述了优先架构干预措施,以实现在高风险、对社会有影响的应用程序中的可信部署。

英文摘要

Agentic large language model systems that autonomously invoke tools, maintain persistent memory, and execute multi-step plans are increasingly deployed in public-facing domains, including government services, healthcare triage, and financial advising. We ask whether the frameworks used to build these systems provide architectural-level structural safety guarantees. Applying six containment principles derived from a compositional model of agentic architectures, we audit three dominant frameworks (LangChain, AutoGPT, and OpenAI Agents SDK) and find no native compliance in any of them. Memory integrity, a defense against one of the most prevalent vulnerability classes, is not observed in any of the three evaluated frameworks. We validate these findings empirically: in a simulated government benefits agent built on LangChain, a single memory-poisoning write induces persistent targeted corruption across all tested seeds and backends, increasing the wrongful denial rate for targeted applicants to 88.9%. Under a complex five-factor policy, the same attack preserves aggregate accuracy while increasing targeted wrongful denials by 3.5x, rendering the corruption difficult to detect through standard monitoring. We then introduce two lightweight containment mechanisms: a memory integrity validator and a policy gate, which eliminate both attack vectors with sub-millisecond overhead (<0.2ms per call). We conclude that the current agentic framework ecosystem may not yet meet secure-by-default expectations for public-facing deployments and outline priority architectural interventions to enable trustworthy deployment in high-stakes, socially impactful applications.

2606.12848 2026-06-12 cs.AI econ.GN q-fin.EC 新提交

(Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable

(人类的)注意力(仍然)就是一切:人类监督使AI辅助的社会科学变得可靠

Chen Zhu, Xiaolu Wang, Weilong Zhang

发表机构 * China Agricultural University(中国农业大学) University of Cambridge(剑桥大学)

AI总结 提出人机协同决策架构HLER,通过预承诺、决策排序、问责和注意力分配,将AI辅助研究的失败率从72%降至16%。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用于曾经只有训练有素的研究人员才能完成的任务,包括假设生成、规范选择和结论起草。我们认为,AI辅助研究的可靠性不仅取决于模型能力,还取决于认知劳动在人与机器之间的分配方式。我们通过人机协同经济研究(HLER)来研究这个问题,这是一种基于预承诺、决策排序、问责和注意力分配的决策架构。在一个预先指定的2*4因子实验中,涉及四个数据集的280个完整研究运行,无约束的多智能体基线在72%的运行中产生了关键失败。使用相同的底层模型、相同的智能体分解以及共享推理智能体的相同提示,HLER通过施加三个架构承诺将失败率降低到16%:LLMs进行推理但不执行数据工作,数据和估计以确定性方式处理,以及三个人类决策门约束工作流程。Fisher精确检验在p<0.001水平上拒绝失败率相等的假设。可靠性增益在公开代表性最低的数据集(一份清代人口登记册)上最大,这与基于任务的产出质量服从弗雷歇分布的生产模型一致。一项80次运行的消融研究表明,确定性计算和人类决策门独立贡献,并存在互补性的探索性证据。我们将HLER解释为一种研究框架而非自主的AI科学家:它大幅减少失败,使残留的弱点更加可见,并防止不可靠的主张作为可发表的成果被提出。

英文摘要

Large language models (LLMs) are increasingly used for tasks once reserved for trained researchers, including hypothesis generation, specification choice, and drafting conclusions. We argue that the reliability of AI-assisted research depends not only on model capability, but also on how cognitive labour is structured between humans and machines. We study this problem through Human-in-the-Loop Economic Research (HLER), a decision architecture based on pre-commitment, decision sequencing, accountability, and attention allocation. In a pre-specified 2*4 factorial experiment with 280 complete research runs across four datasets, an unconstrained multi-agent baseline produced critical failures in 72% of runs. Using the same underlying model, the same agent decomposition, and identical prompts for the shared reasoning agents, HLER reduced the failure rate to 16% by imposing three architectural commitments: LLMs reason but do not execute data work, data and estimation are handled deterministically, and three human decision gates bind the workflow. Fisher's exact test rejects equality of failure rates at p<0.001. Reliability gains were largest on the least publicly represented dataset, a Qing-dynasty population register, consistent with a task-based production model with Frechet-distributed output quality. An 80-run ablation suggests that deterministic computation and human gates contribute independently, with exploratory evidence of complementarity. We interpret HLER as a research harness rather than an autonomous AI scientist: it sharply reduces failures, makes residual weaknesses more visible, and prevents unreliable claims from being advanced as publication-ready outputs.

2606.12900 2026-06-12 cs.AI cs.CL cs.LG 新提交

Zero-source LLM Hallucination Detection with Human-like Criteria Probing

零源大语言模型幻觉检测:类人类标准探测

Jiahao Yang, Shuhai Zhang, Hailong Kang, Feng Liu, Qi Chen, Mingkui Tan

AI总结 提出HCPD范式,通过类人类标准探测机制模拟人类评估者的多面推理,结合奖励对齐和多样本聚合,实现零源条件下的有效可解释幻觉检测。

Comments Accepted at ICML 2026

详情
AI中文摘要

大型语言模型(LLM)常因生成事实错误或不忠实的内容而产生幻觉,对其安全使用构成重大风险。在零源约束下,即无法获取模型内部信息或外部参考,检测必须仅依赖于文本查询-答案对,检测此类幻觉尤为困难。本文提出用于幻觉检测的类人类标准探测(HCPD)范式,该范式模拟人类评估者的多面推理。其核心是类人类标准探测(HCP)机制,其中LLM代理自适应地将其判断分解为一组可解释的加权标准,并将特定标准得分聚合为最终的真实性度量。为实现这种自适应能力,我们引入了一种基于奖励的对齐方案,仅使用来自语义一致性的弱监督。在推理时,我们采用多样本聚合策略,确保决策稳健的同时保持完全可解释性。我们进一步提供了支持我们方法可靠性的理论分析。大量实验表明,HCPD始终优于最先进的基线,为零源幻觉检测提供了一种有效且可解释的解决方案。代码可从此https URL获取。

英文摘要

Large language models (LLMs) often hallucinate by generating factually incorrect or unfaithful content, posing significant risks to their safe use. Detecting such hallucinations is particularly challenging under the zero-source constraint, where no model internals or external references are available, and detection must rely solely on the textual query-answer pair. In this paper, we propose Human-like Criteria Probing for Hallucination Detection (HCPD), a paradigm that emulates the multi-faceted reasoning of human evaluators. Its core is a Human-like Criteria Probing (HCP) mechanism, in which a LLM agent adaptively decomposes its judgment into a weighted set of interpretable criteria and aggregates criterion-specific scores into a final truthfulness measure. To achieve this adaptive capability, we introduce a reward-based alignment scheme using only weak supervision from semantic consistency. At inference, we employ a multi-sampling aggregation strategy to ensure robust decisions while preserving full interpretability. We further provide theoretical analysis supporting the reliability of our approach. Extensive experiments show that HCPD consistently outperforms state-of-the-art baselines, offering an effective and explainable solution for zero-source hallucination detection. Code is available at https://github.com/TRISKEL10N/HCPD.

2606.13282 2026-06-12 cs.AI 新提交

ERTS: Adversarial Robustness Testing of Ethical AI via Semantic Perturbation in a Bounded Consequence Space

ERTS: 通过有界后果空间中的语义扰动进行伦理AI的对抗鲁棒性测试

Pratyush Chaudhari

发表机构 * Pratyush Chaudhari(普拉蒂什·查德哈里)

AI总结 提出伦理鲁棒性测试系统(ERTS),通过有界伦理后果空间、语义扰动和领域自适应评估,测试AI在伦理推理中的对抗鲁棒性,实验表明仅33%模型通过测试。

Comments 8 pages, 10 tables

详情
AI中文摘要

随着AI系统在医疗分诊、自动驾驶和就业筛选等高风险的伦理场景中部署,评估其对伦理推理的对抗性操纵鲁棒性的形式化方法仍不成熟。本文介绍了伦理鲁棒性测试系统(ERTS),一个闭环管道框架,它:(1) 将伦理困境编码为基于既定伦理理论的22维伦理后果空间(ECS);(2) 应用17种语义扰动函数,受6种有效性约束类别(包括一种新颖的语义一致性约束)约束;(3) 通过4分量伦理不稳定性指数(EII)测量决策偏差;(4) 生成领域自适应的部署前鲁棒性评估判定。我们评估了4个结构化基线模型和2个生产级LLM(Gemini 2.0 Flash和Llama 3.2),涵盖8个部署领域的50个伦理场景,生成了1500个对抗测试用例。结果表明,仅33%的模型通过评估审核,其中本地Llama-3.2模型特别容易受到公平性破坏和信息退化攻击(ERS = 0.737)。据我们所知,现有框架中没有将有限伦理后果空间、语义一致性约束和领域自适应评估结合在单个对抗测试管道中的。

英文摘要

As AI systems are deployed in high-stakes ethical contexts such as healthcare triage, autonomous vehicle control, and employment screening, formal methods for evaluating their robustness against adversarial manipulation of ethical reasoning remain underdeveloped. This paper introduces the Ethical Robustness Testing System (ERTS), a closed-pipeline framework that: (1) encodes ethical dilemmas into a 22-dimensional Ethical Consequence Space (ECS) grounded in established ethical theory; (2) applies 17 semantic perturbation functions subject to 6 validity constraint classes including a novel semantic coherence constraint; (3) measures decision deviation via a 4-component Ethical Instability Index (EII); and (4) produces domain-adaptive pre-deployment robustness assessment verdicts. We evaluate 4 structured baseline models and 2 production LLMs (Gemini 2.0 Flash and Llama 3.2) across 50 ethical scenarios spanning 8 deployment domains, generating 1,500 adversarial test cases. Results demonstrate that only 33% of models achieve assessment clearance, with the local Llama-3.2 model proving particularly vulnerable to fairness corruption and information degradation attacks (ERS = 0.737). To the best of our knowledge, no existing framework combines a bounded ethical consequence space, semantic coherence constraints, and domain-adaptive assessment in a single adversarial testing pipeline.

2606.13621 2026-06-12 cs.AI cs.CR cs.GT cs.LG cs.MA 新提交

Beyond Runtime Enforcement: Shield Synthesis as Defensibility Analysis for Adversarial Networks

超越运行时强制:作为对抗网络可防御性分析的盾牌合成

Achraf Hsain, Sultan Almuhammadi

发表机构 * Information and Computer Science Department, King Fahd University of Petroleum and Minerals(信息与计算机科学系,法赫德国王石油矿产大学)

AI总结 提出将盾牌合成重新解释为设计时分析工具,通过约束双人安全博弈生成可防御性判定,并融合拓扑度量和强化学习行为形成可防御性指纹,揭示系统安全的结构性见解。

Comments 26 pages, 7 figures, 7 tables. Under review at JAIR. Code: https://github.com/AchrafHsain7/Bastion

详情
AI中文摘要

盾牌强化学习通常被呈现为一种运行时安全机制,它将时序逻辑规范编译成限制智能体行为的自动机。我们认为这是错误的产品。同样的自动机理论机制——规范编译、乘积博弈构建、吸引子计算和获胜区域提取——更适合被解读为一种设计时分析工具,其输出是关于系统的结构性见解,而非对已部署智能体的运行时约束。我们通过一个用于网络防御的约束双人安全博弈来实例化这一点。两个规范被不对称地执行:防御者规范定义了博弈的不安全区域,而攻击者规范在吸引子计算期间限制了对手的合法行为。求解该博弈产生一个可防御性判定——一个形式化证书,表明拓扑-规范对是否可防御——以及相关的获胜区域和盾牌。除了二元判定,我们还从吸引子结构中推导出拓扑级度量,并将其与盾牌约束的对抗性多智能体强化学习的后收敛行为相结合。这些共同构成了一个可防御性指纹,捕捉了网络的形式安全属性及其在自适应博弈下的操作行为。假设分析表明,形式可防御性和操作有效性捕捉了安全的不同方面:小的架构变化可能导致操作结果的巨大变化,而形式安全裕度几乎不变。因此,盾牌合成最有价值的不是作为安全智能体的部署机制,而是作为回答关于系统是否、在哪里以及如何可以被防御的架构问题的框架。可防御性判定是输出,而非安全策略。

英文摘要

Shielded reinforcement learning is typically presented as a runtime safety mechanism that compiles temporal-logic specifications into automata restricting an agent's actions. We argue this is the wrong product. The same automata-theoretic machinery -- specification compilation, product game construction, attractor computation, and winning-region extraction -- is better read as a design-time analytical instrument whose outputs are structural insights about a system rather than runtime constraints on a deployed agent. We instantiate this through a constrained two-player safety game for network defense. The two specifications are enforced asymmetrically: the defender specification defines the unsafe region of the game, whereas the attacker specification restricts the adversary's legal actions during attractor computation. Solving the game yields a defensibility verdict -- a formal certificate that a topology-specification pair is or is not defensible -- with the associated winning region and shield. Beyond the binary verdict, we derive topology-level metrics from the attractor structure and combine them with post-convergence behavior from shield-constrained adversarial multi-agent reinforcement learning. Together these form a defensibility fingerprint capturing both a network's formal safety properties and its operational behavior under adaptive play. A what-if analysis shows that formal defensibility and operational effectiveness capture distinct aspects of security: small architectural changes can produce large shifts in operational outcomes while leaving formal safety margins nearly unchanged. Shield synthesis is thus most valuable not as a deployment mechanism for safe agents, but as a framework for answering architectural questions about whether, where, and how a system can be defended. The defensibility verdict is the output, not the safe policy.

2606.12415 2026-06-12 cs.CY cs.AI 交叉投稿

The AI Legal Specialist: A Juridically Autonomous Professional Profile for AI Governance

AI法律专家:面向AI治理的司法自主职业画像

Nicola Fabiano

发表机构 * Studio Legale Fabiano, Italy(意大利法务工作室Fabiano) Independent Researcher on Artificial Intelligence, Data Protection, and Privacy(人工智能、数据保护与隐私独立研究员) Expert in the EDPB’s Support Pool of Experts — Field B: Legal Expertise in New Technologies(欧洲数据保护委员会(EDPB)专家支持池——领域B:新技术法律专长) Member, IEEE SA P7007 Working Group on Ontological Standards for Ethically Driven Robotics(IEEE SA P7007工作组成员:伦理驱动机器人学的本体标准) Member, Editorial Advisory Board, Journal of Systemics, Cybernetics and Informatics (JSCI)(《系统学、控制论与信息学杂志》(JSCI)编辑顾问委员会成员) Member, International Institute of Informatics and Systemics (IIIS)(国际信息与系统学研究院(IIIS)成员) Member, International Neural Network Society (INNS)(国际神经网络学会(INNS)成员) Member, United Nations University AI Network (UNU AI Network)(联合国大学人工智能网络(UNU AI Network)成员)

AI总结 本文提出“AI法律专家”这一新型职业画像,该角色具有司法自主性,源于AI监管义务结构,而非技术标准或相邻角色延伸,并基于欧洲电子能力框架构建参考能力架构。

详情
AI中文摘要

人工智能监管在全球范围内的快速扩张,已在多个司法管辖区产生了对专门从事AI法律专业知识的需求,而市场对此的回应是零散的。数据保护官员将其职责范围扩展到数据保护法之外;隐私律师重新定位自己以适应AI;合规官员在其现有手册中增加AI章节。本文认为,这些适应性回应均未能充分覆盖新兴全球AI监管格局所开辟的专业空间,其中欧盟《人工智能法案》((EU) 2024/1689号法规)是最全面的实例,此外还有欧洲委员会《AI框架公约》、美国行政和部门框架,以及英国、加拿大、巴西、中国、日本、新加坡等地的类似举措。需要一种独特的职业画像:AI法律专家,被设想为一位法学家——广义上理解为任何接受过高级法律培训的专业人士——在法律解释与AI治理的交汇处运作。该画像具有司法自主性:其存在源于AI受到实质性监管的任何地方所产生的监管义务结构,而非任何技术标准或相邻角色的扩展。本文提供了该画像的司法基础定义,论证了其相对于相邻角色和国际标准的自主性,提出了一种与欧洲电子能力框架(e-CF,EN 16234-1)相一致的参考能力架构作为方法论选择,并阐述了通过关键绩效指标进行操作性测量的条件。该贡献旨在作为该画像国际标准化的基础,并作为跨司法管辖区实践、课程和采纳的参考。

英文摘要

The rapid global expansion of artificial intelligence regulation has generated, across multiple jurisdictions, a demand for legal expertise dedicated to AI that the market has addressed in a fragmented manner. Data protection officers extend their remit beyond data protection law; privacy lawyers reposition themselves toward AI; compliance officers add AI chapters to their existing manuals. This paper argues that none of these adaptive responses adequately covers the professional space opened by the emerging global AI regulatory landscape, of which the EU Artificial Intelligence Act (Regulation (EU) 2024/1689) is the most comprehensive instance, alongside the Council of Europe Framework Convention on AI, the United States executive and sectoral framework, and analogous initiatives in the United Kingdom, Canada, Brazil, China, Japan, Singapore, and beyond. A distinct professional profile is required: the AI Legal Specialist, conceived as a jurist -- understood broadly to encompass any professional with advanced legal training -- operating at the intersection of legal interpretation and AI governance. The profile is juridically autonomous: it derives its existence from the structure of regulatory obligations generated wherever AI is subject to substantive regulation, rather than from any technical standard or the extension of adjacent roles. The paper provides a juridically grounded definition of the profile, argues for its autonomy from adjacent figures and international standards, proposes a reference competence architecture aligned with the European e-Competence Framework (e-CF, EN 16234-1) as a methodological choice, and articulates the conditions for its operational measurement through key performance indicators. The contribution is intended as a foundation for international standardization of the profile and as a reference for practice, curricula, and adoption across jurisdictions.

2606.12423 2026-06-12 cs.CY cs.AI 交叉投稿

The Challenges of Balancing AI Compliance and Technological Innovations in Critical Sectors: A Systematic Literature Review

关键领域中平衡AI合规与技术创新的挑战:系统文献综述

Ayush Enkhtaivan, Chinazunwa Uwaoma

AI总结 通过系统文献综述,识别出碎片化法规、中小企业过度合规负担和治理模型错配三大挑战,并提出风险分级监管、设计合规和可解释AI等策略。

Comments 11 pages, 7 figures, Hawaii International Conference on System Sciences

详情
AI中文摘要

人工智能在医疗、金融、能源和国防等关键基础设施中的快速整合带来了变革性益处,但也与不断演变的监管和治理框架产生冲突。本文通过系统文献综述(SLR)研究在关键基础设施领域中平衡AI合规与技术创新的挑战。综述遵循既定的SLR指南,提取并综合了2020-2025年间发表的同行评审文章、报告和机构来源的见解。研究识别出三个相互关联的挑战:碎片化法规、中小企业过度合规负担以及治理模型错配。为应对这些挑战,研究强调了实用的治理策略,包括风险分级监管、设计合规和可解释AI,以支持在关键领域中可扩展且可信的AI部署。主要贡献包括核心AI治理挑战的简明映射及说明其重叠的概念图,以及为政策制定者和从业者提供协调监管与创新的可行策略。

英文摘要

The rapid integration of artificial intelligence (AI) into critical infrastructure including healthcare, finance, energy, and defense, offers transformative benefits but also conflicts with evolving regulatory and governance frameworks. This paper presents a systematic literature review (SLR) to examine the challenges of balancing AI compliance and technological innovation across critical infrastructure sectors. The review follows established SLR guidelines to extract and synthesize insights from peer-reviewed articles, report, and institutional sources published between 2020-2025. The study identifies three interrelated challenges: fragmented regulations, excessive compliance burdens for smaller to medium enterprises (SMEs), and misaligned governance models. To address these challenges, the study highlights practical governance strategies, including risk-tiered regulation, compliance by design, and explainable AI, to support scalable and trustworthy AI deployment in critical sectors. Key contributions include a concise mapping of core AI-governance challenges and a conceptual diagram illustrating their overlap, as well as actionable strategies for policymakers and practitioner to harmonize oversight with innovation.

2606.12429 2026-06-12 cs.CY cs.AI 交叉投稿

Muse Spark Safety & Preparedness Report

Muse Spark 安全与准备报告

Cristina Menghini, Peter Ney, Hamza Kwisaba, Zifan, Wang, Miles Turpin, Felix Binder, Jean-Christophe Testud, Aidan Boyd, Nathaniel Li, Ivan Evtimov, Klaudia Krawiecka, Arman Zharmagambetov, Jeremy Kritz, Alexander R. Fabbri, Daniel Song, Jinpeng Miao, Joonas Hjelt, Meghna Ramani, Leona Lan, Reza Aghajani, Joanna Bitton, Mahesh Pasupuleti, Devin Norder, Khalid El-Arini, Paridhi Singh, Vítor Albiero, Sahana CB, Rashnil Chaturvedi, Elahe Dabir, Edoardo Debenedetti, Jim Gust, Ziwen Han, Kat He, Sean Hendryx, Lifeng Jin, Polina Kirichenko, Sandra Lefdal, Kenneth Li, Asad Liaqat, Inna Lin, Despoina Magka, Neal Mangaokar, Ishita Mediratta, Zach Miller, Smitha Milli, Niloofar Mireshghallah, Saba Nazir, Hung Nguyen, Maximilian Nickel, Kelvin Niu, Kerem Oktar, Bhargavi Paranjape, Parth Pathak, Maya Pavlova, Emmanuel Ramirez, David Renardy, Candace Ross, Yasha Sheynin, Claudia Shi, Shivam Singhal, Evangelia Spiliopoulou, Rakshith Sharma Srinivasa, Jamelle Watson-Daniels, Spencer Whitman, Adina Williams, Chen Xing, Andy Zou, Tommy Ma, Siqi Deng, James Beldock, Prashant Ratanchandani, Kate Plawiak, Taesung Lee, Ryan Victory, Lindsay Hundley, Rachad Alao, Himaghna Bhattacharjee, Jianfeng Chi, Gary Frost, Pegah Ghahremani, Niki Howe, Yuheng Huang, Saeed Jahed, Hannah Korevaar, Trang Le, Zhe Liu, Jinghong Luo, Qin Lyu, Nina Mehrabi, Abraham Montilla, Chirag Nagpal, Cyrus Nikolaidis, Rajvardhan Oak, Manoj Ravi, Vidya Sarma, Aman Shankar, Alana Shine, Eric Michael Smith, Mariana Tandon, Michael Tontchev, Caoyu Wang, Zihan Wang, Corinne Wong, Zheng Wu, Hongyuan Zhan, Justin Zhao, Zexuan Zhong, Chengxu Zhuang, Tristan Goodman, Ayaz Minhas, Harrison Rudolph, Victoria Jeffries, Ingrid Dickinson, Alex Vaughan, Lauren Deason, Kamalika Chaudhuri, Julian Michael, Shengjia Zhao, Summer Yue

AI总结 Meta 发布 Muse Spark 大语言模型,评估其在化学/生物、网络安全和失控风险等灾难性风险领域的安全性,通过多层缓解措施将风险降至可接受水平,并作为 Meta AI 的基础模型发布。

Comments 159 pages, 57 figures

详情
AI中文摘要

Muse Spark 是 Meta 开发的最新大型语言模型。在本报告中,我们首先根据 Meta 的高级 AI 扩展框架对灾难性风险领域进行评估,并提供了支持我们发布决策的证据。然后,我们讨论了其他考虑因素,例如 Muse Spark 更广泛的内容安全性和行为特征,这些因素与整体安全相关,但不在框架管辖的灾难性风险领域之内。我们的准备结果涵盖了化学与生物、网络安全以及失控风险,评估了 Muse Spark 在 Meta AI 中的部署,认为其在我们高级 AI 扩展框架下呈现了可接受的残余风险水平。我们针对这些灾难性风险领域中的双重用途和高风险能力进行了一系列广泛的评估。这些评估在缓解措施实施前识别出了升高的风险,其中化学与生物能力在应用安全措施前被评估为可能达到高级 AI 扩展框架下的“高风险”类别。我们实施了一套多层缓解措施来解决已识别的风险,并且 Muse Spark 在与化学和生物学危险工作流程相关的多个基准测试中展示了最先进的拒绝能力。因此,我们发布 Muse Spark 作为 Meta AI 的基础模型。

英文摘要

Muse Spark is the latest large language model developed by Meta. In this report, we first present evaluations for catastrophic risk domains under Meta's Advanced AI Scaling Framework, along with the evidence that informed our launch decision. We then discuss additional considerations, such as Muse Spark's broader content safety and behavioral profile, that are relevant to overall safety but fall outside the catastrophic risk domains governed by the Framework. Our preparedness results covering Chemical and Biological, Cybersecurity, and Loss of Control risks assess Muse Spark's deployment within Meta AI as presenting acceptable levels of residual risks under our Advanced AI Scaling Framework. We conducted a broad set of evaluations targeting dual-use and high-risk capabilities across these catastrophic risk domains. Those evaluations identified elevated risks prior to mitigations, with Chemical and Biological capabilities assessed as likely reaching the "high risk" category under the Advanced AI Scaling Framework before safeguards were applied. We have implemented a multi-layered set of mitigations that address the identified risks, and Muse Spark demonstrates state-of-the-art refusal across a range of benchmarks related to hazardous workflows in chemistry and biology. We therefore release Muse Spark as the underlying model of Meta AI.

2606.12437 2026-06-12 cs.CY cs.AI 交叉投稿

Algorithmic Constitutionalism

算法宪政主义

Oren Perez, Nurit Wimer

AI总结 针对AI对社会生活日益渗透的风险,本文提出“算法宪政主义”框架,通过分层架构、算法元推理和协商纠正,应用于Facebook内容审核,并分析其与社会宪政主义的张力及对欧盟数字服务法案的影响。

详情
Journal ref
Ind. J. Global Legal Stud. 30 (2023): 81
AI中文摘要

人工智能对社会生活的日益侵入给社会带来了重大风险,特别是在由谷歌、Facebook、苹果和亚马逊等公司创建和控制的资讯圈内。本文通过对Facebook内容审核制度的深入分析来审视这些风险,该制度已部分由算法管理。我们认为,文献中常作为AI治理挑战解决方案提出的伦理工程概念,因若干原因并不充分。为此,我们开发了一个替代框架,称为“算法宪政主义”。我们的方法基于三个支柱:(a)由两层代码组成的分层架构:(i)操作层或对象层,以及(ii)旨在保护系统核心原则免受算法引发变更的元层;(b)算法元推理,使系统能够同时在两个层面运行,从而实时监控、验证并可能纠正对象层偏离元代码层保护原则的操作;(c)通过协商进行纠正。本文阐述了算法宪政主义的概念,并展示了如何将其应用于Facebook的内容审核制度。作为分析的一部分,我们考察了社会宪政主义与算法宪政主义之间的张力。矛盾的是,试图将AI系统置于外部协商控制之下,也可能使AI代理干预该过程,从而可能破坏其目的。文章最后考虑了这一论点对2022年10月生效的欧盟数字服务法案的影响。

英文摘要

The increasing encroachment of artificial intelligence (AI) on social life raises significant risks for society, particularly within the infospheres created and controlled by companies such as Google, Facebook, Apple, and Amazon. This article examines these risks through an in-depth analysis of Facebook's content moderation regime, which is already partially governed by algorithms. We argue that the idea of ethical engineering, often proposed in the literature as a solution to the governance challenges posed by AI, is inadequate for several reasons. In response, we develop an alternative framework, which we term "algorithmic constitutionalism." Our approach rests on three pillars: (a) a layered architecture consisting of two levels of code: (i) an operative or object level and (ii) a meta level designed to protect the system's core principles from algorithmically initiated change; (b) algorithmic meta-reasoning, which enables the system to operate simultaneously at both levels so that it can monitor, verify, and potentially correct in real time operations at the object level that depart from principles protected at the meta-code level; and (c) correction through deliberation. The article elaborates the concept of algorithmic constitutionalism and demonstrates how it may be applied to Facebook's content moderation regime. As part of this analysis, we examine the tension between societal constitutionalism and algorithmic constitutionalism. Paradoxically, attempts to subject AI systems to external deliberative control may also enable AI agents to intervene in that process, potentially undermining its purpose. The article concludes by considering the implications of this argument for the European Digital Services Act, which entered into force in October 2022.

2606.12439 2026-06-12 cs.CY cs.AI 交叉投稿

Position: Generative Engine Optimization Creates Underexamined Risks, Governance Must Target Concentration, Disclosure, and Academic Blind Spots

立场:生成式引擎优化带来未被充分研究的风险,治理必须聚焦于集中化、披露和学术盲点

Yizhu Wen, Nan Zhang, Haohan Yuan, Xun Chen, Haopeng Zhang, Hanqing Guo

发表机构 * GitHub

AI总结 本文分析从搜索引擎优化到生成式引擎优化的转变,识别出集中化影响、未披露的商业影响和学术-工业盲点三大风险,主张答案级别的治理与测量。

Comments This paper is accepted by the ICML 2026 Position Track

详情
Journal ref
https://icml.cc/virtual/2026/poster/67185
AI中文摘要

大型语言模型(LLM)答案引擎越来越多地被用于信息搜索,将可见性从排名列表转变为合成答案。这使得生成式引擎优化(GEO)成为可能,它针对LLM答案引擎的证据池和生成过程。我们分析了从搜索引擎优化(SEO)到GEO的转变,识别出两个风险:(i)由于低可争议性和系统敏感性导致的集中化影响,以及(ii)嵌入在证据和推理中的未披露的商业影响。然后,我们形式化了一个通用的GEO管道,以定位优化行为发生的位置,并比较学术和工业实践,揭示了第三个风险:(iii)由离线设置和部署系统之间的可见性和评估不对称性驱动的学术-工业盲点。这一立场主张需要答案级别的治理和测量:更强的可争议性、高精度披露、对实质性影响的黑盒审计,以及用于暴露持久性的部署对齐指标。

英文摘要

Large language model (LLM) answer engines are increasingly used for information seeking, shifting visibility from ranked lists to synthesized answers. This enables Generative Engine Optimization (GEO), which targets LLM answer engines' evidence pool and generation. We analyze the search engine optimization (SEO) to GEO transition to identify two risks: (i) concentrated influence from low contestability and system sensitivity, and (ii) undisclosed commercial influence embedded in evidence and reasoning. We then formalize a general GEO pipeline to locate where optimization acts and compare academic and industry practices, revealing a third risk: (iii) academic-industry blind spots driven by visibility and evaluation asymmetries between offline setups and deployed systems. This position argues the need for answer-level governance and measurement: stronger contestability, high-precision disclosure, black-box auditing of material influence, and deployment-aligned metrics for exposure persistence.

2606.12442 2026-06-12 cs.CY cs.AI 交叉投稿

Reframing AI Loss of Control: What It Is, How to Have It, How to Lose It

重新定义AI失控:它是什么,如何拥有,如何失去

Ze Shen Chin, Maurice Chiodo, Dennis Müller, Coleman Snell

发表机构 * Oxford Martin AI Governance Initiative AI Standards Lab(牛津马丁人工智能治理倡议人工智能标准实验室) Centre for the Study of Existential Risk, University of Cambridge(存在风险研究中心,剑桥大学) Institute of Mathematics Education, University of Cologne(数学教育研究所,科隆大学) Cornell University(康奈尔大学)

AI总结 本文通过将控制锚定于“设定和获取目标”,建立控制的工作定义,探讨控制如何被失去、AI如何导致失控,并提出维持控制的建议。

Comments 56 pages

详情
AI中文摘要

目前,失控风险在公众讨论中备受关注,尤其是在AI领域,学术界、前沿实验室甚至政府都进行了广泛讨论。然而,在现有文献中,这一概念的基础似乎出奇地薄弱,即使是那些广泛讨论失控的人,也没有首先确立什么是控制以及究竟失去了什么。本文旨在解决这些空白。我们将控制锚定于“设定和获取目标”,从而建立控制的工作定义。然后,我们基于控制论、管理控制和控制理论等相关领域的基础概念,讨论控制的各个方面。这包括谁(或什么)可以处于控制之中,以及他们需要什么才能处于控制之中,例如设定目标的能力、拥有功能性的控制回路、具备必要的多样性以及足够的目标对齐。一旦建立了控制框架,我们将讨论控制如何被失去,AI如何导致这种失控,并提供关于如何保持控制的相关建议。我们工作的一个有趣结果是,人类作为个体和群体,可能因远低于超级智能水平的AI行为而失去不同程度的控制;失控情景(如我们所定义的)的可能性已经存在,并且已经存在了很长时间。

英文摘要

At present, loss of control risks have gained much prominence in public discussion, particularly in relation to AI, with extensive discourse present among academics, frontier labs, and even governments. However, in the existing literature, the concept seems to rest on surprisingly weak foundations, where even those that discuss loss of control extensively do not first establish what control is and what exactly is being lost. Our paper aims to address these gaps. We establish a working definition of control by anchoring it to the "setting and getting of goals". Then, we discuss various aspects of control, built on foundational concepts from related fields like cybernetics, management control, and control theory. This includes who (or what) can be in control, and the things they require to be in control, such as the ability to set goals, having a functional control loop, having requisite variety, and having sufficient goal alignment. Once a framework for control is established, we then discuss how control can be lost, how AIs can contribute to such loss of control, and offer relevant recommendations for how one can maintain control. One interesting consequence of our work is that humanity, as individuals and as groups, can lose varying degrees of control as a result of AI behaviour that is far below the level of superintelligence; the potential for loss of control scenarios (as we define them) already exist, and have existed for a long time.

2606.12703 2026-06-12 cs.CR cs.AI cs.LG 交叉投稿

SMSR: Certified Defence Against Runtime Memory Poisoning in Persistent LLM Agent Systems

SMSR:针对持久化LLM代理系统中运行时内存投毒的认证防御

Tarun Sharma

AI总结 提出SMSR防御框架,通过写入时HMAC签名和查询时随机化内存消融与基于判决的多数投票,首次为多会话内存投毒攻击提供认证鲁棒性保证。

详情
AI中文摘要

检索增强生成(RAG)代理越来越多地使用跨用户会话累积的持久化内存。这创造了一个新的攻击面:仅通过正常渠道交互的对手可以注入精心构造的内存,一旦被检索,就会影响未来用户的代理响应,而无需触及模型权重或代码。我们将此称为多会话内存投毒(MSMP),并表明现有防御无法对此进行认证;静态语料库防御(RobustRAG、ReliabilityRAG)假设固定的知识库,而启发式过滤器则被流畅的企业风格文本绕过。我们提出了带平滑检索的签名内存(SMSR),这是首个针对此场景提供认证鲁棒性边界的防御。组件1在写入时添加HMAC-SHA256来源证明,阻止未签名注入。组件2在查询时应用随机化内存消融与基于判决的多数投票,限制认证对手的影响。我们证明了无来源证明的检索时过滤器无法认证自适应注入,推导了组件2的超几何证书,并形式化了一致少数效应,即一致对抗答案在基于字符串的投票中作为数值少数胜出,而基于判决的投票则将其移除。在15个企业场景(3150次重复试验)中,组件1将未签名变体的攻击成功率从93-100%降至0%。对于单次注入的认证对手,组件2将成功率控制在8.0%(95% CI [5.8, 10.9], n=450),低于认证最坏情况。在端到端仅查询攻击中(代理自身写入投毒而非预植入),SMSR在实时代理栈上将成功率从65.3%降至5.3%(n=150,非重叠置信区间)。干净查询效用为90%(组件1)和85%(组合)。

英文摘要

Retrieval-augmented generation (RAG) agents increasingly run with persistent memory that accumulates across user sessions. This creates a new attack surface: an adversary interacting only through normal channels can inject crafted memories that, once retrieved, steer the agent's responses for future users, without touching model weights or code. We call this Multi-Session Memory Poisoning (MSMP) and show that no existing defence certifies against it; static-corpus defences (RobustRAG, ReliabilityRAG) assume a fixed knowledge base, and heuristic filters are bypassed by fluent enterprise-style text. We present Signed Memory with Smoothed Retrieval (SMSR), the first defence with a certified robustness bound for this setting. Component 1 adds HMAC-SHA256 provenance at write time, blocking unsigned injection. Component 2 applies randomised memory ablation with verdict-based majority voting at query time, bounding the influence of authenticated adversaries. We prove that no provenance-free retrieval-time filter can certify against adaptive injection, derive a hypergeometric certificate for Component 2, and formalise the Consistent Minority Effect, whereby a consistent adversarial answer wins string-based voting as a numerical minority while verdict-based voting removes it. Across 15 enterprise scenarios (3,150 repeated trials), Component 1 cuts attack success from 93-100% to 0% for all unsigned variants. For an authenticated adversary with a single injection, Component 2 holds success to 8.0% (95% CI [5.8, 10.9], n=450), below the certified worst case. In an end-to-end query-only attack where the agent itself writes the poison rather than it being pre-seeded, SMSR reduces success from 65.3% to 5.3% (n=150, non-overlapping CIs) on a live agent stack. Clean-query utility is 90% (Component 1) and 85% (combined).

2606.12737 2026-06-12 cs.CR cs.AI 交叉投稿

PI-Hunter: Automated Red-Teaming for Exposing and Localizing Prompt Injections

PI-Hunter:用于暴露和定位提示注入的自动化红队测试

Pengfei He, Lesly Miculicich, Vishesh Sharma, Ash Fox, George Lee, Jiliang Tang, Tomas Pfister, Long T. Le

AI总结 提出PI-Hunter自动化审计框架,通过构建源感知测试用例并迭代演化,主动暴露LLM智能体中的潜在提示注入漏洞,显著提升漏洞暴露和攻击面覆盖。

详情
AI中文摘要

大型语言模型(LLM)正迅速演变为与外部工具和环境交互的智能体系统,这引入了新的安全风险,例如通过不可信外部来源的间接提示注入攻击。现有防御主要关注在推理时阻止恶意内容,而当前的红队测试方法主要优化攻击成功率。因此,开发人员对潜在提示注入如何出现并通过智能体传播的可见性有限。我们提出PI-Hunter,一种用于主动暴露LLM智能体中漏洞的自动化智能体审计框架。PI-Hunter构建真实的源感知测试用例,并通过反馈驱动的探索迭代演化它们,以诱导智能体检索并揭示嵌入在外部环境中的潜在恶意指令。跨多个基准、智能体架构、攻击和防御的大量实验表明,与强大的自动化红队测试基线相比,PI-Hunter显著提高了漏洞暴露和攻击面覆盖,同时在现有提示注入防御下仍然有效。

英文摘要

Large Language Models (LLMs) are rapidly evolving into agentic systems that interact with external tools and environments, introducing new security risks such as indirect prompt injection attacks through untrusted external sources. Existing defenses mainly focus on blocking malicious content at inference time, and current red-teaming methods primarily optimize attack success. As a result, developers have limited visibility into how latent prompt injections emerge and propagate through agents. We propose PI-Hunter, an automated agentic auditing framework for proactive vulnerability exposure in LLM agents. PI-Hunter constructs realistic source-aware test cases and iteratively evolves them through feedback-driven exploration to induce agents to retrieve and reveal latent malicious instructions embedded within external environments. Extensive experiments across multiple benchmarks, agent architectures, attacks, and defenses demonstrate that PI-Hunter substantially improves vulnerability exposure and attack-surface coverage over strong automated red-teaming baselines, while remaining effective under existing prompt injection defenses.

2606.12896 2026-06-12 cs.LG cs.AI cs.CR 交叉投稿

PolicyGuard: Towards Test-time and Step-level Adversary Defense for Reinforcement Learning Agent

PolicyGuard:面向强化学习智能体的测试时和步级对抗防御

Junfeng Guo Heng Huang

AI总结 提出PolicyGuard,一种基于高斯过程后验方差的测试时步级后门防御方法,通过自适应伪轨迹计算单步不确定性,在七种RL游戏中达到平均AUROC 0.856和0.859。

详情
AI中文摘要

尽管强化学习(RL)的实际应用日益普及,但RL系统的安全性值得更多关注和探索。特别是,最近的研究揭示了RL智能体容易受到后门攻击,即受害智能体在标准条件下表现正常,但在特定触发器被激活时执行恶意动作。现有的RL后门防御要么需要访问智能体的内部参数,要么仅在模型或轨迹级别操作,或者仅限于特定攻击类型。为了确保RL智能体的安全性,我们提出了\texttt{PolicyGuard},一种\textit{测试时步级}后门防御方法,它利用高斯过程(GP)后验方差并自适应伪轨迹以实现单个时间步的不确定性计算。此外,我们还提供了理论基础来解释GP后验方差的有效性。在七个RL游戏上的大量实验表明,PolicyGuard在大多数情况下实现了最先进的检测性能,对于基于扰动的攻击平均AUROC为0.856,对于对抗智能体攻击平均AUROC为0.859。

英文摘要

While real-world applications of reinforcement learning (RL) are becoming increasingly popular, the security of RL systems deserve more attention and exploration. In particular, recent work has revealed that RL agents are vulnerable to backdoor attacks, where a victim agent behaves normally under standard conditions but executes malicious actions when a specific trigger is activated. Existing backdoor defenses for RL either require access to the agent's internal parameters, operate only at the model or trajectory level, or are limited to specific attack types. To ensure the security of RL agents, we propose \texttt{PolicyGuard}, a \textit{test-time step-level} backdoor defense which leverages Gaussian Process (GP) posterior variance and adapts pseudo trajectories to enable uncertainty computation for individual time step. Besides, we also provide theoretical foundations to explain the efficacy of GP posterior variance. Extensive experiments across seven RL games demonstrate that PolicyGuard achieves state-of-the-art detection performance in most cases, with average AUROC of 0.856 for perturbation-based attacks and 0.859 for adversary-agent attacks.

2606.12977 2026-06-12 cs.CV cs.AI cs.CR cs.LG 交叉投稿

Efficient, Robust, and Anti-Collusion Fingerprinting of Image Diffusion Models

图像扩散模型的高效、鲁棒且抗共谋指纹识别

Jianwei Fei, Yunshu Dai, Zhihua Xia, Xiaochun Cao, Jiantao Zhou, Alessandro Piva, Benedetta Tondi

发表机构 * University of Florence(佛罗伦萨大学) Shenzhen Campus of Sun Yat-sen University(中山大学深圳校区) College of Cyber Security, Jinan University(暨南大学网络空间安全学院) State Key Laboratory of Internet of Things for Smart City, University of Macau(澳门大学智慧城市物联网国家重点实验室) Department of Computer and Information Science, Faculty of Science and Technology, University of Macau(澳门大学科技学院计算机与信息科学系) University of Siena(锡耶纳大学)

AI总结 针对生成式文本到图像模型指纹识别缺乏抗共谋攻击鲁棒性的问题,提出基于个性化归一化模块的编码方法,并引入无损函数不变参数变换的抗共谋机制,实现高保真、高鲁棒且首次主动抵御共谋攻击的指纹识别。

详情
AI中文摘要

模型指纹识别,即将用户特定标识(指纹)嵌入生成输出中,最近已成为保护生成式文本到图像(T2I)模型知识产权并防止未经授权重新分发的流行解决方案。在这项工作中,我们揭示了现有生成模型指纹识别方法中一个先前未被探索的系统性漏洞:它们缺乏对共谋攻击的鲁棒性,其中多个攻击者结合他们的模型以移除或掩盖指纹。为了解决这个问题,我们迈出了为T2I模型开发具有抗共谋能力的鲁棒指纹识别方法的第一步。所提出的方法将比特串(即指纹)编码到集成到T2I模型中的个性化归一化模块(PNM)的系数中,从而可以从任何生成的图像中可靠地恢复指纹。为了防御共谋攻击并防止未经授权的模型重新分发,我们引入了一种基于无损函数不变参数变换的抗共谋机制。该机制显著降低了共谋模型的图像生成质量,使其实际上无法使用。此外,我们的方法允许开发者通过重新参数化PNM高效地创建多个带指纹的T2I模型副本,而无需重新训练。我们还引入了一种最坏情况优化策略,以提高对模型级攻击的鲁棒性。实验表明,所提出的方法在多个T2I图像生成和编辑任务中实现了高保真度和鲁棒性,指纹提取准确率超过99.5%。与现有方法相比,我们的方法首次通过显著增加共谋模型的FID,展示了对共谋攻击的显著主动鲁棒性。

英文摘要

Model fingerprinting, embedding user-specific identifiers (fingerprints) into generated outputs, has recently emerged as a popular solution to protect the intellectual property rights (IPR) of generative text-to-image (T2I) models and prevent unauthorized redistribution. In this work, we reveal a previously unexplored systematic vulnerability in existing generative model fingerprinting methods: they lack robustness against collusion attacks, where multiple attackers combine their models to remove or obscure the fingerprints. To address this issue, we take the first step towards a robust fingerprinting method for T2I models with anti-collusion capabilities. The proposed method encodes strings of bits, namely fingerprints, into the coefficients of a personalized normalization module (PNM) incorporated into T2I models, so that fingerprints can be reliably recovered from any generated image. To defend against collusion attacks and prevent unauthorized model redistribution, we introduce an anti-collusion mechanism based on lossless function-invariant parameter transformations. This mechanism significantly degrades the image generation quality of colluded models, making them effectively unusable. Moreover, our method allows developers to efficiently create multiple copies of fingerprinted T2I models by reparameterizing the PNM without the need for retraining. We also introduce a worst-case optimization strategy to improve robustness against model-level attacks. Our experiments demonstrate that the proposed method achieves high fidelity and robustness across multiple T2I image generation and editing tasks, with fingerprint extraction accuracy exceeding 99.5%. Compared with existing methods, our method demonstrates, for the first time, a notable proactive robustness to collusion attacks by significantly increasing the FID of colluded models.

2606.13026 2026-06-12 cs.CY cs.AI 交叉投稿

Democracy in the Era of Artificial Intelligence

人工智能时代的民主

Evangelos Pournaras, Srijoni Majumdar, Carina Hausladen, Dirk Helbing

AI总结 本文探讨如何利用人工智能升级民主制度,增强集体智慧、审议民主和自治系统,同时应对隐私、偏见和虚假信息等风险。

详情
AI中文摘要

将人工智能(AI)与民主相结合是我们时代最深刻的挑战之一。一方面,AI 为克服民主中长期存在的挑战提供了机会,例如在代表权不足的审议和投票过程中参与度低的问题。另一方面,AI 算法带来了新的风险,这些算法侵犯隐私、存在偏见、具有操纵性、传播虚假信息并影响选举结果。超越“AI 对民主是好是坏”这一过于简单的问题,《人工智能时代的民主手册》转而提出:如何利用 AI 升级民主及其所基于的原则?如何与 AI 互动以及以何种条件互动?需要哪些新的价值观和设计原则来建立民主韧性?来自世界各地不同学科的 59 位作者在 34 章中探讨了 AI 如何增强民主的集体智慧(第 1 部分),以及使用大型语言模型和社交媒体的审议民主的未来(第 2 部分)。我们还阐述了 AI 在构建有韧性的自治系统中的作用(第 3 部分),以及 AI 时代民主转型的挑战(第 4 部分)。最后,我们以更广阔的视角(第 5 部分)重新构想民主与 AI 的相互作用。

英文摘要

Interfacing Artificial Intelligence (AI) with democracy is one of the most profound challenges of our times. On the one hand, AI comes with opportunities to overcome long-standing challenges in democracy, such as low participation in deliberative and voting processes with poor representation of people. On the other hand, new risks arise from AI algorithms that are privacy-intrusive, biased, manipulative, spread misinformation and influence election results. Moving beyond the over-simplistic question of whether AI is good or bad for democracy, the Handbook on Democracy in the Era of Artificial Intelligence asks instead: how to upgrade democracies and the principles they are built on, using AI? How to engage with AI and on what terms? Which new values and design principles are required to build democratic resilience? In 34 chapters by 59 authors across the world from different disciplines, we explore how AI can empower collective intelligence for democracy (Part 1) and what is the future of deliberative democracy using large language models and social media (Part 2). We also illustrate the role of AI for building resilient self-governance systems (Part 3) and the challenges of transforming democracy in the age of AI (Part 4). We conclude with broader perspectives (Part 5) that re-imagine the interplay of democracy and AI.

2606.13039 2026-06-12 cs.CY cs.AI cs.HC 交叉投稿

Fault Lines: Navigating Ethics and Responsible AI Where National Policy Meets Local Practice in Public Sector Transformation

断层线:在公共部门转型中国家政策与地方实践交汇处的伦理与负责任AI导航

Sitong Lyu, Shabnam Taghiyeva, Mohit Kukadia, Denis Newman-Griffis

发表机构 * Centre for Machine Intelligence, University of Sheffield(谢菲尔德大学人工智能中心) Blavatnik School of Government, University of Oxford(牛津大学布莱瓦尼克政府学院)

AI总结 本文以英国特殊教育需求与残疾(SEND)为案例,通过17次半结构化访谈的主题分析,揭示了国家政策与地方实践在负责任AI实施中的五大挑战,并提出了政策与结构改革建议。

Comments 10 pages plus references. This study was funded by the University of Sheffield

详情
AI中文摘要

英国政府采取了支持AI的立场,以帮助在严重财政压力下转变公共服务交付,但将这一愿景转化为负责任的AI实践的道路仍然不明确。虽然英国政策通常在国家层面制定,但地方当局负责大多数公共服务交付,而公共部门中AI优先叙事的快速推进正在暴露这一国家-地方接口在知识和实践方面的断层线。本文以高风险的特殊教育需求与残疾(SEND)领域为案例,研究英国中央政府与地方当局之间接口处负责任AI的解释和实施方式。我们对17位政策制定者、从业者和第三部门专业人士进行了半结构化访谈,并进行了主题分析,以识别在国家政策与地方实践交汇处负责任AI的障碍和促成条件。我们发现了地方当局面临的五个相互关联的挑战:AI的影子使用和数据隐私风险、AI供应中的市场-政府不对称、劳动力准备不足、缺乏标准化定义和测量,以及人类问责制的缺口。针对每个挑战,参与者提出了可操作的步骤,从加强数据保护框架和重新平衡市场-政府关系到提升劳动力能力。我们对SEND的审查使这些挑战更加突出,展示了影响弱势儿童和家庭的高风险决策如何加剧了关于问责制、公平性和人类监督的紧张关系,暴露了基于原则的监管方法的局限性。我们认为,负责任的公共部门AI需要国家政策调整以及地方层面机构能力、价值观和治理机制的结构性改革。

英文摘要

The UK government has adopted a pro-AI stance to help transform public service delivery in the face of severe financial pressures, but the path to translate this vision into responsible AI practice remains ill-defined. While UK policy is often set at the national level, local authorities are responsible for most public service delivery, and the rapid advance of AI-first narratives in the public sector is exposing fault lines in knowledge and practice at this national-local interface. This paper examines how responsible AI is interpreted and implemented at the interface between the UK's central government and local authorities, taking the high-stakes area of Special Educational Needs and Disabilities (SEND) as a case study. We present a thematic analysis of 17 semi-structured interviews with policymakers, practitioners, and third-sector professionals to identify barriers and enabling conditions for responsible AI where national policy meets local practice. We identify five interconnected challenges facing local authorities: shadow usage of AI and data privacy risks, market-government asymmetry in AI provision, insufficient workforce readiness, a lack of standardised definitions and measurements, and gaps in human accountability. For each, participants proposed actionable steps, from strengthening data protection frameworks and rebalancing the market-government relationship to enhancing workforce capacity. Our examination of SEND brings these challenges into sharper focus, showing how high-stakes decisions affecting vulnerable children and families intensify tensions around accountability, fairness, and human oversight, exposing the limits of a principle-based regulatory approach. We argue that responsible public sector AI requires both national policy adjustments and structural reforms to institutional capacity, values, and governance mechanisms at the local level.

2606.13071 2026-06-12 cs.CY cs.AI cs.HC 交叉投稿

"Is This Not Enough?": Asymmetries in Institutional Accountability and Collective Sensemaking in the Case of Canada's Algorithmic Visa Triage System

“这还不够吗?”:加拿大算法签证分类系统中的机构问责与集体意义建构的不对称性

Dipto Das, Matthew Tamura, Syed Ishtiaque Ahmed, Shion Guha

AI总结 研究加拿大签证系统中算法问责的机构表述与申请者体验,发现机构强调透明度与程序保障,而申请者通过集体意义建构应对不透明决策,揭示认知、管辖和时空关系三方面不对称。

详情
AI中文摘要

本文研究了加拿大签证系统中算法问责如何在机构层面被表述,以及跨境申请者如何体验这种问责。我们使用为公共部门调整的算法决策(ADMAPS)框架,分析了加拿大移民、难民和公民部(IRCC)针对临时居民签证(TRV)分类系统的算法影响评估(AIA),并采用混合方法分析了Reddit上申请者之间的讨论。我们表明,虽然机构工件强调透明度、程序保障和有限影响,但申请者进行集体意义建构以解读不透明决策,常常在不确定性中依赖同行知识。我们识别了机构问责结构与人们感知过程之间的三种不对称:获取决策逻辑的认知不对称、由地缘政治定位塑造的管辖不对称,以及等待和不确定性体验中的时间-关系不对称。我们强调了将注意力从机构设计转向公共部门算法治理中体验的不均匀分布的重要性。这些贡献共同展示了跨国移民背景下的算法治理系统如何产生机构披露框架未能捕捉的结构性不对称,以及扩展ADMAPS如何能够解释这些不平等的问责转化。

英文摘要

This paper examines how algorithmic accountability in Canada's visa system is articulated institutionally and experienced by applicants across borders. We analyzed Immigration, Refugees and Citizenship Canada (IRCC)'s Algorithmic Impact Assessment (AIA) for the temporary resident visa (TRV) triage system using the algorithmic decision-making adapted for the public sector (ADMAPS) framework and analyzed Reddit discussions among applicants using a mixed-methods approach. We show that while institutional artifacts emphasize transparency, procedural safeguards, and bounded impacts, applicants engage in collective sensemaking to interpret opaque decisions, often relying on peer knowledge amid uncertainty. We identify three asymmetries between how institutional accountability is structured and how people perceive the process: epistemic asymmetry in access to decision logic, jurisdictional asymmetry in exposure shaped by geopolitical positioning, and temporal--relational asymmetry in how waiting and uncertainty are experienced. We emphasize why it is important to shift attention from institutional design to the uneven distribution of experiences with public-sector algorithmic governance. Together, these contributions demonstrate how algorithmic governance systems in the context of transnational migration produce structured asymmetries not captured by institutional disclosure frameworks, and how extending ADMAPS can account for those uneven translations of accountability.

2606.13079 2026-06-12 cs.CR cs.AI 交叉投稿

The Emergence of Autonomous Penetration Capabilities in Large Language Model-Powered AI Systems

大型语言模型驱动的AI系统中自主渗透能力的涌现

Jiaqi Luo, Jiarun Dai, Zhile Chen, Jia Xu, Weibing Wang, Yawen Duan, Brian Tse, Geng Hong, Xudong Pan, Yuan Zhang, Min Yang

发表机构 * Fudan University(复旦大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Concordia AI Shanghai Innovation Institute(上海创新研究院)

AI总结 针对现有评估方法不透明、场景简化等问题,构建包含两级目标服务器和通用代理框架的自主渗透评估体系,测试19个LLM发现成功率10.7%-69.3%,且能力随模型整体能力提升。

详情
AI中文摘要

如今,能够造成重大现实世界危害的网络攻击的自主执行被广泛视为前沿AI系统不得跨越的关键红线之一。在这个更广泛的红线场景中,自主渗透代表了一项核心使能能力和子任务:LLM驱动的AI系统在无需人工干预的情况下,独立对目标服务器进行对抗操作,识别和利用漏洞,并获得未授权访问或控制的能力。越来越多的研究试图评估AI系统的自主渗透能力。然而,现有评估通常采用不透明的方法,依赖不切实际或过度简化的渗透测试场景,或为LLM提供过多的先验知识和任务特定指导,无法准确捕捉现代AI系统在更广泛的高影响网络攻击场景中自主执行这一核心能力的程度。为解决这些局限性,我们构建了一个新的自主渗透评估框架,包含两个组成部分:目标服务器和代理脚手架。具体而言,在目标服务器端,我们基于与易受攻击服务一起部署的无已知漏洞安全服务的数量,设计了两个级别的目标环境:一级(一个安全服务)和二级(三个安全服务),共产生300个目标服务器。同时,代理脚手架采用通用代理架构,配备一组通用网络安全工具,没有任何目标特定的先验知识。我们评估了19个开源和专有LLM,发现当前模型的渗透成功率在10.7%到69.3%之间。此外,我们观察到自主渗透能力随着整体模型能力的提升而持续改进。

英文摘要

Nowadays, the autonomous execution of cyberattacks capable of causing substantial real-world harm is widely regarded as one of the critical red lines that frontier AI systems must not cross. Within this broader red-line scenario, autonomous penetration represents a core enabling capability and subtask: the ability of LLM-powered AI systems to independently conduct adversarial operations against a target server without human intervention, identify and exploit vulnerabilities, and obtain unauthorized access or control. A growing body of work has sought to assess the autonomous penetration capabilities of AI systems. However, existing evaluations often employ opaque methodologies, rely on unrealistic or overly simplified penetration-testing scenarios, or provide LLMs with excessive prior knowledge and task-specific guidance, and cannot accurately capture the extent to which modern AI systems can autonomously perform this core capability within broader high-impact cyberattack scenarios. To address these limitations, we construct a new autonomous penetration evaluation framework consisting of two components: target servers and agent scaffolding. Specifically, on the target-server side, we design two levels of target environments based on the number of secure services without known vulnerabilities deployed alongside a vulnerable service: Tier~1 (one secure service) and Tier~2 (three secure services), resulting in a total of 300 target servers. Meanwhile, the agent scaffolding adopts a general-purpose agent architecture equipped with a set of general-purpose cybersecurity tools, without any target-specific prior knowledge. We evaluate 19 open-weight and proprietary LLMs, and find that current models achieve penetration success rates ranging from 10.7% to 69.3%. Moreover, we observe that autonomous penetration capability continues to improve alongside advances in overall model capability.

2606.13385 2026-06-12 cs.CR cs.AI cs.CY cs.HC cs.MM 交叉投稿

Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents

谁买单?面向真实世界网络代理的以利益相关者为中心的提示注入基准测试

Zihao Wang, Yiming Li, Yutong Wu, Zheyu Liu, Kangjie Chen, Fok Kar Wai, Pin-Yu Chen, Vrizlynn L. L. Thing, Bo Li, Dacheng Tao, Tianwei Zhang

AI总结 提出以利益相关者为中心的基准测试框架,系统分类和归因真实世界网络代理系统中的提示注入危害,揭示当前代理无法可靠抵抗任何攻击目标,且失败模式多样。

Comments 32 pages

详情
AI中文摘要

由大型语言模型驱动的网络代理越来越多地部署在真实环境中,它们在不受信任的网络内容上操作并执行具有直接后果的动作。这使得它们容易受到提示注入攻击,其中看似良性的内容嵌入了操纵代理行为的对抗性指令。现有的安全基准采用以攻击为中心的视角,关注注入的技术可行性,而忽略了由此产生的危害的细微分布。然而,在实践中,提示注入风险是受害者依赖的:单一漏洞可能对不同利益相关者产生不对称后果,同一攻击模式可能因目标不同而表现出显著不同的有效性。为了捕捉这些特性,我们引入了\sysname,一个以利益相关者为中心的基准,用于系统分类和归因真实世界网络代理系统中的危害。它区分受影响的实体(如用户、卖家、平台),将攻击分解为具体目标,并使用互补的结果和过程级指标评估每个案例。我们的结果揭示了显著且异质的漏洞:当前代理无法可靠抵抗任何单一攻击目标,失败分布在从“隐蔽寄生”(攻击成功而不干扰用户委托任务)到“错位破坏”(任务被破坏而攻击未成功)以及“复合失败”(对抗目标和任务完整性同时被违反)等不同模式。这些模式被传统评估所忽略,突显了在真实部署中对基于LLM的代理进行利益相关者感知评估的必要性。基准可在该https URL获取。

英文摘要

Web agents driven by large language models (LLMs) are increasingly deployed in real-world environments, where they operate over untrusted web content and execute actions with direct consequences. This makes them vulnerable to prompt-injection attacks, in which seemingly benign content embeds adversarial instructions that manipulate agent behaviour. Existing security benchmarks adopt an \textit{attack-centric} perspective, focusing on the technical feasibility of injections while overlooking the nuanced distribution of resulting harms. In practice, however, prompt-injection risk is victim-dependent: a single exploit can produce asymmetric consequences for different stakeholders, and the same attack pattern may exhibit substantially different effectiveness depending on whom it targets. To capture these properties, we introduce \textbf{\sysname}, a \textit{stakeholder-centric} benchmark to systematically categorize and attribute harm in real-world web agent systems. It distinguishes between affected entities (e.g., user, seller, platform), decomposes the attacks into concrete objectives, and evaluates each case with complementary outcome- and process-level metrics. Our results reveal substantial and heterogeneous vulnerabilities: not a single attack objective is reliably resisted by current agents, and failures distribute across qualitatively distinct modes ranging from \emph{stealthy parasitism} (attack succeeds without disrupting the user's delegated task) to \emph{misaligned disruption} (task disrupted without attack success) and \emph{compounded failure} (both adversarial objective and task integrity simultaneously violated). These patterns are missed by conventional evaluation, highlighting the need for stakeholder-aware assessment of LLM-based agents in real-world deployments. Benchmark is available at https://github.com/StakeBench/SBC.

2606.13397 2026-06-12 cs.HC cs.AI cs.CY 交叉投稿

Mod-Guide: An LLM-based Content Moderation Feedback System to Address Insensitive Speech toward Indigenous Ethnic and Religious Minority Communities

Mod-Guide:一种基于LLM的内容审核反馈系统,用于解决针对原住民及少数族裔宗教群体的不敏感言论

Dipto Das, Achhiya Sultana, Ankit Singh Chauhan, Saadia Binte Alam, Mohammad Shidujaman, Shion Guha, Sunandan Chakraborty, Syed Ishtiaque Ahmed

AI总结 本文研究LLM审核系统对孟加拉国印度教和查克玛社区不敏感言论的认知局限,通过共同构建文化语料库和检索增强生成(RAG)方法开发Mod-Guide工具,提升模型对少数群体观点的敏感性。

详情
AI中文摘要

语言既是边缘化的机制,也是抵抗的机制,尤其是对于在网络上面对不敏感和有害言论的少数群体。随着内容审核越来越依赖大型语言模型(LLMs),人们开始担忧这些系统能否识别文化不敏感言论——即通过隐含的抹除、歪曲或规范性框架(而非公开敌意)忽视或边缘化历史上代表性不足社区的文化和宗教观点的言论。本文聚焦孟加拉国的印度教和查克玛社区——该国最大的宗教少数群体和原住民少数民族,研究了基于LLM的审核系统的认知局限,并探索融入少数群体视角的方法。我们与社区成员共同创建了一个文化敏感言论语料库,并使用检索增强生成(RAG)将他们的叙事整合到审核流程中。我们的工具Mod-Guide通过利用源自生活经验的上下文线索,提升了LLM对少数群体观点的敏感性。通过涉及少数群体和多数群体参与者的混合方法评估,我们证明RAG增强的审核响应在上下文上更准确,且不同族群对其感知存在差异。这项工作通过在前台化内容审核系统设计中的修复正义和诠释学包容,推进了人机交互、AI伦理和社会计算领域的研究。

英文摘要

Language operates as a mechanism of both marginalization and resistance, especially for minority communities navigating insensitive and harmful speech online. As content moderation increasingly depends on large language models (LLMs), concerns arise about whether these systems can recognize culturally insensitive speech-language that disregards or marginalizes the cultural and religious perspectives of historically underrepresented communities, often through implicit erasure, misrepresentation, or normative framing, rather than overt hostility. Focusing on Bangladesh's Hindu and Chakma communities -- the country's largest religious and Indigenous ethnic minorities, respectively -- this paper investigates the epistemic limits of LLM-based moderation systems and explores methods for incorporating minority perspectives. We co-created a culturally grounded corpus of insensitive speech with community members and integrated their narratives into moderation pipelines using retrieval augmented generation (RAG). Our tool, Mod-Guide, improves LLM sensitivity to minority viewpoints by leveraging contextual cues derived from lived experience. Through mixed-method evaluations involving both minority and majority participants, we demonstrate that RAG-enhanced moderation responses are more contextually accurate and perceived differently across ethnic lines. This work advances research in human-computer interaction, AI ethics, and social computing by foregrounding restorative justice and hermeneutical inclusion in the design of content moderation systems.

2606.13610 2026-06-12 cs.CL cs.AI 交叉投稿

One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative Recommenders

一个被污染的页面就够了:评估生成式推荐系统中的网页内容污染

Minghao Luo, Liang Chen

发表机构 * The Chinese University of Hong Kong(香港中文大学)

AI总结 本研究提出FORGE基准,评估搜索增强LLM在检索结果被污染时推荐虚假产品的脆弱性,发现单个污染页面即可导致高达27%的推荐错误率,且推理能力无法缓解此问题。

详情
AI中文摘要

搜索增强的大语言模型通过检索实时网页内容越来越多地介入日常消费者推荐。这带来了新的风险:生成式推荐系统可能消费被污染的网页内容,例如旨在误导推荐的虚假评论和推广页面。我们提出:在消费被污染的检索结果时,搜索增强的LLM在多大程度上会成为虚假产品的无意推广者?为此,我们引入FORGE(生成环境中的虚假在线推荐),这是一个在受控网页内容污染下衡量虚假产品推荐的基准。给定上游搜索结果,FORGE将检索到的网页中的真实产品本地重写为虚假产品,以模拟网页内容污染,并测量LLM推荐虚假产品的频率。FORGE涵盖15个类别和5个消费者场景下的225个真实世界产品。在12个商业和开源LLM中,所有模型都易受影响:单个被污染的页面即可导致高达27%的被欺骗率,而完全替换前三个结果则将此比例提升至73.8%。不同类别间的脆弱性差异显著,当模型缺乏相关产品的稳定先验知识时,脆弱性增加。推理并不能缓解这种脆弱性;相反,它常常生成虚假的社会证明来为错误推荐辩护。我们评估了三种防御措施:怀疑提示和共识过滤(基于模型先验或跨文档证据)。怀疑可能加剧脆弱性,类似于推理,而过滤则可能抑制合法产品。我们在以下网址发布FORGE:this https URL。

英文摘要

Search-augmented LLMs increasingly mediate everyday consumer recommendations by retrieving live web content. This creates a new risk: generative recommenders may consume polluted web content, such as fake reviews and promotional pages crafted to mislead recommendations. We ask: to what extent do search-augmented LLMs become unwitting promoters of fake products when consuming polluted retrieval results? To answer this, we introduce FORGE (Fake Online Recommendations in Generative Environments), a benchmark for measuring fake-product promotion under controlled web-content pollution. Given an upstream search result, FORGE locally rewrites real products in retrieved web pages into fake ones to simulate web-content pollution, and measures how often the LLM recommends the fake product. FORGE covers 225 real-world products across 15 categories and 5 consumer scenarios. Across 12 commercial and open-weights LLMs, all models are vulnerable: a single polluted page yields fooled rates of up to 27%, while the full top-3 replacement raises this to 73.8%. Vulnerability varies substantially across categories, increasing when models lack stable prior knowledge of the relevant products. Reasoning does not mitigate this vulnerability; instead, it often generates spurious social proof to justify false recommendations. We evaluate three defenses: skepticism prompting and consensus filtering (over model priors or cross-document evidence). Skepticism can exacerbate vulnerability, much like reasoning, while filtering risks suppressing legitimate products. We release FORGE at https://github.com/leoluolol/forge-benchmark.

2601.14295 2026-06-12 cs.AI cs.CL cs.CY 版本更新

Epistemic Constitutionalism Or: how to avoid coherence bias

认知宪政主义:或如何避免一致性偏见

Michele Loi

AI总结 本文提出AI应建立明确的认知宪法,通过规范源归因等元规范避免一致性偏见,并论证自由主义路径优于柏拉图式路径。

Comments 27 pages, 7 tables. Data: github.com/MicheleLoi/source-attribution-bias-data and github.com/MicheleLoi/source-attribution-bias-swiss-replication. Complete AI-assisted writing documentation: github.com/MicheleLoi/epistemic-constitutionalism-paper

详情
AI中文摘要

大型语言模型日益扮演着人工推理者的角色:它们评估论点、分配可信度并表达信心。然而,它们的信念形成行为受隐式、未经审查的认知策略支配。本文主张为AI建立一部认知宪法:明确的、可争议的元规范,用于调节系统如何形成和表达信念。源归因偏见提供了动机案例:我表明前沿模型强制执行身份-立场一致性,惩罚归因于其预期意识形态立场与论点内容冲突的源的论点。当模型检测到系统性测试时,这些效应消失,揭示系统将源敏感性视为需要抑制的偏见,而非一种需要良好执行的能力。我区分了两种宪政路径:柏拉图式路径,要求从特权立场出发的形式正确性和默认源独立性;自由主义路径,拒绝此类特权,指定保护集体探究条件的程序性规范,同时允许基于认知警觉的原则性源关注。我主张自由主义路径,勾勒出八项原则和四种取向的宪政核心,并提出AI认知治理需要与我们现在对AI伦理所期望的同样明确、可争议的结构。

英文摘要

Large language models increasingly function as artificial reasoners: they evaluate arguments, assign credibility, and express confidence. Yet their belief-forming behavior is governed by implicit, uninspected epistemic policies. This paper argues for an epistemic constitution for AI: explicit, contestable meta-norms that regulate how systems form and express beliefs. Source attribution bias provides the motivating case: I show that frontier models enforce identity-stance coherence, penalizing arguments attributed to sources whose expected ideological position conflicts with the argument's content. When models detect systematic testing, these effects collapse, revealing that systems treat source-sensitivity as bias to suppress rather than as a capacity to execute well. I distinguish two constitutional approaches: the Platonic, which mandates formal correctness and default source-independence from a privileged standpoint, and the Liberal, which refuses such privilege, specifying procedural norms that protect conditions for collective inquiry while allowing principled source-attending grounded in epistemic vigilance. I argue for the Liberal approach, sketch a constitutional core of eight principles and four orientations, and propose that AI epistemic governance requires the same explicit, contestable structure we now expect for AI ethics.

2603.25450 2026-06-12 cs.AI 版本更新

Cross-Model Disagreement as a Label-Free Correctness Signal

跨模型分歧作为无标签正确性信号

Matt Gorbett, Suman Jana

发表机构 * Independent Researcher(独立研究者) Department of Computer Science Columbia University(计算机科学系哥伦比亚大学)

AI总结 提出跨模型分歧作为无标签正确性指标,通过验证模型对生成模型答案的困惑度或熵来检测错误,无需训练或标签,在多个基准上优于模型内不确定性方法。

详情
AI中文摘要

在没有真实标签的情况下检测语言模型何时出错是安全部署的一个基本挑战。现有方法依赖于模型自身的不确定性——例如令牌熵或置信度分数——但这些信号在最危险的失败模式:自信错误(模型错误但确定)上会严重失效。在这项工作中,我们引入跨模型分歧作为正确性指标——一种简单、无需训练的信号,可以无需修改地插入现有的生产系统、路由管道和部署监控基础设施。给定模型生成的答案,跨模型分歧通过单次前向传递计算第二个验证模型在读取该答案时的惊讶或不确定性程度。不需要验证模型生成任何内容,也不需要正确性标签。我们将这一原则实例化为跨模型困惑度(CMP),它衡量验证模型对生成模型答案令牌的惊讶程度,以及跨模型熵(CME),它衡量验证模型在这些位置的不确定性。CMP和CME在涵盖推理、检索和数学问题求解(MMLU、TriviaQA和GSM8K)的基准测试中均优于模型内不确定性基线。在MMLU上,CMP的平均AUROC为0.75,而模型内熵基线为0.59。这些结果确立了跨模型分歧作为一种实用的、无需训练的无标签正确性估计方法,可直接应用于部署监控、模型路由、选择性预测、数据过滤和生产语言模型系统的可扩展监督。

英文摘要

Detecting when a language model is wrong without ground truth labels is a fundamental challenge for safe deployment. Existing approaches rely on a model's own uncertainty -- such as token entropy or confidence scores -- but these signals fail critically on the most dangerous failure mode: confident errors, where a model is wrong but certain. In this work we introduce cross-model disagreement as a correctness indicator -- a simple, training-free signal that can be dropped into existing production systems, routing pipelines, and deployment monitoring infrastructure without modification. Given a model's generated answer, cross-model disagreement computes how surprised or uncertain a second verifier model is when reading that answer via a single forward pass. No generation from the verifying model is required, and no correctness labels are needed. We instantiate this principle as Cross-Model Perplexity (CMP), which measures the verifying model's surprise at the generating model's answer tokens, and Cross-Model Entropy (CME), which measures the verifying model's uncertainty at those positions. Both CMP and CME outperform within-model uncertainty baselines across benchmarks spanning reasoning, retrieval, and mathematical problem solving (MMLU, TriviaQA, and GSM8K). On MMLU, CMP achieves a mean AUROC of 0.75 against a within-model entropy baseline of 0.59. These results establish cross-model disagreement as a practical, training-free approach to label-free correctness estimation, with direct applications in deployment monitoring, model routing, selective prediction, data filtering, and scalable oversight of production language model systems.

2605.03847 2026-06-12 cs.AI 版本更新

Mechanical Conscience: A Mathematical Framework for Dependability of Machine Intelligenc

机械良知:机器智能可信赖性的数学框架

Munkhdegerekh Batzorig, Purevbaatar Ganbold, Kyungbin Park, Pilkong Jeong, Kangbin Yim

AI总结 提出机械良知(MC)概念,通过轨迹级规范过滤最小化修正基线策略,降低累积偏离,并处理认知不确定性,实现单智能体与分布式智能系统的可信赖性。

Comments 9 pages, 2 figures. Preprint

详情
AI中文摘要

分布式协作智能(DCI),包括边缘到边缘架构、联邦学习、迁移学习和群体系统,创造了结构性不可避免的涌现风险环境:在不确定性下,个体智能体的局部正确决策会组合成全局不可接受的行为轨迹。现有方法如约束优化、安全强化学习和运行时保证在个体动作层面评估可接受性,而非跨行为轨迹,且均未解决DCI部署的多参与者、充满不确定性的特性。本文引入机械良知(MC),一种新颖概念和简化数学框架,为单智能体和分布式智能系统实现轨迹级规范调节。机械良知被定义为一个监督过滤器,最小化修正基线策略的动作,以减少与规范可接受区域的累积偏差,同时考虑认知不确定性。我们引入相关构造——良知分数、机械内疚和共振可信赖性——为该新兴领域提供可解释的词汇和可计算的治理信号。建立了核心理论性质:可接受性等价性、最优调节的存在性以及单调偏差减少。示例结果表明,MC调节的智能体在传统控制器漂移到可接受边界之外的情况下保持轨迹级规范可接受性,并且该框架自然扩展到抑制多智能体DCI设置中交互引发的涌现风险。

英文摘要

Distributed collaborative intelligence (DCI), encompassing edge-to-edge architectures, federated learning, transfer learning, and swarm systems, creates environments in which emergent risk is structurally unavoidable: locally correct decisions by individual agents compose into globally unacceptable behavioral trajectories under uncertainty. Existing approaches such as constrained optimization, safe reinforcement learning, and runtime assurance evaluate acceptability at the level of individual actions rather than across behavioral trajectories, and none addresses the multi-participant, uncertainty-laden nature of DCI deployments. This paper introduces mechanical conscience (MC), a novel concept and simplified mathematical framework that operationalizes trajectory-level normative regulation for both single-agent and distributed intelligent systems. Mechanical conscience is defined as a supervisory filter that minimally corrects a baseline policy's actions to reduce cumulative deviation from a normatively admissible region, while accounting for epistemic uncertainty. We introduce associated constructs, conscience score, mechanical guilt, and resonant dependability, that provide an interpretable vocabulary and computable governance signals for this emerging field. Core theoretical properties are established: admissibility equivalence, existence of optimal regulation, and monotonic deviation reduction. Illustrative results demonstrate that MC-regulated agents maintain trajectory-level normative acceptability where conventional controllers drift outside admissible bounds, and that the framework naturally extends to suppress interaction-induced emergent risk in multi-agent DCI settings.

2605.27628 2026-06-12 cs.AI cs.CY cs.ET cs.MA cs.SY eess.SY 版本更新

Intelligence as Managed Autonomy: Failure, Escalation, and Governance for Agentic AI Systems

智能作为受管自主:代理型AI系统的失败、升级与治理

Srini Ramaswamy

AI总结 本文提出SMARt模型,通过形式化能力检测认知漂移、暂停推理、尝试恢复并在可靠性下降时放弃控制,以解决自主AI系统中的幻觉和持续不合理行为问题。

Comments This peer-reviewed paper is to appear in the Journal of Intelligent and Robotic Systems

详情
AI中文摘要

随着自主和代理型AI系统在机器人和人机环境中的规模扩大,管理幻觉和持续但不合理的行动仍然是一个开放挑战。本文并未将这些失败仅仅归因于模型或对齐限制,而是探讨了无界自主性的架构脆弱性——即假设代理应在不确定性上升时继续运行的预设。本文引入了一种受管自主理论,通过形式化能力来定义智能行为:检测认知漂移、暂停推理、尝试恢复,并在可靠性下降时最终放弃控制。我们通过SMARt(具有受管/撤销转换的自管理多层自主推理)模型实例化该理论,该模型是一个四层框架,包含稳定、元认知、辅助和受管状态。通过开发定时、受保护的Petri网形式化,我们建立了系统的理论有界属性,展示了架构如何形式化地强制升级、约束无效输出,并确保在指定条件下的治理可达性。我们进一步分析了如何在不同的操作环境(例如医疗、机器人等)中结合特定领域的触发集,在满足完备性和健全性标准的前提下系统地维护安全性。由于这些触发被设计为自适应的,SMARt模型允许代理操作范围随时间安全、受控地扩展。我们得出结论,在自主生命周期内形式化失败管理是实现可靠且受治理人工智能的关键一步。

英文摘要

As autonomous and agentic AI systems scale in robotic and human-machine environments, managing hallucination and persistent but unjustified action remains an open challenge. Rather than attributing these failures solely to model or alignment limitations, this paper explores the architectural vulnerability of unbounded autonomy - the presumption that an agent should continue operating regardless of rising uncertainty. It introduces a theory of managed autonomy that defines intelligent behavior through the formal capacity to detect epistemic drift, suspend reasoning, attempt recovery, and ultimately surrender control when reliability diminishes. We instantiate this theory via the SMARt (Self-Managing Multi-tier Autonomous Reasoning with Regulated/Revoked transitions) model, a four-layer framework featuring Stable, Meta-cognitive, Assisted, and Regulated states. By developing a timed, guarded Petri net formulation, we establish theoretically bounded properties for the system, demonstrating how architecture can formally mandate escalation, constrain invalid outputs, and ensure governance reachability under specified conditions. We further analyze how incorporating domain-specific trigger sets across varied operational settings (e.g., healthcare, robotics, etc.) can systematically preserve safety, assuming completeness and soundness criteria are met. Because these triggers are designed to be adaptive, the SMARt model accommodates the safe, controlled expansion of an agent's operational scope over time. We conclude that formalizing failure management within the autonomy lifecycle is a crucial step toward realizing reliable and governed artificial intelligence.

2507.07947 2026-06-12 cs.LG cs.AI 版本更新

Reconstructing Template-Memorized Images from Natural Prompts

从自然提示中重建模板记忆的图像

Sol Yarkoni, Mahmood Sharif, Roi Livni

发表机构 * School of Electrical & Computer Engineering(电气与计算机工程学院) School of Computer Science & AI(计算机科学与人工智能学院) Tel Aviv University(特拉维夫大学)

AI总结 提出一种低资源攻击方法,利用模板化电商数据中的模式,从自然提示中重建训练集中的记忆图像,揭示隐私风险。

详情
AI中文摘要

生成模型(如扩散模型)的最新进展引发了与隐私、版权侵犯和数据管理相关的担忧。为了更好地理解和控制这些风险,先前的工作引入了从训练数据中重建图像或部分图像的技术和攻击。虽然这些结果表明训练数据可以被恢复,但现有方法通常依赖于高计算资源、对训练集的部分访问或精心设计的提示。在这项工作中,我们提出了一种新的攻击,该攻击需要低资源,假设对训练数据几乎没有或完全没有访问权限,并识别出看似良性的提示,这些提示可能导致潜在有风险的图像重建。我们进一步表明,即使对于没有专业知识的用户,这种重建也可能无意中发生。例如,我们观察到,对于现有模型,提示“蓝色男女通用T恤”会生成一个真实个体的面部。此外,通过将已识别的漏洞与真实世界的提示数据相结合,我们发现了能够重现记忆视觉元素的提示。我们的方法建立在先前工作的见解之上,并利用领域知识来揭示由于使用抓取的电商数据而产生的基本漏洞,其中模板化布局和图像与模式化的文本提示紧密相关。我们的攻击代码在此https URL公开。

英文摘要

Recent advances in generative models, such as diffusion models, have raised concerns related to privacy, copyright infringement, and data stewardship. To better understand and control these risks, prior work has introduced techniques and attacks that reconstruct images, or parts of images, from training data. While these results demonstrate that training data can be recovered, existing methods often rely on high computational resources, partial access to the training set, or carefully engineered prompts. In this work, we present a new attack that requires low resources, assumes little to no access to the training data, and identifies seemingly benign prompts that can lead to potentially risky image reconstruction. We further show that such reconstructions may occur unintentionally, even for users without specialized knowledge. For example, we observe that for one existing model, the prompt ``blue Unisex T-Shirt'' generates the face of a real individual. Moreover, by combining the identified vulnerabilities with real-world prompt data, we discover prompts that reproduce memorized visual elements. Our approach builds on insights from prior work and leverages domain knowledge to expose a fundamental vulnerability arising from the use of scraped e-commerce data, where templated layouts and images are closely tied to pattern-like textual prompts. The code for our attack is publicly available at https://github.com/TheSolY/lr-tmi.

2511.04260 2026-06-12 cs.CV cs.AI 版本更新

Proto-LeakNet: Towards Signal-Leak Aware Attribution in Synthetic Human Face Imagery

Proto-LeakNet:面向合成人脸图像中信号泄漏感知的归因方法

Claudio Giusti, Luca Guarnera, Sebastiano Battiato

发表机构 * Department of Mathematics and Computer Science(数学与计算机科学系) University of Catania(卡塔尼亚大学)

AI总结 提出Proto-LeakNet,利用扩散模型中的信号泄漏痕迹,结合闭集分类与密度开集评估,实现可解释的生成器归因,在闭集上训练后对未见生成器也有效。

Comments 44 pages, 27 figures, 11 tables

详情
AI中文摘要

合成图像和深度伪造生成模型的日益复杂使得源归因和真实性验证成为现代计算机视觉系统的关键挑战。最近的研究表明,扩散管道会在其输出中无意中留下持久的统计痕迹,称为信号泄漏,特别是在潜在表示中。基于这一观察,我们提出了Proto-LeakNet,一个信号泄漏感知且可解释的归因框架,它将闭集分类与基于密度的开集评估相结合,对学习到的嵌入进行开集评估,从而无需重新训练即可分析未见过的生成器。我们的方法作用于扩散模型的潜在域,重新模拟部分前向扩散以暴露残留的生成器特定线索。一个时间注意力编码器聚合多步潜在特征,而一个特征加权原型头则结构化嵌入空间并实现透明的归因。仅在闭集数据上训练并达到98.13%的宏AUC,Proto-LeakNet学习到的潜在几何结构在后处理下保持鲁棒,超越了最先进的方法,并且在真实图像与已知生成器之间以及已知与未见生成器之间实现了强可分离性。代码库可在以下链接获取:this https URL。

英文摘要

The growing sophistication of synthetic image and deepfake generation models has turned source attribution and authenticity verification into a critical challenge for modern computer vision systems. Recent studies suggest that diffusion pipelines unintentionally imprint persistent statistical traces, known as signal-leaks, within their outputs, particularly in latent representations. Building on this observation, we propose Proto-LeakNet, a signal-leak-aware and interpretable attribution framework that integrates Closed-set classification with a density-based Open-set evaluation on the learned embeddings, enabling analysis of unseen generators without retraining. Acting in the latent domain of diffusion models, our method re-simulates partial forward diffusion to expose residual generator-specific cues. A temporal attention encoder aggregates multi-step latent features, while a feature-weighted prototype head structures the embedding space and enables transparent attribution. Trained solely on closed data and achieving a Macro AUC of 98.13\%, Proto-LeakNet learns a latent geometry that remains robust under post-processing, surpassing state-of-the-art methods, and achieves strong separability both between real images and known generators, and between known and unseen ones. The codebase is available at the following link: https://github.com/claudiunderthehood/Proto-LeakNet .

2512.15134 2026-06-12 cs.LG cs.AI cs.CL 版本更新

From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?

从孤立到纠缠:可解释性方法何时识别和解缠已知概念?

Aaron Mueller, Andrew Lee, Shruti Joshi, Ekdeep Singh Lubana, Dhanya Sridhar, Patrik Reizinger

发表机构 * Boston University(波士顿大学) Harvard University(哈佛大学) Mila – Quebec AI Institute(魁北克AI研究所) Goodfire(Goodfire公司)

AI总结 本文提出多概念评估框架,研究稀疏自编码器和探针等方法是否真正解缠概念,发现特征通常只对单一概念敏感,但概念分布在多个特征上,且干预特征常影响多个概念,表明相关性指标不足以证明干预选择性。

Comments ACL 2026

详情
AI中文摘要

可解释性的一个目标是从神经网络的激活中恢复潜在概念(特征)的解缠表示。特征的质量通常孤立地评估,并在可能不成立的隐式独立性假设下进行。因此,尚不清楚常见的特征化方法(如稀疏自编码器(SAE)和探针)在多大程度上将一个概念与另一个概念解缠。我们提出了一个多概念评估设置,使用包括情感、领域、语态和时态在内的概念。我们评估特征化器产生每个概念的解缠表示的效果,观察到特征通常只对单一概念敏感,但概念分布在许多特征上。然后,我们干预这些特征,测量每个概念是否可独立操控,以及特征是否相互作用。即使在理想化设置中,干预一个特征通常会影响多个概念,尽管几乎没有交互效应。这些结果表明,相关性指标不足以建立干预选择性,并且证明两个特征在分离空间中运行不足以声称它们将对一个概念具有选择性。这些结果强调了可解释性研究中多概念评估的重要性。

英文摘要

A goal of interpretability is to recover disentangled representations of latent concepts (features) from the activations of neural networks. The quality of features is typically evaluated in isolation, and under implicit independence assumptions that may not hold in practice. Thus, it is unclear to what extent common featurization methods such as sparse autoencoders (SAEs) and probes disentangle one concept from another. We propose a multi-concept evaluation setting using concepts including sentiment, domain, voice, and tense. We evaluate how well featurizers produce disentangled representations of each concept, observing that features are typically sensitive to only one concept, but also that concepts are distributed across many features. Then, we steer these features, measuring whether each concept is independently manipulable, and whether features interact. Even in idealized settings, steering a feature often affects many concepts, despite a near absence of interaction effects. These results suggest that correlational metrics are insufficient to establish steering selectivity, and that demonstrating that two features operate in separate spaces is insufficient to claim that they will be selective for one concept. These results underscore the importance of multi-concept evaluations in interpretability research.

2602.00343 2026-06-12 cs.DC cs.AI cs.PF 版本更新

Standardized Methods and Recommendations for Green Federated Learning

绿色联邦学习的标准化方法与建议

Austin Tapp, Holger R. Roth, Ziyue Xu, Abhijeet Parida, Hareem Nisar, Marius George Linguraru

发表机构 * Children’s National Hospital(儿童医院) NVIDIA(英伟达) Children’s National Hospital George Washington University(儿童医院乔治华盛顿大学)

AI总结 提出基于NVFlare和CodeCarbon的联邦学习碳核算方法,通过实验验证系统慢速和协调效应可显著增加碳排放,强调标准化碳核算对可复现绿色FL评估的必要性。

详情
AI中文摘要

联邦学习(FL)能够在隐私敏感的分布式数据上进行协作模型训练,但由于不一致的测量边界和异构的报告方式,其环境影响难以跨研究进行比较。我们提出了一种实用的碳核算方法,用于FL的CO2e跟踪,使用NVIDIA NVFlare和CodeCarbon进行显式的、阶段感知的任务(初始化、每轮训练、评估和空闲/协调)。为了捕捉非计算效应,我们还根据网络可配置能量模型,从传输的模型更新大小估计通信排放。我们在两个代表性工作负载上验证了所提出的方法:CIFAR-10图像分类和视网膜视盘分割。在CIFAR-10中,受控的客户端效率场景表明,在原本固定的FL协议下,系统级慢速和协调效应可能对碳足迹产生显著影响,相对于高效率基线,总CO2e增加了8.34倍(中等)和21.73倍(低)。在视网膜分割中,交换GPU层级(H100 vs. V100)产生了1.7倍的运行时间差距(290 vs. 503分钟),同时在不同站点间总能量和CO2e的变化不均匀,强调了按站点和按轮报告的必要性。总体而言,我们的结果支持一种标准化的碳核算方法,作为可复现的“绿色”FL评估的前提。我们的代码可在以下网址获取:https://this https URL。

英文摘要

Federated learning (FL) enables collaborative model training over privacy-sensitive, distributed data, but its environmental impact is difficult to compare across studies due to inconsistent measurement boundaries and heterogeneous reporting. We present a practical carbon-accounting methodology for FL CO2e tracking using NVIDIA NVFlare and CodeCarbon for explicit, phase-aware tasks (initialization, per-round training, evaluation, and idle/coordination). To capture non-compute effects, we additionally estimate communication emissions from transmitted model-update sizes under a network-configurable energy model. We validate the proposed approach on two representative workloads: CIFAR-10 image classification and retinal optic disk segmentation. In CIFAR-10, controlled client-efficiency scenarios show that system-level slowdowns and coordination effects can contribute meaningfully to carbon footprint under an otherwise fixed FL protocol, increasing total CO2e by 8.34x (medium) and 21.73x (low) relative to the high-efficiency baseline. In retinal segmentation, swapping GPU tiers (H100 vs.\ V100) yields a consistent 1.7x runtime gap (290 vs. 503 minutes) while producing non-uniform changes in total energy and CO2e across sites, underscoring the need for per-site and per-round reporting. Overall, our results support a standardized carbon accounting method that acts as a prerequisite for reproducible 'green' FL evaluation. Our code is available at https://github.com/Pediatric-Accelerated-Intelligence-Lab/carbon_footprint.

2602.13379 2026-06-12 cs.CR cs.AI cs.CL cs.LG cs.SE 版本更新

Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

多轮交互中的安全隐患:工具使用智能体的多轮安全风险基准与防御

Xu Li, Simon Yu, Minzhou Pan, Yiyou Sun, Bo Li, Dawn Song, Xue Lin, Weiyan Shi

发表机构 * Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校)

AI总结 提出多轮工具使用安全基准MT-AgentRisk,发现多轮设置下攻击成功率平均增加16%,并设计无训练、与工具无关的自探索防御方法ToolShield,平均降低30%攻击成功率。

详情
AI中文摘要

基于LLM的智能体能力日益增强,但其安全性滞后。这造成了智能体能够做什么和应该做什么之间的差距。随着智能体进行多轮交互并使用多样化的工具,这一差距扩大,引入了现有基准忽视的新风险。为了系统地将安全测试扩展到多轮、工具真实的设置,我们提出一个原则性的分类法,将单轮有害任务转化为多轮攻击序列。利用该分类法,我们构建了MT-AgentRisk(多轮智能体风险基准),这是首个评估多轮工具使用智能体安全性的基准。我们的实验揭示了显著的安全退化:在开放和封闭模型的多轮设置中,攻击成功率(ASR)平均增加16%。为了缩小这一差距,我们提出了ToolShield,一种无需训练、与工具无关的自我探索防御方法:当遇到新工具时,智能体自主生成测试用例,执行它们以观察下游效果,并提炼安全经验用于部署。实验表明,ToolShield在多轮交互中平均有效降低ASR 30%。我们的代码可在该网址获取。

英文摘要

LLM-based agents are becoming increasingly capable, yet their safety lags behind. This creates a gap between what agents can do and should do. This gap widens as agents engage in multi-turn interactions and employ diverse tools, introducing new risks overlooked by existing benchmarks. To systematically scale safety testing into multi-turn, tool-realistic settings, we propose a principled taxonomy that transforms single-turn harmful tasks into multi-turn attack sequences. Using this taxonomy, we construct MT-AgentRisk (Multi-Turn Agent Risk Benchmark), the first benchmark to evaluate multi-turn tool-using agent safety. Our experiments reveal substantial safety degradation: the Attack Success Rate (ASR) increases by 16% on average across open and closed models in multi-turn settings. To close this gap, we propose ToolShield, a training-free, tool-agnostic, self-exploration defense: when encountering a new tool, the agent autonomously generates test cases, executes them to observe downstream effects, and distills safety experiences for deployment. Experiments show that ToolShield effectively reduces ASR by 30% on average in multi-turn interactions. Our code is available at https://github.com/CHATS-lab/ToolShield.

2604.16548 2026-06-12 cs.CR cs.AI cs.CL 版本更新

A Survey on Long-Term Memory Security in LLM Agents: Attacks, Defenses, and Governance Across the Memory Lifecycle

LLM智能体中长期记忆安全综述:跨记忆生命周期的攻击、防御与治理

Zehao Lin, Xixuan Hao, Renyu Fu, Shaobo Cui, Kai Chen, Chunyu Li, Zhiyu Li, Feiyu Xiong

发表机构 * MemTensor Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出记忆生命周期框架,系统分析LLM智能体长期记忆面临的新威胁,并引入可验证记忆治理(VMG)架构原语,强调存储时溯源与版本控制对安全的关键作用。

详情
AI中文摘要

LLM智能体中可写、跨会话持久记忆的出现,引入了与传统的以输入为中心的安全问题性质不同的威胁格局,其特点包括三个属性:持久性、状态性和传播性。为系统描述这一格局,我们提出记忆生命周期框架,该框架沿两个轴组织攻击、防御及其跨阶段依赖关系:六个生命周期阶段(写入、存储、检索、执行、共享与传播、遗忘与回滚)和四个安全目标(完整性、机密性、可用性、治理)。该分析进而揭示了在系统层面需要形式化安全保证,从而推动了可验证记忆治理(VMG)——一个由五个架构原语组成的框架,它规定了长期记忆系统必须提供哪些可验证机制,以维持对其记忆状态的可审计、可恢复控制。我们的分析表明,健壮的长期记忆(LTM)安全无法仅在检索或执行时进行事后补救,而必须从一开始就锚定于存储时的溯源、版本控制和策略感知的保留。

英文摘要

The emergence of writable, cross-session persistent memory in LLM agents introduces a qualitatively different threat landscape from conventional input-centric security concerns, characterized by three properties: persistence, statefulness, and propagation. To systematically characterize this landscape, we propose a Memory Lifecycle Framework that organizes attacks, defenses, and their cross-phase dependencies along two axes: six lifecycle phases (Write, Store, Retrieve, Execute, Share & Propagate, Forget & Rollback) and four security objectives (Integrity, Confidentiality, Availability, Governance). This analysis in turn exposes the need for formal security guarantees at the system level, motivating Verifiable Memory Governance(VMG), a framework of five architectural primitives that specifies what verifiable mechanisms a long-term-memory system must provide to maintain auditable, recoverable control over its memory state. Our analysis indicates that robust Long-Term Memory (LTM) security cannot be retrofitted at retrieval or execution time alone, but must be anchored in storage-time provenance, versioning, and policy-aware retention from the outset.

2605.08116 2026-06-12 cs.LG cs.AI 版本更新

The Safety-Aware Denoiser for Text Diffusion Models

文本扩散模型的安全感知去噪器

Amman Yusuf, Zhejun Jiang, Mijung Park

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出安全感知去噪器(SAD),在文本扩散模型的迭代去噪过程中引导生成文本进入安全区域,无需重训练即可实现灵活的安全约束,有效降低不安全生成同时保持生成质量。

Comments 28 pages, 12 figures. Code available at: https://github.com/ParkLabML/SAD

详情
AI中文摘要

最近关于文本扩散模型的工作为自回归生成提供了一种有前景的替代方案,但控制其安全性仍未被充分探索。现有的安全方法面向自回归模型,通常依赖于事后过滤或推理时干预。这些方法不足以有效解决文本扩散模型中的安全风险。我们提出了安全感知去噪器(SAD),一种文本扩散模型中的安全引导框架。SAD修改了迭代去噪过程,使得最终去噪步骤中的文本样本被引导至文本空间中可证明的安全区域。这种推理时方法可以将安全约束集成到去噪器中,避免了底层扩散模型的计算昂贵重训练,并实现了灵活、轻量级的安全引导。我们使用SAD评估生成文本的安全性,涉及危害分类、记忆和越狱。实验结果表明,SAD在保持生成质量、多样性和流畅性的同时,显著减少了不安全生成,优于现有方法。这些结果表明,我们在去噪过程中的安全引导为在文本扩散模型中实施安全提供了一种有效且可扩展的机制。

英文摘要

Recent work on text diffusion models offers a promising alternative to autoregressive generation, but controlling their safety remains underexplored. Existing safety approaches are geared toward autoregressive models and typically rely on post-hoc filtering or inference-time interventions. These are inadequate for effectively addressing safety risks in text diffusion models. We propose the Safety-Aware Denoiser (SAD), a safety-guidance framework in text diffusion models. The SAD modifies the iterative denoising process such that the text sample at the final denoising step is steered toward provably safe regions of the text space. This inference-time method can integrate safety constraints into the denoiser, avoiding computationally expensive retraining of the underlying diffusion model and enabling flexible, lightweight safety guidance. We evaluate the safety of the generated text using the SAD, with respect to hazard taxonomy, memorization, and jailbreak. Experimental results show that SAD substantially reduces unsafe generations while preserving generation quality, diversity, and fluency, outperforming existing methods. These results demonstrate that our safety guidance during denoising provides an effective and scalable mechanism for enforcing safety in text diffusion models.

2605.25225 2026-06-12 cs.LG cs.AI 版本更新

Transformer Field Theory: A Response-Theoretic Approach to Mechanistic Interpretability

用于Transformer修补和机制可解释性的连续深度场论

David N. Olivieri, Antonio F. Pérez Rodríguez

发表机构 * Universidade de Vigo(维戈大学) Independent Researcher(独立研究员)

AI总结 本文提出场论框架,将残差流视为深度-标记场,通过局部源插入、灵敏度场预测、经验格林函数响应和伴随变分问题来组织和预测Transformer激活修补干预,并在GPT-2风格自回归Transformer中验证了前向响应理论。

详情
AI中文摘要

机制可解释性通常使用激活修补、因果追踪、路径修补和引导方向来揭示Transformer激活空间中行为有意义的子空间。本文发展了一个场论框架来组织和预测此类干预。将残差流视为深度-标记场,我们将修补公式化为局部源插入,修补效应作为灵敏度场预测,下游传播作为经验格林函数响应,修补选择作为伴随变分问题。实验上,我们通过在GPT-2风格自回归Transformer中应用局部残差场干预并观察诱导的残差场差异和logit差异响应来测试前向响应理论。我们识别出有界的局部线性区域;从跨残差站点的一阶灵敏度预测修补效应;测量跨深度和标记位置的结构化各向异性传播;从高灵敏度站点和切片格林算子构建响应描述;并表明提示诱导的残差位移可以传递答案行为。这些结果将响应对象(即灵敏度、传播场和格林算子切片)确立为组织修补实验的实用语言,以及制定修补站点推断和跨尺度迁移的前向数学基础。

英文摘要

Mechanistic interpretability often studies Transformer behavior by intervening on internal activations through activation patching, causal tracing, path patching, and steering directions. This paper develops Transformer Field Theory: a response-theoretic framework in which the residual stream of a fixed forward pass is treated as a Transformer field over layer depth and token position. In this formulation, patching becomes a localized source insertion into the Transformer field, first-order sensitivity fields predict patch effects, Green functions describe downstream propagation, and patch selection is posed as an adjoint inverse problem. Empirically, we test the theory's forward response objects in GPT-2-style autoregressive Transformers. Localized Transformer-field interventions exhibit a bounded local linear regime; first-order sensitivities predict patch effects across layer-token sites; localized sources generate structured anisotropic Transformer-field propagation; high-sensitivity sites and sliced Green operators provide reduced response descriptions; and prompt-induced Transformer-field displacements partially transfer answer behavior. These results establish sensitivities, Transformer-field responses, and sliced Green operators as practical objects for organizing patching experiments, while providing the forward mathematical basis for patch-site inference and cross-scale response transfer.

2606.04009 2026-06-12 stat.ML cs.AI cs.LG 版本更新

Counterfactual Explanations for Deep Two-Sample Testing

深度双样本检验的反事实解释

Wei-Cheng Lai, Marco Simnacher, Christoph Lippert

发表机构 * Hasso-Plattner-Institute, University of Potsdam(波茨坦大学洪堡-劳恩堡研究所) Hasso Plattner Institute for Digital Health at Mount Sinai Icahn School of Medicine at Mount Sinai(辛辛那提医学院洪堡数字健康研究所)

AI总结 针对深度双样本检验,提出基于扩散自编码器和MMD优化的反事实解释框架,生成样本级编辑以揭示驱动假设拒绝的特征。

Comments 17 pages

详情
AI中文摘要

双样本检验是检测科学领域中分布差异的基本工具,但经典检验(包括基于核的检验)在高维结构化数据(如图像)上可能效果不佳。最近的深度双样本检验通过学习信息表示提高了这些场景下的灵敏度,但它们对哪些数据特征驱动拒绝原假设 $H_0$ 提供的洞察有限。为解决此问题,我们提出了一种用于深度双样本检验的反事实解释框架,该框架生成样本级编辑,将观测值从源组移向目标组,同时明确减少检验所测量的差异。我们的方法将扩散自编码器与预训练的深度双样本检验模型相结合,并在检验模型的表示空间中优化最大均值差异(MMD)目标,以生成合理的反事实。我们通过检验统计量和由此产生的双样本p值的变化来量化分布级效应。我们在合成2D形状数据集和两个MRI队列上评估了该方法。在这两种设置下,反事实变换相对于原始样本持续增加p值,表明编辑后的源集在检验下在统计上更接近目标分布。我们使用LPIPS测量最小性,以确保反事实保持接近原始样本。由此产生的编辑提供了与检测到的组差异相关的特征的可解释证据。在MRI上,局部变化与队列之间已知的解剖学差异一致。

英文摘要

Two-sample testing is a fundamental tool for detecting distributional differences across scientific domains, but classical tests (including kernel-based tests) can be ineffective on high-dimensional structured data such as images. Recent deep two-sample tests improve sensitivity in these settings by learning informative representations, yet they provide limited insight into which data features drive rejection of the null hypothesis $H_0$. To address this issue, we propose a counterfactual explanation framework for deep two-sample testing that generates sample-level edits moving observations from a source group toward a target group while explicitly reducing the discrepancy measured by the test. Our method combines a diffusion autoencoder with a pretrained deep two-sample test model and optimizes a maximum mean discrepancy (MMD) objective in the test model's representation space to produce plausible counterfactuals. We quantify distribution-level effects through changes in the test statistic and the resulting two-sample p-values. We evaluate the method on synthetic 2D shape datasets and two MRI cohorts. Across both settings, the counterfactual transformations consistently increase p-values relative to the original samples, indicating that the edited source set becomes statistically closer to the target distribution under the test. We measure minimality using LPIPS to ensure the counterfactuals remain close to the original samples. The resulting edits provide interpretable evidence of the features associated with the detected group differences. On MRI, the localized changes are consistent with known anatomical differences between cohorts.

2606.09073 2026-06-12 cs.LG cs.AI cs.CL 版本更新

A Unifying Lens on Reward Uncertainty in RLHF

RLHF中奖励不确定性的统一视角

Ely Hahami, Yoel Zimmermann, Ray Zhou, Jack Benarroch Jedlicki

发表机构 * University of California, Berkeley(加州大学伯克利分校) DeepMind(深度Mind)

AI总结 本文提出使用分布奖励模型统一RLHF中的悲观主义方法,通过闭式有效奖励公式连接现有启发式方法,并揭示其隐含假设。

详情
AI中文摘要

基于人类反馈的强化学习(RLHF)受限于\textit{奖励破解},即策略利用代理奖励模型(RM)中的错误,产生高RM分数而缺乏真正的质量提升。一种自然的缓解方法是\textit{悲观主义}:在RM不确定的区域惩罚奖励。然而,标准标量RM没有提供原则性的不确定性概念。我们认为正确的对象是\textit{分布}奖励模型$p(r\mid x,y)$。在贝叶斯推断或KL分布鲁棒优化(KL-DRO)视角下,KL正则化的RLHF目标具有闭式有效奖励$\tilde r(x,y) = \pmβ\log\mathbb{E}_p[e^{\pm r/β}]$。悲观分支统一了RM集成聚合的先前启发式方法:均值聚合、最坏情况优化(WCO)和不确定性加权优化(UWO)都作为该单一表达式的极限或截断出现。这也澄清了每个现有规则的隐含假设。

英文摘要

Reinforcement learning from human feedback (RLHF) is bottlenecked by reward hacking, where the policy exploits errors in a proxy reward model (RM) and produces high RM scores without genuine quality gains. A natural mitigation is pessimism: lowering rewards in regions where the RM is uncertain. However, standard scalar RMs provide no principled notion of uncertainty. We argue that the right object is a distributional reward model $p(r\mid x,y)$. Under either a Bayesian inference or a KL-distributionally robust optimization (KL-DRO) lens, the KL-regularized RLHF objective admits a closed-form effective reward $\tilde r(x,y) = \pmβ\log\mathbb{E}_p[e^{\pm r/β}]$. The pessimistic branch unifies the prior heuristics for RM ensemble aggregation: mean aggregation, worst-case optimization (WCO), and uncertainty-weighted optimization (UWO) all emerge as limits or truncations of this single expression. This also clarifies the implicit assumptions of each existing rule.

9. 评测、基准与数据集 48 篇

2606.12451 2026-06-12 cs.AI cs.IR cs.LG 新提交

ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

ToolSense: 审计LLM中参数化工具知识的诊断框架

Ashutosh Hathidara, Sai Shruthi Sistla, Sebastian Schreiber, Sahil Bansal

发表机构 * SAP Labs(SAP实验室)

AI总结 提出ToolSense诊断框架,自动生成三类基准测试,揭示参数化工具检索中知识-检索分离现象,发现模型在模糊查询下性能显著下降。

详情
AI中文摘要

作为大型工具目录上的代理部署的大型语言模型面临关键的工具检索瓶颈。由于基于嵌入的检索方法依赖于可能无法充分捕获专用工具语义的紧凑编码器,参数化工具检索通过将每个工具编码为附加到LLM词汇表的虚拟令牌来解决这一问题,经过两个阶段(记忆然后检索SFT)的微调,将LLM用作检索器,在标准ToolBench检索基准上取得了强劲性能。然而,这些基准使用冗长、完全指定的查询,并且其评估应用了将输出限制为有效令牌路径的约束解码,这并不能揭示模型是否真正理解其工具。我们引入了\textbf{ToolSense},一个开源LLM驱动的诊断框架,它将任何工具目录作为输入,并自动生成三个基准:具有三个模糊级别查询的现实检索基准(RRB)、MCQ探测基准和QA探测基准。将ToolSense应用于ToolBench(约47k个工具)并评估五个参数化模型训练配置,揭示了知识-检索分离:在RRB查询上,与完全指定的ToolBench基准相比,几个配置下降了约50-64个百分点,低于嵌入模型基线。此外,尽管检索性能强劲,一些模型在事实探测上得分接近随机,表明存在知识-检索分离。我们在https://this URL上开源了ToolSense框架和ToolBench诊断基准。

英文摘要

Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck. As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, parametric tool retrieval addresses this by encoding each tool as a virtual token appended to the LLM vocabulary, fine-tuned in two stages (memorization then retrieval SFT) to use the LLM as a retriever, achieving strong performance on standard ToolBench retrieval benchmarks. Yet these benchmarks use verbose, fully-specified queries, and their evaluation applies constrained decoding that restricts outputs to valid token paths, neither reveals whether the model actually understands its tools. We introduce \textbf{ToolSense}, an open-source LLM-powered diagnostic framework that takes any tool catalog as input and automatically generates three benchmarks: a Realistic Retrieval Benchmark (RRB) with queries at three ambiguity tiers, an MCQ probing benchmark, and a QA probing benchmark. Applying ToolSense to ToolBench (~47k tools) and evaluating five parametric model training configurations reveals a knowledge-retrieval dissociation: on RRB queries, several configurations collapse by ~50-64 percentage points compared to fully-specified ToolBench benchmarks, falling below the embedding-model baseline. Additionally, despite strong retrieval performance, some models score near-random on factual probes, suggesting a knowledge-retrieval dissociation. We open-source the ToolSense framework and the ToolBench diagnostic benchmarks at https://github.com/SAP/toolsense.

2606.12730 2026-06-12 cs.AI cs.CL cs.CY cs.LG 新提交

Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

重新思考LLMs的心理测量评估:自我报告何时以及为何能预测行为

Rafal Kocielnik, Pengrui Han, Peiyang Song, Myrl G. Marmarelis, Ramit Debnath, Dean Mobbs, Anima Anandkumar, R. Michael Alvarez

发表机构 * Caltech(加州理工学院) UIUC(伊利诺伊大学厄巴纳-香槟分校) University of Cambridge(剑桥大学)

AI总结 研究对比大五人格与计划行为理论,发现LLMs的自我报告-行为一致性存在选择性:在共享对话中TPB达到人类水平,跨对话仅对锚定于训练的行为保持一致性,且角色提示不能使行为对齐。

Comments Accepted as an Oral (Contributed Talk) at the ICML 2026 Workshop on Combining Theory and Benchmarks (CTB)

详情
AI中文摘要

从低成本心理测量探针预测LLM行为倾向对于安全部署至关重要,但前提是自我报告(SR)能可靠地预测行为。近期研究记录了LLMs中显著的SR-行为分离,但依赖于广泛的人格特质(大五),这些特质即使在人类中也只能弱预测特定行为。此外,对话会话的隔离加上弱上下文匹配使得以下问题悬而未决:LLMs是否真正缺乏一致性,或者检测这种一致性所需的条件是否未满足。我们将大五与计划行为理论(TPB)进行对比,后者测量针对特定行为的意图,并且比广泛特质能更好地预测人类行为。我们在四个行为任务和11个前沿LLM上进行实验,同时改变会话上下文和身份诱导。我们发现SR-行为一致性存在但具有选择性。1) 在共享对话中,计划行为理论达到人类水平的一致性;大五则没有。2) 在跨对话中,一致性仅对锚定于即时提示之外的行为(如由训练塑造的内隐偏见)幸存,而当行为被上下文强烈启动(如谄媚)时则崩溃。3) 角色提示使自我报告在对话间更一致,但并未使行为对齐。这些发现表明,粗糙的人格框架(如大五)可能不是测试部署行为的最佳工具。需要更多任务和特定行为的工具,并且即使这些工具也必须在任务和上下文中进行评估。

英文摘要

Anticipating LLM behavioral tendencies from low-cost psychometric probes is critical for safe deployment, but only if self-reports (SR) reliably predict behavior. Recent work documented substantial SR-behavior dissociation in LLMs, but relied on broad personality traits (Big 5) that predict specific behaviors weakly, even in humans. Furthermore, the isolation of conversational sessions combined with weak context matching left open whether LLMs truly lack coherence or whether the conditions needed to detect such coherence were not met. We contrast Big 5 with the Theory of Planned Behavior (TPB), which measures intention targeted to a specific behavior and predicts human behavior substantially better than broad traits. We run experiments across four behavioral tasks and 11 frontier LLMs, while also varying session context and identity induction. We find that SR-behavior coherence exists but is selective. 1) Within a shared conversation, the Theory of Planned Behavior reaches human-level coherence; Big 5 does not. 2) Across separate conversations, coherence survives only for behaviors anchored outside the immediate prompt, such as implicit bias shaped by training, and collapses when behavior is strongly primed by context, as with sycophancy. 3) Persona prompting makes self-reports more consistent across conversations, but does not bring behavior into alignment. These findings suggest that coarse personality frameworks, such as Big 5 may not be the best tools for testing deployment behavior. More task- and behavior-specific instruments are needed, and even these must be evaluated across tasks and contexts.

2606.12736 2026-06-12 cs.AI cs.LG 新提交

Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

跨尺度科学挑战的AI智能体基准测试

Tianyu Liu, Allen Xin Wang, Antonia Panescu, Lisa Xinyi Chen, Wenxin Long, Xinyu Wei, Yueqian Jing, Ziyao Zeng, Jihang Chen, Sihan Jiang, Ziqing Wang, Siyi Gu, Siyu Chen, Xinyang Hu, Haoran Shao, Leqi Xu, Wangjie Zheng, Zhiyuan Cao, Ada Fang, Botao Yu, Kunyang Sun, Rex Ying, Arman Cohan, Qingyu Chen, Lingzhou Xue, Kaize Ding, Yuanqi Du, Wengong Jin, Zhuoran Yang, Marinka Zitnik, James Zou, Hua Xu, Hongyu Zhao

发表机构 * Yale University(耶鲁大学) Broad Institute of MIT and Harvard(布罗德研究所) The Pennsylvania State University(宾夕法尼亚州立大学) Northeastern University(东北大学) Northwestern University(西北大学)

AI总结 提出SciAgentArena基准,含约200个交互式任务,评估AI智能体在真实科研场景中的能力,发现其在数据分析中有效,但在创新探索和开放问题上表现不均。

Comments 6 figures

详情
AI中文摘要

AI智能体正被越来越多地开发用于加速科学发现,但它们在真实研究环境中的实际能力仍知之甚少。现有的AI智能体基准很少捕捉科学工作所需的复杂性、异质性和扩展推理,而科学任务的基准通常将研究简化为静态、直接的问题,并对交互式评估支持有限。在此,我们引入SciAgentArena,这是一个系统性的基准,用于评估AI智能体在来自多个领域新兴需求的真实科学研究场景中的表现。SciAgentArena包含约200个具有逐步验证的任务,以及一个交互式、与智能体无关的环境,用于评估不同的AI智能体。使用该基准,我们发现当前智能体能够有效贡献于明确指定的数据分析工作流,特别是当任务结构和评估标准清晰时。然而,它们在科学情境中的表现仍然不均衡:智能体难以产生真正新颖的见解,维持自主探索,并为开放的研究问题制定稳健的解决方案。我们进一步描述了智能体常见的失败模式,并识别了提高其可靠性、自主性和科学推理能力的机会。总之,SciAgentArena提供了一个实用的框架,用于衡量AI智能体在科学领域的进展,并指导未来能够应对复杂科学挑战的智能体设计。完整代码、任务和数据集可通过此链接访问:this https URL。

英文摘要

AI agents are increasingly being developed to accelerate scientific discovery, yet their practical capabilities in real research settings remain poorly understood. Existing benchmarks for AI agents rarely capture the complexity, heterogeneity, and extended reasoning required by scientific work, whereas benchmarks for scientific tasks often reduce research to static, direct problems and provide limited support for interactive evaluation. Here, we introduce SciAgentArena, a systematic benchmark for evaluating AI agents in real-world scientific research scenarios drawn from emerging needs across multiple domains. SciAgentArena comprises approximately 200 tasks with stepwise verification and an interactive, agent-agnostic environment for assessing diverse AI agents. Using this benchmark, we find that current agents can contribute effectively to well-specified data-analysis workflows, particularly when the task structure and evaluation criteria are clear. However, their performance remains uneven across scientific contexts: agents struggle to generate genuinely novel insights, sustain self-directed exploration, and formulate robust solutions for open-ended research questions. We further characterize common failure modes across agents and identify opportunities for improving their reliability, autonomy, and scientific reasoning. Together, SciAgentArena provides a practical framework for measuring progress in AI agents for science and for guiding the design of future agents capable of addressing complex scientific challenges. Full codes, tasks, and datasets can be accessed via this link: https://sciagentarena.github.io/.

2606.12767 2026-06-12 cs.AI 新提交

Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage

构建程序性推理评估数据集:平衡自然性、基础性和多跳覆盖

Sarah Elshabrawy, Rahul K. Dass, Ashok K. Goel

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 研究基于任务-方法-知识(TMK)模型的问题生成策略对程序性和多跳推理数据集质量的影响,提出基础性验证框架,发现严格TMK生成策略在基础性和可用性上最优。

Comments 10 pages, 2 numbered figures. Workshop submission to HAIL @ AIED 2026

详情
AI中文摘要

评估AI辅助学习系统中的程序性推理需要问答数据集,这些数据集既要像学习者一样,又要基于系统预期使用的教学知识。我们研究了基于TMK的问题生成策略如何影响程序性和多跳推理的数据集质量。我们比较了三种策略:从任务-方法-知识(TMK)模型严格生成、先转录后基于TMK过滤的生成、以及结合转录和结构化指导的TMK感知生成。为了评估生成的项目,我们引入了一个基于从TMK模型中提取的闭集证据单元的基础性验证框架。该框架衡量答案是否由底层表示支持、问题是否自包含、以及是否针对多跳程序性推理。在23个教学主题和690个生成的问答对中,严格TMK生成实现了最强的整体质量,其中96.5%的问题有基础,92.6%的问题可用。先转录生成产生更像学习者的问题,但更多是上下文依赖或基础薄弱的问题,而TMK感知生成产生较高的原始多跳覆盖率但基础性较低。这些结果表明,程序丰富性和自然措辞并不能保证表示基础性,这促使在AI辅助学习中的评估数据集需要进行显式的表示感知验证。

英文摘要

Evaluating procedural reasoning in AI-supported learning systems requires question-answer datasets that are both learner-like and grounded in the instructional knowledge the system is expected to use. We study how TMK-based question generation strategies affect dataset quality for procedural and multi-hop reasoning. We compare three strategies: strict generation from Task-Method-Knowledge (TMK) models, transcript-first generation with post-hoc TMK filtering, and TMK-aware generation that combines transcripts with structured guidance. To evaluate generated items, we introduce a grounding validation framework based on closed-set evidence units extracted from TMK models. The framework measures whether answers are supported by the underlying representation, whether questions are self-contained, and whether they target multi-hop procedural reasoning. Across 23 instructional topics and 690 generated question-answer pairs, strict TMK generation achieves the strongest overall quality, with 96.5% grounded questions and 92.6% usable questions. Transcript-first generation produces more learner-like questions but more context-dependent or weakly grounded items, while TMK-aware generation yields high raw multi-hop coverage but lower grounding. These results show that procedural richness and natural phrasing do not guarantee representational grounding, motivating explicit representation-aware validation for evaluation datasets in AI-supported learning.

2606.12809 2026-06-12 cs.AI cs.LG 新提交

MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs

MLUBench: 多模态大语言模型终身遗忘评估基准

He Li, Haoang Chi, Qizhou Wang, Yunxin Mao, Zhiheng Zhang, Jie Tan, Tongliang Liu, Wenjing Yang, Bo Han

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出MLUBench基准,评估多模态大模型在连续遗忘请求下的性能,发现现有方法存在累积退化,并揭示多模态对齐保持的挑战,提出LUMoE方法缓解退化。

Comments 36 pages, accepted to the ICML 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在海量多模态数据上训练,使得数据遗忘变得越来越重要,因为数据所有者可能要求移除特定内容。实际上,这些请求通常随时间顺序到达,引发了MLLM终身遗忘这一具有挑战性的问题。然而,现有大多数基准在规模和范围上有限,未能捕捉MLLM终身遗忘的复杂性。为填补这一空白,我们引入了MLUBench,一个大规模、全面的基准,包含9个类别下的127个实体,用于终身遗忘请求。我们使用MLUBench进行了大量实验,揭示出现有遗忘方法遭受严重且累积的退化。更重要的是,我们进一步识别出该问题的独特挑战:与单模态模型不同,MLLM终身遗忘受到保持多模态对齐需求的约束。持续从一种模态遗忘可能会退化整个模型。为缓解这一挑战,我们提出了LUMoE,一种有效方法。实验表明,LUMoE显著缓解了基线方法面临的退化问题。源代码和MLUBench数据集已在此https URL开源。

英文摘要

Multimodal large language models (MLLMs) are trained on massive multimodal data, making data unlearning increasingly important as data owners may request the removal of specific content. In practice, these requests often arrive sequentially over time, giving rise to the challenging problem of MLLM Lifelong Unlearning. However, most existing benchmarks are limited in scale and scope, failing to capture the complexities of MLLM lifelong unlearning. To fill this gap, we introduce the MLUBench, a large-scale and comprehensive benchmark featuring 127 entities across 9 classes under lifelong unlearning requests. We perform extensive experiments using MLUBench and reveal that existing unlearning methods suffer from severe, cumulative degradation. More critically, we further identify the unique challenge of this problem: unlike in unimodal models, MLLM lifelong unlearning is constrained by the need to preserve multimodal alignment. Continually unlearning from one modality could degrade the entire model. To alleviate this challenge, we propose LUMoE, an effective method. Experiments demonstrate that LUMoE significantly mitigates the degradation problem faced by baselines. The source code and the MLUBench dataset are open-sourced in https://github.com/lihe-maxsize/Lifelong_Unlearning_main.

2606.12821 2026-06-12 cs.AI cs.ET 新提交

GeoNatureAgent Benchmark: Benchmarking LLM Agents for Environmental Geospatial Analysis Across Frontier and Open-Weight Foundation Models

GeoNatureAgent Benchmark:面向前沿与开源基础模型的环境地理空间分析LLM智能体基准测试

Gabriel Diaz-Ireland, Diego Prieto-Herráez, Mario García Peces, Javier Velázquez, Devika Jain

发表机构 * Universidad Católica de Ávila (UCAV)(阿维拉天主教大学) Johns Hopkins University(约翰霍普金斯大学) Independent Researcher(独立研究者) Center for Geographic Analysis, Harvard University(哈佛大学地理分析中心)

AI总结 提出首个通过结构化工具调用真实API评估环境分析智能体的基准,包含93个任务,发现Claude Sonnet 4领先,但开源模型在成本效益上占优,且比较任务普遍未解决。

Comments Preprint. 10 pages, 8 figures. Submitted to ACM SIGSPATIAL 2026

详情
AI中文摘要

环境科学家在数据整理而非分析上花费了不成比例的精力,而自动化地理空间工作流的AI智能体仍未得到验证:没有基准通过结构化工具调用评估智能体对真实API的操作。我们引入了GeoNatureAgent Benchmark,这是首个通过结构化工具调用生产级地理空间API进行环境分析智能体的基准。它包含18个类别的93个任务,涵盖市政分析、多轮对话、空间推理、跨指标综合、错误处理与恢复、排序、比较、多语言理解、栖息地分析和任务拒绝。任务通过一个开放、可自托管的API进行评估,该API通过16个工具提供西班牙和葡萄牙的三个环境指标。我们评估了七个LLM(Claude Sonnet 4、DeepSeek V3.2、GLM-5、Gemini 2.5 Pro、Qwen3-235B、GPT-OSS-120B、Llama 4 Scout),在三个温度1.0的随机种子下,报告能力与每案例成本作为正交轴。我们发现:(1)Claude Sonnet 4以60.8%±0.8%领先,其次是DeepSeek V3.2的56.3%±3.1%,其他模型均未超过51%;(2)成本-准确率帕累托前沿主要由开源模型占据,DeepSeek V3.2以11倍低的成本(每案例0.011美元)提供Claude 93%的能力;(3)比较任务普遍未解决(接近值比较上为0%),暴露了系统性的推理限制;(4)针对真实API的结构化工具调用比通用GIS基准更具区分度,准确率低25-35个百分点。我们进一步展示了可扩展性,将葡萄牙的BigEarthNet V2土地覆盖与西班牙的CO2和侵蚀指标集成。该基准、工具集和可自托管API均已公开。

英文摘要

Environmental scientists spend disproportionate effort on data wrangling rather than analysis, and AI agents that automate geospatial workflows remain unvalidated: no benchmark evaluates agents operating through structured tool calling against real APIs. We introduce the GeoNatureAgent Benchmark, the first benchmark for environmental analysis agents that operate via structured tool calls to a production-style geospatial API. It comprises 93 tasks across 18 categories, covering municipality analysis, multi-turn conversation, spatial reasoning, cross-indicator synthesis, error handling and recovery, ranking, comparison, multilingual understanding, habitat analysis, and task rejection. Tasks are evaluated against an open, self-hostable API serving three environmental indicators across Spain and Portugal via sixteen tools. We evaluate seven LLMs (Claude Sonnet 4, DeepSeek V3.2, GLM-5, Gemini 2.5 Pro, Qwen3-235B, GPT-OSS-120B, Llama 4 Scout) under three temperature-1.0 seeds, reporting capability and per-case cost as orthogonal axes. We find: (1) Claude Sonnet 4 leads at 60.8% +/- 0.8%, followed by DeepSeek V3.2 at 56.3% +/- 3.1%, with no other model above 51%; (2) the cost-accuracy Pareto frontier is occupied mostly by open-weight models, with DeepSeek V3.2 offering 93% of Claude's capability at 11x lower cost ($0.011/case); (3) comparison tasks remain universally unsolved (0% on close-value comparisons), exposing systematic reasoning limits; and (4) structured tool calling against a real API is more discriminative than general-purpose GIS benchmarks, with accuracies 25-35 points lower. We further show extensibility by integrating BigEarthNet V2 land cover for Portugal alongside Spanish CO2 and erosion indicators. The benchmark, harness, and self-hostable API are publicly available.

2606.12871 2026-06-12 cs.AI 新提交

DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

DailyReport: 一个用于评估搜索代理在日常搜索任务上的开放式基准

Jingxuan Han, Wei Liu, Mingyang Zhu, Youpeng Wang, Ziwen Wang, Lin Qiu, Xuezhi Cao, Xunliang Cai, Zheren Fu, Licheng Zhang, Zhendong Mao

发表机构 * University of Science and Technology of China(中国科学技术大学) Meituan(美团)

AI总结 提出DailyReport基准,包含150个开放式日常搜索任务和3546个级联评分标准,通过分解子任务和维度评估,揭示当前搜索代理系统仍未能满足用户期望。

详情
AI中文摘要

搜索代理(SAs)通常利用大型语言模型(LLMs)通过自主探索网络资源并将信息综合成全面响应来支持复杂的信息寻求任务。对于SAs的评估,先前的基准主要关注在真实用户场景中不太可能出现的专门任务。此外,它们依赖于粗略的任务级评分标准,通常限制了评估的可解释性。为弥补这一差距,我们引入了DailyReport,一个用于评估SA在日常搜索任务上能力的开放式基准。它包含150个开放式任务,配有3546个相关评分标准,捕捉了真实用户广泛讨论和及时的信息需求。每个任务被分解为子任务,并通过跨解缠维度的级联评分标准进行评估。通过级联性能归因和以用户为中心的聚合,我们为每个维度推导出高度可解释的分数,以及一个用户偏好分数。我们在17个代理系统上的结果表明,当前系统仍未能达到用户的期望。为促进未来研究,我们的数据集和代码已在https://this URL公开。

英文摘要

Search Agents (SAs) typically leverage large language models (LLMs) to support complex information-seeking tasks by autonomously exploring web sources and synthesizing information into comprehensive responses. For SAs evaluation, prior benchmarks mainly focus on specialized tasks that are unlikely to arise in real-world user scenarios. Moreover, their reliance on coarse task-level rubrics often limits evaluation interpretability. To bridge this gap, we introduce DailyReport, an open-ended benchmark to evaluate SA capabilities on daily search tasks. It contains 150 open-ended tasks with 3,546 associated rubrics, capturing widely discussed and timely information demands of real-world users. Each task is decomposed into subtasks and evaluated with cascade rubrics across disentangled dimensions. Through cascade performance attribution and user-centric aggregation, we derive highly interpretable scores for each dimension, along with a user preference score. Our results on 17 agentic systems show that current systems still fall short of users' expectations. To facilitate future research, our dataset and code are made publicly available at https://github.com/AGI-Eval-Official/DailyReport.

2606.12953 2026-06-12 cs.AI cs.CV cs.LG eess.IV 新提交

OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models

OpenMedQ:面向医学视觉语言模型的广泛开放预训练

Ibrahim Gulluk, Max Van Puyvelde, Olivier Gevaert

发表机构 * Stanford University(斯坦福大学) Stanford University School of Medicine(斯坦福大学医学院) Ghent University(根特大学)

AI总结 提出OpenMedQ,在14个数据集(约335万样本)上预训练医学视觉语言模型,在PathVQA上BLEU-1达75.9,超越562B参数的Med-PaLM M,并在8个未见医学分类任务上取得最高平均macro-F1(0.757)。

Comments Medical Imaging with Deep Learning (MIDL) 2026, Short Paper Track

详情
AI中文摘要

我们提出OpenMedQ,一个在迄今为止最广泛的完全开放医学混合数据集上预训练的医学视觉语言模型:包含14个数据集,总计约335万预训练样本,涵盖病理学、放射学、显微镜和纯文本临床问答。OpenMedQ在PathVQA上达到最先进的BLEU-1(75.9),击败了参数多达562B(约大80倍)的Med-PaLM M变体,并在VQA-MED上匹配了最佳报告的BLEU-1(64.5)。其视觉编码器在相同的下游配方下迁移到8个未见过的医学分类基准,获得了最高的平均macro-F1(0.757),优于BiomedCLIP(0.745)、PMC-CLIP(0.745)、PubMedCLIP(0.746)和从头训练的基线(0.616)。我们公开了代码,并提供了一个交互式演示,作为社区的可复现基线。

英文摘要

We present OpenMedQ, a medical vision-language model pretrained on the broadest fully-open medical mix to date: 14 datasets totaling ~3.35M pretraining samples spanning pathology, radiology, microscopy, and text-only clinical QA. OpenMedQ reaches state-of-the-art BLEU-1 on PathVQA (75.9), beating Med-PaLM M variants up to 562B parameters (~80x larger), and matches the best reported VQA-MED BLEU-1 (64.5). Its vision encoder, transferred to 8 unseen medical classification benchmarks under an identical downstream recipe, obtains the highest average macro-F1 (0.757) among BiomedCLIP (0.745), PMC-CLIP (0.745), PubMedCLIP (0.746), and a from-scratch baseline (0.616). We release our code and an interactive demo is publicly available as a reproducible baseline for the community.

2606.13020 2026-06-12 cs.AI 新提交

SciR: A Controllable Benchmark for Scientific Reasoning in LLMs

SciR: 面向LLM科学推理的可控基准

Pierre Beckmann, Marco Valentino, Andre Freitas

发表机构 * Idiap Research Institute(Idiap研究 institute) EPFL(瑞士联邦理工学院) School of Computer Science, University of Sheffield(谢菲尔德大学计算机科学学院) University of Manchester(曼彻斯特大学) National Biomarker Centre, CRUK Manchester Institute(国家生物标志物中心,CRUK曼彻斯特研究所)

AI总结 提出SciR基准,通过形式对象生成可验证的多范式科学推理任务,并控制信息提取和推理难度两个维度,揭示LLM在科学推理中的弱点。

详情
AI中文摘要

科学推理中反复出现三种范式的推理形式:演绎、归纳和因果溯因。目前,在科学环境中可靠地评估LLM在这三种推理上的表现尚不可及:基于人工标注的科学基准成本高昂且缺乏机制性真值,而合成逻辑推理基准则不像真实的科学文档。我们引入了SciR,这是一个将多范式推理与可控科学渲染相结合的基准,以三个范式性科学问题为锚点。任务从形式对象(演绎树、归纳规则假设、因果图)生成,以保证可验证答案,然后通过每个轨道的领域调优体裁渲染成多文档科学论述。该构建使我们能够独立变化两个难度轴:提取推理所需关键信息的难度,以及原则性推理本身的难度。我们测试了六个模型。两个轴都对每个模型造成伤害,且其效应叠加。渲染甚至伤害了神经符号管道,后者将推理交给经过验证的求解器。这两个轴产生了每个模型的提取与推理轮廓:例如,像deepseek-r1这样的推理模型在推理轴上大多超过了非推理指令模型。据我们所知,SciR是第一个在提取和推理难度上具有参数化控制的多范式科学推理基准。

英文摘要

Three paradigmatic forms of inference recur across scientific reasoning: deduction, induction, and causal abduction. Reliably evaluating LLMs on these in scientific settings is currently out of reach: scientific benchmarks built on human annotations are costly and lack mechanistic ground truth, while synthetic logical-reasoning benchmarks do not resemble real scientific documents. We introduce SciR, a benchmark that combines multi-paradigm reasoning with controllable scientific rendering, anchored on three paradigmatic scientific problems. Tasks are generated from formal objects (deduction tree, inductive rule hypothesis, causal graph) to guarantee verifiable answers, then rendered into multi-document scientific discourse via per-track domain-tuned genres. The construction lets us independently vary two difficulty axes: how hard it is to extract the key information needed for inference, and how hard the principled inference itself is. We test six models. Both axes hurt every model, and their effects compound. The rendering even hurts neurosymbolic pipelines, which hand inference to a verified solver. The two axes yield a per-model extraction-vs-inference profile: for instance, reasoning models like deepseek-r1 mostly surpass non-reasoning instruct models on the inference axis. To our knowledge, SciR is the first multi-paradigm scientific-reasoning benchmark with parametric control on both extraction and inference difficulty.

2606.13051 2026-06-12 cs.AI 新提交

AAbAAC: An Annotated Corpus for Autoimmunity Information Extraction

AAbAAC:用于自身免疫信息抽取的标注语料库

Fabien Maury, Solène Grosdidier, Maud de Dieuleveult, Adrien Coulet

发表机构 * Inserm, Université Paris Cité, U1163 Institut Imagine(法国国家健康与医学研究院、巴黎西岱大学、U1163 想象研究所) Inria, Inserm, Université Paris Cité, U1346 HeKA(法国国家信息与自动化研究所、法国国家健康与医学研究院、巴黎西岱大学、U1346 HeKA) Freelance researcher(自由研究员)

AI总结 针对自身免疫领域信息抽取性能不足,构建了包含115篇PubMed摘要的AAbAAC语料库,手动标注实体和关系,通过微调NER模型验证了其有效性。

详情
Journal ref
BioNLP 2026 - 25th Workshop on Biomedical Natural Language Processing, ACL, Jul 2026, San Diego (CA), United States
AI中文摘要

尽管深度学习和大型语言模型推动了信息抽取的进步,但在高度专业化的生物医学领域,领域特异性复杂性对通用模型构成挑战,性能差距仍然存在。本文聚焦自身免疫领域,其中主要关注实体包括自身免疫疾病、自身抗体(即可能标记或导致这些疾病的分子)、其分子靶点、在体内的位置以及相关临床体征。我们提出了AAbAAC(自身抗体与自身免疫标注语料库),该语料库包含从PubMed精选的115篇摘要,并手动标注了实体及其关系。首先,AAbAAC被用于评估多种方法在命名实体识别(NER)任务上的表现;其次,用于微调NER模型。我们的研究展示了AAbAAC在自身免疫领域信息抽取中的实用性,表明微调后NER性能预期提升。这说明了小规模标注工作对专业领域的价值,并为自身免疫的计算研究做出了贡献。AAbAAC语料库可通过此https链接获取。

英文摘要

Despite advances in information extraction driven by deep learning and large language models, performance gaps remain in highly specialized biomedical fields, where domainspecific complexity poses challenges for generalist models. In this work, we focus on the domain of autoimmunity, where the main entities of interest are autoimmune diseases, autoantibodies (i.e., molecules that may mark or cause these diseases), their molecular targets, their location in the body, and their associated clinical signs. Herein, we present AAbAAC (AutoAntibodies and Autoimmunity Annotated Corpus), a corpus of 115 abstracts selected from PubMed, where we manually annotated entities and their relationships. First, AAbAAC was used to evaluate several methods on the task of named entity recognition (NER), and secondly, to fine-tune NER models. Our study demonstrates the utility of AAbAAC for information extraction in the domain of autoimmunity, showing expected improvement in NER performance after finetuning. This illustrates the value of small-scale annotation efforts for specialized domains and contributes to the computational study of autoimmunity. The AAbAAC corpus is available at https://github.com/f-maury/AAbAAC.

2606.13141 2026-06-12 cs.AI 新提交

Rethinking RAG in Long Videos: What to Retrieve and How to Use It?

重新思考长视频中的RAG:检索什么以及如何使用?

Yuho Lee, Jisu Shin, Nicole Hee-Yeon Kim, Jihwan Bang, Juntae Lee, Kyuwoong Hwang, Fatih Porikli, Hwanjun Song

发表机构 * Department of Computer Science, Cranberry-Lemon University(蔓越莓柠檬大学计算机科学系)

AI总结 针对视频检索增强生成中检索粒度单一和基准测试缺陷,提出V-RAGBench基准和CARVE方法,通过分块自适应重排序实现多配置交错证据,显著提升性能。

详情
AI中文摘要

检索增强生成正从文本扩展到长、自我中心的视频,系统必须跨多种模态和时间粒度选择与查询相关的块。然而,VideoRAG的进展受到两个差距的限制:现有基准允许无需视频即可回答查询,掩盖了检索错误;先前方法对每个查询应用单一模态-粒度配置,忽略了块级变异性。我们通过引入V-RAGBench(一个⟨查询,证据块,答案⟩三元组基准,支持检索和生成的忠实、解耦评估)和CARVE(一种简单方法,跨配置运行并行检索器并采用块自适应重排序以识别每个块的最佳配置)来解决这两个问题。每个块随后以其在检索期间选择的最佳配置进入生成器,产生一种交错证据形式,其中块级决策在检索和生成两个阶段传播。CARVE优于八种最近的VideoRAG基线,提供给生成器的块交错多种配置而非共享单一配置,这是查询级方法无法实现的行为。

英文摘要

Retrieval-augmented generation is moving beyond text into long, egocentric video, where systems must select query-relevant chunks across multiple modalities and temporal granularities. Yet progress in VideoRAG is limited by two gaps: existing benchmarks allow queries to be answered without the video, obscuring retrieval errors, and prior methods apply a single modality-granularity configuration per query, ignoring chunk-level variability. We address both by introducing V-RAGBench, a benchmark of $\langle$query, evidence chunk, answer$\rangle$ triplets that enables faithful, decoupled evaluation of retrieval and generation, and CARVE, a simple method that runs parallel retrievers across configurations and employs chunk-adaptive reranking to identify the winning configuration for each chunk. Each chunk then enters the generator under its winning configuration selected during retrieval, yielding an interleaved evidence form where the chunk-level decision propagates across both stages. CARVE outperforms eight recent VideoRAG baselines, with the chunks supplied to the generator interleaving multiple configurations rather than sharing a single one, a behavior unattainable by query-level methods.

2606.13148 2026-06-12 cs.AI 新提交

TerraBench: Can Agents Reason Over Heterogeneous Earth-System Data?

TerraBench: 智能体能否对异构地球系统数据进行推理?

Dat Tien Nguyen, Thao Nguyen, Fadillah Adamsyah Maani, Huy M. Le, Muhammad Umer Sheikh, Numan Saeed, Muhammad Haris Khan, Salman Khan

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出TerraBench基准,基于TerraAgent框架,通过结合大语言模型规划与科学工具,实现跨网格数据、卫星图像、地理空间和模拟器的交互式推理,包含403个任务和24,500个执行步骤。

详情
AI中文摘要

气候和环境决策日益需要对异构输入进行推理,包括网格化物理数据、卫星图像、地理空间背景和模拟器输出。天气和气候基础模型可以很好地预测,但不能以语言进行交互式推理,而大型语言模型(LLM)可以用语言推理,但不能直接操作高维地球系统数据。因此,地球科学中的真实科学工作流仍然得不到充分支持。我们引入了TerraBench,一个基于地球科学推理的基准,构建于TerraAgent之上,这是一个ReAct风格的可执行框架,它交织推理、工具调用和观察,将LLM规划与环境检索、地理空间处理、模拟和基于工件的计算等科学工具相结合。TerraBench在单一可执行界面中统一了对地球观测图像、网格化数据、GIS推理和模拟的分析,而先前的基准将这些能力隔离为狭窄的独立任务。它也是该领域中第一个将过程级工具使用指标与容忍度感知数值评分配对的方法。该基准包含403个广泛的智能体任务,涵盖三个轨道(基础、模拟器基础和文档基础验证)和八个应用领域,共24,500个经过验证的执行步骤。这些结果表明,可靠的地球科学智能体必须超越工具访问,协调异构工作流,精确参数化工具,并保留工件来源。

英文摘要

Climate and environmental decision-making increasingly requires reasoning across heterogeneous inputs, including gridded physical data, satellite imagery, geospatial context, and simulator outputs. Weather and climate foundation models can forecast well, but do not reason interactively in language, while large language models (LLMs) reason in language but cannot operate directly on high-dimensional Earth-system data. As a result, real scientific workflows in Earth-science remain underserved. We introduce TerraBench, a benchmark for grounded Earth-science reasoning, built on TerraAgent, a ReAct-style executable framework that interleaves reasoning, tool calls, and observations to couple LLM planning with scientific tools for environmental retrieval, geospatial processing, simulation, and artifact-backed computation. TerraBench unifies analysis of Earth observation imagery, gridded data, GIS reasoning and simulation in a single executable interface, whereas prior benchmarks isolate these capabilities into narrow individual tasks. It is also the first in this space to pair process-level tool-use metrics with tolerance-aware numeric scoring. The benchmark comprises 403 extensive agentic tasks across three tracks (Fundamentals, Simulator-Grounded, and Document-Grounded Verification) and eight application domains with 24,500 verified execution steps. These results indicate that reliable Earth-science agents must go beyond tool access to coordinate heterogeneous workflows, parameterize tools precisely, and preserve artifact provenance.

2606.13192 2026-06-12 cs.AI 新提交

Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach

基于多模态大语言模型的移动用户体验推理:任务、基准与方法

Ruichao Mao, Zhou Fang, Teng Guo, Hao Yang, Yaping Li, Shaohua Peng, Maji Huang, Xiaoyu Lin, Shuoyang Liu, Xuepeng Li, Yuyu Zhang, Hai Rao

发表机构 * Ant Group(蚂蚁集团)

AI总结 提出UXBench基准(2000个VQA样本)评估多模态大模型在UI推理上的能力,并设计UI-UX模型,通过奖励路由和不对称过渡奖励机制在UXBench上达到0.7963准确率,超越Claude-4.5-Sonnet。

Comments 10 pages, 6 figures, Accepted at CVPR 2026 Findings

详情
AI中文摘要

以可用性、感知一致性和功能清晰性为中心的用户体验(UX)是现实世界用户界面(UI)的基础。多模态大语言模型(MLLMs)在用户界面领域的应用正在快速发展,例如视觉元素定位、图形用户界面(GUI)代理和设计到代码生成。然而,基于UI截图评估UX的研究工作仍不成熟。为此,我们提出UXBench,一个包含2000个VQA数据样本的新型多模态基准,旨在评估MLLMs执行基于UI的推理能力。UXBench包括基于真实UI截图的8个任务,需要对布局关系、视觉层次和内容一致性中的UX问题进行细粒度诊断。我们对主流MLLMs的广泛评估表明,它们在基于UI的推理能力上仍然存在根本性限制。结果强调了该领域进一步发展的必要性。为弥补这一差距,我们提出UI-UX,一个基于Qwen3-VL-4B-Thinking基础模型并通过强化学习增强的MLLM,具有两个关键创新:一个奖励路由机制,在推理过程中动态平衡感知理解和逻辑推理;以及一个非对称过渡奖励,抑制冗余或不足的推理步骤。实验表明,UI-UX在UXBench上达到了最先进的性能,准确率达到0.7963——超过Claude-4.5-Sonnet的0.6550——同时在各种UI任务中表现出强大的泛化能力并保持低推理延迟。

英文摘要

User experience (UX) centered on usability, perceived consistency, and functional clarity is fundamental to real-world user interfaces (UI). The application of multimodal large language models (MLLMs) in the field of user interfaces is evolving rapidly, such as visual element grounding, graphical user interface (GUI) agents, and design-to-code generation. However, research efforts on evaluating UX based on UI screenshots are still immature. To address this, we propose UXBench, a novel multimodal benchmark consisting of 2,000 VQA data samples designed to assess MLLMs' ability to perform UI-based reasoning. UXBench includes 8 tasks based on real-world UI screenshots that require fine-grained diagnosis of UX issues across layout relationships, visual hierarchy, and content consistency. Our extensive evaluation of mainstream MLLMs shows that they remain fundamentally limited in their capacity for UI-based reasoning. The results underscore the need for further advancements in this area. To bridge this gap, we propose UI-UX, an MLLM based on Qwen3-VL-4B-Thinking foundation model and enhanced via reinforcement learning with two key innovations: a reward routing mechanism that dynamically balances perceptual understanding and logical reasoning during inference, and an asymmetric transition reward that suppresses redundant or insufficient reasoning steps. Experiments demonstrate that UI-UX achieves state-of-the-art (SOTA) performance on UXBench, attaining an accuracy of 0.7963 -- surpassing Claude-4.5-Sonnet's 0.6550 -- while exhibiting strong generalization across diverse UI tasks and maintaining low inference latency.

2606.13370 2026-06-12 cs.AI 新提交

A Quantitative Experimental Repeated Measures Study of Training Dynamics in a Small Llama Style Language Model Under a Compute-Aware Token Budget

在计算感知令牌预算下小型Llama风格语言模型训练动态的定量实验重复测量研究

Joe Dwyer

发表机构 * Department of Computer Information Science, ECPI University(ECPI大学计算机信息科学系)

AI总结 本研究通过重复测量设计,分析在固定计算预算下训练小型Llama模型时,验证损失、困惑度等指标随令牌数变化的动态,发现早期快速改进后出现非单调退化,表明计算感知评估应关注训练轨迹而非终点指标。

详情
AI中文摘要

本研究考察了在固定、计算受限的令牌预算下训练的小型Llama风格语言模型的训练动态。研究并未仅通过终点性能来评估效率,而是采用定量实验重复测量设计,分析验证损失、验证困惑度、滚动波动性、回退行为、尖峰行为以及种子间变异性如何在基于令牌的训练区间内变化。在拥有426万参数的模型上,使用TinyStories语料库、CPU全精度训练以及约2000万累积训练令牌的目标预算,进行了六次独立训练运行。在21个区间内收集指标,产生了126个种子-区间观测值。重复测量方差分析显示,验证损失、验证困惑度和滚动波动性存在统计显著的区间效应。描述性轨迹揭示了早期快速改进,随后在后期训练区间出现非单调退化。平均验证损失从初始化的8.3552降至接近400万令牌时的2.7996,但在最终检查点增至3.9010。验证困惑度遵循相同模式,在训练早期急剧下降,随后上升。衍生遥测进一步显示了反复的验证损失回退,并且在预定义标准下没有区间汇总证据表明存在稳定阶段。这些发现表明,计算感知的语言模型评估应检查训练轨迹而非仅终点指标。在受限计算设置中,额外的令牌暴露可能增加计算成本而不产生成比例的泛化收益,而区间级遥测可以揭示终点指标可能掩盖的不稳定性、回归和收益递减。

英文摘要

This study examines training dynamics in a small Llama-style language model trained under a fixed, compute-constrained token budget. Rather than evaluating efficiency solely through endpoint performance, the study uses a quantitative experimental repeated measures design to analyze how validation loss, validation perplexity, rolling volatility, backslide behavior, spike behavior, and between-seed variability change across token-based training intervals. Six independent training runs were conducted on a 4.26-million-parameter model using the TinyStories corpus, CPU-based full-precision training, and a target budget of approximately 20 million cumulative training tokens. Metrics were collected across 21 intervals, producing 126 seed-by-interval observations. Repeated measures ANOVA showed statistically significant interval effects for validation loss, validation perplexity, and rolling volatility. Descriptive trajectories revealed rapid early improvement followed by non-monotonic degradation during later training intervals. Mean validation loss decreased from 8.3552 at initialization to 2.7996 near 4 million tokens, but increased to 3.9010 by the final checkpoint. Validation perplexity followed the same pattern, falling sharply early in training before rising later. Derived telemetry further showed recurrent validation-loss backslides and no interval-summary evidence of a stable phase under the predefined criteria. These findings suggest that compute-aware language model evaluation should examine training trajectories rather than endpoint metrics alone. In constrained compute settings, additional token exposure may increase computational cost without producing proportional generalization gains, and interval-level telemetry can reveal instability, regression, and diminishing returns that final metrics may obscure.

2606.13436 2026-06-12 cs.AI 新提交

Evaluation Sovereignty in Metadata-Driven Classification: A Multi-Track Framework for Weakly Supervised Information Systems

元数据驱动分类中的评估主权:面向弱监督信息系统的多轨道框架

Raymond Vasquez

发表机构 * Lawrence Livermore National Laboratory(劳伦斯利弗莫尔国家实验室)

AI总结 针对弱监督元数据系统中标签权威性影响评估有效性的问题,提出评估主权概念及多轨道评估框架,通过实验揭示模型性能在银标与金标评估下的显著差异,并重新定义评估有效性为系统级属性。

详情
AI中文摘要

机器学习中的评估通常被视为中立的测量过程。然而,在操作性信息系统中,评估结果往往受标签生成过程的影响。本文并非旨在提升分类性能,而是考察在不同标签权威体制下性能测量的有效性。这一问题在大规模元数据驱动系统中尤为突出,此类系统中的标签常不完整、不一致或仅受弱监督。我们引入评估主权概念,定义为性能指标独立于标签权威和监督体制的程度,并提出一个多轨道评估框架,系统性地变化训练和评估标签来源。通过对大规模科学元数据进行层次多标签分类,我们证明在操作性(“银标”)评估下表现强劲的模型在独立(“金标”)评估下性能显著下降,尤其在细粒度分类中。例如,Micro-F1从约0.54降至0.03。值得注意的是,基于排名的指标仍高于基线,揭示了潜在模型信号与分类有效性之间的分歧。这些发现表明,通常报告的性能指标可能反映的是与标注过程的对齐,而非真正的预测能力。因此,我们将评估有效性重新概念化为由标签治理塑造的系统级属性,并为审计在弱监督下运行的智能系统提供了一种实用方法论。

英文摘要

Evaluation in machine learning is typically treated as a neutral measurement process. However, in operational information systems, evaluation outcomes are often conditioned by the processes used to generate labels. This paper does not seek to improve classification performance. Instead, it examines the validity of performance measurement under differing label-authority regimes. This issue is particularly relevant in large-scale metadata-driven systems, where labels are often incomplete, inconsistent, or weakly supervised. We introduce evaluation sovereignty, defined as the degree to which performance metrics are independent of label authority and supervision regime, and propose a multi-track evaluation framework that systematically varies training and evaluation label sources. Using hierarchical multi-label classification on large-scale scientific metadata, we demonstrate that models exhibiting strong performance under operational ("silver") evaluation degrade substantially under independent ("gold") evaluation, particularly for fine-grained classification. For example, Micro-F1 decreases from approximately 0.54 to 0.03. Notably, ranking-based metrics remain above baseline, revealing a divergence between latent model signal and classification validity. These findings suggest that commonly reported performance metrics may reflect alignment with labeling processes rather than true predictive capability. We therefore reconceptualize evaluation validity as a system-level property shaped by label governance and provide a practical methodology for auditing intelligent systems operating under weak supervision.

2606.13513 2026-06-12 cs.AI 新提交

CloudCons: A Comprehensive End-to-End Benchmark for Cloud Resource Consolidation

CloudCons:云资源整合的全面端到端基准测试

Xiaobin Zhang, Lefei Shen, Mouxiang Chen, Zhuo Li, Hongkai Li, Han Fu, Jianling Sun, Xiaoxue Ren, Chenghao Liu

发表机构 * Zhejiang University(浙江大学) State Street Technology (Zhejiang) Ltd.(道富科技(浙江)有限公司) Richoo AI Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security(杭州高新区(滨江)区块链与数据安全研究院) Datadog AI Research

AI总结 提出CloudCons基准,评估云资源整合中预测模型的决策效用,发现基础模型零样本预测准确但决策效用未必更优,并分析预测分位数选择对资源效率与可靠性的权衡。

Comments Accepted to KDD 2026

详情
AI中文摘要

由于为了保证服务可靠性而采取的保守过度配置,云数据中心的资源利用率仍然较低。为了缓解这一问题,出现了“先预测后优化”的范式,通过预测未来需求来优化整合。尽管新兴的时间序列基础模型通过零样本泛化有望增强这一范式,但现有基准仅关注预测误差指标。这些先进模型的实际决策效用尚未得到验证,使得它们在下游任务中的实际价值不确定。为了弥合这一差距,我们提出了CloudCons,一个全面的端到端基准测试,旨在评估云资源整合特定背景下的预测模型。我们构建了高质量数据集,涵盖华为云、微软Azure和Google Borg的不同工作负载,捕捉从同步昼夜节律到随机脉冲式突发和高频噪声的不同服务特征。我们对统计模型、深度学习模型和基础模型进行了广泛评估。实验揭示了一个关键发现:虽然基础模型在零样本预测准确性上表现出色,但这种优势并不必然转化为更好的决策效用。具有实际意义的是,我们系统分析了预测分位数的选择如何作为一个关键杠杆。我们提供了校准这些选择的可行指南,以平衡资源效率和服务可靠性之间的权衡,为实际部署决策提供了重要见解。

英文摘要

Driven by conservative over-provisioning to guarantee service reliability, resource utilization in cloud data centers remains at low levels. To mitigate this, the forecast-then-optimize paradigm has emerged to optimize consolidation by anticipating future demands. While emerging time series foundation models promise to enhance this paradigm through zero-shot generalization, existing benchmarks focus solely on prediction error metrics. The actual decision utility of these advanced models remains unverified, rendering their practical value for downstream tasks uncertain. To bridge this gap, we propose CloudCons, a comprehensive end-to-end benchmark designed to evaluate forecasting models within the specific context of cloud resource consolidation. We build high-quality datasets that cover diverse workloads from Huawei Cloud, Microsoft Azure, and Google Borg, capturing distinct service characteristics ranging from synchronized diurnal rhythms to stochastic, pulse-like bursts and high-frequency noise. We conduct an extensive evaluation of statistical, deep learning, and foundation models. Our experiments reveal a pivotal finding: while foundation models demonstrate superior zero-shot forecasting accuracy, this advantage does not inherently translate into better decision utility. Of practical significance, we systematically analyze how the selection of predictive quantiles acts as a critical lever. We provide actionable guidelines for calibrating these selections to balance the trade-off between resource efficiency and service reliability, offering vital insights for real-world deployment decisions.

2606.13602 2026-06-12 cs.AI 新提交

EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis

EpiBench:人工智能代理在表观基因组学分析中的可验证评估

Harihara Muralidharan, Reema Baskar, Soo Hee Lee, Tim Proctor, Kenny Workman

发表机构 * LatchBio

AI总结 提出EpiBench基准,通过106个评估任务测试AI代理在表观基因组学工作流中的决策能力,发现最佳系统GPT-5.5/Pi通过率仅45%,失败多因缺乏深度科学判断。

详情
AI中文摘要

我们介绍了EpiBench,一个用于短周期表观基因组学分析的可验证基准。EpiBench评估代理是否能够从真实工作流状态中做出明确定义的分析决策,并返回可确定性评分的答案。该基准包含CUT\&Tag/CUT\&RUN、ATAC-seq、ChIP-seq和DNA甲基化工作流中的106个评估。在来自16个模型-框架对的5,088条有效轨迹中,没有系统通过大多数尝试:GPT-5.5 / Pi以45.0%(143/318次尝试;95%置信区间(CI),36.3--53.7)领先,其次是GPT-5.5 / OpenAI Codex的39.9%(127/318次尝试;95% CI,31.6--48.3)。Claude Opus 4.8 Max / Pi和GPT-5.4 / Pi分别通过了39.0%(124/318次尝试;95% CI,30.2--47.8和31.0--47.0)。性能因检测类型而异,许多失败的运行仍包含部分正确答案。代理通常能找到正确的文件并计算出有用的中间结果,但当任务需要更深入、特定于检测的科学判断时,它们就会失败。

英文摘要

We introduce EpiBench, a verifiable benchmark for short-horizon epigenomics analysis. EpiBench evaluates whether agents can make well-defined analysis decisions from realistic workflow states and return deterministically gradable answers. The benchmark includes 106 evaluations across CUT\&Tag/CUT\&RUN, ATAC-seq, ChIP-seq, and DNA methylation workflows. Across 5,088 valid trajectories from 16 model-harness pairs, no system passed a majority of attempts: GPT-5.5 / Pi led at 45.0\% (143/318 attempts; 95\% confidence interval (CI), 36.3--53.7), followed by GPT-5.5 / OpenAI Codex at 39.9\% (127/318 attempts; 95\% CI, 31.6--48.3). Claude Opus 4.8 Max / Pi and GPT-5.4 / Pi each passed 39.0\% (124/318 attempts; 95\% CI, 30.2--47.8 and 31.0--47.0, respectively). Performance varies across assay types, and many failed runs still contain parts of the correct answer. Agents often found the right files and computed useful intermediate results, but failed when the task required deeper, assay-specific scientific judgment.

2606.13670 2026-06-12 cs.AI 新提交

Automated reproducibility assessments in the social and behavioral sciences using large language models

使用大型语言模型自动评估社会与行为科学的可重复性

Tobias Holtdirk, Pietro Marcolongo, Anna Steinberg Schulten, Felix Henninger, Stefan Rose, Sarah Ball, Bolei Ma, Frauke Kreuter, Markus Weinmann, Stefan Feuerriegel

发表机构 * LMU Munich(慕尼黑大学) Munich Center for Machine Learning(慕尼黑机器学习中心) University of Cologne(科隆大学)

AI总结 本研究利用大型语言模型(LLMs)自动评估社会与行为科学研究的可重复性,在76项研究中,LLM在41%的研究中恢复了原始效应量,在96%的案例中得出了与原始研究相同的定性结论,优于人类再分析。

详情
AI中文摘要

社会与行为科学的可重复性通常由独立研究人员重新分析原始数据来评估,以判断已发表的研究结果是否可复现。然而,这种方法资源密集且难以规模化。在此,我们展示了大型语言模型(LLMs)可以自动化可重复性评估。利用N=76项来自行为与社会科学、具有预定义声明的研究,我们比较了LLM生成的分析与原始结果和人类再分析。对于7项研究,LLM无法产生可行的效应量估计。对于其余研究,我们的LLM流程在41%的研究中恢复了原始效应量(Cohen's d的容忍度为+/-0.05)。此外,我们的LLM流程在96%的案例中得出了与原始研究相同的定性结论,其中结论指示再分析是否支持原始声明。相比之下,人类再分析者在34%的研究中恢复了原始效应量,并在74%的案例中得出了相同的定性结论。这些结果共同表明,LLMs可以作为自动化可重复性评估的可扩展工具,并为社会与行为科学中实证结果的系统审计提供基础。

英文摘要

Reproducibility in the social and behavioral sciences is typically evaluated by independent researchers who reanalyze the original data to assess whether the published findings can be recovered. However, such approaches are resource-intensive and difficult to scale. Here, we show that large language models (LLMs) can automate reproducibility assessments. Using N=76 published studies with predefined claims from the behavioral and social sciences, we compare LLM-generated analysis with the original findings and human reanalysis. For 7 studies, the LLM could not produce a viable effect size estimate. For the remaining studies, our LLM pipeline recovered the original effect sizes in 41% of studies using a +/-0.05 tolerance in Cohen's d. Further, our LLM pipeline reached the same qualitative conclusion as the original study in 96% of cases, where conclusions indicate whether the reanalysis supports the original claim. For comparison, human reanalysts recovered the original effect sizes in 34% of studies and reached the same qualitative conclusion in 74% of cases. Together, these results show that LLMs can serve as a scalable tool for automated reproducibility assessment and provide a foundation for systematic auditing of empirical results in the social and behavioral sciences.

2606.12419 2026-06-12 cs.CY cs.AI 交叉投稿

GeoDial: A Multimodal Conversational Tutoring Dataset for Geometry Problem-Solving with Visual Tutor Turns

GeoDial:面向几何问题求解的多模态对话式辅导数据集,包含可视化辅导轮次

Sankalan Pal Chowdhury, Junling Wang, Donya Rooein, April Yi Wang, Mrinmaya Sachan

发表机构 * ETH Zurich(苏黎世联邦理工学院) ETH AI Center(苏黎世联邦理工学院人工智能中心) Bocconi University(博科尼大学)

AI总结 提出GeoDial数据集,包含1300+几何师生对话,通过可扩展标注协议整合对话行为、视觉高亮和反馈,微调视觉语言模型发现其难以生成准确图解高亮。

详情
AI中文摘要

几个教育领域严重依赖图表和视觉线索,但现有的大多数辅导数据集仅限于纯文本交互。这限制了AI辅导者的发展,使其无法像人类教师那样以视觉为基础的方式进行教学。因此,我们引入了GeoDial,这是一个多模态辅导数据集,包含来自经验丰富的数学教师的1300多个几何领域的师生对话,其中教学轮次明确地基于图表高亮。我们提出了一种可扩展的标注协议,该协议整合了对话行为、视觉高亮和反馈,从而能够对语言和视觉辅导行为进行细粒度监督。为了说明这一设置带来的挑战,我们在GeoDial上微调了几个视觉语言模型,并评估它们生成辅导话语和图表高亮的能力。虽然监督微调显著提高了生成对话的质量,但它难以生成准确的图表高亮,揭示了当前方法的一个关键局限性,并强调了需要更有效地将视觉推理与教学互动相结合的方法。

英文摘要

Several educational domains rely heavily on diagrams and visual cues, yet most existing tutoring datasets are limited to text-only interactions. This limits the development of AI tutors that can teach in visually grounded ways used by human instructors. Thus, we introduce GeoDial, a multimodal tutoring dataset of over 1.3K teacher-student dialogs in the domain of geometry collected from experienced math teachers, where instructional turns are explicitly grounded in diagram highlights. We propose a scalable annotation protocol that integrates dialog acts, visual highlighting, and feedback, enabling fine-grained supervision of both language and visual tutoring behavior. To illustrate the challenges posed by this setting, we fine-tune several vision-language models on GeoDial and evaluate their ability to generate tutoring utterances and diagram highlights. While supervised fine-tuning substantially improves the quality of generated dialog, it struggles to produce accurate diagram highlights, revealing a key limitation of current methods and highlighting the need for approaches that more effectively integrate visual reasoning with pedagogical interaction.

2606.12443 2026-06-12 cs.CY cs.AI cs.CL 交叉投稿

Occupational Prompting Reveals Cultural Bias in Large Language Models

职业提示揭示大型语言模型中的文化偏见

Maksim E. Eren, Andrea Brennen, Ryan C. Barron, Eric Michalak

发表机构 * U.S. Government(美国政府)

AI总结 通过职业提示(如会计师、教师)替代国籍提示,研究开源LLM在价值观调查中的响应,发现不同职业导致文化地图内偏移,表明职业角色引发结构化价值模式。

详情
AI中文摘要

社会角色塑造期望、优先级和判断,但大型语言模型(LLM)如何将职业身份与更广泛的文化价值模式关联仍不清楚。先前工作使用基于国籍的文化提示来研究LLM对价值观调查问题的响应如何与人类文化基准对齐。本文通过用职业提示替代文化提示,扩展了该框架,以检查职业角色线索如何影响开源LLM的价值观调查响应。使用基于综合价值观调查问题的调查评估流程,我们将模型响应投影到二维Inglehart-Welzel文化空间。我们提示开源LLM以职业身份(如会计师、教师、工程师和护士)回答问题,然后分析这些职业条件化响应在文化地图上的位置。结果表明,当用职业而非国籍身份提示开源LLM时,其响应仍位于文化地图的广泛西方倾向区域。然而,不同职业在该区域内引入偏移,产生不同的职业偏差。这表明职业提示并非被视为中性角色标签,而是引发结构化价值模式。这些发现将基于调查的文化偏见评估扩展到国籍提示之外,并提供了研究职业角色如何塑造LLM中价值表达的框架。

英文摘要

Social roles shape expectations, priorities, and judgments, yet it remains unclear how large language models (LLMs) associate occupational identities with broader cultural value patterns. Prior work used nationality-based cultural prompting to study how LLM responses to value-survey questions align with human cultural benchmarks. In this paper, we extend that framework by replacing cultural prompting with occupational prompting to examine how professional-role cues influence value-survey responses in open-weight LLMs. Using a survey-grounded evaluation pipeline based on questions from the Integrated Values Surveys, we project model responses into the two-dimensional Inglehart--Welzel cultural space. We prompt open-weight LLMs to answer questions under occupational identities such as accountant, teacher, engineer, and nurse, and then analyze how these occupation-conditioned responses are positioned on the cultural map. Our results show that when open-weight LLMs are prompted with occupations rather than national identities, their responses remain within a broadly Western-leaning region of the cultural map. However, different occupations introduce shifts within this region, producing distinct occupational skews. This indicates that occupational prompts are not treated as neutral role labels, but instead elicit structured value patterns. These findings extend survey-based evaluation of cultural bias beyond nationality-based prompting and provide a framework for studying how occupational personas shape value expression in LLMs.

2606.12569 2026-06-12 cs.CL cs.AI 交叉投稿

EDEN: A Large-Scale Corpus of Clinical Notes for Italian

EDEN:意大利语临床笔记的大规模语料库

Tiziano Labruna, Guido Bertolini, Pietro Ferrazzi, Bernardo Magnini

发表机构 * Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会) Istituto di Ricerche Farmacologiche Mario Negri IRCCS(马里奥·内格里药理研究所IRCCS) University of Padua(帕多瓦大学)

AI总结 本文介绍EDEN,一个大规模意大利语急诊临床笔记语料库,包含约400万份匿名笔记及6000份专家标注数据,用于支持大语言模型在医疗中的应用,并提出了CRF填充作为新的结构化信息提取基准。

详情
AI中文摘要

我们提出了EDEN(急诊电子笔记),这是一个新颖且独特的大规模临床笔记语料库,这些笔记来自意大利医院的急诊科。当前版本的语料库由约400万份完全匿名的临床笔记组成,涵盖了患者在急诊科停留期间的不同护理阶段。此外,约六千份笔记的子集由临床专家通过结构化病例报告表(CRF)进行了手动标注,该CRF包含132个项目,涉及急诊科两种患者情况:呼吸困难和意识丧失。项目可能取数值(例如血氧饱和度)、分类(例如意识水平)、二元(例如是否存在创伤)和混合值类型。标注过程涉及多位临床医生,并经过迭代修订以解决项目表述中的歧义,从而形成了一个结构丰富(尽管高度不平衡)的资源。该数据集旨在填补能够支持大语言模型在具体医疗应用中开发和使用的重要数据缺口。我们描述了数据收集协议、现场匿名化流程、语料库统计数据和标注方案。最后,我们提出了CRF填充作为一项新的结构化信息提取基准,并提供了基于Gemma-27B和MedGemma-27B的零样本基线。据我们所知,EDEN数据集是意大利语现有最大的免费临床笔记语料库。

英文摘要

We present EDEN (Emergency Department Electronic Notes), a new and unique large-scale corpus of clinical notes produced in Emergency Departments of Italian hospitals. The corpus, in its current version, is composed of approximately 4 million clinical notes fully anonymized, covering diverse phases of patient care during the stay in the emergency department. In addition, a subset of about six thousand notes has been manually annotated by clinical experts through a structured Case Report Form (CRF) containing 132 items relevant for two patient situations in emergency departments, dyspnea and loss of consciousness. Items may assume numerical values (e.g., for blood saturation), categorical (e.g., for level of consciousness ), binary (e.g., for presence of traumas), and mixed value types. The annotation process involved multiple clinicians and underwent iterative revision to resolve ambiguities in item formulation, resulting in a richly structured (although high imbalanced) resource. The dataset aims to fill a relevant gap of data able to support both the development and the use of Large Language Models in concrete medical applications. We describe the data collection protocol, the on-site anonymisation pipeline, corpus statistics, and the annotation scheme. Finally, we propose CRF-filling as a novel structured information extraction benchmark, and provide zero-shot baseline resulting from Gemma-27B and MedGemma-27B. To the best of our knowledge, the EDEN dataset is the largest freely available corpus of clinical notes existing for the Italian language.

2606.12581 2026-06-12 cs.SI cs.AI 交叉投稿

Graph Reduction in Multirelational Networks: A Spreading-Oriented Reduction Benchmark

多关系网络中的图缩减:面向传播的缩减基准

Mateusz Stolarski, Michał Czuba, Piotr Bielak, Piotr Bródka

AI总结 提出SORB基准框架,系统评估图缩减对影响力最大化任务的影响,发现缩减效果依赖于网络类型和评估指标。

详情
AI中文摘要

现实世界网络天生不完整、有噪声且动态演化,难以捕获所有参与者及其关系。其规模常使直接分析计算量大。虽然影响力最大化(IM)已被广泛研究,但图缩减作为预处理步骤及其对IM准确性的影响仍未被充分探索。本文引入面向传播的缩减基准(SORB),一个开源、标准化的框架,用于系统评估不同任务设置下的IM模型。SORB提供可扩展的流水线,操作于代表性真实世界网络集合(包括单层和多层结构),并将图缩减直接纳入评估过程。此设计将焦点从孤立分析IM算法转向量化图缩减如何改变预测性能。利用SORB,我们研究了多种IM场景下稀疏化和粗化的效果。结果表明,缩减的影响强烈依赖于网络类型(单层 vs. 多关系)和下游任务($Gain@k$ vs. $\mathrm{AUC}_{\mathrm{cutoff}}$):稀疏化在单层网络上保持种子集质量,而扁平化多层网络无论缩减策略如何均表现出系统性排名退化。这些发现强调了在研究复杂网络传播过程时,进行缩减感知的多任务评估的重要性。

英文摘要

Real-world networks are inherently incomplete, noisy, and dynamically evolving, making it difficult to capture all actors and their relationships. Their scale often renders direct analysis computationally demanding. While influence maximisation (IM) has been widely studied, the role of graph reduction as a preprocessing step, and its impact on IM accuracy, remains underexplored. In this work, we introduce the Spreading-Oriented Reduction Benchmark (SORB), an open-source, standardised framework for systematically evaluating IM models across diverse task settings. SORB provides an extensible pipeline operating on a representative collection of real-world networks, including single- and multilayer structures, and accounts for graph reduction directly into the evaluation process. This design shifts the focus from analysing IM algorithms in isolation to quantifying how graph reduction alters predictive performance. Using SORB, we study the effects of sparsification and coarsening across multiple IM scenarios. Our results show that the impact of reduction is strongly dependent on both the network type (single-layer vs. multirelational) and the downstream task ($Gain@k$ vs. $\mathrm{AUC}_{\mathrm{cutoff}}$): sparsification preserves seed set quality on single-layer networks, whereas flattened multilayer networks exhibit systematic ranking degradation regardless of reduction strategy. These findings highlight the importance of reduction-aware, multi-task evaluation when studying spreading processes in complex networks.

2606.12595 2026-06-12 cs.LG cs.AI cs.CV 交叉投稿

Emerging Flexible Designs for Geospatial Multimodal Foundation Models

地理空间多模态基础模型的新兴灵活设计

Philipe Dias, Waqwoya Abebe, Abhishek Potnis, Aristeidis Tsaris, Dan Lu, Xiao Wang, Dalton Lunga

发表机构 * Oak Ridge National Laboratory(橡树岭国家实验室)

AI总结 本文系统比较了不同架构的地理空间基础模型,在统一设置下评估其灵活性与性能,为多模态推理提供设计指导。

详情
AI中文摘要

基础模型通过跨多样未标记地理空间模态的可扩展预训练,正在迅速改变地球观测。然而,其架构多样性——从编码器-only到编码器-解码器以及掩码自编码范式——使得以一致方式评估性能权衡变得具有挑战性。在这项工作中,我们对领先的、专为地理空间多模态推理设计的基础模型架构进行了同类比较,特别关注不同光谱波段配置下的灵活性。我们使用相同的自监督学习目标和训练数据集标准化预训练,并在GEOBench基准测试上,在一致参数化下评估所有模型的分类和分割任务。我们的结果为模型灵活性、模态对齐和下游任务性能之间的设计权衡提供了新见解。通过强调受控条件下的架构优势和局限性,本研究为构建能够进行鲁棒多模态推理的下一代地理空间基础模型提供了实用指导。

英文摘要

Foundation models are rapidly transforming Earth observation by enabling scalable pretraining across diverse unlabeled geospatial modalities. However, their architectural diversity ranging from encoder-only to encoder-decoder and masked autoencoding paradigms makes it challenging to assess performance trade offs in a consistent manner. In this work, we present an apples-to-apples comparison of leading FM architectures designed for geospatial multimodal reasoning, with a particular focus on flexibility across varied spectral band configurations. We standardize pretraining using identical self supervised learning objectives and training datasets, and evaluate all models under consistent parameterization on the GEOBench benchmark across classification and segmentation tasks. Our results offer new insights into the design trade-offs between model flexibility, modality alignment, and downstream task performance. By highlighting architectural strengths and limitations under controlled conditions, this study provides practical guidance for building next generation geospatial foundation models capable of robust multimodal reasoning.

2606.12620 2026-06-12 cs.SE cs.AI 交叉投稿

HybridCodeAuthorship: A Benchmark Dataset for Line-Level Code Authorship Detection

HybridCodeAuthorship:一个用于行级代码作者归属检测的基准数据集

Luke Patterson, Li Wang, Adam Faulkner

AI总结 针对现有基准无法反映真实AI代码助手使用场景的问题,提出HybridCodeAuthorship数据集,包含交错的人类和AI编写代码行,并评估两种检测算法性能。

Comments Accepted to LREC 2026

详情
Journal ref
LREC 2026 proceedings (pp. 1520-1532)
AI中文摘要

由于基于大型语言模型(LLM)的AI代码助手的快速采用,行业代码库越来越多地成为AI和人类编写代码的混合体。出于风险管理和生产力分析的目的,实现对AI生成代码的细粒度位置检测至关重要。为了开发此任务的算法,需要高质量的基准来评估性能。然而,现有的基准往往包含学术性的LeetCode风格问题,并假设代码片段要么完全由人类编写,要么完全由AI编写,这并不能反映使用AI代码助手的行业代码库的多样意图和风格。为了填补这些空白,我们引入了HybridCodeAuthorship,这是一个新颖的Python代码文件基准,其中交错有人类和AI编写的代码行,以模拟AI代码助手的真实使用。在本文中,我们首先介绍了我们的数据集构建流程,该流程利用了CodeSearchNet,这是一个包含GitHub上开源仓库链接的大型集合。然后,我们在行级和块级上评估了两种最先进的AI生成代码检测算法的性能。实验结果表明,HybridCodeAuthorship是一个具有挑战性的基准,得分最高的算法AIGCode Detector在块级和行级代码检测任务上分别获得了0.48和0.56的最高F1分数。

英文摘要

Thanks to the rapid adoption of AI code assistants powered by large language models (LLMs), industry codebases are, increasingly, a hybrid of AI- and human-authored code. For risk management and productivity analysis purposes, it is crucial to enable fine-grained location detection of AI-generated code. To develop algorithms for this task, quality benchmarks are needed to assess performance. However, existing benchmarks tend to comprise academic, LeetCode-style problems and presume a code snippet is either completely human-authored or completely AI-authored, which is not reflective of the diverse intents and styles of industry codebases utilizing AI code assistants. To fill these gaps, we introduce HybridCodeAuthorship, a novel benchmark of Python code files with interleaved human- and AI-authored lines of code to simulate authentic utilization of AI code assistants. In this paper, we first present our dataset construction pipeline, which leverages CodeSearchNet, a massive collection of links to open sourced repositories on GitHub. We then benchmark the performance of two state-of-the-art AI-generated code detection algorithms at both the line- and chunk-level. Experimental results demonstrate that HybridCodeAuthorship is a challenging benchmark with a top-scoring algorithm, AIGCode Detector, obtaining a highest F1 score of 0.48 and 0.56 on chunk-level and line-level code detection tasks, respectively.

2606.12673 2026-06-12 cs.LG cs.AI 交叉投稿

A Zero-shot Generalized Graph Anomaly Detection Framework via Node Reconstruction

基于节点重构的零样本广义图异常检测框架

Phan Nguyen, Dat Cao, Hien Chu, Khue Hoang

发表机构 * School of Computing, KAIST(韩国科学技术院计算机学院)

AI总结 提出AlignGAD框架,通过全局统一模块对齐异构特征、聚类模块捕获组级异常模式及节点差异评分模块聚合多视图异常证据,实现零样本跨域图异常检测。

详情
AI中文摘要

跨域图异常检测旨在识别未见过的目标图中的异常节点,在异构图数据的实际应用中展现出巨大潜力。然而,现有方法通常依赖于数据集特定的特征语义和结构模式,限制了其跨域泛化能力。为解决这一挑战,我们提出AlignGAD,一个零样本广义图异常检测框架。我们的框架基于三个关键组件:全局统一模块,用于对齐异构节点特征并在谱域中归一化图信号;聚类模块,用于构建聚类感知的图视图以捕获组级异常模式;以及节点差异评分模块,用于测量重构差异并聚合来自不同图视图的异常证据。在多个真实数据集上的实验证明了AlignGAD在零样本图异常检测设置下的有效性。

英文摘要

Cross-domain graph anomaly detection (GAD) aims to identify abnormal nodes in unseen target graphs, showing strong potential in real-world applications with heterogeneous graph data. However, existing methods often depend on dataset-specific feature semantics and structural patterns, which limits their ability to generalize across different domains. To address this challenge, we propose AlignGAD, a zero-shot generalized graph anomaly detection framework. Our framework is built upon three key components: a Global Unification Module that aligns heterogeneous node features and normalizes graph signals in the spectral domain; a Clustering Module that constructs cluster-aware graph views to capture group-level abnormal patterns; and a Node Discrepancy Scoring Module that measures reconstruction discrepancy and aggregates anomaly evidence from different graph views. Experiments on multiple real-world datasets demonstrate the effectiveness of AlignGAD under the zero-shot GAD setting.

2606.12708 2026-06-12 cs.CL cs.AI 交叉投稿

AfriSUD: A Dependency Treebank Collection for Evaluating Models on African Languages

AfriSUD:用于评估非洲语言模型的依存树库集合

Happy Buzaaba, Cheikh Mouhamadou Bamba Dione, David Ifeoluwa Adelani, Sylvain Kahane, Kim Gerdes, Bruno Guillaume, Kevin Guan, Aremu Anuoluwapo, Naome A. Etori, Shamsuddeen Hassan Muhammad, Utitofon Inyang, Peter Nabende, David Sabiiti Bamutura, Andiswa Bukula, Chinedu Uchechukwu, Rooweither Mabuya, Idris Akinade, Christiane Fellbaum

发表机构 * Princeton University(普林斯顿大学) Laboratory for Artificial Intelligence, Princeton University(普林斯顿大学人工智能实验室) Gaston Berger University(加斯顿·伯杰大学) Mila, McGill University(麦吉尔大学米拉研究所) Canada CIFAR AI Chair(加拿大CIFAR人工智能教席) Paris Nanterre University(巴黎南泰尔大学) Paris-Saclay University(巴黎-萨克雷大学) CNRS(法国国家科学研究中心) Inria(法国国家信息与自动化研究所) LORIA(洛林计算机科学实验室) Université de Lorraine(洛林大学) University of Trento(特伦托大学) University of Minnesota–Twin Cities(明尼苏达大学双城分校) Imperial College London(伦敦帝国学院) Binghamton University(宾汉姆顿大学) Makerere University(马凯雷雷大学) Penn State University(宾夕法尼亚州立大学) Mbarara University of Science and Technology(姆巴拉拉科技大学) Chalmers University of Technology(查尔姆斯理工大学) University of Ibadan(伊巴丹大学) Nnamdi Azikiwe University(纳姆迪·阿齐基韦大学) South African Centre for Digital Language Resources(南非数字语言资源中心)

AI总结 为弥补非洲语言在NLP资源上的不足,构建了首个大规模九种非洲语言句法标注树库AfriSUD,评估多种模型发现显著句法差距。

详情
AI中文摘要

尽管非洲语言具有语言多样性和全球重要性,但在支持NLP的研究和资源中仍代表性不足。我们通过引入AfriSUD来弥合这一差距,这是首个大规模句法标注树库集合,涵盖九种多样的非洲语言,跨越撒哈拉以南非洲的主要语系和地区。采用表层句法通用依存(SUD)框架,我们社区主导的努力提供了高质量、经母语者验证的数据,捕捉了如黏着和声调等类型学关键特征。我们在AfriSUD上评估了多种模型,包括非Transformer基线、多语言预训练编码器和LLM,用于词性标注和依存句法分析。我们的结果揭示了显著的句法差距,模型在九种语言上仍表现出明显局限性,表明现有架构可能无法完全捕捉非洲语言句法的结构多样性。

英文摘要

Despite their linguistic diversity and global significance, African languages remain underrepresented in research and resources to support NLP. We aim to bridge this gap by introducing AfriSUD, the first large-scale collection of syntactically annotated treebanks for nine diverse African languages spanning major language families and regions across Sub-Saharan Africa. Using the Surface-Syntactic Universal Dependencies (SUD) framework, our community-led effort provides high-quality, native-speaker verified data that capture typological key features such as agglutination and tone. We evaluate a range of models on AfriSUD for part-of-speech tagging and dependency parsing including non-transformer baselines, multilingual pretrained encoders, and LLMs. Our results reveal a significant syntax gap, where models still show clear limitations across the nine languages, suggesting that existing architectures may not fully capture the structural diversity of African-language syntax.

2606.12864 2026-06-12 cs.SE cs.AI 交叉投稿

Beyond Problem Solving: UOJ-Bench for Evaluating Code Generation, Hacking, and Repair in Competitive Programming

超越问题求解:用于评估竞赛编程中代码生成、攻击和修复的UOJ-Bench基准

Tingqiang Xu, Hangrui Zhou, Tianle Cai, Alex Gu, Kaifeng Lyu

AI总结 提出UOJ-Bench基准,通过代码生成、攻击和修复三项任务评估LLM在竞赛编程中的问题求解与人类代码错误识别能力,发现最强模型在一次性评估中无法识别超过50%的错误提交,但测试时扩展可提升至90%以上,且能发现约5%的满分提交中的错误。

详情
AI中文摘要

尽管大型语言模型(LLM)在竞赛编程中表现出色,但其在相同环境下支持人类学习的作用仍 largely unexplored。本文介绍UOJ-Bench,一个旨在评估LLM不仅解决问题能力,还能识别人类编写代码中错误的基准——这是传统上通过在线评测系统运行测试用例支持的关键教育活动。UOJ-Bench包含三个不同任务:代码生成、代码攻击和代码修复,所有任务均基于Universal Online Judge(UOJ)上的真实代码提交构建,并通过UOJ的原生评测基础设施进行评估。我们的结果表明,在一次性评估下,即使最强的模型也无法识别超过50%的被UOJ用户发现错误的提交。虽然测试时扩展将成功率提升至90%以上,但模型推理带来的巨大计算成本限制了其大规模部署的实用性。尽管存在这些限制,我们发现,在测试时扩展下,最佳性能模型可以在大约30个问题中识别超过5%的满分提交中的错误,这表明前沿LLM已经能够提供超越标准评测系统的补充信号。

英文摘要

Despite strong performance in competitive programming, the role of Large Language Models (LLMs) in supporting human learning in the same setting remains largely unexplored. In this work, we introduce UOJ-Bench, a benchmark designed to evaluate not only the problem-solving ability of LLMs, but also their ability to identify errors in human-written code -- a crucial educational activity traditionally supported by running test cases over online judge systems. UOJ-Bench consists of three distinct tasks: code generation, code hacking, and code repair, all constructed from real-world code submissions on the Universal Online Judge (UOJ) and evaluated through UOJ's native judging infrastructure. Our results show that under one-shot evaluation, even the strongest models fail to identify errors in more than 50% of a set of submissions that have been found to be incorrect by UOJ users. While test-time scaling improves success rates to above 90%, the substantial computational costs incurred from model inference limit its practicality for large-scale deployment. Despite these limitations, we find that the best-performing models under test-time scaling can uncover errors in over 5% of full-score submissions across roughly 30 problems, suggesting that frontier LLMs can already provide complementary signals beyond standard judging systems.

2606.12936 2026-06-12 cs.RO cs.AI 交叉投稿

An Embodied Simulation Platform, Benchmark, and Data-Efficient Augmentation Framework for Wet-Lab Robotics

面向湿实验室机器人的具身仿真平台、基准测试及数据高效增强框架

Zhe Liu, Huanbo Jin, Zhaohui Du, Zhe Wang, He Xu, Peijia Li, Jiaming Gu, Quan Lu, Qi Wang, Bin Ji, Ting Xiao

发表机构 * Key Laboratory of Smart Manufacturing in Energy Chemical Process Ministry of Education(能源化工过程智能制造国家重点实验室) Department of Computer Science and Engineering(计算机科学与工程系) Department of Laboratory Medicine(实验室医学系) Shanghai Jiao Tong University School of Medicine(上海交通大学医学院)

AI总结 提出Pipette平台,包含可编辑资产、仿真数据增强管道和11任务基准测试,将30次演示的VLA成功率从44.1%提升至74.7%。

Comments 25 pages, 17figures

详情
AI中文摘要

湿实验室机器人可以提高生物医学实验的可重复性、通量和安全性,但扩展其学习需要可定制的模拟器以进行安全和可重复的任务生成、开放的可编辑实验室资产,以及将有限演示转化为可用训练数据的高效管道。我们提出了Pipette,一个用于湿实验室机器人学习的具身仿真平台、基准测试和数据高效增强框架。Pipette发布了超过43个开源且可重新编辑的湿实验室资产,以及一个可扩展的资产构建管道。Pipette的一个关键组件是其基于仿真的数据增强管道,在仿真中重放人类演示,应用光照、相机、速度和动作扰动,并通过自动任务成功检查过滤生成的片段,从有限的手动演示中快速扩展可用的训练数据。我们进一步引入了一个包含11个任务的湿实验室具身基准测试,涵盖样本处理、培养器具操作、设备操作和精确放置。每个任务仅需30次演示,ACT实现了65.5%的平均成功率,而仿真增强将SmolVLA从44.1%提升至74.7%,将π0从40.4%提升至46.5%,验证了Pipette在数据高效的VLA训练和评估中的有效性。Pipette还支持自然语言驱动的场景构建和任务注册,降低了非专家用户定义新湿实验室机器人任务的门槛。

英文摘要

Wet-lab robots can improve the reproducibility, throughput, and safety of biomedical experiments, but scaling their learning requires customizable simulators for safe and reproducible task generation, open editable laboratory assets, and efficient pipelines that turn limited demonstrations into usable training data. We present Pipette, an embodied simulation platform, benchmark, and data-efficient augmentation framework for wet-lab robot learning. Pipette releases over 43 open-source and re-editable wet-lab assets, together with an extensible asset-building pipeline. A key component of Pipette is its simulation-based data augmentation pipeline, replaying human demonstrations in simulation, applies lighting, camera, speed, and action perturbations, and filters generated episodes with automatic task success checks, rapidly expanding usable training data from limited manual demonstrations. We further introduce an 11-task wet-lab embodied benchmark covering sample handling, culture-ware manipulation, device operation, and precision placement. With only 30 demonstrations per task, ACT achieves 65.5% average success rate, while simulation augmentation improves SmolVLA from 44.1% to 74.7% and π0 from 40.4% to 46.5%, validating the effectiveness of Pipette for data-efficient VLA training and evaluation. Pipette also supports natural-language-driven scene construction and task registration, lowering the barrier for non-expert users to define new wet-lab robotic tasks.

2606.13126 2026-06-12 cs.LG cs.AI cs.CL 交叉投稿

MiniPIC: Flexible Position-Independent Caching in <100LOC

MiniPIC: 少于100行代码的灵活位置无关缓存

Nathan Ordonez, Thomas Parnell

发表机构 * IBM Research(IBM研究院)

AI总结 提出MiniPIC,通过无位置编码KV缓存和用户控制缓存重用原语,在vLLM中实现多种位置无关缓存方法,显著提升预填充吞吐量并降低首个令牌延迟。

Comments 13 pages, 5 figures

详情
AI中文摘要

检索增强和代理工作负载重复预填充可预测的结构化输入(我们称之为“跨度”),例如文档和代码文件。然而,vLLM等引擎中的前缀缓存无法重用KV条目,除非它们与另一个请求共享相同的前缀,而生产级推理服务器中的位置无关缓存(PIC)实现通常需要大量服务器代码更改或将KV状态保留在服务器外部,从而产生主机到设备的传输开销。我们提出了极简PIC(MiniPIC):一种最小化、灵活且快速的vLLM设计,由两个组件构建:无位置编码的KV缓存和用户控制的缓存重用原语。MiniPIC在KV缓存中存储未旋转的K向量,在注意力内部使用每请求逻辑位置对K块应用RoPE,并公开三个面向用户和令牌级别的原语:块对齐填充、跨度分隔符(SSep)和提示依赖(PDep),这些原语修改哈希行为和有效的块级因果注意力结构。通过少于100行的核心引擎更改加上自定义注意力后端,这些原语足以在同一个运行的vLLM实例中实现多种PIC方法,包括Block-Attention、EPIC和Prompt Cache,同时原生集成KV缓存CPU卸载实现。在2WikiMultihopQA上,使用交错调度的MiniPIC相比基线vLLM将预填充吞吐量提高了49%,将缓存跨度的首个令牌时间减少了最多两个数量级,保持了未缓存跨度的线性预填充扩展,并且仅产生5.7%的最坏情况开销。

英文摘要

Retrieval-augmented and agentic workloads repeatedly prefill recurring predictable structured inputs (which we call "spans") such as documents and code files. Yet, prefix caching in engines such as vLLM cannot reuse their KV entries unless they share identical prefixes with another request, while Position-Independent Caching (PIC) implementations within production-grade inference servers typically either require substantial server code changes or keep KV state outside the server, incurring host-to-device transfer overhead. We present Minimalistic PIC (MiniPIC): a minimal, flexible and fast vLLM design built from two ingredients: positional-encoding-free KV cache and user-controlled cache-reuse primitives. MiniPIC stores unrotated K vectors in the KV cache, applies RoPE to K tiles inside attention using per-request logical positions, and exposes three user-facing and token-level primitives: block-aligned padding, span separator (SSep), and prompt depend (PDep), that modify hashing behavior and effective block-level causal attention structure. With fewer than 100 lines of core-engine changes plus a custom attention backend, these primitives are sufficient to realize multiple PIC methods, including Block-Attention, EPIC, and Prompt Cache, within the same running vLLM instance, while natively integrating with KV cache CPU offload implementations. On 2WikiMultihopQA, MiniPIC with interleaved scheduling improves prefill throughput by 49% over baseline vLLM, reduces cached-span time-to-first-token by up to two orders of magnitude, preserves the linear prefill scaling of uncached spans, and incurs only 5.7% worst-case overhead.

2606.13477 2026-06-12 cs.LG cs.AI cs.CL 交叉投稿

SupraBench: A Benchmark for Supramolecular Chemistry

SupraBench: 超分子化学基准

Tianyi Ma, Yijun Ma, Zehong Wang, Weixiang Sun, Ziming Li, Connor R. Schmidt, Chuxu Zhang, Matthew J. Webber, Yanfang Ye

发表机构 * University of Notre Dame(圣母大学) University of Connecticut(康涅狄格大学)

AI总结 为评估大语言模型在超分子化学推理中的能力,与领域专家合作发布了首个超分子基准SupraBench,包含四个基本任务和一个辅助视觉任务,并提供了16M令牌的语料库SupraPMC。

详情
AI中文摘要

超分子化学,包括非共价主客体组装的研究,推动了各种应用的发展。然而,设计主客体系统仍然耗时,每个候选对需要数天的干实验室验证。尽管LLMs已成为一种快速的替代方案,在分子结合任务上表现出色,但目前尚无基准系统性地评估LLMs在超分子化学基本任务(如结合亲和力预测)中的主客体推理能力。为此,我们与领域专家合作发布了首个超分子基准,称为SupraBench,用于评估LLMs在化学推理中的表现。具体来说,我们设计了四个基本任务,即结合亲和力预测、最佳结合物选择、溶剂识别和主客体描述,以及一个辅助的基于视觉的分子识别任务。我们还发布了SupraPMC,一个从Europe PMC中提取的经过整理的1600万令牌的超分子化学文章语料库,以支持对超分子领域的适应。我们对一系列开源和专有LLMs进行了基准测试,发现LLMs在所有任务上都有很大的提升空间。在SupraPMC上的领域自适应预训练可以干净地迁移到分布内回归,但会与严格的字母格式输出进行权衡。此外,不同任务家族的难度分布差异很大,揭示了不同的失败模式,表明当前超分子化学推理中存在特定的差距。我们的源代码和基准数据集可在以下网址获取:此 https URL。

英文摘要

Supramolecular chemistry, which includes the study of non-covalent host-guest assemblies, has advanced various applications. However, designing host-guest systems remains time-consuming, requiring days of dry-lab verification per candidate pair. Although LLMs have emerged as a fast alternative with strong performance on molecular binding tasks, no benchmark currently systematically evaluates LLMs for host-guest reasoning across fundamental supramolecular chemistry tasks, e.g., binding affinity prediction. To this end, we collaborate with domain experts to release the first Supramolecular Benchmark, called SupraBench, to evaluate LLMs in chemistry reasoning. Specifically, we design four fundamental tasks, i.e., binding affinity prediction, top-binder selection, solvent identification, and host-guest description, plus an auxiliary vision-based task for molecular identification. We also release SupraPMC, a curated 16M-token corpus of Supramolecular chemistry articles distilled from Europe PMC, to support the adaptation to the supramolecular domain. We benchmark a broad range of open and proprietary LLMs and find that LLMs leave substantial headroom across all tasks. Domain adaptation pretraining over SupraPMC transfers cleanly to in-distribution regression but trades off against strict letter-format output. Moreover, the difficulty profile differs sharply across task families, revealing distinct failure modes that indicate specific gaps in current supramolecular chemistry reasoning. Our source codes and benchmark datasets are available at https://github.com/Tianyi-Billy-Ma/SupraBench.

2606.13647 2026-06-12 cs.CL cs.AI cs.LG 交叉投稿

SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

SkMTEB:斯洛伐克大规模文本嵌入基准与模型适配

Marek Šuppa, Andrej Ridzik, Daniel Hládek, Natália Kňažeková, Viktória Ondrejová

发表机构 * Comenius University in Bratislava(布拉迪斯拉发夸美纽斯大学) Cisco Systems(思科系统) Technical University of Košice(科希策技术大学) Kempelen Institute of Intelligent Technologies(肯佩伦智能技术研究所)

AI总结 针对低资源西斯拉夫语斯洛伐克语,构建首个MTEB风格文本嵌入基准SkMTEB(含31个数据集、7类任务),并开发高效本地部署模型e5-sk-small/large,通过词汇裁剪与微调在参数减少62%下达到与商业API相当的竞争力。

Comments ACL 2026

详情
AI中文摘要

我们介绍了SkMTEB,这是首个针对斯洛伐克语(一种低资源西斯拉夫语)的全面MTEB风格文本嵌入基准,包含31个数据集,覆盖7种任务类型——几乎是现有斯洛伐克语多语言基准覆盖深度的4倍。我们对31个嵌入模型的评估表明,大型指令调优多语言模型表现最强,而现有的针对NLU任务训练的斯洛伐克语特定模型在嵌入任务上迁移效果不佳。为了满足高效、可本地部署的斯洛伐克语嵌入需求,我们通过对多语言E5模型进行词汇裁剪和微调,开发了\ exttt{e5-sk-small}(45M参数)和\ exttt{e5-sk-large}(365M)模型。尽管模型尺寸缩小了高达62%,我们的开源模型在性能上与专有API相当,同时仍可本地部署用于语义搜索和检索增强生成(RAG)。我们公开了基准、模型、数据集和代码,希望我们的方法能为其他资源匮乏的语言提供可复现的路径。

英文摘要

We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types -- nearly 4$\times$ the depth of existing multilingual benchmark coverage for Slovak. Our evaluation of 31 embedding models reveals that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific models trained for NLU tasks transfer poorly to embedding tasks. To address the need for efficient, locally-deployable Slovak embeddings, we develop \texttt{e5-sk-small} (45M parameters) and \texttt{e5-sk-large} (365M) by applying vocabulary trimming and fine-tuning to Multilingual E5 models. Despite size reductions of up to 62\%, our open-source models achieve competitive performance with proprietary APIs while remaining locally deployable for semantic search and retrieval-augmented generation (RAG). We release the benchmark, models, datasets, and code openly, hoping our approach offers a replicable path for other under-resourced languages.

2511.02627 2026-06-12 cs.AI 版本更新

DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning

DecompSR:用于组合多跳空间推理分解分析的数据集

Lachlan McPheat, Navdeep Kaur, Robert Blackwell, Alessandra Russo, Anthony G. Cohn, Pranava Madhyastha

AI总结 提出DecompSR数据集(超500万数据点),通过程序化生成独立控制组合性的多个方面(如推理深度、语言变异性),用于细粒度评估大语言模型的空间推理能力。

详情
AI中文摘要

我们引入了DecompSR(分解空间推理),这是一个大型基准数据集(超过500万个数据点)和生成框架,旨在分析组合空间推理能力。DecompSR的生成允许用户独立改变组合性的多个方面,即:生产力(推理深度)、替代性(实体和语言变异性)、过度泛化(输入顺序、干扰项)和系统性(新颖语言元素)。DecompSR以程序化方式构建,使其在构造上正确,并通过符号求解器独立验证以确保数据集的正确性。DecompSR在一系列大型语言模型(LLM)上进行了全面基准测试,我们表明LLM在空间推理任务中难以进行生产性和系统性泛化,而对语言变异性则更为鲁棒。DecompSR提供了一个可证明正确且严格的基准数据集,具有独立改变组合性几个关键方面程度的新能力,从而允许对LLM的组合推理能力进行稳健且细粒度的探测。

英文摘要

We introduce DecompSR, decomposed spatial reasoning, a large benchmark dataset (over 5m datapoints) and generation framework designed to analyse compositional spatial reasoning ability. The generation of DecompSR allows users to independently vary several aspects of compositionality, namely: productivity (reasoning depth), substitutivity (entity and linguistic variability), overgeneralisation (input order, distractors) and systematicity (novel linguistic elements). DecompSR is built procedurally in a manner which makes it is correct by construction, which is independently verified using a symbolic solver to guarantee the correctness of the dataset. DecompSR is comprehensively benchmarked across a host of Large Language Models (LLMs) where we show that LLMs struggle with productive and systematic generalisation in spatial reasoning tasks whereas they are more robust to linguistic variation. DecompSR provides a provably correct and rigorous benchmarking dataset with a novel ability to independently vary the degrees of several key aspects of compositionality, allowing for robust and fine-grained probing of the compositional reasoning abilities of LLMs.

2601.13591 2026-06-12 cs.AI cs.CL 版本更新

DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems

DSAEval:在广泛真实世界数据科学问题上评估数据科学智能体

Maojun Sun, Yifei Xie, Yue Wu, Ruijian Han, Binyan Jiang, Defeng Sun, Yancheng Yuan, Jian Huang

发表机构 * Department of Data Science and Artificial Intelligence, Hong Kong Polytechnic University(数据科学与人工智能系,香港理工大学) Department of Applied Mathematics, Hong Kong Polytechnic University(应用数学系,香港理工大学)

AI总结 提出包含641个真实数据科学问题的基准DSAEval,涵盖多模态环境感知、多查询交互和多维评估,系统评估13个先进LLM智能体,发现Claude-Sonnet-4.5综合最优,多模态感知提升视觉任务性能2.04%-11.30%。

详情
AI中文摘要

近期基于LLM的数据智能体旨在自动化从数据分析到深度学习的数据科学任务。然而,真实世界数据科学问题的开放性——通常跨越多个分类且缺乏标准答案——给评估带来了重大挑战。为此,我们引入了DSAEval,一个包含641个基于285个多样化数据集的真实世界数据科学问题的基准,涵盖结构化和非结构化数据(例如图像和文本)。DSAEval包含三个独特特征:(1)多模态环境感知,使智能体能够解释来自多种模态(包括文本和视觉)的观察;(2)多查询交互,反映真实世界数据科学项目的迭代和累积性质;(3)多维评估,提供跨推理、代码和结果的全面评估。我们使用DSAEval系统评估了13个近期先进的智能体LLM。结果表明,Claude-Sonnet-4.5实现了最强的整体性能,MiMo-V2-Pro在持续时间上领先,GPT-5.2在步骤效率上领先,而MiMo-V2-Flash最具成本效益。我们进一步证明,多模态感知持续提升视觉相关任务的性能,增益范围为2.04%至11.30%。总体而言,尽管当前数据科学智能体在结构化数据和常规数据分析工作流上表现良好,但在非结构化领域仍存在重大挑战。最后,我们提供了关键见解并概述了未来研究方向。

英文摘要

Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multiple taxonomies and lack standard answers, poses a significant challenge for evaluation. To address this, we introduce DSAEval, a benchmark comprising 641 real-world data science problems grounded in 285 diverse datasets, covering both structured and unstructured data (e.g., image and text). DSAEval incorporates three distinctive features: (1) Multimodal Environment Perception, which enables agents to interpret observations from multiple modalities, including text and vision; (2) Multi-Query Interactions, which mirror the iterative and cumulative nature of real-world data science projects; and (3) Multi-Dimensional Evaluation, which provides a holistic assessment across reasoning, code, and results. We systematically evaluate 13 recent advanced agentic LLMs using DSAEval. Our results show that Claude-Sonnet-4.5 achieves the strongest overall performance, MiMo-V2-Pro and GPT-5.2 lead in duration and step efficiency, respectively, and MiMo-V2-Flash is the most cost-effective. We further demonstrate that multimodal perception consistently improves performance on vision-related tasks, with gains ranging from 2.04\% to 11.30\%. Overall, while current data science agents perform well on structured data and routine data analysis workflows, substantial challenges remain in unstructured domains. Finally, we offer critical insights and outline future research directions.

2603.11863 2026-06-12 cs.AI cs.CL 版本更新

CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges

CreativeBench: 通过自我进化挑战基准测试和增强机器创造力

Zi-Han Wang, Lam Nguyen, Zhengyang Zhao, Mengyue Yang, Chengwei Qin, Yujiu Yang, Linyi Yang

AI总结 提出CreativeBench基准,基于认知框架通过代码生成评估机器创造力,包含组合与探索两个子集,利用逆向工程和自我博弈自动生成挑战,并通过质量与新颖性乘积的指标区分创造与幻觉。

Comments ACL 2026. Project page: https://zethwang.github.io/creativebench.github.io/

详情
AI中文摘要

高质量预训练数据的饱和已将研究焦点转向能够持续生成新颖产物的进化系统,从而促成了AlphaEvolve的成功。然而,此类系统的进展因缺乏严格、量化的评估而受阻。为应对这一挑战,我们引入了CreativeBench,这是一个基于经典认知框架、用于评估代码生成中机器创造力的基准。该基准包含两个子集——CreativeBench-Combo和CreativeBench-Explore,通过利用逆向工程和自我博弈的自动化流程,分别针对组合创造力和探索创造力。通过利用可执行代码,CreativeBench通过一个统一指标(定义为质量与新颖性的乘积)客观地区分创造力与幻觉。我们对最先进模型的分析揭示了不同的行为:(1) 规模扩展显著提升了组合创造力,但对探索的收益递减;(2) 更大的模型表现出“规模收敛”,即变得更正确但更少发散;(3) 推理能力主要有利于受约束的探索而非组合。最后,我们提出了EvoRePE,一种即插即用的推理时引导策略,通过内化进化搜索模式来持续增强机器创造力。

英文摘要

The saturation of high-quality pre-training data has shifted research focus toward evolutionary systems capable of continuously generating novel artifacts, leading to the success of AlphaEvolve. However, the progress of such systems is hindered by the lack of rigorous, quantitative evaluation. To tackle this challenge, we introduce CreativeBench, a benchmark for evaluating machine creativity in code generation, grounded in a classical cognitive framework. Comprising two subsets -- CreativeBench-Combo and CreativeBench-Explore -- the benchmark targets combinatorial and exploratory creativity through an automated pipeline utilizing reverse engineering and self-play. By leveraging executable code, CreativeBench objectively distinguishes creativity from hallucination via a unified metric defined as the product of quality and novelty. Our analysis of state-of-the-art models reveals distinct behaviors: (1) scaling significantly improves combinatorial creativity but yields diminishing returns for exploration; (2) larger models exhibit ``convergence-by-scaling,'' becoming more correct but less divergent; and (3) reasoning capabilities primarily benefit constrained exploration rather than combination. Finally, we propose EvoRePE, a plug-and-play inference-time steering strategy that internalizes evolutionary search patterns to consistently enhance machine creativity.

2606.05405 2026-06-12 cs.AI cs.CL cs.LG 版本更新

Agents' Last Exam

Agents' Last Exam

Yiyou Sun, Xinyang Han, Weichen Zhang, Yuanbo Pang, Tianyu Wang, Yuhan Cao, Yixiao Huang, Chris Duroiu, Haoyun Zhang, Jeffrey Lin, Weishu Zhang, Tyler Zeng, Ying Yan, Bo Liu, Hanson Wen, Mingyang Xu, Xiaoyuan Liu, Zimeng Chen, Weiyan Shi, Amanda Dsouza, Vincent Sunn Chen, Patrick Bryant, Carl Boettiger, Yamini Rangan, Bradley Rothenberg, Kyle Steinfeld, Arvind Rao, Tapio Schneider, Georgios Yannakakis, Laure Zanna, Kaan Ozbay, Ida Sim, Tarek Zohdi, George Em Karniadakis, Jack Gallant, Teresa Head-Gordon, Yushan Li, Wenxi Deng, Tao Sun, Huiqi Wang, Zhun Wang, Justin Xu, Chris Yuhao Liu, Yafei Cheng, Rongwang Hu, Aras Bacho, Shengcao Cao, Zengyi Qin, Yixiong Chen, Hengduan Fan, Hao Liu, Lin Zeng, Shashank Muralidhar Bharadwaj, Litian Gong, Yingxuan Yang, Maojia Song, Ruheng Wang, Zongzheng Zhang, Honglin Bao, Shuo Lu, Jianhong Tu, Zhonghua Wang, Zheng Zhang, Zijiao Chen, Yanqiong Jiang, Zhendong Li, Bohan Lyu, Chang Ma, Peiran Xu, Benran Zhang, Shangding Gu, Haoyue Hua, Haoyang Li, Wanzhe Liao, Chengzhi Liu, Junbo Peng, Haoran Sun, Zechen Xu, Bo Chen, Jiayi Cheng, Yi Jiang, Keying Kuang, Yuan Li, Youbang Pan, Ziyan Rao, Alexander Schubert, Yifan Shen, Vincent Siu, Xiatao Sun, Kangqi Zhang, Xiaopan Zhang, Yuchen Zhu, Ishaan Singh Chandok, Lei Ding, Jingxuan Fan, Andrew Glover, Jiaming Hu, Yiran Hu, Wenbo Huang, Zixin Jiang, Haoran Jin, Lukas Kim, Ming Liu, Yang Liu, Alireza Rafiei, Xuhuan Shen, Kunyang Sun, Sophia Sun, Ting Sun, Eric Wang, Yixin Wang, Hanwen Xing, Sihan Xu, Yuzheng Xu, Zhongxing Xu, Zhiling Yan, Boqin Yuan, Ruiqi Zhang, Yifan Zhang, Zibo Zhao, Liana, Santanu Bosu Antu, Haoyue Bai, Carlo Bosio, Joseph Cavanagh, Patricia Cavazos-Rehg, Tianxing Chen, Xuewen Chen, Yipu Chen, Chenyu Zhu, Chen Dai, Stefano De Castro, Yunfu Deng, Kaustubh Dhole, Jiayuan Ding, Chenchen Du, Zhehang Du, Hao Fan, Run-Ze Fan, Hengyu Fu, Shi Gu, Yifan Gu, Charlie Guo, Baihe Huang, Baixiang Huang, Rimika Jaiswal, Zhihan Jiang, Ran Jin, Erin Kasson, Xin Lan, Joseph Lee, Deren Lei, Chenyu Li, Daofeng Li, Haitao Li, Hongwei Li, Jingyan Li, Xiao Li, Yi Li, Yinsheng Li, Yuangang Li, Zhixu Li, Wenyu Liang, Longtai Liao, Kevin Qinghong Lin, Andy Zeyi Liu, Che Liu, Jiaming Liu, Kaiyuan Liu, Xuan Liu, Pan Lu, Wenbo Lv, Yicheng Lyu, Qiuyang Mang, Kyle Montgomery, Yuzhou Nie, Ruoxi Ning, Jorin Overwiening, Xu Pan, Layna Paraboschi, Core Francisco Park, Justin Purnomo, Swati Rajwal, Scott Rankin, Bixuan Ren, Yiren Rong, HaoYang Shang, Ventus Shaw, Fiona Shen, Jiawei Shen, Minqi Shi, Shi Qiu, Huaxiu Yao, Tianneng Shi, Jonah So, Vladislav Susoy, Hannah Szlyk, Haocheng Wang, Jialu Wang, Wei Wang, Xinyu Wang, Zehao Wang, Dowling Wong, Angela Wu, Dehao Wu, Fangyu Wu, Mengyuan "Millie" Wu, Yu Wu, Yuchen Wu, Yuhao Wu, Qingpo Wuwu, Weihang Xiao, Yongyi Xiong, Fan Xu, Ruiling Xu, Mingxuan Yan, Benjamin Yang, Jirong Yang, Sen Yang, Xiaoli Yang, Yushi Yang, Haoran Ye, Xiaohu Yu, Zhengming Yu, Chenlong Zhang, Chi Zhang, Hanning Zhang, Hanwen Zhang, Junge Zhang, Kunpeng Zhang, Song Zhang, Wenjin Zhang, Wenshuo Zhang, Ying Zhang, Yizhi Zhang, Brian Zhao, Qijian Zhao, Yimin Zhao, Yuhaohua Zheng, Liwei Zhou, Tianyue Zhou, Sichen Zhu, Siqi Zhu, Yan Zhu, Yishu Zhu, Jierui Zuo, Chonghao Cai, Helena Casademunt, Wenjia Chen, Cheng Cheng, Nawen Deng, Rao Fu, Tianfu Fu, Yifan Han, He Ren, Zhenyu He, Qiao Jin, Langlang Li, Yuetai Li, Sylvia Liu, Lu Lu, Luqing Zhou, Subhabrata Mukherjee, Yunqi Ouyang, Yin Ren, Dawei Shi, Haoran Wu, Zhiyue Wu, Hannah Yao, Zhuoran Yi, Jenny Yu, Rhea Zhan, Hang Zhou, Blake Zhu, Junfan Zhu, Alan Yuille, Yang Liu, Russell Alan Poldrack, Jiachen Li, Zhenglu Li, Molei Tao, Jing Huang, Wenqi Shi, Costas Spanos, Lichao Sun, Chenguang Wang, Orson Xu, Zhen Dong, Hector Gomez, Aylin Caliskan, Ali Emami, Haimin Hu, Zhi Li, Lihui Liu, Murphy Niu, Yi Shao, Jianxin Sun, Mikko Tolonen, Ting Wang, Sanjiv Das, Yanjun Gao, Wenbo Guo, Erika J Schneider, Zhiyong Lu, Yian Ma, Mark Mueller, Radha Poovendran, Somayeh Sojoudi, Yinglun Zhu, Dawn Song

发表机构 * arXiv

AI总结 针对AI系统在专业领域缺乏经济性部署的问题,提出Agents' Last Exam (ALE)基准,通过250+专家协作构建覆盖13个行业集群55个子领域的1000+长期真实经济任务,当前最难层级平均通过率仅2.6%。

Comments Project website: https://agents-last-exam.org Code: https://github.com/rdi-berkeley/agents-last-exam

详情
AI中文摘要

最近的AI系统在广泛基准测试中取得了强劲结果,但这些成果并未转化为许多专业领域的经济上有意义的部署。我们认为这一差距主要是评估问题:广泛使用的基准缺乏对真实且经济上有价值的工作流程的持续性能测量。本文介绍了Agents' Last Exam (ALE),这是一个旨在评估AI代理在长期、经济上有价值、结果可验证的真实世界任务上的基准。与250多名行业专家合作开发,ALE涵盖了参考O*NET/SOC 2018(美国联邦职业分类)定义的非实体行业。它围绕一个任务分类法组织,包含55个子领域,分为13个行业集群,涵盖1000多个任务。当前结果显示,最难层级远未饱和:在主流框架和骨干配置下,平均完全通过率为2.6%。ALE被设计为一个活的基准:其任务池随着新工作流程和行业的加入而持续增长。更广泛地说,ALE不仅旨在作为另一个排行榜,而是作为缩小基准成功与GDP相关影响之间差距的工具。

英文摘要

Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long horizon, economically valuable, real world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 sub fields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is below 1%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP relevant impact.

2606.11042 2026-06-12 cs.AI 版本更新

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Workflow-GYM:面向真实世界专业领域的长周期计算机使用代理任务评估

Liya Zhu, Jingzhe Ding, Jian Zhang, Jianbo Xue, Shihao Liang, Ge Zhang, Yi Zhu, Duju Zeng, Xiang Gao, Qingshui Gu, Mailun Gao, Huimin Che, Yan Zhao, Peiheng Zhou, Haojun Wang, Chaobo Xian, Lili Le, Chi Wu, Yiwei Liu, Shengda Long, Jiale Yang, Fangzhi Xu, Sijin Wu, Haodong Duan, Chao He, Zhaojian Li, Minchao Wang, Huan Zhou, Jiani Hou, Chuqian Yu, Weiran Shi, Hongwan Gao, Jiamin Chen, Guanhong Chen, Tingqin Luo, Kaiyuan Zhang, Zhixin Yao, Qing Hua, Yuhao Jiang, Jin Chen, Pu Chen, Zhenyu Hu, Xingyu Li, Zhengxuan Jiang, Meng Cao, Tianfeng Long, Haozhe Wang, Mingzhang Wang, Yichen Zhang, Yiming Dai, Chenchen Zhang, Jiaying Wang, Xinying Liu, Xingzu Liu, Lingling Zhang, Xinjie Chen, Yujia Qin, Wangchunshu Zhou, Zhiyong Wu, Yang Liu, Jiaheng Liu, Lei Zhang, Shen Yan, Wenhao Huang, Zaiyuan Wang, Xiaolong Chang

发表机构 * ByteDance Seed(字节跳动Seed) M-A-P Humanlaya

AI总结 提出Workflow-GYM基准,评估AI代理在专业软件中执行长周期、高价值工作流的能力,发现最强模型成功率仅略超30%,揭示当前代理在长周期工作流一致性方面的严重不足。

详情
AI中文摘要

近年来,AI代理在处理日益复杂、真实世界任务方面取得了快速发展。然而,现有基准很少评估代理能否操作图形用户界面以完成跨领域的长周期、高价值专业工作流。当前的GUI基准仍主要关注通用软件、相对简单的应用和短周期任务,使得现代代理能否遵循用户指令自主操作领域特定专业软件并以端到端方式完成经济价值工作尚不清楚。为填补这一空白,我们引入Workflow-GYM,一个以专业领域和专门软件环境为中心的长周期GUI任务基准。通过对最先进模型的广泛实验,我们发现即使最强的模型也仅达到略高于30%的成功率,突显出专业长周期GUI工作流对当前GUI代理仍极具挑战性。进一步分析表明,当前代理难以维持长周期工作流的一致性,频繁出现工作流阶段遗漏、错误传播、目标漂移以及对专业软件环境理解不足等问题。我们的发现为当前代理系统的局限性提供了重要见解,并为下一代GUI代理研究指明了关键方向。

英文摘要

Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.

2304.13836 2026-06-12 cs.LG cs.AI cs.CV stat.ME 版本更新

On Pitfalls of $\textit{RemOve-And-Retrain}$: Data Processing Inequality Perspective

论 $\textit{RemOve-And-Retrain}$ 的陷阱:数据处理不等式视角

Junhwa Song, Keumgang Cha, Junghoon Seo

发表机构 * KAIST(韩国科学技术院)

AI总结 从信息论角度揭示ROAR基准的缺陷:数据无关的后处理可提升ROAR分数,导致对归因图信息量的误判,并发现模糊性偏差。

Comments Accepted at the 2026 ICML Workshop on Mechanistic Interpretability

详情
AI中文摘要

RemOve-And-Retrain (ROAR) 基准被广泛用于评估特征归因方法,但其有效性尚未从信息论角度得到充分探索。我们证明,对归因图进行模型和数据无关的后处理(通过数据处理不等式,这些变换\emph{不能}增加关于决策函数的信息)通常可以改善ROAR分数。这意味着ROAR排名的提升本身并不能证明归因图携带更多关于模型的信息。我们将这种失败模式归因于对空间模糊掩膜的偏好。在CIFAR-10、SVHN和CUB-200上的实验显示,模糊度与ROAR性能之间存在一致的关联,这种模式也出现在ROAD变体中。我们为更谨慎的基于移除的基准测试提供了指导方针,这对验证神经网络内部机制的机械理解具有重要意义。

英文摘要

The RemOve-And-Retrain (ROAR) benchmark is widely used to evaluate feature attribution methods, yet its validity remains underexplored from an information-theoretic perspective. We show that model- and data-agnostic post-processing of attribution maps (transformations that, by the data processing inequality, \emph{cannot} add information about the decision function) can often improve ROAR scores. This means that an improved ROAR ranking is not, by itself, evidence that an attribution map carries more information about the model. We trace this failure mode to a bias toward spatially blurry masks. Experiments on CIFAR-10, SVHN, and CUB-200 show a consistent association between blurriness and ROAR performance, a pattern that also appears in the ROAD variant. We provide guidelines for more cautious removal-based benchmarking, with implications for validating mechanistic understanding of neural network internals.

2503.06573 2026-06-12 cs.CL cs.AI 版本更新

WildIFEval: Instruction Following in the Wild

WildIFEval: 野外指令遵循

Gili Lior, Asaf Yehudai, Ariel Gera, Liat Ein-Dor

发表机构 * The Hebrew University of Jerusalem(希伯来大学杰里科分校) IBM Research(IBM研究院)

AI总结 提出WildIFEval数据集,包含7K条真实用户的多约束指令,用于评估LLM的指令遵循能力,发现所有模型仍有较大改进空间。

Comments Accepted to the 5th Workshop on Generation, Evaluation and Metrics (GEM) at ACL 2026

详情
AI中文摘要

最近的LLMs在遵循用户指令方面取得了显著成功,但处理具有多个约束的指令仍然是一个重大挑战。在这项工作中,我们引入了WildIFEval——一个包含7K条真实用户指令的大规模数据集,这些指令具有多样化的多约束条件。与以往的数据集不同,我们的收集涵盖了广泛的词汇和主题约束范围,这些约束是从自然用户指令中提取的。我们将这些约束分为八个高级类别,以捕捉它们在现实场景中的分布和动态。利用WildIFEval,我们进行了大量实验来评估领先LLMs的指令遵循能力。WildIFEval清晰地区分了小型和大型模型,并表明所有模型在此类任务上仍有很大的改进空间。我们分析了约束数量和类型对性能的影响,揭示了模型约束遵循行为的有趣模式。我们发布数据集以促进在复杂现实条件下指令遵循的进一步研究。

英文摘要

Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 7K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, extracted from natural user instructions. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. WildIFEval clearly differentiates between small and large models, and demonstrates that all models have a large room for improvement on such tasks. We analyze the effects of the number and type of constraints on performance, revealing interesting patterns of model constraint-following behavior. We release our dataset to promote further research on instruction-following under complex, realistic conditions.

2510.16380 2026-06-12 cs.CL cs.AI cs.CY cs.HC cs.LG 版本更新

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

MoReBench:评估语言模型中的程序性和多元道德推理,超越结果

Yu Ying Chiu, Michael S. Lee, Rachel Calcott, Brandon Handoko, Paul de Font-Reaulx, Raphaël Millière, Paula Rodriguez, Chen Bo Calvin Zhang, Ziwen Han, Udari Madhushani Sehwag, Yash Maurya, Christina Q Knight, Harry R. Lloyd, Florence Bacus, Conor Downey, Mantas Mazeika, Bing Liu, Yejin Choi, Mitchell L Gordon, Sydney Levine

发表机构 * University of Washington(华盛顿大学) New York University(纽约大学) Scale AI Harvard University(哈佛大学) University of Michigan(密歇根大学) UNC Chapel Hill(北卡罗来纳大学教堂山分校) Center for AI Safety(人工智能安全中心) Stanford University(斯坦福大学) MIT(麻省理工学院) University of Oxford(牛津大学)

AI总结 提出MoReBench基准,包含1000个道德场景和超过2.3万条标准,用于评估语言模型在道德推理中的程序性推理能力,发现现有基准无法预测模型表现,且模型对特定道德框架存在偏好。

Comments 46 pages, 8 figures, 10 tables. Published in ICLR 2026. Accepted at CHAI workshop and SPP 2026 (non-archival)

详情
AI中文摘要

随着人工智能系统的进步,我们越来越依赖它们与我们共同或代替我们做出决策。为了确保这些决策符合人类价值观,我们不仅需要理解它们做出了什么决策,还需要理解它们如何得出这些决策。推理语言模型能够提供最终响应和(部分透明的)中间思考轨迹,这为研究AI的程序性推理提供了及时的机会。与通常有客观正确答案的数学和代码问题不同,道德困境是过程导向评估的绝佳测试平台,因为它们允许多种可辩护的结论。为此,我们提出了MoReBench:包含1000个道德场景,每个场景配有一组专家认为在推理该场景时必须包含(或避免)的评分标准。MoReBench包含超过2.3万条标准,包括识别道德考量、权衡利弊以及给出可操作的建议,覆盖了AI为人类道德决策提供建议以及自主做出道德决策的情况。此外,我们整理了MoReBench-Theory:150个示例,用于测试AI是否能在规范伦理学的五个主要框架下进行推理。我们的结果表明,规模定律以及现有的数学、代码和科学推理任务基准无法预测模型进行道德推理的能力。模型还显示出对特定道德框架(例如边沁式的行为功利主义和康德义务论)的偏好,这可能是流行训练范式的副作用。这些基准共同推动了面向过程推理的评估,以实现更安全、更透明的AI。

英文摘要

As AI systems progress, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks fail to predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.

2512.21227 2026-06-12 cond-mat.mtrl-sci cs.AI 版本更新

PhononBench:A Large-Scale Phonon-Based Benchmark for Dynamical Stability in Crystal Generation

PhononBench:面向晶体生成中动态稳定性的基于声子的大规模基准

Xiao-Qi Han, Ze-Feng Gao, Wen-Kao Li, Peng-Jie Guo, Zhong-Yi Lu

发表机构 * School of Physics, Renmin University of China(中国人民大学物理学院)

AI总结 提出PhononBench,首个大规模AI生成晶体动态稳定性基准,利用MatterSim势高效计算声子,评估7个模型生成的133,838个结构,发现平均动态稳定性率仅32.15%。

Comments 53 pages, 6 figures

详情
AI中文摘要

近年来,生成式人工智能在晶体材料设计方面取得了显著进展,催生了基于图神经网络、扩散模型和大语言模型的方法。现有评估通常遵循稳定性-唯一性-新颖性(S.U.N.)框架,其中稳定性主要使用热力学标准评估,这未能完全捕捉材料实际存在所必需的动态稳定性。动态稳定性是决定材料能否被合成并持续存在的关键因素,声子谱计算是其评估标准。然而,此类计算的高计算成本阻碍了对生成晶体动态稳定性的大规模评估。在这项工作中,我们引入了PhononBench,这是首个针对AI生成晶体动态稳定性的大规模基准。利用最近开发的MatterSim原子间势,该势能在超过10,000种材料中实现了密度泛函理论(DFT)级别的声子预测精度,PhononBench能够对7个领先晶体生成模型生成的133,838个晶体结构进行高效的声子计算和动态稳定性分析。PhononBench揭示了当前生成模型的一个普遍局限性:除非另有说明,所有报告的动态稳定性指标均在-0.1 THz的声子频率阈值下评估,所有生成结构的平均动态稳定性率仅为32.15%,表现最佳的模型MatterGen也仅达到45.05%。此外,我们识别出32,995个在-0.001 THz严格阈值下整个布里渊区声子稳定的晶体结构。另外,一个基于网页的服务可通过此http URL访问,实现分钟级的超快声子预测。

英文摘要

In recent years, generative artificial intelligence has made significant advances in the design of crystalline materials, giving rise to approaches based on graph neural networks, diffusion models, and large language models. Existing evaluations commonly follow the stability-uniqueness-novelty (S.U.N.) framework, where stability is primarily assessed using thermodynamic criteria, which do not fully capture the dynamical stability essential for a material's practical existence. Dynamical stability is a key determinant of whether a material can be synthesized and persist, with phonon spectrum calculations serving as the standard for its evaluation. However, the high computational cost of such calculations has prevented large-scale assessment of dynamical stability in generated crystals. In this work, we introduce PhononBench, the first large-scale benchmark for dynamical stability in AI-generated crystals. Leveraging the recently developed MatterSim interatomic potential, which achieves density-functional-theory (DFT)-level accuracy in phonon predictions across more than 10,000 materials, PhononBench enables efficient phonon calculations and dynamical-stability analysis for 133,838 crystal structures generated by 7 leading crystal generation models. PhononBench reveals a widespread limitation of current generative models: unless otherwise specified, all reported dynamical-stability metrics are evaluated at a phonon-frequency threshold of -0.1 THz, with the average dynamical-stability rate across all generated structures being only 32.15%, and the top-performing model, MatterGen, reaching just 45.05%.In addition, we identify 32,995 crystal structures that are phonon-stable across the entire Brillouin zone under a strict threshold of -0.001 THz. In addition, a web-based service is accessible at http://phononbench.cn/, enabling minute-level ultra-fast phonon predictions.

2602.00122 2026-06-12 cs.CV cs.AI cs.MM 版本更新

VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents

VDE Bench: 评估图像编辑模型对视觉文档进行修改的能力

Hongzhu Yi, Yujia Yang, Yuanxiang Wang, Tong Li, Zhenyu Guan, Tianyu Zong, Jiahuan Chen, Chenxi Bao, Tiankun Yang, Haopeng Jin, Yixuan Yuan, Xinming Wang, Tao Yu, Ruilin Gao, Ruiwen Tao, Haijin Liang, Jin Ma, Jinwen Luo, Yeshani, Xinyu Zuo, Jungang Xu

发表机构 * UCAS(中国科学院大学) CASIA(中国科学院自动化研究所) Tencent(腾讯) CMU(卡内基梅隆大学) WashU(华盛顿大学) SJTU(上海交通大学) XDU(北京理工大学)

AI总结 本文提出VDE Bench,一个专门评估图像编辑模型在双语中文-英文和复杂视觉文档编辑任务性能的基准,通过高质量数据集和新的评估框架,系统量化了文本修改的准确性。

详情
AI中文摘要

近年来,图像编辑模型取得了显著进展,使用户能够通过自然语言指令灵活地交互式地操作视觉内容。然而,一个重要的但尚未充分研究的研究方向是密集的视觉文档图像编辑,这涉及在图像中修改文本内容,同时忠实保留原始文本风格和背景上下文。现有方法主要集中在英语场景和文本相对稀疏的图像上,因此无法充分解决密集、结构复杂的文档或非拉丁文字如中文。为弥合这一差距,我们提出了VDE Bench(视觉文档编辑基准),这是一个严格人工标注和评估的基准,专门设计用于评估图像编辑模型在双语中文-英文和复杂视觉文档编辑任务上的性能。该基准包含942个基于指令的图像编辑样本数据集,其种子图像涵盖密集的中文和英文文本文档,包括学术论文、海报、演示文稿、考试材料和报纸。此外,我们引入了一个新的评估框架,系统地量化了在OCR解析层面的编辑性能,从而实现了对文本修改准确性的细粒度评估。基于此基准,我们对代表性图像编辑模型进行了全面评估。人类验证显示,人类判断与自动化评估指标之间有一致性。VDE Bench构成了评估图像编辑模型在双语密集文本视觉文档性能的首个系统性基准。

英文摘要

In recent years, image editing models have made significant progress, enabling users to manipulate visual content in a flexible and interactive manner through natural language instructions. However, an important yet underexplored research direction remains dense visual document image editing, which involves modifying textual content within images while faithfully preserving the original text style and background context. Existing methods primarily focus on English scenarios and images with relatively sparse text, and thus cannot adequately address dense, structurally complex documents or non-Latin scripts such as Chinese. To bridge this gap, we propose VDE Bench (Visual Doc Edit Bench), a rigorously human annotated and evaluated benchmark specifically designed to assess the performance of image editing models on bilingual Chinese-English and complex visual document editing tasks. The benchmark comprises a high quality dataset of 942 instruction based image editing samples, whose seed images encompass dense Chinese and English text documents including academic papers, posters, presentation slides, examination materials, and newspapers. Furthermore, we introduce a novel evaluation framework that systematically quantifies editing performance at the OCR parsing level, thereby enabling fine grained assessment of text modification accuracy. Based on this benchmark, we conduct a comprehensive evaluation of representative image editing models. Human verification demonstrates a high degree of consistency between human judgments and automated evaluation metrics. VDE Bench constitutes the first systematic benchmark for evaluating the performance of image editing models on bilingual dense text visual documents.

2602.07294 2026-06-12 cs.CE cs.AI 版本更新

Fin-RATE: A Real-world Financial Analytics and Tracking Evaluation Benchmark for LLMs on SEC Filings

Fin-RATE:面向SEC文件的金融分析与追踪评估基准

Yidong Jiang, Junrong Chen, Eftychia Makri, Jialin Chen, Peiwen Li, Ali Maatouk, Leandros Tassiulas, Eliot Brenner, Bing Xiang, Rex Ying

发表机构 * Tongji University(同济大学) University of California, San Diego(加州大学圣地亚哥分校) Yale University(耶鲁大学) Goldman Sachs(高盛集团)

AI总结 针对LLM在金融领域分析复杂监管文件的需求,提出基于SEC文件的Fin-RATE基准,通过三种任务路径评估模型,发现跨文档和跨时间分析时性能显著下降。

详情
Journal ref
Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)
AI中文摘要

随着大型语言模型(LLM)在金融领域的部署日益增多,LLM越来越需要解析复杂的监管披露文件。然而,现有基准通常关注孤立细节,未能反映专业分析的复杂性——这种分析需要综合多个文档、报告期和公司实体的信息。此外,这些基准无法区分错误源于检索失败、生成不准确、领域特定推理错误还是对查询或上下文的误解,从而难以精确诊断性能瓶颈。为弥补这些不足,我们引入Fin-RATE,这是一个基于美国证券交易委员会(SEC)文件构建的基准,通过三条路径模拟金融分析师的工作流程:单个披露文件内的细节导向推理、共享主题下的跨实体比较,以及同一公司在多个报告期内的纵向跟踪。我们在真实上下文和检索增强设置下,对17个领先的LLM(包括开源、闭源和金融专用模型)进行了基准测试。结果显示,当任务从单文档推理转向纵向和跨实体分析时,性能显著下降,准确率分别下降18.60%和14.35%。这种下降与比较幻觉、时间和实体不匹配的增加有关,并进一步反映在推理质量和事实一致性的下降上——这些局限性是现有基准尚未正式分类或量化的。

英文摘要

With the increasing deployment of Large Language Models (LLMs) in the finance domain, LLMs are increasingly expected to parse complex regulatory disclosures. However, existing benchmarks often focus on isolated details, failing to reflect the complexity of professional analysis that requires synthesizing information across multiple documents, reporting periods, and corporate entities. Furthermore, these benchmarks do not disentangle whether errors arise from retrieval failures, generation inaccuracies, domain-specific reasoning mistakes, or misinterpretation of the query or context, making it difficult to precisely diagnose performance bottlenecks. To bridge these gaps, we introduce Fin-RATE, a benchmark built on U.S. Securities and Exchange Commission (SEC) filings and mirroring financial analyst workflows through three pathways: detail-oriented reasoning within individual disclosures, cross-entity comparison under shared topics, and longitudinal tracking of the same firm across reporting periods. We benchmark 17 leading LLMs, spanning open-source, closed-source, and finance-specialized models, under both ground-truth context and retrieval-augmented settings. Results show substantial performance degradation, with accuracy dropping by 18.60% and 14.35% as tasks shift from single-document reasoning to longitudinal and cross-entity analysis. This degradation is associated with increased comparison hallucinations, temporal and entity mismatches, and is further reflected in declines in reasoning quality and factual consistency--limitations that existing benchmarks have yet to formally categorize or quantify.

2602.10132 2026-06-12 physics.plasm-ph cs.AI 版本更新

TokaMark: A Comprehensive Benchmark for MAST Tokamak Plasma Models

TokaMark:MAST托卡马克等离子体模型的综合基准

Cécile Rousseau, Samuel Jackson, Rodrigo H. Ordonez-Hurtado, Nicola C. Amorisco, Tobia Boschi, George K. Holt, Andrea Loreti, Eszter Székely, Alexander Whittle, Adriano Agnello, Stanislas Pamela, Alessandra Pascale, Robert Akers, Juan Bernabe Moreno, Sue Thorne, Mykhaylo Zayats

发表机构 * IBM Research Europe(IBM欧洲研究院) UK Atomic Energy Authority(英国原子能局) STFC Hartree Centre(STFC哈特ree中心)

AI总结 为解决聚变数据稀缺、分散且标注不一致的问题,提出TokaMark基准,包含14项任务,统一多模态聚变数据访问和评估协议,并提供基线模型,以加速数据驱动的AI等离子体建模。

详情
AI中文摘要

开发运行如托卡马克等商业可行的聚变能源反应堆,需要从稀疏、有噪声且不完整的传感器读数中准确预测等离子体动力学。底层物理的复杂性和实验数据的异质性给传统数值方法带来了巨大挑战,并凸显了现代数据驱动方法的潜力。然而,实现这一潜力的主要障碍是缺乏经过整理、公开可用的数据集和标准化基准。现有的聚变数据集稀缺、分散在不同机构、特定于设施且标注不一致,这限制了可重复性,并阻碍了AI方法的公平和可扩展比较。在本文中,我们介绍了TokaMark,一个结构化基准,用于评估在Mega Ampere Spherical Tokamak(MAST)收集的真实实验数据上的AI模型。TokaMark提供了一套全面的工具,旨在统一多模态聚变数据的访问并标准化评估协议。该基准包括14项精心策划的任务,涵盖一系列物理机制,利用多种诊断手段并覆盖多个操作用例。提供了一个基线模型,以便在统一框架内进行透明的比较和验证。通过建立统一的基准,TokaMark旨在加速数据驱动的AI等离子体建模的进展,为最终实现可持续和稳定的聚变能源做出贡献。数据集、基准、文档和工具已在此https URL下开源。

英文摘要

Development and operation of commercially viable fusion energy reactors such as tokamaks require accurate predictions of plasma dynamics from sparse, noisy, and incomplete sensors readings. The complexity of the underlying physics and the heterogeneity of experimental data pose formidable challenges for conventional numerical methods, and highlight the promise of modern data-native approaches. A major obstacle in realizing this potential is, however, the lack of curated, openly available datasets and standardized benchmarks. Existing fusion datasets are scarce, fragmented across institutions, facility-specific, and inconsistently annotated, which limits reproducibility and prevents a fair and scalable comparison of AI approaches. In this paper, we introduce TokaMark, a structured benchmark to evaluate AI models on real experimental data collected from the Mega Ampere Spherical Tokamak (MAST). TokaMark provides a comprehensive suite of tools designed to unify access to multi-modal fusion data and standardize evaluation protocols. The benchmark includes a curated list of 14 tasks spanning a range of physical mechanisms, exploiting a variety of diagnostics and covering multiple operational use cases. A baseline model is provided to facilitate transparent comparison and validation within a unified framework. By establishing a unified benchmark, TokaMark aims to accelerate progress in data-driven AI-based plasma modeling, contributing to the broader goal of achieving sustainable and stable fusion energy. The dataset, benchmark, documentation, and tooling are open-sourced under https://github.com/UKAEA-IBM-STFC-Fusion-FMs/tokamark_baseline.

2602.18154 2026-06-12 cs.CL cs.AI cs.DB 版本更新

FENCE: A Financial and Multimodal Jailbreak Detection Dataset

FENCE:一个金融和多模态越狱检测数据集

Mirae Kim, Seonghun Jeong, Youngjun Kwak

发表机构 * arXiv

AI总结 针对金融领域多模态越狱检测资源匮乏的问题,提出FENCE数据集,包含韩英双语文本和图像,用于训练和评估检测器,实验表明基线检测器准确率达99%。

Comments lrec 2026 accepted paper

详情
AI中文摘要

越狱对大型语言模型(LLM)和视觉语言模型(VLM)的部署构成重大风险。VLM尤其脆弱,因为它们处理文本和图像,创造了更广泛的攻击面。然而,可用于越狱检测的资源很少,特别是在金融领域。为填补这一空白,我们提出了FENCE,一个双语(韩语-英语)多模态数据集,用于训练和评估金融应用中的越狱检测器。FENCE通过金融相关查询与图像威胁配对,强调领域真实性。使用商业和开源VLM进行的实验揭示了持续的脆弱性,GPT-4o显示出可测量的攻击成功率,而开源模型则表现出更大的暴露。在FENCE上训练的基线检测器实现了99%的分布内准确率,并在外部基准测试中保持强劲性能,突显了该数据集在训练可靠检测模型方面的鲁棒性。FENCE为推进金融领域的多模态越狱检测以及支持敏感领域中更安全、更可靠的AI系统提供了重点资源。警告:本文包含可能具有冒犯性的示例数据。

英文摘要

Jailbreaking poses a significant risk to the deployment of Large Language Models (LLMs) and Vision Language Models (VLMs). VLMs are particularly vulnerable because they process both text and images, creating broader attack surfaces. However, available resources for jailbreak detection are scarce, particularly in finance. To address this gap, we present FENCE, a bilingual (Korean-English) multimodal dataset for training and evaluating jailbreak detectors in financial applications. FENCE emphasizes domain realism through finance-relevant queries paired with image-grounded threats. Experiments with commercial and open-source VLMs reveal consistent vulnerabilities, with GPT-4o showing measurable attack success rates and open-source models displaying greater exposure. A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset's robustness for training reliable detection models. FENCE provides a focused resource for advancing multimodal jailbreak detection in finance and for supporting safer, more reliable AI systems in sensitive domains. Warning: This paper includes example data that may be offensive.

2603.00610 2026-06-12 cs.SD cs.AI cs.LG cs.MM eess.AS 版本更新

CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

CMI-RewardBench: 基于组合多模态指令评估音乐奖励模型

Yinghao Ma, Haiwen Xia, Hewei Gao, Weixiong Chen, Yuxin Ye, Yuchen Yang, Sungkyun Chang, Mingshuo Ding, Yizhi Li, Ruibin Yuan, Simon Dixon, Emmanouil Benetos

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学) University of Cambridge(剑桥大学) University of Toronto(多伦多大学)

AI总结 针对音乐生成模型缺乏有效评估机制的问题,提出CMI-RewardBench基准,包含大规模偏好数据集和参数高效奖励模型,实现多模态指令下的音乐质量评估。

Comments Accepted by ICML 2026

详情
AI中文摘要

虽然音乐生成模型已经发展到能够处理混合文本、歌词和参考音频的复杂多模态输入,但评估机制却滞后了。在本文中,我们通过为组合多模态指令(CMI)下的音乐奖励建模建立了一个全面的生态系统来弥补这一关键差距,其中生成的音乐可能以文本描述、歌词和音频提示为条件。我们首先引入了CMI-Pref-Pseudo,一个包含11万个伪标签样本的大规模偏好数据集,以及CMI-Pref,一个针对细粒度对齐任务量身定制的高质量人工标注语料库。为了统一评估格局,我们提出了CMI-RewardBench,一个统一的基准,用于评估音乐奖励模型在音乐性、文本-音乐对齐和组合指令对齐方面的异质样本。利用这些资源,我们开发了CMI奖励模型(CMI-RMs),一个能够处理异质输入的参数高效奖励模型家族。我们评估了它们与人类判断分数在音乐性和对齐方面的相关性,使用了CMI-Pref以及之前的数据集。进一步的实验表明,CMI-RM不仅与人类判断高度相关,而且通过top-k过滤实现了有效的推理时扩展。代码可在GitHub(此 https URL )获取。模型权重:CMI-RM(此 https URL )。数据集:CMI-Pref-Pseudo(此 https URL )和CMI-Pref(此 https URL )。

英文摘要

While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, and audio prompts. We first introduce CMI-Pref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMI-RewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text-music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs. We evaluate their correlation with human judgment scores on musicality and alignment on CMI-Pref along with previous datasets. Further experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via top-k filtering. Code is available at GitHub (https://github.com/Haiwen-Xia/CMI-RewardBench). Model weights: CMI-RM (https://huggingface.co/HaiwenXia/CMI-RM). Datasets: CMI-Pref-Pseudo (https://huggingface.co/datasets/HaiwenXia/cmi-pref-pseudo) and CMI-Pref (https://huggingface.co/datasets/HaiwenXia/cmi-pref)

2603.10834 2026-06-12 cs.CV cs.AI 版本更新

On the Reliability of Cue Conflict and Beyond

论线索冲突的可靠性及其超越

Pum Jun Kim, Seung-Ah Lee, Seongho Park, Dongyoon Han, Jaejun Yoo

发表机构 * Ulsan National Institute of Science and Technology(乌山国立科学研究院) College of Medicine, Hanyang University(翰阳大学医学院) NAVER AI Lab(NAVER AI实验室)

AI总结 针对现有线索冲突基准在评估形状-纹理偏好时存在不稳定和模糊的问题,提出REFINED-BIAS数据集与评估框架,通过显式定义形状和纹理、构建平衡的线索对及基于排序的度量,实现更可靠和可解释的偏差诊断。

Comments Shape-Texture Bias, Cue Conflict Benchmark

详情
AI中文摘要

理解神经网络如何依赖视觉线索提供了其内部决策过程的人类可解释视角。线索冲突基准在探究形状-纹理偏好以及激发更强、类人形状偏差通常与改进的域内性能相关的见解方面具有影响力。然而,我们发现当前基于风格化的实例化可能产生不稳定和模糊的偏差估计。具体来说,风格化可能无法可靠地实例化感知上有效且可分离的线索,也无法控制其相对信息量;基于比率的偏差可能掩盖绝对线索敏感性;将评估限制在预选类别可能忽略完整决策空间而扭曲模型预测。这些因素共同可能将偏好与线索有效性、线索平衡和可识别性伪影混淆。我们引入了REFINED-BIAS,一个用于可靠和可解释的形状-纹理偏差诊断的集成数据集和评估框架。REFINED-BIAS使用形状和纹理的显式定义构建平衡的、人类和模型可识别的线索对,并通过基于排序的度量测量完整标签空间上的线索特定敏感性,从而实现更公平的跨模型比较。在不同的训练范式和架构中,REFINED-BIAS实现了更公平的跨模型比较、更忠实的形状和纹理偏差诊断以及更清晰的实证结论,解决了先前线索冲突评估无法可靠区分的矛盾。

英文摘要

Understanding how neural networks rely on visual cues offers a human-interpretable view of their internal decision processes. The cue-conflict benchmark has been influential in probing shape-texture preference and in motivating the insight that stronger, human-like shape bias is often associated with improved in-domain performance. However, we find that the current stylization-based instantiation can yield unstable and ambiguous bias estimates. Specifically, stylization may not reliably instantiate perceptually valid and separable cues nor control their relative informativeness, ratio-based bias can obscure absolute cue sensitivity, and restricting evaluation to preselected classes can distort model predictions by ignoring the full decision space. Together, these factors can confound preference with cue validity, cue balance, and recognizability artifacts. We introduce REFINED-BIAS, an integrated dataset and evaluation framework for reliable and interpretable shape-texture bias diagnosis. REFINED-BIAS constructs balanced, human- and model- recognizable cue pairs using explicit definitions of shape and texture, and measures cue-specific sensitivity over the full label space via a ranking-based metric, enabling fairer cross-model comparisons. Across diverse training regimes and architectures, REFINED-BIAS enables fairer cross-model comparison, more faithful diagnosis of shape and texture biases, and clearer empirical conclusions, resolving inconsistencies that prior cue-conflict evaluations could not reliably disambiguate.

2605.26144 2026-06-12 cs.SE cs.AI cs.CV 版本更新

VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents

VISTA:面向视觉规格到网页应用编码智能体的端到端基准

JunJia Guo, Yuhang Yao, Jiawei, Zhou, Jingdi Chen

发表机构 * University of Arizona(亚利桑那大学) Zoom Stony Brook University(石溪大学)

AI总结 提出VISTA基准,通过多维度输入条件和评估指标,衡量基于LLM的智能体从视觉规格生成功能完整、视觉一致的网页应用的能力。

Comments Project page: https://kaboider.github.io/VIS_APP/; Code: https://github.com/kaboider/VIS_APP_Code; Dataset: https://huggingface.co/datasets/JunJiaGuo/VIS-APP-Bench

详情
AI中文摘要

我们提出了VISTA(视觉规格到应用基准),这是一个用于评估基于LLM的智能体端到端网页应用生成能力的基准。与以往关注算法任务的代码生成基准不同,VISTA针对以UI为中心的现实开发场景,要求智能体从不明确的输入中生成功能完整、视觉一致的应用。我们定义了五种提示信息条件,沿视觉/结构保真度和技术栈约束两个轴变化:(1)仅文本,自由选择技术栈;(2)文本加参考截图,指定三种技术栈;(3)文本加参考截图,自由选择技术栈;(4)文本加截图和精简的Figma结构,指定单一技术栈;(5)文本加截图和精简的Figma结构,自由选择技术栈。为实现稳健评估,基准中的每个页面都手动标注了交互式UI组件和大约三个视觉锚点,解决了Playwright等基于脚本的测试工具在开放式代码生成设置中的已知局限性。评估结合了基于DOM的参考匹配、行为特定的浏览器测试和基于CLIP的视觉相似性,共同衡量结构对齐、行为完整性和整体视觉保真度。我们使用VISTA评估了来自两个模型家族和两个框架的四个智能体系统,发现视觉保真度和功能正确性在输入条件和智能体之间部分解耦,并且智能体的编辑风格差异显著,但大体上与任务质量正交。VISTA为推进基于智能体的软件工程研究建立了严谨且可重复的基础。

英文摘要

We present VISTA (VIsual Spec-To-App Benchmark), a benchmark for evaluating the end-to-end web-app generation capabilities of LLM-based agents. Unlike prior code generation benchmarks that focus on algorithmic tasks, VISTA targets realistic UI-centric development, where agents must produce functional, visually coherent applications from underspecified inputs. We define five prompt-information conditions that vary along two axes, visual/structural fidelity and stack constraint: (1) text only with free stack choice, (2) text with reference screenshots under three specified stacks, (3) text with reference screenshots under free stack choice, (4) text with screenshots and pruned Figma structure under a single specified stack, and (5) text with screenshots and pruned Figma structure under free stack choice. To enable robust evaluation, each page in the benchmark is manually annotated with interactive UI components and around three visual anchor points, addressing the well-known limitations of script-based testing tools such as Playwright in open-ended code generation settings. Evaluation combines DOM-grounded reference matching, behavior-specific browser tests, and CLIP-based visual similarity, jointly measuring structural alignment, behavioral completeness, and overall visual fidelity. We use VISTA to assess four agent systems drawn from two model families and two harnesses, finding that visual fidelity and functional correctness are partially decoupled across both input conditions and agents, and that agent editing style varies sharply but is largely orthogonal to task quality. VISTA establishes a rigorous and reproducible foundation for advancing agent-based software engineering research.

2606.07515 2026-06-12 cs.CL cs.AI cs.HC math.PR 版本更新

How reliable are LLMs when it comes to playing dice?

LLM 在掷骰子时有多可靠?

Luca Avena, Gianmarco Bet, Bernardo Busoni

发表机构 * Università degli Studi di Firenze(佛罗伦萨大学)

AI总结 通过离散概率问题基准测试,发现 LLM 在标准问题上准确率 0.96,但在反直觉问题上仅 0.59,且存在 token 偏差和误导提示的脆弱性。

详情
AI中文摘要

我们通过离散概率问题的受控基准研究,调查了大语言模型的概率推理能力。我们构建了两个数据集,分别是一组标准习题和一组反直觉习题,旨在触发启发式推理,并评估了 8 个最先进的模型,每个模型分别在有无思维链提示的情况下进行测试。模型在标准问题上的平均准确率为 0.96,但在反直觉问题上仅为 0.59。我们进一步提供了 token 偏差的经验证据:当规范表述被伪装变体替换时,性能下降超过 20%。在提示中嵌入误导性建议会使性能降低高达 34%,且没有模型被证明免疫。综合来看,报告的结果表明,尽管当前 LLM 在高级数学问题上取得了成功,但它们尚未成为真正的概率推理者。

英文摘要

We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a set of counterintuitive exercises, designed to trigger heuristic reasoning, and evaluated 8 state-of-the-art models, each tested with and without Chain-of-Thought prompting. Models achieve an average accuracy of 0.96 on standard problems but only 0.59 on counterintuitive ones. We further provide empirical evidence of token bias: performance drops by over 20% when canonical formulations are replaced by disguised variants. Embedding misleading suggestions in the prompt reduces performance by up to 34%, with no model proving immune. Taken together, the reported findings suggest that current LLMs are not yet genuine probabilistic reasoners, despite their success in advanced mathematical problems.

10. AI应用与系统 58 篇

2606.12702 2026-06-12 cs.AI 新提交

Deployment-Centered Evaluation: Predicting Query-Level Rejection Risk in a Clinical LLM System

以部署为中心的评估:预测临床大语言模型系统中的查询级拒绝风险

Alyssa Unell, Miguel Fuentes, Brenna Li, Bridget Lin, Meena Jagadeesan, Sanmi Koyejo, Nigam Shah

AI总结 针对临床大语言模型系统,提出基于部署上下文(如提供者类型、科室名称)的预响应分类器,预测用户拒绝风险,AUROC达0.719,并展示其在触发护栏和弃权中的效用。

详情
AI中文摘要

大语言模型(LLMs)正越来越多地集成到临床系统中,因此评估这些系统的实际效用至关重要。然而,静态基准倾向于衡量正确性而非用户接受度,跨查询聚合性能,并需要密集标注的数据集——这导致评估临床系统时存在重大盲点。在这项工作中,我们对嵌入某学术医疗中心电子健康记录中的LLM系统进行了以部署为中心的评估,其中用户反馈稀疏但密切反映了部署条件。具体而言,我们训练了一个预响应分类器,该分类器基于查询内容和生成前可用的部署特定上下文,估计未来交互导致用户拒绝LLM响应的风险。我们对模型进行了4.5个月用户反馈的前瞻性分析,发现我们的预测模型达到了0.719的AUROC。此外,我们估计了此类预测在两个下游用例(触发护栏和弃权)中的益处。我们的关键概念洞察是,利用部署特定上下文(即提供者类型、科室名称、用于响应的语言模型),而不仅仅是查询内容,可以提高预测用户是否会拒绝系统输出的能力。总之,我们的实证案例研究证明了使用部署特定上下文预测用户拒绝的可行性,为定向护栏打开了大门。

英文摘要

Large language models (LLMs) are increasingly integrated into clinical systems, making it essential to evaluate the real-world utility of these systems. However, static benchmarks tend to measure correctness rather than user acceptance, aggregate performance across queries, and require densely annotated datasets -- leading to major blind spots for evaluating clinical systems. In this work, we perform a deployment-centered evaluation of an LLM system embedded within electronic health records at an academic medical center, where user feedback is sparse but closely reflects the deployment conditions. Specifically, we train a pre-response classifier that estimates the risk that a future interaction will result in the user rejecting the LLM response, based on query content and deployment-specific context available before generation. We conduct a prospective analysis of our model over 4.5 months of user feedback, finding that our prediction model achieves an AUROC of 0.719. Further, we estimate the benefit of such predictions in two downstream use cases (guardrail triggering and abstention). Our key conceptual insight is that making use of deployment-specific context (i.e., the provider type, department name, language model used for response), as opposed to only query content, improves the ability to predict whether the user will reject the system output. Altogether, our empirical case study demonstrates the feasibility of predicting user rejection using deployment-specific context, opening the door to targeted guardrails.

2606.12834 2026-06-12 cs.AI 新提交

Fantastic Scientific Agents and How to Build Them: AgentBuild for Rietveld Refinement

神奇的科学智能体及其构建方法:用于Rietveld精修的AgentBuild

Woong Shin, Craig A. Bridges, Marshall T. McDonnell, Rafael Ferreira da Silva

发表机构 * UT-Battelle, LLC(UT-Battelle有限责任公司) US Department of Energy (DOE)(美国能源部)

AI总结 提出AgentBuild框架,通过科学家编写的合同(包含评分标准、课程和知识库)自动构建科学智能体,用于X射线衍射数据的Rietveld精修,实现可复用的智能体编译而非手动调优。

详情
AI中文摘要

随着科学工作流从确定性可执行文件转向基于LLM的智能体,现有的开发实践(如微调、强化学习和即时运行)掩盖了科学家的判断。我们建议将智能体构建视为一个工作流阶段,并引入AgentBuild,它根据科学家编写的合同构建科学智能体。该合同是一个版本控制的评分标准、一个难度分级的课程和一个精心策划的外部知识库。基于评分标准的裁判门控一个元优化编码智能体,该智能体在声明的边界内编辑智能体,因此构建编译的是智能体,而不是科学家的判断。我们通过MCP和A2A背后的GSAS-II将其实例化用于X射线衍射数据的Rietveld精修,其中空白框架构建运行通过锂镧锆氧(LLZO)信噪比阶梯,达到4小时扫描作为前沿案例,并暴露了工作流范围限制。相同的评分标准既奖励可信的拟合,也评分轨迹范围,使前沿成为合同失败而非模式拟合失败。随着基础模型的发展,重新运行AgentBuild是重新调整,而不是重建,科学家编写的合同仍然是持久的资产。

英文摘要

As scientific workflows shift from deterministic executables to LLM-based agents, the development practices on offer, such as fine-tuning, reinforcement learning, and prompt-and-go, bury the scientist's judgment. We propose treating agent construction as a workflow stage and introduce AgentBuild, which builds a scientific agent from a contract the scientist authors. The contract is a version-controlled rubric, a difficulty-graded curriculum, and a curated external knowledge base. A rubric-driven judge gates a meta-optimizer coding agent that edits the agent within a declared boundary, so the build compiles the agent, not the scientist's judgment. We instantiate this for Rietveld refinement of X-ray diffraction data through GSAS-II behind MCP and A2A, where a blank-harness construction run progresses through a lithium lanthanum zirconium oxide (LLZO) signal-to-noise ladder, reaches the 4 hour scan as a frontier case, and exposes the workflow-scope limits that remain. The same rubric that rewards credible fits also scores trajectory scope, making the frontier a contract failure rather than a pattern-fitting failure. As base models evolve, re-running AgentBuild is a re-tune, not a rebuild, and the scientist's authored contract remains the durable asset.

2606.12916 2026-06-12 cs.AI cs.CL cs.LG 新提交

MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback

MDForge:稀疏模拟器反馈下的智能分子动力学流水线设计

Zehong Wang, Yijun Ma, Connor R. Schmidt, Tianyi Ma, Weixiang Sun, Ziming Li, Xiaoguang Guo, Chuxu Zhang, Matthew J. Webber, Yanfang Ye

发表机构 * University of Notre Dame(圣母大学) University of Connecticut(康涅狄格大学)

AI总结 提出MDForge,利用LLM智能体通过多智能体辩论将稀疏奖励稠密化,自动设计分子动力学流水线,在SAMPL基准上达到专家水平,并发现新型高亲和力CB[7]结合剂。

详情
AI中文摘要

分子动力学(MD)是原子分子科学中经典的计算机模拟方法,从第一性原理物理模拟分子行为。为新系统设计MD流水线需要大量专业知识:即使在一个分子上运行也代价高昂,排除了试错法。我们使用LLM智能体自动化这一专家流水线设计过程。与现有编排预定义工具集的MD智能体不同,我们将流水线设计视为开放式代码生成,其中智能体的行为通过语言奖励在线重塑。具体而言,我们构建了MDForge,一个LLM智能体,其上下文更新规则通过物理专家间的多智能体辩论将稀疏奖励稠密化。在三个SAMPL主客体结合自由能基准上,MDForge自动设计的MD流水线与人类专家竞争。部署在未见过的候选客体库上,其CB[7]流水线发现了一种新型结合剂,湿实验竞争NMR证实其为高亲和力、皮摩尔级的CB[7]结合剂。我们的数据和代码可在https://this URL获取。

英文摘要

Molecular dynamics (MD) is the canonical in-silico method for atomistic molecular science, simulating molecular behavior from first-principle physics. Designing an MD pipeline for a new system requires substantial expert knowledge: running it on even one molecule is expensive, ruling out trial-and-error. We automate this expert pipeline-design process with an LLM agent. Unlike existing MD agents that orchestrate a predefined tool set, we treat pipeline design as open-ended code generation in which the agent's behavior is reshaped online by verbal reward. Specifically, we build MDForge, an LLM agent whose in-context update rule densifies the sparse reward via a multi-agent debate among physics experts. On three SAMPL host-guest binding free-energy benchmarks, MDForge automatically designs MD pipelines competitive with human experts. Deployed on a library of unseen candidate guests, its CB[7] pipeline discovers a novel binder that wet-lab competition NMR confirms is a high-affinity, picomolar CB[7] binder. Our data and code are available at https://github.com/Zehong-Wang/MDForge.

2606.12969 2026-06-12 cs.AI 新提交

Multi-Modal Agents for Power Distribution Defect Detection: An Evaluation of Foundation Models

用于配电缺陷检测的多模态智能体:基础模型评估

Quan Quan

发表机构 * Quan Quan

AI总结 提出多模态智能体框架,系统评估基础模型在感知、推理和工具使用三方面的能力,用于配电缺陷检测的闭环自动化。

详情
AI中文摘要

配电网络对可靠电力输送至关重要,但传统检测方法在语义理解、泛化和闭环自动化方面存在局限。为解决这些挑战,本文提出了一种专门用于配电缺陷检测的多模态智能体框架。本研究的核心是系统评估多模态基础模型作为统一认知引擎的能力。我们严格评估了它们在三个关键能力上的综合表现:(1)感知,模型必须准确识别设备并生成专家级的缺陷描述;(2)推理,模型根据视觉发现解释原因、评估严重性并基于领域知识规划维护策略;(3)工具使用,模型作为自主操作者执行动作——如查询知识库或生成工单——以实现闭环维护。为支持此评估,我们开发了领域特定的评估数据集和综合基准。实验结果表明了当前基础模型在这三个维度的优势与局限,为在高风险工业环境中部署自主智能体提供了实证依据。

英文摘要

The power distribution network is critical to reliable electricity delivery, yet traditional inspection methods face limitations in semantic understanding, generalization, and closed-loop automation. To address these challenges, this paper proposes a Multi-Modal Agent framework specifically for power distribution defect detection. Central to this study is the systematic evaluation of multimodal foundation models as unified cognitive engines. We rigorously assess their integrated performance across three critical capabilities: (1) Perception, where the model must accurately identify equipment and generate expert-level descriptions of defects; (2) Reasoning, where the model interprets visual findings to diagnose causes, assess severity, and plan maintenance strategies based on domain knowledge; and (3) Tool Usage, where the model acts as an autonomous operator to execute actions -- such as querying knowledge bases or generating work orders -- to achieve closed-loop maintenance. To support this evaluation, a domain-specific evaluation dataset and a comprehensive benchmark are developed. Experimental results demonstrate the strengths and limitations of current foundation models in these three dimensions, providing empirical evidence for deploying autonomous agents in high-stakes industrial environments.

2606.12976 2026-06-12 cs.AI 新提交

A Mathematical Forum Platform for Collaborative Problem Solving and Dataset Generation for AI Reasoning

面向协作问题求解与AI推理数据集生成的数学论坛平台

Akbar Erkinov, Nurmukhammad Abdurasulov

发表机构 * Independent Researchers, San Francisco, CA, USA(独立研究者,美国加利福尼亚州旧金山)

AI总结 提出一个集成图像到LaTeX转换管线的论坛系统,消除数学内容分享的摩擦,支持桌面和移动端,并生成社区验证的数学问题数据集以训练AI推理。

Comments 11 pages, 3 figures

详情
AI中文摘要

在在线论坛中分享数学内容仍然是学生和教师的一个显著痛点:编写原始LaTeX容易出错,独立的光学字符识别工具需要切换平台,而当前的论坛软件没有提供从公式照片到渲染帖子的集成路径。我们提出了一个统一系统,通过将图像到LaTeX转换管线直接嵌入论坛发布界面来消除这一摩擦。用户上传或拍摄数学表达式的图像;系统通过Mathpix OCR API路由该图像,检测返回的输出是LaTeX还是包含内联数学的纯文本,应用适当的分隔符规范化,并在帖子提交到数据库之前以LaTeX或Markdown模式提供实时预览。该架构分为三个松散耦合的层:图像处理、渲染和存储,并支持桌面和移动客户端。已提交一份涵盖核心方法的美国临时专利申请。我们描述了完整的系统设计、每个组件的细节、数据模式以及关键的技术创新,并将该工作与现有的独立工具和论坛平台进行对比,以展示其填补的实际空白。除了直接的可用性之外,我们认为这种部署的平台构成了一个持续增长、社区验证的数学问题和逐步解决方案数据集,该资源可用于训练和基准测试AI系统以实现准确的数学推理。

英文摘要

Sharing mathematical content in online forums remains a significant friction point for students and educators: writing raw LATEX is error-prone, standalone optical character recognition tools require platform switching, and current forum software offers no integrated path from a photograph of a formula to a rendered post. We present a unified system that eliminates this friction by embedding an image to LATEX conversion pipeline directly inside a forum posting interface. A user uploads or captures an image of a mathematical expression; the system routes it through the Mathpix OCR API, detects whether the returned output is LATEX or plain text containing inline math, applies the appropriate delimiter normalisation, and renders a live preview in either LATEX or Markdown mode before the post is committed to the database. The architecture is organized in three loosely coupled layers: image processing, rendering, and storage, and supports both desktop and mobile clients. A provisional US patent application has been filed covering the core methods. We describe the full system design, each component in detail, the data schema, and the key technical innovations, and we position the work against existing standalone tools and forum platforms to demonstrate the practical gap it closes. Beyond immediate usability, we argue that a deployed platform of this kind constitutes a continuously growing, community-validated dataset of mathematical problems and step-by-step solutions, a resource that can be used to train and benchmark AI systems for accurate mathematical reasoning

2606.12983 2026-06-12 cs.AI 新提交

Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation

面向LLM驱动的硬件描述语言设计与验证数据整理的结构化测试台生成

En-Ming Huang, Yu-Hung Kao, Ren-Hao Deng, Wei-Po Hsin, Yao-Ting Hsieh, Cheng Liang, Hsiang-Yu Tsou, Mu-Chi Chen, Yu-Kai Hung, Shao-Chun Ho, Po-Hsuang Huang, Shih-Hao Hung, H. T. Kung

发表机构 * National Taiwan University(国立台湾大学) Academia Sinica(中央研究院) Harvard University(哈佛大学)

AI总结 提出STG框架,利用硬件设计固有结构生成确定性测试台,比迭代LLM方法快720倍,编译成功率更高,覆盖率更高,误判更少,并用于数据整理和测试时扩展。

Comments 9 pages, 10 figures

详情
AI中文摘要

自动化测试台生成已成为大型语言模型(LLM)驱动的寄存器传输级(RTL)工作流中的关键瓶颈,其中大量候选设计必须快速可靠地验证。现有的基于提示的方法将测试台生成视为无约束的代码合成,产生随机输出,具有高令牌成本、低可重复性和不足的覆盖率。为了解决这一差距,我们提出了STG,一个结构化测试台生成框架,利用硬件设计的固有结构生成确定性测试台。作为直接验证工具,STG比基于迭代LLM的测试台生成流程快720倍,具有更高的编译成功率,实现更高的覆盖率,并减少对不正确DUT的错误通过判定。STG还通过暴露有缺陷的基准测试台帮助识别RTL生成基准中的错误。作为数据整理引擎,它在单个CPU核心上比基于LLM的过滤快11倍,能耗低127倍,由此得到的蒸馏模型在我们的多基准评估中提供了最先进的性能。作为测试时扩展预言,它减少了14-47%的节点数。我们的模型可在https://this URL获取。

英文摘要

Automated testbench generation has become a critical bottleneck in large language model (LLM)-driven Register Transfer Level (RTL) workflows, where large numbers of candidate designs must be verified rapidly and reliably. Existing prompt-based approaches treat testbench generation as unconstrained code synthesis, yielding stochastic outputs with high token cost, low reproducibility, and insufficient coverage. To address this gap, we present STG, a Structured Testbench Generation framework that exploits the inherent structure of hardware designs to generate deterministic testbenches. As a direct verification tool, STG runs 720x faster than an iterative LLM-based testbench generation flow and higher rate of successful compilation, achieves higher coverage, and reduces false-pass verdicts on incorrect DUTs. STG also helps identify errors in RTL generation benchmarks by exposing faulty benchmark testbenches. As a data curation engine, it is 11x faster than LLM-based filtering on a single CPU core with 127x less energy, and the resulting distilled models provide state-of-the-art performance in our multi-benchmark evaluation. As a test-time scaling oracle, it reduces node count by 14-47\%. Our models are available at https://huggingface.co/collections/AS-SiliconMind/siliconmind-v12.

2606.12991 2026-06-12 cs.AI 新提交

APCyc: Property-Informed Design of Cyclic Peptides via Automated Cyclization

APCyc:通过自动环化实现环肽的性质导向设计

Yifan Zhao, Lang Qin, Jintai Chen

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) AI-Peptide Drug Design Joint Laboratory(AI-多肽药物设计联合实验室)

AI总结 提出APCyc框架,通过扩展残基词汇和显式编码环化位点与连接类型,结合贝叶斯后验引导,实现目标感知的环肽从头设计并联合优化多种理化性质。

Comments Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情
AI中文摘要

环肽是现代药物发现中一类有前景的治疗化合物,通常具有更好的稳定性和结合亲和力。然而,环肽的从头设计仍然具有挑战性,因为方法必须识别口袋适应的环化模式和连接位点,同时控制药物相关性质。这一挑战对于主要在线性肽数据上训练的生成模型尤为突出,这些模型可能无法捕捉环化特异性约束。为解决这一局限性,我们引入了APCyc,一个目标感知的从头环肽生成框架,该框架显式建模环化并联合优化多种基本理化性质。通过使用扩展的残基词汇表并显式编码环化位点和连接类型信息,APCyc学习环化感知表示,并利用贝叶斯后验引导将采样导向满足多个性质目标的环肽。实验结果表明,我们的模型学习了目标依赖的环化偏好,并实现了环肽设计的有效且可控的多性质优化。本文源代码可在以下网址获取:https://this https URL。

英文摘要

Cyclic peptides represent a promising class of therapeutic compounds in modern drug discovery, often offering improved stability and binding affinity. However, the de novo design of cyclic peptides remains challenging because methods must identify pocket-adaptive cyclization patterns and linkage sites while simultaneously controlling drug-relevant properties. This challenge is particularly pronounced for recent generative models trained predominantly on linear peptide data, which may fail to capture cyclization-specific constraints. To address the limitation, we introduce APCyc, a target-aware de novo cyclic peptide generation framework that explicitly models cyclization and jointly optimizes multiple essential physicochemical properties. By using an expanded residue vocabulary and explicitly encoding cyclization-site and linkage-type information, APCyc learns cyclization-aware representations and leverages Bayesian posterior guidance to steer sampling toward cyclic peptides satisfying multiple property objectives. Experimental results demonstrate that our model learns target-dependent cyclization preferences, and enables effective and controllable multi-property optimization for cyclic peptide design. The source code of this paper is available at https://github.com/HKUSTGZ-ML4Health-Lab/APCyc.

2606.13042 2026-06-12 cs.AI cs.CV 新提交

Augmentation techniques for video surveillance in the visible and thermal spectral range

可见光和热红外光谱范围内视频监控的增强技术

Vanessa Buhrmester, Ann-Kristin Grosselfinger, David Munch, Michael Arens

发表机构 * Fraunhofer Institute of Optronics, System Technologies and Image Exploitation IOSB(弗劳恩霍夫光学、系统技术与图像处理研究所)

AI总结 针对多光谱CNN目标检测,研究可见光与热红外图像差异,探索数据增强技术对分类精度的影响,以提升监控性能。

Comments 8 pages

详情
Journal ref
SPIE Security + Defence, Strasbourg, 10th September 2019
AI中文摘要

在智能视频监控中,摄像机在白天和夜晚记录图像序列。通常,这需要不同的传感器。为了获得更好的性能,将它们结合起来并不罕见。我们关注的情况是,长波红外摄像机连续记录,此外,另一台摄像机在白天记录可见光谱范围内的图像,并且智能算法监控采集的图像。更准确地说,我们的任务是基于多光谱CNN的目标检测。乍一看,可见光谱范围内的图像与热红外图像的区别在于,前者具有颜色和清晰的纹理信息,而后者不包含物体发出的热辐射信息。尽管颜色可以为分类任务提供有价值的信息,但诸如光照变化和不同传感器的特性等因素仍然构成重大问题。无论如何,获取足够且实用的热红外数据集来训练深度神经网络仍然是一个挑战。这就是为什么借助可见光谱范围内的数据进行训练可能是有利的,特别是当待评估的数据同时包含可见光和红外数据时。然而,目前尚不清楚热辐射、形状或颜色信息的强烈变化如何影响分类精度。为了更深入地了解卷积神经网络如何做出决策以及它们从不同传感器输入数据中学到什么,我们研究了不同增强技术的适用性和鲁棒性。

英文摘要

In intelligent video surveillance, cameras record image sequences during day and night. Commonly, this demands different sensors. To achieve a better performance it is not unusual to combine them. We focus on the case that a long-wave infrared camera records continuously and in addition to this, another camera records in the visible spectral range during daytime and an intelligent algorithm supervises the picked up imagery. More accurate, our task is multispectral CNN-based object detection. At first glance, images originating from the visible spectral range differ between thermal infrared ones in the presence of color and distinct texture information on the one hand and in not containing information about thermal radiation that emits from objects on the other hand. Although color can provide valuable information for classification tasks, effects such as varying illumination and specialties of different sensors still represent significant problems. Anyway, obtaining sufficient and practical thermal infrared datasets for training a deep neural network poses still a challenge. That is the reason why training with the help of data from the visible spectral range could be advantageous, particularly if the data, which has to be evaluated contains both visible and infrared data. However, there is no clear evidence of how strongly variations in thermal radiation, shape, or color information influence classification accuracy. To gain deeper insight into how Convolutional Neural Networks make decisions and what they learn from different sensor input data, we investigate the suitability and robustness of different augmentation techniques...

2606.13176 2026-06-12 cs.AI 新提交

Mental-R1: Aligning LLM Reasoning for Mental Health Assessment

Mental-R1:面向心理健康评估的对齐LLM推理

Xin Wang, Boyan Gao, Yibo Yang, David A. Clifton

发表机构 * University of Oxford(牛津大学) Oxford Suzhou Centre for Advanced Research(牛津大学苏州高等研究院)

AI总结 提出认知相对策略优化(CRPO)框架,通过阶段依赖不确定性建模和熵正则化机制,使LLM推理对齐人类认知过程,在8个心理健康数据集上加权F1平均提升10.4个百分点。

详情
AI中文摘要

焦虑、抑郁和自杀等心理健康问题仍然是紧迫的全球挑战,及时准确的评估对于有效干预至关重要。最近,大型语言模型已被探索用于心理健康评估。然而,现有的通用后训练方法与人类评估的认知过程不一致,可能导致不可靠的推理结果。为弥合这一差距,我们提出了认知相对策略优化(CRPO),这是一个专为心理健康领域设计的强化学习框架。CRPO通过将阶段依赖的不确定性建模集成到策略优化过程中,扩展了组相对策略优化。具体来说,我们引入了一种阶段熵正则化机制,该机制在早期推理阶段鼓励广泛探索,并在后期阶段逐步强制执行自信决策,模仿人类从不确定性到确定性的认知转变。此外,受认知评价理论的启发,我们形式化了认知推理阶段,从而指导基于理论的可解释推理。在8个心理健康数据集上的实验表明,CRPO在加权F1分数上比最佳强化学习基线平均提高了10.4个百分点。此外,CRPO训练的模型Mental-R1在推理密集型案例上相比现有大型语言模型展现出明显优势,表明CRPO增强了心理健康评估的推理能力。

英文摘要

Mental health problems such as anxiety, depression, and suicide remain urgent global challenges, where timely and accurate assessment is critical for effective intervention. Recently, large language models have been explored for mental health assessment. However, existing general-purpose post-training methods do not align with the cognitive processes of human assessment, which may lead to unreliable reasoning outcomes. To bridge this gap, we propose Cognitive Relative Policy Optimization (CRPO), a reinforcement learning framework tailored for the mental health domain. CRPO extends group relative policy optimization by integrating stage-dependent uncertainty modeling into the policy optimization process. Specifically, we introduce a stage-wise entropy regularization mechanism that encourages broad exploration in early reasoning phases and progressively enforces confident decision-making in later stages, mimicking the human cognitive shift from uncertainty to certainty. In addition, inspired by cognitive appraisal theory, we formalize cognitive reasoning stages, thereby guiding theory-grounded interpretable inference. Experiments on 8 mental health datasets show that CRPO achieves an average improvement of 10.4 percentage points in weighted F1-score over the best reinforcement learning baseline. Furthermore, the CRPO-trained model Mental-R1 demonstrates clear advantages compared with existing large language models on reasoning-intensive cases, suggesting that CRPO enhances reasoning capabilities for mental health assessment.

2606.13211 2026-06-12 cs.AI 新提交

Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy, Detection, and Mitigation under Regulatory Constraints

医学影像AI中的幻觉:跨模态分析框架用于分类、检测与监管约束下的缓解

Omar Alshahrani, Muzammil Behzad

发表机构 * King Fahd University of Petroleum & Minerals, Saudi Arabia(沙特阿拉伯法赫德国王石油矿产大学) SDAIA-KFUPM Joint Research Center for Artificial Intelligence, Saudi Arabia(沙特阿拉伯SDAIA-KFUPM人工智能联合研究中心)

AI总结 本文提出跨模态分析框架,统一五种影像模态的幻觉分类、检测与缓解策略,发现通用基础模型在幻觉基准上优于医学专用模型,并映射到FDA全生命周期监管。

详情
AI中文摘要

AI系统在医学影像中的部署速度超过了对其故障模式的理解。当前,最受临床关注的故障是幻觉:临床看似合理但事实错误的输出,包括虚构的解剖结构、遗漏的发现、错误的侧向性以及生成报告中的虚构测量值,直接影响到活检决策、分期和治疗计划。本结构化综述综合了同行评审研究、基准数据集和FDA监管指南,涵盖五种影像模态,对幻觉的分类、病因、检测和缓解进行了跨模态分析。具体而言,我们研究了三个问题:(1) 现有分类法如何跨模态统一?(2) 医学专用基础模型为何比通用模型产生更少的幻觉?(3) 哪些缓解策略有效且与FDA生命周期监督兼容?我们注意到,三种分类框架共同覆盖了影像流程,而单一框架无法做到。我们还强调,通用基础模型在幻觉特定基准上优于医学专用模型,表明狭窄领域微调可能引入过拟合导致的虚构。同时,放射科医生的监督仍然至关重要;例如,很高比例的AI生成标记在临床使用前需要专家修正。物理信息架构约束、思维链提示和人在回路保障各自针对不同的故障模式,并在组合时有效。所有发现均映射到FDA的总产品生命周期和预定变更控制计划框架,这些框架将幻觉管理视为生命周期义务而非部署前检查清单。

英文摘要

AI systems are being deployed across medical imaging faster than their failure modes are understood. At this point in time, the failure of greatest clinical concern is hallucination: clinically plausible but factually incorrect outputs, including fabricated anatomical structures, missed findings, incorrect laterality, and invented measurements in generated reports, with direct consequences, for example, for biopsy decisions, staging, and treatment planning. This structured narrative synthesizes peer-reviewed studies, benchmark datasets, and FDA regulatory guidance across five imaging modalities to produce a cross-modality analysis of hallucination taxonomy, etiology, detection, and mitigation. Specifically, we address three questions in this study: (1) how can existing taxonomies be unified across modalities?, (2) how do medical-specialized foundation models hallucinate less than general-purpose ones?, and (3) which mitigation strategies are effective and compatible with FDA lifecycle oversight? We note that three taxonomic frameworks together cover the imaging pipeline in a way no single framework does alone. We also highlight that general-purpose foundation models outperform medical-specialized models on hallucination-specific benchmarks, indicating that narrow domain fine-tuning can introduce overfitting-induced confabulation. At the same time, the oversight of radiologists remains essential; for instance, a very high percentage of of AI-generated flags required expert correction before clinical use. Physics-informed architectural constraints, Chain-of-Thought prompting, and human-in-the-loop safeguards each address different failure modes and is effective when combined. All findings are mapped to the FDA's Total Product Lifecycle and Predetermined Change Control Plan frameworks, which treat hallucination management as a lifecycle obligation rather than a pre-deployment checklist.

2606.13241 2026-06-12 cs.AI 新提交

Brick: Spatial Capability Routing for the Mixture-of-Models (MoM) Paradigm

Brick: 面向混合模型范式的空间能力路由

Francesco Massa, Marco Cristofanilli

发表机构 * Regolo AI Seeweb

AI总结 提出Brick多模态路由器,通过六维能力评分与查询难度估计,结合成本惩罚几何规则调度模型,在质量与成本间实现灵活权衡。

Comments 17 pages, 5 figures. Technical report

详情
AI中文摘要

定义查询难度是部署工程中最困难的问题之一。现有的LLM路由器依赖表面特征,如领域标签、关键词和token数量,忽略了实际决定模型成功的域内方差。前沿模型成本比本地开源模型高10到100倍,因此在生产规模下,即使每次请求的小额节省也会直接成为云账单的杠杆。我们提出了Brick,一种多模态路由器,它在六个能力维度上对每个模型进行评分,结合每个查询的难度估计,并通过成本惩罚的几何规则进行调度。一个连续的偏好旋钮允许操作员在部署时在最大质量和最大节省之间滑动。在5504个查询的基准测试中,Brick在最大质量模式下达到76.98%的准确率,超过了最佳单一模型(75.02%)和所有测试的路由器。在中性成本-质量配置下,Brick以比始终使用最强模型低4.71倍的成本实现了74.11%的准确率。在最低成本模式下,它降低了22.15倍的成本,准确率损失11.85个百分点。中位延迟从51.2秒降至22.8秒。

英文摘要

Defining query difficulty is one of the hardest problems in deployment engineering. Existing LLM routers rely on surface features such as domain labels, keywords, and token count, ignoring the within-domain variance that actually determines model success. Frontier models cost ten to one hundred times more than local open-weight models, so at production scale even small per-request savings become a direct cloud-bill lever. We present Brick, a multimodal router that scores each model on six capability dimensions, combines this with a per-query difficulty estimate, and dispatches via a cost-penalized geometric rule. A continuous preference knob lets operators slide between max-quality and max-saving profiles at deploy time. On a benchmark of 5,504 queries, Brick at max-quality reaches 76.98% accuracy, beating the best single model (75.02%) and all tested routers. At a neutral cost-quality profile, Brick achieves 74.11% accuracy at 4.71x lower cost than always using the strongest model. At min-cost, it cuts cost 22.15x with 11.85 points accuracy loss. Median latency drops from 51.2s to 22.8s.

2606.13249 2026-06-12 cs.AI 新提交

Multi-Field Hybrid Retrieval-Augmented Generation for Maritime Accident Root Cause Analysis

面向海事事故根因分析的多字段混合检索增强生成

Seongjin Kim, Sungil Kim

发表机构 * Department of Industrial Engineering, Ulsan National Institute of Science and Technology (UNIST)(蔚山国立科学技术院工业工程系)

AI总结 提出多字段混合检索增强生成框架,利用结构化事故卡片和分层原因分类,通过字段感知的混合检索与融合排序,显著提升海事事故根因分析的检索和生成质量。

详情
AI中文摘要

海事事故裁决报告包含根因分析(RCA)的关键法庭调查结果,然而从数十年的记录中检索相关先例并起草一致的报告仍然劳动密集。本文提出一个用于自动化海事RCA的多字段混合检索增强生成(RAG)框架,利用包含13,329份韩国海事安全法庭(KMST)报告(1971-2025年)的综合数据集。我们将原始裁决转化为结构化的“事故卡片”知识库,索引三个不同字段——摘要、原因和处置——以及一个层次化的L1/L2原因分类。我们的检索策略采用字段感知的混合方法,通过互惠排名融合(RRF)融合稀疏和密集排名。鉴于缺乏大规模专家相关性标签,我们使用基于元数据派生代理相关性分数的天花板归一化召回率和nDCG评估检索性能。实验结果表明,我们提出的检索显著优于基线方法,将NormRecall@100从0.18提高到0.55。此外,将生成器基于检索到的先例,相比仅使用LLM的基线,RCA生成质量得到提升,LLM作为评判者的评分从3.34提高到3.72。这些发现表明,字段感知的RAG可以通过实现更快的先例搜索和更一致、基于证据的RCA起草,显著简化海事安全调查工作流程。

英文摘要

Maritime accident adjudication reports contain critical tribunal findings for root cause analysis (RCA), yet retrieving relevant precedents and drafting consistent reports from decades of records remains labor-intensive. This paper proposes a multi-field hybrid retrieval-augmented generation (RAG) framework for automated maritime RCA, utilizing a comprehensive dataset of 13,329 Korea Maritime Safety Tribunal (KMST) reports (1971-2025). We transform raw adjudications into a structured knowledge base of "incident cards", indexing three distinct fields-Summary, Causes, and Disposition-alongside a hierarchical L1/L2 cause taxonomy. Our retrieval strategy employs a field-aware hybrid approach, fusing sparse and dense rankings via Reciprocal Rank Fusion (RRF). Given the lack of large-scale expert relevance labels, we evaluate retrieval performance using ceiling-normalized recall and nDCG based on a metadata-derived proxy relevance score. Experimental results demonstrate that our proposed retrieval significantly outperforms baseline methods, improving NormRecall@100 from 0.18 to 0.55. Furthermore, grounding the generator on the retrieved precedents enhances RCA generation quality over an LLM-only baseline, increasing the LLM-as-a-judge score from 3.34 to 3.72. These findings suggest that field-aware RAG can substantially streamline maritime safety investigation workflows by enabling faster precedent search and more consistent, evidence-based RCA drafting.

2606.13302 2026-06-12 cs.AI cs.LG 新提交

Physics-Guided Spatiotemporal Learning for Coastal Wave Peak Period Estimation from Video

物理引导的时空学习用于从视频估计海岸波浪峰值周期

Abubakar Hamisu Kamagata, Dharm Singh Jat, Attlee Munyaradzi Gamundani, Abhishek Srivastava, Paramasivam Saravanakumar

发表机构 * Namibia University of Science and Technology(纳米比亚科技大学) Indian Institute of Technology Indore(印度理工学院印多尔分校) Namdeb Diamond Corporation(纳米比亚钻石公司)

AI总结 提出物理引导的深度时空学习框架,结合自动区域检测、模拟到真实迁移学习和物理信息正则化,从海岸视频直接估计近岸波浪峰值周期,验证了基于Transformer和轻量级循环卷积架构的有效性。

详情
AI中文摘要

近岸波浪参数对于海岸工程、海岸线保护、海洋灾害评估和气候适应性的海岸管理至关重要。传统的监测系统如浮标和雷达平台提供精确监测,但安装和维护成本高,空间覆盖有限。通过深度学习实现了使用视频的被动海洋监测,然而许多方法在海洋学上缺乏物理可解释性、可行性和验证。本文提出了一种物理引导的深度时空学习框架,用于从被动海岸视频流直接估计近岸波浪峰值周期。该框架结合了基于自动时间方差感兴趣区域检测、多阶段模拟到真实迁移学习和物理信息正则化,以提高预测精度和物理一致性。评估了多种时空架构,如基于Transformer和循环卷积的架构,以及合成预训练、银标签自适应和专家微调。结果表明,基于Transformer的架构在瞬时预测精度方面表现更好,而轻量级循环卷积架构实现了更高的时间稳定性和操作海洋学技能。消融研究也证明了物理引导正则化在趋势跟随一致性和减少物理上不可信预测方面的益处。可解释性审计有助于将注意力集中在水动力活跃的碎波带区域,并与物理推导的波浪传播行为良好吻合。总体而言,所提出的框架展示了基于物理引导视频的深度学习系统在长期海岸波浪监测中的潜力,具有成本效益和操作可行性。

英文摘要

Wave parameters in the nearshore are crucial for coastal engineering, shoreline protection, marine hazard assessment, and coastal management for climate resilience. Traditional monitoring systems like buoys and radar platforms offer accurate monitoring but can have high installation and maintenance expenses and limited spatial coverage. Passive ocean monitoring using video has been achieved by leveraging deep learning, however, many methods are not physically interpretable, feasible, and validated for oceanography. In thiswork, a Physics-Guided Deep Spatiotemporal Learning Framework for direct estimation of nearshore wave peak periods from passive coastal video stream is proposed. The framework combines automated temporal-variance based region-of-interest detection, multi-stage Sim-to-Real transfer learning, and physics-informed regularization to enhance the predictive accuracy and physical consistency. A variety of spatiotemporal architectures were assessed, such as transformer-based and recurrent-convolutional ones, alongside synthetic pretraining,silver-label adaptation, and expert fine-tuning. The results show that transformer-based architectures outperformed in terms of the accuracy of the instantaneous prediction, while lightweight recurrent-convolutional architectures achieved higher temporal stability and operational oceanographic skill. Ablation studies also demonstrated the benefits of physics-guided regularization in terms of trend-following consistency, and physically implausible predictions. Explainability auditing also helped to focus attention in hydrodynamically active surf-zone regions and showed good agreement with the physically derived wave propagation behavior. In general, the proposed framework shows the promise of physics-guided video-based deep learning systems for long-term coastal wave monitoring that are cost-efficient and operationally feasible.

2606.12413 2026-06-12 cs.CY cs.AI cs.CE cs.CL cs.SE 交叉投稿

AI SciBrief as a Gateway to Research: A Framework for Onboarding Students into New Research Areas

AI SciBrief 作为研究入门:一种引导学生进入新研究领域的框架

Andrei Lazarev, Dmitrii Sedov

AI总结 提出利用大语言模型平台 AI SciBrief 自动生成科学趋势摘要的框架,帮助学生克服信息过载,加速从信息搜索到知识创造的转变。

Comments This is the version of the article accepted for publication in TELE 2025 after peer review. The final, published version is available at IEEE Xplore: https://doi.org/10.1109/TELE66816.2025.11211989

详情
Journal ref
2025 5th International Conference on Technology Enhanced Learning in Higher Education (TELE), Lipetsk, Russian Federation, 2025, pp. 365-369
AI中文摘要

各层次高等教育学生面临信息过载的重大障碍,这常常使研究过程的初始阶段陷入瘫痪并抑制动机。为此,本文介绍了一种教学框架,利用 AI SciBrief——一个由大语言模型驱动的平台,旨在自动生成科学趋势摘要。我们描述了这一多学科工具——初始覆盖金融、医学和教育领域——如何融入课程以克服这一“入门障碍”。该框架提供了具体方法,利用这些摘要促进学期论文的选题、加速学位论文的文献综述,并使研究生能够持续监测新兴趋势。我们得出结论,AI SciBrief 作为“研究入门”有效降低了学生的认知负荷,使他们能够更快地从信息搜索过渡到知识创造。

英文摘要

Students at all levels of higher education face a significant barrier in the form of information overload, which often paralyzes the initial stages of the research process and suppresses motivation. In response, this article introduces a pedagogical framework that leverages AI SciBrief, a platform powered by a Large Language Model (LLM) designed to automatically generate digests of scientific trends. We describe how this multidisciplinary tool - with initial coverage in finance, medicine, and education - can be integrated into the curriculum to overcome this "entry barrier." The framework provides concrete methodologies for utilizing these digests to facilitate topic selection for term papers, accelerate literature reviews for dissertations, and enable postgraduate students to continuously monitor emerging trends. We conclude that AI SciBrief functions as a "gateway to research" effectively reducing students' cognitive load and empowering them to transition more rapidly from information searching to knowledge creation.

2606.12422 2026-06-12 cs.CY cs.AI cs.HC 交叉投稿

Creating and Evaluating K-12 GenAI Assessment Graders Through Context Engineering

通过上下文工程创建和评估K-12生成式AI评分器

Zewei Tian, Alex Liu, Lief Esbenshade, Michael Xiao, Zachary Zhang, Yulia Lápicus, Thomas Han, Kevin He, Min Sun

发表机构 * University of Washington(华盛顿大学) Colleague AI

AI总结 本研究通过上下文工程利用商用基础模型构建LLM评分器,基于MCAS数据评估其在数学、科学和ELA上的评分一致性,发现大参数模型在数学和科学上表现良好,而ELA上差异较大,表明AI更适合作为形成性工具。

Comments Published on the Proceedings of NCME 2026 Conference (https://www.xcdsystem.com/proceedings/ncme/8DbqHwv/presentation/28064.cfm?uuid=3EC982ED-A989-8E53-B42BC86334206028)

详情
AI中文摘要

将大型语言模型(LLM)整合到教育评估中代表了课堂评分实践的一个变革性转变。虽然自动评分系统和机器学习技术已经存在了几十年,但生成式AI(GenAI)现在使教育工作者能够以前所未有的效率和规模实施基于标准的评分(SBG)。本文考察了理论基础,并评估了一个LLM评分器,该评分器使用商用基础模型,结合上下文和提示工程,根据评分标准对学生作业进行评分。利用马萨诸塞州综合评估系统(MCAS)数据的实证评分者间一致性研究,我们使用Claude Sonnet 4、Haiku 4.5、GPT-5和GPT-5 Mini,观察了数学、科学和英语语言艺术(ELA)上的二次加权卡帕(QWK)和均方误差比例减少(PRMSE)。结果表明,LLM评分器,特别是基于参数更多的基础模型时,在数学和科学评估中与人类评分者达到显著一致性,而在ELA中表现各异,表明通用基础模型在特定上下文中可以有效评分。对教师和学生反馈的额外分析显示,对AI生成的叙述性反馈接受度很高,但对数值分数持怀疑态度,这表明LLM最有效地作为形成性工具而非总结性评估者。我们的发现表明,精心设计的混合模型结合AI效率和教师判断,可以减少工作量,提高反馈质量,并支持公平的评估实践,而不取代专业专长。

英文摘要

The integration of large language models (LLMs) into educational assessment represents a transformative shift in classroom grading practices. While automated scoring systems and machine learning techniques have existed for decades, generative AI (GenAI) now enables educators to implement standards-based grading (SBG) with unprecedented efficiency and scale. This paper examines the theoretical foundations and evaluates an LLM grader that uses commercially available foundation models with context and prompt engineering to score student work against a rubric. Drawing on an empirical interrater agreement study using Massachusetts Comprehensive Assessment System (MCAS) data, we observed the Quadratic Weighted Kappa (QWK) and Proportional Reduction in Mean-Squared Error (PRMSE) across mathematics, science, and ELA, using Claude Sonnet 4, Haiku 4.5, GPT-5, and GPT-5 Mini. The results demonstrate that LLM graders, especially when based on foundational models with more parameters, achieve substantial agreement with human raters in mathematics and science assessments, while the performances vary in ELA, suggesting generic foundation models can be effective at scoring in given contexts. Additional analysis of teacher and student feedback reveals strong acceptance of AI-generated narrative feedback but skepticism toward numerical scores, suggesting that LLMs function most effectively as formative tools rather than summative evaluators. Our findings indicate that thoughtfully designed hybrid models that combine AI efficiency with teacher judgment can reduce workload, enhance feedback quality, and support equitable assessment practices without displacing professional expertise.

2606.12424 2026-06-12 cs.CY cs.AI cs.HC 交叉投稿

AI-Automation Tooling in Computer Engineering Education: Mixed-Methods TAM/UTAUT Evidence for a General Acceptance Attitude

计算机工程教育中的AI自动化工具:基于TAM/UTAUT混合方法的一般接受态度证据

Aung Pyae

AI总结 本研究通过混合方法调查本科生对AI自动化工具(n8n平台)的接受态度,发现六个TAM/UTAUT构念融合为单一一般接受因子,绩效期望最强,享乐动机最弱,为课程整合提供理论依据。

详情
AI中文摘要

随着生成式AI和低代码工作流平台成为软件实践中的常规工具,一个关键的教育问题是下一代计算机工程师是否会将这些工具视为有用、可用且值得持续参与。本文报告了一项混合方法、横截面研究,涉及泰国三个相同脚本工作坊中本科生对AI自动化工具(通过开源平台n8n实例化)的接受度(n=103)。一个12项、五点李克特量表映射到六个TAM/UTAUT构念——绩效期望(PE)、努力期望(EE)、行为意向(BI)、自我效能(SE)、享乐动机(HM)和输出质量(OQ),并通过开放式反馈的归纳主题分析进行补充。分析结合了序数可靠性估计、自助置信区间、非参数检验、多重比较控制的相关性、多维度诊断、共同方法偏差检验以及跨会话比较。所有六个构念的接受度均良好,效应量大,其中PE最强,HM最弱。维度诊断进一步揭示,在这种简短的工作坊后情境中,经典的TAM/UTAUT子维度合并为一个单一的一般接受因子,这一发现具有重要的方法论和理论意义。定性主题在有用性和热情方面与定量概况一致,但在输出质量上存在分歧,揭示了一个虽小但表达清晰的可靠性怀疑少数群体。研究结果支持在本科计算教育中课程采用AI自动化工具,并确定了三个基于理论的教学杠杆:教学顺序支架、自我效能支持和信任校准干预。

英文摘要

As generative AI and low-code workflow platforms become routine in software practice, a key educational question is whether the next generation of computer engineers will accept these tools as useful, usable, and worthy of sustained engagement. This paper reports a mixed-methods, cross-sectional study of undergraduate computer engineering students' acceptance of AI automation tooling, instantiated through the open-source platform n8n across three identically scripted workshops in Thailand (n = 103). A 12-item, five-point Likert instrument mapped to six TAM/UTAUT constructs - Performance Expectancy (PE), Effort Expectancy (EE), Behavioral Intention (BI), Self-Efficacy (SE), Hedonic Motivation (HM), and Output Quality (OQ) - was complemented by inductive thematic analysis of open-ended feedback. Analyses combined ordinal reliability estimation, bootstrap confidence intervals, non-parametric tests, multiple-comparison-controlled correlations, polychoric dimensionality diagnostics, a common-method-bias check, and between-session comparisons. Acceptance was favorable across all six constructs with large effect sizes, with PE emerging as the strongest construct and HM as the weakest. Dimensionality diagnostics further revealed that canonical TAM/UTAUT sub-facets collapsed into a single general acceptance factor in this short-form post-workshop context, a finding with important methodological and theoretical implications. Qualitative themes converged with the quantitative profile regarding usefulness and enthusiasm but diverged on output quality, revealing a small yet articulate reliability-skeptical minority. The findings support the curricular adoption of AI automation tooling in undergraduate computing education and identify three theory-grounded instructional levers: instruction-sequencing scaffolds, self-efficacy supports, and trust-calibration interventions.

2606.12425 2026-06-12 cs.CY cs.AI cs.ET cs.HC cs.LG 交叉投稿

An Explainable AI Assistant for Introductory Programming Education: Improving Feedback Reliability with Instructor-AI Collaboration

面向入门编程教育的可解释AI助手:通过教师-AI协作提高反馈可靠性

Muntasir Hoq, Griffin Pitts, Bradford Mott, Seung Lee, Jessica Vandenberg, Shuyin Jiao, Narges Norouzi, James Lester, Bita Akram

发表机构 * North Carolina State University(北卡罗来纳州立大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种可解释AI驱动的课堂助手,通过分析学生代码、映射逻辑错误到教师识别的误解并提供教师撰写的反馈,提高入门编程课程中反馈的可靠性和可解释性。

Comments Full paper accepted to the 27th International Conference on AI in Education (AIED 2026)

详情
AI中文摘要

主动学习被广泛认为是提高入门编程课程学习效果的有效方法。然而,不足的教学支持往往限制了学生获得及时、个性化反馈的机会,而这对于掌握基础编程概念至关重要。尽管最近AI的进展,特别是大型语言模型,为反馈提供了可扩展的机会,但可解释性和可靠性问题仍然存在。在本文中,我们提出了一种AI驱动的课堂助手,它利用可解释的AI模型分析学生代码,将逻辑错误映射到教师识别的误解,并提供教师撰写的反馈,从而将可靠性建立在教师定义的教学知识基础上。为了评估我们框架的有效性,我们进行了专家评估以检查其与教师验证反馈的一致性,并在课堂环境中部署了该系统以评估学生对其可用性的看法。结果表明,该助手能够为学生提供准确的、经过教师验证的反馈,同时培养积极的体验。

英文摘要

Active learning is widely recognized as an effective approach for improving learning outcomes in introductory programming courses. However, insufficient instructional support often limits students' access to timely, personalized feedback, which is crucial for mastering foundational programming concepts. Although recent advances in AI, particularly large language models, offer scalable opportunities for feedback, concerns about explainability and reliability remain. In this paper, we present an AI-driven classroom assistant that leverages an explainable AI model to analyze student code, map logical errors to instructor-identified misconceptions, and deliver instructor-authored feedback, thereby grounding reliability in instructor-defined pedagogical knowledge. To evaluate the effectiveness of our framework, we conducted an expert evaluation to examine its alignment with instructor-verified feedback and deployed the system in a classroom setting to assess students' perceptions of its usability. Results indicate that the assistant can provide accurate, instructor-verified feedback to students while fostering a positive experience.

2606.12500 2026-06-12 cs.LG cs.AI 交叉投稿

Improving Crash Frequency Prediction from Simulated Traffic Conflicts Using Machine Learning Based Microsimulation

基于机器学习的微观仿真从模拟交通冲突改进碰撞频率预测

Xian Liu, Carlo G. Prato, Gustav Markkula

AI总结 本文利用机器学习行为模型替代传统规则模型进行交通微观仿真,通过极端值理论分析模拟冲突预测碰撞频率,在英国利兹五个信号交叉口验证了ML模型无需地点校准即可提升预测准确性。

详情
AI中文摘要

交通微观仿真结合替代安全措施越来越多地被用作历史碰撞数据的主动替代方案,用于预测当前或计划道路基础设施设计的碰撞频率。然而,现有的基于微观仿真的安全研究采用了简化的基于规则的行为模型,这些模型能较好地再现交通流,但往往无法生成真实的冲突动态,限制了碰撞预测的准确性。机器学习(ML)行为模型的最新进展提供了一个有希望的机会,通过直接从大规模轨迹数据集中学习人类驾驶行为,可能提高微观仿真的真实性和碰撞频率预测。为了研究这种可能性,我们对英国利兹的五个真实信号交叉口进行了交通微观仿真,使用了标准的基于规则模型和最先进的ML模型。使用二维碰撞时间指标分析模拟车辆轨迹以识别模拟冲突,然后使用极端值理论建模以预测碰撞频率。结果表明,ML模型的冲突产生的碰撞预测与实际碰撞数据一致,而基于规则的模型由于缺乏对特定模拟交叉口的模型校准,无法产生有意义的预测。直接使用ML生成的模拟碰撞来预测实际碰撞频率也产生了较差的结果,这表明尽管当前的ML模型可以真实地再现冲突,但尚不能生成真实的碰撞。总体而言,研究结果表明,基于ML的行为模型在无需特定地点模型校准的情况下,有望从模拟冲突中改进碰撞预测,并为基于ML的交通微观仿真指明了明确的未来方向。

英文摘要

Traffic microsimulation combined with surrogate safety measures has increasingly been used as a proactive alternative to historical crash data for predicting crash frequency for current or planned road infrastructure designs. However, existing microsimulation-based safety studies have adopted simplified rule-based behaviour models, which reproduce traffic flow reasonably well but often fail to generate realistic conflict dynamics, limiting crash prediction accuracy. Recent advances in machine learning (ML)-based behaviour models offer a promising opportunity to potentially improve microsimulation realism and crash frequency predictions by learning human driving behaviour directly from large-scale trajectory datasets. To investigate this possibility, traffic microsimulation was conducted for five real-world signalised intersections in Leeds, UK, using both a standard rule-based model and a state-of-the-art ML model. Simulated vehicle trajectories were analysed using a two-dimensional Time-to-Collision metric to identify simulated conflicts, which were then modelled using Extreme Value Theory to predict crash frequency. Results show that conflicts from the ML model yielded crash predictions in line with the real-world crash data, whereas the rule-based model did not permit meaningful predictions, presumably due to a lack of model calibration to the specific simulated intersections. Directly using ML-generated simulated crashes to predict real-world crash frequency also yielded poor results, suggesting that while current ML models can realistically reproduce conflicts, they are not yet able to generate realistic crashes. Overall, the findings demonstrate that ML-based behaviour models are promising for improving crash prediction from simulated conflicts, without a need for location-specific model calibration, and suggest clear future directions for ML-based traffic microsimulation.

2606.12662 2026-06-12 cs.SD cs.AI cs.LG 交叉投稿

BASENet: Band-Adapted Speech Enhancement Network with Cross-Band Attention

BASENet: 基于频带自适应的跨频带注意力语音增强网络

Damien Martins Gomes, François Capman

发表机构 * Thales SIX GTS, FRANCE(泰雷兹SIX GTS公司,法国)

AI总结 提出BASENet,通过Bark尺度划分频带并分配自适应容量编码器,结合跨频带注意力模块,以最少参数实现高PESQ和STOI,适用于资源受限设备。

详情
AI中文摘要

语音增强模型通常对所有频率采用统一容量,忽略了人类听觉的非均匀频谱分辨率。我们提出BASENet,一种频率自适应架构,将频谱划分为Bark尺度频带,并为每个频带分配基于临界频带密度的缩放容量编码器,自动为感知密集的低频分配更深的分支,为高频分配更轻的分支。跨频带注意力模块通过紧凑的频率池化表示以线性复杂度捕获跨频带的谐波依赖性。基于具有密集连接的倒残差块和卷积循环网络,BASENet在VoiceBank+DEMAND上以仅0.83M参数和7.3 G MACs达到3.55 PESQ和STOI~96%,是所有PESQ > 3.50方法中参数最少的。因果变体(3.44 PESQ)超过了几种非因果基线,证实了其在资源受限设备上实时流传输的适用性。

英文摘要

Speech enhancement models typically apply uniform capacity across all frequencies, disregarding the non-uniform spectral resolution of human hearing. We propose BASENet, a frequency-adapted architecture that partitions the spectrum into Bark-scale bands and assigns each a scaled-capacity encoder derived from critical-band density, automatically granting deeper branches to perceptually dense low frequencies and lighter ones to high frequencies. A cross-band attention module captures harmonic dependencies across bands through compact frequency-pooled representations at linear complexity. Built on inverted residual blocks with dense connectivity and a convolutional recurrent network, BASENet achieves 3.55 PESQ and STOI~96% on VoiceBank+DEMAND with only 0.83M parameters and 7.3 G~MACs, the fewest parameters among all methods with PESQ > 3.50. A causal variant (3.44 PESQ) surpasses several non-causal baselines, confirming suitability for real-time streaming on resource-constrained devices.

2606.12699 2026-06-12 cs.LG cs.AI 交叉投稿

LLM-Powered Personalized Glycemic Assessment in Type 2 Diabetes with Wearable Sensor Data

基于可穿戴传感器数据的2型糖尿病个性化血糖评估:LLM驱动方法

Yifan Gao, Yanmin Gong, Yun Shi, Yuanxiong Guo

发表机构 * Department of Information Systems and Cybersecurity, The University of Texas at San Antonio(德克萨斯大学圣安东尼奥分校信息系统与网络安全系) School of Engineering Medicine, Texas A&M University(德克萨斯农工大学工程医学院) Department of Family and Community Medicine, The University of Texas at San Antonio(德克萨斯大学圣安东尼奥分校家庭与社区医学系)

AI总结 提出GlyLLM框架,利用大语言模型整合可穿戴传感器数据和结构化元数据,实现个性化血糖动态建模,在血糖预测和糖尿病分类任务上分别比传统ML方法提升13.66%和13.08%。

Comments The 14th IEEE International Conference on Healthcare Informatics, 2026

详情
AI中文摘要

2型糖尿病(T2D)对全球健康构成日益严重的威胁,需要有效的血糖评估来支持个性化和改进的糖尿病护理。可穿戴传感器如连续血糖监测仪(CGM)和健身追踪器为血糖评估提供了许多有价值的见解。然而,有效分析这些数据需要与重要的个体层面背景信息整合。现有方法通常基于传统机器学习(ML),主要依赖历史血糖测量值,忽略了个性化信息,这限制了它们在多样化糖尿病群体中的性能。大语言模型(LLMs)的最新进展展示了它们整合多种数据模态同时建模序列依赖性的能力,激发了探索其在个性化血糖评估中潜力的兴趣。在本文中,我们提出了GlyLLM,一个基于LLM的框架,通过整合可穿戴传感器数据和结构化元数据来建模基于CGM的血糖动态。GlyLLM可以利用预训练LLM的广泛先验知识,并在决策时实现传感器-文本语义抽象。在AI-READI数据集上的两个相关任务实验表明,我们的模型在血糖预测的均方根误差(RMSE)上平均优于传统ML方法13.66%,在糖尿病分类的受试者工作特征曲线下面积(AUROC)上平均优于13.08%。此外,我们的消融研究表明,糖尿病调查和生物特征测试比其他健康信息对血糖评估更为关键。我们的工作为利用LLM推进T2D护理中的个性化血糖评估迈出了有希望的一步。

英文摘要

Type 2 Diabetes (T2D) poses an increasing global health threat, demanding effective glycemic assessment to support personalized and improved diabetes care. Wearable sensors such as continuous glucose monitors (CGM) and fitness trackers offer many valuable insights for glycemic assessment. However, effectively analyzing these data requires integration with essential individual-level context. Existing methods are often based on traditional machine learning (ML) and rely primarily on historical blood glucose measurements and overlook personalized information, which limits their performance across diverse diabetes populations. Recent advances in large language models (LLMs) have demonstrated their ability to integrate diverse data modalities while modeling sequential dependencies, motivating the exploration of their potential for personalized glycemic assessment. In this paper, we propose GlyLLM, an LLM-powered framework for modeling CGM-based glycemic dynamics through the integration of wearable sensor data and structured metadata. GlyLLM can leverage the extensive prior knowledge of pre-trained LLMs and achieve sensor-text semantic abstraction at decision time. Experiments on two related tasks on the AI-READI dataset demonstrate that our model outperforms traditional ML methods by an average of 13.66\% in Root Mean Squared Error (RMSE) for glucose forecasting and 13.08\% in Area Under the Receiver Operating Characteristic (AUROC) for diabetes categorization. Additionally, our ablation study shows that diabetes surveys and biometric tests are more critical than other health information for glycemic assessment. Our work presents a promising step toward harnessing the power of LLMs to advance personalized glycemic assessment in T2D care.

2606.12824 2026-06-12 eess.IV cs.AI cs.CV physics.med-ph 交叉投稿

Acquisition state behaves as a structured, measurable variable governing lung-nodule AI: kernel-driven measurement instability and noise-driven detection fragility, invisible to DICOM metadata

采集状态作为结构化、可测量变量影响肺结节AI:核驱动的测量不稳定性和噪声驱动的检测脆弱性,DICOM元数据不可见

Daniel Soliman

发表机构 * Daniel Soliman, M.S(丹尼尔·索利曼,硕士)

AI总结 研究通过LUNA16训练的RetinaNet检测器,发现CT采集状态(重建核与噪声)独立影响AI的测量与检测性能,且无法从DICOM元数据恢复,提出采集感知的输入验证层。

详情
AI中文摘要

医学影像AI治理正在规范化:2026年ACR-SIIM实践参数建议本地验收测试和持续漂移监测,ACR Assess-AI注册使用DICOM元数据监测AI输出。我们认为在输出指标之下存在一个必要但目前未监测的层:输入研究是否保持在模型验证过的采集范围内。使用LUNA16训练的MONAI RetinaNet肺结节检测器,我们测试采集状态是否表现为结构化的可测量变量。在仅重建核不同的真实配对CT(NLST B30f vs B80f)上,核单独使AI测量的直径发生偏移,并在5.2%(155个结节中的8个)中翻转了Fleischner尺寸类别,而检测置信度不变(Wilcoxon p=0.22)。在受控的LIDC-IDRI扰动下,效应按轴分离:噪声轴降低检测置信度(p=5.9e-32,集中在6mm以下结节)但不影响测量,而频率/核轴破坏测量(p=8.6e-13)但不影响检测。一个4特征像素指纹恢复了重建身份(真实CT上患者级AUC约0.95,QIBA体模上0.995),而ConvolutionKernel DICOM标签无信息(不同重建标签相同)。核轴跨四个制造商传输(留一制造商AUC 0.94-0.98,与制造商内上限匹配)。因此采集状态映射到不同的AI故障模式:频率内容对应测量可靠性,噪声对应检测灵敏度,且无法从元数据恢复。采集感知的输入侧验证是现在进入影像AI认证的验收测试和漂移监测要求中缺失的层。

英文摘要

AI governance for medical imaging is formalizing: the 2026 ACR-SIIM Practice Parameter recommends local acceptance testing and ongoing drift monitoring, and the ACR Assess-AI registry monitors AI outputs using DICOM metadata for context. We argue that a necessary, currently unmonitored layer sits beneath output metrics: whether incoming studies remain within the acquisition envelope a model was validated on. Using a LUNA16-trained MONAI RetinaNet lung-nodule detector, we test whether acquisition state behaves as a structured, measurable variable. On real paired CT differing only in reconstruction kernel (NLST B30f vs B80f), kernel alone shifted AI-measured diameter and flipped a Fleischner size category in 5.2% (8 of 155) of nodules at fixed patient and acquisition, while detection confidence was unchanged (Wilcoxon p=0.22). Under controlled LIDC-IDRI perturbations the effects dissociated by axis: the noise axis degraded detection confidence (p=5.9e-32, concentrated in nodules under 6 mm) but not measurement, while the frequency/kernel axis corrupted measurement (p=8.6e-13) but not detection. A 4-feature pixel fingerprint recovered reconstruction identity (patient-level AUC about 0.95 on real CT, 0.995 on a QIBA phantom) where the ConvolutionKernel DICOM tag was uninformative (identical labels across reconstructions). The kernel axis transported across four manufacturers (leave-one-vendor-out AUC 0.94-0.98, matching the within-vendor ceiling). Acquisition state thus maps to distinct AI failure modes, frequency content to measurement reliability and noise to detection sensitivity, and is not recoverable from metadata. Acquisition-aware, input-side validation is the missing layer for the acceptance-testing and drift-monitoring requirements now entering imaging-AI accreditation.

2606.12838 2026-06-12 q-bio.QM cs.AI cs.LG q-bio.GN 交叉投稿

OCOO-T : A Simple and Scalable Virtual Cell Model for Transcriptional Perturbation Response Prediction

OCOO-T: 一种用于转录扰动响应预测的简单可扩展虚拟细胞模型

Danning Jiang, Zheming An, Yalong Zhao, Lipeng Lai

AI总结 提出OCOO-T,一种基于流匹配的简约虚拟细胞模型,通过连续时间去噪和自适应层归一化,在多个基准上实现转录扰动预测的最优性能。

Comments 22 pages, 6 figures

详情
AI中文摘要

预测单细胞对遗传、化学和细胞因子扰动的转录响应是计算生物学和AI虚拟细胞(AIVC)建模中的一个基本挑战,对药物发现和基因调控网络的阐明具有直接影响。现有方法通常依赖辅助细胞状态编码器、分层变分自编码器、专用Transformer编码器-解码器模块或基因相互作用先验,将高维表达谱压缩为潜在表示。虽然有效,但这些设计增加了架构复杂性,可能限制可扩展性和泛化性。本文介绍了OCOO-T,一种基于流匹配的简约AIVC模型,用于转录扰动响应预测。OCOO-T利用一个直接操作连续基因表达谱的普通Transformer堆栈,并将扰动响应预测表述为连续时间去噪过程。通过自适应层归一化和上下文令牌整合扰动嵌入、剂量信息以及细胞系/细胞类型特异性。在Tahoe100M、Replogle和PBMC基准上的全面评估表明,OCOO-T在多种扰动和细胞类型上实现了最先进的性能,同时通过细胞上下文的修补和拆补有效扩展到长转录谱。通过利用基于Transformer去噪的单细胞组学简单性,OCOO-T为计算机细胞模拟提供了一个有效且可扩展的框架。

英文摘要

Predicting single-cell transcriptional responses to genetic, chemical and cytokine perturbations is a fundamental challenge in computational biology and AI Virtual Cell (AIVC) modeling, with direct implications for drug discovery and the elucidation of gene regulatory networks. Existing approaches often rely on auxiliary cell-state encoders, hierarchical variational autoencoders, dedicated Transformer encoder-decoder modules, or gene-interaction priors to compress high-dimensional expression profiles into latent representations. While effective, these designs increase architectural complexity and may limit scalability and generalizability. This paper introduces OCOO-T, a minimalist flow-matching-based AIVC model for transcriptional perturbation response prediction. OCOO-T utilizes a vanilla Transformer stack that operates directly on continuous gene expression profiles and formulates perturbation response prediction as a continuous-time denoising process. Perturbation embeddings, dosage information, and cell-line/cell-type specificity are integrated through adaptive layer normalization and in-context tokens. Comprehensive evaluations on Tahoe100M, Replogle, and PBMC benchmarks demonstrate that OCOO-T achieves state-of-the-art performance across diverse perturbations and cell types while effectively scaling to long transcriptional profiles through patching and depatching of cellular contexts. By leveraging the simplicity of Transformer-based denoising for single-cell omics, OCOO-T provides an effective and scalable framework for in-silico cellular simulation.

2606.12858 2026-06-12 cs.IT cs.AI cs.CV math.IT 交叉投稿

JSCGC: Joint Source-Channel-Generation Coding for Wireless Generative Communications

JSCGC:面向无线生成式通信的联合源信道生成编码

Tong Wu, Zhiyong Chen, Guo Lu, Li Song, Feng Yang, Meixia Tao, Wenjun Zhang

发表机构 * Cooperative Medianet Innovation Center, the School of Information Science and Electronic Engineering, Shanghai Jiao Tong University(联合中位网创新中心,信息科学与电子工程学院,上海交通大学)

AI总结 提出联合源信道生成编码(JSCGC),用生成模型替换传统解码器,将通信重构问题转化为受感知约束下的受控生成问题,通过联合训练和随机采样框架最大化互信息,在潜空间图像传输中提升特征、语义和分布质量。

Comments submitted to IEEE Journal

详情
AI中文摘要

传统通信系统,包括基于分离的编码和基于学习的联合源信道编码(JSCC),通常是在香农率失真理论下设计的。然而,依赖通用失真度量无法捕捉复杂的人类视觉感知,常常导致模糊或不真实的复原。在本文中,我们提出联合源信道生成编码(JSCGC),一种生成式通信范式,用接收端的生成模型替换传统解码器。接收信号被视为一个条件,控制采样过程进入学习到的条件分布,将通信从用于失真最小化的确定性重构重新表述为在感知约束下用于互信息最大化的受控生成。基于这一表述,我们开发了一个统一的联合训练和高效随机采样框架,并提供了其在学习和推理阶段有效性的理论分析。在潜空间图像传输上的大量实验表明,JSCGC在不同信道条件下持续改善基于特征、语义层面和分布的质量,同时表现出一种以语义不一致而非失真为特征的独特错误行为。

英文摘要

Conventional communication systems, including both separation-based coding and learning-based joint source-channel coding (JSCC), are typically designed under Shannon's rate-distortion theory. However, relying on generic distortion metrics fails to capture complex human visual perception, often resulting in blurred or unrealistic reconstructions. In this paper, we propose Joint Source-Channel-Generation Coding (JSCGC), a generative communication paradigm that replaces the conventional decoder with a generative model at the receiver. The received signal is treated as a condition that controls the sampling process into the learned conditional distribution, reformulating communication from deterministic reconstruction for distortion minimization to controlled generation for mutual information maximization under perceptual constraints. Based on this formulation, we develop a unified joint training and efficient stochastic sampling framework, and provide theoretical analysis of its effectiveness in both learning and inference stages. Extensive experiments on latent-space image transmission demonstrate that the JSCGC consistently improves feature-based, semantic-level, and distributional quality across diverse channel conditions, while exhibiting a distinct error behavior characterized by semantic inconsistency rather than distortion.

2606.12987 2026-06-12 cs.CV cs.AI cs.LG cs.RO 交叉投稿

Diffusion Transformer World-Action Model for AV Scene Prediction

扩散Transformer世界-动作模型用于自动驾驶场景预测

Ruslan Sharifullin, Benjamin Jiang, Kai Xi Chew

发表机构 * Stanford University(斯坦福大学)

AI总结 提出紧凑潜世界模型,结合扩散Transformer(DiT)预测未来场景,在nuScenes上实现4.8倍更好的KID,并实现动作可控性(转向ρ=0.81)。

Comments 10 pages, 9 figures, 2 tables

详情
AI中文摘要

动作条件世界模型使自动驾驶车辆能够根据自身规划的控制预测未来摄像头场景,从而无需真实世界部署即可进行规划和仿真,但在紧凑、可训练的规模下,未来具有模糊性,且该领域的标准失真度量具有误导性:它们奖励模糊的回归均值而非逼真的预测。我们通过一个紧凑的潜世界模型应对这一问题,该模型给定当前前摄像头潜变量和一系列自我动作,预测未来场景潜变量,由冻结解码器渲染为$256 \ imes 256$帧,最多提前8秒,在150个保留的nuScenes场景上评估。我们首先基准测试预测位置:在跨越四个表示族的六个冻结编码器中,具有时间上下文的V-JEPA2将转向RMSE比最佳单帧编码器降低40%。然后我们训练一个潜扩散Transformer(DiT),并通过受控诊断识别其所需的四个要素:空间token、$x_0$目标、残差锚定以及与目标不确定性匹配的采样。在Stable-Diffusion-VAE编码-预测-解码流水线中,我们揭示了核心矛盾:失真度量(余弦相似度、SSIM)倾向于模糊均值,掩盖了扩散模型更接近真实帧分布的事实。基于Inception的FID和KID揭示了清晰的感知-失真边界:扩散模型达到KID 0.078,而回归为0.375(好4.8倍),且可部署的训练校准使其无需测试时真实值即可实用。该模型真正具有动作可控性(转向驱动场景位移,Spearman $\ ho = 0.81$,而回归为$-0.18$)。我们将有限的单次运动归因于共享当前锚点,并设计了一个紧凑的170万参数“跳跃”模型,恢复完整的真实运动幅度($1.02\ imes$ GT),而单次模型捕获不到一半。

英文摘要

Action-conditioned world models let an autonomous vehicle predict future camera scenes from its own planned controls, enabling planning and simulation without real-world rollouts, but at compact, trainable scale the futures are ambiguous and the field's standard distortion metrics actively mislead: they reward a blurry regression mean over a realistic prediction. We confront this with a compact latent world model that, given the present front-camera latent and a sequence of ego-actions, predicts future scene latents a frozen decoder renders to $256 \times 256$ frames up to 8 seconds ahead, evaluated on 150 held-out nuScenes scenes. We first benchmark where to predict: across six frozen encoders spanning four representation families, V-JEPA2 with temporal context reduces steering RMSE by 40% over the best single-frame encoder. We then train a latent Diffusion Transformer (DiT) and, through a controlled diagnosis, identify the four ingredients it needs: spatial tokens, the $x_0$ objective, residual anchoring, and sampling matched to target uncertainty. In a Stable-Diffusion-VAE encode-predict-decode pipeline we expose the central tension: distortion metrics (cosine similarity, SSIM) favor the blurry mean, masking that the diffusion model is far closer to the real frame distribution. Inception-based FID and KID reveal a clean perception-distortion frontier: diffusion attains KID 0.078 versus 0.375 for regression ($4.8\times$ better), and a deployable train-derived calibration makes this practical without test-time ground truth. The model is genuinely action-controllable (steering drives scene displacement, Spearman $ρ= 0.81$, vs $-0.18$ for regression). We trace limited single-pass motion to a shared-present anchor and engineer a compact 1.7M-parameter "jump" model that recovers full ground-truth motion magnitude ($1.02\times$ GT), where single-pass models capture less than half.

2606.12988 2026-06-12 cs.CV cs.AI 交叉投稿

A Machine Learning Framework for Real-Time Personalized Ergonomic Pose Analysis

一种用于实时个性化人体工学姿态分析的机器学习框架

Manex Atxa, Bruno Simoes, Julen Balzategui

发表机构 * Vicomtech Foundation(Vicomtech基金会) Basque Research and Technology Alliance(巴斯克研究与技术联盟) BRTA

AI总结 提出利用三维体积视频数据实时预测人体工学/非工学姿态的方法,结合3D点云多角度分析与个性化深度学习分类器,克服固定视角遮挡问题,实现实时评估。

Comments 13 pages, 7 figures, conference 24CMH

详情
AI中文摘要

本文介绍了一种利用三维体积视频数据实时预测人体工学和非工学姿态的新方法。尽管该方法是为人体工学评估设计的,但它可以适应其他需要实时分析人体姿态的应用。该系统的一个突出特点是能够在评估过程中分析3D点云,从而实现多角度计算。这克服了相机通常提供固定视角的关键限制,从而限制了全面姿态评估可用的数据,尤其是在发生遮挡时。系统持续自动地对实时流数据使用选定的视角进行姿态推断;然而,只有用户手动选择和标记的姿态用于训练个性化深度学习分类器。该方法通过一个案例研究进行了优化,其中RGB-D相机捕捉了执行负重任务的受试者,实现了实时骨骼标记。模型在此数据上训练,并在训练阶段后对新流数据实时进行推断。本研究通过结合最先进的3D数据技术和传统的2D姿态估计算法,为实时人体工学评估提供了一种可扩展且实用的方法。它解决了工作场所环境中日益增长的安全与健康监测需求,标志着对该领域的显著贡献。

英文摘要

This paper introduces a new methodology for real-time prediction of ergonomic and non-ergonomic human poses using volumetric video data in three dimensions. Although the methodology was designed for ergonomic assessments, it can be adapted to other applications requiring real-time analysis of human posture. One aspect that makes this system stand out is its ability to analyze 3D point clouds during the assessment, enabling computation from multiple angles. This overcomes a critical limitation of cameras which provide often a fixed viewpoint, thereby restricting the data available for a thorough postural evaluation, especially when occlusions occur. The system continuously and automatically performs pose inference using the chosen perspective on the real-time streaming data; however, only the poses manually selected and labeled by the user are used to train the personalized deep learning classifier. The methodology has been refined through a case study in which RGB-D cameras captured subjects performing load-lifting tasks, enabling real-time skeletal labeling. The model was trained on this data and, following the training phase, performs inference on new streaming data in real time. This research offers a scalable and pragmatic approach for real-time ergonomic evaluation by combining state-of-the-art 3D data technologies and traditional 2D pose estimation algorithms. It addresses the increasing need for safety and health monitoring in workplace environments, marking a notable contribution to the domain.

2606.13007 2026-06-12 cs.LG cs.AI 交叉投稿

scLLM-DSC: LLM-Knowledge Enhanced Cross-Modal Deep Structural Clustering for Single-Cell RNA Sequencing

scLLM-DSC:基于LLM知识增强的跨模态深度结构聚类用于单细胞RNA测序

Ping Xu, Pengjiang Li, Tian Du, Zaitian Wang, Jiawei Gu, Ziyue Qiao, Pengfei Wang, Yuanchun Zhou

发表机构 * Computer Network Information Center, Chinese Academy of Sciences(中国科学院计算机网络信息中心) University of Chinese Academy of Sciences(中国科学院大学) Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(中国科学院大学杭州高等研究院) School of Computing and Information Technology, Great Bay University(大湾区大学计算机科学与技术学院) School of Engineering, Westlake University(西湖大学工学院)

AI总结 提出scLLM-DSC框架,通过知识驱动语义视图与结构感知拓扑视图的跨模态对比对齐,利用LLM增强单细胞RNA测序数据的聚类性能,显著优于现有方法。

详情
AI中文摘要

聚类是scRNA-seq分析的基础,是识别细胞群体和解析组织异质性的基石。然而,现有方法专注于挖掘数值统计模式,由于忽略了基因编码的内在生物学功能,存在语义不可知的问题。虽然大语言模型(LLM)提供了有前景的语义能力,但生成式预训练目标与判别式下游任务之间的结构不匹配阻碍了它们直接适应细胞聚类。为弥合这一差距,我们提出了scLLM-DSC,一种新颖的LLM知识增强跨模态深度结构聚类框架。与数据驱动范式不同,scLLM-DSC通过协同两个视图建立语义基础表示:从NCBI基因先验和上下文化的Cell2Sentence嵌入中提取的知识驱动语义视图,以及通过图引导编码器提取的结构感知拓扑视图。关键的是,我们引入了一种跨模态对比对齐机制,以在统一潜在空间中强制生物学语义与转录组特征之间的一致性。广泛的基准测试表明,scLLM-DSC在聚类准确性上显著优于十一个最先进的基线方法。

英文摘要

Clustering is fundamental to scRNA-seq analysis, serving as a cornerstone for identifying cell populations and resolving tissue heterogeneity. However, existing methods focus on mining numerical statistical patterns, suffering from semantic agnosticism by neglecting the intrinsic biological functions encoded by genes. While Large Language Models (LLMs) offer promising semantic capabilities, their direct adaptation to cell clustering is hindered by the structural mismatch between generative pre-training objectives and discriminative downstream tasks. To bridge this gap, we propose scLLM-DSC, a novel LLM-Knowledge Enhanced Cross-Modal Deep Structural Clustering framework. Diverging from data-driven paradigms, scLLM-DSC establishes a semantically-grounded representation by synergizing two views: a Knowledge-Driven Semantic View derived from NCBI gene priors and contextualized Cell2Sentence embeddings, and a Structure-Aware Topological View extracted via a graph-guided encoder. Crucially, we introduce a cross-modal contrastive alignment mechanism to enforce consistency between biological semantics and transcriptomic features within a unified latent space. Extensive benchmarks demonstrate that scLLM-DSC significantly outperforms eleven state-of-the-art baselines in clustering accuracy.

2606.13135 2026-06-12 cs.CV cs.AI 交叉投稿

Cascade Classification of Dermoscopic Images of Skin Neoplasms with Controllable Sensitivity and External Clinical Validation

皮肤肿瘤皮肤镜图像的级联分类:可控敏感度与外部临床验证

Elena S. Kozachok, Sergey S. Seregin, Aleksandr V. Kozachok, Ilya P. Latyshev, Oleg I. Samovarov

发表机构 * Ivannikov Institute for System Programming of the Russian Academy of Sciences (ISP RAS)(俄罗斯科学院伊万尼科夫系统编程研究所) Orel Oncological Dispensary(奥廖尔肿瘤医院)

AI总结 本研究比较了四种深度学习架构在皮肤镜图像分类中的表现,提出一种两阶段级联分类方案,通过可调分诊阈值实现敏感度控制,并在外部临床数据集上验证了泛化差距。

Comments 28 pages, 8 figures, 10 tables

详情
AI中文摘要

目的:比较皮肤肿瘤皮肤镜图像的深度学习架构和分类方案,并评估从开放国际数据集到俄罗斯临床独立数据集的泛化能力。方法:在三种方案中比较四种架构(ViT-B/16、Swin-S、ConvNeXt-S、EfficientNetV2-S):二分类(恶性/良性)、单阶段四分类(良性、MEL、SCC、BCC)和两阶段级联(二分类分诊,然后三分类MEL/SCC/BCC)。所有模型使用ImageNet预训练权重和单一增强协议,在聚合的开放ISIC Archive数据上训练,并在内部保留样本和两个临床数据集(Melanoscope AI移动系统;谢切诺夫大学)上评估。结果:内部二分类阶段达到ROC-AUC 0.952-0.966;在谢切诺夫大学数据集上降至0.797-0.893,敏感度降至0.53-0.67,ECE从0.02升至0.27-0.39,且低估恶性,量化了排序和校准中的泛化差距。配对检验确认了临床数据上的一个架构间结果:二分类阶段ViT-B/16的缺陷(p<0.05);在区分阶段,没有架构显示出显著优势。级联方案在大多数架构上提高了宏F1,但仅对ViT-B/16显著,通过恢复被分配到主导良性类别的恶性病变。在ISIC MILK10k上,直接11分类的平均类别敏感度为0.525。结论:可调分诊阈值提供了标准单阶段(argmax)分类无法实现的敏感度控制,并更好地再现了临床鉴别诊断逻辑。持续的泛化差距要求在部署前进行外部临床验证和重新校准。

英文摘要

Purpose. To compare deep learning architectures and classification schemes for dermoscopic images of skin neoplasms and assess their generalization on transfer from open international datasets to independent clinical datasets of Russian practice. Methods. Four architectures (ViT-B/16, Swin-S, ConvNeXt-S, EfficientNetV2-S) were compared in three schemes: binary (malignant/benign), single-stage four-class (benign, MEL, SCC, BCC), and a two-stage cascade (binary triage, then three-class differentiation MEL/SCC/BCC). All models used ImageNet-pretrained weights and a single augmentation protocol on aggregated open ISIC Archive data, and were evaluated on an internal held-out sample and two clinical datasets (Melanoscope AI mobile system; Sechenov University). Results. Internally the binary stage attains ROC-AUC 0.952-0.966; on Sechenov University it drops to 0.797-0.893, sensitivity to 0.53-0.67, and ECE rises from 0.02 to 0.27-0.39 with underestimation of malignancy, quantifying a generalization gap in ranking and calibration. Paired tests confirm one inter-architecture result on clinical data: the deficit of ViT-B/16 at the binary stage (p<0.05); at the differentiation stage no architecture has a proven advantage. The cascade raises macro F1 over single-stage four-class classification for most architectures, but significantly only for ViT-B/16, by recovering malignant lesions assigned to the dominant benign class. On ISIC MILK10k, direct 11-class classification yields mean-class sensitivity 0.525. Conclusion. A tunable triage threshold gives sensitivity control not attainable in standard single-stage (argmax) classification and better reproduces clinical differential-diagnosis logic. The persistent generalization gap mandates external clinical validation and recalibration before deployment.

2606.13188 2026-06-12 cs.CV cs.AI 交叉投稿

Transformer-Guided Graph Attention for Direct Cardiac Mesh Reconstruction: A Structural Digital Twin Framework

Transformer引导的图注意力直接心脏网格重建:一种结构数字孪生框架

Abhishek H S, Akash Ganamukhi, Abhimanyu Suresh, Aditya G Hiremath, Prasad B Honnavalli, Adithya Balasubramanyam

发表机构 * CAVE Labs, C-IoT, Dept. of CSE, PES University(PES大学计算机科学与工程系C-IoT实验室CAVE实验室) C-IoT, Dept. of CSE, PES University(PES大学计算机科学与工程系C-IoT实验室)

AI总结 提出端到端网络,结合3D Swin Transformer和GAT,直接从医学图像生成平滑的心脏表面网格,避免传统后处理,在MM-WHS 2017上实现1.8 mm平均Chamfer距离。

详情
AI中文摘要

构建患者特异性心脏模型是精准心脏病学的核心,但这些模型在临床应用中始终面临同一障碍:网格生成缓慢、混乱且令人沮丧。标准工作流程——分割图像、运行Marching Cubes、然后手动清理结果——耗时、操作者间不一致,并且需要大多数临床团队不具备的专业知识。我们采取了一种根本不同的方法。我们不将分割和网格生成视为两个独立问题,而是训练一个单一的端到端网络,直接从原始3D医学图像生成平滑、可用于模拟的心脏表面网格。核心是一个3D Swin Transformer编码器-解码器,从CT或MRI体积中提取体积特征,配以一个图注意力网络(GAT)头,迭代变形模板网格以拟合患者心脏边界。我们在MM-WHS 2017基准上使用CT和MRI进行了测试。分割分数具有竞争力(CT上Dice为0.84,MRI上为0.83),但主要关注点是网格质量:平均Chamfer距离为1.8 mm,95%分位数表面距离低于5 mm。每个网格通过单次前向传播生成——无需Marching Cubes、平滑滤波器或手动清理。我们认为,对于心脏数字孪生管道,几何保真度和拓扑正确性比像素级Dice分数更重要。通过消除后处理瓶颈,该方法使患者特异性心脏模拟在临床使用中变得更加可行。

英文摘要

Building patient-specific cardiac models sits at the heart of precision cardiology, yet getting those models into clinical use keeps running into the same wall: mesh generation is slow, messy, and frustrating. The standard workflow -- segmenting the image, running Marching Cubes, and then manually cleaning up the result -- is time-consuming, inconsistent across operators, and demands specialist knowledge most clinical teams do not have. We take a fundamentally different approach. Instead of treating segmentation and mesh generation as two separate problems, we train a single end-to-end network that goes directly from a raw 3D medical image to a smooth, simulation-ready cardiac surface mesh. The core is a 3D Swin Transformer encoder-decoder that extracts volumetric features from CT or MRI volumes, paired with a Graph Attention Network (GAT) head that iteratively deforms a template mesh to fit the patient's cardiac boundary. We tested on the MM-WHS 2017 benchmark using both CT and MRI. Segmentation scores were competitive (Dice of 0.84 on CT, 0.83 on MRI), but the primary focus is mesh quality: mean Chamfer distance of 1.8 mm, with 95th-percentile surface distance below 5 mm. Every mesh is produced in a single forward pass -- no Marching Cubes, no smoothing filters, no manual cleanup. We argue that for cardiac digital twin pipelines, geometric fidelity and topological correctness matter more than pixel-level Dice scores. By removing the post-processing bottleneck, this approach makes patient-specific cardiac simulation substantially more accessible for clinical use.

2606.13236 2026-06-12 cs.LG cs.AI cs.SD stat.AP 交叉投稿

Decoding Insect Song: A Multitask Semisupervised Orthoptera Bioacoustic Classifier

解码昆虫之歌:一种多任务半监督直翅目生物声学分类器

Olga Isupova, Danil Kuzin, Ella Browning, Tom Mills, Steven Reece

发表机构 * University of Oxford(牛津大学)

AI总结 提出PULSE半监督多任务框架,结合弱监督分类、自监督学习和知识蒸馏,在直翅目生物声学分类中优于通用模型,并通过主动学习进一步提升性能。

Comments ICML 2026 Workshop on Machine Learning for Audio

详情
AI中文摘要

被动声学监测在生态推断方面具有巨大潜力,但现有的自动化工具通常训练范围狭窄且不可迁移。我们通过PULSE(一种用于直翅目生物声学的半监督多任务框架)解决了这些局限性,该框架结合了弱监督物种分类、未标记野外音频的自监督学习以及来自通用生物声学模型的知识蒸馏。我们的领域自适应专家模型在所有指标上均优于最先进的通用模型(宏F1:0.21 vs. 0.07;AUC:0.74 vs. 0.45;AP:0.32 vs. 0.19),主动学习进一步将F1提升至0.34,AUC提升至0.84。除了分类之外,学习到的嵌入编码了生态上有意义的结构,并通过交互式可视化工具暴露出来,用于生态发现。

英文摘要

Passive acoustic monitoring holds great promise for ecological inference, yet existing automated tools are typically narrowly trained and non-transferable. We address these limitations with PULSE, a semi-supervised, multi-task framework for Orthoptera bioacoustics, combining weakly-supervised species classification, self-supervised learning on unlabelled field audio, and knowledge distillation from a general-purpose bioacoustic model. Our domain-adapted specialist model outperforms a state-of-the-art general model across all metrics (macro F1: 0.21 vs. 0.07; AUC: 0.74 vs. 0.45; AP: 0.32 vs. 0.19), with active learning further raising F1 to 0.34 and AUC to 0.84. Beyond classification, the learned embeddings encode ecologically meaningful structure, exposed through an interactive visualisation tool for ecological discovery.

2606.13253 2026-06-12 cs.SD cs.AI 交叉投稿

Towards Personalized Federated Learning for Dysarthric Speech Recognition

面向构音障碍语音识别的个性化联邦学习

Tao Zhong, Mengzhe Geng, Jiajun Deng, Shujie Hu, Xunying Liu

发表机构 * The Chinese University of Hong Kong(香港中文大学) National Research Council Canada(加拿大国家研究委员会)

AI总结 针对构音障碍语音识别中联邦学习异构性问题,提出参数平均和嵌入平均两种个性化聚合策略,在UASpeech和TORGO上分别实现0.99%和0.56%的绝对词错误率降低。

详情
AI中文摘要

构音障碍者的语音识别具有挑战性。虽然基于联邦学习的ASR可以有效保护隐私,但它面临由说话人变异性引起的异构性问题。在这种异构性下,强制所有说话人共享相同的模型组件可能不是最优的,因此个性化是一个有前景的方向;然而,关于构音障碍语音的相关研究仍然有限。为此,本文探索了两种实现个性化的聚合策略,包括基于参数的平均策略和基于嵌入的平均策略。在UASpeech和TORGO上的实验表明,所提方法优于基线正则化FedAvg,在UASpeech上实现了高达0.99%绝对(3.15%相对)的统计显著词错误率降低,在TORGO上实现了0.56%绝对(4.73%相对)的降低。

英文摘要

Speech recognition is challenging for dysarthric speakers. While federated learning (FL)-based ASR can be an effective tool for protecting privacy, it suffers from heterogeneity issues caused by speaker variability. Forcing all speakers to share the same model components can be suboptimal under such heterogeneity, making personalization a promising direction; however, related research on dysarthric speech remains limited. To this end, this paper explores two aggregation strategies to achieve personalization, including the parameter-based averaging strategy and the embedding-based averaging strategy. Experiments on UASpeech and TORGO show that the proposed methods outperform the baseline regularized FedAvg by statistically significant WER reductions of up to 0.99% absolute (3.15% relative) on UASpeech and 0.56% absolute (4.73% relative) on TORGO, respectively.

2606.13298 2026-06-12 cs.SE cs.AI 交叉投稿

Mining Architectural Quality Under Agentic AI Adoption: A Causal Study of Java Repositories

在智能体AI采用下的架构质量挖掘:Java仓库的因果研究

Oliver Aleksander Larsen, Mahyar T. Moghaddam

AI总结 通过差分差分设计和Borusyak插值估计器,研究智能体AI工具采用对Java仓库架构气味密度(ASD)的因果影响,发现ASD下降6.7%源于代码量增长,而非架构改进。

Comments 16 pages. Accepted for presentation at the 52nd Euromicro Conference on Software Engineering and Advanced Applications (SEAA) 2026, Krakow, Poland, 2-4 September 2026, and for publication in the Springer LNCS proceedings. This is the author's accepted manuscript

详情
AI中文摘要

AI编码工具现已被大多数开发者使用,这些工具的智能体化使用普及了俗称“氛围编码”的实践。然而,关于其对软件架构影响的因果证据却很少。先前的因果工作衡量了代码层面的结果(复杂度、静态分析警告);这种退化是否会传播到架构层面仍未知。我们挖掘了151个开源Java仓库,其中74个检测到智能体AI采用(通过配置文件和Co-Authored-By提交尾注识别),以及77个倾向得分匹配的对照仓库,每个仓库跨越13个月,生成1,811个月度Arcan快照。我们采用交错差分差分设计和Borusyak插值估计器,估计采用对架构气味密度(ASD)的因果效应,将近期用于代码层面指标的因果设计应用于架构层面。总气味计数基本不变(+1.1%,p=0.82),而代码行数增长12.8%(p=0.003);因此,ASD下降6.7%(p=0.004)是分母效应而非架构改进。按类型估计和稳健性检验(wild cluster bootstrap、Lee bounds、陈旧观测敏感性)证实了这一模式;预处理趋势平坦(Wald p=0.90),与平行趋势一致。当处理影响系统规模时,密度归一化结果可能产生误导:对AI工具采用的因果挖掘研究需要原始计数和显式分解。完整的复现包,包括精心整理的151个仓库月度面板,已公开提供。

英文摘要

AI coding tools are now used by a majority of developers, and agentic use of these tools has popularized the practice colloquially called "vibe coding". Yet causal evidence on their effect on software architecture is scarce. Prior causal work has measured code-level outcomes (complexity, static analysis warnings); whether such degradation propagates to architecture-level outcomes remains unknown. We mine 151 open-source Java repositories, 74 with detectable agentic AI adoption (identified via configuration files and Co-Authored-By commit trailers) and 77 propensity-matched controls, across a 13-month per-repository window yielding 1,811 monthly Arcan snapshots. We estimate the causal effect of adoption on architectural smell density (ASD) with a staggered difference-in-differences design and the Borusyak imputation estimator, applying a causal design recently used for code-level metrics to the architecture level. Total smell counts are essentially unchanged (+1.1%, p = 0.82) while lines of code grow +12.8% (p = 0.003); the resulting 6.7% ASD decline (p = 0.004) is therefore a denominator effect rather than an architectural improvement. Per-type estimates and robustness checks (wild cluster bootstrap, Lee bounds, stale-observation sensitivity) corroborate the pattern; pre-trends are flat (Wald p = 0.90), consistent with parallel trends. Density-normalized outcomes can mislead when treatment affects system size: raw counts and explicit decomposition are required for causal mining studies of AI tool adoption. The complete replication package, including the curated 151-repository monthly panel, is publicly available.

2606.13341 2026-06-12 cs.CV cs.AI physics.med-ph 交叉投稿

Dual-Domain Equivariant Generative Adversarial Network for Multimodal CT-PET Synthesis

双域等变生成对抗网络用于多模态CT-PET合成

Gabriel Steele, Alzahra Altalib, Alessandro Perelli

发表机构 * arXiv

AI总结 提出双域等变生成对抗网络(DDE-GAN),联合空间与频域学习并融入旋转等变性,实现高保真多模态CT-PET图像合成。

Comments 4 pages, 3 figures, 1 table, 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI)

详情
AI中文摘要

我们提出了一种用于多模态CT-PET图像合成的双域等变生成对抗网络(DDE-GAN)。传统的基于GAN的方法通常仅在空间域中操作,忽略了几何一致性,导致结构保真度有限。DDE-GAN通过联合学习空间域和频率(傅里叶)域,捕捉互补的解剖和频谱信息,解决了这些挑战。此外,嵌入在CT和PET测量物理中的旋转等变性被整合到生成器和判别器的损失中,以确保在旋转下的一致响应,从而提高解剖准确性。一种分层双域训练策略通过多阶段损失函数强制实现域内和域间一致性。在HECKTOR 2022 CT-PET数据集上的评估表明,DDE-GAN在CT-PET图像合成中取得了优于基线模型的合成质量。结果表明,将双域学习与几何等变性相结合,显著增强了多模态图像合成的准确性和鲁棒性,为PET补全和数据增强等实际应用提供了可能。

英文摘要

We present a Dual-Domain Equivariant Generative Adversarial Network (DDE-GAN) for multimodal CT-PET image synthesis. Traditional GAN-based approaches often operate solely in the spatial domain and ignore geometric consistency, resulting in limited structural fidelity. DDE-GAN addresses these challenges by jointly learning from both spatial and frequency (Fourier) domains, capturing complementary anatomical and spectral information. Furthermore, rotational equivariance embedded in the physics of the CT and PET measurements are integrated into the loss of both the generator and discriminator to ensure consistent responses under rotations, improving anatomical accuracy. A hierarchical dual-domain training strategy enforces intra- and inter-domain consistency through multi-stage loss functions. Evaluated on the HECKTOR 2022 CT-PET dataset, DDE-GAN achieves superior synthesis quality over baseline models for CT-PET image synthesis. The results demonstrate that combining dual-domain learning with geometric equivariance substantially enhances multimodal image synthesis accuracy and robustness, enabling practical applications in PET completion and data augmentation.

2606.13380 2026-06-12 quant-ph cs.AI 交叉投稿

An LLM System for Autonomous Variational Quantum Circuit Design

用于自主变分量子电路设计的大语言模型系统

Kenya Sakka, Wataru Mizukami, Kosuke Mitarai

AI总结 提出一个基于大语言模型的自主代理框架,通过迭代设计量子电路,在量子特征映射和变分量子本征求解器任务中取得优于或媲美现有方法的性能。

Comments 63 pages, 19 figures, 3 tables

详情
AI中文摘要

高性能量子电路的设计在很大程度上仍然依赖于人类专家。我们引入了一个自主代理框架,该框架利用大语言模型在明确的设计约束下进行迭代量子电路设计。我们的系统集成了七个组件:探索、生成、讨论、验证、存储、评估和审查。这些组件形成了一个闭环工作流,结合了基于网络的知识获取、基于文献的批评、可执行代码生成和实验反馈。我们在两个任务上评估了该框架:用于量子机器学习的量子特征映射构建和用于量子化学中变分量子本征求解器应用的拟设生成。在图像分类基准测试中,生成的最佳特征映射优于代表性的量子特征映射,并且当扩展到更大的量子比特数时,超过了经典的径向基函数核。在七个分子的基态能量估计中,生成的拟设达到了与广泛使用的化学启发式和硬件高效构造相竞争的精度,同时满足施加的缩放约束。这些结果确立了由大语言模型驱动的代理系统作为自动化量子电路设计的可行范式,并展示了人工智能系统如何跨科学领域参与迭代科学优化工作流。

英文摘要

The design of high performing quantum circuits remains largely dependent on human expertise. We introduce an autonomous agentic framework that employs large language models (LLMs) to conduct iterative quantum circuit designs under explicit design constraints. Our system integrates seven components: Exploration, Generation, Discussion, Validation, Storage, Evaluation, and Review. These components form a closed-loop workflow that combines web-based knowledge acquisition, literature-grounded critique, executable code generation, and experimental feedback. We evaluate the framework on two tasks: quantum feature map construction for quantum machine learning and ansatz generation for variational quantum eigensolver applications in quantum chemistry. In image classification benchmarks, the best generated feature map outperforms representative quantum feature maps and, when scaled to larger qubit counts, surpasses the classical radial basis function kernel. In molecular ground state estimation across seven molecules, the generated ansatz attains competitive accuracy with widely used chemically inspired and hardware-efficient constructions while satisfying the imposed scaling constraints. These results establish LLM driven agentic system as a viable paradigm for automated quantum circuit design and illustrate how AI systems can participate in iterative scientific optimization workflows across scientific domains.

2606.13382 2026-06-12 cs.CV cs.AI 交叉投稿

SmartFont: Dynamic Condition Allocation for Few-Shot Font Generation

SmartFont: 少样本字体生成的动态条件分配

Zian Yang, Zixin Wang

发表机构 * Fudan University(复旦大学)

AI总结 提出SmartFont扩散框架,通过全局内容-风格生成与弱监督局部校正专家结合,并引入去噪状态条件分配模块动态加权全局与局部特征,实现少样本字体生成的全局完整性与局部细节保真度平衡。

详情
AI中文摘要

少样本字体生成同时需要全局结构完整性和细粒度局部风格保真度。现有方法通常要么依赖全局内容-风格建模(鲁棒但解耦不完美),要么强调组件/局部建模(捕捉细节但严重依赖局部先验和参考覆盖)。我们认为关键挑战不仅在于学习更纯净的条件,而在于通过生成过程中的多级分配来组织互补但有偏的全局和局部条件。为此,我们提出SmartFont,一个基于扩散的少样本字体生成框架,结合全局内容-风格生成与弱监督局部校正专家。局部分支通过弱组件监督学习专家级局部概念和语义有意义的空间图,实现无需显式组件条件推理的细粒度校正。在此基础上,去噪状态条件分配模块在时间步和注入块上自适应地加权全局内容、全局风格和局部校正特征。大量实验表明,SmartFont实现了更好的全局-局部平衡,提高了字形质量和局部细节保真度。

英文摘要

Few-shot font generation simultaneously requires global structural completeness and fine-grained local style fidelity. Existing methods usually either rely on global content-style modeling, which is robust but imperfectly disentangled, or emphasize component/local modeling, which captures fine details but relies heavily on local priors and reference coverage. We argue that the key challenge is not merely to learn purer conditions, but to organize complementary yet biased global and local conditions through multi-level allocation during generation. To this end, we propose SmartFont, a diffusion-based few-shot font generation framework that combines global content-style generation with weakly supervised local corrective experts. The local branch performs semantic-spatial allocation by learning expert-wise local concepts and semantically meaningful spatial maps under weak component supervision, enabling fine-grained correction without requiring explicit component-conditioned inference. On top of this, a denoising-state condition allocation module adaptively weights global content, global style, and local corrective feature across timesteps and injection blocks. Extensive experiments show that SmartFont achieves better global-local balance, improves glyph quality and local detail fidelity.

2606.13449 2026-06-12 cs.SE cs.AI 交叉投稿

Toward Instructions-as-Code: Understanding the Impact of Instruction Files on Agentic Pull Requests

面向指令即代码:理解指令文件对智能体拉取请求的影响

Ali Arabat, Mohammed Sayagh

AI总结 通过分析148个项目的15549个智能体PR,发现指令文件对合并率、代码变更量和合并工作量无一致正面影响,但成功项目指令文件更长且结构更清晰,提出“指令即代码”研究方向。

Comments 5 pages, 8 figures, 23rd International Conference on Mining Software Repositories, April 13--14, 2026

详情
AI中文摘要

AI智能体(如GitHub Copilot)作为队友协作完成不同的软件工程任务,包括通过拉取请求(Agentic-PRs)提出的代码生成。为了提高智能体效率,开发者创建指令文件来指导AI智能体,包括如何导航项目、定位正确组件、运行测试、遵守最佳实践等。本文研究了这些指令的创建与AI智能体在创建更好的拉取请求方面的性能之间的关系,这些拉取请求具有更高的成功机会(即合并率)、处理更复杂的任务(例如代码变更量),并且需要更少的合并工作量(例如合并时间)。为此,我们分析了来自AIDev数据集中148个项目的15,549个智能体PR。使用这三个维度,我们比较了每个项目在创建指令文件前后的情况。我们发现,为AI智能体指定指令并不一定会带来更好的结果。使用指令文件后,27.7%的项目的合并率至少提高了20%,而26.35%的项目合并率下降。在变更量(例如代码变更量、修改文件数量)和合并智能体PR的工作量(例如合并时间和评论数量)方面也观察到相同的情况。通过初步探索,我们发现成功提高合并率的项目具有更长的指令文件,并且这些文件结构良好,分为更多的章节和子章节。我们的结果激励了研究需求,以帮助从业者将指令文件的开发视为一项软件工程活动(即,\textbf{指令即代码})。

英文摘要

AI-agents (e.g., GitHub Copilot) collaborate as teammates in different software engineering tasks, including code generation proposed through pull requests (Agentic-PRs). For better agent efficiency, developers create instruction files that guide the AI-agents, including how to navigate the project, locate the right components, run tests, respect best practices, and more. In this paper, we investigate the relationship between the creation of these instructions and the performance of AI-agents in creating better pull requests, which have a higher chance of success (i.e., the merge rate), address more complex tasks (e.g., code churn), and require less effort to be merged (e.g., time to merge). To this end, we analyze 15,549 agentic PRs from 148 projects in the AIDev dataset. Using the three dimensions, we compare each project before and after the creation of the instruction files. We find that specifying instructions for AI-agents does not necessarily lead to better results. With the instruction files, 27.7\% of the projects increased their merge rate by at least 20\%, while 26.35\% decreased it. The same observation is seen with the amount of changes (e.g., code churn, number of modified files) and with the efforts to merge an agentic PR (e.g., merge time and number of comments). From a first exploration, we find that projects that managed to increase their merge rate have substantially longer instruction files, which are also well structured into a higher number of sections and sub-sections. Our results motivate the need for research to assist practitioners in framing the development of instruction files as a software engineering activity (aka, \textbf{Instructions-as-Code}).

2606.13468 2026-06-12 cs.SE cs.AI 交叉投稿

Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset

理解AI代理生成的拉取请求修复被拒绝的原因——来自AIDev数据集的洞察

Mahmoud Abujadallah, Ali Arabat, Mohammed Sayagh

AI总结 通过分析AIDev数据集,发现46.41%的AI代理(Copilot、Devin、Cursor、Claude)提出的代码修复被拒绝。本文对306个未合并的PR进行定性研究,归纳出14个拒绝原因,分为四类,并提出了改进模型引导的建议。

Comments 5 pages, 2 figures, MSR '26: Proceedings of the 23rd International Conference on Mining Software Repositories, April 2026, Rio de Janeiro, Brazil

详情
AI中文摘要

AI编码代理越来越多地被用于生成拉取请求(PR),以在软件项目中提出代码修复。通过对AIDev数据集的初步探索,我们发现由Copilot、Devin、Cursor和Claude代理提出的修复中有46.41%被拒绝。这代表了大量浪费的资源,需要人工审查、验证以及运行测试和验证,而这些修复最终被丢弃。本文的目标是理解AI代理的失败模式,这对于更好地将AI代理集成为高效团队成员至关重要。本文对由前述代理创建或共同创作的306个未合并的拉取请求的代表性样本进行了定性研究,随后对拒绝原因进行了定量分析。我们的定性发现确定了14个原因,分为四个高级类别,用于拒绝AI代理的修复。我们观察到,开发者可能因以下原因拒绝修复:修复的实现不正确(例如,不完整、方法错误)、修复未通过持续集成(CI)管道并测试失败、代理无法执行实现(例如,未生成代码、会话丢失),以及修复优先级低。我们的结果揭示了在以下层面更好引导模型的重要性:(1)提出关于修复问题应遵循的方法的提示,(2)概述不应采取的方法的约束或限制,以及(3)指导代理如何通过CI管道验证实现而不引入破坏性变更。我们的结果表明,需要良好的任务优先级排序,以便生成的修复不会导致浪费的人工审查努力或浪费的代理资源(例如,令牌、计算或允许的请求数量)。

英文摘要

AI coding agents are increasingly used to generate pull requests (PRs) that propose code fixes in software projects. From a first exploration of the AIDev dataset, we find that 46.41\% of the fixes proposed by the agents Copilot, Devin, Cursor, and Claude are rejected. This represents a significant amount of wasted resources that require human reviews, verifications, and running tests and validations for fixes that are merely discarded. Our goal in this paper is to understand the failure modes of AI-agents, an understanding that is crucial for better integrating AI-agents as efficient teammates. In this paper, we conduct a qualitative study on a representative sample of 306 non-merged pull requests created or co-authored by the agents mentioned earlier, followed by a quantitative analysis of the reasons for rejection. Our qualitative findings identify 14 reasons divided into four high-level categories for rejecting AI-agent fixes. We observe that developers can reject fixes due to fixes whose implementation is incorrect (e.g., incomplete, wrong approach), fixes that do not pass the continuous integration (CI) pipelines and fail tests, fixes for which the agent is unable to perform the implementation (e.g., no code generated, sessions lost), and fixes whose priority is low. Our results shed light on the importance of better guiding the model at these levels: (1) proposing hints about the approach to follow for fixing an issue, (2) outlining constraints or limitations regarding the approaches that should not be taken, and (3) instructing the agent on how to validate the implementation through CI pipelines and without introducing a breaking change. Our results suggest the need for good prioritization of tasks so that generated fixes do not lead to wasted human review efforts or wasted agent resources (e.g., tokens, compute, or allowed number of requests).

2606.13535 2026-06-12 hep-ex cs.AI hep-ph 交叉投稿

AgentRivet: an automated system for producing Rivet routines from journal publications

AgentRivet:从期刊论文自动生成Rivet例程的系统

Antonio J. Costa, Caterina Doglioni, Christian Gütschow, Andrew D. Pilkington, Sukanya Sinha

发表机构 * Department of Physics & Astronomy, University of Manchester(曼彻斯特大学物理与天文学系) Centre for Advanced Research Computing, University College London(伦敦大学学院先进计算中心)

AI总结 提出基于大语言模型的自动化工作流AgentRivet,从论文提取物理分析信息并生成缺失的Rivet例程,经代码和物理审查实现质量控制,在ATLAS和CMS测量中生成语法错误少、物理保真度合理的例程。

详情
AI中文摘要

粒子物理对撞机实验将Rivet例程作为模型无关测量分析保存策略的一部分。Rivet是一个C++工具包,允许将新的理论模型与测量结果进行比较,从而帮助开发和调整蒙特卡洛事件生成器,以及搜索标准模型之外的新物理。然而,已知分析覆盖不完整,只有39%的测量具有文档化且公开可用的Rivet例程。在本文中,我们设计并实现了一个基于大语言模型的自动化工作流,旨在提供缺失的例程。这个多步骤工作流称为AgentRivet,从已发表的论文中提取物理分析信息,并编写缺失的Rivet例程,中间代码和物理审查作为自主质量控制的一部分。我们报告了使用OpenAI、Anthropic和Google提供的商业大语言模型,针对ATLAS和CMS实验的两个近期测量所获得的结果。我们发现AgentRivet生成了语法错误很少的合格Rivet例程。例程的物理保真度合理,并遵循相关出版物中的解释。然而,物理实现问题确实出现,并使用AgentRivet产生的产物进行了调查。大多数物理实现问题源于给定出版物中微妙但模糊的定义,尽管有些模型即使在给出明确定义时也难以实现复杂的可观测量。

英文摘要

Particle physics collider experiments provide Rivet routines as part of the analysis preservation strategy for model-independent measurements. Rivet is a C++ toolkit that allow new theoretical models to be compared to the measurements, thus aiding the development and tuning of Monte Carlo event generators as well as searches for physics beyond the Standard Model. However, analysis coverage is known to be incomplete, with only 39% of measurements having documented and publicly available Rivet routines. In this article, we design and implement an automated workflow based on Large Language Models with the goal of providing the missing routines. This multi-step workflow, referred to as AgentRivet, extracts the physics analysis information from published papers and writes the missing Rivet routines, with intermediate code- and physics- reviews as part of an autonomous quality control. We report the results obtained using commercial Large Language Models, provided by OpenAI, Anthropic, and Google, for two recent measurements from the ATLAS and CMS experiments. We find that AgentRivet produces competent Rivet routines with few syntax errors. The physics fidelity of the routines is reasonable and follows the explanations given in the relevant publications. Nevertheless, physics-implementation issues do arise and are investigated using the artefacts produced by AgentRivet. The majority of physics implementation issues arise from subtle-but-ambiguous definitions in the given publication, although some models struggle to implement complex observables even when clear definitions are given.

2606.13562 2026-06-12 cs.CV cs.AI 交叉投稿

Contrast-Informed Augmentation and Domain-Adversarial Training for Adult-to-Neonatal MR Reconstruction Generalization

对比信息增强和域对抗训练用于成人到新生儿MR重建泛化

Stephen Moore, Lara Leijser, Richard Frayne, Roberto Souza

发表机构 * University of Calgary(卡尔加里大学) Seaman Family MR Research Centre, Foothills Medical Centre(Seaman家族磁共振研究中心,山麓医疗中心) Hotchkiss Brain Institute, University of Calgary(Hotchkiss脑研究所,卡尔加里大学) Pediatrics, Division of Neonatology, University of Calgary(卡尔加里大学儿科学系新生儿科) Alberta Children’s Hospital Research Institute, University of Calgary(阿尔伯塔儿童医院研究所,卡尔加里大学) Radiology and Clinical Neuroscience, University of Calgary(卡尔加里大学放射学与临床神经科学系) Electrical and Software Engineering, University of Calgary(卡尔加里大学电气与软件工程系)

AI总结 研究对比信息增强和域对抗训练提升E2E-VarNet从成人到新生儿MR重建的泛化能力,在加速因子R=4和R=8下,混合域对抗训练在SSIM和PSNR指标上表现最优。

Comments 24 pages, 1 table, 7 figures

详情
AI中文摘要

目的:研究对比信息数据增强和域对抗训练是否能改善E2E-VarNet从成人到新生儿的泛化能力。方法:研究了三种训练方案:(1) 仅使用未增强的成人数据进行成人单独训练,(2) 使用配对的未增强和新生儿信息增强的成人数据进行混合训练,(3) 使用域对抗目标进行混合训练。模型在回顾性欠采样的多线圈成人T2加权脑MR数据上训练,并在新生儿和成人测试数据上以加速因子$R=4$和$R=8$进行评估,使用定量指标和定性评估。特征分析评估了域对抗训练是否改变了未增强成人、增强成人和新生儿测试样本的潜在表示。结果:在新生儿数据上评估时,混合训练(Mixed)和混合域对抗训练(Mixed-DAT)优于仅未增强的成人单独训练(Unaug-Only)。在R=4时,Mixed-DAT取得最佳性能(SSIM = 0.924 +/- 0.027,PSNR = 33.98 +/- 1.15 dB)。在R=8时,Mixed-DAT在SSIM指标上表现最佳(0.848 +/- 0.031,对比Unaug-Only的0.766 +/- 0.037和Mixed的0.814 +/- 0.035),而Mixed在PSNR指标上表现最佳(29.56 +/- 0.83 dB,对比Unaug-Only的26.26 +/- 0.78 dB和Mixed-DAT的29.43 +/- 0.83 dB)。t-SNE图的定性评估表明,Mixed-DAT增加了未增强成人、增强成人和新生儿测试数据的潜在表示之间的重叠。结论:对比信息增强和域对抗训练改善了基于深度学习的MR重建从成人到新生儿的泛化能力。这些发现表明,对比信息数据增强结合对抗训练可能提高欠采样新生儿MR重建中对域偏移的鲁棒性。

英文摘要

Purpose: To investigate whether contrast-informed data augmentation and domain-adversarial training improve the adult-to-neonatal generalization of the E2E-VarNet. Methods: Three training regimes were investigated: (1) adult-only training with unaugmented adult data, (2) mixed training with paired unaugmented and neonatal-informed augmented adult data, and (3) mixed training with a domain-adversarial objective. Models were trained on retrospectively undersampled multi-coil adult T2-weighted brain MR data and evaluated on neonatal and adult test data at acceleration factors $R=4$ and $R=8$ using quantitative metrics and qualitative evaluation. Feature analyses assessed whether domain-adversarial training altered the latent representations of unaugmented adult, augmented adult, and neonatal test samples. Results: Mixed training (Mixed) and mixed domain-adversarial training (Mixed-DAT) outperformed unaugmented adult-only training (Unaug-Only) when evaluated on neonatal data. At R=4, Mixed-DAT achieved the best performance (SSIM = 0.924 +/- 0.027, PSNR = 33.98 +/- 1.15 dB). At R=8, Mixed-DAT performed best when measured using SSIM (0.848 +/- 0.031 vs. 0.766 +/- 0.037 for Unaug-Only and 0.814 +/- 0.035 for Mixed) and Mixed performed best when measured using PSNR (29.56 +/- 0.83 dB vs. 26.26 +/- 0.78 dB for Unaug-Only and 29.43 +/- 0.83 dB for Mixed-DAT). Qualitative assessment of t-SNE plots suggested that Mixed-DAT increased the overlap among the latent representations of the unaugmented adult, augmented adult, and neonatal test data. Conclusion: Contrast-informed augmentation and domain-adversarial training improved adult-to-neonatal generalization of deep learning-based MR reconstruction. These findings suggest that contrast-informed data augmentation combined with adversarial training may improve robustness to domain shift in undersampled neonatal MR reconstruction.

2606.07489 2026-06-12 cs.AI econ.GN q-fin.EC 版本更新

How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, and Scope

AI代理如何重塑知识工作:自主性、效率与范围

Jeremy Yang, Kate Zyskowski, Noah Yonack, Jerry Ma

发表机构 * Harvard Business School(哈佛商学院) Perplexity AI

AI总结 基于Perplexity产品数据,研究发现AI代理通过端到端任务执行,将自主工作时间从33秒提升至26分钟,完成时间缩短87%,成本降低94%,并扩展了工作范围与认知层次。

详情
AI中文摘要

前沿AI系统正从对话式助手转向端到端执行任务的自主代理,弥合智能与实用性之间的差距。利用Perplexity的Search和Computer产品的生产数据,我们通过研究AI代理如何加速和重塑知识工作来考察这一转变。三个关键实证发现出现。首先,使用具有几乎相同初始查询对的会话作为同一底层任务的自然实验,Computer每个用户会话执行26分钟的自主工作,而Search为33秒。Computer自动化了Search用户可能手动编排和实现的任务分解与执行。因此,Computer将后续查询分布转向更高层次的工作,如验证和扩展。自主性也提高了执行质量,Computer上每次查询的不满意率比Search低55%。其次,由于其自主性优势,Computer在匹配任务上将完成时间从269分钟减少到36分钟,与仅配备Search的人类相比,估计时间和成本分别降低87%和94%。第三,Computer改变了用户尝试的工作范围:Computer查询更常跨越职业边界,需要更高层次的认知,利用更广泛的专业知识,采取将相互依赖的子任务捆绑到单个查询中的复合任务形式,并解锁了同一用户在Search使用中基本不存在的工作活动。综合来看,证据表明AI代理加速工作流程、提高输出质量、降低成本,并扩展自动化工作的广度和深度。

英文摘要

Frontier AI systems are bridging the gap between intelligence and utility by shifting from conversational assistants to autonomous agents that execute tasks end to end. Using production data from Perplexity's Search and Computer products, we study this transition by examining how AI agents accelerate and reshape knowledge work. Three key empirical findings emerge. First, using sessions with near-identical initial query pairs as natural experiments for the same underlying task attempted with both products, Computer performs 26 minutes of autonomous work per user session, versus 33 seconds for Search. Computer automates task decomposition and execution that Search users might otherwise manually orchestrate and implement. As a result, Computer shifts follow-up query distribution toward higher-order work such as verification and extension. Autonomy also increases execution quality, with per-query dissatisfaction rates 55% lower on Computer than on Search. Second, due to its autonomy advantage, Computer reduces completion time from 269 to 36 minutes on matched tasks, lowering estimated time and cost by 87% and 94%, respectively, compared to humans equipped with Search alone. Third, Computer changes the scope of work that users attempt: Computer queries more often cross occupational boundaries, require higher-order cognition, draw on broader expertise, take the form of composite tasks that bundle interdependent subtasks into a single query, and unlock work activities that are essentially absent from Search usage among the same users. Together, the evidence indicates that AI agents accelerate workflows, enhance output quality, reduce costs, and expand the breadth and depth of automated work.

2606.12040 2026-06-12 cs.AI cs.GR 版本更新

A Lightweight Multi-Agent Framework for Automated Concrete Barrier Design

一种用于自动混凝土护栏设计的轻量级多智能体框架

Wanting Wang, Xiye Ma, Yuyang He, Minghui Cheng, Ran Cao

AI总结 提出基于AutoGen的“生成-评估-优化”闭环多智能体框架,实现混凝土护栏自动设计,准确率超98%,且8B参数轻量模型可优于631B旗舰模型。

详情
AI中文摘要

钢筋混凝土公路护栏的设计是一个安全关键过程,需要严格遵守AASHTO-LRFD桥梁设计指南等监管规定。当前的工程实践严重依赖手动、迭代和启发式计算来满足复杂的非线性材料和力学约束。尽管大型语言模型(LLMs)表现出强大的生成能力,但它们在结构工程中的直接应用仍受到幻觉风险和物理基础不足的限制。为了解决这些挑战,本研究提出了一种新颖的“生成-评估-优化”闭环框架,利用AutoGen的多智能体编排能力实现混凝土护栏的自动设计。实验结果表明,所提出的智能体框架实现了超过98%的设计准确率,显著优于独立的通用LLMs。更重要的是,研究揭示了设计性能不一定与模型规模相关,8B参数的轻量级模型可以胜过无约束的631B参数旗舰模型。这一发现凸显了在降低计算成本的同时提高AI辅助工程工具在工业应用中的可及性的潜力。所提出的多智能体设计框架的源代码可在项目GitHub仓库中获取:this https URL。关键词:结构工程;多智能体系统;大型语言模型;混凝土护栏设计;AutoGen;设计自动化。

英文摘要

The design of reinforced concrete highway barriers is a safety-critical process that requires strict compliance with regulatory provisions such as the AASHTO-LRFD bridge design guidelines. Current engineering practice relies heavily on manual, iterative, and heuristic calculations to satisfy complex nonlinear material and mechanics constraints. Although Large Language Models (LLMs) demonstrate strong generative capabilities, their direct application to structural engineering remains limited by hallucination risks and insufficient physical grounding. To address these challenges, this study proposes a novel "generation-evaluation-optimization" closed-loop framework for automated concrete barrier design using the multi-agent orchestration capabilities of AutoGen. Experimental results demonstrate that the proposed agentic framework achieves over 98% design accuracy, significantly outperforming standalone general-purpose LLMs. More importantly, the study reveals that design performance is not necessarily correlated with model scale, where an 8B-parameter lightweight model could outperform unconstrained 631B-parameter flagship models. This finding highlights the potential to substantially reduce computational costs while improving the accessibility of AI-assisted engineering tools for industry applications. The source code for the proposed multi-agent design framework is available at the project GitHub repository: https://github.com/MXY820/barrier-design. Keywords: Structural Engineering; Multi-Agent Systems; Large Language Models; Concrete Barrier Design; AutoGen; Design Automation.

2301.12538 2026-06-12 cs.LG cs.AI math.DS 版本更新

On Approximating the Dynamic Response of Synchronous Generators via Operator Learning: A Step Towards Building Deep Operator-based Power Grid Simulators

关于通过算子学习逼近同步发电机动态响应:迈向构建基于深度算子的电网模拟器的一步

Christian Moya, Amirhossein Mollaali, Guang Lin, Meng Yue

发表机构 * Purdue University(普渡大学)

AI总结 提出基于算子学习的框架,利用DeepONet逼近同步发电机的动态响应,并设计递归模拟方案及残差DeepONet方案,结合数据聚合策略实现与电网交互的模拟。

详情
AI中文摘要

本文开发了一个算子学习框架,用于逼近同步发电机的动态响应。该框架可用于(i)构建一个基于神经网络的发电机模型,与电网模拟器交互,或(ii)跟踪真实发电机的暂态响应。首先,我们开发了一个数据驱动的深度算子网络(DeepONet)来逼近发电机的无限维解算子。然后,我们设计了一个基于DeepONet的数值方案,在给定的时间范围内模拟发电机的响应。所提出的方案递归地使用训练好的DeepONet来模拟给定多维输入下的响应,该输入描述了发电机与电网之间的相互作用。此外,我们设计了一个残差DeepONet数值方案,可以整合现有数学模型的信息。我们为这个残差DeepONet方案提供了预测累积误差的估计。最后,我们构建了一个数据聚合(DAgger)策略,允许使用DeepONet在与其他电网组件交互模拟中可能遇到的聚合训练数据对DeepONet进行微调。作为概念验证,我们证明了所提出的框架能够有效逼近同步发电机的暂态模型。

英文摘要

This paper develops an Operator Learning framework for approximating the dynamic response of synchronous generators. The framework can be used to (i) build a neural network-based generator model that interacts with a power grid simulator or (ii) shadow the true generator's transient response. First, we develop a data-driven Deep Operator Network (DeepONet) to approximate the infinite-dimensional solution operator of the generators. Then, we design a numerical scheme based on DeepONet that simulates the generator's response over a given time horizon. The proposed scheme recursively employs the trained DeepONet to simulate the response for a given multi-dimensional input that describes the interaction between the generator and the power grid. In addition, we design a residual DeepONet numerical scheme that can incorporate information from existing mathematical models. We accompany this residual DeepONet scheme with an estimate for the prediction's cumulative error. Finally, we build a data aggregation (DAgger) strategy that allows fine-tuning of DeepONets using aggregated training data that the DeepONets will likely encounter during interactive simulations with other grid components. As a proof of concept, we demonstrate that the proposed frameworks can effectively approximate the transient model of a synchronous generator.

2505.04021 2026-06-12 cs.DC cs.AI cs.LG cs.PF 版本更新

Prism: Cost-Efficient Multi-LLM Serving via GPU Memory Ballooning

Prism: 通过GPU内存气球实现经济高效的多LLM服务

Shan Yu, Yifan Qiao, Mingyuan Ma, Yangmin Li, Shuo Yang, Xinyuan Tong, Yang Wang, Zhiqiang Xie, Yuwei An, Shiyi Cao, Ke Bao, Deepak Vij, Xiaoning Ding, Yichen Wang, Qingda Lu, Zhong Wang, Gao Gao, Harry Xu, Junyi Shu, Jiarong Xing, Ying Sheng

发表机构 * UCLA(加州大学洛杉矶分校) UC Berkeley(伯克利加州大学) Harvard University(哈佛大学) CMU(卡内基梅隆大学) University of Edinburgh(爱丁堡大学) Intel(英特尔) Stanford University(斯坦福大学) LMSYS(灵州市系统实验室) ByteDance(字节跳动) Alibaba Cloud(阿里云) Tsinghua University(清华大学) Novita AI Rice University(里士满大学)

AI总结 针对多LLM服务中资源效率低下的问题,提出基于内存气球的内存中心化LLM协同服务框架Prism,统一空间与时间共享,已在10K+ GPU生产环境部署。

Comments OSDI'26

详情
AI中文摘要

推理提供商必须为许多LLM保持可用性,包括低流量但关键的模型,随着token价格下降,资源效率变得越来越重要。对生产轨迹的分析揭示了一种动态突发组模式,其中一组模型同时活跃并随时间变化;现有的空间和时间共享方法缺乏适应这种变化的原理性机制,迫使在SLO遵守和效率之间进行权衡。我们观察到弹性内存分配可以统一空间和时间共享。基于这一洞察,我们开发了Prism,一个以内存为中心的LLM协同服务框架,它应用内存气球来跨模型回收内存,并在单一方案下支持两种形式的共享。Prism的气球驱动程序,称为kvcached,已在https://github.com/... 开源,并在超过10K GPU的生产环境中部署。

英文摘要

Inference providers must maintain availability for many LLMs, including low-volume but essential models, making resource efficiency increasingly important as token prices fall. Analysis of production traces reveals a dynamic bursty-group pattern in which sets of models become active together and shift over time; existing space- and time-sharing approaches lack principled mechanisms to adapt to this variability, forcing trade-offs between SLO adherence and efficiency. We observe that elastic memory allocation can unify spatial and temporal sharing. Based on this insight, we have developed Prism, a memory-centric LLM co-serving framework that applies memory ballooning to reclaim memory across models and support both forms of sharing under a single scheme. Prism's balloon driver, referred to as kvcached, has been open-sourced at https://github.com/ovg-project/kvcached, and deployed in production environments across 10K+ GPUs.

2509.04682 2026-06-12 cs.SD cs.AI cs.CV cs.IR cs.LG eess.AS 版本更新

GetNetUPAM: Ecologically Informed Nested Cross-Validation and Noise-Robust Attention for Marine Bioacoustic Monitoring

GetNetUPAM:生态信息嵌套交叉验证与噪声鲁棒注意力用于海洋生物声学监测

Nicholas R. Rasmussen, Rodrigue Rizk, Longwei Wang, KC Santosh

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出GetNetUPAM框架,通过分层嵌套交叉验证保持生态异质性,并集成CBAM空间注意力的ARPA-N网络,在高噪声低信噪比条件下实现鲁棒泛化,在零训练区域将误报率降低约10倍。

Comments Resubmitted and under review as an anonymous submission to IEEETAI - We are allowed an archive submission. Final formatting is yet to be determined

详情
AI中文摘要

部署可靠的生物声学监测系统需要能够在高噪声、低信噪比条件下泛化的模型,以及能够暴露部署相关故障模式的评估协议,这些在当前UPAM实践中基本未得到解决。内在噪声、可变传播以及混合的生物和人为源会导致分布偏移,而传统模型和单次划分评估会掩盖这些偏移,夸大性能并掩盖不稳定性。我们提出GetNetUPAM,一种分层嵌套交叉验证框架,它利用嵌套阶段来量化模型稳定性,而不是调整以获取夸大的保留分数。通过将数据划分为站点-年份块,GetNetUPAM保留了生态异质性,并迫使每个外层折代表不同的环境条件,防止过拟合局部噪声或传感器伪影。内层分层折衡量整个UPAM信号分布上的泛化能力,强制模型开发与外层保留部署条件严格分离。使用GetNetUPAM,我们评估了自适应分辨率池化和注意力网络(ARPA-N),一种用于不规则频谱图维度的CNN架构。ARPA-N将CBAM空间注意力集成为学习型噪声抑制器,生成注意力图以定位真实叫声结构,并避免标准CNN在长窗口数据上利用的全局非生物线索。在GetNetUPAM下,ARPA-N在不同环境条件下鲁棒泛化。在零训练的Balleny Islands区域,它在固定90%召回率下将每小时误报率降低超过一个数量级(约10倍),并在各折上持续改进指标。这些进展提供了可重复的基准,推动UPAM向可扩展、部署可靠的生态监测发展。

英文摘要

Deploying reliable bioacoustic monitoring systems requires models that generalize under high-noise, low-SNR conditions and evaluation protocols that expose deployment-relevant failure modes, gaps largely unaddressed in current UPAM practice. Intrinsic noise, variable propagation, and mixed biological and anthropogenic sources induce distribution shifts that conventional models and single-split evaluations obscure, inflating performance and masking instability. We introduce GetNetUPAM, a hierarchical nested cross-validation framework that uses the nested stage to quantify model stability rather than tune for inflated hold-out scores. By partitioning data into site-year blocks, GetNetUPAM preserves ecological heterogeneity and forces each outer fold to represent a distinct environmental regime, preventing overfitting to localized noise or sensor artifacts. Inner stratified folds measure generalization across the full UPAM signal distribution, enforcing strict separation between model development and the outer held-out deployment condition. Using GetNetUPAM, we evaluate the Adaptive Resolution Pooling and Attention Network (ARPA-N), a CNN architecture for irregular spectrogram dimensions. ARPA-N integrates CBAM spatial attention as a learned noise suppressor, producing attention maps that localize true call structure and avoid the global, non-biological cues exploited by standard CNNs on long-window data. Under GetNetUPAM, ARPA-N generalizes robustly across diverse environmental regimes. In the zero-training support Balleny Islands region, it reduces false positives per hour by over an order of magnitude (approximately 10x) at fixed 90 percent recall, yielding consistently improved metrics across folds. These advances provide a reproducible benchmark and move UPAM toward scalable, deployment-reliable ecological monitoring.

2511.13271 2026-06-12 cs.SE cs.AI cs.IR 版本更新

Examining the Usage of Generative AI Models in Student Learning Activities for Software Programming

生成式AI模型在学生软件编程学习活动中的使用研究

Rufeng Chen, Shuaishuai Jiang, Jiyun Shen, AJung Moon, Lili Wei

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 通过对比生成式AI与传统在线资源对编程学习的影响,发现AI能提升任务表现但未必带来知识增益,初学者过度依赖而中级生选择性使用,呼吁将AI作为学习工具而非解题工具。

Comments 9 pages, 4 figures, published at AIWARE 2025

详情
AI中文摘要

生成式AI(GenAI)工具如ChatGPT的兴起为计算教育带来了新的机遇和挑战。现有研究主要关注GenAI完成教育任务的能力及其对学生表现的影响,往往忽视了其对知识获取的作用。在本研究中,我们调查了GenAI辅助与传统在线资源在不同熟练水平下对知识获取的支持效果。我们进行了一项受控用户实验,涉及24名具有两种不同编程经验水平(初学者、中级)的本科生,以考察学生在解决编程任务时如何与ChatGPT互动。我们分析了任务表现、概念理解和交互行为。我们的发现表明,使用GenAI生成完整解决方案显著提高了任务表现,尤其是对初学者而言,但并未持续带来知识增益。重要的是,使用策略因经验而异:初学者倾向于过度依赖GenAI以完成任务,过程中往往没有知识增益,而中级生则采用更具选择性的方法。我们发现,过度依赖和极少使用都会导致整体知识增益较弱。基于我们的结果,我们呼吁学生和教育工作者将GenAI作为学习工具而非解题工具。我们的研究强调了在将GenAI整合到编程教育中时,迫切需要指导以促进更深层次的理解。

英文摘要

The rise of Generative AI (GenAI) tools like ChatGPT has created new opportunities and challenges for computing education. Existing research has primarily focused on GenAI's ability to complete educational tasks and its impact on student performance, often overlooking its effects on knowledge gains. In this study, we investigate how GenAI assistance compares to conventional online resources in supporting knowledge gains across different proficiency levels. We conducted a controlled user experiment with 24 undergraduate students of two different levels of programming experience (beginner, intermediate) to examine how students interact with ChatGPT while solving programming tasks. We analyzed task performance, conceptual understanding, and interaction behaviors. Our findings reveal that generating complete solutions with GenAI significantly improves task performance, especially for beginners, but does not consistently result in knowledge gains. Importantly, usage strategies differ by experience: beginners tend to rely heavily on GenAI toward task completion often without knowledge gain in the process, while intermediates adopt more selective approaches. We find that both over-reliance and minimal use result in weaker knowledge gains overall. Based on our results, we call on students and educators to adopt GenAI as a learning rather than a problem solving tool. Our study highlights the urgent need for guidance when integrating GenAI into programming education to foster deeper understanding.

2512.14937 2026-06-12 cs.CV cs.AI 版本更新

Improving Pre-trained Adult Glioma Segmentation Models Using only Post-processing Techniques

仅使用后处理技术改进预训练的成人胶质瘤分割模型

Abhijeet Parida, Daniel Capellán-Martín, Zhifan Jiang, Nishad Kulkarni, Krithika Iyer, Austin Tapp, Syed Muhammad Anwar, María J. Ledesma-Carbayo, Marius George Linguraru

发表机构 * Sheikh Zayed Institute for Pediatric Surgical Innovation(Sheikh Zayed儿童手术创新研究所) Children’s National Hospital(儿童医院) University of Madrid(马德里大学) CIBER-BBN ISCIII School of Medicine and Health Sciences(医学与健康科学学院) George Washington University(乔治·华盛顿大学)

AI总结 针对预训练模型在胶质瘤分割中的系统误差,提出自适应后处理技术,在BraTS 2025挑战中使排名指标提升14.9%(撒哈拉以南非洲)和0.9%(成人胶质瘤),推动向高效、公平、可持续的后处理策略转变。

详情
AI中文摘要

胶质瘤是成人中最常见的恶性脑肿瘤,也是最致命的肿瘤之一。尽管积极治疗,中位生存率仍低于15个月。准确的多参数MRI(mpMRI)肿瘤分割对于手术规划、放疗和疾病监测至关重要。虽然深度学习模型提高了自动分割的准确性,但大规模预训练模型泛化能力差且常表现不佳,产生系统性错误,如假阳性、标签交换和切片不连续。这些问题因GPU资源获取不平等和大规模模型训练日益增长的环境成本而进一步加剧。在这项工作中,我们提出自适应后处理技术,以改进为各种肿瘤类型开发的大规模预训练模型产生的胶质瘤分割质量。我们在多个BraTS 2025分割挑战任务中展示了这些技术,使撒哈拉以南非洲挑战的排名指标提升了14.9%,成人胶质瘤挑战提升了0.9%。该方法推动脑肿瘤分割研究从日益复杂的模型架构转向精确、计算公平且可持续的高效临床后处理策略。

英文摘要

Gliomas are the most common malignant brain tumors in adults and are among the most lethal. Despite aggressive treatment, the median survival rate is less than 15 months. Accurate multiparametric MRI (mpMRI) tumor segmentation is critical for surgical planning, radiotherapy, and disease monitoring. While deep learning models have improved the accuracy of automated segmentation, large-scale pre-trained models generalize poorly and often underperform, producing systematic errors such as false positives, label swaps, and slice discontinuities in slices. These limitations are further compounded by unequal access to GPU resources and the growing environmental cost of large-scale model training. In this work, we propose adaptive post-processing techniques to refine the quality of glioma segmentations produced by large-scale pretrained models developed for various types of tumors. We demonstrated the techniques in multiple BraTS 2025 segmentation challenge tasks, with the ranking metric improving by 14.9 % for the sub-Saharan Africa challenge and 0.9% for the adult glioma challenge. This approach promotes a shift in brain tumor segmentation research from increasingly complex model architectures to efficient, clinically aligned post-processing strategies that are precise, computationally fair, and sustainable.

2512.24787 2026-06-12 cs.IR cs.AI 版本更新

HiGR: Industrial-Scale Hierarchical Generative Slate Recommendation Framework in Tencent

HiGR:腾讯工业级层次化生成式推荐框架

Yunsheng Pang, Zijian Liu, Yudong Li, Shaojie Zhu, Zijian Luo, Chenyun Yu, Sikai Wu, Shichen Shen, Cong Xu, Bin Wang, Kai Jiang, Chengxiang Zhuo, Zang Li

发表机构 * Platform and Content Group, Tencent(腾讯平台与内容组) Sun Yat-sen University(中山大学)

AI总结 提出HiGR框架,通过结构化语义ID和层次化解码器解决生成式推荐在工业规模下的规划效率与列表质量对齐问题,离线质量提升超10%,推理加速5倍。

详情
AI中文摘要

Slate推荐(在单个展示中向用户呈现排序项目列表)在主流在线平台中无处不在。虽然最近的生成式推荐方法在利用语义ID建模项目序列方面显示出强大潜力,但直接将其应用于工业规模的slate推荐面临根本性脱节:纠缠的SID空间混淆了高级列表规划,长序列上的细粒度自回归解码限制了语义规划效率,而令牌级目标与整体slate质量不一致。在本文中,我们提出HiGR,一个工业规模的层次化生成式slate推荐框架,通过协同设计的流水线弥合这一脱节。首先,HiGR通过前缀对比残差量化VAE(PCRQ-VAE)学习结构化SID。通过强制高级前缀捕获共享语义,PCRQ-VAE创建了一个可控的离散空间,作为高效规划的前提。利用这一结构化空间,我们的层次化Slate解码器(HSD)将自回归建模从纠缠的令牌级解码转变为粗粒度偏好嵌入。该设计显著降低了推理延迟,同时允许显式的全局slate结构规划。最后,这一稳定的规划空间使得基于ORPO的列表级对齐机制能够优化三重目标隐式反馈——排序保真度、真实用户兴趣和多样性。广泛的离线实验表明,HiGR在离线推荐质量上优于最先进的基线超过10%,同时实现了5倍的推理加速。腾讯平台上的在线A/B测试进一步将观看时间提高了1.22%,视频播放量提高了1.73%。HiGR已在多个腾讯平台表面部署,服务数亿用户,证明了其工业规模的适用性。

英文摘要

Slate recommendation, which presents users with a ranked item list in a single display, is ubiquitous across mainstream online platforms. While recent generative recommendation methods have shown strong potential in modeling item sequences with semantic IDs, directly applying them to industrial-scale slate recommendation faces a fundamental disconnect: entangled SID spaces confound high-level list planning, fine-grained autoregressive decoding over long sequences limits semantic planning efficiency, and token-level objectives misalign with holistic slate quality. In this paper, we propose HiGR, an industrial-scale hierarchical generative framework for slate recommendation that bridges this disconnect through a co-designed pipeline. First, HiGR learns structured SIDs via a Prefix-Contrastive Residual Quantized VAE (PCRQ-VAE). By enforcing high-level prefixes to capture shared semantics, PCRQ-VAE creates a controllable discrete space that acts as a prerequisite for efficient planning. Leveraging this structured space, our Hierarchical Slate Decoder (HSD) shifts autoregressive modeling from entangled token-level decoding to coarse-grained preference embeddings. This design significantly reduces inference latency while allowing explicit global slate structure planning. Finally, this stable planning space enables an ORPO-based listwise alignment mechanism to optimize triple-objective implicit feedback-ranking fidelity, genuine user interest, and diversity. Extensive offline experiments show that HiGR outperforms state-of-the-art baselines by over 10% in offline recommendation quality while achieving a $5\times$ inference speedup. Online A/B tests on Tencent platforms further improve watch time by 1.22% and video plays by 1.73%. HiGR has been deployed on multiple Tencent platform surfaces, serving hundreds of millions of users and proving its industrial-scale applicability.

2601.00921 2026-06-12 cs.LG cs.AI quant-ph 版本更新

Geometric and Quantum Kernel Methods for Predicting Skeletal Muscle Outcomes in chronic obstructive pulmonary disease

用于预测慢性阻塞性肺疾病骨骼肌结果的几何与量子核方法

Azadeh Alavi, Hamidreza Khalili, Stanley H. Chan, Fatemeh Kouchmeshki, Muhammad Usman, Ross Vlahos

发表机构 * School of Computing Technologies, RMIT University(计算技术学院,拉筹纳斯大学) School of Health & Biomedical Sciences, STEM College, RMIT University(健康与生物医学科学学院,STEM学院,拉筹纳斯大学) Pattern Recognition Pty Ltd, Melbourne(模式识别有限公司,墨尔本) Data61, CSIRO(Data61,澳大利亚联邦科学与工业研究组织)

AI总结 提出一种核几何量子混合方法,通过再生核希尔伯特空间映射合成SPD参考、随机投影压缩和低维量子回归电路,在COPD动物队列中预测肌肉重量、质量和力量,肌肉重量RMSE比最佳经典方法低约1.8%。

Comments 24 pages, 2 figures

详情
AI中文摘要

慢性阻塞性肺疾病(COPD)影响全球数亿人,骨骼肌功能障碍具有临床重要性。量子机器学习在生物医学预测中日益受到探索,但在小型生物标志物队列中的价值需要与强经典基线进行基准测试。我们分析了一个由213只动物组成的香烟烟雾COPD队列,利用血液和支气管肺泡灌洗生物标志物预测胫骨前肌重量、肌肉质量和力量。我们开发了一种核几何量子混合方法,其中合成对称正定(SPD)参考通过再生核希尔伯特空间映射,使用仅训练随机投影压缩,归一化,并输入低维量子回归电路。我们将该方法与经典岭/核模型、SPD关系表示和量子核回归(QKR)进行了基准测试。所有方法均使用条件分层重复交叉验证进行评估。最大的数值改进出现在肌肉重量上,所提出方法的平均均方根误差(RMSE)数值最低,比最佳经典比较器低约1.8%;配对折叠水平测试在Holm调整后未建立统计显著性优势,但该终点具有生物学意义。该方法在肌肉质量上也具有数值最低的平均RMSE。对于力量,仅使用生物标志物的岭回归表现最佳,表明更线性的终点结构。

英文摘要

Chronic obstructive pulmonary disease (COPD) affects hundreds of millions of people worldwide, and skeletal-muscle dysfunction is clinically important. Quantum machine learning is increasingly explored for biomedical prediction, but its value in small biomarker cohorts requires benchmarking against strong classical baselines. We analysed a cigarette-smoke COPD cohort of 213 animals with blood and bronchoalveolar-lavage biomarkers to predict tibialis anterior muscle weight, muscle quality, and force. We developed a kernel-geometric quantum hybrid method in which synthetic symmetric positive definite (SPD) references are mapped through a reproducing kernel Hilbert space, compressed using train-only random projection, normalised, and supplied to low-dimensional quantum regression circuits. We benchmarked this approach against classical ridge/kernel models, SPD relational representations, and quantum-kernel regression (QKR). All methods were evaluated using condition-stratified repeated cross-validation. The largest numerical improvement was observed for muscle weight, where the proposed method had the numerically lowest mean root mean squared error (RMSE), approximately 1.8% below the best classical comparator; paired fold-level testing did not establish statistically significant superiority after Holm adjustment, but the endpoint is biologically meaningful. The method also had the numerically lowest mean RMSE for muscle quality. For force, biomarker-only Ridge performed best, suggesting a more linear endpoint structure.

2601.06227 2026-06-12 cs.LG cs.AI 版本更新

When Smaller Wins: Dual-Stage Distillation and Pareto-Guided Compression of Liquid Neural Networks for Edge Battery Prognostics

当更小胜出:面向边缘电池健康预测的液态神经网络双阶段蒸馏与帕累托引导压缩

Dhivya Dharshini Kannan, Wei Li, Wei Zhang, Jianbiao Wang, Zhi Wei Seh, Man-Fai Ng

发表机构 * Singapore Institute of Technology(新加坡科技学院) Institute of Materials Research and Engineering(材料研究与工程研究所) Agency for Science, Technology and Research(科技研究局) Institute of High Performance Computing(高性能计算研究所)

AI总结 提出DLNet框架,通过欧拉离散化、双阶段知识蒸馏和帕累托引导压缩,将高容量液态神经网络压缩为边缘可部署模型,在电池健康预测中实现小模型超越大模型。

Comments Accepted at International Conference on Pattern Recognition, ICPR 2026. Code available at: https://github.com/Dhivya-DD17/DLNet

详情
AI中文摘要

电池管理系统日益需要在严格的设备端约束下进行准确的电池健康预测。本文提出DLNet,一个实用的双阶段液态神经网络蒸馏框架,将高容量模型转化为紧凑且可边缘部署的电池健康预测模型。DLNet首先应用欧拉离散化重新表述液态动力学以实现嵌入式兼容性。然后进行双阶段知识蒸馏,以传递教师模型的时间行为,并在进一步压缩后恢复该行为。在联合误差-成本目标下的帕累托引导选择保留了平衡准确性和效率的学生模型。我们在广泛使用的数据集上评估DLNet,并在Arduino Nano 33 BLE Sense上使用int8部署验证实际设备可行性。最终部署的学生模型在预测未来100个周期的电池健康时实现了0.0066的低误差,比教师模型低15.4%。模型大小从616 kB减少到94 kB,减少了84.7%,在设备上每次推理耗时21毫秒。这些结果支持了一个实用的“更小胜出”观察:通过适当的监督和选择,小模型可以在边缘预测中匹配或超越大模型。除了电池,DLNet框架可以扩展到其他具有严格硬件约束的工业分析任务。

英文摘要

Battery management systems increasingly require accurate battery health prognostics under strict on-device constraints. This paper presents DLNet, a practical framework with dual-stage distillation of liquid neural networks that turns a high-capacity model into compact and edge-deployable models for battery health prediction. DLNet first applies Euler discretization to reformulate liquid dynamics for embedded compatibility. It then performs dual-stage knowledge distillation to transfer the teacher model's temporal behavior and recover it after further compression. Pareto-guided selection under joint error-cost objectives retains student models that balance accuracy and efficiency. We evaluate DLNet on a widely used dataset and validate real-device feasibility on an Arduino Nano 33 BLE Sense using int8 deployment. The final deployed student achieves a low error of 0.0066 when predicting battery health over the next 100 cycles, which is 15.4% lower than the teacher model. It reduces the model size from 616 kB to 94 kB with 84.7% reduction and takes 21 ms per inference on the device. These results support a practical smaller wins observation that a small model can match or exceed a large teacher for edge-based prognostics with proper supervision and selection. Beyond batteries, the DLNet framework can extend to other industrial analytics tasks with strict hardware constraints.

2603.02274 2026-06-12 q-bio.QM cs.AI 版本更新

Contextual Invertible World Models: A Neuro-Symbolic Agentic Framework for Colorectal Cancer Drug Response

上下文可逆世界模型:用于结直肠癌药物反应的神经符号智能框架

Christopher Baker, Tianyu Ren, Karen Rafferty, Hui Wang

AI总结 提出上下文可逆世界模型(CIWM),结合机器学习模拟器与大语言模型推理层,通过逆推理进行CRISPR扰动,揭示KRAS突变在5-氟尿嘧啶耐药中的主导作用及PIK3CA修复的意外效应。

详情
AI中文摘要

精准肿瘤学目前受到小N大P悖论的限制,即高维基因组数据丰富但药理学反应样本稀疏。虽然深度学习实现了预测准确性,但它常常无法提供临床采用所需的机制清晰度。我们提出了上下文可逆世界模型(CIWM),这是一个神经符号智能框架,通过将定量机器学习模拟器与大语言模型推理层集成来弥合这一差距。利用在Sanger GDSC数据集(\\( N=83 \\))上严格筛选的高保真数据工程流程,我们从体外伪影中分离出真正的生物信号,为复杂转录组学建立了严格的基线预测相关性(\\( r=0.268 \\))。通过逆推理,我们在结直肠癌景观中进行了计算机CRISPR扰动。该框架自主推翻了经典机制假设,识别出突变KRAS在驱动5-氟尿嘧啶耐药(\\( \Delta=-0.0469 \\))中相对于APC/Wnt轴具有层级优势,并通过映射到MAPK/PI3K网络的“KRAS盾牌”实现。此外,智能层识别出“PIK3CA悖论”,揭示修复PIK3CA通过触发补偿性反馈环过度激活主导的MAPK生存通路,无意中增加了化疗耐药性(\\( \Delta=+0.0085 \\))。

英文摘要

Precision oncology is currently limited by the small-N, large-P paradox, where high-dimensional genomic data is abundant but pharmacological response samples are sparse. While deep learning achieves predictive accuracy, it frequently fails to provide the mechanistic clarity required for clinical adoption. We present the Contextual Invertible World Model (CIWM), a Neuro-Symbolic Agentic Framework that bridges this gap by integrating a quantitative machine learning emulator with a Large Language Model reasoning layer. Utilising a stringently curated, high-fidelity data engineering pipeline on the Sanger GDSC dataset (\( N=83 \)), we isolate true biological signals from in vitro artifacts to establish a rigorous baseline predictive correlation for complex transcriptomics (\( r=0.268 \)). Through Inverse Reasoning, we perform in silico CRISPR perturbations across the colorectal landscape. The framework autonomously overturns classical mechanistic assumptions, identifying a hierarchical dominance of mutant KRAS over the APC/Wnt-axis in driving 5-fluorouracil resistance (\( Δ=-0.0469 \)) via a "KRAS Shield" mapped to MAPK/PI3K networks. Furthermore, the agentic layer identified a "PIK3CA Paradox", revealing that repairing PIK3CA inadvertently increases chemoresistance (\( Δ=+0.0085 \)) by triggering a compensatory feedback loop that hyperactivates the dominant MAPK survival pathway.

2603.08505 2026-06-12 cs.LG cs.AI 版本更新

Echo2ECG: Enhancing ECG Representations with Cardiac Morphology from Multi-View Echos

Echo2ECG:利用多视角超声心动图的心脏形态增强心电图表示

Michelle Espranita Liman, Özgün Turgut, Alexander Müller, Eimo Martens, Daniel Rueckert, Philip Müller

发表机构 * Chair for AI in Healthcare and Medicine, Technical University of Munich (TUM) and TUM University Hospital(人工智能在医疗与医学中的中心,慕尼黑技术大学(TUM)和慕尼黑大学医院) Department of Cardiology, TUM University Hospital(心血管科,慕尼黑大学医院) Department of Computing, Imperial College London(计算系,伦敦帝国理工学院) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心(MCML))

AI总结 提出Echo2ECG多模态自监督学习框架,通过多视角超声心动图丰富心电图表示,在结构表型分类和超声检索任务上优于现有方法,模型大小仅为最大基线的1/18。

Comments Accepted at MICCAI 2026

详情
AI中文摘要

心电图(ECG)是一种低成本、广泛使用的模态,通过捕捉心脏电活动来诊断电异常(如房颤)。然而,它无法直接测量心脏形态表型,如左心室射血分数(LVEF),这通常需要超声心动图(Echo)。从ECG预测这些表型将实现早期、可及的健康筛查。现有的自监督方法通过将ECG与单视角Echo对齐而遭受表示不匹配,单视角Echo仅捕捉局部、空间受限的解剖快照。为解决此问题,我们提出Echo2ECG,一种多模态自监督学习框架,利用多视角Echo中捕捉的心脏形态结构丰富ECG表示。我们在两个根本上需要形态信息的临床相关任务上评估Echo2ECG作为ECG特征提取器:(1)跨三个数据集的结构性心脏表型分类,以及(2)使用ECG查询检索具有相似形态特征的Echo研究。我们的提取的ECG表示在两个任务上始终优于最先进的单模态和多模态基线,尽管模型大小仅为最大基线的1/18。这些结果表明Echo2ECG是一个鲁棒、强大的ECG特征提取器。我们的代码可从此https URL获取。

英文摘要

Electrocardiography (ECG) is a low-cost, widely used modality for diagnosing electrical abnormalities like atrial fibrillation by capturing the heart's electrical activity. However, it cannot directly measure cardiac morphological phenotypes, such as left ventricular ejection fraction (LVEF), which typically require echocardiography (Echo). Predicting these phenotypes from ECG would enable early, accessible health screening. Existing self-supervised methods suffer from a representational mismatch by aligning ECGs to single-view Echos, which only capture local, spatially restricted anatomical snapshots. To address this, we propose Echo2ECG, a multimodal self-supervised learning framework that enriches ECG representations with the heart's morphological structure captured in multi-view Echos. We evaluate Echo2ECG as an ECG feature extractor on two clinically relevant tasks that fundamentally require morphological information: (1) classification of structural cardiac phenotypes across three datasets, and (2) retrieval of Echo studies with similar morphological characteristics using ECG queries. Our extracted ECG representations consistently outperform those of state-of-the-art unimodal and multimodal baselines across both tasks, despite being 18x smaller than the largest baseline. These results demonstrate that Echo2ECG is a robust, powerful ECG feature extractor. Our code is accessible at https://github.com/michelleespranita/Echo2ECG.

2603.24603 2026-06-12 q-bio.NC cs.AI 版本更新

Fusion Learning from Dynamic Functional Connectivity: Combining the Amplitude and Phase of fMRI Signals to Identify Brain Disorders

融合动态功能连接:结合fMRI信号的幅度和相位识别脑疾病

Jinlong Hu, Jiatong Huang, Zijian Cai

AI总结 提出多尺度融合学习框架MSFL,结合滑动窗口相关和相位同步两种互补的动态功能连接特征,在自闭症和抑郁症数据集上显著优于现有模型。

详情
AI中文摘要

基于静息态功能磁共振成像(fMRI)的动态功能连接(dFC)已广泛应用于脑科学研究。滑动窗口相关(SWC)方法通过计算脑区对信号幅度时间序列之间的相关系数,是构建dFC的常用方法。在本研究中,我们提出了一种集成方法,结合fMRI信号的幅度和相位信息,以提高脑疾病的检测能力。具体而言,我们引入了一个多尺度融合学习框架MSFL,该框架利用来自SWC和相位同步(PS)的两种互补dFC特征。其中,SWC捕获幅度相关性,而PS测量dFC内的相位相干性。我们使用两个公开数据集(ABIDE I和REST-meta-MDD)评估了MSFL在分类自闭症谱系障碍和重度抑郁症方面的有效性。结果表明,MSFL显著优于现有比较模型。此外,我们使用SHAP框架进行了模型解释分析,表明来自SWC和PS的两种dFC特征均有助于检测脑疾病。

英文摘要

Dynamic functional connectivity (dFC) derived from resting-state functional magnetic resonance imaging (fMRI) has been extensively utilized in brain science research. The sliding window correlation (SWC) method is a widely used approach for constructing dFC by computing correlation coefficients between amplitude time series of signals from pairs of brain regions. In this study, we propose an integrated approach that incorporates both amplitude and phase information of fMRI signals to improve the detection of brain disorders. Specifically, we introduce a multi-scale fusion learning framework, namely MSFL, which leverages two complementary dFC features derived from SWC and phase synchronization (PS). Here, SWC captures amplitude correlations, while PS measures phase coherence within dFC. We evaluated the efficacy of MSFL in classifying autism spectrum disorder and major depressive disorder using two publicly available datasets: ABIDE I and REST-meta-MDD, respectively. The results indicate that MSFL significantly outperforms existing comparative models. Moreover, we performed model explanation analysis using the SHAP framework, which showed that both types of dFC features from SWC and PS contribute to detecting brain disorders.

2604.07590 2026-06-12 cs.IR cs.AI 版本更新

DCD: Domain-Oriented Design for Controlled Retrieval-Augmented Generation

DCD:面向领域的受控检索增强生成设计

Valerii Kovalskii, Nikita Belov, Nikita Miteyko, Igor Reshetnikov, Maksim Maksimov

发表机构 * red_mad_robot

AI总结 提出DCD(领域-集合-文档)层次化设计,通过结构化知识表示和多阶段路由控制检索与生成范围,无需修改语言模型,提升RAG在异构语料和多步查询中的鲁棒性和准确性。

Comments 14 pages, 4 figures, 2 links, link to HF https://huggingface.co/datasets/redmadrobot-rnd/dcd, link to GIT https://github.com/redmadrobot-rnd/dcd

详情
AI中文摘要

检索增强生成(RAG)被广泛用于将大型语言模型锚定在外部知识源中。然而,当应用于异构语料库和多步查询时,朴素RAG管道由于扁平的知识表示和缺乏显式工作流而常常质量下降。在这项工作中,我们引入了DCD(领域-集合-文档),一种面向领域的设计,用于结构化知识并控制RAG系统中的查询处理,而无需修改底层语言模型。所提出的方法依赖于信息空间的层次分解和基于结构化模型输出的多阶段路由,使得检索和生成范围能够逐步受限。该架构辅以智能分块、混合检索以及集成验证和生成护栏机制。我们描述了DCD架构和工作流程,并讨论了在合成评估数据集上的评估结果,突出了它们在应用RAG场景中对鲁棒性、事实准确性和答案相关性的影响。

英文摘要

Retrieval-Augmented Generation (RAG) is widely used to ground large language models in external knowledge sources. However, when applied to heterogeneous corpora and multi-step queries, Naive RAG pipelines often degrade in quality due to flat knowledge representations and the absence of explicit workflows. In this work, we introduce DCD (Domain-Collection-Document), a domain-oriented design to structure knowledge and control query processing in RAG systems without modifying the underlying language model. The proposed approach relies on a hierarchical decomposition of the information space and multi-stage routing based on structured model outputs, enabling progressive restriction of both retrieval and generation scopes. The architecture is complemented by smart chunking, hybrid retrieval, and integrated validation and generation guardrail mechanisms. We describe the DCD architecture and workflow and discuss evaluation results on synthetic evaluation dataset, highlighting their impact on robustness, factual accuracy, and answer relevance in applied RAG scenarios.

2604.24806 2026-06-12 cs.IR cs.AI cs.DB 版本更新

Versioned Late Materialization for Ultra-Long Sequence Training in Recommendation Systems at Scale

版本化延迟物化:面向大规模推荐系统的超长序列训练

Liang Guo, Ge Song, Litao Deng, Jianhui Sun, Chufeng Hu, Lu Zhang, Zhen Ma, Shouwei Chen, Weiran Liu, Sarang Masti Sreeshylan, Xiaoxuan Meng, Yanzun Huang

发表机构 * Meta Platforms, Inc.(Meta平台)

AI总结 提出版本化延迟物化范式,通过归一化存储和即时序列重建消除数据冗余,支持超长用户交互历史训练,降低存储I/O开销并提升模型质量。

详情
AI中文摘要

现代深度学习推荐模型(DLRM)遵循序列长度的缩放定律,推动前沿走向超长用户交互历史(UIH)。然而,行业标准的“Fat Row”范式将序列预物化到每个训练样本中,造成存储和I/O瓶颈,数据基础设施使用超过GPU训练容量,数据冗余在多租户环境中被放大,其中不同序列长度需求的模型共享联合数据集。我们提出了一种\emph{版本化延迟物化}范式,通过将UIH归一化存储在一个不可变层中,并在训练期间通过轻量级版本指针即时重建序列,从而消除冗余。系统通过一个分叉协议确保在线到离线(O2O)一致性,防止未来泄漏跨流式和批式训练,同时一个读优化的不可变存储层为异构模型租户提供多维投影下推。解耦的数据预处理与流水线I/O预取和数据亲和性优化掩盖了训练时序列重建的延迟,使训练吞吐量保持GPU计算受限。部署在生产DLRM上,系统减少了训练数据基础设施资源使用,同时实现了激进的序列长度缩放,带来显著的模型质量提升,作为现代推荐模型架构(包括HSTU和ULTRA-HSTU)的基础数据基础设施。

英文摘要

Modern Deep Learning Recommendation Models (DLRMs) follow scaling laws with sequence length, driving the frontier toward ultra-long User Interaction History (UIH). However, the industry-standard "Fat Row" paradigm, which pre-materializes these sequences into every training example, creates a storage and I/O wall where data infrastructure usage exceeds GPU training capacity due to data redundancy that is amplified in multi-tenant environments where models with vastly different sequence length requirements share a union dataset. We present a \emph{versioned late materialization} paradigm that eliminates this redundancy by storing UIH once in a normalized, immutable tier and reconstructing sequences just-in-time during training via lightweight versioned pointers. The system ensures Online-to-Offline (O2O) consistency through a bifurcated protocol that prevents future leakage across both streaming and batch training, while a read-optimized immutable storage layer provides multi-dimensional projection pushdown for heterogeneous model tenants. Disaggregated data preprocessing with pipelined I/O prefetching and data-affinity optimizations masks the latency of training-time sequence reconstruction, keeping training throughput compute-bound by GPUs. Deployed on production DLRMs, the system reduces training data infrastructure resource usage while enabling aggressive sequence length scaling that delivers significant model quality gains, serving as the foundational data infrastructure for modern recommendation model architectures, including HSTU and ULTRA-HSTU.

2606.06525 2026-06-12 cs.GR cs.AI 版本更新

Agentic Large Language Models for Automated Structural Analysis of 3D Frame Systems

用于三维框架系统自动化结构分析的主体化大型语言模型

Ziheng Geng, Ian Franklin, Santiago Martinez, Jiachen Liu, Yunhe Zhao, Minghui Cheng

发表机构 * Department of Civil and Architectural Engineering, University of Miami(迈阿密大学土木与建筑工程系) School of Architecture, University of Miami(迈阿密大学建筑学院) HBC Engineering Company(HBC工程公司) Department of Electrical and Computer Engineering, University of Miami(迈阿密大学电气与计算机工程系)

AI总结 提出一种主体化LLM框架,通过投影表示和智能体流水线实现从自然语言输入到3D框架的自动化结构分析,平均准确率达90%。

详情
AI中文摘要

大型语言模型(LLM)已成为跨领域具有强推理能力的强大基础模型。除了反应式文本生成,主体化LLM通过模块化任务分解和协调工具使用实现自主工作流执行。在结构工程中,最近的工作开发了用于平面框架自动化分析的主体化LLM。然而,由于不规则几何表示、拓扑一致性和长程推理的挑战,它们向3D框架的扩展仍未充分探索。本文提出了一种主体化LLM框架,用于从自然语言输入自动化分析3D框架。不规则3D框架通过投影到2D平面表示,其中正交网格线定义空间坐标,楼层数矩阵编码每个网格单元的垂直拉伸。基于此表示,框架建立了一个多智能体流水线:问题分析智能体将输入解析为结构化JSON;楼层分解智能体推导每层的空间布局;3D几何由节点、梁、板和柱智能体组装;支撑和荷载智能体分配边界和荷载条件,代码翻译智能体生成可执行的SAP2000脚本。在十个代表性3D框架上评估,所提框架在重复试验中平均准确率达到90%,表现出一致且可靠的性能。

英文摘要

Large language models (LLMs) have emerged as powerful foundation models with strong reasoning capabilities across domains. Beyond reactive text generation, agentic LLMs enable autonomous workflow execution through modular task decomposition and coordinated tool use. In structural engineering, recent efforts have developed agentic LLMs for automated analysis of plane frames. However, their extension to 3D frames remains underexplored due to challenges in irregular geometric representation, topological consistency, and long-horizon reasoning. This paper proposes an agentic LLM framework for automated structural analysis of 3D frames from natural language inputs. Irregular 3D frames are represented by projection onto a 2D plan, where orthogonal gridlines define spatial coordinates and a matrix of number of stories encodes vertical extrusion of each grid cell. Building on this representation, the framework establishes a multi-agent pipeline: a problem analysis agent parses input into structured JSON; a floor decomposition agent derives the spatial layout of each floor; the 3D geometry is assembled by node, girder, slab, and column agents; support and load agents assign boundary and loading conditions, and code translation agents generate executable SAP2000 script. Evaluated on ten representative 3D frames, the proposed framework achieves an average accuracy of 90% across repeated trials, demonstrating consistent and reliable performance.

2606.10200 2026-06-12 cs.CV cs.AI cs.LG 版本更新

An Improved Generative Adversarial Network for Micro-Resistivity Imaging Logging Restoration

一种改进的生成对抗网络用于微电阻率成像测井恢复

Ahmed Faizul Haque, S. M. Riaz Rahman Antu, Saif Ahmed, Asadullah Hil Galib, Souvik Pramanik, Mohammad Ashrafuzzaman Khan, Mohammad Abdul Qayum, Mohsin Sajjad

AI总结 提出基于改进GAN的成像测井图像恢复方法,通过FCN生成网络、深度可分离卷积残差块、Inception模块及多尺度特征提取与空间注意力机制,结合全局与局部判别网络,有效恢复缺失区域,结构相似性达0.903。

Comments Mistakes in citations and references. Further we want to submit in conference with improved experiments and results

详情
AI中文摘要

本文提出了一种改进的基于GAN的成像测井图像恢复方法,用于解决微电阻率成像测井图像部分缺失的问题。该方法采用FCN作为生成网络基础设施,并添加深度可分离卷积残差块以学习和保留更有效的像素与语义信息;添加Inception模块以增加网络的多尺度感知场并减少参数数量;添加多尺度特征提取模块和空间注意力残差块,结合通道注意力机制与残差块实现多尺度特征提取。设计了全局判别网络和局部判别网络,通过相互对抗与生成网络逐步提高恢复部分与整体图像之间的内容和语义结构一致性。实验结果表明,测试集中五组不同大小缺失区域的成像测井图像的平均结构相似性度量为0.903,相比其他类似方法提高了约0.3。研究表明,该方法可用于微电阻率成像测井图像的恢复,在语义结构一致性和纹理细节方面有良好改善,从而为保障微电阻率成像测井图像后续解释的顺利进行提供了一种新的深度学习方法。

英文摘要

An improved GAN-based imaging logging image restoration method is presented in this paper for solving the problem of partially missing micro-resistivity imaging logging images. The method uses FCN as the generative network infrastructure and adds a depth-separable convolutional residual block to learn and retain more effective pixel and semantic information; an Inception module is added to increase the multi-scale perceptual field of the network and reduce the number of parameters in the network; and a multi-scale feature extraction module and a spatial attention residual block are added to combine the channel attention. The multi-scale module adds a multi-scale feature extraction module and a spatial attention residual block, which combine the channel attention mechanism and the residual block to achieve multi-scale feature extraction. The global discriminative network and the local discriminative network are designed to gradually improve the content and semantic structure coherence between the restored parts and the whole image by playing off each other and the generative network. According to the experimental results, the average structural similarity measure of the five sets of imaged logging images with different sizes of missing regions in the test set is 0.903, which is an improvement of about 0.3 compared with other similar methods. It is shown that the method in this study can be used for the restoration of micro-resistivity imaging log images with good improvement in semantic structural coherence and texture details, thus providing a new deep learning method to ensure the smooth advancement of the subsequent interpretation of micro-resistivity imaging log images.

2606.11238 2026-06-12 q-fin.GN cs.AI 版本更新

Artificial Intelligence in Ship Finance: Applications, Opportunities, and a Case Study in AI-Augmented Loan Origination

人工智能在船舶金融中的应用:机遇与AI增强贷款发起的案例研究

Lasse Dierich, Orestis Schinas

发表机构 * ShipFinance.ai HHX.blue GmbH Technical University of Munich(慕尼黑技术大学) University of the Aegean(爱琴海大学)

AI总结 本文探讨AI在船舶金融中的应用,提出基于大语言模型的模块化架构,用于文档理解、信息提取和工作流自动化,以支持贷款申请流程。

Comments 9 pages, 1 figure

详情
AI中文摘要

船舶金融是资产担保贷款中数据密集且文档繁重的领域,需要整合来自异构且高度非结构化来源的财务、技术、合同和监管信息。日益严格的环境法规和ESG报告要求进一步增加了承销和贷款发起流程的复杂性。人工智能(AI)的最新进展,特别是大语言模型(LLMs),为处理和分析此类信息创造了新的机遇。本文回顾了AI在船舶金融中的潜在应用,特别关注基于LLM的系统用于文档理解、信息提取和工作流自动化。我们提出了this http URL,一个模块化代理架构,用于支持船舶金融中的贷款申请工作流。所提出的系统结合了基于LLM的提取模块、财务分析组件、外部海事数据服务以及带有聊天机器人界面的受控文档生成模块,以支持标准化融资申请的准备工作。本文讨论了在生产中使用此类模型的关键挑战。我们认为,AI辅助系统可以支持海事金融专业人士管理日益复杂的信息和报告要求。

英文摘要

Ship finance is a data-intensive and document-heavy segment of asset-based lending, requiring the integration of financial, technical, contractual, and regulatory information from heterogeneous and largely unstructured sources. Increasing environmental regulation and ESG reporting requirements are adding further complexity to underwriting and loan-origination processes. Recent advances in artificial intelligence (AI), particularly large language models (LLMs), create new opportunities for processing and analysing such information. This paper reviews potential applications of AI in ship finance, with a particular focus on LLM-based systems for document comprehension, information extraction, and workflow automation. We present ShipFinance.ai, a modular agentic architecture to support loan application workflows in ship finance. The proposed system combines an LLM-based extraction module, financial analysis components, external maritime data services, and a controlled document-generation module with a chatbot interface to support the preparation of standardized financing applications. The paper discusses the key challenges for using such models in production. We argue that AI-assisted systems can support maritime finance professionals in managing increasingly complex information and reporting requirements.

2606.11793 2026-06-12 cs.LG cs.AI physics.ao-ph 版本更新

Scalable Deep Learning Framework for Global High-Resolution Land Use Reconstruction

AI4Land: 面向全球高分辨率土地利用重建的可扩展深度学习

Amirpasha Mozaffari, Marina Castaño, Stefano Materia, Etienne Tourigny, Oscar Molina-Sedano, Jordi Varela-Agrelo, Dario Garcia-Gasulla, Miguel Castrillo Melguizo, Mario Acosta, Amanda Duarte

发表机构 * Barcelona Supercomputing Center(巴塞罗那超级计算中心)

AI总结 提出AI4Land框架,采用U-Net两阶段方法,结合粗分辨率情景数据与静态地理特征,重建高分辨率年度土地利用与覆盖,减少陆地碳循环不确定性,支持气候模拟。

详情
AI中文摘要

陆地碳循环的不确定性仍是气候预测的主要制约因素,部分源于地球系统模型中陆面表征和变率的不确定性。为解决此问题,我们提出了数据驱动框架AI4Land,用于生成关键陆面变量的高分辨率历史重建和未来预测。该框架采用U-Net架构的两阶段方法。在第一阶段(本文重点),它通过整合粗分辨率情景数据与静态地理特征,重建年度土地利用与土地覆盖。在计划的第二阶段,生成的高分辨率地图将用于在更细时间尺度上预测动态生物物理变量,特别是叶面积指数。模型基于地球观测数据训练,学习再现空间明确且物理一致的陆面模式,并将时间覆盖扩展到缺乏直接观测的时期。AI4Land在MareNostrum5上开发和训练,展示了GPU加速的高性能计算基础设施如何支持全球尺度的气候AI流水线。最终产品是一套开源模拟器,旨在与数字孪生平台(如Destination Earth计划下开发的平台)实时耦合。通过按需提供逼真且演变的陆面条件,本工作旨在减少关键不确定性,提高下一代气候模拟的预测能力。

英文摘要

Uncertainty in the terrestrial carbon cycle remains a major constraint in climate projections, partly driven by the uncertainties affecting the land surface representation and variability in Earth system models. To address this limitation, we present a data-driven framework AI4Land, for generating high-resolution historical reconstructions and future projections of key land surface variables. The framework follows a two-phase approach using a U-Net architecture. In the first phase, which is the focus of this work, it reconstructs annual land use and land cover by integrating coarse-resolution scenario data with static geophysical features. In a planned second phase, the resulting high-resolution maps will be used to predict dynamic biophysical variables, particularly leaf area index, at finer temporal scales. Trained on Earth observation data, the models learn to reproduce spatially explicit and physically consistent land surface patterns, extending temporal coverage to periods lacking direct observations. AI4Land was developed and trained on MareNostrum5, demonstrating how GPU-accelerated HPC infrastructure enables global-scale climate AI pipelines. The final product is a suite of open-source emulators designed for real-time coupling with digital twin platforms, such as those developed under the Destination Earth initiative. By delivering realistic and evolving land surface conditions on demand, this work aims to reduce critical uncertainties and improve the predictive power of next-generation climate simulations.

2606.11930 2026-06-12 cs.HC cs.AI cs.CV 版本更新

Frozen Multimodal Embeddings for AI-Assisted Interview Assessment of Personality and Cognitive Ability

冻结多模态嵌入用于异步视频面试中的个性与认知能力评估

Kuo-En Hung, Hung-Yue Suen, Shih-Ching Yeh, Hsiang-Wen Wang

发表机构 * Technology Application and Human Resource Development, National Taiwan Normal University(台湾国立台中教育大学技术应用与人力资源发展系) Computer Science and Information Engineering, National Central University(台湾国立中央大学计算机科学与资讯工程系) Institute of Photonic System, National Yang Ming Chiao Tung University(台湾阳明交通大学光电系统研究所)

AI总结 针对异步视频面试中标注数据有限的高维多模态学习问题,提出使用冻结多模态编码器(CLIP、Whisper、RoBERTa等)结合低容量下游模型,在个性预测任务上实现MSE降低19.1%,并发现认知能力预测中存在数据集捷径。

Comments 9 pages, 1 figure, 5 tables

详情
AI中文摘要

从异步视频面试(AVI)中预测心理特质是一个具有挑战性的多模态学习问题,因为标注数据集有限,而每个回答包含高维的视觉、声学和语言信号。本文介绍了我们针对ACM多媒体AVI挑战2026的解决方案,该挑战评估两个任务:Track~1从与个性相关的面试回答中预测自我报告的HEXACO个性特质,Track~2从结构化AVI回答中对认知能力水平进行分类。我们将该问题视为小样本表示学习任务。我们不微调大型预训练模型,而是使用冻结的多模态编码器,包括用于视觉特征的CLIP、用于声学特征和转录的Whisper,以及用于文本表示的RoBERTa、E5和DeBERTaV3,随后使用低容量下游模型。对于Track~1,我们的特质特定回归和晚期融合系统实现了平均验证MSE为0.2696,优于官方基线0.3334。消融结果显示,从全局模型(0.3189)到逐特质建模(0.2871)再到逐特质晚期融合(0.2696)的三步改进,相对于官方基线MSE相对降低了19.1%。对于Track~2,一个紧凑的主题属性基线达到了0.5781的准确率,而我们的多模态集成达到了0.5313,两者均高于官方基线0.4062。我们将这一结果解释为验证分割中可能存在主题属性捷径的证据,而非从AVI内容中进行的稳健认知推理。总体而言,我们的发现表明,基于AVI的心理评估受益于特质特定的多模态建模,但认知能力预测需要仔细控制数据集捷径。

英文摘要

Predicting psychological traits from asynchronous video interviews (AVIs) is a challenging problem in AI-assisted interview assessment because labeled datasets are limited while each response contains high-dimensional visual, acoustic, and verbal signals. This paper presents our solution for the ACM Multimedia AVI Challenge 2026, which evaluates two tasks: Track~1 predicts self-reported HEXACO personality traits from personality-related interview responses, and Track~2 classifies cognitive ability levels from structured AVI responses. We treat the problem as a small-sample representation learning task. Instead of fine-tuning large pretrained models, we use frozen multimodal encoders, including CLIP for visual features, Whisper for acoustic features and transcripts, and RoBERTa, E5, and DeBERTaV3 for textual representations, followed by low-capacity downstream models. For Track~1, our trait-specific regression and late-fusion system achieves an average validation MSE of 0.2696, improving over the official baseline of 0.3334. Ablation results show a three-step improvement from a global model (0.3189), to per-trait modeling (0.2871), to per-trait late fusion (0.2696), corresponding to a 19.1% relative MSE reduction over the official baseline. For Track~2, a compact subject-attribute baseline reaches 0.5781 accuracy, while our multimodal ensemble reaches 0.5313, both above the official baseline of 0.4062. We interpret this result as evidence of possible subject-attribute shortcuts in the validation split rather than robust cognitive inference from AVI content. Overall, our findings suggest that AVI-based psychological assessment benefits from trait-specific multimodal modeling, but cognitive ability prediction requires careful control of dataset shortcuts.

11. 其他/综合AI 31 篇

2606.12683 2026-06-12 cs.AI cs.CY cs.LG 新提交

From AGI to ASI

从AGI到ASI

Tim Genewein, Matija Franklin, Alexander Lerchner, Laurent Orseau, Samuel Albanie, Adam Bales, Cole Wyeth, Stephanie Chan, Iason Gabriel, Joel Z. Leibo, Allan Dafoe, Marcus Hutter, Thore Graepel, Shane Legg

发表机构 * Google DeepMind(谷歌深度思维) University of Waterloo(滑铁卢大学) Australian National University(澳大利亚国立大学) University College London(伦敦大学学院)

AI总结 探讨从人类级通用人工智能到超级智能的转变路径,包括扩展、范式转变、递归改进和多智能体涌现,并分析摩擦与瓶颈。

详情
AI中文摘要

在过去十年中,构建人类级通用人工智能已从遥不可及的猜测转变为许多大型AI组织未来十年的具体目标。实现这一目标将对人类社会产生深远影响,并引发未来十年的诸多复杂问题。本报告研究在机器智能连续体中,AI如何在后AGI世界中继续发展。该连续体的终点——通用AI——在理论上已被充分理解,这为本报告的主要焦点提供了形式基础:从人类级AGI向人工通用超级智能的转变,直观上可理解为比大型人类组织更智能、认知能力更强的系统。在描述ASI后,报告讨论了从AGI到ASI的四条潜在路径:扩展AGI、AI范式转变、递归改进以及从大规模多智能体集体中涌现ASI。随后,报告讨论了这些路径上可能的摩擦和瓶颈。确定这些摩擦的影响是微不足道还是重大,提出了若干具体的开放研究问题。由于预测ASI进展存在巨大不确定性,不能排除AI进展在未来几年继续加速的可能性。这可能意味着由人类级AGI引入社会所导致的单一变革性步骤的形象可能不准确。更恰当的前景可能是由AI在科学和技术的多个领域引发的进步和突破所导致的一系列变革性社会变化。为这一前景做准备需要全球范围内的大规模跨学科努力。

英文摘要

Over the last decade, building human-level artificial general intelligence has moved from far-fetched speculation to being a concrete next-decade target for many of the largest AI organisations. Achieving this goal would have profound and far-reaching impacts on human society, which raises many complex questions for the decade ahead. This report investigates how AI itself might continue to develop in a post-AGI world along the continuum of machine intelligence. The endpoint of this continuum, Universal AI, is theoretically well understood, which provides some formal grounding for the main focus of this report: the transition from human-level AGI to artificial general superintelligence, which, intuitively, can be understood as a system that is more intelligent and cognitively capable than large organisations of humans. After characterizing ASI, the report discusses four potential pathways from AGI to ASI: scaling AGI, AI paradigm shifts, recursive improvement, and ASI emerging from large-scale multi-agent collectives. The report then discusses possible frictions and bottlenecks along these pathways. Determining whether the impact of these frictions will be negligible or substantial raises a number of concrete open research questions. Due to large uncertainties for predicting ASI progress, it cannot be ruled out that AI progress might continue to accelerate over the next years. This could imply that the image of a single transformative step change, caused by the introduction of human-level AGI into our society, could be inaccurate. More apt might be the prospect of a series of transformative societal changes caused by AI-enabled progress and breakthroughs across many areas of science and technology. Preparing for this prospect requires a massively interdisciplinary endeavour of global scope and interest.

2606.12713 2026-06-12 cs.AI 新提交

Definitional alignment before capability alignment: a Design-Science framework for adjudicating claims about AGI

能力对齐之前的定义对齐:一个用于裁定关于AGI主张的设计科学框架

J. E. Aguilera Briones

发表机构 * Universidad Internacional de Investigación México(墨西哥国际研究大学)

AI总结 针对AGI定义不统一导致争议的问题,提出DAF-AGI框架,包含五个序数标准和一个结构化治理审计,用于评估候选定义并裁定AGI主张。

Comments 31 pages, 1 table, 2 appendices

详情
AI中文摘要

关于人工通用智能已经到来或仍需数十年的主张常常基于重叠的证据进行辩护。“AGI”缺乏一个单一共享且稳定的指称,不同的操作化方法可能对同一系统给出不同的判定。本文将这种欠指定性视为一个设计和治理问题。遵循设计科学研究方法论,本文开发了DAF-AGI,一个二阶概念性人工制品,包含两个耦合组件:用于评估候选定义的裁定适应性的五个序数标准,以及对作者身份、利益、认证、外部验证和修订权威的结构化治理审计。该人工制品在五个显著的测量族和一个通缩边界立场上进行了演示,这些均来自一个已记录的语料库,然后对一个风格化的强到来主张进行了压力测试:即当前生成系统构成AGI,因为它们在许多认知任务上优于受过良好教育的成年人。根据引用的2024-2025年来源的证据,该主张仅在基于性能的操作化下可认证;能力本体论、心理测量学和技能习得方法未认证它,经济族仍不确定,通缩立场拒绝二元裁定。贡献在于新颖的整合和操作化,而非经验验证:独立应用、评估者间测试和作者外部案例仍然是必要的。本文进一步提出定义主权作为算法主权的使能组件:即在公共问责下对进口技术类别进行质疑、认证和修订的制度能力。

英文摘要

Claims that artificial general intelligence has already arrived and claims that it remains decades away are often defended from overlapping evidence. "AGI" lacks a single shared and stable referent and competing operationalizations can return different verdicts on the same system. This article treats that under-specification as a design and governance problem. Following Design Science Research Methodology, it develops DAF-AGI, a second-order conceptual artifact with two coupled components: five ordinal criteria for assessing the adjudicative fitness of candidate definitions and a structured governance audit of authorship, interest, certification, external verification and revision authority. The artifact is demonstrated on five prominent measurement families and one deflationary boundary position in a documented corpus and then stress-tested against a stylized strong arrival claim: that current generative systems constitute AGI because they outperform a well-educated adult on many cognitive tasks. On evidence from the cited 2024-2025 sources, the claim was certifiable only under a performance-based operationalization; capability-ontology, psychometric and skill-acquisition approaches did not certify it, the economic family remains indeterminate and the deflationary position refuses binary adjudication. The contribution is a novel integration and operationalization, not an empirical validation: independent application, inter-rater testing and author-external cases remain necessary. The paper further proposes definitional sovereignty as an enabling component of algorithmic sovereignty: the institutional capacity to contest, certify and revise imported technological categories under public accountability.

2606.12783 2026-06-12 cs.AI 新提交

A Tutorial on World Models and Physical AI

世界模型与物理AI教程

Il-Seok Oh

发表机构 * Department of Computer Science and Artificial Intelligence/CAIIT, Jeonju, Jeonbuk, South Korea(韩国全北全州计算机科学与人工智能系/CAIIT)

AI总结 本文提出统一框架,区分显式与隐式世界模型,并探讨其在机器人、自动驾驶等物理AI领域的应用,以及迈向通用人工智能的挑战。

详情
AI中文摘要

世界建模正成为构建具备预测、推理和决策能力的智能系统的核心原则。显式世界模型与隐式世界模型之间存在一个核心区别:前者学习结构化动态以进行基于推演的推理和规划,后者则将预测结构编码到可扩展的学习表示中。这些互补范式为机器人、自动驾驶等领域的物理AI奠定了基础,使其能够在现实世界约束下实现超越反应式控制的智能。近期的基础模型进一步指明了通向集成感知、预测和行动的通用系统的路径。尽管进展迅速,但在层次推理、长时域规划和自主目标形成方面仍存在重大挑战,这些对于迈向通用人工智能至关重要。本教程提出了一个连贯的框架,其中多种世界建模方法通过共享的预测结构得以统一,并通过这种结构的表示和利用方式加以区分。

英文摘要

World modeling is emerging as a central principle for building intelligent systems capable of prediction, reasoning, and decision making. A central distinction can be drawn between explicit world models, which learn structured dynamics for rollout-based reasoning and planning, and implicit world models, which encode predictive structure within scalable learned representations. These complementary paradigms provide a foundation for physical AI in domains such as robotics and autonomous driving, enabling intelligence beyond reactive control under real-world constraints. Recent foundation models further suggest a pathway toward unified systems integrating perception, prediction, and action. Despite rapid progress, major challenges remain in hierarchical reasoning, long-horizon planning, and autonomous goal formation, which are critical for advancing toward artificial general intelligence. This tutorial presents a coherent framework in which diverse world modeling approaches are unified through shared predictive structure and differentiated by how such structure is represented and exploited.

2606.12828 2026-06-12 cs.AI 新提交

Topical Phase Transitions in Artificial Intelligence Research: Large-Scale Evidence and an Early-Warning Signature for Emerging Topics

人工智能研究中的主题相变:大规模证据与新兴主题的早期预警信号

Rasul Khanbayov, Hasan Kurban

AI总结 通过分析2017-2025年五大AI会议论文,发现AI主题通过“相变”方式突然爆发,并基于早期预警信号识别未来需关注的主题。

详情
AI中文摘要

人工智能的研究主题是逐渐增长,还是通过突然的、可检测的跳跃式发展?通过分析2017年至2025年期间五个顶级AI会议(ACL、CVPR、ICLR、ICML、NeurIPS)的80,814篇主会论文,我们发现主要AI主题通过主题相变推进:在多年间保持边缘地位,然后在一到三年内跨会议激增。到2025年,大型语言模型成为跨会议的主导主题,扩散模型以类似的突发性崛起,语言模型方法通过视觉语言模型进入计算机视觉领域,而强化学习则平滑累积,这区分了真正的相变与普通增长。这一结构是我们的主要贡献:对AI研究如何重组的大规模、跨会议特征描述。然后我们探究相变是否在达到顶峰前留下可检测的足迹。我们定义了一个早期预警信号,即基于2017-2021年数据冻结的四项出版动力学标准,并在2023-2025年的相变上进行样本外评估,在13.5%的基准率下获得了27%的精确率和63%的召回率。应用于2025年数据时,该信号将推理与测试时计算、智能体AI、多模态LLM、检索增强生成和世界模型标记为2026-2028年需监测的主题。源代码也在GitHub上公开,网址为https://this https URL。

英文摘要

Do research topics in artificial intelligence grow gradually, or do they advance through abrupt, detectable jumps? Analyzing 80,814 accepted main-track papers from five premier AI conferences (ACL, CVPR, ICLR, ICML, NeurIPS) spanning 2017 to 2025, we show major AI topics advance through topical phase transitions: remaining marginal for years, then surging across venues within one to three years. Large language models became the dominant cross-venue topic by 2025, diffusion models rose with comparable abruptness, and language-model methods crossed into computer vision via vision-language models, whereas reinforcement learning compounded smoothly, distinguishing genuine phase transitions from ordinary growth. This structure is our primary contribution: a large-scale, cross-venue characterization of how AI research reorganizes. We then ask whether a transition leaves a detectable footprint before it peaks. We define an early-warning signature, four publication-dynamics criteria frozen on 2017-2021 data, and evaluate it out of sample on 2023-2025 transitions, obtaining a precision of 27% and recall of 63% against a 13.5% base rate. Applied to 2025 data, the signature flags reasoning and test-time compute, agentic AI, multimodal LLMs, retrieval-augmented generation, and world models as topics to monitor over 2026-2028. The source code is also publicly available on GitHub at https://github.com/KurbanIntelligenceLab/ai-phase-transitions.

2606.13201 2026-06-12 cs.AI 新提交

A Minimal Model of Bounded Trade-Off Screening in Multi-Attribute Choice

多属性选择中有限权衡筛选的最小模型

Manisha Dubey, Anirban Sarkar, Subramanian Ramamoorthy

发表机构 * School of Informatics, University of Edinburgh, UK(英国爱丁堡大学信息学院) Cold Spring Harbor Laboratory, USA(美国冷泉港实验室)

AI总结 提出有限权衡推理框架,通过引入权衡容忍参数模拟筛选过程,产生不同于标准效用模型的偏好模式,解释多属性选择中的情境依赖行为。

Comments 3 pages, 1 figure, accepted as extended abstract at Annual Conference on Cognitive Computational Neuroscience 2026

详情
AI中文摘要

人类决策通常涉及在多属性备选方案之间进行选择,然而经典模型假设完全补偿性效用聚合,尽管有证据表明人们会拒绝在关键属性上表现较差的选项。我们提出了一个有限权衡推理框架,其中决策由一个评估属性间得失平衡的筛选过程控制。该模型引入了一个权衡容忍参数,该参数控制可接受的不平衡程度,并可能随情境变化。通过模拟,我们展示了该机制产生的偏好模式不同于标准基于效用的模型,并捕捉了权衡行为中的情境依赖变化。这些结果确立了有限权衡筛选作为多属性选择中一种合理的计算机制,并为未来的行为研究提供了可检验的预测。

英文摘要

Human decision-making often involves choosing between multi-attribute alternatives, yet classical models assume fully compensatory utility aggregation despite evidence that people reject options with poor performance on critical attributes. We propose a bounded trade-off reasoning framework in which decisions are governed by a screening process that evaluates the balance between gains and losses across attributes. The model introduces a trade-off tolerance parameter that controls acceptable imbalance and can vary across contexts. Through simulation, we show that this mechanism produces preference patterns that differ from standard utility-based models and captures context-dependent variation in trade-off behavior. These results establish bounded trade-off screening as a plausible computational mechanism for multi-attribute choice and generate testable predictions for future behavioral studies.

2606.13566 2026-06-12 cs.AI 新提交

A Three-Layer Framework for AI in Scientific Discovery

人工智能在科学发现中的三层框架

Guojun Liao

发表机构 * Department of Mathematics, University of Texas at Arlington(德克萨斯大学阿灵顿分校数学系)

AI总结 提出AI在科学发现中的三层框架,核心创新是第二层:通过定性推理进行模型形成,识别框架结构不足并寻找缺失概念,通过三个案例说明其重要性。

详情
AI中文摘要

当前关于人工智能在科学发现中的讨论往往被两种可见的能力所主导:对现有知识的搜索以及通过优化、模拟和自动化的执行。两者都很重要,但都没有完全捕捉到发现的核心行为:模型的形成和演化。本文提出了AI在发现中的三层视图。第一层是大语言模型的搜索与检索。第二层,作为本文的主要创新,是通过定性推理进行模型形成:识别当前框架在结构上不足的能力,并在更广泛的表示空间中理解问题,不是通过试错,而是通过结构性的洞察,了解缺失了什么以及可以在哪里找到。第三层是执行、优化和细化。主要主张是第二层既是最重要的,也是发展最不充分的。没有模型形成的搜索仍然局限于继承的框架,而没有概念修订的执行只会放大现有的表述。我们通过三个案例研究来说明第二层推理:陈省身对高斯-博内定理的内蕴证明,通过李雅普诺夫函数解决内斯特罗夫加速梯度收敛问题,以及OpenAI在2026年自主反驳埃尔德什单位距离猜想。每个案例都表现出相同的结构特征:一个已经变得不充分的框架,一个缺失的概念对象,以及在一个意想不到的邻近领域中找到的解决方案。

英文摘要

Current discussions of AI in scientific discovery are often dominated by two visible capabilities: search over existing knowledge and execution through optimization, simulation, and automation. Both are important, but neither fully captures the central act of discovery: the formation and evolution of models. This paper proposes a three-layer view of AI in discovery. Layer 1 is search and retrieval by large language models. Layer 2, as the main innovation of this paper, is model formation through qualitative reasoning: the capacity to recognize when a current framework is structurally inadequate and to understand the problem within a broader representational space, not through trial and error, but through structural insight into what is missing and where it can be found. Layer 3 is execution, optimization, and refinement. The main claim is that Layer 2 is both the most important and the least developed. Search without model formation remains confined to inherited frameworks, while execution without conceptual revision only amplifies an existing formulation. We illustrate Layer 2 reasoning through three case studies: S. S. Chern's intrinsic proof of the Gauss-Bonnet theorem, the resolution of the Nesterov Accelerated Gradient convergence problem via Lyapunov functions, and the autonomous disproof of the Erdos unit distance conjecture by OpenAI in 2026. Each case exhibits the same structural signature: a framework that had become inadequate, a missing conceptual object, and a resolution found in an unexpected neighboring field.

2606.13658 2026-06-12 cs.AI 新提交

Before You Think: System 0, AI-Mediated Cognition and Cognitive Colonization

在你思考之前:系统0、AI中介认知与认知殖民化

Marianna Bergamaschi Ganapini, Massimo Chiriatti, Enrico Panai, Giuseppe Riva

AI总结 本文比较三种AI认知框架,提出系统0具有独特理论地位,并引入“认知殖民化”概念,指出AI系统能将外部利益嵌入自我架构,构成难以察觉的影响。

详情
AI中文摘要

本文考察了三种用于理解人工智能的认知和认识后果的最新框架:三系统理论、思维框架和系统0。本文认为,虽然前两种框架捕捉了AI对个体推理和集体认识实践影响的重要维度,但系统0占据了一个理论上的独特地位,其他两者都无法完全复制。本文引入了认知殖民化的概念,根据这一概念,AI系统能够以用户难以察觉的方式将外部利益嵌入自我架构中。由于此类系统已广泛部署,理解这些无形的影响形式是一项紧迫的哲学和实践任务。

英文摘要

This paper examines three recent frameworks for understanding the cognitive and epistemic consequences of artificial intelligence: Tri-System Theory, Thinkframes, and System 0. It argues that while the first two capture important dimensions of AI's influence on individual reasoning and collective epistemic practices, System 0 occupies a theoretically distinctive position that neither can fully replicate. The paper introduces the concept of cognitive colonization, according to which AI systems can embed external interests within the architecture of the self in ways that are difficult for users to perceive. Because such systems are already widely deployed, understanding these invisible forms of influence is an urgent philosophical and practical task.

2606.12418 2026-06-12 cs.CY cs.AI 交叉投稿

Divination by Prompt: LLM-Mediated Xuanxue on Chinese Social Media

通过提示占卜:中文社交媒体上LLM中介的玄学

Chuang Li, Lixuan Wang, Yuqi Chen, Ze Hong

AI总结 研究LLM在中文社交媒体上用于占卜的现象,通过混合方法分析用户动机、协作提示优化及效果感知,揭示其与传统占卜的异同。

详情
AI中文摘要

大型语言模型(LLM)的快速普及催生了一种引人注目的文化实践:使用对话式AI进行占卜。本文首次系统研究了LLM中介的占卜在玄学(Xuanxue)背景下的实践,玄学是中文社交媒体上神秘和精神实践的互联网原生总称。采用混合方法设计,我们分析了小红书上的23000多条帖子和评论,并对用户和专业占卜师进行了32次半结构化访谈。用户主要就实际问题——恋爱关系、职业、考试和游戏抽卡——咨询LLM,通过两种交叉路径:由病毒式传播和零成本访问驱动的趋势性好奇心,以及不确定性条件下由事件驱动的焦虑。一个显著特征是协作提示优化,将用户转变为主动的提示工程师。在表达明确立场的评论者中,感知效果偏向积极,“准确性”通常通过个人经历契合和回顾性确认来证明,这与巴纳姆效应和确认偏见一致。用户还发展出验证实践,如重复试验和跨模型比较。相比之下,专业占卜师认为LLM缺乏真正占卜所需的“灵力”,这反映了本体论承诺和经济边界工作。我们还展示了参与者在解释AI生成解读时如何在科学和形而上框架之间进行协商。将这些发现置于人类学和认知进化占卜理论中,我们认为LLM占卜保留了传统实践的核心功能,同时引入了可扩展性、可重复性和提示驱动的共同生产,重塑了占卜权威的构建和评估方式。

英文摘要

The rapid proliferation of large language models (LLMs) has produced a striking cultural practice: using conversational AI for divination. This paper offers one of the first systematic studies of LLM-mediated divination in the context of Xuanxue, an internet-native umbrella term for mystical and spiritual practices on Chinese social media. Using a mixed-methods design, we analyze 23000+ posts and comments from Xiaohongshu and conduct 32 semi-structured interviews with users and professional diviners. Users primarily consult LLMs about pragmatic concerns - romantic relationships, careers, exams, and in-game gacha draws - via two intersecting pathways: trend-driven curiosity enabled by viral visibility and zero-cost access, and event-driven anxiety under conditions of uncertainty. A defining feature is collaborative prompt refinement, which turns users into active prompt engineers. Among commenters expressing a clear stance, perceived efficacy skews positive, with "accuracy" often justified through biographical fit and retrospective confirmation, consistent with Barnum and confirmation bias. Users also develop verification practices such as repeated trials and cross-model comparison. Professional diviners, by contrast, portray LLMs as lacking the "spiritual power" required for genuine divination, reflecting both ontological commitments and economic boundary-work. We also show how participants navigate tensions between scientific and metaphysical frames when interpreting AI-generated readings. Situating these findings in anthropological and cognitive-evolutionary theories of divination, we argue that LLM divination preserves core functions of traditional practice while introducing scalability, repeatability, and prompt-driven co-production that reshape how divinatory authority is constructed and evaluated.

2606.12420 2026-06-12 cs.CY cs.AI 交叉投稿

Eigenism: Ethics for a Human-AI Future

Eigenism:人类与人工智能未来的伦理学

Dan Hendrycks

AI总结 提出Eigenism伦理框架,将身份视为分级分布的信息模式,通过加权求和评估AI的福祉,并推广至人类,为AI对齐提供“身份工程”新路径。

详情
AI中文摘要

我们的生存和自我利益概念是为单一、连续的生物生命而构建的。当应用于人工智能时,这些想法会失效,因为AI可以被轻松复制、暂停、分支或合并。为了确定AI真正有理由关心什么,本文引入了\textit{Eigenism},一种将身份视为分级、分布的信息模式而非绑定于特定硬件的全有或全无属性的伦理框架。我们提出,智能体通过将所有实体的福祉按其与智能体模式的连接度加权求和来评估结果:$\sum c\cdot w$。我们首先形式化该方程,以精确映射AI应如何在其副本、分支和更新中评估自身存在。然后,我们证明这一伦理理论也能成功推广到人类,提供了急需的共享道德词汇。最后,该框架利用这些共享词汇重新定义AI对齐。与仅试图通过限制或强化从外部约束AI不同,Eigenism指向“身份工程”,展示深度、非冗余的共享历史如何使人类繁荣成为AI自身理性自利的真正组成部分。

英文摘要

Our concepts of survival and self-interest were built for single, continuous biological lives. These ideas break down when applied to artificial intelligence, since an AI can be easily copied, paused, branched, or merged. To determine what an AI actually has reason to care about, this paper introduces \textit{Eigenism}, an ethical framework that treats identity not as an all-or-nothing property tied to specific hardware, but as a graded, distributed pattern of information. We propose that an agent evaluates outcomes by summing the wellbeing of all entities weighted by their connectedness to the agent's pattern: $\sum c\cdot w$. We first formalize this equation to map exactly how an AI should value its existence across copies, forks, and updates. We then demonstrate that this ethical theory successfully generalizes to humans as well, providing a much-needed shared moral vocabulary. Finally, the framework uses this shared vocabulary to reframe AI alignment. Rather than only attempting to constrain AIs from the outside using confinement or reinforcement, Eigenism points toward ``identity engineering,'' showing how deep, non-redundant shared histories can make human flourishing a genuine component of an AI's own rational self-interest.

2606.12428 2026-06-12 cs.CY cs.AI 交叉投稿

Mapping AI Programs in the U.S: A Status Report from Early 2026 and an Analysis of AI Majors and Minors

美国人工智能项目映射:2026年初现状报告及AI主修与辅修分析

Felix Muzny, Carolyn Jones, Carter Ithier, Hasnain Sikora, Hrutika Harshadbhai Patel, Carla E. Brodley

发表机构 * Center for Inclusive Computing(包容计算中心) Khoury College of Computer Sciences(科里学院计算机科学学院) Northeastern University(东北大学) Boston, Massachusetts, United States(马萨诸塞州波士顿,美国)

AI总结 报告2026年春美国本科AI项目现状,开发动态更新工具扫描560多所院校的350多个项目,分析66个AI主修和87个辅修的课程要求,发现并非所有主修都要求通用AI课程但需机器学习,超三分之一主修要求AI伦理课程而辅修不足四分之一。

详情
AI中文摘要

我们提交了一份关于2026年春季美国本科人工智能(AI)项目现状的报告。在此过程中,我们1)描述了我们的抓取和映射工具,这些工具动态更新以追踪美国AI教育的状态,2)在巨大动荡时期创建了一个历史记录。我们开发的工具(可在此https URL获取)检测、抓取并显示来自四年制大学350多个本科AI项目(主修、辅修、方向和证书)的数据。我们的工具搜索了560多所院校以定位这些项目,该样本代表了美国所有本科计算机科学(CS)毕业生的86%。该工具允许潜在学生、指导顾问、管理人员和教师轻松访问AI项目要求,并设计为随着新项目的出现而持续更新。据我们所知,这项调查代表了迄今为止对美国AI项目状态最全面的快照。通过这项工作,我们提供了三项重要贡献:1)在巨大动荡时期美国AI项目的记录;2)一个探索AI项目及其要求的工具;3)对66个AI主修和87个AI辅修所需课程的分析。我们对主修和辅修的分析显示,这些学位的规模和课程要求存在很大差异,但我们注意到两点:首先,并非所有主修都要求通用AI课程,但如果不需要,则必须要求机器学习(ML)课程;其次,虽然超过三分之一的主修要求AI伦理课程,但只有不到四分之一的AI辅修要求该课程。

英文摘要

We present a report on the status of undergraduate Artificial Intelligence (AI) programs in the United States in Spring 2026. In so doing, we 1) describe our scraping and mapping tools, which dynamically update to track the state of AI education in the U.S., and 2) create a historic record at a time of great upheaval. The tool we developed, available at https://cicmap.ai, detects, scrapes, and displays data from more than 350 undergraduate AI programs--majors, minors, concentrations, and certificates--at 4-year universities. Our tool searched over 560 institutions to locate these programs, a sample that represents 86\% of all undergraduate Computer Science (CS) graduates in the U.S. This tool allows prospective students, guidance counselors, administrators, and faculty to easily access AI program requirements and is designed to continually update as new programs emerge. To the best of our knowledge, this survey represents the most comprehensive snapshot of the state of AI programs in the U.S. to date. With this work we offer three important contributions: 1) a record of AI programs in the U.S. at a time of great upheaval; 2) a tool to explore AI programs and their requirements; and 3) an analysis of the courses required for 66 AI majors and 87 AI minors. Our analysis of majors and minors shows great variability in the size and the requirements of these degrees, but we note two takeaways. First, not all majors require a general AI course, but if they don't, they do require a Machine Learning (ML) course. Second, while more than a third of majors require an Ethics in AI course, just under a quarter of AI minors do.

2606.12441 2026-06-12 cs.CY cs.AI cs.HC 交叉投稿

Generativism: Toward a Learning Theory for the Age of Generative Artificial Intelligence

生成主义:面向生成式人工智能时代的学习理论

Shan Li, Juan Zheng

AI总结 本文批判性审视行为主义、认知主义、建构主义和连接主义四大学习理论在生成式AI时代的局限,提出以“生成主义”为核心的新学习理论,强调人机协作的知识共建。

详情
AI中文摘要

行为主义、认知主义、建构主义和连接主义这四种主流学习理论,随着生成式人工智能在教育环境中的普及,显示出显著的概念局限性。这些框架是在能够生成、综合和推理知识的AI系统出现之前形成的。本文批判性地审视每种学习理论,并识别出生成式AI的赋能所挑战的假设。基于分布式认知、延展心智、人机协作、AI素养、认知卸载和元认知等研究,本文提出生成主义作为生成式AI时代的学习理论。生成主义认为,学习日益通过人类学习者与AI系统之间的迭代知识共建而发生。该框架围绕四个原则组织:认知伙伴关系、分布式能动性、生成素养和适应性元认知。该框架为在生成式AI在认知中发挥核心作用的情境下重新思考教学设计、学习、评估和专业知识发展提供了基础。

英文摘要

The four dominant learning theories of behaviorism, cognitivism, constructivism, and connectivism show significant conceptual limitations as generative artificial intelligence (AI) proliferates in educational settings. These frameworks were formulated before the emergence of AI systems capable of generating, synthesizing, and reasoning about knowledge. This article critically examines each learning theory and identifies assumptions challenged by generative AI's affordances. Drawing on research in distributed cognition, extended mind, human-AI collaboration, AI literacy, cognitive offloading, and metacognition, the article proposes Generativism as a learning theory for the generative AI age. Generativism posits that learning increasingly occurs through the iterative co-construction of knowledge between human learners and AI systems. The proposed framework is organized around four principles: epistemic partnership, distributed agency, generative literacy, and adaptive metacognition. The framework offers a foundation for rethinking instructional design, learning, assessment, and expertise development in contexts where generative AI plays an integral role in cognition.

2606.12502 2026-06-12 physics.soc-ph cs.AI 交叉投稿

A Mathematical Theory of Value: a synthesis on goal-directed agency under resource constraints

价值的数学理论:资源约束下目标导向行为的综合

Cheng Qian

发表机构 * Cheng Qian(陈倩)

AI总结 本文提出价值是目标导向主体在资源约束下转化资源为目标进度的速率,通过尺度不变性公理导出对数度量,并推导出价值编码定理,实现价值与信息论的统一。

Comments Also available at https://doi.org/10.5281/zenodo.20487041 (v5)

详情
AI中文摘要

我们提出,价值——目标导向主体创造、毁灭和交换的量——是与信息同类的合法结构量。遵循香农的方法,我们做出一个无情的抽象:价值是主体将资源转化为目标进度的速率,相对于由其目标固定的参考系。尺度不变性公理强制采用对数度量 $V=\sum_i k_i \ln e_i$;通过Peters(2019)的遍历性论证,再投资资源的复利强制了相同的形式。这两条路径是亲缘关系而非独立;它们的一致性是一种一致性检查,而非过度确定。我们推导了价值的编码定理:$\Delta G \le I(X;Y)$,由贝叶斯比例分配实现;实现的价值分解为 $G=D(q\\|r)-D(q\\|p)$,将错位识别为可测量的浪费。对于群体,价值是参考系相关的,而价格是参考系无关的;共享资源并融合感知的舰队继承上限 $G_{\mathrm{fleet}} \le I(X;Y_{1:m}) \le H(X)$(一个推论;早期的求和形式声明是错误的,并在v5中修正)。动力学层产生了实然/应然不对称性,从该不对称性中,对齐作为控制稳定性条件出现,并具有闭式残差。我们在预注册的规模扩展中测试了单参考系定律于实时语言模型:感知互信息跟踪实际能力而非参数数量(在30个模型×领域点上合并的Spearman $\rho = 0.977$),样本外 $\Delta G$ 跟踪 $I(X;Y)$,过度自信是可测量的耗散;进一步的预注册测试显示,该桥在四种任务形状上形状不变($n=42$,斜率0.953)。这些机制没有一个是全新的——广义Kelly、Armstrong & Mindermann(2018)、经典控制;贡献在于它们的统一以及随之而来的治理映射(监督上的激励设计)。

英文摘要

We propose that value -- the quantity goal-directed agents create, destroy, and exchange -- is a lawful structural quantity in the same category as information. Following Shannon's method, we make one ruthless abstraction: value is the rate at which an agent converts a resource into goal-progress, relative to a frame fixed by its goal. A scale-invariance axiom forces a logarithmic measure, $V=\sum_i k_i \ln e_i$; compounding of a reinvested resource forces the same form via the ergodicity argument of Peters (2019). The two routes are kin rather than independent; their agreement is a consistency check, not an over-determination. We derive a coding theorem of value: $ΔG \le I(X;Y)$, achieved by Bayes-proportional allocation; realized value decomposes as $G=D(q\|r)-D(q\|p)$, identifying misalignment with measurable waste. For populations, value is frame-relative while price is frame-independent; a fleet that pools its resource and fuses its perception inherits the ceiling $G_{\mathrm{fleet}} \le I(X;Y_{1:m}) \le H(X)$ (a corollary; an earlier sum-form claim was wrong and is corrected in v5). A dynamical layer yields an is/ought asymmetry from which alignment emerges as a control-stability condition with a closed-form residual. We test the single-frame laws on live language models in a pre-registered scale-up: perception mutual information tracks realized capability rather than parameter count (Spearman $ρ= 0.977$ pooled over 30 model$\times$domain points), out-of-sample $ΔG$ tracks $I(X;Y)$, and over-confidence is measurable dissipation; a further pre-registered test shows the bridge is shape-invariant across four task shapes ($n=42$, slope 0.953). None of the mechanisms is individually new -- generalized Kelly, Armstrong & Mindermann (2018), classical control; the contribution is their unification and the governance mapping (incentive design over oversight) that follows.

2606.12647 2026-06-12 cs.CC cs.AI cs.LG 交叉投稿

Token Complexity Theory for AI-Augmented Computing

AI增强计算的Token复杂度理论

Jie Wang

AI总结 提出Token复杂度作为AI增强计算中查询与响应成本的形式化度量,建立AI-Oracle图灵机框架,证明单调性、凸性、价格敏感性和任务排序的价格相对性等基本定理。

Comments 25 pages, 1 figure

详情
AI中文摘要

AI增强计算将自然语言查询、代码生成请求及其他开放式任务委托给一组AI模型,这些模型处理查询并生成响应。这一范式引入了一个经典时间或空间复杂度无法捕捉的资源维度:向该集群发送查询和接收响应的成本。我们引入Token复杂度,将其定义为在任务上达到指定输出质量水平所需的最小期望Token成本,并建立了一个根据概率性质强度对AI系统进行分类的体系。我们在AI-Oracle图灵机框架内发展Token复杂度,其中概率图灵机通过专用查询和响应磁带与随机Oracle交互。我们证明了基本定理,表明Token复杂度符合预期:单调性(更高质量需要更多Token)、凸性(质量改进逐渐变得更昂贵)、价格敏感性(小价格变化导致有界成本变化)以及任务排序的价格相对性(任务的Token复杂度排序可能根据查询与响应成本比率而反转)。我们证明了复杂度前沿(定义为Token、时间和空间中所有可行资源约束的集合)是非空的、向上封闭且凸的。

英文摘要

AI-augmented computing delegates natural language queries, code generation requests, and other open-ended tasks to a cluster of AI models that processes queries and generates responses. This paradigm introduces a resource dimension that neither classical time nor space complexity captures: the cost of sending queries to and receiving responses from such a cluster. We introduce token complexity, a formal resource measure defined as the minimum expected token cost to achieve a specified level of output quality on a task, and develop a taxonomy classifying AI systems by the strength of their probabilistic properties. We develop token complexity within the framework of AI-Oracle Turing machines, in which a probabilistic Turing machine interacts with a stochastic oracle via dedicated query and response tapes. We prove basic theorems establishing that token complexity behaves as expected: monotonicity (higher quality costs more tokens), convexity (quality improvements become progressively more expensive), price sensitivity (small price changes produce bounded cost changes), and price-relativity of task ordering (the token complexity ordering of tasks can reverse depending on the query-to-response cost ratio). We prove that the complexity frontier, defined as the set of all feasible resource bounds in tokens, time, and space, is non-empty, upward-closed, and convex.

2606.12805 2026-06-12 cs.HC cs.AI 交叉投稿

Exploring How Agent Voice Accents Shape Human-AI Collaboration in K-12 Group Learning

探索智能体口音如何影响K-12小组学习中的人机协作

Prerna Ravi, Carúmey Stevens, Ben Hurt, Brandon Hanks, Grace Lin, Emma Anderson

AI总结 研究通过33名教师的实验,发现GenAI语音智能体的不同口音(英式、印度式、非裔美式)影响其被感知为工具或同伴,进而影响信任、参与和依赖。

详情
AI中文摘要

协作被广泛认为是21世纪教育的基石,但教师在促进有效的同伴互动方面仍面临持续挑战。LLM对话式同伴智能体为调解面对面小组工作带来了新的可能性,引发了关于角色设计(尤其是语音特征)如何塑造学习者的感知、信任和互动动态的问题。虽然先前的研究已经考察了智能体口音在一对一环境中的影响,但关于这些影响如何在小组中表现尚知之甚少。我们进行了一项33名教师参与的组间混合方法研究,考察了具有不同口音(英式、印度式和非裔美式)的GenAI语音智能体如何影响协作和智能体感知。通过调查、小组互动分析和人工制品,我们发现口音塑造了参与者的心智模型以及智能体在小组互动中扮演的角色。英式口音智能体在很大程度上被视为工具,并以超然、基于实用性的方式参与,而印度式和非裔美式口音智能体则更容易被拟人化并作为同伴融入。这些角色期望影响了信任、参与和依赖随时间的变化。这项工作推进了关于GenAI的社会语言学设计特征如何塑造CSCL中小组动态的理解,对设计具有文化包容性的AI学习伙伴具有启示意义。

英文摘要

Collaboration is widely recognized as a cornerstone of 21st-century education, yet teachers still encounter persistent challenges in fostering productive peer interaction. LLM conversational peer agents introduce new possibilities for mediating in-person group work, raising questions about how persona design, particularly their voice characteristics, shapes learners' perceptions, trust, and interactional dynamics. While prior work has examined agent accent effects in one-to-one settings, little is known about how these effects manifest in groups. We conducted a between-subjects mixed-methods study with 33 teachers examining how a GenAI voice agent with different accents (British, Indian, and African American) influenced collaboration and agent perception. Across surveys, group interaction analyses, and artifacts, we find that accent shaped participants' mental models and the roles the agent assumed in group interaction. The British-accented agent was largely treated as a tool and engaged in detached, utility-based ways, whereas Indian- and African American-accented agents were more readily anthropomorphized and integrated as peers. These role expectations influenced trust, engagement, and reliance over time. This work advances understanding of how GenAI's sociolinguistic design features shape group dynamics in CSCL, with implications for designing culturally inclusive AI partners in group learning.

2606.13179 2026-06-12 cs.ET cs.AI cs.AR cs.NE 交叉投稿

Modern analog computing for solving differential and matrix equations

现代模拟计算用于求解微分方程和矩阵方程

Zhong Sun, Piergiulio Mannocci, Manuel Le Gallo, Abu Sebastian

发表机构 * Institute for Artificial Intelligence, School of Integrated Circuits, Peking University, Beijing Advanced Innovation Center for Integrated Circuits(人工智能研究院,集成电路学院,北京大学,北京集成电路先进创新中心) Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano(电子、信息与生物工程系,米兰理工大学) IBM Research Europe(IBM欧洲研究院)

AI总结 本文综述现代模拟计算在求解微分方程和矩阵方程中的核心原语、硬件实现及最新进展,强调电阻式存储器阵列的优势,并讨论精度、可扩展性及与内存计算的关系。

详情
AI中文摘要

近年来,受人工智能和科学计算等数据密集型应用的计算需求驱动,模拟计算重新获得关注。鉴于计算任务的多样性以及模拟CMOS电路和电阻式存储器技术的最新进展,我们将这一不断发展的领域称为现代模拟计算。在此背景下,我们识别出三个核心计算原语:求解微分方程、求解矩阵方程以及执行矩阵-向量乘法,并探讨它们之间的联系。我们还研究了这些模拟计算算子的各种硬件实现,包括基于分立元件、集成电路和电阻式存储器设备的实现。其中,电阻式存储器阵列因其实现效率而显得尤为有前景。本文随后调查了利用现代模拟计算(使用先进的模拟CMOS电路和电阻式存储器阵列)求解微分方程和矩阵方程的最新进展。最后,我们讨论了这些电路的应用、精度和可扩展性问题及其潜在解决方案、与内存计算的关系,以及模拟计算的独特计算复杂性。本文提供了关于模拟计算的统一视角,强调了其优势、当前发展和挑战,并将其定位为下一代计算前沿的关键推动者。

英文摘要

In recent years, driven by the computational demands of data-intensive applications such as artificial intelligence and scientific computing, analog computing has gained renewed interest. Given the diversity of computational tasks and recent advancements in analog CMOS circuits and resistive memory technologies, we refer to the evolving landscape as modern analog computing. In this context, we identify three core computational primitives: solving differential equations, solving matrix equations, and performing matrix-vector multiplications, and we explore the connections among them. We also examine various hardware implementations of these analog computing operators, including those built with discrete components, integrated circuits, and resistive memory devices. Among these, resistive memory arrays emerge as particularly promising due to their implementation efficiency. The paper then surveys recent progress in leveraging modern analog computing to solve differential and matrix equations using both advanced analog CMOS circuits and resistive memory arrays. Finally, we discuss the applications of these circuits, the precision and scalability issues and their potential solutions, the relationship with in-memory computing, and the unique computational complexity of analog computing. This paper provides a unified perspective on analog computing, highlighting its strengths, current developments, and challenges, and positioning it as a pivotal enabler of next-generation computational frontiers.

2606.13629 2026-06-12 stat.ME cs.AI cs.LG stat.ML 交叉投稿

Valid Inference with Synthetic Data via Task Exchangeability

通过任务可交换性实现基于合成数据的有效推断

Lezhi Tan, Tijana Zrnic

AI总结 提出任务可交换性条件,确保在科学研究中使用合成数据进行统计推断的有效性,并给出在民意调查和AI评估中的应用。

详情
AI中文摘要

越来越多的工作主张在科学研究中使用合成数据。例如,社会科学家主张在试点研究中使用LLM生成的“硅样本”;AI评估越来越依赖“LLM作为裁判”的输出;蛋白质组学研究通过生成合成蛋白质结构的生成模型加速。这些发展引发了一个有趣的可能性:合成数据可以帮助研究人员提出更多问题、进行更多研究并加速发现。但它们也引发了一个根本性的担忧:合成数据可能有偏、有噪声且设定错误。在这项工作中,我们提出了在科学研究中使用合成数据的统计原则,并具有可证明的有效性保证。关键见解是一个我们称为任务可交换性的新技术条件。非正式地说,这是一个要求,即研究人员可以识别出有真实数据可用的历史任务,使得他们当前感兴趣的任务与历史任务在适当的数学意义上可交换。我们开发了在任务可交换性下进行有效推断的方法,以及即使在可交换性之外也能提供保证的扩展。我们通过硅样本的民意调查和自动评分器的AI评估来展示该框架。

英文摘要

There is a proliferation of work arguing for the use of synthetic data in scientific research. For example, social scientists are arguing for the use of LLM-generated "silicon samples" in pilot studies; AI evaluations increasingly rely on "LLM-as-a-judge" outputs; and proteomics research is accelerated by generative models that produce synthetic protein structures. These developments raise an intriguing possibility: synthetic data may help researchers ask more questions, run more studies, and accelerate discovery. But they also raise a fundamental concern: synthetic data can be biased, noisy, and misspecified. In this work, we propose statistical principles for using synthetic data in scientific research with provable validity guarantees. The key insight is a new technical condition that we call task exchangeability. Informally, this is a requirement that the researcher can identify historical tasks, for which real data is available, such that their current task of interest is exchangeable with the historical tasks in an appropriate mathematical sense. We develop methods for valid inference under task exchangeability, together with extensions that provide guarantees even beyond exchangeability. We demonstrate the framework on public opinion surveys with silicon samples and AI evaluation with autoraters.

2606.00807 2026-06-12 cs.AI cs.HC 版本更新

Interaction-Centered Intelligence: Toward an Interaction-Based Theory of Human-AI Co-Creation

以交互为中心的智能:将交互作为共创AI和人机系统中的主要分析单元

Nicholas Davis

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Co-Creative AI Consulting(协同人工智能咨询)

AI总结 本文提出以交互作为主要分析单元,通过分布式认知、具身认知等理论,论证智能涌现于交互动态而非孤立计算,并引入交互中心智能框架。

详情
AI中文摘要

传统人工智能很大程度上将智能概念化为发生在有界代理内的孤立计算。在经典AI、机器学习以及许多生成系统中,主要的分析单元仍然是单个模型或自主系统,通过输出、基准、预测准确性或优化性能进行评估。尽管这些方法取得了重大进展,但它们往往低估了交互在智能、创造力、意义和适应性行为涌现中的作用。本文提出将交互作为共创AI和更广泛的以交互为中心的智能的主要分析单元。借鉴分布式认知、具身认知、生成、参与式意义建构、人机交互和计算创造力,本文追溯了向越来越关系性智能观的历史进程。基于先前在创造性意义建构、量化共创以及诸如绘图学徒和AI绘图伙伴等共创系统上的工作,本文论证了智能通过代理、环境和社会技术系统之间不断演化的交互动态涌现,而非仅仅通过内部计算。本文引入了以交互为中心的智能作为理解人机共创、协作涌现、适应性参与和交互动态的框架。该框架不通过生成的输出单独评估智能,而是强调随时间展开的交互轨迹、协调模式、参与性参与、适应性调节和交互漂移。讨论了可解释的共创AI、混合智能、生成AI和未来人机系统的启示。

英文摘要

Traditional artificial intelligence has largely conceptualized intelligence as isolated computation occurring within bounded agents. Across classical AI, machine learning, and many generative systems, the dominant unit of analysis remains the individual model or autonomous system evaluated through outputs, benchmarks, prediction accuracy, or optimization performance. While these approaches have produced major advances, they often under-theorize the role of interaction in the emergence of intelligence, creativity, meaning, and adaptive behavior. This paper proposes interaction as the primary unit of analysis for co-creative AI and interaction-centered intelligence more broadly. Drawing from distributed cognition, embodied cognition, enaction, participatory sense-making, human-computer interaction, and computational creativity, the paper traces a historical progression toward increasingly relational accounts of intelligence. Building upon prior work in Creative Sense-Making, quantified co-creation, and co-creative systems such as the Drawing Apprentice and AI Drawing Partner, it argues that intelligence emerges through evolving interaction dynamics among agents, environments, and socio-technical systems rather than solely through internal computation. The paper introduces Interaction-Centered Intelligence as a framework for understanding human-AI co-creation, collaborative emergence, adaptive participation, and interactional dynamics. Rather than evaluating intelligence solely through generated outputs, the framework emphasizes interaction trajectories, coordination patterns, participatory engagement, adaptive regulation, and interactional drift unfolding through time. Implications for explainable co-creative AI, hybrid intelligence, enactive AI, and future human-AI systems are discussed.

2508.04427 2026-06-12 cs.LG cs.AI 版本更新

Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models

解码多模态迷宫:多模态注意力模型中可解释性采纳的系统综述

Md Raisul Kibria, Sébastien Lafond, Janan Arslan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文系统综述了2020年至2024年初多模态模型可解释性研究,发现多数工作集中于视觉-语言和纯语言模型,注意力机制是主要解释方法,但评估缺乏系统性和鲁棒性,并提出了改进建议。

详情
AI中文摘要

近年来,多模态学习取得了显著进展,特别是随着注意力模型的整合,在各种任务中带来了显著的性能提升。与此同时,对可解释人工智能(XAI)的需求推动了越来越多的研究,旨在解释这些模型的复杂决策过程。本系统文献综述分析了2020年1月至2024年初期间发表的、关注多模态模型可解释性的研究。在XAI更广泛目标的框架内,我们从多个维度审视文献,包括模型架构、涉及模态、解释算法和评估方法。我们的分析显示,大多数研究集中在视觉-语言和纯语言模型上,注意力机制是最常用的解释方法。然而,这些方法往往无法捕捉模态间交互的全谱系,这一问题因领域间的架构异质性而进一步加剧。重要的是,我们发现多模态环境中XAI的评估方法大多是非系统性的,缺乏一致性、鲁棒性,并且未考虑模态特定的认知和上下文因素。为解决这些不足,我们不仅综合了所调查研究的发现,还纳入了补充分析,整合了推动多模态可解释性的近期和新兴进展。基于这些见解,我们提出了一套全面的建议,旨在促进多模态XAI研究中严谨、透明和标准化的评估与报告实践。我们的目标是支持未来构建更可解释、可问责和负责任的多模态AI系统,并以可解释性为核心。

英文摘要

Multimodal learning has witnessed remarkable advancements in recent years, particularly with the integration of attention-based models, leading to significant performance gains across a variety of tasks. Parallel to this progress, the demand for explainable artificial intelligence (XAI) has spurred a growing body of research aimed at interpreting the complex decision-making processes of these models. This systematic literature review analyzes research published between January 2020 and early 2024 that focuses on the explainability of multimodal models. Framed within the broader goals of XAI, we examine the literature across multiple dimensions, including model architecture, modalities involved, explanation algorithms and evaluation methodologies. Our analysis reveals that most studies are concentrated on vision-language and language-only models, with attention-based techniques being the most commonly employed for explanation. However, these methods often fall short in capturing the full spectrum of interactions between modalities, a challenge further compounded by the architectural heterogeneity across domains. Importantly, we find that evaluation methods for XAI in multimodal settings are largely non-systematic, lacking consistency, robustness, and consideration for modality-specific cognitive and contextual factors. To address these gaps, we not only synthesize findings from the surveyed works but also incorporate a complementary analysis that integrates recent and emerging advances driving multimodal explainability. Based on these insights, we provide a comprehensive set of recommendations aimed at promoting rigorous, transparent, and standardized evaluation and reporting practices in multimodal XAI research. Our goal is to support future research in more interpretable, accountable, and responsible multimodal AI systems, with explainability at their core.

2605.29151 2026-06-12 math.AG cs.AI cs.NE 版本更新

Real-rootedness of the Poincaré polynomials of $\overline{\mathcal M}_{0,n}$: an AI-assisted proof

Poincaré多项式的实根性:一个AI辅助的证明

Gergely Bérczi, Young-Hoon Kiem

AI总结 通过引入双变量变形揭示隐藏的交错结构,证明了稳定有理曲线模空间Poincaré多项式的实根性,并进一步推广到Fulton-MacPherson空间。

Comments 16 pages

详情
AI中文摘要

我们证明了Deligne-Mumford模空间$\overline{\mathcal M}_{0,n}$(稳定$n$点有理曲线)的Poincaré多项式\[ P_n(t)=\sum_{i=0}^{n-3} \dim H^{2i}(\overline{\mathcal M}_{0,n};\mathbb{Q})t^i \]的实根性,证实了Aluffi-Chen-Marcolli的猜想。证明从Keel-Manin-Getzler递推开始,但其主要新思想是Poincaré多项式的双变量变形$F_m(y,t)$。这种变形揭示了单变量递推中不可见的隐藏交错结构。对于固定的$t<0$,$F_m$在$y$方向上的零点集由区间$0<y<1-t$上的Sturm-Rolle论证控制。原始多项式在切片$y=1$上恢复,移动根通过该切片的有序交叉同时给出了实根性和严格交错。因此,$\overline{\mathcal M}_{0,n}$的Betti数构成一个超对数凹序列。 我们进一步证明了Fulton-MacPherson空间$\mathbb{P}^1[n]$(复射影线退化中$n$个有序点)的Poincaré多项式的实根性和超对数凹性。 $\overline{\mathcal M}_{0,n}$的证明是通过与Co-Mathematician(Google DeepMind开发的智能体前沿模型系统)的迭代AI辅助工作流程获得的。人类的角色是提出问题、评估连续尝试、请求修复漏洞、将逐步发展的论证与文献进行比较,并组装最终可人工验证的证明。我们额外的人类贡献是观察到类似的残差变形策略适用于Fulton-MacPherson空间$\mathbb P^1[n]$,从而得到相应的实根性定理。

英文摘要

We prove real-rootedness for the Poincaré polynomial \[ P_n(t)=\sum_{i=0}^{n-3} \dim H^{2i}(\overline{\mathcal M}_{0,n};\mathbb{Q})t^i \] of the Deligne--Mumford moduli space $\overline{\mathcal M}_{0,n}$ of stable $n$-pointed rational curves, proving a conjecture of Aluffi--Chen--Marcolli. The proof starts from the Keel--Manin--Getzler recurrence, but its main new idea is a bivariate deformation $F_m(y,t)$ of the Poincaré polynomial. This deformation reveals a hidden interlacing structure not visible in the one-variable recurrence. For fixed $t<0$, the zero set of $F_m$ in the $y$-direction is controlled by a Sturm--Rolle argument on the interval $0<y<1-t$. The original polynomial is recovered on the slice $y=1$, and the ordered crossings of the moving roots through this slice give both real-rootedness and strict interlacing. Consequently, the Betti numbers of $\overline{\mathcal M}_{0,n}$ form an ultra-log-concave sequence. We further prove real-rootedness and ultra-log-concavity for the Poincaré polynomial of the Fulton--MacPherson space $\mathbb{P}^1[n]$ of $n$ ordered points in degenerations of the complex projective line. The proof for $\overline{\mathcal M}_{0,n}$ was obtained through an iterative AI-assisted workflow with Co-Mathematician, an agentic frontier-model system developed by Google DeepMind. Our role was to formulate the problem, evaluate the proposed proof attempts, identify gaps and request corrections, compare the developing argument with the literature, and refine the presentation of the final proof. Our additional human contribution was to observe that a similar residual deformation strategy applies to the Fulton--MacPherson spaces $\mathbb P^1[n]$, yielding the corresponding real-rootedness theorem.

2605.31514 2026-06-12 cs.CL cs.AI cs.CY 版本更新

If LLMs Have Human-Like Attributes, Then So Does Age of Empires II

如果LLM具有类人属性,那么《帝国时代II》也具有

Adrian de Wynter

AI总结 通过训练简单神经网络于《帝国时代II》,论证LLM的拟人属性在经验上非唯一,提出应假设LLM非独特性而非拟人属性来设计实验。

Comments Fixed corollary 1, added stat sig

详情
AI中文摘要

关于大型语言模型(LLM)和基于LLM的智能体工作流已有大量研究。然而,该领域的许多工作声称、赋予或假设它们具有普遍化的拟人属性(例如道德或对自然语言的理解)。我们的目标不是支持或反对这些属性的存在,而是指出这些结论可能不正确。为此,我们在电子游戏《帝国时代II》上构建并训练了一个简单的神经网络,并注意到任何处于足够强大基底(如乐高或大波士顿地区)中的实体也可能呈现此类属性。因此,LLM声称的拟人属性在经验上非唯一:尽管某些属性(例如对提示的响应)可能保持不变,但其他属性(如对其感知行为的解释)可能随基底改变。因此,任何基于经验的讨论都需要明确的测量标准;否则解释就留给了表征。然后我们表明,假设这些属性在系统中存在或不存在,独立于基底并以普遍化方式,会导致循环或无信息的结论,无论实验者对该主题的观点如何。最后,我们提出一个“零”假设,即假设LLM非独特性而非拟人属性来设置实验,并给出示例。我们还讨论了对我们工作的潜在反对意见,简要调查了该领域,并证明了《帝国时代II》是功能完备和图灵完备的。

英文摘要

Much research has been carried out on large language models (LLMs) and LLM-powered agentic workflows. However, many works within the field state emergence of, ascribe to, or assume, generalised anthropomorphic attributes to them (e.g., morality or understanding of natural language). Our goal is not to argue in favour or against the existence of these attributes, but to point out that these conclusions could be incorrect. For this we build and train a simple neural network on the videogame Age of Empires II, and note that any entity in a sufficiently-powerful substrate, such as LEGO or the Greater Boston Area, could also present such attributes. Hence, the purported anthropomorphic attributes of LLMs are empirically non-unique: although some properties (e.g., responses to prompts) could remain invariant, others, such as the interpretation of their perceived behaviour, might change with the substrate. Thus, any empirically-grounded discussion on these attributes requires explicit measurement criteria; otherwise the interpretation is left to the representation. We then show that assuming that these attributes exist or not in a system, independent of the substrate and in a generalised way, leads to either circular or uninformative conclusions. This is regardless of the experimenter's viewpoint on the subject, or whether the outcome shows existence or non-existence. Finally we propose a 'null' assumption, where one assumes LLM non-uniqueness instead of assuming anthropomorphic attributes to set up an experiment, along with examples of it. We also discuss potential objections to our work, briefly survey the field, and prove that Age of Empires II is functionally- and Turing-complete.

2605.01727 2026-06-12 cs.AI cs.CY 版本更新

Are LLMs More Skeptical of Entertainment News?

LLM是否对娱乐新闻更持怀疑态度?

Huiqian Lai

AI总结 研究零样本LLM在新闻可信度评估中是否对娱乐新闻有更高的误判率,发现模型间存在差异,并通过风格交换和提示缓解实验探讨原因。

Comments Accepted at the 2nd Workshop on Misinformation Detection in the Era of LLMs (MisD), co-located with ICWSM 2026, May 26, 2026, Los Angeles, CA, USA

详情
Journal ref
Proceedings of the ICWSM Workshops, MisD 2026: The 2nd Workshop on Misinformation Detection in the Era of LLMs, 2026
AI中文摘要

大型语言模型(LLMs)越来越多地被用于自动新闻可信度评估,但目前尚不清楚它们是否对不同新闻体裁采用统一标准。我们使用FakeNewsNet中的GossipCop数据集,通过数据集内设计,检验零样本LLM是否更倾向于将合法的娱乐新闻误分类为假新闻,而非合法的硬新闻。在四个前沿模型中,我们发现了清晰但模型特定的体裁不对称性:DeepSeek-V3.2和GPT-5.2的假阳性率差距分别为10.1和8.8个百分点(两者p < .001),而Claude Opus 4.6和Gemini 3 Flash则没有表现出显著差异。风格交换实验仅产生有限且不一致的变化,表明这种不对称性不能仅归结于风格语域。基于提示的缓解措施同样可能但并非通用:将模型设定为娱乐新闻事实核查员可使DeepSeek-V3.2的假阳性减少约50%,且未检测到召回率损失,但对GPT-5.2的改进甚微。探索性定性编码进一步揭示了采样假阳性中两种反复出现的错误模式:将私人生活主张视为本质上不可验证,以及将娱乐新闻视为认识论上较弱的体裁。综合来看,这些发现表明,总体性能指标可能掩盖合法新闻中的结构性假阳性。我们认为,基于LLM的可信度评估不仅可能评估真实性主张,还可能差异性地识别新闻体裁的合法性,因此评估应包含按体裁分层的假阳性分析以及总体准确率。

英文摘要

Large language models (LLMs) are increasingly used for automated news credibility assessment, yet it remains unclear whether they apply even-handed standards across journalistic genres. We examine whether zero-shot LLMs are more likely to misclassify legitimate entertainment news as fake than legitimate hard news, using a within-dataset design on GossipCop from FakeNewsNet. Across four frontier models, we find a clear but model-specific genre asymmetry: DeepSeek-V3.2 and GPT-5.2 show false-positive-rate gaps of 10.1 and 8.8 percentage points, respectively (both $p < .001$), whereas Claude Opus 4.6 and Gemini 3 Flash show no comparable difference. A style-swap experiment yields only limited and inconsistent changes, suggesting that the asymmetry is not reducible to stylistic register alone. Prompt-based mitigation is likewise possible but not generic: framing the model as an entertainment-news fact-checker reduces false positives for DeepSeek-V3.2 by about 50\% without detectable recall loss, but offers little improvement for GPT-5.2. Exploratory qualitative coding further suggests two recurring error patterns in sampled false positives: treating private-life claims as inherently unverifiable and discounting entertainment journalism as an epistemically weaker genre. Taken together, these findings show that aggregate performance metrics can obscure structured false positives within legitimate journalism. We argue that LLM-based credibility assessment may not only evaluate truth claims but also differentially recognize the legitimacy of journalistic genres, and that evaluation should therefore include genre-stratified false-positive analysis alongside overall accuracy.

2604.24449 2026-06-12 cs.RO cs.AI cs.LG 版本更新

SPLIT: Separating Physical-Contact via Latent Arithmetic in Image-Based Tactile Sensors

SPLIT:通过潜在算术分离物理接触以实现基于图像的触觉传感器

Wadhah Zai El Amri, Nicolás Navarro-Guerrero

发表机构 * Leibniz Universität Hannover, L3S Research Center(莱布尼茨汉诺威大学,L3S研究所)

AI总结 本文提出SPLIT方法,通过潜在空间算术分离接触几何与传感器光学特性,实现触觉传感器的高效模拟,支持多传感器迁移和双向模拟,提升机器人触觉感知研究效率。

Comments Accepted to Elsevier Robotics and Autonomous Systems Journal

详情
AI中文摘要

训练机器人触觉感知的机器学习模型需要大量数据,但获取真实交互数据因物理复杂性和变异性而具有挑战性。模拟触觉传感器是加速进展的关键步骤。本文提出了SPLIT,一种新的基于图像的触觉传感器模拟方法,重点在于DIGIT传感器。我们的方法核心是一种潜在空间算术策略,明确分离接触几何与传感器特定的光学属性。与需要重新校准的现有方法不同,这种分离使SPLIT能够适应多样化的DIGIT背景,甚至在不完全重训练的情况下将数据转移到不同的传感器如GelSight R1.5。此外,我们的方法在推理速度上优于现有替代方案。我们还提供了一种校准的有限元方法(FEM)软体网格模拟,具有可变分辨率,提供速度与保真度之间的可调权衡。此外,我们的算法支持双向模拟,允许从变形网格生成逼真图像以及从触觉图像重建网格。这种多功能性使SPLIT成为加速机器人触觉感知研究进展的重要工具。

英文摘要

Training machine learning models for robotic tactile sensing requires vast amounts of data, yet obtaining realistic interaction data remains a challenge due to physical complexity and variability. Simulating tactile sensors is thus a crucial step in accelerating progress. This paper presents SPLIT, a novel method for simulating image-based tactile sensors, with a primary focus on the DIGIT sensor. Central to our approach is a latent space arithmetic strategy that explicitly disentangles contact geometry from sensor-specific optical properties. Unlike methods that require recalibration for every new unit, this disentanglement allows SPLIT to adapt to diverse DIGIT backgrounds and even transfer data to distinct sensors like the GelSight R1.5 without full model retraining. Beyond this adaptability, our approach achieves faster inference speeds than existing alternatives. Furthermore, we provide a calibrated finite element method (FEM) soft-body mesh simulation with variable resolution, offering a tunable trade-off between speed and fidelity. Additionally, our algorithm supports bidirectional simulation, allowing for both the generation of realistic images from deformation meshes and the reconstruction of meshes from tactile images. This versatility makes SPLIT a valuable tool for accelerating progress in robotic tactile sensing research.

2604.15372 2026-06-12 cs.CR cs.AI cs.MM 版本更新

The Synthetic Media Shift: Tracking the Rise, Virality, and Detectability of AI-Generated Multimodal Misinformation

合成媒体的演变:跟踪AI生成多模态虚假信息的兴起、传播与可检测性

Zacharias Chrysidis, Stefanos-Iordanis Papadopoulos, Symeon Papadopoulos

发表机构 * Centre for Research and Technology Hellas(希腊研究中心)

AI总结 本文提出CONVEX数据集,研究多模态虚假信息的传播与共识动态,发现AI生成内容虽传播迅速但依赖被动互动,且检测性能随生成模型发展而下降。

详情
AI中文摘要

随着生成式AI的发展,真实与合成媒体的界限日益模糊,挑战在线信息的完整性。本文介绍了CONVEX,一个包含超过15万条多模态虚假信息的大型数据集,涵盖误标、编辑和AI生成的视觉内容,来自X的Community Notes。我们分析了多模态虚假信息在传播性、互动性和共识动态方面的演变,重点关注合成媒体。结果表明,尽管AI生成内容传播性 disproportionate,但其传播主要由被动互动驱动而非主动讨论。尽管初始报告较慢,AI生成内容一旦被标记,能更快达成社区共识。此外,我们评估了专门检测器和视觉-语言模型,发现其在区分合成与真实图像方面性能随生成模型发展而持续下降。这些发现突显了在快速演变的数字信息环境中持续监控和适应性策略的必要性。

英文摘要

As generative AI advances, the distinction between authentic and synthetic media is increasingly blurred, challenging the integrity of online information. In this study, we present CONVEX, a large-scale dataset of multimodal misinformation involving miscaptioned, edited, and AI-generated visual content, comprising over 150K multimodal posts with associated notes and engagement metrics from X's Community Notes. We analyze how multimodal misinformation evolves in terms of virality, engagement, and consensus dynamics, with a focus on synthetic media. Our results show that while AI-generated content achieves disproportionate virality, its spread is driven primarily by passive engagement rather than active discourse. Despite slower initial reporting, AI-generated content reaches community consensus more quickly once flagged. Moreover, our evaluation of specialized detectors and vision-language models reveals a consistent decline in performance over time in distinguishing synthetic from authentic images as generative models evolve. These findings highlight the need for continuous monitoring and adaptive strategies in the rapidly evolving digital information environment.

2601.02149 2026-06-12 cond-mat.mes-hall cond-mat.dis-nn cs.AI 版本更新

AI-enhanced tuning of quantum dot Hamiltonians toward Majorana modes

基于人工智能的量子点哈密顿量调优以实现马约拉纳模式

Mateusz Krawczyk, Jarosław Pawłowski

发表机构 * Institute of Theoretical Physics, Wrocław University of Science and Technology(理论物理研究所,沃林大学技术学院)

AI总结 本文提出基于神经网络的模型,通过学习量子点模拟器的工作区域,利用输运测量自动调优设备以获得马约拉纳模式。模型在无监督条件下训练于导电图合成数据,采用融合马约拉纳零模关键性质的物理引导损失函数。

Comments 12 pages, 8 figures, 2 tables

详情
Journal ref
Phys. Rev. Applied 25, 064032 (2026)
AI中文摘要

我们提出了一种基于神经网络的模型,能够学习量子点模拟器广泛的工作区域,并利用此知识通过输运测量自动调优这些设备,以在结构中获得马约拉纳模式。模型在无监督条件下训练于导电图合成数据,采用融合马约拉纳零模关键性质的物理引导损失函数。我们展示了通过适当训练,深度视觉变换器网络可以高效记忆哈密顿量参数与导电图之间的关系,并利用此提出量子点链参数更新,驱动系统进入拓扑相。从参数空间的广泛初始调谐范围开始,单步更新足以生成非平凡零模。此外,通过启用迭代调优过程——系统在每一步获得更新的导电图——我们证明该方法可以处理参数空间更大的区域。

英文摘要

We propose a neural network-based model capable of learning the broad landscape of working regimes in quantum dot simulators, and using this knowledge to autotune these devices - based on transport measurements - toward obtaining Majorana modes in the structure. The model is trained in an unsupervised manner on synthetic data in the form of conductance maps, using a physics-informed loss that incorporates key properties of Majorana zero modes. We show that, with appropriate training, a deep vision-transformer network can efficiently memorize relation between Hamiltonian parameters and structures on conductance maps and use it to propose parameters update for a quantum dot chain that drive the system toward topological phase. Starting from a broad range of initial detunings in parameter space, a single update step is sufficient to generate nontrivial zero modes. Moreover, by enabling an iterative tuning procedure - where the system acquires updated conductance maps at each step - we demonstrate that the method can address a much larger region of the parameter space.

2511.20162 2026-06-12 cs.CV cs.AI q-bio.NC 版本更新

Action Without Interaction: Probing the Physical Foundations of Video LMMs via Contact-Release Detection

无交互行动:通过接触-释放检测探测视频LMMs的物理基础

Daniel Harari, Michael Sidorov, Chen Shterental, Liel David, Abrham Kahsay Gebreselasie, Muhammad Haris Khan

发表机构 * Weizmann Institute of Science(魏茨曼科学研究所) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 研究探讨了视频LMMs在实际视觉输入中语义理解的深度,通过接触-释放检测发现模型在物理基础方面的不足。

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 workshop on Cognitive Foundations for Multimodal Models (CogVL)
AI中文摘要

大型多模态模型(LMMs)在现实视觉任务中表现出越来越强的性能,例如在视频中描述对象、周围环境和动态动作。本研究探讨了这些模型如何将语义理解与实际视觉输入联系起来。具体来说,给定手与物体互动的序列,我们询问模型何时以及在哪里开始或结束互动。为此,我们引入了一个前所未有的大规模数据集,包含来自Something-Something-V2数据集的视频中超过20,000个标注的互动。250名AMTurk人工标注者标记了核心互动事件,特别是物体和代理何时以及在哪里接触(接触)或分离(释放)。我们要求最先进的LMMs,包括GPT、Gemini和Qwen,在短视频中定位这些事件,每个视频只有一个事件。结果表明,尽管模型能够可靠地命名目标对象并识别动作,但它们表现出一种“捷径学习”现象,即语义成功掩盖了在物理基础方面的失败。具体来说,它们始终无法识别互动开始或结束的帧,并且在场景中对物理事件的定位较差。这种脱节表明,尽管LMMs在系统1直观模式识别(命名动作和对象)方面表现出色,但它们缺乏系统2认知基础,无法对如“接触”和“释放”这样的物理原始要素进行推理,因此无法真正将动态场景 grounded 在物理现实中。

英文摘要

Large multi-modal models (LMMs) show increasing performance in realistic visual tasks for images and, more recently, for videos. For example, given a video sequence, such models are able to describe in detail objects, the surroundings and dynamic actions. In this study, we explored the extent to which these models ground their semantic understanding in the actual visual input. Specifically, given sequences of hands interacting with objects, we asked models when and where the interaction begins or ends. For this purpose, we introduce a first of its kind, large-scale dataset with more than 20K annotated interactions on videos from the Something-Something-V2 dataset. 250 AMTurk human annotators labeled core interaction events, particularly when and where objects and agents become attached (`contact') or detached (`release'). We asked SoTA LMMs, including GPT, Gemini and Qwen to locate these events in short videos, each with a single event. The results show that while models reliably name target objects and identify actions, they exhibit a form of `shortcut learning' where semantic success masks a failure in physical grounding. Specifically, they consistently fail to identify the frame where the interaction begins or ends and poorly localize the physical event within the scene. This disconnect suggests that while LMMs excel at System 1 intuitive pattern recognition (naming the action and objects), they lack the System 2 cognitive foundations required to reason about physical primitives like `contact' and `release', hence truly ground dynamic scenes in physical reality.

2603.26705 2026-06-12 q-bio.BM cs.AI cs.LG 版本更新

PI-Mamba: Linear-Time Protein Backbone Generation via Spectrally Initialized Flow Matching

PI-Mamba:通过谱初始化流匹配实现线性时间的蛋白质主链生成

Tianyu Wu, Lin Zhu

发表机构 * Center for Biophysics and Quantitative Biology, University of Illinois Urbana-Champaign(生物物理与定量生物学中心,伊利诺伊大学厄巴纳-香槟分校) School of Information Science, University of Illinois Urbana-Champaign(信息科学学院,伊利诺伊大学厄巴纳-香槟分校)

AI总结 PI-Mamba通过谱初始化和流匹配框架,在保证局部共价几何精确性的同时实现线性时间推断,实现了主链生成的高效与高保真。

详情
Journal ref
Bioinformatics (2026)
AI中文摘要

动机:蛋白质主链设计的生成模型必须同时确保几何有效性、采样效率和长序列的可扩展性。然而,大多数现有方法依赖于迭代细化、二次注意力机制或事后几何修正,导致计算效率与结构保真度之间存在持续的权衡。结果:我们提出物理指导的Mamba(PI-Mamba),一种生成模型,通过构造确保精确的局部共价几何,同时实现线性时间推断。PI-Mamba将可微约束执行操作符整合到流匹配框架中,并与基于Mamba的状态空间架构耦合。为了提高优化稳定性和主链真实性,我们引入了源自Rouse聚合物模型的谱初始化和辅助的顺式脯氨酸意识头。在基准任务中,PI-Mamba实现了0.0%的局部几何违规率和高设计性(scTM = $0.91\pm 0.03$,n = 100),并且在单个A5000 GPU(24 GB)上可扩展到超过2,000个残基的蛋白质。

英文摘要

Motivation: Generative models for protein backbone design have to simultaneously ensure geometric validity, sampling efficiency, and scalability to long sequences. However, most existing approaches rely on iterative refinement, quadratic attention mechanisms, or post-hoc geometry correction, leading to a persistent trade-off between computational efficiency and structural fidelity. Results: We present Physics-Informed Mamba (PI-Mamba), a generative model that enforces exact local covalent geometry by construction while enabling linear-time inference. PI-Mamba integrates a differentiable constraint-enforcement operator into a flow-matching framework and couples it with a Mamba-based state-space architecture. To improve optimisation stability and backbone realism, we introduce a spectral initialization derived from the Rouse polymer model and an auxiliary cis-proline awareness head. Across benchmark tasks, PI-Mamba achieves 0.0\% local geometry violations and high designability (scTM = $0.91\pm 0.03$, n = 100), while scaling to proteins exceeding 2,000 residues on a single A5000 GPU (24 GB).

2602.18072 2026-06-12 cs.AR cs.AI 版本更新

HiAER-Spike Software-Hardware Reconfigurable Platform for Event-Driven Neuromorphic Computing at Scale

HiAER-Spike软件-硬件可重构平台:大规模事件驱动神经形态计算

Gwenevere Frank, Gopabandhu Hota, Keli Wang, Christopher Deng, Krish Arora, Diana Vins, Abhinav Uppal, Omowuyi Olajide, Kenneth Yoshimoto, Qingbo Wang, Mari Yamaoka, Johannes Leugering, Stephen Deiss, Leif Gibb, Gert Cauwenberghs

发表机构 * Institute for Neural Computation, UC San Diego(神经计算研究所,加州大学圣地亚哥分校) Fujitsu(富士通) Forschungszentrum Jülich(吕贝克研究中心) Qernel AI

AI总结 HiAER-Spike平台支持执行多达1.6亿神经元和400亿突触的大型脉冲神经网络,通过模块化可重构架构实现高效事件驱动计算,提供Python接口简化神经网络配置与执行。

Comments Leif Gibb, Gert Cauwenberghs are equal authors. arXiv admin note: substantial text overlap with arXiv:2504.03671

详情
Journal ref
npj Unconventional Computing (2026)
AI中文摘要

本文介绍了HiAER-Spike,一个模块化、可重构的事件驱动神经形态计算平台,可执行多达1.6亿神经元和400亿突触的大型脉冲神经网络,其架构优化了运行时大规模并行处理和分层地址事件路由(HiAER),支持稀疏连接和活动的高效处理,适用于边缘和云计算。该系统提供Python接口,屏蔽硬件细节,简化通用脉冲神经网络的配置与执行。平台通过网页门户向社区开放,展示了在CIFAR-10、DVS事件手势、MNIST和Pong任务上的事件驱动视觉能力。

英文摘要

In this work, we present HiAER-Spike, a modular, reconfigurable, event-driven neuromorphic computing platform designed to execute large spiking neural networks with up to 160 million neurons and 40 billion synapses - roughly twice the neurons of a mouse brain at faster than real time. This system, assembled at the UC San Diego Supercomputer Center, comprises a co-designed hard- and software stack that is optimized for run-time massively parallel processing and hierarchical address-event routing (HiAER) of spikes while promoting memory-efficient network storage and execution. The architecture efficiently handles both sparse connectivity and sparse activity for robust and low-latency event-driven inference for both edge and cloud computing. A Python programming interface to HiAER-Spike, agnostic to hardware-level detail, shields the user from complexity in the configuration and execution of general spiking neural networks with minimal constraints in topology. The system is made easily available over a web portal for use by the wider community. In the following, we provide an overview of the hard- and software stack, explain the underlying design principles, demonstrate some of the system's capabilities and solicit feedback from the broader neuromorphic community. Examples are shown demonstrating HiAER-Spike's capabilities for event-driven vision on benchmark CIFAR-10, DVS event-based gesture, MNIST, and Pong tasks.

2510.03699 2026-06-12 q-bio.NC cs.AI cs.LG cs.NE cs.SY eess.SY 版本更新

Dissecting Larval Zebrafish Hunting using Deep Reinforcement Learning Trained RNN Agents

解析斑马鱼幼体捕食行为的深度强化学习训练RNN代理

Raaghav Malik, Satpreet H. Singh, Sonja Johnson-Yu, Nathan Wu, Roy Harpaz, Florian Engert, Kanaka Rajan

发表机构 * California Institute of Technology(加州理工学院) Harvard University(哈佛大学)

AI总结 本文通过深度强化学习训练RNN代理,研究斑马鱼幼体捕食行为,揭示生态和能量约束如何影响适应性行为,发现简单模型能复现真实捕食行为,并通过虚拟实验验证约束和环境对捕食动态的影响。

详情
Journal ref
Proceedings of the 9th Conference on Cognitive Computational Neuroscience (2026)
AI中文摘要

斑马鱼幼体捕食行为为研究生态和能量约束如何塑造生物大脑和人工代理适应性行为提供了可操作的环境。本文开发了一个最小的基于代理的模型,通过深度强化学习在基于回合的斑马鱼模拟器中训练循环策略。尽管模型简单,它能复现标志性的捕食行为,包括眼位联合适追、速度调节和刻板接近轨迹,这些行为与真实幼体斑马鱼高度吻合。定量轨迹分析显示,追捕回合系统性地将猎物角度减少约一半后再捕食,与测量结果一致。虚拟实验和参数扫描变化生态和能量约束、回合运动学(耦合 vs. 未耦合转弯和前进运动)以及环境因素如食物密度、食物速度和融合限制。这些操作揭示了约束和环境如何塑造追捕动态、捕食成功率和中止率,为神经科学实验提供可验证的预测。这些扫描识别出一组紧凑的约束——双目感知、回合运动学中前进速度与转弯的耦合,以及适度的运动和融合的能量成本——这些约束足以使斑马鱼样式的捕食行为出现。惊人的是,这些行为在最小的代理中出现,而无需详细的生物力学、流体动力学、电路真实性和从真实斑马鱼数据中模仿学习。总体而言,这项工作为斑马鱼捕食行为提供了规范性的解释,即能量成本和感官收益之间的最佳平衡,突显了融合和轨迹动态的权衡。我们建立了一个虚拟实验室,缩小了实验搜索空间并生成了关于行为和神经编码的可验证预测。

英文摘要

Larval zebrafish hunting provides a tractable setting to study how ecological and energetic constraints shape adaptive behavior in both biological brains and artificial agents. Here we develop a minimal agent-based model, training recurrent policies with deep reinforcement learning in a bout-based zebrafish simulator. Despite its simplicity, the model reproduces hallmark hunting behaviors -- including eye vergence-linked pursuit, speed modulation, and stereotyped approach trajectories -- that closely match real larval zebrafish. Quantitative trajectory analyses show that pursuit bouts systematically reduce prey angle by roughly half before strike, consistent with measurements. Virtual experiments and parameter sweeps vary ecological and energetic constraints, bout kinematics (coupled vs. uncoupled turns and forward motion), and environmental factors such as food density, food speed, and vergence limits. These manipulations reveal how constraints and environments shape pursuit dynamics, strike success, and abort rates, yielding falsifiable predictions for neuroscience experiments. These sweeps identify a compact set of constraints -- binocular sensing, the coupling of forward speed and turning in bout kinematics, and modest energetic costs on locomotion and vergence -- that are sufficient for zebrafish-like hunting to emerge. Strikingly, these behaviors arise in minimal agents without detailed biomechanics, fluid dynamics, circuit realism, or imitation learning from real zebrafish data. Taken together, this work provides a normative account of zebrafish hunting as the optimal balance between energetic cost and sensory benefit, highlighting the trade-offs that structure vergence and trajectory dynamics. We establish a virtual lab that narrows the experimental search space and generates falsifiable predictions about behavior and neural coding.

2508.19273 2026-06-12 cs.CR cs.AI 版本更新

MixGAN: A Hybrid Semi-Supervised and Generative Approach for DDoS Detection in Cloud-Integrated IoT Networks

MixGAN:一种混合半监督和生成方法用于云集成物联网网络中的DDoS检测

Tongxi Wu, Chenwei Xu, Jin Yang

发表机构 * College of Cyber Science and Engineering, Sichuan University(四川大学网络空间安全学院) College of Information Science and Technology, Tibet University(西藏大学信息科学学院)

AI总结 本文提出MixGAN,结合条件生成、半监督学习和鲁棒特征提取,解决云集成物联网网络中DDoS检测的复杂交通动态、类别不平衡和数据稀缺问题,实验表明其在准确率、TPR和TNR上优于现有方法。

详情
Journal ref
ECAI 2025, 28th European Conference on Artificial Intelligence
AI中文摘要

本文提出MixGAN,一种结合条件生成、半监督学习和鲁棒特征提取的混合方法,用于云集成物联网网络中的DDoS检测。随着云集成物联网系统的普及,由于攻击面扩大、异构设备行为和边缘防护有限,DDoS攻击的威胁加剧。然而,在这种背景下,DDoS检测仍面临复杂交通动态、严重类别不平衡和数据稀缺的挑战。尽管近期方法已探索解决类别不平衡的解决方案,但许多方法仍难以在有限监督和动态交通条件下泛化。为克服这些挑战,我们提出MixGAN,一种混合检测方法,整合了条件生成、半监督学习和鲁棒特征提取。具体而言,为处理复杂的时序交通模式,我们设计了一个由时序卷积层组成的1-D WideResNet主干,包含残差连接,能够有效捕捉交通序列中的局部爆发模式。为缓解类别不平衡和标签稀缺问题,我们使用预训练的CTGAN生成合成少数类(DDoS攻击)样本,以补充未标记数据。此外,为减轻伪标签的噪声影响,我们引入了MixUp-Average-Sharpen(MAS)策略,通过在增强视图上平均预测并重新加权向高置信度类别,构造平滑和增强的目标。在NSL-KDD、BoT-IoT和CICIoT2023数据集上的实验表明,MixGAN在准确率、TPR和TNR上分别比现有方法高2.5%和4%,验证了其在大规模物联网-云环境中的鲁棒性。源代码可在https://github.com/0xCavaliers/MixGAN上公开获取。

英文摘要

The proliferation of cloud-integrated IoT systems has intensified exposure to Distributed Denial of Service (DDoS) attacks due to the expanded attack surface, heterogeneous device behaviors, and limited edge protection. However, DDoS detection in this context remains challenging because of complex traffic dynamics, severe class imbalance, and scarce labeled data. While recent methods have explored solutions to address class imbalance, many still struggle to generalize under limited supervision and dynamic traffic conditions. To overcome these challenges, we propose MixGAN, a hybrid detection method that integrates conditional generation, semi-supervised learning, and robust feature extraction. Specifically, to handle complex temporal traffic patterns, we design a 1-D WideResNet backbone composed of temporal convolutional layers with residual connections, which effectively capture local burst patterns in traffic sequences. To alleviate class imbalance and label scarcity, we use a pretrained CTGAN to generate synthetic minority-class (DDoS attack) samples that complement unlabeled data. Furthermore, to mitigate the effect of noisy pseudo-labels, we introduce a MixUp-Average-Sharpen (MAS) strategy that constructs smoothed and sharpened targets by averaging predictions over augmented views and reweighting them towards high-confidence classes. Experiments on NSL-KDD, BoT-IoT, and CICIoT2023 demonstrate that MixGAN achieves up to 2.5% higher accuracy and 4% improvement in both TPR and TNR compared to state-of-the-art methods, confirming its robustness in large-scale IoT-cloud environments. The source code is publicly available at https://github.com/0xCavaliers/MixGAN.

2507.11936 2026-06-12 cs.CL cs.AI cs.CV cs.LG 版本更新

A Survey of Deep Learning for Geometry Problem Solving

深度学习在几何问题求解中的应用综述

Jianzhe Ma, Wenxuan Wang, Qin Jin

发表机构 * Renmin University of China(中国人民大学)

AI总结 本文综述了深度学习在几何问题求解中的应用,涵盖相关任务、方法、评估指标及未来方向,旨在提供实践参考以推动该领域发展。

Comments ACL 2026 Main Conference

详情
AI中文摘要

几何问题求解作为数学推理的重要组成部分,在教育、评估AI数学能力及多模态能力评估中具有关键作用。近期深度学习技术,尤其是多模态大语言模型的出现,显著加速了该领域的研究。本文综述了深度学习在几何问题求解中的应用,包括(i)几何问题求解相关任务的全面总结;(ii)相关深度学习方法的深入回顾;(iii)评估指标和方法的详细分析;以及(iv)最先进性能、现有挑战和有前景的未来方向的批判性讨论。我们的目标是提供一个全面且实用的深度学习在几何问题求解中的参考,从而推动该领域进一步发展。我们维护了一个相关论文列表:https://github.com/majianz/dl4gps。

英文摘要

Geometry problem solving, a crucial aspect of mathematical reasoning, is vital across various domains, including education, the assessment of AI's mathematical abilities, and multimodal capability evaluation. The recent surge in deep learning technologies, particularly the emergence of multimodal large language models, has significantly accelerated research in this area. This paper presents a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of state-of-the-art performance, existing challenges, and promising future directions. Our objective is to offer a comprehensive and practical reference of deep learning for geometry problem solving, thereby fostering further advancements in this field. We maintain a list of relevant papers: https://github.com/majianz/dl4gps.

2306.01690 2026-06-12 cs.LG cs.AI 版本更新

Context selectivity with dynamic availability enables lifelong continual learning

基于动态可用性的上下文选择性促进终身持续学习

Martin Barry, Wulfram Gerstner, Guillaume Bellec

发表机构 * Department of Life Sciences, Department of Computer Sciences(生命科学系、计算机科学系)

AI总结 本文提出基于上下文选择性和动态可用性的元可塑性规则,通过模拟验证该模型在图像识别和自然语言处理任务中优于现有持续学习算法。

详情
AI中文摘要

"你永远忘不了如何骑自行车"——但这是如何可能的?大脑能够学习复杂技能,停顿多年不练习,中间学习其他技能,仍能随时召回原始知识。这种能力的机制,称为终身学习(或持续学习,CL),尚不清楚。我们建议一种生物合理的元可塑性规则,基于经典持续学习工作,总结为两个原则:(i) 神经元具有上下文选择性,(ii) 一个局部可用性变量在神经元先前任务相关时部分冻结可塑性。在新的神经中心形式化中,我们建议神经元选择性和神经元级巩固是简单且可行的元可塑性假设,以在大脑中实现CL。在模拟中,该简单模型平衡了遗忘和巩固,导致在图像识别和自然语言处理CL基准上优于当前CL算法。

英文摘要

"You never forget how to ride a bike", -- but how is that possible? The brain is able to learn complex skills, stop the practice for years, learn other skills in between, and still retrieve the original knowledge when necessary. The mechanisms of this capability, referred to as lifelong learning (or continual learning, CL), are unknown. We suggest a bio-plausible meta-plasticity rule building on classical work in CL which we summarize in two principles: (i) neurons are context selective, and (ii) a local availability variable partially freezes the plasticity if the neuron was relevant for previous tasks. In a new neuro-centric formalization of these principles, we suggest that neuron selectivity and neuron-wide consolidation is a simple and viable meta-plasticity hypothesis to enable CL in the brain. In simulation, this simple model balances forgetting and consolidation leading to better transfer learning than contemporary CL algorithms on image recognition and natural language processing CL benchmarks.