arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1727
2606.06627 2026-06-08 cs.RO cs.AI cs.CV cs.LG 新提交

What Matters When Cotraining Robot Manipulation Policies on Everyday Human Videos?

在日常生活人类视频上协同训练机器人操作策略时什么因素重要?

Richard Li, Aditya Prakash, Andrew Wen, Saurabh Gupta, Yilun Du, Pulkit Agrawal

发表机构 * Massachusetts Institute of Technology(麻省理工学院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Harvard University(哈佛大学)

AI总结 研究利用日常互联网视频协同训练机器人操作策略时,手部姿态质量和运动差距对迁移的影响,提出一种协同训练方法,在低机器人数据场景下六个操作任务中绝对成功率提升29.7%。

Comments The project website is here: https://richardrl.github.io/what-matters-cotraining-human-videos/index.html

详情
AI中文摘要

用于协同训练机器人操作策略的人类视频数据集主要由精心策划的演示组成,其中动作被编排成类似机器人行为,并且使用专用硬件捕获3D手部姿态。更丰富的数据源是日常互联网视频,但哪些因素能够实现从这些视频到机器人的迁移仍是一个开放问题。我们使用一个新的数据集(包含532个人类视频,共28小时的高质量三角测量手部标签和自然动作)对此进行研究。我们发现手部姿态质量影响迁移,但即使手部姿态准确,固有的运动差距也会阻碍迁移,除非视觉和策略网络针对每种具身形态进行专门化。我们的协同训练方法在低机器人数据场景下,在六个操作任务中绝对成功率提升29.7%,并带来一致的改进。

英文摘要

Human video datasets used for cotraining robot manipulation policies largely consist of curated demonstrations where motions are orchestrated to resemble robot behavior and 3D hand poses are captured with specialized hardware. A more plentiful source of data is everyday Internet video, but it is an open question what factors enable transfer from such videos to robots. We investigate this using a new dataset of 532 human videos with 28 hours of high-quality triangulated hand labels and natural motions. We find that hand pose quality affects transfer, but even with accurate hands, the inherent motion gap hinders transfer unless the vision and policy networks specialize to each embodiment. Our cotraining recipe yields consistent improvements, with an absolute success rate gain of $29.7\%$ in the low-robot-data regime across six manipulation tasks.

2606.06618 2026-06-08 cs.RO cs.AI cs.LG 新提交

ChronoForest: Closed-Loop Multi-Tree Diffusion Planning for Efficient Bridge Search and Route Composition

ChronoForest: 用于高效桥接搜索和路线组合的闭环多树扩散规划

Jungmin Seo, Jaesik Park

发表机构 * Seoul National University(首尔国立大学)

AI总结 针对仅依赖短程离线轨迹进行长程路线规划的问题,提出ChronoForest系统,通过锚链树扩散规划器和在线多树协调器实现局部桥接搜索与全局路线重解,在OGBench和哈密顿路线组合基准上显著提升成功率和效率。

Comments 40 pages, 4 figures, 7 tables, 3 algorithms

详情
AI中文摘要

当仅有短程离线轨迹可用时,我们如何规划到达指定目标、访问必经航点且保持路径短的长程路线?这一问题在离线导航中至关重要,因为收集足够丰富的长程数据十分困难,但真实智能体仍需以路线级效率(而非仅仅可行性)解决长程任务。难点有两方面:在微观层面,组合多个短程片段会在搜索代价和路径质量之间产生权衡;在宏观层面,航点排序需要比较起点、目标和航点锚点之间的成对旅行代价,而这些锚点在规划前未知,且仅通过长程时间距离估计时可靠性下降。本文提出ChronoForest,一种闭环规划系统,通过锚链树扩散规划器和在线多树协调器,将局部桥接搜索与在线路线重解耦合。ChronoForest利用时间距离进行短程引导和节点评估,同时利用搜索时的桥接证据验证长程锚点连通性,并反复重解路线。在OGBench AntMaze-Stitch上,ChronoForest在中等、大型和巨型分片上分别达到99.8%、99.3%和99.5%的成功率,并在巨型拼接任务上相比先前报道的扩散方法提升高达34.5个百分点。在哈密顿路线组合基准上,在线重解纠正了较差的时间排序,提升了路线质量,同时代价远低于穷举规划。

英文摘要

How can we plan long-horizon routes that reach designated goals, visit required waypoints, and remain short when only short-horizon offline trajectories are available? This problem matters in offline navigation because collecting sufficiently rich long-horizon data is difficult, yet real agents must still solve long-range tasks with route-level efficiency rather than mere feasibility. The difficulty is twofold: at the microscopic level, composing many short-horizon segments creates a trade-off between search cost and path quality, while at the macroscopic level, waypoint ordering requires comparing pairwise travel costs among start, goal, and waypoint anchors that are unknown before planning and increasingly unreliable when estimated only from long-range temporal distance. In this paper, we propose ChronoForest, a closed-loop planning system that couples local bridge search and online route re-solving through an anchor-chaining tree diffusion planner and an online multi-tree orchestrator. ChronoForest uses temporal distance for short-range guidance and node evaluation, while using search-time bridge evidence to validate long-range anchor connectivity and repeatedly re-solve the route. On OGBench AntMaze-Stitch, ChronoForest achieves 99.8%, 99.3%, and 99.5% success on the medium, large, and giant splits and improves giant-stitch success by up to 34.5 points over prior reported diffusion-based results. On Hamiltonian route-composition benchmarks, online re-solving corrects poor temporal orderings and improves route quality while remaining substantially cheaper than exhaustive planning.

2606.06615 2026-06-08 cs.SD cs.AI cs.LG eess.AS 新提交

FIGMA: Towards FIne-Grained Music retrievAl

FIGMA:迈向细粒度音乐检索

Nishit Anand, Ashish Seth, Sreyan Ghosh, Dinesh Manocha, Ramani Duraiswami

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 针对现有音乐检索模型无法处理细粒度属性查询的问题,提出多视角对比架构FIGMA,通过联合优化全局音频-文本对齐和帧级标记对齐,在统一表示空间中捕获高层语义和细粒度音乐属性,并在新构建的细粒度音乐描述数据集上取得显著提升。

Comments Accepted to ACL 2026. Project Website: https://nishitanand.github.io/figma-website/

详情
AI中文摘要

使用自然语言描述检索音乐已通过对比音频-文本模型(如CLAP)得到改进,但当前系统仍局限于粗粒度语义查询。当描述指定细粒度音乐属性(如速度、调性、和弦进行或节奏结构)时,现有模型通常无法检索到正确的音频。我们表明,这一限制源于对比学习目标本身:尽管在长描述上训练,基于CLAP的模型实际上仅利用前几个标记,丢弃了详细提示中编码的大量信息。然后,我们提出FIGMA(细粒度音乐检索),一种多视角对比架构,通过联合优化全局音频-文本对齐和帧级、标记级对齐来解决这一限制。该设计使FIGMA能够在统一表示空间中捕获高层语义上下文和细粒度音乐属性。此外,我们形式化了细粒度音乐检索任务,并构建了细粒度音乐描述数据集(FGMCaps),一个包含38万音乐-描述对的大规模训练数据集以及1万测试集,两者都标注了速度、调性、和弦进行、节拍数以及流派和情绪。大量实验表明,FIGMA在多个音乐检索基准(包括域外评估)上持续优于现有基于CLAP的音乐检索模型,相对改进高达73.3%。

英文摘要

Retrieving music using natural language descriptions has improved with contrastive audio-text models such as CLAP, but current systems remain limited to coarse semantic queries. When descriptions specify fine-grained musical attributes such as tempo, key, chord progression, or rhythmic structure, existing models often fail to retrieve the correct audio. We show that this limitation stems from the contrastive learning objective itself: despite being trained on long captions, CLAP-based models effectively utilize only the first few tokens, discarding much of the information encoded in detailed prompts. Then, we propose FIGMA (FIne-Grained Music RetrievAl), a multi-view contrastive architecture that addresses this limitation by jointly optimizing global audio-text alignment and frame-level, token-wise alignment. This design enables FIGMA to capture both high-level semantic context and fine-grained musical attributes within a unified representation space. Moreover, we formalize the task of Fine-Grained Music Retrieval and construct Fine-Grained Music Caption dataset (FGMCaps), a large-scale dataset of 380K music-caption pairs for training along with a 10K test set, both annotated with tempo, key, chord progression, beat count, as well as genre and mood. Extensive experiments demonstrate that FIGMA consistently outperforms existing CLAP-based music retrieval models across multiple music retrieval benchmarks, including out-of-domain evaluations, with relative improvements of up to 73.3%.

2606.06614 2026-06-08 cs.CL cs.AI cs.HC 新提交

Re-Centering Humans in LLM Personalization

重新将人类置于LLM个性化中心

Lechen Zhang, Jiarui Liu, Tal August

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Carnegie Mellon University(卡内基梅隆大学)

AI总结 研究LLM个性化在合成数据与人类数据上的性能差距,通过收集人类对话和判断揭示系统在属性提取、相关属性配对和个性化响应生成阶段的局限性,并引入轻量级训练干预以缩小差距。

详情
AI中文摘要

尽管兴趣日益增长,但大多数对大型语言模型(LLM)个性化能力的评估都依赖于合成数据。目前尚不清楚当前的个性化系统对真实用户的效果如何。在本文中,我们研究了LLM个性化在使用合成数据与人类数据时的性能差距。我们收集了人类对话(550个对话)和个性化三个阶段的判断:从对话中提取用户属性(5,949个判断),将相关属性与新提示配对(11,919个),以及将相关属性融入个性化响应(1,101个)。纳入人类数据揭示了每个阶段的系统局限性。模型难以从人类对话中提取属性,与人类在相关属性上的判断不一致,并且生成的个性化响应被人类评价为并不优于通用响应(尽管LLM广泛评价为更好)。我们在前两个阶段引入了两种轻量级基于训练的干预措施,使自动化个性化评估更接近人类数据。然而,在第三阶段,我们发现学习到的奖励模型与人类评分的相关性仅达到中等水平,这表明与人类一致的个性化质量判断难以直接建模。我们收集的数据为研究模型如何以人类认为有用的方式提取、选择和整合用户信息提供了基础。

英文摘要

Despite growing interest, most evaluations of large language models' (LLMs') personalization abilities have relied on synthetic data. It remains unclear how well current personalization systems work for real users. In this paper, we study the gap in LLM personalization performance when using synthetic versus human data. We collect human conversations (550 conversations) and judgments across three stages of personalization: extracting user attributes from conversations (5,949 judgments), pairing relevant attributes with new prompts (11,919), and incorporating relevant attributes into a personalized response (1,101). Incorporating human data reveals system limitations at each stage. Models struggle to extract attributes from human conversations, disagree with human judgments on relevant attributes, and generate personalized responses that humans judge no better than generic responses (though that LLM judges widely rate as better). We introduce two lightweight training-based interventions that shift automated personalization evaluation closer to human data in our first two stages. However, in our third stage we find that learned reward models achieve only modest correlation with human ratings, suggesting that human-aligned personalization quality judgments are difficult to model directly. Our collected data provides a foundation for studying how models should extract, select, and incorporate user information in ways that humans find useful.

2606.06601 2026-06-08 cs.CV cs.AI cs.LG 新提交

Direct 3D-Aware Object Insertion via Decomposed Visual Proxies

通过分解视觉代理实现直接3D感知物体插入

Jingbo Gong, Yikai Wang, Yushi Lan, Yuhao Wan, Ziheng Ouyang, Rui Zhao, Ming-Ming Cheng, Qibin Hou, Chen Change Loy

发表机构 * Google(谷歌) Black Forest Labs(黑森林实验室)

AI总结 提出DIRECT框架,通过分解外观、几何和上下文引导,实现可控制3D姿态的物体插入,在几何可控性和视觉质量上优于现有方法。

Comments ICML 2026; Project Page: https://gong1130.github.io/DIRECT/

详情
AI中文摘要

物体插入旨在将参考对象无缝合成到背景图像的指定区域。最近的基于扩散的方法实现了高视觉质量,但将插入视为简单的2D修复任务,无法显式控制对象的3D姿态,限制了其实用性。我们提出DIRECT(用于参考组合和目标集成的分解注入),一种新颖框架,将交互式姿态操作与高保真2D图像合成相结合,实现姿态可控的物体插入。我们的方法将插入条件分解为三个互补组件:从参考对象捕获视觉细节的外观引导、从用户调整的3D代理派生的几何引导以及来自目标背景的上下文引导。通过将它们注入到不同路径,DIRECT避免了特征纠缠,同时保留了参考外观、遵循用户指定的姿态并使对象适应目标场景。我们还引入了一个自动数据构建流程,以提高训练数据的多样性和质量。实验表明,DIRECT在几何可控性和视觉质量方面均优于先前方法。

英文摘要

Object insertion aims to seamlessly composite a reference object into a specified region of a background image. Recent diffusion-based methods achieve high visual quality but formulate insertion as a simple 2D inpainting task, providing no explicit control over the object's 3D pose and limiting their practical applicability. We propose DIRECT (Decomposed Injection for Reference Composition and Target-integration), a novel framework that integrates interactive pose manipulation with high-fidelity 2D image synthesis to enable pose-controllable object insertion. Our method decomposes the insertion conditions into three complementary components: appearance guidance capturing visual details from the reference object, geometry guidance derived from the user-adjusted 3D proxy, and context guidance from the target background. By injecting them through separate pathways, DIRECT avoids feature entanglement and simultaneously preserves reference appearance, follows the user-specified pose, and adapts the object to the target scene. We also introduce an automated data construction pipeline to improve the diversity and quality of training data. Experiments show that DIRECT outperforms previous methods in both geometric controllability and visual quality.

2606.06586 2026-06-08 cs.CL 新提交

Improving Cross-Lingual Factual Recall via Consistency-Driven Reinforcement Learning

通过一致性驱动的强化学习改进跨语言事实回忆

Jonathan von Rad, Louis Arts, George Burgess, Eleftheria Kolokytha, Harry O'Donnell, Ektor Oikonomidis Doumpas, Eduardo Sanchez, Yao Lu, Pontus Stenetorp

发表机构 * University College London(伦敦大学学院) Centre for Artificial Intelligence(人工智能中心)

AI总结 提出PolyFact数据集,利用GRPO强化学习方法提升大语言模型的跨语言事实回忆一致性,优于监督微调,并揭示其通过减少语言专用表示实现跨语言共享的机制。

Comments Under Review at EMNLP 2026

详情
AI中文摘要

主要用英语数据训练的大型语言模型(LLMs)编码了丰富的世界知识,但通常无法在其他语言中可靠地表达这些知识,这种现象称为跨语言事实不一致性。为了研究和解决这一问题,我们引入了PolyFact,一个大规模并行多语言事实问答数据集,包含12种类型多样的语言中的10万个基于Wikidata的事实。利用PolyFact,我们比较了轻量持续预训练(CPT)、监督微调(SFT)和通过组相对策略优化(GRPO)的强化学习在Qwen-2.5-7B和OLMo-2-1124-7B中改进跨语言事实回忆的效果。我们发现GRPO始终优于SFT,提高了跨语言一致性和对未见语言的泛化能力,而并行数据上的CPT带来的额外收益有限。机制分析进一步表明,GRPO通过减少MLP层和注意力头中的语言专门化来重组多语言路由,从而促进更共享的跨语言表示。我们发布了代码、模型和数据集。

英文摘要

Large language models (LLMs) trained predominantly on English data encode substantial world knowledge, yet often fail to express it reliably in other languages, a phenomenon known as cross-lingual factual inconsistency. To study and address this, we introduce PolyFact, a large-scale parallel multilingual factual QA dataset containing 100K Wikidata-grounded facts across 12 typologically diverse languages. Using PolyFact, we compare light continual pretraining (CPT), supervised fine-tuning (SFT), and reinforcement learning via Group Relative Policy Optimization (GRPO) for improving cross-lingual factual recall in Qwen-2.5-7B and OLMo-2-1124-7B. We find that GRPO consistently outperforms SFT, improving both cross-lingual consistency and generalization to unseen languages, while CPT on parallel data yields limited additional gains. Mechanistic analyses further show that GRPO reorganizes multilingual routing by reducing language specialization in MLP layers and attention heads, thereby promoting more shared cross-lingual representations. We release our code, models, and dataset.

2606.06576 2026-06-08 cs.LG astro-ph.EP astro-ph.IM stat.ML 新提交

Gaussian Process Latent Factor Regression for Low-Data, High-Dimensional Output Problems

高斯过程潜在因子回归用于低数据高维输出问题

Edward T. Stevenson, Eric T. Wolf, Mei Ting Mak, N. J. Mayne, Miles Cranmer

发表机构 * University of Cambridge(剑桥大学) University of Colorado Boulder(科罗拉多大学博尔德分校) University of Oxford(牛津大学) University of Exeter(埃克塞特大学)

AI总结 提出高斯过程潜在因子回归(GPLFR)模型,通过将输出表示为低维潜在状态的线性高斯解码,联合优化压缩与预测,解决低数据高维输出回归问题,并首次构建岩石系外行星全球气候模型的空间分辨仿真器。

Comments 9 pages content + 22 pages appendix/references. Supporting code at https://github.com/edstevenson/GPLFR

详情
AI中文摘要

在科学领域,回归任务通常需要从少量训练样本预测高维输出。多输出高斯过程在低数据场景中表现出色,但通常难以处理高维输出。PCA-GP(主成分分析加高斯过程回归)等压缩-预测流程处理了高维性,但依赖于为重构而非预测优化的基。为弥补这一差距,我们提出一个模型,将每个输出表示为从高斯过程先验中抽取的低维潜在状态的线性高斯解码。通过解析地边缘化解码器权重,我们将压缩和预测耦合在一个可扩展到高维输出的单一目标中。我们将此模型称为高斯过程潜在因子回归(GPLFR)。我们通过构建首个岩石系外行星全球气候模型的空间分辨仿真器来演示GPLFR。

英文摘要

In the sciences, regression tasks often require predicting high-dimensional outputs from few training examples. Multi-output Gaussian processes excel in low-data regimes but typically struggle with high-dimensional outputs. Compress-then-predict pipelines such as PCA-GP (principal component analysis plus Gaussian process regression) handle high dimensionality, but rely on bases optimized for reconstruction rather than prediction. To address this gap, we propose a model that represents each output as a linear-Gaussian decoding of a low-dimensional latent state drawn from a Gaussian process prior. By analytically marginalizing the decoder weights, we couple compression and prediction in a single objective that scales to high-dimensional outputs. We refer to this model as Gaussian process latent factor regression (GPLFR). We demonstrate GPLFR by building the first spatially resolved emulator of global climate models for rocky exoplanets.

2606.06574 2026-06-08 cs.LG 新提交

Skip a Layer or Loop It? Learning Program-of-Layers in LLMs

跳过一层还是循环它?学习LLM中的层程序

Ziyue Li, Yang Li, Tianyi Zhou

发表机构 * Ziyue Li, Yang Li, Tianyi Zhou(李子悦、李阳、周天毅)

AI总结 本文提出PoLar方法,通过轻量级预测网络学习为每个输入动态跳过或重复预训练层,在数学推理任务上以更少层数提升准确率。

Comments Accepted at ICML 2026. Substantially extends arXiv:2507.07996. Code: https://github.com/tianyi-lab/PoLar

详情
AI中文摘要

大型语言模型(LLM)通过遵循所有层的固定深度和顺序(非循环执行)进行推理。我们揭示了无训练、灵活、动态的层程序(PoLar)的广泛存在,其中预训练层可以作为模块打包,然后跳过或循环以形成针对每个输入的自定义程序。对于大多数输入,显著更短的程序执行可以达到相同或更好的准确性,而原始LLM的错误预测可以通过具有更少层的替代程序来纠正。这些观察表明,推理允许超出标准前向传递的多种有效潜在计算。为了在实践中高效实现PoLar,我们提出了一个轻量级的PoLar预测网络,该网络学习生成执行程序,为每个输入动态跳过或重复预训练层。在数学推理基准上的实验表明,PoLar在通常执行更少层的情况下,持续提高了标准推理和先前动态深度方法的准确性,并且这些增益在分布外评估中仍然存在。我们的结果表明,固定深度执行仅捕获了LLM潜在推理能力的一小部分。

英文摘要

Large language models (LLMs) perform inference by following a fixed depth and order, non-recurrent execution of all layers. We reveal the wide existence of training-free, flexible, dynamic program-of-layers (PoLar), where pretrained layers can be packed as modules and then skipped or looped to form a customized program for each input. For most inputs, substantially shorter program executions can achieve the same or better accuracy, while incorrect predictions of the original LLM can be corrected by alternative programs with fewer layers. These observations indicate that inference admits multiple valid latent computations beyond the standard forward pass. To efficiently achieve PoLar in practice, we propose a lightweight PoLar prediction network, which learns to generate execution programs that dynamically skip or repeat pretrained layers for each input. Experiments on mathematical reasoning benchmarks demonstrate that PoLar consistently improves accuracy over standard inference and prior dynamic-depth methods, often while executing fewer layers, and that these gains persist under out-of-distribution evaluation. Our results suggest that fixed-depth execution captures only a narrow subset of an LLM's latent reasoning capacity.

2606.06569 2026-06-08 cs.RO 新提交

PhyRoGen: Synthetic Generation of Physical Robot Manipulation Puzzles Using Procedural Content Generation

PhyRoGen:使用程序化内容生成物理机器人操作谜题的合成生成

Lennart Julian Droß, Andreas Orthey, Marc Toussaint

发表机构 * Technical University of Berlin(柏林技术大学) Robotics Institute Germany(德国机器人研究所)

AI总结 提出PhyRoGen框架,利用程序化内容生成自动创建机器人操作谜题的合成数据集,生成的24个谜题可在1-300秒内求解,并在物理仿真中验证可操作性。

Comments 8 pages, accepted at CASE 2026

详情
AI中文摘要

机器人操作物理谜题对于自动装配和拆卸任务很重要。然而,为了让机器人解决物理谜题,需要学习操作技能,这需要大量的训练数据集,而数据集的生成通常耗时且繁琐。为了解决这个问题,我们提出了物理机器人操作谜题生成框架(PhyRoGen),它利用程序化内容生成(PCG)来自动生成操作谜题的合成数据集。PhyRoGen是一个通用谜题生成器,可以生成具有互锁对象依赖关系的物理谜题,其中必须先操作一个关节对象,然后才能移动另一个对象。基于PhyRoGen,我们定义了六个具体的生成器,用于生成24个物理谜题。通过使用基准测试框架,我们能够使用基于采样的规划算法在1到300秒内解决所有谜题。最后,我们通过使用KUKA LBR iiwa机器人在物理仿真中演示了每个生成的谜题都是可操作的。这表明我们的框架能够程序化地生成独特的、可解的机器人操作谜题,这是对操作算法进行基准测试和开发稳健基础模型的关键要素。

英文摘要

Robot manipulation of physical puzzles is important for automatic assembly and disassembly tasks. However, to enable robots to solve physical puzzles, manipulation skills need to be learned, which requires large training datasets, the generation of which is often time consuming and tedious. To overcome this problem, we propose the Physical Robot Manipulation Puzzle Generation framework (PhyRoGen), which leverages procedural content generation (PCG) for automated generation of synthetic datasets of manipulation puzzles. PhyRoGen is a general-purpose puzzle generator, which can generate physical puzzles with interlocking object dependencies, where one articulated object must be manipulated before another can be moved. Based upon PhyRoGen, we define six concrete generators which we use to generate 24 physical puzzles. By using a benchmarking framework, we are able to solve all puzzles in 1 to 300 seconds using sampling-based planning algorithms. Finally, we demonstrate that every generated puzzle is manipulatable by using a KUKA LBR iiwa robot in a physical simulation. This shows that our framework is able to procedurally generate unique, solvable robot manipulation puzzles, which is a crucial ingredient to benchmark manipulation algorithms and to develop robust foundation models.

2606.06564 2026-06-08 cs.LG cs.AI 新提交

WAV: Multi-Resolution Block Residual Routing for Deep Decoder-Only Transformers

WAV:面向深度仅解码器Transformer的多分辨率块残差路由

Kehan Wang

发表机构 * Chongqing University(重庆大学)

AI总结 提出WAV v1方法,通过为每个块增加方向性细节基(相位基和分裂基)来增强残差路由,在深层Transformer中优于现有方法,48层时在TinyStories和Text8上取得更低验证损失。

Comments 6 pages, 4 figures, 3 tables

详情
AI中文摘要

残差连接对于训练深度Transformer至关重要,但标准的PreNorm残差流以固定的单位权重聚合子层更新。最近的注意力残差用内容相关的深度路由替代了这种固定累积,而块注意力残差通过对块级残差摘要进行路由使机制高效。然而,单个块摘要仅存储块内的低频总残差位移,丢弃了方向性结构,例如注意力与MLP的不平衡以及早期与晚期块的动态。我们提出WAV v1,一种用于仅解码器Transformer的轻量级多分辨率残差路由方法。WAV v1不是仅通过累积残差和来表示每个块,而是为每个块增加两个方向性细节基:一个对比注意力和MLP更新的相位基,以及一个对比早期和晚期子层更新的分裂基。这些基与标准块摘要一起通过相同的深度softmax混合器进行路由,而负细节源初始化和分离的RMS匹配稳定了训练。在字符级TinyStories和Text8语言建模中,WAV v1显示出明显的深度相关优势。尽管在12层时并非始终有益,但在24层时变得有竞争力,并在48层时优于所有基线。在48层时,WAV v1将TinyStories上的验证损失从0.4960降至0.4738,Text8上从0.9363降至0.9305,且额外参数可忽略。这些结果表明,方向性残差细节(而不仅仅是块级和)对于在更深Transformer中扩展残差路由很重要。

英文摘要

Residual connections are central to training deep Transformers, but standard PreNorm residual streams aggregate sublayer updates with fixed unit weights. Recent Attention Residuals replace this fixed accumulation with content-dependent depth-wise routing, and Block Attention Residuals make the mechanism efficient by routing over block-level residual summaries. However, a single block summary stores only the low-frequency total residual displacement inside a block, discarding directional structure such as attention-vs-MLP imbalance and early-vs-late block dynamics. We propose WAV v1, a lightweight multi-resolution residual routing method for decoder-only Transformers. Instead of representing each block only by its accumulated residual sum, WAV v1 augments every block with two directional detail bases: a phase basis that contrasts attention and MLP updates, and a split basis that contrasts early and late sublayer updates. These bases are routed together with standard block summaries through the same depth-wise softmax mixer, while negative detail-source initialization and detached RMS matching stabilize training. On character-level TinyStories and Text8 language modeling, WAV v1 shows a clear depth-dependent benefit. Although it is not consistently beneficial at 12 layers, it becomes competitive at 24 layers and outperforms all baselines at 48 layers. At 48 layers, WAV v1 reduces validation loss relative to Block AttnRes from 0.4960 to 0.4738 on TinyStories and from 0.9363 to 0.9305 on Text8, with negligible additional parameters. These results suggest that directional residual details, not only block-level sums, are important for scaling residual routing in deeper Transformers.

2606.06560 2026-06-08 cs.LG cs.AI cs.HC 新提交

MacArena: Benchmarking Computer Use Agents on an Online macOS Environment

MacArena: 在在线macOS环境中基准测试计算机使用代理

Victor Muryn, Maksym Shamrai, Sofiia Mazepa, Yehor Khodysko

发表机构 * MacPaw

AI总结 提出MacArena基准,包含421个任务和50个应用,在Apple Silicon上运行,揭示macOS对GUI代理的独特挑战,模型排名在移植任务和原生任务间反转。

Comments Accepted to the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026

详情
AI中文摘要

计算机使用代理(CUA)通过视觉和控制原语操作图形用户界面(GUI),其能力迅速提升,部分得益于标准化在线评估基准(如OSWorld),这些基准既作为评估工具,也作为强化学习的训练环境。然而,macOS在此领域中仍未被充分覆盖:现有唯一基准macOSWorld仅覆盖少量第一方应用且任务较简单,并在与Apple Silicon不兼容的x86虚拟机上运行。我们引入MacArena,一个包含50个应用中421个手动验证任务的基准,结合了OSWorld任务的精选移植、来自macOSWorld的内容以及49个新的macOS原生任务,全部在Apple Silicon上的Apple原生虚拟化框架上运行。我们认为macOS呈现了Linux基准无法捕捉的独特GUI挑战,我们的评估支持这一观点:现有基准上的强模型性能可能反映对任务分布的熟悉程度,而非真正的跨平台GUI能力。值得注意的是,模型排名在移植任务和macOS原生任务之间发生反转,领先模型在MacArena子集上落后超过26%,表明macOS对当前GUI代理构成了一个真正更困难的环境。

英文摘要

Computer-use agents (CUAs) operate graphical user interfaces (GUIs) through vision and control primitives, and their capabilities have advanced rapidly, driven in part by standardized online evaluation benchmarks such as OSWorld, which serve both as evaluation tools and as training environments for reinforcement learning. However, macOS remains underserved in this landscape: the only existing benchmark, macOSWorld, covers a narrow slice of first-party applications with simpler tasks, and runs on x86 virtual machines incompatible with Apple Silicon. We introduce MacArena, a benchmark of 421 manually verified tasks spanning 50 applications that combines a curated port of OSWorld tasks, content sourced from macOSWorld, and 49 new macOS-native tasks, all running on Apple's native Virtualization framework on Apple Silicon. We argue that macOS presents distinct GUI challenges beyond what Linux-based benchmarks capture, and our evaluation supports this claim: strong model performance on existing benchmarks can reflect familiarity with task distributions rather than genuine cross-platform GUI competence. Notably, model rankings invert between ported and macOS-native tasks, with a leading model trailing by over 26% on the MacArena subset, suggesting that macOS poses a genuinely harder environment for current GUI agents.

2606.06559 2026-06-08 cs.SD cs.AI eess.AS 新提交

IRAF: Interference-Resilient Adaptive Fusion for Noise-Robust End-to-End Full-Duplex Spoken Dialogue Systems

IRAF:面向噪声鲁棒的端到端全双工口语对话系统的抗干扰自适应融合

Tao Zhong, Jiajun Deng, Nikita Kuzmin, Yinke Zhu, Tianxiang Cao, Tristan Tsoi, Zhili Tan, Simon Lui, Xunying Liu

发表机构 * The Chinese University of Hong Kong(香港中文大学) AudioLab Hong Kong, Huawei Leibniz Research Center(香港AudioLab,华为Leibniz研究中心) Nanyang Technological University(南洋理工大学)

AI总结 提出IRAF模块,通过逐帧预测可靠性门控来调节用户音频对LLM的贡献,提升全双工对话系统在干扰说话人环境下的响应质量和交互稳定性。

详情
AI中文摘要

全双工口语对话模型允许语音代理同时听和说,实现具有实时重叠的自然交互。然而,联合编码用户和代理流的端到端双通道模型在现实声学环境中可能会退化:干扰说话人泄漏到用户麦克风中,会被编码为用户查询的一部分,破坏LLM的条件,导致不稳定的轮流说话和响应质量下降。我们提出抗干扰自适应融合(IRAF),一个轻量级、流兼容的模块,逐帧调节用户音频对LLM的贡献。IRAF从目标说话人和用户音频嵌入中预测一个标量可靠性门控,并在与代理嵌入融合之前重新缩放用户表示。在MS-MARCO和InstructS2S-200K上的实验表明,在干扰说话人条件下,响应质量和全双工交互获得一致提升。

英文摘要

Full-duplex spoken dialogue models allow voice agents to listen and speak concurrently, enabling natural interaction with real-time overlap. However, end-to-end dual-channel models that jointly encode user and agent streams may degrade in realistic acoustic environments: interfering speakers leaking into the user microphone can be encoded as part of the user query, corrupting the LLM's conditioning and causing unstable turn-taking and reduced response quality. We propose Interference-Resilient Adaptive Fusion (IRAF), a lightweight, streaming-compatible module that modulates the contribution of user audio to the LLM frame by frame. IRAF predicts a scalar reliability gate from target-speaker and user audio embeddings and rescales user representations before fusion with agent embeddings. Experiments on MS-MARCO and InstructS2S-200K show consistent gains in response quality and full-duplex interaction under interfering-speaker conditions.

2606.06556 2026-06-08 cs.RO 新提交

Robots Need More than VLA and World Models

机器人需要的不仅仅是VLA和世界模型

Elis Karcini, Faisal Mehrban, Quang Nguyen, Mac Schwager, Arash Ajoudani, Cesar Cadena, Jan Peters, Marco Hutter, Haitham Bou-Ammar

发表机构 * Motoniq.ai Stanford University(斯坦福大学) Istituto Italiano di Tecnologia(意大利技术研究院) ETH Zurich(苏黎世联邦理工学院) Technical University of Darmstadt(德累斯顿技术大学) UCL Centre for AI(伦敦大学学院人工智能中心)

AI总结 本文认为机器人通用智能的关键瓶颈不仅是策略学习,还缺乏将非结构化行为数据转化为机器人可用监督的机制,并提出了四种缺失的接口组件。

详情
AI中文摘要

通用机器人智能通常被框定为策略扩展问题:收集更多机器人演示,训练更大的视觉-语言-动作(VLA)模型,并期望更广泛的泛化。在这篇立场论文中,我们认为这种框架是不完整的。核心瓶颈不仅是策略学习,而是缺乏将世界上丰富的非结构化行为数据转化为有监督的机器人监督的机制。人类运动、互联网视频、仿真 rollout 和交互式演示包含关于任务、目标、接触、失败和物理约束的丰富信息,然而这些信息中的大部分无法直接被机器人策略使用,因为它们缺乏特定于具身的动作标签、任务语义和奖励结构。我们为下一代机器人识别了四个缺失的组件:用于自动标注非结构化行为的数据接口、用于将人类运动重定向到机器人动作的具身接口、用于物理接地3D推理的世界模型接口,以及用于从视频和语言推断任务进展和成功的奖励接口。我们调查了机器人基础模型、跨具身数据集、从视频学习、世界模型和奖励建模方面的最新进展,并提出了一个研究议程,以构建不仅能够从机器人演示中学习,而且能够从更广泛的物理世界中学习的机器人系统。

英文摘要

Generalist robot intelligence is often framed as a policy-scaling problem: collect more robot demonstrations, train larger Vision-Language-Action (VLA) models, and expect broader generalisation. In this position paper, we argue that this framing is incomplete. The central bottleneck is not only policy learning, but the absence of mechanisms that convert the world's abundant unstructured behavioural data into grounded robot supervision. Human motion, internet video, simulation rollouts, and interactive demonstrations contain rich information about tasks, goals, contacts, failures, and physical constraints, yet most of this information is not directly usable by robot policies because it lacks embodiment-specific action labels, task semantics, and reward structure. We identify four missing components for the next generation of robotics: data interfaces for autolabelling unstructured behaviour, embodiment interfaces for retargeting human motion to robot actions, world-model interfaces for physics-grounded 3D reasoning, and reward interfaces for inferring task progress and success from video and language. We survey recent progress in robot foundation models, cross-embodiment datasets, learning from video, world models, and reward modelling, and propose a research agenda for building robotics systems that can learn not only from robot demonstrations, but from the broader physical world.

2606.06550 2026-06-08 cs.SD cs.AI eess.AS 新提交

Geometric Second-Order Feature Correlation Learning for Self-Supervised Speech Emotion Recognition

几何二阶特征相关性学习用于自监督语音情感识别

Shuanglin Li, Ruxiao Qian, Siyang Song

发表机构 * Xiangjiang Laboratory(湘江实验室) University of Exeter(埃克塞特大学)

AI总结 针对自监督语音情感识别中一阶聚合忽略特征相关性和黎曼几何的问题,提出二阶相关层,通过协方差描述子捕获协同共现模式,并利用对数欧几里得映射保持几何完整性,在ESD和RAVDESS数据集上有效恢复判别信息。

详情
AI中文摘要

自监督学习(SSL)为语音情感识别(SER)提供了强大且富含上下文的表示,但将这些表示聚合为整体描述符仍是一个瓶颈。传统的一阶聚合隐式假设特征独立,忽略了潜在的黎曼几何,并丢弃了对骨干网络表示能力至关重要的高阶关系。为解决这一问题,本文提出了一种新颖的二阶相关(SOC)层。SOC不孤立地处理特征,而是将特征相关性建模为协方差描述子,以捕获协同共现模式,这些模式可作为鲁棒情感识别的判别性签名。通过对数欧几里得映射(LEM)将这些描述子从黎曼流形映射到欧几里得切空间,所提方法在保持几何完整性的同时,实现了直接的线性判别学习。在ESD和RAVDESS数据集上的大量实验表明,SOC恢复了一阶池化中丢失的判别信息,并有效聚合了高维SSL特征。

英文摘要

Self-supervised learning (SSL) yields powerful, context-rich representations for speech emotion recognition (SER), yet aggregating these representations into holistic descriptors remains a bottleneck. Conventional first-order aggregation implicitly assumes feature independence, which overlooks the latent Riemannian geometry and discards higher-order relationships essential to the representational power of the backbone. To address this problem, this paper proposes a novel Second-Order Correlation (SOC) layer. Instead of treating features in isolation, SOC models feature correlations as covariance descriptors to capture synergistic co-occurrence patterns, which serve as discriminative signatures for robust emotion recognition. By mapping these descriptors from the Riemannian manifold to a Euclidean tangent space through Log-Euclidean mapping (LEM), the proposed method preserves geometric integrity while enabling direct linear discriminative learning. Extensive experiments on the ESD and RAVDESS datasets demonstrate that SOC recovers discriminative information lost in first-order pooling and effectively aggregates high-dimensional SSL features.

2606.06547 2026-06-08 cs.LG cs.AI 新提交

FAIR-Calib: Frontier-Aware Instability-Reweighted Calibration for Post-Training Quantization of Diffusion Large Language Models

FAIR-Calib:面向扩散大语言模型训练后量化的前沿感知不稳定重加权校准

Haoyu Huang, Linlin Yang, Sheng Xu, Boyu Liu, Guodong Guo, Zhongqian Fu, Hang Zhou, Baochang Zhang

发表机构 * FAIR

AI总结 针对扩散大语言模型训练后量化中前沿决策易翻转并永久锁定放大的问题,提出两阶段PTQ框架FAIR-Calib,通过前沿命中与掩码阶段可靠性估计位置先验,并利用重加权隐状态MSE校准优先保护脆弱前沿状态,理论证明其作为输出KL散度代理,实验显著优于基线。

Comments Accepted as a poster at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

扩散大语言模型(dLLMs)迭代地精炼token,但不可逆地提交它们,导致“稳定性滞后”,即早期决策即使在写入后仍然脆弱。我们揭示训练后量化(PTQ)误差容易在写入前沿翻转这些边界决策,然后这些决策被永久锁定并放大。为了解决这个问题,我们提出了前沿感知不稳定重加权校准(FAIR-Calib),一个针对dLLMs的两阶段PTQ框架。第一阶段探测全精度教师模型以估计结合前沿命中与掩码阶段可靠性的位置先验。第二阶段通过最小化重加权隐状态MSE执行离策略、逐层校准,有效优先保护脆弱前沿状态,而无需昂贵的端到端扩散展开。我们进一步从理论上证明我们的加权目标是输出KL散度的代理。实验上,FAIR-Calib在LLaDA和Dream(W4A4)上持续优于最先进的基线,显著减少前沿决策翻转并抑制跨不同基准的提交后不匹配。

英文摘要

Diffusion Large Language Models (dLLMs) refine tokens iteratively but commit them irreversibly, leading to a "stability lag" where early decisions remain fragile even after being written. We reveal that Post-Training Quantization (PTQ) error easily flips these borderline decisions at the write frontier, which are then permanently locked in and amplified. To address this, we propose Frontier-Aware Instability-Reweighted Calibration (FAIR-Calib), a two-stage PTQ framework for dLLMs. Stage I probes a full-precision teacher to estimate a position prior that combines frontier hits and masked-stage reliability. Stage II performs off-policy, layer-wise calibration by minimizing a reweighted hidden-state MSE, effectively prioritizing the protection of fragile frontier states without requiring expensive end-to-end diffusion rollouts. We further theoretically justify our weighted objective as a surrogate for output KL divergence. Empirically, FAIR-Calib consistently outperforms state-of-the-art baselines on LLaDA and Dream (W4A4), significantly reducing frontier decision flips and suppressing post-commit mismatches across diverse benchmarks.

2606.06546 2026-06-08 cs.LG 新提交

Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios

Elmes*:面向长尾教育场景的大语言模型细粒度评估量规自动构建

Tao Liu, Ye Lu, Ruohua Zhang, Siyu Song, Wentao Liu, Aimin Zhou, Hao Hao

发表机构 * Shanghai Institute of AI for Education, East China Normal University(上海人工智能教育研究院,东华师范大学) School of Computer Science and Technology, East China Normal University(计算机科学与技术学院,东华师范大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出Elmes*框架,自动构建细粒度场景特定量规,用于评估大语言模型在教育场景中的多维教学能力,构建Edu-330基准并揭示模型差异。

详情
AI中文摘要

评估用于教育的大语言模型(LLMs)需要衡量模型如何教学,而不仅仅是它们知道什么。现有基准强调领域通用正确性或依赖手动设计的量规,这些量规难以扩展到长尾教学场景。我们引入Elmes*,一个用于构建、优化和应用细粒度场景特定量规的端到端框架。Elmes*结合了用于教师-学生-评判者交互的声明式多智能体引擎与SceneGen(一个自演化模块,从专家定义的教学维度共同优化评估标准和测试数据)。使用Elmes*,我们构建了Edu-330,涵盖11个学科、3个年级段和10种任务类型的330个场景,包含超过1000个二级指标。在Edu-330和四个专家撰写的黄金标准场景上的实验表明,教育能力是多维的:顶级LLM主要在创造力和价值观整合方面存在差异,知识强的模型可能在苏格拉底式支架教学中失败,而教育专用模型InnoSpark获得了最佳的人工评估平均分。LLM评判者保持了与人类可比的排名,但评分方差更低,但表现出评判者特定的偏见,如自我偏好。消融实验表明,专家评分的少样本锚定改善了人机对齐,而推理强制和贪婪解码依赖于模型。因此,Elmes*为基于教学法的LLM评估提供了可扩展的诊断基础设施。

英文摘要

Evaluating large language models (LLMs) for education requires measuring how models teach, not only what they know. Existing benchmarks emphasize domain-general correctness or depend on manually designed rubrics that scale poorly to long-tail pedagogical scenarios. We introduce Elmes*, an end-to-end framework for constructing, refining, and applying fine-grained scenario-specific rubrics. Elmes* combines a declarative multi-agent engine for teacher--student--judge interactions with SceneGen, a self-evolving module that co-optimizes evaluation criteria and test data from expert-defined pedagogical dimensions. Using Elmes*, we build Edu-330, covering 330 scenarios across 11 subjects, 3 grade bands, and 10 task types, with over 1{,}000 second-level indicators. Experiments on Edu-330 and four expert-authored gold-standard scenarios show that educational capability is multidimensional: top-tier LLMs differ mainly in creativity and values integration, knowledge-strong models may fail at Socratic scaffolding, and the education-specialized InnoSpark achieves the best human-evaluated average score. LLM judges preserve human-comparable rankings with much lower scoring variance, but exhibit judge-specific biases such as self-preference. Ablations show that expert-scored few-shot anchoring improves human--LLM alignment, while reasoning enforcement and greedy decoding are model-dependent. Elmes* thus provides scalable diagnostic infrastructure for pedagogically grounded LLM evaluation.

2606.06539 2026-06-08 cs.CV cs.AI cs.LG cs.NE 新提交

Synthetic Benchmarks Overstate Forward-Forward Scaling: Real-Data Limits of Layer-Local Training

合成基准高估了前向-前向扩展:真实数据对逐层训练的限制

Yucheng Chen

发表机构 * Amplimit

AI总结 通过DTG-FF方法在真实数据上评估前向-前向学习的扩展性,发现其与反向传播的差距随类别数增加而扩大,合成任务高估了其迁移能力,且内存优势不成立。

Comments 23 pages, 6 figures

详情
AI中文摘要

前向-前向(FF)学习[Hinton, 2022]用严格的逐层良好性更新取代了反向传播。最近的FF-CNN工作在32x32基准上缩小了与BP的差距,引发了逐层训练是否在现实规模下成为可行替代方案的问题。为了严格探究这一点,我们开发了DTG-FF——动态温度良好性、解耦归一化和多层融合——作为在九个真实数据基准上设定FF系列最先进水平的工具(CIFAR-10上91.8%,以及ImageNet-100 224x224上的首个FF基线),并用它来审计逐层训练实际能扩展到何种程度。(1)真实数据扩展。在相同配方和主干下,架构匹配的BP-DeepSup基线在CIFAR-10/CIFAR-100上分别超过DTG-FF 2.40/5.93个百分点,且差距随类别数增加而扩大。在224x224分辨率下,同一工具仅达到49.4%——这是该尺度下的首个FF基线,而典型BP超过75%[Tian et al., 2020]——暴露了在32x32下不可见的真实数据上限。(2)合成与真实K冲突。在合成教师-学生任务中,随着类别数K增长,DTG-FF越来越优于BP;而在真实图像上,FF-BP差距符号反转并随K扩大。数据集内CIFAR-100粗粒度与细粒度探针将标签层次与图像分布分离:合成K扫描将输出维度与细粒度判别难度混淆,从而高估了FF的可迁移性。(3)系统审计。FF可以在不存储深度激活的情况下实现,但在普通8 GB硬件上,标准BP+梯度累积达到4.18 GB / 157 imgs/s,而DTG-FF为7.90 GB / 138 imgs/s,因此在公平基线支持下,基于内存的理由在此规模下不成立。

英文摘要

Forward-Forward (FF) learning [Hinton, 2022] replaces backpropagation with strictly layer-local goodness updates. Recent FF-CNN work has narrowed the gap to BP on 32x32 benchmarks, raising the question of whether layer-local training is becoming a viable alternative at realistic scale. To probe this rigorously, we develop DTG-FF -- dynamic temperature goodness, decoupled normalization, and multi-layer fusion -- as an instrument that sets FF-family state of the art across nine real-data benchmarks (91.8% CIFAR-10 and the first FF baseline at ImageNet-100 224x224), and use it to audit how far layer-local training actually scales. (1) Real-data scaling. Under identical recipe and backbone, an architecture-matched BP-DeepSup baseline beats DTG-FF by 2.40/5.93 pp on CIFAR-10/CIFAR-100, and the gap widens with class count. At 224x224 the same instrument reaches only 49.4% -- the first FF baseline at this scale, versus typical BP above 75% [Tian et al., 2020] -- exposing a real-data ceiling invisible at 32x32. (2) Synthetic vs. real K-conflict. DTG-FF increasingly outperforms BP as class count K grows on synthetic teacher-student tasks, yet on real images the FF-BP gap reverses sign and widens with K. A within-dataset CIFAR-100 coarse vs. fine probe isolates label-hierarchy from image distribution: synthetic K-sweeps confound output dimensionality with fine-grained discrimination difficulty and thereby overstate FF transferability. (3) Systems audit. FF can be implemented without storing depth-wide activations, but on commodity 8 GB hardware standard BP+gradient-accumulation reaches 4.18 GB / 157 imgs/s versus DTG-FF's 7.90 GB / 138 imgs/s, so a memory-based justification for FF at this scale is not supported under fair baselines.

2606.06538 2026-06-08 cs.CV 新提交

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

WorldBench: 一个具有挑战性且视觉多样的多模态推理基准

Yida Yin, Harish Krishnakumar, Chung Peng Lee, Boya Zeng, Wenhao Chai, Shengbang Tong, Wenhu Chen, Hu Xu, Xingyu Fu, Gabriel Sarch, Aleksandra Korolova, Zhuang Liu

发表机构 * Princeton University(普林斯顿大学) NYU(纽约大学) University of Waterloo(滑铁卢大学) Meta, FAIR(Meta和FAIR)

AI总结 提出WorldBench,通过构建多领域视觉概念分类法并收集多样化图像,设计前沿MLLM难以回答的问题,以评估多模态大语言模型的视觉理解能力,揭示其弱点。

Comments Project page: https://worldbench-vl.github.io/

详情
AI中文摘要

在现实世界应用中,模型被期望在不同设置下可靠地执行。然而,许多现有的多模态基准扩展了任务类型,但没有捕捉到处理开放视觉输入所需的视觉多样性。我们提出了WorldBench,一个具有挑战性且视觉多样的推理基准,用于评估多模态大语言模型(MLLM)。我们构建了一个跨多个领域(例如,生物)的数千个视觉概念的分类法。在该分类法的指导下,我们从搜索引擎和现有数据集中策划了一个广泛的图像集合,以全面代表视觉世界。通过结构化的试错,我们手动设计了前沿MLLM无法回答的具有挑战性的问题。在定量和人工评估中,WorldBench比任何现有的多样化基准实现了更高的视觉多样性。在WorldBench上评估15个MLLM揭示了视觉理解中的弱点:即使是最强的模型也只达到64.0%的准确率,而一些模型的表现略高于随机水平。我们希望我们的工作强调视觉多样性在构建多模态基准中的重要性。

英文摘要

In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existing multimodal benchmarks expand task types without capturing the visual diversity needed to handle open-ended visual inputs. We present WorldBench, a challenging and visually diverse reasoning benchmark to evaluate Multimodal Large Language Models (MLLMs). We build a taxonomy of thousands of visual concepts across multiple domains (e.g., living things). Guided by this taxonomy, we curate a broad collection of images from search engines and existing datasets to comprehensively represent the visual world. Through structured trial-and-error, we manually design challenging questions that frontier MLLMs fail to answer. On quantitative and human evaluations, WorldBench achieves higher visual diversity than any existing diverse benchmark. Evaluating 15 MLLMs on WorldBench reveals weaknesses in visual understanding: even the strongest model reaches only 64.0% accuracy, while some models perform marginally above chance-level. We hope our work highlights the importance of visual diversity in building multimodal benchmarks.

2606.06536 2026-06-08 cs.CV cs.AI cs.LG 新提交

Attention-Guided Autoencoder Fusion for Insulator Defect Detection Using UAV Transmission-Line Imaging

基于注意力引导自编码器融合的无人机输电线路绝缘子缺陷检测

Malak Allam, Khaled Shaban, Ali Hamdi

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出AE-YOLO框架,通过注意力引导自编码器融合与方差最大化正则化,解决无人机图像中绝缘子缺陷检测的类别不平衡和尺度变化问题,在mAP@0.5上达95.10%,优于YOLO基线5个百分点。

详情
AI中文摘要

高压输电线路绝缘子的自动缺陷检测仍然具有挑战性,原因在于无人机(UAV)图像中严重的类别不平衡、尺度变化大以及缺陷实例的空间范围小。为了解决这些问题,本文提出了AE-YOLO,一种注意力引导的自编码器增强型YOLO框架,用于鲁棒的绝缘子缺陷检测。该架构在特征金字塔网络-路径聚合网络(FPN-PAN)颈部集成了轻量级瓶颈自编码器,在多尺度特征融合过程中保留了异常敏感信息。整个骨干网络使用卷积块注意力模块(CBAM),增强了特征辨别能力并抑制了背景干扰。该框架还引入了一种方差最大化的自编码器正则化策略,鼓励生成多样化、缺陷判别性的潜在表示。网络使用统一目标进行训练,该目标结合了焦点损失、完全IoU(CIoU)损失和自编码器正则化,以解决前景-背景不平衡问题并提高定位精度。在推理过程中,加权框融合(WBF)结合了YOLOv8、YOLOv10和YOLO11的预测结果。自编码器引导的置信度提升机制提高了对罕见缺陷类别的敏感性。在绝缘子缺陷检测数据集上的实验表明,采用EfficientNetV2骨干网络的AE-YOLO在mAP@0.5上达到95.10%,精度为96.40%,召回率为93.80%。这一性能在mAP@0.5上比最强的YOLO系列基线高出5.0个百分点,在召回率上高出6.7个百分点。这些结果证实了该框架的有效性和适应性。该模型是基于UAV的输电线路巡检和缺陷监测的实用且可扩展的解决方案。

英文摘要

Automated defect detection in high-voltage transmission-line insulators remains challenging due to severe class imbalance, large scale variation, and the small spatial extent of defect instances in Unmanned Aerial Vehicle (UAV) imagery. To address these challenges, this paper proposes AE-YOLO, an Attention-Guided AutoEncoder-Enhanced YOLO framework for robust insulator defect detection. The architecture integrates lightweight bottleneck autoencoders within a Feature Pyramid Network-Path Aggregation Network (FPN-PAN) neck. This preserves anomaly-sensitive information during multi-scale feature fusion. Convolutional Block Attention Modules (CBAM) are used throughout the backbone, enhancing feature discrimination and suppressing background interference. The framework also introduces a variance-maximizing autoencoder regularization strategy, which encourages diverse, defect-discriminative latent representations. The network trains using a unified objective that combines focal loss, Complete IoU (CIoU) loss, and autoencoder regularization to address foreground-background imbalance and improve localization accuracy. During inference, Weighted Boxes Fusion (WBF) combines predictions from YOLOv8, YOLOv10, and YOLO11. An autoencoder-guided confidence boosting mechanism improves sensitivity to rare defect categories. Experiments on the Insulator-Defect Detection dataset show that AE-YOLO with an EfficientNetV2 backbone achieves 95.10 percent mAP at 0.5, 96.40 percent precision, and 93.80 percent recall. This performance surpasses the strongest YOLO-family baseline by 5.0 points in mAP at 0.5 and 6.7 points in recall. These results confirm the effectiveness and adaptability of the framework. The model is a practical and scalable solution for UAV-based transmission-line inspection and defect monitoring.

2606.06533 2026-06-08 cs.AI cs.CL 新提交

Position: Don't Just "Fix it in Post": A Science of AI Must Study Training Dynamics

立场:不要仅仅“在后期修复它”:AI科学必须研究训练动态

Stella Biderman, Mohammad Aflah Khan, Niloofar Mireshghallah, Catherine Arnett, Fazl Barez, Naomi Saphra

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Boston University(波士顿大学) Harvard University(哈佛大学) University of Oxford, Martian(牛津大学,火星) Max Planck Institute for Software Systems(马克斯·普朗克软件系统研究所)

AI总结 本文主张AI科学应超越事后分析,研究训练动态以预测、干预和设计模型行为,并指出当前在可解释性、公平性等领域的进展及开放问题。

Comments Accepted as an oral to the ICML: https://icml.cc/virtual/2026/poster/67142

详情
AI中文摘要

拥有对AI的科学理解意味着什么?模型不是静态对象:它们是由数据、目标、架构和优化动态塑造的时间演化过程的快照。然而,许多AI研究将模型视为固定的人工制品,分析训练后的行为,而不追问它们为何出现。这篇立场论文认为,AI科学必须超越事后修复,研究产生模型行为的训练动态。这样的科学应该支持逐渐增强的理解形式:从早期训练信号预测结果,在轨迹出错时进行干预,并最终设计出更可靠地产生期望属性的训练程序。缩放定律已使损失预测成为常规;挑战在于将这一成功扩展到能力、偏见、鲁棒性和安全相关行为。我们基于科学史和科学哲学阐述了此类理论的要求,考察了在机械可解释性、公平性、记忆化和简单性偏差方面的进展,并确定了具体的开放问题。

英文摘要

What would it mean to have a scientific understanding of AI? Models are not static objects: they are snapshots of time-evolving processes shaped by data, objectives, architectures, and optimization dynamics. Yet much of AI research treats models as fixed artifacts, analyzing behaviors after training rather than asking why they emerge. This position paper argues that a science of AI must move beyond post-hoc fixes and study the training dynamics that produce model behavior. Such a science should support progressively stronger forms of understanding: predicting outcomes from early training signals, intervening when trajectories go wrong, and ultimately designing training procedures that more reliably produce desired properties. Scaling laws have made prediction routine for loss; the challenge is extending this success to capabilities, biases, robustness, and safety-relevant behaviors. We articulate requirements for such theories grounded in the history and philosophy of science, examine progress in mechanistic interpretability, fairness, memorization, and simplicity bias, and identify concrete open problems.

2606.06532 2026-06-08 cs.CV 新提交

GOPAgen: Motion-Aware and Efficient Agentic Long-Video Understanding with Structural Memory and Hierarchical Reasoning

GOPAgen: 基于结构记忆与层次推理的运动感知高效智能长视频理解

Haozhe Chi, Yang Jin, Yadong Mu

发表机构 * Peking University(北京大学)

AI总结 提出GOPAgen方法,通过视频编解码的GOP运动代理、GOP树推理算法和结构记忆机制,实现高效长视频理解,在多个VQA基准上取得领先性能。

详情
AI中文摘要

尽管在智能长视频理解方面取得了显著进展,现有方法仍然缺乏详细的运动理解以及高效的内存架构。在本文中,我们提出GOPAgen,一种新颖的方法,该方法首先通过精心设计的运动代理将视频编解码器集成到视频理解框架中,该代理基于视频编解码器中的图像组(GOP)进行训练。我们进一步开发了GOP树推理算法,该算法与视频编解码器自然对齐,增强了模型理解视频中局部细节运动的能力。此外,我们精心设计了一种结构记忆机制,将局部运动信息与结构页面中的详细描述相结合,并提出了一种高效的从粗到精的缩放算法,以充分利用结构记忆。此外,我们将运动矢量数据库纳入框架,以实现不同粒度运动矢量的高效检索。总体而言,我们的方法在各种视频理解基准(包括MotionBench和Egoschema)上取得了优越的视频问答(VQA)性能,从而证明了我们提出框架的优越性。

英文摘要

Despite significant progress in agentic long video understanding, existing methods still lack detailed motion comprehension coupled with an efficient memory architecture. In this paper, we propose GOPAgen, a novel approach that first integrates video codec into the video understanding framework via a meticulously designed motion agent trained on Groups of Pictures (GOPs) from video codec. We further develop a GOP tree reasoning algorithm, which is naturally aligned with video codec and enhances the model's ability to understand local detailed motions in videos. Additionally, we carefully design a structural memory mechanism that integrates local motion information with detailed captions in structural pages, and propose an efficient coarse-to-fine zoom-in algorithm to fully exploit the structural memory. Furthermore, we incorporate a motion vector database into the framework to enable efficient retrieval of motion vectors at different granularities. Overall, our method achieves superior Video Question Answering (VQA) performance on various video understanding benchmarks, including MotionBench and Egoschema, thereby demonstrating the superiority of our proposed framework.

2606.06529 2026-06-08 cs.AI cs.LG 新提交

Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety

智能体AI控制评估中的攻击选择显著降低安全性

Catherine Ge-Wang, Tyler Crosse, Benjamin Hadad, Joachim Schaeffer, Ram Potham, Tyler Tracy

发表机构 * GitHub arXiv

AI总结 本文研究攻击者策略性选择攻击时机对AI控制安全性的影响,通过分解攻击决策为开始和停止策略,实验表明两者均显著降低安全性,现有评估可能高估安全性。

详情
AI中文摘要

一个策略性选择攻击时机的攻击者比无差别攻击者更难被捕获。AI控制是一种安全框架,用于在较弱但可信的监控者和有限的人工审计预算下部署能力强但不可信的AI智能体。控制评估通过让红队攻击策略对抗蓝队监控者来压力测试这些协议,但当前评估通常假设攻击者不会策略性地选择攻击时机。我们通过将攻击决策分解为开始策略(决定攻击者何时应开始攻击)和停止策略(决定攻击者何时应中止正在进行的攻击),在智能体环境中研究了这种能力——攻击选择。在两个智能体环境BashArena和LinuxArena中,两种策略都在不改变底层攻击能力的情况下显著降低了测量的经验安全性。在1%的审计预算下,我们的开始策略在BashArena和LinuxArena上均将安全性降低了20个百分点,停止策略在BashArena上降低了20个百分点,在LinuxArena上降低了28个百分点。这些降低应被解释为攻击选择效果的上限。因此,现有的控制评估可能对选择性攻击者产生过于乐观的安全性估计。我们建议未来的评估、系统卡和安全案例应引出攻击选择,以产生更现实的安全性估计。

英文摘要

An attacker that strategically chooses when to attack is much harder to catch than one that attacks indiscriminately. AI control is a safety framework for deploying capable but untrusted AI agents under the oversight of a weaker, trusted monitor and a limited human audit budget. Control evaluations stress-test these protocols by pitting a red-team attack policy against the blue-team monitor, but current evaluations typically assume attackers that do not strategically select when to attack. We study this capability, attack selection, in agentic settings by decomposing attack decisions into a start policy, which decides when an attacker should attack, and a stop policy, which decides when an attacker should abort an ongoing attack. Across two agentic settings, BashArena and LinuxArena, both policies substantially lower measured empirical safety without changing the underlying attack capability. At a 1% audit budget, our start policy reduces safety by 20pp on both BashArena and LinuxArena, and our stop policy reduces safety by 20pp on BashArena and 28pp on LinuxArena. These reductions should be interpreted as upper bounds on the effect of attack selection. Existing control evaluations may therefore yield overly optimistic safety estimates against selective attackers. We recommend that future evaluations, system cards, and safety cases elicit attack selection to produce more realistic safety estimates.

2606.06526 2026-06-08 cs.AI cs.LG 新提交

CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions

CrowdMath: 众包数学研究讨论数据集

Sherin Muckatira, Jesse Geneson, Slava Gerovitch, Pavel Etingof, Mikhail Gronas, Anna Rumshisky

发表机构 * University of Massachusetts Lowell(马萨诸塞大学洛文分校) San Jose State University(圣何塞州立大学) Massachusetts Institute of Technology(麻省理工学院) Dartmouth College(达特茅斯学院) Amazon AGI(亚马逊人工智能研究院)

AI总结 提出CrowdMath数据集,包含164条专家标注的进展链,用于评估大语言模型在协作开放问题求解中的能力,发现模型在局部预测上表现良好但在角色分类上存在不足。

Comments 16 pages, 4 figures

详情
AI中文摘要

大型语言模型在数学推理方面取得了实质性进展,但现有基准通常评估具有最终答案、逐步解决方案或完整证明的明确问题。它们没有捕捉到协作开放问题求解:参与者提出部分论证、识别先前步骤中的空白或错误、修复有缺陷的推理,并逐步将增量贡献综合成证明。我们引入了CrowdMath,一个包含164条专家标注的进展链的数据集,来自MIT PRIMES--Art of Problem Solving (AoPS) CrowdMath项目(2016-2025),这是一个协作研究计划,其讨论已导致同行评审的出版物。每条链追踪一个从开放问题陈述到完成证明的多参与者论坛讨论。帖子根据其在不断演变的解决方案过程中的功能角色进行标注,包括部分进展、证明完成、错误推理和错误识别。我们定义了评估任务并对六个前沿模型进行了基准测试。模型在下一帖子预测上达到83-88%的准确率,表明它们能够跟随数学讨论的局部流程。然而,它们难以识别单个贡献的功能重要性,最佳模型在帖子角色分类上仅达到0.42的宏F1分数。CrowdMath揭示了解决明确数学问题与理解协作数学进展之间的差距。

英文摘要

Large language models have made substantial progress on mathematical reasoning, but existing benchmarks typically evaluate well-specified problems with final answers, step-by-step solutions, or complete proofs. They do not capture collaborative open-problem solving: a setting in which participants propose partial arguments, identify gaps or errors in prior steps, repair flawed reasoning, and gradually synthesize incremental contributions into a proof. We introduce CrowdMath, a dataset of 164 expert-annotated progress chains from the MIT PRIMES--Art of Problem Solving (AoPS) CrowdMath program (2016-2025), a collaborative research initiative whose discussions have led to peer-reviewed publications. Each chain traces a multi-participant forum discussion from an open-problem statement to a completed proof. Posts are labeled by their functional roles in the evolving solution process, including partial progress, proof completion, erroneous reasoning, and error identification. We define evaluation tasks and benchmark six frontier models. Models achieve 83-88% accuracy on next-post prediction, suggesting that they can follow the local flow of mathematical discussion. However, they struggle to identify the functional significance of individual contributions with the best model achieving only 0.42 macro-F1 on post-role classification. CrowdMath exposes a gap between solving well-specified mathematical problems and understanding collaborative mathematical progress as it unfolds.

2606.06520 2026-06-08 cs.CV cs.GR 新提交

Applying Deep Learning for cockpit segmentation in the context of mixed reality

应用深度学习进行混合现实场景中的驾驶舱分割

Alexandre Leles Sousa, Pedro de Oliveira Nielson, Erick Oliveira Rodrigues, Rafael Francisco dos Santos, Giovani Bernardes Vitor

发表机构 * Laboratdrio de Robética, sistemas inteligentes e Complezos - RobSIC(机器人、智能系统与复杂性实验室 - RobSIC) Instituto de Ciências Tecnoldgicas Universidade Federal de Itajubd, Campus Itabira, MG(科技学院 巴西联邦大学它雅布德分校,伊塔比拉校区,马里兰) Universidade Tecnológica Federal do Parand - UTFPR, Campus Pato Branco/PR(帕托布兰科/PR技术联邦大学 - UTFPR)

AI总结 本文提出利用U-net和DeepLabV3+卷积神经网络对驾驶舱图像进行前景与背景分割,以促进混合现实中的虚实融合,实现了约90%的准确率。

Comments XXV Congresso Brasileiro de Automática - CBA 2024

详情
AI中文摘要

计算机视觉是一个持续发展的领域。随着第一人称视角技术的进步,该领域内出现了新的发展机遇。混合现实通过实时显示物理世界中的物体来促进虚拟环境。为此,必须关注用户在此模拟环境中的沉浸感,不断寻求使其更接近可能的期望现实。本文提出开发图像处理,以执行图像分割,识别前景和背景,从而便于虚拟和真实图像的融合。因此,本研究通过摄像头获取用户使用CAT793F非公路卡车模拟器的真实图像,利用人工智能对这些图像进行分割。应用了卷积神经网络架构“U-net”和“DeepLabV3+”来执行图像分割。结果显示,准确率约为90%,并确定了最佳模型。

英文摘要

Computer vision is an area that has been growing continuously. With the advance of technologies with a first-person view, new development opportunities have emerged inside the area. Mixed reality promotes virtual environments with objects from the physical world shown in real time. For that, it's necessary to be concerned with the immersion of the user in this simulated environment, increasingly seeking to bring it closer to a possible desired reality. This paper proposes the development of image processing in order to perform the segmentation of images to identify what is foreground and background in order to facilitate the union of virtual and real images. Thus, the present work obtain real images of the user using the off-highway truck simulator CAT793F, through a camera, to be able to perform the segmentation of such images with artificial intelligence techniques.The convolutional neural network architectures "U-net" and "DeepLabV3+" are applied to perform image segmentation. As a result, metrics with around 90% accuracy were presented and and the best model was determined.

2606.06519 2026-06-08 cs.AI cs.LG 新提交

SafeGene: Reusable Adapters for Transferable Safety Alignment

SafeGene: 可重用的适配器实现可迁移的安全对齐

Yanghan Wang, Zhiqiang Kou, Fu Feng, Jing Wang, Xin Geng

发表机构 * Southeast University(东南大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出SafeGene,一种可重用的安全适配器模块,通过从对齐-退化模型差异中提取安全表示,并利用数据感知层选择和少样本系数重校准,实现跨任务的安全恢复,在保持下游性能的同时降低有害响应率。

详情
AI中文摘要

开放权重的LLM越来越多地被微调成定制助手,但下游微调可能会削弱安全对齐,使模型更容易受到恶意提示的攻击,即使训练数据并非有意有害。这造成了反复的安全恢复问题,因为目标模型会随着新任务数据或用户交互而不断更新。我们提出SafeGene,一种可重用的安全适配器模块,设计用于每个架构兼容模型家族内的跨任务重用。SafeGene不将安全恢复视为特定于模型的修复步骤,而是将安全能力视为一种独立的、可重用的适配器表示,与任务特定更新解耦。这种表示从对齐-退化模型差异中获得,通过数据感知层选择精炼成任务可迁移的安全向量,并通过少样本逐层系数重校准在每个下游任务适应模型中表达。跨多个模型家族、下游任务和安全评估者的实验表明,SafeGene增强的模型在降低有害响应率的同时保持下游性能,在安全-效用权衡中优于代表性的安全适应方法。

英文摘要

Open-weight LLMs are increasingly fine-tuned into customized assistants, but downstream fine-tuning can weaken safety alignment and make models more vulnerable to malicious prompts, even when the training data is not intentionally harmful. This creates a recurring safety recovery problem as target models are repeatedly updated with new task data or user interactions. We propose SafeGene, a reusable safety-adapter module designed for cross-task reuse within each architecture-compatible model family. Rather than treating safety recovery as a model-specific repair step, SafeGene treats safety capability as an independent, reusable adapter representation decoupled from task-specific updates. This representation is obtained from aligned--degraded model discrepancies, refined into task-transferable safety vectors through data-aware layer selection, and expressed in each downstream task-adapted model via few-shot layer-wise coefficient recalibration. Experiments across multiple model families, downstream tasks, and safety judges show that SafeGene-enhanced models reduce harmful response rates while maintaining downstream performance, outperforming representative safe adaptation methods in safety--utility trade-off.

2606.06518 2026-06-08 cs.AI cs.LG 新提交

DiBS: Diffusion-Informed Branch Selection

DiBS: 扩散模型引导的分支选择

Bo Liu, Yuan Xie, Yuan Gao, Xiaolong Luo, Peng Ye, Tao Chen, Fujun Han

发表机构 * Nanyang Technological University(南洋理工大学) University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 针对数独求解中学习型求解器缺乏正确性保证而符号求解器存在长尾搜索的问题,提出扩散模型引导的分支选择方法DiBS,在保持符号求解器完备性的同时,利用扩散模型排序候选值,显著降低搜索成本。

Comments 12 pages, 6 figures, 3 tables

详情
AI中文摘要

数独是一个典型的约束满足问题,需要在严格的离散约束下进行全局结构推理。现有的数独求解方法主要集中于两种主流方法,即传统启发式方法和深度学习求解器。然而,它们存在互补的局限性:基于学习的求解器缺乏硬正确性保证,而完全符号求解器仍容易陷入长尾搜索。为了解决这些不足,我们提出了一种新颖的扩散模型引导方法,称为DiBS,用于分支选择搜索过程。具体来说,DiBS保持符号求解器的完备性,并使用扩散模型作为分支排序指南。核心方法是在当前部分赋值和轻量级一致性信号下对候选值进行排序。此外,我们提供了深入的理论证明来揭示其工作原理和原因。在具有挑战性的Royle 17线索数独基准上的实验表明,与强启发式基线相比,我们的DiBS显著降低了搜索成本,特别是在节点、回溯和长尾百分位数方面。此外,这些结果证实,在分支顺序错误代价最高的困难实例上,学习到的全局指导是有效的。所有代码均可在此https URL获取。

英文摘要

Sudoku is a representative constraint satisfaction problem that requires global structural reasoning under strict discrete constraints. The existing works of solving Sudoku mainly focus on two dominant approaches, i.e., traditional heuristic and deep learning solver. However, they suffer from two complementary limitations: learning-based solvers lack hard correctness guarantees, while complete symbolic solvers are still prone to long-tail search. To address these shortcomings, we propose a novel diffusion model-guided approach, termed as DiBS, for the branch selection search process. Specifically, DiBS keeps the symbolic solver complete and uses the diffusion model as a branch-ordering guide. The core method is ranking candidate values under the current partial assignment and lightweight consistency signal. Furthermore, we provide an in-depth theoretical proof to reveal how it works and why it works. Experiments on the challenging Royle 17-clue Sudoku benchmark show that our DiBS substantially reduces search cost relative to strong heuristic baselines, especially in nodes, backtracks, and long-tail percentiles. Besides, these results confirm that learned global guidance is effective on hard instances where branch-order mistakes are most expensive. All codes are available at https://github.com/shanxierdan/DiBS.

2606.06514 2026-06-08 cs.AI cs.LG 新提交

Detecting and Mitigating Bias by Treating Fairness as a Symmetry Operation

通过将公平性视为对称操作来检测和缓解偏见

Nishit Singh

发表机构 * Nishit Singh

AI总结 将偏见形式化为对称性破缺操作,通过损失正则化恢复对称性,在合成数据集上实现90%以上的违规减少,准确率成本约5%。

Comments 8 pages, 7 figures

详情
AI中文摘要

部署在高风险社会经济环境中的机器学习系统通常表现出偏见。我们将偏见形式化为一种对称性破缺操作:如果分类器在切换敏感属性的反事实操作下(保持能力特征不变)输出不变,则该分类器是公平的。我们实现基于损失的正则化作为对称性恢复机制,并在四个具有不同噪声、相关性和偏见水平的合成数据集上评估该框架。该框架实现了90%以上的违规减少,准确率成本约为5%。该框架不需要因果图知识,计算轻量,且适用于任何可定义为比特翻转的敏感属性,使其适用于主流基准中缺乏局部歧视来源的上下文。

英文摘要

Machine learning systems deployed in high stakes socioeconomic settings routinely display bias. We formalize bias as a symmetry breaking operation: a classifier is fair if its outputs remain invariant under the counterfactual operation of switching a sensitive attribute, with merit features held fixed. We implement loss based regularization as a symmetry restoring mechanism and evaluate the framework on four synthetic datasets with varying levels of noise, correlation, and bias. The framework achieves upwards of 90\% violation reduction, with accuracy costs around 5\%. This framework does not require causal graph knowledge, is computationally lightweight, and generalizes to any sensitive attribute definable as a bit-flip, making it suitable for contexts where local sources of discrimination remain absent from mainstream benchmarks.

2606.06464 2026-06-08 cs.CL cs.AI 交叉投稿

Human Adults and LLMs as Scientists: Who Benefits from Active Exploration?

人类成人与LLM作为科学家:谁从主动探索中受益?

Mandana Samiei, Eunice Yiu, Anthony GX-Chen, Dongyan Lin, Jocelyn Shen, Blake A. Richards, Alison Gopnik, Doina Precup

发表机构 * Mila - Quebec AI Institute(魁北克人工智能研究所) McGill University(麦吉尔大学) University of California Berkeley(加州大学伯克利分校) New York University(纽约大学) Meta FAIR MIT Media Lab(麻省理工学院媒体实验室) Montreal Neurological Institute(蒙特利尔神经科学研究所)

AI总结 本研究通过主动探索实验,发现主动探索能显著提升成人对合取因果规则的推理能力,但合取规则仍需更多测试;同时比较了大型语言模型的表现,发现部分模型在假设推断准确率上接近人类,但探索策略效率较低且存在类似的合取-析取性能差距。

Comments Accepted at the 48th Annual Conference of the Cognitive Science Society (CogSci 2026)

详情
AI中文摘要

因果学习文献中的一个长期发现是,成人难以识别合取因果规则(即一个效应需要多个原因同时存在),而在析取情境中表现更好。然而,这种“合取缺陷”的大多数演示依赖于被动观察范式,证据有限,学习者无法控制证据生成。本文探讨当成人通过主动探索获得能动性时,这种偏见是否仍然存在。使用修改后的“blicket检测器”任务,成人参与者在合取或析取规则结构下自由干预以识别因果对象。我们表明,主动探索显著改善了成人的合取因果推理,尽管合取规则仍比析取规则需要更多测试来推断。我们进一步将人类表现与同一设置下的多种大型语言模型进行比较。虽然一些最先进的模型在假设推断准确率上接近人类水平,但它们通常表现出效率较低的探索策略以及类似的合取-析取性能差距。

英文摘要

A long-standing finding in the causal learning literature is that adults struggle to identify conjunctive causal rules, where an effect requires the simultaneous presence of multiple causes, while performing better in disjunctive settings. However, most demonstrations of this ``conjunctive handicap'' rely on passive observation paradigms with limited evidence, where learners have no control over evidence generation. This paper asks whether this bias persists when adults are granted agency through active exploration. Using a modified ``blicket detector'' task, adult participants freely intervened to identify causal objects under conjunctive or disjunctive rule structures. We show that active exploration substantially improves adults' conjunctive causal reasoning, although conjunctive rules still require more tests to infer than disjunctive rules. We further compare human performance to a range of large language models in the same setting. While some state-of-the-art models approach human-level performance on hypothesis inference accuracy, they often exhibit less efficient exploration strategies and similar conjunctive-disjunctive performance gaps.

2606.05510 2026-06-08 cs.AI cs.CL 交叉投稿

Severity-Aware Curriculum Learning with Multi-Model Response Selection for Medical Text Generation

基于严重性感知的课程学习与多模型响应选择用于医疗文本生成

Ahmed Alansary, Molham Mohamed, Ali Hamdi

发表机构 * Faculty of Computer Science(计算机科学学院) MSA University(MSA大学) Giza, Egypt(埃及吉扎)

AI总结 提出一种结合课程学习策略和相关性响应选择的多模型框架,通过三阶段课程训练和五个大语言模型独立训练,在MAQA数据集上实现医疗文本生成性能提升。

Comments 6 pages, 3 figures, IMSA2026

详情
AI中文摘要

远程医疗系统在提供可获取且及时的医疗信息方面变得越来越重要。现有的大语言模型通常难以在不同病例严重程度下提供一致且上下文恰当的医疗响应。这一局限性凸显了需要能够有效适应医疗查询渐进复杂性的模型。为了解决这一挑战,我们引入了一个严重性感知的多模型框架,该框架将课程训练策略与基于相关性的响应选择相结合。所提出的框架采用三阶段课程学习策略,每个模型依次在轻度、中度和危重病例上进行训练,以逐步获取领域知识。该方法利用五个大语言模型,每个模型在相同的课程方案下独立训练。在推理过程中,所有模型生成候选响应,并选择最合适的响应作为最终输出。该框架在MAQA数据集上进行训练和评估,该数据集提供带注释的医疗问答对。使用BERTScore评估的实验结果表明,与基线和微调模型相比,所提出的方法取得了优越的性能,在基线设置下达到86.71%,微调后达到90.30%。这些结果凸显了将课程学习与多模型响应选择相结合在提高医疗文本生成中的响应质量和相关性方面的有效性。

英文摘要

Telehealth systems have become increasingly important for delivering accessible and timely medical information. Existing large language models often struggle to provide consistent and contextually appropriate medical responses across varying levels of case severity. This limitation highlights the need for models that can effectively adapt to the progressive complexity in medical queries. To address this challenge, we introduce a severity-aware multi-model framework that integrates curriculum training strategy with relevance-based response selection. The proposed framework employs a three-stage curriculum learning strategy, where each model is trained sequentially on mild, moderate, and critical cases to progressively acquire domain knowledge. The approach utilizes five large language models, each independently trained under the same curriculum scheme. During inference, all models generate candidate responses, and the most appropriate response is selected as the final output. The framework is trained and evaluated on the MAQA dataset, which provides annotated medical question-answer pairs. Experimental results evaluated using BERTScore demonstrate that the proposed method achieves superior performance compared to both baseline and fine-tuned models, attaining 86.71% in the baseline setting and 90.30% after fine-tuning. These results highlight the effectiveness of combining curriculum learning with multi-model response selection in improving response quality and relevance in medical text generation.

2605.24649 2026-06-08 cs.LG cs.AI cs.LO cs.SY eess.SY 交叉投稿

On the Stability and Realizability of Recurrent Polynomial Surrogate Ternary Logic Gate Networks

关于循环多项式代理三元逻辑门网络的稳定性与可实现性

Sai Sandeep Damera, Ryan Matheu, Aniruddh G. Puranic, John S. Baras, Calin Belta

发表机构 * University of Maryland, College Park, USA(马里兰大学学院公园分校)

AI总结 提出R-DTLGN架构,通过连续多项式代理训练并硬化为离散三元逻辑电路,结合数值单调和信息单调门,实现STL监控中的稳定递归和原则性弃权,并由STL公式确定网络规模。

Comments 9 pages, 3 figures. This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

循环神经网络(RNN)可以从部分轨迹在线学习预测信号时序逻辑(STL)判定,但在安全关键系统中部署为运行时监控器需要的不只是预测准确性。标准RNN架构无法提供结构保证,确保在传感器退化时输出能优雅降级;一个丢失的输入可能悄无声息地将判定从安全翻转为不安全。我们引入了循环可微三元逻辑门网络(R-DTLGN),这是一种在Kleene三值逻辑$\{-1, 0, +1\}$上运行的循环架构,其中$0$明确表示未知。R-DTLGN通过连续多项式代理进行训练,并在推理时硬化成离散的三元逻辑电路。我们通过从三元域上的两种序导出的两个门词汇表来分析硬化电路:数值单调门确保稳定的循环动态,而信息单调门(如果存在)保证原则性弃权(未知输入从不产生错误输出)和输入确定性上的单调性(更多信息只能改善判定)。我们证明,有界STL算子所需的循环连接仅使用AND和OR门,这两个门属于两个词汇表,从而将监控任务与架构的保证联系起来。由STL公式的时序算子导出的可实现性界限直接确定了网络隐藏状态的大小,用公式驱动的规范取代了超参数搜索。我们在D4RL PointMaze导航数据上的STL规范上进行了评估,测试了预测准确性、谓词丢失下的退化以及两个标签构建流程之间的准确性与安全性权衡。据我们所知,R-DTLGN是第一个将学习到的时序预测与基于三值逻辑的形式化退化保证相结合的循环架构。

英文摘要

Recurrent Neural Networks (RNNs) can learn to predict Signal Temporal Logic (STL) verdicts online from partial trajectories, but deploying them as runtime monitors in safety-critical systems demands more than predictive accuracy. Standard RNN architectures offer no structural guarantee that outputs degrade gracefully under sensor degradation; a dropped input can silently flip a verdict from safe to unsafe. We introduce the Recurrent Differentiable Ternary Logic Gate Network (R-DTLGN), a recurrent architecture that operates over Kleene's three-valued logic $\{-1, 0, +1\}$, where $0$ explicitly represents unknown. The R-DTLGN trains through continuous polynomial surrogates and hardens to a discrete ternary logic circuit at inference. We analyze the hardened circuit through two gate vocabularies derived from two orderings on the ternary domain: numerically monotone gates ensure stable recurrent dynamics, while information-monotone gates, when present, guarantee principled abstention (unknown inputs never produce wrong outputs) and monotonicity in input certainty (more information can only improve the verdict). We show that the recurrent connections required by bounded STL operators use exclusively AND and OR, which belong to both vocabularies, linking the monitoring task to the architecture's guarantees. A realizability bound derived from the STL formula's temporal operators directly sizes the network's hidden state, replacing hyperparameter search with a formula-driven specification. We evaluate on STL specifications over D4RL PointMaze navigation data, testing prediction accuracy, degradation under predicate dropout, and the accuracy-versus-safety tradeoff between two label construction pipelines. The R-DTLGN is, to our knowledge, the first recurrent architecture that couples learned temporal prediction with formal degradation guarantees rooted in three-valued logic.