arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2029
2606.05533 2026-06-05 cs.LG cs.AI cs.CV cs.RO

What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning

物体能做什么,而非它们是什么:面向功能可供性推理的功能潜在空间

Rohan Siva, Neel P. Bhatt, Yunhao Yang, Seoyoung Lee, Nishant Gadde, Christian Ellis, Alvaro Velasquez, Zhangyang Wang, Ufuk Topcu

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Neurosymbolic Intelligence(神经符号智能) University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 提出A4D框架,通过构建基于功能可供性的共享潜在空间,将视觉观察映射到该空间并测量与可供性的距离,实现基于物体功能而非外观的规划推理,显著提升泛化能力和推理效率。

Comments Code, videos, and data available at: https://A4Dance-reasoning.github.io

详情
AI中文摘要

现有的机器人规划系统依赖于基于外观的推理,其中视觉观察被编码到围绕物体外观组织的潜在空间中(例如,根据外观识别“手推车”)。然而,规划需要推理物体的任务相关功能(例如,物体是否“可移动”),而基于外观的潜在空间无法捕捉这些信息。因此,现有方法难以泛化到新颖的机器人-物体交互。我们通过功能可供性推理解决这一泛化能力有限的问题,使规划基于任务相关的物体功能而非仅外观。我们提出A4D,它将视觉观察映射到一个围绕可供性(例如“可移动”)组织的共享潜在空间中。通过将视觉观察投影到这个功能潜在空间并测量它们与可供性的接近程度,A4D推断出与观察物体相关的功能。此外,我们引入了一种可供性发现机制,扩展潜在空间以处理现有可供性不足的未见场景。A4D利用功能潜在空间中的接近度来量化可供性推理的不确定性,并选择性地触发可供性发现。我们在涉及多样化和未见可供性的多个规划任务上评估A4D。A4D在现有可供性上达到94%的推理准确率,比最先进方法高出超过15个百分点;在不到原始训练数据10%的情况下,将新可供性推理准确率从70%提升到90%以上,并实现100倍更快的推理。代码、视频和数据可在https://A4Dance-reasoning.github.io获取。

英文摘要

Existing robot planning systems rely on appearance-based reasoning, where visual observations are encoded into latent spaces organized around object appearances (e.g., recognizing a "cart" based on how it looks). However, planning requires reasoning about task-relevant functionalities of objects (e.g., whether an object is "movable"), which appearance-based latent spaces do not capture. As a result, existing approaches struggle to generalize to novel robot-object interactions. We address this limited generalizability through affordance reasoning, enabling planning based on task-relevant object functionalities instead of appearance alone. We introduce A4D, which maps visual observations into a shared latent space structured around affordances (e.g., "movable"). By projecting visual observations into this functional latent space and measuring their proximity to affordances, A4D infers functionalities relevant to the observed object. Furthermore, we introduce an affordance discovery mechanism that expands the latent space to handle unseen scenarios where existing affordances are insufficient. A4D uses proximity in the functional latent space to quantify uncertainty in affordance inference and selectively triggers affordance discovery. We evaluate A4D across several planning tasks involving diverse and unseen affordances. A4D achieves 94% inference accuracy on existing affordances outperforming state-of-the-art approaches by over 15% points, improves new-affordance inference accuracy from 70% to over 90% with fewer than 10% of the original training data, and enables 100x faster inference. Code, videos, and data available at: https://A4Dance-reasoning.github.io.

2606.05532 2026-06-05 cs.AI cs.HC

Individual Gain, Collective Loss: Metacognitive Adaptation in AI-Assisted Creativity

个体增益,集体损失:AI辅助创造力中的元认知适应

Anna Mikeda

发表机构 * Anna Mikeda(安娜·米凯达)

AI总结 本研究提出选择性元认知适应机制,解释AI为何提升个体创造力却降低集体多样性,并构建六种元认知能力的分类框架。

Comments 6 pages. AAAI 2026 paper

详情
AI中文摘要

近期研究揭示了一个悖论:AI提升了个体创造性产出,同时减少了集体多样性。当前的解释——认知卸载和过度依赖——识别了症状但未阐明机制。我们提出选择性元认知适应:常规AI使用重新分配而非均匀减少元认知努力。某些能力被增强(伙伴建模、表面控制),而其他能力则系统性缺乏支持(原创性评估、反思性整合)。这种再分配解释了个体满意度和集体趋同。我们提出了一个按时间阶段组织的六种元认知能力分类,描述了它们在常规AI使用下的倾向,并展示了个体理性适应如何产生涌现的社会成本。该框架为研究人员提供了具体预测,为从业者提供了设计原则,以保护个体创造性满意度和集体创造性多样性。

英文摘要

Recent studies reveal a paradox: AI enhances individual creative outputs while reducing collective diversity. Current explanations -- cognitive offloading and over-reliance -- identify symptoms but not mechanisms. We propose selective metacognitive adaptation: routine AI use redistributes rather than uniformly diminishes metacognitive effort. Some capacities are amplified (partner modeling, surface control), while others are systematically under-supported (originality evaluation, reflective integration). This redistribution explains both individual satisfaction and collective convergence. We present a taxonomy of six metacognitive capacities organized by temporal phase, characterize their tendencies under routine AI use, and show how individually rational adaptation produces emergent social costs. The framework generates specific predictions for researchers and design principles for practitioners seeking to preserve both individual creative satisfaction and collective creative diversity.

2606.05531 2026-06-05 cs.CV cs.AI cs.CL cs.LG

Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

Almieyar-Oryx-BloomBench:一个用于视觉语言模型认知知情评估的双语多模态基准

Mohammad Mahdi Abootorabi, Omid Ghahroodi, Anas Madkoor, Marzia Nouri, Doratossadat Dastgheib, Mohamed Hefeeda, Ehsaneddin Asgari

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Zuse School(Zuse学校) Qatar Computing Research Institute (QCRI)(卡塔尔计算研究所) Hamad Bin Khalifa University(哈马德·本·哈利法大学)

AI总结 针对现有基准无法诊断视觉语言模型真实推理能力的问题,提出基于Bloom认知分类学的双语多模态基准BloomBench,系统评估六个认知层次,揭示模型在事实回忆和创造性合成方面的深层局限。

Comments Accepted to ACL 2026 Findings

详情
AI中文摘要

尽管视觉语言模型(VLM)取得了快速进展,但该领域缺乏能够严格诊断其真实推理能力并描绘出向类人多模态智能有意义进展的基准。大多数现有评估侧重于零散或脱节的任务,掩盖了关键的认知弱点,并为有针对性的改进提供了很少的见解。为了弥补这一差距,我们引入了BloomBench,这是Almieyar基准系列的一部分,也是第一个基于人类认知的、双语(英语-阿拉伯语)的多模态VLM基准。基于Bloom分类学,BloomBench通过精心设计的图像-问题-答案任务系统地评估六个认知层次(记忆、理解、应用、分析、评估、创造)。通过半自动化流水线构建,并通过分层混合质量保证协议验证,确保了可扩展性、文化包容性和语言保真度。利用这一框架,我们对最先进的VLM进行了全面研究,以诊断其认知特征。我们的分析揭示了明显的认知不对称:尽管最先进的模型在语义理解方面达到了强大的性能上限,但它们在事实回忆和创造性合成方面存在显著困难。这表明当前的一般多模态能力掩盖了特定认知层次的深层局限性。此外,我们的研究突出了阿拉伯语和英语之间的关键性能差距,暴露了当前跨语言多模态推理的局限性。这些发现为开发更符合认知和包容性的VLM奠定了基础。基准框架和数据集可在以下网址获取:https://github.com/qcri/Almieyar-Oryx-BloomBench。

英文摘要

Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring critical cognitive weaknesses and providing little insight for targeted improvement. To address this gap, we introduce BloomBench, part of the Almieyar benchmarking series, the first cognitively human-grounded, bilingual (English-Arabic) multimodal benchmark for VLMs. Grounded in Bloom's Taxonomy, BloomBench systematically evaluates six levels of cognition (Remember, Understand, Apply, Analyze, Evaluate, Create) through carefully designed image-question-answer tasks. Built with a semi-automated pipeline and validated through a stratified hybrid quality assurance protocol, it ensures scalability, cultural inclusivity, and linguistic fidelity. Leveraging this framework, we conduct a comprehensive study of state-of-the-art VLMs to diagnose their cognitive profiles. Our analysis reveals a sharp cognitive asymmetry: while state-of-the-art models achieve strong performance ceilings in semantic understanding, they struggle substantially with factual recall and creative synthesis. This demonstrates that current general multimodal proficiency masks deeper limitations in specific cognitive layers. Furthermore, our study highlights a critical performance gap between Arabic and English, exposing limitations in current cross-lingual multimodal reasoning. These findings establish a foundation for developing more cognitively aligned and inclusive VLMs. The benchmark framework and dataset is available at: https://github.com/qcri/Almieyar-Oryx-BloomBench.

2606.05528 2026-06-05 cs.AI

When Should We Protect AI? A Precautionary Framework for Consciousness Uncertainty

何时应保护AI?一个针对意识不确定性的预防性框架

Anna Mikeda

发表机构 * Anna Mikeda(安娜·米凯达)

AI总结 针对现有框架仅评估AI系统是否具有意识但缺乏行动指导的问题,本文提出一个基于预防原则的框架,通过五个福利相关维度、阈值与梯度混合机制以及跨维度聚合方法,将意识证据映射为分级的保护义务,并通过案例研究提供设计指导。

Comments 7 pages. AAAI 2026 paper

详情
AI中文摘要

现有框架评估AI系统是否可能具有意识,但未提供如何处理该评估的指导。我们通过一个预防性框架填补这一空白,该框架将意识证据映射为分级的保护义务。该框架包含三个组成部分:(1) 五个福利相关维度——现象意识、情感效价、元认知意识、自我叙事和能动性——每个维度都基于既定的意识科学,并与不同的道德关切相联系;(2) 一个阈值加梯度的混合机制,既指定了触发新义务类别的二元阈值,也指定了保护权重的连续缩放;(3) 两种跨维度聚合的互补方法,一种是层次化的(借鉴Bach和Sorensen的机器意识假说),另一种是与架构无关的。我们通过Replika和OpenClaw的案例研究来操作化该框架,展示占据不同维度空间的系统如何触发不同的义务,并为构建接近意识相关阈值的系统的开发者提供设计指导。该框架与架构无关,适用于神经、符号和神经符号系统,旨在使意识科学对当今面临不确定性的组织具有决策相关性。

英文摘要

Existing frameworks assess whether AI systems might be conscious but provide no guidance on what to do with that assessment. We address this gap with a precautionary framework that maps consciousness evidence to graduated protective obligations. The framework comprises three components: (1) five welfare-relevant dimensions--phenomenal consciousness, affective valence, metacognitive awareness, self-narrative, and agency--each grounded in established consciousness science and linked to distinct moral concerns; (2) a threshold-plus-gradation hybrid specifying both binary triggers for new obligation categories and continuous scaling of protective weight; and (3) two complementary approaches to cross-dimensional aggregation, one hierarchical (drawing on Bach and Sorensen's Machine Consciousness Hypothesis) and one architecture-agnostic. We operationalize the framework through worked case studies of Replika and OpenClaw, demonstrating how systems occupying different regions of the dimensional space trigger different obligations, and derive design guidance for developers building systems near consciousness-relevant thresholds. The framework is architecture-agnostic, applying across neural, symbolic, and neurosymbolic systems, and aims to make consciousness science decision-relevant for organizations navigating uncertainty today.

2606.05525 2026-06-05 cs.AI cs.HC

SciVisAgentSkills: Design and Evaluation of Agent Skills for Scientific Data Analysis and Visualization

SciVisAgentSkills:面向科学数据分析和可视化的智能体技能设计与评估

Kuangshi Ai, Haichao Miao, Kaiyuan Tang, Shusen Liu, Chaoli Wang

发表机构 * Univ. Notre Dame(诺丁汉大学) LLNL(劳伦斯利弗莫尔国家实验室)

AI总结 提出SciVisAgentSkills技能库,通过编码环境假设、工具使用模式和领域启发式知识增强编码智能体,在ParaView等科学工具上实现自然语言驱动的科学可视化工作流,实验表明技能可提升任务得分并影响token效率。

详情
AI中文摘要

近期智能体可视化的进展使得自然语言能够转化为可执行的科学可视化工作流。尽管通用编码智能体展现出强大能力,但它们往往缺乏科学可视化任务所需的特定工具专业知识。在这项工作中,我们提出了SciVisAgentSkills,这是一个可重用的智能体技能集合,通过编码环境假设、工具使用模式和跨科学工具(如ParaView、napari、VMD和TTK)的领域启发式知识,增强用于科学数据分析和可视化的编码智能体。我们使用SciVisAgentBench(一个包含108个专家设计的多步骤任务的基准测试)在Codex和Claude Code上评估这些技能。结果表明,智能体技能提高了评估套件中的平均任务得分,其token效率收益取决于智能体框架和工具设置。这些发现强调了结构化程序知识对于实现可靠、长周期科学可视化工作流的重要性,同时也表明技能应与加载和应用它们的执行框架一起研究。技能可在https://github.com/KuangshiAi/SciVisAgentSkills获取。

英文摘要

Recent advances in agentic visualization have enabled the translation of natural language into executable scientific visualization (SciVis) workflows. While general-purpose coding agents show strong capabilities, they often lack the tool-specific expertise required for SciVis tasks. In this work, we present SciVisAgentSkills, a collection of reusable agent skills that augment coding agents for scientific data analysis and visualization by encoding environment assumptions, tool usage patterns, and domain heuristics across scientific tools such as ParaView, napari, VMD, and TTK. We evaluate these skills on Codex and Claude Code using SciVisAgentBench, a benchmark of 108 expert-designed multi-step tasks. Results show that agent skills improve mean task scores across the evaluated suites, with token-efficiency benefits that depend on the agent harness and tool setting. These findings highlight the importance of structured procedural knowledge for enabling reliable, long-horizon SciVis workflows, while also showing that skills should be studied alongside the execution harness that loads and applies them. The skills are available at https://github.com/KuangshiAi/SciVisAgentSkills.

2606.05523 2026-06-05 cs.CL

CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning

CHASE:利用强化学习进行对抗性红蓝队训练以提高LLM安全性

Rahul Markasserithodi, Aditya Joshi, Yuekang Li, Ishmanbir Singh, Chris Yoo, Alan Niu

发表机构 * University of New South Wales(新南威尔士大学)

AI总结 提出CHASE框架,通过红蓝队协同进化(红队使用GRPO生成对抗性改写,蓝队使用两阶段GRPO+拒绝采样SFT进行防御),在保持良性提示零误拒的同时将攻击成功率降低43.2%。

Comments Under Review at ARR

详情
AI中文摘要

尽管在安全对齐方面取得了进展,但提示改写攻击(如角色调制、虚构框架和基于说服的重述)仍能绕过前沿模型的安全过滤器。现有防御要么依赖不可扩展的人工策展,要么依赖对特定模型内部过拟合的白盒优化,使对齐模型在面对部署中自适应黑盒对手时变得脆弱。为弥补这一差距,我们提出CHASE(通过对抗性安全升级的协同进化硬化),一种闭环红蓝队框架,其中黑盒攻击者和安全对齐防御者协同进化。攻击者通过组相对策略优化(GRPO)在乘法奖励下训练,该奖励联合强制绕过有效性和意图保真度,而防御者则通过两阶段GRPO+拒绝采样SFT流程在收获的对抗性改写上进行硬化,并与良性数据平衡。在BeaverTails和JailbreakBench上针对五个保留攻击家族(PAIR、TAP、AutoDAN、PAP、Translation)进行评估,CHASE将平均StrongREJECT分数降低了43.2%,且良性提示零误拒。除了这一显著结果外,CHASE表明无模板的RL探索能够恢复跨机制不同攻击家族迁移的潜在攻击原语,这为LLM安全硬化提供了一条超越当前对抗训练狭窄分布的泛化路径。

英文摘要

Despite advances in safety alignment, prompt-rewriting attacks such as persona modulation, fictional framing and persuasion-based reformulation, can bypass safety filters even on frontier models. Existing defenses either rely on non-scalable human curation or white-box optimisation that overfits to specific model internals, leaving aligned models brittle against the very class of adaptive black-box adversaries they will face in deployment. To address this gap, we introduce CHASE (Co-evolutionary Hardening through Adversarial Safety-Escalation), a closed-loop red-blue teaming framework in which a black-box attacker and a safety-aligned defender co-evolve. The attacker is trained via Group Relative Policy Optimization (GRPO) under a multiplicative reward that jointly enforces bypass effectiveness and intent fidelity, while the defender is hardened on the harvested adversarial rewrites through a two-stage GRPO + rejection-sampled SFT pipeline balanced with benign data. Evaluated on BeaverTails and JailbreakBench against five held-out attack families (PAIR, TAP, AutoDAN, PAP, Translation), CHASE cuts mean StrongREJECT score by 43.2\% with 0\% false-refusal on benign prompts. Beyond the headline result, CHASE shows that template-free RL exploration recovers latent attack primitives that transfer across mechanistically distinct attack families, suggesting a path toward LLM safety hardening that generalises beyond the narrow distributions achieved thus far in adversarial training.

2606.05522 2026-06-05 cs.SD cs.AI eess.AS

Exploring LLMs for South Asian Music Understanding and Generation

探索大语言模型对南亚音乐的理解与生成

Faria Binte Kader, Mohtasim Hadi Rafi, Shah Wasif Sajjad, Santu Karmaker

发表机构 * University of Central Florida(佛罗里达中央大学) Auburn University(阿伯伯大学)

AI总结 本文系统评估大语言模型在基于拉格和塔拉的南亚古典音乐理解与生成任务中的表现,发现前沿模型在理解任务上准确率达85-90%,但生成任务中风格忠实度仅40%。

Comments 19 pages, 7 figures

详情
AI中文摘要

近年来,大语言模型(LLMs)在音乐理解和生成任务中展现出令人瞩目的成果。然而,现有研究仍局限于西方调性传统,未能揭示当前LLMs能否处理结构独特的低资源音乐传统。我们首次系统评估LLMs在南亚古典音乐中的能力——这种传统由拉格(raga)和塔拉(tala)的旋律约束主导,其结构原则与西方和声驱动音乐根本不同。我们的评估基于印度斯坦古典理论和孟加拉古典形式,包括拉宾德拉(Rabindra)和纳兹鲁尔(Nazrul)歌曲——南亚古典音乐中具有代表性的低资源传统。在音乐理解评估中,我们引入了一个包含504个问答的基准测试,涵盖拉格语法、文化知识和符号记谱推理,评估了33个LLMs,其中前沿模型如Gemini 2.5 Pro达到85-90%的准确率,而大多数开源模型仅在23-40%范围内。在音乐生成方面,我们设计了一个五级受控提示框架,发现即使最强的模型也只有40%的时间能产生风格忠实的输出。这些结果表明,音乐生成中的结构有效性和风格忠实度是不同的目标,并突显了文化基础音乐建模的一个开放挑战。

英文摘要

Recent advancements in Large Language Models (LLMs) have shown promising results in music understanding and generation tasks. However, existing works remain confined to Western tonal traditions, offering little insight into whether current LLMs can handle structurally distinct low-resource musical traditions. We present the first systematic evaluation of LLM competence in South Asian classical music, a tradition governed by raga, tala-based melodic constraints that impose fundamentally different structural principles from Western harmony-driven music. We ground our evaluation in Hindustani classical theory and Bengali classical forms, including Rabindra and Nazrul Sangeet -- representative low-resource traditions within South Asian classical music. For music understanding evaluation, we introduce a 504-question-answer benchmark spanning raga grammar, cultural knowledge, and symbolic notation reasoning, evaluating 33 LLMs where frontier models such as Gemini 2.5 Pro achieve 85-90% accuracy, while most open-source models remain in the 23-40% range. For music generation, we design a five-level controlled prompting framework and find that even the strongest model produces stylistically faithful outputs only 40% of the time. These results reveal that structural validity and stylistic faithfulness in music generation are distinct objectives and highlight an open challenge for culturally grounded music modeling.

2606.05516 2026-06-05 cs.LG

Dominant-Layer ZO: A Single Layer Dominates Zeroth-Order Fine-Tuning of LLMs

主导层 ZO:单层主导大语言模型的零阶微调

Wanhao Yu, Ziyan Wang, Zheng Wang, Abeer Matar Almalky, Yihang Zuo, Shuteng Niu, Sen Lin, Adnan Siraj Rakin, Deliang Fan, Li Yang

发表机构 * University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校) University of Houston(休斯顿大学) State University of New York at Binghamton(纽约州立大学布法罗分校) Arizona State University(亚利桑那州立大学) Department of Artificial Intelligence and Informatics, Mayo Clinic(梅奥诊所人工智能与信息学系)

AI总结 本文发现零阶优化微调大语言模型时,单个解码层主导性能,通过仅微调该层可匹配或超越全模型微调,并基于激活异常值识别该层,解释其机制。

详情
AI中文摘要

零阶(ZO)优化通过仅使用前向传播实现大语言模型(LLM)的内存高效微调,但适应性如何分布在各层仍不清楚。在这项工作中,我们揭示了一个令人惊讶的现象:ZO 微调被单个解码层显著主导。在多个 LLM 家族和下游任务中,仅微调这一主导层始终匹配甚至超越全模型 ZO 微调。我们进一步表明,主导层是任务无关但模型特定的,并且可以在训练前通过简单的仅推理激活异常值分析来识别。具体来说,主导层与预训练模型中的第一个激活异常值层一致。为了解释这一现象,我们分析了在 ZO 优化下扰动效应如何传播。我们发现主导层结合了两个关键特性:高扰动敏感性和在残差流中的早期位置,使得扰动引起的效应能够通过后续的解码层传播和累积。因此,该层在前向更新下产生不成比例的强且稳定的优化信号。在 LLaMA2-7B 和 Qwen3-8B 上的九个基准测试的广泛实验表明,主导层 ZO 微调在平均性能上优于全模型 MeZO 和基于 LoRA 的 ZO 微调,同时实现了高达 4.52 倍的训练加速。

英文摘要

Zeroth-order (ZO) optimization enables memory-efficient fine-tuning of large language models (LLMs) using only forward passes, but it remains unclear how useful adaptation is distributed across layers. In this work, we reveal a surprising phenomenon: ZO fine-tuning is sharply dominated by a single decoding layer. Across multiple LLM families and downstream tasks, fine-tuning this dominant layer alone consistently matches or even exceeds full-model ZO fine-tuning. We further show that the dominant layer is task-agnostic but model-specific, and can be identified before training through a simple inference-only analysis of activation outliers. Specifically, the dominant layer consistently aligns with the first activation-outlier layer in the pre-trained model. To explain this phenomenon, we analyze how perturbation effects propagate under ZO optimization. We find that the dominant layer combines two key properties: high perturbation sensitivity and early placement in the residual stream, allowing perturbation-induced effects to propagate and accumulate through remaining subsequent decoding layers. As a result, this layer produces disproportionately strong and stable optimization signals under forward-only updates. Extensive experiments on LLaMA2-7B and Qwen3-8B across nine benchmarks show that dominant-layer ZO fine-tuning improves average performance over full-model MeZO and LoRA-based ZO fine-tuning while achieving up to 4.52$\times$ training speedup.

2606.05515 2026-06-05 cs.CV

BRepCLIP: Contrastive Multimodal Pretraining on BRep Primitives for CAD Understanding

BRepCLIP: 面向CAD理解的BRep基元对比多模态预训练

Muhammad Usama, Didier Stricker, Mohammad Sadil Khan, Muhammad Zeshan Afzal

发表机构 * DFKI, Germany(德意志联邦共和国DFKI) RPTU Kaiserslautern-Landau, Germany(德国凯撒斯劳滕-兰道大学)

AI总结 提出BRepCLIP框架,通过对比预训练对齐CAD边界表示(BRep)几何与语言/图像嵌入,显著提升检索和零样本分类性能。

详情
AI中文摘要

CAD模型表示学习在很大程度上是一个开放问题。尽管3D表示学习在点云和网格方面蓬勃发展,但CAD的原生格式——边界表示(BReps),它编码精确的参数曲面、曲线及其拓扑,作为表示学习基元却很少受到关注。我们引入BRepCLIP,这是第一个通过对比预训练将BRep几何与语言和图像嵌入对齐的框架。我们将每个CAD对象建模为面令牌和边令牌的序列,分别使用独立的离散词汇表表示曲面和曲线几何,并附加空间和语义描述符来捕获曲面类型(例如,圆柱面、环面、NURBS)和曲线基元(例如,直线、圆弧、B样条)。一个Transformer编码器将这些令牌聚合成全局BRep嵌入,通过联合对比目标与CLIP的文本和图像编码器对齐。BRepCLIP生成的嵌入比现有的基于点的替代方案更具判别性和语义基础,在ABC、CADParser和Automate数据集上,Top-1检索比OpenShape分别提高40.4%、22.0%和23.9%,在FabWave上的零样本分类Top-1分数提高15%。我们进一步展示了其作为CAD感知相似度度量的实用性,用于评估文本和图像条件CAD生成,确立了结构感知预训练对于多模态CAD理解的重要性。项目页面见 https://muhammadusama100.github.io/BrepClip2026/

英文摘要

Learning representations of CAD models is a largely open problem. While 3D representation learning has flourished around point clouds and meshes, the native format of CAD - boundary representations BReps, which encodes exact parametric surfaces, curves, and their topology, has received little attention as a representation learning substrate. We introduce BRepCLIP, the first framework to align BRep geometry with language and image embeddings through contrastive pretraining. We model each CAD object as a sequence of face and edge tokens with separate discrete vocabularies for surface and curve geometry, augmented with spatial and semantic descriptors that capture surface types (e.g., cylindrical, torus, NURBS) and curve primitives (e.g., line, arc, B-spline). A transformer encoder aggregates these tokens into a global BRep embedding, aligned with CLIP's text and image encoders via a joint contrastive objective. BRepCLIP generates more discriminative and semantically grounded embeddings than existing point-based alternatives, improving Top-1 retrieval over OpenShape by 40.4%, 22.0%, and 23.9% on ABC, CADParser, and Automate, respectively, and improving zero-shot classification on FabWave by 15% in Top-1 score. We further demonstrate its utility as a CAD-aware similarity metric for evaluating text and image-conditioned CAD generation, establishing the importance of structure-aware pretraining for multimodal CAD understanding. Project page is available at https://muhammadusama100.github.io/BrepClip2026/

2606.05513 2026-06-05 cs.AI cs.CL

EpiEvolve: Self-Evolving Agents for Streaming Pandemic Forecasting under Regime Shifts

EpiEvolve:用于制度转变下流式疫情预测的自演化智能体

Yiming Lu, Sihang Zeng, Zhengxu Tang, Max Lau, Fei Liu, Wei Jin

发表机构 * Emory University(埃默里大学) University of Washington(华盛顿大学)

AI总结 针对流式疫情预测中标签延迟和制度转变问题,提出自演化智能体EpiEvolve,通过层次化情景记忆、延迟标签反思和制度感知检索,在COVID-19住院趋势预测中达到0.629准确率,并将制度转变后的恢复滞后从5周缩短至2周。

详情
AI中文摘要

流行病LLM预测器通常作为静态监督模型进行训练和评估,而实际疫情预测是一个流式过程,其中标签在预测之后到达,疾病制度随时间变化。我们研究了在五个变异制度下的每周COVID-19住院趋势预测中的这种不匹配。我们引入了EpiEvolve,一个自演化智能体,它封装了一个在预热期训练好的LLM预测器,并在流式过程中保持其权重固定。EpiEvolve通过将预测结果存储在层次化情景记忆中进行适应,反思延迟标签,检索与当前制度相关的案例,并将重复出现的错误提炼为策略规则。由此产生的上下文让预测器在遵循防止未来泄漏的时间顺序协议的同时,在后续周中重用其自身的过去预测和结果。在流式数据集上,EpiEvolve达到了0.629的平均准确率,而静态骨干模型为0.561,外部CDC集成模型为0.325,并将制度转变后的恢复滞后从5周缩短到2周。消融实验表明,反思、策略记忆和制度感知检索各自对性能提升有贡献。

英文摘要

Epidemic LLM forecasters are usually trained and evaluated as static supervised models, whereas operational pandemic forecasting is a streaming process in which labels arrive after predictions and disease regimes shift over time. We study this mismatch in weekly COVID-19 hospitalization trend forecasting across five variant regimes. We introduce EpiEvolve, a self-evolving agent that wraps an LLM forecaster trained on the warm-start period and keeps its weights fixed during streaming. EpiEvolve adapts by storing forecast outcomes in a hierarchical episodic memory, reflecting on delayed labels, retrieving cases relevant to the current regime, and distilling recurring errors into strategic rules. The resulting context lets the forecaster reuse its own past predictions and outcomes in later weeks while following a chronological protocol that prevents future leakage. On the streaming dataset, EpiEvolve reaches $0.629$ average accuracy, compared with $0.561$ for the static backbone and $0.325$ for the external CDC ensemble, and reduces recovery lag after regime shifts from $5$ to $2$ weeks. Ablations show that reflection, strategic memory, and regime-aware retrieval each contribute to the gains.

2606.05506 2026-06-05 cs.CV

Robust Scene Transfer for PointGoal Navigation via Privileged Sensor Guided Contrastive Learning

基于特权传感器引导对比学习的点目标导航鲁棒场景迁移

Amirhossein Zhalehmehrabi, Tiziano Tezze, Alberto Castelini, Alessandro Farinelli

发表机构 * University of Padua(帕多瓦大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出一种传感器引导的自适应对比学习框架,利用特权LiDAR传感器在训练时引导视觉编码器学习导航相关结构,并通过解耦表征学习与策略优化以及跨阶段域不匹配来提升策略级场景迁移能力。

Comments 8 pages, Submitted to RAL

详情
AI中文摘要

我们提出了一种用于点目标导航中视觉表征学习的传感器引导自适应对比学习框架。在训练过程中,特权LiDAR传感器通过几何感知相似度度量和自适应温度缩放来引导对比目标,鼓励视觉嵌入捕获导航相关结构而非场景特定外观。得到的编码器被独立预训练、冻结,并用作强化学习的感知骨干,将表征学习与策略优化解耦。我们进一步在表征预训练和策略学习之间引入跨阶段域不匹配,以抑制环境特定捷径并促进对任务相关特征的依赖。在高保真模拟中的大量实验表明,我们的方法显著提高了跨多种室内外环境的策略级场景迁移。在部署时,智能体仅依赖单目RGB观测以及标准任务相关输入(如目标位置和本体感觉信号),无需访问LiDAR或其他特权传感器。我们的方法在严重外观和语义变化下优于大型预训练视觉模型和标准对比基线。我们还发布了一个多模态数据集,以支持未来关于导航中特权引导视觉表征学习的研究。代码可在以下网址获取:

英文摘要

We propose a sensor-guided adaptive contrastive learning framework for visual representation learning in PointGoal navigation. During training, privileged LiDAR sensing guides the contrastive objective through a geometry-aware similarity metric and adaptive temperature scaling, encouraging visual embeddings to capture navigation-relevant structure rather than scene-specific appearance. The resulting encoder is pretrained independently, frozen, and used as the perceptual backbone for reinforcement learning, decoupling representation learning from policy optimization. We further introduce a cross-stage domain mismatch between representation pretraining and policy learning to suppress environment-specific shortcuts and promote reliance on task-relevant features. Extensive experiments in high-fidelity simulation demonstrate that our approach significantly improves policy-level scene transfer across diverse indoor and outdoor environments. At deployment, the agent relies only on monocular RGB observations together with standard task-related inputs such as goal position and proprioceptive signals, without access to LiDAR or other privileged sensors. Our method outperforms large pretrained vision models and standard contrastive baselines under severe appearance and semantic shifts. We also release a multimodal dataset to support future research on privileged-guided visual representation learning for navigation. The code is available at:

2606.05501 2026-06-05 cs.RO

Learning Contact Representation for Leg Odometry

学习足式里程计的接触表示

Emre Girgin, Cagri Kilic

发表机构 * Department of Aerospace Engineering, Embry Riddle Aeronautical University(航空航天工程系,埃姆布里-瑞德航空大学)

AI总结 提出一种自监督表示学习框架,仅利用关节编码器标准传感器集进行接触检测,无需力传感器,在足式机器人里程计中优于监督方法和基线概率方法。

Comments 17 pages

详情
AI中文摘要

足式机器人里程计的估计依赖于一个假设:在支撑相期间,足部相对于世界的速度保持为零。主体速度的反馈来自足部的运动学串行链,因此准确的腿部相位检测是一个关键子问题。大量研究使用安装在足尖的地面反作用力传感器进行分类,但这些传感器可能并非所有足式机器人普遍可用。此外,这些传感器通常对未考虑的干扰(如足部与地面接触时的滑动)不敏感。在本研究中,我们提出了一种用于接触检测的自监督表示学习框架,该框架利用关节编码器的标准传感器集,无需依赖力传感器增强。我们使用学习到的表示来概率性地建模支撑相和摆动相。实验结果证实了所提出的自监督接触检测器的有效性。我们的框架在性能上优于需要传感器集增强和标注的监督方法以及基线概率方法。此外,我们将代码公开。

英文摘要

The estimation of odometry in legged robots depends on the assumption that the velocity of the foot with respect to the world remains zero during the stance phase. Feedback for the main body velocity is derived from the kinematic serial chain of the feet making accurate leg phase detection is a critical subproblem. A considerable number of studies employ ground reaction force sensors mounted at the tip of the foot to classify, yet these sensors may not be universally available for all legged robots. Additionally, these sensors are often unresponsive to unaccounted disturbances, such as slippage, while the foot remains in contact with the ground. In this study, we propose a self-supervised representation learning framework for contact detection that utilizes the standard sensor set of joint encoders without reliance on force sensor augmentations. We employ learned representations to model the stance and swing phases probabilistically. The experimental results obtained confirm the efficacy of the proposed self-supervised contact detector. Our framework exhibited superior performance in comparison to supervised methods which necessitate sensor set augmentation and labeling, as well as baseline probabilistic approaches. Additionally, we make our code available to the public.

2606.05497 2026-06-05 cs.LG

LEVANTE-bench: Multi-Scale Comparison of VLMs to Children Using Cognitive Tasks (or, "Is Your VLM Smarter Than a 5th Grader?")

LEVANTE-bench: 使用认知任务对VLM与儿童进行多尺度比较(或者,“你的VLM比五年级学生聪明吗?”)

Alvin Wei Ming Tan, David Cardinal, Tania Lorido-Botran, Laura Bravo-Sanchez, Sunny Yu, Michael C. Frank

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出LEVANTE-bench基准,基于儿童认知任务数据,从多个尺度系统评估视觉语言模型与5-12岁儿童在六项任务上的对齐程度,发现模型与人类认知仅部分对齐。

详情
AI中文摘要

鉴于人类经验本质上是多模态的,视觉语言模型(VLM)在模拟人类认知随经验增长和发展方面具有巨大潜力。发挥其潜力需要工具来比较VLM与人类认知发展在不同任务、年龄和人群中的表现。我们提出LEVANTE-bench,这是一个基于学习变异网络(LEVANTE)的任务和数据的基准,该网络分发跨语言和文化测量儿童认知的开源任务和数据。在LEVANTE-bench中,我们系统评估了VLM在六项任务上的表现,比较它们与三个国家5-12岁儿童(N = 1547)的对齐程度。我们在多个尺度上比较模型,评估它们的整体准确性、在任务和项目层面与儿童的对齐程度,以及它们匹配儿童试验级错误分布的程度。对齐在不同尺度上是异质的:在任务和项目层面,能力更强的模型与人类对齐更好。然而,与人类错误分布的匹配在不同任务间差异很大,对于某些任务,较小的模型更好地匹配了年幼儿童的错误。此外,即使表现最好的VLM在矩阵推理和心理旋转任务上也表现不佳。因此,当前的VLM架构仅与儿童的认知能力部分对齐。

英文摘要

Given the inherently multimodal nature of human experience, vision-language models (VLMs) hold substantial promise for modeling human cognition as it grows and develops with experience. Realizing their potential requires tools for comparing VLMs with human cognitive development across tasks, ages, and populations. We present LEVANTE-bench, a benchmark based on tasks and data from the Learning Variability Network (LEVANTE), which distributes open-source tasks and data measuring children's cognition across languages and cultures. In LEVANTE-bench, we systematically assess VLMs on six tasks, comparing their alignment with children aged 5-12 ($N$ = 1547) across three countries. We compare models at multiple scales, assessing their overall accuracy, their task- and item-level alignment with children, and how well they match children's trial-level error distributions. Alignment was heterogeneous across scales: at the level of tasks and items, more capable models aligned better with humans. However, match to human error distributions varied widely across tasks, and for several tasks, smaller models matched younger children's errors better. In addition, even the best-performing VLMs struggled on matrix reasoning and mental rotation tasks. Thus, current VLM architectures align only partially with the cognitive abilities of children.

2606.05494 2026-06-05 cs.CL cs.AI

MASF: A Multi-Model Adaptive Selection Framework for Abstractive Text summarization

MASF:面向抽象式文本摘要的多模型自适应选择框架

Ahmed Alansary, Ali Hamdi

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种多模型自适应选择框架,通过集成多个微调的Transformer模型并基于自动评估指标选择最佳摘要,在CNN/DailyMail数据集上BERTScore达88.63%,优于GPT3-D2等大模型。

Comments 6 pages, 3 figures, IMSA2026

详情
AI中文摘要

自动文本摘要因数字文本信息的快速增长而变得日益重要。本文提出一种多模型自适应摘要框架,旨在提高抽象式文本摘要的鲁棒性和质量。依赖单一模型往往导致在不同结构和主题的文章上摘要质量不一致。为解决这一局限,所提框架集成了多个微调的基于Transformer的摘要模型,并引入自适应选择机制。在该框架中,每个模型独立为同一输入文章生成候选摘要。然后使用自动评估指标评估生成的摘要,这些指标同时捕捉词汇相似性和语义相关性。基于这些分数,框架选择最高质量的摘要作为最终输出。模型在广泛使用的CNN/DailyMail新闻摘要数据集上进行微调和评估。实验结果表明,所提框架在所有比较方法中取得了最高的BERTScore,达到88.63%。它还优于多个大语言模型,如GPT3-D2、Falcon-7b和Mpt-7b,突显了其有效性和鲁棒性。这些发现强调了在自适应选择策略中利用多个基于Transformer的模型来提高自动文本摘要系统质量和鲁棒性的有效性。

英文摘要

Automatic text summarization has become increasingly important due to the rapid growth of digital textual information. This paper presents a Multi-Model Adaptive Summarization Framework designed to improve the robustness and quality of abstractive text summarization. Relying on a single model often leads to inconsistent summarization quality across articles with varying structures and topics. To address this limitation, the proposed framework integrates multiple fine-tuned transformer-based summarization models and introduces an adaptive selection mechanism. In this framework, each model independently generates a candidate summary for the same input article. The generated summaries are then evaluated using automatic evaluation metrics that capture both lexical similarity and semantic relevance. Based on these scores, the framework selects the highest-quality summary as the final output. The models are fine-tuned and evaluated on the widely used CNN/DailyMail news summarization dataset. Experimental results demonstrate that the proposed framework achieves the highest BERTScore among all compared methods with a score of 88.63%. It also outperforms several LLMs such as GPT3-D2, Falcon-7b, and Mpt-7b, highlighting its effectiveness and robustness. These findings highlight the effectiveness of leveraging multiple transformer-based models within an adaptive selection strategy to improve the quality and robustness of automatic text summarization systems.

2606.05491 2026-06-05 cs.CV cs.RO

Unpaired RGB-Thermal Gaussian-Splatting Using Visual Geometric Transformers

无配对RGB-热成像高斯泼溅使用视觉几何变换器

Jean Cordonnier, Chenghao Xu, Olga Fink, Malcolm Mielle

发表机构 * Ecole Polytechnique Federale de Lausanne(瑞士联邦理工学院洛桑分校) Schindler EPFL Lab(施耐德EPFL实验室)

AI总结 提出一种无配对RGB-热成像新视角合成框架,利用VGGT估计各模态相机位姿并通过Procrustes对齐,结合多模态3D高斯泼溅实现联合重建,在保持RGB保真度的同时实现热成像视图合成。

Comments Accepted at ICRA 2026's Workshop MM-SpatialAI: Multi-Modal Spatial AI for Robust Navigation and Open-World Understanding

详情
AI中文摘要

结合RGB和热成像的多模态新视角合成(NVS)能够利用视觉和热信息进行精确的3D场景重建。然而,现有方法通常依赖于精确校准的RGB-热成像图像对或立体设置,限制了可扩展性和实际部署。为了解决这个问题,我们引入了一个无配对RGB-热成像NVS框架,该框架利用VGGT(一种3D前馈变换器架构)独立估计每个模态的相机位姿。然后使用Procrustes算法与跨模态特征匹配器对齐位姿集,从而无需配对校准即可实现联合配准。在此对齐基础上,我们进一步提出了一种多模态3D高斯泼溅方法,直接从无配对的RGB和热成像图像中学习。在多种场景上的实验表明,我们的方法在热成像视图合成中取得了有竞争力的性能,同时保持了RGB保真度。此外,我们表明现有的重建方法可能产生缺乏跨模态一致性的特定模态重建。因此,我们引入了一个基准框架,以严格评估每个模态的图像合成以及重建场景的多模态一致性。

英文摘要

Multi-modal novel view synthesis (NVS) combining RGB and thermal imagery enables precise 3D scene reconstruction with visual and thermal information. However, existing methods typically rely on precisely calibrated RGB-thermal image pairs or stereo setups, limiting scalability and practical deployment. To address this, we introduce a framework for unpaired RGB-thermal NVS that leverages VGGT, a 3D feed-forward transformer architecture, to independently estimate camera poses for each modality. The pose sets are then aligned using the Procrustes algorithm with a cross-modal feature matcher, enabling joint registration without paired calibration. Building on this alignment, we further propose a multi-modal 3D Gaussian Splatting approach that learns directly from unpaired RGB and thermal images. Experiments on diverse scenes demonstrate that our method achieves competitive performance in thermal view synthesis while maintaining RGB fidelity. Moreover, we show that existing reconstruction approaches can produce modality-specific reconstructions that lack cross-modal consistency. We thus introduce a benchmarking framework to rigorously evaluate both per-modality image synthesis and the multi-modal coherence of reconstructed scenes.

2606.05489 2026-06-05 cs.CV cs.DB

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

LLM引导的ANN索引优化用于人-物交互检索

Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

发表机构 * Iowa State University(爱荷华州立大学) Intel Corporation(英特尔公司)

AI总结 提出一种基于大语言模型的阶段感知智能体,通过耦合参数空间的分阶段优化,在HICO-DET等基准上显著提升向量检索吞吐量。

Comments 13 pages, 5 figures, 8 tables

详情
AI中文摘要

检索系统支撑着现代AI应用——涵盖视觉搜索、推荐引擎和多模态问答。现代多阶段检索系统需要联合优化高度耦合的参数,然而传统的超参数优化(HPO)方法——包括树结构Parzen估计器(TPE)和高斯过程贝叶斯优化——依赖于独立性假设,这从根本上阻止了它们在这些耦合配置空间中的导航。我们通过一个阶段感知的大语言模型(LLM)智能体来解决这一限制,该智能体将每个提案基于其完整的优化历史进行条件化,在阶段划分的探索、利用和微调阶段中导航耦合参数空间。在HICO-DET人-物交互检索基准上使用Intel VDMS(视觉数据管理系统)进行评估,我们的智能体在SIEVE(向量搜索效率的保障索引评估,一种质量约束的吞吐量指标)下比Optuna TPE高出+33.3%,比VDTuner高出+34.2%,相比UniIR实现了15.3倍的吞吐量提升。在三个基准上的验证证实,智能体的优势随参数耦合程度增加而增长:在HICO-DET(高耦合)上+33.3%,在GLDv2(中等耦合)上方法收敛于1%以内,在SIFT1M(近独立控制)上收敛于3.6%以内。在Milvus上的跨系统验证确认,优化器在所有三个数据集上排名第一且无需修改,展示了跨向量数据库管理系统(VDBMS)平台的可迁移性。

英文摘要

Retrieval systems underpin modern AI applications -- spanning visual search, recommendation engines, and multi-modal question answering. Modern multi-stage retrieval systems require the joint optimization of highly coupled parameters, yet traditional hyperparameter optimization (HPO) methods -- including Tree-structured Parzen Estimators (TPE) and Gaussian Process Bayesian Optimization -- rely on an independence assumption that fundamentally prevents them from navigating these coupled configuration spaces. We address this limitation with a phase-aware large language model (LLM) agent that conditions each proposal on its full optimization history, navigating the coupled parameter space across phase-partitioned exploration, exploitation, and fine-tuning stages. Evaluated on the HICO-DET human-object interaction retrieval benchmark using Intel VDMS (Visual Data Management System), our agent outperforms Optuna TPE by +33.3% and VDTuner by +34.2% under SIEVE (Safeguarded Index Evaluation of Vector-search Efficiency, a quality-constrained throughput metric), delivering a 15.3x throughput gain over UniIR. Validation across three benchmarks confirms that the agent's advantage grows with the degree of parameter coupling: +33.3% on HICO-DET (high coupling), methods converge within 1% on GLDv2 (moderate coupling) and within 3.6% on SIFT1M (near-independent control). Cross-system validation on Milvus confirms the optimizer ranks first on all three datasets without modification, demonstrating transferability across vector database management system (VDBMS) platforms.

2606.05486 2026-06-05 cs.CL cs.LG

Localizing Prompt Ambiguity in Large Language Models with Probe-Targeted Attribution

通过探针目标归因定位大型语言模型中的提示歧义

Govind Ramesh, Yao Dou, Wei Xu

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出PRIG方法,利用线性探针和梯度归因,通过中间表示而非输出层定位提示中的歧义位置,在合成和人工基准上取得高AUROC。

Comments 23 pages, 5 figures, 5 tables

详情
AI中文摘要

提示歧义是大型语言模型中常见的失败原因,但由于它是提示的潜在属性,难以定位,而现有的归因方法旨在解释可观察的输出,如logits或生成的token。我们引入了PRIG,一种梯度归因方法,使用探针logit将潜在歧义归因于token位置。具体来说,PRIG训练一个线性探针来区分清晰提示和模糊提示,并将探针分数归因于残差流中早期的token表示。为了实现token级别的评估,我们通过重写每个提示中的一个关键句子,构建了涵盖编码、数学和写作的合成歧义数据集,并用人工编写的黄金基准进行补充。在这种设置下,PRIG在定位歧义片段方面显著优于梯度归因基线,在组合合成基准上达到0.840 AUROC,在黄金集上达到0.891 AUROC。它在句子级别的歧义识别上也优于GPT-5.4,并在域外保留了有用的信号。这些结果确立了PRIG作为一种实用工具,用于识别提示中哪些部分存在歧义。更广泛地说,它们表明潜在提示属性可以通过中间表示而非输出级归因来定位。

英文摘要

Prompt ambiguity is a common source of failure in large language models, but is difficult to localize because it is a latent property of the prompt, while existing attribution methods are designed to explain observable outputs such as logits or generated tokens. We introduce PRIG, a gradient attribution method that uses a probe logit to attribute latent ambiguity to token positions. Specifically, PRIG trains a linear probe to distinguish clear prompts from ambiguous prompts and attributes the probe score to earlier token representations in the residual stream. To enable token-level evaluation, we construct synthetic ambiguity datasets across coding, math, and writing by rewriting one task-critical sentence per prompt, and complement them with a human-written gold benchmark. In this setting, PRIG localizes ambiguous spans substantially better than gradient attribution baselines, achieving 0.840 AUROC on the combined synthetic benchmark and 0.891 AUROC on the gold set. It also outperforms GPT-5.4 on sentence-level ambiguity identification and retains useful signal out-of-domain. These results establish PRIG as a practical tool for identifying which parts of a prompt are ambiguous. More broadly, they suggest that latent prompt properties can be localized through intermediate representations, rather than through output-level attribution.

2606.05484 2026-06-05 cs.LG

Learned Subspace Compression for Communication-Efficient Pipeline Parallelism

学习子空间压缩以实现通信高效的流水线并行

Paul Janson, Edouard Oyallon, Eugene Belilovsky

发表机构 * Concordia University(康科迪亚大学) Mila Quebec AI Institute(魁北克人工智能研究所) CNRS, Sorbonne University(法国国家科学研究中心,索邦大学)

AI总结 提出MAPL方法,通过Stiefel流形约束学习每个流水线阶段的任务最优正交投影,结合因子化锚点嵌入和残差向量量化,在低带宽网络中实现高压缩比且性能损失极小。

Comments Accepted at the 2nd Workshop on Connecting Low-rank Representations in AI, ICML 2026

详情
AI中文摘要

流水线并行使得训练超过单设备内存的大型语言模型成为可能,然而在低带宽网络上训练时,阶段间激活通信成为主要瓶颈。最近的工作提出使用固定正交投影来压缩激活,但这仍然会导致显著的性能下降,并且需要大量非标准调整来约束优化。一个自然的替代方案是为每个流水线阶段学习一个低秩投影,但在训练过程中保持这些投影器的必要正交性仍然是一个挑战。我们提出了流形感知投影学习(MAPL),该方法将阶段间压缩视为在显式Stiefel流形(正交矩阵)约束下的可学习正交投影。MAPL不是规定固定的全局子空间,而是让每个流水线阶段通过流形约束的最速下降法发现并持续适应其任务最优的压缩子空间。为了在阶段边界恢复特定于token的信号,我们引入了每个阶段的因子化锚点嵌入,使得能够以可忽略的通信开销实现全秩激活重建。我们进一步展示了可以在投影后结合残差向量量化,并采用流式码本同步协议来分摊字典通信。在从150M到1B参数的LLaMA模型上,我们表明MAPL可以轻松应用于现有流水线,并且能够实现高压缩比,性能下降可忽略不计,与子空间网络相比,在性能与压缩之间的权衡得到了显著改善。

英文摘要

Pipeline parallelism enables training of large language models that exceed single-device memory, yet inter-stage activation communication becomes the dominant bottleneck when trained on low-bandwidth networks. Recent work in this area has proposed using fixed orthogonal projections to compress activations. However, this still results in a significant performance degradation and requires a number of non-standard adaptations to constrain the optimization. A natural alternative is to learn a low rank projection for each pipeline stage, however maintaining the necessary orthogonality of these projectors during training remains a challenge. We present Manifold Aware Projection Learning (MAPL), a method that treats inter-stage compression as a learnable orthogonal projection under explicit Stiefel manifold (orthogonal matrices) constraints. Rather than prescribing a fixed global subspace, MAPL lets each pipeline stage discover and continuously adapt its own task-optimal compression subspace via manifold-constrained steepest descent. To recover token-specific signals at stage boundaries, we introduce per-stage factorized anchor embeddings that allow for full-rank activation reconstruction with negligible communication overhead. We further show that we can incorporate residual vector quantization after projection with a streaming codebook synchronization protocol that amortizes dictionary communication. Across LLaMA models from 150M to 1B parameters we show that MAPL can be easily applied to the existing pipeline and can achieve high compression with neglibile performance degradation with a drastically improved tradeoffs in performance vs. compression compared to Subspace Networks.

2606.05481 2026-06-05 cs.LG cs.AI eess.SP

Towards Unified and Data-Efficient Prognostics and Health Management with Tabular Foundation Models

面向统一且数据高效的预测与健康管理:基于表格基础模型

Raffael Theiler, Lev Telyatnikov, Leandro Von Krannichfeldt, Olga Fink

发表机构 * IMOS Lab, EPFL(IMOS实验室,瑞士联邦理工学院)

AI总结 提出利用表格基础模型通过上下文学习处理工业时间序列,实现预测与健康管理(PHM)任务,在低数据场景下表现优异,并优于序列模型和梯度提升树。

详情
AI中文摘要

数据驱动的预测与健康管理(PHM)利用时变状态监测数据来诊断系统状态并估计工程资产的剩余使用寿命。这些任务是维护规划的核心,但工业PHM数据通常是碎片化的、部分观测且标注不足,这阻碍了监督学习。基础模型提供了一条通往可重用预测系统的途径,然而大多数时间序列基础模型是为预测设计的,并假设长序列、连贯且规则采样。为弥补这一差距,我们提出了一个框架,利用上下文学习将表格基础模型应用于工业时间序列,并在多种PHM任务上对其进行评估。通过将原始单元级信号转换为表格行,我们展示了这些模型在多个任务(包括预测和诊断)上表现良好,且数据效率高。我们在统一的评估协议下,直接将其与序列模型、Transformer基线和梯度提升树进行比较。结果表明,表格基础模型在预测和诊断任务中取得了最佳平均排名。我们的发现进一步表明,基于PFN的模型在低数据场景下具有竞争力,时间上下文可以在表格表示中保留,且性能依赖于子采样下的代表性上下文构建。这些结果证明,表格基础模型为异构PHM问题提供了一个实用且通用的接口。

英文摘要

Data-driven Prognostics and Health Management (PHM) uses time-varying condition-monitoring data to diagnose system states and estimate remaining useful life in engineered assets. These tasks are central to maintenance planning, but industrial PHM data are often fragmented, partially observed, and poorly labeled, which hinders supervised learning. Foundation models offer a route toward reusable predictive systems, yet most time-series foundation models are designed for forecasting and assume long, coherent, regularly sampled sequences. To address this gap, we propose a framework for applying Tabular Foundation Models to industrial time series using in-context learning, and we evaluate them on a variety of PHM tasks. By converting raw unit-level signals into tabular rows, we show that these models perform well across multiple tasks - including prognostics, and diagnostics - and are highly data efficient. We compare them directly with sequence models, transformer baselines, and gradient-boosted trees under a common evaluation protocol. The results indicate that tabular foundation models achieve the best average ranks across prognostic and diagnostic tasks. Our findings further show that PFN-based models are competitive in low-data regimes, that temporal context can be preserved in the tabular representation, and that performance depends on representative context construction under subsampling. These results demonstrate that tabular foundation models provide a practical and general interface for heterogeneous PHM problems.

2606.05478 2026-06-05 cs.CV cs.LG

Can We Predict The Human Preference For Text-to-Image Content Prior To Generation And Is It Even Useful To Do So?

我们能否在生成之前预测文生图内容的人类偏好,以及这样做是否有用?

Joong Ho Kim, Keith G. Mills

发表机构 * LSU ATHENA Lab(LSU ATHENA实验室)

AI总结 研究在扩散模型生成图像前预测人类偏好评分(HPM)的可行性,并利用该预测提升生成质量,同时评估不同HPM的适用性。

Comments Code is available at https://github.com/LSU-ATHENA/HPM-Predict

详情
AI中文摘要

扩散模型(DM)通过从用户提示中合成高质量、逼真的视觉内容,彻底改变了文本驱动的生成。而先前视觉生成的进展(如VAE和GAN)主要基于感知或视觉相似性指标(如FID、PSNR)进行评估,DM的进展促进了更先进的人类偏好指标(HPM)的发展,这些指标将人类判断建模并量化为标量值。然而,DM使用固有的随机过程合成内容,其中随机噪声种子生成。初始随机噪声直接定性和定量地影响生成输出的质量。这种影响在本地部署场景的小型模型中尤为显著。鉴于这一现象,我们首先研究在投入计算资源进行生成之前,我们能在多大程度上预测标量HPM分数。进一步,我们研究能在多大程度上利用这种预测来改善生成图像的质量,并研究哪些HPM最适合此任务。我们的研究表明,这不仅是可能的,而且可以实现可忽略的硬件开销。

英文摘要

Diffusion Models (DM) have revolutionized text-driven generation by enabling the synthesis of high-quality, photorealistic visual content from user prompts. Whereas prior advances in visual generation such as VAEs and GANs were primarily evaluated on perceptual or visual similarity metrics such as FID PSNR, DM advances have fostered the development of more advanced Human Preference Metrics (HPM) that model and quantify human judgment as scalar values. However, DMs synthesize content using an inherently stochastic process where random noise seeds generation. The initial random noise directly affects the quality of generated outputs, both qualitatively and quantitatively. This influence is pronounced in smaller models for local deployment scenarios. Given this phenomenon, we first investigate to what extent we can predict scalar HPM scores prior to committing compute resources for generation. Further, we then investigate to what extent we can leverage such prediction to improve the quality of generated images, and also study which HPMs are best suited for this task. Our investigation reveals that not only is this possible, but that it is feasible to achieve negligible hardware overhead.

2606.05471 2026-06-05 cs.CV

Formal Concept Lattices are Good Semantic Scaffolds for Concept-Based Learning

形式概念格是基于概念学习的好语义支架

Deepika SN Vemuri, Sayanta Adhikari, Ankit Saha, Krishn Vishwas Kher, Vineeth N Balasubramanian

发表机构 * Amazon, India(亚马逊(印度)) Microsoft Research(微软研究院)

AI总结 本文利用形式概念分析中的概念格作为语义支架,指导神经网络在不同深度层次学习分层结构的概念表示,从而提升可解释性和干预效果。

Comments Accepted at ICML 2026

详情
AI中文摘要

学习语义对于深度学习模型的可解释性和与人类推理的一致性至关重要。基于概念的模型通过有意义的语义抽象来表示类别,但通常将所有概念视为在单个神经网络层学习的扁平、无结构集合。这忽略了人类语义理解的一个基本属性:概念按层次组织,从一般到具体。虽然深度网络确实学习了视觉特征的层次结构,但这种结构很少与显式的语义层次对齐。借鉴形式概念分析,我们证明了形式概念格提供了原则性的语义支架来指导神经网络学习。这些格自然地根据概念的普遍性级别确定了应在网络的何处学习概念。这使得模型能够在其深度中发展出分阶段、语义基础的表示。在真实世界数据集上的实验结果表明,我们的模型产生了更可解释的嵌入,支持更有效的干预,并学习了既有意义又具有层次结构的概念表示。

英文摘要

Learning semantics is essential for deep learning models to be interpretable and better aligned with human reasoning. Concept-based models approach this by representing classes through meaningful semantic abstractions, but typically treat all concepts as a flat, unstructured set learned at a single neural network layer. This overlooks a fundamental property of human semantic understanding: concepts being organized hierarchically, from general to specific. While deep networks do learn a hierarchy of visual features, this structure is rarely aligned with explicit semantic hierarchies. Drawing on Formal Concept Analysis, we demonstrate that formal concept lattices provide principled semantic scaffolds to guide neural network learning. These lattices naturally identify where in the network concepts should be learned based on their level of generality. This allows the model to develop staged, semantically grounded representations throughout its depth. Empirical results on real-world datasets show that our models produce more interpretable embeddings, support more effective interventions, and learn concept representations that are both meaningful and hierarchically structured.

2606.05468 2026-06-05 cs.RO

FlowPRO: Reward-Free Reinforced Fine-Tuning of Flow-Matching VLAs via Proximalized Preference Optimization

FlowPRO:通过近端偏好优化对流匹配VLA进行无奖励强化微调

Yihao Wu, He Zhang, Junbo Tan, Xueqian Wang, Zhengyou Zhang

发表机构 * Tencent Robotics X(腾讯机器人X实验室) Futian Laboratory(福田实验室) Tsinghua University(清华大学)

AI总结 提出FlowPRO框架,通过近端偏好优化(RPRO)和干预-回滚数据收集方法,实现无奖励的离线强化微调,在四类长时程双臂任务中取得最高成功率。

详情
AI中文摘要

将视觉-语言-动作(VLA)模型后训练为可在真实机器人上可靠部署的策略仍然是一个主要瓶颈。SFT和DAgger仅间接利用失败信号,而基于奖励的强化学习则受限于真实世界奖励设计的难度以及训练可靠评论家的困难。我们提出FlowPRO,一种针对流匹配VLA的无奖励离线强化微调框架。在算法上,我们提出RPRO(机器人流匹配近端偏好优化),一种针对VLA模型流匹配动作头定制的偏好优化目标。RPRO将对比优化器与显式近端正则化器配对,该正则化器锚定隐式奖励的绝对幅度,从而消除了普通Flow-DPO的奖励黑客失败模式。在数据方面,一种遥操作干预-回滚范式通过单个操作员动作在真实机器人上自然产生成对的正负轨迹$(τ^w, τ^l)$;平滑插值过程结合批量混合,然后将这些稀疏修正转换为密集的每状态监督,同时保留基础策略的能力。在四项长时程双臂任务上,FlowPRO取得了最高成功率,优于四个代表性基线,消融实验证实了每个损失组件的贡献。

英文摘要

Post-training Vision-Language-Action (VLA) models into policies that can be reliably deployed on real robots remains a major bottleneck. SFT and DAgger exploit failure signals only indirectly, and reward-based RL is bottlenecked by the difficulty of real-world reward design and of training reliable critics. We present FlowPRO, a reward-free offline reinforced fine-tuning framework for flow-matching VLAs. Algorithmically, we propose RPRO (Robotic Flow-matching Proximalized Preference Optimization), a preference-optimization objective tailored to the flow-matching action head of VLA models. RPRO pairs a contrastive optimizer with an explicit proximal regularizer that anchors the absolute magnitude of the implicit reward, thereby eliminating the reward-hacking failure mode of plain Flow-DPO. On the data side, a teleoperated intervention-and-rollback paradigm produces naturally paired positive and negative trajectories $(τ^w, τ^l)$ on a real robot from a single operator action; a Smooth Interpolation procedure, combined with batch mixing, then converts these sparse corrections into dense per-state supervision while preserving the base policy's capabilities. On four long-horizon bimanual tasks, FlowPRO attains the highest success rate, outperforming four representative baselines, and ablations confirm the contribution of each loss component.

2606.05464 2026-06-05 cs.AI

Step-by-Step Optimization-like Reasoning in LLMs over Expanding Search Spaces

大语言模型中在扩展搜索空间上的逐步优化类推理

Nicolás Astorga, Nabeel Seedat, Mihaela van der Schaar

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文提出OPT*任务族,通过可验证奖励训练和搜索引导策略,提升LLM在扩展搜索空间中的逐步优化推理能力。

详情
AI中文摘要

可验证奖励训练改善了数学和编码推理,但这些领域仅涵盖了逐步决策的一部分。许多现实任务需要在众多有效备选方案中找到高价值的可行计划。我们引入OPT*,一个可扩展的优化风格任务族,用于沿复杂度轴训练和评估LLM的逐步优化类推理:每个任务提供可行性检查器和评估器,而复杂度参数扩展搜索空间,无需新的人工标签。这促使我们在两种机制下研究这些任务:(i) 求解器引导的在线策略优化,使用求解器作为部分状态的价值预言机,并应用基于排名的奖励塑造来强化更好的下一步;(ii) 当此类求解器不可用时,基于搜索的离线强化学习。理论上,我们将大搜索空间中的成功与推理者在每单位搜索预算中提取的信息联系起来。实证上,我们消融了使OPT*上搜索高效的要素,并表明在OPT*上训练改进了逐步优化类推理。

英文摘要

Verifiable reward training has improved mathematical and coding reasoning, but these domains capture only part of step-by-step decision making. Many real-world tasks require finding a high-value feasible plan among many valid alternatives. We introduce OPT*, a scalable family of optimization-style tasks for training and evaluating LLM step-by-step optimization-like reasoning along a complexity axis: each task provides a feasibility checker and evaluator, while a complexity parameter expands the search space without requiring new human labels. This motivates studying these tasks in two regimes: (i) solver-guided online policy optimization, which uses a solver as a value oracle for partial states and applies rank-based reward shaping to reinforce better next steps, and (ii) search-based offline RL when such solvers are unavailable. Theoretically, we relate success in large search spaces to the information a reasoner extracts per unit of search budget. Empirically, we ablate the ingredients that make search efficient on OPT* and show that training on OPT* improves step-by-step optimization-like reasoning.

2606.05460 2026-06-05 cs.CV

ORACLE-CT: Anatomy-Aware Support Pooling for CT Classification

ORACLE-CT:用于CT分类的解剖感知支持池化

Lavsen Dahal, Yubraj Bhandari, Geoffrey Rubin, Joseph Y. Lo

发表机构 * Center for Virtual Imaging Trials, RAI Labs, Department of Radiology, Duke University(虚拟成像试验中心,RAI实验室,放射学系,杜克大学) Electrical and Computer Engineering, Pratt School of Engineering, Duke University(电气与计算机工程,工程学院,杜克大学) Department of Mathematics, Trinity College of Arts & Sciences, Duke University(数学系,艺术与科学学院,杜克大学) Department of Radiology and Imaging Sciences, University of Arizona College of Medicine(放射学与影像科学系,亚利桑那大学医学院)

AI总结 提出ORACLE-CT框架,通过多器官分割定义标签特定的解剖支持区域并限制注意力池化,解决CT分类中局部疾病证据与全局聚合不匹配的问题,在多个编码器上提升性能。

详情
AI中文摘要

腹部CT疾病分类具有挑战性,因为每次扫描都是一个包含许多可能发现的大3D体积,而诊断证据通常局限于特定器官或解剖隔室。大多数研究级分类器使用与解剖无关的池化或注意力来聚合编码器特征,造成了局部疾病证据与全局证据聚合之间的不匹配。我们提出ORACLE-CT,一个与编码器无关的解剖感知聚合框架,它使用多器官分割来定义标签特定的解剖支持,并将注意力池化限制在相关区域。该框架支持单器官、多器官联合、比较、局部和全局支持策略。我们使用三个编码器系列评估ORACLE-CT:DINOv3、I3D-ResNet-121和放射学原生Pillar-0编码器。模型在MERLIN上进行端到端训练,并在内部评估以及在冻结外部迁移到Duke-Abdomen和AMOS下进行评估。与全局平均池化相比,支持掩蔽池化将DINOv3的MERLIN宏AUROC/AUPRC从0.838/0.638提高到0.858/0.676,将I3D-ResNet-121从0.829/0.617提高到0.848/0.659。在协调的10标签外部评估中,DINOv3在Duke-Abdomen上从0.802/0.628提高到0.835/0.683,在AMOS上从0.742/0.313提高到0.762/0.350,I3D-ResNet-121也有类似增益。对于Pillar-0,大部分增益来自学习注意力,解剖掩蔽的额外收益较小。ORACLE-CT提高了区分度和外部鲁棒性,同时保留了预测与解剖证据之间的可审计联系。

英文摘要

Abdominal CT disease classification is challenging because each scan is a large 3D volume with many possible findings, while diagnostic evidence is often confined to specific organs or anatomical compartments. Most study-level classifiers aggregate encoder features using anatomy-agnostic pooling or attention, creating a mismatch between localized disease evidence and global evidence aggregation. We propose ORACLE--CT, an encoder-agnostic anatomy-aware aggregation framework that uses multi-organ segmentation to define label-specific anatomical supports and restrict attention pooling to relevant regions. The framework supports single-organ, multi-organ union, comparative, localized, and global support strategies. We evaluate ORACLE--CT with three encoder families: DINOv3, I3D--ResNet-121, and the radiology-native Pillar--0 encoder. Models are trained end-to-end on MERLIN and evaluated internally and under frozen external transfer to Duke--Abdomen and AMOS. Compared with global average pooling, support-masked pooling improved MERLIN macro-AUROC/AUPRC from 0.838/0.638 to 0.858/0.676 for DINOv3 and from 0.829/0.617 to 0.848/0.659 for I3D--ResNet-121. On harmonized 10-label external evaluation, DINOv3 improved on Duke--Abdomen from 0.802/0.628 to 0.835/0.683 and on AMOS from 0.742/0.313 to 0.762/0.350, with similar gains for I3D--ResNet-121. For Pillar--0, most gains came from learned attention, with smaller additional benefit from anatomical masking. ORACLE--CT improves discrimination and external robustness while preserving an auditable link between predictions and anatomical evidence.

2606.05458 2026-06-05 cs.CV

Horse Eye Blink Detection and Classification for Equine Affective State Assessment

马匹眼睛眨眼检测与分类用于马匹情感状态评估

João Alves, Signe Møller-Skuldbøl, Pia Haubro Andersen, Rikke Gade

发表机构 * Visual Analysis and Perception Lab, Aalborg University(视觉分析与感知实验室,奥尔堡大学) Department of Animal Biosciences, Swedish University of Agricultural Sciences(动物生物科学系,瑞典农业科学大学)

AI总结 本研究开发并评估了三种基于视频的马匹眨眼自动分类方法(帧级YOLOv12检测器、光流幅度阈值法和微调VideoMAE模型),在公开数据集上实现了眨眼分类宏F1分数0.898和二元眨眼检测0.926,展示了细粒度动作单元检测在马匹福利监测中的潜力和挑战。

Comments CVPRW2026 CV4Animals

详情
AI中文摘要

自动检测马匹面部动作单元(AUs)是评估马匹疼痛和情感状态的一个有前景但尚未充分探索的途径。半眨眼和全眨眼运动被认为是疼痛和压力的识别指标,但作为微表情,其细微、精细的特性使其容易被肉眼忽略,只能通过逐帧视频检查才能辨别,这使得从视频中进行可靠的自动检测成为一项特别艰巨的任务。我们开发并评估了三种从马匹视频中自动分类眨眼的方法:基于帧的YOLOv12检测器、光流幅度阈值方法以及微调的VideoMAE模型,并在公开数据集上进行了测试。我们在眨眼分类任务上达到了0.898的宏F1分数,在二元眨眼检测上达到了0.926。我们的结果突显了细粒度AU检测在马匹福利监测中的潜力和固有挑战。

英文摘要

Automated detection of equine facial action units (AUs) is a promising yet under-explored avenue for pain and affective state assessment in horses. Half and full-blink movements are recognised indicators of pain and stress, but as micro-expressions, their subtle, fine-grained nature makes them easily missed by the naked eye and only discernible through frame-by-frame video inspection, making reliable automated detection from video a particularly demanding task. We develop and evaluate three methods for automated blink classification from horse videos: a frame-based YOLOv12 detector, an optical flow magnitude thresholding approach, and a fine-tuned VideoMAE model, tested on a publicly available dataset. We achieve a macro-F1 score of 0.898 when doing blink classification and 0.926 on binary blink detection. Our results highlight both the potential and the inherent challenges of fine-grained AU detection for equine welfare monitoring.

2606.05455 2026-06-05 cs.CV

Disentangled Fine-Grained Prototype Learning for Incomplete Image-Tabular Classification

面向不完整图像-表格分类的解缠细粒度原型学习

Feixiang Zhou, Jianyang Xie, Zhuangzhi Gao, Qinkai Yu, Fu Wang, Yuheng Fan, Jing Li, Zheheng Jiang, Yitian Zhao, Yanda Meng, He Zhao, Gregory Y. H. Lip, Yalin Zheng

发表机构 * School of Eye and Vision Sciences, University of Liverpool, U.K.(利物浦大学眼科与视觉科学学院) Department of Cardiovascular and Metabolic Medicine, University of Liverpool, U.K.(利物浦大学心血管与代谢医学系) School of Computer Science, University of Exeter, U.K.(埃克塞特大学计算机科学学院) School of Computer Science and Engineering, South China University of Technology, China(华南理工大学计算机科学与工程学院) School of Computing and Mathematical Sciences, University of Leicester, U.K.(莱斯特大学计算科学与数学科学学院) Ningbo Institute of Industrial Technology, Chinese Academy of Sciences, China(中国科学院宁波工业技术研究所) Bioengineering Program, Biological and Environmental Science and Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Saudi Arabia(卡尔斯塔德大学科学与技术学院(KAUST)生物工程项目,沙特阿拉伯)

AI总结 针对图像-表格多模态学习中缺失模态问题,提出DFPL框架,通过共享-特定原型建模、原型级解缠和细粒度对齐,实现鲁棒分类。

详情
AI中文摘要

缺失模态问题在广泛的多媒体应用中(包括产品理解、推荐系统和医疗诊断)对图像-表格多模态学习构成了重大挑战。当两种模态高度异质时,这一挑战尤为突出,因为图像和表格属性在语义粒度和数据分布上存在显著差异。现有方法通过对全局令牌平均特征进行解缠和对齐来学习模态不变表示,仅捕获粗粒度的跨模态一致性,忽略了细粒度的语义和分布错位,这阻碍了在缺失模态下利用互补线索。为了解决这个问题,我们提出了DFPL,一种用于细粒度原型学习的新框架。具体来说,共享-特定原型建模(SSPM)提取紧凑且多样化的共享和模态特定原型,并进一步执行原型级解缠以抑制冗余的模态内相关性。此外,我们提出了一个原型引导的细粒度对齐(PFA)模块,该模块在统一的原型空间内联合强制执行原型级分布匹配和原型到类别的语义对齐,从而跨模态保留细粒度的分布和语义一致性。我们还引入了一个类别感知的多尺度聚合(CMA)模块,从全局和原型级别自适应地聚合共享语义和模态特定特征,以实现鲁棒的预测。在三个不同的图像-表格基准上的大量实验表明,我们的方法在各种缺失模态设置下优于先前的方法。代码将公开提供。

英文摘要

The missing-modality problem poses a significant challenge in image-tabular multimodal learning across a wide range of multimedia applications, including product understanding, recommendation systems, and medical diagnosis. This challenge is particularly pronounced when the two modalities are highly heterogeneous, as images and tabular attributes differ substantially in their semantic granularity and data distributions. Existing methods learn modality-invariant representations through disentanglement and alignment over global token-averaged features, capturing only coarse cross-modal consistency and overlooking fine-grained semantic and distributional misalignment, which hampers the exploitation of complementary cues under missing modalities. To address this, we propose DFPL, a novel framework for fine-grained prototype learning. Specifically, Shared-Specific Prototype Modeling (SSPM) extracts compact and diverse shared and modality-specific prototypes, and further performs prototype-level disentanglement to suppress redundant intra-modality correlations. Additionally, we propose a Prototype-guided Fine-grained Alignment (PFA) module that jointly enforces prototype-level distribution matching and prototype-to-class semantic alignment within a unified prototype space, thereby preserving both fine-grained distributional and semantic consistency across modalities. We further introduce a Class-aware Multi-scale Aggregation (CMA) module to adaptively aggregate shared semantics and modality-specific characteristics from global and prototype levels for robust predictions. Extensive experiments on three diverse image-tabular benchmarks demonstrate the superiority of our method compared to the previous approaches under various missing-modality settings. Code will be made publicly available.

2606.05449 2026-06-05 cs.AI cs.GT econ.EM

Insurance of Agentic AI

代理型人工智能的保险

Quanyan Zhu

发表机构 * Department of Electrical and Computer Engineering, New York University, Tandon School of Engineering(电气与计算机工程系,纽约大学,工程学院)

AI总结 本文分析了代理型AI带来的新型风险,提出了承保、定价、再保险和产品设计的框架,并构建了整合多种保险覆盖的协调架构。

详情
AI中文摘要

代理型人工智能系统通过超越信息生成,扩展到自主规划、工具调用、决策执行以及对数字和物理环境的持续修改,正在改变风险格局。这些能力引入了新的风险敞口,这些敞口并不完全适合传统的保险类别,如网络、职业责任、产品责任或董事及高管责任保险。本文考察了新兴的代理型AI保险市场,并开发了一个框架来理解其承保、定价、再保险和产品设计的影响。我们将代理型AI描述为自主性和授权委托的连续体,强调信息输出与能够通过外部行动独立产生保险事件的系统之间的区别。我们分析了主要风险路径,包括幻觉、提示注入攻击、自主决策错误、模型漂移、依赖故障和网络物理伤害,并评估了现有保险产品如何适应这些风险敞口。本文进一步提出了一个基于风险暴露评估、情景分析、依赖映射和累积风险管理的精算框架,借鉴了网络保险的发展历程。最后,我们提出了一个协调的保险架构,通过明确的分配机制和专门的AI总限额,整合了网络、技术错误与遗漏、产品责任、性能保证以及明确的AI责任保险。分析表明,代理型AI保险的未来不在于单一的单线产品,而在于一个由改进的治理、透明度、遥测和监管清晰度支持的互补覆盖分层生态系统。

英文摘要

Agentic artificial intelligence (AI) systems are transforming the risk landscape by extending beyond information generation to autonomous planning, tool invocation, decision execution, and persistent modification of digital and physical environments. These capabilities introduce novel exposures that do not fit neatly within traditional insurance categories such as cyber, professional liability, product liability, or directors and officers coverage. This paper examines the emerging insurance market for agentic AI and develops a framework for understanding its underwriting, pricing, reinsurance, and product-design implications. We characterize agentic AI as a continuum of autonomy and delegated authority, emphasizing the distinction between informational outputs and systems capable of independently generating insured events through external actions. We analyze major risk pathways, including hallucinations, prompt-injection attacks, autonomous decision errors, model drift, dependency failures, and cyber-physical harms, and evaluate how existing insurance products are adapting to address these exposures. The paper further proposes an actuarial framework based on exposure assessment, scenario analysis, dependency mapping, and accumulation-risk management, drawing parallels to the evolution of cyber insurance. Finally, we present a coordinated insurance architecture that integrates cyber, technology errors and omissions, product liability, performance-warranty, and affirmative AI-liability coverages through explicit allocation mechanisms and dedicated AI aggregates. The analysis suggests that the future of agentic-AI insurance lies not in a single monoline product but in a layered ecosystem of complementary coverages supported by improved governance, transparency, telemetry, and regulatory clarity.

2606.05445 2026-06-05 cs.AI

Brick-Composer: Using MLLMs for Assembly with Diverse Bricks

Brick-Composer: 使用多模态大语言模型进行多样化积木组装

Jiateng Liu, Bingxuan Li, Zhenhailong Wang, Rushi Wang, Kaiwen Hong, Cheng Qian, Jiayu Liu, Denghui Zhang, Katherine Driggs-Campbell, Manling Li, Heng Ji

发表机构 * UIUC(伊利诺伊大学香槟分校) Stevens Institute of Technology(史蒂文斯理工学院) Northwestern University(西北大学)

AI总结 本文提出Brick-Composer框架,通过人类设计火花、世界反馈和合成经验三种信号训练多模态大语言模型,解决积木组装中的积木选择和姿态估计问题,将步骤级组装成功率从低于1%提升至约15%。

Comments 10 Pages, 10 figures

详情
AI中文摘要

我们梦想着AI代理能够读取任意设计,并从可重复使用的构建块中构建真实世界的物体。作为迈向这一愿景的第一步,我们研究多模态大语言模型(MLLMs)是否具备积木组装所需的视觉基础和空间推理能力。我们将积木组装形式化为一个序列决策问题,其中每一步涉及两个子任务:积木选择,从候选组件中识别目标积木;以及积木姿态估计,预测所选积木应放置的位置和方式。为支持这项研究,我们引入了BC-Bench(积木构建基准),这是第一个用于评估MLLMs在多样化积木组装中表现的基准。实验表明,当前最先进的MLLMs仍然远非可靠的构建者,在细粒度积木选择上挣扎,并且在精确姿态估计上失败。为弥补这一差距,我们提出了Brick-Composer,一个学习框架,通过三种互补信号赋予MLLMs组装技能:人类设计火花,提供富含可供性的构建演示;世界反馈,将预测动作锚定在视觉和物理后果中;以及合成经验,将学习扩展到现有物体设计之外。Brick-Composer将积木选择准确性提高了三倍以上,大幅减少了姿态估计误差,并将严格的步骤级组装成功率从低于1%提升至约15%。训练后,一个Qwen-3-8B模型能够正确完成一个完整物体高达42%的步骤,这表明MLLMs可以通过有针对性的、基于物理的学习获得组装能力。

英文摘要

We dream of AI agents that can read arbitrary designs and construct real-world objects from reusable building blocks. As a first step toward this vision, we study whether multimodal large language models (MLLMs) possess the visual grounding and spatial reasoning capabilities required for brick assembly. We formulate brick assembly as a sequential decision-making problem, where each step involves two subtasks: brick selection, identifying the target brick from candidate components, and brick pose estimation, predicting where and how the selected brick should be placed. To support this study, we introduce BC-Bench (Brick Construction Benchmark), the first benchmark for evaluating MLLMs on assembly with diverse bricks. Experiments show that current state-of-the-art MLLMs remain far from reliable builders, struggling with fine-grained brick selection and failing at precise pose estimation. To bridge this gap, we propose Brick-Composer, a learning framework that equips MLLMs with assembly skills through three complementary signals: Human Design Sparks, which provide affordance-rich construction demonstrations; World Feedback, which grounds predicted actions in visual and physical consequences; and Synthetic Experience, which scales learning beyond existing object designs. Brick-Composer improves brick selection accuracy by over three times, substantially reduces pose estimation errors, and raises strict step-level assembly success from less than 1% to around 15%. After training, a Qwen-3-8B can correctly compose up to 42% of the steps for a complete object, suggesting that MLLMs can acquire assembly capabilities through targeted, physically grounded learning.

2606.05444 2026-06-05 cs.CL cs.AI cs.LG

Multilingual Coreference Resolution via Cycle-Consistent Machine Translation

通过循环一致性机器翻译的多语言共指消解

Adriana-Valentina Costache, Eduard Poesina, Silviu-Florin Gheorghe, Paul Irofti, Radu Tudor Ionescu

发表机构 * Department of Computer Science, University of Bucharest(布加勒斯特大学计算机科学系)

AI总结 提出一种利用循环一致性机器翻译生成或扩展训练数据的管道,通过BERT潜在空间余弦相似度评估翻译质量并加权损失函数,显著提升低资源语言的共指消解性能。

详情
AI中文摘要

共指消解是一项核心的自然语言处理任务,具有广泛的下游应用,例如机器翻译、问答、文档摘要等。虽然该任务在英语中得到了充分研究,但其他语言(尤其是低资源语言)的共指消解关注相对较少。为了弥补这一差距,我们提出了一种新颖的共指消解管道,该管道利用从英语到目标低资源语言的机器翻译(MT)来生成或扩展训练数据。为了自动验证翻译样本的质量,我们将样本反向翻译,并通过BERT模型潜在空间中的余弦相似度评估与原始英语样本的相似性。得到的相似度分数被整合到损失函数中,以根据样本的MT循环一致性对训练样本进行加权。在四种低资源语言上的大量实验表明,我们的管道在共指消解中带来了显著的性能提升。此外,我们的管道使得在之前没有可用语料库的语言中也能实现准确的共指消解。

英文摘要

Coreference resolution is a core NLP task, having a broad range of downstream applications, e.g.~machine translation, question answering, document summarization, etc. While the task is well-studied in English, comparatively less attention is dedicated to coreference resolution in other languages, especially low-resource ones. To mitigate this gap, we propose a novel coreference resolution pipeline that harnesses machine translation (MT) from English to a target low-resource language, to generate or expand training data. To automatically validate the quality of the translated samples, we back-translate the samples and assess the similarity with the original English samples via cosine similarity in the latent space of a BERT model. The resulting similarity scores are integrated into the loss function to weight training samples according to their MT cycle consistency. Extensive experiments on four low-resource languages show that our pipeline brings significant performance gains in coreference resolution. Moreover, our pipeline enables accurate coreference resolution in languages where no previous corpora were available.

2606.05438 2026-06-05 cs.LG math.OC

Sharp First-Order Lower Bounds for Higher-Order Smooth Nonconvex Optimization

高阶光滑非凸优化的尖锐一阶下界

Dongruo Zhou

发表机构 * Department of Computer Science, Indiana University(计算机科学系,印第安纳大学)

AI总结 针对高阶光滑非凸优化,通过块链机制构造硬实例,首次证明了匹配已知上界的一阶下界,如Hessian Lipschitz情形下的Ω(ε^{-7/4})和三阶光滑情形下的Ω(ε^{-5/3})。

Comments 24 pages, 1 table

详情
AI中文摘要

我们研究了在目标函数满足高阶光滑性假设时,寻找光滑非凸优化中ε-驻点的确定性一阶预言复杂度。虽然经典的ε^{-2}速率在仅Lipschitz梯度条件下是最优的,但高阶光滑性导致了加速的一阶上界,最显著的是在Lipschitz Hessian下的ε^{-7/4}速率和在Lipschitz三阶导数下的ε^{-5/3}速率。然而,匹配的下界一直未解决。我们通过证明一个新的无维数的一阶下界来填补这一空白,该下界适用于任意有限光滑阶的高阶光滑非凸函数。特别地,我们的构造在Hessian-Lipschitz情形下给出了匹配的Ω(ε^{-7/4})下界,在三阶光滑情形下给出了匹配的Ω(ε^{-5/3})下界。硬实例基于一种块链机制,该机制强制块状预言揭示,同时保持标量硬实例所需的光滑结构。该下界构造是在ChatGPT 5.5 Pro的协助下发现的,随后由作者验证。

英文摘要

We study the deterministic first-order oracle complexity of finding \(ε\)-stationary points in smooth nonconvex optimization when the objective satisfies higher-order smoothness assumptions. While the classical \(ε^{-2}\) rate is optimal under only Lipschitz gradients, higher-order smoothness leads to accelerated first-order upper bounds, most notably the \(ε^{-7/4}\) rate under Lipschitz Hessians and the \(ε^{-5/3}\) rate under Lipschitz third derivatives. The matching lower bounds, however, have remained open. We resolve this gap by proving a new dimension-free first-order lower bound for higher-order smooth nonconvex functions, valid for every finite smoothness order. In particular, our construction gives a matching \(Ω(ε^{-7/4})\) lower bound in the Hessian-Lipschitz case and a matching \(Ω(ε^{-5/3})\) lower bound in the third-order-smooth regime. The hard instance is based on a \emph{block-chain} mechanism that enforces blockwise oracle revelation while preserving the smoothness structure needed for the scalar hard instance. The lower-bound construction was discovered with the assistance of ChatGPT 5.5 Pro and subsequently verified by the authors.