arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1670
专题追踪
2506.22604 2026-05-18 cs.AI cs.HC cs.RO

Bootstrapping Human-Like Planning via LLMs

通过大语言模型实现人类样式的规划

David Porfirio, Vincent Hsiao, Morgan Fine-Morris, Leslie Smith, Laura M. Hiatt

发表机构 * Navy Center for Applied Research in AI, US Naval Research Laboratory(美国海军人工智能应用研究中心)

AI总结 本文研究如何结合自然语言接口与拖放界面,利用大语言模型生成人类风格的动作序列,并与手工指定的动作序列进行比较。

Comments Accepted by the 2025 34th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)

详情
AI中文摘要

机器人终端用户日益需要能够指定机器人执行任务的可访问方法。两种常见的终端用户编程范式包括拖放界面和自然语言编程。尽管自然语言接口利用了人类沟通的直观形式,拖放界面使用户能够精确地规定机器人任务中的关键动作。在本文中,我们探讨这两种方法结合的程度。具体来说,我们构建了一个基于大语言模型(LLM)的管道,接受自然语言作为输入,并生成人类风格的动作序列作为输出,其细度水平与人类产生的相似。我们然后将生成的动作序列与另一组手工指定的动作序列进行比较。尽管我们的结果表明,较大的模型在生成人类风格的动作序列方面优于较小的模型,但较小的模型仍然实现了令人满意的性能。

英文摘要

Robot end users increasingly require accessible means of specifying tasks for robots to perform. Two common end-user programming paradigms include drag-and-drop interfaces and natural language programming. Although natural language interfaces harness an intuitive form of human communication, drag-and-drop interfaces enable users to meticulously and precisely dictate the key actions of the robot's task. In this paper, we investigate the degree to which both approaches can be combined. Specifically, we construct a large language model (LLM)-based pipeline that accepts natural language as input and produces human-like action sequences as output, specified at a level of granularity that a human would produce. We then compare these generated action sequences to another dataset of hand-specified action sequences. Although our results reveal that larger models tend to outperform smaller ones in the production of human-like action sequences, smaller models nonetheless achieve satisfactory performance.

2506.16129 2026-05-18 cs.CV

Neurosymbolic Object-Centric Learning with Distant Supervision

基于远监督的神经符号对象中心学习

Stefano Colamonaco, David Debot, Giuseppe Marra

发表机构 * Department of Computer Science, KU Leuven(计算机科学系,鲁汶大学)

AI总结 本文提出DeepObjectLog模型,通过概率神经符号方法实现对象中心学习,无需逐对象标签或掩码,提升对组合、对象计数和规则转移的泛化能力。

详情
AI中文摘要

神经符号学习可通过符号规则为潜在概念提供监督,但通常假设规则引用的实体已指定。对象中心模型将图像分解为槽状表示,但这些槽未必与符号推理所需的谓词对齐。本文研究了基于远监督的对象中心神经符号学习,通过逻辑程序的物体级参数直接从图像中学习,引入DeepObjectLog模型,整合槽式感知编码器与概率逻辑层,预测候选物体表示的对象性和类别概率,逻辑层通过潜在的对象性和类别分配计算观测标签的似然,无需逐对象标签、掩码、边界框或启发式集合匹配。在多样化的视觉推理任务中,DeepObjectLog在组合、对象计数和规则转移的分布外泛化方面优于神经对象中心和标准神经符号基线。

英文摘要

Neurosymbolic learning can use symbolic rules to provide supervision for latent concepts from weak labels, but it commonly assumes that the entities referenced by these rules are already specified. Object-centric models decompose images into slot-like representations; however, such slots are not necessarily aligned with the predicates required for symbolic reasoning. We investigate object-centric neurosymbolic learning under distant supervision, where the object-level arguments of a logic program are learned directly from images using only global task labels. We introduce DeepObjectLog, a probabilistic neurosymbolic model that integrates a slot-based perceptual encoder with a probabilistic logic layer. The encoder predicts objectness and class probabilities for candidate object representations, while the logic layer marginalizes over latent objectness and class assignments to compute the likelihood of the observed label. This formulation provides a differentiable task-level learning signal for object-centric perception without requiring per-object labels, masks, bounding boxes, or heuristic set matching. Evaluations across diverse visual reasoning tasks demonstrate that DeepObjectLog achieves superior out-of-distribution generalization to compositional, object-count, and rule shifts compared to neural object-centric and standard neurosymbolic baselines.

2506.12405 2026-05-18 cs.SD

Methods for pitch analysis in contemporary popular music: multiple pitches from harmonic tones in Vitalic's music

当代流行音乐中音高分析的方法:来自Vitalic音乐中和声音的多重音高

Emmanuel Deruty, David Meredith, Maarten Grachten, Pascal Arbez-Nicolas, Andreas Hasselholt Jørgensen, Oliver Søndermølle Hansen, Magnus Stensli, Christian Nørkær Petersen

发表机构 * Sony Computer Science Laboratories, Paris, France(索尼计算机科学实验室,法国巴黎) Department of Architecture, Design and Media Technology, Aalborg University, Aalborg, Denmark(建筑、设计与媒体技术系,奥尔堡大学,丹麦奥尔堡) Citizen Records, Dijon, France(公民唱片公司,法国第戎)

AI总结 研究探讨了当代流行音乐中单个和声复合音产生多个感知音高的现象,通过Vitalic等电子艺术家的作品示例,分析信号特征与音高感知之间的关系,并发现不同听众对多重模糊音高的感知存在显著差异。

Comments Pending review, Journal of the Audio Engineering Society

详情
AI中文摘要

目的。本研究提出,单个和声复合音产生多个感知音高是当代流行音乐的主动和有意特征。通过Vitalic等电子艺术家作品中的例子加以说明。方法。进行了两项听觉测试:(1) 评估单个和声音产生的同时感知音高的数量,(2) 手动转录和声音序列的音高。随后分析了信号特征与音高感知之间的关系。结果。研究中发现的合成和声音在音乐序列中比其声学对应物传递更多的感知音高,不同听众之间存在显著差异。多重模糊音高与和声音的特性如显著的上部谐波和特定的自相关谱形有关。结论。在当代流行音乐的背景下,和声音可以一般地传达多个模糊音高。感知的音高集合取决于听众和听音条件。

英文摘要

Aims. This study suggests that the use of multiple perceived pitches arising from a single harmonic complex tone is an active and intentional feature of contemporary popular music. The phenomenon is illustrated through examples drawn from the work of electronic artist Vitalic and others. Methods. Two listening tests were conducted: (1) evaluation of the number of simultaneous pitches perceived from single harmonic tones, and (2) manual pitch transcription of sequences of harmonic tones. Relationships between signal characteristics and pitch perception were then analyzed. Results. The synthetic harmonic tones found in the musical sequences under study were observed to transmit more perceived pitches than their acoustic counterparts, with significant variation across listeners. Multiple ambiguous pitches were associated with tone properties such as prominent upper partials and particular autocorrelation profiles. Conclusions. Harmonic tones in a context of contemporary popular music can, in general, convey several ambiguous pitches. The set of perceived pitches depends on both the listener and the listening conditions.

2506.07073 2026-05-18 cs.SD cs.HC eess.AS

Insights on Harmonic Tones from a Generative Music Experiment

从生成音乐实验中洞察和声音调

Emmanuel Deruty, Maarten Grachten

发表机构 * Sony Computer Science Laboratories - Paris(索尼计算机科学实验室-巴黎) Department of Architecture, Design and Media Technology(建筑、设计与媒体技术系) Aalborg University(奥尔堡大学) Contractor for Sony CSL Paris(索尼 CSL 巴黎承包商)

AI总结 生成音乐AI旨在提升音乐创作,实验显示AI模型能生成结构化和声音调,揭示人类对和声的感知问题,推动音乐创造力与理论理解。

Comments 15th International Workshop on Machine Learning and Music, September 9, 2024, Vilnius, Lithuania

详情
AI中文摘要

生成音乐AI的最终目的是音乐创作。在艺术科学交叉领域的工作室-实验室中,通过研究人员、音乐制作人和生成低音音频的AI模型进行实验,发现制作人利用模型输出传达两个或更多音高,表明模型能通过单个和声复音生成结构化、连贯的同时旋律线。这些发现促使重新审视人类是否能将和声视为独立音高,同时展示生成AI如何提升音乐创造力并深化音乐理解。

英文摘要

The ultimate purpose of generative music AI is music production. The studio-lab, a social form within the art-science branch of cross-disciplinarity, is a way to advance music production with AI music models. During a studio-lab experiment involving researchers, music producers, and an AI model for music generating bass-like audio, it was observed that the producers used the model's output to convey two or more pitches with a single harmonic complex tone, which in turn revealed that the model had learned to generate structured and coherent simultaneous melodic lines using monophonic sequences of harmonic complex tones. These findings prompt a reconsideration of the long-standing debate on whether humans can perceive harmonics as distinct pitches and highlight how generative AI can not only enhance musical creativity but also contribute to a deeper understanding of music.

2505.23678 2026-05-18 cs.CV

Grounded Reinforcement Learning for Visual Reasoning

基于视觉的强化学习用于视觉推理

Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J. Tarr, Aviral Kumar, Katerina Fragkiadaki

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出ViGoRL,通过强化学习实现视觉推理,通过空间坐标锚定推理步骤,提升视觉定位和搜索性能,优于传统方法。

Comments Project website: https://visually-grounded-rl.github.io/

详情
AI中文摘要

尽管强化学习在数学和编码任务中显著提升了语言模型,但视觉推理需要模型引导视觉注意力、解读感知输入并用空间证据支撑抽象推理。我们引入ViGoRL,通过强化学习训练视觉语言模型,将每个推理步骤明确锚定到特定视觉坐标。受人类视觉决策启发,ViGoRL学习生成空间接地的推理轨迹,每一步引导视觉注意力到相关区域。当需要精细探索时,我们的新型多轮强化学习框架使模型能动态放大预测坐标。在多样化的视觉推理基准上,ViGoRL在空间推理、视觉搜索和基于网页的接地任务中均优于监督微调和传统强化学习基线。结合多轮强化学习与放大视觉反馈显著提升了ViGoRL在定位小GUI元素和视觉搜索中的性能,达到86.4%的V*Bench成绩。此外,我们发现接地增强了其他视觉行为,如区域探索、接地子目标设定和视觉验证。最终,人类评估显示模型的视觉参考不仅空间准确,而且有助于理解模型推理步骤。我们的结果表明,视觉接地强化学习是赋予模型通用视觉推理能力的强大范式。

英文摘要

While reinforcement learning (RL) over chains of thought has significantly advanced language models in tasks such as mathematics and coding, visual reasoning introduces added complexity by requiring models to direct visual attention, interpret perceptual inputs, and ground abstract reasoning in spatial evidence. We introduce ViGoRL (Visually Grounded Reinforcement Learning), a vision-language model trained with RL to explicitly anchor each reasoning step to specific visual coordinates. Inspired by human visual decision-making, ViGoRL learns to produce spatially grounded reasoning traces, guiding visual attention to task-relevant regions at each step. When fine-grained exploration is required, our novel multi-turn RL framework enables the model to dynamically zoom into predicted coordinates as reasoning unfolds. Across a diverse set of visual reasoning benchmarks--including SAT-2 and BLINK for spatial reasoning, V*bench for visual search, and ScreenSpot and VisualWebArena for web-based grounding--ViGoRL consistently outperforms both supervised fine-tuning and conventional RL baselines that lack explicit grounding mechanisms. Incorporating multi-turn RL with zoomed-in visual feedback significantly improves ViGoRL's performance on localizing small GUI elements and visual search, achieving 86.4% on V*Bench. Additionally, we find that grounding amplifies other visual behaviors such as region exploration, grounded subgoal setting, and visual verification. Finally, human evaluations show that the model's visual references are not only spatially accurate but also helpful for understanding model reasoning steps. Our results show that visually grounded RL is a strong paradigm for imbuing models with general-purpose visual reasoning.

2505.18853 2026-05-18 cs.CL

Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation

Smoothie: 通过令牌嵌入进行扩散平滑以实现文本生成

Alexander Shabalin, Viacheslav Meshchaninov, Dmitry Vetrov

发表机构 * Constructor University(Constructor大学) HSE University(俄罗斯莫斯科高等经济学院)

AI总结 本文提出Smoothie,通过基于语义相似性的逐步平滑令牌嵌入,结合连续潜在空间和分类单纯空间的优势,提升文本生成质量。

Comments 18 pages, 4 figures, 13 tables

详情
AI中文摘要

扩散模型在图像、音频和视频生成中取得了最先进的性能,但其适应文本生成仍具有挑战性,因为文本具有离散性质。以往方法要么在连续潜在空间中应用高斯扩散,继承语义结构但难以处理令牌解码,要么在分类单纯空间中操作,尊重离散性但忽视令牌间的语义关系。本文提出Smoothie,一种新的扩散方法,通过逐步平滑令牌嵌入结合两者的优势。该技术使信息逐步移除的同时保持自然的解码过程。在多个序列到序列和无条件生成任务上的实验结果表明,Smoothie在生成质量上优于现有扩散模型。进一步的消融研究显示,所提出的扩散空间比标准嵌入空间和分类单纯空间表现更好。代码可在https://github.com/ashaba1in/smoothie获取。

英文摘要

Diffusion models have achieved state-of-the-art performance in generating images, audio, and video, but their adaptation to text remains challenging due to its discrete nature. Prior approaches either apply Gaussian diffusion in continuous latent spaces, which inherits semantic structure but struggles with token decoding, or operate in categorical simplex space, which respect discreteness but disregard semantic relation between tokens. In this paper, we propose Smoothing Diffusion on Token Embeddings (Smoothie), a novel diffusion method that combines the strengths of both approaches by progressively smoothing token embeddings based on semantic similarity. This technique enables gradual information removal while maintaining a natural decoding process. Experimental results on several sequence-to-sequence and unconditional generation tasks demonstrate that Smoothie outperforms existing diffusion-based models in generation quality. Furthermore, ablation studies show that our proposed diffusion space yields better performance than both the standard embedding space and the categorical simplex. The code is available at https://github.com/ashaba1in/smoothie.

2505.15692 2026-05-18 cs.CL cs.LG

TemplateRL: Structured Template-Guided Reinforcement Learning for LLM Reasoning

TemplateRL: 结构化模板引导的强化学习用于大语言模型推理

Jinyang Wu, Chonghua Liao, Mingkuan Feng, Shuai Zhang, Zhengqi Wen, Haoran Luo, Ling Yang, Huazhe Xu, Jianhua Tao

发表机构 * Tsinghua University(清华大学) Nanyang Technological University(南洋理工大学) Princeton University(普林斯顿大学) Shanghai AI Lab(上海人工智能实验室) Beijing National Research Center for Information Science and Technology(北京信息科学与技术国家研究中心)

AI总结 TemplateRL通过结构化模板引导强化学习提升大语言模型推理能力,通过MCTS构建问题解决模板库并整合到RL训练中,提高轨迹命中率并减少无效探索,实验显示在AIME和AMC上表现优于GRPO。

Comments Accepted by ACL 2026

详情
AI中文摘要

强化学习(RL)已显现为增强模型推理的有效范式。然而,现有RL方法如GRPO通常依赖无结构的自我采样来拟合标量奖励,往往产生低效的rollouts,无法捕捉可转移的问题解决策略。为解决这一限制,我们提出了**TemplateRL**,一种结构化模板引导的RL框架,通过显式模板引导增强策略优化。我们的方法首先通过MCTS在小种子集上构建问题解决模板库,然后无缝整合此高层结构化引导到RL训练中。通过引导rollout生成与已验证的模板结构对齐,TemplateRL显著提高了高质量轨迹命中率,同时减少了无效探索。这种结构引导设计使策略朝着已验证的战略模式前进,稳定了训练动态,并提高了RL采样效率。值得注意的是,显式模板库是可解释、可编辑的,并支持在线更新,使在训练和推理过程中都能持续更新。大量实验表明,TemplateRL在AIME上比GRPO高出99%,在AMC上高出41%,在弱模型上表现更稳定,并具有显著的跨领域泛化能力,突显了其在更广泛任务中的潜力。

英文摘要

Reinforcement learning (RL) has emerged as an effective paradigm for enhancing model reasoning. However, existing RL methods like GRPO typically rely on unstructured self-sampling to fit scalar rewards, often producing inefficient rollouts that fail to capture transferable problem-solving strategies. To address this limitation, we propose **TemplateRL**, a structured template-guided RL framework that augments policy optimization with explicit template guidance. Our approach first constructs a problem-solving template library via MCTS on a small seed set, then seamlessly integrates this high-level structured guidance into RL training. By guiding rollout generation to align with proven template structures, TemplateRL significantly improves high-quality trajectory hit rates while reducing ineffective exploration. This structure-guided design steers the policy toward validated strategic patterns, stabilizing training dynamics, and enhancing RL sampling efficiency. Notably, the explicit template library is interpretable, editable, and supports online updates-enabling continuous updates during both training and inference. Extensive experiments demonstrate that TemplateRL outperforms GRPO by 99% on AIME and 41% on AMC, with superior stability on weak models and remarkable cross-domain generalization, highlighting its potential for broader tasks.

2505.05583 2026-05-18 cs.CL

KG-HTC: Integrating Knowledge Graphs into LLMs for Effective Zero-shot Hierarchical Text Classification

KG-HTC:将知识图谱整合进LLMs以实现有效的零样本层次文本分类

Qianbo Zang, Christophe Zgrzendek, Igor Tchappi, Afshin Khadangi, Johannes Sedlmeir

发表机构 * Interdisciplinary Centre for Security, Reliability and Trust (SnT)(安全、可靠与信任交叉学科中心)

AI总结 本文提出KG-HTC方法,通过整合知识图谱与大语言模型,解决层次文本分类中标注数据不足、标签空间大和长尾分布等问题,实验表明其在零样本设置下表现优异。

详情
AI中文摘要

层次文本分类(HTC)涉及将文档分配到由分类学组织的标签中。大多数先前的HTC研究集中在监督方法上。然而,在现实场景中,使用监督HTC具有挑战性,因为缺乏标注数据。此外,HTC经常面临大规模标签空间和长尾分布的问题。在本文中,我们提出了用于零样本层次文本分类的知识图谱(KG-HTC),旨在通过将知识图谱与大语言模型(LLMs)整合,为分类提供结构化的语义上下文来解决这些挑战。我们的方法使用检索增强生成(RAG)方法从与输入文本相关的知识图谱中检索相关子图。我们的KG-HTC可以增强LLMs以在不同层次上理解标签语义。我们评估了KG-HTC在三个开源HTC数据集上:WoS、DBpedia和Amazon。我们的实验结果表明,KG-HTC在严格零样本设置下显著优于三个基线方法,特别是在层次的更深层次上取得了显著改进。这项评估证明了将结构化知识整合到LLMs中以解决HTC在大规模标签空间和长尾标签分布中的挑战的有效性。我们的代码可在:https://github.com/QianboZang/KG-HTC 上获得。

英文摘要

Hierarchical Text Classification (HTC) involves assigning documents to labels organized within a taxonomy. Most previous research on HTC has focused on supervised methods. However, in real-world scenarios, employing supervised HTC can be challenging due to a lack of annotated data. Moreover, HTC often faces issues with large label spaces and long-tail distributions. In this work, we present Knowledge Graphs for zero-shot Hierarchical Text Classification (KG-HTC), which aims to address these challenges of HTC in applications by integrating knowledge graphs with Large Language Models (LLMs) to provide structured semantic context during classification. Our method retrieves relevant subgraphs from knowledge graphs related to the input text using a Retrieval-Augmented Generation (RAG) approach. Our KG-HTC can enhance LLMs to understand label semantics at various hierarchy levels. We evaluate KG-HTC on three open-source HTC datasets: WoS, DBpedia, and Amazon. Our experimental results show that KG-HTC significantly outperforms three baselines in the strict zero-shot setting, particularly achieving substantial improvements at deeper levels of the hierarchy. This evaluation demonstrates the effectiveness of incorporating structured knowledge into LLMs to address HTC's challenges in large label spaces and long-tailed label distributions. Our code is available at: https://github.com/QianboZang/KG-HTC.

2504.18361 2026-05-18 cs.CV cs.AI

COCO-Inpaint: A Benchmark for Detecting and Localizing Inpainting-Based Image Manipulations

COCO-Inpaint:用于检测和定位基于修补的图像篡改的基准

Haozhen Yan, Yan Hong, Jiahui Zhan, Suning Lang, Yikun Ji, Huijia Zhu, Jun Lan, Jianfu Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Ant Group(蚂蚁集团)

AI总结 本文提出COCO-Inpaint基准,用于检测和定位基于修补的图像篡改,通过高质样本、多样场景和大规模覆盖,揭示修补与真实区域的内在不一致。

Comments 6 pages, 8 figures

详情
AI中文摘要

近年来,图像篡改技术的进步使高逼真内容生成成为可能,但也降低了随意编辑的门槛,引发了对多媒体真实性和安全性的担忧。现有图像篡改检测与定位(IMDL)方法主要针对拼接或复制移动伪造,而基于修补的篡改基准仍有限。为弥合这一差距,我们提出了COCO-Inpaint,一个专门用于修补检测和定位的综合基准,主要贡献包括:1)由六个最先进的修补模型生成的高质量修补样本;2)通过四种掩码生成策略和可选文本引导实现的多样化生成场景;3)包含238,302张具有丰富语义多样性的修补图像的大规模覆盖。本基准旨在突出修补区域与真实区域之间的内在不一致,而非表面语义特征如物体形状。我们进一步建立了严格的评估协议,通过三个标准指标来评估现有IMDL方法,揭示当前趋势和挑战。

英文摘要

Recent advances in image manipulation have enabled highly photorealistic content generation, but also lowered the barrier to arbitrary editing, raising concerns about multimedia authenticity and security. Existing Image Manipulation Detection and Localization (IMDL) methods mainly target splicing or copy-move forgeries, while benchmarks for inpainting-based manipulations remain limited. To bridge this gap, we present COCO-Inpaint, a comprehensive benchmark specifically designed for inpainting detection and localization, with three key contributions: 1) High-quality inpainting samples generated by six state-of-the-art inpainting models, 2) Diverse generation scenarios enabled by four mask generation strategies with optional text guidance, and 3) Large-scale coverage of 238,302 inpainted images with rich semantic diversity. Our benchmark is constructed to highlight intrinsic inconsistencies between inpainted and authentic regions, rather than superficial semantic artifacts such as object shapes. We further establish a rigorous evaluation protocol with three standard metrics to benchmark existing IMDL methods and reveal current trends and challenges.

2504.00663 2026-05-18 cs.LG

Searching on a Budget: HW-NAS with 10 Latency Probes

在预算内搜索:具有10个延迟探针的HW-NAS

Francesco Capuano, Gabriele Tiboni, Niccolò Cavagnero, Giuseppe Averta

发表机构 * University of Oxford(牛津大学) Julius-Maximilians-Universität Würzburg(乌尔姆-马克斯-普朗克大学) Technische Universität Darmstadt(达姆施塔特技术大学) Politecnico di Torino(都灵理工大学)

AI总结 本文提出一种两阶段HW-NAS框架,通过在合成设备上预训练控制器,再在目标设备上直接部署,利用少量高保真延迟测量实现目标设备架构设计,无需预收集信息。

详情
AI中文摘要

现有的硬件感知NAS(HW-NAS)方法通常假设可以访问目标设备的精确信息,要么通过后编译延迟模型的分析近似,要么通过学习的延迟预测器。此类近似方法可能引入估计误差,这对风险敏感的应用可能有害。在本工作中,我们提出了一种两阶段HW-NAS框架,首先在合成设备的分布上学习架构控制器,然后直接在目标设备上部署控制器。在测试时,我们的网络控制器直接部署到目标设备,不依赖任何预收集的信息,仅利用直接交互。特别是,预训练阶段在合成设备上使控制器能够通过少量高保真延迟测量与目标设备交互,设计出适合的目标设备架构。为保证方法的可访问性,我们仅使用无训练准确度代理进行训练,允许我们在不产生完整网络训练开销的情况下扩展元训练阶段。我们在HW-NATS-Bench上进行了基准测试,证明我们的方法能够泛化到未见过的设备,并通过上下文适应在测试时仅使用少量真实世界延迟评估来搜索延迟高效的架构。

英文摘要

Existing hardware-aware NAS (HW-NAS) methods typically assume access to precise information circa the target device, either via analytical approximations of the post-compilation latency model, or through learned latency predictors. Such approximate approaches risk introducing estimation errors that may prove detrimental in risk-sensitive applications. In this work, we propose a two-stage HW-NAS framework, in which we first learn an architecture controller on a distribution of synthetic devices, and then directly deploy the controller on a target device. At test-time, our network controller deploys directly to the target device without relying on any pre-collected information, and only exploits direct interactions. In particular, the pre-training phase on synthetic devices enables the controller to design an architecture for the target device by interacting with it through a small number of high-fidelity latency measurements. To guarantee accessibility of our method, we only train our controller with training-free accuracy proxies, allowing us to scale the meta-training phase without incurring the overhead of full network training. We benchmark on HW-NATS-Bench, demonstrating that our method generalizes to unseen devices and searches for latency-efficient architectures by in-context adaptation using only a few real-world latency evaluations at test-time.

2504.00289 2026-05-18 cs.CL cs.AI cs.CY

Do Chinese models speak Chinese languages?

中国模型会说中文吗?

Andrea W Wen-Yi, Unso Eun Seo Jo, David Mimno

发表机构 * Cornell University(康奈尔大学)

AI总结 本文通过比较中西方开源大模型的多语言能力,发现中国模型在多数语言上表现与西方模型相似,但对部分中国少数民族语言识别能力较弱,揭示了多语言发展中的优先级与权衡。

Comments First and second author contribute equally

详情
AI中文摘要

顶级开源大模型的发布巩固了中国在AI发展中的领先地位。这些模型支持中国使用的语言吗?还是与美国或欧洲开发的模型支持相同的语言?比较多语言能力对于两个原因很重要:首先,语言能力提供了关于预训练数据编纂的见解,从而揭示了资源分配和发展优先级;其次,中国模型开发者需要在服务于国内语言多样化的群体与优化全球可见基准(主要为英语)之间取得平衡。我们通过比较中国开发和西方开发的开源大模型,在21种语言变体(包括亚洲地区、中文和欧洲语言)上进行了研究。我们的信息平衡和阅读理解实验表明,中国模型在这些语言上的表现与西方模型高度相关(r=0.93),唯一的例外是中文表现更好。中国开发的模型在法语和德语方面表现良好,但有时无法识别中国少数民族语言,如哈萨克语和维吾尔语。总体而言,所有研究的开源大模型在多语言表现上相似,尽管模型开发者所处的语言和文化背景各不相同。我们将这种同质化解释为全球基准实践和共享训练资源影响的结果。而不是将当前语言支持视为不可避免,我们的结果强调多语言发展是一个优先级和权衡的空间,对模型开发者、政策制定者和用户都有影响。

英文摘要

The release of top-performing open-weight LLMs has cemented China's role as a leading force in AI development. Do these models support languages spoken in China? Or do they support the same languages as models developed in the United States or in Europe? Comparing multilingual capabilities is important for two reasons. First, language ability provides insights into pre-training data curation, and thus into resource allocation and development priorities. Second, Chinese model developers need to navigate the tension between serving a linguistically diverse population domestically, and optimizing for globally visible benchmarks that are predominantly English. We investigate Chinese model developers' priorities through a comparative study of Chinese-developed and Western-developed open-weight LLMs, on 21 language variants including Asian regional, Chinese, and European languages. Our experiments on Information Parity and reading comprehension show Chinese models' performance across these languages correlates strongly (r=0.93) with their Western counterparts, with the sole exception being better Mandarin. Chinese-developed models are good at French and German, but they sometimes cannot identify languages spoken by Chinese minorities such as Kazakh and Uyghur. Overall, all open-weight LLMs we study have a similar multilingual performance profile, despite the diverse linguistic and cultural contexts the model developers operated within. We interpret the homogenization as consistent with the influence of global benchmarking practices and shared training resources. Rather than treating current language support as inevitable, our results highlight multilingual development as a space of prioritization and trade-offs, with implications for model developers, policymakers, and users.

2503.13113 2026-05-18 cs.LG math.OC

Exploring the Potential of Bilevel Optimization for Calibrating Neural Networks

探索双层优化在校准神经网络中的潜力

Gabriele Sanguin, Arjun Pakrashi, Marco Viola, Francesco Rinaldi

发表机构 * School of Computer Science, University College Dublin(都柏林大学计算机科学学院) School of Mathematical Sciences, Dublin City University(都柏林城市大学数学科学学院)

AI总结 本文提出基于双层优化的神经网络校准方法,通过玩具数据集和模拟数据集验证其在提升预测置信度和减少校准误差方面的有效性,优于等价回归方法。

详情
AI中文摘要

处理不确定性对于确保智能系统中的可靠决策至关重要。现代神经网络已知校准不佳,导致预测置信度分数难以使用。本文探讨通过应用双层优化框架来改进置信度估计和校准,该框架旨在解决具有相互依赖优化层次的分层问题。介绍了一种自我校准的双层神经网络训练方法以提高模型的预测置信度分数。通过玩具数据集如Blobs和Spirals以及更实际的模拟数据集如血酒精浓度(BAC)分析所提出框架的有效性。将其与一种广为人知且广泛使用的校准策略,等价回归进行比较。报告的实验结果表明,所提出的双层优化方法在减少校准误差的同时保持了准确性。

英文摘要

Handling uncertainty is critical for ensuring reliable decision-making in intelligent systems. Modern neural networks are known to be poorly calibrated, resulting in predicted confidence scores that are difficult to use. This article explores improving confidence estimation and calibration through the application of bilevel optimization, a framework designed to solve hierarchical problems with interdependent optimization levels. A self-calibrating bilevel neural-network training approach is introduced to improve a model's predicted confidence scores. The effectiveness of the proposed framework is analyzed using toy datasets, such as Blobs and Spirals, as well as more practical simulated datasets, such as Blood Alcohol Concentration (BAC). It is compared with a well-known and widely used calibration strategy, isotonic regression. The reported experimental results reveal that the proposed bilevel optimization approach reduces the calibration error while preserving accuracy.

2503.00794 2026-05-18 cs.RO

Detecting Heel Strike and toe off Events Using Kinematic Methods and LSTM Models

利用运动学方法和LSTM模型检测脚跟触地和脚尖离地事件

Longbin Zhang, Zhizhang Li, Xinyi Fu, Yi Xie, Xiaoyue Yan, Suiyuan Wang, Te Zhang, Hui Zhang, Kailun Yang, Tsung-Lin Wu, Prayook Jatesiktat, Ananda Sidarta, Wei Tech Ang

发表机构 * School of Artificial Intelligence and Robotics, Hunan University(湖南大学人工智能与机器人学院) School of Mechanical and Aerospace Engineering, Nanyang Technological University(南洋理工大学机械与航空航天工程学院) Fourth Hospital of Changsha, Hunan Normal University(长沙第四医院,湖南师范大学) Department of Stomatology, Hunan Provincial People’s Hospital, Hunan Normal University(湖南省级人民医院口腔科,湖南师范大学) Rehabilitation Research Institute of Singapore, Lee Kong Chian School of Medicine, Nanyang Technological University(新加坡康复研究中心,李光前医学院,南洋理工大学)

AI总结 本文评估了七种运动学方法和LSTM模型在检测脚跟触地和脚尖离地事件中的性能,发现Zeni等方法在运动学方法中准确率最高,而LSTM模型提供了无系统偏差的数据驱动替代方案。

详情
AI中文摘要

准确的步态事件检测对于步态分析、康复和辅助技术至关重要,特别是在外骨骼控制中,精确识别支撑相和摆动相尤为关键。本研究评估了七种基于运动学的方法和一个长短期记忆(LSTM)模型,在588名健康受试者4363个步态周期中检测脚跟触地和脚尖离地事件的表现。结果表明,尽管Zeni等方法在运动学方法中实现了最高准确率,其他方法表现出系统性偏差或需要数据集特定的调优。LSTM模型的表现与Zeni等方法相当,提供了一种数据驱动的替代方案,无系统性偏差。这些发现突显了基于深度学习的方法在步态事件检测中的潜力,同时强调了在临床人群和多样步态条件下进一步验证的必要性。未来研究将探索这些方法在病理人群(如中风后患者和膝关节骨性关节炎患者)中的泛化能力,以及在不同步态条件和数据收集设置中的鲁棒性,以提高其在康复和外骨骼控制中的应用性。

英文摘要

Accurate gait event detection is crucial for gait analysis, rehabilitation, and assistive technology, particularly in exoskeleton control, where precise identification of stance and swing phases is essential. This study evaluated the performance of seven kinematics-based methods and a Long Short-Term Memory (LSTM) model for detecting heel strike and toe-off events across 4363 gait cycles from 588 able-bodied subjects. The results indicated that while the Zeni et al. method achieved the highest accuracy among kinematics-based approaches, other methods exhibited systematic biases or required dataset-specific tuning. The LSTM model performed comparably to Zeni et al., providing a data-driven alternative without systematic bias. These findings highlight the potential of deep learning-based approaches for gait event detection while emphasizing the need for further validation in clinical populations and across diverse gait conditions. Future research will explore the generalizability of these methods in pathological populations, such as individuals with post-stroke conditions and knee osteoarthritis, as well as their robustness across varied gait conditions and data collection settings to enhance their applicability in rehabilitation and exoskeleton control.

2412.06853 2026-05-18 cs.LG cs.AI

Tube Loss: A Novel Approach for Prediction Interval Estimation

Tube Loss:预测区间估计的一种新方法

Pritam Anand, Tathagata Bandyopadhyay, Suresh Chandra

发表机构 * Dhirubhai Ambani University (Formerly DA-IICT)(迪鲁巴希阿米大学(原达乌学院)) Indian Institute of Technology, Delhi(印度理工学院德里分校)

AI总结 本文提出Tube Loss损失函数,用于回归任务中同时估计预测区间边界。该方法能渐近达到指定置信水平,允许用户调整区间位置以优化覆盖范围和宽度,适用于偏斜分布。

详情
AI中文摘要

本文提出了一种名为'Tube Loss'的新损失函数,用于回归任务中同时估计预测区间(PI)的边界。基于Tube Loss最小化经验风险得到的PI在以下方面优于现有方法:首先,渐近达到指定置信水平t∈(0,1)。其次,用户可通过调整参数移动区间,以捕捉响应变量概率分布的密集区域,从而缩小区间宽度。该方法通过单个优化问题平衡覆盖范围和平均宽度,并通过重新校准进一步减少平均宽度。不同于现有方法,梯度下降法可用于最小化经验风险。通过大量实验,我们证明了基于Tube Loss的PI估计在核机和神经网络中的有效性,并展示了基于Tube Loss的深度概率预报模型在多个基准和风能数据集上优于现有概率预报技术。最后,我们通过符合预测框架验证了Tube Loss方法的优势。代码可在https://github.com/ltpritamanand/Tube$_$loss获取。

英文摘要

This paper proposes a novel loss function, called 'Tube Loss', for simultaneous estimation of bounds of a Prediction Interval (PI) in the regression setup. The PIs obtained by minimizing the empirical risk based on the Tube Loss are shown to be of better quality than the PIs obtained by the existing methods in the following sense. First, it yields intervals that attain the prespecified confidence level t $\in$ (0,1) asymptotically. A theoretical proof of this fact is given. Secondly, the user is allowed to move the interval up or down by controlling the value of a parameter. This helps the user to choose a PI capturing denser regions of the probability distribution of the response variable inside the interval, and thus, sharpening its width. This is shown to be especially useful when the conditional distribution of the response variable is skewed. Further, the Tube Loss based PI estimation method can trade-off between the coverage and the average width by solving a single optimization problem. It enables further reduction of the average width of PI through re-calibration. Also, unlike a few existing PI estimation methods the gradient descent (GD) method can be used for minimization of empirical risk. Through extensive experiments, we demonstrate the effectiveness of Tube Loss-based PI estimation in both kernel machines and neural networks. Additionally, we show that Tube Loss-based deep probabilistic forecasting models achieve superior performance compared to existing probabilistic forecasting techniques across several benchmark and wind datasets. Finally, we empirically validate the advantages of the Tube loss approach within the conformal prediction framework. Codes are available at https://github.com/ltpritamanand/Tube$\_$loss.

2405.13901 2026-05-18 cs.CV cs.LG eess.SP

Discrete Cosine Transform Based Decorrelated Attention for Vision Transformers

基于离散余弦变换的去相关注意力机制用于视觉Transformer

Hongyi Pan, Emadeldeen Hamdan, Xin Zhu, Ahmet Enis Cetin, Ulas Bagci

发表机构 * Machine and Hybrid Intelligence Lab, Northwestern University(机器与混合智能实验室,西北大学) Department of Electrical and Computer Engineering, University of Illinois Chicago(电气与计算机工程系,伊利诺伊大学香槟分校)

AI总结 本文提出基于DCT的去相关注意力机制,通过改进初始化策略和压缩技术提升视觉Transformer的效率和性能,实验表明在Swin Transformer上显著降低计算开销且保持性能。

Comments This paper has been accepted to IJCAI-ECAI 2026

详情
AI中文摘要

自注意力机制是Transformer架构成功的关键,但学习查询、键和值投影仍具挑战性且计算成本高。本文提出两种互补方法,利用离散余弦变换(DCT)提升视觉Transformer的效率和性能。首先,引入基于DCT的初始化策略,通过DCT系数初始化投影权重,提升CIFAR-10和ImageNet-1K的分类精度。其次,提出基于DCT的注意力压缩技术,利用频域的去相关特性,通过截断高频成分减少查询、键和值投影的维度,不牺牲精度。实验表明,该压缩方法在Swin Transformer上显著降低计算开销,同时保持性能。

英文摘要

Self-attention is central to the success of Transformer architectures; however, learning the query, key, and value projections from random initialization remains challenging and computationally expensive. In this paper, we propose two complementary methods that leverage the Discrete Cosine Transform (DCT) to enhance the efficiency and performance of Vision Transformers. First, we address the initialization problem by introducing a simple yet effective DCT-based initialization strategy for self-attention, where projection weights are initialized using DCT coefficients. This structure-preserving approach consistently improves classification accuracy on the CIFAR-10 and ImageNet-1K benchmarks. Second, we propose a DCT-based attention compression technique that exploits the decorrelation properties of the frequency domain. By observing that high-frequency DCT coefficients typically correspond to noise, we truncate high-frequency components of the input patches, thereby reducing the dimensionality of the query, key, and value projections without sacrificing accuracy. Experiments on Swin Transformer models demonstrate that the proposed compression method achieves a substantial reduction in computational overhead while maintaining comparable performance.

2405.01557 2026-05-18 cs.LG

An Experimental Study on the Rashomon Effect of Balancing Methods in Imbalanced Classification

对不平衡分类中平衡方法拉什蒙效应的实验研究

Mustafa Cavus, Przemysław Biecek

发表机构 * Eskisehir Technical University, Department of Statistics, Turkiye(埃斯基谢普技术大学统计系,土耳其) Warsaw University of Technology, Faculty of Mathematics and Information Science(华沙技术大学数学与信息科学系) University of Warsaw, Faculty of Mathematics, Informatics and Mechanics(华沙大学数学、信息学与力学系)

AI总结 本文研究了平衡方法对预测多样性的影响,通过拉什蒙效应发现平衡方法会增加预测多样性,提出扩展的性能-收益图来平衡训练数据。

Comments 16 pages, 6 figures

Journal ref In Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) 2024, Communications in Computer and Information Science

详情
AI中文摘要

预测模型在分类不平衡数据集时可能产生偏见预测,当模型偏向多数类时,少数类的准确预测性能会降低。为解决此问题,平衡或重采样方法是数据导向的AI关键方法,用于提升预测性能。然而,近年来对这些方法的功能存在争议。特别是许多候选模型可能在模型选择中表现出非常相似的预测性能,称为拉什蒙效应,且可能对同一观测产生不同预测。在不考虑预测多样性时选择模型可能导致盲目选择。本文通过拉什蒙效应考察了平衡方法对预测多样性的冲击。这很重要,因为数据导向的AI中,从一组几乎同样准确的模型中选择模型是危险的。这可能导致模型选择、验证和解释中的严重问题。为解决此问题,我们通过拉什蒙效应使用新提出的模糊度指标,结合现有的模糊性和差异性指标,进行了真实数据集实验,观察平衡方法对预测多样性的冲击。我们的发现表明,平衡方法会放大预测多样性并产生不同结果。为了监控预测性能与预测多样性之间的权衡,以负责任地进行建模过程,我们提出了在平衡训练数据时使用扩展的性能-收益图版本。

英文摘要

Predictive models may generate biased predictions when classifying imbalanced datasets. This happens when the model favors the majority class, leading to low performance in accurately predicting the minority class. To address this issue, balancing or resampling methods are critical data-centric AI approaches in the modeling process to improve prediction performance. However, there have been debates and questions about the functionality of these methods in recent years. In particular, many candidate models may exhibit very similar predictive performance, called the Rashomon effect, in model selection, and they may even produce different predictions for the same observations. Selecting one of these models without considering the predictive multiplicity -- which is the case of yielding conflicting models' predictions for any sample -- can result in blind selection. In this paper, the impact of balancing methods on predictive multiplicity is examined using the Rashomon effect. It is crucial because the blind model selection in data-centric AI is risky from a set of approximately equally accurate models. This may lead to severe problems in model selection, validation, and explanation. To tackle this matter, we conducted real dataset experiments to observe the impact of balancing methods on predictive multiplicity through the Rashomon effect by using a newly proposed metric obscurity in addition to the existing ones: ambiguity and discrepancy. Our findings showed that balancing methods inflate the predictive multiplicity and yield varying results. To monitor the trade-off between the prediction performance and predictive multiplicity for conducting the modeling process responsibly, we proposed using the extended version of the performance-gain plot when balancing the training data.

2312.05975 2026-05-18 cs.CV cs.AI cs.LG

FM-G-CAM: A Holistic Approach for Explainable AI in Computer Vision

FM-G-CAM:计算机视觉中可解释AI的综合方法

Ravidu Suien Rammuni Silva, Jordan J. Bird

发表机构 * Department of Computer Science Nottingham Trent University(计算机科学系诺丁汉特大学)

AI总结 本文提出FM-G-CAM方法,通过综合考虑多个预测类别,提供CNN模型决策的全面解释,改进传统Grad-CAM的局限性。

详情
AI中文摘要

可解释性是现代AI在现实应用中的关键因素。本文旨在强调理解计算机视觉模型(特别是卷积神经网络)预测的必要性。现有方法主要基于梯度加权类激活图(Grad-CAM),仅关注单一目标类别,忽略了CNN预测过程的大部分内容。本文提出了一种全面的方法,称为融合多类梯度加权类激活图(FM-G-CAM),考虑多个高预测类别,提供预测器CNN的全面解释。我们还提供了详细数学和算法描述。此外,通过现实应用场景的定量和定性比较,展示了FM-G-CAM相较于Grad-CAM的优势。最后,我们提供了一个开源Python库,包含FM-G-CAM实现,方便生成CNN模型预测的显著图。

英文摘要

Explainability is a vital aspect of modern AI for real-world impact and usability. The main objective of this paper is to emphasise the need to understand the predictions of Computer Vision models, specifically Convolutional Neural Network (CNN) models. Existing methods for explaining CNN predictions are largely based on Gradient-weighted Class Activation Maps (Grad-CAM) and focus solely on a single target class; this assumption about the target class selection neglects a large portion of the predictor CNN's prediction process. In this paper, we present an exhaustive methodology, called Fused Multi-class Gradient-weighted Class Activation Map (FM-G-CAM), that considers multiple top-predicted classes and provides a holistic explanation of the predictor CNN's rationale. We also provide a detailed mathematical and algorithmic description of our method. Furthermore, alongside a concise comparison of existing methods, we compare FM-G-CAM with Grad-CAM, quantitatively and qualitatively highlighting its benefits through real-world practical use cases. Finally, we present an open-source Python library with an FM-G-CAM implementation to conveniently generate saliency maps for CNN-based model predictions.

2308.06822 2026-05-18 cs.LG cs.AI cs.CR math.OC

Approximate and Weighted Data Reconstruction Attack in Federated Learning

联邦学习中的近似和加权数据重建攻击

Yongcun Song, Ziqi Wang, Enrique Zuazua

发表机构 * Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University(南洋理工大学数学科学学院,物理与数学科学学院) Chair for Dynamics, Control, Machine Learning and Numerics – Alexander von Humboldt Professorship, Department of Mathematics, Friedrich-Alexander-Universität Erlangen-Nürnberg(动态、控制、机器学习和数值学主席职位,数学系,埃尔兰根-纽伦堡弗里德里希-亚历山大大学) Chair of Computational Mathematics, Fundación Deusto(计算数学主席,德乌斯基金会) Departamento de Matemáticas, Universidad Autónoma de Madrid(数学系,马德里自治大学)

AI总结 本文提出了一种基于插值的近似方法,用于攻击联邦学习中的联邦平均场景,通过生成客户端本地训练过程中的中间模型更新,改进数据重建质量,并通过实验验证了其在图像数据重建中的优越性。

详情
AI中文摘要

联邦学习(FL)是一种分布式学习范式,允许多个客户端在不共享私人数据的情况下协作构建机器学习模型。尽管FL被设计为隐私保护,但最近的数据重建攻击表明,攻击者可以根据FL中共享的参数恢复客户端的训练数据。然而,大多数现有方法无法攻击最广泛使用的水平联邦平均(FedAvg)场景,其中客户端在多次本地训练步骤后共享模型参数。为了解决这个问题,我们提出了一种基于插值的近似方法,通过生成客户端本地训练过程中的中间模型更新,使攻击FedAvg场景成为可能。然后,我们设计了一种层间加权损失函数以提高数据重建质量。我们根据神经网络结构为不同层的模型更新分配不同的权重,权重通过贝叶斯优化进行调整。最后,实验结果验证了所提出的近似和加权攻击(AWA)方法在不同评估指标上优于其他最先进的方法,显示出在图像数据重建中的显著改进。

英文摘要

Federated Learning (FL) is a distributed learning paradigm that enables multiple clients to collaborate on building a machine learning model without sharing their private data. Although FL is considered privacy-preserved by design, recent data reconstruction attacks demonstrate that an attacker can recover clients' training data based on the parameters shared in FL. However, most existing methods fail to attack the most widely used horizontal Federated Averaging (FedAvg) scenario, where clients share model parameters after multiple local training steps. To tackle this issue, we propose an interpolation-based approximation method, which makes attacking FedAvg scenarios feasible by generating the intermediate model updates of the clients' local training processes. Then, we design a layer-wise weighted loss function to improve the data quality of reconstruction. We assign different weights to model updates in different layers concerning the neural network structure, with the weights tuned by Bayesian optimization. Finally, experimental results validate the superiority of our proposed approximate and weighted attack (AWA) method over the other state-of-the-art methods, as demonstrated by the substantial improvement in different evaluation metrics for image data reconstructions.

2306.04321 2026-05-18 cs.AI cs.MM

Generative Semantic Communication: Diffusion Models Beyond Bit Recovery

生成语义通信:扩散模型超越位恢复

Eleonora Grassucci, Sergio Barbarossa, Danilo Comminiello

发表机构 * Dept. of Information Engineering, Electronics, and Telecommunication, Sapienza University of Rome(信息工程、电子与电信系,罗马萨皮恩扎大学)

AI总结 本文提出一种新的生成扩散框架,利用扩散模型合成多媒体内容并保留语义特征,通过空间自适应归一化生成语义一致的场景,提升在信道噪声下的图像生成质量。

Journal ref IEEE Transactions on Cognitive Communication and Networking, 2026

详情
AI中文摘要

语义通信被认为是下一代AI通信的核心之一。其可能使接收端能再生与传输内容语义等价的图像或视频,而无需恢复传输的位序列。当前解决方案仍缺乏从接收到的有限信息中构建复杂场景的能力。本文提出一种新的生成扩散指导框架,利用扩散模型在合成多媒体内容和保留语义特征方面的强大能力,通过发送高度压缩的语义信息来减少带宽使用。然后,扩散模型通过空间自适应归一化从去噪的语义信息中学习生成语义一致的场景。通过深入评估多个场景,证明我们的方法在接收到显著退化的内容时,仍能生成高质量的图像并保留语义信息。具体而言,即使在通信信道极其嘈杂的条件下,对象、位置和深度仍可识别。代码可在https://github.com/ispamm/GESCO获取。

英文摘要

Semantic communication is expected to be one of the cores of next-generation AI-based communications. One of the possibilities offered by semantic communication is the capability to regenerate, at the destination side, images or videos semantically equivalent to the transmitted ones, without necessarily recovering the transmitted sequence of bits. The current solutions still lack the ability to build complex scenes from the received partial information. Clearly, there is an unmet need to balance the effectiveness of generation methods and the complexity of the transmitted information, possibly taking into account the goal of communication. In this paper, we aim to bridge this gap by proposing a novel generative diffusion-guided framework for semantic communication that leverages the strong abilities of diffusion models in synthesizing multimedia content while preserving semantic features. We reduce bandwidth usage by sending highly-compressed semantic information only. Then, the diffusion model learns to synthesize semantic-consistent scenes through spatially-adaptive normalizations from such denoised semantic information. We prove, through an in-depth assessment of multiple scenarios, that our method outperforms existing solutions in generating high-quality images with preserved semantic information even in cases where the received content is significantly degraded. More specifically, our results show that objects, locations, and depths are still recognizable even in the presence of extremely noisy conditions of the communication channel. The code is available at https://github.com/ispamm/GESCO.

2210.13455 2026-05-18 cs.LG cs.AI

Epistemic Monte Carlo Tree Search

认知蒙特卡洛树搜索

Yaniv Oren, Viliam Vadocz, Matthijs T. J. Spaan, Wendelin Böhmer

发表机构 * Delft University of Technology(代尔夫特理工大学)

AI总结 本文提出Epistemic MCTS,通过考虑认知不确定性提升搜索效率,在代码编写等稀疏奖励任务中表现更优。

详情
AI中文摘要

本文提出Epistemic MCTS,通过考虑认知不确定性提升搜索效率,在代码编写等稀疏奖励任务中表现更优。

英文摘要

The AlphaZero/MuZero (A/MZ) family of algorithms has achieved remarkable success across various challenging domains by integrating Monte Carlo Tree Search (MCTS) with learned models. Learned models introduce epistemic uncertainty, which is caused by learning from limited data and is useful for exploration in sparse reward environments. MCTS does not account for the propagation of this uncertainty however. To address this, we introduce Epistemic MCTS (EMCTS): a theoretically motivated approach to account for the epistemic uncertainty in search and harness the search for deep exploration. In the challenging sparse-reward task of writing code in the Assembly language SUBLEQ, AZ paired with our method achieves significantly higher sample efficiency over baseline AZ. Search with EMCTS solves variations of the commonly used hard-exploration benchmark Deep Sea - which baseline A/MZ are practically unable to solve - much faster than an otherwise equivalent method that does not use search for uncertainty estimation, demonstrating significant benefits from search for epistemic uncertainty estimation.

2605.15769 2026-05-18 cs.RO cs.AI

Lamarckian Inheritance in Dynamic Environments: How Key Variables Affect Evolutionary Dynamics

动态环境中的拉马克继承:关键变量如何影响进化动态

K. Ege de Bruin, Kyrre Glette, Kai Olav Ellefsen

发表机构 * Department of Informatics, University of Oslo, Norway(奥斯陆大学信息学院) RITMO, University of Oslo, Norway(奥斯陆大学RITMO)

AI总结 本文研究动态环境中关键变量对进化动态的影响,通过虚拟软机器人和两种学习方法,发现拉马克继承在环境变化冲突且不可预测时表现欠佳,但添加环境感知传感器可恢复其优势。

详情
AI中文摘要

在动态环境中机器人身体与控制器的共优化是一个耦合挑战:形态约束了哪些控制策略有效,而控制则决定了形态的表现。为了解决这一问题,我们结合形态优化作为进化与控制器优化作为生命周期学习,利用拉马克继承将学习到的控制器参数从父代传递给子代。在动态环境中,现有文献呈现矛盾证据:虽然传统进化理论通常认为拉马克继承无益,但最近的进化机器人研究显示它可以提高性能。我们假设这是因为以前的研究没有包含所有与动态环境相关的变量。在本工作中,我们发现拉马克继承的益处取决于两个变量:环境变化对机器人控制的冲突程度,以及这些变化对机器人代理的可预测性。使用虚拟软机器人和两种不同的学习方法,贝叶斯优化和强化学习,我们发现拉马克继承只在环境变化既冲突又不可预测时表现欠佳。我们发现添加一个检测环境变化的传感器可以恢复拉马克继承在冲突环境中的优势,通过允许机器人代理预测需要不同行为的需要,从而泛化其控制。

英文摘要

The co-optimization of a robot's body and brain presents a coupled challenge: the morphology constrains which control strategies are effective, while the control determines how well the morphology performs. To address this, we combine morphology optimization as evolution with controller optimization as lifetime learning, utilizing Lamarckian inheritance to transfer learned controller parameters from parent to offspring. In dynamic environments, existing literature presents conflicting evidence: while traditional evolutionary theory often suggests Lamarckian inheritance lacks benefit, recent studies in evolutionary robotics indicate it can improve performance. We hypothesize that this is because previous works have not included all relevant variables with dynamic environments. In this work, we show that the benefit of Lamarckian inheritance depends on two variables: how conflicting the environmental changes are to robot control, and the predictability of those changes for the robotic agent. Using virtual soft robots and two different learning approaches, Bayesian optimization and reinforcement learning, we show that Lamarckian inheritance only underperforms Darwinian inheritance when the changes are both conflicting and unpredictable. We find that adding a sensor to detect environmental changes restores the benefits for Lamarckian inheritance in conflicting environments, by allowing robotic agents to predict the need for a different behavior, thereby generalizing their control.

2605.15764 2026-05-18 cs.CV cs.AI

GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions

GRASP:学习多个人非语言互动中的社会推理

Junho Kim, Xu Cao, Houze Yang, Bikram Boote, Ana Jojic, Fiona Ryan, Bolin Lai, Sangmin Lee, James M. Rehg

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Georgia Institute of Technology(佐治亚理工学院) Amazon AGI Korea University(亚马逊AGI韩国大学)

AI总结 GRASP通过连接高层社会问答与细粒度目光和指代手势事件,提升多个人非语言互动的社会推理能力,包含290万对问题-答案对,提出Social Grounding Reward提升模型性能。

Comments Project page: https://social-reaoning.github.io/grasp/

详情
AI中文摘要

理解社会互动需要推理微妙的非语言线索,但当前多模态大语言模型(MLLMs)在多个人视频中常无法识别谁与谁互动。我们引入GRASP,一个大规模社会推理数据集,将高层社会问答与细粒度目光和指代手势事件连接起来。GRASP包含290K个问题-答案对,覆盖46K小时视频,按16类分类涵盖目光、手势及联合目光-手势推理,同时包含GRASP-Bench用于评估。不同于以往仅关注孤立线索或高层社会问答的资源,GRASP通过身份一致的目光轨迹、指代手势及其联合组成构建社会事件。此外,我们提出Social Grounding Reward(SGR),一种利用这些社会事件鼓励模型推理每个互动参与者的学习信号。实验显示,SGR在GRASP-Bench上提升性能,同时在相关社会视频问答基准上保持零样本性能。

英文摘要

Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance on GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.

2605.15763 2026-05-18 cs.CL cs.AI

CompactQE: Interpretable Translation Quality Estimation via Small Open-Weight LLMs

CompactQE: 通过小规模开源大语言模型实现可解释的翻译质量估计

Kamil Guttmann, Zofia Fraś, Artur Nowakowski, Krzysztof Jassem

发表机构 * Laniqo Faculty of Mathematics and Computer Science, Adam Mickiewicz University(亚当·密茨凯维奇大学数学与计算机科学学院)

AI总结 本文提出CompactQE,利用小规模开源大语言模型实现翻译质量估计,生成质量评分、错误标注、修正建议和完整润色,其性能优于传统指标和人类标注。

详情
AI中文摘要

当前最先进的机器翻译质量估计(QE)依赖于大规模专有LLM,引发数据隐私问题。我们证明较小的开源LLM(<30B参数)是可行、成本效益高且隐私保护的替代方案。使用单次提示策略,我们的模型同时生成质量评分、MQM错误标注、建议的错误修正和完整的润色。我们的分析表明,这些模型在系统层面与人类判断的关联性很高,优于传统神经度量、微调模型和人类标注者一致性,有效逼近更大专有LLM的能力。

英文摘要

Current state-of-the-art Quality Estimation (QE) in machine translation relies on massive, proprietary LLMs, raising data privacy concerns. We demonstrate that smaller, open-source LLMs (<30B parameters) are a viable, cost-effective and privacy-preserving alternative. Using a single-pass prompting strategy, our models simultaneously generate quality scores, MQM error annotations, suggested error corrections, and full post-editions. Our analysis shows these models achieve highly competitive system-level correlations with human judgments that outperform traditional neural metrics, fine-tuned models, and human inter-annotator agreement, effectively approximating the capabilities of much larger proprietary LLMs.

2605.15761 2026-05-18 cs.LG

A Unified Perturbation Framework for Analyzing Leaderboard Stability and Manipulation

用于分析排行榜稳定性和操纵的统一扰动框架

Hosna Oyarhoseini, Jimmy Lin, Amir-Hossein Karimi

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 本文提出统一扰动框架分析Bradley-Terry排行榜在结构数据修改下的鲁棒性,研究Drop、Add、Flip等扰动对排行榜稳定性的影响,揭示现代排行榜在三个目标上的非鲁棒性,并提供评估工具。

详情
AI中文摘要

评估排行榜如LMArena在通过聚合人对模型的偏好来基准大型语言模型中起核心作用,但这些排名的鲁棒性仍缺乏理解。我们提出一个统一扰动框架,利用基于影响的近似方法分析Bradley-Terry排行榜在结构数据修改下的鲁棒性。该框架研究三种匹配层面的扰动——Drop、Add和Flip,以及玩家移除,并评估其对top-k成员资格、全局排名一致性(通过Kendall's tau)和基于置信区间不确定性的效果。在Chatbot Arena和六个额外的成对比较数据集中,我们证明现代排行榜在所有三个目标上均不鲁棒:子1%的目标扰动可改变排名第一的模型,降低Kendall's tau,并改变置信区间。除了鲁棒性审计外,我们还显示相同的影响力分数可实现高效的有针对性扰动,促进或降低特定模型,并通过更少的操作减少目标模型的不确定性,优于之前的操纵和主动采样基线。通过用标准化的数据集级别鲁棒性分数总结这些效果,我们的框架为审计排行榜稳定性并推动更鲁棒的评估协议提供了实用且有用的工具。

英文摘要

Evaluation leaderboards such as LMArena play a central role in benchmarking large language models by aggregating pairwise human preferences into model rankings, yet the robustness of these rankings remains poorly understood. We present a unified perturbation framework for analyzing Bradley-Terry leaderboards under structured data modifications using influence-based approximations. Our framework studies three match-level perturbations -- Drop, Add, and Flip -- together with player removal, and evaluates their effects on top-k membership, global ranking consistency via Kendall's tau, and confidence-interval-based uncertainty. Across Chatbot Arena and six additional pairwise-comparison datasets, we show that modern leaderboards are non-robust across all three objectives: sub-1% targeted perturbations can change the top-ranked model, degrade Kendall's tau, and alter confidence intervals. Beyond robustness auditing, we show that the same influence scores enable efficient targeted perturbations, promoting or demoting specific models and reducing target-model uncertainty with fewer actions than previous manipulation and active-sampling baselines. By summarizing these effects with normalized dataset-level robustness scores, our framework provides a practical and helpful tool for auditing leaderboard stability and motivating more robust evaluation protocols.

2605.15760 2026-05-18 cs.CV

Learn2Splat: Extending the Horizon of Learned 3DGS Optimization

Learn2Splat: 扩展学得3DGS优化的视野

Naama Pearl, Stefano Esposito, Haofei Xu, Amit Peleg, Patricia Gschossmann, Lorenzo Porzi, Peter Kontschieder, Gerard Pons-Moll, Andreas Geiger

发表机构 * University of Tübingen, Tübingen AI Center(图宾根大学,图宾根人工智能中心) ETH Zurich(苏黎世联邦理工学院) Meta Reality Labs(Meta现实实验室)

AI总结 本文提出了一种学得优化器,通过元学习方案扩展优化视野,提升稀疏和密集视角下的重建质量与稳定性,实现零样本泛化。

详情
AI中文摘要

3D高斯散射(3DGS)优化通常使用标准优化器(Adam、SGD)。尽管在多样场景中稳定,但标准优化器通用性强,无法针对问题结构进行优化。特别是,它们产生独立的参数更新,无法捕捉场景中的结构和空间关系,导致优化效率低和收敛慢。近期的工作引入了学得优化器,通过参数间和高斯间依赖预测相关更新。然而,这些方法在固定迭代次数训练,并依赖手动调度学习率以避免退化。本文提出了一种学得优化器,能够在延长的优化视野中避免退化,无需辅助机制。为此,我们提出了一种元学习方案,通过检查点缓冲区和优化器滚动策略扩展优化视野,并结合一种编码梯度尺度信息的架构。结果表明,早期新颖视角合成质量得到提升,同时在长视野中保持稳定,实现零样本泛化。为支持我们的发现,我们引入了第一个统一框架,用于训练和评估学得和传统优化器,适用于稀疏和密集视角设置。代码和模型将公开发布。我们的项目页面可在 https://naamapearl.github.io/learn2splat 上找到。

英文摘要

3D Gaussian Splatting (3DGS) optimization is most commonly performed using standard optimizers (Adam, SGD). While stable across diverse scenes, standard optimizers are general-purpose and not tailored to the structure of the problem. In particular, they produce independent parameter updates that do not capture the structural and spatial relationships within a scene, leading to inefficient optimization and slow convergence. Recent works introduced learned optimizers that predict correlated updates informed by inter-parameter and inter-Gaussian dependencies. However, these methods are trained for a fixed number of optimization iterations and rely on manually scheduled learning rates to avoid degradation. In this paper, we introduce a learned optimizer for 3DGS that avoids degradation over extended optimization horizons without auxiliary mechanisms. To enable this, we propose a meta-learning scheme that extends the optimization horizon via a checkpoint buffer and an optimizer rollout strategy, combined with an architecture that encodes gradient scale information in its latent states. Results show improved early novel view synthesis quality while remaining stable over long horizons, with zero-shot generalization to unseen reconstruction settings. To support our findings, we introduce the first unified framework for training and evaluating both learned and conventional optimizers across sparse and dense view settings. Code and models will be released publicly. Our project page is available at https://naamapearl.github.io/learn2splat .

2605.15755 2026-05-18 cs.CV

Attribute-Grounded Selective Reasoning for Artwork Emotion Understanding with Multimodal Large Language Models

基于属性的选型推理用于艺术品情感理解的多模态大语言模型

Cheng Zhang, Yuer Liu, Zhiyu Zhou, Hongxia Xie, Wen-Huang Cheng

发表机构 * Department of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院) Department of Computer Science, National Taiwan University(国立台湾大学计算机科学系)

AI总结 本文提出基于属性的选型推理方法,通过多模态大语言模型实现艺术品情感理解,通过引入属性瓶颈引导框架提升情感预测精度和解释简洁性。

详情
AI中文摘要

多模态大语言模型(MLLMs)能够生成流畅的艺术品情感解释,但常面临属性泛滥问题:它们列举许多可见的正式属性,但未能识别哪些线索真正支持情感判断。因此,本文将艺术品情感理解定义为属性引导的选型推理(AGSR),其中预定义的正式属性作为证据单元,只有情感相关属性应进入最终解释。为使该问题可测量,我们扩展了EmoArt,最初在ACM MM 2025上介绍为包含132,664件艺术品的资源,具有内容、正式属性、价值-唤醒和情感标注,通过添加1,400件艺术品的人类显著性扩展标注,由15名艺术训练标注者标注。此扩展提供了实例级监督,以区分仅存在的属性和情感显著的属性。我们进一步提出FAB-G(正式属性瓶颈引导推理),一个监督的多代理框架,首先预测属性级显著性,然后将下游情感分析限制在保留的线索上。实验表明,FAB-G在情感、唤醒和价值预测上取得了一致的提升,实现了在Dice和Tversky度量下与人类标记的显著属性更强的一致性,并产生了比基于提示的基线更紧凑的最终解释。跨数据集评估进一步表明,基于属性的显著性选择在EmoArt的源分布之外转移,同时揭示了属性特定的边界案例。数据集和项目页面可在https://zhiliangzhang.github.io/EmoArt-130k/上获取。

英文摘要

Multimodal large language models (MLLMs) can produce fluent artwork emotion explanations, but they often suffer from attribute flooding: they enumerate many visible formal attributes without identifying which cues actually support the affective judgment. We therefore formulate artwork emotion understanding as Attribute-Grounded Selective Reasoning (AGSR), where predefined formal attributes serve as evidence units and only emotionally operative attributes should enter the final interpretation. To make this problem measurable, we extend EmoArt, originally introduced at ACM MM 2025 as a 132,664-artwork resource with content, formal-attribute, valence-arousal, and emotion annotations, by adding a 1,400-artwork human salience extension annotated by 15 art-trained annotators. This extension provides instance-level supervision for distinguishing attributes that are merely present from those that are emotionally salient. We further propose FAB-G (Formal-Attribute Bottleneck-Guided reasoning), a supervised multi-agent framework that first predicts attribute-level salience and then constrains downstream emotional analysis to the retained cues. Experiments show that FAB-G yields consistent gains in emotion, arousal, and valence prediction, achieves stronger agreement with human-marked salient attributes under Dice and Tversky metrics, and produces substantially more compact final explanations than prompting-based baselines. Cross-dataset evaluation further suggests that attribute-grounded salience selection transfers beyond the source distribution of EmoArt, while also revealing attribute-specific boundary cases. The dataset and project page are available at https://zhiliangzhang.github.io/EmoArt-130k/

2605.15753 2026-05-18 cs.RO cs.CV

Hierarchical and Holistic Open-Vocabulary Functional 3D Scene Graphs for Indoor Spaces

层次化和整体化的开放词汇功能3D场景图用于室内空间

Xinggang Hu, Chenyangguang Zhang, Alexandros Delitzas, Xiangkui Zhang, Marc Pollefeys, Francis Engelmann, Xiangyang Ji

发表机构 * Tsinghua University(清华大学) ETH Zürich(苏黎世联邦理工学院) MPI for Informatics(信息研究所) Dalian University of Technology(大连理工大学) Microsoft(微软) Stanford University(斯坦福大学) University of Lugano(卢加诺大学)

AI总结 本文提出一种开放词汇管道,结合2D视觉定位和3D图优化,解决小规模密集相似实例的场景图推理问题,通过时间图优化和全局层次塑造提升室内空间的功能3D场景图生成能力。

详情
AI中文摘要

功能3D场景图提供了一种灵活的3D场景理解和机器人操作的表示方法,由物体节点、交互元素和功能关系边定义。然而,由于现有基准覆盖有限和先前管道设计过于简单,其潜力尚未被充分挖掘。因此,本文通过引入密集的桌面上物体和显式的多级功能关系扩展基准覆盖。这种扩展引入了关键挑战,包括小规模、密集和相似实例的处理,关系推理中缺乏视觉锚点,跨帧融合中的实例混淆,以及动态视角下的属性不确定性。为了解决这些问题,我们提出了一种基于2D视觉定位和3D图优化的开放词汇管道。具体而言,我们从2D视觉证据中锚定细粒度的功能边,并使用多个线索在3D中跨帧关联节点。此外,边关联被公式化为时间图优化,整合证据积累、熵正则化和时间平滑,以稳健地确定每个节点的功能连接。最后,通过全局层次塑造恢复层次图结构。大量实验表明,所提方法能够在具有挑战性的现实场景中可靠地推断功能3D场景图,从而进一步解锁其在实际应用中的潜力。

英文摘要

Functional 3D scene graphs offer a versatile and flexible representation for 3D scene understanding and robotic manipulation, defined by object nodes, interactive elements, and functional relationship edges. However, their potential remains underexplored due to the limited coverage of existing benchmarks and the overly straightforward design of previous pipelines, which primarily focus on large-scale furniture but lack of hierarchical structures. Therefore, in this work, we extend the benchmark coverage by introducing dense tabletop objects and explicit multi-level functional relationships. This expansion introduces critical challenges involving small-scale, dense, and similar instances, with lack of visual anchoring in relational reasoning, instance confusion during cross-frame fusion, and attribution uncertainty under dynamic viewpoints. To address these issues, we propose an open-vocabulary pipeline based on 2D visual grounding and 3D graph optimization. Specifically, we anchor fine-grained functional edges from 2D visual evidence, and associate nodes across frames in 3D using multiple cues. Furthermore, edge association is formulated as temporal graph optimization, integrating evidence accumulation, entropy regularization, and temporal smoothing to robustly determine the functional connections of each node. Finally, global hierarchy shaping is performed to recover the hierarchical graph structure. Extensive experiments demonstrate that the proposed method can reliably infer functional 3D scene graphs in challenging real-world scenes, thereby further unlocking their potential for practical applications.

2605.15737 2026-05-18 cs.CV

BARRIER: Bounded Activation Regions for Robust Information Erasure

BARRIER:基于鲁棒信息擦除的有界激活区域

Jan Miksa, Patryk Krukowski, Przemysław Spurek, Dawid Damian Rymarczyk, Marcin Sendera

发表机构 * Jagiellonian University(雅盖隆大学) IDEAS Research Institute(IDEAS研究所) National Research Institute(国家研究所)

AI总结 BARRIER通过动态隐藏层激活几何结构,利用区间算术保护中性概念,实现稳定的信息擦除,同时保持其他表示的完整性。

详情
AI中文摘要

机器无学习面临关键瓶颈。传统方法主要消除目标概念,但常导致其他重要表示的意外抑制。为此,BARRIER将干预从静态模型权重转移到隐藏层激活的动态几何结构。通过SVD投影的激活空间区间算术,将目标区域封装在包围超立方体中,确保保留分布的严谨保护。此几何构造将知识保护从经验启发式转化为具有概率尾界的功能漂移优化目标。关键稳定性允许在遗忘区域进行激进的无学习更新。实验表明,BARRIER在分类器和扩散模型中达到最佳折中,最大化目标概念擦除同时保护其他表示的完整性。代码见https://github.com/OneAndZero24/BARRIER。

英文摘要

Machine unlearning has reached a critical bottleneck. As traditional weight-space interventions focus primarily on erasing targeted concepts, they often fail to prevent the unintended suppression of other significant representations. This leads to substantial collateral damage, with essential knowledge being forgotten, because these methods lack formal mathematical guarantees for the preservation of neutral concepts. To avoid degradation, they are frequently forced into conservative updates. We propose BARRIER (Bounded Activation Regions for Robust Information Erasure), a paradigm-shifting framework that shifts the locus of intervention from static model weights to the dynamic geometry of hidden-layer activations. Unlike existing methods, BARRIER employs Interval Arithmetic (IA) on SVD-based projections of the activation space to encapsulate the specific target region within a bounding hypercube. By driving unlearning updates exclusively within this forget interval and mathematically bounding the model response on the complement, we ensure rigorous protection of the retain distribution. This geometric construction transforms the preservation of knowledge from an empirical heuristic into a formal optimization target with a probabilistic tail bound on functional drift. Crucially, this stability permits highly aggressive unlearning updates within the forget region. Empirical evaluations demonstrate that BARRIER matches state-of-the-art trade-offs across classifiers and diffusion models, maximizing targeted concept erasure while safeguarding the integrity of all other representations. Our code is available at https://github.com/OneAndZero24/BARRIER.

2605.15736 2026-05-18 cs.CV cs.AI

BiomedAP: A Vision-Informed Dual-Anchor Framework with Gated Cross-Modal Fusion for Robust Medical Vision-Language Adaptation

BiomedAP: 一种基于视觉的双锚框架与门控跨模态融合用于鲁棒的医学视觉-语言适应

Huanyang Tong, Kai Liu, Fangjun Kuang, Huiling Chen

发表机构 * Wenzhou University(温州大学) Wenzhou Business College(温州商务学院)

AI总结 BiomedAP通过门控跨模态融合和双锚约束机制,提升医学视觉-语言模型在提示变化下的鲁棒性,实验显示其在多个基准上均优于基线方法。

Comments CVPR2026 Workshop

详情
AI中文摘要

BiomedAP通过门控跨模态融合和双锚约束机制,提升医学视觉-语言模型在提示变化下的鲁棒性,实验显示其在多个基准上均优于基线方法。

英文摘要

Biomedical Vision--Language Models (VLMs) have shown remarkable promise in few-shot medical diagnosis but face a critical bottleneck: \textit{fragility to prompt variations}.Existing adaptation frameworks typically optimize visual and textual prompts as independent streams, relying on ideal ``Golden Prompts''. In clinical reality, where descriptions are often noisy and heterogeneous, this modality isolation leads to unstable cross-modal alignment. To address this, we propose BiomedAP, a vision-informed dual-anchor framework with gated cross-modal fusion.BiomedAP enforces synergistic alignment through two mechanisms: (1) Gated Cross-Modal Fusion, which enables layer-wise interaction between modalities, acting as a dynamic noise regulator to suppress irrelevant textual cues; and (2) a Dual-Anchor Constraint that regularizes learnable prompts toward stable semantic centroids derived from both expert templates (High Anchors) and few-shot visual prototypes (Low Anchors). Extensive experiments across 11 benchmarks demonstrate that BiomedAP consistently surpasses baselines, achieving competitive few-shot accuracy and markedly enhanced robustness under prompt perturbations. Our code is available at: https://github.com/tongdiedie/BiomedAP. Keywords: Vision-Language Models; Prompt Learning; Parameter-Efficient Fine-Tuning; Few-shot Learning

2605.15734 2026-05-18 cs.AI

Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments

我们能否信任AI推断的用户状态。一种用于验证由LLMs在操作环境中对用户状态分类的可靠性的人格测量框架

Izabella Krzeminska, Michal Butkiewicz, Ewa Komkowska

发表机构 * Orange Research, AI Center(Orange研究院、人工智能中心)

AI总结 本文通过实证测试检验了使用大语言模型评估用户状态的假设,探讨了AI测量在人格测量中的可靠性问题,并提出可复制的评估框架以提高适应性系统的AI设计可靠性。

Comments Full survey article with data tables for futher possible replicabilty and comparison

详情
AI中文摘要

使用大语言模型来评估对话和自适应系统中的用户状态是基于一种假设,即用于此类评估的指标在个体分数层面是稳定且可解释的。本文通过实证测试检验这一假设,重点研究了人工智能(AI)测量在人格测量中的可靠性。本研究采用复制评估程序,评估了三个不同双模大语言模型(GPT-4o音频、Gemini 2.0 Flash、Gemini 2.5 Flash)中广泛指标的可重复性。分析包括个体分数可靠性和聚合可靠性,使我们能够区分可能对实时适应有用的指标,以及仅在聚合分析中保留价值的指标。结果表明,指标的可靠性不能被视为解释领域中的默认属性。个体分数层面的不稳定性使得在实时自适应系统中将这些分数解释为用户状态的指标是不可能的,即使这些指标在聚合后表现出稳定性。同时,本研究指出,个体不稳定指标可以在事后研究中保留分析效用,识别交互规则及其与用户经验参数如满意度、信任和参与度的关系。本文的主要贡献,除了量化问题的严重性(只有213个指标中的31个符合标准)外,还提出了一个可复制的评估框架,使指标适用性的可测量评估成为可能。这种方法支持更负责任的AI设计,其中结果的解释需要显式验证可靠性和随时间监测违规情况。

英文摘要

The use of large language models to assess user states in conversational and adaptive systems is based on the assumption that the metrics used for such assessment are stable and interpretable at the level of individual scores. This paper empirically tests this assumption, focusing on the psychometric reliability of artificial intelligence (AI) measures of user states. This study employed replication evaluation procedures to assess the repeatability of a broad set of metrics across three different bimodal large language models (GPT-4o audio, Gemini 2.0 Flash, Gemini 2.5 Flash). Analyses include both individual score reliability and aggregated reliability, allowing us to distinguish metrics potentially useful for real-time adaptation from those that retain their value only in aggregated analyses. The results demonstrate that metric reliability cannot be considered a default property in interpretive domains. The lack of stability at the level of individual scores precludes the interpretation of such scores as indicators of user state in real-time adaptive systems, even if these metrics demonstrate stability after aggregation. At the same time, the study indicates that individually unstable metrics can retain analytical utility in post-hoc studies, identifying rules governing interactions and their relationships with user experience parameters such as satisfaction, trust, and engagement. The main contribution of this work, besides quantifying the severity of the problem (only 31 of 213 metrics met the criteria), is the proposal of a replicable evaluation framework, enabling measurable evaluations of metric applicability. This approach supports more responsible AI design of adaptive systems, in which the interpretation of results requires explicit validation of reliability and monitoring for violations over time.