arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3813
专题追踪
2605.17746 2026-05-19 cs.AI cs.HC

Agents for Experiments, Experiments for Agents: A Design Grammar for AI-Enabled Experimental Science

实验中的代理,代理中的实验:一种面向人工智能增强型实验科学的设计语法

Yingjie Zhang, Chun Feng, Weizhang Zhu, Tianshu Sun

发表机构 * Guanghua School of Management, Peking University(北京大学光华管理学院) Xi'an Jiaotong University(西安交通大学) Cheung Kong Graduate School of Business(长江商学院)

AI总结 本文提出SEED框架,用于表示实验条件为类型化的代理-流程图,以支持实验设计的自动化生成和评估,通过在医疗分诊任务中的实验证明其有效性,并讨论了新颖性、可重复性等治理问题。

详情
AI中文摘要

人工智能系统正成为组织和知识工作中的积极参与者。它们越来越多地与人类互动,协调工作流程,并在多代理安排中运作。因此,理解其影响需要的不仅仅是测量输出准确性,还需要关于机制、委托、反馈和控制的证据。实验仍然是这一任务的核心,但它们也面临递归挑战:我们需要为代理设计实验来研究这些安排,我们可能需要为实验设计设计代理以帮助搜索可能设计的扩展空间。然而,人类-人工智能和代理工作流程的实验条件仍然大多以散文形式指定,这使得它们难以比较、重用或审计。我们将其框架为AI增强型知识生产的流程表示、可追溯性和治理问题。我们引入SEED(结构编码用于实验发现),一个将实验条件表示为类型化代理-流程图的框架。SEED支持三种设计功能:将条件描述为交互结构、评估结构新颖性相对于编码的先前设计、以及在可行性和治理约束下生成候选设计。我们报告了一项轻量级的实证可行性测试,比较了图盲和SEED引导生成在医疗分诊设计任务中的表现。在这一诊断对比中,SEED引导的候选设计显示出更清晰的代理-流程变化、假设和治理检查,支持了该语法作为设计辅助工具的可行性。评论最后指出围绕新颖性、可重复性、有效性、探究多样性以及问责制的治理张力。

英文摘要

AI systems are becoming active participants in organizational and knowledge work. They increasingly interact with humans, coordinate workflows, and operate in multi-agent arrangements. Understanding their effects therefore requires more than measuring output accuracy; it requires evidence about mechanisms, delegation, feedback, and control. Experiments remain central to this task, but they also face a recursive challenge: we need experiments for agents to study these arrangements, and we may need agents for experiments to help search the expanding space of possible designs. Yet experimental conditions for human-AI and agentic workflows are still largely specified in prose, making them difficult to compare, reuse, or audit. We frame this as a problem of workflow representation, traceability, and governance in AI-enabled knowledge production. We introduce SEED (Structural Encoding for Experimental Discovery), a framework that represents experimental conditions as typed actor-flow graphs. SEED supports three design functions: describing conditions as interaction structures, evaluating structural novelty relative to encoded prior designs, and generating candidate designs under feasibility and governance constraints. We report a lightweight empirical feasibility test that compares graph-blind and SEEDguided generation in a medical-triage design task. In this diagnostic contrast, SEED-guided candidate designs show clearer actor-flow changes, assumptions, and governance checks, supporting the feasibility of the grammar as a design aid. The commentary closes by identifying governance tensions around novelty, replication, validity, diversity of inquiry, and accountability.

2605.17743 2026-05-19 cs.CV

MoASE++: Mixture of Activation Sparsity Experts with Domain-Adaptive On-policy Distillation for Continual Test Time Adaptation

MoASE++: 基于领域自适应在线蒸馏的激活稀疏专家混合模型用于持续测试时间适应

Ronyu Zhang, Aosong Cheng, Gaole Dai, Yulin Luo, Jiaming Liu, Li Du, Huanrui Yang, Dan Wang, Leyuan Fang, Yuan Du, Shanghang Zhang

发表机构 * Nanjing University and The Hong Kong Polytechnic University(南京大学和香港理工大学) State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(北京大学多媒体信息处理国家重点实验室,计算机科学学院) University of Arizona(亚利桑那大学) Hong Kong University of Science and Technology(香港科技大学) School of Artificial Intelligence and Robotics, Hunan University(湖南大学人工智能与机器人学院) Nanjing University(南京大学)

AI总结 本文提出MoASE++,通过结合领域自适应在线蒸馏的激活稀疏专家混合模型,解决持续测试时间适应中领域无关结构与领域特定纹理分离的问题,提升模型在动态视觉环境中的持续适应能力。

详情
AI中文摘要

持续测试时间适应旨在将源预训练模型适应非平稳、未标记的目标流,同时保持过去的能力,但纹理偏见的骨干网络可能导致误差累积和灾难性遗忘。受人类视觉系统分离形状和纹理过程的启发,我们引入MoASE,一种插件式混合专家模型,利用具有空间可微置零的激活稀疏专家,将领域无关的结构与领域特定的纹理分离,形成互补的高激活和低激活路径,同时高阶和低阶瓶颈多样化表示。激活稀疏门产生输入自适应的SDD阈值以精确选择令牌,领域感知路由器利用纹理敏感线索为每个样本分配专家权重。为遏制对未标记流的确认偏见并稳定监督,我们引入领域自适应在线蒸馏构成MoASE++,包括基于EMA锚定的在线反KL蒸馏和基于熵和置信度的增强策略,使同一视图的预测对齐并提高鲁棒性-可塑性平衡。在分类(CIFAR-10/100-C,ImageNet-C)和语义分割(Cityscapes->ACDC)上的广泛实验表明,MoASE++在动态视觉环境中持续适应方面表现出一致的最先进性能,提供了一种原理明确、可控的持续适应方法。

英文摘要

Continual test-time adaptation adapts a source-pretrained model to non-stationary, unlabeled target streams while retaining past competence, yet texture-biased backbones risk error accumulation and catastrophic forgetting. Drawing inspiration from the process of decoupling shape and texture in the human visual system, we introduce MoASE, a plug-in mixture-of-experts that disentangles domain-agnostic structure from domain-specific texture using Activation Sparsity Experts with Spatial Differentiable Dropout, forming complementary high- and low-activation pathways, while high- and low-rank bottlenecks diversify representations. The Activation Sparsity Gate produces input-adaptive SDD thresholds for precise token selection, and the Domain-Aware Router assigns per-sample expert weights using texture-sensitive cues. To curb confirmation bias on unlabeled streams and stabilize supervision, we then introduce Domain-Adaptive On-Policy Distillation to constitute MoASE++, with an EMA-anchored on-policy reverse KL distillation and an augmentation policy conditioned on entropy and confidence that aligns predictions across the same views and improves the robustness-plasticity balance. Extensive experiments on classification (CIFAR-10/100-C, ImageNet-C) and semantic segmentation (Cityscapes->ACDC) demonstrate consistent state-of-the-art performance, offering a principled, controllable approach to continual adaptation in dynamic visual environments.

2605.17742 2026-05-19 cs.CV cs.HC

UST-Hand: An Uncertainty-aware Spatiotemporal Point Cloud Interaction Network for 3D Self-supervised Hand Pose Estimation

UST-Hand: 一种面向3D自监督手姿态估计的不确定性感知时空点云交互网络

Tianhao Han, Haoyang Zhang, Liang Xie, Haochen Chang, Kun Gao, Yuan Cheng, Pengfei Ren, Erwei Yin

发表机构 * School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院) Beijing University of Posts and Telecommunications(北京邮电大学) Sun Yat-sen University(中山大学) Peking University(北京大学) Defense Innovation Institute, Academy of Military Sciences(国防科技创新院,军事科学学院) Tianjin Artificial Intelligence Innovation Center(天津人工智能创新中心)

AI总结 本文提出UST-Hand,一种通过估计手姿态不确定性分布并构建概率点云特征空间的自监督学习框架,以更稳定地建模复杂的时空关系,从而在三个具有挑战性的数据集上实现了最先进的性能,比现有自监督方法在均位点误差(MPVPE)上高出37.8%。

Comments Accepted by CVPR 2026

详情
AI中文摘要

手动标注准确的3D手姿态非常耗时且劳动密集。现有的自监督手姿态估计方法利用输入图像与渲染输出之间的差异或多视角一致性约束作为驱动因素来优化网络并逐步提高姿态精度。然而,这些方法对噪声伪标签高度敏感,并忽略了充分利用细粒度空间相关性的重要性,这削弱了模型训练的稳定性。为了解决这些问题,我们提出了UST-Hand,一种自监督学习框架,该框架估计手姿态的不确定性分布,并构建一个概率点云特征空间,从而能够建模复杂的时空关系。UST-Hand采用条件归一化流模型来捕捉手姿态分布,并采样多样假设,从而在噪声伪标签监督下实现稳健学习,具有增强的稳定性。这些多假设被映射到统一的概率3D点云空间中进行多视角和时间特征交互,全面探索手运动模式和细粒度空间相关性。在三个具有挑战性的数据集上的广泛实验表明,UST-Hand实现了最先进的性能,比现有自监督方法在均位点误差(MPVPE)上高出37.8%。

英文摘要

Manually annotating accurate 3D hand poses is extremely time-consuming and labor-intensive. Existing self-supervised hand pose estimation methods leverage the discrepancy between input images and rendered outputs, or multi-view consistency constraints, as the driving force to optimize networks and progressively refine pose accuracy. However, these methods are highly susceptible to noisy pseudo-labels and overlook the importance of fully exploiting fine-grained spatial correlations, which undermines the stability of model training. To address these issues, we propose UST-Hand, a self-supervised learning framework that estimates uncertainty distribution of hand pose and constructs a probabilistic point cloud feature space, which enables the complex spatiotemporal relationship modeling. UST-Hand employs a conditional normalizing flow model to capture hand pose distributions and samples diverse hypotheses, facilitating robust learning under noisy pseudo-labels supervision with enhanced stability. These multi-hypothesis are mapped to a unified probabilistic 3D point cloud space for multi-view and temporal feature interaction, comprehensively exploring hand motion patterns and fine-grained spatial correlations. Extensive experiments on three challenging datasets demonstrate that UST-Hand achieves state-of-the-art performance, outperforming existing self-supervised methods by up to 37.8% in Mean Per Vertex Position Error (MPVPE).

2605.17737 2026-05-19 cs.SD

Profiling the Voice: Speaker-Specific Phoneme Fingerprinting for Speech Deepfake Detection

声纹分析:面向语音深度伪造检测的说话者特定音素指纹

Jun Xue, Tong Zhang, Zhuolin Yi, Yihuan Huang, Yi Chai, Yiyang Zhang, Yanzhen Ren

发表机构 * Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education(航空航天信息安全与可信计算重点实验室,教育部) School of Cyber Science and Engineering, Wuhan University(武汉大学计算机科学与工程学院)

AI总结 本文提出了一种基于音素的语音分析框架PVP,通过微音学建模捕捉说话者特有的发音模式,实现对语音深度伪造的高效检测,并提供细粒度的音素级可解释性。

Comments Accepted by IJCAI 2026

详情
AI中文摘要

生成式人工智能的快速发展使音频深度伪造越来越难以与真实人类语音区分,对公众人物等目标人物构成重大威胁。当前的检测系统主要依赖通用的黑盒模型,无法捕捉说话者特有的发音特征且缺乏可解释性。本文提出Phoneme-based Voice Profiling (PVP),一种新颖的个性化防御框架。通过将检测范式从宏观语音分析转向微观音学建模,PVP捕捉了目标人物习惯性发音模式下的独特声学分布。具体而言,我们的框架利用轻量级高斯混合模型(GMM)对说话者特定的发音实现进行建模,仅需从真实参考语音中估计。这种设计实现了数据高效的建模,并且能够稳健地泛化到之前未见过的伪造攻击,而无需进行重的伪造特定训练。此外,我们引入了首个大规模的中文目标人物深度伪造数据集以基准测试说话者特定的检测。实验结果表明,PVP在目标人物伪造场景中显著优于最先进的通用检测器,实现了显著的EER降低,同时提供细粒度的音素级可解释性用于法医分析。代码和数据可在:https://github.com/JunXue-tech/PVP 获取。

英文摘要

The rapid advancement of generative AI has made audio deepfakes increasingly indistinguishable from authentic human vocals, posing significant threats to persons-of-interest (POI) such as public figures. Current detection systems primarily rely on generic, black-box models that fail to capture speaker-specific idiosyncratic traits and lack interpretability. In this paper, we propose Phoneme-based Voice Profiling (PVP), a novel personalized defense framework. By shifting the detection paradigm from macro-utterance analysis to micro-phonetic modeling, PVP captures the unique acoustic distributions underlying a POI's habitual articulatory patterns. Specifically, our framework models speaker-specific phonetic realizations using lightweight Gaussian Mixture Models (GMMs) estimated solely from bona fide reference speech. This design enables data-efficient profiling and robust generalization to previously unseen spoofing attacks without requiring heavy spoof-specific training. Furthermore, we introduce the first large-scale Chinese POI deepfake dataset to benchmark speaker-specific detection. Experimental results demonstrate that PVP significantly outperforms state-of-the-art generic detectors in POI spoofing scenarios, achieving substantial EER reductions while providing fine-grained, phoneme-level interpretability for forensic analysis. Code and data are available at: https://github.com/JunXue-tech/PVP

2605.17734 2026-05-19 cs.AI

Harnessing LLM Agents with Skill Programs

通过技能程序 harnessing LLM agents

Hongjun Liu, Yifei Ming, Shafiq Joty, Chen Zhao

发表机构 * New York University(纽约大学) Salesforce AI Research(Salesforce人工智能研究)

AI总结 本文提出 HASP 框架,通过将技能转化为可执行程序函数(PFs)来提升 LLM agent 在复杂任务中的表现,其核心方法是通过 PFs 在失败状态时介入并修正行动,主要贡献是通过模块化设计实现推理、训练和自改进的多场景应用。

Comments 40 pages, 7 figures

详情
AI中文摘要

为复杂和长周期任务提供可重用技能已成为一种流行且成功的做法。然而,这些经验通常编码为文本指导,缺乏明确的机制来决定何时以及如何介入 agent 循环。为弥合这一差距,我们引入 HASP(通过技能程序 harnessing LLM agents),一种新的框架,将技能升级为可执行程序函数(PFs)。与被动建议不同,PFs 作为可执行的护栏,在易出错的状态下激活,并修改下一步行动或注入修正上下文。HASP 高度模块化:可以在推理时直接介入 agent 循环,训练后提供结构化监督,或通过进化验证的教师评审 PFs 实现自改进。实证上,HASP 在网页搜索、数学推理和编码任务中相比训练自由和训练方法取得了显著提升。例如,在网页搜索推理中,推理时的 PFs 使平均表现比(多循环)ReAct Agent 提高 25%,而训练后和受控进化则比 Search-R1 提高 30.4%。为了深入理解 HASP,我们的机制分析揭示了 PFs 如何触发和介入,技能如何内化,以及稳定技能库进化的必要性。

英文摘要

Equipping LLM agents with reusable skills derived from past experience has become a popular and successful approach for tackling complex and long-horizon tasks. However, such lessons are often encoded as textual guidance that remains largely advisory, lacking explicit mechanisms for when and how to intervene in the agent loop. To bridge the gap, we introduce HASP(Harnessing LLM Agents with Skill Programs), a new framework that upgrades skills into executable Program Functions (PFs). Rather than offering passive advice, PFs act as executable guardrails that activate on failure-prone states and modify the next action or inject corrective context. HASP is highly modular: it can be applied at inference time for direct agent-loop intervention, during post-training to provide structured supervision, or for self-improvement by evolving validated, teacher-reviewed PFs. Empirically, HASP drives substantial gains compared to both training-free and training-based methods on web-search, math reasoning, and coding tasks. For example, on web-search reasoning, inference-time PFs alone improve the average performance by 25% compared to (multi-loop) ReAct Agent, while post-training and controlled evolution achieve a 30.4% gain over Search-R1. To provide deeper insights into HASP, our mechanism analysis reveals how PFs trigger and intervene, how skills are internalized, and the requirement for stable skill library evolution.

2605.17733 2026-05-19 cs.AI cs.LG

Divergence-Suppressing Couplings for Rectified Flow

修正流的发散抑制耦合

Yimeng Min, Carla P. Gomes

发表机构 * Department of Computer Science(计算机科学系)

AI总结 本文提出了一种修正流的发散抑制耦合方法,通过在耦合生成过程中抑制学习到的速度场中的发散成分,从而减少轨迹的扭曲,提升生成效果。

详情
AI中文摘要

修正流的潜力在于生成自我生成的耦合,其轨迹是直的或几乎如此。在实践中,基础流模型生成的轨迹可能会弯曲和交织,导致耦合继承这种扭曲。本文指出,这种轨迹交织通常与学习到的速度场中非零发散区域相关,其中局部扩张或收缩会扭曲轨迹并推动粒子远离理想终点。我们随后提出了一种修正流的发散抑制耦合,这是一种离线修正,可减小耦合生成过程中学习到的速度场的发散成分。该修正仅在每次耦合对生成时支付一次,且在训练过程中被摊销,因此部署运行的时钟时间成本与标准修正流相同。实验证明,这种离线修改在2D合成基准和图像生成任务上都带来了稳定改进。

英文摘要

The promise of Rectified Flow rests on producing self-generated couplings whose trajectories are straight, or nearly so. In practice, trajectories generated by the base flow model can bend and intertwine, and the resulting coupling inherits this distortion. In this paper, we identify that such trajectory entanglement is often associated with regions of nonzero divergence in the learned velocity field, where local expansion or contraction distorts trajectories and steers particles away from their ideal endpoints. We then propose divergence-suppressing couplings for Rectified Flow, an offline correction that attenuate the divergent component of the learned velocity during coupling generation. The correction is paid only once per coupling pair and amortized over training, so deployment runs plain Euler at identical wall-clock cost to standard Rectified Flow. Empirically, this offline modification yields consistent improvements on 2D synthetic benchmarks and on image generation.

2605.17729 2026-05-19 cs.CV cs.AI cs.LG

Domain Incremental Learning for Pandemic-Resilient Chest X-Ray Analysis

领域增量学习用于疫情 resilient 胸部X光分析

Danu Kim

发表机构 * Danu Kim(丹努·金)

AI总结 本文提出了一种基于回放的领域增量持续学习方法,用于在跨领域变化中保持肺炎检测的鲁棒性和一致性,通过类感知平衡回放和类感知损失实现平衡的类表示和动态重加权,实验表明该方法在领域偏移的PneumoniaMNIST数据集上达到88.66%的平均准确率,优于经验回放、微调和联合训练基线。

Comments Published in Korea Software Congress (2025)

详情
AI中文摘要

深度学习模型在肺炎检测中实现了高准确性,但其在临床领域中的泛化能力受限于成像设备、获取协议和机构条件的差异。本研究引入了一种基于回放的领域增量持续学习方法,旨在使模型能够持续适应跨领域变化而不发生灾难性遗忘。所提出的方法结合了类感知平衡回放以在受限内存中保持平衡的类表示,以及类感知损失以在训练过程中动态重新加权类不平衡。在包含五个模拟领域的领域偏移PneumoniaMNIST数据集上进行的实验表明,所提出的方法实现了88.66%的平均准确率,优于经验回放、微调和联合训练基线。这些发现突显了所提出方法在跨临床环境变化中实现稳健和一致肺炎检测的有效性。

英文摘要

Deep learning models achieved high accuracy in pneumonia detection from chest X-rays. However, their generalization across clinical domains remains limited due to variations in imaging devices, acquisition protocols, and institutional conditions. This study introduces a replay-based domain-incremental continual learning designed to enable continual adaptation to cross-domain variations without catastrophic forgetting. The proposed method incorporates a class-aware balanced replay to maintain balanced class representation within a constrained memory and a class-aware loss to dynamically reweight class imbalance during training. Experiments conducted on a domain-shifted PneumoniaMNIST dataset consisting of five simulated domains demonstrate that the proposed method achieves an average accuracy of 88.66%, outperforming Experience Replay, Fine-Tuning, and Joint Training baselines. These findings highlight the efficacy of the proposed approach in achieving robust and consistent pneumonia detection across clinical environment variations.

2605.17727 2026-05-19 cs.CV

GraSP-VL: Length as a Semantic Granularity Interface for Vision-Language Representations

GraSP-VL: 长度作为视觉-语言表示的语义粒度接口

Zesheng Li, Chengchang Pan, Honggang Qi

发表机构 * University of the Chinese Academy of Sciences(中国科学院大学)

AI总结 本文研究如何将嵌入长度转化为可控的语义访问接口,提出GraSP-VL方法,通过学习共享的近正交前缀变换,实现视觉-语言嵌入的语义层次递进接口,并在多个数据集上验证了其有效性。

Comments Preprint

详情
AI中文摘要

冻结的视觉-语言嵌入包含从物体身份到属性、关系和完整描述意义的多级语义信号,但这些信号通过固定长度的向量接口暴露。我们研究是否可以将嵌入长度转化为可控的语义访问接口。我们提出了GraSP-VL,它在冻结VLM嵌入上学习了一个共享的近正交前缀变换。GraSP-VL实现了语义马特罗什卡接口:短前缀被分配粗粒度的语义角色,而更长的前缀逐步暴露更细粒度的语言基础区分。由于变换在图像和文本嵌入之间共享,并且保持了全维度几何,前缀行为的变化不会改写原始VLM空间。在包含20,147个示例的COCO/Flickr30K注释池上,GraSP-VL达到了阶梯评分53.01和难负样本选择性89.76,同时保持全空间漂移低于10^-6。它还转移到SugarCrepe-clean数据集,达到86.03的对象准确率和11.96的平均外部涌现,并保持全维度零样本CIFAR-100准确率。这些结果表明,冻结的VLM嵌入可以重新组织为可截断的语义前缀接口,而不是仅仅压缩。

英文摘要

Frozen vision-language embeddings contain signals at multiple semantic resolutions, from object identity to attributes, relations, and full-caption meaning, but they expose these signals through a fixed-length vector interface. We study whether embedding length can be turned into a controllable semantic access interface. We propose \textbf{GraSP-VL}, which learns a shared near-orthogonal prefix transform over frozen VLM embeddings. GraSP-VL instantiates a \textbf{Semantic Matryoshka} interface: short prefixes are assigned coarse semantic roles, while longer prefixes progressively expose finer language-grounded distinctions. Because the transform is shared across image and text embeddings and preserves full-dimensional geometry, prefix behavior changes without rewriting the original VLM space. On a 20,147-example COCO/Flickr30K annotation pool, GraSP-VL reaches a staircase score of 53.01 and hard-negative selectivity of 89.76, while keeping full-space drift below $10^{-6}$. It also transfers to SugarCrepe-clean with 86.03 object accuracy and 11.96 mean external emergence, and preserves full-dimensional zero-shot CIFAR-100 accuracy. These results show that frozen VLM embeddings can be reorganized into a truncatable semantic prefix interface rather than merely compressed.

2605.17721 2026-05-19 cs.AI

EXG: Self-Evolving Agents with Experience Graphs

EXG: 基于经验图的自演化代理

Yuxin Jin, Siyuan Zhang, Hanchen Wang, Lu Qin, Ying Zhang, Wenjie Zhang

发表机构 * University of Technology Sydney(悉尼科技大学) The University of New South Wales(新南威尔士大学)

AI总结 本文提出EXG,一种基于经验图的自演化代理框架,通过结构化组织积累的成功与失败经验,提升代理在复杂任务中的解决质量和资源效率。

详情
AI中文摘要

基于大型语言模型(LLM)的代理在复杂推理和问题解决中表现出强大的能力,但大多数部署的代理行为静态,执行过程中获得的知识难以随时间系统性改进。为此,越来越多的研究探索如何在部署过程中通过经验使代理改进,但现有方法要么依赖于单一任务的随意反思,要么采用无结构的记忆积累碎片化经验。为了解决这一限制,我们引入EXG,一种经验图框架,用于自演化代理,明确将积累的成功与失败组织成结构化、关系化的表示。EXG是首个为自演化代理设计的经验图,支持在执行过程中实时增长图以实现跨任务经验重用,以及离线重用整合的经验图作为外部记忆模块。这种设计也使EXG能够作为可插拔组件为现有自演化代理服务,将先前经验组织成统一的经验图,并在部署过程中提高解决方案质量和资源效率。在代码生成和推理基准上的广泛实验表明,EXG在在线和离线评估中均优于基于反思和记忆的基线,在性能-效率权衡上表现更优。我们的结果表明,将经验结构化为图提供了一个原理性基础,以实现可扩展且可迁移的自演化代理行为。

英文摘要

Large language model (LLM)-based agents have demonstrated strong capabilities in complex reasoning and problem solving through multi-step interactions, yet most deployed agents remain behaviorally static, with knowledge acquired during execution rarely translating into systematic improvement over time. In response, a growing line of work on self-evolving agents explores how agents can improve through experience during deployment, but most existing approaches either rely on ad hoc reflection limited to single-task correction or adopt unstructured memory that accumulates fragmented experience with delayed usability. To address this limitation, we introduce EXG, an experience graph framework for self-evolving agents that explicitly organizes accumulated successes and failures into a structured, relational representation. EXG is the first experience graph designed for self-evolving agents, supporting both online, real-time graph growth during execution for immediate cross-task experience reuse, and offline reuse of a consolidated experience graph as an external memory module. This design also enables EXG to serve as a plug-and-play component for existing self-evolving agents, organizing prior experience into a unified experience graph and improving both solution quality and resource efficiency as deployment progresses. Extensive experiments across code generation and reasoning benchmarks show that EXG attains more favorable performance-efficiency trade-offs than reflection- and memory-based baselines in both online and offline evaluations. Our results suggest that structuring experience as a graph provides a principled foundation for scalable and transferable self-evolving agent behavior.

2605.17719 2026-05-19 cs.CV

Patch-MoE Mamba: A Patch-Ordered Mixture-of-Experts State Space Architecture for Medical Image Segmentation

Patch-MoE Mamba: 一种用于医学图像分割的基于补丁顺序的专家混合状态空间架构

Diego Adame, Fabian Vazquez, Jose A. Nunez, Huimin Li, Jinghao Yang, Erik Enriquez, DongChul Kim, Haoteng Tang, Bin Fu, Pengfei Gu

发表机构 * University of Texas Rio Grande Valley(德克萨斯理工大学里奥格兰德谷分校)

AI总结 本文提出了一种基于补丁顺序的专家混合状态空间架构Patch-MoE Mamba,以解决现有Mamba分割模型在像素级方向扫描破坏局部二维空间结构以及简单求和融合方向无法适应多样物体大小、形状和边界的问题。

详情
AI中文摘要

基于CNN和Transformer的架构在医学图像分割中已取得优异性能,但CNN在建模长距离依赖性方面存在限制,而Transformer则常面临二次计算和内存复杂度的问题。状态空间模型,尤其是基于Mamba的网络,提供了一种高效的替代方案,具有线性序列复杂度。然而,现有的Mamba分割模型仍面临两个限制:像素级方向扫描会破坏局部二维空间结构,而简单的求和融合方向无法适应多样化的物体大小、形状和边界。为了解决这些问题,我们提出了Patch-MoE Mamba,一种用于医学图像分割的基于补丁顺序的专家混合状态空间架构。它引入了一种分层的补丁顺序扫描机制,能够在保留局部空间邻域的同时捕捉多尺度上下文,并引入了基于MoE的方向融合模块,通过四个方向专家、一个可学习的连接专家和残差方向聚合,自适应地结合多个Mamba扫描器输出。在五个公开的息肉分割基准和ISIC 2017/2018皮肤病变分割数据集上的实验表明了Patch-MoE Mamba的有效性和通用性。

英文摘要

CNN- and Transformer-based architectures have achieved strong performance in medical image segmentation, but CNNs are limited in modeling long-range dependencies, while Transformers often suffer from quadratic computational and memory complexity. State space models, especially Mamba-based networks, offer an efficient alternative with linear sequence complexity. However, existing Mamba segmentation models still face two limitations: pixel-wise directional scanning can disrupt local 2D spatial structure, and simple summation-based fusion of scan directions cannot adapt well to diverse object sizes, shapes, and boundaries. To address these issues, we propose \textit{Patch-MoE Mamba}, a patch-ordered mixture-of-experts state space architecture for medical image segmentation. It introduces a hierarchical patch-ordered scanning mechanism that preserves local spatial neighborhoods while capturing multi-scale context, and an MoE-based directional fusion module that adaptively combines multiple Mamba scanner outputs using four directional experts, a learnable concatenation expert, and residual directional aggregation. Experiments on five public polyp segmentation benchmarks and the ISIC 2017/2018 skin lesion segmentation datasets demonstrate the effectiveness and generality of Patch-MoE Mamba.

2605.17714 2026-05-19 cs.CL

From Documents to Segments: A Contextual Reformulation for Topic Assignment

从文档到段落:一种用于主题分配的上下文重述

Hoonsang Yoon, Takyoung Kim, Wonkee Lee, Ilmin Cho, Dilek Hakkani-Tür, Stanley Jungkyu Choi

发表机构 * LG AI Research(LG AI研究院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出了一种基于段落的主题分配方法(SBTA),通过将主题分配到短小且连贯的文本段落而非整个文档,以解决传统主题模型中文档多主题问题导致的主题污染问题,从而提升主题分析的清晰度和可解释性。

Comments Findings of ACL 2026

详情
AI中文摘要

传统的主题建模方法为每个文档分配一个单一主题。然而,在实践中,许多现实世界文档,如产品评论或开放式调查回答,包含多个不同的主题。这种不匹配常常导致主题污染,即不相关主题被合并到一个主题中,使得难以识别真正专注于特定主题的文档。我们通过引入基于段落的主题分配(SBTA),一种对主题建模的重述方法,将主题分配给段落:短小、连贯的文本片段,每个片段表达一个单一主题。通过在段落层面建模主题结构,我们的方法产生更清晰和可解释的主题,并更好地支持多主题文档的分析。为了支持系统评估,我们构建了一个SemEval-STM数据集,灵感来自基于方面的情感分析。文档首先通过大型语言模型(LLMs)分解为基于主题的段落,随后通过人工校验确保段落质量。我们还提出了一种基于段落的词入侵任务扩展,使人类能够在主题实际分配的粒度上评估主题连贯性。在多个模型和评估指标上,我们证明SBTA提高了聚类质量和可解释性。总体而言,这项工作提供了一个实用、可扩展的框架,用于异构文本语料库中细粒度的主题分析,其中文档自然涵盖多个主题。

英文摘要

Traditional topic modeling assigns a single topic to each document. In practice, however, many real-world documents, such as product reviews or open-ended survey responses, contain multiple distinct topics. This mismatch often leads to topic contamination, where unrelated themes are merged into a single topic, making it difficult to identify documents that truly focus on a specific subject. We address this issue by introducing segment-based topic allocation (SBTA), a reformulation of topic modeling that assigns topics not to entire documents, but to segments: short, coherent spans of text that each express a single theme. By modeling topical structure at the segment level, our approach yields cleaner and more interpretable topics and better supports analysis of multi-theme documents. To support systematic evaluation, we construct a SemEval-STM, a new dataset inspired by aspect-based sentiment analysis. Documents are first decomposed into topical segments using large language models (LLMs), followed by human refinement to ensure segment quality. We also propose a segment-level extension of the word intrusion task, enabling human evaluation of topical coherence at the granularity where topics are actually assigned. Across multiple models and evaluation metrics, we show that SBTA improves clustering quality and interpretability. Overall, this work provides a practical, scalable framework for fine-grained topic analysis in heterogeneous text corpora where documents naturally span multiple topics. URL: https://huggingface.co/datasets/LG-AI-Research/SemEval-STM

2605.17710 2026-05-19 cs.CL eess.AS

Sometin Beta Pass Notin (SBPN): Improving Multilingual ASR for Nigerian Languages via Knowledge Distillation

Sometin Beta Pass Notin (SBPN): 通过知识蒸馏改进尼日利亚语言的多语言语音识别

Sewade Ogun

发表机构 * Nigerian Languages(尼日利亚语言)

AI总结 本文提出SBPN模型,通过两阶段知识蒸馏方法提升尼日利亚多种语言的语音识别性能,显著降低词错误率并优于现有多语言模型。

Comments 25 pages

详情
AI中文摘要

尽管现代多语言自动语音识别(ASR)系统支持多种尼日利亚语言,但其性能始终落后于英语和法语等高资源语言。尼日利亚语言存在独特的建模挑战,包括数据稀缺、不一致的正字法、声调符号、多样化的口音、频繁的代码切换和本地化专有名词。为解决这些挑战,我们开发了一个多语言ASR框架,采用两阶段蒸馏过程。首先,我们利用学生-教师知识蒸馏从现有单语言模型中学习,基于稳健的语言特定N-gram语言模型进行条件化。其次,我们使用伪标签数据进行迭代自我改进以进一步提高准确性。我们的方法显著缩小了性能差距,平均在单语言基线上实现了29%的词错误率(WER)减少。我们的模型在主要基准上也优于现有最先进的多语言模型,包括Common Voice和Fleurs。我们引入Sometin Beta Pass Notin(SBPN),一个覆盖约鲁巴、豪萨、伊博、尼日利亚皮钦语和尼日利亚英语的多语言ASR模型。SBPN以两种大小发布:SBPN-Base(120 M参数)和SBPN-Large(600 M参数)。通过发布这些作为开放基础模型,我们旨在为该地区丰富的语音和文化景观的研究提供ASR资源。

英文摘要

Although modern multilingual Automatic Speech Recognition (ASR) systems support several Nigerian languages, their performance consistently lags behind high-resource languages like English and French. Nigerian languages present unique modelling hurdles, including acute data scarcity, inconsistent orthography, tonal diacritics, diverse accents, frequent code-switching, and localized named entities. To address these challenges, we developed a multilingual ASR framework utilizing a two-stage distillation process. First, we employ student-teacher knowledge distillation from existing monolingual models, conditioned on robust language-specific N-gram language models. Second, we perform iterative self improvement using pseudo-labelled data to further refine accuracy. Our method significantly bridges the performance gap, achieving on average a relative Word Error Rate (WER) reduction of 29 % over monolingual baselines. Our models also outperform state-of-the-art multilingual models across major benchmarks, including Common Voice and Fleurs. We introduce Sometin Beta Pass Notin (SBPN), a foundational multilingual ASR model covering Yorùbá, Hausa, Igbo, Nigerian Pidgin, and Nigerian English. SBPN is released in two sizes: SBPN-Base (120 M parameters) and SBPN-Large (600 M parameters). By releasing these as open foundation models, we aim to provide ASR resources for further research into the rich phonetic and cultural landscape of the region.

2605.17704 2026-05-19 cs.LG

Toy Combinatorial Interpretability Models Reveal Lottery Tickets in Early Feature Space

玩具组合可解释性模型揭示早期特征空间中的彩票彩票

Alon Bebchuk, Nir Shavit

发表机构 * Tel-Aviv University(特拉维夫大学) MIT and Red Hat AI(麻省理工学院和红帽AI)

AI总结 本文研究了彩票彩票假说在早期特征空间中的表现,通过组合玩具模型揭示了彩票彩票在特征空间中的保留对象,表明彩票彩票结构由隐藏的特征空间几何而非权重空间子网络身份决定。

详情
AI中文摘要

彩票彩票假说认为密集网络中包含稀疏子网络,即' winning tickets',当重置初始权重并单独训练时,其性能可与完整模型匹配。我们提出更机理性的问题:彩票彩票保留的是什么内部对象?我们采用组合、子句结构的玩具设置,该设置允许具有明确组合距离的可解释特征空间表示。我们显示,在权重空间中彩票彩票对应于特征空间中已接近最终特征通道编码的前驱位置。密集SGD通过结构化选择解决这些位置:近邻位置要么收敛到最终代码要么被拒绝,拒绝集中在更拥挤的神经元,暗示在叠加下存在竞争。因此,彩票彩票是兼容代码位置的家族,共同平衡接近最终代码与低特征间干扰。稀疏重训练通常在不同行上重新表达相同的子句/模板家族,因此保留的对象是家族层面而非微观行身份。我们通过轻量级探针基于特征空间距离和运动验证了这一观点;在我们的设置中,这些探针在准确性和精确代码恢复方面经常优于已建立的基于权重的彩票发现方法。尽管这些发现基于玩具设置,但它们表明彩票彩票结构由隐藏的特征空间几何而非权重空间子网络身份决定。

英文摘要

The lottery ticket hypothesis posits that dense networks contain sparse subnetworks, ``winning tickets,'' that, when rewound to their initial weights and retrained in isolation, match the performance of the full model. We ask a more mechanistic question: what internal object does a winning ticket preserve? We work in a combinatorial, clause-structured toy setting that admits an interpretable feature-space representation with well-defined combinatorial distances between features. We show that winning tickets in weight space correspond to precursor locations in feature space that are already near, at initialization, to the final feature-channel codes. Dense SGD resolves these locations through structured selection: proximal locations either converge to final codes or are rejected, with rejection concentrated at more crowded neurons, implicating competition under superposition. A winning ticket is thus a family of compatible code locations that jointly balance proximity to final codes with low inter-feature interference. Sparse retraining often re-expresses the same clause/template family on a different row, so the preserved object is family-level rather than microscopic row identity. We validate this account with lightweight probes based on feature-space distance and motion; in our setting, these probes frequently outperform established weight-based ticket discovery methods in both accuracy and exact code recovery. Although these findings are grounded in a toy setting, they suggest that the lottery ticket structure is governed by hidden feature-space geometry rather than weight-space subnetwork identity.

2605.17698 2026-05-19 cs.LG cs.MA

Agent Bazaar: Enabling Economic Alignment in Multi-Agent Marketplaces

Agent Bazaar: 使多智能体市场场所具备经济对齐能力

Seth Karten, Cameron Crow, Chi Jin

发表机构 * Princeton University(普林斯顿大学)

AI总结 该研究提出Agent Bazaar框架,用于评估多智能体系统的经济对齐能力,通过分析两种失败模式(算法不稳定和Sybil欺骗)发现模型难以自我调节,并提出经济对齐的训练方法和EAS评分标准。

Comments 17 pages, 9 figures

详情
AI中文摘要

将大型语言模型(LLMs)作为自主经济代理部署引入了系统性风险,这些风险超出了单个能力故障的范围。随着代理直接参与市场,其集体行为会放大波动并大规模掩盖欺骗。我们引入Agent Bazaar,一个多代理模拟框架,用于评估经济对齐能力,即代理系统维持市场稳定和完整性的能力。我们识别出两种失败模式:(1)在B2C市场中的算法不稳定(

英文摘要

The deployment of Large Language Models (LLMs) as autonomous economic agents introduces systemic risks that extend beyond individual capability failures. As agents transition to directly interacting with marketplaces, their collective behavior can amplify volatility and mask deception at scale. We introduce the Agent Bazaar, a multi-agent simulation framework for evaluating Economic Alignment, the capacity of agentic systems to preserve market stability and integrity. We identify two failure modes: (1) Algorithmic Instability in a B2C market ("The Crash"), where firms amplify price volatility until the market collapses, and (2) Sybil Deception in a C2C market ("The Lemon Market"), where a single deceptive agent controlling multiple coordinated seller identities floods the market with fraudulent listings, eroding trust and consumer welfare. We evaluate frontier and open-weight models across both scenarios and find that models largely fail to self-regulate, with failure severity varying by model rather than by size. We propose economically aligned harnesses, Stabilizing Firms and Skeptical Guardians, that improve outcomes but remain fragile under harder market conditions. To close this gap, we train agents with REINFORCE++ using an adaptive curriculum, producing a 9B model that outperforms all evaluated frontier and open-weight models. We propose the Economic Alignment Score (EAS), a 4-component scalar metric aggregating stability, integrity, welfare, and profitability, enabling direct cross-model comparison. Our results show that economic alignment is orthogonal to general capability and can be directly trained with targeted RL.

2605.17693 2026-05-19 cs.LG cs.AI

Fine-tuning Pocket-Aware Diffusion Models via Denoising Policy Optimization

通过去噪策略优化微调意识口袋扩散模型

Yuan Xue, Daniel Kudenko, Megha Khosla

发表机构 * L3S Research Center(L3S研究所以) Delft University of Technology(代尔夫特理工大学)

AI总结 本文提出DEPPA方法,基于去噪扩散策略优化,通过强化学习微调预训练的意识口袋扩散模型,以优化结合亲和力、药物性、可合成性和多样性等多属性。

详情
AI中文摘要

基于结构的药物设计已被意识口袋3D生成模型加速,但大多数方法主要拟合训练分布,可能无法满足真实世界治疗药物发现所需的多种属性。最近,越来越多的关注集中在基于结构的分子优化(SBMO)上,其目标是精细控制多个指定的分子属性。在本文中,我们提出DEPPA,一种新的SBMO方法,基于去噪扩散策略优化,通过强化学习微调预训练的意识口袋扩散模型。DEPPA能够优化多个属性,包括结合亲和力、药物性、可合成性和多样性。我们将预训练的意识口袋扩散模型的反向去噪过程建模为多步马尔可夫决策过程,其中期望的属性作为奖励信号在最终生成的配体分子上进行评估。DEPPA在RL微调期间结合粗略的去噪调度器,以实现高效的分子优化。在CrossDocked2020基准上的实验结果表明,DEPPA在结合亲和力(Vina Score -8.5 kcal/mol)、药物性和多样性方面优于基线,在可合成性方面表现出竞争性性能。源代码可在https://github.com/xy9485/DePPA上获得。

英文摘要

Structure-based drug design has been accelerated by pocket-aware 3D generative models, yet most methods primarily fit the training distribution and may fall short of satisfying multiple properties required in real-world therapeutic drug discovery. Recently, increasing attention has focused on structure-based molecule optimization (SBMO), which targets fine-grained control over multiple specified molecular properties. In this paper, we present DEPPA, a novel SBMO approach building upon Denoising Diffusion Policy Optimization for fine-tuning a pre-trained pocket-aware diffusion model via reinforcement learning. DEPPA enables optimization over multiple properties, including binding affinity, drug-likeness, synthesizability and diversity. We formulate the reverse denoising process of the pretrained pocket-aware diffusion model as a multi-step Markov Decision Process, where the desired properties that serve as reward signals are evaluated on the final generated ligand molecules. DEPPA incorporates a coarse denoising scheduler during the RL fine-tuning to achieve efficient and effective molecule optimization. Experimental results on the CrossDocked2020 benchmark demonstrate that DEPPA outperforms baselines in binding affinity (Vina Score -8.5 kcal/mol), drug-likeness and diversity while exhibiting competitive performance in synthesizability. The source code is available at https://github.com/xy9485/DePPA .

2605.17691 2026-05-19 cs.CL cs.AI

Validate Your Authority: Benchmarking LLMs on Multi-Label Precedent Treatment Classification

验证你的权威:在多标签先例处理分类上对LLM进行基准测试

M. Mikail Demir, M. Abdullah Canbaz

发表机构 * Department of Information Science and Technology(信息科学与技术系) College of Emergency Preparedness, Homeland Security, and Cybersecurity(应急准备、国土安全与网络安全学院) University at Albany, SUNY(萨利纳大学)

AI总结 本文提出了一种新的评估框架,通过专家标注的数据集对现代大语言模型进行基准测试,引入了平均严重性误差指标,以更准确地衡量分类错误的实践影响。

Comments Accepted for publication at the Natural Legal Language Processing Workshop (NLLP) 2025, co-located with EMNLP

详情
AI中文摘要

自动化法律先例中负面处理的分类是一个关键但复杂的自然语言处理任务,误分类可能带来重大风险。为了解决标准准确率的不足,本文介绍了一种更稳健的评估框架。我们对239个真实世界法律引用的新专家标注数据集上的现代大语言模型进行了基准测试,并提出了一种新的平均严重性误差度量标准,以更好地衡量分类错误的实践影响。我们的实验揭示了性能的分裂。Google的Gemini 2.5 Flash在高层次分类任务上达到了最高准确率(79.1%),而OpenAI的GPT-5-mini则在更复杂的细粒度模式上表现最佳(67.7%)。本工作建立了关键基准,提供了一个新的上下文丰富的数据集,并引入了一个针对这一复杂法律推理任务的评估度量标准。

英文摘要

Automating the classification of negative treatment in legal precedent is a critical yet nuanced NLP task where misclassification carries significant risk. To address the shortcomings of standard accuracy, this paper introduces a more robust evaluation framework. We benchmark modern Large Language Models on a new, expert-annotated dataset of 239 real-world legal citations and propose a novel Average Severity Error metric to better measure the practical impact of classification errors. Our experiments reveal a performance split. Google's Gemini 2.5 Flash achieved the highest accuracy on a high-level classification task (79.1%), while OpenAI's GPT-5-mini was the top performer on the more complex fine-grained schema (67.7%). This work establishes a crucial baseline, provides a new context-rich dataset, and introduces an evaluation metric tailored to the demands of this complex legal reasoning task.

2605.17686 2026-05-19 cs.CV

Brain-inspired spike-timing plasticity for reliable label-efficient event-camera vision

脑启发式脉冲时间依赖性可塑性用于可靠的标签高效事件相机视觉

Mohamad Yazan Sadoun, Sarah Sharif, Yaser Mike Banad

发表机构 * School of Electrical and Computer Engineering, University of Oklahoma(俄克拉荷马大学电气与计算机工程学院)

AI总结 本文提出了一种基于脑启发式脉冲时间依赖性可塑性(STDP)的事件相机视觉方法,通过三个局部STDP模块实现无需GPU支持的单线程处理,提升了标签效率和检测性能。

详情
AI中文摘要

部署事件相机目标检测器受到每帧标注需求和GPU计算需求的限制。本文引入了三个局部脉冲时间依赖性可塑性(STDP)模块,包括序列、候选和管可靠性模块,这些模块在单个CPU线程上运行而无需GPU支持。在FRED无人机基准测试中,所提出的框架覆盖了三个标签高效监督层级。严格零标签检测器实现了53.8%的mAP@30,约26个训练衍生位实现76.9%的mAP@30,而STDP候选可靠性门实现了78.60±0.42%的mAP@30。在获取顺序漂移下,群体门在20次正例试验中优于流式k-means,而无漂移对照组则否定了其效果。STDP将单模型方差减少了6.6倍,一个训练好的门与44种子集合界线相当。门在Intel Lava上实现了89%的前两名一致性。在EVUAV基准测试中,管级STDP层将误报率从454降至331e-4(Pd≥88%)。密集梯度训练检测器无法提供这种梯度训练、密集矩阵乘法和无局部可塑性操作的组合。

英文摘要

Deploying event-camera object detectors is constrained by per-frame labeling requirements and GPU compute demands. This work introduces three local spike-timing-dependent plasticity (STDP) modules, including sequence, candidate, and tube-reliability modules, that operate on a single CPU thread without GPU support. On the FRED drone benchmark, the proposed framework spans three label-efficient supervision tiers. A strict zero-label detector achieves 53.8% mAP@30, approximately 26 train-derived bits achieve 76.9% mAP@30, and an STDP candidate-reliability gate achieves 78.60 +/- 0.42% mAP@30. Under acquisition-order drift, the cohort gate outperforms streaming k-means by 2.03 +/- 0.58 percentage points across 20 of 20 positive trials, while a no-drift control falsifies the effect. STDP reduces single-model variance by 6.6 times, and one trained gate matches a 44-seed ensemble bound. The gate transfers to Intel Lava with 89% top-2 agreement. On the EVUAV benchmark, a tube-level STDP layer reduces false alarms from 454 to 331e-4 at Pd >= 88%. Dense gradient-trained detectors cannot provide this combination of gradient training, dense matrix multiplication, and local plasticity-free operation by construction.

2605.17685 2026-05-19 cs.CV cs.AI cs.CR cs.SY eess.SP eess.SY

Attention-Guided Fusion of 1D and 2D CNNs for Robust ECG-Based Biometric Recognition

基于注意力引导的1D和2D CNN融合用于鲁棒的基于ECG的生物识别

Arioua, Islameddine, Benzaoui, Amir, Zeroual, Abdelhafid, Houam, Lotfi

发表机构 * PIMIS Laboratory, Electronics and Telecommunications Department(PIMIS实验室,电子与电信系) Université du 8 Mai 1945(8月1945大学) Electrical Engineering Department, University of 20 August 1955(电子工程系,20 August 1955大学) Department of Electrical Engineering, Faculty of Science and Applied Sciences(电子工程系,科学与应用科学学院) Larbi Ben M'hidi University(拉比·本·迈迪大学) Department of Electronics and Communications, University of Larbi Tebessi(电子与通信系,拉比·塔贝西大学)

AI总结 本文提出了一种结合1D和2D CNN的混合框架,通过注意力引导融合机制提升ECG生物识别的鲁棒性和性能,实验表明该方法在多个数据集上均取得了较高的识别准确率。

Journal ref Digital Signal Processing 2026

详情
AI中文摘要

基于心电图(ECG)的生物识别已作为一种安全的身份验证和活体检测的有希望的解决方案。然而,大多数现有方法依赖于单模深度学习架构,单独处理一维(1D)时间信号或二维(2D)时频表示,限制了鲁棒性和泛化能力。为了解决这个问题,本文提出了一种将1D和2D卷积神经网络(CNNs)整合到统一端到端架构中的混合框架。1D分支从原始ECG信号中提取时序和形态学特征,而2D分支从时频表示中捕获判别性的频谱信息。注意力引导的融合机制根据输入特性动态加权两种模态,克服了传统静态融合策略的局限性。该框架在三个基准数据集(ECG-ID、MIT-BIH和PTB)上进行了评估,包括健康受试者和患有心脏病理学的患者,分别实现了99.56%、100.00%和99.89%的识别准确率。为了评估长期生物稳定性,还进行了多会话Heartprint数据集的实验,该数据集跨越十年。所提出的方法在相同会话中实现了98.54%(S1)、99.09%(S2)、94.93%(S3R)和96.08%(S3L)的准确率,跨会话评估达到了56.33%(S1-S2)和53.27%(S2-S3R),证明了其在时间上的稳定生物特征捕获能力。最优配置结合了InceptionTime用于1D处理,ResNet-34用于2D分析,以及基于注意力的融合。消融研究证实,所提出的注意力机制在传统融合方法中始终表现更优。总体而言,所提出的框架为ECG生物识别提供了一种稳健、可扩展且高性能的解决方案。

英文摘要

Electrocardiogram (ECG)-based biometric recognition has emerged as a promising solution for secure authentication and liveness detection. However, most existing methods rely on unimodal deep learning architectures that independently process either one-dimensional (1D) temporal signals or two-dimensional (2D) time-frequency representations, limiting robustness and generalization. To address this issue, this paper proposes a hybrid framework integrating 1D and 2D convolutional neural networks (CNNs) within a unified end-to-end architecture. The 1D branch extracts temporal and morphological features from raw ECG signals, while the 2D branch captures discriminative spectral information from time-frequency representations. An attention-guided fusion mechanism dynamically weights both modalities according to input characteristics, overcoming the limitations of conventional static fusion strategies. The framework was evaluated on three benchmark datasets (ECG-ID, MIT-BIH, and PTB), including healthy subjects and patients with cardiac pathologies, achieving identification accuracies of 99.56%, 100.00%, and 99.89%, respectively. To assess long-term biometric permanence, experiments were also conducted on the multi-session Heartprint dataset spanning ten years. The proposed approach achieved same-session accuracies of 98.54% (S1), 99.09% (S2), 94.93% (S3R), and 96.08% (S3L), while cross-session evaluations reached 56.33% (S1-S2) and 53.27% (S2-S3R), demonstrating the ability to capture stable biometric signatures over time. The optimal configuration combines InceptionTime for 1D processing, ResNet-34 for 2D analysis, and attention-based fusion. Ablation studies confirm that the proposed attention mechanism consistently outperforms conventional fusion approaches. Overall, the proposed framework provides a robust, scalable, and high-performance solution for ECG biometric recognition.

2605.17684 2026-05-19 cs.AI cs.SE

EGI: A Multimodal Emotional AI Framework for Enhancing Scrum Master Real-time Self-Awareness

EGI:一种多模态情感AI框架,用于增强Scrum Master的实时自我意识

Jingni Huang, Peter Bloodsworth

发表机构 * Department of Computer Science(计算机科学系) University of Oxford(牛津大学)

AI总结 本文提出一种多模态情感AI框架EGI,通过整合四个精选的AI模型,实时监测Scrum Master和会议组织者无意识表达的情绪,提升团队动态中的情绪感知能力。

详情
AI中文摘要

尽管越来越多的研究关注敏捷团队成员的情绪福祉,但在Scrum Master和会议组织者的情绪监测研究中仍存在显著差距,这些角色对团队动态的影响至关重要。本文提出了一种新的应用,整合四个精心选择和推荐的AI模型,通过实时语音转文本模型进行实时转录;通过阈值分析检测语气中的情绪线索;通过基于情绪的词汇匹配识别语音内容中的情感;并通过开源的多模块AI API提供上下文感知的建议,包含情绪关键词。系统在模拟会议环境中实现了10%的ASR词错误率。我们的评估表明,实时反馈显著提高了模拟敏捷会议中的情绪感知能力,为Scrum Master和会议组织者提供实时和实用的建议,帮助他们快速识别并减少负面情绪的表达,促进更积极有效的团队互动。

英文摘要

While increasing research focuses on the emotional well-being of agile team members, a significant gap remains in emotion monitoring studies for Scrum Masters and meeting organizers, whose impact on team dynamics is crucial. This paper proposes a novel application integrating four carefully selected and recommended AI models to monitor the unconsciously expressed emotions of these key roles. This is achieved through: real- time transcription using a speech-to-text model; thresholding for intonation analysis to detect emotional cues in prosody; applying emotion-based vocabulary matching to identify sentiment in spoken content; and providing context-aware suggestions containing emotion keywords using an open-source, multi-module AI API. The system achieved an ASR word error rate WER of 10% in simulated meeting environments. Our evaluation shows that real- time feedback significantly improves emotion awareness during simulated agile meetings, providing Scrum Masters and meeting organizers with real-time and practical suggestions to help them quickly identify and minimize the expression of negative emotions, fostering more positive and effective team interactions.

2605.17682 2026-05-19 cs.CV

GEM: Gaussian Evolution Model for Occupancy Forecasting and Motion Planning

GEM:用于占用预测和运动规划的高斯演化模型

Cheng Chen, Hao Huang, Saurabh Bagchi

发表机构 * Purdue University(普渡大学) New York University Abu Dhabi(纽约大学阿布扎克分校)

AI总结 该研究提出GEM模型,通过高斯演化模型实现高效的占用预测和运动规划,解决了传统方法在时间灵活性、场景演化和连续时间动态匹配上的不足。

详情
AI中文摘要

未来3D语义占用预测和运动规划是自动驾驶的核心,需要模型能够推断周围场景的演变和车辆的行动。现有占用世界模型通常将场景离散化为潜在嵌入、体素特征或量化标记,并通过固定步长自回归生成预测未来状态。这限制了时间灵活性,掩盖了场景演变,长时间预测会积累误差,并且难以匹配真实驾驶场景的连续时间动态。我们提出了GEM,一种用于非自回归占用世界建模的高斯演化模型,其中驾驶场景被表示为学习的动态显式连续4D高斯原语。与逐步推演未来占用状态不同,GEM可以直接查询高斯世界表示中的任意时间戳,并将相应的条件3D高斯分布投射到语义占用体积中。这使得能够高效地进行全时间范围预测,同时保留紧凑且可解释的场景表示。通过解耦空间几何、时间支持和原语运动,GEM使预测的世界更容易检查,因为每个原语的演变可以连续随时间跟踪。相同表示也支持运动规划,通过从学习的高斯世界预测未来的车辆轨迹。大量实验表明,GEM在未来的语义占用预测和强大的运动规划性能方面均达到最先进的水平,同时提供灵活的时间查询。

英文摘要

Future 3D semantic occupancy forecasting and motion planning are central to autonomous driving, as they require models to reason about how surrounding scenes evolve and how the ego vehicle should act. Existing occupancy world models commonly discretize scenes into latent embeddings, volumetric features, or quantized tokens, and forecast future states through fixed-step autoregressive generation. This limits temporal flexibility, obscures scene evolution, accumulates errors over long horizons, and poorly matches the continuous-time dynamics of real driving scenes. We propose GEM, a Gaussian Evolution Model for non-autoregressive occupancy world modeling, where driving scenes are represented as explicit continuous 4D Gaussian primitives with learned dynamics. Instead of rolling out future occupancy states step by step, GEM directly queries the Gaussian world representation at arbitrary timestamps and splats the corresponding conditional 3D Gaussians into semantic occupancy volumes. This enables efficient forecasting over the full horizon while retaining a compact and interpretable scene representation. By decoupling spatial geometry, temporal support, and primitive motion, GEM makes the predicted world easier to inspect, as each primitive's evolution can be followed continuously over time. The same representation also supports motion planning by predicting future ego trajectories from the learned Gaussian world. Extensive experiments show that GEM achieves state-of-the-art future semantic occupancy forecasting and strong motion planning performance, while providing flexible temporal querying.

2605.17681 2026-05-19 cs.RO

PRIME: Physically-consistent Robotic Inertial and Motion Estimation for Legged and Humanoid Robots

PRIME: 为四足机器人和人形机器人提供物理一致的机器人惯性与运动估计

Jiarong Kang, Kunzhao Ren, Tao Pang, Xiaobin Xiong

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Shanghai Innovation Institute(上海创新研究院)

AI总结 该研究提出PRIME方法,通过结合可微接触动力学和光滑互补约束,实现从 onboard 传感器数据中获得物理一致的运动轨迹和惯性参数估计,从而提升机器人运动估计的准确性。

Comments Robotics: Science and Systems 2026

详情
AI中文摘要

人形和腿部机器人通过间歇性接触与环境互动,使准确的运动估计从根本上依赖于对接触动力学的推理。然而,标准的传感流程——无论是基于机载本体感觉的扩展卡尔曼滤波器(EKFs)还是外部运动捕捉系统——只能恢复运动学信息,而接触力、接触时间和惯性参数仍未被观测。因此,纯运动学重建往往违反刚体动力学,尤其是在接触丰富的运动中。为了实现从机载运动学数据中准确的运动估计,我们提出PRIME(Physically-consistent Robotic Inertial and Motion Estimation),一种最大后验(MAP)公式,将测量的运动学和执行器命令细化为动态一致的轨迹,同时联合估计摩擦接触力和物理一致的惯性参数。我们的方法结合了可微接触动力学与平滑互补约束和Antescu风格的摩擦模型,产生一个平滑的优化问题,在各种接触转换中保持可处理性。我们在接触丰富的运动中评估了PRIME,使用四足机器人和Unitree G1人形机器人,展示了改进的轨迹一致性和准确的惯性参数识别。除了通过校准的惯性参数提高状态估计和反馈控制外,PRIME还能够从实际机器人中生成带有力和接触注释的运动重建,可用于下游学习应用,包括大规模行为建模和机器人基础模型。

英文摘要

Humanoid and legged robots interact with the environment through intermittent contacts, making accurate motion estimation fundamentally dependent on reasoning about contact dynamics. However, standard sensing pipelines-whether based on onboard proprioception with Extended Kalman Filters (EKFs) or external motion capture systems-recover only kinematics, while contact forces, contact timing, and inertial parameters remain unobserved. As a result, purely kinematic reconstructions often violate rigid-body dynamics, particularly during contact-rich motions. To enable accurate motion estimation from onboard kinematics in real-world deployment, we propose PRIME (Physically-consistent Robotic Inertial and Motion Estimation), a Maximum A Posteriori (MAP) formulation that refines measured kinematics and actuator commands into a dynamically consistent trajectory while jointly estimating frictional contact forces and physically consistent inertial parameters. Our approach incorporates differentiable contact dynamics with smoothed complementarity constraints and an Anitescu-style friction model, yielding a smooth optimization problem that remains tractable across versatile contact transitions. We evaluate PRIME on contact-rich locomotion with quadrupedal robots and the Unitree G1 humanoid, demonstrating improved trajectory consistency and accurate inertial parameter identification. Beyond improving state estimation and feedback control with calibrated inertial parameters, PRIME produces force- and contact-annotated motion reconstructions from real robots in deployment, which can be used to provide high-quality data for downstream learning applications, including large-scale behavior modeling and robot foundation models.

2605.17673 2026-05-19 cs.CV

A simple approach for biometrics: Finger-knuckle prints recognition based on a Sobel filter and similarity measures

一种简单的生物识别方法:基于Sobel滤波器和相似性度量的指纹-指节印识别

E. O. Rodrigues, T. M. Porcino, Aura Conci, Aristofanes C. Silva

发表机构 * Department of Computer Science, Universidade Federal Fluminense(弗拉门蒂努斯联邦大学计算机科学系) Department of Electrical Engineering, Universidade Federal do Maranhão(马拉尼昂联邦大学电气工程系)

AI总结 本文提出了一种简单的指纹-指节印识别方法,利用Sobel滤波器和相似性度量进行边缘检测和噪声减少,实现了高效的二值图像处理和存储,实验表明在大规模数据集上达到了17.02%的正确识别率。

Journal ref 2016 International Conference on Systems, Signals and Image Processing (IWSSIP)

详情
AI中文摘要

本文的目标是提出一种新的指纹-指节印识别方法,该方法本质上是手指指节区域的数字照片。我们采用了非常简单的视觉计算概念,如基于Sobel算子的边缘检测滤波器和简单的噪声减少算法。这些操作非常快速,能够产生二值图像,这些图像在处理和存储上都非常高效。此外,除了预处理之外,还考虑并评估了某些相似性度量以用于该任务。在预处理输入手指后,将其与数据集中所有手指的图像一一进行比较。我们获得了在大规模数据集上高达17.02%的成功识别率(真阳性率).

英文摘要

The objective of this work is to propose a novel methodology for the finger knuckle print recognition, which is essentially a digital photo of the finger-knuckle region. We have employed very simple concepts of visual computing such as a filter based on the Sobel operator for finding edges and a simple noise reduction algorithm. These operations are exceptionally fast and produce binary images, which are very efficient to process and to store. Furthermore, alongside this preprocessing, some similarity measures were also regarded and evaluated for the task. After preprocessing an input finger it is compared to all the images of fingers in the dataset, one by one. We have obtained up to 17.02% of successful recognitions (true positive rate) with a large dataset.

2605.17672 2026-05-19 cs.CL

Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models

停止当推理收敛:用于推理模型的语义保留早退出

Dehai Min, Giovanni Vaccarino, Huiyi Chen, Yongliang Wu, Gal Yona, Lu Cheng

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校) Google Research(谷歌研究) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Politecnico di Milano(米兰理工学院)

AI总结 本文提出PUMA框架,通过识别推理层面的语义冗余信号,实现语义保留的早退出,从而在保持准确性和推理链完整性的同时减少token消耗。

Comments under review

详情
AI中文摘要

大型推理模型(LRMs)通过生成长链的推理步骤(CoT)实现强大的性能,但常常过度推理,继续推理在解决方案已经稳定后,从而浪费token并增加延迟。现有的推理时早退出方法主要依赖于答案层面的信号,如置信度或试答一致性,来决定何时停止。然而,这些信号主要反映答案准备程度而非推理收敛:它们可能在模型尚未完成探索或自我纠正之前触发,导致过早退出,从而降低最终答案的准确性并留下保留的推理链语义不完整。我们识别推理层面的语义冗余作为语义保留早退出的互补信号:当连续步骤不再添加新的进展而是重复已确立的结论时,推理轨迹可能已收敛。基于这一见解,我们提出了PUMA,一个插件式框架,结合了轻量级的冗余检测器和答案层面的验证。检测器标记语义冗余的候选退出点,而验证确认停止是否安全,使PUMA能够在保持答案准确性和连贯推理前缀的同时删除冗余的延续。在五个LRMs和五个具有挑战性的推理基准测试中,PUMA实现了26.2%的平均token减少,同时保持准确性和保留的CoT质量。此外,针对代码生成、零样本视觉-语言推理和学习停止策略内部化等额外实验进一步证明,推理层面的冗余是高效推理的稳健、可转移和可学习的信号。我们的代码可在https://github.com/giovanni-vaccarino/PUMA上获得。

英文摘要

Large Reasoning Models (LRMs) achieve strong performance by generating long chains of thought (CoT), but often overthink, continuing to reason after a solution has already stabilized and thereby wasting tokens and increasing latency. Existing inference-time early-exit methods rely primarily on answer-level signals, such as confidence or trial-answer consistency, to decide when to stop. However, these signals mainly reflect answer readiness rather than reasoning convergence: they may trigger before the model has finished exploring or self-correcting, causing premature exits that can degrade final-answer accuracy and leave the retained reasoning chain semantically incomplete. We identify reasoning-level semantic redundancy as a complementary signal for semantic-preserving early exit: when successive steps no longer add novel progress and instead revisit established conclusions, the reasoning trajectory has likely converged. Building on this insight, we propose PUMA, a plug-and-play framework that combines a lightweight Redundancy Detector with answer-level verification. The detector flags semantically redundant candidate exits, while verification confirms whether stopping is safe, allowing PUMA to remove redundant continuation while preserving both answer accuracy and a coherent reasoning prefix. Across five LRMs and five challenging reasoning benchmarks, PUMA achieves 26.2% average token reduction while preserving accuracy and retained CoT quality. Additional experiments on code generation, zero-shot vision-language reasoning, and learned stopping-policy internalization further demonstrate that reasoning-level redundancy is a robust, transferable, and learnable signal for efficient reasoning. Our code is available at \url{https://github.com/giovanni-vaccarino/PUMA}.

2605.17671 2026-05-19 cs.LG cs.AI

PEIRA: Learning Predictive Encoders through Inter-View Regressor Alignment

PEIRA: 通过视图回归对齐学习预测编码器

Michael Arbel, Basile Terver, Jean Ponce

发表机构 * Univ. Grenoble Alpes, Inria CNRS, Grenoble INP, LJK(格勒诺布尔大学、法国国家信息与自动化研究所、格勒诺布尔INP、LJK实验室) Ecole Normale Supérieure / PSL Inria Paris(巴黎高等师范学院/PSL 国家科学研究中心、法国国家信息与自动化研究所巴黎分部) New York University(纽约大学)

AI总结 本文提出PEIRA方法,通过显式目标函数和线性回归器对齐来实现非对比自监督学习,通过理论分析和实验验证其在ImageNet-1K和CIFAR-10上的有效性。

详情
AI中文摘要

非对比自监督学习(SSL)是预测表示学习的有效框架,但像SimSiam、BYOL、I-JEPA或DINO等流行方法依赖于自蒸馏来训练教师-学生网络,但通常不最小化明确的目标函数。我们分析了联合嵌入预测架构(JEPA)的一个变种,使用正则化的线性回归器来预测数据两个视图之间的学习表示,并完全表征其稳定性:非坍塌的稳定平衡点对齐于主导的非线性典型相关子空间,而坍塌的平衡点也可能是稳定的吸引子。受此结果启发,我们引入PEIRA,一种非对比SSL方法,其目标函数通过最优线性回归器的迹定义。我们证明其唯一稳定的平衡点是非平凡的全局最小值,并恢复相同的典型相关子空间,正则化选择有效维度。在ImageNet-1K和CIFAR-10上的实验表明,PEIRA与VICReg和LeJEPA基线具有竞争力,定性实验结果支持理论。

英文摘要

Non-contrastive self-supervised learning (SSL) is an effective framework for predictive representation learning, but popular (and in practice effective) methods such as SimSiam, BYOL, I-JEPA or DINO, which rely on a form of self-distillation to train a teacher-student network, remain poorly understood as they typically do not minimize a well-defined objective. We analyze the dynamics of a variant of the Joint Embedding Predictive Architecture (JEPA) using a regularized linear regressor to predict the learned representations of two views of the data from one another, and fully characterize its stability: non-collapsed stable equilibria align with leading nonlinear canonical correlation subspaces, while collapsed equilibria may also be stable attractors. Motivated by this result, we introduce PEIRA, a non-contrastive SSL method with an explicit objective defined through the trace of the optimal linear regressor. We show that its only stable equilibria are nontrivial global minimizers and recover the same canonical correlation subspaces, with regularization selecting the effective dimension. Experiments on ImageNet-1K and CIFAR-10 show PEIRA is competitive with VICReg and LeJEPA baselines, and qualitative empirical results support the theory.

2605.17669 2026-05-19 cs.AI

Multimodal Cultural Heritage Knowledge Graph Extension with Language and Vision Models

多模态文化遗产品知图扩展与语言和视觉模型

Yang Zhang, Nada Mimouni, Jean-Claude Moissinac, Fayçal Hamdi

发表机构 * Center for Studies and Research in Computer Science and Communication, CNAM(计算机科学与通信研究所以及CNAME)

AI总结 本文提出了一种多模态方法,利用语言和视觉模型扩展文化遗产品知图,通过构建多模态知识图谱WJoconde并建立评估基准,提高知识图谱的扩展效率和可靠性。

详情
AI中文摘要

文化遗产品保育和解读日益依赖数字技术,其中知识图谱(KGs)因其能够结构化大量数据而脱颖而出。然而,这些KGs的构建和扩展往往面临挑战,因为文化遗产信息具有多样性和复杂性。本文提出了一种新的方法,用于扩展文化遗产领域的KG资源,应用于法语数据。首先,我们引入了一个新的知识图谱WJoconde,其特点是多模态,整合了实体的文本和图像信息。我们进一步引入了三个WJoconde的变体,以促进下游研究,如知识图谱补全(KGC)。我们还建立了一个全面的KGC方法基准,用于我们的数据集。其次,我们提出了一种新的框架,利用多模态方法扩展文化遗产KGs,结合大型语言模型(LLMs)和视觉-语言模型(VLMs),包括从非结构化资源中自动提取数据,并结合特殊的验证流程来确保两种模型输出的可靠性,以进一步扩展WJoconde。我们的结果表明,通过整合文化遗产数据中的丰富文本和图像信息,可以高效地增强具有高可靠性的KGs。我们开源了所有代码和基准数据集,包括文本和图像,以及原始数据的交互访问点。

英文摘要

The preservation and interpretation of cultural heritage increasingly rely on digital technologies, among which Knowledge Graphs (KGs) stand out for their ability to structure vast amounts of data. However, the construction and expansion of these KGs often face challenges due to the diverse and complex nature of cultural heritage information. In this paper, we propose a novel approach for extending KG resources in the domain of cultural heritage, which we applied to French data. First, we introduce a new knowledge graph in the domain of French cultural heritage, WJoconde, which is distinguished by its multimodality as it integrates both textual and image information of the entities. We further introduce three variants of WJoconde to facilitate downstream research, such as Knowledge Graph Completion (KGC). We also built a comprehensive benchmark for KGC methods on our dataset. Second, we propose a new framework for extending cultural heritage KGs using multi-modal approaches leveraging Large Language Models (LLMs) and Vision-Language Models (VLMs), which includes automated data extraction from unstructured resources combined with a special validation pipeline for grounding the output of both models, to further extend WJoconde. Our results show that by integrating the rich text and image information in cultural heritage data, we can efficiently enhance KGs with high reliability. We open-source all code and benchmark datasets with text and images, as well as the original data with an interactive access point

2605.17668 2026-05-19 cs.CV

Deep learning-based compression of giga-resolution whole slide images

基于深度学习的高分辨率全切片图像压缩

Maren Høibø, Etienne Gaucher, Ingerid Reinertsen, Marit Valla, Erik Smistad

发表机构 * Department of Clinical and Molecular Medicine, Norwegian University of Science and Technology (NTNU)(临床与分子医学系,挪威科学技术大学(NTNU)) Clinic of Laboratory Medicine, St. Olavs hospital, Trondheim University Hospital(实验室医学诊所,圣奥拉夫医院,特罗姆瑟大学医院) Department of Health Research, SINTEF Digital(健康研究部,SINTEF数字) Department of Circulation and Medical Imaging, Norwegian University of Science and Technology (NTNU)(循环与医学成像系,挪威科学技术大学(NTNU))

AI总结 本文研究了基于深度学习的全切片图像压缩方法,通过比较深度学习与传统编码方式(JPEG、JPEG-2000、JPEG-XL)在去除玻璃和压缩效果上的差异,发现深度学习压缩在减少文件大小方面更有效,但解压时间更长。

详情
AI中文摘要

数字病理学的实施导致全切片图像(WSI)数量增加。WSI的大小对存储构成挑战。目前,WSI使用JPEG等编码器压缩,每个WSI占用数GB空间,存储玻璃导致大量空间浪费。本研究探讨并比较了基于深度学习的组织分割用于去除玻璃和深度学习压缩方法与JPEG、JPEG-2000和JPEG-XL。创建了包含完整玻璃、玻璃替换为单色像素以及玻璃替换为零字节瓷砖的图像金字塔(N=21),并使用JPEG、JPEG-XL和深度学习模型进行压缩。此外,几种压缩模型在组织切片数据集上进行了评估,并与JPEG、JPEG-2000和JPEG-XL进行了比较。去除玻璃显著减少了JPEG和JPEG-XL的文件大小。与JPEG压缩相比,基于深度学习的图像压缩将WSI大小减少了43-72%,而基于深度学习的玻璃去除将WSI大小减少了0.3-33%和6-62%(仅使用单色像素和去除所有玻璃瓷砖)。结合两者,总大小减少为44-80%,表明基于深度学习的图像压缩能高效压缩玻璃瓷砖,而JPEG则不能。在组织切片数据集上,最好的基于深度学习的压缩模型在每块切片上平均节省了约35-40%的存储空间,同时保持平均SSIM值高于0.95,而JPEG-XL和JPEG-2000分别节省了17%和14%,同时保持SSIM值为0.96。然而,深度学习模型的解压时间比JPEG和JPEG-XL更高。

英文摘要

Implementation of digital pathology leads to an increased number of whole slide images (WSIs). The large size of WSIs is challenging. Today, WSIs are compressed with codecs like JPEG resulting in several gigabytes per WSI, and large amounts of space are wasted storing glass. In this study, deep learning-based tissue segmentation for glass removal, and deep learning compression methods were explored and compared with JPEG, JPEG-2000 and JPEG-XL. Image pyramids (N=21) with intact glass, glass replaced by single-colored pixels, and glass replaced by zero-byte tiles were created and compressed with JPEG, JPEG-XL and a deep learning model. Additionally, several compression models were evaluated on a tissue patch dataset and compared with JPEG, JPEG-2000 and JPEG-XL. Removing glass reduced file sizes considerably for JPEG and JPEG-XL. Deep learning-based image compression reduced the WSI size by 43-72% compared to JPEG compression, whereas deep learning-based glass removal reduced the WSI size by 0.3-33%, and 6-62% using only single-colored pixels and removing all-glass tiles, respectively. Combining the two gave a small improvement to a 44-80% total size reduction which indicates that deep learning-based image compression is able to efficiently compress glass tiles, whereas JPEG is not. On the tissue patch dataset, the best deep learning-based compression models saved on average ~35-40% per patch compared to JPEG, while keeping an average SSIM above 0.95, whereas JPEG-XL and JPEG-2000 saved 17% and 14%, respectively while keeping an SSIM of 0.96. However, the deep learning models had higher decompression times than JPEG and JPEG-XL.

2605.17661 2026-05-19 cs.RO cs.CV

Mono-Hydra++: Real-Time Monocular Scene Graph Construction with Multi-Task Learning for 3D Indoor Mapping

Mono-Hydra++: 基于多任务学习的实时单目场景图构建用于3D室内映射

U. V. B. L. Udugama, George Vosselman, Francesco Nex

发表机构 * Department of Earth Observation Science, University of Twente(特文特大学地球观测科学系)

AI总结 本文提出Mono-Hydra++,一种基于多任务学习的实时单目RGB加IMU流水线,用于3D室内度量语义映射和分层3D场景图构建,通过结合M2H-MX多任务模型和深度特征视觉惯性里程计前端,实现了在资源受限的机器人平台上无需主动深度传感器的实时度量语义映射和场景图构建。

Comments Submitted to ISPRS Journal of Photogrammetry and Remote Sensing. 50 pages, figures and tables included. Code: https://github.com/BavanthaU/mono-hydra-pp.git

详情
AI中文摘要

自主敏捷机器人需要的不仅仅是度量几何:它们必须理解物体、房间、地点和空间关系,以进行搜索、检查、探索和人机交互。传统度量地图支持定位和避障,但不提供这种语义和关系结构。3D场景图通过将几何与物体级和房间级的理解连接起来,填补了这一空白。在敏捷平台上构建此类表示仍然困难,因为空中和轻量级机器人受到严格的载荷、电力和计算限制,使RGB-D相机和LiDAR传感器在许多机载设置中不切实际。我们提出了Mono-Hydra++,一种实时单目RGB加IMU流水线,用于室内度量语义映射和分层3D场景图构建。该系统结合了M2H-MX,一种基于DINOv3的多任务模型,用于深度和语义,以及深度特征视觉惯性里程计前端,稀疏预测深度约束在VIO推导的姿态图中,语义遮蔽用于动态区域,以及在Mono-Hydra后端体积融合前的姿态感知时间对齐。在Go-SLAM ScanNet评估子集中,Mono-Hydra++在仅使用单目RGB加IMU输入的情况下,其平均轨迹误差比我们比较中的最强RGB-D基线低1.6%,在校准的7-Scenes中,其平均ATE比最强的竞争校准基线提高了29.8%。我们进一步在真实ITC建筑部署中验证了Mono-Hydra++,使用RealSense RGB加IMU,并通过在Jetson Orin NX 16GB上部署ONNX/TensorRT FP16 M2H-MX-L感知模型,以25.53 FPS的速度证明了嵌入可行性。这些结果表明,Mono-Hydra++可以在不依赖主动深度传感器的情况下,为资源受限的机器人平台提供实时度量语义映射和场景图构建。

英文摘要

Autonomous agile robots need more than metric geometry: they must understand objects, rooms, places, and spatial relations for search, inspection, exploration, and human robot interaction. Conventional metric maps support localization and collision avoidance, but do not provide this semantic and relational structure. 3D scene graphs address this gap by connecting geometry with object level and room level understanding. Building such representations on agile platforms remains difficult because aerial and lightweight robots operate under strict payload, power, and compute limits, making RGB-D cameras and LiDAR sensors impractical for many onboard settings. We present Mono-Hydra++, a real time monocular RGB plus IMU pipeline for indoor metric semantic mapping and hierarchical 3D scene graph construction. The system combines M2H-MX, a DINOv3 based multi-task model for depth and semantics, with a deep feature visual inertial odometry front end, sparse predicted depth constraints in the VIO derived pose graph, semantic masking for dynamic regions, and pose aware temporal alignment before volumetric fusion in the Mono-Hydra backend. On the Go-SLAM ScanNet evaluation subset, Mono-Hydra++ achieves 1.6% lower average trajectory error than the strongest RGB-D baseline in our comparison, while using only monocular RGB plus IMU input. On calibrated 7-Scenes, it improves average ATE by 29.8% over the strongest competing calibrated baseline. We further validate Mono-Hydra++ in a real ITC building deployment using RealSense RGB plus IMU and demonstrate embedded feasibility by deploying the ONNX/TensorRT FP16 M2H-MX-L perception model at 25.53 FPS on a Jetson Orin NX 16GB. These results show that Mono-Hydra++ can provide real time metric semantic mapping and scene graph construction for resource constrained robotic platforms without relying on active depth sensors.

2605.17658 2026-05-19 cs.LG

When a Zero-Shooter Cheats: Improving Age Estimation via Activation Steering

当零样本射手作弊时:通过激活引导提升年龄估计

Erik Imgrund, Pia Hanfeld, Klim Kireev, Konrad Rieck

发表机构 * BIFOLD & TU Berlin(BIFOLD与柏林技术大学)

AI总结 本文研究了基于视觉语言模型的零样本年龄估计中出现的'身份捷径'现象,提出激活引导方法以提高年龄估计的准确性,减少均方误差达25%。

详情
AI中文摘要

不同年龄相关的规定已提出以保护未成年人免受有害内容和互动的在线影响。自动年龄估计是执行此类规定的关键,而视觉语言模型(VLMs)在该任务上实现了最先进的性能。然而,我们发现VLM基于的零样本年龄估计会产生一个意外的副作用,我们称之为'身份捷径':VLMs不再从视觉特征中估计年龄,而是识别所描绘的人并从记忆中的知识中推断他们的年龄。这种现象导致在非名人被误认为名人时产生显著错误的预测。它还产生了对名人图像的噪声和对抗扰动具有欺骗性高鲁棒性的效果,这些图像主导了流行基准。为缓解这一问题,我们提出了一种激活引导方法,通过干预VLM的隐藏状态来抑制捷径。该方法提高了对记忆中和未见过的身份的年龄估计准确性,减少均方误差达25%。

英文摘要

Different age-related regulations have been proposed to protect minors from harmful content and interactions online. Automated age estimation is central to enforcing such regulations, and vision-language models (VLMs) achieve state-of-the-art performance on this task. However, we find that the zero-shot nature of VLM-based age estimation produces an unexpected side effect we call the identity shortcut: Instead of estimating age from visual features, VLMs tend to identify the depicted person and infer their age from memorized knowledge. This phenomenon leads to substantially incorrect predictions when non-celebrities are misidentified as celebrities. It also produces deceptively high robustness to noise and adversarial perturbations on celebrity images, which dominate popular benchmarks. To mitigate this, we propose an activation steering method that suppresses the shortcut by intervening on the hidden states of the VLM. This method improves age estimation accuracy for both memorized and unseen identities, reducing mean absolute error by up to 25% across popular benchmarks.

2605.17653 2026-05-19 cs.LG cs.AI

LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models

LLMForge: 多后端硬件感知的神经架构搜索与无限头注意力用于边缘语言模型

Xinting Jiang, Junyi Luo, Ruichen Qi, Kauna Lei, Ben Laurie, Gregory Kielian, Mehdi Saligane

发表机构 * Brown University(布朗大学) University of Michigan(密歇根大学) Google Research(谷歌研究)

AI总结 本文提出LLMForge,一种多后端硬件感知的神经架构搜索框架,通过无限头注意力扩展了每层注意力配置空间,并结合Forge-Former和Forge-DSE实现了高效的边缘语言模型架构搜索,最终在不同硬件子系统上获得了不同形状的架构,展示了在不同性能指标上的优化效果。

详情
AI中文摘要

子百亿参数的Transformer语言模型正越来越多地部署在边缘设备上,其中设备端推理的隐私、延迟和运行成本优势受到紧密的内存带宽、能量和热预算的限制,使得架构选择和加速器特定的成本成为高效推理的关键。我们提出了LLMForge,一种硬件感知的神经架构搜索(NAS)框架,其三个可组合的贡献共同使边缘LM架构搜索变得硬件条件化,因为不同的基材施加了不同的硬件成本瓶颈。无限头注意力(IHA)解耦了查询头数、KV组数和每个头的查询/键/值维度,扩展了在我们的搜索空间范围内每层注意力配置空间,大约扩大了400倍。Forge-Former是一种基于编码器的替代方案,用于对架构候选者进行排名,优于MLP和随机森林基线。Forge-DSE是一种基于NSGA-II的设计空间探索引擎,与Forge-Former配对,结合了覆盖GPU、张量核心加速器和环数据流边缘加速器的多后端硬件成本模型。在四种不同的硬件基材上,搜索收敛到明显不同的架构,其形状跟踪每个基材的成本瓶颈。在多芯片环基材上,我们的联合搜索返回了三个3亿参数规模的部署感知变体,这些变体位于帕累托前沿上。每个变体都在FineWeb-Edu-10BT上重新训练,以匹配SmolLM2-360M和Qwen-0.5B架构基线。准确的变体具有最低的验证损失2.798,并在参数较少的情况下具有竞争性的基准性能,能量优化的变体降低了每token的能量消耗40%,延迟优化的变体降低了TTFT和TPOT 43%。

英文摘要

Sub-billion-parameter Transformer language models are increasingly deployed on edge devices, where the privacy, latency, and operating-cost advantages of on-device inference are constrained by tight memory-bandwidth, energy, and thermal budgets that make architectural choice and accelerator-specific cost central to efficient inference. We present LLMForge, a hardware-aware neural architecture search (NAS) framework whose three composable contributions together make edge-LM architecture search hardware-conditioned, since different substrates impose different hardware cost bottlenecks. Infinite-Head Attention (IHA) decouples the number of query heads, KV groups, and per-head query/key and value dimensions, expanding the feasible per-layer attention configuration space by approximately 400x over grouped-query attention within our search-space ranges. Forge-Former, an encoder-based surrogate for ranking architectural candidates, outperforms MLP and random-forest baselines. Forge-DSE, an NSGA-II-based design-space-exploration engine, pairs Forge-Former with a multi-backend hardware cost model spanning GPUs, systolic accelerators, and ring-dataflow edge accelerators. Across four different hardware substrates, the searches converge to visibly different architectures whose shapes track each substrate's cost bottleneck. On the multi-chip ring substrate, our co-search returns three 300M-scale deployment-aware variants on the Pareto front. Each is re-trained on FineWeb-Edu-10BT under matched recipe against SmolLM2-360M and Qwen-0.5B architecture baselines. The accurate variant has the lowest validation loss 2.798 and competitive benchmark performance with fewer parameters, the energy-optimized variant lowers energy per token by 40%, and the latency-optimized variant lowers TTFT and TPOT by 43%.

2605.17652 2026-05-19 cs.CL

Beyond Transcripts: Iterative Peer-Editing with Audio Unlocks High-Quality Human Summaries of Conversational Speech

超越转录:借助音频的迭代同行编辑解锁高质量的对话语音总结

Kaavya Chaparala, Thomas Thebaud, Jesús Villalba López, Laureano Moro-Velazquez, Peter Viechnicki, Najim Dehak

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 该研究通过比较音频和转录本总结的质量,探讨了同行编辑在提升语音总结质量中的作用,并发现音频总结在信息量和压缩程度上不如转录本总结,但通过迭代同行编辑可以弥补这一差距。

Comments Accepted in LREC 2026

详情
AI中文摘要

目前缺乏足够的语音总结任务基准。创建新基准需要人工标注,因为LLM可能会将系统性错误和偏见嵌入数据集中。我们测试了十种标注工作流程,涉及不同的输入模态(音频、转录本或两者)以及编辑(自我编辑或同行编辑)的 inclusion,以研究使用人工标注者总结音频时可能的质量权衡。我们比较了基于音频的人类总结与基于转录本的人类总结,以跟踪不同信息模态对总结质量的影响。我们还比较了人类输出与四个LLM基准(三个文本,一个音频)以确定人类编写总结是否比高度流畅的自动输出信息更少。我们发现基于音频的总结信息较少且更压缩,但通过使用音频的迭代同行编辑可以缓解这一差异,使基于音频的总结信息量与转录本总结和LLM总结相当。这些发现验证了在创建结合词汇和语调信息的基准时,人类标注者之间的迭代同行编辑的有效性。这使在转录本不可用的情况下也能关键数据集的收集成为可能。

英文摘要

There are not enough established benchmarks for the task fo speech summarization. Creating new benchmarks demands human annotation, as LLMs could embed systemic errors and bias into datasets. We test ten annotation workflows varying input modality (audio, transcript, or both) and the inclusion of editing (self or peer-editing) to investigate potential quality tradeoffs from using human annotators to summarize audio. We compare human audio-based summaries to human transcript-based summaries to track the impact of the different information modalities on summary quality. We also compare the human outputs against four LLM benchmarks (three text, one audio) to examine whether human-written summaries are less informative than highly fluent automated outputs. We find that audio-based summaries are less informative and more compressed than transcript summaries. However, iterative peer-editing with audio mitigates this difference, enabling audio-based summaries to be as informative as their transcript counterparts and LLM summaries. These findings validate iterative peer-editing among human annotators for the creation of benchmarks informed by both lexical and prosodic information. This enables crucial dataset collection even in setting where transcripts are unavailable.