arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2070
2606.11889 2026-06-11 cs.CV cs.AI cs.RO 新提交

Task-Aligned Stability Analysis of Vision-Language Models for Autonomous Driving Hazard Detection

面向自动驾驶危险检测的视觉-语言模型任务对齐稳定性分析

Everett Richards

发表机构 * Everett Richards(埃弗里特·里奇ards)

AI总结 研究视觉-语言模型在自动驾驶危险检测中,嵌入漂移与任务对齐危险分数变化的关系,发现不同腐败类型导致不同的失效模式,建议基准测试包含任务对齐稳定性指标。

Comments 8 pages (5 main body + 3 references / appendices). ICML 2026 Workshop on Combining Theory and Benchmarks (CTB)

详情
AI中文摘要

视觉-语言模型(VLM)越来越多地用于自动驾驶中的场景理解,但鲁棒性分析通常仅依赖于任务无关的嵌入稳定性。我们研究腐败引起的嵌入漂移是否能预测基于CLIP图像-文本相似性的任务对齐危险分数的变化。通过在BDD100K道路场景上使用受控腐败,我们将嵌入漂移与边际漂移(定义为扰动下危险分数的变化)进行比较。这种关系高度依赖于腐败类型:某些家族表现出表示漂移与决策漂移之间的强耦合,而其他家族则在嵌入变化相对较小的情况下引发危险的决策不稳定性。此外,腐败家族在失效方向上有所不同:大多数通过假阴性抑制危险检测,而遮挡则触发假警报,这表明基准设计应考虑不对称的失效模式,而不仅仅是整体不稳定性率。这些结果表明,鲁棒性基准应包含任务对齐的稳定性指标,而不仅仅是嵌入级别的扰动统计。

英文摘要

Vision-language models (VLMs) are increasingly used for scene understanding in autonomous driving, but robustness analysis often relies on task-agnostic embedding stability alone. We study whether corruption-induced embedding drift predicts changes in a task-aligned hazard score derived from CLIP image-text similarities. Using controlled corruptions on BDD100K road scenes, we compare embedding drift against margin drift, defined as the change in hazard score under perturbation. The relationship is highly corruption-dependent: some families exhibit strong coupling between representation drift and decision drift, while others induce hazardous decision instability despite relatively modest embedding change. Furthermore, corruption families differ in failure direction: most suppress hazard detections via false negatives, while occlusion instead triggers false alarms, suggesting that benchmark design should account for asymmetric failure modes, not just overall instability rates. These results suggest that robustness benchmarks should include task-aligned stability measures in addition to embedding-level perturbation statistics.

2606.11886 2026-06-11 cs.SD cs.OS 新提交

Real-Time Language Model Jamming: A Case Study for Live Music Accompaniment Generation

实时语言模型阻塞:现场音乐伴奏生成的案例研究

Bowen Zheng, Andrew H. Yang, Jiaqi Ruan, Jia He, Xinyue Li, Yuan-Hsin Chen, Ziyu Wang, Xiaosong Ma

发表机构 * MBZUAI(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出StreamMUSE系统,在客户端-服务器架构中实现帧同步流式推理,通过现场音乐伴奏任务验证了不同延迟环境下实时同步的有效性。

Comments Accepted to RTAS 2026. 14 pages, 5 figures, 3 tables

详情
AI中文摘要

语言模型(LMs)已成为现代生成建模中最突出的范式之一。虽然提高速度是实时部署的主要焦点,但仅靠速度是不够的。许多实际应用,如同步翻译和语音合成,还需要生成内容与外部信号在生成内容和时序上精确对齐。我们将此问题称为\textit{帧同步流式推理}。为了解决这个问题,我们提出了StreamMUSE,一个在客户端-服务器架构中响应外部信号流执行LM生成的推理系统。客户端基于最新输入持续发送高频推理请求,并接收与外部时钟同步的输出,而服务器执行模型推理。我们通过现场音乐伴奏任务演示了该框架,展示了在不同往返延迟的部署环境中如何实现实时同步。我们进一步建模了系统超参数与往返延迟之间的关系,并评估了不同环境如何影响实现实时性能的最佳配置。实验结果表明,系统实时性能与音乐质量之间存在一致对应关系,证明了所提出框架的有效性。该项目是开源的。相关代码和最新更新可在此https URL获取。

英文摘要

Language models (LMs) have become one of the most prominent paradigms in modern generative modeling. While making them faster has been the main focus of real-time deployment, speed alone is not enough. Many real-world applications, such as synchronized translation and voice synthesis, also require precise alignment between generation and external signals, both in terms of generation content and timing. We refer to this problem as \textit{frame-synchronous streaming inference}. To address it, we present StreamMUSE, an inference system that performs LM generation in response to an external signal stream within a client-server architecture. The client continuously sends high-frequency inference requests based on the most recent inputs and receives outputs synchronized to the external clock, while the server executes model inference. We demonstrate the framework through a live music accompaniment task, showing how real-time synchronization can be achieved across different deployment environments with varying round-trip latencies. We further model the relationship between system hyperparameters and round-trip latency, and evaluate how different environments affect optimal configurations to achieve real-time performance. Experimental results show a consistent correspondence between system real-time performance and music quality, demonstrating the effectiveness of the proposed framework. The project is open source. Relevant code and the latest updates are available at https://stream-muse-webpage.vercel.app/#audio-library.

2606.11884 2026-06-11 cs.CV cs.CR 新提交

Image Quality Assessment of Identity Cards Using Measures from Open Face Image Quality

使用开放人脸图像质量度量对身份证进行图像质量评估

Gregor Grote, Juan E. Tapia, Christian Rathgeb

发表机构 * da/sec - Biometrics and Internet Security Research Group, Hochschule Darmstadt(达姆施塔特应用科学大学生物识别与互联网安全研究组)

AI总结 本文通过将OFIQ标准中的捕获相关质量度量应用于身份证图像,提出一种预处理流程,并分析这些度量与三种呈现攻击检测算法性能的相关性,表明基于某些OFIQ度量的质量评估可显著提升PAD性能。

Comments Presented on IWBF 2026 (14th International Workshop on Biometrics and Forensics)

详情
AI中文摘要

本文通过将开放人脸图像质量(OFIQ)标准中的捕获相关质量度量应用于身份证图像,解决了远程验证系统中身份证图像质量评估的挑战。我们的预处理流程包括角点检测、透视归一化和全面的前景掩码,以确保准确且无偏的质量度量计算。我们通过分析这些度量与三种呈现攻击检测(PAD)算法在四个不同身份证数据集上的性能相关性来评估其有效性,其中两个数据集包含真实(即原始)图像,两个包含打印的模拟身份证。我们的结果表明,基于某些OFIQ度量的质量评估可以显著提升PAD性能。

英文摘要

This paper addresses the challenge of assessing image quality in ID cards in remote verification systems by applying capture-related quality measures from the Open Face Image Quality (OFIQ) standard to ID card images. Our preprocessing pipeline includes corner detection, perspective normalization, and comprehensive foreground masking to ensure accurate and unbiased quality measure computation. We evaluate the effectiveness of these measures by analyzing their correlation with the performance of three presentation attack detection (PAD) algorithms across four diverse ID card datasets, where two datasets contain bona fide, i.e. pristine, images and two contain printed mock ID cards. Our results suggest that quality assessment based on some OFIQ measures can significantly improve PAD performance.

2606.11880 2026-06-11 cs.CV 新提交

SG2Loc: Sequential Visual Localization on 3D Scene Graphs

SG2Loc: 基于3D场景图的顺序视觉定位

Nicole Damblon, Olga Vysotska, Federico Tombari, Marc Pollefeys, Daniel Barath

发表机构 * ETH Zurich(苏黎世联邦理工学院) Google(谷歌) TU Munich(慕尼黑工业大学) Microsoft(微软)

AI总结 提出一种轻量级顺序视觉定位方法,利用紧凑的3D场景图表示环境,通过粒子滤波和语义匹配实现高效定位,显著降低存储需求。

Comments The code will be available at https://github.com/DmblnNicole/sg2loc

详情
AI中文摘要

复杂室内环境中的视觉定位仍然是机器人和AR应用的关键挑战。顺序定位,即随时间细化位姿估计,对自主智能体至关重要。然而,传统方法通常需要存储大量图像数据库或点云,导致显著开销。本文提出一种新颖的轻量级顺序视觉定位方法,使用3D场景图。我们的方法用紧凑的场景图表示环境,其中节点表示对象(带有粗略网格),边编码空间关系。在定位阶段,对于每张图像,我们提取逐块语义特征,预测对象身份。定位在粒子滤波框架内进行。每个粒子代表一个相机位姿,将场景图中的粗略对象网格投影到图像中,根据可见性为块分配对象身份。输入图像中逐块特征与场景图对象特征的相似度决定粒子的权重。后续图像顺序融合,细化位姿估计。通过利用紧凑的场景图和高效的语义匹配,我们的方法在保持真实世界数据集性能的同时显著减少存储。代码将在该网址提供。

英文摘要

Visual localization in complex indoor environments remains a critical challenge for robotics and AR applications. Sequential localization, where pose estimates are refined over time, is important for autonomous agents. However, traditional methods often require storing extensive image databases or point clouds, leading to significant overhead. This paper introduces a novel, lightweight approach to sequential visual localization using 3D scene graphs. Our method represents the environment with a compact scene graph, where nodes represent objects (with coarse meshes) and edges encode spatial relationships. For each image in the localization phase, we extract per-patch semantic features, predicting object identities. Localization is performed within a particle filter framework. Each particle, representing a camera pose, projects the coarse object meshes from the scene graph into the image, assigning object identities to patches based on visibility. The similarity of the per-patch features, in the input image, and object features from the scene graph determines the weight of a particle. Subsequent images are incorporated sequentially, refining the pose estimate. By leveraging a compact scene graph and efficient semantic matching, our method significantly reduces storage while maintaining performance on real-world datasets. The code will be available at https://github.com/DmblnNicole/sg2loc.

2606.11875 2026-06-11 cs.CL cs.SD 新提交

I Understand How You Feel: Enhancing Deeper Emotional Support Through Multilingual Emotional Validation in Dialogue System

我理解你的感受:通过对话系统中的多语言情感验证增强深层情感支持

Zi Haur Pang, Yahui Fu, Koji Inoue, Tatsuya Kawahara

发表机构 * Graduate School of Informatics, Kyoto University(京都大学信息学研究科)

AI总结 提出情感验证在对话系统中的应用,构建多语言语料库M-EDESConv和测试集M-TESC,设计多语言情感感知门控单元MEGUMI进行时机检测,并评估当前LLM在情感验证响应生成中的表现。

Comments This paper has been accepted for presentation at SIGdial Meeting on Discourse and Dialogue 2026 (SIGDIAL 2026)

详情
AI中文摘要

情感验证——明确承认用户的感受是合理的——已被证明具有治疗价值,但很少受到计算方面的关注。对话系统中的情感验证可以分解为:(i) 验证响应识别,(ii) 验证时机检测,以及 (iii) 验证响应生成。为了支持所有三个子任务的研究,我们发布了 M-EDESConv,一个通过混合手动和自动标注创建的 12 万条英日多语言语料库,以及 M-TESC,一个多语言口语对话测试集。对于时机检测,我们提出了 MEGUMI,一种多语言情感感知门控单元用于相互融合,它通过跨模态注意力和门控融合将冻结的 XLM-RoBERTa 语义与特定语言的情感编码器融合。MEGUMI 在 M-EDESConv 和 M-TESC 数据集上均表现出优越的性能,无论是客观还是主观评价。最后,我们的 EmoValidBench 基准测试(使用 GPT-4.1 Nano 和 Llama-3.1 8B)表明,当前的 LLM 能够生成上下文相似且多样化的验证响应,但情感理解仍然是一个需要改进的主要领域。项目页面:this https URL

英文摘要

Emotional validation - explicitly acknowledging that a user's feelings make sense - has proven therapeutic value but has received little computational attention. Emotional validation in dialogue systems can be decomposed into (i) validating response identification, (ii) validation timing detection, and (iii) validating response generation. To support research on all three subtasks, we release M-EDESConv, a 120k English-Japanese multilingual corpus created through hybrid manual and automatic annotation, and M-TESC, a multilingual spoken-dialogue test set. For timing detection, we propose MEGUMI, a Multilingual Emotion-aware Gated Unit for Mutual Integration, that fuses frozen XLM-RoBERTa semantics with language-specific emotion encoders via cross-modal attention and gated fusion. MEGUMI shows superior performance on both the M-EDESConv and M-TESC datasets, both objectively and subjectively. Finally, our EmoValidBench benchmarks of GPT-4.1 Nano and Llama-3.1 8B indicate that current LLMs generate contextually similar and diverse validating responses, but emotional understanding remains a major area for improvement. Project page: https://github.com/zihaurpang/Multilingual-Emotional-Validation

2606.11874 2026-06-11 cs.AI 新提交

AutoMine Solution for AV2 2026 Scenario Mining Challenge

AutoMine 解决方案:面向 AV2 2026 场景挖掘挑战

Songliang Cao, Jiele Zhao, Yuru Wang, Hao Li, Daqi Liu, Zehan Zhang, Fangzhen Li, Yu Wang, Yue Zhang, Bing Wang, Guang Chen, Hao Lu, Hangjun Ye

发表机构 * Xiaomi EV(小米汽车) Huazhong University of Science and Technology(华中科技大学)

AI总结 提出基于 LLM 和 VLM 的自优化场景挖掘方法 AutoMine,通过语义保持提示增强、鲁棒轨迹原子函数与 VLM 函数结合以及执行反馈优化,在 CVPR 2026 挑战赛中取得领先性能。

Comments CVPR 2026 Scenario Mining Challenge (Temporal Track Winners)

详情
AI中文摘要

随着自动驾驶系统的发展,从大规模驾驶日志中挖掘高价值、安全关键且与规划相关的场景已成为数据驱动评估的关键。本文提出 AutoMine,一种基于 LLM 和 VLM 的鲁棒自优化场景挖掘方法。AutoMine 使用语义保持提示增强来降低 LLM 提示敏感性,结合鲁棒轨迹原子函数与基于 VLM 的函数以处理感知噪声和开放世界视觉线索,并通过真实日志的执行反馈来优化生成的代码。在 CVPR 2026 的 Argoverse 2 场景挖掘竞赛中,AutoMine 取得了 36.38 的 HOTA-Temporal 分数和 77.21 的 Timestamp BA 分数。

英文摘要

With the development of autonomous driving systems, mining high-value, safety-critical, and planning-relevant scenarios from large-scale driving logs has become essential for data-driven evaluation. In this paper, we propose AutoMine, a robust self-refining scenario mining method based on LLMs and VLMs. AutoMine uses semantics-preserving prompt augmentation to reduce LLM prompt sensitivity, combines robust trajectory atomic functions with VLM-based functions to handle perception noise and open-world visual cues, and refines generated code through execution feedback from real logs. In the Argoverse 2 Scenario Mining Competition at CVPR 2026, AutoMine achieves a HOTA-Temporal score of 36.38 and a Timestamp BA score of 77.21.

2606.11868 2026-06-11 cs.LG q-bio.QM 新提交

MemNovo: Look Back at the Spectrum for Balanced De Novo Peptide Sequencing from Mass Spectrometry

MemNovo: 回顾谱图以实现质谱中平衡的从头肽段测序

Dongxin Lyu, Jingbo Zhou, Hongxin Xiang, Yuqiang Li, Jun Xia

发表机构 * Westlake University(西湖大学) Hunan University(湖南大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) HKUST-GZ & HKUST(香港科技大学(广州)与香港科技大学)

AI总结 针对现有Transformer模型在从头肽段测序中过度依赖生成序列先验而忽视谱图证据的问题,提出训练无关的即插即用机制MemNovo,通过建立持久谱记忆库和超保守残差连接在解码阶段注入谱特征,显著提升氨基酸和肽段精度。

Comments Code: https://github.com/AIMS-Lab-HKUSTGZ/MemNovo

详情
Journal ref
Knowledge Discovery and Data Mining(KDD), 2026
AI中文摘要

从串联质谱中进行从头肽段测序是蛋白质组学的关键,能够在不依赖参考数据库的情况下识别新型肽段。尽管基于Transformer的编码器-解码器模型已取得显著性能,但我们发现其推理动态中存在关键病理现象。通过全面的特征缩放实验,我们证明现有的自回归肽段解码器倾向于过度依赖生成序列的先验,同时逐渐未能充分利用输入质谱中的细粒度物理证据。这一现象导致次优结果,生成的肽段序列在生物学上合理但不符合输入谱图。为解决此问题,我们提出MemNovo,一种无需训练且即插即用的机制,在推理时重新平衡肽段和谱图的贡献。MemNovo通过建立持久的谱记忆库,并通过超保守残差连接将检索到的特征直接注入最终解码阶段,从而缓解信息瓶颈。理论分析证实,该机制恢复了解码器状态与原始谱图之间的互信息。在Nine Species基准上使用两个代表性基线模型Casanovo和InstaNovo进行的大量实验表明,MemNovo持续提高了氨基酸精度和肽段精度,对于Casanovo,肽段精度相对提升高达39.1%,对于InstaNovo提升高达3.9%,且计算开销可忽略不计。

英文摘要

De novo peptide sequencing from tandem mass spectrometry is pivotal in proteomics, enabling identification of novel peptides without reference databases. While recent Transformer-based encoder-decoder models have achieved remarkable performance, we uncover a critical pathology in their inference dynamics. Through comprehensive feature scaling experiments, we demonstrate that existing auto-regressive peptide decoders tend to over-rely on generated-sequence priors while progressively under-utilizing fine-grained physical evidence from the input mass spectrum. This phenomenon leads to suboptimal results, where generated peptide sequences are biologically plausible yet not faithful to the input spectrum. To rectify this, we propose MemNovo, a training-free and plug-and-play mechanism that re-balances peptide and spectral contributions at inference time. MemNovo alleviates the information bottleneck by establishing a persistent spectral memory bank and injecting retrieved features directly into the final decoding stage via an ultra-conservative residual connection. Theoretical analysis confirms that this mechanism restores the mutual information between the decoder state and the raw spectrum. Extensive experiments on the Nine Species benchmark with two representative baselines, Casanovo and InstaNovo, demonstrate that MemNovo consistently improves both amino acid precision and peptide precision, achieving up to 39.1% relative improvement in peptide precision for Casanovo and up to 3.9% for InstaNovo, with negligible computational overhead.

2606.11860 2026-06-11 cs.LG 新提交

RePAIR: Predictive Self-Supervised Representation Learning in Chess

RePAIR:国际象棋中的预测性自监督表示学习

Christoph Koller, Johannes Fürnkranz, Timo Bertram

发表机构 * University of Zurich(苏黎世大学)

AI总结 提出RePAIR架构,融合MAE、JEPA和BERT,通过掩码和迭代细化学习国际象棋序列的紧凑表示,无需强化学习即可推理棋子移动。

Comments Accepted for oral presentation at IEEE Conference on Games 2026

详情
AI中文摘要

在本文中,我们介绍了通过自编码迭代细化进行表示预测(RePAIR)——一种新颖的自监督表示学习架构,它综合了掩码自编码器(MAE)、联合嵌入预测架构(JEPA)和来自Transformer的双向编码器表示(BERT)。我们展示了如何将其用于将顺序数据(如连续的国际象棋局面)中的对象编码为紧凑而有意义的表示。该架构的基本原理是掩码潜在状态序列的大部分,类似于BERT和MAE。然后,我们对潜在表示应用一个轻量级预测器,该预测器在类似JEPA的低维嵌入空间中修复序列中的间隙。我们在国际象棋领域的实验表明,编码器优化了棋盘表示,使得有意义的国际象棋概念在潜在空间中聚类出现。此外,掩码棋盘状态的重建表明,该模型能够在不依赖昂贵强化学习方法的情况下推理棋子移动。最后,我们发现,通过在这个语义丰富的空间中观察游戏路径轨迹,所得到的表示空间允许对国际象棋游戏进行快速直观的剖析。

英文摘要

In this paper, we introduce Representation Prediction via Autoencoding using Iterative Refinement (RePAIR) - a novel self-supervised representation learning architecture that synthesizes Masked Autoencoders (MAE), Joint Embedding Predictive Architectures (JEPA), and Bidirectional Encoder Representations from Transformers (BERT). We demonstrate how it can be used to encode objects in sequential data like consecutive chess positions into compact yet meaningful representations. The basic principle of the architecture is to mask large portions of a sequence of latent states, similar to BERT and MAE. Then, we apply a lightweight Predictor to the latent representations that repairs gaps in the sequence in a lower-dimensional embedding space akin to JEPA. Our experiments in the domain of chess show that the Encoder refines the board representations such that meaningful chess concepts emerge clustered in the latent space. Furthermore, reconstructions of the masked board states show that the model is able to reason about the piece movements without relying on costly reinforcement learning methods. Lastly, we find that the resulting representation space allows for quick and intuitive dissections of chess games by observing the game path trajectories in this semantically rich space.

2606.11854 2026-06-11 cs.LG cs.AI cs.CL 新提交

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

使用ART微调多模态大语言模型:基于艺术的强化训练

Michal Chudoba, Sergey Alyaev, Petra Galuscakova, Tomasz Wiktorski

发表机构 * University of Stavanger(斯塔万格大学) NORCE Research(NORCE研究机构)

AI总结 提出ART方法,通过优化原始视觉输入将信息注入冻结的多模态大语言模型,实现软提示微调,无需修改计算图,在数学和工具使用基准上达到与LoRA相当的精度。

详情
AI中文摘要

大语言模型有两种主要的参数高效微调技术。低秩适应在LLM层之间引入额外权重,而软提示则向LLM输入引入额外的微调特定原始token。然而,两者都需要修改预编译、预优化LLM的计算图。因此,两者在vLLM等高吞吐引擎中均未得到完全支持。我们提出使用ART(基于艺术的强化训练)进行微调。该方法通过仅优化冻结的多模态大语言模型的原始视觉输入来注入信息,从而在预编译计算图上实现软token方法。它依赖于将梯度反向传播到普通像素阵列,因此支持任何微调目标。此外,优化的视觉输入可以风格化为与任务相关的计算艺术品。该方法在流行的开源Qwen架构的不同规模以及多个文本基准上的有效性得到确认。具体而言,ART在数学和结构化工具使用基准上达到了与LoRA竞争的精度。

英文摘要

There are two main Parameter-Efficient Fine-Tuning (PEFT) techniques for Large Language Models (LLMs). While Low-Rank Adaptation (LoRA) introduces additional weights between the LLM layers, Soft Prompting introduces additional fine-tuning-specific raw tokens to an LLM input. However, both require modification to the computational graphs of precompiled, preoptimized LLMs. As a result, neither is fully supported in high-throughput engines like vLLM. We propose fine-tuning with ART (Art-based Reinforcement Training). The method injects information into a frozen Multimodal Large Language Model (MLLM) by optimizing only its raw visual input, thus enabling the soft-token approach on pre-compiled computational graphs. It relies on backpropagation of gradients back into a plain pixel array and thus supports any fine-tuning objective. Moreover, the optimized visual input can be stylized as task-relevant computational artworks. The approach's effectiveness is confirmed for different sizes of a popular open Qwen architecture and for several textual benchmarks. Specifically, ART reaches accuracy competitive with LoRA across mathematics and structured-tool-use benchmarks.

2606.11853 2026-06-11 cs.CV cs.AI 新提交

Task-Aware Structured Memory for Dynamic Multi-modal In-Context Learning

任务感知结构化记忆用于动态多模态上下文学习

Zhirui Chen, Ziwei Chen, Ling Shao

发表机构 * Zhihui Chen(陈志辉) Ziwei Chen(陈子伟) Ling Shao(邵令)

AI总结 提出TASM框架,通过任务向量引导压缩、语义感知令牌合并和层次化记忆结构,解决多模态大语言模型上下文学习中记忆压缩导致的语义破坏和静态问题。

Comments Accepted to ICML 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)依赖上下文学习(ICL)进行快速任务适应,但其可扩展性受到有限上下文窗口和长多模态序列中键值(KV)缓存成本增长的严重限制。现有的记忆压缩方法通常依赖于刚性令牌移除或样本相关的重要性估计,这引入了偏差,破坏了语义结构(特别是视觉表示),并产生无法适应新查询的静态记忆。我们提出了TASM(任务感知结构化记忆),一个无需训练的框架,通过任务感知、结构保持和动态可访问的记忆构建来解决这些限制。TASM采用任务向量引导压缩,用捕获演示间共享相关性的任务级方向替代样本特定信号。为了保持底层流形,它通过二分图匹配应用语义感知令牌合并,在不进行破坏性修剪的情况下聚合令牌。最后,TASM将记忆结构化为一个层次结构,包括紧凑的核心记忆和潜在库,促进查询自适应的动态检索。评估证实,TASM在重度压缩下保持高性能,有效平衡了效率与适应性。

英文摘要

Multi-modal large language models (MLLMs) depend on in-context learning (ICL) for rapid task adaptation, but their scalability is severely limited by finite context windows and the growing cost of key-value (KV) caches in long multi-modal sequences. Existing memory compression approaches typically rely on rigid token removal or sample-dependent importance estimation, which introduces bias, disrupts semantic structure, particularly for visual representations, and yields static memories that cannot adapt to new queries. We introduce TASM (Task-Aware Structured Memory), a training-free framework that addresses these limitations through task-aware, structure-preserving, and dynamically accessible memory construction. TASM employs task-vector guided compression to replace sample-specific signals with a task-level direction that captures shared relevance across demonstrations. To preserve the underlying manifold, it applies semantics-aware token merging via bipartite graph matching, aggregating tokens without destructive pruning. Finally, TASM structures memory into a hierarchy comprising a compact Core Memory and a Latent Bank, facilitating query-adaptive dynamic retrieval. Evaluations confirm TASM maintains high performance under heavy compression, effectively balancing efficiency with adaptability.

2606.11851 2026-06-11 cs.AI 新提交

StatefulDiscovery: Evidence-Calibrated Claim Formation in Open-Ended Scientific Discovery

StatefulDiscovery:开放科学发现中证据校准的声明形成

Jiayao Chen, Shi Liu, Linyi Yang

发表机构 * Southern University of Science and Technology(南方科技大学)

AI总结 提出StatefulDiscovery框架,通过外部化探索状态来协调前沿选择、证据获取和声明裁决,在40个真实数据任务中生成更多高质量、有充分证据支持的声明。

详情
AI中文摘要

开放式的科学发现要求智能体超越为预定义问题执行分析。在多轮探索中,发现智能体必须决定哪些现象值得研究,同时避免过度解释,即新出现的声明超出支持它们的分析证据范围。这产生了一个证据校准问题:探索轨迹必须与声明状态耦合,以便证据既能指导下一步探索什么,也能指导可以声明什么。我们引入了StatefulDiscovery,一个将调查状态外部化并利用它来协调前沿选择、证据获取和声明裁决的发现框架。我们在40个真实数据发现任务上评估了StatefulDiscovery。与几个基线相比,StatefulDiscovery总体上产生了更多被认为既有充分支持又有高价值的声明。消融实验表明,结构化假设、局部裁决和前沿控制有助于性能。这些结果共同表明,显式的发现状态可以将探索与证据校准的声明形成耦合起来。

英文摘要

Open-ended scientific discovery asks agents to move beyond executing analyses for predefined questions. Across multiple rounds of exploration, a discovery agent must decide which phenomena warrant investigation while avoiding overinterpretation, where emerging claims exceed the evidential scope of the analyses supporting them. This creates an evidence-calibration problem: the exploration trajectory must be coupled with claim status so that evidence can guide both what to investigate next and what can be claimed. We introduce StatefulDiscovery, a discovery framework that externalizes investigation state and uses it to coordinate frontier selection, evidence acquisition, and claim adjudication. We evaluate StatefulDiscovery across 40 real-data discovery tasks. Compared with several baselines, StatefulDiscovery produces more claims overall judged to be both well-supported and high-value. Ablations indicate that structured hypotheses, local adjudication, and frontier control contribute to performance. Together, these results suggest that explicit discovery state can couple exploration with evidence-calibrated claim formation.

2606.11846 2026-06-11 cs.CV 新提交

SheafStain: Sheaf-Theoretic Schrödinger Bridge for Spatially and Biologically Coherent Virtual Staining

SheafStain:用于空间和生物学一致虚拟染色的层论薛定谔桥

Hyeongyeol Lim, Hongjun Yoon, Eunjin Jang, Daeky Jeong, Won June Cho, Hwamin Lee

发表机构 * Department of Medical Informatics, College of Medicine, Korea University(高丽大学医学院医学信息学系) DEEPNOID Inc.(DEEPNOID公司)

AI总结 针对虚拟染色中补丁推理导致的空间不连续和上下文污染问题,提出SheafStain方法,将视觉基础模型特征重新解释为层状截面,结合薛定谔桥框架实现空间和生物学一致的虚拟染色,在HER2等指标上优于六种现有方法。

Comments 32 pages

详情
AI中文摘要

当前的虚拟染色方法为癌症诊断和预后中的生物标志物量化提供了节省时间和成本的潜力。然而,对于千兆像素全切片图像(WSI)的补丁推理无法保持空间连续性,产生伪影,导致与真实图像出现灾难性不匹配。尽管病理视觉基础模型(VFM)提供了丰富的表示,但其自注意力机制导致不同的全局上下文为同一物理区域产生不一致的嵌入。我们将这种“上下文污染”形式化并验证为一个层论问题,其中这些嵌入形成一个违反粘合公理的预层。为了解决这个问题,我们提出了SheafStain,一种新方法,将VFM特征重新解释为层状截面,用于空间和生物学一致的虚拟染色。具体来说,SheafStain将类别和补丁令牌集成到薛定谔桥框架中作为层状截面。类别令牌锚定生物学一致性,而补丁令牌形成逐位置的空间图。在苏木精和伊红(H&E)与免疫组化(IHC)上共同预训练的主干网络产生非退化的跨染色茎,因此单个VFM特征空间同时监督输入条件和输出染色对齐。与先前在孤立$256 \ imes 256$补丁上评估并对$1024 \ imes 1024$真实图像进行随机裁剪或调整大小的工作不同,我们在$256 \ imes 256$上进行翻译,并在拼接后的$1024 \ imes 1024$输出上评估HER2、ER、PR和Ki-67。SheafStain在减轻补丁边界拼接伪影的同时,展示了优于六种先前方法的结果。代码即将发布。

英文摘要

Current virtual staining approaches offer the potential for time- and cost-efficient biomarker quantification in cancer diagnostics and prognostics. However, patch-wise inference for gigapixel whole slide images (WSIs) fails to maintain spatial continuity, yielding artifacts that cause catastrophic mismatches with ground-truth images. Although pathology Vision Foundation Models (VFMs) offer rich representations, their self-attention causes varying global contexts to produce inconsistent embeddings for the same physical region. We formalize and validate this ``context contamination'' as a sheaf-theoretic problem where these embeddings form a presheaf that violates the gluing axiom. To address this, we propose SheafStain, a new approach that reinterprets VFM features as sheaf-like sections for spatially and biologically coherent virtual staining. Specifically, SheafStain integrates class and patch tokens into a Schrödinger Bridge framework as sheaf-like sections. While the class token anchors biological consistency, patch tokens form a per-position spatial map. A backbone co-pretrained on Hematoxylin \& Eosin (H\&E) and Immunohistochemistry (IHC) yields non-degenerate cross-stain stalks, so a single VFM feature space supervises both input conditioning and output stain alignment. Departing from prior work that evaluates on isolated $256 \times 256$ patches and either random-crops or resizes the $1024 \times 1024$ ground truth, we translate at $256 \times 256$ and evaluate on the stitched $1024 \times 1024$ outputs across HER2, ER, PR, and Ki-67. SheafStain demonstrates promising results against six prior methods while mitigating patch-boundary stitching artifacts. Code will soon be released.

2606.11844 2026-06-11 cs.LG 新提交

TaskFusion: Continual Anomaly Detection for Heterogeneous Tabular Data

TaskFusion: 异构表格数据的持续异常检测

Dayananda Herurkar, Federico Raue, Joachim Folz, Jörn Hees, Andreas Dengel

发表机构 * German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI)) RPTU Kaiserslautern-Landau(凯泽斯劳滕-兰道大学) Hochschule Bonn-Rhein-Sieg (H-BRS)(波恩-莱茵-锡格应用技术大学)

AI总结 提出TaskFusion方法,通过AGF模型、任务融合增强和异常暴露技术,解决异构表格数据在持续学习中的特征空间变化、分布偏移和类别不平衡问题,在21个数据集上显著提升持续异常检测性能。

Comments 22 Pages

详情
AI中文摘要

表格数据中的持续异常检测具有挑战性且尚未充分探索,尤其是在异构特征模式、分布偏移和严重类别不平衡的情况下。在许多实际应用中,数据来自不同领域并按顺序到达,这使得传统的持续学习方法因依赖固定输入空间而失效。我们提出了一种持续学习方法,能够克服这些挑战并持续从不同任务中学习。我们的方法包含三个主要部分:AGF模型、TaskFusion增强和异常暴露。AGF模型将任务特定特征映射到共享空间,然后对齐分布以减少表示漂移,并在对齐空间中学习异常决策边界。为了提高稳定性,我们引入了TaskFusion增强,结合任务内的边界感知插值来细化模型异常边界,以及跨任务混合以在数据集间传递异常结构。为了处理类别不平衡和内存限制,我们采用表格数据集蒸馏来存储紧凑的合成回放样本,这些样本与增强数据一起在异常暴露目标中用于鲁棒的异常检测。我们在多个领域的21个异构数据集上评估了该方法。结果表明,与顺序微调和其他持续学习基线相比,我们的方法显著提高了持续异常检测性能,同时减少了灾难性遗忘并在异构数据集上保持稳定的检测。

英文摘要

Continual anomaly detection in tabular data is challenging and remains largely underexplored, particularly in settings with heterogeneous feature schemas, distribution shifts, and severe class imbalance. In many real-world applications, data arrive sequentially from diverse domains, rendering conventional continual learning methods ineffective due to their reliance on a fixed input space. We propose a continual learning (CL) method, which can overcome these challenges and continually learn from different tasks. Our method consists of three main parts: our AGF model, Taskfusion augmentation, and outlier exposure. The AGF-model maps task-specific features into a shared space, then aligns distributions to reduce representation drift, and learns anomaly decision boundaries in the aligned space. To improve stability, we introduce Taskfusion augmentation, combining boundary-aware interpolation within tasks to refine the model anomaly boundaries and cross-task mixing to transfer anomaly structure across datasets. To handle class imbalance and memory constraints, we employ tabular dataset distillation to store compact synthetic replay samples, which are jointly used with augmented data in an outlier exposure objective for robust anomaly detection. We evaluate the approach on 21 heterogeneous datasets across multiple domains. Results show that our approach substantially improves continual anomaly detection performance over sequential fine-tuning and other CL baselines while reducing catastrophic forgetting and maintaining stable detection across heterogeneous datasets.

2606.11841 2026-06-11 cs.CV 新提交

Scene-Adaptive Nonlinear Tone Curves for Pseudo Ground-Truth Generation in Low-Light 3D Gaussian Splatting

面向低光照3D高斯泼溅的场景自适应非线性色调曲线伪地面真值生成

Mingzhe Lyu, Jinqiang Cui, Hong Zhang

发表机构 * Southern University of Science and Technology(南方科技大学) Pengcheng Laboratory(鹏城实验室)

AI总结 针对低光照3D重建中伪地面真值生成问题,提出场景自适应非线性色调曲线框架,通过两种曲线(ASE和AP3)替代线性增益,在三个基准上PSNR提升最高4.34dB。

详情
AI中文摘要

低光照新视角合成具有挑战性,因为暗光多视图图像包含噪声、弱结构细节和压缩的动态范围。最近的3D高斯泼溅(3DGS)方法通过生成伪地面真值(pseudo-GT)图像作为监督目标来解决这些挑战,当没有配对正常光照参考时。现有的伪GT方法对所有像素应用均匀线性增益,这会裁剪亮区,同时暗区增强不足,限制了重建质量。我们观察到,在2D低光照增强中早已建立的非线性色调映射,尚未在3D重建的伪GT生成中得到探索。因此,我们提出了一种场景自适应非线性色调曲线框架,用非线性替代方案替换线性伪GT。该框架引入了基于百分位数的归一化以实现场景无关的曲线应用、场景自适应偏移用于自动黑电平调整,以及两条互补曲线:自适应SoftExp(ASE),一种有界指数曲线,和自适应Poly3(AP3),一种数据驱动的三次多项式。该模块仅改变伪GT计算,而保持3DGS骨干不变。在覆盖21个场景的三个基准上的实验表明,两条曲线均一致优于线性基线,在LOM上PSNR提升高达+4.34 dB,在RealX3D上提升+3.25 dB。尽管数学形式不同,两条曲线实现了相似的性能,表明改进是曲线无关的。代码见 https://this https URL。

英文摘要

Low-light novel view synthesis is challenging because dark multi-view images contain noise, weak structural detail, and compressed dynamic range. Recent 3D Gaussian Splatting (3DGS) methods address these challenges by generating pseudo ground-truth (pseudo-GT) images as supervision targets when paired normal-light references are unavailable. Existing pseudo-GT methods apply a uniform linear gain to all pixels, which clips bright regions while providing insufficient enhancement in dark regions, limiting reconstruction quality. We observe that nonlinear tone mappings, long established in 2D low-light enhancement, have not been explored for pseudo-GT generation in 3D reconstruction. Accordingly, we propose a scene-adaptive nonlinear tone-curve framework that replaces linear pseudo-GT with nonlinear alternatives. The framework introduces percentile-based normalisation for scene-agnostic curve application, a scene-adaptive offset for automatic black-level adjustment, and two complementary curves: Adaptive SoftExp (ASE), a bounded exponential curve, and Adaptive Poly3 (AP3), a data-driven cubic polynomial. The module changes only the pseudo-GT computation and leaves the 3DGS backbone unchanged. Experiments on three benchmarks covering 21 scenes show that both curves consistently outperform the linear baseline with PSNR improvements up to +4.34 dB on LOM and +3.25 dB on RealX3D. Both curves achieve similar performance despite their different mathematical forms, suggesting the improvement is curve-agnostic. Code is available at https://github.com/lvmingzhe/adaptiveToneCurve

2606.11838 2026-06-11 cs.CV 新提交

Plan-and-Verify Video Reward Reasoning with Spatio-Temporal Scene Graph Grounding

基于时空场景图基础的计划与验证视频奖励推理

Hyomin Kim, Junghye Kim, Joanie Hayoun Chung, Yoonjin Oh, Kyungjae Lee, Sungbin Lim, Sungwoong Kim

发表机构 * Korea University(高丽大学)

AI总结 提出SG-PVR视频奖励模型,通过计划-验证推理和时空场景图,系统验证提示中的每个条件,实现细粒度语义对齐,提升文本到视频生成的组合对齐。

详情
AI中文摘要

文本到视频(T2V)生成的奖励模型指导后训练,但常在细粒度语义对齐上失败。我们将其归因于现有基于推理的奖励模型的两个结构弱点:它们没有系统地验证提示中描述的每个条件,并且支持每个判断的视觉证据在其自由形式推理中仍然是隐式的。我们提出SG-PVR,一种视频奖励模型,通过基于时空场景图的计划-验证推理来解决这些限制。验证计划将提示分解为原子声明,确保检查每个要求。时空场景图编码实体、属性和时间基础关系,从视频中提取并作为持久的结构化视觉参考贯穿推理过程。每个声明都针对视频和场景图进行验证,将判断锚定在明确的视觉证据上。SG-PVR在语义对齐(包括细粒度时间语义)上取得了强劲性能。作为测试时重排序器,它进一步增强了T2V生成中的组合对齐。

英文摘要

Reward models for text-to-video (T2V) generation guide post-training but often fail at fine-grained semantic alignment. We trace this to two structural weaknesses in existing reasoning-based reward models: they do not systematically verify every condition described in the prompt, and the visual evidence supporting each judgment remains implicit in their free-form reasoning. We propose SG-PVR, a video reward model that addresses these limitations through plan-and-verify reasoning grounded in spatio-temporal scene graphs. The verification plan decomposes the prompt into atomic claims, ensuring every requirement is checked. The spatio-temporal scene graph, encoding entities, attributes, and temporally-grounded relations, is extracted from the video and maintained as a persistent structured visual reference throughout reasoning. Each claim is verified against both the video and the scene graph, anchoring judgments in explicit visual evidence. SG-PVR achieves strong performance on semantic alignment, including fine-grained temporal semantics. As a test-time reranker, it further enhances compositional alignment in T2V generation.

2606.11837 2026-06-11 cs.CV cs.AI 新提交

LASA: A Weak Supervision Method for Open-Vocabulary Scene Sketch Semantic Segmentation

LASA:一种用于开放词汇场景草图语义分割的弱监督方法

Liwen Yi, Xianlin Zhang, Yue Zhang, Yue Ming, Xueming Li

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出LASA方法,通过跨层聚合Vision Transformer注意力图,在弱监督下实现开放词汇场景草图的语义分割,显著提升分割精度和空间一致性。

详情
AI中文摘要

开放词汇场景草图语义分割旨在基于推理时指定的灵活类别词汇,为稀疏线条图分配密集语义标签,而无需在训练期间依赖像素级标注。与自然图像不同,草图缺乏纹理和颜色线索,使得语义理解严重依赖于笔画布局和空间配置,这一挑战导致单层视觉-语言特征本质上不稳定。我们的关键观察是,来自不同Vision Transformer层的注意力图编码了互补的空间线索:浅层捕获全局结构布局,而深层聚焦于局部笔画交叉和物体部件。这表明跨层聚合比任何单独一层提供了更稳健的结构先验。利用这一洞察,我们提出了一种结构感知框架,基于\textbf{逐层累积结构注意力}(\textbf{LASA}),该框架聚合多层注意力以在弱监督下指导层次化语义对齐,并在推理期间细化预测。在FS-COCO、SFSD和FrISS上的实验表明,与先前的弱监督基线相比,LASA将mIoU分别提高了+3.43、+8.01和+15.74,在分割精度和空间一致性上均表现出一致的提升。我们的源代码将公开提供。

英文摘要

Open-vocabulary scene sketch semantic segmentation aims to assign dense semantic labels to sparse line drawings based on flexible category vocabularies specified at inference time, without relying on pixel-level annotations during training. Unlike natural images, sketches lack texture and color cues, making semantic understanding heavily dependent on stroke layout and spatial configuration, a challenge that renders single-layer vision-language features inherently unstable. Our key observation is that attention maps from different Vision Transformer layers encode complementary spatial cues: shallow layers capture global structural layouts, while deeper layers focus on local stroke intersections and object parts. This suggests that cross-layer aggregation provides a more robust structural prior than any individual layer alone. Leveraging this insight, we propose a structure-aware framework built upon \textbf{L}ayer-wise \textbf{A}ccumulated \textbf{S}tructural \textbf{A}ttention (\textbf{LASA}), which aggregates multi-layer attention to guide hierarchical semantic alignment under weak supervision and refine predictions during inference. Experiments on FS-COCO, SFSD, and FrISS show that LASA improves mIoU by $+3.43$, $+8.01$, and $+15.74$ over the prior weakly supervised baselines, demonstrating consistent gains in both segmentation accuracy and spatial coherence. Our source code will be made publicly available.

2606.11833 2026-06-11 cs.LG q-bio.NC 新提交

Flow Matching with In-Context Priors for Out-of-Distribution Brain Dynamics

基于上下文先验的分布外脑动力学流匹配

Sam Gijsen, Michał Łukomski, Marc-André Schulz, Kerstin Ritter

发表机构 * Hertie Institute for AI in Brain Health, University of Tübingen(赫蒂人工智能脑健康研究所,图宾根大学) Tübingen AI Center, University of Tübingen(图宾根人工智能中心,图宾根大学) Charité – Universitätsmedizin Berlin, Department of Psychiatry and Psychotherapy(柏林夏里特医学院,精神病学与心理治疗系) German Center for Mental Health (DZPG), partner site Tübingen(德国心理健康中心(DZPG),图宾根合作站点)

AI总结 提出一种逐时间步条件扩散Transformer,通过注入组合语言和可选空间先验,实现未见认知任务下fMRI脑动力学的零样本生成,支持反事实神经科学。

Comments Code and pretrained models available at https://github.com/SamGijsen/pinc-flows

详情
AI中文摘要

流匹配和扩散模型能够实现从图像到蛋白质等领域的条件生成,最近扩展到分布外上下文。然而,神经时间序列的生成模型主要局限于分类条件,阻碍了组合和零样本泛化。在这项工作中,我们提出了一种逐时间步条件扩散Transformer,通过注入组合语言和可选空间先验在上下文中,生成未见认知任务期间的真实fMRI脑动力学。这种零样本生成可以通过在经验验证之前支持新型认知实验的计算机设计和评估,从而促进反事实神经科学。利用该模型,我们在数百个保留任务条件下进行评估,并描述与训练流形相关的预测性能。仅从语言出发,模型恢复了跨任务和保留空间激活模式的区域特异性招募。当空间先验可用时,它们通过将生成锚定在仅靠语言退化的任务空间区域来补充文本路径,同时保留反事实任务规范所需的组合结构。据我们所知,这是首个用于未见认知任务的整个皮层fMRI动力学生成模型,推动了反事实神经科学和数据驱动的实验设计。

英文摘要

Flow matching and diffusion models enable conditional generation across domains ranging from images to proteins, with recent extensions to out-of-distribution contexts. Yet generative models of neural time series have largely remained restricted to categorical conditioning, precluding compositional and zero-shot generalization. In this work, we propose a per-timestep conditioned diffusion transformer for generating realistic fMRI brain dynamics during unseen cognitive tasks by injecting both compositional language and optional spatial priors in-context. Such zero-shot generation could enable counterfactual neuroscience by supporting in-silico design and evaluation of novel cognitive experiments before empirical validation. Leveraging this model, we evaluate across hundreds of held-out task conditions and characterize predictive performance in relation to the training manifold. From language alone, the model recovers region-specific recruitment across tasks and held-out spatial activation patterns. Spatial priors, when available, complement the text pathway by anchoring generation in regions of task space where language alone degrades, while retaining the compositional structure needed for counterfactual task specification. To our knowledge this is the first generative model of whole-cortex fMRI dynamics for unseen cognitive tasks, advancing counterfactual neuroscience and data-driven experimental design.

2606.11831 2026-06-11 cs.LG cs.AI 新提交

From Uniform to Learned Graph Priors: Diffusion for Structure Discovery

从均匀到学习图先验:用于结构发现的扩散

Qi Shao, Hao Guo, Jiawen Chen, Duxin Chen, Wenwu Yu

发表机构 * School of Mathematics, Southeast University(东南大学数学学院)

AI总结 提出Diff-prior,一种扩散参数化的自适应先验,通过可学习的去噪式校准对边后验进行结构化校准,提升神经关系推理方法的结构发现可靠性。

Comments 15 pages, 3 figures, Accepted by KDD 2026

详情
AI中文摘要

神经关系推理(NRI)方法通过离散潜在边的变分推理从轨迹中发现交互图。然而,这些方法通常依赖于过度简化的因子化图先验。这种先验通常接近均匀分布,将边视为独立实体。这种系统性错位与现实世界系统不匹配,导致边后验分散且不明确,限制了结构发现的可靠性。为了解决这个问题,我们提出了\textit{Diff-prior},一种扩散参数化的自适应先验,用于校准潜在图分布而非生成图。我们的核心见解是将先验整合重新构建为一种可学习的去噪式校准,将分散、不确定的边后验组织成更可靠的整体结构,该结构可通过扩散模型训练。Diff-prior学习一个自适应结构先验,在推理过程中对边后验进行结构化校准,引导其朝向更接近底层结构的分布。Diff-prior在结构采样之前操作,并直接对编码器边分布进行去噪校准,为结构化变量提供了一种通用的训练范式。在标准基准上的实验验证了我们的框架,结果表明Diff-prior提高了结构推理的性能,并在多个NRI系列架构中生成更明确的边后验。代码可在以下网址获取:https://this URL。

英文摘要

Neural relational inference (NRI) methods discover interaction graphs from trajectories through variational reasoning on discrete potential edges. However, these methods typically rely on oversimplified, factorized graph priors. Such priors, typically nearing uniform distributions, treat edges as independent entities. This systemic misalignment does not match the real-world systems and yields diffuse and indecisive edge posteriors limiting the reliability of structural discovery. To address this, we propose \textit{Diff-prior}, a diffusion-parameterized adaptive prior used to calibrate latent graph distribution rather than generate graphs. Our core insight is to reframe prior integration as a learnable denoising-style calibration that organizes scattered, uncertain edge posteriors into a more reliable overall structure which can be trained by the diffusion model. Diff-prior learns an adaptive structure prior that performs structured calibration on the edge posteriors during inference, guiding it towards a distribution closer to the underlying structure. The diff-prior operates before structural sampling and acts as a denoising calibrator directly on the encoder edge distribution, which provides a generic training paradigm over structured variables. Experiments on standard benchmarks validated our framework, and the results indicate that Diff-prior improves the performance of structure inference and generates more decisive edge posteriors across multiple NRI-family architectures. The code is available on https://github.com/Hardy158118/Diffprior.

2606.11830 2026-06-11 cs.AI 新提交

Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task

面向医学研究分析的技能增强型AI代理:一项NSCLC转录组生物标志物任务中的探索性多模型人类评估

Qianyu Yao, Fei Sun, Bocheng Huang, Wei Chen, Jiarui Jiang, Shu Quan, Yifei Chen, Wenjie Xu, Bo li, Liping Su, Ruoqiong Wu, Huhai Hong, Huimei Wang

发表机构 * AIPOCH PTE. LTD.

AI总结 本研究通过非小细胞肺癌免疫治疗生物标志物任务,评估技能增强型AI代理相比原生AI在转录组研究分析输出质量上的提升,发现质量信号方向性但未达统计显著性。

详情
AI中文摘要

背景。大型语言模型和AI代理越来越多地用于支持生物医学研究,但原生模型输出可能遗漏关键分析步骤、误用方法或夸大结论。我们评估了自主访问医学研究技能包是否与更高质量的AI生成转录组研究分析输出相关,相比于无技能的原生AI。方法。我们使用非小细胞肺癌免疫治疗生物标志物任务进行了一项探索性多模型人类评估。测试了六个模型骨干。评估包括21个匿名输出:9个原生AI输出和12个通过OpenClaw实现的AI代理生成的技能增强输出。四位非专家生物医学评审员和两位盲法专家评估每个输出,每位评审员类型给出两个评分。主要结局是专家评定的总体质量。结果。技能增强输出在专家总体质量上方向性高于原生AI输出(均值5.50 vs 5.11;差异=0.39;bootstrap 95% CI,-0.04至0.90;Welch p=0.156)。非专家评审员质量呈现相同方向(均值4.72 vs 4.47;差异=0.26;bootstrap 95% CI,-0.25至0.80;Welch p=0.373)。专家一致性有限(单评分ICC=-0.15),模型特异性效应为描述性且异质性。结论。在此探索性样本中,自主技能访问显示出方向性质量信号,但信号小于专家评分噪声,不应视为确证性证据。这些发现主要激励了具有更强可靠性控制、平台复制和生物学有效性评估的技能增强型AI代理的更大规模评估。

英文摘要

Background. Large language models and AI agents are increasingly used to support biomedical research, but native model outputs may omit key analytical steps, misuse methods, or overstate conclusions. We evaluated whether autonomous access to a medical research skill package was associated with higher-quality AI-generated transcriptomic research-analysis outputs compared with native AI without skills. Methods. We conducted an exploratory multi-model human evaluation using a non-small cell lung cancer immunotherapy biomarker task. Six model backbones were tested. The evaluation included 21 anonymized outputs: 9 native-AI outputs and 12 skill-augmented outputs generated through an AI agent implementation represented by OpenClaw. Four non-expert biomedical reviewers and two blinded experts evaluated each output, with two ratings from each reviewer type. The primary outcome was expert-rated overall quality. Results. Skill-augmented outputs showed directionally higher expert overall quality than native-AI outputs (mean 5.50 vs 5.11; difference=0.39; bootstrap 95\% CI, -0.04 to 0.90; Welch p=0.156). Non-expert reviewer quality showed the same direction (mean 4.72 vs 4.47; difference=0.26; bootstrap 95\% CI, -0.25 to 0.80; Welch p=0.373). Expert agreement was limited (single-rating ICC=-0.15), and model-specific effects were descriptive and heterogeneous. Conclusions. Autonomous skill access showed a directional quality signal in this exploratory sample, but the signal was smaller than expert-rating noise and should not be interpreted as confirmatory evidence. The findings primarily motivate larger evaluations of skill-augmented AI agents with stronger reliability controls, platform replication, and biological-validity assessment.

2606.11828 2026-06-11 cs.SD cs.AI cs.CR cs.MM 新提交

Feature-Aligned Speech Watermarking for Robustness to Reconstruction Distortions

特征对齐的语音水印技术以抵抗重建失真

Haiyun Li, Shuhai Peng, Zhisheng Zhang, Jingran Xie, Xiaofeng Xie, Hanyang Peng, Zhiyong Wu

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Shenzhen Key Laboratory of Intelligent Media and Content Understanding(深圳市智能媒体与内容理解重点实验室) Tencent AI Lab(腾讯人工智能实验室)

AI总结 提出特征对齐水印方法,通过将水印与原始语音特征分布对齐,在保持不可感知性的同时提高水印能量,增强对语音重建模型的鲁棒性。

Comments Accepted by ICME2026

详情
AI中文摘要

音频水印旨在将可识别信息嵌入音频中同时保持不可感知性。现有方法采用高保真、低能量设计以保持感知质量,但由此产生的水印在语音重建模型的抑制下缺乏鲁棒性。由于现有设计中固有的鲁棒性-保真度权衡,提高鲁棒性具有挑战性,增加水印能量会提高鲁棒性但降低保真度。为解决此问题,我们提出一种特征对齐的水印方法,将水印与原始语音特征分布对齐,允许更高的水印能量以提高鲁棒性同时保持不可感知性。我们使用预训练的语音编解码器生成伪语音水印,并将其融合到输入音频的频谱图中,通过VAD损失和感知损失引导在浊音区域嵌入。实验表明,我们的方法在保持与现有方法相当的不可感知性的同时,在见过和未见过的语音重建模型下均显著提高了鲁棒性。

英文摘要

Audio watermarking aims to embed identifiable information into audio while remaining imperceptible. Existing methods adopt high-fidelity, low-energy designs to preserve perceptual quality, but the resulting watermarks lack robustness under suppression by speech reconstruction models. Improving robustness is challenging due to the inherent robustness-fidelity trade-off in existing designs, where increasing watermark energy improves robustness but reduces fidelity. To address this problem, we propose a feature-aligned watermarking method that aligns the watermark with the original speech feature distribution, allowing higher watermark energy to improve robustness while preserving imperceptibility. We use a pretrained speech codec to generate a pseudo-speech watermark and fuse it into the spectrogram of the input audio, with VAD loss and perceptual losses guiding embedding within voiced regions. Experiments show that our method maintains imperceptibility comparable to existing approaches while substantially improving robustness under both seen and unseen speech reconstruction models.

2606.11826 2026-06-11 cs.RO 新提交

Modular Anthropomorphic Hand Design via Multi-Parameter Finger Benchmarking and Selection

模块化拟人手设计:基于多参数手指基准测试与选择

Yu Zhang, Huijiang Wang, Josie Hughes

发表机构 * The CREATE Lab, Institute of Mechanical Engineering, Swiss Federal Institute of Technology in Lausanne (EPFL)(瑞士洛桑联邦理工学院机械工程研究所CREATE实验室)

AI总结 提出一种模块化拟人手设计框架,通过多参数基准测试优化手指设计,实现整手性能提升,并在多物体抓取和灯泡旋拧任务中验证有效性。

Comments 14 pages, 13 figures. Submitted to an IEEE journal for possible publication

详情
AI中文摘要

设计拟人灵巧手仍然具有挑战性,因为设计空间跨越形态、驱动和传感特性,而性能指标既包括任务相关也包括任务无关。现有的优化方法通常是非结构化的,或者只考虑单一性能指标,限制了系统比较和针对性改进。虽然整手的设计考虑很重要,但单个手指的特性在灵巧性中起着关键作用。通过开发一个手指可以模块化集成到完整遥操作手中的机器人手平台,我们提出优化手指可以显著提高整手性能。该方法能够在手指集成到手部进行任务级验证之前,通过多个定量基准快速筛选不同的手指级原型。候选手指设计(包含关节、骨骼、皮肤和传感器位置的变化)使用面向机制和任务相关的指标进行评估,建立了组件设计与整手体现之间的定量联系。该框架通过开发具有优化手指的拟人机器人手得到验证,展示了这些手指如何在多物体抓取和灯泡旋拧等任务中实现性能提升。

英文摘要

Designing anthropomorphic dexterous robotic hands remains challenging as the design space straddles morphology, actuation, and sensing properties, and performance metrics span both task-dependent and task-agnostic. Existing optimization methods are often unstructured or consider only a single performance metric, limiting systematic comparison and targeted refinement. While the design considerations of the entire hand are significant, the individual finger properties play a key role in dexterity. By developing a robotic hand platform where fingers can be modularly integrated into a full teleoperated hand, we propose that optimizing the fingers can significantly improve overall hand performance. This approach enables rapid screening of different finger-level prototypes through a number of quantitative benchmarks before their integration into the hand for task-level validation. Candidate finger designs (incorporating variations in joint, bone, skin, and sensor placement) are assessed using both mechanism-oriented and task-relevant metrics, which establish a quantitative link between component design and full hand embodiment. The framework is validated through the development of an anthropomorphic robotic hand with optimized fingers, demonstrating how these fingers enable performance improvements across tasks, including multi-object grasping and light bulb screwing.

2606.11818 2026-06-11 cs.RO 新提交

Human-Guided Co-Manipulation of Carbon Fiber Plies

碳纤维铺层的人机协同操作

Rami Ojanen, James Fant-Male, Roel Pieters

发表机构 * Automation Technology and Mechanical Engineering, Tampere University(坦佩雷大学自动化技术与机械工程系)

AI总结 针对柔性材料自动化处理困难的问题,本文提出结合语音指令、视觉腕部跟踪和力/柔顺控制的多模态方法,实现碳纤维铺层的高效人机协同操作。

Comments Accepted to the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026)

详情
AI中文摘要

由于柔性材料的可变形性带来的挑战,完全自动化处理这类物体是一项困难的任务。同时,全手动过程在人体工程学上可能具有挑战性、繁琐且低效。因此,人机协作(HRC)和协同操作(co-manipulation)在该领域受到越来越多的关注,因为它们能够在需要时让人类参与,同时提高生产力。为了实现操作员与机器人之间的高效协同操作和交互,需要不同的模态和控制方法。在本文中,我们提出并研究了用于碳纤维铺层协同操作的不同控制方法,在受控环境中评估每种方法的优缺点。我们提出,语音指令、通过视觉进行腕部跟踪以及力与柔顺控制的多模态组合将为任务的完整和直观控制提供最佳解决方案。

英文摘要

The handling of flexible materials is a difficult task to fully automate due to the challenges caused by the deformability of these types of objects. Meanwhile, a fully manual process can be ergonomically challenging, tedious and inefficient. Thus, human-robot collaboration (HRC) and cooperative manipulation (co-manipulation) have received increasing interest in this field as they enable human involvement when needed while also improving productivity. To enable efficient co-manipulation and interaction between the human operator and the robot, different modalities and control methods are required. In this paper, we present and examine different control methods for co-manipulation of carbon fiber plies, evaluating the pros and cons of each method in a controlled setting. We propose that a multimodal combination of speech commands, wrist-tracking through vision, and force with compliant control would provide the best solution for complete and intuitive control of the task.

2606.11816 2026-06-11 cs.CL cs.AI 新提交

WorldReasoner: Evaluating Whether Language Model Agents Forecast Events with Valid Reasoning

WorldReasoner: 评估语言模型代理是否通过有效推理预测事件

Yizhou Chi, Eric Chamoun, Zifeng Ding, Andreas Vlachos

发表机构 * Department of Computer Science and Technology, University of Cambridge(剑桥大学计算机科学与技术系)

AI总结 提出WorldReasoner框架,通过时间有效检索、证据质量和因果图推理三个维度评估语言模型代理的事件预测能力,发现时间有效检索是结果准确性的最强驱动因素。

详情
AI中文摘要

预测现实世界事件要求语言模型代理在不完整、时间有限的信息下进行不确定性推理。然而,评估代理是否真正进行预测需要的不仅仅是最终答案的准确性:模型可能通过回忆记忆中的训练事实、引用捏造的证据或产生无根据的因果故事而正确。我们提出WorldReasoner,一个用于时间有效事件预测的评估框架。每个任务向代理提供一个已解决的预测问题、一个模拟的预测日期,并且只能访问该日期之前可用的证据;在问题解决后,该框架对提交的概率、引用的证据和可选的因果事件图进行评分。WorldReasoner报告三个互补的轴:针对已解决答案的结果质量、针对引用来源的证据质量,以及针对解决后事后图的推理质量。该基准测试由一个代理构建管道构建,该管道生成预测问题、收集时间戳证据并大规模构建事后参考图,最终产生345个已解决的任务,这些任务源自14,141篇文章,其图覆盖8,087个提取的事件。在六种受控代理设置中,时间有效检索是结果准确性的最强驱动因素;因果图构建提高了关键事件的恢复;并且正确的图支持预测更牢固地基于关键事件和相关来源,但代理仍然难以将基于证据的推理转化为校准的概率。

英文摘要

Forecasting real-world events requires language-model agents to reason under uncertainty from incomplete, time-bounded information. Yet evaluating whether agents genuinely forecast requires more than final-answer accuracy: a model may be correct by recalling memorized training facts, citing fabricated evidence, or producing an unsupported causal story. We present WorldReasoner, an evaluation framework for temporally valid event forecasting. Each task gives an agent a resolved forecasting question, a simulated forecast date, and access only to evidence available before that date; after resolution, the framework scores the submitted probability, cited evidence, and optional causal event graph. WorldReasoner reports three complementary axes: outcome quality against resolved answers, evidence quality over cited sources, and reasoning quality against post-resolution hindsight graphs. The benchmark is built by an agentic construction pipeline that generates forecasting questions, collects time-stamped evidence, and builds hindsight reference graphs at scale, yielding 345 resolved tasks derived from 14,141 articles with graphs covering 8,087 extracted events. Across six controlled agent settings, temporally valid retrieval is the strongest driver of outcome accuracy; causal graph construction improves key-event recovery; and correct graph-enabled forecasts are more strongly grounded in key events and relevant sources, yet agents still struggle to convert grounded evidence into calibrated probabilities.

2606.11806 2026-06-11 cs.CL 新提交

External Experience Serving in Production LLM Systems: A Deployment-Oriented Study of Quality-Cost Trade-offs

生产级LLM系统中的外部经验服务:面向部署的质量-成本权衡研究

Lin Sun, Heming Zhang, Xiangzheng Zhang

发表机构 * Qiyuan Tech(奇元科技)

AI总结 研究生产级LLM系统中注入外部经验的质量-成本权衡,发现选择性检索优于全局注入,且检索质量比增加Top-K更重要,成本效益因任务输出长度而异。

详情
AI中文摘要

生产级LLM系统会积累可重用的操作经验,但实际部署问题不仅仅在于这种经验是否有帮助,更在于不同的服务策略如何在现实约束下权衡质量与在线成本。注入外部经验可以提升任务质量,但也会增加提示负担、延迟和服务压力。我们将\textit{外部经验服务}作为一个面向部署的质量-成本权衡问题进行研究。我们在一个真实的审核场景中评估该问题,并使用工具使用和GPQA作为辅助对比任务,这些任务暴露了不同的输出-成本区间。我们比较了无经验基线、随机经验控制、全局提示注入和基于检索的选择性注入,并分析了任务质量和服务成本。结果表明,一旦经验变得依赖于具体案例,选择性检索比无条件的全局注入提供了更强的操作点。进一步表明,检索质量比单纯增加Top-$K$更重要,并且相同的服务策略在短输出和密集解码场景下可能表现出截然不同的成本效益曲线。这些发现表明,外部经验最好被视为一种选择性的、成本感知的服务决策,而不是通用的附加组件。总体而言,在所研究的设置中,只有当服务接口和任务特定的成本结构使其质量提升值得在线成本时,外部经验才是有价值的。

英文摘要

Production LLM systems accumulate reusable operational experience, but the practical deployment issue is not merely whether such experience can help. It is how different serving strategies trade off quality against online cost under realistic constraints. Injecting external experience can improve task quality, yet it also increases prompt burden, latency, and serving pressure. We study \textit{external experience serving} as a deployment-oriented quality-cost trade-off problem. We evaluate this question in a real production moderation setting, with tool-use and GPQA as supporting contrast tasks that expose different output-cost regimes. We compare no-experience baselines, random experience controls, global prompt injection, and retrieval-based selective injection, and analyze both task quality and serving cost. The results show that, once experience becomes case-dependent, selective retrieval provides a stronger operating point than unconditional global injection. They further show that retrieval quality matters more than simply increasing Top-$K$, and that the same serving policy can exhibit substantially different cost-benefit profiles across short-output and decode-heavy regimes. These findings suggest that external experience is best treated as a selective, cost-aware serving decision rather than as a universal add-on. Overall, in the settings studied here, external experience pays off only when both the serving interface and the task-specific cost structure make its quality gains worth the online cost.

2606.11805 2026-06-11 cs.CV cs.AI 新提交

TextHOI-3D: Text-to-3D Hand-Object Interaction via Discrete Multi-View Generation and Joint Mesh Optimization

TextHOI-3D: 基于离散多视图生成与联合网格优化的文本到三维手物交互

Zixiong Hao, Zhencun Jiang

发表机构 * Technical University of Munich(慕尼黑工业大学) Tongji University(同济大学) Shanghai Research Institute for Intelligent Autonomous Systems(上海自主智能无人系统科学中心)

AI总结 提出TextHOI-3D框架,通过多视图离散表示连接文本生成与几何恢复,实现文本驱动的三维手物网格生成,显著降低物体倒角距离和穿透体积。

Comments 11 pages, 8 figures, 3 tables

详情
AI中文摘要

文本条件的三维生成在图像和孤立物体方面进展迅速,但生成手物网格仍然具有挑战性:输出必须保持语言语义、跨视图一致性、物体几何、关节手部形状以及物理上合理的接触。我们提出TextHOI-3D,一个分阶段框架,使用生成的多视图观测作为文本条件视觉生成与几何感知手物恢复之间的显式接口。TextHOI-3D为固定相机的手物观测学习紧凑的VQ令牌空间,通过CLIP条件的视觉自回归模型从文本预测多视图视觉令牌,并通过先验初始化、多视图联合优化和抗穿透细化恢复统一的手物网格。该设计将语义生成与几何恢复分离,同时通过离散多视图表示保持两个阶段的连接。在HO3D衍生评估中,与单视图对应相比,多视图设置将物体倒角距离从17.26毫米降低到4.92毫米,穿透体积从5.3721立方厘米降低到0.2193立方厘米,同时改善了手部误差和表面F分数。这些结果支持多视图视觉令牌作为文本驱动三维手物网格创建的有效中间表示。

英文摘要

Text-conditioned 3D generation has progressed rapidly for images and isolated objects, but producing a hand-object mesh remains challenging: the output must preserve language semantics, cross-view consistency, object geometry, articulated hand shape, and physically plausible contact. We present TextHOI-3D, a staged framework that uses generated multi-view observations as an explicit interface between text-conditioned visual generation and geometry-aware hand-object recovery. TextHOI-3D learns a compact VQ token space for fixed-camera hand-object observations, predicts multi-view visual tokens from text with a CLIP-conditioned visual autoregressive model, and recovers a unified hand-object mesh through prior initialization, multi-view joint optimization, and anti-penetration refinement. The design separates semantic generation from geometric recovery while keeping both stages connected by a discrete multi-view representation. On HO3D-derived evaluations, the multi-view setting reduces object CD from 17.26 mm to 4.92 mm and penetration volume from 5.3721 cm^3 to 0.2193 cm^3 compared with a single-view counterpart, while improving hand errors and surface F-scores. These results support multi-view visual tokens as an effective intermediate representation for text-driven 3D hand-object mesh creation.

2606.11804 2026-06-11 cs.AI cs.CR cs.LG 新提交

Toward Trustworthy AI: Multi-Target Adversarial Attacks and Robust Defenses for Continuous Data Summarization

迈向可信赖的人工智能:针对连续数据摘要的多目标对抗攻击与鲁棒防御

Yuefang Lian, Longkun Guo, Zhongrui Zhao, Zhigang Lu, Yanan Cai, Shuchao Pang, Dachuan Xu, Jason Xue

发表机构 * Nankai University(南开大学) James Cook University(詹姆斯库克大学) Western Sydney University(西悉尼大学) Beijing University of Technology(北京工业大学) Fuzhou University(福州大学) Nanjing University of Science and Technology(南京理工大学) CSIRO's Data 61(澳大利亚联邦科学与工业研究组织Data61) The University of Adelaide(阿德莱德大学)

AI总结 研究通过DR-子模优化在相似性层面扰动下对连续数据摘要进行对抗攻击,提出多目标攻击生成和鲁棒防御的近似算法,实验表明攻击有效且防御能改善鲁棒性-缓解权衡。

Comments Submitted to IEEE Transactions on Information Forensics and Security (IEEE TIFS)

详情
AI中文摘要

可信赖的人工智能需要可靠的数据处理管道,而不仅仅是鲁棒的下游预测模型。作为上游组件,数据摘要决定了哪些信息被保留并传递给后续的学习或决策模块。因此,对摘要过程的对抗性扰动可能以上游方式损害可信赖的人工智能:它们可能改变所选摘要,降低其代表性,并进一步降低后续学习任务的效用。在本文中,我们通过DR-子模优化研究相似性层面扰动下的连续数据摘要对抗攻击。我们证明了一类多分辨率图像摘要目标可以表示为非负子模集函数的多线性扩展,并满足具有$m$-弱单调性的DR-子模性。然后,我们将多目标攻击生成表述为一个最小-最大问题,其中优化相似性结构的一个可容许扰动以降低多个目标摘要模型。为了缓解此类扰动,我们将针对混合攻击类型的鲁棒防御表述为一个正则化的最大-最小问题。对于这两个问题,我们开发了具有理论保证的近似算法。在真实数据和受控聚类基准上的实验表明,所提出的攻击在代表性的低到中等预算范围内是有效的,并且可以导致下游任务性能损失。所提出的防御在结构化设置中改善了鲁棒性-缓解权衡,同时也揭示了真实数据上鲁棒保护的参数敏感性。

英文摘要

Trustworthy AI requires reliable data-processing pipelines, not only robust downstream predictive models. As an upstream component, data summarization determines which information is retained and passed to subsequent learning or decision modules. Therefore, adversarial perturbations to the summarization process can compromise trustworthy AI in an upstream manner: they may alter the selected summary, reduce its representativeness, and further degrade the utility of subsequent learning tasks. In this paper, we study adversarial attacks on continuous data summarization under similarity-level perturbations through DR-submodular optimization. We show that a class of multi-resolution image summarization objectives can be formulated as multilinear extensions of non-negative submodular set functions and satisfy DR-submodularity with $m$-weak monotonicity. We then formulate multi-target attack generation as a min-max problem, where one admissible perturbation of the similarity structure is optimized to degrade multiple target summarization models. To mitigate such perturbations, we formulate robust defense against mixed attack types as a regularized max-min problem. For both problems, we develop approximation algorithms with theoretical guarantees. Experiments on real-data and controlled clustered benchmarks show that the proposed attack is effective in representative low-to-moderate budget regimes and can induce downstream task-performance loss. The proposed defense improves the robustness--mitigation trade-off in structured settings, while also revealing the parameter sensitivity of robust protection on real data.

2606.11797 2026-06-11 cs.LG 新提交

Space-sampled Value Decay: Forgetting Mechanisms for Non-stationary Deep Reinforcement Learning

空间采样值衰减:非平稳深度强化学习的遗忘机制

Felix Störck, Fabian Hinder, Barbara Hammer

发表机构 * CITEC, Faculty of Technology, Bielefeld University(比勒费尔德大学技术学院CITEC)

AI总结 受啮齿动物遗忘行为启发,提出空间采样值衰减作为显式遗忘机制,用于深度强化学习应对环境漂移,在DQN和SAC上验证效果与局限。

Comments Accepted at The 2nd Workshop on Epistemic Intelligence in Machine Learning, EIML@ICML 2026, (non-archival)

详情
AI中文摘要

对小鼠等啮齿动物的研究表明,即使没有提供关于变化的信息(不确定性),它们也能适应环境参数的变化(“漂移”)——这种行为可以通过遗忘机制建模。非平稳强化学习(NSRL)致力于改进最先进的强化学习方法以应对变化的环境:然而,这些方法通常需要关于漂移的(部分)完美信息,如“任务ID”或“上下文”。为了减轻漂移的影响,本文开发了\emph{空间采样值衰减},作为基于值的深度强化学习架构的一种显式遗忘机制,这是一种简单而有效的方法。特别地,我们展示并讨论了在非平稳环境中评估深度Q网络(DQN)和软演员-评论家(SAC)的修改时,在获得的回报方面的积极效果以及局限性。

英文摘要

Studies on rodents such as mice have shown the capabilities to adapt their behavior when dealing with changing parameters (``drift'') of the environment even if no information about change is provided (uncertainty) -- a behavior that can be modeled by forgetting mechanisms. Non-stationary Reinforcement Learning (NSRL) deals with adapting state-of-the-art RL methods to deal with changing environments: these however usually require (partially) perfect information about the drift such as ``task IDs'' or ``context''. To mitigate the effects of drift, this work develops \emph{Space-sampled Value Decay} as an explicit forgetting mechanism for value-based deep RL architectures as a simple yet effective approach. In particular we demonstrate and discuss positive effects but also limitations in achieved returns for modifications of Deep Q-networks (DQN) and Soft Actor-Critic (SAC) when evaluated on non-stationary environments.

2606.11794 2026-06-11 cs.LG cs.AI 新提交

Multimodal Ordinal Modeling of Alzheimer's Disease Severity Using Structural MRI and Clinical Data

使用结构MRI和临床数据的阿尔茨海默病严重程度的多模态序数建模

Boris-Stephan Rauchmann, Jonathan Laib, Buse Ercik, Robert Perneczky, Sergio Altares-López

发表机构 * Department of Neuroradiology, LMU University Hospital, Ludwig Maximilian University of Munich(神经放射科,慕尼黑路德维希-马克西米利安大学医院,慕尼黑路德维希-马克西米利安大学) Department of Psychiatry and Psychotherapy, LMU University Hospital, Ludwig Maximilian University of Munich(精神病学与心理治疗系,慕尼黑路德维希-马克西米利安大学医院,慕尼黑路德维希-马克西米利安大学)

AI总结 提出一种注意力增强的多模态序数回归框架,整合MRI、人口统计学和遗传数据,用于自动且可解释的AD严重程度分期,在ADNI等数据集上验证,序数模型在相邻阶段准确率(0.970)和与临床分期一致性(QWK 0.549)上表现最佳。

Comments 18 pages. Submitted to journal for review

详情
AI中文摘要

神经退行性疾病如阿尔茨海默病(AD)需要准确且可扩展的工具来评估疾病严重程度,然而当前的临床分期仍然耗时且易变。我们提出了一种带有注意力增强的多模态机器学习框架,结合序数回归,用于自动且可解释的AD严重程度分期。该框架整合了T1加权MRI与人口统计学和遗传变量,并使用序数和非序数预测头比较了单模态和多模态架构。模型使用来自ADNI、AIBL和NIFD数据集的队列分层划分进行训练和验证。严格保留的测试集由排除在所有训练、验证、预处理和超参数调优过程之外的受试者构建,并在整个过程中采用受试者级划分以防止数据泄漏。在单模态方法中,T1加权MRI模型在相邻阶段准确率(0.963)和与临床分期的一致性(QWK 0.444)上略高于表格模型(QWK 0.433)。整合成像、人口统计学和遗传信息提高了整体性能。多模态非序数基线实现了最低的预测误差(MAE 0.340),而序数多模态模型实现了最高的相邻阶段准确率(0.970)和与临床分期的最强一致性(QWK 0.549)。这些发现表明,序数公式更好地捕捉了CDR量表的顺序结构,并产生与临床分期更一致的预测。使用Grad CAM++和SHAP的可解释性分析展示了解剖学和临床上合理的模型行为,支持透明决策。总体而言,基于注意力的多模态学习与序数回归代表了一种稳健、可解释且可扩展的方法,用于自动AD严重程度分期和AI辅助临床决策支持。

英文摘要

Neurodegenerative diseases such as Alzheimer's disease (AD) require accurate and scalable tools for assessing disease severity, yet current clinical staging remains time-intensive and prone to variability. We propose an attention-enhanced multimodal machine learning framework with ordinal regression for automated and interpretable AD severity staging. The framework integrates T1-weighted MRI with demographic and genetic variables and compares unimodal and multimodal architectures using ordinal and non-ordinal prediction heads. Models were trained and validated using cohort-stratified splits derived from the ADNI, AIBL, and NIFD datasets. A strictly held-out test set was constructed using subjects excluded from all training, validation, preprocessing, and hyperparameter tuning procedures, with subject-level splitting employed throughout to prevent data leakage. Among unimodal approaches, the T1-weighted MRI model achieved slightly higher adjacent-stage accuracy (0.963) and agreement with clinical staging (QWK 0.444) than the tabular model (QWK 0.433). Integrating imaging, demographic, and genetic information improved overall performance. The multimodal non-ordinal baseline achieved the lowest prediction error (MAE 0.340), whereas the ordinal multimodal model achieved the highest adjacent-stage accuracy (0.970) and strongest agreement with clinical staging (QWK 0.549). These findings indicate that ordinal formulations better capture the ordered structure of the CDR scale and yield predictions more consistent with clinical staging. Explainability analyses using Grad CAM++ and SHAP demonstrated anatomically and clinically plausible model behavior, supporting transparent decision-making. Overall, attention-based multimodal learning with ordinal regression represents a robust, interpretable, and scalable approach for automated AD severity staging and AI-assisted clinical decision support.

2606.11786 2026-06-11 cs.CL 新提交

Lius: Translation Model Based Instructional Lingustic Using Continual Instruction Tuning In Kupang Malay

Lius:基于持续指令调优的库邦马来语教学语言学翻译模型

Joanito Agili Lopo, Yunita Sari, Guntur Budi Herwanto

发表机构 * Universitas Gadjah Mada(加札马达大学)

AI总结 针对低资源语言库邦马来语,提出利用双语词典的词汇和语义特征设计指令,并采用持续指令调优(CIT)范式微调大语言模型,在多个指标上超越基线4-6分,优于NMT和多语言LLM 10-13分。

Comments This paper is the result of the Master Thesis in Master of Artificial Intelligence at Universitas Gadjah Mada

详情
AI中文摘要

大语言模型(LLM)为翻译任务提供了新的潜力,但在处理低资源语言时常常出现性能下降。为了解决这一限制,我们提出了一种针对低资源语言库邦马来语微调LLM的方法。我们的方法涉及利用双语词典的显式词汇和语义特征设计一组指令,并引入持续指令调优(CIT),一种支持基于迭代指令训练的训练范式。实验结果表明,我们名为Lius的模型在多个评估指标上比标准指令调优模型提高了4-6分,并超越了神经机器翻译(NMT)和多语言LLM模型10-13分。这些发现突显了我们的方法在减轻低资源语言翻译中对大规模并行数据依赖的潜力。

英文摘要

Large Language Models (LLMs) offer new potential for translation tasks but often experience performance degradation when handling low-resource languages. To address this limitation, we propose an approach for fine-tuning LLMs on a low-resource language, Kupang Malay. Our approach involves designing a set of instructions by leveraging explicit lexical and semantic features from a bilingual dictionary, and introducing Continual Instruction Tuning (CIT), a training paradigm that enables iterative instruction-based training. Experimental results demonstrate that our model, named Lius, yields notable improvements over standard instruction-tuned models by outperforming 4-6 points, and surpassing both Neural Machine Translation (NMT) and Multilingual LLM models by 10-13 points on several evaluation metrics. These findings highlight the potential of our approach to mitigate the reliance on large-scale parallel data in low-resource language translation.

2606.11782 2026-06-11 cs.CV 新提交

Seeing What Matters: Perceptual Wrapper with Common Randomness for 3D Gaussian Splatting

看见重要之处:基于公共随机性的感知包装器用于3D高斯泼溅

He-Bi Yang, Jing-Zhong Chen, Yen-Kuan Ho, Sang NguyenQuang, Fan-Yi Hsu, Yun-Yu Lee, Jui-Chiu Chiang, Wen-Hsiao Peng

发表机构 * National Yang Ming Chiao Tung University(国立阳明交通大学) National Chung Cheng University(国立中正大学)

AI总结 针对3D高斯泼溅在内存受限和率失真优化管道中高频纹理合成困难的问题,提出一种2D感知包装器,利用伪随机高斯噪声和Wasserstein失真监督,以内容与视角相关的方式增强渲染输出,显著提升感知质量并压缩模型大小。

Comments 18 pages, 9 figures

详情
AI中文摘要

虽然3D高斯泼溅(3DGS)实现了令人印象深刻的实时渲染,但它经常难以合成高频纹理,这一限制在内存受限和率失真优化(RDO)管道中尤为严重。为了解决这个问题,我们提出了一种通用的2D感知包装器,它以内容和视角相关的方式增强现有3DGS表示的渲染输出。我们的方法利用一个以伪随机高斯噪声为条件的轻量级合成网络来合成感知上合理的纹理。在Wasserstein失真的监督下,该网络学习匹配局部特征统计,而不是严格强制逐像素重建保真度,从而有效缓解标准框架中固有的模糊性。我们展示了我们的即插即用方法在普通、内存受限和RDO 3DGS方法中的广泛适用性。全面的主观和客观实验证实,我们的方法显著优于现有基线,在急剧减小文件或模型尺寸的同时,实现了卓越的感知质量。

英文摘要

While 3D Gaussian Splatting (3DGS) achieves impressive real-time rendering, it frequently struggles to synthesize high-frequency textures, a limitation heavily exacerbated in memory-constrained and rate-distortion-optimized (RDO) pipelines. To address this, we propose a versatile 2D perceptual wrapper that enhances the rendered outputs of existing 3DGS representations in a content- and view-dependent manner. Our method leverages a lightweight synthesis network conditioned on pseudo-random Gaussian noise to synthesize perceptually plausible textures. Supervised by Wasserstein Distortion, the network learns to match local feature statistics rather than strictly enforcing pixel-wise reconstruction fidelity, effectively mitigating the blurriness inherent in standard frameworks. We demonstrate the broad applicability of our plug-and-play approach across vanilla, memory-constrained, and RDO 3DGS methods. Comprehensive subjective and objective experiments confirm that our method significantly improves over existing baselines, yielding superior perceptual quality at sharply reduced file or model sizes.