arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2115
2606.17832 2026-06-17 cs.LG 新提交

From Drift to Coherence: Stabilizing Beliefs in LLMs

从漂移到一致:稳定LLM中的信念

SongEun Kim, Seungyoo Lee, Edwin Fong, Hyungi Lee, Juho Lee

发表机构 * Department of Statistics, Seoul National University Korea Advanced Institute of Science \& Technology Department of AI, Kookmin University University of Hong Kong

AI总结 研究LLM在多项选择问答中的信念漂移问题,提出提示式预测重采样(PPR)方法,发现信念过程会自稳定并收敛,进而提出种子答案提示策略和自一致性损失以加速稳定并提高预测一致性。

详情
AI中文摘要

大型语言模型(LLM)常被假设执行隐式贝叶斯推理,然而一个关键的一致性条件——预测信念的鞅性质——已被证明在受控的合成上下文学习设置中失效。我们在更典型的使用场景中重新审视这个问题:通用多项选择问答。利用离散答案空间,我们计算精确的预测分布,并研究由自回归答案重采样引起的信念动态。我们引入了提示式预测重采样(PPR),其中LLM对同一问题生成一系列答案。实验表明,PPR揭示了早期阶段的信念漂移,表明鞅性质被违反。然而,在足够的重采样步骤后,信念过程自稳定并收敛到一个一致的预测分布。基于这一观察,我们进一步提出了(i)种子答案提示策略以加速稳定,以及(ii)自一致性损失,通过微调将早期漂移摊销到模型中。在多项选择问答基准上的实验表明,我们的方法在不牺牲准确性的情况下显著减少了信念漂移并提高了预测一致性。

英文摘要

Large language models (LLMs) are often hypothesized to perform implicit Bayesian inference, yet a key coherence condition, the martingale property of predictive beliefs, has been shown to fail in controlled synthetic in-context learning settings. We revisit this question in a more typical usage regime: generic multiple-choice question answering. Exploiting the discrete answer space, we compute exact predictive distributions and study belief dynamics induced by autoregressive answer resampling. We introduce prompted predictive resampling (PPR), where an LLM generates a sequence of answers to the same question. Empirically, PPR reveals early-stage belief drift, indicating martingale violations. However, after sufficient resampling steps, the belief process self-stabilizes and converges to a coherent predictive distribution. Based on this observation, we further propose (i) a seed-answer prompting strategy to accelerate stabilization, and (ii) a self-consistency loss that amortizes early-stage drift into the model via fine-tuning. Experiments on multiple-choice QA benchmarks show that our methods substantially reduce belief drift and improve predictive coherence without sacrificing accuracy.

2606.17831 2026-06-17 cs.RO cs.HC 新提交

Accountability in Autonomous Drone-Based Firefighting: Insights From a Field Trial

自主无人机消防中的问责制:来自实地试验的见解

Dzmitry Katsiuba, Anna Katharina Boos, Robin Hany, Mateusz Dolata, Gerhard Schwabe

发表机构 * University of Zurich(苏黎世大学) Zeppelin University(塞普豪森大学)

AI总结 通过实地试验,研究自主无人机在消防中对问责制的影响,发现角色不确定性和人机交互新问题,并提出建议以负责任地整合无人机。

Comments Accepted for Publication at International Conference on Information Systems (ICIS) 2025: https://aisel.aisnet.org/icis2025/ethical_is/ethical_is/10/

详情
Journal ref
International Conference on Information Systems, 2025, ICIS2025-2162
AI中文摘要

有一个不断增长的研究领域探索自主无人机如何提高应急响应效率。将这些(人工)智能体整合到现有的应急团队和工作流程中,可能会显著影响既定的问责关系。本文研究了自主无人机如何在复杂的社会技术系统中影响问责归属。通过两次真实的消防实地试验,该研究揭示了当无人机在组织层面部署时,围绕问责制存在显著的不确定性。利用Bovens的问责框架,识别出两个挑战:(1)无人机在层级结构中的角色不确定性,导致问责归属混乱;(2)新形式的人机交互引入了额外的问责相关问题。基于这些见解,本文提出了可操作的建议,以支持在不损害问责制的前提下将自主无人机负责任地整合到消防行动中。这些发现为政策制定者提供了实用指导,并有助于进一步研究自主系统中的问责制。

英文摘要

There is a growing research field exploring how autonomous drones can enhance emergency response effectiveness. Integrating these (artificial) agents into existing emergency teams and workflows may significantly impact established accountability relationships. This paper examines how autonomous drones affect accountability attribution within complex socio-technical systems. Drawing on two real-life field trials in firefighting, the study reveals substantial uncertainty around accountability when drones are organizationally deployed. Using Bovens' accountability framework, two challenges are identified: (1) uncertainty about the role of drones within hierarchical structures, leading to confused accountability ascriptions; and (2) new forms of human-drone interactions introducing additional accountability-relevant issues. Based on these insights, the paper proposes actionable recommendations to support the responsible integration of autonomous drones into firefighting operations without undermining accountability. These findings offer practical guidance for policymakers and contribute to further research on accountability in autonomous systems.

2606.17830 2026-06-17 cs.LG cs.AI 新提交

Functional Equivalence in Attention: A Comprehensive Study with Applications to Linear Mode Connectivity

注意力中的功能等价性:一项综合研究及其在线性模式连通性中的应用

Viet-Hoang Tran, Vinh Khanh Bui, Van-Hoan Trinh, Tan Lai Ngoc, Tan M. Nguyen

发表机构 * National University of Singapore(新加坡国立大学) Center for AI Research, VinUniversity(Vin大学人工智能研究中心) Independent Researcher(独立研究者) Technical University of Munich(慕尼黑技术大学)

AI总结 本文形式化研究了Transformer中位置编码对功能等价性的影响,发现正弦编码保持原始注意力的对称性,而旋转编码显著减少对称群从而增强表达力,并通过对齐算法实证了位置编码对线性模式连通性的关键作用。

Comments Published at the International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

神经网络参数空间本质上是非单射的,因为不同的参数配置可以通过功能等价性实现相同的函数。虽然这种对称性在经典的全连接和卷积模型中已被充分理解,但在现代基于注意力的架构中变得更为复杂。现有的多头注意力分析主要关注原始公式,忽略了从根本上重塑架构对称性的位置编码。在这项工作中,我们提供了对带有位置编码的Transformer中功能等价性的形式化研究。聚焦于两种最广泛使用的变体——正弦和旋转位置编码(RoPE)——我们表明正弦编码保留了原始注意力的等价结构,而旋转编码显著减少了对称群,从而增强了表达力。这为RoPE在实践中日益突出的地位提供了原则性解释。我们进一步研究了位置编码如何影响线性模式连通性,并通过一种对齐算法,实证表明Transformer设置中连通性的存在和可变性关键取决于位置编码。

英文摘要

Neural network parameter spaces are inherently non-injective, as distinct parameter configurations can realize identical functions through functional equivalence. While this symmetry is well understood in classical fully connected and convolutional models, it becomes substantially more intricate in modern attention-based architectures. Existing analyses of multihead attention have largely focused on the vanilla formulation, overlooking positional encodings that fundamentally reshape architectural symmetries. In this work, we provide a formal study of functional equivalence in Transformers with positional encodings. Focusing on the two most widely used variants--sinusoidal and rotary positional encodings (RoPE)--we show that sinusoidal encodings preserve the equivalence structure of vanilla attention, whereas rotary encodings significantly reduce the symmetry group, thereby enhancing expressivity. This offers a principled explanation for the growing prominence of RoPE in practice. We further examine how positional encodings affect linear mode connectivity, and through an alignment algorithm, empirically demonstrate that the presence and variability of connectivity across Transformer settings crucially depend on the positional encoding.

2606.17826 2026-06-17 cs.CL cs.AI 新提交

When Multiple Scripts Matter: Evaluating ASR in Clinical Settings

当多种文字重要时:在临床环境中评估ASR

Jean Seo, Minkyu Kim, Jeonguk Lee, Jisoo Jung, Wooseok Han, Eunho Yang

发表机构 * AITRICS University of Copenhagen(哥本哈根大学) KAIST(韩国科学技术院)

AI总结 针对非英语临床场景中ASR受多文字变异性影响的问题,提出MultiClin基准,通过多文字感知评估更公平地衡量识别质量,并发现文字统一化能提升ASR性能。

Comments Interspeech 2026

详情
AI中文摘要

非英语临床环境中的自动语音识别(ASR)面临多文字变异性的挑战,即同一术语可能以多种有效的正字法形式出现。传统的字符串匹配评估指标通常将正字法变体视为错误,从而低估ASR性能。为解决此问题,我们引入了MultiClin,一个旨在评估对多文字变异性鲁棒性的临床ASR基准。跨多种ASR模型的实验表明,与传统的单参考评估相比,多文字感知评估能更公平地评估识别质量。我们进一步研究了训练过程中文字一致性的影响,发现不一致的文字映射会增加正字法不确定性并阻碍模型收敛,其中50%的平衡映射比例产生最高的熵。相比之下,文字统一化始终能带来最佳的ASR性能。我们的数据集和代码公开于:this https URL。

英文摘要

Automatic speech recognition (ASR) in non-English clinical settings is challenged by multiscript variability, where the same term may appear in multiple valid orthographic forms. Conventional string-matching evaluation metrics often underestimate ASR performance by treating orthographic variants as errors. To address this issue, we introduce MultiClin, a clinical ASR benchmark designed to evaluate robustness to multiscript variability. Experiments across diverse ASR models show that multiscript-aware evaluation provides a fairer assessment of recognition quality than conventional single-reference evaluation. We further investigate the impact of script consistency during training and find that inconsistent script mappings increase orthographic uncertainty and hinder model convergence, with a balanced 50% mapping ratio producing the highest entropy. In contrast, script unification consistently yields the best ASR performance. Our dataset and code are publicly available at: https://github.com/aitrics-ronaldo/Interspeech_MultiClin.

2606.17824 2026-06-17 cs.CV cs.AI 新提交

Human-in-the-Loop Atlas-Based 3D Asset Segmentation for Interactive Content Workflows

人在回路中基于图集的3D资产分割用于交互式内容工作流

Paul Julius Kühn, Saptarshi Neil Sinha, Jakob Hansen, Robin Horst

发表机构 * Fraunhofer IGD(弗劳恩霍夫计算机图形学研究所) Hochschule RheinMain(莱茵美因应用科学大学)

AI总结 提出一种人在回路中流水线,通过贪心视图选择、SAM~2交互分割和UV反投影生成分割图集,支持材质分配、风格迁移等下游任务,在8个文化遗产物体上验证了有效性。

详情
AI中文摘要

将3D资产分割成有意义的区域仍然具有挑战性,尤其是当分割标准依赖于应用且需要用户控制时。我们提出了一种人在回路中的流水线,用于从3D模型生成分割的2D参数化图集,适用于交互式媒体、游戏和XR内容工作流。我们的方法首先使用基于采样表面点的贪心集合覆盖策略选择一组紧凑的渲染视图,然后支持使用SAM~2和Label Studio对这些视图进行交互式分割。生成的掩码被反投影到模型的UV参数化上,以产生统一的图集分割,支持下游生产任务,如逐段材质分配、风格迁移和语义标注。我们通过对八个文化遗产物体的基于演示的技术评估来评估该流水线。结果表明,该方法可以在不同几何形状上生成可用的分割图集,同时揭示了需要手动校正的常见问题,特别是精细结构、空腔和弱外观边界。

英文摘要

Segmenting 3D assets into meaningful regions remains challenging, especially when segmentation criteria are application-dependent and require user control. We present a human-in-the-loop pipeline for generating a segmented 2D parameterized atlas from a 3D model for interactive media, game, and XR content workflows. Our method first selects a compact set of rendered views using a greedy set cover strategy over sampled surface points, and then supports interactive segmentation of these views with SAM~2 and Label Studio. The resulting masks are back-projected onto the model's UV parameterization to produce a unified segmented atlas that supports downstream production tasks such as segment-wise material assignment, style transfer, and semantic labeling. We assess the pipeline through a demonstration-based technical evaluation on eight cultural heritage objects. The results show that the approach can generate usable segmented atlases across diverse geometries while revealing recurring sources of manual correction, particularly fine structures, cavities, and weak appearance boundaries.

2606.17821 2026-06-17 cs.AI 新提交

DecoSearch: Complexity-Aware Routing and Plan-Level Repair for Text-to-SQL

DecoSearch: 面向Text-to-SQL的复杂度感知路由与计划级修复

Esteban Schafir, Xu Zheng, Hojat Allah Salehi, Zhuomin Chen, Mo Sha, Wei Cheng, Dongsheng Luo

发表机构 * Florida International University(佛罗里达国际大学) NEC-Labs(NEC实验室) Singapore Management University(新加坡管理大学)

AI总结 提出DecoSearch框架,通过复杂度感知路由将查询分配给直接生成或DAG分解,并结合拓扑精炼器修复执行失败,在BIRD和Spider上取得高准确率且显著降低token消耗。

详情
AI中文摘要

大型语言模型(LLMs)在将自然语言翻译为SQL方面展现了卓越的能力,但现有方法在处理需要多步骤、数据感知推理的复杂查询时仍然表现不佳。我们引入了DecoSearch,一个无需训练的框架,通过将每个查询路由到适当的推理努力级别来解决这一问题。轻量级的Schema Selector首先将完整数据库模式修剪为相关的表和列。然后,LLM Judger判断问题是否需要分解:简单问题遵循直接生成路径,而复杂问题则升级为原子子问题的有向无环图(DAG),每个子问题通过目标SQL生成步骤解决。RAG组件用语义相似的训练示例为分解器提供基础,而Topology Refiner在执行失败表明存在有缺陷的分解而非可修复的SQL错误时,重构推理计划。DecoSearch在BIRD上达到70.53%的执行准确率,在Spider上达到88.31%,使用DeepSeek骨干网络,超越了所有无需训练的基线方法,同时消耗的token数量比竞争方法少一个数量级。它还可以作为模型无关的包装器,在不修改管道的情况下持续改进微调后的SQL生成骨干网络。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities in translating natural language to SQL, yet existing methods still falter on complex queries requiring multi-step, data-aware reasoning. We introduce DecoSearch, a training-free framework that addresses this by routing each query to the appropriate level of reasoning effort. A lightweight Schema Selector first prunes the full database schema to the relevant tables and columns. An LLM Judger then decides whether the question requires decomposition: straightforward questions follow a direct generation path and complex ones are escalated to a Directed Acyclic Graph (DAG) of atomic sub-questions, each solved by a targeted SQL generation step. A RAG component grounds the decomposer with semantically similar training examples, and a Topology Refiner restructures the reasoning plan when execution failures signal a flawed decomposition rather than a fixable SQL error. DecoSearch achieves 70.53% execution accuracy on BIRD and 88.31% on Spider with a DeepSeek backbone, surpassing all training-free baselines while consuming an order of magnitude fewer tokens than competing methods. It also functions as a model-agnostic wrapper, consistently improving fine-tuned SQL generation backbones without any modification to the pipeline.

2606.17820 2026-06-17 cs.CL 新提交

Improving low-resource ASR using bilingual fine-tuning with language identification: a cross-linguistic evaluation

利用语言识别的双语微调改进低资源语音识别:跨语言评估

Reihaneh Amooie, Yun Hao, Wietse de Vries, Jelske Dijkstra, Matt Coler, Martijn Wieling

发表机构 * University of Groningen(格罗宁根大学) Fryske Akademy(弗里斯兰科学院) Vrije Universiteit Brussel(布鲁塞尔自由大学)

AI总结 研究双语微调对低资源语言语音识别的影响,在九种语言对中评估,通过语言识别令牌区分语言,发现高语言识别准确率时双语微调有效,低准确率时推理时加入令牌可提升性能。

详情
AI中文摘要

本研究探讨了双语微调如何影响低资源语言的自动语音识别(ASR)。我们在九种语言和地理多样化的语言对上评估了该方法,涵盖了多种语系和书写系统。为了区分两种语言,在训练期间,我们在每个输入文本前添加一个语言识别令牌。在推理时,模型仅从语音输入中联合预测语言和转录。由于语言被错误确定的文本显示出较低的ASR性能,我们还进行了一项后续实验,在训练和推理期间都提供语言识别令牌。我们的结果表明,当语言识别准确率高时,双语微调可能是有益的,而在语言识别性能低的情况下,在推理时包含语言识别令牌有助于提高ASR性能。

英文摘要

This study explores how bilingual fine-tuning affects automatic speech recognition (ASR) in low-resource languages. We evaluate this method across nine linguistically and geographically diverse language pairs, covering a range of language families and writing systems. To distinguish the two languages, during training, we pre-pend each input text with a language identification token. At inference, the model jointly predicts both the language and transcription from the speech input alone. As texts for which the language is incorrectly determined show low ASR performance, we also conduct a follow-up experiment in which the language identification token is provided both during training and inference. Our results show that bilingual fine-tuning can be beneficial when language identification accuracy is high, and that in cases where language identification performance is low, including the language identification token at inference helps to improve ASR performance.

2606.17816 2026-06-17 cs.LG cs.AI 新提交

Conservation Laws for Modern Neural Architectures

现代神经架构的守恒律

Viet-Hoang Tran, Vinh Khanh Bui, Tan Lai Ngoc, Nam Nguyen, Tuan Dam, Tan M. Nguyen

发表机构 * National University of Singapore(新加坡国立大学) Center for AI Research, VinUniversity(Vin大学人工智能研究中心) Independent Researcher(独立研究者) Hanoi University of Science and Technology(河内科学技术大学)

AI总结 本文提出统一框架,刻画GELU、SiLU、SwiGLU激活的前馈网络、多头注意力及混合专家模型中的梯度流守恒律,实验验证了理论预测的不变量。

Comments Published at the International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

理解梯度下降动力学是解释过参数化模型成功的关键,其中隐式偏差通过梯度流中的守恒律体现。尽管这类定律在线性和ReLU网络中已被充分理解,但在现代架构中仍鲜有探索。本文开发了一个统一框架,用于刻画当代模型中的守恒律,包括具有GELU、SiLU和SwiGLU激活的前馈网络、具有正弦和旋转位置编码的多头注意力,以及多种门控设计下的混合专家架构。我们的理论发现得到了实验支持,实验验证了预测的不变量。

英文摘要

Understanding gradient descent dynamics is key to explaining the success of over-parameterized models, where implicit bias manifests through conservation laws in gradient flow. While such laws are well understood for linear and ReLU networks, they remain largely unexplored for modern architectures. This work develops a unified framework to characterize conservation laws for contemporary models, including feedforward networks with GELU, SiLU, and SwiGLU activations, multihead attention with sinusoidal and rotary positional encodings, and Mixture-of-Experts architectures under diverse gating designs. Our theoretical findings are supported by experiments that validate the predicted invariants.

2606.17810 2026-06-17 cs.LG cs.AI 新提交

No-Free-Fairness: Fundamental Limits and Trade-offs in Learning Systems

无免费公平:学习系统中的基本限制与权衡

Khoat Than

发表机构 * Hanoi University of Science and Technology(河内科技大学)

AI总结 本文提出无免费公平定理,揭示学习系统中三个固有差异来源:任务固有成本导致性能与公平的权衡、有限样本诱导子群差异、模型类表达力限制导致公平不可达,表明不公平源于决策问题结构、数据有限性和模型表达力。

详情
AI中文摘要

在本文中,我们建立了一组理论不可能性结果,称为无免费公平定理,这些定理识别了学习系统中三个根本性的差异来源。首先,我们证明当任务在某个子群上表现出不可约成本时,任何决策规则都必须在整体性能与差异之间进行权衡,从而产生固有的公平-成本前沿。其次,我们证明即使在理想的无噪声环境中,存在完全公平且准确的解,仅凭有限样本学习就会导致非平凡的子群差异,排除了分布无关的公平保证。更严重的是,强制执行严格的相对公平会造成统计瓶颈:实现低成本可能需要指数级数量的样本。第三,我们证明模型类的局限性可以独立地导致差异:如果模型无法为某个子群表示准确的解,那么无论数据或训练过程如何,公平性都无法实现。总体而言,这些结果表明不公平不仅仅是由于有偏数据或次优优化,而是源于决策问题的内在结构、有限数据的约束以及模型的表达力。我们的框架广泛适用于标准监督学习之外,并表明实现公平需要明确的权衡,应被视为核心设计考虑因素。

英文摘要

In this paper, we establish a set of theoretical impossibility results, termed the No-Free-Fairness theorems, that identify three fundamental sources of disparity in learning systems. First, we show that when a task exhibits irreducible cost on a subgroup, any decision rule must trade off overall performance with disparity, yielding an inherent fairness--cost frontier. Second, we prove that even in ideal, noise-free settings where a perfectly fair and accurate solution exists, finite-sample learning alone induces nontrivial subgroup disparity, ruling out distribution-free fairness guarantees. More seriously, enforcing strict relative fairness creates a statistical bottleneck: achieving low cost may require exponentially many samples. Third, we show that limitations of the model class can independently induce disparity: if the model cannot represent accurate solutions for a subgroup, fairness remains unattainable regardless of data or training procedure. Overall, these results demonstrate that unfairness is not solely a consequence of biased data or suboptimal optimization, but arises from the intrinsic structure of decision problems, the constraints of finite data, and the expressivity of models. Our framework applies broadly beyond standard supervised learning, and suggests that achieving fairness requires explicit trade-offs and should be treated as a core design consideration.

2606.17809 2026-06-17 cs.CV 新提交

Million-scale multimodal pollen microscopy with expert-guided foundation models

百万级多模态花粉显微镜图像与专家引导的基础模型

András Biricz, Björn Gedda, Donát Magyar, Antonio Spanu, János Fillinger, Péter Pollner, István Csabai

发表机构 * Department of Physics of Complex Systems, ELTE Eötvös Loránd University(ELTE罗兰大学复杂物理系) The Palynological Laboratory at the Swedish Museum of Natural History(瑞典自然历史博物馆孢粉学实验室) National Centre for Public Health and Pharmacy(国家公共卫生与药品中心) INRAE, UR 546 BioSP, Site Agroparc(法国国家农业、食品与环境研究院,UR 546 BioSP,阿格罗帕克园区) National Korányi Institute for Pulmonology(国家科拉尼肺病研究所) Health Data Science and AI Knowledge Centre, Health Services Management Training Centre, Faculty of Health and Public Administration, Semmelweis University(塞梅维什大学健康与公共管理学院卫生服务管理培训中心健康数据科学与人工智能知识中心) Department of Biological Physics, ELTE Eötvös Loránd University(ELTE罗兰大学生物物理系)

AI总结 提出百万级多模态花粉显微镜数据集Pollen AI Atlas,结合专家引导的视觉-语言模型生成形态描述,实现跨区域、跨设置的高精度花粉识别与检索。

Comments 31 pages, 5 main figures, supplementary information included. Submitted to Scientific Reports

详情
AI中文摘要

从显微镜图像自动识别花粉仍然是空气生物学、古生态学和生物多样性监测中的一个瓶颈,因为可扩展系统必须泛化到样本制备、扫描仪设置和地理来源,同时保持孢粉学可解释性。为解决这一问题,我们提出了一个百万级多模态花粉显微镜资源——Pollen AI Atlas,该资源由来自四个地理来源、四种扫描仪设置和31个植物科46个分类标签的纯种全切片明场图像组装而成。通过每个源切片的一个手动选择示例,令牌级挖掘和过滤产生了1,511,390个释放的花粉颗粒检测结果,在专家筛选的测试区域中提案精度达到99.6%。每个检测结果与来自五个开放权重视觉-语言模型的机器生成颗粒级形态描述配对,这些描述由专家验证的孢粉学锚点引导,提供了关于萌发孔系统、壁纹饰、形状和大小的结构化描述。在评估的模型中,Gemma4提供了最可控的主描述集,结合了严格长度控制、无泄漏和最强的文本检索性能。使用冻结视觉特征的基线基准达到了88.16%的top-1准确率,而跨区域检索表明,当图像相似度下降时,描述派生的文本嵌入仍然保持鲁棒(mAP@20 0.811对比0.262)。发布的数据、注释、描述、划分、代码和权重为花粉识别、跨区域领域适应和特定领域多模态显微镜学习提供了基准。

英文摘要

Automated pollen identification from microscopy remains a bottleneck in aerobiology, palaeoecology and biodiversity monitoring, because scalable systems must generalise across specimen preparation, scanner settings and geographic origins while retaining palynological interpretability. To address this gap, we present a million-scale multimodal pollen microscopy resource, Pollen AI Atlas, assembled from pure-species whole-slide bright-field images spanning four geographic origins, four scanner settings and 46 taxon labels across 31 botanical families. Seeded by one manually selected exemplar per source slide, token-level mining and filtering produced 1,511,390 released grain detections with 99.6\% proposal precision in expert-curated test regions. Each detection was paired with machine-generated grain-level morphological captions from five open-weight vision-language models, guided by expert-verified palynological anchors, yielding structured descriptions of aperture systems, wall ornamentation, shape and size. Among the evaluated models, Gemma4 provided the most controlled primary caption set, combining tight length control, no leakage and the strongest text-retrieval performance. Baseline benchmarks with frozen visual features reached 88.16\% top-1 accuracy, while cross-regional retrieval showed that caption-derived text embeddings remained robust when image similarity degraded (mAP@20 0.811 versus 0.262). Released data, annotations, captions, splits, code, and weights provide a benchmark for pollen recognition, cross-regional domain adaptation and domain-specific multimodal microscopy learning.

2606.17805 2026-06-17 cs.LG 新提交

QueryMarket: Cost-Aware Online Active Learning in Data Markets

QueryMarket: 数据市场中成本感知的在线主动学习

Xiwen Huang, Pierre Pinson

发表机构 * Dyson School of Design Engineering, Imperial College London(帝国理工学院戴森设计工程学院) Halfspace (part of Accenture)(埃森哲旗下Halfspace) Technical University of Denmark (DTU Management)(丹麦技术大学(DTU管理系)) Aarhus University (CoRE)(奥胡斯大学(CoRE))

AI总结 提出QueryMarket框架和OVBAL算法,通过D-最优性准则估计边际效用,在滚动预算约束下实现成本感知的在线主动学习,适应非平稳流和异构标签成本。

Comments 10 pages, 8 figures. Submitted to IEEE Transactions on Neural Networks and Learning Systems

详情
AI中文摘要

数据采集是实时流学习中一个主要瓶颈:分析师必须在滚动预算约束下即时决定购买哪些标签。然而,现有的在线主动学习很少在概念漂移下统一考虑定价、信息增益和滚动预算约束。我们引入了QueryMarket,一个受市场启发的框架,它根据每个传入数据点对模型的估计效用及其价格进行查询。在该框架内,我们提出了OVBAL(基于方差的在线主动学习),它通过使用带有指数遗忘的D-最优性准则估计每个样本的边际效用,并在滚动预算约束下执行成本感知的购买,将数据定价与信息驱动的选择相结合。OVBAL产生了一个简单的、完全在线的决策规则,能够适应非平稳流和异构标签成本。在合成数据和真实世界太阳能发电预测任务上的实验表明,OVBAL在卖方中心定价下特别有效,并且在两种定价方案下,在真实世界任务中实现了更有利的长期误差-成本权衡。

英文摘要

Data acquisition is a major bottleneck for learning in real-time streams: analysts must decide on the fly which labels to purchase while respecting a rolling budget. However, existing online active learning rarely unifies pricing, information gain, and rolling budget constraints under concept drift. We introduce QueryMarket, a market-inspired framework that queries each incoming data point based on its estimated utility to the model and its price. Within this framework, we propose OVBAL (online variance-based active learning), which integrates data pricing with information-driven selection by estimating each sample's marginal utility via a D-optimality criterion with exponential forgetting and executing cost-aware purchases under rolling budget constraints. OVBAL yields a simple, fully online decision rule that adapts to nonstationary streams and heterogeneous label costs. Experiments on synthetic data and a real-world solar power generation forecasting task show that OVBAL is particularly effective under seller-centric pricing and yields a more favorable long-run error-cost trade-off in the real-world task under both pricing schemes.

2606.17803 2026-06-17 cs.LG 新提交

Continual Self-Improvement with Lightweight Experiential Latent Memories

持续自我改进:轻量级经验潜在记忆

Vaggelis Dorovatas, Nancy Kalaj, Rahaf Aljundi

发表机构 * Toyota Motor Europe(丰田汽车欧洲公司) University of Trento(特伦托大学)

AI总结 提出一种在线方法,将推理时计算转化为轻量级模块化潜在记忆,通过自生成测试时信号进行训练,实现持续改进且避免灾难性遗忘。

详情
AI中文摘要

大型语言模型通过扩展推理时计算实现了强大的推理性能,但本质上仍然是无状态的,丢弃了在此过程中产生的丰富、自生成的推理轨迹。我们研究模型是否可以从这种经验中在线学习,将瞬态计算(推理轨迹)转化为持久可复用的知识,且无需外部监督或访问未来数据。我们表明,对原始推理轨迹进行上下文学习(ICL)无法泛化,反映了令牌级复用的根本局限性:即使经过细化(例如自我反思),单个轨迹也缺乏迁移所需的抽象。相比之下,受近期无监督强化学习工作的启发,我们发现使用自生成的测试时信号(多数投票)作为奖励的轻量级每实例训练能带来显著收益,通常超过全数据集离线训练,这促使从原始轨迹转向学习到的潜在表示。基于这一见解,我们提出一种在线方法,将遇到问题所花费的推理时计算蒸馏为紧凑的模块化潜在记忆,捕捉底层推理结构。这些记忆被存储并检索用于未来输入,通过模块化设计实现持续改进,同时避免灾难性遗忘。重要的是,我们的方法高效,参数化为极其轻量级的软提示记忆(约模型参数的0.001%),仅需少量梯度步训练,但性能与完全参数更新和离线训练相当。在具有挑战性的数学推理基准测试中,我们的方法显著优于零样本和原始数据ICL基线,并在数据集间有效迁移。

英文摘要

Large language models achieve strong reasoning performance by scaling inference-time compute, yet remain fundamentally stateless, discarding the rich, self-produced reasoning traces generated during this process. We investigate whether models can instead learn online from this experience, converting transient computation (reasoning traces) into persistent reusable knowledge, and without external supervision or access to future data. We show that In-Context Learning (ICL) over raw reasoning traces fails to generalize, reflecting a fundamental limitation of token-level reuse: individual traces lack the abstraction needed for transfer, even after refinement (e.g. self-reflection). In contrast, drawing inspiration from recent works on unsupervised reinforcement learning, we find that lightweight per-instance training with self-generated test-time signals (majority voting) as rewards yields substantial gains, often surpassing full-dataset offline training, motivating a shift from raw traces to learned latent representations. Building on this insight, we propose an online method that distills inference-time compute spent on encountered problems into compact modular latent memories capturing the underlying reasoning structure. These memories are stored and retrieved for future inputs, enabling continual improvement while avoiding catastrophic forgetting through modular design. Importantly, our method is highly efficient, parametrized as extremely lightweight soft prompt memories (~0.001% of model parameters) and trained with only a few gradient steps, yet achieving performance competitive with full parametric updates and offline training. Across challenging mathematical reasoning benchmarks, our approach significantly outperforms zero-shot and raw data ICL baselines, while transferring effectively across datasets.

2606.17800 2026-06-17 cs.CV 新提交

MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model

MaineCoon: 追求实时音视频社交世界模型

Lichen Bai, Tianhao Zhang, Shitong Shao, Dingwei Tan, Qiyu Zhong, Zhengpeng Xie, Haopeng Li, Qinghao Huang, Dandan Shen, Tengjiao Ji, Wei Wang, Peicheng Wu, Yuxuan Zhao, Xiangyu Zhu, Welly Luo, Shurui Yang, Zeke Xie

发表机构 * Catnip AI Team(Catnip AI团队)

AI总结 提出MaineCoon,首个22B参数的实时音视频自回归模型,支持单GPU上高达47.5 FPS的流式生成和亚秒级交互,专为社交互动应用优化,引入自重采样、跨模态对齐、领域偏好优化和强化在线策略蒸馏等技术。

Comments 32 pages, 13 figures, 3 tables

详情
AI中文摘要

随着全球视频内容越来越多地在社交平台上用于互动社交目的,为社交世界构建的视频生成模型至关重要,但以往研究在很大程度上忽视了这一点。在这项工作中,我们定义了社交世界模型的位置,并构建了一个原型模型作为实现这一目标的第一步。虽然以往的世界模型成功模拟了物理环境或游戏世界探索,但它们从根本上与以人为中心的社交动态脱节。为了弥合这一差距,作为社交世界模型的第一步,我们提出了MaineCoon,这是首个实时音视频自回归模型,拥有22B参数,能够在单个GPU上实现高达47.5 FPS的创纪录帧率的实时流式生成和亚秒级交互。据我们所知,MaineCoon也是首个专门针对社交互动应用优化的实时音视频生成模型。为了实现高效稳定的训练,我们在MaineCoon中引入了多种新技术,包括自重采样、跨模态表示对齐、领域感知偏好优化和强化在线策略蒸馏(ROPD)。我们还设计了首个智能体流式推理框架,支持千秒级甚至更长的生成,同时通过智能体缓存管理和提示规划减轻漂移。这些创新显著加速了训练,同时优化了实时推理性能。我们相信,这项工作不仅为高质量、低延迟和长时域的音视频自回归模型设定了新的最先进(SOTA)性能基准,而且指出了下一代AI原生社交平台所需的范式转变。

英文摘要

As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked by previous studies. In this work, we define the position of social world models and build a prototype model as the first step towards this goal. While previous world models successfully simulate physical environments or gaming world exploration, they remain fundamentally detached from human-centric social dynamics. To bridge this gap as the first step to social world models, we present MaineCoon, the first real-time audio-visual autoregressive model that has 22B parameters and is capable of real-time streaming generation and sub-second interaction, with a record-breaking frame rate of up to 47.5 FPS, on a single GPU. To the best of our knowledge, MaineCoon is also the first real-time audio-visual generation model specifically optimized for social-interactive applications. To enable efficient and stable training, we introduce several novel techniques into MaineCoon, including self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). We also design the first agentic streaming inference framework that supports thousand-second-scale or even longer generation while mitigating drift with agentic cache management and prompt planing. These innovations significantly accelerate training while optimizing real-time inference performance. We believe this work not only sets a new state-of-the-art (SOTA) performance benchmark for high-quality, low-latency, and long-horizon audio-visual autoregressive models, but also points out the paradigm shift desired for next-generation AI-native social platforms.

2606.17798 2026-06-17 cs.CV cs.AI 新提交

LiveStarPro: Proactive Streaming Video Understanding with Hierarchical Memory for Long-Horizon Streams

LiveStarPro: 具有分层记忆的主动式流视频理解用于长时域流

Zhenyu Yang, Kairui Zhang, Bing Wang, Shengsheng Qian, Changsheng Xu

发表机构 * IEEE

AI总结 提出LiveStarPro,通过流验证解码、流因果注意力掩码和树结构分层记忆三个组件,实现长时域流媒体视频的主动理解,在语义正确性和时序误差上分别提升28.9%和降低18.2%。

详情
AI中文摘要

尽管视频大语言模型(Video-LLMs)取得了显著进展,当前的在线架构仍然难以同时处理连续视频流、自主决定何时响应以及保持长时域上下文记忆。这些障碍削弱了实时响应能力,并在长时间交互中导致严重遗忘。在这项工作中,我们引入了LiveStarPro,一个专为长时域流上的主动视频理解而设计的直播助手。LiveStarPro的设计基于三个互补组件。第一个组件是流验证解码(SVeD),一种通过单次困惑度验证识别适当响应时机的推理框架,从而消除了对显式静音标记的依赖。第二个组件是流因果注意力掩码(SCAM),一种训练策略,它在可变长度流上强制实现增量视频-语言对齐。第三个组件是树结构分层记忆(TSHM),一种递归记忆架构,它将驱逐的历史信息组织成事件链,从而能够从有效无界的视频流中高效检索。为了在现实在线条件下促进全面评估,我们进一步提出了OmniStarPro,一个大规模基准测试,涵盖15个多样化的真实世界场景,并扩展到小时级流以评估长期回忆。大量实验表明,LiveStarPro持续超越现有方法,在语义正确性上提升28.9%,时序误差降低18.2%,而其流式键值缓存进一步在相同模型上实现了1.58倍的推理加速。模型和代码在此https URL公开。

英文摘要

Despite the remarkable progress of Video Large Language Models (Video-LLMs), current online architectures still struggle to simultaneously process continuous video streams, decide autonomously when to respond, and preserve long-horizon contextual memory. These obstacles undermine real-time responsiveness and cause severe forgetting throughout prolonged interactions. In this work, we introduce LiveStarPro, a live streaming assistant that is designed for proactive video understanding over long-horizon streams. The design of LiveStarPro rests on three complementary components. The first component is Streaming Verification Decoding (SVeD), an inference framework that identifies the appropriate response timing through single-pass perplexity verification, thereby eliminating the dependency on explicit silence tokens. The second component is Streaming Causal Attention Masks (SCAM), a training strategy that enforces incremental video-language alignment over variable-length streams. The third component is Tree-Structured Hierarchical Memory (TSHM), a recursive memory architecture that organizes evicted historical information into event chains and consequently enables efficient retrieval from effectively unbounded video streams. To facilitate a comprehensive evaluation under realistic online conditions, we further present OmniStarPro, a large-scale benchmark that spans 15 diverse real-world scenarios and that extends to hour-scale streams for the assessment of long-term recall. Extensive experiments demonstrate that LiveStarPro consistently surpasses existing methods, attaining a 28.9% improvement in semantic correctness and an 18.2% reduction in timing error, while its streaming key-value cache further yields a 1.58x inference speedup over the same model without caching. The model and the code are publicly available at https://github.com/sotayang/LiveStarPro.

2606.17791 2026-06-17 cs.CL cs.CV 新提交

The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports

Slop悖论:合成标准化如何侵蚀AI重写放射学报告中的临床不确定性和跨模态对齐

Samar Ansari

发表机构 * School of Computing and Engineering Sciences, University of Chester(切斯特大学计算与工程科学学院)

AI总结 本研究通过控制实验测量AI重写放射学报告导致的信息退化,发现电子健康记录摘要虽破坏内容但保留图像-文本对齐,而标准化重写和教学病例准备则相反,造成更大对齐损失,称为slop悖论。

详情
AI中文摘要

AI辅助临床文档工具越来越多地使用大型语言模型(LLMs)对放射学报告进行摘要、标准化和重新格式化。我们提出了对由此产生的信息退化的受控测量。使用印第安纳大学数据集的450份胸部X光报告,我们通过三种真实的LLM重写任务生成合成版本:电子健康记录摘要、标准化重写和教学病例准备。我们测量了实体侵蚀(通过医学命名实体识别)、对冲崩溃(临床不确定性语言的丧失)和跨模态对齐退化(通过BiomedCLIP图像-文本相似度)。我们的核心发现是信息损失与跨模态保真度之间的分离。电子健康记录摘要在内容层面最具破坏性,侵蚀了51.4%的临床实体和43.7%的对冲语言,但它几乎完全保留了图像-文本对齐(下降2.5%)。旨在生成更干净训练数据的两个任务,即标准化重写和教学病例准备,则相反:它们保留了更多实体(分别侵蚀26.8%和29.3%),但导致14.9-16.5%的对齐下降,是电子健康记录摘要的六到七倍。我们称之为slop悖论:使临床文本看起来更干净以用于多模态训练的重写恰恰使其偏离图像。与我们预先指定的假设相反,罕见病理并未优先退化:在九次罕见与常见比较中,没有差异在多重比较校正后幸存,且名义差异方向相反(常见>罕见),因此污染对特定条件监测不可见。退化的主要决定因素是AI重写任务的类型,而非临床内容。这些发现对多模态医学AI数据集构建和AI辅助临床文档的治理具有重要意义。

英文摘要

AI-assisted clinical documentation tools increasingly summarize, standardize, and reformat radiology reports using large language models (LLMs). We present a controlled measurement of the resulting information degradation. Using 450 chest X-ray reports from the Indiana University dataset, we generate synthetic versions via three realistic LLM rewriting tasks: EHR summarization, standardized rewriting, and teaching case preparation. We measure entity erosion (via medical NER), hedging collapse (loss of clinical uncertainty language), and cross-modal alignment degradation (via BiomedCLIP image-text similarity). Our central finding is a dissociation between information loss and cross-modal fidelity. EHR summarization is the most destructive at the content level, eroding 51.4% of clinical entities and 43.7% of hedging language, yet it preserves image-text alignment almost entirely (a 2.5% drop). The two tasks meant to produce cleaner training data, standardized rewriting and teaching case preparation, do the reverse: they preserve more entities (26.8% and 29.3% eroded) but cause 14.9-16.5% alignment drops, six to seven times those of EHR summarization. We term this the slop paradox: rewriting that makes clinical text look cleaner for multimodal training is precisely what pulls it away from the image. Contrary to our pre-specified hypothesis, rare pathologies were not preferentially degraded: across nine rare-versus-common comparisons, no difference survived multiple-comparison correction, and nominal differences ran in the opposite direction (common > rare), so contamination is invisible to condition-specific monitoring. The dominant determinant of degradation is the type of AI rewriting task, not the clinical content. These findings bear on multimodal medical AI dataset construction and the governance of AI-assisted clinical documentation.

2606.17782 2026-06-17 cs.LG 新提交

Blind Recovery of Latent Domains via Unsupervised Symmetry Discovery

通过无监督对称性发现实现潜在域盲恢复

Onur Efe, Arkadas Ozakin

发表机构 * Bogazici University(博阿齐奇大学)

AI总结 提出无监督框架,通过发现数据分布的对称性,从无结构观测中恢复潜在域和信号,使用浅层群卷积网络并施加平稳性和局部性正则化。

详情
AI中文摘要

盲逆问题的主要动机是在不知道混淆机制的情况下从损坏的观测中恢复感兴趣的信号。当损坏是卷积时,盲反卷积是一种突出的方法,但当一般线性变换混淆域结构时,它不适用。在这项工作中,我们提出了一个无监督框架,通过发现数据分布的对称性来恢复潜在域和信号。我们的框架将观测建模为从潜在随机场采样的信号的线性测量,并通过在模型输出处施加平稳性和局部性正则化来优化浅层群卷积网络。该模型学习潜在的对称性动作和适当的滤波器,从而将无结构观测映射到基于对称性的表示,揭示潜在信号。在随机过程、伊辛模型、打乱和比特乱序图像以及神经记录上的实验表明,该方法从无结构观测中恢复了潜在域和信号,表明对称性发现是无监督结构学习和盲逆问题的新方向。

英文摘要

Primary motivation in blind inverse problems is to recover signals of interest from corrupted observations without knowing the obfuscating mechanism. Blind deconvolution is a prominent approach when the corruption is convolutional, but it is not applicable when general linear transformations obfuscate the domain structure. In this work, we propose an unsupervised framework for recovering latent domains and signals by discovering symmetries of the data distribution. Our framework models observations as linear measurements of signals sampled from a latent random field, and optimizes a shallow group-convolutional network by imposing stationarity and locality regularization at the model output. The model learns a latent symmetry action and an appropriate filter, thereby mapping unstructured observations to a symmetry-based representation that reveals latent signals. Experiments on stochastic processes, Ising models, shuffled and bit-scrambled images, and neural recordings show that the method recovers latent domains and signals from unstructured observations, suggesting symmetry discovery as a new direction for unsupervised structure learning and blind inverse problems.

2606.17775 2026-06-17 cs.SD cs.AI cs.NE 新提交

A Neuromorphic Trigger for Efficient Audio Event Detection

一种用于高效音频事件检测的神经形态触发器

Benjamin Hatton, Oliver Rhodes, Luca Peres

发表机构 * ICNS, University of Manchester(曼彻斯特大学ICNS)

AI总结 提出基于脉冲神经网络(SNN)的低成本前端触发器,选择性筛选音频片段,在异常声音检测和声音事件检测任务上分别实现0.97的F1分数和42.6倍FLOPs减少。

Comments 9 pages, 4 figures, 6 tables

详情
AI中文摘要

连续音频流的高效处理仍然是实时和资源受限系统面临的关键挑战。本文介绍了一种用于音频事件检测的神经形态触发器,基于脉冲神经网络(SNN)选择性门控下游模型的输入。所提出的触发器作为低成本前端,识别显著音频片段,仅将这些片段转发给计算密集型的模型进行分类等任务。触发器实现为轻量级全连接SNN,并在两个代表性任务上评估:异常声音检测(ASD)和声音事件检测(SED)。对于ASD,触发器在URBAN-SED数据集的类别无关形式下,实现了基于一秒片段的F1分数0.97,显示出识别相关音频区域的高可靠性。对于SED,触发器与Dang分类器结合在DCASE 2017挑战赛任务2数据集上,展示了潜在的42.6倍FLOPs减少,同时将基于事件错误率的下限从0.41降低到0.25。这些结果凸显了神经形态触发器作为实时、节能前端滤波器的潜力,能够大幅降低计算成本。

英文摘要

Efficient processing of continuous audio streams remains a key challenge for real-time and resource-constrained systems. This paper introduces a neuromorphic trigger for audio event detection, based on a spiking neural network (SNN) that selectively gates input to downstream models. The proposed trigger acts as a low-cost front-end, identifying salient audio segments and forwarding only these to a more computationally intensive model for tasks such as classification. The trigger is implemented as a lightweight fully connected SNN and evaluated on two representative tasks: Anomalous Sound Detection (ASD) and Sound Event Detection (SED). For ASD, the trigger achieves a one-second segment-based F1 score of 0.97 on a class-agnostic form of the URBAN-SED dataset, demonstrating high reliability in identifying relevant audio regions. For SED, the trigger is combined with the Dang classifier on the DCASE 2017 Challenge Task 2 dataset, showing a potential $42.6\times$ reduction in FLOPs while reducing the lower bound of the event-based error rate from 0.41 to 0.25. These results highlight the potential of neuromorphic triggers as real-time, energy-efficient front-end filters, enabling substantial reductions in computational cost.

2606.17756 2026-06-17 cs.LG 新提交

A fairness-aware extension of Stochastic Multicriteria Acceptability Analysis for ranking

一种公平性感知的随机多准则可接受性分析扩展用于排序

Guilherme Dean Pelegrina, Renata Pelissari

发表机构 * Engineering School, Mackenzie Presbyterian University(麦肯锡长老会大学工程学院)

AI总结 提出SMAA-Fair,通过重加权排序以提升群体公平性,结合统计均等、rKL和nDKL指标,在保持鲁棒性同时改善受保护群体在有利位置的代表性。

详情
AI中文摘要

公平性已成为涉及个人或社会群体的排序问题的核心关注点,特别是在负责任人工智能议程下。在多准则决策分析中,随机多准则可接受性分析(SMAA)为处理不确定性和不完整偏好信息提供了稳健框架,但未明确解决排序结果中的公平性。本文提出SMAA-Fair,一种公平性感知的SMAA扩展用于排序问题。该方法根据模拟排序的群体公平性水平对其重新加权,使得更公平的排序对可接受性指数和中心权重向量贡献更大。该框架独立于聚合模型,并可纳入不同的公平性度量。本研究采用统计均等、归一化折扣Kullback-Leibler散度(rKL)和归一化折扣累积Kullback-Leibler散度(nDKL)。排序通过公平性调整的可接受性矩阵,使用期望排序和最大可接受性排序得出。我们还根据所得排序的公平程度推导中心权重。使用合成数据和真实数据的数值实验表明,SMAA-Fair改善了受保护群体在有利排序位置中的代表性,同时保持对偏好不确定性的鲁棒性。

英文摘要

Fairness has become a central concern in ranking problems involving individuals or social groups, particularly under the Responsible Artificial Intelligence agenda. In Multi-Criteria Decision Analysis, Stochastic Multicriteria Acceptability Analysis (SMAA) provides a robust framework for handling uncertainty and incomplete preference information, but it does not explicitly address fairness in the resulting rankings. This paper proposes SMAA-Fair, a fairness-aware extension of SMAA for ranking problems. The approach reweights the simulated rankings generated by SMAA according to their level of group fairness, so that fairer rankings contribute more strongly to the acceptability indices and central weights vector. The framework is independent of the aggregation model and can incorporate different fairness metrics. In this study, Statistical Parity, normalized discounted Kullback--Leibler divergence (rKL) and normalized discounted cumulative Kullback--Leibler divergence (nDKL) are adopted. Rankings are derived from the fairness-adjusted acceptability matrix using expected ranking and maximum acceptability ranking. We also derive the central weight according to the degree of fairness in the obtained rankings. Numerical experiments with synthetic and real data show that SMAA-Fair improves the representation of protected groups among favourable ranking positions, while preserving robustness to preference uncertainty.

2606.17742 2026-06-17 cs.CV q-bio.NC 新提交

BrainWorld: A Structural-Prior-Conditioned Generative Model for Whole-Brain 4D fMRI Dynamics

BrainWorld:一种用于全脑4D fMRI动力学的结构先验条件生成模型

Junfeng Xia, Wenhao Ye, Junxiang Zhang, Xuanye Pan, Mo Wang, Quanying Liu

发表机构 * Department of Biomedical Engineering, Southern University of Science and Technology(南方科技大学生物医学工程系) School of Biomedical Engineering, Shenzhen University(深圳大学生物医学工程学院)

AI总结 提出BrainWorld模型,利用结构MRI作为解剖先验条件,通过去噪过程生成全脑4D fMRI动态,在22个数据集上稳定生成400帧轨迹,并通过生成样本增强提升下游任务性能。

详情
AI中文摘要

全脑4D fMRI生成对于建模功能性脑动力学具有重要价值,然而现有的fMRI基础模型主要针对表示学习和下游预测,而非条件预测生成。我们提出BrainWorld,一种用于全脑4D fMRI动力学的结构先验条件生成模型。BrainWorld使用sMRI作为受试者级别的解剖上下文来指导未来的fMRI生成,将结构信息整合到去噪过程中,而非将其视为并行模态。在涵盖不同队列和脑状态的22个数据集上评估,BrainWorld能够生成长达400帧的稳定4D fMRI轨迹,通过生成样本增强提升下游性能,并学习到可迁移的多模态表示,优于基线方法。这些结果共同确立了BrainWorld作为长时程脑动力学建模和多模态表示学习的条件感知生成框架。

英文摘要

Whole-brain 4D fMRI generation is valuable for modeling functional brain dynamics, yet existing fMRI foundation models mainly target representation learning and downstream prediction rather than conditional predictive generation. We introduce BrainWorld, a structural-prior-conditioned generative model for whole-brain 4D fMRI dynamics. BrainWorld uses sMRI as subject-level anatomical context to guide future fMRI generation, integrating structural information into the denoising process rather than treating it as a parallel modality. Evaluated on 22 datasets spanning diverse cohorts and brain states, BrainWorld generates stable 4D fMRI trajectories up to 400 frames, improves downstream performance through generated-example augmentation, and learns transferable multimodal representations that outperform baselines. Together, these results establish BrainWorld as a condition-aware generative framework for long-horizon brain dynamics modeling and multimodal representation learning.

2606.17739 2026-06-17 cs.RO cs.AI cs.CV cs.MA 新提交

ED3R: Energy-Aware Distributed Disaster Detection Enabled by Cooperative Robotic Agents

ED3R: 能量感知的分布式灾难检测——基于协作机器人智能体

Lina Magoula, Nikolaos Koursioumpas, Nancy Alonistioti, Ramin Khalili

发表机构 * Dept. of Informatics and Telecommunications, National and Kapodistrian University of Athens(雅典大学信息学与电信系) Huawei Heisenberg Research Center (Munich)(华为海森堡研究中心(慕尼黑))

AI总结 提出ED3R框架,通过机器人-远程控制器分层协作与分布式神经回归预测,在不确定性下以最低能耗实现野火检测,成功率达97.18%,能耗降低36.4%,检测速度提升41%。

Comments 14 pages, 9 figures

详情
AI中文摘要

机器人技术有望支持环境监测和自然灾害管理,在这些场景中,决策必须在不确定性、资源限制和严格操作约束下做出。在关键任务(如野火)中,机器人智能体不仅需要以足够置信度识别危险事件,还需管理能量成本和检测时间。本文介绍ED3R,一种用于不确定性下野火检测的能量感知分布式框架。ED3R实现了机器人与远程控制器之间的分层协作决策:远程控制器决定机器人的运动,而机器人感知环境并决定在何处(机载或远程)以及如何执行野火检测。共同目标是以所需置信度检测野火,同时最小化任何机器人操作消耗的能量。ED3R进一步集成了避免附近障碍物、防止冗余探索、实现自适应早期任务完成以及通过自定义惩罚函数确保可行性的机制。ED3R还引入了前瞻能力,通过分布式神经回归模型使智能体能够在执行前评估候选策略以预测未来。该框架通过逼真的机器人仿真、消融研究和基线比较进行评估。总体而言,ED3R的任务成功率高达97.18%。尤其是在最具挑战性的任务中,它比基线减少高达36.4%的能量消耗,并提前高达41%检测到野火。

英文摘要

Robotics are expected to support environmental monitoring and natural disaster management, where decisions must be made under uncertainty, resource limitations, and strict operational constraints. In critical missions, such as wildfires, robotic agents must not only identify hazardous events with sufficient confidence, but also manage the energy cost and time until detection. This paper introduces ED3R, an energy-aware distributed framework for wildfire detection under uncertainty. ED3R enables hierarchical cooperative decision-making between a robot and a remote controller. The remote controller decides upon the robot's motion, while the robot senses the environment and decides where to execute the wildfire detection (onboard or remotely) and how. The common goal is to detect wildfires with a required confidence while minimizing the energy consumed by any robot operation. ED3R further integrates mechanisms to avoid nearby obstacles, prevent redundant exploration, enable adaptive early mission completion, and ensure feasibility through a custom penalty function. ED3R also introduces a forward-looking capability, enabled through distributed neural regression models that allow the agents to anticipate the future by evaluating candidate strategies before execution. The framework is evaluated through realistic robotics simulations, ablation studies, and baseline comparisons. Overall, ED3R achieves a mission success rate of up to 97.18%. Especially in the most demanding missions, it reduces energy consumption by up to 36.4% and detects wildfires up to 41% faster than baselines.

2606.17735 2026-06-17 cs.AI 新提交

Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs

打破自回归诅咒:动态认知熵编排的可擦除强化学习用于大语言模型

Ziliang Wang, Kang An, Faqiang Qian, Jialu Cai, Cijun Ouyang, Yuhang Wang, Qibing Ren, Yichao Wu

发表机构 * SenseTime(商汤科技) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出动态认知熵编排的可擦除强化学习(E³RL),通过将模型内生的局部自回归交叉熵作为认知不确定性坐标,利用分段自适应动态阈值和优势分配精准切除逻辑缺陷并重用KV缓存,解决长序列推理中的自回归级联崩溃问题。

详情
AI中文摘要

尽管强化学习(RL)扩展了大语言模型(LLMs)的认知边界,但在长程逻辑推理中,它仍然容易受到自回归诅咒的影响:生成早期引入的微小认知扰动会沿着马尔可夫决策过程流不可逆地传播,引发级联故障,导致推理轨迹崩溃。为了克服这种自回归级联(即单个早期错误可能危及所有后续推理步骤),我们提出了动态认知熵编排的可擦除强化学习($\text{E}^3\text{RL}$)。$\text{E}^3\text{RL}$ 通过将模型内生的局部自回归交叉熵作为认知不确定性的内在坐标,消除了对外部信号的依赖。通过引入分段自适应动态阈值和优势分配,$\text{E}^3\text{RL}$ 使模型能够精确切除局部逻辑缺陷,同时重用历史键值(KV)缓存流,从而赋予推理过程自愈能力。我们在 DeepMath-103k 数据集上训练 $\text{E}^3\text{RL}$。实验结果表明,$\text{E}^3\text{RL}$ 重塑了长序列推理的探索效率,提高了样本效率,同时保持线性内存开销。在 AIME 等数学推理基准上,$\text{E}^3\text{RL}$ 取得了显著的性能提升,4B 和 8B 参数模型分别超越了之前的最优结果(SOTA)5.349% 和 6.514%。这些发现表明,$\text{E}^3\text{RL}$ 打破了长序列推理中的自回归诅咒,为下一代自愈人工通用智能(AGI)奠定了理论和系统级基础。

英文摘要

Although reinforcement learning (RL) has expanded the cognitive boundaries of large language models (LLMs), it often remains vulnerable to the autoregressive curse in long-horizon logical reasoning: small epistemic perturbations introduced early in generation can propagate irreversibly along the Markov decision process flow, triggering cascading failures that drive the reasoning trajectory toward collapse. To overcome this autoregressive cascade, in which a single early mistake can compromise all subsequent reasoning steps, we propose dynamic epistemic entropy orchestrated erasable reinforcement learning ($\text{E}^3\text{RL}$). $\text{E}^3\text{RL}$ eliminates reliance on external signals by grounding the model's endogenous local autoregressive cross-entropy as an intrinsic coordinate of epistemic uncertainty. By introducing segment-level adaptive dynamic thresholds and advantage allocation, $\text{E}^3\text{RL}$ enables the model to precisely excise localized logical defects while reusing historical key-value (KV) cache streams, thereby endowing the reasoning process with a self-healing capability. We train $\text{E}^3\text{RL}$ on the DeepMath-103k dataset. Experimental results show that $\text{E}^3\text{RL}$ reshapes the exploration efficiency of long-sequence reasoning and improves sample efficiency while maintaining linear memory overhead. On mathematical reasoning benchmarks such as AIME, $\text{E}^3\text{RL}$ achieves substantial performance gains, with the 4B and 8B parameter models surpassing previous state-of-the-art (SOTA) results by 5.349\% and 6.514\%, respectively. These findings suggest that $\text{E}^3\text{RL}$ shatters the autoregressive curse in long-sequence reasoning and establishes a theoretical and systems-level foundation for the next generation of self-healing artificial general intelligence (AGI).

2606.17730 2026-06-17 cs.CV 新提交

ActWorld: From Explorable to Interactive World Model via Action-Aware Memory

ActWorld: 从可探索到可交互的世界模型——基于动作感知记忆

Zhexiao Xiong, Yizhi Song, Hao Kang, Qing Yan, Liming Jiang, Jenson Yang, Zhoujie Fu, Stathi Fotiadis, Angtian Wang, Zichuan Liu, Bo Liu, Yiding Yang, Xin Lu, Nathan Jacobs

发表机构 * Washington University in St. Louis(圣路易斯华盛顿大学) Intelligent Creation, ByteDance(字节跳动智能创作)

AI总结 提出ActWorld,通过构建10万交互视频数据集和分层动作感知记忆机制,在块自回归框架中扩展导航模型以支持对象交互,解决数据稀缺和记忆瓶颈问题。

Comments Project page: https://interactwm.github.io/ActWorld

详情
AI中文摘要

交互式世界模型旨在模拟实时用户动作下的环境动态。然而,它们的动作词汇很大程度上局限于导航:大多数动作对应运动(例如,行走、转身、环顾四周),而与场景中对象的交互(例如,拿起盘子、开门或触发物理响应)要么不存在,要么局限于游戏领域,要么降级为提示到全视频场景。由此产生的世界在视觉上可探索,但并非真正可操作。在这项工作中,我们提出了ActWorld,一个交互式世界模型,它将先前的以导航为中心的生成器扩展到块自回归框架内支持中期滚动中的对象交互。我们认为导航-交互差距源于两个瓶颈。首先,数据瓶颈:缺乏具有准确、密集标签的人-对象交互数据。其次,记忆瓶颈:现有世界模型中基于近因的历史压缩丢弃了因果决定后续对象状态的事件转换帧,导致动作遗忘病理。在数据方面,我们构建了一个10万交互视频数据集,每个视频通过思维链推理标注了每块字幕。在模型方面,我们引入了一种分层动作感知记忆设计,通过交互重要性路由历史压缩,并辅以一个持久记忆库,在长滚动中维护事件更新和对象身份令牌。实验表明,ActWorld在单个模型中同时支持灵活的导航和丰富的对象交互,在不牺牲视点控制的情况下,显著提高了交互保真度,优于仅导航的基线。项目页面可从此https URL获取。

英文摘要

Interactive world models aim to simulate environment dynamics under real-time user actions. However, their action vocabulary is largely confined to navigation: most actions correspond to motion (e.g., walk, turn, look around), while interaction with objects in the scene (e.g., pick up plates, open doors, or trigger physical responses) is either absent, restricted to game domains, or relegated to prompt-to-full-video scenarios. The resulting worlds are visually explorable but not truly actionable. In this work, we present ActWorld, an interactive world model that extends prior navigation-centric generators to support mid-rollout object interaction within a chunk-autoregressive framework. We argue that the navigation-interaction gap stems from two bottlenecks. First, a data bottleneck: the lack of human-object interaction data with accurate, dense labels. Second, a memory bottleneck: recency-biased history compression in existing world models discards the event-transition frames that causally determine subsequent object states, leading to an action-forgetting pathology. On the data side, we construct a 100K interaction video dataset, each annotated with per-chunk captions via chain-of-thought reasoning. On the model side, we introduce a hierarchical action-aware memory design that routes history compression by interaction importance, complemented by a persistent memory bank that maintains event-update and object-identity tokens across long rollouts. Experiments show that ActWorld supports both flexible navigation and rich object interaction within a single model, substantially improving interaction fidelity over navigation-only baselines without sacrificing viewpoint control. Project page is available at https://interactwm.github.io/ActWorld.

2606.17727 2026-06-17 cs.AI 新提交

LongWebBench: Evaluating Structural and Functional Webpage Generation in Long-Horizon Settings

LongWebBench: 评估长程设置下的结构和功能性网页生成

Yi Zhao, Zhen Yang, Mengpan Chen, Mingde Xu, Shanghui Gong, Xijun Liu, Jibing Gong, Jie Tang

发表机构 * Tsinghua University(清华大学) Yanshan University(燕山大学) University of Waterloo(滑铁卢大学) Beihang University(北京航空航天大学)

AI总结 提出LongWebBench基准,通过结构保真度和功能可执行性评估长网页生成,发现视觉相似性高但多步交互失败。

Comments 49 pages, 38 figures

详情
AI中文摘要

最近的视觉语言模型(VLM)在从视觉输入生成网页方面显示出有希望的进展,但现有评估主要关注短、单屏且基本静态的网页。我们引入了LongWebBench,这是一个从结构和功能角度评估长程网页生成的基准。LongWebBench包含490个真实长网页用于结构保真度评估,以及129个网页上的507个目标导向交互任务用于功能评估。它采用两种互补协议:基于多维VLM的指标用于评估长程结构连贯性,以及基于DOM增强的智能体流水线用于端到端功能验证。我们进一步通过人类一致性分析检查自动评估协议。在单图像和多图像设置下,使用最先进的开源和专有VLM进行的实验表明,结构保真度随着网页长度的增加而下降,而视觉上合理的生成往往无法支持可执行的多步交互。这些结果强调了在视觉相似性之外评估长网页生成的必要性,并将可执行交互作为核心标准。我们的代码和数据可在该https URL获取。

英文摘要

Recent vision-language models (VLMs) have shown promising progress in generating webpages from visual inputs, yet existing evaluations mainly focus on short, single-screen, and largely static webpages. We introduce LongWebBench, a benchmark for evaluating long-horizon webpage generation from both structural and functional perspectives. LongWebBench contains 490 real-world long webpages for structural fidelity evaluation and 507 goal-oriented interaction tasks over 129 webpages for functional evaluation. It employs two complementary protocols: a multi-dimensional VLM-based metric for assessing long-range structural coherence, and a DOM-augmented agent-based pipeline for end-to-end functional verification. We further examine the automatic evaluation protocols through human agreement analysis. Experiments with state-of-the-art open-source and proprietary VLMs under single-image and multi-image settings reveal that structural fidelity degrades as webpage length increases, while visually plausible generations often fail to support executable multi-step interactions. These results highlight the need to evaluate long webpage generation beyond visual similarity, with executable interaction as a core criterion. Our code and data are available at https://github.com/zheny2751-dotcom/LongWebBench.

2606.17722 2026-06-17 cs.CV 新提交

GSPan: A Continuous Gaussian Primitive Representation for Arbitrary-Scale Pansharpening

GSPan:一种用于任意尺度全色锐化的连续高斯基元表示

Fangyi Li, Xiaoyuan Yang, Yixiao Li, Zongyang Sui, Kangqing Shen, Gemine Vivone

发表机构 * Beihang University(北京航空航天大学) Tsinghua University(清华大学) National Research Council - Institute of Methodologies for Environmental Analysis, CNR-IMAA(意大利国家研究委员会 - 环境分析方法研究所)

AI总结 提出GSPan框架,将2D高斯溅射引入全色锐化,通过连续可学习的2D高斯基元表示残差细节,实现任意尺度融合,无需重新训练。

详情
AI中文摘要

全色锐化旨在通过融合低分辨率多光谱(LRMS)和全色(PAN)观测生成高分辨率多光谱(HRMS)图像。现有深度学习方法大多将全色锐化视为固定网格预测,限制了尺度适应性。为此,我们提出GSPan框架,将2D高斯溅射(GS)引入全色锐化。GSPan不直接预测像素,而是将逐波段残差细节表示为连续且可学习的2D高斯基元。我们设计了具有空间-光谱交互注意力(SSIA)模块的双流层次交互(DSHI)架构,从互补的PAN和MS观测中估计这些基元。预测的基元被渲染为残差细节场,并注入到上采样的MS图像中。这种连续表示使得GSPan能够在任意目标采样网格上渲染融合图像,无需针对特定尺度重新训练。它进一步支持尺度解耦非对称推理(SDAI)策略,该策略在降低的分辨率下估计基元,并在目标分辨率下渲染融合图像,用于高效的大场景全色锐化。在QuickBird、GaoFen-2、WorldView-3和WorldView-3-4K数据集上的实验表明,GSPan实现了最先进的融合性能。此外,SDAI显著加速了推理,在计算效率和融合质量之间取得了良好的平衡。我们的结果证明了连续高斯残差表示作为固定网格预测的灵活且尺度解耦替代方案的潜力。

英文摘要

Pansharpening aims to generate high-resolution multispectral (HRMS) images by fusing low-resolution multispectral (LRMS) and panchromatic (PAN) observations. Most existing deep learning methods treat pansharpening as fixed-grid prediction, which limits scale adaptation. To address this, we propose GSPan, a framework that introduces 2D Gaussian Splatting (GS) into pansharpening. Instead of directly predicting pixels, GSPan represents band-wise residual details as continuous and learnable 2D Gaussian primitives. We design a Dual-Stream Hierarchical Interaction (DSHI) architecture with a Spatial-Spectral Interactive Attention (SSIA) module to estimate these primitives from complementary PAN and MS observations. The predicted primitives are rendered as a residual detail field and injected into the upsampled MS image. This continuous representation allows GSPan to render fused images on arbitrary target sampling grids without scale-specific retraining. It further enables a Scale-Decoupled Asymmetric Inference (SDAI) strategy, which estimates primitives at a reduced resolution and renders the fused image at the target resolution for efficient large-scene pansharpening. Experiments on QuickBird, GaoFen-2, WorldView-3, and WorldView-3-4K datasets show that GSPan delivers state-of-the-art fusion performance. Moreover, SDAI markedly accelerates inference, achieving a favorable trade-off between computational efficiency and fusion quality. Our results demonstrate the potential of continuous Gaussian residual representations as a flexible and scale-decoupled alternative to fixed-grid prediction.

2606.17713 2026-06-17 cs.CV 新提交

Heterogeneous SAR-optical fusion for near-real-time land use and land cover mapping under cloud contamination: A novel framework and global benchmark dataset

异质SAR-光学融合用于云污染下近实时的土地利用和土地覆盖制图:新框架与全球基准数据集

Jiangong Xu, Weibao Xue, Xiaoyu Yu, Jun Pan, Xinlian Lianga, Mi Wang

发表机构 * State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing(信息工程测绘与遥感国家重点实验室) School of Computer Science and Information Engineering(计算机科学与信息工程学院) Hangzhou International Innovation Institute(杭州国际创新研究院) Oriental Space Port Research Institute(东方航天港研究院) Hubei Luojia Laboratory(湖北珞珈实验室)

AI总结 针对云污染导致光学遥感不可靠的问题,提出端到端异质SAR-光学融合框架CloudLULC-Net,通过光学可靠性调制、异质信息自适应聚合和统一语义映射变换器,实现近实时LULC制图,并构建含40,223个三元组的全球基准数据集CloudLULC-Set,在多个指标上优于现有方法。

详情
AI中文摘要

光学遥感影像经常受到云和云阴影污染的干扰,这限制了其在近实时土地利用和土地覆盖(LULC)制图中的可靠性。尽管合成孔径雷达(SAR)可以提供穿透云层的结构信息,但现有的SAR-光学融合方法通常假设光学观测可靠,未能充分解决云污染引入的语义不确定性。为了解决这个问题,我们提出了CloudLULC-Net,一个端到端的异质SAR-光学融合框架,直接从受云污染的Sentinel-2影像和时间相邻的Sentinel-1 SAR观测中预测LULC图。所提出的网络包含光学可靠性调制以抑制不可靠的光学响应、异质信息自适应聚合以建模光学和SAR表示之间的高阶空间-通道交互,以及一个统一的语义映射变换器,在面向LULC的潜在空间中组织融合特征。进一步引入语义锚点引导优化策略以提高中间语义表示的一致性。为支持该任务,我们构建了CloudLULC-Set,一个大规模基准数据集,包含40,223个精心挑选的SAR-光学-标签三元组,具有跨不同地理区域和云条件的像素级LULC标注。实验结果表明,CloudLULC-Net实现了86.60%的总体精度、83.29%的F1分数和73.51%的平均交并比,优于代表性的异质重建优先和端到端SAR-光学映射方法。与现有全球LULC产品的比较以及不同云覆盖水平下的分析进一步证明了CloudLULC-Net在易云区域目标日期LULC制图中的鲁棒性和实用价值。该项目公开于:https://github.com/your-repo(实际链接请替换)。

英文摘要

Optical remote sensing imagery is frequently degraded by cloud and cloud-shadow contamination, which limits its reliability for near-real-time land use and land cover (LULC) mapping. Although synthetic aperture radar (SAR) can provide cloud-penetrating structural information, existing SAR-optical fusion methods often assume reliable optical observations and insufficiently address the semantic uncertainty introduced by cloud contamination. To address this issue, we propose CloudLULC-Net, an end-to-end heterogeneous SAR-optical fusion framework that directly predicts LULC maps from cloud-contaminated Sentinel-2 imagery and temporally adjacent Sentinel-1 SAR observations. The proposed network incorporates optical reliability modulation to suppress unreliable optical responses, heterogeneous information adaptive aggregation to model high-order spatial-channel interactions between optical and SAR representations, and a unified semantic mapping transformer to organize fused features in a LULC-oriented latent space. A semantic anchor-guided optimization strategy is further introduced to improve the consistency of intermediate semantic representations. To support this task, we construct CloudLULC-Set, a large-scale benchmark dataset containing 40,223 curated SAR-optical-label triplets with pixel-level LULC annotations across diverse geographic regions and cloud conditions. Experimental results show that CloudLULC-Net achieves an OA of 86.60%, an F1-score of 83.29%, and an mIoU of 73.51%, outperforming representative heterogeneous reconstruction-first and end-to-end SAR-optical mapping methods. Comparisons with existing global LULC products and analyses under different cloud-cover levels further demonstrate the robustness and practical value of CloudLULC-Net for target-date LULC mapping in cloud-prone regions.The project is publicly available at: https://github.com/RSIIPAC/CloudLULC

2606.17711 2026-06-17 cs.CV cs.AI 新提交

Structured Adversarial Camouflage via Voronoi Diagrams

基于Voronoi图的结构化对抗伪装

Jens Bayer, Stefan Becker, David Münch, Michael Arens, Jürgen Beyerer

发表机构 * Fraunhofer IOSB and Fraunhofer Center for Machine Learning(弗劳恩霍夫光学、系统技术及图像处理研究所和弗劳恩霍夫机器学习中心) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 提出通过软分配优化种子点位置生成结构化伪装图案,在固定调色板下有效降低行人检测AP,且攻击可跨域转移。

详情
AI中文摘要

像素级对抗补丁计算量大且视觉上可检测,限制了在安全关键系统中的实用性。我们提出对抗性Voronoi伪装,通过软分配在固定可打印调色板下仅优化种子点位置,无需额外正则化即可生成类似结构化碎片伪装图案。在COCO风格AP@[.5:.95]上评估行人检测,朴素放置(Inria -> COCO)表现相当差,而通过分割掩码(3DPeople)进行服装级应用导致AP显著下降。该攻击可迁移到域外背景和跨检测器家族(YOLOv9/10/11/12),表明在黑盒设置中的鲁棒性。使用不同调色板重新绘制在很大程度上抵消了效果,单色调整显示有限容忍度(<=0.17),突出了结构-调色板耦合。参数高效、调色板受限的设计在降低实时检测器性能的同时提高了视觉合理性。物理验证和颜色校准留待未来工作。代码:此https URL。本文最初发表于由信息与通信技术系统技术委员会IST-224-RSY组织的国际军事通信与信息系统会议(ICMCIS),于2026年5月12-13日在英国巴斯举行。

英文摘要

Pixel-wise adversarial patches are computationally heavy and often visually detectable, limiting utility in security-critical systems. We present adversarial Voronoi camouflage that optimizes only seed-point locations under fixed, printable palettes using a soft assignment, producing structured, splinter camouflage-like patterns without additional regularization. Evaluated on person detection with COCO-style AP@[.5:.95], naive placement (Inria -> COCO) performs comparably bad, while garment-level application via segmentation mask (3DPeople) results in a significant AP drop. The attack transfers to out-of-domain backgrounds and across detector families (YOLOv9/10/11/12), indicating robustness in black-box settings. Repainting with different palettes largely nullifies the effect, and single-color tweaks show limited tolerance (<=0.17), highlighting a structure-palette coupling. The parameter-efficient, palette-constrained design improves visual plausibility while degrading real-time detector performance. Physical validation and color calibration are left for future work. Code: https://github.com/JensBayer/Voronoi This paper was originally presented at the International Conference on Military Communication and Information Systems (ICMCIS), organized by the Information Systems Technology (IST) Scientific and Technical Committee, IST-224-RSY - the ICMCIS, held in Bath, United Kingdom, 12-13 May 2026.

2606.17710 2026-06-17 cs.CV cs.AI cs.CL cs.LG 新提交

Vision-language models for chest radiography do not always need the image

胸部X光片的视觉-语言模型并不总是需要图像

Mahshad Lotfinia, Sebastian Ziegelmayer, Lisa Adams, Daniel Truhn, Andreas Maier, Soroosh Tayebi Arasteh

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg(弗里德里希-亚历山大-埃尔朗根-纽伦堡大学模式识别实验室) Department of Diagnostic and Interventional Radiology, TUM University Clinic, School of Medicine and Health, Klinikum rechts der Isar, Technical University of Munich(慕尼黑工业大学医学院与健康学院伊萨尔河右岸医院诊断与介入放射学系) Lab for AI in Medicine, RWTH Aachen University(亚琛工业大学医学人工智能实验室) Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen(亚琛工业大学医院诊断与介入放射学系)

AI总结 本文通过因果审计方法,发现许多医学视觉-语言模型在胸部X光片任务中依赖文本先验而非图像,纯文本模型与多模态模型性能接近,并提出了基于图像依赖性的评估框架。

详情
AI中文摘要

医学视觉-语言模型报告了强大的胸部X光片准确性,这越来越多地被解读为它们使用了图像的证据。这种推断是不安全的:一个利用发现名称先验的模型得分与读取扫描的模型相同,且没有标准基准能区分它们。我们引入了一种因果审计方法,通过遮挡相关区域、遮挡无关区域以及替换为另一患者的相同标签扫描来干预图像,并结合三种行为指标测试正确答案是否依赖于图像。在九个系统中,一个没有图像访问权限的纯文本模型达到了最佳多模态模型5.7个准确度点以内的水平,而一个1190亿参数的多模态模型在统计上与70亿参数的纯文本基线无法区分。审计将队列分为三个忽略图像的模型、一个不稳定的模型和五个选择性使用图像的模型(针对部分发现);这些分类在第二个数据集、分辨率和提示措辞上保持一致。与委员会认证的放射科医生相比,纯文本模型在准确率上与放射科医生无统计差异,但基础归因于零,而使用图像的模型的基础归因率与放射科医生相当。报告的置信度仅在模型使用图像时标记无根据的答案。基础归因审计(而非准确性)应成为临床部署的门槛。

英文摘要

Medical vision-language models report strong chest radiograph accuracy, and this is increasingly read as evidence that they use the image. That inference is unsafe: a model exploiting finding-name priors scores like one that reads the scan, and no standard benchmark separates them. We introduce a causal audit that intervenes on the image, occluding the relevant region, occluding an irrelevant one, and swapping in another patient's same-label scan, and combines three behavioral metrics to test whether a correct answer depends on the image. Across nine systems, a text-only model with no image access reaches within 5.7 accuracy points of the best multimodal one, and a 119-billion-parameter multimodal model is statistically indistinguishable from a 7-billion text-only baseline. The audit splits the cohort into three models that ignore the image, one that is unstable, and five that use it selectively, for a subset of findings; the categories hold across a second dataset, resolution, and prompt phrasing. Against board-certified radiologists, a text-only model is statistically indistinguishable from a radiologist's accuracy while grounding at zero, whereas the image-using models ground at radiologist-comparable rates. Reported confidence flags ungrounded answers only when a model uses the image. Grounding audits, not accuracy, should gate clinical deployment.

2606.17706 2026-06-17 cs.LG cs.AI 新提交

Confusion-Aware Transfer Teacher Curriculum Learning Framework: Disentangling Scoring and Pacing Effects

混淆感知的迁移教师课程学习框架:解耦评分与节奏效应

Savini Kommalage, Sanka Mohottala, Asiri Gawesha, Dulara Madhusanka, Menan Velayuthan, Dharshana Kasthurirathna, Mahima Milinda Alwis Weerasinghe, Charith Abhayaratne

发表机构 * Faculty of Computing, Sri Lanka Institute of Information Technology, Sri Lanka(斯里兰卡信息科技学院计算机学院,斯里兰卡) Faculty of Engineering, University of Sri Jayewardenepura, Sri Lanka(斯里兰卡贾亚韦达内普拉大学工程学院,斯里兰卡) Faculty of Engineering, Sri Lanka Institute of Information Technology, Sri Lanka(斯里兰卡信息科技学院工程学院,斯里兰卡) University of Sheffield, United Kingdom(谢菲尔德大学,英国) Utrecht University, The Netherlands(乌得勒支大学,荷兰)

AI总结 提出混淆感知难度评分,通过阶段性子集测试和随机基线解耦课程学习的评分与节奏效应,在CIFAR-10上验证评分可解释性,但全数据下无提升,仅在小数据量下提升数据效率。

Comments Accepted at International Conference on Machine Learning (ICML) GlobalSouthML Workshop (2026)

详情
AI中文摘要

课程学习结合了两个设计选择:样本如何按难度评分,以及较难样本如何逐步引入训练,这使得难以将观察到的性能提升归因于任一组件。我们通过两种评估协议解耦这些因素:阶段性子集测试(独立于课程训练验证评分函数)和基线(将相同的节奏调度应用于随机排序数据)。在迁移教师框架(TTF)中,我们使用这些协议评估一种混淆感知的难度评分,该评分同时考虑正确类别的置信度和错误类别上的概率分布。在CIFAR-10上使用ResNet-18和VGG-16,所提出的评分产生了与人类直觉一致的模型可解释难度排序。然而,在全数据下,无论是课程排序还是反课程排序,都没有比标准训练提高准确率,这表明仅改进评分函数不足以克服TTF中课程学习的已知失败模式。相反,我们发现混淆感知的课程排序带来一致的数据效率优势,在20%数据量下比随机排序高出最多8.7个百分点,表明TTF作为一种数据高效训练方法的潜力。

英文摘要

Curriculum learning couples two design choices, how samples are scored by difficulty and how harder samples are paced into training, making it difficult to attribute observed gains to either component. We disentangle these factors with two evaluation protocols: stage-wise test subsets that validate scoring functions independently of curriculum training, and a baseline that applies the same pacing schedule to randomly ordered data. Within the Transfer Teacher framework (TTF), we use these protocols to evaluate a confusion-aware difficulty score that considers both correct-class confidence and the probability distribution over incorrect classes. On CIFAR-10 with ResNet-18 and VGG-16, the proposed score produces model-interpretable difficulty rankings that align with human intuition. However, at full data, neither curriculum nor anti-curriculum ordering improves accuracy over standard training, indicating that improving the scoring function alone is insufficient to overcome the known failure modes of curriculum learning in TTF. In contrast, We find that confusion-aware curriculum ordering result in consistent data-efficiency benefits, outperforming random ordering by up to 8.7% points at the 20% data regime, suggesting the potential of TTF as a data-efficient training method.

2606.17702 2026-06-17 cs.CV cs.AI 新提交

SegTME-UNI2: A Foundation Model-Based Framework for Generalisable Multiclass Cell Segmentation and LLM-Driven Tumour Microenvironment Characterisation in Histopathology

SegTME-UNI2: 一种基于基础模型的可泛化多类细胞分割框架及LLM驱动的组织微环境表征在组织病理学中的应用

Wan Siti Halimatul Munirah Wan Ahmad, Faris Syahmi Samidi, Mohammad Badal Ahmmed, Vimal Angela Thiviyanathan, Selvam James Thavaraj, Anwar P. P. Abdul Majeed

发表机构 * Department of Data Science and Artificial Intelligence, School of Computing and Artificial Intelligence, Faculty of Engineering and Technology, Sunway University(双威大学工程与技术学院计算与人工智能学院数据科学与人工智能系) Faculty of Dentistry, Universiti Malaya(马来亚大学牙科学院)

AI总结 提出SegTME-UNI2框架,结合UNI2-H病理基础模型与双头UperNet解码器实现六类语义分割和核实例分割,通过三阶段伪标签课程学习解决标注不足问题,并利用LLM生成临床可解释的TME报告。

详情
AI中文摘要

从常规H&E染色组织学图像中表征肿瘤微环境(TME)需要同时进行细胞分割、特征提取和可解释的临床报告。我们提出了SEGTME-UNI2,一个统一框架来满足这些需求。其核心是UNI2-UPERHOVER,一个双头分割模型,将UNI2-H病理基础模型(ViT-Giant,在来自100K张切片的>100M张图块上预训练)与两个并行的UperNet解码器配对:一个用于六类语义分割,另一个用于水平-垂直梯度回归,从而实现基于分水岭的核实例分离。为了解决大型真实世界数据集中缺乏像素级标注的问题,UNI2-UPERHOVER经历了一个三阶段渐进式伪标签课程。每个阶段训练一个全新模型(无权重迁移),完全通过提高伪标签质量来驱动改进:阶段1:使用人工标注的PanNuke(7,901张图像,189,744个细胞核,0.25 um/像素)。阶段2:使用阶段1模型在271,711个TCGA-UT尺度0图块(0.5 um/像素)上生成的熵过滤伪标签。阶段3:使用阶段2模型在所有1,608,060个TCGA-UT图块(覆盖六个分辨率尺度,0.5-1.0 um/像素)上生成的伪标签。分割输出输入到一个结构化的TME特征提取流水线,计算每个图块的20多个组成、形态、空间熵和细胞间距离指标。这些指标编码为JSON,并传递给微调的NVIDIA BioNeMo GPT模型,以生成临床可解释的TME叙述。在保留的PanNuke和TCGA-UT分区上的初步验证证明了框架的可行性和内部一致性。公开释放了伪标注的TCGA-UT数据集和UNI2-UPERHOVER检查点,以支持大规模TME分析和空间生物学研究。

英文摘要

Characterising the tumour microenvironment (TME) from routine H&E-stained histology images requires simultaneous cell segmentation, feature extraction, and interpretable clinical reporting. We present SEGTME-UNI2, a unified framework addressing these requirements. Its core is UNI2-UPERHOVER, a dual-head segmentation model pairing the UNI2-H pathology foundation model (ViT-Giant, pretrained on >100M tiles from 100K slides) with two parallel UperNet decoders: one for six-class semantic segmentation and one for horizontal-vertical gradient regression enabling watershed-based nuclear instance separation. To address the lack of pixel-level annotations in large real-world repositories, UNI2-UPERHOVER undergoes a three-stage progressive pseudo-label curriculum. Each stage trains a fresh model without weight transfer, driving improvement entirely via increased pseudo-label quality: Stage 1: Uses human-annotated PanNuke (7,901 images, 189,744 nuclei, 0.25 um/pixel). Stage 2: Uses entropy-filtered pseudo-labels from the Stage 1 model on 271,711 TCGA-UT scale-0 patches (0.5 um/pixel). Stage 3: Uses pseudo-labels from the Stage 2 model on all 1,608,060 TCGA-UT patches across six resolution scales (0.5-1.0 um/pixel). Segmentation outputs feed a structured TME feature extraction pipeline computing 20+ per-patch compositional, morphological, spatial entropy, and intercellular distance metrics. These are encoded as JSON and passed to a fine-tuned NVIDIA BioNeMo GPT model to generate clinically interpretable TME narratives. Preliminary validation on held-out PanNuke and TCGA-UT partitions demonstrates framework feasibility and internal consistency. The pseudo-labelled TCGA-UT dataset and UNI2-UPERHOVER checkpoint are publicly released to support large-scale TME profiling and spatial biology research.

2606.17698 2026-06-17 cs.AI cs.CL 新提交

EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent

EComAgentBench:在分布式隐藏意图的长时任务上基准测试购物代理

Zeyao Du, Tong Li, Haibo Zhang

发表机构 * Shopee

AI总结 提出EComAgentBench基准,包含662个基于真实亚马逊产品的任务,要求代理在100次工具调用内从可见查询、工具门控配置文件和脚本化澄清中挖掘隐藏意图,验证候选产品并提交最终选择,通过类型化源标签评分归因失败。

详情
AI中文摘要

随着基于LLM的购物代理进入生产环境,现有基准未能捕捉购物者需求的出现方式:隐含在查询中、记录在配置文件中,或仅在提出正确问题时才揭示。提前暴露全部意图并仅对最终选择评分的基准既无法提出这种长时挑战,也无法解释代理遗漏了哪个需求。为填补这一空白,我们引入了EComAgentBench,一个基于真实亚马逊产品和评论的662个任务的基准。每个任务将这些需求分散在可见查询、工具门控配置文件和脚本化澄清中;代理必须揭示隐藏意图,根据属性和评论证据验证候选产品,并在100次工具调用内提交单个产品。此外,类型化、源标记的评分规则对每个任务进行评分,将每个失败归因于一个需求及其来源。构建过程自动化且可靠,每个答案在生成任何文本之前已在代码中固定,每个样本都经过验证。我们对七个模型的评估显示,即使最强的模型也仅达到57.1%的整体准确率,并且评分规则的满足度从可见源到隐藏源逐渐下降。总体而言,我们相信EComAgentBench将作为一个可复现的基础,推动购物代理从单查询搜索向长时可靠辅助发展。

英文摘要

As LLM-based shopping agents enter production, existing benchmarks fail to capture how a shopper's requirements arrive: stated implicitly in the query, recorded in a profile, or revealed only when the right question is asked. Benchmarks that expose full intent upfront and grade only the final choice can neither pose this long-horizon challenge nor explain which requirement an agent missed. To address this gap, we introduce EComAgentBench, a benchmark of 662 tasks grounded in real Amazon products and reviews. Each task scatters these requirements across a visible query, a tool-gated profile, and scripted clarification; an agent must uncover hidden intent, verify candidates against attributes and review evidence, and commit to a single product within 100 tool calls. Moreover, typed, source-tagged rubrics grade every task, attributing each failure to a requirement and its source. Construction is automated yet reliable, with every answer fixed in code before any text is generated and every sample validated. Our evaluation of seven models reveals that even the strongest attains only 57.1% overall accuracy, and rubric satisfaction degrades from visible to hidden sources. Overall, we believe EComAgentBench will serve as a reproducible foundation for moving shopping agents from single-query search toward dependable assistance over long horizons.