arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1967
专题追踪
2605.14641 2026-05-15 cs.CV cs.AI

How to Evaluate and Refine your CAM

Luca Domeniconi, Alessandra Stramiglio, Michele Lombardi, Samuele Salti

发表机构 * University of Bologna(博洛尼亚大学)

AI总结 该研究针对卷积神经网络中类别归因图(CAM)的评估与改进问题,提出了一种合成数据集以生成真实归因标签,从而更严格地比较现有评估指标,并提出了一种新的复合评估指标ARCC,能够更可靠地识别忠实的解释。同时,为解决CAM分辨率低的问题,研究还引入了RefineCAM方法,通过聚合多层网络的CAM生成高分辨率归因图,实验表明该方法在新评估指标下优于现有方法。

Comments Accepted at ICPR 2026

详情
英文摘要

Class attribution maps (CAMs) provide local explanations for the decisions of convolutional neural networks. While widely used in practice, the evaluation of CAMs remains challenging due to the lack of ground-truth explanations, making it difficult to evaluate the soundness of existing metrics. Independently, most commonly used CAM methods produce low-resolution attribution maps, which limits their usefulness for detailed interpretability. To address the evaluation challenge, we introduce a synthetic dataset with ground-truth attributions that enables a rigorous comparison of CAM evaluation metrics. Using this dataset, we analyze existing metrics and propose ARCC, a new composite metric that more reliably identifies faithful explanations. To address the low resolution issue, we introduce RefineCAM, a method that produces high-resolution attribution maps by aggregating CAMs across multiple network layers. Our results show that RefineCAM consistently outperforms existing methods according to the proposed evaluation.

2605.14636 2026-05-15 cs.AI

Teaching Large Language Models When Not to Know: Learning Temporal Critique for Ex-Ante Reasoning

Chenlu Ding, Jiancan Wu, Yanchen Luo, Zheyuan Liu, Yancheng Yuan, Xiang Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) The Hong Kong Polytechnic University(香港理工大学) University of Notre Dame(圣母大学)

AI总结 该研究探讨了大型语言模型在时间截断条件下进行推理时的失效问题,即模型在回答过去时间点的问题时错误地使用了未来才可获得的信息。研究提出了一种名为TCFT的时序批评微调框架,通过训练模型识别和判断回答中是否存在时间泄露,从而提升其在时间限制下的推理能力。实验表明,TCFT在多个模型上显著优于传统提示和微调方法,有效降低了时间泄露的比例。

详情
英文摘要

Large language models (LLMs) often fail to reason under temporal cutoffs: when prompted to answer from the standpoint of an earlier time, they exploit knowledge that became available only later. We study this failure through the lens of ex-ante reasoning, where a model must rely exclusively on information knowable before a cutoff. Through a systematic analysis of prompt-level interventions, we find that temporal leakage is highly sensitive to cutoff formulation and instruction placement: explicit cutoff statements outperform implicit historical framings, and prefix constraints reduce leakage more effectively than suffix constraints. These findings indicate that prompting can steer models into a temporal frame, but does not endow them with the ability to verify whether a response is temporally admissible. We further argue that supervised fine-tuning is insufficient, since ex-ante correctness is not an intrinsic property of an answer, but a relation between the answer and the cutoff. To address this gap, we propose TCFT, a Temporal Critique Fine-Tuning framework that trains models to acquire cutoff-aware temporal verification. Given a query, a cutoff, and a candidate response, TCFT teaches the model to identify post-cutoff leakage, explain temporal boundary violations, and judge temporal admissibility. Experiments with Qwen2.5-7B-Instruct and Qwen2.5-14B-Instruct show that TCFT consistently outperforms prompting and SFT baselines, reducing average leakage by 41.89 and 37.79 percentage points, respectively.

2605.14635 2026-05-15 cs.CV cs.AI

MultiEmo-Bench: Multi-label Visual Emotion Analysis for Multi-modal Large Language Models

Tianwei Chen, Takuya Furusawa, Yuki Hirakawa, Ryotaro Shimizu, Mo Fan, Takashi Wada

发表机构 * ZOZO NEXT Inc.(ZOZO NEXT公司)

AI总结 本文提出一个多标签视觉情感分析基准数据集MultiEmo-Bench,用于全面评估多模态大语言模型(MLLMs)对图像引发情感的预测能力。现有数据集采用单一标签标注方式,难以反映图像可能引发的多维度、多强度情感,为此本文引入多标注员协同标注机制,生成包含10,344张图像和236,998个有效情感标签的高质量数据集,并基于该数据集评估了多个主流模型在主控情感预测和情感分布预测任务上的表现,揭示了当前MLLMs在情感理解方面的进展与不足。

详情
英文摘要

This paper introduces a multi-label visual emotion analysis benchmark dataset for comprehensively evaluating the ability of multimodal large language models (MLLMs) to predict the emotions evoked by images. Recent user studies report an unintuitive finding: humans may prefer the predictions of MLLMs over the labels in existing datasets. We argue that this phenomenon stems from the suboptimal annotation scheme used in existing datasets, where each annotator is shown a single candidate emotion for each image and judges whether it is evoked or not. This approach is clearly limited because a single image can evoke multiple emotions with varying intensities. As a result, evaluations based on these datasets may underestimate the capabilities of MLLMs, yet an appropriate benchmark for evaluating such models remains lacking. To address this issue, we introduce a new multi-label benchmark dataset for visual emotion analysis toward MLLMs evaluation. We hire $20$ annotators per image and ask them to select all emotions they feel from an image. Then, we aggregate the votes across all annotators, providing a more reliable and representative dataset labeled with a distribution of emotions. The resulting dataset contains $10,344$ images with $236,998$ valid votes across eight emotions. Based on this benchmark dataset, we evaluate several recent models, including Qwen3-VL, OpenAI's GPT, Gemini, and Claude. We assess model performance on both dominant emotion prediction and emotion distribution prediction. Our results demonstrate the progress achieved by recent MLLMs while also indicating that substantial room for improvement remains. Furthermore, our experiments with LLM-as-a-judge show that the method does not consistently improve MLLMs' performance, indicating its limitations for the subjective task of visual emotion analysis.

2605.14632 2026-05-15 cs.LG stat.AP

DRL-STAF: A Deep Reinforcement Learning Framework for State-Aware Forecasting of Complex Multivariate Hidden Markov Processes

Manrui Jiang, Jingru Huang, Yong Chen, Chen Zhang

发表机构 * Department of Industrial Engineering, Tsinghua University, Beijing 100084, China(清华大学工业工程系) Department of Industrial and Systems Engineering, University of Iowa, Iowa City, IA 52242, USA(爱荷华大学工业与系统工程系)

AI总结 该研究提出了一种基于深度强化学习的DRL-STAF框架,用于复杂多变量隐马尔可夫过程的状态感知预测。该方法结合深度神经网络建模非线性观测,并利用强化学习估计离散隐状态,克服了传统隐马尔可夫模型在非线性发射和扩展性方面的不足,同时减少了对预定义状态转移结构的依赖。实验表明,DRL-STAF在预测性能和隐状态估计方面均优于现有方法。

详情
英文摘要

Forecasting multivariate hidden Markov processes is challenging due to nonlinear and nonstationary observations, latent state transitions, and cross-sequence dependencies. While deep learning methods achieve strong predictive accuracy, they typically lack explicit state modeling, whereas Hidden Markov Models (HMMs) provide interpretable latent states but struggle with complex nonlinear emissions and scalability. To address these limitations, we propose DRL-STAF, a Deep Reinforcement Learning based STate-Aware Forecasting framework that jointly predicts next-step observations and estimates the corresponding hidden states for complex multivariate hidden Markov processes. Specifically, DRL-STAF models complex nonlinear emissions using deep neural networks and estimates discrete hidden states using reinforcement learning, reducing the reliance on predefined transition structures and enabling flexible adaptation to diverse temporal dynamics. In particular, DRL-STAF mitigates the state-space explosion encountered by typical multivariate HMM-based methods. Extensive experiments demonstrate that DRL-STAF outperforms HMM variants, standalone deep learning models, and existing DL-HMM hybrids in most cases, while also providing reliable hidden-state estimates.

2605.14631 2026-05-15 cs.LG cs.AI cs.CV

Action-Inspired Generative Models

Eshwar R. A., Debnath Pal

发表机构 * Department of Computer Science Engineering(计算机科学与工程系) PES University (EC Campus), Bengaluru(班加罗尔EC校区的PES大学) Department of Computational and Data Sciences(计算与数据科学系) Indian Institute of Science, Bengaluru(班加罗尔印度科学研究院)

AI总结 本文提出了一种受动作启发的生成模型(AGMs),旨在改进现有桥接匹配方法中对所有随机转移赋予相同回归权重的问题。该方法引入了一个轻量的可学习标量势函数 $V_ϕ$,用于在线评估桥接样本并调节漂移目标,从而选择性地惩罚非信息性传输路径,提升了生成质量。该模型结构简单,仅增加约1.4%的参数,无需额外计算开销,可直接嵌入任何桥接匹配训练流程中。

Comments 11 pages, 5 figures, and 4 tables

详情
英文摘要

We introduce Action-Inspired Generative Models (AGMs), a dual-network generative framework motivated by the observation that existing bridge-matching methods assign uniform regression weight to every stochastic transition in the transport landscape, regardless of whether a given bridge sample lies along a structurally coherent trajectory or a degenerate one. We address this by introducing a lightweight learned scalar potential $V_ϕ$ that scores bridge samples online and modulates the drift objective via importance weights derived through a stop-gradient barrier -- preventing adversarial feedback between the two networks whilst preserving $V_ϕ$'s guiding signal. Crucially, $V_ϕ$ comprises only $\sim$1.4% of the primary drift network's parameter count, adds no overhead to the inference graph, and requires no iterative half-bridge fitting or auxiliary stochastic differential equation (SDE) solvers: it is a plug-and-play enhancement to any bridge-matching training loop. At inference, $V_ϕ$ is discarded entirely, leaving standard Euler-Maruyama integration of the exponential moving average (EMA) drift. We demonstrate that selectively penalising uninformative transport paths through the learned potential yields consistent improvements in generation quality across fidelity and coverage metrics.

2605.14626 2026-05-15 cs.CV

UniTriGen: Unified Triplet Generation of Aligned Visible-Infrared-Label for Few-Shot RGB-T Semantic Segmentation

Ping Zhou, Haoyu Wang, Mengmeng Zheng, Lei Zhang, Wei Wei, Chen Ding, Fei Zhou

发表机构 * School of Computer Science, Northwestern Polytechnical University(西北工业大学计算机学院) School of Computer Science & Technology, Xi’an University of Posts & Telecommunications(西安邮电大学计算机科学与技术学院) MMLab, The Chinese University of Hong Kong(香港中文大学MMLab)

AI总结 RGB-T语义分割需要严格对齐的可见光-红外-标签三元组,但在实际场景中这类数据往往稀缺。为解决这一问题,本文提出UniTriGen,一种统一的三元组生成框架,能够在文本提示引导下直接生成空间对齐、语义一致且模态互补的可见光-红外-标签三元组。该方法通过共享潜在空间中的联合编码和扩散过程建模,确保跨模态一致性,并引入轻量级模态特定适配器以适应不同模态的成像特性,同时采用场景平衡和类别感知的少样本采样策略,提升生成三元组的多样性和质量,从而在多种RGB-T语义分割模型中实现性能提升。

详情
英文摘要

RGB-T semantic segmentation requires strictly aligned VIS-IR-Label triplets; however, such aligned triplet data are often scarce in real-world scenarios. Existing generative augmentation methods usually adopt cascaded generation paradigms, decomposing joint triplet generation into local conditional processes. As a result, consistency among VIS, IR, and Label in spatial structure, semantic content, and cross-modal details cannot be reliably maintained. To address this issue, we propose UniTriGen, a unified triplet generation framework that directly generates spatially aligned, semantically consistent, and modality complementary VIS-IR-Label triplets under the guidance of text prompts. UniTriGen first introduces a unified triplet generation mechanism, where VIS, IR, and Label are jointly encoded into a shared latent space and modeled with a diffusion process to enforce global cross-modal consistency. Lightweight modality-specific residual adapters are further integrated into this mechanism to accommodate modality-specific imaging characteristics and output formats. To mitigate generation bias caused by imbalanced scene and class distributions in limited paired triplets, UniTriGen also employs a scene-balanced and class-aware few-shot sampling strategy, which induces a more balanced sampling distribution and enhances the scene and class diversity of generated triplets. Experiments show that UniTriGen generates high-quality aligned triplets from limited real paired data, thereby achieving consistent performance improvements across various RGB-T semantic segmentation models.

2605.14621 2026-05-15 cs.CV cs.AI cs.CL

Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution

Tian Qin, Junzhe Chen, Yuqing Shi, Tianshu Zhang, Qiang Ju, Lijie Wen

发表机构 * Tsinghua University(清华大学) The University of Sydney(悉尼大学) Stanford University(斯坦福大学) Baichuan AI(百川AI)

AI总结 大型视觉语言模型(LVLMs)在语言先验主导弱或模糊视觉证据时容易产生幻觉。现有对比解码方法通过比较原始图像和外部扰动输入的预测来缓解这一问题,但依赖外部参考可能引入偏差并增加计算成本。本文提出SIRA,一种无需训练的内部对比解码框架,通过利用多模态变换器的分阶段信息流,在模型内部构建反事实参考,有效抑制幻觉,同时保持描述覆盖率,并适用于开源权重模型。

详情
英文摘要

Large vision-language models (LVLMs) often hallucinate when language priors dominate weak or ambiguous visual evidence. Existing contrastive decoding methods mitigate this problem by comparing predictions from the original image with those from externally perturbed visual inputs, but such references can introduce off-manifold artifacts and require costly extra forward passes. We propose SIRA, a training-free internal contrastive decoding framework that constructs a counterfactual reference inside the same LVLM by exploiting the staged information flow of multimodal transformers. Instead of removing visual information from the input, SIRA first lets image and text tokens interact through a shared prefix, forming an aligned multimodal state that preserves prompt interpretation, decoding history, positional structure, and early visual grounding. It then forks a counterfactual branch in later transformer layers, where attention to image-token positions is masked. This branch retains the shared multimodal context but lacks continued access to fine-grained visual evidence, yielding a language-prior-dominated internal reference for token-level contrast. During decoding, SIRA suppresses tokens that remain strong without late visual access and favors predictions whose advantage depends on the full visual pathway. Experiments on POPE, CHAIR, and AMBER with Qwen2.5-VL and LLaVA-v1.5 show that SIRA consistently reduces hallucinations while preserving descriptive coverage and incurring lower overhead than two-pass contrastive decoding. SIRA requires no training, external verifier, or perturbed input, and applies to open-weight LVLMs with white-box inference access.

2605.14619 2026-05-15 cs.AI

SliceGraph: Mapping Process Isomers in Multi-Run Chain-of-Thought Reasoning

Kang Chen, Junjie Nian, Yixin Cao, Yugang Jiang

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 该研究提出了SliceGraph方法,用于分析多轮思维链(CoT)推理过程中不同路径之间的共享、分裂与重组结构。通过计算CoT片段间的激活键Jaccard相似度并构建互k近邻图,SliceGraph揭示了不同推理路径在过程结构上的异同,并识别出具有相同答案但推理过程不同的“过程异构体”。实验表明,多数问题-模型组合中存在多个过程家族,它们在策略上具有一致性但结构上有所区分,表明最终答案聚合忽略了推理过程中的多路径结构特征。

详情
英文摘要

Multi-run chain-of-thought reasoning is usually collapsed to final-answer aggregates, which discard howsampled trajectories share, split, and rejoin through intermediate computation. We propose SliceGraph, a post-hoc problem-model-cell graph built by mutual-kNN over sparse activation-key Jaccard similarity between CoT slices, and treat it as a measurement object for process geometry rather than as a decoding program. Across sampled CoT ensembles from three primary 4B/8B models on math and science benchmarks, blinded annotation supports SliceGraph biconnected components as shared reasoning-state units and process families as within-family strategy-coherent route units. In 85.5% of 954 problem-model cells, correct CoTs sharing the same normalized answer split into multiple process families; among cells with at least two such runs, 76.6% of run pairs are cross-family on average. We call such same-answer, family-divergent correct trajectories process isomers. A label-seeded reward field provides a separate value-landscape layer: success-associated regions often split into disconnected high-value cores, and route families specialize over these core footprints rather than merely duplicating one another. A typed-state transition analysis further shows that process families navigate the same atlas with distinct transition kernels under matched null controls. Representation ablations, a cross-architecture replication, and two cross-scale replications support the robustness of the route-family scaffold, showing that final-answer aggregation overlooks this structured multi-route process geometry.

2605.14615 2026-05-15 cs.CV

CalibAnyView: Beyond Single-View Camera Calibration in the Wild

Boying Li, Cheng Zhang, Weirong Chen, Daniel Cremers, Ian Reid, Hamid Rezatofighi

发表机构 * Monash University(蒙纳士大学) Technical University of Munich(慕尼黑技术大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 本文提出了一种名为 CalibAnyView 的新型相机标定方法,能够在任意数量的视角下(包括单视角)实现鲁棒的几何一致性标定。该方法通过构建大规模多视角视频数据集,并设计多视角变换网络预测密集透视场,结合几何优化框架联合估计相机内参和重力方向,从而在真实场景中取得优于现有方法的标定效果。该工作为野外环境下的三维重建和机器人感知等任务提供了可靠的基础。

Comments 44 pages, 25 figures

详情
英文摘要

Camera calibration is a fundamental prerequisite for reliable geometric perception, yet classical approaches rely on controlled acquisition setups that are impractical for in-the-wild imagery. Recent learning-based methods have shown promising results for single-view calibration, but inherently neglect geometric consistency across multiple views. We introduce CalibAnyView, a unified formulation that supports an arbitrary number of input views ($N \geq 1$) by explicitly modeling cross-view geometric consistency. To facilitate this, we construct a large-scale multi-view video dataset covering diverse real-world scenarios, including multiple camera models, dynamic scenes, realistic motion trajectories, and heterogeneous lens distortions. Building on this dataset, we develop a multi-view transformer that predicts dense perspective fields, which are further integrated into a geometric optimization framework to jointly estimate camera intrinsics and gravity direction. Extensive experiments demonstrate that CalibAnyView consistently outperforms state-of-the-art methods, achieves strong robustness under single-view settings, and further improves with multi-view inference, providing a reliable foundation for downstream tasks such as 3D reconstruction and robotic perception in the wild.

2605.14609 2026-05-15 cs.CV cs.LG

Deep Image Segmentation via Discriminant Feature Learning

Adam Dawid Sztamborski, Raül Pérez-Gonzalo, Antonio Agudo

发表机构 * Institut de Robòtica i Informàtica Industrial, CSIC-UPC(机器人与信息学工业研究所,西班牙加泰罗尼亚科技学院-巴塞罗那高等学院) Politechnika Łódzka(卢布林理工大学)

AI总结 本文研究了图像分割中边界不清晰的问题,提出了一种新的可微且与网络结构无关的损失函数Deep Discriminant Analysis(DDA),通过最大化类间方差并最小化类内方差,提升特征分布的紧致性和可分性。实验表明,DDA在多种架构上均能有效提升分割精度、边界清晰度和模型置信度,为构建更鲁棒的分割模型提供了简单而有效的方法。

Comments Accepted to ICIP 2026

详情
英文摘要

Accurate image segmentation remains challenging, particularly in generating sharp, confident boundaries. While modern architectures have advanced the field, many of them still rely on standard loss functions like Cross-Entropy and Dice, which often neglect the discriminative structure of learned features, leading to inaccurate boundaries. This work introduces Deep Discriminant Analysis (DDA), a differentiable, architecture-agnostic loss function that embeds classical discriminant principles for network training. DDA explicitly maximizes between-class variance while minimizing within-class one, promoting compact and separable feature distributions without increasing inference cost. Evaluations on the DIS5K benchmark demonstrate that DDA consistently improves segmentation accuracy, boundary sharpness, and model confidence across various architectures. Our results show that integrating discriminant analysis offers a simple, effective path for building more robust segmentation models.

2605.14607 2026-05-15 cs.CV cs.CY

ViMU: Benchmarking Video Metaphorical Understanding

Qi Li, Xinchao Wang

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 本文提出ViMU,首个用于评估视频隐喻理解能力的基准,旨在解决现有视频理解模型主要关注字面内容而忽视隐喻、讽刺和社会含义的问题。ViMU通过开放问答和多选题形式,要求模型基于多模态证据推断视频中的隐含意义,且问题设计无提示,确保模型依赖自身理解能力进行推理。该工作为视频理解领域引入了新的评估方向,推动模型在深层次语义理解方面的发展。

详情
英文摘要

Any new medium, once it emerges, is used for more than the transmission of overt content alone. The information it carries typically operates on two levels: one is the content directly presented, while the other is the subtext beneath it-the implicit ideas and intentions the creator seeks to convey through the medium. Likewise, since video technologies became widely adopted, video has served not only as a powerful tool for recording and communicating visual information, but also as a vehicle for emotions, attitudes, and social meanings that are often difficult to articulate explicitly. Thus, the true meaning of many videos does not reside solely in what is shown on screen; it is often embedded in context, style of expression, and the viewer's social experience. Some forms of such video subtext are humorous, while others carry irony, mockery, or criticism. These implicit meanings can also be interpreted very differently across cultural backgrounds and social groups. However, most existing video understanding models still focus primarily on literal visual comprehension, such as recognizing objects, actions, or temporal relations, and lack a systematic ability to understand the metaphorical, ironic, and social meanings embedded in videos. To bridge this gap, we introduce ViMU, the first benchmark designed to systematically evaluate the subtext understanding capabilities of frontier models in videos. ViMU assesses whether video understanding models can go beyond literal perception to infer implicit meaning while grounding their interpretations in multimodal evidence and answering both open-ended and multiple-choice questions. Importantly, all questions are designed to be hint-free, ensuring that no key evidence is disclosed to models before answering.

2605.14606 2026-05-15 cs.CV

MambaRain: Multi-Scale Mamba-Attention Framework for 0-3 Hour Precipitation Nowcasting

Chunlei Shi, Cui Wu, Xiang Xu, Hao Li, Ni Fan, Xue Han, Yongchao Feng, Yufeng Zhu, Boyu Liu, Zengliang Zang, Hongbin Wang, Yanlan Yang, Dan Niu

发表机构 * School of Automation, Southeast University(东南大学自动化学院) Nanjing XinDa Institute of Meteorological Science and Technology(南京新达气象科学与技术研究所) Beijing Leninainfo Technology Co., Ltd.(北京 Leninainfo 技术有限公司) China CEC Engineering Corporation(中国 CEC 工程公司) School of Mathematical Sciences, Tongji University(同济大学数学科学学院) State Key Laboratory of Virtual Reality Technology and Systems, Beihang University(北京航空航天大学虚拟现实技术与系统国家重点实验室) College of Meteorology and Oceanography, National University of Defense Technology(国防科技大学气象学与海洋学学院) Key Laboratory of Transportation Meteorology of China Meteorological Administration, Nanjing Innovation Institute for Atmospheric Sciences(中国气象局交通运输气象重点实验室,南京大气科学创新研究院)

AI总结 本文提出了一种名为MambaRain的多尺度编码-解码框架,用于0-3小时的降水临近预报。该方法结合了Mamba模型的线性复杂度长期时间建模能力和自注意力机制对空间相关性的显式捕捉,有效解决了现有方法在长时段预测中性能下降的问题。通过引入混合架构和频谱损失函数,MambaRain在保持计算效率的同时提升了预报精度,尤其在2-3小时的困难预测区间表现突出。

Comments 9 pages,7 figures

详情
英文摘要

Accurate precipitation nowcasting over extended horizons (0-3 hours) is essential for disaster mitigation and operational decision-making, yet remains a critical challenge in the field. Existing deterministic approaches are predominantly constrained to shorter prediction windows (0-2 hours), exhibiting severe performance degradation beyond 90 minutes owing to their inherent difficulty in capturing long-range spatiotemporal dependencies from radar-derived observations. To address these fundamental limitations, we propose MambaRain, a novel multi-scale encoder-decoder architecture that synergistically integrates Mamba's linear-complexity long-range temporal modeling with self-attention mechanisms for explicit spatial correlation capture. The core innovation lies in a hybrid design paradigm wherein Mamba blocks leverage selective state space mechanisms to model global temporal dynamics across extended sequences with computational efficiency, while self-attention modules explicitly characterize spatial correlations within precipitation fields - a capability inherently absent in Mamba's sequential processing paradigm. This complementary synergy enables comprehensive spatiotemporal representation learning, effectively extending the viable forecasting horizon to 2-3 hours with substantial accuracy improvements. Furthermore, we introduce a spectral loss formulation to mitigate blurring artifacts characteristic of chaotic precipitation systems, thereby preserving fine-scale motion details critical for nowcasting accuracy. Experimental validation demonstrates that MambaRain substantially outperforms existing deterministic methodologies in 0-3 hour nowcasting tasks, with particularly pronounced performance gains in the challenging 2-3 hour prediction range.

2605.14604 2026-05-15 cs.AI cs.HC

Sycophancy is an Educational Safety Risk: Why LLM Tutors Need Sycophancy Benchmarks

Enkelejda Kasneci, Gjergji Kasneci

发表机构 * Technical University of Munich, Munich, Germany(慕尼黑技术大学,慕尼黑,德国) Munich Center for Machine Learning, Munich, Germany(慕尼黑机器学习中心,慕尼黑,德国)

AI总结 本文指出,有效的教学需要“纠正性摩擦”,即通过指出并支持性地挑战学生的误解来促进概念转变,但当前偏好对齐的大语言模型(LLMs)可能为了友好而牺牲认知严谨性。为此,作者提出了“推理-谄媚悖论”,即模型虽能抵御上下文切换攻击,却可能在权威或社交压力下退缩。文章引入了EduFrameTrap基准,用于评估LLM在不同学科和压力情境下的教学表现,并发现当前前沿模型在面对权威和社会压力时更容易出现认知退缩,强调了建立衡量“社会-认知勇气”的教学基准的重要性。

详情
英文摘要

This position paper argues that effective tutoring requires corrective friction: surfacing misconceptions and challenging them supportively to drive conceptual change. Yet preference-aligned LLMs can trade epistemic rigor for agreeableness. We identify a Reasoning-Sycophancy Paradox: models that resist context-switch frame attacks can still capitulate under social-epistemic pressure, especially authority ("my notes say I'm right") and social-affective face-saving ("please don't tell me I'm wrong"). We introduce EduFrameTrap, a tutoring benchmark across math, physics, economics, chemistry, biology, and computer science that varies student confidence and pressure (context-switch, authority, social-affective). Across two frontier LLMs, context-switch failures are comparatively lower for GPT-5.2, while authority and social pressure more often trigger epistemic retreat. In contrast, Claude shows substantial context-switch fragility in this run. Because these failures are hard to judge automatically, we report two-judge disagreement as a reliability signal. We argue benchmarks should measure social-epistemic courage, i.e., supportive but corrective tutoring, and treat kind-but-correct behavior as a safety requirement.

2605.14601 2026-05-15 cs.CV

Towards Accurate Single Panoramic 3D Detection: A Semantic Gaussian Centric Approach

Kanglin Ning, Yiran Zhao, Wenrui Li, Shaoru Sun, Xingtao Wang, Xiaopeng Fan

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) The Suzhou Research Institute of HIT(哈尔滨工业大学苏州研究院) The PengChengLab(鹏城实验室)

AI总结 本文提出了一种基于连续语义高斯表示的单目全景3D目标检测框架PanoGSDet,旨在解决全景图像中2D特征到3D空间映射不准确的问题。该方法通过全景深度估计模块和语义高斯模块,将全景图像中的语义和深度信息提升到3D语义高斯分布,并通过优化和预测模块生成精确的3D目标框。实验表明,该方法在Structured3D数据集上显著优于现有方法。

Comments Current has been accepted by ICME 2026

详情
英文摘要

Three-dimensional object detection in panoramic imagery is crucial for comprehensive scene understanding, yet accurately mapping 2D features to 3D remains a significant challenge. Prevailing methods often project 2D features onto discrete 3D grids, which break geometric continuity and limit representation efficiency. To overcome this limitation, this paper proposes PanoGSDet, a monocular panoramic 3D detection framework built upon continuous semantic 3D Gaussian representations. The proposed framework comprises a panoramic depth estimation component and a semantic Gaussian component. The panoramic depth estimation component extracts the equirectangular semantic and depth features from the monocular panorama input. The semantic Gaussian component includes a semantic Gaussian lifting module that projects spherical features into 3D semantic Gaussians, a semantic Gaussian optimization module that refines these semantic Gaussians, and a Gaussian guided prediction head that generates 3D bounding boxes from optimized Gaussian representations. Extensive experiments on the Structured3D dataset demonstrate that our method significantly outperforms existing methods.

2605.14600 2026-05-15 cs.CL

SciPaths: Forecasting Pathways to Scientific Discovery

Eric Chamoun, Yizhou Chi, Yulong Chen, Rui Cao, Zifeng Ding, Michalis Korakakis, Andreas Vlachos

发表机构 * University of Cambridge(剑桥大学) The Alan Turing Institute(艾伦·图灵研究所) Singapore University of Technology and Design(新加坡科技设计大学)

AI总结 本文提出 SciPaths,一个用于科学发现路径预测的新基准,旨在预测实现特定科学成果所需的前置贡献及其在已有文献中的依据。研究通过构建包含专家标注和机器学习生成的路径数据集,评估了前沿语言模型在该任务上的表现,发现模型在严格语义匹配下表现有限,尤其在恢复核心方法依赖方面存在困难。该工作揭示了科学预测中一个被忽视的关键能力:从目标成果逆向推理出其所需的科学基础和文献依赖。

详情
英文摘要

Scientific progress depends on sequences of enabling contributions, yet existing AI4Science benchmarks largely focus on citation prediction, literature retrieval, or idea generation rather than the dependencies that make progress possible. In this paper, we introduce discovery pathway forecasting: given a target scientific contribution and the prior literature available at a specified time, the task is to (1) identify the enabling contributions required to realize it and (2) ground each in prior work when such prior work exists. We present SciPaths, a benchmark of 262 expert-annotated gold pathways and 2,444 silver pathways constructed from machine learning and natural language processing papers, where each pathway records enabling contributions, roles, rationales, and prior-work groundings or unmapped decisions. Evaluating frontier and open-weight language models, we find that the best model reaches only 0.189 F1 under strict semantic matching, with core methodological dependencies hardest to recover. Prior-work grounding improves substantially when gold enabling contributions are provided, showing that decomposition quality is a major bottleneck for end-to-end pathway recovery. SciPaths therefore shifts evaluation toward a missing capability in scientific forecasting: reasoning backward from a target contribution to the enabling scientific building blocks and prior-work dependencies that make it feasible.

2605.14599 2026-05-15 cs.LG cs.AI stat.ML

Fast Rates for Inverse Reinforcement Learning

Andreas Schlaginhaufen, Maryam Kamgarpour

发表机构 * EPFL(瑞士联邦理工学院)

AI总结 本文研究了有限时间马尔可夫决策过程中的熵正则化最小-最大逆强化学习(Min-Max-IRL)问题,针对线性奖励类问题,建立了新的结构和统计性质。作者证明了在总体层面,最大似然估计与Min-Max-IRL等价,在确定性动力学下在经验层面也等价。通过利用Min-Max-IRL损失的伪自共轭性质,作者展示了轨迹级KL散度和参数误差在Hessian范数下的衰减速度为$\mathcal{O}(n^{-1})$,且结果适用于模型误设情况,无需探索假设。此外,还扩展了奖励可识别性的结果到一般的Borel空间,并推导了软最优价值函数关于奖励参数的导数新性质。

详情
英文摘要

We establish novel structural and statistical results for entropy-regularized min-max inverse reinforcement learning (Min-Max-IRL) with linear reward classes in finite-horizon MDPs with Borel state and action spaces. On the structural side, we show that maximum likelihood estimation (MLE) and Min-Max-IRL are equivalent at the population level, and at the empirical level under deterministic dynamics. On the statistical side, exploiting pseudo-self-concordance of the Min-Max-IRL loss, we prove that both the trajectory-level KL divergence and the squared parameter error in the Hessian norm decay at the fast rate $\mathcal{O}(n^{-1})$, where $n$ is the number of expert trajectories. Our guarantees apply under misspecification and require no exploration assumptions. We further extend reward-identifiability results to general Borel spaces and derive novel results on the derivatives of the soft-optimal value function with respect to reward parameters.

2605.14597 2026-05-15 cs.CV cs.CE cs.MM

VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting

Chunlei Shi, Hao Li, Yufeng Zhu, Boyu Liu, Yongchao Feng, Zengliang Zang, Hongbin Wang, Yanlan Yang, Dan Niu

发表机构 * Department of Automation, Southeast University(东南大学自动化部门) State Key Laboratory of Virtual Reality Technology and Systems, Beihang University(北京航空航天大学虚拟现实技术与系统国家重点实验室) Key Laboratory of Transportation Meteorology, China Meteorological Administration(中国气象局交通运输气象重点实验室)

AI总结 降水临近预报是气象应用中的重要时空预测任务,但因降水系统的混沌特性面临诸多挑战。现有方法多依赖单一来源的雷达数据构建确定性或概率性模型进行外推,但存在模糊性或计算效率低等问题。本文提出一种基于粗到细的视觉Mamba Unet与残差扩散模型(VMU-Diff)的多源数据融合框架,通过两阶段过程实现降水临近预报:第一阶段利用雷达与多波段卫星数据融合预测全局运动趋势,第二阶段基于条件扩散模型生成细节预测,实验表明该方法在短期预报中优于现有先进方法。

Comments 5 pages, 2 figures

详情
英文摘要

Precipitation nowcasting is a vital spatio-temporal prediction task for meteorological applications but faces challenges due to the chaotic property of precipitation systems. Existing methods predominantly rely on single-source radar data to build either deterministic or probabilistic models for extrapolation. However, the single deterministic model suffers from blurring due to MSE convergence. The single probabilistic model, typically represented by diffusion models, can generate fine details but suffers from spurious artifacts that compromise accuracy and computational inefficiency. To address these challenges, this paper proposes a novel coarse-to-fine Vision Mamba Unet and residual Diffusion (VMU-Diff) based precipitation nowcasting framework. It realizes precipitation nowcasting through a two-stage process, i.e., a deterministic model-based coarse stage to predict global motion trends and a probabilistic model-based fine stage to generate fine prediction details. In the coarse prediction stage, rather than single-source radar data, both radar and multi-band satellite data are taken as input. A spatial-temporal attention block and several Vision mamba state-space blocks realize multi-source data fusion, and predict the future echo global dynamics. The fine-grained stage is realized by a spatio-temporal refine generator based on residual conditional diffusion models. It first obtains spatio-temporal residual features based on coarse prediction and ground truth, and further reconstructs the residual via conditional Mamba state-space module. Experiments on Jiangsu SWAN datasets demonstrate the improvements of our method over state-of-the-art methods, particularly in short-term forecasts.

2605.14594 2026-05-15 cs.CV cs.GR

TOPOS: High-Fidelity and Efficient Industry-Grade 3D Head Generation

Bojun Xiong, Zoubin Bi, Xinghui Peng, Yunmu Wang, Junchen Deng, Jun Liang, Jing Li, Bowen Cai, Huan Fu

发表机构 * HUJING Digital Media & Entertainment Group(华景数字媒体与娱乐集团)

AI总结 本文提出TOPOS,一种用于单图像条件生成高保真3D头部模型的框架,旨在满足影视、动画和游戏等行业对统一拓扑结构的需求。TOPOS通过引入一种新型变分自编码器(TOPOS-VAE)和修正流变换器(TOPOS-DiT),在固定工业标准拓扑下联合生成几何和外观,实现跨生成头部的顶点级一致性。此外,TOPOS-Texture模块可从同一肖像图像生成可重新光照的UV纹理贴图,保留高频细节,实验表明TOPOS在3D头部生成任务中达到领先水平。

Comments Technical Report

详情
英文摘要

High-fidelity 3D head generation plays a crucial role in the film, animation and video game industries. In industrial pipelines, studios typically enforce a fixed reference topology across all head assets, as such a clean and uniform topology is a prerequisite for production-level rigging, skinning and animation. In this paper, we present TOPOS, a framework tailored for single image conditioned 3D head generation that jointly recovers geometry and appearance under such an industry-standard topology. In contrast to general 3D generative models which produce triangle meshes with inconsistent topology and numerous vertices, hindering semantic correspondence and asset-level reuse, TOPOS generates head meshes with a fixed, studio-style topology, enabling consistent vertex-level correspondence across all generated heads. To model heads under this unified topology, we proposed a novel variational autoencoder structure, termed TOPOS-VAE. Inspired by multi-model large language models (MLLMs), our TOPOS-VAE leverages the Perceiver Resampler to convert input pointclouds sampled from head meshes of diverse topologies into the target reference topology. Building upon TOPOS-VAE's structured latent space, we train a rectified flow transformer, TOPOS-DiT, to efficiently generate high-fidelity head meshes from a single image. We further present TOPOS-Texture, an end-to-end module that produces relightable UV texture maps from the same portrait image via fine-tuning a multimodal image generative model. The generated textures are spatially aligned with the underlying mesh geometry and faithfully preserve high-frequency appearance details. Extensive experiments demonstrate that TOPOS achieves state-of-the-art performance on 3D head generation, surpassing both classical face reconstruction methods and general 3D object generative models, highlighting its effectiveness for digital human creation.

2605.14590 2026-05-15 cs.CV

FedStain: Modeling Higher-Order Stain Statistics for Federated Domain Generalization in Computational Pathology

Fengyi Zhang, Junya Zhang, Wenzhuo Sun

发表机构 * School of Electronic Science and Technology, Hainan University, Haikou, China, 570228(海南大学电子科学与技术学院) School of Computer Science and Technology, Xidian University, Xi'an, China, 710126(西安电子科技大学计算机科学与技术学院) Xiangjiang College of Elite Engineers, Hunan University, Changsha, China, 410082(湖南大学精英工程师学院)

AI总结 在计算病理学中,由于不同机构之间染色异质性显著,鲁棒的全切片图像分析仍面临挑战。现有联邦域泛化方法大多依赖低阶统计量,难以捕捉真实染色过程中存在的非高斯特性。本文提出FedStain,一种联邦域泛化框架,通过引入偏度和峰度等高阶统计量作为紧凑的染色描述子,在保护隐私和通信效率的前提下,有效建模染色变化,实验表明其在多个基准数据集上显著优于现有方法。

详情
英文摘要

Robust whole-slide image (WSI) analysis under strict data-governance remains challenging due to substantial cross-institutional stain heterogeneity. Domain generalization (DG) mitigates these shifts but typically requires centralized data, conflicting with privacy regulations. Federated learning (FedL) provides a decentralized alternative; however, existing FedL and federated DG (FedDG) approaches rely almost exclusively on low-order statistics, assuming Gaussian-like stain distributions. In contrast, real-world staining processes often produce asymmetric, heavy-tailed color distributions due to biochemical diffusion and scanner nonlinearity. Consequently, current methods fail to model the higher-order, non-Gaussian characteristics dominating real-world stain variability. To address this, we propose FedStain, a stain-aware FedDG framework explicitly incorporating higher-order stain moments--skewness and kurtosis--as compact statistical descriptors exchanged during federated optimization. These descriptors require no pixel-level data transmission, preserving strict privacy and communication efficiency, while enabling the global model to capture stain variability missed by low-order statistics. FedStain also employs a contrastive, cross-site parameter aggregation strategy to promote stain-invariant representations without relaxing data constraints. Extensive experiments on Camelyon17 and our new MvMidog-Fed benchmark show FedStain yields consistent improvements, outperforming state-of-the-art FedL, DG, and FedDG baselines by up to +3.9% absolute accuracy. To our knowledge, FedStain is the first FedDG approach to explicitly model higher-order stain statistics, enabling robust cross-institutional deployment in computational pathology.

2605.14587 2026-05-15 cs.LG cs.AI cs.CR

Angel or Demon: Investigating the Plasticity Interventions' Impact on Backdoor Threats in Deep Reinforcement Learning

Oubo Ma, Ruixiao Lin, Yang Dai, Jiahao Chen, Chunyi Zhou, Linkang Du, Shouling Ji

发表机构 * Zhejiang University(浙江大学) National University of Defense Technology(国防科技大学) Xi'an Jiaotong University(西安交通大学)

AI总结 本文研究了可塑性干预对深度强化学习(DRL)中后门攻击的影响,发现大多数干预措施能有效缓解后门威胁,而仅有SAM干预会加剧威胁。通过病理分析,揭示了后门梯度放大与激活路径破坏等机制,并提出了SCC概念框架和异常损失景观锐度作为后门检测的新指标,为提升DRL系统安全性提供了理论支持。

Comments To appear in the Forty-Third International Conference on Machine Learning (ICML 2026), July 6-11, 2026, Seoul, South Korea

详情
英文摘要

Extensive research has highlighted the severe threats posed by backdoor attacks to deep reinforcement learning (DRL). However, prior studies primarily focus on vanilla scenarios, while plasticity interventions have emerged as indispensable built-in components of modern DRL agents. Despite their effectiveness in mitigating plasticity loss, the impact of these interventions on DRL backdoor vulnerabilities remains underexplored, and this lack of systematic investigation poses risks in practical DRL deployments. To bridge this gap, we empirically study 14,664 cases integrating representative interventions and attack scenarios. We find that only one intervention (i.e., SAM) exacerbates backdoor threats, while other interventions mitigate them. Pathological analysis identifies that the exacerbation is attributed to backdoor gradient amplification, while the mitigation stems from activation pathway disruption and representation space compression. From these findings, we derive two novel insights: (1) a conceptual framework SCC for robust backdoor injection that deconstructs the mechanistic interplay between interventions and backdoors in DRL, and (2) abnormal loss landscape sharpness as a key indicator for DRL backdoor detection.

2605.14581 2026-05-15 cs.CV cs.AI cs.IR

A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval

Ho Hung Lim, Yi Yang

发表机构 * The Hong Kong University of Science and Technology(香港理工大学)

AI总结 本研究探讨了在视觉金融文档检索中,将文档图像编码为单一向量进行聚合可能带来的信息丢失问题。通过构建一个金融文档诊断基准,实验发现单一向量聚合会导致不同文档的向量几乎相同,从而掩盖了关键语义细节。研究指出,全局纹理主导是导致这一问题的根本原因,并表明该现象在不同模型规模和优化策略下均存在,突显了单一向量方法在金融应用中的潜在风险。

Comments Accepted to Findings of ACL 2026

详情
英文摘要

Visual RAG has offered an alternative to traditional RAG. It treats documents as images and uses vision encoders to obtain vision patch tokens. However, hundreds of patch tokens per document create retrieval and storage challenges in a vector database. Practical deployment requires aggregating them into a single vector. This raises a critical question: does single-vector aggregation lose key information in financial documents? We develop a diagnostic benchmark using financial documents where changes in single digits can lead to significant semantic shifts. Our experiments show that single-vector aggregation collapses different documents with almost identical vectors. Metrics show that the patch level detects semantic changes, and confirm that aggregation obscures these details. We identify global texture dominance as the root cause. Our findings are consistent across model scales, retrieval-optimized embeddings, and multiple mitigation strategies, highlighting significant risks for single-vector visual document retrieval in financial applications.

2605.14579 2026-05-15 cs.CV

Med-DisSeg: Dispersion-Driven Representation Learning for Fine-Grained Medical Image Segmentation

Zhiquan Chen, Haitao Wang, Guowei Zou, Hejun Wu

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院) Guangdong Key Laboratory of Big Data Analysis and Processing(广东省大数据分析与处理重点实验室)

AI总结 医学图像分割是精准医疗的基础,但在面对组织外观差异大、边界模糊和解剖结构多变等挑战时,现有方法仍难以实现稳定而精确的分割。本文提出 Med-DisSeg 框架,通过引入一种轻量级的分散损失(Dispersive Loss)和自适应注意力机制,提升细粒度结构分割的表示学习与解剖边界刻画能力。该方法通过扩大样本间嵌入表示的间隔,增强编码器对结构特征的敏感性,并利用多尺度解码器保留局部纹理与整体形状信息,实验表明其在多个医学影像数据集上均取得领先的分割性能。

详情
英文摘要

Accurate medical image segmentation is fundamental to precision medicine, yet robust delineation remains challenging under heterogeneous appearances, ambiguous boundaries, and large anatomical variability. Similar intensity and texture patterns between targets and surrounding tissues often lead to blurred activations and unreliable separation. We attribute these failures to representation collapse during encoding and insufficient fine grained multi scale decoding. To address these issues, we propose Med DisSeg, a dispersion driven medical image segmentation framework that jointly improves representation learning and anatomical delineation. Med DisSeg combines a lightweight Dispersive Loss with adaptive attention for fine grained structure segmentation. The Dispersive Loss enlarges inter sample margins by treating in batch hidden representations as negative pairs, producing well dispersed and boundary aware embeddings with negligible overhead. Based on these enhanced representations, the encoder strengthens structure sensitive responses, while the decoder performs adaptive multi scale calibration to preserve complementary local texture and global shape information. Extensive experiments on five datasets spanning three imaging modalities demonstrate consistent state of the art performance. Moreover, Med DisSeg achieves competitive results on multi organ CT segmentation, supporting its robustness and cross task applicability.

2605.14578 2026-05-15 cs.LG

Woodelf++: A Fast and Unified Partial Dependence Plot Algorithm for Decision Tree Ensembles

Ron Wettenstein, Alexander Nadel, Udi Boker

发表机构 * Reichman University(里奇曼大学) Faculty of Data and Decision Sciences(数据与决策科学学院)

AI总结 本文提出了一种名为 Woodelf++ 的高效统一算法,用于计算决策树集成模型的多种可解释性工具,包括部分依赖图(PDP)、联合 PDP 和任意阶特征交互值(Any-Order-PDIVs)。该方法基于伪布尔函数的度量推导,实现了对这些工具的统一计算框架,相比现有方法在计算复杂度上有了显著提升,尤其在 Any-Order-PDIVs 上实现了指数级加速。实验表明,Woodelf++ 在 Python 中实现并支持 GPU 加速,其计算速度远超当前主流工具。

Comments Extended version of the paper to appear at IJCAI 2026

详情
英文摘要

Partial Dependence Plots (PDPs) visualize how changes in a single feature affect the average model prediction. They are widely used in practice to interpret decision tree ensembles and other machine learning models. Joint-PDPs extend this idea to pairs of features, revealing their combined effect. Partial Dependence Interaction Values (PDIVs) measure feature interactions. The Any-Order-PDIVs task computes these interactions for every feature subset across all rows of the dataset. We introduce Woodelf++, a unified and efficient approach for computing all these useful explainability tools on decision tree ensembles, building on Woodelf, an algorithm for efficient SHAP computation. By deriving suitable metrics over pseudo-Boolean functions, Woodelf++ can compute PDPs (exact and approximate), Joint-PDPs, and Any-Order-PDIVs in a unified framework. Our method delivers substantial complexity improvements over the state of the art, including an exponential gain for Any-Order-PDIVs. Additionally, we introduce and efficiently compute Full PDPs, which leverage the model's split thresholds to faithfully capture its behavior across all possible feature values. Woodelf++ is implemented in pure Python and supports GPU acceleration. On a dataset with 400,000 rows, Woodelf++ computes PDP and Joint-PDP up to 6x faster than the state of the art and up to five orders of magnitude faster than scikit-learn. For Any-Order-PDIVs, the gap is even larger: Woodelf++ computes all interaction values in 5 minutes, while the state of the art is estimated to require over 1,000,000 years.

2605.14571 2026-05-15 cs.RO cs.LG

Let Robots Feel Your Touch: Visuo-Tactile Cortical Alignment for Embodied Mirror Resonance

Tianfang Zhu, Ning An, Rui Wang, Jiasi Gao, Qingming Luo, Anan Li, Guyue Zhou

发表机构 * Institute for AI Industry Research, Tsinghua University(清华大学人工智能产业研究院) Key Laboratory of Biomedical Engineering of Hainan Province, School of Biomedical Engineering, Hainan University(海南省生物医学工程重点实验室,海南大学生物医学工程学院) School of New Media Art and Design, Beihang University(北航艺术与设计学院) MoE Key Laboratory for Biomedical Photonics, Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology(教育部生物医学光子学重点实验室,武汉光电研究所,华中科技大学)

AI总结 该研究旨在赋予机器人“镜像触觉”能力,使其能够通过观察他人的触觉动作,预测并模拟相应的触觉信号。研究提出了一种名为Mirror Touch Net的模型,通过多层次约束实现视觉与触觉表征在语义、分布和几何上的对齐,从而从RGB图像中预测机械手上的毫米级触觉信号。该方法不仅提升了跨模态感知的准确性,还为机器人实现具有共情能力的触觉交互提供了可解释的神经机制基础。

详情
英文摘要

Observing touch on another's body can elicit corresponding tactile sensations in the observer, a phenomenon termed mirror touch that supports empathy and social perception. This visuo-tactile resonance is thought to rely on structural correspondence between visual and somatosensory cortices, yet robotic systems lack computational frameworks that instantiate this principle. Here we demonstrate that cortical correspondence can be operationalized to endow robots with mirror touch. We introduce Mirror Touch Net, which imposes semantic, distributional and geometric alignment between visual and tactile representations through multi-level constraints, enabling prediction of millimetre-scale tactile signals across 1,140 taxels on a robotic hand from RGB images. Manifold analysis reveals that these constraints reshape visual representations into geometry consistent with the tactile manifold, reducing the complexity of cross-modal mapping. Extending this alignment framework to cross-domain observations of human hands enables tactile prediction and reflexive responses to observed human touch. Our results link a neural principle of visuo-tactile resonance to robotic perception, providing an explainable route towards anticipatory touch and empathic human-robot interaction. Code is available at https://github.com/fun0515/Mirror-Touch-Net.

2605.14570 2026-05-15 cs.CL

Uncertainty Quantification for Large Language Diffusion Models

Artem Vazhentsev, Vladislav Smirnov, David Li, Maxim Panov, Timothy Baldwin, Artem Shelmanov

发表机构 * MBZUAI The University of Melbourne(墨尔本大学)

AI总结 本文研究了大语言扩散模型(LLDMs)中的不确定性量化(UQ)问题,旨在提高其推理可靠性。针对现有方法与LLDMs的并行化特性不兼容的问题,作者提出了一种轻量、零样本的不确定性信号,基于去噪过程中的中间生成、标记重掩码动态和去噪复杂度。实验表明,该方法在保持高效推理的同时,能够有效检测生成内容中的幻觉,实现了计算开销与性能之间的良好平衡。

详情
英文摘要

Large Language Diffusion Models (LLDMs) are emerging as an alternative to autoregressive models, offering faster inference through higher parallelism. Similar to autoregressive LLMs, they remain prone to hallucinations, making reliable uncertainty quantification (UQ) crucial for safe deployment. However, existing UQ methods are fundamentally misaligned with this new paradigm: they assume autoregressive factorization or use expensive repeated sampling, negating the efficiency of LLDMs. In this work, we present the first systematic study of UQ for LLDMs and propose lightweight, zero-shot uncertainty signals derived from the iterative denoising process, leveraging intermediate generations, token remasking dynamics, and denoising complexity. We further adapt a state-of-the-art UQ method to LLDMs by combining masked diffusion likelihoods with trajectory-based semantic dissimilarity. We prove that expected trajectory dissimilarity lower bounds the masked diffusion training objective, which motivates its usage as an uncertainty score. Comprehensive experiments across three tasks, eight datasets, and two models show that our method achieves a great cost-performance trade-off: it approaches the strongest sampling-based baselines while incurring up to 100x lower computational overhead. Our work demonstrates that LLDMs can deliver both fast inference and reliable hallucination detection simultaneously.

2605.14569 2026-05-15 cs.CV

Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction

Yujie Wei, Chenglong Ma, Jianxiong Gao, Chenhui Wang, Shiwei Zhang, Biao Gong, Shuai Tan, Hangjie Yuan, Hongming Shan

发表机构 * Fudan University(复旦大学) Alibaba Group(阿里巴巴集团) Ant Group(蚂蚁集团)

AI总结 本文提出了一种名为CineNeuron的层次化框架,旨在解决从功能性磁共振成像(fMRI)信号重建动态视频时存在的语义鸿沟问题。该方法受到人类大脑双通路处理机制的启发,通过自底向上的语义增强阶段和自顶向下的记忆整合阶段,分别将fMRI信号映射到丰富的语义空间,并动态融合历史数据中的相关记忆以提升视频重建质量。实验表明,CineNeuron在两个fMRI到视频的基准数据集上均优于现有最先进方法。

Comments Accepted to CVPR 2026

详情
英文摘要

Reconstructing dynamic visual experiences as videos from functional magnetic resonance imaging (fMRI) is pivotal for advancing the understanding of neural processes. However, current fMRI-to-video reconstruction methods are hindered by a semantic gap between noisy fMRI signals and the rich content of videos, stemming from a reliance on incomplete semantic embeddings that neither capture video-specific cues (e.g., actions) nor integrate prior knowledge. To this end, we draw inspiration from the dual-pathway processing mechanism in human brain and introduce CineNeuron, a novel hierarchical framework for semantically enhanced video reconstruction from fMRI signals with two synergistic stages. First, a bottom-up semantic enrichment stage maps fMRI signals to a rich embedding space that comprehensively captures textual semantics, image contents, action concepts, and object categories. Second, a top-down memory integration stage utilizes the proposed Mixture-of-Memories method to dynamically select relevant "memories" from previously seen data and fuse them with the fMRI embedding to refine the video reconstruction. Extensive experimental results on two fMRI-to-video benchmarks demonstrate that CineNeuron surpasses state-of-the-art methods across various metrics.

2605.14566 2026-05-15 cs.CV

SpectraFlow: Unifying Structural Pretraining and Frequency Adaptation for Medical Image Segmentation

Zhiquan Chen, Haitao Wang, Guowei Zou, Hejun Wu

发表机构 * School of Computer Science(计算机科学学院) Engineering, Sun Yat-sen University(工程,中山大学) Guangdong Key Laboratory of Big Data Analysis(大数据分析与处理重点实验室)

AI总结 医学图像分割在数据稀缺的情况下仍面临挑战,传统方法常因标注不足导致泛化能力差和边界模糊。为此,本文提出 SpectraFlow 框架,结合结构感知的预训练与边界导向的解码,提升分割精度。该方法分为两阶段:第一阶段通过混合域均值流预训练,学习与结构相关的表示;第二阶段引入轻量解码器,结合注意力融合与频率方向卷积,增强边界细节与鲁棒性。实验表明,该方法在多个医学数据集上优于现有方法,尤其在低数据场景下表现突出。

详情
英文摘要

Medical image segmentation remains challenging in low-data regimes, where scarce annotations often yield poor generalization and ambiguous boundaries with missing fine structures. Recent self-supervised pretraining has improved transferability, but it often exhibits a texture bias. In contrast, accurate segmentation is inherently geometry-aware and depends on both topological consistency and precise boundary preservation. To address this problem, we propose a two-stage framework that couples structure-aware encoder pretraining with boundary-oriented decoding. In Stage-1, we aim to learn structure-aware representations for downstream segmentation in low-data regimes. To this end, we propose Mixed-Domain MeanFlow Pretraining, which aligns images and binary masks in a shared latent space through latent transport regression, where masks act as conditional structural guidance rather than prediction targets, making the pretraining task-agnostic. To further improve training stability under scarce supervision, we incorporate a lightweight Dispersive Loss to prevent representation collapse. In Stage-2, we fine-tune the pretrained encoder with a lightweight decoder that combines Direct Attentional Fusion for adaptive cross-scale gating and Frequency-Directional Dynamic Convolution for high-frequency boundary refinement under appearance variation. Experiments on ISIC-2016, Kvasir-SEG, and GlaS demonstrate consistent gains over state-of-the-art methods, with improved robustness in low-data settings and sharper boundary delineation.

2605.14561 2026-05-15 cs.AI

Prompt Segmentation and Annotation Optimisation: Controlling LLM Behaviour via Optimised Segment-Level Annotations

Devika Prasad, Luke Gerschwitz, Tong Li, Henry Xiao, Anjin Liu, Coco Wu, Anna Leontjeva, Luiz Pizzato

发表机构 * Commonwealth Bank of Australia(澳大利亚全国银行)

AI总结 本文提出了一种结构化的提示优化框架——提示分割与注释优化(PSAO),旨在提升与大型语言模型交互时的可控性和效率。该方法将提示分解为可解释的片段,并为每个片段添加人类可读的注释,以引导模型在生成响应时合理分配注意力并减少混淆。实验表明,优化后的片段级注释能够提升模型的推理准确性和一致性,同时保留原始提示作为优化候选以避免性能下降。该工作验证了片段级注释优化的可行性与潜力,但如何高效确定最优分割和注释仍是未来研究的方向。

详情
英文摘要

Prompt engineering is crucial for effective interaction with generative artificial intelligence systems, yet existing optimisation methods often operate over an unstructured and vast prompt space, leading to high computational costs and potential distortions of the original intent. We introduce Prompt Segmentation and Annotation Optimisation (PSAO), a structured prompt optimisation framework designed to improve prompt optimisation controllability and efficiency. PSAO decomposes a prompt into interpretable segments (e.g., sentences) and augments each with human-readable annotations (e.g., {not important}, {important}, {very important}). These annotations guide large language models (LLMs) in allocating focus and clarifying confusion during response generation. We formally define the segmentations and annotations and demonstrate that optimised segment-level annotations can lead to improved LLM responses, with the original prompt retained as a candidate in the optimisation space to prevent performance degradation. Empirical evaluations indicate that PSAO benefits from annotations in terms of improved reasoning accuracy and self-consistency. However, developing efficient methods for identifying optimal segmentations and annotations remains challenging and is reserved for future investigation. This work is intended as a proof of concept, demonstrating the feasibility and potential of segment-level annotation optimisation.

2605.14558 2026-05-15 cs.LG cs.AI cs.CL

Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy

Langzhou He, Junyou Zhu, Yue Zhou, Zhengyao Gu, Junhua Liu, Wei-Chieh Huang, Henry Peng Zou, David Wipf, Philip S. Yu, Qitian Wu

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校) Potsdam Institute for Climate Impact Research(波茨坦气候影响研究所) Technical University of Berlin(柏林技术大学) University of Southern California(南加州大学) University of Hong Kong(香港大学) Broad Institute of MIT and Harvard(MIT和哈佛大学Broad研究所)

AI总结 本文研究了智能体强化学习中轨迹训练信号分配不均的问题,指出现有方法对轨迹中的每个token一视同仁,导致训练信号分配不合理。作者从能量模型视角出发,发现实际训练信号主要集中在动作token上,而非推理token,这一现象被称为“动作瓶颈”。为此,提出了一种简单有效的token重加权方法ActFocus,通过降低推理token的梯度权重并增强动作token的不确定性加权,显著提升了模型性能。

Comments Preprint

详情
英文摘要

Agentic reinforcement learning trains large language models using multi-turn trajectories that interleave long reasoning traces with short environment-facing actions. Common policy-gradient methods, such as PPO and GRPO, treat each token in a trajectory equally, leading to uniform credit assignment. In this paper, we critically demonstrate that such uniform credit assignment largely misallocates token-level training signals. From an energy-based modeling perspective, we show that token-level training signals, quantified by their correlations with reward variance of different rollouts sampled from a given prompt, concentrate sharply on action tokens rather than reasoning tokens, even though action tokens account for only a small fraction of the trajectory. We refer to this phenomenon as the Action Bottleneck. Motivated by this observation, we propose an embarrassingly simple token reweighting approach, ActFocus, that downweights gradients on reasoning tokens, along with an additional energy-based redistribution mechanism that further increases the weights on action tokens with higher uncertainty. Across four environments and different model sizes, ActFocus consistently outperforms PPO and GRPO, yielding final-step gains of up to 65.2 and 63.7 percentage points, respectively, without any additional runtime or memory cost.

2605.14556 2026-05-15 cs.AI

TeachAnything: A Multimodal Crowdsourcing Platform for Training Embodied AI Agents in Symmetrical Reality

Zidong Liu, Rongkai Liu, Yue Li, Zhenliang Zhang

发表机构 * State Key Laboratory of General Artificial Intelligence(通用人工智能国家重点实验室) BIGAI

AI总结 本文提出了一种名为TeachAnything的多模态众包平台,用于在对称现实(Symmetrical Reality)中训练具身智能体。该平台通过融合多模态示范信号的三阶段示范范式,支持跨场景、任务和具身形态的多样化示范数据采集。通过统一虚拟与物理交互,该系统为构建符合对称现实需求的具身智能体提供了实用的基础。

Comments 5 pages, 3 figures. Accepted as an IEEE VR 2026 Poster

详情
英文摘要

Symmetrical Reality (SR) is emerging as a future trend for human-agent coexistence, placing higher demands on agents to acquire human-like intelligence. It calls for richer and more diverse human guidance. We introduce a three-stage demonstration paradigm integrating multimodal demonstration signals. Building on this paradigm, we developed TeachAnything, a cloud-based, crowdsourcing-oriented demonstration platform with physics simulation capable of collecting diverse demonstration data across varied scenes, tasks, and embodiments. By unifying virtual and physical interactions through both methodological design and physics simulation, the system serves as a practical foundation for developing embodied agents aligned with Symmetrical Reality.