arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4033
专题追踪
2605.09906 2026-05-12 cs.AI cs.SD

Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought

Xuanchen Li, Yuheng Lu, Chenrui Cui, Tianrui Wang, Zikang Huang, Yu Jiang, Long Zhou, Longbiao Wang, Jianwu Dang

发表机构 * Tianjin Key Laboratory of Cognitive Computing(天津认知计算实验室) Tianjin University(天津大学) Huiyan Technology Company, Ltd.(慧颜科技有限公司) Chinese Academy of Sciences(中国科学院) Tencent(腾讯)

AI总结 该研究针对音频-视觉大语言模型在推理过程中存在的跨模态干扰问题,提出了一种名为“先分离后融合”(SFFL)的新型推理框架。该方法通过强制进行模态特定的推理过程,分别生成音频和视觉的推理轨迹,并在后续阶段整合信息进行回答,从而减少模态间的信息干扰。实验表明,该方法在多个基准测试中显著提升了模型的准确性和鲁棒性。

详情
英文摘要

Audio and vision provide complementary evidence for audio-visual question answering, yet current audio-visual large language models may suffer from cross-modal interference: information from one modality misguides the interpretation of another, thereby inducing hallucinations. We attribute this issue to uncontrolled cross-modal interactions during intermediate reasoning. To mitigate this, we propose Separate First, Fuse Later (SFFL), an audio-visual reasoning framework designed to reduce cross-modal interference. SFFL enforces modality-specific chain-of-thought reasoning, producing separate audio and visual reasoning traces and integrating evidence for answering. We construct modality-preference labels via a data pipeline under different modality input settings. We use these labels as an auxiliary reward in reinforcement learning to encourage a instance-dependent preference for modality cues when answering. We further introduce a modality-specific reasoning mechanism that preserves modality isolation during the separated reasoning stage while enabling full access to cross-modal information at the evidence fusion stage. Experiments demonstrate consistent improvements in both accuracy and robustness, yielding an average relative gain of 5.16\% on general AVQA benchmarks and 11.17\% on a cross-modal hallucination benchmark.

2605.09905 2026-05-12 cs.LG cs.AI

Rethinking Random Transformers as Adaptive Sequence Smoothers for Sleep Staging

Guisong Liu, Xin Gao, Martin Dresler, Jiansong Zhang, Pengfei Wei

发表机构 * School of Biological Science and Medical Engineering(生物科学与医学工程学院) Southeast University(东南大学) University of Bath(巴斯大学) Donders Institute for Brain, Cognition and Behaviour(脑、认知与行为研究所) Radboud University Medical Center(拉德堡德大学医学中心) School of Computer Science & Software Engineering(计算机科学与软件工程学院)

AI总结 本文重新审视了随机初始化的Transformer在睡眠分期任务中的作用,指出睡眠信号具有强局部时间连续性这一被忽视的特性。研究发现,未经训练的随机Transformer即可显著提升睡眠分期性能,并优于传统平滑方法。通过引入随机注意力先验核(RAPK),论文揭示了随机自注意力机制在保持阶段转换的同时,实现了全局平均与内容相似性的自适应平衡,表明性能提升主要源于模型结构的归纳偏置,而非参数学习。这一发现为构建高效、适用于边缘设备的睡眠监测系统提供了新思路。

详情
英文摘要

Automatic sleep staging commonly adopts Transformers under the assumption that they learn complex long-range dependencies. We challenge this view by revealing a neglected property of sleep sequences: strong local temporal continuity. We show that a randomly initialized Transformer, without any training, substantially improves sleep staging performance and consistently outperforms heuristic smoothing. We formalize this effect via a Random Attention Prior Kernel (RAPK), showing that random self-attention acts as an adaptive smoother by balancing global averaging and content-based similarity while preserving stage transitions. Using two metrics, the Local Smoothness Influence Index (LSII) and the Weighted Transition Entropy (WTE), we provide evidence that most performance gains in Transformer-based sleep staging arise from architectural inductive bias rather than parameter learning. Our results suggest that sleep staging can be effectively addressed with structure-driven smoothing mechanisms rather than complex dependency modeling, enabling more efficient and edge-deployable healthcare systems for large-scale physiological monitoring.

2605.09902 2026-05-12 cs.CV

Adversarial Attacks Against MLLMs via Progressive Resolution Processing and Adaptive Feature Alignment

Haobo Wang, Xiaorong Ma, Weiqi Luo, Xiaojun Jia, Jiwu Huang

发表机构 * Sun Yat-sen University(中山大学) Nanyang Technological University(南洋理工大学) Shenzhen MSU-BIT University(深圳MSU-BIT大学)

AI总结 该研究针对多模态大语言模型(MLLM)的安全性问题,提出了一种新型的定向迁移攻击方法PRAF-Attack,旨在通过对抗样本误导模型对图像内容的判断。该方法引入了渐进式分辨率处理和自适应特征对齐策略,利用中间层特征增强攻击的迁移性和鲁棒性,并通过梯度一致性选择可迁移的层次特征,显著提升了攻击效果。实验表明,PRAF-Attack在多种黑盒MLLM上均表现出优于现有方法的迁移能力。

详情
英文摘要

Adversarial perturbations can mislead Multimodal Large Language Models (MLLMs) recognize a benign image as a specific target object, posing serious risks in safety-critical scenarios such as autonomous driving and medical diagnosis. This makes transfer-based targeted attacks crucial for understanding and improving black-box MLLM robustness. Existing transfer-based targeted attack methods typically rely on the final global features of the surrogate encoder and anchor optimization to original-resolution target crops, leading to their limited transferability and robustness. To address these challenges, we propose Progressive Resolution Processing and Adaptive Feature Alignment (PRAF-Attack), a targeted transfer-based attack framework that integrates multi-scale global semantic guidance with robust intermediate-layer local alignment. Unlike prior methods that align only the surrogate encoder's final layer, we design an adaptive feature alignment strategy that leverages intermediate representations to enhance transferability. Specifically, we introduce an adaptive intermediate layer selection mechanism to identify transferable hierarchical features across surrogate ensembles via gradient consistency, along with an adaptive patch-level optimization strategy that preserves highly correlated local regions through efficient patch filtering. To overcome the reliance on fixed original-resolution target crops, we propose a progressive resolution processing strategy that gradually refines optimization from coarse to fine, enabling the attack to better exploit target information at multiple scales and achieve stronger transferability. We evaluate PRAF-Attack on a diverse suite of black-box MLLMs, including six open-source models and six closed-source commercial APIs. Compared with seven state-of-the-art targeted attack baselines, the proposed PRAF-Attack consistently achieves superior transferability.

2605.09900 2026-05-12 cs.AI cs.CL cs.CV

The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark

Hao Liu, Jicheng Liu

发表机构 * Department of Psychology(心理学系) New York University(纽约大学) Department of Computer Science(计算机科学系) University of Southern California(南加州大学)

AI总结 该论文提出了一种名为KnotBench的新型基准,用于评估视觉-语言模型在处理绳结图示任务中的能力。研究通过大量绳结图像和对应的规范签名,设计了包括等价判断、操作预测、识别和跨模态对齐在内的14项任务,揭示了当前模型在感知与操作之间的能力差距。实验表明,即使是最先进的模型如Claude Opus 4.7和GPT-5,在无思考模式下表现接近随机水平,而思考模式虽有提升,但整体仍难以准确模拟绳结操作。

Comments 41 pages, 18 figures

详情
英文摘要

A vision-language model can look at a knot diagram and report what it sees, yet fail to act on that structure. KnotBench pairs an 858,318-image corpus from 1,951 prime-knot prototypes (crossing numbers 3 to 19) with a protocol whose answers are checked against Regina's canonical knot signature. Its 14 tasks span four families, equivalence judgment, move prediction, identification, and cross-modal grounding; an image-versus-symbol split locates failures along the perception-operation gap. We score Claude Opus 4.7 and GPT-5, each with and without thinking, under a 64K output-token budget matched on both vendors. Across 56 (task, model) cases, 15 sit at or below a random baseline and 8 of 14 tasks have a best score under 1.5x random. On diagram-to-symbol transcription, no model produces a strictly correct string, and permissive Regina decoding recovers the knot in 0 to 4 of 100 items. Thinking-mode reasoning lifts overall accuracy by 1.65 points for Claude and 9.25 points for GPT-5, narrowing the gap only modestly. Read together, the four families suggest current vision-language models hold features of a diagram but lack apparatus to simulate moves on those features.

2605.09899 2026-05-12 cs.CV cs.AI

Hyperbolic Distillation: Geometry-Guided Cross-Modal Transfer for Robust 3D Object Detection

Kanglin Ning, Wenrui Li, Houde Quan, Qifan Li, Xingtao Wang, Xiaopeng Fan

发表机构 * Faculty of Computing, Harbin Institute of Technology(哈尔滨工业大学计算机学院) Suzhou Research Institute of HIT(哈尔滨工业大学苏州研究院) PengChengLab(鹏城实验室)

AI总结 本文提出了一种基于双曲几何约束的跨模态知识蒸馏方法HGC-Det,用于提升多模态3D目标检测的性能。该方法通过图像分支和点云分支分别提取语义特征,并引入语义引导的体素优化、双曲几何约束的跨模态特征迁移以及特征聚合的几何优化三个核心组件,有效缓解了模态异质性、空间错位和表示危机等问题。实验表明,该方法在室内和室外数据集上均取得了检测精度与计算成本之间的良好平衡。

Comments Current version has been subbmitted to IEEE Transactions on Multimedia. Now, this manuscript's status is Under Review

详情
英文摘要

Cross-modal knowledge distillation has emerged as an effective strategy for integrating point cloud and image features in 3D perception tasks. However, the modality heterogeneity, spatial misalignment, and the representation crisis of multiple modalities often limit the efficient of these cross-modal distillation methods. To address these limitations in existing approaches, we propose a hyperbolic constrained cross-modal distillation method for multimodal 3D object detection (HGC-Det). The proposed HGC-Det framework includes an image branch and a point cloud branch to extract semantic features from two different modalities. The point cloud branch comprises three core components: a 2D semantic-guided voxel optimization component (SGVO), a hyperbolic geometry constrained cross-modal feature transfer component (HFT), and a feature aggregation-based geometry optimization component (FAGO). Specifically, the SGVO component adaptively refines the spatial representation of the 3D branch by leveraging semantic cues from the image branch, thereby mitigating the issue of inadequate representation fusion. The HFT component exploits the intrinsic geometric properties of hyperbolic space to alleviate semantic loss during the fusion of high-dimensional image features and low-dimensional point cloud features. Finally, the FAGO compensates for potential spatial feature degradation introduced by the 2D semantic-guided voxel optimization component. Extensive experiments on indoor datasets (SUN RGB-D, ARKitScenes) and outdoor datasets (KITTI, nuScenes) demonstrate that our method achieves a better trade-off between detection accuracy and computational cost.

2605.09893 2026-05-12 cs.CL cs.AI

Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions

Sushrita Rakshit, Hanwen Zhang, Hua Shen

发表机构 * New York University(纽约大学)

AI总结 本研究探讨了大型语言模型中“价值-行为鸿沟”问题,即模型所宣称的价值与其实际行为之间存在不一致的现象。研究提出了一种新的失败模式——“伪推理”,即模型表现出看似合理的推理过程,但行为并未与价值对齐。为此,研究者构建了VALDI框架,用于系统评估模型在对话生成中对价值的遵循程度,并发现无论是专有模型还是开源模型,都存在显著的价值与行为不一致现象。此外,研究还提出VIVALDI多智能体审计系统,用于在生成过程中干预以改善对齐效果。

Comments 9 pages

详情
英文摘要

Large language models (LLMs) are often evaluated based on their stated values, yet these do not reliably translate into their actions, a discrepancy termed "value-action gap." In this work, we argue that this gap persists even under explicit reasoning, revealing a deeper failure mode we call "Pseudo-Deliberation": the appearance of principled reasoning without corresponding behavioral alignment. To study this systematically, we introduce VALDI, a framework for measuring alignment between stated values and generated dialogue. VALDI includes 4,941 human-centered scenarios across five domains, three tasks that elicit value articulation, reasoning, and action, and five metrics for quantifying value adherence. Across both proprietary and open-source LLMs, we observe consistent misalignment between expressed values and downstream dialogues. To investigate intervention strategies, we propose VIVALDI, a multi-agent value auditor that intervenes at different stages of generation.

2605.09887 2026-05-12 cs.LG cs.AI math.DG

The Geometric Wall: Manifold Structure Predicts Layerwise Sparse Autoencoder Scaling Laws

Eslam Zaher, Maciej Trzaskowski, Quan Nguyen, Fred Roosta

发表机构 * ARC Training Centre for Information Resilience (CIRES)(信息韧性ARC培训中心(CIRES)) School of Mathematics and Physics, University of Queensland(昆士兰大学数学与物理学院) Institute for Molecular Bioscience, University of Queensland(昆士兰大学分子生物科学研究所) Profenso(Profenso公司) QIMR Berghofer Medical Research Institute(QIMR伯格霍尔医学研究中心)

AI总结 该研究探讨了稀疏自编码器(SAEs)在不同网络层中重建误差变化的几何原因,指出激活空间的曲率和内在维度差异导致了现有单层缩放定律无法解释的现象。研究通过分析多个模型层的几何特征,发现SAEs的宽度-稀疏性缩放规律依赖于每层的流形结构,并提出了一个可跨模型迁移的几何缩放定律。实验表明,流形的几何特性决定了每层的宽度指数,且高曲率和高内在维度对应更高的重建误差下限,揭示了SAEs面临的是由流形结构决定的“几何墙”而非资源限制的天花板。

详情
英文摘要

Sparse autoencoders (SAEs) operationalise the linear representation hypothesis: they reconstruct model activations as sparse linear combinations of interpretable dictionary atoms, on the implicit assumption that activation space is well approximated by a globally linear structure. Their reconstruction error varies sharply across layers in ways that existing scaling laws, fitted at single layers, do not explain. We argue that this variation is the empirical trace of a geometric mismatch: where the activation manifold is curved and its intrinsic dimension varies across layers, no sparse linear dictionary can match it uniformly, and the SAE's width-sparsity scaling becomes a layer-dependent function of manifold structure rather than a single universal law. We conduct the first cross-layer SAE scaling study, fitting and regressing on 844 residual-stream Gemma Scope SAE checkpoints across 68 layers of Gemma 2 2B and 9B. Stage 1 fits a per-layer scaling-law surface; Stage 2 regresses the fitted parameters and the derived per-layer width exponents on four layerwise geometric summaries. We find that manifold geometry predicts the per-layer width exponent in both models, and that the same regression coefficients learnt on one model predict the other model's per-layer exponents under cross-model transfer, indicating a transferable geometric law. At the showcase layers where richer width grids permit identification of the asymptotic floor, we find that the fitted floor tracks the layerwise geometric ordering: higher curvature and intrinsic dimension correspond to higher floor, consistent with the irreducible second-order residual that any sparse linear approximation of a curved manifold must leave behind. SAEs thus encounter not a finite-resource ceiling but a geometry-dependent wall, set by the manifold they are trying to reconstruct.

2605.09886 2026-05-12 cs.RO

Network-Efficient World Model Token Streaming

Shatadal Mishra, Ahmadreza Moradipari, Nejib Ammar

发表机构 * InfoTech Labs, Toyota Motor North America R\&D, Mountain View, CA, USA

AI总结 该研究探讨了在分布式计算和车联网环境下,如何高效地传输和同步离散世界模型的状态表示。提出了一种基于VQ-U-Net编码器的网络高效流式传输方法,并设计了一种无标签、全在线的算法,通过余弦距离优先传输状态变化部分,并自适应触发关键帧以应对网络带宽限制和数据包丢失。实验表明,该方法在保持相同比特率的前提下,显著降低了状态嵌入的失真,并提升了下游任务的预测性能,验证了其在车载网络环境中的实用价值。

Comments Accepted at IEEE VNC 2026

详情
英文摘要

Generative driving world models rely on compact latent state representations that must be efficiently transmitted and synchronized across distributed compute and connected vehicles. We study network-efficient streaming of a discrete world model state, where a stride-16 VQ-U-Net tokenizer (codebook size 8,192) maps each 288x512 frame to an 18x32 grid of token IDs (576 tokens/frame), equivalent to 936 bytes/frame under fixed-length coding. We consider a keyframe--delta protocol under strict per-message payload budgets and packet loss, and propose a fully online, label-free algorithm that prioritizes delta updates via cosine distance in codebook embedding space and triggers keyframes adaptively using a Hamming-drift threshold. The adaptive algorithm consistently improves the rate distortion frontier over periodic keyframes at matched bitrates: at 0.024 Mb/s (200-byte budget) dynamic-only embedding distortion drops from 0.0712 to 0.0661 (7.2\%), and at 0.036 Mb/s (400-byte budget) from 0.0427 to 0.0407 (4.8\%). Under 10\% delta packet loss at 200 bytes, dynamic-only distortion is 0.0757 versus 0.0789 for a matched periodic baseline. To connect state fidelity to world model usefulness, we train a lightweight next-token predictor and evaluate perplexity conditioned on streamed receiver states: at 0.024 Mb/s, dynamic-position perplexity improves from 206.0 to 193.1 (6.3\%), and at 0.036 Mb/s from 158.9 to 155.6 (2.1\%). These results support discrete token-state streaming as a practical systems layer for bandwidth-aware synchronization and improved downstream token-dynamics utility under vehicular networking constraints.

2605.09879 2026-05-12 cs.AI

M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models

Junjian Wang, Xin Zhou, Qiran Xu, Kun Zhan

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Li Auto Inc.(Li Auto公司)

AI总结 该研究提出了M2A方法,旨在将数学推理与智能体推理在大语言模型中有效结合,解决两者在多任务学习中难以协同的问题。M2A通过在参数空间中合并模型,仅沿不影响智能体行为的子空间注入数学推理能力,从而在不干扰原有行为的前提下增强推理深度。实验表明,M2A在真实编程智能体任务中显著提升了推理效果,例如在Qwen3-8B模型上将SWE-Bench Verified的解决率从44.0%提升至51.2%。

详情
英文摘要

While reasoning has become a central capability of large language models (LLMs), the reasoning patterns required for different scenarios are often misaligned. Mathematical reasoning typically relies on intrinsic logic to solve closed-world problems in a single response, whereas agentic reasoning requires not only internal reasoning but also multi-turn interaction with external environments, interleaving thought and action. This misalignment prevents mathematical and agentic reasoning from effectively benefiting from each other, often yielding unstable reasoning behavior and only limited performance gains under multi-task learning. In this paper, we propose M2A, a novel paradigm that synergizes mathematical and agentic reasoning via model merging. To avoid overfitting to superficial reasoning patterns under joint training, M2A operates directly in parameter space: it identifies the feature subspace critical for agent behavior, and merges the mathematical reasoning task vector only along its null space, thereby injecting reasoning capability along directions that do not perturb agent behavior. Unlike SFT or RL, M2A requires no additional gradient-update and exposes the merging coefficient as a simple knob for controlling reasoning length. Experiments in a challenging real-world coding agent setting show that our method effectively extends agentic reasoning depth and delivers substantial performance improvements. Applied to a fine-tuned Qwen3-8B, M2A improves its SWE-Bench Verified resolved rate from 44.0% to 51.2% without retraining the model. Code is available at https://github.com/laplucky/M2A.git.

2605.09875 2026-05-12 cs.AI

Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations

Su-Hyeon Kim, Yo-Sub Han

发表机构 * Department of Artificial Intelligence(人工智能系) Yonsei University(延世大学) Department of Computer Science(计算机科学系)

AI总结 不同家族的大语言模型由于使用不同的隐藏维度、分词器和训练过程,使得行为方向难以在模型间进行比较或迁移。本文提出了一种锚点投影框架,将各模型的隐藏表示映射到共享的锚坐标空间(ACS),从而提取并对齐跨模型的行为方向。实验表明,该方法在多个模型家族和行为轴上具有良好的对齐效果,并在下游任务中表现出稳定的迁移能力,为跨家族模型的可解释性研究提供了新的视角。

详情
英文摘要

Large language models from different families use different hidden dimensions, tokenizers, and training procedures, making behavioral directions difficult to compare or transfer across models. We introduce an anchor-projection framework that maps hidden representations from each model into a shared anchor coordinate space (ACS). Behavioral directions extracted from source models are projected into ACS and averaged into a canonical direction. For a new model, the canonical direction is reconstructed into its native hidden space using only anchor activations, without fine-tuning or target-specific direction extraction. We evaluate five instruction-tuned model families and ten behavioral axes. We find that same-axis directions align tightly across the Llama-Qwen-Mistral-Phi (LQMP) cluster in ACS. This shared structure transfers to downstream tasks. For the aligned LQMP cluster, held-out targets achieve (0.83) ten-way detection accuracy and (0.95) mean binary AUROC, while canonical steering induces refusal-rate shifts of up to +0.46% under distribution shift. Sensitivity analyses show that two source models and small anchor pools already suffice to approximate transferable directions. Overall, ACS provides a novel perspective on cross-family interpretability, revealing that representation-level transfer remains robust across model families.

2605.09874 2026-05-12 cs.CV cs.AI cs.CL

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

Ziyang Wang, Yue Zhang, Shoubin Yu, Ce Zhang, Zengqi Zhao, Jaehong Yoon, Hyunji Lee, Gedas Bertasius, Mohit Bansal

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校) NTU Singapore(新加坡国立大学)

AI总结 EgoMemReason 是一个面向长期第一人称视频理解的记忆驱动推理基准,旨在评估模型在连续多天视觉信息中积累、回忆和推理的能力。该基准引入了三种互补的记忆类型,包括实体记忆、事件记忆和行为记忆,用于评估模型对物体状态变化、活动顺序以及长期行为模式的识别能力。实验表明,当前最先进的模型在该基准上的整体准确率仅为39.6%,揭示了长期记忆推理仍面临重大挑战。

Comments The first two authors contributed equally. Project website: https://egomemreason.github.io/

详情
英文摘要

Next-generation visual assistants, such as smart glasses, embodied agents, and always-on life-logging systems, must reason over an entire day or more of continuous visual experience. In ultra-long video settings, relevant information is sparsely distributed across hours or days, making memory a fundamental challenge: models must accumulate information over time, recall prior states, track temporal order, and abstract recurring patterns. However, existing week-long video benchmarks are primarily designed for perception and recognition, such as moment localization or global summarization, rather than reasoning that requires integrating evidence across multiple days. To address this gap, we introduce EgoMemReason, a comprehensive benchmark that systematically evaluates week-long egocentric video understanding through memory-driven reasoning. EgoMemReason evaluates three complementary memory types: entity memory, tracking how object states evolve and change across days; event memory, recalling and ordering activities separated by hours or days; and behavior memory, abstracting recurring patterns from sparse, repeated observations over the whole week period. EgoMemReason comprises 500 questions across three memory types and six core challenges, with an average of 5.1 video segments of evidence per question and 25.9 hours of memory backtracking. We evaluate EgoMemReason on 17 methods across MLLMs and agentic frameworks, revealing that even the best model achieves only 39.6% overall accuracy. Further analysis shows that the three memory types fail for distinct reasons and that performance degrades as evidence spans longer temporal horizons, revealing that long-horizon memory remains far from solved. We believe EgoMemReason establishes a strong foundation for evaluating and advancing long-context, memory-aware multimodal systems.

2605.09870 2026-05-12 cs.LG cs.AI

Intervention-Based Time Series Causal Discovery via Simulator-Generated Interventional Distributions

Tsuyoshi Okita

发表机构 * Kyushu Institute of Technology(九州工业大学)

AI总结 该论文提出了一种基于干预的时序因果发现框架SVAR-FM,通过将物理模拟器视为对 Pearl 的 do 操作符的实现,利用模拟器生成干预数据,从而学习非线性因果关系。研究证明了在满足一定条件时结构VAR模型的可识别性,并通过实验验证了该方法在多个科学领域中优于传统观测方法,尤其在模拟器精度不足时能正确预测因果效应符号反转现象。

Comments 54 pages, 6 figures

详情
英文摘要

We propose SVAR-FM (Structural VAR with Flow Matching), a framework for time series causal discovery that treats a physics-based simulator as a mechanical realization of Pearl's do operator. Clamping a variable inside the simulator physically severs confounding paths, producing interventional data by construction. Conditional Flow Matching then learns the nonlinear interventional conditionals. Theoretically, we prove that the full structural VAR becomes identifiable under a coverage condition on the simulator-clampable variables, and derive an end-to-end error bound that decomposes into Monte Carlo, simulator fidelity, and Flow Matching terms. A sign-flip corollary predicts that when simulator accuracy falls below a threshold, the estimated causal effect reverses sign. Empirically, a benchmark across four scientific domains confirms that SVAR-FM recovers the correct causal sign where observational methods produce sign-reversed estimates due to confounding. A case study in ultrafast laser physics verifies the sign-flip prediction by physically varying the accuracy level of a first-principles quantum solver: the low-accuracy setting reverses the causal sign, while the high-accuracy setting recovers the correct direction (R-squared = 0.983, zero bias).

2605.09867 2026-05-12 cs.LG cs.AI

Continuous Latent Contexts Enable Efficient Online Learning in Transformers

Emile Anand, Abdullah Ateyeh, Xinyuan Cao, Max Dabagia

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of California, Berkeley(加州大学伯克利分校) Columbia University(哥伦比亚大学)

AI总结 该研究探讨了如何使Transformer模型更有效地实现在线学习,提出通过引入连续潜在上下文标记来增强模型的适应能力。研究构建了深度恒定的Transformer结构,能够以线性组合的形式存储算法状态,从而实现加权多数算法和Q学习等基础在线决策过程。实验表明,使用潜在上下文的轻量级Transformer在长序列在线预测任务中表现优于更大更复杂的语言模型,展示了其作为实现在线学习算法的有效状态表示的潜力。

Comments 37 pages, 15 figures, 3 tables

详情
英文摘要

Large language models (LLMs) exhibit a strong capacity for in-context learning: Given labeled examples, they can generate good predictions without parameter updates. However, many interactive settings go beyond static prediction to online decision-making, in which effective behavior demands adaptation over long multi-turn horizons in response to feedback, and efficient algorithms in these domains must use compact representations of what they have learned. Recently, continuous transformer architectures with latent chain of thought have shown promise for offline iterative tasks such as directed graph-reachability. Motivated by this, we study whether continuous latent context tokens equip transformers to more effectively realize online learning. We give explicit constructions of constant-depth transformers that implement two foundational online decision-making procedures -- the weighted majority algorithm and $Q$-learning -- by storing their algorithmic state as linear combinations of feature embeddings, using a small number of latent context tokens. We further train a small GPT-2-style transformer with latent contexts using a multi-curriculum objective that does not directly supervise the latent states. On long synthetic online prediction sequences, this model outperforms larger and more complex LLMs, including Qwen-3-14B and DeepSeek-V3. Our results suggest that continuous latent contexts provide a simple and effective persistent state for transformers to implement online learning algorithms.

2605.09864 2026-05-12 cs.CV cs.LG

DA-SegFormer: Damage-Aware Semantic Segmentation for Fine-Grained Disaster Assessment

Kevin Zhu, William Tang, Raphael Hay Tene, Zesheng Liu, Nhut Le, Maryam Rahnemoonfar

发表机构 * Bina Labs, Lehigh University(Bina实验室,莱斯大学)

AI总结 本文提出了一种名为DA-SegFormer的细粒度灾害评估语义分割方法,旨在解决无人机影像中因纹理退化和类别不平衡导致的细微损伤识别难题。该方法基于SegFormer架构,引入了类别感知采样策略和在线难例挖掘结合Dice损失函数,以增强对罕见损伤特征的学习,并采用分辨率保持的推理协议以保留原始纹理细节。实验表明,DA-SegFormer在RescueNet数据集上取得了74.61%的mIoU,显著优于基线模型,并在关键损伤类别上实现了显著提升。

Comments Accepted for 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026)

详情
英文摘要

Rapid and accurate damage assessment following natural disasters is critical for effective emergency response. However, identifying fine-grained damage levels (e.g., distinguishing minor from major roof damage) in UAV imagery remains challenging due to the degradation of texture cues during resizing and extreme class imbalance. We propose DA-SegFormer, a damage-aware adaptation of the SegFormer architecture optimized for high-resolution disaster imagery. Our method introduces a Class-Aware Sampling strategy to guarantee exposure to rare damage features, and it integrates Online Hard Example Mining (OHEM) with Dice Loss to dynamically focus on underrepresented classes. In addition, we employ a resolution-preserving inference protocol that maintains native texture details. Evaluated on the RescueNet dataset, DA-SegFormer achieves 74.61\% mIoU, outperforming the baseline by 2.55\%. Notably, our improvements yield double-digit gains in critical damage classes: Minor Damage (+11.7%) and Major Damage (+21.3%).

2605.09862 2026-05-12 cs.LG cs.AI

UFO: A Unified Flow-Oriented Framework for Robust Continual Graph Learning

Danhui Zhang, Zhe Wang, Qing Qing, Jiarui Liu, Wentao Gao, Ziqi Xu, Mingliang Hou, Xikun Zhang, Renqiang Luo

发表机构 * Jilin University(吉林大学) Adelaide University(阿德莱德大学) RMIT University(皇家墨尔本理工学院) Jinan University(济南大学)

AI总结 本文研究了鲁棒持续图学习问题,即在图数据不断演变且新加入部分常含噪声的场景下,如何同时应对灾难性遗忘和噪声监督的挑战。为此,作者提出了一个统一的流导向框架UFO,通过基于流模型的条件特征分布建模生成回放表示以缓解遗忘,并利用实例级可靠性评分区分噪声节点,从而减少噪声监督的影响。实验表明,UFO在多个基准图数据集上均优于现有方法,具有更高的准确性和更优的遗忘控制能力。

详情
英文摘要

Graph learning research has increasingly shifted toward continual graph learning (CGL), which better reflects real-world scenarios where graphs evolve over time. However, existing CGL methods largely assume clean supervision and overlook a critical challenge: the newly arriving portions of the graph are often noisy, due to annotation errors or adversarial corruption. This mismatch limits their applicability in practice. In this work, we study robust continual graph learning, where models must simultaneously handle catastrophic forgetting and noisy supervision in evolving graph data. We show that label noise introduces a new failure mode, catastrophic remembering, where models persistently reinforce corrupted knowledge across tasks. To address these challenges, we propose a Unified Flow-Oriented framework (UFO). First, UFO models conditional feature distributions via flow-based generative modeling and produces replay representations, mitigating forgetting without storing historical data. Second, UFO estimates instance-level reliability scores to distinguish clean from noisy nodes, reducing the impact of corrupted supervision and alleviating catastrophic remembering. Extensive experiments on four benchmark graph datasets under varying noise ratios demonstrate that UFO consistently outperforms existing methods in both accuracy and forgetting metrics. Code is available at: https://anonymous.4open.science/r/UFO.

2605.09861 2026-05-12 cs.LG cs.AI

Flag Varieties: A Geometric Framework for Deep Network Alignment

Jingchuan Xiao, Xinyi Sui, Cihan Ruan

发表机构 * Department of Mathematics and Computer Studies, Mary Immaculate College, Ireland(爱尔兰玛丽伊曼纽尔学院数学与计算机研究系) Department of Computer Science and Engineering, Santa Clara University, USA(美国圣克拉拉大学计算机科学与工程系)

AI总结 该论文研究深度神经网络中相邻权重矩阵的对齐现象,揭示其背后的几何结构。通过几何不变理论,作者证明对齐几何具有由标志流形(flag variety)定义的规范结构,并指出子空间交集维度是唯一的重参数化不变可观测量,从而将子空间度量从经验惯例提升为数学必然。研究还揭示了正则化与非线性激活对对齐过程的影响,并提供了无需前向传播即可分析网络内部对齐结构的新方法。

详情
英文摘要

Alignment, the tendency of adjacent weight matrices in deep networks to develop compatible subspace orientations, underlies gradient flow, Neural Collapse, and representation similarity across architectures. Despite extensive empirical documentation, these phenomena have resisted unified theoretical treatment: existing explanations are post-hoc, each fitted to a specific observation with whatever mathematics is at hand. We reverse this direction by deriving the mathematical structure that layerwise alignment inherently demands. Using geometric invariant theory, we prove that alignment geometry has a canonical closed, polystable stratum given by a flag variety, and that subspace intersection dimension is its unique reparameterization-invariant observable, establishing that subspace metrics are not empirical conventions but mathematical necessities. This unified framework yields two dynamical consequences: ridge regularization drives subspace alignment at an exponential rate set by weight decay, whereas nonlinear activations induce a commutator obstruction to exact basis alignment, generically present in nonlinear networks and absent in linear ones. Together these give a geometric explanation of the Level-2/3 hierarchy in Neural Collapse from first principles rather than post-hoc analysis. The commutator magnitude and head subspace overlap further serve as weight-space windows into internal alignment structure, requiring no forward passes. Experiments on multilayer perceptrons, residual networks, and pretrained language models support the proposed diagnostics and delineate their scope.

2605.09859 2026-05-12 cs.CV

Learning to Align Generative Appearance Priors for Fine-grained Image Retrieval

Shijie Wang, Yadan Luo, Zijian Wang, Xin Yu, Zi Huang

发表机构 * The University of Queensland, Australia(昆士兰大学,澳大利亚) The University of Adelaide, Australia(阿德莱德大学,澳大利亚)

AI总结 本文研究了细粒度图像检索中如何提升对未见类别的检索性能问题,提出了一种基于生成外观先验对齐的新型方法GAPan。该方法通过可逆密度模型重构学习目标,从类别预测转向外观建模,利用归一化流将特征映射到潜在密度空间,并通过类别条件高斯先验进行优化,从而保留更丰富的外观细节。通过反向采样生成外观感知的锚点,引导检索嵌入与类别特定的外观分布对齐,显著提升了模型在未见类别上的泛化能力。

详情
英文摘要

Fine-grained image retrieval (FGIR) typically relies on supervision from seen categories to learn discriminative embeddings for retrieving unseen categories. However, such supervision often biases retrieval models toward the semantics of seen categories rather than the underlying appearance characteristics that generalize across categories, thereby limiting retrieval performance on unseen categories. To tackle this, we propose GAPan, a Generative Appearance Prior alignment network that reformulates the learning objective from category prediction toward appearance modeling. Technically, GAPan treats retrieval features with an invertible density model based on normalizing flows. In the forward direction, the flow maps all instance features into a latent density space, where each seen category is modeled by a class-conditional Gaussian prior and optimized via exact likelihood estimation. This formulation preserves richer appearance details by leveraging the invertible property of the flows. In the reverse direction, samples from the high-density regions of these learned priors are mapped back to the feature space to produce appearance-aware anchors that reflect intra-category variation. These anchors supervise a prior-driven alignment objective that aligns retrieval embeddings with category-specific appearance distributions, thereby improving generalization to unseen categories. Evaluations demonstrate that our GAPan achieves state-of-the-art performance on both widely-used fine- and coarse-grained benchmarks.

2605.09858 2026-05-12 cs.CV

Clip-level Uncertainty and Temporal-aware Active Learning for End-to-End Multi-Object Tracking

Riku Inoue, Shogo Sato, Kazuhiko Murasaki, Tomoyasu Shimada, Toshihiko Nishimura, Ryuichi Tanida

发表机构 * NTT, Inc.(NTT公司)

AI总结 本文研究了动态环境下端到端多目标跟踪(MOT)中如何通过主动学习(AL)提升标注效率的问题。针对现有基于帧的AL方法与现代基于Transformer的端到端跟踪器在时间粒度上不匹配的问题,提出了一种基于片段(clip)的主动学习方法CUTAL,该方法通过多帧预测的不确定性度量评估每个片段的不确定性,并引入时间多样性约束以选择信息量大且冗余度低的片段。实验表明,CUTAL在相同标注预算下优于现有方法,并且在仅使用50%标注数据时即可达到接近全监督的跟踪性能。

Comments Accepted to 2026 IEEE International Conference on Image Processing (ICIP). Copyright 2026 IEEE. Published in 2026 IEEE International Conference on Image Processing (ICIP), scheduled for 13-17 September 2026 in Tampere, Finland

详情
英文摘要

Multi-Object Tracking (MOT) in dynamic environments relies on robust temporal reasoning to maintain consistent object identities over time. Transformer-based end-to-end MOT models achieve strong performance by explicitly modeling temporal dependencies, yet training them requires extensive bounding-box and identity annotations. Given the high labeling cost and strong redundancy in videos, Active Learning (AL) is an effective approach to improve annotation efficiency. However, existing AL methods for MOT primarily operate at the frame level, which is structurally misaligned with modern end-to-end trackers whose inference and training rely on multi-frame clips. To bridge this gap, we formulate clip-level active learning and propose Clip-level Uncertainty and Temporal-aware Active Learning (CUTAL). In contrast to frame-based approaches, CUTAL scores each clip using uncertainty metrics derived from multi-frame predictions to capture inter-frame correspondence ambiguities, while enforcing temporal diversity to select an informative and non-redundant subset. Experiments show that CUTAL achieves stronger overall performance than baselines at the same label budgets across MeMOTR and SambaMOTR. Notably, CUTAL achieves performance comparable to full supervision for MeMOTR on both datasets using only 50% of the labeled training data.

2605.09856 2026-05-12 cs.CV cs.AI

MoPO: Incorporating Motion Prior for Occluded Human Mesh Recovery

Tao Tang, Hong Liu, Xinshun Wang, Wanruo Zhang

发表机构 * State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School, China(一般人工智能国家重点实验室,北京大学,深圳研究生院,中国)

AI总结 尽管近期在人体网格恢复方面取得了显著进展,但在面对遮挡时仍表现出鲁棒性不足,常导致姿态估计不准确和运动抖动。本文提出MoPO方法,通过引入运动先验来提升遮挡人体网格恢复的效果。MoPO包含运动去遮挡模块和运动感知融合与优化模块,前者利用历史姿态预测遮挡关节位置,后者结合图像特征与预测姿态进行人体形状和姿态估计,并通过逆运动学进一步优化最终姿态,显著提升了遮挡场景下人体网格恢复的精度和时序一致性。

Comments 35 pages

详情
英文摘要

Although recent studies have made remarkable progress in human mesh recovery, they still exhibit limited robustness to occlusions and often produce inaccurate poses and severe motion jitter due to the insufficient spatial features for occluded body parts. Inspired by the rapid advancements in human motion prediction, we discover that compared to occluded image features, pose sequence inherently contains reliable motion prior for estimating occluded body parts. In this paper, we incorporate Motion Prior for Occluded human mesh recovery, called MoPO. Our MoPO mainly consists of two components: 1) The motion de-occlusion module, where we propose a spatial-temporal occlusion detector to detect joint visibility, and then we propose a lightweight motion predictor to complete the occluded body parts by predicting the most plausible joint positions based on history poses. 2) The motion-aware fusion and refinement module, which fuses the completed joint sequence with image features to estimate human shape and initial human pose. Moreover, the completed joint sequence is further used to refine the final human pose through inverse kinematics, which provides the occlusion-free motion prior for regressing human poses. Extensive experiments demonstrate that MoPO achieves state-of-the-art performance on both occlusion-specific and standard benchmarks, significantly enhancing the accuracy and temporal consistency of occluded human mesh recovery. Our code and demo can be found in the supplementary material.

2605.09853 2026-05-12 cs.LG

Exploration-Driven Optimization for Test-Time Large Language Model Reasoning

Changhao Li, Yuchen Zhuang, Chenxiao Gao, Haotian Sun, Rushi Qiang, Chao Zhang, Bo Dai

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 该研究针对大语言模型在推理阶段的推理能力和多样性之间的矛盾,提出了一种探索驱动优化(EDO)方法,通过将奖励偏差探索目标引入迭代后训练过程,提升模型在推理时的解题多样性与推理能力。实验表明,EDO有效增强了iDPO和GRPO等方法的性能,在多个基准任务中取得了显著的准确率提升,并有助于保持模型熵值和训练稳定性,为测试时推理优化提供了实用框架。

Comments Accepted by TMLR 2026

详情
英文摘要

Post-training techniques combined with inference-time scaling significantly enhance the reasoning and alignment capabilities of large language models (LLMs). However, a fundamental tension arises: inference-time methods benefit from diverse sampling from a relatively flattened probability distribution, whereas reinforcement learning (RL)-based post-training inherently sharpens these distributions. To address this, we propose Exploration-Driven Optimization (EDO), which extends reward-biasing style exploration objectives to iterative post-training and integrates them into standard RL objectives, encouraging greater diversity in sampled solutions while facilitating more effective inference-time computation. We incorporate EDO into iterative Direct Preference Optimization (iDPO) and Group Relative Policy Optimization (GRPO), resulting in two variants: ED-iDPO and ED-GRPO. Extensive experiments demonstrate that both ED-iDPO and ED-GRPO exhibit greater solution diversity and improved reasoning abilities, particularly when combined with test-time computation techniques like self-consistency. Across three in-distribution reasoning benchmarks, EDO achieves a 1.0-1.3\% improvement over the strongest baselines, and delivers an additional 1.5\% average gain on five out-of-distribution tasks. Beyond accuracy, EDO preserves model entropy and stabilizes RL training dynamics, highlighting its effectiveness in preventing over-optimization collapse. Taken together, these results establish EDO as a practical framework for balancing exploration and exploitation in LLM reasoning, especially in settings that rely on test-time scaling.

2605.09852 2026-05-12 cs.AI cs.CE cs.CY cs.LG

Fairness of Explanations in Artificial Intelligence (AI): A Unifying Framework, Axioms, and Future Direction toward Responsible AI

Gideon Popoola, John Sheppard

发表机构 * Montana State University(蒙大拿州立大学)

AI总结 该论文探讨了人工智能中解释的公平性问题,指出当前算法公平性与可解释AI(XAI)研究虽各自独立,却忽略了模型在输出满足公平性标准的同时,其推理过程可能存在深层次的不公平现象,即“过程偏差”。为此,作者提出了条件不变性框架,将解释公平性形式化为对保护属性的条件独立性要求,并构建了七维分类体系及六步评估流程,为负责任AI的发展提供了理论基础与实践指导。

Comments 53 pages, 1 figure

详情
英文摘要

Machine learning algorithms are being used in high-stakes decisions, including those in criminal justice, healthcare, credit, and employment. The research community has responded with two largely independent research fields: \emph{algorithmic fairness}, which targets equitable outcomes, and \emph{explainable AI} (XAI), which targets interpretable reasoning. This survey identifies and maps a novel blind spot at their intersection, which is a model that can satisfy every standard fairness criterion in its outputs while being profoundly unfair in its \emph{reasoning process}. We refer to this as the procedural bias, and mitigating it requires treating the fairness of explanations as a distinct object of scientific study. To our knowledge, we provide the first unified theoretical and literature review of this emerging field and elucidate the drawbacks of post-hoc explainers in certifying explanation fairness. Our central contribution is a \emph{conditional invariance framework} formalizing explanation fairness as the requirement that explanations should be indifferent regardless of the protected attributes $ P(E(X) \in \cdot \mid X_\text{rel} = x_\text{rel},\, A = a) = P(E(X) \in \cdot \mid X_\text{rel} = x_\text{rel},\, A = b)$ for all task-relevant $x$, a single principle from which all existing explanation fairness metrics emerge as partial operationalizations. We introduce a seven-dimensional taxonomy, identify three generative mechanisms of explanation inequity (representation-driven, explanation-model mismatch, actionability-driven), and propose a canonical six-step evaluation workflow for operationalizing explanation fairness audits in practice.

2605.09850 2026-05-12 cs.CV cs.AI

Probing Routing-Conditional Calibration in Attention-Residual Transformers

Wenhao Liang, Lin Yue, Wei Emma Zhang, Miao Xu, Mingyu Guo, Olaf Maennel, Weitong Chen

发表机构 * Adelaide University(阿德莱德大学) Australian Institute for Machine Learning (AIML), Adelaide University(澳大利亚机器学习研究所(AIML),阿德莱德大学) The University of Queensland(昆士兰大学)

AI总结 本文研究了在注意力残差变换器(Attention-Residual Transformers)中,路由信息对模型校准的影响。通过设计匹配置信度的诊断实验,作者发现路由摘要无法提供稳定的路由条件下的校准证据,且基于路由深度的校准方法在多个评估指标上表现并不优于仅基于置信度的模型。实验表明,所谓的路由感知校准提升可能是由其他因素引起的,需在控制匹配置信度、带宽、模型容量和排列等因素后,才能确认是否为内部状态校准的真正提升。

Comments Under reviewing

详情
英文摘要

Post-hoc calibration is usually evaluated as a function of logits or softmax confidence alone, even as routing-augmented architectures increasingly accompany predictions with sample-specific internal routing traces and pair them with claims of calibration-relevant uncertainty. We ask a basic question: do these traces provide stable routing-specific evidence for post-hoc calibration beyond confidence? We study this in Attention-Residual transformers (Kimi Team, 2026) through a matched-confidence diagnostic suite that stratifies examples by routing-derived state, compares subgroup gaps against within-bin routing-permutation nulls, and evaluates matched post-hoc probes differing only in their auxiliary feature. Across our completed AR runs, scalar routing summaries do not provide stable evidence of routing-conditional miscalibration: weighted gaps remain small or seed-sensitive, and only $1$ of $30$ within-bin permutation tests rejects the conditional-null at $α=0.05$ (only on one seed; not stable across seeds in that cell). AR-CondCal, a minimal $2$-D Nadaraya--Watson probe on confidence and routing-depth variance, lies within the seed-variance band of matched confidence-only and predictive-entropy controls and does not reliably improve worst-routing-tertile ECE; bandwidth-sensitivity checks (Scott multiples, CV-NLL, global-ECE oracle) do not change this. A full-vector MLP over $(c, H_1, \ldots, H_L)$ can appear to improve over a linear confidence baseline, but the apparent gain disappears once a capacity-matched confidence-only MLP is included as a control, and shuffled routing profiles achieve comparable performance. Apparent routing-aware calibration gains in this AR setting should not be read as internal-state calibration until matched-confidence, bandwidth, capacity, and permutation controls rule out common confounds.

2605.09848 2026-05-12 cs.LG

Efficient Neural Architectures for Real-Time ECG Interpretation on Limited Hardware

Ashery Mbilinyi, Callum O'Riley, Julia Handra, Ashley Moller-Hansen, Jason Andrade, Marc Deyell, Cameron Hague, Nathaniel Hawkins, Kendall Ho, Jonathan Leipsic, Roger Tam

发表机构 * Department of Computer Science(计算机科学系) University of Victoria(维多利亚大学) Department of Electrical and Computer Engineering(电气与计算机工程系) University of British Columbia(不列颠哥伦比亚大学) Faculty of Medicine(医学院) School of Biomedical Engineering(生物医学工程学院) Division of Cardiology(心内科) Department of Radiology(放射科) Department of Emergency Medicine(急诊医学科)

AI总结 本文研究了在有限硬件上实现实时心电图(ECG)解读的高效神经网络架构。通过对比现有模型,作者提出了三种轻量级CNN模型,旨在平衡诊断准确率与计算效率。实验表明,这些模型在多个公开ECG数据集上表现优异,并引入统一效率评分体系,为心血管领域AI系统的部署提供了可扩展的解决方案。

Comments 9 pages, 6 figures, 3 tables. Published in: 2025 IEEE International Conference on Big Data (BigData), pp. 3275-3284. DOI: 10.1109/BIGDATA66926.2025.11402097

Journal ref 2025 IEEE International Conference on Big Data (BigData), pp. 3275-3284

详情
英文摘要

Electrocardiogram (ECG) interpretation is essential for diagnosing a wide range of cardiac abnormalities. While deep learning has shown strong potential for automating ECG classification, many existing models rely on large, computationally intensive architectures that hinder practical deployment. In this paper, we present an empirical study of convolutional neural network (CNN) architectures, exploring tradeoffs between diagnostic accuracy and computational efficiency. We benchmark two established baselines: AttiaNet, a compact model composed of sequential temporal and spatial blocks, and DeepResidualCNN, the winning architecture of the 2021 PhysioNet/Computing in Cardiology Challenge. Building on these, we propose three lightweight models: (i) ParallelCNN, which employs dual temporal and spatial branches for parallel pattern extraction; (ii) ParallelCNNew, a variant with symmetric weight initialization for balanced feature learning; and (iii) SimpleNet, a streamlined architecture that jointly processes temporal and spatial dimensions. Our experiments span three publicly available 12-lead ECG datasets from Germany, China, and the United States, covering binary, multiclass, and multilabel classification tasks across diverse patient populations. We further evaluate the impact of integrating low-cost demographic metadata (age and sex) to improve performance with minimal overhead. To ensure fair comparison, we introduce a unified Efficiency Score that integrates model size, inference speed, memory usage, and AUC performance. By balancing diagnostic performance and efficiency, our models offer a scalable and viable foundation for next-generation AI systems in cardiovascular care.

2605.09846 2026-05-12 cs.SD cs.AI

ChladniSonify: A Visual-Acoustic Mapping Method for Chladni Patterns in New Media Art Creation

Yakun Liu, Hai Luan, Dong Liu, Zhiyu Jin

发表机构 * Department of Composition(作曲系) Education Information Center(教育信息中心) Department of Musicology(音乐学系)

AI总结 在新媒体艺术创作中,视觉与听觉的映射往往具有主观性。本文提出了一种实时的视觉-听觉映射方法 ChladniSonify,用于生成克拉尼图案(Chladni patterns)的声学映射。该方法基于Kirchhoff-Love板理论构建数据集,并采用轻量级CNN结合CBAM模块实现高精度、低延迟的图案分类,最终在Python和Max/MSP中搭建了端到端系统,将识别出的图案映射到对应的正弦波频率,实现了零偏差的理论频率匹配与实时交互。

Comments 9 pages, 5 figures, IEEE conference format

详情
英文摘要

In new media art creation, the mapping between vision and hearing is often subjective. As a classic carrier of sound visualization, Chladni patterns have great potential in building audio-visual mapping mechanisms. However, existing tools face pain points: high technical barriers for simulation, offline computing failing real-time interaction, and uncontrollable mapping rules in general sonification tools. To address these, this paper proposes ChladniSonify, a real-time visual-acoustic mapping method for Chladni patterns. Based on Kirchhoff-Love plate theory, we build a paired dataset via numerical programming and calibrate it using ANSYS finite element simulation. Focusing on the slender nodal lines of Chladni patterns, we adopt a lightweight CNN with CBAM to achieve high-precision, low-latency pattern classification. Finally, we build an end-to-end system in Python and Max/MSP, mapping recognized patterns to corresponding sine wave frequencies. Results show the system has excellent usability: the classification module achieves 99.33% accuracy on the test set with 7.03 ms inference latency; the mapped frequency matches the theoretical value with zero deviation; the average end-to-end latency is under 50 ms, meeting real-time interactive needs. This work provides a reproducible engineering prototype for Chladni audio-visual art creation.

2605.09845 2026-05-12 cs.LG

Sub-Footprint Effect Correction in FW-LiDAR Point Clouds via Intra-Footprint Target Unmixing

Zhen Xiao, Yanfeng Gu, Xian Li

发表机构 * School of Electronics and Information Engineering, Harbin Institute of Technology(哈尔滨工业大学电子与信息工程学院)

AI总结 本文研究了全波形激光雷达(FW-LiDAR)点云中子光斑目标混合导致的强度不确定性问题,提出了一种基于物理的框架,通过显式建模光斑内部多目标的混合过程,实现子光斑级别的强度校正。该方法结合波形参数和地表几何信息,将混合过程转化为逆向解混问题,从而分离出每个光斑内不同子目标的贡献,并恢复出更准确的强度信息。实验表明,该方法有效提升了异质目标的语义可分性和同质目标的强度一致性。

Comments 11 pages,7 figures

详情
英文摘要

Sub-footprint target mixing within a laser footprint significantly increases LiDAR intensity uncertainty, especially in complex environments where heterogeneous materials inside one footprint cause nonlinear distortions that impair intensity-based applications. However, the forward mixing inherent to the single-pixel detection mode of LiDAR systems blurs sub-footprint contributions, making sub-footprint effects difficult to address effectively in existing studies. To address this issue, we introduce a novel, physics-based framework that explicitly resolves sub-footprint intensity correction in full-waveform LiDAR (FW-LiDAR) point clouds. The key innovation is to make the otherwise implicit intra-footprint mixing process explicit: we first develop a spatiotemporal laser-beam distribution model to physically characterize within-footprint forward mixing of multi-target returns. Building on this formulation, we incorporate ancillary information including waveform parameters and surface geometry as constraints to pose a well-defined inverse unmixing problem and decompose each footprint into fractional contributions from multiple sub-targets. We then recover sub-footprint-corrected intensities by inverting the observed mixtures through a unified combination of parametric and model-driven approaches. To the best of our knowledge, few prior studies explicitly establish sub-footprint inversion and correction within a single laser footprint, and our framework offers a principled, physics-grounded solution. Experiments on both controlled and real-world LiDAR datasets demonstrate that the proposed method significantly enhances semantic separability across heterogeneous targets and intensity consistency across homogeneous targets.

2605.09844 2026-05-12 cs.AI cs.CL cs.LG

The Metacognitive Probe: Five Behavioural Calibration Diagnostics for LLMs

Rafael C. T. Oliveira

发表机构 * Independent Researcher(独立研究者)

AI总结 该研究提出了一种名为“元认知探针”的诊断工具,用于评估大型语言模型(LLM)在自信行为上的表现,将其分解为五个行为维度,包括置信度校准、知识边界识别等。该工具在多个前沿模型和人类被试上进行了验证,揭示了模型在不同任务中的自信与正确性对齐情况,发现了模型在整体表现良好时仍可能存在局部过度自信的问题。研究在Gemini 2.5 Flash模型中观察到了显著的内部行为差异,突显了模型在不同任务间自信判断能力的不一致性。

Comments 27 pages, 13 tables. Code, data, prompts, and rubrics released with the paper. OSF deposit pending; DOI in v2

详情
英文摘要

The Metacognitive Probe is an exploratory five-task, 15-slot diagnostic that decomposes an LLM's confidence behaviour into five behaviourally-distinct dimensions: confidence calibration (T1-CC), epistemic vigilance (T2-EV), knowledge boundary (T3-KB), calibration range (T4-CR), and reasoning-chain validation (T5-RCV). It is evaluated on N=8 frontier models and N=69 humans. The instrument is motivated by Flavell (1979) and Nelson and Narens (1990) but operates on observable confidence-correctness alignment; it is not a validated cross-species metacognition scale, and the pre-specified human developmental hypothesis was falsified. Composite benchmarks (MMLU, BIG-Bench, HELM, GPQA) ask whether a model produces a correct response. They are silent on whether the model knows when its response is wrong. A model can score 80 on a composite calibration benchmark and still be wildly overconfident in narrow pockets the aggregate cannot surface. The Metacognitive Probe surfaces those pockets. Our headline is a 47-point within-model dissociation in Gemini 2.5 Flash: panel-best within-task calibration (T1-CC = 88; Spearman rho = +0.551, 95% CI [+0.14, +0.80], p = 0.005) and panel-worst cross-task difficulty prediction (T4-CR = 41; sigma_conf = 1.4 across twelve factoids).

2605.09842 2026-05-12 cs.AI

Yield Curve Forecasting using Machine Learning and Econometrics: A Comparative Analysis

Aman Singh, Tokunbo Ogunfunmi, Sanjiv Das

发表机构 * Department of Electrical and Computer Engineering, Santa Clara University, USA(电子工程系,圣克拉拉大学,美国) School of Business, Santa Clara University, USA(商学院,圣克拉拉大学,美国)

AI总结 本文比较了计量经济学、经典机器学习和深度学习方法在预测美国国债收益率曲线方面的性能,使用了长达47年的每日数据。研究发现,传统计量经济模型如ARIMA在大多数情况下表现最佳,而时间序列深度学习模型如TimeGPT、LGBM和RNN也表现出色。此外,论文还探讨了平稳或非平稳数据作为深度学习模型输入的适用性问题。

Comments 18 pages, 12 figures, comparative study of econometric, machine learning, and deep learning methods for U.S. Treasury yield curve forecasting

Journal ref Journal of Investment Management, vol. 23, no. 4, Fourth Quarter 2025

详情
英文摘要

While machine learning has revolutionized many fields such as natural language processing (NLP) and computer vision, its impact on time-series forecasting is still widely disputed, especially in the finance domain. This paper compares forecasting performance on U.S. Treasury yield curve data across econometrics/time-series analysis, classical machine learning, and deep learning methods, using daily data over 47 years. The Treasury yield curve is important because it is widely used by every participant in the bond markets, which are larger than equity markets. We examine a variety of methods that have not been tested on yield curve forecasting, especially deep learning algorithms. The algorithms include the Autoregressive Integrated Moving Average (ARIMA) model and its extensions, naive benchmarks, ensemble methods, Recurrent Neural Networks (RNNs), and multiple transformers built for forecasting. ARIMA and naive econometric models outperform other models overall, except in one time block. Of the machine learning methods, TimeGPT, LGBM and RNNs perform the best. Furthermore, the paper explores whether stationary or nonstationary data are more appropriate as input to deep learning models.

2605.09839 2026-05-12 cs.LG cs.AI

Free Energy Manifold: Score-Based Inference for Hybrid Bayesian Networks

Cheol Young Park, Shou Matsumoto

发表机构 * ATOS Co., Ltd.(ATOS公司) C5I Center, George Mason University(乔治·马歇尔大学C5I中心)

AI总结 本文提出了一种名为自由能流形(Free Energy Manifold, FEM)的条件能量模型,专门用于含有离散和连续变量的混合贝叶斯网络中的推理任务。FEM 通过学习离散父节点的嵌入和连续观测值的能量景观,实现了对后验分布的评估、生成采样以及多连续叶节点的组合推理。研究还发现传统条件能量模型在类内模式之间可能产生低能量脊,导致对非数据点的过自信后验,并提出山谷正则化方法以修正这一问题,实验表明 FEM 在多模态和组合推理任务中优于经典方法和普通条件能量模型。

详情
英文摘要

We introduce the Free Energy Manifold (FEM), a score-trained conditional energy model specialized for inference in hybrid Bayesian networks with discrete and continuous variables. FEM represents each conditional factor as an energy landscape over learned discrete-parent embeddings and continuous observations, enabling posterior evaluation, generative sampling, and compositional inference across multiple continuous leaves by energy addition under conditional independence. A central finding is the mode-bridge artifact: standard conditional energy models can create low-energy ridges between separated modes of the same class, producing overconfident posteriors at off-data interior points. We analyze this failure and propose valley regularization, an off-data calibration term that restores near-uniform posteriors in such regions while preserving in-data fit. Across synthetic multimodal hybrid-BN benchmarks, FEM substantially reduces KL divergence relative to classical baselines and a vanilla conditional EBM, including large gains at mode-bridge midpoint queries and in multi-leaf evidence composition. We also evaluate high-cardinality discrete-parent settings and a UCI Breast Cancer sanity check, showing that FEM is most useful when multimodal or compositional Bayesian-network inference is required, while discriminative classifiers remain preferable for closed-world classification tasks.

2605.09838 2026-05-12 cs.CL cs.LG

The Association of Transformer-based Sentiment Analysis with Symptom Distress and Deterioration in Routine Psychotherapy Care

Douglas K. Faust, Peter Awad, Alexandre Vaz, Tony Rousmaniere

发表机构 * Sentio University(Sentio大学) Western Washington University(西雅图华盛顿大学)

AI总结 该研究探讨了基于Transformer架构的情感分析模型在心理治疗常规护理中对患者症状困扰和恶化程度的关联性。研究通过分析大量心理治疗会话数据,提取了话语级和会话级的情感特征,并发现这些特征与OQ-45心理测量工具的多个维度,尤其是情绪价值相关指标存在显著相关性。此外,研究还表明,被标记为有恶化或退出风险的患者在情感分布上存在统计学上的显著差异,表明所提出的情感特征可作为评估患者心理状态的辅助指标。

Comments 20 pages, 4 figures

Journal ref (2026) Front. Digit. Health 8:1792536

详情
英文摘要

Sentiment analysis has been of long-standing interest in psychotherapy research. Recently, the Transformer deep learning architecture has produced text-based sentiment analysis models that are highly accurate and context-aware. These models have been explored as proxies for emotion measurement instruments in psychotherapy, but not investigated as stand-alone psychometric tools. Using proposed utterance-level and session-level sentiment features derived from a fine-grained sentiment model on a large corpus of psychotherapy sessions (N = 751), we investigate the distribution of session aggregated sentiment scores. Further, we characterize the relationship of these features to individual components and the overall score of the OQ-45 instrument and find that this sentiment feature is most strongly correlated to components related to emotional valence in directionally intuitive ways. Finally, we report that there are statistically significant differences between the sentiment distributions for patients flagged as at risk of deterioration or dropping out of care via either the OQ Rational or Empirical outcome models. These correlations to a fully-validated psychometric instrument demonstrate that these proposed sentiment features are, at least, adjunctive measures of client distress and deterioration.

2605.09832 2026-05-12 cs.LG

Modeling Atomic Conformational Ensembles of Proteins via Test-Time Supervision of Boltz-2 on Cryo-EM Density Maps

Jay Shenoy, Miro Astore, Axel Levy, Frédéric Poitevin, Sonya M. Hanson, Gordon Wetzstein

发表机构 * Flatiron Institute(Flatiron研究所) SLAC National Lab(斯坦福直线加速器实验室) Center for Computational Biology & Center for Computational Mathematics(计算生物学中心与计算数学中心)

AI总结 该研究旨在解决蛋白质原子构象集合预测中的数据稀缺问题,提出了一种无需传统两阶段训练流程的方法,直接在原始冷冻电镜(cryo-EM)密度图上微调预训练的静态结构预测模型Boltz-2,从而生成高精度的原子构象。该方法命名为CryoSampler,不仅在模型构建准确性上优于现有方法,还展示了在相同蛋白家族中对未见序列的跨样本泛化能力,为基于原始cryo-EM数据训练下一代构象预测模型提供了新思路。

Comments Project page: https://jayshenoy.com/cryosampler

详情
英文摘要

Knowledge of a protein's atomic conformational ensemble is critical to determining its function, yet state-of-the-art ensemble prediction models are limited by lack of high-quality conformational data from simulation or experiment. Recent advances in heterogeneous reconstruction for cryo-electron microscopy (cryo-EM) have enabled scientists to visualize ensembles of density maps for larger proteins and complexes not typically accessible through simulation, but building atomic models into these maps remains a challenge. Traditionally, ensemble prediction models are trained via a two-stage process: experimental density maps are converted into atomic structural ensembles through model building, after which these structures are used to train sequence-to-atomic ensemble predictors. In this work, we propose a new principle for fine-tuning pre-trained static structure prediction models such as Boltz-2 directly on raw cryo-EM maps, bypassing the two-stage process. We apply this technique to the problem of atomic model building by fine-tuning Boltz-2 to generate atomic conformations from an input ensemble of cryo-EM maps, achieving superior model building accuracy compared to prior work. Beyond overfitting to individual map ensembles, our method, CryoSampler, also shows preliminary evidence of in-domain generalization after fine-tuning, sampling diverse atomic conformations for an unseen sequences within the same protein family without requiring cryo-EM data. These capabilities indicate that CryoSampler holds the potential to train next-generation atomic ensemble prediction models directly on raw cryo-EM measurements.