arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2328
专题追踪
2603.13988 2026-05-13 cs.AI cs.LG

Faithful or Just Plausible? Evaluating the Faithfulness of Closed-Source LLMs in Medical Reasoning

Halimat Afolabi, Zainab Afolabi, Elizabeth Friel, Jude Roberts, Antonio Ji-Xu, Lloyd Chen, Egheosa Ogbomo, Emiliomo Imevbore, Phil Eneje, Wissal El Ouahidi, Aaron Sohal, Alisa Kennan, Shreya Srivastava, Anirudh Vairavan, Laura Napitu, Katie McClure

发表机构 * Stratified Precision Harvard Medical School(哈佛医学院) Imperial College London(帝国理工学院伦敦分校) National Health Service(国家健康服务系统) Ipsen France(Ipsen法国) University College London(伦敦大学学院)

AI总结 本文研究了封闭源大型语言模型(如ChatGPT和Gemini)在医疗推理任务中的解释可信度问题,指出其生成的解释可能看似合理但并不反映真实的推理过程。为此,作者设计了三种基于扰动的探测方法,包括因果消融、位置偏差和提示注入,评估模型推理过程与预测结果之间的关联性,并结合人类评估分析模型解释的可信度与用户信任之间的关系。研究发现,模型的推理步骤往往不直接影响预测结果,且容易受到外部提示的影响,强调在医疗场景中评估模型时,除了准确性,可信度也应成为核心考量。

Journal ref Proceedings of Machine Learning Research, Vol. 297, pp. 1562-1591, 2026

详情
英文摘要

Closed-source large language models (LLMs), such as ChatGPT and Gemini, are increasingly consulted for medical advice, yet their explanations may appear plausible while failing to reflect the model's underlying reasoning process. This gap poses serious risks as patients and clinicians may trust coherent but misleading explanations. We conduct a systematic black-box evaluation of faithfulness in medical reasoning among three widely used closed-source LLMs. Our study consists of three perturbation-based probes: (1) causal ablation, testing whether stated chain-of-thought (CoT) reasoning causally influences predictions; (2) positional bias, examining whether models create post-hoc justifications for answers driven by input positioning; and (3) hint injection, testing susceptibility to external suggestions. We complement these quantitative probes with a small-scale human evaluation of model responses to patient-style medical queries to examine concordance between physician assessments of explanation faithfulness and layperson perceptions of trustworthiness. We find that CoT reasoning steps often do not causally drive predictions, and models readily incorporate external hints without acknowledgment. In contrast, positional biases showed minimal impact in this setting. These results underscore that faithfulness, not just accuracy, must be central in evaluating LLMs for medicine, to ensure both public protection and safe clinical deployment.

2603.05947 2026-05-13 cs.CV

LucidNFT: LR-Anchored Multi-Reward Preference Optimization for Flow-Based Real-World Super-Resolution

Song Fei, Tian Ye, Sixiang Chen, Zhaohu Xing, Jianyu Lai, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) The Hong Kong University of Science and Technology(香港科学与技术大学)

AI总结 本文提出了一种名为LucidNFT的多奖励强化学习框架,用于基于流匹配的现实场景图像超分辨率任务。该方法通过引入一种对退化不变且对语义幻觉敏感的LR参考评估器LucidConsistency,以及解耦的奖励归一化策略和大规模真实退化图像集LucidLR,有效解决了现有方法在保持低分辨率输入真实性与提升视觉质量之间的平衡问题。实验表明,LucidNFT在多个基准上提升了感知质量,同时保持了对真实低分辨率输入的一致性。

详情
英文摘要

Generative real-world image super-resolution (Real-ISR) can synthesize visually convincing details from severely degraded low-resolution (LR) inputs, yet its stochastic sampling makes a critical failure mode hard to avoid: outputs may look sharp but be unfaithful to the LR evidence, exhibiting semantic or structural hallucinations. Preference-based reinforcement learning (RL) is a natural fit because each LR input yields a rollout group of candidate restorations. However, effective alignment in Real-ISR is hindered by three coupled challenges: (i) the lack of an LR-referenced faithfulness signal that is robust to degradation yet sensitive to localized hallucinations, (ii) a rollout-group optimization bottleneck where scalarizing heterogeneous rewards before normalization compresses objective-wise contrasts and weakens DiffusionNFT-style reward-weighted updates, and (iii) limited coverage of real degradations, which restricts rollout diversity and preference signal quality. We propose LucidNFT, a multi-reward RL framework for flow-matching Real-ISR. LucidNFT introduces LucidConsistency, a degradation-invariant and hallucination-sensitive LR-referenced evaluator trained with content-consistent degradation pools and original-inpainted hard negatives; a decoupled reward normalization strategy that preserves objective-wise contrasts within each LR-conditioned rollout group before fusion; and LucidLR, a large-scale collection of real-world degraded images for robust RL fine-tuning. Extensive experiments show that LucidNFT improves perceptual quality on strong flow-based Real-ISR baselines while generally maintaining LR-referenced consistency across diverse real-world scenarios.

2602.22347 2026-05-13 cs.CV cs.AI

Enabling clinical use of foundation models for computational pathology

Audun L Henriksen, Ole-Johan Skrede, Lisa van der Schee, Enric Domingo, Karolina Cyll, Sepp de Raedt, Ilyá Kostolomov, Jennifer Hay, Wanja Kildal, Joakim Kalsnes, Robert W Williams, Manohar Pradhan, John Arne Nesheim, Hanne Askautrud, Maria Isaksen, Karmele Saez de Gordoa, Miriam Cuatrecasas, Joanne Edwards, TransSCOT group, Arild Nesbakken, Neil A Shepherd, Ian Tomlinson, Daniel-Christoph Wagner, Rachel Kerr, Tarjei Sveinsgjerd Hveem, Knut Liestøl, Yoshiaki Nakamura, Marco Novelli, Masaaki Miyo, Sebastian Försch, David N Church, Miangela M Lacle, David J Kerr, Andreas Kleppe

发表机构 * Institute for Cancer Genetics and Informatics, Oslo University Hospital(癌症遗传学与信息学研究所,奥斯陆大学医院) Department of Pathology, University Medical Center Utrecht(病理学系,乌得勒支大学医学中心) Department of Oncology, University of Oxford(肿瘤学系,牛津大学) CRUK Beatson Institute of Cancer Research, Garscube Estate(CRUK贝茨癌症研究中心,加尔斯克里特庄园) Glasgow Tissue Research Facility, University of Glasgow, Queen Elizabeth University Hospital(格拉斯哥组织研究设施,格拉斯哥大学,伊丽莎白女王大学医院) Area for Improvement and Digital Transformation, Norwegian Offshore Directorate(改进与数字化转型部门,挪威海上管理局) Pathology Department, Hospital Clínic, Barcelona, Spain(病理学系,巴塞罗那医院,西班牙) Institut d’Investigacions Biomèdiques August Pi I Sunyer (IDIBAPS), Barcelona, Spain(August Pi I Sunyer生物医学研究所(IDIBAPS),巴塞罗那,西班牙) Department of Clinical Foundations, Universitat de Barcelona(临床基础系,巴塞罗那大学) School of Cancer Sciences, Wolfson Wohl Cancer Research Centre, University of Glasgow(癌症科学学院,沃尔夫森沃尔夫癌症研究中心,格拉斯哥大学) Institute of Clinical Medicine, University of Oslo(临床医学研究所,奥斯陆大学) Department of Gastrointestinal Surgery, Oslo University Hospital(胃肠外科系,奥斯陆大学医院)

AI总结 该研究探讨了如何使基础模型在计算病理学中更适用于临床场景,解决了现有模型因捕捉扫描仪和预分析变异而影响下游任务性能的问题。研究提出在下游模型训练中引入新的鲁棒性损失函数,以减少对技术变异的敏感性,并通过大量临床病理图像实验验证了该方法的有效性。该方法在不重新训练基础模型的前提下,提升了模型的鲁棒性和分类准确性,有助于开发更适用于真实临床环境的深度学习系统。

详情
英文摘要

Foundation models for computational pathology are expected to facilitate the development of high-performing, generalisable deep learning systems. However, in addition to biologically relevant features, current foundation models also capture pre-analytic and scanner-specific variation that bias the predictions made by downstream task-specific models trained on these features. Here we show that introducing novel robustness losses during downstream model training reduces sensitivity to technical variability. A purpose-designed comprehensive experimentation setup with 27,042 whole-slide images from 6,155 patients is used to train thousands of models from the features of eight well-known foundation models for computational pathology. In addition to a substantial improvement in robustness, our approach improves classification accuracy by focusing on biologically relevant features. It mitigates robustness limitations of foundation models for computational pathology without retraining the foundation models themselves, enabling development of models that are more suitable in real-world clinical use.

2602.21625 2026-05-13 cs.RO

Tacmap: Bridging the Tactile Sim-to-Real Gap via Geometry-Consistent Penetration Depth Map

Lei Su, Zhijie Peng, Renyuan Ren, Shengping Mao, Juan Du, Kaifeng Zhang, Xuezhou Zhu

发表机构 * Sharpa HKUST(香港科技大学) NVIDIA(英伟达)

AI总结 本文提出了一种名为Tacmap的高保真、计算高效的触觉仿真框架,旨在解决视觉触觉传感器在机器人操作中面临的仿真到现实的差距问题。该方法通过统一的形变图表示,将仿真与现实世界在几何空间中对齐,利用体积穿透深度计算仿真中的3D接触体积,并通过自动化数据采集装置在现实世界中学习触觉图像到真实深度图的映射。实验表明,Tacmap在多种接触场景中表现出与实际测量高度一致的性能,并成功实现了从仿真到物理机器人的零样本迁移。

Comments 8 pages

详情
英文摘要

Vision-Based Tactile Sensors (VBTS) are essential for achieving dexterous robotic manipulation, yet the tactile sim-to-real gap remains a fundamental bottleneck. Current tactile simulations suffer from a persistent dilemma: simplified geometric projections lack physical authenticity, while high-fidelity Finite Element Methods (FEM) are too computationally prohibitive for large-scale reinforcement learning. In this work, we present Tacmap, a high-fidelity, computationally efficient tactile simulation framework anchored in volumetric penetration depth. Our key insight is to bridge the tactile sim-to-real gap by unifying both domains through a shared deform map representation. Specifically, we compute 3D intersection volumes as depth maps in simulation, while in the real world, we employ an automated data-collection rig to learn a robust mapping from raw tactile images to ground-truth depth maps. By aligning simulation and real-world in this unified geometric space, Tacmap minimizes domain shift while maintaining physical consistency. Quantitative evaluations across diverse contact scenarios demonstrate that Tacmap's deform maps closely mirror real-world measurements. Moreover, we validate the utility of Tacmap through an in-hand rotation task, where a policy trained exclusively in simulation achieves zero-shot transfer to a physical robot.

2602.15473 2026-05-13 cs.LG

POP: Prior-Fitted First-Order Optimization Policies

Jan Kobiolka, Christian Frey, Gresa Shala, Arlind Kadra, Erind Bedalli, Josif Grabocka

发表机构 * Department of Computer Science(计算机科学系) University of Technology Nuremberg(纽伦堡技术大学) Albert Ludwig University of Freiburg(弗赖堡阿尔伯特-路易斯大学) University of Elbasan(埃尔巴沙大学)

AI总结 本文提出了一种基于强化学习的元学习优化策略 POP,用于预测梯度下降中的自适应学习率。该方法通过优化轨迹中的上下文信息进行学习,并引入了新的奖励函数、函数缩放策略和先验分布以生成大量合成优化问题。实验表明,POP 在包含 43 个不同复杂度优化函数的基准测试中显著优于传统梯度优化方法,且无需任务特定调参即可实现良好的泛化能力。

Comments Under Review

详情
英文摘要

Gradient-based optimizers are highly sensitive to design choices in their adaptive learning rate mechanisms. To address this limitation, we introduce POP, a meta-learned Reinforcement Learning (RL) policy that predicts adaptive learning rates for gradient descent, conditioned on the contextual information provided in the optimization trajectory. Our method introduces a novel RL reward formulation, a new function-scaling strategy for in-distribution generalization, and a novel prior that is used to sample millions of synthetic optimization problems. We evaluate POP on an established benchmark including 43 optimization functions of various complexity, where it significantly outperforms gradient-based methods. Our evaluation demonstrates strong generalization capabilities without task-specific tuning.

2602.13690 2026-05-13 cs.LG

Physics Aware Neural Networks: Denoising for Magnetic Navigation

Aritra Das, Yashas Shende, Muskaan Chugh, Reva Laxmi Chauhan, Arghya Pathak, Debayan Gupta

发表机构 * Ashoka University(阿什oka大学)

AI总结 本文研究了在GPS不可用情况下利用地磁异常进行导航时的去噪问题,提出了一种基于物理约束的神经网络方法。该方法引入了无散度矢量场和E(3)等变性两个物理约束,确保学习到的地磁场符合麦克斯韦方程并具有正确的空间变换特性。通过生成合成数据集和对比多种神经网络结构,实验表明该方法在预测精度和物理合理性方面优于传统方法。

详情
英文摘要

Magnetic-anomaly navigation, leveraging small-scale variations in the Earth's magnetic field, is a promising alternative when GPS is unavailable or compromised. Airborne systems face a key challenge in extracting geomagnetic field data: the aircraft itself induces magnetic noise. Although the classical Tolles-Lawson model addresses this, it inadequately handles stochastically corrupted magnetic data required for navigation. To handle stochastic noise, we propose using two physics-based constraints: divergence-free vector fields and E(3)-equivariance. These ensure the learned magnetic field obeys Maxwell's equation and that outputs transform correctly with sensor position and orientation. The divergence-free constraint is implemented by training a neural network to output a vector potential A, with the magnetic field defined as its curl. For E(3)-equivariance, we use tensor products of geometric tensors represented via spherical harmonics with known rotational transformations. Enforcing physical consistency and restricting the admissible function space acts as an implicit regularizer that improves spatiotemporal performance. We present ablation studies evaluating each constraint alone and jointly across CNNs, MLPs, LTCs, and Contiformers. Continuous-time dynamics and long-term memory are critical for modelling magnetic time series; the Contiformer, which provides both, outperforms existing methods. To mitigate data scarcity, we generate synthetic datasets using the World Magnetic Model (WMM) and time-series conditional GANs, producing realistic, temporally consistent magnetic sequences across varied trajectories and environments. Experiments show that embedding these constraints significantly improves predictive accuracy and physical plausibility, outperforming classical and unconstrained deep learning approaches. Acknowledgement: This work was done in collaboration with Dirac Labs.

2602.12139 2026-05-13 cs.LG

Oscillators Are All You Need: Irregular Time Series Modelling via Damped Harmonic Oscillators with Closed-Form Solutions

Yashas Shende, Aritra Das, Reva Laxmi Chauhan, Arghya Pathak, Debayan Gupta

发表机构 * Department of Physics(物理系) Ashoka University(阿什oka大学) Department of Computer Science(计算机科学系)

AI总结 该论文提出了一种基于阻尼谐振子模型的新型时间序列建模方法,用于处理非均匀时间间隔的数据。通过将Transformer中的键值对建模为受驱阻尼振子,并以正弦基展开查询,该方法将注意力机制解释为共振现象,从而在保持模型表达能力的同时,避免了传统ODE求解器的计算开销。该方法具有理论保证,能够在非均匀时间序列任务上实现高精度且高效的表现。

详情
英文摘要

Transformers excel at time series modelling through attention mechanisms that capture long-term temporal patterns. However, they assume uniform time intervals and therefore struggle with irregular time series. Neural Ordinary Differential Equations (NODEs) effectively handle irregular time series by modelling hidden states as continuously evolving trajectories. ContiFormers arxiv:2402.10635 combine NODEs with Transformers, but inherit the computational bottleneck of the former by using heavy numerical solvers. This bottleneck can be removed by using a closed-form solution for the given dynamical system - but this is known to be intractable in general! We obviate this by replacing NODEs with a novel linear damped harmonic oscillator analogy - which has a known closed-form solution. We model keys and values as damped, driven oscillators and expand the query in a sinusoidal basis up to a suitable number of modes. This analogy naturally captures the query-key coupling that is fundamental to any transformer architecture by modelling attention as a resonance phenomenon. Our closed-form solution eliminates the computational overhead of numerical ODE solvers while preserving expressivity. We prove that this oscillator-based parameterisation maintains the universal approximation property of continuous-time attention; specifically, any discrete attention matrix realisable by ContiFormer's continuous keys can be approximated arbitrarily well by our fixed oscillator modes. Our approach delivers both theoretical guarantees and scalability, achieving state-of-the-art performance on irregular time series benchmarks while being orders of magnitude faster. Acknowledgement: This work was done in collaboration with Dirac Labs.

2602.11126 2026-05-13 cs.LG

The Offline-Frontier Shift: Diagnosing Distributional Limits in Generative Multi-Objective Optimization

Stephanie Holly, Alexandru-Ciprian Zăvoianu, Siegfried Silber, Sepp Hochreiter, Werner Zellinger

发表机构 * LIT AI Lab and Institute for Machine Learning JKU Linz, Austria(LIT人工智能实验室和机器学习研究所,约翰·肯普夫大学(JKU)林茨分校,奥地利) School of Computing, Engineering and Technology RGU, Aberdeen, Scotland(计算、工程与技术学院 RGU,阿伯丁,苏格兰) Linz Center of Mechatronics GmbH Linz, Austria(林茨机电研究中心 GmbH,林茨,奥地利)

AI总结 本文研究了离线多目标优化中生成方法的分布限制问题,指出尽管生成模型在超体积指标上表现良好,但在其他关键指标如代际距离上却明显落后于进化算法。研究发现,这种性能差异源于离线数据集与帕累托前沿之间的偏移,即“离线前沿偏移”现象,这构成了离线多目标优化的根本性限制。作者提出通过目标空间中的分布外采样来应对这一限制,并指出生成方法在目标分布上趋于保守,难以有效突破数据分布的边界。

详情
英文摘要

Offline multi-objective optimization (MOO) aims to recover Pareto-optimal designs given a finite, static dataset. Recent generative approaches, including diffusion models, show strong performance under hypervolume, yet their behavior under other established MOO metrics is less understood. We show that generative methods systematically underperform evolutionary alternatives with respect to other metrics, such as generational distance. We relate this failure mode to the offline-frontier shift, i.e., the displacement of the offline dataset from the Pareto front, which acts as a fundamental limitation in offline MOO. We argue that overcoming this limitation requires out-of-distribution sampling in objective space (via an integral probability metric) and empirically observe that generative methods remain conservatively close to the offline objective distribution. Our results position offline MOO as a distribution-shift--limited problem and provide a diagnostic lens for understanding when and why generative optimization methods fail.

2602.09587 2026-05-13 cs.CV cs.AI

MieDB-100k: A Comprehensive Dataset for Medical Image Editing

Yongfan Lai, Wen Qian, Bo Liu, Hongyan Li, Hao Luo, Fan Wang, Bohan Zhuang, Shenda Hong

发表机构 * State Key Laboratory of General Artificial Intelligence, Beijing, China(1 国家一般人工智能重点实验室,北京,中国) School of Intelligence Science and Technology, Peking University, Beijing, China(2 智能科学与技术学院,北京大学,北京,中国) National Institute of Health Data Science, Peking University, Beijing, China(3 国家健康数据科学研究院,北京大学,北京,中国) DAMO Academy, Alibaba Group, Zhejiang, China(4 阿里巴巴集团 DAMO 院,浙江,中国) hupan lab, zhejiang province(5 鹏元实验室,浙江省) Zhejiang University, Zhejiang, China(6 浙江大学,浙江,中国)

AI总结 针对医学图像编辑领域高质量数据稀缺的问题,本文提出MieDB-100k,一个大规模、高质量且多样化的文本引导医学图像编辑数据集。该数据集从感知、修改和转换三个视角分类编辑任务,兼顾理解和生成能力,并通过专家模型与规则合成方法构建,经过严格人工审核确保临床准确性。实验表明,基于该数据集训练的模型在性能和泛化能力上均优于现有开源和商业模型,为医学图像编辑研究提供了重要基础。

详情
英文摘要

The scarcity of high-quality data remains a primary bottleneck in adapting multimodal generative models for medical image editing. Existing medical image editing datasets often suffer from limited diversity, neglect of medical image understanding and inability to balance quality with scalability. To address these gaps, we propose MieDB-100k, a large-scale, high-quality and diverse dataset for text-guided medical image editing. It categorizes editing tasks into perspectives of Perception, Modification and Transformation, considering both understanding and generation abilities. We construct MieDB-100k via a data curation pipeline leveraging both modality-specific expert models and rule-based data synthetic methods, followed by rigorous manual inspection to ensure clinical fidelity. Extensive experiments demonstrate that model trained with MieDB-100k consistently outperform both open-source and proprietary models while exhibiting strong generalization ability. We anticipate that this dataset will serve as a cornerstone for future advancements in specialized medical image editing.

2602.09368 2026-05-13 cs.RO

Certified Gradient-Based Contact-Rich Manipulation via Smoothing-Error Reachable Tubes

Wei-Chen Li, Glen Chou

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 该论文研究了如何在接触丰富的操作任务中,通过平滑动力学模型并补偿由此产生的误差,实现基于梯度的控制器优化与安全保证。核心方法是在可微分仿真中对接触动力学和平滑几何进行平滑处理,并通过集合值偏差量化模型失配,结合分析可达集优化时变仿射反馈策略,从而在原始非平滑动力学下实现闭环系统的鲁棒约束满足。该方法在多个接触密集任务中验证了其有效性,表现出更低的安全违规率和更小的目标误差。

Comments Robotics: Science & Systems (RSS) 2026

详情
英文摘要

Gradient-based methods can efficiently optimize controllers by leveraging differentiable simulation and physical priors. However, contact-rich manipulation remains challenging because hybrid contact dynamics often produce discontinuous or vanishing gradients. Although smoothing the dynamics can restore informative gradients, the resulting model mismatch can cause controller failures when deployed on real systems. We address this trade-off by planning with smoothed dynamics while explicitly quantifying and compensating for the induced error, providing formal guarantees on safety and task completion under the original nonsmooth dynamics. Our approach applies smoothing to both contact dynamics and contact geometry within a differentiable simulator based on convex optimization, allowing us to characterize the deviation from the nonsmooth dynamics as a set-valued discrepancy. We incorporate this discrepancy into the optimization of time-varying affine feedback policies through analytical reachable sets, enabling robust constraint satisfaction for the closed-loop hybrid system while relying solely on the informative gradients of the smoothed model. By bridging differentiable simulation with set-valued robust control, our method produces affine feedback policies that respect the unilateral nature of contact. We evaluate our method on several contact-rich tasks, including planar pushing, object rotation, and in-hand dexterous manipulation, achieving certified constraint satisfaction with lower safety violations and smaller goal errors than baseline approaches.

2602.08813 2026-05-13 cs.LG

Robust Policy Optimization to Prevent Catastrophic Forgetting

Mahdi Sabbaghi, George Pappas, Adel Javanmard, Hamed Hassani

发表机构 * University of Pennsylvania(宾夕法尼亚大学) University of Southern California(南加州大学)

AI总结 本文研究了大型语言模型在多阶段微调过程中因后续更新导致的“灾难性遗忘”问题,即早期学习的行为(如安全性)可能被破坏。为解决这一问题,作者提出了一种名为FRPO的鲁棒强化学习框架,通过在策略的KL散度邻域内优化奖励,确保策略在后续微调时仍能保持稳定。实验表明,该方法在多个基础模型和下游任务中有效减少了安全性能的下降,同时保持了任务性能。

详情
英文摘要

Large language models are commonly trained through multi-stage post-training: first via RLHF, then fine-tuned for other downstream objectives. Yet even small downstream updates can compromise earlier learned behaviors (e.g., safety), exposing a brittleness known as catastrophic forgetting. This suggests standard RLHF objectives do not guarantee robustness to future adaptation. To address it, most prior work designs downstream-time methods to preserve previously learned behaviors. We argue that preventing this requires pre-finetuning robustness: the base policy should avoid brittle high-reward solutions whose reward drops sharply under standard fine-tuning. We propose Fine-tuning Robust Policy Optimization (FRPO), a robust RLHF framework that optimizes reward not only at the current policy, but across a KL-bounded neighborhood of policies reachable by downstream adaptation. The key idea is to ensure reward stability under policy shifts via a max-min formulation. By modifying GRPO, we develop an algorithm with no extra computation, and empirically show it substantially reduces safety degradation across multiple base models and downstream fine-tuning regimes (SFT and RL) while preserving downstream task performance. We further study a math-focused RL setting, demonstrating that FRPO preserves accuracy under subsequent fine-tuning.

2602.05830 2026-05-13 cs.AI cs.LG

Learning Compact Boolean Networks

Shengpu Wang, Yuhao Mao, Yani Zhang, Martin Vechev

发表机构 * Department of Information Technology and Electrical Engineering(信息科技与电气工程系) Department of Computer Science(计算机科学系)

AI总结 本文研究了如何学习结构紧凑且精度高的布尔网络,以应对资源受限场景下的高效推理需求。为解决布尔网络离散结构带来的学习难题,作者提出了三种互补的方法:一种无需参数的有效连接学习策略、一种利用空间局部性的紧凑卷积布尔架构,以及一种降低连续网络离散化精度损失的自适应量化方法。实验表明,该方法在多个视觉任务中实现了更优的精度-计算量权衡,相比现有方法在布尔运算数量上减少了高达47倍,并在FPGA上实现了更高的精度与更低的推理延迟。

详情
英文摘要

Floating-point neural networks dominate modern machine learning but incur substantial inference costs, motivating emerging interest in Boolean networks for resource-constrained deployments. Since Boolean networks use only Boolean operations, they can achieve nanosecond-scale inference latency. However, learning Boolean networks that are both compact and accurate remains challenging because of their discrete, combinatorial structure. In this work we address this challenge via three novel, complementary contributions: (i) a new parameter-free strategy for learning effective connections, (ii) a novel compact convolutional Boolean architecture that exploits spatial locality while requiring fewer Boolean operations than existing convolutional kernels, and (iii) an adaptive discretization procedure that reduces the accuracy drop incurred when converting a continuously relaxed network into a discrete Boolean network. Across standard vision benchmarks, our method improves the Pareto frontier over prior state-of-the-art methods, achieving higher accuracy with up to $47\times$ fewer Boolean operations. This advantage also extends to other modalities. Further, on an FPGA, our model on MNIST achieves 99.38\% accuracy with 6.48 ns latency, surpassing the prior state-of-the-art in both accuracy and runtime, while generating a $7\times$ smaller circuit. Code and models are available at https://github.com/eth-sri/CompactLogic.

2602.04476 2026-05-13 cs.CV

Vision-aligned Latent Reasoning for Multi-modal Large Language Model

Byungwoo Jeon, Yoonwoo Jeong, Hyunseok Lee, Minsu Cho, Jinwoo Shin

发表机构 * Byungwoo Jeon Yoonwoo Jeong Hyunseok Lee Minsu Cho Jinwoo Shin

AI总结 尽管多模态大语言模型在多种理解任务上取得了进展,但在需要多步骤推理的问题上仍存在不足,主要原因是视觉信息在长上下文生成过程中逐渐稀释。为此,本文提出了一种名为Vision-aligned Latent Reasoning(VaLR)的推理框架,通过在每一步推理前动态生成与视觉对齐的潜在标记,引导模型基于潜在空间中的感知线索进行推理。实验表明,VaLR在多个需要长上下文理解和精确视觉感知的基准测试中表现优异,并在VSI-Bench上将性能从33.0%提升至52.9%,显著优于现有模型。

Comments Published as conference proceeding for ICML 2026. Last two authors advised equally

详情
英文摘要

Despite recent advancements in Multi-modal Large Language Models (MLLMs) on diverse understanding tasks, these models struggle to solve problems which require extensive multi-step reasoning. This is primarily due to the progressive dilution of visual information during long-context generation, which hinders their ability to fully exploit test-time scaling. To address this issue, we introduce Vision-aligned Latent Reasoning (VaLR), a simple, yet effective reasoning framework that dynamically generates vision-aligned latent tokens before each Chain of Thought reasoning step, guiding the model to reason based on perceptual cues in the latent space. Specifically, VaLR is trained to preserve visual knowledge during reasoning by aligning intermediate embeddings of MLLM with those from vision encoders. Empirical results demonstrate that VaLR consistently outperforms existing approaches across a wide range of benchmarks requiring long-context understanding or precise visual perception, while exhibiting test-time scaling behavior not observed in prior MLLMs. In particular, VaLR improves the performance significantly from 33.0% to 52.9% on VSI-Bench, achieving a 19.9%p gain over Qwen2.5-VL.

2602.02282 2026-05-13 cs.LG

MoLF: Mixture-of-Latent-Flow for Pan-Cancer Spatial Gene Expression Prediction from Histology

Susu Hu, Stefanie Speidel

发表机构 * Translational Surgical Oncology, National Center for Tumor Diseases (NCT/UCC) Dresden, Germany Faculty of Medicine University Hospital Carl Gustav Carus, Dresden University of Technology German Cancer Research Center (DKFZ), Heidelberg, Germany

AI总结 该研究提出了一种名为MoLF的生成模型,用于从组织学图像预测跨癌症类型的基因表达空间分布。MoLF通过条件流匹配目标,结合专家混合架构,将噪声映射到基因潜在空间,从而有效处理不同癌症类型的异质性。实验表明,MoLF在跨癌症基准测试中优于现有方法,并能在跨物种数据上实现零样本泛化,揭示了其对保守组织分子机制的捕捉能力。

Comments Accepted at Proceedings 43rd International Conference on Machine Learning, Seoul, South Korea

Journal ref Proceedings 43rd International Conference on Machine Learning 2026

详情
英文摘要

Inferring spatial transcriptomics (ST) from histology enables scalable histogenomic profiling, yet current methods are largely restricted to single-tissue models. This fragmentation fails to leverage biological principles shared across cancer types and hinders application to data-scarce scenarios. While pan-cancer training offers a solution, the resulting heterogeneity challenges monolithic architectures. To bridge this gap, we introduce MoLF (Mixture-of-Latent-Flow), a generative model for pan-cancer histogenomic prediction. MoLF leverages a conditional Flow Matching objective to map noise to the gene latent manifold, parameterized by a Mixture-of-Experts (MoE) velocity field. By dynamically routing inputs to specialized sub-networks, this architecture effectively decouples the optimization of diverse tissue patterns. Our experiments demonstrate that MoLF establishes a new state-of-the-art, consistently outperforming both specialized and foundation model baselines on pan-cancer benchmarks. Furthermore, MoLF exhibits zero-shot generalization to cross-species data, suggesting it captures fundamental, conserved histo-molecular mechanisms.

2602.00297 2026-05-13 cs.LG

From Observations to States: Latent Time Series Forecasting

Jie Yang, Yifan Hu, Yuante Li, Kexin Zhang, Kaize Ding, Philip S. Yu

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校) Tsinghua University(清华大学) Carnegie Mellon University(卡内基梅隆大学) Northwestern University(西北大学)

AI总结 该论文研究了时间序列预测中的潜在表示悖论问题,即模型在预测准确的同时往往学习到时间无序的潜在表示。为此,作者提出了一种新的方法——潜在时间序列预测(LatentTSF),通过将观测数据映射到潜在状态空间并在此空间中进行预测,使模型能够学习到更结构化的时序动态。实验表明,该方法有效缓解了潜在混沌问题,在预测精度和表示质量上均取得显著提升。

Comments Accepted at ICML 2026

详情
英文摘要

Deep learning has achieved strong performance in Time Series Forecasting (TSF). However, we identify a critical representation paradox, termed Latent Chaos: models with accurate predictions often learn latent representations that are temporally disordered and lack continuity. We attribute this to the dominant observation-space forecasting paradigm, where minimizing point-wise errors on noisy and partially observed data encourages shortcut solutions instead of the recovery of underlying system dynamics. To address this, we propose Latent Time Series Forecasting (LatentTSF), a paradigm that shifts TSF from observation regression to latent state prediction. LatentTSF employs an AutoEncoder to project each observation into a learned latent state space and performs forecasting entirely in this space, allowing the model to focus on learning structured temporal dynamics. We provide an information-theoretic analysis showing that the latent objectives can be motivated as surrogates for maximizing mutual information between predicted and ground-truth latent states and future observations. Extensive experiments on widely-used benchmarks confirm that LatentTSF effectively mitigates latent chaos, yielding consistent improvements in both forecasting accuracy and representation quality. Our code is available at https://github.com/Muyiiiii/LatentTSF.

2601.09448 2026-05-13 cs.SD cs.AI

One Prompt, Many Sounds: Modeling Listener Variability in LLM-Based Equalization

Ioannis Stylianou, Jon Francombe, Pablo Martinez-Nuevo, Sven Ewan Shepstone, Zheng-Hua Tan

发表机构 * Bang & Olufsen A/S, Struer, Denmark(丹麦Bang & Olufsen A/S公司,Struer) Department of Electronic Systems, Aalborg University(奥胡斯大学电子系统系) Pioneer Centre for AI, Copenhagen, Denmark(哥本哈根先锋人工智能中心)

AI总结 本文提出了一种基于大语言模型(LLM)的音频均衡方法,通过自然语言提示映射到均衡设置,实现了对声音系统的对话式控制。该方法利用受控听音实验收集的数据,结合上下文学习和参数高效微调技术,使模型能够可靠地对齐人群偏好的均衡设置。实验结果表明,与随机采样和静态预设基线相比,该方法在分布对齐方面有显著提升,展示了LLM作为“人工均衡器”的潜力,为更易用、上下文感知和专家级的音频调音方法提供了新方向。

Comments 13 pages, 15 figures, 2 tables, IEEE JSTSP submission

详情
英文摘要

Conventional audio equalization is a static process that requires manual and cumbersome adjustments to adapt to changing listening contexts (e.g., mood, location, or social setting). In this paper, we introduce a Large Language Model (LLM)-based alternative that maps natural language text prompts to equalization settings. This enables a conversational approach to sound system control. By utilizing data collected from a controlled listening experiment, our models exploit in-context learning and parameter-efficient fine-tuning techniques to reliably align with population-preferred equalization settings. Our evaluation methods, which leverage distributional metrics that capture users' varied preferences, show statistically significant improvements in distributional alignment over random sampling and static preset baselines. These results indicate that LLMs could function as "artificial equalizers," contributing to the development of more accessible, context-aware, and expert-level audio tuning methods.

2512.24985 2026-05-13 cs.CV cs.AI cs.LG cs.RO

DarkQA: Benchmarking Vision-Language Models on Visual-Primitive Question Answering in Low-Light Indoor Scenes

Yohan Park, Hyunwoo Ha, Wonjun Jo, Tae-Hyun Oh

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院) Pohang University of Science and Technology (POSTECH)(釜山科学技术大学)

AI总结 本文提出DarkQA,一个用于评估视觉语言模型在低光室内场景下视觉原语问答能力的开源基准。该基准通过多级光照控制生成9,400个可验证的问题-图像对,模拟真实光照下降和传感器噪声,揭示了现有模型在低光条件下的性能退化问题。研究还系统评估了多种视觉语言模型和低光图像增强方法,展示了DarkQA在分析模型鲁棒性方面的有效性。

Comments This work has been submitted to the IEEE for possible publication

详情
英文摘要

Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments, a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkQA, an open-source benchmark for evaluating perceptual primitives under multi-level low-light conditions in embodied scenarios. DarkQA evaluates single-view egocentric observations across controlled degradation levels, isolating low-light perceptual failures before they are entangled with complex embodied tasks. The benchmark contains 9.4K deterministically generated and verifiable question-image pairs spanning five visual-primitive families. A key design feature of DarkQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline; we further validate the synthesis against real paired low-light camera data. We evaluate representative VLMs and Low-Light Image Enhancement (LLIE) preprocessing methods. Results show consistent VLM degradation under low illumination and sensor noise, while LLIE provides severity-dependent but unstable recovery. We demonstrate the utility of DarkQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models, and systematically reveal VLMs' limitations when operating under these challenging visual conditions. Our code and benchmark dataset will be released upon acceptance. Project website: https://darkqa-benchmark.github.io

2512.20865 2026-05-13 cs.LG cs.SY eess.SY

Robustness Certificates for Neural Networks against Adversarial Attacks

Sara Taheri, Mahalakshmi Sabanayagam, Debarghya Ghoshdastidar, Majid Zamani

发表机构 * LMU Munich(慕尼黑大学) Technical University of Munich(慕尼黑技术大学) University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 随着机器学习在安全关键领域中的广泛应用,对抗性威胁,尤其是数据投毒攻击,带来的风险日益加剧。本文提出了一种基于离散时间动态系统建模的正式鲁棒性认证框架,将梯度训练过程形式化,并借鉴控制理论中的屏障证书概念,为模型在最坏情况下的鲁棒性提供形式化保证。该方法通过神经网络参数化屏障证书,并结合场景凸优化推导出泛化性保证,首次实现了对训练时和测试时攻击的统一形式化认证,实验表明其在多个数据集上具有良好的鲁棒性认证效果。

Journal ref IEEE Open Journal of Control Systems, 2026

详情
英文摘要

The increasing use of machine learning in safety-critical domains amplifies the risk of adversarial threats, especially data poisoning attacks that corrupt training data to degrade performance or induce unsafe behavior. Most existing defenses lack formal guarantees or rely on restrictive assumptions about the model class, attack type, extent of poisoning, or point-wise certification, limiting their practical reliability. This paper introduces a principled formal robustness certification framework that models gradient-based training as a discrete-time dynamical system (dt-DS) and formulates poisoning robustness as a formal safety verification problem. By adapting the concept of barrier certificates (BCs) from control theory, we introduce sufficient conditions to certify a robust radius ensuring that the terminal model remains safe under worst-case ${\ell}_p$-norm based poisoning. To make this practical, we parameterize BCs as neural networks trained on finite sets of poisoned trajectories. We further derive probably approximately correct (PAC) bounds by solving a scenario convex program (SCP), which yields a confidence lower bound on the certified robustness radius generalizing beyond the training set. Importantly, our framework also extends to certification against test-time attacks, making it the first unified framework to provide formal guarantees in both training and test-time attack settings. Experiments on MNIST, SVHN, and CIFAR-10 show that our approach certifies non-trivial perturbation budgets while being model-agnostic and requiring no prior knowledge of the attack or contamination level.

2512.17637 2026-05-13 cs.AI cs.FL cs.LO

About Time: Model-free Reinforcement Learning with Timed Reward Machines

Rajarshi Roy, Anirban Majumdar, Ritam Raha, David Parker, Marta Kwiatkowska

发表机构 * University of Liverpool(利物浦大学)

AI总结 在强化学习中,奖励规范对指导智能体行为至关重要。为表达非马尔可夫奖励,已有研究引入奖励机,但传统奖励机难以建模精确的时间约束。本文提出了一种新的时间奖励机(TRM),将时间约束融入奖励结构,支持更丰富的奖励逻辑,例如对延迟施加惩罚或对及时动作给予奖励。研究基于无模型强化学习框架(如表格Q学习),通过时间自动机的抽象和反事实想象启发式方法,学习满足时间约束的最优策略,并在多个基准任务中验证了其有效性。

Comments Extended version of paper accepted at IJCAI 2026

详情
英文摘要

Reward specification plays a central role in reinforcement learning (RL), guiding the agent's behavior. To express non-Markovian rewards, formalisms such as reward machines have been introduced to capture dependencies on histories. However, traditional reward machines lack the ability to model precise timing constraints, limiting their use in time-sensitive applications. In this paper, we propose timed reward machines (TRMs), which are an extension of reward machines that incorporate timing constraints into the reward structure. TRMs enable more expressive specifications with tunable reward logic, for example, imposing costs for delays and granting rewards for timely actions. We study model-free RL frameworks (i.e., tabular Q-learning) for learning optimal policies with TRMs under digital and real-time semantics. Our algorithms integrate the TRM into learning via abstractions of timed automata, and employ counterfactual-imagining heuristics that exploit the structure of the TRM to improve the search. Experimentally, we demonstrate that our algorithm learns policies that achieve high rewards while satisfying the timing constraints specified by the TRM on popular RL benchmarks. Moreover, we conduct comparative studies of performance under different TRM semantics, along with ablations that highlight the benefits of counterfactual-imagining.

2512.11114 2026-05-13 cs.LG cs.AI stat.ML

In-Context Multi-Objective Optimization

Xinyu Zhang, Conor Hassan, Julien Martinelli, Daolang Huang, Samuel Kaski

发表机构 * Department of Computer Science, Aalto University, Finland(芬兰阿尔托大学计算机科学系) ELLIS Institute Finland(芬兰ELLIS研究所) Department of Computer Science, University of Manchester, UK(英国曼彻斯特大学计算机科学系)

AI总结 在多目标优化问题中,如何平衡多个竞争目标是一个普遍存在的挑战,尤其在药物设计和自主系统等领域。本文提出了一种名为TAMO的全摊销通用策略,利用Transformer架构实现对不同输入和目标维度的多目标黑盒优化,无需针对每个任务重新训练模型。通过强化学习预训练,TAMO能够在单次前向传播中快速生成优化方案,显著提升了计算效率,并在多个基准和实际任务中表现出优异的帕累托前沿质量。

详情
英文摘要

Balancing competing objectives is omnipresent across disciplines, from drug design to autonomous systems. Multi-objective Bayesian optimization is a promising solution for such expensive, black-box problems: it fits probabilistic surrogates and selects new designs via an acquisition function that balances exploration and exploitation. In practice, it requires tailored choices of surrogate and acquisition that rarely transfer to the next problem, is myopic when multi-step planning is often required, and adds refitting overhead, particularly in parallel or time-sensitive loops. We present TAMO, a fully amortized, universal policy for multi-objective black-box optimization. TAMO uses a transformer architecture that operates across varying input and objective dimensions, enabling pretraining on diverse corpora and transfer to new problems without retraining: at test time, the pretrained model proposes the next design with a single forward pass. We pretrain the policy with reinforcement learning to maximize cumulative hypervolume improvement over full trajectories, conditioning on the entire query history to approximate the Pareto frontier. Across synthetic benchmarks and real tasks, TAMO produces fast proposals, reducing proposal time by 50-1000x versus alternatives while matching or improving Pareto quality under tight evaluation budgets. These results show that transformers can perform multi-objective optimization entirely in-context, eliminating per-task surrogate fitting and acquisition engineering, and open a path to foundation-style, plug-and-play optimizers for scientific discovery workflows.

2512.07150 2026-05-13 cs.LG cs.AI cs.CV

FlowLPS: Langevin-Proximal Sampling for Flow-based Inverse Problem Solvers

Jonghyun Park, Jong Chul Ye

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出了一种名为 FlowLPS 的训练-free 潜在流逆问题求解方法,基于朗之万-近端采样(Langevin-Proximal Sampling),旨在解决深度生成模型在图像逆问题中的有限步数权衡问题。该方法在每一步反向过程中使用少量朗之万更新对模型预测的干净估计进行扰动,以提供后验导向的随机初始化,随后通过局部 MAP 风格的近端优化快速提升测量一致性,并结合受控的 pCN 风格重噪声技术保持轨迹稳定性。实验表明,FlowLPS 在多个线性逆问题上实现了测量保真度与感知质量的良好平衡。

详情
英文摘要

Deep generative models are powerful priors for imaging inverse problems, but training-free solvers for latent flow models face a practical finite-step trade-off. Optimization-heavy methods quickly improve measurement consistency, but in highly nonlinear latent spaces, their results can depend strongly on where local refinement is initialized, often degrading perceptual realism. In contrast, stochastic sampling methods better preserve posterior exploration, but often require many iterations to obtain sharp, measurement-consistent reconstructions. To address this trade-off, we propose FlowLPS, a training-free latent flow inverse solver based on Langevin-Proximal Sampling. At each reverse step, FlowLPS uses a few Langevin updates to perturb the model-predicted clean estimate in posterior-oriented directions, providing stochastic initializations for local refinement. It then applies local MAP-style proximal refinement to rapidly improve measurement consistency from the Langevin-updated estimate. We additionally use controlled pCN-style re-noising to stabilize the reverse trajectory while retaining trajectory coherence. Experiments on FFHQ and DIV2K across five linear inverse problems show that FlowLPS achieves a strong balance between measurement fidelity and perceptual quality, with additional experiments on pixel-space inverse problems and phase retrieval.

2512.01675 2026-05-13 cs.CV

GRASP: Guided Residual Adapters with Sample-wise Partitioning

Felix Nützel, Mischa Dombrowski, Bernhard Kainz

发表机构 * Friedrich-Alexander-Universität Erlangen-Nürnberg(埃朗根-纽伦堡弗里德里希-亚历山大大学) Imperial College London(伦敦帝国理工学院)

AI总结 在长尾分布场景下,文本到图像的流匹配变换器在尾部类别上表现出生成质量下降的问题。本文提出GRASP方法,通过条件空间的确定性划分和分组残差适配器,有效提升了尾部类别的生成质量,同时保持了原优化目标和采样器不变。实验表明,GRASP在多个数据集上显著提升了生成图像的多样性与尾部类别覆盖率,并在下游分类任务中优于现有方法。

Comments 16 pages, 6 figures, 6 tables

详情
英文摘要

Text-to-image flow matching transformers degrade sharply in long-tail settings: tail-class outputs collapse in fidelity and diversity, limiting their value as synthetic augmentation for rare conditions. We trace this to low head-versus-tail gradient alignment during fine-tuning, an optimization-level pathology that conditioning- and sampling-side interventions do not address. We propose GRASP (Guided Residual Adapters with Sample-wise Partitioning): a deterministic partition of the conditioning space, paired with group-specific residual adapters in the transformer feedforward layers, that leaves the flow-matching objective and the sampler untouched. In conditional flow matching, condition values index distinct sets of probability paths, so partitioning along the conditioning is the structurally correct factorization suitable as gradient alignment proxy. Because the partition is static, every tail sample is guaranteed to update its assigned expert, which bypasses extreme longtail failure modes. Crucially, GRASP is non-invasive and composable: on MIMIC-CXR-LT, combining GRASP with self-guided minority sampling at inference time yields the best all-labels IRS we observe, beyond either intervention alone. GRASP itself reduces overall FID by up to 80\% and lifts tail-class coverage by up to 44\% over full fine-tuning, learned-routing MoE, and minority guidance. Used as training data for a downstream DenseNet classifier on NIH-CXR-LT, GRASP synthetics significantly outperform every non-GRASP alternative on macro F1, match the macro F1 obtained from real training data, and yield nonzero F1 on $9$ of $13$ classes versus $3$ of $13$ from full fine-tuning. Results on ImageNet-LT confirm the mechanism is not tied to medical inductive bias.

2511.22663 2026-05-13 cs.CV

AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model

Dian Zheng, Manyuan Zhang, Hongyu Li, Kai Zou, Hongbo Liu, Ziyu Guo, Kaituo Feng, Yexin Liu, Ying Luo, Hongsheng Li

发表机构 * MMLab, CUHK(CUHK多媒体实验室) Meituan(美团) USTC(中国科学技术大学) TJU(天津大学)

AI总结 统一多模态模型在图像生成与理解方面取得了显著进展,但任务间的冲突目标使得训练范式难以优化。为缓解冲突,现有方法多采用架构解耦策略,但可能导致模型失去交互生成能力。本文提出一种无需架构解耦的策略,通过分析模型的跨模态注意力行为,揭示解耦提升性能的本质是引导模型学习任务特定的交互模式,并提出注意力交互对齐(AIA)损失函数,有效优化跨模态注意力结构,提升生成与理解性能。

Comments Project page: https://zhengdian1.github.io/AIA-project/ Code: https://github.com/zhengdian1/AIA

详情
英文摘要

Unified multimodal models for image generation and understanding represent a significant step toward AGI and have attracted widespread attention from researchers. The main challenge of this task lies in the difficulty in establishing an optimal training paradigm due to inherent conflicting targets in understanding and generation tasks. To alleviate these conflicts and pursue higher performance, many researchers adopt varying degrees of architecture decoupling (e.g., Double image encoders, MOE/MOT architecture, or frozen MLLM). However, excessive model decoupling can lead to the loss of interleave generation ability, undermining the original intent of unified models. In this work, we aim to explore how to mitigate task conflicts without resorting to model decoupling. Firstly, we analyze why decoupling boosts performance by studying the cross-modal attention behavior of models. We observe that architecture decoupling does not solve task conflicts, but essentially drives models toward cross-modal interaction patterns of task-specific models, as seen in Qwen3-VL and HunyuanImage-3.0, and that the more thorough the decoupling, the more consistent the behavior becomes. Motivated by this observation, we propose Attention Interaction Alignment (AIA) loss, which explicitly learns task-specific multimodal interaction patterns during training. To demonstrate the generalizability of our AIA loss, we apply it to Emu3 and Janus-Pro during SFT and post-training stage respectively. Without bells and whistles, AIA not only refines cross-modal attention patterns, but also boosts both generation and understanding performance.

2511.18152 2026-05-13 cs.CV cs.AI

UnfoldLDM: Degradation-Aware Unfolding with Iterative Latent Diffusion Priors for Blind Image Restoration

Chunming He, Rihan Zhang, Zheng Chen, Bowen Yang, Chengyu Fang, Yunlong Lin, Yulun Zhang, Fengyang Xiao, Sina Farsiu

发表机构 * Duke University(杜克大学) Shanghai Jiao Tong University(上海交通大学) Peking University(北京大学) Tsinghua University(清华大学) Xiamen University(厦门大学)

AI总结 本文提出了一种名为 UnfoldLDM 的盲图像修复方法,旨在解决现有深度展开网络在未知退化建模和过平滑问题上的不足。该方法结合了深度展开网络与潜在扩散模型,通过多粒度退化感知模块估计未知退化信息,并设计了退化鲁棒的扩散模型和过平滑校正模块,以恢复图像的高频细节和纹理。实验表明,UnfoldLDM 在多种盲图像修复任务中表现优异,并可作为通用框架与现有方法兼容。

Comments 6 figures, 11 tables

详情
英文摘要

Deep unfolding networks (DUNs) combine the interpretability of model-based methods with the learning ability of deep networks, yet remain limited for blind image restoration (BIR). Existing DUNs suffer from: (1) \textbf{Degradation-specific dependency}, as their optimization frameworks are tied to a known degradation model, making them unsuitable for BIR tasks; and (2) \textbf{Over-smoothing bias}, resulting from the direct feeding of gradient descent outputs, dominated by low-frequency content, into the proximal term, suppressing fine textures. To overcome these issues, we propose UnfoldLDM to integrate DUNs with latent diffusion model (LDM) for BIR. In each stage, UnfoldLDM employs a multi-granularity degradation-aware (MGDA) module as the gradient descent step. MGDA models BIR as an unknown degradation estimation problem and estimates both the holistic degradation matrix and its decomposed forms, enabling robust degradation removal. For the proximal step, we design a degradation-resistant LDM (DR-LDM) to extract compact degradation-invariant priors from the MGDA output. Guided by this prior, an over-smoothing correction transformer (OCFormer) explicitly recovers high-frequency components and enhances texture details. This unique combination ensures the final result is degradation-free and visually rich. Experiments show that our UnfoldLDM achieves a leading place on various BIR tasks and benefits downstream tasks. Moreover, our design is compatible with existing DUN-based methods, serving as a plug-and-play framework. Code will be released.

2511.16814 2026-05-13 cs.AI cs.HC

Stable diffusion models reveal a persisting human and AI gap in visual creativity

Silvia Rondini, Claudia Alvarez-Martin, Paula Angermair-Barkai, Olivier Penacchio, M. Paz, Matthew Pelowski, Dan Dediu, Antoni Rodriguez-Fornells, Xim Cerda-Company

发表机构 * Cognition and Brain Plasticity Unit, Bellvitge Biomedical Research Institute(认知与脑可塑性单位,贝尔维希生物医学研究所) Bridging AI and Neuroscience, Computer Vision Center(弥合人工智能与神经科学,计算机视觉中心) Department of Cognition, Development and Educational Psychology, University of Barcelona(认知、发展与教育心理学系,巴塞罗那大学) Vienna Cognitive Science Hub(维也纳认知科学中心) Faculty of Psychology, University of Vienna(心理学系,维也纳大学) Computer Science Department, Universitat Autonoma de Barcelona(计算机科学系,巴塞罗那自治大学) University of Barcelona Institute for Complex Systems (UBICS)(巴塞罗那大学复杂系统研究所) Department of Catalan Philology and General Linguistics, University of Barcelona(加泰罗尼亚语言学与一般语言学系,巴塞罗那大学) Catalan Institution for Research and Advanced Studies (ICREA)(加泰罗尼亚研究与高级科学研究机构(ICREA)) Aix-Marseille University(艾克斯-马赛大学) Institute of Neurosciences (UBNeuro), University of Barcelona(神经科学研究所(UBNeuro),巴塞罗那大学)

AI总结 尽管近期研究表明大型语言模型在发散性思维任务中已能匹配人类的创造力,但视觉创造力领域仍缺乏系统研究。本研究通过对比视觉艺术家、非艺术家以及两种不同提示条件下的生成式AI模型(人类启发式与自主引导式)的图像生成结果,发现人类在视觉创造力上仍显著优于AI,且AI的创造力随着人类引导的增加而提升,但仍未达到非艺术家水平。研究还揭示了人类与AI在创造力评价上的判断模式存在明显差异,表明视觉创造力依赖于感知细节与情境敏感性,这些能力可能难以从语言模型直接迁移至视觉生成模型。

Journal ref Advanced Science, 2026, e24142

详情
英文摘要

While recent research suggests Large Language Models match human creative performance in divergent thinking tasks, visual creativity remains underexplored. This study compared image generation in human participants (Visual Artists and Non Artists) and using an image generation AI model (two prompting conditions with varying human input: high for Human Inspired, low for Self Guided). Human raters (N=255) and GPT4o evaluated the creativity of the resulting images. We found a clear creativity gradient, with Visual Artists being the most creative, followed by Non Artists, then Human Inspired generative AI, and finally Self Guided generative AI. Increased human guidance strongly improved GenAI's creative output, bringing its productions close to those of Non Artists. Notably, human and AI raters also showed vastly different creativity judgment patterns. These results suggest that, in contrast to language centered tasks, GenAI models may face unique challenges in visual domains, where creativity depends on perceptual nuance and contextual sensitivity, distinctly human capacities that may not be readily transferable from language models.

2511.11935 2026-05-13 cs.LG

SurvBench: A Standardised Preprocessing Pipeline for Multi-Modal Electronic Health Record Survival Analysis

Munib Mesinovic, Tingting Zhu

发表机构 * Department of Engineering Science, University of Oxford(牛津大学工程科学系)

AI总结 SurvBench 是一个开源的预处理流程,旨在为多模态电子健康记录(EHR)的生存分析提供标准化的数据处理方法。该工具解决了当前深度学习生存模型在EHR数据上难以比较的问题,通过统一的预处理步骤,包括队列定义、时间离散化、缺失值处理和截断规则等。SurvBench 支持多个重症监护数据库和多种输入模态,提供了统一的配置接口和跨数据集验证支持,为未来多模态EHR生存分析研究提供了可靠的基准平台。

详情
英文摘要

Deep-learning survival models for electronic health record (EHR) data are hard to compare across papers because the upstream preprocessing step, which includes cohort definition, time discretisation, missingness handling, and censoring rules, is typically undocumented and inconsistent. A reported difference in concordance between two mortality models can therefore reflect any of these choices rather than a modelling contribution. We present SurvBench, an open-source preprocessing pipeline that converts raw PhysioNet exports into model-ready tensors for survival analysis. SurvBench covers four critical-care databases (MIMIC-IV, eICU, MC-MED, HiRID) and four input modalities: time-series vitals and laboratory values, static demographics, International Classification of Diseases (ICD) codes, and radiology report embeddings. Every preprocessing decision is controlled through YAML configuration. Imputation, scaling, and feature filtering are fit on the training fold only. Missingness is recorded as a binary mask alongside each feature tensor. The pipeline handles single-risk endpoints (in-hospital and in-ICU mortality) and competing-risks endpoints (a three-way emergency-department admission pathway, with home discharge treated as administrative censoring). We also provide support for harmonised cross-dataset external validation between eICU and MIMIC-IV. SurvBench is publicly available at https://github.com/munibmesinovic/SurvBench, providing a robust platform that future deep-learning EHR survival work, especially nascent multi-modal approaches, can be measured against under matched preprocessing.

2511.11412 2026-05-13 cs.CL cs.CY stat.OT

MajinBook: An open catalogue of digitally mediated world literature

Antoine Mazières, Thierry Poibeau

发表机构 * Lattice (ENS-PSL, CNRS)(Lattice(ENS-PSL,CNRS)) LEDa (Dauphine-PSL, CNRS)(LEDa(Dauphine-PSL,CNRS))

AI总结 本文介绍了MajinBook,一个开放的数字文献目录,旨在促进对影子图书馆(如Library Genesis和Z-Library)在计算社会科学和文化分析中的应用。通过将这些众包档案的元数据与Goodreads的结构化书目数据进行关联,构建了一个包含539,000多本英文书籍的高精度语料库,并附有首次出版日期、类型和受欢迎程度等信息。该研究采用原生数字EPUB文件以确保机器可读性,同时解决了传统语料库的偏差问题,并提供了法语、德语和西班牙语的辅助数据集。

Comments 9 pages, 5 figures, 1 table

详情
英文摘要

This data paper introduces MajinBook, an open catalogue designed to facilitate the use of shadow libraries-such as Library Genesis and Z-Library-for computational social science and cultural analytics. By linking metadata from these vast, crowd-sourced archives with structured bibliographic data from Goodreads, we create a high-precision corpus of over 539,000 references to digitally mediated English-language books. Spanning three centuries and reflecting a contemporary selection bias, these entries are enriched with first publication dates, genres, and popularity metrics like ratings and reviews. Our methodology prioritises natively digital EPUB files to ensure machine-readable quality, while addressing biases in traditional corpora like HathiTrust, and includes secondary datasets for French, German, and Spanish. We evaluate the linkage strategy for accuracy, release all underlying data openly, and discuss the project's legal permissibility under EU and US frameworks for text and data mining in research.

2511.10670 2026-05-13 cs.CL cs.AI cs.SD

Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment

Yan Gao, Yazheng Yang, Zhibin Lan, Yidong Chen, Min Zhang, Daimeng Wei, Derek F. Wong, Jinsong Su

发表机构 * School of Informatics, Xiamen University, China(厦门大学信息学院) Huawei Translation Services Center, Beijing, China(华为翻译服务中心) NLP 2 CT Lab, Department of Computer and Information Science, University of Macau(澳门大学计算机与信息科学系NLP 2 CT实验室)

AI总结 该研究旨在解决代码混用(Code-switching)语音翻译中的细粒度语义建模难题,提出了一种结合专家混合(MoE)结构的语音投影方法,通过语言专家组对不同语言的语义空间进行精细化建模。研究引入了语言特定损失和组内负载均衡损失,以提升模型效率,并采用多阶段训练策略,结合现有自动语音识别和单语翻译数据,增强对齐效果和翻译性能。实验表明,该方法在多个数据集上显著优于现有模型,BLEU和COMET指标均有明显提升。

Comments Accepted to IJCAI 2026 Main Track

详情
英文摘要

Code-switching (CS) speech translation (ST) aims to translate speech that alternates between multiple languages into a target language text, posing significant challenges due to the complexity of semantic modeling and the scarcity of CS data. Previous studies mainly rely on the models themselves to implicitly learn semantic representations and resort to costly manual annotations. To mitigate these limitations, we propose enhancing Large Language Models (LLMs) with a Mixture-of-Experts (MoE) speech projector composed of language expert groups, where each group specializes in the semantic space of a specific language for fine-grained speech feature modeling. A language-specific loss and an intra-group load balancing loss are jointly introduced to guide efficient token routing across and within expert groups. Furthermore, we introduce a multi-stage training paradigm that utilizes readily available automatic speech recognition (ASR) and monolingual ST data, facilitating speech-text alignment and improving translation performance. To bridge the data gap for smooth domain transfer, a transition loss is employed to improve adaptation to CS scenarios. Extensive experiments on widely used datasets demonstrate the effectiveness and generality of our approach, achieving average improvements of $0.86$ BLEU and $0.93$ COMET over SeamlessM4T, with maximum improvements of $1.49$ BLEU and $1.41$ COMET across different test sets.

2511.07767 2026-05-13 cs.LG

Taking the Road Less Scheduled with Adaptive Polyak Steps

Dimitris Oikonomou, Matthew Buchholz, Yuen-Man Pun, Robert M. Gower, Nicolas Loizou

发表机构 * Johns Hopkins University(约翰霍普金斯大学) University of British Columbia(不列颠哥伦比亚大学) Australian National University(澳大利亚国立大学) Center for Computational Mathematics(计算数学中心) Flatiron Institute, Simons Foundation(Flatiron 机构,Simons 基金会)

AI总结 本文研究了无需预设训练周期的自适应优化方法,提出了一种适用于Schedule-Free SGD和Adam的Polyak步长选择策略,该方法能够在每一步迭代中仅基于当前损失、梯度和迭代点自动计算学习率,无需手动调参。通过引入一个理想情况下的变体和一个无需理想值的鲁棒变体,作者证明了其在凸且满足利普希茨条件的目标函数上的收敛速率。实验表明,该方法在语言模型预训练和知识蒸馏任务中表现优异,且对超参数选择更加鲁棒。

详情
英文摘要

Schedule-Free SGD, proposed in [Defazio et al., 2024], achieves optimal convergence rates without requiring the training horizon in advance, by replacing learning rate schedules with a principled form of iterate averaging. However, the method still requires tuning a base learning rate whose optimal value depends on unknown problem constants. In this work, we continue down this road by deriving Polyak-type step sizes for Schedule-Free SGD and Adam that compute the learning rate at each iteration from the sampled loss, gradient, and current iterates alone. We first propose an oracle variant that uses per-sample optimal function values and prove an $O(1/\sqrt{t})$ anytime last-iterate rate for convex Lipschitz objectives. We then remove the oracle requirement with a safeguarded variant that replaces the unknown optimal values with any available lower bound, achieving the same rate up to a neighborhood that vanishes under interpolation. Both step sizes reduce to existing Polyak rules for standard SGD when momentum is set to zero, unifying standard and schedule-free Polyak methods. Numerical experiments on language modeling, including pretraining and distillation, show that the proposed methods match or surpass tuned Schedule-Free baselines while offering greater robustness to hyperparameter choices.

2510.27055 2026-05-13 cs.CL cs.AI

Detecting Data Contamination in LLMs via In-Context Learning

Michał Zawalski, Meriem Boubdir, Klaudia Bałazy, Besmira Nushi, Pablo Ribalta

发表机构 * NVIDIA

AI总结 本文提出了一种名为CoDeC的方法,用于检测和量化大语言模型训练数据中的污染问题。该方法通过衡量上下文学习对模型性能的影响,区分模型在训练过程中记忆的数据与训练分布之外的数据。实验表明,CoDeC能够生成可解释的污染评分,有效区分已见和未见数据集,并揭示了未公开训练语料的开源模型中存在显著的记忆现象。该方法简单、自动化,且适用于不同模型和数据集,便于集成到基准评估中。

详情
英文摘要

We present Contamination Detection via Context (CoDeC), a practical and accurate method to detect and quantify training data contamination in large language models. CoDeC distinguishes between data memorized during training and data outside the training distribution by measuring how in-context learning affects model performance. We find that in-context examples typically boost confidence for unseen datasets but may reduce it when the dataset was part of training, due to disrupted memorization patterns. Experiments show that CoDeC produces interpretable contamination scores that clearly separate seen and unseen datasets, and reveals strong evidence of memorization in open-weight models with undisclosed training corpora. The method is simple, automated, and both model- and dataset-agnostic, making it easy to integrate with benchmark evaluations.