arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1670
专题追踪
2511.02342 2026-05-18 cs.RO

Whole-body motion planning and safety-critical control for aerial manipulation

Lin Yang, Jinwoo Lee, Domenico Campolo, H. Jin Kim, Jeonghyun Byun

发表机构 * School of Mechanical and Aerospace Engineering, Nanyang Technological University, Singapore(机械与航空航天工程学院,南洋理工大学,新加坡) Department of Aerospace Engineering, Seoul National University, South Korea(航空航天工程系,首尔国立大学,韩国)

AI总结 本文研究了空中机械臂系统的全身运动规划与安全关键控制问题,针对复杂环境中避障与动态轨迹生成的挑战,提出了一种基于超二次曲面(SQs)的规划与控制框架。该方法通过可微分的几何精确表面建模,结合最大安全距离规划器和高阶控制屏障函数,实现了高效、安全且平滑的轨迹生成与控制。实验表明,该方法在仿真与实际平台中均表现出优越的性能,优于传统基于椭球体的基线方法。

Comments Will be presented in 23rd IFAC World Congress 2026

详情
英文摘要

Aerial manipulation combines the maneuverability of multirotors with the dexterity of robotic arms to perform complex tasks in cluttered spaces. Yet planning safe, dynamically feasible trajectories remains difficult due to whole-body collision avoidance and the conservativeness of common geometric abstractions such as bounding boxes or ellipsoids. We present a whole-body motion planning and safety-critical control framework for aerial manipulators built on superquadrics (SQs). Using an SQ-plus-proxy representation, we model both the vehicle and obstacles with differentiable, geometry-accurate surfaces. Leveraging this representation, we introduce a maximum-clearance planner that fuses Voronoi diagrams with an equilibrium-manifold formulation to generate smooth, collision-aware trajectories. We further design a safety-critical controller that jointly enforces thrust limits and collision avoidance via high-order control barrier functions. In simulation, our approach outperforms sampling-based planners in cluttered environments, producing faster, safer, and smoother trajectories and exceeding ellipsoid-based baselines in geometric fidelity. Actual experiments on a physical aerial-manipulation platform confirm feasibility and robustness, demonstrating consistent performance across simulation and hardware settings. The video can be found at https://youtu.be/hQYKwrWf1Ak.

2510.22665 2026-05-18 cs.CV cs.AI

SARVLM: A Vision Language Foundation Model for Semantic Understanding in SAR Imagery

Qiwei Ma, Xukun Lu, Wang Liu, Puhong Duan, Xudong Kang, Shutao Li

发表机构 * School of Artificial Intelligence and Robotics, Hunan University(湖南大学人工智能与机器人学院) Yuelushan Center for Industrial Innovation(岳麓山创新中心) School of Medical Information Engineering, Jining Medical University(济南医学院医学信息工程学院)

AI总结 本文提出SARVLM,首个专为合成孔径雷达(SAR)影像设计的视觉-语言基础模型,旨在提升SAR图像的语义理解能力。为解决SAR多模态数据稀缺及跨模态表征不足的问题,研究者构建了包含百万级图像-文本对的SARVLM-1M大规模数据集,并设计了两阶段领域迁移训练策略,利用光学遥感数据作为桥梁,有效提升模型在SAR领域的表现。实验表明,SARVLM在多个基准任务中均优于现有模型,显著推进了SAR影像的语义理解水平。

Comments 13 pages, 13 figures

详情
英文摘要

Synthetic Aperture Radar (SAR) is a critical imaging modality due to its all-weather operational capability. Although recent advances in self-supervised learning and masked image modeling (MIM) have enabled SAR foundation models, these approaches primarily focus on low-level visual features and often neglect multi-modal representation. Moreover, multimodal data for SAR is scarce, limiting the development of robust cross-modal models. To address this limitation, we construct SARVLM-1M, a large-scale vision-language dataset comprising over one million image-text pairs aggregated from existing datasets. Furthermore, to mitigate the substantial differences between SAR and natural imagery, we propose a two-stage domain transfer training strategy that leverages optical remote sensing data as an intermediate bridge, facilitating effective knowledge transfer from natural images to SAR domains. Based on this strategy, we develop SARVLM, the first vision-language foundation model tailored for SAR, consisting of SARCLIP and SARCap. In addition, an ensemble strategy is utilized to improve the cross-scene generalization capability of the model. Moreover, SARDet and SARRot further validate the capability of the proposed framework in object detection. Extensive experiments on 13 benchmarks across image-text retrieval, target recognition, zero-shot classification, object detection, semantic localization, and image captioning demonstrate the superior feature extraction and interpretation capabilities of SARVLM. It consistently outperforms state-of-the-art vision-language models and advances semantic understanding in SAR imagery. Code and datasets will be released on https://github.com/KlayMa527/SARVLM.git.

2510.18814 2026-05-18 cs.LG cs.AI

A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning

Mengqi Li, Lei Zhao, Anthony Man-Cho So, Ruoyu Sun, Xiao Li

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文研究了在没有外部奖励信号的情况下,语言模型能否仅通过自身生成的响应来提升推理能力。提出了一种名为Self-evolving Post-Training(SePT)的简单后训练方法,通过交替进行自我生成和基于生成数据的训练,逐步优化模型性能。实验表明,SePT在多个数学推理基准测试中有效提升了模型推理能力,验证了仅依赖自生成监督进行模型自我进化的可行性。

详情
英文摘要

Can language models improve their reasoning performance without external rewards, using only their own sampled responses for training? We show that they can. We propose Self-evolving Post-Training (SePT), a simple post-training method that alternates between self-generation and training on self-generated responses. It repeatedly samples questions, uses the model itself to generate responses under a specified sampling temperature, and then trains the model on the self-generated data. In this self-training loop, we use an online data refresh mechanism, where each new batch is generated by the most recently updated model. Across six math reasoning benchmarks, SePT improves a strong no-training baseline, defined as the untuned base model evaluated at its best swept decoding temperature, on several tested models. Additional ablations demonstrate the importance of online data refresh and temperature dynamics. Overall, our results identify a practical regime where reasoning can be improved using self-generated supervision alone. Our code is available at https://github.com/ElementQi/SePT.

2510.10454 2026-05-18 cs.AI

Traj-CoA: Patient Trajectory Modeling via Chain-of-Agents for Lung Cancer Risk Prediction

Sihang Zeng, Yujuan Fu, Sitong Zhou, Zixuan Yu, Lucas Jing Liu, Jun Wen, Matthew Thompson, Ruth Etzioni, Meliha Yetisgen

发表机构 * University of Washington(华盛顿大学) Fred Hutch Cancer Center(Fred Hutch癌症中心) Harvard University(哈佛大学) Google(谷歌)

AI总结 本文提出了一种名为Traj-CoA的多智能体系统,用于通过链式智能体结构对患者轨迹进行建模,以提升肺癌风险预测的准确性。该方法通过一系列工作智能体逐步处理电子健康记录(EHR)数据,提炼关键事件并存储在共享的长期记忆模块EHRMem中,以降低噪声并保留完整的就诊时间线,最终由管理智能体综合信息进行预测。实验表明,Traj-CoA在零样本一年期肺癌风险预测任务中优于四类基线方法,展现了其在临床时间推理方面的一致性和有效性。

Comments Accepted by NeurIPS 2025 GenAI4Health Workshop

详情
英文摘要

Large language models (LLMs) offer a generalizable approach for modeling patient trajectories, but suffer from the long and noisy nature of electronic health records (EHR) data in temporal reasoning. To address these challenges, we introduce Traj-CoA, a multi-agent system involving chain-of-agents for patient trajectory modeling. Traj-CoA employs a chain of worker agents to process EHR data in manageable chunks sequentially, distilling critical events into a shared long-term memory module, EHRMem, to reduce noise and preserve a comprehensive timeline. A final manager agent synthesizes the worker agents' summary and the extracted timeline in EHRMem to make predictions. In a zero-shot one-year lung cancer risk prediction task based on five-year EHR data, Traj-CoA outperforms baselines of four categories. Analysis reveals that Traj-CoA exhibits clinically aligned temporal reasoning, establishing it as a promisingly robust and generalizable approach for modeling complex patient trajectories. Implementation of Traj-CoA is available on https://github.com/zengsihang/Traj-CoA.

2510.08008 2026-05-18 cs.LG

Beyond Sunk Costs: Boosting LLM Pre-training Efficiency via Orthogonal Growth of Mixture-of-Experts

Ruizhe Wang, Yucheng Ding, Xiao Liu, Yaoxiang Wang, Peng Cheng, Baining Guo, Zhengjun Zha, Yeyun Gong

发表机构 * ustc(中国科学技术大学) msra(微软亚洲研究院) stju(上海交通大学) xmu(厦门大学)

AI总结 随着大语言模型(LLM)预训练的计算需求不断上升,提高训练效率变得尤为重要。本文提出了一种“正交增长”策略,通过在继续训练前战略性地扩展现有模型参数,有效“回收”已有的预训练模型资源。该方法通过增加模型深度和扩展模型宽度两个维度优化混合专家(MoE)模型,实验表明,在相同计算预算下,该方法在最大700亿参数和1万亿token规模的模型上实现了10.6%的准确率提升,为可持续的大规模LLM开发提供了高效可行的方案。

Comments Accepted to ICML 2026

详情
英文摘要

As the computational demands for pre-training Large Language Models (LLMs) continue to surge, the need for efficient training paradigms becomes critical. Despite the vast resources already invested in existing pre-trained checkpoints, these assets often remain under-leveraged due to architectural limitations. We introduce an "orthogonal growth" strategy designed to "recycle" these checkpoints by strategically expanding their parameters prior to continued training. Our method focuses on optimizing converged Mixture-of-Experts (MoE) models through two dimensions: interpositional layer copying for increased depth and noisy expert duplication for expanded width. Through extensive scaling laws analysis, we demonstrate a strong positive correlation between the "sunk cost" (prior investment) and the final model accuracy. Empirical results on models up to 70B parameters and 1T tokens show that our recycling approach yields a 10.6% accuracy improvement compared to training from scratch under identical extra compute budgets. This work provides a cost-effective blueprint for sustainable large-scale LLM development.

2510.06062 2026-05-18 cs.CL

When Importance Sampling Misallocates Credit: Asymmetric Ratios for Outcome-Supervised RL

Jiakang Wang, Runze Liu, Qingpeng Cai, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, Kun Gai, Ling Pan

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Kuaishou Technology(快手科技) Tsinghua University(清华大学)

AI总结 在基于结果监督的强化学习(OSRL)中,重要性采样(IS)比值在分配响应中各标记的共享优势信号时,其作用发生了隐含的转变,导致正优势标记与负优势标记之间的权重不平衡,进而引发过度强化高概率标记、抑制低概率标记更新等问题。为解决这一问题,本文提出了一种简单有效的策略——非对称重要性采样策略优化(ASPO),通过反转正优势标记的比值诱导权重,稳定更新过程并保持梯度流动,从而改善训练稳定性并提升模型性能。实验表明,ASPO在数学推理和编程任务中显著缓解了熵崩溃问题,优于现有的GRPO基线方法。

详情
英文摘要

Reinforcement learning (RL) has shown great promise in large language models (LLMs) post-training, which typically rely on token-level clipping to maintain stability during optimization. Despite the empirical success of GRPO-style methods, we identify a fundamental and previously overlooked challenge in this popular Outcome-Supervised RL (OSRL) paradigm. We reveal that in OSRL, where advantages are shared across tokens within a response, importance sampling (IS) ratios deviate from their traditional purpose of distribution correction as in classic RL, which become token-level weights that allocate the shared advantage signal across tokens. We show that this hidden role shift induces a critical mismatch for positive-advantage tokens, leading to unbalanced token weighting between positive and negative tokens. Specifically, it suppresses the update of underrepresented tokens that are lagging behind, while over-amplifying already high-probability tokens. This mismatch results in rich-get-richer dynamics that over-reinforce confident tokens, weaken catch-up learning that drive entropy collapse, excessive repetition, and premature convergence. To address this, we propose Asymmetric Importance Sampling Policy Optimization (ASPO), a simple yet effective strategy that reverses the ratio-induced weighting of positive-advantage tokens, while stabilizing extreme updates and maintaining gradient flow. This mismatch correction aligns their update direction with the learning dynamics of negative ones. Comprehensive experiments across math reasoning and coding benchmarks demonstrate that ASPO significantly mitigates entropy collapse, improves training stability, and enhances performance over strong GRPO-based baselines. Our analysis provides new insights into the role of token-level weighting in OSRL and highlights the critical importance of correcting ratio-induced weighting in LLM RL.

2510.05676 2026-05-18 cs.LG cs.SI

Inductive inference of gradient-boosted decision trees on graphs for insurance fraud detection

Félix Vandervorst, Bruno Deprez, Wouter Verbeke, Tim Verdonck

发表机构 * Data Office, Allianz Benelux(Allianz Benelux 数据办公室) KU Leuven University of Antwerp-imec(安特卫普-imec 联合大学KU Leuven) KU Leuven University of Antwerp-imec(安特卫普-imec 大学)

AI总结 本文提出了一种用于保险欺诈检测的新型归纳图梯度提升机(G-GBM),旨在解决异构动态图数据中欺诈类别不平衡的问题。该方法结合了梯度提升对类别不平衡的鲁棒性与图结构中可解释路径特征的编码,同时保留了原始表格特征空间的可访问性。实验表明,G-GBM在公开和实际保险数据集上的表现优于现有先进方法,并公开了相关数据集以促进研究复现。

详情
英文摘要

Graph-based methods are becoming increasingly popular in machine learning due to their ability to model complex data and relations. Insurance fraud is a prime use case, since fraudulent claims are often the result of organised criminals that stage accidents or the same persons filing erroneous claims on multiple policies. One challenge is that graph-based approaches struggle to find meaningful representations of the data because of the high class imbalance present in fraud data. In addition, insurance graphs are heterogeneous and dynamic, given the changing relations among people, companies and policies. As a result, gradient-boosted tree approaches on tabular data still dominate the field. Therefore, we present a novel inductive graph gradient boosting machine (G-GBM) for supervised learning on heterogeneous and dynamic graphs. G-GBM combines the class-imbalance robustness of gradient boosting with heterogeneous graph information encoded through interpretable path-level feature concatenations, while preserving access to the original tabular feature space. In addition, the explicit representation of neighbourhood information enables transparent SHAP-based explanations at the metapath and feature level. We demonstrate G-GBM for insurance fraud detection on an open-source and a real-world, proprietary dataset, and find that G-GBM performs on par or better than the state-of-the-art. The associated insurance fraud dataset is publicly released to facilitate reproducibility.

2510.04124 2026-05-18 cs.CL

Sri Lanka Document Datasets: A Large-Scale, Multilingual Resource for Law, News, and Policy

Nuwan I. Senaratna

发表机构 * Independent Researcher(独立研究者)

AI总结 本文介绍了一组来自斯里兰卡的开放、可机读的多语言文档数据集,涵盖议会记录、法律判决、政府文件、新闻和旅游统计等内容,包含269,194份文档,总大小达79.5 GB,支持僧伽罗语、泰米尔语和英语。数据集每日更新,并托管于GitHub和Hugging Face平台,旨在支持计算语言学、法律分析、社会政治研究及多语言自然语言处理等领域的发展。文章还详细描述了数据来源、采集流程、格式及潜在应用场景,并讨论了许可和伦理问题。

Comments 4 pages. 269,194 documents (79.5 GB) across 26 datasets in Sinhala, Tamil, and English. Last updated on 2026-05-15

详情
英文摘要

We present a collection of open, machine-readable document datasets covering parliamentary proceedings, legal judgments, government publications, news, and tourism statistics from Sri Lanka. The collection currently comprises of 269,194 documents (79.5 GB) across 26 datasets in Sinhala, Tamil, and English. The datasets are updated daily and mirrored on GitHub and Hugging Face. These resources aim to support research in computational linguistics, legal analytics, socio-political studies, and multilingual natural language processing. We describe the data sources, collection pipeline, formats, and potential use cases, while discussing licensing and ethical considerations. This manuscript is at version v2026-05-15-0811.

2510.02307 2026-05-18 cs.CV cs.AI

NoiseShift: Resolution-Aware Noise Recalibration for Better Low-Resolution Image Generation

Ruozhen He, Moayed Haji-Ali, Ziyan Yang, Vicente Ordonez

发表机构 * Rice University(里士大学)

AI总结 文本到图像扩散模型在生成分辨率超出训练设定的图像时性能往往会下降。本文针对低分辨率图像生成问题,提出了一种无需额外训练的噪声重新校准方法 NoiseShift,通过调整去噪器的噪声条件索引,恢复正向与反向过程的一致性,从而减少训练与测试阶段的不匹配。实验表明,NoiseShift 在多个主流扩散模型上显著提升了低分辨率图像的生成质量,且实现简单、推理开销极小。

详情
英文摘要

Text-to-image diffusion models often degrade when sampled at resolutions outside the final training resolution set. Prior work has largely emphasized higher resolution generation, enabling pretrained diffusion models to extrapolate beyond the resolutions seen during training. In this work, we instead target lower-resolution generation, performing inference at reduced resolution to significantly cut computational cost. We show that network conditioning of the noise level induces a train-test mismatch that directly degrades low-resolution generation: the same scheduled noise level can correspond to a different perceptual corruption level at lower resolutions, mis-calibrating the denoiser timestep and noise embedding. To this end, we propose NoiseShift, a training-free recalibration method that keeps the original noise sampling schedule unchanged and instead re-indexes the noise conditioning of the denoiser to restore local forward-reverse consistency. Using a lightweight coarse-to-fine calibration on a small set of image-text pairs, NoiseShift learns a resolution-specific mapping from scheduler noise to conditioning noise, reducing train-test mismatch and improving lower-resolution generation quality. When NoiseShift is applied to Stable Diffusion 3 (SD3), Stable Diffusion 3.5 (SD3.5), and Flux-Dev, generation quality at low resolutions improves consistently. Particularly, SD3 generation at 128x128 resolution gets an improved FID score from 203 to 171, and SD3.5 gets an improved FID score from 310 to 277 on LAION-COCO. Even Flux-Dev which already implements a complementary time-shifting strategy gets a modest boost from NoiseShift with an improved FID score from 120 to 113 at 64x64 resolution. More importantly, NoiseShift achieves such improvements with minimal implementation changes and no additional inference overhead.

2509.24798 2026-05-18 cs.CV cs.AI

Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation

Lei Tong, Zhihua Liu, Chaochao Lu, Dino Oglic, Tom Diethe, Philip Teare, Sotirios A. Tsaftaris, Chen Jin

发表机构 * Centre for AI, DS\&AI, Astrazeneca, UK(英国阿斯利康人工智能中心) Institute for Imaging, Data and Communications (IDCOM), School of Engineering, University of Edinburgh, Edinburgh, UK(爱丁堡大学工程学院影像、数据与通信研究所) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 本文提出了一种名为 Causal-Adapter 的模块化框架,用于适配冻结的文本到图像扩散模型,实现对图像的反事实生成。该方法通过因果干预目标属性,并将其影响一致地传播至因果依赖部分,同时保持图像的核心身份。与依赖提示工程的方法不同,Causal-Adapter 引入结构因果模型,并采用属性正则化策略,实现了更准确的语义控制和高保真图像生成,在多个数据集上取得了优越的性能。

Comments Project Page: https://leitong02.github.io/causaladapter/

Journal ref ICML 2026

详情
英文摘要

We present Causal-Adapter, a modular framework that adapts frozen text-to-image diffusion backbones for counterfactual image generation. Our method supports causal interventions on target attributes and consistently propagates their effects to causal dependents while preserving the core identity of the image. Unlike prior approaches that rely on prompt engineering without explicit causal structure, Causal-Adapter leverages structural causal modeling with two attribute-regularization strategies: (i) prompt-aligned injection, which aligns causal attributes with textual embeddings for precise semantic control, and (ii) a conditioned token contrastive loss that disentangles attribute factors and reduces spurious correlations. Causal-Adapter achieves state-of-the-art performance on both synthetic and real-world datasets, including up to a 91% reduction in MAE on Pendulum for accurate attribute control and up to an 87% reduction in FID on ADNI for high-fidelity MRI generation. These results demonstrate robust, generalizable counterfactual editing with faithful attribute modification and strong identity preservation. Code and models will be released at: https://leitong02.github.io/causaladapter/.

2509.22267 2026-05-18 cs.LG eess.SP

Towards a more realistic evaluation of machine learning models for bearing fault diagnosis

João Paulo Vieira, Victor Afonso Bauler, Rodrigo Kobashikawa Rosa, Danilo Silva

发表机构 * Department of Electrical and Electronic Engineering, Federal University of Santa Catarina(电气与电子工程系,圣卡塔琳娜联邦大学) Department of Mechanical Engineering, Federal University of Santa Catarina(机械工程系,圣卡塔琳娜联邦大学)

AI总结 本文针对基于振动信号的轴承故障诊断中普遍存在的数据泄露问题,探讨了其对模型评估的影响,并提出了一种基于轴承级别的严格数据划分方法,以避免训练与测试数据之间的物理组件重叠。此外,研究将分类任务重新定义为多标签问题,支持多种故障类型的联合检测,并引入基于ROC曲线的评估指标。实验在四个常用数据集上验证了方法的有效性,为工业故障诊断中构建更可靠、更具泛化能力的机器学习系统提供了指导。

Comments Revised version submitted to Mechanical Systems and Signal Processing

详情
英文摘要

Reliable detection of bearing faults is essential for maintaining the safety and operational efficiency of rotating machinery. While recent advances in machine learning (ML), particularly deep learning, have shown strong performance in controlled settings, many studies fail to generalize to real-world applications due to methodological flaws, most notably data leakage. This paper investigates the issue of data leakage in vibration-based bearing fault diagnosis and its impact on model evaluation. We demonstrate that common dataset partitioning strategies, such as segment-wise and condition-wise splits, introduce spurious correlations that inflate performance metrics. To address this, we propose a rigorous, leakage-free evaluation methodology centered on bearing-wise data partitioning, ensuring no overlap between the physical components used for training and testing. Additionally, we reformulate the classification task as a multi-label problem, enabling the detection of co-occurring fault types and the use of prevalence-independent metrics based on the ROC curve. Beyond preventing leakage, we also examine the effect of dataset diversity on generalization, showing that the number of unique training bearings is a decisive factor for achieving robust performance. We evaluate our methodology on four widely adopted datasets: CWRU, Paderborn University (PU), University of Ottawa (UORED-VAFCLS) and HUST bearing. This study highlights the importance of leakage-aware evaluation protocols and provides practical guidelines for dataset partitioning, model selection, and validation, fostering the development of more trustworthy ML systems for industrial fault diagnosis applications.

2509.05030 2026-05-18 cs.CV

LUIVITON: Learned Universal Interoperable VIrtual Try-ON

Cong Cao, Xianhang Cheng, Jingyuan Liu, Yujian Zheng, Zhenhui Lin, Ren Li, Meriem Chkir, Hao Li

发表机构 * The University of Tokyo(东京大学)

AI总结 本文提出了一种名为LUIVITON的全自动虚拟试穿系统,旨在解决现实世界中服装与人体模型之间骨骼结构、模板和密集对应关系不一致的问题,实现复杂多层服装在不同姿态和形态的人形角色上的自动穿戴。该方法通过SMPL作为中间代理,将服装到身体的映射分解为两个关键对应任务,并分别采用几何驱动模型和基于扩散的多视角外观特征匹配方法进行处理,最终在目标角色上生成物理合理的服装垂坠效果。该系统能够处理复杂的服装拓扑结构,并适用于多种人形角色,同时具备高效计算和无需人工干预的优点。

详情
英文摘要

To enable large-scale reuse of real-world 3D assets, where garments and characters rarely share skeletons, templates, or dense correspondences, we present a fully automated virtual try-on system that dresses complex, multi-layer garments onto diverse, arbitrarily posed humanoids. Our key idea is to use SMPL as an intermediate proxy and decompose clothing-to-body transfer into two correspondence tasks with distinct challenges: (1) clothing-to-SMPL (partial-to-complete alignment) and (2) body-to-SMPL (large pose/shape variation and stylization). We address clothing-to-SMPL using a geometry-driven correspondence model, and introduce a diffusion-based body-to-SMPL correspondence approach that leverages multi-view consistent appearance features together with a pretrained 2D foundation model. Using these correspondences, we register SMPL/SMPL+D (Displacement) to the garment and target body and then perform simulator-driven fitting by transferring the garment along a smooth SMPL-to-SMPL+D transition, producing physically plausible draping on the target. Our system handles complex garment topology (including non-manifold meshes) and generalizes to a wide range of humanoid characters (e.g., humans, robots, cartoons, and creatures) while remaining computationally practical. Upon draping, our system also supports fast customization of clothing size. We show that our system can produce high-quality 3D clothing fittings without any human labor, even when 2D clothing sewing patterns are not available. Our project page is: https://cao-cong0.github.io/LUIVITON-Learned-Universal-Interoperable-VIrtual-Try-ON/.

2508.20810 2026-05-18 cs.AI cs.CL

From Guidelines to Guarantees: A Graph-Based Evaluation Harness for Domain-Specific Evaluation of LLMs

Jessica M. Lundin, Usman Nasir Nakakana, Guillaume Chabot-Couture

发表机构 * Gates Foundation(比尔及梅琳·格ates基金会)

AI总结 该论文提出了一种基于图结构的评估框架,用于对领域特定语言模型进行严格评估。该方法将结构化的临床指南转化为可查询的知识图谱,并通过图遍历动态生成评估问题,从而确保评估的全面性、抗污染性和可维护性。应用在世界卫生组织IMCI指南上时,该框架生成了涵盖症状识别、治疗方案、严重程度分类和后续护理的多选题,并揭示了不同语言模型在临床决策任务中的系统性能力差距。

详情
英文摘要

Rigorous evaluation of domain-specific language models requires benchmarks that are comprehensive, contamination-resistant, and maintainable. Static, manually curated datasets do not satisfy these properties. We present a graph-based evaluation harness that transforms structured clinical guidelines into a queryable knowledge graph and dynamically instantiates evaluation queries via graph traversal. The framework provides three guarantees: (1) complete coverage of guideline relationships; (2) surface-form contamination resistance through combinatorial variation; and (3) validity inherited from expert-authored graph structure. Applied to the WHO IMCI guidelines, the harness generates clinically grounded multiple-choice questions spanning symptom recognition, treatment, severity classification, and follow-up care. Evaluation across five language models reveals systematic capability gaps. Models perform well on symptom recognition but show lower accuracy on treatment protocols and clinical management decisions. The framework supports continuous regeneration of evaluation data as guidelines evolve and generalizes to domains with structured decision logic. This provides a scalable foundation for evaluation infrastructure.

2508.18167 2026-05-18 cs.CL cs.HC

DiscussLLM: Teaching Large Language Models When to Speak

Deep Anil Patel, Iain Melvin, Christopher Malon, Martin Renqiang Min

发表机构 * NEC Laboratories America(NEC美洲实验室)

AI总结 本文提出了一种名为 DiscussLLM 的框架,旨在解决大语言模型在动态对话中被动响应的问题,使其能够主动判断何时发言以提供有价值的帮助。研究设计了一个两阶段的数据生成流程,构建了大规模的真实多轮对话数据集,并为每段对话标注了五类干预类型及明确的触发点。通过训练模型预测何时保持沉默、何时进行干预,提升了模型在对话中的情境感知与响应能力。

详情
英文摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like text, yet they largely operate as reactive agents, responding only when directly prompted. This passivity creates an "awareness gap," limiting their potential as truly collaborative partners in dynamic human discussions. We introduce $\textit{DiscussLLM}$, a framework designed to bridge this gap by training models to proactively decide not just $\textit{what}$ to say, but critically, $\textit{when}$ to speak. Our primary contribution is a scalable two-stage data generation pipeline that synthesizes a large-scale dataset of realistic multi-turn human discussions. Each discussion is annotated with one of five intervention types (e.g., Factual Correction, Concept Definition) and contains an explicit conversational trigger where an AI intervention adds value. By training models to predict a special silent token when no intervention is needed, they learn to remain quiet until a helpful contribution can be made. We explore two architectural baselines: an integrated end-to-end model and a decoupled classifier-generator system optimized for low-latency inference. We evaluate these models on their ability to accurately time interventions and generate helpful responses, paving the way for more situationally aware and proactive conversational AI.

2508.17218 2026-05-18 cs.LG cs.AI

Generalized Policy Gradient with History-Aware Decision Transformer for Reliable Routing over Graph Signals

Xing Wei, Yuanhang Wang, Duoxiang Zhao, Zezhou Zhang, Hao Qin, Yuqi Ouyang

发表机构 * Sichuan University-Pittsburgh Institute(四川大学匹兹堡研究院) Sichuan University(四川大学) College of Computer Science(计算机科学学院) University College Dublin(都柏林大学) School of Electrical and Electronic Engineering(电子与电气工程学院) School of Electronics and Information Engineering(电子与信息工程学院)

AI总结 该研究针对随机交通网络中的可靠路径规划问题,提出了一种基于历史感知的决策变换器与广义策略梯度结合的新型策略框架GPG-HT。该方法通过关注历史节点-边-时间观测,捕捉非马尔可夫时空依赖关系,从而在不确定环境下实现更具上下文感知的路径决策。实验表明,该方法在典型交通网络中显著提升了准时到达概率,优于传统优化和强化学习方法。

详情
英文摘要

Reliable path planning in stochastic transportation networks requires decisions that account for uncertain and correlated travel times on irregular road graphs, rather than only minimizing expected delay. Such networks exhibit strong spatial-temporal coupling, where link travel times evolve as stochastic processes over graph edges, making the problem inherently sequential under uncertainty. Existing stochastic on-time arrival (SOTA) methods primarily depend on the current node and remaining budget, which limits their ability to exploit trajectory-level temporal structure and history-dependent correlations. This work proposes GPG-HT, a history-aware graph-signal policy framework that integrates a Decision Transformer with generalized policy gradient optimization for reliable routing. By attending to historical node-edge-time observations, GPG-HT captures non-Markovian spatial-temporal dependencies and enables context-aware decision making under uncertainty. Experiments on the Sioux Falls and Anaheim networks demonstrate consistent gains in on-time arrival probability over representative optimization and reinforcement learning baselines.

2508.17034 2026-05-18 cs.RO cs.CV

DualReg: Dual-Space Filtering and Reinforcement for Rigid Registration

Jiayi Li, Yuxin Yao, Qiuhang Lu, Juyong Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学) City University of Hong Kong(香港城市大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文针对刚性配准中噪声数据、部分重叠和实时处理等挑战,提出了一种双空间滤波与强化学习相结合的新方法DualReg。该方法结合基于特征匹配和基于局部几何匹配的优点,通过高效的滤波机制去除不可靠的特征对应点,并利用几何代理构建目标函数以估计变换参数。实验表明,该方法在保持精度的同时,相比MAC方法在KITTI数据集上实现了32倍的CPU时间加速。

Comments Accepted to CVPR 2026, Project page: https://ustc3dv.github.io/DualReg/

详情
英文摘要

Noisy, partially overlapping data and the need for real-time processing pose major challenges for rigid registration. Considering that feature-based matching can handle large transformation differences but suffers from limited accuracy, while local geometry-based matching can achieve fine-grained local alignment but relies heavily on a good initial transformation, we propose a novel dual-space paradigm to fully leverage the strengths of both approaches. First, we introduce an efficient filtering mechanism consisting of a computationally lightweight one-point RANSAC algorithm and a subsequent refinement module to eliminate unreliable feature-based correspondences. Subsequently, we treat the filtered correspondences as anchor points, extract geometric proxies, and formulate an effective objective function with a tailored solver to estimate the transformation. Experiments verify our method's effectiveness, as demonstrated by a 32x CPU-time speedup over MAC on KITTI with comparable accuracy. Project page: https://ustc3dv.github.io/DualReg/.

2508.01014 2026-05-18 cs.RO cs.CV

Hestia: Voxel-Face-Aware Hierarchical Next-Best-View Acquisition for Efficient 3D Reconstruction

Cheng-You Lu, Zhuoli Zhuang, Nguyen Thanh Trung Le, Da Xiao, Yu-Cheng Chang, Thomas Do, Srinath Sridhar, Chin-teng Lin

发表机构 * University of Technology Sydney(悉尼技术大学) Brown University(布朗大学)

AI总结 Hestia 是一种面向高效三维重建的视角规划方法,旨在解决传统重建过程中图像采集依赖人工或固定轨迹的问题。该方法通过引入体素面感知的分层结构,结合多样化数据集、贪心策略与几何感知设计,提升了视角规划的鲁棒性和重建质量。实验表明,Hestia 在覆盖范围、重建精度和实时性方面均优于现有方法,具有良好的实际应用前景。

Comments Accepted to the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026

详情
英文摘要

Advances in 3D reconstruction and novel view synthesis have enabled efficient and photorealistic rendering. However, images for reconstruction are still either largely manual or constrained by simple preplanned trajectories. To address this issue, recent works propose generalizable next-best-view planners that do not require online learning. Nevertheless, robustness and performance remain limited across various shapes. Hence, this study introduces Voxel-Face-Aware Hierarchical Next-Best-View Acquisition for Efficient 3D Reconstruction (Hestia), which addresses the shortcomings of the reinforcement learning-based generalizable approaches for five-degree-of-freedom viewpoint prediction. Hestia systematically improves the planners through four components: a more diverse dataset to promote robustness, a hierarchical structure to manage the high-dimensional continuous action search space, a close-greedy strategy to mitigate spurious correlations, and a face-aware design to avoid overlooking geometry. Experimental results show that Hestia achieves non-marginal improvements, with at least a 4% gain in coverage ratio, while reducing Chamfer Distance by 50% and maintaining real-time inference. In addition, Hestia outperforms prior methods by at least 12% in coverage ratio with a 5-image budget and remains robust to object placement variations. Finally, we demonstrate that Hestia, as a next-best-view planner, is feasible for the real-world application. Our project page is https://johnnylu305.github.io/hestia web.

2507.16806 2026-05-18 cs.LG cs.AI cs.CL

Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty

Mehul Damani, Isha Puri, Stewart Slocum, Idan Shenfeld, Leshem Choshen, Yoon Kim, Jacob Andreas

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文研究了如何通过强化学习训练语言模型在生成推理链时更好地评估自身不确定性。传统方法使用二元奖励函数仅评价输出正确性,导致模型在面对不确定情况时容易产生错误回答。为此,作者提出了一种新的训练方法 RLCR,结合二元正确性奖励与 Brier 分数,同时优化模型的准确性和置信度校准。实验表明,RLCR 在多个数据集上显著提升了模型的校准能力,且不牺牲准确性,优于传统强化学习和事后置信度校准方法。

详情
英文摘要

When language models (LMs) are trained via reinforcement learning (RL) to generate natural language "reasoning chains", their performance improves on a variety of difficult question answering tasks. Today, almost all successful applications of RL for reasoning use binary reward functions that evaluate the correctness of LM outputs. Because such reward functions do not penalize guessing or low-confidence outputs, they often have the unintended side-effect of degrading calibration and increasing the rate at which LMs generate incorrect responses (or "hallucinate") in other problem domains. This paper describes RLCR (Reinforcement Learning with Calibration Rewards), an approach to training reasoning models that jointly improves accuracy and calibrated confidence estimation. During RLCR, LMs generate both predictions and numerical confidence estimates after reasoning. They are trained to optimize a reward function that augments a binary correctness score with a Brier score -- a scoring rule for confidence estimates that incentivizes calibrated prediction. We first prove that this reward function (or any reward function that uses a bounded, proper scoring rule) yields models whose predictions are both accurate and well-calibrated. We next show that across diverse datasets, RLCR substantially improves calibration with no loss in accuracy, on both in-domain and out-of-domain evaluations -- outperforming both ordinary RL training and classifiers trained to assign post-hoc confidence scores. While ordinary RL hurts calibration, RLCR improves it. Finally, we demonstrate that verbalized confidence can be leveraged at test time to improve accuracy and calibration via confidence-weighted scaling methods. Our results show that explicitly optimizing for calibration can produce more generally reliable reasoning models. Code, models, and further info is available at https://rl-calibration.github.io/.

2507.15778 2026-05-18 cs.CL

Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR

Jiakang Wang, Runze Liu, Fuzheng Zhang, Xiu Li, Guorui Zhou, Ling Pan

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) Kuaishou Technology(快手科技) Tsinghua University(清华大学)

AI总结 该研究针对强化学习与可验证奖励(RLVR)方法在提升大语言模型推理能力中的应用,提出了一种新的框架Archer,通过引入双令牌约束机制,区分处理高熵(与推理相关)和低熵(与知识存储相关)令牌的优化策略。该方法在保持序列生成依赖性的前提下,对不同类型的令牌施加差异化的更新强度控制,从而在数学推理和代码生成任务中取得了优于现有方法的性能提升,验证了其在细粒度优化策略设计中的有效性。

详情
英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective post-training method for improving the reasoning abilities of Large Language Models (LLMs). However, existing methods mainly apply uniform optimization constraints across all tokens, ignoring their heterogeneous roles. Prior work shows that high-entropy tokens are closely tied to reasoning, while low-entropy tokens primarily encode factual knowledge, and recent approaches attempt to exploit this distinction by isolating token updates via masking or asynchronous training. We argue that such isolation breaks the sequential dependency structure of autoregressive generation, leading to suboptimal learning. To address this, we propose \textbf{Archer}, an entropy-aware RLVR framework with \textbf{dual-token constraints} that preserves joint optimization while modulating update strength across token types. Our method introduces response-level entropy normalization for stable token classification and applies differentiated clipping ranges and KL regularization to encourage exploration on reasoning tokens while preserving knowledge tokens. Experiments on mathematical reasoning and code generation benchmarks show that Archer consistently outperforms strong baselines across multiple model scales, improving both \textit{pass@1} and \textit{pass@K} performance. These results highlight the importance of respecting sequence-level dependencies when designing fine-grained RL optimization strategies for LLMs.

2507.01679 2026-05-18 cs.LG cs.AI cs.CL

Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling

Zeyu Huang, Tianhao Cheng, Zihan Qiu, Zili Wang, Yinghui Xu, Edoardo M. Ponti, Ivan Titov

发表机构 * ILCC, University of Edinburgh(爱丁堡大学ILCC) Fudan University(复旦大学) Qwen Team, Alibaba Group(阿里集团Qwen团队) ILLC, University of Amsterdam(阿姆斯特丹大学ILLC)

AI总结 本文研究了大语言模型后训练中监督微调(SFT)与强化微调(RFT)的结合方法,提出了Prefix-RFT这一混合策略,通过前缀采样实现从演示数据和探索行为中协同学习。该方法在数学推理任务中表现出色,不仅优于单独使用SFT或RFT,也优于其他混合策略,验证了SFT与RFT的互补性,并展示了其对演示数据质量与数量变化的鲁棒性。

Comments ICML 2026

详情
英文摘要

Existing LLMs-post-training techniques are broadly categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). Each paradigm presents a distinct trade-off: (1) SFT excels at mimicking demonstration data, but can lead to problematic generalization as a form of behavior cloning. (2) Conversely, RFT can significantly enhance a model's performance but is prone to learning unexpected behaviors, and its performance is sensitive to the initial policy. In this paper, we propose a unified view of these methods and introduce Prefix-RFT, a hybrid approach that synergizes learning from both demonstration and exploration. Using mathematical reasoning problems as a test bed, we empirically demonstrate that Prefix-RFT is simple yet effective. Not only does it surpass the performance of standalone SFT and RFT, but it also outperforms parallel mixed-policy RFT methods. Our analysis highlights the complementary nature of SFT and RFT, validating that Prefix-RFT effectively harmonizes them. Further ablation studies confirm the method's robustness to variations in the quality and quantity of demonstration data.

2507.01201 2026-05-18 cs.LG cs.CV

Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models

Lauren Hyoseo Yoon, Yisong Yue, Been Kim

发表机构 * Computation and Neural Systems(计算与神经系统) California Institute of Technology(加利福尼亚理工学院) Computation and Mathematical Sciences(计算与数学科学) Google DeepMind(谷歌深Mind)

AI总结 该论文研究了如何对齐独立训练的视觉和语言模型,提出了一种名为JAM的方法,通过联合训练模态特定的自编码器,实现跨模态对齐。JAM引入了多模态扩散损失,有效提升了对齐效果,并系统分析了对齐目标、网络深度及基础模型规模对表示一致性的影响。研究不仅提供了对共享语义结构的理论见解,也为构建专业化的多模态模型提供了实用指导。

详情
英文摘要

Independently trained vision and language models inhabit disjoint representational spaces, shaped by their respective modalities, objectives, and architectures. The Platonic Representation Hypothesis (PRH) suggests these models may nonetheless converge toward a shared statistical model of reality. This raises a fundamental question: can we move beyond post-hoc detection of such alignment and explicitly optimize for it? We argue this challenge is most critical in fine-grained contextual distinctions-where multiple descriptions share global semantics but differ in subtle compositional details. We address this with the Joint Autoencoder Modulator (JAM), which aligns frozen unimodal models by jointly training modality-specific autoencoders with coordinated reconstruction and cross-modal alignment objectives. We systematically evaluate JAM across three design axes: (i) alignment objectives, introducing our multimodal Spread Loss that outperforms classic contrastive methods; (ii) the layer depth at which alignment is most effective; and (iii) the role of foundation model scale in representational convergence. Our findings show that JAM reliably induces alignment even across independently trained representations, offering both theoretical insight into the structure of shared semantics and practical guidance for transforming generalist unimodal foundations into specialist multimodal models.

2507.00275 2026-05-18 cs.LG cs.AI

Deep Double Q-learning

Prabhat Nagarajan, Martha White, Marlos C. Machado

发表机构 * Department of Computing Science(计算科学系) University of Alberta(阿尔伯塔大学) Alberta Machine Intelligence Institute(阿尔伯塔机器智能研究所) CIFAR AI Chair(CIFAR人工智能主席) Edmonton, AB, Canada(加拿大艾德蒙顿省,亚伯达)

AI总结 本文提出了一种深度强化学习算法——Deep Double Q-learning(DDQL),旨在解决传统深度Q网络(DQN)中存在的估计过高的问题。该方法通过显式训练两个独立的Q函数,结合降低经验回放比例、延长目标网络更新间隔等技术,有效提升了训练稳定性。实验表明,DDQL在57款Atari 2600游戏中整体表现优于Double DQN,在其中47款游戏中表现更优,并进一步减少了估计过高的现象。

Comments 44 pages

详情
英文摘要

Double Q-learning is a classical control algorithm that mitigates the maximization bias of Q-learning. To do so, it explicitly trains two independent action-value functions and uses them to decouple action-selection and action-evaluation when computing bootstrap targets. Double DQN adapts target bootstrap decoupling to deep reinforcement learning (RL), but explicitly trains only a single action-value function and does not fully decouple its estimators. Consequently, the two estimators remain correlated, and overestimation persists. In this paper, we introduce Deep Double Q-learning (DDQL), a deep RL algorithm that explicitly trains two Q-functions through Double Q-learning. DDQL stabilizes training through a combination of techniques, including lower replay ratios, longer target network update intervals, and shared layers. Across 57 Atari 2600 games, DDQL improves aggregate performance over Double DQN, outperforming it on 47 games while further reducing overestimation. In addition, we study key design choices when adapting Double Q-learning to deep RL, including the network architecture, replay ratio, and minibatch sampling strategies.

2506.23552 2026-05-18 cs.CV cs.SD eess.AS

JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching

Mingi Kwon, Joonghyuk Shin, Jaeseok Jung, Jaesik Park, Youngjung Uh

发表机构 * Yonsei University(延世大学) CineLingo Seoul National University(首尔国立大学)

AI总结 本文提出了一种名为 JAM-Flow 的统一框架,用于同时生成面部运动和语音信号,解决了传统方法中将人脸生成与语音合成作为独立任务处理的问题。该方法结合了流匹配技术和一种新型的多模态扩散变换器(MM-DiT)架构,通过选择性联合注意力层实现跨模态交互,并保留各模态的特性。JAM-Flow 能够在单一模型中支持多种条件输入,如文本、参考音频和参考运动,从而实现从文本生成同步说话人脸、音频驱动动画等多种任务,显著推进了多模态生成建模的发展。

Comments project page: https://joonghyuk.com/jamflow-web Under review. Preprint published on arXiv

详情
英文摘要

The intrinsic link between facial motion and speech is often overlooked in generative modeling, where talking head synthesis and text-to-speech (TTS) are typically addressed as separate tasks. This paper introduces JAM-Flow, a unified framework to simultaneously synthesize and condition on both facial motion and speech. Our approach leverages flow matching and a novel Multi-Modal Diffusion Transformer (MM-DiT) architecture, integrating specialized Motion-DiT and Audio-DiT modules. These are coupled via selective joint attention layers and incorporate key architectural choices, such as temporally aligned positional embeddings and localized joint attention masking, to enable effective cross-modal interaction while preserving modality-specific strengths. Trained with an inpainting-style objective, JAM-Flow supports a wide array of conditioning inputs-including text, reference audio, and reference motion-facilitating tasks such as synchronized talking head generation from text, audio-driven animation, and much more, within a single, coherent model. JAM-Flow significantly advances multi-modal generative modeling by providing a practical solution for holistic audio-visual synthesis. project page: https://joonghyuk.com/jamflow-web

2506.06739 2026-05-18 cs.AI cs.LG

Honey, I shrunk the hypothesis space (through logical preprocessing)

Andrew Cropper, Filipe Gouveia, David M. Cerna

发表机构 * ELLIS Institute(ELLIS研究所) University of Helsinki(赫尔辛基大学) Czech Academy of Sciences Institute of Computer Science(捷克科学院计算机科学研究所) Dynatrace Research(Dynatrace研究)

AI总结 该研究提出了一种通过逻辑预处理缩小归纳逻辑编程(ILP)假设空间的方法。利用背景知识,该方法在学习前移除那些无论训练数据如何都无法出现在最优假设中的规则,例如“偶数不可能是奇数”等逻辑矛盾。实验表明,这种方法在保持预测精度的同时,显著减少了学习时间,例如在仅花费10秒预处理的情况下,将原本需要10小时以上的学习时间缩短至仅2秒。

Comments Published in JAIR

Journal ref Journal of Artificial Intelligence Research, Vol. 85 (2026)

详情
英文摘要

Inductive logic programming (ILP) is a form of logical machine learning. The goal is to search a hypothesis space for a hypothesis that generalises training examples and background knowledge. We introduce an approach that 'shrinks' the hypothesis space before an ILP system searches it. Our approach uses background knowledge to find rules that cannot be in an optimal hypothesis regardless of the training examples. For instance, our approach discovers relationships such as "even numbers cannot be odd" and "prime numbers greater than 2 are odd". It then removes violating rules from the hypothesis space. We implement our approach using answer set programming and use it to shrink the hypothesis space of a constraint-based ILP system. Our experiments on multiple domains, including visual reasoning and game playing, show that our approach can substantially reduce learning times whilst maintaining predictive accuracies. For instance, given just 10 seconds of preprocessing time, our approach can reduce learning times from over 10 hours to only 2 seconds.

2506.05878 2026-05-18 cs.LG

A projection-based framework for gradient-free and parallel learning

Andreas Bergmeister, Manish Krishan Lal, Stefanie Jegelka, Suvrit Sra

发表机构 * TU Munich(慕尼黑工业大学) MCML(麦肯锡实验室) MIT CSAIL(麻省理工学院计算机科学与人工智能实验室) MIT LIDS(麻省理工学院计算机科学与人工智能实验室)

AI总结 本文提出了一种基于投影的神经网络训练框架,不同于传统的梯度下降方法,它将训练过程转化为一个大规模可行性问题,通过迭代投影算法寻找满足局部约束的网络参数。该方法利用投影算子进行局部操作,支持并行计算,适用于处理非微分操作。作者开发了PJAX工具包,实现了该框架,支持GPU/TPU加速,并在多种网络结构上验证了其有效性,展示了其在并行性和通用性方面的优势。

详情
英文摘要

We present a feasibility-seeking approach to neural network training. This mathematical optimization framework is distinct from conventional gradient-based loss minimization and uses projection operators and iterative projection algorithms. We reformulate training as a large-scale feasibility problem: finding network parameters and states that satisfy local constraints derived from its elementary operations. Training then involves projecting onto these constraints, a local operation that can be parallelized across the network. We introduce PJAX, a JAX-based software framework that enables this paradigm. PJAX composes projection operators for elementary operations, automatically deriving the solution operators for the feasibility problems (akin to autodiff for derivatives). It inherently supports GPU/TPU acceleration, provides a familiar NumPy-like API, and is extensible. We train diverse architectures (MLPs, CNNs, RNNs) on standard benchmarks using PJAX, demonstrating its functionality and generality. Our results show that this approach is a compelling alternative to gradient-based training, with clear advantages in parallelism and the ability to handle non-differentiable operations.

2505.21698 2026-05-18 cs.CV

Adapting Foundation Vision-Language Models to Medical Diagnosis via Query-Driven Expert Bridging

Yitong Li, Morteza Ghahremani, Christian Wachinger

发表机构 * Lab for AI in Medical Imaging, Technical University of Munich (TUM)(医学影像人工智能实验室,慕尼黑技术大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心)

AI总结 该研究针对基础视觉-语言模型在医学影像诊断中的应用难题,提出了一种名为MedBridge的轻量级适配框架,通过结合领域对齐、分辨率保持和多标签推理,有效缓解了医学图像与通用图像之间的领域差异。MedBridge利用预训练的视觉-语言模型作为多视角查询编码器,引入可学习的查询标记以实现非破坏性的领域适配,并通过多专家混合架构动态整合异构模型进行多标签诊断,显著提升了跨领域和同领域任务的性能。实验表明,该方法在多个胸部X光诊断基准上优于现有方法,且具有模型无关性和良好的扩展性。

详情
英文摘要

Vision-language foundation models achieve promising performance in natural image classification, yet their direct application to medical imaging is limited by severe domain shifts, resolution mismatches, and the multi-label nature of clinical diagnosis. Training dedicated medical foundation models from scratch, however, is costly and data-intensive. Here, we propose MedBridge, a lightweight adaptation framework that opens a new direction in domain-gap mitigation by jointly combining domain alignment, resolution preservation, and multi-label reasoning via complementary VLM experts for medical image diagnosis. Specifically, MedBridge transforms pretrained VLMs into multi-view query encoders that inject a compact set of learnable query tokens into intermediate layers, enabling non-destructive domain alignment while preserving fine-grained pathological cues via multi-view high-resolution sampling. These query tokens further act as routing signals for a mixture-of-experts, dynamically integrating heterogeneous foundation models for multi-label reasoning without requiring a shared representation space. We evaluated MedBridge on five chest radiograph benchmarks in three key adaptation tasks. MedBridge demonstrates superior performance in both cross-domain generalization (out-of-distribution transfer) and in-domain specialization (same-distribution tuning) settings, yielding a significant 6-15% AUC improvement over state-of-the-art adaptation methods for multi-label thoracic disease diagnosis. Furthermore, MedBridge is model-agnostic and demonstrates broad extensibility across eight diverse VLMs (e.g., CLIP, LLaVA, Qwen-VL, MedGemma), highlighting its ability to flexibly adapt arbitrary foundation models into a powerful medical diagnostic tool. Our code will be released upon acceptance.

2505.21535 2026-05-18 cs.CV cs.AI cs.LG

FAR: Function-preserving Attention Replacement for IMC-friendly Inference

Yuxin Ren, Maxwell D Collins, Miao Hu, Huanrui Yang

发表机构 * University of Arizona(亚利桑那大学) TetraMem, Inc.(TetraMem公司)

AI总结 本文提出了一种名为FAR的函数保持注意力替换框架,旨在解决Transformer模型在基于忆阻器(ReRAM)的存算一体(IMC)设备上推理效率低的问题。FAR通过将预训练DeiT模型中的注意力机制替换为与IMC数据流兼容的多头双向LSTM结构,并结合块级知识蒸馏和结构化剪枝,实现了功能等效的同时显著降低了计算延迟和参数量。实验表明,FAR在ImageNet及多个下游任务上保持了与原始模型相当的准确率,展示了其在边缘计算设备上高效部署Transformer模型的潜力。

Comments 7 pages main paper, 6 figures; accepted by GLSVLSI 2026

详情
英文摘要

While transformers dominate modern vision and language models, their attention mechanism remains poorly suited for in-memory computing (IMC) devices due to intensive activation-to-activation multiplications and non-local memory access, leading to substantial latency and bandwidth overhead on ReRAM-based accelerators. To address this mismatch, we propose FAR, a Function-preserving Attention Replacement framework that substitutes all attention in pretrained DeiTs with sequential modules inherently compatible with IMC dataflows. Specifically, FAR replaces self-attention with a multi-head bidirectional LSTM architecture via block-wise distillation to retain functional equivalence while enabling linear-time computation and localized weight reuse. We further incorporate structured pruning on FAR models, enabling flexible adaptation to resource-constrained IMC arrays while maintaining functional fidelity. Evaluations on the DeiT family demonstrate that FAR maintains comparable accuracy to the original attention-based models on ImageNet and multiple downstream tasks with reduced parameters and latency. Further analysis shows that FAR preserves the semantic token relationships learned by attention while improving computational efficiency, highlighting its potential for energy-efficient transformer inference on IMC-based edge accelerators.

2505.19241 2026-05-18 cs.LG cs.AI

ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment

Xiaoqiang Lin, Arun Verma, Zhongxiang Dai, Daniela Rus, See-Kiong Ng, Bryan Kian Hsiang Low

发表机构 * Department of Computer Science, National University of Singapore(新加坡国立大学计算机科学系) Singapore-MIT Alliance for Research and Technology Centre(新加坡-麻省理工联盟研究技术中心) The Chinese University of Hong Kong, Shenzhen, China(香港中文大学(深圳)) CSAIL, Massachusetts Institute of Technology(麻省理工学院计算机科学与人工智能实验室) Institute of Data Science, National University of Singapore(新加坡国立大学数据科学研究院)

AI总结 本文提出了一种名为 ActiveDPO 的主动直接偏好优化方法,旨在提升大语言模型对齐过程中的样本效率。该方法基于理论支撑的数据选择准则,适用于非线性奖励函数,并直接利用待对齐的LLM本身参数化奖励模型,从而更有效地指导数据选择。实验表明,ActiveDPO 在多种模型和真实偏好数据集上均优于现有方法,显著提升了对齐效果与数据使用效率。

Comments Accepted at ICLR 2026

详情
英文摘要

The recent success in using human preferences to align large language models (LLMs) has significantly improved their performance in various downstream tasks, such as question answering, mathematical reasoning, and code generation. However, achieving effective LLM alignment depends on high-quality datasets of human preferences. Collecting these datasets requires human preference annotation, which is costly and resource-intensive, necessitating efficient active data selection methods. Existing methods either lack a strong theoretical foundation or depend on restrictive assumptions about the reward function, such as linear latent reward functions. To this end, we propose an algorithm, ActiveDPO, that uses a theoretically grounded data selection criterion for non-linear reward functions while directly leveraging the LLM itself to parameterize the reward model used for active data selection. As a result, ActiveDPO explicitly accounts for the LLM's influence on data selection, unlike methods that select data without considering the LLM that is being aligned, thereby leading to more effective and efficient data collection. Our extensive experiments demonstrate that ActiveDPO outperforms existing methods across various models and real-world preference datasets.

2505.18511 2026-05-18 cs.LG math.AP physics.comp-ph

SPDEBench: An Extensive Benchmark for Learning Stochastic PDEs

Yuantu Zhu, Zheyan Li, Dai Shi, Luke Thompson, Oliver Nash, Jose Miguel Lara Rangel, Siran Li, Bingguang Chen, Rongchan Zhu, Qi Meng, Hao Ni

发表机构 * Shanghai Jiao Tong University(上海交通大学) University of Pennsylvania(宾夕法尼亚大学) University of Cambridge(剑桥大学) University of Sydney(悉尼大学) Imperial College London(伦敦帝国理工学院) University College London(伦敦大学学院) Fujian Normal University(福建师范大学) Beijing Institute of Technology(北京理工大学) Chinese Academy of Sciences(中国科学院)

AI总结 本文介绍了SPDEBench,这是首个用于学习随机偏微分方程(SPDEs)的统一基准平台,旨在解决当前在该领域缺乏标准化数据集和评估体系的问题。该基准涵盖了具有周期或狄利克雷边界条件的1-3维物理和数学上重要的SPDEs,包括常规和奇异SPDEs,并提供了多种机器学习基线模型及七种评估指标。实验表明,针对SPDE设计的模型在准确性和泛化能力方面优于通用操作符学习方法,SPDEBench为相关研究提供了可复现且可扩展的资源。

详情
英文摘要

Stochastic Partial Differential Equations (SPDEs) driven by random noise play a central role in modeling physical processes with rough spatio-temporal dynamics, such as turbulence flows, superconductors, and quantum dynamics. Although machine learning (ML)-based surrogate models have shown promise for efficiently approximating such dynamics, progress remains limited by the lack of a unified benchmark with controlled data generation and comprehensive evaluation. This gap is particularly significant for singular SPDEs, for which benchmark datasets are largely unavailable and reliable simulation requires numerically delicate schemes based on renormalization. Moreover, subtle differences in data-generation procedures, such as noise approximation, basis choice, and the inclusion of renormalization, can significantly affect the resulting datasets and, consequently, model evaluation. We introduce SPDEBench, the first unified benchmark for ML-based SPDE learning. SPDEBench provides ready-to-use datasets for physically and mathematically significant SPDEs on 1-3D domains with periodic or Dirichlet boundary condition. Both regular and singular SPDEs are taken into consideration. SPDEBench also incorporates representative ML baselines in operator learning, together with 7 evaluation metrics, including Sobolev and distributional metrics beyond the standard $L^2$-error. Supported by SPDEBench, we conduct systematic evaluations of model accuracy, robustness, and out-of-distribution generalization under controlled data variations. Our numerical results show that SPDE-aware architectures generally achieve stronger performance than generic operator-learning baselines. These findings establish SPDEBench as a reproducible and extensible resource, paving pathway for principled benchmarking and architecture design for stochastic spatio-temporal dynamics.

2505.18134 2026-05-18 cs.AI cs.CL cs.CV

VideoGameBench: Can Vision-Language Models complete popular video games?

Alex L. Zhang, Thomas L. Griffiths, Karthik R. Narasimhan, Ofir Press

发表机构 * Princeton University(普林斯顿大学)

AI总结 VideoGameBench 是一个用于评估视觉语言模型(VLMs)完成流行视频游戏能力的基准测试,包含10款90年代经典游戏,模型仅通过原始视觉输入和目标描述进行实时交互。该研究揭示了当前前沿VLM在实时游戏任务中表现有限,难以完成完整游戏,主要受限于推理延迟等问题。为此,研究还提出了VideoGameBench Lite 以缓解实时性挑战,并指出当前最先进的模型在该基准上的完成率仍非常低。

Comments 10 pages, 38 pages including supplementary

详情
英文摘要

Vision-language models (VLMs) have achieved strong results on coding and math benchmarks that are challenging for humans, yet their ability to perform tasks that come naturally to humans--such as perception, spatial navigation, and memory management--remains understudied. Real video games are crafted to be intuitive for humans to learn and master by leveraging innate inductive biases, making them an ideal testbed for evaluating such capabilities in VLMs. To this end, we introduce VideoGameBench, a benchmark consisting of 10 popular video games from the 1990s that VLMs directly interact with in real-time. VideoGameBench challenges models to complete entire games with access to only raw visual inputs and a high-level description of objectives and controls, a significant departure from existing setups that rely on game-specific scaffolding and auxiliary information. We keep three of the games secret to encourage solutions that generalize to unseen environments. Our experiments show that frontier vision-language models struggle to progress beyond the beginning of each game. We find inference latency to be a major limitation of frontier models in the real-time setting; therefore, we introduce VideoGameBench Lite, a setting where the game pauses while waiting for the LM's next action. The best performing models, Gemini 2.5 Pro and Claude 3.7 Sonnet, complete only 0.48% of VideoGameBench and 1.6% of VideoGameBench Lite. We hope that the formalization of the human skills mentioned above into this benchmark motivates progress in these research directions.