arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 21512
2606.04310 2026-06-04 cs.LG cs.SE

Latent Anchor-Driven Test Generation for Deep Neural Networks

基于潜在锚点的深度神经网络测试生成

Bin Duan, Matthew B. Dwyer, Guowei Yang

发表机构 * School of Electrical Engineering and Computer Science, The University of Queensland, Australia(昆士兰大学电气工程与计算机科学学院) Department of Computer Science, University of Virginia, United States(弗吉尼亚大学计算机科学系)

AI总结 提出 Latte 框架,利用预训练 VQ-VAE 在潜在空间中进行锚点引导的变异,生成语义相近、多样且能揭示错误的测试用例,提高故障暴露和行为多样性。

详情
AI中文摘要

深度神经网络(DNN)越来越多地部署在安全关键和安全性敏感的应用中,这使得严格的测试对于识别和缓解模型弱点至关重要。现有的 DNN 测试方法要么探索输入空间,要么探索学习到的潜在空间。虽然潜在空间生成比直接输入空间变异能更好地保持合理性,但当前方法在探索可控性、故障多样性和种子相对语义漂移之间仍面临权衡。为了克服这些限制,我们提出了 Latte,一个黑盒测试框架,通过利用潜在空间生成语义相近、多样且能揭示错误的测试用例。具体来说,Latte 使用预训练的 VQ-VAE 对每个输入种子进行编码,并沿着由从替代类别中采样的锚点定义的方向执行以种子为中心的一步潜在变异,然后进行量化并解码回输入空间。这会在学习到的潜在流形中探索每个种子周围的局部邻域,从而在相同预算下产生更多数量和更广泛多样性的触发预言机预测差异。我们在 5 个数据集和 10 个 DNN 模型上评估了 Latte,包括单模型和多模型测试场景。在评估的数据集和模型上,Latte 在匹配的测试预算下提高了故障暴露和行为多样性。在单模型设置下,它还相对于源种子保持了较低的种子相对语义漂移。

英文摘要

Deep Neural Networks (DNNs) are increasingly being deployed in security-critical and safety-sensitive applications, which makes rigorous testing essential to identify and mitigate model weaknesses. Existing DNN testing approaches explore either the input space or a learned latent space. While latent-space generation can better maintain plausibility than direct input-space mutation, current methods still face a trade-off among exploration controllability, failure diversity, and seed-relative semantic drift. To overcome these limitations, we propose Latte, a black-box testing framework that generates semantically proximate, diverse, and fault-revealing test cases by leveraging the latent space. Specifically, Latte encodes each input seed with a pre-trained VQ-VAE and performs a seed-centered, one-step latent mutation along directions defined by anchors sampled from alternative classes, followed by quantization and decoding back to the input space. This explores local neighborhoods around each seed within the learned latent manifold, resulting in a larger number and broader diversity of oracle-triggering prediction discrepancies under the same budget. We evaluated Latte on 5 datasets and 10 DNN models in single-model and multi-model testing scenarios. Across the evaluated datasets and models, Latte improves fault exposure and behavioral diversity under matched testing budgets. Under the single-model setting, it also maintains low seed-relative semantic drift with respect to the source seeds.

2606.04307 2026-06-04 cs.LG stat.CO stat.ME

Folded Transport MCMC: Certifiable Quotient Posterior Computation for Symmetric Bayesian Models

折叠传输MCMC:对称贝叶斯模型的可认证商后验计算

Jun Hu

发表机构 * Wuhan University of Technology(武汉理工大学)

AI总结 针对对称贝叶斯模型中的冗余多峰性导致MCMC收敛诊断退化的问题,提出Folded Transport MCMC方法,通过在对称群的基本域上构建独立采样器直接对商后验进行推断,并利用LCNF振荡认证框架在商度量下提供可证明的认证下界。

Comments 48 pages (including supplementary material), 5 figures, 6 tables. Submitted to Journal of the Royal Statistical Society: Series B

详情
AI中文摘要

具有有限对称性的贝叶斯模型——如可交换分量的混合模型、具有紧密间隔模态的结构识别——定义的后验在标签置换群下不变,产生冗余的多峰性,从而降低MCMC收敛诊断的质量。我们引入折叠传输MCMC(FolT-MCMC),该方法通过在对称群的基本域上构建独立采样器,直接对商后验进行推断。商提议分布通过对群轨道上学习的归一化流进行对称化得到。我们证明了基于LCNF振荡的认证框架可以迁移到商度量,并具有稳定子修正的球质量界和改进的覆盖半径,并且当未折叠流表现出跨模态提议缺陷时,分位数核心认证下界会得到改善。在高斯混合(d=2-20)、标签切换目标(最多24个等价模态)以及标准贝叶斯三分量混合后验上,分位数核心认证改进比从2倍到145倍不等,且折叠认证经验上几乎与维度无关。在台风山竹期间超高层建筑的真实加速度计数据上,FolT-MCMC产生了非平凡的分位数核心认证,而未折叠认证是平凡的。

英文摘要

Bayesian models with finite symmetry - mixture models with exchangeable components, structural identification with closely-spaced modes - define posteriors that are invariant under a group of label permutations, creating redundant multimodality that degrades MCMC convergence diagnostics. We introduce Folded Transport MCMC (FolT-MCMC), which performs inference directly on the quotient posterior by constructing an independence sampler on the fundamental domain of the symmetry group. The quotient proposal is formed by symmetrising a learned normalising flow over the group orbits. We prove that the LCNF oscillation-based certification framework transfers to the quotient metric with a stabiliser-corrected ball-mass bound and improved covering radius, and that the quantile-core certified lower bound improves whenever the unfolded flow exhibits cross-mode proposal deficiency. On Gaussian mixtures (d = 2 - 20), label-switching targets (up to 24 equivalent modes), and a standard Bayesian three-component mixture posterior, the quantile-core certified improvement ratio ranges from 2x to 145x, with the folded certificate empirically nearly dimension-free. On real accelerometer data from a supertall building during Typhoon Mangkhut, FolT-MCMC yields a non-vacuous quantile-core certificate where the unfolded certificate is vacuous.

2606.04305 2026-06-04 cs.LG stat.ML

Offline-to-Online Learning in Linear Bandits

线性Bandit中的离线到在线学习

Kushagra Chandak, Toshinori Kitamura, Xiaoqi Tan

发表机构 * University of Tokyo(东京大学)

AI总结 针对随机线性Bandit问题,提出一种平衡离线数据与在线探索的算法,实现次线性遗憾并随离线样本增加降低离线参考遗憾。

详情
AI中文摘要

我们研究了在随机线性Bandit设置中利用额外离线数据集进行在线学习的问题。尽管该问题在实践中频繁出现,但在结构化环境中,离线到在线的权衡仍然缺乏深入理解。我们提出了一种线性Bandit算法来平衡这种权衡:它在早期回合依赖离线数据,并随着时间推移逐渐增加探索。我们建立了遗憾界,表明我们的方法同时与纯在线和纯离线解决方案具有竞争力。特别地,相对于最优动作,它在在线交互次数上实现了次线性遗憾,而相对于离线参考的遗憾随着离线样本数量的增加而降低。实验结果进一步证明了该方法在各种问题参数下的有效性。

英文摘要

We study online learning with an additional offline dataset in the stochastic linear bandit setting. Although this problem arises frequently in practice, the offline-to-online tradeoff remains poorly understood in structured environments. We propose a linear bandit algorithm that balances this tradeoff: it relies on offline data during early rounds, and increasingly favors exploration as the horizon grows. We establish regret bounds showing that our method is simultaneously competitive with both purely online and purely offline solutions. In particular, it achieves sublinear regret relative to the optimal action in the number of online interactions, while its regret relative to an offline reference decreases as the number of offline samples grows. Empirical results further demonstrate its effectiveness across various problem parameters.

2606.04302 2026-06-04 cs.CL cs.LG

LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

LazyAttention: 高效检索增强生成中的延迟位置编码

Haocheng Xia, Mihir Pamnani, Hanxi Fang, Supawit Chockchowwat, Yongjoo Park

发表机构 * Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校Siebel计算与数据科学学院) Google(谷歌) Amazon(亚马逊)

AI总结 针对检索增强生成中KV缓存位置编码复用性差的问题,提出LazyAttention机制,通过核化延迟位置编码实现零拷贝、位置无关的KV重用,显著降低首令牌延迟并提升推理吞吐量。

Comments ICML 2026

详情
AI中文摘要

键值(KV)缓存通过重用已生成令牌的过去计算来加速大型语言模型(LLM)的推理。在长上下文应用(如检索增强生成(RAG)和上下文学习(ICL))中,其重要性更加凸显。然而,传统的KV缓存将位置信息直接嵌入缓存中,限制了其可重用性。现有解决方案要么将重用限制为前缀,要么需要昂贵的内存物化来进行位置重新编码。我们引入了LazyAttention,一种新颖的注意力机制,它通过核化延迟位置编码来实现零拷贝、位置无关的KV重用。通过在注意力内核中动态调整位置编码,LazyAttention解决了物化瓶颈,使得单个物理KV副本能够服务于任意位置的多个逻辑请求。利用为预填充和解码定制的注意力内核,我们的系统实现了显著的效率提升:在偏斜的文档分布下,与最先进的Block-Attention相比,首令牌延迟(TTFT)降低了1.37倍,推理吞吐量提高了1.40倍,同时保持了可比的输出质量。

英文摘要

Key-value (KV) caching accelerates inference of large language models (LLMs) by reusing past computations for generated tokens. Its importance becomes even greater in long-context applications such as retrieval-augmented generation (RAG) and in-context learning (ICL). However, conventional KV caching embeds positional information directly into the cache, limiting its reusability. Existing solutions either restrict reuse to prefixes or require expensive memory materialization for positional re-encoding. We introduce LazyAttention, a novel attention mechanism that kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV reuse. By adjusting positional encoding within attention kernels on-the-fly, LazyAttention resolves the materialization bottleneck, allowing a single physical KV copy to serve multiple logical requests at arbitrary positions. Leveraging attention kernels tailored for prefilling and decoding, our system achieves significant efficiency improvements: under skewed document distributions, it reduces time-to-first-token (TTFT) by 1.37$\times$ and increases inference throughput by 1.40$\times$ compared to the state-of-the-art Block-Attention, while maintaining comparable output quality.

2606.04301 2026-06-04 cs.CV

XSSR: Cross-Domain Self-Supervised Representative Selection for Efficient Annotation in Medical Image Segmentation

XSSR: 跨域自监督代表性选择用于医学图像分割中的高效标注

Byunghyun Ko, Aleksei Anisimov, Kobe Ke, Suhas Bharthepude, Jeongkyu Lee

发表机构 * Northeastern University, San Jose, CA 95113, USA(东北大学,旧金山,CA 95113,美国) Northeastern University, New York, NY 10021, USA(东北大学,纽约,NY 10021,美国)

AI总结 提出XSSR框架,通过自监督学习在目标域中自动选择代表性样本进行标注,在仅使用5%标注预算时达到接近全数据性能。

Comments Accepted to the Third International Conference on AI in Healthcare (AIiH 2026). This is the preprint version of the paper

详情
AI中文摘要

获取标注医学图像数据是资源密集型的,而在跨域场景中,源域和目标域在成像设备、人群或临床站点上存在差异,这一挑战进一步加剧。本研究引入了XSSR(跨域自监督代表性选择),一个旨在最小化目标域标注工作同时保持稳健分割性能的框架。XSSR包括三个阶段:首先,在无标签源数据上训练掩码自编码器(MAE),以建立共享嵌入空间,无需目标标签;其次,贪婪选择算法基于复合密度、新颖性和多样性标准对无标签目标样本进行评分;第三,仅在所选子集上训练U-Net分割模型。新颖性-多样性权衡参数alpha通过最小化嵌入空间覆盖自动校准,消除了手动调整。我们在三个公开基准上评估XSSR:胸部X光、RIGA+视网膜眼底成像和多站点前列腺MRI,每个基准在固定的5%标注预算下。XSSR在胸部X光上仅使用22个标注样本就达到了全数据性能的99.3%,在前列腺MRI上比随机选择高出最多2.5个Dice点,并在所有数据集上始终比CoreSet基线高出0.4到1.2个Dice点。消融研究表明多样性是最有影响力的评分组成部分,按站点分析表明性能与源域的扫描仪相似性相关。

英文摘要

Acquiring labeled medical image data is resource-intensive and a challenge further exacerbated in cross-domain scenarios where source and target datasets differ in imaging equipment, population, or clinical site. This study introduces XSSR (Cross-Domain Self-Supervised Representative Selection), a framework designed to minimize annotation effort in the target domain while maintaining robust segmentation performance. XSSR comprises three stages: first, a Masked Autoencoder (MAE) is trained on unlabeled source data to establish a shared embedding space without requiring target labels; second, a greedy selection algorithm scores unlabeled target samples based on a composite density, novelty, and diversity criterion; and third, a U-Net segmentation model is trained exclusively on the selected subset. The novelty-diversity trade-off parameter, alpha, is automatically calibrated by minimizing embedding-space coverage, eliminating manual tuning. We evaluate XSSR on three public benchmarks: Chest X-ray, RIGA+ retinal fundus imaging, and multi-site Prostate MRI, each under a fixed 5% annotation budget. XSSR achieves 99.3% of full-data performance on Chest X-ray using only 22 labeled samples, surpasses random selection by up to 2.5 Dice points on Prostate MRI, and consistently outperforms the CoreSet baseline by 0.4 to 1.2 Dice points across all datasets. Ablation studies indicate that diversity is the most influential scoring component, and per-site analysis shows that performance correlates with scanner similarity to the source domain.

2606.04299 2026-06-04 cs.CV cs.LG

Efficient and Training-Free Single-Image Diffusion Models

高效且无需训练的单图像扩散模型

Haojun Qiu, Kiriakos N. Kutulakos, David B. Lindell

发表机构 * Department of Computer Science, University of Toronto(多伦多大学计算机科学系) Vector Institute(向量研究所)

AI总结 提出一种基于多尺度补丁数据集的无训练单图像扩散模型,通过闭式最优去噪器实现高效生成,达到与训练模型相当的质量和多样性。

Comments CVPR 2026; Project Page: https://haojunqiu.github.io/efficient-SID/

详情
AI中文摘要

我们考虑生成图像的问题,其内部结构——由多尺度补丁分布定义——与单个参考图像匹配。最近的方法通过训练单图像扩散模型来解决这个问题。但即使在这种设置下,训练计算成本高昂且需要数小时的优化。相反,我们使用不同尺度下的图像补丁数据集对图像进行建模。由于该数据集是有限的,且其补丁的维度较小,可以使用最优的闭式去噪器可计算地获得噪声补丁的得分函数,从而消除了神经网络训练的需要。我们将这种基于补丁的去噪器集成到一个高效、无需训练的图像扩散模型中,并描述了我们的方法如何与经典的基于补丁的图像恢复技术相联系。与训练过的单图像扩散模型相比,我们的方法实现了最先进的生成质量和多样性,并展示了应用,包括无条件图像生成、文本引导风格化、图像对称化和重定向。此外,我们展示了我们的方法与潜在空间扩散兼容,并展示了多种额外的加速技术,以实现一秒内的百万像素单图像生成和几分钟内的十亿像素生成。

英文摘要

We consider the problem of generating images whose internal structure -- defined by the distribution of patches across multiple scales -- matches that of a single reference image. Recent approaches address this problem by training a diffusion model on a single image. But even in this setting, training is computationally expensive and requires hours of optimization. Instead, we model the image using a dataset of its patches at different scales. As this dataset is finite and the dimensionality of its patches is small, the score function for a noisy patch can be computed tractably using an optimal, closed-form denoiser, eliminating the need for neural network training. We integrate this patch-based denoiser into an efficient, training-free image diffusion model, and we describe how our method connects to classical patch-based image restoration techniques. Our approach achieves state-of-the-art generation quality and diversity compared to trained single-image diffusion models, and we demonstrate applications, including unconditional image generation, text-guided stylization, image symmetrization, and retargeting. Further, we show that our approach is compatible with latent space diffusion, and we show multiple additional acceleration techniques to achieve megapixel single-image generation in one second, and gigapixel generation in minutes.

2606.04296 2026-06-04 cs.AI

The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents

饱和陷阱与干预时机的主观性:为什么基于情感的触发器和LLM评判者无法在自主智能体上把握干预时机

Manvendra Modgil

发表机构 * manvendramodgil.ai

AI总结 本研究通过18维情感动力学引擎HEART诊断自主智能体干预时机问题,发现状态饱和陷阱、LLM评判者的能力与上下文门槛,以及人类标注者之间极低的干预时机一致性,表明干预时机是一个低可靠性构念。

Comments 11 pages, 5 tables. Code and data:https://github.com/2025eb1100268-tech/intervention-timing-saturation-trap

详情
AI中文摘要

随着自主AI智能体从对话系统转向长周期软件执行,决定何时中断智能体的运行时安全层变得至关重要。我们使用一个连续的18维情感动力学引擎(HEART)作为诊断探针,研究了这一时机问题,评估了四种干预触发家族——绝对状态阈值、复合状态-动作模式、正则推理特征提取和零样本LLM作为评判者——针对SWE-bench-Verified调试轨迹上人工标注的干预点。我们报告了三个发现。首先,状态饱和陷阱:智能体在持续困难下没有恢复信号,因此建模的挫折感迅速越过阈值并保持最大值,将基于状态阈值的触发器从时刻检测器转变为近乎恒定的指示器,在五个轨迹中触发39-83%的动作。其次,LLM评判者的能力和上下文底线:小模型(gpt-5.4-mini)从不触发,而前沿和跨供应商模型只有在完整轨迹上下文下才能逃脱零触发底线,即使如此,F1值也仅为0.17-0.40,成本高达90倍。第三,最重要的是,监督目标在人类之间不可复现:三名训练有素的标注者使用同一评分标准对一条56动作轨迹进行标注,在干预位置上的一致性仅略高于偶然(位置Krippendorff's alpha = +0.047;最佳成对Cohen's kappa = +0.349),而在干预类型上完全不一致(暂停退化;澄清低于偶然;仅反思alpha = +0.226)。我们得出结论,干预时机是一个低可靠性构念,使得单标注者F1不适合作为优化目标。我们的贡献是跨人类评分者间信度、四种检测器架构、跨模型LLM评判者扫描以及复现的饱和效应,共同绘制了这一问题图谱,而非任何单一检测器的准确性。

英文摘要

As autonomous AI agents move from conversational systems to long-horizon software execution, runtime safety layers that decide when to interrupt an agent have become essential. We study this timing problem using a continuous 18-dimensional affective-dynamics engine (HEART) as a diagnostic probe, evaluating four intervention trigger families - absolute state thresholds, composite state-action patterns, regex reasoning-feature extraction, and zero-shot LLM-as-judge - against human-annotated intervention points on SWE-bench-Verified debugging traces. We report three findings. First, a State Saturation Trap: agents show no recovery signal under sustained difficulty, so modeled frustration quickly crosses the threshold and stays at its maximum, converting threshold-on-state triggers from moment detectors into near-constant indicators that fire on 39-83% of actions across five trajectories. Second, a capability-and-context floor for LLM judges: a small model (gpt-5.4-mini) never fires, while frontier and cross-vendor models escape the zero-firing floor only with full-trajectory context, and even then reach only F1 0.17-0.40 at up to 90x the cost. Third, and most importantly, the supervised target is not reproducible among humans: three trained annotators using one rubric on a 56-action trajectory agree on where to intervene only slightly above chance (location Krippendorff's alpha = +0.047; best pairwise Cohen's kappa = +0.349) and not at all on intervention type (pause degenerate; clarify below chance; reflect only alpha = +0.226). We conclude that intervention timing is a low-reliability construct, making single-annotator F1 an unsuitable optimization target. Our contribution is the joint mapping of this problem across human inter-rater reliability, four detector architectures, a cross-model LLM-judge sweep, and a reproduced saturation effect, rather than any single detector's accuracy.

2606.04291 2026-06-04 cs.CV

A Cookbook of 3D Vision: Data, Learning Paradigms, and Application

3D视觉食谱:数据、学习范式与应用

Hongyang Du, Zongxia Li, Dawei Liu, Runhao Li, Haoyuan Song, Qingyu Zhang, Yubo Wang, Jingcheng Ni, Shihang Gui, Congchao Dong, Tao Hu

发表机构 * Brown University(布朗大学) University of Maryland, College Park(马里兰大学学院公园分校) University of Pennsylvania(宾夕法尼亚大学) University of Southern California(南加州大学) New York University(纽约大学) The University of Sydney(悉尼大学) Stability AI

AI总结 本文提出一种以数据为中心的3D视觉分类法,通过分析点云、网格、体素和3D高斯等几何表示及其获取流程,以及数据集设计、基准构建和监督机制,统一了表示、学习范式与下游任务(重建、生成、视频建模)之间的关系。

Comments Accepted to the CVPR 2026 OpenSUN3D Workshop. Official version available at CVF Open Access. https://openaccess.thecvf.com/content/CVPR2026W/OpenSUN3D/html/Du_A_Cookbook_of_3D_Vision_Data_Learning_Paradigms_and_Application_CVPRW_2026_paper.html

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026
AI中文摘要

3D视觉在日益多样化的数据表示、学习范式和建模策略的推动下迅速发展。然而,该领域在表示和基准测试方面仍然分散,难以形成关于效率、保真度和可扩展性的统一视角。本文提供了一种以数据为中心的3D视觉分类法,将几何表示、数据集、学习框架和应用连接在一个单一的概念图中。我们首先分析3D数据的主要结构表示——点云、网格、体素和3D高斯——及其获取流程。然后,我们研究数据集设计、基准构建和监督机制如何塑造最近的进展,涵盖2D监督的3D学习、隐式神经表示和4D世界建模。通过这种整合视角,我们阐明了表示、学习范式与下游任务(重建、生成和视频建模)之间的关系,提供了关于平衡效率与保真度以及多模态几何基础的新兴趋势的统一观点。

英文摘要

3D vision has rapidly evolved, driven by increasingly diverse data representations, learning paradigms, and modeling strategies. Yet the field remains fragmented across representations and benchmarks, making it difficult to develop unified perspectives on efficiency, fidelity, and scalability. This work provides a data-centric taxonomy of 3D vision that connects geometric representations, datasets, learning frameworks, and applications within a single conceptual map. We begin by analysing the principal structural representations of 3D data--point clouds, meshes, voxels, and 3D Gaussians--along with their acquisition pipelines. We then examine how dataset design, benchmark construction, and supervision regimes shape recent advances, spanning 2D-supervised 3D learning, implicit neural representations, and 4D world modeling. Through this integrative lens, we clarify the relationships among representations, learning paradigms, and downstream tasks in reconstruction, generation, and video modeling, offering a consolidated view of emerging trends toward balancing efficiency and fidelity and toward multimodal geometric grounding.

2606.04290 2026-06-04 cs.LG math.OC

PE-MHL: Physics-Encoded Modular Hybrid Layers for Scalable Learning of Complex Systems

PE-MHL: 用于复杂系统可扩展学习的物理编码模块化混合层

Ismail Hassaballa, Mircea Lazar

发表机构 * TUE(蒂姆大学)

AI总结 提出物理编码模块化混合层(PE-MHL)框架,通过增量添加子模型并保证训练误差单调非增,实现可扩展、鲁棒的混合建模,在非线性NARX基准和Quanser Aero 2平台上优于同等规模单体网络。

详情
AI中文摘要

结合基于物理和数据驱动的混合模型在控制应用中展现出实现准确性和可解释性的强大潜力。尽管最近的方法在融入物理一致性方面取得了进展,但在可扩展性、对噪声的鲁棒性以及模型复杂度控制方面仍存在挑战。本文提出了物理编码模块化混合层(PE-MHL)框架,其中基线基于物理的模型通过添加新的子模型逐步细化,每个新组件在保留先前组件已学知识的同时增加复杂度。我们为这种构造建立了理论保证:通过每个新子模型的最小二乘初始化,训练误差在子模型数量上单调非增并可证明收敛。在非线性NARX基准和Quanser Aero 2平台上的实证评估表明,PE-MHL在准确性和泛化能力上均优于同等规模的单体网络,同时提供更稳定的训练动态和更好的底层数据结构保留。

英文摘要

Hybrid models that combine physics-based and data-driven components have shown strong potential for achieving accuracy and interpretability in control applications. While recent methods have made progress in incorporating physical consistency, challenges remain in scalability, robustness to noise, and control of model complexity. This paper proposes a Physics-Encoded Modular Hybrid Layer (PE-MHL) framework, in which a baseline physics-based model is incrementally refined through the addition of new sub-models, where each new component adds complexity while preserving what previous components have already learned. We establish a theoretical guarantee for this construction: with a least-squares initialization of each new sub-model, the training error is monotonically non-increasing in the number of sub-models and provably converges. Empirical evaluations on a nonlinear NARX benchmark and the Quanser Aero 2 platform demonstrate that PE-MHL outperforms equivalently sized monolithic networks in both accuracy and generalization, while also providing more stable training dynamics and better preservation of underlying data structures.

2606.04287 2026-06-04 cs.LG cs.AI

Scaling Novel Graph Generation via Lightweight Structure-Guided Autoregressive Models

通过轻量级结构引导自回归模型扩展新颖图生成

Alessio Barboni, Massimiliano Lupo Pasini, Bishal Lakha, Edoardo Serra

发表机构 * Boise State University(博伊州立大学) Oak Ridge National Laboratory(橡树岭国家实验室)

AI总结 提出一种轻量级自回归框架,利用结构引导拓扑排序和两阶段训练策略,在分子和非分子基准上实现高新颖性、有效性和唯一性的图生成。

详情
AI中文摘要

生成真实且多样的图是机器学习中的一个关键问题,在分子发现、电路设计、网络安全等领域有应用。然而,当前的图生成模型在可扩展性和新颖性方面仍存在局限。基于扩散的方法通常需要昂贵的全邻接操作和长去噪链,而许多自回归和混合模型至少具有二次复杂度。此外,这些模型往往模仿训练图而非泛化到新图。我们提出一个轻量级自回归框架来解决这些问题。它使用结构引导的拓扑排序将图序列化为规则的边序列,实现近对数线性生成,以及一种两阶段训练策略,结合探索导向的增强和迭代细化,以减少过拟合并促进受控的新颖性。在分子和非分子基准上的实验表明,我们的方法在保持高有效性和唯一性的同时提高了新颖性。该框架还支持LSTM和Mamba风格的因果序列骨干,大内存加速器使得能够进行超出典型GPU限制的更长的图序列实验。

英文摘要

Generating realistic and diverse graphs is a key problem in machine learning, with applications in molecular discovery, circuit design, cybersecurity, and beyond. However, current graph generative models remain limited by scalability and novelty. Diffusion-based methods often require costly full-adjacency operations and long denoising chains, while many autoregressive and hybrid models have at least quadratic complexity. In addition, these models often imitate training graphs rather than generalize beyond them. We propose a lightweight autoregressive framework to address these issues. It uses a structure-guided topological ordering to serialize graphs into regular edge sequences, enabling near log-linear generation, and a two-phase training strategy that combines exploration-oriented augmentation with iterative refinement to reduce overfitting and promote controlled novelty. Experiments on molecular and non-molecular benchmarks show that our approach improves novelty while preserving high validity and uniqueness. The framework also supports both LSTM and Mamba-style causal sequence backbones, with large-memory accelerators enabling longer graph-sequence experiments beyond typical GPU limits.

2606.04286 2026-06-04 cs.CL

Using Text-Based Causal Inference to Disentangle Factors Influencing Online Review Ratings

使用基于文本的因果推断解构影响在线评论评分的因素

Linsen Li, Aron Culotta, Nicholas Mattei

发表机构 * Department of Computer Science Tulane University(计算机科学系路易斯安那大学)

AI总结 提出基于CausalBERT的文本因果分析方法,通过温度缩放、超参数优化和可解释性改进,从60万条美国K-12学校评论中解构各因素对整体评分的影响。

Comments HLT/NAACL 2025

详情
Journal ref
In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies
AI中文摘要

在线评论提供了对产品或服务各方面感知质量的宝贵见解。虽然基于方面的情感分析侧重于从评论中提取这些方面,但关于每个方面对整体感知影响的研究较少。由于方面之间的相关性,分离每个方面的影响尤其具有挑战性。本文介绍了一种基于文本因果分析最新进展的方法,特别是CausalBERT,以解构每个因素对整体评论评分的影响。我们通过三个关键改进增强了CausalBERT:用于更校准的处理分配估计的温度缩放;减少混杂过度调整的超参数优化;以及表征发现混杂因素的可解释性方法。在这项工作中,我们将评论中的文本提及视为现实世界属性的代理。我们在来自超过60万条美国K-12学校评论的真实和半合成数据上验证了我们的方法。我们发现,所提出的增强方法产生了更可靠的估计,并且对学校管理和基准测试表现的感知是整体学校评分的重要驱动因素。

英文摘要

Online reviews provide valuable insights into the perceived quality of facets of a product or service. While aspect-based sentiment analysis has focused on extracting these facets from reviews, there is less work understanding the impact of each aspect on overall perception. This is particularly challenging given correlations among aspects, making it difficult to isolate the effects of each. This paper introduces a methodology based on recent advances in text-based causal analysis, specifically CausalBERT, to disentangle the effect of each factor on overall review ratings. We enhance CausalBERT with three key improvements: temperature scaling for better calibrated treatment assignment estimates; hyperparameter optimization to reduce confound overadjustment; and interpretability methods to characterize discovered confounds. In this work, we treat the textual mentions in reviews as proxies for real-world attributes. We validate our approach on real and semi-synthetic data from over 600K reviews of U.S. K-12 schools. We find that the proposed enhancements result in more reliable estimates, and that perception of school administration and performance on benchmarks are significant drivers of overall school ratings.

2606.04284 2026-06-04 cs.LG cs.AI cs.CL

Sparse Mixture-of-Experts Reward Models Learn Interpretable and Specialized Experts for Personalized Preference Modeling

稀疏混合专家奖励模型学习可解释且专业化的专家用于个性化偏好建模

Yifan Wang, Jinyi Mu, Mayank Jobanputra, Yu Wang, Ji-Ung Lee, Soyoung Oh, Isabel Valera, Vera Demberg

发表机构 * Saarland University(萨尔兰大学) Independent Researcher(独立研究者) Bielefeld University(比勒菲尔德大学) Max Planck Institute for Software Systems(马克斯·普朗克软件系统研究所) Max Planck Institute for Informatics(马克斯·普朗克信息研究所)

AI总结 提出稀疏混合专家奖励模型,通过稀疏路由和专家多样性训练,从二元偏好数据中学习可解释的专家模式,提升个性化偏好建模的测试时适应性和可解释性。

详情
AI中文摘要

偏好建模在基于人类反馈的强化学习(RLHF)中扮演核心角色,使大型语言模型(LLMs)与人类价值观对齐。然而,大多数现有方法假设一个通用的奖励函数,忽视了人类偏好的多样性和异质性。为了在不增加额外标注成本的情况下解决这一限制,最近的工作提出从二元数据中学习多个偏好组件,并组合它们以建模个体偏好。然而,这些组件往往无法捕捉连贯且解耦的模式,限制了其可解释性和个性化效果。在这项工作中,我们提出了一种稀疏混合专家(MoE)奖励模型,该模型在二元偏好数据训练过程中鼓励稀疏路由和专家多样性。在受控和真实世界的实验中,稀疏MoE学习了可解释的路由模式和专业化的专家。它还改进了测试时的个性化,并且适应后的专家权重变化为分析模型如何适应个性化偏好提供了定性视角。

英文摘要

Preference modeling plays a central role in reinforcement learning from human feedback (RLHF), enabling large language models (LLMs) to align with human values. However, most existing approaches assume a universal reward function, neglecting the diversity and heterogeneity of human preferences. To address this limitation without additional annotation costs, recent work has proposed learning multiple preference components from binary data and combining them to model individual preferences. Nevertheless, these components often fail to capture coherent and disentangled patterns, limiting their interpretability and effectiveness for personalization. In this work, we propose a sparse Mixture-of-Experts (MoE) reward model that encourages sparse routing and expert diversity during training on binary preference data. Across controlled and real-world experiments, sparse MoE learns interpretable routing patterns and specialized experts. It also improves test-time personalization, and post-adaptation shifts in expert weights provide a qualitative lens for analyzing how the model adapts to personalized preferences.

2606.04282 2026-06-04 cs.CV

FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal LLMs

FindIt:面向通用多模态大语言模型的格式感知视觉检测基准

Eshika Khandelwal, Jingjing Pan, Mingfang Zhang, Quan Kong, Lorenzo Garattoni, Hilde Kuehne

发表机构 * Tuebingen AI Center, University of Tuebingen(图宾根人工智能中心,图宾根大学) Woven by Toyota, Inc.(丰田公司) Toyota Motor Europe(丰田欧洲公司)

AI总结 提出首个全面评估通用多模态大语言模型在可提示定位能力上的基准,涵盖四种核心任务,并标准化输入输出格式,揭示模型对格式约束的敏感性。

详情
AI中文摘要

多模态大语言模型(MLLMs)主要在自由形式的视觉语言任务(如视觉问答、图像描述和摘要)上进行评估。然而,它们的实际应用正在迅速扩展到更结构化的计算机视觉场景,用户提示模型执行以定位为中心的任务(如目标检测),通常是在更大的智能体或决策系统中。尽管发生了这种转变,但目前还没有标准化的基准来系统地大规模评估这些能力。在这项工作中,我们引入了第一个专门设计用于评估通用MLLMs可提示定位能力的全面基准。我们的基准涵盖四个核心任务类别:目标检测、指代表达检测、实例级检测和基于视频的检测。为了实现一致和公平的评估,我们开发了一个统一框架,标准化输入,强制可解析的边界框输出,并定义了跨任务的透明评估协议。使用该套件,我们评估了多种开源和专有MLLMs,深入分析了它们的性能和局限性。除了准确性,我们还检查了模型遵守输出格式规范的能力,表明当前系统对格式约束高度敏感,并且即使面对微小变化也常常无法泛化。我们的结果突出了最先进的MLLMs在定位设置中的优势和缺点,并指出了改进多模态模型设计和评估的重要方向。

英文摘要

Multimodal large language models (MLLMs) are predominantly evaluated on free-form vision-language tasks such as visual question answering, captioning, and summarization. However, their practical use is rapidly expanding to more structured computer vision settings, where users prompt models to perform localization-centric tasks such as object detection, often within larger agentic or decision-making systems. Despite this shift, there is currently no standardized benchmark that systematically evaluates these capabilities at scale. In this work, we introduce the first comprehensive benchmark specifically designed to assess the promptable localization abilities of generalist MLLMs. Our benchmark spans four core task categories: object detection, referring expression detection, instance-level detection, and video-based detection. To enable consistent and fair evaluation, we develop a unified framework that standardizes inputs, enforces parsable bounding box outputs, and defines transparent evaluation protocols across tasks. Using this suite, we evaluate a diverse set of open-source and proprietary MLLMs, providing an in-depth analysis of their performance and limitations. Beyond accuracy, we examine models' ability to adhere to output format specifications, showing that current systems are highly sensitive to formatting constraints and often fail to generalize even to minor variations. Our results highlight both the strengths and shortcomings of state-of-the-art MLLMs in localization settings, and point toward important directions for improving multimodal model design and evaluation.

2606.04280 2026-06-04 cs.LG cs.AI cs.IR

The Loss Is Not Enough: Sampling Conditions and Inductive Bias in Contrastive Representation Learning

损失还不够:对比表示学习中的采样条件和归纳偏置

Justinas Zaliaduonis, Patrick Putzky, Till Richter, Sergios Gatidis

发表机构 * ETH Zürich(苏黎世联邦理工学院)

AI总结 本文通过测度论框架形式化对比学习中的多样性条件,提出支持校正的InfoNCE变体,并实验验证了采样多样性与编码器归纳偏置的相互作用。

详情
AI中文摘要

对比学习已成为自监督表示学习的主要范式,但其恢复有意义潜在几何的条件尚未完全理解。我们开发了一个测度论框架,形式化了多样性条件,即正对采样的支持要求,这是等距潜在恢复所必需的。我们表明,标准的全支持von Mises-Fisher设置意味着满足多样性条件,因此全局对比损失最小化器可以恢复潜在几何(直到正交变换),而受限条件分布可以使非正交映射达到严格更低的渐近对比损失。我们引入了一种支持校正的信息噪声对比估计(InfoNCE)变体作为理论修复:这种校正使得正交潜在空间恢复成为可能,但并不能唯一选择它。在合成基准上的实验验证了可识别性预测,CIFAR-10实验与定性预测一致,即当采样多样性有限时,架构归纳偏置变得更加重要。总之,我们的结果阐明了采样机制和编码器归纳偏置在对比表示学习中的相互作用。

英文摘要

Contrastive learning has become a leading paradigm for self-supervised representation learning, yet the conditions under which it recovers meaningful latent geometry remain incompletely understood. We develop a measure-theoretic framework formalizing the diversity condition, a support requirement on positive-pair sampling that is necessary for isometric latent recovery. We show that the standard full-support von Mises-Fisher setting implies the satisfaction of the diversity condition and as a consequence global contrastive loss minimizers recover latent geometry up to orthogonal transformation, while restricted conditionals can make non-orthogonal maps attain strictly lower asymptotic contrastive loss. We introduce a support-corrected Information Noise Contrastive Estimation (InfoNCE) variant as a theoretical fix: this correction makes orthogonal latent space recovery achievable but does not uniquely select it. Experiments on synthetic benchmarks validate the identifiability predictions, and CIFAR-10 experiments are consistent with the qualitative prediction that architectural inductive bias becomes more important when sampling diversity is limited. Together, our results clarify how sampling mechanisms and encoder inductive bias interact in contrastive representation learning.

2606.04279 2026-06-04 cs.LG quant-ph

Derivative Informed Learning of Exchange-Correlation Functionals

交换相关泛函的导数知情学习

Eike S. Eberhard, Luca A. Thiede, Abdul Aldossary, Andreas Burger, Nicholas Gao, Vignesh Bhethanabotla, Alán Aspuru-Guzik, Stephan Günnemann

发表机构 * Technical University of Munich(慕尼黑技术大学) Munich Data Science Institute(慕尼黑数据科学研究所) Munich Center for Machine Learning(慕尼黑机器学习中心) University of Toronto(多伦多大学) Vector Institute(向量研究所) CuspAI California Institute of Technology(加州理工学院)

AI总结 提出导数知情交换相关损失(DI-Loss),通过监督能量在密度矩阵Grassmannian上的一阶和二阶导数,训练O(N^3)标度的机器学习交换相关泛函以复现B3LYP/def2-SVP目标,在多个架构上平均总能量MAE降低66%,并减少混合泛函SCF迭代次数达50%。

Comments Proceedings of the 43rd International Conference on Machine Learning

详情
AI中文摘要

机器学习(ML)交换相关(XC)泛函旨在通过直接从参考数据学习来替代人工设计的密度泛函近似,但它们仍未能持续优于传统的$\mathcal{O}(N^4)$标度混合泛函。我们研究了一种混合蒸馏设置,其中$\mathcal{O}(N^3)$标度的ML-XC泛函被训练以复现B3LYP/def2-SVP目标。我们引入了导数知情XC损失(DI-Loss),该损失通过监督能量在可容许密度矩阵的Grassmannian上的一阶和二阶导数,融入了来自参考混合泛函的额外信息。DI-Loss不仅匹配自洽不动点,还将学习到的泛函的局部一阶和二阶响应与目标泛函对齐。在四个评估的架构中,DI-Loss一致地改善了主要能量指标。在所有架构上均匀平均,总能量MAE相对于仅使用能量和密度监督降低了66%。密度敏感的均场能量度量$E_ρ$平均从1.2 mEh改善到0.8 mEh,而偶极子和$\mathcal{L}_2$密度误差并未均匀改善。我们进一步表明,来自蒸馏泛函的密度将混合泛函的SCF迭代次数减少了高达50%。在下游TDDFT计算中,Hessian监督改善了激发态预测,XCdiff将平均激发能MAE降低了19-35%。

英文摘要

Machine-learned (ML) exchange-correlation (XC) functionals aim to replace human-designed density functional approximations by learning directly from reference data, but they still do not consistently outperform traditional $\mathcal{O}(N^4)$-scaling hybrid functionals. We study a hybrid-distillation setting in which $\mathcal{O}(N^3)$-scaling ML-XC functionals are trained to reproduce B3LYP/def2-SVP targets. We introduce Derivative Informed XC-Loss (DI-Loss), a loss that incorporates additional information from the reference hybrid functional by supervising first and second derivatives of the energy on the Grassmannian of admissible density matrices. Rather than only matching the self-consistent fixed point, DI-Loss aligns the local first- and second-order response of the learned functional with that of the target functional. Across four evaluated architectures, DI-Loss consistently improves the main energy metrics. Averaged uniformly across architectures, the total-energy MAE decreases by 66% relative to energy and density supervision alone. The density-sensitive mean-field energy metric $E_ρ$ improves from $1.2$ to $0.8$ mEh on average, while dipole and $\mathcal{L}_2$ density errors do not improve uniformly. We further show that densities from the distilled functionals reduce hybrid-functional SCF iterations by up to 50%. In downstream TDDFT calculations, Hessian supervision improves excited-state predictions, with XCdiff reducing the mean excitation-energy MAE by 19 - 35%.

2606.04275 2026-06-04 cs.LG cs.AI

From Ticks to Flows: Dynamics of Neural Reinforcement Learning in Continuous Environments

从蜱虫到流:连续环境中神经强化学习的动力学

Saket Tiwari, Tejas Kotwal, George Konidaris

发表机构 * Brown University(布朗大学)

AI总结 本文通过将深度强化学习建模为连续时间随机过程,利用随机控制理论,首次推导了连续环境下过参数化神经演员-评论家算法在无限宽度极限下的状态分布演化方程。

Comments Presented at ICLR 2026: https://openreview.net/forum?id=TdiRLe3rPA

详情
AI中文摘要

我们提出了一种新颖的深度强化学习(RL)在连续环境中的理论框架,通过借鉴随机控制的思想,将问题建模为连续时间随机过程。在先前工作的基础上,我们引入了一个可行的演员-评论家算法模型,该模型同时包含探索和随机转移。对于单隐藏层神经网络,我们表明环境状态可以表述为两个时间尺度的过程:环境时间和梯度时间。在此框架下,我们描述了表示环境状态和累积折扣回报估计的时间相关随机变量如何在两层网络的无限宽度极限下随梯度步长演化。利用随机微分方程理论,我们首次在连续RL中推导出一个方程,描述了在极小的学习率下,每个梯度步长上状态分布的无穷小变化。总体而言,我们的工作为研究过参数化神经演员-评论家算法提供了一种新颖的非参数化表述。我们通过一个简单的连续控制任务实证验证了我们的理论结果。

英文摘要

We present a novel theoretical framework for deep reinforcement learning (RL) in continuous environments by modeling the problem as a continuous-time stochastic process, drawing on insights from stochastic control. Building on previous work, we introduce a viable model of actor-critic algorithm that incorporates both exploration and stochastic transitions. For single-hidden-layer neural networks, we show that the state of the environment can be formulated as a two time scale process: the environment time and the gradient time. Within this formulation, we characterize how the time-dependent random variables that represent the environment's state and estimate of the cumulative discounted return evolve over gradient steps in the infinite width limit of two-layer networks. Using the theory of stochastic differential equations, we derive, for the first time in continuous RL, an equation describing the infinitesimal change in the state distribution at each gradient step, under a vanishingly small learning rate. Overall, our work provides a novel nonparametric formulation for studying overparametrized neural actor-critic algorithms. We empirically corroborate our theoretical result using a toy continuous control task.

2606.04274 2026-06-04 cs.CL cs.CY

Long Live Fine-Tuning: Task-Specific Transformers Outperform Zero-Shot LLMs for Misinformation Response Classification on Reddit

长存微调:在Reddit上,任务特定Transformer在错误信息响应分类中优于零样本LLM

JooYoung Lee, Lin Tian, Angela Brillantes, Adriana-Simona Mihăiţă, Marian-Andrei Rizoiu

发表机构 * University of Technology Sydney(悉尼技术大学)

AI总结 通过对比微调模型与零样本LLM在Reddit错误信息评论分类上的表现,发现微调RoBERTa在宏F1分数上显著优于最佳零样本模型,且成本更低,表明任务特定微调在检测隐性错误信息方面仍更可靠。

详情
AI中文摘要

随着大型语言模型(LLM)成为在线信息验证的默认工具,一个隐含的假设随之而来:规模和通用能力足以对错误信息话语进行细致分类。我们直接在900条Reddit评论上测试这一假设,这些评论涵盖三个经PolitiFact验证的错误信息主张(环境、健康、移民),并标记为相信(传播主张)、事实核查(纠正主张)或其他。我们比较了三种范式下的九个模型——BART-MNLI、三种Llama变体、三种商业前沿LLM(Claude Haiku 4.5、Gemini Flash Lite 2.5、Claude Sonnet 4.6),以及微调的DistilBERT和RoBERTa——在通用和主题特定标签方案下。该假设不成立。微调RoBERTa达到0.62的宏F1,而最佳零样本结果为0.50(Claude Haiku 4.5),且每次查询成本极低;监督优势集中在相信类别,这是每个零样本模型都检测不足的隐式情感类别。规模无帮助:Llama-3-8B与Llama-3-70B表现相当,Claude Sonnet 4.6在通用标签下表现逊于较小的Haiku,将相信检测降至0.17,并直接拒绝部分被标记为敏感的评论。这是安全对齐的伪影,而非能力限制。标签方案和主题共同塑造零样本性能,同一模型在匹配标签下不同主题间的宏F1差异超过0.13。在验证场景中,遗漏相信是代价更高的错误,尽管大型生成模型激增,任务特定微调仍是更可靠的选择。

英文摘要

As large language models (LLMs) become default tools for online information verification, an implicit assumption follows them: that scale and general capability are sufficient for nuanced classification of misinformation discourse. We test this assumption directly on 900 Reddit comments spanning three PolitiFact-verified misinformation claims (environment, health, immigration), labelled as belief (propagates the claim), fact-check (corrects it), or other. We compare nine models across three paradigms -- BART-MNLI, three Llama variants, three commercial frontier LLMs (Claude Haiku 4.5, Gemini Flash Lite 2.5, Claude Sonnet 4.6), and fine-tuned DistilBERT and RoBERTa -- under universal and topic-specific label schemas. The assumption does not hold. Fine-tuned RoBERTa reaches 0.62 macro-$F_1$ against a best zero-shot result of 0.50 (Claude Haiku 4.5), at a fraction of the per-query cost; the supervised advantage is concentrated on the belief class, the implicit, affective category every zero-shot model under-detects. Scaling does not help: Llama-3-8B matches Llama-3-70B, and Claude Sonnet 4.6 underperforms the smaller Haiku under generic labels, collapsing belief detection to 0.17 and refusing outright on a subset of comments flagged as sensitive. This is a safety-alignment artefact, not a capacity limit. Label schema and topic jointly shape zero-shot performance, with the same model varying by more than 0.13 macro-$F_1$ across topics under matched labels. In a verification context, where missing belief is the costlier error, task-specific fine-tuning remains the more reliable choice despite the proliferation of large generative models.

2606.04273 2026-06-04 cs.AI

Characterizing initial human-AI proof formalization workflows

表征初始人机交互的证明形式化工作流

Katherine M. Collins, Simon Frieder, Jonas Bayer, Jacob Loader, Jeck Lim, Peiyang Song, Fabian Zaiser, Lexin Zhou, Shanda Li, Sam Looi, Joshua B. Tenenbaum, Umang Bhatt, Adrian Weller, Jose Hernandez-Orallo, Cameron E. Freer, Valerie Chen, Ilia Sucholutsky

发表机构 * Massachusetts Institute of Technology(麻省理工学院) University of Cambridge(剑桥大学) Princeton University(普林斯顿大学) University of Oxford(牛津大学) Caltech(加州理工学院) Carnegie Mellon University(卡内基梅隆大学) Universitat Politècnica de València(瓦伦西亚理工大学) New York University(纽约大学)

AI总结 通过混合方法分析,研究人们在形式化证明过程中对AI工具的需求、障碍及实际使用模式,发现AI辅助能提高形式化准确率且用户偏好多样但普遍希望保持人类对证明发现过程的高层控制。

详情
AI中文摘要

几个世纪以来,人类数学家通过书写证明来支撑其数学论证;然而,自动验证证明有效性的能力长期以来一直是一个挑战。AI系统在生成代码和进行日益高级的数学推理方面的进步,有望改变人们形式化并进而验证证明的能力。虽然许多工作聚焦于对当前前沿进行基准测试,但我们转而研究人们如何使用这些工具。我们采用混合方法分析,研究AI对人们形式化工作流的初始影响:人们声称想要什么,他们认为这些愿景的障碍是什么,以及他们在实践中如何实际使用和适应AI。一项定性调查显示,人们的偏好是多样化的,但普遍希望AI辅助形式化,同时保留人类对证明发现过程的高层控制。为了评估在这种限制下人们如何实际使用AI进行形式化,我们进行了一项受控用户研究,参与者形式化非正式的数学问题及其证明,在有和没有AI的情况下,涉及不同难度和领域的多种数学问题。尽管当时用于自动形式化的工具有限,但参与者在使用AI工具时往往比单独形式化时获得更高的形式化准确率,大多数参与者灵活选择使用多种不同的AI工具。综合来看,我们的工作揭示了AI融入形式化工作流的早期阶段,涉及人类与AI参与的密切互动。

英文摘要

For centuries, human mathematicians have written proofs to substantiate their mathematical arguments; yet, the ability to automatically verify the validity of proofs has long been a challenge. Advances in AI systems' ability to generate code and engage in increasingly high-level mathematical reasoning promise to transform people's ability to formalize and thereby verify proofs. While many works focus on benchmarking the current frontier, we instead study how people use these tools. We conduct a mixed-methods analysis into the initial impact of AI on people's formalization workflows: what people claim they want, what they see as the barriers to those visions, and how they actually use and adapt AI in practice. A qualitative survey shows that people's preferences are diverse, but with a general desire for AI assistance in formalization that preserves high-level human control over the proof discovery process. To assess how people actually engage with AI for formalization under such limitations, we conduct a controlled user study in which participants formalize informal math problems and their proofs, with and without AI, across a range of mathematical problems at varying levels of difficulty and domains. Despite limitations of the tools at the time for autoformalization, participants tend to attain higher formalization accuracy when allowed access to AI tools than when formalizing on their own, with most participants flexibly choosing to use multiple different AI tools. Taken together, our work sheds light on the early stages of AI integration into formalization workflows, involving an intimate interplay of human and AI engagement.

2606.04272 2026-06-04 cs.LG

RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training

预训练期间的强化学习探索:重新审视LLM训练中的策略优化

Rachit Bansal, Clara Mohri, Tian Qin, David Alvarez-Melis, Sham Kakade

发表机构 * Harvard University(哈佛大学)

AI总结 本文质疑LLM标准训练流程中仅在预训练和监督微调后使用强化学习的做法,通过从头训练LLM并在中间检查点直接应用RL、SFT及SFT后RL,发现RL早期有效且能匹配完整流程,同时提出并行平均合并RL和SFT目标的方法在保持通用能力的同时优于其他方法。

详情
AI中文摘要

标准的LLM训练流程仅在预训练和监督微调(SFT)之后应用强化学习(RL)。我们通过从头训练LLM,并直接在中间预训练检查点上应用RL、SFT以及SFT后接RL,来质疑这一现状。我们发现RL在早期非常有效,并且通常也能在早期匹配完整的SFT→RL流程。通过在更难问题上的实验,我们发现针对性的预训练数据组成是RL有效性的强大杠杆,甚至比模型规模更重要。除了推理准确性之外,直接将RL应用于基础检查点会扩展模型的分布;而最近工作中报告的锐化效应仅在RL跟随SFT时出现。RL基本不改变模型的通用能力,而SFT后通用能力会下降。最后,我们通过并行平均合并RL和SFT目标,该方法在所有其他训练方法中表现最佳,跨指标均优于其他方法,同时保持通用能力。这些结果表明,LLM训练可能受益于RL的更广泛使用。

英文摘要

The standard LLM training pipeline applies reinforcement learning (RL) only after pre-training and supervised fine-tuning (SFT). We question this status quo by training a LLM from scratch and applying RL, SFT, and SFT followed by RL directly to intermediate pre-training checkpoints. We find that RL is effective very early, and often matches the full SFT$\to$RL pipeline early as well. Through experiments on harder problems, we find that targeted pre-training data composition is a strong lever for RL effectiveness, even more so than model scale. Beyond reasoning accuracy, applying RL directly to base checkpoints expands the model's distribution; the sharpening effect reported in recent work arises only when RL follows SFT. The general capabilities of the model remain essentially unchanged by RL, while they degrade following SFT. Finally, we merge RL and SFT objectives by parallel averaging, which outperforms across all other training methods discussed, across metrics, while preserving general capabilities. Together, these results suggest that LLM training might benefit from an expanded use of RL.

2606.04271 2026-06-04 cs.CV cs.AI

StandardE2E: A Unified Framework for End-to-End Autonomous Driving Datasets

StandardE2E:端到端自动驾驶数据集的统一框架

Stepan Konev

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出StandardE2E框架,通过统一数据模式、多数据集联合加载和简化新数据集添加流程,解决端到端自动驾驶数据集格式不兼容问题。

详情
AI中文摘要

自动驾驶已从模块化的感知-预测-规划堆栈转向端到端(E2E)模型,这些模型直接将传感器输入映射到车辆控制,通常通过辅助任务(如3D检测、运动预测和高清地图感知)进行正则化。进展由快速增长的传感器丰富驾驶数据集生态系统驱动,但每个数据集都有自己的文件格式、API、坐标约定和模态覆盖范围,导致跨数据集实验甚至基本的每个数据集预处理都需要为每个项目重新实现。我们提出StandardE2E,一个为E2E驾驶数据集提供统一接口的框架。StandardE2E (i) 在共享数据模式下标准化每个数据集的预处理;(ii) 在单个PyTorch DataLoader中组合多个数据集,用于跨数据集预训练、辅助任务监督和场景级过滤;(iii) 将添加新数据集简化为从原始帧到规范模式的单个数据集映射,而整个下游流程保持不变。该框架开箱即支持六个数据集:Waymo End-to-End、Waymo Perception、Argoverse 2 Sensor、Argoverse 2 LiDAR、NAVSIM (OpenScene-v1.1) 和 WayveScenes101,并作为开源标准e2e Python包发布,可在 https://github.com/stepankonev/StandardE2E 获取。

英文摘要

Autonomous driving has shifted from modular perception-prediction-planning stacks toward end-to-end (E2E) models that map sensor inputs directly to vehicle control, often regularized by auxiliary tasks such as 3D detection, motion forecasting, and HD-map perception. Progress is driven by a fast-growing ecosystem of sensor-rich driving datasets, yet each ships its own file formats, APIs, coordinate conventions, and modality coverage, leaving cross-dataset experimentation and even basic per-dataset preprocessing to be re-implemented per project. We present StandardE2E, a framework that provides a single unified interface over E2E driving datasets. StandardE2E (i) standardizes per-dataset preprocessing under one shared data schema; (ii) combines multiple datasets in a single PyTorch DataLoader for cross-dataset pretraining, auxiliary-task supervision, and scenario-level filtering; and (iii) reduces adding a new dataset to a single per-dataset mapping from raw frames to the canonical schema, leaving the entire downstream pipeline unchanged. The framework supports six datasets out of the box: Waymo End-to-End, Waymo Perception, Argoverse 2 Sensor, Argoverse 2 LiDAR, NAVSIM (OpenScene-v1.1), and WayveScenes101, and is released as the open-source standard-e2e Python package, available at https://github.com/stepankonev/StandardE2E.

2606.04269 2026-06-04 cs.RO cs.AI cs.CV

Instant-Fold: In-Context Imitation Learning for Deformable Object Manipulation

Instant-Fold: 可变形物体操作的情境模仿学习

Yilong Wang, Cheng Qian, Edward Johns

发表机构 * The Robot Learning Lab(机器人学习实验室) Imperial College London(伦敦帝国学院)

AI总结 提出Instant-Fold框架,通过单次人类演示的情境模仿学习,无需梯度更新即可推断并执行多种可变形物体操作模式,在仿真训练后零样本迁移到真实世界。

详情
AI中文摘要

可变形物体操作(DOM)具有挑战性,因为其状态是高维、部分可观测的,并且通过长时间跨度、拓扑变化的交互演变,涉及多种有效的操作模式。我们引入了Instant-Fold,一个用于DOM的情境模仿学习框架。给定单次人类演示,我们的策略直接从演示中推断并执行多种操作模式,包括空间执行和顺序的变化,无需梯度更新。我们的方法首先通过时间对比预训练学习变形感知的视觉表示,然后基于演示的条件流匹配变换器策略预测执行预期操作模式的动作。完全在仿真中训练的Instant-Fold能够泛化到多种折叠模式,并零样本迁移到真实世界环境,无需额外的数据收集或微调。视频可在https://instant-fold.github.io获取。

英文摘要

Deformable object manipulation (DOM) is challenging due to high-dimensional, partially observable states that evolve through long-horizon, topology-changing interactions with multiple valid manipulation modes. We introduce Instant-Fold, an in-context imitation learning framework for DOM. Given a single human demonstration, our policy infers and executes diverse manipulation modes directly from the demonstration, including variations in spatial execution and ordering, without requiring gradient updates. Our approach first learns deformation-aware visual representations via temporal contrastive pretraining, after which a flow-matching transformer policy conditioned on the demonstration predicts actions to execute the intended manipulation mode. Trained entirely in simulation, Instant-Fold generalizes across diverse folding modes and transfers zero-shot to real-world settings without additional data collection or finetuning. Videos are available at https://instant-fold.github.io.

2606.04264 2026-06-04 cs.CV

UniCanvas: A Diffusion-base Unified Model for Text-in-Image Joint Generation

UniCanvas: 一种基于扩散的图文联合生成统一模型

Zeyuan Yang, Hao-Wei Chen, Xueyang Yu, Yuncong Yang, Haoyu Zhen, Ziqiao Ma, Maohao Shen, Chuang Gan

发表机构 * UMass Amherst(马萨诸塞大学阿默斯特分校) University of Michigan(密歇根大学) MIT(麻省理工学院)

AI总结 提出UniCanvas,通过扩散模型在像素画布上以文本嵌入图像的方式实现图文联合生成,解决现有模型在视觉与文本生成上的不足。

详情
AI中文摘要

近年来,在单一架构内同时处理多模态理解与生成的统一视觉语言模型取得了显著进展。虽然自回归VLM能够跨模态推理,但无法生成高质量图像。相比之下,扩散模型能生成逼真的视觉效果,却难以生成连贯的文本,这使得开发一个能无缝处理视觉和文本生成的统一模型变得具有挑战性。最近的进展表明,语言可以有效地嵌入到视觉表示中,使模型能够直接从图像中推理文本语义。为此,我们提出了UniCanvas,这是首次尝试通过文本图像生成来统一扩散模型以生成交错多模态内容。扩散模型自然地捕捉共享像素画布上的变换,这可以视为视觉变化的世界模型。该模型不是生成离散的文本标记,而是学习将语言表示为图像内部的视觉模式,利用其固有的多模态嵌入空间。这种设计使得模型在图像合成过程中能够在单个像素画布上自然地“绘制”文本,实现无缝的多模态生成。实验表明,UniCanvas在性能上优于先前的统一模型,将基于扩散模型的文本图像生成定位为一种有前景的统一多模态生成范式。

英文摘要

Recent years have seen remarkable progress in unified vision-language models handling both multimodal understanding and generation within a single architecture. While autoregressive VLMs can reason across modalities, they fail to generate high-quality images. In contrast, diffusion models produce photorealistic visuals yet struggle to generate coherent text, making it challenging to develop a single unified model that can seamlessly handle both visual and text generation. Recent advances suggest that language can be effectively embedded within visual representations, allowing models to reason about textual semantics directly from images. To this end, we propose UniCanvas, a first attempt that unifies diffusion models to generate interleaved multimodal contents through text-in-image generation. Diffusion models naturally capture transformations on a shared pixel canvas, which can be viewed as world models of visual change. Instead of producing discrete text tokens, the model learns to represent language as visual patterns inside images, leveraging its inherent multimodal embedding space. This design allows the model to "draw" text naturally within a single pixel canvas during image synthesis, achieving seamless multimodal generation. Experiments demonstrate that UniCanvas improves performance over previous unified models, positioning text-in-image generation with diffusion models as a promising unified multimodal generation paradigm.

2606.04262 2026-06-04 cs.CL cs.AI

Can I Take Another Dose? Evaluating LLM Decision-Making Under Temporal Uncertainty in OTC Dosing QA

我可以再服一剂吗?评估LLM在OTC剂量问答中时间不确定性下的决策能力

Maroof Kousar, Yibo Hu

发表机构 * Illinois Institute of Technology(伊利诺伊理工学院)

AI总结 提出DOSEBENCH基准测试,评估大语言模型在非处方药剂量问答中处理时间推理、约束遵循和不确定性的能力。

Comments 16 pages, 7 figures

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用于日常健康问题,包括用户是否可以安全地再服用一剂非处方(OTC)药物。然而,这一常见的安全相关场景在现有的医学问答评估中仍未得到充分探索,其中正确答案需要跟踪剂量时间、计算滚动24小时摄入量、遵循产品标签约束以及处理不完整的用药史。我们引入了DOSEBENCH,这是一个包含81个精心策划的OTC剂量场景的聚焦基准测试,专注于成人对乙酰氨基酚和布洛芬的使用,并带有手动标注的金标准参考。我们使用决策正确性、一致性、解释可验证性、失败类型和置信度相关信号等指标,在多次运行中评估了四个LLM,共获得1620个模型响应。我们的结果表明,模型在滚动窗口推理和模糊敏感场景中经常遇到困难,且稳定或看似自信的响应仍可能违反剂量约束。这些发现表明,OTC剂量问答为评估医学问答中的时间推理、约束遵循和安全相关不确定性处理提供了一个狭窄但实用的测试平台。

英文摘要

Large language models (LLMs) are increasingly used for everyday health questions, including whether a user can safely take another dose of an over-the-counter (OTC) medication. Yet this common safety-relevant setting remains underexplored in existing medical QA evaluations, where correct answers require tracking dose timing, computing rolling 24-hour intake, following product-label constraints, and handling incomplete medication histories. We introduce DOSEBENCH, a focused benchmark of 81 curated OTC dosing scenarios focused on adult acetaminophen and ibuprofen use, with manually annotated gold references. We evaluate four LLMs across repeated runs using metrics for decision correctness, consistency, explanation verifiability, failure types, and confidence-related signals, resulting in 1,620 model responses. Our results show that models frequently struggle with rolling-window reasoning and ambiguity-sensitive cases and that stable or confident-looking responses can still violate dosing constraints. These findings suggest that OTC dosing QA provides a narrow yet practical testbed for evaluating temporal reasoning, constraint following, and safety-relevant uncertainty handling in medical QA.

2606.04261 2026-06-04 cs.AI cs.CL cs.CV cs.ET cs.LG

Can Generalist Agents Automate Data Curation?

通用智能体能否自动化数据筛选?

Feiyang Kang, Hanze Li, Adam Nguyen, Mahavir Dabas, Jiaqi W. Ma, Frederic Sala, Dawn Song, Ruoxi Jia

发表机构 * Virginia Tech(弗吉尼亚理工大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出Curation-Bench基准,通过通用编码智能体自动化数据筛选循环,实验表明现成智能体可达到强基线,但存在执行-研究差距,而结构化方法引导的智能体能在十分之一数据预算下自主组合出优于强基线的数据选择策略。

Comments Preprint

详情
AI中文摘要

训练数据的筛选是现代AI开发中最重要但劳动密集的部分之一:实践者根据嘈杂的基准反馈迭代地提出、实施、评估和修订数据策略。我们探究通用编码智能体能否自动化这一数据筛选循环。我们引入了*Curation-Bench*,一个以智能体为中心的基准,它固定模型、训练配方和评估套件,同时赋予智能体命令行权限以检查数据、实施策略、提交到固定的训练/评估流水线并进行修订。在视觉-语言指令微调实例中,现成智能体在十次迭代内达到了已发表的强数据选择基线。然而,轨迹分析揭示了持续的*执行-研究差距*:即使提供了策略指南和论文参考,智能体主要调整局部策略变体,而非探索新的策略家族。要求每次迭代引用、实例化和改编先前方法的框架将智能体转向方法引导的探索。这种框架化的智能体自主组合——无需人工设计输入——一种数据选择策略,在十分之一的数据预算下优于已发表的强基线。总体而言,当前智能体可以运行筛选循环,但可靠的数据研究需要框架化的方法适应,而非仅靠开放式提示。代码和基准已开源。

英文摘要

Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.

2606.04251 2026-06-04 cs.CV

SBP-Net: Learning Thin Structure Reconstruction with Sliding-Box Projections

SBP-Net: 基于滑动盒投影的薄结构重建学习

Ofir Gilad, Andrei Sharf

发表机构 * Faculty of Computer and Information Science, Ben Gurion University of the Negev(计算机与信息科学学院,内盖夫本· Gurion大学)

AI总结 针对薄3D结构稀疏、尺度变化和复杂几何带来的重建挑战,提出一种基于局部深度投影的SBP-Net方法,通过滑动盒生成局部正交深度投影并用神经网络重建缺失薄结构,再融合回3D模型,在肺动脉和工业管道重建中优于现有方法。

Comments Accepted to IEEE ICIP 2026, 6 pages, 4 figures

详情
AI中文摘要

重建薄3D结构因其稀疏性、尺度变化和复杂几何而具有挑战性。这类结构出现在广泛领域,包括血管系统的医学成像和工业管道系统。虽然最近的神经方法在密集表面上表现良好,但常常无法恢复精细的薄几何形状。我们提出了一种基于局部深度投影的重建方法,该方法为薄结构提供了高效且信息丰富的2D表示。具体来说,我们使用滑动盒遍历3D模型以生成局部正交深度投影,然后由神经网络处理以在2D中重建缺失的薄结构。随后,局部重建结果被融合回3D模型,以产生连贯且详细的形状。在CT体积的肺动脉重建以及合成和真实扫描的工业管道恢复上的实验表明,与现有方法相比,该方法更好地保留了精细结构细节。

英文摘要

Reconstructing thin 3D structures is challenging due to their sparsity, scale variation, and complex geometry. Such structures arise in a wide range of domains, including medical imaging of vascular systems and industrial pipe systems. While recent neural methods perform well on dense surfaces, they often fail to recover fine thin geometries. We propose a reconstruction approach based on local depth projections, which provide an efficient and informative 2D representation of thin structures. Specifically, we traverse the 3D model with a sliding box to generate local orthographic depth projections, which are processed by a neural network to reconstruct missing thin structures in 2D. The local reconstructions are subsequently fused back into the 3D model to produce a coherent and detailed shape. Experiments on pulmonary artery reconstruction from CT volumes and industrial pipeline recovery from synthetic and real scans demonstrate improved preservation of fine structural details over existing methods.

2606.04249 2026-06-04 cs.CV eess.IV

Prospective Dynamic 3D MRI Reconstruction via Latent-Space Motion Tracking from Single Measurement

基于潜空间运动跟踪的单次测量前瞻性动态3D MRI重建

Lixuan Chen, Zhongnan Liu, Jesse Hamilton, James M. Balter, Jeong Joon Park, Liyue Shen

发表机构 * University of Michigan(密歇根大学)

AI总结 提出PDMR框架,通过离线学习运动场的低维潜流形并采用三平面表示实现高效编码,从单次测量中实现高保真、时间一致的前瞻性动态3D MRI重建。

详情
AI中文摘要

前瞻性重建在许多临床应用中至关重要,例如MRI引导的放射治疗,这需要从当前获取的测量中实现精确的图像重建和快速运动估计。然而,由于超稀疏采样和严格的延迟要求,前瞻性重建仍然具有挑战性。在这项工作中,我们提出了PDMR,一种具有潜空间运动跟踪的前瞻性动态3D MRI重建框架。我们的核心思想是离线学习一个高效且可泛化的运动场潜流形,从而实现快速在线自适应以进行前瞻性重建。具体来说,我们将变形矢量场(DVF)参数化在低维流形上,有效减少了快速在线自适应的搜索空间,并采用三平面表示实现几何感知和内存高效的3D运动编码。在XCAT数字体模和内部腹部MRI数据集上的实验表明,PDMR在多个前瞻性场景(立即和2分钟后)中实现了高保真和时间一致的重建,优于最先进的回顾性和在线方法。我们的结果为临床实践中实现超快速、运动感知的前瞻性MRI重建提供了一条有前景的途径。

英文摘要

Prospective reconstruction is crucial in many clinical applications such as MRI-guided radiotherapy, which demands accurate image reconstruction and fast motion estimation from currently acquired measurements. However, prospective reconstruction remains challenging due to ultra-sparse sampling and stringent latency requirements. In this work, we propose PDMR, a Prospective Dynamic 3D MRI Reconstruction framework with latent-space motion tracking. Our core idea is to learn an efficient and generalizable latent manifold of motion fields offline, enabling rapid online adaptation for prospective reconstruction. Specifically, we parameterize the deformation vector fields (DVFs) on a low-dimensional manifold, effectively reducing the search space for fast online adaptation, and employ a tri-plane representation to achieve geometry-aware and memory-efficient encoding of 3D motion. Experiments on both XCAT digital phantoms and in-house abdominal MRI datasets demonstrate that PDMR achieves high-fidelity and temporally consistent reconstruction across multiple prospective scenarios (Immediate and After-2min), outperforming state-of-the-art retrospective and online methods. Our results suggest a promising pathway toward ultra-fast, motion-aware prospective MRI reconstruction in clinical practice.

2606.04248 2026-06-04 cs.RO

RSC: Decentralized Rigid Formation Flocking for Large-Scale Swarms via Hybrid Predictive Control and Online Reconfiguration

RSC:通过混合预测控制与在线重配置实现大规模集群的分散式刚性编队集群

Ganyu Zou, Linhan Wang, Chen Dai, Siji Chen, Chang-Tien Lu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种分散式控制框架RSC,结合有限时域轨迹预测与反应式人工势场安全控制器,并引入在线领航-跟随重配置机制,在25架无人机杂乱环境中实现83%的编队保持、避障与目标跟踪成功率。

Comments 8 pages, 4 figures, two-column format

详情
AI中文摘要

分散式刚性编队集群要求自主智能体集群在移动过程中仅依靠局部感知和通信来维持预定的几何构型。然而,现有的分散式控制方法在杂乱环境中难以保持严格的智能体间距离约束,常常遭遇局部极小死锁、高频控制振荡或避障时灵活性有限等问题,导致成功率低。为解决这些限制,我们提出了刚性集群控制(RSC),一种用于大规模刚性编队集群的分散式控制框架。为了通过鲁棒的长期规划逃离局部极小同时确保短期安全,RSC在混合架构中集成了有限时域轨迹预测与反应式人工势场(APF)安全控制器。此外,为了在穿越障碍后加速编队重组而不中断任务执行,RSC引入了一种基于稳定角色交换的在线领航-跟随重配置机制。在25架无人机的挑战性杂乱环境中的广泛评估表明,RSC可靠地统一了刚性编队保持、避障和目标跟踪。在严格的成功标准——无碰撞运行且最大相对边长度误差低于10%下,RSC实现了83%的成功率,显著优于成功率低于5%的现有启发式和基于学习的基线方法。

英文摘要

Decentralized rigid formation flocking requires a swarm of autonomous agents to maintain a predetermined geometric configuration while moving, relying solely on local sensing and communication. However, existing decentralized control methods struggle to maintain strict inter-agent distance constraints in cluttered environments, often suffering from local minima deadlocks, high frequency control oscillations, or limited flexibility during obstacle navigation, resulting in low success rate. To address these limitations, we propose Rigid Swarm Control (RSC), a decentralized control framework for large-scale rigid formation flocking. To escape local minima via robust long-term planning while ensuring short-term safety, RSC integrates finite-horizon trajectory predictions with a reactive artificial potential field (APF) safety controller within a hybrid architecture. Furthermore, to accelerate formation reassembly after obstacle traversal without interrupting task execution, RSC introduces an online leader-follower reconfiguration mechanism based on stable role exchange. Extensive evaluations in challenging cluttered environments with 25 UAVs demonstrate that RSC reliably unifies rigid formation maintenance, obstacle avoidance, and target tracking. Under strict success criteria - collision-free operation with a maximum relative edge-length error below 10%, RSC achieves an 83% success rate, significantly outperforming existing heuristic and learning-based baselines that fall below 5%.

2606.04246 2026-06-04 cs.AI cs.AR cs.CL

StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis

StepPRM-RTL:基于逐步过程奖励引导的LLM微调以增强RTL综合

Prashanth Vijayaraghavan, Apoorva Nitsure, Luyao Shi, Ehsan Degan, Vandana Mukherjee

发表机构 * IBM Research San Jose CA USA(IBM研究院圣何塞加州美国)

AI总结 提出StepPRM-RTL框架,结合逐步轨迹建模、过程奖励模型和检索增强微调,通过密集反馈和蒙特卡洛树搜索探索推理路径,提升LLM生成RTL代码的功能正确性和推理保真度,在基准数据集上相比先前方法提升超10%。

Comments 6 pages, 2 figures, DAC'2026

详情
AI中文摘要

由于Verilog和VHDL中的长程推理、多步依赖和严格正确性约束,数字硬件设计的RTL代码自动生成仍然具有挑战性。我们提出StepPRM-RTL,一种新颖的框架,结合逐步轨迹建模、过程奖励模型(PRM)和检索增强微调(RAFT),以增强基于LLM的RTL代码生成的功能正确性和推理保真度。StepPRM-RTL从规范解构建逐步推理轨迹,其中每一步包含一个理由和增量代码修改。过程奖励模型(PRM)评估中间步骤,提供密集反馈,指导RAFT微调期间的强化式更新。蒙特卡洛树搜索(MCTS)探索替代推理路径,用高质量轨迹丰富训练数据集。这种逐步和结果感知奖励的集成使模型能够学习如何以及为何构建正确的RTL,从而改善超出标准监督或基于结果训练的长程推理。在基准Verilog和VHDL数据集上的实验评估表明,StepPRM-RTL在功能正确性和推理保真度指标上优于先前最佳方法超过10%。消融研究证实,PRM引导奖励和逐步轨迹探索的结合是其性能的关键。StepPRM-RTL跨RTL语言泛化,并为高保真、可解释的代码生成提供了可扩展框架,为LLM辅助硬件设计自动化建立了新标准。

英文摘要

Automatic generation of RTL code for digital hardware designs remains challenging due to long-horizon reasoning, multi-step dependencies, and strict correctness constraints in Verilog and VHDL. We present StepPRM-RTL, a novel framework that combines stepwise trajectory modeling, process-reward modeling (PRM), and retrieval-augmented fine-tuning (RAFT) to enhance both the functional correctness and reasoning fidelity of LLM-based RTL code generation. StepPRM-RTL constructs stepwise reasoning trajectories from canonical solutions, where each step contains a rationale and incremental code modification. A Process Reward Model (PRM) evaluates intermediate steps, providing dense feedback that guides reinforcement-style updates during RAFT fine-tuning. Monte Carlo Tree Search (MCTS) explores alternative reasoning paths, enriching the training dataset with high-quality trajectories. This integration of stepwise and outcome-aware rewards allows the model to learn both how and why to construct correct RTL, improving long-horizon reasoning beyond standard supervised or outcome-based training. Experimental evaluation on benchmark Verilog and VHDL datasets demonstrates that StepPRM-RTL outperforms the best prior methods by over 10\% in functional correctness and reasoning fidelity metrics. Ablation studies confirm that the combination of PRM-guided rewards and stepwise trajectory exploration is key to its performance. StepPRM-RTL generalizes across RTL languages and provides a scalable framework for high-fidelity, interpretable code generation, establishing a new standard for LLM-assisted hardware design automation.

2606.04244 2026-06-04 cs.AI cs.CL cs.CV cs.LG

VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

VAMPS: 视觉辅助数学问题求解基准

Amirhossein Dabiriaghdam, Shayan Vassef, Mohammadreza Bakhtiari, Yasamin Medghalchi, Ilker Hacihaliloglu, Mesrob Ohannessian, Lele Wang, Giuseppe Carenini

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出VAMPS基准,通过1,168道双语多选题评估多模态大模型在借助绘图工具进行数学推理时的表现,发现直接解析求解优于工具辅助视觉求解。

详情
AI中文摘要

多模态大语言模型在复杂推理方面能力日益增强,但当它们必须通过工具外部化问题然后基于工具输出进行推理时,尤其是在依赖视觉辅助的情况下,其性能往往会下降。这一差距尤为重要,因为真实的工程和科学工作流程通常依赖可视化工具进行分析、验证和决策。为了研究这一差异,我们引入了VAMPS(视觉辅助数学问题求解),一个用于图辅助数学的基准。VAMPS包含1,168个多模态、双语选择题问答对,这些题目来自伊朗大学入学考试的代数和微积分问题,并通过人工审核的LLM生成的合成变体进行了扩展,所有题目都经过精心挑选,使得绘图能够通过揭示交点、极值、渐近线等提供自然的求解策略。VAMPS旨在用于基准测试和诊断,它超越了以往主要评估在固定视觉输入上进行推理的多模态基准,通过测试模型是否能够从构建有用的图形中受益并将其答案基于结果可视化。总体而言,我们发现,在一组多样化的模型中,直接解析求解出人意料地优于工具辅助的视觉求解,即使在绘图是自然策略的问题上也是如此。

英文摘要

Multimodal large language models are increasingly capable of complex reasoning, yet their performance often degrades when they must externalize a problem through a tool and then reason over the tool's output, specifically when they rely on visual aids. This gap is especially important because real engineering and scientific workflows often rely on visualization tools for analysis, validation, and decision-making. To study this discrepancy, we introduce VAMPS (Visual-Assisted Mathematical Problem Solving), a benchmark for graph-assisted mathematics. VAMPS contains 1,168 multimodal, bilingual multiple-choice question-answer pairs drawn from Iranian University Entrance Exam algebra and calculus problems and expanded with human-reviewed LLM-generated synthetic variants, all selected so that plotting provides a natural solution strategy by revealing intersections, extrema, asymptotes, etc. Designed for both benchmarking and diagnosis, VAMPS goes beyond prior multimodal benchmarks that primarily evaluate reasoning over fixed visual inputs by testing whether a model can benefit from constructing a useful graph and grounding its answer in the resulting visualization. Overall, we found that across a diverse set of models, direct analytical solving surprisingly outperforms tool-enabled visual solving, even on problems where plotting is a natural strategy.

2606.04240 2026-06-04 cs.CV cs.AI cs.CL

Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)

EReL@MIR 2025 多模态文档检索挑战赛(赛道1)概述

Jingbiao Mei

发表机构 * University of Cambridge(剑桥大学) Cambridge United Kingdom(剑桥英国)

AI总结 本文介绍了EReL@MIR 2025多模态文档检索挑战赛(赛道1)的设计、数据集、评估协议、最终排名及前三名获胜系统的分析,所有系统均基于Qwen2-VL系列解码器多模态大语言模型嵌入器。

Comments MDR Challenge Report at WWW2025

详情
AI中文摘要

对于视觉丰富的文档(即文本与图形、表格和图表交织的页面)的检索,对于多模态检索增强生成至关重要,然而大多数检索器仍然丢弃视觉通道。\emph{多模态文档检索挑战赛}是首届EReL@MIR研讨会(与2025年万维网会议同期举办)中MIR挑战赛的赛道1,要求参与者构建一个\emph{单一}检索系统,处理两种互补的场景:基于文本查询在长文档内进行封闭集文档页面检索(MMDocIR),以及基于图像或图像加文本查询进行开放域维基百科风格段落检索(M2KR)。系统根据两个任务上平均Recall@$\{1,3,5\}$的宏平均值进行排名。该挑战赛吸引了来自22个团队的455名参赛者和586份提交。本报告描述了挑战赛的设计、数据集和评估协议;报告了最终排名;并分析了三个获胜团队的系统。所有三个系统都基于Qwen2-VL系列的解码器多模态大语言模型嵌入器,而非CLIP风格的编码器,主要区别在于它们是通过微调集成、无训练的多路融合与强视觉语言重排序器,还是零样本后期交互达到顶尖水平。无训练系统与微调获胜者的得分差距在0.1分以内。

英文摘要

Retrieval over visually-rich documents, pages that interleave text with figures, tables, and charts, is essential for multimodal retrieval-augmented generation, yet most retrievers still discard the visual channel. The \emph{Multimodal Document Retrieval Challenge}, Track~1 of the MIR Challenge at the first EReL@MIR workshop, co-located with The Web Conference 2025, asks participants to build a \emph{single} retrieval system that handles two complementary regimes: closed-set document page retrieval within long documents from a text query (MMDocIR), and open-domain retrieval of Wikipedia-style passages from an image or image-plus-text query (M2KR). Systems are ranked by the macro-average of mean Recall@$\{1,3,5\}$ over the two tasks. The challenge drew 455 entrants and 586 submissions across 22 teams. This report describes the challenge design, datasets, and evaluation protocol; reports the final standings; and analyses the three winning teams' systems. All three build on decoder-based Multimodal-LLM embedders from the Qwen2-VL family rather than on CLIP-style encoders, and differ chiefly in whether they reach the top through fine-tuned ensembles, training-free multi-route fusion with a strong vision-language re-ranker, or zero-shot late interaction. The training-free system finished within $0.1$ point of the fine-tuned winner.