arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2378
2605.30317 2026-05-29 cs.CV

VPG: Visual Prefix Guidance for Autoregressive Image and Video Generation

VPG: 视觉前缀引导的自回归图像与视频生成

Xinyao Liao, Qiyuan He, Yicong Li, Jiayin Zhu, Xiaoye Qu, Wei Wei, Angela Yao

AI总结 提出VPG,一种无需训练、推理时引导的方法,通过对比生成前缀与损坏前缀下的模型输出来改进自回归图像和视频生成的下一步预测,提升生成质量。

详情
AI中文摘要

自回归图像和视频生成器在训练时使用教师强制历史,但在推理时必须从自身生成的前缀中采样,因此容易受到曝光偏差和前缀漂移的影响。现有的补救方法要么修改训练,要么应用主要针对外部语义条件(如类别标签或文本提示)的采样时引导,而不是测试下一步预测是否为生成的前缀本身提供强大的后验支持。我们提出视觉前缀引导(VPG),一种用于自回归图像和视频生成的无需训练、推理时引导方法。VPG通过对比模型在生成前缀下的输出与在损坏前缀下的输出,然后将logits外推到加强生成前缀后验支持的候选者,从而改进下一步预测。在基于VAR的类别条件图像生成、基于Infinity的文本到图像生成以及基于InfinityStar的文本到视频生成中,VPG在不重新训练基础模型的情况下提高了生成质量,平均将VAR上的FID降低了0.36,并在图像和视频生成上均提升了基准性能。

英文摘要

Autoregressive image and video generators are trained with teacher-forced histories but must sample from their own generated prefixes at inference time, making them vulnerable to exposure bias and prefix drift. Existing remedies either modify training or apply sampling-time guidance aimed primarily at external semantic conditions, such as class labels or text prompts, rather than testing whether a next-step prediction provides strong posterior support for the generated prefix itself. We propose Visual Prefix Guidance (VPG), a training-free inference-time guidance method for autoregressive image and video generation. VPG improves next-step prediction by contrasting the model's output under the generated prefix with its output under a corrupted prefix, then extrapolating logits toward candidates that strengthen the posterior support of the generated prefix. Across class-conditional image generation with VAR, text-to-image generation with Infinity, and text-to-video generation with InfinityStar, VPG improves generation quality without retraining the base model, reducing FID on VAR by 0.36 on average and improving benchmark performance on both image and video generation.

2605.30315 2026-05-29 cs.CL cs.LG

Resolution Diagnostics for Paired LLM Evaluation

配对LLM评估的分辨率诊断

Anany Kotawala

AI总结 针对公开LLM排行榜中配对排名未达到常规配对检验分辨率目标的问题,提出基于假设检验的配对评估框架,并引入分辨率比q=N/N*作为主要诊断指标,揭示了常用非配对Cohen-h-plus-(1-rho)捷径在接近比较区域存在约两倍的偏差。

Comments 16 pages, 7 figures, 12 tables. Accepted to the ICML 2026 Workshop on Hypothesis Testing, Seoul, South Korea, 2026. Copyright 2026 by the author(s)

详情
AI中文摘要

在两个公开的LLM排行榜中,许多显示的配对排名在实际配对评估设计下未达到常规配对检验的分辨率目标:在Open LLM Leaderboard v1的40个配对比较中,有11个未解决;在MMLU-Pro前10名相邻排名配对中,9个中有4个未解决(在(alpha, 1-beta) = (0.05, 0.8)下)。在真实的主题级聚类下,MMLU-Pro未解决数上升至6/9,并且在99.9%的类别自助重采样中保持9个中的5-6个未解决。我们将配对LLM评估构建为一个假设检验问题,反转水平alpha、功效(1-beta)的检验,并报告每对的分辨率比q = N/N*作为主要诊断指标。一个具有显式二阶常数的尖锐小效应展开表明,广泛使用的非配对Cohen-h-plus-(1-rho)捷径在接近比较区域与正确的N*偏差约两倍,当用户将其每臂输出乘以(1-rho)时,五个现成计算器中的三个(Cohen 1988, G*Power, R pwr)会无声地继承这一缺陷。在多重校正和任意有效序贯检验下,未解决配对模式仍然存在。

英文摘要

Across two public LLM leaderboards, many displayed pairwise rankings do not meet a conventional paired-test resolution target under the actual paired evaluation design: 11 of 40 Open LLM Leaderboard v1 pairwise comparisons and 4 of 9 MMLU-Pro top-10 adjacent-rank pairs are unresolved at (alpha, 1-beta) = (0.05, 0.8). The MMLU-Pro count rises to 6/9 under real subject-level clustering and stays at 5-6 out of 9 in 99.9% of category-bootstrap resamples. We frame paired LLM evaluation as a hypothesis-testing problem, invert level-alpha, power-(1-beta) tests, and report a per-pair resolution ratio q = N/N* as the primary diagnostic. A sharp small-effect expansion with an explicit second-order constant shows that the widely-used unpaired Cohen-h-plus-(1-rho) shortcut deviates from the correct N* by approximately a factor of two in the close-comparison regime, a deficit that three of five off-the-shelf calculators(Cohen 1988, G*Power, R pwr) silently inherit when the user post-multiplies their per-arm output by (1-rho). The unresolved-pair pattern remains under multiplicity correction and anytime-valid sequential testing.

2605.30311 2026-05-29 cs.CV cs.AI

Archon: A Unified Multimodal Model for Holistic Digital Human Generation

Archon:面向整体数字人生成的统一多模态模型

Chong Bao, Shichen Liu, Lijun Yu, David Futschik, Stylianos Moschoglou, Shefali Srivastava, Ziqian Bai, Feitong Tan, Guofeng Zhang, Zhaopeng Cui, Sean Fanello, Yinda Zhang

AI总结 提出Archon,一个完全预训练的以人为中心的统一多模态模型,通过模态特定分词器、语义视频重参数化和“模态思维”策略,实现文本、音频、动作和视觉等七种模态的整体数字人生成。

Comments Accepted to CVPR 2026. Project Page: https://zju3dv.github.io/archon/

详情
AI中文摘要

数字人是沉浸式交互的基础,然而创建一个统一模型来处理包括文本、音频、动作和视觉内容在内的整体模态仍然是一个开放的挑战。在本文中,我们提出了Archon,一个完全预训练的、以人为中心的统一多模态模型,用于整体虚拟形象生成。Archon通过模态特定分词器统一了七种模态,并利用一个在同步模态和72个不同任务上预训练的原生自回归统一多模态模型来建模整体联合分布。为了解决高保真说话视频中的标记爆炸挑战,我们引入了一种内存高效的语义视频重参数化方法,在保持细粒度动态的同时实现了4倍的标记减少,并结合了一个语义驱动的视频扩散解码器。我们进一步提出了一种“模态思维”,它将模糊的跨模态任务分解为替代模态链中的逐步思维,逐步增强保真度和可控性。大量实验表明,Archon在各种数字人生成任务中实现了优越或可比的性能,验证了我们统一框架的有效性。项目页面:https://zju3dv.github.io/archon/。

英文摘要

Digital humans are fundamental to immersive interaction, yet creating a unified model for holistic modalities, including text, audio, motion, and visual content, remains an open challenge. In this paper, we present Archon, a fully pretrained, human-centric unified multimodal model for holistic avatar generation. Archon unifies seven modalities with modality-specific tokenizers, and a native autoregressive unified multimodal model pretrained on synchronized modalities and 72 diverse tasks to model holistic joint distributions. To address the token explosion challenge in high-fidelity talking videos, we introduce a memory-efficient semantic video reparameterization, achieving 4x token reduction while preserving fine-grained dynamics, coupled with a semantic-driven video diffusion decoder. We further propose a "Thinking in Modality" that decomposes ambiguous cross-modal tasks into stepwise thinking in an alternative chain of modality, progressively enhancing fidelity and controllability. Extensive experiments demonstrate that Archon achieves superior or comparable performance across diverse digital human generation tasks, validating the effectiveness of our unified framework. Project page: https://zju3dv.github.io/archon/.

2605.30310 2026-05-29 cs.CV cs.AI cs.GR

City-Mesh3R: Simulation-Ready City-Scale 3D Mesh Reconstruction from Multi-View Images

City-Mesh3R:面向仿真就绪的城市级多视图三维网格重建

Sayan Paul, Sourav Ghosh, Siddharth Katageri, Soumyadip Maity, Sanjana Sinha, Brojeshwar Bhowmick

AI总结 提出City-Mesh3R框架,通过分治策略从大规模无序图像集合端到端重建水密表面网格,解决城市尺度重建中几何不完整、表面不规则及计算复杂性问题。

Comments Accepted to the USM3D Workshop Proceedings at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 as an Oral Presentation. Project page: https://citymesh3r.github.io/

详情
AI中文摘要

从多视图图像进行城市级三维表面重建以支持下游三维仿真,由于城市场景的规模和复杂性,带来了极具挑战性的问题。现有的基于NeRF、高斯泼溅等方法的城市级三维重建技术,常因几何不完整/缺失以及不规则、噪声表面而无法恢复可用于仿真的三维网格。将现有小规模三维重建方法扩展到任意大规模城市场景因计算复杂而不可行。我们提出City-Mesh3R,一个可扩展的框架,直接从大规模无序图像集合重建水密表面网格。与近期使用全局稀疏SfM点云初始化后分布式稠密重建大规模场景的方法不同,我们的方法采用分治策略,遵循端到端的图像到网格三维重建流程。通过拓扑图像聚类、聚类独立稀疏SfM和地图合并重建稀疏城市地图,无需穷举图像特征匹配。然后对该地图进行空间划分,执行几何感知的相机选择,接着进行稠密表面重建,并使用曲率感知的自适应顶点密度重网格化进行表面细化。这些分区网格随后拼接成城市全局网格。所提出的端到端框架在城市级重建数据集上进行了评估。定性和定量结果表明,我们的方法能生成具有规则几何、捕捉精细表面细节的高保真水密三维网格,并因其分布式端到端处理而适用于任意大规模场景。

英文摘要

City-scale 3D surface reconstruction from multiview images for downstream 3D simulation, poses highly challenging problems due to the scale and complexity of urban scenes. Existing city-scale 3D reconstruction methods based on NeRF, Gaussian Splatting etc. often fail to recover 3D meshes ready for simulation due to incomplete/missing geometry and irregular, noisy surfaces. Scaling existing small-scale 3D reconstruction methods to arbitrarily large urban scenes is highly infeasible due to their computational complexity. We present City-Mesh3R, a scalable framework for reconstructing watertight surface meshes directly from large unordered image collections. Unlike recent methods which use global sparse SfM point-cloud initialization followed by a distributed 3D dense reconstruction of large-scale scenes, our method follows an end-to-end images-to-mesh 3D reconstruction approach using a divide-and-conquer strategy. The sparse city map is reconstructed via topological image clustering, cluster-wise independent sparse SfM and map merging, without need for exhaustive image feature matching. Then this map is partitioned spatially to perform geometry-aware camera selection, followed by dense surface reconstruction and surface refinement using curvature-aware adaptive vertex density remeshing. These partition meshes are then stitched together to produce the global mesh of the city. The proposed end-to-end framework is evaluated on city-scale reconstruction datasets. As demonstrated by our qualitative and quantitative results, our proposed method yields high-fidelity watertight 3D meshes with regular geometry, capturing fine surface details, and is suitable for scaling to arbitrarily large scenes owing to the end-to-end processing in a distributed setting.

2605.30307 2026-05-29 cs.CV

Grounded 3D-Aware Spatial Vision-Language Modeling

基于三维感知的空间视觉语言建模

An-Chieh Cheng, Yang Fu, Yatai Ji, Ligeng Zhu, Guanqi Zhan, Zhuoyang Zhang, Zhaojing Yang, Song Han, Yao Lu, Pavlo Molchanov, Vidya Nariyambut Murali, Jan Kautz, Xiaolong Wang, Hongxu Yin, Sifei Liu

AI总结 提出GR3D模型,通过显式2D定位、隐式2D定位和单目3D定位三种互补定位能力,在单一框架内实现空间链式推理,并在定位与非定位空间基准上取得一致提升。

Comments CVPR 2026 https://www.anjiecheng.me/gr3d

详情
AI中文摘要

我们提出了GR3D,一个空间视觉语言模型,在单一框架内配备了三种互补的定位能力——显式2D定位、隐式2D定位和单目3D定位。GR3D引入了一种隐式定位机制,在生成过程中识别实体提及,并将相应的区域标记插入文本流中,使模型在生成空间链式推理响应时能够即时引用视觉证据。同时,一种区域提示的单目3D定位设计从定位的区域查询中预测相机视图中的3D边界框,并由内在感知归一化和密集几何监督支持。这些定位能力共同使GR3D能够将复杂的空间理解问题分解为定位的2D感知,随后进行3D推理。GR3D在定位和非定位空间基准上均取得了一致的改进,证明了定位作为增强VLM空间理解的有效归纳偏差。这些定位能力共同增强了超越定位任务本身的通用空间理解。

英文摘要

We present GR3D, a spatial vision language model equipped with three complementary grounding capabilities--explicit 2D grounding, implicit 2D grounding, and monocular 3D grounding--within a single framework. GR3D introduces an implicit grounding mechanism that identifies entity mentions during generation and inserts the corresponding region tokens into the text stream, allowing the model to reference visual evidence on the fly when producing spatial chain-of-thought responses. In parallel, a region-prompted monocular 3D grounding design predicts 3D bounding boxes in the camera view from grounded region queries, supported by intrinsic-aware normalization and dense geometric supervision. Together, these grounding capabilities enable GR3D to decompose complex spatial understanding problems into grounded 2D perception followed by 3D inference. GR3D achieves consistent improvements across grounded and non-grounded spatial benchmarks, demonstrating grounding as an effective inductive bias for strengthening spatial understanding in VLMs. These grounding capabilities collectively enhance general spatial understanding beyond the grounding task itself.

2605.30295 2026-05-29 cs.CL cs.AI

MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

MedCase-Structured:用于在临床真实EHR环境中基准测试诊断推理的文本到FHIR数据集

Valentina Bui Muti, Eugénie Dulout, Ziquan Fu

AI总结 提出一个从非结构化文本生成临床真实HL7 FHIR R4数据集的流水线,构建MedCase-Structured数据集,发现LLMs在结构化FHIR输入上的诊断准确性低于纯文本,强调部署对齐基准测试的重要性。

Comments Accepted to ICML 2026 Structured Data for Health Workshop

详情
AI中文摘要

大型语言模型(LLMs)在临床推理和决策支持方面显示出潜力,但在真实、与电子健康记录一致的环境中的评估仍然有限。现有的基准测试通常依赖于静态数据集或不反映临床系统中使用的结构化、可互操作数据格式的非结构化输入。我们引入了一个从非结构化文本生成临床真实HL7 FHIR R4数据包的流水线,从而实现对临床决策支持系统的可控评估。该流水线将分阶段LLM生成与基于术语的验证和修复相结合,以减少幻觉代码并强制结构和语义一致性。将此方法应用于MedCaseReasoning,我们构建了MedCase-Structured,这是一个与临床医生编写的诊断案例对齐的合成数据集,实现了82.5%案例的有效FHIR生成。在MedCase-Structured上的评估显示,LLMs在结构化FHIR输入上的诊断准确性始终低于纯文本,突出了部署对齐基准测试的重要性。

英文摘要

Large language models (LLMs) show promise for clinical reasoning and decision support, but evaluation in realistic, electronic health record-congruent settings remains limited. Existing benchmarks often rely on static datasets or unstructured inputs that do not reflect the structured, interoperable data formats used in clinical systems. We introduce a pipeline for generating clinically realistic HL7 FHIR R4 bundles from unstructured text, enabling controllable evaluation of clinical decision support systems. The pipeline combines staged LLM generation with terminology-grounded validation and repair to reduce hallucinated codes and enforce structural and semantic consistency. Applying this approach to MedCaseReasoning, we construct MedCase-Structured, a synthetic dataset aligned with clinician-authored diagnostic cases, achieving valid FHIR generation for 82.5% of cases. Evaluation on MedCase-Structured reveals consistently lower diagnostic accuracy for LLMs on structured FHIR inputs than with plain text, highlighting the importance of deployment-aligned benchmarking.

2605.21235 2026-05-29 cs.CL

LamPO: A Lambda Style Policy Optimization for Reasoning Language Models

LamPO: 一种用于推理语言模型的Lambda风格策略优化

Redacted by arXiv

AI总结 提出LamPO方法,通过成对分解优势函数和置信度加权,改进基于可验证奖励的强化学习在推理语言模型中的信用分配和训练稳定性。

Comments arXiv admin comment: This version has been removed by arXiv administrators as the submitter did not have the rights to agree to the license at the time of submission. Author list and submitter name redacted due to disputed authorship

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)已成为改进推理语言模型在数学、编程和科学问答等任务上的有效范式。然而,广泛使用的组相对目标(如GRPO)用标量统计量总结每个采样组,从而丢弃了候选响应之间的细粒度关系信息。这削弱了稀疏结果奖励下的信用分配,尤其是当多个生成的解决方案仅在推理质量上存在细微差异时。我们提出 extbf{LamPO},一种 extbf{Lambda风格策略优化}方法,它用 extit{成对分解优势}替代标量组优势。LamPO聚合每个响应组内的成对奖励差距,并通过从序列对数概率差异计算出的置信度权重调节每个比较,同时保留PPO风格优化的无评论家和裁剪更新结构。当参考解可用时,我们进一步添加一个轻量级的基于ROUGE-L的密集辅助奖励以减少奖励稀疏性。在AIME24、AIME25、MATH-500和GPQA-Diamond上使用Qwen3-1.7B、Qwen3-4B和Phi-4-mini进行的实验表明,LamPO在更稳定的训练动态和更好的样本效率下,持续优于GRPO和最近的RLVR变体。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving reasoning language models on tasks such as mathematics, coding, and scientific question answering. However, widely used group-relative objectives, such as GRPO, summarize each sampled group with scalar statistics and therefore discard fine-grained relational information among candidate responses. This weakens credit assignment under sparse outcome rewards, especially when multiple generated solutions differ only subtly in reasoning quality. We propose \textbf{LamPO}, a \textbf{Lambda-Style Policy Optimization} method that replaces scalar group advantages with a \emph{Pairwise Decomposed Advantage}. LamPO aggregates pairwise reward gaps within each response group and modulates each comparison by a confidence-aware weight computed from sequence log-probability differences, while retaining the critic-free and clipped-update structure of PPO-style optimization. When reference solutions are available, we further add a lightweight ROUGE-L-based dense auxiliary reward to reduce reward sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond with Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show that LamPO consistently improves over GRPO and recent RLVR variants, with more stable training dynamics and better sample efficiency.

2605.19416 2026-05-29 cs.CL

LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

LambdaPO: 一种用于推理语言模型的Lambda风格策略优化

Redacted by arXiv

AI总结 针对GRPO因使用群体均值作为基线而丢失细粒度偏好信息的问题,提出LambdaPO方法,通过将优势估计分解为成对偏好结构并引入语义密度奖励,从群体轨迹中挖掘更细粒度的优化信号,提升推理性能。

Comments arXiv admin comment: This version has been removed by arXiv administrators as the submitter did not have the rights to agree to the license at the time of submission. Author list and submitter name redacted due to disputed authorship

详情
AI中文摘要

群体相对策略优化(GRPO)已成为现代强化学习对齐的基石,因其通过利用跨采样轨迹群体的奖励归一化来避免显式价值评判器而备受推崇。然而,该方法依赖于单一的统计基线(如群体均值),将轨迹空间的关联拓扑压缩为单个标量,从而抹去了在复杂、对排名敏感的奖励景观中导航所必需的细粒度偏好信息。为解决此问题,我们引入了一个新框架——Lambda策略优化(LambdaPO),它通过将优势估计从标量值重新概念化为分解的成对偏好结构来解决这一信息论瓶颈。具体而言,任何给定轨迹的优势被公式化为与其群体中所有同伴的奖励差分的积分和,其中每个成对比较由策略自身对已建立偏好的概率置信度动态衰减。为进一步缓解二元结果监督的稀疏性,我们通过一个语义密度奖励来增强目标,该奖励源自生成推理轨迹与真实解之间的精确率-召回率对齐。因此,我们的方法可以从一组 rollout 中挖掘更细粒度的优化信号,引导大语言模型达到更优的极值。在具有挑战性的数学推理和问答任务上的实验结果表明,LambdaPO相比基线方法提升了性能。

英文摘要

Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled trajectory cohorts. However, the method's reliance on a monolithic statistical baseline, such as the group mean, collapses the relational topology of the trajectory space into a single scalar, thereby erasing the fine-grained preference information essential for navigating complex, rank-sensitive reward landscapes. To address this issue, we introduce a novel framework, Lambda Policy Optimization (LambdaPO), that addresses this information-theoretic bottleneck by re-conceptualizing advantage estimation from a scalar value to a decomposed, pairwise preference structure. Specifically, the advantage for any given trajectory is formulated as the integrated sum of reward differentials against all peers in its cohort, where each pairwise comparison is dynamically attenuated by the policy's own probabilistic confidence in the established preference. To further mitigate the sparsity of binary outcome supervision, we augment the objective with a semantic density reward, derived from the precision-recall alignment between generated reasoning traces and ground-truth solutions. As a result, our method can mine more fine-grained optimization signals from a group of rollouts, guiding the LLM to a better optima. Experimental results across challenging math reasoning and question-answering tasks demonstrates that LambdaPO improves performance compared to the baseline methods.

2603.17942 2026-05-29 cs.CL

Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing

通过嵌入空间探测的高效无训练多令牌预测

Raghavv Goel, Mukul Gagrani, Mingu Lee, Chris Lott

AI总结 提出ESP方法,利用嵌入空间中的掩码令牌进行无训练的多令牌预测,通过并行验证和轻量剪枝实现无损解码,提升吞吐量。

Comments v2: Accepted at ICML 2026. Updated experiments replaced tok/s with speedup ratio over AR baseline; improved exposition in Section 3.1 (mask token initialization) and Section 4 (ablations)

详情
AI中文摘要

大型语言模型(LLM)尽管仅针对下一个令牌生成进行训练,但具有潜在的多令牌预测(MTP)能力。我们引入了ESP(嵌入空间探测),一种简单且无需训练的MTP方法,它使用从嵌入空间中实时抽取的掩码令牌来探测LLM,从而无需修改权重或依赖草稿模型即可实现并行未来令牌预测。ESP通过从掩码令牌logits中采样Top-K候选来构建推测性令牌树,并应用轻量级剪枝规则保留高概率的延续。在生成过程中,预测被并行验证,实现无损解码,同时显著减少模型调用次数并增加令牌吞吐量。ESP始终优于现有的无训练基线,在LLaMA3上比LADE提高了7-11%的接受长度,在Qwen3上提高了7-8%,并且吞吐量比最强基线提高了15-19%。最后,我们提供了理论见解和实证证据,表明解码器层自然地将掩码令牌表示与下一个令牌状态对齐,从而无需重新训练或辅助模型即可实现准确的多步预测。

英文摘要

Large Language Models (LLMs) possess latent multi-token prediction (MTP) abilities despite being trained only for next-token generation. We introduce ESP (Embedding-Space Probing), a simple and training-free MTP method that probes an LLM using on-the-fly mask tokens drawn from its embedding space, enabling parallel future-token prediction without modifying weights or relying on draft models. ESP constructs a speculative token tree by sampling Top-K candidates from mask-token logits and applies a lightweight pruning rule to retain high-probability continuations. During generation, predictions are verified in parallel, yielding lossless decoding while significantly reducing model calls and increasing token throughput. ESP consistently outperforms existing training-free baselines, improving acceptance length by 7-11% over LADE on LLaMA3 and 7-8% on Qwen3, and increasing throughput by up to 15-19% over the strongest baseline. Finally, we provide theoretical insight and empirical evidence showing that decoder layers naturally align mask-token representations with next-token states, enabling accurate multi-step prediction without retraining or auxiliary models.

2602.11389 2026-05-29 cs.AI

Causal-JEPA: Learning World Models through Object-Level Latent Masking

Causal-JEPA:通过对象级潜在掩码学习世界模型

Heejeong Nam, Quentin Le Lidec, Lucas Maes, Yann LeCun, Randall Balestriero

AI总结 提出C-JEPA,一种通过对象级潜在掩码扩展联合嵌入预测的对象中心世界模型,在视觉问答和智能体控制任务中分别提升反事实推理20%和仅用1%潜在特征实现高效规划。

Comments Project Page: https://hazel-heejeong-nam.github.io/cjepa/ ICML 2026 Accepted

详情
AI中文摘要

世界模型需要稳健的关系理解以支持预测、推理和控制。虽然对象中心表示提供了有用的抽象,但不足以捕捉依赖交互的动态。因此,我们提出C-JEPA,一种简单灵活的对象中心世界模型,将掩码联合嵌入预测从图像块扩展到对象中心表示。通过掩码对象级潜在变量并要求每个掩码对象状态从周围上下文中推断,C-JEPA在训练期间施加了结构化的部分可观测性,创建了类似反事实的预测查询,阻止捷径解决方案,并在学习目标下使依赖交互的预测成为必要。实验上,C-JEPA在视觉问答中取得了一致的提升,与没有对象级掩码的相同架构相比,反事实推理绝对提高了约20%。在智能体控制任务中,C-JEPA仅使用基于块的世界模型所需总潜在输入特征的1%,即可实现相当的性能,从而实现了更高效的规划。最后,我们提供了形式化分析,证明对象级掩码通过控制可观测性引入了有用的归纳偏置。我们的代码可在https://github.com/galilai-group/cjepa获取。

英文摘要

World models require robust relational understanding to support prediction, reasoning, and control. While object-centric representations provide a useful abstraction, they are not sufficient to capture interaction-dependent dynamics. We therefore propose C-JEPA, a simple and flexible object-centric world model that extends masked joint embedding prediction from image patches to object-centric representations. By masking object-level latents and requiring each masked object state to be inferred from the surrounding context, C-JEPA imposes structured partial observability during training, creating counterfactual-like prediction queries that discourage shortcut solutions and make interaction-dependent prediction necessary under the learning objective. Empirically, C-JEPA leads to consistent gains in visual question answering, with an absolute improvement of about 20% in counterfactual reasoning over the same architecture without object-level masking. On agent control tasks, C-JEPA enables substantially more efficient planning by using only 1% of the total latent input features required by patch-based world models, while achieving comparable performance. Finally, we provide a formal analysis demonstrating that object-level masking induces useful inductive bias by controlling observability. Our code is available at https://github.com/galilai-group/cjepa.

2601.22139 2026-05-29 cs.CL cs.AI

Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers

边推理边提问:将推理型大语言模型从被动求解者转变为主动询问者

Xin Chen, Feng Jiang, Yiqian Zhang, Hardy Chen, Shuo Yan, Wenya Xie, Min Yang, Shujian Huang

AI总结 提出主动交互推理(PIR)范式,通过不确定性感知微调和用户模拟器策略优化,使LLM在推理中主动提问以澄清前提和意图不确定性,在数学推理、代码生成和文档编辑任务上显著提升准确率、通过率和BLEU值,同时减少近半推理计算和不必要交互。

Comments ACL Main Conference

详情
AI中文摘要

面向推理的大语言模型(LLMs)通过思维链(CoT)提示取得了显著进展,但它们仍然受到一种“盲目自我思考”范式的根本限制:即使在关键信息缺失或模糊的情况下,也会进行大量的内部推理。我们提出了主动交互推理(PIR),一种新的推理范式,将LLMs从被动求解者转变为主动询问者,在推理过程中穿插澄清。与现有的主要通过与外部环境交互来解决知识不确定性的搜索或工具框架不同,PIR通过与用户直接交互来解决前提和意图层面的不确定性。PIR通过两个核心组件实现:(1)一种不确定性感知的监督微调过程,赋予模型交互推理能力;(2)一个基于用户模拟器的策略优化框架,由复合奖励驱动,使模型行为与用户意图对齐。在数学推理、代码生成和文档编辑上的大量实验表明,PIR始终优于强基线,准确率提高高达32.70%,通过率提高22.90%,BLEU提升41.36,同时减少近一半的推理计算和不必要的交互轮次。在事实知识、问答和缺失前提场景上的进一步可靠性评估证实了PIR的强大泛化能力和鲁棒性。模型和代码公开于:https://github.com/SUAT-AIRI/Proactive-Interactive-R1

英文摘要

Reasoning-oriented Large Language Models (LLMs) have achieved remarkable progress with Chain-of-Thought (CoT) prompting, yet they remain fundamentally limited by a \emph{blind self-thinking} paradigm: performing extensive internal reasoning even when critical information is missing or ambiguous. We propose Proactive Interactive Reasoning (PIR), a new reasoning paradigm that transforms LLMs from passive solvers into proactive inquirers that interleave reasoning with clarification. Unlike existing search- or tool-based frameworks that primarily address knowledge uncertainty by querying external environments, PIR targets premise- and intent-level uncertainty through direct interaction with the user. PIR is implemented via two core components: (1) an uncertainty-aware supervised fine-tuning procedure that equips models with interactive reasoning capability, and (2) a user-simulator-based policy optimization framework driven by a composite reward that aligns model behavior with user intent. Extensive experiments on mathematical reasoning, code generation, and document editing demonstrate that PIR consistently outperforms strong baselines, achieving up to 32.70\% higher accuracy, 22.90\% higher pass rate, and 41.36 BLEU improvement, while reducing nearly half of the reasoning computation and unnecessary interaction turns. Further reliability evaluations on factual knowledge, question answering, and missing-premise scenarios confirm the strong generalization and robustness of PIR. Model and code are publicly available at: \href{https://github.com/SUAT-AIRI/Proactive-Interactive-R1}

2601.07525 2026-05-29 cs.CL cs.AI

Thinking Before Constraining: A Unified Decoding Framework for Large Language Models

先思考再约束:大型语言模型的统一解码框架

Ngoc Trinh Hung Nguyen, Alonso Silva, Laith Zumot, Liubov Tupikina, Armen Aghasaryan, Mehwish Alam

AI总结 提出In-Writing混合方法,通过触发令牌将自由形式推理与结构化解码解耦,在分类和推理任务上准确率提升高达27%。

Comments v2-EMNLP

详情
AI中文摘要

自然生成允许大型语言模型(LLMs)产生具有丰富推理的自由形式响应,但缺乏结构使得输出难以验证。相反,约束解码确保标准化格式,但可能在生成过程中过早施加约束,从而无意中限制推理能力。我们提出一种混合方法,即In-Writing,它在单次调用中结合了自由形式推理和结构化生成。模型首先执行无约束推理,仅在生成触发令牌后应用结构化解码,明确地将推理与格式化解耦。我们证明,我们的触发令牌策略能够几乎消除过早触发,即约束解码中断正在进行推理的失败模式。在涵盖分类和推理任务的多个数据集上的评估表明,我们的方法优于现有技术,在自然生成基础上准确率提升高达27%。我们的代码可在https://github.com/Nokia-Bell-Labs/InWriting获取。

英文摘要

Natural generation allows Large Language Models (LLMs) to produce free-form responses with rich reasoning, yet the lack of structure makes outputs difficult to verify. Conversely, constrained decoding ensures standardized formats but can inadvertently restrict reasoning capabilities by imposing constraints too early in the generation process. We propose a hybrid approach, namely In-Writing, that combines free-form reasoning and structured generation in a single call. The model first performs unconstrained reasoning and only applies structured decoding after a trigger token is generated, explicitly decoupling reasoning from formatting. We establish that our trigger-token strategies are able to virtually eradicate premature triggering, a failure mode in which constrained decoding interrupts on-going reasoning. Evaluations across diverse datasets covering classification and reasoning tasks demonstrate that our approach outperforms the state-of-the-art by achieving accuracy gains of up to 27% over natural generation. Our code are available at: https://github.com/Nokia-Bell-Labs/InWriting.

2511.14426 2026-05-29 cs.LG cond-mat.mtrl-sci cs.AI physics.comp-ph

MiAD: Mirage Atom Diffusion for De Novo Crystal Generation

MiAD: 幻影原子扩散用于从头晶体生成

Andrey Okhotin, Maksim Nakhodnov, Nikita Kazeev, Mikhail Lazarev, Andrey E Ustyuzhanin, Dmitry Vetrov

AI总结 提出幻影注入技术,使扩散模型能在生成过程中改变原子数量,显著提升晶体生成质量,在MP-20数据集上实现8.2%的S.U.N.率。

详情
AI中文摘要

近年来,基于扩散的模型在搜索同时稳定、独特和新颖(S.U.N.)的晶体材料方面表现出卓越性能。然而,大多数这些模型在生成过程中无法改变晶体中的原子数量,这限制了模型采样轨迹的多样性。在本文中,我们展示了这种限制的严重性,并引入了一种简单而强大的技术——幻影注入,它使扩散模型能够将构成晶体的原子状态从存在变为不存在(幻影),反之亦然。我们表明,与没有这种修改的相同模型相比,该技术将模型质量提高了多达2.5倍。由此产生的模型,幻影原子扩散(MiAD),是一种用于从头晶体生成的等变联合扩散模型,能够在生成过程中改变原子数量。MiAD在MP-20数据集上实现了8.2%的S.U.N.率,大大超过了现有的最先进方法。代码:https://github.com/andrey-okhotin/miad.git

英文摘要

In recent years, diffusion-based models have demonstrated exceptional performance in searching for simultaneously stable, unique, and novel (S.U.N.) crystalline materials. However, most of these models don't have the ability to change the number of atoms in the crystal during the generation process, which limits the variability of model sampling trajectories. In this paper, we demonstrate the severity of this restriction and introduce a simple yet powerful technique, mirage infusion, which enables diffusion models to change the state of the atoms that make up the crystal from existent to non-existent (mirage) and vice versa. We show that this technique improves model quality by up to x2.5 compared to the same model without this modification. The resulting model, Mirage Atom Diffusion (MiAD), is an equivariant joint diffusion model for de novo crystal generation that is capable of altering the number of atoms during the generation process. MiAD achieves an 8.2% S.U.N. rate on the MP-20 dataset, which substantially exceeds existing state-of-the-art approaches. Code: https://github.com/andrey-okhotin/miad.git

2605.30289 2026-05-29 cs.LG stat.AP stat.ML

Statistical Embeddings for Similarity, Retrieval, and Interpretable Alignment of Numeric Tabular Datasets

用于数值表格数据集的相似性、检索和可解释对齐的统计嵌入

M. Ross Kunz, John Merickel, Keith Wilson

AI总结 提出一种通过结构化探索性数据分析描述符、句子变换器嵌入和典型相关分析(CCA)来表征和比较数值表格数据集的方法,实现跨数据集的相似性检索和可解释变量级对齐,并支持差分隐私。

详情
AI中文摘要

数值表格数据集是科学实践中的主要数据格式,但大型语言模型缺乏在异构特征空间中有意义地表示数值数据集的原生机制。现有方法要么针对单个数据集的预测建模(需要共享变量定义),要么缺乏可解释的跨数据集对齐机制。提出的方法通过结构化探索性数据分析描述符来表征数值表格数据集,使用预训练的句子变换器将这些描述符嵌入到共享向量空间,并通过典型相关分析(CCA)量化跨数据集相似性。此外,应用惩罚形式的CCA来恢复数据集之间稀疏、可解释的变量级对应关系,识别哪些统计描述符或变量级数量驱动跨数据集对齐,而无需共享变量名或特征约定。在嵌入之前,可选地对描述符集应用差分隐私,支持在敏感数据环境中部署,而无需在比较时访问原始观测值。该方法在15个数据集上进行了评估,涵盖通用基准、材料信息学和核级石墨表征。结果表明,总P@1得分为0.9,已知最近邻检索和聚类结构在嵌入消融和差分隐私预算下保持稳健。所提出的框架为将异构数值数据集成到检索增强生成流程中提供了一条原则性途径,同时保留统计上下文,直接应用于数据驱动的算法选择和未知数据集的模拟模型初始化。

英文摘要

Numeric tabular datasets are the dominant data format in scientific practice, yet large language models lack native mechanisms for representing numeric datasets in a meaningful way across heterogeneous feature spaces. Existing approaches either target predictive modeling over individual datasets, which requires a shared set of variable definitions, or lack mechanisms for interpretable cross-dataset alignment. The proposed methodology characterizes numeric tabular datasets through structured exploratory data analysis descriptors, embeds those descriptors into a shared vector space using a pretrained sentence transformer, and quantifies cross-dataset similarity via Canonical Correlation Analysis (CCA). Furthermore, a penalized formulation of CCA is applied to recover sparse, interpretable variable-level correspondences between datasets, identifying which statistical descriptors or variable-level quantities drive cross-dataset alignment without requiring shared variable names or feature conventions. Differential privacy is optionally applied to the descriptor set prior to embedding, supporting deployment in sensitive data contexts without requiring access to raw observations at time of comparison. The methodology is evaluated across 15 datasets spanning general-purpose benchmarks, materials informatics, and nuclear-grade graphite characterization. Results demonstrate a total P@1 score of 0.9, with known nearest-neighbor retrieval and cluster structure remaining robust across embedding ablations and differential privacy budgets. The proposed framework provides a principled pathway for integrating heterogeneous numeric data into retrieval-augmented generation pipelines while preserving statistical context, with direct applications to data-driven algorithm selection and simulation model initialization for unknown datasets.

2605.30284 2026-05-29 cs.AI

ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure

ProjectionBench: 在渐进信息揭示下评估大语言模型的科学假设生成

A. J. Lew, Y. Cao, M. J. Buehler

AI总结 提出ProjectionBench框架,通过渐进式信息揭示评估大语言模型在科学发现中的创新性和推理能力,实验表明GPT-5.4在最小上下文下仍保持0.7 F1分数与真实结论对齐。

Comments 19 pages, 4 figures

详情
AI中文摘要

科学发现本质上是一个创造性和不确定的过程,需要超越已知知识的推理。尽管许多基准测试通过多跳检索评估大语言模型在深度研究任务上的表现,但其对真正科学发现至关重要的创新推理能力仍未得到充分测试。我们引入了一个基准框架,用于评估模型在科学发现和推理中的表现,从原始问题逐步构建到经典零假设检验。在我们的框架中,模型最初仅接收来自近期论文的主题和研究问题,技术细节逐步揭示。在每个信息揭示阶段,模型需要生成针对研究问题的假设,这些假设与原始论文的结论进行比较,并通过组成原子声明的自动语义相似性进行评估。这种对与真实结论语义偏离的渐进评估,使得能够评估模型的创新性(在最小信息下)到基于推理的能力(在完整实验细节下),这两者对于将大语言模型用于科学发现都至关重要。我们的框架为系统评估大语言模型的科学推理和发现能力提供了基础,这对于推动下一代AI科学家/协同科学家系统的发展至关重要。具体来说,我们在涵盖生物活性材料、机械材料和纳米材料的45篇论文上评估了GPT-5、GPT-5.4、Gemini 2.5 pro和Gemini 3.1 pro preview。我们发现GPT-5.4和Gemini 3.1 pro的表现优于其前代版本,特别是GPT-5.4即使在最小上下文下仍保持0.7 F1分数与真实结论对齐。

英文摘要

Scientific discovery is an inherently creative and uncertain process, requiring reasoning beyond the recall of known knowledge. While many benchmarks have been proposed to evaluate large language model (LLM) performance on deep research tasks via multi-hop retrieval, their innovative reasoning abilities essential for true scientific discovery remain largely untested. We introduce a benchmark framework for evaluating model performance in scientific discovery and reasoning, building up from a raw problem to the classical null hypothesis test. In our framework, models initially receive only the topic and research question from a recent paper, with technical details progressively revealed. At each stage of information disclosure, the model is tasked with generating hypotheses that address the research question, which is compared with the conclusions from the original paper and evaluated via automated semantic similarity of constituent atomic claims. This progressive evaluation of semantic divergence from ground-truth conclusions enables assessment of a model's innovativeness (under minimal information) to grounded reasoning capabilities (under full experimental details), both critical for using LLMs for scientific discovery purposes. Our framework provides a foundation for systematically evaluating scientific reasoning and discovery capabilities in LLMs, crucial for advancing the development of next-generation AI scientist/co-scientist systems. Specifically, here we evaluate GPT-5, GPT-5.4, Gemini 2.5 pro, and Gemini 3.1 pro preview across 45 papers spanning bioactive materials, mechanical materials, and nanomaterials. We find that GPT-5.4 and Gemini 3.1 pro outperform their previous generation counterparts as expected, and GPT-5.4 in particular maintains 0.7 F1 score alignment with ground truth conclusions even under minimal context.

2605.30283 2026-05-29 cs.AI cs.ET

mcp-proto-okn: Natural-language access to open scientific knowledge graphs through the Model Context Protocol

mcp-proto-okn:通过模型上下文协议实现对开放科学知识图谱的自然语言访问

Peter W. Rose, Benjamin M. Good, Amanda M. Saravia-Butler, Charlotte A. Nelson, James P. Balhoff, Yaphet Kebede, Patricia L. Whetzel, Christopher Bizon, Andrew I. Su, Sergio E. Baranzini

AI总结 提出基于模型上下文协议的服务器mcp-proto-okn,使AI助手能通过自然语言发现、查询和集成科学知识图谱,降低跨领域知识图谱分析门槛。

Comments 9 pages, 1 figure

详情
AI中文摘要

MCP Server Proto-OKN (mcp-proto-okn) 是一个基于Python的模型上下文协议服务器,使AI助手能够通过自然语言发现、检查、查询和集成科学知识图谱。该服务器提供图路由、模式检查、SPARQL执行、本体扩展、多图查询和转录生成功能,降低了生物医学和科学用户进行跨领域知识图谱分析的门槛。mcp-proto-okn使用FastMCP框架在Python中实现,可在https://github.com/sbl-sdsc/mcp-proto-okn获取。GitHub仓库提供了文档、客户端配置说明和示例分析转录。

英文摘要

MCP Server Proto-OKN (mcp-proto-okn) is a Python-based Model Context Protocol server that enables AI assistants to discover, inspect, query and integrate scientific knowledge graphs through natural language. The server provides graph routing, schema inspection, SPARQL execution, ontology expansion, multi-graph querying, and transcript generation, lowering the barrier to cross-domain knowledge graph analysis for biomedical and scientific users. mcp-proto-okn is implemented in Python using the FastMCP framework and is available at https://github.com/sbl-sdsc/mcp-proto-okn. Documentation, client configuration instructions, and example analysis transcripts are provided in the GitHub repository.

2605.30282 2026-05-29 cs.RO

Gaze2Act: Gaze-Conditioned Vision-Language-Action Policies for Interactive Robot Manipulation

Gaze2Act: 基于注视条件的视觉-语言-动作策略用于交互式机器人操作

Kuangji Zuo, Gen Li, Bofan Lyu, Yanshuo Lu, Boyu Ma, Shijia Han, Xinyu Zhou, Xichen Yuan, Chuhao Zhou, Jiaqi Bai, Geng Li, Jianfei Yang

AI总结 提出Gaze2Act框架,通过将人类注视作为动态意图信号,结合跨视角语义匹配和策略级条件化,实现机器人对复杂交互任务的精确操作。

Comments Project page: https://zuo-kuangji.github.io/Gaze2Act/

详情
AI中文摘要

视觉-语言-动作(VLA)模型近期在遵循语言指令的机器人学习方面展现出强大潜力。然而,在实践中,仅靠语言往往难以精确传达人类意图。很难描述在相似候选对象中具体要交互哪个对象、在对象上的何处操作,或目标在执行过程中如何变化。为解决这一局限,我们提出Gaze2Act,一种新颖的VLA框架,利用人类注视作为复杂交互操作中动态且直观的意图信号。Gaze2Act首先通过跨视角语义匹配将第一人称注视映射到机器人视角,弥合自我-外部视角差距,生成对象掩码和注视点,用于从粗到细的目标指定。然后,这些线索通过感知级提示和动作级条件化整合到策略中,使机器人能够关注相关区域并在动态意图下执行精确交互。在对Unitree G1人形机器人进行的七个任务类别和16个真实机器人任务的系统评估中,Gaze2Act在意图准确性和任务成功率方面均达到最先进水平。它在对象消歧、细粒度交互和动态意图引导方面显著优于基线方法。这些结果表明,人类注视为人在环VLA控制提供了一种自然、低负担且高表达性的模态。

英文摘要

Vision-Language-Action (VLA) models have recently shown strong potential for robot learning by following language instructions. However, in practice, language alone is often insufficient to precisely convey human intent. It is difficult to describe which exact object to interact with among similar candidates, where to act on the object, or how the target may change during execution. To address this limitation, we propose Gaze2Act, a novel VLA framework that leverages human gaze as a dynamic and intuitive intent signal for complex interactive manipulation. Gaze2Act first bridges the ego-exo view gap by mapping first-person gaze into the robot's perspective through cross-view semantic matching, producing both an object mask and a gaze point for coarse-to-fine target specification. These cues are then integrated into the policy through perception-level prompting and action-level conditioning, allowing the robot to attend to relevant regions and execute precise interactions under dynamic intent. In a systematic evaluation across seven task categories and 16 real-robot tasks on a Unitree G1 humanoid, Gaze2Act achieves state-of-the-art performance in both intent accuracy and task success rate. It notably outperforms baselines in object disambiguation, fine-grained interaction, and dynamic intent steering. These results demonstrate that human gaze provides a natural, low-burden, and highly expressive modality for human-in-the-loop VLA control.

2605.30277 2026-05-29 cs.LG physics.flu-dyn

Neural Operator-Based Surrogate Model for CFD:Helical Coil Steam Generator in Small Modular Reactor

基于神经算子的CFD代理模型:小型模块化反应堆中的螺旋管蒸汽发生器

Minseo Lee, Seongmin Oh, Chaehyeon Song, Bumjin Cho, Shilaj Baral, Sangam Khanal, Minseop Song, Joongoo Jeon

AI总结 针对小型模块化反应堆数字孪生中CFD实时仿真的计算瓶颈,提出结合降阶模型与神经算子(多尺度L-DeepONet和FNO)的代理模型框架,在螺旋管蒸汽发生器上实现了瞬时涡流动力学和时均流场的高效预测。

详情
AI中文摘要

实时热工水力仿真对于支持小型模块化反应堆(SMR)安全高效运行的数字孪生(DT)技术至关重要。计算流体动力学(CFD)提供了高保真流动分析,但其计算成本阻碍了在DT中的直接应用。基于AI的代理建模已被积极研究以解决这一限制,但针对SMR特定几何结构的CFD级瞬态分析的神经算子代理尚未见报道。本研究提出了一个集成框架,结合了降阶模型(ROM)与神经算子,应用于系统集成模块化先进反应堆(SMART)的螺旋管蒸汽发生器(HCSG)。比较了针对每种CFD数据类型的两种ROM策略:用于非结构化网格数据的基于MLP的自编码器(AE)和用于结构化网格数据的卷积自编码器(CAE),并将每种策略与深度算子网络(DeepONet)耦合以构建潜在DeepONet(L-DeepONet)。此外,还采用了傅里叶神经算子(FNO)进行比较。两种框架中都引入了多尺度技术以减轻频谱偏差并改进对HCSG内部发展的卡门涡街的预测。多尺度L-DeepONet捕捉了速度和压力场中的瞬时周期性涡旋动力学,而FNO及其多尺度变体预测了时均平均流并提供了可靠的压降估计。这些互补特性提供了实用的模型选择指南,根据CFD数据类型和所需的流动分辨率水平将每种架构与特定的DT目标联系起来。

英文摘要

Real-time thermal-hydraulic simulation is essential for digital twin (DT) technology that supports the safe and efficient operation of small modular reactors (SMRs). Computational fluid dynamics (CFD) provides high-fidelity flow analysis, but its computational cost prevents direct use in DT applications. AI-based surrogate modeling has been actively investigated to address this limitation, yet neural operator--based surrogates for CFD-level transient analysis of SMR-specific geometries have not been reported. This study presents an integrated framework that combines a reduced-order model (ROM) with neural operators, applied to the helical coil steam generator (HCSG) of the System-integrated Modular Advanced Reactor (SMART). Two ROM strategies tailored to each CFD data type were compared, an MLP-based autoencoder (AE) for unstructured mesh data and a convolutional autoencoder (CAE) for structured mesh data, and each was coupled with the deep operator network (DeepONet) to construct the latent DeepONet (L-DeepONet). The Fourier neural operator (FNO) was additionally adopted for comparison. A multi-scale technique was incorporated into both frameworks to mitigate spectral bias and improve the prediction of Kármán vortex streets developing inside the HCSG. The multi-scale L-DeepONet captured the instantaneous periodic vortex dynamics in both velocity and pressure fields, while the FNO and its multi-scale variant predicted the time-averaged mean flow and provided reliable pressure drop estimates. These complementary characteristics provide a practical model-selection guideline that links each architecture to specific DT objectives based on CFD data type and the required level of flow resolution.

2605.30275 2026-05-29 cs.LG q-bio.QM

Digitally enriching a screening population for pancreatic cancer using routine blood-based measures and clinical histories

利用常规血液检测指标和临床病史对胰腺癌筛查人群进行数字富集

Chris Varghese, Leo Y. Li-Han, Richa Bisht, Ellen Larson, Frank Lee, Ryan M. Carr, Tanios S. Bekaii-Saab, Shounak Majumder, John D. Halamka, Mark Truty, Ajit H. Goenka, Hojjat Salehinejad, Cornelius A. Thiels

AI总结 提出基于Transformer的多头注意力神经网络,利用纵向诊断编码和血液检测序列预测胰腺癌风险,实现提前1-3年风险分层,为人群级数字富集筛查奠定基础。

详情
AI中文摘要

早期检测胰腺癌是扩大治愈性治疗可及性和减少癌症死亡的关键;然而,目前筛查并不可行。病理的潜在指标体现在个体的疾病和血液检测轨迹中,可能预测胰腺癌的发展。利用患者在临床互动过程中积累的纵向诊断编码和血液检测值序列,训练了一个基于Transformer的定制神经网络,采用多头注意力机制,以提前多年预测胰腺癌风险,并对人群进行风险分层以进行靶向筛查。该队列包括6,017名胰腺癌成人患者和177,081名对照(总体中位年龄75岁,45%女性),在胰腺癌诊断前拥有中位12年(四分位距6.9-16.2)的病史。通过留一站点法进行外部验证,在诊断前1年、2年和3年预测胰腺癌,受试者工作特征曲线下面积均值分别为0.837(95%置信区间0.827-0.848)、0.797(95%置信区间0.782-0.813)和0.760(95%置信区间0.745-0.776)。估计的胰腺癌风险校准良好(校准图斜率1.08,截距-0.077;Brier评分0.025),贝叶斯人群胰腺癌患病率更新使得估计的癌症风险输出可跨环境迁移。在测试中,1年内胰腺癌风险>3.3%的筛查阈值提供了18.2的诊断优势比。因此,我们的工作为第一个人群级数字富集工具奠定了基础,以扩大胰腺癌治愈性管理的可及性。

英文摘要

Earlier detection of pancreatic cancer is key to enabling wider access to curative treatment and reducing cancer deaths; however, screening is presently not viable. Latent indicators of pathology are evident in an individual's disease and blood test trajectories and may predict the development of pancreatic cancer. Longitudinal sequences of coded diagnoses and blood test values accrued by patients throughout their clinical interactions were used to train a custom Transformer-based neural network with a multi-head attention mechanism to predict risk of pancreatic cancer with a multi-year lead time and risk-stratify populations for targeted screening. The cohort comprised 6,017 adults with pancreatic cancer and 177,081 controls (overall median age 75, 45% female) with median 12 years (interquartile range 6.9-16.2) of medical history prior to pancreatic cancer diagnosis. External validation via leave-one-site-out, out-of-sample testing predicting pancreatic cancer 1-, 2-, and 3-years prior to diagnosis demonstrated mean area under the receiver operating characteristic of 0.837 (95% confidence interval 0.827-0.848), 0.797 (95% confidence interval 0.782-0.813), and 0.760 (95% confidence interval 0.745-0.776), respectively. Estimated pancreatic cancer risks were well-calibrated (calibration plot slope 1.08, intercept of -0.077; Brier score 0.025), and a Bayesian population pancreatic cancer prevalence update allows estimated cancer risk outputs to be transportable across settings. At testing, a screening threshold of >3.3% risk of pancreatic cancer in 1-year offered a diagnostic odds ratio of 18.2. Our work therefore lays the foundation for a first population-level digital enrichment tool to widen access to curative-intent management of pancreatic cancer.

2605.30274 2026-05-29 cs.CL cs.AI

Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection

Loong: 一种类人长文档翻译代理,具有观察与行动的适应性上下文选择

Yutong Wang, Xuebo Liu, Derek F. Wong, Zhilin Li, Rongqing Jiang, Min Zhang, Shimin Tao, Daimeng Wei, Min Zhang

AI总结 提出Loong代理,通过3E记忆模块和强化学习优化上下文策略,解决长文档翻译中上下文窗口限制和冗余信息问题,在英⇄中、德、法翻译中平均提升13.0分。

详情
AI中文摘要

文档级翻译仍然是大型语言模型最具挑战性的任务之一,它们受到有限上下文窗口的限制,阻碍了全局连贯性,同时遭受冗余上下文信息的影响,降低了翻译质量。为了解决这个问题,我们提出了一种名为Loong的类人长文档翻译代理,它利用3E记忆模块(精华-示例-实体)存储摘要、句子对和实体记录作为历史上下文。Loong不是被动地关注所有历史,而是进行深度推理,自适应地识别翻译指导的最佳上下文。Loong通过强化学习优化其上下文策略,利用从其自身采样的观察与行动推理轨迹中得出的偏好数据。实证评估表明,Loong在英语⇄中文、德语和法语方向上实现了显著的翻译质量提升,在三个评估指标上平均提升高达13.0分。此外,Loong在跨领域和对抗上下文噪声方面表现出强大的泛化能力和鲁棒性,同时在超长文档翻译中保持显著的稳定性。我们的代码发布在https://github.com/YutongWang1216/LoongDocMT。

英文摘要

Document-level translation remains one of the most challenging tasks for large language models, which are constrained by limited context windows that impede global cohesion, while simultaneously suffering from redundant contextual information that degrades translation quality. To address this, we propose a human-like long document translation agent called Loong, which leverages a 3E memory module (Essence-Exemplar-Entity) to store summaries, sentence pairs, and entity records as historical context. Instead of passively attending to all history, Loong performs deep reasoning to adaptively identify the optimal context for translation guidance. Loong optimizes its context policy through reinforcement learning, utilizing preference data derived from its own sampled observe-and-act reasoning trajectories. Empirical evaluations demonstrate that Loong achieves substantial translation quality improvements in English $\Leftrightarrow$ Chinese, German, and French directions, with average gains of up to 13.0 points across the three evaluation metrics. Furthermore, Loong exhibits strong generalization across domains and robustness against contextual noise, while maintaining remarkable stability in ultra-long document translation. Our code is released at https://github.com/YutongWang1216/LoongDocMT.

2605.30269 2026-05-29 cs.CV eess.IV

Boosting Image Quality Assessment Performance: Unsupervised Score Fusion by Deep Maximum a Posteriori Estimation

提升图像质量评估性能:基于深度最大后验估计的无监督分数融合

Zhongling Wang, Raymond Zhou, Shahrukh Athar, Wenbo Yang, Zhou Wang

AI总结 提出一种基于深度最大后验估计的无监督图像质量评估分数融合框架,通过细粒度不确定性估计提高融合预测的准确性并降低不确定性。

Comments 2024 International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024)

详情
AI中文摘要

在过去的几十年中,出现了许多图像质量评估(IQA)模型,旨在预测图像的感知质量。然而,单个模型往往偏向于某些类型的图像内容或失真,具体取决于设计原则和过程。一个直观的想法是通过将多个模型的分数融合成一个更强的模型,来利用每个IQA模型的优势并减轻其弱点。在此,我们首次尝试为这一想法寻求最优解,并提出一个基于深度最大后验(MAP)估计的无监督IQA分数融合通用框架。所提出的模型在分数级别进行细粒度不确定性估计,以提高准确性并降低融合预测中的不确定性。综合实验表明,所提出的模型优于单个IQA模型和其他融合方法。它还在融合过程中展现出拒绝“坏”模型的有趣能力。

英文摘要

Over the past decades, numerous Image Quality Assessment (IQA) models have emerged, aiming to predict the perceptual quality of images. However, individual models are often biased toward certain types of image content or distortions, depending on the design principle and process. An intuitive idea is to harness the strengths and mitigate the weaknesses of each IQA model, by fusing the scores of multiple models into a stronger one. Here we make one of the first attempts to seek an optimal solution for the idea and propose a general framework for unsupervised IQA score fusion using deep Maximum a Posteriori (MAP) estimation. The proposed model conducts fine-grained uncertainty estimation at the score level to increase the accuracy and reduce the uncertainty in fused predictions. Comprehensive experiments demonstrate the superiority of the proposed model over individual IQA models and other fusion methods. It also exhibits an interesting capability of rejecting ``bad" models in the fusion process.

2605.30268 2026-05-29 cs.CV cs.AI

PhyGenHOI: Physically-Aware 4D Generation of Dynamic Human-Object Interactions

PhyGenHOI:物理感知的动态人-物交互4D生成

Omer Benishu, Gal Fiebelman, Sagie Benaim

AI总结 提出PhyGenHOI框架,结合运动扩散模型和物质点方法,通过窗口吸引损失、接触驱动重模拟和掩码视频SDS目标,生成物理一致且视觉逼真的4D人-物交互动态场景。

详情
AI中文摘要

我们解决了生成物理准确且视觉逼真的4D人-物交互(HOI)的任务。给定一个静态3D人体和以3D高斯泼溅(3DGS)表示的目标物体,我们的目标是合成动态场景,其中人体根据给定的输入文本主动与物体交互,例如拳击或踢腿。为此,我们引入了PhyGenHOI,一种新颖的框架,将生成式人体运动与显式物理物体模拟相结合。我们将人体建模为由运动扩散模型(MDM)驱动的语义智能体,将物体建模为通过物质点方法(MPM)模拟的物理智能体,并利用3D高斯作为统一的、可微分的表示。我们通过三种耦合机制监督它们的交互:(1)窗口吸引损失,时间上同步生成运动以拦截物体;(2)接触驱动重模拟步骤,在碰撞时触发物理一致动量传递;(3)掩码视频SDS目标,注入基于视频的先验以增强接触保真度。实验表明,PhyGenHOI在多种动作、人体和物体上生成物理一致的4D HOI,优于基线方法。项目页面和视频:https://omerbenishu.github.io/PhyGenHOI/

英文摘要

We address the task of generating physically accurate and visually faithful 4D Human-Object Interaction (HOI). Given a static 3D human and target object represented as 3D Gaussian Splats (3DGS), our goal is to synthesize dynamic scenes where the human actively engages with the object through actions, such as punching or kicking, in accordance with a given input text. To this end, we introduce PhyGenHOI, a novel framework that couples generative human motion with an explicit physical object simulation. We model the human as a semantic agent driven by a Motion Diffusion Model (MDM) and the object as a physical agent simulated via the Material Point Method (MPM), utilizing 3D Gaussians as a unified, differentiable representation. We supervise their interaction through three coupled mechanisms: (1) A Windowed Attraction Loss that temporally synchronizes generative motion to intercept the object; (2) A Contact-Driven Re-simulation step that triggers physically consistent momentum transfer upon impact; and (3) A Masked Video-SDS objective that injects video-based priors to enhance contact fidelity. Experiments show PhyGenHOI generates physically consistent 4D HOI across diverse actions, humans, and objects, outperforming baselines. Project page and videos: https://omerbenishu.github.io/PhyGenHOI/

2605.30265 2026-05-29 cs.CV cs.CL

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

LoMo: 局部模态替换以实现更深的视觉-语言融合

Feng Han, Zhixiong Zhang, Zheming Liang, Yibin Wang, Jiaqi Wang

AI总结 针对视觉-语言模型在模态替换时性能下降的“载体敏感性”问题,提出局部模态替换(LoMo)数据策展范式,通过将文本片段动态渲染为图像来训练跨模态表示不变性,显著提升多模态推理与融合效果。

详情
AI中文摘要

视觉-语言模型(VLM)在广泛的理解和推理任务中取得了显著进展,这得益于旨在多模态融合的大规模图像-文本训练。理想情况下,将文本问题替换为其渲染图像对应物应基本不影响模型性能。然而,在实践中,这种模态替换会导致性能急剧下降。我们将这种“载体敏感性”问题归因于当前训练语料中固有的偏差。在图像描述、VQA、OCR和网络来源的交错数据等流行数据集中,文本和图像通常被组织成不同且不对称的角色,文本作为语言查询,图像作为视觉参考。这种数据偏差导致VLM在不同模态的信息获取上表现出不同的偏好。因此,VLM无法对齐语义等价内容在文本和视觉载体上的表示,使得模型推理在模态替换下变得脆弱。为了解决这个问题,我们提出了局部模态替换(LoMo),一种轻量级、架构无关的数据策展范式,旨在为语义等价的文本和图像载体之间的跨模态表示不变性提供监督。LoMo通过将单模态提示重新表述为无缝交错的跨模态序列来实现这一点。它动态选择目标文本跨度并将其重新表述为渲染图像,从而在“文本、视觉、文本”载体上保持相同的语义。在13个不同的多模态基准上的大量实验表明,LoMo显著改善了整体多模态推理,并实现了更深的跨模态融合。具体来说,它在基础模型上带来了一致的提升,在LLaVA-OneVision-1.5-8B上比标准SFT提高了2.67个百分点,在Qwen3.5-9B上提高了2.82个百分点。

英文摘要

Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally, replacing a textual question with its rendered-image counterpart should leave model performance essentially unaffected. In practice, however, such modality substitution induces dramatic performance degradation. We attribute this "carrier sensitivity" issue to an inherent bias in current training corpora. Across prevalent datasets such as image captioning, VQA, OCR, and web-sourced interleaved data, text and images are typically organized into distinct and asymmetric roles, with text serving as linguistic queries and images as visual references. Such data bias leads VLMs to exhibit distinct preferences for information acquisition across different modalities. Consequently, VLMs fail to align representations of semantically equivalent content across textual and visual carriers, making model reasoning fragile under modality substitution. To address this, we propose Local Modality Substitution (LoMo), a lightweight, architecture-agnostic data curation paradigm designed to provide supervision for cross-modal representational invariance between semantically equivalent text and image carriers. LoMo achieves this by reformulating single-modality prompts into seamlessly interleaved multimodal sequences. It dynamically selects target text spans and recasts them as rendered images, thereby preserving the same semantics across "text, visual, text" carriers. Extensive experiments across 13 diverse multimodal benchmarks demonstrate that LoMo significantly improves overall multimodal reasoning and yields deeper cross-modal fusion. Specifically, it delivers consistent gains across foundational models, improving over standard SFT by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B.

2605.30263 2026-05-29 cs.CV

minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

minWM: 用于实时交互式视频世界模型的全栈开源框架

Min Zhao, Hongzhou Zhu, Bokai Yan, Zihan Zhou, Yimin Chen, Wenqiang Sun, Kaiwen Zheng, Guande He, Xiao Yang, Chongxuan Li, Fan Bao, Jun Zhu

AI总结 提出minWM全栈开源框架,通过因果强制/因果强制++流水线将双向视频扩散模型转化为可控制、低延迟的自回归世界模型,支持相机控制与多种骨干架构。

详情
AI中文摘要

最近的视频扩散基础模型在高品质视频生成方面取得了显著进展,但将其转化为实时交互式视频世界模型仍然具有挑战性。交互式世界模型需要可控、因果和低延迟的展开,这在实际中需要涵盖数据构建、可控微调、自回归训练、少步蒸馏和流式推理的完整流水线。在这项工作中,我们提出了minWM,一个用于构建实时交互式视频世界模型的全栈开源框架。minWM提供了一个端到端流水线,将现有的双向T2V/TI2V视频基础模型转化为相机可控的少步自回归世界模型。具体来说,minWM首先微调一个带有相机控制的双向视频扩散模型,然后应用因果强制/因果强制++流水线,包括AR扩散训练、因果ODE或因果一致性蒸馏以及非对称DMD,将其蒸馏为少步自回归生成器以实现低延迟展开。该框架是模块化和架构可扩展的:我们在代表性开源骨干上实例化它,包括Wan2.1-T2V-1.3B和HY1.5-TI2V-8B,覆盖了基于交叉注意力的条件注入和MMDiT风格架构。minWM还支持将现有的视频世界模型(如HY-WorldPlay)适应到新的数据分布、训练配方和延迟目标。除了发布可运行脚本、检查点、文档和推理代码外,我们还提供了关于相机轨迹质量、可控性训练步骤和最小批量大小要求的实际消融实验。我们希望minWM能够作为构建和适应实时交互式视频世界模型的可复现和可扩展的配方。

英文摘要

Recent video diffusion foundation models have achieved remarkable progress in high-quality video generation, yet turning them into real-time interactive video world models remains challenging. Interactive world models require controllable, causal, and low-latency rollout, which in practice demands a full pipeline spanning data construction, controllable fine-tuning, autoregressive training, few-step distillation, and streaming inference. In this work, we present minWM, a full-stack open-source framework for building real-time interactive video world models. minWM provides an end-to-end pipeline that converts existing bidirectional T2V/TI2V video foundation models into camera-controllable few-step autoregressive world models. Specifically, minWM first fine-tunes a bidirectional video diffusion model with camera control, and then applies the Causal Forcing / Causal Forcing++ pipeline, including AR diffusion training, causal ODE or causal consistency distillation, and asymmetric DMD, to distill it into a few-step autoregressive generator for low-latency rollout. The framework is modular and architecture-extensible: we instantiate it on representative open backbones, including Wan2.1-T2V-1.3B and HY1.5-TI2V-8B, covering both cross-attention-based condition injection and MMDiT-style architectures. minWM also supports adapting existing video world models, such as HY-WorldPlay, to new data distributions, training recipes, and latency targets. Beyond releasing runnable scripts, checkpoints, documentation, and inference code, we provide practical ablations on camera trajectory quality, controllability training steps, and minimal batch-size requirements. We hope minWM serves as a reproducible and extensible recipe for building and adapting real-time interactive video world models. Project Page: [https://github.com/shengshu-ai/minWM](https://github.com/shengshu-ai/minWM)

2605.30260 2026-05-29 cs.CL cs.AI cs.CV cs.LG

How LoRA Remembers? A Parametric Memory Law for LLM Finetuning

LoRA如何记忆?大语言模型微调的参数记忆定律

Ziwen Xu, Haiwen Hong, Linsong Yu, Benglei Cui, Longtao Huang, Hui Xue, Ningyu Zhang

AI总结 本文提出参数记忆定律,揭示LoRA在微调中参数与序列长度对损失降低的幂律关系,并基于此设计MemFT优化策略提升记忆保真度与效率。

Comments Ongoing work

详情
AI中文摘要

大型语言模型(LLM)必须持续学习和更新知识,以在动态的真实世界环境中保持有效。虽然低秩适应(LoRA)被广泛用于此类记忆更新,但现有研究主要依赖于定性的下游评估,使得精确参数记忆的定量容量限制和潜在动态在很大程度上未被探索。为了弥合这一差距,我们在潜在空间中使用LoRA作为受控记忆容量探针,以系统量化精确参数记忆。我们引入了参数记忆定律,这是一个将损失降低ΔL与有效参数和序列长度联系起来的稳健幂律。在令牌级别,细粒度分析揭示了确定性相变,表明在贪婪解码下,预测概率p > 0.5构成逐字回忆的充分条件。基于这些见解,我们引入了MemFT,一种阈值引导的优化策略,该策略动态地将训练预算重新分配给低于阈值的令牌。实证评估表明,MemFT可以提高记忆保真度和效率。代码将在https://github.com/zjunlp/ParametricMemoryLaw发布。

英文摘要

Large Language Models (LLMs) must continuously learn and update knowledge to remain effective in dynamic real-world environments. While Low-Rank Adaptation (LoRA) is widely used for such memory updates, existing studies mainly rely on qualitative downstream evaluations, leaving the quantitative capacity limits and underlying dynamics of exact parametric memory largely unexplored. To bridge this gap, we employ LoRA as a controlled memory capacity probe within the latent space to systematically quantify exact parametric memory. We introduce the Parametric Memory Law, a robust power law linking loss reduction Delta L to effective parameters and sequence length. At the token level, fine-grained analysis reveals a deterministic phase transition, demonstrating that a prediction probability of p > 0.5 constitutes a sufficient condition for verbatim recall under greedy decoding. Driven by these insights, we introduce MemFT, a threshold-guided optimization strategy that dynamically redistributes the training budget toward sub-threshold tokens. Empirical evaluations demonstrate that MemFT can enhance memory fidelity and efficiency. Code will be released at https://github.com/zjunlp/ParametricMemoryLaw.

2605.30257 2026-05-29 cs.CV

Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning

Stable-Layers: 使用VLM评分强化学习微调图像层分解模型

Ciara Rowles, Reshinth Adithyan, Nikhil Pinnaparaju, Vikram Voleti, Mark Boss

AI总结 提出Stable-Layers框架,通过强化学习(Flow-GRPO)和视觉语言模型(VLM)评分,无需配对监督即可微调预训练层分解模型,解决评分信号方差不足问题,提升层分离质量和重建精度。

Comments 25 pages, 8 figures, 4 tables. Project page: https://stability-ai.github.io/stable-layers.github.io/

详情
AI中文摘要

我们提出了Stable-Layers,一个强化学习框架,通过仅使用视觉语言模型(VLM)的反馈来微调预训练的层分解模型,从而消除了对配对监督的需求。从Qwen-Image-Layered开始,我们应用带有LoRA适应的Flow-GRPO,对每张图像采样多个候选分解,用VLM进行评分,并根据组相对优势优化策略。关键挑战在于设计可靠的奖励信号:单独对样本评分的VLM倾向于将其判断压缩到一个狭窄的范围内,使得GRPO几乎没有组内方差可供学习。我们通过一个两阶段评估流水线解决了这个问题,该流水线将基于五个编辑中心标准的结构化逐样本评分与基于网格的校准步骤配对,在该步骤中VLM并排重新评分所有候选。与基础模型相比,Stable-Layers在Crello数据集上产生了具有更强层分离、更少空白或伪影层以及更低逐层重建误差的分解结果。

英文摘要

We present Stable-Layers, a reinforcement learning framework that eliminates the need for paired supervision by fine-tuning a pretrained layer decomposition model using only feedback from a vision-language model (VLM). Starting from Qwen-Image-Layered, we apply Flow-GRPO with LoRA adaptation, sampling multiple candidate decompositions per image, scoring them with a VLM, and optimising the policy from group-relative advantages. The key challenge lies in designing a reliable reward signal: VLMs scoring samples in isolation tend to compress their judgements into a narrow band, leaving GRPO with little within-group variance to learn from. We address this with a two-stage evaluation pipeline that pairs structured per-sample scoring across five edit-centric criteria with a grid-based calibration step in which the VLM re-scores all candidates side-by-side. Stable-Layers produces decompositions with stronger layer separation, fewer blank or artifact-heavy layers, and lower per-layer reconstruction error on the Crello dataset compared to the base model.

2605.30256 2026-05-29 cs.CV cs.CL cs.HC

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

VideoFDB: 评估对话代理中的全双工视觉-语音能力

Amrita Mazumdar, Seonwook Park, Rajarshi Roy, Nikhil Srihari, Shengze Wang, Yuhao Zhou, Julia Wang, Koki Nagano, Shalini De Mello

AI总结 提出首个全双工视听到视听(AV2AV)对话基准VideoFDB,通过237个真实视频片段、感知与生成行为分类以及基于评分规则的LM评判框架,系统评估代理在非语言对话动态中的表现,发现现有系统存在字幕崩溃和视觉流忽视等缺陷。

Comments Project page: https://research.nvidia.com/labs/amri/projects/video-fdb/

详情
AI中文摘要

自然的人类对话是全双工且视听融合的:人们同时说话和倾听,同时持续解读并产生非语言线索,如点头、微笑和手势。为了支持成功的人机交互,代理必须建模全双工视听对话;然而,现有的全双工基准仅评估语音。在这项工作中,我们提出了VideoFDB,这是首个评估全双工视听到视听(AV2AV)对话代理的基准。VideoFDB贡献了:(i) 237个来自真实世界视频通话的二元片段,涵盖11种非语言对话动态;(ii) 将感知行为与生成行为分离的分类法;(iii) 基于评分规则的LM评判评估框架,具有可解释的轴,用于评估关于非语言对话动态的对话质量。在开源和闭源的视觉-语音代理中,我们发现了系统性的失败模式:字幕崩溃和视觉流忽视,并且我们表明当前系统利用视觉进行显式视觉问答,但不用于自然对话中所需的流式联合视听基础。我们进一步评估了级联的语音到虚拟形象系统,发现其架构从根本上排除了全双工非语言线索的产生。作为全双工AV2AV交互的首个基准,VideoFDB为系统评估奠定了基础,我们希望这将加速下一代多模态对话代理的进步和发展。

英文摘要

Natural human conversation is full-duplex and audio-visual: people simultaneously speak and listen while continuously interpreting and producing nonverbal cues, such as nods, smiles, and gestures. To support successful human-agent interaction, agents must model full-duplex audiovisual conversation; however, existing full-duplex benchmarks evaluate only speech. In this work, we present VideoFDB, the first benchmark to evaluate full-duplex audio-visual-to-audio-visual (AV2AV) conversational agents. VideoFDB contributes (i) 237 dyadic clips spanning 11 nonverbal conversational dynamics from real-world video calls, (ii) a taxonomy separating perception from generation behaviors, and (iii) a rubric-based LM-as-judge evaluation framework with interpretable axes for assessing conversational quality with respect to nonverbal conversational dynamics. Across open- and closed-source vision-speech agents, we find systematic failure modes: captioning collapse and visual-stream ignorance, and we show that current systems exploit vision for explicit visual question answering but not for the streaming joint audiovisual grounding required in natural conversation. We further evaluate cascaded speech-to-avatar systems and find that their architecture fundamentally precludes the production of full-duplex nonverbal cues. As the first benchmark for full-duplex AV2AV interaction, VideoFDB establishes a foundation for systematic evaluation and, we hope, will accelerate the advancement and development of next-generation multimodal conversational agents.

2605.30251 2026-05-29 cs.CL cs.AI

Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models

相同证据,不同答案:面向多轮语言模型的规范上下文在线策略蒸馏

Zizhuo Lin, Quanling Liu, Jinsheng Quan, Chao Zhang, Yifan Zhu, Xing Shi, Jingtao Xu, Zhihui Li, Yawei Luo

AI总结 提出规范上下文在线策略蒸馏(CCOPD)方法,通过教师-学生框架对齐模型在完整提示和逐步揭示信息下的行为,减少自我锚定漂移,在多轮数学对话上训练后,在原始分片任务上平均提升32%性能。

详情
AI中文摘要

大型语言模型(LLMs)通常在单次提示中给出所有指令时能解决任务,但当相同信息在多个轮次中逐步揭示时却会失败。当干净的完整提示和原始分片对话包含相同的完整用户证据时,模型仍应得出相同的答案。我们认为造成这一差距的关键原因是自我锚定漂移:在部分信息下产生的响应引入了未经支持的假设,而这些假设随后扭曲了最终答案。为了减少这种影响,我们提出了规范上下文在线策略蒸馏(CCOPD)。在训练过程中,同一基础模型扮演两个角色:一个冻结的教师模型,以干净的完整提示为条件;一个可训练的学生模型,通过多轮对话逐步接收相同的证据;CCOPD将学生在其自身轨迹上的行为与教师的规范全上下文行为对齐。仅在数学问题对话上训练后,CCOPD在数学和五个零样本跨领域任务族上的原始分片性能相比原始基础模型平均提升32%,同时基本保持全上下文性能。进一步分析表明,CCOPD增强了基于用户证据的推理,并减少了对早期助手轮次污染的敏感性。

英文摘要

Large language models (LLMs) often solve a task when all instructions are given in a single prompt, but fail when the same information is revealed gradually across turns. When a clean FULL prompt and a RAW-SHARDED conversation contain the same complete user evidence, the model should still arrive at the same answer. We argue that a key reason for this gap is self-anchored drift: responses produced under partial information introduce unsupported assumptions, and those assumptions later distort the final answer. To reduce this effect, we propose Canonical-Context On-Policy Distillation (CCOPD). During training, the same base model is used in two roles: a frozen teacher conditioned on the clean FULL prompt and a trainable student that receives the same evidence incrementally through a multi-turn conversation; CCOPD aligns the student's behavior on its own trajectories with the teacher's canonical full-context behavior. Trained only on math problem conversations, CCOPD yields a 32\% average relative improvement in RAW-SHARDED performance over the original base model across math and five zero-shot out-of-domain task families, while largely preserving full-context performance. Further analyses suggest that CCOPD strengthens grounding in user evidence and reduces sensitivity to contamination from earlier assistant turns.

2605.30250 2026-05-29 cs.CV cs.GR

Ambient-robust Inverse Rendering using Active RGB-NIR Imaging

使用主动RGB-NIR成像的环境鲁棒逆渲染

Hoon-Gyu Chung, Jinnyeong Kim, Hyunwoo Kang, Seung-Hwan Baek

AI总结 提出一种利用主动RGB-NIR成像的三阶段逆渲染方法,通过结合环境光照下的多视角RGB图像和主动NIR闪光图像,实现对外部光照变化鲁棒的几何与反射率重建。

Comments 11 pages

详情
AI中文摘要

逆渲染旨在从图像中重建物体的几何和反射率。尽管近期取得了进展,现有方法通常会产生不准确的重建,且对环境光照条件敏感。本文介绍了一种由主动RGB-NIR成像实现的环境鲁棒逆渲染方法。我们的关键洞察是利用近红外(NIR)闪光照明(对人眼不可见)来获得稳定的点光源阴影,该阴影在很大程度上不受环境光照影响。通过使用环境光照下的多视角RGB图像和主动NIR闪光照明获取的NIR图像,我们利用RGB和NIR图像的互补优势,通过三阶段逆渲染方法重建精确的几何和反射率。为了实现密集多视角采集,我们开发了一个主动成像系统,配备RGB-NIR相机和安装在移动底座上的NIR闪光灯。利用该系统,我们收集了首个在多种环境光照条件下捕获的多视角RGB-NIR逆渲染数据集。实验表明,我们的方法优于先前方法,在多种环境光照场景下实现了准确的几何和反射率估计。

英文摘要

Inverse rendering aims to reconstruct geometry and reflectance of objects from images. Despite recent progress, existing methods often produces inaccurate reconstructions that are sensitive to ambient illumination conditions. Here we introduce an ambient-robust inverse rendering method enabled by active RGB-NIR imaging. Our key insight is to leverage near-infrared (NIR) flash illumination-imperceptible to human observers-to obtain stable point-light shading that is largely invariant to ambient illumination. By using multi-view RGB images illuminated by ambient light and NIR images acquired with active NIR flash illumination, we reconstruct accurate geometry and reflectance by exploiting the complementary benefits of RGB and NIR images via a three-stage inverse rendering method. To enable dense multi-view acquisition, we develop an active imaging system equipped with a RGB-NIR camera and a NIR flash mounted on a mobile base. Using this system, we collect the first multi-view RGB-NIR inverse rendering dataset captured under multiple ambient illumination conditions. Experiments demonstrate that our method outperforms prior approaches, achieving accurate geometry and reflectance estimation across multiple ambient lighting scenarios.

2605.30247 2026-05-29 cs.LG cs.MM

OOD-GraphLLM: Graph Large Language Model for Out-of-Distribution Generalized Drug Synergy Prediction

OOD-GraphLLM:面向分布外泛化的药物协同预测图大语言模型

Xin Wang, Linxin Xiao, Yang Yao, Wenwu Zhu

AI总结 针对药物协同预测中因新化合物导致的分布外偏移问题,提出OOD-GraphLLM框架,通过联合优化分子图表示与生物医学语义语言表示实现准确预测。

Comments 12 pages, 9 figures, ACM KDD 2026

详情
AI中文摘要

药物协同预测(DSP)旨在识别不同细胞环境下针对不同靶点的有效药物组合。然而,新化合物的不断出现导致分子骨架和大小发生变化,使得药物协同数据在拓扑结构上呈现分布外(O.O.D.)偏移。现有工作依赖于分布内(I.D.)假设,无法处理O.O.D.偏移。为解决此问题,我们首次通过图大语言模型研究分布外泛化的药物协同预测。尽管如此,O.O.D.泛化的DSP极具挑战性,面临以下难题:i) 如何发现与细胞靶点相关的结构相关和无关的分子表示;ii) 如何找到精确计算分子表示的最优图神经架构;iii) 如何联合利用LLM中的分子结构和语义信息。为应对这些挑战,我们提出OOD-GraphLLM,一种新颖的图LLM框架,通过统一方式联合优化分子图表示和生物医学语义语言表示,能够在O.O.D.设置下准确预测药物协同。此外,我们微调了生物医学LLM DrugSyn-LLM,并采用检索增强的生物医学指令调优策略,将分子拓扑信息和分子语义信息与基于语言的推理对齐,用于O.O.D.泛化的DSP。源代码(https://github.com/EkkoXiao/Bio-GraphLLM)和发布模型(https://mn.cs.tsinghua.edu.cn/bio-graphllm/)均已公开,用户可下载模型资源并通过Web界面交互式使用系统。

英文摘要

Drug synergy prediction (DSP) aims to identify efficacious drug combinations under various cellular contexts with different targets. However, the continual emergence of novel compounds results in variations in molecular scaffolds and sizes, causing drug synergy data to exhibit out-of-distribution (O.O.D.) shifts with respect to topological structure. Existing works rely on in-distribution (I.D.) assumption, failing to handle the O.O.D. shifts. To solve this problem, we study out-of-distribution generalized drug synergy prediction through a graph large language model for the first time. Nevertheless, O.O.D. generalized DSP is highly non-trivial, posing several challenges: i) how to discover structurally relevant and irrelevant molecular representations with respect to cell targets; ii) how to find the optimal graph neural architectures that accurately calculate molecular representations; and iii) how to jointly leverage molecular structural and semantic information in LLMs. To address these challenges, we propose OOD-GraphLLM, a novel graphLLM framework which is able to accurately predict drug synergy under O.O.D. settings via jointly optimizing molecular graph representation and biomedical semantic language representations in a unified manner. Furthermore, we finetune DrugSyn-LLM, a biomedical LLM, and employ a retrieval-augmented biomedical instruction tuning strategy to align molecular topological information and molecular semantic information with language-based reasoning for O.O.D. generalized DSP. Both the source code (https://github.com/EkkoXiao/Bio-GraphLLM) and released model (https://mn.cs.tsinghua.edu.cn/bio-graphllm/) are publicly available, where users are allowed to download model resources and interactively use the system through a web interface.