arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2056
专题追踪
2605.26121 2026-06-01 cs.LG cs.AI

GEM: Geometric Entropy Mixing for Optimal LLM Data Curation

GEM: 用于最优LLM数据策展的几何熵混合

Yue Min, Ziyun Qiao, Ruining Chen, Yujun Li

发表机构 * The Hong Kong University of Science and Technology, Hong Kong SAR, China(香港科学与技术大学) Peking University, Beijing, China(北京大学) University of Science and Technology of China, Hefei, China(中国科学技术大学)

AI总结 提出GEM框架,通过将数据策展重构为超球面上的变分问题并采用MM算法优化,解决了分类缺陷和嵌入各向异性问题,在1.1B参数模型上实现下游准确率提升1.2%。

Comments ICML 2026 Poster

详情
AI中文摘要

LLM预训练的有效性越来越依赖于数据组成而非单纯的数据量。然而,最优混合受到分类缺陷的阻碍:人类分类法存在本体论错位,而欧几里得聚类无法解决嵌入各向异性。我们引入GEM(几何熵混合),这是一个将数据策展重构为超球面上的变分问题并辅以混合平衡正则化项的框架。通过解耦生成先验并使用可证明的MM(Minorize-Maximize)算法优化目标,GEM有效对抗聚类坍缩,从而发现欧几里得启发式方法无法察觉的平衡语义结构。我们采用师生蒸馏将这种几何保真度扩展到网络规模语料库,并引入几何影响分数(GIS)用于可解释的分类法生成。使用1.1B参数模型的实验表明,当集成到DoReMi和RegMix等混合策略中时,GEM建立了新的最先进水平,将平均下游准确率提升高达1.2%,并为可预测的数据混合提供了稳健的坐标系。

英文摘要

LLM pre-training efficacy increasingly depends on data composition rather than sheer volume. Yet, optimal mixing is hindered by categorization flaws: human taxonomies suffer from ontological misalignment, and Euclidean clustering fails to address embedding anisotropy. We introduce GEM (Geometric Entropy Mixing), a framework reformulating data curation as a variational problem on the hypersphere augmented with a mixing-balance regularizer. By decoupling the generative prior and optimizing the objective via a provable MM (Minorize-Maximize) algorithm, GEM effectively counteracts the cluster collapse to discover balanced semantic structures invisible to Euclidean heuristics. We employ teacher-student distillation to scale this geometric fidelity to web-scale corpora and introduce the Geometric Influence Score (GIS) for interpretable taxonomy generation. Experiments with 1.1B-parameter models demonstrate that GEM establishes a new state-of-the-art when integrated into mixing strategies like DoReMi and RegMix, improving average downstream accuracy by up to 1.2% and offering a robust coordinate system for predictable data mixing.

2605.22050 2026-06-01 cs.CV

Broken Memories: Detecting and Mitigating Memorization in Diffusion Models with Degraded Generations

破碎的记忆:通过退化生成检测和缓解扩散模型中的记忆化

Yuanmin Huang, Mi Zhang, Chen Chen, Feifei Li, Geng Hong, Xiaoyu You, Min Yang

发表机构 * Fudan University(复旦大学) East China University of Science and Technology(东华大学)

AI总结 本文首次发现扩散模型中的记忆化会导致内部数值不稳定性并表现为视觉“破碎”伪影,基于此提出了一种基于潜变量更新范数的经验稳定区域来量化稳定行为,并设计了一个即时的逐步骤检测与自适应缓解框架,在不改变提示或引导的情况下抑制记忆化,在Stable Diffusion 1.4上实现了AUC>0.999的检测性能和0.0%的记忆化率。

Comments KDD 2026, extended version

详情
AI中文摘要

虽然扩散模型在生成高质量图像方面表现出色,但它们记忆训练数据的倾向带来了显著的隐私和版权风险。在这项工作中,我们首次发现记忆化会导致内部数值不稳定性,通常表现为视觉上的“破碎”伪影。受数值方法中稳定性分析的启发,我们引入了基于潜变量更新范数的经验稳定区域,以定量表征生成过程中的稳定行为。利用这一点,我们提出了一个原则性的、即时的框架,用于逐步骤检测和自适应缓解。我们的方法在不改变提示或引导的情况下抑制记忆化,从而保持语义保真度和图像质量。在Stable Diffusion 1.4上的大量实验表明,我们的方法在缓解后实现了AUC>0.999的检测性能和0.0%的记忆化率,且开销可忽略不计(每张图像约0.01秒)。

英文摘要

While diffusion models excel at generating high-quality images, their tendency to memorize training data poses significant privacy and copyright risks. In this work, we for the first time identify that memorization induces internal numerical instability, often manifesting as visually ``broken'' artifacts. Inspired by stability analysis in numerical methods, we introduce empirical stability regions based on latent update norms to quantitatively characterize stable behavior during generation. Leveraging this, we propose a principled, on-the-fly framework for step-wise detection and adaptive mitigation. Our approach suppresses memorization without altering prompts or guidance, thereby preserving semantic fidelity and image quality. Extensive experiments on Stable Diffusion 1.4 demonstrate that our method achieves an AUC $>0.999$ detection performance and a $0.0\%$ memorization rate after mitigation with negligible overhead ($\approx0.01$s per image).

2605.21168 2026-06-01 cs.AI

ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving

ScenePilot: 可控的边界驱动型自动驾驶关键场景生成

Qiyu Ruan, Yuxuan Wang, He Li, Zhenning Li, Cheng-zhong Xu

发表机构 * State Key Laboratory of Internet of Things for Smart City (SKL-IOTSC), University of Macau, Macau, China(智能城市物联网国家重点实验室(SKL-IOTSC)、澳门大学、中国澳门) Faculty of Science and Technology, University of Macau, Macau, China(澳门大学科技学院)

AI总结 提出ScenePilot框架,通过结合RSS物理可行性评分与在线学习的AV风险预测器,将场景生成建模为约束多目标强化学习,并引入步级可行性感知屏蔽,以生成物理上可解但导致自动驾驶系统失败的关键场景。

详情
AI中文摘要

安全关键场景对于评估自动驾驶系统至关重要,但由于其在自然日志中罕见,基于仿真的压力测试不可或缺。大多数场景生成方法将周围智能体视为对手,但它们要么(i)未显式建模车辆-道路物理极限而导致失败,产生视觉极端但物理上不可解的碰撞,要么(ii)单独强制执行物理可行性或策略可行性,可能过度关注激进操作或受限于控制器依赖的能力边界。我们提出ScenePilot,一个可行性引导的、边界驱动的框架,针对边界带:即原则上物理可解但仍导致部署的自动驾驶堆栈失败的场景。我们将生成建模为约束多目标强化学习,结合RSS衍生的物理可行性评分$σ$和在线学习的AV风险预测器$Φ$,并引入步级可行性感知屏蔽,以保持探索接近可行性边界,同时避免不可行的伪影。在SafeBench上使用多个规划器的实验表明,ScenePilot在保持物理有效性的同时,产生了显著更高的碰撞率(+6.2个百分点),并且在这些边界带场景上的对抗性微调持续降低了下游碰撞率。代码可在https://github.com/QiyuRuan/ScenePilot获取。

英文摘要

Safety-critical scenarios are central to evaluating autonomous driving systems, yet their rarity in naturalistic logs makes simulation-based stress testing indispensable. Most scenario generation methods treat surrounding agents as adversaries, but they either (i) induce failures without explicitly modeling vehicle-road physical limits, yielding visually extreme yet physically unsolvable crashes, or (ii) enforce physical feasibility or policy feasibility in isolation, which can over-focus on aggressive maneuvers or remain tied to a controller-dependent capability boundary. We propose ScenePilot, a feasibility-guided, boundary-driven framework that targets the boundary band: scenarios that are physically solvable in principle yet still cause the deployed autonomy stack to fail. We formulate generation as constrained multi-objective reinforcement learning, combining an RSS-derived physical-feasibility score $σ$ with an online-learned AV-risk predictor $Φ$, and introduce step-level feasibility-aware shielding to keep exploration near the feasibility boundary while avoiding infeasible artifacts. Experiments on SafeBench with multiple planners show that ScenePilot yields substantially higher collision rates (+6.2 percentage points) while preserving physical validity, and that adversarial fine-tuning on these boundary-band scenarios consistently reduces downstream crash rates. The code is available at https://github.com/QiyuRuan/ScenePilot.

2605.30288 2026-06-01 cs.AI

MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

MIRA: 基于自锚定评分标准的中期训练源感知数据选择

Haowen Wang, Yaxin Du, Jian Yang, Jiajun Wu, Shukai Liu, Yuxuan Zhang, Pingjie Wang, Siheng Chen, Tuney Zheng, Ming Zhou, Xianglong Liu, Bryan Dai

发表机构 * Beihang University(北洋大学) IQuest Research(IQuest研究院) Shanghai Jiao Tong University(上海交通大学) University of British Columbia(不列颠哥伦比亚大学) Langboat Multilingual-Multimodal-NLP/mira(Langboat多语言-多模态-NLP/mira)

AI总结 针对中期训练中异构数据源的选择问题,提出MIRA框架,通过自锚定评分标准发现和可扩展的学生评分器,在代码中期训练中仅用一半token即可匹配全语料性能。

详情
AI中文摘要

中期训练已成为现代大语言模型开发中的重要阶段,使用大规模精选混合数据在最终后训练前增强能力。其数据选择问题具有独特性:数据在接近预训练规模的预训练风格目标下优化,但针对下游能力进行策划,并来自具有不同格式和训练角色的异构源。因此,有效选择需要可扩展性和源自适应语义标准。现有的基于模型的方法可扩展性好,但仅提供隐式质量信号。语义选择方法提供更强的判断,但通常假设固定评分标准或标准化数据格式。为解决这一不匹配,我们提出MIRA,一种基于自锚定评分标准发现的源感知过滤框架。关键思想是将评分标准构建作为数据选择的一部分:MIRA首先发现每个源组应评估什么,然后将这些判断提炼为可扩展的学生评分器,用于全语料过滤。在包含21个源和5个源组的代码中期训练中,MIRA在九个代码基准测试中优于选择基线,并在仅使用一半token的情况下匹配全语料运行。

英文摘要

Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective selection requires both scalability and source-adaptive semantic criteria. Existing model-based methods scale well, but provide only implicit quality signals. Semantic selection methods offer stronger judgments, but usually assume fixed rubrics or standardized data formats. To address this mismatch, we propose MIRA, a source-aware filtering framework based on self-anchored rubric discovery. The key idea is to make rubric construction part of data selection: MIRA first discovers what should be evaluated for each source group, then distills those judgments into scalable student scorers for full-corpus filtering. On code-oriented mid-training with 21 sources and 5 source groups, MIRA outperforms selection baselines across nine code benchmarks and matches the full-corpus run while using only half the tokens.

2605.30215 2026-06-01 cs.CV

Déjà View: Looping Transformers for Multi-View 3D Reconstruction

Déjà View: 用于多视图3D重建的循环Transformer

Alessandro Burzio, Tobias Fischer, Sven Elflein, Qunjie Zhou, Riccardo de Lutio, Jiawei Ren, Jiahui Huang, Shengyu Huang, Marc Pollefeys, Laura Leal-Taixé, Zan Gojcic, Haithem Turki

发表机构 * NVIDIA University of Modena and Reggio Emilia, AImageLab(摩德纳和雷焦艾米利亚大学,AImageLab) University of Toronto, Vector Institute(多伦多大学,向量研究所) ETH Zürich(苏黎世联邦理工学院)

AI总结 提出DéjàView模型,通过循环应用单个Transformer块进行迭代细化,以更少的参数和计算量在多个3D重建基准上达到或超越大规模前馈模型。

Comments Project Page: https://research.nvidia.com/labs/dvl/projects/dvlt

详情
AI中文摘要

近期的前馈式3D重建Transformer已扩展到超过十亿参数,遵循计算机视觉中模型容量增加的趋势。然而,新出现的证据表明,连续的Transformer层通常表现为类似操作的重复应用,而多视图重建Transformer在解码器深度上逐步优化其预测。我们认为模型深度部分地购买了迭代,但以独特的参数低效地支付,因此我们将迭代显式地融入架构中。我们的模型DéjàView对每个视图的特征循环应用单个循环Transformer块,进行K步细化。训练一次后,它将K暴露为推理时的计算旋钮,在涵盖室内、室外、物体中心和驾驶场景的五个重建基准上,匹配或优于显著更大的前馈基线,同时使用其一小部分参数和相当或更低的计算量。重要的是,在匹配的训练数据和计算量下,相同的循环块公式优于具有独立每步参数的相同变体,这表明显式迭代不仅是计算高效的容量替代方案,而且是多视图3D重建更强的归纳偏置。

英文摘要

Recent feed-forward 3D reconstruction transformers have scaled to over a billion parameters, following the broader trend of increasing model capacity in computer vision. Yet emerging evidence suggests that contiguous transformer layers often behave like repeated applications of similar operations, and multi-view reconstruction transformers refine their predictions progressively across decoder depth. We posit that model depth partially buys iteration, paid for inefficiently in unique parameters, and instead make that iteration explicit in architecture. Our model, DéjàView, applies a single looped transformer block recurrently to per-view features for K refinement steps. Trained once, it exposes K as an inference-time compute knob, matching or outperforming substantially larger feed-forward baselines across five reconstruction benchmarks spanning indoor, outdoor, object-centric, and driving scenes, while using a fraction of their parameters and comparable or lower compute. Importantly, the same looped block formulation outperforms an otherwise identical variant with independent per-step parameters under matched training data and compute, suggesting that explicit iteration is not merely a compute-efficient substitute for capacity but a stronger inductive bias for multi-view 3D reconstruction.

2605.30060 2026-06-01 cs.CV

Towards Consistent Video Geometry Estimation

Towards Consistent Video Geometry Estimation

Zhu Yu, Jingnan Gao, Runmin Zhang, Lingteng Qiu, Zhengyi Zhao, Rui Peng, Yichao Yan, Kejie Qiu, Siyu Zhu, Zilong Dong, Si-Yuan Cao, Hui-Liang Shen

发表机构 * Zhejiang University(浙江大学) Tongyi Lab, Alibaba Group(阿里云实验室) Shanghai Jiao Tong University(上海交通大学) Fudan University(复旦大学)

AI总结 提出ViGeo,一种基于纯Transformer架构的前馈基础模型,通过动态分块注意力机制和基于补全的数据精炼框架,实现视频序列中空间密集且时间一致的几何(深度、法线、点图)估计,在在线、离线及长视频任务中达到最先进性能。

Comments Project webpage: https://pkqbajng.github.io/ViGeo/

详情
AI中文摘要

本文提出了ViGeo,一种前馈基础模型,用于从视频序列中恢复空间密集且时间一致的几何信息。ViGeo基于纯Transformer架构,没有针对特定任务的架构修改,支持在统一模型中进行流式、全序列和长视频推理。关键设计是动态分块注意力,该机制在训练期间使模型同时暴露于双向和因果时间上下文,并允许其在测试时无需重新训练即可调整注意力模式。为了提高监督质量,我们进一步引入了一种基于补全的数据精炼框架。该框架训练了一个视频深度补全教师模型,该模型以稀疏且有噪声的标注为条件,利用视频/多视图上下文生成密集、时间一致且几何可靠的训练目标。除了深度和点图,ViGeo还在同一框架内预测表面法线。仅使用公共数据集训练,ViGeo在在线、离线和长视频深度估计、表面法线估计以及视频点图估计中均达到了最先进性能。

英文摘要

This work presents ViGeo, a feed-forward foundation model for recovering spatially dense and temporally consistent geometry from video sequences. Built upon a plain transformer architecture without task-specific architectural modifications, ViGeo supports streaming, full-sequence, and long-video inference within a unified model. The key design is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and allows it to adapt its attention pattern at test time without retraining. To improve supervision quality, we further introduce a completion-based data refinement framework. This framework trains a video depth completion teacher that conditions on sparse and noisy annotations and exploits video/multi-view context to produce dense, temporally coherent, and geometrically reliable training targets. Beyond depth and point maps, ViGeo also predicts surface normals within the same framework. Trained solely on public datasets, ViGeo achieves state-of-the-art performance across online, offline, and long-video depth estimation, surface normal estimation, and video point map estimation.

2605.30039 2026-06-01 cs.AI

Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning

基于最小充分表示学习的大语言模型领域特定数据合成

Tong Ye, Hang Yu, Tengfei Ma, Xuhong Zhang, Jianguo Li, Peng Di, Peiyu Liu, Jianwei Yin, Wenhai Wang

发表机构 * vivo AI Lab(vivo人工智能实验室) Ant Group(蚂蚁集团) Zhejiang University(浙江大学)

AI总结 提出DOMINO框架,通过对比解耦学习最小充分领域表示,指导生成领域对齐的合成数据,在隐式领域定义下提升微调性能。

Comments Accepted by KDD 2026

详情
AI中文摘要

大语言模型在通用能力上取得了显著进展,并可通过在领域特定数据上微调在特定领域实现强性能。然而,获取目标领域的高质量数据仍是一个重大挑战。现有数据合成方法遵循演绎范式,严重依赖自然语言表达的显式领域描述和精心设计的提示工程,限制了其在领域难以描述或正式表述的现实场景中的适用性。在这项工作中,我们通过归纳范式处理未被充分探索的领域特定数据合成问题,其中目标领域仅通过一组参考示例定义,特别是在领域特征难以用自然语言表述时。我们提出了一种新颖框架DOMINO,它从参考样本中学习最小充分的领域表示,并利用它来指导生成领域对齐的合成数据。DOMINO将提示调优与对比解耦目标相结合,以分离领域级模式与样本特定噪声,在保留核心领域特征的同时缓解过拟合。理论上,我们证明DOMINO扩展了合成数据分布的支持集,确保了更大的多样性。在隐式领域定义的具有挑战性的编码基准上,对DOMINO合成的数据进行微调,在强大的指令调优基线上将Pass@1准确率提高了高达4.63%,证明了其有效性和鲁棒性。这项工作为领域特定数据合成建立了一种新范式,无需手动提示设计或自然语言领域规范即可实现实用且可扩展的领域适应。

英文摘要

Large Language Models have demonstrated remarkable progress in general-purpose capabilities and can achieve strong performance in specific domains through fine-tuning on domain-specific data. However, acquiring high-quality data for target domains remains a significant challenge. Existing data synthesis approaches follow a deductive paradigm, heavily relying on explicit domain descriptions expressed in natural language and careful prompt engineering, limiting their applicability in real-world scenarios where domains are difficult to describe or formally articulate. In this work, we tackle the underexplored problem of domain-specific data synthesis through an inductive paradigm, where the target domain is defined only through a set of reference examples, particularly when domain characteristics are difficult to articulate in natural language. We propose a novel framework, DOMINO, that learns a minimal sufficient domain representation from reference samples and leverages it to guide the generation of domain-aligned synthetic data. DOMINO integrates prompt tuning with a contrastive disentanglement objective to separate domain-level patterns from sample-specific noise, mitigating overfitting while preserving core domain characteristics. Theoretically, we prove that DOMINO expands the support of the synthetic data distribution, ensuring greater diversity. Empirically, on challenging coding benchmarks where domain definitions are implicit, fine-tuning on data synthesized by DOMINO improves Pass@1 accuracy by up to 4.63\% over strong, instruction-tuned backbones, demonstrating its effectiveness and robustness. This work establishes a new paradigm for domain-specific data synthesis, enabling practical and scalable domain adaptation without manual prompt design or natural language domain specifications.

2605.30018 2026-06-01 cs.CL cs.LG

Latent Performance Profiling of Large Language Models

大型语言模型的潜在性能剖析

Tanmoy Chakraborty, Ayan Sengupta, Suparna Bhattacharya, Partha Pratim Chakrabarti, Amlan Chakrabarti, Supratik Chakraborty, Partha Pratim Das, Lipika Dey, Richa Singh, Mayank Vatsa

发表机构 * Department of Electrical Engineering, Indian Institute of Technology Delhi(印度理工学院德里分校电子工程系) Yardi School of Artificial Intelligence, Indian Institute of Technology Delhi(印度理工学院德里分校人工智能学院) Hewlett Packard Enterprise, India(印度惠普企业公司) Department of Computer Science & Engineering, Indian Institute of Technology Kharagpur(印度理工学院Khargapur分校计算机科学与工程系) A.K.Choudhury School of Information Technology, University of Calcutta, India(印度加尔各答大学信息科技学院) Department of Computer Science & Engineering, Indian Institute of Technology Bombay(印度理工学院孟买分校计算机科学与工程系) Department of Computer Science, Ashoka University, India(阿什oka大学计算机科学系) Department of Computer Science & Engineering, Indian Institute of Technology Jodhpur(印度理工学院朱罗普分校计算机科学与工程系)

AI总结 提出潜在性能剖析(LPP)框架,通过隐藏激活和输出分布提取任务无关的诊断指标,揭示模型内在特性,补充传统基准评估。

详情
AI中文摘要

大型语言模型(LLMs)在标准化基准测试中经常取得令人印象深刻的分数,但仅凭准确性对能力的了解有限。通过排行榜评估开源LLMs面临持续的问题,如数据污染、任务范围狭窄以及与真实世界可靠性的弱对齐。基于基准的评估(如MMLU PRO、BBH或IFEval)主要捕捉模型在固定测试集上的输出,而非其如何处理信息、校准不确定性或构建内部知识。在本文中,我们主张从以基准为中心的评估转向对LLMs进行互补的、以状态为中心的内在评估。为此,我们引入了潜在性能剖析(LPP)——一个从隐藏激活和输出分布中提取任务无关诊断的框架。LPP在模型的潜在表示和动态上定义了一组标量指标,揭示了与规模无关的特征,从而实现可解释的比较并揭示隐藏的脆弱性。与静态准确性分数不同,LPP在相似规模的模型间提供稳定、对架构敏感的签名。通过对八个LLMs(规模范围0.5B-14B)的广泛实证分析,我们证明了具有相似基准分数的模型可能表现出对比的潜在特征,例如熵或适应性的差异。在这些见解的指导下,我们设计了用于不确定性和符号推理的合成探针,这些探针与内在指标一致,同时与排行榜偏差解耦。我们建议将LPP与基准一起报告,以提供对模型行为更深入、可解释的理解,从而实现更可靠的模型选择、安全评估以及超越表面准确性的评估。

英文摘要

Large language models (LLMs) frequently achieve impressive scores on standardized benchmarks, yet accuracy alone offers a limited view of their capabilities. Evaluating open-source LLMs through leaderboards faces persistent issues like data contamination, narrow task scope, and weak alignment with real-world reliability. Benchmark-based evaluations such as MMLU PRO, BBH, or IFEval primarily capture what a model outputs on fixed test sets, not how it processes information, calibrates uncertainty, or structures internal knowledge. In this article, we advocate for a shift from benchmark-centric evaluation toward a complementary, state-centered intrinsic assessment of LLMs. To this end, we introduce Latent Performance Profiling (LPP) -- a framework that derives task-agnostic diagnostics from hidden activations and output distributions. LPP defines a set of scalar metrics on a model's latent representations and dynamics, revealing scale-independent traits that enable interpretable comparisons and uncover hidden vulnerabilities. Unlike static accuracy scores, LPP provides stable, architecture-sensitive signatures across models of similar size. With extensive empirical analyses across eight LLMs, spanning a size range of 0.5B-14B, we demonstrate that models with similar benchmark scores can exhibit contrasting latent profiles, such as differences in entropy or adaptability. Guided by these insights, we design synthetic probes for uncertainty and symbolic reasoning that align with intrinsic metrics while decoupling from leaderboard bias. We recommend that reporting LPP alongside benchmarks provides a deeper, interpretable understanding of model behavior, enabling more reliable model selection, safety assessment, and evaluation beyond surface-level accuracy.

2605.29879 2026-06-01 cs.CV cs.RO

DGSG-Mind: Dynamic 3D Gaussian Scene Graphs for Long-Term Scene Understanding and Grounding

DGSG-Mind:用于长期场景理解与定位的动态3D高斯场景图

Luzhou Ge, Xiangyu Zhu, Jinyan Liu, Xuesong Li

发表机构 * School of Computer Science, Beijing Institute of Technology, China(北京理工大学计算机科学学院)

AI总结 提出DGSG-Mind,一种混合实例感知的3D高斯动态场景图系统,通过概率体素网格与显式3D高斯结合实现鲁棒的跨模态实例融合和增量语义映射,并构建层次化场景图与3D高斯思维进行多模态推理,在零样本3D视觉定位、开放词汇语义分割和场景重建中取得领先性能。

Comments 9 pages, 6 figures

详情
AI中文摘要

将开放词汇语义信息集成到动态3D场景表示中对于长期具身场景理解至关重要。然而,现有方法常因跨视角线索不完整而导致脆弱的实例关联,同时处理对象级拓扑变化的能力有限,限制了长期机器人任务执行。此外,当前的3D场景理解方法要么依赖简单的特征匹配而缺乏显式空间推理,要么假设离线真实3D几何。为应对这些挑战,我们提出DGSG-Mind,一种混合实例感知的3D高斯动态场景图系统,配备具身推理智能体。我们的系统将概率体素网格与显式3D高斯耦合,实现鲁棒的跨模态实例融合和增量语义映射。它通过基于高斯的视觉重定位和由几何-语义一致性引导的局部掩码细化来处理动态变化。基于实例高斯图,DGSG-Mind进一步构建层次化场景图,并开发3D高斯思维,集成结构关系、空间-语义信息和视觉标注的RoI高斯渲染以进行多模态推理。大量实验表明,DGSG-Mind在基于自重建地图的方法中实现了最佳的零样本3D视觉定位性能,同时在3D开放词汇语义分割和场景重建中也表现出强劲性能。我们进一步将DGSG-Mind部署到真实世界机器人上,展示其目标导向推理和动态更新能力。DGSG-Mind的项目页面位于https://icr-lab.github.io/DGSG-Mind。

英文摘要

Integrating open-vocabulary semantic information into dynamic 3D scene representations is essential for long-term embodied scene understanding. However, existing methods often suffer from fragile instance association due to incomplete cross-view cues, while their limited ability to handle object-level topological changes restricts long-term robotic task execution. Moreover, current 3D scene understanding methods either rely on simple feature matching without explicit spatial reasoning or assume offline ground-truth 3D geometry. To address these challenges, we present DGSG-Mind, a hybrid instance-aware 3D Gaussian dynamic scene graph system with an embodied reasoning agent. Our system couples a probabilistic voxel grid with explicit 3D Gaussians to enable robust cross-modal instance fusion and incremental semantic mapping. It handles dynamic changes through Gaussian-based visual relocalization and localized masked refinement guided by geometric-semantic consistency. Built on the instance Gaussian map, DGSG-Mind further constructs a hierarchical scene graph and develops the 3D Gaussian Mind, which integrates structural relations, spatial-semantic information, and visually annotated RoI Gaussian renderings for multimodal reasoning. Extensive experiments show that DGSG-Mind achieves the best zero-shot 3DVG performance among methods operating on self-reconstructed maps, while also delivering strong performance in 3D open-vocabulary semantic segmentation and scene reconstruction. We further deploy DGSG-Mind on real-world robots to demonstrate its target-oriented reasoning and dynamic update capabilities. The project page of DGSG-Mind is available at https://icr-lab.github.io/DGSG-Mind

2605.29852 2026-06-01 cs.CV cs.LG cs.MM

Parameter-Efficient Subspace Decoupling ViT for Mitigating Multi-Task Negative Transfer in Histological Scoring

参数高效子空间解耦ViT用于缓解组织学评分中的多任务负迁移

Youhan Huang, Jiajun Li, Yilin Fang, Shuai Wang, Chuheng Li

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Beijing University of Chemical Technology(北京化工大学) Capital Medical University(首都医科大学)

AI总结 提出子空间解耦多任务Vision Transformer,通过轻量级任务特定适配器和正交性约束构建独立特征子空间,减少任务干扰并保留共享表示,有效缓解多任务负迁移。

Comments 6 pages, 5 figures, 2 tables. IEEE ICME 2026 (Oral). Camera-ready version

详情
AI中文摘要

组织学评分对于诊断非酒精性脂肪性肝病(NAFLD)至关重要,但由于高标注成本以及多任务学习中强相关的NAFLD活动评分(NAS)指标之间的负迁移,其自动化仍然具有挑战性。为了解决这个问题,我们提出了一种子空间解耦的多任务Vision Transformer(ViT),它集成了轻量级的任务特定适配器与基于正交性的约束。该设计为脂肪变性、气球样变和炎症构建了独立的特征子空间,有效减少了任务干扰,同时保留了共享表示。我们进一步构建了一个精心策划的多任务小鼠NAFLD组织学数据集,其中包含所有NAS组件的专家标注。实验结果表明,与训练单独的单个任务模型相比,所提出的方法以显著降低的计算成本提高了多任务稳定性和泛化能力。代码和策划的数据集已准备就绪,将在接收后公开以支持可重复性。

英文摘要

Histological scoring is essential for diagnosing Non-Alcoholic Fatty Liver Disease (NAFLD), yet its automation remains challenging due to the high annotation cost and negative transfer among the strongly correlated NAFLD Activity Score (NAS) indicators in multi-task learning. To address this issue, we propose a subspace-decoupled multi-task Vision Transformer (ViT) that integrates lightweight task-specific Adapters with orthogonality-based constraints. This design constructs independent feature subspaces for steatosis, ballooning, and inflammation, effectively reducing task interference while retaining shared representations. We further construct a curated multi-task mouse NAFLD histology dataset with expert annotations for all NAS components. Experimental results demonstrate that the proposed method improves multi-task stability and generalization with substantially reduced computational cost compared to training separate single-task models. The code and the curated dataset have been prepared and will be made publicly available upon acceptance to support reproducibility.

2605.29833 2026-06-01 cs.AI

OmniMatBench: A Human-Calibrated Multimodal Reasoning Benchmark Across 19 Materials Science Subfields

OmniMatBench:跨19个材料科学子领域的人类校准多模态推理基准

Wanhao Liu, Jiaqing Xie, Qian Tan, Weida Wang, Jue Wang, Ran Sun, Zhuo Yang, Wanli Ouyang, Lei Bai, Tianfan Fu, Lu Chen, Xin Chen, Yuqiang Li

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Fudan University(复旦大学) Southeast University(东南大学) Nanjing University(南京大学) Suzhou Laboratory(苏州实验室) Shanghai Jiao Tong University(上海交通大学)

AI总结 针对现有基准忽视从材料知识到应用的推理过程,提出OmniMatBench,包含3171个专家策划的问答与计算问题,覆盖19个子领域,评估13个多模态大模型,最佳模型仅得0.372分,揭示当前模型在材料科学推理中的显著差距。

Comments 22 Pages

详情
AI中文摘要

随着多模态语言模型在科学研究中扮演越来越重要的角色,材料科学因其跨学科、多模态和应用驱动的特性而成为一个关键的测试平台。然而,现有的材料基准主要关注属性预测、知识问答或表征理解,而忽略了从材料知识到应用的更广泛推理过程。为填补这一空白,我们提出了OmniMatBench,一个针对材料科学的人类校准多模态推理基准。OmniMatBench包含3171个专家策划的问答和计算问题,涵盖19个材料科学子领域,包括基础材料知识、结构材料与工程材料、材料加工与制造以及功能材料与应用材料。我们评估了13个开源和闭源的多模态大语言模型,发现最佳模型仅获得0.372的总体得分,揭示了当前材料科学推理中的显著差距。进一步分析显示,不同子领域之间存在强烈差异、固定的推理启发式、不均匀的材料知识,以及在公式辅助、检索辅助和代码辅助设置下有限的高级知识应用。OmniMatBench为当前多模态大语言模型的能力和局限性提供了关键见解,并为材料科学研究中可靠的AI助手奠定了基础。

英文摘要

As multimodal language models play an increasingly important role in scientific research, materials science offers a critical testbed due to its interdisciplinary, multimodal, and application-driven nature. However, existing materials benchmarks mainly focus on property prediction, knowledge QA, or characterization understanding, leaving the broader reasoning process from materials knowledge to application underexplored. To fill this gap, we present OmniMatBench, a human-calibrated multimodal reasoning benchmark for materials science. OmniMatBench contains 3,171 expert-curated QA and calculation problems across 19 materials-science subfields, spanning fundamental materials knowledge, structural and engineering materials, materials processing and manufacturing, and functional and applied materials. We evaluate 13 open-source and closed-source MLLMs and find that the best model achieves only a 0.372 overall score, revealing a substantial gap in current materials-science reasoning. Further analysis shows strong variation across subfields, fixed reasoning heuristics, uneven materials knowledge, and limited high-level knowledge application under formula-, retrieval-, and code-assisted settings. OmniMatBench provides crucial insights into the capabilities and limitations of current MLLMs and establishes a foundation for reliable AI assistants in materials-science research.

2605.29751 2026-06-01 cs.CL

DySem: Uncovering Dynamic Semantic Components of Large Language Models for Calculating Semantic Textual Similarity

DySem: 揭示大语言模型的动态语义组件以计算语义文本相似度

Kaijie Zheng, Weiqin Wang, Yile Wang, Hui Huang

发表机构 * College of Computer Science and Software Engineering(计算机科学与软件工程学院)

AI总结 提出DySem框架,通过多语言共识提取大语言模型中与语义更相关的内部组件,并构建文本相关的联合语义集实现动态维度相似度计算,无需训练且性能优于基线。

Comments 18 pages, 23 figures, 5 tables

详情
AI中文摘要

计算语义文本相似度是自然语言处理中的基础任务。当前基于大语言模型(LLM)的方法通常依赖提取固定维度的最后一层隐藏状态来计算每对文本的相似度。我们认为这种范式存在两个局限:(i)最后一层隐藏层编码的是更通用的知识而非仅语义知识,因此对于语义相似度计算并非最优;(ii)LLM的隐藏层维度通常非常大,这引入了表示语义时的冗余和噪声。在这项工作中,我们提出DySem,一种新颖的无需训练框架,通过多语言共识探究LLM中更多与语义相关的内部组件,并摆脱静态表示空间,转而通过构建文本相关的联合语义集实现动态的、样本特定的语义维度,并在该共享维度子集上计算相似度。在各种LLM上的大量实验表明,我们的方法在保持较低相似度计算维度的同时,持续优于最近的基线。代码已发布在https://github.com/szu-tera/DySem。

英文摘要

Calculating semantic textual similarity is a foundational task in natural language processing. Current large language models (LLMs) based methods typically rely on extracting last-layer hidden states with fixed dimensions to compute similarity for every text pairs. We argue that this paradigm is suffer from two limitations: (i) The last hidden layer encodes more general knowledge rather than just semantic knowledge, making it suboptimal for semantic similarity computation; (ii) The hidden layer dimensions of LLMs are generally very large, which introduces some redundancy and noise for representing semantics. In this work, we propose DySem, a novel training-free framework that investigates more semantic-related internal components of LLMs via multilingual consensus, and shifts away from static representation spaces in favor of dynamic, sample-specific semantic dimensions by constructing text-dependent joint semantic set and computes similarity over this shared dimensional subset. Extensive experiments across various LLMs show that our method consistently outperforms recent baselines while maintaining lower dimensions for similarity calculation. The code is released at https://github.com/szu-tera/DySem.

2605.29655 2026-06-01 cs.CV cs.GR

SuperVoxelGPT: Adaptive and Ordered 3D Tokenization for Autoregressive Shape Generation

SuperVoxelGPT: 自适应有序3D令牌化用于自回归形状生成

Yuan Li, Congyi Zhang, Xifeng Gao, Xiaohu Guo

发表机构 * University of Texas at Dallas(德克萨斯大学达拉斯分校) Tencent America(腾讯美国)

AI总结 提出SuperVoxelGPT框架,通过自适应且有序的超体素令牌化解决自回归3D生成中序列长度与空间顺序的矛盾,实现高质量、高效率的形状生成。

详情
AI中文摘要

自回归多模态大语言模型(MLLMs)能够进行3D生成,但由于3D令牌化不足,难以扩展到高分辨率形状。基于集合的紧凑表示丢弃了确定性的空间排序,导致序列预测模糊,而均匀或基于八叉树的体素网格保留了排序,但代价是严重的冗余和过长的序列。这种结构上的权衡限制了稳定高效的自回归3D生成。我们提出了SuperVoxelGPT,一个以表示优先的框架,通过自适应且确定性的超体素令牌化解决了这一矛盾。给定提示,我们首先预测粗略的几何显著性分布,并使用显著性引导的质心Voronoi细分构建形状自适应的超体素划分,将细粒度单元分配给复杂区域,将较大单元分配给平滑区域。基于文本和有序的超体素布局,我们引入了SuperVoxelVAE,并微调预训练的MLLM以自回归生成超体素令牌。在Trellis-500K上的实验表明,SuperVoxelGPT将令牌序列长度减少到均匀体素令牌化的12.8%,同时实现了最先进的生成质量,并且相比先前方法平均加速10倍。

英文摘要

Autoregressive multimodal large language models (MLLMs) enable 3D generation but struggle to scale to high-resolution shapes due to inadequate 3D tokenizations. Compact set-based representations discard deterministic spatial ordering, leading to ambiguous sequence prediction, while uniform or octree-based voxel grids preserve ordering at the cost of severe redundancy and excessively long sequences. This structural trade-off limits stable and efficient autoregressive 3D generation. We present SuperVoxelGPT, a representation-first framework that resolves this tension through adaptive and deterministically ordered supervoxel tokenization. Given a prompt, we first predict a coarse geometric saliency distribution and construct a shape-adaptive supervoxel partition using saliency-guided centroidal Voronoi tessellation, allocating fine-grained cells to complex regions and larger cells to smooth regions. Conditioned on the text and ordered supervoxel layout, we introduce a SuperVoxelVAE and fine-tune a pretrained MLLM to autoregressively generate supervoxel tokens. Experiments on Trellis-500K show that SuperVoxelGPT reduces token sequence length to 12.8% of uniform voxel tokenization while achieving state-of-the-art generation quality and an average 10$\times$ speedup over prior methods.

2605.24700 2026-06-01 cs.CV cs.GR

SRUG: Shadow-Guided Relightable Urban Scene with Generation Model

SRUG: 基于阴影引导的可重光照城市场景生成模型

Yonghao Zhao, Zexin Yin, Jian Yang, Beibei Wang, Jin Xie

发表机构 * College of Computer Science, Nankai University(南开大学计算机科学学院) Nankai University(南开大学) Nanjing University(南京大学) School of Intelligence Science and Technology, Nanjing University(南京大学智能科学与技术学院)

AI总结 提出SRUG框架,利用阴影引导3D补全模型恢复不可见区域几何,结合迭代材质分解和物理光照模型,实现从稀疏输入视图生成可重光照城市场景。

详情
AI中文摘要

从图像或视频创建可重光照的城市场景具有广泛用途,但高度不适定。城市环境通常是无界的,且延伸到可见区域之外。因此,场景的许多部分未被观察到,但这些不可见区域会向可见区域投射阴影。合理建模这些不可见区域投射的阴影具有挑战性,并成为创建可重光照城市场景的主要障碍。同时,稀疏的输入视图和复杂的照明条件进一步使重光照复杂化,因为它们引入了材质分解中的严重歧义。在本文中,我们提出了SRUG(Shadow-guided Relightable Urban Scene with Generation model),一种新颖的框架,旨在解决城市场景中的重光照挑战。SRUG利用阴影引导3D补全模型恢复不可见区域的几何,促进物理合理阴影的合成。此外,SRUG采用迭代材质分解方案,应用大材质模型(LMM)提供材质监督,并迭代分解场景的材质属性,实现鲁棒的材质分解。基于这些组件,我们引入了一个基于物理的光照模型,该模型捕捉城市场景的复杂照明并支持可靠的重光照。大量的定量评估和视觉比较表明,我们的方法在新视图合成和重光照任务中均优于现有方法。

英文摘要

Creating relightable urban scenes from images or videos is widely useful but highly ill-posed. Urban environments are typically unbounded and extend beyond the visible regions. As a result, many portions of the scene remain unobserved, yet these invisible regions can cast shadows onto visible areas. Reasonably modeling shadows cast by such invisible regions is challenging and poses a significant obstacle to creating relightable urban scenes. At the same time, sparse input views and complex illumination conditions further complicate relighting, as they introduce severe ambiguities in material decomposition. In this paper, we propose Shadow-guided Relightable Urban Scene with Generation model (SRUG), a novel framework designed to address relighting challenges in urban scenes. SRUG leverages shadows to guide a 3D completion model for recovering the geometry of invisible regions, promoting the synthesis of physically reasonable shadows. In addition, SRUG employs an iterative material decomposition scheme that applies the large material model (LMM) to provide material supervision and iteratively decompose the scene's material properties, enabling robust material decomposition. Building upon these components, we introduce a physically-based lighting model that captures the complex illumination of urban scenes and supports reliable relighting. Extensive quantitative evaluations and visual comparisons demonstrate that our method outperforms existing approaches in both novel view synthesis and relighting tasks.

2605.22737 2026-06-01 cs.LG cs.AI

The Distillation Game: Adaptive Attacks & Efficient Defenses

蒸馏博弈:自适应攻击与高效防御

Youssef Allouah, Mahdi Haghifam, Sanmi Koyejo, Reza Shokri

发表机构 * Stanford University(斯坦福大学) Toyota Technological Institute at Chicago(芝加哥丰田技术研究所) National University of Singapore(新加坡国立大学)

AI总结 通过最小化博弈框架研究蒸馏攻击中模型提供者的部署权衡,提出自适应评估规则和产品专家(PoE)防御方法,实验表明自适应学生能恢复更多能力,且PoE在成本和质量上具有优势。

详情
AI中文摘要

蒸馏攻击为模型提供者带来了部署权衡:使模型更有用的相同输出也可能使其更容易被模仿。我们通过一个效用受限的教师和自适应学生之间的最小化博弈来研究这种权衡。我们的框架产生了可处理的一侧响应规则:一个自适应评估规则,其中学生重新加权高价值示例,以及一个教师侧防御模板,抑制对蒸馏最有用的输出。从示例价值的廉价代理中,我们推导出产品专家(PoE),一种简单的前向传递防御,在生成过程中将教师与代理学生结合。实验上,自适应评估揭示了一个大的被动-自适应差距:在最先进的防御上,自适应学生在GSM8K和MATH上恢复了比被动评估所建议的更多的能力。在这种更强的评估下,昂贵防御和PoE之间的明显鲁棒性差距显著缩小,而PoE仍然便宜得多,并保留了更高质量的推理轨迹。总体而言,我们的结果表明,强大的蒸馏仍然难以阻止,并且反蒸馏的进展应该根据自适应学生而非被动学生来判断。我们的代码可在:https://github.com/ysfalh/distillation-game 获取。

英文摘要

Distillation attacks create a deployment trade-off for model providers: the same outputs that make a model more useful can also make it easier to imitate. We study this trade-off through a minimax game between a utility-constrained teacher and an adaptive student. Our framework yields tractable one-sided response rules: an adaptive evaluation rule in which the student reweights high-value examples, and a teacher-side defense template that suppresses outputs most useful for distillation. From a cheap proxy for example value, we derive Product-of-Experts (PoE), a simple forward-pass-only defense that combines the teacher with a proxy student during generation. Empirically, adaptive evaluation reveals a large passive--adaptive gap: on state-of-the-art defenses, adaptive students recover substantially more capability than passive evaluation suggests on GSM8K and MATH. Under this stronger evaluation, the apparent robustness gap between expensive defenses and PoE narrows considerably, while PoE remains substantially cheaper and preserves higher-quality reasoning traces. Overall, our results suggest that strong distillation remains difficult to stop, and that progress on antidistillation should be judged against adaptive students rather than passive ones. Our code is available at: https://github.com/ysfalh/distillation-game.

2605.29417 2026-06-01 cs.CV

ParCo-SDF: Learning Prior-Free Partial-to-Complete Signed Distance Fields of Deformable Objects

ParCo-SDF: 学习可变形物体的无先验部分到完整有符号距离场

Deokmin Hwang, Minseok Song, Daehyung Park

发表机构 * School of Computing, Korea Advanced Institute of Science and Technology, Korea(韩国科学技术院计算机学院)

AI总结 提出 ParCo-SDF 两阶段框架,通过时序几何编码和 FiLM 条件 SDF 预测,实现无需物体特定先验的可变形物体部分到完整几何重建。

Comments Accepted at the 23rd International Conference on Ubiquitous Robots (UR 2026), 6 pages

详情
AI中文摘要

本研究针对从点云观测到可变形物体(DOs)的部分到完整几何重建,以实现精确的 DO 操作。最近的 DO 重建方法通常采用隐式神经表示(INRs)来建模连续表面并捕捉结构变异性。然而,这些方法通常依赖于物体特定的形状先验,这虽然提高了训练稳定性,但限制了泛化能力。为了解决这个问题,我们引入了 ParCo-SDF,一个两阶段的部分到完整有符号距离场(SDF)重建框架,包括时序几何编码和随后的 FiLM 条件 SDF 预测。时序编码器捕捉 DO 序列中的结构相似性,实现无先验的稳定训练。基于 FiLM 的条件化在降低网络复杂度的同时保持了重建的表达能力。我们在橡皮筋操作数据集上评估了所提方法与最先进的 DO 表面重建基线,证明了在严重遮挡下的鲁棒和高保真重建。

英文摘要

This study addresses the partial-to-complete geometry reconstruction of deformable objects (DOs) from point-cloud observations toward precise DO manipulation. Recent DO reconstruction approaches often adopt implicit neural representations (INRs) to model continuous surfaces as well as capture structural variability. However, these methods typically rely on object-specific shape priors that improve training stability and limit generalization. To figure it out, we introduce ParCo-SDF, a two-stage partial-to-complete signed distance field (SDF) reconstruction framework consisting of temporal geometry encoding followed by FiLM-conditioned SDF prediction. The temporal encoder captures structural similarity across DO sequence, enabling prior-free stable training. FiLM-based conditioning preserves reconstruction expressivity while reducing network complexity. We evaluate the proposed method against a state-of-the-art DO surface reconstruction baseline on a rubber band manipulation dataset, demonstrating robust and high-fidelity reconstruction under severe occlusions.

2605.29373 2026-06-01 cs.LG cs.NA math.NA

Deep Adaptive Dimension Reduction for Bayesian Inference in Inverse Problems

逆问题中贝叶斯推理的深度自适应降维

Yueyang Wang, Xili Wang, Kejun Tang, Xiaoliang Wan, Tao Zhou, Chao Yang

发表机构 * School of Mathematical Sciences, Peking University(北京大学数学科学学院) School of Sciences, Great Bay University(大湾大学理学院) Department of Mathematics, Louisiana State University(路易斯安那州立大学数学系) SKLMS & Institute of Computational Mathematics and Scientific/Engineering Computing, Academy of Mathematics and Systems Science, Chinese Academy of Sciences(中国科学院数学与系统科学研究院SKLMS及计算数学与科学/工程计算研究所)

AI总结 提出基于变分流模型的深度自适应降维贝叶斯推理框架,结合VAE非线性降维、双归一化流和迭代先验更新策略,并自适应微调傅里叶神经算子代理,以高效求解高维PDE控制逆问题中的复杂非高斯后验分布。

Comments 25 pages, 5 figures

详情
AI中文摘要

求解高维PDE控制的逆问题通常具有挑战性,原因在于复杂的非高斯后验分布、昂贵的正演模型评估以及错误的先验信息。为了解决这些问题,我们提出了一种基于变分流(VF)模型的深度自适应降维贝叶斯推理框架。由于标准归一化流受双射映射限制且无法直接降维,VF通过将基于VAE的非线性降维与潜在先验和编码器的双归一化流相结合,克服了这一限制。该设计提供了严格高于VAE的证据下界,并允许更灵活地逼近复杂后验分布。我们进一步引入了一种迭代先验更新策略,该策略逐渐将先验均值移向高概率后验区域,避免了手动先验调整。这些组件与自适应微调的傅里叶神经算子(FNO)代理一起形成了一个闭环自适应循环:VF生成后验集中样本以改进代理,而更新的代理进一步改进后验推理。在100维Rosenbrock问题和三个标准PDE控制逆问题上的数值实验表明,与MCMC、UKI和SVGD基线相比,我们的方法在所有测试配置中均具有竞争性或更优的精度,在高噪声观测和高维参数空间等挑战性场景中优势最为明显。

英文摘要

Solving high-dimensional PDE-governed inverse problems is often challenging due to complex non-Gaussian posterior distributions, expensive forward model evaluations, and misspecified prior information. To address these issues, we propose a deep adaptive dimension-reduction Bayesian inference framework based on the Variational Flow (VF) model. Since standard normalizing flows are restricted by bijective mappings and cannot directly reduce dimensions, VF overcomes this limitation by integrating VAE-based nonlinear dimension reduction with dual normalizing flows for the latent prior and encoder. This design provides a strictly higher evidence lower bound than VAE and allows more flexible approximation of complex posterior distributions. We further introduce an iterative prior updating strategy that gradually moves the prior mean toward high-probability posterior regions, avoiding manual prior tuning. These components form a closed adaptive loop together with an adaptively fine-tuned Fourier Neural Operator (FNO) surrogate: VF generates posterior-concentrated samples to refine the surrogate, while the updated surrogate further improves posterior inference. Numerical experiments on a 100-dimensional Rosenbrock problem and three standard PDE-governed inverse problems show that our method delivers competitive or superior accuracy compared with MCMC, UKI, and SVGD baselines across all tested configurations, with the most pronounced advantages emerging in challenging scenarios such as high-noise observations and high-dimensional parameter spaces.

2605.29343 2026-06-01 cs.CL

Draft-OPD: On-Policy Distillation for Speculative Draft Models

Draft-OPD:用于推测草稿模型的在线策略蒸馏

Haodi Lei, Yafu Li, Haoran Zhang, Shunkai Zhang, Qianjia Cheng, Xiaoye Qu, Ganqu Cui, Bowen Zhou, Ning Ding, Yun Luo, Yu Cheng

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室) Tsinghua University(清华大学) The Chinese University of Hong Kong(香港中文大学) Peking University(北京大学) Zhejiang University(浙江大学)

AI总结 针对推测解码中草稿模型因离线训练与在线推理不匹配导致性能瓶颈的问题,提出Draft-OPD方法,通过目标辅助展开和重放验证暴露的错误位置实现在线策略蒸馏,在多种任务上实现超过5倍的无损加速。

详情
AI中文摘要

推测解码通过将目标模型与轻量级草稿模型配对,并行验证其提出的令牌,从而加速大型语言模型推理。构建草稿模型的常见方法(如EAGLE3或DFlash)是在目标生成轨迹上进行监督微调(SFT)。然而,我们观察到SFT很快达到平台期:草稿模型在测试数据上的接受长度停止提升。原因是离线到推理的不匹配:在SFT中,草稿模型从固定的目标生成轨迹学习,而在推测解码期间,它在其自身策略提出的块上进行评估。这激发了在线策略蒸馏(OPD),其中目标模型在草稿诱导的状态上监督草稿模型。然而,OPD对于草稿模型仍然困难,因为它们无法可靠地独立展开完整序列,而目标辅助生成使收集的序列遵循目标分布,从而消除了在线策略信号。因此,我们提出Draft-OPD,它使用目标辅助展开进行稳定延续,并从验证暴露的错误位置重放草稿。这使得草稿模型能够从接受和拒绝的提议中学习目标反馈,将训练集中在限制推测接受的草稿诱导错误上。实验表明,Draft-OPD在多种任务上对思考模型实现了超过5倍的无损加速,比EAGLE-3和DFlash分别提高了23%和13%。

英文摘要

Speculative decoding accelerates large language model inference by pairing a target model with a lightweight draft model whose proposed tokens are verified in parallel. A common way to build draft models, like EAGLE3 or DFlash is supervised fine-tuning (SFT) on target-generated trajectories. However, we observe that SFT quickly plateaus: the draft model's acceptance length on test data stops improving. The reason is an offline-to-inference mismatch: In SFT, the drafter learns from fixed target-generated trajectories, whereas during speculative decoding it is evaluated on blocks proposed under its own policy. This motivates on-policy distillation (OPD), where the target model supervises the drafter on draft-induced states. Yet OPD remains difficult for draft models, as they cannot reliably roll out complete sequences independently, whereas target-assisted generation makes the collected sequences follow the target distribution and thus eliminates the on-policy signal. We therefore propose Draft-OPD, which uses target-assisted rollout for stable continuations and replays drafting from the verification-exposed error positions. This allows the drafter to learn from target feedback on both accepted and rejected proposals, focusing training on the draft-induced errors that limit speculative acceptance. Experiments show that Draft-OPD achieves over $5\times$ lossless acceleration for thinking models across diverse tasks, improving over EAGLE-3 and DFlash by 23\% and 13\%.

2605.29317 2026-06-01 cs.CL

FoRA: Fisher-orthogonal Rank Adaptation for Parameter-Efficient Fine-Tuning

FoRA: 基于Fisher正交秩适配的参数高效微调

Juneyoung Park, Seongbae Lee, Han-Sang Lee, Kyuho Lee, Minjae Kim, Seungheon Hyeon, Kiduk Kwon, Seongwan Kim, Jaeho Lee

发表机构 * OptAI Inc(OptAI公司) LG Uplus

AI总结 提出FoRA方法,通过Fisher信息选择信息层并在Stiefel流形上训练LoRA下投影,在减少参数预算的同时保持性能,优于LoRA和DoRA。

Comments EMNLP 2026

详情
AI中文摘要

参数高效微调(PEFT)主要关注LoRA及其面向精度的变体,而减少可训练参数的原始目标相对较少受到关注。我们引入了FoRA,通过减少适配层数而非适配器秩来重新审视这一目标。FoRA通过单次对角Fisher评分(训练成本低于1%)选择任务信息层,并在Stiefel流形上训练所选层的LoRA下投影,保持列正交性和有效秩。在五个LLaMA系列骨干网络上,FoRA在参数预算减半的情况下始终优于LoRA和DoRA,在参数数量为AdaLoRA四分之一时,精度差距在0.7-0.8个点以内。在来自LLaMA、Qwen3和Gemma系列的十二个骨干网络上的跨架构实验证实了从270M到32B参数的一致增益。两个组件超加性地结合:Fisher选择本身在相同预算下匹配秩缩减,而Stiefel约束提供了决定性的额外增益。

英文摘要

Parameter-efficient fine-tuning(PEFT) has largely focused on LoRA and its accuracy-oriented variants, leaving the original goal of reducing trainable parameters has receivedcomparatively little attention. We introduce FoRA, which revisits this goal by reducing the number of adapted layers rather than adapter rank. FoRA selects task-informative layers via a single-pass diagonal Fisher score (under 1% of training cost) and trains the LoRA down-projection at selected layers on the Stiefel manifold, preserving column orthonormality and effective rank. FoRA consistently outperforms LoRA and DoRA at half their parameter budget, and falls within 0.7-0.8 accuracy points of AdaLoRA at one-quarter its parameter count, across five LLaMA-family backbones. Cross-architecture experiments on twelve backbones from the LLaMA, Qwen3, and Gemma families confirm consistent gains from 270M to 32B parameters. The two components combine super-additively: Fisher selection alone matches rank reduction at the same budget, while the Stiefel constraint provides the decisive additional gain.

2605.29299 2026-06-01 cs.CV cs.AI

Pocket-Dentist: On-Device Dental Image Understanding via Efficient Multimodal Large Language Models

口袋牙医:通过高效多模态大语言模型实现设备端牙科图像理解

Kai Bian, Xucheng Guo, Bin Chen, Lingyan Ruan, Yiran Shen, Ting Dang, Hong Jia

发表机构 * The University of Auckland, New Zealand(奥克兰大学) Shandong University, China(山东大学) The University of Melbourne, Australia(墨尔本大学)

AI总结 提出Pocket-Dentist基准,通过评估14种视觉语言模型发现紧凑模型(2B参数)在牙科图像理解中精度更高且计算成本更低,并在iPhone 17 Pro上实现低延迟部署。

详情
AI中文摘要

牙科视觉语言模型的评估在数据集、任务定义和指标上仍然分散,并且常常忽略其计算成本。这限制了它们在专科中心之外的广泛部署用于牙科筛查,而及时推理、有限的硬件以及对患者图像的本地处理对于实用、保护隐私的临床预筛查至关重要。本文提出了Pocket-Dentist,一个面向牙科多模态问答的效率感知基准,它汇集了三个数据集,涵盖约1159名患者、五种任务类型和七种指标。在典型的14种VLM上,我们的结果揭示了一个有趣的观察:紧凑型VLM(例如2B参数模型)在牙科图像理解中精度更高,同时所需计算成本大幅降低。在iPhone 17 Pro上本地部署时,我们微调的紧凑型VLM Pocket-Dentist-2B处理每个样本耗时4.31秒,与7B基线相比延迟降低4.9倍,内存使用减少2.3倍。

英文摘要

Evaluations of dental vision-language models remain fragmented across datasets, task definitions and metrics, and often ignore their computational cost. This limits their widespread deployment for dental screening outside specialist centres, where timely inference, limited hardware, and local handling of patient images are vital for practical, privacy-preserving clinical prescreening. Here we present Pocket-Dentist, an efficiency-aware benchmark for dental multimodal question answering that brings together three datasets spanning approximately 1,159 patients, five task types and seven metrics. Across typical 14 VLMs, our results reveals an interesting observation: compact VLMs (e.g., 2B-parameter models) outperform larger VLMs in accuracy while requiring substantially lower computational costs in dental image understanding. Deployed locally on an iPhone 17 Pro, our finetuned compact VLM Pocket-Dentist-2B processed each sample in 4.31 s, reducing latency by 4.9-fold and memory use by 2.3-fold compared with a 7B baseline.

2605.29268 2026-06-01 cs.CL cs.AI cs.LG cs.NE

Compute Allocation in Evolutionary Search: From Depth-Breadth to Multi-Armed Bandits

进化搜索中的计算分配:从深度-广度到多臂老虎机

Sixue Xing, Haoyu He, Kerui Wu, Zhuo Yang, Haozheng Luo, Tianfan Fu, Aarthy Nagarajan

发表机构 * University of Notre Dame(诺丁汉大学) Northeastern University(东北大学) University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) Southeast University(东南大学) Northwestern University(西北大学) Nanjing University(南京大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 针对LLM引导的进化搜索中固定预算的LLM调用分配问题,提出基于多臂老虎机的BaSE方法,通过跨并行轨迹分配调用,平均适应度提升12.3%。

详情
AI中文摘要

LLM引导的进化搜索(Evolve系统)在数学和组合任务上达到了最先进的结果,但现有系统通常只报告多次运行中的最佳结果,而未记录运行间的分布。我们询问如何分配固定的LLM调用预算,以及单次运行达到报告数字的可靠性如何。通过扫描五个模型和三个任务的深度-广度网格,我们识别出两个经验规律:一个适应度-计算包络线,其中能力排序主要取决于有效FLOPs;以及一个双线性深度-广度拟合,具有任务特定的交互;两者都受模型-任务能力门控。受这些规律启发,我们提出BaSE(基于老虎机的自进化),一种多臂老虎机,它在并行轨迹间分配LLM调用。在不改变模型、提示或评估器的情况下,BaSE在8个(模型,任务)单元上比最强的岛屿协议基线平均适应度提高12.3%,在方差高的设置上增益最大:仅通过分配实现可靠性提升。

英文摘要

LLM-guided evolutionary search (Evolve systems) has reached state-of-the-art results on mathematical and combinatorial tasks, yet most existing systems report only the best of many runs and leave the run-to-run distribution undocumented. We ask how a fixed budget of LLM calls should be allocated, and how reliably a single run reaches the reported numbers. Sweeping the depth-breadth grid over five models and three tasks, we identify two empirical regularities: a fitness-compute envelope along which capability ordering largely collapses on effective FLOPs, and a bilinear depth-breadth fit with task-specific interaction; both are gated by model-task capability. Motivated by these regularities, we propose BaSE (Bandit-based Self-Evolving), a multi-armed bandit that allocates LLM calls across parallel trajectories. Without changing the model, prompt, or evaluator, BaSE improves mean fitness by 12.3% over the strongest island-protocol baseline across 8 (model, task) cells, with the largest gains on high-variance settings: a reliability gain from allocation alone.

2605.29198 2026-06-01 cs.CV

Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization

引导对比令牌信用分配用于离散策略优化

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yuta Kyuragi, Aditya Grover

发表机构 * UCLA Panasonic AI Research(松下人工智能研究) NVIDIA(英伟达)

AI总结 针对组优势强化学习方法中令牌级信用分配缺失的问题,提出引导对比策略优化(GCPO),通过正负提示下的对比预测分配令牌级优势,在文本到图像生成和思维链推理任务上优于GRPO和DAPO。

Comments 21 pages, 11 figures

详情
AI中文摘要

基于组优势的强化学习方法,如GRPO和DAPO,在包括数学推理和文本到图像生成在内的多个领域展示了强大的性能。然而,它们对样本级奖励的依赖引入了一个关键限制,即所有令牌的均匀信用分配无法捕捉细粒度的令牌级贡献。为了解决这个问题,我们提出了引导对比策略优化(GCPO),一种新颖的算法,通过对比正负提示下的模型预测来实现每个令牌的信用分配。GCPO不是均匀地广播样本级优势,而是分配与这些对比预测差异成比例的令牌级优势,从而提供更精确和信息丰富的学习信号。实验上,我们发现GCPO强调语义相关区域,例如文本到图像生成中与文本提示对齐的视觉区域,以及思维链任务中推理轨迹内的关键关键词。通过大量实验,GCPO在文本到图像生成和思维链推理基准测试上 consistently 优于GRPO和DAPO基线,证明了其作为离散策略学习的通用且可扩展优化策略的有效性。

英文摘要

Group-advantage-based reinforcement learning methods, such as GRPO and DAPO, have demonstrated strong performance across diverse domains, including mathematical reasoning and text-to-image generation. However, their reliance on sample-level rewards introduces a key limitation as uniform credit assignment across all tokens fails to capture fine-grained, token-level contributions. To address this issue, we propose Guidance Contrastive Policy Optimization (GCPO), a novel algorithm that enables per-token credit assignment by contrasting model predictions under positive and negative prompts. Rather than uniformly broadcasting sample-level advantages, GCPO assigns token-level advantages proportional to the difference between these contrastive predictions, allowing more precise and informative learning signals. Empirically, we find that GCPO emphasizes semantically relevant regions such as visual areas aligned with textual prompts in text-to-image generation, and critical keywords within reasoning traces for chain-of-thought tasks. Through extensive experiments, GCPO consistently outperforms GRPO and DAPO baselines on both text-to-image generation and chain-of-thought reasoning benchmarks, demonstrating its effectiveness as a general and scalable optimization strategy for discrete policy learning.

2605.29146 2026-06-01 cs.CL cs.AI

SafeRx-Agent: A Knowledge-Grounded Multi-Agent Framework for Safe and Explainable Medication Recommendation

SafeRx-Agent: 基于知识的多智能体框架用于安全且可解释的药物推荐

Xinyu Wang, Hanwei Wu, Zhenghan Tai, Sicheng Lyu, Qincheng Lu, Ziyu Zhao, Jijun Chi, Jingrui Tian, Xiao-Wen Chang, Ziyang Song

发表机构 * McGill University(麦吉尔大学) McMaster University(麦马斯特大学) University of Toronto(多伦多大学) Ohio University(俄亥俄大学)

AI总结 提出SafeRx-Agent,一种基于知识的多智能体框架,通过患者上下文、外部临床知识和安全验证来推荐可追溯的药物集合,在MIMIC-III和MIMIC-IV数据集上提高了细粒度药物预测准确性,同时控制了药物相互作用、禁忌症和药物集合大小。

详情
AI中文摘要

药物推荐预测患者就诊时的用药,但现有方法仍面临两个关键挑战。在模型层面,传统药物推荐方法仅预测结构化的药物代码,证据基础有限,而LLM智能体可以利用更丰富的临床上下文,但可能缺乏安全验证和可追溯性。在任务层面,现有基准通常使用宽泛的药物类别,忽略了亚组级别的安全性差异,可能导致风险高估。我们引入了基于第四级ATC代码生成的第一个细粒度药物推荐设置。我们提出了安全处方智能体(SafeRx-Agent),一种基于知识的多智能体框架,利用患者上下文、外部临床知识和安全验证来推荐可追溯的药物集合。在MIMIC-III和MIMIC-IV数据集上的实验结果表明,SafeRx-Agent提高了细粒度药物预测准确性,同时控制了药物相互作用、禁忌症和药物集合大小。

英文摘要

Medication recommendation predicts medications for patient visits, but existing methods still face two key challenges. At the model level, traditional drug recommendation methods only predict structured drug codes with limited evidence grounding, while LLM agents can use richer clinical context but may lack safety verification and traceability. At the task level, existing benchmarks often use broad medication categories, which ignore subgroup-level safety differences and can lead to risk overestimation. We introduce the first fine-grained medication recommendation setting based on fourth-level ATC code generation. We propose Safe Prescription Agent (SafeRx-Agent), a knowledge-grounded multi-agent framework that uses patient context, external clinical knowledge, and safety verification to recommend traceable medication sets. Experimental results on MIMIC-III and MIMIC-IV datasets show that SafeRx-Agent improves fine-grained medication prediction accuracy while controlling drug interactions, contraindications, and medication set size.

2605.28836 2026-06-01 cs.CL cs.AI

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand

不让任何读者掉队:人人能理解的多智能体摘要

Jimin Jung, MyoungJin Kim, Jaehyung Seo, Heuiseok Lim

发表机构 * Department of Computer Science and Engineering, Korea University(韩国大学计算机科学与工程系) Department of Computer Science and Engineering, Konkuk University(konkuk大学计算机科学与工程系)

AI总结 提出NRLB多智能体框架,通过模拟三类读者群体并结合模板规划与迭代优化,生成既忠实又易于理解的平实语言摘要。

详情
AI中文摘要

美国的《平实语言法案》要求政府文件使用清晰、简单的语言,以便公众易于理解,但现有的摘要系统难以应对普通读者中多样化的语言和认知障碍。我们提出了NRLB(不让任何读者掉队),一个用于平实语言摘要的多智能体框架,它模拟了三类代表性读者群体:小学生读者、非母语读者和注意力缺陷读者。NRLB结合了基于模板的规划与迭代的、面向读者的优化,能够系统地检测和解决难懂术语、缺失上下文和令人困惑的句子。在多个数据集上的评估显示,在保持事实准确性的同时,可读性持续提升。人工评估进一步验证了NRLB的效果,标注者偏好率在55%到76%之间,突显了NRLB在生成既忠实于原文又广泛适用于公众的平实语言摘要方面的潜力。

英文摘要

The Plain Writing Act in the United States requires government documents to be accessible in clear and simple language that the general public can easily understand, yet existing summarization systems struggle to address diverse linguistic and cognitive barriers among general readers. We present NRLB (No Reader Left Behind), a multi-agent framework for plain language summarization that simulates three representative reader groups: elementary school student readers, non-native readers, and readers with attention deficits. NRLB combines template-based planning with iterative, reader-oriented refinement, enabling systematic detection and resolution of difficult terms, missing contexts, and confusing sentences. Evaluations across multiple datasets demonstrate consistent improvements in readability while preserving factual accuracy. Human evaluation further validates NRLB's impact, with annotator preference rates ranging from 55% to 76%, highlighting NRLB's potential to produce plain language summaries that are both faithful to the source and broadly accessible to the general public.

2605.25134 2026-06-01 cs.LG cs.AI

Theoretical Analysis of Sparse Optimization with Reparameterization, Weight Decay, and Adaptive Learning Rate

重参数化、权重衰减和自适应学习率下稀疏优化的理论分析

Huangyu Xu, Jingqin Yang, Qianqian Xu, Jiaye Teng

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China(人工智能安全国家重点实验室,计算技术研究所,中国科学院,北京,中国) School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学计算机科学与技术学院,北京,中国) Beijing Academy of Artificial Intelligence (BAAI), Beijing, China(北京人工智能研究院(BAAI),北京,中国) IIIS, Tsinghua University, Beijing, China(清华大学人工智能院,北京,中国) School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China(上海财经大学统计与管理学院,上海,中国) Institute of Data Science and Statistics, Shanghai University of Finance and Economics, Shanghai, China(上海财经大学数据科学与统计研究所,上海,中国)

AI总结 针对稀疏优化中的不稳定问题,提出基于重参数化、权重衰减和自适应学习率的ReWA方法,通过改善优化景观实现比ℓ1正则化更好的稀疏性,同时保持测试精度。

Comments 32 pages, 5 figures. Submitted to ICML 2026

详情
AI中文摘要

稀疏优化是各种实际应用中的一个基本挑战。一种流行的稀疏优化方法是ℓ_p正则化。然而,当0<p<1时,由于无界梯度,它可能遇到优化不稳定性。在本文中,我们介绍了一种新的稀疏优化方法,称为ReWA,它基于重参数化、权重衰减和自适应学习率。ReWA与ℓ_p正则化密切相关,但它揭示了一个不同的优化景观,有助于缓解不稳定性问题。在CIFAR-10和ImageNet上使用ResNets进行的实验表明,与ℓ_1正则化方法相比,ReWA在保持测试精度的同时显著提高了稀疏性。

英文摘要

Sparse optimization is a fundamental challenge in various practical applications. A popular approach to sparse optimization is $\ell_p$ regularization. However, it may encounter optimization instability due to the unbounded gradients when $0<p<1$. In this paper, we introduce a novel approach to sparse optimization termed ReWA, based on Reparameterization, Weight decay, and Adaptive learning rate. ReWA is closely connected to $\ell_p$-regularization, yet it unveils a distinct optimization landscape that helps mitigate instability issues. Experiments on CIFAR-10 and ImageNet with ResNets demonstrate that ReWA leads to significant sparsity improvements over the $\ell_1$-regularization approach while preserving test accuracy.

2604.22409 2026-06-01 cs.CV

SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

SpaMEM:具身环境中通过感知-记忆集成进行动态空间推理的基准测试

Chih-Ting Liao, Xi Xiao, Chunlei Meng, Zhangquan Chen, Yitong Qiao, Weilin Zhou, Tianyang Wang, Xu Zheng, Xin Cao

发表机构 * The University of New South Wales(新南威尔士大学) The University of Alabama at Birmingham(阿拉巴马大学伯明翰分校) Fudan University(复旦大学) Tsinghua University(清华大学) Zhejiang University(浙江大学) Xinjiang University(新疆大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出SpaMEM基准,通过动作条件场景变换和多模态数据,分层评估多模态大模型在具身环境中的空间信念演化能力,揭示坐标一致性和视觉记忆瓶颈。

详情
AI中文摘要

多模态大语言模型(MLLMs)在静态视觉-空间推理方面取得了进展,但在具身环境中,当信念必须根据环境变化下的自我中心观察不断修正时,它们往往无法保持长期的空间连贯性。我们引入了SpaMEM(动作序列的空间记忆),这是一个大规模诊断基准,通过长交互时间内的动作条件场景变换(生成、放置、移除)来隔离空间信念演化的机制。SpaMEM基于一个物理基础数据集构建,包含来自1000个程序生成房屋中25000多个交互序列的10,601,392张高保真图像,涵盖四种模态(RGB、深度、实例、语义分割)。我们将具身空间推理形式化为一个三级层次结构,包含15个诊断任务:第1级测量单次观察的原子空间感知;第2级利用神谕文本状态历史探测时间推理,以排除感知噪声;第3级要求在同一任务维度下从原始视觉流进行端到端的信念维护。我们还评估了短期(逐步)更新和长期(情节)重建。对代表性开源VLM系列的基准测试揭示了一个一致的堆叠瓶颈:坐标一致的定位仍然是一个硬上限,从第2级到第3级的急剧下降暴露了显著的符号脚手架依赖性,即模型在基于文本的记账中成功,但难以维持稳健的视觉记忆。SpaMEM提供了一个细粒度的诊断标准,并激发了状态表示、信念修正和长期情节集成的显式机制。SpaMEM的一个子集可在https://huggingface.co/datasets/mill-ct-liao/SpaMEM公开获取。

英文摘要

Multimodal large language models (MLLMs) have advanced static visual--spatial reasoning, yet they often fail to preserve long-horizon spatial coherence in embodied settings where beliefs must be continuously revised from egocentric observations under environmental change. We introduce SpaMEM (Spatial Memory from Action Sequences), a large-scale diagnostic benchmark that isolates the mechanics of spatial belief evolution via action-conditioned scene transformations (spawn, place, remove) over long interaction horizons. SpaMEM is built on a physically grounded dataset with 10,601,392 high-fidelity images across four modalities (RGB, depth, instance, semantic segmentation), collected from 25,000+ interaction sequences in 1,000 procedurally generated houses. We formalize embodied spatial reasoning as a three-level hierarchy with 15 diagnostic tasks: Level 1 measures atomic spatial perception from single observations; Level 2 probes temporal reasoning with oracle textual state histories to factor out perceptual noise; and Level 3 requires end-to-end belief maintenance from raw visual streams under the same task dimensions. We further evaluate both short-term (step-wise) updates and long-term (episodic) reconstruction. Benchmarking representative open-source VLM families reveals a consistent stacked bottleneck: coordinate-consistent grounding remains a hard ceiling, and the sharp collapse from Level 2 to Level 3 exposes a pronounced symbolic scaffolding dependency, where models succeed with text-based bookkeeping but struggle to sustain robust visual memory. SpaMEM provides a granular diagnostic standard and motivates explicit mechanisms for state representation, belief revision, and long-horizon episodic integration. A subset of SpaMEM is publicly available at https://huggingface.co/datasets/mill-ct-liao/SpaMEM.

2603.09632 2026-06-01 cs.CV cs.CL

X-GS: An Extensible Framework for Perceiving and Thinking via 3D Gaussian Splatting

X-GS:基于3D高斯溅射的感知与思考可扩展框架

Yueen Ma, Zenglin Xu, Irwin King

发表机构 * The Chinese University of Hong Kong(香港中文大学) Fudan University(复旦大学) Shanghai Academy of AI for Science(上海人工智能科学研究院)

AI总结 提出X-GS框架,包含感知器和思考器,统一多种3DGS技术实现实时在线SLAM与语义蒸馏,并支持多模态模型完成下游任务。

详情
AI中文摘要

3D高斯溅射(3DGS)已成为新颖视图合成的强大技术,随后扩展到众多空间AI应用。然而,大多数现有3DGS方法孤立运行,专注于特定领域。本文介绍X-GS,一个包含两个主要组件的可扩展框架。X-GS-感知器统一了广泛的3DGS技术,以实现具有语义蒸馏的实时在线SLAM。X-GS-思考器容纳多模态模型,使其能够与感知器无缝交互以完成下游任务。在我们的X-GS实现中,感知器利用最新的视觉基础模型提高在线SLAM性能,并采用三种关键机制加速语义蒸馏。思考器可以基于对比和生成视觉语言模型构建,并利用感知器的语义高斯溅射解锁3D视觉定位和场景描述等功能。在多个基准上的实验结果表明了X-GS框架的高效性和新解锁的多模态能力。

英文摘要

3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis, subsequently extending into numerous spatial AI applications. However, most existing 3DGS methods operate in isolation, focusing on specific domains. In this paper, we introduce X-GS, an extensible framework consisting of two major components. The X-GS-Perceiver unifies a broad range of 3DGS techniques to enable real-time online SLAM with semantic distillation. The X-GS-Thinker accommodates multimodal models, enabling them to seamlessly interface with the Perceiver to complete downstream tasks. In our implementation of X-GS, the Perceiver leverages the latest vision foundation models to improve online SLAM performance and employs three key mechanisms to accelerate semantic distillation. The Thinker can be built upon both contrastive and generative vision-language models and utilizes the Perceiver's semantic Gaussian splats to unlock capabilities such as 3D visual grounding and scene captioning. Experimental results on diverse benchmarks demonstrate the efficiency and newly unlocked multimodal capabilities of the X-GS framework.

2602.10388 2026-06-01 cs.CL cs.AI

Less is Enough: Synthesizing Diverse Data in LLM Feature Space with Sparse Autoencoders

少即是多:利用稀疏自编码器在LLM特征空间中合成多样化数据

Zhongzhi Li, Xuansheng Wu, Yijiang Li, Lijie Hu, Ninghao Liu

发表机构 * Department of Computing, University of Georgia, Georgia, United States(佐治亚大学计算机系) Computer Engineering, University of California San Diego, California, United States(加州大学圣地亚哥分校计算机工程系) Machine Learning Department, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates(Mohamed bin Zayed人工智能大学机器学习系) Department of Computing, Hong Kong Polytechnic University, Hong Kong, China(香港理工大学计算机系)

AI总结 提出基于稀疏自编码器的特征激活覆盖率(FAC)指标及数据合成框架FAC Synthesis,通过识别缺失特征并生成对应样本来提升数据多样性和下游任务性能。

详情
AI中文摘要

后训练数据的多样性对于大型语言模型(LLM)的有效下游性能至关重要。许多现有的后训练数据构建方法使用基于文本的指标来衡量多样性,这些指标捕捉语言变化,但此类指标仅能为决定下游性能的任务相关特征提供微弱信号。在这项工作中,我们引入了特征激活覆盖率(FAC),该指标在可解释的特征空间中衡量数据多样性。基于此指标,我们进一步提出了一个多样性驱动的数据合成框架,名为FAC Synthesis,该框架首先使用稀疏自编码器从种子数据集中识别缺失特征,然后生成明确反映这些特征的合成样本。实验表明,我们的方法在包括指令遵循、毒性检测、奖励建模和行为引导在内的各种任务上,持续提高了数据多样性和下游性能。有趣的是,我们识别出跨模型家族(即LLaMA、Mistral和Qwen)共享的可解释特征空间,从而实现了跨模型知识迁移。我们的工作为探索以数据为中心的LLM优化提供了坚实且实用的方法论。

英文摘要

The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine downstream performance. In this work, we introduce Feature Activation Coverage (FAC) which measures data diversity in an interpretable feature space. Building upon this metric, we further propose a diversity-driven data synthesis framework, named FAC Synthesis, that first uses a sparse autoencoder to identify missing features from a seed dataset, and then generates synthetic samples that explicitly reflect these features. Experiments show that our approach consistently improves both data diversity and downstream performance on various tasks, including instruction following, toxicity detection, reward modeling, and behavior steering. Interestingly, we identify a shared, interpretable feature space across model families (i.e., LLaMA, Mistral, and Qwen), enabling cross-model knowledge transfer. Our work provides a solid and practical methodology for exploring data-centric optimization of LLMs.

2509.21190 2026-06-01 cs.LG cs.AI

Towards Foundation Models for Zero-Shot Time Series Anomaly Detection: Leveraging Synthetic Data and Relative Context Discrepancy

面向零样本时间序列异常检测的基础模型:利用合成数据和相对上下文差异

Tian Lan, Hao Duong Le, Jinbo Li, Wenjun He, Meng Wang, Chenghao Liu, Chen Zhang

发表机构 * Department of Industrial Engineering, Tsinghua University, Beijing, China(清华大学工业工程系) Datadog AI Research, Paris, France. This work was completed prior to joining Datadog(Datadog AI 研究院) Lab, Huawei Technologies, ShenZhen, China(华为技术2012实验室)

AI总结 提出基于相对上下文差异(RCD)的预训练范式,通过合成数据训练Transformer模型比较查询模式与上下文,实现零样本时间序列异常检测,在多个基准上超越现有基础模型。

Comments This manuscript is withdrawn, as the authors intend to further extend and develop the work beyond its current scope

详情
AI中文摘要

时间序列异常检测(TSAD)是一项关键任务,但开发能够以零样本方式泛化到未见数据的模型仍然具有挑战性。现有的TSAD基础模型通常依赖推理时的重构误差评分,这可能会遗漏重构良好的细微异常,并可能错误地标记未见领域中复杂但正常的模式。我们引入了TimeRCD,这是一个基于相对上下文差异(RCD)构建的TSAD基础模型,RCD是一种预训练范式,通过比较查询模式与其周围上下文来训练模型检测异常。这种关系公式通过标准Transformer架构实现,使模型能够从输入上下文中推断正常性,而不是依赖固定的全局正常模式。我们进一步构建了一个大规模合成语料库,其中包含上下文相关的异常标签,为RCD提供监督预训练信号。跨多个基准的实验表明,在大多数零样本TSAD设置中,TimeRCD优于现有的通用和异常特定基础模型,同时与数据集特定的全样本基线保持竞争力。这些结果提供了实证证据,表明RCD是构建鲁棒且可泛化的TSAD模型的有效方向。

英文摘要

Time series anomaly detection (TSAD) is a critical task, but developing models that generalize to unseen data in a zero-shot manner remains challenging. Existing foundation models for TSAD often rely on reconstruction-error scoring at inference time, which can miss subtle anomalies that are well reconstructed and can falsely flag complex but normal patterns in unseen domains. We introduce TimeRCD, a foundation model for TSAD built on Relative Context Discrepancy (RCD), a pre-training paradigm that trains the model to detect anomalies by comparing a query pattern with its surrounding context. This relational formulation, implemented with a standard Transformer architecture, enables the model to infer normality from the input context rather than relying on fixed global normal patterns. We further construct a large-scale synthetic corpus with context-dependent anomaly labels to provide supervised pre-training signals for RCD. Experiments across diverse benchmarks show that TimeRCD outperforms existing general-purpose and anomaly-specific foundation models in most zero-shot TSAD settings, while remaining competitive with dataset-specific full-shot baselines. These results provide empirical evidence that RCD is an effective direction for building robust and generalizable TSAD models.

2605.25842 2026-06-01 cs.AI cs.CL

MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning

MuCRASP: 多模态思维链推理感知的结构化剪枝

Aritra Dutta, Somak Aditya

发表机构 * Indian Institute of Technology, Kharagpur(印度理工学院,哈里科普尔)

AI总结 针对视觉语言模型在结构化剪枝后思维链推理准确性下降的问题,提出MuCRASP框架,通过识别推理关键令牌并保持跨模态对齐,在压缩下维持推理质量。

Comments Preprint ver. 2

详情
AI中文摘要

视觉语言模型(VLM)越来越依赖思维链(CoT)推理来解决复杂的多模态任务,但其庞大的参数量使得部署成本高昂。结构化剪枝提供了一种自然的解决方案;然而,现有方法无法在VLM中保持CoT推理的准确性。我们确定了两个关键原因:(1)CoT一致性依赖于生成轨迹中的稀疏过渡点(枢轴令牌),而现有剪枝方法对CoT不敏感;(2)为单模态LLM设计的剪枝方法未考虑视觉和文本模态之间的激活分布差异。基于这些观察,我们提出了MuCRASP,一种结构化剪枝框架,针对推理关键组件,同时保持跨模态对齐并在全局参数预算下考虑层间敏感性。在三个推理基准测试上的四个VLM实验表明,MuCRASP在不断增加压缩的情况下始终能保持推理质量。在Qwen2.5-VL-7B上剪枝30%时,MuCRASP在物理推理任务上获得了8.87的LLM-as-a-Judge评分,而最强基线为7.32。此外,MuCRASP在高达50%的剪枝率下仍保持高推理一致性,显著优于先前的剪枝方法,同时表现出更低的困惑度退化。

英文摘要

Vision-language models (VLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex multimodal tasks, but their large parameter sizes make deployment expensive. Structured pruning offers a natural solution; however, existing methods fail to preserve CoT reasoning accuracy in VLMs. We identify two key reasons: (1) CoT consistency depends on sparse transition points (pivot tokens) in the generation trajectory, while existing pruning methods are CoT-agnostic; and (2) pruning methods designed for unimodal LLMs do not account for activation-distribution differences across visual and textual modalities. Motivated by these observations, we propose MuCRASP, a structured pruning framework that targets reasoning-critical components while preserving cross-modal alignment and accounting for layer-wise sensitivity under a global parameter budget. Experiments on four VLMs across three reasoning benchmarks show that MuCRASP consistently preserves reasoning quality under increasing compression. At 30% pruning on Qwen2.5-VL-7B, MuCRASP achieves an LLM-as-a-Judge score of 8.87 versus 7.32 for the strongest baseline on physical reasoning tasks. Furthermore, MuCRASP maintains high reasoning consistency up to 50% pruning, significantly outperforming prior pruning approaches while exhibiting lower perplexity degradation.