2510.19255 2026-06-17 cs.CV 版本更新

Advances in 4D Representation: Geometry, Motion, and Interaction

4D表示进展：几何、运动与交互

Mingrui Zhao, Sauradip Nag, Kai Wang, Aditya Vora, Guangda Ji, Peter Chun, Ali Mahdavi-Amiri, Hao Zhang

AI总结本文综述了4D生成与重建领域，从几何、运动和交互三个核心支柱出发，分析不同4D表示方法的特性、挑战及适用场景，并探讨了大语言模型和视频基础模型在其中的作用。

Comments CGF'26,21 pages. Project Page: https://mingrui-zhao.github.io/4DRep-GMI/

详情

AI中文摘要

我们呈现了一篇关于4D生成与重建的综述，这是一个快速发展的计算机图形学子领域，其进展得益于神经场、几何与运动深度学习以及3D生成式人工智能（GenAI）的最新突破。尽管我们的综述并非首篇，但我们从独特且鲜明的4D表示视角构建领域覆盖，以建模随时间演变的3D几何，同时展现运动和交互。具体而言，我们并未穷举众多工作，而是采取更具选择性的方法，聚焦代表性工作，以突出每种表示在不同计算、应用和数据场景下的理想特性及随之而来的挑战。我们旨在向读者传达的主要信息是：如何为其任务选择并定制合适的4D表示。在组织上，我们基于三个关键支柱：几何、运动与交互，对4D表示进行划分。我们的讨论不仅涵盖当今最流行的表示，如神经辐射场（NeRFs）和3D高斯泼溅（3DGS），还关注在4D背景下相对未被充分探索的表示，如结构化模型和长程运动。在整个综述中，我们将重新审视大语言模型（LLMs）和视频基础模型（VFMs）在各种4D应用中的作用，同时引导讨论指向它们当前的局限性以及如何解决。我们还专门介绍了目前可用的4D数据集以及推动该子领域前进所缺乏的数据。项目页面：this https URL

英文摘要

We present a survey on 4D generation and reconstruction, a fast-evolving subfield of computer graphics whose developments have been propelled by recent advances in neural fields, geometric and motion deep learning, as well as 3D generative artificial intelligence (GenAI). While our survey is not the first of its kind, we build our coverage of the domain from a unique and distinctive perspective of 4D representations, to model 3D geometry evolving over time while exhibiting motion and interaction. Specifically, instead of offering an exhaustive enumeration of many works, we take a more selective approach by focusing on representative works to highlight both the desirable properties and ensuing challenges of each representation under different computation, application, and data scenarios. The main take-away message we aim to convey to the readers is on how to select and then customize the appropriate 4D representations for their tasks. Organizationally, we separate the 4D representations based on three key pillars: geometry, motion, and interaction. Our discourse will not only encompass the most popular representations of today, such as neural radiance fields (NeRFs) and 3D Gaussian Splatting (3DGS), but also bring attention to relatively under-explored representations in the 4D context, such as structured models and long-range motions. Throughout our survey, we will reprise the role of large language models (LLMs) and video foundational models (VFMs) in a variety of 4D applications, while steering our discussion towards their current limitations and how they can be addressed. We also provide a dedicated coverage on what 4D datasets are currently available, as well as what is lacking, in driving the subfield forward. Project page:https://mingrui-zhao.github.io/4DRep-GMI/

URL PDF HTML ☆

赞 0 踩 0

2509.15626 2026-06-17 cs.SD eess.AS 版本更新

LibriTTS-VI: A Public Corpus and Novel Methods for Efficient Voice Impression Control

LibriTTS-VI：用于高效语音印象控制的公开语料库与新方法

Junki Ohmura, Yuki Ito, Emiru Tsunoo, Toshiyuki Sekiya, Toshiyuki Kumakura

AI总结针对数值语音印象控制中缺乏公开语料库和印象泄漏问题，构建首个公开语料库LibriTTS-VI，并提出解耦训练和无参考方法，显著提升控制精度。

Comments Accepted to INTERSPEECH 2026

详情

AI中文摘要

数值语音印象（VI）控制（例如，缩放明亮度）能够在文本到语音（TTS）中实现细粒度控制。然而，它面临两个挑战：缺乏公开语料库和印象泄漏，其中参考音频会使合成语音偏离目标VI。针对第一个挑战，我们引入了LibriTTS-VI，这是基于LibriTTS-R构建的首个公开VI语料库。针对第二个挑战，我们假设单个参考通过纠缠说话人身份和VI导致泄漏。为了缓解这一问题，我们提出：1）使用同一说话人的两个话语进行解耦训练，分别用于说话人和VI条件化；2）一种无参考方法，仅通过目标VI控制印象。实验表明，我们的最佳方法提高了可控性：11维VI均方误差从0.61降至0.42（客观）和从1.15降至0.92（主观）。与基于提示的TTS比较显示，后者存在数值控制不精确以及VI与文本语义纠缠的问题，而我们的方法克服了这些缺陷。

英文摘要

Numerical voice impression (VI) control (e.g., scaling brightness) enables fine-grained control in text-to-speech (TTS). However, it faces two challenges: no public corpus and impression leakage, where reference audio biases synthesized voice away from the target VI. To address the first challenge, we introduce LibriTTS-VI, the first public VI corpus built on LibriTTS-R. For the second, we hypothesize a single reference causes leakage by entangling speaker identity and VI. To mitigate this, we propose 1) disentangled training with two utterances from the same speaker for speaker and VI conditioning, and 2) a reference-free method controlling the impression solely via target VI. Experimentally, our best method improves controllability: 11-dimensional VI mean squared error drops from 0.61 to 0.41 objectively and 1.15 to 0.92 subjectively. A comparison with a prompt-based TTS reveals imprecise numerical control and entanglement between VI and text semantics, which our methods overcome.

URL PDF HTML ☆

赞 0 踩 0

2603.03485 2026-06-17 cs.CV cs.AI cs.RO 版本更新

Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion

Phys4D: 从视频扩散模型实现细粒度物理一致的4D建模

Haoran Lu, Shang Wu, Songling Liu, Jianshu Zhang, Maojiang Su, Guo Ye, Chenwei Xu, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Zhaoran Wang, Han Liu

AI总结提出Phys4D流水线，通过三阶段训练（伪监督预训练、物理监督微调、强化学习校正）从视频扩散模型学习物理一致的4D世界表示，显著提升细粒度时空与物理一致性。

详情

AI中文摘要

最近的视频扩散模型作为大规模生成式世界模型已经取得了令人印象深刻的能力。然而，这些模型通常难以保持细粒度的物理一致性，随时间表现出物理上不合理的动态。在这项工作中，我们提出了 \textbf{Phys4D}，一个从视频扩散模型中学习物理一致的4D世界表示的流水线。Phys4D 采用 \textbf{三阶段训练范式}，逐步将外观驱动的视频扩散模型提升为物理一致的4D世界表示。我们首先通过大规模伪监督预训练引导出稳健的几何和运动表示，为4D场景建模奠定基础。然后，我们使用模拟生成的数据进行基于物理的监督微调，强制执行时间一致的4D动态。最后，我们应用基于模拟的强化学习来纠正难以通过显式监督捕获的残留物理违规。为了评估超越外观指标的细粒度物理一致性，我们引入了一套 \textbf{4D世界一致性评估}，探测几何一致性、运动稳定性和长期物理合理性。实验结果表明，与外观驱动的基线相比，Phys4D 显著改善了细粒度时空和物理一致性，同时保持了强大的生成性能。我们的项目页面可在此 https URL 获取。

英文摘要

Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present \textbf{Phys4D}, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts \textbf{a three-stage training paradigm} that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, establishing a foundation for 4D scene modeling. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of \textbf{4D world consistency evaluation} that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance. Our project page is available at https://sensational-brioche-7657e7.netlify.app/

URL PDF HTML ☆

赞 0 踩 0

2602.23116 2026-06-17 cs.LG cs.GT stat.ML 版本更新

Provably Efficient Regularized Online RLHF with Generalized Bilinear Preferences

具有广义双线性偏好的可证明高效正则化在线RLHF

Junghyun Lee, Minju Hong, Kwang-Sung Jun, Chulhee Yun, Se-Young Yun

AI总结研究在线RLHF中正则化最佳响应最大遗憾最小化问题，通过广义双线性偏好模型证明强凸性可导出多对数遗憾，表明快速遗憾不限于KL散度。

Comments 48 pages, 3 figures (ver3: major revisions; ver2: more colorful boxes, fixed some typos)

详情

AI中文摘要

我们考虑在一般偏好和bandit反馈下在线RLHF中的正则化最佳响应最大遗憾最小化问题。虽然各种正则化器被用于增强对齐的鲁棒性，但已知的多对数遗憾保证仍然高度特定于KL。为了研究这种快速速率是否扩展到KL之外，我们采用广义双线性偏好模型（GBPM）——通过一个秩为$2r$的斜对称矩阵捕获$d$维逐项特征上的非传递偏好——以隔离一般正则化的影响。关键地，在GBPM下，我们证明任何贪婪策略的对偶间隙受限于平方估计误差，该误差仅利用强凸性和斜对称性导出。在特征覆盖假设下，我们通过贪婪采样建立了$\tilde{\mathcal{O}}(\eta d^4 C_{\min}^{-1} (\log T)^2 \wedge d^2 C_{\min}^{-1/2} \sqrt{T})$的通用多对数遗憾，并通过探索后提交（Explore-Then-Commit）建立了$\tilde{\mathcal{O}}(C_{\min}^{-2} \sqrt{\eta r T} \wedge r^{1/3} C_{\min}^{-4/3} T^{2/3})$的维度改进遗憾（对于条件良好的臂集），其中$\eta^{-1}$是正则化系数，$T$是时间范围，$C_{\min}$是依赖于臂集的量。这表明“快速”遗憾并非KL特有，而是通用强凸几何的基本结果。

英文摘要

We consider the problem of regularized best-response max-regret minimization in online RLHF under general preferences and bandit feedback. While various regularizers are utilized to robustify alignment, known polylogarithmic regret guarantees remain heavily specific to KL. To investigate whether such fast rates extend beyond KL, we adopt the Generalized Bilinear Preference Model (GBPM) -- capturing intransitive preferences over $d$-dimensional item-wise features via a rank-$2r$ skew-symmetric matrix -- to isolate the impact of generic regularization. Crucially, under GBPM, we prove that the dual gap of any greedy policy is bounded by the squared estimation error, derived using \emph{only} strong convexity and skew-symmetry. Under a feature coverage assumption, we establish a \emph{generic} polylogarithmic regret of $\tilde{\mathcal{O}}(ηd^4 C_{\min}^{-1} (\log T)^2 \wedge d^2 C_{\min}^{-1/2} \sqrt{T})$ with Greedy Sampling, and a dimension-wise improved regret (for well-conditioned arm-sets) of $\tilde{\mathcal{O}}(C_{\min}^{-2} \sqrt{ηr T} \wedge r^{1/3} C_{\min}^{-4/3} T^{2/3})$ with Explore-Then-Commit, where $η^{-1}$ is the regularization coefficient, $T$ is the time horizon, and $C_{\min}$ is an arm-set dependent quantity. This demonstrates that ``fast'' regrets are not KL-specific, but rather a fundamental consequence of generic strongly convex geometry.

URL PDF HTML ☆

赞 0 踩 0

2603.03824 2026-06-17 cs.AI cs.CL cs.LG cs.MA 版本更新

通过闭环视觉基础验证弥合自我反思中的模态脱节

Haoyu Zhang, Yuwei Wu, Pengxiang Li, Xintong Zhang, Zhi Gao, Rui Gao, Mingyang Gao, Che Sun, Yunde Jia

AI总结提出MIRROR框架，通过闭环视觉反思（草稿-批评-区域验证-修订）减少VLM幻觉，并构建ReflectV数据集训练视觉基础的多轮反思。

详情

AI中文摘要

在视觉语言模型（VLM）时代，增强多模态推理能力仍然是一个关键挑战，尤其是在处理模糊或复杂的视觉输入时，初始推理常常导致幻觉或逻辑错误。现有的VLM通常产生看似合理但缺乏依据的答案，即使提示其“反思”，修正也可能与图像证据脱节。为了解决这个问题，我们提出了MIRROR框架，用于通过视觉区域的反思进行多模态迭代推理。通过将视觉反思嵌入为核心机制，MIRROR被表述为一个闭环过程，包括草稿、批评、基于区域的验证和修订，重复进行直到输出具有视觉基础。为了促进该模型的训练，我们构建了**ReflectV**，一个用于多轮监督的视觉反思数据集，明确包含反思触发器、基于区域的验证动作以及基于视觉证据的答案修订。在通用视觉语言基准和代表性视觉语言推理基准上的实验表明，MIRROR提高了正确性并减少了视觉幻觉，证明了将反思训练为一种寻求证据、区域感知的验证过程而非纯文本修订步骤的价值。

英文摘要

In the era of Vision-Language Models (VLMs), enhancing multimodal reasoning capabilities remains a critical challenge, particularly in handling ambiguous or complex visual inputs, where initial inferences often lead to hallucinations or logic errors. Existing VLMs often produce plausible yet ungrounded answers, and even when prompted to "reflect", their corrections may remain detached from the image evidence. To address this, we propose the MIRROR framework for Multimodal Iterative Reasoning via Reflection On visual Regions. By embedding visual reflection as a core mechanism, MIRROR is formulated as a closed-loop process comprising draft, critique, region-based verification, and revision, which are repeated until the output is visually grounded. To facilitate training of this model, we construct **ReflectV**, a visual reflective dataset for multi-turn supervision that explicitly contains reflection triggers, region-based verification actions, and answer revision grounded in visual evidence. Experiments on both general vision-language benchmarks and representative vision-language reasoning benchmarks show that MIRROR improves correctness and reduces visual hallucinations, demonstrating the value of training reflection as an evidence-seeking, region-aware verification process rather than a purely textual revision step.

URL PDF HTML ☆

赞 0 踩 0

2508.03250 2026-06-17 cs.CL cs.AI 版本更新

大型语言模型会为景观付费吗？从主观选择中推断支付意愿

Manon Reusens, Sofie Goethals, Toon Calders, David Martens

AI总结研究在旅行助手场景下，通过多分类逻辑模型分析LLM的主观选择，推断其支付意愿并与人类基准比较，发现LLM在属性层面存在系统偏差且高估支付意愿，但通过条件化偏好可改善。

详情

DOI: 10.1016/j.eswa.2026.133279

AI中文摘要

随着大型语言模型（LLM）越来越多地部署在旅行辅助和购买支持等应用中，它们常常需要在没有客观正确答案的情况下代表用户做出主观选择。我们在旅行助手背景下研究LLM的决策，通过向模型呈现选择困境，并使用多项逻辑模型分析其响应，推导出隐含的支付意愿（WTP）估计。随后将这些WTP值与经济学文献中的人类基准值进行比较。除了基线设置外，我们还研究了在更现实条件下模型行为的变化，包括提供用户过去选择的信息和基于角色的提示。我们的结果表明，虽然可以从较大的LLM中推导出有意义的WTP值，但它们在属性层面也显示出系统偏差。此外，它们倾向于整体高估人类的WTP，特别是在引入昂贵选项或面向商业的角色时。将模型条件化于对更便宜选项的先前偏好，得出的估值更接近人类基准。总体而言，我们的发现突出了使用LLM进行主观决策支持的潜力和局限性，并强调了在实际部署此类系统时仔细选择模型、设计提示和表示用户的重要性。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in applications such as travel assistance and purchasing support, they are often required to make subjective choices on behalf of users in settings where no objectively correct answer exists. We study LLM decision-making in a travel-assistant context by presenting models with choice dilemmas and analyzing their responses using multinomial logit models to derive implied willingness to pay (WTP) estimates. These WTP values are subsequently compared to human benchmark values from the economics literature. In addition to a baseline setting, we examine how model behavior changes under more realistic conditions, including the provision of information about users' past choices and persona-based prompting. Our results show that while meaningful WTP values can be derived for larger LLMs, they also display systematic deviations at the attribute level. Additionally, they tend to overestimate human WTP overall, particularly when expensive options or business-oriented personas are introduced. Conditioning models on prior preferences for cheaper options yields valuations that are closer to human benchmarks. Overall, our findings highlight both the potential and the limitations of using LLMs for subjective decision support and underscore the importance of careful model selection, prompt design, and user representation when deploying such systems in practice.

URL PDF HTML ☆

赞 0 踩 0

2602.08939 2026-06-17 cs.AI 版本更新

CausalT5k: Diagnosing Refusal and Failure Modes in Trustworthy Causal Reasoning Across Causal Rungs

CausalT5k: 诊断可信因果推理中的拒绝与失败模式——跨越因果阶梯

Longling Geng, Andy Ouyang, Theodore Wu, Daphne Barretto, Matthew John Hayes, Rachael Cooper, Yuqiao Zeng, Sameer Vijay, Gia Ancone, Ankit Rai, Matthew Wolfman, Patrick Flanagan, Edward Y. Chang

AI总结提出CTK基准，通过5,147个案例诊断大语言模型在因果推理中的失败模式，包括因果阶梯、陷阱类型、压力敏感性和拒绝质量等标注，揭示聚合准确率隐藏的缺陷。

Comments 12 pages, 17 tables, 4 figures

详情

AI中文摘要

大型语言模型越来越能生成流畅的因果解释，但它们常常以聚合准确率无法诊断的方式失败：混淆关联与干预、在压力下放弃正确判断、过度拒绝有效主张、或在证据不足时作答。我们引入CTK，一个包含5,147个案例且不断增长的诊断基准，涵盖10个领域和Pearl因果阶梯的所有三个层次。与仅评分的基准不同，CTK通过标注因果阶梯、陷阱类型、压力敏感性、拒绝质量以及效用-安全权衡来揭示模型为何失败。其Sheep/Wolf分类法区分有效因果设计与推理陷阱；配对的neutral/pressure变体通过Bad Flip Rate测量谄媚漂移；Wise Refusal字段测试模型在认可主张前是否识别出缺失信息。CTK暴露了聚合准确率隐藏的失败模式：怀疑陷阱、缩放下的阶梯坍塌、压力诱导漂移、检测-纠正差距以及反事实错误模式。它不规定修正方法，而是为研究因果推理失败概况提供诊断基础。

英文摘要

Large language models increasingly produce fluent causal explanations, yet they often fail in ways aggregate accuracy cannot diagnose: confusing association with intervention, abandoning correct judgments under pressure, over-refusing valid claims, or answering when evidence is underdetermined. We introduce CTK, a diagnostic benchmark of 5,147 cases and growing, across 10 domains and all three levels of Pearl's Ladder of Causation. Unlike benchmarks that only score correctness, CTK reveals why a model failed by annotating causal rung, trap type, pressure sensitivity, refusal quality, and Utility-Safety tradeoffs. Its Sheep/Wolf taxonomy separates valid causal designs from inferential traps; paired neutral/pressure variants measure sycophantic drift through Bad Flip Rate; and Wise Refusal fields test whether a model identifies the missing information needed before endorsing a claim. CTK exposes failure modes hidden by aggregate accuracy: the Skepticism Trap, Rung Collapse under scaling, pressure-induced drift, Detection-Correction gaps, and counterfactual error modes. Rather than prescribing a correction method, it provides the diagnostic substrate for studying causal-reasoning failure profiles.

URL PDF HTML ☆

赞 0 踩 0

2509.21886 2026-06-17 cs.AI 版本更新

TRACE: Learning to Compute on Circuit Graphs

TRACE：在电路图上学习计算

Ziyang Zheng, Jiaying Zhu, Jingyi Zhou, Qiang Xu

AI总结针对图表示学习在电路功能建模中的架构不匹配问题，提出TRACE，采用层次化Transformer和函数偏移学习，显著超越现有方法。

详情

AI中文摘要

学习计算，即对电路图的功能行为进行建模的能力，是图表示学习的一个基本挑战。然而，主流范式在此任务上存在架构不匹配。这一有缺陷的假设，是主流消息传递神经网络（MPNN）及其基于Transformer的常规对应物的核心，阻止了模型捕捉计算的位置感知和层次化特性。为解决此问题，我们引入了TRACE，一种建立在架构合理的骨干网络和原则性学习目标之上的新范式。首先，TRACE采用层次化Transformer，模拟计算的逐步流程，提供了替代有缺陷的置换不变聚合的忠实架构骨干。其次，我们引入了函数偏移学习，一种将学习问题解耦的新颖目标。我们的模型不是直接预测复杂的全局函数，而是训练仅预测函数偏移，即真实全局函数与假设输入独立的简单局部近似之间的差异。我们在各种电路模态上验证了这一范式，包括寄存器传输级图、与反相器图和映射后网表。在全面的基准测试套件中，TRACE显著优于所有先前的架构。这些结果表明，我们的架构对齐骨干和解耦学习目标为学习电路图功能行为这一基本挑战形成了更稳健的范式。

英文摘要

Learning to compute, the ability to model the functional behavior of a circuit graph, is a fundamental challenge for graph representation learning. Yet, the dominant paradigm is architecturally mismatched for this task. This flawed assumption, central to mainstream message passing neural networks (MPNNs) and their conventional Transformer-based counterparts, prevents models from capturing the position-aware, hierarchical nature of computation. To resolve this, we introduce TRACE, a new paradigm built on an architecturally sound backbone and a principled learning objective. First, TRACE employs a Hierarchical Transformer that mirrors the step-by-step flow of computation, providing a faithful architectural backbone that replaces the flawed permutation-invariant aggregation. Second, we introduce function shift learning, a novel objective that decouples the learning problem. Instead of predicting the complex global function directly, our model is trained to predict only the function shift, the discrepancy between the true global function and a simple local approximation that assumes input independence. We validate this paradigm on various circuits modalities, including Register Transfer Level graphs, And-Inverter Graphs and post-mapping netlists. Across a comprehensive suite of benchmarks, TRACE substantially outperforms all prior architectures. These results demonstrate that our architecturally-aligned backbone and decoupled learning objective form a more robust paradigm for the fundamental challenge of learning the functional behavior of a circuit graph.

URL PDF HTML ☆

赞 0 踩 0

2602.07429 2026-06-17 cs.LG cs.AI 版本更新

Brep2Shape: Boundary and Shape Representation Alignment via Self-Supervised Transformers

Brep2Shape：通过自监督变换器对齐边界与形状表示

Yuanxu Sun, Yuezhou Ma, Haixu Wu, Guanyang Zeng, Muye Chen, Jianmin Wang, Mingsheng Long

AI总结提出Brep2Shape自监督预训练方法，利用双Transformer骨干和拓扑注意力对齐B-rep的抽象边界表示与直观形状表示，在多项下游任务中达到最优精度并加速收敛。

详情

AI中文摘要

边界表示（B-rep）是计算机辅助设计（CAD）的行业标准。虽然深度学习在处理B-rep模型方面显示出潜力，但现有方法存在表示差距：连续方法提供分析精度但视觉上抽象，而离散方法提供直观清晰性但牺牲了几何精度。为弥合这一差距，我们引入了Brep2Shape，一种新颖的自监督预训练方法，旨在对齐抽象边界表示与直观形状表示。我们的方法采用几何感知任务，其中模型学习从参数化贝塞尔控制点预测密集空间点，使网络能够更好地理解从抽象系数导出的物理流形。为增强这种对齐，我们提出了一个双Transformer骨干，具有并行流，独立编码表面和曲线令牌以捕获它们不同的几何属性。此外，集成了拓扑注意力以建模表面和曲线之间的相互依赖关系，从而保持拓扑一致性。实验结果表明，Brep2Shape具有显著的可扩展性，在各种下游任务中实现了最先进的精度和更快的收敛速度。代码可在以下仓库获取：this https URL。

英文摘要

Boundary representation (B-rep) is the industry standard for computer-aided design (CAD). While deep learning shows promise in processing B-rep models, existing methods suffer from a representation gap: continuous approaches offer analytical precision but are visually abstract, whereas discrete methods provide intuitive clarity at the expense of geometric precision. To bridge this gap, we introduce Brep2Shape, a novel self-supervised pre-training method designed to align abstract boundary representations with intuitive shape representations. Our method employs a geometry-aware task where the model learns to predict dense spatial points from parametric Bézier control points, enabling the network to better understand physical manifolds derived from abstract coefficients. To enhance this alignment, we propose a Dual Transformer backbone with parallel streams that independently encode surface and curve tokens to capture their distinct geometric properties. Moreover, the topology attention is integrated to model the interdependencies between surfaces and curves, thereby maintaining topological consistency. Experimental results demonstrate that Brep2Shape offers significant scalability, achieving state-of-the-art accuracy and faster convergence across various downstream tasks.Code is available at this repository: https://github.com/thuml/Brep2Shape.

URL PDF HTML ☆

赞 0 踩 0

2602.06257 2026-06-17 cs.LG cs.GT 版本更新

PLATE: 可塑性可调的几何感知持续学习高效适配器

Romain Cosentino

AI总结提出无需旧任务数据的持续学习方法PLATE，利用预训练网络的几何冗余性，通过结构化低秩更新显式控制可塑性-保留权衡，提升最坏情况保留保证。

详情

AI中文摘要

我们为预训练模型开发了一种持续学习方法，该方法不需要访问旧任务数据，解决了基础模型适应中预训练分布通常不可用的实际障碍。我们的关键观察是，预训练网络表现出大量的几何冗余性，并且这种冗余性可以通过两种互补的方式加以利用。首先，冗余神经元提供了预训练时代主导特征方向的代理，使得可以直接从预训练权重构建近似受保护的更新子空间。其次，冗余性为可塑性的放置位置提供了自然偏差：通过将更新限制在冗余神经元的子集并约束剩余的自由度，我们获得了在旧数据分布上功能漂移减少且最坏情况保留保证改善的更新族。这些见解导致了PLATE（可塑性可调的高效适配器），一种不需要过去任务数据的持续学习方法，它提供了对可塑性-保留权衡的显式控制。PLATE通过结构化低秩更新ΔW = B A Q^T参数化每一层，其中B和Q从预训练权重一次性计算并保持冻结，只有A在新任务上训练。代码可在https://this URL获取。

英文摘要

We develop a continual learning method for pretrained models that \emph{requires no access to old-task data}, addressing a practical barrier in foundation model adaptation where pretraining distributions are often unavailable. Our key observation is that pretrained networks exhibit substantial \emph{geometric redundancy}, and that this redundancy can be exploited in two complementary ways. First, redundant neurons provide a proxy for dominant pretraining-era feature directions, enabling the construction of approximately protected update subspaces directly from pretrained weights. Second, redundancy offers a natural bias for \emph{where} to place plasticity: by restricting updates to a subset of redundant neurons and constraining the remaining degrees of freedom, we obtain update families with reduced functional drift on the old-data distribution and improved worst-case retention guarantees. These insights lead to \textsc{PLATE} (\textbf{Pla}sticity-\textbf{T}unable \textbf{E}fficient Adapters), a continual learning method requiring no past-task data that provides explicit control over the plasticity-retention trade-off. PLATE parameterizes each layer with a structured low-rank update $ΔW = B A Q^\top$, where $B$ and $Q$ are computed once from pretrained weights and kept frozen, and only $A$ is trained on the new task. The code is available at https://github.com/SalesforceAIResearch/PLATE.

URL PDF HTML ☆

赞 0 踩 0

2602.03420 2026-06-17 cs.SD cs.LG 版本更新

m2sv: 地图到街景空间推理的可扩展基准

Yosub Shin, Michael Buriek, Igor Molybog

AI总结提出m2sv基准，通过匹配朝北俯视图与街景图像推断相机方向，评估VLM空间推理能力；最佳模型准确率65.2%，低于人类72.0%，揭示几何对齐与推理一致性的差距。

详情

AI中文摘要

视觉-语言模型（VLM）在许多多模态基准上表现强劲，但在需要将抽象俯视图表示与自我中心视图对齐的空间推理任务上仍然脆弱。我们引入m2sv，一个用于地图到街景空间推理的可扩展基准，要求模型通过将朝北俯视图与在同一真实世界交叉口拍摄的街景图像对齐来推断相机视角方向。我们发布了m2sv-20k，一个具有受控歧义的地理多样化基准，以及m2sv-sft-11k，一个用于监督微调的精选结构化推理轨迹集。尽管在现有多模态基准上表现强劲，但最佳评估的VLM在m2sv上仅达到65.2%的准确率，低于人类标注者的平均72.0%（专家可达95%），且标注者间一致性高（$\kappa$高达0.76）。虽然监督微调和强化学习带来持续改进，但跨基准评估显示迁移有限。除了总体准确率，我们使用结构信号和人工努力系统分析了地图到街景推理的难度，并对适应的开放模型进行了广泛的失败分析。我们的发现凸显了几何对齐、证据聚合和推理一致性方面的持续差距，为跨视角的接地空间推理的未来工作提供了动力。

英文摘要

Vision--language models (VLMs) achieve strong performance on many multimodal benchmarks but remain brittle on spatial reasoning tasks that require aligning abstract overhead representations with egocentric views. We introduce m2sv, a scalable benchmark for map-to-street-view spatial reasoning that asks models to infer camera viewing direction by aligning a north-up overhead map with a Street View image captured at the same real-world intersection. We release m2sv-20k, a geographically diverse benchmark with controlled ambiguity, along with m2sv-sft-11k, a curated set of structured reasoning traces for supervised fine-tuning. Despite strong performance on existing multimodal benchmarks, the best evaluated VLM achieves only 65.2% accuracy on m2sv, below human annotators who reach 72.0% on average (and 95% for an expert) with strong inter-annotator agreement ($κ$ up to 0.76). While supervised fine-tuning and reinforcement learning yield consistent gains, cross-benchmark evaluations reveal limited transfer. Beyond aggregate accuracy, we systematically analyze difficulty in map-to-street-view reasoning using both structural signals and human effort, and conduct an extensive failure analysis of adapted open models. Our findings highlight persistent gaps in geometric alignment, evidence aggregation, and reasoning consistency, motivating future work on grounded spatial reasoning across viewpoints.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

PACE-RAG: Patient-Aware Contextual and Evidence-Constrained RAG for Clinical Drug Recommendation

Exposing the Illusion of Fairness: Auditing Vulnerabilities to Distributional Manipulation Attacks

Amortizing Maximum Inner Product Search with Learned Support Functions

Advances in 4D Representation: Geometry, Motion, and Interaction

LibriTTS-VI: A Public Corpus and Novel Methods for Efficient Voice Impression Control

Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion

Provably Efficient Regularized Online RLHF with Generalized Bilinear Preferences

In-Context Environments Induce Evaluation-Awareness in Language Models

Position: Modular Memory is the Key to Continual Learning Agents

Learning Credal Ensembles via Distributionally Robust Optimization

X-REFINE: XAI-based RElevance input-Filtering and archItecture fiNe-tuning for channel Estimation

Bridging Modality Disconnect in Self-Reflection via Closed-Loop Visually Grounded Verification

RooseBERT: A New Deal For Political Language Modelling

ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

Would a Large Language Model Pay Extra for a View? Inferring Willingness to Pay from Subjective Choices

CausalT5k: Diagnosing Refusal and Failure Modes in Trustworthy Causal Reasoning Across Causal Rungs

TRACE: Learning to Compute on Circuit Graphs

Brep2Shape: Boundary and Shape Representation Alignment via Self-Supervised Transformers

On Randomized Algorithms in Online Strategic Classification

MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models

Optimism Stabilizes Thompson Sampling for Adaptive Inference

Partial Ring Scan: Revisiting Scan Order in Vision State Space Models

PLATE: Plasticity-Tunable Efficient Adapters for Geometry-Aware Continual Learning

CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering

R1-SyntheticVL: Is Synthetic Data from Generative Models Ready for Multimodal Large Language Model?

Clarify Before You Draw: Proactive Agents for Robust Text-to-CAD Generation

Gradual Fine-Tuning for Flow Matching Models

Geodesic Calculus on Implicitly Defined Latent Manifolds

m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning