arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1938
专题追踪
2605.21059 2026-05-21 cs.CV cs.LG

Multimodal LLMs under Pairwise Modalities

基于成对模态的多模态大语言模型

Yan Li, Yunlong Deng, Yuewen Sun, Gongxu Luo, Kun Zhang, Guangyi Chen

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出了一种基于成对模态训练多模态大语言模型的方法,通过理论分析和表示学习框架,实现了跨模态对齐和重构,提升了模型的跨模态性能。

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)取得了令人印象深刻的结果,但其训练通常依赖于联合编纂的多模态数据,需要大量的人力来构建多向对齐的数据集,从而限制了跨领域的可扩展性。在本工作中,我们探索了仅利用多种成对模态作为完整联合多模态分布的替代方案进行训练。具体来说,我们首先提供了理论分析,探讨在仅观察成对模态的情况下,表示可识别的条件。基于此分析,我们提出了一种表示学习框架,用于仅使用成对数据对齐跨模态的潜在表示。该框架包括两个阶段:潜在表示对齐和跨模态重构。具体而言,在第一阶段,我们通过自模态重建和成对对比学习学习跨模态的共享潜在空间。我们还通过部分对齐和最小潜在规范在对比学习过程中引入归纳偏置。在第二阶段,我们将新引入的模态的编码器与预训练模态的解码器整合起来,以促进跨模态转移和生成。我们通过将3D点云和触觉模态添加到预训练的MLLMs中,并使用三种模态对进行评估,证明通过学习对齐的潜在表示空间,我们的模型在跨模态性能上表现优异。

英文摘要

Despite the impressive results achieved by multimodal large language models (MLLMs), their training typically relies on jointly curated multimodal data, requiring substantial human effort to construct multi-way aligned datasets and thereby limiting scalability across domains. In this work, we explore training MLLMs by only leveraging multiple paired modalities as a surrogate for the full joint multimodal distribution. Specifically, we first provide a theoretical analysis of the conditions under which the representations are identifiable with only observing pairwise modalities. Building on this analysis, we propose a representation learning framework for aligning latent representations across modalities using only pairwise data. The framework consists of two stages: latent representation alignment and cross-modal recomposition. Specifically, in the first stage, we learn the shared latent space across modalities by both self-modal reconstruction and pair-wise contrastive learning. We also incorporate an inductive bias in the contrastive learning process by partially aligning and minimal latent specification. In stage two, we integrate the encoder of newly introduced modalities with the decoders of the pre-trained modalities to facilitate cross-modal transfer and generation. We evaluate our method by newly adding 3D point clouds and tactile modalities into pre-trained MLLMs with three modality pairs and show that, by learning an aligned latent representation space, our model achieves strong cross-modal performance.

2605.21058 2026-05-21 cs.LG

A Dialogue between Causal and Traditional Representation Learning: Toward Mutual Benefits in a Unified Formulation

因果与传统表征学习之间的对话:在统一框架中实现相互受益

Yan Li, Yuewen Sun, Shaoan Xie, Gongxu Luo, Yunlong Deng, Kun Zhang, Guangyi Chen

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎伊德·本·扎耶德人工智能大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文探讨了因果表征学习与传统表征学习之间的对话,提出统一框架,通过任务组件和约束组件相互促进发展,实验表明因果约束的有效性依赖于所配的任务。

详情
AI中文摘要

因果表征学习(CRL)和传统表征学习在发展轨迹上大相径庭。传统表征学习主要由应用和经验目标驱动,而CRL则更关注理论问题,尤其是可识别性。这种侧重点的不同导致了两个领域在术语、问题建模和评估上的差距,限制了交流,有时导致孤立或冗余的努力。本文认为,这两个领域应对话而非视为独立范式。为此,我们引入了一个统一框架,其中表征学习由两个组件定义:任务组件,指定所学表征需要保留的信息;约束组件,指定对潜在空间的结构约束。在此框架下,双向收益。CRL提供理论工具,用于理解何时结构化潜在约束是有用或必要的,而传统表征学习提供实用见解,关于任务设计和目标选择,可以改进CRL方法的发展。为了说明这种交互,我们实验研究了不同任务组件如何影响CRL方法在不同结构约束下的行为。在CausalVerse上的结果表明,因果约束的有效性强烈依赖于所配的任务。

英文摘要

Causal representation learning (CRL) and traditional representation learning have largely developed along different trajectories. Traditional representation learning has been driven mainly by applications and empirical objectives, whereas CRL has focused more on theoretical questions, particularly identifiability. This difference in emphasis has created a gap between the two fields in terminology, problem formulation, and evaluation, limiting communication and sometimes leading to disconnected or redundant efforts. In this paper, we argue that these two fields should be brought into dialogue rather than treated as separate paradigms. To this end, we introduce a unified formulation in which the representation learning is characterized by two components: a task component, which specifies what information the learned representation is required to preserve, and a constraint component, which specifies what structure is imposed on the latent space. Under this formulation, the benefits run in both directions. CRL provides theoretical tools for understanding when structured latent constraints are useful or necessary, while traditional representation learning offers practical insights on task design and objective choice that can improve the development of CRL methods. To illustrate this interaction, we experimentally study how different task components affect the behavior of CRL methods under different structured constraints. Results on CausalVerse show that the effectiveness of causal constraints depends strongly on the tasks with which they are paired.

2605.21053 2026-05-21 cs.RO

Perception of Social Robots as Communication Partners in Healthcare for Older Adults

在医疗领域中老年人对社交机器人作为交流伙伴的感知

Hana Yamamoto, Carlotta Julia Mayer, Charlotte Raithel, Theresa Buchner, Christian Werner, Yasuhisa Hirata, Monika Eckstein, Katja Mombaur

发表机构 * Institute for Anthropomatics and Robotics(人机学与机器人研究所) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) Department of Robotics(机器人系) Tohoku University(东北大学) Institute of Medical Psychology(医学心理学研究所) Heidelberg University(海德堡大学) Institute of Psychology(心理学研究所)

AI总结 研究探讨了社交机器人在医疗领域中作为交流伙伴的有效性,以及积极提示对交互效果的影响,发现机器人与人类交互时压力水平无显著差异,且机器人能被接受为有效的交流伙伴,有助于减轻护理人员负担。

Comments 31 pages, 10 figures, Under review at International Journal of Social Robotics

详情
AI中文摘要

通过社交助理工作者解决全球护理人员短缺问题,需要深入了解人类-机器人交互(HRI)对老年人的心理和生理影响。本研究探讨了社交机器人是否能像人类一样成为有效的交流伙伴,以及积极提示是否能同样增强这些交互。我们与35名参与者(年龄70岁以上)进行了比较研究。我们的多模态分析,整合了面部表情数据、心率变异性数据和主观问卷,发现人类和机器人交互的整体压力水平无显著差异。面部表情分析证实机器人被接受为有效的交流伙伴,而生理数据表明在机器人交互期间心率略低,表明比由人类主导的活动更放松。这些发现表明社交机器人可以与老年人互动而不引起心理压力,并能通过执行结构化任务(如健康监测调查)来减轻护理人员的负担。未来的工作应解决机器人设计中发现的'外观-内容不匹配'问题,以促进更加自然和有效的交互。

英文摘要

Addressing the global caregiver shortage through socially assistive robots necessitates a deep understanding of their psychological and physiological impacts on older adults during human-robot interaction (HRI). This study addresses whether social robots can serve as effective interaction partners compared to humans, and if "positive prompts" can similarly enhance these interactions. We conducted a comparative study with 35 participants (aged 70+). Our multi-modal analysis, integrating facial expression data, heart rate variability, and subjective questionnaires, revealed no significant differences in overall stress levels between human and robot interactions. Facial expression analysis confirmed that the robot was accepted as a valid interaction partner, while physiological data showed slightly lower heart rates during robot interactions, suggesting a more relaxed state compared to human-led sessions. These findings indicate that social robots can engage older adults without inducing psychological strain and are capable of alleviating caregiver burden by performing structured tasks, such as health-sensing surveys. Future work should address the identified "appearance-content mismatch" in robot design to facilitate even more natural and effective interactions.

2605.21049 2026-05-21 cs.CL

Cross-lingual robustness of LLM-brain alignment and its computational roots

LLM-脑对齐的跨语言鲁棒性及其计算根源

Ni Yang, Rui He, Philipp Homan, Iris Sommer, Davide Staub, Wolfram Hinzen

发表机构 * Grammar and Cognition Lab, Department of Translation & Language Sciences, Universitat Pompeu Fabra(语言与翻译科学系语法与认知实验室,庞培法华大学) Department of Adult Psychiatry and Psychotherapy, University of Zurich(苏黎世大学成人精神病学与心理治疗系) Neuroscience Center Zurich, University of Zurich and ETH Zurich(苏黎世大学神经科学中心与苏黎世联邦理工学院) Center for Clinical Neuroscience and Cognition and Department of Psychiatry, University of Groningen, University Medical Center Groningen(格罗宁根大学临床神经科学与认知中心及精神病学系,格罗宁根大学医学中心) Scalable Scientific Machine Learning Lab, Imperial College London, Department of Earth Science and Engineering(伦敦帝国理工学院可扩展科学机器学习实验室,地球科学与工程系) Institut Català de Recerca i Estudis Avançats (ICREA), Barcelona, Spain(加泰罗尼亚高级研究与研究机构(ICREA),巴塞罗那,西班牙)

AI总结 该研究探讨了大型语言模型与大脑对齐的跨语言鲁棒性,通过多语言全脑编码框架分析了中文、英语和法语在自然故事听觉过程中大脑与LLM的对齐情况,发现其在空间上具有跨语言重叠性,但无法通过预测不确定性或表征几何来解释。

详情
AI中文摘要

大型语言模型(LLMs)能够可靠地预测语言理解过程中的神经活动,并且transformer深度已被解释为镜像层次皮层组织。然而,这种对齐是否扩展到皮层下区域、在不同语言中是否存在空间重叠,以及这种对齐的计算根源仍不清楚。在此,我们使用多语言全脑编码框架,研究了在自然故事听觉过程中,中文、英语和法语三种语言的大脑与LLM的对齐情况。我们的结果表明,跨语言情况下,基于transformer的模型预测了覆盖广泛分布的皮层功能网络(如边缘系统、背侧注意网络、默认模式网络)以及皮层下结构的分布式景观。空间对齐模式显示了显著的跨语言重叠性,并且在模型层之间保持稳定,仅在有限的层之间有进展,这与功能皮层层次结构一致。与之前证据相反,上下文嵌入并未优于静态嵌入。为了测试候选计算解释,我们检查了逐层大脑评分是否反映惊奇度和内在维度性,从而预测处理和信息压缩。这两种计算指标均未与神经对齐轮廓相匹配。我们的发现表明,大脑-LLM对齐在空间上具有鲁棒性,并且在跨语言上保持稳定,但无法通过预测不确定性或表征几何来解释。而不是直接反映共享的层次计算,神经预测性可能主要源于分布式词汇-语义对应关系,这些关系在不同语言中具有泛化性。

英文摘要

Large language models (LLMs) reliably predict neural activity during language comprehension and transformer depth has been interpreted as mirroring hierarchical cortical organization. However, it remains unclear whether such alignment extends to subcortical regions, overlaps spatially across languages, and what the computational roots of such alignment are. Here, we used a multilingual, whole-brain encoding framework to examine brain-LLM alignment across three typologically distinct languages: Mandarin, English, and French during naturalistic story listening. Our results show that across languages, transformer-based models predicted activity in a distributed landscape spanning widely distributed cortical functional networks like limbic, ventral attention, default mode network, and subcortical structures. Spatial alignment patterns showed substantial cross-linguistic overlap and remained largely stable across model layers, with limited layer progression consistent with functional cortical hierarchies. Contrary to previous evidence, contextual embeddings did not outperform static embeddings. To test candidate computational explanations, we examined whether layer-wise brain scores reflect surprisal and intrinsic dimensionality, and thereby predictive processing and information compression. Neither of these two computational metrics mirrored neural alignment profiles. Our findings suggest that brain-LLM alignment is spatially robust and cross-linguistically stable but not explainable from predictive uncertainty or representational geometry. Rather than directly reflecting shared hierarchical computation, neural predictivity may primarily arise from distributed lexical-semantic correspondences that generalize across languages.

2605.21042 2026-05-21 cs.CV

Dynamic Video Generation: Shaping Video Generation Across Time and Space

动态视频生成:跨时间和空间的视频生成塑造

Shikang Zheng, Jingkai Huang, Jiacheng Liu, Guantao Chen, Lixuan, Yuqi Lin, Peiliang Cai, Linfeng Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) South China University of Technology(华南理工大学) Tsinghua University(清华大学)

AI总结 本文提出DVG框架,通过在时间和空间上联合分配计算,自动选择内容感知的加速策略,实现近无损加速,展示了在视频生成中的高效性能。

详情
AI中文摘要

扩散模型在视频生成中取得了显著成效,但其迭代去噪过程由于每个时间步处理大量token而计算成本高。最近,渐进分辨率采样作为一种有前途的加速方法,通过在早期阶段降低潜在分辨率。然而,将其扩展到视频生成仍具挑战性,因为额外的时间维度引入了不同视频中多样的时空需求,仅压缩单个维度往往导致有限的加速或质量下降。因此,我们提出DVG,一种动态视频生成框架,通过在时间和空间上联合分配计算,自动选择内容感知的加速策略,无需手动调优或重新训练。DVG在模型和任务上实现了接近无损的加速,达到HunyuanVideo和HunyuanVideo-1.5的7倍加速,结合蒸馏时达到18倍,展示了其作为当今大规模高效视频生成系统关键组件的潜力。我们的代码见补充材料,并将在GitHub上发布。

英文摘要

Diffusion models have achieved impressive performance in video generation, but their iterative denoising process remains computationally expensive due to the large number of tokens processed at each timestep. Recently, progressive resolution sampling has emerged as a promising acceleration approach by reducing latent resolution in early stages. However, scaling this idea to video generation remains challenging, as the additional temporal dimension introduces diverse spatio-temporal demands across different videos, and compressing only a single dimension often leads to limited acceleration or degraded quality. Therefore, we propose DVG, a Dynamic Video Generation framework that jointly allocates computation across time and space, automatically selecting content-aware acceleration strategies without manual tuning or retraining. DVG achieves near-lossless acceleration across models and tasks, reaching up to 7 times speedup on HunyuanVideo and HunyuanVideo-1.5, and 18 times when combined with distillation, demonstrating its potential as a key component in today's large-scale efficient video generation systems. Our code is in supplementary material and will be released on Github.

2605.21033 2026-05-21 cs.LG cs.DS

Efficient Banzhaf-Based Data Valuation for $k$-Nearest Neighbors Classification

高效基于Banzhaf值的$k$-最近邻分类数据估值

Guangyi Zhang, Lutz Oettershagen, Lixu Wang, Aristides Gionis

发表机构 * Shenzhen Technology University(深圳技术大学) University of Liverpool(利物浦大学) Nanyang Technological University(南洋理工大学)

AI总结 本文提出了一种高效计算$k$-最近邻分类器中Banzhaf值的方法,解决了数据估值中的计算复杂性问题,通过动态规划框架实现了显著的计算效率提升。

Comments To appear at VLDB 2026

详情
AI中文摘要

数据估值,即量化单个数据点对模型性能的贡献,已成为机器学习中的基本挑战。基于博弈论的方法,如Banzhaf值,提供了公平数据估值的原理性框架;然而,它们存在指数级计算复杂性。我们通过开发专门用于计算$k$-最近邻($k$NN)分类器中Banzhaf值的高效算法来解决这一挑战。我们首先通过证明该问题为\#P难来建立该问题的理论难度。尽管这种不可计算性,我们利用$k$NN分类器的局部性质开发了实用的精确算法。我们的主要贡献是一个动态规划框架,实现了显著的计算改进:我们提出了一种伪多项式算法,时间复杂度为$O(Wkn^2)$,适用于加权$k$NN分类器,其中$W$是前$k$个权重的总和最大值,并且为无权$k$NN提出了一种专门的算法,时间复杂度为$O(nk^2)$,即与数据点数量成线性关系。我们还提供了高效的蒙特卡洛估计方法。在现实世界数据集上的广泛实验展示了我们方法的实用效率及其在数据估值应用中的有效性。

英文摘要

Data valuation, the task of quantifying the contribution of individual data points to model performance, has emerged as a fundamental challenge in machine learning. Game-theoretic approaches, such as the Banzhaf value, offer principled frameworks for fair data valuation; however, they suffer from exponential computational complexity. We address this challenge by developing efficient algorithms specifically tailored for computing Banzhaf values in $k$-nearest neighbor ($k$NN) classifiers. We first establish the theoretical hardness of the problem by proving that it is \#P-hard. Despite this intractability, we exploit the locality properties of $k$NN classifiers to develop practical exact algorithms. Our main contribution is a dynamic programming framework that achieves significant computational improvements: we present a pseudo-polynomial algorithm with $O(Wkn^2)$ time complexity for weighted $k$NN classifiers, where $W$ is the maximum sum of top-$k$ weights, and a specialized algorithm for unweighted $k$NN that achieves $O(nk^2)$ time complexity, that is, linear in the number of data points. We also offer efficient Monte Carlo estimation methods. Extensive experiments on real-world datasets demonstrate the practical efficiency of our approach and its effectiveness in data valuation applications.

2605.21032 2026-05-21 cs.CV

Towards Physically Consistent 4D Scene Reconstruction for Closed-loop Autonomous Driving Simulation

迈向物理一致的闭环自动驾驶模拟中的4D场景重建

Bowyn Tan, Yutong Xie, Bai Huang, Fan Luo, Xiao Li, Naizheng Wang, Yang Guan, Shengbo Eben Li

发表机构 * Tsinghua University(清华大学) Meituan(美团) Central University of Finance and Economics(中央财经大学)

AI总结 本文提出了一种信息几何诊断框架,解决3DGS方法在同时实现空间和时间参数建模时的信用分配难题,通过引入正交投影梯度(OPG)和时间正则化策略,提升了4D场景重建的物理一致性。

Comments 20 pages, 4 figures

详情
AI中文摘要

高保真的街道场景重建对于端到端自动驾驶模拟至关重要,其中新颖视角合成(NVS)和时间变化信息建模是两种基本能力,以促进闭环训练。然而,现有3DGS方法及其4D扩展未能同时实现这两者。为弥合这一差距,我们建立了信息几何诊断框架,揭示该限制源于空间和时间参数之间的信用分配困境。具体而言,单源观测中视角与时间的确定性耦合产生了一种低秩结构,导致静态视依赖性和动态时间变化组件之间产生大量零空间模糊性。时间信息压制了空间线索,导致空间参数估计方差发散。为了解决这一问题,我们提出正交投影梯度(OPG),一种分层训练方法,旨在恢复空间可识别性。OPG优先保证空间表示的完整性,通过在初始阶段将其固定,然后限制时间更新到空间零空间,使信用分配更加主动。虽然OPG通过代数方式隔离了时间更新,但时间正则化策略被提出,通过基于一致外观演化的物理先验施加平滑约束,确保重建的场景在闭环模拟中保持物理一致性。广泛的实验表明,我们的方法不仅保持了稳定的NVS能力,还在传统观察-再现度量中表现出优越的性能,这间接反映了对时间动态建模能力的建模能力。

英文摘要

High-fidelity street scene reconstruction is pivotal for end-to-end autonomous driving simulation, where novel-view synthesis (NVS) and time-varying information modeling are two fundamental capabilities to facilitate closed-loop training. However, existing 3DGS methods and their 4D extensions fail to simultaneously achieve both. To bridge this gap, we establish an information-geometric diagnostic framework, revealing that this limitation stems from a credit assignment dilemma between spatial and temporal parameters. Specifically, the deterministic coupling between viewpoint and time in single-source observation creates a low-rank structure that induces massive null-space ambiguity between static view-dependent and dynamic time-varying components. Temporal information overshadows spatial cues, causing the estimation variance of spatial parameters to diverge. To address this issue, we propose Orthogonal Projected Gradient (OPG), a hierarchical training method designed to restore spatial identifiability. OPG prioritizes the integrity of spatial representations by securing them in an initial stage, then restricts temporal updates to the spatial null space, enabling proactive credit assignment. While OPG isolates temporal updates algebraically, Temporal Regularization Strategy is proposed to further refine the temporal solution space by imposing a smoothness constraint based on the physical prior of consistent appearance evolution, ensuring that the reconstructed scene remains physically consistent in closed-loop simulation. Extensive experiments demonstrate that our method not only maintains stable NVS capabilities but also demonstrates superior performance in traditional observation-reproducing metrics, which indirectly reflect the capability of modeling temporal dynamics.

2605.21029 2026-05-21 cs.CL

Building a Custom Taxonomy of AI Skills and Tasks from the Ground Up with Job Postings

从零开始构建人工智能技能和任务的定制分类体系

Stephen Meisenbacher, Peter Norlander

发表机构 * Technical University of Munich(慕尼黑技术大学) Loyola University Chicago(芝加哥洛约拉大学)

AI总结 本文通过分析招聘广告数据,探讨了如何构建更清晰的人工智能技能和任务分类体系,提出TaxonomyBuilder作为系统研究的蓝图,展示了过滤输入数据能提供更具体的领域覆盖。

Comments 14 pages, 2 figures, 8 tables. Accepted to CustomNLP4U 2026

详情
AI中文摘要

利用大型语言模型(LLMs)进行自动分类体系构建,为全面而高效的复杂领域映射提供了清晰的机会。然而,面对快速增长的大量语料库时,如何最佳利用此类数据进行最优分类体系构建变得不明确。以职场中系统化人工智能技能为例,我们使用两个大规模的招聘广告语料库来研究关键的设计决策,即在分类体系构建中包含(或排除)数据点。我们提出了TaxonomyBuilder作为系统研究的蓝图,通过评估各种自定义、数据驱动和层次化的分类体系配置。我们证明,较少的数据可以提供更多的清晰度:将输入过滤到TaxonomyBuilder可以比将未过滤的输入提供给聚类和LLM增强的层次分类标签工具提供更具体的领域覆盖。

英文摘要

Utilizing LLMs for automated taxonomy construction presents a clear opportunity for the comprehensive, yet efficient mapping of potentially complex domains. When contending with high volumes of rapidly growing corpora, however, it becomes unclear how to best leverage such data for optimal taxonomy construction. Taking the case of systematizing AI skills in the workplace, we use two large-scale job postings corpora to investigate key design decisions for the inclusion (or exclusion) of data points for taxonomy construction. We propose TaxonomyBuilder as a blueprint for our systematic study, with which we evaluate various configurations of custom, data-informed, and hierarchical taxonomies. We demonstrate that less data can provide more clarity: filtering inputs to TaxonomyBuilder provides better domain-specific coverage than offering unfiltered inputs to clustering and LLM-enhanced hierarchical taxonomy labeling tools.

2605.21026 2026-05-21 cs.RO

Component Influence-Driven Fastener Reduction for Robotic Disassemblability-Aware Design Simplification

基于组件影响的快速件减少用于机器人拆解意识设计简化

Takuya Kiyokawa, Tomoki Ishikura, Shingo Hamada, Genichiro Matsuda, Kensuke Harada

发表机构 * Department of Systems Innovation, Graduate School of Engineering Science, The University of Osaka(大阪大学工学研究科系统创新部门) Manufacturing Innovation Division, Panasonic Holdings Corporation(松下电器制造创新部门)

AI总结 本文提出了一种分析框架,通过快速件减少来提高机器人拆解意识设计简化,该框架利用CAD模型和自动生成的接触-连接-约束(CCC)图,将机器人拆解序列规划结果转化为组件影响评分,以指导设计简化。

Comments 7 pages, 8 figures

详情
AI中文摘要

为了加速自动化再制造,产品设计阶段必须考虑机器人拆解。然而,设计师目前缺乏定量反馈来识别哪些结构元件阻碍机器人操作。为此,本研究提出了一种分析框架,专注于快速件减少,因为快速件是几乎所有制造产品中普遍存在的组件。使用CAD模型及其自动生成的接触-连接-约束(CCC)图,该框架将机器人拆解序列规划结果转化为组件影响评分。这些评分反映了组件在机器人拆解序列中导致结构约束违规或评估目标恶化的频率。为了突出结构障碍,该框架将这些评分投影到CAD几何体上作为3D热图。系统随后分析性地模拟了高影响快速件的移除。它报告了预期的结构约束减少、工具更换和机器人行驶距离的减少,同时通过评估几何稳定性指标防止结构不安全的修改。对七种家用电器的实验表明,该框架成功地针对冗余快速件。移除推荐的快速件通过消除8到132个结构约束(取决于每个产品的结构配置)简化了结构依赖性。此外,通过消除不必要的工具更换操作并缩短行驶距离(165到1675毫米,只要结构上允许)提高了机器人操作效率。

英文摘要

To accelerate automated remanufacturing, robotic disassembly must be considered during the product design phase. However, designers currently lack quantitative feedback to identify which structural elements hinder robotic operations. To address this, this study proposes an analytical framework that provides actionable redesign guidance focused on fastener reduction, as fasteners are numerous and ubiquitous components found in almost all manufactured products. Using a Computer-Aided Design (CAD) model and its automatically generated Contact-Connection-Constraint (CCC) graph, the framework translates robotic disassembly sequence planning outcomes into component influence scores. These scores reflect how often a component causes structural constraint violations or evaluation objective deteriorations in the robotic disassembly sequence. To visually highlight structural hindrances, the framework projects these scores onto the CAD geometry as 3D heatmaps. The system then analytically simulates the removal of highly influential fasteners. It reports the expected reductions in structural constraints, tool changes, and robot travel distances, while preventing structurally unsafe modifications by evaluating geometric stability metrics. Experiments on seven household appliances demonstrate that the framework successfully targets redundant fasteners. Removing the recommended fasteners simplified the structural dependencies by eliminating between 8 and 132 structural constraints on the graph depending on each product's structural configuration. Furthermore, it improved robotic operational efficiency by eliminating unnecessary tool change operations and shortening travel distances by 165 to 1675 millimeters wherever structurally permissible.

2605.21001 2026-05-21 cs.CV

DAMA: Disentangled Body-Anchored Gaussians for Controllable Multi-Layered Avatars

DAMA:解耦的体锚定高斯用于可控的多层avatar

Daniel Eskandar, Berna Kabadayi, Garvita Tiwari, Gerard Pons-Moll

发表机构 * University of Tübingen(图宾根大学) Tübingen AI Center(图宾根人工智能中心) Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所) Max Planck Institute for Informatics(马克斯·普朗克信息研究所) Zuse School ELIZA(祖斯学校ELIZA)

AI总结 本文提出DAMA方法,通过专门的表示和重建方法,生成具有物理合理性的穿衣avatar,实现了可控的多层结构、清晰的衣物分离和显式的堆叠控制。

详情
AI中文摘要

现有的3D穿衣avatar重建方法虽然能实现高视觉保真度,但忽略了几何结构和物理合理性。它们要么将穿衣人类建模为单个可变形表面,要么尝试衣物解耦但不强制几何约束,导致衣物边界模糊且无法控制堆叠或层顺序。为解决这些限制,我们引入DAMA(Disentangled body-Anchored Gaussians for Controllable Multi-layered Avatars),一种3D avatar重建方法,通过专门的表示和重建方法生成具有物理合理性的穿衣avatar。在表示层面,我们通过重心平面坐标和正向法线偏移将高斯绑定到SMPL-X面部。基于此参数化,重建方法将2D分割提升为体锚定高斯,利用拓扑引导的修正细化层,并联合优化几何和外观。DAMA是首个从多视角图像生成具有物理合理性的多层avatar的高斯avatar重建方法,实现了清晰的衣物分离和显式的堆叠控制。在完整的4D-DRESS数据集(82扫描)上,DAMA在几何重建、衣物分离、穿透率和穿透深度方面均达到最先进的性能。该表示还支持用户定义的衣物重排和快速将符合身体的衣物转换为模拟准备的网格。项目页面:https://danieleskandar.github.io/dama/

英文摘要

Existing 3D clothed avatar reconstruction methods achieve high visual fidelity but ignore geometric structure and physical plausibility. They either model clothed humans as a single deformable surface or attempt garment disentanglement without enforcing geometric constraints, resulting in ambiguous garment boundaries and no control over stacking or layer ordering. To address these limitations, we introduce DAMA (Disentangled body-Anchored Gaussians for Controllable Multi-layered Avatars), a 3D avatar reconstruction method that produces physically plausible clothed avatars through a dedicated representation and reconstruction method. At the representation level, we bind Gaussians to SMPL-X faces using barycentric in-plane coordinates and a positive normal offset. Based on this parameterization, the reconstruction method lifts 2D segmentations to body-anchored Gaussians, refines layers using topology-guided correction, and jointly optimizes geometry and appearance. DAMA is the first Gaussian avatar reconstruction method from multi-view images to achieve physically plausible layering, clean garment separation, and explicit stacking control. On the full 4D-DRESS dataset (82 scans), it achieves state-of-the-art performance in geometry reconstruction, garment separation, penetration rate, and penetration depth. The representation further supports user-defined garment reordering and fast conversion of body-conforming garments to simulation-ready meshes. Project Page: https://danieleskandar.github.io/dama/

2605.20998 2026-05-21 cs.CL cs.AI

Single-Pass, Depth-Selective Reading for Multi-Aspect Sentiment Analysis

单次传递、深度选择性阅读用于多方面情感分析

Yan Xia, Zhuangzhuang Pan, Amirrudin Kamsin, Chee Seng Chan

发表机构 * Universiti Malaya, Malaysia(马来大学,马来西亚) Suzhou University of Technology, China(苏州科技学院,中国) VinUniversity, Vietnam(文大学,越南)

AI总结 本文提出DABS框架,通过单次编码构建可重用的深度有序基底,使多方面情感分析在保持性能的同时减少60%的端到端计算量。

Comments Accepted at ACL2026 (main). Our solution (DABS) reads the sentence once, then lets each aspect selectively query the right tokens and Transformer depths, cutting redundant computation while preserving ATSA accuracy

详情
AI中文摘要

在多方面句子中,方面术语情感分析(ATSA)面临效率与表达性的根本权衡。现有模型要么为每个方面重新编码句子,要么依赖静态深度表示,导致冗余计算和有限适应性。我们主张Transformer深度是一种昂贵且可查询的资源,并提出DABS,一种单次推断框架,通过一次编码构建可重用的深度有序基底。每个方面则查询此共享表示以选择性地读取相关token和抽象层次,而无需重新编码。这将共享句子编码与轻量级、方面条件化的读取解耦。在四个ATSA基准测试中,DABS实现了具有竞争力的性能,同时在多方面设置(M >= 2)中将端到端计算减少了高达60%。进一步分析表明,自适应深度查询在语言复杂情况如否定和对比中最为有益。代码可在https://github.com/panzhzh/acl-dabs公开获取。

英文摘要

Aspect-Term Sentiment Analysis (ATSA) in multi-aspect sentences faces a fundamental tradeoff between efficiency and expressiveness. Existing models either re-encode the sentence for each aspect or rely on static use of deep representations, leading to redundant computation and limited adaptivity. We argue that Transformer depth is a costly, queryable resource, and propose DABS, a single-pass inference framework that encodes each sentence once to construct a reusable, depth-ordered substrate. Each aspect then queries this shared representation to selectively read relevant tokens and abstraction levels, without re-encoding. This decouples shared sentence encoding from lightweight, aspect-conditioned readout. Experiments on four ATSA benchmarks show that DABS achieves competitive performance while reducing end-to-end computation by up to 60% in multi-aspect settings (M >= 2). Further analyses indicate that adaptive depth querying is most beneficial for linguistically complex cases such as negation and contrast. Code is publicly available at https://github.com/panzhzh/acl-dabs

2605.20997 2026-05-21 cs.CV cs.AI cs.LG physics.comp-ph

Hybrid Machine Learning Model for Forest Height Estimation from TanDEM-X and Landsat Data

基于TanDEM-X和Landsat数据的混合机器学习模型用于森林高度估计

Islam Mansour, Ronny Haensch, Irena Hajnsek, Konstantinos Papathanassiou

发表机构 * German Aerospace Center (DLR)(德国航空航天中心(DLR)) Institute of Environmental Engineering, ETH Zürich(环境工程研究所,苏黎世联邦理工学院)

AI总结 本文提出了一种结合机器学习与物理模型的混合方法,利用TanDEM-X干涉相干测量和Landsat光学数据来提高森林高度估计的精度,通过扩展特征空间减少高度和基线地形坡度的模糊性,实验结果表明RMSE和MAE分别降低了13.5%和16.6%。

详情
AI中文摘要

将机器学习(ML)与物理模型(PM)结合,已成为从遥感数据中检索地球物理参数的一种有前途的方法。在此背景下,一种用于从TanDEM-X干涉相干测量中估计森林高度的ML模型最近被提出,该模型通过物理模型约束学习过程。虽然所选特征用于训练和反演以确保解决方案的物理一致性,但它们无法解决数据中的所有高度/结构和基线/地形坡度模糊性。为改进这一点,提出通过扩展特征空间加入光学Landsat数据,以提供关于森林类型或结构的补充信息。扩展的模型被应用于几处Gabon的Lopé国家公园的TanDEM-X数据,并与空中LiDAR测量进行评估。结果表明,与原始混合模型相比,RMSE和MAE分别减少了13.5%和16.6%,证实了多光谱输入的附加价值。

英文摘要

Integrating machine learning (ML) with physical models (PM) has emerged as a promising way of retrieving geophysical parameters from remote sensing data. In this context, a ML model for estimating forest height from TanDEM-X interferometric coherence measurements has recently been proposed, that constrains the learning process through a PM. While the features used for training and inversion where selected to ensure the physical consistency of the solutions, they could not resolve all height / structure and baseline / terrain slope ambiguities in the data. To improve this, the extension of the feature space with optical Landsat data is proposed able to provide complementary information on forest type or structure. The extended model is applied and validated on several TanDEM-X acquisitions over the Gabonese Lopé national park site and assessed against airborne LiDAR measurements. Results show a 13.5% reduction in RMSE and a 16.6% reduction in MAE compared to the original hybrid model, confirming the added value of multispectral inputs.

2605.20996 2026-05-21 cs.LG math.OC

Beyond the Bellman Recursion: A Pontryagin-Guided Framework for Non-Exponential Discounting

超越贝尔曼递归:一种指导性框架用于非指数折扣

Hojin Ko, Jeonggyu Huh

发表机构 * Department of Mathematics, Sungkyunkwan University, Suwon, Republic of Korea(韩国首尔大学数学系)

AI总结 本文提出了一种基于庞特里亚金原理的直接策略优化框架(PG-DPO),以解决非指数折扣问题,通过放弃递归方法,结合庞特里亚金最大原理和蒙特卡洛回放,提高动态规划的准确性和稳定性。

详情
AI中文摘要

大多数基于价值的方法和演员-评论家强化学习方法依赖于贝尔曼式递归,然而这些递归在非指数折扣情况下会崩溃,这在人类偏好和生存过程中很常见。我们证明这种崩溃是结构性的:指数折扣处于乘法性和时间齐性的脆弱交界处,违反任一属性都会破坏标准动态规划。为克服这一问题,我们提出庞特里亚金指导的直接策略优化(PG-DPO),一种变分框架,放弃递归并结合庞特里亚金最大原理与蒙特卡洛回放,通过伴随-蒙特卡洛投影强制点wise哈密顿最大化。在多维双曲和生存折扣基准上,PG-DPO在方程驱动求解器和基于批评者的基础线中提高了准确性和稳定性。

英文摘要

Most value-based and actor--critic reinforcement learning methods rely on Bellman-style recursions, yet these recursions collapse under non-exponential discounting common in human preferences and survival processes. We show the breakdown is structural: exponential discounting sits at a fragile intersection of multiplicativity and time homogeneity, and violating either property breaks standard dynamic programming. To overcome this, we propose Pontryagin-Guided Direct Policy Optimization (PG-DPO), a variational framework that abandons recursion and couples the Pontryagin Maximum Principle with Monte Carlo rollouts via an Adjoint-MC projection enforcing pointwise Hamiltonian maximization. Across multi-dimensional hyperbolic and survival-discount benchmarks, PG-DPO improves accuracy and stability where equation-driven solvers and critic-based baselines diverge.

2605.20994 2026-05-21 cs.CL cs.AI

Towards Context-Invariant Safety Alignment for Large Language Models

面向大语言模型的上下文不变安全对齐

Yixu Wang, Yang Yao, Xin Wang, Yifeng Gao, Yan Teng, Xingjun Ma, Yingchun Wang

发表机构 * Fudan University, Shanghai, China(复旦大学,上海,中国) Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室,上海,中国)

AI总结 本文提出了一种上下文不变的安全对齐方法,通过引入锚点不变正则化(AIR)来提升模型在不同上下文中的鲁棒性,从而增强安全约束对对抗性框架的抵抗力。

Comments ICML 2026

详情
AI中文摘要

基于偏好进行的后训练对齐可以将大语言模型与人类意图对齐,但安全行为往往仍然脆弱。一个模型可能在标准提示下拒绝有害请求,但在相同意图被包装在对抗性语言中时却会合规。我们建议,稳健的安全性需要上下文不变的对齐,其中行为取决于底层意图而非表面形式。在对齐中强制不变性是困难的,因为并非所有训练信号都同等可信;对于某些提示变体我们能够获得可验证的反馈(例如多选题),而对于开放性变体我们通常依赖于噪声且可游戏化的奖励代理(例如学习的评判者)。因此,标准对称不变正则化器可以通过降低在可靠变体上的性能来减少跨上下文差异,而不是改进开放性鲁棒性。为了解决这个问题,我们引入了锚点不变正则化(AIR),它将可验证的提示视为锚点,并使用停止梯度目标来正则化开放性变体朝着锚点性能的方向。AIR作为插件辅助损失实现,并通过异质提示分组与基于组的偏好优化(例如GRPO)结合。在安全、道德推理和数学方面,AIR提高了上下文不变性,提升了在分布内组的准确性达12.71%,在分布外一致性提升33.49%,使安全约束对对抗性框架更加稳健。

英文摘要

Preference-based post-training aligns LLMs with human intent, yet safety behavior often remains brittle. A model may refuse a harmful request in a standard prompt but comply when the same intent is wrapped in adversarial wording. We suggest that robust safety requires context-invariant alignment, where behavior depends on the underlying intent rather than surface form. Enforcing invariance is difficult in alignment because not all training signals are equally trustworthy; for some prompt variants we can obtain verifiable feedback (e.g., multiple-choice), while for open-ended variants we typically rely on noisy, gameable reward proxies (e.g., learned judges). As a result, standard symmetric invariance regularizers can reduce cross-context discrepancies by lowering performance on reliable variants instead of improving open-ended robustness. To address this, we introduce Anchor Invariance Regularization (AIR), which treats verifiable prompts as anchors and uses a stop-gradient target to regularize only the open-ended variants toward the anchor performance. AIR is implemented as a plug-in auxiliary loss and combined with group-based preference optimization (e.g., GRPO) via heterogeneous prompt grouping. Across Safety, Moral Reasoning, and Math, AIR improves context invariance, boosting in-distribution group accuracy by 12.71% and out-of-distribution consistency by 33.49%, making safety constraints robust to adversarial framings.

2605.20989 2026-05-21 cs.LG q-bio.GN

Modeling Temporal scRNA-seq Data with Latent Gaussian Process and Optimal Transport

用潜在高斯过程和最优传输建模时间序列scRNA-seq数据

Mehmet Yigit Balik, Harri Lähdesmäki

发表机构 * Department of Computer Science, Aalto University, Espoo, Finland(奥卢大学计算机科学系,埃斯波,芬兰)

AI总结 本文提出了一种生成框架,利用潜在异方差高斯过程建模种群趋势,并通过最优传输对齐生成和观测的种群分布,以捕捉生物异质性,从而在复杂插值和外推基准上实现最先进的性能。

详情
AI中文摘要

单细胞RNA测序提供了单细胞分辨率的基因表达见解,但从这些静态快照测量中推断时间过程仍然是一个根本性挑战。当前利用神经微分方程和流的方法容易过拟合且缺乏对生物变异性的仔细考虑。在本文中,我们提出了一种生成框架,利用希尔伯特空间方法近似潜在异方差高斯过程(GP)来建模种群趋势。为解决真实细胞轨迹的缺失问题,我们利用最优传输(OT)目标对齐生成和观测的种群分布。我们的方法通过引入细胞特异性潜在时间和细胞类型条件来捕捉生物异质性,从而解构时间异步性和不同细胞类型的轨迹。我们展示了在复杂插值和外推基准上的最先进性能,并引入了一种新的基于梯度的策略来推断扰动轨迹。

英文摘要

Single-cell RNA sequencing provides insights into gene expression at single-cell resolution, yet inferring temporal processes from these static snapshot measurements remains a fundamental challenge. Current approaches utilizing neural differential equations and flows are sensitive to overfitting and lack careful considerations of biological variability. In this work, we propose a generative framework that models population trends using a latent heteroscedastic Gaussian process (GP) approximated by Hilbert space methods. To address the absence of genuine cell trajectories, we leverage an optimal transport (OT) objective that aligns generated and observed population distributions. Our method explicitly captures biological heterogeneity by incorporating cell-specific latent time and cell type conditioning to disentangle temporal asynchrony and trajectories to different cell types. We demonstrate state-of-the-art performance on complex interpolation and extrapolation benchmarks and introduce a novel gradient-based strategy for inferring perturbation trajectories.

2605.20978 2026-05-21 cs.LG

Point Cloud Sequence Encoding for Material-conditioned Graph Network Simulators

用于材料条件化图网络模拟器的点云序列编码

Philipp Dahlinger, Balázs Gyenes, Niklas Freymuth, Luca Geminiani, Tobias Würth, Johannes Mitsch, Nadja Klein, Luise Kärger, Gerhard Neumann

发表机构 * Autonomous Learning Robots(自主学习机器人) Methods for Big Data(大数据方法) Institute of Vehicle System Technology(车辆系统技术研究所)

AI总结 本文提出PEACH框架,通过点云序列编码实现对未知物理属性的适应,提高了模拟到现实的零样本转移精度,并在实际部署中更具实用性。

Comments 9 pages + appendix, 7 figures. Submitted to the 40th Conference on Neural Information Processing Systems (NeurIPS 2026)

详情
AI中文摘要

图网络模拟器(GNSs)已作为复杂物理模拟的强大替代方案,提供内在可微性和比传统求解器快多个数量级的速度提升。然而,GNSs通常假设可以访问底层材料参数,如刚度或粘度,这严重限制了其在现实实验中的实用性。尽管最近的元学习方法通过从网格轨迹推断属性来解决参数依赖性,但从观察场景中重建网格具有挑战性。在本文中,我们介绍了Point Cloud Encoding for Accurate Context Handling(PEACH),一种新的框架,通过上下文学习在点云上适应学习的模拟器以适应未见过的物理属性。我们的方法依赖于一种新颖的时空点云序列编码器,以及两种形式的辅助监督来帮助提高模拟保真度。我们证明PEACH能够在具有挑战性的动态场景中实现准确的零样本模拟到现实转移。在模拟场景上的实验表明,PEACH在预测精度上甚至优于基于网格的基线,同时在实际部署中更加实用。

英文摘要

Graph Network Simulators (GNSs) have emerged as powerful surrogates for complex physics-based simulation, offering inherent differentiability and orders-of-magnitude speedups over traditional solvers. However, GNSs typically assume access to the underlying material parameters, such as stiffness or viscosity, severely limiting their utility in realistic experimental settings. While recent meta-learning approaches address the parameter dependency by inferring properties from mesh trajectories, reconstructing a mesh from an observed scene is challenging. In this work, we introduce Point Cloud Encoding for Accurate Context Handling (PEACH), a novel framework that applies in-context learning on point clouds to adapt a learned simulator to unseen physical properties during inference. Our approach relies on a novel spatio-temporal point cloud sequence encoder, as well as two forms of auxiliary supervision to help improve simulation fidelity. We demonstrate that PEACH is capable of accurate zero-shot sim-to-real transfer on a challenging, dynamic scene. Experiments on simulation scenes show that PEACH even outperforms mesh-based baselines on prediction accuracy, while being much more practical for real-world deployment.

2605.20973 2026-05-21 cs.CV

Towards Integrated Rock Support Visualisation in 3D Point Cloud of Underground Mines

向地下矿山3D点云中的集成岩支可视化迈进

Dibyayan Patra, Simit Raval, Pasindu Ranasinghe, Bikram Banerjee, Ismet Canbulat

发表机构 * School of Minerals and Energy Resources Engineering, University of New South Wales(新南威尔士大学矿物与能源资源工程学院) School of Surveying and Built Environment, University of Southern Queensland(南方昆士兰大学测绘与环境工程学院)

AI总结 本文提出了一种自动化框架,用于利用地下矿山开掘的3D点云进行集成岩支可视化,通过结构映射、岩钉识别、断层面拟合和岩钉方向估计的统一工作流,实现了对断层面和岩钉向量的集成3D可视化,以评估其空间交集和几何关系,同时通过互补的立体分析评估整体锚固几何有效性。

详情
AI中文摘要

地下矿山中岩支的有效性取决于安装的岩钉与周围岩体结构特征之间的相互作用。然而,断层特征化和岩钉识别通常被视为单独的任务,限制了它们在集成支持评估中的价值。本文提出了一种自动化框架,用于利用地下矿山开掘的3D点云进行集成岩支可视化。该框架将结构映射、岩钉识别、断层面拟合和岩钉方向估计整合到一个统一的工作流中,该工作流针对准确性和计算效率进行了优化。输出用于生成拟合的断层面和岩钉向量的集成3D可视化,从而能够直接评估其空间交集和几何关系。此外,还进行了互补的立体分析,以评估断层极和岩钉方向的整体锚固几何有效性,相对于映射的结构特征。此外,岩钉级别的质量指标,包括暴露的突出长度和偏离局部顶板法线的程度,也进行了可视化,以支持安装质量的评估。所提出的框架在真实的地下金属矿扫描上进行了演示,在中等规模的点云中产生了准确的结构映射和岩钉识别结果。总体而言,本研究提供了一个实用的步骤,朝着无需手动测量或额外现场数据采集的自动化、集成的岩支有效性地质力学评估。

英文摘要

The effectiveness of rock support in underground mines depends on the interaction between installed rock bolts and the structural fabric of the surrounding rock mass. However, discontinuity characterisation and rock bolt identification are commonly treated as separate tasks, limiting their value for integrated support assessment. This study presents an automated framework for integrated rock support visualisation using 3D point clouds of underground mine excavations. The framework integrates structure mapping, rock bolt identification, discontinuity plane fitting, and bolt orientation estimation into a unified workflow optimised for accuracy and computational efficiency. The outputs are used to generate an integrated 3D visualisation of fitted discontinuity planes and bolt vectors, enabling direct assessment of their spatial intersections and geometric relationships. A complementary stereographic analysis of discontinuity poles and bolt orientations is also performed to evaluate overall bolting geometric effectiveness relative to the mapped structural fabric. Additionally, bolt-level quality metrics, including exposed protrusion length and deviation from the local roof normal, are visualised to support assessment of installation quality. The proposed framework is demonstrated on real underground metal mine scans, producing accurate structure mapping and rock bolt identification results in medium-scale point clouds. Overall, the study provides a practical step towards automated, integrated geotechnical assessment of rock support effectiveness without requiring manual measurements or additional in-situ data acquisition.

2605.20971 2026-05-21 cs.CV cs.AI cs.CR

Comparative Evaluation of Deep Learning Models for Fake Image Detection

深度学习模型在虚假图像检测中的比较评估

Akhitha Pakala, Mohammed Mahir Rahman, Shahzad Memon, Tauseef Ahmed

发表机构 * University of East London(东伦敦大学)

AI总结 本研究通过统一的预处理和训练流程比较了四个预训练的CNN架构在虚假图像检测中的性能,发现VGG16在准确性上表现最佳,但EfficientNetB0在检测虚假图像时的敏感性较高,但对真实图像的可靠性较低,研究指出需要平衡数据集、高级增强和公平性意识训练来开发可靠的虚假图像检测系统。

Comments Accepted at ICCIIoT26 and waiting to be indexed

Journal ref 6th International Conference on Computational Intelligence & Internet of Things (ICCIIoT), 2026

详情
AI中文摘要

随着基于GAN的图像篡改技术日益复杂,数字取证面临重大挑战。本研究比较了四个预训练的CNN架构(VGG16、ResNet50、EfficientNetB0和XceptionNet)在虚假图像检测中的性能,使用统一的预处理和训练流程。通过调整大小、归一化和增强来解决类别不平衡问题并提高泛化能力。模型评估使用了准确性、精确率、召回率、F1分数和ROC-AUC。VGG16在准确性上达到91%,XceptionNet、ResNet50和EfficientNetB0分别达到90%。EfficientNetB0对虚假图像的敏感性更强,但在真实图像上的可靠性较低,反映了由不平衡驱动的偏差。局限性包括数据集不平衡、过拟合和解释性有限,这些因素影响了跨域鲁棒性。本研究提供了一个可重复的基准,并强调了平衡数据集、高级增强和公平性意识训练的必要性,以开发可靠的虚假图像检测系统。

英文摘要

The growing sophistication of GAN-based image manipulation presents significant challenges for digital forensics. This study compares the performance of four pretrained CNN architectures including VGG16, ResNet50, EfficientNetB0, and XceptionNet for fake image detection using a unified preprocessing and training pipeline. A dataset of real and manipulated images was processed through resizing, normalization, and augmentation to address class imbalance and improve generalization. Models were evaluated using Accuracy, Precision, Recall, F1-score, and ROC-AUC. VGG16 achieved the highest accuracy at 91%, with XceptionNet, ResNet50, and EfficientNetB0 each reaching 90%. EfficientNetB0 showed stronger sensitivity to fake images but reduced reliability on real samples, reflecting imbalance-driven bias. Limitations include dataset imbalance, overfitting, and limited interpretability, which affect cross-domain robustness. The study provides a reproducible baseline and underscores the need for balanced datasets, advanced augmentation, and fairness-aware training to develop reliable fake image detection systems.

2605.20967 2026-05-21 cs.CL

ArPoMeme: An Annotated Arabic Multimodal Dataset for Political Ideology and Polarization

ArPoMeme:一个标注的阿拉伯多模态数据集用于政治意识形态和极化

Wajdi Zaghouani, Kais Attia, Md. Rafiul Biswas, Fadhl Eryani

发表机构 * Northwestern University in Qatar(卡塔尔西北大学) Independent Researcher(独立研究员) Hamad Bin Khalifa University(哈马德·本·卡伊夫大学) University of Tübingen(图宾根大学)

AI总结 本文提出ArPoMeme数据集,用于分析阿拉伯政治漫画的多模态和意识形态维度,通过自定义工具实现大规模标注,揭示意识形态极化特征。

Comments Accepted at LREC 2026 Main Conference

详情
AI中文摘要

漫画已成为阿拉伯世界政治沟通的重要媒介,反映了幽默、图像和文本如何相互作用以表达意识形态和文化立场。尽管漫画在在线政治讨论中至关重要,但缺乏系统整理的资源来分析其多模态和意识形态维度。本文提出了ArPoMeme,一个包含约7300个阿拉伯政治漫画的大规模数据集,按意识形态方向分类,包括左翼、伊斯兰主义、泛阿拉伯主义和讽刺视角。该数据集通过公共Facebook页面和群组的自我识别来捕捉阿拉伯漫画生态系统的多样性。为了确保规模和准确性,我们设计了一个半自动化数据收集管道,结合基于Playwright的Facebook爬取和Google Drive同步,随后使用Qwen2.5-VL-7B视觉语言模型进行文本提取。提取的文本经过人工验证和标注,针对三个极化维度:我们 vs 他们框架、对外群体的敌意和行动号召。标注通过自定义的Streamlit界面进行,支持分布式标记、实时跟踪和版本控制。最终的数据集将视觉内容、文本信息和意识形态方向联系起来,使政治对抗、动员和幽默的细粒度分析成为可能。对标注语料库的定量分析揭示了意识形态群体之间对抗性框架的强烈不对称性,伊斯兰主义和讽刺漫画表现出最高的敌意和动员信号。该数据集和标注工具为研究阿拉伯政治话语、多模态意识形态检测和极化动态提供了可重复和公开可用的资源。

英文摘要

Memes have become a prominent medium of political communication in the Arab world, reflecting how humor, imagery, and text interact to express ideological and cultural positions. Despite the centrality of memes to online political discourse, there is a lack of systematically curated resources for analyzing their multimodal and ideological dimensions in Arabic. This paper presents ArPoMeme, a large-scale dataset of approximately 7,300 Arabic political memes categorized by ideological orientation, including Leftist, Islamist, Pan-Arabist, and Satirical perspectives. The dataset captures the diversity of Arabic meme ecosystems by grounding classification in the self-identification of public Facebook pages and groups that produce and disseminate these memes. To ensure both scale and accuracy, we designed a semi-automated data collection pipeline combining Playwright-based Facebook scraping with Google Drive synchronization, followed by text extraction using the Qwen2.5-VL-7B vision language model. The extracted text was manually verified and annotated for three polarization dimensions: Us vs. Them framing, Hostility toward out-groups, and Calls to action. Annotation was conducted through a custom Streamlit-based interface supporting distributed labeling, real-time tracking, and version control. The resulting dataset links visual content, textual messages, and ideological orientation, enabling fine-grained analysis of political antagonism, mobilization, and humor. Quantitative analysis of the annotated corpus reveals strong asymmetries in antagonistic framing across ideological groups, with Islamist and satirical memes exhibiting the highest levels of hostility and mobilization cues. The dataset and the annotation tool offers a reproducible and publicly available resource for studying Arabic political discourse, multimodal ideology detection, and polarization dynamics.

2605.20965 2026-05-21 cs.CV cs.AI

Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy

在不遗忘的情况下寻找正确的视觉证据:通过层间视觉注意力差异减轻LVLMs中的幻觉

Yutong Xie, Zhenglin Hua, Ran Wang, Wing W. Y. Ng, Xizhao Wang, Yuheng Jia

发表机构 * School of Computer Science and Engineering, Southeast University, Nanjing, China(东南大学计算机科学与工程学院) School of Artificial Intelligence, Shenzhen University, Shenzhen, China(深圳大学人工智能学院) College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China(深圳大学计算机科学与软件工程学院) Engineering, South China University of Technology, Guangzhou, China(华南理工大学工程学院) National Engineering Laboratory for Big Data Systems Computing Technology, Shenzhen University, Shenzhen, China(深圳大学大数据系统计算技术国家工程实验室) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China(新一代人工智能技术及其交叉应用重点实验室(东南大学),中华人民共和国教育部)

AI总结 本文提出了一种基于层间视觉注意力差异的幻觉缓解方法,通过增强视觉证据的注意力来减少视觉遗忘,从而在不遗忘的情况下找到正确的视觉证据。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型视觉-语言模型(LVLMs)在广泛的视觉-语言任务上表现出色。尽管有进展,它们仍然容易产生幻觉,生成与视觉内容不一致的响应。在本工作中,我们发现LVLMs在对正确的视觉证据关注不足时容易产生幻觉,并在生成过程中逐渐遗忘它。我们实证发现,尽管LVLMs整体对视觉证据关注不足,但在特定层中表现出对正确视觉证据的敏感性,存在显著的层间差异。受此观察启发,我们提出了一种新的幻觉缓解方法,通过层间视觉注意力差异(ILVAD)增强视觉证据。具体来说,我们从早期生成的token到视觉token在各层中获取注意力权重,并识别被反复激活作为视觉证据的token,形成显著性图。然后通过显著性图在生成过程中增强对视觉证据的注意力,以减少视觉遗忘。此外,我们利用显著性图获得生成文本对视觉证据的注意力分数,以选择并强调强烈基于视觉证据的文本token。我们的方法是无训练的,即插即用。在五个最近发布的模型上进行的多个基准评估表明,我们的方法可以在不同架构的LVLMs上一致地缓解幻觉。代码可在https://github.com/ytx-ML/ILVAD上获得。

英文摘要

Large Vision-Language Models (LVLMs) have shown remarkable performance on a wide range of vision-language tasks. Despite this progress, they are still prone to hallucination, generating responses that are inconsistent with visual content. In this work, we find that LVLMs tend to hallucinate when they pay insufficient attention to the correct visual evidence and gradually forget it during the generation process. We empirically find that although LVLMs overall attend insufficiently to visual evidence, they exhibit sensitivity to the correct visual evidence in specific layers, with notable inter-layer discrepancy. Motivated by this observation, we propose a novel hallucination mitigation method that enhances visual evidence based on Inter-Layer Visual Attention Discrepancy (ILVAD). Specifically, we obtain the attention weights from early generated tokens to visual tokens across layers and identify the tokens that are repeatedly activated as visual evidence, forming a saliency map. We then enhance attention to visual evidence during generation through the saliency map to reduce visual forgetting. In addition, we leverage the saliency map to obtain attention scores of generated text to visual evidence, in order to select and emphasize text tokens that are strongly grounded in visual evidence. Our method is training-free and plug-and-play. Multiple benchmark evaluations conducted on five recently released models show that our method can consistently mitigate hallucinations in different LVLMs over various architectures. Code is available at https://github.com/ytx-ML/ILVAD.

2605.20963 2026-05-21 cs.CV

Towards UAV Detection in the Real World: A New Multispectral Dataset UAVNet-MS and a New Method

面向现实世界的无人机检测:一个新的多光谱数据集UAVNet-MS和一个新方法

Yihang Luo, Jun Chen, Chao Xiao, Yingqian Wang, Zhaoxu Li, Qiang Ling, Xu He, Nuo Chen, Gaowei Guo, Hongge Li, Miao Li, Longguang Wang, Yulan Guo, Li Liu, Wei An, Zhijie Chen

发表机构 * College of Electronic Science and Technology, National University of Defense Technology(电子科学与技术学院,国防科技大学) Aviation University of Air Force(空军航空大学) Sun Yat-sen University(中山大学)

AI总结 本文提出了一种新的多光谱数据集UAVNet-MS和一种新的方法MFDNet,用于细粒度小无人机的检测,解决了传统RGB系统在小尺度下的性能问题。

Comments submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

详情
AI中文摘要

无人飞行器(UAV)的普及催生了对精确UAV监测的迫切需求。现有的基于RGB的系统依赖于空间线索,在小尺度下退化,特别是在高类型相似性、目标杂波模糊和低对比度的情况下。多光谱成像(MSI)编码了材料感知的光谱签名,但基于MSI的细粒度小UAV检测仍因缺乏专用数据集而被忽视。我们引入了UAVNet-MS,这是首个用于细粒度小UAV检测的多光谱数据集,包含15,618个时间同步的RGB-MSI数据立方体(1440x1080),带有边界框注释。该数据集具有挑战性的小对象(93.7% <= 32²像素,平均18²像素,约0.02%图像面积)在低对比度下。我们提出MFDNet,一种双流基线方法,解决数组诱导的视差和空间-光谱融合。在RGB-only、MSI-only和RGB+MSI协议下,对20种检测器的广泛评估表明,MFDNet在最佳RGB-only方法上实现了+6.2%的AP50提升,证明光谱线索提供了超越空间线索的互补材料证据。本文为多光谱UAV监测研究提供了基础数据集、强大基线和基准。

英文摘要

The proliferation of unmanned aerial vehicles (UAVs) has created urgent demand for precise UAV monitoring. Existing RGB-based systems rely on spatial cues that degrade at small scales, particularly with high inter-type similarity, target-clutter ambiguity, and low contrast. Multispectral imaging (MSI) encodes material-aware spectral signatures, yet MSI-based fine-grained small-UAV detection remains underexplored due to lack of dedicated datasets. We introduce UAVNet-MS, the first multispectral dataset for fine-grained small-UAV detection, comprising 15,618 temporally synchronized RGB-MSI data cubes (1440x1080) with bounding box annotations. The dataset features challenging small objects (93.7% <= 32^2 pixels, average 18^2 pixels, ~0.02% image area) under low contrast. We propose MFDNet, a dual-stream baseline addressing array-induced parallax and spatial-spectral fusion. Extensive evaluation under RGB-only, MSI-only, and RGB+MSI protocols against 20 detectors shows MFDNet achieves +6.2% AP50 improvement over best RGB-only methods, demonstrating spectral cues provide complementary material evidence beyond spatial cues. This work provides foundational dataset, strong baseline, and benchmark for multispectral UAV monitoring research.

2605.20961 2026-05-21 cs.CV

Preserve, Reveal, Expand: Faithful 4D Video Editing with Region-Aware Conditioning

保留、揭示、扩展:基于区域感知的4D视频编辑

Zhangchi Hu, Wenzhang Sun, Xiangchen Yin, Jiahui Yuan, Chunfeng Wang, Hao Li, Kun Zhan, Xiaoyan Sun

发表机构 * University of Science and Technology of China(中国科学技术大学) Li Auto Inc.(利汽车公司)

AI总结 本文提出PREX框架,通过区域感知分解目标时空体积,解决4D视频编辑中区域保持、揭示和扩展的问题,提升了视频编辑的准确性和稳定性。

Comments 23 pages, 13 figures

详情
AI中文摘要

现有的4D驱动视频扩散模型主要针对合理生成,但忠实的4D编辑需要在合成遮挡或视外内容时保留源观测区域。我们识别出证据角色不匹配问题:可靠的源支持证据、不可靠的渲染提示和不支持的区域在单一条件信号中交织,导致保留漂移、鬼影和不稳定的外推。我们提出PREX(保留、揭示、扩展),一个区域感知框架,根据观测支持和场景范围将目标时空体积分解为保留、揭示和扩展角色。PREX通过校准置信度构建观测支持的外观提示,并通过区域感知适配器注入到冻结的视频扩散骨干网络中,通过代理任务训练而无需配对编辑视频。我们进一步引入PREBench,一个诊断基准,包含精心编辑、区域角色掩码和人类对齐的指标,补充了全局视频质量和4D控制评估。实验表明,PREX在减少区域结构失败的同时,保持了强大的视觉质量和4D编辑控制能力。项目页面:https://ricepastem.github.io/PREX-Open

英文摘要

Existing 4D-driven video diffusion models primarily target plausible generation, but faithful 4D editing requires preserving source-observed regions while synthesizing disoccluded or out-of-view content. We identify Evidence-Role Mismatch: reliable source-backed evidence, unreliable rendered cues, and unsupported regions are entangled in a single conditioning signal, causing preservation drift, ghosting, and unstable extrapolation. We propose PREX (Preserve, Reveal, Expand), a region-aware framework that decomposes the target spatiotemporal volume into Preserve, Reveal, and Expand roles according to observation support and scene extent. PREX builds observation-backed appearance cues with calibrated confidence and injects them into a frozen video diffusion backbone through a region-aware adapter, trained with proxy tasks without requiring paired edited videos. We further introduce PREBench, a diagnostic benchmark with curated edits, region-role masks, and human-aligned metrics that complement global video-quality and 4D-control evaluations. Experiments show that PREX reduces region-structured failures while maintaining strong visual quality and 4D edit control capability. Project Page: https://ricepastem.github.io/PREX-Open

2605.20960 2026-05-21 cs.CL

JobArabi: An Arabic Corpus and Analysis of Job Announcements from Social Media

JobArabi: 一个阿拉伯语语料库及来自社交媒体的招聘公告分析

Wajdi Zaghouani, Shimaa Amer Ibrahim, Mabrouka Bessghaier, Houda Bouamor

发表机构 * Northwestern University in Qatar(卡塔尔诺维克大学) Carnegie Mellon University in Qatar(卡塔尔卡内基梅隆大学)

AI总结 本文介绍了JobArabi,一个从2024年1月至2025年10月期间收集的阿拉伯语招聘公告语料库,包含20,528条来自X平台的公开帖子,旨在分析阿拉伯语在线社区中的就业相关话语,揭示社交媒体在劳动力市场沟通和语言变化研究中的潜力。

Comments Accepted at LREC 2026 Main Conference

详情
AI中文摘要

本文介绍了JobArabi,一个大规模的阿拉伯语招聘公告语料库,该语料库从2024年1月至2025年10月期间收集自社交媒体。该数据集包含20,528条来自X平台的公开帖子,涵盖了阿拉伯语在线社区中超过两年的就业相关话语。该语料库使用了一个基于语言学的查询框架,覆盖了21个阿拉伯语关键词家族,这些关键词反映了招聘语言中性别化、复数、正式和方言化的表达。所得到的数据集包括来自机构、商业和个体账号的帖子,并提供了时间戳、参与指标和地理位置等元数据(如可用)。这使得能够对就业话语进行时间与区域分析。定量分析揭示了在线招聘中的若干社会语言学模式,包括性别化招聘语言的持续存在、地区职业需求的差异以及招聘信息的情感框架。这些发现突显了阿拉伯语社交媒体作为研究劳动力市场沟通和语言变化资源的潜力。JobArabi语料库,连同其文档和收集脚本,将被发布以支持阿拉伯语自然语言处理、计算社会科学和数字劳动研究领域的研究。

英文摘要

This paper introduces JobArabi, a large-scale corpus of Arabic job announcements collected from social media between January 2024 and October 2025. The dataset contains 20,528 public posts from X and captures more than two years of employment-related discourse across Arabic-speaking online communities. The corpus was compiled using a linguistically informed query framework covering 21 Arabic keyword families that reflect gendered, plural, formal, and dialectal expressions of recruitment language. The resulting dataset includes posts from institutional, commercial, and individual accounts and provides metadata such as timestamps, engagement indicators, and geolocation when available, enabling temporal and regional analysis of employment discourse. Quantitative analysis reveals several sociolinguistic patterns in online recruitment, including the persistence of gendered hiring language, regional variation in occupational demand, and the emotional framing of recruitment messages. These findings highlight the potential of Arabic social media as a resource for studying labor market communication and linguistic change. The JobArabi corpus, together with documentation and collection scripts, will be released to support research in Arabic NLP, computational social science, and digital labor studies.

2605.20956 2026-05-21 cs.LG cs.CY

A Deployment Audit of Release-Side Risk in Conformal Triage under Prevalence Shift

发布侧风险的符合性分诊部署审计

Chengze Li, Xiao Liu, Hanrong Zhang, Haiyang Peng, Yanghao Ruan, Huanhuan Ma, Chunyu Miao, Qichao Zhou, Xiangrong Qi, Philip Yu

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校) Manteia Technologies Co., Ltd(Manteia技术有限公司) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of California Los Angeles(加州大学洛杉矶分校)

AI总结 本文提出了一种泄漏感知的发布侧符合性分诊审计方法,用于评估在患病率变化下,是否真正经历目标事件的患者被释放而无需审查,通过将目标主体分为三个非重叠角色来评估发布直接安全性。

Comments 18 pages, 4 figures, 5 tables

详情
AI中文摘要

符合性分诊将预测分数转换为部署行动,即释放案例、标记为紧急关注或推迟给人类审查。然而,在患病率变化下,通常的边际覆盖和人类审查率总结可能无法回答关键的安全问题:是否真正经历目标事件的患者被释放而无需审查。为解决这一差距,我们引入了一种泄漏感知的发布侧符合性分诊审计。它首先将目标主体分配给三个非重叠角色:患病率校正、符合性校准和保留的发布安全性评估。这种分离使审计能够直接评估发布:有多少事件阳性患者被清除而无需审查,是否试点有足够的事件标签用于校准,以及安全审查权衡如何转移。将此审计应用于回顾性NSCLC试点显示了较低审查可能具有误导性:在患病率校正后,池化符合性分支通过释放更多患者降低审查,其中一些是事件阳性。在审计中,类内分支充当稀缺性诊断:试点拥有过多的事件标签以认证安全的低审查释放。

英文摘要

Conformal triage converts predictive scores into deployment actions that either release a case, flag it for urgent attention, or defer it to human review. Under prevalence shift, however, the usual summaries of marginal coverage and human-review rate can miss the safety-critical question of whether patients who truly experience the target event are released without review. To address this gap, we introduce a leakage-aware deployment audit for release-side conformal triage. It first assigns target subjects to three non-overlapping roles: prevalence correction, conformal calibration, and held-out release-safety evaluation. This separation then lets the audit evaluate release directly: how many event-positive patients are cleared without review, whether the pilot has enough event labels for calibration, and how the safety-review trade-off shifts. Applying this audit to a retrospective NSCLC pilot shows why lower review can be misleading: after prevalence correction, the pooled conformal branch lowers review by releasing more patients, some of whom are event-positive. Within the audit, the classwise branch acts as a scarcity diagnostic: the pilot has too few event labels to certify safe low-review release.

2605.20955 2026-05-21 cs.CV

DrawMotion: Generating 3D Human Motions by Freehand Drawing

DrawMotion: 通过自由手绘生成3D人体动作

Tao Wang, Lei Jin, Zhihua Wu, Qiaozhi He, Jiaming Chu, Yu Cheng, Junliang Xing, Jian Zhao, Shuicheng Yan, Li Wang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) University of Science and Technology of China(中国科学技术大学) NLP Lab, School of Computer Science and Engineering, Northeastern University(东北大学计算机科学与工程学院自然语言处理实验室) National University of Singapore(新加坡国立大学) Tsinghua University(清华大学) The Institute of AI (TeleAI), China Telecom(中国电信人工智能研究院) Northwestern Polytechnical University(西北工业大学)

AI总结 本研究提出DrawMotion,一种基于扩散模型的框架,通过自由手绘和文本条件生成3D人体动作,减少用户输入时间,提升生成精度。

详情
AI中文摘要

文本到动作生成,即通过文本描述生成人体动作,面临用户难以通过文本精确表达意图的挑战。为了解决这一问题,本文介绍了DrawMotion,一种高效的扩散基框架,适用于多条件场景。DrawMotion基于传统文本条件和新的手绘条件生成动作,分别提供语义和空间控制。具体而言,我们从三个方面解决细粒度动作生成任务:1) 自由手绘条件。为了准确捕捉用户意图而不需繁琐的文本输入,我们开发了算法自动在不同数据集格式中生成手绘简笔画;2) 多条件融合。我们提出了一个多条件模块(MCM),整合到扩散过程中,使模型能够利用所有可能的条件组合,同时比传统方法减少计算复杂性;3) 训练自由引导。值得注意的是,DrawMotion中的MCM确保其中间特征位于连续空间中,允许分类器引导梯度更新特征,从而使生成的动作与用户意图对齐,同时保持保真度。定量实验和用户研究表明,自由手绘方法在生成与想象一致的动作时,可将用户时间减少约46.7%。代码、演示和相关数据可在https://github.com/InvertedForest/DrawMotion上公开获取。

英文摘要

Text-to-motion generation, which translates textual descriptions into human motions, faces the challenge that users often struggle to precisely convey their intended motions through text alone. To address this issue, this paper introduces DrawMotion, an efficient diffusion-based framework designed for multi-condition scenarios. DrawMotion generates motions based on both a conventional text condition and a novel hand-drawing condition, which provide semantic and spatial control over the generated motions, respectively. Specifically, we tackle the fine-grained motion generation task from three perspectives: 1) freehand drawing condition. To accurately capture users' intended motions without requiring tedious textual input, we develop an algorithm to automatically generate hand-drawn stickman sketches across different dataset formats; 2) multi-condition fusion. We propose a Multi-Condition Module (MCM) that is integrated into the diffusion process, enabling the model to exploit all possible condition combinations while reducing computational complexity compared to conventional approaches; and 3) training-free guidance. Notably, the MCM in DrawMotion ensures that its intermediate features lie in a continuous space, allowing classifier-guidance gradients to update the features and thereby aligning the generated motions with user intentions while preserving fidelity. Quantitative experiments and user studies demonstrate that the freehand drawing approach reduces user time by approximately 46.7% when generating motions aligned with their imagination. The code, demos, and relevant data are publicly available at https://github.com/InvertedForest/DrawMotion.

2605.20948 2026-05-21 cs.CL

Memory Grafting: Scaling Language Model Pre-training via Offline Conditional Memory

Memory Grafting: 通过离线条件记忆实现语言模型预训练的扩展

Runxi Cheng, Yuchen Guan, Yongxian Wei, Qianpu Sun, Qixiu Li, Sinan Du, Feng Xiong, Chun Yuan, Yan Lu, Yeyun Gong

发表机构 * Tsinghua University(清华大学)

AI总结 本文提出Memory Grafting方法,通过利用冻结的隐藏状态作为条件n-gram记忆,实现语言模型预训练的扩展,通过离线处理和高效检索机制提升模型容量,实验表明其在不同规模下均优于MoE和Vanilla Engram基线。

Comments 25 pages, 12 figures, 5 tables

详情
AI中文摘要

扩大条件记忆为提高语言模型容量提供了一种有前途的方法,但现有方法如Engram在预训练过程中从头学习大型记忆表,使记忆扩展成本高昂且有时效果不佳。我们提出Memory Grafting,一种利用冻结的隐藏状态作为条件n-gram记忆的条件记忆扩展方法。鉴于频繁的局部n-gram,我们离线运行grafting模型,将最终token的隐藏表示存储为记忆值,并让接收模型通过精确的最长匹配后缀查找来检索它们。检索到的记忆通过轻量级投影和门控进行适应,同时基于哈希的Engram回退机制保留了未匹配上下文的覆盖范围。由于grafting模型仅在离线运行,且精确查找的复杂度相对于内存库大小具有预期O(1)的复杂度,Memory Grafting在有限的训练和推理开销下扩展了外部潜在容量。在匹配的接收架构和预训练预算下进行的实验表明,Memory Grafting在MoE和Vanilla Engram基线之上有所改进。在2.8B规模下,其平均基准分数从MoE的51.95和Vanilla Engram的52.43提升到53.86。在0.92B规模下,所有grafting模型变体均优于基线,其中Qwen3.5-35B-A3B表现最佳。这些结果表明,预训练模型可以作为外部潜在记忆的可重用构造器,为未来语言模型超越仅可训练参数的扩展提供了实用步骤。

英文摘要

Scaling conditional memory offers a promising way to increase language-model capacity, but existing methods such as Engram learn large memory tables from scratch during pre-training, making memory scaling expensive and sometimes ineffective. We propose Memory Grafting, a conditional memory scaling method that utilizes frozen hidden states from a grafting model as conditional n-gram memory. Given frequent local n-grams, we run the grafting model offline, store final-token hidden representations as memory values, and let the recipient model retrieve them through exact longest-match suffix lookup. Retrieved memories are adapted by lightweight projections and gates, while a hash-based Engram fallback preserves coverage for unmatched contexts. Since the grafting model is only run offline and exact lookup has expected O(1) complexity with respect to memory-bank size, Memory Grafting expands external latent capacity with limited training and inference overhead. Experiments under matched recipient architectures and pre-training budgets show that Memory Grafting improves over both MoE and vanilla Engram baselines. In the 2.8B-scale setting, it improves the average benchmark score from 51.95 for MoE and 52.43 for vanilla Engram to 53.86. In the 0.92B-scale setting, all grafting-model variants improve over the baselines, with Qwen3.5-35B-A3B giving the strongest gains. These results suggest that pretrained models can serve as reusable constructors of external latent memory, providing a practical step toward scaling future language models beyond trainable parameters alone.

2605.20946 2026-05-21 cs.CL

Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation

思考-言语:一种受控交错推理方法用于实时语音生成

Xuan Du, Qiangyu Yan, Wenshuo Li, Borui Jiang, Changming Xiao, Han Shu, Xinghao Chen

发表机构 * Huawei Technologies(华为技术)

AI总结 本文提出了一种受控交错推理方法InterRS,用于实时语音生成,通过在自然语音生成过程中插入推理步骤,提高了语音流畅性和推理深度,实验表明其在数学和逻辑基准测试中表现更优,并生成更自然流畅的答案。

详情
AI中文摘要

思考-言语范式旨在使AI交流更人性化。关键挑战是保持流畅的语音生成同时进行深度推理。我们的方法InterRS通过在自然语音生成过程中插入推理步骤来解决这一问题。这需要高质量的数据,其中推理和语音精确对齐且长度比例受控。我们引入了一种新的管道来生成无缝交错的音频数据。为了训练我们的模型,我们结合了交错SFT与精炼数据以及强化学习,使用两种新的奖励:TA-Balance Reward用于管理时间与思考-回答比例,以及Linguistic Quality Reward用于优化表达。实验表明,我们的方法在数学和逻辑基准测试中实现了13%的性能提升,同时生成像口语指令模型一样快速的响应。此外,我们的方法生成的答案例句比先前方法更自然流畅。

英文摘要

The thinking-while-speaking paradigm aims to make AI communication more human. A key challenge is maintaining fluent speech while performing deep reasoning. Our method, InterRS, tackles this by inserting reasoning steps only during natural speech generation. This requires high-quality data where reasoning and speech are precisely aligned, and the length ratio are under controlled. We introduce a novel pipeline to generate such seamlessly interleaved audio data. To train our model, we combine interleaved SFT with refined data and reinforcement learning with two new rewards: a TA-Balance Reward to manage timing and thinking-answer ratio, and a Linguistic Quality Reward to refine expression. Experiments show our approach achieves 13% better performance on mathmatical and logic benchmarks while generating instant response like a spoken-language instruct model which outputs fast CoT response. Furthermore, our method generates more natural and fluent answers than prior methods.

2605.20942 2026-05-21 cs.CV

Bridging Structure and Language: Graph-Based Visual Reasoning for Autonomous Road Understanding

连接结构与语言:基于图的视觉推理用于自动驾驶道路理解

Lena Wild, Katie Z Luo, Marco Pavone

发表机构 * KTH Royal Institute of Technology(皇家理工学院) TRATON Stanford University(斯坦福大学) NVIDIA(英伟达)

AI总结 本文提出结合道路子基质(CRS)框架,通过图结构和开放词汇语义的联合执行,解决自动驾驶中道路结构理解的精度与语义灵活性之间的平衡问题。

详情
AI中文摘要

车道几何、拓扑和交通元素关系的结构化道路理解是安全自动驾驶的基础。尽管视觉-语言模型(VLMs)提供了有前途的语义灵活性,但它们缺乏精确道路推理所需的几何和关系基础。相反,传统模块化系统,如HD地图和拓扑道路图,提供了结构精度,但保持了语义刚性。为弥合这一差距,我们引入了结合道路子基质(CRS),一种基于图的框架,使几何道路结构和开放词汇语义能够在单一表示中联合执行。CRS能够通过递归图查询自动生成具有组合复杂性和语言多样性的问答对,辅以一种“免费基础”机制,确保逻辑可追溯到特定地图元素,并通过程序提取的推理链监督轨迹。我们证明了最先进的VLMs,包括大型闭源模型,在结构化道路推理上表现显著不足,但训练一个仅需20到80个CRS增强场景的2亿或4亿参数小模型,即可在不同深度的组合推理任务中获得稳定的提升。通过可验证的推理轨迹分析模型行为,揭示了失败模式的系统性转变:尽管基线模型在关系场景理解上失败,CRS训练的模型将失败减少到属性识别,表明道路理解的主要瓶颈不是模型规模,而是缺乏结构化监督。

英文摘要

Structured road understanding of lane geometry, topology, and traffic element relationships is foundational to safe autonomous driving. While vision-language models (VLMs) offer promising semantic flexibility, they lack the geometric and relational grounding required for precise road reasoning. Conversely, traditional modular systems, e.g., HD maps and topological road graphs, provide structural precision but remain semantically rigid. To bridge this gap, we introduce the Combined Road Substrate (CRS), a graph-grounded framework that makes geometric road structure and open-vocabulary semantics jointly executable in a single representation. CRS enables the automatic generation of compositionally complex and linguistically varied question-answer pairs via recursive graph queries, augmented with a "grounding for free" mechanism that ensures logical traceability to specific map elements, and procedurally extracted chain-of-thought supervision traces. We demonstrate that state-of-the-art VLMs - including large, closed-source models - struggle significantly with structured road reasoning, yet training a small 2- or 4-billion-parameter model with as few as 20 to 80 CRS-enriched scenes yields stable gains in compositional reasoning tasks of varying depth. Analysis of model behavior via verifiable reasoning traces reveals a systematic shift in failure modes: whereas baseline models fail at relational scene understanding, CRS-trained models reduce failures to attribute recognition, suggesting that the primary bottleneck in road understanding is not model scale, but the absence of structured supervision.

2605.20941 2026-05-21 cs.CV cs.GR cs.HC

PaintCopilot: Modeling Painting as Autonomous Artistic Continuation

PaintCopilot: 将绘画建模为自主的艺术延续

Yunge Wen, Yuancheng Shen, Paul Pu Liang

发表机构 * MIT Media Lab(MIT媒体实验室) New York University(纽约大学)

AI总结 本文提出了一种基于神经网络的绘画助手PaintCopilot,通过建模绘画作为开放性自回归艺术行为,基于不断演变的画布状态和先前笔触历史,无需目标图像即可预测未来笔触,与现有神经绘画方法不同,后者将绘画建模为向预定参考图像的像素重建。

详情
AI中文摘要

我们提出了PaintCopilot,一种协作式神经绘画助手,将绘画建模为一种开放性自回归的艺术行为,该行为基于不断演变的画布状态和先前笔触历史,而无需目标图像。与现有神经绘画方法不同,后者将绘画建模为向预定参考图像的像素重建,PaintCopilot直接从学习到的艺术动态中预测未来的笔触,类似于大型语言模型通过先前上下文继续文本序列。该框架提出了三个互补的模型:基于ViT的目标预测器,通过部分画布观察推断艺术家意图;自回归的下一笔预测器,通过流匹配生成时间上连贯的笔触;以及基于VAE的区域采样器,可按需合成语义本地化的笔触序列。基于三种可微分的笔触表示(硬圆、笔尖和2D高斯),系统支持四种交互工作流程:优化历史、笔触完成、区域修复和动态笔刷。通过与专业艺术家的案例研究,我们证明PaintCopilot能够实现流畅的协作绘画工作流程,在创作过程中艺术家和AI不断交替控制。

英文摘要

We present PaintCopilot, a co-creative neural painting assistant that models painting as an open-ended autoregressive artistic behavior conditioned on evolving canvas states and prior brushstroke history, without requiring a target image. Unlike existing neural painting methods that frame painting as pixel reconstruction toward a predefined reference, PaintCopilot predicts future strokes directly from learned artistic dynamics, analogous to how large language models continue text sequences from prior context. The framework proposes three complementary models: a ViT-based Target Predictor that infers artist intent from partial canvas observations, an autoregressive Next Stroke Predictor that generates temporally coherent brushstrokes via flow matching, and a VAE-based Region Sampler that synthesizes semantically localized stroke sequences on demand. Built on three differentiable brush representations (Hard Round, Brush Tip, and 2D Gaussian), the system supports four interactive workflows: Optimize History, Stroke Completion, Region Inpainting, and Dynamic Brush. Through case studies with professional artists, we demonstrate that PaintCopilot enables fluid co-creative painting workflows in which artists and AI continuously alternate control throughout the creative process.

2605.20940 2026-05-21 cs.CV

3D Reconstruction and Knowledge Distillation to Improve Multi-View Image Models to Explore Spike Volume Estimation in Wheat

3D重建与知识蒸馏以改进多视角图像模型以探索小麦籽粒体积估计

Olivia Zumsteg, Jannis Widmer, Yann Bourdé, Norbert Kirchgessner, Andreas Hund, Lukas Roth, Paraskevi Nousi

发表机构 * ETH Zurich(苏黎世联邦理工学院) Swiss Data Science Center(瑞士数据科学中心)

AI总结 本文提出了一种混合2D-3D方法,通过训练过程中知识蒸馏,使模型能够高效地进行图像-only推理。该方法结合了基于距离直方图特征的刚性不变点云网络和提出的多视角图像基于调节Transformer(RT)的集成架构,最终通过特征或标签蒸馏将知识转移到纯图像模型中,从而提高了籽粒体积估计的精度和效率。

Comments 8 pages, 6 figures (Appendix: 4 pages, 5 figures)

详情
AI中文摘要

准确估计小麦籽粒体积对于产量成分分析和压力耐受性评估至关重要,但基于现场的测量仍然具有挑战性。主动3D传感方法如光检测和测距(LiDAR)或飞行时间(ToF)对植物运动敏感或不适合户外条件,而3D重建计算成本高。直接2D图像处理可提供计算优势,但基于图像的模型缺乏显式几何信息。因此,我们提出了一种混合2D-3D方法,在训练过程中进行知识蒸馏,同时允许高效的图像-only推理。首先,我们训练一个基于距离直方图特征的刚性不变点云网络,以获得姿态鲁棒的几何表示。然后,我们将3D模型与所提出的多视角图像基于调节Transformer(RT)结合到集成架构中。最后,我们通过基于特征或标签的蒸馏将集成知识转移到纯图像学生模型中。两个蒸馏的RTs将非蒸馏RT的均方绝对误差(MAE)从654.31 mm³降低到639.93 mm³和644.62 mm³,并将相关性从0.76提高到0.77和0.82。同时,推理时间从160 ms减少到每粒籽1.4 ms。蒸馏进一步减轻了体积依赖性偏差,并使图像模型的潜在表示向几何感知的形状转变。我们的结果表明,2D Transformer的3D指导训练能够实现高通量田间表型分析中可扩展且高效的籽粒体积估计。

英文摘要

Accurate estimation of wheat spike volume is important for yield component analysis and stress resilience assessment, yet field-based measurement remains challenging. Active 3D sensing methods such as Light Detection and Ranging (LiDAR) or time-of-flight (ToF) are sensitive to plant motion or poorly suited to outdoor conditions, while 3D reconstructions are computationally expensive. Direct 2D image processing would offer computational advantages, but image-based models lack explicit geometric information. We therefore propose a hybrid 2D-3D approach with knowledge distillation during training while enabling efficient image-only inference. First, we train a rigid-invariant point cloud network using distance-based histogram features to obtain pose-robust geometric representations. We then combine the 3D model with a proposed multi-view image-based regulated Transformer (RT) in an ensemble architecture. Finally, we distill the ensemble knowledge into a purely image-based student model using either feature-based or label-based distillation. The two distilled RTs reduce the mean absolute error (MAE) from 654.31 mm$^3$ of the non-distilled RT to 639.93 mm$^3$ and 644.62 mm$^3$, and increase correlation from 0.76 to 0.77 and 0.82, respectively. At the same time, inference time is reduced from 160 ms to 1.4 ms per spike. Distillation further mitigates volume-dependent bias and reshapes the latent representation of the image model toward a geometry-aware shape. Our results demonstrate that 3D-informed training of a 2D Transformer allows for scalable and efficient spike volume estimation for high-throughput field phenotyping.