arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2136
专题追踪
2606.10194 2026-06-10 cs.LG cs.AI 新提交

MMClima: A Framework for Multimodal Climate Science Data and Evaluation

MMClima:多模态气候科学数据与评估框架

Muhammad Umer Sheikh, Hassan Abid, Khawar Shehzad, Ufaq Khan, Muhammad Haris Khan

AI总结 提出MMClima,一个包含10万+专家验证问答对的多模态气候问答框架,覆盖文本、视频和图表,用于评估多模态语言模型在气候科学中的表现。

详情
AI中文摘要

气候变化研究日益需要能够推理文本、动态视觉内容和科学图表的AI系统,然而现有的气候问答基准规模小、大多为文本,且覆盖模型范围狭窄。我们提出MMClima,一个大规模多模态气候问答框架,包含10万+专家验证的问答对,涵盖五个核心气候科学领域的文章、视频转录和图表。MMClima通过自动化的声明提取和问答合成构建,并采用人在回路验证以确保规模和可靠性。利用MMClima,我们在需要事实回忆、视觉解释和跨模态合成的任务上对最先进的多模态语言模型进行基准测试。此外,我们在文本分割上进行微调,得到mmclima-70b-txt,一个领域适应的基线模型,在文本问答上优于强大的开源和闭源模型。我们发布数据集、评估流程、微调模型权重和数据创建框架,以支持气候科学的标准多模态评估。

英文摘要

Climate change research increasingly requires AI systems that reason across text, dynamic visual content, and scientific figures, yet existing climate QA benchmarks are small, mostly textual, and cover a narrow range of models. We introduce MMClima, a large-scale multimodal climate question answering framework with 104k+ expert-validated question-answer pairs spanning articles, video transcriptions, and figures across five core climate science domains. MMClima is constructed via automated claim extraction and QA synthesis with human-in-the-loop validation to ensure both scale and reliability. Using MMClima, we benchmark state-of-the-art multimodal language models on tasks requiring factual recall, visual interpretation, and cross-modal synthesis. We additionally fine-tune on the textual split to produce mmclima-70b-txt, a domain-adapted baseline that outperforms strong open- and closed-source models on textual QA. We release the dataset, evaluation pipeline, fine-tuned model weights, and data creation framework to support standardized multimodal evaluation for climate science.

2606.10184 2026-06-10 cs.LG cs.AI 新提交

Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning

Dropout-GRPO: 用于连续潜在推理的变分随机性

Wooil Jung

AI总结 针对GRPO在连续潜在推理模型中因确定性轨迹导致优势为零的问题,提出通过结构化Dropout引入随机性,使GRPO能优化贝叶斯模型平均策略,在GSM8K上提升Coconut基线准确率。

详情
AI中文摘要

组相对策略优化(GRPO)依赖于每组内$K$次rollout的多样性;否则,组平均优势$A^{(k)} = r^{(k)} - \mu_r$会坍缩为零。这对像Coconut这样的潜在推理模型构成了结构性挑战,该模型循环地馈送连续隐藏状态以替代离散的思维链token。由于给定参数和提示后潜在阶段本质上是确定性的,多次rollout会产生相同的轨迹,阻碍GRPO的进展。因此,将组相对强化学习应用于连续潜在推理被证明是困难的。为解决此问题,我们提出通过结构化dropout来获取必要的随机性。通过在给定rollout的所有潜在递归步骤中应用一个保持不变的单一伯努利掩码,我们生成必要的轨迹方差。这个共享掩码有效地将每个rollout视为来自参数变分分布的后验样本,使GRPO能够优化贝叶斯模型平均策略的期望奖励。我们为该方法提供了理论证明——包括无偏性、方差减少以及潜在梯度的良定义性——以及实证验证。在GSM8K上,dropout-GRPO将Coconut基线从$27.29\%$提升至$29.01\%$的pass@1,证明了GRPO学习在潜在推理模型中的可行性。我们的工作将此定位为一种实用且理论基础的潜在推理LLM后训练方法。

英文摘要

Group Relative Policy Optimization (GRPO) relies on the diversity of $K$ rollouts within each group; otherwise, the group-mean advantage $A^{(k)} = r^{(k)} - μ_r$ collapses to zero. This presents a structural challenge for latent-reasoning models like Coconut, which feed continuous hidden states recurrently in place of discrete chain-of-thought tokens. Because the latent phase is inherently deterministic given the parameters and prompt, multiple rollouts produce identical trajectories, stalling GRPO's progress. Consequently, applying group-relative reinforcement learning to continuous latent reasoning has proven difficult. To address this, we propose sourcing the necessary stochasticity through structured dropout. By applying a single Bernoulli mask held constant across all latent recurrence steps for a given rollout, we generate essential trajectory variance. This shared mask effectively treats each rollout as a posterior sample from a variational distribution over parameters, allowing GRPO to optimize the expected reward of a Bayesian model-average policy. We provide both theoretical justification for this method -- including unbiasedness, variance reduction, and the well-definedness of the latent gradient -- and empirical validation. On GSM8K, dropout-GRPO improves a Coconut baseline from $27.29\%$ to $29.01\%$ pass@1, demonstrating the viability of GRPO learning for latent-reasoning models. Our work positions this as a practical, theoretically grounded approach for post-training latent-reasoning LLMs.

2606.10183 2026-06-10 cs.CV cs.AI cs.MM 新提交

Making Time Editable in Video Diffusion Transformers

在视频扩散Transformer中实现时间可编辑性

Konstantin Kuklev, Viacheslav Vasilev, Alexander Kunitsyn, Andrei Ivaniuta, Denis Dimitrov

AI总结 提出一种时间控制方法,通过轻量级时间模块扩展预训练DiT,实现运动速度和时序结构的编辑,无需重新设计骨干网络。

详情
AI中文摘要

现代用于视频生成的扩散Transformer对时间进程的控制和时序动态的编辑能力有限。我们提出一种时间控制方法,通过显式时间编辑扩展预训练DiT,允许控制运动速度和时序结构,而无需重新设计骨干网络。其核心实现通过一个轻量级时间模块增强预训练模型,保留原始生成先验的同时扩展其可控动态范围。

英文摘要

Modern Diffusion Transformers for video generation provide limited control over the progression of time and the editing of temporal dynamics. We propose a temporal-control methodology that extends a pretrained DiT with explicit time editing, allowing control over motion speed and temporal structure without redesigning the backbone. Its core implementation augments the pretrained model with a lightweight temporal module, preserving the original generative prior while expanding its controllable dynamic range.

2606.10180 2026-06-10 cs.RO cs.AI cs.HC 新提交

Flow Control: Steering Vision-Language-Action Models with Simple Real-Time Inputs

流控制:通过简单实时输入引导视觉-语言-动作模型

Jonathan C. Kao, Jason Chan, Andy Wang

AI总结 提出流控制方法,利用键盘等通用实时输入引导VLA模型动作,无需重新训练,能提升任务成功率和完成速度。

Comments 10 pages, 5 figures

详情
AI中文摘要

我们引入了视觉-语言-动作(VLA)模型的流控制,这是一种简单有效的方法,通过通用输入(如键盘)实时引导VLA动作。该方法可直接使用,无需重新训练或微调VLA。它允许相对粗糙的用户输入引导VLA与用户意图对齐。VLA将这些输入转换为从训练期间学习的VLA专家动作分布中采样的动作样本,从而生成高质量(符合动作专家分布)和高保真度(反映用户意图)的动作。我们证明流控制具有许多理想特性:(1)流控制能准确、响应地通过用户输入引导机器人动作;(2)它对次优用户输入具有鲁棒性;(3)它使用户能够引导VLA实现显著更高的成功率和更快的任务完成;(4)在流控制轨迹上微调VLA可提高自主策略性能。这些结果共同为用户提供了一种简单直观的方式来帮助引导VLA动作,提升任务性能。

英文摘要

We introduce flow control of vision-language-action (VLA) models, a simple and effective way to steer VLA actions in real-time through generic inputs, such as a keyboard. This method can be used out-of-the-box and does not require retraining or fine-tuning VLAs. It enables relatively crude user inputs to steer a VLA to align with user intent. The VLA transforms these inputs into action samples drawn from the VLA expert action distribution learned during training, so that the generated actions are high quality (conformity to the action expert distribution) and high fidelity (reflecting the user's intent). We demonstrate that flow control has many desirable properties: (1) flow control accurately and responsively steers robot actions with user inputs, (2) it is robust to suboptimal user inputs, (3) it enables users to steer VLAs to achieve significantly higher success rates and faster task completion, and (4) fine-tuning a VLA on flow control trajectories improves the autonomous policy. Together, these results provide a simple and intuitive way for users to help steer VLA actions, increasing task performance.

2606.10174 2026-06-10 cs.CV 新提交

A Large Scale Open-Source Image and Video Dataset for Robust Wildfire Detection and Classification

用于鲁棒野火检测与分类的大规模开源图像和视频数据集

Emadeldeen Hamdan, Yingyi Luo, B. Ugur Toreyin, Erdem Koyuncu, Adam J. Watts, Ugur Gudukbay, Ahmet Enis Cetin

AI总结 提出大规模开源野火图像视频数据集GWFP,结合多种卷积与Transformer架构及HTE-ResNet方法,实现跨域鲁棒检测。

详情
AI中文摘要

野火检测与监测对于减缓火势蔓延和减少环境及基础设施损害至关重要。本文介绍了GWFP(全球野火预防数据集),这是一个大规模、开源的野火图像和视频数据集,旨在支持早期火灾和烟雾检测研究。GWFP包含地理多样化的野火场景,包括火焰、烟雾、水雾/雾环境条件、近红外(NIR)图像、余烬以及从全球真实场景中收集的具有挑战性的负样本。为了评估数据集的鲁棒性和跨域泛化能力,我们在域内和跨数据集设置下对多种卷积和基于Transformer的架构进行了基准测试。此外,我们探索了使用Hadamard增强残差连接(HTE-ResNet)的轻量级频率-空间特征交互,以分析域偏移条件下的表示鲁棒性。实验结果表明,该方法在真实世界野火监测应用中具有强大的跨数据集泛化能力和实用价值。数据集和源代码将在接收后公开发布。

英文摘要

Wildfire detection and monitoring are critical for mitigating fire spread and reducing environmental and infrastructural damage. In this work, we introduce GWFP (Global Wildfire Prevention Dataset), a large-scale, open-source dataset of wildfire images and videos designed to support early fire and smoke detection research. GWFP contains geographically diverse wildfire scenes, including flames, smoke, Waterdog/Fog environmental conditions, Near Infrared (NIR) imagery, Ember, and challenging negative samples collected from real-world scenarios worldwide. To evaluate dataset robustness and cross-domain generalization, we benchmark multiple convolutional and transformer-based architectures across both in-domain and cross-dataset settings. Additionally, we explore lightweight frequency--spatial feature interaction using Hadamard-enhanced residual connections (HTE-ResNet) to analyze representation robustness under domain-shift conditions. Experimental results demonstrate strong cross-dataset generalization and practical utility for real-world wildfire monitoring applications. The dataset and source code will be publicly released upon acceptance.

2606.10170 2026-06-10 cs.LG 新提交

Learning Entropy and Spatial Adaptation Dynamics of Multilayer Perceptrons for Structural Point Extraction

多层感知机的学习熵与空间自适应动力学用于结构点提取

Jan Glaser, Ivo Bukovsky, Marcel Jirina

AI总结 提出空间学习熵(SLEM)方法,通过分析MLP在图像样本学习中的权重自适应,识别对网络学习重要的图像点与区域,为特征提取提供新视角。

详情
AI中文摘要

本文将学习熵(LE)的概念从时间自适应系统扩展到应用于图像数据的多层感知机网络(MLP)中的空间学习。与局部邻域方法直接从梯度或协方差算子评估图像结构不同,所提方法通过学习熵分析学习过程本身。训练MLP从周围空间上下文预测中心像素的强度,同时从跨图像样本的学习过程中神经权重的增量自适应评估LE。生成的空间学习熵图(SLEM)识别出引起神经网络强烈自适应的异常图像点和区域,这些点在网络学习过程中具有重要作用。结果表明,空间学习熵通过突出对网络学习特别有信息量的空间位置,为传统特征提取和可解释性方法提供了补充视角。空间学习熵根据学习影响而非局部结构属性识别图像点和区域,为传统特征提取和可解释性方法提供了补充视角。所提框架可能为计算机视觉、制造和机器人学中的学习驱动图像或场景分析开辟新方向。

英文摘要

This paper extends the concept of Learning Entropy (LE) from temporal adaptive systems to spatial learning in multilayer perceptron networks (MLPs) applied to image data. Instead of evaluating image structure directly from gradients or covariance operators, as local neighborhood methods do, the proposed approach analyzes the learning process itself through Learning Entropy. An MLP is trained to predict the intensity of a center pixel from its surrounding spatial context, while LE is evaluated from the incremental adaptation of neural weights during learning across image-derived samples. The resulting Spatial Learning Entropy Maps (SLEM) identify unusual image points and regions that induce strong adaptation of the neural network and therefore have an important role in the learning process. The results indicate that spatial Learning Entropy provides a complementary perspective to conventional feature extraction and explainability methods by highlighting spatial locations that are particularly informative for network learning. Spatial Learning Entropy provides a complementary perspective to conventional feature extraction and explainability methods by identifying image points and regions according to their learning impact rather than their local structural properties. The proposed framework may open new directions for learning-driven image or scene analysis in computer vision, manufacturing, and robotics.

2606.10167 2026-06-10 cs.CV 新提交

FlexPath: Learned Semantic Path Priors for Image-Based Planning

FlexPath: 基于图像规划的学习语义路径先验

Taehyoung Kim, Tim Schoenbrod, David Eckel, Henri Meeß

AI总结 提出FlexPath两阶段框架,将可行性先验与偏好解耦,通过可微路径形状目标实现任务自适应,在最短路径规划中搜索代价降低14.3%,并支持零样本泛化与多目标适配。

详情
AI中文摘要

最近基于学习的路径规划器使用神经网络处理视觉地图表示,并近似经典搜索算法的启发式,从而以更少的搜索代价获得接近最优的路径。然而,这些方法受限于其监督中隐含的最短路径目标,这限制了它们适应其他标准的灵活性。我们提出FlexPath,一个两阶段框架,将可行性与偏好解耦。在第一阶段,我们使用模仿学习从视觉地图输入中获取一个与任务无关的可行路径空间先验。在第二阶段,可微路径形状目标(PSOs)使该先验适应特定任务的标准,而无需重新学习路径结构,仅需高效的 objective 级适应。单个预训练模型可适应多个目标。对于最短路径规划,FlexPath在TMP上相比最先进的TransPath减少了14.3%的搜索代价,同时平均找到更低成本的路径,并在三个未见领域上展现出强大的零样本泛化能力。对于最小间隙距离为2的障碍物避让,它在保持低搜索代价的同时实现了96.8%的完全避障。该框架进一步通过 objective 级适应扩展到语义感知避让和航点引导,并在推理时与经典规划器兼容。数据和代码可在 https://this URL 获取。

英文摘要

Recent learning-based path planners use neural networks to process visual map representations and approximate heuristics for classical search algorithms, yielding near-optimal paths with reduced search effort. However, these methods are tied to the shortest-path objective implicit in their supervision, which limits their flexibility to accommodate alternative criteria. We introduce FlexPath, a two-stage framework that decouples feasibility from preference. In Stage 1, we use imitation learning to acquire a task-independent spatial prior over feasible paths from visual map inputs. In Stage 2, differentiable Path Shape Objectives (PSOs) adapt this prior toward task-specific criteria without relearning path structure, requiring only efficient objective-level adaptation. A single pretrained model can be adapted to multiple objectives. For shortest-path planning, FlexPath reduces search effort on TMP by 14.3% compared to the state-of-the-art TransPath, while also finding lower-cost paths on average and demonstrating strong zero-shot generalization across three unseen domains. For obstacle clearance with minimum clearance distance 2, it achieves 96.8% full obstacle avoidance while maintaining low search cost. The framework further extends to semantic-aware avoidance and waypoint guidance via objective-level adaptation, and remains compatible with classical planners at inference time. Data and code are available at https://github.com/FraunhoferIVI/FlexPath.

2606.10166 2026-06-10 cs.CV 新提交

Fusing Satellite Imagery and Planimetric Maps for Cross-View Localization

融合卫星图像与平面地图的跨视角定位

Quang Long Ho Ngo, Zimin Xia, Alexandre Alahi

AI总结 提出一种融合卫星图像与平面地图的模块,通过跨模态条件化和补丁级融合规则,将定位误差降低30.13%。

详情
AI中文摘要

当前的跨视角定位方法主要依赖卫星图像作为空中模态。尽管近期工作探索了平面地图(如OpenStreetMap瓦片),但这些方法性能往往滞后。然而,两种模态都广泛可用且具有互补特性。卫星图像更接近地面相机图像,提供更精细的细节,而平面地图包含标注对象(如路灯),并在地面被遮挡(如树叶)的区域仍能提供信息。尽管如此,只有一项先前工作提供了融合这两种模态的端到端方法,且未展示其在最先进方法中的潜力。为结合两种模态的优势,我们提出一种新的融合模块,增强标准编码器,并证明将卫星图像与平面地图集成可改进最先进的单模态方法。该模块包括(i)跨模态条件化,处理每种模态编码时考虑另一种模态的信息,以及(ii)控制信息交换粒度的补丁级融合规则。我们取得了最先进的结果,将平均定位误差降低了30.13%。定性上,融合自适应地选择信息更丰富的模态,提高了整体准确性。

英文摘要

Current cross-view localization methods predominantly rely on satellite imagery as the aerial modality. Although recent work explores planimetric maps (e.g., OpenStreetMap tiles), these approaches often lag in performance. Yet both modalities are widely available and possess complementary properties. Satellite images are closer to ground-level camera imagery, offering finer detail, whereas planimetric maps contain annotated objects (e.g., streetlamps) and remain informative in areas where the ground is occluded, such as by foliage. Despite this, only one prior work provides an end-to-end method to fuse the two modalities, and it does not demonstrate their potential within state-of-the-art methods. To combine the strengths of both modalities, we propose a new fusion module that augments standard encoders and demonstrates that integrating satellite imagery with planimetric maps improves state-of-the-art single-modality methods. The module comprises (i) cross-modal conditioning, which processes each modality's encoding with awareness of the other, and (ii) a patch-level fusion rule that controls the granularity of information exchange. We achieve state-of-the-art results, reducing the mean localization error by 30.13\%. Qualitatively, the fusion adaptively selects the more informative modality, improving overall accuracy.

2606.10159 2026-06-10 cs.CL cs.AI cs.CY cs.LG 新提交

Gaming AI-Assisted Peer Reviews Poses New Risks to the Scientific Community

游戏化AI辅助同行评审对科学界构成新风险

Lin Li, Qi Zhang, Xander Davies, Jianing Qiu, Yarin Gal

AI总结 研究发现,通过表面改写摘要即可显著操纵AI评审结果,成功率约38%,且成本低、难以区分,可能扭曲科学评估的公正性。

详情
AI中文摘要

AI越来越多地被用于支持科学同行评审,从稿件筛选、评审辅助到编辑分类。尽管这类系统有望减轻评审负担并加速出版,但其对策略性操纵的鲁棒性仍知之甚少。本文表明,AI中介的同行评审容易受到一种简单、低成本的操纵:对稿件摘要进行表面改写。在不改变底层科学内容和交流方式,甚至不了解评审模型的情况下,对抗性重写的摘要显著改善了AI评审结果。我们在不同学科和出版场所,针对人类撰写和AI生成的论文都观察到了这一现象。我们最强的攻击实现了约38%的攻击成功率,将Gemini 3 Flash评审员的接受评分提高了+1.31,将GPT 5.4 Mini评审员的接受评分提高了+0.88(10分制)。当原始AI评审建议“拒绝”时,成功率升至50%以上。这种效应不仅限于总体分数膨胀,还增加了评审信心以及核心科学标准(如合理性、重要性和感知贡献)的得分。该攻击实用性强,仅需约5分钟和1美元即可完成一篇10页的AI会议投稿,且难以与普通科学编辑区分。膨胀的AI评审可能偏向下游人类决策,将编辑建议从拒绝转向接受。这些发现揭示了AI辅助科学评估中的一个普遍漏洞:当AI生成的评审影响编辑决策时,作者可能被激励优化稿件以迎合AI判断而非科学价值。我们的结果表明,在高风险的同行评审中,AI工具不应被视为中立的评估者,而应进行系统的鲁棒性测试、透明的保障措施和谨慎的人工监督。

英文摘要

AI is increasingly used to support scientific peer review, from manuscript screening, reviewer assistance to editorial triage. Although such systems promise to reduce reviewer burden and accelerate publication, their robustness to strategic manipulation remains poorly understood. Here we show that AI-mediated peer review is vulnerable to a simple, low-cost manipulation: superficial rephrasing of the manuscript abstract. Without changing the underlying scientific content and communication, and even without knowledge of the reviewing model, adversarially rewritten abstracts substantially improve AI review outcomes. We see this across disciplines and publication venues, for both human-written and AI-generated papers. Our strongest attack achieves an attack-success-rate of about 38%, increasing acceptance ratings by +1.31 for Gemini 3 Flash reviewers and by +0.88 for GPT 5.4 Mini reviewers on a 10-point scale. When the original AI review suggests 'reject', the success rate rises to more than 50%. This effect extends beyond overall score inflation, increasing review confidence and scores on core scientific criteria such as soundness, significance and perceived contribution. The attack is practical, requiring only about 5 minutes and $1 for a 10-page AI conference submission, and is hard to distinguish from ordinary scientific editing. Inflated AI reviews could bias downstream human decision-making, shifting editorial recommendations from rejection towards acceptance. These findings reveal a general vulnerability in AI-assisted scientific evaluation: when AI-generated review influence editorial decisions, authors may be incentivized to optimize manuscripts for AI judgment rather than scientific merit. Our results suggest that AI tools should not be treated as neutral evaluators in high-stakes peer review without systematic robustness testing, transparent safeguards and careful human oversight.

2606.10154 2026-06-10 cs.LG cs.CR 新提交

Quality Is Not a Safety Proxy Under Quantization

质量不是量化下的安全代理

Sahil Kadadekar

AI总结 研究发现量化检查点的质量指标无法替代直接安全测试,提出拒绝模板稳定性指数(RTSI)以识别危险行。

Comments 21 pages, 6 figures. Preprint

详情
AI中文摘要

量化检查点通常首先通过质量指标筛选,然后才进行直接安全测试(如果有的话)。本文在一个匹配的51行矩阵上审计了这一捷径,该矩阵涵盖6个模型、4个系列、7级GGUF梯度和AWQ/GPTQ INT4检查点。在这个矩阵中,捷径失败:所有36个质量-安全配对在模型间方向分裂,9个隐藏危险行加上1个接近隐藏危险行显示质量稳定或改善,而拒绝率下降12-68个百分点。11个AWQ/GPTQ行中有7个是隐藏危险。对17个Hugging Face支持的FP16/AWQ/GPTQ单元格进行的四探针机械后续研究未能挽救:熵、拒绝方向和校准探针是危险行的弱或零分离器,尽管探针识别的安全相关神经元整体上吸收了1.39倍的量化误差(p < 5×10^{-7}),但该效应并非特定于体系。Claude Sonnet 4重新标记了预定义分层集中的11,470个项目,与主要gemma3:12b判断者在89.9%的行上一致(κ=0.873,95% CI [0.866, 0.881]),并且改变了0/10个隐藏危险单元格。一个校准的研究内部行为筛选——拒绝模板稳定性指数(RTSI),由四个拒绝模板漂移特征构建并在该矩阵上校准——将10/10个隐藏或接近隐藏危险行路由到直接安全测试(Wilson 95% CI下限0.72),同时在样本内评分和行级留一验证下,将45个非基线行中的23个留在低风险桶中;在同一矩阵上,最佳单特征基线(唯一前缀率差、原始拒绝率差)在匹配桶大小下分别恢复9/10和8/10,跨堆栈传输需要重新校准。对于此处研究的量化检查点、模型系列和安全结果,保留的质量不能免除直接安全评估。

英文摘要

Quantized checkpoints are often screened first with quality metrics and only later, if at all, with direct safety tests. This paper audits that shortcut on a matched 51-row matrix spanning 6 models, 4 families, a 7-level GGUF ladder, and AWQ/GPTQ INT4 checkpoints. In this matrix the shortcut fails: all 36 quality-safety pairings split direction across models, and 9 hidden-danger rows plus 1 near-hidden-danger row show quality stable or improved while refusal falls by 12-68 percentage points. Seven of the 11 AWQ/GPTQ rows are hidden-danger. A four-probe mechanistic follow-up over the 17 Hugging Face-backed FP16/AWQ/GPTQ cells does not rescue it: entropy, refusal-direction, and calibration probes are weak or null separators of dangerous rows, and although probe-identified safety-associated neurons absorb 1.39$\times$ more quantization error overall ($p < 5 \times 10^{-7}$), the effect is not regime-specific. Claude Sonnet 4 relabels 11,470 items in a predefined stratified set, agrees with the primary gemma3:12b judge on 89.9\% of rows ($κ= 0.873$, 95\% CI [0.866, 0.881]), and changes 0/10 hidden-danger cells. A calibrated study-internal behavioral screen -- the Refusal Template Stability Index (RTSI), built from four refusal-template drift features and calibrated on this matrix -- routes 10/10 hidden- or near-hidden-danger rows to direct safety testing (Wilson 95\% CI lower bound 0.72) while leaving 23 of 45 non-baseline rows in a low-risk bucket under both in-sample scoring and row-level leave-one-out validation; on the same matrix, the best single-feature baselines (unique-prefix-rate-delta, raw refusal-rate delta) recover 9/10 and 8/10 respectively at matched bucket size, and cross-stack transfer requires recalibration. For the quantized checkpoints, model families, and safety outcomes studied here, retained quality cannot waive direct safety evaluation.

2606.10153 2026-06-10 cs.LG 新提交

Compositional Generative Modeling from Decentralized Data

来自分散数据的组合生成建模

Mashrur M. Morshed, Vishnu Naresh Boddeti

AI总结 提出去中心化组合流匹配(DCFM)框架,通过结构约束实现分散数据中生成因子的组合,无需交换原始数据,在条件图像生成、机器人空间规划和医学属性共现建模中显著优于基线。

Comments ICML 2026

详情
AI中文摘要

学习物理世界的组合性质需要联合观察相互作用因素。然而,由于实际数据通常是分散的,这些因素被碎片化地隔离在孤岛中。现有的去中心化生成方法仅关注对孤岛数据并集的建模,忽略了整体所隐含的新颖组合。为弥合这一差距,我们引入了去中心化组合流匹配(DCFM),这是一个在全局生成因子集上强制执行结构约束的框架,无需交换任何原始数据。DCFM使得新颖组合能够通过同伴交互涌现,即使没有单一数据源能独立支持该组合。实验上,DCFM在条件图像生成、机器人空间规划和医学属性共现建模中显著优于联邦学习和混合专家基线。

英文摘要

Learning the compositional nature of the physical world requires joint observation of interacting factors. However, because practical data is often decentralized, these factors are fragmented across isolated silos. Existing decentralized generative approaches focus only on modeling the union of siloed data, overlooking novel combinations implied by the collective whole. To bridge this gap, we introduce Decentralized Compositional Flow Matching (DCFM), a framework that enforces structural constraints across the global set of generative factors, without exchanging any raw data. DCFM enables novel combinations to emerge through peer interactions, even when no single data source can independently support the composition. Empirically, DCFM substantially outperforms federated learning and mixture-of-experts baselines across conditional image generation, robotic spatial planning, and medical attribute co-occurrence modeling.

2606.10147 2026-06-10 cs.AI cs.CL cs.CV cs.SD 新提交

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

从感知到决策:多模态大语言模型中听觉与视觉感知的信息流

Wish Suharitdamrong, Muhammad Awais, Xiatian Zhu, Sara Atito

AI总结 研究多模态大语言模型(AVLLMs)中音频和视觉信息流的路径与整合机制,发现顺序流与并行流两种路由模式,并证明信息传递后可丢弃无关token以提升效率。

Comments 40 pages, 29 figures

详情
AI中文摘要

多模态大语言模型(MLLMs)能够听和看,但音频和视觉信号实际上如何通过网络传播以形成答案?尽管它们在研究和实际应用中的作用日益增长,但音频和视觉标记影响最终预测的内部路径仍然知之甚少。在本研究中,我们考察了音频-视觉大语言模型(AVLLMs)内部的音视频信息流,追踪了AVLLMs如何在两种输入配置(音视频视频和多个交错音视频项目)下路由、利用和整合音频与视觉信息。我们发现,对于音视频视频,AVLLMs遵循为VLMs和VideoLLMs建立的顺序信息流路径,音频和视觉贡献沿着该路径按任务对每种模态的依赖程度成比例流动。在多个交错音视频项目的设置中,这种路由转变为不同的并行流。此外,我们证明,一旦音频-视觉和其他类型的标记的信息被传递到LLM,它们可以被丢弃,对模型的预测影响最小甚至略有改善,这适用于多个任务和数据集,从而实现更高效的推理。这些发现适用于多个模型和规模,包括3B和7B规模的Qwen2.5-Omni和Video-SALMONN2 Plus,从而产生了关于这些流结构为何出现的假设。总之,这些结果首次清晰地描绘了AVLLMs如何在网络内部协调声音和视觉,并为音频-视觉及更广泛的MLLMs在可解释性、设计和效率方面的下一波进展奠定了基础。

英文摘要

Multimodal Large Language Models (MLLMs) can listen and see, but how do audio and visual signals actually travel through the network to shape an answer? Despite their growing role in research and real-world applications, the internal pathways through which audio and visual tokens influence the final prediction remain poorly understood. In this study, we examine audio-visual information flow inside Audio-Visual Large Language Models (AVLLMs), tracing how AVLLMs route, utilize, and integrate audio and visual information across two input configurations, audio-visual video and multiple interleaved audio-visual items. We find that for audio-visual video, AVLLMs follow the sequential information flow pathway established for VLMs and VideoLLMs, with audio and visual contribution flowing along this pathway in proportion to the task's reliance on each modality. In settings with multiple interleaved audio-visual items, this routing shifts to different parallel streams. Furthermore, we demonstrate that audio-visual and other token types can be discarded once their information is transferred to LLM, with minimal impact on the model's prediction or even slight improvement, generalizing across multiple tasks and datasets, enabling more efficient inference. These findings hold across multiple models and scales, Qwen2.5-Omni and Video-SALMONN2 Plus at 3B and 7B scales, leading to hypotheses on why these flow structures emerge. Together, these results deliver the first coherent picture of how AVLLMs orchestrate sound and sight inside the network and lay the groundwork for the next wave of interpretability, design, and efficiency advances in audio-visual and broader MLLMs.

2606.10142 2026-06-10 cs.CV 新提交

DB-3DME: From Dataset to Benchmark for Human-aligned Automatic 3D Mesh Evaluation

DB-3DME:从数据集到基准测试,实现与人类对齐的自动3D网格评估

Nanshan Jia, Zhenyu Zhao, Sui Huang, Jingshen Wang, Zeyu Zheng

AI总结 提出DB-3DME数据集与基准,包含2619个合成3D网格及其人类评分,通过微调视觉编码器优化VLM评估性能,显著超越现有模型。

Comments CVPR 2026 workshop paper. 10 pages, 3 figures, 6 tables. Dataset available at GitHub and Hugging Face

详情
AI中文摘要

近年来3D生成的进展在真实性、可控性和效率上取得了显著提升,但3D资产的评估仍未被充分探索。现有的评估范式,包括人工评估、学习指标和视觉语言模型(VLM)作为评判者,在成本、可扩展性、分辨率处理或任务特定对齐方面存在局限性。在这项工作中,我们专注于3D网格评估,并引入了DB-3DME,即用于3D网格评估的数据集和基准。DB-3DME包含2,619个合成3D网格,并配有关于几何和提示遵从性的人工评分。利用该数据集,我们系统地基准测试了最先进的VLM,并发现3D表示的视觉编码是与人类对齐的评估性能的关键因素。受此发现启发,我们通过调整视觉编码器同时冻结语言模型,微调了一个开放权重的VLM——Qwen-2.5-VL-7B,用于3D网格评估。微调后的模型在多个评估维度上显著优于现有的预训练VLM,为自动3D网格评估建立了新的基准。我们在GitHub和Hugging Face上公开发布了基准数据集,以促进未来的研究。

英文摘要

Recent advances in 3D generation have led to substantial improvements in realism, controllability, and efficiency, yet the evaluation of 3D assets remains underexplored. Existing evaluation paradigms, including human evaluation, learned metrics, and vision-language models (VLMs) as judges, suffer from limitations in cost, scalability, resolution handling, or task-specific alignment. In this work, we focus on 3D mesh evaluation and introduce DB-3DME, the Dataset and Benchmark for 3D Mesh Evaluation. DB-3DME contains 2,619 synthetic 3D meshes paired with human ratings on Geometry and Prompt Adherence. Using this dataset, we systematically benchmark state-of-the-art VLMs and identify visual encoding of 3D representations as a key factor for human-aligned evaluation performance. Motivated by this finding, we fine-tune an open-weight VLM, Qwen-2.5-VL-7B, for 3D mesh evaluation by adapting the visual encoder while freezing the language model. The fine-tuned model substantially outperforms existing pre-trained VLMs across multiple evaluation dimensions, establishing a new benchmark for automatic 3D mesh evaluation. We publicly release the benchmark dataset on GitHub and Hugging Face to facilitate future research.

2606.10137 2026-06-10 cs.LG 新提交

Ambiguous Strategic Classification

模糊策略分类

Ivri Hikri, Nir Rosenfeld

AI总结 研究在监管要求部分信息披露下,学习器如何联合优化分类器及其不确定性,引入模糊性概念并开发高效算法。

详情
AI中文摘要

策略分类中的一个常见假设是分类器是公开的。然而,系统是否会选择完全披露,以及为什么,仍不清楚。我们研究了一个监管要求系统披露部分(而非全部)信息的设置。这引发了一个学习任务,其中学习器必须联合优化分类器及其周围的不确定性。为此,我们从稳健机制设计中采纳了模糊性概念,在我们的设置中,这允许学习器揭示一组或一系列可能的分类器,同时私下选择最终实现哪一个。我们研究了模糊性如何影响学习任务,开发了计算最佳响应和训练的高效算法,并通过我们的方法在新设置中实证探索了策略学习及其结果。

英文摘要

A common assumption in strategic classification is that the classifier is public knowledge. However, it remains unclear whether, and why, a system would choose to commit to full disclosure. We study a setting in which regulation requires the system to disclose some, but not all, of the information. This induces a learning task in which the learner must jointly optimize the classifier and the uncertainty surrounding it. To this end, we adopt from robust mechanism design the notion of ambiguity, which in our setting allows the learner to reveal a set or range of possible classifiers, while privately choosing which of them to ultimately realize. We investigate how ambiguity affects the learning task, develop efficient algorithms for computing best-responses and training, and empirically explore strategic learning and its outcomes in this novel setting and using our approach.

2606.10129 2026-06-10 cs.LG cs.NE 新提交

Discovering Interpretable Multi-Parameter Control Policies for Evolutionary Algorithms Using Deep Reinforcement Learning

使用深度强化学习发现可解释的进化算法多参数控制策略

Tai Nguyen, Phong Le, Carola Doerr, Nguyen Dang

AI总结 针对进化算法多参数控制缺乏可解释策略的问题,提出深度强化学习结合动作空间分解、奖励平移和长期折扣的方法,蒸馏出符号控制策略,在OneMax问题上超越现有基线。

Comments arXiv admin note: text overlap with arXiv:2505.12982

详情
AI中文摘要

虽然深度强化学习(deep-RL)已越来越多地应用于进化算法中的参数控制,但由于难以推导出适合形式化研究的有效、可解释的多参数策略,参数控制的严格理论分析在很大程度上仍局限于单参数设置。我们展示了如何利用深度强化学习克服这一障碍,以优化OneMax的(1+($\lambda$,$\lambda$))-遗传算法作为代表性案例研究,这是少数几个已正式证明动态控制具有超常数加速的问题之一。我们首先表明标准方法在这种多参数设置下难以收敛,并引入算法无关的增强技术,针对动作空间分解、奖励平移和长期折扣。在这些技术到位后,我们比较了常见的深度强化学习方法,发现双深度Q网络(Double Deep Q-Networks)独特地避免了近端策略优化(Proximal Policy Optimization)中观察到的策略崩溃,从而产生适合下游分析的轨迹。至关重要的是,我们通过将学习到的行为蒸馏为透明的符号控制策略,超越了神经网络的“黑箱”性质。由此产生的策略不仅为未来的理论分析提供了可解释性,而且表现出卓越的性能,在广泛的问题规模上始终优于现有基线。

英文摘要

While deep Reinforcement Learning (deep-RL) has been increasingly applied to parameter control in evolutionary algorithms, rigorous theoretical analysis of parameter control remains largely restricted to single-parameter settings, owing to the difficulty of deriving effective, interpretable multi-parameter policies amenable to formal study. We demonstrate how deep-RL can be leveraged to overcome this barrier, using the (1+($λ$,$λ$))-genetic algorithm optimizing OneMax, one of the few problems where a super-constant speedup of dynamic control has been formally proven, as a representative case study. We first show that standard approaches struggle to converge in this multi-parameter setting, and introduce algorithm-agnostic enhancements targeting action-space decomposition, reward shifting, and long-horizon discounting. With these in place, we compare common deep-RL methods and find that Double Deep Q-Networks uniquely avoid the policy collapse observed in Proximal Policy Optimization, yielding trajectories suitable for downstream analysis. Crucially, we move beyond the ``black-box'' nature of neural networks by distilling the learned behaviors into a transparent, symbolic control policy. This resulting policy does not only offer interpretability for future theoretical analysis but also yields exceptional performance, consistently outperforming existing baselines across a wide range of problem sizes.

2606.10124 2026-06-10 cs.LG cs.AI 新提交

FedSteer: Taming Extreme Gradient Staleness in Federated Learning with Corrective Projections and Caching

FedSteer: 通过校正投影和缓存驯服联邦学习中的极端梯度陈旧性

Haoran Zhang, Cainã Figueiredo Pereira, Marie Siew, Xutong Liu, Carlee Joe-Wong, Rachid El-Azouzi

AI总结 针对联邦学习中客户端参与不均导致的梯度陈旧问题,提出FedSteer方法,利用客户端梯度缓存构建子空间,通过投影和缓存策略校正陈旧梯度,显著提升训练稳定性与精度。

Comments UAI 2026

详情
AI中文摘要

联邦学习(FL)在客户端不持续参与训练轮次时,常遭受聚合方差的影响。虽然重用非活跃客户端的陈旧模型更新是减少这种方差的常见技术,但我们发现,在客户端参与偏斜的情况下,由此产生的更新陈旧性可能变得严重到足以破坏训练稳定性。为了解决这个问题,我们提出了FedSteer,一种新颖的方法,该方法从最近客户端梯度的缓存中构建一个梯度子空间,作为当前优化景观的低维表示。FedSteer将活跃客户端的真实梯度投影到这个子空间上,以找到一组最优坐标。对于非活跃客户端,FedSteer重用这些坐标,并结合由其他活跃客户端漂移的已演化的子空间。这个过程有效地将过时的梯度“引导”向当前的全局目标。此外,还辅以选择性缓存策略,识别代表性客户端子集以形成子空间,从而减少服务器内存。实验表明,FedSteer显著优于基线,在挑战性场景中防止性能崩溃,并在其他场景中实现超过7%的精度提升。

英文摘要

Federated learning (FL) is often subject to aggregation variance if clients do not consistently participate in training rounds. While reusing stale model updates from inactive clients is a common technique to reduce this variance, we find that with skewed client participation, the resulting update staleness can become severe enough to destabilize training. To remedy this, we propose FedSteer, a novel method that constructs a gradient subspace from a cache of recent client gradients to serve as a low-dimensional representation of the current optimization landscape. FedSteer projects an active client's true gradient onto this subspace to find a set of optimal coordinates. For an inactive client, FedSteer reuses these coordinates with the now-evolved subspace drifted by other active clients. This process effectively "steers" outdated gradients toward the current global objective. This is complemented by a selective caching strategy that identifies a representative client subset to form the subspace, reducing server memory. Experiments demonstrate that FedSteer significantly outperforms baselines, preventing performance collapse in challenging scenarios while delivering accuracy gains of over 7% in others.

2606.10115 2026-06-10 cs.CV 新提交

Improving PET/CT-Based Whole-Body Lesion Segmentation Using Prediction Uncertainty-Augmented Models

利用预测不确定性增强模型改进PET/CT全身病灶分割

Bashirul Azam Biswas, Biratal Raj Wagle, Zhihan Yang, Marc A. Seltzer, Matthew E. Maeder, James B. Yu, Indrani Bhattacharya

AI总结 提出一种不确定性感知框架,结合贝叶斯集成、体素不确定性量化与不确定性增强训练,提升PET/CT全身病灶分割的鲁棒性和病灶检测能力,在AutoPET-III和Deep-PSMA数据集上验证。

Comments 32 pages, 10 figures, 5 tables

详情
AI中文摘要

准确的全身正电子发射断层扫描(PET)/计算机断层扫描(CT)病灶分割对于癌症分期和治疗计划至关重要。PET提供不同放射性示踪剂的功能代谢信息,而CT提供解剖定位。由于细微的影像特征、混杂因素和读者间差异,从PET/CT影像中勾画病灶在临床上具有挑战性。现有的深度学习方法存在训练随机性、预测不一致、高肿瘤负荷病例中病灶遗漏以及缺乏不确定性量化等问题,限制了其临床可靠性。以nnU-Net为基线,我们提出了一种用于全身PET/CT病灶分割的不确定性感知框架,该框架整合了(1)贝叶斯集成以减少训练随机性,(2)具有认知和偶然分解的体素级不确定性量化,以及(3)认知不确定性增强训练以提高病灶检测。使用两个公开数据集AutoPET-III(1,611次扫描)和Deep-PSMA(200次扫描),包含多种癌症类型的FDG和PSMA研究,进行训练和评估。在未见过的AutoPET-III测试集上,贝叶斯集成相比确定性nnU-Net模型提高了鲁棒性和性能。不确定性图突出了模型不一致的区域,并与错误分类(尤其是假阳性)相关。不确定性增强训练以增加假阳性体积为代价提高了病灶恢复,反映了精确率-召回率的权衡。一种病例自适应路由策略通过在基础模型和增强模型之间进行选择,进一步提高了Dice系数。据我们所知,这是第一项在多示踪剂、泛癌种PET/CT分割中系统研究不确定性量化,并将贝叶斯集成与不确定性感知建模相结合的工作。

英文摘要

Accurate lesion segmentation from whole-body Positron Emission Tomography (PET)/Computed Tomography (CT) scans is essential for cancer staging and treatment planning. PET provides functional metabolic information with different radiotracers, while CT offers anatomical localization. Lesion delineation from PET/CT imaging is clinically challenging due to subtle imaging features, confounders, and inter-reader variability. Existing deep learning approaches suffer from training-related stochasticity, inconsistent predictions, missed lesions in high tumor-burden cases, and lack uncertainty quantification, limiting their clinical reliability. Using nnU-Net as a baseline, we propose an uncertainty-aware framework for whole-body PET/CT lesion segmentation that integrates (1) Bayesian ensembling to reduce training stochasticity, (2) voxel-wise uncertainty quantification with epistemic and aleatoric decomposition, and (3) epistemic uncertainty-augmented training to improve lesion detection. Two public datasets, AutoPET-III (1,611 scans) and Deep-PSMA (200 scans), comprising FDG and PSMA studies across multiple cancer types, are used for training and evaluation. Bayesian ensembling improves robustness and performance over deterministic nnU-Net models on the unseen AutoPET-III test set. Uncertainty maps highlight regions of model disagreement and correlate with misclassifications, particularly false positives. Uncertainty-augmented training improves lesion recovery at the cost of increased FPVol, reflecting a precision-recall trade-off. A case-adaptive routing strategy further improves Dice by selecting between the base and augmented models. To our knowledge, this is the first study to systematically investigate uncertainty quantification in multi-tracer, pan-cancer PET/CT segmentation and to combine Bayesian ensembling with uncertainty-aware modeling for this task.

2606.10113 2026-06-10 cs.CL cs.AI 新提交

Emotion Profiling in LLM-Based Literary Translation: Systematic Shifts Across MT and Post-Editing

基于LLM的文学翻译中的情感特征:机器翻译与译后编辑的系统性转变

Antonio Castaldo, Johanna Monti, Sheila Castilho

AI总结 研究LLM翻译的情感特征及译后编辑如何使其接近人类翻译,通过对比《Oryx and Crake》的LLM翻译、译后编辑版本和人类翻译,发现MT系统引入特定情感指纹,削弱作者声音。

详情
AI中文摘要

本文研究LLM翻译是否表现出可识别的情感特征,以及译后编辑如何将其重塑为更接近人类的标准。我们比较了玛格丽特·阿特伍德《Oryx and Crake》的LLM翻译及其译后编辑版本和人类翻译,以当代意大利科幻小说的大规模语料库为基线。通过基于词典和多语言建模的方法,我们对不同系统的情感变化进行了细粒度分析。我们发现,机器翻译系统在翻译中引入了特定模型且统计显著的情感指纹,导致作者声音的保留有限。

英文摘要

This paper investigates whether LLM translations exhibit identifiable emotional profiles and how post-editing reshapes them toward human-like norms. We compare LLM translations of Margaret Atwood's Oryx and Crake with their post-edited versions and a human translation, using a large-scale corpus of contemporary Italian science-fiction as a baseline. We examine emotion through lexicon-based and multilingual modeling, conducting a fine-grained analysis of emotional variation across systems. We find that MT systems introduce model-specific and statistically significant emotional fingerprints across translations, leading to a limited preservation of an author's voice.

2606.10111 2026-06-10 cs.LG cs.SY eess.SY stat.ML 新提交

Nonlinear Estimator: Dual Bayesian Affine Estimators for Parameter Learning

非线性估计器:用于参数学习的双贝叶斯仿射估计器

Sasan Vakili, Daniël Woonings, Pradyumna Paruchuri, Peyman Mohajerin Esfahani

AI总结 提出一种用于Wiener型状态空间模型的非线性参数估计器,通过固定点架构耦合两个仿射最小均方误差估计器,分别估计未知参数和潜在变量,并开发两种双估计器框架,实验表明双状态-参数估计器在参数均方误差上优于其他方法。

Comments 32 pages, 9 figures

详情
AI中文摘要

本文提出一种用于Wiener型状态空间模型的非线性参数估计器,该估计器采用固定点架构,耦合两个仿射最小均方误差(MMSE)估计器:一个用于未知参数,另一个用于潜在变量。该架构保留了最优仿射MMSE参数估计器的功能结构,同时引入了动态基统计(DBS)估计,以总结非线性基函数评估。开发了两种DBS构建策略,从而产生两种非线性估计器框架。双基-参数估计器将仿射基估计器与仿射参数估计器相结合,而双状态-参数估计器首先计算仿射状态估计及其协方差,然后通过高斯DBS算子映射这些状态估计统计量以获得DBS估计。两种双估计器都采用固定点表征,交替估计每个分量,使用另一个分量的更新先验(该先验来自前一次迭代中该分量的插件估计统计量)。通过广泛的蒙特卡洛实验检验了所提方法的有效性,结果表明双基-参数估计器获得的参数均方误差与纯仿射参数估计器相当,而双状态-参数估计器实现了最低的参数均方误差,优于双基-参数估计器、纯仿射参数估计器以及经典粒子吉布斯和期望最大化方案的顺序蒙特卡洛变体。

英文摘要

This paper presents a nonlinear parameter estimator for Wiener-type state-space models obtained as a fixed-point architecture that couples two affine minimum mean-squared error (MMSE) estimators: one for the unknown parameters and one for latent variables. The architecture retains the functional structure of the optimal affine MMSE parameter estimator while incorporating Dynamic Basis Statistics (DBS) estimates that summarize nonlinear basis-function evaluations. Two DBS construction strategies are developed, leading to two nonlinear estimator frameworks. The dual basis-parameter estimator combines an affine basis estimator with the affine parameter estimator, whereas the dual state-parameter estimator first computes affine state estimates and their covariances, then maps these state-estimate statistics through a Gaussian DBS operator to obtain DBS estimates. Both dual estimators admit fixed-point characterizations that alternate between estimating each component using the updated prior of the other, obtained from that component's plug-in estimate statistics from the previous iteration. The efficacy of the proposed methods is examined via extensive Monte Carlo experiments, showing that the dual basis-parameter estimator attains parameter mean-squared errors comparable to those of the purely affine parameter estimator, while the dual state-parameter estimator achieves the lowest parameter mean-squared error, outperforming both the dual basis-parameter and purely affine parameter estimators, as well as sequential Monte Carlo variants of classical Particle Gibbs and Expectation-Maximization schemes.

2606.10107 2026-06-10 cs.CV q-bio.QM 新提交

Maximum Matching Accuracy: An Instance Segmentation Evaluation Metric Utilizing Globally Optimal Matching

最大匹配精度:利用全局最优匹配的实例分割评估指标

Kaden Stillwagon, Alexandra D. VandeLoo, Craig R. Forest

AI总结 提出最大匹配精度(MMA),通过全局最优一对一匹配和逐像素归一化,克服现有指标在细胞分割评估中的不连续、不敏感和匹配非最优问题,提供更稳定、敏感和可解释的评分。

详情
AI中文摘要

可靠评估实例分割模型需要准确且一致反映分割质量的指标。然而,生物成像中最广泛使用的指标存在根本性的数学缺陷:硬交并比阈值导致不连续、低灵敏度的评分;逐对象归一化在对象大小变化下扭曲分数;以及贪婪或一对多匹配过程产生非最优、顺序依赖的对应关系。这些特性共同导致在常见失败模式(如细胞分裂、细胞合并和细胞边界不精确)下产生不直观且不可靠的模型排名。我们提出最大匹配精度(MMA),一种无阈值连续指标,它找到预测对象与真实对象之间的全局最优一对一匹配,并使用逐像素归一化聚合总重叠。我们在三个实验(合成失败案例、渐进式破坏测试和模型排名比较)中评估MMA与AP@50、PQ、SEG和AJI。MMA产生的分数比现有替代方案更稳定、更敏感、更可解释,为生物细胞成像中的公平实例分割基准测试提供了原则性基础。

英文摘要

Reliable evaluation of instance segmentation models requires metrics that accurately and consistently reflect segmentation quality. However, the metrics most widely used in biological imaging carry fundamental mathematical weaknesses: hard Intersection-over-Union (IoU) thresholds that produce discontinuous, low sensitivity scoring; per-object normalization that distorts scores under object size variation; and greedy or one-to-many matching procedures that yield non-optimal, order-dependent correspondences. Together, these properties produce unintuitive and unreliable model rankings under common failure modes such as split cells, merged cells, and cell boundary imprecision. We propose Maximum Matching Accuracy (MMA), a threshold-free continuous metric that finds a globally optimal one-to-one matching between predicted and ground truth objects and aggregates total overlap using per-pixel normalization. We evaluate MMA against AP@50, PQ, SEG, and AJI across three experiments: synthetic failure cases, progressive corruption tests, and a model ranking comparison. MMA produces scores that are more stable, more sensitive, and more interpretable than existing alternatives, providing a principled foundation for fair instance segmentation benchmarking in biological cell imaging.

2606.10099 2026-06-10 cs.LG cs.AI 新提交

Unsupervised Style Representation Learning for AI-Text Detection via Paraphrase Inversion

无监督风格表示学习用于通过释义反转检测AI文本

Rafael Rivera Soto, Barry Chen, Nicholas Andrews

AI总结 提出无监督风格编码器,通过重构人工文本与机器生成释义间的差异学习判别性风格特征,实现少样本和零样本AI文本检测,性能优于基线。

详情
AI中文摘要

大型语言模型(LLMs)的快速发展引发了对其滥用的担忧,如抄袭、错误信息和自动化影响操作,这促使需要鲁棒的检测器。最近的研究表明,写作风格的神经表示对于检测是有效的,并且至关重要的是,对于击败大多数现有检测器的对抗攻击具有鲁棒性。然而,当前的基于风格的检测器依赖作者标签进行训练,并且仅限于少样本推理进行检测,需要可能并不总是可用的分布内样本。我们通过训练风格编码器从机器生成的释义中重构人工文本,从而在没有作者标签的情况下学习判别性风格特征;在训练期间冻结语义编码器,使风格编码器偏向于仅捕获重构所需的非语义特征。我们通过两种检测策略评估学习到的表示:少样本检测器和基于DeepSVDD的零样本检测器。在基准测试中,我们的方法在少样本设置下匹配或优于所有基线,并且在零样本设置下,与完全监督的分类器在分布内测试数据上具有竞争力,同时对未见过的LLMs具有更好的泛化能力。除了检测之外,学习到的表示还能泛化到未见过的任务,在作者验证和细粒度风格区分上取得竞争性表现,尽管从未针对这两个目标进行训练。

英文摘要

The rapid development of large language models (LLMs) has raised concerns about misuse such as plagiarism, misinformation, and automated influence operations, motivating the need for robust detectors. Recent work has shown that neural representations of writing style are effective for detection and, crucially, robust to adversarial attacks that defeat most existing detectors. However, current style-based detectors rely on authorship labels for training, and are limited to few-shot inference for detection, requiring in-distribution samples that may not always be available. We learn discriminative style features without authorship labels by training a style encoder to reconstruct human-authored text from its machine-generated paraphrase; freezing a semantic encoder during training biases the style encoder to capture only the non-semantic features needed for reconstruction. We evaluate the learned representations via two detection strategies: a few-shot detector and a zero-shot DeepSVDD-based detector. Across benchmarks, our method matches or outperforms all baselines in the few-shot setting and, in the zero-shot regime, is competitive with fully supervised classifiers on in-distribution test data while generalizing better to unseen LLMs. Beyond detection, the learned representations generalize to unseen tasks, achieving competitive performance on authorship verification and fine-grained style discrimination despite never being trained on either objective.

2606.10094 2026-06-10 cs.AI 新提交

Predictive Assistance and the Temporal Dynamics of Exploratory Compression

预测性辅助与探索性压缩的时间动态

Balaraju Battu

AI总结 提出几何动力学框架,研究预测性AI如何通过外源探索性压缩改变认知探索的时间动态,发现持续稳定会降低探索响应性、曲率不对称积累导致滞后效应、早期干预限制后续探索多样性。

详情
AI中文摘要

经典认知理论将问题解决描述为通过结构化问题空间的探索性搜索,其中重复交互逐渐将搜索压缩为高效的表征结构。预测性人工智能系统引入了一种独特的机制,在这种机制中,稳定可能在探索性多样化展开之前发生,在内部生成搜索之前提供解决方案和决策轨迹。本文发展了一个几何动力学框架,其中注意力在由稳定漂移、内源探索性扰动和响应性门控学习塑造的策略景观上演化。预测性辅助被建模为外源探索性压缩的过程,在自生成探索拓宽策略空间的可达区域之前稳定轨迹。该框架产生三个主要结果。首先,持续的预测性稳定通过减弱内源扰动的有效影响来降低探索响应性,即使探索变异性仍然存在。其次,曲率不对称地积累和松弛,产生滞后效应和辅助撤除后探索移动性的延迟恢复。第三,发展结果关键取决于稳定的时机,早期干预在广泛的表征多样化发生之前缩小未来的探索遍历。该框架产生了关于探索熵、过早收敛和预测稳定后延迟恢复的经验可检验预测。更广泛地说,结果表明预测系统可能重塑探索性认知本身的几何结构。

英文摘要

Classical theories of cognition describe problem solving as exploratory search through structured problem spaces in which repeated interaction gradually compresses search into efficient representational structures. Predictive artificial intelligence systems introduce a distinct regime in which stabilization may occur before exploratory diversification unfolds, supplying solutions and decision trajectories prior to internally generated search. This paper develops a geometric dynamical framework in which attention evolves over a landscape of strategies shaped by stabilizing drift, endogenous exploratory perturbation, and responsiveness-gated learning. Predictive assistance is modeled as a process of exogenous exploratory compression that stabilizes trajectories before self-generated exploration broadens the accessible regions of strategy space. The framework yields three main results. First, sustained predictive stabilization reduces exploratory responsiveness by attenuating the effective influence of intrinsic perturbations even when exploratory variability remains present. Second, curvature accumulates and relaxes asymmetrically, producing hysteresis and delayed recovery of exploratory mobility after assistance withdrawal. Third, developmental outcomes depend critically on the timing of stabilization, with early intervention narrowing future exploratory traversal before broad representational diversification has occurred. The framework generates empirically testable predictions concerning exploratory entropy, premature convergence, and delayed recovery following predictive stabilization. More broadly, the results suggest that predictive systems may reshape the geometry of exploratory cognition itself.

2606.10092 2026-06-10 cs.LG econ.GN q-fin.EC 新提交

Decision-Making under Combinatorial Risk

组合风险下的决策

Yifan Hong, Hongmiao Fan, Chen Wang

AI总结 通过投资分配任务研究组合风险下的决策,发现参与者主要依据投资后成功概率等特征而非精确评估完整分布,并利用符号回归发现简洁描述模型。

详情
AI中文摘要

风险下的决策通常通过单次彩票选择来研究。然而,许多实际决策涉及组合风险,其中风险来自多个风险组件,因此结果上的彩票是诱导的而非直接给出的,并且精确评估可能代价高昂。我们引入了一项投资分配任务来研究组合风险下的决策,其中投资于一个组件会提高其成功概率,从而重塑结果分布。参与者倾向于选择概率增量较大的选项,当增量相等时,选择初始成功概率较高的选项。揭示诱导的概率质量函数(PMF)会显著改变行为,使参与者对组合风险特征的反应减弱,并减少选择方差。为了解释这些模式,我们超越标准基准和手工假设,使用符号回归发现简洁的描述模型。发现的模型主要依赖于组合风险特征,例如投资后的成功概率,而不是对完整诱导分布的精确评估。当显示PMF时,行为可以通过用前景理论残差模型增强该模型来很好地解释。结果表明,人们主要通过核心特征来导航组合风险,仅在显示诱导PMF时才转向彩票估值。

英文摘要

Decision-making under risk is typically studied through single-shot lottery choices. Yet many real decisions involve combinatorial risk, where risk arises from multiple risky components, so the lottery over outcomes is induced rather than given outright and can be costly to evaluate exactly. We introduce an investment-allocation task to study decision under combinatorial risk, where investing in a component raises its success probability and thereby reshapes the outcome distribution. Participants favor the option with the larger probability increment, and, when increments are equal, the option with the higher initial success probability. Revealing the induced probability mass function (PMF) substantially changes behavior, making participants less responsive to combinatorial-risk features and reducing choice variance. To explain these patterns, we move beyond standard benchmarks and hand-crafted hypotheses with symbolic regression to discover compact descriptive models. The discovered models rely mainly on combinatorial-risk features, such as the after-investment success probability, rather than exact evaluation of the full induced distribution. Behavior under the displayed PMF is then well explained by augmenting this model with a prospect-theoretic residual model. The results show that people navigate combinatorial risk primarily through its core features, shifting toward lottery valuation only when the induced PMF is displayed.

2606.10089 2026-06-10 cs.LG cs.AI 新提交

A Theory on Flow Matching with Neural Networks

基于神经网络的流匹配理论

Yihan He, Qishuo Yin, Yuan Cao, Jianqing Fan, Han Liu

AI总结 本文为神经网络参数化的条件速度场流匹配建立了理论基础,证明了过参数化两层ReLU网络中梯度下降的收敛性,推导了条件速度场匹配目标的泛化界,并提供了生成样本的Wasserstein距离保证。

详情
AI中文摘要

在这项工作中,我们为神经网络参数化的条件速度场流匹配建立了理论基础。我们证明了过参数化两层ReLU神经网络中梯度下降的收敛性保证。我们推导了条件速度场匹配目标的泛化界。基于这些结果,我们为诱导流生成的样本提供了Wasserstein距离保证。我们的分析基于具有无界损失的多任务表示学习的泛化界,这可能对流式生成建模之外的其他领域也有独立意义。这些理论结果通过在合成和真实图像基准上的大量实验得到了验证。

英文摘要

In this work, we develop theoretical foundation for flow matching with neural-network-parameterized conditional velocity fields. We establish convergence guarantees for gradient descent in the over-parameterized 2-layered ReLU neural network regime. We derive generalization bounds for the conditional velocity-field matching objective. Building on these results, we provide Wasserstein-distance guarantees for the samples generated by the induced flow. Our analysis is based on generalization bound for multi-task representation learning with unbounded losses, which may be of independent interest beyond flow-based generative modeling. These theoretical results are validated through extensive experiments on both synthetic and real-world image benchmarks.

2606.10088 2026-06-10 cs.CV 新提交

Interpretable Temporal Facial-Region Motion Analysis for In-the-Wild Parkinson's Disease Video Classification

可解释的时序面部区域运动分析用于野外帕金森病视频分类

Riyadh Almushrafy

AI总结 提出基于面部区域关键点的时序运动描述符,在YouTubePD基准上实现轻量级且可解释的PD视频分类,平衡准确率达0.826。

Comments 22 pages, 6 figures. Submitted to Biomedical Signal Processing and Control

详情
AI中文摘要

面部表情减少是帕金森病(PD)常见的运动表现,通常描述为面部运动减退或面部运动迟缓。本文研究从面部区域关键点提取的时序运动描述符是否能够支持野外PD相关视频分类,并在YouTubePD基准上进行评估。每个视频使用来自14个预定义面部区域的几何描述符表示。在相同的二分类协议下,比较了静态几何、归一化几何、基于速度的描述符、相对速度描述符以及GRU序列基线。为了评估稳定性和可解释性,研究包括种子鲁棒性分析、区域级消融和排列重要性。最佳结果使用归一化速度描述符和随机森林分类器获得,在保留测试集上达到平衡准确率0.826和AUROC 0.855。在10个随机种子下,该表示保持稳定,平衡准确率为0.810 ± 0.018,AUROC为0.855 ± 0.005。总体而言,结果表明归一化的面部区域运动是YouTubePD视频分类的一种轻量级且可解释的表示。该研究作为基准级分析,不声称临床严重程度评估或MDS-UPDRS面部表情评分。

英文摘要

Reduced facial expressivity is a common motor manifestation of Parkinson's disease (PD), often described as hypomimia or facial bradykinesia. This paper examines whether temporal motion descriptors extracted from facial-region keypoints can support in-the-wild PD-related video classification on the YouTubePD benchmark. Each video is represented using geometric descriptors from 14 predefined facial regions. Static geometry, normalized geometry, velocity-based descriptors, relative-velocity descriptors, and a GRU sequence baseline are compared under the same binary classification protocol. To assess stability and interpretability, the study includes seed-robustness analysis, region-level ablation, and permutation importance. The best result is obtained with normalized velocity descriptors and a Random Forest classifier, reaching a balanced accuracy of 0.826 and an AUROC of 0.855 on the held-out test split. Across 10 random seeds, this representation remains stable, with balanced accuracy of 0.810 +/- 0.018 and AUROC of 0.855 +/- 0.005. Overall, the results suggest that normalized facial-region motion is a lightweight and interpretable representation for YouTubePD video classification. The study is framed as a benchmark-level analysis and does not claim clinical severity assessment or MDS-UPDRS facial-expression scoring.

2606.10087 2026-06-10 cs.CL cs.LG 新提交

CodeAlchemy: Synthetic Code Rewriting at Scale

CodeAlchemy:大规模合成代码重写

Ankit Gupta, Aditya Prasad, Rameswar Panda

AI总结 提出CodeAlchemy框架,通过5种策略生成超过500B token的合成代码数据,引入DevEval和TraceEval基准,3B模型在多项任务上超越10倍大小的前沿模型。

详情
AI中文摘要

在原始代码上预训练可以学习语法,但为多样化的真实世界任务格式提供的信号稀疏。虽然合成数据已被证明对语言模型具有变革性,但代码领域除有限的质量改进外仍基本未被探索。我们提出CodeAlchemy,一个合成数据生成框架,通过5种策略将公开来源的代码转换为语义丰富的训练数据:CodeEnhance(质量感知重写)、CodeQA(基于模板的问题)、CodeDev(开发者任务)、CodeDialogue(多轮对话)和CodeTrace(执行轨迹)。我们处理了15种语言的3个语料库,生成了超过500B token的合成数据以及350B推理token,数量级远超先前工作。CodeTrace对14种语言和5K个库的1.3M+文件进行插桩和执行,捕获控制流、状态跟踪和库知识。我们引入了DevEval(开发者任务)和TraceEval(执行预测)基准;前沿模型如Claude Sonnet 4.5在TraceEval上仅达到5.6%的精确匹配,揭示了语义理解的关键差距。我们的3B模型在HumanEval上达到83.5%,在MBPP上达到63.2%,在DevEval上达到8.09%的胜率,在TraceEval上达到15.36 ROUGE-2,超越了包括27B Gemma-3和32B Granite-4.0在内的10倍大小的前沿模型。

英文摘要

Pre-training on raw code teaches syntax but provides sparse signal for diverse real-world task formats. While synthetic data has proven transformative for language models, code remains largely unexplored beyond limited quality improvements. We present CodeAlchemy, a synthetic data generation framework that transforms publicly sourced code into semantically-rich training data through 5 strategies: CodeEnhance (quality-aware rewriting), CodeQA (template-based problems), CodeDev (developer tasks), CodeDialogue (multi-turn conversations), and CodeTrace (execution traces). We process 3 corpora across 15 languages to generate 500B+ tokens of synthetic data plus 350B reasoning tokens, orders of magnitude more than prior efforts. CodeTrace instruments and executes 1.3M+ files across 14 languages and 5K libraries, capturing control flow, state tracking, and library knowledge. We introduce DevEval (developer tasks) and TraceEval (execution prediction) benchmarks; frontier models like Claude Sonnet 4.5 achieve only 5.6% exact match on TraceEval, revealing critical gaps in semantic understanding. Our 3B models achieve 83.5% on HumanEval, 63.2% on MBPP, 8.09% win rate on DevEval, and 15.36 ROUGE-2 on TraceEval, outperforming frontier models 10x the size including 27B Gemma-3 and 32B Granite-4.0.

2606.10086 2026-06-10 cs.AI 新提交

Exploratory Responsiveness and Adaptive Rigidity under AI-Assisted Optimization

AI辅助优化下的探索响应性与适应性刚性

Balaraju Battu

AI总结 本文提出AI辅助优化下的探索适应理论,通过动态框架分析预测辅助如何影响系统探索响应性,揭示收敛预测机制导致适应性降低、刚性增强,而探索增强机制则促进适应性。

详情
AI中文摘要

本文发展了AI辅助优化下的探索适应理论。核心论点是,AI系统的长期适应效应关键取决于预测辅助如何与探索响应性本身相互作用。我们使用一个动态框架形式化这一机制,其中认知、制度和技术系统在由多个局部强化配置构成的崎岖认知景观上演化。模型中的一个核心状态变量是适应响应性,它衡量系统在不断变化的条件下穿越不熟悉的概念和制度轨迹的能力。在收敛预测机制下,AI系统替代探索参与,降低适应响应性,并产生亚稳态陷阱、滞后、过早收敛和探索崩溃动力学,使系统局部高效但全局刚性。该框架还识别出对比的探索增强机制,其中AI系统放大探索搜索、概念穿越和适应流动性。因此,有效替代参数是响应性依赖的:拥有弱探索例程的系统更容易受到探索替代,而已经拥有高适应响应性的系统可能利用AI辅助在崎岖景观上扩展探索流动性。因此,AI的长期适应效应不仅取决于AI能力本身,还取决于制度结构、发展背景和人机交互架构。

英文摘要

This paper develops a theory of exploratory adaptation under AI-assisted optimization. The central argument is that the long-run adaptive effects of AI systems depend critically on how predictive assistance interacts with exploratory responsiveness itself. We formalize this mechanism using a dynamical framework in which cognitive, institutional, and technological systems evolve over rugged epistemic landscapes characterized by multiple locally reinforced configurations. A central state variable in the model is adaptive responsiveness, which measures the capacity of a system to traverse unfamiliar conceptual and institutional trajectories under changing conditions. Under convergent predictive regimes, AI systems substitute for exploratory engagement, reducing adaptive responsiveness and generating metastable trapping, hysteresis, premature convergence, and exploration-collapse dynamics in which systems become locally efficient but globally rigid. The framework also identifies contrasting exploration-enhancing regimes in which AI systems amplify exploratory search, conceptual traversal, and adaptive mobility. The effective substitution parameter is therefore responsiveness-dependent: systems possessing weak exploratory routines are more vulnerable to exploratory substitution, whereas systems already possessing high adaptive responsiveness may use AI assistance to expand exploratory mobility across rugged landscapes. The long-run adaptive effects of AI consequently depend not only on AI capability itself, but also on institutional structure, developmental context, and the architecture of human-machine interaction.

2606.10084 2026-06-10 cs.LG cs.AI 新提交

Divide-and-Conquer Modeling for the CTF-4-Science Lorenz Benchmark

CTF-4-Science Lorenz基准的分治建模策略

Shundong Li

AI总结 提出分治建模策略,针对CTF-4-Science Lorenz基准的五个场景族分别设计模型,通过平滑去噪、NG-RC/NVAR预测、Lorenz过渡校正和参数前缀混合,以79.63分证明场景特定更新优于通用模型。

详情
AI中文摘要

本文针对CTF-4-Science Lorenz基准提出了一种分治建模策略,该基准通过十二个隐藏分数和五个场景族评估混沌系统预测:干净预测、噪声重建、噪声输入预测、少样本学习和参数泛化。最终系统不是强制一个模型类处理所有场景,而是将每个预测块与其任务组的评估行为相匹配。主要贡献包括:基于平滑的重建用于噪声全轨迹去噪;针对噪声长时间吸引子预测调优的NG-RC/NVAR模型;限制在敏感干净短时间前缀上的拟合Lorenz过渡校正;以及用于插值任务的参数前缀混合。最终系统得分为79.63,表明在混合混沌预测基准上,有界、场景特定的更新可以优于广泛的模型替换。

英文摘要

This work presents a divide-and-conquer modeling strategy for the CTF-4-Science Lorenz benchmark, which evaluates chaotic-system prediction across twelve hidden scores and five scenario families: clean forecasting, noisy reconstruction, noisy-input forecasting, few-shot learning, and parametric generalization. Rather than forcing one model class to handle all regimes, the final system matched each prediction block to the evaluation behavior of its task group. The main contributions are: smoothing-based reconstruction for noisy full-trajectory denoising; NG-RC/NVAR models tuned for noisy long-time attractor forecasting; a fitted Lorenz transition correction restricted to the sensitive clean short-time prefix; and a parametric prefix blend for the interpolation task. The resulting system with final public score of 79.63 shows that bounded, scenario-specific updates can outperform broad model replacement on mixed chaotic forecasting benchmarks.

2606.10080 2026-06-10 cs.LG cs.AI q-bio.QM 新提交

VFUSE: Virulent Feature Understanding with Sparse autoEncoders

VFUSE: 基于稀疏自编码器的毒力特征理解

Michael Yu, Matthew L. Olson

AI总结 提出VFUSE方法,通过训练稀疏自编码器(SAE)分析扩散-Transformer模型激活,识别蛋白质设计中的危险特征,实现可解释性提升而不牺牲性能。

详情
AI中文摘要

生成模型在蛋白质设计等领域取得了显著进展,但这种能力也使得危险蛋白质的生成变得不透明。在这项工作中,我们引入了VFUSE(基于稀疏自编码器的毒力特征理解),这是一种机制可解释性方法,通过在扩散-Transformer激活上训练SAE来审计蛋白质模型中的危险感知特征。我们将VFUSE应用于RoseTTAFold3和RFDiffusion3,这些是流行的开源蛋白质折叠和合成模型。我们发现,对于某些模块,线性探针在SAE潜在空间中的拟合效果显著优于原始模型表示,从而在不牺牲模型性能的情况下提高了可解释性。此外,我们识别出SAE中的单语义特征,这些特征仅在危险设计上激活,AUROC高达0.84(q < 10^{-13})。据我们所知,这是首次在全原子扩散模型上训练SAE,也是首次对蛋白质设计模型进行特征级毒力审计,为安全且可解释的蛋白质设计铺平了道路。

英文摘要

Generative models have shown remarkable progress in a variety of domains such as protein design, but such power enables the opaque generation of hazardous proteins. In this work, we introduce VFUSE (Virulent Feature Understanding with Sparse autoEncoders), a mechanistic interpretability approach that trains SAEs on diffusion-transformer activations to audit protein models for hazard-aware features. We apply VFUSE to RoseTTAFold3 and RFDiffusion3, popular open-weight models for protein folding and synthesis. We find that for certain blocks, linear probes detect hazardous designs significantly better when fit in the SAE latent space over the original model's representations: improving interpretability without sacrificing model performance. Furthermore, we identify monosemantic features from the SAE that fire only on hazardous designs at up to AUROC $0.84$ ($q < 10^{-13}$). To our knowledge this is the first SAE trained on an all-atom diffusion model and the first feature-level virulence audit of a protein design model, paving the way towards safe and interpretable protein design.

2606.10071 2026-06-10 cs.LG cs.AI 新提交

Temporal Sheaf Neural Networks with Dynamic Orthogonal Transport

时序层神经网络与动态正交传输

Md Sadek Hossain Asif, Tanzila Khan, Md. Mosaddek Khan

AI总结 提出时序层神经网络(TSNN),通过动态正交帧和局部坐标系间显式传输实现时序链接预测,在多种基准上超越现有方法,尤其适用于节点角色异质性强的图。

详情
AI中文摘要

我们引入了时序层神经网络(TSNN),这是一个时序链接预测框架,它为每个节点配备一个时变正交帧,并仅在局部坐标系之间进行显式传输后比较节点状态。与在共享全局嵌入空间中运行的现有连续时间图模型不同,TSNN通过动态局部帧建模节点特定且不断演化的交互语义。该模型通过高效的低秩Householder乘积参数化每个节点的帧,在帧更新下精确保留存储的隐藏状态,并使用几何残差解码器,该解码器基于传输距离锚定预测,同时学习残差校正。所有计算严格因果,仅使用事件前历史。我们证明了对称度归一化层拉普拉斯算子与对称归一化图拉普拉斯算子正交相似,而随机游走归一化形式在相应度度量下相似;TSNN使用的全激活、特征缩放扩散正是组合层Dirichlet能量上的度量梯度步,具有无度单调下降和非扩张保证。帧漂移仅线性扰动更新。在TGB v2链接预测和时序异质排行榜以及DGB基准套件上,TSNN在大多数基准上匹配或超越最强先前方法,在表现出强节点角色异质性的图上改进最大。消融实验证实了动态帧、正交传输和几何残差解码的独特优势。

英文摘要

We introduce Temporal Sheaf Neural Networks (TSNN), a temporal link prediction framework that equips each node with a time-varying orthogonal frame and compares node states only after explicit transport between local coordinate systems. In contrast to existing continuous-time graph models that operate in a shared global embedding space, TSNN models node-specific and evolving interaction semantics through dynamic local frames. The model parameterizes per-node frames via efficient low-rank Householder products, preserves stored hidden states exactly under frame updates, and uses a geometric-residual decoder that anchors predictions on transported distances while learning residual corrections. All computations are strictly causal and use only the pre-event history. We show that the symmetric degree-normalized sheaf Laplacian is orthogonally similar to the symmetric normalized graph Laplacian, with the random-walk normalized form similar in the corresponding degree metric; the full-active, feature-scaled diffusion used by TSNN is exactly a metric-gradient step on the combinatorial sheaf Dirichlet energy, with a degree-free monotone-descent and non-expansiveness guarantee. Frame drift perturbs updates only linearly. Across TGB v2 link-prediction and temporal-heterogeneous leaderboards, together with the DGB benchmark suite, TSNN matches or surpasses the strongest prior methods on most benchmarks, with the largest improvements on graphs exhibiting strong node-role heterogeneity. Ablations confirm the distinct benefit of dynamic frames, orthogonal transport, and geometric-residual decoding.