arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1695
专题追踪
2602.00567 2026-05-25 cs.LG

Forget by Uncertainty: Orthogonal Entropy Unlearning for Quantized Neural Networks

基于不确定性的遗忘:面向量化神经网络的正交熵遗忘

Tian Zhang, Yujia Tong, Junhao Dong, Ke Xu, Yuze Wang, Jingling Yuan

发表机构 * Hubei Key Laboratory of Transportation Internet of Things, School of Computer Science and Artificial Intelligence, Wuhan University of Technology, China(交通物联网湖北重点实验室,计算机科学与人工智能学院,武汉理工大学,中国) College of Computing and Data Science, Nanyang Technological University, Singapore(计算与数据科学学院,南洋理工大学,新加坡)

AI总结 随着量化神经网络在边缘设备上的部署以及隐私法规的日益严格,对量化模型进行机器遗忘的需求愈发迫切。本文提出了一种名为OEU的正交熵遗忘框架,其核心创新包括:通过最大化遗忘数据上的预测不确定性来提供无偏的遗忘方向,避免误判特定类别;以及通过梯度正交投影消除遗忘梯度与保留梯度之间的干扰,从而在保持模型效用方面具有理论保证。实验表明,OEU在遗忘效果和保留精度方面均优于现有方法。

Comments Accepted by ICML2026

详情
AI中文摘要

量化神经网络在边缘设备上的部署,结合GDPR等隐私法规,迫切需要在量化模型中进行机器遗忘。然而,现有方法面临关键挑战:它们通过训练模型记忆错误标签来诱导遗忘,将遗忘与错误记忆混为一谈,并采用标量梯度重加权,无法解决梯度之间的方向冲突。我们提出OEU,一种新颖的正交熵遗忘框架,具有两个关键创新:1)熵引导遗忘通过最大化遗忘数据上的预测不确定性提供无偏遗忘方向,避免向任何特定类别的错误预测;2)梯度正交投影通过将遗忘梯度投影到保留梯度的正交补上来消除干扰,在一阶近似下为效用保持提供理论保证。大量实验表明,OEU在遗忘效果和保留准确率上均优于现有方法。

英文摘要

The deployment of quantized neural networks on edge devices, combined with privacy regulations like GDPR, creates an urgent need for machine unlearning in quantized models. However, existing methods face critical challenges: they induce forgetting by training models to memorize incorrect labels, conflating forgetting with misremembering, and employ scalar gradient reweighting that cannot resolve directional conflicts between gradients. We propose OEU, a novel Orthogonal Entropy Unlearning framework with two key innovations: 1) Entropy-guided unlearning provides an unbiased forgetting direction by maximizing prediction uncertainty on forgotten data, avoiding confident misprediction toward any specific class, and 2) Gradient orthogonal projection eliminates interference by projecting forgetting gradients onto the orthogonal complement of retain gradients, providing theoretical guarantees for utility preservation under first-order approximation. Extensive experiments demonstrate that OEU outperforms existing methods in both forgetting effectiveness and retain accuracy.

2601.22324 2026-05-25 cs.LG cs.MA

Automatic Construction of Clinical Scoring Systems with LLM Agents

基于LLM代理的临床评分系统自动构建

Silas Ruhrberg Estévez, Christopher Chiu, Mihaela van der Schaar

发表机构 * DAMTP, University of Cambridge, Cambridge, UK(剑桥大学 DAMTP 实验室,剑桥,英国)

AI总结 本文研究如何自动构建适用于临床实践的评分系统,这类系统通常由少量可解释的决策规则组成。作者提出了一种基于大语言模型(LLM)代理的方法——AgentScore,通过语义引导的优化流程,在巨大的规则组合空间中搜索符合统计有效性与临床部署要求的评分规则。实验表明,AgentScore 在多个临床预测任务中优于现有方法,并在保持强结构性约束的同时实现了与灵活可解释模型相当的预测性能。

详情
AI中文摘要

现代临床实践依赖于以紧凑评分系统形式实施的循证指南,这些评分系统由少量可解释的决策规则组成。虽然机器学习模型实现了强大的性能,但由于与工作流约束(如可记忆性、可审计性和床边执行)不匹配,许多模型未能转化为常规临床使用。我们认为,这种差距并非源于预测能力不足,而是由于在模型类别上优化时与指南部署不兼容。可部署的指南通常采用单位加权临床检查表的形式,通过对二元规则求和并设置阈值形成,但学习此类评分需要在指数级大的离散规则集空间中进行搜索。我们引入了AgentScore,它通过使用LLM提出候选规则,并采用确定性的、基于数据的验证与选择循环来强制执行统计有效性和可部署性约束,在此空间中进行语义引导的优化。在八个临床预测任务中,AgentScore优于现有的评分生成方法,并且在更强的结构约束下实现了与更灵活的可解释模型相当的AUROC。在两个额外经过外部验证的任务中,AgentScore比已建立的基于指南的评分实现了更高的区分度。

英文摘要

Modern clinical practice relies on evidence-based guidelines implemented as compact scoring systems composed of a small number of interpretable decision rules. While machine-learning models achieve strong performance, many fail to translate into routine clinical use due to misalignment with workflow constraints such as memorability, auditability, and bedside execution. We argue that this gap arises not from insufficient predictive power, but from optimizing over model classes that are incompatible with guideline deployment. Deployable guidelines often take the form of unit-weighted clinical checklists, formed by thresholding the sum of binary rules, but learning such scores requires searching an exponentially large discrete space of possible rule sets. We introduce AgentScore, which performs semantically guided optimization in this space by using LLMs to propose candidate rules and a deterministic, data-grounded verification-and-selection loop to enforce statistical validity and deployability constraints. Across eight clinical prediction tasks, AgentScore outperforms existing score-generation methods and achieves AUROC comparable to more flexible interpretable models despite operating under stronger structural constraints. On two additional externally validated tasks, AgentScore achieves higher discrimination than established guideline-based scores.

2601.21766 2026-05-25 cs.CL cs.AI

CoFrGeNet: Continued Fraction Architectures for Language Generation

CoFrGeNet:用于语言生成的连分式架构

Amit Dhurandhar, Vijil Chenthamarakshan, Dennis Wei, Tejaswini Pedapati, Karthikeyan Natesan Ramamurthy, Rahul Nair

发表机构 * IBM Research(IBM研究院)

AI总结 本文提出了一种基于连分数结构的新型生成模型架构CoFrGeNet,用于替代传统Transformer中的多头注意力和前馈网络模块,显著减少了参数量。该方法通过自定义梯度计算提升训练效率,并在多个大规模语言模型(如GPT2-xl和Llama3)上验证了其有效性,实验表明在保持甚至提升任务性能的同时,参数规模可减少至原模型的二分之一到三分之一,且预训练时间更短。

Comments Earlier version accepted to ICML 2026

详情
AI中文摘要

Transformer可以说是语言生成的首选架构。本文受连分式启发,引入了一种用于生成建模的新函数类。实现该函数类的架构族称为CoFrGeNets——连分式生成网络。我们基于该函数类设计了新颖的架构组件,可以替换Transformer块中的多头注意力和前馈网络,同时需要的参数少得多。我们推导了自定义梯度公式,以比使用标准PyTorch梯度更准确、更高效地优化所提出的组件。我们的组件是即插即用的替换,几乎不需要改变已为基于Transformer的模型建立的训练或推理过程,从而使我们的方法易于集成到大型工业工作流中。我们在两个非常不同的Transformer架构GPT2-xl(1.5B)和Llama3(3.2B)上进行了实验,前者我们在OpenWebText和GneissWeb上预训练,后者我们在docling数据混合(包含九个不同数据集)上预训练。结果表明,我们的模型在下游分类、问答、推理和文本理解任务上的性能与原始模型相当,有时甚至更优,而参数量仅为原始模型的2/3到1/2,预训练时间更短。我们相信,未来针对硬件定制的实现将进一步发挥我们架构的真正潜力。

英文摘要

Transformers are arguably the preferred architecture for language generation. In this paper, inspired by continued fractions, we introduce a new function class for generative modeling. The architecture family implementing this function class is named CoFrGeNets - Continued Fraction Generative Networks. We design novel architectural components based on this function class that can replace Multi-head Attention and Feed-Forward Networks in Transformer blocks while requiring much fewer parameters. We derive custom gradient formulations to optimize the proposed components more accurately and efficiently than using standard PyTorch-based gradients. Our components are a plug-in replacement requiring little change in training or inference procedures that have already been put in place for Transformer-based models thus making our approach easy to incorporate in large industrial workflows. We experiment on two very different transformer architectures GPT2-xl (1.5B) and Llama3 (3.2B), where the former we pre-train on OpenWebText and GneissWeb, while the latter we pre-train on the docling data mix which consists of nine different datasets. Results show that the performance on downstream classification, Q\& A, reasoning and text understanding tasks of our models is competitive and sometimes even superior to the original models with $\frac{2}{3}$ to $\frac{1}{2}$ the parameters and shorter pre-training time. We believe that future implementations customized to hardware will further bring out the true potential of our architectures.

2601.21692 2026-05-25 cs.AI

TCAP: Tri-Component Attention Profiling for Unsupervised Backdoor Detection in MLLM Fine-Tuning

TCAP: 面向MLLM微调中无监督后门检测的三组件注意力分析

Mingzu Liu, Hao Fang, Runmin Cong

发表机构 * School of Control Science and Engineering, Shandong University, Jinan, China(控制科学与工程学院,山东大学,济南,中国) Key Laboratory of Industrial Intelligent Systems, Shandong Province, China(山东省工业智能系统重点实验室,中国)

AI总结 在微调即服务(FTaaS)模式下,多模态大语言模型(MLLM)的定制化使用带来了数据中毒引发的后门风险。本文提出了一种无监督的后门检测方法TCAP,通过分析模型在系统指令、视觉输入和用户文本查询三个功能组件间的注意力分配差异,揭示了中毒样本的普遍特征,并利用高斯混合模型和EM算法进行注意力头的统计分析与样本过滤,有效识别并隔离后门样本,实验表明该方法在多种模型架构和攻击方式下均具有优异的检测性能。

Comments ICML 2026

详情
AI中文摘要

微调即服务(FTaaS)促进了多模态大语言模型(MLLMs)的定制化,但通过中毒数据引入了严重的后门风险。现有防御要么依赖监督信号,要么无法泛化到多样的触发器类型和模态。在这项工作中,我们揭示了一个通用的后门指纹——注意力分配差异,即中毒样本破坏了系统指令、视觉输入和用户文本查询三个功能组件之间的平衡注意力分布,无论触发器形态如何。受此启发,我们提出三组件注意力分析(TCAP),一种无监督防御框架,用于过滤后门样本。TCAP将跨模态注意力图分解为三个组件,通过高斯混合模型(GMM)统计分析识别对触发器敏感的注意力头,并通过基于EM的投票聚合隔离中毒样本。跨多种MLLM架构和攻击方法的大量实验表明,TCAP实现了持续强劲的性能,使其成为MLLMs中稳健且实用的后门防御方法。

英文摘要

Fine-Tuning-as-a-Service (FTaaS) facilitates the customization of Multimodal Large Language Models (MLLMs) but introduces critical backdoor risks via poisoned data. Existing defenses either rely on supervised signals or fail to generalize across diverse trigger types and modalities. In this work, we uncover a universal backdoor fingerprint-attention allocation divergence-where poisoned samples disrupt the balanced attention distribution across three functional components: system instructions, vision inputs, and user textual queries, regardless of trigger morphology. Motivated by this insight, we propose Tri-Component Attention Profiling (TCAP), an unsupervised defense framework to filter backdoor samples. TCAP decomposes cross-modal attention maps into the three components, identifies trigger-responsive attention heads via Gaussian Mixture Model (GMM) statistical profiling, and isolates poisoned samples through EM-based vote aggregation. Extensive experiments across diverse MLLM architectures and attack methods demonstrate that TCAP achieves consistently strong performance, establishing it as a robust and practical backdoor defense in MLLMs.

2601.21500 2026-05-25 cs.LG

Task-Awareness Improves LLM Generations and Uncertainty

任务感知提升大语言模型生成与不确定性

Tim Tomov, Dominik Fuchsgruber, Stephan Günnemann

发表机构 * School of Computation, Information \& Technology, Technical University of Munich Munich Data Science Institute Munich Center for Machine Learning

AI总结 本文研究了如何利用任务相关的潜在结构提升大语言模型(LLM)的生成质量与不确定性估计。作者提出了一种直接在任务依赖的潜在结构中建模LLM输出的方法,并通过引入差异度量计算贝叶斯最优响应,从而生成更准确且结构合理的输出。实验表明,该方法在多种任务中均优于传统的解码策略,同时通过贝叶斯风险量化不确定性,提升了输出质量与正确性的一致性。

详情
AI中文摘要

在LLM的许多应用中,自然语言响应通常具有潜在结构,例如表示离散标签、数值或图。然而,现有的解码和不确定性估计方法仅在语言空间中操作,并且很大程度上忽略了结构信息。我们通过在任务依赖的潜在结构中直接建模LLM输出来解决这一问题。通过为该结构配备不相似性度量,我们可以计算贝叶斯最优响应。这些响应不是从采样生成中选择的,而是通过在潜在空间中组合个体响应新合成的。在不同任务中,贝叶斯最优响应始终优于波束搜索等标准解码方法。此外,通过诱导贝叶斯风险量化不确定性,可以捕捉潜在结构方面的变化,并改善与输出质量和正确性的对齐。我们的决策理论框架适用于任何允许潜在响应结构的问题,并能够实现可靠的任务感知LLM预测。

英文摘要

In many applications of LLMs, natural language responses often have an underlying structure such as representing discrete labels, numerical values, or graphs. Yet, existing decoding and uncertainty estimation methods operate only in language space and largely disregard structural information. We address this by modeling LLM outputs directly in a task-dependent latent structure. By equipping this structure with a dissimilarity measure, we can compute Bayes-optimal responses. These are not selected from sampled generations but are newly synthesized by combining individual responses in the latent space. Across different tasks, Bayes-optimal responses consistently outperform standard decoding methods like beam search. Moreover, quantifying uncertainty via the induced Bayesian risk captures variations in terms of the latent structure and improves alignment with output quality and correctness. Our decision-theoretic framework is applicable to any problem that admits a latent response structure and enables reliable task-aware LLM predictions.

2601.17261 2026-05-25 cs.LG

AGZO: Activation-Guided Zeroth-Order Optimization for LLM Fine-Tuning

AGZO:用于大语言模型微调的激活引导零阶优化

Wei Lin, Yining Jiang, Qingyu Song, Qiao Xiang, Hong Xu

发表机构 * Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong(香港中文大学计算机科学与工程系) Xiamen University, China(厦门大学)

AI总结 在严格内存限制下,零阶优化(ZO)为大语言模型的微调提供了一种有前景的解决方案,但现有方法通常采用各向同性扰动,忽略了前向传播中丰富的激活结构信息。本文提出了一种基于激活引导的零阶优化方法AGZO,通过利用线性层梯度受限于其输入激活张成子空间的特性,在前向传播过程中动态提取一个紧凑的激活感知子空间,并将扰动限制在该低秩子空间中。理论分析表明,AGZO优化了一个子空间平滑的目标函数,其更新方向与真实梯度的余弦相似度高于各向同性基线方法,实验结果也显示AGZO在多个基准上优于现有最优ZO方法,显著缩小了与一阶微调的性能差距,同时保持了相近的内存占用。

Comments 21 pages in total, including 9 pages of main text, with 4 figures and 3 tables Accepted by ICML 2026

详情
AI中文摘要

零阶优化已成为在严格内存约束下微调大语言模型的一种有前景的解决方案,因为它避免了为反向传播存储激活值所带来的过高内存成本。然而,现有的零阶方法通常采用各向同性扰动,忽略了前向传播过程中可用的丰富结构信息。在本文中,我们发现了梯度形成与激活结构之间的一个关键联系:线性层的梯度被限制在其输入激活所张成的子空间内。基于这一见解,我们提出了激活引导零阶优化方法(AGZO)。与先前方法不同,AGZO在前向传播过程中动态提取一个紧凑的、由激活信息引导的子空间,并将扰动限制在这个低秩子空间内。我们提供了一个理论框架,表明AGZO优化了一个子空间平滑的目标函数,并且能够证明其产生的更新方向与真实梯度的余弦相似度高于各向同性基线方法。在实验上,我们在Qwen3和Pangu模型上对AGZO进行了多种基准测试。AGZO持续优于最先进的零阶基线方法,并显著缩小了与一阶微调的性能差距,同时保持了与其他零阶方法几乎相同的峰值内存占用。

英文摘要

Zeroth-Order (ZO) optimization has emerged as a promising solution for fine-tuning LLMs under strict memory constraints, as it avoids the prohibitive memory cost of storing activations for backpropagation. However, existing ZO methods typically employ isotropic perturbations, neglecting the rich structural information available during the forward pass. In this paper, we identify a crucial link between gradient formation and activation structure: the gradient of a linear layer is confined to the subspace spanned by its input activations. Leveraging this insight, we propose Activation-Guided Zeroth-Order optimization (AGZO). Unlike prior methods, AGZO extracts a compact, activation-informed subspace on the fly during the forward pass and restricts perturbations to this low-rank subspace. We provide a theoretical framework showing that AGZO optimizes a subspace-smoothed objective and provably yields update directions with higher cosine similarity to the true gradient than isotropic baselines. Empirically, we evaluate AGZO on Qwen3 and Pangu models across various benchmarks. AGZO consistently outperforms state-of-the-art ZO baselines and significantly narrows the performance gap with first-order fine-tuning, while maintaining almost the same peak memory footprint as other ZO methods.

2601.16027 2026-05-25 cs.AI

Deja Vu in Plots: Leveraging Cross-Session Evidence with Retrieval-Augmented LLMs for Live Streaming Risk Assessment

绘图中的既视感:利用检索增强的大语言模型跨会话证据进行直播风险评估

Yiran Qiao, Xiang Ao, Jing Chen, Yang Liu, Qiwei Zhong, Qing He

发表机构 * Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所)

AI总结 随着直播平台的兴起,检测如诈骗和恶意行为等风险变得愈发重要,但这些风险往往在不同直播会话中逐渐累积并重复出现,给识别带来挑战。为此,研究提出了一种名为CS-VAR的跨会话证据感知检索增强检测器,通过结合轻量级模型与大型语言模型的跨会话行为分析能力,实现了高效的风险识别与评估。该方法在大规模工业数据集上的实验表明其性能优越,并能提供可解释的信号以支持实际的直播内容审核工作。

Comments SIGIR'26 Full Paper

详情
AI中文摘要

直播的兴起改变了在线互动方式,实现了大规模实时参与,但也使平台面临复杂风险,如诈骗和协同恶意行为。检测这些风险具有挑战性,因为有害行为通常逐渐累积并在看似无关的直播中重复出现。为此,我们提出了CS-VAR(跨会话证据感知检索增强检测器)用于直播风险评估。在CS-VAR中,一个轻量级、领域特定的模型执行快速的会话级风险推理,在训练过程中由一个大语言模型(LLM)指导,该LLM对检索到的跨会话行为证据进行推理,并将其局部到全局的见解传递给小模型。这种设计使小模型能够识别跨直播的重复模式,执行结构化风险评估,并保持实时部署的效率。在大规模工业数据集上的广泛离线实验,结合在线验证,展示了CS-VAR的最先进性能。此外,CS-VAR提供可解释的局部信号,有效赋能直播的实际审核工作。

英文摘要

The rise of live streaming has transformed online interaction, enabling massive real-time engagement but also exposing platforms to complex risks such as scams and coordinated malicious behaviors. Detecting these risks is challenging because harmful actions often accumulate gradually and recur across seemingly unrelated streams. To address this, we propose CS-VAR (Cross-Session Evidence-Aware Retrieval-Augmented Detector) for live streaming risk assessment. In CS-VAR, a lightweight, domain-specific model performs fast session-level risk inference, guided during training by a Large Language Model (LLM) that reasons over retrieved cross-session behavioral evidence and transfers its local-to-global insights to the small model. This design enables the small model to recognize recurring patterns across streams, perform structured risk assessment, and maintain efficiency for real-time deployment. Extensive offline experiments on large-scale industrial datasets, combined with online validation, demonstrate the state-of-the-art performance of CS-VAR. Furthermore, CS-VAR provides interpretable, localized signals that effectively empower real-world moderation for live streaming.

2601.15224 2026-05-25 cs.CV cs.CL

PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

PROGRESSLM: 迈向视觉-语言模型中的进度推理

Jianshu Zhang, Chengxuan Qian, Haosen Sun, Haoran Lu, Dingcheng Wang, Letian Xue, Han Liu

发表机构 * Northwestern University(西北大学) University of California, Santa Barbara(加州大学圣巴巴拉分校)

AI总结 该论文提出ProgressLM,旨在解决视觉语言模型在任务进展推理方面的能力不足问题。研究引入了Progress-Bench基准,用于系统评估模型对任务进展的推理能力,并提出了一种受人类启发的两阶段推理范式,通过训练无关的提示和基于数据集ProgressLM-45K的训练方法进行探索。实验表明,大多数现有模型在任务进展估计上表现有限,而基于训练的ProgressLM-3B即使在小规模下也取得了稳定提升,显示出良好的泛化能力。

Comments ACL 2026 Camera Ready Version

详情
AI中文摘要

估计任务进度需要对长期动态进行推理,而非仅识别静态视觉内容。尽管现代视觉-语言模型(VLM)擅长描述可见内容,但它们能否从部分观测中推断任务进展程度仍不清楚。为此,我们引入了Progress-Bench,一个用于系统评估VLM进度推理能力的基准。除基准测试外,我们进一步探索了一种受人类启发的两阶段进度推理范式,包括基于无训练提示和基于训练的方法,后者基于精心策划的数据集ProgressLM-45K。对14个VLM的实验表明,大多数模型尚未准备好进行任务进度估计,对演示模态和视角变化敏感,且难以处理不可回答的情况。虽然强制结构化进度推理的无训练提示仅带来有限且依赖模型的改进,但基于训练的ProgressLM-3B即使在小型模型规模下也取得了一致的改进,尽管其训练任务集与评估任务完全不相交。进一步分析揭示了特征性错误模式,并阐明了进度推理成功或失败的时间与原因。网站:https://progresslm.github.io/ProgressLM/

英文摘要

Estimating task progress requires reasoning over long-horizon dynamics rather than recognizing static visual content. While modern Vision-Language Models (VLMs) excel at describing what is visible, it remains unclear whether they can infer how far a task has progressed from partial observations. To this end, we introduce Progress-Bench, a benchmark for systematically evaluating progress reasoning in VLMs. Beyond benchmarking, we further explore a human-inspired two-stage progress reasoning paradigm through both training-free prompting and training-based approach based on curated dataset ProgressLM-45K. Experiments on 14 VLMs show that most models are not yet ready for task progress estimation, exhibiting sensitivity to demonstration modality and viewpoint changes, as well as poor handling of unanswerable cases. While training-free prompting that enforces structured progress reasoning yields limited and model-dependent gains, the training-based ProgressLM-3B achieves consistent improvements even at a small model scale, despite being trained on a task set fully disjoint from the evaluation tasks. Further analyses reveal characteristic error patterns and clarify when and why progress reasoning succeeds or fails. Website: https://progresslm.github.io/ProgressLM/

2601.14821 2026-05-25 cs.CV

POTR: Post-Training 3DGS Compression

POTR:训练后3DGS压缩

Bert Ramlot, Martijn Courteaux, Peter Lambert, Glenn Van Wallendael

发表机构 * IDLab-MEDIA research group(IDLab-MEDIA研究组) Ghent University(根特大学) imec

AI总结 本文提出了一种名为POTR的后训练3D高斯点云压缩方法,旨在解决3D高斯溅射(3DGS)在存储需求过高的问题。该方法引入了一种高效的剪枝技术,通过改进的3DGS光栅化器同时计算每个点的移除影响,显著减少了点的数量并提升了推理速度;同时,提出了一种无需训练即可重构光照系数的新方法,大幅降低了其熵值并提高了稀疏性。实验表明,POTR在压缩率与推理速度方面均优于现有方法。

Comments 15 pages, 12 figures. Submitted to IEEE TCSVT, under review

Journal ref IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2026

详情
AI中文摘要

3D高斯泼溅(3DGS)最近在3D场景重建和实时新视角合成中成为神经辐射场(NeRF)的有力竞争者。3DGS在训练和推理速度上优于NeRF,但存储需求显著更高。为解决这一缺点,我们提出POTR,一种基于两项新技术的训练后3DGS编解码器。首先,POTR引入一种新颖的剪枝方法,使用修改后的3DGS光栅化器同时高效计算每个泼溅的单独移除效果。该技术相比其他训练后剪枝技术减少2-4倍的泼溅数量,并因此显著加速推理,实验表明其推理速度比其他压缩模型快1.5-2倍。其次,我们提出一种重新计算光照系数的新方法,在不使用任何训练的情况下显著降低其熵。我们的快速且高度并行的方案特别增加了AC光照系数的稀疏性,实验表明稀疏性从70%提升到97%,且质量损失极小。最后,我们通过简单的微调方案扩展POTR,以进一步增强剪枝、推理和率失真性能。实验表明,即使没有微调,POTR在率失真性能和推理速度上始终优于所有其他训练后压缩技术。

英文摘要

3D Gaussian Splatting (3DGS) has recently emerged as a promising contender to Neural Radiance Fields (NeRF) in 3D scene reconstruction and real-time novel view synthesis. 3DGS outperforms NeRF in training and inference speed but has substantially higher storage requirements. To remedy this downside, we propose POTR, a post-training 3DGS codec built on two novel techniques. First, POTR introduces a novel pruning approach that uses a modified 3DGS rasterizer to efficiently calculate every splat's individual removal effect simultaneously. This technique results in 2-4x fewer splats than other post-training pruning techniques and as a result also significantly accelerates inference with experiments demonstrating 1.5-2x faster inference than other compressed models. Second, we propose a novel method to recompute lighting coefficients, significantly reducing their entropy without using any form of training. Our fast and highly parallel approach especially increases AC lighting coefficient sparsity, with experiments demonstrating increases from 70% to 97%, with minimal loss in quality. Finally, we extend POTR with a simple fine-tuning scheme to further enhance pruning, inference, and rate-distortion performance. Experiments demonstrate that POTR, even without fine-tuning, consistently outperforms all other post-training compression techniques in both rate-distortion performance and inference speed.

2601.14300 2026-05-25 cs.LG cs.CR

Low-Cost Hard-Label Adversarial Attack with Theoretical Foundations

具有理论基础的低成本硬标签对抗攻击

Jun Liu, Leo Yu Zhang, Fengpeng Li, Isao Echizen, Jiantao Zhou

发表机构 * University of Macau(澳门大学) National Institute of Informatics(国家信息研究所) Griffith University(格里菲斯大学) University of Tokyo(东京大学)

AI总结 本文研究了基于硬标签的黑盒对抗攻击问题,这类攻击仅依赖模型的顶部预测结果,具有较高的实际威胁性。为解决现有方法在初始化策略和理论保障方面的不足,作者提出了一个具有理论支撑的统一框架,并设计了零查询初始化策略与模式驱动优化算法,显著提升了攻击效率与成功率。实验表明,该方法在多个数据集和防御模型上均优于现有最先进方法,且具有良好的泛化能力与对状态型防御的绕过能力。

详情
AI中文摘要

硬标签黑盒攻击仅依赖top-1预测,是最具挑战性但实际威胁最大的模型之一。尽管近期有进展,现有方法存在两个关键局限:(1) 忽视初始化的关键作用,主要关注优化策略;(2) 严重依赖经验启发式方法,缺乏理论保证。为弥补这一差距,我们建立了一个统一的理论框架,表明现有的符号翻转硬标签攻击可理解为近似真实梯度符号。在此原则性分析指导下,我们提出一种新颖的攻击框架,包含零查询初始化策略和模式驱动优化(PDO)算法。我们提供理论保证,证明我们的初始化比随机基线具有更高的与真实梯度符号的余弦相似度,且PDO模块的查询复杂度显著低于基线搜索方法。在CIFAR-10、ImageNet和ObjectNet上的大量实验(涵盖标准训练和对抗训练模型、商业API以及CLIP模型)表明,我们的方法在成功率和效率上持续优于最先进的硬标签攻击,尤其在低查询预算下。此外,我们的方法在损坏数据(ImageNet-C)、生物医学图像(PathMNIST)以及密集预测任务(如分割)上展现出鲁棒的泛化能力。值得注意的是,它绕过了有状态防御Blacklight,实现了0%的检测率。

英文摘要

Hard-label black-box attacks, relying solely on top-1 predictions, represent one of the most challenging yet practically threat models. Despite recent progress, existing approaches face two key limitations: (1) they overlook the critical role of initialization, focusing primarily on optimization strategies; and (2) they rely heavily on empirical heuristics without theoretical guarantees. To bridge this gap, we establish a unified theoretical framework showing that existing sign-flipping hard-label attacks can be understood as approximating the true gradient sign. Guided by this principled analysis, we propose a novel attack framework featuring a zero-query initialization strategy and a Pattern-Driven Optimization (PDO) algorithm. We provide theoretical guarantees that our initialization yields higher cosine similarity to the true gradient sign than random baselines, and our PDO module achieves significantly lower query complexity than baseline search methods. Extensive experiments across CIFAR-10, ImageNet, and ObjectNet-covering standard and adversarially trained models, commercial APIs, and CLIP models-demonstrate that our method consistently outperforms SOTA hard-label attacks in both success rate and efficiency, particularly under low query budgets. Furthermore, our method demonstrates robust generalization across corrupted data (ImageNet-C), biomedical images (PathMNIST), and dense prediction tasks such as segmentation. Notably, it bypasses the stateful defense Blacklight, achieving a 0% detection rate.

2601.07545 2026-05-25 cs.LG stat.ML

Near-Optimal Private Linear Regression via Iterative Hessian Mixing

通过迭代Hessian混合实现近最优私有线性回归

Omri Lev, Moshe Shenfeld, Vishwak Srinivasan, Katrina Ligett, Ashia C. Wilson

发表机构 * Department of EECS, Massachusetts Institute of Technology, US(麻省理工学院电子工程与计算机科学系) School of Computer Science and Engineering, The Hebrew University of Jerusalem, IL(耶路撒冷希伯来大学计算机科学与工程学院)

AI总结 本文研究了在数据有界条件下实现差分隐私的普通最小二乘回归问题,提出了一种基于高斯投影的迭代海森矩阵混合(IHM)算法。该方法在保证差分隐私的同时,通过改进的实用风险界提升了模型性能,相比现有方法如AdaSSP,去除了与数据维度相关的乘法因子,从而在多个数据集上表现出更优的实证效果。

详情
AI中文摘要

我们研究通过草图机制实现带界数据$(X,Y)$的差分隐私普通最小二乘(DP-OLS)。虽然高斯草图方法已被探索用于DP-OLS \citep{sheffet2017differentially},但它们通常被认为不如自适应充分统计量扰动(AdaSSP)方法 \citep{wang_adassp},后者直接扰动充分统计量$(X^{\top}X, X^{\top}Y)$。该方法被证明接近信息论最优,同时表现出强大的实证性能。在这项工作中,我们提出了\emph{迭代Hessian混合}(IHM),一种基于高斯草图方法构建的DP-OLS算法,其灵感来自\citet{pilanci_hessiansketch}的迭代Hessian草图。我们证明IHM是差分私有的,并以超额经验风险界的形式提供效用保证。这些界通过移除一个可能高达数据维度平方根的乘法因子,改进了AdaSSP的界。IHM的设计基于我们为先前DP-OLS的高斯草图方法提出的新准确性保证,这些保证阐明了这些方法何时预期表现良好,以及IHM如何规避其固有局限性。我们还在大量数据集上进行了严格的实证评估,表明IHM始终优于包括AdaSSP在内的先前基线。

英文摘要

We study differentially private ordinary least squares (DP-OLS) with bounded data $(X,Y)$ via sketching-based mechanisms. While Gaussian sketching approaches have been explored for DP-OLS \citep{sheffet2017differentially}, they are typically viewed as less competitive than the Adaptive Sufficient Statistics Perturbation (AdaSSP) method \citep{wang_adassp}, which directly perturbs the sufficient statistics $(X^{\top}X, X^{\top}Y)$. This method was shown to be close to information-theoretically optimal, while also exhibiting strong empirical performance. In this work, we propose the \emph{Iterative Hessian Mixing} (IHM), an algorithm that builds on Gaussian sketching approaches to DP-OLS and is inspired by the Iterative Hessian Sketch of \citet{pilanci_hessiansketch}. We prove that IHM is differentially private and provide utility guarantees in the form of excess empirical risk bounds. These bounds improve upon those of AdaSSP by removing a multiplicative factor that can be as large as the square root of the data dimension. The design of the IHM is based on new accuracy guarantees that we present for prior Gaussian sketching approaches for DP-OLS, which clarify when these methods are expected to perform well and how IHM circumvents their inherent limitations. We also conduct a rigorous empirical evaluation on a large suite of datasets, demonstrating that IHM consistently outperforms prior baselines, including AdaSSP.

2601.00969 2026-05-25 cs.RO cs.AI

V-VLAPS: Value-Guided Planning for Vision-Language-Action Models

V-VLAPS:面向视觉-语言-动作模型的价值引导规划

Ke Ren, Ali Salamatian, Kieran Pattison, Cyrus Neary

发表机构 * The University of British Columbia(不列颠哥伦比亚大学)

AI总结 该研究提出了一种名为 V-VLAPS 的价值引导型视觉-语言-动作规划方法,旨在解决视觉-语言-动作(VLA)模型在复杂任务中因策略偏差导致的规划失败问题。通过引入一个轻量的价值头,V-VLAPS 利用离线 VLA 演示数据预测蒙特卡洛回报,从而引导蒙特卡洛树搜索优先探索高价值分支。实验表明,V-VLAPS 在多个 LIBERO 任务套件中显著提升了规划效果,尤其在增加搜索预算后表现优于无价值引导的基线方法。

详情
AI中文摘要

视觉-语言-动作(VLA)模型为机器人操作提供了强大的动作先验,但其反应式行为在分布偏移和长时域任务结构下可能失败。最近的VLA引导规划方法通过使用预训练策略引导树搜索来改进执行,但节点选择仍严重依赖于策略先验和访问计数探索。因此,当策略偏向不良动作时,规划器缺乏学习到的价值信号来纠正这种偏差。先前工作表明,VLA表示编码了 rollout 成功与失败信息,暗示它们也可能在规划期间支持价值估计。我们引入了价值引导的视觉-语言-动作规划与搜索(V-VLAPS),该方法通过一个在离线VLA rollout上训练的轻量级价值头来预测蒙特卡洛回报,从而增强VLA引导规划。这些预测引导蒙特卡洛树搜索朝向更高价值的分支。在五个LIBERO套件上,V-VLAPS在默认搜索预算下总体上与无价值规划基线相当,分析表明许多硬失败是根级超时,其中预测值弱分离。在更大的搜索预算下,V-VLAPS在所有任务套件上优于基线,在LIBERO-Object上提高6个百分点,在LIBERO-10上提高4个百分点。我们的结果表明,VLA表示不仅可以支持失败预测,还可以在搜索到达价值排序重要的分支时支持价值引导规划。

英文摘要

Vision-language-action (VLA) models provide strong action priors for robotic manipulation, but their reactive behavior can fail under distribution shift and long-horizon task structure. Recent VLA-guided planning methods improve execution by using pretrained policies to guide tree search, yet node selection still depends heavily on policy priors and visit-count exploration. Consequently, when the policy favors poor actions, the planner lacks a learned value signal to correct this bias. Prior work has shown that VLA representations encode rollout success and failure information, suggesting that they may also support value estimation during planning. We introduce Value-Guided Vision-Language-Action Planning and Search (V-VLAPS), which augments VLA-guided planning with a lightweight value head trained on offline VLA rollouts to predict Monte Carlo returns. These predictions guide Monte Carlo Tree Search toward higher-value branches. Across five LIBERO suites, V-VLAPS matches value-free planning baseline at the default search budget in aggregate, and analysis shows that many hard failures are root-level timeouts where predicted values are weakly separated. With a larger search budget, V-VLAPS improves over the baseline in all task suites with +6 percentage points on LIBERO-Object and +4 percentage points on LIBERO-10. Our results suggest that VLA representations can support not only failure prediction, but also value-guided planning when search reaches branches where value-based ranking matters.

2512.22597 2026-05-25 cs.LG physics.chem-ph

Energy-Guided Generative Modeling for Low-Energy Molecular Structure Discovery

能量引导的生成式建模用于低能分子结构发现

Guikun Xu, Xiaohan Yi, Ziqiao Meng, Peilin Zhao, Yatao Bian

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学人工智能学院) Department of Computer Science, National University of Singapore(新加坡国立大学计算机科学系) Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院)

AI总结 本文提出了一种名为EnFlow的能量引导生成模型,用于高效发现低能量分子构型。该方法结合了基于流的构型生成与显式的能量景观建模,实现了构象集合的联合生成与基态识别。通过将生成动力学与学习到的能量模型相结合,EnFlow能够在极少采样步骤内生成结构准确且能量较低的分子构型,并能根据能量对生成结果进行排序,实验表明其在多个分子数据集上表现出色。

详情
AI中文摘要

探索分子能量景观和识别基态构象是计算化学的核心挑战。然而,从分子图生成多样化的低能构象在传统的基于物理的流程中仍然昂贵。现有的基于学习的方法仍然分散:生成模型捕捉构象多样性但通常缺乏可靠的能量校准,而确定性预测器关注单一结构且无法表示系综变异性。这里我们介绍EnFlow,据我们所知,这是第一个能量引导的生成框架,它将基于流的构象生成与显式能量景观建模相结合,用于联合构象系综生成和基态识别。通过将生成动力学与学习的能量模型集成,EnFlow引导采样朝向构象景观的低能区域,在极少的采样步数下提高结构保真度,同时实现对生成构象的基于能量的排序。在GEOM-QM9和GEOM-Drugs上的实验表明,EnFlow在构象生成和基态识别方面取得了强劲性能,同时仅需要1-2个ODE采样步。单点GFN2-xTB评估进一步表明,学习的能量分数保留了生成构象的物理上有意义的能量排序。这些结果支持显式能量景观建模作为通过联合建模构象系综及其相关能量来发现低能分子结构的有效策略。

英文摘要

Exploring molecular energy landscapes and identifying ground-state conformations are central challenges in computational chemistry. However, generating diverse low-energy conformers from molecular graphs remains expensive with traditional physics-based pipelines. Existing learning-based approaches remain fragmented: generative models capture conformational diversity but often lack reliable energy calibration, whereas deterministic predictors focus on a single structure and fail to represent ensemble variability. Here we introduce EnFlow, to our knowledge, the first energy-guided generative framework that couples flow-based conformer generation with explicit energy landscape modeling for joint conformational ensemble generation and ground-state identification. By integrating generative dynamics with a learned energy model, EnFlow guides sampling toward low-energy regions of the conformational landscape, improving structural fidelity under extremely few sampling steps while enabling energy-based ranking of generated conformations. Experiments on GEOM-QM9 and GEOM-Drugs show that EnFlow achieves strong performance in conformer generation and ground-state identification while requiring only 1--2 ODE sampling steps. Single-point GFN2-xTB evaluations further show that the learned energy scores preserve physically meaningful energetic rankings of generated conformations. These results support explicit energy landscape modeling as an effective strategy for low-energy molecular structure discovery through joint modeling of conformational ensembles and their associated energies.

2512.20298 2026-05-25 cs.CL cs.AI cs.CY cs.HC

Patterns vs. Patients: Evaluating LLMs against Mental Health Professionals on Personality Disorder Diagnosis through First-Person Narratives

模式 vs. 患者:通过第一人称叙事评估大语言模型与心理健康专业人员的人格障碍诊断能力

Karolina Drożdż, Kacper Dudzic, Anna Sterna, Marcin Moskalewicz

发表机构 * IDEAS Research Institute(IDEAS研究 institute) Adam Mickiewicz University(亚当·密茨凯维奇大学) AMU Center for Artificial Intelligence(AMU人工智能中心) Poznań University of Medical Sciences(波兹南医学科学大学) Maria Curie-Skłodowska University(玛丽·居里-斯克洛多夫斯卡大学)

AI总结 该研究探讨了大型语言模型(LLMs)在基于第一人称叙述进行人格障碍诊断方面的能力,特别比较了其与心理健康专业人士在诊断边缘型(BPD)和自恋型(NPD)人格障碍时的表现。研究发现,尽管LLMs在识别BPD方面表现优异,但在诊断NPD时显著低估,反映出模型在处理价值判断性术语时可能存在偏见。研究还指出,LLMs倾向于基于模式和形式分类提供详细解释,而人类专家则更关注患者的自我认知和时间体验,整体诊断可靠性仍有待提升。

详情
AI中文摘要

对LLMs进行精神病学自我评估的日益依赖引发了对其解释定性患者叙事能力的质疑。这项深度而非广度的案例研究直接比较了最先进的LLMs和心理健康专业人员,基于波兰语第一人称自传叙事评估边缘型人格障碍(BPD)和自恋型人格障碍(NPD)。在我们的样本中,表现最佳的Gemini Pro模型的总体诊断得分(65.48%)比人类专业人员的平均得分(43.57%)高出21.91个百分点。虽然模型和人类专家在识别BPD方面都表现出色(F1分别为83.4和80.0),但模型严重漏诊NPD(F1=6.7 vs. 50.0),显示出对价值负载术语“自恋”的潜在回避。定性上,模型提供了自信、详尽的理由,侧重于模式和形式类别,而人类专家则保持简洁和谨慎,强调患者的自我感和时间体验。我们的研究结果表明,虽然LLMs可能擅长解释复杂的临床第一人称数据,但其输出仍然存在关键的可靠性和偏见问题。

英文摘要

Growing reliance on LLMs for psychiatric self-assessment raises questions about their ability to interpret qualitative patient narratives. This depth over breadth case study directly compares state-of-the-art LLMs and mental health professionals in assessing Borderline (BPD) and Narcissistic (NPD) Personality Disorders based on Polish-language first-person autobiographical accounts. Within our sample, the overall diagnostic scores of the top-performing Gemini Pro models (65.48%) were 21.91 percentage points higher than the average scores of the human professionals (43.57%). While both models and human experts excelled at identifying BPD (F1 = 83.4 & F1 = 80.0, respectively), models severely underdiagnosed NPD (F1 = 6.7 vs. 50.0), showing a potential reluctance toward the value-laden term "narcissism." Qualitatively, models provided confident, elaborate justifications focused on patterns and formal categories, while human experts remained concise and cautious, emphasizing the patients' sense of self and temporal experience. Our findings demonstrate that while LLMs might be competent at interpreting complex first-person clinical data, their outputs still carry critical reliability and bias issues.

2512.18363 2026-05-25 cs.CV

Enhancing 3D Semantic Scene Completion with a Refinement Module

使用精化模块增强3D语义场景补全

Dunxing Zhang, Jiachen Lu, Han Yang, Lei Bao, Bo Song

发表机构 * National Science Center for Earthquake Engineering, Tianjin University(天津大学地球quake工程科学中心) School of Civil Engineering, Tianjin University(天津大学土木工程学院) Chair of Robotics, Artificial Intelligence and Real-time Systems, Technical University of Munich(慕尼黑技术大学机器人、人工智能与实时系统教授席)

AI总结 本文提出了一种名为ESSC-RM的增强型语义场景补全框架,该框架通过一个可插拔的细化模块,能够无缝集成到现有的语义场景补全模型中。该方法采用两阶段策略,首先由基础网络生成粗粒度体素预测,再通过基于3D U-Net的预测噪声感知模块和体素级局部几何模块进行多尺度监督下的细化。实验表明,ESSC-RM在SemanticKITTI数据集上显著提升了语义预测性能,验证了其作为通用细化框架的广泛适用性。

Comments 19 pages, 8 figures

详情
AI中文摘要

我们提出ESSC-RM,一种即插即用的增强框架,用于带有精化模块的语义场景补全,可以无缝集成到现有的SSC模型中。ESSC-RM分两个阶段运行:基线SSC网络首先生成粗体素预测,随后由基于3D U-Net的预测噪声感知模块(PNAM)和体素级局部几何模块(VLGM)在多尺度监督下进行精化。在SemanticKITTI上的实验表明,ESSC-RM持续改善语义预测性能。当集成到CGFormer和MonoScene中时,平均IoU分别从16.87%提升到17.27%,以及从11.08%提升到11.51%。这些结果表明ESSC-RM作为一个通用的精化框架,可适用于广泛的SSC模型。

英文摘要

We propose ESSC-RM, a plug-and-play Enhancing framework for Semantic Scene Completion with a Refinement Module, which can be seamlessly integrated into existing SSC models. ESSC-RM operates in two phases: a baseline SSC network first produces a coarse voxel prediction, which is subsequently refined by a 3D U-Net-based Prediction Noise-Aware Module (PNAM) and Voxel-level Local Geometry Module (VLGM) under multiscale supervision. Experiments on SemanticKITTI show that ESSC-RM consistently improves semantic prediction performance. When integrated into CGFormer and MonoScene, the mean IoU increases from 16.87% to 17.27% and from 11.08% to 11.51%, respectively. These results demonstrate that ESSC-RM serves as a general refinement framework applicable to a wide range of SSC models.

2512.11551 2026-05-25 cs.RO

CarlaNCAP: A Framework for Quantifying the Safety of Vulnerable Road Users in Infrastructure-Assisted Collective Perception Using EuroNCAP Scenarios

CarlaNCAP:使用EuroNCAP场景量化基础设施辅助集体感知中弱势道路使用者安全性的框架

Jörg Gamerdinger, Sven Teufel, Simon Roller, Oliver Bringmann

发表机构 * University of Tübingen, Faculty of Science, Department of Computer Science, Embedded Systems Group(图宾根大学科学学院计算机科学系嵌入式系统组)

AI总结 随着道路使用者数量的增加,近年来交通事故风险显著上升,其中易受伤害的道路使用者(VRUs)在城市环境中因被遮挡而面临更高风险。本文提出了一种基于基础设施辅助集体感知(CP)的框架CarlaNCAP,专门用于评估VRUs的安全性提升,并构建了一个包含11,000帧的EuroNCAP安全关键场景数据集。实验表明,与仅依赖车辆传感器相比,基础设施辅助CP可显著降低事故率,最高可达100%的事故避免率。

详情
AI中文摘要

近年来,道路使用者数量的增加显著提高了事故风险。弱势道路使用者(VRU)尤其危险,尤其是在城市环境中,他们经常被停放的车辆或建筑物遮挡。自动驾驶(AD)和集体感知(CP)是减轻这些风险的有前景的解决方案。特别是基础设施辅助的CP,其中传感器单元安装在交通信号灯或灯柱等基础设施元件上,可以通过提供增强的视角来帮助克服感知限制,从而显著减少遮挡。为了鼓励决策者采用这项技术,需要全面的研究和数据集来证明VRU的安全改进。在本文中,我们提出了一个评估基于基础设施的CP对VRU安全改进的框架,包括一个包含11000帧安全关键EuroNCAP场景的数据集(CarlaNCAP)。利用该数据集,我们进行了深入的仿真研究,并证明基础设施辅助的CP可以显著降低安全关键场景中的事故率,与仅配备传感器的车辆(事故避免率33%)相比,实现了高达100%的事故避免。代码可在https://github.com/ekut-es/carla_ncap获取。

英文摘要

The growing number of road users has significantly increased the risk of accidents in recent years. Vulnerable Road Users (VRUs) are particularly at risk, especially in urban environments where they are often occluded by parked vehicles or buildings. Autonomous Driving (AD) and Collective Perception (CP) are promising solutions to mitigate these risks. In particular, infrastructure-assisted CP, where sensor units are mounted on infrastructure elements such as traffic lights or lamp posts, can help overcome perceptual limitations by providing enhanced points of view, which significantly reduces occlusions. To encourage decision makers to adopt this technology, comprehensive studies and datasets demonstrating safety improvements for VRUs are essential. In this paper, we propose a framework for evaluating the safety improvement by infrastructure-based CP specifically targeted at VRUs including a dataset with safety-critical EuroNCAP scenarios (CarlaNCAP) with 11k frames. Using this dataset, we conduct an in-depth simulation study and demonstrate that infrastructure-assisted CP can significantly reduce accident rates in safety-critical scenarios, achieving up to 100% accident avoidance compared to a vehicle equipped with sensors with only 33%. Code is available at https://github.com/ekut-es/carla_ncap

2512.06404 2026-05-25 cs.AI cond-mat.mtrl-sci physics.chem-ph

GENIUS: An Agentic AI Framework for Autonomous Design and Execution of Simulation Protocols

GENIUS: 一种用于自主设计和执行模拟协议的智能AI框架

Mohammad Soleymanibrojeni, Roland Aydin, Diego Guedes-Sobrinho, Alexandre C. Dias, Maurício J. Piotrowski, Wolfgang Wenzel, Celso Ricardo Caldeira Rêgo

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) Institute of Nanotechnology(纳米技术研究所) Hamburg University of Technology(汉堡技术大学) Federal University of Paraná(帕拉纳联邦大学) University of Brasília(巴西利亚大学) Federal University of Pelotas(普拉多斯联邦大学) Institute of Physics and International Center of Physics(物理研究所和国际物理中心)

AI总结 GENIUS 是一个智能代理框架,旨在自主设计和执行模拟协议,解决材料计算中复杂的设置和调试问题。该框架结合了量子力学模拟软件 Quantum ESPRESSO 的知识图谱和分层语言模型,并由有限状态错误恢复机监督,能够将自然语言指令转化为有效的输入文件并自动修复错误。GENIUS 显著降低了推理成本,减少了幻觉现象,使电子结构密度泛函理论模拟更加易用,推动了材料工程的自动化和大规模应用。

Journal ref Communications Materials 7, 115 (2026)

详情
AI中文摘要

预测性原子模拟推动了材料发现,但常规设置和调试仍需计算机专家。这种知识差距限制了集成计算材料工程(ICME),因为最先进的代码存在但非专家使用起来仍然繁琐。我们通过GENIUS解决了这一瓶颈,这是一种AI智能体工作流,将智能的Quantum ESPRESSO知识图谱与由有限状态错误恢复机器监督的分层大语言模型层次结构融合。我们展示了GENIUS将自由形式的人类生成提示翻译成经过验证的输入文件,在295个多样化基准测试中约有80%运行完成,其中76%被自主修复,成功率呈指数衰减至7%的基线。与仅使用LLM的基线相比,GENIUS将推理成本减半,并几乎消除了幻觉。该框架通过智能自动化协议生成、验证和修复,使电子结构DFT模拟大众化,为全球学术界和工业界开放大规模筛选并加速ICME设计循环。

英文摘要

Predictive atomistic simulations have propelled materials discovery, yet routine setup and debugging still demand computer specialists. This know-how gap limits Integrated Computational Materials Engineering (ICME), where state-of-the-art codes exist but remain cumbersome for non-experts. We address this bottleneck with GENIUS, an AI-agentic workflow that fuses a smart Quantum ESPRESSO knowledge graph with a tiered hierarchy of large language models supervised by a finite-state error-recovery machine. Here we show that GENIUS translates free-form human-generated prompts into validated input files that run to completion on $\approx$80% of 295 diverse benchmarks, where 76% are autonomously repaired, with success decaying exponentially to a 7% baseline. Compared with LLM-only baselines, GENIUS halves inference costs and virtually eliminates hallucinations. The framework democratizes electronic-structure DFT simulations by intelligently automating protocol generation, validation, and repair, opening large-scale screening and accelerating ICME design loops across academia and industry worldwide.

2511.18000 2026-05-25 cs.LG cs.AI q-bio.PE

Reward Engineering for Spatial Epidemic Simulations: A Reinforcement Learning Platform for Individual Behavioral Learning

空间流行病模拟中的奖励工程:个体行为学习的强化学习平台

Radman Rakhshandehroo, Daniel Coombs

发表机构 * Department of Computer Science University of British Columbia(计算机科学系,不列颠哥伦比亚大学) Department of Mathematics and Institute of Applied Mathematics University of British Columbia(数学系和应用数学研究所,不列颠哥伦比亚大学)

AI总结 本文介绍了 ContagionRL,一个专为疫情空间模拟设计的强化学习平台,用于系统研究奖励函数设计对个体行为学习的影响。该平台结合了可配置的 SIRS+D 流行病模型,支持在不同环境条件下评估多种奖励机制对智能体生存策略的影响,并通过实验发现方向引导和明确遵守激励是提升策略学习的关键因素。研究还表明,采用势场奖励函数的智能体在非药物干预遵守和空间规避策略方面表现最优,平台为探索奖励与行为关系提供了模块化工具,具有重要的理论和应用价值。

Comments 38 pages, 15 figures and 18 tables; Accepted to TMLR. OpenReview: https://openreview.net/forum?id=yPEASsx3hk

Journal ref Transactions on Machine Learning Research, 2026

详情
AI中文摘要

我们提出了ContagionRL,一个与Gymnasium兼容的强化学习平台,专门用于空间流行病模拟中的系统奖励工程。与依赖固定行为规则的传统基于智能体的模型不同,我们的平台能够严格评估奖励函数设计如何影响在不同流行病场景中学到的生存策略。ContagionRL集成了空间SIRS+D流行病模型与可配置的环境参数,允许研究人员在包括有限可观测性、不同移动模式和异质人口动态等变化条件下对奖励函数进行压力测试。我们评估了五种不同的奖励设计,从稀疏生存奖励到一种新颖的势场方法,跨越多种RL算法(PPO、SAC、A2C)。通过系统的消融研究,我们发现方向性指导和明确的依从性激励是稳健策略学习的关键组成部分。我们在不同感染率、网格大小、可见性约束和移动模式下的全面评估表明,奖励函数的选择显著影响智能体行为和生存结果。使用我们的势场奖励训练的智能体始终获得优越性能,学习最大程度地遵守非药物干预,同时发展出复杂的空间规避策略。该平台的模块化设计使得能够系统地探索奖励-行为关系,弥补了这类模型中奖励工程关注有限的空白。ContagionRL是研究流行病背景下适应性行为反应的有效平台,并强调了奖励设计、信息结构和环境可预测性在学习中的重要性。我们的代码公开在https://github.com/redradman/ContagionRL。

英文摘要

We present ContagionRL, a Gymnasium-compatible reinforcement learning platform specifically designed for systematic reward engineering in spatial epidemic simulations. Unlike traditional agent-based models that rely on fixed behavioral rules, our platform enables rigorous evaluation of how reward function design affects learned survival strategies across diverse epidemic scenarios. ContagionRL integrates a spatial SIRS+D epidemiological model with configurable environmental parameters, allowing researchers to stress-test reward functions under varying conditions including limited observability, different movement patterns, and heterogeneous population dynamics. We evaluate five distinct reward designs, ranging from sparse survival bonuses to a novel potential field approach, across multiple RL algorithms (PPO, SAC, A2C). Through systematic ablation studies, we identify that directional guidance and explicit adherence incentives are critical components for robust policy learning. Our comprehensive evaluation across varying infection rates, grid sizes, visibility constraints, and movement patterns reveals that reward function choice dramatically impacts agent behavior and survival outcomes. Agents trained with our potential field reward consistently achieve superior performance, learning maximal adherence to non-pharmaceutical interventions while developing sophisticated spatial avoidance strategies. The platform's modular design enables systematic exploration of reward-behavior relationships, addressing a knowledge gap in models of this type where reward engineering has received limited attention. ContagionRL is an effective platform for studying adaptive behavioral responses in epidemic contexts and highlight the importance of reward design, information structure, and environmental predictability in learning. Our code is publicly available at https://github.com/redradman/ContagionRL

2511.17171 2026-05-25 cs.CV cs.LG

FireScope: Wildfire Risk Raster Prediction with a Chain-of-Thought Oracle

FireScope: 基于思维链预言机的野火风险栅格预测

Mario Markov, Stefan Maria Ailuro, Luc Van Gool, Konrad Schindler, Danda Pani Paudel

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 该论文提出了一种名为FireScope的框架,用于预测野火风险栅格图,通过结合视觉、气候和地理信息进行因果推理。研究引入了FireScope-Bench数据集,整合了Sentinel-2卫星图像、气候数据和专家定义的风险图,用于跨大陆评估。FireScope基于视觉语言模型,结合强化学习和视觉监督,生成带有推理轨迹的风险图,显著提升了模型在不同大陆间的泛化能力和可解释性。该工作首次展示了基于语言的推理在视觉生成中的泛化提升作用,并提出了首个可跨大陆应用的高分辨率野火风险模型。

Comments CVPR 2026, Project Page: https://firescope.ai/research

详情
AI中文摘要

预测野火风险是一个推理密集型的空间问题,需要整合视觉、气候和地理因素来推断连续的风险地图。现有方法缺乏可靠泛化所需的因果推理和多模态理解。我们引入了FireScope-Bench,一个大规模数据集和基准,将Sentinel-2图像和气候数据与专家定义的全美风险栅格以及欧洲的真实野火事件配对,用于跨大陆评估。基于此数据集,我们提出了FireScope,一个基于VLM的推理到生成框架,从强化学习和视觉监督中学习,通过互补的推理轨迹预测风险栅格。当在美国训练并在欧洲测试时,FireScope取得了显著的性能提升,而专家反馈和自动化分析证实其推理轨迹是忠实且有语义意义的。我们的发现表明,推理可以支撑栅格预测模型,提高泛化性和可解释性。据我们所知,这是第一个(1)证明基于语言的推理可以改善视觉生成泛化性的框架,(2)提出一个可跨大陆应用的高分辨率野火风险模型,以及(3)能够系统研究多模态火灾风险模型稳健跨大陆泛化的框架。我们相信FireScope-Bench有潜力成为推动推理驱动、可解释和可泛化空间建模的基础。数据和源代码将公开提供。

英文摘要

Predicting wildfire risk is a reasoning-intensive spatial problem that requires the integration of visual, climatic, and geographic factors to infer continuous risk maps. Existing methods lack the causal reasoning and multimodal understanding required for reliable generalization. We introduce FireScope-Bench, a large-scale dataset and benchmark that couples Sentinel-2 imagery and climate data with expert-defined risk rasters across the USA, and real wildfire events in Europe for cross-continental evaluation. Building on this dataset, we propose FireScope, a VLM-based reasoning-to-generation framework that learns from both reinforcement learning and visual supervision to predict risk rasters with complementary reasoning traces. When trained in the USA and tested in Europe, FireScope achieves substantial performance gains, while expert feedback and automated analysis confirm that its reasoning traces are faithful and semantically meaningful. Our findings demonstrate that reasoning can ground raster prediction models, improving both generalization and interpretability. To our knowledge, this is the first framework to (1) demonstrate that language-based reasoning can improve generalization in visual generation, (2) propose a high-resolution wildfire risk model that can be applied across continents, and (3) enable systematic studies of robust cross-continental generalization for multimodal fire risk models. We believe that FireScope-Bench has the potential to serve as a foundation for advancing reasoning-driven, interpretable and generalizable spatial modeling. Data and source code will be made publicly available.

2511.16014 2026-05-25 cs.AI

MUSEKG: A Knowledge Graph Over Museum Collections

MUSEKG:博物馆藏品知识图谱

Jinhao Li, Jianzhong Qi, Soyeon Caren Han, Eun-Jung Holden

发表机构 * The University of Melbourne School of Computing(墨尔本大学计算机与信息系统学院) The University of Melbourne(墨尔本大学)

AI总结 MUSEKG 是一个针对博物馆藏品数据构建的交互式知识图谱系统,旨在整合结构化目录、图像和非结构化描述等异构数据,形成统一的、可查询的知识表示。该系统通过建立类型化的图结构,将藏品、人物、机构、图像及其语义实体进行关联,支持基于自然语言的查询和关系感知的检索。实验表明,MUSEKG 能有效支持属性查询、关系探索等常见任务,并通过显式的图结构保证答案的可解释性。

Comments SIGIR'26

详情
AI中文摘要

文化遗产领域的数字化产生了大量但分散的博物馆藏品数据存储库,涵盖结构化编目记录、图像和非结构化描述。现有的博物馆信息系统通常难以将这些来源整合成统一的、可查询的表示,以支持关系感知的探索。我们提出了MuseKG,一个交互式知识图谱系统,它将异构博物馆数据组织成一个类型化图,在连贯的模式下链接对象、人物、组织、图像、图像派生标签和提取的语义实体。MuseKG通过将用户问题映射到图实体并检索用于答案生成的紧凑证据邻域来支持自然语言查询。通过在真实博物馆藏品上的交互式演示,我们展示了MuseKG支持常见的探索任务,如属性查找、关系探索和关系感知检索,并且答案可以通过显式图结构进行检查。

英文摘要

Digitisation in the cultural heritage sector has produced large but fragmented repositories of museum collection data, spanning structured catalogue records, images, and unstructured descriptions. Existing museum information systems often make it difficult to integrate these sources into a unified, queryable representation that supports relation-aware exploration. We present MuseKG, an interactive knowledge graph system that organises heterogeneous museum data into a typed graph that links objects, people, organisations, images, image-derived labels, and extracted semantic entities within a coherent schema. MuseKG supports natural-language queries by grounding user questions to graph entities and retrieving a compact neighbourhood of evidence for answer generation. Through an interactive demonstration on real museum collections, we show that MuseKG supports common exploration tasks such as attribute lookup, relation exploration, and relation-aware retrieval, with answers that remain inspectable via explicit graph structures.

2511.14286 2026-05-25 cs.CV

NeuralBoneReg: An Instance-Specific Label-Free Point Cloud-Based Method for Multi-Modal Bone Surface Registration

NeuralBoneReg:一种用于多模态骨表面注册的实例特定无标签点云方法

Luohong Wu, Matthias Seibold, Nicola A. Cavalcanti, Yunke Ao, Roman Flepp, Aidana Massalimova, Lilian Calvet, Philipp Fürnstahl

发表机构 * Research in Orthopedic Computer Science, Balgrist University Hospital, University of Zurich(骨科计算机科学研究所,巴尔格里斯大学医院,苏黎世大学) AI Center, ETH Zurich(人工智能中心,苏黎世联邦理工学院)

AI总结 在计算机辅助骨科手术中,术前影像与术中数据的精确配准对手术规划至关重要。本文提出了一种无需标注的点云为基础的神经配准方法NeuralBoneReg,通过隐式神经网络学习术前骨模型,并结合多层感知机进行全局初始化与局部优化,实现了跨模态骨表面的鲁棒配准。该方法无需跨受试者训练数据,实验表明其在多个公开数据集上表现优异,具有良好的解剖结构与模态泛化能力。

详情
AI中文摘要

在计算机和机器人辅助骨科手术(CAOS)中,基于术前影像的患者特定手术计划定义了目标位置和植入物轨迹。在手术过程中,这些计划必须准确传递,依赖于术前和术中数据之间的精确交叉注册。然而,不同成像模态之间的显著异质性使得这种注册具有挑战性且容易出错。因此,鲁棒、自动且与模态无关的骨表面注册在临床上非常重要。我们提出了NeuralBoneReg,一个自监督的基于表面的框架,使用3D点云作为与模态无关的表示来注册骨表面。NeuralBoneReg包括两个模块:一个学习术前骨模型的隐式神经无符号距离场(UDF),以及一个基于MLP的注册模块,通过生成变换假设来对齐术中点云与神经UDF,从而执行全局初始化和局部细化。与最先进的监督方法不同,NeuralBoneReg以自监督方式运行,无需跨受试者的训练数据。我们在两个公开的多模态数据集上评估了NeuralBoneReg与基线方法的性能:一个腓骨和胫骨的CT-超声数据集(UltraBones100k)和一个脊柱椎骨的CT-RGB-D数据集(SpineDepth)。评估还包括一个新引入的包含股骨和骨盆的尸体的CT-超声数据集(UltraBones-Hip),该数据集将公开提供。NeuralBoneReg在所有数据集上匹配或超越现有方法,在UltraBones100k上平均RRE/RTE为1.83°/2.02 mm,在UltraBones-Hip上为1.90°/1.56 mm,在SpineDepth上为3.78°/2.80 mm。这些结果证明了跨解剖结构和模态的强泛化能力,为CAOS提供了鲁棒且准确的跨模态对齐。

英文摘要

In computer- and robot-assisted orthopedic surgery (CAOS), patient-specific surgical plans derived from preoperative imaging define target locations and implant trajectories. During surgery, these plans must be accurately transferred, relying on precise cross-registration between preoperative and intraoperative data. However, substantial modality heterogeneity across imaging modalities makes this registration challenging and error-prone. Robust, automatic, and modality-agnostic bone surface registration is therefore clinically important. We propose NeuralBoneReg, a self-supervised, surface-based framework that registers bone surfaces using 3D point clouds as a modality-agnostic representation. NeuralBoneReg includes two modules: an implicit neural unsigned distance field (UDF) that learns the preoperative bone model, and an MLP-based registration module that performs global initialization and local refinement by generating transformation hypotheses to align the intraoperative point cloud with the neural UDF. Unlike SOTA supervised methods, NeuralBoneReg operates in a self-supervised manner, without requiring inter-subject training data. We evaluated NeuralBoneReg against baseline methods on two publicly available multi-modal datasets: a CT-ultrasound dataset of the fibula and tibia (UltraBones100k) and a CT-RGB-D dataset of spinal vertebrae (SpineDepth). The evaluation also includes a newly introduced CT-ultrasound dataset of cadaveric subjects containing femur and pelvis (UltraBones-Hip), which will be made publicly available. NeuralBoneReg matches or surpasses existing methods across all datasets, achieving mean RRE/RTE of 1.83°/2.02 mm on UltraBones100k, 1.90°/1.56 mm on UltraBones-Hip, and 3.78°/2.80 mm on SpineDepth. These results demonstrate strong generalizability across anatomies and modalities, providing robust and accurate cross-modal alignment for CAOS.

2511.11051 2026-05-25 cs.CV

NP-LoRA: Null Space Projection for Subject-Style LoRA Fusion

NP-LoRA: 用于主题-风格LoRA融合的零空间投影

Chuheng Chen, Xiaofei Zhou, Geyuan Zhang, Yong Huang

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(信息工程研究所,中国科学院)

AI总结 本文提出了一种名为NP-LoRA的训练无需方法,用于融合主题和风格的LoRA表示,以实现可控生成。该方法从几何角度出发,将内容和风格LoRA视为在共享参数空间中重叠的非正交低秩子空间,并通过投影操作显式调控它们之间的交互。NP-LoRA利用风格LoRA的主方向定义投影子空间,将内容LoRA投影到风格子空间的补空间中,从而在抑制主导风格方向干扰的同时保留互补信息,实验表明该方法在多个预训练LoRA对上取得了更平衡的内容-风格组合效果。

详情
AI中文摘要

低秩适配(LoRA)融合能够组合主题和风格表示以实现可控生成,无需重新训练。然而,现有方法主要通过权重级合并操作,没有明确建模独立训练的LoRA如何在共享参数空间中交互。我们从几何角度看待LoRA融合,将内容和风格LoRA解释为占据重叠、非正交的低秩子空间,这种重叠可能导致参数更新冲突,影响生成质量。这一观察促使我们将LoRA融合重新表述为控制重叠子空间更新如何组合的问题,而不仅仅是参数组合。基于这一见解,我们提出零空间投影LoRA(NP-LoRA),一种无需训练的框架,采用投影作为融合算子来显式调节跨LoRA交互。具体而言,NP-LoRA使用风格LoRA的主方向定义投影子空间,并将内容LoRA投影到补子空间(即风格LoRA的零空间),从而抑制沿主导风格方向的干扰,同时保留互补信息。为避免硬投影的过度激进抑制,我们进一步将软投影表述为一个正则化优化问题,平衡内容保留与风格子空间抑制。该目标具有闭式解,产生一个由单一参数控制的投影算子,该参数在线性合并和硬投影之间连续插值。在多个预训练LoRA对上的大量实验表明,与强基线相比,NP-LoRA实现了更平衡的内容-风格组合,且无需重新训练。

英文摘要

Low-Rank Adaptation (LoRA) fusion enables the composition of subject and style representations for controllable generation without retraining. However, existing approaches primarily operate through weight-level merging, without explicitly modeling how independently trained LoRAs interact in the shared parameter space. We adopt a geometric perspective on LoRA fusion, interpreting content and style LoRAs as occupying overlapping, non-orthogonal low-rank subspaces, where such overlap can lead to conflicting parameter updates that affect generation quality. This observation motivates us to reformulate LoRA fusion not merely as parameter combination, but as a problem of controlling how updates from overlapping subspaces are combined. Based on this insight, we propose Null Space Projection LoRA (NP-LoRA), a training-free framework that employs projection as a fusion operator to explicitly modulate cross-LoRA interactions. Specifically, NP-LoRA uses principal directions of the style LoRA to define a projection subspace and projects the content LoRA onto the complementary subspace (i.e., the null space of the style LoRA), suppressing interference along dominant style directions while preserving complementary information. To avoid the overly aggressive suppression of hard projection, we further formulate soft projection as a regularized optimization problem that balances content preservation against style-subspace suppression. This objective admits a closed-form solution, yielding a projection operator controlled by a single parameter that continuously interpolates between linear merging and hard projection. Extensive experiments across multiple pretrained LoRA pairs show that NP-LoRA achieves more balanced content-style composition compared to strong baselines, without requiring retraining.

2511.02239 2026-05-25 cs.RO cs.AI

LACY: A Vision-Language Model-based Language-Action Cycle for Self-Improving Robotic Manipulation

LACY: 基于视觉-语言模型的语言-动作循环用于自我改进的机器人操作

Youngjin Hong, Houjian Yu, Mingen Li, Changhyun Choi

发表机构 * Department of Electrical and Computer Engineering, Univ. of Minnesota(电气与计算机工程系,明尼苏达大学)

AI总结 本文提出LACY,一种基于视觉-语言模型的“语言-动作循环”框架,旨在提升机器人操作任务中的策略泛化能力。该方法通过同时学习语言到动作(L2A)、动作到语言(A2L)以及语言间一致性(L2C)的双向映射,使机器人不仅能执行任务,还能解释自身行为,从而形成更丰富的内部表征。LACY采用主动增强策略自主生成和筛选训练数据,无需额外人工标注,实验表明其在抓取与放置任务中平均提升了56.46%的成功率,显著增强了语言-动作的语义一致性与鲁棒性。

Comments Accepted to ICRA 2026. Project page: https://vla2026.github.io/LACY/

详情
AI中文摘要

学习机器人操作的可泛化策略越来越依赖于将语言指令映射到动作(L2A)的大规模模型。然而,这种单向范式通常产生执行任务而缺乏更深层次上下文理解的策略,限制了它们泛化或解释其行为的能力。我们认为,将动作映射回语言(A2L)的互补技能对于发展更全面的基础至关重要。一个既能行动又能解释其动作的智能体可以形成更丰富的内部表示,并开启自我监督学习的新范式。我们引入了LACY(语言-动作循环),一个统一的框架,在单个视觉-语言模型内学习这种双向映射。LACY在三个协同任务上联合训练:从语言生成参数化动作(L2A)、用语言解释观察到的动作(A2L)以及验证两个语言描述之间的语义一致性(L2C)。这实现了一个自我改进的循环,通过针对低置信度案例的主动增强策略自主生成和过滤新的训练数据,从而在没有额外人工标注的情况下改进模型。在仿真和真实世界的拾取-放置任务上的实验表明,LACY平均将任务成功率提高了56.46%,并为机器人操作产生了更稳健的语言-动作基础。项目页面:https://vla2026.github.io/LACY/

英文摘要

Learning generalizable policies for robotic manipulation increasingly relies on large-scale models that map language instructions to actions (L2A). However, this one-way paradigm often produces policies that execute tasks without deeper contextual understanding, limiting their ability to generalize or explain their behavior. We argue that the complementary skill of mapping actions back to language (A2L) is essential for developing more holistic grounding. An agent capable of both acting and explaining its actions can form richer internal representations and unlock new paradigms for self-supervised learning. We introduce LACY (Language-Action Cycle), a unified framework that learns such bidirectional mappings within a single vision-language model. LACY is jointly trained on three synergistic tasks: generating parameterized actions from language (L2A), explaining observed actions in language (A2L), and verifying semantic consistency between two language descriptions (L2C). This enables a self-improving cycle that autonomously generates and filters new training data through an active augmentation strategy targeting low-confidence cases, thereby improving the model without additional human labels. Experiments on pick-and-place tasks in both simulation and the real world show that LACY improves task success rates by 56.46% on average and yields more robust language-action grounding for robotic manipulation. Project page: https://vla2026.github.io/LACY/

2510.26411 2026-05-25 cs.AI

MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders

MedSAE: 用稀疏自编码器剖析MedCLIP表示

Riccardo Renzulli, Colas Lepoutre, Enrico Cassano, Marco Grangetto

发表机构 * University of Turin(都灵大学) École polytechnique(巴黎-萨克勒高等理工学院)

AI总结 本文提出了一种名为 MedSAE 的方法,通过稀疏自编码器对医学视觉语言模型 MedCLIP 的潜在空间进行解析,以提升其可解释性。研究引入了结合相关性度量、熵分析和自动神经元命名的评估框架,实验表明 MedSAE 能生成更具单语义性和可解释性的神经元表示,从而在医疗 AI 的性能与透明性之间建立桥梁。

Comments Accepted at ICIP 2026

详情
AI中文摘要

医疗保健中的人工智能需要既准确又可解释的模型。我们通过将医学稀疏自编码器(MedSAEs)应用于MedCLIP(一个在胸部X光片和报告上训练的视觉-语言模型)的潜在空间,推进了医学视觉中的机制可解释性。为了量化可解释性,我们提出了一个评估框架,该框架结合了相关性指标、熵分析以及通过MedGemma基础模型进行的自动神经元命名。在CheXpert数据集上的实验表明,MedSAE神经元比原始MedCLIP特征具有更高的单语义性和可解释性。我们的研究结果弥合了高性能医学AI与透明度之间的差距,为迈向临床可靠的表示提供了可扩展的一步。支持本研究结果的源代码可在https://github.com/EIDOSLAB/MedSAE获取。

英文摘要

Artificial intelligence in healthcare requires models that are accurate and interpretable. We advance mechanistic interpretability in medical vision by applying Medical Sparse Autoencoders (MedSAEs) to the latent space of MedCLIP, a vision-language model trained on chest radiographs and reports. To quantify interpretability, we propose an evaluation framework that combines correlation metrics, entropy analyses, and automated neuron naming via the MedGemma foundation model. Experiments on the CheXpert dataset show that MedSAE neurons achieve higher monosemanticity and interpretability than raw MedCLIP features. Our findings bridge high-performing medical AI and transparency, offering a scalable step toward clinically reliable representations. The source code supporting the findings of this study is available at https://github.com/EIDOSLAB/MedSAE.

2510.22941 2026-05-25 cs.LG

Hazard-Responsive Digital Twin for Climate-Driven Urban Resilience and Equity

面向气候驱动的城市韧性与公平的灾害响应数字孪生

Zhenglai Shen, Hongyu Zhou

发表机构 * Buildings and Transportation Science Division, Oak Ridge National Laboratory(奥克伍德国家实验室建筑与交通科学部门) Civil and Environmental Engineering, University of Tennessee(田纳西大学土木与环境工程系)

AI总结 面对野火引发的停电和城市热浪等复合型气候灾害,本文提出了一种具有响应能力的数字孪生系统(H-RDT),结合物理信息神经网络、多模态数据融合和公平性风险分析,提升城市应对灾害的韧性与公平性。该系统在模拟城区中展示了对部分传感器失效情况下的稳定室内温度预测能力,并通过强化学习模块自适应融合物联网、无人机和卫星数据,识别高脆弱性区域,如学校、诊所和低收入住房。研究还表明,通过提前启动冷却中心和共享微电网等干预措施,可有效降低人群加权热风险和极端风险,为城市气候适应决策提供更具适应性和公平导向的支持。

Comments 52 pages, 9 figures

Journal ref Sustainable Cities and Society 144 (2026) 107413

详情
AI中文摘要

复合气候灾害,如野火引发的停电和城市热浪,挑战着城市的稳定性和公平性。我们提出一种灾害响应数字孪生(H-RDT),它结合了物理信息神经网络建模、多模态数据融合和公平感知风险分析,用于城市尺度的响应。在一个包含多种建筑原型和人群的合成区域中,模拟的野火-停电-热浪级联事件表明,H-RDT 在部分传感器缺失的情况下能维持稳定的室内温度预测(约31至33°C),再现停电引发的温度激增和恢复。基于强化学习的融合模块自适应地重新加权物联网、无人机和卫星输入,以维持时空覆盖,而公平调整的映射则隔离出高脆弱性集群(学校、诊所、低收入住房)。前瞻性干预措施,如预防性冷却中心启动和微电网共享,将人口加权热风险降低11%至13%,将95百分位(尾部)风险缩小7%至17%,并将过热小时数减少高达9%。除了合成演示之外,该框架为实际城市实施建立了可迁移的基础,将物理灾害建模与社会公平和决策智能联系起来。H-RDT 推动数字城市韧性向自适应、基于学习和以公平为中心的决策支持发展,以应对气候适应。

英文摘要

Compounding climate hazards, such as wildfire-induced outages and urban heatwaves, challenge the stability and equity of cities. We present a Hazard-Responsive Digital Twin (H-RDT) that combines physics-informed neural network modeling, multimodal data fusion, and equity-aware risk analytics for urban-scale response. In a synthetic district with diverse building archetypes and populations, a simulated wildfire-outage-heatwave cascade shows that H-RDT maintains stable indoor temperature predictions (approximately 31 to 33 C) under partial sensor loss, reproducing outage-driven surges and recovery. The reinforcement learning based fusion module adaptively reweights IoT, UAV, and satellite inputs to sustain spatiotemporal coverage, while the equity-adjusted mapping isolates high-vulnerability clusters (schools, clinics, low-income housing). Prospective interventions, such as preemptive cooling-center activation and microgrid sharing, reduce population-weighted thermal risk by 11 to 13 percent, shrink the 95th-percentile (tail) risk by 7 to 17 percent, and cut overheating hours by up to 9 percent. Beyond the synthetic demonstration, the framework establishes a transferable foundation for real-city implementation, linking physical hazard modeling with social equity and decision intelligence. The H-RDT advances digital urban resilience toward adaptive, learning-based, and equity-centered decision support for climate adaptation.

2510.21270 2026-05-25 cs.CL cs.AI cs.CV

Sparser Block-Sparse Attention via Token Permutation

通过令牌置换实现更稀疏的块稀疏注意力

Xinghao Wang, Pengyu Wang, Dong Zhang, Chenkun Tan, Shaojun Zhou, Zhaoxiang Liu, Shiguo Lian, Fangxu Liu, Kai Song, Xipeng Qiu

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 随着大语言模型上下文长度的增加,计算成本显著上升,主要瓶颈来自自注意力机制的二次复杂度。为此,本文提出了一种名为Permuted Block-Sparse Attention(PBS-Attn)的新型稀疏注意力方法,通过重新排列token顺序以提升块级稀疏性,从而在保持模型精度的同时显著提高计算效率。实验表明,该方法在多个长上下文数据集上优于现有块稀疏注意力方法,并在端到端推理速度上实现了最高2.75倍的加速。

Comments ICML 2026

详情
AI中文摘要

扩展大语言模型(LLM)的上下文长度带来了显著的好处,但计算成本高昂。这种成本主要源于自注意力机制,其相对于序列长度的$O(N^2)$复杂度在内存和延迟方面构成了主要瓶颈。幸运的是,注意力矩阵通常是稀疏的,尤其是对于长序列,这为优化提供了机会。块稀疏注意力已成为一种有前景的解决方案,它将序列划分为块并跳过其中一部分块的计算。然而,该方法的有效性高度依赖于底层的注意力模式,这可能导致次优的块级稀疏性。例如,单个块内查询的重要键令牌可能分散在许多其他块中,导致计算冗余。在这项工作中,我们提出了置换块稀疏注意力(PBS-Attn),这是一种即插即用的方法,利用注意力的置换性质来增加块级稀疏性并提高LLM预填充的计算效率。我们在具有挑战性的真实世界长上下文数据集上进行了全面实验,结果表明PBS-Attn在模型精度上始终优于现有的块稀疏注意力方法,并紧密匹配全注意力基线。借助我们自定义的permuted-FlashAttention内核,PBS-Attn在长上下文预填充中实现了高达2.75倍的端到端加速,证实了其实用性。代码可在https://github.com/xinghaow99/pbs-attn获取。

英文摘要

Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems primarily from the self-attention mechanism, whose $O(N^2)$ complexity with respect to sequence length presents a major bottleneck for both memory and latency. Fortunately, the attention matrix is often sparse, particularly for long sequences, suggesting an opportunity for optimization. Block-sparse attention has emerged as a promising solution that partitions sequences into blocks and skips computation for a subset of these blocks. However, the effectiveness of this method is highly dependent on the underlying attention patterns, which can lead to sub-optimal block-level sparsity. For instance, important key tokens for queries within a single block may be scattered across numerous other blocks, leading to computational redundancy. In this work, we propose Permuted Block-Sparse Attention (\textbf{PBS-Attn}), a plug-and-play method that leverages the permutation properties of attention to increase block-level sparsity and enhance the computational efficiency of LLM prefilling. We conduct comprehensive experiments on challenging real-world long-context datasets, demonstrating that PBS-Attn consistently outperforms existing block-sparse attention methods in model accuracy and closely matches the full attention baseline. Powered by our custom permuted-FlashAttention kernels, PBS-Attn achieves an end-to-end speedup of up to $2.75\times$ in long-context prefilling, confirming its practical viability. Code available at https://github.com/xinghaow99/pbs-attn

2510.16335 2026-05-25 cs.CV

On the Provable Importance of Gradients for Language-Assisted Image Clustering

关于梯度在语言辅助图像聚类中可证明的重要性

Bo Peng, Jie Lu, Guangquan Zhang, Zhen Fang

发表机构 * University of Technology Sydney(悉尼技术大学)

AI总结 本文研究了语言辅助图像聚类(LaIC)问题,旨在利用文本语义提升图像表示的可区分性,从而改善图像聚类效果。由于缺乏真实的类别名称,如何从未标注的语料库中筛选出与图像语义相近的正名词是核心挑战。为此,作者提出了一种基于梯度的框架 GradNorm,通过反向传播的交叉熵梯度大小衡量名词的正相关性,并提供了理论误差界以保证其有效性,同时证明该方法能涵盖现有筛选策略。实验表明,GradNorm 在多个基准数据集上取得了最先进的聚类性能。

Comments revised and extended version of ICCV2025

详情
AI中文摘要

本文研究了最近出现的语言辅助图像聚类(LaIC)问题,其中利用文本语义来改善视觉表示的可区分性以促进图像聚类。由于真实类别名称不可用,LaIC的核心挑战之一在于如何从未标记的野生语料数据中筛选正名词,即那些与感兴趣图像语义接近的名词。现有的筛选策略主要基于CLIP学习的现成特征空间;然而,尽管直观,这些策略缺乏严格的理论基础。为了填补这一空白,我们提出了一种新颖的基于梯度的框架,称为GradNorm,该框架具有理论保证并表现出强大的实证性能。特别地,我们根据从预测目标分布与softmax输出之间的交叉熵反向传播的梯度大小来衡量每个名词的正性。理论上,我们提供了严格的误差界来量化GradNorm对正名词的可分离性,并证明GradNorm自然地将现有筛选策略作为其极端特例。实证上,大量实验表明GradNorm在各种基准测试上达到了最先进的聚类性能。代码公开于\href{https://github.com/60pen9/On-the-Provable-Importance-of-Gradients-for-Language-Assisted-Image-Clustering}{此处}。

英文摘要

This paper investigates the recently emerged problem of Language-assisted Image Clustering (LaIC), where textual semantics are leveraged to improve the discriminability of visual representations to facilitate image clustering. Due to the unavailability of true class names, one of core challenges of LaIC lies in how to filter positive nouns, i.e., those semantically close to the images of interest, from unlabeled wild corpus data. Existing filtering strategies are predominantly based on the off-the-shelf feature space learned by CLIP; however, despite being intuitive, these strategies lack a rigorous theoretical foundation. To fill this gap, we propose a novel gradient-based framework, termed as GradNorm, which is theoretically guaranteed and shows strong empirical performance. In particular, we measure the positiveness of each noun based on the magnitude of gradients back-propagated from the cross-entropy between the predicted target distribution and the softmax output. Theoretically, we provide a rigorous error bound to quantify the separability of positive nouns by GradNorm and prove that GradNorm naturally subsumes existing filtering strategies as extremely special cases of itself. Empirically, extensive experiments show that GradNorm achieves the state-of-the-art clustering performance on various benchmarks. Code is publicly available at \href{https://github.com/60pen9/On-the-Provable-Importance-of-Gradients-for-Language-Assisted-Image-Clustering}{here}.

2510.12787 2026-05-25 cs.AI cs.MA

Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics

Ax-Prover:用于数学和量子物理定理证明的深度推理智能体框架

Benjamin Breen, Marco Del Tredici, Jacob McCarran, Javier Aspuru Mijares, Weichen Winston Yin, Kfir Sulimany, Jacob M. Taylor, Frank H. L. Koppens, Dirk Englund

发表机构 * Axiomatic_AI(公理人工智能) Massachusetts Institute of Technology (MIT)(麻省理工学院) Institut de Ciències Fotòniques (ICFO)(光子科学研究所) Institució Catalana de Recerca i Estudis Avançats (ICREA)(加泰罗尼亚高级研究与高等学院)

AI总结 本文提出了一种名为 Ax-Prover 的多智能体系统,用于在 Lean 证明助手环境中进行自动化定理证明,能够解决数学和量子物理等不同科学领域的问题,并支持自主运行或与人类专家协作。该系统结合了大语言模型的推理能力与 Lean 工具的严格形式化验证机制,通过模型上下文协议实现知识与形式正确性的统一。实验表明,Ax-Prover 在多个基准测试中表现优异,尤其在新引入的抽象代数和量子理论基准上显著优于现有方法,展示了其在跨领域形式化验证中的通用性和有效性。

详情
AI中文摘要

我们提出了Ax-Prover,一个用于Lean中自动化定理证明的多智能体系统,能够解决跨不同科学领域的问题,并可自主运行或与人类专家协作。为此,Ax-Prover通过形式化证明生成来处理科学问题求解,这一过程既需要创造性推理又需要严格的句法严谨性。Ax-Prover通过模型上下文协议(MCP)将提供知识和推理能力的大语言模型(LLM)与确保形式正确性的Lean工具相结合,以应对这一挑战。为了评估其作为自主证明器的性能,我们在两个公开数学基准以及我们在抽象代数和量子理论领域引入的两个Lean基准上,将我们的方法与前沿LLM和专用证明器模型进行了比较。在公开数据集上,Ax-Prover与最先进的证明器竞争力相当,而在新基准上则大幅超越它们。这表明,与难以泛化的专用系统不同,我们基于工具的智能体定理证明方法为跨不同科学领域的形式化验证提供了一种可泛化的方法论。此外,我们通过一个实际用例展示了Ax-Prover的辅助能力,展示了它如何使一位专家数学家能够形式化一个复杂密码学定理的证明。

英文摘要

We present Ax-Prover, a multi-agent system for automated theorem proving in Lean that can solve problems across diverse scientific domains and operate either autonomously or collaboratively with human experts. To achieve this, Ax-Prover approaches scientific problem solving through formal proof generation, a process that demands both creative reasoning and strict syntactic rigor. Ax-Prover meets this challenge by equipping Large Language Models (LLMs), which provide knowledge and reasoning, with Lean tools via the Model Context Protocol (MCP), which ensure formal correctness. To evaluate its performance as an autonomous prover, we benchmark our approach against frontier LLMs and specialized prover models on two public math benchmarks and on two Lean benchmarks we introduce in the fields of abstract algebra and quantum theory. On public datasets, Ax-Prover is competitive with state-of-the-art provers, while it largely outperforms them on the new benchmarks. This shows that, unlike specialized systems that struggle to generalize, our tool-based agentic theorem prover approach offers a generalizable methodology for formal verification across diverse scientific domains. Furthermore, we demonstrate Ax-Prover's assistant capabilities in a practical use case, showing how it enabled an expert mathematician to formalize the proof of a complex cryptography theorem.

2510.12328 2026-05-25 cs.LG

Leveraging Teleconnections with Physics-Informed Graph Attention Networks for Long-Range Extreme Rainfall Forecasting in Thailand

利用物理信息图注意力网络的遥相关进行泰国长距离极端降雨预报

Kiattikun Chobtham, Kanoksri Sarinnapakorn, Kritanai Torsri, Prattana Deeprasertkul, Jirawan Kamma

发表机构 * Hydro-Informatics Institute, Ministry of Higher Education, Science, Research and Innovation(水信息研究所,教育部、科学、研究与创新部)

AI总结 本文提出了一种结合物理信息图神经网络和极值分析方法的新型模型,用于提高泰国地区极端降雨的预测精度。该方法通过图结构表示雨量监测站点,捕捉复杂的时空模式,并利用遥相关关系增强模型可解释性。模型采用基于地形降水物理机制的图注意力机制与长短期记忆网络结合,配合空间季节感知广义帕累托分布方法处理极端值,实验表明其在多个区域尤其是极端事件高发区的预测性能优于现有方法,为长期水资源管理提供了实用的高分辨率预测支持。

详情
AI中文摘要

准确的降雨预报,特别是极端事件,仍然是气候学和地球系统中的一个重大挑战。本文提出了新颖的物理信息图神经网络(GNNs)结合极值分析技术,以改进泰国全境的测站降雨预测。该模型利用测站的图结构表示来捕捉复杂的时空模式,并通过遥相关提供可解释性。我们预处理可能影响区域降雨的相关气候指数。所提出的图注意力网络与长短期记忆(Attention-LSTM)使用基于简单地形降水物理公式的初始边特征应用注意力机制。嵌入随后由LSTM层处理。为了处理极端值,我们使用新颖的空间季节感知广义帕累托分布(GPD)方法进行峰值超过阈值(POT)映射,克服了传统机器学习模型的局限性。实验表明,我们的方法在大多数区域(包括易发生极端事件的区域)优于已建立的基线,并与最先进的方法保持强烈竞争力。与业务预报系统SEAS5相比,我们的实际应用改进了极端事件预测,并提供了实用增强,以生成支持长期水管理决策的高分辨率地图。

英文摘要

Accurate rainfall forecasting, particularly for extreme events, remains a significant challenge in climatology and the Earth system. This paper presents novel physics-informed Graph Neural Networks (GNNs) combined with extreme-value analysis techniques to improve gauge-station rainfall predictions across Thailand. The model leverages a graph-structured representation of gauge stations to capture complex spatiotemporal patterns, and it offers explainability through teleconnections. We preprocess relevant climate indices that potentially influence regional rainfall. The proposed Graph Attention Network with Long Short-Term Memory (Attention-LSTM) applies the attention mechanism using initial edge features derived from simple orographic-precipitation physics formulation. The embeddings are subsequently processed by LSTM layers. To address extremes, we perform Peak-Over-Threshold (POT) mapping using the novel Spatial Season-aware Generalized Pareto Distribution (GPD) method, which overcomes limitations of traditional machine-learning models. Experiments demonstrate that our method outperforms well-established baselines across most regions, including areas prone to extremes, and remains strongly competitive with the state of the art. Compared with the operational forecasting system SEAS5, our real-world application improves extreme-event prediction and offers a practical enhancement to produce high-resolution maps that support decision-making in long-term water management.

2510.09450 2026-05-25 cs.CV

Dynamic Weight-based Temporal Aggregation for Low-light Video Enhancement Under Extreme Noise

基于动态权重的极端噪声下低光视频增强的时间聚合

Ruirui Lin, Guoxi Huang, Nantheera Anantrasirichai

发表机构 * Visual Information Laboratory, University of Bristol, United Kingdom(布里斯托大学视觉信息实验室)

AI总结 本文研究了在极端噪声环境下低光视频增强(LLVE)的问题,针对现有基于学习的方法在处理真实场景中严重噪声时效果不佳的问题,提出了一种新型的基于深度学习的递归框架DWTA-Net。该方法采用两阶段架构,第一阶段通过多帧对齐实现时序一致的Mamba增强,第二阶段利用动态权重引导的光流驱动的时序聚合进行递归细化,有效提升了视频的视觉质量。实验表明,DWTA-Net在噪声抑制和细节保留方面优于现有先进方法。

详情
AI中文摘要

低光视频增强(LLVE)由于噪声、低对比度和颜色退化而具有挑战性。虽然基于学习的方法能够实现快速推理,但由于未能充分利用长期时间线索,它们在严重的真实噪声下常常失败。我们提出了DWTA-Net,一种新颖的基于深度学习的递归LLVE框架,采用递归设计。DWTA-Net采用集成的两阶段架构:第一阶段通过多帧对齐恢复局部结构和颜色,实现时间一致的基于Mamba的增强;第二阶段使用新颖的基于动态权重的时间聚合(由光流引导)进行递归细化,作为适应运动的递归去噪器。我们进一步引入了一种纹理自适应损失,在保留纹理区域细节的同时抑制均匀区域中的噪声。在真实低光视频上的实验表明,DWTA-Net实现了更强的噪声抑制和更少的伪影,与最先进的方法相比,提供了优越的视觉质量。

英文摘要

Low-light video enhancement (LLVE) is challenging due to noise, low contrast, and color degradation. While learning-based methods enable fast inference, they often fail under heavy real-world noise because they do not sufficiently exploit long-term temporal cues. We propose DWTA-Net, a novel deep-learning recurrent LLVE framework with a recurrent design. DWTA-Net adopts an integrated two-stage architecture: Stage I restores local structure and color via multi-frame alignment for temporally consistent Mamba-based enhancement, while Stage II performs recurrent refinement using a novel dynamic weight-based temporal aggregation guided by optical flow, functioning as a recurrent denoiser that adapts to motion. We further introduce a texture-adaptive loss that preserves fine details in textured regions while suppressing noise in homogeneous areas. Experiments on real-world low-light footage show that DWTA-Net achieves stronger noise suppression and fewer artifacts, delivering superior visual quality compared with state-of-the-art methods.