arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1106
2605.01662 2026-05-05 cs.CV

Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models

视频主动感知:基于视觉-语言模型的高效推理时长视频理解

Martin Q. Ma, Willis Guo, Aditya Agrawal, Ankit Gupta, Paul Pu Liang, Ruslan Salakhutdinov, Louis-Philippe Morency

发表机构 * Carnegie Mellon University(卡内基梅隆大学) MIT(麻省理工学院)

AI总结 本文提出视频主动感知方法,通过主动感知理论提升长视频问答性能,实现帧效率提升5.6倍,优于现有模型。

Comments ICCV 2025 workshop

详情
AI中文摘要

大型视觉-语言模型(VLMs)在视频问答等多模态任务上取得了进展。然而,VLMs在有效且高效地选择帧方面面临挑战,标准均匀采样成本高且性能可能停滞。受主动感知理论启发,该理论认为模型通过获取与预期不同的数据来获取信息,我们引入视频主动感知(VAP),一种无需训练的方法,利用VLMs增强长视频问答。我们的方法将关键帧选择视为主动感知中的数据获取,并利用轻量级文本条件视频生成模型表示先验世界知识。实验表明,VAP在长视频或推理视频问答数据集上实现了最先进的零样本结果,帧效率比标准GPT-4o、Gemini 1.5 Pro和LLaVA-OV提高高达5.6倍。此外,VAP的推理能力优于先前方法,并能有效选择与问题相关的关键帧。这些发现突显了利用主动感知提升长视频问答帧有效性和效率的潜力。

英文摘要

Large vision-language models (VLMs) have advanced multimodal tasks such as video question answering (QA). However, VLMs face the challenge of selecting frames effectively and efficiently, as standard uniform sampling is expensive and performance may plateau. Inspired by active perception theory, which posits that models gain information by acquiring data that differs from their expectations, we introduce Video Active Perception (VAP), a training-free method to enhance long-form video QA using VLMs. Our approach treats keyframe selection as data acquisition in active perception and leverages a lightweight text-conditioned video generation model to represent prior world knowledge. Empirically, VAP achieves state-of-the-art zero-shot results on long-form or reasoning video QA datasets such as EgoSchema, NExT-QA, ActivityNet-QA, IntentQA, and CLEVRER, achieving an increase of up to 5.6 x frame efficiency by frames per question over standard GPT-4o, Gemini 1.5 Pro, and LLaVA-OV. Moreover, VAP shows stronger reasoning abilities than previous methods and effectively selects keyframes relevant to questions. These findings highlight the potential of leveraging active perception to improve the frame effectiveness and efficiency of long-form video QA.

2605.01659 2026-05-05 cs.CV cs.AI

TRIMMER: A New Paradigm for Video Summarization through Self-Supervised Reinforcement Learning

TRIMMER:通过自监督强化学习实现视频摘要的新范式

Pritam Mishra, Coloma Ballester, Dimosthenis Karatzas

发表机构 * Universitat Pompeu Fabra(庞培法布拉大学) Universitat Autònoma de Barcelona(巴塞罗那自治大学)

AI总结 TRIMMER通过自监督强化学习框架,在视频摘要中实现高效且可扩展的语义表示,优于传统方法在长时序依赖和语义结构的捕捉能力。

详情
AI中文摘要

视频内容在多个领域迅速增长,使得高效内容理解变得至关重要。视频摘要通过生成简洁且语义丰富的表示来解决这一挑战,但现有方法往往依赖昂贵的手动标注,难以跨领域泛化,并因复杂架构导致计算成本高。此外,无监督和弱监督方法在捕捉长时序依赖和语义结构方面通常表现不佳。在本工作中,我们提出TRIMMER(Temporal Relative Information Maximization for Multi-objective Efficient Reinforcement),一种新的自监督强化学习框架用于视频摘要。TRIMMER分两阶段操作:首先通过自监督学习学习鲁棒的表示,然后通过强化学习进行时空决策,该决策由信息论奖励函数引导。与以往依赖相似性目标的方法不同,我们的方法引入基于熵的度量来捕捉更高阶的时序动态和语义多样性,同时直接在选定的帧索引上计算奖励以提高计算效率。在标准基准上的广泛实验表明,TRIMMER在无监督和自监督方法中实现了最先进的性能,同时在领先监督方法中保持竞争力,突显了其在可扩展和可泛化的视频摘要中的有效性。

英文摘要

The rapid growth of video content across domains such as surveillance, education, and social media has made efficient content understanding increasingly critical. Video summarization addresses this challenge by generating concise yet semantically meaningful representations, but existing approaches often rely on expensive manual annotations, struggle to generalize across domains, and incur significant computational costs due to complex architectures. Moreover, unsupervised and weakly supervised methods typically underperform compared to supervised counterparts in capturing long-range temporal dependencies and semantic structure. In this work, we propose TRIMMER (Temporal Relative Information Maximization for Multi-objective Efficient Reinforcement), a novel self-supervised reinforcement learning framework for video summarization. TRIMMER operates in two stages: it first learns robust representations via self-supervised learning and then performs spatio-temporal decision making through reinforcement learning guided by information-theoretic reward functions. Unlike prior approaches that rely on similarity-based objectives, our method introduces entropy-based metrics to capture higher-order temporal dynamics and semantic diversity, while computing rewards directly over selected frame indices to improve computational efficiency. Extensive experiments on standard benchmarks demonstrate that TRIMMER achieves state-of-the-art performance among unsupervised and self-supervised methods, while remaining competitive with leading supervised approaches, highlighting its effectiveness for scalable and generalizable video summarization.

2605.01657 2026-05-05 cs.CV

Act2See: Emergent Active Visual Perception for Video Reasoning

Act2See:面向视频推理的涌现式主动视觉感知

Martin Q. Ma, Yuxiao Qu, Aditya Agrawal, Willis Guo, Paul Pu Liang, Ruslan Salakhutdinov, Louis-Philippe Morency

发表机构 * Carnegie Mellon University(卡内基梅隆大学) MIT(麻省理工学院)

AI总结 本文提出Act2See框架,通过使VLMs在文本推理中主动交错视频帧,实现视频推理中的主动视觉感知,提升了推理质量并优于现有方法。

Comments CVPR 2026

详情
AI中文摘要

视觉语言模型(VLMs)通常依赖静态初始帧进行视频推理,限制了它们在推理过程演进中整合关键动态信息的能力。现有方法通过在链式思维(CoT)中增加额外帧信息来增强CoT质量,但往往缺乏合成假设性或反事实场景视觉信息的关键能力。我们引入Act-to-See(Act2See),一种新型框架,通过赋予VLMs在文本CoT中主动交错视频帧的能力,实现视频推理中的主动视觉感知。Act2See通过监督微调(SFT)在高质量的推理轨迹数据集上开发,这些轨迹由前沿VLM生成,整合了主动调用检索现有帧或生成新帧的调用,并严格验证与人工标注的CoT一致,以确保质量。这种做法培养出一种新兴能力:在推理时,模型会主动决定何时搜索或合成必要的视觉证据。Act2See在具有挑战性的基准测试中建立了新的最先进结果,包括VideoEspresso和ViTIB,并在Video-MME、EgoNormia和VCR-Bench上优于可比或更大模型,展示了在视频推理中使VLMs具备主动视觉感知的进展。

英文摘要

Vision-Language Models (VLMs) typically rely on static initial frames for video reasoning, restricting their ability to incorporate essential dynamic information as the reasoning process evolves. Existing methods that augment Chain-of-Thought (CoT) with additional frame information often exhibit suboptimal CoT quality and lack the crucial ability to synthesize visual information for hypothetical or counterfactual scenarios. We introduce Act-to-See (Act2See), a novel framework that enables active visual perception by empowering VLMs to actively interleave video frames within text CoTs. Act2See is developed via Supervised Fine-Tuning (SFT) on a high-quality dataset of reasoning traces generated by a frontier VLM. These traces integrate active calls to either retrieve existing frames or generate new ones, and are rigorously verified against human-annotated CoTs to ensure quality. This approach cultivates an emergent capability: at inference time, the model actively determines when to search for or synthesize the necessary visual evidence. Act2See establishes new state-of-the-art results on challenging benchmarks, including VideoEspresso and ViTIB, and outperforms comparable or larger models on Video-MME, EgoNormia, and VCR-Bench, demonstrating an advancement in enabling VLMs with active visual perception for video reasoning.

2605.01653 2026-05-05 cs.CV

SteeringDiffusion: A Bottlenecked Activation Control Interface for Diffusion Models

SteeringDiffusion: 一种瓶颈化激活控制接口用于扩散模型

Fangzheng Wu, Brian Summa

发表机构 * Department of Computer Science(计算机科学系)

AI总结 SteeringDiffusion通过瓶颈化激活层控制扩散模型的内容-风格平衡,采用冻结的U-Net结构和小的提示条件潜在代码,实现可控且稳定的生成效果,优于LoRA、ControlNet等方法。

详情
AI中文摘要

我们引入了SteeringDiffusion,一种用于扩散模型的瓶颈化激活级控制接口,能够提供平滑、单调且可运行时调整的内容-风格平衡控制表面。我们的方法保持U-Net主干冻结,并学习一个小型的提示条件潜在代码,投影到FiLM/AdaGN风格的调制参数。零初始化设计保证了在零尺度下与基模型的精确等价性,而时间步感知的门控限制调制仅在后期去噪阶段进行。在推理过程中,单个标量连续遍历控制表面而无需重新训练。在多个艺术风格的Stable Diffusion 1.5和SDXL实验中,我们展示了SteeringDiffusion能够产生平滑且单调的内容-风格平衡。在匹配的参数预算下,它在可控性和稳定性上优于LoRA,而ControlNet和秩-1适配器无法暴露可比的控制表面。我们进一步引入了基于DDIM倒置的倒置稳定性诊断,作为事后轨迹探针,揭示了与干预幅度的强相关性。这些结果将Steering Bottlenecked Explicit Control (S-BEC)定位为一种实用的、通用的冻结扩散主干控制接口。

英文摘要

We introduce SteeringDiffusion, a bottlenecked activation-level control interface for diffusion models that exposes a smooth, monotonic, and runtime-adjustable control surface over the content--style trade-off. Our method keeps the U-Net backbone frozen and learns a small, prompt-conditioned latent code projected to FiLM/AdaGN-style modulation parameters. A zero-initialized design guarantees exact equivalence to the base model at zero scale, while timestep-aware gating restricts modulation to later denoising stages. A single scalar at inference continuously traverses the control surface without retraining. Across experiments on Stable Diffusion~1.5 and SDXL covering multiple artistic styles, we show that SteeringDiffusion produces smooth and monotonic content--style trade-offs. Under matched parameter budgets, it outperforms LoRA in controllability and stability, while ControlNet and rank-1 adapters do not expose a comparable control surface. We further introduce an inversion-stability diagnostic based on DDIM inversion, used as a post-hoc trajectory probe, which reveals strong correlations with intervention magnitude. These results position \emph{Steering Bottlenecked Explicit Control (S-BEC)} as a practical, general-purpose control interface for frozen diffusion backbones.

2605.01650 2026-05-05 cs.LG

Geospatial foundation-model embeddings improve population estimation unevenly across space and scale

地理基础模型嵌入物在空间和尺度上不均等地提高人口估计

Wenbin Zhang, Eimear Cleary, Francisco Rowe, Somnath Chaudhuri, Maksym Bondarenko, Shengjie Lai, Andrew J. Tatem

发表机构 * WorldPop, School of Geography and Environmental Sciences, University of Southampton, United Kingdom(世界人口研究机构,地理与环境科学学院,南安普顿大学,英国) Geographic Data Science Lab, Department of Geography and Planning, School of Environmental Sciences, University of Liverpool, United Kingdom(地理数据科学实验室,地理与规划系,环境科学学院,利物浦大学,英国)

AI总结 本文评估了地理基础模型嵌入物在巴西、尼日利亚和美国的子国家人口估计中的表现,发现其在数据贫乏区域提升预测效果,但空间尺度不匹配时性能下降,揭示了当前地理AI的局限性。

详情
AI中文摘要

可靠的次国家人口估计对于应用至关重要,但在人口普查稀疏、过时或空间粗略的地区仍难以实现。现有的人口制图工作流程依赖于手工构建的地理空间协变量,如居民区范围、夜间灯光和环境条件,这些必须在不同尺度和地理区域中组装和协调。地理基础模型提供了一种替代方法,通过学习来自更多多面和异质数据源的可重用位置表示。在此,我们通过巴西、尼日利亚和美国的次国家人口估计,将人口动态基础模型(PDFM)嵌入物与协调的地理空间协变量进行基准测试。在地理结构验证下,PDFM将预测拟合度提高了中位数20.1%(IQR:10.0-33.2%,跨国家-模型比较),并减少了未解释方差23.2%(9.2-26.2%)。然而,这些收益是不均等的。PDFM在地理空间协变量弱化描述居民区情境的地方最为有利,如较大的、发展较慢的次国家区域。此外,PDFM的性能与尺度耦合,嵌入物在空间聚合转移时比地理空间协变量更不灵活。这些发现表明,地理基础模型对位置的表示可以改善数据贫乏地区的人口估计,但其在空间尺度不匹配时性能下降,揭示了当前地理AI的根本限制。

英文摘要

Reliable subnational population estimates are essential for applications, yet remain difficult where censuses are sparse, outdated or spatially coarse. Existing population-mapping workflows rely on hand-built geospatial covariates, such as settlement extent, night-time lights, and environmental conditions, which must be assembled and harmonised across scales and geographies. Geospatial foundation models offer an alternative by learning reusable representations of place from more multifaceted and heterogeneous data sources. Here, we benchmark Population Dynamics Foundation Model (PDFM) embeddings against the harmonised geospatial covariates for subnational population estimation in Brazil, Nigeria and the United States. Under geographically structured validation, PDFM increased predictive fit by a median of 20.1% (IQR: 10.0-33.2%, across country-model comparisons) reduction in unexplained variance, and reduced Kullback-Leibler divergence by 23.2% (9.2-26.2%). However, these gains were uneven. PDFM was most advantageous where the geospatial covariates weakly characterised settlement context, such as larger and less-developed subnational areas. Moreover, PDFM performance was scale-coupled with embeddings providing less flexible transfer across spatial aggregations than geospatial covariates. These findings showed that geospatial foundation-model representations of place can improve population estimation in data poor settings, but their benefits break down predictably under spatial scale mismatch, revealing a fundamental limitation of current geospatial AI.

2605.01647 2026-05-05 cs.CL

Beyond Perplexity: Character Distribution Signatures and the MDTA Benchmark for AI Text Detection

超越困惑度:字符分布签名与AI文本检测的MDTA基准

Priyadarshan Narayanasamy, Swastik Agrawal, Klint Faber, Fardina Fathmiul Alam

发表机构 * University of Maryland, College Park(马里兰大学学院市分校)

AI总结 本文提出基于字符分布签名的AI文本检测方法,通过MDTA基准验证其有效性,显示与困惑度方法低相关性,并在特定领域取得显著提升。

Comments 11 figures, 10 tables, 24 pages, Under Review at COLM 2026

详情
AI中文摘要

免训练的AI文本检测方法主要依赖模型对数概率,通过Binoculars和DNA-DetectLLM等方法实现强性能。然而,这些方法面临根本限制,因为模型通过RLHF优化产生人类样概率分布。本文引入基于字符分布签名的替代检测信号,理论证明AI模型在大规模领域平衡语料中近似全局字符模式,而人类表现出领域专业化分布,形成

英文摘要

Training-free AI text detection methods primarily rely on model log-probabilities, achieving strong performance through approaches like Binoculars and DNA-DetectLLM. However, these methods face a fundamental ceiling as models are optimized through RLHF to produce human-like probability distributions. We introduce an alternative detection signal based on character distribution signatures. We provide theoretical foundations showing that AI models, trained on massive domain-balanced corpora, approximate global character patterns while humans exhibit domain-specialized distributions, creating a "Wall of Separation" where human-AI divergence significantly exceeds AI-AI divergence. To enable systematic evaluation, we construct the Models-Domains-Temperatures-Adversarials (MDTA) benchmark comprising 642,274 prompt-aligned samples across 4 models, 5 domains, 3 temperature settings, and 3 adversarial strategies, substantially expanding the HC3 dataset with modern model responses, temperature variation, and adversarial augmentation. We introduce the Letter Distribution Score (LD-Score), demonstrating low correlation (r = 0.08-0.13) with perplexity methods. When integrated with DNA-DetectLLM, Binoculars and FastDetectGPT via a non-linear classifier, LD-Score yields consistent improvements in AUROC and F1, with particularly pronounced gains in specialized domains where vocabulary constraints amplify the detection signal. The MDTA dataset can be accessed at: https://huggingface.co/datasets/nsp909/MDTA.

2605.01640 2026-05-05 cs.LG cs.CL

Prescriptive Scaling Laws for Data Constrained Training

数据受限训练的规范性扩展定律

Justin Lovelace, Christian Belardi, Srivatsa Kundurthy, Shriya Sudhakar, Kilian Q. Weinberger

发表机构 * Department of Computer Science(计算机科学系)

AI总结 本文提出新的扩展定律,通过引入过拟合惩罚,改进数据受限训练中的计算分配策略,证明过度重复无效,建议增加模型容量,且通过权重衰减系数比较验证其有效性。

详情
AI中文摘要

训练计算量正日益超过高质量数据的可用性。这将核心挑战从最优计算分配转向从有限数据中提取最大价值。广泛采用的Chinchilla扩展定律假设每个训练token都是唯一的,这限制了其在数据受限环境中的预训练决策指导能力。我们通过简单的加性过拟合惩罚模型,发现其能准确描述模型行为。我们的扩展定律提供了新的计算最优分配建议。在某个点之后,进一步重复是无益的,计算应更多用于模型容量。我们证明遵循该定律推荐的配置在数据受限环境中能提升性能。最后,由于我们的单参数形式将过拟合隔离在单一系数中,它使不同训练配置之间的直接比较成为可能。作为案例研究,我们展示强权重衰减(λ=1.0)可将此系数降低约70%,为最近发现的数据受限环境中最优权重衰减比标准实践大一个数量级提供了扩展定律解释。

英文摘要

Training compute is increasingly outpacing the availability of high-quality data. This shifts the central challenge from optimal compute allocation to extracting maximum value from limited data. The widely adopted Chinchilla scaling law assumes every training token is unique. This limits its ability to guide pretraining decisions in data-constrained regimes. We model the excess loss under repetition with a simple additive overfitting penalty and find that it accurately describes model behavior. Our scaling law yields qualitatively new compute-optimal allocation advice. Beyond a point, further repetition is counterproductive and compute is better spent on model capacity. We show that following our law's recommended configuration improves performance in data-constrained regimes. Finally, because our one-parameter form isolates overfitting in a single coefficient, it enables direct comparison across training configurations. As a case study, we show that strong weight decay ($λ=1.0$) reduces this coefficient by approximately 70%, providing a scaling-law explanation for recent findings that optimal weight decay in data-constrained regimes is an order of magnitude larger than standard practice.

2605.01638 2026-05-05 cs.CV

Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection

Omni-Fake:面向社交媒体的统一多模态深度伪造检测基准测试

Tianxiao Li, Zhenglin Huang, Haiquan Wen, Yiwei He, Xinze Li, Bingyu Zhu, Wuhui Duan, Congang Chen, Zeyu Fu, Yi Dong, Baoyuan Wu, Jason Li, Guangliang Cheng

发表机构 * University of Liverpool(利物浦大学) University of Exeter(埃克塞特大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Nanyang Technological University(南洋理工大学)

AI总结 本文提出Omni-Fake多模态深度伪造检测基准,包含大规模数据集和分布外基准,支持联合检测-定位-解释协议,并提出基于强化学习的多模态检测器,提升检测准确性和可解释性。

Comments Accepted to CVPR 2026

详情
AI中文摘要

多模态深度伪造在社交媒体中迅速传播,威胁真实性、信息完整性及数字取证。现有基准因单一模态、简化操作或不现实分布而受限。本文提出Omni-Fake统一多模态深度伪造检测基准,包含Omni-Fake-Set(100万+高质量样本)和Omni-Fake-OOD(20万+分布外样本),覆盖图像、音频、视频及音频-视频说话人四种模态,支持联合检测-定位-解释协议。基于Omni-Fake提出Omni-Fake-R1强化学习驱动多模态检测器,可自适应融合视觉与听觉线索,输出结构化决策、定位及自然语言解释。实验表明,Omni-Fake-R1在检测准确率、跨模态泛化及可解释性上优于现有最先进基线。项目页面:https://tianxiao1201.github.io/omni-fake-project-page/

英文摘要

Multimodal deepfakes are proliferating on social media and threaten authenticity, information integrity, and digital forensics. Existing benchmarks are constrained by their single-modality scope, simplified manipulations, or unrealistic distributions, which limit their ability to assess real-world robustness. To address these limitations, we present Omni-Fake, a unified omni-dataset for comprehensive multimodal deepfake detection in social-media settings. It comprises Omni-Fake-Set, a large-scale, high-quality dataset with 1M+ samples, and Omni-Fake-OOD, an out-of-distribution benchmark with 200k+ samples intentionally excluded from training to evaluate generalization. Omni-Fake spans four modalities (image, audio, video, and audio-video talking head) and supports a joint detection-localization-explanation protocol. On top of Omni-Fake, we further propose Omni-Fake-R1, a reinforcement-learning-driven multimodal detector that adaptively integrates visual and auditory cues and outputs structured decisions, localization, and natural-language explanations. Extensive experiments show significant gains in detection accuracy, cross-modal generalization, and explainability over state-of-the-art baselines. Project page: https://tianxiao1201.github.io/omni-fake-project-page/

2605.01637 2026-05-05 cs.LG cs.CC cs.DM math.CO

The Banach-Butterfly Invariant: Influence-Adaptive Walsh Geometry for Ternary Polynomial Threshold Functions

Banach-Butterfly 不变量:适应影响的Walsh几何用于三元多项式阈值函数

Gorgi Pavlov

发表机构 * Lehigh University(莱德大学) Johnson and Johnson(强生公司)

AI总结 本文提出Banach-Butterfly不变量,用于三元多项式阈值函数的适应影响的Walsh几何分析,通过影响向量的Schur凸性分离函数,并证明其作为收缩不变量的性质。

Comments 21 pages, 3 figures. Theory paper; LLM-application companion in preparation. Code, certificates, and 616,126 NPN-canonical n=5 representatives in supplementary repository

详情
AI中文摘要

我们引入Banach-Butterfly不变量(BBT),一种适应影响的Banach几何,基于Walsh-Hadamard蝴蝶因子分解。对于布尔函数f: {-1,+1}^n→{-1,+1},其坐标影响Inf_ℓ(f),BBT将蝴蝶层ℓ分配指数p_ℓ = 1+Inf_ℓ(f),得到收缩不变量μ(f)=prod_ℓ 2^{-Inf_ℓ/(1+Inf_ℓ)}。我们证明了一个Jensen下界log_2μ(f) ≥ -I(f)/(1+I(f)/n),并证明μ在影响向量上严格Schur凸(模排列),给出缩放类μ~2^{-n/2}(奇偶性)、2^{-Θ(√n)}(多数情况)、2^{-1/2}(独裁者)。log_2μ是傅里叶系数的有理数但非多项式函数,而μ是代数的,并且μ能区分总影响相同函数(n=3时有122对)。使用配套合成论文中的认证n≤4三元Walsh-阈值宇宙作为有限测试平台,我们计算了所有65536个布尔函数在n=4时的精确MILP最小支持证书(均值6.42,最大9,全奇由奇偶性论证),并在n=5时对10000个616126个NPN-标准代表进行枚举(匹配OEIS A000370)。在固定总影响下,条件Spearmanρ(μ,|supp|)在n=4的最大分层为+0.571,但在n=5时在两种采样下反转为-0.38:μ是一个有效的Schur凸集中不变量,而不是一个通用的最小支持预测器。配套应用论文验证了基于此理论的实值WHT激活能代理在五个预训练LLM上W2A16的性能,将wikitext-2的困惑度降低15-58%相对于vanilla auto-round;从布尔理论到实值代理的转移是定性的,而非形式的。

英文摘要

We introduce the Banach-Butterfly Invariant (BBT), an influence-adaptive Banach geometry on the Walsh-Hadamard butterfly factorization. For a Boolean function $f:\{-1,+1\}^n\to\{-1,+1\}$ with coordinate influences $\mathrm{Inf}_\ell(f)$, BBT assigns exponent $p_\ell = 1+\mathrm{Inf}_\ell(f)$ to butterfly layer $\ell$, yielding the contraction invariant $μ(f)=\prod_\ell 2^{-\mathrm{Inf}_\ell/(1+\mathrm{Inf}_\ell)}$. We prove a Jensen lower bound $\log_2μ(f) \ge -I(f)/(1+I(f)/n)$ and that $μ$ is strictly Schur-convex in the influence vector (modulo permutation), giving scaling classes $μ\sim 2^{-n/2}$ (parity), $2^{-Θ(\sqrt{n})}$ (majority), $2^{-1/2}$ (dictators). $\log_2μ$ is rational but not polynomial in the Fourier coefficients while $μ$ is algebraic, and $μ$ separates functions with identical total influence (122 pairs at $n=3$). Using the certified $n \le 4$ ternary Walsh-threshold universe from a companion synthesis manuscript as a finite testbed, we compute exact MILP minimum-support certificates for all 65,536 Boolean functions at $n=4$ (mean 6.42, max 9, all-odd by a parity argument) and on 10,000 of the 616,126 NPN-canonical representatives we enumerate at $n=5$ (matching OEIS A000370). Conditional Spearman $ρ(μ,|\mathrm{supp}|)$ at fixed total influence is $+0.571$ in the largest stratum at $n=4$ but reverses to $-0.38$ at $n=5$ under both function-uniform and NPN-canonical sampling: $μ$ is a valid Schur-convex concentration invariant, not a universal monotone predictor of minimum support across $n$. A companion application paper validates a real-valued WHT activation-energy proxy inspired by this theory on five pretrained LLMs at W2A16, cutting wikitext-2 perplexity by 15-58% versus vanilla auto-round; the transfer from Boolean theory to the real-valued proxy is qualitative, not formal.

2605.01634 2026-05-05 cs.LG

Chebyshev-Augmented One-Shot Transfer Learning for PINNs on Nonlinear Differential Equations

Chebyshev增强的一次式迁移学习用于非线性微分方程的PINN

Yiqi Rao, Pavlos Protopapas

发表机构 * Harvard University(哈佛大学)

AI总结 本文提出结合Chebyshev多项式近似与一次式迁移学习,扩展了可应用于一次式PINN迁移的非线性函数类别,通过多项式非线性分解实现快速在线适应。

Comments 18 pages, 4 figures, 9 tables, accepted to ICLR 2026 Workshop on Artificial Intelligence and Partial Differential Equations

详情
AI中文摘要

物理指导神经网络(PINNs)提供了一种灵活的求解微分方程的范式,通过将支配定律嵌入训练目标。一个持久的限制是实例特定性:标准PINNs通常需要为每个新的力项、边界/初始条件或参数设置重新训练。一次式迁移学习(OTL)通过冻结预训练的潜在表示并计算最优输出权重的闭式解来解决线性算子的瓶颈,但非线性问题的闭式适应通常不可用,因为损失在输出层是非凸的。本文通过将OTL与Chebyshev多项式近似相结合,显著扩展了可应用于一次式PINN迁移的非线性函数类别。我们通过截断Chebyshev展开近似一般光滑弱非线性项,得到一个可由扰动分解处理的多项式非线性。多头PINN学习与主导线性算子相关的可重用潜在空间;在测试时,通过输出层的一系列闭式线性求解获得新实例的解,无需重新训练网络主体。我们为ODE和PDE提供了统一的推导,并在非线性基准上展示了准确性和快速在线适应,包括非多项式和奇异ODE非线性和具有饱和动力学的反应扩散PDE,证明了该方法在多查询场景中的实用性。

英文摘要

Physics-Informed Neural Networks (PINNs) offer a flexible paradigm for solving differential equations by embedding governing laws into the training objective. A persistent limitation is instance specificity: standard PINNs typically require retraining for each new forcing term, boundary/initial condition, or parameter setting. One-shot transfer learning (OTL) addresses this bottleneck for linear operators by freezing a pretrained latent representation and computing optimal output weights in closed form, but for nonlinear problems closed-form adaptation is generally unavailable because the loss is nonconvex in the output layer. In this paper we substantially broaden the class of nonlinearities amenable to one-shot PINN transfer by combining OTL with Chebyshev polynomial surrogates. We approximate general smooth weakly nonlinear terms by truncated Chebyshev expansions over a prescribed solution range, yielding a polynomial nonlinearity that can be handled by a perturbative decomposition into linear subproblems. A multi-head PINN learns a reusable latent space associated with the dominant linear operator; at test time, solutions to new instances are obtained via a sequence of closed-form linear solves in the output layer, without retraining the network body. We provide a unified derivation of the framework for ODEs and PDEs and demonstrate accuracy and fast online adaptation on nonlinear benchmarks, including non-polynomial and singular ODE nonlinearities as well as a reaction-diffusion PDE with saturating kinetics, demonstrating the method's utility in many-query regimes.

2605.01632 2026-05-05 cs.LG

Perturb and Correct: Post-Hoc Ensembles using Affine Redundancy

扰动与校正:利用仿射冗余构建后验集成

Eleanor Quint

发表机构 * Department of Computing(计算系)

AI总结 本文提出P&C方法,通过在预训练网络中引入随机隐藏层扰动并进行最小二乘校正,构建出在校准数据上一致但远离时分歧的预测器,实现ID/OOD的优异权衡。

详情
AI中文摘要

在分布偏移下,看似在分布数据上不可区分的模型行为可能差异很大。我们引入扰动与校正(P&C),一种后验方法,从单个预训练网络构建出认知上多样的预测器。P&C通过在后续仿射层应用随机隐藏层扰动并进行最小二乘校正,产生在校准数据上一致但远离时分歧的预测器。我们通过后校正残差及其一阶敏感性分析该机制:残差由杠杆项控制接近校准分布,而校正敏感性随着输入偏离校准几何而增长。经验上,P&C在MuJoCo动态预测和CIFAR-10 OOD检测中实现了优异的ID/OOD权衡,匹配或优于标准后验基线,仅需单个预训练模型。我们的发现突显了进一步利用过度参数化作为深度学习模型优势的潜力。

英文摘要

Models that are indistinguishable on in-distribution data can behave very differently under distribution shift. We introduce Perturb-and-Correct (P&C), a post-hoc method for constructing epistemically diverse predictors from a single pretrained network. P&C applies random hidden layer perturbations with a least-squares correction in the subsequent affine layer, producing predictors that agree on calibration data while remaining free to disagree away from it. We analyze this mechanism through the post-correction residual and its first-order sensitivity: the residual is controlled near the calibration distribution by a leverage term, while corrected sensitivity grows as inputs deviate from the calibration geometry. Empirically, P&C achieves a strong ID/OOD tradeoff across MuJoCo dynamics prediction and CIFAR-10 OOD detection, matching or outperforming standard post-hoc baselines while requiring only a single pretrained model. Our findings highlight the potential in further exploiting overparameterization as a strength of deep learning models.

2605.01630 2026-05-05 cs.CL cs.AI

Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese

Prosa:基于评分标准的LLM在巴西葡萄牙语真实用户聊天中的评估

Roseval Malaquias Junior, Giovana Kerche Bonás, Thales Sales Almeida, Hugo Abonizio, Thiago Laitz, Ramon Pires, Marcos Piau, Celio Larcher, Rodrigo Nogueira

发表机构 * Maritaca AI Jusbrasil

AI总结 本文提出Prosa基准,通过多判官过滤的二元评分标准减少LLM评分偏差,提升评估的判别能力,验证了评分标准的重要性。

详情
AI中文摘要

由整体LLM作为评判者产生的排名对所选评判模型的偏差敏感。我们证明,切换到带有多判官过滤的二元评分标准可以消除这种敏感性:分解评判比评判模型本身更重要。为了支持这一主张,我们引入了Prosa,第一个真实用户多轮巴西葡萄牙语聊天基准:1000个WildChat对话由三个判官从三个模型家族对16个模型进行评分。在经过过滤的评分标准下,三个判官在16个排名上全部一致,而在整体评分下,他们仅在16个中的7个一致。此外,评分过滤流程使相邻模型之间的平均评分差距增加了47%,从而提高了Prosa的判别能力。在使用Gemini 3 Flash作为评判者时,评估新模型的成本约为2.1美元。我们发布了基准和过滤代码,以确保未来模型可以在相同条件下进行评估。这些工具也使我们的基于评分标准的评分方法在Prosa之外得到重用,支持其他开放性评估场景。

英文摘要

Rankings produced by holistic LLM-as-a-judge scoring are sensitive to the bias of the chosen judge model. We show that switching to binary rubric scoring with multi-judge filtering removes this sensitivity: decomposing the judgement matters more than the judge model itself. To support this claim, we introduce Prosa, the first real user multi-turn Brazilian Portuguese chat benchmark: 1,000 WildChat conversations scored by three judges from three model families on 16 models. Under filtered rubric scoring the three judges agree on every one of the 16 ranks, whereas under holistic scoring they agree on only 7 of 16. Additionally, the rubric filtering pipeline increases the average score gap between neighbouring models by 47%, thereby improving Prosa's discriminative power. Evaluating a new model on Prosa costs approximately $2.1 when using Gemini 3 Flash as the judge. We release the benchmark and the filtering code to ensure that future models can be assessed under identical conditions. These artifacts also make our rubric-based scoring method reusable beyond Prosa, supporting other open-ended evaluation settings.

2605.01609 2026-05-05 cs.LG cs.AI

Concepts Whisper While Syntax Shouts: Spectral Anti-Concentration and the Dual Geometry of Transformer Representations

概念低语而语法高呼:频谱反集中与Transformer表示的双几何

Pratyush Acharya, Nuraj Rimal, Habish Dhakal

发表机构 * Pratyush Acharya Nuraj Rimal Habish Dhakal

AI总结 研究通过跨语言概念传输测试,发现因果内积在频谱正则化下无显著差异,揭示了残差流中反集中现象及静态解嵌行对比的高方差集中特性,支持语法优先编码于高方差子空间。

Comments 25 pages, 16 figures, 13 tables

详情
AI中文摘要

我们测试因果内积是否能实现跨语言概念传输。在17个模型和4个语言对中,匹配频谱随机化测试发现白化因果对齐与频谱正则化无显著差异(p=0.95)。然而,这一失败揭示了更广泛的现象:反集中在五个架构家族的残差流均值差向量中被观察到(p<10^-33),并通过SAE特征(如p=4.5×10^-19)和Gemma和Llama的线性探测支持。我们发现双几何:激活空间的概念方向在频谱尾部反集中,而静态解嵌行对比在高方差方向集中(p<10^-4)。Split-injection因果干预支持Gemma和Llama的功能基础(Cohen's d最高达1.80),POS标签探测显示语法优先编码于高方差子空间(p<0.013),Qwen~2.5家族显示显著反转,符合架构特定频谱结构。这些结果表明,Transformer可能在上下文化处理中将语义内容旋转到频谱安静区域,编码概念以减少语法干扰。

英文摘要

We test whether the causal inner product of \citet{park2024linear} -- defined by the unembedding covariance $Σ$ -- enables cross-lingual concept transport. Across 17 models and 4 language pairs, a matched-spectrum randomization test finds that Whitened Causal Alignment is indistinguishable from spectral regularization alone ($p = 0.95$). However, this failure reveals a broader phenomenon: anti-concentration is observed in residual-stream difference-of-means vectors across five architecture families ($p < 10^{-33}$) and supported by SAE features (e.g., $p = 4.5 \times 10^{-19}$) and linear probes on Gemma and Llama. We discover a \emph{dual geometry}: activation-space concept directions anti-concentrate in the spectral tail, while static unembedding-row contrasts \emph{concentrate} in high-variance directions ($p < 10^{-4}$). Split-injection causal interventions support the functional basis on Gemma and Llama (Cohen's $d$ up to $1.80$), and POS-tag probing across 8 models shows syntax preferentially encodes in the high-variance subspace in 6 of 8 architectures ($p < 0.013$), with the Qwen~2.5 family showing a significant reversal consistent with architecture-specific spectral structure. These results suggest transformers may rotate semantic content into spectrally quiet regions during contextualized processing, encoding concepts where they can be manipulated with reduced grammatical disruption.

2605.01605 2026-05-05 cs.CL cs.AI

Where Do Prompt Perturbations Break Generation? A Segment-Level View of Robustness in LoRA-Tuned Language Models

提示扰动为何破坏生成?基于段级视角的LoRA微调鲁棒性分析

Zhuoyun Li, Boxuan Wang, Jinwei Hu, Zhenglin Huang, Qisong He, Xinmiao Huang, Guangliang Cheng, Xiaowei Huang, Yi Dong

发表机构 * School of Computer Science and Informatics, University of Liverpool, UK(利物浦大学计算机科学与信息学学院)

AI总结 本文提出S$^2$R$^2$框架,通过段级语义分割与最优传输对齐提升LoRA微调模型的鲁棒性,有效应对提示扰动导致的关键实体或结论漂移问题。

Comments Under review

详情
AI中文摘要

大型语言模型对微小提示扰动敏感,但现有鲁棒性方法通常在整体序列层面强制一致性。这种整体视角可能掩盖一个重要失败模式:扰动响应可能在全局上与干净响应相似,但在关键实体、关系或结论上漂移。我们引入S$^2$R$^2$,一种基于段级的鲁棒LoRA微调框架。S$^2$R$^2$将干净和扰动生成分解为语义段,通过最优传输目标对齐它们,并惩罚意义漂移最大的段。为了将此输出端目标与模型适应连接起来,我们添加了受段级注意力重新分配启发的适配器稳定性正则化器,利用LoRA范数控制作为可行代理,限制扰动放大证据偏移。PAC-Bayesian复杂性视角进一步解释了为何控制适配器增长可能支持超越观察扰动的迁移。在摘要基准测试中,S$^2$R$^2$在拼写错误、删除、同义词替换和改写等噪声下提升了鲁棒性,同时保持了竞争性的干净性能,并在跨数据集迁移方面优于基于一致性的基线。

英文摘要

Large language models are sensitive to minor prompt perturbations, yet existing robustness methods usually enforce consistency at the whole-sequence level. This holistic view can hide an important failure mode: a perturbed response may remain globally similar to the clean one while drifting on a critical entity, relation, or conclusion. We introduce S$^2$R$^2$, a segment-level framework for robust LoRA fine-tuning. S$^2$R$^2$ decomposes clean and perturbed generations into semantic segments, aligns them with an optimal-transport objective, and penalises the segments with the largest meaning drift. To connect this output-side objective with model adaptation, we add an adapter-stability regulariser motivated by segment-level attention reallocation, using LoRA norm control as a tractable proxy for limiting perturbation-amplified evidence shifts. A PAC-Bayesian complexity view further explains why controlling adapter growth may support transfer beyond observed perturbations. Experiments on summarisation benchmarks show that S$^2$R$^2$ improves robustness under typographical noise, deletion, synonym replacement, and paraphrasing, while maintaining competitive clean performance and stronger cross-dataset transfer than consistency-based baselines.

2605.01604 2026-05-05 cs.AI

Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework

在真实环境中评估代理AI:故障模式、漂移模式及生产评估框架

Mukund Pandey

发表机构 * Independent Researcher(独立研究者)

AI总结 本文针对代理AI在生产环境中持续运行时的评估挑战,提出七种故障模式分类及PAEF框架,揭示传统指标在检测故障模式上的不足。

Comments 11 pages, 6 tables, 1 figure. Reference implementation: https://github.com/mukund1985/llm-eval-toolkit

详情
AI中文摘要

现有的大型语言模型评估框架,如HELM、MT-Bench、AgentBench和BIG-bench,都是为受控、单次会话、实验室规模的环境设计的。它们无法解决代理AI系统在生产环境中持续运行时出现的评估挑战:决策错误的累积、工具故障级联、非确定性输出漂移以及长期任务缺乏真实标注。本文做出了三个贡献。首先,我们提出了七种独特的生产代理系统故障模式分类,每种都基于在十亿事件规模系统上的观察。其次,我们通过实证证明标准指标——ROUGE、BERTScore、准确性/AUC以及上述代理基准——在检测每种故障模式时均失效。第三,我们提出了PAEF(生产代理评估框架),一个五维评估框架,具有开源参考实现,旨在对生产流量进行持续评估,而非单次基准运行。我们的分析显示,标准指标完全无法检测四种故障模式,而对其他三种故障模式的检测需要多个评估周期后才能显现。

英文摘要

Existing evaluation frameworks for large language models -- including HELM, MT-Bench, AgentBench, and BIG-bench -- are designed for controlled, single-session, lab-scale settings. They do not address the evaluation challenges that emerge when agentic AI systems operate continuously in production: compounding decision errors, tool failure cascades, non-deterministic output drift, and the absence of ground truth for long-horizon tasks. This paper makes three contributions. First, we present a taxonomy of seven failure modes unique to production agentic systems, each grounded in observations from systems operating at billion-event scale. Second, we demonstrate empirically where standard metrics -- ROUGE, BERTScore, accuracy/AUC, and the agentic benchmarks above -- fail to detect each failure mode. Third, we propose PAEF (Production Agentic Evaluation Framework), a five-dimension evaluation framework with an open-source reference implementation, designed for continuous evaluation on production traffic rather than episodic benchmark runs. Our analysis shows that standard metrics fail to detect four of the seven failure modes entirely and detect three others only after a lag of multiple evaluation cycles.

2605.01596 2026-05-05 cs.CL

Fine-Tuning Pre-Trained Code Models for AI-Generated Code Detection

针对AI生成代码检测的预训练代码模型微调

Jany-Gabriel Ispas, Sergiu Nisioi

发表机构 * Human Language Technologies Research Center(人类语言技术研究中心) Faculty of Mathematics and Computer Science(数学与计算机科学系) University of Bucharest(布加勒斯特大学)

AI总结 本文提出通过微调四个预训练代码模型,解决AI生成代码二分类和生成模型归因问题,取得较高宏F1分数。

Comments Archaeology at SemEval-2026 Task 13

详情
AI中文摘要

本文描述了团队Archaeology在SemEval-2026任务13中提交的系统,该共享任务包含三个子任务。我们参与子任务A(人类编写与AI生成代码的二分类)和子任务B(11类生成模型归因)。从TF-IDF和逻辑回归基线出发,我们分别针对每个子任务微调四个预训练代码模型(CodeBERT、GraphCodeBERT、UniXcoder和CodeT5+)。对于子任务A,我们使用留一语言交叉验证、代码增强、分块推理与截断均值聚合以及在困难数据集上的阈值校准。对于子任务B,我们使用三明治标记打包、类平衡损失和多种子集成与测试时增强。我们的最佳提交在子任务A上获得宏F1分数0.737(第6名/81支队伍),在子任务B上获得0.422(第7名/34支队伍)

英文摘要

This paper describes the system submitted by team \textbf{Archaeology} to SemEval-2026 Task~13 on AI-generated code detection. The shared task consists of three subtasks; we participate in Subtask-A (binary classification: human-written vs.\ AI-generated code) and Subtask-B (11-class attribution of the generating model). Starting from a TF-IDF and Logistic Regression baseline, we fine-tune four pre-trained code models (CodeBERT, GraphCodeBERT, UniXcoder, and CodeT5+) with separate strategies for each subtask. For Subtask-A, we use leave-one-language-out cross-validation, code augmentation, chunked inference with trimmed-mean aggregation, and threshold calibration on a difficult dataset. For Subtask-B, we use sandwich token packing, class-balanced loss, and multi-seed ensembling with test-time augmentation. Our best submissions obtain macro-F1 scores of 0.737 on Subtask-A (6th/81 teams) and 0.422 on Subtask-B (7th/34 teams).

2605.01580 2026-05-05 cs.LG cs.AI

Model Merging: Foundations and Algorithms

模型融合:基础与算法

Donato Crisostomi

发表机构 * Sapienza University of Rome(萨皮恩扎罗马大学) Department of Computer Science(计算机科学系) GLADIA research lab(GLADIA研究实验室) Faculty of Information Engineering, Informatics, and Statistics(信息工程、信息学和统计学学院)

AI总结 本文研究了模型融合作为替代范式,提出C$^2$M$^3$和MASS等算法,通过参数空间整合多个网络,减少计算成本并提升模型性能。

Comments PhD thesis

详情
AI中文摘要

现代深度学习通常将模型视为独立的artifact:独立训练、专用于特定目的,并在改进版本出现时被替换。本论文研究了模型融合作为替代范式:在权重空间中直接结合独立训练的神经网络,几乎不进行优化,且不需要访问原始训练数据。论文考虑了两种主要场景。在单任务设置中,模型共享目标但初始化不同,我们引入C$^2$M$^3$,一种基于Frank-Wolfe优化的循环一致融合算法。C$^2$M$^3$将多个网络整合到共享的参考自由参数空间中,使权重平均有意义而不优先考虑任何单个模型。在多任务设置中,模型从共同的预训练初始化进行微调以适应不同的下游任务,我们首先发展了任务向量作为近似梯度的理论解释。这解释了任务算术的有效性和局限性。基于这一观点,我们展示任务向量继承梯度的低秩结构,并引入任务奇异向量(TSV),一种通过TSV-Merge实现压缩和干扰减少的分解方法。我们随后提出MASS,一种输入自适应的路由方法,利用TSV几何在推理时选择任务相关的子空间。最后,我们介绍MERGE$^3$,一种进化融合框架,利用项目反应理论将评估成本减少高达50$\times$,同时保持解决方案质量。

英文摘要

Modern deep learning usually treats models as separate artifacts: trained independently, specialized for particular purposes, and replaced when improved versions appear. This thesis studies model merging as an alternative paradigm: combining independently trained neural networks directly in weight space, with little or no optimization and without requiring access to the original training data. The thesis considers two main regimes. In the single-task setting, where models share an objective but differ in initialization, we introduce C$^2$M$^3$, a cycle-consistent merging algorithm based on Frank-Wolfe optimization. C$^2$M$^3$ aligns multiple networks into a shared, reference-free parameter space, making weight averaging meaningful without privileging any individual model. In the multi-task setting, where models are fine-tuned for different downstream tasks from a common pretrained initialization, we first develop a theoretical account of task vectors as approximate gradients. This explains both the effectiveness and the limitations of task arithmetic. Building on this view, we show that task vectors inherit the low-rank structure of gradients and introduce Task Singular Vectors (TSV), a decomposition that enables compression and interference reduction through TSV-Merge. We then present MASS, an input-adaptive routing method that uses TSV geometry to select task-relevant subspaces at inference time. Finally, we introduce MERGE$^3$, an evolutionary merging framework that uses Item Response Theory to reduce evaluation costs by up to 50$\times$ while preserving solution quality. Together, these contributions provide theoretical and algorithmic foundations for model merging, supporting a paradigm in which learned capabilities can be composed, reused, and extended across models.

2605.01574 2026-05-05 cs.LG

Hybrid Quantum Reinforcement Learning with QAOA for Improved Vehicle Routing Optimization

混合量子强化学习与QAOA用于改进车辆路径优化

T. Satyanarayana Murthy, B. Swathi Sowmya, Santhosh Voruganti, Sai Varshini Giridi, Chaitanyya Pratap Agarwal, Vanteddu Akshitha

发表机构 * Chaitanya Bharathi Institute of Technology(恰伊坦尼亚·巴拉提技术学院)

AI总结 本文提出一种结合QAOA的混合量子强化学习框架,用于解决车辆路径优化问题,展示更快收敛速度和更优解决方案。

详情
AI中文摘要

车辆路径问题(VRP)是运输和物流中最复杂的NP难组合优化问题之一,需要动态求解方法。本文提出一种新的混合方法,将量子近似优化算法(QAOA)整合到QRL策略网络中,替代传统的变分层。这种增强使智能体在学习策略时能够利用问题特定的量子相关性,从而更充分地探索路径解空间。QAOA增强的QRL框架在训练中收敛更快,能够处理超出Grover自适应搜索(GAS)和量子强化学习(QRL)方法范围的更大VRP实例。在标准VRP实例上的实验表明,该方法能获得更优的解决方案,收敛所需的episode更少,并在近期量子硬件模拟器上表现出良好的内存使用情况。这些发现证明了QAOA整合的QRL作为可扩展、高质量量子辅助组合优化方法的可行性。

英文摘要

Vehicle Routing Problem (VRP) is one of the most complex NP-hard combinatorial optimization problem in transportation and logistics that requires a dynamic solution approach. In this paper we present a new hybrid approach that combines the Quantum Approximate Optimization Algorithm (QAOA) into the QRL policy network, instead of the usual variational layers, QAOA mixing and cost Hamiltonian layers. This enhancement enables the agent to exploit problem specific particular quantum correlations when learning policies, and so richer exploration of the routing solution space. The QAOA-augmented QRL framework shows quicker convergence in training and can tackle larger VRP instances that are beyond the reach of Grover's Adaptive Search (GAS) and Quantum Reinforcement Learning (QRL) approaches. Experiments on standard VRP instances demonstrate better solutions, fewer episodes to converge and good memory usage on near term quantum hardware simulators. These findings demonstrate QAOA- integrated QRL as a viable approach to scalable, high quality quantum-assisted combinatorial optimization.

2605.01568 2026-05-05 cs.CV

Unifying Deep Stochastic Processes for Image Enhancement

统一深度随机过程用于图像增强

Wojciech Kozłowski, Radosław Kuczbański, Kamil Adamczewski, Karol Szczypkowski, Maciej Zięba

发表机构 * Wrocław University of Science and Technology(沃拉布大学科学与技术学院)

AI总结 本文统一了图像增强中的深度随机过程,通过分类为无条件扩散模型、Ornstein-Uhlenbeck过程和扩散桥三类连续时间过程,揭示其共同的SDE框架,并通过实验分析不同设计对性能的影响。

Comments 27 pages, in proceesings of the 43rd International Conference on Machine Learning, Seoul, South Korea

详情
AI中文摘要

深度随机过程近年来已成为图像增强的核心范式,许多方法显式地将随机轨迹条件化于退化输入。然而,这些条件过程与标准扩散模型之间的关系尚不明确。本文通过将近期方法分为三类连续时间过程:无条件扩散模型、Ornstein-Uhlenbeck(OU)过程和扩散桥,提出统一视角。我们证明所有方法均源自共同的随机微分方程(SDE)公式。该框架明确指出看似不同的方法主要差异在于漂移和扩散项、终端分布及边界条件,而调度器和采样器则是正交的设计选择。借助这一统一,我们通过多个图像增强任务进行受控实证研究,使用相同架构和训练协议。结果表明没有一致占优的方法;相反,我们识别并分离出最强烈影响性能的具体设计选择。最后,我们发布ItoVision,一个模块化的PyTorch库,实现统一框架,支持快速原型设计和公平比较随机图像增强方法。

英文摘要

Deep stochastic processes have recently become a central paradigm for image enhancement, with many methods explicitly conditioning the stochastic trajectory on the degraded input. However, the relationship between these conditional processes and standard diffusion models remains unclear. In this work, we introduce a unified perspective on stochastic image enhancement by classifying recent methods into three families of continuous-time processes: unconditional diffusion models, Ornstein-Uhlenbeck (OU) processes, and diffusion bridges. We show that all of these approaches arise from a common stochastic differential equation (SDE) formulation. This framework makes explicit that seemingly disparate methods differ primarily in their drift and diffusion terms, terminal distributions, and boundary conditions, while schedulers and samplers constitute orthogonal design choices. Leveraging this unification, we conduct a controlled empirical study across multiple image enhancement tasks using identical architectures and training protocols. Our results reveal no consistently dominant method; instead, we identify and disentangle the specific design choices that most strongly influence performance. Finally, we release ItoVision, a modular PyTorch library that implements the unified framework and enables rapid prototyping and fair comparison of stochastic image enhancement methods.

2605.01566 2026-05-05 cs.AI

Multi-Agent Reasoning Improves Compute Efficiency: Pareto-Optimal Test-Time Scaling

多智能体推理提升计算效率:帕累托最优测试时缩放

Florian Valentin Wunderlich, Lars Benedikt Kaesberg, Jan Philip Wahle, Terry Ruas, Bela Gipp

发表机构 * University of Göttingen(戈滕堡大学)

AI总结 本文分析了多智能体推理策略的计算效率,通过实验确定在不同模型规模下,帕累托最优方法能以最低计算成本获得最高准确性,尤其在复杂任务中多智能体方法表现更优。

Comments Accepted at SRW at ACL 2026, long paper

详情
AI中文摘要

推理方法的进步使语言模型能在不额外训练的情况下提升预测性能。这些方法往往优先考虑原始性能而非计算效率。然而,计算效率对资源受限的现实应用至关重要。我们系统分析了自一致性、自细化、多智能体辩论和混合智能体的推理缩放策略,研究其计算性能权衡。我们在两个推理基准(MMLU-Pro,BBH)上评估方法,并涵盖不同模型规模下的广泛参数配置(例如,扩展并行预测、智能体和辩论轮次的数量)。在34种配置和超过100次评估中,我们计算帕累托最优前沿,选择在最低计算预算下获得最佳准确性的方法。值得注意的是,在最高评估预算(20倍于CoT计算预算)下,推理缩放在MMLU-Pro上将准确性提高至+7.1个百分点。在同等计算预算下,辩论和混合智能体分别比自一致性高出1.3%和2.7个百分点。虽然自一致性较早饱和,但多智能体方法在更复杂任务中持续受益。我们提出一个简单的多智能体设计指南:混合智能体在并行生成数量超过顺序聚合数量时效率最高。

英文摘要

Advances in inference methods have enabled language models to improve their predictions without additional training. These methods often prioritize raw performance over cost-effective compute usage. However, computational efficiency is key for real-world applications with resource constraints. We provide a systematic analysis of the inference scaling strategies self-consistency, self-refinement, multi-agent debate, and mixture-of-agents, to study their computational performance tradeoffs. We evaluate methods on two reasoning benchmarks (MMLU-Pro, BBH) and include extensive parameter configurations (e.g., scaling the number of parallel predictions, agents, and debate rounds) across different model sizes. Across 34 configurations and over 100 evaluations, we compute the Pareto-optimal front to select methods that achieve the best accuracy with the lowest computational budget. Notably, inference scaling improves accuracy by up to +7.1% points over chain-of-thought at the highest evaluated budgets (20x the CoT compute budget) on MMLU-Pro. With an equal computing budget, debate and mixture-of-agents outperform self-consistency by 1.3% and 2.7% points, respectively. While self-consistency saturates earlier, multi-agent gains persist, particularly on more complicated tasks. We identify a simple multi-agent design guideline: mixture-of-agents is most efficient when the number of parallel generations exceeds the number of sequential aggregations.

2605.01563 2026-05-05 cs.CV

Multi-Dataset Cross-Domain Knowledge Distillation for Unified Medical Image Segmentation, Classification, and Detection

多数据集跨领域知识蒸馏用于统一的医学图像分割、分类和检测

Ceausescu Ciprian-Mihai, Anghelina Ion-Marian, Alexe Dumitru-Bogdan

发表机构 * Faculty of Mathematics and Computer Science, University of Bucharest(布加勒斯特大学数学与计算机科学学院) Gheorghe Mihoc-Caius Iacob Institute of Mathematical Statistics and Applied Mathematics of the Romanian Academy(罗马尼亚科学院格奥尔基·米霍茨-凯奥斯·伊阿科夫学院数学统计与应用数学研究所)

AI总结 本文提出了一种统一的跨领域迁移学习框架,通过多个异构医学影像数据集的知识提升分割、分类和目标检测性能,采用教师-学生范式,通过多级知识蒸馏训练任务特定的学生模型,扩展支持图像分类和目标检测,实现医学图像分析的多任务统一。

Comments Journal extension from the KES paper

详情
AI中文摘要

我们提出了一种统一的跨领域迁移学习框架,该框架利用多个异质医学影像数据集的知识来提升分割、分类和目标检测任务的性能。我们的方法采用教师-学生范式,在其中联合教师模型从多样源数据集学习领域不变表示,而任务特定的学生模型则通过多级知识蒸馏进行训练。最初为医学图像分割开发,该框架扩展以支持图像级分类和目标级检测,使医学图像分析能够实现通用的多任务形式。我们在广泛的多个数据集上评估了我们的方法,包括六个分割基准、BrainMetShare、ISLES、BraTS(MRI)和Lung MSD、LiTS、KiTS(CT),以及多个用于肺部疾病和痴呆症的分类数据集和带有原生边界框注释的检测数据集。在所有任务和模态中,所提出的方法在强数据集特定和多头基线上都表现出一致的改进,展示了对分布偏移的增强鲁棒性和更优的泛化能力。这些发现突显了多数据集知识蒸馏作为可扩展且任务无关的方法,在异构医学影像领域提升分割、分类和目标检测性能的潜力。

英文摘要

We propose a unified cross-domain transfer learning framework that leverages knowledge from multiple heterogeneous medical imaging datasets to improve performance across segmentation, classification, and object detection tasks. Our approach employs a teacher-student paradigm in which a joint teacher model aggregates domain-invariant representations learned from diverse source datasets, while a task-specific student model is trained via multi-level knowledge distillation. Originally developed for medical image segmentation, the framework is extended to support image-level classification and object-level detection, enabling a general multi-task formulation for medical image analysis. We evaluate our method on a broad suite of datasets, including six segmentation benchmarks, BrainMetShare, ISLES, BraTS (MRI) and Lung MSD, LiTS, KiTS (CT), as well as multiple classification datasets for pulmonary disease and dementia, and detection datasets with native bounding-box annotations. Across all tasks and modalities, the proposed approach yields consistent improvements over strong dataset-specific and multi-head baselines, demonstrating enhanced robustness to distributional shifts and superior generalization. These findings highlight the potential of multi-dataset knowledge distillation as a scalable and task-agnostic approach for enhancing segmentation, classification, and object detection performance across heterogeneous medical imaging domains.

2605.01555 2026-05-05 cs.CL cs.AI cs.HC

Automated Interpretability and Feature Discovery in Language Models with Agents

在语言模型中通过代理实现自动化可解释性和特征发现

Arnau Marin-Llobet, Javier Ferrando

发表机构 * Harvard University(哈佛大学)

AI总结 本文提出一个自主多代理框架,用于自动化大型语言模型的内部特征发现与解释,通过两个耦合循环提升解释精度和可验证性。

详情
AI中文摘要

我们介绍了一个自主的多代理框架,用于机制可解释性,该框架自动化了大型语言模型的解释和内部特征发现。系统运行两个耦合循环:(1) 解释细化,其中代理提出竞争性假设,并通过针对性提示控制和多指标评估迭代测试;(2) 特征发现,其中代理生成提示集,构建激活空间中的k近邻图,并使用统计分离性和语义一致性标准检索候选特征。在Gemma-2家族模型和权重稀疏转换器的MLP神经元上,我们的代理在单次自动解释上取得改进,发现了语言特定和安全相关的特征,并生成可审计的解释轨迹,表明代理驱动的实证循环产生的解释比单次标签更精确且更具可检验性。

英文摘要

We introduce an autonomous multiagent framework for mechanistic interpretability that automates both explaining and finding internal features in large language models. The system runs two coupled loops: (1) explanation refinement, where an agent proposes competing hypotheses and iteratively tests them with targeted prompt controls and a multi-metric evaluation; and (2) feature discovery, where an agent generates prompt sets, constructs a k-nearest-neighbor graph in activation space, and retrieves candidate features using statistical separability and semantic coherence criteria. On Gemma-2 family models and MLP neurons in weight-sparse transformers, our agent improves over one-shot auto-interpretations, discovers language-specific and safety-relevant features, and produces auditable explanation traces, showing that agent-driven empirical loops yield sharper and more falsifiable explanations than one-shot labels.

2605.01552 2026-05-05 cs.CV

Robust Fundamental Matrix Estimation from Single Image Motion Blur

从单张图像运动模糊中鲁棒估计基础矩阵

Bao-Long Tran, Per-Erik Forssén, Fredrik Viksten

发表机构 * Computer Vision Laboratory, Linköping University, Sweden(链接öping大学计算机视觉实验室)

AI总结 本文提出从单张运动模糊图像中鲁棒估计基础矩阵的方法,通过利用模糊图像中的 smear 路径进行时间实例对应,解决经典方法如8点算法无法处理的时间方向模糊问题。

Comments 13 pages, 8 figures, under submission

详情
AI中文摘要

本文介绍了一个具有挑战性的任务:从单张运动模糊图像中提取基础矩阵。对于在曝光期间在3D空间中移动的相机,模糊图像中的 smear 路径包含关于该运动的线索和约束。我们展示了在相机曝光窗口内两个时间实例之间的对应关系的可行性,并证明这些对应关系可用于鲁棒地推断基础矩阵,该矩阵总结了相机在曝光时间内的运动。推断出的基础矩阵在转置意义上是唯一的,对应于时间方向的模糊性。由于这种每 smear 的模糊性,经典方法如8点算法不再适用。所提出的方法修改了估计过程,以处理时间方向模糊的对应关系。为了提高基础矩阵估计的鲁棒性,我们还提出在 smear 模式预测中结合不确定性度量,并在估计器的采样过程中使用它。在合成和现实运动模糊数据集上的实验表明,我们的方法能够从单帧中估计包含3D相机运动的基础矩阵。在运动分割的下游任务中展示了实际应用。

英文摘要

In this paper, we introduce a challenging task: extracting a fundamental matrix from a single motion blurred image. For a camera moving in 3D during exposure, the smear paths in the blurry image contain cues and constraints on this motion. We demonstrate the feasibility of establishing correspondences between two time instances within the camera exposure window, and that these can be used to robustly infer a fundamental matrix, which summarizes the motion of the camera during the exposure time. The inferred fundamental matrix is unique up to a transpose, corresponding to an ambiguity of the direction of time. Due to this per-smear ambiguity, classic methods, such as the 8-point algorithm, are no longer usable. The proposed method modifies the estimation to work on time-direction ambiguous correspondences. To improve the robustness of the fundamental matrix estimation, we also propose to incorporate an uncertainty measurement in smear pattern prediction and use it in the sampling process of the estimator. Experiments on synthetic and real-world motion-blur datasets demonstrate that our approach is able to estimate the fundamental matrix encoding the 3D camera motion, from single frames. Practical applicability is demonstrated on the downstream task of motion segmentation.

2605.01548 2026-05-05 cs.LG cs.CV eess.SP

ECG-biometrics-bench: A Unified Framework for Reproducible Benchmarking of ECG Biometrics

ECG生物特征-基准: 一个统一的可重复基准测试框架用于ECG生物特征

Milad Parvan

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出ECG-biometrics-bench框架,用于可重复评估ECG生物特征,揭示随机分割谬误,展示模型失效并非特定于模型,且通过动态多会话模板融合缓解时间老化影响。

Comments Under review

详情
AI中文摘要

心电图(ECG)生物特征已作为一种有前景的连续、具有活力意识的认证模式出现在可穿戴系统中。然而,许多先前研究由于数据泄漏(例如同一会话内的随机分割)报告过于乐观的结果。为了解决这个问题,我们引入ECG-biometrics-bench,一个模块化、可重复的基准测试框架,标准化七个广泛使用的公共ECG数据集的预处理、分割和评估,涵盖临床、便携式和大规模队列设置。该框架支持闭合集和开放集(即本工作中主体不相交的泛化)评估,以及逐步现实的协议,包括跨会话和长期时间分离。为了促进社区的可重复研究,ECG-biometrics-bench仓库将在本手稿被接受后在GitHub上公开。通过全面的多数据集分析,我们揭示了随机分割谬误,证明了会话内评估协议人为提高了性能,同时掩盖了由时间漂移和未见身份引起的严重退化。此外,通过评估多个架构,包括DeepECG、ResNet1D和CNN-LSTM,我们显示这些失败并非特定于模型,而是可能源于当前监督特征学习范式。最后,我们证明,通过基于动态多会话模板融合的重登轻认证策略,可以部分缓解时间老化导致的性能退化。这些发现建立了ECG生物特征的更现实基准,并突显了为可靠现实部署必须解决的关键挑战。

英文摘要

Electrocardiogram (ECG) biometrics have emerged as a promising modality for continuous, liveness-aware authentication in wearable systems. However, many prior studies report overly optimistic results due to data leakage (e.g., random splits within the same session). To address this issue, we introduce ECG-biometrics-bench, a modular, reproducible benchmarking framework that standardizes preprocessing, segmentation, and evaluation across seven widely used public ECG datasets spanning clinical, ambulatory, and large-scale cohort settings. The framework supports both closed-set and open-set (i.e., subject-disjoint generalization in this work) evaluation, as well as progressively realistic protocols including cross-session and long-term temporal separation. To facilitate reproducible research in the community, the ECG-biometrics-bench repository will be made publicly accessible on GitHub upon the acceptance of this manuscript. Through a comprehensive multi-dataset analysis, we expose the Random Split Fallacy, demonstrating that intra-session evaluation protocols artificially inflate performance while masking severe degradation caused by temporal drift and unseen identities. Furthermore, by evaluating multiple architectures, including DeepECG, ResNet1D, and CNN-LSTM, we show that these failures are not model-specific but are likely inherent to current supervised feature-learning paradigms. Finally, we demonstrate that performance degradation due to temporal aging can be partially mitigated through a heavy enrollment, lightweight authentication strategy based on dynamic multi-session template fusion. These findings establish a more realistic baseline for ECG biometrics and highlight critical challenges that must be addressed for reliable real-world deployment.

2605.01544 2026-05-05 cs.RO

An Efficient Metric for Data Quality Measurement in Imitation Learning

在模仿学习中数据质量测量的高效指标

Noushad Sojib, Momotaz Begum

发表机构 * Department of Computer Science, University of New Hampshire, USA(新罕布什尔大学计算机科学系)

AI总结 本文提出基于功率谱密度的高效自动演示排序指标,用于筛选高质量演示数据,提升模仿学习性能。

详情
AI中文摘要

模仿学习(IL)虽取得显著进展,但机器人在实际部署中仍受分布外(OOD)场景限制。通过微调预训练策略以利用部署环境中的用户演示可缓解此问题。然而,用户演示质量通常较差,包含过多修正动作、振荡和突变,影响学习和微调策略性能。现有自动演示数据筛选方法需环境交互,计算成本高且不适用于实际部署。本文提出基于演示轨迹功率谱密度(PSD)的快速、高效、全自动演示排序指标,无需策略学习、环境交互或专家标注,适用于大规模现场数据筛选。较低的PSD值对应更平滑、高质量的演示,而较高的PSD值表明不规则、含噪声的轨迹。我们在两个基准模仿学习数据集上评估该指标,并通过养老机构老年人用户研究,利用收集的演示数据微调π0.5策略以完成日常任务。结果表明,PSD筛选数据产生的策略在任务成功率和轨迹平滑度上优于未筛选基线和两种竞争性数据排序方法。

英文摘要

Imitation learning (IL) has seen remarkable progress, yet field deployment of IL-powered robots remains hindered by the challenge of out-of-distribution (OOD) scenarios. Fine-tuning pre-trained policies with end-user demonstrations collected in deployment environments is a promising strategy to address this challenge. However, end-user demonstrations are frequently of poor quality, characterized by excessive corrective motions, oscillations, and abrupt adjustments that degrade both learned and fine-tuned policy performance. Existing automated approaches for curating demonstration data require policy rollouts in the environment, making them computationally expensive and impractical for real-world deployment. In this paper, we propose a fast, efficient, and fully automated demonstration ranking metric based on the power spectral density (PSD) of demonstration trajectories. The PSD metric requires no policy learning, environment interaction, or expert labeling, making it well-suited for scalable, in-the-field data curation. Lower PSD values correspond to smoother, higher-quality demonstrations, while higher PSD values indicate erratic, artifact-laden trajectories. We evaluate the proposed metric on two benchmark imitation learning datasets comprising expert and lay-user demonstrations, and through a user study with older adults at a retirement facility, where collected demonstrations are used to fine-tune $\pi0.5$ \cite{intelligence2025pi_} for a daily living task. Results demonstrate that PSD-curated data yields policies with higher task success rates and smoother execution trajectories compared to uncurated baselines and two competitive data-ranking methods.

2605.01542 2026-05-05 cs.LG cs.AI physics.comp-ph

Mesh Based Simulations with Spatial and Temporal awareness

基于网格的模拟与空间时间意识

Paul Garnier, Vincent Lannelongue, Elie Hachem

发表机构 * CEMEF - Mines Paris PSL(CEMEF - 巴黎 Mines 工程学院)

AI总结 本文提出统一框架,结合几何深度学习与严谨数值分析,通过多节点预测、时间校正和几何归纳偏差提升物理模拟的准确性和稳定性。

Journal ref ICML 2026

详情
AI中文摘要

机器学习代理用于计算流体动力学(CFD)的,特别是图神经网络(GNN)和变换器,已成为加速物理模拟的新重要方法。然而,我们发现该领域存在关键瓶颈:尽管架构已显著进步,但常见的训练范式仍局限于幼稚假设,如节点级监督和显式欧拉时间推进。这些传统选择忽略了诸如有限元、差分或体积(FEM)等许多偏微分方程求解方法中固有的刚性动力学和局部通量连续性。在本文中,我们提出一个统一框架,以弥合几何深度学习与严谨数值分析之间的差距。我们引入三个关键创新:(1)多节点预测,一种用于预测节点完整局部拓扑的卷积级目标,强制空间导数一致性;(2)时间校正,通过时间交叉注意力将不稳定显式方案替换为预测-校正方案;(3)几何归纳偏差,利用3D旋转位置嵌入(RoPE)以鲁棒地捕捉无结构网格中的旋转对称性。我们评估了该框架在三种架构(MeshGraphNet、Transolver和变压器)上的多种物理数据集。我们的方法在准确性和稳定性方面产生了一致的改进,特别是在长时间步长推进中,同时生成的潜在表示能够推广到未见过的子任务,如壁面剪切应力或压力预测。代码可在https://github.com/DonsetPG/graph-physics获取。

英文摘要

Machine Learning surrogates for Computational Fluid Dynamics (CFD), particularly Graph Neural Networks (GNNs) and Transformers, have become a new important approach for accelerating physics simulations. However, we identify a critical bottleneck in the field: while architectures have advanced significantly, the common underlying training paradigms remain bound to naive assumptions, such as node-wise supervision and explicit Euler time-stepping. These legacy choices ignore the stiff dynamics and local flux continuity inherent to numerous partial differential equations resolution methods, such as Finite Element, Difference, or Volume (FEM). In this work, we propose a unified framework to bridge the gap between geometric deep learning and rigorous numerical analysis. We introduce three key innovations: (1) Multi Node Prediction, a stencil-level objective that predicts field values for a node's full local topology, enforcing spatial derivative consistency; (2) Temporal Correction, replacing unstable explicit schemes with a predictor-corrector via temporal Cross-Attention; and (3) Geometric Inductive Biases, leveraging 3D Rotary Positional Embeddings (RoPE) to robustly capture rotational symmetries in unstructured meshes. We evaluate this framework across three architectures (MeshGraphNet, Transolver, and a Transformer) on diverse physics datasets. Our approach yields consistent improvements in accuracy and stability, particularly in long-horizon rollouts, while producing latent representations that generalize to unseen subtasks such as Wall Shear Stress or Pressure prediction. Code is available at https://github.com/DonsetPG/graph-physics.

2605.01537 2026-05-05 cs.CL

The grip of grammar on meaning uncertainty: cross-linguistic evidence, neural correlates, and clinical relevance

语法对意义不确定性的影响:跨语言证据、神经相关性及临床相关性

Rui He, Claudio Palominos, Samuele Vallisa, Ni Yang, Han Zhang, Miguel Ángel Santos Santos, Neguine Rezaii, Sergi Valero, Yonghua Huang, Huan Li, Hong Jiang, Yongjun Peng, Maria Francisca Alonso-Sánchez, Frederike Stein, Tilo Kircher, Philipp Homan, Iris Sommer, Lena Palaniyappan, Wolfram Hinzen

发表机构 * Grammar and Cognition Lab, Department of Translation & Language Sciences, Universitat Pompeu Fabra(语法与认知实验室,翻译与语言科学系,庞培法布拉大学) School of Foreign Studies, Guangzhou University(外语学院,广州大学) Department of Neurology Sant Pau Memory Unit, Hospital de la Santa Creu i Sant Pau Biomedical Research Institute Sant Pau(圣帕乌记忆单位神经科,圣帕乌生物医学研究机构) Frontotemporal Disorders Unit, Department of Neurology, Massachusetts General Hospital, Harvard Medical School(前额叶疾病单位,神经科,马萨诸塞总医院,哈佛医学院) Ace Alzheimer Center Barcelona, Universitat Internacional de Catalunya(巴塞罗那阿尔茨海默病中心,国际加泰罗尼亚大学) Networking Research Center on Neurodegenerative Diseases (CIBERNED), Instituto de Salud Carlos III(神经退行性疾病网络研究机构(CIBERNED),卡洛斯三世健康研究所) Department of Language Science and Technology, Saarland University(语言科学与技术系,萨尔兰大学) Department of Operation and Management, Zhuhai People's Hospital (The Affiliated Hospital of Beijing Institute of Technology, Zhuhai Clinical Medical College of Jinan University)(珠海人民医院(北京理工大学附属珠海临床医学院)) Department of Statistics, Zhuhai People's Hospital (The Affiliated Hospital of Beijing Institute of Technology, Zhuhai Clinical Medical College of Jinan University)(珠海人民医院(北京理工大学附属珠海临床医学院)) Faculty of Medicine, Macau University of Science and Technology(医学学院,澳门科技大学) CIDCL, Escuela de Fonoaudiología, Universidad de Valparaíso(CIDCL,语言治疗学系,瓦尔帕莱索大学) Department of Psychiatry and Psychotherapy, University of Marburg(精神病学与心理学系,马尔堡大学) Marburg University, School of Medicine, Department of Psychiatry and Psychotherapy(马尔堡大学,医学院,精神病学与心理学系) Department of Adult Psychiatry and Psychotherapy, University of Zurich(成人精神病学与心理学系,苏黎世大学) Neuroscience Center Zurich, University of Zurich and ETH Zurich(苏黎世神经科学中心,苏黎世大学和ETH苏黎世联邦理工学院) Center for Clinical Neuroscience and Cognition and Department of Psychiatry, University of Groningen, University Medical Center Groningen(临床神经科学与认知中心及精神病学系,格罗宁根大学,格罗宁根大学医学中心) Douglas Mental Health University Institute, Department of Psychiatry, McGill University(道格拉斯大学心理健康研究所,精神病学系,麦吉尔大学) Department of Medical Biophysics, Schulich School of Medicine and Dentistry, Western University(医学生物物理系,施吕赫尔医学院及牙科学院,西部大学) Robarts Research Institute, Schulich School of Medicine and Dentistry, Western University(罗巴茨研究所,施吕赫尔医学院及牙科学院,西部大学) Institut Català de Recerca i Estudis Avançats (ICREA), Barcelona, Spain(加泰罗尼亚高级研究与研究机构(ICREA),巴塞罗那,西班牙)

AI总结 研究探讨语法如何跨语言压缩意义不确定性,通过神经机制和临床案例揭示语言处理的核心原理。

详情
AI中文摘要

孤立词的意义本质上具有不确定性,这种不确定性在结合上下文时会降低。本文提出语法跨语言压缩意义不确定性,这一现象在大脑中体现并受疾病影响。压缩被定义为非上下文性惊奇与上下文性惊奇之间的相对差异。在20种语言的叙述中,上下文性惊奇降低了基于频率的惊奇。这种减少与逆转词序的惊奇成本密切相关,并随更丰富、非冗余的词汇量变化,这些词汇由更复杂但最优的依赖结构组织。在fMRI中,惊奇及其减少解释了理解与生产中的BOLD活动,分别在重叠但不同的区域。在失语症、痴呆症和精神分裂症中,不确定性减少显著减弱,但当主要缺陷不是语言时,其保持完整。这些发现将通过语法减少不确定性定位为一个基础概念,揭示语言原则、大脑基础和中断。

英文摘要

Isolated word meanings are inherently uncertain. This uncertainty reduces when they are combined and anchored in context. We propose that grammar compresses meaning uncertainty cross-linguistically, which is reflected in brain and selectively disrupted in disorders. Compression was operationalized as the relative difference between non-contextual surprisal estimated from lexical frequency, and contextual surprisal from grammar-sensitive models. In narratives from 20 languages, contextual surprisal reduced frequency-based surprisal. This reduction closely tracked the surprisal cost of reversing word order, and scaled with richer, non-redundant lexis as organized by more complex but optimal dependency structure. During fMRI, surprisal and its reduction explained BOLD activity for comprehension and production in overlapping but distinct regions. Uncertainty reduction was significantly attenuated in aphasia, dementia, and schizophrenia, but remained intact where primary deficit is not language. These findings position uncertainty reduction via grammar as a foundational concept that illuminates principles, brain basis, and disruptions of language.

2605.01520 2026-05-05 cs.CV cs.CL

MIRL: Mutual Information-Guided Reinforcement Learning for Vision-Language Models

MIRL:基于互信息的强化学习用于视觉-语言模型

Yin Zhang, Jiaxuan Zhao, Zonghan Wu, Zengxiang Li, Junfeng Fang, Kun Wang, Qingsong Wen, Yilei Shao

发表机构 * School of Mathematics, Tianjin University, Tianjin, China(天津大学数学学院) Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China(中国科学院信息工程研究所) Shanghai Advanced Institute of Finance (SAIFS), East China Normal University, Shanghai, China(上海先进金融研究所(SAIFS),东华大学) ENN Group, Digital Technology Research Institute, China(ENN集团,数字技术研究院) Nanyang Technological University, Singapore(南洋理工大学) National University of Singapore, Singapore(新加坡国立大学)

AI总结 MIRL通过利用生成描述与视觉输入之间的互信息,解决视觉语言模型在复杂推理任务中的感知误差和幻觉问题,提升回答准确性。

详情
AI中文摘要

MIRL通过利用生成描述与视觉输入之间的互信息,解决视觉语言模型在复杂推理任务中的感知误差和幻觉问题,提升回答准确性。

英文摘要

Vision-Language Models (VLMs) frequently suffer from visual perception errors and hallucinations that compromise answer accuracy in complex reasoning tasks. Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising solution by optimizing policies using answer correctness signals. Despite their effectiveness, prevailing RLVR methods face two critical limitations. First, much of the sampling budget is wasted on trajectories doomed to fail due to early visual description errors. Second, sparse rewards cannot distinguish whether failures stem from visual perception or reasoning stages. We introduce MIRL, a decoupled framework that addresses both limitations by leveraging mutual information (MI) between generated descriptions and visual inputs as a cheap pre-screening signal. This enables intelligent budget allocation toward high-potential trajectories via forking, while decoupled training provides independent MI-based rewards for visual perception optimization, resolving reward blindness. Experiments on six vision-language reasoning benchmarks demonstrate that MIRL achieves 70.22% average accuracy and successfully surpasses the performance of sampling 16 complete trajectories using only 10 pre-samples with top-6 selection (25% fewer complete trajectories). Our code is available at: https://anonymous.4open.science/r/mirl-main/.

2605.01519 2026-05-05 cs.CV

Certified vs. Empirical Adversarial Robust-ness via Hybrid Convolutions with Attention Stochasticity

基于注意力随机性的混合卷积:在可证明的鲁棒性和经验鲁棒性之间

Joy Dhar, Song Xia, Manish Kumar Pandey, Maryam Haghighat, Azadeh Alavi, Ferdous Sohel, Wenyu Zhang, Nayyar Zaidi

发表机构 * Indian Institute of Technology Ropar(印度理工学院罗帕尔分校) Nanyang Technological University(南洋理工大学) RoentGen Health(RoentGen健康) Queensland University of Technology(昆士兰理工大学) RMIT University(皇家墨尔本理工大学) Murdoch University(莫纳什大学) Deakin University(迪金大学)

AI总结 HyCAS通过结合确定性和随机性原理,提升模型在L2和L攻击下的鲁棒性,实验证明其在多个数据集上显著优于现有方法。

详情
AI中文摘要

我们介绍了HyCAS,一种结合注意力随机性的混合卷积,旨在缩小L2证书下的可证明鲁棒性与强L攻击下的经验鲁棒性之间的差距,同时保持在多样化图像基准上的强大泛化能力。HyCAS通过耦合1-Lipschitz、谱归一化卷积与两个随机组件——谱归一化随机投影滤波器和随机注意力噪声机制,实现随机防御。在架构中注入平滑随机性,使整体网络具有<=2-Lipschitz性质并获得形式证书。在CIFAR-10/100、ImageNet-1k、NIH胸部X光、HAM10000等多样化的图像基准上进行的广泛实验表明,HyCAS在NIH胸部X光上将可证明的准确率提高了高达7.3%,在HAM10000上将经验鲁棒性提高了高达3.1%,而不会牺牲干净准确率。这些结果表明,一个随机化的Lipschitz约束架构可以同时提高可证明的L2和经验L对抗鲁棒性,从而支持在高风险应用中更安全地部署深度模型。代码:https://github.com/misti1203/HyCAS

英文摘要

We introduce Hybrid Convolutions with Attention Stochasticity (HyCAS), an adversarial defense that narrows the long-standing gap between provable robustness under L2 certificates and empirical robustness against strong L attacks, while preserving strong generalization across diverse imaging benchmarks. HyCAS unifies deterministic and randomized principles by coupling 1-Lipschitz, spectrally normalized convolutions with two stochastic components, spectral normalized random, projection filters and a randomized attention-noise mechanism, to realize a randomized defense. Injecting smoothing randomness inside the architecture yields an overall <= 2-Lipschitz network with formal certificates. Exten-sive experiments on diverse imaging benchmarks, including CIFAR-10/100, ImageNet-1k, NIH Chest X-ray, HAM10000, show that HyCAS surpasses prior leading certified and empirical defenses, boosting certified accuracy by up to 7.3% (on NIH Chest X-ray) and empirical robustness by up to 3.1% (on HAM10000), without sacrificing clean accuracy. These results show that a randomized Lipschitz constrained architecture can simultaneously improve both certified L2 and empirical L adversarial robustness, thereby supporting safer deployment of deep models in high-stakes applications. Code: https://github.com/misti1203/HyCAS

2605.01517 2026-05-05 cs.CV

VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

VAnim:基于渲染的稀疏状态建模用于结构保持的向量动画

Guotao Liang, Zhangcheng Wang, Chuang Wang, Juncheng Hu, Haitao Zhou, Junhua Liu, Jing Zhang, Dong Xu, Qian Yu

发表机构 * School of Software, Beihang University, Beijing, China(北京航空航天大学软件学院) Department of Computer Science, The University of Hong Kong, Hong Kong, China(香港大学计算机科学系) College of Computer Science and Technology, Zhejiang University, Hangzhou, China(浙江大学计算机科学与技术学院)

AI总结 VAnim提出了一种基于LLM的框架,通过稀疏状态更新和渲染感知强化学习,实现结构保持的向量动画生成,优于现有方法。

Comments Accepted to ICML 2026. Project page: https://yukinonooo.github.io/VAnimProject

详情
AI中文摘要

可扩展矢量图形(SVG)动画生成对专业设计至关重要,因其结构可编辑性和分辨率无关性。然而,这一任务仍具挑战性,因为它需要将离散代码表示与连续视觉动态联系起来。现有基于优化的方法常破坏拓扑一致性,而通用LLM依赖于刚性的CSS/SMIL转换,无法建模几何层面的非刚性变形。为解决这些限制,我们提出了VAnim,第一个面向开放领域的文本到SVG动画框架。我们将动画不视为序列生成,而是作为持久SVG DOM树上的稀疏状态更新(SSU)。这一范式将序列长度压缩了超过9.8倍,同时保持SVG DOM结构和非参与元素。为实现精确控制,我们提出了一种先识别再运动规划的机制,将文本指令接地在显式视觉实体上。此外,为克服SVG渲染的非可微性,我们采用基于组相对策略优化的渲染感知强化学习(GRPO)。通过利用最先进的视频感知编码器的混合奖励,我们对离散代码更新与高保真的视觉反馈对齐。我们还引入了SVGAnim-134k,第一个向量动画基准。大量实验表明,VAnim在语义对齐和结构有效性方面显著优于现有最佳基线,附加的附录指标进一步验证了运动质量和身份保持。

英文摘要

Scalable Vector Graphics (SVG) animation generation is pivotal for professional design due to their structural editability and resolution independence. However, this task remains challenging as it requires bridging discrete code representations with continuous visual dynamics. Existing optimization-based methods often destroy topological consistency, while general-purpose LLMs rely on rigid CSS/SMIL transformations, failing to model geometry-level non-rigid deformations. To address these limitations, we present VAnim, the first LLM-based framework for open-domain text-to-SVG animation. We reconceptualize animation not as sequence generation, but as Sparse State Updates (SSU) on a persistent SVG DOM tree. This paradigm compresses sequence length by over 9.8x while preserving the SVG DOM structure and non-participating elements by construction. To enable precise control, we propose an Identification-First Motion Planning mechanism that grounds textual instructions in explicit visual entities. Furthermore, to overcome the non-differentiable nature of SVG rendering, we employ Rendering-Aware Reinforcement Learning via Group Relative Policy Optimization (GRPO). By leveraging a hybrid reward from a state-of-the-art video perception encoder, we align discrete code updates with high-fidelity visual feedback. We also introduce SVGAnim-134k, the first benchmark for vector animation. Extensive experiments demonstrate that VAnim significantly outperforms state-of-the-art baselines in semantic alignment and structural validity, with additional appendix metrics further validating motion quality and identity preservation.