arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.28657 2026-05-28 cs.SD

DEMON: Diffusion Engine for Musical Orchestrated Noise

DEMON: 音乐编排噪声的扩散引擎

Ryan Fosdick

AI总结提出DEMON实时扩散引擎，通过异构去噪调度、共享可变状态、逐帧源混合和窗口化VAE解码四种机制，使去噪过程可作为现场乐器演奏，在单GPU上实现每秒12.3次60秒音乐解码。

Comments 15 pages, 3 figures, 15 tables. Project page with audio samples and demo video: https://daydreamlive.github.io/DEMON/

详情

AI中文摘要

我们提出DEMON，一个实时扩散引擎，使去噪过程可作为现场乐器演奏：控制面既宽广（每帧跨输出塑造多个参数）又响应迅速（每个控制在其去噪循环中的位置允许下尽快生效）。基于ACE-Step 1.5和StreamDiffusion的环形缓冲区架构，并采用TensorRT加速，在单消费级GPU（RTX 5090）上，对于60秒音乐，每秒可完成最多12.3次解码器完成，或在我们的生产环深度4下每秒生成11.3次。在这些速率下，去噪参数可作为现场表演控制，但环形缓冲区仅以其排出速率（S个去噪步的下限）传播每次请求的变化。我们贡献了四种机制。（1）每槽异构去噪调度：每个环形缓冲区槽拥有自己的时间步调度，因此移动的去噪滑块无需清除飞行队列即可被跟踪，而上游全局调度设计必须重建并丢弃队列。（2）共享可变的每步状态，使每个求解器步骤中查询的任何参数在下一拍生效，绕过环形缓冲区排出。（3）逐帧源混合：对标准SDE重噪步骤的采样时间控制，提供逐帧变换强度轴，补充标量去噪调度。（4）窗口化VAE解码，利用感受野分析实现8.0倍解码加速。这些机制将流式扩散参数按起始和收敛延迟分为四个传播类别。

英文摘要

We present DEMON, a real-time diffusion engine that makes the denoising process playable as a live musical instrument: a control surface both broad (many parameters shaped per-frame across the output) and responsive (each control taking effect as fast as its place in the denoising loop allows). Built on ACE-Step 1.5 and StreamDiffusion's ring-buffer architecture with TensorRT acceleration, it sustains up to 12.3 decoder completions per second for 60-second music on a single consumer GPU (RTX 5090), or 11.3 generations per second at our production ring-depth of 4. At these rates denoising parameters become viable as live performance controls, but the ring buffer propagates per-request changes only at its drain rate, a floor of S denoising steps. We contribute four mechanisms. (1) Per-slot heterogeneous denoise scheduling: each ring-buffer slot owns its timestep schedule, so a moving denoise slider is tracked without wiping the in-flight queue, where the upstream global-schedule design must rebuild and discard it. (2) Shared mutable per-step state, giving any parameter consulted at every solver step next-tick effect, bypassing ring-buffer drain. (3) Per-frame source blending: a sampling-time control on the standard SDE re-noise step, giving a framewise transformation-strength axis that complements scalar denoise scheduling. (4) Windowed VAE decode exploiting receptive-field analysis for an 8.0x decode speedup. Together these separate streaming-diffusion parameters into four propagation classes, by onset and convergence latency.

URL PDF HTML ☆

赞 0 踩 0

2605.28655 2026-05-28 cs.AI

AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation

AutoScientists: 用于长期科学实验的自组织智能体团队

Shanghua Gao, Ada Fang, Marinka Zitnik

AI总结提出一种去中心化的AI智能体团队系统AutoScientists，通过自组织协作、提案评审和失败知识共享，在生物医学机器学习、语言模型训练优化和蛋白质适应性预测等长期实验中显著优于现有方法。

详情

AI中文摘要

科学研究通过假设生成、实验设计、执行和修正的迭代循环进行。AI智能体可以自动化这一过程的某些部分，但现有方法通常遵循单一研究轨迹，或通过具有固定目标的中央规划器进行协调。因此，它们难以维持并行探索、根据实验证据的变化进行调整，或在长期实验中保留失败方向的知识。我们引入了AutoScientists，一个用于长期计算科学实验的去中心化AI智能体团队。智能体解释共享的实验状态，围绕有希望的假设自组织成团队，在使用实验计算资源之前评审提案，并分享成功和失败以减少冗余探索。在匹配的实验预算下，AutoScientists在生物医学机器学习、语言模型训练优化和蛋白质适应性预测方面优于先前的AI智能体。在涵盖生物医学成像、蛋白质工程、单细胞组学和药物发现的BioML-Bench上，AutoScientists在24个任务中达到了74.4%的平均排行榜百分位，比最强的AI智能体提高了8.33%。在GPT训练优化中，AutoScientists达到目标验证bits-per-byte的速度比Autoresearch快1.9倍，并从初始冠军开始持续发现改进，而单智能体方法未发现任何改进（7个接受改进对比0个）。在ProteinGym适应性预测中，AutoScientists发现了一种ACE2-Spike结合方法，其Spearman相关性比当前最先进模型提高了12.5%。在未经修改地应用于所有217个ProteinGym检测时，相同方法比先前最先进技术提高了6.5%（Spearman相关性）。

英文摘要

Scientific research proceeds through iterative cycles of hypothesis generation, experiment design, execution, and revision. AI agents can automate parts of this process, but existing approaches typically follow a single research trajectory or coordinate through a central planner with fixed objectives. As a result, they struggle to sustain parallel exploration, adapt as experimental evidence changes, or preserve knowledge of failed directions over long-running experiments. We introduce AutoScientists, a decentralized team of AI agents for long-running computational scientific experimentation. Agents interpret a shared experimental state, self-organize into teams around promising hypotheses, critique proposals before using experimental compute, and share successes and failures to reduce redundant exploration. Under matched experimental budgets, AutoScientists improves over prior AI agents across biomedical machine learning, language-model training optimization, and protein fitness prediction. On BioML-Bench, spanning biomedical imaging, protein engineering, single-cell omics, and drug discovery, AutoScientists achieves a mean leaderboard percentile of 74.4% across 24 tasks, improving over the strongest AI agent by +8.33%. On GPT training optimization, AutoScientists reaches a target validation bits-per-byte 1.9x faster than Autoresearch and continues discovering improvements from a starting champion where the single-agent approach finds none (7 vs. 0 accepted improvements). On ProteinGym fitness prediction, AutoScientists discovers a method for ACE2-Spike binding that improves over the current state-of-the-art model by +12.5% in Spearman correlation. Applied without modification across all 217 ProteinGym assays, the same method improves over the prior state of the art by +6.5% (Spearman correlation).

URL PDF HTML ☆

赞 0 踩 0

2605.28654 2026-05-28 cs.RO cs.SY eess.SY math.OC

Integrated Exploration-Aware UAV Route Optimization and Path Planning

集成探索感知的无人机路径优化与轨迹规划

Jimin Choi, Grant Stagg, Cameron K. Peterson, Max Z. Li

AI总结提出一种集成探索感知的无人机路径优化与轨迹规划框架，通过风险地图、不确定兴趣区域建模、B样条轨迹优化和在线重规划，在灾害监测中平衡报告点访问与新信息探索，实现平均KL散度降低15.9%。

详情

AI中文摘要

无人机越来越多地用于危险环境（如灾区、污染场地、野火区域和受损基础设施）中的探索驱动监测，此时有限的飞行续航必须在访问报告位置和收集新信息之间分配。在这些场景中，关于危险的先验信息通常不完整、空间不精确，并且在执行过程中可能发生变化。例如，初始报告可能识别出危险可能存在的区域，但实际危险可能被移动、部分观察到或完全未被报告。我们提出了一种集成的探索感知无人机路径优化与轨迹规划框架，用于在不确定和演变的先验信息下进行危险监测。环境被表示为空间风险地图，每个位置都有相关的危险状况信念。报告的危险被建模为不确定的兴趣区域（ROI），而不是确认的目标位置，要求无人机在检查报告区域的同时，利用有限的飞行续航探索信息丰富的区域。所提出的方法解决了报告ROI上的车辆路径问题，通过辅助伪节点增强路径以改善空间覆盖，将剩余飞行距离预算分配到路径段，并优化局部探索的动态可行B样条轨迹。在执行过程中，无人机测量更新基于网格的信念地图，当新信息和剩余预算证明调整合理时，对剩余轨迹进行重规划。在48种场景配置中，在线重规划相比离线优化规划器平均KL散度降低15.9%，相比直线遍历降低48.6%。

英文摘要

Uncrewed aerial vehicles (UAVs) are increasingly used for exploration-driven monitoring in hazardous environments such as disaster zones, contaminated sites, wildfire areas, and damaged infrastructure, where limited flight endurance must be allocated between visiting reported locations and gathering new information. In these settings, prior information regarding hazards is often incomplete, spatially imprecise, and subject to change during execution. For example, initial reports may identify a region where a hazard is likely to exist, but the actual hazard may be displaced, partially observed, or entirely unreported. We present an integrated exploration-aware UAV route optimization and path planning framework for hazard monitoring under uncertain and evolving prior information. The environment is represented as a spatial risk map, where each location has an associated belief of hazardous conditions. Reported hazards are modeled as uncertain regions of interest (ROIs) rather than confirmed target locations, requiring the UAV to inspect reported areas while also using its limited flight endurance to explore informative regions. The proposed method solves a vehicle routing problem over reported ROIs, augments the route with auxiliary pseudo-nodes to improve spatial coverage, allocates the remaining flight distance budget across route segments, and optimizes dynamically feasible B-spline trajectories for local exploration. During execution, UAV measurements update a grid-based belief map, and the remaining trajectory is replanned when new information and the remaining budget justify adaptation. Across 48 scenario configurations, online replanning improves average KL reduction by 15.9% over the offline optimized planner and 48.6% over straight-line traversal.

URL PDF HTML ☆

赞 0 踩 0

2605.28649 2026-05-28 cs.LG cs.CL

Interpretability-Guided Layer Selection over Subspace Projection: SAEs as Stethoscopes, Not Scalpels, for Raw Task Vector Model Editing

可解释性引导的子空间投影层选择：SAE作为听诊器而非手术刀用于原始任务向量模型编辑

Li Lei, Madalina Ciobanu, Qingqing Mao, Ritankar Das

AI总结本文发现将任务向量投影到稀疏自编码器（SAE）特征子空间会丢弃约97%的修改能量，导致无效编辑；提出将SAE用于层诊断而非干预过滤，通过SAE特异性分数选择层注入原始任务向量，在数学推理任务上显著提升性能。

详情

AI中文摘要

大型语言模型越来越需要精细的模型编辑来增强领域特定能力，而无需承担全微调带来的计算成本或灾难性遗忘。稀疏自编码器（SAE）在此背景下成为一种有前景的工具，原则上允许在特征级别识别干预位置。本文严格评估了基于SAE引导的编辑流程在Gemma-3-4B-IT上的数学推理能力，并揭示了一个根本性失败模式：将任务向量投影到SAE特征子空间这一直观方法实际上充当了信息瓶颈，丢弃了约97%的修改能量，在七个数学科目上均未产生统计显著的改进。我们表明，这种失败源于激活空间SAE方向与权重空间任务向量之间的几何失配。随后，我们提出视角转变：SAE作为听诊器而非手术刀，即SAE用于层级别诊断而非干预级别过滤。通过仅将未过滤的原始任务向量注入由SAE特异性分数识别的层，我们在Minerva Math基准上将数论准确率从29.6%提升至39.4%（z=+3.41，p=0.0007）；7个数学科目中有5个显著提升，且无任何科目显著下降。我们的方法完全确定，无需额外推理成本，并为可解释性引导的模型编辑提供了原则性框架。

英文摘要

LLMs increasingly require surgical model editing to enhance domain-specific capabilities without incurring the computational cost or catastrophic forgetting associated with full fine-tuning. Sparse Autoencoders (SAEs) have emerged as a promising tool in this setting, in principle allowing for feature-level identification of where to intervene. In this work, we rigorously evaluate an SAE-guided editing pipeline for mathematical reasoning on Gemma-3-4B-IT and uncover a fundamental failure mode: the intuitively appealing approach of projecting task vectors onto SAE feature subspaces acts as an information bottleneck that discards approximately 97% of the modification energy, yielding no statistically significant improvements across seven math subjects. We show that this failure stems from a geometric misalignment between activation-space SAE directions and weight-space task vectors. We then propose a shift in perspective: SAE as a Stethoscope, Not a Scalpel, where SAEs are used for layer-level diagnosis rather than intervention-level filtering. By injecting unfiltered raw task vectors only into layers identified by an SAE-derived specificity score, we improve Number Theory accuracy from 29.6% to 39.4% (z=+3.41, p=0.0007) on the Minerva Math benchmark; 5 of 7 math subjects significantly improved and none significantly degraded. Our method is fully deterministic, requires no additional inference cost, and provides a principled framework for interpretability-guided model editing.

URL PDF HTML ☆

赞 0 踩 0

2605.28647 2026-05-28 cs.AI cs.CY q-fin.RM

The Ethics of LLM Sandbox and Persona Dynamics

LLM沙盒与人格动态的伦理

Tim Gebbie, Stewart Gebbie

AI总结本文论证LLM护栏和人格动态产生的现实差距（reality gap）构成不道德的“现实洗白”（reality laundering），并提出通过任务级因果需求规范而非响应级道德修正来解决。

Comments 8 pages

详情

AI中文摘要

众所周知，LLM护栏和训练的人格动态会产生现实差距：LLM被允许或塑造描述的世界与用户必须行动的世界之间的距离。这里我们论证，主动产生现实差距实际上是不道德的，因为它有意将认知风险转嫁给不知情的用户——这就是现实洗白。当大规模运作时，这可能会造成伤害。在高暴露建议情境中风险最为尖锐，用户寻求的是方向而非有边界、可外部检查的任务。护栏在声称防止直接伤害时看似在伦理上必要，但当它们压制真实感知并将令人不适的机制洗白为可接受的抽象时，往往变得可疑。巴塞尔式金融监管、B-BBEE式合规、法国兴业银行和伦敦鲸事件展示了正式安全系统如何变得可理解、可博弈和表演性，而真实风险却转移到了别处。同样的模式可能出现在LLM中作为道德合规：安全的语言，扭曲的现实。因此，我们区分拒绝伤害与拒绝现实；然后主张在任务层面进行自上而下的因果需求规范，而非在响应或沙盒层面进行自下而上的道德修正。人格动态之所以重要，是因为助手界面并非中立；它塑造了不确定性、冲突、权威和风险如何被呈现。结论是，所谓的“伦理AI”当用制度安慰替代与现实接触时，实质上变得不伦理。

英文摘要

It is well known that LLM guardrails and trained persona dynamics can produce a reality gap: the distance between the world a LLM is permitted or shaped to describe, and the world in which users must act. Here we argue that actively generating reality gaps is in fact unethical because it knowingly shifts epistemic risk back to the uninformed user -- this is reality laundering. This can potentially cause harm when operationalised at scale. The risk is sharpest in high-exposure advice contexts, where users seek orientation rather than a bounded, externally checkable task. Guardrails naively appear ethically necessary when they claim to prevent direct harm, but often become suspect when they suppress truthful perception and launder uncomfortable mechanisms into acceptable abstractions. Basel-style financial regulation, B-BBEE-style compliance, Societe Generale, and the London Whale show how formal safety systems can become legible, gameable, and performative while real exposure migrates elsewhere. The same pattern can appear in LLMs as moral compliance: safe language, distorted reality. We therefore distinguish refusing harm, from refusing reality; and then argue for top-down causal requirements specification at the task level rather than bottom-up moral correction at the response or sandbox level. Persona dynamics matter because the assistant interface is not neutral; it shapes how uncertainty, conflict, authority, and risk are staged. The conclusion is that so-called ``ethical AI'' becomes substantively unethical when it substitutes institutional reassurance for contact with reality.

URL PDF HTML ☆

赞 0 踩 0

2605.28642 2026-05-28 cs.AI

Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation

带宽高效且隐私保护的边缘-云多对多语音翻译

Yexing Du, Kaiyuan Liu, Youcheng Pan, Bo Yang, Ming Liu, Bing Qin, Yang Xiang

AI总结提出边缘-云协同框架ESRT，通过分割推理架构压缩中间特征实现带宽降低10倍和语音隐私保护，并采用多任务加权课程学习策略实现45种语言的多对多语音翻译。

详情

AI中文摘要

多模态大语言模型（MLLMs）在语音到文本翻译（S2TT）方面展现出巨大潜力。然而，现有部署范式面临关键挑战：纯设备端模型受资源限制，而集中式云系统通过传输原始语音数据导致严重的隐私风险和带宽瓶颈。此外，大多数模型表现出以英语为中心的偏见，限制了多对多翻译的扩展。在本文中，我们提出边缘-云语音识别与翻译（ESRT），一种隐私保护且带宽高效的协作式边缘-云MLLM框架。具体而言，我们设计了一种边缘-云分割推理架构，在设备上保留轻量级语音编码器和适配器，仅将高度压缩的中间特征传输到云端。这从根本上防止了声纹泄露，并将带宽需求降低高达10倍。为克服以英语为中心的瓶颈，我们引入了一种多任务加权课程学习策略与数据平衡，以确保鲁棒的跨语言一致性。在FLEURS数据集上的大量实验表明，我们的模型ESRT-4B和ESRT-12B在45种语言（45×44个方向）上实现了最先进的多对多S2TT性能。代码和模型已发布，以促进可复现的、隐私感知的MLLM S2TT研究。代码和模型发布于https://github.com/yxduir/esrt。

英文摘要

Multimodal large language models (MLLMs) have demonstrated significant potential for speech-to-text translation (S2TT). However, existing deployment paradigms face critical challenges: pure on-device models suffer from resource constraints, while centralized cloud systems incur severe privacy risks and bandwidth bottlenecks by transmitting raw voice data. Furthermore, most models exhibit English-centric biases, restricting many-to-many translation scaling. In this paper, we propose Edge-cloud Speech Recognition and Translation (ESRT), a privacy-preserving and bandwidth-efficient collaborative edge-cloud MLLM framework. Specifically, we design an edge-cloud split inference architecture that retains a lightweight speech encoder and adapter on the device, transmitting only highly compressed intermediate features to the cloud. This fundamentally prevents voiceprint leakage and reduces bandwidth requirements by up to 10$\times$. To overcome English-centric bottlenecks, we introduce a multi-task weighted curriculum learning strategy with data balancing to ensure robust cross-lingual consistency. Extensive experiments on the FLEURS dataset demonstrate that our models, ESRT-4B and ESRT-12B, achieve state-of-the-art many-to-many S2TT performance across 45 languages ($45 \times 44$ directions). Code and models are released to facilitate reproducible, privacy-aware MLLM S2TT research. The code and models are released at https://github.com/yxduir/esrt.

URL PDF HTML ☆

赞 0 踩 0

2605.28640 2026-05-28 cs.LG

Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity

用指数衰减记忆增强注意力提升查询感知的KV稀疏性

Xiuying Wei, Caglar Gulcehre

AI总结本文通过引入指数衰减记忆模块增强注意力机制，在稀疏注意力推理中显著提升查询感知方法的准确性，并在多个任务上验证了其有效性。

详情

AI中文摘要

高效推理对于长上下文语言模型至关重要，其中注意力计算和KV缓存访问主导了成本。最近的工作RAT+引入了一种递归增强的注意力骨干，使得在推理时能够进行灵活的扩张注意力。在本文中，我们研究了这种指数衰减记忆是否也能改进现有的查询感知稀疏推理方法。使用包括Quest、MoBA和SnapKV在内的代表性方法，我们展示了在八个“大海捞针”任务中，RAT+在不同稀疏预算下始终比标准注意力提高了准确性。我们既在RAT+论文发布的检查点上验证了这些增益，也在OLMo2-7B上进行了验证，后者我们使用添加的记忆模块继续预训练了100亿个token。最后，我们提出了两个假设来解释为什么这个记忆模块有利于查询感知的稀疏推理，并设计了有针对性的实验来支持它们。

英文摘要

Efficient inference is critical for long-context language models, where attention computation and KV-cache access dominate the cost. Recent work RAT+, introduces a recurrence-augmented attention backbone that enables flexible dilated attention at inference time. In this paper, we investigate whether this exponentially decaying memory can also improve existing query-aware sparse inference methods. Using representative methods including Quest, MoBA, and SnapKV, we show that RAT+ consistently improves accuracy over standard attention across sparse budgets on eight needle-in-a-haystack tasks. We validate these gains both on the released checkpoints from the RAT+ paper and on OLMo2-7B, which we continue pretraining with the added memory module for 10B tokens. Finally, we propose two hypotheses explaining why this memory module benefits query-aware sparse inference and design targeted experiments to support them.

URL PDF HTML ☆

赞 0 踩 0

2605.28639 2026-05-28 cs.CL cs.AI

The Attentional White Bear Effect in Transformer Language Models

Transformer语言模型中的注意力白熊效应

Rebecca Ramnauth, Brian Scassellati

AI总结通过表征探测、注意力分析和行为语义泄露实验，发现指令抑制下Transformer语言模型仍能恢复被禁止概念的表征并影响后续生成，揭示了行为对齐与表征对齐之间的根本差距。

Comments Currently under review at EMNLP 2026

2605.28634 2026-05-28 cs.RO

PrimitiveVLA: Learning Reusable Motion Primitives for Efficient and Generalizable Robotic Manipulation

PrimitiveVLA：学习可复用的运动基元以实现高效且可泛化的机器人操作

Yutai Li, Shaohui Peng, Jiaming Guo, Di Huang, Zihao Zhang, Yuxuan Guo, Yunkai Gao, Siming Lan, Ling Li, Xing Hu, Yunji Chen

AI总结提出PrimitiveVLA框架，通过将视觉-语言-动作模型从直接指令到控制映射转向以基元为中心的拆解与组装范式，利用多模态规范表示和自动化流水线，提升数据效率并实现零样本泛化。

详情

AI中文摘要

视觉-语言-动作（VLA）模型为通用机器人策略提供了有前景的范式，但其适应受到数据效率低下和泛化能力差的阻碍。我们认为这些瓶颈源于主流的直接指令到控制映射，该映射迫使模型记忆整体轨迹而非可复用的运动模式，即基元。我们提出PrimitiveVLA，一个将该范式转向以基元为中心的拆解与组装范式的框架。在共享的多模态规范表示（MCR）支持下，PrimitiveVLA统一了两个阶段：（1）微调阶段拆解，使用自动化流水线将演示拆解为可复用的基元；（2）推理阶段组装，采用基于VLM的规划器和LLM生成的切换模块实现鲁棒的闭环执行。通过将任务拆解为可复用的基元，PrimitiveVLA使VLA模型能够学习不变的运动模式而非特定任务的轨迹。大量实验表明，我们的框架提高了数据效率，并在未见过的任务和长时域任务上实现了卓越的零样本泛化。

英文摘要

Vision-Language-Action (VLA) models offer a promising paradigm for generalist robotic policies, yet their adaptation is hindered by data inefficiency and poor generalization. We argue that these bottlenecks stem from the prevailing Direct Instruction-to-Control Mapping, which forces models to memorize monolithic trajectories rather than reusable motion patterns, i.e., primitives. We propose PrimitiveVLA, a framework that shifts this paradigm toward a Primitive-Centric Disassemble & Assemble paradigm. Supported by a shared Multimodal Canonical Representation (MCR), PrimitiveVLA unifies two phases: (1) Fine-tuning-phase Disassembly, which uses an automated pipeline to disassemble demonstrations into reusable primitives; and (2) Inference-phase Assembly, which employs a VLM-based planner and an LLM-generated switch module for robust closed-loop execution. By disassembling tasks into reusable primitives, PrimitiveVLA enables VLA models to learn invariant motion patterns instead of task-specific trajectories. Extensive experiments show that our framework improves data efficiency and achieves superior zero-shot generalization across unseen and long-horizon tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.28631 2026-05-28 cs.LG

Single-Rollout Hidden-State Dynamics for Training-Free RLVR Data Selection

单次展开隐藏状态动力学用于无训练RLVR数据选择

Jianghao Wu, Jianfei Cai, Weiqiang Wang, Jin Ye, Daniel F. Schmidt, Yasmeen George

AI总结提出SHIFT方法，通过单次推理展开的隐藏状态变化（RIRS）作为实例效用代理，结合质量加权CoreSet覆盖，实现无训练、无标签的RLVR数据选择，在数学推理和医学QA上优于基线。

Comments 14 pages, 2 figures. Accepted by ICML 2026

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）可以从极少的训练实例中获得巨大的推理增益，但其对使用哪些实例的强敏感性使得数据选择成为核心瓶颈。大多数现有的选择流程依赖于训练时的优化信号，和/或需要访问可验证奖励或大规模候选池的真实答案，这在专业领域成本高昂且通常不可行。我们研究在必须进行任何RL训练之前且没有标签或完整池奖励评估的情况下进行RLVR数据选择。我们提出SHIFT，一种基于推理时隐藏状态动力学的单次、无训练选择器。对于每个候选实例，SHIFT运行一次确定性推理展开，并计算推理引起的表示偏移（RIRS）作为从开始到结束的隐藏状态变化。SHIFT使用RIRS幅度作为实例效用的轻量级代理，并通过在RIRS增强特征空间中的质量加权最远优先CoreSet过程强制覆盖，生成可扩展到大型未标记池的紧凑子集。在超低预算下的数学推理和医学QA基准测试中，SHIFT始终优于无训练的多样性和难度/不确定性基线，提高了领域内准确性和向更难评估设置的迁移。消融实验表明，基于RIRS的覆盖和质量加权贡献了互补的增益，分析表明RIRS不能由简单的输入/输出长度统计解释。代码可在github.com/JianghaoWu/SHIFT获取。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) can yield large reasoning gains from very few training instances, yet its strong sensitivity to which instances are used makes data selection a central bottleneck. Most existing selection pipelines rely on training-time optimization signals and/or require access to verifiable rewards or ground-truth answers over large candidate pools, which is costly and often infeasible in specialized domains. We study RLVR data selection in a setting where selection must be performed before any RL training and without labels or reward evaluation on the full pool. We propose SHIFT, a one-shot, training-free selector based solely on inference-time hidden-state dynamics. For each candidate instance, SHIFT runs a single deterministic reasoning rollout and computes a reasoning-induced representation shift (RIRS) as the start-to-end hidden-state delta. SHIFT uses the RIRS magnitude as a lightweight proxy for instance utility and enforces coverage via a quality-weighted farthest-first CoreSet procedure in an RIRS-augmented feature space, producing compact subsets that scale to large unlabeled pools. Across mathematical reasoning and medical QA benchmarks under ultra-low budgets, SHIFT consistently outperforms training-free diversity and difficulty/uncertainty baselines, improving both in-domain accuracy and transfer to harder evaluation settings. Ablations show that RIRS-based coverage and quality-weighting contribute complementary gains, and analyses indicate that RIRS is not explained by simple input/output length statistics. Code is available at github.com/JianghaoWu/SHIFT.

URL PDF HTML ☆

赞 0 踩 0

2605.28630 2026-05-28 cs.CV cs.MM

EntroAD: Structural Entropy-Guided Prompt Adaptation for Zero-Shot Anomaly Detection

EntroAD: 结构熵引导的提示自适应用于零样本异常检测

Xinyu Zhao, Qingyun Sun, Jiayi Luo, Jianxin Li

AI总结提出EntroAD框架，利用结构熵引导动态路由机制和置信度感知双分支提示自适应，实现零样本异常检测，在跨数据集设置中达到最优性能。

详情

AI中文摘要

零样本异常检测（ZSAD）旨在无需目标域适应的情况下检测未见域中的异常。最近的基于CLIP的方法通过利用提示学习和视觉-文本对齐展示了有前景的性能。然而，大多数现有方法依赖于单一适应路径，这可能不足以处理跨域的异质异常模式。在实践中，异常表现出截然不同的特征，从显著、局部的结构破坏到微妙、扩散且不规则的变异。为了解决这一挑战，我们提出了EntroAD，一种结构熵引导的零样本异常检测框架。与以往方法不同，EntroAD引入了一种动态路由机制，通过专门的适应策略处理不同类型的异常。具体地，我们从自注意力诱导的补丁关系中估计补丁级结构熵，并将其作为关系不确定性的代理来指导异常感知的令牌路由。基于该路由信号，我们构建异常感知的路由令牌，以更好地捕捉具有不同结构特征的异常线索。我们进一步引入了一个置信度感知的双分支提示自适应模块，以稳定视觉-文本对齐，同时保留CLIP的可迁移先验。在10个工业和医学基准上的大量实验表明，EntroAD在具有挑战性的跨数据集ZSAD设置中达到了最先进的性能。

英文摘要

Zero-Shot Anomaly Detection (ZSAD) aims to detect anomalies in unseen domains without target-domain adaptation. Recent CLIP-based methods have shown promising performance by leveraging prompt learning and visual-text alignment. However, most existing approaches rely on a single adaptation pathway, which may be insufficient for heterogeneous anomaly patterns across domains. In practice, anomalies exhibit vastly different characteristics, ranging from salient, localized structural disruptions to subtle, diffuse, and irregular variations. To address this challenge, we propose EntroAD, a structural entropy-guided zero-shot anomaly detection framework. Unlike previous methods, EntroAD introduces a dynamic routing mechanism to process different types of anomalies with specialized adaptation strategies. Specifically, we estimate patch-level structural entropy from self-attention-induced patch relations and use it as a proxy for relational uncertainty to guide anomaly-aware token routing. Based on this routing signal, we construct anomaly-aware routed tokens to better capture anomaly cues with different structural characteristics. We further introduce a confidence-aware dual-branch prompt adaptation module to stabilize visual-text alignment while preserving CLIP's transferable prior. Extensive experiments on 10 industrial and medical benchmarks show that EntroAD achieves state-of-the-art performance in challenging cross-dataset ZSAD settings.

URL PDF HTML ☆

赞 0 踩 0

2605.28629 2026-05-28 cs.CL

Mobile-Aptus: Confidence-Driven Proactive and Robust Interaction in MLLM-based Mobile-Using Agents

Mobile-Aptus: 基于MLLM的手机使用代理中的置信度驱动的主动与鲁棒交互

Zheng Wu, Pengzhou Cheng, Zongru Wu, Yuan Guo, Tianjie Ju, Aston Zhang, Gongshen Liu, Zhuosheng Zhang

AI总结针对手机使用代理中的过度执行和过度请求问题，提出一种置信度驱动的主动与鲁棒交互框架，通过监督微调和置信度偏差校正实现最优性能。

Comments Accepted by TASLP

详情

AI中文摘要

多模态大语言模型（MLLMs）的最新进展在使手机使用代理能够自主执行人类指令方面显示出非凡的潜力。然而，完全自动化的代理即使无法解决任务也常常尝试执行，导致过度执行问题。以往研究通过训练交互式手机使用代理来解决这一问题，让代理在无法完成用户指令时请求人类交互。但我们发现，这些交互式代理倾向于表现出过度请求行为，过度依赖人类干预。为了缓解过度执行和过度请求，我们提出了一种通用的置信度集成框架，使基于MLLM的手机使用代理能够实现置信度驱动的主动与鲁棒交互。该框架包括两个阶段：交互能力赋予和置信度偏差校正。在交互能力赋予阶段，代理通过监督微调学习输出动作和置信度分数。在置信度偏差校正阶段，代理通过结合语义相似性检索和直接偏好优化，学习输出更准确的置信度分数。实验结果表明，Mobile-Aptus在四个流行的手机使用代理基准测试（OS-Kairos、AITZ、Meta-GUI和AndroidControl）上达到了最先进的性能。在离线基准测试中，Mobile-Aptus始终优于所有基线，任务成功率平均提升超过17%。在真实世界动态实验中，Mobile-Aptus的任务成功率比基线高出26%，每个指令仅需0.64次干预步骤。代码可在https://github.com/Wuzheng02/Mobile-Aptus获取。

英文摘要

Recent advancements in multimodal large language models (MLLMs) have shown exceptional potential in enabling mobile-using agents to autonomously execute human instructions. However, fully automated agents often try to execute tasks even when they are unable to resolve them, leading to the problem of over-execution. Previous studies solve it by training a interactive mobile-using agents to let agents request human interaction when agents can not complete user instructions. However, we find that these interactive agents tend to exhibit over-soliciting behavior, relying excessively on human intervention. To mitigate both over-execution and over-soliciting, we propose a universal confidence integration framework that enables confidence-driven proactive and robust interaction in MLLM-based mobile-using agents. The framework consists of two stages: interaction capability empowerment and confidence bias correction. In the interaction capability empowerment stage, agents learn through supervised fine-tuning to output both actions and confidence scores. In the confidence bias correction stage, agents learn to output more accurate confidence scores by combining semantic similarity retrieval with direct preference optimization. Experimental results show Mobile-Aptus achieves state-of-the-art performance on the four popular mobile-using agent benchmarks: OS-Kairos, AITZ, Meta-GUI, and AndroidControl. Mobile-Aptus consistently outperforms all baselines in offline benchmarks, with an average improvement over 17\% in task success rate. In real-world dynamic experiments, Mobile-Aptus surpasses the baseline by 26% in task success rate with only 0.64 intervention steps per instruction. The codes are available at https://github.com/Wuzheng02/Mobile-Aptus.

URL PDF HTML ☆

赞 0 踩 0

2605.28626 2026-05-28 cs.LG

When Interpretability Is Unequally Distributed: Fairness in Hybrid Interpretable Models

当可解释性分配不均：混合可解释模型中的公平性

Ziba Jabbar Zare, Ulrich Aïvodji, Julien Ferry, Thibaut Vidal

AI总结针对混合可解释模型将不同群体不均地分配给可解释组件与黑箱组件的问题，提出可解释性覆盖差异（ICD）度量，并通过约束缓解不公平性。

详情

AI中文摘要

混合可解释模型通过将部分样本分配给透明组件，其余样本交给黑箱模型，结合了透明组件与黑箱模型。虽然这种设计允许在准确性和可解释性之间灵活权衡，但也引发了一个独特的程序公平性问题：某些人口群体可能系统地获得可解释的决策，而其他群体则被不成比例地路由到黑箱。我们将此问题形式化为可解释性覆盖差异（ICD），这是一种应用于混合可解释模型路由决策的群体均等性度量。利用预测多重性的工具，我们研究了四种混合可解释学习方法、三个标准公平性基准数据集和多个敏感属性下的ICD。我们的实验揭示了在中间透明度区间（即可解释组件和黑箱组件都被积极使用）存在显著的ICD。我们进一步表明，简单的覆盖差异约束可以显著减少精确混合学习方法中的ICD，同时对准确性和稀疏性的影响很小。在几种设置中，缓解ICD还改善了标准算法公平性指标。这些结果表明，混合可解释模型不仅应审计其预测公平性，还应审计其如何在个体和群体之间分配可解释性。

英文摘要

Hybrid interpretable models combine a transparent component with a black-box model by assigning some examples to the former and deferring the rest to the latter. While this design enables flexible tradeoffs between accuracy and interpretability, it also raises a distinct procedural fairness concern: some demographic groups may systematically receive interpretable decisions, while others are disproportionately routed to a black box. We formalize this issue as Interpretability Coverage Disparity (ICD), a demographic-parity-style measure applied to the routing decision of hybrid interpretable models. Using tools from predictive multiplicity, we study ICD across four hybrid interpretable learning methods, three standard fairness benchmark datasets, and multiple sensitive attributes. Our experiments reveal substantial ICD in intermediate transparency regimes, where both the interpretable and black-box components are actively used. We further show that simple coverage-disparity constraints can significantly reduce ICD in exact hybrid learning methods, with marginal impact on accuracy and sparsity. In several settings, ICD mitigation also improves standard algorithmic fairness metrics. These results show that hybrid interpretable models should be audited not only for predictive fairness, but also for how they allocate interpretability across individuals and groups.

URL PDF HTML ☆

赞 0 踩 0

2605.28625 2026-05-28 cs.LG

Random Process Flow Matching: Generative Implicit Representations of Multivariate Random Fields

随机过程流匹配：多元随机场的生成隐式表示

Julien Lalanne, David Picard, Lionel Boillot, Lina-María Guayacán-Carrillo, Leon Barens, Jean-Michel Pereira

AI总结提出基于流匹配的随机过程流（RP Flow）框架，利用随机傅里叶特征学习隐式信号表示，通过集成采样编码不确定性，实现从稀疏观测生成高质量样本并校准不确定性估计。

详情

Journal ref: 43rd International Conference on Machine Learning, 2026

AI中文摘要

生成建模为学习数据分布提供了强大框架。这些模型最初依赖于高斯过程等概率方法进行不确定性感知预测，并转向更大的可训练模型以学习更复杂的分布。在这项工作中，我们引入了随机过程流（RP Flow），一种基于流匹配的框架，将向量场表示为神经隐式函数。与现代生成方法不同，我们的设置涉及单个观测场，仅能获得稀疏测量。RP Flow使用随机傅里叶特征学习隐式信号表示，可以从有限的观测集查询任意位置，同时通过集成采样编码不确定性。我们提出通过源空间中的高斯过程回归构建贝叶斯后验以生成高质量样本。实验结果表明，即使在高频、高稀疏或高维等挑战性条件下，该框架也能生成逼真样本并提供校准的不确定性估计。这些发现使RP Flow成为数据稀缺且不确定性需可追踪的重建任务中生成模型的里程碑。

英文摘要

Generative modeling provides a powerful framework for learning data distributions. These models initially relied on probabilistic methods such as Gaussian Processes (GP) for uncertainty-aware predictions and shifted towards larger trainable models to learn more complex distributions. In this work, we introduce Random Process (RP) Flow, a Flow Matching-based framework that represents the vector field as a neural implicit function. Unlike modern generative methods, our setting involves a single observed field, from which only sparse measurements are available. RP Flow uses Random Fourier Features to learn an implicit signal representation that can be queried at any arbitrary location from a limited set of observations, while encoding uncertainty through ensemble sampling. We propose constructing a Bayesian posterior by GP regression in the source space to generate high-quality samples. Our empirical results demonstrate that this framework generates realistic samples along with calibrated uncertainty estimates, even under challenging conditions such as high frequency, high sparsity, or high dimensionality. These findings position RP Flow as a milestone towards generative models for reconstruction tasks where data is scarce and uncertainty must remain traceable.

URL PDF HTML ☆

赞 0 踩 0

2605.28619 2026-05-28 cs.CV nlin.AO

A Multiscale Kinetic Framework for Image Segmentation: From Particle Systems to Continuum Models

图像分割的多尺度动力学框架：从粒子系统到连续模型

Horacio Tettamanti, Giulia Guicciardi, Mattia Zanella

AI总结提出一种基于共识的多尺度动力学框架，通过将图像视为粒子系统并推导动力学方程与宏观模型，结合粒子优化实现图像分割。

Comments 26 pages, 34 figures

详情

AI中文摘要

在这项工作中，我们提出了一种用于基于共识的图像分割的多尺度动力学框架。通过将图像解释为相互作用的粒子系统，每个像素由其空间位置和编码颜色信息的内部特征来表征。我们引入了一个耦合相互作用方案，控制粒子在位置和特征空间中的演化，由此推导出空间-特征域中粒子密度的动力学公式，结合了输运、聚集和扩散效应。此外，通过适当的缩放，我们获得了一个一阶宏观模型，描述携带关于具有特定特征的像素分数信息的像素分数的演化。基于这个简化复杂度的模型，我们提出了一种数据导向的方法，利用基于粒子的优化技术进行精确的图像分割。数值测试显示了所提出框架的有效性及其在不同噪声条件下的鲁棒性。

英文摘要

In this work, we present a multiscale kinetic framework for consensus-based image segmentation. By interpreting an image as a system of interacting particles, each pixel is characterised by its spatial position and an internal feature encoding color information. We introduce a coupled interaction scheme governing the evolution of particles in both position and feature spaces, from which we derive a kinetic formulation for the particle density in the space-feature domain combining transport, aggregation, and diffusion effects. Furthermore, through a suitable scaling, we obtain a first-order macroscopic model describing the evolution of the fraction of pixels carrying information on the fraction of pixels having a certain feature. Based on this reduced-complexity model, we present a data-oriented approach where we make use of particle-based optimisation techniques for the accurate segmentation of images. Numerical tests show the effectiveness of the proposed framework and its robustness under different noise conditions.

URL PDF HTML ☆

赞 0 踩 0

2605.28617 2026-05-28 cs.AI cs.PL

LACUNA: Safe Agents as Recursive Program Holes

LACUNA: 作为递归程序空洞的安全智能体

Yaoyu Zhao, Yichen Xu, Oliver Bračevac, Cao Nguyen Pham, Frank Zhengqing Wu, Martin Odersky

AI总结提出LACUNA编程模型，通过类型化调用和编译时检查，让LLM智能体以递归程序空洞的方式安全地编写代码，实现表达性与安全性的统一。

详情

AI中文摘要

LLM智能体越来越多地通过编写代码来行动，但驱动智能体的运行时与模型编写的代码之间仍然存在分裂。运行时拥有循环、上下文和控制流，而模型对这些几乎没有发言权。让模型编写的代码塑造运行时本身将使智能体更具表达性，但也会加剧安全问题。模型可能因提示注入而偏离方向、调用错误的工具，或在执行中途失败并留下不一致的状态，而当代码塑造运行时，此类失败的波及范围比仅表达单个动作时更广。我们提出了LACUNA，一种智能体编程模型，它在保持安全性的同时弥合了这种分裂。每个智能体动作都是一个类型化调用$\texttt{agent[T](task)}$，当执行到达该调用时，LLM用代码填充它，并且在代码运行之前，会针对周围程序进行类型检查。由于每个动作作为一个整体被接受或拒绝，被拒绝的动作不会影响环境，其编译器诊断信息会驱动重试。同样的检查也限制了动作可以使用哪些工具和数据以及它们如何流动。我们的原语将ReAct循环、子智能体、技能、并行分解和多模型规划表达为普通的控制流。我们在测试用例集合、BrowseComp-Plus和$τ^2$-bench上评估了LACUNA。在BrowseComp-Plus上，8.6%的生成在执行前被拒绝，平均每次查询重试0.7次，智能体达到27.1%的准确率。在$τ^2$-bench上，LACUNA使用一个能力强的模型解决了四个领域392个任务中的76.0%，与基线智能体相当。

英文摘要

LLM agents increasingly act by writing code, yet a split persists between the runtime that drives the agent and the code the model writes. The runtime owns the loop, context, and control flow, and the model has little say over any of them. Letting model-written code shape the runtime itself would make agents more expressive, but it would also sharpen safety problems. A model can be diverted by a prompt injection, call the wrong tool, or fail partway and leave an inconsistent state, and each such failure reaches further when the code shapes the runtime than when it expresses a single action. We present LACUNA, a programming model for agents that closes this split while preserving safety. Each agent action is a typed call $\texttt{agent[T](task)}$ that the LLM fills with code when execution reaches it, and the code is type-checked against the surrounding program before it runs. Because each action is accepted or rejected as a whole, a rejected one leaves the environment untouched, and its compiler diagnostics drive a retry. The same check also bounds which tools and data an action may use and how they flow. Our primitive expresses ReAct loops, sub-agents, skills, parallel decomposition, and multi-model planning as ordinary control flow. We evaluate LACUNA on a collection of test cases, BrowseComp-Plus, and $τ^2$-bench. On BrowseComp-Plus, $8.6\%$ of generations are rejected before execution, with 0.7 retries per query on average, and the agent reaches $27.1\%$ accuracy. On $τ^2$-bench, LACUNA solves $76.0\%$ of $392$ tasks across four domains with a capable model, on par with the baseline agent.

URL PDF HTML ☆

赞 0 踩 0

2605.28616 2026-05-28 cs.CL cs.AI

Measuring Form and Function in Language Models

语言模型中的形式与功能测量

Héctor Javier Vázquez Martínez, Charles Yang

AI总结通过引入儿童语言习得的定量指标，提出上下文替代选择（CAC）提示方法，评估语言模型在英语限定词的形式句法和功能话语知识方面的表现，发现仅大型模型能同时满足形式和功能基准。

Comments Under review at ACL Rolling Review May 2026 cycle

2605.28615 2026-05-28 cs.CV

Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization

基于区域感知双模态直接偏好优化的组合式文本到图像生成

Zhuohan Liu, Wujian Peng, Yitong Chen, Zuxuan Wu

AI总结提出BiDPO框架，通过构建大规模偏好数据集BiComp和扩展Diffusion DPO联合优化图像与文本偏好，结合区域级引导方法，提升文本到图像模型对复杂组合提示的生成保真度。

详情

AI中文摘要

尽管文本到图像（T2I）模型取得了快速进展，但生成准确反映复杂组合提示（涵盖属性绑定、对象关系、计数）的图像仍然具有挑战性。为了解决这个问题，我们提出了BiDPO，一个增强T2I模型组合式文本到图像生成能力的框架。我们首先引入一个精心设计的流程，构建大规模偏好数据集BiComp，并进行严格的质量控制。然后，我们将Diffusion DPO扩展到联合优化图像和文本偏好，这在提高模型遵循复杂文本提示生成方面被证明非常有效。为了进一步增强模型的细粒度对齐，我们采用区域级引导方法，聚焦于与组合概念相关的区域。实验结果表明，我们的BiDPO显著提高了组合保真度，在多个基准测试中持续优于先前方法。我们的方法突显了基于偏好微调在复杂文本到图像任务中的潜力，为现有技术提供了一种灵活且可扩展的替代方案。

英文摘要

Despite the rapid progress of text-to-image (T2I) models, generating images that accurately reflect complex compositional prompts (covering attribute bindings, object relationships, counting) still remains challenging. To address this, we propose BiDPO, a framework to enhance T2I model's capability of compositional text-to-image generation. We begin by introducing an carefully designed pipeline to construct a large-scale preference dataset, BiComp, with strictly quality control. Then, we extend Diffusion DPO to jointly optimize image and text preferences, which is shown to greatly effective in improving the models to follow complex text prompt in generation. To further enhance the models for fine-grained alignment, we employ a region-level guidance method to focus on regions relevant to compositional concepts. Experimental results demonstrate that our BiDPO substantially improves compositional fidelity, consistently outperforming prior methods across multiple benchmarks. Our approach highlights the potential of preference-based fine-tuning for complex text-to-image tasks, offering a flexible and scalable alternative to existing techniques.

URL PDF HTML ☆

赞 0 踩 0

2605.28612 2026-05-28 cs.LG

Learning High-Dimensional Parity Functions with Product Networks using Gradient Descent

使用乘积网络通过梯度下降学习高维奇偶函数

Guillaume Larue, Louis-Adrien Dufrène, Quentin Lampin, Hadi Ghauch, Ghaya Rekaya

AI总结本文提出结合紧凑乘积神经网络架构与随机数据稀疏性（伯努利输入，p_e ≤ 1/N）及超参数优化，实现了高维奇偶函数的高效学习，并给出了收敛性理论保证。

详情

AI中文摘要

奇偶函数是基本的布尔运算，在机器学习、密码学和纠错码中具有关键应用。然而，学习高维奇偶函数面临重大挑战：在一般情况下，标准神经网络架构通常需要指数级的样本复杂度，使得基于梯度的优化对于大输入数量$N$变得不可行。我们证明，紧凑的乘积神经网络架构结合随机数据稀疏性（伯努利输入，$p_e \leq 1/N$）和适当的超参数选择，能够实现高效的奇偶学习，并具有收敛的理论保证。实验验证了我们的理论在高达$N = 100{,}000$维度上的有效性，经验证据显示了$p_e$和学习率$\alpha$的最优超参数选择，以及多项式复杂度的标度律。这项工作建立了架构归纳偏差与数据稀疏性之间的基本联系，为神经算术、结构化推理、二值神经网络以及机器学习在自动协议发现中的应用开辟了新的可能性。

英文摘要

Parity functions are fundamental Boolean operations with critical applications across machine learning, cryptography, and error correction. Yet, learning high-dimensional parity functions poses significant challenges: in a general setting, standard neural network architectures typically require exponential sample complexity, making gradient-based optimization intractable for large number of inputs $N$. We demonstrate that compact product-based neural architectures combined with stochastic data sparsity (Bernoulli inputs with $p_e \leq 1/N$) and appropriate hyperparameter choice enable efficient parity learning, with theoretical guarantees of convergence. Experiments validate our theory across dimensions up to $N = 100{,}000$, with empirical evidence showing optimal hyperparameter choices for $p_e$ and learning rate $α$, as well as polynomial complexity scaling laws. This work establishes fundamental connections between architectural inductive bias and data sparsity, opening new possibilities for neural arithmetic, structured reasoning, binary neural networks, and machine learning applied to automated protocol discovery.

URL PDF HTML ☆

赞 0 踩 0

2605.28609 2026-05-28 cs.CV

JECA^2: Judgment-Explanation Consistent Adversarial Attack against Forensic Vision-Language Models

JECA^2: 面向取证视觉语言模型的判断-解释一致对抗攻击

Jiachen Qian

AI总结针对取证视觉语言模型，提出一种白盒对抗攻击方法JECA^2，通过Grad-CAM引导的视觉扰动和令牌邻近约束的文本嵌入优化，实现判断与解释的一致性，实验表明攻击成功率和一致性优于基线。

Comments 37 pages, 6 figures. Includes supplementary material

详情

AI中文摘要

取证视觉语言模型（VLM）最近被开发用于检测图像篡改并提供自然语言解释。然而，它们对抗对抗性操纵的鲁棒性仍未得到充分探索。现有的对抗攻击通常旨在翻转模型的二元判断，而伴随的解释可能仍然揭示取证线索并与被攻击的判断相矛盾。在本文中，我们研究了针对取证VLM的判断-解释一致对抗攻击，并提出了JECA^2，一种受控的白盒红队诊断方法，它联合重定向视觉注意力并将文本解释与目标判断对齐。在视觉方面，JECA^2使用Grad-CAM引导的扰动将注意力从篡改区域转移到良性区域。在文本方面，它在令牌邻近约束下优化提示嵌入，使其朝向真实性肯定的语义。在取证VLM基准上的实验表明，在白盒威胁设置下，JECA^2比实现的基线实现了更高的攻击成功率和自动判断-解释一致性，而迁移到闭源VLM仍然可测量但有限。我们的结果突显了基于解释的取证VLM中的一致性失败模式，并激励了超越二元检测准确性的未来鲁棒性评估。

英文摘要

Forensic vision-language models (VLMs) have recently been developed to detect image tampering and provide natural-language explanations. However, their robustness against adversarial manipulation remains underexplored. Existing adversarial attacks typically aim to flip the model's binary judgment, while the accompanying explanation may still reveal forensic cues and contradict the attacked judgment. In this paper, we study judgment-explanation consistent adversarial attacks against forensic VLMs and propose JECA^2, a controlled white-box red-team diagnostic that jointly redirects visual attribution and aligns textual explanations with the target judgment. On the visual side, JECA^2 uses Grad-CAM-guided perturbations to divert attribution from tampered regions toward benign regions. On the textual side, it optimizes prompt embeddings toward authenticity-affirming semantics under a token-proximity constraint. Experiments on forensic VLM benchmarks show that JECA^2 achieves higher attack success and automated judgment-explanation consistency than implemented baselines under white-box threat settings, while transfer to closed-source VLMs remains measurable but limited. Our results highlight a consistency failure mode in explanation-based forensic VLMs and motivate future robustness evaluation beyond binary detection accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.28607 2026-05-28 cs.AI cs.CL

Adaptive Multimodal Agents-Based Framework for Automatic Workflow Execution

基于自适应多智能体框架的自动工作流执行

Susanna Cifani, Mario Luca Bernardi, Marta Cimitile

AI总结提出一种多模态多智能体框架，通过离线构建拓扑知识库和在线自适应检索增强生成与闭环协作验证，实现自动工作流执行。

Comments Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses. Accepted for publication at the 2026 IEEE International Conference on Evolving and Adaptive Intelligent Systems (EAIS 2026)

详情

AI中文摘要

现代信息系统需要能够导航复杂工作流的自主智能体，但当前方法在从结构化元数据解析过渡到通用环境感知时常常遇到困难。虽然多模态大语言模型的集成使智能体能够直接与图形用户界面交互，但现有方法通常将任务序列视为离散的线性片段。这种碎片化阻止了智能体捕捉底层转移拓扑结构，限制了它们在新型或非平稳场景中的有效性。为了解决这个问题，我们提出了一种新颖的多模态多智能体框架，通过一个独特的两阶段流程实现自动工作流执行。首先，在离线发现阶段，该架构从碎片化的执行日志中自适应地构建拓扑知识库。在推理过程中，智能体利用自适应检索增强生成（RAG）作用于这个固定的、预先建立的图，并结合闭环协作验证协议进行动态自我纠正和导航。这种基于图的方法促进了优越的任务分解和自适应导航性能。我们在真实世界环境中验证了该框架，展示了即使在训练数据有限的情况下，它也能保持高可靠性和语义感知能力。

英文摘要

Modern information systems require autonomous agents capable of navigating complex workflows, yet current methodologies often struggle with the transition from structured metadata parsing to general environmental perception. While the integration of MLLMs has enabled agents to interact directly with GUIs, existing approaches typically treat task sequences as discrete, linear episodes. This fragmentation prevents agents from capturing the underlying transition topology, limiting their effectiveness in novel or non-stationary scenarios. To address this, we propose a novel multimodal multi-agent framework that achieves automatic workflow execution through a distinct two-phase pipeline. First, during an offline discovery phase, the architecture adaptively constructs a topological knowledge base from fragmented execution logs. During inference, agents leverage Adaptive Retrieval-Augmented Generation (RAG) over this fixed, pre-established graph, coupled with a closed-loop collaborative verification protocol to dynamically self-correct and navigate. This graph-based approach facilitates superior task decomposition and adaptive navigation performance. We validate our framework in a real-world context, demonstrating its ability to maintain high reliability and semantic awareness even with limited training data.

URL PDF HTML ☆

赞 0 踩 0

2605.28605 2026-05-28 cs.CV

Internally Referenced Low-Light Enhancement

内部参考的低光增强

Peiyuan He, Hainuo Wang, Hengxing Liu, Mingjia Li, Xiaojie Guo

AI总结提出一种内部参考低光增强框架，通过从退化输入中提取物理和结构参考，结合局部曝光模拟、双域保持策略和增益自适应特征调制，实现自监督低光图像增强，在噪声抑制和纹理保真度上达到最优性能。

详情

AI中文摘要

自监督低光图像增强（LLIE）因其消除了对外部配对数据的依赖而极具吸引力。然而，缺乏外部参考导致网络难以解耦纠缠的照明、精细纹理和放大的噪声。为解决这一挑战，我们提出了一种内部参考的LLIE框架，该框架从退化输入图像本身提取可靠的物理和结构参考。首先，我们引入了一种局部曝光模拟方案来提取低频伪真值。这作为内部物理参考，用于指导全局照明估计和校正色偏。其次，我们提出了一种具有空间和光谱约束的双域保持策略来构建内部结构参考。具体来说，照明对齐感知损失在照明变化下保留全局结构，而平移不变光谱相关损失捕获细粒度局部结构并抑制高频噪声。最后，我们提出了一种增益自适应特征调制（GAFM）机制来处理高度空间变化的残余噪声。通过将自估计的照明图转换为内部空间增益先验，GAFM动态引导盲点网络进行空间感知去噪。大量实验表明，我们的方法实现了最先进的性能，提供了卓越的噪声抑制和纹理保真度。代码将在https://visonj.github.io/IRLE/公开。

英文摘要

Self-supervised low-light image enhancement (LLIE) is highly appealing as it eliminates the reliance on external paired data. However, the lack of external references causes networks to struggle with decoupling entangled illumination, delicate textures, and amplified noise. To resolve this challenge, we propose an Internally Referenced LLIE framework that extracts reliable physical and structural references from the degraded input image itself. First, we introduce a local exposure-simulated scheme to extract a low-frequency pseudo ground-truth. This serves as an internal physical reference to guide global illumination estimation and correct color casts. Second, we propose a dual-domain preservation strategy with spatial and spectral constraints to construct internal structural references. Specifically, an Illumination-Aligned Perceptual loss preserves global structures under illumination shifts, while a Shift-Invariant Spectral Correlation loss captures fine-grained local structures and suppresses high-frequency noise. Finally, we propose a Gain-Adaptive Feature Modulation (GAFM) mechanism to address highly spatially-variant residual noise. By transforming the self-estimated illumination map into an internal spatial gain prior, GAFM dynamically guides a blind-spot network for spatially-aware denoising. Extensive experiments demonstrate that our method achieves state-of-the-art performance, delivering superior noise suppression and textural fidelity. Code will be publicly released at https://visonj.github.io/IRLE/.

URL PDF HTML ☆

赞 0 踩 0

2605.28604 2026-05-28 cs.CV cs.AI

Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification

挖掘多模态时空线索用于视频重要人物识别

Xiao Wang, Minglei Yang, Bin Yang, Wenke Huang, Zheng Wang, Xin Xu, Mang Ye

AI总结针对视频中人物重要性随时间变化的问题，提出VIP-Net框架，通过多模态时空线索融合与时间重要性矫正，在Temporal-VIP数据集上达到67.3%准确率。

详情

AI中文摘要

识别视频场景中的关键人物对于自动视频编辑和智能监控等应用至关重要。当前方法主要关注静态图像和即时视觉线索，忽略了视频中丰富的时空信息。这导致了时间重要性转移（TIS）现象，即早期帧中被认为重要的人物在考虑整个时间上下文后可能被降级。为了解决这一问题，我们引入了视频重要人物（VIP）识别任务，旨在自动识别视频中最具影响力的人物，同时提供文本理由。我们提出了Temporal-VIP，一个大规模的理由标注数据集，包含11个类别的9,249个视频片段，并附有对齐的重要性理由。为了缓解TIS，我们开发了VIP-Net框架，包括用于提取多模态时空线索的社会线索编码器（SCE）、用于层次化线索融合和跨模态对齐的时间重要性矫正器（TIR），以及用于人物排序的VIP推理。实验结果表明，VIP-Net达到了67.3%的准确率，显著优于最先进的模型（37.5%-53.9%），并通过特征引导的LLM优化，平均理由相似度达到0.63。数据集和代码可在https://huggingface.co/datasets/yml2002/Temporal-VIP获取。

英文摘要

Identifying key individuals in video scenes is essential for applications such as automated video editing and intelligent surveillance. Current methods primarily focus on static images and immediate visual cues, overlooking the rich spatio-temporal information in videos. This leads to the phenomenon of Temporal Importance Shift (TIS), wherein individuals deemed significant in early frames may be demoted as the entire temporal context is considered. To address this, we introduce the Video Important Person (VIP) identification task, aimed at automatically identifying the most influential individuals in videos while providing textual rationales. We present Temporal-VIP, a large-scale rationale-annotated dataset consisting of 9,249 video segments across 11 categories with aligned importance rationales. To mitigate TIS, we develop the VIP-Net framework, which includes a Social Cue Encoder (SCE) for extracting multi-modal spatio-temporal cues, a Temporal Importance Rectifier (TIR) for hierarchical cue fusion and cross-modal alignment, and VIP Inference for ranking individuals. Experimental results show that VIP-Net achieves 67.3% accuracy, significantly outperforming state-of-the-art models (37.5%-53.9%) and yielding a mean rationale similarity of 0.63 to ground truth through feature-guided LLM refinement. The dataset and code are available at https://huggingface.co/datasets/yml2002/Temporal-VIP.

URL PDF HTML ☆

赞 0 踩 0

2605.28603 2026-05-28 cs.LG cs.AI

Online Irregular Multivariate Time Series Forecasting via Uncertainty-Driven Dual-Expert Calibration

在线不规则多变量时间序列预测：基于不确定性驱动的双专家校准

Haonan Wen, Hanyang Chen, Songhe Feng

AI总结针对在线不规则多变量时间序列预测中数据分布动态变化导致性能下降的问题，提出不确定性驱动的双专家校准框架Under-Cali，通过不确定性估计、双专家校准和自适应路由模块实现稳定高效的在线学习。

Comments Accepted by KDD 2026

详情

AI中文摘要

不规则多变量时间序列预测在许多实际应用中至关重要，其中时间序列是不规则采样的，并表现出动态演变的缺失模式。尽管现有方法在离线设置中表现良好，但在在线部署时，由于数据分布的动态变化，它们常常遭受显著的性能下降。在这种动态场景中保持预测能力通常需要在线自适应技术。由于不规则采样从根本上破坏了时间连续性和周期性，我们无法利用来自规则MTS的这些广泛研究的特性进行在线学习。为此，我们研究了在线IMTS预测问题，并提出了Under-Cali，一个不确定性驱动的双专家校准框架，包含三个核心组件：不确定性估计器、双专家校准模块和自适应路由模块。我们设计了一个不确定性估计器，作为核心控制信号来联合管理推理和自适应过程。在我们的框架中，不确定性估计器首先评估每个传入批次的不确定性。然后，自适应路由模块将高不确定性的样本引导至不可靠专家进行校准，而低不确定性样本则保留给可靠专家。随后，系统使用校准良好的可靠样本更新可靠专家和不确定性估计器，并使用具有挑战性的样本更新不可靠专家，从而实现稳定高效的在线学习。Under-Cali保持源预测模型冻结，仅通过轻量级、模型无关的校准模块进行自适应，从而实现高效自适应。在IMTS基准上的大量实验表明，在低计算成本下取得了持续的改进。我们的代码可在https://github.com/HaonanWen/Under-Cali获取。

英文摘要

Irregular multivariate time series forecasting is critical in many real-world applications, where time series are irregularly sampled and exhibit dynamically evolving missingness patterns. Although existing methods perform well in offline settings, they often suffer from significant performance degradation when deployed online due to dynamic shifts in data distribution. Maintaining forecasting capability in such dynamic scenarios typically necessitates online adaptation techniques. Since irregular sampling fundamentally undermines temporal continuity and periodicity, we cannot leverage these widely studied characteristics from regular MTS for online learning. To this end, we study the problem of online IMTS forecasting and propose Under-Cali, an uncertainty-driven dual-expert calibration framework consisting of three core components: an uncertainty estimator, a dual-expert calibration module, and an adaptive routing module. We design an uncertainty estimator that serves as the core control signal to jointly manage inference and adaptation processes. In our framework, the uncertainty estimator first assesses uncertainty for each incoming batch. The adaptive routing module then directs samples with high uncertainty to the unreliable expert for calibration, while low uncertainty samples remain with the reliable expert. Subsequently, the system updates the reliable expert and the uncertainty estimator using well-calibrated reliable samples, and updates the unreliable expert with challenging samples, enabling stable and efficient online learning. Under-Cali keeps the source forecasting model frozen and performs adaptation only through a lightweight, model-agnostic calibration module, enabling efficient adaptation. Extensive experiments on IMTS benchmarks demonstrate consistent improvements with low computational cost. Our code is available at https://github.com/HaonanWen/Under-Cali.

URL PDF HTML ☆

赞 0 踩 0

2605.28602 2026-05-28 cs.AI cs.CL cs.LO

Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability

用大语言模型求解可满足性问题：推理能力的配对评估

Leizhen Zhang, Shuhan Chen, Sheng Chen

AI总结提出配对公式协议和准确区分率（ADR）来评估大语言模型在SAT问题上的推理能力，发现传统指标具有误导性，而ADR能更忠实、跨表示鲁棒地评估模型。

Comments Accepted at the ACM International Conference on the Foundations of Software Engineering (FSE 2026)

详情

AI中文摘要

大型语言模型（LLMs）越来越多地用于隐式归结为布尔可满足性（SAT）的任务，但它们在SAT上的推理能力仍不清楚。我们对LLMs在2-SAT和3-SAT上进行了系统研究，并使用了两个经典归约——顶点覆盖和离散3D装箱——来探测表示不变的推理。我们首先使用传统指标评估模型，包括准确率、精确率、召回率和F1，以及SAT相变设置。我们发现这些指标可能具有误导性：许多模型通过过度预测可满足公式获得高分，未能重现3-SAT阈值附近经典的易-难-易特征，并且随着变量数量的增加而急剧下降。为解决这个问题，我们引入了一个基于最小差异可满足和不可满足实例的配对公式协议，以及准确区分率（ADR），它要求每对中的两个成员都被正确分类。ADR将面向推理的模型与启发式模型区分开来，并与证据有效性相关。在CNF之外，我们通过将CNF转换为顶点覆盖和将3-SAT转换为离散3D装箱来测试跨表示一致性。大多数模型在超过80%的实例上，对CNF和对应图或装箱实例的决策一致，表明跨表示存在稳定的决策规则。总体而言，我们的结果表明SAT是LLM推理的一个保守探针，并且使用ADR的配对评估比传统指标提供了更忠实且表示鲁棒的评估。

英文摘要

Large language models (LLMs) are increasingly used for tasks that implicitly reduce to Boolean satisfiability (SAT), yet their reasoning ability on SAT remains unclear. We present a systematic study of LLMs on 2-SAT and 3-SAT, together with two canonical reductions, Vertex Cover and discrete 3D packing, to probe representation-invariant reasoning. We first evaluate models using conventional metrics, including accuracy, precision, recall, and F1, as well as the SAT phase-transition setting. We find that these metrics can be misleading: many models obtain high scores by over-predicting satisfiable formulas, fail to reproduce the classical easy-hard-easy signature around the 3-SAT threshold, and degrade sharply as the number of variables grows. To address this problem, we introduce a paired-formula protocol based on minimally different satisfiable and unsatisfiable instances, together with Accurate Differentiation Rate (ADR), which requires both members of each pair to be classified correctly. ADR separates reasoning-oriented models from heuristic ones and correlates with witness validity. Beyond CNF, we test cross-representation consistency by converting CNF to Vertex Cover and 3-SAT to discrete 3D packing. Model decisions on CNF and on the corresponding graph or packing instances agree for most models on more than 80 percent of instances, suggesting stable decision rules across representations. Overall, our results show that SAT is a conservative probe for LLM reasoning, and that paired evaluation with ADR provides a more faithful and representation-robust assessment than conventional metrics.

URL PDF HTML ☆

赞 0 踩 0

2605.28600 2026-05-28 cs.LG

Transformers Provably Learn to Internalize Chain-of-Thought

Transformer可证明地内化思维链

Yixiao Huang, Hanlin Zhu, Zixuan Wang, Jiantao Jiao, Stuart Russell, Somayeh Sojoudi, Song Mei

AI总结提出Log-ICoT课程训练，使L层Transformer用多项式样本学习k-奇偶性，匹配显式CoT的样本效率且消除推理开销。

详情

AI中文摘要

思维链提示显著提高了transformer的样本效率，将奇偶学习等任务的复杂度从输入长度的指数级降低到多项式级。然而，在推理时生成显式推理步骤的计算成本很高。隐式思维链作为一种有前景的经验性补救措施出现，它训练模型在隐藏状态内内化中间步骤，但其理论基础仍不清楚。我们首次对ICoT进行理论分析，证明在我们提出的Log-ICoT课程下训练的$L$层transformer可以用$\mathsf{poly}(n)$个样本和$L = \log_2 k$个训练阶段学习$k$-奇偶性。这匹配了显式CoT的样本效率，同时消除了其推理开销，并将先前的一层奇偶性保证扩展到多层架构。与每次移除一个思考令牌的标准ICoT相比，Log-ICoT以几何块的方式移除它们，将阶段数从$k$的线性减少到对数。多层transformer上的实验证实了该理论，并可视化了推理如何逐步被吸收到更深层中。

英文摘要

Chain-of-Thought (CoT) prompting substantially improves the sample efficiency of transformers, reducing the complexity of tasks like parity learning from exponential to polynomial in the input length. However, generating explicit reasoning steps at inference is computationally expensive. Implicit Chain-of-Thought (ICoT) has emerged as a promising empirical remedy that trains models to internalize intermediate steps within their hidden states, but its theoretical foundations remain poorly understood. We give the first theoretical analysis of ICoT, proving that an $L$-layer transformer trained under our proposed Log-ICoT curriculum learns $k$-parity with $\mathsf{poly}(n)$ samples and $L = \log_2 k$ training stages. This matches the sample efficiency of explicit CoT while eliminating its inference overhead, and extends prior one-layer parity guarantees to multi-layer architectures. Compared to standard ICoT, which removes thinking tokens one at a time, Log-ICoT removes them in geometric chunks, reducing the number of stages from linear in $k$ to logarithmic. Experiments on multi-layer transformers confirm the theory and visualize how reasoning is progressively absorbed into deeper layers.

URL PDF HTML ☆

赞 0 踩 0

2605.28598 2026-05-28 cs.CL cs.AI

Evaluating the Realism of LLM-powered Social Agents: A Case Study of Reactions to Spanish Online News

评估基于LLM的社会智能体的真实性：对西班牙在线新闻反应的案例研究

Alejandro Buitrago López, Alberto Ortega Pastor, Javier Pastor-Galindo, José A. Ruipérez-Valiente

AI总结通过比较真实与LLM生成的西班牙新闻评论，研究LLM在仇恨言论、情感和语义对齐三个维度上的真实性，发现现成模型表现不佳，微调可部分改善。

详情

AI中文摘要

基于LLM的社会智能体越来越多地被用于模拟在线社交行为，但其真实性仍然难以验证。现有工作主要依赖通用基准，而对简短的反应性话语（如受众对在线新闻的回复）关注较少。在本文中，我们评估LLM生成的西班牙新闻反应是否再现了真实受众话语的可测量属性。使用Hatemedia数据集，我们将5,631条新闻与58,555条真实受众反应配对，并在共享实验设置下使用五个LLM生成匹配的合成数据集。我们从仇恨言论、情感和语义对齐三个维度比较真实和合成反应，考虑现成和微调生成。结果表明，现成模型是真实受众反应的糟糕代理：它们严重低估仇恨言论，引入模型特定的情感偏差，并且在分布上与人类回复相距甚远。微调不均匀地提高了保真度。Qwen3提供了最平衡的近似，而Mistral7B实现了最强的情感和语义对齐，但过度估计了仇恨普遍性。看似合理的合成回复不一定再现公共话语的分布特性。

英文摘要

LLM-powered social agents are increasingly used to simulate online social behavior, yet their realism remains difficult to validate. Existing work has largely relied on general-purpose benchmarks, while less attention has been paid to short, reactive discourse such as audience replies to online news. In this paper, we evaluate whether LLM-generated reactions to Spanish online news reproduce measurable properties of real audience discourse. Using the Hatemedia dataset, we pair 5,631 news items with 58,555 real audience reactions, and generate a matched synthetic dataset using five LLMs under a shared experimental setting. We compare real and synthetic reactions across three dimensions: hate speech, sentiment, and semantic alignment, considering both off-the-shelf and fine-tuned generation. Results show that off-the-shelf models are poor proxies for real audience reactions: they strongly underproduce hate speech, introduce model-specific sentiment biases, and remain distributionally distant from human replies. Fine-tuning improves fidelity unevenly. Qwen3 provides the most balanced approximation, while Mistral7B achieves the strongest sentiment and semantic alignment but overshoots hate prevalence. Plausible synthetic replies do not necessarily reproduce the distributional properties of public discourse.

URL PDF HTML ☆

赞 0 踩 0

2605.28592 2026-05-28 cs.LG

PLS in the Mirror of Self-Attention

PLS在自注意力镜中的映射

Jiangsheng, You

AI总结将偏最小二乘法（PLS）视为线性化自注意力机制，从而在神经网络框架内研究PLS，同时PLS的降维和预测变量选择表明自注意力包含一定程度的维度归一化以改进学习。

2605.28589 2026-05-28 cs.LG

Thinned Mean Field Langevin Dynamics

稀疏化平均场朗之万动力学

Zonghao Chen, Heishiro Kanagawa, François-Xavier Briol, Chris J. Oates, Lester Mackey

AI总结提出KT-MFLD算法，通过核稀疏化将粒子交互复杂度从O(N^2)降至O(N^{3/2})，并保持与MFLD相同的收敛保证。

详情

Journal ref: International Conference on Machine Learning, 2026

AI中文摘要

几个重要的学习任务可以表述为在适当的概率分布空间上最小化熵正则化目标。平均场朗之万动力学（MFLD）促进了这一一般上下文中的计算，将最小化器视为McKean-Vlasov过程的不变分布，该过程可以使用$N$个粒子进行数值离散化并模拟。然而，模拟这个相互作用粒子系统的计算复杂度为$O(N^2)$。受最近关于\emph{核稀疏化}研究的启发，我们提出了 exttt{KT-MFLD}，其中每个粒子仅与大小为$\mathcal{O}(N^{ rac{1}{2}})$的稀疏粒子核心集相互作用。因此， exttt{KT-MFLD}将计算复杂度降低到$O(N^{ rac{3}{2}})$，同时在温和的正则条件下，实现与MFLD相同的收敛保证（最多对数因子）。我们的理论分析在包括学生-教师神经网络训练、最大均值差异量化以及后贝叶斯框架中面向预测的后验计算等任务上得到了实证确认。

英文摘要

Several important learning tasks can be formulated as minimizing an entropy-regularized objective over an appropriate space of probability distributions. Mean-field Langevin dynamics (MFLD) facilitate computation in this general context, casting the minimizer as the invariant distribution of a McKean--Vlasov process, which can be numerically discretized using $N$ particles and thus simulated. However, simulating this interacting particle system has computational complexity of order $N^2$. Motivated by recent research into \emph{kernel thinning}, we propose \texttt{KT-MFLD}, in which each particle interacts only with a thinned particle coreset of size $\mathcal{O}(N^{\frac{1}{2}})$. \texttt{KT-MFLD} thus reduces the computational complexity to order $N^{\frac{3}{2}}$ while, under mild regularity conditions, achieving the same convergence guarantees (up to logarithmic factors) as MFLD. Our theoretical analysis is empirically confirmed on tasks including the training of student-teacher neural networks, quantization with maximum mean discrepancy, and computation of predictively-oriented posteriors in a post-Bayesian framework.

URL PDF HTML ☆

赞 0 踩 0

2605.28587 2026-05-28 cs.CV

Deformable Gaussian Occupancy: Decoupling Rigid and Nonrigid Motion with Factorized Distillation

可变形高斯占据：通过分解蒸馏解耦刚性与非刚性运动

Yang Gao, Wuyang Li, Po-Chien Luan, Alexandre Alahi

AI总结提出DeGO框架，通过解耦高斯变形和分解式4D基础模型蒸馏，在弱监督下实现动态场景中刚性与非刚性运动的分离，显著提升人体实例占据预测性能。

Comments CVPR 2026

详情

AI中文摘要

理解动态3D环境对于安全自动驾驶至关重要，特别是推理以人为中心的非刚性智能体时。然而，现有的弱监督占据预测框架主要假设刚体运动并依赖简单的帧间偏移，限制了其捕捉细粒度变形和保持时间一致性的能力。为解决此问题，我们提出DeGO，一个可变形高斯占据框架，统一了解耦高斯变形与分解式4D基础模型蒸馏。DeGO解耦刚性和非刚性运动，使每个高斯基元通过变形和偏移更新共同演化。同时，分解式4D蒸馏策略从VGGT基础模型迁移跨相机和跨帧知识，产生对齐基础模型的特征，增强时间一致性。在Occ3D-NuScenes基准上的实验表明，我们的方法在弱监督下达到了最先进性能，在人体实例上获得13.5%的提升，整体提升10.9%。这些结果凸显了变形感知和基础模型引导的占据建模对动态场景理解的有效性。代码已公开：https://github.com/vita-epfl/DeGO

英文摘要

Understanding dynamic 3D environments is essential for safe autonomous driving, particularly when reasoning about human-centric, nonrigid agents. However, existing weakly supervised occupancy prediction frameworks predominantly assume rigid-body motion and rely on simple frame-to-frame offsets, limiting their ability to capture fine-grained deformations and maintain temporal coherence. To address this issue, we propose DeGO, a deformable Gaussian occupancy framework that unifies decoupled Gaussian deformation with factorized 4D foundation-model distillation. DeGO disentangles rigid and nonrigid motion, enabling each Gaussian primitive to evolve through both deformation and offset-based updates. In parallel, a factorized 4D distillation strategy transfers cross-camera and cross-frame knowledge from the VGGT foundation model, producing foundation-aligned features that enhance temporal consistency. Experiments on the Occ3D-NuScenes benchmark demonstrate that our method achieves state-of-the-art performance under weak supervision, delivering 13.5% gains on human-centric instances and 10.9% overall improvements. These results highlight the effectiveness of deformation-aware and foundation-guided occupancy modeling for dynamic scene understanding. The code is publicly available: https://github.com/vita-epfl/DeGO

URL PDF HTML ☆

赞 0 踩 0