arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1695
专题追踪
2605.23365 2026-05-25 cs.LG cs.AI

Score-Based One-step MeanFlow Policy Optimization

基于分数的单步MeanFlow策略优化

Kyungyoon Kim, Donghyeon Ki, Hee-Jun Ahn, Byung-Jun Lee

发表机构 * Korea University, Decision Making Lab(韩国大学,决策实验室) Gauss Labs Inc.(Gauss实验室)

AI总结 本文提出了一种基于分数估计的单步均流策略优化方法(SOM),旨在解决强化学习中扩散模型和流匹配方法在在线场景下计算开销大的问题。该方法通过Q函数和概率流ODE直接构建目标速度场,无需目标分布的样本,从而在保证策略性能的同时显著降低了训练和推理时间。实验表明,SOM在运动控制任务中实现了领先的在线强化学习效果。

详情
AI中文摘要

扩散和流匹配已成为强化学习中表达力强的策略类,但它们对多步去噪的依赖在推理时带来了大量计算开销,这在在线强化学习中尤其成问题。MeanFlow通过学习一个平均速度场,在单次网络评估中将噪声映射到数据,提供了一种有前景的替代方案。然而,MeanFlow通常需要来自目标分布的样本来构建其目标速度场,而这在在线强化学习中不可用。我们提出了基于分数的单步MeanFlow策略优化(SOM),一种演员-评论家算法,通过分数估计和概率流ODE直接从Q函数构建目标速度场,从而将概率质量集中在高价值模式上。在完全在线强化学习设置中,SOM在运动任务上以单生成步骤实现了最先进的性能,同时与先前基于扩散和流匹配的策略相比,大幅减少了训练和推理时间。

英文摘要

Diffusion and flow matching have emerged as expressive policy classes in reinforcement learning, but their reliance on multi-step denoising imposes substantial computational overhead at inference time, which is particularly problematic in online RL. MeanFlow offers a promising alternative by learning an average velocity field that maps noise to data in a single network evaluation. However, MeanFlow typically requires samples from the target distribution to construct its target velocity field, which are unavailable in online RL. We propose Score-Based One-step MeanFlow Policy Optimization (SOM), an actor-critic algorithm that resolves this by constructing the target velocity field directly from the Q-function via score estimation and a probability flow ODE, thereby concentrating probability mass on high-value modes. In the fully online RL setting, SOM achieves state-of-the-art performance on locomotion tasks with a single generation step, while substantially reducing both training and inference time compared to prior diffusion- and flow-matching-based policies.

2605.23362 2026-05-25 cs.LG cs.IT math.IT math.ST stat.ML stat.TH

Instance-Optimal Estimation with Multiple LLM Judges on a Budget

预算限制下多LLM裁判的实例最优估计

Junghyun Lee, Sanghwa Kim, Yassir Jedra, Alexandre Proutière, Se-Young Yun

发表机构 * KAIST AI(韩国科学技术院人工智能研究所) ICL EEE(国际计算机语言研究所电子工程系) KTH EECS(皇家理工学院电子工程系)

AI总结 本文研究了在有限预算下如何高效分配多个具有不同成本和可靠性的大语言模型评估任务,以获得最准确的评分估计。作者提出了预算异方差多评委估计问题,并设计了一种自适应算法EST-IVWE,通过乐观偏差方差估计实现稳定分配,理论证明其性能接近最优分配方案。此外,作者还建立了匹配的局部最小最大下界,证明了所提方法的实例最优性,并在实验中验证了其优于均匀分配策略的效果。

Comments 53 pages, 4 figures; the first two authors contributed equally

详情
AI中文摘要

评估大型语言模型越来越依赖于LLM作为裁判的协议,但此类评估仍然成本高昂:不同的裁判有不同的价格和可靠性,且每个提示-响应对的难度可能差异很大。这引发了一个基本的分配问题:在固定预算下,应如何在异构裁判和实例之间分配评估查询,以获得最准确的分数估计?我们将此问题形式化为*预算限制下的异方差多裁判估计*。给定$K$个提示-响应对、$J$个已知成本的裁判以及未知的查询-裁判方差,目标是估计一个有界分数向量,同时最小化$\ell_p$误差。我们的第一个贡献是分析逆方差加权估计量(IVWE)并推导出最小化其误差率的最优分配。由于该分配依赖于未知方差,我们随后通过提出EST-IVWE来解决实际中的未知方差设置,这是一种自适应算法,它构建并利用*乐观偏差*方差估计来稳定经验分配。我们证明EST-IVWE在预算内匹配了IVWE的速率,直至低阶项。我们的第二个且核心的理论贡献是一个匹配的*局部*极小极大下界,这确立了所提出算法的实例最优性。一个关键的技术见解是,Fano型高概率论证对于这个问题过于粗糙:它们的填充构造失去了控制最优分配的局部方差结构。我们转而使用基于局部扰动的Assouad型期望论证,该论证保留了这一结构并产生了尖锐的分配相关下界。最后,我们在合成数据集和HelpSteer2数据集上数值验证了我们的方法优于朴素的均匀分配。

英文摘要

Evaluating large language models increasingly relies on LLM-as-a-judge protocols, but such evaluations remain costly: different judges have different prices and reliabilities, and the difficulty of each prompt-response pair can vary substantially. This raises a basic allocation question: under a fixed budget, how should one distribute evaluation queries across heterogeneous judges and instances to obtain the most accurate score estimates? We formalize this question as *budgeted heteroskedastic multi-judge estimation*. Given $K$ prompt-response pairs, $J$ judges with known costs, and unknown query-judge variances, the goal is to estimate a bounded score vector while minimizing an $\ell_p$-error. Our first contribution is to analyze the inverse-variance weighted estimator (IVWE) and to derive the oracle allocation that minimizes its error rate. Since this allocation depends on the unknown variances, we then address the practical unknown-variance setting by proposing EST-IVWE, an adaptive algorithm that constructs and leverages *optimistically biased* variance estimates to stabilize the empirical allocation. We prove that EST-IVWE matches the oracle IVWE rate up to lower-order terms in the budget. Our second and central theoretical contribution is a matching *local* minimax lower bound, which establishes the instance-optimality of the proposed algorithms. A key technical insight is that Fano-type high-probability arguments are too coarse for this problem: their packing construction loses the local variance structure that governs the optimal allocation. We instead use an Assouad-type in-expectation argument, based on local perturbations, which preserves this structure and yields the sharp allocation-dependent lower bound. Finally, we numerically validate the superiority of our approach over naïve uniform allocation on synthetic and HelpSteer2 datasets.

2605.23355 2026-05-25 cs.CV cs.LG cs.MM

Decoupling Spatio-Temporal Adapter for Fine-Grained Badminton Action Localization

解耦时空适配器用于细粒度羽毛球动作定位

Tianyu Wang, Junjie Wu, Jingquan Gao, Shishuo Li

发表机构 * School of Economics and Management, Beihang University(北京航空航天大学经济管理学院) Key Laboratory of Data Intelligence and Management, Beihang University, Ministry of Industry and Information Technology(信息产业部北京航空航天大学数据智能与管理重点实验室)

AI总结 本文研究了专业羽毛球视频中的细粒度时序动作定位问题,针对其复杂的时空动态特性,提出了一种解耦时空适配器(DSTA),通过将运动表示分解为三个并行分支,分别捕捉时间动态以及垂直和水平方向的空间变化,从而更有效地建模细粒度动作的细微差异。同时,作者构建了一个包含31场比赛、29类细粒度击球动作的Fine-Badminton数据集,并在该数据集和ShuttleSet基准上验证了方法的有效性,取得了最先进的性能,且计算和参数开销增加有限。

Comments 11 pages, 11figures

详情
AI中文摘要

时间动作定位(TAL)在通用视频理解中已被广泛研究,而由于复杂微妙的时空动态,专业羽毛球等细粒度体育场景仍未被充分探索。本文聚焦于专业羽毛球视频中的细粒度TAL,并引入一个新的基准数据集Fine-Badminton,包含31场比赛、29个细粒度击球类别,涵盖2104个回合和27597个标注动作。为了有效捕捉此类场景中的复杂运动模式,我们提出解耦时空适配器(DSTA),能够在参数高效框架内高效建模时空特征。具体而言,DSTA将运动表示分解为三个并行分支,分别捕捉时间动态以及垂直和水平空间变化。该设计使模型能够更好地区分细粒度动作之间的细微差异。在Fine-Badminton数据集和ShuttleSet基准上的大量实验表明,所提方法在仅增加微小计算和参数成本的情况下实现了最先进性能。这些结果验证了所提方法在细粒度时间动作定位中的有效性和效率。

英文摘要

Temporal Action Localization (TAL) has been extensively studied in generic video understanding, while fine-grained sports scenarios, such as professional badminton, remain underexplored due to their complex and subtle spatio-temporal dynamics. In this paper, we focus on fine-grained TAL in professional badminton videos and introduce a new benchmark dataset, Fine-Badminton, which consists of 31 matches with 29 fine-grained stroke categories, covering 2104 rallies and 27597 annotated actions. To effectively capture the intricate motion patterns in such scenarios, we propose a Decoupling Spatio-Temporal Adapter (DSTA), which enables efficient modeling of spatio-temporal features within a parameter-efficient framework. Specifically, DSTA decomposes motion representation into three parallel branches, capturing temporal dynamics as well as vertical and horizontal spatial variations. The design allows the model to better distinguish subtle differences among fine-grained actions. Extensive experiments on both the Fine-Badminton dataset and the ShuttleSet benchmark demonstrate that the proposed method achieves state-of-the-art performance while introducing only a marginal increase in computational and parameter cost. These results validate the effectiveness and efficiency of the proposed approach for fine-grained temporal action localization.

2605.23351 2026-05-25 cs.LG cs.GT

Prudent-Banker: No Extra Fees for Baseline Safety in Adversarial Bandits With and Without Delays

Prudent-Banker: 对抗性赌博机中无延迟与有延迟下的基线安全保障无额外代价

Ting Hu, Luanda Cai, Emmanouil-Vasileios Vlatakis-Gkaragkounis

发表机构 * Department of Economics University of Wisconsin–Madison(经济学系威斯康星大学麦迪逊分校) Department of Finance University of Wisconsin–Madison(金融系威斯康星大学麦迪逊分校) Department of Computer Sciences University of Wisconsin–Madison(计算机科学系威斯康星大学麦迪逊分校)

AI总结 本文研究了在有无延迟反馈的情况下,如何在对抗性多臂老虎机问题中实现安全基线下的最小最大最优最坏情况悔恨。为了解决延迟可能破坏安全保证的问题,作者提出了Prudent-Banker算法,结合了延迟自适应的在线镜像下降方法和改进的分阶段激进机制,实现了与安全策略相比近似常数悔恨的最优安全-鲁棒性权衡。该算法在理论分析中证明了其悔恨上界不可改进,并通过实验验证了其在多种延迟分布下的有效性。

详情
AI中文摘要

我们研究了在安全感知目标下,具有和不具有延迟反馈的对抗性多臂赌博机问题:实现极小极大最优的最坏情况遗憾,同时相对于指定的“安全”基线策略保持几乎恒定的遗憾。现有方法可以在即时反馈下平衡这种权衡以获得平滑的比较器,但任意延迟可能会错误地安排保守主义和探索之间的转换,危及安全保障。为了弥合这一差距,我们提出了Prudent-Banker,一种新颖的算法,它将延迟适应的在线镜像下降变体与修改的分阶段攻击机制相结合。其关键技术贡献是一个延迟校准的重启阈值,该阈值严格考虑了未观察反馈引起的最坏情况失真,并可靠地检测比较器的次优性。我们还为安全约束的对抗性延迟赌博机建立了新的下界,表明在基线安全要求下,Prudent-Banker的遗憾保证在忽略对数因子时是不可改进的。据我们所知,Prudent-Banker是第一个实现最优安全-鲁棒性权衡的算法:伪遗憾$\widetilde{O}(\sqrt{T}+\sqrt{D})$加上相对于安全比较器的$\widetilde{O}(1)$遗憾,无论有无延迟。跨不同延迟分布的实验表明,与标准的延迟鲁棒基线不同,Prudent-Banker有效地平衡了安全性和学习。

英文摘要

We study adversarial multi-armed bandits with and without delayed feedback under a safety-aware goal: achieving minimax-optimal worst-case regret while keeping nearly constant regret relative to a designated "safe" baseline policy. Existing approaches can balance this trade-off with immediate feedback for smooth comparators, but arbitrary delays can mistime transitions between conservatism and exploration, endangering the safety guarantee. To bridge this gap, we propose Prudent-Banker, a novel algorithm that combines a delay-adapted variant of Online Mirror Descent with a modified phased-aggression mechanism. Its key technical contribution is a delay-calibrated restart threshold that rigorously accounts for the worst-case distortion induced by unobserved feedback and reliably detects comparator suboptimality. We also establish new lower bounds for safety-constrained adversarial delayed bandits, showing that the regret guarantees of Prudent-Banker are unimprovable, up to logarithmic factors, under the baseline-safety requirement. To the best of our knowledge, Prudent-Banker is the first algorithm to achieve the optimal safety--robustness trade-off: pseudo-regret $\widetilde{O}(\sqrt{T}+\sqrt{D})$ together with $\widetilde{O}(1)$ regret against the safe comparator, both with and without delays. Experiments across diverse delay distributions show that, unlike standard delay-robust baselines, Prudent-Banker effectively balances safety and learning.

2605.23346 2026-05-25 cs.LG

Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion

对比分布匹配用于离散扩散中的摊销序贯蒙特卡洛

Jaihoon Kim, Taehoon Yoon, Prin Phunyaphibarn, Seungjun Kim, Morteza Mardani, Minhyuk Sung

发表机构 * KAIST(韩国科学技术院) University of Michigan(密歇根大学) NVIDIA(英伟达)

AI总结 离散扩散模型在生成结构化分类数据方面表现出色,但如何高效地从奖励倾斜分布中采样仍是一个核心挑战。本文提出了一种名为对比分布匹配(CDM)的新框架,通过学习参数化的扭曲函数,将序列蒙特卡洛(SMC)推断的计算成本进行摊销,从而显著提升了推理效率。实验表明,CDM在多种应用场景中均优于现有方法,且额外计算开销极小,验证了其有效性与广泛适用性。

Comments Project Page: https://cdm-smc.github.io/

详情
AI中文摘要

离散扩散模型已成为生成结构化分类数据的强大框架。然而,从奖励倾斜分布中高效采样仍然是一个基本挑战。虽然扭曲序贯蒙特卡洛(SMC)为此任务提供了渐近精确性,但在离散状态空间中估计最优扭曲函数需要昂贵的蒙特卡洛近似,导致推理时严重的计算瓶颈。为克服这一限制,我们引入对比分布匹配(CDM),一种新颖的框架,通过正负样本学习参数化扭曲函数,摊销SMC推理的成本。为了高效训练,我们重新表述梯度估计器,以利用离散扩散模型的闭式前向核。在实践中,评估我们学习的扭曲函数相比基础模型的单次前向传播仅增加不到5%的额外计算开销。通过广泛的经验评估,我们证明CDM在匹配的挂钟时间下始终优于现有基线。我们在多种应用中验证了我们方法的有效性和通用性,包括有毒文本生成、调控DNA序列设计、蛋白质可设计性以及扩散大语言模型对齐。

英文摘要

Discrete diffusion models have emerged as powerful frameworks for generating structured categorical data. However, efficiently sampling from reward-tilted distributions remains a fundamental challenge. While Twisted Sequential Monte Carlo (SMC) offers asymptotic exactness for this task, estimating the optimal twist function in discrete state spaces necessitates costly Monte Carlo approximations, resulting a severe computational bottleneck at inference. To overcome this limitation, we introduce Contrastive Distribution Matching (CDM), a novel framework that amortizes the cost of SMC inference by learning a parameterized twist function via positive and negative samples. For efficient training, we reformulate the gradient estimator to leverage the closed-form forward kernels of discrete diffusion models. In practice, evaluating our learned twist function incurs less than 5% additional computational overhead compared to a single forward pass of the base model. Through extensive empirical evaluations, we demonstrate that CDM consistently outperforms existing baselines under matched wall-clock time. We validate the effectiveness and versatility of our approach across a diverse range of applications, including toxic text generation, regulatory DNA sequence design, protein designability, and diffusion large language model alignment.

2605.23344 2026-05-25 cs.CV cs.AI

CHASD: Language Increment-Calibrated Contrastive Decoding against Hallucination in LVLMs

CHASD:面向LVLMs中幻觉的语言增量校准对比解码

Xiaoyi Huang, Kejia Zhang, Zhiming Luo

发表机构 * Institute of Artificial Intelligence, Xiamen University(厦门大学人工智能学院) Department of Artificial Intelligence, Xiamen University(厦门大学人工智能系)

AI总结 本文研究了大型视觉-语言模型(LVLMs)在语言先验主导下容易产生物体幻觉的问题,提出了一种无需训练的对比解码方法CHASD。该方法通过注意力引导的局部视觉扰动构建负样本分支,并在生成过程中仅对低置信度的词元进行对比校准,从而在保证推理效率的同时有效抑制幻觉。实验表明,CHASD在多个基准数据集上显著提升了相关指标,优于现有的训练自由基线方法。

详情
AI中文摘要

大型视觉-语言模型展现了强大的多模态推理能力,但当语言先验主导不足或错位的视觉证据时,它们仍然容易产生对象幻觉。无训练对比解码方法通过比较原始和扰动视觉输入的预测来缓解此问题,但现有方法要么应用可能改变有用视觉证据的全局扰动,要么在每个解码步骤调用额外的负分支。在本文中,我们观察到幻觉风险是瞬态且特定于token的:视觉注意力在生成的token间转移,而一些功能token以高置信度产生,不需要对比校准。基于这一观察,我们提出面向大型视觉-语言模型的对比幻觉感知逐步解码(CHASD),一种“按需校准”的推理时框架。CHASD使用不确定性驱动的置信门控,仅当下一token的最大概率低于阈值时激活对比分支,并通过注意力引导的局部扰动构建负分支,扰动当前显著的视觉token。这种设计减少了不必要的负分支前向传播,同时保留了高置信度步骤的原始分布。在POPE、AMBER、MME、MMHal-Bench和CHAIR上的实验表明,CHASD在强无训练基线上改进了幻觉相关指标,并具有有竞争力的推理效率。

英文摘要

Large Vision-Language Models have shown strong multimodal reasoning capabilities, yet they remain susceptible to object hallucinations when language priors dominate insufficient or misaligned visual evidence. Training-free contrastive decoding methods mitigate this issue by comparing predictions from original and perturbed visual inputs, but existing approaches either apply global perturbations that may alter useful visual evidence or invoke an additional negative branch at every decoding step. In this paper, we observe that hallucination risks are transient and token-specific: visual attention shifts across generated tokens, while some functional tokens are produced with high confidence and do not require contrastive calibration. Based on this observation, we propose Contrastive Hallucination-Aware Step-wise Decoding (CHASD) for Large Vision-Language Models, an inference-time framework for "calibration on demand". CHASD uses an uncertainty-driven confidence gate to activate the contrastive branch only when the maximum probability of the next-token is less than the threshold, and constructs the negative branch through attention-guided localized perturbations of the currently salient visual tokens. This design reduces unnecessary negative-branch forward passes while preserving the original distribution for high-confidence steps. Experiments on POPE, AMBER, MME, MMHal-Bench, and CHAIR show that CHASD improves hallucination-related metrics over strong training-free baselines with competitive inference efficiency.

2605.23341 2026-05-25 cs.RO cs.AI

Sparse Compositional Flow Matching by geometric assembly from motion primitives

基于运动基元的几何组装的稀疏组合流匹配

Yan Tang, Yuanbo Tang, Tingyu Cao, Shaolun Huang, Yang Li

发表机构 * Tsinghua Shenzhen Graduate School, Tsinghua University, Shenzhen, China(清华大学深圳研究生院,清华大学,深圳,中国) School of AI, Chinese University of Hong Kong (Shenzhen)(香港中文大学(深圳)人工智能学院)

AI总结 该论文研究了如何生成具身智能体(如机器人、水下机器人等)的可执行运动轨迹,提出了一种基于运动原语的稀疏组合流匹配方法。该方法通过在物理轨迹空间中直接组合可重复使用的运动原语,并引入几何约束和结构化稀疏流匹配框架,有效建模轨迹的组合结构与时空连续性。实验表明,该方法在多个数据集上取得了最先进的性能,显著提升了轨迹预测的准确性。

详情
AI中文摘要

具身轨迹,如机器人操纵器、水下航行器和移动机器人的可执行运动序列,是具身AI的基本输出。现代生成模型通常将其视为逐点生成的密集、整体信号,拟合复杂的高维后验,而未建模数据的潜在结构,这是结构化生成模型文献早已指出的样本效率低下问题。我们认为组合潜在结构是自然的选择:许多具身任务共享重复出现的运动片段,这些片段可以明确为有限的可重用运动基元库,并且组合单元自然与子任务边界对齐以支持任务分解。然而,现有的组合生成器在潜在空间中组合,并依赖事后解码将采样单元与实际轨迹段关联。相反,我们通过具有两个耦合设计的流匹配框架直接在物理轨迹空间中组合。运动基元字典学习为每个原子配备可学习的长度掩码和二进制起始指示器,使得原子本身即为基元,在其放置位置逐字重用。然后,具有几何约束的结构化稀疏流匹配通过持续时间感知分词和可微几何损失生成二进制放置矩阵,该损失在相邻基元相遇处强制执行空间连续性和时间邻接性。在Open X-Embodiment和3DMoTraj上,该框架达到了最先进的精度,并将FDE/ADE比从1.8降至1.07,相比最强基线,ADE提高了19.2%,FDE提高了21.0%。

英文摘要

Embodied trajectories, such as the executable motion sequences of robotic manipulators, underwater vehicles, and mobile robots, are a fundamental output of embodied AI. Modern generative models often treat them as a dense, monolithic signal generated point by point, fitting an intricate high-dimensional posterior while leaving the data's latent structure unmodeled, the same sample inefficiency long identified by the structured generative model literature. We argue that a compositional latent structure is a natural choice: many embodied tasks share recurring motion fragments that can be made explicit as a finite repertoire of reusable motion primitives, and compositional units naturally align with subtask boundaries to support task decomposition. Existing compositional generators, however, compose in a latent space and rely on post-hoc decoding to relate sampled units to actual trajectory segments. We instead compose directly in the physical trajectory space through a flow-matching framework with two coupled designs. Motion-Primitive Dictionary Learning equips each atom with a learnable length mask and binary starting indicators so the atom itself is the primitive, reused verbatim wherever it is placed. Structural Sparse Flow Matching with Geometric Constraints then generates a binary placement matrix using duration-aware tokenization and a differentiable geometric loss that enforces spatial continuity and temporal contiguity where adjacent primitives meet. On Open X-Embodiment and 3DMoTraj, the framework attains state-of-the-art accuracy and reduces the FDE/ADE ratio from 1.8 to 1.07, improving ADE by 19.2% and FDE by 21.0% over the strongest baseline.

2605.23332 2026-05-25 cs.CL

Cultural Adaptation in Large Language Models for Political Discourse

大型语言模型在政治话语中的文化适应

Wajdi Zaghouani

发表机构 * Northwestern University in Qatar(卡塔尔西北大学)

AI总结 本文探讨了在政治话语分析中部署大型语言模型时所需的文化适应问题,指出当前模型因依赖英语主导的数据和有限的多语言覆盖,难以适应不同文化和制度背景,导致系统性偏差。研究从翻译、话语和本体三个层面形式化文化适应,提出基于文化忠实度、校准和民主安全的评估矩阵,并结合参与式数据集构建、文化感知的迁移学习等方法,为文化适应提供可衡量的实现路径,以增强政治自然语言处理的民主合法性。

详情
AI中文摘要

大型语言模型融入政治话语分析为比较研究、政策分析和公民技术创造了新机遇,同时也为民主问责带来了实质性风险。本文认为,文化适应是在不同语言和制度背景下将大型语言模型可靠地部署于政治传播的先决条件。当前系统仍受英语主导数据、不均衡的多语言覆盖以及基于狭窄范围的政治制度和话语惯例的假设所影响,导致跨文化应用时产生系统性错误。我们在翻译、话语和本体层面形式化文化适应,识别政治自然语言处理中反复出现的文化失败模式,并提出一个基于文化保真度、校准度和民主安全性的操作性评估矩阵。基于政治文本分析、社会技术审计和跨文化语用学,我们概述了方法论路径,包括参与式数据集开发、文化感知迁移学习以及使文化适应可经验测量的基准设计。最后,我们阐明了适应性政治自然语言处理能够支持民主合法性的治理约束和适用范围条件。

英文摘要

The integration of large language models into political discourse analysis creates new opportunities for comparative research, policy analysis, and civic technology, while introducing material risks for democratic accountability. This paper argues that cultural adaptation is a prerequisite for trustworthy deployment of large language models in political communication across diverse linguistic and institutional contexts. Current systems remain shaped by English dominant data, uneven multilingual coverage, and assumptions grounded in a narrow range of political institutions and discourse conventions, producing systematic errors when applied across cultures. We formalize cultural adaptation across translation, discourse, and ontology levels, identify recurring cultural failure modes in political NLP, and propose an operational evaluation matrix grounded in cultural fidelity, calibration, and democratic safety. Building on political text analysis, sociotechnical auditing, and cross cultural pragmatics, we outline methodological pathways including participatory dataset development, culturally aware transfer learning, and benchmark design that makes cultural adaptation empirically measurable. We conclude by clarifying governance constraints and scope conditions under which culturally adaptive political NLP can support democratic legitimacy.

2605.23328 2026-05-25 cs.CL

Emotion Recognition in Sign Language Conversation

手语对话中的情感识别

Yusong Wang, Keyu Mao, Takao Obi, Minghao Shao, Kotaro Funakoshi

发表机构 * Institute of Science Tokyo(东京科学研究所) New York University(纽约大学)

AI总结 本文研究手语对话中的情绪识别问题,针对现有手语情绪数据集主要关注孤立句子而缺乏对话上下文的局限性,提出了一个新的任务——手语对话情绪识别(ERC),并构建了包含1920个视频样本的eJSL Dialog数据集。研究通过多种模型在该数据集上的系统评估,揭示了通用多模态对话情绪识别模型在手语场景中的性能差距,突显了开发针对手语的上下文感知视觉提取器以及扩展对话数据集规模以支持大规模预训练的必要性。

详情
AI中文摘要

对话中的情感识别是情感计算的核心组成部分,而当前手语情感数据集资源主要关注孤立句子,缺乏对话上下文。仅在这些孤立话语上训练的模型在现实场景中表现下降,因为它们无法利用历史对话流。为了解决这一结构性限制,我们将ERC任务引入手语视频分析,并提出了eJSL Dialog数据集。该数据集使用STUDIES语料库的脚本构建,包含1,920个视频样本,组织成480个独特的对话。我们使用从孤立视觉网络到多模态对话架构的模型对该数据集进行了系统基准测试。结果揭示了将通用多模态对话情感识别模型应用于手语时存在的领域差距。这些发现表明,明确需要针对手语的上下文感知视觉提取器,并指出扩大对话数据集规模以支持大规模预训练是未来研究的必要下一步。

英文摘要

Emotion Recognition in Conversation is a core component of affective computing, while current resources of sign language emotion datasets primarily focus on isolated sentences and lack conversational context. Models trained exclusively on these isolated utterances demonstrate degraded performance in real world scenarios because they cannot utilize historical dialogue flow. To address this structural limitation, we introduce the ERC task to sign language video analysis and propose the eJSL Dialog dataset. Constructed using the scripts from the STUDIES corpus, the dataset contains 1,920 video samples organized into 480 unique dialogues. We conduct systematic benchmarking on this dataset using models ranging from isolated visual networks to multimodal conversational architectures. The results reveal a domain gap when applying generic multimodal conversational emotion recognition models to sign language. These findings demonstrate the explicit need for context aware visual extractors specific to sign language and indicate that expanding the scale of conversational datasets to support large scale pre-training is a necessary next step for future research.

2605.23326 2026-05-25 cs.CL

ClimateChat-300K: A Multi-Modal Facebook Dataset for Understanding Diverse Perspectives in Climate Communication

ClimateChat-300K:一个用于理解气候传播中不同观点的多模态Facebook数据集

Wajdi Zaghouani, Md. Rafiul Biswas, Mabrouka Bessghaier, Shimaa Ibrahim, George Mikros

发表机构 * Northwestern University in Qatar(卡塔尔西北大学) Hamad Bin Khalifa University(哈马德·本·卡伊夫大学)

AI总结 本文介绍了ClimateChat-300K,一个包含299,329条公共Facebook帖子的多模态数据集,涵盖2020年5月至2024年5月期间关于气候变化的讨论。该数据集包含41个元数据特征,如内容、互动指标和页面属性,支持对气候沟通中公众话语的全面分析。研究通过主题建模和情感分析识别出五个领域下的十个主要主题,并揭示了情感基调、内容形式和页面身份对受众参与度的影响,同时展示了在线讨论如何随国际气候峰会和新冠疫情等重大事件演变。该数据集为气候议题的偏见、虚假信息及数字话语动态等跨学科研究提供了开放资源。

详情
AI中文摘要

我们提出了ClimateChat-300K,这是一个大规模数据集,包含通过CrowdTangle平台收集的2020年5月至2024年5月间关于气候变化的299,329条Facebook公开帖子。该数据集包含41个元数据特征,包括帖子内容、参与度指标和页面属性,覆盖来自超过26,000个全球页面的材料。每条帖子包含丰富的上下文信息,如语言、时间戳、页面类别和互动次数,从而能够对气候传播中的公共话语进行全面分析。通过主题建模和情感分析,我们识别出十个主要主题,这些主题被归为五个领域:政策、行动主义、合作、科学和保护。结果显示,情感基调、帖子格式和页面身份强烈影响受众参与度,视觉丰富且情感强烈的内容获得最高水平的互动。该数据集还展示了在线讨论如何响应重大事件(如国际气候峰会和COVID-19疫情时期)而演变。ClimateChat-300K为关于极化、错误信息和数字气候话语动态的可重复跨学科研究提供了开放资源。通过发布该数据集,我们旨在支持透明、数据驱动的研究,并促进对公众如何随时间、地理和制度背景参与气候问题的更深入理解。

英文摘要

We present ClimateChat-300K, a large-scale dataset of 299,329 public Facebook posts about climate change collected between May 2020 and May 2024 through the CrowdTangle platform. The dataset contains 41 metadata features including post content, engagement metrics, and page attributes, covering material from more than 26,000 global pages. Each post includes rich contextual information such as language, timestamp, page category, and interaction counts, enabling comprehensive analyses of public discourse around climate communication. Using topic modeling and sentiment analysis, we identify ten main themes grouped into five domains: policy, activism, cooperation, science, and conservation. The results reveal that emotional tone, post format, and page identity strongly influence audience engagement, with visually rich and emotionally charged content receiving the highest levels of interaction. The dataset also demonstrates how online discussions evolved in response to major events such as international climate summits and the COVID-19 pandemic period. ClimateChat-300K provides an open resource for reproducible and interdisciplinary research on polarization, misinformation, and the dynamics of digital climate discourse. By releasing this dataset, we aim to support transparent, data-driven research and contribute to a deeper un-derstanding of how public engagement with climate issues develops across time, geography, and institutional contexts.

2605.23325 2026-05-25 cs.CL

AraHopeCorpus: Annotation Guidelines and Dataset for Hope Speech in Arabic Social Media Crisis Discourse

AraHopeCorpus:阿拉伯语社交媒体危机话语中的希望言语标注指南与数据集

Esra'a Sharqawi, Wajdi Zaghouani

发表机构 * Hamad Bin Khalifa University(哈马德·本·哈利法大学) Northwestern University in Qatar(卡塔尔西北大学)

AI总结 本文介绍了 AraHopeCorpus,这是首个用于阿拉伯语社交媒体危机语境中希望言论研究的标注数据集,包含2023年至2024年间加沙战争相关一万条YouTube评论。研究通过详细的标注框架将评论分为希望言论、无希望言论和中性或模糊内容三类,发现超过六成的评论表达了希望,主要体现为宗教鼓励、集体团结和对正义与忍耐的乐观。该数据集为研究阿拉伯语社交媒体中的建设性话语、危机沟通与韧性提供了重要资源。

详情
AI中文摘要

社交媒体已成为武装冲突期间塑造公众叙事的关键舞台,为有害和建设性交流提供了空间。虽然仇恨言论和虚假信息已被广泛研究,但促进韧性、团结和乐观的表达仍未被充分探索,尤其是在阿拉伯语语境中。本文介绍了AraHopeCorpus,这是首个从2023年至2024年与加沙战争相关的一万条YouTube评论中收集的阿拉伯语希望言语标注数据集。使用详细的标注框架,评论被分为三类:希望言语、无希望言语和中性或模糊话语。数据集显示,希望语言占主导地位,占所有评论的64%以上。这些希望表达主要表现为宗教鼓励、集体团结以及对忍耐和正义的乐观。无希望言语约占13%,反映了绝望和幻灭,而其余评论包含中性或混合内容。标注者间一致性达到显著水平(Cohen's Kappa = 0.71),尽管方言差异、讽刺和隐含意义带来了标注挑战。人类标注者与ChatGPT之间的比较分析表明,大型语言模型可以支持标注,但在处理方言和文化嵌入表达方面仍存在局限性。AraHopeCorpus将在开放和非商业许可下发布用于研究目的。它为研究建设性数字话语提供了宝贵资源,有助于进一步研究阿拉伯语社交媒体中的希望言语检测、危机沟通和韧性。

英文摘要

Social media has become a crucial arena for shaping public narratives during armed conflicts, providing space for both harmful and constructive communication. While hate speech and misinformation have been widely studied, expressions that promote resilience, solidarity, and optimism remain underexplored, particularly in Arabic contexts. This paper introduces AraHopeCorpus, the first annotated dataset of Arabic hope speech collected from ten thousand YouTube comments related to the war on Gaza between 2023 and 2024. Using a detailed annotation framework, comments were classified into three categories: hope speech, no hope speech, and neutral or unclear discourse. The dataset shows that hopeful language dominates, accounting for more than sixty four percent of all comments. These expressions of hope appear mainly as religious encouragement, collective solidarity, and optimism for endurance and justice. No hope speech, representing about thirteen percent, reflects despair and disillusionment, while the rest of the comments contain neutral or mixed content. Inter-Annotator Agreement reached substantial levels (Cohen's Kappa equals 0.71), though dialectal variation, sarcasm, and implicit meaning posed annotation challenges. A comparative analysis between human annotators and ChatGPT revealed that large language models can support annotation but remain limited in handling dialectal and culturally embedded expressions. AraHopeCorpus will be released for research purposes under an open and non commercial license. It provides a valuable resource for studying constructive digital discourse, enabling further research on hope speech detection, crisis communication, and resilience in Arabic social media.

2605.23324 2026-05-25 cs.CV quant-ph

Enhancing Blood Cells Classification using Hybrid Quantum Neural Networks

使用混合量子神经网络增强血细胞分类

Guilherme Cruz, Nouhaila Innan, Alberto Marchisio, Gabriel Falcao, Muhammad Shafique

发表机构 * Center for Quantum and Topological Systems, NYUAD Research Institute, New York University Abu Dhabi, UAE(阿布扎比纽约大学NYUAD研究机构量子与拓扑系统中心) Science Division, New York University Abu Dhabi, UAE(阿布扎比纽约大学科学学院)

AI总结 本文研究了如何利用混合量子-经典神经网络(HQNN)提升显微血细胞分类的准确性。作者提出了一种模块化架构,结合预训练的ResNet-50主干网络、低维潜在瓶颈和变分量子电路,以比较量子增强与纯经典变换机制的效果。实验结果表明,HQNN在两个公开血细胞数据集上均表现出更优或更均衡的分类性能,尤其在高难度的8类分类任务中,F1分数提升了0.15个百分点,并在IBM量子硬件上验证了模型对噪声的鲁棒性。

Comments 11 pages, 13 figures

详情
AI中文摘要

显微镜血细胞的准确分类仍然是医学图像分析中的关键任务,其中微小的变化和有限的数据可能挑战传统的深度学习模型。因此,在这项工作中,我们研究了混合量子-经典神经网络(HQNN)在该领域中增强特征表示和改善分类性能的潜力。我们提出了一种模块化架构,结合了预训练的ResNet-50骨干网络、低维潜在瓶颈和变分量子电路,使得量子增强和纯经典变换机制之间能够进行直接比较。为了隔离量子组件的贡献,我们评估了三种架构:HQNN模型、具有可比容量的额外非线性变换层的经典匹配模型,以及没有中间变换阶段的基线模型。在两个公开的血细胞数据集(即血细胞图像数据集和PBC数据集)上进行的实验表明,HQNN在评估指标上始终实现更优或更平衡的性能。在血细胞图像数据集中,与经典基线相比,所提出的方法将宏F1分数提高了高达3.7%,而在更具挑战性的8类场景中,F1分数从98.54%提高到98.69%,性能接近饱和。在IBM量子硬件上的额外评估表明,该模型在噪声下仍然保持鲁棒性,与模拟结果相比仅出现适度的性能下降。这些结果表明,量子特征变换可以增强判别表示,特别是在具有挑战性的分类场景中,并突显了HQNN模型在医学成像任务中的实际潜力。

英文摘要

Accurate classification of microscopic blood cells is still a critical task in medical image analysis, where subtle variations and limited data can challenge conventional deep learning models. As such, we investigate in this work the potential of Hybrid Quantum-Classical Neural Networks (HQNNs) to enhance feature representation and improve classification performance in this domain. We propose a modular architecture combining a pre-trained ResNet-50 backbone with a low-dimensional latent bottleneck and a variational quantum circuit, enabling a direct comparison between quantum-enhanced and purely classical transformation mechanisms. To isolate the contribution of the quantum component, we evaluate three architectures: a HQNN model, a Classical Matched Model with an additional nonlinear transformation layer of comparable capacity, and a baseline model without an intermediate transformation stage. Experiments conducted on two publicly available blood cell datasets, namely the Blood Cell Images dataset and the PBC dataset, demonstrate that HQNNs consistently achieve superior or more balanced performance across evaluation metrics. In the Blood Cell Images Dataset, the proposed approach improves macro F1-score by up to 3.7% compared to classical baselines, while improving the F1-score from 98.54% to 98.69% in the more challenging 8-class scenario with near-saturated performance. Additional evaluation on IBM quantum hardware shows that the model remains robust under noise, with only a modest performance degradation relative to simulated results. These results indicate that quantum feature transformations can enhance discriminative representations, particularly in challenging classification scenarios, and highlight the practical potential of HQNN models for medical imaging tasks.

2605.23320 2026-05-25 cs.AI

Human-in-the-Loop Multi-Agent Ventilator Decision Support with Contextual Bandit Preference Learning

人在回路的多智能体呼吸机决策支持与上下文赌博机偏好学习

Sijia Li, Xiaoyu Tan, Qixing Wang, Weiyi Zhao, Chen Zhan, Teqi Hao, Xuemin Wang, Lei Gu, Roland Eils, Xihe Qiu

发表机构 * Shanghai University of Engineering Science, Shanghai, China Tencent Youtu Lab, Tencent, China Department of Critical Care Medicine, Shanghai Tenth People's Hospital, Tongji University School of Medicine, Shanghai, China Department of Emergency Critical Disease, Songjiang Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai, China Max Planck Institute for Heart Lung Research, Bad Nauheim, Germany Fudan University, Shanghai, China BIH at Charit\'e -- Universit\"atsmedizin Berlin, Berlin, Germany

AI总结 该研究提出了一种基于人类在环的多智能体框架的呼吸机决策支持系统(VDSS),用于辅助临床医生进行呼吸机参数调整。系统通过上下文老虎机算法实现在线偏好学习,根据临床医生的反馈动态调整决策策略,并利用结构化反馈机制提高交互效率与稳定性。实验表明,该方法在重症监护环境中能显著提升推荐接受率并减少交互轮次,为临床可部署的人机协作提供了有效支持。

Comments miccai 2026

详情
AI中文摘要

呼吸机决策支持需要顺序决策,跟踪不断变化的生理和疾病轨迹,同时尊重安全边界和临床医生的特定调节风格。基于规则的方法很少能泛化个性化,而端到端强化学习或单一大型语言模型系统仍难以控制和审计。我们提出了呼吸机决策支持系统(VDSS),这是一个人在回路的多智能体框架,通过合同驱动的结构化接口协调模块化决策组件,并生成可追溯的证据以供审查。VDSS使用上下文赌博机进行在线偏好适应,在每个调整周期根据最终接受的决策更新临床医生特定偏好,并利用这些偏好指导后续建议。结构化的拒绝反馈触发有针对性的重新规划,以减少无效迭代并提高交互稳定性。回顾性ICU轨迹重放与专家审查表明,推荐接受度更高,达到可接受计划所需的交互轮次更少,支持临床可部署的人机协作。

英文摘要

Ventilator decision support requires sequential decisions that track evolving physiology and disease trajectories while respecting safety boundaries and clinician specific tuning styles. Rule based approaches rarely generalize personalization, and end to end reinforcement learning or single large language model systems remain difficult to control and audit. We propose the Ventilator Decision Support System (VDSS), a human in the loop multi agent framework that coordinates modular decision components through contract driven structured interfaces and produces traceable evidence for review. VDSS performs online preference adaptation with a contextual bandit, updating clinician specific preferences from the final accepted decision at each adjustment cycle and using them to guide subsequent recommendations. Structured rejection feedback triggers targeted replanning to reduce unproductive iterations and improve interaction stability. Retrospective ICU trajectory replay with expert review indicates higher recommendation acceptability and fewer interaction rounds to reach an acceptable plan, supporting clinically deployable human AI collaboration.

2605.23311 2026-05-25 cs.AI

DART: Semantic Recoverability for Structured Tool Agents

DART:结构化工具代理的语义可恢复性

Ke Yang, Panpan Li, Zonghan Wu, Kejin Xu, Huaxi Huang, Xiaoshui Huang

发表机构 * MOS Intelligent Connectivity Technology Co. Ltd.(MOS智能连接技术有限公司) Sichuan Vocational College of Post and Telecom(四川邮电职业技术学院) East China Normal University(华东师范大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Jiao Tong University(上海交通大学)

AI总结 当结构化工具代理在执行过程中发生故障时,系统面临一个两难问题:重新执行整个任务虽然安全但效率低,而从局部检查点恢复虽然高效,却可能导致下游工作依赖于已不存在的上游历史。为了解决这一问题,DART 提出了一种模块化的运行时机制,能够定位失败实例、验证其语义可恢复的边界、对齐检查点,并选择一个在依赖和效果约束下可接受的恢复点,从而在保证下游工作不受影响的前提下实现安全恢复。实验表明,DART 在多个基于大语言模型的领域中成功恢复了传统局部恢复方法无法处理的语义敏感场景,且安全审计未发现任何不安全的回滚操作。

详情
AI中文摘要

当结构化工具代理在执行过程中失败时,运行时面临两难:重放整个任务安全但浪费资源,而从本地检查点恢复高效但可能使已提交的下游工作与不再存在的上游历史相关联。这种紧张关系在承诺敏感场景中尤为突出,其中回滚目标是一个失败的实例,但下游消费者已经对其输出采取了行动。现有的恢复方法提供机械回滚,但没有标准来判断本地恢复在下游提交后是否保持语义有效。我们将这一差距形式化为语义可恢复性,并在DART中解决它,DART是一个模块化运行时,它定位失败实例,认证该实例的语义可恢复边界,将检查点对齐到这些边界,并选择一个可接受的恢复点,该恢复点在依赖和效果约束下保留已提交的下游工作——否则阻止恢复。在三个LLM驱动的领域以及基于LangGraph的基板上的外部验证中,DART正确恢复了所有评估的承诺敏感案例,而基线局部恢复失败,并且一个五领域安全审计未发现不安全的允许回滚。这些结果表明,控制器的合法性并不意味着语义有效性,而合理的局部恢复需要明确的可接受性检查。

英文摘要

When a structured tool agent fails mid-execution, the runtime faces a dilemma: replaying the entire task is safe but wasteful, while restoring from a local checkpoint is efficient but can leave committed downstream work tied to an upstream history that no longer exists. This tension is acute in commitment-sensitive settings, where rollback targets a single failed instance yet downstream consumers have already acted on its output. Existing recovery approaches provide mechanical rollback but no criterion for whether a local restore remains semantically valid after downstream commitment. We formalize this gap as semantic recoverability and address it in DART, a modular runtime that localizes the failed instance, certifies semantically recoverable boundaries of that instance, aligns checkpoints to those boundaries, and selects an admissible restore point that preserves committed downstream work under dependency and effect constraints-or blocks otherwise. Across three LLM-driven domains and external validation on a LangGraph-based substrate, DART correctly recovers all evaluated commitment-sensitive cases where baseline local recovery fails, and a five-domain safety audit finds no unsafe admitted rollbacks. These results show that controller legality does not imply semantic validity, and that sound local recovery requires an explicit admissibility check.

2605.23304 2026-05-25 cs.CV

General Hazard Detection

通用危险检测

Stephanie Ng, CP Lim, SueJen Looi, Hendrik Zurlinden, David Nguyen, Lei Wei, Saeid Nahavandi, Hailing Zhou

发表机构 * Swinburne University of Technology(斯winburne大学) National Transport Research Organisation(国家交通运输研究组织) Google Cloud(谷歌云) Deakin University(德金大学)

AI总结 本文研究了如何检测抽象概念的“危害”,并提出了一种基于语言规则而非具体图像示例的通用危害检测方法。为了解决现有系统在数据稀疏性、定义动态变化和泛化能力方面的不足,作者构建了CompliVision数据集,并设计了一个结合视觉与语言模型的框架,通过权威规范定义多领域危害概念,实现对安全合规性的有效评估。该方法引入主动学习机制,提升模型在复杂场景下的鲁棒性和适应性。

Comments 20 pages, 7 figures and 4 tables

详情
AI中文摘要

危险作为一个抽象概念,通常通过认知层面的逻辑推理而非具体示例来定义。相比之下,现有的危险检测系统依赖于预定义的危险类别,并需要在检测或分类架构中密集收集标注示例。这种方法在处理抽象安全概念时面临三个基本挑战:(1) 噪声大且稀疏的训练数据,(2) 随上下文和时间动态演变的定义,以及(3) 对未见或新颖场景的泛化能力有限。为了解决这些局限性,我们提出了CompliVision数据集,这是第一个专为基于规则的合规评估设计的通用危险数据集,同时提供了一个用于危险评估的基线框架。我们的关键创新在于通过基于语言的规则表达安全要求,从而将危险概念与基于图像的示例解耦。我们将方法建立在权威领域法规和ISO标准之上,以定义跨多个领域的多样化危险概念。CompliVision数据集包含跨越交通、建筑和仓库环境的3,006张图像,每张图像都根据特定安全规则进行了合规性标注,并附有突出显示支持性视觉证据的自然语言解释。为了实现稳健的泛化,我们开发了一个主动学习框架,以更有效地指导和优化视觉语言模型在危险合规评估中的表现。尽管最先进的VLM表现出强大的能力,但在准确安全评估所需的细粒度、上下文相关解释方面仍存在困难。我们提出了一个通用危险检测框架来解决这一局限性,该框架结合了基于LLaVA的视觉推理与人在回路反馈。

英文摘要

Hazard, as an abstract concept, is typically defined through cognitive-level logical reasoning rather than concrete examples. In contrast, existing hazard detection systems rely on predefined hazard categories and require intensive collection of labelled examples within detection or classification architectures. This approach faces three fundamental challenges when addressing abstract safety concepts: (1) noisy and sparse training data, (2) dynamically evolving definitions that change across contexts and time, and (3) limited generalisation to unseen or novel scenarios. To address these limitations, we present the CompliVision dataset, the first general-purpose hazard dataset designed for rule-based compliance assessment, along with a baseline framework for hazard evaluation. Our key innovation is decoupling the hazard concept from image-based examples by expressing safety requirements through language-based rules. We ground our approach in authoritative domain regulations and ISO standards to define diverse hazard concepts across multiple domains. The CompliVision dataset comprises 3,006 images spanning traffic, construction, and warehouse environments, with each image annotated for compliance against specific safety rules, accompanied by natural language explanations highlighting the supporting visual evidence. To achieve robust generalisation, we develop an active learning framework to more effectively guide and refine vision-language models in assessing hazard compliance. While state-of-the-art VLMs demonstrate strong capabilities, they struggle with the fine-grained, context-dependent interpretation required for accurate safety assessment. We proposed a general hazard detection framework to address this limitation which combines LLaVA-based visual reasoning with with human-in-the-loop feedback.

2605.23296 2026-05-25 cs.AI

Parallel Context Compaction for Long-Horizon LLM Agent Serving

长视界LLM智能体服务的并行上下文压缩

Musa Cim, Burak Topcu, Chita Das, Mahmut Taylan Kandemir

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 长期对话场景下的大语言模型代理在运行过程中会积累越来越多的对话历史,最终超出模型的上下文窗口限制。为解决这一问题,本文提出了一种并行上下文压缩方法,通过并行处理对话历史,实现了对摘要长度的精细控制和更高效的推理性能。实验表明,该方法在多个基准测试中优于传统的串行压缩方式,显著提升了处理效率和稳定性。

详情
AI中文摘要

长视界LLM智能体积累的对话历史会逐渐增长,最终超出模型的上下文窗口。基于LLM摘要的上下文压缩可以保持对话有界,但摘要本质上有损,且阻塞调用会暂停智能体推理数十秒。此外,由于提示指令基本被忽略,操作者无法细粒度控制摘要体积,随着上下文增长,模型生成的输出令牌数量及其保留的信息在不同运行间波动显著,使得智能体保留的知识在不同运行间不可预测。我们引入了长视界智能体流的 extbf{并行压缩},并在HotpotQA多跳问答和LoCoMo长上下文对话基准上,针对8B到120B参数的四个骨干模型(混合密集和MoE架构,包括推理和非推理模型)与顺序同步基线进行对比。并行压缩使操作者能够细粒度、可预测地控制摘要体积,并支持每块更针对性的提示工程。在匹配的压缩解码体积下,它相比顺序基线减少了端到端耗时并提升了压缩吞吐量。

英文摘要

Long-horizon LLM agents accumulate growing conversation histories that eventually exceed the model's context window. Context compaction via LLM-based summarization keeps the conversation bounded, but summarization is inherently lossy and the blocking call stalls agent inference for tens of seconds. Moreover, the operator has no fine-grained control over summary volume since prompt instructions are largely ignored, and as context grows, both the amount of output tokens the model produces and the information it retains fluctuate substantially from run to run, making the agent's retained knowledge unpredictable across runs. We introduce \textbf{parallel compaction} for long-horizon agentic flows and characterize it against the sequential synchronous baseline across four backbones spanning 8B to 120B parameters, mixing dense and MoE architectures with reasoning and non-reasoning models, on the HotpotQA multi-hop QA and LoCoMo long-context dialogue benchmarks. Parallel compaction gives the operator fine-grained, predictable control over summary volume and enables more targeted prompt engineering per block. At matched compaction decode volume, it reduces end-to-end wall time and improves compaction throughput over the sequential baseline.

2605.23288 2026-05-25 cs.CV

Spatio-Temporal Similarity Volume Aggregation for Open-Vocabulary Action Recognition

时空相似性体积聚合用于开放词汇动作识别

Yerim So, Jiyeong Kim, Jiwon Yoon, Dongbo Min

发表机构 * Ewha Womans University(成均馆大学)

AI总结 本文提出了一种名为SimVA的框架,用于解决开放词汇动作识别中的细粒度时空信息丢失问题。该方法通过构建局部视频块与动作类别之间的密集四维时空相似性体积,保留了局部视觉-文本对齐信息,并结合类采样、空间聚合和运动感知调制等技术,提升了模型对时空动态变化的建模能力。实验表明,SimVA能够有效将CLIP模型迁移至视频动作识别任务,在零样本、少样本及基础到新类别的多个基准测试中均取得具有竞争力的性能。

详情
AI中文摘要

最近的开放词汇动作识别(OVAR)方法通常在计算文本对齐之前将视觉特征聚合为全局表示,这一过程掩盖了局部补丁信息和细粒度的时空线索。我们提出了相似性体积聚合(SimVA)框架,该框架从补丁级别的视觉-文本相似性构建密集的4D时空相似性体积。SimVA在局部视频令牌和动作类别上构建时空相似性体积,并采用类别采样确保相似性聚合可扩展到大型词汇表。通过空间聚合对相似性体积进行细化,将局部相似性模式上下文化以提高帧内一致性。运动感知调制进一步注入帧间变化线索,突出动态变化区域。基于Mamba的时序聚合则建模类别条件相似性模式在帧间的演化。通过保持密集的视觉-文本对应关系,SimVA有效地将CLIP迁移到视频动作识别,在零样本、少样本和基类到新类基准测试中均取得了竞争性性能。

英文摘要

Recent Open-Vocabulary Action Recognition (OVAR) methods typically aggregate visual features into a global representation before computing text alignment, a process that obscures local patch information and fine-grained spatio-temporal cues. We propose Similarity Volume Aggregation (SimVA), a framework that constructs a dense 4D spatio-temporal similarity volume from patch-level visual-text similarities. SimVA constructs a spatio-temporal similarity volume over local video tokens and action classes, and employs class sampling to ensure similarity aggregation scalable to large vocabularies. The similarity volume is refined by spatial aggregation, which contextualizes local similarity patterns to improve intra-frame consistency. Motion-aware modulation further injects inter-frame variation cues, highlighting dynamically changing regions. Mamba-based temporal aggregation then models the evolution of class-conditioned similarity patterns across frames. By maintaining dense visual-text correspondence, SimVA effectively transfers CLIP to video action recognition, achieving competitive performance across zero-shot, few-shot, and base-to-novel benchmarks.

2605.23287 2026-05-25 cs.CV

LangFlash: Feed-forward 3D Language Gaussian Splatting from Sparse Unposed Images

LangFlash: 基于前馈的3D语言高斯泼溅从稀疏无位姿图像

Yilong Liu, Wanhua Li, Chen Zhu-Tian, Hanspeter Pfister

发表机构 * Harvard University(哈佛大学) Nanyang Technological University(南洋理工大学) Tsinghua University(清华大学) University of Minnesota - Twin Cities(明尼苏达大学-双城分校)

AI总结 本文提出 LangFlash,一种基于前馈网络的 3D 语言高斯溅射框架,能够从稀疏未配准的多视角图像中直接重建带有语言对齐语义特征的 3D 场景。与基于优化的 3D 方法不同,LangFlash 在一次前向传播中同时预测几何结构和语义信息,实现了低延迟的 3D 重建与语义一致的场景理解。通过引入稀疏语义编码方案和增强的语义监督数据集,LangFlash 在新型视角合成和语义一致性方面优于现有方法,为无姿态依赖、语言驱动的 3D 场景重建提供了新范式。

Comments CVPRF 2026

详情
AI中文摘要

我们提出LangFlash,一种用于3D语言高斯泼溅的前馈框架,它从稀疏无位姿多视图图像重建由高斯原语参数化的3D场景,这些原语富含语言对齐的语义特征。与基于优化的3D方法不同,LangFlash在单次前向传播中直接预测几何和语义,实现低延迟3D重建和语言一致的场景理解。为了支持大规模训练,我们为RealEstate10k数据集丰富了连贯且密集的语义信息,用于3D语义监督。此外,我们提出了一种稀疏语义编码方案,该方案将全局语义字典与局部变化的每个原语权重相结合,在保留高级语言信息的同时降低表示复杂度。实验结果表明,与先前方法相比,LangFlash在新视图合成和语义一致性方面表现更优。本研究为无位姿、语言基础的3D场景重建建立了新范式,推动了可泛化3D视觉和多模态场景理解的发展。演示地址:https://liylo.github.io/langflash.github.io/。

英文摘要

We present LangFlash, a feed-forward framework for 3D Language Gaussian Splatting that reconstructs 3D scenes parameterized by Gaussian primitives enriched with language-aligned semantic features from sparse unposed multi-view images. Unlike optimization-based 3D methods, LangFlash directly predicts the geometry and semantics in a single forward pass, enabling low-latency 3D reconstruction and language-consistent scene understanding. To support large-scale training, we enriched the RealEstate10k dataset with coherent and dense semantic information for 3D semantic supervision. Furthermore, we propose a sparse semantic encoding scheme that combines a global semantic dictionary with locally varying per-primitive weights, preserving high-level linguistic information, while reducing representation complexity. Experimental results show that LangFlash achieves superior novel view synthesis and semantic consistency compared with previous methods. This study establishes a new paradigm for pose-free, language-grounded 3D scene reconstruction, advancing generalizable 3D vision and multimodal scene understanding. Demo is available at https://liylo.github.io/langflash.github.io/.

2605.23285 2026-05-25 cs.LG cond-mat.stat-mech cs.AI

Reinforcement Learning for Microcanonical Graph Ensemble with Assortativity Constraints

具有同配性约束的微正则图集成的强化学习

Hoyun Choi, Junghyo Jo, Deok-Sun Lee

发表机构 * School of Computational Sciences, Korea Institute for Advanced Study(韩国高等科学研究院计算科学系) Department of Physics Education, Seoul National University(首尔国立大学物理教育系) Center for Theoretical Physics and Artificial Intelligence Institute, Seoul National University(首尔国立大学理论物理与人工智能研究所) Center for AI and Natural Sciences, Korea Institute for Advanced Study(韩国高等科学研究院人工智能与自然科学中心)

AI总结 本文研究如何通过强化学习生成满足特定 assortativity(度-度相关性)约束的微正则图系,以精确控制网络结构特性。提出了一种基于强化学习的深度微正则图生成器(DMGG),通过度保持的重连操作,使图的 assortativity 精确达到目标值,克服了传统方法在参数调校和生成效率上的不足。该方法能够在不同规模、稀疏度和拓扑结构的图上生成精确的无偏模型,有助于定量分析网络的次级特性,如聚类系数,为研究网络结构与功能的关系提供了有力工具。

详情
AI中文摘要

网络结构如何决定功能是一个基本问题,可以通过具有精确控制结构属性的图集成来研究。规范方法(如指数随机图模型ERGM)仅期望约束,允许个体实现围绕目标波动。相反,微正则集成施加硬约束,但除固定度序列外的实用采样方法仍难以实现。本文介绍深度微正则图生成器(DMGG),一种强化学习(RL)框架,通过保度重连变换任意给定图,以精确达到指定的同配性(表征相邻节点的度-度相关性)。DMGG不依赖于ERGM的熵主导的Metropolis-Hastings动力学,而是采用策略引导搜索,最大程度地改变联合度矩阵。这消除了详尽的参数调优,并在保持构型多样性的同时将生成速度提高至少一个数量级。由于DMGG可推广到各种图大小、稀疏性和拓扑结构,它提供了精确的零模型,允许定量隔离二次可观测量(如聚类系数)。这些结果确立了RL作为生成硬约束图的实用且强大的范式,为研究无集成伪影的结构-功能关系开辟了途径。

英文摘要

How network structure determines function is a fundamental question, and it can be investigated by graph ensembles with precisely controlled structural properties. Canonical approaches, formulated as exponential random graph models (ERGMs), enforce constraints only in expectation, allowing individual realizations to fluctuate around the target. Conversely, microcanonical ensembles impose hard constraints exactly, but practical sampling methods beyond fixing the degree sequence have remained out of reach. Here we introduce the Deep Microcanonical Graph Generator (DMGG), a reinforcement learning (RL) framework that transforms any given graph through degree-preserving rewirings to exactly reach a prescribed assortativity, which characterizes the degree--degree correlation of adjacent nodes. Instead of relying on the entropically dominated Metropolis--Hastings dynamics of the ERGM, DMGG employs a policy-guided search that maximally alters the joint-degree matrix. This eliminates exhaustive parameter tuning and accelerates generation by at least an order of magnitude while preserving configurational diversity. As DMGG generalizes across various graph sizes, sparsities, and topologies, it provides exact null models that allow for the quantitative isolation of secondary observables, such as the clustering coefficient. These results establish RL as a practical and powerful paradigm for generating hard-constrained graphs, opening avenues to investigate structure-function relationships free from ensemble artifacts.

2605.23281 2026-05-25 cs.CV

DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection

DepthAgent: 通过样本级专家选择实现更好的通用深度估计

Jie Zhu, Girish Chandar Ganesan, Xiaoming Liu

发表机构 * Michigan State University(密歇根州立大学) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 本文提出了一种名为 DepthAgent 的视觉语言智能体,用于自适应单目深度估计。该方法通过分析场景和相机特性,选择或融合多个预训练深度模型的预测结果,从而提升在不同视角、鱼眼和全景图像等多样化相机设置下的深度估计性能。研究发现,不同模型在不同输入域上的表现存在显著差异,通过样本级专家选择与融合可以显著提升难样本的估计精度,实验表明 DepthAgent 在多个基准测试中均优于单一模型及固定融合方法。

详情
AI中文摘要

单目度量深度估计通过大规模训练和通用相机建模取得了显著进展,但在不同相机设置(如透视、鱼眼和全景图像)下的鲁棒部署仍然具有挑战性。现有方法通常依赖单一深度估计器,忽略了不同模型编码不同的相机假设并在不同输入域下表现最佳。本文中,我们展示了深度专家在样本级上具有强互补性:模型偏好与相机几何高度相关,多模型融合在单个专家不可靠的困难样本上带来最大收益。受这些观察启发,我们提出了 extbf{\ours},一种用于自适应单目深度估计的视觉语言智能体。DepthAgent将现有深度模型视为冻结工具,学习分析场景和相机线索,通过多轮工具调用调用合适的专家,并为每个输入选择或融合它们的预测。为了优化这种离散决策以实现密集几何质量,我们设计了一种多奖励强化学习微调方案,共同鼓励有效的工具执行、相机/场景分析、专家选择质量和推理效率。在透视、鱼眼和全景基准上的大量实验表明,\ours一致优于单个专家、固定模型融合和不同选择策略,在困难样本上取得了显著改进,突显了专家选择和融合的关键作用。代码和模型将在发表后发布。

英文摘要

Monocular metric depth estimation has achieved strong progress with large-scale training and universal-camera modeling, yet robust deployment across diverse camera settings, such as perspective, fisheye, and panoramic images, remains challenging. Existing methods typically rely on a single depth estimator, overlooking that different models encode different camera assumptions and perform best under different input domains. In this paper, we show that depth experts exhibit strong sample-wise complementarity: model preference is highly correlated with camera geometry, and multi-model fusion brings the largest gains on difficult samples where individual experts are unreliable. Motivated by these observations, we propose \textbf{\ours}, a vision-language agent for adaptive monocular depth estimation. DepthAgent treats existing depth models as frozen tools and learns to analyze scene and camera cues, invoke suitable experts through multi-turn tool utilization, and select or fuse their predictions for each input. To optimize such discrete decision-making toward dense geometric quality, we design a multi-reward reinforcement fine-tuning scheme that jointly encourages valid tool execution, camera/scene analysis, expert-selection quality, and inference efficiency. Extensive experiments across perspective, fisheye, and panoramic benchmarks show that \ours consistently outperforms individual experts, fixed model fusion, and different selection strategies, with strong improvements on challenging samples, highlighting the critical role of expert selection and fusion. The code and model will be released upon publication.

2605.23275 2026-05-25 cs.LG

Diffusion Domain Expansion: Learning to Coordinate Pre-trained Diffusion Models

扩散域扩展:学习协调预训练扩散模型

Egor Lifar, Semyon Savkin, Timur Garipov, Shangyuan Tong, Tommi Jaakkola

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室)

AI总结 本文提出了一种名为扩散域扩展(DDE)的方法,旨在高效地扩展预训练扩散模型,使其能够生成更大规模的对象并处理更复杂的条件输入。该方法通过一个紧凑的可训练网络协调预训练扩散模型的去噪输出,实现了对超出其原始训练范围的领域的泛化能力。实验表明,DDE在长音频生成和条件图像生成任务中均表现出色,优于其他协调生成方法。

Comments Accepted as poster at ICML 2024 Workshop on Structured Probabilistic Inference and Generative Modeling (SPIGM)

详情
AI中文摘要

在本文中,我们提出了扩散域扩展(DDE),一种高效扩展预训练扩散模型的方法,使其能够生成更大的对象并处理超出其原始能力的更复杂条件。我们的方法采用一个紧凑的可训练网络,旨在协调预训练扩散模型的去噪输出。我们证明协调器可以普遍简单,同时能够泛化到比训练时观察到的更大的域。我们在长音频轨道生成和条件图像生成上评估了DDE,展示了其跨域的适用性。在定性和定量评估中,DDE在扩散模型的协调生成方面优于其他方法。

英文摘要

In this paper, we propose Diffusion Domain Expansion (DDE), a method that efficiently extends pre-trained diffusion models to generate larger objects and handle more complex conditioning beyond their original capabilities. Our method employs a compact trainable network designed to coordinate the denoised outputs of pre-trained diffusion models. We demonstrate that the coordinator can be universally simple while being capable of generalizing to domains larger than those observed during its training time. We evaluate DDE on long audio track generation and conditional image generation, demonstrating its applicability across domains. DDE outperforms other approaches to coordinated generation with diffusion models in qualitative and quantitative evaluations.

2605.23274 2026-05-25 cs.CV

U-CESE: Unified Clip-based Event Search Engine for AI Challenge HCMC 2025

U-CESE:面向AI挑战赛胡志明市2025的统一基于片段的事件搜索引擎

Duc-Nhuan Le, Hoang-Phuc Nguyen, Thanh-Duy Lam, Minh-Nhut Dang, Minh-Hoang Le

发表机构 * Faculty of Information Technology, University of Science, VNU-HCM(越南国家大学胡志明市分校信息科技学院) Vietnam National University, Ho Chi Minh City, Vietnam(越南国家大学胡志明市分校)

AI总结 本文提出U-CESE,一种统一的基于片段的事件搜索引擎,用于AI Challenge HCMC 2025中的多模态事件检索任务。U-CESE整合了原有CESE的三个模块,形成统一框架,支持跨多种视频源的一致事件检索。其核心方法包括统一剪辑算法、基于JPEG文件大小变化的无训练关键帧提取方法DAKE,以及受循环神经网络启发的时序一致字幕生成框架ReCap,有效提升了大规模多模态事件检索的效率与准确性。

Comments Accepted for publication in the Proceedings of the 14th International Symposium on Information and Communication Technology (SOICT 2025)

详情
AI中文摘要

从大规模视频数据集中检索事件由于复杂的时空和多模态信息而具有挑战性。本文介绍了U-CESE,这是我们对AI挑战赛胡志明市2025的解决方案,一个统一的基于片段的事件搜索引擎,用于跨多种视频源的多模态事件检索。在CESE的基础上,U-CESE将其三个模块集成到一个统一的框架中,确保跨查询类型的一致处理和检索。核心组件是统一剪辑算法,它将单独的剪辑算法合并为一个高效的流水线。为了处理大规模数据,我们提出了DAKE,一种轻量级、无需训练的关键帧提取方法,利用JPEG文件大小变化来识别显著的场景变化。最后,我们引入了ReCap,一个受循环神经网络启发的时序一致字幕生成框架,生成详细且上下文感知的文本描述。实验表明,U-CESE在大规模多模态事件检索中提供了稳健、一致且高效的性能。

英文摘要

Retrieving events from large-scale video datasets is challenging due to complex temporal, spatial, and multimodal information. This paper presents U-CESE, our solution for the AI Challenge HCMC 2025, a Unified Clip-based Event Search Engine for multimodal event retrieval across diverse video sources. Building on CESE, U-CESE integrates its three modules into a single cohesive framework, ensuring consistent processing and retrieval across query types. A core component is the Unified Clipping Algorithm, which merges separate clipping algorithms into one efficient pipeline. To handle large-scale data, we propose DAKE, a lightweight, training-free keyframe extraction method using JPEG file size variations to identify significant scene changes. Finally, we introduce ReCap, a temporally consistent captioning framework inspired by Recurrent Neural Network, generating detailed and context-aware textual descriptions. Experiments show that U-CESE delivers robust, consistent, and efficient performance in large-scale multimodal event retrieval.

2605.23272 2026-05-25 cs.LG cs.AI

When Good Equations Get Bad Scores: Improving Symbolic Regression Through Better Parameter Optimization

当好方程得到差分数:通过更好的参数优化改进符号回归

Boxiao Wang, Kai Li, Zhiwei Chen, Yang Huang, Runxiang Wang, Ziwen Zhang, Yifan Zhang, Jian Cheng

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 符号回归(SR)在科学知识发现中扮演重要角色,旨在从观测数据中提炼出数学方程。现有方法通常采用双层优化框架,但参数拟合质量直接影响结构评分,导致正确结构可能因局部最优解而被低估。为此,本文提出SAGE-Fit,一种基于符号表达式结构与语义先验的拟合框架,有效缓解了优化瓶颈,显著提升了符号回归系统的评估准确性和整体性能。

详情
AI中文摘要

符号回归(SR)通过从观测数据中提炼数学方程,在科学知识发现中发挥核心作用。大多数现有SR方法在双层优化框架内运行:外层循环搜索离散方程结构,内层循环优化该结构的连续参数。关键的是,参数拟合质量直接决定结构的得分,从而影响外层搜索。然而,非线性算子使得内层循环高度非凸,且预算驱动的快速局部求解器(如BFGS)的依赖常常导致正确的结构陷入较差的局部极小值并被低估得分。这种“好结构、差分数”现象成为关键瓶颈,降低效率并误导搜索偏离真实方程。为解决此问题,我们提出SAGE-Fit(结构感知与语义引导的符号回归评估器),一个利用符号表达式双重原生先验的SR原生拟合框架。通过利用SR特有的结构和语义先验,我们为每个属性设计定制模块,从而有效缓解这一优化瓶颈。大量实验表明,我们的方法作为即插即用模块,显著提升评估保真度,并普遍提高各种SR系统的性能。

英文摘要

Symbolic Regression (SR) plays a central role in scientific knowledge discovery by distilling mathematical equations from observational data. Most existing SR methods function within a bi-level optimization framework: an outer loop that searches for the discrete equation structure, and an inner loop that optimizes the continuous parameters of that structure. Crucially, parameter-fitting quality directly determines a structure's score and thus the outer-loop search. However, nonlinear operators make the inner loop highly non-convex, and budget-driven reliance on fast local solvers (e.g., BFGS) often yields poor local minima and underestimated scores for correct structures. This ``Good Structure, Bad Score'' phenomenon becomes a key bottleneck, degrading efficiency and misguiding the search away from the true equation. To resolve this, we propose SAGE-Fit (Structure-Aware and Semantics-Guided Evaluator for Symbolic Regression), an SR-native fitting framework that exploits the dual native priors of symbolic expressions. By capitalizing on the structural and semantic priors unique to SR, we design tailored modules for each property, thereby effectively mitigating this optimization bottleneck. Extensive experiments demonstrate that our approach, as a plug-and-play module, significantly enhances evaluation fidelity and universally improves the performance of various SR systems.

2605.23271 2026-05-25 cs.CV cs.AI

EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation

EvalVerse:面向专业电影级视频生成的流水线感知与专家校准基准测试

Songlin Yang, Haobin Zhong, Ruilin Zhang, Xiaotong Zhao, Shuai Li, Kai Zheng, Xuyi Yang, Zhe Wang, Zhenchen Tang, Yang Li, Bohai Gu, Zhengwei Peng, Yidan Huang, Mengzhou Luo, Yihang Bo, Dalu Feng, Yujia Zhang, Juntao Ma, Ruiqi Wang, Lvmin Zhang, Yuwei Guo, Frank Guan, Maneesh Agrawala, Hongbo Fu, Alan Zhao, Anyi Rao

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) Tencent(腾讯) Tsinghua University(清华大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Beijing Film Academy(北京电影学院) Stanford University(斯坦福大学) The Chinese University of Hong Kong(香港中文大学) Singapore Institute of Technology(新加坡理工学院)

AI总结 随着生成式视频基础模型的快速发展,影视级视频生成成为研究热点,但现有的评估方法多关注生成内容是否符合提示,而忽视了其艺术质量、表演和美学表现。为解决这一问题,本文提出 EvalVerse,一个流程感知且由专家校准的评估框架,通过构建专业影视制作流程的评估体系、收集大规模专家标注数据,并结合专家校准的微调策略提升视觉语言模型的推理能力,从而实现对视频生成质量的全面评估,为未来奖励模型和评估代理的研究提供了基础支撑。

详情
AI中文摘要

生成式视频基础模型的快速发展推动该领域向专业级电影合成迈进。为达到如此苛刻的质量,社区正转向强化学习和智能体工作流。然而,可靠的评估已成为关键瓶颈。现有基准主要评估“是否正确”(基本提示遵循),而从根本上忽略了“是否优良”(电影质量、表演和美学)。此外,当前的自动指标缺乏提供可信信号所需的领域特异性,在人类审美感知与机器评分之间造成了严重的可信度差距。为弥合这一差距,我们引入了EvalVerse,一个全面、流水线感知且专家校准的评估框架。我们将视频生成评估不仅视为一项工程任务,而是作为一个核心科学问题:主观电影专业知识的系统数字化。首先,我们将领域知识组织成与专业电影制作工作流(前期制作、制作和后期制作)一致的评估分类法。其次,我们将人类专家判断提炼为带有大规模人工标注的精选数据集。第三,我们通过专家校准的微调策略将这些知识注入视觉语言模型,使VLM能够执行显式的思维链推理。与先前工作相比,EvalVerse不仅保持与基础“正确性”指标的兼容性,还显著扩展了“优良性”标准,并将任务覆盖范围拓宽到复杂的多镜头序列和视听整合。因此,通过提供细粒度的诊断信号,EvalVerse超越了静态排行榜,为未来工作(如奖励模型和评估智能体)建立了基础基础设施。

英文摘要

The rapid evolution of generative video foundation models has propelled the field toward professional-grade cinematic synthesis. To achieve such demanding quality, the community transitions towards Reinforcement Learning (RL) and agentic workflows. However, reliable evaluation has emerged as a critical bottleneck. Existing benchmarks predominantly evaluate ''whether it is right'' (basic prompt-following) while fundamentally neglecting ''whether it is good'' (cinematic quality, acting, and aesthetics). Furthermore, current automated metrics lack the domain-specific rigor required to provide trustworthy signals, creating a severe credibility gap between human aesthetic perception and machine scoring. To bridge this gap, we introduce EvalVerse, a comprehensive, pipeline-aware, and expert-calibrated evaluation framework. We treat video generation assessment not merely as an engineering task, but as a core scientific problem: the systematic digitization of subjective cinematic expertise. First, we organize domain knowledge into an evaluation taxonomy aligned with the professional filmmaking workflow (pre-production, production, and post-production). Second, we distill human expert judgments into a curated dataset with large-scale human annotations. Third, we inject this knowledge into Vision-Language Models (VLMs) through an expert-calibrated fine-tuning strategy, enabling the VLM to perform explicit Chain-of-Thought reasoning. Compared to previous works, EvalVerse not only retains compatibility with foundational ''rightness'' metrics, but also significantly expands the criteria to ''goodness'' and broaden the task coverage to complex multi-shot sequencing and audio-visual integration. Consequently, by providing granular diagnostic signals, EvalVerse transcends a static leaderboard and establishes a fundamental infrastructure for future work, such as reward models and evaluator agent.

2605.23270 2026-05-25 cs.CV cs.AI cs.RO

ChainFlow-VLA: Causal Flow Planning with Vision-Language Models

ChainFlow-VLA: 基于视觉语言模型的因果流规划

Xiyang Wang, Xinlin Wang, Tingguang Zhou, Gong Chen, Xingtai Gui, Zhi Xu, Xiaolei Wu, Feiyang Tan, Hangning Zhou, Mu Yang

发表机构 * Afari Intelligent Drive(阿法瑞智能驾驶) Tianjin University(天津大学) University of Macau(澳门大学)

AI总结 当前端到端自动驾驶系统在时间因果推理与全局轨迹一致性之间存在根本性矛盾。为解决这一问题,本文提出 ChainFlow-VLA,通过统一因果生成与全局优化的联合概率框架,将因果推理与全局轨迹修正相结合。该方法利用视觉语言模型作为语义先验,在保留因果结构的基础上进行轨迹修正,实验表明其在复杂场景中表现出色,达到了与人类相当的高水平性能。

详情
AI中文摘要

当前的端到端自动驾驶系统从根本上受到时间因果推理与全局轨迹一致性之间不匹配的限制。自回归(AR)模型通过因果分解捕获交互感知的时间依赖性,但其逐步解码导致误差累积和次优的全局结构。相比之下,扩散模型全局优化轨迹但缺乏显式因果约束,使其在交互和关键安全场景中不可靠。这种二分法揭示了一个更深层次的问题:现有方法将因果建模和全局优化视为分离的范式,没有原则性的方式将它们统一在单个轨迹分布中。为了解决这个问题,我们提出了ChainFlow-VLA,它在统一的概率框架内统一了因果生成和全局细化。我们将规划公式化为AR诱导模式的混合,并学习这些模式上的视觉语言模型(VLM)条件残差分布。自回归生成器(Chain)生成一组离散的因果轨迹模式,随后基于扩散的细化器(Flow)利用VLM隐藏状态作为语义先验,在残差空间中执行模式条件校正,同时保持因果结构。这种直接的调节将高层场景理解无缝注入到细粒度的轨迹调整中。实验表明,ChainFlow-VLA在模糊和长尾场景中实现了鲁棒的规划,在NAVSIM v1排行榜上取得了94.85的最新分数,匹配人类水平(94.8)。代码将在https://github.com/AFARI-Research/ChainFlow-VLA提供。

英文摘要

Current end-to-end autonomous driving systems are fundamentally limited by a mismatch between temporal causal reasoning and global trajectory consistency. Autoregressive (AR) models capture interaction-aware temporal dependencies via causal factorization, but their step-wise decoding leads to error accumulation and suboptimal global structure. In contrast, diffusion models optimize trajectories globally but lack explicit causal constraints, making them unreliable in interactive and safety-critical scenarios. This dichotomy reveals a deeper issue: existing methods treat causal modeling and global optimization as separate paradigms, without a principled way to unify them within a single trajectory distribution. To address this, we propose ChainFlow-VLA, which unifies causal generation and global refinement within a unified probabilistic framework. We formulate planning as a mixture over AR-induced modes and learn Vision-Language Model (VLM)-conditioned residual distributions over these modes. An autoregressive generator (Chain) produces a discrete set of causal trajectory modes, followed by a diffusion-based refiner (Flow) that leverages VLM hidden states as semantic priors to perform mode-conditioned correction in residual space while preserving causal structure. This straightforward conditioning seamlessly injects high-level scene understanding into fine-grained trajectory adjustments. Experiments demonstrate that ChainFlow-VLA achieves robust planning in ambiguous and long-tail scenarios, achieving a state-of-the-art score of 94.85 on the NAVSIM v1 leaderboard, matching human-level performance (94.8). Code will be available at https://github.com/AFARI-Research/ChainFlow-VLA.

2605.23264 2026-05-25 cs.CV cs.AI

Coloring the Noise: Adversarial Sobolev Alignment for Faithful Image Super Resolution

着色噪声:用于忠实图像超分辨率的对抗性Sobolev对齐

Hongbo Wang, Huaibo Huang, Pin Wang, Jinhua Hao, Chao Zhou, Ran He

发表机构 * MAIS \& NLPR, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China

AI总结 图像超分辨率生成中,生成先验常导致还原不够忠实,本文认为这是由于各向同性目标与自然图像内在流形之间存在基本的谱不匹配。为解决这一问题,研究提出了一种基于Sobolev诱导黎曼几何的ASASR框架,通过显式地对噪声转移核进行谱色处理,使其更符合自然图像的谱衰减特性,并引入基于Riesz表示定理的参数化对抗网络,生成针对性的负样本以引导优化方向。实验表明,该方法在保持谱一致性和结构保真度方面优于现有生成方法,有效减少了伪影。

Comments Accepted to ICML 2026

详情
AI中文摘要

图像超分辨率(SR)中的生成先验常常损害忠实重建,我们将这一限制归因于各向同性目标与内在自然图像流形之间的基本光谱失配。虽然直接偏好优化提供了一条对齐路径,但其对光谱平坦高斯噪声的依赖无法区分真实高频细节与幻觉。为了弥合这一几何差距,我们提出了ASASR,一个理论基础的框架,通过显式着色噪声转移核以镜像自然光谱衰减,将生成流重铸为Sobolev诱导的黎曼几何。驱动这一几何对齐,我们集成一个基于Riesz表示定理的参数化对抗器,该对抗器合成目标负样本,等效于最坏情况下的Sobolev梯度,以沿着可能结构失效的切空间引导优化。大量评估表明,ASASR优于领先的生成基线,特别是在保持光谱一致性和结构保真度方面,提供了一种有效缓解伪影的鲁棒解决方案。

英文摘要

Generative priors in Image Super-Resolution (SR) often compromise faithful restoration, we attribute this limitation to a fundamental spectral misalignment between isotropic objectives and the intrinsic natural image manifold. While Direct Preference Optimization offers a path to alignment, its reliance on spectrally flat Gaussian noise fails to distinguish authentic high-frequency details from hallucinations. To bridge this geometric gap, we propose ASASR, a theoretically grounded framework that recasts the generative flow into a Sobolev-induced Riemannian geometry by explicitly coloring the noise transition kernel to mirror natural spectral decay. Driving this geometric alignment, we integrate a parametric adversary grounded in the Riesz Representation Theorem, which synthesizes targeted negative samples equivalent to worst-case Sobolev gradients to direct optimization along the tangent space of plausible structural failures. Extensive evaluations demonstrate that ASASR outperforms leading generative baselines, particularly in preserving spectral consistency and structural fidelity, offering a robust solution that effectively mitigates artifacts.

2605.23263 2026-05-25 cs.RO cs.AI cs.SY eess.SP eess.SY

6G Communication Networks Enabling Embodied Agents: Architecture and Prototype

6G通信网络赋能具身智能体:架构与原型

Lipeng Dai, Luping Xiang, Kun Yang

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) Institute of Intelligent Networks and Communications (NINE), Nanjing University (Suzhou Campus)(南京大学智能网络与通信研究所(苏州校区))

AI总结 本文研究了6G通信网络如何支持具身智能体的通信需求,探讨了具身智能体与6G网络之间的协同关系,并提出了面向人机远程交互的分层通信架构。通过构建包含触觉设备、工业机械臂和5G O-RAN测试平台的原型系统,验证了该架构在毫秒级时延和稳定闭环控制方面的可行性,为未来6G与具身智能体的融合应用提供了重要参考。

详情
AI中文摘要

具身智能体将智能决策与物理执行相结合,对通信提出了比纯软件智能体更严格和多样化的要求。尽管6G承诺亚毫秒级延迟、超高可靠性、原生智能和集成感知,但如何利用这些能力支持具身智能体通信的系统性研究仍然有限。本文从概念和工程两个角度研究了面向具身智能体的6G通信系统。首先,我们回顾了具身智能体的概念和具身价值,并澄清了其与非具身智能体的区别。然后,我们分析了具身智能体与6G网络的共生关系,强调了关键6G使能技术如何支持人机交互的严苛需求。此外,我们展示了具身智能体通过覆盖扩展、环境感知和物理世界理解在增强通信网络中的主动作用。基于这些见解,我们提出了一种用于人机远程交互的分层通信架构,包括人类意图感知层、基于开放无线接入网(O-RAN)的传输层、智能中间层和具身层。为验证其可行性,我们实现了一个端到端原型,集成了触觉设备、工业机械臂、中间平台和5G O-RAN测试床。实验结果表明毫秒级延迟和稳定的闭环操作,证实了所提架构的实用性,并为未来6G-具身智能体研究和工业部署提供了参考。

英文摘要

Embodied agents, which couple intelligent decision-making with physical actuation in the real world, impose far more stringent and heterogeneous communication requirements than purely software-based agents. While 6G promises sub-millisecond latency, ultra-high reliability, native intelligence, and integrated sensing, systematic studies on how to exploit these capabilities for embodied agent communication remain limited. This article investigates 6G-enabled communication systems for embodied agents from both conceptual and engineering perspectives. First, we review the concept, embodiment value of embodied agents, and clarify their distinctions from disembodied agents. Then, we analyse the symbiotic relationship between embodied agents and 6G networks. We highlight how key 6G enablers can support the stringent requirements of human-robot interaction. Furthermore, we demonstrate the proactive role of embodied agents in bolstering communication networks through coverage extension, environmental sensing, and physical world understanding. Building on these insights, we propose a hierarchical communication architecture for human-robot remote interaction, comprising a human-intent perception layer, an open radio access network (O-RAN)-based transport layer, an intelligent intermediary layer, and an embodiment layer. To validate its feasibility, we implement an end-to-end prototype that integrates a haptic device, an industrial robotic arm, an intermediary platform, and a 5G O-RAN testbed. Experimental results demonstrate millisecond-level latency and stable closed-loop operation, confirming the practicality of the proposed architecture and providing a reference for future 6G-embodied agent research and industrial deployments.

2605.23262 2026-05-25 cs.AI

Design and Report Benchmarks for Knowledge Work

知识工作的设计与报告基准

Yining Hua, Hongbin Na, Cyrus Ayubcha, Levi Lian

发表机构 * Harvard University(哈佛大学) University of Technology Sydney(悉尼科技大学) Stanford University(斯坦福大学) Raycaster AI

AI总结 本文针对知识工作领域的人工智能系统评估问题,提出了一种三步骤的基准设计方法,以明确任务评分与实际工作成果之间的对应关系。研究指出当前知识工作评估仍沿用传统NLP任务逻辑,难以真实反映系统在实际部署中的能力。为此,作者从工作活动、测试环境和评分标准三个维度构建基准设计框架,并基于O*NET职业任务数据库提炼出18类工作活动,结合三个实际案例展示了该方法在不同知识工作场景中的应用与效果。

详情
AI中文摘要

LLM智能体的发展催生了越来越多关于知识工作AI的研究,包括编程、研究和医疗保健。然而,当前的知识工作评估和基准设计在很大程度上仍遵循传统NLP任务的逻辑。因此,更高的基准性能并不能可靠地表明系统能够在实际部署环境中执行知识工作。本文提出了一种三步法,用于明确基准任务如何代表其分数所附的工作主张:定义被评估的工作活动,指定测试设置,并对适当的工作产品进行评分。我们回顾了工作研究表明,知识工作是通过角色和职责、本地材料和工具以及必须在下游工作流程中保持可用的工件来组织的。然后,我们将这些关注点转化为基准设计和报告指南,涵盖任务应如何映射到工作活动、测试设置应如何指定材料、工具、角色和约束,以及评分应如何关注系统留下的工作产品。为了命名被评估的工作活动并将其与常见的基准任务区分开来,我们从O{*}NET职业任务数据库中导出了18个工作活动的清单。我们通过三个基准案例分析来演示该方法:GDPval(一个非代码职业交付物基准)、OfficeQA Pro(一个基于文档的分析基准,通过最终答案评分)和APEX-SWE(一个软件工程基准,具有可执行评分产品)。这些案例展示了基准设计选择如何塑造分数所能支持的最强工作主张,以及基准任务、测试设置、评分产品和更广泛工作主张之间出现的差距。

英文摘要

The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding, research, and healthcare. However, current knowledge-work evaluation and benchmark design still largely follow the logic of traditional NLP tasks. As a result, higher benchmark performance does not reliably show that a system can carry out knowledge work in real-world deployment settings. This paper contributes a three-step approach for making explicit how benchmarked tasks represent the work claims attached to their scores: defining the work activity under evaluation, specifying the tested setting, and scoring the appropriate work product. We review work studies showing that knowledge work is organized through roles and responsibilities, local materials and tools, and artifacts that must remain usable in downstream workflows. We then translate these concerns into benchmark design and reporting guidance, covering how tasks should be mapped to work activities, how tested settings should specify materials, tools, roles, and constraints, and how scoring should focus on the work product left by the system. To name the work activity being evaluated and distinguish it from common benchmark tasks, we derive an inventory of 18 work activities from the O{*}NET occupational task database. We demonstrate the approach through three benchmark case analyses: GDPval, a non-code occupational deliverable benchmark; OfficeQA Pro, a grounded document-analysis benchmark scored by final answers; and APEX-SWE, a software-engineering benchmark with executable scored products. These cases show how benchmark design choices shape the strongest work claim a score can support, and where gaps arise between the benchmarked task, tested setting, scored product, and broader work claim.

2605.23259 2026-05-25 cs.LG cs.AI cs.CL

Multi-Gate Residuals

多门残差

Zhizhan Zheng, Feiyun Zhang, Shuchun Liu, Tian Xia, Xi Liu, Dasheng Hu, Hongquan Zhou

发表机构 * Shanghai Yichuang Information Technology Co.,Ltd.(上海亿创信息技术有限公司) Fudan University(复旦大学)

AI总结 本文提出了一种名为Multi-Gate Residuals(MGR)的新方法,旨在解决深度残差网络中激活值无界增长的问题,同时避免引入额外的通信开销。该方法通过简单的评分与门控机制维护多流上下文,并结合注意力池化技术提取隐藏状态,从而在保持激活规模稳定的同时提升模型性能。实验表明,MGR在大规模训练与部署中具有实用性,并优于现有架构。

详情
AI中文摘要

虽然注意力残差在解决深度残差层中普遍存在的激活值无界增长问题方面显示出一定效果,但它不可避免地引入了显著的通信开销。为了规避这一瓶颈,我们提出了多门残差(MGR),它在不增加通信负担的情况下稳定激活尺度。它利用简单的评分和门控机制来维护多流上下文,并结合注意力池化从流状态中提取隐藏状态。实证实验表明,MGR对于大规模训练和部署是实用的,相比现有架构提供了切实的性能提升。

英文摘要

While Attention Residuals has shown some effectiveness in addressing the widespread issue of unbounded activation growth across deep residual layers, it inevitably incurs significant communication overhead. To circumvent this bottleneck, we propose Multi-Gate Residuals (MGR), which stabilizes activation scales without additional communication burden. It utilizes a straightforward scoring and gating mechanism to maintain multi-stream context, coupled with Attention Pooling to extract hidden states from the stream states. Empirical experiments demonstrate that MGR is practical for large-scale training and deployment, offering tangible performance improvements over existing architectures.

2605.23258 2026-05-25 cs.LG

A Simple Plug-in for Improving Eviction-Based KV Cache Compression

一种改进基于驱逐的KV缓存压缩的简单插件

Yuping Lin, Jiayuan Ding, Yue Xing, Pengfei He, Jiliang Tang, Subhabrata Mukherjee

发表机构 * Michigan State University(密歇根州立大学) Hippocratic AI(希波克拉底AI)

AI总结 在大型语言模型的长上下文推理中,键值缓存(KV cache)的增长是一个主要瓶颈。本文提出VECTOR,一种用于改进基于驱逐的KV缓存压缩的即插即用方法,通过引入三类标记路由机制(保留、近似和驱逐),结合基础评分器的重要信号与离线校准的回归值估计的可重构信号,有效提升了缓存压缩下的质量与内存权衡,尤其在严格的内存预算下表现突出。

详情
AI中文摘要

KV缓存增长是大语言模型长上下文推理的主要瓶颈。现有方法通常以二元驱逐或表示近似为主,可能未充分利用那些对精确保留不关键但仍可重构的令牌。我们提出VECTOR,一种用于基于驱逐的流水线的即插即用增强,引入了三路令牌路由:保留、近似和驱逐。VECTOR将来自基础评分器的重要性信号与来自离线校准的基于回归的值估计的可重构性信号相结合。通过利用可重构性,VECTOR恢复了在二元驱逐下本会不可逆丢失的有用值信息,同时保留关键向量以保证注意力路由稳定性。实验结果表明,VECTOR在中高压缩率下改善了质量-内存权衡,在更严格的预算方案中尤其有显著收益。

英文摘要

KV cache growth is a major bottleneck for long-context inference in large language models. Existing methods are often dominated by binary eviction or representation approximation, which may underutilize tokens that are not critical for exact retention but are still reconstructable. We present VECTOR, a plug-and-play augmentation for eviction-based pipelines that introduces three-way token routing: retention, approximation, and eviction. VECTOR combines an importance signal from the base scorer with a reconstructability signal from an offline-calibrated regression-based value estimation. By leveraging reconstructability, VECTOR recovers useful value information that would otherwise be irreversibly lost under binary eviction, while preserving key vectors for attention routing stability. Experimental results show that VECTOR improves quality-memory trade-offs under medium-to-high compression, with especially clear gains in stricter budget regimes.