arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1926
专题追踪
2605.21413 2026-05-22 cs.AI

Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

通过基准构建教学AI:QuestBench作为一门课程实践以实现问责知识工作

Haiyang Shen, Jiuzheng Wang, Taian Guo, Mugeng Liu, Wenchun Jing, Chongyang Pan, Siqi Zhong, Zhiyang Chen, Weichen Bi, Yudong Han, Xiaoying Bai, Yun Ma

发表机构 * Peking University(北京大学) Advanced Institute of Big Data(大数据高级研究院)

AI总结 本文提出通过构建基准来教学AI,介绍QuestBench作为一门课程实践,帮助学生理解在AI时代知识工作的责任。

Comments 24 pages, 5 figures, 4 tables

详情
AI中文摘要

随着AI成为日常学习的一部分,许多课程教授学生主要将其作为生产力工具:如何提示、搜索、总结、写作、编程和更高效地使用工具。我们主张AI教育也需要一个让学生学习测试AI并理解自己在判断机器生成知识角色的环境。为此,我们介绍了一种基于课程的实践,通过构建基准来教学AI,以深度研究系统为例展示AI时代的知识工作。学生将学科知识转化为可验证的专家级问题,互相审查设计以发现歧义和捷径,并在由此产生的任务上评估AI系统。这项活动让学生直接接触到强大工具,同时要求他们明确信任答案所需的标准。所生成的基准QuestBench包含256个问题,涵盖14个文科和社会科学领域。在QuestBench上的评估显示,学生设计的任务揭示了当前深度研究系统中的隐藏失败:在十三个评估系统中,平均问题级通过率仅为16.85%,最佳表现系统GPT-5.5的通过率为57.58%。这些失败在教育上是有用的,因为它们展示了流畅、来源支持的答案仍可能错过正确的查询、来源、术语或证据标准。来自五名学生贡献者的反思表明,基准构建可以帮助学生将专业知识不仅视为AI可能检索的内容,而是作为判断AI输出的基础。我们以QuestBench作为基准制品和可重用的课堂设置,提出一个更大的教育问题:当AI进入学习和专业工作时,学生如何保持负责任的知识行动者。数据集可在https://huggingface.co/datasets/PKUAIWeb/QuestBench/tree/main获取。

英文摘要

As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more efficiently. We argue that AI education also needs a setting in which students learn to test AI and understand their own role in judging machine-produced knowledge. To this end, we introduce a course-based practice that teaches AI through benchmark construction, using deep research systems as a concrete example of AI-era knowledge work. Students turn disciplinary knowledge into verifiable expert-level questions, review one another's designs for ambiguity and shortcuts, and evaluate AI systems on the resulting tasks. This activity gives students direct exposure to a powerful tool while asking them to specify what a trustworthy answer would require. The produced benchmark, QuestBench, consists of 256 questions across 14 humanities and social-science domains. Evaluation on QuestBench shows that student-designed tasks reveal hidden failures in current deep research systems: across thirteen evaluated systems, the mean question-level pass rate is only 16.85%, and the best-performing system, GPT-5.5, reaches a 57.58% pass rate. The failures are educationally useful because they show how fluent, source-backed answers can still miss the right query, source, term, or evidence standard. Reflections from five student contributors suggest that benchmark construction can help students see professional knowledge not only as content AI may retrieve, but as the basis for judging AI outputs. We present QuestBench as a benchmark artifact and as a reusable classroom setting for a larger educational question: how students can remain responsible knowledge actors as AI enters learning and professional work. The dataset is available at https://huggingface.co/datasets/PKUAIWeb/QuestBench/tree/main.

2605.21214 2026-05-22 cs.LG cs.AI

Behavior-Consistent Deep Reinforcement Learning

行为一致的深度强化学习

Marcel Hussing, Liv G. d'Aliberti, Claas Voelcker, Benjamin Eysenbach, Eric Eaton

发表机构 * University of Pennsylvania(宾夕法尼亚大学) Princeton University(普林斯顿大学) University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出了一种行为一致的深度强化学习方法,通过控制策略的分布相似性来减少跨训练运行的策略分歧,从而提高稳定性和性能。

详情
AI中文摘要

强化学习(RL)在不同训练运行中常常表现出高方差,导致性能不可靠,并对现实领域中的部署构成重大挑战。在本文中,我们通过形式化行为一致的RL问题来解决跨运行策略分歧的挑战,目标是获得在不同训练运行中表现优异且分布相似的策略。我们的关键观察是最大熵RL提供了一种直接机制来控制行为分歧,通过将运行锚定到一个共同的(均匀)先验。我们证明,对于玻尔兹曼策略,选择温度与Q函数分歧界成正比可以限制诱导策略之间的成对KL散度。然而,我们还表明,简单地增加熵可能会损害策略优化并放大非策略误差。基于这些观察,我们提出了Q值期望分歧(QED),一种状态依赖的温度调度,利用双批评机分歧作为单次运行的跨运行分歧代理。经验上,我们在18个连续控制任务中展示了QED将跨运行分歧减少两个数量级,而不会牺牲性能,从而在适度的样本效率成本下实现了显著的回报方差减少。

英文摘要

Reinforcement learning (RL) often exhibits high variance across training runs, leading to unreliable performance and posing a major challenge to deployment in real-world domains. In this work, we address the challenge of cross-run policy divergence by formalizing the problem of behavior-consistent RL, where the objective is to obtain policies that are both high-performing and distributionally similar across training runs. Our key observation is that maximum-entropy RL provides a direct mechanism for controlling behavioral divergence by anchoring runs to a common (uniform) prior. We prove that, for Boltzmann policies, choosing the temperature proportional to $Q$-function disagreement bounds the pairwise KL divergence between the induced policies. However, we also show that naïvely increasing entropy might impair policy optimization while amplifying off-policy error. Building upon these observations, we propose $Q$-value Expectile Disagreement (QED), a state-dependent temperature schedule that uses double-critic disagreement as a single-run proxy for cross-run disagreement. Empirically, we demonstrate that across 18 continuous-control tasks, QED reduces across-run divergence by two orders of magnitude without sacrificing performance, resulting in a considerable reduction in return variance at modest sample-efficiency costs.

2605.21143 2026-05-22 cs.SD cs.LG

CoarseSoundNet: Building a reliable model for ecological soundscape analysis

CoarseSoundNet:构建一个可靠的生态声音景观分析模型

Alexander Gebhard, Andreas Triantafyllopoulos, Dominik Arend, Sandra Müller, Svenja Schmidt, Michael Scherer-Lorenzen, Björn W. Schuller

发表机构 * organization= TUM University Hospital, CHI -- Chair of Health Informatics , addressline= Ismaninger Str. 22 , city= Munich , postcode= 81675 , state= Bavaria , country= Germany organization= University of Freiburg, Faculty of Biology, Geobotany , addressline= Schaenzlestr. 1 , city= Freiburg , postcode= 79104 , state= Baden-Württemberg , country= Germany organization= MCML -- Munich Center for Machine Learning , city= Munich , state= Bavaria , country= Germany organization= Imperial College London, GLAM -- Group on Language, Audio, \& Music , city= London , country= UK

AI总结 本文提出CoarseSoundNet模型,用于在真实噪声环境下对生物声音、地质声音和人类声音进行分类,并通过系统研究模型架构、训练数据和评估策略,提高了模型在被动声学监测中的泛化能力。

Comments Currently under review

详情
AI中文摘要

声音景观由三种声音组成:生物声音(动物发出的声音)、地质声音(自然非生物声音)和人类声音(人类发出的声音)。在声音景观生态学领域,一个关键研究问题是这些组成部分如何相互作用,特别是生物声音如何响应地质声音和人类声音。然而,目前尚缺乏能够对这些元素进行区分量化分析的工具。最近的机器学习(ML)方法旨在支持自动化分析,但通常依赖于任务特定或干净的数据,限制了其在噪声被动声学监测(PAM)记录中的泛化能力。本文提出了一种清晰且可重复的结构来构建用于粗粒度声音景观分类的ML模型,并引入了CoarseSoundNet,一个经过训练以在真实PAM条件下区分生物声音、地质声音和人类声音的深度学习模型。我们系统地研究了模型架构、额外训练类的影响、数据组成和评估策略。我们的发现表明,模型性能随着额外PAM数据的增加而提高,特别是当数据与目标领域相似时,并且通过在训练中引入显式的静默类进一步提高性能。类特定的决策阈值和基于持续时间的约束进一步提高了性能,特别是在人类声音和地质声音方面。错误分析显示,人类声音由于掩蔽效应而面临挑战,而静默和昆虫声音在地质和生物声音方面存在混淆。最后,我们进行了一项生态案例研究,表明使用CoarseSoundNet预过滤记录可以产生与地面真实过滤相当的声学指数趋势,支持其作为生态声学分析有效预处理工具的使用。

英文摘要

A soundscape is composed of three types of sound: biophony (sounds made by animals), geophony (natural abiotic sounds) and anthropophony (sounds made by humans). A key research question in the field of soundscape ecology is how these components interact with each other, specifically how biophony responds to geophony and anthropophony. Nevertheless, as of today, there are not many analytical instruments that enable the distinct quantification of these elements. Recent machine learning (ML) approaches aim to support automated analysis but often rely on task-specific or clean data, limiting generalisation to noisy passive acoustic monitoring (PAM) recordings. This study presents a clear and reproducible structure to build ML models for coarse soundscape classification and introduces CoarseSoundNet, a deep learning model trained to distinguish biophony, geophony, and anthropophony under realistic PAM conditions. We systematically investigate model architectures, the influence of an additional training class, data composition, and evaluation strategies. Our findings suggest that model performance improves with additional PAM data, especially when similar to the target domain, and by introducing an explicit silence class during training. Class-specific decision thresholds and duration-based constraints further enhance performance, particularly for anthropophony and geophony. Error analyses exhibit challenges for anthropophony due to masking effects and confusions for silence and insect sounds for geophony and biophony. Finally, we conduct an ecological case study which shows that pre-filtering recordings with CoarseSoundNet yields acoustic index trends comparable to ground-truth filtering, supporting its use as an effective preprocessing tool for ecoacoustic analyses.

2605.20975 2026-05-22 cs.LG cs.CR

Choose Wisely and Privately: Proactive Client Selection for Fair and Efficient Federated Learning

明智且私密地选择:为公平和高效的联邦学习进行主动客户端选择

Adda Akram Bendoukha, Heber Hwang Arcolezi, Nesrine Kaaniche, Aymen Boudguiga

发表机构 * GitHub

AI总结 本文提出了一种主动客户端选择框架,旨在在训练前找到满足效用和公平性要求的最佳客户端联邦,以提高联邦学习的效率和公平性。

详情
AI中文摘要

联邦学习使能够在去中心化的数据源上进行协作模型训练而无需数据传输。基于平均的联邦学习受限于非独立同分布数据的存在,这会负面影响收敛速度和最终模型的准确性。传统替代方法存在显著的低效率。包含噪声或高度异质数据的客户端会进行昂贵的梯度计算,这些计算在聚合前要么被丢弃要么被大幅降权。这些反应式方法浪费计算资源,需要更多的通信轮次并导致不必要的隐私暴露。在本文中,我们提出了一种主动客户端选择框架,旨在在训练开始前找到一个最优的客户端联邦,其联合数据满足效用和公平性要求。我们的方法依赖于从差分隐私连续表中计算出的互信息来量化联合数据集中的跨特征相关性的重要性。我们引入了一个潜在联邦损失(PFL)在固定大小的联邦集上,它平衡了两个目标。最大化集体数据效用的同时确保公平的跨特征相关性以防止群体不公平。客户端选择被表达为一个最优子集搜索问题,基于PFL目标,我们使用模拟退火在强差分隐私保证下解决客户端的本地统计信息。在四个基准上的实验结果表明,与均匀抽样相比,使用最优找到的联邦训练的模型更快、更公平且更准确,即使当使用最先进的自适应聚合或抽样策略时也是如此。

英文摘要

Federated Learning enables collaborative model training across decentralized data sources without data transfer. Averaging-based FL is limited by the presence of non-IID data, which negatively impacts convergence speed and final model accuracy. Conventional alternatives suffer from significant inefficiency. Clients with noisy or highly heterogeneous data contribute expensive gradient computations that are either discarded or heavily down-weighted before aggregation. These reactive approaches waste computational resources, require more communication rounds and result in unnecessary privacy exposure. In this paper, we propose a proactive client selection framework that aims to find an optimal federation of clients whose combined data match utility and fairness requirements before training begins. Our method relies on mutual information computed from differentially private contingency tables to quantify the relevance of cross-feature correlations in the union dataset. We introduce a Potential Federation Loss (PFL) over the set of fixed-size federations, which balances two objectives. Maximizing collective data utility while ensuring fair cross-features correlations to prevent group unfairness. Client selection is expressed as an optimal subset search problem over the PFL objective, which we solve using simulated annealing under strong differential privacy guarantees for clients' local statistics. Experimental results on four benchmarks show faster, fairer, and more accurate models trained on optimally found federations, compared to uniform sampling, even when state-of-the-art adaptive aggregation or sampling strategies are employed.

2605.20578 2026-05-22 cs.SD cs.CV

A strongly annotated passive acoustic dataset for tropical bird monitoring

一个强注解的被动声学数据集用于热带鸟类监测

Daniela Ruiz, Juan Sebastián Ulloa, Zhongqi Miao, Nicolás Betancourt, Maria Paula Toro-Gómez, Andrés Hernández, Bruno Demuro, Eliana Barona-Cortés, Angela Mendoza-Henao, Andrés Sierra-Ricaurte, Sebastián Pérez-Peña, Rahul Dodhia, Pablo Arbeláez, Juan M. Lavista Ferres

发表机构 * Microsoft AI for Good Research Lab(微软AI for Good研究实验室) Instituto de Investigación de Recursos Biológicos Alexander von Humboldt(亚历山大冯·洪堡生物资源研究所) Center for Research and Formation in Artificial Intelligence(人工智能研究与培养中心) Fundación Manacus(曼卡斯基金会) Louisiana State University(路易斯安那州立大学) Museum of Natural Sciences(自然博物馆)

AI总结 本文提出PteroSet数据集,用于热带鸟类监测,通过强注解的音频数据和COCO-inspired JSON格式,为机器学习提供基准,并展示了二元鸟类检测的深度学习基线。

详情
AI中文摘要

被动声学监测能够实现对多样化生态系统的连续、非侵入性生物多样性评估。这些数据集的规模推动了机器学习的应用,监督方法表现出强劲的性能。然而,监督方法需要时间分辨的注解数据集,这些数据仍然稀缺,尤其是在复杂的热带声音景观中。我们提出了PteroSet,这是一个经过精心编纂的数据集,包含在哥伦比亚Putumayo的Puerto Asis和Magdalena的Pivijay之间2023年至2025年录制的强注解新热带鸟类叫声数据集。该数据集包含563个录音(73.62小时)和15,372个时频注解,包括6,702个事件,这些事件被识别到物种水平,涵盖168个物种。我们以COCO启发的JSON模式发布注解,将音频文件、分类类别和机器学习工作流程的标签统一起来。除了提供注解数据外,PteroSet还充当一个现实的基准,突显了热带声音景观的关键特征,包括不同录制地点的声学共现和领域转移。我们提供了一个二元鸟类检测的深度学习基线,展示了PteroSet的可用性和其带来的挑战。

英文摘要

Passive acoustic monitoring enables continuous, non-invasive biodiversity assessment across diverse ecosystems. The scale of these datasets has driven the adoption of machine learning, with supervised approaches showing strong performance. However, supervised methods require time-resolved annotated datasets, which remain scarce, especially in complex tropical soundscapes. We present PteroSet, a curated dataset of strongly annotated Neotropical bird vocalizations recorded in Puerto Asis (Putumayo) and Pivijay (Magdalena), Colombia, between 2023 and 2025. The dataset comprises 563 recordings (73.62 h) and 15,372 time-frequency annotations, including 6,702 events identified to the species level across 168 species. We release the annotations in a COCO-inspired JSON schema that unifies audio files, taxonomic categories, and labels for machine learning workflows. Beyond providing annotated data, PteroSet serves as a realistic benchmark that highlights key characteristics of tropical soundscapes, including acoustic co-occurrence and domain shift across recording sites. We provide a deep learning baseline for binary bird detection, demonstrating PteroSet's usability and the challenges it presents.

2605.20342 2026-05-22 cs.CV

ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

ParaVT: 平衡工具先验悖论以实现代理视频强化学习中的并行工具使用

Zuhao Yang, Kaichen Zhang, Sudong Wang, Keming Wu, Zhongyu Yang, Bo Li, Xiaojuan Qi, Shijian Lu, Xingxuan Li, Lidong Bing

发表机构 * MiroMind NTU(国立台湾大学) HKU(香港大学) HKUST(GZ)(香港科技大学(广州)) THU(清华大学) LMMs-Lab(LMMs实验室)

AI总结 本文提出ParaVT,一种用于并行视频工具调用的端到端强化学习框架,通过引入PARA-GRPO机制解决工具先验悖论,提升了长视频理解的性能。

Comments Project Page: https://evolvinglmms-lab.github.io/ParaVT/

详情
AI中文摘要

通过强化学习(RL)训练大型多模态模型(LMMs)以原生调用视频处理工具(如裁剪)已成为实现长视频理解的有前景途径。然而,现有原生RL方法按顺序调度工具调用(即每回合一个):单个错误的裁剪会传播错误而无法得到同伴纠正,多回合工具调用会破坏上下文,且推理成本与回合数成线性关系。我们引入ParaVT,首个多智能体端到端RL训练框架用于并行视频工具调用,通过单个回合内调度多个时间窗口裁剪以获得更干净的上下文和更好的容错能力。然而,将标准RL应用于ParaVT揭示了一个我们称之为工具先验悖论的障碍:预训练的工具先验能够促进工具探索,但也破坏了冷启动的结构格式并暴露了在温度采样下的跳过工具奖励捷径。一个较弱先验LMM的跨模型对比支持这一观点:格式保持稳定但RL触发零工具调用,表明先验强度是格式崩溃和工具探索的共同驱动因素。我们提出PARA-GRPO(Parseability-Anchored和Ratio-gAted GRPO),它通过两种互补机制增强标准RL:(i)仅在最易崩溃的结构标记位置应用目标格式奖励;(ii)每提示帧预算随机化,创建训练提示,其中调用工具会提供可测量的奖励信号,而跳过工具则不会。在六个长视频理解基准测试中,ParaVT在平均上比Qwen3-VL基线提升了7.9%,而PARA-GRPO将训练时间格式合规性从0.13提升到0.64。随着工具能力在现代LMMs中日益内部化,RL必须与由此产生的先验合作,ParaVT提供了一种通用的代理RL配方。代码、数据和模型权重已公开可用。

英文摘要

Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promising route to long-video understanding. However, existing native-RL methods dispatch tool calls sequentially (i.e., one per turn): a single wrong crop propagates errors without peer correction, multi-turn tool calls corrupt context, and inference cost scales linearly with the number of turns. We introduce ParaVT, the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling, dispatching multiple time-window crops in a single turn for cleaner context and better fault tolerance. Yet applying standard RL to ParaVT reveals an obstacle we term the Tool Prior Paradox: the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose the skip-tool reward shortcut under temperature sampling. A cross-model contrast on a weaker-prior LMM supports this claim: format stays stable but RL elicits zero tool calls, indicating that prior strength is the shared driver of both format collapse and tool exploration. We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), which augments standard RL with two complementary mechanisms: (i) a targeted format reward applied only at the structural-token positions most prone to collapse, and (ii) a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it. Across six long-video understanding benchmarks, ParaVT improves over the Qwen3-VL baseline by +7.9% on average, with PARA-GRPO lifting training-time format compliance from 0.13 to 0.64. As tool capabilities become increasingly internalized in modern LMMs, RL must cooperate with the resulting priors, and ParaVT offers a general recipe for agentic RL. Code, data, and model weights are publicly available.

2605.20069 2026-05-22 cs.LG cs.GT

Smooth Partial Lotteries for Stable Randomized Selection

用于稳定随机选择的平滑部分彩票

Alexander Goldberg, Giulia Fanti, Nihar B. Shah

发表机构 * New Zealand Health Research Council(新西兰健康研究理事会) Swiss National Science Foundation(瑞士国家科学基金会) European Research Council(欧洲研究理事会) Science Foundation Ireland(爱尔兰科学基金会) Volkswagen Foundation(大众基金会) The British Academy(英国学院) Austrian Science Fund (FWF)(奥地利科学基金) Formas(Formas基金会) Luebber et al.(Luebber等人)

AI总结 本文提出平滑性作为部分彩票设计原则,通过定义评分到选择概率的Lipschitz条件,提出Clipped Linear Lottery机制,证明其在平滑性与遗憾之间取得更好的平衡,并通过实验验证其在实际应用中的有效性。

详情
AI中文摘要

竞争性选择过程,从科学资金资助到招生和招聘,使用评估来评分候选人,并最终根据这些评分选择一部分人。最近,许多组织采用了部分彩票,根据评估评分随机化选择。然而,现有的彩票设计本质上是不稳定的,因为对单个候选人的评分的微小变化会导致其选择概率的大幅变化。这种不稳定性削弱了彩票的一个关键目标:减少决策边界附近细微评分区别的影响。我们提出平滑性作为部分彩票的设计原则,并将其形式化为评分到选择概率的映射的Lipschitz条件。我们引入了Clipped Linear Lottery,一种简单的机制,其中选择概率与估计质量在上阈值和下阈值之间线性变化,上阈值以上我们总是接受,下阈值以下我们总是拒绝。我们证明Clipped Linear Lottery的最坏遗憾与任何平滑选择规则的下界在(1 - k/n)因子内匹配,其中k/n是接受率。我们比较平滑选择与其他稳定性概念如个体公平性和差分隐私,证明Clipped Linear Lottery在平滑性与遗憾的权衡上优于其他方法。在ICLR 2025、NeurIPS 2024和瑞士国家科学基金会的真实同行评审数据上的实验表明,现有彩票设计在实践中即使在单个评分扰动下也高度不稳定。我们的实验还确认了我们的理论分析的紧性,并证明我们提出的Clipped Linear Lottery在实践中比其他方法在平滑性与效用的权衡上更优。

英文摘要

Competitive selection processes, from scientific funding to admissions and hiring, use evaluations to score candidates, and eventually choose a subset of them based on those scores. Recently, many organizations have adopted partial lotteries, which randomize selection based on evaluation scores. However, existing lottery designs are inherently unstable, as a small change to a single candidate's score can cause large shifts in their selection probabilities. This instability undermines a key goal of lotteries: reducing the influence of fine-grained score distinctions near the decision boundary. We propose smoothness as a design principle for partial lotteries, formalizing it as a Lipschitz condition on the mapping from review scores over candidates to selection probabilities. We introduce the Clipped Linear Lottery, a simple mechanism in which selection probabilities scale linearly with estimated quality between an upper threshold, above which we always accept, and a lower threshold, below which we always reject. We prove that the Clipped Linear Lottery's worst-case regret matches a lower bound for any smooth selection rule up to a factor of $(1 - k/n)$, where $k/n$ is the acceptance rate. We compare smooth selection to other stability notions like Individual Fairness and Differential Privacy, showing that the Clipped Linear Lottery achieves a better smoothness-regret tradeoff than alternatives. Experiments on real peer review data from ICLR 2025, NeurIPS 2024, and the Swiss National Science Foundation demonstrate that existing lottery designs are highly unstable in practice even under perturbations to a single score. Our experiments also confirm the tightness of our theoretical analysis and show that our proposed Clipped Linear Lottery achieves a better smoothness-utility tradeoff than alternatives in practice.

2605.19965 2026-05-22 cs.LG eess.SP

Normative Networks for Source Separation via Local Plasticity and Dendritic Computation

通过局部可塑性和树突计算进行源分离的规范网络

Bariscan Bozkurt, Efe Ali Gorguner, Francesco Innocenti, Rafal Bogacz

发表机构 * Gatsby Computational Neuroscience Unit, University College London, UK(Gatsby计算神经科学单元,伦敦大学学院,英国) Department of Computer Science, University of Oxford, UK(计算机科学系,牛津大学,英国) Brain Network Dynamics Unit, University of Oxford, UK(脑网络动力学单元,牛津大学,英国) MRC Centre of Research Excellence in Restorative Neural Dynamics, UK(英国医学研究委员会修复神经动力学研究卓越中心)

AI总结 本文提出了一种基于局部可塑性和树突计算的预测熵最大化方法,用于源分离,该方法在结构化源域上最大化正则化的二阶熵,实现了在增加的源相关性和观测噪声下的鲁棒性,并在生物合理算法和精确基线中表现优异。

详情
AI中文摘要

盲源分离(BSS)是研究如何从感觉混合中恢复潜在原因的自然框架,但推导出针对结构化(即受限于已知领域)且可能相关源的在线和生物合理算法仍然具有挑战性。最近的工作从最大化熵度量出发推导出BSS的神经网络,但其在线实现涉及复杂且非局部的递归动力学。受此视角启发,我们提出了预测熵最大化方法,仅使用局部权重更新即可实现BSS的竞争力。该方法采用熵度量的近似,产生一个具有易于解释组件的目标函数。最小化该目标导致预测神经架构,其中前馈突触遵循误差驱动规则(可通过树突机制实现),横向抑制连接通过局部海马体可塑性学习,源域约束通过简单的输出非线性性强制执行。我们推导了对偶误差的显式频谱界限,表征了何时近似是准确的。经验上,预测熵最大化在增加的源相关性和观测噪声下保持稳健,优于依赖更强独立性或去相关假设的生物合理算法,并在精确行列式和相关信息基线中表现竞争。这些结果展示了如何通过最大化结构化源域上的正则化二阶熵,使局部可塑性和适应性横向抑制得以出现。我们的实现代码可在https://github.com/BariscanBozkurt/Predictive-Entropy-Maximization上获得。

英文摘要

Blind source separation (BSS) is a natural framework for studying how latent causes may be recovered from sensory mixtures, but deriving online and biologically plausible algorithms for structured (i.e., constrained to known domains) and potentially correlated sources remains challenging. Recent work has derived neural networks for BSS from maximization of an entropy measure, yet its online implementations involve complex and nonlocal recurrent dynamics. Motivated by this perspective, we propose Predictive Entropy Maximization, which achieves competitive performance in BSS, using only local weight updates. The method employs a close approximation of an entropy measure, yielding an objective function with easily interpretable components. Minimizing this objective leads to a predictive neural architecture in which feedforward synapses follow an error-driven rule (that can be realized through dendritic mechanisms), lateral inhibitory connections are learned with local Hebbian plasticity, and source-domain constraints are enforced through simple output nonlinearities. We derive explicit spectral bounds on the surrogate error, characterizing when the approximation is accurate. Empirically, Predictive Entropy Maximization remains robust under increasing source correlation and observation noise, outperforms biologically plausible algorithms that rely on stronger independence or decorrelation assumptions, and remains competitive with exact determinant- and correlative-information-based baselines. These results show how local plasticity and adaptive lateral inhibition can emerge from maximizing a regularized second-order entropy over structured source domains. Our implementation code is available at https://github.com/BariscanBozkurt/Predictive-Entropy-Maximization.

2605.19578 2026-05-22 cs.CV cs.AI

Lens Privacy Sealing: A New Benchmark and Method for Physical Privacy-Preserving Action Recognition

Lens Privacy Sealing: 一种新的基准和方法用于物理隐私保护的动作识别

Mengyuan Liu, Ziyi Wang, Peiming Li, Junsong Yuan

发表机构 * State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School(北京大学深圳研究生院通用人工智能国家重点实验室) Department of Computer Science and Engineering, State University of New York at Buffalo(纽约州立大学布法罗分校计算机科学与工程系)

AI总结 本文提出了一种名为Lens Privacy Sealing (LPS)的硬件解决方案,通过可调节的贴膜物理遮挡摄像头镜头,实现低成本的预传感器隐私保护,并引入P$^3$AR数据集用于隐私保护的动作识别,同时提出MSPNet框架以应对LPS带来的视频退化问题,实验表明MSPNet在动作识别准确率和隐私保护方面具有优势。

Comments Accepted by IEEE Transactions on Image Processing (TIP), 2026

详情
AI中文摘要

基于RGB摄像头的监控系统能够为公共安全和医疗保健提供人类动作识别,但引发了严重的隐私问题。现有方法依赖于事后捕获算法,这些算法在数据采集过程中无法保护隐私。我们提出Lens Privacy Sealing (LPS),一种简单的硬件解决方案,通过可调节的贴膜物理遮挡摄像头镜头,以最低的成本提供预传感器隐私保护。与软件方法或昂贵的工程光学不同,LPS通过随机多层散射实现强隐私保护,这种散射是物理不可逆的。我们引入了P$^3$AR数据集用于隐私保护的动作识别,该数据集包含大规模回放捕获(P$^3$AR-NTU,114K视频)和现实世界收集(P$^3$AR-PKU)的子集,并带有隐私属性注释。为处理LPS带来的视频退化,我们提出MSPNet,一种单阶段框架,结合了帧间噪声抑制器(IFNS)和跨帧语义聚合器(CFSA),并借助对比语言-图像预训练进行增强的语义提取。大量实验表明,与基线方法相比,MSPNet结合IFNS和CFSA几乎将动作识别准确率提高了一倍,同时抑制身份识别到低水平。全面验证显示,LPS在隐私-效用权衡方面优于现有最先进的硬件方法,能够抵御包括PSF反向计算和数据驱动恢复在内的重建攻击,并在不同光学配置和挑战性环境中具有良好的泛化能力。代码可在https://github.com/wangzy01/MSPNet上获得。

英文摘要

RGB camera-based surveillance systems enable human action recognition for public safety and healthcare, yet raise serious privacy concerns. Existing methods rely on post-capture algorithms, which fail to protect privacy during data acquisition. We propose Lens Privacy Sealing (LPS), a simple hardware solution that physically obscures camera lenses with adjustable laminating film, providing pre-sensor privacy protection at minimal cost. Unlike software methods or expensive engineered optics, LPS achieves strong privacy through stochastic multi-layer scattering that is physically irreversible. We introduce the P$^3$AR dataset for privacy-preserving action recognition, featuring both large-scale replay-captured (P$^3$AR-NTU, 114K videos) and real-world collected (P$^3$AR-PKU) subsets with privacy attribute annotations. To handle video degradation from LPS, we propose MSPNet, a single-stage framework incorporating Inter-Frame Noise Suppressor (IFNS) and Cross-Frame Semantic Aggregator (CFSA), enhanced by contrastive language-image pre-training for robust semantic extraction. Extensive experiments demonstrate that MSPNet with IFNS and CFSA nearly doubles action recognition accuracy compared to baseline methods while suppressing identity recognition to low levels. Comprehensive validation shows LPS achieves a superior privacy-utility trade-off compared to state-of-the-art hardware methods, resists reconstruction attacks including PSF inversion and data-driven recovery, and generalizes robustly across optical configurations and challenging environments. Code is available at https://github.com/wangzy01/MSPNet.

2605.19329 2026-05-22 cs.CV cs.AI

RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding

RE-VLM:事件增强的视觉-语言模型用于场景理解

Hanqing Liu, Mingjie Liu, Luoping Cui, Endian Lin, Donghong Jiang, Chuang Zhu

发表机构 * School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院) State Key Laboratory of General Artificial Intelligence, BIGAI(通用人工智能国家重点实验室,BIGAI)

AI总结 本文提出RE-VLM,一种结合RGB图像和事件流的双流视觉-语言模型,旨在提升在正常和恶劣条件下对场景的理解能力。通过事件相机提供的高时间分辨率和宽动态范围的数据,RE-VLM在场景描述和视觉问答任务中优于现有模型。

Comments 10 pages, 6 figures, 6 tables

详情
AI中文摘要

传统视觉-语言模型(VLMs)在恶劣条件下(如低光、高动态范围或快速运动)捕获的场景解释能力不足,因为标准RGB图像在这些环境中质量下降。事件相机提供了一种互补的模态:它们异步记录每个像素的亮度变化,具有高时间分辨率和宽动态范围,在帧失效时保留运动线索。我们提出了RE-VLM,第一个双流视觉-语言模型,联合利用RGB图像和事件流,以在正常和挑战性条件下实现稳健的场景理解。RE-VLM采用并行的RGB和事件编码器,以及一种渐进训练策略,将异构视觉特征与语言对齐。为了解决RGB-Event-Text监督不足的问题,我们进一步提出了一种图驱动的流程,将同步的RGB-Event流转换为可验证的场景图,从中合成描述和问答对。为了开发和评估RE-VLM,我们构建了两个数据集:PEOD-Chat,针对光照挑战性场景,和RGBE-Chat,涵盖多样化的场景。在描述和VQA基准测试中,RE-VLM在与现有RGB-only和事件-only模型参数量相当的情况下,始终优于现有模型,特别是在挑战性条件下表现显著提升。这些结果证明了事件增强的VLMs在广泛现实环境中实现稳健视觉-语言理解的有效性。

英文摘要

Conventional vision-language models (VLMs) struggle to interpret scenes captured under adverse conditions (e.g., low light, high dynamic range, or fast motion) because standard RGB images degrade in such environments. Event cameras provide a complementary modality: they asynchronously record per-pixel brightness changes with high temporal resolution and wide dynamic range, preserving motion cues where frames fail. We propose RE-VLM, the first dual-stream vision-language model that jointly leverages RGB images and event streams for robust scene understanding across both normal and challenging conditions. RE-VLM employs parallel RGB and event encoders together with a progressive training strategy that aligns heterogeneous visual features with language. To address the scarcity of RGB-Event-Text supervision, we further propose a graph-driven pipeline that converts synchronized RGB-Event streams into verifiable scene graphs, from which we synthesize captions and question-answer (QA) pairs. To develop and evaluate RE-VLM, we construct two datasets: PEOD-Chat, targeting illumination-challenged scenes, and RGBE-Chat, covering diverse scenarios. On captioning and VQA benchmarks, RE-VLM consistently outperforms state-of-the-art RGB-only and event-only models with comparable parameter counts, with particularly large gains under challenging conditions. These results demonstrate the effectiveness of event-augmented VLMs in achieving robust vision-language understanding across a wide range of real-world environments.

2605.18507 2026-05-22 cs.CV

Weakly Supervised Cross-Modal Learning for 4D Radar Scene Flow Estimation

弱监督跨模态学习用于4D雷达场景流估计

Jingyun Fu, Zhiyu Xiang, Na Zhao

发表机构 * Zhejiang University, Hangzhou, Zhejiang, China(浙江大学) Zhejiang Provincial Key Laboratory of Multi-Modal Communication Networks and Intelligent Information Processing(浙江省多模态通信网络与智能信息处理重点实验室) Singapore University of Technology and Design(新加坡科技设计大学)

AI总结 本文提出了一种弱监督的雷达场景流学习框架,利用图像和里程计进行辅助监督,通过实例感知的自监督损失和静态区域的刚性损失,实现了更高效的场景流估计。

Comments Accepted by ICML2026

详情
AI中文摘要

由于获取4D雷达场景流的真实数据困难,先前的方法通常依赖于自监督损失或利用3D激光雷达数据、2D图像和里程计进行跨模态监督。然而,自监督方法由于雷达固有的低保真度测量往往导致次优结果,而现有跨模态监督方法引入复杂的多任务架构并需要昂贵的激光雷达传感器来从预训练的3D跟踪模型中生成伪雷达场景流标签。为克服这些限制,我们提出了一种任务特定的迭代框架,仅使用图像和里程计进行训练中的辅助监督。特别地,我们通过利用现成的2D跟踪和分割算法获得跟踪实例掩码,并将其投影到3D空间,以提供实例级别的语义指导;对于静态区域,我们整合车辆里程计与雷达的内在运动线索以构建刚性静态损失。在现实世界的View-of-Delft(VoD)数据集上的大量实验表明,我们的方法不仅超越了依赖密集LiDAR点云的最新跨模态监督方法,还优于现有的全监督场景流估计方法。代码已开源在https://github.com/FuJingyun/IterFlow。

英文摘要

Due to the difficulty of obtaining ground-truth data for 4D radar scene flow estimation, previous methods typically rely on either self-supervised losses or cross-modal supervision using 3D LiDAR data, 2D images, and odometry. However, self-supervised approaches often yield suboptimal results due to radar's inherently low-fidelity measurements, while existing cross-modal supervised methods introduce complex multi-task architecture and require costly LiDAR sensors to generate pseudo radar scene flow labels from pretrained 3D tracking models. To overcome these limitations, we propose a task-specific iterative framework for weakly supervised radar scene flow learning, using only images and odometry for auxiliary supervision during training. Specially, we establish two novel instance-aware self-supervised losses by exploiting off-the-shelf 2D tracking and segmentation algorithms to obtain tracked instance masks, which are back-projected into 3D space to provide instance-level semantic guidance; for static regions, we integrate vehicle odometry with radar's intrinsic motion cues to construct a rigid static loss. Extensive experiments on the real-world View-of-Delft (VoD) dataset demonstrate that our method not only surpasses state-of-the-art cross-modal supervised approaches that rely on 3D multi-object tracking on dense LiDAR point clouds but also outperforms existing fully supervised scene flow estimation methods. The code is open-sourced at \href{https://github.com/FuJingyun/IterFlow}{https://github.com/FuJingyun/IterFlow}.

2605.18047 2026-05-22 cs.RO

FUSE: A Framework for Unified State Estimation in Vehicular and Robotic SLAM Systems

FUSE:一种用于车辆和机器人SLAM系统统一状态估计的框架

Wei Wu, Honglin Chen, Wenhan Cao, Yao Lyu, Shaobing Xu, Kun Jiang, Jiangtao Li, Tao Zhang, Lei Guo, Shengbo Eben Li

发表机构 * State Key Lab of Intelligent Green Vehicle and Mobility, Tsinghua University(智能绿色车辆与移动国家重点实验室,清华大学) School of VM and College of AI, Tsinghua University(车辆学院与人工智能学院,清华大学) SunRisingAI Ltd.(SunRisingAI有限公司) China Intelligent and Connected Vehicles (Beijing) Research Institute Co., Ltd.(中国汽车智能互联车辆(北京)研究院有限公司)

AI总结 本文提出FUSE框架,用于统一车辆和机器人SLAM系统中的状态估计,通过分离时间处理、局部几何关联、估计器公式和地图更新策略,提高状态估计设计的灵活性和准确性。

详情
AI中文摘要

在混合速率传感下,紧密耦合的SLAM公式通常将时间处理、局部几何关联、估计器公式和地图更新策略绑定到特定方法的设计中。这种绑定使得难以在不重新设计其余状态估计过程的情况下改变一个设计选择。本文提出了FUSE,一种用于车辆和机器人SLAM系统统一状态估计的框架。FUSE围绕观察摄入、传播、更新和状态查询组织状态估计接口,并利用此接口将时间处理、残差准备的局部几何关联、估计器公式和地图更新策略分开。开发了一个LiDAR-IMU实例来在混合速率传感和方向退化下检验该框架,其中高速惯性传播、LiDAR触发的几何更新、残差筛选和退化感知的修正通过相同的接口边界操作。在418米的环形走廊序列中,该实例报告了1.626米的端到端轨迹误差,与Faster-LIO相比,相对误差减少了7.9%。结果支持FUSE作为组织状态估计设计选择的框架,并展示了评估实例如何在弱可观测方向上正则化更新。

英文摘要

Tightly coupled SLAM formulations under mixed-rate sensing often bind temporal processing, local geometric association, estimator formulation, and map-update policy into method-specific designs. Such binding makes it difficult to vary one design choice without re-engineering the rest of the state-estimation process. This paper presents FUSE, a framework for unified state estimation in vehicular and robotic SLAM systems. FUSE organizes the state-estimation interface around observation ingestion, propagation, update, and state query, and uses this interface to separate temporal processing, residual-ready local geometric association, estimator formulation, and map-update policy. A LiDAR--IMU instantiation is developed to examine the framework under mixed-rate sensing and directional degeneracy, where high-rate inertial propagation, LiDAR-triggered geometric update, residual screening, and degeneracy-aware correction operate through the same interface boundaries. On a 418~m loop-corridor sequence, the instantiation reports a 1.626 m end-to-end trajectory error, corresponding to a 7.9% relative error reduction compared with Faster-LIO, the lowest-error baseline on this sequence. The results support FUSE as a framework for organizing state-estimation design choices and show how the evaluated instantiation regularizes updates along weakly observable directions.

2605.17950 2026-05-22 cs.RO cs.SY eess.SY

Active Defense Against False Data Injection Attacks in Robotic Manipulators

对抗机器人机械臂中虚假数据注入攻击的主动防御

Gabriele Gualandi, Carl Mikael Larsson, Alessandro V. Papadopoulos

发表机构 * M \"a lardalen University, V \"a ster s, Sweden (e-mail: ).

AI总结 本文提出两种防御方法,即异常感知虚拟阻尼和操作性降低,以提高机器人机械臂在有限时间范围内抵御虚假数据注入攻击的能力,并通过仿真验证其有效性。

Comments Extended 8-page version containing full proofs. An abridged 6-page version has been accepted for publication in the Proceedings of the 23rd IFAC World Congress (2026). v3: Minor typographical fixes and updated reference formatting

详情
AI中文摘要

机器人系统容易受到虚假数据注入攻击(FDIAs)的影响,其中攻击者通过篡改传感器信号来获得恶意控制。反馈线性化使机器人系统暴露于积分器漏洞,使其容易受到隐蔽攻击,这些攻击可能导致末端执行器行为出现显著偏差而不会引发警报。本文通过形式化两种防御方法,即异常感知虚拟阻尼和操作性降低,以提高机械臂在有限时间范围内抵御FDIAs的韧性,并在名义任务执行中提供概率保证。在7自由度冗余机械臂上的仿真显示,所提出的防御方法在与仅使用阈值基ADS如卡方检测相比时,显著减少了FDIA的影响,同时在无攻击情况下保持了名义任务性能。

英文摘要

Robotic systems are vulnerable to False Data Injection Attacks (FDIAs), where adversaries corrupt sensor signals to gain malicious control. Feedback linearization exposes robotic systems to integrator vulnerability, making them susceptible to stealthy attacks that can cause significant deviations in end-effector behavior without raising alarms. This paper addresses the resilience of manipulators against finite-horizon FDIAs by formalizing two defense methods, namely anomaly-aware virtual damping and manipulability reduction, with probabilistic guarantees on nominal task execution. Simulations on a 7-DOF redundant manipulator show that the proposed defenses substantially reduce the impact of FDIA compared to using solely a threshold-based ADS like the Chi-squared, while preserving nominal task performance in the absence of attack.

2605.16984 2026-05-22 cs.CL

Closing the Gap at CRAC 2026: Two-Stage Adaptation for LLM-Based Multilingual Coreference Resolution

在CRAC 2026上缩小差距:基于LLM的多语言核心指代解析的两阶段适应

Antoine Bourgois, Olga Seminck, Thierry Poibeau

发表机构 * Lattice (CNRS UMR 8094 & ENS-PSL & Université Sorbonne Nouvelle)(Lattice(CNRS UMR 8094 和 ENS-PSL 和 索邦学院新欧))

AI总结 本文提出了一种基于LLM的多语言核心指代解析方法,通过两阶段适应策略,在CRAC 2026共享任务中取得了74.32的平均CoNLL F1分数,排名第一。

详情
AI中文摘要

我们提交了参加2026年计算参照、指代和核心指代(CRAC 2026)共享任务的LLM赛道的系统。在官方测试集上,我们的系统以74.32的平均CoNLL F1分数排名第一,整体排名第三。我们的系统基于Gemma-3-27b模型,通过多语言基础适配器和数据集特定适配器的两阶段策略进行微调。我们使用XML启发式的格式用头词表示提及跨度,并通过局部重新索引进行标注。这些设计选择在不同语言、文档长度和标注指南下均表现出色。

英文摘要

We present our submission to the LLM track of the 2026 Computational Models of Reference, Anaphora and Coreference (CRAC 2026) shared task. With an average CoNLL F1 score of 74.32 on the official test set, our system ranked first in the LLM track, and third overall. Our system is based on the Gemma-3-27b model, fine-tuned using a two-stage strategy with a multilingual base adapter followed by dataset-specific adapters. We represent mention spans by their headword using an XML-inspired format with local reindexing and annotate documents iteratively. These design choices proved effective across languages, document lengths, and annotation guidelines.

2605.16545 2026-05-22 cs.LG cs.AI cs.CL

Symphony for Speech-to-Text: Supporting Real-Time Medical Voice Interfaces

Symphony for Speech-to-Text: 支持实时医疗语音接口

Arne Nix, Robert James, Lasse Borgholt, Anna B. Ekner, Lana Krumm, Julius Severin, Dan Engel, Lars Maaløe, Jakob Havtorn

发表机构 * Corti

AI总结 本文提出Symphony for Speech-to-Text,一种医疗级实时语音识别系统,通过分解转录过程为识别、格式化和上下文校正等专业化组件,优化医学术语召回,实现实时临床结构文本生成,并在医疗场景中显著优于现有系统,同时在通用领域表现不逊。

Comments Updated with a correction and improvement to Symphony's performance in spoken punctuation evaluation (R_punct, P_punct)

详情
AI中文摘要

在数十年用于打字和更近期的环境记录后,语音正逐渐成为与技术及AI交互的主要方式,在医疗领域也不例外。然而,医疗语音识别仍然具有挑战性:系统必须捕捉专业术语,解决上下文歧义,并精确渲染测量、缩写和临床缩写。现有解决方案通常针对通用目的转录或狭窄的打字工作流进行优化,限制了其在安全关键设置中的可靠性以及在更广泛临床工作流中的实用性。我们引入Symphony for Speech-to-Text,一种用于实时流式和基于批量文件的临床使用的医疗级语音识别系统。Symphony将转录过程分解为识别、格式化和上下文校正等专业化组件,以优化医学术语召回,同时在实时生成临床结构文本并适应不同使用场景。在公共基准和医疗语音数据集上的评估表明,Symphony在临床场景中显著优于现有系统,同时在通用领域表现不逊,表明具有鲁棒的泛化能力而非过拟合。我们发布了一个临床基准数据集以支持可靠的验证和进一步推进医疗语音识别。Symphony通过生产级API提供,用于实时打字、对话转录和批量音频文件处理。

英文摘要

After decades of use in dictation and, more recently, ambient documentation, speech is emerging as a primary modality for interacting with technology and AI in healthcare. Yet medical speech recognition remains difficult: systems must capture specialized terminology, resolve contextual ambiguity, and render measurements, abbreviations, and clinical shorthand precisely. Existing solutions are typically optimized either for general-purpose transcription or narrow dictation workflows, limiting their reliability in safety-critical settings and their usefulness for broader clinical workflows. We introduce Symphony for Speech-to-Text, a medical-grade speech recognition system for real-time streaming and batch file-based clinical use. Symphony decomposes the transcription process into specialized components for recognition, formatting, and contextual correction to optimize medical term recall while producing clinically structured text in real time and adapting across use cases. Evaluations on public benchmark and medical speech datasets show that Symphony substantially outperforms state-of-the-art systems in clinical settings while matching or exceeding them in general-domain settings, suggesting robust generalization rather than overfitting. We release a clinical benchmark dataset to support reliable validation and further progress in medical speech recognition. Symphony is available through a production-grade API for live dictation, conversational transcription, and batch audio file processing.

2605.15153 2026-05-22 cs.RO cs.AI

Pelican-Unify 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action

Pelican-Unify 1.0:一种用于理解和推理、想象和行动的统一具身智能模型

Yi Zhang, Yinda Chen, Che Liu, Zeyuan Ding, Jin Xu, Shilong Zou, Junwei Liao, Jiayu Hu, Xiancong Ren, Xiaopeng Zhang, Yechi Liu, Haoyuan Shi, Zecong Tang, Haosong Sun, Renwen Cui, Kuishu Wu, Wenhai Liu, Yang Xu, Yingji Zhang, Yidong Wang, Senkang Hu, Jinpeng Lu, Nga Teng Chan, Yechen Wu, Zeting Liu, Xianzhou Hou, Yong Dai, Jian Tang, Xiaozhu Ju

发表机构 * Beijing Innovation Center of Humanoid Robotics (X-Humanoid)(北京人形机器人创新中心(X-Humanoid))

AI总结 本文提出Pelican-Unify 1.0,一种基于统一原则训练的首个具身基础模型,通过单一视觉语言模型作为统一理解模块,将场景、指令、视觉上下文和行动历史映射到共享语义空间,并通过统一推理模块生成任务、行动和未来导向的思维链,最终将隐藏状态投影到密集潜在变量中,再通过统一未来生成器生成未来视频和行动。

详情
AI中文摘要

我们提出了Pelican-Unify 1.0,首个根据统一原则训练的具身基础模型。Pelican-Unify 1.0使用单一视觉语言模型作为统一理解模块,将场景、指令、视觉上下文和行动历史映射到共享语义空间。同一视觉语言模型也作为统一推理模块,通过单次前向传递自回归地生成任务、行动和未来导向的思维链,并将最终隐藏状态投影到密集潜在变量中。统一未来生成器(UFG)然后基于该潜在变量,在同一去噪过程中通过两个模态特定的输出头联合生成未来视频和未来行动。语言、视频和行动损失均反向传播到共享表示中,使模型在训练过程中共同优化理解和推理、想象和行动,而非训练三个独立专家系统。实验表明,统一并不意味着妥协。通过单一检查点,Pelican-Unify 1.0在所有三种能力上均取得强劲表现:在八个VLM基准测试中得分为64.7,是同类模型中最佳;在WorldArena中得分为66.03,排名第一;在RoboTwin中得分为93.5,是对比行动方法中第二好的平均值。这些结果表明,统一范式在保持专业能力的同时,将理解和推理、想象和行动整合到一个模型中。

英文摘要

We present Pelican-Unify 1.0, the first embodied foundation model trained according to the principle of unification. Pelican-Unify 1.0 uses a single VLM as a unified understanding module, mapping scenes, instructions, visual contexts, and action histories into a shared semantic space. The same VLM also serves as a unified reasoning module, autoregressively producing task-, action-, and future-oriented chains of thought in a single forward pass and projecting the final hidden state into a dense latent variable. A Unified Future Generator (UFG) then conditions on this latent variable and jointly generates future videos and future actions through two modality-specific output heads within the same denoising process. The language, video, and action losses are all backpropagated into the shared representation, enabling the model to jointly optimize understanding, reasoning, imagination, and action during training, rather than training three isolated expert systems. Experiments demonstrate that unification does not imply compromise. With a single checkpoint, Pelican-Unify 1.0 achieves strong performance across all three capabilities: 64.7 on eight VLM benchmarks, the best among comparable-scale models; 66.03 on WorldArena, ranking first; and 93.5 on RoboTwin, the second-best average among compared action methods. These results show that the unified paradigm succeeds in preserving specialist strength while bringing understanding, reasoning, imagination, and action into one model.

2605.15040 2026-05-22 cs.AI cs.CL

Orchard: An Open-Source Agentic Modeling Framework

Orchard:一个开源的智能体建模框架

Baolin Peng, Wenlin Yao, Qianhui Wu, Hao Cheng, Xiao Yu, Rui Yang, Tao Ge, Alessandro Sordoni, Xingdi Yuan, Yelong Shen, Pengcheng He, Tong Zhang, Zhou Yu, Jianfeng Gao

发表机构 * Microsoft Research(微软研究院) Columbia University(哥伦比亚大学) UIUC(伊利诺伊大学香槟分校)

AI总结 本文提出Orchard,一个开源的智能体建模框架,通过轻量级环境服务和三种智能体建模食谱,实现了跨领域可重用的智能体数据、训练和评估。

详情
AI中文摘要

智能体建模旨在通过规划、推理、工具使用和与环境的多轮交互,将大语言模型转化为能够解决复杂任务的自主智能体。尽管有大量投入,开放研究仍受制于基础设施和训练差距。许多高性能系统依赖于专有代码库、模型或服务,而大多数开源框架专注于编排和评估,而非可扩展的智能体训练。我们提出了Orchard,一个用于可扩展智能体建模的开源框架。其核心是Orchard Env,一个轻量级环境服务,提供可重用的原语用于跨任务领域、智能体利用和流水线阶段的沙盒生命周期管理。在Orchard Env之上,我们构建了三种智能体建模食谱。Orchard-SWE针对编码智能体,从MiniMax-M2.5和Qwen3.5-397B中蒸馏出107K条轨迹,引入信用分配SFT来学习未解决轨迹的 productive 段落,并应用平衡自适应回滚进行强化学习。从Qwen3-30B-A3B-Thinking开始,Orchard-SWE在SWE-bench Verified上经过SFT后达到64.3%,经过SFT+RL后达到67.5%,在同等规模的开源模型中设立了新的状态。Orchard-GUI使用仅0.4K蒸馏轨迹和2.2K开放性任务训练了一个4B视觉-语言计算机使用智能体,在WebVoyager、Online-Mind2Web和DeepShop上分别达到74.1%、67.0%和64.0%的成功率,成为最强的开源模型,同时在与专有系统竞争中保持竞争力。Orchard-Claw针对个人助理智能体,仅用0.2K合成任务训练,达到Claw-Eval上的59.6% pass@3和与更强的ZeroClaw利用配合时的73.9%。总体而言,这些结果表明,一个轻量级、开源、不依赖利用的环境层能够实现跨领域的可重用智能体数据、训练和评估。

英文摘要

Agentic modeling aims to transform LLMs into autonomous agents capable of solving complex tasks through planning, reasoning, tool use, and multi-turn interaction with environments. Despite major investment, open research remains constrained by infrastructure and training gaps. Many high-performing systems rely on proprietary codebases, models, or services, while most open-source frameworks focus on orchestration and evaluation rather than scalable agent training. We present Orchard, an open-source framework for scalable agentic modeling. At its core is Orchard Env, a lightweight environment service providing reusable primitives for sandbox lifecycle management across task domains, agent harnesses, and pipeline stages. On top of Orchard Env, we build three agentic modeling recipes. Orchard-SWE targets coding agents. We distill 107K trajectories from MiniMax-M2.5 and Qwen3.5-397B, introduce credit-assignment SFT to learn from productive segments of unresolved trajectories, and apply Balanced Adaptive Rollout for RL. Starting from Qwen3-30B-A3B-Thinking, Orchard-SWE achieves 64.3% on SWE-bench Verified after SFT and 67.5% after SFT+RL, setting a new state of the art among open-source models of comparable size. Orchard-GUI trains a 4B vision-language computer-use agent using only 0.4K distilled trajectories and 2.2K open-ended tasks. It achieves 74.1%, 67.0%, and 64.0% success rates on WebVoyager, Online-Mind2Web, and DeepShop, respectively, making it the strongest open-source model while remaining competitive with proprietary systems. Orchard-Claw targets personal assistant agents. Trained with only 0.2K synthetic tasks, it achieves 59.6% pass@3 on Claw-Eval and 73.9% when paired with a stronger ZeroClaw harness. Collectively, these results show that a lightweight, open, harness-agnostic environment layer enables reusable agentic data, training recipes, and evaluations across domains.

2605.14257 2026-05-22 cs.CL

Sakura at BEA 2026 Shared Task 1: What Makes Vocabulary Difficult?

BEA 2026 共同任务 1:什么使词汇困难?

Adam Nohejl, Xuanxin Wu, Yusuke Ide, Maria Angelica Riera Machin, Yi-Ning Chang, Hitomi Yanaka

发表机构 * RIKEN The University of Osaka(大阪大学) Nara Institute of Science and Technology(奈良科学技術大學) National Tsing Hua University(國立清華大學) The University of Tokyo(東京大學) Tohoku University(東北大學)

AI总结 本文提出两种模型用于预测词汇难度:一种高精度的黑盒模型,在公开赛道取得最佳成绩,另一种可解释模型,优于微调编码器基线。黑盒模型通过软目标损失函数微调LLM,在评分任务中达到r>0.91的精度,而可解释模型在保持强相关性(r>0.77)的同时,揭示了影响每个项目难度的因素。进一步分析显示,英国理事会知识型词汇列表(KVL)中词汇难度常受拼写难度或测试项目构造影响,而不仅仅是词汇本身的生产难度。

Comments To be published in Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

详情
AI中文摘要

我们描述了两种用于词汇难度预测的模型类型:一种高精度的黑盒模型,其在公开赛道中取得了最佳共同任务结果;另一种可解释的模型,其优于微调编码器基线。作为黑盒模型,我们使用软目标损失函数微调了一个LLM,以在评分任务中实现有效的应用,达到了r > 0.91的精度。可解释模型在保持强相关性(r > 0.77)的同时,提供了关于影响每个项目难度因素的见解。我们进一步分析了结果,证明英国理事会知识型词汇列表(KVL)中词汇的难度往往受到拼写难度或测试项目构造的影响,而不仅仅是词汇本身的生产难度。我们的代码已在线上提供,网址为https://github.com/ynklab/vocabulary-difficulty.

英文摘要

We describe two types of models for vocabulary difficulty prediction: a high-accuracy black-box model, which achieved the top shared task result in the open track, and an explainable model, which outperforms a fine-tuned encoder baseline. As the black-box model, we fine-tuned an LLM using a soft-target loss function for effective application to the rating task, achieving r > 0.91. The explainable model provides insights into what impacts the difficulty of each item while maintaining a strong correlation (r > 0.77). We further analyze the results, demonstrating that the difficulty of items in the British Council's Knowledge-based Vocabulary Lists (KVL) is often affected by spelling difficulty or the construction of the test items, in addition to the genuine production difficulty of the words. We make our code available online at https://github.com/ynklab/vocabulary-difficulty .

2605.13989 2026-05-22 cs.CL

VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use

VectraYX-Nano: 一个具有课程学习和原生工具使用的4200万参数西班牙语网络安全语言模型

Juan S. Santillana

发表机构 * Globant

AI总结 本文提出了一种基于西班牙语的网络安全语言模型VectraYX-Nano,通过课程学习和原生工具调用,展示了在网络安全领域的应用与改进。

Comments 24 pages, 5 figures, 12 tables. v3: post-Chinchilla compute ablation (v8-v15), Globant affiliation finalized, EMNLP Findings 2026 submission. Released model: VectraYX-Nano v7 (42M params, GGUF Q4 ~20 MB, native MCP)

详情
AI中文摘要

我们介绍了VectraYX-Nano,一个从头开始训练的41.95M参数解码器-only语言模型,专为西班牙语网络安全领域设计,具有拉丁美洲地区侧重和通过模型上下文协议(MCP)进行的原生工具调用。该模型有四个贡献。(i)语料库:VectraYX-Sec-ES,一个由八台虚拟机分布式管道构建的170M-token西班牙语语料库,成本约为25美元的云计算,分为三个课程阶段(对话42M,网络安全118M,攻击性工具10M)。(ii)架构:一个42M Transformer解码器,包含GQA、QK-Norm、RMSNorm、SwiGLU、RoPE和z-loss,配以领域平衡的16,384-token字节回退BPE。(iii)课程学习通过跨三个阶段的回放,导致单调损失下降(9.80 -> 3.17 -> 3.00 -> 2.16);在SFT(损失1.74)后,v2 bootstrap-ablation参考在B5上达到0.775 +/- 0.043的对话门,经过受控的Phase-2回放扫描,以{0,5,10,25,50}%的饱和度在B5上达到>=25%的回放。(iv)两个经验发现,N=4。一个受控的bootstrap-语料库消融跨v2(OpenSubs)、v4(mC4-ES)和v6(60/25/15 OpenSubs/mC4/Wiki)暴露了损失与注册的倒置:低困惑度的bootstrap产生可测量更差的对话行为(v2 > v4 > v6在B5上每个配对的种子)。B4(工具选择)地板为0.000是语料库密度的产物,而不是容量门:重新平衡SFT混合物的工具使用比例为1:21,得到VectraYX-Nano v7,即发布的头版配置,达到B4 = 0.230 +/- 0.052在42M的同时保持B1 = 0.332 +/- 0.005和B5 = 0.725 +/- 0.130;在260M从头开始的中档模型上进行LoRA复制达到0.445 +/- 0.201。发布的GGUF在F16中为96 MB,能够在llama.cpp下在商用硬件上实现亚秒级TTFT,并且,据我们所知,是第一个发布的西班牙语原生网络安全LLM,具有端到端的MCP集成。

英文摘要

We present VectraYX-Nano, a 41.95M-parameter decoder-only language model trained from scratch in Spanish for cybersecurity, with a Latin-American regional focus and native tool invocation via the Model Context Protocol (MCP). The model has four contributions. (i) Corpus: VectraYX-Sec-ES, a 170M-token Spanish corpus assembled by an eight-VM distributed pipeline at ~$25 USD of cloud compute and split into three curriculum phases (conversational 42M, cybersecurity 118M, offensive tooling 10M). (ii) Architecture: a 42M Transformer decoder with GQA, QK-Norm, RMSNorm, SwiGLU, RoPE and z-loss, paired with a domain-balanced 16,384-token byte-fallback BPE. (iii) Curriculum with replay across the three phases yields a monotonic loss descent (9.80 -> 3.17 -> 3.00 -> 2.16); after SFT (loss 1.74) the v2 bootstrap-ablation reference attains a conversational gate of 0.775 +/- 0.043 on B5 over N=4 seeds, and a controlled Phase-2 replay sweep over {0,5,10,25,50}% saturates B5 at >=25% replay. (iv) Two empirical findings, both N=4. A controlled bootstrap-corpus ablation across v2 (OpenSubs), v4 (mC4-ES), and v6 (60/25/15 OpenSubs/mC4/Wiki) exposes a loss-versus-register inversion: lower-perplexity bootstraps yield measurably worse conversational behavior (v2 > v4 > v6 on B5 at every paired seed). The B4 (tool-selection) floor of 0.000 is a corpus-density artifact, not a capacity gate: rebalancing the SFT mixture to tool-use ratio 1:21 yields VectraYX-Nano v7, the released headline configuration, reaching B4 = 0.230 +/- 0.052 at 42M while retaining B1 = 0.332 +/- 0.005 and B5 = 0.725 +/- 0.130; a LoRA replication on a 260M from-scratch mid-tier reaches 0.445 +/- 0.201. The released GGUF is 96 MB in F16, runs sub-second TTFT on commodity hardware under llama.cpp, and is, to our knowledge, the first published Spanish-native cybersecurity LLM with end-to-end MCP integration.

2605.12058 2026-05-22 cs.LG cs.AI

Holder Policy Optimisation

Hölder Policy Optimisation

Yuxiang Chen, Dingli Liang, Yihang Chen, Ziqin Gong, Chenyang Le, Zhaokai Wang, Jiachen Zhu, Lingyu Yang, Jianghao Lin, Weinan Zhang, Jun Wang

发表机构 * University College London(伦敦大学学院) Shanghai Jiao Tong University(上海交通大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 本文提出HölderPO框架,通过Hölder均值统一token级概率聚合,解决固定聚合机制导致的训练崩溃与性能不足问题,理论证明不同p值对梯度集中度和方差的平衡作用,并通过动态退火算法实现训练周期内的p值调度,实验表明其在多个数学基准测试中取得更优的稳定性和收敛性。

详情
AI中文摘要

Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose HölderPO, a generalised policy optimisation framework unifying token-level probability aggregation via the Hölder mean. By explicitly modulating the parameter $p$, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger $p$ concentrates the gradient to amplify sparse learning signals, whereas a smaller $p$ strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules $p$ across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of $54.9\%$ across multiple mathematical benchmarks, yielding a substantial $7.2\%$ relative gain over standard GRPO and secures an exceptional $93.8\%$ success rate on ALFWorld.

英文摘要

Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose \textbf{HölderPO}, a generalised policy optimisation framework unifying token-level probability aggregation via the Hölder mean. By explicitly modulating the parameter $p$, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger $p$ concentrates the gradient to amplify sparse learning signals, whereas a smaller $p$ strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules $p$ across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of $54.9\%$ across multiple mathematical benchmarks, yielding a substantial $7.2\%$ relative gain over standard GRPO and secures an exceptional $93.8\%$ success rate on ALFWorld.

2605.11739 2026-05-22 cs.CL

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

学习预见:揭示在线蒸馏的解锁效率

Yuchen Cai, Ding Cao, Liang Lin, Chunxi Luo, Xin Xu, Kai Yang, Weijie Liu, Saiyong Yang, Tianxiang Zhao, Guangzhong Sun, Guiquan Liu, Junfeng Fang

发表机构 * USTC(中国科学技术大学) Tencent(腾讯) NUS(新加坡国立大学) HKUST(GZ)(香港科技大学(广州)) UCAS-IIE(中国科学技术大学国际交流学院) SHU(上海大学)

AI总结 本文研究了在线蒸馏(OPD)的效率来源,提出EffOPD方法通过适应性选择 extrapolation 步长和沿当前更新方向移动来加速OPD,实现了3倍的训练加速同时保持最终性能。

详情
AI中文摘要

在线蒸馏(OPD)已成为大型语言模型的一种高效的后训练范式。然而,现有研究大多将其优势归因于更密集和稳定的监督,而OPD效率背后的参数级机制仍不清晰。本文认为OPD的效率源于一种“预见”机制:它在训练早期就建立了指向最终模型的稳定更新轨迹。这种预见体现在两个方面。首先,在模块分配层面,OPD识别出边际效用低的区域,并将更新集中在对推理更关键的模块上。其次,在更新方向层面,OPD表现出更强的低秩集中,其主导子空间在训练早期就与最终更新子空间紧密对齐。基于这些发现,我们提出了EffOPD,一种即插即用的加速方法,通过自适应选择extrapolation步长并沿当前更新方向移动来加速OPD。EffOPD不需要额外可训练模块或复杂的超参数调优,实现了平均3倍的训练加速,同时保持可比的最终性能。整体而言,我们的发现为理解OPD的效率提供了参数动态视角,并为设计更高效的大型语言模型后训练方法提供了实用见解。

英文摘要

On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, existing studies largely attribute this advantage to denser and more stable supervision, while the parameter-level mechanisms underlying OPD's efficiency remain poorly understood. In this work, we argue that OPD's efficiency stems from a form of ``foresight'': it establishes a stable update trajectory toward the final model early in training. This foresight manifests in two aspects. First, at the \textbf{Module-Allocation Level}, OPD identifies regions with low marginal utility and concentrates updates on modules that are more critical to reasoning. Second, at the \textbf{Update-Direction Level}, OPD exhibits stronger low-rank concentration, with its dominant subspaces aligning closely with the final update subspace early in training. Building on these findings, we propose \textbf{EffOPD}, a plug-and-play acceleration method that speeds up OPD by adaptively selecting an extrapolation step size and moving along the current update direction. EffOPD requires no additional trainable modules or complex hyperparameter tuning, and achieves an average training acceleration of $3\times$ while maintaining comparable final performance. Overall, our findings provide a parameter-dynamics perspective for understanding the efficiency of OPD and offer practical insights for designing more efficient post-training methods for large language models.

2605.11246 2026-05-22 cs.LG

Support-Proximity Augmented Diffusion Estimation for Offline Black-Box Optimization

支持接近增强的扩散估计用于离线黑盒优化

Yonghan Yang, Ye Yuan, Zipeng Sun, Linfeng Du, Bowei He, Haolun Wu, Can Chen, Xue Liu

发表机构 * MBZUAI - Mohamed bin Zayed University of Artificial Intelligence(MBZUAI - 摩擦 bin Zayed 大学) McGill University(麦吉尔大学) Mila - Quebec AI Institute(Mila - 加拿大AI研究所) Amazon AGI(亚马逊人工智能实验室)

AI总结 本文提出SPADE框架,通过条件生成建模重新想象前向替代建模,利用扩散模型建模前向似然p(y|x),并引入校准扩散估计模块和支撑接近正则化机制,以提高优化性能。

Comments Accepted by ICML 2026. First two authors contributed equally

详情
AI中文摘要

离线黑盒优化旨在仅使用静态数据集发现具有高属性分数的新设计,这一任务本质上受到分布外(OOD)外推问题的挑战。现有方法通常分为逆向方法,其在将分数映射到设计的 ill-posed 性质上挣扎,以及前向方法,其往往缺乏量化不确定性有效性的分布表达能力。在本文中,我们提出SPADE(Support-Proximity Augmented Diffusion Estimation),一种新颖的框架,通过条件生成建模的视角重新想象前向替代建模。SPADE通过扩散模型建模前向似然p(y|x),但通过两个关键增强来适应优化:(1)校准扩散估计模块,强制统计矩和成对排名的全局一致性;(2)支撑接近正则化机制,通过kNN基于的密度估计隐式内化数据流形约束p(x)。理论上,我们证明我们的正则化在第一阶上等价于最大化具有有效设计先验的贝叶斯后验。经验上,SPADE在Design-Bench任务和LLM数据混合优化基准上实现了最先进的性能。

英文摘要

Offline black-box optimization aims to discover novel designs with high property scores using only a static dataset, a task fundamentally challenged by the out-of-distribution (OOD) extrapolation problem. Existing approaches typically bifurcate into inverse methods, which struggle with the ill-posed nature of mapping scores to designs, and forward methods, which often lack the distributional expressivity to quantify uncertainty effectively. In this work, we propose SPADE (Support-Proximity Augmented Diffusion Estimation), a novel framework that reimagines forward surrogate modeling through the lens of conditional generative modeling. SPADE models the forward likelihood p(y|x) using a diffusion model, but with two critical enhancements to tailor it for optimization: (1) a Calibrated Diffusion Estimation module that enforces global consistency in statistical moments and pairwise rankings, and (2) a Support-Proximity Regularization mechanism that implicitly internalizes the data manifold constraint p(x) via kNN-based density estimation. Theoretically, we prove that our regularization is first-order equivalent to maximizing a Bayesian posterior with a valid design prior. Empirically, SPADE achieves state-of-the-art performance across Design-Bench tasks and an LLM data mixture optimization benchmark.

2605.10696 2026-05-22 cs.RO

VRA: Grounding Discrete-Time Joint Acceleration in Voltage-Constrained Actuation

VRA:在电压受限致动器中接地离散时间联合加速度

Lingwei Zhang, Jiaming Wang, Tianlin Zhang, Zhitao Song, Xuanqi Zeng, Weipeng Xia, Zhongyu Li, Yun-hui Liu

发表机构 * Department of Mechanical and Automation Engineering(机械与自动化工程系) Hong Kong Embodied AI Lab(香港具身AI实验室) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文提出VRA方法,通过将运动学加速度与电压受限致动器物理相联系,解决在电压受限情况下不可实现的加速度问题,实验表明该方法能消除不可实现的加速度,恢复一致的近约束执行并减少约束引起的振荡。

Comments 10 pages, Accepted by RSS 2026

详情
AI中文摘要

离散时间关节加速度约束被广泛用于强制位置和速度限制。然而,在电压受限的电动致动器中,运动学上可行的加速度可能无法物理实现,暴露了缺失的执行层面抽象。我们提出电压可实现加速度(VRA),一种关节级加速度接口,通过限制命令加速度到电压可实现的约束,将运动学加速度接地在电压受限致动器物理上。在电动致动器和轮腿四足机器人上的硬件实验表明,VRA消除了不可实现的加速度,恢复了一致的近约束执行,并减少了约束引起的振荡。

英文摘要

Discrete-time joint acceleration constraints are widely used to enforce position and velocity limits. However, under voltage-constrained electric actuators, kinematically admissible accelerations may be physically unrealizable, exposing a missing execution-level abstraction. We propose Voltage-Realizable Acceleration (VRA), a joint-level acceleration interface that grounds kinematic acceleration in voltage-constrained actuator physics by restricting commanded accelerations to voltage-realizable constraints. Hardware experiments on electric actuators and a wheel-legged quadruped show that VRA removes unrealizable accelerations, restores consistent near-constraint execution, and reduces constraint-induced oscillations.

2605.08982 2026-05-22 cs.LG

PMCTS: Particle Monte Carlo Tree Search for Principled Parallelized Inference Time Scaling

PMCTS:用于原理化并行推断时间扩展的粒子蒙特卡洛树搜索

Yaniv Oren, Viliam Vadocz, Joery A. de Vries, Wendelin Böhmer, Matthijs T. J. Spaan, Hendrik Baier

发表机构 * Department of Intelligent Systems, TU Delft(代尔夫特理工大学智能系统系) Department of Computer Science, ETH Zürich(苏黎世联邦理工学院计算机科学系) Trent AI Limited(Trent AI有限公司) Information Systems, TU Eindhoven(埃因霍温理工大学信息系统系) Centrum Wiskunde & Informatica, Amsterdam(阿姆斯特丹数学与信息学研究中心)

AI总结 本文提出PMCTS,一种适用于神经网络评估的原理化并行MCTS算法,通过并行计算实现推断时间扩展,并在多个领域中显著优于传统启发式基线方法。

详情
AI中文摘要

蒙特卡洛树搜索(MCTS)是一种通过搜索来改进策略的广泛使用的方法,其在现实世界应用中日益受到关注。由于其搜索过程的顺序性和确定性,利用并行计算扩展MCTS的运行时间仍然是一个主要挑战。我们引入了粒子MCTS(PMCTS),据我们所知,这是首个原理化并行MCTS算法,适用于神经网络评估,并能保持正式的策略改进保证。经验上,PMCTS在并行计算下表现良好,并在多个领域中显著优于流行的启发式基线方法。

英文摘要

Monte Carlo Tree Search (MCTS) is a widely used approach for policy improvement through search with increasing popularity for real world applications. Due to the sequential and deterministic nature of its search, runtime-scaling of MCTS with parallel compute remains a major challenge. We introduce Particle MCTS (PMCTS), to our knowledge the first principled parallel MCTS algorithm which is suited for neural network evaluations and can preserve formal policy improvement guarantees. Empirically, PMCTS scales well with parallel compute and significantly outperforms the popular heuristic-based baselines across domains.

2605.08389 2026-05-22 cs.CV cs.AI

Decoupling Endpoint and Semantic Transition Learning for Zero-Shot Composed Image Retrieval

解耦端点与语义转换学习以实现零样本复合图像检索

Mingyu Liu, Sihan Huang, Yijia Fan, Yinlin Yan, Quan Zhang, Jian-Fang Hu, Jianhuang Lai

发表机构 * Sun Yat-sen University(中山大学) Guangdong Province Key Laboratory of Information Security Technology(广东省信息安全技术重点实验室) Key Laboratory of Machine Intelligence and Advanced Computing(机器智能与先进计算重点实验室)

AI总结 本文提出了一种解耦端点与语义转换学习的方法DeCIR,用于零样本复合图像检索,通过构造配对的正向/反向编辑元组,训练独立的低秩文本适配器分支,并利用低秩方向合并(LRDM)将它们合并为一个可部署的适配器,从而提升了投影基于的零样本复合图像检索性能。

详情
AI中文摘要

零样本复合图像检索(ZS-CIR)在不依赖人工标注的CIR三元组的情况下,从参考图像和文本修改中检索目标图像。基于投影的ZS-CIR方法因其不依赖LLM并在推理时保持轻量而具有吸引力,但它们在复杂语义修改上往往表现不佳。这一差距反映了基于投影的ZS-CIR中的语义转换瓶颈:端点级匹配可以让编辑文本作为目标侧的属性线索,而不是作为源条件的语义转换。我们进一步表明,将语义转换监督添加到相同的文本适配器中会创建端点对齐与语义转换对齐之间的冲突。为了解决这一冲突,DeCIR解耦端点与转换学习。它从图像-标题对中构建配对的正向/反向编辑元组,训练独立的低秩文本适配器分支用于端点对齐和语义转换对齐,并将它们通过低秩方向合并(LRDM)合并为一个可部署的适配器。在CIRR、CIRCO、FashionIQ和GeneCIS上的大量实验表明,DeCIR在不增加推理复杂性的情况下,一致提升了基于投影的ZS-CIR性能。

英文摘要

Zero-shot composed image retrieval (ZS-CIR) retrieves a target image from a reference image and a text modification without human-annotated CIR triplets. Projection-based ZS-CIR methods are attractive because they do not rely on LLMs at inference and remain lightweight, but they often underperform LLM-based approaches on complex semantic modifications. This gap reflects a semantic transition bottleneck in projection-based ZS-CIR: endpoint-level matching can let the edit text act as a target-side attribute cue rather than grounding it as a source-conditioned semantic transition. We further show that adding semantic transition supervision to the same text adapter creates an endpoint--transition conflict between endpoint alignment and semantic transition alignment. To address this conflict, DeCIR decouples endpoint and transition learning. It constructs paired forward/reverse edit tuples from image-caption pairs, trains separate low-rank text adapter branches for endpoint alignment and semantic transition alignment, and merges them with Low-Rank Directional Merge (LRDM) into one deployable adapter. Extensive experiments on CIRR, CIRCO, FashionIQ, and GeneCIS demonstrate that DeCIR consistently improves projection-based ZS-CIR without increasing inference complexity.

2605.07711 2026-05-22 cs.CL

SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

SimCT: 通过跨分词器策略进行监督恢复

Jie Sun, Mao Zheng, Mingyang Song, Qiyong Zhong, Yilin Cheng, Bichuan Feng, Pengfei Liu, Junfeng Fang, Xiang Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Large Language Model Department, Tencent(腾讯大语言模型部门) Shanghai Innovation Institute(上海创新研究院) Zhongguancun Academy(中关村学院)

AI总结 本文提出SimCT,一种改进的在线策略蒸馏方法,通过扩展监督空间来恢复因分词差异而丢失的监督信号,从而在数学推理和代码生成任务中提升了性能。

Comments 4 figures, 6 tables, 28 pages

详情
AI中文摘要

在线策略蒸馏(OPD)是一种标准工具,用于将教师行为转移到较小的学生模型中,但其隐含假设教师和学生预测在逐个token上是可比的,这一假设在两个模型对同一文本进行不同分词时会失效。在异构分词器情况下,精确共享token匹配会静默丢弃大量教师信号,特别是在词汇不一致的位置。我们提出了简单的跨分词器OPD(SimCT),通过扩展监督空间:在共享token之外,SimCT比较教师和学生在短多token延续上的表现,这些延续两者都能实现,从而保持OPD损失形式不变。我们证明这些单位是 finest 共同可分词的监督接口,并且更粗糙的替代方法会移除对在线学习有用的教师-学生区分。在三个异构教师-学生对上,SimCT在数学推理和代码生成基准上表现优于共享词汇OPD和代表性跨分词器基线,消融实验确认改进来自恢复精确共享token匹配所丢弃的监督。代码可在https://github.com/sunjie279/SimCT-获取。

英文摘要

On-policy distillation (OPD) is a standard tool for transferring teacher behavior to a smaller student, but it implicitly assumes that teacher and student predictions are comparable token by token, an assumption that fails whenever the two models tokenize the same text differently. Under heterogeneous tokenizers, exact shared-token matching silently discards a large fraction of the teacher signal at precisely the positions where vocabularies disagree. We propose \textbf{\underline{Sim}ple \underline{C}ross-\underline{T}okenizer OPD (SimCT)}, which restores this signal by enlarging the supervision space: alongside shared tokens, SimCT compares teacher and student over short multi-token continuations that both tokenizers can realize, leaving the OPD loss form itself unchanged. We show that these units are the finest jointly tokenizable supervision interface, and that coarser alternatives remove teacher-student distinctions that are useful for on-policy learning. Across three heterogeneous teacher-student pairs on mathematical reasoning and code-generation benchmarks, SimCT shows consistent gains over shared-vocabulary OPD and representative cross-tokenizer baselines, with ablations confirming that the improvements come from recovering supervision discarded by exact shared-token matching. Code is available at \href{https://github.com/sunjie279/SimCT-}{https://github.com/sunjie279/SimCT-}.

2605.07598 2026-05-22 cs.LG

Optimal Recourse Summaries via Bi-Objective Decision Tree Learning

通过双目标决策树学习实现最优补救摘要

Ioannis Chatzis, Jason Liartis, Athanasios Voulodimos, Giorgos Stamou

发表机构 * Artificial Intelligence and Learning Systems Laboratory(人工智能与学习系统实验室) National Technical University of Athens(希腊国家技术大学)

AI总结 本文提出SOGAR方法,通过将补救摘要学习转化为最优决策树学习问题,找到帕累托前沿,实现补救效果与成本之间的平衡,产生稳定、低成本且有效的补救摘要。

详情
AI中文摘要

可操作的补救为个体提供了改变不利分类器结果的行动。虽然在实例层面有用,但不适合作为全局审计和偏见检测,因为汇总局部行动是昂贵且不一致的。补救摘要通过将人口划分为子群体并为每个子群体分配一个共享行动,从而实现这一限制。设计摘要涉及补救效果和补救成本之间的根本权衡,现有方法未能充分解决。我们引入了最优和全局可操作补救摘要(SOGAR),将补救摘要学习转化为最优决策树学习问题,并找到帕累托前沿——即一组解决方案,其中改进一个目标必然使另一个恶化。SOGAR允许事后选择所需的权衡而无需重新训练。使用浅层轴平行决策树和稀疏叶行动,SOGAR产生稳定、低成本且有效的补救摘要,在效果和成本指标上均优于现有方法。

英文摘要

Actionable Recourse provides individuals with actions they can take to change an unfavorable classifier outcome. While useful at the instance level, it is ill-suited for global auditing and bias detection, since aggregating local actions is costly and often inconsistent. Recourse Summaries address this limitation by partitioning the population and assigning one shared action per subgroup, enabling comparison across subgroups. Designing summaries involves a fundamental trade-off between recourse effectiveness and recourse cost, which existing methods do not adequately address. We introduce Summaries of Optimal and Global Actionable Recourse (SOGAR), which formulates recourse summary learning as an optimal decision tree learning problem and finds the Pareto front -- the complete set of solutions where improving one objective necessarily worsens the other. SOGAR enables post-hoc selection of the desired trade-off without retraining. Using shallow axis-parallel decision trees and sparse leaf actions, SOGAR produces stable, low-cost, and effective recourse summaries that outperform existing approaches across effectiveness and cost metrics.

2605.07243 2026-05-22 cs.CL

SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

SpecBlock:带有动态树草案的块迭代推测解码

Weijie Shi, Qiang Xu, Fan Deng, Yaguang Wu, Jiarun Liu, Yehong Xu, Hao Chen, Jia Zhu, Jiajie Xu, Xiangjun Huang, Jian Yang, Xiaofang Zhou

发表机构 * Hong Kong University of Science and Technology(香港科技大学) MetaX Zhejiang Normal University(浙江师范大学) Soochow University(苏州大学)

AI总结 该研究提出了一种结合路径依赖性和低成本草案的块迭代草案方法SpecBlock,通过动态树草案和路径依赖机制提高LLM推理效率,同时在部署时利用验证器反馈进行成本感知适应,从而在速度和成本上均优于现有方法。

详情
AI中文摘要

推测解码通过起草候选延续的树并单次目标前向验证来加速LLM推理。现有草案工具分为两派,各有缺陷。自回归草案工具如EAGLE-3在每条草案路径上保留依赖性,但每次树深度调用一次草案器,使草案成为每次迭代延迟的重要部分。并行草案工具通过一次前向预测多个未来位置来减少草案器调用,但每个位置的预测不考虑其他位置,导致验证器拒绝路径。本文提出SpecBlock,一种结合路径依赖性和低成本草案的块迭代草案工具。每个草案器前向生成K个依赖位置,称为块。草案树通过重复块扩展生长。两种机制显式携带路径依赖性以保持后续草案位置的准确性。在每个块内,逐层位移将前一位置的隐藏状态传输到每个解码器层。在块之间,每个新块可以从上一个块的任意位置开始,继承其隐藏状态以延长路径。为在验证器预算中花费在可能接受的位置,一个共同训练的排名头取代固定top-k树,通过在草案过程中分配每位置分支。为避免在推理时训练草案器在它从未生成的前缀上,一个有效前缀掩码在较后位置的损失在较早位置出错时丢弃。除了静态草案外,部署时的成本感知老虎机利用免费验证器反馈来选择性更新草案器,仅当预期吞吐量增益超过更新成本时。实验表明,SpecBlock在44-52%的草案成本下,相比EAGLE-3提高了8-13%的平均速度,而成本感知适应将此优势扩展到11-19%。

英文摘要

Speculative decoding accelerates LLM inference by drafting a tree of candidate continuations and verifying it in one target forward. Existing drafters fall into two camps with opposite weaknesses. Autoregressive drafters such as EAGLE-3 preserve dependence along each draft path but call the drafter once per tree depth, making drafting a non-trivial share of per-iteration latency. Parallel drafters cut drafter calls by predicting multiple future positions in one forward, but each position is predicted without seeing the others, producing paths the verifier rejects. In this paper, we propose SpecBlock, a block-iterative drafter that combines path dependence with cheap drafting. Each drafter forward produces K dependent positions and we call this a block. The draft tree grows through repeated block expansions. Two mechanisms explicitly carry path dependence to keep later draft positions accurate. Within each block, a layer-wise shift carries the previous position's hidden state into every decoder layer. Across blocks, each new block can start from any position of the previous block, inheriting its hidden state to extend the path. To spend verifier budget where acceptance is likely, a co-trained rank head replaces the fixed top-k tree by allocating per-position branching during drafting. To avoid training the drafter on prefixes it never produces at inference, a valid-prefix mask drops the loss at later positions once an earlier one is wrong. Beyond static drafting, a cost-aware bandit at deployment uses free verifier feedback to update the drafter selectively, only when the expected throughput gain exceeds the update cost. Experiments show that SpecBlock improves mean speedup by 8-13% over EAGLE-3 at 44-52% of its drafting cost, and cost-aware adaptation extends this lead to 11-19%.

2605.06597 2026-05-22 cs.CL cs.AI cs.LG

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

UniSD:面向大语言模型的统一自蒸馏框架

Yiqiao Jin, Yiyang Wang, Lucheng Fu, Yijia Xiao, Yinyi Luo, Haoxin Liu, B. Aditya Prakash, Josiah Hester, Jindong Wang, Srijan Kumar

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of California, Los Angeles(加州大学洛杉矶分校) Carnegie Mellon University(卡内基梅隆大学) William & Mary(威廉与玛丽大学)

AI总结 本文提出UniSD框架,系统研究自蒸馏方法,通过整合多种机制提升监督可靠性、表征对齐和训练稳定性,从而在多个基准和模型上验证自蒸馏的有效性,并构建出性能最优的UniSDfull流水线。

Comments Website: https://unifiedsd.github.io/ Code: https://github.com/Ahren09/UniSD

详情
AI中文摘要

自蒸馏(SD)为在不依赖更强外部教师的情况下适应大语言模型(LLMs)提供了一条有前途的路径。然而,在自回归LLMs中,SD仍然具有挑战性,因为自生成轨迹是自由形式的,正确性依赖于任务,且合理的推理仍可能提供不稳定或不可靠的监督。现有方法主要考察孤立的设计选择,留下其有效性、作用和交互关系不清晰。在本文中,我们提出UniSD,一个统一的框架,系统地研究自蒸馏。UniSD整合了互补的机制,解决监督可靠性、表征对齐和训练稳定性问题,包括多教师一致、EMA教师稳定、token级对比学习、特征匹配和发散剪裁。在六个基准和六个模型(来自三个模型家族)上,UniSD揭示了自蒸馏何时优于静态模仿,哪些组件驱动了收益,以及这些组件在不同任务间的交互方式。基于这些见解,我们构建了UniSDfull,一个整合互补组件的流水线,实现了最强的整体性能,比基模型提高了+5.4点,比最强基线提高了+2.8点。广泛评估凸显了自蒸馏作为一种实用且可控的高效LLM适应方法,无需更强的外部教师。

英文摘要

Self-distillation (SD) offers a promising path for adapting large language models (LLMs) without relying on stronger external teachers. However, SD in autoregressive LLMs remains challenging because self-generated trajectories are free-form, correctness is task-dependent, and plausible rationales can still provide unstable or unreliable supervision. Existing methods mainly examine isolated design choices, leaving their effectiveness, roles, and interactions unclear. In this paper, we propose UniSD, a unified framework to systematically study self-distillation. UniSD integrates complementary mechanisms that address supervision reliability, representation alignment, and training stability, including multi-teacher agreement, EMA teacher stabilization, token-level contrastive learning, feature matching, and divergence clipping. Across six benchmarks and six models from three model families, UniSD reveals when self-distillation improves over static imitation, which components drive the gains, and how these components interact across tasks. Guided by these insights, we construct UniSDfull, an integrated pipeline that combines complementary components and achieves the strongest overall performance, improving over the base model by +5.4 points and the strongest baseline by +2.8 points. Extensive evaluation highlights self-distillation as a practical and steerable approach for efficient LLM adaptation without stronger external teachers.

2605.05749 2026-05-22 cs.CV

Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction

具有自适应更新的射线感知指针记忆用于流式3D重建

Feifei Li, Qi Song, Chi Zhang, Rui Huang

发表机构 * The Chinese University of HongKong, Shenzhen(香港中文大学(深圳)) Tsinghua University(清华大学)

AI总结 本文提出了一种射线感知指针记忆,用于流式3D重建,通过统一的记忆表示模型同时建模空间位置和视角方向,采用自适应指针更新策略以保留信息性指针并丢弃冗余指针,从而提高长期重建稳定性和相机姿态精度。

详情
AI中文摘要

从连续图像流中进行密集3D重建需要准确的几何聚合和稳定的长期内存管理。最近的前馈重建框架通过持久内存表示整合观测,但大多数方法在更新内存时主要依赖基于外观的相似性。这种基于外观的整合常常导致在视角变化时出现观测冗余和不稳定的几何结构。在本文中,我们提出了一种用于流式3D重建的射线感知指针记忆,该记忆在统一的记忆表示中显式建模空间位置和视角方向。每个内存指针存储其3D位置、关联的射线方向和特征嵌入,使系统能够联合推理几何接近性和视角一致性。基于此表示,我们引入了一种自适应指针更新策略,将传统的融合基记忆压缩替换为保留或替换机制。而不是平均附近的观测,系统会选择性地保留信息性指针并丢弃冗余的,从而在保持内存增长有限的同时保留独特的几何结构。此外,对空间距离和射线方向差异的联合推理使系统能够统一区分局部冗余、新观测和潜在的环路重访。当检测到环路候选时,会触发姿态细化以强制在重建中保持全局几何一致性。大量实验表明,所提出的射线感知记忆设计显著提高了长期重建的稳定性和相机姿态精度,同时保持了高效的流式推理。我们的方法提供了一个系统的方法框架,用于可扩展且抗漂移的在线3D重建。

英文摘要

Dense 3D reconstruction from continuous image streams requires both accurate geometric aggregation and stable long-term memory management. Recent feed-forward reconstruction frameworks integrate observations through persistent memory representations, yet most rely primarily on appearance-based similarity when updating memory. Such appearance-driven integration often leads to redundant accumulation of observations and unstable geometry when viewpoint changes occur. In this work, we propose a ray-aware pointer memory for streaming 3D reconstruction that explicitly models both spatial location and viewing direction within a unified memory representation. Each memory pointer stores its 3D position, associated ray direction, and feature embedding, allowing the system to reason jointly about geometric proximity and viewpoint consistency. Based on this representation, we introduce an adaptive pointer update strategy that replaces traditional fusion-based memory compression with a retain-or-replace mechanism. Instead of averaging nearby observations, the system selectively retains informative pointers while discarding redundant ones, preserving distinctive geometric structures while maintaining bounded memory growth. Furthermore, the joint reasoning over spatial distance and ray-direction discrepancy enables the system to distinguish between local redundancy, novel observations, and potential loop revisits in a unified manner. When loop candidates are detected, pose refinement is triggered to enforce global geometric consistency across the reconstruction. Extensive experiments demonstrate that the proposed ray-aware memory design significantly improves long-term reconstruction stability and camera pose accuracy while maintaining efficient streaming inference. Our approach provides a principled framework for scalable and drift-resistant online 3D reconstruction from image streams.