arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2084
专题追踪
2605.26648 2026-05-27 cs.RO

L-Learning : A Lyapunov-Based Approach Leveraging Lagrangian Mechanics for Efficient and Stable Robot Tracking

L-Learning:一种基于李雅普诺夫与拉格朗日力学的机器人高效稳定跟踪方法

Quan Quan, Hao Li

发表机构 * School of Automation Science and Electrical Engineering, Beihang University(北京航空航天大学自动化科学与电气工程学院)

AI总结 提出L-Learning框架,结合李雅普诺夫稳定性理论与拉格朗日力学,从数据中学习系统能量函数,实现高效、稳定且样本效率高的机器人轨迹跟踪。

Comments 9 pages, 4 figures, 4 tables

详情
AI中文摘要

本文提出L-Learning,一种新颖的机器人数据驱动控制框架,将李雅普诺夫稳定性理论与拉格朗日力学相结合,以增强轨迹跟踪性能。传统控制方法在动态和不确定环境中往往性能下降,而数据驱动方法虽然适应性更强,但常受限于高样本复杂度和缺乏严格的稳定性保证。L-Learning通过从数据中显式学习系统能量函数来缓解这些挑战,从而在确保闭环稳定性的同时优化性能。L-Learning具有优越的控制精度、理论稳定性保证和高样本效率,是实际机器人应用中有前景的解决方案。

英文摘要

This paper presents L-Learning, a novel data-driven control framework for robotics that integrates Lyapunov stability theory with Lagrangian mechanics to enhance trajectory tracking performance. While traditional control methods often suffer from performance degradation in dynamic and uncertain environments, data-driven approaches, while more adaptable, are frequently limited by high sample complexity and a lack of rigorous stability guarantees. L-Learning mitigates these challenges by explicitly learning the system's energy function from data, thereby optimizing performance while ensuring closed-loop stability intrinsically. Characterized by superior control accuracy, theoretical stability guarantees, and high sample efficiency, L-Learning represents a promising solution for practical robotic applications.

2605.26647 2026-05-27 cs.LG cs.AI stat.ML

More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations

更具表达力的前馈层:第一部分。激活的令牌自适应混合

Mingze Wang, Jinbo Wang, Yikuan Xia, Kai Shen, Shu Zhong

发表机构 * Peking University(北京大学)

AI总结 提出令牌自适应激活混合(MoA)和可学习激活(LA)方法,通过轻量级输入相关门混合多个激活函数,在理论和实验上证明其比固定激活FFN具有更强的表达能力和更优的缩放行为。

Comments 31 pages

详情
AI中文摘要

前馈网络(FFN)层在基于Transformer的大语言模型(LLMs)中占据了大部分参数和非线性表达能力。尽管从ReLU和GELU发展到门控变体如SwiGLU,大多数FFN设计仍使用单一固定激活函数,对所有令牌应用相同的非线性变换。在这项工作中,我们提出了激活混合(MoA),一种令牌自适应的FFN设计,它使用轻量级输入相关门混合一个激活函数字典,同时共享相同的线性投影。作为输入无关的对应,我们还引入了可学习激活(LA),它为ReLU型和SwiGLU型FFN形成激活函数的线性组合。理论上,我们在固定激活FFN、LA和MoA之间建立了严格的有限宽度表达分离:LA严格包含固定激活FFN,而MoA严格包含LA,额外的表达能力来自于输入相关的非线性混合。实验上,我们通过在不同令牌预算、优化器和学习率调度下,对0.12B到2B参数的密集和MoE语言模型进行广泛的预训练实验来评估MoA。与调整良好的基线相比,MoA始终获得更低的最终损失,并表现出更有利的缩放行为,且参数和计算开销极小。这些结果表明,令牌自适应激活混合是提高LLMs中FFN表达能力的一种简单而有效的机制。

英文摘要

Feedforward network (FFN) layers account for a large fraction of parameters and nonlinear expressivity in Transformer-based large language models (LLMs). Despite the evolution from ReLU and GELU to gated variants such as SwiGLU, most FFN designs still use a single fixed activation function, applying the same nonlinear transformation to all tokens. In this work, we propose Mixture of Activations (MoA), a token-adaptive FFN design that mixes a dictionary of activation functions using lightweight input-dependent gates while sharing the same linear projections. As an input-independent counterpart, we also introduce learnable activations (LA), which form linear combinations of activation functions for both ReLU-type and SwiGLU-type FFNs. Theoretically, we establish strict finite-width expressive separations among fixed-activation FFNs, LA, and MoA: LA strictly contains fixed-activation FFNs, while MoA strictly contains LA, with the additional expressivity arising from input-dependent nonlinear hybridization. Empirically, we evaluate MoA through extensive pre-training experiments on dense and MoE language models ranging from 0.12B to 2B parameters under different token budgets, optimizers, and learning rate schedules. MoA consistently achieves lower terminal loss and exhibits more favorable scaling behavior than well-tuned baselines, with minimal parameter and computational overhead. These results suggest that token-adaptive activation mixing is a simple and effective mechanism for improving FFN expressivity in LLMs.

2605.26646 2026-05-27 cs.AI cs.CL cs.MA

UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

UnityMAS-O: 基于LLM的多智能体系统的通用强化学习优化框架

Yiqun Chen, Wei Yang, Erhan Zhang, Shijie Wang, Qi Liu, Zechun Niu, Bin Zhang, Haitao Li, Rui Li, Lingyong Yan, Jinyuan Feng, Biqing Qi, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao

发表机构 * Renmin University of China(中国人民大学) Xiaohongshu Inc.(小红书公司)

AI总结 提出UnityMAS-O框架,将多智能体工作流作为优化单元,通过逻辑角色、图轨迹、用户定义奖励和智能体-模型映射四个核心对象解耦逻辑与物理参数,支持灵活的参数共享和奖励分配,在检索增强问答、迭代搜索和反思代码生成任务上验证了多智能体RL对手动工作流的提升效果。

详情
AI中文摘要

基于LLM的多智能体系统将复杂任务分解为交互角色,但大多数仍通过提示、工具和控制规则手动编排,智能体很少通过统一的强化学习接口进行优化。现有的RL后训练框架主要针对单策略优化,缺乏对用户定义的多智能体工作流、结构化交互、角色特定信用分配和可配置参数共享的抽象。我们提出了UnityMAS-O,一个用于基于LLM的多智能体系统的通用RL优化框架。UnityMAS-O将完整工作流视为优化单元,而非单个响应或策略轨迹。它通过四个核心对象表示工作流:逻辑智能体角色、图轨迹、用户定义奖励和智能体-模型映射。这将逻辑智能体与物理模型参数解耦,支持完全共享、完全分离和部分共享,奖励在角色、轮次和轨迹级别分配。UnityMAS-O通过基于Ray的星形拓扑运行时扩展了verl。中央控制器执行工作流、调用工具、记录结构化轨迹并组装奖励;模型本地工作器组负责轨迹生成、缓冲、优势计算和分布式PPO风格更新。用户可以定义智能体、工作流、模型映射和奖励,而无需重写优化基础设施。我们在检索增强问答、迭代智能体搜索和反思代码生成上实例化了UnityMAS-O。在Natural Questions、HotpotQA和保留代码任务上,多智能体RL在优化后改进了手动指定的工作流,对于较小模型和严格代码全通过指标尤其有较大提升。这些结果表明,UnityMAS-O可以作为可复用基础,将多样化的基于LLM的多智能体工作流转化为可训练的多智能体RL系统。

英文摘要

LLM-based multi-agent systems decompose complex tasks into interacting roles, but most remain manually orchestrated by prompts, tools, and control rules, while agents are rarely optimized through a unified reinforcement learning interface. Existing RL post-training frameworks mainly target single-policy optimization and lack abstractions for user-defined multi-agent workflows, structured interaction, role-specific credit assignment, and configurable parameter sharing. We present UnityMAS-O, a general RL optimization framework for LLM-based multi-agent systems. UnityMAS-O treats the complete workflow as the optimization unit, rather than a single response or policy trajectory. It represents workflows through four first-class objects: logical agent roles, graph trajectories, user-defined rewards, and agent--model mappings. This decouples logical agents from physical model parameters, supporting full sharing, full separation, and partial sharing, with rewards assigned at role, turn, and trajectory levels. UnityMAS-O extends verl with a Ray-based star-topology runtime. A central controller executes workflows, invokes tools, records structured trajectories, and assembles rewards; model-local worker groups handle rollout, buffering, advantage computation, and distributed PPO-style updates. Users can define agents, workflows, model mappings, and rewards without rewriting the optimization infrastructure. We instantiate UnityMAS-O on retrieval-augmented QA, iterative agentic search, and reflective code generation. Across Natural Questions, HotpotQA, and held-out code tasks, multi-agent RL improves manually specified workflows after optimization, with especially large gains for smaller models and strict code all-passed metrics. These results show that UnityMAS-O can serve as a reusable substrate for converting diverse LLM-based multi-agent workflows into trainable multi-agent RL systems.

2605.26645 2026-05-27 cs.CL

Bounded Path Context: A Controlled Study of Visible Path History in LLM-Based Knowledge Graph Question Answering

有界路径上下文:基于LLM的知识图谱问答中可见路径历史的受控研究

Xihang Shan, Ye Luo

发表机构 * School of Mathematical Sciences(数学科学学院) Xiamen University(厦门大学) School of Informatics(信息学院)

AI总结 本文提出有界路径上下文(BPC)方法,通过将完整路径存储在符号记忆中,仅暴露最近K跳路径给关系选择提示,在WebQSP和CWQ数据集上使用Qwen3.5-9B-AWQ模型达到或超过全历史提示的性能,同时减少输入token。

Comments 13 pages, 1 figure, submitted to EMNLP 2026

详情
AI中文摘要

基于LLM的知识图谱问答(KGQA)将图遍历委托给语言模型,将每个问题转化为一系列局部关系选择决策,这些决策在多个波束和跳数上重复。一个常见但未经测试的默认做法是将完整的部分路径序列化到每个路由提示中,尽管控制器已经以精确符号状态维护了该路径。有界路径上下文(BPC)解耦了这两个角色:控制器在符号记忆中保留完整路径用于答案提取和审计,而关系选择提示仅暴露问题、当前实体、候选关系以及最多最后K跳。对K进行受控扫描——固定图邻域、波束预算、深度、解码和答案提取格式——表明,在完整的WebQSP和CWQ测试集上,使用Qwen3.5-9B-AWQ模型,有界历史匹配或超过全历史提示:K=1在WebQSP上达到0.487的答案集F1,而全历史为0.472;K=0在CWQ上达到0.287,而全历史为0.274,同时输入token分别减少9.7%和12.1%。在4B规模下,K=1在两个基准上仍然是最强设置。逐示例分析显示,71-84%的示例不受历史长度影响,而受影响的案例揭示了先前跳数何时起到消歧作用或分散注意力。这些结果表明,路径序列化长度更适合作为可调接口变量,而不是基于LLM的图控制器中的默认假设。

英文摘要

LLM-based knowledge-graph question answering (KGQA) delegates graph traversal to language models, turning each question into a sequence of local relation-selection decisions repeated across beams and hops. A common but untested default is to serialize the complete partial path into every routing prompt, even though the controller already maintains this path as exact symbolic state. Bounded Path Context (BPC) decouples these two roles: the controller retains full paths in symbolic memory for answer extraction and audit, while the relation-selection prompt exposes only the question, the current entity, outgoing relation candidates, and at most the last K hops. A controlled sweep over K -- fixing graph neighborhoods, beam budget, depth, decoding, and answer-extraction format -- shows that bounded histories match or exceed full-history prompting on complete WebQSP and CWQ test sets with Qwen3.5-9B-AWQ: K=1 achieves 0.487 answer-set F1 on WebQSP versus 0.472 for full history, and K=0 reaches 0.287 on CWQ versus 0.274, with 9.7% and 12.1% fewer input tokens respectively. At the 4B scale, K=1 remains the strongest setting on both benchmarks. Per-example analysis reveals that 71-84% of examples are unaffected by history length, while the affected cases expose when prior hops disambiguate versus distract. These results suggest that path serialization length is better treated as a tunable interface variable than as a default assumption in LLM-based graph controllers.

2605.26642 2026-05-27 cs.CV

Adaptation-Free Heterogeneous Collaborative Perception with Unseen Agent Configurations

无适应异构协同感知:应对未见过的智能体配置

Hyunchul Bae, Heejin Ahn

发表机构 * School of Electrical Engineering(电气工程学院) Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院)

AI总结 提出ALF框架,通过将轻量级框级消息提升为自车兼容的辅助特征,实现与未见配置智能体的零适应协同感知,在V2X-Real上零样本评估中相对mAP@0.7提升35.91%,带宽仅需约9.6 Kbps。

Comments 9 pages main paper, 23 pages including references and appendix, 7 figures

详情
AI中文摘要

协同感知通过使智能体共享互补观测来改进3D目标检测,但大多数现有方法假设固定或已知的编码器配置,限制了实际部署。在这项工作中,我们考虑一个开放世界场景,其中具有未见配置的辅助智能体可能在部署后出现,例如不同的LiDAR线束数量或编码器架构。为应对这一挑战,我们提出ALF,一种协同感知框架,通过将轻量级框级消息提升为自车兼容的辅助特征,实现与未见配置智能体的零适应协作。ALF将辅助框级消息转换为伪BEV地图,并通过将目标中心线索与来自自车特征的场景上下文相结合,合成自车兼容的潜在特征。在V2X-Real上,跨越64个案例研究的零样本评估中,ALF在相对mAP@0.7上比最强先前基线高出35.91%,同时每个智能体每帧仅需120字节(在10 Hz下约9.6 Kbps带宽)。

英文摘要

Collaborative perception improves 3D object detection by enabling agents to share complementary observations, but most existing methods assume fixed or known collaborator encoder configurations, limiting deployment in practice. In this work, we consider an open-world setting in which auxiliary agents with unseen configurations may appear after deployment, such as different LiDAR beam counts or encoder architectures. To address this challenge, we propose ALF, a collaborative perception framework that enables zero-adaptation collaboration with unseen agent configurations by lifting lightweight box-level messages into ego-compatible auxiliary features. ALF converts auxiliary box-level messages into pseudo-BEV maps and synthesizes ego-compatible latent features by combining object-centric cues with scene context from the ego feature. On V2X-Real, under a zero-shot evaluation across 64 case studies, ALF outperforms the strongest prior baseline by 35.91% in relative mAP@0.7 while requiring only 120 bytes per agent per frame (approximately 9.6 Kbps bandwidth at 10 Hz).

2605.26641 2026-05-27 cs.CV

OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation

OmniRetriever: 通过融合作为教师蒸馏实现任意到任意的音频-视频-文本检索

Yunze Liu, Chi-Hao Wu, Enmin Zhou, Junxiao Shen

发表机构 * Memories.ai Research(Memories.ai研究院)

AI总结 提出融合作为教师蒸馏方法,利用三元组嵌入的融合信号训练单模态嵌入,并构建OmniRetriever-7B模型,在零样本检索基准上超越现有方法,同时发布OmniRetriever-Bench基准。

Comments https://yunzeliu.github.io/OmniRetriever/

详情
AI中文摘要

统一的多模态嵌入空间已成为跨模态检索和多模态RAG的标准接口,最近的音频-视频-文本(AVT)编码器将这一设置扩展到三种模态。当所有三种模态都可用时,此类编码器可以生成联合的(T,V,A)嵌入,但标准的成对InfoNCE目标在训练过程中未使用这一信号。我们通过融合作为教师蒸馏来弥补这一差距,将融合嵌入的停止梯度副本视为单模态嵌入的教师信号,并配以Tuple-InfoNCE项直接监督融合嵌入。我们将这一目标实例化为OmniRetriever-7B。在六个零样本检索基准上,OmniRetriever-7B在Clotho和SoundDescs上以R@1超过闭源的Gemini Embedding 2达13.3-18.0,并在MSR-VTT和MSVD上达到当代零样本专家级开放视频-文本编码器的水平。为了压力测试联合表示,我们进一步发布了OmniRetriever-Bench,这是一个包含12个方向的AVT检索基准,总计3782个三元组;在此基准上,OmniRetriever-7B达到AVG-all 34.84,比Gemini Embedding 2提高1.72,比之前最好的开源AVT方法提高8.03。

英文摘要

Unified multimodal embedding spaces have become the standard interface for cross-modal retrieval and multimodal RAG, and recent audio-video-text (AVT) encoders extend this setting to three modalities. Such encoders can produce a joint (T,V,A) embedding whenever all three modalities are available, but standard pairwise InfoNCE objectives leave this signal unused during training. We close this gap with fusion-as-teacher distillation, which treats a stop-gradient copy of the fused embedding as a teacher signal for the single-modal embeddings, paired with a Tuple-InfoNCE term that supervises the fused embedding directly. We instantiate this objective as OmniRetriever-7B. Across six zero-shot retrieval benchmarks, OmniRetriever-7B surpasses the closed-source Gemini Embedding 2 by 13.3-18.0 R@1 on Clotho and SoundDescs, and reaches the contemporary zero-shot specialist band of open video-text encoders on MSR-VTT and MSVD. To stress-test joint representations, we further release OmniRetriever-Bench, a 12-direction AVT retrieval benchmark totaling 3782 triples; on it OmniRetriever-7B attains AVG-all 34.84, improving over Gemini Embedding 2 by 1.72 and over the best prior open-source AVT method by 8.03.

2605.26638 2026-05-27 cs.RO

HyperSim: A Holistic Sim-To-Real Framework For Robust Robotic Manipulation

HyperSim: 一种面向鲁棒机器人操作的整体仿真到现实框架

Junyi Dong, Haotian Luo, Ziwei Xu, Shengwei Bian, Heng Zhang, Sitong Mao, Jingyi Guo, Yang Xu, Wenhao Chen, Qiuyu Feng, Yao Mu, Ping Luo, Shunbo Zhou, Xiaodong Wu

发表机构 * CloudRobo Lab, Huawei Cloud Computing Technologies Co.,Ltd.(华为云计算技术有限公司云机器人实验室) Shanghai Jiao Tong University(上海交通大学) The University of Hong Kong(香港大学)

AI总结 本文提出HyperSim框架,通过高保真环境合成、对抗轨迹生成和仿真-现实联合训练三大支柱,系统性地缩小仿真到现实的域差距,在400次真实世界任务执行中实现了ACT和π0分别达到80%和95%的仿真到现实成功率。

Comments 9 pages, 8 figures

详情
AI中文摘要

扩展数据量和多样性对于泛化具身智能至关重要。虽然合成数据生成为昂贵的物理数据采集提供了一种可扩展的替代方案,但由于域差距,将机器人操作策略从仿真迁移到现实世界(仿真到现实)仍然是一个艰巨的挑战。本文提出了HyperSim,一个涵盖从合成数据生成到策略训练和无缝现实部署的整体框架。为了系统地弥合仿真到现实的差距,HyperSim通过三个核心支柱实现:高保真环境合成、对抗轨迹生成和仿真-现实联合训练。这些模块共同通过增强视觉保真度、扩展数据覆盖范围和强制域不变表示来解决域差异。我们通过一项大规模实证研究严格验证了HyperSim,该研究涉及两个代表性操作模型的400次真实世界任务执行。在三个细粒度指标上评估,我们的完整流程在ACT和π0上分别实现了80%和95%的显著仿真到现实成功率。此外,在我们的对抗轨迹上训练的策略对动态不确定性表现出显著增强的鲁棒性,在物理扰动下实现了35%更高的完成率。

英文摘要

Scaling data volume and diversity is critical for generalizing embodied intelligence. While synthetic data generation offers a scalable alternative to expensive physical data acquisition, transferring robotic manipulation policies from simulation to the real world (sim-to-real) remains a formidable challenge due to the domain gap. This paper presents HyperSim, a holistic framework spanning from synthetic data generation to policy training and seamless real-world deployment. To systematically bridge the sim-to-real gap, HyperSim is realized through three core pillars: high-fidelity environment synthesis, adversarial trajectory generation, and sim-and-real co-training. Collectively, these modules address domain discrepancies by enhancing visual fidelity, expanding data coverage, and enforcing domain-invariant representations. We rigorously validate HyperSim through a large-scale empirical study involving 400 real-world task executions across two representative manipulation models. Assessed across three fine-grained metrics, our complete pipeline achieves remarkable sim-to-real success rates of 80% and 95% with ACT and π_{0}, respectively. Furthermore, policies trained on our adversarial trajectories exhibit significantly enhanced robustness against dynamic uncertainties, achieving a 35% higher completion rate under physical perturbations.

2605.26637 2026-05-27 cs.RO

Enabling Extensible Embodied Capabilities with Tools

利用工具实现可扩展的具身能力

Xueyang Zhou, Zijia Wang, Qianjiang Li, Yibo Hu, Guiyao Tie, Li Wan, Yidan Liu, Pan Zhou, Lichao Sun, Yongchao Chen

发表机构 * Huazhong University of Science and Technology(华中科技大学) Hebei University of Technology(河北工业大学) Tianjin University(天津大学) Lehigh University(莱特大学) College of AI, Tsinghua University(清华大学人工智能学院)

AI总结 提出一种通过外部化能力为工具、并借助标准化协议ETP动态调用工具的方法,在仿真和真实平台上平均提升具身性能31%-36%,但揭示了工具使用在认知和感知方面增益显著而在执行方面有限。

Comments 51 pages, 20 figures,

详情
AI中文摘要

大多数现有的具身智能方法将感知、推理、规划和控制统一在参数化策略中。然而,这些能力本质上是层次化和异质的,使得它们难以在单一模型中可靠地学习和模块化。我们提出了一种能力外部化方法,将异质能力解耦为独立优化的工具,在推理时动态调用。为此,我们引入了具身工具协议(ETP),一种用于具身工具注册、发现、调用和执行的标准化协议,并策划了100多个经过验证的工具,涵盖感知、认知、推理和执行,作为工具库。在此基础上,我们构建了EmbodiedToolBench,以评估工具增强是否提高了具身性能,以及当前模型在工具必要性识别、工具选择、工具执行和工具链组合方面的工具使用能力。在仿真和真实平台上的实验证实,能力外部化一致地提高了具身性能(在EB-ALFRED上平均提升31%,在EB-Navigation上平均提升36%),但揭示了一个明确的边界:在认知和感知方面增益显著,而在执行类能力方面增益有限。此外,我们的分析表明,知道何时、调用哪个以及如何调用工具仍然是所有模型面临的持续挑战,从而凸显了具身工具能力作为未来研究的关键方向。

英文摘要

Most existing embodied intelligence methods formulate perception, reasoning, planning, and control within a unified parameterized policy. Yet these capabilities are inherently hierarchical and heterogeneous, making them difficult to reliably learn and modularize within a single model. We propose a capability externalization approach that decouples heterogeneous capabilities into independently optimized tools, dynamically invoked at inference time. To this end, we introduce Embodied Tool Protocol (ETP), a standardized protocol for embodied tool registration, discovery, invocation, and execution, and curate 100+ validated tools spanning perception, cognition, reasoning, and execution as the tool base. Building on this, we construct EmbodiedToolBench to evaluate both whether tool augmentation improves embodied performance and how well current models use tools across tool-necessity recognition, tool selection, tool execution, and tool-chain composition. Experiments across simulation and real-world platforms confirm that capability externalization consistently improves embodied performance (avg. gain 31% on EB-ALFRED and 36% on EB-Navigation), yet reveal a clear boundary: gains are substantial for cognition and perception but are limited for execution-type capabilities. Moreover, our analysis reveals that knowing when, which, and how to invoke tools remains a persistent challenge across all models, thereby highlighting embodied tool competence as a critical direction for future research.

2605.26636 2026-05-27 cs.CV cs.AI

JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search

JetViT: 高效高分辨率视觉Transformer与训练后注意力搜索

Dongyun Zou, Zhuoyang Zhang, Junyu Chen, Wenkun He, Qinhe Peng, Hanrong Ye, Yao Lu, Hongxu Yin, Yu Wang, Song Han, Han Cai

发表机构 * MIT(麻省理工学院) University of Pennsylvania(宾夕法尼亚大学) NVIDIA(NVIDIA公司) Physical Intelligence(物理智能)

AI总结 提出JetViT混合架构视觉Transformer,通过训练后注意力搜索将预训练全注意力ViT转换为高效混合注意力变体,在高分辨率图像上实现更高推理效率且不损失精度。

Comments Accepted to CVPR 2026 Findings

详情
AI中文摘要

我们介绍了JetViT,一种新颖的混合架构视觉Transformer(ViT)模型系列,它在匹配最先进的全注意力视觉基础模型精度的同时,在高分辨率图像上实现了显著更高的推理效率。我们方法的核心是训练后注意力搜索,这是一种训练后加速框架,通过识别并将冗余的全注意力块替换为线性注意力或窗口注意力块,将预训练的全注意力ViT转换为高效的混合注意力变体。通过继承基础模型的MLP和注意力权重,训练后注意力搜索通过三个关键步骤高效探索架构设计空间:(1)优化线性注意力块设计;(2)找到线性注意力块和窗口注意力块的最佳组合;(3)识别并保留关键的全注意力块。我们在两个代表性的高分辨率视觉基础模型DINOv3和DepthAnythingV2上评估了JetViT。在NVIDIA H100 GPU上,JetViT在不牺牲精度的情况下实现了高达1.79倍的吞吐量提升和高达44.81%的延迟降低。我们将很快发布我们的代码和加速后的ViT模型。

英文摘要

We introduce JetViT, a novel family of hybrid-architecture Vision Transformer (ViT) models that match the accuracy of state-of-the-art full-attention vision foundation models while achieving substantially higher inference efficiency on high-resolution images. At the core of our approach is Post-Training Attention Search, a post-training acceleration framework that converts pre-trained full-attention ViTs into efficient hybrid-attention variants by identifying and replacing redundant full-attention blocks with linear or window-attention blocks. By inheriting the MLP and attention weights from the base model, Post-Training Attention Search efficiently explores the architectural design space through three key steps: (1) optimizing the linear-attention block design; (2) finding the best combination of linear-attention and window-attention blocks; and (3) identifying and preserving critical full-attention blocks. We evaluate JetViT on two representative high-resolution vision foundation models, DINOv3 and DepthAnythingV2. On the NVIDIA H100 GPU, JetViT achieves up to 1.79x higher throughput and up to 44.81% lower latency without sacrificing accuracy. We will release our code and accelerated ViT models soon.

2605.26630 2026-05-27 cs.CV

Attenuation-Resilient Alternating Optimization for Laparoscopic Liver Landmark Detection

衰减鲁棒的交替优化用于腹腔镜肝脏地标检测

Lanqing Liu, Ruize Cui, Jialun Pei, Diandian Guo, Tiffany Y. So, Pheng-Ann Heng, Jing Qin

发表机构 * The Hong Kong Polytechnic University, Hong Kong, China(香港理工大学) The Chinese University of Hong Kong, Hong Kong, China(香港中文大学)

AI总结 提出A2ONet,通过照明场补偿、频率方向选择性滤波和交替分割-曲线优化解码器,解决腹腔镜肝脏地标检测中的光照衰减和结构不匹配问题。

Comments This paper has been accepted by MICCAI 2026

详情
AI中文摘要

肝脏表面地标检测是腹腔镜肝脏手术中解剖引导的基本前提。然而,由于两个普遍存在的挑战,它在实践中仍然不可靠:欠曝光区域的照明衰减和像素级定位与连续曲线几何之间的结构不匹配。为了解决这些限制,我们提出了A2ONet,一种衰减鲁棒的交替优化网络,用于稳健的肝脏地标检测。为了减轻照明衰减,A2ONet包含一个照明场补偿(IFC)块,该块自适应增强暗区域同时保持结构一致性。同时,我们引入了一个轻量级的频率方向选择性滤波器(FOSF),以抑制重复纹理干扰并保留显著的曲线线索。基于这些鲁棒的表示,我们设计了一个交替分割-曲线优化(ASCO)解码器,该解码器迭代地将密集分割与显式曲线建模耦合,实现相互指导以优化结构连续性和端点定位。在L3D-2K、L3D和P2ILF上的广泛评估表明,与竞争方法相比,该方法具有一致的改进,为术中解剖引导建立了更可靠的基础。我们的代码将在https://github.com/hyperiondk115/A2ONet上提供。

英文摘要

Liver surface landmark detection is a fundamental prerequisite for anatomical guidance in laparoscopic liver surgery. However, it remains unreliable in practice due to two pervasive challenges: illumination attenuation in underexposed regions and the structural mismatch between pixel-wise localization and continuous curvilinear geometry. To address these limitations, we propose A2ONet, an attenuation-resilient alternating optimization network for robust liver landmark detection. To mitigate illumination attenuation, A2ONet embraces an illumination field compensation (IFC) block that adaptively enhances dark regions while preserving structural consistency. Meanwhile, we introduce a lightweight frequency-orientation selective filter (FOSF) to suppress repetitive texture interference and preserve salient curvilinear cues. Building upon these resilient representations, we design an alternating seg-curve optimization (ASCO) decoder that iteratively couples dense segmentation with explicit curve modeling, enabling mutual guidance to optimize both structural continuity and endpoint localization. Extensive evaluations on L3D-2K, L3D, and P2ILF demonstrate consistent improvements over competitive methods, establishing a more reliable foundation for intraoperative anatomy guidance. Our code will be available at https://github.com/hyperiondk115/A2ONet.

2605.26629 2026-05-27 cs.CV

DelowlightSplat: Feed-Forward Gaussian Splatting for Lowlight 3D Scene Reconstruction

DelowlightSplat: 面向低光照3D场景重建的前馈高斯泼溅

Fuzhen Jiang, Zengtian Xie, Zhuoran Li

发表机构 * Hangzhou Dianzi University(杭州电子科技大学) Zhuhai College of Science(珠海科技学院)

AI总结 提出DelowlightSplat,一种低光照感知的前馈高斯泼溅框架,通过轻量级低光照适配器和成本体积多视图推理,从稀疏有噪声图像中直接预测干净3D高斯,实现高质量新视角合成。

详情
AI中文摘要

从稀疏有姿态图像进行新视角合成和3D重建是机器人和AR/VR的核心。然而,前馈3D高斯重建在低光照下因噪声、颜色偏移和不可靠对应而失败。我们提出DelowlightSplat,一种低光照感知的前馈高斯泼溅框架,用于干净的新视角渲染。我们通过仅退化上下文视图同时保持目标视图干净,构建了一个可控的多视图低光照基准。我们引入轻量级低光照适配器进行残差增强以提高可匹配性,并将其与基于成本体积的多视图推理相结合,直接预测干净的3D高斯。实验表明,DelowlightSplat在低光照条件下显著优于先前的前馈方法和两阶段流水线。

英文摘要

Novel-view synthesis and 3D reconstruction from sparse posed images are central to robotics and AR/VR. Yet, feed-forward 3D Gaussian reconstruction fails under lowlight due to noise, color shifts, and unreliable correspondence. We propose DelowlightSplat, a lowlight-aware feed-forward Gaussian splatting framework for clean novel-view rendering. We build a controllable multi-view lowlight benchmark by degrading only context views while keeping target views clean. We introduce a lightweight Lowlight Adapter for residual enhancement to improve matchability, and couple it with cost-volume-based multi-view inference to directly predict clean 3D Gaussians. Experiments show that DelowlightSplat significantly outperforms previous feed-forward method and two-stage pipeline under lowlight conditions.

2605.26628 2026-05-27 cs.AI

Tail-Aware HiFloat4: W4A4 Post-Training Quantization for Wan2.2

Tail-Aware HiFloat4: 面向Wan2.2的W4A4训练后量化

Zhanfeng Feng, Shuai Guo, Xin Di, Long Peng, Yang Cao, Zhengjun Zha

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出Tail-Aware HiFloat4方法,通过感知激活尾部的百分位校准和紧凑PTQ状态恢复,在HiFloat4数值格式下对Wan2.2进行W4A4训练后量化,减少罕见校准异常值的影响。

详情
AI中文摘要

本报告描述了Tail-Aware HiFloat4,这是我们提交给低位文本到视频生成量化挑战的方法。我们的方法将公开的ViDiT-Q训练后量化流水线适配到Wan2.2,并采用HiFloat4数值格式。我们对Wan2.2两个Transformer模块中的主要线性层进行W4A4 HiFloat4伪量化,将数值敏感的边界模块保持高精度,并引入一个感知激活尾部的百分位校准模块用于通道掩码构建。结合紧凑的PTQ状态恢复,该设计减少了罕见校准异常值的影响,同时保持运行时HiFloat4算术和采样流水线不变。

英文摘要

This report describes Tail-Aware HiFloat4, our submission to the low-bit text-to-video generation quantization challenge. Our method adapts the public ViDiT-Q post-training quantization pipeline to Wan2.2 under the HiFloat4 numerical format. We quantize the main linear layers in both Wan2.2 transformer modules with W4A4 HiFloat4 fake quantization, keep numerically sensitive boundary modules in high precision, and introduce an activation-tail-aware percentile calibration module for channel-mask construction. Together with compact PTQ-state restoration, this design reduces the influence of rare calibration outliers while keeping the runtime HiFloat4 arithmetic and sampling pipeline unchanged.

2605.26621 2026-05-27 cs.CV cs.AI

MedVol-R1: Reward-Driven Evidence Grounding for Volumetric Reasoning Segmentation

MedVol-R1:基于奖励驱动的证据基础用于体积推理分割

Zichun Wang, Hairong Shi, Bingzheng Wei, Yan Xu, Zihua Wang

发表机构 * School of Biological Science and Medical Engineering, Beihang University, Beijing, China(生物科学与医学工程学院,北京航空航天大学) Center for Information and Computer Science, School of Science for Open and Environmental Systems, Graduate School of Science and Technology, Keio University, Kanagawa, Japan(信息与计算机科学中心,开放与环境系统科学学院,科技研究生学校,东京大学,神奈川,日本) Bytedance Inc., China(字节跳动公司,中国) Tsinghua University, Beijing, China(清华大学,北京,中国)

AI总结 提出MedVol-R1框架,通过强化学习将临床推理解耦为可验证的2D证据锚点,再传播为3D掩膜,实现体积推理分割,在多个基准上达到最优性能。

详情
AI中文摘要

体积推理分割(VRS)旨在根据自由形式的临床查询在3D医学扫描中分割目标区域,其中所指对象通常是隐含的,需要医学知识和体积基础推理。现有方法通常依赖专门的分割标记将语言与掩膜解码连接起来,但这种耦合将决策过程压缩为不透明的潜在表示,限制了可解释性和对多样化叙述表达的泛化能力。在本文中,我们提出MedVol-R1,一种基于强化学习的VRS框架,明确地将证据基础与体积描绘解耦:LVLM将临床推理定位到可验证的2D证据锚点(关键轴向切片和2D边界框),然后由冻结的MedSAM2模块将其传播为连贯的3D掩膜。我们使用冷启动监督微调后接GRPO来训练MedVol-R1,并由多组件奖励引导,该奖励鼓励信息性证据选择、准确的2D空间定位和跨切片体积连贯性,无需昂贵的思维链注释。在M3D-Seg基准的CT-ORG、AbdomenCT-1K和KiTS23上的实验表明,MedVol-R1一致优于强基线并达到最先进性能,强化学习相比纯监督微调提供了明显增益。

英文摘要

Volumetric Reasoning Segmentation (VRS) aims to segment a target region in a 3D medical scan from a free-form clinical query, where the referent is often implicit and requires both medical knowledge and volume-grounded reasoning. Existing methods typically rely on specialized segmentation tokens to connect language with mask decoding, but this coupling collapses the decision process into opaque latent representations, limiting interpretability and generalization to diverse narrative expressions. In this paper, we present MedVol-R1, a reinforcement learning-based framework for VRS that explicitly decouples evidence grounding from volumetric delineation: the LVLM grounds clinical reasoning to a verifiable 2D evidence anchor (key axial slice and 2D bounding boxes), which is then propagated into a coherent 3D mask by a frozen MedSAM2 module. We train MedVol-R1 with cold-start supervised fine-tuning followed by GRPO, guided by a multi-component reward that encourages informative evidence selection, accurate 2D spatial grounding, and cross-slice volumetric coherence, without requiring costly chain-of-thought annotations. Experiments on CT-ORG, AbdomenCT-1K, and KiTS23 from the M3D-Seg benchmark demonstrate that MedVol-R1 consistently outperforms strong baselines and achieves state-of-the-art performance, with reinforcement learning providing clear gains over pure supervised fine-tuning.

2605.26620 2026-05-27 cs.CL cs.HC

Granuscore: A Reference-Free Measure of Granularity for Text Analysis and Question Answering

Granuscore: 一种用于文本分析和问答的无参考粒度度量

Lukas Ellinger, Alexander Fichtl, Miriam Anschütz, Georg Groh

发表机构 * School for Computation, Information and Technology(计算、信息与技术学院)

AI总结 提出无参考粒度度量Granuscore,利用层次嵌入空间结构可靠恢复粒度顺序并解释句子特异性变化,应用于问答基准分析模型行为差异。

详情
AI中文摘要

自然语言以不同粒度水平传达信息,从细粒度指代到宽泛描述。虽然粒度对人类交流至关重要,但现有度量主要捕捉表面细节或句子特异性。我们引入Granuscore,一种无参考的粒度度量,利用层次嵌入空间的结构特性。Granuscore在Granola-EQ数据集上可靠恢复层次顺序,并捕捉跨话语语境的预期粒度差异。跨领域,我们进一步展示Granuscore解释了超出句子长度的句子特异性非线性变化。最后,我们将Granuscore应用于四个问答基准,分析问题、黄金答案和模型输出在不同响应结果中的粒度差异。分析揭示了模型行为的一致差异,并为表征QA数据集的难度提供了原则性视角。综合来看,结果将Granuscore定位为一种可扩展、广泛适用的文本粒度分析工具。

英文摘要

Natural language conveys information at varying levels of granularity, from fine-grained references to broad descriptions. While granularity is fundamental to human communication, existing measures mostly capture surface detail or sentence specificity. We introduce Granuscore, a reference-free measure of granularity that leverages structural properties of a hierarchical embedding space. Granuscore reliably recovers hierarchical orderings on the Granola-EQ dataset and captures expected differences in granularity across discourse contexts. Across domains, we further show that Granuscore explains non-linear variation in sentence specificity beyond sentence length. Finally, we apply Granuscore to four question-answering benchmarks and analyze how granularity differs for questions, gold answers, and model outputs across response outcomes. The analysis reveals consistent differences in model behavior and provides a principled lens for characterizing the difficulty of QA datasets. Together, the results position Granuscore as a scalable, broadly applicable tool for analyzing granularity in text.

2605.26619 2026-05-27 cs.LG

PIDM-DP: Physics-Informed Diffusion with Dormand-Prince Integration for Chaotic System Identification and State Reconstruction across Multiple Dynamical Regimes

PIDM-DP: 基于Dormand-Prince积分的物理信息扩散用于跨多种动力学机制的混沌系统辨识与状态重构

Shailendra Dabral

发表机构 * Indian Institute of Technology Indore(印度理工学院印多尔)

AI总结 提出PIDM-DP模型,将5阶Dormand-Prince ODE积分器嵌入扩散模型反向采样,通过物理残差反向传播约束轨迹满足控制方程,在稀疏噪声观测下实现混沌系统状态重构,显著优于无约束扩散和集合卡尔曼滤波。

Comments extended work of my journal paper submission

详情
AI中文摘要

从稀疏、含噪观测中重构混沌动力系统的连续状态轨迹仍然是非线性科学中的一个基本开放问题。我们提出了带有Dormand-Prince积分的物理信息扩散模型(PIDM-DP),该模型将一个完全可微的5阶Dormand-Prince(DP-RK45)ODE积分器直接嵌入去噪扩散概率模型(DDPM)的反向采样循环中。在每个去噪步骤中,通过自动微分反向传播物理残差,约束每个生成的轨迹以5阶精度满足系统的控制方程。一种线性调度的引导机制将物理权重从高噪声水平的零逐渐增加到接近干净数据极限的全值,防止了梯度爆炸,而朴素的物理信息方法在雅可比特征值阶数为$O(10^3)$的刚性系统上会因梯度爆炸而失败。在五个复杂度递增的基准系统(3D Lorenz、3D Rössler、5D超混沌、20D Lorenz-96以及刚性3D Rabinovich-Fabrikant)上,在10%观测密度和加性高斯噪声($σ=0.05$)条件下进行评估,PIDM-DP的重构RMSE比无约束扩散基线提高了高达$15.4$倍,并在集合协方差崩溃的刚性系统上显著优于集合卡尔曼滤波。在Rabinovich-Fabrikant分布外基准测试中,PIDM-DP的RMSE为$0.1097 \pm 0.0269$,而无约束扩散为$0.9443 \pm 0.5288$(差$8.6$倍),EnKF为$0.3561 \pm 0.3040$(差$3.2$倍),配对Wilcoxon检验($N = 30$)的$p<0.001$。通过Rosenstein Lyapunov估计器进行的拓扑验证表明,PIDM-DP保留了混沌不变测度。

英文摘要

Reconstructing continuous state trajectories of chaotic dynamical systems from sparse, noisy observations remains a fundamental open problem in nonlinear science. We introduce the Physics-Informed Diffusion Model with Dormand-Prince Integration (PIDM-DP), which embeds a fully differentiable 5th-order Dormand-Prince (DP-RK45) ODE integrator directly into the reverse sampling loop of a Denoising Diffusion Probabilistic Model (DDPM). At each denoising step, physics residuals are back-propagated via automatic differentiation, constraining every generated trajectory to satisfy the system's governing equations to 5th-order accuracy. A linear-scheduled guidance mechanism that ramps the physics weight from zero at high noise levels to its full value near the clean-data limit prevents the gradient explosions that cause naive physics-informed approaches to fail on stiff systems with Jacobian eigenvalues of order $O(10^3)$. Evaluated across five benchmark systems of increasing complexity 3D Lorenz, 3D Rössler, 5D Hyperchaotic, 20D Lorenz-96, and the stiff 3D Rabinovich-Fabrikant at 10% observation density with additive Gaussian noise ($σ=0.05$), PIDM-DP achieves reconstruction RMSE improvements of up to $15.4\times$ over an unconstrained diffusion baseline and decisively outperforms the Ensemble Kalman Filter on stiff systems where ensemble covariance collapses. On the Rabinovich-Fabrikant out-of-distribution benchmark, PIDM-DP attains RMSE $0.1097 \pm 0.0269$ versus $0.9443 \pm 0.5288$ (unconstrained diffusion, $8.6\times$ worse) and $0.3561 \pm 0.3040$ (EnKF, $3.2\times$ worse), with $p<0.001$ in paired Wilcoxon tests ($N = 30$). Topological validation via the Rosenstein Lyapunov estimator confirms that PIDM-DP preserves the chaotic invariant measure.

2605.26616 2026-05-27 cs.CV

Gaussian-Voxel Duet: A Dual-Scaffolding Hybrid Representation for Fast and Accurate Monocular Surface Reconstruction

高斯-体素二重奏:用于快速准确单目表面重建的双支架混合表示

Zhenhua Du, Zhen Tan, Haoyu Zhang, Dewen Hu, Shuaifeng Zhi, Peidong Liu

发表机构 * Zhejiang University(浙江大学) Westlake University(西湖大学) National University of Defense Technology(国防科技大学)

AI总结 提出一种混合高斯-体素表示,通过将锚定高斯约束在体素化SDF定义的表面窄带内,并引入隐式表面约束损失,在保持快速训练和实时渲染的同时,实现了高质量表面重建和新视图合成。

Comments 27 pages, 14 figures

详情
AI中文摘要

尽管3D高斯泼溅在逼真新视图合成方面取得了显著成功,但其追求快速高保真3D重建一直受限于几何精度与优化效率之间的权衡。专攻图像渲染的方法收敛快,但代价是由于多余基元过拟合训练视图导致几何不完美;而集成神经有符号距离场(SDF)以改善几何的方法则带来了高昂的训练成本。在本文中,我们尝试通过将支架锚定高斯与联合优化的稀疏体素支架绑定来达成更好的权衡。这种混合高斯-体素表示明确地将锚定高斯限制在体素化SDF定义的表面周围的窄带内,有效提高了表示效率并凝聚了浮动高斯,同时不牺牲几何质量。隐式表面约束损失进一步以相互正则化的方式将单个高斯基元拉近至SDF诱导的表面,从而提高重建精度。在来自ScanNet++、ScanNetv2和DeepBlending数据集的各种真实室内场景上的大量实验表明,我们的方法在保持快速训练收敛和实时渲染的同时,实现了最先进的表面重建质量以及优于领先基线的新视图合成。代码将在https://github.com/duzh11/VoxelGS提供。

英文摘要

While 3D Gaussian Splatting has achieved remarkable success in photorealistic novel view synthesis, its pursuit of fast and high-fidelity 3D reconstruction has long been constrained by a trade-off between geometric accuracy and optimization efficiency. Methods specialized in image rendering converge quickly at the cost of imperfect geometry caused by superfluous primitives overfitting training views, while methods integrating neural signed-distance field (SDF) for better geometry incur prohibitive training costs. In this paper, we attempt to strike a better trade-off by tethering scaffold-anchored Gaussians to a jointly optimized sparse voxel scaffold. This hybrid Gaussian-Voxel representation explicitly confines anchored Gaussians to a narrow band around surfaces defined by voxelized SDFs, which effectively improves representation efficiency and condenses floating Gaussians without sacrificing geometry quality. An implicit surface tethering loss further pulls individual Gaussian primitives closer to SDF-induced surfaces in a mutually regularized manner for improved reconstruction accuracy. Extensive experiments on diverse real-world indoor scenes from ScanNet++, ScanNetv2, and DeepBlending datasets demonstrate that our method achieves state-of-the-art surface reconstruction quality as well as superior novel view synthesis against leading baselines, while maintaining fast training convergence and real-time rendering. Code will be available at https://github.com/duzh11/VoxelGS.

2605.26615 2026-05-27 cs.AI

FAST-GOAL: Fast and Efficient Global-local Object Alignment Learning

FAST-GOAL: 快速高效的全局-局部对象对齐学习

Hyungyu Choi, Young Kyun Jang, Chanho Eom

发表机构 * Department of Virtual Convergence, Graduate School of Advanced Imaging Science, Multimedia & Films (GSAIM), Chung-Ang University(虚拟融合系,高级影像科学研究生院,多媒体与电影系(GSAIM), Chung-Ang 大学)

AI总结 提出FAST-GOAL微调方法,通过全局-局部语义对齐增强CLIP处理长文本的能力,包括快速局部图像-句子匹配和基于token相似性的学习,并在GLIT100k数据集上训练,在长/短描述数据集上均取得显著提升。

Comments 21 pages, 8 figures, IEEE/TIP 2026 accepted

详情
AI中文摘要

视觉-语言模型如CLIP在图像和文本对齐方面表现出色,但由于在简短标题上预训练,它们通常难以处理冗长详细的文本描述。我们提出FAST-GOAL(快速高效的全局-局部对象对齐学习),一种高效的微调方法,通过全局-局部语义对齐增强CLIP处理长文本的能力。我们的方法包含两个关键组件。首先,快速局部图像-句子匹配(FLISM)通过目标检测和空间划分高效提取局部图像区域,然后将其与对应句子匹配。其次,基于token相似性的学习(TSL)最大化图像中特定区域的patch token与其对应区域嵌入之间的相似性,并将相同原理应用于文本,从而增强模型捕获细节对应关系的能力。此外,我们引入了GLIT100k数据集,该数据集提供全局图像-长描述对和上下文派生的局部对,其中局部描述从全局描述中提取以保持语义连贯性。通过在长描述数据集(DOCCI, DCI)和短描述数据集(MSCOCO, Flickr30k)上的大量实验,我们证明FAST-GOAL相比基线取得了显著改进,使CLIP能够有效适应详细文本描述,同时保持计算效率。

英文摘要

Vision-language models such as CLIP have shown impressive capabilities in aligning images and text, but they often struggle with lengthy and detailed text descriptions due to pre-training on short and concise captions. We present FAST-GOAL (Fast and Efficient Global-local Object Alignment Learning), an efficient fine-tuning method that enhances ability of CLIP to handle lengthy text through global-local semantic alignment. Our method consists of two key components. First, Fast Local Image-Sentence Matching (FLISM) efficiently extracts local image regions through object detection and spatial division, then matches them with corresponding sentences. Second, Token Similarity-based Learning (TSL) maximizes the similarity between patch tokens from specific regions in the image and their corresponding region embeddings, applying the same principle to text, which enhances the ability of the model to capture detailed correspondences. Additionally, we introduce GLIT100k, a dataset that provides both global image-lengthy caption pairs and context-derived local pairs, where local descriptions are extracted from global captions to maintain semantic coherence. Through extensive experiments on long caption datasets (DOCCI, DCI) and short caption datasets (MSCOCO, Flickr30k), we demonstrate that FAST-GOAL achieves significant improvements over baselines, enabling effective adaptation of CLIP to detailed textual descriptions while maintaining computational efficiency.

2605.26612 2026-05-27 cs.CL

LATTE: Forecasting Peer Anchored Preference Trajectories for Personalized LLM Generation

LATTE: 预测同伴锚定偏好轨迹以实现个性化LLM生成

Jinze Li, Xiaoyan Yang, Shuo Yang, Jinfeng Xu, Yue Shen, Jian Wang, Jinjie Gu, Edith Cheuk-Han Ngai

发表机构 * The University of Hong Kong(香港大学) Ant Healthcare, Ant Group(蚂蚁集团医疗科技部)

AI总结 提出LATTE框架,通过预测同伴锚定的相对偏好状态来个性化冻结的大语言模型,在Amazon Reviews 2023和MemoryCD上优于检索、摘要记忆和静态潜在轮廓等方法。

Comments Under review

详情
AI中文摘要

使用冻结的大语言模型进行个性化生成需要既紧凑又即时的条件信号。现有的个性化方法通常检索或总结用户历史文本,或将其压缩为静态潜在轮廓和软提示。这些方法高效,但将用户过去行为视为聚合轮廓,因此将稳定身份、近期漂移和物品内容混合在同一表示中。我们提出潜在轨迹跟踪与外推(LATTE),一个将个性化表示为预测同伴锚定相对偏好状态的框架。对于每个历史会话,LATTE减去由对同一物品做出反应的可比用户形成的时间掩蔽基线,产生一个衡量目标用户在共享物品背景下与同伴差异的状态。然后,一个轻量级序列预测器预测该轨迹中的下一个状态,状态到令牌桥通过单个锚定软令牌将预测注入冻结的指令调优LLM。我们提供潜在因子分析,展示同伴锚定何时抵消共享物品变化,以及为什么时间预测在陈旧平均值和噪声近期状态之间权衡。在Amazon Reviews 2023和MemoryCD上的实验表明,LATTE始终优于检索、摘要记忆、静态潜在轮廓、差异感知潜在轮廓和软提示压缩基线。在Amazon Reviews 2023上,LATTE将平均ROUGE-L从静态潜在轮廓的0.219和最强附加潜在压缩基线的0.245提高到0.259。额外的成对比较和诊断分析表明,改进主要归因于预测用户特定轨迹信息,而不仅仅是添加软提示接口。

英文摘要

Personalized generation with frozen large language models requires a conditioning signal that is both compact and current. Existing personalization methods typically retrieve or summarize user histories in text, or compress them into static latent profiles and soft prompts. These approaches are efficient, but they treat a user's past behavior as an aggregate profile and therefore mix stable identity, recent drift, and item content in the same representation. We propose LAtent Trajectory Tracking and Extrapolation (LATTE), a framework that represents personalization as forecasting a peer anchored relative preference state. For each historical session, LATTE subtracts a time masked baseline formed from comparable users who responded to the same item, producing a state that measures how the target user differs from peers under a shared item context. A lightweight sequence predictor then forecasts the next state in this trajectory, and a State to Token Bridge injects the forecast into a frozen instruction tuned LLM through a single anchored soft token. We provide a latent factor analysis showing when peer anchoring cancels shared item variation and why temporal forecasting trades off stale averages against noisy recent states. Experiments on Amazon Reviews 2023 and MemoryCD show that LATTE consistently outperforms retrieval, summary memory, static latent profiles, difference aware latent profiles, and soft prompt compression baselines. On Amazon Reviews 2023, LATTE improves average ROUGE-L from 0.219 for a static latent profile and 0.245 for the strongest added latent compression baseline to 0.259. Additional pairwise comparisons and diagnostic analyses suggest that the improvement is mainly due to forecasting user-specific trajectory information, rather than merely adding a soft prompt interface.

2605.26606 2026-05-27 cs.LG cs.AI

Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training

将你的展开用在关键处:基于组强化学习后训练的展开分配

Woojeong Kim, Ziyi Yang, Jing Nathan Yan, Jialu Liu

发表机构 * Cornell University(康奈尔大学)

AI总结 提出 Pilot-Commit 框架,通过预算感知的展开分配策略,优先将计算资源分配给高信息量的提示,从而在组策略优化中减少采样成本并加速收敛。

详情
AI中文摘要

强化学习(RL)是后训练大型语言模型的主要范式。然而,在在线、在策略设置中,展开生成主导了训练的计算成本。基于组的策略优化方法对每个提示计算多个展开的优势,但它们不加区分地将预算分配给奖励分布崩溃的提示,将昂贵的展开浪费在可忽略的学习信号上。我们证明,基于组的更新在高奖励方差区域最为有效。由于策略在整个训练过程中演变,提示的信息量必须在线估计而非预先计算,但穷举评估每个提示在计算上不可行。我们引入了 Pilot-Commit,一个用于基于组 RL 后训练的预算感知展开分配框架。Pilot-Commit 将提示评估与利用解耦:一个试点阶段使用预算的一部分估计每个提示的信息量,然后将剩余的展开分配给高杠杆提示,同时跳过低信号提示。在多个数学推理基准和从 1.5B 到 14B 参数的模型规模上,Pilot-Commit 以显著更低的采样成本匹配基线准确率,在累积展开中达到目标准确率的速度比 GRPO 快高达 $1.9 imes$,比 DAPO 快高达 $4.0 imes$。

英文摘要

Reinforcement learning (RL) is the dominant paradigm for post-training large language models. However, in the online, on-policy setting, rollout generation dominates the computational cost of training. Group-based policy optimization methods compute advantages from multiple rollouts per prompt, yet they indiscriminately allocate budget to prompts with collapsed reward distributions, wasting expensive rollouts on negligible learning signals. We demonstrate that group-based updates are most effective in regimes of high reward variance. Since the policy evolves throughout training, prompt informativeness must be estimated online rather than precomputed, but exhaustively evaluating every prompt is computationally prohibitive. We introduce Pilot-Commit, a budget-aware rollout allocation framework for group-based RL post-training. Pilot-Commit decouples prompt evaluation from exploitation: a pilot stage estimates per-prompt informativeness using a fraction of the budget, and the remaining rollouts are allocated to high-leverage prompts while low-signal prompts are skipped. Across multiple math reasoning benchmarks and model scales from 1.5B to 14B parameters, Pilot-Commit matches baseline accuracy with significantly lower sampling costs, reaching target accuracy up to $1.9\times$ faster than GRPO and $4.0\times$ faster than DAPO in cumulative rollouts.

2605.26600 2026-05-27 cs.LG cs.AI

Geometry-Aware Contrastive Learning for Few-Shot Automatic Modulation Recognition

几何感知对比学习用于少样本自动调制识别

Guanqun Zhao, Yitong Liu, Jiaxuan Fang, Yufei Mao, Hongwen Yang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出动态一致性对比学习框架,通过虚拟对抗增强和语义一致性损失解决自监督学习中的各向同性增强、频谱不稳定和语义漂移问题,在少样本设置下提升自动调制识别准确率。

详情
AI中文摘要

标准的自动调制识别自监督学习面临无效的各向同性增强、频谱不稳定性和语义漂移等挑战。为解决这些问题,我们提出了动态一致性对比学习,一种几何感知框架,将虚拟对抗增强与语义一致性损失相结合。我们提供的理论分析表明,该策略作为编码器的隐式频谱正则化器,能够实现稳定的流形探索。此外,我们的信号自适应Swin骨干网络采用固定窗口注意力,通过限制注意力局部性提高了结构稳定性,而混合知识融合模块则利用物理先验锚定表示。在RML基准上的实验表明,DyCo-CL在1-shot设置下相比先前方法获得了6.27%的准确率提升。

英文摘要

Standard Self-Supervised Learning (SSL) for Automatic Modulation Recognition (AMR) struggles with ineffective isotropic augmentations, spectral instability, and semantic drift. To address these challenges, we propose Dynamic-Consistency Contrastive Learning (DyCo-CL), a geometry-aware framework that couples Virtual Adversarial Augmentation (VAA) with a semantic consistency loss. We provide a theoretical analysis indicating that this strategy acts as an implicit spectral regularizer for the encoder, enabling stable manifold exploration. Complementing this, our Signal-Adaptive Swin Backbone with fixed-window attention improves structural stability by constraining attention locality, while a Hybrid Knowledge Fusion module anchors representations with physical priors. Experiments on RML benchmarks show that DyCo-CL achieves a 6.27% accuracy gain in 1-shot settings over prior methods.

2605.26596 2026-05-27 cs.AI

AGORA: Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents

AGORA: 基于适配器接地观察-动作保留的LLM智能体无推理提示压缩

Haoran Zhang, Zhaohua Sun

发表机构 * AI Agent Technologies (Hong Kong) Limited(人工智能代理技术(香港)有限公司) Department of Mechanical Engineering, The University of Hong Kong(香港大学机械工程系)

AI总结 针对LLM智能体,提出AGORA无推理步骤级压缩器,通过结构提示解析器、格式关键内容保留和125M参数相关性评分器,在9个测试单元中8个保持≥75%的无压缩性能。

Comments 10 pages, 2 figures. Code and data: https://github.com/ranranrannervous/agoracompression

详情
AI中文摘要

广泛用于通用LM上下文的token级抽取式压缩器在结构上不适合LLM智能体:在跨越两个独立token级方法家族的17个(环境、骨干、方法)单元中,尽管实现了1.3-13.3倍的压缩,每个单元的均值奖励≤0.05。我们将这种失败模式命名为动作语法破坏——携带动作语义的token(标识符、括号、动作动词)正是那些自信息排名最低的token,因此通用压缩器可靠地移除它们,环境拒绝剩余部分。诊断指向步骤粒度压缩。我们引入AGORA,一种无推理的步骤级压缩器,结合结构提示解析器、格式和时效关键内容的始终保留底线,以及一个在反事实下一步动作变化标签上训练的125M参数相关性评分器(约2ms/步,零每步LLM开销)。在比较的无推理和基于LLM的方法中,AGORA是唯一在9个单元中的8个中保持≥75%无压缩性能的方法(唯一的例外为73%);四路组件消融将结构底线隔离为主要的性能杠杆,而学习到的评分器是单一固定保留比率下实现1.0-11.5倍自适应端到端压缩的来源。

英文摘要

The token-level extractive compressors widely used for general LM context are structurally inappropriate for LLM agents: across 17 (env, backbone, method) cells spanning two independent token-level method families, every cell collapses to mean reward <= 0.05 despite 1.3-13.3x realized compression. We name and characterize this failure mode as action-grammar destruction -- the tokens carrying action semantics (identifiers, brackets, action verbs) are exactly those self-information ranks lowest, so a general-purpose compressor reliably removes them and the environment rejects the residual. The diagnosis points to step-granularity compression. We introduce AGORA, an inference-free step-level compressor combining a structural prompt parser, an always-keep floor for format- and recency-critical content, and a 125M-parameter relevance scorer trained on counterfactual next-action-change labels (~2ms/step, zero per-step LLM toll). Across the compared inference-free and LLM-based methods, AGORA is the only one retaining >= 75% uncompressed performance in 8 of 9 cells (with the lone exception at 73%); a four-way component ablation isolates the structural floor as the dominant quality lever and the learned scorer as the source of 1.0-11.5x adaptive end-to-end compression from a single fixed keep ratio.

2605.26589 2026-05-27 cs.LG cs.AI stat.ML

Few-shot Cross-country Generalization of Tabular Machine Learning and Foundation Models for Childhood Anemia Prediction under Distribution Shift

分布漂移下儿童贫血预测的表格机器学习与基础模型的少样本跨国家泛化

Yusuf Brima, Marcellin Atemkeng, Lansana Hassim Kallon, David Niyukuri, Antoine Vacavant, Samuel Saidu, Ding-Geng Chen

发表机构 * Department of Mathematics, Rhodes University, South Africa(数学系,罗德斯大学,南非) National Institute for Theoretical and computational Sciences (NITheCS), Stellenbosch, 7600, South Africa(理论与计算科学国家研究所(NITheCS),斯泰伦博斯,7600,南非) Interdisciplinary Research Program in Public Health, University of Burundi, Burundi(公共卫生跨学科研究计划,布恩迪大学,布恩迪) Universite Clermont Auvergne, Clermont Auvergne INP, CNRS, Institut Pascal, Clermont–Ferrand, France(克莱蒙特-奥弗涅大学,克莱蒙特-奥弗涅INP,CNRS,帕西尔研究所,克莱蒙特-费尔南,法国) Department of International Public Health, Liverpool School of Tropical Medicine, Liverpool, UK(国际公共卫生系,利物浦热带医学学校,利物浦,英国) College of Health Solutions, Arizona State University, Phoenix, USA(健康解决方案学院,亚利桑那州立大学,凤凰城,美国) Department of Statistics, University of Pretoria, Pretoria, South Africa(统计系,普里特oria大学,普里特oria,南非)

AI总结 本研究评估了基于Transformer的表格基础模型TabPFN在跨国家、数据稀缺环境下预测儿童贫血的性能,发现其优于经典监督方法,尤其在低数据场景下表现出更好的区分度和校准能力。

详情
AI中文摘要

儿童贫血影响全球约40%的6-59个月儿童,且由异质性因素引起,限制了模型的泛化能力。我们在跨国家和数据稀缺环境下,评估了基于Transformer的表格基础模型与经典监督方法。我们使用了来自非洲、亚洲、拉丁美洲、高加索和中东16个国家的DHS数据(n=68,856)。比较了逻辑回归、XGBoost、LightGBM和TabPFN v2.6。性能通过AUC-ROC、Brier评分和ECE评估。泛化性通过留一国家法(LOCO)、反向LOCO和少样本设置评估。亚组分析包括性别、年龄、居住地、母亲教育和财富。特征重要性通过SHAP估计。TabPFN在低数据场景(<200样本)中优于经典模型,显示出更高的区分度和更好的校准。在各国中,它实现了最低的Brier评分(0.042)和ECE(0.203)。在全数据设置下,AUC-ROC范围为0.59-0.76,模型间差异较小(≤0.05)。LOCO性能稳定(0.58-0.69),受国家背景驱动。反向LOCO显示出不对称的可转移性。亚组性能一致,无系统性人口统计偏差。SHAP识别出儿童年龄、海拔和年龄别身高Z分数为主要预测因子,其次是财富和母亲教育。儿童贫血预测的性能更多由人群变异驱动而非模型选择。TabPFN在低资源环境中通过改进的区分度和校准提供了优势,突显了基础模型作为数据稀缺全球健康预测的有前景工具。

英文摘要

Childhood anemia affects around 40% of children aged 6-59 months globally and arises from heterogeneous factors, limiting model generalizability. We evaluate a transformer-based tabular foundation model against classical supervised methods under cross-country and data-scarce settings. We used DHS data from 16 countries across Africa, Asia, Latin America, the Caucasus, and the Middle East (n=68,856). We compared Logistic Regression, XGBoost, LightGBM, and TabPFN v2.6. Performance was assessed using AUC-ROC, Brier score, and ECE. Generalization was evaluated using leave-one-country-out (LOCO), reverse-LOCO, and few-shot settings. Subgroup analyses included sex, age, residence, maternal education, and wealth. Feature importance was estimated using SHAP. TabPFN outperformed classical models in low-data regimes (<200 samples), showing higher discrimination and better calibration. Across countries, it achieved the lowest Brier score (0.042) and ECE (0.203). Under full-data settings, AUC-ROC ranged from 0.59-0.76 with small between-model differences ($\leq 0.05$). LOCO performance was stable (0.58-0.69), driven by country context. Reverse-LOCO showed asymmetric transferability. Subgroup performance was consistent with no systematic demographic bias. SHAP identified child age, altitude, and height-for-age z-score as dominant predictors, followed by wealth and maternal education. Performance in childhood anemia prediction is driven more by population variation than model choice. TabPFN provides advantages in low-resource settings through improved discrimination and calibration, highlighting foundation models as promising tools for data-scarce global health prediction.

2605.26585 2026-05-27 cs.LG

Near-Optimal Regret in Adversarial Kernel Bandits

对抗性核赌博中的近最优遗憾

Yu-Jie Zhang, Hao Qiu, Jonathan Scarlett, Kevin Jamieson

发表机构 * University of Washington(华盛顿大学) National University of Singapore(新加坡国立大学)

AI总结 针对对抗性核赌博问题,提出基于正则化重要性加权损失估计的指数权重算法,通过显式修正项消除偏差,实现与随机核赌博已知最优率匹配的遗憾界。

详情
AI中文摘要

我们研究对抗性核赌博问题,其中每轮的损失由再生核希尔伯特空间(RKHS)中的任意有界元素诱导。我们提出了一种基于正则化重要性加权损失估计的指数权重算法,并带有一个显式修正项,用于抵消正则化引入的偏差。我们的主要结果将遗憾界限制为 $\widetilde{O}ig(\sqrt{T\, d_*(λ)\,\log|{X}|}ig)$,其中 $d_*(λ)$ 是广泛采用的有效维度概念,用于捕捉核的复杂度。忽略对数因子,这匹配了相关随机核赌博问题中已知的速率。一个显著的应用是 $\mathbb{R}^d$ 上具有平滑参数 $ν$ 的 Matérn$(ν,d)$ 核,此时我们的界特化为 $\widetilde{O}ig(T^{(ν+d)/(2ν+d)}ig)$,改进了 Chatterji 等人 [2019] 先前已知的最佳速率,同时去除了他们分析所需的秩一对手假设。此外,该速率与随机核赌博的已知最优速率相同,并且与并发工作中的下界仅相差一个 $\log T$ 因子。

英文摘要

We study the adversarial kernel bandit problem, in which the loss at each round is induced by an arbitrary bounded element of a reproducing kernel Hilbert space (RKHS). We propose an exponential-weights algorithm built on a regularized importance-weighted loss estimator, together with an explicit correction term that cancels the bias introduced by the regularization. Our main result bounds the regret by $\widetilde{O}\big(\sqrt{T\, d_*(λ)\,\log|{X}|}\big)$, where $d_*(λ)$ is a widely-adopted notion of effective dimension that captures the complexity of the kernel. Up to logarithmic factors, this matches the known rate achieved in the related stochastic kernel bandit problem. A notable application is the Matérn$(ν,d)$ kernel with smoothness parameter $ν$ on $\mathbb{R}^d$, for which our bound specializes to $\widetilde{O}\big(T^{(ν+d)/(2ν+d)}\big)$, improving over the best-known prior rate of Chatterji et al. [2019] while simultaneously removing the rank-one adversary assumption required by their analysis. Moreover, this rate is the same as the known optimal rate for stochastic kernel bandits, and also matches a lower bound from concurrent work up to a $\log T$ factor.

2605.26584 2026-05-27 cs.CV

O-MARC: Omni Memory-Augmented Compression Distillation for Efficient Video Understanding

O-MARC: 全记忆增强压缩蒸馏用于高效视频理解

Peiran Wu, Yunze Liu, Chi-Hao Wu, Chen Chen, Junxiao Shen

发表机构 * University of Bristol(布里斯托大学) Memories.ai Research(Memories.ai研究院) University of Central Florida(佛罗里达中央大学)

AI总结 提出O-MARC框架,通过无训练压缩方法OMAC保留视觉记忆和音频锚点,并利用压缩蒸馏使紧凑模型鲁棒,在多个基准上提升性能并降低推理成本。

详情
AI中文摘要

全模态大语言模型实现了统一的音频视频理解,但长联合令牌序列导致推理成本高昂,且现有基准未能完全隔离噪声用户生成视频中的音视频关联。我们引入了UGC-AVQA,一个公开的UGC基准,包含1000个视频和4816个问答对,其中音频移除测试确保基准问题需要声学和视觉证据。为了降低推理成本,我们提出了OMAC,一种无需训练的即插即用压缩方法,保留显著的视觉记忆和时域锚定的音频锚点。为了进一步使紧凑模型对压缩输入鲁棒,我们引入了O-MARC,一种用于学习记忆压缩多模态上下文的压缩蒸馏框架。在Qwen2.5-Omni-3B上,O-MARC在四个基准上的平均得分提升至45.8,优于全令牌推理的44.1和OmniZip的41.0。与全令牌推理相比,OMAC还保持了推理效率,延迟降低34.6%(1.53倍加速),内存降低34.7%。

英文摘要

Omnimodal large language models enable unified audio video understanding, but long joint token sequences make inference costly, and existing benchmarks do not fully isolate audio visual association in noisy user generated videos. We introduce UGC-AVQA, a public UGC benchmark with 1,000 videos and 4,816 QA pairs, where an audio removal test ensures that benchmark questions require both acoustic and visual evidence. To reduce inference cost, we propose OMAC, a training free plug in compression method that preserves salient visual memory and temporally grounded audio anchors. To further make compact models robust to compressed inputs, we introduce O-MARC, a compression distillation framework for learning with memory compressed multimodal contexts. On Qwen2.5-Omni-3B, O-MARC improves the average score across four benchmarks to 45.8, outperforming full token inference at 44.1 and OmniZip at 41.0. OMAC also keeps inference efficient, reducing latency by 34.6\% (1.53$\times$ speedup) and memory by 34.7\% compared with full token inference.

2605.26582 2026-05-27 cs.LG cs.AI

On the Error-Correcting Effects of Stochasticity in Discrete Diffusion

离散扩散中随机性的纠错效应

William Yuan, Sungwon Jeong, Amirali Aghazadeh

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文系统研究离散扩散模型中马尔可夫转移随机性程度对采样效率与质量的权衡,提出离散搅动与重启采样(DCRS)算法,通过交替正向和反向扩散过程注入受控随机性,在低函数评估次数下改善速度-质量权衡。

详情
AI中文摘要

离散扩散模型在文本和图像生成中取得了强劲性能,但其推理仍然缓慢,且必须内在平衡采样效率与样本质量。在这项工作中,我们系统研究了马尔可夫转移中随机性程度如何主导采样权衡。我们表明,高度确定性的转移收敛迅速但遭受误差累积,而更随机的转移收敛更慢但能达到更高的最终样本质量。通过信息论分析,我们识别出潜在机制为一种由对称地在状态间交换质量的冗余转移诱导的纠错效应,并表明这些转移可证明地收缩采样误差。受此分析启发,我们提出离散搅动与重启采样(DCRS),一种新颖的推理算法,通过交替正向和反向扩散过程注入受控随机性。在合成数据集和大规模基准上的实验表明,DCRS在低函数评估次数下改善了速度-质量权衡。在图像数据集上,与标准采样器相比,DCRS在保持竞争性样本质量的同时,实现了高达10倍的采样步数减少;而在语言基准上,我们观察到更细微的行为,取决于损坏过程和采样程序。

英文摘要

Discrete diffusion models achieve strong performance in text and image generation, but their inference remains slow and must inherently balance sampling efficiency and sample quality. In this work, we present a systematic study of how the \emph{degree of stochasticity} in Markov transitions governs the sampling tradeoff. We show that highly deterministic transitions converge rapidly but suffer from error accumulation, while more stochastic transitions converge more slowly yet can achieve higher final sample quality. Using an information-theoretic analysis, we identify the underlying mechanism as an error-correcting effect induced by \emph{redundant transitions} that symmetrically exchange mass between states, and show that these transitions can provably contract sampling errors. Motivated by this analysis, we propose \emph{Discrete Churn and Restart Sampling} (DCRS), a novel inference algorithm that injects controlled stochasticity by alternating between forward and reverse diffusion processes. Experiments on synthetic datasets and large-scale benchmarks show that DCRS improves the speed-quality tradeoff in the low number of function evaluations regime. On image datasets, DCRS achieves up to a $10\times$ reduction in sampling steps compared to standard samplers while maintaining competitive sample quality, whereas on language benchmarks, we observe more nuanced behavior depending on the corruption process and sampling procedure.

2605.26579 2026-05-27 cs.LG

Focal Reward: Balanced Reinforcement Learning under Rubric-Based Rewards

Focal Reward: 基于评分标准的强化学习中的平衡奖励

Yu Huang, Zihua Zhao, Zhaoxin Huan, Wanli Gu, Feng Hong, Xinmu Ge, Lin Yuan, Weichang Wu, Qiang Hu, Xiaolu Zhang, Jun Zhou, Jiangchao Yao

发表机构 * Shanghai Jiao Tong University(上海交通大学) Ant Group(蚂蚁集团)

AI总结 针对大语言模型在基于多维评分标准的强化学习中奖励失衡的问题,提出Focal Reward方法,通过逆奖励投影机制估计各维度饱和程度并自动重加权,实现细粒度平衡,在18个模型-基准对比中均优于最强静态聚合基线。

Comments Preprint

详情
AI中文摘要

大语言模型中的开放式生成通常需要多维评分标准来充分评估质量并指导强化学习的改进。然而,这种训练范式固有的一个关键困境是不同评分标准维度上的奖励极化不平衡。在此瓶颈下,即使大语言模型在训练后获得相对较高的奖励,它们仍可能在某些维度上表现出严重缺陷,直接导致用户体验下降。为了解决这个问题,我们提出了Focal Reward,一种新颖的目标函数,用于自动平衡基于评分标准的强化学习训练。具体来说,我们首先利用逆奖励投影机制来估计评分标准中每个准则的饱和程度,这构成了校准奖励方向的基础。然后,最终目标函数为每个准则设计了一个自动重新加权的系数,以实现细粒度平衡。跨三个模型规模和六个基准的大量实验表明,我们的Focal Reward方法在所有18个模型-基准比较中均优于最强的静态聚合基线。展开、机制和消融分析进一步表明,这些增益来自于向仍有改进空间的评分标准进行在线、饱和感知的重新分配。

英文摘要

The open-ended generation in LLMs usually requires multi-dimensional rubrics to adequately assess quality and guide the improvement of reinforcement learning. However, a critical dilemma inherent in this training paradigm is the imbalanced reward polarization along different rubric dimensions. Under this bottleneck, even if LLMs achieve relatively high rewards after training, they may still exhibit severe deficiencies in certain dimensions, leading to a direct deterioration in user experience. To address this problem, we propose Focal Reward, a novel objective to automatically balance the training of reinforcement learning under rubric-based rewards. Specifically, we first leverage an inverse reward projection mechanism to estimate the saturation degree of each criterion in the rubric, which forms the basis to calibrate the reward direction. Then, the final objective is designed with an automatically reweighting coefficient for each criterion to achieve the fine-grained balancing. Extensive experiments across three model scales and six benchmarks demonstrate that our Focal Reward method outperforms the strongest static aggregation baseline in all 18 model-benchmark comparisons. Rollout, mechanism, and ablation analyses further show that these gains arise from online, saturation-aware reallocation toward rubrics that still have room for improvement.

2605.26576 2026-05-27 cs.CV cs.LG

TrackRef3D: Multi-View Consistent Track-then-Label for Open-World Referring Segmentation in 3D Gaussian Splatting

TrackRef3D: 面向开放世界3D高斯泼溅分割的多视角一致跟踪-标注方法

Yuyang Tan, Renhe Zhang, Hang Zhang, Ao Li, Xin Tan

发表机构 * East China Normal University, Shanghai, China(华东师范大学,上海,中国) Shanghai AI Laboratory(上海人工智能实验室) University of Electronic Science and Technology of China, Chengdu, China(电子科技大学,成都,中国)

AI总结 提出TrackRef3D全自动流水线,通过多视角一致跟踪-标注范式解耦目标发现与语义定位,无需人工标注实现开放世界3D高斯泼溅分割。

详情
AI中文摘要

引用3D高斯泼溅(R3DGS)利用自然语言进行3D目标分割,已成为具身AI的关键能力。然而,现有方法通常依赖昂贵的每场景人工标注和每视图伪掩码生成,存在多视角不一致以及对不同查询特异性的泛化能力差的问题。为此,我们提出TrackRef3D,一种全自动流水线,通过引入多视角一致的跟踪-标注范式,从根本上将目标发现与语义定位解耦,无需人工标注即可实现3D高斯泼溅(3DGS)中的开放世界引用分割。具体而言,我们提出轨迹感知语义共识模块(TSCM),通过同义词聚类和轨迹感知投票聚合跨视图预测,建立规范语义身份,从而确保多视角一致性。此外,我们采用可见性感知描述生成策略以缓解歧义,并提出混合训练策略(HTS),利用多正例对比目标联合优化粗粒度类别语义和细粒度引用线索,确保在不同查询特异性下的鲁棒性。在基准上的大量实验表明,TrackRef3D达到了最先进的性能。

英文摘要

Referring 3D Gaussian Splatting (R3DGS), which utilizes natural language for 3D object segmentation, has emerged as a crucial capability for embodied AI. However, existing methods typically rely on expensive per-scene manual annotation and per-view pseudo mask generation, which suffer from multi-view inconsistency and poor generalization to varying query specificities. To address this, we present TrackRef3D, a fully automatic pipeline that achieves open-world referring segmentation in 3D Gaussian Splatting (3DGS) without manual annotation by introducing a multi-view consistent track-then-label paradigm that fundamentally decouples object discovery from semantic grounding. Specifically, we propose a Trajectory-Aware Semantic Consensus Module (TSCM) which aggregates cross-view predictions via synonymous clustering and trajectory-aware voting to establish a canonical semantic identity, thereby ensuring multi-view consistency. Furthermore, we employ a visibility-aware description generation strategy to mitigate ambiguity and propose a Hybrid Training Strategy (HTS) that jointly optimizes coarse category semantics and fine-grained referential cues to ensure robustness under varying query specificities using a multi-positive contrastive objective. Extensive experiments on benchmarks demonstrate that TrackRef3D achieves state-of-the-art performance.

2605.26575 2026-05-27 cs.CL

Hubness, Not Anisotropy, Drives Cross-Lingual Retrieval Asymmetry in Multilingual Embedding Models

中心性而非各向异性驱动多语言嵌入模型中的跨语言检索不对称性

Adib Sakhawat, Fardeen Sadab, Atik Shahriar

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系)

AI总结 本文通过实验证明,在多语言嵌入模型中,中心性(hubness)是导致跨语言检索不对称性的主要几何病理因素,而非各向异性、质心漂移或向量幅度,并推荐使用CSLS替代余弦相似度作为默认检索度量。

Comments 17 pages, 5 figures

详情
AI中文摘要

多语言嵌入模型在部署时假设跨语言检索是对称的:如果语言A的查询检索到语言B中的翻译,反之亦然。但实际上并非如此。我们使用包含英语、孟加拉语、印地语和阿拉伯语的6,518个习语和谚语表达的平行语料库,通过五个生产级编码器(Gemini、Mistral、OpenAI-L、OpenAI-S、Qwen)进行嵌入,将这种失败形式化为互近邻互惠性的缺陷,并测试一个单一的机制性主张:在多语言空间的几何病理中,中心性(hubness),而非各向异性、质心漂移或幅度,是主要的因果驱动因素。在五个预先注册的实验中,预先指定了证伪条件,中心质量在互惠性的联合回归中占主导地位(49.5%的主导份额,是下一个预测因子的1.68倍;偏R²=0.302,而各向异性为0.003),而中心性感知的分数校正(CSLS)缩小了最差到最佳互惠性差距的63.5%,并产生平均模型内效应量,是外科中心向量消融的130倍。后一对比指出了机制:中心性是相似度度量的病理,而非单个中心向量的病理。我们解决了著名的各向异性-中心性悖论,证明两者在统计上是可分离的,并建议用CSLS替换余弦相似度作为多语言嵌入管道的默认检索度量。

英文摘要

Multilingual embedding models are deployed under the assumption that cross-lingual retrieval is symmetric: if a query in language A retrieves its translation in language B, the reverse should also hold. In practice it does not. Using a parallel corpus of 6,518 idiomatic and proverbial expressions in English, Bangla, Hindi, and Arabic, embedded by five production-grade encoders (Gemini, Mistral, OpenAI-L, OpenAI-S, Qwen), we formalise this failure as a deficit in mutual nearest-neighbour reciprocity and test a single mechanistic claim: among the geometric pathologies of multilingual spaces, hubness, not anisotropy, centroid drift, or magnitude, is the dominant causal driver. Across five pre-registered experiments with falsification conditions specified in advance, hub mass dominates a joint regression on reciprocity (49.5% dominance share, 1.68x the next predictor; partial R^2 = 0.302 versus 0.003 for anisotropy), while a hub-aware score correction (CSLS) closes 63.5% of the worst-to-best reciprocity gap and yields a mean within-model effect size 130x larger than surgical hub-vector ablation. The latter contrast pinpoints the mechanism: hubness is a pathology of the similarity metric, not of individual hub vectors. We resolve the well-known anisotropy-hubness paradox by showing the two are statistically dissociable, and we recommend replacing cosine similarity with CSLS as the default retrieval metric for multilingual embedding pipelines.

2605.26571 2026-05-27 cs.LG

Separate Aggregation of Split Network for Personalized Federated Learning

分离网络的分组聚合用于个性化联邦学习

Yunseok Kang, Jaeyoung Song

发表机构 * Department of Electronics Engineering, Pusan National University(全州国立大学电子工程系)

AI总结 提出PGFedSplit框架,采用分离架构和自适应聚合调度,结合本地与服务器生成的表示,解决客户端数据异构下的个性化与全局泛化权衡问题。

详情
AI中文摘要

联邦学习能够在不共享原始数据的情况下进行协作模型训练,但在客户端数据分布异构时性能会大幅下降。单一的全局模型往往无法满足不同客户端的需求,因此个性化联邦学习被探索用于在保持全局泛化的同时提升客户端特定性能。现有的PFL方法通常面临一个基本权衡:更强的全局共享可能削弱本地专业化,而更强的本地适应则可能导致在数据有限、标签不平衡和缺失类别场景下的过拟合。在这项工作中,我们提出了PGFedSplit,一个在严重客户端异构下同时提升个性化和全局泛化的个性化联邦学习框架。PGFedSplit采用分离架构,并根据不同模型组件的角色执行自适应聚合调度,在保持客户端特定适应的同时实现稳定的知识共享。每个客户端进一步利用本地提取的表示和从服务器端高斯统计生成的合成表示的混合,提升了在标签不平衡和缺失类别条件下的鲁棒性。在Fashion MNIST、CIFAR-10、CIFAR-100和Tiny ImageNet上的大量实验表明,与最先进的PFL方法相比,PGFedSplit在高度异构设置下实现了持续改进,具有稳定的收敛和优越的个性化性能。

英文摘要

Federated learning enables collaborative model training without sharing raw data, but its performance can degrade substantially under heterogeneous client data distributions. A single global model often cannot satisfy diverse client requirements, so personalized federated learning has therefore been explored to improve client specific performance while preserving global generalization. Existing PFL methods often face a fundamental tradeoff in which stronger global sharing can undermine local specialization, whereas stronger local adaptation can lead to overfitting under limited data, label imbalance, and missing class scenarios. In this work, we propose PGFedSplit, a personalized federated learning framework that improves both personalization and global generalization under severe client heterogeneity. PGFedSplit adopts a split architecture and performs adaptive aggregation scheduling tailored to the roles of different model components, enabling stable knowledge sharing while maintaining client specific adaptation. Each client further leverages a mixture of locally extracted representations and synthetic representations generated from server side Gaussian statistics, improving robustness under label imbalance and missing class conditions. Extensive experiments on Fashion MNIST, CIFAR 10, CIFAR 100, and Tiny ImageNet demonstrate consistent improvements over state of the art PFL methods, with stable convergence and superior personalization in highly heterogeneous settings.

2605.26569 2026-05-27 cs.LG

Distribution-Aware Conformal Prediction: A Framework for generating efficient prediction intervals for time series

分布感知共形预测:一种为时间序列生成高效预测区间的框架

Daniel Schweizer, Peter Kuhn, Jayant Sharma, Shivali Dubey, Malte von Ramin, Christoph Brockt-Haßauer

发表机构 * Fraunhofer Institute for Highspeed Dynamics, Ernst-Mach-Institut, EMI Freiburg(弗劳恩霍夫高速动力研究所,恩斯特-马赫研究所,EMI弗赖堡)

AI总结 提出分布感知共形预测(DCP)框架,通过集成概率预测器与分数无关的共形校准,为时间序列生成有效且高效的预测区间。

Comments submitted to Journal of Machine Learning Research (JMLR)

详情
AI中文摘要

我们提出了分布感知共形预测(DCP),这是一个统一框架,将蒙特卡洛dropout、深度集成和分位数回归等概率预测器与分数无关的共形校准相结合,以生成有效且高效的预测区间。利用数值反演方法构建区间边界,DCP能够适应任意组合的分布生成预测器和非一致性分数。对合成和真实时间序列数据的基准分析表明,DCP能够在不同的不确定性机制下自适应地校准预测区间。关键的是,DCP的模块化设计便于对不同预测器-分数配对进行即插即用实验,并通过新引入的修正Winkler分数进行定量支持,该分数通过显式惩罚欠覆盖来平衡有效性和效率。虽然DCP推广并扩展了现有方法(如共形分位数回归和共形蒙特卡洛),但其模块化设计允许进一步扩展,为在动态环境和高风险应用中推进不确定性量化奠定了基础。

英文摘要

We present Distribution-aware Conformal Prediction (DCP), a unified framework integrating probabilistic predictors like Monte Carlo dropout, deep ensembles, and quantile regression with score-agnostic conformal calibration to produce valid and efficient prediction intervals. Leveraging a numerical inversion approach to construct interval bounds, DCP accommodates arbitrary combinations of distribution generating predictors and nonconformity scores. Benchmark analysis on synthetic and real-world time series data demonstrate DCP's ability to adaptively calibrate prediction intervals under varying uncertainty regimes. Crucially, DCP's modular design facilitates plug-and-play experimentation with different predictor-score pairings, quantitatively supported by a newly introduced modified Winkler score that balances validity and efficiency by explicitly penalizing undercoverage. While DCP generalizes and extends existing approaches like Conformalized Quantile Regression and Conformalized Monte Carlo, its modular design allows further extensions, setting a foundation for advancing uncertainty quantification in dynamic environments and high-risk applications.