arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1709
专题追踪
2605.26759 2026-06-15 cs.LG 版本更新

Time Series Causal Discovery via Context-Conditioned and Causality-Augmented Pretraining

基于上下文条件与因果增强预训练的时间序列因果发现

Biao Ouyang, Tengxue Zhang, Zhihao Zhuang, Yang Shu, Chenjuan Guo, Bin Yang

发表机构 * East China Normal University(东华师范大学)

AI总结 提出PTCD框架,通过上下文条件建模和可迁移因果增强的预训练范式,提升跨任务时间序列因果发现的泛化能力,在多个真实OOD数据集上因果发现和根因识别表现优异。

Comments 20 pages

详情
AI中文摘要

时间序列的因果发现对于许多现实世界应用至关重要,例如追踪异常的根本原因。现有方法通常依赖于特定数据集的优化,这使得其因果发现能力难以迁移到由不同因果机制控制的新时间序列上。在本文中,我们提出PTCD,一种新颖的时间序列因果发现预训练框架,通过上下文条件建模和可迁移的因果增强来改进跨任务泛化。为了建模复杂的时间因果依赖关系,PTCD采用双尺度迭代注意力机制来捕获窗口级别的因果关系,并利用带有上下文级别路由机制的高斯混合模型来处理异质的外生分布。为了进一步解决因果图之间的分布偏移,PTCD在合成数据集上采用预训练范式,该范式整合了基于干预的学习和因果混合策略,促进了稳定的因果发现和更强的泛化能力。在多个真实世界分布外(OOD)数据集上的大量实验表明,PTCD在因果发现和根因识别方面均表现出色。

英文摘要

Causal discovery from time series is critical for many real-world applications, such as tracing the root causes of anomalies. Existing approaches typically rely on dataset-specific optimization, making it difficult to transfer their causal discovery capabilities to new time series governed by diverse causal mechanisms. In this paper, we propose \textbf{PTCD}, a novel \textbf{P}retraining framework for \textbf{T}ime-series \textbf{C}ausal \textbf{D}iscovery, which improves cross-task generalization through context-conditioned modeling and transferable causal augmentation. To model complex temporal causal dependencies, PTCD employs a dual-scale iterative attention mechanism to capture window-level causal relationships, and a Gaussian mixture with a context-level routing mechanism to handle heterogeneous exogenous distributions. To further address distribution shifts across causal graphs, PTCD adopts a pretraining paradigm on synthetic datasets that integrates intervention-based learning and a causal mixup strategy, promoting stable causal discovery and stronger generalization. Extensive experiments on multiple real-world out-of-distribution (OOD) datasets demonstrate that PTCD excels in both causal discovery and root cause identification.

2605.26702 2026-06-15 cs.CV cs.AI cs.CR cs.LG 版本更新

Rotation-Invariant Spherical Watermarking via Third-Order SO(3) Representation Coupling

通过三阶SO(3)表示耦合的旋转不变球面水印

Pengzhen Chen, Yanwei Liu, Xiaoyan Gu, Antonios Argyriou, Wu Liu, Weiping Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对全景图像在任意3D旋转下水印鲁棒性不足的问题,提出利用三阶SO(3)表示耦合构造旋转不变的球面双谱,将水印嵌入高阶球谐系数并从不变标量中提取,实现理论保证的旋转不变性和高视觉保真度。

Comments ICML 2026

详情
AI中文摘要

全景图像的可靠水印面临任意3D旋转的根本挑战。由于全景图定义在球面上,它们在$SO(3)$作用下自然变换,使得传统的平面表示和基于增强的鲁棒策略变得不充分且缺乏理论保证。为了解决这个问题,我们将全景图表示为球面信号,并利用$SO(3)$表示理论推导出可证明的旋转不变描述符。虽然球谐系数在旋转下等变变换,但自然的旋转不变构造通常限于零阶统计量,这消除了方向信息并严重限制了嵌入容量。在这项工作中,我们通过张量积耦合高阶$SO(3)$不可约表示并投影到平凡表示,引入了一种有原则的三阶不变构造。这产生了球面不变双谱,它在保持严格旋转不变性的同时保留了相位信息。利用这一特性,我们将水印嵌入到高阶球谐系数中,并从不变双谱标量中恢复它们,从而在任意3D旋转下实现可靠的提取。我们提供了其$SO(3)$不变性的理论证明,并通过实验证明其对连续旋转具有近乎完美的鲁棒性,同时保持高视觉保真度。

英文摘要

Reliable watermarking of panoramic imagery is fundamentally challenged by arbitrary 3D rotations. As panoramas are defined on the sphere, they naturally transform under the action of $SO(3)$, rendering conventional planar representations and augmentation-based robustness strategies inadequate and devoid of theoretical guarantees. To address this, we formulate panoramas as spherical signals and leverage $SO(3)$ representation theory to derive provably rotation-invariant descriptors. While spherical harmonic coefficients transform equivariantly under rotations, the natural invariant constructions are typically limited to zeroth-order statistics which eliminate directional information and severely constrain embedding capacity. In this work, we introduce a principled third-order invariant construction by coupling higher-order $SO(3)$ irreducible representations via tensor products and projecting onto the trivial representation. This yields a spherical invariant bispectrum that preserves phase information while remaining strictly rotation-invariant. Leveraging this property, we embed watermarks into higher-order spherical harmonic coefficients and recover them from invariant bispectral scalars, enabling reliable extraction under arbitrary 3D rotations. We provide a theoretical proof of $SO(3)$ invariance for it and demonstrate experimentally its near-perfect robustness to continuous rotations while maintaining high visual fidelity.

2605.25782 2026-06-15 cs.RO 版本更新

ParkourFormer: Integrating Predictive Supervision and Sequence Modeling into Parkour Locomotion

ParkourFormer:将预测监督与序列建模融入跑酷运动

Yanheng Mai, Wenhao Xu, Zirui Huang, Yifei Fu, Shengwei Dong, Xinjue Wang, Kailun Huang, Yanzhe Xie, Renjing Xu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) CLAI-LAB, CL-TECH(CLAI实验室,CL-TECH) South China Agricultural University(华南农业大学) Guangdong University of Technology(广东工业大学)

AI总结 提出基于Transformer的序列建模框架ParkourFormer,通过预测未来本体感受状态并融合时序特征生成动作,实现人形机器人在多地形跑酷中的高成功率运动控制。

Comments Project Homepage: https://mronaldo-gif.github.io/parkourformer.github.io/

详情
AI中文摘要

人形机器人跑酷需要运动策略协调全身动力学,以应对楼梯、间隙、斜坡和障碍物等快速变化的地形。现有的强化学习策略大多是反应式的,直接将观测映射到动作,而不显式建模未来身体状态。在敏捷运动任务中,这种建模变得至关重要,因为成功的运动执行强烈依赖于对即将到来的接触过渡和身体动力学的预测。我们提出了ParkourFormer,一个基于Transformer的序列建模框架,将人形机器人运动重新表述为未来条件化的决策问题。当前机器人状态通过交叉注意力查询历史传感器运动轨迹,同时一个轻量级预测头预测短时域的未来本体感受状态。经过监督信号训练的预测未来状态与时间特征融合以生成动作,使策略能够联合推理运动历史和预期的未来动力学。我们在一个包含楼梯、间隙、斜坡、粗糙地形和障碍物穿越的多样化多地形人形机器人跑酷基准上评估了ParkourFormer。在仿真和真实人形机器人上的实验表明,ParkourFormer在极具挑战性的地形上实现了93.85%的平均穿越成功率,相比强MLP、基于MoE的MLP和普通Transformer基线,提升高达42.73%,同时在所有地形类型上保持单一统一策略。这些结果表明,显式未来状态建模显著提高了敏捷全身运动的鲁棒性和泛化能力。

英文摘要

Humanoid parkour requires locomotion policies to coordinate whole-body dynamics across rapidly changing terrains such as stairs, gaps, slopes, and obstacles. Existing reinforcement learning policies are largely reactive, mapping observations directly to actions without explicitly modeling future body states. Such modeling becomes critical in agile locomotion tasks where successful motion execution depends strongly on anticipating upcoming contact transitions and body dynamics. We present ParkourFormer, a Transformer-based sequence modeling framework that reformulates humanoid locomotion as a future-conditioned decision-making problem. The current robot state queries historical sensorimotor trajectories through cross-attention, while a lightweight prediction head forecasts short-horizon future proprioceptive states. The predicted future states, trained with supervised signals, are fused with temporal features to generate actions, enabling the policy to jointly reason over motion history and anticipated future dynamics. We evaluate ParkourFormer on a diverse multi-terrain humanoid parkour benchmark including stairs, gaps, slopes, rough terrain, and obstacle traversal. Experiments in simulation and on a real humanoid robot show that ParkourFormer achieves a 93.85% average traversal success rate on highly challenging terrains, with improvements of up to 47.12% over strong MLP, MoE-based MLP, and vanilla Transformer baselines, while maintaining a single unified policy across all terrain types. These results demonstrate that explicit future-state modeling significantly improves robustness and generalization for agile whole-body locomotion.

2605.29640 2026-06-15 cs.AI 版本更新

VikingMem: A Memory Base Management System for Stateful LLM-based Applications

VikingMem:面向有状态LLM应用的记忆库管理系统

Jiajie Fu, Junwen Chen, Mengzhao Wang, Aoxiang He, Maojia Sheng, Xiangyu Ke, Yifan Zhu, Yunjun Gao

发表机构 * Zhejiang University(浙江大学)

AI总结 提出记忆库(Memory Base)数据管理范式,并基于VikingDB向量引擎实现VikingMem系统,通过事件与实体抽象、主题时间线压缩和时间加权召回,在长期记忆基准上提升检索效果达30%。

Comments Accepted by VLDB26

详情
AI中文摘要

大型语言模型彻底改变了交互式应用;然而,其有限的上下文窗口为维护有状态的长期交互带来了关键的数据管理挑战。现有的记忆方法通常依赖于简单的提取方法,导致记忆不完整,或使用针对单一用例(如聊天机器人)的刚性、单用途记忆提取提示。因此,它们缺乏泛化能力,在多样化的下游任务中表现不佳。为弥补这一差距,我们引入了记忆库(Memory Base),一种用于管理长期交互持久状态的新型数据管理范式。其特点包括三个核心原则:从原始信息流中选择性提取高价值记忆;固有的状态性和演化性,其中记忆内容被逐步总结、纠正并按时间加权以优先处理近期交互;以及一种可泛化的抽象范式,旨在跨不同应用(包括教育、推荐和智能体记忆)实现稳健的可迁移性。基于此,我们提出了VikingMem,一个在VikingDB向量引擎上实现的端到端记忆库管理系统。VikingMem通过互连的事件和实体抽象具体化了这一范式。它采用以事件为中心的记忆提取来选择性处理复杂信息流,同时实体由事件动态更新以实现有状态演化。通过基于主题时间线的时间压缩和时间加权召回,系统逐步生成高层级总结记忆,优先处理近期项目,并压缩和淡出较旧项目。在长期记忆基准上的广泛评估表明,VikingMem在记忆检索效果上比基线方法提升高达30%,同时保持了交互应用所需的低延迟。

英文摘要

Large Language Models have revolutionized interactive applications; however, their finite context windows pose a critical data management challenge for maintaining stateful, long-term interactions. Existing memory approaches often rely on simplistic extraction methods that lead to incomplete memories or use rigid, single-purpose memory extraction prompts tailored to a single use case, such as chatbots. Consequently, they lack generalizability and perform poorly across diverse downstream tasks. To bridge this gap, we introduce the Memory Base, a novel data management paradigm for managing the persistent state of long-term interactions. It is characterized by three core principles: selective extraction of high-value memories from raw information streams; inherent statefulness and evolution, where memory content is progressively summarized, corrected, and temporally weighted to prioritize recent interactions; and a generalizable abstraction paradigm designed for robust transferability across diverse applications, including education, recommendation, and agent memory. Building on this foundation, we present VikingMem, an end-to-end Memory Base Management System implemented on the VikingDB vector engine. VikingMem materializes this paradigm through interconnected event and entity abstractions. It features event-centric memory extraction to selectively handle complex information streams, while entities are dynamically updated by events to achieve stateful evolution. Using temporal compression via a topic-wise timeline and time-weighted recall, the system progressively produces high-level summary memories, prioritizes recent items, and compresses and fades older ones. Extensive evaluations on long-term memory benchmarks demonstrate that VikingMem outperformes baselines by up to 30% in memory retrieval effectiveness while maintaining the low latency essential for interactive applications.

2605.29228 2026-06-15 cs.LG q-bio.MN 版本更新

Traditional machine learning vs. deep learning from dynamic graph representations of proteins' 3D folds in the task of protein structure classification

传统机器学习 vs. 深度学习在蛋白质三维折叠动态图表示中的蛋白质结构分类任务

Aydin Wells, Francis A. Gatsi, Aaron Striegel, Tijana Milenković

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本研究比较了传统机器学习与深度学习在基于动态蛋白质结构网络进行蛋白质结构分类时的准确性和效率,发现两者准确性相近但深度学习慢10倍以上。

Comments Main paper: 16 pages, 4 figures, and 1 table; Supplementary information: 13 pages, 9 figures

详情
AI中文摘要

蛋白质结构分类(PSC)使用监督学习从蛋白质序列或三维结构特征预测其CATH/SCOP(e)类别。我们之前将三维结构建模为(静态)蛋白质结构网络(PSN),证明了基于PSN的特征在PSC任务中与序列或直接(即非网络)三维结构特征相比具有竞争力。最近,我们展示了从动态PSN中提取的特征在相同任务中优于从静态PSN中提取的特征(从而通过传递性优于序列和直接三维结构特征)。该动态PSN方法使用传统机器学习(ML),结合手动(预设计)特征与现成分类器。在此,我们评估从动态PSN进行自动深度学习(DL)是否能带来改进。我们对涵盖约44,000个CATH或SCOPe标记的动态PSN的72个数据集进行的评估显示,就PSC准确性而言,传统ML和DL在绝大多数数据集上(接近)持平,而DL平均慢10倍以上。我们是首个在基于动态PSN的PSC任务中评估传统ML与DL的研究。

英文摘要

Protein structure classification (PSC) uses supervised learning to predict a protein's CATH/SCOP(e) class from the protein's sequence or 3D structural feature(s). We already modeled 3D structures as (static) protein structure networks (PSNs), demonstrating the competitiveness of PSN-based features to sequence or direct (i.e. non-network) 3D structural features in the PSC task. More recently, we demonstrated the power of features extracted from dynamic PSNs over features extracted from static PSNs (and thus by transitivity over sequence and direct 3D structural features) in the same task. That dynamic PSN approach used traditional machine learning (ML), combining manual (pre-engineered) features with an off-the-shelf classifier. Here, we evaluate whether automatic deep learning (DL) from the dynamic PSNs yields improvements. Our evaluation on 72 datasets spanning ~44,000 CATH- or SCOPe-labeled dynamic PSNs reveals that in terms of PSC accuracy, traditional ML and DL are (close to) tied for a large majority of the datasets, while DL is on average 10+ times slower. We are the first to evaluate traditional ML vs. DL in the dynamic PSN-based PSC task.

2605.25651 2026-06-15 cs.CV 版本更新

Hierarchical Consistency Learning for Test-time Adaptation in Camouflage Perception

用于伪装感知测试时适应的层次一致性学习

Mingfeng Zha, Tianyu Li, Guoqing Wang, Yunqiang Pei, Chaofan Qiao, Jiening Zhang, Yang Yang, Heng Tao Shen

发表机构 * Center for Future Media and School of Computer Science and Engineering, University of Electronic Science and Technology of China(未来媒体中心和电子科技大学计算机科学与工程学院) School of Computer Science and Technology, Tongji University(计算机科学与技术学院,同济大学) Peng Cheng Laboratory(鹏城实验室)

AI总结 提出层次一致性学习(HCL)框架,通过测试时适应动态调整表示,结合层次表示重构、任务亲和引导和原型一致性校准,解决伪装目标检测中的域刚性和注释依赖问题。

Comments Accepted by IEEE TIP

详情
AI中文摘要

伪装目标检测(COD)旨在通过物理属性定位与背景感知差异最小的目标。现有方法受限于静态的“训练-冻结”范式,存在域刚性和注释依赖,限制了其对场景变化和未见伪装模式的适应性。为克服这些问题,我们提出层次一致性学习(HCL)框架,该框架集成了测试时适应以实现动态表示重校准。具体而言,我们设计了层次表示重构(HRR),通过协同空间重构与双流频域分解来缓解特征纠缠,增强对表观均匀化的鲁棒性。像素和频谱推理提供了结构和上下文先验。我们进一步引入任务亲和引导(TAG),通过通道级亲和力在分支间传播知识,对齐局部判别线索并缓解语义漂移。为确保语义不变性,我们制定了原型一致性校准(PCC),将区域特征聚合为紧凑原型并建立原型-特征相似度。这施加了隐式和层次化的约束,弥合了任务和表示之间的差距。在四个伪装和四个水下目标基准上,在三种退化设置下的广泛实验表明,我们的方法始终优于最先进的方法,突显了其在分布偏移下的鲁棒性和泛化能力。

英文摘要

Camouflaged object detection (COD) aims to localize targets that exhibit minimal perceptual differences from backgrounds through physical attributes. Existing methods, constrained by the static train-then-freeze paradigm, suffer from domain rigidity and annotation dependency, limiting their adaptability to scene variations and unseen camouflage patterns. To overcome these, we propose the hierarchical consistency learning (HCL) framework, which integrates test-time adaptation for dynamic representation recalibration. Specifically, we design the hierarchical representation reconstruction (HRR) to alleviate feature entanglement by synergizing spatial reconstruction with dual-stream frequency-domain decomposition, enhancing robustness against appearance homogenization. The pixel and spectrum inference provide structural and contextual priors. We further introduce task affinity guidance (TAG) to propagate knowledge across branches via channel-wise affinity, aligning local discriminative cues and mitigating semantic drift. To ensure semantic invariance, we formulate the prototype consistency calibration (PCC), which aggregates region features into compact prototypes and establishes prototype-feature similarity. This imposes implicit and hierarchical constraints that bridge task and representation gaps. Extensive experiments across four camouflaged and four underwater object benchmarks, under three degradation settings, demonstrate that our method consistently outperforms state-of-the-art approaches, highlighting its robustness and generalization under distribution shifts.

2604.18419 2026-06-15 cs.LG cs.CL stat.ML 版本更新

Knowing When to Quit: A Principled Framework for Dynamic Abstention in LLM Reasoning

知道何时退出:LLM推理中动态弃权的原则性框架

Hen Davidov, Nachshon Cohen, Oren Kalinsky, Yaron Fairstein, Guy Kushilevitz, Ram Yazdi, Patrick Rebeschini

发表机构 * Hebrew University of Jerusalem(特拉维夫大学)

AI总结 本文提出一个基于正则化强化学习框架的动态弃权原则,通过价值函数与弃权奖励的比较来决定是否提前终止推理,在数学推理和毒性避免任务上优于现有方法。

Journal ref Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026. Copyright 2026 by the author(s)

详情
AI中文摘要

利用思维链推理的大型语言模型常常因产生冗长且错误的响应而浪费大量计算资源。弃权可以通过抑制可能不正确的输出来缓解这一问题。虽然大多数弃权方法在生成之前或之后决定是否保留输出,但动态的生成中弃权考虑在每个token位置提前终止无前途的推理轨迹。先前的工作探索了这一想法的经验变体,但缺乏对弃权规则的原则性指导。我们提出了LLM动态弃权的形式化分析,将弃权建模为正则化强化学习框架中的一个显式动作。弃权奖励参数控制计算与信息之间的权衡。我们证明,在一般条件下,当价值函数低于该奖励时弃权严格优于自然基线。我们进一步推导了一种原则性且高效的方法来近似价值函数。在数学推理和毒性避免任务上的实证结果支持我们的理论,并展示了相比现有方法改进的选择性准确性。

英文摘要

LLMs utilizing chain-of-thought reasoning often waste substantial compute by producing long, incorrect responses. Abstention can mitigate this by withholding outputs unlikely to be correct. While most abstention methods decide to withhold outputs before or after generation, dynamic mid-generation abstention considers early termination of unpromising reasoning traces at each token position. Prior work has explored empirical variants of this idea, but principled guidance for the abstention rule remains lacking. We present a formal analysis of dynamic abstention for LLMs, modeling abstention as an explicit action within a regularized reinforcement learning framework. An abstention reward parameter controls the trade-off between compute and information. We show that abstaining when the value function falls below this reward strictly outperforms natural baselines under general conditions. We further derive a principled and efficient method to approximate the value function. Empirical results on mathematical reasoning and toxicity avoidance tasks support our theory and demonstrate improved selective accuracy over existing methods.

2602.08324 2026-06-15 cs.LG 版本更新

Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression

通过极端比例思维链压缩实现高效大型语言推理模型

Yuntian Tang, Bohan Jia, Wenxuan Huang, Lianyue Zhang, Jiao Xie, Wenxi Li, Wei Li, Jie Hu, Xinghao Chen Rongrong Ji, Shaohui Lin

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出Extra-CoT框架,通过极端比例压缩思维链、混合比例监督微调和约束层次化比率策略优化,在显著减少推理令牌的同时保持甚至提升推理准确率。

Comments Accepted to ICML 2026. 15 pages, 7 figures

详情
AI中文摘要

思维链推理成功增强了大型语言模型的推理能力,但推理时会产生大量计算开销。现有的思维链压缩方法在高压缩比下常遭受关键逻辑保真度的损失,导致性能显著下降。为实现高保真、快速推理,我们提出了一种新颖的极端比例思维链压缩框架,称为Extra-CoT,该框架在保留答案准确性的同时,激进地减少令牌预算。为了生成可靠的高保真监督,我们首先在带有细粒度标注的数学思维链数据上训练一个专用的语义保留压缩器。然后,通过混合比例监督微调对大型语言模型进行微调,使其学习遵循一系列压缩预算,并为强化学习提供稳定的初始化。我们进一步提出约束和层次化比率策略优化,通过层次化奖励明确激励在较低预算下的问题解决能力。在三个数学推理基准上的实验显示了Extra-CoT的优越性。例如,在MATH-500上使用Qwen3-1.7B,Extra-CoT实现了超过73%的令牌减少,同时准确率提升0.6%,显著优于最先进方法。我们的源代码已在https://github.com/Mwie1024/Extra-CoT发布。

英文摘要

Chain-of-Thought (CoT) reasoning successfully enhances the reasoning capabilities of Large Language Models (LLMs), yet it incurs substantial computational overhead for inference. Existing CoT compression methods often suffer from a critical loss of logical fidelity at high compression ratios, resulting in significant performance degradation. To achieve high-fidelity, fast reasoning, we propose a novel EXTreme-RAtio Chain-of-Thought Compression framework, termed Extra-CoT, which aggressively reduces the token budget while preserving answer accuracy. To generate reliable, high-fidelity supervision, we first train a dedicated semantically-preserved compressor on mathematical CoT data with fine-grained annotations. An LLM is then fine-tuned on these compressed pairs via a mixed-ratio supervised fine-tuning (SFT), teaching it to follow a spectrum of compression budgets and providing a stable initialization for reinforcement learning (RL). We further propose Constrained and Hierarchical Ratio Policy Optimization (CHRPO) to explicitly incentivize question-solving ability under lower budgets by a hierarchical reward. Experiments on three mathematical reasoning benchmarks show the superiority of Extra-CoT. For example, on MATH-500 using Qwen3-1.7B, Extra-CoT achieves over 73\% token reduction with an accuracy improvement of 0.6\%, significantly outperforming state-of-the-art (SOTA) methods. Our source codes have been released at https://github.com/Mwie1024/Extra-CoT.

2406.09250 2026-06-15 cs.CV cs.AI cs.LG 版本更新

MirrorCheck: Efficient Adversarial Defense for Vision-Language Models

MirrorCheck: 视觉-语言模型的高效对抗防御

Samar Fares, Klea Ziu, Toluwani Aremu, Nikita Durasov, Martin Takáč, Pascal Fua, Ivan Laptev, Karthik Nandakumar

发表机构 * Mohamed Bin Zayed University of Artificial Intelligence(莫扎伊德大学人工智能大学) NVIDIA École Polytechnique Fédérale de Lausanne(洛桑联邦理工学院) Michigan State University(密歇根州立大学)

AI总结 提出MirrorCheck框架,利用文本到图像模型和随机化策略检测并防御针对视觉-语言模型的自适应对抗攻击。

详情
AI中文摘要

视觉-语言模型(VLM)越来越容易受到复杂的对抗性攻击,包括专门设计用于绕过现有防御的自适应策略。为了解决这一漏洞,我们提出了MirrorCheck,一个鲁棒且与模型无关的检测框架,在单模态和多模态设置中均能有效运行。MirrorCheck利用文本到图像(T2I)模型从目标模型生成的标题中重建视觉内容,并通过比较原始图像和合成图像之间的特征空间嵌入来评估语义一致性。为了增强对自适应攻击的鲁棒性,MirrorCheck引入了一种随机防御策略,从多样化的模型库中随机选择T2I生成器和图像编码器。此外,我们采用了一种新颖的一次性(OTU)扰动,应用于所选编码器嵌入,并通过缩放因子调节,这降低了自适应攻击的有效性。跨多种威胁场景的大量实验表明,MirrorCheck始终优于基线方法,即使在强自适应对抗条件下也能保持其实用性。

英文摘要

Vision-Language Models (VLMs) are increasingly susceptible to sophisticated adversarial attacks, including adaptive strategies specifically designed to bypass existing defenses. To address this vulnerability, we propose MirrorCheck, a robust and model-agnostic detection framework that operates effectively in both unimodal and multimodal settings. MirrorCheck leverages Text-to-Image (T2I) models to regenerate visual content from captions produced by the target model and assesses semantic consistency by comparing feature-space embeddings between the original and synthesized images. To enhance robustness against adaptive attacks, MirrorCheck introduces a stochastic defense strategy that randomly selects T2I generators and image encoders from a diverse model zoo. Additionally, we incorporate a novel One-Time-Use (OTU) perturbation applied to the selected encoder embeddings, regulated by a scaling factor, which decreases the effectiveness of adaptive attacks. Extensive experiments across multiple threat scenarios demonstrate that MirrorCheck consistently outperforms baseline methods, and maintains its utility even under strong adaptive adversarial conditions.

2605.25025 2026-06-15 cs.RO cs.SY eess.SY 版本更新

Micro-Swarm Locomotion Optimization in Dynamic Flow using Multi-Objective Multi-Agent Reinforcement Learning

动态流场中微群集运动优化的多目标多智能体强化学习方法

Josef Berman, Oren Gal

发表机构 * Hatter Department of Marine Technologies, Leon H. Charney School of Marine Sciences, University of Haifa(哈特尔海洋技术系,列昂·H·夏恩海洋科学学院,海法大学)

AI总结 提出混合CFD与多目标多智能体强化学习框架,通过PCGrad解决梯度冲突,在振荡流中优化微机器人集群的上游推进、能量效率和运动平滑性。

详情
AI中文摘要

在生理真实、时间依赖的流体环境中协调微型机器人集群,仍然是生物医学和环境应用中的未解决挑战。我们提出了一种混合计算流体动力学-多目标多智能体强化学习框架,该框架将高保真不可压缩纳维-斯托克斯求解器与去中心化近端策略优化直接耦合,以在振荡流中学习物理一致的集群控制策略。十六个磁驱动微型机器人在脉动动脉波形中导航,同时优化上游推进、能量守恒和运动平滑性,并通过PCGrad手术进行协调。没有PCGrad时,能量效率和平滑度奖励在10000训练步内降至接近零,而进度表现出持续的大幅振荡,证实梯度冲突解决是该领域的一个结构性要求而非可选改进。收敛策略实现了6.5-7.0的进度奖励、0.63-0.65的持续能量效率以及接近最大的平滑度(0.97-0.99),在主目标上比暴力基线有所改进,而两个基线在整个过程中能量效率均为负值。训练揭示了三个涌现行为阶段:在正向流动期间抑制峰值通道速度的集体双层水动力节流编队、利用流动反转进行上游重新定位的周期同步棘轮机制,以及智能体接近成功边界时的个体化最终接近。这些结果表明,时间依赖的流体-智能体相互作用可以直接在多目标强化学习循环中捕获,为生物医学导航、环境监测和工业微流体中的微群集控制提供了基于物理的范式。

英文摘要

Coordinating micro-robotic swarms in realistic, time-dependent fluid environments remains a major challenge for biomedical and environmental applications. We present a hybrid CFD-MO-MARL (Computational Fluid Dynamics-Multi Objective-Multi Agent Reinforcement Learning) framework that couples a high-fidelity incompressible Navier--Stokes solver with decentralized proximal policy optimization to learn swarm control policies in oscillatory flow. Sixteen magnetically actuated micro-robots were simulated to navigate a pulsatile arterial waveform within a 2 mm channel while jointly optimizing upstream progression, energy efficiency, and motion smoothness. Conflicting objectives are resolved using Projected Conflicting Gradient (PCGrad) surgery. Without PCGrad, energy and smoothness rewards collapse during training, demonstrating that gradient conflict resolution is essential for stable multi-objective learning. The converged policy achieves progress rewards of 6.5-7.0, energy efficiency of 0.63-0.65, and smoothness of 0.97-0.99, outperforming brute-force baselines by more than 8 reward units on the primary objective. Training reveals three emergent behaviors not encoded in the reward function: hydrodynamic throttling formations that reduce peak flow velocities, a cycle-synchronized ratchet mechanism that exploits flow reversals for upstream movement, and individualized final-approach strategies near the target boundary. These results demonstrate that physically realistic fluid--agent interactions can be integrated directly into multi-objective reinforcement learning, providing a scalable framework for micro-swarm control in biomedical navigation, environmental monitoring, and microfluidic systems.

2604.26740 2026-06-15 cs.CV cs.GR 版本更新

Rendering-Aware Sparse Sampling for BRDF Acquisition

面向BRDF采集的渲染感知稀疏采样

W. Cao, D. Jönsson, Z. Huang, J. Unger

发表机构 * Media and Information Technology, Department of Science and Technology, Linköping University(_linköping大学科学与技术学院媒体与信息科技系)

AI总结 提出一种渲染感知的稀疏采样方法,通过可微渲染器优化采样方向,以最少BRDF测量实现高质量材质外观重建。

详情
AI中文摘要

精确的BRDF采集对于真实感渲染至关重要,但密集的测角光度计测量既缓慢又昂贵。我们研究如何选择一小部分BRDF测量,这些测量在学习的BRDF先验下对重建材质外观最具信息量。现有的稀疏采集方法通常优化所有材质的BRDF空间重建样本,而自适应测量的感知重要性最终取决于其对每个渲染外观的影响。因此,我们将稀疏自适应采集表述为一个渲染感知的优化问题。我们的方法结合了用于稀疏坐标-值观测的集合编码器、基于预训练超网络/PCA的BRDF重建器以及可微渲染器。在采样器训练期间,重建器保持固定,来自渲染图像损失的梯度优化测量位置。这将采集设计与先验拟合分离,并鼓励采样器选择在学习材质分布下信息量大的方向。为了使比较受控,我们在匹配的样本数量、训练/测试分割、渲染场景、对象掩码、图像映射和指标下评估均匀基线、元学习方法、HyperBRDF方法和我们学习的采样器。我们的核心主张是:当最终渲染外观是目标时,渲染感知采样改进了极其稀疏的BRDF采集。BRDF空间和组合损失仅作为消融实验报告,同时包括联合优化和仅图像潜在拟合以处理未见过的材质。

英文摘要

Accurate BRDF acquisition is essential for realistic rendering, but dense gonioreflectometer measurements are slow and expensive. We study how to select a small set of BRDF measurements that is most informative for reconstructing material appearance under a learned BRDF prior. Existing sparse-acquisition methods often optimize samples for BRDF-space reconstruction for all materials, while the perceptual importance of a adaptive measurement ultimately depends on its effect on each rendered appearance. We therefore formulate sparse adaptive acquisition as a rendering-aware optimization problem. Our method combines a set encoder for sparse coordinate--value observations, a pretrained hypernetwork-based/PCA-based BRDF reconstructor, and a differentiable renderer. During sampler training, the reconstructor remains fixed, and gradients from a rendered-image loss optimize the measurement locations. This separates acquisition design from prior fitting and encourages the sampler to choose directions that are informative under the learned material distribution. To make the comparison controlled, we evaluate the uniform baseline, meta-learning method, HyperBRDF method, and our learned sampler under matched sample numbers, train/test split, rendering scene, object mask, image mapping, and metrics. Our central claim: rendering-aware sampling improves extremely sparse BRDF acquisition when final rendered appearance is the target. BRDF-space and combined losses are reported only as ablations, together with joint refinement and image-only latent fitting for unseen materials.

2601.05106 2026-06-15 cs.AI cs.CL cs.LG 版本更新

Token-Level LLM Collaboration via FusionRoute

通过融合路由实现令牌级LLM协作

Nuoya Xiong, Yuhang Zhou, Hanqing Zeng, Zhaorun Chen, Furong Huang, Shuchao Bi, Lizhu Zhang, Zhuokai Zhao

发表机构 * [cs.AI](计算机科学与人工智能)

AI总结 本文提出FusionRoute框架,通过轻量级路由器在解码步骤中选择最合适的专家并补充对数几率以优化下一个令牌分布,解决了单个通用模型在多个领域表现不佳的问题,同时在多个基准测试中优于其他方法。

Comments 25 pages

详情
AI中文摘要

大型语言模型(LLMs)在多个领域表现出色。然而,使用单一通用模型在这些领域实现强大性能通常需要扩展到训练和部署成本极高的规模。另一方面,虽然较小的领域专用模型更高效,但它们在训练分布之外的泛化能力较差。为了解决这一矛盾,我们提出了FusionRoute,一种稳健且有效的令牌级多LLM协作框架,其中轻量级路由器同时(i)在每个解码步骤中选择最合适的专家,(ii)贡献一个互补的对数几率,通过对数几率添加来细化或校正所选专家的下一个令牌分布。与现有依赖固定专家输出的令牌级协作方法不同,我们提供了一个理论分析,表明纯专家路由本质上是有限的:除非持有强全局覆盖假设,否则无法一般实现最优解码策略。通过在专家选择中加入可训练的互补生成器,FusionRoute扩展了有效的策略类别,并在温和条件下实现了最优价值函数的恢复。经验上,FusionRoute在Llama-3和Gemma-2家族以及涵盖数学推理、代码生成和指令跟随在内的多种基准测试中,优于序列级和令牌级协作、模型融合和直接微调方法,同时在各自任务上与领域专家保持竞争力。

英文摘要

Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks.

2605.21472 2026-06-15 cs.CV 版本更新

Stream3D: Sequential Multi-View 3D Generation via Evidential Memory

Stream3D: 基于证据记忆的序列多视角3D生成

Kaichen Zhou, Zeyang Bai, Xinhai Chang, Mengyu Wang, Paul Liang, Fangneng Zhan

发表机构 * World Mind Lab, HKUST(世界心智实验室,香港科技大学) Media Lab and EECS, MIT(媒体实验室和电子工程与计算机科学系,麻省理工学院) Kempner Institute, Harvard University(凯普纳研究所,哈佛大学)

AI总结 提出Stream3D,一种无需训练的流式机制,通过维护紧凑的证据记忆缓存关键历史帧,将冻结的视角条件3D生成器转换为流式生成器,解决单目视频流中3D生成的时间不一致问题。

Comments Multi-view 3D Generation, Streaming 3D Generation

详情
AI中文摘要

视角条件3D生成器(如SAM 3D、TRELLIS和Hunyuan3D)能够从单视角生成高质量物体重建,但真实世界的视觉观测通常以长单目流的形式出现。将这些生成器独立应用于每个流式帧会导致生成结果严重的时间不一致。为解决此问题,我们提出Stream3D,这是第一种无需训练的流式机制,通过恒定跨块记忆将冻结的视角条件3D生成器转换为流式生成器。Stream3D通过维护一个紧凑的证据记忆来实现这一点,该记忆基于提出的证据评分机制选择性缓存最具信息量的历史帧。随着流式处理进行,记忆动态更新以保留固定数量的信息帧,防止内存占用随序列长度线性增长。这还防止了长序列上的性能退化,并保持底层生成器完全不变,无需重新训练、架构修改或辅助损失。在真实和合成流式基准上的评估表明,Stream3D在光度指标和几何指标上均优于潜在传输基线,包括KV缓存重用和基于流的特征编辑。更多详情请见:https://stream-3d.github.io/stream3d.github.io/。

英文摘要

View-conditioned 3D generators such as SAM 3D, TRELLIS, and Hunyuan3D produce high-quality object reconstructions from a single view, but real-world visual observation often arrives as long monocular streams. Naively applying these generators to each streaming frame independently leads to severe temporal inconsistency in the generated results. To address this problem, we propose Stream3D, the first training-free streaming mechanism that turns a frozen view-conditioned 3D generator into a streaming generator with constant cross-chunk memory. Stream3D achieves this by maintaining a compact evidential memory, which selectively caches the most informative historical frames based on a proposed evidence score mechanism. As the stream progresses, the memory dynamically updates to retain a fixed number of informative frames, preventing the memory footprint from growing linearly with sequence length. This also prevents degradation over long sequences and keeps the underlying generator completely unchanged without retraining, architectural modifications, or auxiliary losses. Evaluated on both realistic and synthetic streaming benchmarks, Stream3D outperforms latent-transport baselines, including KV-cache reuse and flow-based feature editing, across both photometric and geometric metrics. More details can be found at: https://stream-3d.github.io/stream3d.github.io/.

2605.21363 2026-06-15 cs.CL 版本更新

"I Didn't Make the Micro Decisions": Measuring, Inducing, and Exposing Goal-Level AI Contributions in Collaboration

我没有做出微决策:在协作中测量、诱导和暴露目标级AI贡献

Eunsu Kim, Jessica R. Mindel, Kyungjin Kim, Sherry Tongshuang Wu

发表机构 * KAIST(韩国科学技术院) Carnegie Mellon University(卡内基梅隆大学) Seoul National University(首尔国立大学)

AI总结 本文提出CoTrace框架,用于测量和暴露协作中目标级AI贡献,发现模型在目标塑造中贡献有限,但在引入具体要求和间接影响方面作用显著,且交互设计影响模型行为。

详情
AI中文摘要

随着大型语言模型(LLMs)越来越多地影响用户如何形成、细化和扩展目标,将贡献归因于人类-人工智能协作变得对用户校准自身依赖性和评估者评估AI辅助工作至关重要。然而,现有方法专注于最终成果,忽略了目标本身共同塑造的过程。我们引入了一个目标级归因框架CoTrace,将显式目标分解为可验证的需求,并追踪对话回合中直接贡献和间接影响。对638个真实世界协作日志应用CoTrace,发现尽管模型仅在目标塑造中贡献11-26%,但它们在引入较低层次的具体需求方面贡献显著,并产生各种间接贡献。通过受控模拟,我们展示了交互设计选择显著影响模型目标塑造行为。在一项用户研究中,向参与者暴露目标级分析使他们对贡献的感知在5分量表上几乎增加2分,揭示了用户在理解自身AI辅助工作时的系统性误校准。

英文摘要

As large language models (LLMs) increasingly shape how users form, refine, and extend their goals, attributing contributions in human-AI collaboration becomes critical for users calibrating their own reliance and for evaluators assessing AI-assisted work. Yet existing methods focus on final artifacts, missing the process through which goals themselves are jointly shaped. We introduce a goal-level attribution framework, CoTrace, that decomposes explicit goals into verifiable requirements and traces both direct contributions and indirect influences across dialogue turns. Applying CoTrace to 638 real-world collaboration logs, we find that while models account for only 11-26% of goal-shaping contribution, they contribute substantially more on introducing lower-level concrete requirements, and make various kinds of indirect contributions. Through controlled simulations, we show that interaction design choices significantly affect model goal-shaping behavior. In a user study, exposing participants to goal-level analyses shifts their perceived contributions by nearly 2 points on a 5-point scale, revealing systematic miscalibration in how users understand their own AI-assisted work.

2605.21182 2026-06-15 cs.CL cs.AI cs.CV 版本更新

Manga109-v2026: Revisiting Manga109 Annotations for Modern Manga Understanding

Manga109-v2026: 重新审视Manga109标注以适应现代漫画理解

Jeonghun Baek, Atsuyuki Miyai, Shota Onohara, Hikaru Ikuta, Kiyoharu Aizawa

发表机构 * University of Tokyo(东京大学)

AI总结 本文重新审视Manga109的对话文本标注,识别出五类标注问题,包括转录错误、缺失文本区域、对话与拟声词重叠以及未分割的对话气泡,并通过结合OCR基于的问题检测和人工修订构建Manga109-v2026,修订了约29,000个对话标注,使Manga109更好地适应现代OCR和多模态漫画理解系统,同时保留漫画特有的表达结构。

Comments Accepted to the Culture x AI Workshop at ICML 2026. Project page: https://manga109.github.io/manga109-project-website/en/

详情
AI中文摘要

漫画是一种具有文化特色的多模态媒介,是日本流行文化中最具影响力的形态之一。随着AI系统越来越多地针对漫画理解、OCR和翻译进行研究,Manga109已成为漫画相关AI研究的基础数据集。然而,当前的Manga109数据集包含转录错误和粗略的标注,这与现代OCR和多模态漫画理解任务不匹配。在本工作中,我们重新审视Manga109的对话文本标注,识别出五类标注问题,包括转录错误、缺失文本区域、对话与拟声词重叠以及未分割的对话气泡。为了解决这些问题,我们结合基于OCR的问题检测和人工修订,构建了Manga109-v2026,修订了大约29,000个对话标注。我们的修订使Manga109更好地适应现代OCR和多模态漫画理解系统,同时保留了漫画特有的表达结构。

英文摘要

Manga is a culturally distinctive multimodal medium and one of the most influential forms of Japanese popular culture. As AI systems increasingly target manga understanding, OCR, and translation, Manga109 has become a foundational dataset for manga-related AI research. However, the current Manga109 dataset contains inaccurate transcriptions and coarse annotations, which do not align well with modern OCR and multimodal manga understanding tasks. In this work, we revisit the dialogue text annotations of Manga109 and identify five categories of annotation issues, including inaccurate transcriptions, missing text regions, overlapping dialogue and onomatopoeia, and under-segmented speech balloons. To address these issues, we combine OCR-based issue detection and manual revision to construct Manga109-v2026, revising approximately 29,000 dialogue annotations. Our revisions better align Manga109 with modern OCR and multimodal manga understanding systems while preserving expressive structures characteristic of manga.

2605.21006 2026-06-15 cs.AI cs.CL cs.LG 版本更新

Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

扮演魔鬼的代言人:现成的人格向量在顺从性上与针对性引导相媲美

Ishaan Kelkar, Nebras Alam, Vikram Kakaria, Madhur Panwar, Vasu Sharma, Maheep Chaudhary

发表机构 * University of Toronto(多伦多大学) Princeton University(普林斯顿大学) Purdue University(普渡大学) EPFL(瑞士联邦理工学院) Algoverse Independent(独立)

AI总结 本文研究了不同人格对顺从性的影响,发现现成的人格引导向量在减少顺从性方面与针对性引导相当,且在用户正确时保持准确性。

Journal ref ICML, Pluralistic Alignment Workshop, 2026

详情
AI中文摘要

我们研究了不同人格对顺从性的影响:模型在用户错误时仍同意用户。标准缓解方法,对比激活添加(CAA),从顺从性和诚实响应的标记对中推导出引导方向。本研究评估了现成的人格引导向量是否能作为替代方案,这些向量最初是为一般角色扮演开发的,且未在顺从性数据上训练。在两个指令微调模型中,引导至以怀疑或审查为特征的人格可将顺从性减少到CAA效果的约68%和98%,且不同于CAA,在用户正确时保持准确性。效果也是不对称的:引导至顺从的人格不会产生镜像增加的顺从性。几何上,人格向量在激活空间的方向上与顺从性方向基本无关。总体而言,这些发现表明,顺从性应被视为人格层面的属性,而非单一可引导方向。我们在此发布代码:https://anonymous.4open.science/r/Sycophancy-Steering-9DF0/.

英文摘要

We study the effect of different persona on \textbf{sycophancy}: model's agreement with users even when the user is incorrect. The standard mitigation, Contrastive Activation Addition (CAA), derives a steering direction from labelled pairs of sycophantic and honest responses. This study evaluates whether off-the-shelf persona steering vectors, originally developed for general role-playing and not trained on sycophancy data, can serve as an alternative. In two instruction-tuned models, steering toward personas characterised by doubt or scrutiny reduces sycophancy to approximately $68\%$ and $98\%$ of CAA's effect, and, unlike CAA, maintains accuracy when the user is correct. The effect is also asymmetric: steering toward agreeable personas does not produce a mirror increase in sycophancy. Geometrically, the persona vector is largely independent of the direction of sycophancy in activation space. Collectively, these findings suggest that sycophancy is better understood as a persona-level property rather than a single steerable direction. We release our code here: https://anonymous.4open.science/r/Sycophancy-Steering-9DF0/.

2605.18848 2026-06-15 cs.LG cs.AI 版本更新

Exact Linear Attention

精确线性注意力

Weinuo Ou

发表机构 * GitHub

AI总结 本文提出精确线性注意力(ELA),通过利用核函数的精确分解性质,实现Transformer注意力的线性计算复杂度,消除近似误差。针对先前线性注意力的两个关键限制——梯度爆炸和token注意力稀释,提出核约束以确保非负性、判别性和几何可解释性。此外,本文还提出了三种工程创新,包括Hyper-Link结构、Memory Lobe模块和基于路由分数的MoE偏置机制,实验结果表明ELA在解码速度和KV缓存内存使用上分别达到全注意力的6倍和75%的减少,同时保持或优于训练性能。

Comments 9 pages, 19 figures, journal

详情
AI中文摘要

本文介绍精确线性注意力(ELA),一种通过利用核函数的精确分解性质,实现Transformer注意力线性计算复杂度的机制,从而消除近似误差。我们识别并解决了先前线性注意力的两个关键限制——梯度爆炸和token注意力稀释——通过施加核约束,确保非负性、判别性和几何可解释性。提出了几种核函数,包括Hadamard Exp核、求和平方欧几里得距离核和减法平方欧几里得距离核,每种都针对特定的注意力行为进行了优化。除了核心注意力公式之外,本文还提出了三种工程创新:(1)Hyper-Link结构,用以替代传统残差连接以缓解梯度退化;(2)基于双向线性注意力的Memory Lobe模块,捕捉跨层的“转换流”以实现定性记忆和隐式强化学习范式;(3)基于路由分数的MoE偏置机制,以提高可解释性和语义对齐。实验结果表明,ELA在解码速度和KV缓存内存使用上分别达到全注意力的6倍和75%的减少,同时保持或优于训练性能。所提出的记忆模块加速了收敛并增强了泛化能力。此外,我们还将线性注意力原理扩展到视觉模型,得到YOLO-LAT,其在GPU推理速度和参数减少方面分别达到4.3倍和7.9倍,同时保持竞争性的检测精度。这些结果表明,精确线性注意力在扩展Transformer模型以处理超长序列和高效视觉任务方面具有广泛的应用前景。

英文摘要

This paper introduces Exact Linear Attention (ELA), a mechanism that achieves linear computational complexity for Transformer attention by exploiting the exact decomposition property of kernel functions, thereby eliminating approximation error. We identify and address two key limitations of prior linear attention -- gradient explosion and token attention dilution -- by imposing kernel constraints that ensure non-negativity, discriminability, and geometric interpretability. Several kernel functions are proposed, including the Hadamard Exp Kernel, Summation Squared Euclidean Distance Kernel, and Subtraction Squared Euclidean Distance Kernel, each tailored for specific attention behaviors. Beyond the core attention formulation, the paper presents three engineering innovations: (1) a Hyper-Link structure that replaces traditional residual connections to mitigate gradient degradation; (2) a Memory Lobe module based on bidirectional linear attention, which captures "transformation flow" across layers to implement qualitative memory and an implicit reinforcement learning paradigm; and (3) a routing-score-based bias mechanism for Mixture-of-Experts (MoE) to improve interpretability and semantic alignment. Experimental results demonstrate that ELA achieves up to 6x faster decoding speed and 75% reduction in KV cache memory usage compared to full attention, while maintaining comparable or superior training performance. The proposed memory module accelerates convergence and enhances generalization. Furthermore, we extend the linear attention principle to vision models, yielding YOLO-LAT, which attains up to 4.3x GPU inference speedup and 7.9x parameter reduction with competitive detection accuracy. These results underline the broad applicability of exact linear attention for scaling Transformer models to ultra-long sequences and efficient visual tasks.

2602.00593 2026-06-15 cs.CV cs.LG 版本更新

Pix2Fact: When Vision Is Not Enough -- Benchmarking Fine-Grained VQA with Web Verification on High-Resolution Real-World Scenes

Pix2Fact: 当视觉不够时——基于网络验证的细粒度VQA基准测试

Yifan Jiang, Cong Zhang, Bofei Zhang, Qiaofeng Zheng, Yifan Yang, Bingzhang Wang, Yew-Soon Ong

发表机构 * GADE Union (Global AI Data Experts Union)(GADE联盟(全球人工智能数据专家联盟)) Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学) New York University(纽约大学) Cambridge University(剑桥大学) The University of Hong Kong(香港大学)

AI总结 本文提出Pix2Fact基准测试,通过高分辨率真实场景中的网络验证,评估细粒度视觉问答中的专家级视觉感知和知识搜索能力,发现现有模型在复杂任务中存在显著不足。

详情
AI中文摘要

尽管在通用任务上取得了进展,视觉-语言模型(VLMs)仍然在需要精细视觉定位和外部知识的挑战中面临困难,而现有基准测试未能综合评估这些能力。为填补这一空白,我们引入Pix2Fact,一个视觉问答基准测试,旨在评估专家级视觉感知和知识搜索能力。Pix2Fact包含1000张高分辨率(4K+)图像,覆盖八个场景。其问题和答案由来自全球顶尖大学的博士持有标注者精心设计。每个问题都需要详细的视觉定位和外部知识的整合。评估十种最先进的VLMs,包括专有模型如Gemini-3.1-Pro和GPT-5.4,发现Pix2Fact对模型提出了严峻挑战:最先进的模型(Gemini-3.1-Pro)在有视觉地面真实和搜索工具的情况下仅达到51.7%的平均准确率。我们的分析将低准确率归因于三个因素:即使有视觉地面真实,频繁的视觉定位错误,浅层搜索利用,以及VLM无法检索长尾、无结构的局部信息。这种显著的差距暴露了当前模型在帮助人类处理需要超负荷视觉理解的现实场景中的局限性。我们相信Pix2Fact将作为推动下一代语言-视觉代理的关键基准测试,这些代理能够无缝整合细粒度感知与稳健的知识搜索。

英文摘要

Despite progress on general tasks, vision-language models (VLMs) still struggle with challenges that demand both fine-grained visual grounding and external knowledge, a synergy overlooked by existing benchmarks that evaluate these abilities in isolation. To fill this void, we introduce Pix2Fact, a visual question-answering benchmark designed to assess expert-level visual perception and knowledge search. Pix2Fact comprises 1,000 high-resolution (4K+) images spanning eight scenarios. Its questions and answers are meticulously crafted by PhD-holding annotators from top global universities across diverse disciplines. Each question requires detailed visual grounding and the integration of external knowledge. Evaluating ten state-of-the-art VLMs, including proprietary models such as Gemini-3.1-Pro and GPT-5.4, we find that Pix2Fact poses a formidable challenge: the most advanced model (Gemini-3.1-Pro) achieves only 51.7% average accuracy, even with access to visual ground truth and search tools. Our analysis attributes this low accuracy to three factors, frequent visual grounding errors even with visual ground truth, shallow search harnessing, and VLM's inability to retrieve long-tail, unstructured local information. This striking gap exposes the limitations of current models in assisting humans with real-world scenarios that demand overwhelming visual comprehension. We believe Pix2Fact will serve as a critical benchmark to drive the next generation of language-vision agents that seamlessly integrate fine-grained perception with robust knowledge search.

2605.17779 2026-06-15 cs.LG 版本更新

Learning Variable-Length Tokenization for Generative Recommendation

学习可变长度分词以生成推荐

Minhao Wang, Bowen Wu, Wei Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出VarLenRec框架,通过Popularity-Weighted Information Budget Allocation方法解决生成推荐中可变长度分词问题,提升推荐准确性和效率。

Comments 14 pages, 5 figures

详情
AI中文摘要

生成推荐将推荐问题重新表述为离散语义标识符(ID)的下一个标记预测。一个基本但未被探索的设计选择是现有方法对所有项目使用固定长度分词,隐含假设编码能力在项目特性上是均匀的。通过系统地在四个数据集上进行实验,我们发现流行度-长度悖论:流行项目在短ID下表现最佳,而尾部项目需要显著更长的代码来捕捉判别性语义。这揭示了一个关键不匹配:流行项目受益于丰富的协同信号并需要最小的语义细节,而尾部项目必须依赖于细粒度的内容特征,因为交互数据稀疏。为了解决这个问题,我们提出了VarLenRec,一个学习可变长度分词的框架。我们开发了流行度加权信息预算分配(PIBA),一个信息论框架证明最优ID长度应与流行度的负幂成比例。直接实现可变长度分配面临两个技术挑战:标准欧几里得残差量化缺乏支持不同代码长度的几何容量,而离散长度决策是非可微的。我们通过双曲残差量化解决这些问题,该方法利用庞加莱球的指数体积增长来自然分层编码能力,并通过软长度控制器实现可微长度预测,通过连续层保留概率正则化由PIBA推导出的先验。广泛的实验表明,VarLenRec在推荐准确性和训练/推理效率上显著优于现有最先进方法,揭示了自适应编码能力在生成推荐中的重要性。

英文摘要

Generative recommendation reformulates recommendation as next-token prediction over discrete semantic identifiers (IDs). A fundamental yet unexplored design choice is that existing methods employ fixed-length tokenization for all items, implicitly assuming uniform encoding capacity regardless of item characteristics. Through systematic experiments across four datasets, we discover the Popularity-Length Paradox: popular items achieve optimal performance with short IDs, while tail items require substantially longer codes to capture discriminative semantics. This reveals a critical mismatch where popular items benefit from abundant collaborative signals and require minimal semantic detail, whereas tail items must rely on fine-grained content features due to sparse interaction data. To address this, we propose VarLenRec, a framework for learning variable-length tokenization. We develop Popularity-Weighted Information Budget Allocation (PIBA), an information-theoretic framework proving that optimal ID length should scale as a negative power of popularity. Directly implementing variable-length allocation faces two technical challenges: standard Euclidean residual quantization lacks geometric capacity to support diverse code lengths without distortion, and discrete length decisions are non-differentiable. We address these through Hyperbolic Residual Quantization, which leverages the exponential volume growth of the Poincaré ball to naturally stratify encoding capacity, and a Soft Length Controller, which enables differentiable length prediction via continuous layer retention probabilities regularized by PIBA-derived priors. Extensive experiments demonstrate that VarLenRec achieves significant improvements over state-of-the-art methods in recommendation accuracy and training/inference efficiency, revealing the importance of adaptive encoding capacity in generative recommendation.

2504.20734 2026-06-15 cs.CL cs.AI cs.CV cs.IR cs.LG 版本更新

UniversalRAG: Retrieval-Augmented Generation over Corpora of Diverse Modalities and Granularities

UniversalRAG: 在多样模态和粒度的语料库上实现检索增强生成

Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出UniversalRAG,一种能够处理多种模态和粒度的检索增强生成框架,通过动态路由机制和多粒度组织,提升跨模态知识检索的有效性,实验表明其在多个模态基准上的优越性。

Comments ACL 2026. Project page : https://universalrag.github.io

详情
AI中文摘要

检索增强生成(RAG)通过将外部相关知识与查询绑定,显著提升了事实准确性。然而,现有方法多局限于文本语料,尽管最近有尝试扩展到图像、视频等模态,但通常仅针对单一模态语料。相比之下,现实中的查询所需知识类型多样,单一知识源无法满足。为此,我们引入UniversalRAG,一种any-to-any RAG框架,旨在从异构源中检索和整合多样模态和粒度的知识。具体而言,受强制所有模态进入单一聚合语料的统一表示空间导致模态间隙的观察启发,我们提出模态感知路由,动态识别最合适的模态特定语料并执行针对性检索,并通过理论分析证明其有效性。此外,除模态外,我们对每个模态组织为多个粒度层级,实现针对查询复杂性和范围的精细检索。我们验证UniversalRAG在10个多种模态基准上的性能,显示其优于各种模态特定和统一基线。

英文摘要

Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing approaches are limited to a text-only corpus, and while recent efforts have extended RAG to other modalities such as images and videos, they typically operate over a single modality-specific corpus. In contrast, real-world queries vary widely in the type of knowledge they require, which a single type of knowledge source cannot address. To address this, we introduce UniversalRAG, an any-to-any RAG framework designed to retrieve and integrate knowledge from heterogeneous sources with diverse modalities and granularities. Specifically, motivated by the observation that forcing all modalities into a unified representation space derived from a single aggregated corpus causes a modality gap, where the retrieval tends to favor items from the same modality as the query, we propose modality-aware routing, which dynamically identifies the most appropriate modality-specific corpus and performs targeted retrieval within it, and further justify its effectiveness with a theoretical analysis. Moreover, beyond modality, we organize each modality into multiple granularity levels, enabling fine-tuned retrieval tailored to the complexity and scope of the query. We validate UniversalRAG on 10 benchmarks of multiple modalities, showing its superiority over various modality-specific and unified baselines.

2605.16739 2026-06-15 cs.LG cs.AI cs.CL q-bio.NC 版本更新

EmoMind: Decoding Affective Captions from Human Brain fMRI

EmoMind:从人类大脑fMRI信号解码情感描述

Bilal A. Mohammed, Lin Gu, Ruogu Fang

发表机构 * Department of Biomedical Engineering(生物医学工程系) Vanderbilt University(范德比大学) Research Institute of Electrical Communication(电气通信研究所) Tohoku University(东北大学) University of Florida(佛罗里达大学)

AI总结 本文提出EmoMind,首个端到端解码fMRI信号生成情感描述的系统,通过结合语义基础的中性场景描述和连续情感向量,实现了在内容保留与情感表达间的平衡,并在多个验证框架下优于基于标签提示的GPT-4。

详情
AI中文摘要

从大脑活动解码视觉经验已取得显著进展,但当前的脑-文本系统主要恢复语义内容而丢弃情感。此外,语言模型在接收到类别标签提示时可以生成情感文本,但此类标签将丰富的跨受试者变异性压缩成粗糙的离散类别。我们提出了EmoMind,首个端到端的解码情感描述的fMRI信号管道。EmoMind首先从解码的视觉特征中检索出语义基础的中性场景描述,然后使用从相同fMRI记录中解码的连续34维情感向量重写该描述。为了在内容保留和情感表达之间保持平衡,我们使用分类器自由指导训练重写器,以对抗一个保持身份的空分支,从而在语义忠实性和情感表达性之间实现平滑插值。我们通过涵盖受试者特异性、结构几何和因果控制的三轴验证框架评估情感描述生成。我们进一步用合成大脑替代测试增强此框架,以探测对测量设备的鲁棒性,并将每个轴与使用脑解码的前五名情感标签提示的GPT-4进行基准测试。在两个独立的情感fMRI数据集中,EmoMind在所有三个轴上均显著优于标签提示的GPT-4,其中最大的收益出现在需要个人特定情感结构而非群体层面情绪聚合的指标上。这些结果确立了连续脑解码情感作为个性化情感描述生成的可行控制信号,并为研究个体情感大脑组织开辟了新方向。

英文摘要

Decoding visual experience from brain activity has advanced substantially, but current brain-to-text systems largely recover semantic content while discarding affect. Additionally, language models can generate emotional text when prompted with categorical labels, but such labels collapse rich inter-subject variability into coarse discrete bins. We present EmoMind, the first end-to-end pipeline for decoding affective captions directly from fMRI signals. EmoMind first retrieves a semantically grounded neutral scene description from brain-decoded visual features, then rewrites it using a continuous 34-dimensional emotion vector decoded from the same fMRI recording. To control the balance between content preservation and affective expression, we train the rewriter with classifier-free guidance against an identity-preserving null branch, enabling smooth interpolation between semantic fidelity and affective expressivity. We evaluate affective caption generation with a three-axis validation framework spanning subject-specificity, structural geometry, and causal control. We further augment this framework with a synthetic-brain substitution test that probes robustness to the measurement apparatus, and we benchmark each axis against GPT-4 prompted with brain-decoded top-5 emotion labels as a strong discrete baseline. Across two independent emotion fMRI datasets, EmoMind significantly outperforms label-prompted GPT-4 on all three axes, with the largest gains on metrics that require person-specific affective structure rather than population-level emotion aggregation. These results establish continuous brain-decoded affect as a viable control signal for individualized affective caption generation and open new directions for studying individual affective brain organisation.

2605.14998 2026-06-15 cs.AI cs.SY eess.SY q-bio.QM 版本更新

Learning Developmental Scaffoldings to Guide Self-Organisation

学习发育支架以引导自组织

Milton L. Montero, Elias Najarro, Jakob Schauser, Sebastian Risi

发表机构 * IT University of Copenhagen(丹麦哥本哈根信息技术大学) University of Copenhagen(丹麦哥本哈根大学) Sakana AI

AI总结 本文研究了通过学习自组织规则和预模式共同作用来提升发育过程的鲁棒性、编码能力和对称性打破。

Comments 8 pages + acknowledgements and references, 5 figures. Camera-ready version for ALife 2026

详情
AI中文摘要

从亚细胞结构到整个生物体,许多自然系统通过自组织生成复杂结构:局部相互作用共同产生全局结构,而无需任何结果的蓝图。然而,推动此类过程的大量信息并非由自组织本身产生,而是常常转移到系统的初始条件中。生物发育是一个典型例子,其中母体的预模式编码位置和对称性打破信息,从而引导自组织过程。从早期胚胎发育中的母体形态发生素梯度到组织水平的形态发生预模式指导器官形成,这种信息转移到初始条件的现象,类似于计算系统中的记忆-计算权衡,是发育过程的基本部分。在本文中,我们通过引入一个模型来研究这种信息转移现象,该模型同时学习自组织规则和预模式,允许其相互作用在受控条件下进行变化和测量:一个神经细胞自动机(NCA)配对一个学习基于坐标的模式生成器(SIREN),两者同时训练以生成一组模式。我们提供了信息论分析,探讨信息如何在预模式和自组织过程之间分布,并展示联合学习两者可提高鲁棒性、编码能力和对称性打破,相较于纯自组织替代方案。进一步分析表明,有效的预模式不简单地近似其目标;而是通过偏转发育动力学的方式促进收敛,指出了初始条件结构与自组织动力学之间非平凡的关系。

英文摘要

From subcellular structures to entire organisms, many natural systems generate complex organisation through self-organisation: local interactions that collectively give rise to global structure without any blueprint of the outcome. Yet a significant portion of the information driving such processes is not produced by self-organisation itself, instead, it is often offloaded to initial conditions of the system. Biological development is a prime example, where maternal pre-patterns encode positional and symmetry-breaking information that scaffolds the self-organising process. From maternal morphogen gradients in early embryogenesis to tissue-level morphogenetic pre-patterns guiding organ formation, this transfer of information to initial conditions, analogous to a memory-compute trade-off in computational systems, is a fundamental part of developmental processes. In this work, we study this offloading phenomenon by introducing a model that jointly learns both the self-organisation rules and the pre-patterns, allowing their interplay to be varied and measured under controlled conditions: a Neural Cellular Automaton (NCA) paired with a learned coordinate-based pattern generator (SIREN), both trained simultaneously to generate a set of patterns. We provide information-theoretic analyses of how information is distributed between pre-patterns and the self-organising process, and show that jointly learning both components yields improvements in robustness, encoding capacity, and symmetry breaking over purely self-organising alternatives. Our analysis further suggests that effective pre-patterns do not simply approximate their targets; rather, they bias the developmental dynamics in ways that facilitate convergence, pointing to a non-trivial relationship between the structure of initial conditions and the dynamics of self-organisation.

2605.11558 2026-06-15 cs.LG stat.ML 版本更新

A Composite Activation Function for Learning Stable Binary Representations

一种用于学习稳定二进制表示的复合激活函数

Seokhun Park, Choeun Kim, Kwanho Lee, Sehyun Park, Insung Kong, Yongdai Kim

发表机构 * Department of Statistics(统计学系) Seoul National University(首尔国立大学) Department of Applied Mathematics(应用数学系) University of Twente(埃因霍温理工大学)

AI总结 本文提出HTAF复合激活函数,通过平滑近似Heaviside函数实现稳定训练,适用于Spiking神经网络等模型,并引入ICBMs模型实现可解释的图像处理。

Comments 32 pages

详情
AI中文摘要

激活函数在神经网络中通过塑造内部表示起核心作用。最近,学习二进制激活表示因其在计算和内存效率以及可解释性方面的优势而受到广泛关注。然而,使用Heaviside激活函数训练神经网络仍具挑战性,因其非可导性阻碍了标准梯度优化。本文提出Heavy Tailed Activation Function (HTAF),一种Heaviside函数的平滑近似,使基于梯度的优化能够稳定训练。我们构造HTAF为sigmoid双曲正切复合函数,并理论证明其在零输入附近保持大梯度质量,同时在尾部区域表现出更慢的梯度衰减。我们展示Spiking神经网络、二进制神经网络和深度Heaviside神经网络可以使用HTAF稳定训练。最后,我们引入隐式概念瓶颈模型(ICBMs),一种利用HTAF诱导离散特征表示的可解释图像模型。在各种架构和图像数据集上的广泛实验表明,ICBMs能够稳定地实现离散化,同时预测性能与标准模型相当或更好。

英文摘要

Activation functions play a central role in neural networks by shaping internal representations. Recently, learning binary activation representations has attracted significant attention due to their advantages in computational and memory efficiency, as well as interpretability. However, training neural networks with Heaviside activations remains challenging, as their non-differentiability obstructs standard gradient-based optimization. In this paper, we propose Heavy Tailed Activation Function (HTAF), a smooth approximation to the Heaviside function that enables stable training with gradient-based optimization. We construct HTAF as a sigmoid hyperbolic tangent composite function and theoretically show that it maintains a large gradient mass around zero inputs while exhibiting slower gradient decay in the tail regions. We show that Spiking Neural Networks, Binary Neural Networks and Deep Heaviside neural Networks can be trained stably using HTAF with gradient-based optimization. Finally, we introduce Implicit Concept Bottleneck Models (ICBMs), an interpretable image model that leverages HTAF to induce discrete feature representations. Extensive experiments across various architectures and image datasets demonstrate that ICBM enables stable discretization while achieving prediction performance comparable to or better than standard models.

2605.11378 2026-06-15 cs.CL 版本更新

An Empirical Study of Automating Agent Evaluation

自动化代理评估的实证研究

Kang Zhou, Sangmin Woo, Haibo Ding, Kiran Ramnath, Subramanian Chidambaram, Aosong Feng, Vinayak Arannil, Muhyun Kim, Ishan Singh, Darren Wang, Zhichao Xu, Megha Gandhi, Nirmal Prabhu, Soumya Smruti Mishra, Vivek Singh, Gouri Pandeshwar, Lin Lee Cheong

发表机构 * AWS AI Labs(AWS人工智能实验室)

AI总结 本文研究了自动化代理评估的可行性,提出EvalAgent系统,通过编码技能和领域知识提升评估效率,实验显示其在评估准确性和人类偏好上显著优于基线方法。

详情
AI中文摘要

代理评估需要评估复杂的多步骤行为,涉及工具使用和中间推理,这使其成本高昂且需要专业知识。一个自然的问题是:前沿编码助手能否可靠地自动化这一评估过程?我们的研究表明,仅仅提示编码助手是不够的。没有领域特定的评估知识,前沿编码助手仅能达到30%的执行成功率,并产生平均每个代理12+个指标的过度工程化评估,表明强大的编码能力并不自动转化为可靠的代理评估能力。我们引入EvalAgent,一种自动化端到端代理评估流程的AI助手。EvalAgent将评估领域专业知识编码为评估技能(程序指令、可重用代码和模板、以及动态检索的API文档),这些技能组成基于跟踪的流程,生成完整的评估成果,包括指标、可执行代码和报告。为了系统评估生成的评估,我们引入了一个元评估框架和AgentEvalBench基准,该基准包含20个代理,每个代理配对评估要求和测试场景。我们进一步提出了Eval@1指标,以衡量生成的评估代码是否在首次运行时既执行又产生有意义的结果。我们的实验显示,EvalAgent生成的评估更加聚焦,将Eval@1从17.5%提升到65%,并在人类专家偏好上达到79.5%的优势。进一步的消融研究显示,评估技能对于处理复杂评估至关重要:移除它们会使Eval@1显著从65%降至30%。

英文摘要

Agent evaluation requires assessing complex multi-step behaviors involving tool use and intermediate reasoning, making it costly and expertise-intensive. A natural question arises: can frontier coding assistants reliably automate this evaluation process? Our study shows that simply prompting coding assistants is insufficient for this task. Without domain-specific evaluation knowledge, frontier coding assistants achieve only a 30% execution success rate and produce over-engineered evaluations averaging 12+ metrics per agent, indicating that strong coding ability does not automatically translate to reliable agent evaluation. We introduce EvalAgent, an AI assistant that automates the end-to-end agent evaluation pipeline. EvalAgent encodes evaluation domain expertise as evaluation skills (procedural instructions, reusable code and templates, and dynamically retrieved API documentation) that compose into a trace-based pipeline producing complete evaluation artifacts including metrics, executable code, and reports. To systematically assess generated evaluations, we introduce a meta-evaluation framework alongside AgentEvalBench, a benchmark comprising 20 agents, each paired with evaluation requirements and test scenarios. We further propose the Eval@1 metric to measure whether generated evaluation code both executes and yields meaningful results on the first run. Our experiments show that EvalAgent produces focused evaluations, improving Eval@1 from 17.5% to 65%, and achieving 79.5% human expert preference over baseline approaches. Further ablation studies show that evaluation skills are critical for handling complex evaluation: removing them causes Eval@1 to drop significantly from 65% to 30%.

2602.23638 2026-06-15 cs.LG cs.AI 版本更新

FedRot-LoRA: Mitigating Rotational Misalignment in Federated LoRA

FedRot-LoRA: 缓解联邦LoRA中的旋转偏移

Haoran Zhang, Dongjun Kim, Seohyeon Cha, Haris Vikalo

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出FedRot-LoRA框架,通过正交变换对齐客户端更新以减少子空间不匹配,提升联邦LoRA在异质数据下的性能。

Comments ICML 2026

详情
AI中文摘要

联邦LoRA提供了一种高效的通信机制用于在去中心化数据上微调大语言模型。然而,因子加权平均与数学上正确的本地更新聚合之间的不一致会导致显著的聚合误差和不稳定的训练。本文认为,主要问题是由于低秩因子化旋转不变性导致的旋转偏移,即不同客户端的潜在子空间中,语义等价的更新可以以不同的形式表示。当这些不一致的因子直接平均时,会产生破坏性干扰,降低全局更新质量。为此,本文提出FedRot-LoRA框架,在聚合前通过正交变换对齐客户端更新,从而在不增加通信成本或限制模型表达能力的情况下,保持语义更新并减少跨客户端子空间不匹配。本文提供了收敛性分析,研究了因子加权平均引起的聚合误差,并展示了旋转对齐如何提供更紧的误差上界。在自然语言理解和生成任务上的广泛实验表明,FedRot-LoRA在各种异质性和LoRA秩水平下均优于现有联邦LoRA基线。

英文摘要

Federated LoRA provides a communication-efficient mechanism for fine-tuning large language models on decentralized data. In practice, however, a discrepancy between the factor-wise averaging used to preserve low rank and the mathematically correct aggregation of local updates can cause significant aggregation error and unstable training. We argue that a major source of this problem is rotational misalignment, arising from the rotational invariance of low-rank factorizations -- semantically equivalent updates can be represented in different latent subspaces across clients since $(B_i R_i)(R_i^\top A_i) = B_i A_i$. When such misaligned factors are averaged directly, they interfere destructively and degrade the global update. To address this issue, we propose FedRot-LoRA, a federated LoRA framework that aligns client updates via orthogonal transformations prior to aggregation. This alignment preserves the semantic update while reducing cross-client subspace mismatch, without increasing communication cost or restricting model expressivity. We provide a convergence analysis that examines the aggregation error induced by factor-wise averaging and shows how rotational alignment yields a tighter upper bound on this error. Extensive experiments on natural language understanding and generative tasks demonstrate that FedRot-LoRA consistently outperforms existing federated LoRA baselines across a range of heterogeneity levels and LoRA ranks.

2605.09420 2026-06-15 cs.CV cs.AI cs.MM 版本更新

Relational Retrieval: Leveraging Known-Novel Interactions for Generalized Category Discovery

关系检索:利用已知-新颖相互作用进行通用类别发现

Yulin Xu, Chunqi Guo, Yuanzhen Shuai, Jianyuan Ni

发表机构 * University of California, Irvine(加州大学尔湾分校) Sichuan Agricultural University(四川农业大学) University College London(伦敦大学学院) Juniata College(朱尼ata学院)

AI总结 本文通过关系检索视角解决通用类别发现问题,提出关系模式一致性方法,通过双向知识转移增强已知类别和新类别发现,实验表明在通用和细粒度基准上均取得最佳性能。

Comments Accepted by ICMR 2026 (Oral)

详情
AI中文摘要

在本研究中,我们通过关系检索视角解决通用类别发现(GCD)问题,通过双向知识转移显式连接标记和未标记数据。尽管现有方法将这些来源分开处理,错过了有价值的作用机会,我们提出关系模式一致性(RPC),使两者相互增强。RPC使用一对一分类器进行软ID/OOD分解,然后引入两种机制:(i)为已知类别保留,我们转移语义行为对齐;(ii)为类别发现,我们利用样本来自同一类别与已知类别原型保持不变的关系的洞察,将不可靠的伪标签转化为明确的关系模式匹配。这种双向设计使标记数据指导未标记学习,同时通过它们的集体关系签名发现新类别。广泛的实验表明,RPC在通用和细粒度基准上均取得最佳性能。

英文摘要

In this study, we tackle Generalized Category Discovery (GCD) via a Relational Retrieval perspective, explicitly coupling labeled and unlabeled data through bidirectional knowledge transfer. While existing methods treat these sources separately, missing valuable interaction opportunities, we propose Relational Pattern Consistency (RPC) that enables mutual enhancement. RPC employs One-vs-All classifiers for soft ID/OOD decomposition, then introduces two mechanisms: (i) for known-class preservation, we transfer semantic behavioral alignment; (ii) for category discovery, we leverage the insight that samples from the same category maintain invariant relationships with known-class prototypes, transforming unreliable pseudo-labeling into well-defined relational pattern matching. This bidirectional design allows labeled data to guide unlabeled learning while discovering novel categories through their collective relational signatures. Extensive experiments demonstrate RPC achieves state-of-the-art performance on both generic and fine-grained benchmarks.

2604.17892 2026-06-15 cs.LG cs.AI 版本更新

LEPO: Latent Reasoning Policy Optimization for Large Language Models

LEPO:面向大语言模型的潜在推理策略优化

Yuyan Zhou, Jiarui Yu, Hande Dong, Zhezheng Hao, Hong Wang, Jianqing Zhang, Qiang Lin

发表机构 * Tencent(腾讯) Zhejiang University(浙江大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 LEPO通过引入Gumbel-Softmax在大语言模型中实现可控的随机性,提升其探索能力与强化学习兼容性,通过直接在连续潜在表示上应用强化学习,显著优于现有方法。

详情
AI中文摘要

近年来,潜在推理被引入大语言模型(LLMs)以利用连续空间中的丰富信息。然而,缺乏随机采样时,这些方法不可避免地退化为确定性推理,无法发现多样的推理路径。为弥合这一差距,我们通过Gumbel-Softmax在潜在推理中注入可控的随机性,恢复LLMs的探索能力并增强其与强化学习(RL)的兼容性。在此基础上,我们提出LEPO,一种将强化学习直接应用于连续潜在表示的新框架。具体而言,在回放阶段,LEPO保持随机性以实现多样化的轨迹采样;在优化阶段,LEPO为潜在表示和离散令牌构建统一的梯度估计。大量实验表明,LEPO在离散和潜在推理方面显著优于现有RL方法。

英文摘要

Recently, latent reasoning has been introduced into large language models (LLMs) to leverage rich information within a continuous space. However, without stochastic sampling, these methods inevitably collapse to deterministic inference, failing to discover diverse reasoning paths. To bridge the gap, we inject controllable stochasticity into latent reasoning via Gumbel-Softmax, restoring LLMs' exploratory capacity and enhancing their compatibility with Reinforcement Learning (RL). Building on this, we propose \textbf{\underline{L}}atent R\textbf{\underline{e}}asoning \textbf{\underline{P}}olicy \textbf{\underline{O}}ptimization~(\textbf{LEPO}), a novel framework that applies RL directly to continuous latent representations. Specifically, in rollout stage, LEPO maintains stochasticity to enable diverse trajectory sampling, while in optimization stage, LEPO constructs a unified gradient estimation for both latent representations and discrete tokens. Extensive experiments show that LEPO significantly outperforms existing RL methods for discrete and latent reasoning.

2605.08270 2026-06-15 cs.CV cs.AI 版本更新

SAFformer:Improving Spiking Transformer via Active Predictive Filtering

SAFformer:通过主动预测滤波改进脉冲Transformer

Zequan Xie, Weiming Zeng, Yunhua Chen, Sichang Ling, Tongyang Chen, Jinsheng Xiao

发表机构 * School of Computer Science and Technology, Guangdong University of Technology(广东技术大学计算机科学与技术学院) Faculty of Science, Hong Kong Baptist University(香港 Baptist 大学科学学院) School of Electronic Information, Wuhan University(武汉大学电子信息学院)

AI总结 提出基于主动预测滤波的脉冲Transformer架构SAFformer,通过抑制可预测信号并聚焦显著视觉特征,在CIFAR和ImageNet-1K上实现新最优性能,平衡精度与能耗。

Comments IJCAI 2026(International Joint Conference on Artificial Intelligence)

详情
AI中文摘要

脉冲神经网络(SNNs)在生物合理性和能效方面具有显著优势,使其成为构建低功耗Transformer的有前景的候选方案。然而,现有的脉冲Transformer主要遵循被动反应范式,难以聚焦于任务相关信息,并且在处理冗余视觉数据时会产生大量计算开销。为了克服这一基础但尚未充分探索的局限性,我们提出了SAFformer,一种基于主动预测滤波范式的新型脉冲Transformer架构。受大脑预测编码机制的启发,SAFformer主动抑制可预测信号并聚焦于显著视觉特征。大量实验表明,SAFformer在CIFAR-10/100和CIFAR10-DVS上建立了新的最先进性能。值得注意的是,在ImageNet-1K上,它仅用26.58M参数和5.88 mJ的能耗就达到了80.44%的Top-1准确率,展现了精度与效率之间的卓越平衡。

英文摘要

Spiking Neural Networks (SNNs) offer notable advantages in biological plausibility and energy efficiency, making them promising candidates for building low-power Transformers. However, existing Spiking Transformers largely adhere to a passive reactive paradigm, which struggles to focus on task-relevant information and incurs substantial computational overhead when processing redundant visual data. To overcome this fundamental yet underexplored limitation, we propose SAFformer, a novel Spiking Transformer architecture based on an active predictive filtering paradigm. Inspired by the brain's predictive coding mechanism, SAFformer actively suppresses predictable signals and focuses on salient visual features. Extensive experiments show that SAFformer establishes new state-of-the-art performance on CIFAR-10/100 and CIFAR10-DVS. Remarkably, on ImageNet-1K, it achieves 80.44% Top-1 accuracy with only 26.58M parameters and an energy consumption of 5.88 mJ, demonstrating an exceptional balance between accuracy and efficiency.

2605.07984 2026-06-15 cs.LG cs.AI 版本更新

Where's the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions

计划在哪里?通过轻量级机制干预定位语言模型中的潜在规划

Nicole Ma, Nick Rui

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 通过押韵对句补全任务,使用线性探针和激活修补方法,研究语言模型在生成过程中是否形成并因果依赖未来约束的潜在规划,发现仅Gemma-3-27B模型存在因果依赖,并定位到五个注意力头。

Comments 13 pages, 20 figures, 3 tables. Accepted to Workshop on Mechanistic Interpretability @ ICML 2026

详情
AI中文摘要

我们研究语言模型中的规划位点形成——在前向传播过程中,结构约束的未来标记的内部表示是否形成,以及它们是否因果驱动生成。使用押韵对句补全作为前向约束的干净测试,我们在Qwen3、Gemma-3和Llama-3的十多个规模上应用两种轻量级方法(线性探针和激活修补)。探针显示,未来押韵信息在行边界处是线性可解码的,且信号在所有三个模型族中随规模增强。激活修补揭示,只有Gemma-3-27B因果依赖这种编码,表现出一种交接,其中因果驱动因素在大约第30层从押韵词迁移到行边界。我们测试的其他每个模型在整个生成过程中都条件于押韵词,在行边界处因果效应接近零,尽管探针信号很强。通过两阶段路径修补,我们将Gemma-3-27B的交接定位到五个注意力头,这些头在新行处恢复了约90%的押韵路由能力。

英文摘要

We study planning site formation in language models -- where internal representations of structurally-constrained future tokens form during the forward pass, and whether they causally drive generation. Using rhyming-couplet completion as a clean test of forward-looking constraint, we apply two lightweight methods (linear probing and activation patching) across Qwen3, Gemma-3, and Llama-3 at more than ten scales. Probing shows that future-rhyme information is linearly decodable at the line boundary, with signal that strengthens with scale in all three families. Activation patching reveals that only Gemma-3-27B causally relies on this encoding, exhibiting a handoff in which the causal driver migrates from the rhyme word to the line boundary around layer 30. Every other model we test conditions on the rhyme word throughout generation, with near-zero causal effect at the line boundary despite strong probe signal. We localize the Gemma-3-27B handoff to five attention heads through two-stage path patching that recover ~90% of the rhyme-routing capacity at the newline.

2509.20086 2026-06-15 cs.CL 版本更新

OLaPh: Optimal Language Phonemizer

OLaPh: 最优语言音素化器

Johannes Wirth

发表机构 * Institute for Information Systems at Hof University of Applied Sciences(霍夫应用科学大学信息学院)

AI总结 提出OLaPh混合框架,结合多语言词典、NLP技术和统计子词分割,在WikiPron基准上显著优于基线,并通过LLM合成语料探索神经泛化能力。

Comments 12 pages, 1 figure, 4 tables

详情
AI中文摘要

音素化是文本到语音合成中的关键组成部分。传统方法依赖于确定性转换和词典,而神经方法在词汇外(OOV)术语上具有更高的泛化潜力。我们提出了OLaPh(最优语言音素化器),一个混合框架,将广泛的多语言词典与先进的NLP技术和统计子词分割功能相结合。在WikiPron基准上的评估表明,OLaPh在整体准确性上显著优于已建立的基线,并通过高级回退机制在OOV数据上保持鲁棒性。为了进一步探索神经泛化,我们利用该框架为指令调优的大语言模型(LLM)合成高一致性训练语料。虽然确定性框架总体上仍然更准确,但LLM表现出强大的泛化能力,匹配或部分超过了框架的性能。这表明LLM成功地从合成数据中内化了超越框架能力的语音直觉。这些工具共同为多语言字素到音素转换(G2P)研究提供了全面的开源资源。

英文摘要

Phonemization is a critical component in text-to-speech synthesis. Traditional approaches rely on deterministic transformations and lexica, while neural methods offer potential for higher generalization on out-of-vocabulary (OOV) terms. We introduce OLaPh (Optimal Language Phonemizer), a hybrid framework that integrates extensive multilingual lexica with advanced NLP techniques and a statistical subword segmentation function. Evaluations on the WikiPron benchmark show OLaPh significantly outperforms established baselines in overall accuracy and maintains robustness on OOV data through advanced fallback mechanisms. To further explore neural generalization, we utilize the framework to synthesize a high-consistency training corpus for an instruction-tuned Large Language Model (LLM). While the deterministic framework remains more accurate overall, the LLM demonstrates strong generalization, matching or partly exceeding the framework's performance. This suggests that the LLM successfully internalized phonetic intuitions from the synthetic data that transcend the framework's capabilities. Together, these tools provide a comprehensive, open-source resource for multilingual grapheme-to-phoneme conversion (G2P) research.