arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1709
专题追踪
2605.07121 2026-06-15 cs.AI cs.LG 版本更新

AdaTKG: Adaptive Memory for Temporal Knowledge Graph Reasoning

AdaTKG: 用于时序知识图谱推理的自适应记忆

Seunghan Lee, Jun Seo, Jaehoon Lee, Sungdong Yoo, Minjae Kim, Tae Yoon Lim, Dongwan Kang, Hwanil Choi, SoonYoung Lee, Wonbin Ahn

发表机构 * LG AI Research(LG人工智能研究)

AI总结 提出AdaTKG,通过为每个实体维护自适应记忆,并采用可学习的指数移动平均更新,解决时序知识图谱中实体表示静态的问题,提升推理性能。

Comments KDD Workshop on Frontiers in Graph Machine Learning for the Large Model Era 2026 (Oral Presentation)

详情
AI中文摘要

时序知识图谱(TKG)表示带有时间戳的关系事实,并支持对演化事件进行广泛的推理任务。然而,现有方法生成的实体表示在实体层面是静态的,即每个表示仅是学习参数的函数,且不保留实体参与交互的任何痕迹。在本文中,我们摒弃这种静态观点,提出将每个实体建模为一个自适应过程,其表示在实体每次参与事实时被细化。为此,我们提出AdaTKG,它为每个实体维护一个记忆,该记忆随每次观察到的交互而更新,记忆在线累积,预测随更多交互的到来而改进。具体而言,我们将记忆更新实例化为一个可学习的指数移动平均,由单个共享标量控制,而不是为每个实体使用可学习参数,使AdaTKG能够处理训练中未见过的实体。大量实验证实了相对于TKG基线的持续改进,证明了自适应记忆的有效性。代码见:this https URL

英文摘要

Temporal knowledge graphs (TKGs) represent time-stamped relational facts and support a wide range of reasoning tasks over evolving events. However, existing methods produce entity representations that are static at the entity level, in that each representation is a function of learned parameters only and retains no trace of the interactions in which the entity has participated. In this paper, we depart from this static view and propose that each entity be modeled as an adaptive process whose representation is refined every time the entity participates in a fact. To this end, we propose AdaTKG, which maintains a per-entity memory that is updated with every observed interaction, with the memory accumulating online and predictions improving as more interactions arrive. Specifically, we instantiate the memory update as a learnable exponential moving average governed by a single shared scalar instead of using learnable parameters for each entity, enabling AdaTKG to handle entities unseen during training. Extensive experiments confirm consistent gains over TKG baselines, demonstrating the effectiveness of adaptive memory. Code is available at: https://github.com/seunghan96/AdaTKG

2605.05983 2026-06-15 cs.LG 版本更新

Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions

无牺牲的引导:面向仅提示干预的引导向量的原则性训练

Yuntai Bao, Qinfeng Li, Xinyan Yu, Ge Su, Wenqi Zhang, Liu Yan, Haiqin Weng, Jianwei Yin, Xuhong Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出联合训练引导因子和方向的方法,消除后验因子选择;引入仅提示引导向量(PrOSV),仅干预少量提示词,在AxBench上优于传统全序列引导向量,并实现更好的通用模型效用与对抗鲁棒性权衡。

Comments 63 pages, 50 figures; accepted by ICML 2026

详情
AI中文摘要

近年来,引导向量(SVs)已成为一种有效且轻量级的方法来引导大型语言模型(LLMs)的行为,其中微调后的SVs比无优化的SVs更有效。然而,当前的微调SV方法存在两个局限性。首先,它们需要在推理时针对每个SV仔细选择引导因子,以平衡引导效果和生成质量。其次,它们作为全序列SV(FSSVs)运行,无论因子选择如何,由于对模型生成过程的过度干预,都可能牺牲生成质量。为了解决第一个局限性,我们提出联合训练引导因子和方向,从而不再需要事后因子选择。利用神经网络缩放理论,我们发现引导因子适中的大初始化大小和学习率对于联合训练的稳定性和效率至关重要。为了解决第二个局限性,我们从表示微调中汲取灵感,引入了仅提示SV(PrOSV),一种仅干预少量提示词的SV。我们的实验结果表明,在使用我们的联合训练方案时,PrOSV在AxBench上优于传统的FSSVs。我们还发现,与FSSV相比,PrOSV在通用模型效用和对抗鲁棒性之间实现了更好的权衡。

英文摘要

Recently, steering vectors (SVs) have emerged as an effective and lightweight approach to steer behaviors of large language models (LLMs), among which fine-tuned SVs are more effective than optimization-free ones. However, current approaches to fine-tuned SVs suffer from two limitations. First, they require careful selection of steering factors on a per-SV basis to balance steering effectiveness and generation quality at inference time. Second, they operate as full-sequence SVs (FSSVs), which can sacrifice generation quality regardless of factor selection due to excessive intervention on the model generation process. To address the first limitation, we propose joint training of steering factors and directions, such that post-hoc factor selection is no longer required. Using neural network scaling theory, we find that moderately large initialization sizes and learning rates for steering factors are essential for stability and efficiency of joint training. To tackle the second limitation, we draw inspiration from representation fine-tuning and introduce Prompt-only SV (PrOSV), an SV that intervenes only on a few prompt tokens. Our empirical results show that PrOSV outperforms traditional FSSVs on AxBench when using our joint training scheme. We also find that PrOSV achieves a better tradeoff between general model utility and adversarial robustness than FSSV.

2512.03883 2026-06-15 cs.CV 版本更新

Dual Cross-Attention Siamese Transformer for Rectal Tumor Regrowth Assessment in Watch-and-Wait Endoscopy

双交叉注意力孪生Transformer用于观察等待内镜中直肠肿瘤再生长评估

Jorge Tapias Gomez, Despoina Kanata, Aneesh Rangnekar, Christina Lee, Julio Garcia-Aguilar, Joshua Jesse Smith, Harini Veeraraghavan

发表机构 * Total Neoadjuvant Treatment (TNT)(总新辅助治疗) Watch-and-Wait (WW) surveillance(观察等待(WW)监视) Swin Transformer(Swin变压器) Dual Cross-Attention (SSDCA)(双交叉注意力(SSDCA))

AI总结 提出双交叉注意力孪生Swin Transformer(SSDCA),结合再分期和随访内镜图像区分临床完全缓解与局部再生长,在62名患者上实现81.76%平衡准确率。

Comments Accepted to ISBI 2026 conference (6 pages, 5 figures, 1 table)

详情
AI中文摘要

越来越多的证据支持对接受全新辅助治疗后再分期显示临床完全缓解(cCR)的直肠癌患者进行观察等待(WW)监测。然而,在WW期间,从随访内镜图像中早期准确检测局部再生长(LR)对于管理护理和预防远处转移至关重要。因此,我们开发了一种具有双交叉注意力的孪生Swin Transformer(SSDCA),用于结合再分期和随访的纵向内镜图像,区分cCR和LR。SSDCA利用预训练的Swin Transformer提取领域无关特征,增强对成像变化的鲁棒性。实现双交叉注意力以强调来自配对扫描的特征,无需任何空间对齐即可预测响应。SSDCA以及基于Swin的基线模型使用135名患者的图像对进行训练,并在62名患者的保留图像对上进行评估。SSDCA产生了最佳平衡准确率(81.76% ± 0.04)、灵敏度(90.07% ± 0.08)和特异性(72.86% ± 0.05)。鲁棒性分析显示,无论是否存在血液、粪便、毛细血管扩张和图像质量差等伪影,性能均稳定。提取特征的UMAP聚类显示,SSDCA具有最大的簇间分离(1.45 ± 0.18)和最小的簇内分散(1.07 ± 0.19),证实了判别性表示学习。代码和权重可在以下网址获取:this https URL

英文摘要

Increasing evidence supports watch-and-wait (WW) surveillance for patients with rectal cancer who show clinical complete response (cCR) at restaging following total neoadjuvant treatment (TNT). However, accurate methods to early detect local regrowth (LR) from follow-up endoscopy images during WW are essential to manage care and prevent distant metastases. Hence, we developed a Siamese Swin Transformer with Dual Cross-Attention (SSDCA) to combine longitudinal endoscopic images at restaging and follow-up and distinguish cCR from LR. SSDCA leverages pretrained Swin Transformers to extract domain agnostic features and enhance robustness to imaging variations. Dual cross attention is implemented to emphasize features from the paired scans without requiring any spatial alignment to predict response. SSDCA as well as Swin-based baselines were trained using image pairs from 135 patients and evaluated on a held-out set of image pairs from 62 patients. SSDCA produced the best balanced accuracy (81.76% $\pm$ 0.04), sensitivity (90.07% $\pm$ 0.08), and specificity (72.86% $\pm$ 0.05). Robustness analysis showed stable performance irrespective of artifacts including blood, stool, telangiectasia, and poor image quality. UMAP clustering of extracted features showed maximal inter-cluster separation (1.45 $\pm$ 0.18) and minimal intra-cluster dispersion (1.07 $\pm$ 0.19) with SSDCA, confirming discriminative representation learning. Code and weights available at: https://github.com/Jotanator/SSDCA

2605.05407 2026-06-15 cs.AI 版本更新

PRISM: Perception Reasoning Interleaved for Sequential Decision Making

PRISM: 感知与推理交错用于序列决策

Mohamed Salim Aissi, Clemence Grislain, Clement Romac, Laure Soulier, Mohamed Chetouani, Olivier Sigaud, Nicolas Thome

发表机构 * Institut National de la Recherche Scientifique (INRS)(国家科学研究院)

AI总结 提出PRISM框架,通过动态问答流水线紧密耦合视觉语言模型(VLM)和语言模型(LLM),实现任务驱动的感知,在ALFWorld和R2R基准上显著超越现有图像模型。

详情
AI中文摘要

将基于LLM的具身智能体从纯文本环境扩展到复杂多模态设置仍是一个主要挑战。最近的研究发现,独立的视觉语言模型(VLM)存在感知-推理-决策差距,常常忽略任务关键信息。在本文中,我们介绍了PRISM,一个通过动态问答(DQA)流水线紧密耦合感知(VLM)和决策(LLM)的框架。LLM不是被动接受VLM的描述,而是对其提出批评,用目标导向的问题探查VLM,并综合生成紧凑的图像描述。这种闭环交互产生了对场景的清晰、任务驱动的理解。我们在ALFWorld和Room-to-Room(R2R)基准上评估了PRISM。我们表明:(1)PRISM显著优于最先进的基于图像的模型,(2)我们的交互式目标导向感知流水线带来了系统性和实质性的提升,(3)PRISM完全自动化,无需手工制作问题或答案。

英文摘要

Scaling LLM-based embodied agents from text-only environments to complex multimodal settings remains a major challenge. Recent work identifies a perception-reasoning-decision gap in standalone Vision-Language Models (VLMs), which often overlook task-critical information. In this paper, we introduce PRISM, a framework that tightly couples perception (VLM) and decision (LLM) through a dynamic question-answer (DQA) pipeline. Instead of passively accepting the VLM's description, the LLM critiques it, probes the VLM with goal-oriented questions, and synthesizes a compact image description. This closed-loop interaction yields a sharp, task-driven understanding of the scene. We evaluate PRISM on the ALFWorld and Room-to-Room (R2R) benchmarks. We show that: (1) PRISM significantly outperforms state-of-the-art image-based models, (2) our Interactive goal-oriented perception pipeline yields systematic and substantial gains, and (3) PRISM is fully automatic, eliminating the need for handcrafted questions or answers.

2510.17633 2026-06-15 cs.SD cs.CR 版本更新

SARSteer: Safeguarding Large Audio-Language Models via Safe-Ablated Refusal Steering

SARSteer: 通过安全消融拒绝引导保护大型音频语言模型

Weilin Lin, Jianze Li, Hui Xiong, Li Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出SARSteer,首个针对大型音频语言模型的推理时防御框架,通过文本衍生的拒绝引导和分解安全空间消融,有效提升有害查询拒绝率并减少良性查询的过度拒绝。

详情
AI中文摘要

大型音频语言模型(LALMs)正成为现实世界应用中强大的多模态骨干。然而,最近的研究表明,音频输入比文本更容易引发有害响应,给部署带来新的风险。尽管安全对齐在LLMs和大型视觉语言模型(LVLMs)中取得了初步进展,我们发现将这些方法直接适配到LALMs面临两个关键限制:1)基于LLM的引导在音频输入下失败,因为激活之间存在较大的分布差距;2)基于提示的防御会导致对良性语音查询的过度拒绝。为了解决这些挑战,我们提出了安全消融拒绝引导(SARSteer),这是首个针对LALMs的推理时防御框架。具体来说,SARSteer利用文本衍生的拒绝引导来强制执行拒绝而不操纵音频输入,并引入分解安全空间消融以减轻过度拒绝。大量实验表明,SARSteer显著提高了有害查询的拒绝率,同时保留了良性响应,为LALMs的安全对齐奠定了原则性步骤。代码和构建的数据集已发布在此https URL。

英文摘要

Large Audio-Language Models (LALMs) are becoming essential as a powerful multimodal backbone for real-world applications. However, recent studies show that audio inputs can more easily elicit harmful responses than text, exposing new risks toward deployment. While safety alignment has made initial advances in LLMs and Large Vision-Language Models (LVLMs), we find that vanilla adaptation of these approaches to LALMs faces two key limitations: 1) LLM-based steering fails under audio input due to the large distributional gap between activations, and 2) prompt-based defenses induce over-refusals on benign-speech queries. To address these challenges, we propose Safe-Ablated Refusal Steering (SARSteer), the first inference-time defense framework for LALMs. Specifically, SARSteer leverages text-derived refusal steering to enforce rejection without manipulating audio inputs and introduces decomposed safe-space ablation to mitigate over-refusal. Extensive experiments demonstrate that SARSteer significantly improves harmful-query refusal while preserving benign responses, establishing a principled step toward safety alignment in LALMs. The codes and constructed datasets are released at https://github.com/linweiii/SARSteer.

2604.24942 2026-06-15 cs.CL q-bio.NC 版本更新

Independent-Component-Based Encoding Models of Brain Activity During Story Comprehension

基于独立成分的故事理解过程中大脑活动的编码模型

Kamya Hari, Taha Binhuraib, Jin Li, Cory Shain, Anna A. Ivanova

发表机构 * School of Electrical and Computer Engineering, Georgia Institute of Technology(佐治亚理工学院电子与计算机工程学院) School of Psychological and Brain Sciences, Georgia Institute of Technology(佐治亚理工学院心理学与脑科学学院) Department of Linguistics, Stanford University(斯坦福大学语言学系)

AI总结 提出基于独立成分的编码框架,从fMRI数据中分离刺激驱动和噪声信号,利用语言模型预测独立成分时间序列,识别出与听觉和语言相关的认知网络,验证了成分的可解释性和跨个体一致性。

Comments Accepted to CCN 2026 (Proceedings Track)

详情
AI中文摘要

编码模型为连接连续刺激特征与神经活动提供了强大框架;然而,传统的体素方法受限于测量噪声、个体间变异以及由编码重叠神经信号的空间相关体素引起的冗余。本文提出了一种基于独立成分(IC)的编码框架,从fMRI数据中分离刺激驱动和噪声驱动信号。我们使用一部分数据将自然故事聆听过程中的连续fMRI数据分解为独立成分,并在独立数据上训练编码模型,从语言输入的大型语言模型表示中预测独立成分时间序列。跨被试来看,一部分独立成分表现出持续的高可预测性。这些独立成分在空间和时间上跨被试一致,并包括已知在故事聆听期间响应的认知网络(听觉和语言)。听觉成分时间序列与声学刺激特征强相关,突出了所识别成分时间序列的可解释性。被ICA-AROMA识别为噪声或运动相关伪影的成分表现出普遍较差的预测性能,证实高预测成分反映的是真实的刺激相关神经信号而非混淆因素。总体而言,基于独立成分的编码模型能够在功能网络层面进行分析,适应个体间网络位置的变异性,并提供易于跨被试比较的可解释结果。代码见:this https URL

英文摘要

Encoding models provide a powerful framework for linking continuous stimulus features to neural activity; however, traditional voxelwise approaches are limited by measurement noise, inter-subject variability, and redundancy arising from spatially correlated voxels encoding overlapping neural signals. Here, we propose an independent component (IC)-based encoding framework that dissociates stimulus-driven and noise-driven signals in fMRI data. We decompose continuous fMRI data from naturalistic story listening into ICs using one subset of the data, and train encoding models on independent data to predict IC time series from large language model representations of linguistic input. Across subjects, a subset of ICs exhibited consistently high predictivity. These ICs were spatially and temporally consistent across subjects and included cognitive networks known to respond during story listening (auditory and language). Auditory component time series were strongly correlated with acoustic stimulus features, highlighting the interpretability of identified component time series. Components identified as noise or motion-related artifacts by ICA-AROMA showed uniformly poor predictive performance, confirming that highly predicted components reflect genuine stimulus-related neural signals rather than confounds. Overall, IC-based encoding models enable analyses at the level of functional networks, accommodating the variability in network locations across individuals and providing interpretable results that are easy to compare across subjects. Code provided at: https://github.com/kamyahari/IC-Encoding-Models.git

2604.23841 2026-06-15 cs.LG cs.AI 版本更新

Scalable Production Scheduling: Linear Complexity via Unified Homogeneous Graphs

可扩展的生产调度:通过统一同质图实现线性复杂度

Jonathan Hoss, Moritz Link, Noah Klarmann

发表机构 * Faculty of Management and Engineering, Rosenheim Technical University of Applied Sciences(管理与工程学院,罗森海姆应用技术大学)

AI总结 提出统一同质图框架,通过特征同质化将不同节点角色映射到共享潜在空间,使用同构图同构网络以线性复杂度解决作业车间调度问题,实现零样本泛化,并发现作业与机器比率是策略有效性的主要驱动因素。

Comments This paper has been accepted for presentation at the IEEE 22st International Conference on Automation Science and Engineering (CASE 2026)

详情
AI中文摘要

在现实工业应用中高效解决作业车间调度问题需要既计算精简又拓扑鲁棒的策略。虽然强化学习在自动化调度规则方面显示出潜力,但现有模型常因二次图复杂度或异质层的架构开销而面临可扩展性瓶颈。我们引入了一个统一图框架,采用基于特征的同质化将不同的节点角色投影到共享潜在空间。这使得标准的同构图同构网络能够以线性复杂度捕获复杂的资源竞争,确保大规模工业应用的低延迟推理。我们的实验结果表明,我们的框架实现了最先进的性能,同时表现出一致的零样本泛化。我们确定作业与机器比率是策略有效性的主要驱动因素,而非绝对问题规模。基于此,我们提出了结构饱和假设,证明在临界拥塞实例($\mathcal{J} \approx \mathcal{M}$)上训练的策略学习了尺度不变的解决策略。在此饱和点训练的智能体内化了不变的冲突解决逻辑,使它们能够将大规模矩形实例视为饱和子问题的顺序串联。这种方法消除了昂贵的特定尺度重新训练的需要,并防止了对统计捷径的过拟合,为在动态生产环境中部署强化学习解决方案提供了鲁棒且高效的途径。

英文摘要

Efficiently solving the Job Shop Scheduling Problem in real-world industrial applications requires policies that are both computationally lean and topologically robust. While Reinforcement Learning has shown potential in automating dispatching rules, existing models often struggle with a scalability bottleneck caused by quadratic graph complexity or the architectural overhead of heterogeneous layers. We introduce a unified graph framework that employs feature-based homogenization to project distinct node roles into a shared latent space. This allows a standard homogeneous Graph Isomorphism Network to capture complex resource contention with linear complexity, ensuring low-latency inference for large-scale industrial applications. Our empirical results demonstrate that our framework achieves state-of-the-art performance while exhibiting consistent zero-shot generalization. We identify the job-to-machine ratio as the primary driver of policy effectiveness, rather than absolute problem size. Based on this, we propose a hypothesis of structural saturation, demonstrating that policies trained on critically congested instances ($\mathcal{J} \approx \mathcal{M}$) learn scale-invariant resolution strategies. Agents trained at this saturation point internalize invariant conflict-resolution logic, allowing them to treat massive rectangular instances as a sequential concatenation of saturated sub-problems. This approach eliminates the need for expensive scale-specific retraining and prevents overfitting to statistical shortcuts, providing a robust and efficient pathway for deploying RL solutions in dynamic production environments.

2604.21335 2026-06-15 cs.LG cs.CL 版本更新

Sub-Token Routing for KV Cache Compression

子令牌路由用于KV缓存压缩

Wei Jiang, Wei Wang

发表机构 * Futurewei Technologies(未来智科)

AI总结 提出子令牌路由方法,在保留令牌内对值向量分组并选择性保留,与令牌级压缩互补,在LLM和VLM中提升压缩性能。

Comments 17 pages, 8 tables, 2 figures

详情
AI中文摘要

Transformer推理通常需要大型KV缓存,尤其是在长上下文语言建模和多模态生成中。现有的压缩方法通常通过选择、驱逐、量化或压缩缓存令牌,或在语言模型推理前减少视觉令牌序列来降低缓存成本。我们引入子令牌路由,一种KV压缩方法,它在保留令牌内部添加了更精细的控制轴。它将每个保留的值向量分成组,并仅保留选定的组,同时保持查询和键状态不变。该方法设计在令牌级缩减之后工作。首先,令牌缩减方法确定保留哪些令牌。然后,子令牌路由压缩这些保留令牌内部的值状态。在匹配KV预算下的实验表明,添加子令牌路由提高了令牌级缩减在LLM和VLM设置中的性能,包括LLaMA-2-7B和Qwen2.5-7B上的Quest,以及LLaVA和Qwen-VL模型上的FastV/VisionZip。在较小的KV预算下增益更大,表明当进一步移除令牌成本高昂时,值组路由特别有用。总体而言,令牌级缩减和子令牌路由提供了互补的降低KV成本的方式。

英文摘要

Transformer inference often requires a large KV cache, especially for long-context language modeling and multimodal generation. Existing compression methods usually reduce cache cost by selecting, evicting, quantizing, or compressing cached tokens, or by reducing the visual-token sequence before language-model inference. We introduce sub-token routing, a KV-compression method that adds a finer control axis inside retained tokens. It splits each retained value vector into groups and keeps only selected groups, while leaving query and key states unchanged. The method is designed to work after token-level reduction. First, a token-reduction method determines which tokens are retained. Then, sub-token routing compresses the value states inside those retained tokens. Experiments under matched KV budgets show that adding sub-token routing improves token-level reduction performance in both LLM and VLM settings, including Quest on LLaMA-2-7B and Qwen2.5-7B, and FastV/VisionZip across LLaVA and Qwen-VL models. The gains are larger at smaller KV budgets, suggesting that value-group routing is especially useful when further token removal becomes costly. Overall, token-level reduction and sub-token routing provide complementary ways to reduce KV cost.

2604.17402 2026-06-15 cs.LG cs.NE 版本更新

On the Generalization Bounds of Symbolic Regression with Genetic Programming

基于遗传规划的符号回归的泛化界

Masahiro Nomura, Ryoki Hamano, Isao Ono

发表机构 * Institute of Science Tokyo, Yokohama, Japan(东京科学研究所, Yokohama, 日本) CyberAgent, Shibuya, Japan(CyberAgent, Shibuya, 日本)

AI总结 本文对基于遗传规划的符号回归进行学习理论分析,推导出在树大小、深度和可学习常数约束下的泛化界,将泛化差距分解为结构选择和常数拟合两个可解释分量,为遗传规划中的常用实践提供理论依据。

Comments Accepted for PPSN2026

详情
AI中文摘要

基于遗传规划(GP)的符号回归(SR)旨在直接从数据中发现可解释的数学表达式。尽管其实验成功显著,但关于基于GP的SR为何能泛化到训练数据之外的理论理解仍然有限。在这项工作中,我们对表示为表达式树的SR模型进行了学习理论分析。我们推导了在树大小、深度和可学习常数约束下GP风格SR的泛化界。我们的结果将泛化差距分解为两个可解释的分量:一个结构选择项,反映了选择表达式树结构的组合复杂性;以及一个常数拟合项,捕捉了在固定结构内优化数值常数的复杂性。这种分解为GP中几种广泛使用的实践提供了理论视角,包括简约压力、深度限制、数值稳定算子和区间算术。特别是,我们的分析显示了结构限制如何减少假设类增长,而稳定性机制如何控制预测对参数扰动的敏感性。通过将这些实际设计选择与泛化界中的显式复杂性项联系起来,我们的工作为基于GP的SR中常见的经验行为提供了原则性解释,并有助于更严格地理解其泛化性质。

英文摘要

Symbolic regression (SR) with genetic programming (GP) aims to discover interpretable mathematical expressions directly from data. Despite its strong empirical success, the theoretical understanding of why GP-based SR generalizes beyond the training data remains limited. In this work, we provide a learning-theoretic analysis of SR models represented as expression trees. We derive a generalization bound for GP-style SR under constraints on tree size, depth, and learnable constants. Our result decomposes the generalization gap into two interpretable components: a structure-selection term, reflecting the combinatorial complexity of choosing an expression-tree structure, and a constant-fitting term, capturing the complexity of optimizing numerical constants within a fixed structure. This decomposition provides a theoretical perspective on several widely used practices in GP, including parsimony pressure, depth limits, numerically stable operators, and interval arithmetic. In particular, our analysis shows how structural restrictions reduce hypothesis-class growth while stability mechanisms control the sensitivity of predictions to parameter perturbations. By linking these practical design choices to explicit complexity terms in the generalization bound, our work offers a principled explanation for commonly observed empirical behaviors in GP-based SR and contributes towards a more rigorous understanding of its generalization properties.

2512.00336 2026-06-15 cs.CV 版本更新

MVAD: A Benchmark Dataset for Multimodal AI-Generated Video-Audio Detection

MVAD:多模态AI生成视频-音频检测基准数据集

Mengxue Hu, Yunfeng Diao, Changtao Miao, Tairui Ge, Taize Ge, Zhiqing Guo, Jianshu Li, Zhe Li, Zhongjie Ba, Joey Tianyi Zhou

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对现有数据集缺乏多模态真实性的问题,提出MVAD数据集,包含三种伪造模式、高质量样本及多样化的视觉风格和内容类别,用于检测AI生成的视频-音频内容。

Comments 10 pages,2 figures

详情
AI中文摘要

多模态AI生成视频-音频内容的快速发展引发了对信息安全和内容真实性的重大担忧。现有的合成视频数据集主要关注视觉模态,而少数包含音频的数据集也大多局限于面部深度伪造——这一局限性未能解决通用多模态AI生成内容日益扩展的领域,并严重阻碍了可信检测系统的发展。为弥补这一关键差距,我们引入了多模态视频-音频数据集(MVAD),这是第一个专门设计用于检测AI生成多模态视频-音频内容的综合数据集。我们的数据集具有三个关键特征:(1)真正的多模态性,样本根据三种真实的视频-音频伪造模式生成;(2)通过多种最先进的生成模型实现的高感知质量;(3)涵盖现实和动漫视觉风格、四种内容类别(人类、动物、物体和场景)以及四种视频-音频多模态数据类型的全面多样性。我们的数据集将在以下网址提供:此 https URL。

英文摘要

The rapid advancement of AI-generated multimodal video-audio content has raised significant concerns regarding information security and content authenticity. Existing synthetic video datasets predominantly focus on the visual modality alone, while the few incorporating audio are largely confined to facial deepfakes--a limitation that fails to address the expanding landscape of general multimodal AI-generated content and substantially impedes the development of trustworthy detection systems. To bridge this critical gap, we introduce the Multimodal Video-Audio Dataset (MVAD), the first comprehensive dataset specifically designed for detecting AI-generated multimodal video-audio content. Our dataset exhibits three key characteristics: (1) genuine multimodality with samples generated according to three realistic video-audio forgery patterns; (2) high perceptual quality achieved through diverse state-of-the-art generative models; and (3) comprehensive diversity spanning realistic and anime visual styles, four content categories (humans, animals, objects, and scenes), and four video-audio multimodal data types. Our dataset will be available at https://github.com/HuMengXue0104/MVAD.

2604.16522 2026-06-15 cs.CV 版本更新

Efficient Online 3D Multi-Camera Multi-Object Tracking and Pose Estimation

高效在线3D多相机多目标跟踪与姿态估计

Linh Van Ma, Tran Thien Dat Nguyen, Juhua Hu, Wei Cheng, Moongu Jeon

发表机构 * School of Electrical Engineering and Computer Science(电子工程与计算机科学学院) School of Electrical Engineering, Computing and Mathematical Sciences(电子工程、计算与数学科学学院) School of Engineering and Technology(工程与技术学院)

AI总结 提出一种基于多单目相机的快速在线3D多目标跟踪与姿态估计方法,仅需2D检测,无需3D训练数据,通过贝叶斯最优滤波实现高效准确跟踪,并支持相机断连场景。

详情
AI中文摘要

本文提出了一种快速在线方法,用于联合执行使用多个单目相机的3D多目标跟踪和姿态估计。我们的算法仅需要2D边界框和姿态检测,消除了对昂贵的3D训练数据或计算成本高昂的深度学习模型的需求。我们的解决方案是贝叶斯最优多目标跟踪滤波器的高效实现,在保持准确性的同时提高了计算效率。我们证明了我们的算法在仅使用公开可用的预训练2D检测模型的情况下,比最先进的方法显著更快,且不牺牲准确性。我们还展示了我们的算法在多个相机在运行过程中间歇性断开或重新连接的场景中的鲁棒性能。

英文摘要

This paper proposes a fast and online method for jointly performing 3D multi-object tracking and pose estimation using multiple monocular cameras. Our algorithm requires only 2D bounding box and pose detections, eliminating the need for costly 3D training data or computationally expensive deep learning models. Our solution is an efficient implementation of a Bayes-optimal multi-object tracking filter, enhancing computational efficiency while maintaining accuracy. We demonstrate that our algorithm is significantly faster than state-of-the-art methods without compromising accuracy, using only publicly available pre-trained 2D detection models. We also illustrate the robust performance of our algorithm in scenarios where multiple cameras are intermittently disconnected or reconnected during operation.

2604.14892 2026-06-15 cs.LG cs.AI 版本更新

Can LLMs Accurately Score Medical Diagnoses and Clinical Reasoning?

LLM能否准确评分医学诊断和临床推理?

Amy Rouillard, Sitwala Mundia, Linda Camara, Ziyaad Dangor, Michael Cameron Gramanie, Ismail Kalla, Shabir A. Madhi, Kajal Morar, Marlvin T. Ncube, Haroon Saloojee, Bruce A. Bassett

发表机构 * Wits MIND Institute, University of the Witwatersrand, Johannesburg, South Africa(维特士心理研究所,沃斯兰德大学,约翰内斯堡,南非) Grai Labs, Cape Town, South Africa(格雷实验室,开普敦,南非) South African Medical Research Council Vaccines and Infectious Diseases Analytics Research Unit, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa(南非医学研究理事会疫苗和传染病分析研究组,健康科学学院,沃斯兰德大学,约翰内斯堡,南非) Department of Internal Medicine, Charlotte Maxeke Johannesburg Academic Hospital, and Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa(内科学系,查理·马克斯凯约翰内斯堡学术医院,以及健康科学学院,沃斯兰德大学,约翰内斯堡,南非) Department of Paediatrics and Child Health, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa(儿科学与儿童健康系,健康科学学院,沃斯兰德大学,约翰内斯堡,南非) Wits MIND Institute, University of the Witwatersrand, Johannesbu(维特士心理研究所,沃斯兰德大学,约翰内斯堡)

AI总结 研究使用LLM陪审团对300例低收入和中等收入国家医院病例的3334个诊断进行评分,发现校准后的LLM评分与专家评分高度一致,且严重错误风险更低,可作为可靠的评估代理。

详情
AI中文摘要

使用专家临床医生小组评估医学AI系统成本高且速度慢,这促使使用大型语言模型(LLM)作为替代评判者。在此,我们评估了一个由三个前沿AI模型组成的LLM陪审团,对300个真实低收入和中等收入国家(LMIC)医院病例的3334个诊断进行评分。LLM和临床医生生成的诊断均根据专家小组诊断在四个维度上进行评分:诊断、鉴别诊断、临床推理和阴性治疗风险。将LLM陪审团评分与专家和独立重新评分小组的评分进行比较,以评估误差指标、评分者间一致性、严重风险错误以及使用等渗回归进行事后校准的效果。在我们的数据中,我们发现:(i)未校准的LLM陪审团评分与专家临床医生小组评分保持序数一致性,但系统性地更低;(ii)LLM陪审团出现严重风险错误的概率低于人类专家重新评分小组;(iii)LLM陪审团结合LLM诊断可用于识别高风险错误诊断,从而实现有针对性的专家审查并提高小组效率;(iv)校准后的LLM陪审团评分和诊断代理排名与主要专家小组的评分和排名表现出极好的一致性;(v)LLM陪审团模型没有表现出自我偏好偏差,它们对自己底层模型或同一供应商模型生成的诊断评分并不比其他模型生成的诊断更有利(或更不利)。总之,这些结果提供了证据,表明校准后的LLM陪审团是医学AI基准测试中专家临床医生评估的值得信赖且可靠的代理。在其他临床环境中确认这些发现是未来工作的重要方向。

英文摘要

Evaluating medical AI systems using expert clinician panels is costly and slow, motivating the use of large language models (LLMs) as alternative adjudicators. Here, we evaluate an LLM Jury, composed of three frontier AI models, for scoring 3334 diagnoses on 300 real-world low- and middle-income country (LMIC) hospital cases. Both LLM- and clinician-generated diagnoses are scored against expert panel diagnoses across four dimensions: diagnosis, differential diagnosis, clinical reasoning, and negative treatment risk. The LLM Jury scores are compared with expert and independent re-scoring panel scores to assess error metrics, inter-rater agreement, severe-risk errors, and the effect of post hoc calibration using isotonic regression. In our data, we find that: (i) the uncalibrated LLM Jury scores preserve ordinal agreement with the expert clinician panel scores, but are systematically lower; (ii) the probability of severe-risk errors is lower for the LLM Jury than the human expert re-score panels; (iii) the LLM Jury combined with LLM diagnoses can be used to identify diagnoses at high risk of error, enabling targeted expert review and improved panel efficiency; (iv) the calibrated LLM Jury scores and rankings of diagnosing agents show excellent agreement with those of the primary expert panels; (v) LLM Jury models show no self-preference bias, they did not score diagnoses generated by their own underlying model or models from the same vendor more (or less) favourably than those generated by other models. Together, these results provide evidence that a calibrated LLM Jury is a trustworthy and reliable proxy for expert clinician evaluation in medical AI benchmarking. Confirming these findings in other clinical settings is an important direction for future work.

2604.15173 2026-06-15 cs.CV 版本更新

Boundary-Centric Clip-Budgeted Active Learning for Temporal Action Segmentation

面向时间动作分割的边界中心剪辑预算主动学习

Halil Ismail Helvaci, Sen-ching Samson Cheung

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出B-ACT框架,通过边界中心监督策略,在有限标注预算下优先标注动作边界帧,显著提升时间动作分割的标签效率,在多个数据集上超越现有方法。

详情
AI中文摘要

未修剪视频中的时间动作分割(TAS)需要密集的时间监督。然而,大部分标注成本花费在识别动作转换上,这些地方分割错误集中,且微小的时间偏移会不成比例地降低片段级指标。我们引入B-ACT,一种剪辑预算主动学习框架,明确地将监督分配到这些易出错的边界区域。B-ACT在分层两阶段循环中运行:(i) 使用预测不确定性对未标记视频进行排序和查询,(ii) 在每个选定视频中,从当前模型预测中检测候选转换,并通过新颖的边界分数选择前K个边界。边界分数融合邻域不确定性、类别模糊性和时间预测动态,以揭示每帧的潜在重要性。重要的是,我们的标注协议仅请求边界帧的标签,同时仍然在边界中心剪辑上训练,以通过模型的感受野利用时间上下文。在GTEA、50Salads和Breakfast上的大量实验表明,边界中心监督在稀疏预算下实现了强大的标签效率,并持续优于代表性的TAS主动学习基线和先前的最先进方法。在性能对边界放置高度敏感的数据集上(通过编辑和基于重叠的F1指标衡量),增益最大。

英文摘要

Temporal action segmentation (TAS) in untrimmed videos requires dense temporal supervision. However, most of the annotation cost is spent identifying action transitions where segmentation errors concentrate and small temporal shifts can disproportionately degrade segment-level metrics. We introduce B-ACT, a clip-budgeted active learning framework that explicitly allocates supervision to these error-prone boundary regions. B-ACT operates in a hierarchical two-stage loop: (i) it ranks and queries unlabeled videos using predictive uncertainty, and (ii) within each selected video, it detects candidate transitions from the current model predictions and selects the top-$K$ boundaries via a novel boundary score. The boundary score fuses neighborhood uncertainty, class ambiguity, and temporal prediction dynamics to reveal the underlying importance of each frame. Importantly, our annotation protocol requests labels only at the boundary frames while still training on boundary-centered clips to exploit temporal context through the model's receptive field. Extensive experiments on GTEA, 50Salads, and Breakfast demonstrate that boundary-centric supervision delivers strong label efficiency and consistently surpasses representative TAS active learning baselines and prior state of the art under sparse budgets. Gains are largest on datasets where performance is highly sensitive to boundary placement, as measured by edit and overlap-based F1 metrics.

2604.14193 2026-06-15 cs.CV eess.IV q-bio.NC 版本更新

QualiaNet: An Experience-Before-Inference Network

QualiaNet:一种先验体验的推理网络

Paul Linton

发表机构 * Columbia University(哥伦比亚大学)

AI总结 提出QualiaNet,模拟人类立体视觉的两阶段架构:先通过视差图模拟体验,再用CNN从视差梯度估计距离,验证了从视差梯度恢复距离的可行性。

Journal ref Extended abstract presented at the 9th Conference on Cognitive Computational Neuroscience, New York, NY, USA, 2026

详情
AI中文摘要

人类3D视觉涉及两个不同阶段:体验模块,其中相对于注视点提取立体深度;推理模块,其中解释这种体验以估计3D场景属性。矛盾的是,尽管立体视觉不提供绝对距离信息,但它仍然影响我们对距离的推断。我们提出推理模块利用自然场景统计:近景产生鲜明的视差梯度,而远景相对平坦。QualiaNet在计算上实现了这种两阶段架构:模拟人类立体体验的视差图被传递给训练用于估计距离的CNN。该网络可以仅从视差梯度恢复距离,验证了这种方法。

英文摘要

Human 3D vision involves two distinct stages: an Experience Module, where stereo depth is extracted relative to fixation, and an Inference Module, where this experience is interpreted to estimate 3D scene properties. Paradoxically, although stereo vision does not provide us with absolute distance information, it nonetheless affects our inferences about distance. We propose the Inference Module exploits a natural scene statistic: near scenes produce vivid disparity gradients, while far scenes appear comparatively flat. QualiaNet implements this two-stage architecture computationally: disparity maps simulating human stereo experience are passed to a CNN trained to estimate distance. The network can recover distance from disparity gradients alone, validating this approach.

2602.22822 2026-06-15 cs.AI cs.LG 版本更新

FlexMS: A Unified Public Benchmark for Molecule Tandem Mass Spectrum Prediction

FlexMS:分子串联质谱预测的统一公共基准

Yunhua Zhong, Yixuan Tang, Yifan Li, Pan Liu, Zhiwen Yang, Jie Yang, Jun Xia

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) The Hong Kong University of Science and Technology(香港科学与技术大学) The University of Hong Kong(香港大学) Yangzhou University(扬州大学) Fudan University(复旦大学)

AI总结 提出FlexMS基准框架,通过标准化预处理、元数据条件和评估协议,实现跨公共资源的公平比较,并引入难度感知诊断指导模型选择。

Comments preprint version v3

详情
AI中文摘要

串联质谱(MS/MS)在小分子鉴定中至关重要,但当前的深度学习谱预测系统在实际评估和部署中仍存在困难。尽管新架构不断声称达到最先进性能,但不一致的元数据条件和纠缠的预处理流程阻碍了公平的架构比较。此外,现有评估通常局限于精心策划的数据集,未能捕捉真实代谢组学的异质性和跨领域偏移。而且,当前基准缺乏难度感知诊断,对模型在特定计算或数据约束下的行为视而不见。为解决这些问题,我们提出了FlexMS,一个模块化的公共数据基准框架,它在统一协议下标准化跨公共资源的MS/MS预测,同时保留分子编码器、元数据条件、预测头以及下游检索。FlexMS建立了一个公平的评估平台,显著降低了集成新预测工具的门槛。FlexMS不仅优化平均分数,还通过难度感知诊断增强聚合准确性,为不同计算约束、数据规模和下游检索目标下的模型选择提供可操作指导。最终,FlexMS为社区提供了一个可复现的标准,以识别哪些算法结论是稳定的,以及哪些操作点在实践中最为可行。

英文摘要

Tandem mass spectrometry (MS/MS) is central to small molecule identification, but current deep learning systems for spectrum prediction still remain difficult to evaluate and deploy in practice. While novel architectures constantly claim state-of-the-art performance, inconsistent metadata conditioning and entangled preprocessing pipelines hinder fair architectural comparisons. Besides, existing evaluations are often restricted to curated datasets, failing to capture the heterogeneity and cross-domain shifts of real-world metabolomics. Furthermore, current benchmarks lack difficulty-aware diagnostics and leave blind to how models behave under specific compute or data constraints. To address this, we present FlexMS, a modular public-data benchmark framework that standardizes MS/MS prediction across public resources while keeping molecular encoders, metadata conditioning, predictor heads, and downstream retrieval under one protocol. FlexMS establishes a fair evaluation playground which significantly lowers the barrier for integrating new predictive tools. Rather than solely optimizing for average scores, FlexMS augments aggregate accuracy with difficulty-aware diagnostics, providing actionable guidance on model selection across different compute constraints, data scales, and downstream retrieval objectives. Ultimately, FlexMS provides the community with a reproducible standard to identify which algorithmic conclusions are stable and which operating points are most viable in practice.

2604.09737 2026-06-15 cs.LG cs.AI 版本更新

STaR-DRO: Stateful Tsallis Reweighting for Group-Robust Structured Prediction

STaR-DRO: 面向群体鲁棒结构化预测的状态化Tsallis重加权

Samah Fodeh, Ganesh Puthiaraju, Elyas Irankhah, Afshan Khan, Sreeraj Ramachandran, Linhai Ma, Srivani Talakokkul, Sarah Schellhorn

发表机构 * Yale University(耶鲁大学) Yale School of Medicine(耶鲁医学院)

AI总结 提出STaR-DRO框架,结合Tsallis镜像上升和稀疏entmax映射,仅对持续困难群体上权重,在结构化预测中提升标签准确性和鲁棒性,在EPPC Miner任务上相比SFT和标准DRO分别提升F1分数1.08和2.20。

详情
AI中文摘要

使用大型语言模型进行结构化预测需要输出在标签不平衡和异质群体难度下具有标签准确性、本体约束、结构有效性和证据基础。我们提出了一个统一框架用于本体约束生成。首先,我们引入了一个模块化的提示工程架构,结合了XML风格结构、专家消歧规则、思维链推理、元数据感知决策逻辑、模式契约和自我验证门。它针对反复出现的上下文失败,包括格式漂移、标签歧义、证据幻觉和元数据条件混淆。其次,我们提出了STaR-DRO,结合了Tsallis镜像上升、稀疏entmax风格原始映射、EMA平滑群体损失跟踪、重新缩放上升信号和有界超额乘数。与依赖密集香农熵指数梯度更新、可能引入高方差随机重加权、将正对抗质量分配给非持续困难群体、并通过单纯形竞争产生成本的常规DRO不同,STaR-DRO仅对持续困难群体上权重,而不抑制较容易的群体。我们在EPPC Miner上评估该框架,这是一个临床基础的高风险结构化预测任务,需要从患者-提供者安全消息中进行层次标签预测和证据跨度提取。在1B-70B Llama模型上,提示工程改进了零样本提取,平均标签F1增益为+14.46,跨度F1增益为+17.40。在监督微调的基础上,STaR-DRO进一步提高了准确性和鲁棒性,平均标签F1分别提高了+1.08和+2.20,同时相对于SFT和标准DRO,平均群体验证交叉熵分别降低了21.3%和14.8%。这些结果推进了以患者为中心的临床护理分析的可靠自动化通信挖掘。

英文摘要

Structured prediction with large language models requires outputs that are label-accurate, ontology-constrained, structurally valid, and evidence-grounded under label imbalance and heterogeneous group difficulty. We present a unified framework for ontology-constrained generation. First, we introduce a modular prompt-engineering architecture combining XML-style structure, expert disambiguation rules, chain-of-thought reasoning, metadata-aware decision logic, schema contracts, and a self-validation gate. It targets recurrent in-context failures, including format drift, label ambiguity, evidence hallucination, and metadata-conditioned confusion. Second, we propose STaR-DRO, combining Tsallis mirror ascent, sparse entmax-style primal mapback, EMA-smoothed group-loss tracking, rescaled ascent signals, and bounded excess-only multipliers. Unlike conventional DRO, which relies on dense Shannon-entropy exponentiated-gradient updates, can introduce high-variance stochastic reweighting, assigns positive adversarial mass to groups that are not persistently hard, and incurs costs through simplex competition, STaR-DRO upweights only persistently hard groups without suppressing easier ones. We evaluate the framework on EPPC Miner, a clinically grounded high-stakes structured-prediction task requiring hierarchical label prediction and evidence-span extraction from patient-provider secure messages. Across 1B-70B Llama models, prompt engineering improves zero-shot extraction, yielding an average label F1 gain of +14.46 and a Span F1 gain of +17.40. Building on supervised fine-tuning, STaR-DRO further improves accuracy and robustness, increasing average label F1 by +1.08 and +2.20 while reducing mean groupwise validation cross-entropy by 21.3% and 14.8% relative to SFT and standard DRO, respectively. These results advance reliable automated communication mining for patient-centered clinical care analysis.

2512.10966 2026-06-15 cs.LG cs.AI cs.CV eess.IV 版本更新

Interpretable Alzheimer's Diagnosis via Multimodal Fusion of Regional Brain Experts

可解释的阿尔茨海默病诊断:基于区域脑专家的多模态融合

Farica Zhuang, Shu Yang, Dinara Aliyeva, Zixuan Wen, Duy Duong-Tran, Christos Davatzikos, Tianlong Chen, Song Wang, Li Shen

发表机构 * University of Pennsylvania(宾夕法尼亚大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出MREF-AD多模态区域专家融合模型,采用混合专家框架将各模态脑区域视为独立专家,通过门控网络学习个性化融合权重,实现可解释的AD诊断。

Comments Published at IEEE ICHI 2026

详情
AI中文摘要

准确早期诊断阿尔茨海默病(AD)对有效干预至关重要,需要整合多模态神经影像数据的互补信息。然而,传统融合方法通常依赖特征的简单拼接,无法自适应平衡淀粉样蛋白PET和MRI等生物标志物在不同脑区的贡献。本文提出MREF-AD,一种用于AD诊断的多模态区域专家融合模型。它是一个混合专家(MoE)框架,将每个模态内的介观脑区域建模为独立专家,并采用门控网络学习个体特定的融合权重。利用阿尔茨海默病神经影像学倡议(ADNI)的表格神经影像和人口统计学信息,MREF-AD在强经典和深度学习基线上取得了有竞争力的性能,同时提供了可解释的、模态和区域层面的洞察,揭示了结构和分子影像如何共同促进AD诊断。源代码见:此 https URL。

英文摘要

Accurate and early diagnosis of Alzheimer's disease (AD) is critical for effective intervention and requires integrating complementary information from multimodal neuroimaging data. However, conventional fusion approaches often rely on simple concatenation of features, which cannot adaptively balance the contributions of biomarkers such as amyloid PET and MRI across brain regions. In this work, we propose MREF-AD, a Multimodal Regional Expert Fusion model for AD diagnosis. It is a Mixture-of-Experts (MoE) framework that models mesoscopic brain regions within each modality as independent experts and employs a gating network to learn subject-specific fusion weights. Utilizing tabular neuroimaging and demographic information from the Alzheimer's Disease Neuroimaging Initiative (ADNI), MREF-AD achieves competitive performance over strong classic and deep baselines while providing interpretable, modality- and region-level insight into how structural and molecular imaging jointly contribute to AD diagnosis. The source code is available at https://github.com/PennShenLab/mref-ad.

2604.04611 2026-06-15 cs.LG cs.CR 版本更新

Dynamic Free-Rider Detection in Federated Learning via Simulated Attack Patterns

联邦学习中基于模拟攻击模式的动态搭便车者检测

Motoki Nakamura

发表机构 * Fujitsu Limited(富士通株式会社)

AI总结 针对联邦学习中动态搭便车者难以检测的问题,提出S2-WEF方法,通过模拟全局模型攻击的权重演化频率模式并结合偏差评分进行二维聚类,实现无需代理数据集或预训练的高鲁棒性检测。

Comments 23 pages, 1 figure, 8 tables

详情
AI中文摘要

联邦学习(FL)使多个客户端能够通过聚合本地更新来协作训练全局模型,而无需共享私有数据。然而,FL经常面临搭便车者的挑战,即提交虚假模型参数而不执行实际训练以获取全局模型的客户端。Chen等人提出了一种基于模型参数权重演化频率(WEF)的搭便车者检测方法。该检测方法是实际搭便车者检测方法的主要候选,因为它既不需要代理数据集也不需要预训练。然而,它难以检测“动态”搭便车者,这些客户端在早期轮次中行为诚实,后来转而搭便车,特别是在全局模型模仿攻击(如delta权重攻击和我们新提出的自适应WEF伪装攻击)下。在本文中,我们提出了一种新颖的检测方法S2-WEF,该方法在服务器端使用先前广播的全局模型模拟潜在基于全局模型的攻击的WEF模式,并识别提交的WEF模式与模拟模式相似的客户端。为了处理各种搭便车攻击策略,S2-WEF进一步将该基于模拟的相似度分数与通过提交的WEF之间的相互比较计算的偏差分数相结合,并通过二维聚类和每分数分类来区分良性客户端和搭便车客户端。该方法能够在无需代理数据集或预训练的情况下,动态检测训练过程中转变为搭便车者的客户端。我们在三个数据集和五种攻击类型上进行了大量实验,证明S2-WEF比现有方法具有更高的鲁棒性。

英文摘要

Federated learning (FL) enables multiple clients to collaboratively train a global model by aggregating local updates without sharing private data. However, FL often faces the challenge of free-riders, clients who submit fake model parameters without performing actual training to obtain the global model without contributing. Chen et al. proposed a free-rider detection method based on the weight evolving frequency (WEF) of model parameters. This detection approach is a leading candidate for practical free-rider detection methods, as it requires neither a proxy dataset nor pre-training. Nevertheless, it struggles to detect ``dynamic'' free-riders who behave honestly in early rounds and later switch to free-riding, particularly under global-model-mimicking attacks such as the delta weight attack and our newly proposed adaptive WEF-camouflage attack. In this paper, we propose a novel detection method S2-WEF that simulates the WEF patterns of potential global-model-based attacks on the server side using previously broadcasted global models, and identifies clients whose submitted WEF patterns resemble the simulated ones. To handle a variety of free-rider attack strategies, S2-WEF further combines this simulation-based similarity score with a deviation score computed from mutual comparisons among submitted WEFs, and separates benign and free-rider clients by two-dimensional clustering and per-score classification. This method enables dynamic detection of clients that transition into free-riders during training without proxy datasets or pre-training. We conduct extensive experiments across three datasets and five attack types, demonstrating that S2-WEF achieves higher robustness than existing approaches.

2604.01463 2026-06-15 cs.RO cs.AI cs.HC 版本更新

Low-Burden LLM-Based Preference Learning: Personalizing Assistive Robots from Natural Language Feedback for Users with Paralysis

基于低负担LLM的偏好学习:通过自然语言反馈为瘫痪用户个性化辅助机器人

Keshav Shankar, Dan Ding, Wei Gao

发表机构 * Electrical and Computer Engineering(电气与计算机工程) Rehabilitation Science and Technology(康复科学与技术)

AI总结 针对严重运动障碍用户,提出一种低负担离线框架,利用大语言模型将非结构化自然语言反馈转化为确定性机器人控制策略,并通过职业治疗框架解码用户需求,显著降低用户负担。

Comments Accepted to IEEE RO-MAN 2026

详情
AI中文摘要

物理辅助机器人需要个性化行为以确保用户安全和舒适。然而,传统的偏好学习方法(如详尽的成对比较)会给严重运动障碍用户带来巨大的身体和认知疲劳。为解决这一问题,我们提出了一种低负担的离线框架,将非结构化自然语言反馈直接转化为确定性的机器人控制策略。为了安全地弥合模糊的人类语言与机器人代码之间的差距,我们的流程使用基于职业治疗实践框架的大语言模型(LLMs)。这种临床推理将主观用户反应解码为明确的生理和心理需求,然后映射到透明的决策树中。在部署前,自动化的“LLM-as-a-Judge”验证代码的结构安全性。我们在一个模拟的餐食准备研究中,对10名瘫痪成年人进行了系统验证。结果表明,与传统的基线方法相比,我们的自然语言方法显著降低了用户的工作负担。此外,职业治疗师确认生成的策略是安全的,并且准确反映了用户偏好。

英文摘要

Physically Assistive Robots require personalized behaviors to ensure user safety and comfort. However, traditional preference learning methods, like exhaustive pairwise comparisons, cause substantial physical and cognitive fatigue for users with severe motor impairments. To solve this, we propose a low-burden, offline framework that translates unstructured natural language feedback directly into deterministic robotic control policies. To safely bridge the gap between ambiguous human speech and robotic code, our pipeline uses Large Language Models (LLMs) grounded in the Occupational Therapy Practice Framework. This clinical reasoning decodes subjective user reactions into explicit physical and psychological needs, which are then mapped into transparent decision trees. Before deployment, an automated "LLM-as-a-Judge" verifies the code's structural safety. We validated this system in a simulated meal preparation study with 10 adults with paralysis. Results show our natural language approach significantly reduces user workload compared to traditional baselines. Additionally, occupational therapists confirmed the generated policies are safe and accurately reflect user preferences.

2604.00887 2026-06-15 cs.CV cs.CR 版本更新

Towards Physically Realizable Adversarial Attenuation Patch against SAR Object Detection

面向SAR目标检测的物理可实现对抗衰减补丁

Yiming Zhang, Weibo Qin, Feng Wang

发表机构 * Key Laboratory for Information Science of Electromagnetic Waves (MoE) School of Information Science and Technology, Fudan University(电磁波信息科学重点实验室(MoE)复旦大学信息科学与技术学院)

AI总结 提出对抗衰减补丁(AAP)方法,通过能量约束优化和衰减部署框架平衡攻击有效性与隐蔽性,并基于信号级电子干扰机制实现物理可行性。

Comments 5 pages, 4 figures. Source code is available at https://github.com/boremycin/SAAP. Accepted and published in IEEE CAIT 2026. DOI: 10.1109/CAIT70489.2026.11553874

Journal ref Proc. 2026 China Aerospace Information Technology Conference (CAIT), Tongxiang, China, May 2026

详情
AI中文摘要

深度神经网络在SAR目标检测任务中表现出色,但仍易受对抗攻击影响。现有的SAR特定攻击方法能有效欺骗检测器,但往往引入明显扰动,且主要局限于数字域,忽略了攻击SAR系统的物理实现约束。本文提出一种新颖的对抗衰减补丁(AAP)方法,采用能量约束优化策略结合基于衰减的部署框架,在攻击有效性和隐蔽性之间实现无缝平衡。更重要的是,AAP通过对齐信号级电子干扰机制,展现出强大的物理实现潜力。实验结果表明,AAP在保持高隐蔽性的同时有效降低检测性能,并在不同模型间表现出良好的可迁移性。本研究为SAR目标检测系统的对抗攻击提供了物理基础视角,并促进了更隐蔽且实际可部署的攻击策略设计。源代码已在此https URL公开。

英文摘要

Deep neural networks have demonstrated excellent performance in SAR target detection tasks but remain susceptible to adversarial attacks. Existing SAR-specific attack methods can effectively deceive detectors; however, they often introduce noticeable perturbations and are largely confined to digital domain, neglecting physical implementation constrains for attacking SAR systems. In this paper, a novel Adversarial Attenuation Patch (AAP) method is proposed that employs energy-constrained optimization strategy coupled with an attenuation-based deployment framework to achieve a seamless balance between attack effectiveness and stealthiness. More importantly, AAP exhibits strong potential for physical realization by aligning with signal-level electronic jamming mechanisms. Experimental results show that AAP effectively degrades detection performance while preserving high imperceptibility, and shows favorable transferability across different models. This study provides a physical grounded perspective for adversarial attacks on SAR target detection systems and facilitates the design of more covert and practically deployable attack strategies. The source code is made available at https://github.com/boremycin/SAAP.

2603.23530 2026-06-15 cs.CL cs.AI cs.LG 版本更新

Did You Forget What I Asked? Prospective Memory Failures in Large Language Models

你忘记我问什么了吗?大型语言模型中的前瞻记忆失败

Avni Mittal

发表机构 * University of Washington(华盛顿大学)

AI总结 本研究通过认知心理学中的前瞻记忆视角,发现大型语言模型在执行复杂任务时,格式化指令的遵从率下降2-21%,并提出了显著性增强格式来恢复遵从性。

详情
AI中文摘要

大型语言模型在必须同时执行要求较高的任务时,常常无法满足格式化指令。我们通过认知心理学中的前瞻记忆视角,使用一个受控范式来研究这种行为,该范式将可验证的格式化约束与复杂度递增的基准任务相结合。在三个模型家族和超过8000个提示中,在并发任务负载下,遵从性下降了2-21%。脆弱性高度依赖于类型:终端约束(需要在响应边界采取行动)下降最多,高达50%,而避免约束相对稳健。显著性增强格式(显式指令框架加上尾部提醒)恢复了大量丢失的遵从性,在许多设置中将性能恢复到90-100%。干扰是双向的:格式化约束也可能降低任务准确性,其中一个模型的GSM8K准确率从93%下降到27%。在额外的堆叠实验中,随着约束的累积,联合遵从性急剧下降。所有结果均使用确定性程序化检查器,无需LLM作为评判组件,并在公开可用的数据集上进行。

英文摘要

Large language models often fail to satisfy formatting instructions when they must simultaneously perform demanding tasks. We study this behaviour through a prospective memory inspired lens from cognitive psychology, using a controlled paradigm that combines verifiable formatting constraints with benchmark tasks of increasing complexity. Across three model families and over 8,000 prompts, compliance drops by 2-21% under concurrent task load. Vulnerability is highly type-dependent: terminal constraints (requiring action at the response boundary) degrade most, with drops up to 50%, while avoidance constraints remain comparatively robust. A salience-enhanced format (explicit instruction framing plus a trailing reminder) recovers much of the lost compliance, restoring performance to 90-100% in many settings. Interference is bidirectional: formatting constraints can also reduce task accuracy, with one model's GSM8K accuracy dropping from 93% to 27%. In additional stacking experiments, joint compliance declines sharply as constraints accumulate. All results use deterministic programmatic checkers without an LLM-as-judge component on publicly available datasets.

2512.21201 2026-06-15 cs.RO cs.AI cs.CV 版本更新

Schrödinger's Navigator: Imagining an Ensemble of Futures for Zero-Shot Object Navigation

薛定谔的导航者:为零样本目标导航设想未来轨迹集合

Yu He, Da Huang, Zhenyang Liu, Zixiao Gu, Qiang Sun, Guangnan Ye, Yanwei Fu, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学) Shanghai Jiao Tong University(上海交通大学) Shanghai University of International Business and Economics(上海对外经贸大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出一种信念感知框架,在推理时通过轨迹条件化的3D世界模型设想多个未来场景,结合自适应遮挡物感知采样和未来感知价值图,提升零样本目标导航在遮挡严重环境中的隐蔽目标发现和风险感知路径选择。

详情
AI中文摘要

零样本目标导航(ZSON)要求机器人在未见环境中找到目标物体,无需任务特定的微调或预建地图,这是通用服务机器人的关键能力。然而,在模拟中表现良好的方法在杂乱的真实世界场景中往往会退化,这些场景存在严重遮挡和潜在危险,大面积的未观察区域使得单场景推理脆弱且不安全。我们提出薛定谔的导航者,一个信念感知框架,在推理时对多个轨迹条件化的设想3D未来进行推理。给定候选路径,轨迹条件化的3D世界模型预测假设的观察结果,并保持多个合理场景实现的叠加,而不是承诺于单一地图。自适应遮挡物感知采样器将想象引导至不确定性关键区域,而未来感知价值图(FAVM)聚合设想的未来,以实现鲁棒、主动的动作选择。在模拟和物理Go2四足机器人上的实验表明,薛定谔的导航者优于强ZSON基线,在遮挡严重的导航场景中提高了隐蔽目标发现和风险感知路径点选择。这些结果突显了设想3D未来作为在不确定真实世界环境中进行零样本导航的可扩展和通用策略。

英文摘要

Zero-shot object navigation (ZSON) requires robots to find target objects in unseen environments without task-specific fine-tuning or pre-built maps, a key capability for general-purpose service robots. Yet methods that perform well in simulation often degrade in cluttered real-world scenes with severe occlusion and latent hazards, where large unseen regions make single-scene inference brittle and unsafe. We propose Schrödinger's Navigator, a belief-aware framework that reasons at inference time over multiple trajectory-conditioned imagined 3D futures. Given candidate paths, a trajectory-conditioned 3D world model predicts hypothetical observations and maintains a superposition of plausible scene realizations rather than committing to one map. An adaptive occluder-aware sampler directs imagination to uncertainty-critical regions, while a Future-Aware Value Map (FAVM) aggregates imagined futures for robust, proactive action selection. Experiments in simulation and on a physical Go2 quadruped show that Schrödinger's Navigator outperforms strong ZSON baselines, improving hidden-target discovery and risk-aware waypoint selection in occlusion-heavy navigation scenarios. These results highlight imagined 3D futures as a scalable and generalizable strategy for zero-shot navigation in uncertain real-world environments.

2603.18464 2026-06-15 cs.LG 版本更新

AcceRL: A Distributed Asynchronous Reinforcement Learning and World Model Framework for Vision-Language-Action Models

AcceRL: 面向视觉-语言-动作模型的分布式异步强化学习与世界模型框架

Chengxuan Lu, Shukuan Wang, Yanjie Li, Yingying Fang, Huoyan Wang, Tian Zhang, Wei Liu, Shiji Jin, Fuyuan Qian, Peiming Li, Chao Xu, Baigui Sun, Yang Liu

发表机构 * IROOTECH TECHNOLOGY Wolf 1069 b Lab, Sany Group(伊罗科技沃尔夫1069b实验室,三一集团)

AI总结 提出AcceRL框架,通过物理隔离环境交互、模型推理和梯度更新实现分布式异步强化学习,消除同步系统的空闲气泡,提升硬件利用率,并支持即插即用的世界模型集成,在LIBERO任务上实现2.4倍吞吐加速和200倍样本效率提升。

详情
AI中文摘要

大规模视觉-语言-动作(VLA)模型的强化学习(RL)严重受限于同步障碍和环境数据获取的高成本。为克服这些挑战,我们提出AcceRL,一种分布式异步RL框架,物理隔离环境交互、模型推理和梯度更新。通过消除同步系统中固有的级联长尾空闲气泡,AcceRL最大化硬件利用率并确保可扩展吞吐量。此外,AcceRL采用模块化设计,支持将多种即插即用的世界模型集成到其分布式流水线中。大量实验表明,基础框架在所有四个LIBERO~\cite{liu2023libero}任务套件上均取得极具竞争力的性能。系统层面,异步架构相比领先的同步基线实现了2.4倍的吞吐加速。算法层面,通过利用在1000条离线轨迹上预训练的世界模型,AcceRL在LIBERO-Spatial上实现了高达200倍的在线样本效率提升,为具身AI建立了一个既样本高效又时间高效的稳健框架。代码包含在补充材料中。代码见此网址。

英文摘要

Reinforcement learning (RL) for large-scale Vision-Language-Action (VLA) models is severely bottlenecked by synchronization barriers and the high cost of environment data acquisition. To overcome these challenges, we propose AcceRL, a distributed asynchronous RL framework that physically isolates environment rollouts, model inference, and gradient updates. By eliminating the cascading long-tail idle bubbles inherent in synchronous systems, AcceRL maximizes hardware utilization and ensures scalable throughput. Furthermore, AcceRL features a modular design that supports the integration of diverse, plug-and-play world models into its distributed pipeline. Extensive experiments demonstrate that the base framework achieves highly competitive performance across all four LIBERO~\cite{liu2023libero} task suites. Systematically, the asynchronous architecture delivers a $2.4\times$ throughput speedup over leading synchronous baselines. Algorithmically, by leveraging a world model pre-trained on 1,000 offline trajectories, AcceRL achieves up to a $200\times$ improvement in online sample efficiency on LIBERO-Spatial, establishing a robust framework that is both sample-efficient and time-efficient for embodied AI. Code is included in the supplementary material. Code is available at https://github.com/distanceLu/AcceRL.

2603.16073 2026-06-15 cs.CL 版本更新

ClaimFlow: Tracing the Evolution of Scientific Claims in NLP

ClaimFlow: 追踪自然语言处理中科学主张的演变

Aniket Pramanick, Yufang Hou, Saif M. Mohammad, Iryna Gurevych

发表机构 * Ubiquitous Knowledge Processing Lab (UKP Lab)(通用知识处理实验室) Department of Computer Science and Hessian Center for AI (hessian.AI)(计算机科学系和海斯曼人工智能中心) Technische Universität Darmstadt(德累斯顿技术大学) IT:U Interdisciplinary Transformation University Austria(奥地利跨学科转型大学) National Research Council Canada(加拿大国家研究委员会)

AI总结 提出ClaimFlow框架,通过标注1617篇论文中的5689个主张和4871个跨论文关系,定义主张关系分类任务,并分析NLP领域主张的演变模式。

详情
AI中文摘要

科学论文提出$\textit{主张}$,后续工作支持、扩展或有时反驳这些主张。然而,现有的引文和主张分析方法仅捕捉到这一对话的片段。在这项工作中,我们在单个科学主张的层面上使这些交互变得明确。我们引入$\texttt{ClaimFlow}$,一种以主张为中心的NLP文献视图,基于$1{,}617$篇ACL文集论文$(1979-2025)$构建,这些论文被手动标注了$5{,}689$个主张和$4{,}871$个跨论文主张关系,指示引用论文是$\texttt{支持}$、$\texttt{扩展}$、$\texttt{限定}$、$\texttt{反驳}$还是将引用的主张作为$\texttt{背景}$引用。基于$\texttt{ClaimFlow}$,我们定义了一个新任务——$\textit{主张关系分类}$——要求模型从文本和引文上下文中推断对引用的主张的科学立场。评估神经网络模型和大语言模型在该任务上的表现,我们报告了$0.81$宏F1的基线性能,表明该任务是可处理的,同时仍有改进空间。然后,我们将该框架扩展到约$13$k篇NLP论文,以研究跨越数十年NLP研究的主张演变。我们显示$63.5\%$的主张从未被重复使用;只有$11.1\%$曾受到挑战。广泛传播的主张更常通过限定和扩展$\textit{重塑}$,而非被支持或反驳。总体而言,$\texttt{ClaimFlow}$提供了一个审视NLP中思想如何转变和成熟的视角。

英文摘要

Scientific papers advance $\textit{claims}$ that later work supports, extends, or sometimes refutes. Yet existing methods for citation and claim analysis capture only fragments of this dialogue. In this work, we make these interactions explicit at the level of individual scientific claims. We introduce $\texttt{ClaimFlow}$, a claim-centric view of the NLP literature, built from $1{,}617$ ACL Anthology papers $(1979 - 2025)$ that are manually annotated with $5{,}689$ claims and $4{,}871$ cross-paper claim relations, indicating whether a citing paper $\texttt{supports}$, $\texttt{extends}$, $\texttt{qualifies}$, $\texttt{refutes}$, or references a cited claim as $\texttt{background}$. Building on $\texttt{ClaimFlow}$, we define a new task -- $\textit{Claim Relation Classification}$ -- which requires models to infer the scientific stance toward a cited claim from the text and citation context. Evaluating neural models and large language models on this task, we report baseline performance of $0.81$ macro-F1, suggesting that the task is tractable while leaving room for improvement. We then scale this framework to $\sim$$13k$ NLP papers to study claim evolution across decades of NLP research. We show that $63.5\%$ claims are never reused; only $11.1\%$ are ever challenged. Widely propagated claims are more often $\textit{reshaped}$ through qualification and extension than supported or refuted. Overall, $\texttt{ClaimFlow}$ offers a lens for examining how ideas shift and mature within NLP.

2603.15481 2026-06-15 cs.LG cs.AI 版本更新

TabKD: Tabular Knowledge Distillation through Interaction Diversity of Learned Feature Bins

TabKD: 通过学习特征箱的交互多样性实现表格知识蒸馏

Shovon Niverd Pereira, Krishna Khadka, Yu Lei

发表机构 * Department of Computer Science and Engineering, The University of Texas at Arlington(计算机科学与工程系,德克萨斯理工大学阿灵顿分校)

AI总结 提出TabKD方法,通过学习与教师决策边界对齐的自适应特征箱,生成最大化成对交互覆盖的合成查询,在表格数据知识蒸馏中显著提升学生-教师一致性。

Comments Accepted in 35th International Joint Conference on Artificial Intelligence IJCAI 2026

详情
AI中文摘要

无数据知识蒸馏可以在没有原始训练数据的情况下实现模型压缩,这对于隐私敏感的表格领域至关重要。然而,现有方法在表格数据上表现不佳,因为它们没有明确处理特征交互,而特征交互是表格模型编码预测知识的基本方式。我们识别出交互多样性,即特征组合的系统覆盖,是有效表格蒸馏的基本要求。为了实施这一见解,我们提出了TabKD,它学习与教师决策边界对齐的自适应特征箱,然后生成最大化成对交互覆盖的合成查询。在4个基准数据集和4种教师架构上,TabKD在16个配置中的14个中实现了最高的学生-教师一致性,优于5个最先进的基线。我们进一步表明,交互覆盖与蒸馏质量强相关,验证了我们的核心假设。我们的工作建立了以交互为中心的探索作为表格模型提取的原则性框架。

英文摘要

Data-free knowledge distillation enables model compression without original training data, critical for privacy-sensitive tabular domains. However, existing methods does not perform well on tabular data because they do not explicitly address feature interactions, the fundamental way tabular models encode predictive knowledge. We identify interaction diversity, systematic coverage of feature combinations, as an essential requirement for effective tabular distillation. To operationalize this insight, we propose TabKD, which learns adaptive feature bins aligned with teacher decision boundaries, then generates synthetic queries that maximize pairwise interaction coverage. Across 4 benchmark datasets and 4 teacher architectures, TabKD achieves highest student-teacher agreement in 14 out of 16 configurations, outperforming 5 state-of-the-art baselines. We further show that interaction coverage strongly correlates with distillation quality, validating our core hypothesis. Our work establishes interaction-focused exploration as a principled framework for tabular model extraction.

2603.12231 2026-06-15 cs.LG 版本更新

Temporal Straightening for Latent Planning

时间拉直用于隐式规划

Ying Wang, Oumayma Bounou, Gaoyue Zhou, Randall Balestriero, Tim G. J. Rudner, Yann LeCun, Mengye Ren

发表机构 * New York University(纽约大学) Brown University(布朗大学) University of Toronto(多伦多大学)

AI总结 受人类视觉处理中感知拉直假说启发,提出时间拉直方法,通过曲率正则化联合学习JEPA世界模型的编码器和预测器,改善隐式规划中的表示学习,使梯度规划更稳定并提高目标到达任务成功率。

Comments ICML2026 Camera Ready

详情
AI中文摘要

学习良好的表示对于基于世界模型的隐式规划至关重要。虽然预训练的视觉编码器能产生强大的语义视觉特征,但它们并非为规划定制,且包含与规划无关甚至有害的信息。受人类视觉处理中感知拉直假说的启发,我们引入时间拉直来改进隐式规划的表示学习。通过使用鼓励局部拉直隐式轨迹的曲率正则化器,我们联合学习联合嵌入预测架构(JEPA)世界模型的编码器和预测器。我们表明,以这种方式降低曲率使得隐空间中的欧氏距离更好地近似测地距离,并改善了规划目标的条件。我们通过实验证明,时间拉直使得基于梯度的规划更稳定,并在一系列目标到达任务中显著提高了成功率。我们的代码可在该 https URL 获取。

英文摘要

Learning good representations is essential for latent planning with world models. While pretrained visual encoders produce strong semantic visual features, they are not tailored to planning and contain information irrelevant -- or even detrimental -- to planning. Inspired by the perceptual straightening hypothesis in human visual processing, we introduce temporal straightening to improve representation learning for latent planning. Using a curvature regularizer that encourages locally straightened latent trajectories, we jointly learn an encoder and a predictor of a Joint-Embedding Predictive Architecture (JEPA) world model. We show that reducing curvature this way makes the Euclidean distance in latent space a better proxy for the geodesic distance and improves the conditioning of the planning objective. We demonstrate empirically that temporal straightening makes gradient-based planning more stable and yields significantly higher success rates across a suite of goal-reaching tasks. Our code is available at https://agenticlearning.ai/temporal-straightening.

2603.10444 2026-06-15 cs.LG cs.AI 版本更新

The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training

FP4量化LLM训练中均值偏差的诅咒与祝福

Hengjie Cao, Zhendong Huang, Mengyi Chen, Yifeng Yang, Fang Dong, Anrui Chen, Ruijun Huang, Xin Zhang, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Robert P. Dick, Yuan Cheng, Tun Lu, Fan Yang, Yixuan Chen, Li Shang

发表机构 * Fudan University(复旦大学) University of Bath(巴斯大学) Shanghai Innovation Institute(上海创新研究院) University of Oxford(牛津大学) Oxford Suzhou Centre for Advanced Research(牛津苏浙研究中心) University of Colorado Boulder(科罗拉多大学波德格分校) University of Michigan(密歇根大学) Shenzhen Loop Area Institute(深圳环宇研究院)

AI总结 发现FP4训练失败源于激活异常值由秩一均值偏差主导,提出Averis均值残差分离量化法,在Qwen3模型上实现鲁棒W4A4G4训练,损失差距低于NVIDIA的Hadamard方法。

详情
AI中文摘要

FP4训练有望为大型语言模型节省大量内存和计算,但由于分块量化受极端激活幅度支配,导致动态范围膨胀并压缩长尾信号,因此仍然脆弱。我们发现了这一失败的一个反直觉来源:主导激活异常值不仅仅是任意的稀疏事件,而主要是由一致的秩一均值偏差引起的,其方向与主导各向异性谱分量对齐。该均值分量在训练过程中增强,被注意力和FFN算子放大和重塑,并日益主导顶部激活幅度。至关重要的是,这一发现揭示了一个看似复杂的异常值抑制问题实际上有一个非常简单的解决方案:在量化之前隔离一致的均值。因此,我们提出了Averis,一种均值残差分割量化方法,该方法在FP4量化之前仅使用归约和逐元素减法来分离均值分量。在100B token上训练的Qwen3 0.6B密集模型和50B token上训练的Qwen3 7B A1.5B MoE模型上,Averis实现了鲁棒的W4A4G4 FP4训练,将BF16损失差距降低至1.19%/0.81%,而NVIDIA最近发布的基于Hadamard的异常值平滑方法为2.05%/1.10%,同时将下游差距限制在0.89/0.71点。Averis在vanilla NVFP4上的端到端开销仅为2.20%,约为NVIDIA基于Hadamard设计的30%,为稳定的低位LLM训练提供了一条硬件高效的路径。与Hadamard互补,Averis在结合使用时进一步将Qwen3-0.6B的损失和下游差距降低至0.94%和0.73点。代码可在以下网址获取:this https URL。

英文摘要

FP4 training promises substantial memory and compute savings for large language models, but remains fragile because blockwise quantization is dictated by extreme activation magnitudes, which inflate dynamic range and compress long-tail signals. We identify a counterintuitive source of this failure: dominant activation outliers are not merely arbitrary sparse events, but are largely induced by a coherent rank-one mean bias, whose direction aligns with the leading anisotropic spectral component. This mean component strengthens during training, is amplified and reshaped by attention and FFN operators, and increasingly dominates top activation magnitudes. Crucially, this discovery reveals that a seemingly complex outlier-suppression problem admits a truly simple solution: isolate the coherent mean before quantization. We therefore propose Averis, a mean-residual splitting quantization method that separates the mean component using only reductions and elementwise subtractions before FP4 quantization. Across Qwen3 0.6B Dense trained on 100B tokens and Qwen3 7B A1.5B MoE trained on 50B tokens, Averis enables robust W4A4G4 FP4 training, reducing BF16 loss gaps to 1.19%/0.81% versus 2.05%/1.10% for NVIDIA's recently released Hadamard-based outlier-smoothing method, while limiting downstream gaps to 0.89/0.71 points. With only 2.20% end-to-end overhead over vanilla NVFP4, about 30% of NVIDIA's Hadamard-based design, Averis provides a hardware-efficient path to stable low-bit LLM training. Complementary to Hadamard, Averis further reduces the Qwen3-0.6B loss and downstream gaps to 0.94% and 0.73 points when combined. Code is available at: https://anonymous.4open.science/r/averis-504D.

2603.09377 2026-06-15 cs.CV 版本更新

SinGeo: Unlock Single Model's Potential for Robust Cross-View Geo-Localization

SinGeo: 解锁单一模型在鲁棒跨视角地理定位中的潜力

Yang Chen, Xieyuanli Chen, Junxiang Li, Jie Tang, Tao Wu

发表机构 * College of Intelligence Science and Technology, National University of Defense Technology(智能科学与技术学院,国防科技大学)

AI总结 提出SinGeo框架,通过双判别学习架构和课程学习策略,使单一模型在未知视场角和方向下实现鲁棒跨视角地理定位,在四个基准数据集上达到最先进性能。

Comments v2

详情
AI中文摘要

尽管近期取得了进展,鲁棒的跨视角地理定位(CVGL)仍然具有挑战性。现有方法仍依赖于视场角(FoV)特定的训练范式,其中模型在固定FoV下优化,但在未见过的FoV和未知方向上测试时性能崩溃。这种局限性需要部署多个模型来覆盖各种变化。尽管有研究通过简单随机化FoV探索了动态FoV训练,但它们未能实现跨不同条件的鲁棒性——隐含地假设所有FoV难度相同。为解决这一差距,我们提出了SinGeo,一个简单而强大的框架,使单一模型能够实现鲁棒的跨视角地理定位,无需额外模块或显式变换。SinGeo采用双判别学习架构,增强地面和卫星分支内的视角内判别性,并且是第一个引入课程学习策略以实现鲁棒CVGL的方法。在四个基准数据集上的广泛评估表明,SinGeo在多种条件下取得了最先进(SOTA)结果,并且显著优于专门为极端FoV训练的方法。除了卓越的性能,SinGeo还展示了跨架构的可迁移性。此外,我们提出了一种一致性评估方法,以定量评估模型在不同视角下的稳定性,为理解和推进未来CVGL研究的鲁棒性提供了可解释的视角。代码将在接收后公开。

英文摘要

Robust cross-view geo-localization (CVGL) remains challenging despite the surge in recent progress. Existing methods still rely on field-of-view (FoV)-specific training paradigms, where models are optimized under a fixed FoV but collapse when tested on unseen FoVs and unknown orientations. This limitation necessitates deploying multiple models to cover diverse variations. Although studies have explored dynamic FoV training by simply randomizing FoVs, they failed to achieve robustness across diverse conditions -- implicitly assuming all FoVs are equally difficult. To address this gap, we present SinGeo, a simple yet powerful framework that enables a single model to realize robust cross-view geo-localization without additional modules or explicit transformations. SinGeo employs a dual discriminative learning architecture that enhances intra-view discriminability within both ground and satellite branches, and is the first to introduce a curriculum learning strategy to achieve robust CVGL. Extensive evaluations on four benchmark datasets reveal that SinGeo sets state-of-the-art (SOTA) results under diverse conditions, and notably outperforms methods specifically trained for extreme FoVs. Beyond superior performance, SinGeo also exhibits cross-architecture transferability. Furthermore, we propose a consistency evaluation method to quantitatively assess model stability under varying views, providing an explainable perspective for understanding and advancing robustness in future CVGL research. Codes will be available upon acceptance.

2603.05556 2026-06-15 cs.LG 版本更新

IntSeqBERT: Learning Arithmetic Structure in OEIS via Modulo-Spectrum Embeddings

IntSeqBERT: 通过模谱嵌入学习OEIS中的算术结构

Kazuhisa Nakasho

发表机构 * Iwate Prefectural University(岩手县大学)

AI总结 提出IntSeqBERT,一种双流Transformer编码器,通过连续对数幅度嵌入和100个模数的正弦/余弦模嵌入融合,在OEIS序列上联合训练三个预测头,显著提升了序列预测精度。

详情
AI中文摘要

OEIS中的整数序列涵盖从个位数常数到天文阶乘和指数,使得标准分词模型难以处理,因为它们无法处理词汇表外的值或利用周期性算术结构。我们提出IntSeqBERT,一种用于OEIS掩码整数序列建模的双流Transformer编码器。每个序列元素沿两个互补轴编码:连续对数尺度幅度嵌入和100个残差(模数$2$--$101$)的正弦/余弦模嵌入,通过FiLM融合。三个预测头(幅度回归、符号分类和100个模数的模预测)在274,705个OEIS序列上联合训练。在Large规模(9150万参数)下,IntSeqBERT在测试集上达到95.85%的幅度准确率和50.38%的平均模准确率(MMA),分别比标准分词Transformer基线高出$+8.9$和$+4.5$个百分点。去除模流的消融实验证实,模流贡献了$+15.2$个百分点的MMA增益,并额外贡献了$+6.2$个百分点的幅度准确率。基于概率中国剩余定理(CRT)的解算器将模型预测转化为具体整数,使得下一项预测比分词Transformer基线提升7.4倍(Top-1: 19.09% vs. 2.59%)。模谱分析显示,归一化信息增益(NIG)与欧拉函数比值$\varphi(m)/m$之间存在强负相关($r = -0.851$, $p < 10^{-28}$),为复合模数通过CRT聚合更有效地捕获OEIS算术结构提供了经验证据。

英文摘要

Integer sequences in the OEIS span values from single-digit constants to astronomical factorials and exponentials, making prediction challenging for standard tokenised models that cannot handle out-of-vocabulary values or exploit periodic arithmetic structure. We present IntSeqBERT, a dual-stream Transformer encoder for masked integer-sequence modelling on OEIS. Each sequence element is encoded along two complementary axes: a continuous log-scale magnitude embedding and sin/cos modulo embeddings for 100 residues (moduli $2$--$101$), fused via FiLM. Three prediction heads (magnitude regression, sign classification, and modulo prediction for 100 moduli) are trained jointly on 274,705 OEIS sequences. At the Large scale (91.5M parameters), IntSeqBERT achieves 95.85% magnitude accuracy and 50.38% Mean Modulo Accuracy (MMA) on the test set, outperforming a standard tokenised Transformer baseline by $+8.9$ pt and $+4.5$ pt, respectively. An ablation removing the modulo stream confirms it accounts for $+15.2$ pt of the MMA gain and contributes an additional $+6.2$ pt to magnitude accuracy. A probabilistic Chinese Remainder Theorem (CRT)-based Solver converts the model's predictions into concrete integers, yielding a 7.4-fold improvement in next-term prediction over the tokenised-Transformer baseline (Top-1: 19.09% vs. 2.59%). Modulo spectrum analysis reveals a strong negative correlation between Normalised Information Gain (NIG) and Euler's totient ratio $φ(m)/m$ ($r = -0.851$, $p < 10^{-28}$), providing empirical evidence that composite moduli capture OEIS arithmetic structure more efficiently via CRT aggregation.

2603.05230 2026-06-15 cs.CV cs.RO 版本更新

Digital Twin Driven Textile Classification and Foreign Object Recognition in Automated Sorting Systems

数字孪生驱动的自动化分拣系统中的纺织品分类与异物识别

Serkan Ergun, Tobias Mitterer, Hubert Zangl

发表机构 * Institute of Smart Systems Technologies(智能系统技术研究所) University of Klagenfurt(克雷格弗特大学) AAU SAL USE Laboratory(AAU SAL USE实验室) Silicon Austria Labs(硅 Austria 实验室)

AI总结 提出一种数字孪生驱动的机器人分拣系统,结合抓取预测、多模态感知和视觉语言模型,实现纺织品分类与异物检测,Qwen模型准确率达87.9%。

Comments 10 pages,single column, 5 figures, preprint for Photomet Edumet 2026 (Klagenfurt, Austria)

详情
AI中文摘要

对可持续纺织品回收日益增长的需求要求强大的自动化解决方案,能够处理可变形服装并在杂乱环境中检测异物。本文提出了一种数字孪生驱动的机器人分拣系统,集成了抓取预测、多模态感知和语义推理,用于现实世界中的纺织品分类。一个配备RGBD传感、电容式触觉反馈和碰撞感知运动规划的双臂机器人单元,自主地将服装从未分类的篮子中分离,将其转移到检查区域,并使用最先进的视觉语言模型(VLM)进行分类。我们在一个包含223个检查场景的数据集上对来自五个模型家族的九个VLM进行了基准测试,这些场景包括衬衫、袜子、裤子、内衣、异物(包括上述类别之外的服装)和空场景。评估评估了每类准确率、幻觉行为以及在实际硬件约束下的计算性能。结果表明,Qwen模型家族实现了最高的总体准确率(高达87.9%),具有强大的异物检测性能,而较轻的模型如Gemma3为边缘部署提供了有竞争力的速度-准确率权衡。数字孪生结合MoveIt实现了碰撞感知路径规划,并将分割后的检查服装3D点云集成到虚拟环境中,以提高操作可靠性。所提出的系统证明了将语义VLM推理与常规抓取检测和数字孪生技术相结合,在现实工业环境中实现可扩展的自主纺织品分拣的可行性。

英文摘要

The increasing demand for sustainable textile recycling requires robust automation solutions capable of handling deformable garments and detecting foreign objects in cluttered environments. This work presents a digital twin driven robotic sorting system that integrates grasp prediction, multi modal perception, and semantic reasoning for real world textile classification. A dual arm robotic cell equipped with RGBD sensing, capacitive tactile feedback, and collision-aware motion planning autonomously separates garments from an unsorted basket, transfers them to an inspection zone, and classifies them using state of the art Visual Language Models (VLMs). We benchmark nine VLM s from five model families on a dataset of 223 inspection scenarios comprising shirts, socks, trousers, underwear, foreign objects (including garments outside of the aforementioned classes), and empty scenes. The evaluation assesses per class accuracy, hallucination behavior, and computational performance under practical hardware constraints. Results show that the Qwen model family achieves the highest overall accuracy (up to 87.9 %), with strong foreign object detection performance, while lighter models such as Gemma3 offer competitive speed accuracy trade offs for edge deployment. A digital twin combined with MoveIt enables collision aware path planning and integrates segmented 3D point clouds of inspected garments into the virtual environment for improved manipulation reliability. The presented system demonstrates the feasibility of combining semantic VLM reasoning with conventional grasp detection and digital twin technology for scalable, autonomous textile sorting in realistic industrial settings.