arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3844
专题追踪
2606.15751 2026-06-16 cs.SD cs.LG cs.MM eess.AS 新提交

Acoustic Prompting via Stage-wise Modulation for Few-Shot Learning in Audio Language Models

通过阶段调制进行声学提示以实现音频语言模型中的少样本学习

Hyebin Cho, Jaehyuk Jang, Changick Kim, Joon Son Chung

发表机构 * Korea Advanced Institute of Science and Technology(韩国科学技术院)

AI总结 提出在音频编码器中引入可训练提示以捕获任务特定声学特征,与文本提示结合提升少样本适应性能,在11个数据集上验证有效性。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

音频-语言模型(ALMs)通过将音频波形与文本对齐,在零样本音频分类中取得了显著成功。最近改进下游性能的努力集中在学习最优文本提示上。然而,先前的方法侧重于文本编码器,忽略了音频编码器中可学习提示的潜力。在本文中,我们提出了一种新颖框架,将可训练提示引入音频编码器以捕获任务特定的声学特征。我们证明,将音频侧提示学习与现有文本侧方法相结合可以增强少样本适应。通过在11个数据集上的广泛实验表明,将我们的方法作为即插即用模块与现有文本提示调优相结合通常能带来性能提升。这些发现表明,显式调制音频表示空间可以有效补充仅文本提示方法。代码可在 https://github.com/hyebin-c/aspl 获取。

英文摘要

Audio-Language Models (ALMs) have shown remarkable success in zero-shot audio classification by aligning audio waveforms with text. Recent efforts to improve downstream performance focus on learning optimal text prompts. However, previous approaches focus on the text encoder, leaving the potential of learnable prompts within the audio encoder unexplored. In this paper, we propose a novel framework that introduces trainable prompts into the audio encoder to capture task-specific acoustic features. We demonstrate that integrating audio-side prompt learning with existing text-side approaches enhances few-shot adaptation. Through extensive experiments across 11 datasets show that integrating our method as a plug-and-play module alongside existing text prompt tuning generally leads to performance improvements. These findings suggest that explicitly modulating the audio representation space effectively complements text-only prompting approaches. The code is available at https://github.com/hyebin-c/aspl.

2606.15749 2026-06-16 cs.CV cs.AI cs.SY eess.SY 新提交

OmniTraffic: A Controllable Generation Pipeline and Benchmark for Spatio-Temporal Traffic Reasoning

OmniTraffic:面向时空交通推理的可控生成流水线与基准

Maonan Wang, Zhengyan Huang, Kemou Jiang, Yuhang Fu, Jiayue Zhu, Yuxin Cai, Xingchen Zou, Qiaosheng Zhang, Yi Yu, Ding Wang, Xi Chen, Ben M. Chen, Yuxuan Liang, Zhiyong Cui, Man On Pun, Yirong Chen

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Shanghai AI Lab(上海人工智能实验室) Beihang University(北京航空航天大学) Nanyang Technological University(南洋理工大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出OmniTraffic,一个基于12个真实路口3D重建的可控生成流水线与基准,通过8M VQA样本和3K人工验证测试集评估11个前沿MLLM,揭示拓扑与时空推理中的显著人机差距,并证明仿真数据微调可提升真实场景性能。

Comments 34 pages, 28 figures

详情
AI中文摘要

交通场景理解要求模型超越物体识别进行推理,包括车道拓扑、多视角几何、时间演变和信号相位语义。然而,现有的面向交通的多模态基准大多强调被动视觉识别或孤立的视频理解,在受控条件下评估结构感知的交通推理方面支持有限。我们介绍了OmniTraffic,一个用于时空交通推理的可控生成流水线和基准。它基于12个真实世界交叉口重建为可编辑的3D交通环境,并辅以来自两个国家的监控录像,支持受控和自然条件评估。它定义了一个三级任务层次,涵盖场景感知、多视角和时间推理以及决策支持。利用结构化交通元数据,OmniTraffic生成同步的多视角VQA样本,涵盖车辆状态、车道功能、视图-BEV对应、时间动态和信号相位分析,产生800万个VQA样本和一个3000个人工验证的测试集。对11个前沿MLLM的评估揭示了巨大的人机差距,在拓扑基础和时空推理任务中失败最为明显。在模拟的OmniTraffic数据上微调轻量级MLLM进一步提高了在真实交通场景上的性能,证明了仿真生成的监督对特定交通多模态推理的价值。除了固定数据集,OmniTraffic还提供了一个可扩展的流水线,具有可配置的交叉口、相机视角、交通需求、信号相位、视觉条件和罕见事件。

英文摘要

Traffic scene understanding requires models to reason beyond object recognition, including lane topology, multi-view geometry, temporal evolution, and signal-phase semantics. However, existing traffic-oriented multimodal benchmarks largely emphasize passive visual recognition or isolated video understanding, offering limited support for evaluating structure-aware traffic reasoning under controlled conditions. We introduce OmniTraffic, a controllable generation pipeline and benchmark for spatio-temporal traffic reasoning. Built around 12 real-world intersections reconstructed into editable 3D traffic environments and complemented by surveillance footage from two countries, OmniTraffic supports both controlled and natural-condition evaluation. It defines a three-level task hierarchy spanning scene perception, multi-view and temporal reasoning, and decision support. Using structured traffic metadata, OmniTraffic generates synchronized multi-view VQA samples covering vehicle states, lane functions, view--BEV correspondence, temporal dynamics, and signal-phase analysis, resulting in 8M VQA samples and a 3K human-verified test set. Evaluation of eleven frontier MLLMs reveals a large human--model gap, with the most pronounced failures in topology-grounded and spatio-temporal reasoning tasks. Fine-tuning a lightweight MLLM on simulated OmniTraffic data further improves performance on real-world traffic scenes, demonstrating the value of simulation-generated supervision for traffic-specific multimodal reasoning. Beyond a fixed dataset, OmniTraffic provides an extensible pipeline with configurable intersections, camera views, traffic demands, signal phases, visual conditions, and rare events.

2606.15743 2026-06-16 cs.LG 新提交

Unsupervised Learning for Missing Modalities in Multimodal Learning

多模态学习中缺失模态的无监督学习

Hassan Ismkhan, Hamid Bouchahcia

发表机构 * Bournemouth University(伯恩茅斯大学)

AI总结 提出UL4M4框架,通过无监督聚类和迭代插补处理任意缺失模态,实现跨模态结构保持和尺度不变性,在超过50%模态缺失时仍稳定达到F1-Micro>0.7。

详情
AI中文摘要

本文通过引入多模态学习中缺失模态的无监督学习(UL4M4),解决了多模态学习中的缺失模态挑战。UL4M4是一个灵活的框架,在监督预测之前以任务无关的方式插补缺失的特征嵌入。我们提出了模态特定归一化和一种新颖的部分模态距离度量,以实现对不完整观测的公平聚类,在保持跨模态结构的同时,跨不同维度和模态数量保持尺度不变性。该无监督阶段的聚类中心指导训练或推理过程中任何缺失模态的迭代贪婪插补过程,支持任意数量的模态和每个样本的任意缺失模式。插补模块轻量级,使用冻结编码器,并与下游任务解耦,易于与任何融合/预测架构集成。在多样且高度不完整的情况下的广泛实验证明了UL4M4的鲁棒性,据我们所知,即使在超过50%的模态槽位缺失的情况下,它在具有挑战性的缺失配置上首次一致地实现了高于0.7的F1-Micro分数。结果在不同聚类大小下也保持稳定,并显著优于最先进的基线。代码可在此处获取:https://github.com/h-ismkhan/Multimodal-Learning-with-Missing-Modalities-via-Unsupervised-Learning。

英文摘要

This paper addresses the missing-modality challenge in multi-modal learning by introducing Unsupervised Learning for Missing Modalities in Multi-Modal Learning (UL4M4), a flexible framework that imputes missing feature embeddings in a task-independent manner before supervised prediction. We propose modality-specific normalization and a novel partial-modality distance metric to enable fair clustering of incomplete observations, capturing cross-modal structures while preserving scale-invariance across varying dimensionalities and modality counts. Cluster centers from this unsupervised stage guide an iterative greedy imputation process for any missing modalities during training or inference, supporting arbitrary numbers of modalities and arbitrary missing patterns per sample. The imputation module is lightweight, uses frozen encoders, and decouples from the downstream task, allowing easy integration with any fusion/prediction architecture. Extensive experiments under diverse and highly incomplete regimes demonstrate UL4M4's robustness, achieving, to the best of our knowledge, the first consistent F1-Micro scores above 0.7 on challenging missing configurations even when more than 50\% of modality slots are missing. Results are also stable across cluster sizes and significantly outperform state-of-the-art baselines. Code is available here: https://github.com/h-ismkhan/Multimodal-Learning-with-Missing-Modalities-via-Unsupervised-Learning.

2606.15741 2026-06-16 cs.CL cs.AI 新提交

A Self Consistency Based Reranking for Narrative Question Answering

基于自一致性的叙事问答重排序

Molham Mohamed, Ali Hamdi

发表机构 * GitHub

AI总结 提出自一致性重排序框架,通过生成多个候选答案并基于语义一致性选择最终答案,提升叙事问答的鲁棒性和准确性。

详情
AI中文摘要

叙事问答(NQA)是自然语言处理中一项具有挑战性的任务,要求模型理解长文本上下文、捕捉事件间关系并生成连贯的响应。尽管预训练语言模型近期取得了进展,但大多数现有方法在推理时依赖单一解码输出,使其对生成变异性敏感,常导致答案不完整或不一致。为解决这一局限,我们提出了一种基于自一致性的自集成重排序框架用于叙事问答。该方法为每个故事-问题对生成多个候选答案,并根据生成响应间的语义一致性选择最终答案。这使得模型能够探索多样化的答案表述,同时通过基于共识的选择提高鲁棒性,而无需修改底层架构。该框架将预训练和微调的语言生成与多答案推理及基于相似度的重排序相结合。我们在NarrativeQA数据集上使用多种模型(包括FLAN-T5 Base和Small以及Pegasus-Large)在基线和微调设置下评估了所提方法。实验结果表明,该方法在所有模型上均持续提升了性能。特别是,FLAN-T5-Base在结合自集成推理后,性能从82.32%提升至86.66%(+4.34%),取得了最佳整体性能。此外,Pegasus-Large的提升最大,从72.50%提升至87.07%(+14.57%),凸显了所提策略的有效性。

英文摘要

Narrative question answering (NQA) is a challenging task in natural language processing that requires models to understand long textual contexts, capture relationships across events, and generate coherent responses. Despite recent advances in pretrained language models, most existing approaches rely on a single decoding output during inference, making them sensitive to generation variability and often resulting in incomplete or inconsistent answers .To address this limitation, we propose a self-ensemble Self-Consistency-Based reranking framework for narrative question answering. The proposed method generates multiple candidate answers for each story-question pair and selects the final answer based on semantic agreement among the generated responses. This allows the model to explore diverse answer formulations while improving robustness through consensus-based selection without requiring modifications to the underlying architecture .The framework combines pretrained and fine-tuned language generation with multi-answer inference and similarity-based reranking. We evaluate the proposed approach on the NarrativeQA dataset using multiple models, including FLAN-T5 (Base and Small) and Pegasus-Large, under both baseline and fine-tuned settings .Experimental results demonstrate that the proposed method consistently improves performance across all models. In particular, FLAN-T5-Base achieves the best overall performance, improving from 82.32% to 86.66% (+4.34%) when combined with self-ensemble inference. Additionally, the largest improvement is observed with Pegasus-Large, which increases from 72.50% to 87.07% (+14.57%), highlighting the effectiveness of the proposed strategy.

2606.15734 2026-06-16 cs.CL cs.AI cs.IR cs.LG 新提交

Retrievable Gradients: Continual Post-Training Without Cumulative Weight Drift

可检索梯度:无累积权重漂移的持续后训练

Weihang Su, Jiacheng Kang, Jingyan Xu, Qingyao Ai, Jianming Long, Hanwen Zhang, Bangde Du, Xinyuan Cao, Min Zhang, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 提出ReGrad范式,将梯度作为可检索知识单元,通过元学习重塑文档梯度为通用适应信号,实现无权重漂移的可扩展参数知识注入。

详情
AI中文摘要

持续后训练使模型在部署后能够吸收新知识,但重复更新共享参数会累积权重漂移,可能导致灾难性遗忘并降低通用能力。检索增强生成避免了这种参数漂移,但往往缺乏参数化知识整合的深度。在本文中,我们提出ReGrad(可检索梯度),一种将梯度视为可检索知识单元的新范式。ReGrad离线预计算文档特定梯度,存储在索引化的梯度库中,并在推理时仅检索与查询相关的梯度以进行临时权重调整。然而,原始语言建模梯度针对词级文档重建而非查询驱动的知识使用进行优化。因此,我们引入双层元学习目标,将文档派生梯度重塑为下游任务的通用适应信号。在通用和特定领域设置上的实验表明,ReGrad优于CPT和RAG基线,实现了可扩展且可逆的参数知识注入,且不累积权重漂移。

英文摘要

Continual post-training enables models to absorb emerging knowledge after deployment, but repeatedly updating shared parameters can accumulate weight drift, potentially causing catastrophic forgetting and degrading general capabilities. Retrieval-augmented generation avoids such parameter drift, yet often lacks the depth of parametric knowledge integration. In this paper, we propose ReGrad (Retrievable Gradients), a new paradigm that treats gradients as retrievable units of knowledge. ReGrad pre-computes document-specific gradients offline, stores them in an indexed Gradient Bank, and retrieves only query-relevant gradients at inference time for temporary weight adaptation. However, raw language-modeling gradients are optimized for token-level document reconstruction rather than for query-driven knowledge use. We therefore introduce a bi-level meta-learning objective that reshapes document-derived gradients into generalizable adaptation signals for downstream tasks. Experiments across general and domain-specific settings show that \textsc{ReGrad} outperforms CPT and RAG baselines, enabling scalable and reversible parametric knowledge injection without accumulating weight drift.

2606.15733 2026-06-16 cs.CL cs.AI 新提交

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

Vernier: 探测因果推理中词汇间隙背后的表征错位

Zhenyu Yu

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University(复旦大学计算机科学与人工智能学院)

AI总结 通过配对视图权重更新和激活修补,发现语言模型在因果推理中因变量名替换导致的答案差异源于表征错位而非信息丢失,并在Qwen和Llama模型上验证了反事实增强的对齐效果。

详情
AI中文摘要

指令微调的语言模型在将其英文变量名替换为类型保留的占位符后,可能会对相同的因果推理问题给出不同的答案,尽管结构因果模型和正确答案未变。我们探究这种词汇间隙是否反映了占位符视图中的信息丢失,或是从仍携带答案相关内容的表征中读取时的错位。Vernier 使用配对视图权重更新作为工具,然后检查间隙闭合后留下的机制。在工作状态下,证据支持表征错位。变量名探针在占位符视图上变得更准确,对 Qwen-7B、Qwen-14B 和 Llama-3.1-8B 的激活修补表明,决策令牌表征可以在视图间传递答案身份。重新对齐视图的更新是对原始提示和占位符提示的反事实增强,而答案子空间 KL 主要增强了中间答案信念的一致性。成功受限于模型家族、规模和任务。CRASS 转移在 Qwen 规模和 Llama 上可靠,e-CARE 仍然较弱,初步的非因果重命名任务显示出类似的定性模式。

英文摘要

Instruction-tuned language models can answer the same causal-reasoning question differently after its English variable names are replaced by type-preserving placeholders, although the structural causal model and the gold answer are unchanged. We ask whether this lexical gap reflects information loss in the placeholder view or a misaligned read-out from a representation that still carries answer-relevant content. Vernier uses a paired-view weight update as an instrument and then inspects the mechanism left after the gap closes. In the working regimes, the evidence favours representational misalignment. A variable-name probe becomes more accurate on the placeholder view, and activation patching on Qwen-7B, Qwen-14B, and Llama-3.1-8B shows that the decision-token representation can transfer answer identity between views. The update that realigns the views is counterfactual augmentation over original and placeholder prompts, while the answer-subspace KL mainly sharpens intermediate answer-belief agreement. Success is bounded by model family, scale, and task. CRASS transfer is reliable across Qwen scales and Llama, e-CARE remains weak, and preliminary non-causal rename tasks show a similar qualitative pattern.

2606.15730 2026-06-16 cs.LG cs.AI 新提交

InstantForget: Update-Free Backdoor Unlearning with Inference-Time Feature Reset

InstantForget: 无需更新的后门遗忘与推理时特征重置

Zhenyu Yu

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University(复旦大学计算机科学与人工智能学院)

AI总结 提出InstantForget方法,通过推理时特征重置实现无需参数更新的后门遗忘,利用马氏距离检测异常特征并重置为中性表示,在CIFAR-10上平均ASR降至0.071。

详情
AI中文摘要

后门遗忘旨在从部署模型中移除恶意触发行为,同时保持清洁效用。我们研究了无需更新的推理时设置,其中模型参数保持冻结。首先,我们在oracle配对的清洁和触发特征下审计了一个常见的投影假设。投影主要对BadNets成功,而在CIFAR-10 ResNet-18上对WaNet、Blended和SIG的ASR分别为0.683、0.888和0.941。这种失败不能由谱紧凑性、空间局部性或子空间错位解释,而是由涉及目标边际、目标logit下降和非目标logit上升的logit三元组差距预测。然后我们引入了InstantForget,一种清洁校准的门控重置,通过马氏距离标记异常特征,并仅将标记的特征移向中性的非目标表示。在保留的触发验证集上选择一个固定操作点后,InstantForget在部署时无需触发样本或参数更新,将CIFAR-10上四种非自适应触发的平均ASR降至0.071。它还达到了0.981的检测AUROC,并迁移到八个测试骨干中的六个。报告的在WaNet、ModelNet10点混合、两种骨干几何和自适应特征紧凑性攻击下的失败定义了该方法的适用范围。

英文摘要

Backdoor unlearning aims to remove a malicious trigger behavior from a deployed model while preserving clean utility. We study the update-free inference-time setting, where model parameters remain frozen. First, we audit a common projection assumption under oracle paired clean and triggered features. Projection succeeds mainly on BadNets and leaves WaNet, Blended, and SIG at 0.683, 0.888, and 0.941 ASR on CIFAR-10 ResNet-18. This failure is not explained by spectral compactness, spatial locality, or subspace misalignment. It is predicted by a logit-triplet gap involving the target margin, target-logit drop, and non-target logit rise. We then introduce InstantForget, a clean-calibrated gated reset that flags anomalous features with a Mahalanobis score and moves only flagged features toward a neutral non-target representation. With one fixed operating point selected on held-out triggered validation, InstantForget reduces average ASR to 0.071 across four non-adaptive CIFAR-10 triggers without triggered samples or parameter updates at deployment. It also reaches 0.981 detection AUROC and transfers to six of eight tested backbones. Reported failures under WaNet, ModelNet10 point blend, two backbone geometries, and adaptive feature-compactness attacks define the method's scope.

2606.15716 2026-06-16 cs.LG 新提交

How to Score Experts for One-Shot MoE Expert Pruning: A Unified Formulation and Selection Principle

如何为一次性MoE专家剪枝评分:统一公式与选择原则

Zongfang Liu, Jinghui Zhang, Zijian Ma, Guangyi Chen, Xin Yuan

发表机构 * Zhejiang University(浙江大学) Westlake University(西湖大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出一次性MoE专家剪枝的统一公式,基于路由频率、门控权重和激活强度三个因素,推导出任务无关剪枝应使用基于激活的准则,任务特定剪枝可保留路由频率和门控信息,并据此提出两种新准则MAN和MSAN,在多个模型和基准上取得最优性能。

详情
AI中文摘要

混合专家(MoE)语言模型通过稀疏专家激活减少了每令牌的计算量,但部署时仍需存储完整的专家池,使得一次性专家剪枝成为减少内存使用的实用方法。尽管有效,现有准则大多是启发式的,且没有单一准则普遍最优。因此,为不同部署目标建立选择剪枝准则的原则,是一次性专家剪枝中一个重要但尚未充分探索的问题。为此,我们引入了一个一次性MoE专家剪枝的统一公式,围绕三个因素组织:路由频率、门控权重和激活强度。该公式产生了一个准则选择原则:任务无关剪枝应倾向于基于路由令牌平均、无门控的激活准则,而任务特定剪枝可以从保留路由频率和门控权重信息中受益。除了这一原则,该公式还提供了对现有启发式准则的系统性视角,并提出了两个新的任务无关准则:平均激活范数(MAN)和均方激活范数(MSAN)。在四个代表性MoE模型和16个多样化基准上,MAN和MSAN在任务无关设置中始终表现强劲,获得前两名的平均排名,并在最强基线上将平均性能提升高达8.8个百分点。

英文摘要

Mixture-of-Experts (MoE) language models reduce per-token computation through sparse expert activation, yet deployment still requires storing the full expert pool, making one-shot expert pruning a practical approach for reducing memory usage. Although effective, existing criteria are largely heuristic, and no single criterion is universally optimal. Thus, establishing a principle for selecting pruning criteria suited to different deployment objectives remains an important yet largely underexplored problem in one-shot expert pruning. To this end, we introduce a unified formulation for one-shot MoE expert pruning organized around three factors: routing frequency, gate weighting, and activation strength. The formulation yields a criteria selection principle: task-agnostic pruning should favor routed-token-averaged, gate-free activation-based criteria, whereas task-specific pruning can benefit from retaining routing-frequency and gate-weight information. Beyond this principle, the formulation also provides a systematic view of existing heuristic criteria and gives rise to two new task-agnostic criteria, Mean Activation Norm (MAN) and Mean Squared Activation Norm (MSAN). Across four representative MoE models and 16 diverse benchmarks, MAN and MSAN are consistently strong in the task-agnostic setting, obtain the top-two average ranks, and improve average performance by up to 8.8 points over the strongest baseline.

2606.15714 2026-06-16 cs.CL cs.RO 新提交

Beyond English: Uncovering the Multilingual Gap in Vision-Language-Action Models

超越英语:揭示视觉-语言-动作模型中的多语言差距

Hanyang Chen, Hongliang Li, Jiarui Cao, Yang Li, Yang Jiang, Haonan Wen, Kaiyu Huang, Shengnan Guo, Huaiyu Wan

发表机构 * Beijing Jiaotong University(北京交通大学)

AI总结 本研究首次系统探究VLA模型的多语言指令跟随能力,发现英语训练模型在其他语言上性能显著下降,并提出多语言主成分对齐方法缩小差距。

详情
AI中文摘要

视觉-语言-动作模型最近展示了从大规模多模态数据学习通用机器人策略的能力。然而,大多数现有的VLA系统主要使用英语指令进行训练和评估,使得它们理解和执行其他语言指令的能力在很大程度上未被探索。虽然底层的大语言模型通常具备多语言能力,但这些多语言能力在训练过程中是否能迁移到VLA尚不清楚。在这项工作中,我们首次对VLA模型中的多语言指令跟随进行了系统研究。我们首先通过扩展现有基准测试并翻译其指令来构建多语言指令。利用这些指令,我们在模拟环境中评估了几个代表性的VLA模型在一系列任务上的表现。我们的实验揭示了一个显著的多语言差距:主要用英语指令训练的模型在评估其他语言时表现出显著的性能下降,即使底层语言骨干是多语言的。我们提供了若干发现和分析来理解多语言差距。跨语言迁移行为分析表明,性能下降与指令理解和动作执行都相关。表示分析表明,多语言指令引起的表示偏移可能导致了多语言差距。受这些发现的启发,我们进一步探索了提高VLA多语言性能的策略。我们提出了一种简单而有效的多语言微调方法——多语言主成分对齐,该方法利用主成分分析获取主成分子空间并对齐投影后的多语言表示,有效缩小了多语言性能差距。

英文摘要

Vision-Language-Action models have recently demonstrated promising capabilities in learning generalist robot policies from large-scale multimodal data. However, most existing VLA systems are trained and evaluated primarily with English instructions, leaving their ability to understand and execute instructions in other languages largely unexplored. While the underlying large language models often possess multilingual capabilities, it remains unclear whether these multilingual capabilities transfer to VLAs during training. In this work, we present the first systematic study of multilingual instruction following in VLA models. We first construct multilingual instructions by extending existing benchmarks with translations of their instructions. Using these instructions, we evaluate several representative VLA models across a range of tasks in simulation settings. Our experiments reveal a significant multilingual gap: models trained primarily on English instructions exhibit substantial performance degradation when evaluated on other languages, even when the underlying language backbone is multilingual. We provide several findings and analyses to understand the multilingual gap. Cross-lingual transfer behavior analysis shows that performance drops correlate with both instruction understanding and action execution. Representation analyses suggest that multilingual instruction-caused representation shifts may contribute to the multilingual gap. Motivated by these findings, we further explore strategies to improve multilingual performance in VLAs. We propose a simple yet effective multilingual fine-tuning approach, Multilingual Principal Component Alignment, which leverages Principal Component Analysis to get the principal component subspace and align projected multilingual representations, effectively reducing the multilingual performance gap.

2606.15709 2026-06-16 cs.AI cs.MA 新提交

AI-Driven Framework for Adaptive Water Network Management with Proof-of-Concept Implementation: Addressing Non-Revenue Water in Jordan

基于AI的自适应水网管理框架及概念验证实施:解决约旦无收益水问题

Mohammed Fasha, Nahel Al-Maayta, Bilal Sowan, Mohammad Athamneh, Husam Barham

发表机构 * Jordan(约旦)

AI总结 提出集成EPANET水力建模、数字孪生、SCADA和LLM智能体的框架,通过实时数据与物理模拟结合实现异常检测与自适应决策,概念验证在约旦1164节点管网中实现2分钟内自动生成健康报告,爆管检测定位准确。

Journal ref 2026 2nd International Conference on Computational Intelligence Approaches and Applications (ICCIAA)

详情
AI中文摘要

约旦面临严重的水资源短缺,50%的生产水因泄漏、盗窃和计量问题(即无收益水,NRW)而损失。传统的被动方法已被证明不足以持续减少NRW。本文提出一个智能框架,集成EPANET水力建模、数字孪生技术、SCADA系统和基于大语言模型(LLM)的AI智能体,用于连续网络监控和自适应决策。该系统将实时数据流与基于物理的模拟相结合,以检测异常,采用检索增强生成(RAG)进行策略解释,并通过函数调用进行网络控制。概念验证实施使用EPYT和离线LLM(通过Ollama的llama3.1:8b)在安曼一个1164节点的区域管网中验证了技术可行性。该系统展示了自动化水力模拟、基于流量的异常检测(与配水区域(DZ)实践一致)、以及AI生成的健康报告,响应时间低于2分钟且零API成本。爆管检测依赖于局部流量异常分析:模拟的30.1 L/s泄漏在15根管道中产生可测量的流量重新分布,标记出一个15节点的簇,从而定位爆管——确认了与配水区域(DZ)监测实践的一致性。该框架通过分阶段实施适应约旦的间歇性供水模式和有限的自动化,为缺水地区利用智能自动化减少NRW和提高运营效率提供了可扩展的路径。

英文摘要

Jordan faces severe water scarcity with 50\% of water produced is lost to leakage, theft and metering issues also known as non-revenue water (NRW). Traditional reactive approaches have proven insufficient for sustained NRW reduction. This paper proposes an intelligent framework integrating EPANET hydraulic modeling, digital twin technology, SCADA systems, and large language model (LLM)-based AI agents for continuous network monitoring and adaptive decision-making. The system combines real-time data streams with physics-based simulation to detect anomalies, employing retrieval-augmented generation (RAG) for policy interpretation and function calling for network control. A proof-of-concept implementation validates technical feasibility using EPYT with offline LLMs (llama3.1:8b via Ollama) on a 1,164-junction Amman district network. The system demonstrates automated hydraulic simulation, flow-based anomaly detection aligned with water distribution zone (DZ) practice, and AI-generated health reports with response times under 2 minutes and zero API costs. Burst detection relies on local flow anomaly analysis: a 30.1~L/s simulated leak produces measurable flow redistribution in 15 pipes, flagging a 15-junction cluster that localises the burst -- confirming alignment with water distribution zone (DZ) monitoring practice. The framework accommodates Jordan's intermittent supply patterns and limited automation through phased implementation, offering a scalable pathway for water-scarce regions to leverage intelligent automation for NRW reduction and operational efficiency.

2606.15701 2026-06-16 cs.LG q-fin.ST 新提交

Robust Transformer-Based One-Step Stock Index Forecasting via Shifted Data Augmentation

基于移位数据增强的鲁棒Transformer一步股票指数预测

Tien Thanh Thach

发表机构 * Faculty of Mathematics and Statistics, Ton Duc Thang University(孙德胜大学数学与统计学院)

AI总结 提出改进的Transformer架构结合余弦退火学习率调度和移位数据增强(SDA),在VN30和S&P 500指数上有效降低预测误差和波动性,优于增加模型复杂度的方法。

详情
AI中文摘要

Transformer在序列建模中取得了显著成功,但由于噪声信号、短记忆动态和分布偏移,其直接应用于金融时间序列仍具有挑战性。本文提出了一种改进的Transformer架构用于一步股票指数预测,结合了先进的学习率调度和一种新颖的移位数据增强(SDA)技术。我们在两个基准股票指数数据集VN30和S&P 500上评估了所提出的框架。实验结果表明,带预热的余弦退火相比广义逆幂调度器持续提高了预测精度。此外,SDA显著降低了预测误差和运行间变异性,同时提高了对超参数选择的鲁棒性。余弦退火调度与SDA的组合在两个数据集上均取得了最佳性能,表明在基于Transformer的金融预测中,数据增强比增加模型复杂度可以发挥更重要的作用。这些发现为在噪声金融环境中进行鲁棒的股票指数预测提供了一种实用且计算高效的方法。

英文摘要

Transformers have shown remarkable success in sequence modeling, yet their direct application to financial time series remains challenging due to noisy signals, short-memory dynamics, and distributional shifts. This paper proposes a modified Transformer architecture for one-step stock index forecasting, combined with advanced learning-rate scheduling and a novel Shifted Data Augmentation (SDA) technique. We evaluate the proposed framework on two benchmark stock index datasets, VN30 and S&P 500. Experimental results demonstrate that cosine annealing with warmup consistently improves forecasting accuracy over the generalized inverse-power scheduler. Furthermore, SDA substantially reduces forecasting errors and run-to-run variability while improving robustness to hyperparameter selection. The combination of cosine annealing scheduling and SDA achieved the best performance on both datasets, indicating that data augmentation can play a more important role than increasing model complexity in Transformer-based financial forecasting. These findings provide a practical and computationally efficient approach for robust stock index forecasting in noisy financial environments.

2606.15696 2026-06-16 cs.AI cs.CL cs.LG 新提交

Do LLMs Reliably Identify Correct Information Units in Aphasic Discourse?

LLMs 能否可靠识别失语症语篇中的正确信息单元?

Jason M Pittman, Yesenia Medina-Santos, Anton Phillips, Brielle C. Stark

发表机构 * Indiana University Bloomington(印第安纳大学布卢明顿分校)

AI总结 研究评估指令微调大语言模型在零样本和少样本提示下对失语症语篇进行词级正确信息单元分类的性能,发现少样本提示可提升效果但一致性仍不足。

Comments 5 tables, 4 figures

详情
AI中文摘要

正确信息单元(CIUs)是失语症语篇评估的核心,因为它们量化了交际信息性而非仅语言形式。然而,CIU评分耗时且需要训练有素的评分者。本研究考察了指令微调的大语言模型(LLMs)是否能够可靠地从失语症语篇转录中进行词级CIU分类。使用Cat Rescue刺激引发的16个图片描述转录根据Nicholas和Brookshire(1993)的标准进行CIU状态标注。样本涵盖四个严重程度层:对照组、轻度、中度和重度失语症。在零样本和两种少样本提示条件下,对四个公开可用的指令微调LLMs进行了基准测试,使用五个分层随机种子。通过准确率、精确率、召回率、F1和Cohen's kappa与人类共识标签进行性能评估。零样本提示在所有模型中均不足。相比之下,少样本提示带来了显著提升,并为三个可行模型产生了有竞争力的性能。Llama-3.1-8B、Qwen2.5-7B和Mistral-7B的平均少样本F1分数范围为0.776至0.817,固定全局和逐块局部示例选择之间无显著差异。Phi-3-mini不稳定且未产生可靠性能。可行模型显示出高召回率但较低的精确率,表明系统性地过度将词元分类为CIU。性能也随语篇严重程度变化,在更严重的失语症中结果最弱。少样本LLM提示可以在无需基于梯度的任务训练的情况下支持自动CIU识别,但与人类标注的一致性仍不足以完全自主使用。这些发现支持基于LLM的CIU评分作为语篇评估系统中一个有前景的人机协同组件。

英文摘要

Correct Information Units (CIUs) are central to discourse assessment in aphasia because they quantify communicative informativeness rather than linguistic form alone. However, CIU scoring is time intensive and requires trained raters. This study examined whether instruction-tuned large language models (LLMs) can reliably perform token-level CIU classification from aphasic discourse transcripts. Sixteen picture-description transcripts elicited with the Cat Rescue stimulus were annotated for CIU status according to Nicholas and Brookshire (1993). The sample spanned four severity strata: control, mild, moderate, and severe aphasia. Four publicly available instruction-tuned LLMs were benchmarked under zero-shot and two few-shot prompting conditions across five stratified random seeds. Performance was evaluated against consensus human labels using accuracy, precision, recall, F1, and Cohen's kappa. Zero-shot prompting was insufficient across models. In contrast, few-shot prompting yielded substantial gains and produced competitive performance for three viable models. Mean few-shot F1 scores ranged from 0.776 to 0.817 across Llama-3.1-8B, Qwen2.5-7B, and Mistral-7B, with no significant differences between fixed global and per-chunk local example selection. Phi-3-mini was unstable and did not yield reliable performance. Viable models showed high recall but lower precision, suggesting systematic over-classification of tokens as CIUs. Performance also varied by discourse severity, with the weakest results in more severe aphasia. Few-shot LLM prompting can support automated CIU identification without gradient-based task training, but agreement with human annotation remains insufficient for fully autonomous use. These findings support LLM-based CIU scoring as a promising human-in-the-loop component of discourse assessment systems.

2606.15695 2026-06-16 cs.LG cs.AI 新提交

When Generator Replay Degrades: Projected Rehearsal Orchestration for Heterogeneous Federated Class-Incremental Learning

当生成器回放退化时:面向异构联邦类增量学习的投影排练编排

Thinh T. H. Nguyen, Khoa D. Doan, Binh T. Nguyen, Danh Le-Phuoc, Kok-Seng Wong

发表机构 * VinUniversity VNU-HCM, University of Science(胡志明市国家大学理科大学) Technische Universität Berlin(柏林工业大学)

AI总结 针对异构联邦类增量学习中客户端标签子集不同、任务阶段不一致导致的旧知识遗忘问题,提出投影排练编排框架PRO及增强版PRO-MAX,通过服务器端维护紧凑类级投影记忆并实现平衡伪多任务训练,在图像、文本和图基准上提升异构流下的保留与最终效用。

Comments 46 pages

详情
AI中文摘要

联邦类增量学习(FCIL)在客户端观察到不同标签子集、在不同阶段推进任务以及为相同语义概念提供不均匀监督时变得极其困难。现有的FCIL方法通常通过输入空间合成来保留旧知识,但在异构任务流下可能脆弱且难以跨模态迁移。为缓解这些问题,我们提出PRO,一个用投影排练编排替代合成输入回放的框架。为去除外部预训练,我们在相同的预热条件下评估所有方法。此后,PRO在服务器上维护紧凑的类级投影记忆,并允许客户端在当前示例和旧投影记忆上执行平衡的伪多任务训练。为处理更强的表示漂移,我们进一步引入PRO-MAX,它在保持相同服务器轻量原则(服务器仅聚合模型更新和记忆统计)的同时,用邻域加权记忆对齐增强PRO。在图像、文本和图基准上,PRO和PRO-MAX在异构流下提高了保留和最终效用,同时在同构FCIL中保持竞争力。即使基线获得更大的回放预算,它们在监督不平衡和阶段错位下也会退化,表明仅靠回放数量无法解决回放质量失败。额外的弱任务诊断进一步表明,更大的回放不匹配与更大的下游退化相关,而我们的方法使投影记忆与不断演化的表示保持更好对齐。

英文摘要

Federated class-incremental learning (FCIL) becomes substantially harder when clients observe different label subsets, progress through tasks at different stages, and provide uneven supervision for the same semantic concepts. Existing FCIL methods often preserve old knowledge through input-space synthesis, but they can be fragile under heterogeneous task streams and difficult to transfer across modalities. To alleviate such issues, we propose PRO, a framework that replaces synthetic input replay with projected rehearsal orchestration. To remove external pretraining, we evaluate all methods under the same warmup. After this, PRO maintains compact class-level projected memories on the server and allows clients perform balanced pseudo multi-task training over current examples and old projected memories. To handle stronger representation drift, we further introduce PRO-MAX, which augments PRO with neighborhood-weighted memory alignment while preserving the same server-light principle that the server only aggregates model updates and memory statistics. Across image, text, and graph benchmarks, PRO and PRO-MAX improve retention and final utility under heterogeneous streams while remaining competitive in homogeneous FCIL. Even when baselines are given expanded replay budgets, they degrade under supervision imbalance and stage misalignment, indicating that replay quantity alone does not resolve replay-quality failures. Additional weak-task diagnostics further show that larger replay mismatch is associated with larger downstream degradation, while our method keeps projected memories better aligned with the evolving representation.

2606.15691 2026-06-16 cs.RO 新提交

Can Causal Models Enhance Robot Navigation? Online Causal Adaptation for Real-Robot Navigation

因果模型能否增强机器人导航?面向真实机器人导航的在线因果自适应

Zhitao Liang, Alex Mitrevski, Emmanuel Dean, Karinne Ramirez-Amaro

发表机构 * Chalmers University of Technology(查尔姆斯理工大学)

AI总结 研究因果模型在真实机器人导航中的迁移问题,提出离线评估和在线自适应两种应用方式,实验表明因果模型在复杂场景下能显著提升导航性能。

Comments Accepted for publication at the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)

详情
AI中文摘要

机器人学中的因果性旨在通过使机器人能够预测其行为的后果,产生更可解释和灵活的机器人行为;然而,在真实环境中将因果模型与现有系统(如导航)结合部署的研究仍不充分。本文解决了在真实机器人实验中为导航场景迁移因果模型的挑战性问题。我们通过两种方式研究该问题:(i) 使用因果模型作为离线评估模块,预测记录的机器人导航轨迹的胜任度,并将其与定量导航性能相关联;(ii) 使用因果模型作为在线自适应模块,在默认导航的预测胜任度较低时进行干预。我们在一个在走廊巡逻的物理服务机器人上验证了该方法。结果表明,预测的胜任度与路径效率正相关,与路径不规则性(次优行为)负相关。模型预测还与人工标注高度一致(Cohen's kappa值为0.88)。在在线实验中,所提方法在转弯和避障等复杂场景中提升了导航性能,相比默认导航基线获得了更高的预测胜任度和更好的导航指标。在基线已接近最优的简单场景中,因果自适应的收益有限。这些结果表明,因果模型在任务复杂度增加时尤其能有效增强导航。总体而言,我们的结果证明了为行为解释开发的因果模型可以成功集成到真实机器人导航系统中。

英文摘要

Causality in robotics aims to produce more interpretable and flexible robot behaviours by enabling robots to predict the consequences of their actions; however, deploying causal models with existing systems (e.g., navigation) operating in real environments remains understudied. This paper addresses the challenging problem of transferring causal models in real-robot experiments for a navigation scenario. We study this problem in two ways: (i) using the causal model as an offline evaluation module that predicts the competence of recorded real-robot navigation trajectories and relates it to quantitative navigation performance, and (ii) using the causal model as an online adaptation module that intervenes when the predicted competence of the default navigation is low. We validate our approach in a physical service robot that patrols around corridors. We show that the predicted competence correlates positively with path efficiency, and negatively with path irregularities (suboptimal behaviour). The model predictions also show strong agreement with human annotations (Cohen's kappa value of 0.88). In online experiments, the proposed method improves navigation performance in complex scenarios such as cornering and obstacle avoidance, yielding higher predicted competence and better navigation metrics than the default navigation baseline. In simpler scenarios, where the baseline already performs near-optimally, the causal adaptation provides limited benefit. These results indicate that causal models are particularly effective in enhancing navigation under increased task complexity. Overall, our results demonstrate that causal models developed for behavioural interpretation can be successfully integrated into real-robot navigation systems.

2606.15690 2026-06-16 cs.LG math.DS 新提交

Multi-Fidelity SINDy: Sparse Discovery of Nonlinear Dynamical Systems with Fidelity-Weighted Measurements

多保真度SINDy:基于保真度加权测量的非线性动力系统稀疏发现

Filippo Zacchei, Ana Larrañaga, Attilio Frangi, Andrea Manzoni, Steven L. Brunton

发表机构 * Politecnico di Milano(米兰理工大学) University of Washington(华盛顿大学)

AI总结 针对异质噪声数据,提出多保真度SINDy方法,通过加权回归融合集成SINDy和弱SINDy,从不同保真度测量中稀疏识别非线性动力系统,理论证明加权策略的统计合理性,在常微分和偏微分方程基准系统及双摆预测中验证了其抑制异方差噪声、利用低成本低质量数据提升模型恢复的效果。

Comments 27 pages, 6 figures, 2 tables

详情
AI中文摘要

来自模拟和实验的数据很少是无噪声的,并且常常表现出异质保真度水平。测量不确定性可能在重复观测、传感设备甚至单个实验中变化。本文解决了从这种非均匀数据中发现非线性动力系统的问题。我们通过将集成SINDy和弱SINDy结合在由广义最小二乘法导出的加权回归公式中,扩展了稀疏识别非线性动力系统(SINDy)框架以考虑可变噪声水平。还提供了加权策略的统计证明。该方法在几个基准系统上得到验证,包括常微分和偏微分方程。此外,我们展示了多保真度集成在预测双摆系统动力学中的优势。结果证实,所提出的方法减轻了异方差噪声的不利影响,并且重复、低成本、低质量的测量可以改善模型恢复,在某些情况下匹配或优于仅使用高保真度数据获得的重建结果。

英文摘要

Data from simulations and experiments are rarely noise-free and often exhibit heterogeneous levels of fidelity. Measurement uncertainty may vary across repeated observations, sensing devices, or even within a single experiment. This work addresses the problem of discovering nonlinear dynamical systems from such inhomogeneous data. We extend the Sparse Identification of Nonlinear Dynamical Systems (SINDy) framework to account for variable noise levels by combining Ensemble SINDy and Weak SINDy within a weighted regression formulation derived from generalized least squares. A statistical justification for the weighting strategy is also provided. The methodology is validated on several benchmark systems, including ordinary and partial differential equations. In addition, we show the benefit of multi-fidelity integration for forecasting the dynamics of a double pendulum system. The results confirm that the proposed approach mitigates the adverse effects of heteroscedastic noise and that repeated, low-cost, low-quality measurements can improve model recovery, in some cases matching or outperforming reconstructions obtained using only high-fidelity data.

2606.15686 2026-06-16 cs.AI cs.LG 新提交

Recurrent Reasoning on Symbolic Puzzles with Sequence Models

基于序列模型的符号谜题循环推理

Gowrav Mannem, Chowdhury Marzia Mahjabin, Jason Chen, Shivank Garg, Kevin Zhu

发表机构 * Algoverse AI Research Cornell University(康奈尔大学)

AI总结 提出 RecurrReason 基准,包含四个递归逻辑谜题,通过控制难度参数 N 评估序列模型,发现架构比规模更重要,预训练仅对局部结构转移函数的谜题有效。

详情
AI中文摘要

大型语言模型在符号和算法任务上通常表现强劲,但当问题变长、变难或略微超出分布时,这种表面优势可能隐藏脆弱行为。当前推理基准的一个主要限制是,许多主要测试模型是否能产生有效答案,而较少关注解决方案在可控难度缩放下是否最小、稳健和稳定。我们引入了 RecurrReason,一个难度可控的基准,包含四个递归逻辑谜题(汉诺塔、过河问题、积木世界和跳棋),具有 BFS 最优轨迹和单一可解释难度参数 $N \in \{1,\dots,10\}$,总计 10,817 个独特谜题和 285,933 步动作。我们在一致的数据划分和评估标准下,对两个 Transformer 家族(编码器-解码器模型(T5 风格)和仅解码器模型(GPT-2 风格))进行基准测试,在 $N=1$ 到 $7$ 上训练,并在 $N=8$ 到 $10$ 的保留分布内实例和更难的分布外实例上评估。微调后的预训练 T5 在积木世界上达到 97.27% 的验证准确率和 81.00% 的 OOD 准确率;所有模型在过河问题上的所有条件下得分为 0.00%。失败模式分析表明,架构比规模更能决定成功。预训练仅能迁移到具有局部结构转移函数的谜题。我们的代码和数据集将在接收后开源。

英文摘要

Large language models often appear strong on symbolic and algorithmic tasks, yet this apparent strength can hide brittle behaviour when problems become longer, harder, or slightly out of distribution. A major limitation of current reasoning benchmarks is that many primarily test whether a model can produce a valid answer, while paying less attention to whether the solution is minimal, robust, and stable under controlled difficulty scaling. We introduce RecurrReason, a difficulty-controlled benchmark of four recurrent logic puzzles (Tower of Hanoi, River Crossing, Block World, and Checkers Jumping) with BFS-optimal trajectories and a single interpretable difficulty parameter $N \in \{1,\dots,10\}$, totalling 10{,}817 unique puzzles and 285{,}933 moves. We benchmark two Transformer families, an encoder-decoder model (T5-style) and a decoder-only model (GPT-2-style), under consistent data splits and evaluation criteria, training on $N{=}1$ to $7$ and evaluating on both held-out in-distribution instances and harder out-of-distribution instances at $N{=}8$ to $10$. Fine-tuned pre-trained T5 achieves 97.27\% validation and 81.00\% OOD accuracy on Block World; all models score 0.00\% on River Crossing under all conditions. Failure mode analysis reveals that architecture is a stronger determinant of success than scale. Pre-training transfers only to puzzles with locally structured transition functions. Our code and dataset will be open-sourced upon acceptance.

2606.15685 2026-06-16 cs.RO cs.CV 新提交

Learning New Tasks via Reusable Skills: Skill-Compositional Experts for Embodied Continual Learning

通过可复用技能学习新任务:面向具身持续学习的技能组合专家

Shuaike Zhang, Shaokun Wang, Haoyu Tang, Jianlong Wu, Liqiang Nie

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Shandong University(山东大学) Shenzhen Loop Area Institute(深圳循环区域研究所)

AI总结 提出技能组合专家(SCE)框架,通过组合技能基础(CSG)分解演示为可复用技能,并利用双执行-转换专家(DETE)实现新任务学习,有效缓解具身持续学习中的灾难性遗忘。

Comments 13 pages, 5 figures

详情
AI中文摘要

具身持续学习(ECL)旨在使机器人能够在闭环控制下持续获取新的操作任务,同时保留先前学习的行为。与传统的持续学习相比,ECL遭受更严重的灾难性遗忘。在闭环控制下累积的特征漂移通过顺序决策逐步传播,导致先前学习的行为退化。ECL中的一个关键挑战在于如何在不断演变的任务中进行结构化的技能复用,因为现有方法主要关注技能学习,而没有明确组织它们以执行连贯的任务。为了解决这个问题,我们提出了SCE,一个用于ECL的技能组合专家框架。SCE通过组合技能基础(CSG)构建技能库,将任务演示分解为可复用的技能。在此基础上,双执行-转换专家(DETE)通过技能组合实现新任务学习,其中一个分支确保技能执行,另一个支持技能之间的转换以实现连贯行为。在LIBERO基准测试和真实世界操作任务上的实验表明,SCE持续提高了保留率和整体任务性能。进一步的特征漂移分析和消融研究验证了我们方法的有效性。项目网站:https://eqcy.github.io/sce/。

英文摘要

Embodied Continual Learning (ECL) aims to enable robots to continually acquire new manipulation tasks while retaining previously learned behaviors under closed-loop control. Compared with conventional continual learning, ECL suffers from more severe catastrophic forgetting. Feature drift accumulated under closed-loop control progressively propagates through sequential decision-making, leading to degradation of previously learned behaviors. A key challenge in ECL lies in structured skill reuse across continually evolving tasks, since existing methods primarily focus on skill learning without explicitly organizing them for coherent task execution. To address this issue, we propose SCE, a Skill-Compositional Experts framework for ECL. SCE builds a skill base via Compositional Skill Grounding (CSG), which decomposes task demonstrations into reusable skills. Based on this, Dual Execution-and-Transition Experts (DETE) enable new task learning through skill composition, where one branch ensures skill execution and the other supports transitions between skills for coherent behavior. Experiments on LIBERO benchmarks and real-world manipulation tasks demonstrate that SCE consistently improves retention and overall task performance. Further feature drift analyses and ablation studies verify the effectiveness of our method. Project website: https://eqcy.github.io/sce/.

2606.15684 2026-06-16 cs.AI 新提交

Multi-agent Framework for Time-Sensitive Complementary Collaboration in Minecraft

Minecraft中时间敏感互补协作的多智能体框架

Juheon Yi, Jinglu Wang, Xiaoyi Zhang, Yan Lu

发表机构 * Microsoft Research Asia(微软亚洲研究院)

AI总结 提出TickingCollabBench基准和TickingCollab框架,用于评估LLM在动态、实时、异构智能体强制协作任务中的表现,发现LLM因延迟和协调困难而频繁失败。

详情
AI中文摘要

我们提出了TickingCollabBench,这是一个基于Minecraft的多智能体基准,用于一类新颖的时间敏感互补协作任务。我们的基准反映了现实世界协作的四个核心特征:智能体异构性、强制协作、动态环境以及具有失败风险的严格实时约束。为此,我们开发了TickingCollab框架,该框架支持生成多样化的动态环境,并抽象了Minecraft的原始API,以便通过声明式YAML任务规范来组合这些事件。在此基础上,我们设计了一个可行性感知的自动基准生成流水线,其中LLM起草结构多样的任务配置,可行性验证器使用近似约束过滤掉无效配置。评估表明,语言延迟以及在部分可观测性和智能体异构性下协调的固有困难,导致LLM在动态环境中频繁失败,并且远不及全局知识oracle的表现。

英文摘要

We present TickingCollabBench, a Minecraft-based multi-agent benchmark for a novel class of time-sensitive complementary collaboration tasks. Our benchmark reflects four core characteristics of real-world collaboration: agent heterogeneity, mandatory collaboration, dynamic environments, and strict real-time constraints with failure risks. To enable this, we develop the TickingCollab framework, which supports the generation of diverse dynamic environments and abstracts Minecraft's primitive APIs to enable declarative YAML task specifications for composing these events. Building on this, we design a feasibility-aware automated benchmark generation pipeline, where an LLM drafts structurally diverse task configurations and feasibility verifier filters out invalid ones using approximate constraints. Evaluations demonstrate that lang latency and inherent difficulty of coordinating under partial observability and agent heterogeneity cause LLMs to frequently fail under dynamic environments and fall significantly short of a global-knowledge oracle.

2606.15682 2026-06-16 cs.LG 新提交

ReQAT: Achieving Full-Precision Reasoning Accuracy with 4-bit Floating-Point Quantization-Aware Training

ReQAT: 实现全精度推理精度的4位浮点量化感知训练

Janghwan Lee, Sihwa Lee, Jinseok Kim, Yongjik Kim, Jieun Lim, Jinwook Oh, Jungwook Choi

发表机构 * Hanyang University(汉阳大学) Samsung Advanced Institute of Technology(三星综合技术院)

AI总结 针对大推理模型在低比特量化(W4A4KV4)下推理精度严重下降的问题,提出ReQAT框架,通过迹对齐QAT、选择性熵最小化和量化友好初始化,恢复并超越BF16微调精度,实现最高3.9倍吞吐加速。

Comments ICML 2026

详情
AI中文摘要

大型推理模型(LRMs)通过长思维链实现了强大的问题解决能力,但其部署受到全精度推理的高成本和不断增长的KV缓存占用限制。微尺度FP4格式支持高效的FP4部署;然而,完全量化权重、激活和KV缓存(W4A4KV4)会导致严重的推理退化,现有的PTQ和QAT无法恢复。我们发现FP4失败集中在低熵token上——精确的符号承诺,如数字和运算符——量化噪声放大了采样误差,这些误差在推理轨迹中级联。基于这一洞察,我们提出了ReQAT,一个以推理为中心的FP4训练框架,包含三个组件:(i)迹对齐QAT(TAQ),重新审视相同的推理轨迹,将更新集中在关键的低熵决策上;(ii)选择性熵最小化(SEM),在低熵位置增强置信度;(iii)Q-FIT,一种量化友好的初始化,联合校准RoPE一致的KV缓存变换以稳定QAT。在相同的训练预算下,ReQAT不仅恢复而且超越了BF16微调精度,同时在NVIDIA DGX Spark上实现了高达3.9倍的吞吐加速,在B200上实现了3.1倍。

英文摘要

Large Reasoning Models (LRMs) achieve strong problem-solving through long chain-of-thought, but their deployment is constrained by the high cost of full-precision inference and growing KV cache footprints. Microscaled FP4 formats enable efficient FP4 deployment; however, fully quantizing weights, activations, and KV caches (W4A4KV4) causes severe reasoning degradation that existing PTQ and QAT fail to recover. We identify that FP4 failures concentrate on low-entropy tokens--precise symbolic commitments such as digits and operators--where quantization noise inflates sampling errors that cascade through reasoning traces. Based on this insight, we propose ReQAT, a reasoning-centric FP4 training framework with three components: (i) Trace-Aligned QAT (TAQ), which revisits identical reasoning traces to focus updates on critical low-entropy decisions; (ii) Selective Entropy Minimization (SEM), which reinforces confidence at low-entropy positions; and (iii) Q-FIT, a quantization-friendly initialization that jointly calibrates RoPE-consistent KV cache transformations to stabilize QAT. Under the same training budget, ReQAT not only recovers but surpasses BF16 fine-tuning accuracy, while delivering up to 3.9x throughput speedup on NVIDIA DGX Spark and 3.1x on B200.

2606.15681 2026-06-16 cs.CV 新提交

3D Consistency Optimization for Self-Supervised Monocular Video Depth Estimation

自监督单目视频深度估计的3D一致性优化

Yuanye Liu, Ke Zhang, Junzhe Jiang, Li Zhang, Vishal Patel, Xiahai Zhuang

发表机构 * Fudan University(复旦大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出一种将视频深度估计转化为多视图3D重建的框架,通过光度渲染、世界坐标对齐和多尺度时间梯度一致性约束,实现全局3D结构一致性,在自监督和零样本临床场景中达到最先进的空间精度。

详情
AI中文摘要

可靠的单目视频深度估计对于内窥镜导航中的下游3D推理和具身AI至关重要。然而,现有的自监督方法通常独立处理视频帧或依赖弱时间正则化。这些方法缺乏对底层3D场景的整体感知,不可避免地遭受几何不一致的预测和严重的跨帧漂移。为了解决这些限制,我们引入了一种新范式,将顺序视频深度估计重新表述为无约束的多视图3D重建问题,从而能够充分利用嵌入在最近3D基础模型中的强大几何先验。我们方法的核心是一个由三个约束驱动的3D一致性优化框架:图像级光度渲染、显式世界坐标几何对齐和多尺度时间梯度一致性。这种统一优化优雅地将孤立帧锚定到全局一致的3D结构上。我们的方法在自监督训练场景和具有挑战性的零样本临床环境中都得到了验证。结果表明,所提出的方法实现了最先进的空间精度,优于基于帧、基于视频的深度估计器和多视图3D重建基线。

英文摘要

Reliable monocular video depth estimation is crucial for downstream 3D reasoning and embodied AI in endoscopic navigation. However, existing self-supervised approaches typically treat video frames independently or rely on weak temporal regularization. These methods, lacking a holistic perception of the underlying 3D scene, inevitably suffer from geometrically inconsistent predictions and severe cross-frame drift. To address these limitations, we introduce a new paradigm that recasts sequential video depth estimation as an unconstrained multi-view 3D reconstruction problem, enabling full exploitation of the powerful geometric priors embedded in recent 3D foundation models. The core of our approach is a 3D consistency optimization framework driven by three constraints: image-level photometric rendering, explicit world-coordinate geometric alignment, and multi-scale temporal gradient consistency. Such unified optimization elegantly anchors isolated frames to a globally coherent 3D structure. Our method has been validated in both the self-supervised training scenarios and challenging zero-shot clinical environments. Results show that the proposed approach achieves state-of-the-art spatial accuracy, outperforming the frame-based, video-based depth estimators and the multi-view 3D reconstruction baselines.

2606.15678 2026-06-16 cs.LG cs.AI 新提交

The Reservoir Attention Network: Cross-Pass State in Pretrained Transformers via Content-Addressable Reservoir Injection

储层注意力网络:通过内容可寻址储层注入在预训练Transformer中的跨前向传播状态

Emma Leonhart

发表机构 * Emma Leonhart

AI总结 提出储层注意力网络(RAN),通过在预训练Transformer中间层注入固定随机储层来携带跨前向传播状态,实验表明未训练的循环动态足以传递可用状态。

Comments 29 pages, 14 figures

详情
AI中文摘要

本文对储层注意力网络(RAN)进行了可行性和动力学研究,该架构将一个固定的、随机初始化的储层注入到预训练Transformer的中间层注意力中,以在跨前向传播时携带状态。实验涵盖从GPT-2(124M、355M)到Qwen2.5(0.5B、1.5B)的模型,均在单个消费级GPU上运行。任务被选为最小探针,以隔离单个机制;更广泛的“始终活跃的智能体”愿景在整个过程中被视为受计算限制的未来工作,而非本文的主张。储层被设计为未训练的(固定随机):这隔离了未训练的循环动态本身是否足以携带可用的跨前向传播状态,而将训练的循环作为互补的、更昂贵的方向。

英文摘要

A feasibility and dynamics study of the Reservoir Attention Network (RAN), an architecture that injects a fixed, randomly-initialized reservoir into the mid-layer attention of a pretrained transformer to carry state across forward passes. Experiments span GPT-2 (124M, 355M) to Qwen2.5 (0.5B, 1.5B) on a single consumer GPU. The tasks are minimal probes chosen to isolate individual mechanisms; the broader always-alive agent vision is treated throughout as compute-limited future work, not a claim of this paper. The reservoir is left untrained (fixed random) by design: this isolates whether untrained recurrent dynamics alone suffice to carry usable cross-pass state, leaving trained recurrence as a complementary, more expensive direction.

2606.15673 2026-06-16 cs.AI cs.LG 新提交

Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking

哪里出错了?基于语义状态追踪的Web智能体过程级评估

Jiwan Chung, JiHyuk Byun, Vibhav Vineet, Seon Joo Kim

发表机构 * Yonsei University(延世大学) Microsoft Research(微软研究院)

AI总结 提出WebStep基准,通过语义MDP追踪过程状态,揭示隐藏于终端成功率下的智能体差异,并定位具体改进方向。

详情
AI中文摘要

Web智能体通过长交互序列执行任务,然而现有基准仅评估终端成功,丢弃所有过程信息,对改进提供的指导有限。在这项工作中,我们对Web智能体进行了过程级分析。我们引入了WebStep,一个包含1800个任务实例的基准,具有可控难度和自动语义状态追踪。每个网站除了GUI外还暴露一个确定性的语义MDP:智能体在界面上操作,而环境在后台记录高级状态和转换,从而实现无需人工标注的细粒度分析。基于语义轨迹,我们首先表明过程度量揭示了结果评估无法察觉的差异:三个成功率集中在31-33%的智能体在探索范围与执行准确性上存在分歧。然后,按技能分解刻画了这些差异的本质,揭示了同一网站内隐藏的相反技能排名:例如,在Housing上,OpenAI CUA在提交动作上优于Qwen3.5 23.7%,但在过滤上却落后15.6%,精确指出了即使在单个领域内也需要改进的具体技能。分叉分析进一步定位了导致任务失败的决定性错误,并表明该错误是智能体特定的而非共享的。最后,随着任务难度增加,这些差异扩大:在简单任务上成功率相似,但随着探索要求提高而急剧分化。我们的过程级分析为Web智能体评估开辟了新途径,提供了关于每个智能体应在何处以及如何改进的细粒度且可操作的见解。

英文摘要

Web agents act through long interaction sequences, yet existing benchmarks evaluate only terminal success, discarding all process information and offering little guidance on improvement. In this work, we conduct a process-level analysis of web agents. We introduce WebStep, a benchmark of 1,800 task instances with controlled difficulty and automatic semantic state tracking. Each website exposes a deterministic semantic MDP alongside the GUI: the agent operates on the interface, while the environment records high-level states and transitions in the background, enabling fine-grained analysis without manual annotation. Based on the semantic trajectory, we first show that process metrics reveal differences invisible to outcome evaluation: three agents whose success rates cluster within 31-33% diverge in exploration reach versus execution accuracy. Then, decomposing by skill characterizes the nature of these differences, exposing opposite per-skill rankings hidden within the same website: e.g., on Housing, OpenAI CUA outperforms Qwen3.5 by 23.7% on commit actions yet underperforms it by 15.6% on filtering, pinpointing a concrete skill to improve even within a domain. Bifurcation analysis further localizes the decisive error that loses the task and shows that this error is agent-specific rather than shared. Finally, these differences widen as tasks grow harder: success rate is similar on easy tasks but separates sharply as exploration becomes more demanding. Our process-level analysis opens a new avenue in web agent evaluation, providing fine-grained and actionable insight into where and how each agent should be improved.

2606.15669 2026-06-16 cs.LG cs.AI 新提交

Z-Plane Neural Networks: Bounded Geometric Activation Replaces ReLU and LayerNorm

Z平面神经网络:有界几何激活替代ReLU和LayerNorm

Sungwoo Goo, Hwi-yeol Yun, Sangkeun Jung

发表机构 * College of Pharmacy, Chungnam National University(忠南大学药学院) Department of Computer Science & Engineering, Chungnam National University(忠南大学计算机科学与工程系)

AI总结 提出Z平面神经网络,通过有界几何激活函数Radial Bounding将隐藏状态映射到超球面上的2D相量束,在保持方向信息的同时限制能量幅度,理论证明其保持1-Lipschitz连续性并防止梯度消失,实验表明100层无ReLU和LayerNorm的MLP在MNIST上稳定收敛。

详情
AI中文摘要

现代深度神经网络依赖欧几里得标量激活(如ReLU)和全局归一化技术(如LayerNorm)来防止深层架构中的梯度不稳定。然而,这些机制固有地导致神经元死亡、丢弃关键方向信息并破坏特征表示的正交性。受生物轴突频率调制传输的启发,我们提出了Z平面神经网络,将隐藏状态映射到超球面上的2D相量束。我们引入了一种新颖的几何激活函数Radial Bounding($\mathbf{x} / \max(1, \\|\mathbf{x}\\|_2)$),它在保持相位(方向)的同时限制能量幅度。我们从数学上证明,这种各向同性激活保持了1-Lipschitz连续性,并通过保留切向梯度防止梯度消失。实验上,一个完全不含ReLU和LayerNorm的100层Z平面多层感知机(MLP)在MNIST数据集上成功收敛,准确率达到98.34%,且具有绝对数值稳定性,证明仅靠有界几何激活就足以实现稳定的深度学习。

英文摘要

Modern deep neural networks rely on Euclidean scalar activations (e.g., ReLU) and global normalization techniques (e.g., LayerNorm) to prevent gradient instability in deep architectures. However, these mechanisms inherently cause dead neurons, discard critical directional information, and destroy the orthogonality of feature representations. Inspired by the frequency-modulation transmission of biological axons, we propose the Z-Plane Neural Network, which maps hidden states into 2D phasor bundles on a hypersphere. We introduce a novel geometric activation function, Radial Bounding($\mathbf{x} / \max(1, \|\mathbf{x}\|_2)$), which limits the energy magnitude while preserving the phase (direction). We demonstrate mathematically that this isotropic activation maintains 1-Lipschitz continuity and prevents gradient vanishing by preserving tangential gradients. Empirically, a 100-layer Z-Plane Multi-Layer Perceptron (MLP)-entirely devoid of ReLU and LayerNorm-successfully converges on the MNIST dataset with 98.34% accuracy and absolute numerical stability, proving that bounded geometric activation alone is sufficient for stable deep learning.

2606.15667 2026-06-16 cs.CV 新提交

CEVAR: Centerline Embedding Extraction for Endovascular Aneurysm Repair

CEVAR:用于血管内动脉瘤修复的中心线嵌入提取

Roman Naeem, Timo Niiniskorpi, Charlotte Sandström, Naman Desai, Anders Jeppsson, Ida Häggström, Fredrik Kahl, Håkan Roos, Jennifer Alvén

发表机构 * Chalmers University of Technology(查尔姆斯理工大学) The University of Gothenburg(哥德堡大学) Sahlgrenska University Hospital(萨尔拉格斯卡大学医院)

AI总结 针对EVAR术后密封区失效导致的破裂问题,提出一种结合3D中心线追踪与嵌入几何预测的Transformer框架,实现自动密封区评估,在常规及无对比剂CT上均优于半自动方法。

Comments Submitted Version. Accepted at MICCAI 2026

详情
AI中文摘要

由于支架移植物密封区密封失效导致EVAR术后破裂,血管内动脉瘤修复(EVAR)后的长期死亡率仍然很高。使用中心线测量的结构化CT审查可改善检测,但当前工作流程需要手动中心线编辑和专家操作。我们提出了一种用于自动化、协议驱动的密封区评估的Transformer框架,该框架将3D中心线追踪与基于嵌入的几何预测相结合。评估了两种最先进的图像到图模型,用于从随访CT中提取主动脉-髂动脉中心线,并根据EVAR4C协议测量支架位置、血管直径和密封长度。在整个测试集和具有挑战性的无对比剂子集上,所提出的全自动方法优于商业半自动工作流程。

英文摘要

Long-term mortality rates after endovascular aneurysm repair (EVAR) remain elevated due to post-EVAR rupture caused by loss of seal in stent graft sealing zones. Structured CT review using centerline measurements improves detection, but current workflows require manual centerline editing and expert operators. We propose a transformer framework for automated, protocol-driven sealing zone assessment that combines 3D centerline tracking with embedding-based geometric prediction. Two state-of-the-art image-to-graph models are evaluated for aorto-iliac centerline extraction from follow-up CT and for measurement of stent position, vessel diameters, and seal lengths according to EVAR4C protocol. Across the full test set and a challenging no-contrast subset, the proposed fully automatic method outperforms the commercial semi-automatic workflow.

2606.15659 2026-06-16 cs.CV 新提交

SpatialAvatar-0: High-Quality 4D Head Avatar with Multi-Stage Reconstruction

SpatialAvatar-0: 高质量4D头部虚拟形象的多阶段重建

Yiran Wang, Zeyu Zhang, Yuanming Li, Ziming Wang, Yang Zhao

发表机构 * USYD(悉尼大学) SpatialReal ZJU(浙江大学) La Trobe(拉筹伯大学)

AI总结 提出基于FLAME网格绑定高斯表示的多阶段框架,通过前馈生成器和10K迭代布局保持微调,实现跨域零样本和单目基准领先性能。

详情
AI中文摘要

高质量4D头部虚拟形象(来自一张或少量源肖像)是远程呈现、AR/VR和数字人交互的核心。3D高斯泼溅(3DGS)已成为主导表示,两个互补范式(可泛化的前馈预测器和逐主体精炼器)并行成熟。然而,现有前馈预测器在单一数据集族上训练,具有硬编码的源数量,继承了相应的领域偏差。逐主体精炼器需要30万至60万次迭代,并依赖自适应致密化,这会破坏上游高斯布局,导致两个范式无法端到端共享表示。为桥接两个范式,我们提出SpatialAvatar-0,基于共享的FLAME网格绑定高斯表示:一个前馈生成器,具有无参数的K源均值池化,以及一个从单目时序到多视角空间的两阶段调度,防止身份先验在小多视角集上坍缩。我们进一步引入一个10K迭代的布局保持逐主体精炼循环,冻结FLAME绑定和高斯数量,并用三分量抗尖峰正则化替代致密化。在VFHQ/HDTF跨域零样本上,我们超越域内领先者GAGAvatar +1.5 dB PSNR,尽管从未在任一测试域上训练;在SplattingAvatar单目基准上,我们领先所有报告指标,超越30万次迭代的GeoAvatar +1.3 dB PSNR,且逐主体调度比常见SOTA基线短至60倍。网站:https://spatialwalk.github.io/SpatialAvatar-0。

英文摘要

High-quality 4D head avatars from one or a few source portraits are central to telepresence, AR/VR, and digital-human interaction. 3D Gaussian Splatting (3DGS) has emerged as the dominant representation, with two complementary regimes (generalizable feed-forward predictors and per-subject refiners) maturing in parallel. However, existing feed-forward predictors are trained on a single dataset family with a hard-coded source count, inheriting the corresponding domain bias. Per-subject refiners require 300K--600K iterations and rely on adaptive densification that destroys upstream Gaussian layouts, preventing the two regimes from sharing a representation end-to-end. To bridge both regimes we propose SpatialAvatar-0 on a shared FLAME-mesh-bound Gaussian representation: a feed-forward generator with a parameter-free K-source mean-pool and a monocular-temporal to multi-view-spatial two-phase schedule that anchors against identity-prior collapse onto the smaller multi-view set. We further introduce a 10K-iter layout-preserving per-subject refinement loop that freezes the FLAME-binding and Gaussian count and replaces densification with a three-component anti-spike regularization. On VFHQ/HDTF cross-domain zero-shot we surpass the in-domain leader GAGAvatar by +1.5 dB PSNR despite never training on either test domain, and on the SplattingAvatar monocular benchmark we lead every reported metric, surpassing the 300K-iter GeoAvatar by +1.3 dB PSNR at up to 60x shorter per-subject schedule than common SOTA baselines. Website: https://spatialwalk.github.io/SpatialAvatar-0.

2606.15656 2026-06-16 cs.AI 新提交

Overcoming the Impedance Mismatch: A Theoretical Roadmap for Fusing Foundation Models and Knowledge Graphs

克服阻抗不匹配:融合基础模型与知识图谱的理论路线图

Sahil Rajesh Dhayalkar

发表机构 * Arizona State University(亚利桑那州立大学)

AI总结 本文提出“阻抗不匹配”概念,形式化分析基础模型与知识图谱的结构与几何摩擦,通过三级层次分类揭示现有方法的局限,并给出理论路线图实现真正的语义融合。

Comments 12 pages. Accepted at the ACL 2026 4th Workshop on Towards Knowledgeable Foundation Models (https://openreview.net/forum?id=hXDYsNAq8m)

详情
AI中文摘要

现代人工智能仍然从根本上分裂于基础模型的连续概率空间和知识图谱的离散确定性结构之间。虽然检索增强生成(RAG)试图通过将图数据序列化为文本来连接它们,但我们认为这种词汇桥接仅仅是表面的补丁。在本文中,我们将底层的结构和几何摩擦形式化为\textit{阻抗不匹配}。通过将当前的神经符号集成策略分类为三级层次,我们证明无论是表面级别的提示注入还是连续表示对齐,都无法保留可靠的多跳推理所需的严格逻辑模式。我们定义了具体的数学极限,如词汇瓶颈和拓扑坍缩,表明当前架构最终会产生幻觉或混淆语义节点。为了实现真正的语义融合,我们提出了一个严格的理论路线图。我们主张通过结构化残差流原生内化离散符号结构,利用向量符号架构进行潜在子图注入,并通过正交子空间编辑执行模型更新。这个可操作的框架为无缝融合符号逻辑的精确性和参数化记忆的表达能力的模型铺平了道路。

英文摘要

Modern artificial intelligence remains fundamentally divided between the continuous, probabilistic spaces of Foundation Models and the discrete, deterministic structures of Knowledge Graphs. While Retrieval-Augmented Generation (RAG) attempts to connect them by serializing graph data into text, we argue this lexical bridging is merely a superficial patch. In this paper, we formalize the underlying structural and geometric friction as the \textit{Impedance Mismatch}. By categorizing current neuro-symbolic integration strategies into a three-tiered hierarchy, we demonstrate that neither surface-level prompt injection nor continuous representation alignment can preserve the strict logical motifs required for reliable multi-hop reasoning. We define the specific mathematical limits, such as the Lexical Bottleneck and Topological Collapse, that show current architectures will eventually hallucinate or conflate semantic nodes. To achieve true semantic fusion, we propose a rigorous theoretical roadmap. We advocate for natively internalizing discrete symbolic structures through Structured Residual Streams, utilizing Vector Symbolic Architectures for latent sub-graph injection, and performing model updates via Orthogonal Subspace Editing. This actionable framework paves the way for models that seamlessly fuse the precision of symbolic logic with the expressivity of parametric memory.

2606.15655 2026-06-16 cs.AI 新提交

Advanced Machine Learning and Deep Learning Techniques for Enhanced Cattle Identification and Detection: A Comprehensive Review

用于增强牛只识别与检测的先进机器学习和深度学习技术:一项全面综述

Fayazunnesa Chowdhury, Syed Md. Galib, Md Nasim Adnan, Md. Moradul Siddique, Md Robiul Karim, K M Tanvir Anjum

发表机构 * Jashore University of Science and Technology(贾沙雷科学与技术大学) University of Information Technology & Sciences (UITS)(信息科技与科学大学) Gazipur Agricultural University(加兹ipur农业大学) Shanto Mariam University of Creative Technology(沙托·马里姆创意技术大学)

AI总结 本文系统综述了利用机器学习和深度学习技术进行牛只识别的研究,比较了传统方法(如K近邻、支持向量机)与深度学习方法(如CNN、ResNet、YOLO)的效果,指出深度学习方法在识别和检测任务中更优,并讨论了数据集有限、数据质量问题和实时处理需求等挑战。

Comments Published in the journal of Annals of Emerging Technologies in Computing (AETiC), 34 pages, 5 Figures. The Article is available here: http://aetic.theiaer.org/archive/v10/v10n2/p1.html

Journal ref Annals of Emerging Technologies in Computing (AETiC),Vol. 10, No. 2, 2026

详情
AI中文摘要

在畜牧管理中,维护生物安全、食品安全和供应链效率的需求使得有效的牛只识别技术比以往任何时候都更加迫切。本文对使用机器学习和深度学习技术的牛只识别研究进行了系统综述。本系统综述通过主要学术数据库的研究评估了传统和现代牛只识别技术的有效性,并对文章进行了全文审查。在这些技术中,经典机器学习技术如K近邻和支持向量机在牛只识别中表现出良好效果;然而,深度学习技术如卷积神经网络、残差网络和You Only Look Once在认知、检测和识别任务中表现更优。特征提取依赖于常见技术如局部二值模式(LBP)、加速稳健特征(SURF)和尺度不变特征变换(SIFT),而这些研究中常用的关键特征包括鼻纹和皮毛图案。综述强调了牛只识别中的主要障碍,例如公开可用的数据集数量有限、易受环境变化和动物移动影响的数据质量问题,以及对实时处理能力的高需求。本文旨在为研究人员、政策制定者和利益相关者提供关于实施可扩展、人道且有效的牛只识别系统以实现可持续畜牧管理的信息。

英文摘要

The need for effective cattle identification technology is now more acutely felt than ever in maintaining biosecurity, food safety, and supply chain efficacy in livestock management. This paper presents a systematic review of recent research in cattle identification using machine learning and deep learning techniques. The present systematic review measures the effectiveness of traditional and modern cattle identification techniques using studies from major academic databases, where articles were subjected to full-text review. Among these techniques, classical Machine Learning Techniques such as K-Nearest Neighbors and Support Vector Machines have demonstrated good results in cattle identification; however, Deep Learning Techniques, such as Convolutional Neural Networks, Residual Networks, and You Only Look Once, are better in cognition, detection, and identification tasks. Feature extraction relies on common techniques like Local Binary Pattern (LBP), Speeded-Up Robust Features (SURF), and Scale-Invariant Feature Transform (SIFT), while key features commonly used in these studies include muzzle prints and coat patterns. The review highlights key hurdles involving cattle identification, such as the limited number of publicly accessible datasets, issues with data quality susceptible to environmental changes and animal mobility, and high demand for real-time processing ability. The paper aims to inform researchers, policymakers, and stakeholders about implementing scalable, humane, and effective cattle identification systems to achieve sustainable livestock management.

2606.15654 2026-06-16 cs.RO cs.AI 新提交

PO-PDDL: Learning Symbolic POMDPs from Visual Demonstrations for Robot Planning Under Uncertainty

PO-PDDL: 从视觉演示中学习符号化POMDP以实现不确定性下的机器人规划

Wenjing Tang, Xuanjin Jin, Yuan Liu, Renming Huang, Cewu Lu, Panpan Cai

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出PO-PDDL符号化POMDP框架,通过从机器人执行视频中重建潜在状态轨迹、识别部分可观测性并学习随机转移与观测模型,实现不确定性下的鲁棒任务规划。

详情
AI中文摘要

现实世界的机器人任务规划必须在随机动作执行和部分可观测性下进行,然而为真实机器人领域构建部分可观测马尔可夫决策过程(POMDP)模型仍然困难且劳动密集。我们引入了PO-PDDL,一种POMDP的符号化表述,它保留了规划领域定义语言(PDDL)的关系结构和LLM友好的语法,同时显式建模了部分可观测性、随机性和信念。基于此表述,我们提出了一种用于学习PO-PDDL模型的演示驱动流程。该方法从真实机器人执行视频中重建潜在符号状态轨迹,通过推断状态与视觉观测之间的不一致性识别部分可观测性,并相应地学习随机转移和观测模型。得到的PO-PDDL领域可跨任务重用,并在感知和执行不确定性下实现在线信念空间规划。在真实世界长时域操作任务上的实验表明,我们的方法持续优于现有的PDDL和POMDP模型学习方法,以显著更低的规划成本实现了不确定性下的鲁棒任务规划。

英文摘要

Real-world robot task planning must operate under both stochastic action execution and partial observability, yet constructing Partially Observable Markov Decision Process (POMDP) models for real robotics domains remains difficult and labor-intensive. We introduce PO-PDDL, a symbolic formulation of POMDPs that preserves the relational structure and LLM-friendly syntax of the Planning Domain Definition Language (PDDL), while explicitly modeling partial observability, stochasticity, and beliefs. Building on this formulation, we propose a demonstration-driven pipeline for learning PO-PDDL models. The proposed method reconstructs latent symbolic state trajectories from real-robot execution videos, identifies partial observability via inconsistencies between inferred states and visual observations, and learns stochastic transition and observation models accordingly. The resulting PO-PDDL domains are reusable across tasks and enable online belief-space planning under both perception and execution uncertainty. Experiments on real-world long-horizon manipulation tasks show that our method consistently outperforms existing PDDL and POMDP model-learning approaches, achieving robust task planning under uncertainty with significantly lower planning cost.

2606.15652 2026-06-16 cs.LG cs.CL 新提交

MosaicQuant: Inlier-Outlier Disaggregation for Unified 4-Bit LLM Quantization

MosaicQuant: 基于内点-离点分离的统一4位LLM量化

Yangjia Hu, Haodong Wang, Zicong Hong, Qianli Liu, Quanxin Shou, Jian Lin, Song Guo, Xiaowei Shen, Xiangjun Huang, Dian Wang, Jian Yang

发表机构 * HKUST(香港科技大学) EPFL(瑞士联邦理工学院洛桑) MetaX Integrated Circuits Co., Ltd(MetaX集成电路有限公司)

AI总结 提出MosaicQuant,通过将权重矩阵量化为密集4位基分量和稀疏4位残差分量,结合ZipperEngine融合稀疏块计算,实现统一4位推理,在LLaMA3和Qwen3上保持近FP16精度并加速1.24倍。

Comments 17 pages

详情
AI中文摘要

4位量化显著减少了内存占用并加速了大语言模型(LLM)的推理。然而,其有限的位宽表示难以忠实捕捉密集的常见值(内点)和罕见的大幅度值(离点),导致显著的精度下降。现有的混合精度方法通过保留离点的高精度来缓解这一问题,但代价是破坏了低比特执行的统一性,引入了精度转换和额外的数据移动,削弱了实际加速效果。我们提出MosaicQuant,一种基于内点-离点分离新原理的统一4位LLM量化范式。MosaicQuant不提升离点精度,而是将整个权重矩阵量化为密集的4位基分量,其中内点被忠实捕捉,而离点不可避免地量化。然后引入一个稀疏的4位残差分量来补偿这些量化误差,选择性地针对输出失真最严重的误差关键权重块。然而,仅统一表示是不够的,因为将稀疏残差作为单独内核执行仍然会破坏统一的低比特推理流水线。为弥补这一差距,我们引入ZipperEngine,通过重叠流水线将稀疏块计算融合到密集4位GEMM内核中,不仅统一了表示,而且将执行统一为单个连贯的低比特推理流水线。在LLaMA3和Qwen3上的大量实验表明,MosaicQuant在保持接近FP16精度的同时,相比W16A16基线实现了高达1.24倍的加速。

英文摘要

4-bit quantization significantly reduces the memory footprint and accelerates the inference of large language models (LLMs). However, its limited bit-width representation struggles to faithfully capture both dense common values (\emph{inliers}) and rare large-magnitude values (\emph{outliers}), causing substantial accuracy degradation. Existing mixed-precision methods mitigate this by retaining outliers in high precision, but at the cost of breaking the uniformity of low-bit execution, introducing precision conversion and extra data movement that undermine practical speedup. We propose \textbf{MosaicQuant}, a unified 4-bit LLM quantization paradigm built on a novel principle of \emph{inlier--outlier disaggregation}. Rather than elevating outlier precision, MosaicQuant quantizes the full weight matrix into a dense 4-bit base component, where inliers are captured faithfully while outlier are inevitably quantized. A sparse 4-bit residual component is then introduced to compensate for these quantization errors, selectively targeting the most error-critical weight blocks where output distortion is shown to be concentrated. However, a unified representation alone is insufficient, as naïvely executing the sparse residual as a separate kernel still breaks the unified low-bit inference pipeline. To bridge this gap, we introduce \textbf{ZipperEngine}, which fuses sparse block computation into the dense 4-bit GEMM kernel via an overlapped pipeline, unifying not only the representation but also the execution into a single coherent low-bit inference pipeline. Extensive experiments on LLaMA3 and Qwen3 demonstrate that MosaicQuant preserves near-FP16 accuracy while achieving up to $1.24\times$ speedup over the W16A16 baseline.

2606.15651 2026-06-16 cs.CV 新提交

Self-Questioning Vision-Language Models: Reinforcement Learning for Compositional Visual Reasoning

自问式视觉语言模型:用于组合视觉推理的强化学习

Saraswathy Amjith

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出自问式框架,通过GRPO强化学习训练VLM自动分解问题并回答子问题,提升组合视觉推理能力,在CLEVR和A-OKVQA上验证有效性。

详情
AI中文摘要

视觉语言模型(VLM)是处理图像和文本的AI系统,但它们通常难以处理需要多步骤链式推理的组合视觉推理问题,例如识别物体、计数并比较结果。现有方法通过训练模型使用人工编写的逐步解释来改进推理,但创建这些注释成本高昂且难以扩展。我们提出一个自问式框架,使用称为组相对策略优化(GRPO)的强化学习算法,训练VLM将视觉问题分解为更小的子问题,并在生成最终答案前回答每个子问题。模型从未见过如何分解问题的示例,而是通过奖励信号(根据输出是否包含子问题以及最终答案是否正确评分)自行发现这种行为。我们将该框架应用于一个30亿参数的模型,在合成几何形状场景(CLEVR)和真实世界照片(A-OKVQA)上进行训练。在A-OKVQA上,自问式和标准强化学习均显著提高了未训练模型的准确率(分别为52.2%和51.6%,对比46.8%)。我们引入了首个自问式VLM,不仅像标准RL那样奖励最终答案,还额外奖励生成中间子问题,使其能够发现组合分解策略。这些结果表明,教会AI系统自问中间问题是复杂视觉推理的一种有前景的策略,特别是当问题难度需要显式的逐步分解时。

英文摘要

Vision-Language Models (VLMs) are AI systems that process both images and text, yet they often struggle with compositional visual reasoning questions that require chaining multiple steps together, such as identifying objects, counting them, and comparing the results. Existing approaches improve this reasoning by training models on human-written step-by-step explanations, but creating these annotations is expensive and difficult to scale. We propose a self-questioning framework that trains a VLM to break visual questions into smaller sub-questions and answer each one before producing a final response, using a reinforcement learning algorithm called Group Relative Policy Optimization (GRPO). The model is never shown examples of how to decompose questions, it discovers this behavior on its own, guided by a reward signal that scores whether the output contains sub-questions and whether the final answer is correct. We apply this framework to a 3-billion-parameter model, training on both synthetic scenes of geometric shapes (CLEVR) and real-world photographs (A-OKVQA). On A-OKVQA, both self-questioning and standard reinforcement learning substantially improve accuracy over the untrained model (52.2% and 51.6% vs. 46.8%). We introduce the first self-questioning VLM by rewarding not only the final answer like standard RL but additionally for generating intermediate sub-questions, enabling it to discover compositional decomposition strategies. These results suggest that teaching AI systems to ask themselves intermediate questions is a promising strategy for complex visual reasoning, particularly when the difficulty of a question warrants explicit step-by-step decomposition.