arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 8081
专题追踪
2606.02607 2026-06-03 cs.LG cs.AI cs.CR

Geometry-Aware Tabular Diffusion

几何感知表格扩散

David Turtora Zagardo

发表机构 * arXiv

AI总结 提出几何感知表格扩散(GATD),通过向扩散去噪器注入列值差异的成对角度和长度作为输入和辅助目标,以显式建模列间关系,在10个数据集上以更少参数取得SOTA性能。

Comments Accepted to the ICML 2026 main track. 24 pages, 10 figures, 22 tables

详情
AI中文摘要

表格合成对于隐私保护的共享和增强至关重要,然而扩散模型依赖隐式机制来捕捉列间关系。我们引入了几何感知表格扩散(GATD),它通过从列值差异计算出的成对角度和长度来增强表格扩散去噪器,并将其用作输入和辅助目标。我们的MLP实例化在平均使用3.5倍更少参数(对于分类任务最多25倍)的情况下实现了最先进的基准性能:在十个数据集上,它在8/10的形状、7/10的趋势和9/10的下游效用(F1/RMSE)上获胜,将形状和趋势误差分别降低了27%和20%。默认损失权重可迁移到GNN和Transformer去噪器,在27/30个架构-数据集单元上改善了形状,在25/30上改善了趋势。一项匹配的消融实验表明,监督(而非额外输入或容量)驱动了性能提升。这表明显式关系监督是表格扩散的一种可移植归纳偏置。

英文摘要

Tabular synthesis is critical for privacy-preserving sharing and augmentation, yet diffusion models rely on implicit mechanisms to capture inter-column relationships. We introduce Geometry-Aware Tabular Diffusion (GATD), which augments tabular diffusion denoisers with pairwise angles and lengths computed from column value differences and used as inputs and auxiliary targets. Our MLP instantiation achieves state-of-the-art benchmark performance while using 3.5x fewer parameters on average (up to 25x for classification tasks): on ten datasets, it wins 8/10 Shape, 7/10 Trend, and 9/10 downstream utility (F1/RMSE), reducing Shape and Trend error by 27% and 20%. Default loss weights transfer to GNN and Transformer denoisers, improving Shape on 27/30 and Trend on 25/30 architecture-dataset cells. A matched ablation shows supervision (not extra inputs or capacity) drives the gain. This shows explicit relational supervision is a portable inductive bias for tabular diffusion.

2606.02606 2026-06-03 cs.LG cs.AI

ReLoRA: Knowledge-Reusing Adaptation for Fast Rollout of Evolving LLM Services

ReLoRA: 面向演化LLM服务快速部署的知识复用适配

Yang Xu, Zihuai Xu, Hongli Xu, Yunming Liao, Zhiwei Yao, Xitong Fu

发表机构 * School of Computer Science and Technology, University of Science and Technology of China(计算机科学与技术学院,中国科学技术大学) Suzhou Institute for Advanced Research, University of Science and Technology of China(苏州先进研究院,中国科学技术大学)

AI总结 针对基础模型频繁更新导致已有LoRA适配器失效的问题,提出ReLoRA框架,通过贝叶斯优化初始化与调度正则化微调,实现知识复用与快速重新适配,降低计算开销并提升性能。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署为持续演化的服务,其中频繁的基础模型更新可能使先前部署的任务特定低秩适配(LoRA)适配器失效。对于管理众多下游模型服务的提供商来说,为每个更新的基础模型从头重新训练每个LoRA适配器在计算上代价高昂,并延迟服务部署。同时,更简单的替代方案,即简单地将原始LoRA适配器应用于更新的基础模型,由于适配器-骨干网络不兼容,常常导致服务质量下降。为了解决这个问题,我们提出了ReLoRA,一种知识复用的重新适配框架,能够高效地为演化的LLM服务恢复可用的LoRA适配器,同时保持或提升任务性能。具体来说,ReLoRA包含两个关键的优化步骤:1)自适应LoRA初始化利用贝叶斯优化,通过融合先前部署的任务适配器和基础模型演化的信息,构建一个兼容性感知的起点;2)带调度正则化的微调首先通过强正则化快速将适配器引导至高质量区域,随后通过放松正则化进行任务特定精炼。这种设计使得在减少重新适配开销的同时,能够快速恢复服务质量。大量实验表明,与基线相比,ReLoRA将就绪时间减少高达8.9倍,准确率提升高达4.6%。

英文摘要

Large Language Models (LLMs) are increasingly deployed as continuously evolving services, where frequent base-model updates may invalidate previously deployed task-specific Low-Rank Adaptation (LoRA) adapters. For service providers managing numerous downstream model services, retraining each LoRA adapter from scratch for every updated base model is computationally prohibitive and delays service rollout. Meanwhile, the simpler alternative, i.e., naively applying the original LoRA adapter to the updated base model, often leads to degraded service quality due to adapter-backbone incompatibility. To address this problem, we propose ReLoRA, a knowledge-reusing re-adaptation framework that efficiently restores service-ready LoRA adapters for evolving LLM services while preserving or improving task performance. Specifically, ReLoRA comprises two key optimization steps: 1) Adaptive LoRA initialization leverages Bayesian optimization to construct a compatibility-aware starting point by fusing information from both the previously deployed task adapter and the base model's evolution; 2) Fine-tuning with scheduled regularization first rapidly steers the adapter to a high-quality region via strong regularization, followed by relaxed regularization for task-specific refinement. This design enables rapid service-quality recovery with reduced re-adaptation overhead. Extensive experiments demonstrate that ReLoRA reduces time-to-readiness by up to 8.9$\times$ and improves accuracy by up to 4.6\% compared to baselines.

2606.02605 2026-06-03 cs.LG cs.AI eess.IV

Cross-Modal Contrastive Learning of ECG and Angiography Representations for Severe Stenosis Classification

用于严重狭窄分类的心电图与血管造影表示的跨模态对比学习

Nikola Cenikj, Özgün Turgut, Alexander Müller, Alexander Steger, Jan Kehrer, Marcus Brugger, Daniel Rueckert, Philip Müller

发表机构 * Chair for AI in Healthcare and Medicine, Technical University of Munich and TUM University Hospital(人工智能在医疗与医学中的研究所,慕尼黑技术大学及慕尼黑大学医院) Department of Computing, Imperial College London(伦敦帝国理工学院计算机系) Munich Center for Machine Learning (MCML), Munich, Germany(慕尼黑机器学习中心(MCML)) Department of Internal Medicine, TUM University Hospital(慕尼黑大学医院内科学系)

AI总结 提出StenCE预训练框架,通过跨模态对比学习从心电图特征中实现冠状动脉狭窄风险分层,在严重狭窄分类中首次达到高性能。

详情
AI中文摘要

冠状动脉狭窄是一种常见的心血管疾病,未经治疗的严重病例具有显著的心肌梗死风险。尽管冠状动脉(X射线)血管造影仍是狭窄诊断的金标准,但其具有侵入性、耗时且资源密集,因此仅对基于症状和既往临床测试具有高疾病概率的患者进行。然而,一部分患者,尤其是无症状患者,可能仍未被诊断。从心电图(ECG)中检测狭窄的迹象,由于心电图快速、廉价、无创,因此即使在无症状患者中也常规采集,将支持早期诊断。然而,由于在心电图中尚未识别出可靠的狭窄特异性信号,目前无法用于狭窄风险分层。为解决这一问题,我们引入了StenCE,一个预训练框架,允许基于直接从心电图导出的特征对患者进行分层。在不同狭窄严重程度阈值和额外心电图疾病分类任务上的评估表明,不同心电图编码器均取得了一致的性能提升,优于先前的工作。所获得的模型成功检测到心电图中用于狭窄诊断的信号,并且是首个在严重狭窄分类中实现高性能的模型。源代码可在以下网址获取:此 https URL。

英文摘要

Coronary artery stenosis is a common cardiovascular disease, with severe, untreated cases posing significant risks of heart attack. Although coronary (X-ray) angiograms remain the standard for stenosis diagnosis, they are invasive, time- and resource-intensive, and therefore only performed on patients with a high probability of disease based on symptoms and prior clinical tests. However, a subset of patients, especially those without symptoms, may remain undiagnosed. Detecting indications of stenosis from ECGs, which are fast, cheap, non-invasive, and thus routinely acquired even in asymptomatic patients, would support early diagnosis. However, as no reliable stenosis-specific signal has been identified in ECGs, they can not currently be used for stenosis risk stratification. To address this, we introduce StenCE, a pretraining framework, allowing stratification of patients based on features derived directly from ECGs. Evaluations across varying stenosis severity thresholds and additional ECG disease classification tasks demonstrate consistent performance improvements across different ECG encoders, outperforming previous work. The obtained models successfully detect signals for stenosis diagnosis in ECGs and are the first to achieve high performance in severe stenosis classification. The source code is available at https://github.com/NikolaCenic/ecg-stenosis-cls.

2606.02604 2026-06-03 cs.LG cs.AI

Auditable Climate Risk Intelligence from Fragmented ESG Data: Deterministic Orchestration and Imbalance-Aware Learning for Scope 1-3 Validation

来自碎片化ESG数据的可审计气候风险智能:面向范围1-3验证的确定性编排与不平衡感知学习

Karan Sehgal, Khawar Naveed Bhatti

发表机构 * Kent Business School, University of Kent(肯特大学 Kent 商学院)

AI总结 针对ESG数据碎片化及传统验证缺乏可审计性的问题,提出一种融合确定性编排、时序异常检测、不平衡感知集成学习和可解释治理的框架,并构建合成基准实现可复现验证。

Comments 22 pages, 7 figures. Preprint

详情
AI中文摘要

ESG和气候风险数据在异构的范围1、范围2和范围3报告环境中仍然碎片化,而传统的验证流程缺乏来源感知的可审计性、隐藏漂移检测和面向可复现性的治理。本文提出一个确定性气候风险智能框架,整合单一真相来源编排、时序异常检测、不平衡感知集成学习和面向可解释性的治理,用于可审计的ESG验证。为支持开放复现,我们构建并发布了一个合成ESG验证基准,该基准根据GHG协议、PCAF和ISSB标准的公开报告特征进行校准。该方法包括时序漂移分析、基于SMOTE的罕见事件优化、集成学习、来源感知编排以及基于TreeSHAP的可解释性,用于治理检查和审计重建。我们使用分类指标(召回率、F1、ROC AUC)、校准指标(ECE、Brier分数)以及面向治理的审计轨迹完整性度量(衡量可重建确定性来源到升级来源链的标记异常比例)将框架与统计分类器、异常检测方法、时序预测基线和基于阈值的系统进行评估。结果以分层五折交叉验证的均值和标准差报告,并进行配对显著性检验。该框架将ESG报告重新定义为确定性气候风险治理基础设施,支持可复现性、可解释性和运营可审计性。

英文摘要

ESG and climate risk data remain fragmented across heterogeneous Scope 1, Scope 2, and Scope 3 reporting environments, while conventional validation pipelines lack provenance aware auditability, hidden drift detection, and reproducibility oriented governance. This paper proposes a deterministic climate risk intelligence framework integrating single source of truth orchestration, temporal anomaly detection, imbalance aware ensemble learning, and explainability oriented governance for auditable ESG validation. To support open reproducibility, we construct and release a synthetic ESG validation benchmark calibrated against publicly reported characteristics of the GHG Protocol, PCAF, and ISSB standards. The methodology incorporates temporal drift analysis, SMOTE based rare event optimization, ensemble learning, provenance aware orchestration, and TreeSHAP based interpretability for governance inspection and audit reconstruction. We evaluate the framework against statistical classifiers, anomaly detection methods, temporal forecasting baselines, and a threshold based system using classification metrics (recall, F1, ROC AUC), calibration metrics (ECE, Brier score), and a governance oriented audit trace completeness metric measuring the fraction of flagged anomalies for which a deterministic source to escalation provenance chain can be reconstructed. Results are reported as mean and standard deviation across stratified five fold cross validation with paired significance testing. The framework reframes ESG reporting toward deterministic climate risk governance infrastructure supporting reproducibility, explainability, and operational auditability.

2606.02603 2026-06-03 cs.CV cs.LG

COD10K-C: Benchmarking Robustness of Camouflaged Object Detection Under Natural Image Corruptions

COD10K-C:自然图像损坏下伪装目标检测的鲁棒性基准测试

Arafat Hossain Sayem

发表机构 * CSE, Bangladesh University of Engineering and Technology(孟加拉国工程与技术大学计算机科学与工程系)

AI总结 提出COD10K-C基准,包含8种损坏类型和5个严重级别,评估伪装目标检测模型在损坏图像上的性能,并引入轻量级模型RobustCODLite,通过损坏增强、频率先验分支和不确定性一致性损失,在损坏条件下保持较高Dice分数。

Comments 7 pages, 1 figure

详情
AI中文摘要

伪装目标检测已取得显著进步,但大多数标准基准仅评估模型在干净图像上的性能。这并不现实,因为真实相机经常捕捉到模糊、传感器噪声、天气效应和压缩伪影。我们提出了COD10K-C,一个基于COD10K的损坏鲁棒性基准。它包含8种损坏类型和5个严重级别,总共40种条件和81,040个评估对。我们评估了三种流行的伪装目标检测模型:SINet-v2、PFNet和ZoomNet,以及一个轻量级模型RobustCODLite。所有模型在损坏图像上均表现出明显的性能下降。运动模糊和高斯模糊导致最大的下降,其中SINet-v2在运动模糊下损失了18.5个Dice点。亮度和雾的影响较小。RobustCODLite使用了损坏增强、频率先验分支和不确定性一致性损失。它在损坏条件下保留了其干净Dice分数的92.3%,而SINet-v2为87.7%,ZoomNet为84.8%,PFNet为84.1%。在最严重的损坏情况下,RobustCODLite达到或超过了在干净数据上表现更好的模型。我们将发布COD10K-C的GitHub仓库,以支持未来在鲁棒伪装目标检测方面的研究。

英文摘要

Camouflaged object detection has improved substantially, but most standard benchmarks evaluate models only on clean images. This is not realistic because real cameras often capture blur, sensor noise, weather effects, and compression artifacts. We present COD10K-C, a corruption robustness benchmark based on COD10K. It includes 8 corruption types and 5 severity levels, giving 40 conditions and 81,040 evaluation pairs in total. We evaluate three popular camouflaged object detection models, SINet-v2, PFNet, and ZoomNet, as well as a lightweight model called RobustCODLite. All models show clear performance drops on corrupted images. Motion blur and Gaussian blur cause the largest drops, with SINet-v2 losing 18.5 Dice points under motion blur. Brightness and fog are less harmful. RobustCODLite uses corruption augmentation, a frequency-prior branch, and an uncertainty-consistency loss. It retains 92.3% of its clean Dice score under corruption, compared with 87.7% for SINet-v2, 84.8% for ZoomNet, and 84.1% for PFNet. On the hardest corruptions, RobustCODLite matches or outperforms models that perform better on clean data. We will release the COD10K-C GitHub repository to support future research in robust camouflaged object detection.

2606.02602 2026-06-03 cs.LG cs.CV

Graph Mamba Survival Analysis Based on Topology-Aware ordering

基于拓扑感知排序的图Mamba生存分析

Yuanfang Chen, Peiqiang Yan, Yuntao Shou, Qian Zhao, Xiangyong Cao

发表机构 * School of Mathematics and Statistics(数学与统计学学院) West China Science and Technology Innovation Harbor(西部科学与技术创新港) School of Computer Science and Technology(计算机科学与技术学院)

AI总结 针对WSI生存分析中Mamba模型对输入顺序敏感及单向架构限制空间结构利用的问题,提出基于拓扑感知排序的图Mamba框架TopoMamSurv,通过TAO策略、双向Mamba模块和GCN集成实现高效长程依赖建模与双向空间上下文建模。

详情
AI中文摘要

在计算病理学中,全切片图像(WSI)生存分析对于患者预后评估至关重要,但面临多项技术挑战。尽管Transformer通过其自注意力机制捕获长程依赖,但其$O(N^2)$时间复杂度在大规模WSI图结构中造成严重计算瓶颈。Mamba模型以线性复杂度突破了Transformer的计算瓶颈。然而,由于Mamba对输入数据顺序的高度敏感性,图Mamba中传统的节点排序方法(如基于节点度或子图大小的方法)未能充分考虑图数据的拓扑连通性,从而限制了Mamba序列建模的性能。此外,其单向架构无法利用图像的双向空间结构。为解决这些挑战,本文提出一种基于拓扑感知排序的新型图Mamba生存分析框架(TopoMamSurv),以适应Mamba的序列敏感性。我们的可视化实验进一步证实,通过拓扑感知排序(TAO)策略提取的节点确实表现出更高的相似性。此外,我们设计了双向Mamba模块并集成图卷积网络(GCN),以实现图像的双向空间上下文建模,形成“局部聚合-全局捕获”的分层特征学习架构。该框架通过TAO、双向语义建模和分层特征融合的系统设计,有效调和了WSI分析中长程依赖建模、计算效率和空间结构利用之间的矛盾。该框架在五个TCGA数据集上验证了其全面的性能优势。

英文摘要

In computational pathology, Whole Slide Images (WSIs) survival analysis is crucial for patient prognosis assessment, but it faces multiple technical challenges. Although the Transformer captures long-range dependencies through its self-attention mechanism, its $O(N^2)$ time complexity causes a severe computational bottleneck in large-scale WSIs graph structures. The Mamba model breaks through the Transformer's computational bottleneck with linear complexity. But, owing to Mamba's high sensitivity to the order of input data, traditional node sorting methods in Graph Mamba, such as those based on node degree or subgraph size, fail to adequately account for the topological connectivity of graph data. This inadequacy consequently restricts the performance of Mamba's sequential modeling. Moreover, its unidirectional architecture cannot leverage the bidirectional spatial structure of images. To address these challenges, this paper proposes a novel Graph Mamba survival analysis framework based on topology-aware ordering (TopoMamSurv) to adapt to the sequential sensitivity of Mamba. Our visualization experiments further confirmed that the nodes extracted through the topology-aware ordering (TAO) strategy indeed exhibit higher similarity. Furthermore, we designed a bidirectional Mamba module and integrated a Graph Convolutional Network (GCN) to achieve bidirectional spatial context modeling of images, forming a hierarchical feature learning architecture for "local aggregation - global capture." This framework effectively reconciles the contradiction between long-range dependency modeling, computational efficiency, and spatial structure utilization in WSIs analysis through its systematic design of TAO, bidirectional semantic modeling, and hierarchical feature fusion. This framework has been validated for its comprehensive performance advantage on five TCGA datasets.

2606.02601 2026-06-03 cs.LG

Testing the Test: Score-Direction Instability in Class-Split Anomaly Detection

测试测试:类分割异常检测中的分数方向不稳定性

Alejandro Ascarate, Leo Lebrat, Rodrigo Santa Cruz, Clinton Fookes, Olivier Salvado

发表机构 * GitHub arXiv

AI总结 本文通过提出邻域类泄漏诊断方法,揭示了类分割异常检测协议在异常类与正常混合重叠时分数方向不稳定的问题,并建议将其视为几何依赖的压力测试。

Comments 4+1 pages, 1 figure, accepted at ICML 2026 Workshop on Hypothesis Testing

详情
AI中文摘要

数据集内类分割评估被广泛用作完全无条件分布外异常检测的代理。我们表明,当留出的异常类在表示空间中与正常混合重叠时,该协议可能变得不适定。在这种情况下,异常分数可能趋近于随机甚至反转,且偏好的分数方向可能取决于未知的异常类。我们引入了一种简单的无训练诊断方法——邻域类泄漏,并表明它在Fashion-MNIST、CIFAR-10和Imagenette上,无论是在像素空间还是VAE潜在空间中,都能预测分数方向的不稳定性。我们的结果表明,类分割异常检测基准应被视为几何依赖的压力测试,而非异常检测能力的无条件证据。

英文摘要

Within-dataset class-split evaluation is widely used as a proxy for fully unconditional out-of-distribution anomaly detection. We show that this protocol can become ill-posed when the held-out anomaly class overlaps the normal mixture in representation space. In this regime, anomaly scores may collapse toward chance or even invert, and the preferred score direction can depend on the unknown anomaly class. We introduce a simple training-free diagnostic, neighborhood class leakage, and show that it predicts score-direction instability across Fashion-MNIST, CIFAR-10, and Imagenette, in both pixel and VAE latent spaces. Our results suggest that class-split AD benchmarks should be treated as geometry-dependent stress tests rather than unconditional evidence of anomaly-detection ability.

2606.02598 2026-06-03 cs.LG cs.HC

Assessing Region-Level EEG Contributions to Cognitive Workload Prediction

评估区域级脑电图对认知负荷预测的贡献

Jacob Wong, Sohan Singh, Prannaya Gupta, Jin Xing Ang, Kritika Johari, U-Xuan Tan

发表机构 * School of InfoComm Technology, Ngee Ann Polytechnic(信息与通信技术学院,南洋理工学院新加坡分校) Engineering Product Development Pillar, Singapore University of Technology and Design(工程产品开发支柱,新加坡科技设计大学) NUS High School of Math and Science(国立大学科学高中)

AI总结 提出区域级评估框架,通过跨四个公开数据集的大规模分析,发现额叶电极组在混合被试和独立被试评估中均优于全头皮基线,额中央区域预测稳定性最高,支持设计高效通用的脑电图负荷监测系统。

Comments Accepted to EMBC 2026

详情
AI中文摘要

准确且可泛化的脑电图(EEG)认知负荷估计对于以人为中心和安全性关键的系统至关重要。尽管EEG广泛用于负荷评估,但区域级EEG贡献在不同任务、数据集和受试者之间的一致性仍不清楚。本文提出了一个基于EEG的负荷预测区域级评估框架,其中模型使用仅从属于解剖学定义的头皮区域的电极提取的特征进行训练和评估。我们对四个公开可用的EEG负荷数据集进行了大规模分析,这些数据集涵盖了多样化的任务需求、记录硬件和电极布局。区域重要性通过一种模型无关的、基于性能的方法在混合受试者和受试者独立评估协议下进行量化,并使用基于排名的策略汇总结果,以确保跨实验配置的鲁棒性。在所有数据集和受试者独立评估中,额叶电极组在相对排名位置上优于全头皮基线约15-20%,同时使用的电极数量显著减少。额中央区域表现出最稳定的预测效用,而顶叶和枕叶区域在实验条件下的贡献一致性较低。这些发现表明,与负荷相关的EEG信息最一致地保留在额叶和额中央电极组中,支持设计高效且可泛化的基于EEG的负荷监测系统。

英文摘要

Accurate and generalizable estimation of cognitive workload from electroencephalography (EEG) is critical for human-centered and safety-critical systems. Although EEG is widely used for workload assessment, the consistency of region-level EEG contributions across tasks, datasets, and subjects remains unclear. This paper presents a region-level evaluation framework for EEG-based workload prediction in which models are trained and evaluated using features extracted exclusively from electrodes belonging to anatomically defined scalp regions. We perform a large-scale analysis across four publicly available EEG workload datasets spanning diverse task demands, recording hardware, and electrode montages. Region importance is quantified using a model-agnostic, performance-based approach under both mixed-subject and subject-independent evaluation protocols, with results aggregated using a rank-based strategy to ensure robustness across experimental configurations. Across all datasets and subject-independent evaluations, frontal electrode groups outperform the full-scalp baseline by approximately 15-20% in relative rank position while using substantially fewer electrodes. Fronto-central regions exhibit the most stable predictive utility, whereas posterior and occipital regions contribute less consistently across experimental conditions. These findings indicate that workload-relevant EEG information is most consistently retained within frontal and fronto-central electrode groups, supporting the design of efficient and generalizable EEG-based workload monitoring systems.

2606.02597 2026-06-03 cs.LG cs.CR

Making Brain-Computer Interfaces More Secure

使脑机接口更安全

Md Fahimul Kabir Chowdhury, Gahangir Hossain

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 针对脑电图(EEG)脑机接口(BCI)易受对抗攻击的问题,提出轻量级卷积神经网络(CNN)架构,在梯度攻击下比EEGNet、DeepConvNet和SleepEEGNet等模型具有更好的分类鲁棒性。

Comments Accepted and presented at IEEE World AI IoT Congress 2026

详情
AI中文摘要

基于脑电图(EEG)的脑机接口(BCI)的发展主要得益于机器学习而显著进步。尽管早期研究大多集中在提高分类准确率上,但安全性和鲁棒性方面关注较少。根据最近的研究,基于EEG的BCI容易受到对抗性攻击,这些攻击可能由于微小、精心设计的扰动而导致误诊。因此,评估模型对此类扰动的鲁棒性对于确保可靠部署至关重要。在本研究中,我们提出了一种轻量级的自定义卷积神经网络(CNN)架构,以研究基于EEG的BCI中的对抗鲁棒性。所提出的方法使用两个EEG数据集进行评估,并与三种针对EEG定制的新型CNN模型(即EEGNet、DeepConvNet和SleepEEGNet)在基于梯度的对抗攻击场景下进行对比。实验结果表明,在对抗扰动下,所提出的模型在分类性能上持续优于基线模型,显示出更强的鲁棒性。这些发现突显了轻量级架构在对抗条件下增强基于EEG的BCI系统可靠性的潜力。

英文摘要

The development of brain-computer interfaces (BCIs) based on electroencephalograms (EEGs) has advanced significantly mainly to machine learning. Although the majority of earlier research has been on increasing classification accuracy, relatively little focus has been placed on security and robustness. According to recent research, EEG-based BCIs are susceptible to adversarial attacks, which can cause misdiagnosis due to minute, well-crafted disturbances. Evaluating model robustness against such perturbations is therefore critical for ensuring reliable deployment. In this study, we propose a lightweight custom Convolutional Neural Network (CNN) architecture to investigate adversarial robustness in EEG-based BCIs. The suggested method is assessed using two EEG datasets and contrasted with three novel CNN models tailored to EEG, namely EEGNet, DeepConvNet, and SleepEEGNet, under gradient-based adversarial attack scenarios. According to experimental findings, the suggested model continuously performs better in classification under adversarial perturbations compared to baseline models, indicating improved robustness. These findings highlight the potential of lightweight architectures for enhancing the reliability of EEG-based BCI systems under adversarial conditions.

2606.02596 2026-06-03 cs.LG

Spectral Asymptotics of Neural Network Loss Landscapes: An Exact Decomposition of the Curvature Exponent

神经网络损失景观的谱渐近:曲率指数的精确分解

Anherutowa Calvo

发表机构 * D Labs(9D实验室)

AI总结 本文通过谱对齐分解证明曲率指数α=2+dlogΦ_k/dlogσ_k,揭示了不同层类型曲率指数变化的几何原因,并推导出谱传递恒等式s=αγ,在无自由参数下以约2%中位误差预测Hessian衰减指数。

Comments 13 pages, 6 figures, 3 tables. Code and data: https://github.com/9D-Labs/9d-spectral-alignment-decomposition

详情
AI中文摘要

曲率指数α(h_k ∝ σ_k^α中,控制Hessian特征值如何随梯度奇异值缩放)在不同层类型中系统变化(卷积层约2,Transformer注意力层约1,MLP上投影层小于1)。为什么?我们证明了谱对齐分解:α = 2 + dlogΦ_k / dlogσ_k,其中Φ_k衡量Kronecker因子特征基与梯度奇异方向之间的对齐程度。这将“为什么α变化?”简化为一个几何问题,我们针对LayerNorm、残差连接和softmax头给出了答案。该分解蕴含一个谱传递恒等式s = αγ,连接曲率指数、有效梯度秩衰减γ和Hessian衰减指数s。该恒等式是代数的;其实证内容是,在独立数据(HVPs vs. SVD)上拟合的α和γ,在93个层、五种架构和三个数据集上以约2%的中位误差恢复s,且无自由参数。参与比的zeta函数界表明曲率集中在每层一个有效方向上。作为概念验证,我们推导了架构自适应预条件子T(σ;α),并展示了在梯度奇异基中实现T的谱牛顿法在α≈2的视觉基准上优于AdamW。

英文摘要

The curvature exponent $α$ in $h_k \propto σ_k^α$ -- governing how Hessian eigenvalues scale with gradient singular values -- varies systematically across layer types ($α\approx 2$ for convolutions, $\approx 1$ for transformer attention, $< 1$ for MLP up-projections). Why? We prove the Spectral Alignment Decomposition: $α= 2 + d\logΦ_k / d\logσ_k$, where $Φ_k$ measures alignment between Kronecker factor eigenbases and gradient singular directions. This reduces "why does $α$ vary?" to a geometric question we answer for LayerNorm, residual connections, and softmax heads. The decomposition implies a spectral transfer identity $s = αγ$ linking curvature exponent, effective gradient rank-decay $γ$, and Hessian decay exponent $s$. The identity is algebraic; its empirical content is that $α$ and $γ$, fit on independent data (HVPs vs. SVD), recover $s$ to ~2% median error across 93 layers, five architectures, and three datasets -- with no free parameters. A zeta-function bound on participation ratio shows curvature concentrates onto effectively one direction per layer. As a proof of concept, we derive the architecture-adaptive preconditioner $T(σ;α)$ and show that Spectral Newton -- implementing $T$ in the gradient singular basis -- outperforms AdamW on vision benchmarks where $α\approx 2$.

2606.02595 2026-06-03 cs.LG

Human-in-the-Loop Contextual Bandits for Short-Term Rental Dynamic Pricing: Structural Equivalence of Historical Warm-Up and Approval-Gated Live Learning

面向短租动态定价的人机协同上下文赌博机:历史预热与审批门控在线学习的结构等价性

Oleg Miroshnichenko

发表机构 * Oleg Miroshnichenko(奥列格·米罗什尼琴科)

AI总结 针对短租动态定价中反馈稀疏、决策风险高的问题,提出人机协同门控赌博机框架,证明历史定价数据与在线策略预热数据的结构等价性,并设计正则化岭回归预热方法,将冷启动周期从约150轮压缩至约30轮。

详情
AI中文摘要

短租市场中的动态定价为在线学习算法带来了独特挑战:定价决策具有重大财务风险,运营商需要可解释性,且市场反馈稀疏(每个挂牌夜仅有一个预订结果)。我们提出了人机协同门控赌博机(HITL-GB)框架,其中上下文赌博机算法生成价格推荐,但人类代理保留在接受、修改或拒绝每条推荐后应用的权力。我们证明,在审批约束下,历史定价数据——在先前确定性策略下收集的——与用于初始化赌博机后验的策略内预热数据在结构上等价,从而避免了在稀疏反馈市场中使纯在线赌博机学习不可行的数周至数月的冷启动期。我们形式化了审批门控奖励信号,从历史片段推导出正则化岭回归预热程序,并在真实短租生产数据(匿名城市市场,2个房间,2022年4月至2026年4月,1461个夜间定价片段)上验证了该方法。当从层次因子化汤普森采样(HF-TS)家族初始化代理时,我们的预热程序将有效冷启动从约150轮压缩至约30轮。我们进一步论证,该结构等价结果具有领域无关性:任何法律或操作上需要人类审批的高风险领域——包括临床药物剂量、信贷发放、内容审核和放射诊断——都满足相同条件,并受益于相同的预热策略。在受监管行业中,强制性人类监督因此是一种统计资产而非部署约束。

英文摘要

Dynamic pricing in short-term rental (STR) markets presents a distinctive challenge for online learning algorithms: pricing decisions carry significant financial risk, operators require explainability, and market feedback is sparse (one booking outcome per listed night). We introduce the Human-in-the-Loop Gated Bandit (HITL-GB) framework, in which a contextual bandit algorithm generates price recommendations but a human agent retains authority to accept, modify, or reject each recommendation before it is applied. We show that under this approval constraint, historical pricing data -- collected under a prior deterministic policy -- is structurally equivalent to on-policy warm-up data for initialising the bandit's posterior, bypassing the weeks-to-months cold-start period that renders pure online bandit learning impractical in sparse-feedback markets. We formalise the approval-gated reward signal, derive a regularised ridge-regression warm-up procedure from historical episodes, and validate the approach on real STR production data (anonymised urban market, 2 rooms, April 2022 -- April 2026, 1,461 nightly pricing episodes). Our warm-up procedure compresses effective cold-start from ~150 episodes to ~30 episodes when initialising agents from the Hierarchical Factored Thompson Sampling (HF-TS) family. We further argue that the structural equivalence result is domain-agnostic: any high-stakes domain where human approval is legally or operationally required -- including clinical drug dosing, credit origination, content moderation, and radiological diagnosis -- satisfies the same conditions and benefits from the same warm-up strategy. In regulated industries, mandatory human oversight is thus a statistical asset rather than a deployment constraint.

2606.02482 2026-06-03 cs.CV

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

X-Stream: 探索多模态大语言模型作为多流理解的多路复用器

Peiwen Sun, Xudong Lu, Huadai Liu, Yang Bo, Dongming Wu, Huankang Guan, Minghong Cai, Jinpeng Chen, Xintong Guo, Shuhan Li, Fang Liu, Rui Liu, Xiangyu Yue

发表机构 * MMLab, Chinese University of Hong Kong(中大香港人工智能实验室) Huawei Inc.(华为公司)

AI总结 为解决多流视频理解评估缺失的问题,提出首个基准X-Stream,包含4220个QA对和932个视频,覆盖多窗口、多视角和多设备场景,并基于信号多路复用理论评估MLLM作为多路复用器的性能,发现现有模型在并发流上仅达约50%分数。

Comments Project Page: https://peiwensun2000.github.io/xstream/

详情
AI中文摘要

尽管视频流理解取得了显著进展,但实际应用(如体育直播、自动驾驶和多屏协作)本质上需要连续的多流交互。然而,现有基准局限于单流范式,在评估在线跨流推理方面存在关键空白。为填补这一空白,我们引入了X-Stream,这是首个专门用于多流流式理解的基准。X-Stream包含932个视频中精心整理的4220个QA对,评估了跨多窗口、多视角和多设备场景的11个子任务。关键的是,我们的数据集使用一种新颖的双重验证流水线构建,防止对单一流的过度依赖。此外,我们开创性地将多模态大语言模型(MLLM)概念化为朴素多路复用器,通过信号多路复用理论的视角系统评估其性能。我们广泛的在线推理实验揭示了一个严峻的现实:最先进的MLLM在并发流上表现困难,仅达到约50%的分数,且主动能力差。最终,X-Stream暴露了当前多路复用方案的权衡,为下一代多流智能体提供了实用的评估协议和经验指导。

英文摘要

While video streaming understanding has made significant strides, real-world applications, such as live sports broadcasting, autonomous driving, and multi-screen collaboration, inherently demand continuous, multi-stream interactions. However, existing benchmarks are confined to single-stream paradigms, leaving a critical gap in evaluating online, cross-stream reasoning. To bridge this, we introduce X-Stream, the first benchmark dedicated to multi-stream streaming understanding. Comprising 4,220 rigorously curated QA pairs across 932 videos, X-Stream evaluates 11 subtasks across multi-window, multi-view, and multi-device scenarios. Crucially, our dataset is constructed using a novel dual-verification pipeline that prevents over-reliance on a single stream. Furthermore, we pioneer the conceptualization of multi-modal large language models (MLLMs) as naive multiplexers, systematically evaluating their performance through the lens of Signal Multiplexing Theory. Our extensive online inference experiments reveal a stark reality: state-of-the-art MLLMs struggle significantly with concurrent streams, achieving only about 50% score and exhibiting poor proactive ability. Ultimately, X-Stream exposes the trade-off of current multiplexing schemes, providing both a practical evaluation protocol and empirical guidance for next-generation multi-stream agents.

2606.02461 2026-06-03 cs.AI cs.CL

AgentCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

AGENTCL:面向语言代理持续学习的严格评估

Yiheng Shu, Bernal Jiménez Gutiérrez, Saisri Padmaja Jonnalagedda, Yuguang Yao, Huan Sun, Yu Su

发表机构 * The Ohio State University(俄亥俄州立大学) Johns Hopkins University(约翰霍普金斯大学) Intuit AI Research(Intuit AI研究)

AI总结 提出AGENTCL评估框架,通过可控任务流和迁移增益指标,严格评估语言代理的持续学习能力,并开发MemProbe探针方法诊断记忆设计的影响。

Comments 10 pages in the main text, 26 pages in total

详情
AI中文摘要

语言代理在解决单个任务上花费大量推理时间,但一个回合中获得的经验在后续回合中往往未被充分利用。持续学习期望代理在任务流中积累可重用经验,随时间改进,并避免无关经验的干扰。不幸的是,现有基准难以严格评估语言代理中的持续学习。大多数工作侧重于长上下文对话或文档的检索和推理,而最近的生命周期适应基准通常依赖于简单的任务流,对跨任务关系的分析有限,使得难以理解代理随时间学习和重用的内容。本文提出了一个用于代理持续学习的评估框架AGENTCL,其核心是受控任务流和迁移增益指标。AGENTCL构建了组合流,其中早期的子解决方案、证据或工作流有意在后续任务中可重用,并与不保证这种可重用性的简单流形成对比。我们使用该基准评估用于持续学习的非参数记忆设计。为了诊断记忆设计选择如何影响持续学习,我们开发了MemProbe,一种探针方法,存储交互、洞察和技能,同时在整合过程中过滤不可靠的经验。跨编码、深度研究和语言理解/推理任务的实证分析表明,简单流区分记忆设计的能力有限,而受控流更清晰地区分其可塑性。同时,简单和保留设置通常产生有限的增益,并可能暴露记忆引起的退化。这些结果突显了需要更强的记忆设计,以平衡可塑性和稳定重用。

英文摘要

Language agents spend substantial inference time solving individual tasks, yet the experience acquired in one episode is often underutilized in future episodes. Continual learning expects an agent to accumulate reusable experience across a stream of tasks, improve over time, and avoid interference from irrelevant experiences. Unfortunately, existing benchmarks struggle to evaluate continual learning in language agents rigorously. Most efforts focus on retrieval and reasoning over long-context conversations or documents, while recent lifelong-adaptation benchmarks often rely on naive task streams with limited analysis of cross-task relationships, making it difficult to understand what an agent learns and reuses over time. This paper presents an evaluation framework AgentCL for continual learning in agents, centered on controlled task streams and metrics for transfer gains. AgentCL constructs compositional streams where earlier sub-solutions, evidence, or workflows are intentionally reusable in later tasks, and contrasts them with naive streams where such reusability is not guaranteed. We use the benchmark to evaluate non-parametric memory designs for continual learning. To diagnose how memory design choices affect continual learning, we develop MemProbe, a probing method that stores interactions, insights, and skills, while filtering unreliable experiences during consolidation. Empirical analysis across coding, deep research, and language understanding/reasoning tasks shows that naive streams offer limited ability to distinguish memory designs, whereas controlled streams more clearly distinguish their plasticity. Meanwhile, naive and held-out settings often yield limited gains and can expose memory-induced degradation. These results highlight the need for stronger memory designs that balance plasticity and stable reuse.

2606.02332 2026-06-03 cs.AI cs.CL cs.LG

Forget Attention: Importance-Aware Attention Is All You Need

忘记注意力:重要性感知注意力即你所需

Suhyeong Shin, Yeongwook Yang

发表机构 * Department of Computer Engineering(计算机工程系)

AI总结 提出SISA方法,通过将状态空间模型的重要性信号直接融入注意力分数计算,实现分数级融合,在语言建模中兼顾全局检索与重要性排序。

Comments 20 pages, 6 figures, 25 tables

详情
AI中文摘要

将注意力的全局检索与状态空间模型(SSM)的顺序重要性信号相结合是混合语言建模的开放挑战。Transformer能看见所有位置但无法区分优先级;SSM知道什么重要但无法重新访问。现有混合模型——Jamba(块级)和Hymba(头级)——将两者置于独立模块,因此在注意力计算过程中彼此无法相互影响。我们提出SISA(SSM引导的Softmax注意力),该方法在注意力分数内部直接添加SSM导出的重要性项,并通过在增强的查询/键向量上执行单个SDPA调用来实现完整操作——无需循环状态,无需自定义内核。在152M/5B token上,SISA在LAMBADA-greedy上达到17.3%(对比Transformer的13.9和Mamba-3的15.5),并从第1K步起实现NIAH 100%,比Transformer的检索收敛速度快7倍;在369M规模下,Mamba-3在LAMBADA上领先,而SISA保持完美的NIAH和标准SDPA执行。因此,SISA为SSM-注意力混合模型定义了第三个设计轴——分数级融合——超越了此前主导该领域的块级和头级范式。

英文摘要

Combining attention's global retrieval with the sequential importance signal of state space models (SSMs) is the open challenge of hybrid language modeling. Transformers see everywhere but cannot prioritize; SSMs know what matters but cannot revisit. Existing hybrids -- Jamba (block level) and Hymba (head level) -- place the two in separate compartments, so neither informs the other during the attention computation itself. We propose SISA (SSM-Informed Softmax Attention), which adds an SSM-derived importance term directly inside the attention score and realizes the full operation as a single SDPA call on augmented query/key vectors -- no recurrent state, no custom kernel. At 152M / 5B tokens, SISA reaches LAMBADA-greedy 17.3% (vs. Transformer 13.9 and Mamba-3 15.5) and attains NIAH 100% from step 1K, 7x faster than Transformer's retrieval convergence; at 369M, Mamba-3 leads LAMBADA while SISA preserves perfect NIAH and stock-SDPA execution. SISA thus defines a third design axis for SSM-attention hybrids -- score-level fusion -- beyond the block-level and head-level paradigms that have dominated the field.

2606.02132 2026-06-03 cs.AI

Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

学习何时不行动:缓解智能体强化学习中的工具滥用

Liuji Chen, Dianxing Tang, Xing Shi, Dingshuo Chen, Qiang Liu, Shu Wu, Liang Wang

发表机构 * NLPR, Institute of Automation, Chinese Academy of Sciences(NLPR,自动化研究所,中国科学院) ByteDance(字节跳动) Zhejiang University(浙江大学)

AI总结 提出EAPO框架,通过引入无工具轨迹、难度感知奖励塑造和置信度感知令牌重加权,在数学和知识密集型推理任务中减少工具滥用,同时提升准确率-效率权衡。

Comments Under review

详情
AI中文摘要

智能体强化学习可能引发工具滥用,即模型过度使用外部工具,即使对于内部推理可解的查询也是如此。现有方法通过统一的工具使用惩罚或硬限制来缓解此问题,这降低了工具使用频率,但可能抑制有用的工具辅助探索。我们提出EAPO,一种高效的智能体策略优化框架,学习选择性工具使用。EAPO在每个rollout组中引入无工具轨迹,应用难度感知奖励塑造以主要对较简单查询上的冗余工具调用进行惩罚,并使用置信度感知令牌重加权来改进策略学习。在九个数学和知识密集型推理基准上,EAPO在Qwen2.5-3B、Qwen2.5-7B和Llama3.1-8B上持续改善了准确率-效率权衡。与GRPO相比,EAPO的平均性能分别提高了10.45%、7.27%和9.69%,同时平均工具调用次数分别减少了18.33%、18.33%和24.59%。这些结果表明,智能体可以在不损害工具集成推理的情况下学习何时不使用工具。

英文摘要

Agentic reinforcement learning can induce tool abuse, where models overuse external tools even for queries solvable by internal reasoning. Existing approaches mitigate this issue with uniform tool-use penalties or hard limits, which reduce tool frequency but may also suppress useful tool-assisted exploration. We propose EAPO, an Efficient Agentic Policy Optimization framework that learns selective tool use. EAPO introduces tool-free trajectories into each rollout group, applies difficulty-aware reward shaping to penalize redundant tool calls mainly on easier queries, and uses confidence-aware token reweighting to improve policy learning. Across nine mathematical and knowledge-intensive reasoning benchmarks, EAPO consistently improves the accuracy efficiency trade-off on Qwen2.5-3B, Qwen2.5-7B, and Llama3.1-8B. Compared with GRPO, EAPO improves average performance by 10.45%, 7.27%, and 9.69%, while reducing average tool calls by 18.33%, 18.33%, and 24.59%, respectively. These results show that agents can learn when not to use tools without compromising tool-integrated reasoning.

2606.02091 2026-06-03 cs.CL

DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding

DFlare: 扩展块扩散推测解码的草稿容量

Jiebin Zhang, Zhenghan Yu, Song Liu, Eugene J. Yu, Zheng Li, Dawei Zhu, Jiangshan Duo, Weimin Xiong, Yifan Song, Guanghua Yu, Jianchen Zhu, Sujian Li

发表机构 * School of Computer Science, Peking University(北京大学计算机科学系) Tencent(腾讯)

AI总结 提出DFlare方法,通过轻量级逐层融合机制扩展草稿模型容量,在多个基准上实现平均5.52倍加速。

Comments 12 pages, 3 figures

详情
AI中文摘要

块扩散推测解码通过同时预测一个块内的所有令牌,供目标模型并行验证,从而加速LLM推理。一次性预测整个块需要足够强大的草稿模型和有效利用目标模型的内部知识。然而,最先进的方法DFlash限制所有草稿层共享仅从少数目标层导出的单个融合表示,限制了逐层表达能力并阻碍了草稿容量的进一步扩展。在本文中,我们提出\modelname,通过轻量级逐层融合机制扩展DFlash的狭窄条件瓶颈:每个草稿层关注其自身可学习的广泛目标层组合,开销可忽略,同时注入更丰富的目标知识并为每个草稿层提供不同的输入。这种增强的逐层表达能力使得草稿模型能够扩展到更深的架构并获得一致的性能提升。我们进一步将训练数据从80万样本扩展到240万样本,以充分利用扩大的容量。在涵盖数学推理、代码生成和对话的六个基准上,\modelname在Qwen3-4B上平均加速5.52倍,在Qwen3-8B上平均加速5.46倍,在GPT-OSS-20B上平均加速3.91倍,分别比DFlash提高约11%、8%和5%。我们的代码可在https://github.com/Tencent/AngelSlim获取。

英文摘要

Block diffusion speculative decoding accelerates LLM inference by predicting all tokens within a block simultaneously for the target model to verify in parallel. Predicting an entire block at once requires a sufficiently capable draft model and effective utilization of the target model's internal knowledge. However, the state-of-the-art method DFlash constrains all draft layers to share a single fused representation derived from only a few target layers, limiting per-layer expressiveness and hindering further scaling of draft capacity. In this paper, we present \modelname, which flares out the narrow conditioning bottleneck of DFlash through a lightweight layer-wise fusion mechanism: each draft layer attends to its own learnable combination of a broad set of target layers at negligible overhead, simultaneously injecting richer target knowledge and providing every draft layer with a distinct input. This enhanced per-layer expressiveness enables scaling the draft model to deeper architectures with consistent gains. We further scale training data from 800K to 2.4M samples to fully exploit the enlarged capacity. On six benchmarks spanning mathematical reasoning, code generation, and conversation, \modelname attains average wall-clock speedups of 5.52x on Qwen3-4B, 5.46x on Qwen3-8B, and 3.91x on GPT-OSS-20B, improving over DFlash by roughly 11\%, 8\%, and 5\% respectively. Our code is available at https://github.com/Tencent/AngelSlim.

2606.02090 2026-06-03 cs.CV

FocusDiT: Masking Queries in Diffusion Transformers for Fine-grained Image Generation

FocusDiT: 扩散Transformer中的查询掩码用于细粒度图像生成

Xueji Fang, Liyuan Ma, Jianhao Zeng, Jinjin Cao, Mingyuan Zhou, Guo-Jun Qi

发表机构 * Zhejiang University(浙江大学) Westlake University(西湖大学)

AI总结 提出FocusDiT方法,通过掩码关键查询令牌仅输入FFN层,增强细粒度视觉生成,实验验证其有效性。

详情
AI中文摘要

扩散Transformer(DiT)已被广泛应用于生成扩散领域,通过注意力和前馈(FFN)层推进查询令牌的去噪。FFN实际上充当解码视觉内容的键值词汇表,其中值嵌入视觉语义知识。我们提出,关注对应于更复杂细节的关键查询令牌,并鼓励模型改进这些令牌,对于细粒度视觉生成至关重要。为此,我们提出FocusDiT,它应用掩码方案来关注仅输入FFN的关键查询令牌。掩码查询可以从FFN词汇表中检索视觉令牌,并使用它们解码其视觉细节。大量的文本到图像实验验证了令牌掩码在增强生成性能方面的有效性。

英文摘要

Diffusion transformer (DiT) has been widely adopted in the generative diffusion field, advancing the denoising of query tokens through attention and Feed-Forward (\text{FFN}) layers. FFN actually acts as the key-value vocabulary for decoding visual contents where the value embeds the visual semantical knowledge. We present that focusing on critical query tokens corresponding to more complex details and encouraging the model to improve these tokens is essential for fine-grained visual generation. To this end, we propose FocusDiT, which applies a Masking scheme to focus on critical query tokens that are exclusively fed into FFN. The masked queries can retrieve visual tokens from the FFN vocabularies, and use them to decode their visual details. Extensive text-to-image experiments validate the effectiveness of token masking in enhancing generative performance.

2606.02060 2026-06-03 cs.AI

Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

深度研究代理在何处出错?代理轨迹中的跨度级错误定位

Jiaming Wang, Ziteng Feng, Jiangtao Wu, Ruihao Li, Qianqian Xie, Yuxiang Ren, He Zhu, Xueming Han, Fanyu Meng, Junlan Feng, Jiaheng Liu

发表机构 * NJU-LINK Team, Nanjing University(南京大学NJU-LINK团队) JIUTIAN Research(JIUTIAN研究院)

AI总结 针对深度研究代理在长轨迹中难以定位错误的问题,本文通过构建TELBench基准和提出DRIFT审计框架,实现了跨度级错误定位,将首次错误定位准确率提升高达30个百分点。

Comments 28 pages, 11 figures, 4 tables

详情
AI中文摘要

深度研究代理通过搜索、工具使用、证据检查和答案合成的长轨迹来完成任务。基于最终答案的评估可以显示代理是否成功,但无法显示轨迹的哪些部分导致答案不可靠。我们研究了深度研究代理的跨度级错误定位。我们从两个代理框架、三个骨干模型和三个基准中收集了2,790条真实轨迹,将原始日志转换为语义跨度,并通过LLM辅助的专家评审标注了有害错误跨度。基于这些标注,我们构建了TELBench,一个包含1,000个实例的基准,用于在正常探索、失败搜索、暂定假设和无害噪声中识别错误跨度。我们进一步提出了DRIFT,一个以声明为中心的审计框架,该框架跟踪代理声明,检查其在轨迹证据中的支持,并标记那些无支持或冲突声明影响答案路径的跨度。跨模型系列和审计框架的实验表明,DRIFT将跨度级错误定位和首次错误准确率提高了高达30个百分点。我们的工作提供了深度研究代理可靠性的过程级视角。

英文摘要

Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answer synthesis. Evaluation based on final answers shows whether an agent succeeds, but not which parts of the trajectory make the answer unreliable. We study span-level error localization for deep-research agents. We collect 2,790 real trajectories from two agent frameworks, three backbone models, and three benchmarks, convert raw logs into semantic spans, and annotate harmful error spans through LLM-assisted expert review. From these annotations, we build TELBench, a 1,000-instance benchmark for identifying error spans among normal exploration, failed searches, tentative hypotheses, and harmless noise. We further propose DRIFT, a claim-centric auditing framework that tracks agent claims, checks their support in trajectory evidence, and marks spans where unsupported or conflicting claims affect the answer path. Experiments across model families and auditing frameworks show that DRIFT improves span-level error localization and first-error accuracy by up to 30 percentage points. Our work provides a process-level view of reliability in deep-research agents.

2606.02004 2026-06-03 cs.CL cs.LG

Machine Learning for Coding Retail Product Names to Consumer-Price Categories: A Rule-plus-Bag-of-Words Pipeline with Reliability-Weighted Human-in-the-Loop Labeling

将零售产品名称编码为消费者价格类别的机器学习:基于规则加词袋的流水线,结合可靠性加权的人工参与标注

Vladimir Beskorovainyi

发表机构 * Besk Tech(Besk科技) Moscow Institute of Physics and Technology (MIPT)(莫斯科物理技术学院)

AI总结 本文提出一种结合规则和词袋模型的流水线方法,并采用可靠性加权的人工参与标注协议,将零售产品名称映射到消费者价格类别(如UN COICOP),实验表明词袋模型在该任务上已接近饱和(F1约0.99),而标注协议中可靠性加权投票仅略优于简单多数投票。

Comments 11 pages, 3 tables. Methodology paper; illustrative experiments only, no proprietary data

详情
AI中文摘要

消费者价格测量越来越多地依赖替代数据源——扫描仪、网络抓取和交易/收据数据。一个反复出现的障碍是,这些来源中的产品描述简短、嘈杂且缩写,没有标准产品代码,因此每个项目必须首先映射到消费分类(例如,联合国COICOP方案),然后才能比较价格。本文将该映射作为一种通用的、可重复的方法进行研究。流水线包括:(i) 对嘈杂项目名称进行文本归一化和分词;(ii) 基于每类关键词和停用词的前缀树(trie)规则预分类器;(iii) 每个类别的二元确认模型,决定一个项目是否属于暂定分配的类别。对于大规模标注,我们使用人工参与协议,其中标注者给出二元有效/拒绝判断,通过动态更新的可靠性权重进行聚合;模型加入相同的规则,实现持续微调。我们的实证发现是通货紧缩的:在一个受控、无泄漏的研究中(一个类别,真实正例与困难负例,五个随机种子),词袋模型基本上饱和了任务(F1约0.99)——线性分类器匹配多层感知器,显式词序(n-gram)特征没有增加任何价值,约67个标注样本已经足够。标注协议的蒙特卡洛研究表明,可靠性加权投票勉强超过简单多数投票(其加性权重饱和),而Dawid-Skene方法明显更好地恢复标签。我们还讨论了价格层面的质量控制和统计办公室考虑交易数据时的设计经验。所有数字均为示意性;未复制任何机密数据、代码或文档。

英文摘要

Consumer-price measurement increasingly draws on alternative data sources -- scanner, web-scraped, and transaction/receipt data. A recurring obstacle is that product descriptions in such sources are short, noisy, and abbreviated, with no standard product code, so each item must first be mapped to a consumption classification (e.g., the UN COICOP scheme) before prices can be compared. This paper studies that mapping as a general, reproducible method. The pipeline is: (i) text normalization and tokenization of noisy item names; (ii) a prefix-tree (trie) rule-based pre-classifier driven by per-category key-phrases and stop-phrases; and (iii) a per-category binary confirmation model deciding whether an item belongs to a tentatively assigned category. For labels at scale we use a human-in-the-loop protocol in which annotators give a binary valid/reject judgment, aggregated by a dynamically updated reliability weight; the model joins the same rule, enabling continual fine-tuning. Our empirical finding is deflationary: in a controlled, leakage-free study (one category, real positives vs. hard negatives, five seeds), bag-of-words models essentially saturate the task (F1 about 0.99) -- a linear classifier matches a multilayer perceptron, explicit word-order (n-gram) features add nothing, and about 67 labeled examples already suffice. A Monte-Carlo study of the labeling protocol shows the reliability-weighted vote barely beats plain majority (its additive weights saturate) while Dawid-Skene recovers labels markedly better. We also discuss price-level quality control and design lessons for statistical offices considering transaction data. All figures are illustrative; no confidential data, code, or documentation is reproduced.

2606.01962 2026-06-03 cs.CV

Contrastive Augmented Transformer with Domain-specific Enhancement for Robust Multi-scenario Metal Surface Defect Detection

基于领域增强的对比增强Transformer用于鲁棒的多场景金属表面缺陷检测

Yiyao Liu, Wenxiao He, Liyuan Ren, Huan Wang

发表机构 * Glasgow College, University of Electronic Science and Technology of China(格拉斯哥学院,电子科学与技术大学)

AI总结 提出对比增强Transformer(CAT)框架,结合Swin Transformer骨干、特征金字塔网络、领域特定液滴增强算法和难负样本挖掘策略,解决金属表面缺陷检测中标注数据有限、多尺度缺陷识别难和跨场景泛化差的问题,在KolektorSDD2数据集上达到99.54%像素级AUROC。

详情
AI中文摘要

金属表面缺陷检测对于维持工业制造中的产品质量至关重要。然而,它面临着重大挑战,包括有限的标注数据、难以识别细微的多尺度缺陷以及跨不同场景的泛化能力差。为了解决这些问题,本文提出了一种新颖的对比增强Transformer(CAT)框架,用于鲁棒的缺陷检测。CAT采用分层Swin Transformer骨干,并重新设计了特征金字塔网络,以有效融合低级纹理与高级语义,从而实现对细微和多尺度缺陷模式的精确建模。为了增强在真实噪声条件下的鲁棒性,我们提出了一种领域特定的液滴增强算法。此外,我们将难负样本挖掘策略纳入对比损失中,以增强模型在模糊缺陷区域的判别能力。在KolektorSDD2数据集上的实验结果表明,CAT实现了99.54%的像素级AUROC,优于现有方法。此外,CAT在三个未见过的数据集(包括KSDD1、用于瓷砖缺陷的MTD和用于轨道表面缺陷的MSDD)上表现出优越的泛化能力和鲁棒性,展示了其在大规模工业部署中的潜力。

英文摘要

Metal surface defect detection is critical for maintaining product quality in industrial manufacturing. However, it faces significant challenges, including limited annotated data, difficulty in identifying subtle multi-scale defects, and poor generalization across diverse scenarios. To address these issues, this paper proposes a novel Contrastive Augmented Transformer (CAT) framework for robust defect detection. CAT employs a hierarchical Swin Transformer backbone and redesigns the feature pyramid network to effectively fuse low-level textures with high-level semantics, enabling precise modeling of subtle and multi-scale defect patterns. To enhance robustness under real-world noise conditions, we propose a domain-specific droplet augmentation algorithm. Furthermore, we incorporate a hard negative mining strategy into the contrastive loss to strengthen the model's discrimination ability in ambiguous defect regions. Experimental results on the KolektorSDD2 dataset demonstrate that CAT achieves a pixel-level AUROC of 99.54%, outperforming existing methods. In addition, CAT exhibits superior generalization and robustness on three unseen datasets, including KSDD1, MTD for tile defects, and MSDD for rail surface defects, demonstrating its potential for wide-scale industrial deployment.

2606.01904 2026-06-03 cs.CL cs.AI

KliniskVestBERT: BERT Model Specialised to Norwegian Clinical Texts

KliniskVestBERT: 针对挪威临床文本特化的BERT模型

Christian Autenried, Cosimo Persia

发表机构 * helse-vest-ikt(赫尔塞-维斯特信息科技)

AI总结 通过继续预训练现有BERT模型,在挪威临床文本上构建KliniskVestBERT模型,在临床NLP任务中显著优于基线模型。

详情
AI中文摘要

自然语言处理(NLP)在医疗保健中的应用日益增长,这要求语言模型特别适应临床语言的复杂性。本文介绍了KliniskVestBERT,这是一套基于BERT的编码器模型套件,在来自Helse Vest的大量真实、去标识化的挪威临床文本语料库上进行预训练。我们在专门的临床数据集上继续预训练现有的语言模型Nb-BERT-large、NorBERT3-large和ModernBERT。该数据集基于Helse Vest患者的代表性人群。包含的文档类型经过精心策划,涵盖了bokmål和nynorsk的广泛临床范围,包括出院小结、手术报告、护理记录等,确保全面代表挪威医疗环境中的语言景观。在三个合成挪威临床基准数据集和两个真实世界问题上的评估表明,每个临床特化模型都持续优于其基线对应模型,突显了领域特定预训练对临床领域NLP任务的显著益处。该项目由所有Helse Vest实体(Helse Bergen、Helse Fonna、Helse Førde和Helse Stavanger)与DIPS在Helse Vest ICT的项目领导下共同完成。

英文摘要

The increasing application of Natural Language Processing (NLP) in healthcare demands language models specifically attuned to the complexities of clinical language. This work introduces KliniskVestBERT, a suite of three BERT-based encoder models pre-trained on a substantial corpus of real-world, de-identified Norwegian clinical texts from Helse Vest. We continue pretraining existing language models Nb-BERT-large, NorBERT3-large, and ModernBERT on our specialized clinical dataset. This dataset is based on a representative population of Helse Vest patients. The included document types are carefully curated to encompass a broad clinical spectrum in bokmål and nynorsk including discharge summaries, surgical reports, nursing notes etc. ensuring comprehensive representation of the linguistic landscape within Norwegian healthcare settings. Evaluation on three synthtetic Norwegian clinical benchmark datasets and two real-world problems demonstrates that each of our clinically specialized models consistently outperforms their baseline counterparts, highlighting the significant benefit of domain-specific pre-training for NLP tasks within the clinical domain. The project was a joint effort by all Helse Vest entities (Helse Bergen, Helse Fonna, Helse Førde and Helse Stavanger) with DIPS under the project lead of Helse Vest ICT.

2606.01851 2026-06-03 cs.RO

PHASOR: Phase-Anchored Universal Action Representations for Humanoid Embodiments

PHASOR: 面向人形本体的相位锚定通用动作表示

Kihyun Kim, Chaeyun Kim, Jongho Shin, Taeyoun Kwon, Junghyun Kim, Mijin Koo, Haon Park

发表机构 * AIM Intelligence Seoul National University(首尔国立大学) LG Electronics(LG电子) MAUM AI OpenMind

AI总结 提出PHASOR方法,通过将动作嵌入空间分解为相位流形和姿态分支,并结合运动语义蒸馏,构建跨本体的通用动作表示,实现人形机器人的跨本体检索和下游任务性能提升。

Comments * Equal contribution

详情
AI中文摘要

学习一个好的动作嵌入空间对于可扩展的机器人策略学习至关重要,但现有方法将动作潜在变量视为任务特定的中间产物,而非第一类表示。由此产生的潜在变量是非结构化的、本体特定的,且与运动语义关联较弱,限制了可解释性、可控性和跨机器人的迁移性。我们将动作嵌入空间本身定位为第一类设计目标,下游策略质量源于表示质量。利用运动的内在周期性,我们将其分解为一个相位流形(通过FFT参数系数捕获循环结构)和一个姿态分支(将流形条件化为非周期配置细节)。结合运动语义蒸馏,这种分解结构产生了一个跨本体的运动流形,该流形在设计上是可解释且与本体无关的。将多个人形机器人锚定到一个共享的预训练流形上,则在不同平台上产生统一的动作嵌入空间,实现了强大的跨本体检索和下游机器人任务的一致性能提升。

英文摘要

Learning a good action embedding space is fundamental to scalable robot policy learning, yet existing methods treat action latents as task-specific intermediates rather than first-class representations. The resulting latents are unstructured, embodiment-specific, and weakly tied to motion semantics, limiting interpretability, controllability, and transferability across robots. We position the action embedding space itself as a first-class design target, with downstream policy quality emerging from representation quality. Exploiting motion's intrinsic periodicity, we factorize it into a phase manifold that captures cyclic structure via FFT-parametric coefficients, together with a pose branch that conditions the manifold on non-periodic configuration detail. Combined with motion-semantic distillation, this factorized structure yields a cross-embodiment motion manifold that is interpretable and embodiment-agnostic by design. Anchoring multiple humanoid robots to a shared human-pretrained manifold then produces a unified action embedding space across diverse platforms, achieving strong cross-embodiment retrieval and consistent gains on downstream robot tasks.

2606.01849 2026-06-03 cs.LG cs.CL cs.CR

ContinuousBench: Can Differentially Private Synthetic Text Improve Capabilities?

ContinuousBench: 差分隐私合成文本能否提升能力?

Peihan Liu, Lucas Rosenblatt, Weiwei Kong, Natalia Ponomareva, Gautam Kamath, Rachel Cummings, Roxana Geambasu, Yu Gan, Lillian Tsai, Alex Bie

发表机构 * Columbia University(哥伦比亚大学) NYU(纽约大学) Google Research(谷歌研究) University of Waterloo(滑铁卢大学) Vector Institute(向量研究所) Google(谷歌)

AI总结 提出ContinuousBench基准,通过持续更新的数据集评估差分隐私合成文本能否传递原始语料库中的新知识,实验表明非隐私合成能有效转移知识,而最先进的DP合成方法即使在高隐私预算下也基本失败。

Comments For datasets, see https://huggingface.co/ContinuousBench; for the evaluation harness, see https://github.com/plau666/ContinuousBenchEval; for an accompanying blog post, see https://peihanliu.com/posts/continuousbench.html

详情
AI中文摘要

差分隐私(DP)文本合成有望为模型训练解锁敏感语料库,但目前尚不清楚DP合成数据是否能传递仅存在于这些语料库中的真正新知识和能力。这是因为现有评估依赖于无需训练即可几乎解决的任务,因此强大的基准性能并不能证明DP合成可以替代原始数据访问。为此,我们引入了ContinuousBench,一个持续自动更新的基准,用于衡量DP合成文本带来的能力提升。每季度发布一次新版本,配对一个从未见过的训练语料库和一个衍生的问答集,其构建满足:(1)无语料库无法解决;(2)在DP下可学习,因为测试的知识由数百条独立记录支持。研究人员从训练语料库生成DP合成数据,并在其合成数据上运行我们的标准化训练和评估框架以衡量增益。我们实例化两个轨道:Geminon,一个关于虚构生物的程序生成数据集;以及News,一个新爬取的公共新闻文章流。尽管标准基准几乎饱和,但在ContinuousBench上,我们发现非隐私合成从原始语料库转移了大量知识,而最先进的DP合成方法即使在高隐私预算(ε=100)下也基本无法做到。

英文摘要

Differentially private (DP) text synthesis promises to unlock sensitive corpora for model training, but it remains unclear whether DP synthetic data transmits genuinely new knowledge and capabilities present only in those corpora. This is because existing evaluations rely on tasks that are nearly solvable without training, so strong benchmark performance does not establish that DP synthesis can substitute original data access. Thus, we introduce ContinuousBench, a continuously and automatically-regenerated benchmark that measures capability gain from DP synthetic text. Each quarter, a new release pairs a never-before-seen training corpus with a derived QA set, constructed to be: (1) unsolvable sans-corpus; and (2) learnable under DP, as the tested knowledge is supported by hundreds of independent records. Researchers produce DP synthetic data from the training corpus and run our standardized training and evaluation harness on their synthetic data to measure gains. We instantiate two tracks: Geminon, a procedurally-generated dataset about fictional creatures; and News, a stream of newly crawled public news articles. Although standard benchmarks are nearly saturated, on ContinuousBench we find that non-private synthesis transfers substantial knowledge from the original corpus, while state-of-the-art DP synthesis methods generally fail to do so, even at $\varepsilon=100$.

2606.01767 2026-06-03 cs.AI

EvoBrain: Continual Learning of EEG Foundation Models Across Heterogeneous BCI Tasks

EvoBrain: 面向异构BCI任务的EEG基础模型的持续学习

Yangxuan Zhou, Sha Zhao, Jiquan Wang, Shijian Li, Gang Pan

发表机构 * State Key Laboratory of Brain-machine Intelligence and College of Computer Science and Technology, Zhejiang University(脑机智能国家重点实验室和浙江大学计算机科学与技术学院) MOE Frontier Science Center for Brain Science and Brain-Machine Integration, Zhejiang University(教育部脑科学与脑机集成前沿科学中心,浙江大学)

AI总结 提出EvoBrain框架,通过神经频谱任务归一化和响应亲和蒸馏,解决EEG基础模型在异构BCI任务中的持续学习问题,实现跨任务知识迁移和遗忘缓解。

Comments 18 pages,12 figures

详情
AI中文摘要

脑电图(EEG)是非侵入式脑机接口(BCI)的基石,然而传统的解码依赖于碎片化的、特定任务的架构,严重限制了跨任务的可扩展性。虽然在大规模语料库上预训练的EEG基础模型有望实现通用脑解码,但当前的后训练依赖于任务隔离的微调。这种静态范式限制了跨异构任务的知识迁移,阻碍了模型的可扩展性,并导致计算和存储开销随任务数量线性增长。为了克服这些瓶颈,我们将下游适应形式化为跨任务的持续学习问题,并提出了EvoBrain,一个动态的、任务感知的持续学习框架,用于统一的EEG解码。EvoBrain通过两个互补组件解决可塑性-稳定性权衡:(1)神经频谱任务归一化(NSN)将传入任务与历史统计对齐,同时重新校准频谱响应以处理分布和神经频谱偏移;(2)响应亲和蒸馏(RAD)结合时间依赖的重放,保留旧任务的响应几何结构,并促进频谱兼容任务之间的选择性知识迁移,有效缓解遗忘。在六个不同BCI任务上的广泛评估表明,EvoBrain在各种基础骨干网络上始终优于最先进的方法,最佳地平衡了可塑性和稳定性。据我们所知,这项工作开创了EEG领域的跨任务持续学习,推进了统一的、一劳永逸的脑解码系统的实现。

英文摘要

Electroencephalography (EEG) is the cornerstone of non-invasive brain-computer interfaces (BCIs), yet conventional decoding relies on fragmented, task-specific architectures that severely limit cross-task scalability. While EEG foundation models pre-trained on massive corpora promise universal brain decoding, current post-training depends on task-isolated fine-tuning. This static paradigm restricts knowledge transfer across heterogeneous tasks, hinders model scalability, and incurs computational and storage overheads that scale linearly with task count. To overcome these bottlenecks, we formulate downstream adaptation as a cross-task continual learning problem and propose EvoBrain, a dynamic, task-aware continual learning framework for unified EEG decoding. EvoBrain addresses the plasticity-stability trade-off via two complementary components: (1) Neuro-Spectral Task Normalization (NSN) aligns incoming tasks with historical statistics while recalibrating spectral responses to handle distributional and neuro-spectral shifts; and (2) Response-Affinity Distillation (RAD), combined with time-dependent replay, preserves old-task response geometry and promotes selective knowledge transfer between spectrally compatible tasks, effectively mitigating forgetting. Extensive evaluations across six distinct BCI tasks demonstrate that EvoBrain consistently surpasses state-of-the-art methods across diverse foundation backbones, optimally balancing plasticity and stability. To our knowledge, this work pioneers cross-task continual learning in the EEG domain, advancing the realization of a unified, one-for-all brain decoding system.

2606.01629 2026-06-03 cs.CL

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

基准测试LLM作为裁判在长文本输出评估中的应用

Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su, Yujia Zhou, Min Zhang, Yiqun Liu, Qinyao Ai

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) University of Science and Technology Beijing(北京科技大学)

AI总结 提出LongJudgeBench基准,系统评估LLM裁判在长文本输出评估中的可靠性,发现当前模型存在显著可靠性差距。

详情
AI中文摘要

随着大语言模型(LLM)越来越多地用于长文本生成,可靠地评估长文本输出已成为一个关键挑战。LLM作为裁判提供了一种可扩展的替代人工评估的方法,但其在长文本输出评估中的可靠性仍未得到充分检验:现有的元评估基准主要关注短文本输出。与短文本评估相比,长文本评估不仅仅是输出长度的问题;它通常要求裁判处理更复杂的文档级需求。在这项工作中,我们引入了LongJudgeBench,这是一个全面的基准,用于评估LLM裁判在跨多种真实场景和评判协议下的长文本输出表现。我们系统地评估了广泛的LLM裁判,涵盖了多个基础模型和评判设置。我们的结果揭示了显著的可靠性差距:当前的LLM裁判在不同场景下仍然不稳定,而评分标准或参考虽然有所帮助,但并非总是足够。我们希望LongJudgeBench能够支持未来在更稳健、上下文感知且与人类对齐的LLM-as-a-judge方法上的研究。我们的代码可在https://anonymous.4open.science/r/LongJudgeBench-F782获取。

英文摘要

As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scalable alternative to human evaluation, yet its reliability in long-form output evaluation remains underexamined: existing meta-evaluation benchmarks focus mainly on short-form outputs. Compared with short-form evaluation, long-form evaluation is not merely a matter of output length; it often requires judges to make more complex document-level assessments of overall organization, task-relevant coverage and depth, cross-section consistency, and scenario-specific quality criteria. In this work, we introduce LongJudgeBench, a comprehensive benchmark for evaluating LLM judges on long-form outputs across diverse real-world scenarios and judging protocols. We systematically evaluate a broad range of LLM judges, covering multiple base models and judging settings. Our results reveal a substantial reliability gap: current LLM judges remain unstable across scenarios, and rubrics or references are helpful but not always sufficient. We hope LongJudgeBench will support future research on more robust, context-aware, and human-aligned LLM-as-a-judge methods. Our code is available at https://github.com/cjj826/LongJudgeBench.

2606.01532 2026-06-03 cs.LG cs.CC

Rethinking the Role of Positional Encoding: Sliding-Window Transformers without PE Remain Turing Complete

重新思考位置编码的作用:无PE的滑动窗口Transformer仍具图灵完备性

Qian Li, Xinyu Mao, Shang-Hua Teng

发表机构 * Shenzhen Research Institute of Big Data(深圳大数据研究院) University of Southern California(南加州大学)

AI总结 本文证明,在滑动窗口机制下,无需位置编码的Transformer仍可通过窗口演化模拟图灵完备的Post机器,从而具备通用计算能力。

详情
AI中文摘要

位置编码(PE)被广泛认为是Transformer处理有序序列所必需的:没有位置编码,下一个token映射在其上下文token中似乎是置换不变的。这一直觉支撑了所有先前的普适性结果,这些结果依赖位置信息来证明具有思维链的Transformer可以执行任意计算,即它们是图灵完备的。我们在与长程推理最相关的机制下重新审视这一信念,其中生成通过有限的滑动上下文窗口进行。我们的初步认识是,窗口机制本身(轻微地)打破了置换对称性。为了提炼并精确捕捉这种额外表达能力的大小,我们引入了一个抽象的自回归模型——HIST模型,其中每次更新仅依赖于恒定大小的内部状态和当前窗口内的token计数直方图。我们证明这个HIST模型是图灵完备的,通过展示窗口的演化可以揭示刚刚离开窗口的token,这足以模拟图灵完备的Post机器。然后,我们构建了一个在恒定大小token字母表上的滑动窗口Transformer,没有位置编码,并证明它可以模拟HIST模型。我们的结果表明,位置编码对于Transformer执行通用计算并非不可或缺:窗口滑动本身已经打破了置换对称性并捕获了足够的位置信息。

英文摘要

Positional encoding (PE) is widely viewed as necessary for transformers to process ordered sequences: without them, the next-token map appears permutation-invariant in its context tokens. This intuition underlies all prior universality results, which rely on positional information to prove that transformers with chain-of-thought can perform arbitrary computation, i.e., they are Turing complete. We revisit this belief in the regime most relevant to long-form reasoning, where generation proceeds through a finite sliding context window. Our opening perception is that the window mechanism itself (mildly) breaks the permutation symmetry. To distill and precisely capture the degree of this added expressiveness, we introduce an abstract autoregressive model, the HIST model, in which each update depends only on constant-size internal state and the token-count histogram within the current window. We prove that this HIST model is Turing complete by showing that the evolution of the window can reveal the token that has just left the window, which suffices to simulate Turing-complete Post machines. We then construct a sliding-window transformer over a constant-size token alphabet, without PE, and show that it can simulate the HIST model. Our result demonstrates that positional encodings are not indispensable for transformers to perform universal computation: The window sliding itself already breaks permutation symmetry and captures sufficient positional information.

2606.01348 2026-06-03 cs.CV

ChartArena: Benchmarking Chart Parsing across Languages, Scenarios, and Formats

ChartArena: 跨语言、场景和格式的图表解析基准测试

Shangpin Peng, Gengluo Li, Xingyu Wan, Chengquan Zhang, Hao Feng, Binghong Wu, Huawen Shen, Weinong Wang, Ziyi Cai, Zhuotao Tian, Han Hu, Can Ma, Yu Zhou

发表机构 * Large Language Model Department, Tencent(腾讯大语言模型部门) Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) Shenzhen Loop Area Institute(深圳环湖区研究所) Nankai University(南开大学)

AI总结 提出ChartArena,一个覆盖8种图表族、3种视觉场景的双语基准,通过人机协作标注和格式无关评估协议,系统评估26个多模态大模型的图表解析能力,揭示前沿模型差距与挑战。

详情
AI中文摘要

图表是传达定量和关系信息的主要媒介,但系统地评估图表解析模型仍然困难。现有基准专注于狭窄的图表类型,而流程图和思维导图等图表结构在很大程度上未被涉及,同时模型输出格式不兼容,数据集也极少包含实践中遇到的打印或手绘图像。为解决这些问题,我们引入了ChartArena,一个全面的双语基准,涵盖8种图表族,包括数值图表和图表结构,每种图表在三种视觉场景中评估:数字渲染、打印照片和手绘照片。该数据集通过人机协作标注流程构建,并经过多阶段人工验证以确保标注可靠性。为实现公平的跨模型比较,我们进一步设计了一种格式无关的评估协议,将异构输出映射到两个规范语义空间:归一化三元组视图和有向图视图,并使用结构感知指标进行评分。通过对26个领先的多模态大语言模型的广泛评估,我们观察到三个一致的发现:(i) 前沿专有模型(如Gemini 3.1 Pro)总体领先,但最强的开源系统正在迅速缩小差距;(ii) 文档解析模型能较好地处理数值图表,但在图表结构上表现大幅落后;(iii) 专家图表解析器仍局限于狭窄的图表族。在所有模型中,雷达图和手绘场景尤其具有挑战性。这些发现表明,ChartArena揭示了清晰的能力差距,并为未来的进展提供了统一基础。ChartArena公开在 https://github.com/pspdada/ChartArena。

英文摘要

Charts are a primary medium for conveying quantitative and relational information, yet systematically evaluating chart parsing models remains difficult. Existing benchmarks focus on narrow chart types and leave diagrammatic structures such as flowcharts and mind maps largely unaddressed, while models produce outputs in incompatible formats, and datasets rarely include the printed or hand-drawn images encountered in practice. To address these issues, we introduce ChartArena, a comprehensive bilingual benchmark covering eight chart families spanning both numeric charts and diagrammatic structures, each evaluated across three visual scenarios: digital renderings, printed photos, and hand-drawn photos. The dataset is built via a human-agent collaborative annotation pipeline with multi-stage human verification to ensure annotation reliability. To enable fair cross-model comparison, we further design a format-agnostic evaluation protocol that maps heterogeneous outputs into two canonical semantic spaces, a normalized triple view and a directed graph view, and scores them with structure-aware metrics. Through extensive evaluation of 26 leading MLLMs, we observe three consistent findings: (i) frontier proprietary models such as Gemini 3.1 Pro lead overall, yet the strongest open-source systems are rapidly closing the gap; (ii) document parsing models handle numeric charts reasonably but fall sharply behind on diagrammatic structures; and (iii) expert chart parsers remain limited to narrow chart families. Across all models, radar charts and hand-drawn scenarios stay especially challenging. These findings show that ChartArena exposes clear capability gaps and provides a unified foundation for future progress. ChartArena is publicly available at https://github.com/pspdada/ChartArena.

2606.01340 2026-06-03 cs.LG stat.ML

Sample Complexity and Decision-Theoretic Guarantees for Bayesian Model Averaging over Decision Trees with Catalan-Exponential Priors

基于Catalan指数先验的决策树贝叶斯模型平均的样本复杂度和决策理论保证

Livija Jakaite, Vitaly Schetinin

发表机构 * School of Computing and Engineering University of Bedfordshire, Luton, UK(计算与工程学院贝德福德郡大学,卢顿,英国)

AI总结 针对具有Dirichlet-Multinomial叶模型和Catalan指数树大小先验的贝叶斯决策树,建立了理性承诺阈值的完整非渐近理论,回答了贝叶斯模型平均权重何时蕴含足够认知信息以证明对平均分布的承诺利用是合理的。

Comments 22 pages, 3 figures, Submitted to the Journal of Machine Learning Research

详情
AI中文摘要

我们提出一个问题:当决策树上的贝叶斯模型平均(BMA)权重携带足够的认知信息时,何时可以证明对平均分布的承诺利用是合理的?对于具有Dirichlet-Multinomial叶模型和Catalan指数树大小先验(Schetinin & Jakaite, 2025)的贝叶斯决策树(BDTs),我们以闭式回答了这个问题,建立了理性承诺阈值的完整非渐近理论。

英文摘要

We ask: when do Bayesian model averaging (BMA) weights over decision trees carry sufficient epistemic information to justify committed exploitation of the averaging distribution? We answer this question in closed form for Bayesian decision trees (BDTs) with Dirichlet-Multinomial leaf models and a Catalan-exponential tree-size prior (Schetinin&Jakaite, 2025), establishing a complete non-asymptotic theory of rational commitment thresholds.

2606.01269 2026-06-03 cs.AI

Emergent Ordinal Geometry in Transformers Trained on Local Comparisons

局部比较训练下Transformer中涌现的序数几何

Nishit Singh

发表机构 * Birla Institute of Technology and Science, Pilani(比拉理工学院和科学学院,皮兰)

AI总结 通过仅训练相邻比较,Transformer模型在未见远距离对上展现出泛化能力,并形成一维序数几何结构,其决策置信度与排名距离单调相关,类似于符号距离效应。

Comments 11 pages, 12 figures

详情
AI中文摘要

传递性推理是指仅从已知的相邻关系(A < B, B < C)推断出A < C的挑战。人类和动物解决这一问题并非通过逻辑链,而是借助一个模拟的心理数字线,其标志是符号距离效应:远距离比较比近距离比较更容易。我们探究Transformer是否获得相同的原始能力,仅使用隐藏全序中的相邻比较训练小型模型,并评估对未见远距离对的泛化。我们发现,分布外泛化伴随着惊人的几何重组:实体嵌入坍缩到一维流形上,其主轴以近乎完美的保真度恢复隐藏的秩序,并且这种结构对优化方式敏感,产生类似grokking的瞬态动力学。关键的是,即使准确率达到上限,决策置信度和几何分离度都随秩距离单调变化,直接反映了在人类、灵长类和啮齿类动物数十年的行为实验中观察到的符号距离效应。这些结果将50年来的行为规律建立在学习表示的几何基础上,为连接认知科学和现代神经网络的传递性推理提供了机制性解释。

英文摘要

Transitive inference is the challenge of inferring that A < C from knowing only adjacent relations (A < B, B < C). It is solved by humans and animals not through logical chaining but via an analogue mental number line, whose signature is the symbolic distance effect: distant comparisons are easier than nearby ones. We ask whether Transformers acquire the same primitive, training small models exclusively on adjacent comparisons from a hidden total order and evaluating generalization to unseen distant pairs. We find that out-of-distribution generalization emerges alongside a striking geometric reorganization: entity embeddings collapse onto a one-dimensional manifold whose principal axis recovers the hidden rank order with near-perfect fidelity, and this structure is sensitive to optimization in ways that produce grokking-like transient dynamics. Critically, even when accuracy is at ceiling, decision confidence and geometric separation both scale monotonically with rank distance, directly mirroring the symbolic distance effect observed across decades of behavioural experiments on humans, primates, and rodents. We further show the same rank-aligned geometry in a pretrained large language model, where it tracks the topology of each ordinal relation: linear for sizes and digits, cyclic for months. These results ground a 50-year-old behavioural regularity in the geometry of learned representations, offering a mechanistic account of transitive inference that bridges cognitive science and modern neural networks.

2606.01241 2026-06-03 cs.RO

OneVLA: A Unified Framework for Embodied Tasks

OneVLA:面向具身任务的统一框架

Lingfeng Zhang, Xiaoshuai Hao, Yingbo Tang, Lei Zhou, Shuyi Zhang, Jinkun Liu, Hongsheng Li, Chenhao Zhang, Qiang Zhang, Hangjun Ye, Xiaojun Liang, Long Chen, Wenbo Ding

发表机构 * Tsinghua University(清华大学) Pengcheng Laboratory(鹏城实验室) Xiaomi EV(小米电动车) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Peking University(北京大学) HKUST(GZ)(香港科技大学(广州))

AI总结 提出统一架构OneVLA,通过设计统一动作头和渐进式训练策略(含数据构建和思维链微调),在导航与操作任务上实现跨任务正迁移,达到最先进性能。

详情
AI中文摘要

导航和操作是具身智能的基本能力,使机器人能够解释自然语言命令并与环境进行物理交互。然而,当前的视觉-语言-动作(VLA)模型仍受限于任务特定的架构,专门处理导航或操作,这阻碍了通用机器人智能体的发展。为弥补这一差距,我们引入了OneVLA,一个统一架构,将这些不同任务整合到单个连贯框架中。具体来说,我们设计了一个统一的动作头,能够生成导航和操作动作,无需任务特定的变体。此外,我们提出了一种多阶段渐进式训练策略——结合精心构建的数据和思维链(CoT)微调——促进了两个领域之间的强正迁移和相互增强。在模拟和真实环境中的大量实验表明,OneVLA实现了最先进的性能,显著优于专门的单任务和现有的跨任务模型。通过统一这些核心能力,OneVLA为真正的通用机器人系统铺平了道路。模型和源代码将公开发布。

英文摘要

Navigation and manipulation are fundamental capabilities of embodied intelligence, enabling robots to interpret natural language commands and interact physically with their surroundings. However, current Vision-Language-Action (VLA) models remain constrained by task-specific architectures, specializing in either navigation or manipulation, which hinders the development of general-purpose robotic agents. To bridge this gap, we introduce OneVLA, a unified architecture that integrates these distinct tasks into a single, cohesive framework. Specifically, we design a unified action head capable of generating both navigation and manipulation actions without requiring task-specific variants. Furthermore, we propose a multi stage progressive training strategy-incorporating curated data construction and Chain-of-Thought (CoT) fine-tuning that facilitates strong positive transfer and mutual reinforcement between the two domains. Extensive experiments in both simulated and real-world environments demonstrate that OneVLA achieves state-of-the-art performance, significantly outperforming both specialized single-task and existing cross-task models. By unifying these core capabilities, OneVLA paves the way for truly general-purpose robotic systems. The model and source code will be publicly released.