arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.02607 2026-06-03 cs.LG cs.AI cs.CR

Geometry-Aware Tabular Diffusion

几何感知表格扩散

David Turtora Zagardo

发表机构 * arXiv

AI总结提出几何感知表格扩散（GATD），通过向扩散去噪器注入列值差异的成对角度和长度作为输入和辅助目标，以显式建模列间关系，在10个数据集上以更少参数取得SOTA性能。

Comments Accepted to the ICML 2026 main track. 24 pages, 10 figures, 22 tables

详情

AI中文摘要

表格合成对于隐私保护的共享和增强至关重要，然而扩散模型依赖隐式机制来捕捉列间关系。我们引入了几何感知表格扩散（GATD），它通过从列值差异计算出的成对角度和长度来增强表格扩散去噪器，并将其用作输入和辅助目标。我们的MLP实例化在平均使用3.5倍更少参数（对于分类任务最多25倍）的情况下实现了最先进的基准性能：在十个数据集上，它在8/10的形状、7/10的趋势和9/10的下游效用（F1/RMSE）上获胜，将形状和趋势误差分别降低了27%和20%。默认损失权重可迁移到GNN和Transformer去噪器，在27/30个架构-数据集单元上改善了形状，在25/30上改善了趋势。一项匹配的消融实验表明，监督（而非额外输入或容量）驱动了性能提升。这表明显式关系监督是表格扩散的一种可移植归纳偏置。

英文摘要

Tabular synthesis is critical for privacy-preserving sharing and augmentation, yet diffusion models rely on implicit mechanisms to capture inter-column relationships. We introduce Geometry-Aware Tabular Diffusion (GATD), which augments tabular diffusion denoisers with pairwise angles and lengths computed from column value differences and used as inputs and auxiliary targets. Our MLP instantiation achieves state-of-the-art benchmark performance while using 3.5x fewer parameters on average (up to 25x for classification tasks): on ten datasets, it wins 8/10 Shape, 7/10 Trend, and 9/10 downstream utility (F1/RMSE), reducing Shape and Trend error by 27% and 20%. Default loss weights transfer to GNN and Transformer denoisers, improving Shape on 27/30 and Trend on 25/30 architecture-dataset cells. A matched ablation shows supervision (not extra inputs or capacity) drives the gain. This shows explicit relational supervision is a portable inductive bias for tabular diffusion.

URL PDF HTML ☆

赞 0 踩 0

2606.02606 2026-06-03 cs.LG cs.AI

ReLoRA: Knowledge-Reusing Adaptation for Fast Rollout of Evolving LLM Services

ReLoRA: 面向演化LLM服务快速部署的知识复用适配

Yang Xu, Zihuai Xu, Hongli Xu, Yunming Liao, Zhiwei Yao, Xitong Fu

发表机构 * School of Computer Science and Technology, University of Science and Technology of China（计算机科学与技术学院，中国科学技术大学）； Suzhou Institute for Advanced Research, University of Science and Technology of China（苏州先进研究院，中国科学技术大学）

AI总结针对基础模型频繁更新导致已有LoRA适配器失效的问题，提出ReLoRA框架，通过贝叶斯优化初始化与调度正则化微调，实现知识复用与快速重新适配，降低计算开销并提升性能。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被部署为持续演化的服务，其中频繁的基础模型更新可能使先前部署的任务特定低秩适配（LoRA）适配器失效。对于管理众多下游模型服务的提供商来说，为每个更新的基础模型从头重新训练每个LoRA适配器在计算上代价高昂，并延迟服务部署。同时，更简单的替代方案，即简单地将原始LoRA适配器应用于更新的基础模型，由于适配器-骨干网络不兼容，常常导致服务质量下降。为了解决这个问题，我们提出了ReLoRA，一种知识复用的重新适配框架，能够高效地为演化的LLM服务恢复可用的LoRA适配器，同时保持或提升任务性能。具体来说，ReLoRA包含两个关键的优化步骤：1）自适应LoRA初始化利用贝叶斯优化，通过融合先前部署的任务适配器和基础模型演化的信息，构建一个兼容性感知的起点；2）带调度正则化的微调首先通过强正则化快速将适配器引导至高质量区域，随后通过放松正则化进行任务特定精炼。这种设计使得在减少重新适配开销的同时，能够快速恢复服务质量。大量实验表明，与基线相比，ReLoRA将就绪时间减少高达8.9倍，准确率提升高达4.6%。

英文摘要

Large Language Models (LLMs) are increasingly deployed as continuously evolving services, where frequent base-model updates may invalidate previously deployed task-specific Low-Rank Adaptation (LoRA) adapters. For service providers managing numerous downstream model services, retraining each LoRA adapter from scratch for every updated base model is computationally prohibitive and delays service rollout. Meanwhile, the simpler alternative, i.e., naively applying the original LoRA adapter to the updated base model, often leads to degraded service quality due to adapter-backbone incompatibility. To address this problem, we propose ReLoRA, a knowledge-reusing re-adaptation framework that efficiently restores service-ready LoRA adapters for evolving LLM services while preserving or improving task performance. Specifically, ReLoRA comprises two key optimization steps: 1) Adaptive LoRA initialization leverages Bayesian optimization to construct a compatibility-aware starting point by fusing information from both the previously deployed task adapter and the base model's evolution; 2) Fine-tuning with scheduled regularization first rapidly steers the adapter to a high-quality region via strong regularization, followed by relaxed regularization for task-specific refinement. This design enables rapid service-quality recovery with reduced re-adaptation overhead. Extensive experiments demonstrate that ReLoRA reduces time-to-readiness by up to 8.9$\times$ and improves accuracy by up to 4.6\% compared to baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.02605 2026-06-03 cs.LG cs.AI eess.IV

基于拓扑感知排序的图Mamba生存分析

Yuanfang Chen, Peiqiang Yan, Yuntao Shou, Qian Zhao, Xiangyong Cao

发表机构 * School of Mathematics and Statistics（数学与统计学学院）； West China Science and Technology Innovation Harbor（西部科学与技术创新港）； School of Computer Science and Technology（计算机科学与技术学院）

AI总结针对WSI生存分析中Mamba模型对输入顺序敏感及单向架构限制空间结构利用的问题，提出基于拓扑感知排序的图Mamba框架TopoMamSurv，通过TAO策略、双向Mamba模块和GCN集成实现高效长程依赖建模与双向空间上下文建模。

详情

AI中文摘要

FocusDiT: 扩散Transformer中的查询掩码用于细粒度图像生成

Xueji Fang, Liyuan Ma, Jianhao Zeng, Jinjin Cao, Mingyuan Zhou, Guo-Jun Qi

发表机构 * Zhejiang University（浙江大学）； Westlake University（西湖大学）

AI总结提出FocusDiT方法，通过掩码关键查询令牌仅输入FFN层，增强细粒度视觉生成，实验验证其有效性。

2606.02060 2026-06-03 cs.AI

Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

深度研究代理在何处出错？代理轨迹中的跨度级错误定位

Jiaming Wang, Ziteng Feng, Jiangtao Wu, Ruihao Li, Qianqian Xie, Yuxiang Ren, He Zhu, Xueming Han, Fanyu Meng, Junlan Feng, Jiaheng Liu

发表机构 * NJU-LINK Team, Nanjing University（南京大学NJU-LINK团队）； JIUTIAN Research（JIUTIAN研究院）

AI总结针对深度研究代理在长轨迹中难以定位错误的问题，本文通过构建TELBench基准和提出DRIFT审计框架，实现了跨度级错误定位，将首次错误定位准确率提升高达30个百分点。

Comments 28 pages, 11 figures, 4 tables

详情

AI中文摘要

深度研究代理通过搜索、工具使用、证据检查和答案合成的长轨迹来完成任务。基于最终答案的评估可以显示代理是否成功，但无法显示轨迹的哪些部分导致答案不可靠。我们研究了深度研究代理的跨度级错误定位。我们从两个代理框架、三个骨干模型和三个基准中收集了2,790条真实轨迹，将原始日志转换为语义跨度，并通过LLM辅助的专家评审标注了有害错误跨度。基于这些标注，我们构建了TELBench，一个包含1,000个实例的基准，用于在正常探索、失败搜索、暂定假设和无害噪声中识别错误跨度。我们进一步提出了DRIFT，一个以声明为中心的审计框架，该框架跟踪代理声明，检查其在轨迹证据中的支持，并标记那些无支持或冲突声明影响答案路径的跨度。跨模型系列和审计框架的实验表明，DRIFT将跨度级错误定位和首次错误准确率提高了高达30个百分点。我们的工作提供了深度研究代理可靠性的过程级视角。

英文摘要

Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answer synthesis. Evaluation based on final answers shows whether an agent succeeds, but not which parts of the trajectory make the answer unreliable. We study span-level error localization for deep-research agents. We collect 2,790 real trajectories from two agent frameworks, three backbone models, and three benchmarks, convert raw logs into semantic spans, and annotate harmful error spans through LLM-assisted expert review. From these annotations, we build TELBench, a 1,000-instance benchmark for identifying error spans among normal exploration, failed searches, tentative hypotheses, and harmless noise. We further propose DRIFT, a claim-centric auditing framework that tracks agent claims, checks their support in trajectory evidence, and marks spans where unsupported or conflicting claims affect the answer path. Experiments across model families and auditing frameworks show that DRIFT improves span-level error localization and first-error accuracy by up to 30 percentage points. Our work provides a process-level view of reliability in deep-research agents.

URL PDF HTML ☆

赞 0 踩 0

2606.02004 2026-06-03 cs.CL cs.LG

Machine Learning for Coding Retail Product Names to Consumer-Price Categories: A Rule-plus-Bag-of-Words Pipeline with Reliability-Weighted Human-in-the-Loop Labeling

将零售产品名称编码为消费者价格类别的机器学习：基于规则加词袋的流水线，结合可靠性加权的人工参与标注

Vladimir Beskorovainyi

发表机构 * Besk Tech（Besk科技）； Moscow Institute of Physics and Technology (MIPT)（莫斯科物理技术学院）

AI总结本文提出一种结合规则和词袋模型的流水线方法，并采用可靠性加权的人工参与标注协议，将零售产品名称映射到消费者价格类别（如UN COICOP），实验表明词袋模型在该任务上已接近饱和（F1约0.99），而标注协议中可靠性加权投票仅略优于简单多数投票。

Comments 11 pages, 3 tables. Methodology paper; illustrative experiments only, no proprietary data

详情

DOI: 10.5281/zenodo.20503355

AI中文摘要

消费者价格测量越来越多地依赖替代数据源——扫描仪、网络抓取和交易/收据数据。一个反复出现的障碍是，这些来源中的产品描述简短、嘈杂且缩写，没有标准产品代码，因此每个项目必须首先映射到消费分类（例如，联合国COICOP方案），然后才能比较价格。本文将该映射作为一种通用的、可重复的方法进行研究。流水线包括：(i) 对嘈杂项目名称进行文本归一化和分词；(ii) 基于每类关键词和停用词的前缀树（trie）规则预分类器；(iii) 每个类别的二元确认模型，决定一个项目是否属于暂定分配的类别。对于大规模标注，我们使用人工参与协议，其中标注者给出二元有效/拒绝判断，通过动态更新的可靠性权重进行聚合；模型加入相同的规则，实现持续微调。我们的实证发现是通货紧缩的：在一个受控、无泄漏的研究中（一个类别，真实正例与困难负例，五个随机种子），词袋模型基本上饱和了任务（F1约0.99）——线性分类器匹配多层感知器，显式词序（n-gram）特征没有增加任何价值，约67个标注样本已经足够。标注协议的蒙特卡洛研究表明，可靠性加权投票勉强超过简单多数投票（其加性权重饱和），而Dawid-Skene方法明显更好地恢复标签。我们还讨论了价格层面的质量控制和统计办公室考虑交易数据时的设计经验。所有数字均为示意性；未复制任何机密数据、代码或文档。

英文摘要

Consumer-price measurement increasingly draws on alternative data sources -- scanner, web-scraped, and transaction/receipt data. A recurring obstacle is that product descriptions in such sources are short, noisy, and abbreviated, with no standard product code, so each item must first be mapped to a consumption classification (e.g., the UN COICOP scheme) before prices can be compared. This paper studies that mapping as a general, reproducible method. The pipeline is: (i) text normalization and tokenization of noisy item names; (ii) a prefix-tree (trie) rule-based pre-classifier driven by per-category key-phrases and stop-phrases; and (iii) a per-category binary confirmation model deciding whether an item belongs to a tentatively assigned category. For labels at scale we use a human-in-the-loop protocol in which annotators give a binary valid/reject judgment, aggregated by a dynamically updated reliability weight; the model joins the same rule, enabling continual fine-tuning. Our empirical finding is deflationary: in a controlled, leakage-free study (one category, real positives vs. hard negatives, five seeds), bag-of-words models essentially saturate the task (F1 about 0.99) -- a linear classifier matches a multilayer perceptron, explicit word-order (n-gram) features add nothing, and about 67 labeled examples already suffice. A Monte-Carlo study of the labeling protocol shows the reliability-weighted vote barely beats plain majority (its additive weights saturate) while Dawid-Skene recovers labels markedly better. We also discuss price-level quality control and design lessons for statistical offices considering transaction data. All figures are illustrative; no confidential data, code, or documentation is reproduced.

URL PDF HTML ☆

赞 0 踩 0

2606.01962 2026-06-03 cs.CV

Contrastive Augmented Transformer with Domain-specific Enhancement for Robust Multi-scenario Metal Surface Defect Detection

基于领域增强的对比增强Transformer用于鲁棒的多场景金属表面缺陷检测

Yiyao Liu, Wenxiao He, Liyuan Ren, Huan Wang

发表机构 * Glasgow College, University of Electronic Science and Technology of China（格拉斯哥学院，电子科学与技术大学）

AI总结提出对比增强Transformer（CAT）框架，结合Swin Transformer骨干、特征金字塔网络、领域特定液滴增强算法和难负样本挖掘策略，解决金属表面缺陷检测中标注数据有限、多尺度缺陷识别难和跨场景泛化差的问题，在KolektorSDD2数据集上达到99.54%像素级AUROC。

详情

AI中文摘要

金属表面缺陷检测对于维持工业制造中的产品质量至关重要。然而，它面临着重大挑战，包括有限的标注数据、难以识别细微的多尺度缺陷以及跨不同场景的泛化能力差。为了解决这些问题，本文提出了一种新颖的对比增强Transformer（CAT）框架，用于鲁棒的缺陷检测。CAT采用分层Swin Transformer骨干，并重新设计了特征金字塔网络，以有效融合低级纹理与高级语义，从而实现对细微和多尺度缺陷模式的精确建模。为了增强在真实噪声条件下的鲁棒性，我们提出了一种领域特定的液滴增强算法。此外，我们将难负样本挖掘策略纳入对比损失中，以增强模型在模糊缺陷区域的判别能力。在KolektorSDD2数据集上的实验结果表明，CAT实现了99.54%的像素级AUROC，优于现有方法。此外，CAT在三个未见过的数据集（包括KSDD1、用于瓷砖缺陷的MTD和用于轨道表面缺陷的MSDD）上表现出优越的泛化能力和鲁棒性，展示了其在大规模工业部署中的潜力。

英文摘要

Metal surface defect detection is critical for maintaining product quality in industrial manufacturing. However, it faces significant challenges, including limited annotated data, difficulty in identifying subtle multi-scale defects, and poor generalization across diverse scenarios. To address these issues, this paper proposes a novel Contrastive Augmented Transformer (CAT) framework for robust defect detection. CAT employs a hierarchical Swin Transformer backbone and redesigns the feature pyramid network to effectively fuse low-level textures with high-level semantics, enabling precise modeling of subtle and multi-scale defect patterns. To enhance robustness under real-world noise conditions, we propose a domain-specific droplet augmentation algorithm. Furthermore, we incorporate a hard negative mining strategy into the contrastive loss to strengthen the model's discrimination ability in ambiguous defect regions. Experimental results on the KolektorSDD2 dataset demonstrate that CAT achieves a pixel-level AUROC of 99.54%, outperforming existing methods. In addition, CAT exhibits superior generalization and robustness on three unseen datasets, including KSDD1, MTD for tile defects, and MSDD for rail surface defects, demonstrating its potential for wide-scale industrial deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.01904 2026-06-03 cs.CL cs.AI

基准测试LLM作为裁判在长文本输出评估中的应用

Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su, Yujia Zhou, Min Zhang, Yiqun Liu, Qinyao Ai

发表机构 * Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）； University of Science and Technology Beijing（北京科技大学）

AI总结提出LongJudgeBench基准，系统评估LLM裁判在长文本输出评估中的可靠性，发现当前模型存在显著可靠性差距。

详情

AI中文摘要

随着大语言模型（LLM）越来越多地用于长文本生成，可靠地评估长文本输出已成为一个关键挑战。LLM作为裁判提供了一种可扩展的替代人工评估的方法，但其在长文本输出评估中的可靠性仍未得到充分检验：现有的元评估基准主要关注短文本输出。与短文本评估相比，长文本评估不仅仅是输出长度的问题；它通常要求裁判处理更复杂的文档级需求。在这项工作中，我们引入了LongJudgeBench，这是一个全面的基准，用于评估LLM裁判在跨多种真实场景和评判协议下的长文本输出表现。我们系统地评估了广泛的LLM裁判，涵盖了多个基础模型和评判设置。我们的结果揭示了显著的可靠性差距：当前的LLM裁判在不同场景下仍然不稳定，而评分标准或参考虽然有所帮助，但并非总是足够。我们希望LongJudgeBench能够支持未来在更稳健、上下文感知且与人类对齐的LLM-as-a-judge方法上的研究。我们的代码可在https://anonymous.4open.science/r/LongJudgeBench-F782获取。

英文摘要

As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scalable alternative to human evaluation, yet its reliability in long-form output evaluation remains underexamined: existing meta-evaluation benchmarks focus mainly on short-form outputs. Compared with short-form evaluation, long-form evaluation is not merely a matter of output length; it often requires judges to make more complex document-level assessments of overall organization, task-relevant coverage and depth, cross-section consistency, and scenario-specific quality criteria. In this work, we introduce LongJudgeBench, a comprehensive benchmark for evaluating LLM judges on long-form outputs across diverse real-world scenarios and judging protocols. We systematically evaluate a broad range of LLM judges, covering multiple base models and judging settings. Our results reveal a substantial reliability gap: current LLM judges remain unstable across scenarios, and rubrics or references are helpful but not always sufficient. We hope LongJudgeBench will support future research on more robust, context-aware, and human-aligned LLM-as-a-judge methods. Our code is available at https://github.com/cjj826/LongJudgeBench.

URL PDF HTML ☆

赞 0 踩 0

2606.01532 2026-06-03 cs.LG cs.CC

Rethinking the Role of Positional Encoding: Sliding-Window Transformers without PE Remain Turing Complete

重新思考位置编码的作用：无PE的滑动窗口Transformer仍具图灵完备性

Qian Li, Xinyu Mao, Shang-Hua Teng

发表机构 * Shenzhen Research Institute of Big Data（深圳大数据研究院）； University of Southern California（南加州大学）

AI总结本文证明，在滑动窗口机制下，无需位置编码的Transformer仍可通过窗口演化模拟图灵完备的Post机器，从而具备通用计算能力。

详情

AI中文摘要

位置编码（PE）被广泛认为是Transformer处理有序序列所必需的：没有位置编码，下一个token映射在其上下文token中似乎是置换不变的。这一直觉支撑了所有先前的普适性结果，这些结果依赖位置信息来证明具有思维链的Transformer可以执行任意计算，即它们是图灵完备的。我们在与长程推理最相关的机制下重新审视这一信念，其中生成通过有限的滑动上下文窗口进行。我们的初步认识是，窗口机制本身（轻微地）打破了置换对称性。为了提炼并精确捕捉这种额外表达能力的大小，我们引入了一个抽象的自回归模型——HIST模型，其中每次更新仅依赖于恒定大小的内部状态和当前窗口内的token计数直方图。我们证明这个HIST模型是图灵完备的，通过展示窗口的演化可以揭示刚刚离开窗口的token，这足以模拟图灵完备的Post机器。然后，我们构建了一个在恒定大小token字母表上的滑动窗口Transformer，没有位置编码，并证明它可以模拟HIST模型。我们的结果表明，位置编码对于Transformer执行通用计算并非不可或缺：窗口滑动本身已经打破了置换对称性并捕获了足够的位置信息。

英文摘要

Positional encoding (PE) is widely viewed as necessary for transformers to process ordered sequences: without them, the next-token map appears permutation-invariant in its context tokens. This intuition underlies all prior universality results, which rely on positional information to prove that transformers with chain-of-thought can perform arbitrary computation, i.e., they are Turing complete. We revisit this belief in the regime most relevant to long-form reasoning, where generation proceeds through a finite sliding context window. Our opening perception is that the window mechanism itself (mildly) breaks the permutation symmetry. To distill and precisely capture the degree of this added expressiveness, we introduce an abstract autoregressive model, the HIST model, in which each update depends only on constant-size internal state and the token-count histogram within the current window. We prove that this HIST model is Turing complete by showing that the evolution of the window can reveal the token that has just left the window, which suffices to simulate Turing-complete Post machines. We then construct a sliding-window transformer over a constant-size token alphabet, without PE, and show that it can simulate the HIST model. Our result demonstrates that positional encodings are not indispensable for transformers to perform universal computation: The window sliding itself already breaks permutation symmetry and captures sufficient positional information.

URL PDF HTML ☆

赞 0 踩 0

2606.01348 2026-06-03 cs.CV

ChartArena: Benchmarking Chart Parsing across Languages, Scenarios, and Formats

ChartArena: 跨语言、场景和格式的图表解析基准测试

Shangpin Peng, Gengluo Li, Xingyu Wan, Chengquan Zhang, Hao Feng, Binghong Wu, Huawen Shen, Weinong Wang, Ziyi Cai, Zhuotao Tian, Han Hu, Can Ma, Yu Zhou

发表机构 * Large Language Model Department, Tencent（腾讯大语言模型部门）； Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； Shenzhen Loop Area Institute（深圳环湖区研究所）； Nankai University（南开大学）

AI总结提出ChartArena，一个覆盖8种图表族、3种视觉场景的双语基准，通过人机协作标注和格式无关评估协议，系统评估26个多模态大模型的图表解析能力，揭示前沿模型差距与挑战。

详情

AI中文摘要

图表是传达定量和关系信息的主要媒介，但系统地评估图表解析模型仍然困难。现有基准专注于狭窄的图表类型，而流程图和思维导图等图表结构在很大程度上未被涉及，同时模型输出格式不兼容，数据集也极少包含实践中遇到的打印或手绘图像。为解决这些问题，我们引入了ChartArena，一个全面的双语基准，涵盖8种图表族，包括数值图表和图表结构，每种图表在三种视觉场景中评估：数字渲染、打印照片和手绘照片。该数据集通过人机协作标注流程构建，并经过多阶段人工验证以确保标注可靠性。为实现公平的跨模型比较，我们进一步设计了一种格式无关的评估协议，将异构输出映射到两个规范语义空间：归一化三元组视图和有向图视图，并使用结构感知指标进行评分。通过对26个领先的多模态大语言模型的广泛评估，我们观察到三个一致的发现：(i) 前沿专有模型（如Gemini 3.1 Pro）总体领先，但最强的开源系统正在迅速缩小差距；(ii) 文档解析模型能较好地处理数值图表，但在图表结构上表现大幅落后；(iii) 专家图表解析器仍局限于狭窄的图表族。在所有模型中，雷达图和手绘场景尤其具有挑战性。这些发现表明，ChartArena揭示了清晰的能力差距，并为未来的进展提供了统一基础。ChartArena公开在 https://github.com/pspdada/ChartArena。

英文摘要

Charts are a primary medium for conveying quantitative and relational information, yet systematically evaluating chart parsing models remains difficult. Existing benchmarks focus on narrow chart types and leave diagrammatic structures such as flowcharts and mind maps largely unaddressed, while models produce outputs in incompatible formats, and datasets rarely include the printed or hand-drawn images encountered in practice. To address these issues, we introduce ChartArena, a comprehensive bilingual benchmark covering eight chart families spanning both numeric charts and diagrammatic structures, each evaluated across three visual scenarios: digital renderings, printed photos, and hand-drawn photos. The dataset is built via a human-agent collaborative annotation pipeline with multi-stage human verification to ensure annotation reliability. To enable fair cross-model comparison, we further design a format-agnostic evaluation protocol that maps heterogeneous outputs into two canonical semantic spaces, a normalized triple view and a directed graph view, and scores them with structure-aware metrics. Through extensive evaluation of 26 leading MLLMs, we observe three consistent findings: (i) frontier proprietary models such as Gemini 3.1 Pro lead overall, yet the strongest open-source systems are rapidly closing the gap; (ii) document parsing models handle numeric charts reasonably but fall sharply behind on diagrammatic structures; and (iii) expert chart parsers remain limited to narrow chart families. Across all models, radar charts and hand-drawn scenarios stay especially challenging. These findings show that ChartArena exposes clear capability gaps and provides a unified foundation for future progress. ChartArena is publicly available at https://github.com/pspdada/ChartArena.

URL PDF HTML ☆

赞 0 踩 0

2606.01340 2026-06-03 cs.LG stat.ML

Sample Complexity and Decision-Theoretic Guarantees for Bayesian Model Averaging over Decision Trees with Catalan-Exponential Priors

基于Catalan指数先验的决策树贝叶斯模型平均的样本复杂度和决策理论保证

Livija Jakaite, Vitaly Schetinin

发表机构 * School of Computing and Engineering University of Bedfordshire, Luton, UK（计算与工程学院贝德福德郡大学，卢顿，英国）

AI总结针对具有Dirichlet-Multinomial叶模型和Catalan指数树大小先验的贝叶斯决策树，建立了理性承诺阈值的完整非渐近理论，回答了贝叶斯模型平均权重何时蕴含足够认知信息以证明对平均分布的承诺利用是合理的。

Comments 22 pages, 3 figures, Submitted to the Journal of Machine Learning Research

2606.01269 2026-06-03 cs.AI

Emergent Ordinal Geometry in Transformers Trained on Local Comparisons

局部比较训练下Transformer中涌现的序数几何

Nishit Singh

发表机构 * Birla Institute of Technology and Science, Pilani（比拉理工学院和科学学院，皮兰）

AI总结通过仅训练相邻比较，Transformer模型在未见远距离对上展现出泛化能力，并形成一维序数几何结构，其决策置信度与排名距离单调相关，类似于符号距离效应。

Comments 11 pages, 12 figures

详情

AI中文摘要

传递性推理是指仅从已知的相邻关系（A < B, B < C）推断出A < C的挑战。人类和动物解决这一问题并非通过逻辑链，而是借助一个模拟的心理数字线，其标志是符号距离效应：远距离比较比近距离比较更容易。我们探究Transformer是否获得相同的原始能力，仅使用隐藏全序中的相邻比较训练小型模型，并评估对未见远距离对的泛化。我们发现，分布外泛化伴随着惊人的几何重组：实体嵌入坍缩到一维流形上，其主轴以近乎完美的保真度恢复隐藏的秩序，并且这种结构对优化方式敏感，产生类似grokking的瞬态动力学。关键的是，即使准确率达到上限，决策置信度和几何分离度都随秩距离单调变化，直接反映了在人类、灵长类和啮齿类动物数十年的行为实验中观察到的符号距离效应。这些结果将50年来的行为规律建立在学习表示的几何基础上，为连接认知科学和现代神经网络的传递性推理提供了机制性解释。

英文摘要

Transitive inference is the challenge of inferring that A < C from knowing only adjacent relations (A < B, B < C). It is solved by humans and animals not through logical chaining but via an analogue mental number line, whose signature is the symbolic distance effect: distant comparisons are easier than nearby ones. We ask whether Transformers acquire the same primitive, training small models exclusively on adjacent comparisons from a hidden total order and evaluating generalization to unseen distant pairs. We find that out-of-distribution generalization emerges alongside a striking geometric reorganization: entity embeddings collapse onto a one-dimensional manifold whose principal axis recovers the hidden rank order with near-perfect fidelity, and this structure is sensitive to optimization in ways that produce grokking-like transient dynamics. Critically, even when accuracy is at ceiling, decision confidence and geometric separation both scale monotonically with rank distance, directly mirroring the symbolic distance effect observed across decades of behavioural experiments on humans, primates, and rodents. We further show the same rank-aligned geometry in a pretrained large language model, where it tracks the topology of each ordinal relation: linear for sizes and digits, cyclic for months. These results ground a 50-year-old behavioural regularity in the geometry of learned representations, offering a mechanistic account of transitive inference that bridges cognitive science and modern neural networks.

URL PDF HTML ☆

赞 0 踩 0

2606.01241 2026-06-03 cs.RO

OneVLA: A Unified Framework for Embodied Tasks

OneVLA：面向具身任务的统一框架

Lingfeng Zhang, Xiaoshuai Hao, Yingbo Tang, Lei Zhou, Shuyi Zhang, Jinkun Liu, Hongsheng Li, Chenhao Zhang, Qiang Zhang, Hangjun Ye, Xiaojun Liang, Long Chen, Wenbo Ding

发表机构 * Tsinghua University（清华大学）； Pengcheng Laboratory（鹏城实验室）； Xiaomi EV（小米电动车）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Peking University（北京大学）； HKUST(GZ)（香港科技大学（广州））

AI总结提出统一架构OneVLA，通过设计统一动作头和渐进式训练策略（含数据构建和思维链微调），在导航与操作任务上实现跨任务正迁移，达到最先进性能。

详情

AI中文摘要

导航和操作是具身智能的基本能力，使机器人能够解释自然语言命令并与环境进行物理交互。然而，当前的视觉-语言-动作（VLA）模型仍受限于任务特定的架构，专门处理导航或操作，这阻碍了通用机器人智能体的发展。为弥补这一差距，我们引入了OneVLA，一个统一架构，将这些不同任务整合到单个连贯框架中。具体来说，我们设计了一个统一的动作头，能够生成导航和操作动作，无需任务特定的变体。此外，我们提出了一种多阶段渐进式训练策略——结合精心构建的数据和思维链（CoT）微调——促进了两个领域之间的强正迁移和相互增强。在模拟和真实环境中的大量实验表明，OneVLA实现了最先进的性能，显著优于专门的单任务和现有的跨任务模型。通过统一这些核心能力，OneVLA为真正的通用机器人系统铺平了道路。模型和源代码将公开发布。

英文摘要

Navigation and manipulation are fundamental capabilities of embodied intelligence, enabling robots to interpret natural language commands and interact physically with their surroundings. However, current Vision-Language-Action (VLA) models remain constrained by task-specific architectures, specializing in either navigation or manipulation, which hinders the development of general-purpose robotic agents. To bridge this gap, we introduce OneVLA, a unified architecture that integrates these distinct tasks into a single, cohesive framework. Specifically, we design a unified action head capable of generating both navigation and manipulation actions without requiring task-specific variants. Furthermore, we propose a multi stage progressive training strategy-incorporating curated data construction and Chain-of-Thought (CoT) fine-tuning that facilitates strong positive transfer and mutual reinforcement between the two domains. Extensive experiments in both simulated and real-world environments demonstrate that OneVLA achieves state-of-the-art performance, significantly outperforming both specialized single-task and existing cross-task models. By unifying these core capabilities, OneVLA paves the way for truly general-purpose robotic systems. The model and source code will be publicly released.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Geometry-Aware Tabular Diffusion

ReLoRA: Knowledge-Reusing Adaptation for Fast Rollout of Evolving LLM Services

Cross-Modal Contrastive Learning of ECG and Angiography Representations for Severe Stenosis Classification

Auditable Climate Risk Intelligence from Fragmented ESG Data: Deterministic Orchestration and Imbalance-Aware Learning for Scope 1-3 Validation

COD10K-C: Benchmarking Robustness of Camouflaged Object Detection Under Natural Image Corruptions

Graph Mamba Survival Analysis Based on Topology-Aware ordering

Testing the Test: Score-Direction Instability in Class-Split Anomaly Detection

Assessing Region-Level EEG Contributions to Cognitive Workload Prediction

Making Brain-Computer Interfaces More Secure

Spectral Asymptotics of Neural Network Loss Landscapes: An Exact Decomposition of the Curvature Exponent

Human-in-the-Loop Contextual Bandits for Short-Term Rental Dynamic Pricing: Structural Equivalence of Historical Warm-Up and Approval-Gated Live Learning

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

AgentCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

Forget Attention: Importance-Aware Attention Is All You Need

Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding

FocusDiT: Masking Queries in Diffusion Transformers for Fine-grained Image Generation

Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

Machine Learning for Coding Retail Product Names to Consumer-Price Categories: A Rule-plus-Bag-of-Words Pipeline with Reliability-Weighted Human-in-the-Loop Labeling

Contrastive Augmented Transformer with Domain-specific Enhancement for Robust Multi-scenario Metal Surface Defect Detection

KliniskVestBERT: BERT Model Specialised to Norwegian Clinical Texts

PHASOR: Phase-Anchored Universal Action Representations for Humanoid Embodiments

ContinuousBench: Can Differentially Private Synthetic Text Improve Capabilities?

EvoBrain: Continual Learning of EEG Foundation Models Across Heterogeneous BCI Tasks

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

Rethinking the Role of Positional Encoding: Sliding-Window Transformers without PE Remain Turing Complete

ChartArena: Benchmarking Chart Parsing across Languages, Scenarios, and Formats

Sample Complexity and Decision-Theoretic Guarantees for Bayesian Model Averaging over Decision Trees with Catalan-Exponential Priors

Emergent Ordinal Geometry in Transformers Trained on Local Comparisons

OneVLA: A Unified Framework for Embodied Tasks