arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1967
专题追踪
2605.14110 2026-05-15 cs.CV cs.RO

SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection

Sandro Papais, Lezhou Feng, Charles Cossette, Lingting Ge

发表机构 * University of Toronto(多伦多大学) Zoox Inc(Zoox公司)

AI总结 本文提出SToRe3D,一种用于高效多视角3D目标检测的稀疏性框架,旨在解决视觉Transformer(ViT)在处理多视角和大范围3D区域时计算量大、推理延迟高的问题。该方法通过联合选择2D图像token和3D目标查询,并结合特征存储与重新激活机制,实现对关键信息的计算分配。实验表明,SToRe3D在保持检测精度的同时,显著提升了推理速度,为实时大规模3D检测提供了可行方案。

Comments Accepted to CVPR 2026

详情
英文摘要

Vision Transformers (ViTs) enable strong multi-view 3D detection but are limited by high inference latency from dense token and query processing across multiple views and large 3D regions. Existing sparsity methods, designed mainly for 2D vision, prune or merge image tokens but do not extend to full-model sparsity or address 3D object queries. We introduce SToRe3D, a relevance-aligned sparsity framework that jointly selects 2D image tokens and 3D object queries while storing filtered features for reactivation. Mutual 2D-3D relevance heads allocate compute to driving-critical content and preserve other embeddings. Evaluated on nuScenes and our new nuScenes-Relevance benchmark, SToRe3D achieves up to 3x faster inference with marginal accuracy loss, establishing real-time large-scale ViT-based 3D detection while maintaining accuracy on planning-critical agents.

2605.14108 2026-05-15 cs.CV cs.AI cs.LG

Bridging the Rural Healthcare Gap: A Cascaded Edge-Cloud Architecture for Automated Retinal Screening

Nishi Doshi, Shrey Shah

发表机构 * University of Southern California(南加州大学)

AI总结 该研究针对农村地区糖尿病视网膜病变(DR)筛查资源不足的问题,提出了一种边缘-云端级联架构,以提高筛查效率并降低云端计算负担。该架构分为两层:第一层使用轻量级的MobileNetV3-small模型在本地设备上进行二分类分诊,判断是否需要转诊;第二层在云端使用RETFoundDINOv2模型对需转诊的图像进行细粒度严重程度分级。实验表明,该方法在APTOS数据集上显著减少了云端调用次数,同时保持了较高的筛查准确性。

详情
英文摘要

Diabetic Retinopathy (DR) is one of the leading causes of preventable blindness, yet rural regions often lack the specialists and infrastructure needed for early detection. Although cloud-based deep learning systems offer high accuracy, they face significant challenges in these settings due to high latency, limited bandwidth, and high data transmission costs. To address these challenges, we propose a two-tier edge-cloud cascade on the public APTOS 2019 Blindness Detection dataset. Tier 1 runs a lightweight MobileNetV3-small model on a local clinic device to perform a binary triage between Referable DR (Classes 2-4) and Non-referable DR (Classes 0-1). Tier 2 runs a RETFoundDINOv2 model in the cloud for ordinal severity grading, but only on the subset of images flagged as referable by Tier 1. On a stratified APTOS test split of 733 images, Tier 1 reaches 98.99% sensitivity and 84.37% specificity at a validation-tuned high-sensitivity threshold. The default cascade forwards 49.52% of test images to Tier 2, reducing cloud calls by 50.48% relative to using a cloud-based model for all images. In the deployed 4-class output space (Class 0-1 / Class 2 / Class 3 / Class 4), the cascade obtains 80.49% accuracy and 0.8167 quadratic weighted kappa; the cloud-only baseline obtains 80.76% accuracy and 0.8184 quadratic weighted kappa. On APTOS, the cascade cuts cloud use by about half with a modest drop in grading performance. Index Terms: Diabetic Retinopathy, Edge-Cloud Cascade, MobileNetV3-small, RETFound-DINOv2, Retinal Screening, tele-ophthalmology

2605.14106 2026-05-15 cs.RO

Behavior Cloning for Active Perception with Low-Resolution Egocentric Vision

Anthony Bilic, Chen Chen, Ladislau Bölöni

发表机构 * Department of Computer Science, University of Central Florida(中央佛罗里达大学计算机科学系)

AI总结 本文研究了行为克隆在结构化物体寻找任务中是否能够实现主动感知。通过一个配备手腕安装的低分辨率RGB摄像头的低成本机械臂,模型直接从低分辨率图像中预测关节命令,在闭环控制下实现对部分可见植物的定位与抓取。实验表明,低分辨率的自中心视觉足以完成任务,且相对关节变化的预测优于绝对位置预测,展示了基于视觉的行为克隆可以有效实现可复现的主动感知。

详情
英文摘要

We investigate whether behavior cloning is sufficient to produce active perception in a structured object-finding task. A low-cost robot arm equipped with a wrist-mounted egocentric RGB camera must reposition to center a partially visible plant before triggering a grasp signal, requiring actions that improve future observations. The model predicts joint commands directly from low-resolution RGB images under closed-loop control. We show that low-resolution egocentric vision is sufficient for reliable task completion and that predicting relative joint deltas substantially outperforms absolute joint position prediction in our setting. These results demonstrate that visually grounded active perception can emerge from behavior cloning in a reproducible setting.

2605.14104 2026-05-15 cs.CV

DUET: Dual-Paradigm Adaptive Expert Triage with Single-cell Inductive Prior for Spatial Transcriptomics Prediction

Junchao Zhu, Ruining Deng, Junlin Guo, Tianyuan Yao, Chongyu Qu, Juming Xiong, Zhengyi Lu, Yanfan Zhu, Marilyn Lionts, Yuechen Yang, Yu Wang, Shilin Zhao, Haichun Yang, Yuankai Huo

发表机构 * Vanderbilt University, Tennessee, USA(范德比尔特大学,田纳西州,美国) Weill Cornell Medicine, New York, USA(韦尔医学院,纽约,美国) Vanderbilt University Medical Center, Tennessee, USA(范德比尔特大学医学中心,田纳西州,美国)

AI总结 该研究提出了一种名为DUET的新型双范式框架,用于从组织切片图像中预测空间转录组数据。DUET结合了参数化预测与基于记忆的检索方法,在细胞归纳先验的指导下实现更准确的基因表达推断。通过引入大规模单细胞数据作为分子约束,并设计轻量适配器动态调整不同空间区域的模型偏好,DUET在多个公开数据集上取得了当前最优的预测性能。

详情
英文摘要

Inferring spatially resolved gene expression from histology images offers a cost-effective complement to spatial transcriptomics (ST). However, existing methods reduce this task to a simple morphology-to-expression mapping, where visual similarity does not guarantee molecular consistency. Meanwhile, single-cell data has amassed rich resources far surpassing the scale of ST data, yet it remains underexplored in vision-omics modeling. Furthermore, current approaches commit to a monolithic paradigm with bottlenecks, unable to balance expressive flexibility with biological fidelity. To bridge these gaps, we propose DUET, a novel dual-paradigm framework that synergizes parametric prediction and memory-based retrieval under cellular inductive priors. DUET implements a parallel regression-retrieval paradigm, adaptively reconciling the outputs of its complementary pathways. To mitigate aleatoric vision ambiguity, we incorporate large-scale single-cell references to impose molecular states as biological constraints for faithful learning. Building upon structural refinement, we further design a lightweight adapter to dynamically assign branch preference across spatial contexts to achieve optimal performance. Extensive experiments on three public datasets across varied gene scales demonstrate that DUET achieves SOTA performance, with consistent gains contributed by each proposed component. Code is available at https://github.com/Junchao-Zhu/DUET

2605.14089 2026-05-15 cs.AI

SkillFlow: Flow-Driven Recursive Skill Evolution for Agentic Orchestration

Mingda Zhang, Tiesunlong Shen, Haoran Luo, Wenjin Liu, Zikai Xiao, Erik Cambria, Xiaoying Tang

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) National University of Singapore(新加坡国立大学) Nanyang Technological University(南洋理工大学) Zhejiang University(浙江大学)

AI总结 SkillFlow 是一种基于流模型的框架,旨在解决智能体编排中的关键挑战,如策略崩溃、信用分配不透明和技能演化缺乏指导。该方法通过可训练的监督器与结构化环境进行多轮交互,结合温差轨迹平衡损失实现多样化的策略保持与透明的信用分配,并引入递归技能演化机制以自主决定技能的生成、剪枝与改进。实验表明,SkillFlow 在多个任务上显著优于现有方法。

Comments 49 pages, 5 figures, 6 tables

详情
英文摘要

In recent years, a variety of powerful LLM-based agentic systems have been applied to automate complex tasks through task orchestration. However, existing orchestration methods still face key challenges, including strategy collapse under reward maximization, high gradient variance with opaque credit assignment, and unguided skill evolution whose decisions are typically made by directly prompting an LLM to judge rather than derived from principled training signals. To address these challenges, we propose SkillFlow, a flow-based framework that takes a trainable Supervisor as the agent and a structured environment with dynamic skill library and frozen executor, automating task orchestration through multi-turn interaction. SkillFlow employs Tempered Trajectory Balance (TTB), a regression-based flow-matching loss that samples trajectories proportional to reward, preserving diverse orchestration strategies rather than collapsing to a single mode. The same flow objective yields a jointly learned backward policy that provides transparent per-step credit assignment at zero additional inference cost. Building on these flow diagnostics, a recursive skill evolution mechanism determines when to evolve, what skills to create or prune, and where decision gaps lie -- closing the loop from training signal to autonomous capability growth. Experimental results on 14 datasets show that SkillFlow significantly outperforms baselines across question answering, mathematical reasoning, code generation, and real-world interactive decision making tasks. Our code is available at https://anonymous.4open.science/r/SkillFlow-E850.

2605.14075 2026-05-15 cs.LG cs.CL

Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity

Cristian Hinostroza, Rodrigo Toro Icarte, Christ Devia, Andres Carvallo De Ferari, Eugenio Herrera-Berg, Denis Parra, Jorge F Silva

发表机构 * Pontificia Universidad Católica de Chile(天主教智利大学) National Center for Artificial Intelligence (CENIA)(人工智能国家中心) Universidad de Chile(智利大学)

AI总结 本文探讨了在大语言模型中,层相关性评估应超越传统的余弦相似度方法。研究指出,余弦相似度无法准确反映移除某层对模型性能的实际影响,理论分析表明即使某层余弦相似度极低,也可能对模型性能至关重要。为此,作者提出以移除某层后模型准确率的实际下降作为更可靠的评估指标,尽管计算成本较高,但能更准确地指导模型剪枝与轻量化设计,对构建可解释的大语言模型具有重要意义。

Comments Published at ICLR 2026

Journal ref Proceedings of the International Conference on Learning Representations (ICLR), 2026

详情
英文摘要

Large language models (LLMs) have revolutionized natural language processing. Understanding their internal mechanisms is crucial for developing more interpretable and optimized architectures. Mechanistic interpretability has led to the development of various methods for assessing layer relevance, with cosine similarity being a widely used tool in the field. On this work, we demonstrate that cosine similarity is a poor proxy for the actual performance degradation caused by layer removal. Our theoretical analysis shows that a layer can exhibit an arbitrarily low cosine similarity score while still being crucial to the model's performance. On the other hand, empirical evidence from a range of LLMs confirms that the correlation between cosine similarity and actual performance degradation is often weak or moderate, leading to misleading interpretations of a transformer's internal mechanisms. We propose a more robust metric for assessing layer relevance: the actual drop in model accuracy resulting from the removal of a layer. Even though it is a computationally costly metric, this approach offers a more accurate picture of layer importance, allowing for more informed pruning strategies and lightweight models. Our findings have significant implications for the development of interpretable LLMs and highlight the need to move beyond cosine similarity in assessing layer relevance.

2605.14074 2026-05-15 cs.LG

Fair and Calibrated Toxicity Detection with Robust Training and Abstention

Mokshit Surana

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校)

AI总结 该研究探讨了毒性检测中的公平性问题,关注排序、校准和弃权三个维度,并比较了多种训练方法与后处理机制在这些维度上的表现。研究发现,传统方法如经验风险最小化(ERM)虽然整体校准良好,但在不同身份子群中存在显著的校准偏差;而训练干预措施虽能改善排序性能,却可能加剧校准公平性差距。此外,后处理方法如温度缩放和置信度弃权也继承了训练阶段的问题,甚至可能引入新的不公平性。论文强调,实现真正的公平性需要多维度的综合考量,单一维度的优化不足以确保实际应用中的公平表现。

详情
英文摘要

Fairness in toxicity classification involves three integrated axes: ranking, calibration, and abstention. Training-time interventions and post-hoc safety mechanisms cannot be evaluated independently because the former determines the efficacy of the latter. We compare Empirical Risk Minimization (ERM), instance-level reweighting, and Group DRO across these axes, combined with temperature scaling, confidence-based abstention, and per-identity threshold optimization. Evaluation uses subgroup AUC, BPSN/BNSP AUC, error gaps, and per-subgroup Expected Calibration Error (ECE) with bootstrap CIs ($n = 1000$). We report four findings. (1) Calibration disparity is a hidden fairness violation. ERM has near-perfect aggregate calibration ($0.013$) but is significantly miscalibrated across all identity subgroups ($+0.029$ to $+0.134$). (2) Training interventions reshape rather than eliminate disparity. Reweighted ERM improves ranking (BPSN AUC $+0.06$ to $+0.12$) but worsens the calibration-fairness gap by up to $+0.232$. Group DRO eliminates calibration disparity but only by becoming uniformly miscalibrated globally (ECE $0.118$). (3) Post-hoc methods inherit training failure modes. Temperature scaling fails because miscalibration is non-uniform. Confidence-based abstention works under ERM but breaks under DRO, where the risk-coverage curve rises with deferral. (4) Abstention itself is unfair. Confidence-based deferral helps background content far more than identity-mentioning content. We argue that SRAI fairness requires a multi-axis framework: methods that differ only in aggregate ranking can differ sharply in failure modes that determine real-world harm.

2605.14073 2026-05-15 cs.LG cs.AI

AttnGen: Attention-Guided Saliency Learning for Interpretable Genomic Sequence Classification

Rayhaneh Shabani Nia, Ali Karkehabadi

发表机构 * University of California, Davis(加州大学戴维斯分校)

AI总结 本文提出了一种名为 AttnGen 的注意力引导训练框架,旨在提升基因组序列分类模型的可解释性。该方法通过注意力机制计算核苷酸层面的重要性评分,并在训练过程中逐步抑制低贡献位置,使模型更关注具有信息量的区域,减少对噪声序列元素的依赖。实验表明,AttnGen 在标准基准数据集上取得了优于传统卷积神经网络的分类性能,并通过扰动分析验证了其重要性评分的有效性,展示了模型对一小部分关键位置的高度依赖。

Comments Accepted at IEEE CCGE 2026

详情
英文摘要

Deep neural networks have achieved strong performance in genomic sequence classification; however, relating their predictions to biologically meaningful sequence patterns remains challenging. In this work, we present AttnGen, an attention-guided training framework that embeds interpretability directly into the optimization process. AttnGen computes nucleotide-level importance scores using an attention mechanism and progressively suppresses low-contribution positions during training. This encourages the model to focus its predictions on a compact set of informative regions while reducing reliance on noisy sequence elements. We evaluate AttnGen on the standardized demo_human_or_worm benchmark, a binary classification task over 200-nucleotide sequences. With moderate masking, AttnGen achieves a validation accuracy of 96.73%, outperforming a conventional CNN baseline with 95.83% accuracy, while also exhibiting faster convergence and improved training stability. To assess whether the learned importance scores reflect functionally relevant signal, we conduct perturbation-based analysis by removing high-saliency nucleotides. This causes accuracy to drop from 96.9% to near chance level on a 3,000-sequence evaluation set, indicating that the model relies on a relatively small subset of informative positions. Our analysis shows that masking 10--20% of positions provides the most favorable trade-off between predictive performance and interpretability. These results suggest that attention-guided masking not only improves classification performance but also reshapes how models distribute importance across sequence positions. Although this study focuses on short genomic sequences, the proposed approach may extend to more complex interpretable sequence modeling settings.

2605.14071 2026-05-15 cs.CL

Distribution Corrected Offline Data Distillation for Large Language Models

Yumeng Zhang, Zhengbang Yang, Yevin Nikhel Goonatilake, Zhuangdi Zhu

发表机构 * George Mason University(乔治·马歇尔大学)

AI总结 本文研究了如何从大型语言模型中有效地蒸馏推理能力到小型模型中,特别是在资源受限的场景下。为了解决现有方法在离线蒸馏中面临的分布偏移问题,作者提出了一种基于分布校正的离线数据蒸馏框架,通过自适应地强调与学生模型推理分布更一致的教师模型指导,从而在保持离线数据高效性和监督质量的同时,减少推理过程中的误差累积。实验表明,该方法在多个数学推理基准测试中显著提升了推理准确性和稳定性。

详情
英文摘要

Distilling reasoning traces from strong large language models into smaller ones is a promising route to improve intelligence in resource-constrained settings. Existing approaches face a fundamental trade-off: offline distillation from teacher-generated traces provides high-quality, sample-efficient supervision but suffers from distributional drift: during training, the student model conditions on teacher-generated prefixes, whereas during inference the student autoregresses on self-generated prefixes, leading to compounding errors over long reasoning trajectories. Meanwhile, on-policy or self-distillation methods better match the student's inference-time distribution, but require costly online sampling and often produce low-quality traces in early training. We propose a principled offline reasoning distillation framework that preserves the efficiency and supervision quality of offline teacher-generated data while correcting teacher-student distribution drift. It adaptively emphasizes teacher supervision that is better aligned with the student's on-policy distribution. Evaluations on mathematical reasoning benchmarks of GSM8K, MATH, MATH500, and harder held-out competition-style tasks, including AMC, AIME, and OlympiadBench, show that our method improves reasoning accuracy over prior offline distillation algorithms and yields more stable reasoning traces while preserving instruction-following capabilities. Our work shows that lightweight, distribution-correction-aware training can substantially strengthen offline reasoning distillation without online rollouts.

2605.14069 2026-05-15 cs.LG

SurF: A Generative Model for Multivariate Irregular Time Series Forecasting

Mohammad R. Rezaei, Tejas Balaji, Rahul G. Krishnan

发表机构 * Department of Computer Science(计算机科学系) University of Toronto(多伦多大学) Vector Institute(向量研究所)

AI总结 本文提出了一种名为 SurF 的生成模型,用于处理多变量不规则时间序列的预测问题。该模型基于时间尺度变换定理,将事件序列与独立同分布的单位速率指数噪声之间建立可学习的双射关系,从而实现对异构事件流数据的统一建模。研究还引入了三种高效的累积强度参数化方法以及基于 Transformer 的编码器用于多数据集预训练。实验表明,SurF 在多个现实数据集上取得了优于现有方法的预测性能,为异步事件流的基础模型研究奠定了初步基础。

详情
英文摘要

Irregularly sampled multivariate event streams remain a stubbornly difficult modality for generative modeling: tokenization-based approaches break down when inter-event intervals vary by orders of magnitude, and neural temporal point processes are bottlenecked by window-level numerical quadrature. We (i) propose SurF, a generative model that uses the Time Rescaling Theorem (TRT) as a learnable bijection between event sequences and i.i.d.\ unit-rate exponential noise, enabling a single model to be trained across heterogeneous event-stream datasets; (ii) three efficient parameterizations of the cumulative intensity that scale to long sequences; and (iii) a Transformer-based encoder for multi-dataset pretraining. On six real-world benchmarks, SurF achieves the best reported time RMSE on Earthquake, Retweet, and Taobao, and is within trial-level noise of the strongest specialist on the remaining three. Under a strict leave-one-out protocol, the held-out checkpoint beats every classical and neural-autoregressive baseline on 5/6 datasets and beats every baseline on Amazon and Earthquake, an initial step toward foundation models over asynchronous event streams.

2605.14067 2026-05-15 cs.LG

Comparative Evaluation of Machine Learning Approaches for Minority-Class Financial Distress Prediction Under Class Imbalance Constraints

Karan Sehgal, Khawar Naveed Bhatti

发表机构 * Kent Business School(肯特商学院) University of Kent(肯特大学)

AI总结 本文研究了在类别不平衡条件下,如何有效预测少数类财务困境企业的问题,比较了经典统计方法、集成学习和神经网络模型的性能。通过结构化预处理和SMOTE过采样技术,实验发现梯度提升方法在严重不平衡情况下对少数类具有更高的敏感性。研究强调了可复现性、可解释性和治理导向的机器学习评估,为金融风险分析提供了实用的工程化解决方案。

Comments 16 pages, 4 figures, preprint under review. Applied machine learning evaluation involving imbalance-aware financial distress prediction, ensemble learning, SMOTE, and SHAP explainability analysis

详情
英文摘要

Financial distress prediction remains a significant challenge in enterprise risk analysis due to the highly imbalanced nature of real-world financial datasets, where bankrupt or distressed firms typically constitute only a small minority of observations. This paper presents a comparative evaluation of classical statistical methods, ensemble learning approaches, and exploratory neural models for minority-class financial distress prediction under class imbalance constraints. The study incorporates structured preprocessing, imbalance mitigation using the Synthetic Minority Oversampling Technique (SMOTE), comparative evaluation across ensemble learning architectures including XGBoost, CatBoost, LightGBM, Random Forest, and explainability analysis using SHAP-based feature attribution methods. Experimental evaluation demonstrates that gradient-boosting approaches achieved improved minority-class sensitivity relative to baseline statistical classifiers under severe imbalance conditions. The workflow additionally emphasises reproducibility, interpretability, auditability, and governance-oriented machine learning evaluation within enterprise financial risk environments. The work is positioned as an applied engineering evaluation intended to support reproducible and interpretable machine learning workflows for financial distress prediction under severe class imbalance constraints.

2605.14062 2026-05-15 cs.AI cs.CL

Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection

Anjir Ahmed Chowdhury, Syed Zawad, Feng Yan

发表机构 * Department of Computer Science University of Houston(计算机科学系休斯顿大学) IBM Research(IBM研究院)

AI总结 本文提出了一种名为MSIFR的轻量级框架,用于在生成过程中及时检测并终止低质量的生成轨迹,从而减少大语言模型合成数据生成中的冗余计算。该方法通过分阶段生成和快速规则验证,在生成早期识别算术错误、幻觉和格式问题,实现对无效样本的提前拒绝。实验表明,MSIFR在不增加训练或架构改动的前提下,显著降低了生成过程中的token消耗,同时保持或提升了生成数据的质量。

Comments 17 pages, 4 figures, 7 tables

详情
英文摘要

While synthetic data generation with large language models (LLMs) is widely used in post-training pipelines, existing approaches typically generate full outputs before applying quality filters, leading to substantial token waste on samples that are ultimately discarded. To address this, we propose Multi-Stage In-Flight Rejection (MSIFR), a lightweight, training-free framework that detects and terminates low-quality generation trajectories at intermediate checkpoints before they reach full completion. MSIFR decomposes the generation process into sequential stages and applies fast rule-based validators to identify arithmetic inconsistencies, hallucination patterns, and formatting violations, enabling early rejection of faulty samples. We formalize in-flight rejection as a sequential decision process and show that any non-trivial discard policy reduces expected token consumption, with stage-wise savings increasing when rejection occurs earlier in the generation pipeline. We further demonstrate that conditional utility estimates form a martingale, ensuring that early, in-flight rejection does not bias the expected utility of retained samples. Across five instruction-tuned models and seven reasoning benchmarks, MSIFR reduces token consumption by 11%-77% as a standalone method, and up to 78.2% when combined with early-exit methods, while preserving or improving evaluation accuracy. These results confirm that MSIFR provides a practical mechanism for improving the efficiency of LLM-based synthetic data generation without additional training or architectural changes.

2605.14061 2026-05-15 cs.AI cs.LG

MathAtlas: A Benchmark for Autoformalization in the Wild

Nilay Patel, Noah Arias, Davit Babayan, Victoria Cochran, Timothy Libman, Hafsah Mahmood, Liam McCarty, Soli Munoz, Laurel Willey, Jeffrey Flanigan

发表机构 * University of California, Santa Cruz(加州大学圣克ruz分校)

AI总结 当前自动形式化基准主要聚焦于竞赛或本科数学内容,而研究生及研究级数学领域仍缺乏相关资源。本文提出 MathAtlas,首个大规模研究生级别数学自动形式化基准,包含从103本教材中提取的约52,000条定理、定义、练习、示例及证明,并构建了包含约178,000个关系的数学依赖图。实验表明该基准质量高但极具挑战性,现有先进模型在定理和定义形式化任务上的表现均较低,且随着依赖深度增加,模型性能显著下降。

Comments In submission at NeurIPS 2026

详情
英文摘要

Current autoformalization benchmarks are largely focused on olympiad or undergraduate mathematics, while graduate and research-level mathematics remains underexplored. In this paper, we introduce MathAtlas, the first large-scale autoformalization benchmark of in the wild graduate-level mathematics, containing ~52k theorems, definitions, exercises, examples, and proofs extracted from 103 graduate mathematics textbooks. MathAtlas is enriched with a mathematical dependency graph containing ~178k relations, and is the first autoformalization benchmark to include such relations, facilitating evaluation and development of dependency-aware autoformalization systems. Our extensive experiments show that MathAtlas is high quality but extremely challenging: strong baselines achieve at most 9.8% correctness on theorem statements and 16.7% on definitions. Furthermore, we find performance of state-of-the-art models degrades substantially with dependency depth: on MA-Hard, a subset of 700 entities with the deepest dependency trees, the best model achieves only 2.6% correctness for autoformalization on this challenging dataset. We release MathAtlas to the community as a benchmark set for large-scale autoformalization of graduate-level mathematics in the wild.

2605.14055 2026-05-15 cs.CL cs.AI

PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts

Anjir Ahmed Chowdhury, Syed Zawad, Xiaolong Ma, Xu Dong, Feng Yan

发表机构 * IBM Research(IBM研究院) Argonne National Laboratory(阿贡国家实验室)

AI总结 本文提出了一种参数高效的多任务学习方法PEML,旨在通过优化连续提示和模型权重的联合调整,提升大语言模型在多任务场景下的微调效率。与现有方法如LoRA和Prefix Tuning相比,PEML结合了神经架构工程优化提示结构,并采用低秩适配调整模型参数,从而更全面地适应多任务需求。实验表明,PEML在多个基准数据集上实现了显著的性能提升,平均准确率提高达6.67%,部分任务提升甚至超过10.75%。

Comments 26 pages, 8 figures, 18 Tables

详情
英文摘要

Parameter-Efficient Fine-Tuning (PEFT) is widely used for adapting Large Language Models (LLMs) for various tasks. Recently, there has been an increasing demand for fine-tuning a single LLM for multiple tasks because it requires overall less data for fine-tuning thanks to the common features shared among tasks. More importantly, LLMs are resource demanding and deploying a single model for multiple tasks facilitates resource consolidation and consumes significantly less resources compared to deploying individual large model for each task. Existing PEFT methods like LoRA and Prefix Tuning are designed to adapt LLMs to a specific task. LoRA and its variation focus on aligning the model itself for tasks, overlooking the importance of prompt tuning in multi-task learning while Prefix Tuning only adopts a simple architecture to optimize prompts, which limits the adaption capabilities for multi-task. To enable efficient fine-tuning for multi-task learning, it is important to co-optimize prompt optimization and model adaptation. In this work, we propose a Parameter-Efficient Multi-task Learning (\PM), which employs a neural architecture engineering method for optimizing the continuous prompts while also performing low-rank adaption for model weights. We prototype PEML by creating an automated framework for optimizing the continuous prompts and adapting model weights. We evaluate PEML against state-of-the-arts multi-task learning methods MTL-LoRA, MultiLoRa, C-Poly, and MoE, on the GLUE, SuperGLUE, Massive Multitask Language Understanding, and commonsense reasoning benchmarks. The evaluation results present an average accuracy improvement of up to 6.67%, with individual tasks showing peak gains of up to 10.75%.

2605.14053 2026-05-15 cs.CL cs.AI

Derivation Prompting: A Logic-Based Method for Improving Retrieval-Augmented Generation

Ignacio Sastre, Guillermo Moncecchi, Aiala Rosá

发表机构 * Instituto de Computación, Facultad de Ingeniería, Universidad de la República(计算研究所,工程学院,乌拉圭共和国大学)

AI总结 该研究针对大语言模型在问答任务中出现的幻觉和错误推理问题,提出了一种基于逻辑推导的新型提示方法——推导提示(Derivation Prompting),用于改进检索增强生成(RAG)框架中的生成步骤。该方法通过预定义规则系统性地从初始假设推导结论,构建可解释的推导树,从而增强生成过程的可控性。实验表明,该方法在特定案例中显著减少了不可接受的回答,优于传统RAG和长上下文方法。

Journal ref Advances in Artificial Intelligence IBERAMIA 2024, LNCS 15277, pp. 412 423, Springer (2025)

详情
英文摘要

The application of Large Language Models to Question Answering has shown great promise, but important challenges such as hallucinations and erroneous reasoning arise when using these models, particularly in knowledge-intensive, domain-specific tasks. To address these issues, we introduce Derivation Prompting, a novel prompting technique for the generation step of the Retrieval-Augmented Generation framework. Inspired by logic derivations, this method involves deriving conclusions from initial hypotheses through the systematic application of predefined rules. It constructs a derivation tree that is interpretable and adds control over the generation process. We applied this method in a specific case study, significantly reducing unacceptable answers compared to traditional RAG and long-context window methods.

2605.14051 2026-05-15 cs.AI

SPIN: Structural LLM Planning via Iterative Navigation for Industrial Tasks

Yusuke Ozaki, Dhaval Patel

发表机构 * University at Albany(阿尔巴马大学) IBM(国际商业机器公司) Kwansei Gakuin University(关西大学)

AI总结 该论文提出了一种名为SPIN的规划包装器,旨在解决工业任务中大型语言模型(LLM)规划阶段常出现的结构无效或冗余的问题。SPIN结合了验证的有向无环图(DAG)规划与基于前缀的执行控制,通过严格的DAG合同和修复提示生成可执行的计划,并在执行前逐步评估DAG前缀以提前终止任务。实验表明,SPIN在多个基准测试中有效减少了执行任务数量和工具调用次数,同时提升了任务完成率和相关性能指标。

Comments 31 pages, 10 figures

详情
英文摘要

Industrial LLM agent systems often separate planning from execution, yet LLM planners frequently produce structurally invalid or unnecessarily long workflows, leading to brittle failures and avoidable tool and API cost. We propose \texttt{SPIN}, a planning wrapper that combines validated Directed Acyclic Graph (DAG) planning with prefix based execution control. \texttt{SPIN} enforces a strict DAG contract through \texttt{\_validate\_plan\_text} and repair prompting, producing executable plans before downstream execution, and then evaluates DAG prefixes incrementally to stop when the current prefix is sufficient to answer the query. On AssetOpsBench, across 261 scenarios, \texttt{SPIN} reduces executed tasks from 1061 to 623 and improves \emph{Accomplished} from 0.638 to 0.706, while reducing tool calls from 11.81 to 6.82 per run. On MCP Bench, the same wrapper improves planning, grounding, and dependency related scores for both GPT OSS1 and Llama 4 Maverick.

2605.14049 2026-05-15 cs.AI cs.CL cs.CY

Bridging Legal Interpretation and Formal Logic: Faithfulness, Assumption, and the Future of AI Legal Reasoning

Olivia Peiyu Wang, Leilani H. Gilpin

发表机构 * University of California, Santa Cruz(加州大学圣克鲁兹分校)

AI总结 随着大型语言模型在法律实践中的应用日益广泛,其带来的潜力与风险并存。本文探讨了当前AI在法律推理中存在系统性假设性推理的问题,即模型常基于文本内容之外的假设得出结论,缺乏逻辑严谨性。为此,研究提出了一种结合大型语言模型表达能力和形式化验证严谨性的神经符号方法,旨在提升AI辅助法律推理的可靠性与可信度,从而在降低人工验证负担的同时满足法律实践对责任性的要求。

Comments 2 pages abstract accepted by Bloomberg LSLLAI 2026 Symposium

详情
英文摘要

The growing adoption of large language models in legal practice brings both significant promise and serious risk. Legal professionals stand to benefit from AI that can reason over contracts, draft documents, and analyze sources at scale, yet the high-stakes nature of legal work demands a level of rigor that current AI systems do not provide. The central problem is not simply that LLMs hallucinate facts and references; it is that they systematically draw inferences that go beyond what the source text actually supports, presenting assumption-laden conclusions as if they were logically grounded. This proposal presents a neuro-symbolic approach to legal AI that combines the expressive power of large language models with the rigor of formal verification, aiming to make AI-assisted legal reasoning both capable and trustworthy, thus reducing the burden of manual verification without sacrificing the accountability that legal practice demands.

2605.14047 2026-05-15 cs.CV cs.AR

Evolving Layer-Specific Scalar Functions for Hardware-Aware Transformer Adaptation

Kieran Carrigg, Sigur de Vries, Amirhossein Sadough, Marcel van Gerven

发表机构 * Department of Machine Learning and Neural Computing(机器学习与神经计算系) Donders Institute for Brain, Cognition, and Behaviour(大脑、认知与行为多纳尔斯研究所)

AI总结 本文研究了如何在边缘设备上高效部署视觉Transformer(ViT),针对其因层归一化操作导致的计算复杂度和全局归约瓶颈问题,提出了一种基于遗传编程的硬件感知框架。该方法通过进化生成每层特定的标量函数,替代传统的归一化层,无需从头训练模型即可实现高性能适配。实验表明,该方法在保持图像分类精度的同时,显著降低了计算和内存开销,为ViT在边缘加速器上的部署提供了有效解决方案。

Comments 18 pages, 7 figures

详情
英文摘要

Vision Transformers (ViTs) achieve state-of-the-art performance on challenging vision tasks, but their deployment on edge devices is severely hindered by the computational complexity and global reduction bottleneck imposed by layer normalization. Recent methods attempt to bypass this by replacing normalization layers with hardware-friendly scalar approximations. However, these homogeneous replacements do not optimally fit to all layers' behaviour and rely on expensive model retraining. In this work, we propose a highly efficient, hardware-aware framework that utilizes genetic programming (GP) to evolve heterogeneous, layer-specific scalar functions directly from pre-trained weights. Coupled with a novel post-training re-alignment strategy, our approach eliminates the need to retrain models from scratch entirely. Our evolved expressions accurately approximate the target normalization behaviours, capturing $91.6\%$ of the variance ($R^2$) compared to only $70.2\%$ for homogeneous baselines, allowing our modified architecture to recover $84.25\%$ Top-1 ImageNet-1K accuracy in only 20 epochs. By preserving this performance while eliminating the global reduction bottleneck, our approach establishes a highly favourable trade-off between arithmetic complexity and off-chip memory traffic, removing a primary barrier to the efficient deployment of ViTs on edge accelerators.

2605.14045 2026-05-15 cs.CV

PVRF: All-in-one Adverse Weather Removal via Prior-modulated and Velocity-constrained Rectified Flow

Wei Dong, Han Zhou, Terry Ji, Guanhua Zhao, Shahab Asoodeh, Yulun Zhang, Guangtao Zhai, Jun Chen, Xiaohong Liu

发表机构 * McMaster University(麦斯特大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 该论文提出了一种名为PVRF的统一框架,用于解决真实场景中复杂多变的恶劣天气去除问题。该方法结合了基于冻结视觉-语言模型的软天气感知模块和速度约束的修正流优化,通过属性调制归一化和天气加权适配器生成初始修复估计,并利用终端一致的残差修正流提升修复质量与稳定性。实验表明,PVRF在修复保真度和感知质量方面优于现有方法,且具有良好的跨数据集泛化能力。

Comments 10 pages, 9 figures, and 4 tables

详情
英文摘要

Adverse weather removal (AWR) in real-world images remains challenging due to heterogeneous and unseen degradations, while distortion-driven training often yields overly smooth results. We propose PVRF, a unified framework that integrates zero-shot soft weather perceptions with velocity-constrained rectified-flow refinement. PVRF introduces an AWR-specific question answering module (AWR-QA) that uses frozen vision--language models (VLMs) to estimate soft probabilities of weather types and low-level attribute scores. These perceptions condition restoration networks via attribute-modulated normalization (AMN) and weather-weighted adapters (WWA), producing an anchor estimate for refinement. We then learn a terminal-consistent residual rectified flow with perception-adaptive source perturbation and a terminal-consistent velocity parameterization to stabilize learning near the terminal regime. Extensive experiments show that PVRF improves both fidelity and perceptual quality over state-of-the-art baselines, with strong cross-dataset generalization on single and combined degradations. Code will be released at https://github.com/dongw22/PVRF.

2605.14040 2026-05-15 cs.CL

Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning

Shan Yang

发表机构 * Independent Researcher(独立研究者)

AI总结 该研究审视了多模态物理推理评估流程中的潜在问题,揭示了训练-评估污染、翻译偏差和选择题饱和等三个未被察觉的问题,并提出了改进方案。研究构建了经过严格审计的多模态数据集和评估体系,显著提升了模型在物理推理任务中的表现。通过发布四个新数据集和一个基于Qwen3-VL的参考方法,该工作为视觉物理推理提供了更可靠的研究基准和训练资源。

Comments 10 pages, 3 tables. Project page: https://shanyang.me/physics-r1-page/

详情
英文摘要

We audit the multimodal-physics evaluation pipeline end-to-end and document three undetected construction practices that distort how the field measures vision-language reasoning: train-eval contamination, translation drift, and MCQ saturation. (1) Public training pools (UGPhysics-Train, SciInstruct, MMK12) pass single-stage 5-gram-Jaccard audits with zero hits across all six public physics evals; a three-stage audit (Jaccard -> mxbai-embed-large cosine -> Haiku-4.5 LLM-judge) surfaces 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct alone. (2) A 17-pp Sonnet 4.5 delta on 59 paired Estonian-English olympiad problems (30.5% vs. 13.6%; sign test p=0.011, McNemar p=0.021, paired bootstrap 95% CI [+5.1, +28.9] pp). (3) A 46-pp format-and-novelty gradient on identical Sonnet weights between MCQ (79.7% on PhyX) and open-ended olympiad evaluation (33.4% on PhysOlym-A). We release four artifacts addressing these gaps: PhysCorp-A (6,432-record three-stage-audited multimodal corpus), PhysR1Corp (2,268-record closed-form RL pool), PhysOlym-A (500-problem, 99.8% novel-source held-out olympiad eval with native difficulty labels and an EN/ET bilingual subset), and Physics-R1, a reference GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking. Across 3 seeds, Physics-R1 lifts the audited corpus over the 8B base by +18.3 pp on PhysOlym-A liberal (8.0 -> 26.3 +/- 1.7; 7.1 pp behind Sonnet 4.5), +15.7 pp on PhysReason (23.9 -> 39.6 +/- 6.4; ahead of Qwen3-VL-32B and Gemini 2.5 Pro), +6.9 pp on OlympiadBench-Physics (46.2 +/- 1.5), and +4.1 pp on PhyX MCQ (77.8 +/- 0.3).

2605.14037 2026-05-15 cs.LG cs.CL

Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility

Gergely Szilvasy, Manuel Faysse, Maria Lomeli, Matthijs Douze, Pierre-Emmanuel Mazaré, Loïc Cabannes, Wen-tau Yih, Hervé Jégou

发表机构 * Meta FAIR

AI总结 在现代语言模型处理长序列文本的背景下,键值缓存(KV cache)的内存占用和带宽限制成为高效生成的瓶颈。本文提出了一种自剪枝键值注意力机制(SP-KV),通过预测键值对的未来效用,动态地决定哪些键值对需要保留在全局缓存中,从而有效减少缓存大小。该方法在不显著影响模型性能的前提下,显著提升了内存使用效率和解码速度,并揭示了层和头级别的稀疏性模式,为设计混合局部-全局注意力架构提供了新思路。

Comments 28 pages, 8 figures, 8 tables

详情
英文摘要

Under modern test-time compute and agentic paradigms, language models process ever-longer sequences. Efficient text generation with transformer architectures is increasingly constrained by the Key-Value cache memory footprint and bandwidth. To address this limitation, we introduce Self-Pruned Key-Value Attention (SP-KV), a mechanism designed to predict future KV utility in order to reduce the size of the long-term KV cache. This strategy operates at a fine granularity: a lightweight utility predictor scores each key-value pair, and while recent KVs are always available via a local window, older pairs are written in the cache and used in global attention only if their predicted utility surpasses a given threshold. The LLM and the utility predictor are trained jointly end-to-end exclusively through next-token prediction loss, and are adapted from pretrained LLM checkpoints. Rather than enforcing a fixed compression ratio, SP-KV performs dynamic sparsification: the mechanism adapts to the input and typically reduces the KV cache size by a factor of $3$ to $10\times$, longer sequences often being more compressible. This leads to vast improvements in memory usage and decoding speed, with little to no degradation of validation loss nor performance on a broad set of downstream tasks. Beyond serving as an effective KV-cache reduction mechanism, our method reveals structured layer- and head-specific sparsity patterns that we can use to guide the design of hybrid local-global attention architectures.

2605.14036 2026-05-15 cs.AI cs.CC cs.CL cs.LG

Enhanced and Efficient Reasoning in Large Learning Models

Leslie G. Valiant

发表机构 * John A. Paulson School of Engineering and Applied Sciences(约翰·A·保罗森工程与应用科学学院)

AI总结 本文提出了一种高效且原理明确的推理方法,旨在提升大型语言模型在生成内容可信度方面的表现。该方法通过预处理阶段将数据编码为更明确描述对象关系的“Unary Relational Integracode”,随后结合标准的机器学习流程进行训练,从而在保留现有软硬件基础的同时实现更可靠的推理能力。该方法不仅适用于自然语言处理,还可拓展至视觉与动作等领域,并基于“鲁棒逻辑”理论,使得模型在单次或多次调用中都能进行更稳固的逻辑推理。

详情
英文摘要

In current Large Language Models we can trust the production of smoothly flowing prose on the basis of the principles of machine learning. However, there is no comparably principled basis to justify trust in the content of the text produced. It appears to be conventional wisdom that addressing this issue by adding more principled reasoning is not computationally affordable. Here we propose a principled method of reasoning that is efficient enough to be practical for large language models. Further, the method allows the retention of much of the currently used software and hardware base. Our method for improving the functioning of large language models consists of a first stage of preprocessing that recodes the data to a Unary Relational Integracode that is more explicit about the relationships among the objects described in the text, followed as a second stage by a standard but possibly streamlined machine learning process that then also learns to predict these relationships. The method may be viewed as realizing a world model and applying beyond natural language, to vision and actions, for example, where the multiple properties of an object referred to in an input are brought together explicitly, rather than remaining distributed in the various references to it in the input. We articulate its advantages in terms of Robust Logic, a system for performing principled chaining on learned, and hence uncertain, information. We show that this recoding has the surprising and fortuitous property that, while succinct, it makes the task of learning a core subset of relational rules that hold in the world described in the training data polynomial time learnable in a defined sense, the polynomial depending on the complexity of the rule. This gives support for sound reasoning within each single call of the learned classifier as well as between multiple calls.

2605.14034 2026-05-15 cs.AI cs.CL cs.CY

From Descriptive to Prescriptive: Uncover the Social Value Alignment of LLM-based Agents

Jinxian Qu, Qingqing Gu, Teng Chen, Luo Ji

发表机构 * Geely AI Lab(Geely人工智能实验室)

AI总结 本文研究了基于大语言模型的智能体如何更好地与人类社会价值观对齐的问题,提出了一个基于价值的新型框架,利用GraphRAG将伦理原则转化为价值导向的指令,从而引导智能体在具体对话情境中做出符合预期的行为。通过引入马斯洛需求层次理论和普鲁奇克情绪轮来定义期望行为,实验表明该方法在DAILYDILEMMAS基准上显著优于基于提示的基线方法,为AI系统中自我情感的生成提供了理论基础。

Comments Accepted by CogSci 2026

详情
英文摘要

Wide applications of LLM-based agents require strong alignment with human social values. However, current works still exhibit deficiencies in self-cognition and dilemma decision, as well as self-emotions. To remedy this, we propose a novel value-based framework that employs GraphRAG to convert principles into value-based instructions and steer the agent to behave as expected by retrieving the suitable instruction upon a specific conversation context. To evaluate the ratio of expected behaviors, we define the expected behaviors from two famous theories, Maslow's Hierarchy of Needs and Plutchik's Wheel of Emotion. By experimenting with our method on the benchmark of DAILYDILEMMAS, our method exhibits significant performance gains compared to prompt-based baselines, including ECoT, Plan-and-Solve, and Metacognitive prompting. Our method provides a basis for the emergence of self-emotion in AI systems.

2605.14033 2026-05-15 cs.AI cs.LG

Sheaf-Theoretic Transport and Obstruction for Detecting Scientific Theory Shift in AI Agents

David N. Olivieri, Roque J. Hernández

发表机构 * Universidade de Vigo, Department of Computer Science (LSI), Spain(维戈大学计算机科学系(LSI),西班牙)

AI总结 本文研究了人工智能代理在科学理论转变时如何检测现有表征框架是否适用于新情境,或是否需要扩展。作者提出了一种基于有限sheaf理论的框架,通过运输与阻塞机制识别理论转变的候选情况,衡量不一致性的指标包括残差拟合、重叠不兼容性、约束违反等。该方法在控制实验中验证有效,能够区分理论变形与扩展,并为AI代理提供一种有限的诊断工具,以判断表征迁移失败时是否需要进行扩展。

详情
英文摘要

Scientific theory shift in AI agents requires more than fitting equations to data. An artificial scientific agent must detect whether an existing representational framework remains transportable into a new regime, or whether its language has become locally-to-globally obstructed and must be extended. This paper develops a finite sheaf-theoretic framework for detecting theory-shift candidates through transport and obstruction. Contexts are organized as a local-to-global structure in which source, overlap, target, and validation charts are fitted, restricted, and tested for gluing. Obstruction measures failure of coherence through residual fit, overlap incompatibility, constraint violation, limiting-relation failure, and representational cost. We evaluate the framework on a controlled transition-card benchmark designed to separate deformation within a source language from extension of that language. The main result is direct obstruction ranking: the intended deformation or extension is usually the lowest-obstruction candidate, and transition type is separated in the benchmark. A constellation kernel over the same signatures is included only as a secondary representational-similarity probe. The aim is not to reconstruct historical paradigm shifts or solve open-ended autonomous theory invention, but to isolate a finite diagnostic subproblem for AI agents: detecting when representational transport fails and extension becomes the coherent next move.

2605.14031 2026-05-15 cs.SD cs.CV cs.LG

Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study

Wuao Liu, Mustafa Chasmai, Subhransu Maji, Grant Van Horn

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 本文研究了在有限数据条件下,掩码自编码器(MAE)在生物声学细粒度物种分类任务中的有效性。通过在iNatSounds数据集上的系统实验,分析了预训练数据规模、领域特异性、数据筛选和迁移策略等因素的影响。研究发现,使用多样化通用音频数据预训练的模型在生物声学任务中表现最佳,而针对特定领域的额外预训练和数据筛选在小规模数据下效果有限,甚至可能降低性能。结果表明,在中等规模的细粒度生物声学场景中,预训练数据的规模比目标函数设计对模型性能影响更大。

Comments Workshop on Fine-Grained Visual Categorization (FGVC) at CVPR 2026. 8 pages, 6 figures

详情
英文摘要

Bioacoustic recognition requires fine-grained acoustic understanding to distinguish similar-sounding species. However, many large-scale data repositories such as iNaturalist are weakly annotated, often with only a single positive species label per recording, making supervised learning particularly challenging. Inspired by advances in computer vision, recent approaches have shifted toward self-supervised learning to capture the underlying structure of audio without relying on exhaustive annotations. In particular, masked autoencoders (MAE) have shown strong transferability on massive audio corpora, yet their effectiveness in more modest bioacoustic settings remains underexplored. In this work, we conduct a systematic study of MAE pretraining for species classification on iNatSounds, analyzing the impacts of pretraining data scale, domain specificity, data curation, and transfer strategies. Consistent with prior work, we find that models pretrained on diverse general audio data achieve the best transfer performance on iNatSounds. Contrary to observations from large-scale audio benchmarks, we find that (1) additional masked reconstruction pretraining on domain-specific data provides limited benefits and may even degrade performance relative to off-the-shelf models, and (2) selective data filtering offers a negligible advantage when the overall data scale is limited. Our results indicate that, in moderate-sized fine-grained bioacoustic settings, pretraining scale dominates objective design. These findings further clarify when MAE-based pretraining is effective and provide practical guidance for model selection under limited supervision.

2605.14026 2026-05-15 cs.LG cs.AI

R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning

Sanghyeob Song, Donghyeok Lee, Jinsik Kim, Sungroh Yoon

发表机构 * Interdisciplinary Program in Artificial Intelligence, Seoul National University(人工智能交叉学科项目,首尔国立大学) Department of Electrical and Computer Engineering, Seoul National University(电子与计算机工程系,首尔国立大学)

AI总结 在数据稀缺的现实机器人等强化学习场景中,密集的数据复用虽能提升效率,但易导致过拟合。为解决自预测学习(SPL)在高更新与数据比(UTD)条件下表示层不稳定的问题,本文提出了一种通过冗余减少实现鲁棒表示的方法R2R2。该方法通过理论分析指出标准零中心化与SPL的谱特性存在冲突,并设计了非中心化的正则化目标,实验表明R2R2有效缓解了过拟合问题,在多个连续控制任务中显著提升了算法性能。

Comments Accepted at the Forty-Third International Conference on Machine Learning (ICML 2026). This is the camera-ready version

详情
英文摘要

For reinforcement learning in data-scarce domains like real-world robotics, intensive data reuse enhances efficiency but induces overfitting. While prior works focus on critic bias, representation-level instability in Self-Predictive Learning (SPL) under high Update-to-Data (UTD) regimes remains underexplored. To bridge this gap, we propose Robust Representation via Redundancy Reduction (R2R2), a regularization method within SPL. We theoretically identify that standard zero-centering conflicts with SPL's spectral properties and design a non-centered objective accordingly. We verify R2R2 on SPL-native algorithms like TD7. Furthermore, to demonstrate its orthogonality to prior advancements, we extend the state-of-the-art SimbaV2, which originally lacks SPL, by integrating a tailored SPL module, termed SimbaV2-SPL. Experiments across 11 continuous control tasks confirm that R2R2 effectively mitigates overfitting; specifically, at a UTD ratio of 20, it improves TD7 by ~22% and provides additional gains on top of SimbaV2-SPL, which itself establishes a new state-of-the-art. The code can be found at: https://github.com/songsang7/R2R2

2605.14004 2026-05-15 cs.AI

Conditional Attribute Estimation with Autoregressive Sequence Models

Erica Stutz, Giacomo Marino, Daniella Meeker, Qiao Liu, Andrew J. Loza

发表机构 * Department of Biomedical Informatics and Data Science(生物医学信息学与数据科学系) Yale University(耶鲁大学) Department of Biostatistics(生物统计学系) Department of Pediatrics(儿科系)

AI总结 本文提出了一种名为“条件属性变换器”的新方法,用于在生成模型中联合估计下一个词的概率以及在每个潜在下一个词选择下的属性值。该方法能够在单次前向计算中实现三个关键功能:逐词归因、反事实分析和可控生成,无需修改输入序列。该方法在稀疏奖励任务中表现出色,提升了大模型的下一个词预测能力,并显著加快了属性概率的估计速度,适用于多种语言任务的生成引导。

详情
英文摘要

Generative models are often trained with a next-token prediction objective, yet many downstream applications require the ability to estimate or control sequence-level properties. Next-token prediction can lead to overfitting of local patterns during training, underfitting of global structure, and requires significant downstream modifications or expensive sampling to guide or predict the global attributes of generated samples at inference time. Here, we introduce Conditional Attribute Transformers, a novel method for jointly estimating the next-token probability and the value of an attribute conditional on each potential next token selection. This framework enables three critical capabilities within a single forward pass, without modification of the input sequence: (1) per-token credit assignment across an entire sequence, by identifying how each token in a sequence is associated with an attribute's value; (2) counterfactual analysis, by quantifying attribute differences conditional on alternative next token choices; (3) steerable generation, by decoding sequences based on a combination of next-token and attribute likelihoods. Our approach achieves state of the art performance on sparse reward tasks, improves next-token prediction at sufficient model sizes, estimates attribute probabilities orders of magnitude faster than sampling, and can guide decoding of autoregressive sequence models on a range of language tasks.

2605.14002 2026-05-15 cs.AI

PolitNuggets: Benchmarking Agentic Discovery of Long-Tail Political Facts

Yifei Zhu

发表机构 * The University of Hong Kong(香港大学)

AI总结 本文提出 PolitNuggets,一个多语言基准,用于评估智能体在开放环境中发现和综合长尾政治事实的能力。该基准通过构建400位全球政要的生平,涵盖超过10000个政治事实,引入优化的多智能体系统和FactNet协议,从发现性、准确性与效率三个维度进行标准化评估。研究发现当前模型在细节处理和效率上存在较大差异,并揭示了智能体性能与模型基础能力之间的关系,突显了短上下文提取、多语言鲁棒性与工具使用可靠性的重要性。

Comments 24 pages, 7 figues, accpeted in The 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

详情
英文摘要

Large Reasoning Models (LRMs) embedded in agentic frameworks have transformed information retrieval from static, long context question answering into open-ended exploration. Yet real world use requires models to discover and synthesize "long-tail" facts from dispersed sources, a capability that remains under-evaluated. We introduce PolitNuggets, a multilingual benchmark for agentic information synthesis via constructing political biographies for 400 global elites, covering over 10000 political facts. We standardize evaluation with an optimized multi agent system and propose FactNet, an evidence conditional protocol that scores discovery, fine-grained accuracy, and efficiency. Across models and settings, we find that current systems often struggle with fine-grained details, and vary substantially in efficiency. Finally, using benchmark diagnostics, we relate agent performance to underlying model capabilities, highlighting the importance of short-context extraction, multilingual robustness, and reliable tool use.

2605.13999 2026-05-15 cs.LG

Support Before Frequency in Discrete Diffusion

Adrian Müller, Antoine Gonon, Zebang Shen, Ya-Ping Hsieh, Niao He

发表机构 * ETH Zürich(苏黎世联邦理工学院) EPFL(苏黎世联邦理工学院) Academia Sinica(台湾“中央研究院”)

AI总结 本文研究了离散扩散模型在语言建模中的学习机制,发现其去噪目标在学习过程中存在一个从粗粒度支持信息到细粒度频率信息的层次结构。通过理论分析与实验验证,作者证明在小噪声条件下,单个标记的逆向编辑可分为主导尺度(决定是否接近数据支持,如语法正确的句子)和更细粒度的系数(决定同一尺度内的概率分布)。研究指出,模型首先学习数据支持结构,随后才学习数据频率分布,这一分离现象在均匀扩散和吸收扩散中表现出不同的特性。

详情
英文摘要

Discrete diffusion models are increasingly competitive for language modeling, yet it remains unclear how their denoising objectives organize learning. Although these objectives target the full data distribution, we show that the exact reverse process induces a hierarchy between coarse support information and finer frequency information. For uniform and absorbing (a.k.a. masking) diffusion, we prove that, in the small-noise regime of the final denoising steps, each single-token reverse edit decomposes into a leading scale, determined by whether it moves toward the data support (e.g., grammatically valid sentences), and a finer coefficient, determining relative probabilities within the same scale. Thus, recovering validity structure only requires learning the correct order of magnitude of reverse probabilities, whereas recovering data frequencies requires coefficient-level estimation. The separation is mechanism-dependent: uniform diffusion exhibits a trichotomy into validity-improving, validity-preserving, and validity-worsening edits, while absorbing diffusion places its leading-order mass on validity-improving moves. Experiments on a masked language diffusion model and synthetic regular-language tasks support these predictions: support-localization emerges earlier than within-support frequency ranking, and the contrast between uniform and absorbing diffusion matches the predicted rate separation. Together, our results suggest that discrete diffusion models learn data support before data frequencies.

2605.13997 2026-05-15 cs.LG cs.AI cs.CL

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

Tao Zhong, Dongzhe Zheng, Christine Allen-Blanchette

发表机构 * Princeton University(普林斯顿大学)

AI总结 本文研究了稀疏专家混合(MoE)层的无学习压缩问题,指出现有方法在处理专家合并时存在结构性盲点,即三个两两兼容的专家可能形成无法合并的循环结构。为此,作者引入了基于单纯复形拉普拉斯算子的调和核概念,提出HodgeCover方法,通过覆盖关键边和三角形结构实现专家选择,并结合权重剪枝进一步提升压缩效果。实验表明,HodgeCover在专家大幅减少的情况下表现优异,优于现有无学习方法,并在压缩效率与质量之间实现了良好平衡。

Comments 34 pages, 8 figures

详情
英文摘要

Sparse Mixture-of-Experts (MoE) layers route tokens through a handful of experts, and learning-free compression of these layers reduces inference cost without retraining. A subtle obstruction blocks every existing compressor in this family: three experts can each be pairwise compatible yet form an irreducible cycle when merged together, so any score that ranks experts on pairwise signals is structurally blind to which triples are jointly mergeable. We show the obstruction is a precise mathematical object, the harmonic kernel of the simplicial Laplacian on a 2-complex whose vertices are experts, whose edges carry KL merge barriers, and whose faces carry triplet barriers; Hodge-decomposing the edge-barrier signal isolates the kernel exactly. We turn the diagnostic into a selection objective: HodgeCover greedily covers the harmonic-critical edges and triplet-critical triangles, and a hybrid variant of HodgeCover pairs it with off-the-shelf weight pruning on survivors. On three open-weight Sparse MoE backbones under aggressive expert reduction, HodgeCover matches state-of-the-art learning-free baselines on the expert-reduction axis, leads on the aggressive-compression frontier of the hybrid axis, and uniquely balances retained mass across all four Hodge components. These results show that exposing the harmonic kernel of a learned MoE structure changes which compressor wins at the regime that matters most.