arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4033
2605.10500 2026-05-12 cs.AI

SkillEvolver: Skill Learning as a Meta-Skill

Genrui Zhang, Erle Zhu, Jinfeng Zhou, Caiyan Jia, Hongning Wang

发表机构 * Tsinghua University(清华大学) Beijing Jiaotong University(北京交通大学)

AI总结 当前智能体技能大多是静态生成的,一旦创建便无法根据实际使用情况进行改进。本文提出了一种名为 SkillEvolver 的轻量级在线技能学习方法,通过一个元技能迭代生成、部署并优化领域特定技能,使技能能够持续进化。该方法直接学习技能的描述与代码,而非模型参数,使得生成的技能可直接用于任意智能体而无需重新训练。实验表明,SkillEvolver 在多个任务中显著优于人工编写技能和无技能基线。

详情
英文摘要

Agent skills today are static artifact: authored once -- by human curation or one-shot generation from parametric knowledge -- and then consumed unchanged, with no mechanism to improve from real use. We propose \textbf{SkillEvolver}, a lightweight, plug-and-play solution for online skill learning, in which a single meta-skill iteratively authors, deploys, and refines domain-specific skills. The learning target of SkillEvolver is the skill's prose and code, not model weights, so that the resulting artifact drops into any agent without retraining; and the meta-skill itself is just another skill, loaded through the same interface by any protocol-compliant CLI-agent. Unlike trace-distillation, the meta-skill refines only after deploying the learnt skill, such that the learning signal comes from failures another agent encounters while using it -- not from exploratory traces alone. Refinement iterations are governed by a fresh-agent overfit audit that catches possible leakage as well as deployed-skill-specific failures, including the silent-bypass mode in which a skill appears valid in content but is never invoked at runtime. On $83$ SkillsBench tasks spanning $15^{+}$ domains, SkillEvolver reaches $56.8\%$ accuracy versus $43.6\%$ for curated human skills and $29.9\%$ for the no-skill baseline; on three GPU kernel optimization tasks from KernelBench, it also raises mean speedup from $1.16$ to $1.51$ on average.

2605.10498 2026-05-12 cs.CV cs.AI stat.ML

Simultaneous Long-tailed Recognition and Multi-modal Fusion for Highly Imbalanced Multi-modal Data

Heegeon Yoon, Heeyoung Kim

发表机构 * Department of Industrial and Systems Engineering, Korea Advanced Institute of Science and Technology (KAIST)(工业与系统工程系,韩国科学技术院(KAIST))

AI总结 该研究针对高度不平衡的多模态数据,提出了一个同时处理长尾识别与多模态融合的新框架。该方法通过引入多专家架构,结合模态特异性网络估计各模态的信息量,并利用置信度引导的权重动态调整融合过程,从而更有效地整合多源数据。实验表明,该方法在多个基准和真实数据集上优于现有方法,展示了其在长尾分类任务中的鲁棒性和泛化能力。

详情
英文摘要

Long-tailed distributions in class-imbalanced data present a fundamental challenge for deep learning models, which tend to be biased toward majority classes. While recent methods for long-tailed recognition have mitigated this issue, they are largely restricted to single-modal inputs and cannot fully exploit complementary information from diverse data sources. In this work, we introduce a new framework for long-tailed recognition that explicitly handles multi-modal inputs. Our approach extends multi-expert architectures to the multi-modal setting by fusing heterogeneous data into a unified representation while leveraging modality-specific networks to estimate the informativeness of each modality. These confidence-guided weights dynamically modulate the fusion process, ensuring that more informative modalities contribute more strongly to the final decision. To further enhance performance, we design specialized training and test procedures that accommodate diverse modality combinations, including images and tabular data. Extensive experiments on benchmark and real-world datasets demonstrate that the proposed approach not only effectively integrates multi-modal information but also outperforms existing methods in handling long-tailed, class-imbalanced scenarios, highlighting its robustness and generalization capability.

2605.10494 2026-05-12 cs.SD cs.AI

Multi-layer attentive probing improves transfer of audio representations for bioacoustics

Marius Miron, David Robinson, Masato Hagiwara, Titouan Parcollet, Jules Cauzinille, Gagan Narula, Milad Alizadeh, Ellen Gilsenan-McMahon, Sara Keen, Emmanuel Chemla, Benjamin Hoffman, Maddie Cusimano, Diane Kim, Felix Effenberger, Jane K. Lawton, Aza Raskin, Olivier Pietquin, Matthieu Geist

发表机构 * Earth Species Project(地球物种项目)

AI总结 本文研究了不同探针策略对生物声学任务中音频表征迁移性能的影响,提出使用多层注意力探针可以更有效地利用时间信息,提升模型在下游任务中的表现。研究对比了线性探针和注意力探针在多个生物声学基准上的性能,发现多层探针优于传统的单层探针,尤其在Transformer模型中,注意力探针显著优于线性探针。该工作为评估和提升音频表征的可迁移性提供了新的方法和见解。

详情
英文摘要

Probing heads map the representations learned from audio by a machine learning model to downstream task labels and are a key component in evaluating representation learning. Most bioacoustic benchmarks use a fixed, low-capacity probe, such as a linear layer on the final encoder layer. While this standardization enables model comparisons, it may bias results by overlooking the interaction between encoder features and probe design. In this work, we systematically study different probing strategies across two bioacoustic benchmarks, BEANs and BirdSet. We evaluate last- and multi-layer probing, across linear and attention probes. We show that larger probe heads that leverage time information have superior performance. Our results suggest that current benchmarks may misrepresent encoder quality when relying on a last-layer probing setup. Multi-layer probing improves downstream task performance across all tested models, while attention probing has superior performance to linear probing for transformer models.

2605.10488 2026-05-12 cs.CL cs.AI

DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning

Haoyu Huang, Jiaxin Bai, Shujie Liu, Yang Wei, Hong Ting Tsang, Yisen Gao, Zhongwei Xie, Yufei Li, Yangqiu Song

发表机构 * HKUST(香港科技大学) HKBU(香港大学) Microsoft Research Asia Hong Kong(微软亚洲研究院(香港))

AI总结 DeepRefine 是一种基于大型语言模型的推理方法,旨在提升智能体编译知识库的质量,以更好地支持开放场景下的下游任务。该方法通过与知识库进行多轮交互,进行归因诊断,定位潜在缺陷并执行针对性的优化操作,从而实现知识库的逐步完善。为了在没有标准答案的情况下优化优化策略,DeepRefine 引入了“超越草稿收益”奖励机制,并通过强化学习进行端到端训练,实验表明其在多个任务上均优于现有方法。

详情
英文摘要

Agent-compiled knowledge bases provide persistent external knowledge for large language model (LLM) agents in open-ended, knowledge-intensive downstream tasks. Yet their quality is systematically limited by \emph{incompleteness}, \emph{incorrectness}, and \emph{redundancy}, manifested as missing evidence or cross-document links, low-confidence or imprecise claims, and ambiguous or coreference resolution issues. Such defects compound under iterative use, degrading retrieval fidelity and downstream task performance. We present \textbf{DeepRefine}, a general LLM-based reasoning model for \emph{agent-compiled knowledge refinement} that improves the quality of any pre-constructed knowledge bases with user queries to make it more suitable for the downstream tasks. DeepRefine performs multi-turn interactions with the knowledge base and conducts abductive diagnosis over interaction history, localizes likely defects, and executes targeted refinement actions for incremental knowledge base updates. To optimize refinement policies of DeepRefine without gold references, we introduce a Gain-Beyond-Draft (GBD) reward and train the reasoning process end-to-end via reinforcement learning. Extensive experiments demonstrate consistent downstream gains over strong baselines.

2605.10485 2026-05-12 cs.RO

VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models

Hao Wang, Xiaobao Wei, Jingyang He, Chengyu Bai, Chun-Kai Fan, Jiajun Cao, Jintao Chen, Ying Li, Shanyu Rong, Ming Lu, Xiaozhu Ju, Jian Tang, Shanghang Zhang

发表机构 * Peking University(北京大学) Beijing Innovation Center of Humanoid Robotics(北京人形机器人创新中心)

AI总结 VEGA 是一种用于提升视觉-语言-动作(VLA)模型空间感知能力的框架,旨在解决当前模型因缺乏三维几何监督而导致的空间理解不足问题。该方法通过将 VLA 模型的视觉编码器输出与基于多视角一致的三维高斯点云监督训练的 DINOv2-FiT3D 模型特征对齐,实现更准确和可解释的空间感知对齐。VEGA 在视觉编码器输出层进行对齐,避免了语言语义的干扰,且对齐模块在推理时被移除,不增加额外计算负担,实验表明其在模拟和现实任务中均优于现有方法。

详情
英文摘要

Precise spatial reasoning is fundamental to robotic manipulation, yet the visual backbones of current vision-language-action (VLA) models are predominantly pretrained on 2D image data without explicit 3D geometric supervision, resulting in representations that lack accurate spatial awareness. Existing implicit spatial grounding methods partially address this by aligning VLA features with those of 3D-aware foundation models, but they rely on empirical layer search and perform alignment on LLM-level visual tokens where spatial structure has already been entangled with linguistic semantics, limiting both generalizability and geometric interpretability. We propose VEGA (Visual Encoder Grounding Alignment), a simple yet effective framework that directly aligns the output of the VLA's visual encoder with spatially-aware features from DINOv2-FiT3D, a DINOv2 model fine-tuned with multi-view consistent 3D Gaussian Splatting supervision. By performing alignment at the visual encoder output level, VEGA grounds spatial awareness before any linguistic entanglement occurs, offering a more interpretable and principled alignment target. The alignment is implemented via a lightweight projector trained with a cosine similarity loss alongside the standard action prediction objective, and is discarded at inference time, introducing no additional computational overhead. Extensive experiments on simulation benchmark and real-world manipulation tasks demonstrate that VEGA consistently outperforms existing implicit spatial grounding baselines, establishing a new state-of-the-art among implicit spatial grounding methods for VLA models.

2605.10484 2026-05-12 cs.CV cs.RO

OpenSGA: Efficient 3D Scene Graph Alignment in the Open World

Gang Chen, Sebastián Barbas Laina, Stefan Leutenegger, Javier Alonso-Mora

发表机构 * Autonomous Multi-Robots Lab, Department of Cognitive Robotics, School of Mechanical Engineering, Delft University of Technology, 2628 CD, Delft, Netherlands(代尔夫特理工大学机械工程学院认知机器人学系自主多机器人实验室) Mobile Robotics Lab, School of Computation, Information and Technology, Technical University of Munich(慕尼黑技术大学计算、信息与技术学院移动机器人实验室)

AI总结 本文提出了一种名为 OpenSGA 的高效三维场景图对齐框架,旨在解决机器人在开放环境中重新访问场景时的物体级定位与地图融合问题。该方法通过融合视觉-语言、文本和几何特征,并结合空间上下文信息,实现了即使在坐标偏差较大的情况下也能准确对齐场景图。此外,作者还构建了一个大规模数据集 ScanNet-SG,包含超过 70 万样本和丰富的物体类别,显著提升了场景图对齐任务的训练与评估能力。实验表明,该方法在帧到扫描(F2S)和子扫描到子扫描(S2S)任务中均取得了最佳性能。

Comments 13 figures

详情
英文摘要

Scene graph alignment establishes object correspondences between two 3D scene graphs constructed from partially overlapping observations. This enables efficient scene understanding and object-level relocalization when a robot revisits a place, as well as global map fusion across multiple agents. Such capabilities are essential for robots that require long-term memory for long-horizon tasks involving interactions with the environment. Existing approaches mainly focus on subscan-to-subscan (S2S) alignment and depend heavily on geometric point-cloud features, leaving frame-to-scan (F2S) alignment and open-set vision-language features underexplored. In addition, existing datasets for scene graph alignment remain small-scale with limited object diversity, constraining systematic training and evaluation. We present a unified and efficient scene graph alignment framework that predicts object correspondences by fusing vision-language, textual, and geometric features with spatial context. The framework comprises modules such as a distance-gated spatial attention encoder, a minimum-cost-flow-based allocator, and a global scene embedding generator to achieve accurate alignment even under large coordinate discrepancies. We further introduce ScanNet-SG, a large-scale dataset generated via an automated annotation pipeline with over 700k samples, covering 509 object categories from ScanNet labels and over 3k categories from GPT-4o-based tagging. Experiments show that our method achieves the best overall performance on both F2S and S2S tasks, substantially outperforming existing scene graph alignment methods. Our code and dataset are released at: https://autonomousrobots.nl/paper_websites/opensga.

2605.10480 2026-05-12 cs.AI

ASIA: an Autonomous System Identification Agent

Dario Piga, Marco Forgione

发表机构 * Dalle Molle Institute for Artificial Intelligence (IDSIA), SUPSI(达勒莫利人工智能研究所(IDSIA),SUPSI)

AI总结 本文提出了一种名为ASIA的自主系统识别代理框架,旨在自动化系统识别过程中的模型选择、算法训练和超参数调优等繁琐任务。该方法基于大型语言模型作为自主编码代理,通过自然语言描述问题,无需人工干预即可完成从假设生成到模型评估的闭环流程。研究在两个系统识别基准上验证了ASIA的有效性,分析了其搜索行为与发现的模型结构,并探讨了该方法的潜力及当前存在的测试泄露、透明度降低和可复现性等局限。

详情
英文摘要

Over the years, research in system identification has provided a rich set of methods for learning dynamical models, together with well-established theoretical guarantees. In practice, however, the choice of model class, training algorithm, and hyperparameter tuning is still largely left to empirical trial-and-error, requiring substantial expert time and domain experience. Motivated by recent advances in agentic artificial intelligence, we present ASIA, a framework that delegates this iterative search to a large language model acting as an autonomous coding agent. Building on existing agentic platforms, ASIA closes the loop between hypothesis, implementation, and evaluation without human intervention, requiring only a plain-English description of the identification problem. We conduct an empirical study of ASIA on two system identification benchmarks and analyse the agent's search behaviour, the architectures and training strategies it discovers, and the quality of the resulting models. We also discuss the potential of the approach and its current limitations, including implicit test leakage, reduced methodological transparency, and reproducibility concerns.

2605.10474 2026-05-12 cs.LG cs.AI

Formally Verifying Analog Neural Networks Under Process Variations Using Polynomial Zonotopes

Yasmine Abu-Haeyeh, Tobias Ladner, Matthias Althoff, Lars Hedrich

发表机构 * Goethe University Frankfurt, Germany(法兰克福歌德大学,德国) Technical University of Munich, Germany(慕尼黑技术大学,德国)

AI总结 本文研究了模拟神经网络在制造工艺变化下的行为验证问题,提出了一种基于多项式的方法来建模神经元电路的性能变化,并利用多项式zonotope进行可达性分析,从而实现了对电路级模型的正式验证。该方法有效避免了传统的耗时蒙特卡洛仿真,实验表明其能在秒级时间内验证99%的工艺变化样本,显著提升了验证效率。

详情
英文摘要

Analog neural networks are gaining attention due to their efficiency in terms of power consumption and processing speed. However, since analog neural networks are implemented as physical circuits, they are highly sensitive to manufacturing process variations, which can cause large deviations from the nominal model. We present a polynomial-based model that resembles the performance of the neuron circuit under process variations. Then, we formally verify the behavior of the circuit-level model using reachability analysis with polynomial zonotopes, thus, avoiding conventional, time-consuming Monte Carlo simulations. We evaluate our proposed verification approach on three different datasets, verifying both fully-connected and convolutional analog neural networks. Our experimental results confirm the effectiveness of our verification approach by reducing the verification time from days to seconds while enclosing 99% of the variation samples.

2605.10470 2026-05-12 cs.CV

Adaptive Context Matters: Towards Provable Multi-Modality Guidance for Super-Resolution

Jinyi Luo, Minghao Liu, Yifan Li, Zejia Fan, Jiaying Liu

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学王轩计算机技术研究院)

AI总结 超分辨率(SR)是一个严重病态的问题,存在固有的歧义性。本文首次对多模态超分辨率进行了理论建模,揭示了现有方法在模态利用上的不足,并提出了一种基于动态模态融合的多模态专家混合超分辨率框架(M$^3$ESR),通过空间动态模态权重模块和时间自适应模态温度调度机制,实现了更精确的风险控制和模态贡献优化。实验表明,该方法在泛化能力和语义一致性方面均有显著提升。

详情
英文摘要

Super-resolution (SR) is a severely ill-posed problem with inherent ambiguity, as widely recognized in both empirical and theoretical studies. Although recent semantic-guided and multi-modal SR methods exploit large models or external priors to enhance semantic alignment, the fusion of heterogeneous modalities remains insufficiently understood in practice and theory. In this work, we provide the first theoretical modeling of multi-modal SR, revealing that prior methods are bottlenecked by sub-optimal modality utilization. Our analysis shows that the generalization risk bound can be improved by strengthening the alignment between modality weights and their effective contributions, while reducing representation complexity. This theoretical insight inspires us to propose the novel Multi-Modal Mixture-of-Experts Super-Resolution framework (M$^3$ESR) that employs generalization-oriented dynamic modality fusion for accurate risk control and modality contribution optimization. In detail, we propose a novel spatially dynamic modality weighting module and a temporally adaptive modality temperature scheduling mechanism, enabling flexible and adaptive spatial-temporal modality weighting for effective risk control. Extensive experiments demonstrate that our M$^3$ESR significantly boosts generalization and semantic consistency performances, which confirms our superiority.

2605.10468 2026-05-12 cs.LG

Can Muon Fine-tune Adam-Pretrained Models?

Xingyu Qu, Peigeng Huang, Samuel Horvath

发表机构 * Nanjing University(南京大学)

AI总结 本文研究了在微调预训练模型时使用Muon优化器替代Adam所带来的性能下降问题。通过实验分析,作者发现这种性能下降源于优化器之间的隐式偏差不匹配,并提出通过限制更新幅度(如使用LoRA方法)可以有效缓解这一问题。研究结果为理解优化器不匹配对微调的影响提供了新见解,并展示了如何通过调整更新策略来减轻其负面影响。

详情
英文摘要

Muon has emerged as an efficient alternative to Adam for pretraining, yet remains underused for fine-tuning. A key obstacle is that most open models are pretrained with Adam, and naively switching to Muon for fine-tuning leads to degraded performance due to an optimizer mismatch. We investigate this mismatch through controlled experiments and relate it to the distinct implicit biases of Adam and Muon. We provide evidence that the mismatch disrupts pretrained knowledge, and that this disruption scales with update strength. This leads us to hypothesize that constraining updates should mitigate the mismatch. We validate this with LoRA: across language and vision tasks, LoRA reduces the performance gap between Adam and Muon observed under full fine-tuning. Studies on LoRA rank, catastrophic forgetting, and LoRA variants further confirm that mismatch severity correlates with update strength. These results shed light on how optimizer mismatch affects fine-tuning and how it can be mitigated. Our code is available at https://github.com/XingyuQu/muon-finetune.

2605.10466 2026-05-12 cs.LG

Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition

Haoren Xu, Guanhua Fang

发表机构 * Fudan University(复旦大学)

AI总结 该论文探讨了大语言模型在上下文学习(ICL)和重复生成中的行为,揭示了其背后的统一机制。研究指出,当输入满足特定统计条件时,自注意力机制的输出可近似为输入协方差矩阵的线性读取,从而解释了模型在处理长上下文时对统计信息的提取与细节的遗忘。这一机制不仅能够实现单步的群体梯度下降,还为重复生成提供了结构化的解释,将两种看似无关的现象统一于协方差读取的原理之下。

详情
英文摘要

Large language models (LLMs) exhibit two striking and ostensibly unrelated behaviours: in-context learning (ICL) and repetitive generation. In both, the model behaves as though it had summarised the context into a population-level statistic and discarded token-level detail. We ask whether this ``summarisation and forgetting'' can be derived from the attention mechanism itself, and answer in the affirmative. Under stationary, ergodic and elliptical inputs, the softmax attention output converges almost surely to $Θ_VΣΘ_K^{\top}Θ_Q x_t$, where $Σ$ is the input covariance; the long-context limit is therefore a linear readout of the input's second-order statistics. Two consequences follow. (i) For in-context linear regression, a single softmax head can implement one step of population gradient descent. Stacking such heads with residual connections iterates this update and implements multiple gradient descent steps. (ii) Propagated across an $L$-layer transformer, this readout drives the terminal hidden state at the parametric $1/t$ rate to a deterministic function of the current token alone, so that autoregressive generation collapses asymptotically to a first-order Markov chain whose attracting orbits furnish a structural account of repetition and mode collapse. The two phenomena thus emerge as facets of a single covariance-readout principle.

2605.10464 2026-05-12 cs.CV

Automated Detection of Abnormalities in Zebrafish Development

Sarath Sivaprasad, Hui-Po Wang, Anna-Lisa Jäckel, Jonas Baumann, Carole Baumann, Jennifer Herrmann, Mario Fritz

发表机构 * CISPA Helmholtz Center for Information Security(CISPA海德堡信息安全中心) Helmholtz Institute for Pharmaceutical Research Saarland(萨尔兰州制药研究所海德堡中心)

AI总结 本文提出了一种用于斑马鱼胚胎发育异常自动检测的方法,针对目前依赖人工评估效率低的问题,构建了一个包含高分辨率显微图像序列的大型数据集,涵盖正常发育和药物暴露两种条件,并提供了细粒度时间标注。研究还引入了基于Transformer的模型,能够融合时空特征以早期预测发育异常,在受精卵存活率分类和毒性评估任务中分别达到98%和92%的准确率,为自动化斑马鱼毒性分析提供了有效工具。

详情
英文摘要

Zebrafish embryos are a valuable model for drug discovery due to their optical transparency and genetic similarity to humans. However, current evaluations rely on manual inspection, which is costly and labor-intensive. While machine learning offers automation potential, progress is limited by the lack of comprehensive datasets. To address this, we introduce a large-scale dataset of high-resolution microscopic image sequences capturing zebrafish embryonic development under both control conditions and exposure to compounds (3,4-dichloroaniline). This dataset, with expert annotations at fine-grained temporal levels, supports two benchmarking tasks: (1) fertility classification, assessing zebrafish egg viability (130,368 images), and (2) toxicity assessment, detecting malformations induced by toxic exposure over time (55,296 images). Alongside the dataset, we present the first transformer-based baseline model that integrates spatiotemporal features to predict developmental abnormalities at early stages. Experimental results present the model's effectiveness, achieving 98% accuracy in fertility classification and 92% in toxicity assessment. These findings underscore the potential of automated approaches to enhance zebrafish-based toxicity analysis.

2605.10462 2026-05-12 cs.CL cs.LO

Coherency through formalisations of Structured Natural Language, A case study on FRETish

Joost J. Joosten, Marina López Chamosa, Sofía Santiago Fernández

发表机构 * Universitat de Barcelona(巴塞罗那大学) Centre de Recerca Matemàtica(数学研究中心) Formal Vindications S.L.(Formal Vindications公司)

AI总结 本文提出了一种新的形式化指南——“通过形式化实现一致性”,主张在将自然语言需求转化为形式化语言的过程中,不同层次的描述应保持逻辑结构的一致性。研究以NASA的FRET工具和其控制自然语言FRETish为案例,提出了一种将其自动翻译为MTL形式化语言的新方法,并通过模型检测证明了其与原有翻译的等价性。实验统计结果显示新翻译具有优势,同时揭示了形式化过程中存在的不一致问题,为形式化方法的改进提供了新思路。

详情
英文摘要

Formalisation is the process of writing system requirements in a formal language. These requirements mostly originate in Natural Language. In the field of Formal Methods, formalisation is often identified as one of the most delicate and complicated steps in the verification process. Not seldomly, formalisation tools and environments choose various levels of requirement descriptions: Natural Language, Technical Language, Diagram Representations and Formal Language, to mention a few. In the literature, there are various maxims and principles of good practice to guide the process of requirement formalisation. In this paper we propose a new guideline: Coherency through Formalisations. The guideline states that the different levels of formalisation mentioned above should roughly follow the same logical structure. The principle seems particularly relevant in the setting where LLMs are prompted to perform reasoning tasks that can be checked by formal tools using Structured Natural Language to act as an intermediate layer bridging both paradigms. In the light of coherency, we analyze NASA's Formal Requirement Elicitation Tool FRET and propose an alternative automated translation of the Controlled Natural Language FRETish to the formal language of MTL. We compare our translation to the original translation and prove equivalence using model checking. Some statistics are performed which seem to favor the new translation. As expected, the translation process yielded interesting reflections and revealed inconsistencies which we present and discuss.

2605.10458 2026-05-12 cs.LG cond-mat.mtrl-sci physics.chem-ph

QT-Net: Rethinking Evaluation of AI Models in Atomic Chemical Space

Pablo Martínez Crespo, Stefano Ribes, Martin Rahm, Richard Beckmann, Robert S. Jordan, Marisa Gliege, Santiago Miret, Vijay Kris Narasimhan, Rocío Mercado

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Chalmers University of Technology(楚姆勒技术大学) University of Gothenburg(哥德堡大学) Department of Chemistry and Chemical Engineering(化学与化工系) Technology Research(技术研究部) Intel Corporation(英特尔公司) Chief Technology Office(首席技术办公室) EMD Electronics(EMD电子公司) Lila Sciences(Lila科学) M Ventures(M资本)

AI总结 该研究针对原子尺度上AI模型的评估问题,提出了一种基于SOAP描述符的留出评估协议,用于更准确地评估机器学习模型在预测原子电荷和多极矩等化学特性时的泛化能力。通过严格的交叉验证和统计检验,作者比较了E(3)-等变模型与非等变模型的性能,并基于结果提出了旋转增强的非等变图神经网络QT-Net。该模型能够从QM9数据集外的分子中推断原子属性,并有效提升下游分子性质预测任务的性能,为原子尺度分子机器学习提供了新的归纳偏置。

详情
英文摘要

Atomic properties such as partial charges or multipoles encode chemically meaningful information that can inform downstream molecular property prediction, but their evaluation as machine learning targets has been complicated by the absence of a principled out-of-distribution evaluation protocol at the atomic level. In this work, we propose a held-out evaluation protocol that clusters atomic environments by SOAP descriptors and computes metrics accounting only for cluster labels unseen during training. Following this procedure, we use 5$\times$5 cross-validation and Tukey's HSD to run a statistically rigorous comparison of E(3)-equivariant against non-equivariant, rotationally augmented models for predicting electron populations and multipoles of H, C, N, and O atoms. Building on our results, we introduce the Quantum Topological Neural Network (QT-Net), a rotationally augmented, non-equivariant graph neural network. We show that QT-Net can be used to infer properties of atoms in molecules from QM9 outside our training set, and that these inferred properties can yield improvement when used as input features for downstream molecular property prediction. To further validate the framework, molecular dipole moments computed from QT-Net's per-atom outputs recover the ground-truth values reported in QM9. We release all code and data, including a JAX implementation of QT-Net, to support the broader use of learned QTA properties as inductive biases for atomic-scale molecular machine learning.

2605.10456 2026-05-12 cs.RO

Learning Point Cloud Geometry as a Statistical Manifold: Theory and Practice

Jinwoo Lee, Jiwoo Kim, Woojae Shin, Giseop Kim, Hyondong Oh

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院) Daegu Gyeongbuk Institute of Science and Technology (DGIST)(大邱庆北科学技术院)

AI总结 该论文研究了如何从稀疏且不规则的激光雷达点云中学习几何结构,提出了一个基于统计流形的数学建模方法。核心思想是将每个点的局部几何结构建模为高斯分布,从而构建出一个统计流形表示。基于此,作者设计了Point-to-Ellipsoid(POLI)方法,通过自监督学习从点云中预测每个点的高斯几何参数,无需标注数据即可实现鲁棒的几何估计,并在多个机器人感知任务中取得了显著性能提升。

详情
英文摘要

Point clouds are a fundamental representation for robotic perception tasks such as localization, mapping, and object pose estimation. However, LiDAR-acquired point clouds are inherently sparse and non-uniform, providing incomplete observations of the underlying scene geometry. This makes reliable geometric reasoning challenging and degrades downstream perception performance. Existing approaches attempt to compensate for these limitations by estimating local geometry, but often rely on hand-crafted statistics or end-to-end supervised learning, which can suffer from limited scalability or require large amounts of accurately labeled data. To address these challenges, we explicitly model point cloud geometry under a principled mathematical formulation. We represent local geometry as a statistical manifold induced by a family of Gaussian distributions, where each point is associated with a Gaussian capturing its local geometric structure. Based on this formulation, we introduce Point-to-Ellipsoid (POLI), a deep neural estimator that predicts per-point Gaussian geometry. POLI learns a mapping from point cloud observations to their underlying geometry in a self-supervised manner, removing the need for labeled data while preserving strong geometric inductive biases. The resulting representation integrates seamlessly into existing robotic perception pipelines without architectural modifications. Extensive experiments show that POLI enables accurate and robust geometry estimation and consistently improves performance across diverse robotic perception tasks.

2605.10455 2026-05-12 cs.LG

AxiomOcean: Forecasting the Three-Dimensional Structure of the Upper Ocean

Sensen Wu, Yifan Chen, Guantao Pu, Xiaoyao Sun, Yijun Chen, Jin Qi, Ming Kong, Keyi Yang, Lichen Xu, Wenguan Wang, Xiaofeng Li, Zhenhong Du

发表机构 * School of Earth Sciences & Zhejiang Key Laboratory of Geographic Information Science, Zhejiang University(地球科学学院及浙江省地理信息科学重点实验室,浙江大学) State Key Laboratory of Ocean Sensing, Zhejiang University(海洋感知国家重点实验室,浙江大学) The State Key Lab of Brain-Machine Intelligence, Zhejiang University(脑机智能国家重点实验室,浙江大学) Key Laboratory of Ocean Circulation and Waves, Institute of Oceanology, Chinese Academy of Sciences(海洋环流与波浪重点实验室,中国科学院海洋研究所)

AI总结 AxiomOcean 是一个全球人工智能海洋预测模型,旨在提升对上层海洋三维结构的预报能力。该模型通过引入全三维编码-主干-解码架构,显式表示水柱中的垂直分层和跨层依赖关系,结合海面大气强迫信息,联合预测温度、盐度及三维洋流等变量。实验表明,AxiomOcean 在10天预报中显著优于现有先进模型,降低了约20%至35%的均方根误差,同时保持更高的异常相关性,且在涡动能、温度和盐度方差等方面具有更好的保持能力,提升了预报的物理一致性与准确性。

详情
英文摘要

Short-term ocean forecast skill depends strongly on the three-dimensional ocean structure of the upper ocean, which governs stratification, subsurface heat storage, and the response of the ocean to atmospheric forcing. However, AI ocean forecasting models often fail to preserve this vertical structure, resulting in over-smoothed subsurface features and weak physical consistency under strong forcing. Here, we present AxiomOcean, a global AI ocean forecasting model that explicitly represents vertical hierarchy and cross-layer dependence within the water column. By combining a fully three-dimensional encoder-backbone-decoder architecture with surface atmospheric forcing, AxiomOcean jointly predicts upper-ocean temperature, salinity, and three-dimensional currents at global 1/12° resolution down to 643 m depth. In 10-day forecasts, AxiomOcean outperforms an advanced AI comparison model across variables and lead times, reducing day-1 RMSE by approximately 20 to 35% while maintaining higher anomaly correlation. The gain is not achieved through excessive smoothing: AxiomOcean better preserves eddy kinetic energy, temperature and salinity variance. Its advantage also extends through the water column and remains evident across the equatorial Pacific, Kuroshio Extension, and Southern Ocean, yielding a more realistic reconstruction of upper-ocean heat content. These results show that explicitly preserving upper-ocean three-dimensional structure can improve both forecast accuracy and physical fidelity in AI ocean prediction.

2605.10453 2026-05-12 cs.LG cs.CL

SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding

Anton Plaksin, Sergei Krutikov, Sergei Skvortsov, Alexander Samarin

发表机构 * Nebius

AI总结 本文提出SlimSpec,一种用于加速推测解码的低秩语言模型头部(LM-head)参数化方法。该方法通过压缩草案模型的内部表示而非输出词汇表,有效降低了计算瓶颈,同时保持完整的词汇支持。实验表明,SlimSpec在多种目标模型和基准测试中实现了比标准LM-head架构4到5倍的加速,并在端到端速度提升上优于现有方法8%到9%。该方法对训练和推理流程的调整需求极小,适用于多种草案LM-head架构。

详情
英文摘要

Speculative decoding speeds up autoregressive generation in Large Language Models (LLMs) through a two-step procedure, where a lightweight draft model proposes tokens which the target model then verifies in a single forward pass. Although the drafter network is small in modern architectures, its LM-head still performs projection to a large vocabulary, becoming one of the major computational bottlenecks. In prior work this issue has been predominantly addressed via static or dynamic vocabulary truncation. Yet mitigating the bottleneck, these methods bring in extra complexity, such as special vocabulary curation, sophisticated inference-time logic or modifications of the training setup. In this paper, we propose SlimSpec, a low-rank parameterization of the drafter's LM-head that compresses the inner representation rather than the output, preserving full vocabulary support. We evaluate our method with EAGLE-3 drafter across three target models and diverse benchmarks in both latency- and throughput-bound inference regimes. SlimSpec achieves $4\text{-}5\times$ acceleration over the standard LM-head architecture while maintaining a competitive acceptance length, surpassing existing methods by up to $8\text{-}9\%$ of the end-to-end speedup. Our method requires minimal adjustments of training and inference pipelines. Combined with the aforementioned speedup improvements, it makes SlimSpec a strong alternative across wide variety of draft LM-head architectures.

2605.10451 2026-05-12 cs.LG cs.NA math.FA math.NA

Don't Fix the Basis -- Learn It: Spectral Representation with Adaptive Basis Learning for PDEs

Xuxiang Zhao, Angelica I. Aviles-Rivero

发表机构 * Qiuzhen College, Tsinghua University(清华大学求真学院) YMSC, Tsinghua University(清华大学数学科学中心)

AI总结 该研究针对偏微分方程(PDE)学习中传统谱神经算子依赖固定基函数、难以有效捕捉空间异质性和多尺度动态的问题,提出了一种自适应基学习框架ABLE。ABLE通过学习数据相关的谱表示,构建空间自适应的Parseval框架,使算子在提升的谱空间中高效运作,同时保持可逆性和$O(N\log N)$复杂度。实验表明,ABLE在多个基准任务中提升了模型精度,尤其在梯度陡峭和多尺度场景下表现突出,并可作为模块化组件增强现有神经算子架构。

Comments 26 pages, 4 figures

详情
英文摘要

Spectral neural operators achieve strong performance for PDE learning, but rely on fixed global bases that limit their ability to represent spatially heterogeneous and multiscale dynamics. We propose Adaptive Basis Learning (ABLE), a framework that learns data-dependent spectral representations instead of relying on predefined bases. ABLE constructs a spatially adaptive Parseval frame via a learned ancillary density, enabling the operator to act in a lifted spectral space while preserving invertibility and maintaining $O(N\log N)$ complexity through FFT-based implementation. This shifts the source of expressivity from spectral coefficients to the representation itself, allowing the model to capture localized structures and non-translation-invariant interactions more efficiently. ABLE integrates seamlessly into existing neural operator architectures as a drop-in replacement for spectral layers. Across a range of benchmarks ABLE improves accuracy over strong baselines, with the largest gains in regimes characterized by sharp gradients and multiscale behavior. Moreover, augmenting existing models (e.g., U-FNO, HPM) with ABLE further enhances their performance, demonstrating its role as a general and complementary spectral refinement. Our results highlight that the data-driven choice of representation, rather than operator complexity alone, is a key bottleneck in neural operator design. By learning the basis itself, ABLE provides a principled and efficient framework for improving spectral methods in PDE learning.

2605.10449 2026-05-12 cs.CV

Automated high-frequency quantification of fish communities and biomass using computer vision

Kota Ishikawa, Takuma Masui, Keita Koeda, Rickdane Gomez, Lucas Yutaka Kimura, Michio Kondoh

发表机构 * Graduate School of Life Sciences, Tohoku University(东北大学生命科学研究生院) Advanced Institute for Marine Ecosystem Change (WPI-AIMEC), Tohoku University(东北大学海洋生态系统变化先进研究所) Graduate School of Science and Engineering, University of the Ryukyus(冲绳大学理学研究院) Faculty of Science, University of the Ryukyus(冲绳大学理学部)

AI总结 该研究提出了一种基于计算机视觉的自动化方法,用于高频量化水下鱼类群落结构和生物量。方法结合了深度学习鱼类识别、多目标跟踪和三维重建技术,能够从立体摄像系统采集的视频中准确估计鱼类的种类、数量及生物量。研究在珊瑚礁鱼类群落中进行了20天的连续监测,展示了该方法在捕捉物种丰富度、数量和生物量动态变化方面的优势,并验证了其在非侵入性、持续性监测中的有效性。

Comments 21 pages, 3 figures, supplementary information under Ancillary files

详情
英文摘要

Quantifying fish community structure is essential for understanding biodiversity and ecosystem responses in a changing environment, yet existing survey methods provide limited high-frequency, quantitative observations. Conventional approaches, including catch-based methods, underwater visual censuses, and environmental DNA metabarcoding, either require intensive labor or lack reliable estimates of abundance and biomass. Here, we develop an automated framework for quantifying fish communities from underwater video using computer vision. Using videos acquired with a custom-made stereo camera system, the framework integrates deep learning-based fish identification, multi-object tracking, and 3D reconstruction to estimate species-level abundance and biomass. We applied the approach to a reef fish community over a 20-day period with hourly daytime observations, revealing dynamic fluctuations in species richness, abundance, and biomass associated with changes in species composition. By comparing fish communities estimated from visual census and environmental DNA surveys, we demonstrate that our method provides complementary strengths for continuous, non-invasive, and quantitative monitoring of consistently observed species. This approach provides a scalable foundation for long-term monitoring and advances the capacity to resolve fine-scale temporal dynamics in fish communities.

2605.10448 2026-05-12 cs.AI

Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation

Shanshan Gao, Liyi Zhou

发表机构 * The University of Sydney(悉尼大学)

AI总结 本文探讨了交互式智能体基准测试中评分的可靠性问题,指出当前基准测试往往依赖于表面信号而非实际行为路径,导致评分可能不准确。为此,作者提出了一种无需修改任务、智能体或评估者的新方法——引入一个结果证据报告层,用于明确验证所需证据、标记运行结果的证据状态,并报告支持证据的评分范围以反映不确定性。实验表明,该方法在多个公开基准上有效区分了不同类型的失败模式,提升了评估的透明度和可信度。

详情
英文摘要

Interactive agent benchmarks map an agent run to a binary outcome through outcome checks. When these checks rely on surface level signals or fail to capture the agent's actual action path, they cannot reliably determine whether the run succeeded. For example, a benchmark task may ask whether Alice's shipping address was changed, while the outcome check only verifies that the agent clicked "Save." This does not guarantee that the intended state change occurred, since the agent may have modified the wrong record. Treating such a run as successful therefore makes the reported score misleading. Benchmark quality thus depends not only on task design, but also on the reliability of outcome detection. We address this problem by introducing an outcome evidence reporting layer for existing benchmarks, without modifying their tasks, agents, or evaluators. The layer performs three functions. First, before scoring, it specifies which stored artifacts are required to verify the claimed outcome for each case. Second, it applies a locked checklist to each completed run and assigns one of three evidence labels: Evidence Pass, Evidence Fail, or Unknown. Third, it reports evidence supported score bounds that quantify uncertainty arising from Unknown cases. Rather than silently counting, discarding, or hiding uncertain cases inside a single aggregate success rate, the framework keeps them explicitly visible. We evaluate the outcome evidence layer on five public benchmarks: ANDROIDWORLD, AGENTDOJO, APPWORLD, tau3 bench retail, and MINIWOB. The resulting reports separate several empirically distinct failure modes.

2605.10445 2026-05-12 cs.CV

Uni-Synergy: Bridging Understanding and Generation for Personalized Reasoning via Co-operative Reinforcement Learning

Zijun Shen, Sihan Yang, Ruichuan An, Ziyu Guo, Hao Liang, Ming Lu, Renrui Zhang, Wentao Zhang

发表机构 * Peking University(北京大学) Nanjing University(南京大学) CUHK(香港中文大学) Zhongguancun Academy(中关村学院)

AI总结 本文提出了一种名为Sync-R1的端到端强化学习框架,旨在通过协同优化实现个性化理解和生成之间的桥梁。该方法引入了Sync-GRPO和动态组缩放(DGS)技术,以增强多任务间的协同效应并提升训练效率,同时构建了更贴近现实场景的UnifyBench++数据集。实验表明,Sync-R1在跨任务推理和个性化生成方面表现出色,且无需复杂的冷启动流程。

详情
英文摘要

Unified Multimodal Models (UMMs) excel in general tasks but struggle to bridge the gap between personalized understanding and generation. Prior works largely rely on implicit token-level alignment via supervised fine-tuning, which fails to fully capture the potential synergy between comprehension and creation. In this work, we propose Sync-R1, an end-to-end reinforcement learning framework that jointly optimizes personalized understanding and generation within a single, explicit reasoning loop. Through this unified feedback process, Sync-R1 enables personalized comprehension to guide content creation, while the resulting generation quality reciprocally refines understanding within an integrated reward landscape. To efficiently orchestrate this dual-task synergy, we introduce Sync-GRPO, a reinforcement learning method utilizing an ensemble reward system. Furthermore, we propose Dynamic Group Scaling (DGS), which adaptively filters low-potential trajectories to reduce gradient variance and accelerate convergence. To better reflect real-world complexity, we introduce UnifyBench++, featuring denser textual descriptions and richer user contexts. Experimental results demonstrate that Sync-R1 achieves state-of-the-art performance, showcasing superior cross-task reasoning and robust personalization without requiring complex cold-start procedures. The code and the UnifyBench++ dataset will be released at: https://github.com/arctanxarc/UniCTokens.

2605.10439 2026-05-12 cs.CV

Filtering Memorization from Parameter-Space in Diffusion Models

Yu Zhe, Yang Jiayan, Wei Junhao, Yu-Lin Tsai, Wang Chen

发表机构 * RIKEN AIP(理化学研究所Advanced Institute for Peripheral Research) Science of Tokyo(东京科学大学) University of California, Berkeley(加州大学伯克利分校) Zhejiang University(浙江大学)

AI总结 本文研究了扩散模型中低秩适配(LoRA)模块可能记住训练图像的问题,导致生成内容泄露受版权保护或敏感信息。为此,作者提出了一种无需训练和数据的后处理方法——Base-Anchored Filtering(BAF),通过分解LoRA更新为频谱通道,并衡量其与预训练主干网络主子空间的对齐程度,从而过滤掉可能包含记忆内容的通道。实验表明,BAF在多个数据集和扩散模型主干上有效减少了记忆效应,同时保持或提升了生成质量。

详情
英文摘要

Low-Rank Adaptation (LoRA) has become a widely used mechanism for customizing diffusion models, enabling users to inject new visual concepts or styles through lightweight parameter updates. However, LoRAs can memorize training images, causing generated outputs to reproduce copyrighted or sensitive content. This risk is particularly concerning in LoRA-sharing ecosystems, where users distribute trained LoRAs without releasing the underlying training data. Existing approaches for mitigating memorization rely on access to the training pipeline, training data, or control over the inference process, making them difficult to apply when only the released LoRA weights are available. We propose \textbf{Base-Anchored Filtering (BAF)}, a training-free and data-free framework for post-hoc memorization mitigation in diffusion LoRAs. BAF decomposes LoRA updates into spectral channels and measures their alignment with the principal subspace of the pretrained backbone. Channels strongly aligned with this subspace are retained as generalizable adaptations, while weakly aligned channels are suppressed as potential carriers of memorized content. Experiments on multiple datasets and diffusion backbones demonstrate that BAF consistently reduces memorization while preserving or even improving generation quality. Our code is available in the supplementary material.

2605.10438 2026-05-12 cs.LG cs.CV

Beyond Spatial Compression: Interface-Centric Generative States for Open-World 3D Structure

Xiang Chen, Alexander Binder

发表机构 * DSC ScaDS.AI, Leipzig University(DSC ScaDS.AI,莱比锡大学) Institute for Cancer Genetics and Informatics (ICGI), Oslo, Norway(癌症遗传学与信息学研究所(ICGI),奥斯陆,挪威) ICT Cluster, Singapore Institute of Technology, Singapore(信息科技集群,新加坡理工学院,新加坡)

AI总结 当前3D编码器大多将表示视为空间压缩,虽然能重建表面几何,但无法明确组件归属和连接有效性。本文提出一种以接口为中心的生成状态表示方法,将编码过程构建为可操作的状态而非被动压缩代码,使得局部几何、组件归属和连接有效性在解码过程中可被查询、约束和修复。通过引入组件条件的局部规范标记(C2LT-3D),该方法在开放世界多组件场景中提升了结构鲁棒性,并展示了其潜在状态在装配级结构推理中的有效性。

详情
英文摘要

Current 3D tokenizers largely treat representation as spatial compression: compact codes reconstruct surface geometry, but leave component ownership and attachment validity implicit. In open-world assets with intersecting components, noisy topology, and weak canonical structure, this creates a representation mismatch: local shape, component identity, and assembly relations become entangled in a latent stream and are not natively addressable during decoding. We formulate an alternative view, interface-centric generative states, in which tokenization constructs an operational state rather than a passive compressed code. The state exposes local geometry, component ownership, and attachment validity as variables that can be queried, constrained, and repaired during decoding. We instantiate this formulation with Component-Conditioned Canonical Local Tokens (C2LT-3D), factorizing representation into canonical local geometry, partition-conditioned context, and relational seam variables. Each factor targets a distinct failure mode of compression-centric tokens: pose leakage, cross-component interference, or invalid local attachment. This exposed state supports attachment validation, latent structural repair, targeted intervention, and constrained serialization without a separate post-hoc structure recovery module. Trained on single-object CAD models and evaluated zero-shot on open-world multi-component assets, C2LT-3D improves structural robustness and shows that its latent variables remain actionable under adversarial attachment settings. These results suggest that open-world 3D generative representations should be evaluated not only by reconstruction fidelity, but by whether their discrete states remain operational for assembly-level structural reasoning.

2605.10434 2026-05-12 cs.CV

WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

Keming Wu, Yijing Cui, Wenhan Xue, Qijie Wang, Xuan Luo, Zhiyuan Feng, Zuhao Yang, Sudong Wang, Sicong Jiang, Haowei Zhu, Zihan Wang, Ping Nie, Wenhu Chen, Bin Wang

发表机构 * Tsinghua University(清华大学) Nanyang Technological University(南洋理工大学) University of Waterloo(滑铁卢大学) Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 本文提出WorldReasonBench,用于评估视频生成模型作为未来世界状态预测器的能力,重点检验其在物理、社会、逻辑和信息一致性方面的推理能力。该基准包含436个结构化测试案例,并采用人类对齐的两阶段评估方法,分别验证推理过程和视频质量。研究揭示了当前视频生成模型在视觉合理性与世界推理能力之间存在显著差距,并提供了WorldRewardBench用于奖励模型评估,推动更真实的世界感知视频生成研究。

Comments Project Page: https://unix-ai-lab.github.io/WorldReasonBench/

详情
英文摘要

Commercial video generation systems such as Seedance2.0 and Veo3.1 have rapidly improved, strengthening the view that video generators may be evolving into "world simulators." Yet the community still lacks a benchmark that directly tests whether a model can reason about how an observed world should evolve over time. We introduce WorldReasonBench, which reframes video generation evaluation as world-state prediction: given an initial state and an action, can a model generate a future video whose state evolution remains physically, socially, logically, and informationally consistent? WorldReasonBench contains 436 curated test cases with structured ground-truth QA annotations spanning four reasoning dimensions and 22 subcategories. We evaluate generated videos with a human-aligned two-part methodology: Process-aware Reasoning Verification uses structured QA and reasoning-phase diagnostics to detect temporal and causal failures, while Multi-dimensional Quality Assessment scores reasoning quality, temporal consistency, and visual aesthetics for ranking and reward modeling. We further introduce WorldRewardBench, a preference benchmark with approximately 6K expert-annotated pairs over 1.4K videos, supporting pair-wise and point-wise reward-model evaluation. Across modern video generators, our results expose a persistent gap between visual plausibility and world reasoning: videos can look convincing while failing dynamics, causality, or information preservation. We will release our benchmarks and evaluation toolkit to support community research on genuinely world-aware video generation at https://github.com/UniX-AI-Lab/WorldReasonBench/.

2605.10419 2026-05-12 cs.CL cs.AI

Can Language Models Analyze Data? Evaluating Large Language Models for Question Answering over Datasets

Andreas Xenofontos, Pavlos Fafalios

发表机构 * School of Production Engineering and Management, Technical University of Crete(生产工程与管理学院,希腊克里特技术大学) Institute of Computer Science, Foundation for Research and Technology - Hellas(计算机科学研究所,希腊基础研究与技术研究院)

AI总结 本文研究了大型语言模型在数据集问答任务中的有效性,探讨了它们在直接回答数据集问题和根据数据库模式生成SQL查询两种场景下的表现。研究还评估了不同提示策略对模型性能的影响,并在两个包含不同难度问题的数据集上进行了实验。结果表明,大型语言模型表现出色,而小型、更节省资源的模型则存在明显局限,这些发现有助于更深入理解大语言模型在数据分析任务中的应用潜力与限制。

Comments Accepted for publication in CARMA 2026 proceedings

详情
英文摘要

This paper investigates the effectiveness of large language models (LLMs) in answering questions over datasets. We examine their performance in two scenarios: (a) directly answering questions given a dataset file as input, and (b) generating SQL queries to answer questions given the schema of a relational database. We also evaluate the impact of different prompting strategies on model performance. The study includes both state-of-the-art LLMs and smaller language models that require fewer resources and operate at lower computational and financial cost. Experiments are conducted on two datasets containing questions of varying difficulty. The results demonstrate the strong performance of large LLMs, while highlighting the limitations of smaller, more cost-efficient models. These findings contribute to a better understanding of how LLMs can be utilized in data analytics tasks and their associated limitations.

2605.10414 2026-05-12 cs.LG

Remember to Forget: Gated Adaptive Positional Encoding

Riccardo Ali, Alessio Borgi, Christopher Irwin, Mario Severino, Pietro Liò

发表机构 * Department of Computer Science and Technology, University of Cambridge(计算机科学与技术系,剑桥大学) Department of Computer, Control and Management Engineering, Sapienza University(计算机、控制与管理工程系,萨皮恩扎大学) Department of Information Engineering, University of Padova(信息工程系,帕多瓦大学)

AI总结 该研究针对现代大语言模型中旋转位置编码(RoPE)在处理超出训练范围的长序列时出现的注意力偏差和检索性能下降问题,提出了一种新的位置编码方法——门控自适应位置编码(GAPE)。GAPE 通过引入内容感知的注意力偏差,在保持旋转几何结构的同时,利用查询门和键门分别抑制无关上下文和保留重要远距离信息,从而提升长上下文下的注意力聚焦能力和模型鲁棒性。实验表明,GAPE 在合成检索和长上下文基准测试中均优于传统旋转位置编码方法。

详情
英文摘要

Rotary Positional Encoding (RoPE) is widely used in modern large language models. However, when sequences are extended beyond the range seen during training, rotary phases can enter out-of-distribution regimes, leading to spurious long-range alignments, diffuse attention, and degraded retrieval. Existing remedies only partially address these failures, as they often trade local positional resolution for long-context stability. We propose GAPE (Gated Adaptive Positional Encoding), a drop-in augmentation for positional encodings that introduces a content-aware bias directly into the attention logits while preserving the rotary geometry. GAPE decouples distance-based suppression from token importance through a query-dependent gate that contracts irrelevant context and a key-dependent gate that preserves salient distant tokens. We prove that protected tokens remain accessible, while the attention mass assigned to unprotected distant tokens decays as a function of the query gate. We further show that GAPE can be implemented within standard scaled dot-product attention. We validate these properties empirically, finding that GAPE consistently yields sharper attention and improved long-context robustness over rotary baselines across both synthetic retrieval and long-context benchmarks.

2605.10410 2026-05-12 cs.LG

Equilibrium Residuals Expose Three Regimes of Matrix-Game Strategic Reasoning in Language Models

Wenhua Nie, Binhan Luo, Zijie Meng, Jyh-Shing Roger Jang, Ching-Wen Ma

发表机构 * National Taiwan University(国立台湾大学)

AI总结 该研究探讨了大型语言模型在矩阵博弈中的战略推理能力,发现模型在去除语义线索后表现显著下降。通过程序生成的零和矩阵博弈实验,研究揭示了模型在不同规模博弈中的三种推理模式,并证明利用收益残差进行训练可以在格式不稳定的条件下提升模型的泛化能力。实验还表明,通过监督微调和残差奖励训练,模型在未见过的较大规模博弈中的成功率大幅提升,揭示了战略推理能力的格式依赖性和改进潜力。

详情
英文摘要

Large language models can score well on named game-theory benchmarks while failing on the same strategic computation once semantic cues are removed. We show this gap with procedurally generated zero-sum matrix games: a model that recognizes familiar games drops to 34%, 18%, and 2% success on anonymous $2{\times}2$, $3{\times}3$, and $5{\times}5$ payoff matrices. The benchmark separates semantic recall, learned approximate Nash computation, and an output-interface bottleneck that limits scale. Training only on $2{\times}2$ and $3{\times}3$ games, supervised fine-tuning raises unseen $5{\times}5$--$7{\times}7$ success from 2% to 61%, while exploitability-reward training averages 37% with high seed variance. We prove that the exploitability residual is $2$-Lipschitz in payoff perturbations, unlike discontinuous vertex-returning LP equilibrium selectors, explaining why residual training can transfer under payoff shifts even when formatting instability limits mean performance. A dominated-action padding experiment provides causal evidence: trained models solve $3{\times}3$ games embedded in much larger matrices, while random-padded controls fail and dense $12{\times}12$ games remain near failure. Procedural evaluation is therefore necessary for measuring strategic reasoning, and residual rewards expose a real but format-limited route to approximate equilibrium computation.

2605.10409 2026-05-12 cs.CV

Progressive Photorealistic Simplification

Adi Rosenthal, Dana Berman, Yedid Hoshen, Ariel Shamir

发表机构 * Reichman University and Google(里奇曼大学和谷歌) Google Israel(谷歌以色列) Hebrew University and Google(希伯来大学和谷歌) Google(谷歌)

AI总结 本文提出了一种渐进式光栅化简化方法,旨在在保持图像真实感的前提下减少视觉复杂度。该方法通过结合语义理解和生成编辑,利用视觉语言模型识别并优先移除图像中的元素,并通过学习验证器确保简化过程中的真实感和一致性。研究还进一步将该过程蒸馏为一个图像到视频生成模型,能够直接从单张图像生成连贯的简化序列,适用于内容感知去杂、语义分层分解等任务。

详情
英文摘要

Existing image simplification techniques often rely on Non-Photorealistic Rendering (NPR), transforming photographs into stylized sketches, cartoons, or paintings. While effective at reducing visual complexity, such approaches typically sacrifice photographic realism. In this work, we explore a complementary direction: simplifying images while preserving their photorealistic appearance. We introduce progressive semantic image simplification, a framework that iteratively reduces scene complexity by removing and inpainting elements in a controlled manner. At each step, the resulting image remains a plausible natural photograph. Our method combines semantic understanding with generative editing, leveraging Vision-Language Models (VLMs) to identify and prioritize elements for removal, and a learned verifier to ensure photorealism and coherence throughout the process. This is implemented via an iterative Select-Remove-Verify pipeline that produces high-quality simplification trajectories. To improve efficiency, we further distill this process into an image-to-video generation model that directly predicts coherent simplification sequences from a single input image. Beyond generating cleaner and more focused compositions, our approach enables applications such as content-aware decluttering, semantic layer decomposition, and interactive editing. More broadly, our work suggests that simplification through structured content removal can serve as a practical mechanism for guiding visual interpretation within the photorealistic domain, complementing traditional abstraction methods.

2605.10407 2026-05-12 cs.LG

Identified-Set Geometry of Distributional Model Extraction under Top-$K$ Censored API Access

Wenhua Nie, ZiCheng Zhu, Jianan Wu, Binhan Luo, Haoran Zheng, Jyh-Shing Roger Jang

发表机构 * National Taiwan University(国立台湾大学)

AI总结 本文研究了在仅能获取顶部-$K$个logit分数的API访问模式下,对语言模型分布进行恢复的限制。通过分析截断阈值$τ$,作者确定了可兼容的教师分布构成的识别集,并给出了其总变分直径的精确表达式。实验表明,尽管顶部-$K$截断限制了每个位置的分布恢复能力,但并不妨碍对模型能力的提取,揭示了分布恢复与能力迁移之间的分离现象。

详情
英文摘要

Modern LLM APIs often reveal only top-$K$ logit scores and censor the remaining vocabulary. We study the per-position distribution-recovery limits of this access model. For censoring threshold $τ$, the compatible teacher distributions form an identified set whose total-variation diameter is exactly $U_K=(V-K)\exp(τ)/(Z_A+(V-K)\exp(τ))$, where $Z_A$ is the observed partition function. For KL recovery, we give a computable binary-endpoint lower bound and an asymptotically matching small-ambiguity upper bound, with an extension to reference-aware attackers. Experiments on a Qwen3 math-reasoning teacher reveal a layered extraction hierarchy: on-task top-$K$ distillation recovers 12% of private capability, full-logit distillation recovers 56% despite 99% KL closure, and generation-based extraction recovers 96%. Top-$K$ censoring therefore limits per-position distribution recovery but does not by itself prevent capability extraction, separating fidelity from transfer in prompt-only logit distillation.

2605.10405 2026-05-12 cs.LG

Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization

Elad Tolochinsky, Yaniv Tenzer, Yaniv Romano

发表机构 * Department of Computer Science, Technion – Israel Institute of Technology(计算机科学系,技术离子理工学院) Department of Electrical and Computer Engineering, Technion – Israel Institute of Technology(电气与计算机工程系,技术离子理工学院)

AI总结 本文研究如何在有限资源下高效识别性能最佳的大型语言模型(LLM),提出了一种结合多臂老虎机(MAB)算法与低秩分解预测的框架。该方法通过利用低秩分解预测模型得分,减少对低效模型的评估次数,同时引入双重稳健估计器以保证统计有效性,从而在适应性模型选择和无放回采样场景下构建有效的置信区间。实验表明,该方法在实际基准测试中显著减少了评估次数,降低了计算和成本开销,同时仍能准确识别最佳模型。

详情
英文摘要

Selecting the best large language model (LLM) for a fixed benchmark is often expensive, since exhaustive evaluation requires running every model on every example. Multi-armed bandit (MAB) algorithms can reduce the number of LLM calls by sequentially selecting the next model-example pair to evaluate, thereby avoiding wasted evaluations on clearly underperforming models. Further savings can be achieved by predicting model scores from the partially observed model-example score matrix using low-rank factorization. However, such predictions are not ground truth: they can be biased and may therefore lead to incorrect identification of the best model. In this work, we propose a principled framework that combines MAB with cheap predicted scores without compromising statistical validity. Specifically, we derive doubly robust estimators of each model's performance that use the low-rank predictions to reduce variance. This enables the construction of valid finite-sample confidence intervals in our setting, where models are selected adaptively and examples are sampled without replacement. Empirical results on real-world benchmarks show that our approach reduces the number of required evaluations, yielding meaningful savings in compute and cost while accurately identifying the best-performing model.