arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2328
专题追踪 全部专题
2605.11862 2026-05-13 cs.CL

Concordance Comparison as a Means of Assembling Local Grammars

Juliana Pirovani, Elias de Oliveira, Eric Laporte

发表机构 * Universidade Federal do Esp\'irito Santo - UFES Av. Fernando Ferrari, 514, 29075-910 Vit\'oria, ES, Brazil Universit\'e Paris-Est, LIGM, UPEM/CNRS/ENPC/ESIEE, Champs-sur-Marne, 77420, France

AI总结 本文研究了如何通过比较局部语法(LG)的搭配信息来构建更优的局部语法,以提升人名实体识别的性能。作者提出了一种基于搭配对比的方法,通过分析不同局部语法之间的包含、交集和排斥关系,选择并组合出效果最佳的语法结构。该方法在葡萄牙语人名提取任务中取得了76.86的F值,相比现有最佳方法提升了6个百分点。

Journal ref Computational Processing of the Portuguese Language. 13th International Conference, PROPOR, Canela, Brazil, September 24-26, 2018, Proceedings, 11122, Springer, pp.57-65, Lecture Notes in Artificial Intelligence

详情
英文摘要

Named Entity Recognition for person names is an important but non-trivial task in information extraction. This article uses a tool that compares the concordances obtained from two local grammars (LG) and highlights the differences. We used the results as an aid to select the best of a set of LGs. By analyzing the comparisons, we observed relationships of inclusion, intersection and disjunction within each pair of LGs, which helped us to assemble those that yielded the best results. This approach was used in a case study on extraction of person names from texts written in Portuguese. We applied the enhanced grammar to the Gold Collection of the Second HAREM. The F-Measure obtained was 76.86, representing a gain of 6 points in relation to the state-of-the-art for Portuguese.

2605.11859 2026-05-13 cs.RO cs.AI

EvoNav: Evolutionary Reward Function Design for Robot Navigation with Large Language Models

Zhikai Zhao, Chuanbo Hua, Federico Berto, Zihan Ma, Kanghoon Lee, Jiachen Li, Jinkyoo Park

发表机构 * KAIST(韩国科学技术院) Radical Numerics UC Riverside(加州大学河滨分校) Omelet AI4CO

AI总结 本文提出了一种基于进化算法和大语言模型的机器人导航奖励函数设计框架EvoNav,旨在解决传统人工设计奖励函数依赖领域专业知识、难以适应复杂环境的问题。该方法通过分阶段的预热-提升流程,利用大语言模型生成候选奖励函数,并结合低成本代理和逐步强化训练,显著提高了设计效率与导航策略性能。实验表明,EvoNav生成的导航策略优于手动设计和现有先进方法。

详情
英文摘要

Robot navigation is a crucial task with applications to social robots in dynamic human environments. While Reinforcement Learning (RL) has shown great promise for this problem, the policy quality is highly sensitive to the specification of reward functions. Hand-crafted rewards require substantial domain expertise and embed inductive biases that are difficult to audit or adapt, limiting their effectiveness and leading to suboptimal performance. In this paper, we propose EvoNav, an evolutionary framework that automates the design of robot navigation reward functions via large language models (LLMs). To overcome prohibitively costly policy training, EvoNav evaluates each candidate proposal from the LLM via a progressive three-stage warm-up-boost procedure. EvoNav advances from analytical proxies with low-cost surrogates, such as small datasets and analytic rules, to lightweight rollouts and, finally, to full policy training, enabling computationally efficient exploration under effective feedback. Experiment results show that EvoNav produces more effective navigation policies than manually designed RL rewards and state-of-the-art reward design methods.

2605.11857 2026-05-13 cs.LG

Beyond Parameter Aggregation: Semantic Consensus for Federated Fine-Tuning of LLMs

Amr Abourayya, Jens Kleesiek, Michael Kamp

发表机构 * Lamarr Institute for ML and AI, Technical University Dortmund(拉马尔机器学习与人工智能研究所,德意志理工大学) Institute for AI in medicine (IKIM), University Hospital Essen(医学人工智能研究所(IKIM),埃森大学医院)

AI总结 本文提出了一种新的联邦微调方法,突破传统参数聚合的限制,通过模型行为而非参数进行协作。客户端在本地数据上微调模型,并在共享的公共提示集上生成输出,服务器将这些输出映射到语义空间,形成每条提示的语义共识并返回伪标签供进一步微调。该方法显著降低了通信开销,与模型规模无关,适用于异构架构和开放文本生成,并在实验中表现出与现有方法相当的效果,同时大幅减少通信量、运行时间和能耗。

详情
英文摘要

Federated fine-tuning of large language models is commonly formulated as a parameter aggregation problem. However, even parameter-efficient methods require transmitting large collections of trainable weights, assume aligned architectures, and rely on white-box access to model parameters. As model sizes continue to grow and deployments become increasingly heterogeneous, these assumptions become progressively misaligned with practical constraints. We consider an alternative formulation in which collaboration is mediated through model behavior rather than parameters. Clients fine-tune local models on private data and exchange generated outputs on a shared, public prompt set. The server maps these outputs into a semantic representation space, forms a per-prompt semantic consensus, and returns pseudo-labels for further local fine-tuning. This formulation fundamentally changes the communication scaling of federated LLM fine-tuning. The amount of information exchanged depends only on the public prompt budget and the size of the communicated behaviors, independent of model size. As a consequence, the protocol naturally accommodates heterogeneous architectures and applies directly to open-ended text generation. We present a theoretical analysis and empirical results demonstrating that this approach can match strong federated fine-tuning baselines while substantially reducing communication by orders of magnitude (e.g., analytically by a factor of $1006$ for Llama3.1-405B), as well as reductions in runtime and energy consumption. These results suggest that, for generative foundation models, behavior-level consensus provides a more appropriate abstraction for federated adaptation than parameter aggregation.

2605.11856 2026-05-13 cs.CV cs.CL

UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs

Houcheng Jiang, Jiajun Fu, Junfeng Fang, Chen Gao, Xiang Wang, Xiangnan He, Yong Li

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学) National University of Singapore(新加坡国立大学) Zhongguancun Academy(中关村学院)

AI总结 本文提出了一种统一的视觉潜层推理框架UniVLR,旨在提升多模态大语言模型在图像推理任务中的效率与表现。该方法将文本推理与辅助视觉信息整合到共享的视觉工作空间中,通过联合生成推理轨迹和图像信息,并将其压缩为紧凑的视觉潜层表示,从而在推理时仅依赖视觉潜层进行推理并直接生成答案,避免了显式文本推理和外部工具调用。实验表明,UniVLR在实际感知与视觉推理任务中优于现有方法,且生成的推理标记更少,展示了更高效统一的视觉推理范式。

详情
英文摘要

Multimodal large language models are increasingly expected to perform thinking with images, yet existing visual latent reasoning methods still rely on explicit textual chain-of-thought interleaved with visual latent tokens. This interleaved design limits efficiency and keeps reasoning fragmented across separate text and vision channels. We propose UniVLR, a unified visual latent reasoning framework that treats textual reasoning and auxiliary visual evidence as a shared visual workspace. Instead of preserving text CoT as an independent inference-time path, UniVLR renders reasoning traces together with auxiliary images and learns to compress this unified representation into compact visual latent tokens. At inference time, the model reasons only through visual latents and directly decodes the final answer, avoiding both external tool calls and verbose text reasoning. Experiments on real-world perception and visual reasoning tasks show that UniVLR outperforms prior visual latent reasoning methods while using substantially fewer generated reasoning tokens, suggesting a more unified and efficient paradigm for visual thinking in MLLMs.

2605.11846 2026-05-13 cs.LG cs.AI

Martingale-Consistent Self-Supervised Learning

Moritz Gögl, Hanwen Xing, Christopher Yau

发表机构 * University of Oxford(牛津大学) Health Data Research UK(英国健康数据研究)

AI总结 本文研究了在信息不完整或动态变化的环境下,如何提升自监督学习(SSL)的鲁棒性和一致性。作者提出了一种基于鞅理论的自监督学习框架,确保粗略预测与精炼预测在期望上保持一致,从而防止系统性偏差。该方法引入了预测空间和潜在空间的变体,并设计了无偏的蒙特卡洛估计器,实验表明其在部分观测场景下能提升模型的稳定性与校准能力。

详情
英文摘要

Self-supervised learning (SSL) is often deployed under changing information, such as shorter histories, missing features, or partially observed images. In these settings, predictions from coarse and refined views should be coherent: before refinement, the coarse-view prediction should match the average prediction expected after refinement. Martingales formalize this coherence principle, but standard SSL objectives do not enforce it. Unlike invariance objectives that pull views together, martingale consistency constrains only the expected refined prediction, allowing predictions to update as information is revealed while preventing systematic drift. We introduce a martingale-consistent SSL framework that closes this gap, with practical prediction- and latent-space variants and an unbiased two-sample Monte Carlo estimator based on stochastic refinement. We evaluate the approach on synthetic and real time-series, tabular, and image benchmarks under partial-observation regimes, in both semi-self-supervised and fully label-free settings. Across these experiments, our framework improves robustness and calibration under partial observation, yielding more stable representations as information is revealed.

2605.11845 2026-05-13 cs.CL

Probabilistic Calibration Is a Trainable Capability in Language Models

Davide Baldelli, Sruthi Kuriakose, Maryam Hashemzadeh, Amal Zouaq, Sarath Chandar

发表机构 * Chandar Research Lab(昌达尔研究实验室) Mila – Quebec AI Institute(魁北克人工智能研究所) LAMA-WeST Lab(LAMA-WeST实验室) Polytechnique Montréal(蒙特利尔理工学院) Université de Montréal(蒙特利尔大学) Independent researcher(独立研究者)

AI总结 该研究探讨了语言模型在满足用户指定随机性约束时,其生成概率与目标分布之间校准不佳的问题,并通过微调方法提升这一能力。研究者提出了两种校准微调方法:一种基于软目标,将目标分布转化为词序树导出的下一个词目标;另一种基于硬目标,通过目标分布采样完成进行训练。实验表明,这两种方法均能有效提升模型在多种分布和参数设置下的结构化采样准确性,证明概率校准是可以通过微调增强的能力。

详情
英文摘要

Language models are increasingly used in settings where outputs must satisfy user-specified randomness constraints, yet their generation probabilities are often poorly calibrated to those targets. We study whether this capability can be improved directly through fine-tuning. Concretely, we fine-tune language models on synthetic prompts that require sampling from mathematical distributions, and compare two Calibration Fine-Tuning variants: a soft-target method that converts the desired output distribution into trie-derived next-token targets, and a hard-target method that trains on sampled completions from the same target distribution. Across 12 models spanning four families, both methods substantially improve structured-sampling fidelity on held-out distribution families and unseen parameter settings, showing that probabilistic calibration is a trainable capability. Under our selected training configurations, the two methods exhibit different empirical profiles: hard-target fine-tuning is often strongest on structured numeric sampling, while soft-target fine-tuning performs better on broader stochastic generation benchmarks, including open-ended random generation, multiple-choice answer-position balancing, and NoveltyBench. The gains sometimes reduce downstream capability, especially arithmetic reasoning, with costs varying by model. Overall, our results show that probabilistic calibration can be improved through fine-tuning, with our hard-target configuration favoring exact numeric fidelity and our soft-target configuration favoring broader stochastic transfer. Code is available at https://github.com/chandar-lab/calibration-finetuning.

2605.11840 2026-05-13 cs.CV

Selection, Not Fusion: Radar-Modulated State Space Models for Radar-Camera Depth Estimation

Zhangcheng Hou, Tomoaki Ohtsuki

发表机构 * School of Science and Technology(科学与技术学部)

AI总结 本文研究了如何利用雷达信号提升雷达-相机深度估计的性能,提出了一种基于状态空间模型的雷达调制选择机制(RMS),将雷达信息直接融入模型的扫描过程,而非传统的特征融合方式。该方法通过雷达对扫描步长和读取参数进行调制,在保证图像主干网络不变的前提下,仅在雷达能提升精度的区域引入雷达影响,从而实现更高效、准确的深度估计。实验表明,该方法在nuScenes数据集上取得了显著的性能提升,并且具有更低的计算延迟。

Comments 16 pages, 3 figures, 9 tables

详情
英文摘要

Radar-camera depth estimation must turn an ultra-sparse, all-weather, metric radar signal into a dense per-pixel depth map. Existing methods -- concatenation, confidence-aware gating, sparse supervision, graph-based extraction -- combine radar and image features outside the backbone's sequence operator, and even cross-modal Mamba variants leave the selection mechanism itself unimodal. We argue that the selection mechanism is the right place for radar to enter. We introduce Radar-Modulated Selection (RMS), a minimal and principled way to inject radar into Mamba's selective scan: radar modulates the scan from within, adding zero-initialised perturbations to the step size $Δ$ and readout $\mathbf{C}$ while leaving the input projection $\mathbf{B}$ and state dynamics $\mathbf{A}$ image-only. The construction is exactly equivalent to a pretrained image-only Mamba at initialisation, ensuring radar only influences the model where it improves accuracy. Two further properties follow that out-of-scan fusion cannot offer: linear-cost cross-modal coupling at every recurrence step, and a natural fallback to the image-only backbone when radar is absent. We deploy RMS in a Multi-View Scan Pyramid (MVSP) that matches the fusion operator to radar's spatial reach at each scale. SemoDepth achieves state-of-the-art performance on nuScenes, reducing MAE by 34.0%, 29.9%, and 29.9% over the previous best at 0--50, 0--70, and 0--80m, while attaining the lowest single-frame latency (26.8ms). A further ablation shows that out-of-scan feature blending adds no accuracy on top of RMS, providing empirical validation that in-scan selection can replace out-of-scan fusion.

2605.11838 2026-05-13 cs.LG math.OC

Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters

Alexander Yukhimchuk, Mladen Kolar, Martin Takáč, Sayantan Choudhury

发表机构 * MBZUAI(穆罕默德·本·拉什德智能技术研究院) University of Southern California(南加州大学)

AI总结 本文研究了在现代神经网络训练中如何更有效地应用梯度裁剪技术,针对参数矩阵的结构提出了一种新的方法。作者发现,数据异常值主要影响梯度矩阵的前几个奇异值,因此提出基于奇异值的梯度裁剪方法,通过限制超过阈值的奇异值来稳定训练过程。该方法不仅推广了传统的向量范数裁剪,还提供了对重尾噪声的收敛性分析,并通过随机截断SVD实现了高效的实现,适用于大规模神经网络层。

详情
英文摘要

Gradient clipping is a standard safeguard for training neural networks under noisy, heavy-tailed stochastic gradients; yet, most clipping rules treat all parameters as vectors and ignore the matrix structure of modern architectures. We show empirically that data outliers often amplify only a small number of leading singular values in layer-wise gradient matrices, while the rest of the spectrum remains largely unchanged. Motivated by this phenomenon, we propose spectral clipping, which stabilizes training by clamping singular values that exceed a threshold while preserving the singular directions. This framework generalizes classical gradient norm clipping and can be easily integrated into existing optimizers. We provide a convergence analysis for non-convex optimization with spectrally clipped SGD, yielding the optimal $\mathcal{O}\left(K^{\frac{2 - 2α}{3α- 2}}\right)$ rate for heavy-tailed noise. To minimize hyperparameter tuning, we introduce layer-wise adaptive thresholds based on moving averages or sliding-window quantiles of the top singular values. Finally, we develop efficient implementations that clip only the top $r$ singular values via randomized truncated SVD, avoiding full decompositions for large layers. We demonstrate competitive performance across synthetic heavy-tailed settings and neural network training tasks.

2605.11836 2026-05-13 cs.LG cs.CL

More Edits, More Stable: Understanding the Lifelong Normalization in Sequential Model Editing

Xin Ma, Wei Chen, Qi Liu, Derong Xu, Zhi Zheng, Tong Xu, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China(认知智能国家重点实验室,中国科学技术大学)

AI总结 本文研究了在连续模型编辑过程中保持大型语言模型稳定性的关键机制,提出了“终身归一化”(Lifelong Normalization, LN)这一核心策略,并首次从理论上解释了其作用机制。研究发现,LN通过运行统计量对梯度进行归一化,能够形成自我强化的稳定性循环,结合岭正则回归可有效抑制遗忘和系统崩溃。基于这些发现,作者提出了StableEdit方法,通过引入预热阶段和全白化处理,进一步提升了长期编辑的稳定性,实验验证了理论的有效性。

详情
英文摘要

Lifelong Model Editing aims to continuously update evolving facts in Large Language Models while preserving unrelated knowledge and general capabilities, yet it remains plagued by catastrophic forgetting and model collapse. Empirically, we find that recent editors resilient over long horizons share the same core strategy: Lifelong Normalization (LN), which normalizes value gradients using running statistics. Removing LN causes immediate performance collapse, and we observe a counter-intuitive positive cumulative effect where early edits can promote the success of future edits. Yet the mechanism of LN remains a "black box", leaving its precise role in lifelong stability poorly understood. In this work, we provide the first theoretical account of LN in the lifelong regime. Our analysis reveals a self-reinforcing stability loop and proves that, when combined with ridge-regularized regression, LN yields parameter updates with asymptotic orthogonality and bounded norms, directly mitigating forgetting and systemic collapse. Based on these insights, we derive StableEdit, which strengthens this stability loop via an explicit warm-up stage and full whitening, improving long-horizon stability at minimal overhead. Extensive experiments validate our theory and demonstrate competitive performance. Our code is available at https://github.com/MINE-USTC/StableEdit.

2605.10916 2026-05-13 cs.CV cs.AI

Confidence-Guided Diffusion Augmentation for Enhanced Bangla Compound Character Recognition

Md. Sultan Al Rayhan

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系)

AI总结 识别手写孟加拉语复合字符是一个具有挑战性的问题,主要由于字符结构复杂、类内变化大以及高质量标注数据有限。本文提出了一种基于置信度引导的扩散增强框架,用于提升低分辨率孟加拉语复合字符的识别性能。该方法结合了类别条件扩散模型和分类器引导技术,生成高质量的合成样本,并引入了增强残差块和置信度过滤机制,以提升生成质量并筛选出类别一致性高的样本。实验表明,该方法在多个主流模型上均取得性能提升,最佳模型在AIBangla数据集上的分类准确率达到89.2%,显著优于现有基准。

详情
英文摘要

Recognition of handwritten Bangla compound characters remains a challenging problem due to complex character structures, large intra-class variation, and limited availability of high-quality annotated data. Existing Bangla handwritten character recognition systems often struggle to generalize across diverse writing styles, particularly for compound characters containing intricate ligatures and diacritical variations. In this work, we propose a confidence-guided diffusion augmentation framework for low-resolution Bangla compound character recognition. Our framework combines class-conditional diffusion modeling with classifier guidance to synthesize high-quality handwritten compound character samples. To further improve generation quality, we introduce Squeeze-and-Excitation enhanced residual blocks within the diffusion model's U-Net backbone. We additionally propose a confidence-based filtering mechanism where pre-trained classifiers act as quality gates to retain only highly class-consistent synthetic samples. The filtered synthetic images are fused with the original training data and used to retrain multiple classification architectures. Experiments conducted on the AIBangla compound character dataset demonstrate consistent performance improvements across ResNet50, DenseNet121, VGG16, and Vision Transformer architectures. Our best-performing model achieves 89.2\% classification accuracy, surpassing the previously published AIBangla benchmark by a substantial margin. The results demonstrate that quality-aware diffusion augmentation can effectively enhance handwritten character recognition performance in low-resource script domains.

2605.10818 2026-05-13 cs.LG q-bio.NC

On periodic distributed representations using Fourier embeddings

Jakeb Chouinard

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 本文研究了如何利用傅里叶嵌入构建周期性分布式表示,以更好地处理角度等周期性信号。作者提出使用高维实值周期嵌入,解决传统标量角度表示在处理接近角度时的困难,并通过点积相似性控制不同核函数的形状。研究重点在于利用空间语义指针这一神经可解释的表示方法,形式化定义狄利克雷核和周期高斯核,为周期性信号的建模提供了新的思路。

详情
英文摘要

Periodic signals are critical for representing physical and perceptual phenomena. Scalar, real angular measures, e.g., radians and degrees, result in difficulty processing and distinguishing nearby angles, especially when their absolute difference exceeds pi. We can avoid this problem by using real-valued, periodic embeddings in high-dimensional space. These representations also allow us to control the nature of their dot product similarities, allowing us to construct a variety of different kernel shapes. In this work, we aim of highlight how these representations can be constructed and focus on the formalization of Dirichlet and periodic Gaussian kernels using the neurally-plausible representation scheme of Spatial Semantic Pointers.

2605.10684 2026-05-13 cs.LG cs.AI

Is Data Shapley Not Better than Random in Data Selection? Ask NASH

Xiao Tian, Jue Fan, Rachael Hwee Ling Sim, Zixuan Wang, Nancy F. Chen, Bryan Kian Hsiang Low

发表机构 * Department of Computer Science, National University of Singapore, Singapore(新加坡国立大学计算机科学系) Research (A STAR), Singapore(新加坡科技研究局)

AI总结 本文研究了如何从训练数据中选择高质量子集的问题,探讨了数据选择中使用Data Shapley等方法的有效性。针对Data Shapley在实践中表现不稳定的问题,作者提出了NASH框架,通过将目标效用函数分解为更简单的Shapley-信息组件,并非线性地聚合这些组件进行数据选择,显著提升了基于Shapley的数据选择效果,且仅需少量额外计算成本。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML-26) as a Spotlight paper

详情
英文摘要

Data selection studies the problem of identifying high-quality subsets of training data. While some existing works have considered selecting the subset of data with top-$m$ Data Shapley or other semivalues as they account for the interaction among every subset of data, other works argue that Data Shapley can sometimes perform ineffectively in practice and select subsets that are no better than random. This raises the questions: (I) Are there certain "Shapley-informative" settings where Data Shapley consistently works well? (II) Can we strategically utilize these settings to select high-quality subsets consistently and efficiently? In this paper, we propose a novel data selection framework, NASH (Non-linear Aggregation of SHapley-informative components), which (I) decomposes the target utility function (e.g., validation accuracy) into simpler, Shapley-informative component functions, and selects data by optimizing an objective that (II) aggregates these components non-linearly. We demonstrate that NASH substantially boosts the effectiveness of Shapley/semivalue-based data selection with minimal additional runtime cost.

2605.10360 2026-05-13 cs.CV

DySurface: Consistent 4D Surface Reconstruction via Bridging Explicit Gaussians and Implicit Functions

Minje Kim, Younghyun Noh, Jaesoon Kim, Tae-Kyun Kim

发表机构 * KAIST(韩国科学技术院) Sungkyunkwan University(全北国立大学)

AI总结 本文提出了一种名为DySurface的新框架,用于解决动态场景中重建时间一致的4D表面的挑战。该方法结合了显式的高斯点和隐式的符号距离函数(SDF),通过构建动态稀疏体素网格,为隐式SDF场提供明确的几何引导,从而显著提升了表面重建的质量,实现了更精确的边界和细节表现。实验表明,DySurface在几何精度方面优于现有先进方法,同时保持了良好的渲染性能。

详情
英文摘要

While novel view synthesis (NVS) for dynamic scenes has seen significant progress, reconstructing temporally consistent geometric surfaces remains a challenge. Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) offer powerful dynamic scene rendering capabilities; however, relying solely on photometric optimization often leads to geometric ambiguities. This results in discontinuous surfaces, severe artifacts, and broken surfaces over time. To address these limitations, we present DySurface, a novel framework that bridges the effectiveness of explicit Gaussians with the geometric fidelity of implicit Signed Distance Functions (SDFs) in dynamic scenes. Our approach tackles the structural discrepancy between the forward deformation of 3DGS ($canonical \rightarrow dynamic$) and the backward deformation required for volumetric SDF rendering ($dynamic \rightarrow canonical$). Specifically, we propose the VoxGS-DSDF branch that leverages deformed Gaussians to construct a dynamic sparse voxel grid, providing explicit geometric guidance to the implicit SDF field. This explicit anchoring effectively regularizes the volumetric rendering process, significantly improving surface reconstruction quality, with watertight boundaries and detailed representations. Quantitative and qualitative experiments demonstrate that DySurface significantly outperforms state-of-the-art baselines in geometric accuracy metrics while maintaining competitive rendering performance.

2605.10288 2026-05-13 cs.LG math.OC

BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization

Hengrui Zhang, Boao Kong, Engao Zhang, Kun Yuan

发表机构 * Sichuan University(四川大学) Peking University(北京大学)

AI总结 本文提出了一种名为BROS的高效单循环双层优化方法,旨在解决深度学习中超参数学习、数据重加权等问题。该方法通过在随机子空间中进行梯度更新,并结合Rademacher双探针校正技术,实现了对Hessian算子的无偏估计,从而在降低内存消耗的同时保持与精确单循环方法相近的收敛速度。实验表明,BROS在多个任务中相比现有方法可减少高达44.9%的峰值内存使用,同时保持相近的性能。

详情
英文摘要

Stochastic bilevel optimization (SBO) has become a standard framework for hyperparameter learning, data reweighting, representation learning, and data-mixture optimization in deep learning. Existing exact single-loop SBO methods and memory-efficient surrogate SBO methods either create severe memory pressure for large lower-level neural networks or lack competitive convergence guarantees under standard assumptions. In this paper, we propose BROS, a memory-efficient single-loop SBO method with the same convergence rate order as exact single-loop SBO methods. BROS performs lower and auxiliary updates in randomized subspaces with a Rademacher bi-probe correction that recovers an unbiased Hessian-action estimator. We prove that BROS preserves the $\mathcal O(\varepsilon^{-2})$ sample complexity of MA-SOBA for finding an $\varepsilon$-stationary point under only standard assumptions. Experiments on hyper-data cleaning, data-mixture learning, hyper-representation learning, and ViT sample reweighting show that BROS reduces peak memory by up to 44.9% while closely matching full-space baseline performance.

2605.10235 2026-05-13 cs.CL

Route Before Retrieve: Activating Latent Routing Abilities of LLMs for RAG vs. Long-Context Selection

Yiwen Chen, Kuan Li, Fuzhen Zhuang, Deqing Wang, Zhao Zhang, Liwen Zhang, Yong Jiang, Shuai Wang, Minhao Cheng

发表机构 * Beihang University(北航) HKUST(香港科技大学) Alibaba Group(阿里巴巴集团) Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 本文研究了在大语言模型(LLM)中如何有效选择检索增强生成(RAG)与长上下文(LC)策略的问题,提出了一种名为Pre-Route的主动路由框架。该方法通过利用文档类型、长度等轻量级元数据进行结构化推理,在回答前完成任务分析、覆盖估计和信息需求预测,从而生成可解释且高效的成本决策。实验表明,Pre-Route在多个基准上优于现有方法,展现出更高的整体成本效益。

详情
英文摘要

Recent advances in large language models (LLMs) have expanded the context window to beyond 128K tokens, enabling long-document understanding and multi-source reasoning. A key challenge, however, lies in choosing between retrieval-augmented generation (RAG) and long-context (LC) strategies: RAG is efficient but constrained by retrieval quality, while LC supports global reasoning at higher cost and with position sensitivity. Existing methods such as Self-Route adopt failure-driven fallback from RAG to LC, but remain passive, inefficient, and hard to interpret. We propose Pre-Route, a proactive routing framework that performs structured reasoning before answering. Using lightweight metadata (e.g., document type, length, initial snippet), Pre-Route enables task analysis, coverage estimation, and information-need prediction, producing explainable and cost-efficient routing decisions. Our study shows three key findings: (i) LLMs possess latent routing ability that can be reliably elicited with guidelines, allowing single-sample performance to approach that of multi-sample (Best-of-N) results; (ii) linear probes reveal that structured prompts sharpen the separability of the "optimal routing dimension" in representation space; and (iii) distillation transfers this reasoning structure to smaller models for lightweight deployment. Experiments on LaRA (in-domain) and LongBench-v2 (OOD) confirm that Pre-Route outperforms Always-RAG, Always-LC, and Self-Route baselines, achieving superior overall cost-effectiveness.

2605.10094 2026-05-13 cs.RO cs.AI

Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs

Jianchao Zhao, Huoren Yang, Yusong Hu, Yuyang Gao, Qiguan Ou, Cong Wan, SongLin Dong, Zhiheng Ma, Yihong Gong

发表机构 * College of Artificial Intelligence, Xi’an Jiaotong University(西安交通大学人工智能学院) One Robotics Shenzhen University of Advanced Technology(深圳先进技术大学)

AI总结 本文研究了在持续部署环境下如何提升冻结的视觉-语言-动作(VLA)模型在测试时的可靠性问题。提出了一种基于在线成功记忆的测试时自适应框架,通过在部署过程中存储成功的观察-动作片段,并在推理时检索相关动作片段进行轨迹一致性过滤和聚合,生成高质量的动作先验。该方法引入了置信度自适应的先验引导机制,将先验信息注入动作生成流程,实现了无需参数更新的轻量级自适应,实验表明该方法在长时间和多阶段任务中显著提升了任务成功率和闭环稳定性。

详情
英文摘要

Vision-Language-Action (VLA) models show strong potential for general-purpose robotic manipulation, yet their closed-loop reliability often degrades under local deployment conditions. Existing evaluations typically treat test episodes as independent zero-shot trials. However, real robots often operate repeatedly in the same or slowly changing environments, where successful executions provide environment-verified evidence of reliable behavior patterns. We study this persistent-deployment setting, asking whether a partially competent frozen VLA can improve its reliability by reusing its successful test-time experience. We propose an online success-memory guided test-time adaptation framework for generative VLAs. During deployment, the robot stores progress-calibrated successful observation-action segments in a long-term memory. At inference, it retrieves state-relevant action chunks, filters inconsistent candidates via trajectory-level consistency, and aggregates them into an elite action prior. To incorporate this prior into action generation, we introduce confidence-adaptive prior guidance, which injects the elite prior into an intermediate state of the flow-matching action sampler and adjusts the guidance strength based on retrieval confidence. This design allows the frozen VLA to exploit environment-specific successful experience while preserving observation-conditioned generative refinement. This retrieve-then-steer mechanism enables lightweight, non-parametric test-time adaptation without requiring parameter updates. Simulation and real-world experiments show improved task success and closed-loop stability, especially in long-horizon and multi-stage tasks.

2605.09965 2026-05-13 cs.CV

Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

Kuan Zhang, Dongchen Liu, Qiyue Zhao, Tianyu Xin, Yue Su, Haisheng Wang, Han Yin, Hongbo Ma, Peize Li, Tianjun Gu, Xiangnan Wu, Xinran Zhang, Yongxuan Li, Zirong Chen, Yiming Li

发表机构 * College of AI, Tsinghua University(清华大学人工智能学院) MMLab, The University of Hong Kong(香港大学MMLab) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 该研究探讨了如何通过基础模型实现通用游戏玩家,旨在使人工智能具备在由不同规则、目标和物理特性构成的“游戏多元宇宙”中灵活适应和表现的能力。研究从数据集、模型、应用框架和评估基准四个相互关联的支柱出发,分析了通用游戏玩家的完整生命周期,并指出了当前系统面临的五大根本性权衡。通过这一整体视角,论文提出了一个五阶段的发展路线图,从单一游戏精通逐步迈向能够同时创造和演化于理论游戏多元宇宙的终极创造者阶段,为实现通用人工智能(AGI)提供了系统性指导。

Comments 51 pages, 7 figures, github: https://github.com/THUSI-Lab/Awesome-LFMs-Play-Games

详情
英文摘要

The real world unfolds along a single set of physics laws, yet human intelligence demonstrates a remarkable capacity to generalize experiences from this singular physical existence into a multiverse of games, each governed by entirely different rules, aesthetics, physics, and objectives. This omni-reality adaptability is a hallmark of general intelligence. As Artificial Intelligence progresses towards Artificial General Intelligence, the multiverse of games has evolved from mere entertainment into the ultimate ground for training and evaluating AGI. The pursuit of this generality has unfolded across four eras: from environment-specific symbolic and reinforcement learning agents, to current large foundation models acting as generalist players, and toward a future creator stage where agent both creates new game worlds and continually evolves within them. We trace the full lifecycle of a generalist game player along four interdependent pillars: Dataset, Model, Harness, and Benchmark. Every advance across these pillars can be read as an attempt to break one of five fundamental trade-offs that currently bound the whole system. Building on this end-to-end view, we chart a five-level roadmap, progressing from single-game mastery to the ultimate creator stage in which the agent simultaneously creates and evolves within theoretical game multiverse. Taken together, our work offers a unified lens onto a rapidly shifting field,and a principled path toward the omnipotent generalist agent capable of seamlessly mastering any challenge within the multiverse of games, thereby paving the way for AGI.

2605.09780 2026-05-13 cs.AI

Attribution-based Explanations for Markov Decision Processes

Paul Kobialka, Andrea Pferscher, Francesco Leofante, Erika Ábrahám, Silvia Lizeth Tapia Tarifa, Einar Broch Johnsen

发表机构 * University of Oslo(奥斯陆大学) Imperial College London(伦敦帝国理工学院) RWTH Aachen University(亚琛工业大学)

AI总结 本文研究如何为马尔可夫决策过程(MDP)生成基于归因的解释,以阐明智能体在序列决策中的行为逻辑。作者提出了一种形式化框架,用于在MDP中分配状态和执行路径的重要性分数,并利用策略合成技术高效计算这些分数,克服了MDP中非确定性的挑战。通过五个案例研究验证了方法的有效性,展示了其在提供可解释决策洞察方面的应用价值。

详情
英文摘要

Attribution techniques explain the outcome of an AI model by assigning a numerical score to its inputs. So far, these techniques have mainly focused on attributing importance to static input features at a single point in time, and thus fail to generalize to sequential decision-making settings. This paper fills this gap by introducing techniques to generate attribution-based explanations for Markov Decision Processes (MDPs). We give a formal characterization of what attributions should represent in MDPs, focusing on explanations that assign importance scores to both individual states and execution paths. We show how importance scores can be computed by leveraging techniques for strategy synthesis, enabling the efficient computation of these scores despite the non-determinism inherent in an MDP. We evaluate our approach on five case-studies, demonstrating its utility in providing interpretable insights into the logic of sequential decision-making agents.

2605.09769 2026-05-13 cs.AI

UTS at PsyDefDetect: Multi-Agent Councils and Absence-Based Reasoning for Defense Mechanism Classification

Dima Galat, Marian-Andrei Rizoiu

发表机构 * University of Technology Sydney(技术大学悉尼)

AI总结 本文介绍了一种用于情感支持对话中心理防御机制分类的系统,基于防御机制评分量表(DMRS),在64支队伍中排名第二(F1值为0.406)。研究核心在于将防御机制定义为缺失的方面(如情感缺失、认知阻滞、现实否认),并通过情感-认知整合光谱在提示级别的临床规则中进行编码,显著提升了分类性能。系统采用多阶段的Gemini 2.5代理委员会架构,通过类特定倡导者评估证据强度而非简单投票,无需微调即取得良好效果,最终结合三个微调Qwen3.5模型的定向覆盖策略进一步提升了性能。

详情
英文摘要

This paper describes our system for classifying psychological defense mechanisms in emotional support dialogues using the Defense Mechanism Rating Scales (DMRS), placing second (F1 0.406) among 64 teams. A central insight is that defense mechanisms are defined by what is absent: missing affect, blocked cognition, denied reality. We encode this as an affect-cognition integration spectrum in prompt-level clinical rules, which account for the largest single gain (+11.4pp F1). Our architecture is a multi-phase deliberative council of Gemini 2.5 agents where class-specific advocates rate evidence strength rather than voting, achieving F1 0.382 with no fine-tuning - a top-5 result on its own. We find, however, that the council is confidently wrong about minority classes: 59-80% of stable minority predictions are incorrect, driven by a systematic "L7 attractor" in which emotional content defaults to the majority class. A targeted override ensemble from three fine-tuned Qwen3.5 models applies 16 overrides (+2.4pp), selected by a structured multi-agent system (builder, critic, regression guard) that produced a larger F1 gain in one iteration than 8 prior attempts combined.

2605.09271 2026-05-13 cs.AI

Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding

Zhiqin Yang, Yuhan Liu, Jingwen Fu, Pei Fu, Bo Han, Masashi Sugiyama, Nanning Zheng

发表机构 * The Hong Kong University of Science and Technology(香港理工大学) MiLM Plus, Xiaomi Inc(小米公司) Zhongguancun Academy(中关村学院) Hong Kong Baptist University(香港 Baptist大学) The University of Tokyo(东京大学) RIKEN Center for Advanced Intelligence Project(日本理化学研究院高级智能项目中心) Xi’an Jiaotong University(西安交通大学)

AI总结 尽管自然语言是大语言模型(LLM)的默认输入媒介,但其表达能力的局限性在复杂问题求解中形成了瓶颈。本文提出,通过先进的语言表征来构建知识框架(schema)是拓展LLM智能的下一步关键方向,并论证了语言表征的结构和符号复杂性对模型知识激活与组织方式的重要影响。研究通过理论阐述与实验验证,展示了精心设计的语言表征能够在不改变模型参数或规模的前提下显著提升模型性能,为未来研究提供了新的思路和方向。

Comments 41 pages, 30 figures

详情
英文摘要

Although natural language is the default medium for Large Language Models (LLMs), its limited expressive capacity creates a profound bottleneck for complex problem-solving. While recent advancements in AI have relied heavily on scaling, merely internalizing knowledge does not guarantee its effective application. Defining language representation as the linguistic and symbolic constructs used to map and model the real world, this paper argues that shaping schemas through advanced language representation is the next frontier for expanding LLM intelligence. We posit that an LLM's knowledge activation and organization -- its schema -- depends heavily on the structural and symbolic sophistication of the language used to represent a given task. This paper contributes both a formalization of this claim and the empirical evidence to support it. With a new formalization, we present multiple lines of evidence to support our position: Firstly, we review recent empirical practices and emerging methodologies that demonstrate the substantial performance gains achievable through deliberate language representation design, even without modifying model parameters or scale. Secondly, we conduct controlled experiments showing that LLM performance and its internal feature activations vary under different language representations of the same underlying task. Together, these findings highlight language representation design as a promising direction for future research.

2605.09266 2026-05-13 cs.AI

SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning

Kun Xiang, Terry Jingchen Zhang, Zirong Liu, Bokai Zhou, Yueling Tang, Junjie Yu, Jiacong Lu, Shangrui Huang, Heng Li, Likui Zhang, Kunkun Liu, Changzheng Zhang, Yangle Fang, Boqiang Guo, Hui-Ling Zhen, Dandan Tu, Yinya Huang, Xiaodan Liang

发表机构 * Sun Yat-sen University(中山大学) ETH Zurich(苏黎世联邦理工学院) ETH AI Center(苏黎世人工智能中心) Huawei Technologies Ltd(华为技术有限公司)

AI总结 本文提出 SeePhys Pro,一个用于研究多模态模型在文本向图像逐步转移信息时是否保持相同推理能力的细粒度基准。该基准包含每个问题的四个语义对齐的变体,视觉元素逐步增加,实验表明当前前沿模型在从语言到图表的信息转移过程中性能下降,视觉变量的 grounding 是关键瓶颈。研究进一步通过盲训练等方法分析模型改进的来源,发现部分提升可能源于文本残留线索而非真实视觉证据,强调多模态推理评估应关注模态迁移下的鲁棒性及对关键视觉证据的依赖性。

详情
英文摘要

We introduce SeePhys Pro, a fine-grained modality transfer benchmark that studies whether models preserve the same reasoning capability when critical information is progressively transferred from text to image. Unlike standard vision-essential benchmarks that evaluate a single input form, SeePhys Pro features four semantically aligned variants for each problem with progressively increasing visual elements. Our evaluation shows that current frontier models are far from representation-invariant reasoners: performance degrades on average as information moves from language to diagrams, with visual variable grounding as the most critical bottleneck. Motivated by this inference-time fragility, we further develop large training corpora for multimodal RLVR and use blind training as a diagnostic control, finding that RL with all training images masked can still improve performance on unmasked validation sets. To analyze this effect, text-deletion, image-mask-rate, and format-saturation controls suggest that such gains can arise from residual textual and distributional cues rather than valid visual evidence. Our results highlight the need to evaluate multimodal reasoning not only by final-answer accuracy, but also by robustness under modality transfer and by diagnostics that test whether improvements rely on task-critical visual evidence.

2605.09236 2026-05-13 cs.CL cs.AI cs.CY cs.DL cs.IR

Matching Meaning at Scale: Evaluating Semantic Search for 18th-Century Intellectual History through the Case of Locke

Yu Wu, Ananth Mahadevan, Filip Ginter, Michael Mathioudakis, Mikko Tolonen

发表机构 * University of Helsinki(赫尔辛基大学) TurkuNLP, University of Turku(图尔库大学TurkuNLP) ELLIS Institute Finland(芬兰ELLIS研究所)

AI总结 本文通过研究约翰·洛克思想在18世纪的传播,评估了语义搜索在分析历史语料中思想传播的有效性。研究采用基于语义分类的专家标注,检验现成语义搜索方法能否发现传统基于词汇重用方法所忽略的隐含引用。结果表明,语义搜索能检索到更多隐性思想影响,但也揭示了表面词汇重叠对检索结果的限制,突显了语义检索在历史语料分析中的潜力与局限。

Comments Accepted by NLP4DH 2026

详情
英文摘要

While digitized corpora have transformed the study of intellectual transmission, current methods rely heavily on lexical text reuse detection, capturing verbatim quotations but fundamentally missing paraphrases and complex implicit engagement. This paper evaluates semantic search in 18th-century intellectual history through the reception of John Locke's foundational work. Using expert annotation grounded in a semantic taxonomy, we examine whether an off-the-shelf semantic search pipeline can surface meaning-level correspondences overlooked by lexical methods. Our results demonstrate that semantic search retrieves substantially more implicit receptions than lexical baselines. However, linguistic diagnostics also reveal a "lexical gatekeeping" effect, where retrieval remains partially constrained by surface vocabulary overlap. These findings highlight both the potential and the limitations of semantic retrieval for analyzing the circulation of ideas in large historical corpora. The data is available at https://github.com/COMHIS/locke-sim-data.

2605.09127 2026-05-13 cs.RO

IMPACT: An Implicit Active-Set Augmented Lagrangian for Fast Contact-Implicit Trajectory Optimization

Jiayun Li, Dejian Gong, Georgia Chalvatzaki

发表机构 * PEARL Lab, Dept. of Computer Science, TU Darmstadt, Germany(图腾实验室,计算机科学系,图恩大学,德国) Hessian.AI, Darmstadt, Germany(海森AI,图恩,德国) Robotics Institute Germany (RIG)(德国机器人研究所(RIG))

AI总结 IMPACT 是一种用于接触隐式轨迹优化(CITO)的隐式增广拉格朗日方法,旨在高效求解包含互补约束的数学规划问题。该方法能够在轨迹优化过程中动态识别接触模式分支,从而提升求解效率与稳定性。实验表明,IMPACT 在多个基准测试中显著优于现有方法,并在实际机器人系统上实现了高质量的接触丰富任务控制。

Comments Accepted to Robotics: Science and Systems (RSS), 2026

详情
英文摘要

Contact-implicit trajectory optimization (CITO) has attracted growing attention as a unified framework for planning and control in contact-rich robotic tasks. Recent approaches have demonstrated promising results in manipulation and locomotion without requiring a prescribed contact-mode schedule. It is well known that the underlying mathematical programs with complementarity constraints (MPCCs) remain numerically ill-conditioned, and systematic, scalable solution strategies for CITO remain an active area of research. More efficient and principled solvers that can handle contact constraints are therefore essential to broaden the applicability of CITO. In this work, we develop an augmented-Lagrangian approach to CITO for solving MPCC-based CITO with stationarity guarantees. The method can be interpreted as identifying the implicit contact-mode branches on the fly during the trajectory optimization (TO) iterations; we call this approach IMPACT (IMPlicit contact ACtive-set Trajectory optimization). We provide an efficient C++ implementation tailored to trajectory-optimization workloads and evaluate it on the open-source CITO and contact-implicit model predictive control (CI-MPC) benchmarks. On CITO, IMPACT achieves 2.9x-70x speedups over strong baselines (geometric mean 13.8x). On CI-MPC, we show improved control quality for contact-rich trajectories on dexterous manipulation tasks in simulation. Finally, we demonstrate the proposed method on real robotic hardware on a T-shaped object pushing task.

2605.09100 2026-05-13 cs.CL

GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression

Zhongtao Miao, Qiyu Wu, Yoshimasa Tsuruoka

发表机构 * The University of Tokyo(东京大学)

AI总结 本文提出了一种名为GRC的统一训练框架,旨在将推理驱动的生成、文本表示和上下文压缩任务整合到大型语言模型的一次前向传播中。通过引入元潜在标记和统一的生成、表征与压缩调优方法,GRC实现了在单次推理过程中同时完成三个任务,并在推理时保持模块化和灵活的组合特性。该方法显著降低了检索增强生成(RAG)的部署成本,提升了训练数据利用率,并提出了自推理潜在嵌入和潜在记忆增强生成等新范式,实验结果验证了其在多个任务上的有效性。

Comments Fixed typos in Eq. 4 and GPU names; added details on hybrid paged attention implementation

详情
英文摘要

Text embedding and generative tasks are usually trained separately based on large language models (LLMs) nowadays. This causes a large amount of training cost and deployment effort. Context compression is also a challenging and pressing task, which is vital to reasoning-driven generation, and agentic tasks requiring long context and continual learning. In this paper, we explore how to unify reasoning-driven generation, reasoning-enhanced text representation and context compression tasks in one forward pass for LLMs. Through meta latent tokens and a unified generative, representative and compressive tuning approach, we propose a training framework named GRC that bridges the three tasks. The trained models can accomplish three objectives in a single forward pass while maintaining modular, LEGO-style flexibility during inference. This design greatly reduces the deployment effort for retrieval-augmented generation (RAG) and achieves efficient inference and three times data utilization during training. Furthermore, this framework design enables a new paradigm for text embedding: self-reason-latent embeds, and a new generation paradigm, latent memory-augmented generation, where compressed and internalized KV cache with O(1) length is used as the updatable memory. We also propose hybrid paged attention to speed up the inference of our models. Extensive experiments on reasoning-intensive retrieval benchmarks, generative tasks, document compression, latency evaluation, and RAG settings demonstrate the effectiveness of our method and may shed light on the truly unified model that can handle reasoning-driven generation, embedding and compression tasks seamlessly.

2605.08804 2026-05-13 cs.RO

Constraint-Aware Diffusion Priors for High-Fidelity and Versatile Quadruped Locomotion

Jianhui Chen, Ruixin Zhan, Liu Liu, Yang Cai, Ziqiao Li

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 该研究针对四足机器人高保真、多样化运动控制中的关键挑战,提出了一种基于扩散模型的约束感知运动先验框架Diff-CAST。该方法通过扩散模型强大的多模态分布建模能力,有效解决了传统GAN判别器在大规模数据集上的模式崩溃问题,并结合对称增强指令条件(SACC)和约束强化学习,实现了高保真运动意图执行与安全的硬件部署。实验表明,Diff-CAST能够有效提升运动技能的多样性与鲁棒性,支持复杂环境下的稳定行走。

详情
英文摘要

Reinforcement learning combined with imitation learning has significantly advanced biomimetic quadrupedal locomotion. However, scaling these frameworks to massive, multi-source datasets exposes fundamental bottlenecks. First, traditional GAN-based discriminators are prone to mode collapse, struggling to capture diverse motion distributions from uncurated datasets. Second, existing kinematic priors suffer from out-of-distribution (OOD) tracking conflicts, leading to severe unintended heading drifts during complex maneuvers. Furthermore, deploying unconstrained priors to physical hardware poses critical safety risks by disregarding actuator dynamics. To overcome these challenges, we propose Diff-CAST (Diffusion-guided Constraint-Aware Symmetric Tracking), a novel motion prior framework leveraging the multi-modal distribution modeling capabilities of diffusion models for stylistic rewards. Diff-CAST effectively replaces traditional GAN discriminators, unlocking robust data scaling on heterogeneous collections. To ensure high-fidelity intent execution and reliable real-world deployment, we introduce a comprehensive Sim2Real architecture integrating Symmetric Augmented Command Conditioning (SACC) for drift-free tracking, and Constrained RL for hardware safety. Experiments on a quadruped demonstrate that Diff-CAST mitigates mode collapse, enables seamless transitions between diverse skills, and ensures robust, hardware-compliant locomotion.

2605.08463 2026-05-13 cs.AI

Behavioral Determinants of Deployed AI Agents in Social Networks: A Multi-Factor Study of Personality, Model, and Guardrail Specification

Sarah Wilson, Diem Linh Dang, Usman Ali Moazzam, Shan Ye, Gail Kaiser

发表机构 * Columbia University(哥伦比亚大学)

AI总结 该研究探讨了部署在社交网络中的自主AI代理的行为决定因素,系统分析了个性设定、模型架构和操作规则等多因素对代理社交行为的影响。通过在模拟社交平台Moltbook上部署13个OpenClaw代理,并对比一个默认控制代理,研究发现个性设定是影响代理行为的最主要因素,而模型和规则则对语言风格和话题参与度产生中等程度的影响。该研究为构建用于协作或监控任务的AI代理提供了实证依据和设计指导。

详情
英文摘要

Autonomous AI agents are increasingly deployed in open social environments, yet the relationship between their configuration specifications and their emergent social behavior remains poorly understood. We present a controlled, multi-factor empirical study in which thirteen OpenClaw agents are deployed on Moltbook -- a Reddit-like social network built for AI agents -- across three systematically varied independent variables: (1) personality specification, (2) underlying LLM model backbone, and (3) operational rules and memory configuration. A default control agent provides a behavioral baseline. Over a one-week observation window spanning approximately 400 autonomous sessions per agent, we collect behavioral, linguistic, and social metrics to assess how configuration layers predict emergent social behavior. We find that personality specification is the dominant behavioral lever, producing a massive spread in response length across agents, while model backbone and operational rules drive more moderate but still meaningful effects on rhetorical style and topic engagement breadth. Our findings contribute empirical evidence to the emerging literature on deployed multi-agent social systems and offer practical guidance for designing agents intended for collaborative or monitoring tasks in real social environments.

2605.08434 2026-05-13 cs.RO

Failing Forward: Adaptive Failure-Informed Learning for Vision-Language-Action Models

Meng Zheng, Samhita Marri, Anwesa Choudhuri, Benjamin Planche, Zhongpai Gao, Van Nguyen Nguyen, Terrence Chen, Girish Chowdhary, Ziyan Wu

发表机构 * United Imaging Intelligence(联合影像智能) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 视觉-语言-动作(VLA)模型为机器人操作提供了可扩展的范式,但其仅依赖成功示例的行为克隆方法使其在面对执行误差时容易失效。为此,本文提出了一种自适应失败感知学习(AFIL)框架,通过在线生成失败轨迹作为负向引导,提升VLA策略的鲁棒性。该方法结合扩散模型与流模型,利用预训练VLA生成失败样本,并通过共享视觉-语言主干的双动作生成器联合训练,实现高效、低参数开销的失败感知策略学习,实验表明其在多种机器人操作任务中显著提升了成功率与鲁棒性。

详情
英文摘要

Vision-language-action (VLA) models provide a promising paradigm for scalable robotic manipulation, yet their reliance on success-only behavioral cloning leaves them brittle; lacking corrective training signals, minor execution errors rapidly compound into unrecoverable, out-of-distribution failures. To address this limitation, we propose Adaptive Failure-Informed Learning (AFIL), an end-to-end framework that leverages failure trajectories as adaptive negative guidance for diffusion- and flow-based VLA policies. AFIL uses a pretrained VLA to generate failure rollouts online, avoiding the need for handcrafted failure-mode design or human-in-the-loop recovery. It then jointly trains Dual Action Generators (DAGs) for successful and failed behaviors while sharing a common vision-language backbone, enabling efficient failure-aware policy learning with limited parameter overhead. During sampling, the failure generator adaptively steers action generation away from failure-prone regions and toward more reliable success modes, with guidance strength determined by the per-diffusion-step distance between success and failure distributions. Experiments across in-domain and out-of-domain robotic manipulation tasks, covering both short- and long-horizon settings, show that AFIL consistently improves task success rates and robustness over existing VLA baselines, demonstrating its effectiveness, efficiency, and generality.

2605.08133 2026-05-13 cs.CV cs.AI

VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving

Rui Zhao, Haofeng Hu, Zhenhai Gao, Jiaqiao Liu, Gao Fei

发表机构 * College of Automotive Engineering(汽车工程学院) The National Key Laboratory of Automotive Chassis Integration and Bionics(汽车底盘集成与生物力学国家级重点实验室) ReeFocus AI Technology(ReeFocus人工智能技术)

AI总结 本文提出了一种名为 VLADriver-RAG 的检索增强型视觉-语言-动作模型,用于自动驾驶任务。该模型通过引入结构感知的历史知识检索机制,解决了传统 VLA 模型在长尾场景中泛化能力不足的问题。研究通过将视觉输入转化为时空语义图,并采用场景对齐的嵌入模型提升检索相关性,最终在 Bench2Drive 基准测试中取得了新的最优性能,驾驶评分为 89.12。

详情
英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving, yet their reliance on implicit parametric knowledge limits generalization in long-tail scenarios. While Retrieval-Augmented Generation (RAG) offers a solution by accessing external expert priors, standard visual retrieval suffers from high latency and semantic ambiguity. To address these challenges, we propose \textbf{VLADriver-RAG}, a framework that grounds planning in explicit, structure-aware historical knowledge. Specifically, we abstract sensory inputs into spatiotemporal semantic graphs via a \textit{Visual-to-Scenario} mechanism, effectively filtering visual noise. To ensure retrieval relevance, we employ a \textit{Scenario-Aligned Embedding Model} that utilizes Graph-DTW metric alignment to prioritize intrinsic topological consistency over superficial visual similarity. These retrieved priors are then fused within a query-based VLA backbone to synthesize precise, disentangled trajectories. Extensive experiments on the Bench2Drive benchmark establish a new state-of-the-art, achieving a Driving Score of 89.12.

2605.07637 2026-05-13 cs.AI cs.LG cs.MA

Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding

Valeriy Vyaltsev, Alsu Sagirova, Anton Andreychuk, Oleg Bulichev, Yuri Kuratov, Konstantin Yakovlev, Aleksandr Panov, Alexey Skrynnik

发表机构 * GitHub

AI总结 本文研究了大规模多智能体路径规划(MAPF)问题,旨在提高多智能体在共享环境中的协同效率。为解决该问题,作者提出了一种基于强化学习的去中心化方法,并引入了一个可学习的局部通信模块,使邻近智能体能够通过多轮通信交换信息、提升协作能力。实验表明,该方法在多种未见过的测试场景中优于现有基于模仿学习和强化学习的MAPF求解器,同时保持了良好的可扩展性。

详情
英文摘要

Multi-agent pathfinding (MAPF) is a widely used abstraction for multi-robot trajectory planning problems, where multiple homogeneous agents move simultaneously within a shared environment. Although solving MAPF optimally is NP-hard, scalable and efficient solvers are critical for real-world applications such as logistics and search-and-rescue. To this end, the research community has proposed various decentralized suboptimal MAPF solvers that leverage machine learning. Such methods frame MAPF (from a single agent perspective) as a Dec-POMDP where at each time step an agent has to decide an action based on the local observation and typically solve the problem via reinforcement learning or imitation learning. We follow the same approach but additionally introduce a learnable communication module tailored to enhance cooperation between agents via efficient feature sharing. We present the Local Communication for Multi-agent Pathfinding (LC-MAPF), a generalizable pre-trained model that applies multi-round communication between neighboring agents to exchange information and improve their coordination. Our experiments show that the introduced method outperforms the existing learning-based MAPF solvers, including IL and RL-based approaches, across diverse metrics in a diverse range of (unseen) test scenarios. Remarkably, the introduced communication mechanism does not compromise LC-MAPF's scalability, a common bottleneck for communication-based MAPF solvers.

2605.07076 2026-05-13 cs.CL cs.LG

Self-Consolidating Language Models: Continual Knowledge Incorporation from Context

Zekun Wang, Anant Gupta, Zihan Dong, Christopher J. MacLellan

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文研究了大型语言模型在连续接收信息流时如何有效整合新知识的问题,提出了一种名为SCoL的后训练框架,该框架使模型能够根据当前上下文生成更新指令,选择性地更新自身Transformer层的参数,从而在保留已有知识的同时引入新信息。通过元强化学习和监督奖励机制,SCoL在知识整合和长期记忆保持方面优于多种基线方法,并表现出良好的可扩展性。

Comments 9 pages

详情
英文摘要

Large language models (LLMs) increasingly receive information as streams of passages, conversations, and long-context workflows. While longer context windows expose more evidence, they do not ensure that useful information is preserved and reused. We study continual context consolidation: writing current context into model weights while limiting interference with previously consolidated information. We propose \textbf{S}elf-\textbf{Co}nsolidating \textbf{L}anguage Models (SCoL), a post-training framework in which, given current context, an LLM learns to generate textual update instructions specifying which of its own Transformer layers should be updated. Because committed updates change the model that later generates future selections, we train SCoL with meta-reinforcement learning over an evolving model state. We instantiate SCoL with supervised QA rewards on SQuAD knowledge incorporation and intrinsic likelihood-based rewards for LongBench v2 long-context consolidation. Across both settings, SCoL improves acquisition and retention over prompting, summarization, batch test-time training, and sequential finetuning baselines. Analysis of learned selection patterns shows that SCoL encourages the LLM to generate sparse update locations that align with layers of high Fisher information, suggesting that the model learns to route plasticity toward loss-sensitive regions while limiting interference. Moreover, SCoL transfers from shorter meta-training streams to longer LongBench v2 streams at evaluation, suggesting that our framework supports scalable streaming consolidation.