arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4033
专题追踪
2605.09975 2026-05-12 cs.LG math.OC

Chebyshev Center-Based Direction Selection for Multi-Objective Optimization and Training PINNs

Hoyeol Yoon, Seoungbin Bae, Nam Ho-Nguyen, Dabeen Lee

发表机构 * Department of Industrial & Systems Engineering, KAIST(韩国科学技术院工业与系统工程系) Discipline of Business Analytics, The University of Sydney(悉尼大学商业分析学科) Department of Mathematical Sciences, Seoul National University(首尔国立大学数学科学系)

AI总结 该论文研究了物理信息神经网络(PINNs)训练中多目标优化的方向选择问题,提出了基于切比雪夫中心的更新方向选择方法。通过将方向选择建模为对偶锥体中的切比雪夫中心问题,该方法在低维空间中高效求解,并保证了非凸情况下的收敛性。该方法统一了现有方法中的关键性质,提供了可解释的几何准则,并在多个基准测试中表现出优越的性能。

详情
英文摘要

Physics-informed neural networks (PINNs) are a promising approach for solving partial differential equations (PDEs). Their training, however, is often difficult because multiple loss terms induced by PDE residuals and boundary or initial conditions must be optimized simultaneously. To address this difficulty, existing approaches often construct update directions by explicitly enforcing particular desirable properties, such as scale robustness and simultaneous descent. While effective in many cases, such property-by-property designs can make it unclear which conditions are essential, what geometric principle determines the selected update direction, and how different methods are structurally related. In this work, we formulate update-direction selection for PINN training as a Chebyshev-center problem in the dual cone. The proposed formulation selects a normalized direction that maximizes the minimum distance to the cone facets. The resulting formulation admits an efficient dual problem in a much lower-dimensional space and yields a convergence guarantee in the nonconvex setting. It also recovers the key desirable properties targeted by existing approaches without imposing them separately; rather, they follow from the single geometric criterion underlying the formulation. This makes the selected direction interpretable through a single geometric rule and provides a unified basis for systematically comparing related direction-selection methods. Experiments on several PINN benchmarks further demonstrate strong empirical performance of the proposed method.

2605.09973 2026-05-12 cs.CL cs.AI

GLiNER2-PII: A Multilingual Model for Personally Identifiable Information Extraction

Urchade Zaratiana, Ash Lewis, George Hurn-Maloney

发表机构 * OpenAI Zaratiana et al.(Zaratiana 等)

AI总结 本文提出GLiNER2-PII,一个用于多语言个人可识别信息(PII)提取的小型模型,能够识别42种不同类型的PII实体。为了解决训练数据稀缺和隐私风险的问题,研究者构建了一个包含4,910篇标注文本的多语言合成语料库,通过约束驱动的生成方法生成多样化、真实的示例。实验表明,GLiNER2-PII在SPY基准测试中取得了最高的跨度级F1分数,优于包括OpenAI隐私过滤器在内的多个对比系统。

Comments Under submission

详情
英文摘要

Reliable detection of personally identifiable information (PII) is increasingly important across modern data-processing systems, yet the task remains difficult: PII spans are heterogeneous, locale-dependent, context-sensitive, and often embedded in noisy or semi-structured documents. We present GLiNER2-PII, a small 0.3B-parameter model adapted from GLiNER2 and designed to recognize a broad taxonomy of 42 PII entity types at character-span resolution. Training such systems, however, is constrained by the scarcity of shareable annotated data and the privacy risks associated with collecting real PII at scale. To address this challenge, we construct a multilingual synthetic corpus of 4,910 annotated texts using a constraint-driven generation pipeline that produces diverse, realistic examples across languages, domains, formats, and entity distributions. On the challenging SPY benchmark, GLiNER2-PII achieves the highest span-level F1 among five compared systems, including OpenAI Privacy Filter and three GLiNER-based detectors. We publicly release the model on Hugging Face to support further research and practical deployment of open PII detection systems.

2605.09972 2026-05-12 cs.RO cs.CV

HiDrive: A Closed-Loop Benchmark for High-Level Autonomous Driving

Zhongyu Xia, Guanyu Zhu, Guo Tang, Wenhao Chen, Yongtao Wang

发表机构 * Wangxuan Institute of Computer Technology, Peking University(王炫计算机技术研究所,北京大学)

AI总结 HiDrive 是一个全新的闭环自动驾驶基准,旨在解决现有基准在场景多样性、对象种类和驾驶能力评估方面的不足。该基准特别强调长尾场景,引入了多种罕见物体和复杂交通情境,并扩展了对规则遵守、道德推理和应急决策等高级驾驶能力的评估。HiDrive 采用更先进的物理引擎,提供真实光照和高保真视觉渲染,为自动驾驶系统在真实复杂环境中的表现提供了更具挑战性的测试平台。

详情
英文摘要

End-to-end autonomous driving has witnessed rapid progress, yet existing benchmarks are increasingly saturated, with state-of-the-art models achieving near-perfect scores on widely used open-loop and closed-loop benchmarks. This saturation does not mean that the problem has been solved; instead, it reveals that current benchmarks remain limited in scenario diversity, object variety, and the breadth of driving capabilities they evaluate. In particular, they lack sufficient long-tail scenarios involving rare but safety-critical objects and fail to assess advanced decision-making such as legal compliance, ethical reasoning, and emergency response. To address these gaps, we propose HiDrive, a new closed-loop benchmark for end-to-end autonomous driving that emphasizes long-tail scenarios and a richer evaluation of driving capabilities. HiDrive introduces a diverse set of rare objects and uncommon traffic situations, and expands evaluation from basic driving skills to more advanced capabilities, including rule compliance, moral reasoning, and context-dependent emergency maneuvers. Correspondingly, we extend previous collision-avoidance-centered metrics into a comprehensive evaluation system that encompasses collision and braking, traffic-rule compliance, and moral-reasoning indicators. Built on a more advanced physics engine, HiDrive provides physically realistic lighting and high-fidelity visual rendering, offering a more challenging and realistic testbed for assessing whether autonomous driving systems can handle the complexity of real-world deployment. The HiDrive software, source code, digital assets, and documentation are available at https://github.com/VDIGPKU/HiDrive.

2605.09969 2026-05-12 cs.LG cs.CL

The Truth Lies Somewhere in the Middle (of the Generated Tokens)

Sophie L. Wang, Phillip Isola, Brian Cheung

发表机构 * MIT(麻省理工学院)

AI总结 本文研究了如何将自回归生成的隐藏状态压缩为能够反映语言模型内部状态的表示。作者发现,通过对生成的隐藏状态进行均值池化,可以获得比单个token更具有语义信息的表示,并通过核对齐方法在语言、视觉和蛋白质领域进行了验证。研究还表明,生成token的表示优于提示token,并揭示了模型行为中可解释的动态特性。

详情
英文摘要

How should hidden states generated autoregressively be collapsed into a representation that reflects a language model's internal state? Despite tokens being generated under causal masking, we find that mean pooling across their hidden states yields more semantic representations than any individual token alone. We quantify this through kernel alignment to reference spaces in language, vision, and protein domains. The improvement through mean pooling is consistent with information being distributed across generated tokens rather than localized to a single position. Furthermore, representations derived from generated tokens outperform those from prompt tokens, and alignment across generation reveals interpretable dynamics in model behavior.

2605.09967 2026-05-12 cs.LG

Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions

Andrew Lee, Fernanda Viégas, Martin Wattenberg

发表机构 * Harvard University(哈佛大学) Google DeepMind(谷歌DeepMind)

AI总结 本研究探讨了语言模型中线性方向所表示的概念如何捕捉关系结构的问题。通过在结构明确的领域(如围棋游戏Othello)中训练模型,研究发现虽然模型的内部状态可以被线性解码,但其实际结构还包含张量积表示(TPR)。研究通过训练TPR探测器,揭示了线性探测器所捕捉的结构实际上是更复杂结构的投影,并展示了如何从TPR探测器的参数中直接恢复线性方向。这一发现表明,方向性表示可能是更结构化表征的投影。

详情
英文摘要

While researchers are finding concepts represented as linear directions in language models, a bag of linear directions fails to capture relational structure. To better understand this dichotomy, we study a model with known linear representations, but trained in a highly structured domain -- the board game Othello. While the model's internal board-state representation is linearly decodable, we find additional structure in the form of tensor product representations (TPRs). We train TPR probes to recover shared structure amongst the linear probes, yielding a factorization into square-embeddings, color-embeddings, and a binding matrix that composes them to construct the model's board-state representation. We find geometric signatures within the weights of our TPR probe that align with the structure of the board, but perhaps more importantly, that the linear probes can be recovered directly from the parameters of our TPR probe. Our findings suggest that directional representations may be projections of more structured underlying representations.

2605.09963 2026-05-12 cs.CV

Learning to Perceive "Where": Spatial Pretext Tasks for Robust Self-Supervised Learning

Yang Shen, Yusen Cai, Weronika Hryniewska-Guzik, Qing Lin, Mengmi Zhang

发表机构 * Nanyang Technological University, Singapore(南洋理工大学,新加坡) Warsaw University of Technology, Poland(华沙理工大学,波兰)

AI总结 现有自监督学习方法主要学习对象不变的表征,但往往忽视了物体部分之间的空间结构和关系。为解决这一问题,本文提出了一种空间感知的预训练任务——空间预测(SP),通过预测同一图像中两个解耦局部视图之间的相对位置和尺度,学习细粒度的空间依赖关系。实验表明,该方法在图像识别、细粒度分类、语义分割和深度估计等多个任务中均取得显著提升,并增强了模型在分布外场景下的鲁棒性。

详情
英文摘要

Existing self-supervised learning (SSL) methods primarily learn object-invariant representations but often neglect the spatial structure and relationships among object parts. To address this limitation, we introduce Spatial Prediction (SP), a spatially aware pretext regression task that predicts the relative position and scale between a pair of disentangled local views from the same image. By modeling part-to-part relationships in a continuous geometric space, SP encourages representations to capture fine-grained spatial dependencies beyond invariant categorical semantics, thereby learning the compositional structure of visual scenes. SP is implemented as a decoupled plug-in and can be seamlessly integrated into diverse SSL frameworks. Extensive experiments show consistent improvements across image recognition, fine-grained classification, semantic segmentation, and depth estimation, as well as substantial gains in out-of-distribution robustness for object recognition. To evaluate spatial reasoning, we introduce (1) a position and scale prediction task on image patch pairs and (2) a jigsaw understanding task requiring patch reordering and recognition after reconstruction. Strong performance on these tasks indicates improved spatial structure and geometric awareness. Overall, explicitly modeling spatial information provides an effective inductive bias for SSL, leading to more structured representations and better generalization. Code and models will be released.

2605.09959 2026-05-12 cs.LG cs.AI cs.CL cs.ET

G-Zero: Self-Play for Open-Ended Generation from Zero Data

Chengsong Huang, Haolin Liu, Tong Zheng, Runpeng Dai, Langlin Huang, Jinyuan Li, Zongxia Li, Zhepei Wei, Yu Meng, Jiaxin Huang

发表机构 * Washington University in St. Louis(华盛顿大学圣路易斯分校) University of Virginia(弗吉尼亚大学) University of Maryland(马里兰大学) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 本文提出了一种名为 G-Zero 的自演化框架,用于在无外部评估的情况下实现大语言模型的自主持续改进,尤其适用于开放性生成任务。其核心方法是引入 Hint-δ 内在奖励机制,通过生成模型自身预测差异来指导优化,并结合提案模型和生成模型的协同进化进行训练。该方法无需依赖外部判断器,有效避免了奖励黑客和能力瓶颈,为不可验证领域的模型自我进化提供了可扩展且鲁棒的解决方案。

详情
英文摘要

Self-evolving LLMs excel in verifiable domains but struggle in open-ended tasks, where reliance on proxy LLM judges introduces capability bottlenecks and reward hacking. To overcome this, we introduce G-Zero, a verifier-free, co-evolutionary framework for autonomous self-improvement. Our core innovation is Hint-$δ$, an intrinsic reward that quantifies the predictive shift between a Generator model's unassisted response and its response conditioned on a self-generated hint. Using this signal, a Proposer model is trained via GRPO to continuously target the Generator's blind spots by synthesizing challenging queries and informative hints. The Generator is concurrently optimized via DPO to internalize these hint-guided improvements. Theoretically, we prove a best-iterate suboptimality guarantee for an idealized standard-DPO version of G-Zero, provided that the Proposer induces sufficient exploration coverage and the data filteration keeps pseudo-label score noise low. By deriving supervision entirely from internal distributional dynamics, G-Zero bypasses the capability ceilings of external judges, providing a scalable, robust pathway for continuous LLM self-evolution across unverifiable domains.

2605.09956 2026-05-12 cs.CV cs.AI

SDTalk: Structured Facial Priors and Dual-Branch Motion Fields for Generalizable Gaussian Talking Head Synthesis

Peng Jia, Zhen Xiao, Jia Li, Xueliang Liu, Zhenzhen Hu, Lingyun Yu

发表机构 * Hefei University of Technology(合肥工业大学) University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出了一种名为SDTalk的单次拍摄3D高斯溅射(3DGS)框架,用于实现无需个性化训练即可泛化到未知身份的高质量实时说话头生成。该方法通过引入结构化面部先验和双分支运动场,分别提升头部重建的完整性与面部动态的细节表现,从而在视觉质量和推理效率方面优于现有方法。

Comments 5 pages, 4 figures, 4 tables

详情
英文摘要

High-quality, real-time talking head synthesis remains a fundamental challenge in computer vision. Existing reconstruction- and rendering-based methods typically rely on identity-specific models, limiting cross-identity generalization. To address this issue, we propose SDTalk, a one-shot 3D Gaussian Splatting (3DGS)-based framework that generalizes to unseen identities without personalized training or fine-tuning. Our framework comprises two modules with a two-stage training strategy. In the first stage, we incorporate structured facial priors into the reconstruction module and separately predict 3DGS parameters for visible and occluded regions, enabling complete head reconstruction from a single image. In the second stage, we introduce a dual-branch motion field to model coarse and fine facial dynamics, improving detail fidelity and lip synchronization. Experiments demonstrate that SDTalk surpasses existing methods in both visual quality and inference efficiency.

2605.09955 2026-05-12 cs.CL

Beyond Majority Voting: Agreement-Based Clustering to Model Annotator Perspectives in Subjective NLP Tasks

Tadesse Destaw Belay, Ibrahim Said Ahmad, Idris Abdulmumin, Abinew Ali Ayele, Alexander Gelbukh, Eusebio Ricárdez-Vázquez, Olga Kolesnikova, Shamsuddeen Hassan Muhammad, Seid Muhie Yimam

发表机构 * Instituto Politécnico Nacional(智利国家理工学院) University of Wisconsin–Stevens Point(威斯康星州立大学斯蒂文斯点分校) University of Pretoria(培特里亚大学) Bahir Dar University(巴希尔达大学) Imperial College London(伦敦帝国理工学院) University of Hamburg(汉堡大学)

AI总结 本文研究了在主观性自然语言处理任务中如何有效建模标注者之间的意见分歧问题。作者提出了一种基于共识的聚类方法,用于捕捉和建模不同标注者的观点差异,从而提升标签聚合的效果。通过在18种语言的40个数据集上进行实验,结果表明该方法相比传统的多数投票和单个标注者建模,能够更全面地利用标注者视角,显著提升分类性能。此外,研究还比较了多种聚合策略,发现多标签和多任务方法在处理聚类标注者时表现更优。

Comments Pre-MIT Press publication version

详情
英文摘要

Disagreement in annotation is a common phenomenon in the development of NLP datasets and serves as a valuable source of insight. While majority voting remains the dominant strategy for aggregating labels, recent work has explored modeling individual annotators to preserve their perspectives. However, modeling each annotator is resource-intensive and remains underexplored across various NLP tasks. We propose an agreement-based clustering technique to model the disagreement between the annotators. We conduct comprehensive experiments in 40 datasets in 18 typologically diverse languages, covering three subjective NLP tasks: sentiment analysis, emotion classification, and hate speech detection. We evaluate four aggregation approaches: majority vote, ensemble, multi-label, and multitask. The results demonstrate that agreement-based clustering can leverage the full spectrum of annotator perspectives and significantly enhance classification performance in subjective NLP tasks compared to majority voting and individual annotator modeling. Regarding the aggregation approach, the multi-label and multitask approaches are better for modeling clustered annotators than an ensemble and model majority vote.

2605.09954 2026-05-12 cs.RO cs.CV

JODA: Composable Joint Dynamics for Articulated Objects

Tianhong Gao, Cheng Yu, Yinghao Xu, Mengyu Chu

发表机构 * Peking University(北京大学) Ant Group, Robbyant(蚂蚁集团,Robbyant)

AI总结 本文提出JODA,一种用于生成关节级动力学的可组合框架,能够捕捉如摩擦保持、卡扣、软闭合等精细的机械行为。JODA通过结构化的三通道场描述关节自由度下的保守力、干摩擦和阻尼,结合形状约束的分段三次插值方法,实现了表达力强且可微分模拟的动力学建模。该方法支持从多模态输入中推断和优化关节动力学,为复杂机械系统的建模、编辑和优化提供了统一的接口。

详情
英文摘要

Articulated objects used in simulation and embodied AI are typically specified by geometry and kinematic structure, but lack the fine-grained dynamical effects that govern realistic mechanical behavior, such as frictional holding, detents, soft closing, and snap latching. Existing approaches either ignore the detailed structure of dynamics entirely, or use simple models with limited expressiveness. We introduce JODA, a framework for generating joint-level dynamics as a structured three-channel field over the joint degree of freedom, capturing conservative forces, dry friction, and damping. Instantiated using shape-constrained piecewise cubic interpolation (PCHIP), this formulation defines a compact and expressive function space that is both interpretable and compatible with differentiable simulation. Building on this representation, we develop methods for inferring and refining joint dynamics from multimodal inputs. Given visual observations and joint context, a vision-language model proposes structured dynamical primitives, which are composed into a unified dynamics field. The resulting representation supports both direct manipulation and gradient-based refinement. We demonstrate that JODA enables plausible and controllable modeling of diverse joint behaviors, providing a unified interface for inference, editing, and optimization. Code and example assets with their generated profiles will be released upon publication.

2605.09951 2026-05-12 cs.LG

Generating synthetic electronic health record data using agent-based models to evaluate machine learning robustness under mass casualty incidents

Roben Delos Reyes, Daniel Capurro, Nicholas Geard

发表机构 * School of Computing and Information Systems The University of Melbourne(墨尔本大学计算与信息系统学院) Department of Medicine The University of Melbourne(墨尔本大学医学院)

AI总结 该研究提出了一种基于智能体的建模方法,用于生成合成电子健康记录(EHR)数据,以评估机器学习模型在大规模伤亡事件(MCI)等极端情况下的鲁棒性。研究利用真实EHR数据构建急诊科的智能体模型,模拟患者到达、资源容量和临床流程,并通过调整系统条件生成反映MCI场景的合成数据。实验表明,MCI条件下机器学习模型的召回率显著下降,突显了系统变化对模型性能和患者预后的影响,为提升医疗AI在复杂环境下的可靠性提供了新方法。

Comments 14 pages, 1 figure; accepted at CHIL 2026

详情
英文摘要

ML models in healthcare are typically evaluated using curated real-world EHR data. A key limitation of such evaluations is that they may fail to assess the robustness of ML models to changes in the data at deployment, which is a common issue because EHR data used for ML model development cannot capture all such changes. Mass casualty incidents (MCIs) caused by disasters are critical instances where this will be an issue, as they induce rare, uncertain, and novel changes to routine system conditions. Because real-world EHR data from MCIs are often limited or unavailable, assessing ML robustness under such conditions before deployment remains challenging. Here, we propose an agent-based modelling approach for generating synthetic EHR data to evaluate the robustness of ML models under MCI scenarios. We use real-world EHR data to develop and calibrate an agent-based model (ABM) of an emergency department (ED) that explicitly models patient arrivals, resource capacity, and clinical workflow. By changing these system conditions to reflect plausible MCI scenarios, the ED model generates synthetic versions of the real-world EHR data that exhibit shifts in system behaviour. Using these synthetic data, we test ML models for predicting length of stay. We observed consistent declines in recall under MCI conditions relative to baseline system conditions, resulting in an increase in the number of patients with prolonged length of stay that were missed by the ML models. These results highlight the impact of changes in system conditions on patient outcomes, EHR data, and ML model performance. Our work establishes ABM-based synthetic EHR data generation as a proactive and systematic approach for evaluating the robustness of ML models under MCI or other system conditions not captured in real-world EHR data, supporting the safer and more effective deployment of ML models in healthcare systems.

2605.09950 2026-05-12 cs.LG cs.AI

Novel GPU Boruta algorithms for feature selection from high-dimensional data

Xurui Li, Zhiguo Gan, Jiaming Zhang, Zheng Liu, Diannan Lu

发表机构 * Department of Chemical Engineering, Tsinghua University(清华大学化学工程系)

AI总结 本文针对传统特征选择算法在CPU上处理高维数据时效率低下的问题,提出两种基于GPU加速的Boruta特征选择方法——Boruta-Permut和Boruta-TreeImp,分别基于特征排列重要性和不纯度减少重要性进行特征选择。实验表明,这两种方法在保持与原始Boruta算法相近选择精度的同时,显著提升了计算效率,为大规模数据的特征选择提供了高效且经济的解决方案。

Comments This paper has been submitted to the journal Data Mining and Knowledge Discovery, and a preprint is available for the authors' records

详情
英文摘要

Most feature selection algorithms, especially wrapper methods, run inefficiently on CPU based platforms because of their high computational complexity. This inefficiency makes them unsuitable for processing large scale datasets. To address this challenge, the present study proposed two GPU accelerated versions of the Boruta feature selection procedure, in which Boruta-Permut relies on permutation based feature importance and Boruta-TreeImp employs importance based on impurity reduction. To evaluate these methods we conducted experiments on both a self constructed dataset and several publicly available datasets. The experimental results show that the proposed GPU accelerated algorithms greatly improve computational efficiency while preserving feature selection accuracy comparable to the original Boruta algorithm. In our analysis we also observe that the impurity reduction based version can overestimate the importance of some features. Overall these findings suggest that performing Boruta feature selection on GPUs offers an effective and cost efficient solution for large scale data analysis, which is a good deal.

2605.09949 2026-05-12 cs.LG

From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models

Zehao Li, Yasuhiro Yoshikai, Shumpei Nemoto, Hiroyuki Kusuhara, Tadahaya Mizuno

发表机构 * Laboratory of Molecular Pharmacokinetics, Graduate School of Pharmaceutical Sciences, The University of Tokyo(分子药代动力学实验室,药学研究生院,东京大学) The Institute of Statistical Mathematics (ISM), Research Organization of Information and Systems(统计数学研究所(ISM),信息与系统研究组织)

AI总结 该研究探讨了化学语言模型如何从分子字符串表示中学习化学意义,而非仅依赖表面字符串模式,特别关注手性这一具有挑战性的测试案例。研究提出了一个基于Transformer的SMILES翻译模型Pan-CORE,并通过高时间分辨率的训练过程分析,揭示了手性信息在模型训练中的学习机制。研究发现,手性信息的学习存在一个长期停滞后的突增现象,表明手性学习的困难不仅源于模型容量,还涉及手性约束的复杂性,研究进一步通过注意力动态、残差流轨迹和潜在空间几何分析,揭示了编码器在手性信息学习中的核心作用。

详情
英文摘要

Understanding how chemical language models (CLMs) learn chemical meaning from molecular string representations, rather than only surface-level string patterns, is an important question in chemical representation learning and machine learning for chemistry. Chirality provides a demanding test case: enantiomers can differ greatly in pharmacological activity and toxicity, yet CLMs often struggle to distinguish chiral configurations reliably. Here we present Pan-CORE (Pan-Chemical Omniscale Representation Engine), a family of autoregressive Transformer-based encoder-decoder models for SMILES translation, and use high-temporal-resolution checkpoint analysis to investigate how chiral information is learned during training. Across all tested Pan-CORE variants, we observe a reproducible jump-up in which chiral-token accuracy rises abruptly after a long plateau, suggesting that chiral learning stagnation is not explained by model capacity alone and instead reflects the complexity of chiral constraints. Analyses of attention dynamics, residual-stream trajectories, and latent-space geometry support an encoder-centered mechanism in which chiral-token representations undergo transient destabilization and reconstruction, seen as a V-shaped drop and recovery in vector norm and directional stability, together with a clear reorganization of chiral molecular representations in the latent space. Encoder-decoder cross-evaluation further supports the encoder-centered nature of the transition, and targeted attention-head ablation identifies a small set of chiral-sensitive heads whose removal selectively reduces chiral-token accuracy even in the fully trained model. These findings show that SMILES translation can serve as a useful experimental system for mechanistic analysis of semantic emergence in CLMs, with implications for interpretable chemical representation learning.

2605.09948 2026-05-12 cs.AI cs.CV cs.RO

LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models

Boyang Shen, Kaixiang Yang, Hao Wang, Qiuyu Yu, Qiang Xie, Qiang Li, Zhiwei Wang

发表机构 * Huazhong University of Science and Technology(华中科技大学) Wuhan United Imaging Surgical Co.,Ltd. (UIS)(武汉联影 surgical 公司)

AI总结 当前视觉-语言-动作(VLA)模型通常将视觉-语言主干网络的最深层表示视为动作预测的最优输入,但机器人操作任务需要频繁的闭环空间调整,过度抽象可能浪费计算资源并削弱精确控制所需的底层几何线索。为此,本文提出LoopVLA,一种递归VLA架构,联合学习表示优化、动作预测与表示充分性估计,通过共享的Transformer块迭代优化多模态特征,并在每一步生成候选动作和充分性评分,从而动态决定是否需要进一步优化。实验表明,LoopVLA在保持任务成功率的同时显著提升了模型效率,参数量减少45%,推理吞吐量提升达1.7倍。

详情
英文摘要

Current Vision-Language-Action (VLA) models typically treat the deepest representation of a vision-language backbone as universally optimal for action prediction. However, robotic manipulation is composed of many frequent closed-loop spatial adjustments, for which excessive abstraction may waste computation and weaken low-level geometric cues essential for precise control. Existing early-exit strategies attempt to reduce computation by stopping at predefined layers or applying heuristic rules such as action consistency, but they do not directly answer when a representation is actually sufficient for action. In this paper, we present LoopVLA, a recurrent VLA architecture that jointly learns representation refinement, action prediction, and sufficiency estimation. LoopVLA iteratively applies a shared Transformer block to refine multimodal tokens, and at each iteration produces both a candidate action and a sufficiency score that estimates whether further refinement is necessary. By sharing parameters across iterations, LoopVLA decouples refinement from absolute layer indices and grounds sufficiency estimation in the evolving representation itself. Since sufficiency has no direct supervision, we introduce a self-supervised distribution alignment objective, where intermediate confidence scores are trained to match the relative action quality across refinement steps, thereby linking sufficiency learning to policy optimization signals. Experiments on LIBERO, LIBERO-Plus, and VLA-Arena show that LoopVLA pushes the efficiency-performance frontier of VLA policies, reducing parameters by 45% and improving inference throughput by up to 1.7 times while matching or outperforming strong baselines in task success.

2605.09945 2026-05-12 cs.LG

Selection of the Best Policy under Fairness Constraints for Subpopulations

Tingyu Zhu, Yuhang Wu, Zeyu Zheng

发表机构 * Department of Industrial Engineering and Operations Research(工业工程与运营管理系) Berkeley Artificial Intelligence Research Lab(伯克利人工智能研究实验室) University of California Berkeley(加州大学伯克利分校)

AI总结 本文研究了在公平性约束下选择适用于不同子群体的最佳政策的问题,要求所选政策在每个预设子群体上的表现均不低于一定阈值。作者提出了一个名为 T-a-S-CS 的算法,能够在保证公平性的前提下高效识别出平均性能最优的政策,并给出了该问题的样本复杂度下界。实验表明,该方法相比现有政策分配方法具有显著的效率提升。

详情
英文摘要

Many high-stakes decisions in health care, public policy, and clinical development require committing to a single policy that will be applied uniformly across a heterogeneous population. Regulatory and fairness standards sometime requires that the chosen policy performs adequately in every pre-specified subpopulation, not only on average. We formalize this as a Selection of the Best with Fairness Constraints (SBFC) problem, in order to identify the policy with the highest average performance among those policies that meet a minimum per-subpopulation threshold. We establish an instance-specific lower bound on sample complexity of the SBFC problem. We then develop a Track-and-Stop with Constraints on Subpopulation (T-a-S-CS) algorithm that achieves the lower bound asymptotically. We extend the framework to general closed-set and penalty-based fairness specifications with matching guarantees. Numerical experiments and a case study using the International Stroke Trial demonstrate substantial efficiency gains over policy-level allocation baselines.

2605.09944 2026-05-12 cs.RO

Explicit Stair Geometry Conditioning for Robust Humanoid Locomotion

Jianguo Zhang, Wentai Xu, Shusheng Ye, Yuxiang He, Weimin Qi, Qinbo Sun, Ning Ding, Liguang Zhou

发表机构 * Shenzhen Institute of Artificial Intelligence and Robotics for Society(深圳人工智能与机器人社会研究院) School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)科学与工程学院) Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学)

AI总结 本文针对人形机器人在复杂楼梯环境中行走的鲁棒性问题,提出了一种基于显式楼梯几何条件的控制框架。该方法通过提取楼梯高度、深度和偏航角等可解释的几何参数,直接作为策略网络的输入,从而实现对步态参数的主动调整。实验表明,该方法在仿真和真实环境中均表现出优异的泛化能力和稳定性,尤其在户外连续33级台阶的测试中验证了其实际应用价值。

Comments 8 pages, 7 figures, 4 tables

详情
英文摘要

Robust humanoid stair climbing remains challenging due to geometric discontinuities, sensitivity to step height variations, and perception uncertainty in real-world environments. Existing learning-based locomotion policies often rely on implicit terrain representations or blind proprioceptive feedback, limiting their ability to generalize across varying stair geometries and to anticipate required gait adjustments. This paper proposes an explicit stair geometry conditioning framework for robust humanoid stair climbing. Instead of encoding terrain as high-dimensional latent features, we extract a compact set of interpretable geometric parameters, including step height, step depth, and current yaw angle relative to the robot heading. These explicit stair parameters directly condition a Proximal Policy Optimization (PPO)-based locomotion policy, enabling proactive modulation of swing-foot clearance and stride characteristics according to stair structure. Simulation experiments demonstrate improved generalization across unseen stair heights beyond the training distribution. Real-world experiments on the Unitree G1 humanoid validate reliable indoor and outdoor stair traversal. In challenging outdoor scenarios, the robot successfully ascends 33 consecutive steps without failure, demonstrating robustness and practical deployability.

2605.09942 2026-05-12 cs.AI

HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution

Dongming Jiang, Yi Li, Guanpeng Li, Qiannan Li, Bingzhe Li

发表机构 * Department of Computer Science, The University of Texas at Dallas(德克萨斯大学达拉斯分校计算机科学系) Department of Electrical and Computer Engineering, University of Florida(佛罗里达大学电气与计算机工程系) University of California, Davis(加州大学戴维斯分校)

AI总结 本文提出HAGE,一种基于强化学习的加权多关系记忆框架,旨在解决智能体大语言模型系统中记忆检索的问题。HAGE将记忆检索重新定义为基于查询条件的序列化图遍历过程,通过共享记忆节点上的关系特定图视图组织记忆,并利用可训练的关系特征向量编码多维关系信号。研究引入了一个路由网络动态调整边嵌入的维度,并结合语义相似度与查询条件下的边表示计算遍历得分,从而优先选择高效用的关系路径。实验表明,HAGE在长期推理任务中表现出更高的准确率,并在准确率与效率之间取得了更优的平衡。

详情
英文摘要

Memory retrieval in agentic large language model (LLM) systems is often treated as a static lookup problem, relying on flat vector search or fixed binary relational graphs. However, fixed graph structures cannot capture the varying strength, confidence, and query-dependent relevance of relationships between events. In this paper, we propose HAGE, a weighted multi-relational memory framework that reconceptualizes retrieval as sequential, query-conditioned traversal over a unified relational memory graph. Memory is organized as relation-specific graph views over shared memory nodes, where each edge is associated with a trainable relation feature vector encoding multiple relational signals. Given a query, an LLM-based classifier identifies the relational intent, and a routing network dynamically modulates the corresponding dimensions of the edge embedding. Traversal scores are computed via a learned combination of semantic similarity and these query-conditioned edge representations. This allows memory traversal to prioritize high-utility relational paths while softly suppressing noisy or weakly relevant connections. Beyond adaptive traversal, HAGE further introduces a reinforcement learning-based training framework that jointly optimizes routing behavior and edge representations using downstream tasks. Finally, empirical results demonstrate improved long-horizon reasoning accuracy and a favorable accuracy-efficiency trade-off compared to state-of-the-art agentic memory systems. Our code is available at https://github.com/FredJiang0324/HAGE_MVPReview.

2605.09939 2026-05-12 cs.RO

Neural Distance-Guided Path Integral Control for Tractor-Trailer Navigation

Peng Wei, Chen Peng, Stavros Vougioukas

发表机构 * Department of Biological and Agricultural Engineering, University of California Davis(加州大学戴维斯分校生物与农业工程系) ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University(浙江大学杭州全球科技创新中心)

AI总结 本文研究了牵引挂车系统在复杂农业环境中的自主安全导航问题,针对其复杂的几何结构和非线性动力学特性,提出了一种基于几何神经编码器的实时避障方法。该方法通过神经网络快速准确地估计牵引挂车与激光雷达感知环境之间的距离,无需预先地图即可实现动态几何推理,并将学习到的距离信息融入模型预测路径积分(MPPI)控制器中,从而提升系统在复杂环境中的导航安全性和响应性。仿真结果验证了该方法在生成动态可行且安全轨迹方面的有效性。

详情
英文摘要

Autonomous and safe navigation of tractor-trailer systems requires accurate, real-time collision avoidance and dynamically feasible control, particularly in cluttered and complex agricultural environments. This is challenging due to their articulated, deformable geometries and nonlinear dynamics. Traditional methods oversimplify vehicle geometry or rely on precomputed distance fields that assume a known map, limiting their applicability in dynamic, partially unknown environments. To address these limitations, we propose a geometric neural encoder that provides fast and accurate distance estimates between the full tractor-trailer body and raw LiDAR perception, enabling real-time, map-free geometric reasoning. These learned distances are integrated into a Model Predictive Path Integral (MPPI) controller, allowing the system to incorporate true articulated geometry directly into its cost evaluation and enabling more responsive navigation in challenging agricultural settings. Simulation results demonstrate that the proposed framework generates dynamically feasible and safe trajectories for navigating tractor-trailer systems in cluttered and complex environments.

2605.09936 2026-05-12 cs.CV cs.IR cs.LG

Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception

Yiwei Ou, Chung Ching Cheung, Jun Yang Ang, Xiaobin Ren, Ronggui Sun, Guansong Gao, Kaiqi Zhao, Manfredo Manfredini

发表机构 * University of Auckland(奥克兰大学) University of Pennsylvania(宾夕法尼亚大学) Stanford University(斯坦福大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳))

AI总结 本文提出Urban-ImageNet,一个大规模多模态数据集与评估框架,用于从社交媒体图像中感知城市空间。该数据集包含来自微博的200万张公共图像及其配对文本,涵盖中国24个城市61个城区,支持从1K到2M不同规模的训练与评估。基于城市理论构建的层次化分类体系,Urban-ImageNet支持城市场景语义分类、跨模态图像-文本检索和实例分割三项任务,旨在评估AI模型对城市空间社会性、功能性和空间特征的理解能力。

详情
英文摘要

We present Urban-ImageNet, a large-scale multi-modal dataset and evaluation benchmark for urban space perception from user-generated social media imagery. The corpus contains over 2 Million public social media images and paired textual posts collected from Weibo across 61 urban sites in 24 Chinese cities across 2019-2025, with controlled benchmark subsets at 1K, 10K, and 100K scale and a full 2M corpus for large-scale training and evaluation. Urban-ImageNet is organized by HUSIC, a Hierarchical Urban Space Image Classification framework that defines a 10-class taxonomy grounded in urban theory. The taxonomy is designed to distinguish activated and non-activated public spaces, exterior and interior urban environments, accommodation spaces, consumption content, portraits, and non-spatial social-media content. Rather than treating urban imagery as generic scene data, Urban-ImageNet evaluates whether machine perception models can capture spatial, social, and functional distinctions that are central to urban studies. The benchmark supports three tasks within one standardized library: (T1) urban scene semantic classification, (T2) cross-modal image-text retrieval, and (T3) instance segmentation. Our experiments evaluate representative vision, vision-language, and segmentation models, revealing strong performance on supervised scene classification but more challenging behavior in cross-modal retrieval and instance-level urban object segmentation. A multi-scale study further examines how model performance changes as balanced training data increases from 1K, 10K to 100K images. Urban-ImageNet provides a unified, theory-grounded, multi-city benchmark for evaluating how AI systems perceive and interpret contemporary urban spaces across modalities, scales, and task formulations. Dataset and benchmark are available at: huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet and github.com/yiasun/dataset-2.

2605.09934 2026-05-12 cs.CL

TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents

Bihui Yu, Caijun Jia, Jing Chi, Xiaohan Liu, Yining Wang, He Bai, Yuchen Liu, Jingxuan Wei, Junnan Zhu

发表机构 * Shenyang Institute of Computing Technology, Chinese Academy of Sciences(中国科学院沈阳计算技术研究所) University of Chinese Academy of Sciences(中国科学院大学) MAIS, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所MAIS) Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences)(教育部计算电力网络与信息安全重点实验室,山东计算机科学中心(济南国家超算中心),齐鲁工业大学(山东省科学院))

AI总结 TRACER 是一种用于多模态工具使用代理的可验证生成溯源框架,旨在解决当前工具使用过程中存在的“溯源鸿沟”问题,即生成的结论缺乏对支撑证据的明确依赖关系。TRACER 在生成每个回答的同时,生成结构化的溯源记录,明确标注支持该结论的工具调用、证据单元及语义关系,并通过多方面验证确保溯源可靠性,进而用于强化学习中的可追溯性约束和局部信用分配。实验表明,TRACER 在 TRACE-Bench 基准上表现出色,显著优于现有方法,证明了可靠多模态工具推理依赖于对观测的溯源感知,而非单纯增加工具调用次数。

详情
英文摘要

Multimodal large language models increasingly solve vision-centric tasks by calling external tools for visual inspection, OCR, retrieval, calculation, and multi-step reasoning. Current tool-using agents usually expose the executed tool trajectory and the final answer, but they rarely specify which tool observation supports each generated claim. We call this missing claim-level dependency structure the provenance gap. The gap makes tool use hard to verify and hard to optimize, because useful evidence, redundant exploration, and unsupported reasoning are mixed in the same trajectory. We introduce TRACER, a framework for verifiable generative provenance in multimodal tool-using agents. Instead of adding citations after generation, TRACER generates each answer sentence together with a structured provenance record that identifies the supporting tool turn, evidence unit, and semantic support relation. Its relation space contains Quotation, Compression, and Inference, covering direct reuse, faithful condensation, and grounded derivation. TRACER verifies each record through schema checking, tool-turn alignment, source authenticity, and relation rationality, and then converts verified provenance into traceability constraints and provenance-derived local credit for reinforcement learning. We further construct TRACE-Bench, a benchmark for sentence-level provenance reconstruction from coarse multimodal tool trajectories. On TRACE-Bench, simply adding tools often introduces noise. With Qwen3-VL-8B, TRACER reaches 78.23% answer accuracy and 95.72% summary accuracy, outperforming the strongest closed-source tool-augmented baseline by 23.80 percentage points. Compared with tool-only supervised fine-tuning, it also reduces total test-set tool calls from 4949 to 3486. These results show that reliable multimodal tool reasoning depends on provenance-aware use of observations, not on more tool calls alone.

2605.09932 2026-05-12 cs.CL

FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

Zehua Pei, Hui-Ling Zhen, Xianzhi Yu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu

发表机构 * The Chinese University of Hong Kong(香港中文大学) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 当前大型语言模型在处理长文本时,仍难以有效利用长上下文中的信息。本文提出FocuSFT,一种基于双层优化的细调方法,通过在训练过程中优化注意力分配,减少位置偏差和注意力陷阱对内容相关词的关注度削弱问题。该方法在内层优化中引入轻量级快速参数形成参数化记忆,引导模型关注语义相关内容,外层则基于此优化进行监督细调,从而提升模型在长上下文任务中的表现。实验表明,FocuSFT在多个基准测试中均取得显著性能提升。

详情
英文摘要

Large language models can now process increasingly long inputs, yet their ability to effectively use information spread across long contexts remains limited. We trace this gap to how attention budget is spent during supervised fine-tuning (SFT) on long sequences: positional biases and attention sinks cause the model to allocate most of its attention to positionally privileged tokens rather than semantically relevant content. This training-time attention dilution (the starvation of content tokens in the attention distribution) weakens the gradient signal, limiting the model's ability to learn robust long-context capabilities. We introduce FocuSFT, a bilevel optimization framework that addresses this problem at training time. An inner loop adapts lightweight fast-weight parameters on the training context to form a parametric memory that concentrates attention on relevant content, and the outer loop performs SFT conditioned on this sharpened representation. Both loops apply bidirectional attention over context tokens while preserving causal masking for responses, reducing the causal asymmetry that gives rise to attention sinks and aligning inner-outer behavior. On BABILong, FocuSFT improves accuracy by up to +14pp across 4K--32K context lengths; on RULER, it raises CWE aggregation from 72.9\% to 81.1\% at 16K; and on GPQA with agentic tool use, it yields a 24\% relative gain in pass@1. Attention analysis shows that FocuSFT reduces attention sink mass by 529$\times$ and triples context engagement during training. Code: https://github.com/JarvisPei/FocuSFT

2605.09931 2026-05-12 cs.CL cs.AI

PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning

Luan Zhang, Dandan Song, Zhijing Wu, Zhengyu Chen, Chen Zhang, Yuhang Tian, Huipeng Ma, Chenhao Li, Changzhi Zhou, Xudong Li, Shuhao Zhang

发表机构 * School of Computer Science and Technology, Beijing Institute of Technology, China(北京理工大学计算机科学与技术学院) School of Computer Science and Technology, Huazhong University of Science and Technology, China(华中科技大学计算机科学与技术学院) Independent, China(独立)

AI总结 PruneTIR 是一种在推理阶段提升工具集成推理(TIR)效果与效率的方法,旨在优化已具备工具使用能力的大语言模型在实际推理中的表现。该方法通过剪枝错误工具调用轨迹、重新采样工具调用以及在必要时暂停工具使用,有效减少错误调用对推理过程的负面影响,避免模型陷入反复失败的循环。实验表明,PruneTIR 显著提升了模型的推理准确率和效率,同时缩短了推理所需上下文长度。

详情
英文摘要

Tool-integrated reasoning (TIR) enables large language models (LLMs) to enhance their capabilities by interacting with external tools, such as code interpreters (CI). Most recent studies focus on exploring various methods to equip LLMs with the ability to use tools. However, how to further boost the reasoning ability of already tool-capable LLMs at inference time remains underexplored. Improving reasoning at inference time requires no additional training and can help LLMs better leverage tools to solve problems. We observe that, during tool-capable LLM inference, both the number and the proportion of erroneous tool calls are negatively correlated with answer correctness. Moreover, erroneous tool calls are typically resolved successfully within a few subsequent turns. If not, LLMs often struggle to resolve such errors even with many additional turns. Building on the above observations, we propose PruneTIR, a rather effective yet efficient framework that enhances the tool-integrated reasoning at inference time. During LLM inference, PruneTIR prunes trajectories, resamples tool calls, and suspends tool usage through three components: Success-Triggered Pruning, Stuck-Triggered Pruning and Resampling, and Retry-Triggered Tool Suspension. These three components enable PruneTIR to mitigate the negative impact of erroneous tool calls and prevent LLMs from getting stuck in repeated failed resolution attempts, thereby improving overall LLM performance. Extensive experimental results demonstrate the effectiveness of PruneTIR, which significantly improves Pass@1 and efficiency while reducing the working context length for tool-capable LLMs.

2605.09929 2026-05-12 cs.LG cs.SE

TeleResilienceBench: Quantifying Resilience for LLM Reasoning in Telecommunications

Pranshav Gajjar, Emmanuel Ojo, Vijay K Shah

发表机构 * NextG Wireless Lab, North Carolina State University(NextG无线实验室,北卡罗来纳州立大学)

AI总结 本文提出了TeleResilienceBench,用于评估大型语言模型在电信领域中面对不完整或错误推理时的恢复能力,即“推理韧性”。该基准通过从弱生成模型中收集失败案例,并截断错误推理过程,要求目标模型继续并修正推理,从而量化模型的恢复表现。研究发现,即使是最强的模型其恢复率也仅为29.1%,且模型规模并不总是带来韧性提升,其中Nemotron-3-nano 4b在韧性与成本比方面表现最佳。此外,研究指出当前电信基准的难度标签更多反映知识覆盖而非推理深度。

详情
英文摘要

Deploying large language models in telecommunications requires more than task accuracy. In realistic workflows, a model may inherit partially completed reasoning from a prior step, an upstream agent, or its own earlier generation, and must continue that reasoning even when it is already going wrong. We introduce TeleResilienceBench, a benchmark that quantifies this capability, which we term reasoning resilience, across seven telecom sub-domains drawn from the GSMA Open-Telco LLM suite. Instances are constructed by collecting failures from a weak generator model, truncating the flawed reasoning trace at its midpoint, and asking a target model to continue and correct it. We propose the Correct Flip Rate (CFR) as a direct measure of successful recovery and evaluate eight models spanning the Qwen3.5, Gemma4, and Nemotron-3 families. Our results show that even the strongest model achieves a macro-average CFR of only 29.1%, and scale does not reliably improve resilience within families. Nemotron-3-nano 4b outperforms all Qwen3.5 variants including the 27b model and leads the auxiliary TeleMath numerical evaluation at 23.4% CR%, offering the best resilience-to-cost ratio in the set. A difficulty-stratified analysis further reveals that existing telecom benchmark difficulty labels reflect factual specificity rather than reasoning depth, suggesting that current evaluations measure knowledge coverage more than reasoning ability.

2605.09925 2026-05-12 cs.CV

Frequency Adapter with SAM for Generalized Medical Image Segmentation

Phuoc-Nguyen Bui, Van-Nguyen Pham, Duc-Tai Le, Junghyun Bum, Hyunseung Choo

发表机构 * Sungkyunkwan University, Korea(成均馆大学,韩国)

AI总结 医学图像分割在辅助诊断和治疗规划中具有重要意义,但深度学习模型在面对不同数据集时常因成像协议、扫描设备和患者群体的差异而难以泛化。本文提出了一种基于频率域适配的通用医学图像分割方法FSAM,结合低秩适配(LoRA)和频率适配模块,有效提取跨域不变的高频特征,提升模型在单一源域下的泛化能力。实验表明,该方法在视网膜和前列腺数据集上优于传统域泛化及基于SAM的域泛化方法。

Comments Under review, 10 pages, 1 figure, 2 tables

详情
英文摘要

Medical image segmentation is a critical task in computer-aided diagnosis and treatment planning. However, deep learning models often struggle to generalize across datasets due to domain shifts arising from variations in imaging protocols, scanner types, and patient populations. Traditional domain generalization (DG) methods utilize causal feature learning, adversarial consistency, and style augmentation to improve segmentation robustness. While effective, these approaches rely on explicit feature alignment, adversarial objectives, or handcrafted augmentations, which may not fully exploit the capabilities of foundation models. Recently, the Segment Anything Model (SAM) has demonstrated strong generalization capabilities in segmentation tasks. SAM-based DG methods attempt to improve medical image segmentation. However, these approaches primarily operate in the spatial domain and overlook frequency-based discrepancies that significantly affect model robustness. In this work, we propose Frequency-based Domain Generalization with SAM (FSAM), a novel framework that integrates Low-Rank Adaptation (LoRA) for efficient fine-tuning and a frequency adapter to incorporate frequency-domain representations for single-source domain generalization. FSAM enhances SAM's segmentation robustness by extracting domain-invariant high-frequency features, mitigating frequency-related domain shifts. Experimental results on fundus and prostate datasets demonstrate that FSAM outperforms existing traditional DG and SAM-based DG approaches in domain generalization. Codes and pre-trained models will be made available on GitHub.

2605.09924 2026-05-12 cs.CL

Evolving Knowledge Distillation for Lightweight Neural Machine Translation

Xuewen Zhang, Haixiao Zhang, Xinlong Huang

发表机构 * Department of Content Generation(内容生成系) Li Auto Beijing, China(中国北京)

AI总结 本文提出了一种名为Evolving Knowledge Distillation(EKD)的渐进式知识蒸馏框架,旨在解决大型神经机器翻译模型在资源受限设备上部署时的挑战。通过让学生模型逐步从容量逐渐增加的一系列教师模型中学习,EKD有效缩小了师生模型之间的性能差距。实验表明,EKD在多个基准数据集上均取得显著提升,最终学生模型的性能与大型教师模型非常接近。

详情
英文摘要

Recent advancements in Neural Machine Translation (NMT) have significantly improved translation quality. However, the increasing size and complexity of state-of-the-art models present significant challenges for deployment on resource-limited devices. Knowledge distillation (KD) is a promising approach for compressing models, but its effectiveness diminishes when there is a large capacity gap between teacher and student models. To address this issue, we propose Evolving Knowledge Distillation (EKD), a progressive training framework in which the student model learns from a sequence of teachers with gradually increasing capacities. Experiments on IWSLT-14, WMT-17, and WMT-23 benchmarks show that EKD leads to consistent improvements at each stage. On IWSLT-14, the final student achieves a BLEU score of 34.24, narrowing the gap to the strongest teacher (34.32 BLEU) to just 0.08 BLEU. Similar trends are observed on other datasets. These results demonstrate that EKD effectively bridges the capacity gap, enabling compact models to achieve performance close to that of much larger teacher models.Code and models are available at https://github.com/agi-content-generation/EKD.

2605.09922 2026-05-12 cs.CL cs.AI

Team-Based Self-Play With Dual Adaptive Weighting for Fine-Tuning LLMs

Wu Li, Yigeng Zhou, Zesheng Shi, Yequan Wang, Min Zhang, Jing Li

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳)) Beijing Academy of Artificial Intelligence, Beijing, China(北京人工智能研究院)

AI总结 本文提出了一种名为TPAW的团队式自博弈算法,旨在提升大语言模型在完全自监督设置下的对齐效果。该方法通过让当前策略模型与历史检查点进行协作与竞争,增强训练稳定性与效率,并引入两种自适应加权机制,分别调整目标响应的重要性以及团队成员在训练中的贡献度。实验表明,TPAW在多种基础模型和大语言模型基准上均优于现有方法。

Comments Accepted by ACL 2026 Main

详情
英文摘要

While recent self-training approaches have reduced reliance on human-labeled data for aligning LLMs, they still face critical limitations: (i) sensitivity to synthetic data quality, leading to instability and bias amplification in iterative training; (ii) ineffective optimization due to a diminishing gap between positive and negative responses over successive training iterations. In this paper, we propose Team-based self-Play with dual Adaptive Weighting (TPAW), a novel self-play algorithm designed to improve alignment in a fully self-supervised setting. TPAW adopts a team-based framework in which the current policy model both collaborates with and competes against historical checkpoints, promoting more stable and efficient optimization. To further enhance learning, we design two adaptive weighting mechanisms: (i) a response reweighting scheme that adjusts the importance of target responses, and (ii) a player weighting strategy that dynamically modulates each team member's contribution during training. Initialized from a SFT model, TPAW iteratively refines alignment without requiring additional human supervision. Experimental results demonstrate that TPAW consistently outperforms existing baselines across various base models and LLM benchmarks. Our code is publicly available at https://github.com/lab-klc/TPAW.

2605.09920 2026-05-12 cs.LG cs.AI

Verifier-Free RL for LLMs via Intrinsic Gradient-Norm Reward

Xuexiang Wen, Hang Yu, Linchao Zhu, Gaoang Wang

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团)

AI总结 本文提出了一种无需验证器的强化学习方法VIGOR,用于大语言模型的后训练优化。该方法通过计算策略模型自身生成文本时的梯度范数作为内在奖励信号,引导模型生成更符合当前策略的输出。VIGOR通过调整梯度长度偏差并采用分组排序策略,提升了奖励信号的稳定性和有效性,在数学推理和代码生成任务中均表现出优于现有方法的性能。

Comments Accepted to Findings of ACL 2026

详情
英文摘要

While Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a promising post-training paradigm for Large Language Models (LLMs), its dependency on the gold label or domain-specific verifiers limits its scalability to new tasks and domains. In this work, we propose Verifier-free Intrinsic Gradient-Norm Reward (VIGOR), a simple reward that uses only the policy model itself. Given a prompt, VIGOR samples a group of completions and assigns higher within-group rewards to outputs that induce smaller $\ell_2$ norms of the teacher-forced negative log-likelihood gradients under the current parameters. Intuitively, lower gradient norms suggest the completion aligns better with the current policy, serving as an intrinsic preference signal for policy optimization. To make this intrinsic signal practical for RL, we correct the systematic length bias of averaged token-level gradients with a $\sqrt{T}$ scaling, and apply group-wise rank shaping to stabilize reward scales across prompts. Across mathematical reasoning benchmarks, VIGOR outperforms the state-of-the-art Reinforcement Learning from Internal Feedback (RLIF) baseline, and it also exhibits cross-domain transfer to code benchmarks when trained only on math data. For instance, on Qwen2.5-7B-Base post-trained on MATH, VIGOR improves the average math accuracy by +3.31% and the average code accuracy by +1.91% over this baseline, while exhibiting more stable training dynamics. The code is available at https://github.com/ZJUSCL/VIGOR.

2605.09918 2026-05-12 cs.LG cs.AI cs.CY

NaiAD: Initiate Data-Driven Research for LLM Advertising

Yihang Zhang, Zimeng Huang, Ren Zhai, Yipeng Kang, Tonghan Wang

发表机构 * Tsinghua University(清华大学) College of AI(人工智能学院) Department of Literature, Arts and Communication(文学、艺术与传播系) Anhui International Studies University(安徽国际关系大学) State Key Laboratory of General Artificial Intelligence, BIGAI(通用人工智能国家重点实验室,BIGAI)

AI总结 本文提出NaiAD,首个专为大语言模型(LLM)广告设计的综合性数据集,包含58,999条精心构建的嵌入广告的响应及对应用户查询。该数据集基于理论支撑的评估指标,分别全面捕捉用户和商业价值,并通过解耦生成管道缓解对齐LLM的维度共线性问题,生成结构多样的样本。研究还引入基于方差校准预测驱动推理的评分框架,使自动评分与人工标注一致,并揭示了成功广告整合依赖于四种语义策略,为未来LLM原生广告系统的发展提供了基础支撑。

Comments 37 pages, 11 figures

详情
英文摘要

Reconciling platform revenue with user experience in LLM advertising motivates a data-centric foundation. We introduce NaiAD, the first comprehensive dataset for LLM-native advertising comprising 58,999 carefully constructed ad-embedded responses paired with user queries. NaiAD is organized around theoretically grounded evaluation metrics that separately and comprehensively capture user and commercial utility. To mitigate the dimensional collinearity of aligned LLMs, we propose a decoupled generation pipeline that produces structurally diverse samples, ranging from responses that explicitly disentangle stakeholder utilities to responses that are uniformly strong or weak across dimensions. We further provide score labels calibrated by a Variance-Calibrated Prediction-Powered Inference (VC-PPI) framework, aligning automated scoring with human annotations. Mechanistic analyses reveal that successful ad integration relies on reasoning paths that cluster into four distinct semantic strategies. Models leveraging NaiAD internalize these strategies to simultaneously improve user and commercial utility, while enabling independent control over these distinct objectives via in-context learning. Together, these results position NaiAD as a foundational infrastructure for developing future LLM-native ad systems.

2605.09915 2026-05-12 cs.CL cs.AI cs.CY

Position: Academic Conferences are Potentially Facing Denominator Gaming Caused by Fully Automated Scientific Agents

Rong Shan, Te Gao, Hang Zheng, Yunjia Xi, Jiachen Zhu, Zeyu Zheng, Yong Yu, Weinan Zhang, Jianghao Lin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Central South University(中南大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文指出,顶级人工智能会议为维持相对稳定的接收率,可能面临由全自动科学代理引发的“分母博弈”新威胁。恶意行为者可通过部署AI代理大量提交表面合理但质量低的论文,从而稀释评审资源,提高特定高质量论文的录用概率。研究分析了该威胁的可行性及影响,并提出需通过系统性政策与激励机制改革,而非仅依赖技术检测手段,来应对这一挑战。

Comments Accepted by ICML'26 Position Track

详情
英文摘要

The implicit policy of maintaining relatively stable acceptance rates at top AI conferences, despite exponentially growing submissions, introduces a critical structural vulnerability. This position paper characterizes a new systemic threat we term Agentic Denominator Gaming, in which a malicious actor deploys AI agents to generate and submit a large volume of superficially plausible but low-quality papers. Crucially, their objective is not the acceptance of low-quality papers, but rather to inflate the submission denominator and overwhelm reviewing capacity. Under a relatively stable acceptance rate, this dilution can systematically increase the publication probability of a small, targeted set of legitimate papers. We analyze the practical feasibility of this threat and its broader consequences, including intensified reviewer burnout, degraded review quality, and the emergence of industrialized automated agent mills. Finally, we propose and evaluate a range of mitigation strategies, and argue that durable protection will require system-level policy and incentive reforms, rather than relying primarily on technical detection alone.

2605.09908 2026-05-12 cs.LG cs.AI cs.SD

Voice Biomarkers for Depression and Anxiety

Oleksii Abramenko, Noah D. Stein, Colin Vaz

发表机构 * Kintsugi Mindful Wellness, Inc.(Kintsugi Mindful Wellness公司)

AI总结 本文研究如何从语音中检测抑郁和焦虑,提出了一种基于深度学习的方法,直接利用原始语音信号进行建模,避免了传统方法中依赖人工设计特征的局限。研究使用了一个包含约65,000条语料、来自23,000名美国代表性人群的大规模数据集进行训练,所提出的模型能够提取与内容无关的生物标志物信息,并与语音中的词汇特征结合,在实际应用中提升了预测性能。实验表明,该模型在约5000名独立测试者上实现了71%的灵敏度和特异性,并已开源发布以促进相关研究。

详情
英文摘要

Current approaches to detecting depression and anxiety from speech primarily rely on machine learning techniques that utilize hand-engineered paralinguistic features and related acoustic descriptors derived from time- and frequency-domain representations of speech signals. Applying deep learning methods directly to raw speech signals has the potential to produce biomarker representations with substantially greater predictive power. However, these approaches typically require large volumes of carefully annotated data to learn robust and clinically meaningful representations of the underlying biomarkers. In this paper, we describe our efforts toward developing a deep learning model trained on a large-scale proprietary dataset comprising ~65,000 utterances collected from more than 23,000 subjects representative of relevant United States demographics. We present the techniques employed and analyze their impact on model performance. Our results demonstrate that the proposed models can extract content-agnostic biomarker information, which, when combined with lexical features extracted from audio, yields improved predictive performance in production settings. Our models are evaluated on ~5000 unique subjects and achieve performance of 71% in terms of sensitivity and specificity. To foster further research in mental health assessment from speech, we release the best-performing model described in this paper on HuggingFace.