arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2328
专题追踪
2605.08802 2026-05-13 cs.CV

CoLVR: Enhancing Exploratory Latent Visual Reasoning via Contrastive Optimization

Ziyang Ding, Linjian Meng, Yiming Wu, Yuhan Li, Yuhao Liu, Zhen Zhao

发表机构 * Shandong University(山东大学) Shanghai AI Laboratory(上海人工智能实验室) Nanjing University(南京大学) The University of Hong Kong(香港大学)

AI总结 CoLVR 是一种通过对比优化增强潜空间视觉推理探索能力的方法,旨在解决现有模型因依赖硬对齐目标而限制潜空间推理灵活性的问题。该方法引入了基于角度扰动的潜空间对比训练框架,以学习更加多样化和探索性强的表示,并结合强化学习的潜轨迹对比奖励进行后训练,进一步优化潜空间推理过程。实验表明,CoLVR 在多个基准测试中显著提升了潜空间表示的探索能力,并在跨域任务中表现出色。

详情
英文摘要

Due to the potential for exploratory reasoning of Latent Visual Reasoning, recent works tend to enable MLLMs (Multimodal Large Language Models) to perform visual reasoning by propagating continuous hidden states instead of decoding intermediate steps into discrete tokens. However, existing works typically rely on hard alignment objectives to force latent representations to match predefined visual features, thereby severely limiting the exploratory of latent reasoning process. To address this problem, we propose CoLVR (Contrastive Optimization for Latent Visual Reasoning). To obtain a more exploratory visual reasoning, CoLVR introduces a latent contrastive training framework. Firstly, CoLVR learns diverse and exploratory representations with a latent contrastive objective guided by angle-based perturbation, which expands the semantic latent space and avoids over-constrained embedding. Then, CoLVR employs a latent trajectory contrastive reward for RL (Reinforcement Learning) post-training to enable fine-grained optimization of latent visual reasoning process and thus fostering diverse reasoning behaviors. Experiments demonstrate that CoLVR significantly enhances the exploratory capability of latent representations, achieving average improvements of 5.83% on VSP and 8.00% on Jigsaw, while also outperforming existing latent models on out of domain benchmarks, with a 3.40% gain on MMStar. The data, codes, and models are released at https://github.com/Oscar-dzy/CoLVR.

2605.08754 2026-05-13 cs.AI

Value-Decomposed Reinforcement Learning Framework for Taxiway Routing with Hierarchical Conflict-Aware Observations

Shizhong Zhou, Haifeng Liu, Zheng Zhang, Shiyu Zhang, Bo Yang, Yi Lin

发表机构 * National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University(合成视觉基础科学国家重点实验室,四川大学) College of Computer Science, Sichuan University(计算机学院,四川大学)

AI总结 本文提出了一种名为CaTR的强化学习框架,用于解决机场地面上的实时多架飞机滑行路径规划问题。该框架通过分层的冲突感知观测机制,结合基于网格的环境建模和动作掩码技术,能够有效捕捉当前及下游的交通冲突信息,并采用价值分解策略以平衡安全与效率的多目标优化。实验表明,CaTR在多种交通密度下均能实现优于传统规划和强化学习方法的安全与效率综合性能。

详情
英文摘要

Taxiway routing and on-surface conflict avoidance are coupled safety-critical decision problems in airport surface operations. Existing planning and optimization methods are often limited by online computational cost, while reinforcement learning methods may struggle to represent downstream traffic conflicts and balance multiple objectives. This paper presents Conflict-aware Taxiway Routing (CaTR), a reinforcement learning framework for real-time multi-aircraft taxiway routing. CaTR constructs a grid-based airport surface environment with action masking, introduces a hierarchical foresight traffic representation to encode current and downstream conflict-related traffic conditions, and adopts a value-decomposed reinforcement learning strategy to prioritize sparse but safety-critical objectives. Experiments are conducted on a realistic environment based on Changsha Huanghua International Airport under multiple traffic density levels. Results show that CaTR achieves better safety--efficiency trade-offs than representative planning, optimization, and reinforcement learning baselines while maintaining practical runtime.

2605.08693 2026-05-13 cs.AI

SkillMaster: Toward Autonomous Skill Mastery in LLM Agents

Min Yang, Jinghua Piao, Xu Xia, Xiaochong Lan, Jiaju Chen, Yongshun Gong, Yong Li

发表机构 * Shandong University(山东大学) Zhongguancun Academy(中关村学院) Tsinghua University(清华大学) Southeast University(东南大学) University of Science and Technology of China(中国科学技术大学)

AI总结 SkillMaster 是一种旨在使大语言模型代理实现自主技能掌握的训练框架。该方法通过轨迹引导的技能复盘、反事实效用评估和双优势估计机制,使代理能够在任务解决过程中自主创建、优化和选择技能,从而提升其应对复杂任务的能力。实验表明,SkillMaster 在多个基准任务中显著优于现有方法,展示了代理从被动使用技能向主动学习和改进技能的能力转变。

详情
英文摘要

Skills provide an effective mechanism for improving LLM agents on complex tasks, yet in existing agent frameworks, their creation, refinement, and selection are typically governed by external teachers, hand-designed rules, or auxiliary modules. As a result, skills remain external resources to be invoked, rather than capabilities that agents can develop, adapt, and internalize through experience. To endow LLM agents with autonomous skill mastery, we propose SkillMaster, a training framework that teaches agents to create new skills, refine existing skills, and select accumulated skills during task solving. This capability is achieved through three key designs. First, we train agents through trajectory-informed skill review, teaching agents to propose, update, or retain skills based on evidence from completed episodes. Second, each candidate skill edit is designed to be evaluated by its counterfactual utility on related probe tasks, providing a direct learning signal for training skill-editing decisions. Third, we introduce DualAdv-GRPO, which separately estimates advantages for task-solving actions and skill-editing decisions, stabilizing joint training across task solving and skill management. Experiments on ALFWorld and WebShop show that SkillMaster improves the overall success rate over state-of-the-art baselines by 8.8% and 9.3%, respectively, achieving the best performance among all compared methods. Further analysis reveals a marked shift in agent capability: agents trained with SkillMaster can identify skill failures, refine procedural knowledge from trajectory evidence, and transfer improvements to future tasks with limited skill-bank edits. Overall, SkillMaster moves LLM agents beyond mere skill use toward self-improving agents capable of developing, adapting, and applying their own skill repertoires.

2605.08600 2026-05-13 cs.CL

100,000+ Movie Reviews from Kazakhstan: Russian, Kazakh, and Code-Switched Texts

Rustem Yeshpanov

发表机构 * Independent Researcher(独立研究员) Astana, Kazakhstan(哈萨克斯坦阿斯塔纳)

AI总结 本文介绍了一个包含100,502条哈萨克斯坦电影评论的多语言语料库,涵盖俄语、哈萨克语及代码混合文本,时间跨度从2001年至2025年。评论经过人工标注语言和情感极性,并附有部分用户评分。研究通过对比传统文本特征方法与多语言Transformer模型在情感分类任务中的表现,发现后者在极性分类任务中具有明显优势,但在评分分类任务中仍面临类别不平衡和评分细微差异带来的挑战。

Comments 10 pages, 1 figure, 8 tables, to appear in Proceedings of the 6th International Conference on Natural Language Processing for the Digital Humanities (NLP4DH 2026)

详情
英文摘要

We present a new publicly available corpus of 100,502 movie reviews from Kazakhstan collected from kino.kz, spanning 2001-2025 and covering 4,943 unique titles. The dataset is multilingual, consisting mainly of Russian reviews alongside Kazakh and code-switched texts. Reviews are manually annotated for language and sentiment polarity, and 11,309 reviews additionally contain explicit user-provided ratings. We define two sentiment tasks -- three-way polarity classification and five-class score classification -- and benchmark classical BoW/TF-IDF baselines against multilingual transformer models (mBERT, XLM-RoBERTa, RemBERT). Experimental results show that transformer models consistently outperform classical baselines on polarity classification, while score classification remains challenging under leakage-controlled evaluation due to severe class imbalance and subtle distinctions between adjacent rating levels.

2605.08571 2026-05-13 cs.RO

BEACON: Cross-Domain Co-Training of Generative Robot Policies via Best-Effort Adaptation

Antong Zhang, Han Qi, Heng Yang

发表机构 * Department of Computer Science, Brown University(布朗大学计算机科学系) School of Engineering and Applied Sciences, Harvard University(哈佛大学工程与应用科学学院)

AI总结 本文提出BEACON框架,旨在通过最佳努力适应实现跨领域协同训练,用于在源域有大量示范而目标域示范有限的情况下训练生成式机器人策略。该方法将跨域协同训练建模为差异感知的重要性重加权问题,同时学习基于扩散模型的视觉运动策略和样本级源权重,以最小化目标域泛化保证下的目标函数。通过可扩展的实例级差异估计器、策略与权重的随机交替更新以及多源扩展,BEACON在多种跨域场景中提升了策略的鲁棒性和数据效率,并隐式实现了特征对齐。

详情
英文摘要

We introduce BEACON--Best-Effort Adaptation for Cross-Domain Co-Training--a theory-driven framework for training generative robot policies with abundant source demonstrations and limited target demonstrations. BEACON casts cross-domain co-training as a discrepancy-aware importance-reweighting problem, jointly learning a diffusion-based visuomotor policy and per-sample source weights that minimize an objective informed by target-domain generalization guarantees. To make best-effort adaptation practical for high-dimensional sequence policies, we develop scalable instance-level discrepancy estimators, stochastic alternating updates for policy and weights, and a multi-source extension that balances heterogeneous source domains. Across sim-to-sim, sim-to-real, and multi-source manipulation settings, BEACON improves robustness and data efficiency over target-only, fixed-ratio co-training, and feature-alignment baselines. Importantly, even without an explicit alignment objective, BEACON achieves feature alignment as an implicit result of discrepancy-aware cross-domain co-training.

2605.08446 2026-05-13 cs.LG

Direct Bethe Free Energy Minimization for Bayesian Neural Network

Pavel Prochazka

发表机构 * Cisco Inc.(思科公司)

AI总结 本文提出了一种通过直接最小化Bethe自由能来训练贝叶斯神经网络的方法,替代传统的变分下界最大化策略。该方法在树状因子图上能够精确计算Bethe自由能,支持概率层和确定层的混合结构,并且在权重后验限制为最后一层高斯分布时,能够得到解析可计算的损失函数。实验表明,该方法在预测性能上与标准方法相当,同时避免了变分族选择带来的Jensen间隙,并实现了单次梯度传递下的超参数优化。

Comments Submited to conference - fix typo in title + name

详情
英文摘要

We propose training Bayesian neural networks by directly minimizing the Bethe free energy rather than maximizing a variational lower bound. On tree-structured factor graphs the Bethe free energy is exact; deterministic layers drop out of the objective and are trained by standard backpropagation, so the framework accommodates any mixture of probabilistic and deterministic subgraphs without modification. Restricting the weight posterior to a last-layer Gaussian yields analytically tractable losses: for a Gaussian likelihood the Bethe loss equals the exact marginal likelihood, and for a probit likelihood it reduces to a closed form via the probit-Gaussian convolution. Both objectives sit strictly between MAP and the ELBO ($L_\text{MAP} \leq L_\text{Bethe} \leq L_\text{ELBO}$), removing the structural Jensen gap that no choice of variational family can close. The Z-consistent prior formulation makes the prior precision a differentiable parameter, enabling empirical Bayes - joint optimization of weights, covariance, and hyperparameters - in a single gradient pass, with no cross-validation or outer loop. All variants admit a closed-form predictive at MAP-equivalent inference cost, in contrast to ensemble and sampling-based methods. On 8 UCI regression and 12 UCI classification benchmarks evaluated under a single shared hyperparameter regime, Bethe is competitive with standard reference methods at single-pass cost. Independently, joint single-pass empirical Bayes matches grid-search cross-validation of the prior precision on essentially all dataset-variant combinations, eliminating the outer hyperparameter loop without measurable cost. Isolated optimization gaps on a few datasets reflect numerical rather than principled limitations of the framework.

2605.08328 2026-05-13 cs.LG cs.CV

P-Flow: Proxy-gradient Flows for Linear Inverse Problems

Zehua Jiang, Fenghao Zhu, Xinquan Wang, Chongwen Huang, Zhaoyang Zhang

发表机构 * Zhejiang University(浙江大学) University of Notre Dame(诺丁汉大学)

AI总结 本文提出了一种名为 P-Flow 的新框架,用于解决线性逆问题,通过引入代理梯度来更新源点,有效避免了传统方法中因长链求导导致的数值不稳定和计算开销。该方法结合高维空间中的测度集中现象,采用高斯球面投影以确保先验分布的一致性,并基于贝叶斯理论和 Lipschitz 连续性进行了理论分析。实验表明,P-Flow 在多种图像修复任务中表现优异,尤其在极端退化条件下具有明显优势。

详情
英文摘要

Generative models based on flow matching have emerged as a powerful paradigm for inverse problems, offering straighter trajectories and faster sampling compared to diffusion models. However, existing approaches often necessitate differentiating through unrolled paths, leading to numerical instability and prohibitive computational overhead. To address this, we propose P-Flow, a framework that stabilizes the reconstruction process by leveraging a proxy gradient to update the source point. This approach effectively circumvents the numerical instability and memory overhead of long-chain differentiation. To ensure consistency with the prior distribution, we employ a Gaussian spherical projection motivated by the concentration of measure phenomenon in high-dimensional spaces. We further provide a theoretical analysis for P-Flow based on Bayesian theory and Lipschitz continuity. Experiments across diverse restoration tasks demonstrate that P-Flow delivers competitive performance, especially under extreme degradations such as severely ill-posed conditions and high measurement noise.

2605.08322 2026-05-13 cs.LG cs.AI

SDG-MoE: Signed Debate Graph Mixture-of-Experts

Stepan Kulibaba, Kirill Labzin, Artem Dzhalilov, Roman Pakhomov, Oleg Svidchenko, Alexander Gasnikov, Aleksei Shpilman

发表机构 * Innopolis University(因诺波利斯大学) Sirius University(西里乌斯大学) HSE University(俄罗斯高等经济大学)

AI总结 本文提出了一种名为SDG-MoE的新颖稀疏混合专家(MoE)架构,旨在通过引入专家间的结构化交流机制提升模型性能。该方法在路由后引入了一个轻量级的迭代讨论步骤,包含支持图和批评图两个交互矩阵,以及基于分歧的锚定机制,以增强专家间的信息传递与协调。实验表明,SDG-MoE在多个基准数据集上显著优于传统MoE和无符号图通信基线,验证了其有效性与优越性。

详情
英文摘要

Sparse MoE models achieve a good balance between capacity and compute by routing each token to a small subset of experts. However, in most MoE architectures, once a token is routed, the selected experts process it independently and their outputs are combined via a weighted sum. This leaves open whether enabling communication among them could improve performance. While prior work has raised this question, direct interaction among the active routed experts remains underexplored. In this paper, we propose SDG-MoE (Signed Debate Graph Mixture-of-Experts), a novel architecture that adds a lightweight, iterative deliberation step before final aggregation. SDG-MoE introduces three components: (i) two learned interaction matrices over the active experts, a support graph $A^+$ and a critique graph $A^-$, capturing reinforcing and corrective influences; (ii) a signed message-passing step that updates expert representations before aggregation; and (iii) a disagreement-gated Friedkin-Johnsen-style anchoring that controls deliberation strength while preventing expert drift. Together, these enable a structured deliberation process where interaction strength scales with disagreement and specialization is preserved. We also provide a theoretical analysis establishing stability conditions on expert states and showing that deliberation adds only low-order overhead over the active set. In controlled three-seed pretraining experiments, SDG-MoE improves validation perplexity over both an unsigned graph communication baseline and vanilla MoE, outperforming the strongest baseline by 19.8%, and gives the best external perplexity on WikiText-103, C4, and Paloma among the compared systems.

2605.08083 2026-05-13 cs.CL

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

Tong Zheng, Haolin Liu, Chengsong Huang, Huiwen Bao, Sheng Zhang, Rui Liu, Runpeng Dai, Ruibo Chen, Chenxi Liu, Tianyi Xiong, Xidong Wu, Hongming Zhang, Heng Huang

发表机构 * UMD(马里兰大学) UVA(弗吉尼亚大学) WUSTL(华盛顿大学) UNC(北卡罗来纳大学) Google(谷歌) Meta(元宇宙)

AI总结 本文研究了如何在推理时通过分配额外计算资源来提升大语言模型的性能,提出了一种名为AutoTTS的环境驱动框架,用于自动发现高效的推理时扩展策略。该方法通过构建可控的环境,使策略搜索更加高效,并引入参数化和反馈机制以提升发现效率。实验表明,所发现的策略在数学推理任务中优于手动设计的基线,在成本与准确率的权衡上表现更优,且发现过程成本低廉。

Comments 25 pages

详情
英文摘要

Test-time scaling (TTS) has become an effective approach for improving large language model performance by allocating additional computation during inference. However, existing TTS strategies are largely hand-crafted: researchers manually design reasoning patterns and tune heuristics by intuition, leaving much of the computation-allocation space unexplored. We propose an environment-driven framework, AutoTTS, that changes what researchers design: from individual TTS heuristics to environments where TTS strategies can be discovered automatically. The key to AutoTTS lies in environment construction: the discovery environment must make the control space tractable and provide cheap, frequent feedback for TTS search. As a concrete instantiation, we formulate width--depth TTS as controller synthesis over pre-collected reasoning trajectories and probe signals, where controllers decide when to branch, continue, probe, prune, or stop and can be evaluated cheaply without repeated LLM calls. We further introduce beta parameterization to make the search tractable and fine-grained execution trace feedback to improve discovery efficiency by helping the agent diagnose why a TTS program fails. Experiments on mathematical reasoning benchmarks show that the discovered strategies improve the overall accuracy--cost tradeoff over strong manually designed baselines. The discovered strategies generalize to held-out benchmarks and model scales, while the entire discovery costs only $39.9 and 160 minutes. Our data, and code will be open-source at https://github.com/zhengkid/AutoTTS.

2605.07782 2026-05-13 cs.CL cs.PL

CktFormalizer: Autoformalization of Natural Language into Circuit Representations

Jing Xiong, Qi Han, Chenchen Ding, He Xiao, Zunhai Su, Chaofan Tao, Ngai Wong

发表机构 * The University of Hong Kong(香港大学)

AI总结 CktFormalizer 是一个将自然语言自动转化为电路表示的框架,旨在解决大语言模型生成的 Verilog 代码在综合和实现过程中常出现的缺陷问题。该框架通过嵌入在 Lean 4 中的依赖类型硬件描述语言,实现了类型检查、正确性保障和形式化证明,有效提升了生成电路的正确性和可实现性。实验表明,CktFormalizer 在保持仿真通过率的同时,显著提高了后端实现的成功率,并能通过自动化定理证明实现性能优化。

详情
英文摘要

LLMs can generate hardware descriptions from natural language specifications, but the resulting Verilog often contains width mismatches, combinational loops, and incomplete case logic that pass syntax checks yet fail in synthesis or silicon. We present CktFormalizer, a framework that redirects LLM-driven hardware generation through a dependently-typed HDL embedded in Lean 4. Lean serves three roles: (i) type checker:dependent types encode bit-width constraints, case coverage, and acyclicity, turning hardware defects into compile-time errors that guide iterative repair; (ii) correctness firewall:compiled designs are structurally free of defects that cause silent backend failures (the baseline loses 20% of correct designs during synthesis and routing; CktFormalizer preserves all of them); (iii) proof assistant:the agent constructs machine-checked equivalence proofs over arbitrary input sequences and parameterized widths, beyond the reach of bounded SMT-based checking. On VerilogEval (156 problems), RTLLM (50 problems), and ResBench (56 problems), CktFormalizer achieves simulation pass rates competitive with direct Verilog generation while delivering substantially higher backend realizability: 95--100% of compiled designs complete the full synthesis, place-and-route, DRC, and LVS flow. A closed-loop PPA optimization stage yields up to 35% area reduction and 30% power reduction through validated architecture exploration, with automated theorem proof ensuring that each optimized variant remains functionally equivalent to its formal specification.

2605.07744 2026-05-13 cs.AI

Alternating Target-Path Planning for Scalable Multi-Agent Coordination

Yu Kumagai, Keisuke Okumura

发表机构 * Hitotsubashi University, Japan(日本立命堂大学) National Institute of Advanced Industrial Science and Technology (AIST), Japan(日本国家先进工业科学和技术研究院)

AI总结 本文研究了多智能体在同时分配目标和规划路径(TAPF)中的协调问题,提出了一种解耦目标分配与路径规划的迭代优化框架。该方法基于高效的次优多智能体路径规划求解器,通过反复规划路径并利用反馈信息优化目标分配,有效提升了算法的可扩展性。实验表明,该框架在保持较好解质量的同时,显著优于基于冲突搜索的传统方法,为实际大规模TAPF问题提供了可行的解决方案。

详情
英文摘要

The concurrent target assignment and pathfinding (TAPF) problem extends multi-agent pathfinding (MAPF) by asking planners to allocate distinct targets and collision-free paths to agents. Prior work on TAPF has relied exclusively on Conflict-Based Search (CBS), which tightly couples target assignment and pathfinding, resulting in compute-intensive, non-scalable solutions. In contrast, we propose an iterative refinement framework that decouples target assignment from pathfinding. Our framework builds on modern, fast, suboptimal MAPF solvers, such as LaCAM. Specifically, within a given time budget, it repeatedly solves MAPF for the current target assignment, identifies bottleneck agents via MAPF feedback, and refines the assignment. Empirical results show that feedback-driven reassignment loop is effective, enabling our framework to scale well beyond the reach of the state-of-the-art CBS-based solver while maintaining decent solution quality. This represents a solid step toward practical, large scale TAPF suitable for real-world setups.

2605.07552 2026-05-13 cs.CV

VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network

Zepeng Yang, Junxuan Bai, Hao Li, Ju Dai, Junjun Pan, Yongfeng Yin, Bin Li

发表机构 * Beihang University(北航) Peng Cheng Laboratory(鹏城实验室) Capital University of Physical Education and Sports(首都体育学院) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院)

AI总结 该论文提出了一种名为VIMCAN的混合架构,用于视觉-惯性融合的三维人体姿态估计。该方法结合了Mamba的高效序列建模能力和Cross-Attention的空间感知能力,有效解决了传统Transformer在处理长序列时计算复杂度高、难以实时处理的问题。实验表明,VIMCAN在多个数据集上取得了优于现有方法的精度,并能在普通消费级硬件上实现每秒60帧以上的实时推理。

Comments Accepted in CVPR 2026

详情
英文摘要

The rapid advances in deep learning have significantly enhanced the accuracy of multimodal 3D human pose estimation (HPE). However, the state-of-the-art (SOTA) HPE pipelines still rely on Transformers, whose quadratic complexity makes real-time processing for long sequences impractical. Mamba addresses this issue through selective state-space modeling, enabling efficient sequence processing without sacrificing representational power. Nevertheless, it struggles to capture complex spatial dependencies in multimodal settings. To bridge this gap, we propose VIMCAN, a hybrid architecture that combines the efficient sequence modeling of Mamba with the spatial reasoning of Cross-Attention, and performs robust visual-inertial fusion and human pose estimation between RGB keypoints and wearable IMU data. By leveraging Mamba's dynamic parameterization for temporal modeling and Attention for spatial dependency extraction, VIMCAN achieves superior accuracy, with mean per-joint position errors (MPJPE) of 17.2 mm on TotalCapture and 45.3 mm on 3DPW. VIMCAN outperforms prior Transformer-based and other SOTA approaches while supporting real-time inference at over 60 frames per second on consumer-grade hardware. The source code is available at \href{https://github.com/Eddieyzp/VIMCAN}{this GitHub repository}.

2605.06940 2026-05-13 cs.CL cs.AI cs.LG

MultiSoc-4D: A Benchmark for Diagnosing Instruction-Induced Label Collapse in Closed-Set LLM Annotation of Bengali Social Media

Souvik Pramanik, S. M. Riaz Rahman Antu, Shak Mohammad Abyad, Md. Ibrahim Khalil, Md. Shahriar Hussain

发表机构 * North South University(北南大学)

AI总结 该研究提出MultiSoc-4D,一个用于诊断封闭集指令下大型语言模型(LLM)标注偏差的孟加拉语社交媒体数据集,包含超过58,000条来自六个来源的社交媒体评论,并在四个维度上进行标注。通过多模型协作标注与共享验证集的结构化流程,研究系统性地揭示了LLM在标注过程中普遍存在的“指令诱导标签坍缩”现象,即模型倾向于使用默认标签,导致对少数类别的检测严重不足。该研究还通过统计验证证明了这一现象形成的“标签一致性幻觉”,并评估了40多个LLM在训练流程中的标注偏差传播情况,为低资源语言的NLP标注研究提供了重要基准。

Comments 21 pages, 14 figures, 13 tables

详情
英文摘要

Annotation automation via Large Language Models (LLMs) is the core approach for scaling NLP datasets; however, LLM behavior with respect to closed-set instructions in low-resource languages has not been well studied. We present MultiSoc-4D, a Bengali social media dataset benchmark, which contains 58K+ social media comments from six sources annotated along four dimensions: category, sentiment, hate speech, and sarcasm. By employing a structured pipeline where ChatGPT, Gemini, Claude, and Grok individually annotate separate partitions, while sharing a common validation set of 20%, we diagnose LLM behavior systematically. We discover a prevalent phenomenon called "instruction-induced label collapse", wherein LLMs show a systematic preference towards fallback labels (Other, Neutral, No), leading to high agreement rates but under-detection of minority categories. For example, we find that LLMs failed to detect 79% and 75% of instances with hateful and sarcastic content compared to a human-calibrated reference. Furthermore, we prove that it represents a "label agreement illusion", statistically validated via almost null Fleiss' Kappa ($κ\approx -0.001$) on sarcasm detection. Across 40+ LLMs, we benchmark this annotation bias propagation within the training pipeline, regardless of architectural differences. We release MultiSoc-4D as a diagnostic benchmark for annotation biases in Bengali NLP.

2605.06870 2026-05-13 cs.LG

Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse

Xinyu Zhao, Nikita Karagodin, Hamed Hassani, Sinan Hersek, Paul Pu Liang, Yury Polyanskiy

发表机构 * MIT(麻省理工学院) University of Pennsylvania(宾夕法尼亚大学) Google(谷歌)

AI总结 本文研究了VQ-VAE在训练过程中出现的维度坍塌问题,即编码表示退化到极低维子空间的现象,并揭示了这一问题会导致难以突破的下界损失。作者提出了一种简单有效的解决方法——在引入VQ之前先进行自编码器的预训练(AE Warm-Up),从而恢复编码表示的维度。实验表明,该方法在图像和音频任务中均显著提升了重建质量与感知性能,同时提高了码本的有效维度。

详情
英文摘要

While many approaches to improve VQ-VAE performance focus on codebook size and utilization, the effect of dimensional collapse, where trained VQ-VAE representations live in an extremely low-dimensional subspace (1-2% of full rank), remains unaddressed. We show theoretically and empirically that dimension collapse causes a hard loss lower bound that various codebook improvement techniques fail to surpass. Our analytic framework extends the sequential learning effect of Saxe et al. [2014] by introducing ideas from rate-distortion theory and explains how the latent collapse is caused by the VQ suppressing lower-variance directions. Our theory justifies a simple solution: a "warm-up phase" that trains the model as an (unquantized) autoencoder before introducing VQ. On both synthetic experiments and large-scale image (VQGAN) and audio (WavTokenizer) VQ-VAEs, we show that AE Warm-Up successfully restores representation dimension, leading to lower reconstruction and perceptual loss at the same training budget. Across codebook sizes $K \in$ {$2^{10}, 2^{14}, 2^{16}$}, AE warm-up raises VQGAN codebook effective dimension from 3-5 to 17-19 and reduces rFID by 17-35%; on WavTokenizer at $K \in$ {$2^{13}, 2^{14}$}, it raises codebook dimension from 4 to 17-19 and improves PESQ by 11-14%. We empirically characterize how warm-up duration governs the achievable final loss. In agreement with experiment, our theoretical analysis predicts downstream performance as a function of warm-up length, enabling an adaptive criterion for switching from AE Warm-up to VQ-VAE training.

2605.06732 2026-05-13 cs.LG

On Training in Imagination

Nadav Timor, Ravid Shwartz-Ziv, Micah Goldblum, Yann LeCun, David Harel

发表机构 * Weizmann Institute of Science(魏茨曼研究所) New York University(纽约大学) Columbia University(哥伦比亚大学) New York University AMI Labs(纽约大学AMI实验室)

AI总结 本文研究了基于想象的模型强化学习中,使用学习到的动力学模型和奖励模型进行策略训练时,模型误差对策略优化和回报的影响。作者扩展了现有分析,推导出在功率律假设下最优的样本分配比例,以最小化回报误差的上界,并指出降低动力学、奖励和策略的Lipschitz常数有助于紧化这一界。此外,作者分析了REINFORCE算法在存在噪声奖励情况下的表现,发现零均值噪声不影响梯度估计的无偏性,但会增加方差,并提出了在固定预算下如何权衡 rollout 数量与奖励噪声的优化问题。

详情
英文摘要

State-of-the-art model-based reinforcement learning methods train policies on imagined rollouts. These rollouts are trajectories generated by a learned dynamics model and are scored by a learned reward model, but without querying the true environment during policy updates. We study this training paradigm by quantifying how errors in learned dynamics and reward models affect returns and policy optimization. First, we extend the analysis of Asadi et al. (2018) to MDPs with learned reward models, and derive the optimal sample allocation--the ratio of dynamics samples to reward samples that minimizes a bound on return error under power-law scaling assumptions. We identify lower Lipschitz constants of the learned dynamics, reward, and policy as a representation desideratum that tightens this bound, and we connect this perspective to the temporal-straightening objective of Wang et al. (2026). Second, we examine how policy optimization with REINFORCE tolerates noisy rewards, which are often cheaper to obtain. We show that zero-mean reward noise leaves the gradient estimator unbiased and adds at most a variance term that decreases with the number of rollouts. This introduces a practical tradeoff: given a fixed budget, should one buy more rollouts with cheaper but noisier rewards, or fewer rollouts with more expensive but less noisy rewards? We reduce this choice to a one-dimensional optimization problem and characterize the optimum.

2605.06440 2026-05-13 cs.LG cs.CV

Hyperbolic Concept Bottleneck Models

Daniel Uyterlinde, Swasti Shreya Mishra, Pascal Mettes

发表机构 * Informatics Institute, University of Amsterdam(阿姆斯特丹大学信息学院)

AI总结 该论文提出了一种名为Hyperbolic Concept Bottleneck Models(HypCBM)的新型可解释神经网络框架,用于提升模型的可解释性。与传统将概念嵌入欧几里得空间的方法不同,HypCBM将概念组织在语义层次结构中,并利用双曲空间的几何特性,通过不对称的几何包含关系来表示概念激活,从而更自然地捕捉概念间的层次关系。该方法无需额外监督或学习模块即可实现稀疏且层次感知的激活,并在保持人类可解释性的同时,展现出更强的层次一致性和对输入噪声的鲁棒性。

Comments 24 pages, 14 figures

详情
英文摘要

Concept Bottleneck Models (CBMs) have become a popular approach to enable interpretability in neural networks by constraining classifier inputs to a set of human-understandable concepts. While effective, current models embed concepts in flat Euclidean space, treating them as independent, orthogonal dimensions. Concepts, however, are highly structured and organized in semantic hierarchies. To resolve this mismatch, we propose Hyperbolic Concept Bottleneck Models (HypCBM), a post-hoc framework that grounds the bottleneck in this structure by reformulating concept activation as asymmetric geometric containment in hyperbolic space. Rather than treating entailment cones as a pre-training penalty, we show they encode a natural test-time activation signal: the margin of inclusion within a concept's entailment cone yields sparse, hierarchy-aware activations without any additional supervision or learned modules. We further introduce an adaptive scaling law for hierarchically faithful interventions, propagating user corrections coherently through the concept tree. Empirically, HypCBM rivals post-hoc Euclidean models trained on 20$\times$ more data in sparse regimes required for human interpretability, with stronger hierarchical consistency and improved robustness to input corruptions.

2605.06314 2026-05-13 cs.LG

When Does $\ell_2$-Boosting Overfit Benignly? High-Dimensional Risk Asymptotics and the $\ell_1$ Implicit Bias

Ye Su, Jian Li, Yong Liu

发表机构 * Shenzhen Institutes of Advanced Technology(深圳先进技术研究院) School of Artificial Intelligence(人工智能学院) Chinese Academy of Sciences(中国科学院) Beijing Normal University(北京师范大学) Gaoling School of Artificial Intelligence(海淀区人工智能学院) Renmin University of China(中国人民大学)

AI总结 本文研究了在高维设置下,$\ell_2$-Boosting 算法在 $\ell_1$ 隐含偏差下的良性过拟合行为。通过结合凸高斯极小极大定理与截断高斯矩的渐近展开,作者分析了连续时间 $\ell_2$-Boosting 的风险特性,揭示了其在纯噪声模型下以对数速率衰减的过拟合现象,并指出在存在信号时,该机制仍可能成立,但信号-噪声分解仍是开放问题。此外,作者还提出了一个无需调参的早停规则,能够在 $\ell_1$ 约束下达到最优的预测性能。

详情
英文摘要

Benign overfitting is well-characterized in $\ell_2$ geometries, but its behavior under the $\ell_1$ implicit bias of greedy ensembles remains challenging. The analytical barrier stems from the non-linear coupling of coordinate selection thresholds, which invalidates standard spectral resolvent tools. To isolate this algorithmic bias, we characterize the high-dimensional risk of continuous-time $\ell_2$-Boosting over $p$ features and $n$ samples. By coupling the Convex Gaussian Minimax Theorem with delicate asymptotic expansions of double-sided truncated Gaussian moments, we analytically resolve the non-smooth $\ell_1$ interpolant. Under an isotropic pure-noise model, we prove that benign overfitting fails at the linear rate: greedy selection localizes noise into sparse active sets, and the excess variance decays at a logarithmic rate $Θ(σ^2/\log(p/n))$ for noise variance $σ^2$. We remark that while this localization mechanism should persist in the presence of signals, the exact signal-noise decomposition remains an open problem. For spiked-isotropic designs with $k^*$ head eigenvalues and $r_2 = p - k^*$ tail dimensions, the risk converges to zero when $r_{2} \gg n$, but only at a logarithmic rate $Θ(σ^2/\log(r_2/n))$, which is slower than the linear decay observed in $\ell_2$ geometries. To avoid this slow convergence, we analyze the non-smooth subdifferential dynamics of the boosting flow. This yields a tuning-free early stopping rule that, under a bounded $\ell_1$-path condition, recovers the Lasso basic inequality and attains the minimax-optimal empirical prediction rate for $\ell_1$-bounded signals.

2605.06218 2026-05-13 cs.LG

AffineLens: Capturing the Continuous Piecewise Affine Functions of Neural Networks

Yi Wei, Xuan Qi, Furao Shen, Jian Zhao, Vittorio Murino, Cigdem Beyan

发表机构 * State Key Laboratory of Novel Software Technology School of Intelligence Science and Technology Nanjing University, Jiangsu, China(软件新技术国家重点实验室 智能科学与技术学院 南京大学 江苏 中国) AI for Good Istituto Italiano di Tecnologia, Genoa, Italy DITEN University of Genoa, Genoa, Italy(AI for Good 意大利技术研究院,热那亚,意大利 DITEN 热那亚大学,热那亚,意大利) State Key Laboratory of Novel Software Technology School of Artificial Intelligence Nanjing University, Jiangsu, China(软件新技术国家重点实验室 人工智能学院 南京大学 江苏 中国) AI for Good Istituto Italiano di Tecnologia, Genoa, Italy Department of Computer Science University of Verona, Verona, Italy(AI for Good 意大利技术研究院,热那亚,意大利 计算机科学系 热那亚大学,热那亚,意大利)

AI总结 AffineLens 是一种用于分析神经网络中分段仿射函数结构的统一框架,旨在准确捕捉神经网络输入输出映射的连续分段仿射特性。该方法通过计算神经元诱导的超平面排列和多面体结构,逐层枚举并可视化网络的仿射区域,从而提供对网络表达能力的直观理解与量化评估。AffineLens 支持包括批量归一化、池化、残差连接等多种现代网络组件,并通过实证研究揭示了不同网络设计对函数几何特性的影响。

详情
英文摘要

Piecewise affine neural networks (PANNs) provide a principled geometric perspective on neural network expressivity by characterizing the input--output map as a continuous piecewise affine (CPA) function whose complexity is governed by the number, arrangement, and shapes of its affine regions. However, existing interpretability and expressivity analyses often rely on indirect proxies (e.g., activation statistics or theoretical upper bounds) and rarely offer practical, accurate tools for enumerating and visualizing the induced region partition under realistic architectures and bounded input domains. In this work, we present AffineLens, a unified framework for computing the hyperplane arrangements and polyhedral structures underlying PANNs. Given a calibrated (bounded) input polytope, AffineLens identifies the subset of neuron-induced hyperplanes that intersect the domain, enumerates the resulting affine sub-regions in a layer-wise manner, and returns provably non-empty maximal CPA regions together with interior representatives. The framework further provides visualizations of region partitioning and decision boundaries, enabling qualitative inspection alongside quantitative region counts. By exploiting the affine restriction property of CPA networks under fixed activation patterns, AffineLens supports a broad class of modern components, including batch normalization, pooling, residual connections, multilayer perceptrons, and convolutional layers. Finally, we use AffineLens to perform a systematic empirical study of architectural expressivity, comparing networks through region complexity metrics and revealing how design choices influence the geometry of learned functions.

2605.05971 2026-05-13 cs.LG

Training Transformers for KV Cache Compressibility

Yoav Gelberg, Yam Eitan, Michael Bronstein, Yarin Gal, Haggai Maron

发表机构 * University of Oxford(牛津大学) Technion – Israel Institute of Technology(技术ion理工学院) AITHYRA NVIDIA

AI总结 随着长上下文语言模型的发展,Key-Value(KV)缓存的内存和解码时访问成本已成为关键瓶颈。本文提出了一种在训练过程中引导Transformer模型学习可压缩表示的方法,即KV-压缩感知训练(KV-CAT),通过在训练时稀疏化KV缓存,促使模型生成更利于后续压缩的内部表示。实验表明,该方法有效提升了后续压缩方法在检索、长上下文问答和压缩前缀续写等任务中的性能表现。

Comments 32 pages, 4 figures

详情
英文摘要

Long-context language modeling is increasingly constrained by the Key-Value (KV) cache, whose memory and decode-time access costs scale linearly with the prefix length. This bottleneck has motivated a range of context-compression methods, from token-level summarization to recent optimization-based KV compression methods. These post-hoc methods operate on the KV cache of a fixed pretrained model, so their effectiveness is fundamentally limited by how well the model's internal representations can be compressed. In this work, we formalize the notion of KV compressibility and show that it is a property of the learned representations, rather than of the context alone. We prove that almost any sequence-to-vector function admits both highly compressible and inherently non-compressible transformer implementations, highlighting the need to guide transformers toward compressible representations during training. Motivated by this, we propose KV-Compression Aware Training (KV-CAT), a continued pretraining procedure that incentivizes the emergence of compressible representations. We introduce a train-time KV sparsification policy that masks KV slots during training. This forces the model to use fewer KV slots and encourages it to learn representations amenable to post-hoc compression. Empirically, we show that KV-CAT improves the quality-budget tradeoff of downstream compression methods across retrieval, long-context question answering, and perplexity-based evaluation of compressed-prefix continuation.

2605.05922 2026-05-13 cs.CV

Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

Yuan Wang, Ouxiang Li, Yulong Xu, Borui Liao, Jiajun Liang, Jinghan Li, Meng Wang, Xintao Wang, Pengfei Wan, Kuien Liu, Xiang Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Kling Team, Kuaishou Technology(快手科技 Kling 团队) Institute of Software Chinese Academy of Sciences(中国科学院软件研究所)

AI总结 该论文提出了一种名为DeScore的视频奖励模型,旨在解决现有模型在推理与评分耦合时存在的优化瓶颈问题。其核心方法是将推理与评分过程解耦,先由多模态大语言模型生成详细的推理过程,再通过独立的评分模块预测最终奖励。该方法在保证模型可解释性和泛化能力的同时,提升了训练稳定性与效率。

详情
英文摘要

Recent advances in generative video models are increasingly driven by post-training and test-time scaling, both of which critically depend on the quality of video reward models (RMs). An ideal reward model should predict accurate rewards that align with human preferences across diverse scenarios. However, existing paradigms face a fundamental dilemma: \textit{Discriminative RMs} regress rewards directly on features extracted by multimodal large language models (MLLMs) without explicit reasoning, making them prone to shortcut learning and heavily reliant on massive data scaling for generalization. In contrast, \textit{Generative RMs} with Chain-of-Thought (CoT) reasoning exhibit superior interpretability and generalization potential, as they leverage fine-grained semantic supervision to internalize the rationales behind human preferences. However, they suffer from inherent optimization bottlenecks due to the coupling of reasoning and scoring within a single autoregressive inference chain. To harness the generalization benefits of CoT reasoning while mitigating the training instability of coupled reasoning and scoring, we introduce DeScore, a training-efficient and generalizable video reward model. DeScore employs a decoupled ``think-then-score'' paradigm: an MLLM first generates an explicit CoT, followed by a dedicated discriminative scoring module consisting of a learnable query token and a regression head that predicts the final reward. DeScore is optimized via a two-stage framework: (1) a discriminative cold start incorporating a random mask mechanism to ensure robust scoring capabilities, and (2) a dual-objective reinforcement learning stage that independently refines CoT reasoning quality and calibrates the final reward, ensuring that higher-quality reasoning directly translates to superior model performance.

2605.04946 2026-05-13 cs.LG stat.ML

Training-Time Batch Normalization Reshapes Local Partition Geometry in Piecewise-Affine Networks

Xuan Qi, Yi Wei, Fanqi Yu, Furao Shen, Vittorio Murino, Cigdem Beyan

发表机构 * AI for Good Istituto Italiano di Tecnologia(AI for Good 意大利技术研究院) DITEN University of Genoa(DITEN Genoa大学) State Key Laboratory of Novel Software Technology School of Intelligence Science and Technology(新型软件技术国家重点实验室 智能科学与技术学院) School of Artificial Intelligence Nanjing University(人工智能学院 南京大学) Department of Computer Science University of Verona(计算机科学系 Verona大学)

AI总结 本文研究了训练过程中批量归一化(BN)在分段仿射网络中的几何影响,揭示了BN如何通过调整神经元的参考超平面,改变局部区域的划分结构。研究发现,BN在每个神经元上定义了一个以小批量中心为基准的超平面,其切换超平面的偏移量以标准化坐标表示,与原始偏置无关。这一机制提高了局部划分的精细程度,并在深度网络中具有局部传递性,为理解BN在训练阶段的函数级几何作用提供了新视角。

详情
英文摘要

Batch normalization (BN) is central to modern deep networks, but its effect on the realized function during training remains less understood than its optimization benefits. We study training-time BN in continuous piecewise-affine (CPA) networks through the geometry of switching hyperplanes and the induced affine-region partition. Conditioned on a mini-batch, we show that BN defines for each neuron a reference hyperplane through the batch centroid, and that breakpoint-switching hyperplanes are parallel translates whose offsets are expressed in batch-standardized coordinates and are independent of the raw bias. This yields an exact criterion for when a switching hyperplane intersects a local $\ell_\infty$ window and motivates a local region-density functional based on exact affine-region counts. Under explicit sufficient conditions, we show that BN increases expected local partition refinement in ReLU and more general piecewise-affine networks, and that this mechanism transfers locally through depth inside parent affine regions where the upstream representation map is an affine embedding. These results provide a function-level geometric account of training-time BN as a batch-conditional recentering mechanism near the data.

2605.04647 2026-05-13 cs.RO

ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

Huimin Wang, Yue Wang, Bihao Cui, Pengxiang Li, Ben Lu, Mingqian Wang, Tong Wang, Chuan Tang, Teng Zhang, Kun Zhan

发表机构 * LiAuto

AI总结 本文提出 ReflectDrive-2,一种基于强化学习对齐的离散扩散规划器,用于自动驾驶任务。该方法通过独立的动作专家生成离散轨迹标记,并利用并行掩码解码生成轨迹,支持在原地进行轨迹编辑。通过两阶段训练策略,结合结构感知扰动和强化学习优化,显著提升了轨迹生成与编辑的性能。实验表明,ReflectDrive-2 在 NAVSIM 上实现了较高的 PDMS 分数,并具备较高的推理效率。

详情
英文摘要

We introduce ReflectDrive-2, a masked discrete diffusion planner with separate action expert for autonomous driving that represents plans as discrete trajectory tokens and generates them through parallel masked decoding. This discrete token space enables in-place trajectory revision: AutoEdit rewrites selected tokens using the same model, without requiring an auxiliary refinement network. To train this capability, we use a two-stage procedure. First, we construct structure-aware perturbations of expert trajectories along longitudinal progress and lateral heading directions and supervise the model to recover the original expert trajectory. We then fine-tune the full decision--draft--reflect rollout with reinforcement learning (RL), assigning terminal driving reward to the final post-edit trajectory and propagating policy-gradient credit through full-rollout transitions. Full-rollout RL proves crucial for coupling drafting and editing: under supervised training alone, inference-time AutoEdit improves PDMS by at most $0.3$, whereas RL increases its gain to $1.9$. We also co-design an efficient reflective decoding stack for the decision--draft--reflect pipeline, combining shared-prefix KV reuse, Alternating Step Decode, and fused on-device unmasking. On NAVSIM, ReflectDrive-2 achieves $91.0$ PDMS with camera-only input and $94.8$ PDMS in a best-of-6 oracle setting, while running at $31.8$ ms average latency on NVIDIA Thor.

2605.04539 2026-05-13 cs.CL cs.AI

RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization

Qiming Bao, Juho Leinonen, Paul Denny, Michael J. Witbrock

发表机构 * University of Auckland(奥克兰大学) Aalto University(阿alto大学)

AI总结 该论文提出了一种名为RLearner-LLM的新方法,旨在解决大型语言模型在知识密集型生成任务中逻辑准确性与流畅性之间的平衡问题。研究通过引入混合直接偏好优化(Hybrid-DPO)技术,结合基于DeBERTa-v3的自然语言推理信号和验证器LLM评分,无需人工标注即可提升模型的逻辑对齐能力。实验表明,该方法在多个学术领域中显著提升了模型的逻辑推理能力,同时保持了生成流畅性,并在多个基础模型上实现了有效的性能提升。

详情
英文摘要

Direct Preference Optimization (DPO), the efficient alternative to PPO-based RLHF, falls short on knowledge-intensive generation: standard preference signals from human annotators or LLM judges exhibit a systematic verbosity bias that rewards fluency over logical correctness. This blindspot leaves a logical alignment gap -- SFT models reach NLI entailment of only 0.05-0.22 despite producing fluent text. We propose RLearner-LLM with Hybrid-DPO: an automated preference pipeline that fuses a DeBERTa-v3 NLI signal with a verifier LLM score, removing human annotation while overcoming the "alignment tax" of single-signal optimization. Evaluated across five academic domains (Biology, Medicine, Law) with three base architectures (LLaMA-2-13B, Qwen3-8B, Gemma 4 E4B-it), RLearner-LLM yields up to 6x NLI improvement over SFT, with NLI gains in 11 of 15 cells and consistent answer-coverage gains. On Gemma 4 E4B-it (4.5B effective params), Hybrid-DPO lifts NLI in four of five domains (+11.9% to +2.4x) with faster inference across all five, scaling down to compact base models without losing the alignment-tax mitigation. Our Qwen3-8B RLearner-LLM wins 95% of pairwise comparisons against its own SFT baseline; GPT-4o-mini in turn wins 95% against our concise output -- alongside the 69% win the same judge gives a verbose SFT over our DPO model, this replicates verbosity bias on a frontier comparator and motivates logic-aware metrics (NLI, ACR) over LLM-as-a-judge for knowledge-intensive generation.

2605.02906 2026-05-13 cs.LG

An End-to-End Framework for Building Large Language Models for Software Operations

Jingkai He, Pengfei Chen, Chenghui Wu, Shuang Liang, Ye Li, Gou Tan, Xiadao Wen, Chuanfu Zhang

发表机构 * School of Systems Science and Engineering, Sun Yat-sen University(系统科学与工程学院,中山大学) School of Computer Science and Engineering, Sun Yat-sen University(计算机科学与工程学院,中山大学) Alibaba Cloud Computing(阿里云 computing)

AI总结 本文提出了一种面向软件运维领域的端到端大语言模型构建框架 OpsLLM,旨在解决当前运维场景下大模型因数据质量低、知识碎片化和学习效率不足而难以实现高效智能运维的问题。该框架引入了人工参与的数据筛选机制和领域过程奖励模型,有效提升了模型在运维问答和根因分析任务中的准确性和可靠性。实验表明,OpsLLM 在多个难度级别的任务中均优于现有开源和闭源模型,并且已开源三个不同参数规模的版本及相应的微调数据集。

详情
英文摘要

In the field of software operations, Large Language Models (LLMs) have attracted increasing attention. However, existing research has not yet achieved efficient and effective end-to-end intelligent operations due to low-quality data, fragmented knowledge and insufficient learning. To explore the potential of LLMs in software operations, we propose OpsLLM, a domain-specific LLM that supports both knowledge-based question answering (QA) and root cause analysis (RCA). Moreover, we disclose the detailed workflow for building LLMs specifically in the software operations domain. First, a Human-in-the-Loop mechanism is introduced to curate highquality data from a large collection of operational raw data and construct a fine-tuning dataset. Then, based on the data, supervised fine-tuning is conducted to achieve a base model. Furthermore, we introduce a domain process reward model (DPRM) during the reinforcement learning stage to optimize the accuracy and reliability of the fine-tuned model on RCA tasks. Experimental results on the tasks with diverse difficulties demonstrate that OpsLLMs effectively learns and aligns with the operational domain knowledge infused, outperforming existing open-source and closed-source LLMs in accuracy with improvements of 0.2%~5.7% on QA tasks and 2.7% ~70.3% on RCA tasks, while exhibiting strong transferability. Moreover, we will open-source three versions of OpsLLM with 7B, 14B and 32B parameters, along with a 15K fine-tuning dataset.

2605.00939 2026-05-13 cs.LG cs.AI

From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity

Yee Zhing Liew, Andrew Huey Ping Tan, Anwar P. P. Abdul Majeed

发表机构 * School of Intelligent Manufacturing Ecosystem, Xi’an Jiaotong-Liverpool University, People’s Republic of China(智能制造生态系统学院,西安交通大学-利物浦大学,中华人民共和国) Department of Computer Science, University of Liverpool, United Kingdom(计算机科学系,利物浦大学,英国) Faculty of Engineering and Technology, Sunway University, Malaysia(工程与技术学院,Sunway大学,马来西亚) School of Robotics, Xi’an Jiaotong-Liverpool University, People’s Republic of China(机器人学院,西安交通大学-利物浦大学,中华人民共和国)

AI总结 本文研究了传统语言模型中难以检测的“顽固性幻觉”问题,即模型在错误信息上表现出高度自信的情况。作者提出了一种基于梯度敏感性的几何检测方法——嵌入扰动梯度敏感性(EPGS),通过在输入嵌入中加入高斯噪声并测量梯度幅值的变化,来区分稳定知识与脆弱记忆。实验表明,该方法在检测高置信度事实错误方面显著优于基于熵和表示的基线方法。

Comments Accepted to ICML 2026. Camera-ready version

详情
英文摘要

Traditional hallucination detection fails on "Stubborn Hallucinations" - errors where LLMs are confidently wrong. We propose a geometric solution: Embedding-Perturbed Gradient Sensitivity (EPGS). We hypothesize that while robust facts reside in flat minima, stubborn hallucinations sit in sharp minima, supported by brittle memorization. EPGS detects this sharpness by perturbing input embeddings with Gaussian noise and measuring the resulting spike in gradient magnitude. This acts as an efficient proxy for the Hessian spectrum, differentiating stable knowledge from unstable memorization. Our experiments show that EPGS significantly outperforms entropy-based and representation-based baselines, providing a robust signal for detecting high-confidence factual errors.

2604.24801 2026-05-13 cs.LG cs.AI

Architecture Determines Observability of Transformers

Thomas Carmichael

发表机构 * Independent Researcher(独立研究者)

AI总结 该研究探讨了Transformer模型中架构对可观测性的影响,指出自回归Transformer在输出置信度监控下仍可能产生无法被检测的错误。研究发现,激活信号中包含的决策质量信息主要由模型架构和训练过程决定,而非输出置信度本身。实验表明,通过控制输出置信度可大幅减少激活探针信号,而剩余信号的可观测性取决于架构和训练方式,为模型监控和训练设计提供了新的视角。

Comments 31 pages, 8 figures, 14 tables. v3 of arXiv:2604.24801. Code v5.1.0: https://github.com/tmcarmichael/nn-observability/tree/v5.1.0 Changelog: https://github.com/tmcarmichael/nn-observability/blob/v5.1.0/CHANGELOG.md Croissant: https://github.com/tmcarmichael/nn-observability/blob/v5.1.0/croissant.json

详情
英文摘要

Autoregressive transformers make confident errors that output-confidence monitoring cannot catch. Activation monitors catch them only when training leaves a decision-quality signal beyond what the output already exposes. This signal is an architectural property of the trained model, fixed upstream of any monitor. Controlling for output confidence removes 60.3% of the raw activation-probe signal on average across 14 models. Raw probe signal is mostly output confidence, and output-side readouts cannot recover the residual. What remains depends on architecture and training. In Pythia's controlled training, both matched-width configurations form the signal early. One preserves it through convergence while another erases it as perplexity continues to improve. Capability and observability are not inherently in tension. Across independently trained families this pattern persists, even as the collapse point shifts. Where the signal survives, monitoring catches what confidence cannot. On downstream QA, a WikiText-trained probe with no task-specific tuning catches about one in eight confident errors that output-confidence monitoring misses, at a 20% flag rate. These results establish signal engineering as a training-time design axis alongside loss and capability. Architecture sets the conditions for observability, and training determines what remains readable.

2604.22099 2026-05-13 cs.LG

Assessing the impact of dimensionality reduction on clustering performance -- a systematic study

Ousmane Assani-Amate, Mohammadreza Bakhtyari, Émilie Roy, Vladimir Makarenkov

发表机构 * Université du Québec à Montréal(魁北克大学蒙特利尔分校) Mila - Quebec AI Institute(魁北克AI研究所)

AI总结 本研究系统评估了五种降维技术对四种聚类算法性能的影响,旨在探讨降维在高维数据聚类中的作用。通过调整降维后的维度比例,并使用调整兰德指数(ARI)进行性能比较,研究发现选择合适的降维方法和降维程度对于提升聚类效果至关重要,且需根据数据结构和聚类算法特性进行适配。

详情
英文摘要

Dimensionality reduction is a critical preprocessing step for clustering high-dimensional data, yet comprehensive evaluation of its impact across diverse methods and data types remains limited. In this study, we systematically assess the influence of five dimensionality reduction techniques - Principal Component Analysis (PCA), Kernel Principal Component Analysis (Kernel PCA), Variational Autoencoder (VAE), Isometric Mapping (Isomap), and Multidimensional Scaling (MDS) - on the performance of four popular clustering algorithms - k-means, Agglomerative Hierarchical Clustering (AHC), Gaussian Mixture Models (GMM), and Ordering Points to Identify the Clustering Structure (OPTICS). We evaluate clustering quality using the Adjusted Rand Index (ARI), comparing results without and with dimensionality reduction at different reduction levels recommended in the literature (i.e., k-1, where k is the number of clusters, and 25% and 50% of the original number of dimensions). Our findings underscore the importance of a careful selection of the dimensionality reduction technique and the dimensionality reduction level that should be tailored to intrinsic data geometry and clustering algorithms under consideration.

2604.22026 2026-05-13 cs.AI cs.CY cs.DL

Rethinking Publication: A Certification Framework for AI-Enabled Research

Yang Lu, Rabimba Karanjai, Lei Xu, Weidong Shi

发表机构 * Department of Computer Science, University of Houston, Houston, Texas(休斯敦大学计算机科学系)

AI总结 本文提出了一种用于评估AI生成研究成果的双重认证框架,旨在应对当前学术出版体系对人类作者假设的局限性。该框架将知识有效性与人类贡献程度的评估分离开来,前者确保研究成果的科学性,后者明确人类在研究过程中的参与程度。研究还提出了专门的基准投稿渠道,以促进完全自动化研究成果的透明发表,并强调应基于知识价值而非作者身份来评价研究贡献。

Comments correct references

详情
英文摘要

AI research pipelines can now generate academic work that may satisfy existing peer review standards for quality, novelty, and methodological rigor. However, the publication system was built around the assumption that research is produced by human authors. It therefore lacks a clear way to evaluate work when the knowledge claim may be valid but the producer is partly or fully automated. This paper proposes a two-layer certification framework for AI-generated research. The first layer evaluates whether the knowledge claim is sound. The second layer evaluates the level of human contribution. This separation allows journals and conferences to assess pipeline-generated work more consistently without creating new institutions. The framework uses normative analysis, conceptual design, and dry-run validation against representative submission cases. It classifies human contribution into three categories: Category A, where the work is reachable by an automated pipeline; Category B, where human direction is required at identifiable stages; and Category C, where the work goes beyond current pipeline capability, especially at the problem-formulation stage. The paper also proposes dedicated benchmark slots for fully disclosed automated research. These slots would provide a transparent publication path and help reviewers calibrate judgments over time. The key argument is that publication has historically certified two things at once: that the knowledge is valid and that a human produced it. AI research pipelines separate these two claims. By decoupling knowledge certification from authorship attribution, the proposed framework responds to a structural change already underway. It can be implemented within existing editorial systems, works even when attribution is uncertain, and recognizes human frontier contribution based on epistemic value rather than human origin alone.

2604.21052 2026-05-13 cs.CV cs.AI

StyleVAR: Controllable Image Style Transfer via Visual Autoregressive Modeling

Liqi Jing, Dingming Zhang, Peinian Li, Lichen Zhu, Yang Xu, Hanyu Xing

发表机构 * Duke University(杜克大学) University of Southern California(南加州大学) Xidian University(西安电子科技大学)

AI总结 StyleVAR 是一种基于视觉自回归建模(VAR)框架的可控图像风格迁移方法,通过将图像分解为多尺度表示并编码为离散码,利用变压器模型在条件离散序列建模中实现风格与内容的可控融合。该方法引入了混合交叉注意力机制和尺度相关的融合系数,以在保持自回归连续性的同时,有效结合风格与内容信息。实验表明,StyleVAR 在多个基准测试中优于传统 AdaIN 方法,在感知相似度和结构保持方面表现突出,尤其在风景和建筑场景中效果显著。

详情
英文摘要

We build on the Visual Autoregressive Modeling (VAR) framework and formulate style transfer as conditional discrete sequence modeling in a learned latent space. Images are decomposed into multi-scale representations and tokenized into discrete codes by a VQ-VAE; a transformer then autoregressively models the distribution of target tokens conditioned on style and content tokens. To inject style and content information, we introduce a blended cross-attention mechanism in which the evolving target representation attends to its own history, while style and content features act as queries that decide which aspects of this history to emphasize. A scale-dependent blending coefficient controls the relative influence of style and content at each stage, encouraging the synthesized representation to align with both the content structure and the style texture without breaking the autoregressive continuity of VAR. We train StyleVAR in two stages from a pretrained VAR checkpoint: supervised fine-tuning on a large triplet dataset of content--style--target images, followed by reinforcement fine-tuning with Group Relative Policy Optimization (GRPO) against a DreamSim-based perceptual reward, with per-action normalization weighting to rebalance credit across VAR's multi-scale hierarchy. Across three benchmarks spanning in-, near-, and out-of-distribution regimes, StyleVAR consistently outperforms an AdaIN baseline on Style Loss, Content Loss, LPIPS, SSIM, DreamSim, and CLIP similarity, and the GRPO stage yields further gains over the SFT checkpoint, most notably on the reward-aligned perceptual metrics. Qualitatively, the method transfers texture while maintaining semantic structure, especially for landscapes and architectural scenes, while a generalization gap on internet images and difficulty with human faces highlight the need for better content diversity and stronger structural priors.

2604.16684 2026-05-13 cs.LG stat.ML

DARLING: Detection Augmented Reinforcement Learning with Non-Stationary Guarantees

Argyrios Gerogiannis, Yu-Han Huang, Venugopal V. Veeravalli

发表机构 * ECE and CSL, The Grainger College of Engineering University of Illinois at Urbana-Champaign(电气与计算机工程系和计算机科学实验室,伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文研究了在非平稳有限时间回合马尔可夫决策过程(MDPs)中的无模型强化学习问题,且不预先知道非平稳性。针对分段平稳(PS)环境,即奖励和转移动态在未知时间点发生变化的情况,提出了一个名为DARLING的模块化方法,适用于表格和线性MDPs,无需提前知道变化时间点。DARLING在理论分析中改进了已知的最佳动态遗憾界,并在多种非平稳基准测试中表现出优于现有方法的性能。

Comments 50 pages, 8 figures

详情
英文摘要

We study model-free reinforcement learning (RL) in non-stationary finite-horizon episodic Markov decision processes (MDPs) without prior knowledge of the non-stationarity. We focus on the piecewise stationary (PS) setting, where both rewards and transition dynamics can change at unknown times. We first revisit existing state-of-the-art approaches and identify theoretical and practical limitations that change the current landscape of performance guarantees. To characterize the difficulty of the problem, we establish the first minimax lower bounds for PS-RL in tabular and linear MDPs. We then introduce Detection Augmented Reinforcement Learning (DARLING), a modular wrapper for PS-RL that applies to both tabular and linear MDPs, without knowledge of the changes. In tabular MDPs, under change-point separability and reachability conditions, DARLING improves the best known dynamic regret bounds and matches our minimax lower bound. In linear MDPs, DARLING matches the minimax lower bound when the relevant reachability parameters are known, and our analysis clarifies the structural obstacles that distinguish this setting from the tabular case. Finally, through extensive experimentation across diverse non-stationary benchmarks, we show that DARLING consistently surpasses the state-of-the-art methods.