arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2328
专题追踪
2605.11235 2026-05-13 cs.LG cs.AI

Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning

Han Zheng, Yining Ma, Karthick Gunasekaran, Bharathan Balaji, Zheng Du, Shiv Vitaladevuni, Cathy Wu

发表机构 * MIT(麻省理工学院) Amazon AGI(亚马逊人工智能实验室)

AI总结 在大语言模型的强化微调中,课程学习有助于提升训练效率与性能,但现有方法依赖人工设计的启发式规则或辅助模型进行课程判断,可能与策略的训练动态不一致。本文提出METIS框架,将课程判断内化为模型的原生能力,通过分析提示内部奖励的方差来衡量提示的信息量,并基于近期训练结果进行轻量化的上下文学习预测,从而动态调整训练分配。METIS通过联合优化标准奖励与自我判断奖励,实现策略的元认知学习,在多个基准任务中展现出更高的性能与更快的收敛速度。

详情
英文摘要

In LLM Reinforcement Fine-Tuning (RFT), curriculum learning drives both efficiency and performance. Yet, current methods externalize curriculum judgment via handcrafted heuristics or auxiliary models, risking misalignment with the policy's training dynamics. In this paper, we introduce METIS (METacognitive Internalized Self-judgment), a novel framework that internalizes curriculum judgment as a native capability. Leveraging a critical observation that within-prompt reward variance effectively gauges prompt informativeness, METIS predicts this metric based on recent training outcomes as lightweight in-context learning examples. This intrinsic self-judgment then dynamically dictates the training allocation. Moreover, METIS closes the loop between judgment and optimization by jointly optimizing the standard RFT rewards and a self-judgment reward. This allows the policy to learn what to learn next, as a form of metacognition. Across extensive discrete and continuous RFT benchmarks from mathematical reasoning, code generation, to agentic function-calling, METIS consistently delivers superior performance while accelerating convergence by up to 67%. By bypassing handcrafted heuristics and auxiliary models, our work establishes a simple, closed-loop, and highly efficient curriculum internalization paradigm for LLM reinforcement fine-tuning.

2605.11234 2026-05-13 cs.AI

The Semantic Training Gap: Ontology-Grounded Tool Architectures for Industrial AI Agent Systems

Grama Chethan

发表机构 * Siemens Digital Industries Software(西门子数字工业软件)

AI总结 本文提出并解决了工业AI代理系统中的“语义训练差距”问题,即大语言模型虽能掌握领域术语,却缺乏对制造操作语义结构的深入理解。为弥补这一差距,研究设计了一种基于制造本体的工具架构,将领域知识直接嵌入AI工具层,通过运行时语义约束替代传统训练方式,有效减少了领域标识符的错误生成。实验表明,该方法在不修改应用代码的情况下,实现了跨领域配置和工具调用零幻觉的性能提升。

Comments 29 pages, 2 figures

详情
英文摘要

Large language model (LLM)-based AI agents are increasingly deployed in manufacturing environments for analytics, quality management, and decision support. These agents demonstrate statistical fluency with domain terminology but lack grounded understanding of operational semantics -- the relational structure that connects equipment identifiers, process parameters, failure codes, and regulatory constraints within a specific production context. This paper identifies and formalizes the semantic training gap: a structural disconnect between how AI systems acquire domain vocabulary through training and how manufacturing operations define meaning through ontological relationships. We demonstrate that this gap causes operationally incorrect outputs even when model responses are linguistically precise, and that in multi-agent configurations it produces a compounding failure mode we term semantic drift. To close this gap, we present an architecture that embeds manufacturing ontology directly into the AI tool layer as a typed relational configuration, enforcing semantic constraints at runtime rather than relying on model training. The architecture is formalized as a three-operation interface contract -- resolve, contextualize, annotate -- with invariants enforced by an AIOps orchestration layer. In a controlled experiment across six industry configurations (72 tool invocations using Qwen3-32B), unconstrained tool parameters produced a 43% hallucination rate for domain identifiers; ontology-grounded parameters reduced this to 0%. We validate the approach through a digital twin analytics platform demonstrating that a single codebase with domain-specific ontology configurations eliminates tool-call hallucination and achieves cross-domain configurability without application code changes.

2605.11233 2026-05-13 cs.LG

A Comparative Study of Model Selection Criteria for Symbolic Regression

Ali Soltani, Gabriel Kronberger, Fabricio Olivetti de Franca, Mattia Billa, Alessandro Lucantonio

发表机构 * Aarhus University(奥胡斯大学) Heuristic and Evolutionary Algorithms Laboratory (HEAL)(启发式和进化算法实验室) Federal University of ABC(巴西联邦大学ABC分校) University of Modena and Reggio Emilia(摩德纳和雷吉奥艾米利亚大学) University of Modena(摩德纳大学) Reggio Emilia Department of Physics, Informatics(雷吉奥艾米利亚物理与信息学系)

AI总结 本文对比研究了符号回归中常用的模型选择准则,旨在从生成的候选数学表达式中选择出在准确性与复杂性之间取得平衡且具有良好泛化能力的模型。研究通过在七个含高斯噪声的合成数据集上系统评估了AIC、AICc、BIC、MDL以及Efron引导法等准则的表现,发现MDL在多数数据集上能最有效地识别出测试误差最小且表达式最简的模型,BIC也有较高概率选择出真实函数表达式。

详情
英文摘要

Effective model selection is critical in symbolic regression (SR) to identify mathematical expressions that balance accuracy and complexity, and have low expected error on unseen data. Many modern implementations of genetic programming (GP) for SR generate a set of Pareto optimal candidate solutions, but reliable automatic selection of solutions that generalize well remains an open issue. Current literature offers various information-theoretic and Bayesian approaches, yet comprehensive comparisons of their performance across different data regimes are limited. This study presents a systematic empirical comparison of widely used selection criteria: the Akaike information criterion (AIC), the corrected AIC (AICc), the Bayesian information criterion (BIC), minimum description length (MDL), as well as Efron's bootstrap estimate for the in-sample prediction error on seven synthetic datasets with Gaussian noise. We rank candidate expressions generated by perturbing ground-truth functions to assess generalization error and selection probability of the ground-truth expression. Our findings reveal that MDL consistently identifies models with the lowest test error and the shortest length across most datasets. While no single criterion dominates all results, MDL and BIC produced the highest probability of selecting the ground-truth expressions.

2605.11232 2026-05-13 cs.AI cs.LG

Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack

Prathamesh Vasudeo Naik, Naresh Dintakurthi, Yue Wang

发表机构 * GitHub

AI总结 本文研究了如何为欺诈检测和反洗钱(AML)等合规性场景构建高效的大语言模型(LLM)服务架构。针对这类任务中常见的前缀密集、结构约束强和证据丰富的输入特点,作者提出了一套面向工作负载的LLMOps系统,结合了运行时调优、前缀缓存、多适配器服务、批处理优化等多种技术,显著提升了服务吞吐量和响应速度。实验表明,该方法在公共合成数据集上实现了性能的大幅提升,展示了合规性LLM服务需从工作负载设计、服务优化和质量控制多方面综合提升。

详情
英文摘要

Fraud detection and anti-money-laundering (AML) compliance are high-value domains for large language models (LLMs), but their serving requirements differ sharply from generic chat workloads. Compliance prompts are often prefix-heavy, schema-constrained, and evidence-rich, combining reusable policy instructions, risk taxonomies, transaction or document context, and short structured outputs such as JSON labels or risk factors. These properties make prefix reuse, KV-cache efficiency, runtime tuning, model orchestration, and output validation first-order systems concerns. This paper introduces a workload-aware LLMOps stack for fraud and AML workloads using self-hosted open-weight models such as Meta Llama and Alibaba Qwen. The stack combines vLLM-style runtime tuning, PagedAttention, Automatic Prefix Caching, multi-adapter serving, adapter and prompt-length-aware batching, sleep/wake lifecycle management, speculative decoding, and optional prefill/decode disaggregation. To avoid exposing institution-specific data, the reproducibility track converts public synthetic AML datasets, including IBM AML and SAML-D, into prefix-heavy compliance prompts with reusable policy text, transaction evidence, typology definitions, and schema-constrained outputs. We also incorporate an LLM-as-judge quality gate using deterministic compliance checks, reference metrics, expert-adjudicated calibration data where available, and multi-judge rubric scoring. Across public-synthetic AML workloads and controlled serving benchmarks, workload-aware tuning improved throughput from 612-650 to 3,600 requests/hour, reduced P99 latency from 31-38 seconds to 6.4-8.7 seconds, and increased GPU utilization from 12% to 78%. These results show that regulated LLM performance is a workload-design, serving-optimization, and quality-gating problem, not only a model-selection problem.

2605.11224 2026-05-13 cs.CV cs.AI

ABRA: Agent Benchmark for Radiology Applications

Bulat Maksudov, Vladislav Kurenkov, Kathleen M. Curran, Alessandra Mileo

发表机构 * School of Computing(计算学院) Dublin City University(都柏林城市大学) School of Medicine(医学院) University College Dublin(都柏林大学)

AI总结 ABRA 是一个面向放射学应用的智能体基准,旨在评估医疗智能体在实际影像处理任务中的能力。该基准通过21个功能调用工具,使智能体能够操作医学影像查看器和DICOM服务器,完成包括切片导航、窗口调节、标注和结构化报告等任务。ABRA 包含655个自动生成的任务,涵盖多个难度等级和任务类型,并通过自动评分系统评估智能体在规划、执行和结果方面的表现,揭示了当前模型在感知层面存在较大瓶颈。

详情
英文摘要

Existing medical-agent benchmarks deliver imaging as pre-selected samples, never as an environment the agent must navigate. We introduce ABRA, a radiology-agent benchmark in which the agent operates an OHIF viewer and an Orthanc DICOM server through twenty-one function-calling tools that span slice navigation, windowing, series selection, pixel-coordinate annotation, and structured reporting. ABRA contains 655 programmatically generated tasks across three difficulty tiers and eight types (viewer control, metadata QA, vision probe, annotation, longitudinal comparison, BI-RADS reporting, and oracle variants of annotation and BI-RADS reporting), drawn from LIDC-IDRI, Duke Breast Cancer MRI, and NLST New-Lesion LongCT. Each episode is scored along Planning, Execution, and Outcome (Bluethgen et al., 2025) by task-type-specific automatic scorers. Ten current models, five closed-weight and five open-weight, reach at least 89% Execution on real annotation but only 0-25% Outcome; on the paired oracle variant where a simulated detector supplies the finding, Outcome on the same task reaches 69-100% across the models evaluated, localising the bottleneck to perception rather than tool orchestration. Code, task generators, and scorers are released at https://github.com/Luab/ABRA

2605.11222 2026-05-13 cs.LG

ADMM-Q: An Improved Hessian-based Weight Quantizer for Post-Training Quantization of Large Language Models

Ryan Lucas, Mehdi Makni, Xiang Meng, Adam Deng, Rahul Mazumder

发表机构 * MIT Operations Research Center(麻省理工学院运筹学中心) MIT Sloan School of Management(麻省理工学院斯隆管理学院) MIT Center for Statistics(麻省理工学院统计中心)

AI总结 本文提出了一种改进的基于海森矩阵的权重量化方法ADMM-Q,用于大语言模型的后训练量化。该方法基于改进的交替方向乘子法(ADMM),通过分层优化策略逐步最小化层间重构误差并满足量化约束,同时引入惩罚调度、预处理和局部搜索等增强技术以提升效率。实验表明,ADMM-Q在多个量化设置下显著降低了模型的困惑度,优于现有主流量化方法如GPTQ。

详情
英文摘要

Quantization is an effective strategy to reduce the storage and computation footprint of large language models (LLMs). Post-training quantization (PTQ) is a leading approach for compressing LLMs. Popular weight quantization procedures, including GPTQ and RTN, suffer in model utility, especially at aggressive quantization levels (sub-4-bit). We propose ADMM-Q, a novel weight quantization algorithm that considers the layer-wise quantization problem. Our algorithm is based on a combinatorial variant of the Alternating Direction Method of Multipliers (ADMM). Our operator-splitting procedure updates weights continuously to minimize the layer-wise reconstruction error, while gradually enforcing the quantization constraints with convergence guarantees. We propose additional algorithmic enhancements (e.g., penalty scheduling, preconditioning, and a local search post-processing step) to make ADMM-Q efficient at LLM scale. ADMM-Q is modular and can be used as a drop-in replacement for any weight quantizer within existing quantization pipelines: ADMM-Q is fully composable with existing techniques including range clipping, learned or random rotations, and activation scaling. Using ADMM-Q in place of GPTQ on Qwen3-8B, we decrease WikiText-2 perplexity in: (i) the W3A16 weight-only setting (12.85 $\rightarrow$ 10.06); (ii) the W4A8 SmoothQuant procedure (9.29 $\rightarrow$ 8.68); and (iii) the W2A4KV4 SpinQuant procedure (66.11 $\rightarrow$ 19.42).

2605.11218 2026-05-13 cs.AI

Don't Look at the Numbers: Visual Anchoring Bias and Layer-wise Representation in VLMs

M. Shalankin

发表机构 * M. Shalankin

AI总结 该研究揭示了视觉-语言模型(VLMs)在评估图像质量时受到嵌入数字锚点的系统性偏差影响,并发现这种偏差在不同模型架构中普遍存在。通过逐层分析,研究发现模型中用于分类的浅层特征与质量预测性能存在解耦现象,而深层特征则更有利于质量判断。研究还揭示了不同模型对锚点信息的融合方式存在差异,为理解视觉锚定偏差的成因及其与模型表征动态的关系提供了因果解释。

详情
英文摘要

Embedded numeric anchors on images systematically bias Vision-Language Model quality judgments across six VLMs from five architectural families (ANOVA eta^2 = 0.18-0.77, all p < 0.001). Anchor effects are 2.5x larger than severe image quality degradation, confirming bias is not reducible to visual changes. Layer-wise probing reveals consistent dissociation: layers where anchor classification saturates (L12-L34) are suboptimal for quality prediction, with optimal layers deeper (R^2 = 0.69-0.91). Fusion analysis identifies architecture-dependent integration -- instant fusion at L1-L2 in two models versus partial or no fusion in three others. These results establish a causal account of visual anchoring bias, linking behavioral susceptibility to representation dynamics.

2605.11217 2026-05-13 cs.LG cs.AI cs.CR

Leveraging RAG for Training-Free Alignment of LLMs

John T. Halloran

发表机构 * Leidos(莱迪奥斯公司)

AI总结 该论文提出了一种基于检索增强生成(RAG)的对齐方法RAG-Pref,用于在无需额外训练的情况下提升大语言模型(LLM)对代理攻击的拒绝能力。该方法通过在推理过程中利用偏好和非偏好样本的对比信息,实现在线对齐,计算开销低且兼容现有工具。实验表明,RAG-Pref在五种主流LLM上显著提升了拒绝攻击的性能,同时在通用人类偏好对齐任务中也表现出色,且不显著增加计算资源需求。

Comments 19 pages, 4 figures, and 6 tables

详情
英文摘要

Large language model (LLM) alignment algorithms typically consist of post-training over preference pairs. While such algorithms are widely used to enable safety guardrails and align LLMs with general human preferences, we show that state-of-the-art alignment algorithms require significant computational resources while being far less capable of enabling refusal guardrails for recent agentic attacks. Thus, to improve refusal guardrails against such attacks without drastically increasing computational overhead, we introduce Retrieval Augmented Generation for Pref erence alignment (RAG-Pref), a simple RAG-based alignment algorithm which conditions on preferred and dispreferred samples to leverage contrastive information during inference. RAG-Pref is online (training-free), compatible with off-the-shelf packages, and, when combined with offline (training-based) alignment algorithms, enables more than an average 3.7 factor improvement in agentic attack refusals across five widely used LLMs, compared to 2.9 for other online alignment algorithms and 1.5 for offline alignment alone. We conclude by showing that, in stark contrast to other online alignment methods, RAG-Pref similarly increases performance on general human-preference alignment tasks and does not drastically increase overall computational requirements.

2605.11214 2026-05-13 cs.LG

Enforcing Constraints in Generative Sampling via Adaptive Correction Scheduling

Noah Trupin, Yexiang Xue

发表机构 * Department of Computer Science(计算机科学系)

AI总结 本文研究了在生成采样过程中如何有效施加硬约束的问题,指出传统方法在采样末尾或每一步进行投影的方式忽略了投影对状态分布的影响,可能导致采样结果虽满足约束但与原始动态不一致。为此,作者将约束施加形式化为生成过程中的修正调度问题,提出了一种基于状态的自适应修正调度策略,根据每一步的约束偏差动态分配投影资源,从而在减少修正次数的同时提升采样精度。实验表明,该方法在多种生成模型中均能显著优化约束采样的效率与质量。

详情
英文摘要

Hard constraints in generative sampling are typically enforced by projection, applied either once at the end of sampling or after every update. This binary framing overlooks a fundamental issue: projection changes the distribution of states which future updates depend on. As a result, delayed projection can produce samples that are feasible but inconsistent with the intended sampling dynamics, even after final projection. We formalize constraint enforcement as a correction scheduling problem over the generative rollout. Using one-step constraint defect as a local signal of geometric mismatch, we introduce adaptive correction scheduling, a state-dependent policy that allocates projection budget to the steps that most strongly perturb the trajectory. Terminal and stepwise projection arise as limiting cases of this family. Across controlled manifold rollouts and a learned projected diffusion sampler, adaptive scheduling improves the cost-accuracy frontier at matched projection budgets, recovering 71.2% of full stepwise benefit with 75% fewer corrections. These results show that constraint timing is a first-class design variable in generative sampling, and that enforcing feasibility alone is insufficient to preserve the intended constrained sampling dynamics.

2605.11210 2026-05-13 cs.RO

Distributed Pose Graph Optimization via Continuous Riemannian Dynamics

Jaeho Shin, Maani Ghaffari, Yulun Tian

发表机构 * University of Michigan Ann Arbor(密歇根大学安娜堡分校)

AI总结 本文提出了一种基于李群上二阶连续时间动力系统的分布式姿态图优化(PGO)框架,通过将姿态变量建模为受阻尼作用的粒子,使所得黎曼动力学的平衡点与原PGO问题的一阶临界点一致。该方法利用阻尼欧拉-泊アン方程和半隐式几何积分器设计出一种优化算法,可推广现有黎曼梯度下降和高斯-牛顿方法,并在多机器人场景中实现了基于块对角质量与阻尼矩阵的全分布式并行求解,具有通信开销小、收敛性好的特点。实验表明,该求解器在同步与异步环境下均优于现有分布式方法。

详情
英文摘要

We present a framework for distributed Pose Graph Optimization (PGO) by formulating the problem as a second-order continuous-time dynamical system evolving on Lie groups. By modeling pose variables as massive particles subject to damping, the equilibrium points of the resulting Riemannian dynamics coincide with first-order critical points of the original PGO problem. Using the governing damped Euler--Poincaré equations and a semi-implicit geometric integrator, we design an optimization algorithm that generalizes existing algorithms such as Riemannian gradient descent and Gauss--Newton. In multi-robot settings, we present a fully distributed and parallel method based on block-diagonal mass and damping matrices, where each robot solves an ordinary differential equation for its own poses with minimal communication overhead. Moreover, modeling both state and velocity enables principled neighbor prediction that significantly improves convergence under delayed communication. Theoretically, we present an analysis and establish sufficient condition that ensures energy dissipation under the employed geometric discretization scheme. Experiments on benchmark PGO datasets demonstrate that the proposed solver achieves superior performance compared to state-of-the-art distributed baselines in both synchronous and asynchronous regimes.

2605.11209 2026-05-13 cs.LG

Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks

Eungyeup Kim, Chenchen Gu, Vashisth Tiwari, J. Zico Kolter

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 现有基准测试显示大型语言模型在多项任务上表现接近完美,但这掩盖了对其可靠性进行严格评估的必要性。本文提出了一种高效评估方法,通过识别模型失败的系统性模式,利用交叉熵方法学习聚焦于易失败输入的采样分布,从而大幅减少所需推理量。实验表明,该方法在多个模型和任务上实现了高达156倍的效率提升,揭示了即使在基准测试中表现相近的模型,其可靠性也可能存在显著差异,强调了可靠性作为模型质量独立且可衡量维度的重要性。

Comments Project page: https://five-nines-reliability.notion.site/Measuring-Five-Nines-Reliability-Sample-Efficient-LLM-Evaluation-in-Saturated-Benchmarks-312b998d4f39802d88c0e9886db1b9cd

详情
英文摘要

While existing benchmarks demonstrate the near-perfect performance of large language models (LLMs) on various tasks, this apparent saturation often obscures the need for rigorous evaluation of their reliability. In real-world deployment, however, achieving extremely high reliability (e.g., "five-nines" (99.999%) vs. "three-nines" (99.9%)) is fundamentally critical, as this gap results in an order-of-magnitude increase in failures, which is catastrophic in reliability-critical applications. Still, estimating such a rare failure probability with tight confidence bounds requires prohibitively large LLM inference sizes, making standard Monte Carlo evaluation infeasible under limited compute budgets. In this paper, we observe that LLM failures exhibit strong systematic patterns: across broad parameterized input spaces, a small subset of inputs disproportionately accounts for the majority of failures. Leveraging this observation, we propose to learn a sampling distribution concentrated on failure-prone inputs via the cross-entropy method (CEM). We evaluate our framework on three LLMs, Qwen2.5-Math-7B-Instruct, gpt-oss-20b-low, and Gemini 2.5 Flash Lite, across parameterized GSM8K templates and achieve up to 156.22x reduction in required inferences compared to naive uniform sampling. Our estimates reveal that models with indistinguishable accuracy on standard benchmarks can differ substantially in estimated failure rates, underscoring that reliability is a distinct and measurable axis of model quality. Our simple yet practical framework enables the evaluation of extreme reliability in LLMs, a distinct and underexplored dimension of evaluation beyond existing benchmarks, for their growing use in reliability-sensitive applications.

2605.11205 2026-05-13 cs.LG cs.AI

The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains

Jung Min Kang

发表机构 * Independent Researcher(独立研究员)

AI总结 本文研究了在数据稀疏和项目难度差异较大的情况下,简单平均法在评估排名中的失效问题,并提出利用项目反应理论(IRT)可以更准确地恢复真实排名。通过在多个领域(如自然语言处理、临床试验等)的实验,作者发现当数据覆盖率下降时,简单平均的排名相关性显著降低,而基于IRT的模型则能保持高精度。研究揭示了评估失效的规模规律,并为物理AI等领域的基准测试提供了更可靠的评估方法。

Comments 15 pages, 4 tables, 1 figure. Code at https://github.com/testofschool/evaluation-failure-scaling-law

详情
英文摘要

Benchmark evaluation across AI and safety-critical domains overwhelmingly relies on simple averaging. We demonstrate that this practice produces substantially misleading rankings when two conditions co-occur: (1) the evaluation matrix is sparse and (2) items vary substantially in difficulty. Through controlled simulation experiments across four domains -- NLP (GLUE), clinical drug trials, autonomous vehicle safety, and cybersecurity -- we show that Spearman rank correlation $ρ$ between simple-average rankings and ground-truth rankings degrades from $ρ= 1.000$ at 100% coverage to $ρ= 0.809$ at 67% coverage with high difficulty heterogeneity (mean over 20 seeds). A standard two-parameter logistic (2PL) Item Response Theory (IRT) model maintains $ρ\geq 0.996$ across all conditions. A 150-condition grid sweep over sparsity $S \in [0, 0.70]$ and difficulty gap $D \in [0.5, 5.0]$ confirms that ranking error forms a failure surface with a strong $S \times D$ interaction ($γ_3 = +0.20$, $t = 13.05$), while IRT maintains $ρ\geq 0.993$ throughout. We discuss implications for Physical AI benchmarking, where evaluation matrices are often incomplete and difficulty gaps are extreme.

2605.11203 2026-05-13 cs.LG cs.CV

FeatMap: Understanding image manipulation in the feature space and its implications for feature space geometry

Elias B. Krey, Nils Neukirch, Nils Strodthoff

发表机构 * Division AI4Health(AI4Health部门) Carl von Ossietzky Universität Oldenburg(奥尔登堡卡尔·冯·奥西特齐克大学)

AI总结 本文研究了深度神经网络中间特征表示的几何结构,通过在输入空间应用多种图像变换,评估了在特征空间中学习从原始特征到变换后特征映射的可能性。研究设计了多种映射方式,包括线性与非线性、局部与全局映射,并分析了其重建质量和语义内容。结果表明,即使对于复杂的语义变换,使用单一特征向量的共享线性模型也能实现较好的重建效果,暗示特征空间可能在一定程度上具有线性结构。该研究为理解特征空间的组织方式提供了新视角,并展示了生成式图像编辑模型在这一领域的潜力。

Comments 27 pages, 24 figures, 3 tables, Code is available at https://github.com/AI4HealthUOL/FeatMap

详情
英文摘要

Intermediate feature representations represent the backbone for the expressivity and adaptability of deep neural networks. However, their geometric structure remains poorly understood. In this submission, we provide indirect insights into this matter by applying a broad selection of manipulations in input space, ranging from geometric and photometric transformations to local masking and semantic manipulations using generative image editing models, and assess the feasibility of learning a mapping in the feature space, mapping from the original to the manipulated feature map. To this end, we devise different types of mappings, from linear to non-linear and local to global mappings and assess both the reconstruction quality of the mapping as well as the semantic content of the mapped representations. We demonstrate the feasibility of learning such mappings for all considered transformations. While global (transformer) models that operate on the full feature map often achieve best results, we show that the same can be achieved with a shared linear model operating on a single feature vector typically with very little degradation in reconstruction quality, even for highly non-trivial semantic manipulations. We analyze the corresponding mappings across different feature layers and characterize them according to dominance of weight vs. bias and the effective rank of the linear transformations. These results provide hints for the hypothesis that the feature space is to a first degree of approximation organized in linear structures. From a broader perspective, the study demonstrates that generative image editing models might open the door to a deeper understanding of the feature space through input manipulation.

2605.11196 2026-05-13 cs.LG

Variational Linear Attention: Stable Associative Memory for Long-Context Transformers

Vishal Pandey, Gopal Singh

发表机构 * Independent Researcher(独立研究者) Metriqual

AI总结 该论文提出了一种名为变分线性注意力(VLA)的新方法,旨在解决传统线性注意力在处理长上下文时出现的记忆干扰问题。VLA通过将记忆更新建模为带有自适应惩罚矩阵的在线正则最小二乘问题,有效控制了状态范数的增长,并保证了系统稳定性。实验表明,VLA在保持高检索性能的同时大幅降低了内存状态的范数,且在大规模序列处理中表现出优于现有方法的效率和准确性。

Comments 20 pages

详情
英文摘要

Linear attention reduces the quadratic cost of softmax attention to $\mathcal{O}(T)$, but its memory state grows as $\mathcal{O}(T)$ in Frobenius norm, causing progressive interference between stored associations. We introduce \textbf{Variational Linear Attention} (VLA), which reframes the memory update as an online regularised least-squares problem with an adaptive penalty matrix maintained via the Sherman-Morrison rank-1 formula. We prove that normalising the write direction to unit length gives the recurrence Jacobian spectral norm exactly $1$ for all sequence lengths and head dimensions (Proposition 2), and that the state norm is self-limiting under bounded inputs (Proposition 1). Empirically, VLA reduces $\|S_t\|_F$ by $109\times$ relative to standard linear attention at $T{=}1{,}000$, achieves near-perfect exact-match accuracy on multi-query associative recall within the effective per-head memory regime ($n_\text{pairs} < d_h$), maintaining substantially higher retrieval performance than DeltaNet and standard linear attention under increasing memory load, and maintains 62\% accuracy at the per-head capacity boundary. A Triton-fused kernel achieves $14\times$ speedup over sequential Python and $\mathcal{O}(T)$ scaling, crossing below softmax attention latency at approximately 43\,000 tokens.

2605.11195 2026-05-13 cs.CL

How Does Differential Privacy Affect Social Bias in LLMs? A Systematic Evaluation

Eduardo Tenorio, Karuna Bhaila, Xintao Wu

发表机构 * University of Arkansas(阿肯色大学)

AI总结 本文系统评估了差分隐私(DP)对大型语言模型(LLMs)中社会偏见的影响,通过在四个互补任务范式中比较DP训练模型与非DP基线模型的表现。研究发现,DP在句子评分任务中能有效降低偏见,但在其他任务中效果不一,揭示了logit层偏见与输出层偏见之间的差异。结果表明,减少记忆并不必然减少不公平性,强调了在评估LLMs公平性时进行多范式分析的重要性。

Comments 14 pages, 1 figure

详情
英文摘要

Large language models (LLMs) trained on web-scale corpora can memorize sensitive training data, posing significant privacy risks. Differential privacy (DP) has emerged as a principled framework that limits the influence of individual data points during training, yet the relationship between differential privacy and social bias in LLMs remains poorly understood. To investigate this, we present a systematic evaluation of social bias in a pretrained LLM trained with DP-SGD, comparing a DP model against non-DP baselines across four complementary paradigms: sentence scoring, text completion, tabular classification, and question answering. We find that DP reduces bias in sentence scoring tasks, where bias is measured through controlled likelihood comparisons, yet this improvement does not generalize across all tasks. Our results reveal a discrepancy between logit-level bias and output-level bias. Moreover, decreasing memorization does not necessarily reduce unfairness, underscoring the importance of multi-paradigm evaluation when assessing fairness in LLMs.

2605.11192 2026-05-13 cs.SD cs.AI cs.LG

Exploring Token-Space Manipulation in Latent Audio Tokenizers

Francesco Paissan, Luca Della Libera, Mirco Ravanelli, Cem Subakan

发表机构 * Mila – Québec AI Institute(魁北克人工智能研究所) Université Laval(拉瓦尔大学) Concordia University(康科迪亚大学)

AI总结 本文研究了在潜空间音频编码器中对 token 空间进行操作的可能性,提出了一种名为 LATTE 的新型音频 tokenizer,通过引入可学习的潜空间 token 来实现对全局语音特征的编辑。该方法在保持高质量语音重建的同时,使得通过替换 token 来修改说话人身份或背景噪声等全局属性成为可能,并在语音转换和去噪任务中验证了其有效性,为无监督的可控音频编辑提供了新思路。

详情
英文摘要

Neural audio codecs provide compact discrete representations for speech generation and manipulation. However, most codecs organize tokens as frame-level sequences, making it difficult to study or intervene on global factors of variation. In this work, we propose the Latent Audio Tokenizer for Token-space Editing (LATTE) that appends a fixed set of learnable latent tokens to the audio feature sequence and retains only these tokens for quantization and decoding. This design produces a compact, non-temporally aligned bottleneck in which each token can aggregate global information across the full utterance. We show that the resulting tokenizer preserves competitive reconstruction quality in low-bitrate speech coding settings while enabling simple token-space interventions. In particular, we find that swapping selected latent token positions between utterances can modify global attributes, such as speaker identity and background noise, and we evaluate these interventions on voice conversion and denoising tasks. Our results suggest that compact latent audio tokenizers can support controllable audio manipulation without supervision in task-specific editing models.

2605.11189 2026-05-13 cs.LG q-bio.BM

Deep Learning for Protein Complex Prediction and Design

Ziwei Xie

发表机构 * TOYOTA TECHNOLOGICAL INSTITUTE AT CHICAGO(丰田技术研究所芝加哥分校)

AI总结 本文研究如何利用深度学习准确建模和设计蛋白质复合物结构,这是计算结构生物学中的核心问题,对理解细胞功能和开发药物具有重要意义。研究提出了专门针对蛋白质结构层次特性的深度学习架构,并设计了高效的搜索算法,以在庞大的序列空间中寻找相互作用的同源蛋白,从而提升复合物结构预测和蛋白质序列设计的准确性。

Comments PhD thesis

详情
英文摘要

Accurately modeling and designing protein complex structures is a central problem in computational structural biology, with broad implications for understanding cellular function and developing therapeutics. This thesis investigates two fundamental aspects of this problem using deep learning: domain-specific architectures that capture the hierarchical nature of protein structures, and search algorithms that efficiently navigate the vast sequence spaces of protein complexes to identify interacting homologs for improving complex structure prediction and to design protein sequences.

2605.11186 2026-05-13 cs.LG cs.AI

CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration

Yuning Han, Yangchenchen Jin, Dylan Zhao, Jingwei Sun

发表机构 * University of Florida(佛罗里达大学)

AI总结 在内存受限的设备上进行大语言模型推理时,自回归解码过程受到内存带宽的限制,现有基于推测解码的方法通常假设设备内存足够容纳目标模型和辅助模型,这在边缘设备上并不适用。本文提出了一种名为CATS的级联自适应树推测框架,通过基于内存预算和参数卸载模式进行级联验证与修正,在不增加峰值内存占用的前提下,显著提升了推理速度。实验表明,CATS在多个真实边缘设备上实现了最高达5.08倍的加速,且生成质量无下降,优于现有最优方法1.45倍。

详情
英文摘要

Auto-regressive decoding in Large Language Models (LLMs) is inherently memory-bound: every generation step requires loading the model weights and intermediate results from memory (e.g., High-Bandwidth Memory (HBM) for GPU servers), making throughput bottlenecked by memory bandwidth rather than compute. Speculative decoding addresses this by enabling parallel verification of multiple draft tokens, effectively amortizing the cost of each target-model call. However, existing speculative decoding methods are designed under the assumption that HBM is sufficiently large to hold both the target model and an auxiliary draft model simultaneously -- an assumption that breaks down on memory-constrained devices such as edge platforms with limited DRAM. We analyze the inference bottleneck in this memory-limited regime and propose CATS, a self-speculative decoding framework that conducts cascaded verification and correction based on the memory budget and parameter offloading patterns on memory-limited devices. This design maximizes token acceptance rate and end-to-end speedup while keeping the peak memory footprint on the device equal to that of the target model alone. We evaluate CATS on different models across five benchmarks on real edge devices. CATS can achieve a wall-clock speedup of up to 5.08x with no degradation in generation quality, outperforming the SOTA method by up to 1.45x under edge memory constraints.

2605.11181 2026-05-13 cs.LG cs.AI cs.NA math.NA math.OC stat.ML

Muon is Not That Special: Random or Inverted Spectra Work Just as Well

Zakhar Shumaylov, Nathaël Da Costa, Peter Zaika, Bálint Mucsányi, Alex Massucco, Yoav Gelberg, Carola-Bibiane Schönlieb, Yarin Gal, Philipp Hennig

发表机构 * University of Cambridge(剑桥大学) University of Tübingen(图宾根大学) University of Oxford(牛津大学)

AI总结 本文挑战了Muon优化器在非欧几里得优化中依赖几何结构的主流观点,提出精确的几何结构并非影响优化性能的关键因素。研究引入了基于Schatten(准)范数的Freon优化器,其性能在GPT-2等任务中优于Muon,并揭示了最佳参数位于准范数区域,无法用传统LMO理论解释。进一步提出Kaon优化器,通过用随机噪声替代奇异值仍能匹配Muon性能,证明严格的几何结构并非必要。研究指出,优化性能主要由对齐度和下降潜力等局部量决定,而非全局几何结构。

Comments 45 pages

详情
英文摘要

The recent empirical success of the Muon optimizer has renewed interest in non-Euclidean optimization, typically justified by similarities with second-order methods, and linear minimization oracle (LMO) theory. In this paper, we challenge this geometric narrative through three contributions, demonstrating that precise geometric structure is not the key factor affecting optimization performance. First, we introduce Freon, a family of optimizers based on Schatten (quasi-)norms, powered by a novel, provably optimal QDWH-based iterative approximation. Freon naturally interpolates between SGD and Muon, while smoothly extrapolating into the quasi-norm regime. Empirically, the best-performing Schatten parameters for GPT-2 lie strictly within the quasi-norm regime, and thus cannot be represented by any unitarily invariant LMO. Second, noting that Freon performs well across a wide range of exponents, we introduce Kaon, an absurd optimizer that replaces singular values with random noise. Despite lacking any coherent geometric structure, Kaon matches Muon's performance and retains classical convergence guarantees, proving that strict adherence to a precise geometry is practically irrelevant. Third, having shown that geometry is not the primary driver of performance, we demonstrate it is instead controlled by two local quantities: alignment and descent potential. Ultimately, each optimizer must tune its step size around these two quantities. While their dynamics are difficult to predict a-priori, evaluating them within a stochastic random feature model yields a precise insight: Muon succeeds not by tracking an ideal global geometry, but by guaranteeing step-size optimality.

2605.11178 2026-05-13 cs.LG cs.AI math.RT

Oversmoothing as Representation Degeneracy in Neural Sheaf Diffusion

Arif Dönmez, Axel Mosig, Ellen Fritsche, Katharina Koch

发表机构 * IUF – Leibniz Research Institute for Environmental Medicine(莱比锡环境医学研究所) DNTOX GmbH(DNTOX公司) Bioinformatics Group, Ruhr University Bochum(博德姆鲁尔大学生物信息学小组) Swiss Centre for Applied Human Toxicology (SCAHT)(瑞士应用人类毒理学中心(SCAHT))

AI总结 本文研究了神经束扩散(NSD)模型中的过平滑问题,将其解释为表示几何退化现象。通过将图上的细胞束与关联的入射图表示建立联系,作者揭示了NSD在扩散极限下所达到的调和空间的代数结构,并指出学习到的束几何可能退化为低复杂度的表示,导致判别信息丢失。文章进一步引入基于矩映射的正则化方法,以引导束限制映射趋向于更平衡的几何结构,并分析了等维结构中的稳定性障碍,提出了非均匀维数设计的有效性。实验表明,打破束维对称性有助于提升模型性能。

Comments 15 pages, Comments welcome

详情
英文摘要

Neural Sheaf Diffusion (NSD) generalizes diffusion-based Graph Neural Networks by replacing scalar graph Laplacians with sheaf Laplacians whose learned restriction maps define a task-adapted geometry. While the diffusion limit of NSD is known to be the space of global sections, the representation-theoretic structure of this harmonic space remains largely implicit. We develop a quiver-theoretic interpretation of NSD by identifying cellular sheaves on graphs with representations of the associated incidence quiver. Under this correspondence, learned sheaf geometries become points in a finite-dimensional representation space. We show that direct-sum decompositions of the underlying incidence-quiver representation induce decompositions of the harmonic space reached in the diffusion limit. This gives an algebraic interpretation of oversmoothing as representation degeneration: learned sheaves may collapse toward low-complexity summands whose global sections fail to preserve discriminative information. Building on this viewpoint, we connect sheaf diffusion to stability and moment-map principles from Geometric Invariant Theory. We introduce moment-map-inspired regularizers that bias restriction maps toward balanced representation geometries, and identify a structural obstruction in equal-stalk architectures: when $d_v = d_e$, admissibility for learnable stability parameters forces the trivial all-object summand onto a stability wall. Non-uniform stalk dimensions remove this obstruction, making adaptive stability meaningful. Experiments on heterophilic benchmarks are consistent with this mechanism: breaking stalk symmetry can reduce variance or improve validation behavior, and adaptive stability becomes more effective in selected rectangular settings. Overall, our framework reframes oversmoothing as a degeneration phenomenon in the representation geometry underlying learned sheaf diffusion.

2605.11172 2026-05-13 cs.LG

Optimistic Dual Averaging Unifies Modern Optimizers

Thomas Pethick, Wanyun Xie, Roman Machacek, Volkan Cevher

发表机构 * EPFL (LIONS)(苏黎世联邦理工学院(LIONS)) University of Bern(伯尔尼大学)

AI总结 本文提出了一种名为SODA的优化框架,它是乐观对偶平均法的推广,能够统一当前先进的优化器如Muon、Lion、AdEMAMix和NAdam。通过该框架,研究者提出了一种实用的SODA包装器,能够通过理论支持的$1/k$衰减计划自动消除权重衰减调参的需求。实验表明,SODA在不同规模和训练周期下均能提升性能,且无需额外调整超参数。

详情
英文摘要

We introduce SODA, a generalization of Optimistic Dual Averaging, which provides a common perspective on state-of-the-art optimizers like Muon, Lion, AdEMAMix and NAdam, showing that they can all be viewed as optimistic instances of this framework. Based on this framing, we propose a practical SODA wrapper for any base optimizer that eliminates weight decay tuning through a theoretically-grounded $1/k$ decay schedule. Empirical results across various scales and training horizons show that SODA consistently improves performance without any additional hyperparameter tuning.

2605.11169 2026-05-13 cs.AI

OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents

Sheldon Yu, Junda Wu, Xintong Li, Nikki Lijing Kuang, Sizhe Zhou, Tong Yu, Jiawei Han, Jingbo Shang, Julian McAuley

发表机构 * UC San Diego(加州大学圣地亚哥分校) University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Adobe Research(Adobe研究)

AI总结 本文提出OLIVIA,一种针对ReAct风格大语言模型代理的在线动作适配框架,用于提升其在部署时的决策性能。OLIVIA将代理的动作选择层建模为一个基于上下文的线性置信域上界(UCB)多臂老虎机问题,利用冻结的隐藏状态作为决策上下文,从而在保持原始推理过程的同时,实现对动作选择的直接调整和不确定性估计。实验表明,OLIVIA在多个基准任务中显著优于静态ReAct和基于提示的适配方法,展示了其在部署阶段进行高效、细粒度和不确定性感知的在线优化的有效性。

详情
英文摘要

Large language model agents interleave reasoning, action selection, and observation to solve sequential decision-making tasks. In deployed settings where agents repeatedly handle related multi-step tasks, small action-selection errors can accumulate into wasted tool calls, latency, and reduced reliability. Despite this need for deployment-time improvement, existing inference-time adaptation methods for LLM agents mainly rely on prompting or retrieval, which influence behavior indirectly through context manipulation. For ReAct-style agents, such approaches do not expose an explicit decision layer that can score candidate actions, represent uncertainty, or be updated online from action-level feedback. As a result, they provide limited support for trackable, fine-grained, and uncertainty-aware adaptation during deployment. We propose OLIVIA, an inference-time action adaptation framework for ReAct-style agents. OLIVIA models the LLM's final action-selection layer as a contextual linear bandit over candidate actions, with frozen hidden states as decision contexts. This choice is particularly suitable for deployment because it adapts behavior directly at the action-selection interface, preserves the underlying reasoning process, and provides explicit uncertainty estimates and lightweight online updates from action-level feedback. With upper-confidence-bound exploration, OLIVIA improves the policy sample-efficiently with minimal computational overhead. We instantiate OLIVIA on four benchmarks and show that it consistently improves task performance over static ReAct and prompt-based inference-time baselines. Our results suggest that explicit online decision layers provide an effective alternative to purely prompt- or retrieval-based adaptation for LLM agents during deployment.

2605.11167 2026-05-13 cs.CL cs.AI cs.LG

The Bicameral Model: Bidirectional Hidden-State Coupling Between Parallel Language Models

Cedric Flamant, Udaya Ghai, Kanna Shimizu

发表机构 * AWS Agentic AI(AWS智能AI)

AI总结 本文提出了一种名为“双室模型”的新方法,通过可训练的神经接口在两个预训练语言模型的中间隐藏状态之间建立双向耦合,使它们能够通过连续的并发通道进行协调,而非传统的文本生成方式。该模型在每一步生成过程中同步运行,主模型负责任务执行,辅助模型则处理工具调用、约束求解或代码执行,并通过翻译网络和学习抑制门实现相互条件控制。实验表明,该方法在算术、逻辑网格谜题和数学推理任务中显著提升了性能,展示了其在多模型协作中的有效性。

Comments 9 pages main text, 5 figures, 24 pages appendix

详情
英文摘要

Existing multi-model and tool-augmented systems communicate by generating text, serializing every exchange through the output vocabulary. Can two pretrained language models instead coordinate through a continuous, concurrent channel? The Bicameral Model couples two frozen language models through a trainable neural interface on their intermediate hidden states. At every generation step, both models run in lockstep: a primary model drives the task while an auxiliary model operates tools, solves constraints, or executes code, with both conditioning on each other's activations through a translation network and a learned suppression gate ($\sim$1\% of combined parameters). The gate learns a selective communication protocol from task loss alone, without a prescribed format. We demonstrate the mechanism across three tool backends. On arithmetic, coupling two 0.5B models with a calculator raises accuracy from 36\% to 96\%. On logic grid puzzles, coupling two 0.6B models with a Z3 solver achieves $1.7\times$ the unaugmented baseline on ZebraLogic. On mathematical reasoning, coupling with a Python sandbox enables the auxiliary to generate problem-specific code from hidden-state signals alone, without ever seeing the problem text.

2605.11166 2026-05-13 cs.CV

Unpacking the Eye of the Beholder: Social Location, Identity, and the Moving Target of Political Perspectives

Elena Sirotkina

发表机构 * Center for Data Science(数据科学中心)

AI总结 本文研究了政治和社会身份如何影响人们对政治信息的评价,并指出传统计算工具往往忽略这种差异。为此,作者提出了一个名为Perspectivist Visual Political Sentiment(PVPS)的分类器,通过大量美国成年人的评价数据,预测不同政治和社会身份群体对同一图像的评价差异。该方法保留了群体间的系统性分歧,揭示了政治图像意义的动态性,强调理解图像传达的内容必须考虑受众的身份背景。

详情
英文摘要

Political and social identities structure how people evaluate political information, a finding decades deep in political science and routinely discarded by computational tools that often produce single scores that treat a piece of text, an image, or a video as if it means the same thing to everyone. This paper shows that it does not, and that the difference is consequential. To address this problem, I develop the Perspectivist Visual Political Sentiment (PVPS) classifier, which learns from approximately 82,000 evaluations by 5,575 U.S. adults to predict how audiences defined by political and social identities will evaluate the same image. Unlike standard tools that average systematic disagreement away, PVPS preserves it, returning an evaluative profile that records who agrees, who diverges, and along which identity lines. Applied to several influential studies of visual sentiment, PVPS shows that perceived violence in protest imagery and the emotional mechanisms behind protest image engagement both change substantively once audience identity is taken into account. It follows that what a political image conveys is a moving target, and measuring it requires knowing whom it is moving.

2605.11161 2026-05-13 cs.LG cs.AI

Interpretability Can Be Actionable

Hadas Orgad, Fazl Barez, Tal Haklay, Isabelle Lee, Marius Mosbach, Anja Reusch, Naomi Saphra, Byron Wallace, Sarah Wiegreffe, Eric Wong, Ian Tenney, Mor Geva

发表机构 * Kempner Institute at Harvard University(哈佛大学凯默纳研究所) University of Southern California(美国南加州大学) Mila – Quebec AI Institute(魁北克AI研究所) McGill University(麦吉尔大学) Google DeepMind(谷歌DeepMind) Tel Aviv University(特拉维夫大学) University of Pennsylvania(宾夕法尼亚大学) University of Maryland(马里兰大学) University of Oxford(牛津大学) Northeastern University(东北大学) Boston University(波士顿大学)

AI总结 本文探讨了深度神经网络可解释性研究的实践价值问题,指出当前研究缺乏将可解释性转化为实际决策和干预能力的评估标准。作者提出应以“行动性”作为可解释性的核心评价标准,从具体性和验证性两个维度定义可操作的可解释性,并分析了阻碍其实际应用的障碍。文章进一步识别了五个可解释性具有独特优势的领域,提出了与实际效果对齐的评估框架,旨在推动可解释性研究从理论探索向实际应用转化。

Comments Accepted to ICML 2026

详情
英文摘要

Interpretability aims to explain the behavior of deep neural networks. Despite rapid growth, there is mounting concern that much of this work has not translated into practical impact, raising questions about its relevance and utility. This position paper argues that the central missing ingredient is not new methods, but evaluation criteria: interpretability should be evaluated by actionability--the extent to which insights enable concrete decisions and interventions beyond interpretability research itself. We define actionable interpretability along two dimensions--concreteness and validation--and analyze the barriers currently preventing real-world impact. To address these barriers, we identify five domains where interpretability offers unique leverage and present a framework for actionable interpretability with evaluation criteria aligned with practical outcomes. Our goal is not to downplay exploratory research, but to establish actionability as a core objective of interpretability research.

2605.11153 2026-05-13 cs.CL cs.LG cs.NE

Decomposing Evolutionary Mixture-of-LoRA Architectures: The Routing Lever, the Lifecycle Penalty, and a Substrate-Conditional Boundary

Ramchand Kumaresan

发表机构 * Murai Labs(穆莱实验室)

AI总结 本文研究了进化混合LoRA架构在特定基础模型上的性能分解问题,提出了三个关键因素:路由重写机制、领域评估范围和生命周期策略。通过实验分析,发现路由重写对模型性能提升具有显著贡献,而生命周期策略则带来一定负面影响。研究还揭示了进化搜索在路由通道中的有效性依赖于适配器的预对齐程度,为LoRA架构的优化提供了新的理论依据和实践指导。

详情
英文摘要

We decompose an evolutionary mixture-of-LoRA system on a from-scratch ~150M-parameter widened-D substrate (D=1536, V=32000; D/V approx 0.048; the "widened-1536" substrate) into three factors -- a router rewrite (parallel sigmoid gate with learnable per-adapter floor and bounded temperature anneal, fed post-stack hidden states rather than token-embedding means), a per-domain leave-one-out evaluation scope, and a lifecycle of death plus alpha-blend inheritance plus SVD mutation plus slot reallocation -- and report a 5-of-8 partial 2^3 factorial run at n=3 seeds and 25000 adaptation steps per cell. The attribution chain is sharp on this substrate: the router rewrite carries the entire +0.0426 nat balanced log-PPL improvement (Delta = log PPL_ref - log PPL_test, positive = improvement; t=12.86, p=0.006) attributed to "the full evolutionary system vs the static B3 baseline"; the headline full-system-vs-B3 balanced contrast itself is +0.015 nats, t=1.94, p=0.19 at n=3 and does not clear alpha=0.05. The per-domain evaluation scope is null at seed-resolution, and the lifecycle is a net drag of approx -0.028 nats (t=-4.46,p=0.047 in the primary chain). An auxiliary alpha=0 inheritance counterfactual at n=3 seeds is sign-inconsistent at the headline metric and underpowered for either an equivalence or load-bearing conclusion (corrected from an earlier arithmetic-mean aggregator that erroneously cleared inheritance; see Appendix B.11). A base-perturbation probe directionally refutes a "genomic-context" reframe of the lifecycle role. A controllable synthetic sandbox locates a substrate-conditional regime boundary: evolutionary search on the routing channel is load-bearing only when adapters are pre-aligned to the task; in every other regime tested it underperforms, ties, or actively degrades the gradient solution.

2605.11144 2026-05-13 cs.RO

Forecast-aware Gaussian Splatting for Predictive 3D Representation in Language-Guided Pick-and-Place Manipulation

Kaixin Jia, Jiacheng Xu

发表机构 * KTH Royal Institute of Technology(皇家理工学院)

AI总结 本文提出了一种名为Forecast-aware Gaussian Splatting(Forecast-GS)的预测性三维表示框架,用于语言引导的机器人抓取与放置操作。该方法通过显式建模任务完成状态,提升了机器人在部分观测条件下对动作可行性的评估能力。实验表明,Forecast-GS在多个真实场景任务中取得了优于现有方法的性能,显示出其在语言理解、三维感知与机器人规划之间建立可解释桥梁的有效性。

详情
英文摘要

We introduce Forecast-aware Gaussian Splatting (Forecast-GS), a predictive 3D representation framework for language-conditioned robotic manipulation. While recent manipulation systems have made progress by grounding language instructions into robot affordances, value maps, or relational keypoint constraints, they usually reason over the current scene and do not explicitly model the task-completed state. This limitation is critical when success depends on satisfying spatial and semantic goals under partial observations, where the robot must evaluate whether a candidate action leads to a feasible task-consistent outcome. We validate Forecast-GS on real-world pick-and-place manipulation tasks, including Cutter-to-Box, Apple-to-Bowl, and Sponge-to-Tray. For each task, we conduct 25 real-world trials under varied initial object configurations using the same robot platform and sensing setup. Forecast-GS with automatic candidate selection achieves success rates of 21/25, 23/25, and 16/25 on the three tasks, respectively, outperforming the ReKep baseline, which achieves 15/25, 19/25, and 10/25. A diagnostic human-assisted setting further improves success rates to 23/25, 24/25, and 19/25, suggesting that candidate generation is effective while automatic ranking remains imperfect. These results suggest that explicitly forecasting task-completed 3D states enables more reliable action evaluation, while the gap between automatic and human-assisted selection indicates that robust final-state ranking remains an important challenge for fully autonomous manipulation. Overall, Forecast-GS provides an interpretable bridge between language understanding, 3D perception, and robotic manipulation planning.

2605.11142 2026-05-13 cs.LG

Rank Is Not Capacity: Spectral Occupancy for Latent Graph Models

Nikolaos Nakis, Panagiotis Promponas, Konstantinos Tsirkas, Katerina Mamali, Eftychia Makri, Leandros Tassiulas, Nicholas A. Christakis

发表机构 * Human Nature Lab, Yale University(耶鲁大学人类本质实验室) Department of Electrical and Computer Engineering, Yale University(耶鲁大学电气与计算机工程系) Department of Statistics and Data Science, Yale University(耶鲁大学统计与数据科学系) Department of Computer Science, Yale University(耶鲁大学计算机科学系)

AI总结 本文研究了图表示学习中潜空间维度这一传统超参数的设定问题,指出其与模型行为的实际控制量不一致。为此,作者提出了一种基于谱分析的新方法Spectra,通过学习正定核的谱分布来替代传统的秩作为分析单位,并利用归一化特征值构建可控的训练坐标,从而在训练过程中动态调节模型容量。该方法在多个网络数据集上展示了预测性能与模型容量之间的权衡关系,为过参数化场景下的模型容量控制提供了理论依据和实用工具。

Comments Preprint

详情
英文摘要

Graph representation learning has become a standard approach for analyzing networked data, with latent embeddings widely used for link prediction, community detection, and related tasks. Yet a basic design choice, the latent dimension, is still treated as a brittle hyperparameter, fixed before training and tuned by held-out performance. Learned factors are also identifiable only up to rotation and rescaling, so the nominal rank rarely coincides with the quantity that governs model behavior. We propose Spectral Prefix Extraction and Capacity-Targeted Representation Analysis (Spectra), which replaces rank as the unit of analysis with the spectrum of a learned positive semidefinite kernel, trace-normalized so that spectra are comparable across fits. The normalized eigenvalues form a distribution on the simplex, and their Shannon effective rank acts both as a summary of learned capacity and as a controllable training-time coordinate: a single scalar shapes this realized dimension during training, and bisection targets any desired value within the rank cap. To theoretically support that, we show local regularity and monotonicity of the realized-dimension profile. Across collaboration, social, biological, and infrastructure networks, Spectra traces performance--capacity frontiers that make the trade-off between predictive accuracy and realized dimension visible. It performs competitively with strong link-prediction baselines, yields aligned lower-capacity views of the same fitted model through spectral prefixes, and provides a principled handle on capacity in the overparameterized regime. Capacity thus becomes a property of the fitted model rather than a hyperparameter of the training.

2605.11136 2026-05-13 cs.AI

EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales

Yaolun Zhang, Tianyi Xu, Shengyu Dai, Zhenwen Shao, Qingyun Wu, Huazheng Wang

发表机构 * Oregon State University(俄勒冈州立大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Johnson & Johnson(强生公司) Pennsylvania State University(宾夕法尼亚州立大学) AG2AI, Inc.(AG2AI公司)

AI总结 本文提出EVOCHAMBER,一种无需训练的框架,用于在个体、团队和种群三个层面实现多智能体系统的测试时协同进化。其核心方法CODREAM通过团队失败或分歧后协作反思与知识异步传递,实现跨智能体的非对称知识转移,保留专业化分工的同时填补知识空白。实验表明,该方法在数学、编程和多领域推理任务中均取得显著提升,并观察到多个稳定的专业化智能体自发形成,展现了多智能体进化的结构特征。

详情
英文摘要

We argue that multi-agent test-time evolution is not single-agent evolution replicated N times. A single-agent learner can only evolve its own context and memory. A multi-agent system additionally evolves who collaborates, how they collaborate, and how knowledge flows across the population. These components have no single-agent counterpart and can produce phenomena such as emergent specialization. Yet prior test-time methods either confine experiences to individual agents, forfeiting cross-agent learning, or broadcast symmetrically to all agents, erasing the specialization that makes collaboration valuable. We present EVOCHAMBER, a training-free framework that instantiates test-time evolution at three levels over a coevolving agent pool. At its core is CODREAM (Collaborative Dreaming), a post-task protocol triggered on team failure or disagreement, in which agents collaboratively reflect, distill insights, and route them asymmetrically from strong to weak agents on the failed niche, preserving specialization while filling knowledge gaps. Team-level operators assemble niche-conditioned teams and select collaboration structures online. Population-level lifecycle operators fork, merge, prune, and seed agents under performance pressure. On three heterogeneous task streams with Qwen3-8B, EVOCHAMBER reaches 63.9% on competition math, 75.7% on code, and 87.1% on multi-domain reasoning, outperforming the best baseline by 32% relative on math and confirming asymmetric cross-agent transfer as the primary driver in ablation. Starting from several identically initialized agents, four to five stable niche specialists spontaneously emerge, a structural signature of multi-agent evolution that no single-agent learner can express. See our code at: https://github.com/Mercury7353/EvoChamber

2605.11133 2026-05-13 cs.LG math.DG

Steerable Neural ODEs on Homogeneous Spaces

Emma Andersdotter, Daniel Persson, Fredrik Ohlsson

发表机构 * Department of Mathematics and Mathematical Statistics(数学与统计学系) Umeå University(乌梅大学) Department of Mathematical Sciences(数学科学系) Chalmers University of Technology and Gothenburg University(楚姆勒技术大学和哥德堡大学)

AI总结 本文提出了一种在齐性空间 $M=G/H$ 上的可操控神经常微分方程(Steerable Neural ODEs),将特征向量在局部对称群 $H$ 作用下的变换纳入模型设计。通过将特征解释为齐性空间上的向量丛截面,并将其演化视为平行移动,模型形成了一组耦合的微分方程,包括空间流方程和特征操控方程。该方法在满足特定对称性条件时具有 $G$-等变性,为学习齐性空间上一般向量值特征的连续时间等变动力学提供了几何基础。

Comments 39 pages, 3 figures

详情
英文摘要

We introduce steerable neural ordinary differential equations on homogeneous spaces $M=G/H$. These models constitute a novel geometric extension of manifold neural ordinary differential equations (NODEs) that transport associated feature vectors transforming under the local symmetry group $H$. We interpret features as sections of associated vector bundles over $M$, and describe their evolution as parallel transport. This results in a coupled system of ODEs consisting of a flow equation on $M$ and a steering equation acting on features. We show that steerable NODEs are $G$-equivariant whenever the vector field generating the flow and the connection governing parallel transport are both $G$-invariant. Furthermore, we demonstrate how steerable NODEs incorporate existing NODE models and continuous normalizing flows on Lie groups. Our framework provides the geometric foundation for learning continuous-time equivariant dynamics of general vector-valued features on homogeneous spaces.