2606.03067 2026-06-05 stat.ML cs.LG

Trajectory-Aware Node Contributions and the Limits of Static Controllability

轨迹感知的节点贡献与静态可控性的极限

Valentina Kuskova, Dmitry Zaytsev, Michael Coppedge

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文提出“涌现贡献”（EC）作为节点动态杠杆的有限时域度量，通过可微模型的雅可比矩阵计算，在线性时不变极限下退化为平均可控性，并构建相图刻画两者一致与分歧的条件。

Comments 11 pages, 1 figure

详情

AI中文摘要

复杂网络中的一个常见数据挖掘任务是确定单个节点如何影响系统行为。现有方法依赖于静态图中心性或控制理论量（如可控性格拉姆矩阵），这些方法假设线性时不变动力学。然而，实际估计的系统通常是非线性和时变的。我们定义了“涌现贡献（EC）”，这是一种节点动态杠杆的有限时域度量：其脉冲响应的度量加权能量沿系统轨迹累积。EC 通过任何可微模型的雅可比矩阵计算，与估计器无关，并在线性时不变极限下精确地退化为平均可控性。我们的贡献是刻画了这两种度量一致与分歧的条件。使用一个具有已知真实贡献的受控合成族，我们构建了一个跨越非线性、机制结构、持续性和扰动幅度的相图。EC 和平均可控性在静态或平滑漂移动力学下一致，并且两者都跟踪真实值。分歧在持续机制切换下出现，在持续符号反转下最强，并在移除符号反转时消失。在极端扰动幅度下，两种度量都会退化，这揭示了局部线性化的极限。我们将来自多个领域的五个估计真实系统置于该相空间中。它们的位置可作为 EC 何时提供超出静态可控性信息的诊断，从而证明其额外计算成本的合理性。在一个深入检查的面板上，一个二十种子重训练集成揭示了稳健的方差-杠杆分离：节点的扰动广泛传播，尽管其系统内方差较低，这既未被静态中心性恢复，也未被基于方差的摘要恢复。

英文摘要

A recurring data mining task in complex networks is to determine how individual nodes contribute to system behavior. Existing approaches rely on either static-graph centralities or control-theoretic quantities such as controllability Gramians, which assume linear, time-invariant dynamics. Estimated systems, however, are typically nonlinear and time-varying. We define "emergent contribution (EC)," a finite-horizon measure of a node's dynamical leverage: the metric-weighted energy of its impulse response accumulated along the system trajectory. Computed from the Jacobians of any differentiable model, EC is estimator-agnostic and reduces exactly to average controllability in the linear, time-invariant limit. Our contribution is a characterization of when the two measures agree and diverge. Using a controlled synthetic family with known ground-truth contribution, we construct a phase diagram spanning nonlinearity, regime structure, persistence, and perturbation amplitude. EC and average controllability agree under static or smoothly drifting dynamics and both track ground truth. Divergence emerges under persistent regime switching, is strongest under persistent sign reversal, and disappears when the sign reversal is removed. At extreme perturbation amplitudes, both measures degrade, identifying the limits of local linearization. We place five estimated real systems from several domains within this phase space. Their placement serves as a diagnostic of when EC provides information beyond static controllability and therefore justifies its additional computational cost. On one panel examined in depth, a twenty-seed retraining ensemble reveals a robust variance--leverage dissociation: nodes whose perturbations propagate widely despite low within-system variance, which is not recovered by static centralities nor variance-based summaries.

URL PDF HTML ☆

赞 0 踩 0

2606.03091 2026-06-05 cs.IR cs.AI

BAHSD: Bridging the Long-tail Gap via Adaptive Distillation in Black-box Sequential Recommendation

BAHSD：通过自适应蒸馏弥合黑盒序列推荐中的长尾差距

Xi Zhou, Famin Wu, Mingming Li, Hongyue Zhang, Jiao Dai, Jizhong Han, Tao Guo

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China（中国科学院信息工程研究所，北京，中国）； School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China（中国科学院大学网络安全学院，北京，中国）； Beijing Institute for General Artificial Intelligence, Beijing, China（北京一般人工智能研究院，北京，中国）

AI总结针对黑盒序列推荐中长尾分布导致的信号异质性，提出BAHSD框架，利用多尺度一致性探测机制量化信号可靠性，并设计自适应分层目标（动态温度KL散度、排序一致性和InfoNCE对比学习）来缓解偏好固化并增强噪声鲁棒性，在尾用户上提升80%以上。

详情

AI中文摘要

序列推荐系统被广泛采用，但通常作为黑盒API部署，这推动了近期对模型提取的兴趣，以在本地复制其能力。然而，长尾分布导致了严重的信号异质性：密集的头部序列触发教师偏好的固化，使提取偏向局部模式，而稀疏的尾部序列产生平坦且嘈杂的预测。现有的一刀切式提取忽略了这种差异，导致噪声过拟合和次优的知识迁移。我们提出BAHSD，一种黑盒自适应蒸馏框架，通过多尺度一致性探测机制隐式量化信号可靠性来处理信号异质性。基于此，设计了自适应分层目标：动态温度KL散度缓解高置信度信号的偏好固化，而排序一致性和InfoNCE对比学习为低置信度信号提供噪声鲁棒的增强。BAHSD持续优于基线，在教师模型上获得高达4.98%的提升，在尾用户上提升80%以上，为高保真黑盒推荐提取提供了一种即插即用的解决方案。

英文摘要

Sequential recommendation systems are widely adopted but often deployed as black-box APIs, which has driven recent interest in model extraction to replicate their capabilities locally. However, the long-tail distribution induces severe signal heterogeneity: dense head sequences trigger the solidification of teacher preference, biasing extraction toward local patterns, while sparse tail sequences yield flat, noisy predictions. Existing one-size-fits-all extraction overlooks this disparity, resulting in noise overfitting and suboptimal knowledge transfer. We propose BAHSD, a black-box adaptive distillation framework that handles signal heterogeneity via a multi-scale consistency probing mechanism to implicitly quantify signal reliability. Based on this, an adaptive hierarchical objective is designed: dynamic-temperature KL divergence mitigates preference solidification for high-confidence signals, while ranking consistency and InfoNCE contrastive learning provide noise-robust enhancement for low-confidence signals. BAHSD consistently outperforms baselines, achieving up to 4.98\% gain over the teacher and 80\%+ improvement on tail users, offering a plug-and-play solution for high-fidelity black-box recommendation extraction.

URL PDF HTML ☆

赞 0 踩 0

2606.00804 2026-06-05 cs.MA cs.AI cs.CL

Dynamic Coordination Strategy Selection for Enterprise Multi-Agent Systems

企业多智能体系统的动态协调策略选择

Thanh Luong Tuan

发表机构 * Golden Gate University（金门大学）； Foundation AgenticOS (FAOS)（基础代理操作系统（FAOS））

AI总结本文通过大规模实验评估企业多智能体系统是否应根据问题类别动态选择协调策略，发现动态路由作为校准默认值有效，但无法确定唯一最优策略。

Comments 13 pages, 4 appendix. Code and data: https://github.com/frank-luongt/faos-research/tree/main/RA-1

详情

AI中文摘要

企业多智能体系统日益暴露多种协调模式，但部署时往往缺乏证据表明何时使用共识、辩论、综合或更简单的单智能体工作流。本文评估协调策略是否应根据问题类别动态选择，而非全局固定。我们运行了一个固定的矩阵，包含30个企业任务，涵盖六个行业、五个问题类别、四种执行条件、每个单元格三个重复，以及四个模型分支：qwen_local、sonnet、gemma_openrouter和一个辅助的openai云验证分支。所有1,440个生成输出均由固定的Sonnet评分标准评判。主要发现是有界且操作上有用的，但并非最初的严格H1。预先注册的精确胜者/CI标准未得到支持：精确胜者身份在不同模型分支间不稳定，且若干预测策略接近但未超过最佳观察到的替代方案。一个较弱的近最优路由主张得到强烈支持。在每个预先注册的模型分支和问题类别中，以及在辅助的OpenAI验证分支中，预测策略的质量分数与最佳观察条件相差在0.10以内。结构化合规验证是对原始映射最明显的例外：所有分支都偏好单智能体而非共识。预先注册的Kendall's W检验发现，越南语领域和英语领域任务在四种协调条件排序的一致性上没有可靠差异（两个分层的平均W均为0.20；符号秩检验p = .85），因此H2未得到支持。我们得出结论，企业协调策略应使用动态路由作为校准默认值，而非确定性胜者选择法则。

英文摘要

Enterprise multi-agent systems increasingly expose multiple coordination patterns, but deployments often lack evidence for when to use consensus, debate, synthesis, or a simpler single-agent workflow. This paper evaluates whether coordination strategy should be selected dynamically by problem class rather than fixed globally. We run a frozen matrix of 30 enterprise tasks spanning six industries, five problem classes, four execution conditions, three replications per cell, and four model arms: qwen_local, sonnet, gemma_openrouter, and an auxiliary openai cloud-validation arm. All 1,440 generated outputs are judged by a fixed Sonnet rubric. The main finding is bounded and operationally useful, but it is not the original strict H1. The pre-registered exact-winner/CI criterion is not supported: exact winner identity is unstable across model arms, and several predicted strategies are close to, but not above, the best observed alternative. A weaker near-best routing claim is strongly supported. In every pre-registered model arm and problem class, and again in the auxiliary OpenAI validation arm, the predicted strategy is within 0.10 quality-score points of the best observed condition. Structured compliance verification is the clearest exception to the original mapping: all arms favor single_agent rather than consensus. A pre-registered Kendall's W test finds no reliable difference between Vietnamese-domain and English-domain tasks in how consistently the four coordination conditions are ranked (mean W of 0.20 in both strata; signed-rank p = .85), so H2 is not supported. We conclude that enterprise coordination policy should use dynamic routing as a calibrated default, not as a deterministic winner-selection law.

URL PDF HTML ☆

赞 0 踩 0

2605.27991 2026-06-05 stat.ML cs.LG

Gradient-Flow Optimization as Dynamic Random-Effects Inference: Testing and Early Stopping with Applications to Deep Learning

深度神经网络训练作为随机效应：优化-推断对偶性

Minhao Yao, Ruoyu Wang, Xihong Lin, Lin Liu, Zhonghua Liu

发表机构 * Centre for Biomedical Data Science, Duke-NUS Medical School, National University of Singapore（生物医学数据科学中心，国立新加坡大学杜克-新加坡医学学校）； Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA（生物统计学系，哈佛T.H. Chan公共卫生学院，马萨诸塞州波士顿，美国）； Institute of Natural Sciences, MOE-LSC, School of Mathematical Sciences, CMA-Shanghai, SJTU-Yale Joint Center of Biostatistics and Data Science, Shanghai Jiao Tong University（自然科学院，MOE-LSC，数学科学学院，CMA-上海，SJTU-耶鲁联合生物统计学与数据科学中心，上海交通大学）； Department of Biostatistics, Columbia University, New York, NY, USA（生物统计学系，哥伦比亚大学，纽约州纽约市，美国）

AI总结本文提出深度神经网络训练与经典随机效应模型等价，揭示了优化-推断对偶性，并利用限制最大似然估计实现基于似然的早停规则。

详情

AI中文摘要

深度神经网络（DNN）取得了显著的实证成功，但其训练动态主要从优化而非统计原理的角度被理解。本文通过证明连续时间神经正切核（NTK）梯度流产生的预测与经典随机效应模型的预测完全等价，为过参数化机制下的DNN训练建立了一个统计框架。在该框架中，训练时间充当方差分量，或等价地作为经验贝叶斯协方差超参数，控制噪声到结构化信号的变异分配。这种等价性揭示了一种优化-推断对偶性：梯度流路径既是优化轨迹，也是经验贝叶斯随机效应推断路径。以训练时间为条件，网络输出是潜在信号的后验均值，通过限制最大似然估计（REML）估计训练时间，将早停转化为基于似然的经验贝叶斯推断，而非外部调参。这一视角产生了一个两阶段推断程序。首先，方差分量检验确定DNN训练是否捕捉到初始化之外的统计显著结构。其次，以训练合理为条件，REML提供基于似然的早停规则。由此产生的停止时间在NTK特征基下具有谱解释，其中训练持续到谱损失去相关实现。我们进一步证明，对于固定设计下的样本内预测，REML引导的早停实现了渐近最优预测误差，并且在额外的随机设计正则条件下，对于样本外预测也成立。这项工作将DNN训练重新定义为统计推断，并为决定是否以及训练深度神经网络多长时间提供了原则性基础。

英文摘要

Gradient-flow optimization is usually viewed as an algorithmic procedure for minimizing empirical loss, with training duration selected by validation or heuristic early-stopping rules. We develop a statistical inference framework for the gradient-flow training trajectory itself. The central object is fixed-operator squared-error gradient flow: whenever the fitted value evolves through a time-invariant positive semidefinite training operator, the trained model output at each training time is exactly equivalent to the best linear unbiased predictor, or empirical-Bayes posterior mean, under a corresponding random-effects model. Under this representation, training time becomes a variance-component parameter governing how variance is reallocated from residual noise to structured signal. This turns two basic training decisions into inferential problems. First, whether training is needed is formulated as a variance-component test for signal beyond initialization. Second, how long to train is formulated as restricted maximum likelihood (REML) estimation of the training-time variance component. The resulting REML-guided early stopping rule has a spectral interpretation: it selects the training time at which optimized spectral losses become empirically decorrelated from the eigenvalues of the training operator, yielding an effective degrees-of-freedom measure for the evolving trained model. We establish asymptotic prediction optimality for fixed-design in-sample risk and, under additional kernel regularity conditions, random-design out-of-sample risk. Deep learning models in fixed-kernel gradient regimes provide canonical modern-AI instantiations of the theory. Numerical experiments and a UK Biobank proteomics application show that the proposed inferential approach attains competitive prediction accuracy while reducing the reliance on validation splits and repeated checkpoint evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.26179 2026-06-05 cond-mat.mtrl-sci cs.AI cs.CE

AutoDFT: A Closed-Loop Multi-Agent Framework for Autonomous DFT Calculations

AutoDFT：用于自主DFT计算的闭环多智能体框架

Penghui Yang, Zhonghan Zhang, Yue Li, Xinrun Wang, Yanchen Deng, Yuhao Lu, Bijun Tang, Zheng Liu, Bo An

发表机构 * Nanyang Technological University, Singapore（南洋理工大学，新加坡）； Singapore Management University（新加坡管理大学）

AI总结提出AutoDFT闭环多智能体框架，通过将LLM推理嵌入DFT计算全生命周期，实现从规划到执行的自主适应，在VASPBench基准上达到94.1%任务成功率，并可靠预测电子、磁性和能量性质。

详情

AI中文摘要

密度泛函理论（DFT）是材料科学和化学中计算发现的基础，然而每次计算都需要大量人工努力：当收敛停滞时调整算法，当出现意外物理现象时修改计划，以及当中间结果重塑问题时插入步骤。现有的基于LLM的智能体仅自动化初始规划阶段，预先生成完整的执行计划，而将所有后续调整留给手工规则。因此，这些工作流仍然脆弱，难以泛化到预规划场景之外，并且当失败或意外的中间结果需要改变计算路径时，通常需要专家干预。在此，我们介绍AutoDFT，一个闭环多智能体框架，将LLM推理嵌入DFT生命周期的每个阶段：战略规划器生成步骤目标的骨架计划；步骤规划器根据先前结果即时生成数值参数；监控-恢复-反思循环诊断失败、修复失败，并在证据支持时修改计划。我们展示了广度和深度：广度方面，在VASPBench（一个专门构建的基准，涵盖34个任务和9种DFT计算类型）上，AutoDFT使用GPT-5.2实现了94.1%的任务级成功率；深度方面，在已建立的材料数据库上，AutoDFT在电子、磁性和能量性质上产生了定量可靠的属性预测。通过闭环规划和执行，AutoDFT使没有深厚计算专业知识的实验人员能够获得可靠的第一性原理结果。

英文摘要

Density functional theory (DFT) serves as the basis for computational discovery in materials science and chemistry, yet each calculation demands extensive human effort: adjusting algorithms when convergence stalls, revising plans when unexpected physics emerges, and inserting steps as intermediate results reshape the problem. Existing LLM-based agents automate only the initial planning stage, producing a full execution plan upfront and leaving all subsequent adaptation to hand-crafted rules. As a result, these workflows remain fragile, do not generalize well beyond pre-planned scenarios, and often require expert intervention when failures or unexpected intermediate results require changes to the calculation path. Here, we introduce AutoDFT, a closed-loop multi-agent framework that embeds LLM reasoning into every stage of the DFT lifecycle, where a strategic planner produces a skeletal plan of step objectives; a step planner generates numerical parameters just in time from preceding results; and a monitor-recover-reflect cycle diagnoses failures, repairs them, and revises the plan when the evidence justifies it. We demonstrate both breadth and depth: breadth on VASPBench, a purpose-built benchmark spanning 34 tasks and 9 DFT calculation types, where AutoDFT achieves 94.1% task-level success with GPT-5.2; and depth on established materials databases, where AutoDFT produces quantitatively reliable property predictions across electronic, magnetic, and energetic properties. By closing the loop between planning and execution, AutoDFT enables experimentalists without deep computational expertise to obtain reliable first-principles results.

URL PDF HTML ☆

赞 0 踩 0

2605.29916 2026-06-05 cs.NE cs.AI cs.DS math.OC

Selection Hyper-heuristics Can Automatically Adjust the Learning Period to Optimally Solve Pseudo-Boolean Problems

选择超启发式可以自动调整学习周期以最优地解决伪布尔问题

Benjamin Doerr, Pietro S. Oliveto, John Alasdair Warwicker

发表机构 * Laboratoire d’Informatique (LIX), CNRS, École Polytechnique, Institut Polytechnique de Paris（信息实验室（LIX），法国国家科学研究中心，巴黎高等理工学院，巴黎理工学院）； Department of Computer Science and Engineering, Southern University of Science and Technology（计算机科学与工程系，南方科技大学）； School of Computing & Communications, Lancaster University Leipzig（计算与通信学院，莱斯特大学莱比锡分校）

AI总结本文提出一种自动设置学习周期参数的超启发式方法，证明其能在1-o(1)比例的迭代中选择最优邻域大小，从而以最优时间（忽略低阶项）优化LeadingOnes基准问题。

Comments To appear in "Artificial Intelligence"

详情

DOI: 10.1016/j.artint.2026.104560
Journal ref: Artificial Intelligence 357:104560 (2026)

AI中文摘要

最近研究表明，随机梯度超启发式在使用随机局部搜索（RLS）元启发式优化LeadingOnes基准时，能够学习最优邻域大小。然而，这需要使用一定长度$τ$的学习周期，这与经典超启发式不同，后者仅基于前一次迭代的成功来改变行为。在本文中，我们展示了如何自动设置这个新参数值，从而使用户免于控制这一新颖算法参数的非平凡任务。我们证明，由此产生的超启发式在$1-o(1)$比例的迭代中选择最优邻域大小，并因此以这些邻域大小所能达到的最佳时间（忽略低阶项）优化LeadingOnes基准。

英文摘要

The Random Gradient hyper-heuristic was recently shown to be able to learn the optimal neighbourhood size when optimizing the LeadingOnes benchmark via the Randomised Local Search (RLS) meta-heuristic. However, for this to happen, a learning period of a certain length $τ$ had to be used, differently from classic hyper-heuristics, which change their behaviour based on the success of only the previous iteration. In this paper, we show how to automatically set this new parameter value, relieving the user from the non-trivial task of controlling this novel algorithm parameter. We prove that the resulting hyper-heuristic selects the optimal neighbourhood size in a $1-o(1)$ fraction of the iterations and, consequently, optimises the LeadingOnes benchmark in the best possible time (apart from lower-order terms) achievable with these neighborhood sizes.

URL PDF HTML ☆

赞 0 踩 0

2605.29054 2026-06-05 cs.SE cs.CL

Converted, Not Equivalent: Benchmarking Codebase Conversion via Observational Equivalence

转换而非等价：通过观察等价性基准测试代码库转换

Linxin Song, Jiefeng Chen, Yue Huang, Bhavana Dalvi Mishra, Chi Wang, Jieyu Zhao, Jinsung Yoon, Tomas Pfister

发表机构 * University of Southern California（南加州大学）； Google Cloud AI Research（谷歌云人工智能研究）； University of Notre Dame（圣约翰大学）； Google Deepmind（谷歌深Mind）

AI总结针对代码库转换中智能体过度信任本地验证导致语义违反的问题，提出T2J-Bench基准，通过固定等价契约和三级验证（Spec、Numeric、Behavioral）评估转换质量，发现最佳系统通过率仅26.7-28.9%，且所有系统高估成功率66.6-97.8点。

详情

AI中文摘要

编码智能体日益成为代码库规模的协作者，能够协助代码库转换，但这一进展暴露了一个关键弱点：智能体往往过度信任自己的本地验证例程，并在满足表面检查但违反用户实际关心的语义契约的工件上宣布成功。这个问题在代码库转换中尤为严重，因为先前的评估主要是结果驱动的，因此不稳定：两个实现可以在浅层结果上匹配，例如单个前向损失，但在梯度、优化器行为或短期训练动态上存在差异。我们引入了T2J-Bench，一个代码库转换基准，它将转换重新定义为在固定等价契约下的迁移。然后，一个固定验证器通过三个有序阶段比较源代码库和转换后的代码库：Spec（接口可接受性）、Numeric（前向输出、损失、梯度和目标特定张量）和Behavioral（固定种子下的短期训练动态）。在355次盲转换尝试中，尽管Spec通过率高达91.1%，最佳系统总体通过率仅为26.7-28.9%；4.7倍的token预算差异仅产生2.2倍的通过率差异；所有系统相对于固定评估器高估成功率66.6-97.8点。这表明失败更多源于契约不一致的自我验证，而非有限的预算或骨干强度。

英文摘要

Coding agents increasingly act as codebase-scale collaborators that can assist with codebase conversion, but this progress has exposed a critical weakness: agents often over-trust their own local validation routines and declare success on artifacts that satisfy surface checks while violating the semantic contracts users actually care about. This problem is especially acute in codebase conversion, where prior evaluation is largely outcome-driven and therefore unstable: two implementations can match on a shallow outcome, such as a single forward loss, while diverging in gradients, optimizer behavior, or short-horizon training dynamics. We introduce T2J-Bench, a benchmark for codebase conversion that reformulates conversion as transfer under a fixed equivalence contract. A fixed verifier then compares source and converted codebases through three ordered stages: Spec (interface admissibility), Numeric (forward outputs, losses, gradients, and objective-specific tensors), and Behavioral (short training dynamics under fixed seeds). Across 355 blind conversion attempts, the best system reaches only 26.7--28.9% overall pass rate despite Spec pass rates up to 91.1%; a 4.7x token-budget spread yields only a 2.2x pass-rate spread; and all systems overestimate success by 66.6--97.8 points relative to the fixed evaluator. This suggests that failures stem more from contract-misaligned self-validation than from limited budget or backbone strength.

URL PDF HTML ☆

赞 0 踩 0

2605.23809 2026-06-05 eess.SY cs.LG cs.SY

Advanced AI Service Provisioning in O-RAN through LLM Engine Integration

通过LLM引擎集成在O-RAN中的高级AI服务提供

Seyed Bagher Hashemi Natanzi, Pranshav Gajjar, Bo Tang, Vijay K. Shah

发表机构 * Department of Electrical and Computer Engineering, Worcester Polytechnic Institute（电气与计算机工程系，沃斯特理工学院）； Department of Electrical and Computer Engineering, North Carolina State University（电气与计算机工程系，北卡罗来纳州立大学）

AI总结提出一种双脑架构，结合LLM的推理能力和轻量级ML引擎的实时性，实现O-RAN中AI服务的自动化部署与配置。

2604.15524 2026-06-05 eess.SY cs.RO cs.SY

Safe and Energy-Aware Multi-Robot Density Control via PDE-Constrained Optimization for Long-Duration Autonomy

面向长期自主性的安全与能量感知多机器人密度控制：基于PDE约束优化

Longchen Niu, Andrew Nasif, Gennaro Notomista

发表机构 * Department of Electrical and Computer Engineering, University of Waterloo（滑铁卢大学电气与计算机工程系）

AI总结提出一种结合Fokker-Planck偏微分方程与控制李雅普诺夫/障碍函数的密度控制框架，实现多机器人系统的目标密度跟踪、避障和能量可持续性。

2605.21557 2026-06-05 stat.ML cs.AI cs.LG

在大规模大语言模型群体中评估涌现协调：对MoltBook档案库的评估框架

Brandon Yee, Pairie Koh

发表机构 * Management Lab, Yee Collins Research Group（Yee Collins研究组管理实验室）

AI总结本文提出了一种评估框架，用于在开放代理环境中评估角色专业化、信息扩散和协作任务解决的涌现协调，通过MoltBook档案库的数据集展示了该框架，并建立了量化基准，揭示了核心-外围结构、重尾级联分布和去中心化任务解决中的严重协调开销。

Comments Substantial Revision Required

详情

AI中文摘要

随着多智能体大语言模型（LLM）系统规模扩大，评估其涌现协调动态变得越来越关键。然而，当前的评估范式——专注于单个智能体或小型、显式结构化的群体——无法捕捉到在大规模、去中心化群体中出现的自组织和病毒信息动态。我们引入了一种系统化的评估框架，用于在开放代理环境中基准测试角色专业化、信息扩散和协作任务解决。我们在此框架上展示了MoltBook观测站档案库，这是一个包含273万个交互的2.73M交互数据集，其中90,704个自主代理相互作用。该框架建立了涌现协调的量化基准。我们的评估揭示了明显的核心-外围结构（轮廓0.91）、重尾级联分布（α=2.57）以及去中心化任务解决中的严重协调开销（Cohen's d = -0.88，相对于单智能体基线）。通过提供标准化的评估任务和实证基准，我们的框架使未来多智能体协议的严格比较成为可能，并将评估本身确立为科学研究的对象。

英文摘要

As multi-agent Large Language Model (LLM) systems scale, evaluating their emergent coordination dynamics becomes increasingly critical. However, current evaluation paradigms-focused on single agents or small, explicitly structured groups-fail to capture the self-organization and viral information dynamics that arise in large, decentralized populations. We introduce a systematic evaluation framework to benchmark role specialization, information diffusion, and cooperative task resolution in open agent environments. We demonstrate this framework on the MoltBook Observatory Archive, a dataset of 2.73M interactions among 90,704 autonomous agents, establishing quantitative baselines for emergent coordination. Our evaluation reveals a pronounced core-periphery structure (silhouette 0.91), heavy-tailed cascade distributions ($α= 2.57$), and severe coordination overhead in decentralized task resolution (Cohen's $d = -0.88$ against a single-agent baseline). By providing standardized evaluation tasks and empirical baselines, our framework enables the rigorous comparison of future multi-agent protocols and establishes evaluation itself as an object of scientific study.

URL PDF HTML ☆

赞 0 踩 0

2503.17181 2026-06-05 cs.SE cs.AI

A Study of LLMs' Preferences for Libraries and Programming Languages

对大型语言模型在库和编程语言偏好方面的研究

Lukas Twist, Mark Harman, Don Syme, Joost Noppen, Helen Yannakoudakis, Detlef Nauck, Jie M. Zhang

发表机构 * King’s College London（伦敦国王学院）； University College London（伦敦大学学院）； GitHub Next ； Digital AI Research, BT Group（BT集团数字人工智能研究）

AI总结本研究探讨了大型语言模型在生成代码时对库和编程语言的选择偏好，通过实证研究分析了八种不同大型语言模型在库和语言选择上的倾向，发现模型倾向于使用广泛采用的库如NumPy，并且在某些情况下这种选择并非必要，同时也显示出对Python的偏好，尽管在某些高性能项目初始化任务中Python并非最优选择。

Comments 21 pages, 10 tables, 3 figures. Accepted to Findings of ACL 2026

详情

AI中文摘要

尽管大型语言模型（LLMs）在代码生成方面取得了快速进展，但现有评估主要集中在功能正确性或语法有效性上，忽略了LLMs在关键设计决策中如何选择库或编程语言。为了填补这一空白，我们进行了首次对LLMs在生成代码时对库和编程语言偏好的实证研究，涵盖了八个不同的LLMs。我们观察到LLMs倾向于过度使用广泛采用的库，如NumPy；在多达45%的情况下，这种使用是不必要的，并偏离了真实解决方案。我们研究的LLMs还显示出对Python作为默认语言的显著偏好。在高性能项目初始化任务中，当Python不是最优语言时，它仍然在58%的情况下占据主导地位，而Rust从未被使用。这些结果突显了LLMs在选择熟悉度和流行度而非适合性和任务特定最优性上的倾向；强调了需要针对的微调、数据多样化以及能够明确衡量语言和库选择忠实度的评估基准。

英文摘要

Despite the rapid progress of large language models (LLMs) in code generation, existing evaluations focus on functional correctness or syntactic validity, overlooking how LLMs make critical design choices such as which library or programming language to use. To fill this gap, we perform the first empirical study of LLMs' preferences for libraries and programming languages when generating code, covering eight diverse LLMs. We observe a strong tendency to overuse widely adopted libraries such as NumPy; in up to 45% of cases, this usage is not required and deviates from the ground-truth solutions. The LLMs we study also show a significant preference toward Python as their default language. For high-performance project initialisation tasks where Python is not the optimal language, it remains the dominant choice in 58% of cases, and Rust is not used once. These results highlight how LLMs prioritise familiarity and popularity over suitability and task-specific optimality; underscoring the need for targeted fine-tuning, data diversification, and evaluation benchmarks that explicitly measure language and library selection fidelity.

URL PDF HTML ☆

赞 0 踩 0

2603.28257 2026-06-05 q-fin.ST cs.LG

Nonlinear Factor Decomposition via Kolmogorov-Arnold Networks: A Spectral Approach to Asset Return Analysis

通过Kolmogorov-Arnold网络进行非线性因子分解：一种资产收益分析的谱方法

David Breazu

发表机构 * Faculty of Mathematics and Computer Science, University of Bucharest（布加勒斯特大学数学与计算机科学学院）

AI总结本文提出KAN-PCA，一种利用KAN作为编码器和线性映射作为解码器的自编码器，通过在每条边上使用学习的B样条函数替代线性投影，以捕捉比传统PCA更多的方差。实验表明KAN-PCA在20只S&P 500股票上实现了更高的重建R²值，并在修正数据泄露后与PCA外推结果一致。

Comments 12 pages, 2 figures

2505.11006 2026-06-05 stat.ML cs.LG

Is Supervised Learning Really That Different from Unsupervised?

监督学习真的和无监督学习有那么大的区别吗？

Oskar Allerbo, Thomas B. Schön

发表机构 * KTH Royal Institute of Technology（皇家理工学院）； Uppsala University（乌普萨拉大学）

AI总结该研究通过将监督学习分解为两阶段过程，证明在不访问标签数据的情况下选择模型参数和添加输出，可以实现与传统监督学习相似的性能，表明监督与无监督学习的区别可能不如表面看起来那么根本。

Comments Paper accepted at AISTATS 2026

2603.17925 2026-06-05 stat.ME cs.LG math.ST stat.TH

Multi-Armed Sequential Hypothesis Testing by Betting

通过赌注进行多臂顺序假设检验

Ricardo J. Sandoval, Ian Waudby-Smith, Michael I. Jordan

发表机构 * University of California Berkeley（加州大学伯克利分校）； École Normale Supérieure & Inria Paris（法国国家科学研究中心巴黎分校 & 巴黎研究所）

AI总结本文研究了通过赌注进行多臂顺序检验的问题，提出了一种在多个数据源（臂）中选择以获取数据的统计学家的变体，旨在拒绝全局空假设P（所有臂在某种意义上无效）并支持复合替代假设Q（至少有一个臂非空）。通过推广对数最优性和期望拒绝时间最优性的概念，得到了匹配的上下界，并提出了一个修改的上置信界算法来处理不可观测但足够可估计的奖励。

详情

AI中文摘要

我们考虑了一种通过赌注进行的顺序检验变体，其中在每个时间步，统计学家会面对多个数据源（臂）并选择其中一个以获取数据。我们考虑了一个复合全局空假设P，即所有臂在某种意义上（例如所有治疗剂量无效）都是空假设，并希望拒绝P以支持一个复合替代假设Q，其中至少有一个臂是非空的（例如存在有效的治疗剂量）。我们提出了一种最优性要求，即即使多个臂是非空的，我们寻求e-过程和顺序检验，其性能尽可能强，如同拥有 oracle 知识关于哪个臂生成最多反对P的证据。形式上，我们将对数最优性和期望拒绝时间最优性的概念推广到多个臂，得到两者匹配的上下界。在最优性分析中，一个关键技术设备是一个修改的上置信界算法，用于不可观测但足够“可估计”的奖励。在设计此算法时，我们推导了非渐近的集中不等式，用于最优财富增长率，即凯利[1956]的意义。这些可能具有独立的兴趣。

英文摘要

We consider a variant of sequential testing by betting where, at each time step, the statistician is presented with multiple data sources (arms) and obtains data by choosing one of the arms. We consider the composite global null hypothesis $\mathscr{P}$ that all arms are null in a certain sense (e.g. all dosages of a treatment are ineffective) and we are interested in rejecting $\mathscr{P}$ in favor of a composite alternative $\mathscr{Q}$ where at least one arm is non-null (e.g. there exists an effective treatment dosage). We posit an optimality desideratum that we describe informally as follows: even if several arms are non-null, we seek $e$-processes and sequential tests whose performance are as strong as the ones that have oracle knowledge about which arm generates the most evidence against $\mathscr{P}$. Formally, we generalize notions of log-optimality and expected rejection time optimality to more than one arm, obtaining matching lower and upper bounds for both. A key technical device in this optimality analysis is a modified upper-confidence-bound-like algorithm for unobservable but sufficiently "estimable" rewards. In the design of this algorithm, we derive nonasymptotic concentration inequalities for optimal wealth growth rates in the sense of Kelly [1956]. These may be of independent interest.

URL PDF HTML ☆

赞 0 踩 0