arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.10673 2026-06-10 stat.OT cs.LG 新提交

ClusBench: The Clustering Benchmark Data Resource You've All Been Waiting For (?)

ClusBench：你一直期待的聚类基准测试数据资源（？）

David P. Hofmeyr

发表机构 * School of Mathematical Sciences, Lancaster University（兰卡斯特大学数学科学学院）

AI总结本文通过拟合灵活的非参数分布，从200多个公开数据集生成近3000个合成数据集，用于大规模聚类方法评估，保留真实数据细微差别。

2606.10454 2026-06-10 eess.AS cs.SD 新提交

Entropy-Aware Domain-Routed Mixture-of-Experts Speech-LLM Framework: A Case Study of Multi-Domain Child-Adult ASR

熵感知域路由混合专家语音-大语言模型框架：多领域儿童-成人ASR案例研究

Mohan Shi, Kaiyuan Zhang, Zilai Wang, Natarajan Balaji Shankar, Eray Eren, Abeer Alwan

发表机构 * University of California, Los Angeles, USA（加州大学洛杉矶分校）

AI总结提出一种混合专家语音-大语言模型，通过分类器域路由、混合投影器和混合LoRA模块以及熵感知路由机制，实现跨不同环境和年龄组的统一儿童-成人ASR，在公共儿童语料库上取得一致改进。

Comments Accepted to Interspeech 2026

详情

AI中文摘要

虽然语音大语言模型在成人自动语音识别上取得了强劲性能，但其对儿童语音的有效性仍未被充分探索，且单一模型往往难以同时处理多样化的成人和儿童年龄组。本文提出一种混合专家语音-大语言模型，用于跨不同环境和年龄组的统一成人及儿童语音ASR。该框架采用基于分类器的域路由，结合粗到细策略，并集成混合投影器和混合LoRA模块以建模域特定变化。为解决域边界附近的路由不确定性，引入熵感知路由机制以动态整合共享专家。在公共儿童语料库上的实验表明，该方法在保持成人ASR性能的同时，相比基线取得了一致改进。据我们所知，这是首个利用语音-大语言模型实现涵盖儿童和成人的统一多领域ASR的工作。

英文摘要

While Speech Large Language Models (Speech-LLMs) have achieved strong performance on adult Automatic Speech Recognition (ASR), their effectiveness on child speech remains under-explored, and single models often struggle to handle diverse adult and child age groups simultaneously. This paper proposes a Mixture-of-Experts (MoE) Speech-LLM for unified ASR across adult and child speech spanning diverse environments and age groups. The framework employs a Classifier-based Domain Router (C-DR) with a coarse-to-fine strategy and integrates both a Mixture-of-Projectors (MoP) and a Mixture-of-LoRAs (MoL) to model domain-specific variations. To address routing uncertainty near domain boundaries, an Entropy-Aware Routing (EAR) mechanism is introduced to dynamically incorporate a shared expert. Experiments on public child corpora demonstrate consistent improvements over baselines while preserving adult ASR performance. To our knowledge, this is the first work leveraging Speech-LLMs for unified, multi-domain ASR encompassing both children and adults.

URL PDF HTML ☆

赞 0 踩 0

2606.10361 2026-06-10 stat.ML cs.LG 新提交

Near-Exponential Convergence Rates for kNN Classification based on Boltzmann Margin

基于玻尔兹曼间隔的kNN分类近指数收敛速率

Luyuan Yang, Shayan Shafaei, Chao Lan

发表机构 * School of Computer Science, University of Oklahoma（计算机科学系，俄克拉荷马大学）

AI总结提出玻尔兹曼间隔条件，介于Tsybakov与Massart间隔之间，首次证明kNN分类器可实现近指数收敛速率。

Comments Conference on Uncertainty in Artificial Intelligence (UAI)

2606.10317 2026-06-10 eess.AS cs.SD 新提交

SSL-GMMVC: Interpretable Voice Conversion via Locally Linear GMM Transforms in Self-Supervised Representation Space

SSL-GMMVC：自监督表示空间中通过局部线性GMM变换的可解释语音转换

Tomoya Tanabu, Hiroshi Nishijima, Daisuke Saito, Nobuaki Minematsu

发表机构 * The University of Tokyo, Japan（东京大学）

AI总结提出SSL-GMMVC方法，在自监督语音空间中用高斯混合模型建模源-目标特征，通过后验加权仿射变换实现可解释的语音转换，在保持可理解性和自然度的同时提升说话人相似度。

Comments Accepted to Interspeech2026

2606.10238 2026-06-10 q-bio.NC cs.AI 新提交

Hyperbolic Neural Population Geometry Benefits Computation

双曲神经群体几何结构有益于计算

Dennis Wu, Yi-Chun Hung, Braden Yuille, James E. Fitzgerald, Han Liu

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文提出海马体群体活动诱导双曲几何的理论框架，证明现代Hopfield网络更新规则计算最小均方误差估计，并引入双曲空间中的新联想记忆模型，其容量显著优于现有模型。

Comments Accepted at ICML 2026, 37 pages, 5 figures

2606.10233 2026-06-10 eess.AS cs.LG cs.SD 新提交

ANCHOR: Autoregressive Non-intrusive Chunk-Ordered Refinement for Joint Multi-Resolution Speech Quality Modeling

ANCHOR: 自回归非侵入式分块有序细化用于联合多分辨率语音质量建模

Zhuoyan Tao, Jiatong Shi, Hye-jin Shim, Shinji Watanabe

发表机构 * University of Southern California, USA（美国南加州大学）； Carnegie Mellon University, USA（美国卡内基梅隆大学）

AI总结提出ANCHOR模型，将增量语音质量评估重构为多分辨率自回归任务，通过双分辨率令牌和分辨率感知层次实现分块到整句的粗到细细化，在部分输入下显著降低误差，并揭示感知质量的时域积累机制。

Comments Accepted at Interspeech 2026

2606.10187 2026-06-10 stat.ML cs.LG 新提交

Decision-Calibrated Conformal Uncertainty for Pacing Decisions in Streaming Advertising

面向流式广告中节奏控制的决策校准共形不确定性

Prashant Shekhar, Caroline Howard

发表机构 * Department of Mathematics, Embry-Riddle Aeronautical University（数学系，埃姆伯里-瑞德航空大学）

AI总结提出一种决策校准共形框架，通过衡量预测误差对实际部署策略的最大影响来校准不确定性，理论证明该分数是保护所有可部署节奏控制策略的最小有效不确定性度量，并在公开数据集上显著降低不确定性半径。

详情

AI中文摘要

我们开发了一个决策校准的共形框架，用于流式广告中的节奏控制决策。节奏控制依赖于不确定的未来库存、需求压力、增量响应和会员体验负载。该框架不是校准通用的预测残差，而是通过预测误差对实际可能部署的策略的最大影响来衡量预测误差。主要定理表明，所提出的分数是统一保护所有可部署节奏控制策略的最小有效不确定性度量。几何上，它是有符号策略敏感性集的支持函数。分裂共形校准为该分数提供了有限样本覆盖。一个高维分离定理表明，传统的残差校准可能因支付干扰库存维度而任意保守，而一个鲁棒的节奏控制结果结合了库存、响应和体验不确定性。在基于Criteo Uplift和KuaiRand数据集构建的公开数据校准节奏控制回放中，传统共形节奏控制仍然未解决，在Criteo上残差半径高达7236.7，在KuaiRand上为4629.4。采用所提出的决策校准方法，不确定性半径分别降至18.4和278.6，并为价值、交付、预算和会员负载设置了单独的边际。在Criteo上，所提出的方法证明了比点预测基线更不激进的节奏控制策略，并将保留的任何违规率从16.7%降至3.3%，且预算和会员负载违规为零。在KuaiRand上，选择仍未解决。简而言之，本文确立了预测、响应估计和会员体验模型应根据它们是否缩小节奏控制决策使用的不确定性来判断，因为这会导致自信且不过度保守的决策。

英文摘要

We develop a decision-calibrated conformal framework for pacing decisions in streaming advertising. Pacing depends on uncertain future inventory, demand pressure, incremental response, and member-experience load. Instead of calibrating a generic forecast residual, the framework measures forecast error by its largest impact on the policies that could actually be deployed. The main theorem shows that the proposed score is the smallest valid uncertainty measure that uniformly protects all deployable pacing policies. Geometrically, it is the support function of the signed policy sensitivity set. Split conformal calibration gives finite-sample coverage for this score. A high-dimensional separation theorem shows that traditional residual calibration can be arbitrarily more conservative by paying for nuisance inventory dimensions, and a robust pacing result combines inventory, response, and experience uncertainty. On public-data-calibrated pacing replays built from Criteo Uplift and KuaiRand datasets, traditional conformal pacing remains unresolved with high residual radii of 7236.7 on Criteo and 4629.4 on KuaiRand. With the proposed decision calibration approach, the uncertainty radii are reduced to 18.4 and 278.6 respectively, with separate margins for value, delivery, budget, and member load. On Criteo, the proposed method certifies a less aggressive pacing policy than the point-forecast baseline, and reduces held-out any-violation rate from 16.7% to 3.3%, with zero budget and member-load violations. On KuaiRand, the choice remains unresolved. In a nutshell, the paper establishes that forecasts, response estimates, and member-experience models should be judged by whether they shrink the uncertainty that the pacing decision uses, as this leads to confident decisions that are not overly conservative.

URL PDF HTML ☆

赞 0 踩 0

2606.10125 2026-06-10 stat.ML cs.DB cs.LG 新提交

Robust Active Learning for Few-Shot Example Selection in Text-to-SQL

鲁棒主动学习用于文本到SQL中的少样本示例选择

Arash Pourhabib

发表机构 * NVIDIA

AI总结针对文本到SQL中少样本示例选择，提出一种鲁棒主动学习方法，通过分层贪婪算法最大化异方差互信息目标，在嵌入流形上实现常数因子近似保证，显著减少标注成本。

Comments 31 pages, 4 figures, 5 tables

详情

AI中文摘要

少样本示例检索是将大型语言模型（LLM）应用于特定领域文本到SQL系统的主要范式。然而，标注示例库的质量直接决定系统准确性，且专家标注成本高昂。我们将这些示例的主动选择形式化为一个在语义查询嵌入的内在低维流形上的约束实验设计问题。与标准主动学习框架不同，我们的设置引入了三个关键挑战：依赖于查询的可变标注可靠性（异方差性）、跨语义主题的空间多样性严格要求（划分拟阵约束），以及嵌入空间真实协方差结构未知的固有现实（模型误设）。为了解决这些问题，我们提出了一种分层贪婪算法，该算法最大化异方差互信息目标。我们证明该目标在内在流形上保持子模性和近似单调性，从而得到理论上的常数因子近似保证。我们建立了一个谱界，表明当假设的替代核与真实数据生成过程存在偏差时，该近似保证会优雅地退化，而非灾难性地崩溃。实验结果表明，所提出的策略显著减少了标注工作量，同时保持了较高的文本到SQL检索准确性。

英文摘要

Few-shot example retrieval is the dominant paradigm for grounding large language models (LLMs) in domain-specific text-to-SQL systems. However, the quality of the annotated example bank directly governs system accuracy, and expert annotation is prohibitively expensive. We formalize the active selection of these examples as a constrained experimental design problem over the intrinsic, low-dimensional manifold of semantic query embeddings. Unlike standard active learning frameworks, our setting introduces three critical challenges: varying, query-dependent annotation reliability (heteroscedasticity), strict requirements for spatial diversity across semantic topics (partition matroid constraints), and the inherent reality that the true covariance structure of the embedding space is unknown (misspecification). To address these, we propose a stratified greedy algorithm that maximizes a heteroscedastic mutual information objective. We prove that this objective remains submodular and approximately monotonic on the intrinsic manifold, yielding a theoretical constant-factor approximation guarantee. We establish a spectral bound demonstrating that this approximation guarantee degrades gracefully, rather than catastrophically, when the assumed surrogate kernel diverges from the true underlying data-generating process. Empirical results demonstrate that the proposed strategy significantly reduces labeling effort while maintaining high text-to-SQL retrieval accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.10010 2026-06-10 eess.AS cs.AI cs.MM cs.SD 新提交

DeRA-MOS: Optimizing Text-to-Music Evaluation via Decoupled Listwise Ranking and Modality Alignment

DeRA-MOS：通过解耦列表排序和模态对齐优化文本到音乐评估

Chien-Chun Wang, Hung-Shin Lee, Hsin-Min Wang, Berlin Chen

发表机构 * E.SUN Financial Holding Co., Ltd.（E.SUN财务控股公司）； United Link Co., Ltd.（联合链接有限公司）； Institute of Information Science, Academia Sinica（学术院信息科学研究所）； Department of Computer Science and Information Engineering, National Taiwan Normal University（台湾师范大学计算机科学与信息工程系）

AI总结提出DeRA-MOS解耦优化框架，通过批感知列表排序损失和分数锚定模态对齐损失，分别优化音乐印象和文本对齐的排名指标，在MusicEval上显著提升评估性能。

Comments Accepted to IEEE Signal Processing Letters (SPL)

详情

AI中文摘要

评估文本到音乐（TTM）系统仍然昂贵，因为音乐印象（MI）和文本对齐（TA）分数依赖于人类平均意见分数（MOS）。大多数自动MOS估计器采用逐点回归或分布分类训练。这些目标不直接优化基于排名的指标，并且为跨模态一致性提供较弱的几何约束。为了解决这些问题，我们提出了DeRA-MOS，一种用于TTM评估的解耦优化框架。对于MI，我们引入了一种批感知列表排序损失，该损失对每个小批量内的相对顺序进行建模，并更好地与基于Spearman秩相关系数（SRCC）的评估对齐。对于TA，我们引入了一种分数锚定的模态对齐损失，将人类分数映射到目标音频-文本相似度，并在融合前正则化潜在空间。通过有效缓解逐点训练不匹配和模态漂移，MusicEval上的实验表明，我们的解耦框架在MI和TA排名指标上均取得了显著改进，为大规模TTM评估建立了稳健的范式。

英文摘要

Evaluating text-to-music (TTM) systems remains expensive because music impression (MI) and text alignment (TA) scores rely on human mean opinion scores (MOS). Most automatic MOS estimators are trained with point-wise regression or distributional classification. These objectives do not directly optimize rank-based metrics and provide weak geometric constraints for cross-modal coherence. To address these gaps, we propose DeRA-MOS, a decoupled optimization framework for TTM evaluation. For MI, we introduce a batch-aware listwise ranking loss that models relative order within each mini-batch and better aligns with evaluation based on Spearman's rank correlation coefficient (SRCC). For TA, we introduce a score-anchored modality alignment loss that maps human scores to target audio-text similarity and regularizes the latent space before fusion. By effectively mitigating the point-wise training mismatch and modality drift, experiments on MusicEval demonstrate that our decoupled framework yields substantial improvements in both MI and TA ranking metrics, establishing a robust paradigm for large-scale TTM evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.09953 2026-06-10 eess.IV cs.AI cs.LG 新提交

Deep Slice Interpolation for Reducing Through-Plane Anisotropy and Noise in Head CT

深度切片插值用于减少头部CT的穿平面各向异性和噪声

Luis Cortés Ferre, Miguel A. Gutiérrez-Naranjo, Marcin Balcerzyk

发表机构 * Department of Computer Science and Artificial Intelligence, University of Seville（塞维利亚大学计算机科学与人工智能系）； Bioaraba Health Research Institute（Bioaraba健康研究 institute）； IKERBASQUE, Basque Foundation of Science（巴斯克科学基金会）

AI总结提出一种深度学习系统，通过相邻轴向切片对合成中间CT切片，将有效穿平面间距减半，同时实现隐式降噪，在结构指标上优于经典插值和视频帧插值方法。

详情

AI中文摘要

头部计算机断层扫描（CT）通常使用亚毫米级的面内分辨率，但穿平面间距为2-5毫米，造成显著的各向异性，这会降低多平面重建、血肿体积估计等体积测量以及假设近似各向同性体素的后续算法的性能。我们提出一个深度学习系统，从相邻轴向切片对合成中间CT切片，将有效穿平面间距减半。该系统改善三维可视化，同时产生固有降噪的输出，在一次推理中实现两个互补优势。为构建可靠系统，我们系统评估像素级损失（均方误差MSE和平均绝对误差L1）、结构相似性损失（结构相似性指数SSIM及其多尺度变体MS-SSIM）以及混合组合。在保留测试集上，所有收敛模型在所有结构指标上均优于经典插值基线和预训练视频帧插值方法（RIFE、FILM），其中MS-SSIM+L1提供最强平衡性能。我们还记录了SSIM族损失中的训练不稳定性并识别部分补救措施：标准数值修复消除了主要失败模式，但在较小批量大小下留下残余发散。所有结果均报告患者级自助法置信区间和配对统计检验。作为示例，我们将系统应用于来自Virgen del Rocío大学医院的非分布头部CT序列：模型合成中间切片，并在真实切片上表现出我们理论分析预测的隐式降噪特征，支持在单个外部病例中插值质量和隐式降噪不局限于训练分布。

英文摘要

Head computed tomography (CT) typically uses sub-millimeter in-plane resolution but 2-5 mm through-plane spacing, creating substantial anisotropy that degrades multiplanar reconstructions, volumetric measurements such as hematoma volume estimation, and downstream algorithms that assume near-isotropic voxels. We present a deep learning system that synthesizes intermediate CT slices from pairs of neighboring axial slices, halving the effective through-plane spacing. The system improves three-dimensional visualization while simultaneously producing inherently denoised outputs, yielding two complementary benefits from a single inference pass. To build a reliable system, we systematically evaluate pixel-wise losses, namely mean squared error (MSE) and mean absolute error (L1); structural-similarity losses, namely the structural similarity index (SSIM) and its multi-scale variant (MS-SSIM); and hybrid combinations. On a held-out test set, all converged models outperform classical interpolation baselines and pretrained video frame interpolation methods (RIFE, FILM) on all structural measures, with MS-SSIM+L1 offering the strongest balanced profile. We also document training instability in SSIM-family losses and identify partial remedies: the standard numerical fixes eliminate the dominant failure mode but leave residual divergence at smaller batch sizes. All results are reported with patient-level bootstrap confidence intervals and paired statistical tests. As an illustration, we apply the system to an out-of-distribution head CT series from Hospital Universitario Virgen del Rocío: the model synthesizes intermediate slices and exhibits on the real slices the implicit-denoising signature predicted by our theoretical analysis, supporting in a single external case that interpolation quality and implicit denoising are not confined to the training distribution.

URL PDF HTML ☆

赞 0 踩 0

2606.09944 2026-06-10 econ.GN cs.AI q-fin.EC 新提交

GAGI: A Gini-Adjusted GDP-per-Capita Index for Distribution-Aware Macroeconomic Welfare Monitoring

GAGI：一种用于分布感知宏观经济福利监测的基尼调整人均GDP指数

Sivasathivel Kandasamy

发表机构 * Independent Researcher（独立研究者）

AI总结提出GAGI指数，通过基尼系数和价格水平调整人均GDP，以监测福利分配效应，应用于G7国家发现福利增长与GDP增长持续偏离。

详情

AI中文摘要

人均GDP是政府机构追踪经济繁荣和经济事件后果的默认视角，但它忽视了生活繁荣的两个首要决定因素：收入/财富分配和通胀影响。不平等调整的收入衡量指标本身并不新鲜，但宏观经济监测工具包中具体缺失的不是福利概念，而是一个可操作的监测触发指标：一个足够简洁、可每年从公开数据计算、无需建模假设即可审计、且标准化以便于理解年度间和国家间变化（监管机构需要据此采取行动）的统计量。我们构建了这样一个工具，即基尼调整人均GDP指数（GAGI）：一种可复现、可公开计算的公式，通过不平等调整因子(1-G)和价格水平重新调整各国人均GDP，并以2010年为基准标准化。GAGI是一个通用福利指数，并非特定于AI自动化，适用于任何需要追踪福利调整后繁荣的场景。将GAGI应用于2010-2026年的G7经济体，我们发现福利调整后的繁荣与总体GDP增长持续且日益偏离，这种偏离在2022年后急剧扩大，时间上与COVID后遗症和生成式AI部署加速相吻合，尽管仅凭此证据尚不能证明因果关系。我们认为GAGI是基于GDP监测的必要补充：任何仅追踪总产出的宏观经济监测工具都会系统性地忽略自动化可能造成的分配损害，即使报告的增长依然强劲。

英文摘要

GDP per capita is the default lens through which governibng bodies track the economic prosperity and consequences of economic events , yet it is blind to two first-order determinants of lived prosperity: income/wealth distribution and inflation impact. Inequality-adjusted income measures are themselves not new but What is missing from the macroeconomic monitoring toolkit specifically is not a welfare concept but an operational monitoring trigger: a statistic minimal enough to compute annually from public data, transparent enough to audit without modelling assumptions, and normalised so that year-on-year, cross-country change ? the quantity a regulator needs to act on? is legible. We assemble such an instrument, the Gini- Adjusted GDP per Capita Index (GAGI): a reproducible, publicly computable formulation that rescales each country's GDP per capita by its inequality-adjustment factor (1-G) and its price level, normalised to a 2010 baseline. GAGI is a general-purpose welfare index, not inherently specific to AI automation, applicable wherever welfare-adjusted prosperity needs tracking. Applying GAGI to the G7 economies over 2010-2026, we show that welfare-adjusted prosperity has diverged persistently and increasingly from headline GDP growth, that the divergence widens sharply after 2022, temporally coincident with, though not, on this evidence alone, demonstrated to be caused by the after effects of COVID and the acceleration of generative-AI deployment. We argue that GAGI is a necessary complement to GDP-based monitoring: any macroeconomic monitoring instrument that tracks only aggregate output will systematically miss the distributional harm that automation can cause even while reported growth remains strong.

URL PDF HTML ☆

赞 0 踩 0

2606.09941 2026-06-10 stat.AP cs.LG stat.OT 新提交

Stochastic weather generators for high-frequency wind vector time series

高频风矢量时间序列的随机天气生成器

Mingshi Cui, Kevin Eng, Justin T. Greene, Zern Ke, Abolfazl Sodagartojgi, Zhiqiu Xia, Gemma E. Moran, Michael L. Stein

发表机构 * Department of Statistics, Rutgers University（统计学系，罗格斯大学）

AI总结针对分钟级风矢量时间序列，开发基于时间矢量量化变分自编码器的机器学习模型，生成逼真序列，捕捉昼夜变化但极端风速分布匹配不足。

详情

AI中文摘要

地表风速在分钟尺度上变化显著，因此有必要研究其在此精细时间尺度上的变化。为最小化季节性影响，本文限定于六月，基于俄克拉荷马州拉蒙特站点超过30年的分钟级高质量测量数据，开发了一系列用于生成真实地表风矢量时间序列的机器学习模型。此类生成器可作为多种学科模型的输入，特别是风能领域，同时也适用于野火蔓延和航空等。数据显示风速和风向均存在复杂的昼夜结构，标准时间序列模型难以捕捉，因此我们考虑多种机器学习方法，基于时间矢量量化变分自编码器构建随机风生成器。我们考虑一次生成一天的数据，以及基于前一天风况生成一天的风矢量。我们还研究了在生成器中纳入离散天气状态变量的方法。我们使用多种正式和非正式方法评估生成器。其中最佳生成器能够捕捉观测数据中的许多（但非全部）复杂特征。特别地，我们的最佳方法准确模拟了风波动性的昼夜变化，但在匹配观测到的极端风速分布方面存在困难。

英文摘要

Surface winds can vary substantially from one minute to the next, so there is scope for studying its variation on this fine time scale. Restricting to the month of June to minimize seasonality, this work develops a range of machine learning models for generating realistic time series of surface wind vectors at a site in Lamont, Oklahoma based on more than 30 years of high quality measurements at the minute time scale. Such a generator could be used as an input into models from a range of disciplines, notably for wind energy, but also wildfire spread and aviation, among others. The data show complex diurnal structures in both wind speed and direction that would be challenging to capture with standard time series models, so we consider a number of machine learning approaches to producing a stochastic wind generator based on time vector-quantized variational autoencoders. We consider generating a day's worth of data at a time and generating a day of wind vectors conditional on the previous day's winds. We also study methods for incorporating a discrete weather state variable in the generator. We evaluate the generators using a wide range of formal and informal methods. The best of these generators can capture many but not all of the complex features present in the observational data. In particular, the best of our approaches accurately mimic diurnal changes in wind volatility but struggle to match the observed distribution of extreme wind speeds.

URL PDF HTML ☆

赞 0 踩 0

2606.09893 2026-06-10 eess.IV cs.AI cs.LG 新提交

Tractogram foundation model

TractFM：纤维束图基础模型

Guikun Chen, Yuqian Chen, Yijie Li, Yogesh Rathi, Nikos Makris, Fan Zhang, Wenguan Wang, Lauren J. O'Donnell

发表机构 * The State Key Lab of Brain-Machine Intelligence, Zhejiang University, Hangzhou（脑机智能国家重点实验室，浙江大学，杭州）； Department of Radiology, Brigham and Women’s Hospital, Mass General Brigham, Boston（放射科，布里洛妇女医院，马萨诸塞总医院，波士顿）； Harvard Medical School, Boston（哈佛医学院，波士顿）； Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin（医学工程与转化医学研究院，天津大学，天津）； School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu（信息与通信工程学院，电子科技大学，成都）； Psychiatry Neuroimaging Laboratory, Brigham and Women’s Hospital, Mass General Brigham, Boston（精神病神经影像实验室，布里洛妇女医院，马萨诸塞总医院，波士顿）； Department of Psychiatry, Center for Morphometric Analysis, Massachusetts General Hospital, Boston（精神病科，形态分析中心，马萨诸塞总医院，波士顿）

AI总结提出TractFM基础模型，直接从全脑纤维束集学习可复用表示，结合局部纤维编码器和置换等变纤维束编码器，通过密集解剖束分割预训练，实现纤维束级和受试者级任务的迁移。

详情

AI中文摘要

扩散MRI（dMRI）纤维束成像是在活体人脑中绘制白质通路的唯一非侵入性方法。它将每个大脑表示为一个纤维束图：一个大型、无序的三维流线集合，包含局部流线几何和全脑解剖组织的信息。这种结构使纤维束图成为表示学习的自然但具有挑战性的目标。现有方法将流线分类和受试者级预测视为独立问题：流线分类器关注几何模式，而受试者级预测通常依赖于手工特征。因此，当前方法无法学习连接流线解剖与全脑受试者间变异的可复用表示。本文介绍TractFM，一个纤维束图基础模型，直接从全脑纤维束集学习可复用表示。TractFM结合了局部流线编码器和置换等变纤维束编码器，使得一个受试者的所有流线能够在单次前向传递中共同上下文化。在密集解剖束分割（即给单个流线分配解剖标签）上的预训练产生了两种互补表示：用于束分割的上下文化流线级嵌入和用于下游受试者表型预测的紧凑受试者级描述符。在三种纤维束成像算法和五个dMRI数据集上，TractFM迁移到流线级和受试者级任务。其冻结表示实现了准确的束分割，并在独立数据集上预测年龄和性别。这些结果表明，全脑几何上下文（一次性学习）可以泛化到纤维束成像流程、数据集和预测任务中。

英文摘要

Diffusion MRI (dMRI) tractography is the only noninvasive approach for mapping white-matter pathways in the living human brain. It represents each brain as a tractogram: a large, unordered set of three-dimensional streamlines that includes information about both local streamline geometry and whole-brain anatomical organization. This structure makes tractograms a natural but challenging target for representation learning. Existing methods treat streamline classification and subject-level prediction as separate problems: streamline classifiers focus on geometric patterns, whereas subject-level prediction often depends on hand-crafted features. As a result, current methods do not learn reusable representations that connect streamline anatomy with whole-brain inter-subject variation. Here we introduce TractFM, a tractogram foundation model that learns reusable representations directly from whole-brain streamline sets. TractFM combines a local streamline encoder with a permutation-equivariant tractogram encoder, allowing all streamlines from a subject to be contextualized jointly in a single forward pass. Pretraining on dense anatomical tract parcellation, i.e., assigning anatomical labels to individual streamlines, yields two complementary representations: contextualized streamline-level embeddings for tract parcellation and compact subject-level descriptors for downstream prediction of subject phenotypes. Across three tractography algorithms and five dMRI datasets, TractFM transfers to both streamline-level and subject-level tasks. Its frozen representations achieve accurate tract parcellation and predict age and sex across independent datasets. These results show that whole-brain geometric context, learned once, can generalize across tractography pipelines, datasets, and prediction tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.11169 2026-06-10 cs.DC cs.AI 新提交

Piper: A Programmable Distributed Training System

Piper: 可编程的分布式训练系统

Megan Frisella, Shubham Tiwari, Andy Ruan, Yi Pan, Parker Gustafson, Mat Jacob, Gilbert Bernstein, Stephanie Wang

发表机构 * University of Washington（华盛顿大学）； University of Washington and Shanghai Jiao Tong University（华盛顿大学和上海交通大学）

AI总结提出Piper系统，通过解耦策略与运行时实现，允许用户用少量注解和调度指令声明分布式训练策略，并编译为设备执行计划，支持常见策略并实现组合策略的联合调度优化。

详情

AI中文摘要

大规模模型训练日益依赖于组合多种并行策略（如数据、流水线和专家并行）以及内存节省优化（如ZeRO）。用于基础模型预训练的部署系统通常依赖人类专家手动设计高层并行策略，然后实现相应的低层执行策略，这使得系统难以适应新策略。同时，许多通用框架更加灵活，但其实现仍然局限于一组固定的常见并行策略，使得整合最新策略具有挑战性。我们提出Piper，一个用户可控的分布式训练系统，将策略与运行时实现解耦。Piper允许用户通过少量模型注解和调度指令声明全面的分布式训练策略。每条指令对Piper的中间表示（IR）应用变换，IR是一个统一的全局训练DAG，表示所有计算和通信。使用此IR，Piper编译每设备执行计划，并使用与策略无关的分布式运行时执行它们。我们表明，该组合系统在常见策略（如ZeRO）上保持性能一致，同时通过组合并行策略（如DeepSeek-V3的DualPipe）中计算和通信的联合调度，实现额外的性能和内存效率提升。

英文摘要

Large-scale model training increasingly relies on composing multiple parallelism strategies, such as data, pipeline, and expert parallelism, together with memory-saving optimizations like ZeRO. Deployed systems for foundation model pretraining often rely on human experts to manually design a high-level parallelism strategy then implement the corresponding low-level execution strategy, making it difficult to adapt the system to new strategies. Meanwhile, many general-purpose frameworks are more flexible but their implementations are still tied to a fixed set of common parallelism strategies, making it challenging to integrate state-of-the-art strategies. We present Piper, a user-controllable distributed training system that decouples the strategy from the runtime implementation. Piper allows users to declare a comprehensive distributed training strategy with a small set of model annotations and scheduling directives. Each directive applies a transformation on Piper's intermediate representation (IR), a unified global training DAG that represents all computation and communication. Using this IR, Piper compiles per-device execution plans and executes them with a distributed runtime agnostic to the strategy. We show that the combined system maintains performance parity on commonly available strategies such as ZeRO, while also enabling additional performance and memory efficiency gains through joint scheduling of compute and communication in composed parallelism strategies such as DeepSeek-V3's DualPipe.

URL PDF HTML ☆

赞 0 踩 0

2606.11117 2026-06-10 cs.AR cs.AI cs.PF 新提交

Towards Autonomous Accelerator Design: FPGA Accelerator Generation with SECDA

迈向自主加速器设计：基于SECDA的FPGA加速器生成

Vinamra Sharma, Xingjian Fu, Jude Haris, José Cano

发表机构 * School of Computing Science, University of Glasgow, Scotland, UK（格拉斯哥大学计算机科学学院）

AI总结提出SECDA-DSE框架，集成大语言模型引导FPGA加速器设计空间探索，通过结构化探索器和LLM推理生成可综合的加速器设计，减少人工干预。

Comments Accepted to the Machine Learning for Architecture and Systems Workshop (MLArchSys), co-located with ISCA 2026

详情

AI中文摘要

为现代人工智能工作负载设计基于FPGA的加速器需要探索庞大而复杂的硬件设计空间，涉及架构参数、数据流策略和内存层次结构，这使得过程非常耗时。虽然现有方法如SECDA通过SystemC仿真和FPGA执行实现了快速的硬件-软件协同设计，但识别高效的加速器配置仍然是一个主要需要广泛领域知识的手动过程。SECDA-DSE是一个将大语言模型（LLM）集成到SECDA生态系统中的框架，用于指导基于FPGA的加速器的设计空间探索（DSE）。它结合了用于生成候选架构的结构化DSE探索器，以及使用检索增强生成和思维链提示进行推理引导探索的LLM栈，并配有用于迭代和强化优化的反馈循环。基于我们之前介绍SECDA-DSE的工作，本文通过生成三种加速器设计（包括逐元素向量乘法、二维卷积和矩阵转置）并在FPGA硬件上执行端到端运行来扩展其评估。结果表明，SECDA-DSE能够生成符合SECDA标准的加速器设计，并成功在FPGA硬件上综合和执行。此外，生成的设计捕获了计算并行性和数据移动之间的内核特定权衡，突显了LLM引导探索在跨不同工作负载调整架构配置方面的潜力，同时减少了探索时间和大量人类专业知识的需求。

英文摘要

Designing FPGA-based accelerators for modern artificial intelligence workloads requires exploring a large and complex hardware design space that involves architectural parameters, data flow strategies, and memory hierarchies, making the process very time consuming. While existing methodologies such as SECDA enable rapid hardware-software co-design through SystemC simulation and FPGA execution, identifying efficient accelerator configurations remains a largely manual process requiring extensive domain knowledge. SECDA-DSE is a framework that integrates Large Language Models (LLMs) into the SECDA ecosystem to guide design space exploration (DSE) of FPGA-based accelerators. It combines a structured DSE Explorer for generating candidate architectures with an LLM Stack that performs reasoning-guided exploration using retrieval-augmented generation and chain-of-thought prompting, coupled with a feedback loop for iterative and reinforced refinement. Building on our previous work introducing SECDA-DSE, this paper extends its evaluation by generating three accelerator designs, including element-wise vector multiplication, 2D convolution, and matrix transpose, and performing end-to-end execution on FPGA hardware. The results show that SECDA-DSE can generate SECDA-compliant accelerator designs that are successfully synthesized and executed on FPGA hardware. Furthermore, the generated designs capture kernel-specific trade-offs between compute parallelism and data movement, highlighting the potential of LLM-guided exploration to adapt architectural configurations across diverse workloads while reducing exploration time and the need for extensive human expertise.

URL PDF HTML ☆

赞 0 踩 0

2606.11116 2026-06-10 cs.CY cs.AI cs.HC 新提交

Designed by Journalists, but Is It for Readers? Rethinking AI Disclosures and Transparency in News

由记者设计，但为读者而设？重新思考AI披露与新闻透明度

Pooja Prajod

发表机构 * Centrum Wiskunde & Informatica（数学与信息学中心）

AI总结研究发现，详细披露会引发透明度困境降低信任，而简短披露造成信息缺口；读者偏好用户代理型设计（如按需详情、AI比例可视化），呼吁HCI社区重新设计披露机制。

Comments Accepted to CHIWORK Workshop (Interrogating GenAI Augmentation for CHIworkers: Strategies for Professional Autonomy and Accountability)

详情

AI中文摘要

随着新闻编辑室整合生成式AI，记者面临一个披露挑战：如何以维护读者信任的方式传达AI参与。当前实践提供两种方法：简短的一行标签或详细的披露，说明人工监督、编辑责任和错误报告机制。两者都未能实现记者通过透明度建立信任的目标。一项针对34名新闻读者的现有对照实验表明，详细披露会引发\textit{透明度困境}，降低信任而非增加信任，并有可能引入暗黑模式，使读者在透明度的错觉下滚动忽略。一行披露避免了这种效应，但可能造成信息缺口，促使读者花费认知努力寻找披露所指示但未解释的AI参与迹象。然而，读者并非拒绝透明度，他们提出了以用户代理为中心的披露设计：按需详情交互、比例AI可视化、媒体级别信号和明确的“无AI”标签。我认为，从业者认为负责任的披露与用户实际需求之间的脱节是HCI社区的一个设计问题。

英文摘要

As newsrooms integrate generative AI, journalists face a disclosure challenge: how to communicate AI involvement in ways that maintain reader trust. Current practice offers two approaches: brief one-line labels or detailed disclosures specifying human oversight, editorial accountability, and error reporting mechanisms. Neither achieves journalists' goal of building trust through transparency. An existing controlled experiment with 34 news readers show that detailed disclosures trigger a \textit{transparency dilemma}, reducing trust rather than increasing it, and risk introducing dark patterns that readers scroll past with the illusion of transparency. One-line disclosures avoid this effect but can create an information gap, prompting readers to expend cognitive effort searching for signs of AI involvement that the disclosure indicates but does not explain. Yet readers are not rejecting transparency, they proposed disclosure designs centered on user agency: detail-on-demand interactions, proportional AI-ratio visualizations, outlet-level signals, and explicit "no AI" labels. I argue that this disconnect between what practitioners believe is responsible disclosure and what users actually need is a design problem for the HCI community.

URL PDF HTML ☆

赞 0 踩 0

2606.11098 2026-06-10 cs.CR cs.LG 新提交

Do Transformers Actually Help Intrusion Detection? A Temporal Sequence Evaluation on CIC-IDS2017

Transformer 真的有助于入侵检测吗？基于 CIC-IDS2017 的时间序列评估

Zach Moczkodan, Hany Ragab

发表机构 * Department of Electrical and Computer Engineering, Faculty of Engineering, Royal Military College of Canada (RMC)（加拿大皇家军事学院电气与计算机工程系，工程学院）

AI总结本研究重新将 CIC-IDS2017 构建为时间序列入侵检测任务，发现填充方式而非架构决定 Transformer 性能，且随机分割和填充方式会高估模型鲁棒性。

Comments 11 pages, 9 figures, 9 tables. Preprint. Code: https://github.com/zachmocz/temporal-ids-bench

详情

AI中文摘要

近年来，用于网络入侵检测的深度学习方法越来越多地采用时间架构，如循环网络和 Transformer，通常在 CIC-IDS2017 上报告接近完美的性能。然而，许多现有研究既没有为其时间模块提供真实的序列输入，也没有在现实、无泄漏的条件下进行评估，使得报告的性能提升是否源于真正的序列建模能力尚不清楚。在这项工作中，我们通过从网络对话中构建有序流序列，并在随机分割、两种无泄漏分割以及填充方案消融下对九种经典和深度学习架构进行基准测试，将 CIC-IDS2017 重新表述为时间入侵检测任务。核心发现是，填充惯例而非架构决定了 Transformer 的性能：在真正的序列（非填充）窗口上，Transformer 实现了实验中所有模型的最高 macro-F1（0.89）；在零填充+掩码评估下，其性能显著下降（-0.24 macro-F1），而 LSTM、GRU 和 1D-CNN 保持稳定。在无泄漏组评估下，随机森林是最稳健的模型（+0.009），而 Transformer 的误报率从 0.04% 增长到 2.7%，增加了 67 倍，这在传统协议下是不可见的。这些发现表明，评估方法——特别是填充惯例和分割协议——对报告性能的影响大于架构选择，并且广泛使用的随机分割与重复最后填充可能高估模型鲁棒性高达 0.24 macro-F1。我们主张将无泄漏分割、显式填充披露和序列感知基准测试作为未来入侵检测研究的标准实践。代码和实现细节可在此 https URL 获取。

英文摘要

Recent deep learning approaches for network intrusion detection increasingly incorporate temporal architectures such as recurrent networks and Transformers, often reporting near-perfect performance on CIC-IDS2017. However, many existing studies neither supply their temporal modules with genuine sequence inputs nor evaluate under realistic, leakage-free conditions, making it unclear whether reported gains arise from true sequence-modeling capability. In this work, we reformulate CIC-IDS2017 as a temporal intrusion-detection task by constructing ordered flow sequences from network conversations and benchmarking nine classical and deep learning architectures under a random split, two leakage-free splits, and a padding-scheme ablation. The central finding is that padding convention, not architecture, determines the Transformer's performance: on genuinely sequential (non-padded) windows the Transformer achieves the highest macro-F1 of any model in the experiment (0.89); under zero-pad+mask evaluation it drops markedly (-0.24 macro-F1), while LSTM, GRU, and 1D-CNN remain stable. Under leakage-free group evaluation the Random Forest is the most robust model (+0.009), while the Transformer's false-alarm rate grows from 0.04% to 2.7%, a 67-fold increase invisible under conventional protocols. These findings demonstrate that evaluation methodology -- specifically padding convention and split protocol -- has a larger effect on reported performance than architectural choice, and that widely used random splits with repeat-last padding can overestimate model robustness by up to 0.24 macro-F1. We advocate leakage-free splits, explicit padding disclosure, and sequence-aware benchmarking as standard practice in future IDS research. Code and implementation details are available at https://github.com/zachmocz/temporal-ids-bench.

URL PDF HTML ☆

赞 0 踩 0

2606.11023 2026-06-10 cs.IR cs.CL cs.LG 新提交

Generative Archetype-Grounded Item Representations for Sequential Recommendation

生成式原型驱动的物品表示用于序列推荐

Yifan Li, Jiahong Liu, Xinni Zhang, Hao Chen, Yankai Chen, Wenhao Yu, Jianting Chen, Irwin King

发表机构 * The Chinese University of Hong Kong（香港中文大学）； McGill University（麦吉尔大学）； Tongji University（同济大学）

AI总结提出GenAIR框架，利用大语言模型生成物品原型描述并提取嵌入，结合行为校准目标弥合语义与行为差距，显著提升序列推荐性能。

Comments Accepted by WWW 2026 (Oral)

详情

DOI: 10.1145/3774904.3792587

AI中文摘要

序列推荐旨在通过分析用户的历史行为来预测用户与物品的下一次交互。然而，物品表示的质量有限仍然是一个关键瓶颈。虽然预训练的大语言模型（LLM）可以提供丰富的语义表示，但现有方法仅依赖于固定属性的静态编码，忽视了目标受众在定义物品身份中的关键作用。此外，语义空间难以反映实际用户行为，导致语义表示与行为模式之间存在显著差距。为了解决这些局限性，我们提出了GenAIR，一个通用框架，通过生成式原型驱动的物品表示来增强序列推荐。具体来说，我们首先利用LLM分析物品元数据并推断原型的文本描述，该原型代表物品理想目标受众的概念轮廓。然后，我们在一次前向传播中提取相应的嵌入。此外，为了将这些生成式原型基于现实世界的行为，我们引入了一个行为校准目标，该目标明确地整合了来自实际交互的行为信号。该目标调整嵌入空间的结构以反映经验模式。GenAIR能够与大多数现有模型无缝集成，同时保持高效率。在三个真实世界数据集上进行的全面实验表明，GenAIR显著提高了各种序列推荐模型的性能，并始终优于最先进的基线方法。实现代码可在以下网址获取：https://this URL。

英文摘要

Sequential recommendation aims to predict users' next interaction with items by analyzing their historical behavior. However, the limited quality of item representations remains a critical bottleneck. While pre-trained large language models (LLMs) can provide rich semantic representations, existing approaches only rely on static encoding of fixed attributes, overlooking the crucial role of target audiences in defining item identity. Moreover, the semantic space struggles to reflect actual user behavior, resulting in a significant gap between semantic representations and behavioral patterns. To address these limitations, we propose GenAIR, a general framework that empowers sequential recommendation with Generative Archetype-grounded Item Representations. Specifically, we first leverage an LLM to analyze item metadata and infer textual description of the Archetype, which represents the conceptual profile of the item's ideal target audience. We then extract the corresponding embeddings in a single forward pass. Further, to ground these generative archetypes in real-world behavior, we introduce a behavioral calibration objective, which explicitly incorporates behavioral signals from actual interactions. This objective adjusts the structure of the embedding space to reflect empirical patterns. GenAIR enables seamless integration with most existing models while maintaining high efficiency. Comprehensive experiments conducted on three real-world datasets demonstrate that GenAIR significantly improves the performance of various sequential recommendation models and consistently outperforms state-of-the-art baseline approaches. Implementation codes are available at https://github.com/AI-Santiago/GenAIR.

URL PDF HTML ☆

赞 0 踩 0

2606.11007 2026-06-10 cs.CR cs.AI cs.SE 新提交

Understanding and mitigating the risks of OpenClaw for non-technical users: A practical guide with Skill

理解并减轻非技术用户使用OpenClaw的风险：一份实用指南与Skill

Junchang Zheng, Junfeng Tan, Jialiang Lin

发表机构 * School of Computer Science and Engineering, Guangzhou Institute of Science and Technology, Guangzhou, China（计算机科学与工程学院，广州科学与技术研究院，中国广州）； Science and Education Evaluation Lab, Guangzhou Institute of Science and Technology, Guangzhou, China（科学与教育评估实验室，广州科学与技术研究院，中国广州）

AI总结针对非技术用户，识别OpenClaw的七类核心风险，用通俗语言解释，提供可操作的防御策略，并开发自动化安全配置的Skill，降低使用门槛。

Comments Work in progress

详情

AI中文摘要

OpenClaw已迅速成为一种变革性的人工智能（AI）智能体框架，其自主执行复杂多步任务的能力吸引了日益增长且多样化的用户群体。然而，这种能力伴随着显著的风险。虽然现有研究在描述这些威胁方面取得了重要进展，但此类工作主要面向技术娴熟的受众，对非技术用户而言仍然难以触及。这一群体如今在社区中占比越来越大且服务不足，而正是这些用户最迫切需要实用且直接的指导。为此，我们通过一系列相互关联的努力来弥合这一差距，旨在降低非技术OpenClaw用户的风险门槛。首先，我们识别并分类了OpenClaw用户在日常使用中可能遇到的七类核心风险，并用通俗语言解释，以便非技术用户能够轻松理解这些威胁的性质和潜在后果。其次，针对每种已识别的风险，我们将一套相应的防御策略提炼为清晰且可操作的具体步骤，易于遵循。第三，为使保护更加便捷，我们提供了一个配套的OpenClaw Skill，可自动执行关键安全配置，使用户能够以最少的手动干预保护其系统。通过这项工作，我们证明了防范智能体风险不必是安全专家的专属领域，非技术用户可以通过简单、实用的行动有意义地参与降低这些风险。

英文摘要

OpenClaw has rapidly emerged as a transformative artificial intelligence (AI) agent framework, and its ability to autonomously execute complex, multi-step tasks has attracted an ever-growing and diverse user base. However, this capability comes with significant risks. While existing research has made important strides in characterizing these threats, such work is predominantly directed at technically sophisticated audiences. It remains largely inaccessible to non-technical users. This demographic now makes up an increasingly large and underserved portion of the community, yet it is these very users who most urgently need practical and straightforward guidance. In response, we bridge this gap through a series of interconnected efforts designed to lower the risk barrier for non-technical OpenClaw users. First, we identify and categorize seven core risks that OpenClaw users may encounter in daily usage, explaining each in plain language so that non-technical users can readily grasp the nature and potential consequences of these threats. Second, for each identified risk, we distill a set of corresponding defensive strategies into clear and actionable operational steps that are easy to follow. Third, to make protection even easier, we provide a companion OpenClaw Skill that automates key security configurations, enabling users to safeguard their systems with minimal manual intervention. Through this work, we demonstrate that safeguarding against the risks of intelligent agents need not be the exclusive domain of security experts, and that non-technical users can meaningfully participate in reducing these risks through simple, practical actions.

URL PDF HTML ☆

赞 0 踩 0

2606.10942 2026-06-10 cs.NI cs.AI cs.LG 新提交

Generative Explainability for Next-Generation Networks: LLM-Augmented XAI with Mutual Feature Interactions

下一代网络的生成式可解释性：基于互特征交互的LLM增强XAI

Kiarash Rezaei, Omran Ayoub, Sebastian Troia, Francesco Lelli, Paolo Monti, Carlos Natalino

发表机构 * Swedish Innovation Agency（瑞典创新署）； Swiss Innovation Agency（瑞士创新署）

AI总结提出一种利用大语言模型和互特征交互数据生成自然语言解释的框架，在光传输质量估计用例中，相比基线方法，解释有用性和范围分别提升12.2%和6.2%，正确率达97.5%。

Comments 7 pages, with one page for appendix. Accepted for publication at the 2025 21th International Conference on Wireless and Mobile Computing, Networking and Communications (WiMob)

Journal ref Proc. WiMob, Marrakesh, Morocco, 2025

详情

DOI: 10.1109/WiMob66857.2025.11257542

AI中文摘要

随着人工智能和机器学习模型成为网络运营的核心，其缺乏透明度对运营商信任构成重大障碍。现有的可解释人工智能技术往往无法为非专家弥合这一差距，产生的技术输出难以转化为可操作的见解。本文提出了一个专门解决这一缺陷的框架。它利用中等规模的大语言模型，并超越了SHapley Additive exPlanations特征影响值的标准用法。该框架采用结构化的提示，并辅以互特征交互数据，以生成人类可理解的自然语言解释。为了验证我们的框架，我们在光传输质量估计用例中进行了实证评估，并邀请了人类评估者。我们收集了专家的独立性能评估，显示出较高的评估者间一致性。与仅使用SHAP特征影响值进行简单提示的最先进基线相比，我们的方法将解释有用性和范围分别提高了12.2%和6.2%，同时实现了97.5%的正确性。

英文摘要

As artificial intelligence and machine learning (AI/ML) models become integral to network operations, their lack of transparency poses a significant barrier to operator trust. Existing explainable artificial intelligence (XAI) techniques often fail to bridge this gap for non-specialists, producing technical outputs that are difficult to translate into actionable insights. This paper presents a framework specifically designed to address this shortcoming. It leverages a moderately sized large language model (LLM) and extends beyond the standard use of SHapley Additive exPlanations (SHAP) feature influence values. The framework employs a structured prompt enriched with mutual feature interaction data to generate human-understandable natural language explanations. To validate our framework, we performed an empirical evaluation on an optical quality of transmission (QoT) estimation use case with human evaluators. We collected independent performance evaluations from specialists, which showed a high inter-evaluator agreement. Compared to a state-of-the-art baseline that uses only SHAP feature influence values in a straightforward prompt, our approach improves the explanation usefulness and scope by 12.2% and 6.2%, while achieving 97.5% correctness.

URL PDF HTML ☆

赞 0 踩 0

2606.10937 2026-06-10 cs.DB cs.AI 新提交

Provenance Tracking in AI Compilers through the Lens of Coalgebra

通过余代数视角追踪AI编译器中的来源

Zilu Tian, Liying Liu

发表机构 * OmniVision Technology Singapore（奥米视觉技术（新加坡））； Black Sesame Technology Singapore（黑 sesame 技术（新加坡））

AI总结针对AI编译器中图重写导致来源难以追踪的问题，提出基于观测语义的轻量级方法，利用余代数和互模拟形式化，并在原型编译器COVAN中验证。

详情

AI中文摘要

AI编译器通过规范化、降级和优化积极重写计算图，使得跨编译追踪张量和运算符的来源变得困难。可靠的来源对于附加平台特定的后处理、调试编译器行为以及验证变换至关重要，然而现有解决方案在非单射图重写下要么是侵入式的，要么是特设的。我们提出了一种基于观测语义的轻量级生成式方法来追踪来源。我们不通过编译器传递传播标识符，而是观测图变换并根据可观测的计算行为推理来源。我们使用余代数模型和互模拟形式化了这种方法，即使中间节点被消除，也能保留来源。此外，我们在原型AI编译器COVAN中实现了该方法，展示了在编译流水线中稳定的来源追踪，且工程开销最小。

英文摘要

AI compilers aggressively rewrite computation graphs through normalization, lowering, and optimization, making it difficult to track the provenance of tensors and operators across compilation. Reliable provenance is essential for attaching platform-specific postprocessing, debugging compiler behavior, and validating transformations, yet existing solutions are either invasive or ad hoc under non-injective graph rewrites. We present a lightweight, generative approach to provenance tracking based on observational semantics. Instead of propagating identifiers through compiler passes, we observe graph transformations and reason about provenance in terms of observable computational actions. We formalize this approach using a coalgebraic model and bisimulation, which preserves provenance even when intermediate nodes are eliminated. Furthermore, we implement this approach in a prototype AI compiler COVAN, demonstrating stable provenance across compilation pipelines with minimal engineering overhead.

URL PDF HTML ☆

赞 0 踩 0

2606.10861 2026-06-10 cs.SE cs.AI cs.HC 新提交

From Perception to Action: Can UI Interventions Foster Sustainable LLM Chatbot

从感知到行动：UI干预能否促进可持续的LLM聊天机器人

Nitish Patkar, Pooja Rani, Jack Glässer, Simon Lüscher, Martin Kropp

发表机构 * University of Applied Sciences and Arts Northwestern Switzerland (FHNW)（瑞士西北应用科学与艺术大学（FHNW））； University of Mannheim（曼海姆大学）

AI总结研究通过UI干预（如模式切换、能耗反馈）提升用户对LLM聊天机器人能耗的感知，并鼓励节能行为，发现模式切换是主要行为机制。

详情

AI中文摘要

基于LLM的聊天机器人日益融入日常工作流程，其能源使用引发了可持续性担忧。大多数缓解策略强调模型或基础设施效率，而用户界面层尽管具有塑造交互行为的潜力，却仍未得到充分探索。我们调查了面向可持续性的UI干预能否提高用户的能源意识，并鼓励更节能的聊天机器人使用，同时不降低可用性。我们首先进行了一项基线调查，有77名参与者评估了对干预概念的意识和接受度。在说服技术和选择架构的先前工作指导下，我们实现了一个基于Web的聊天机器人原型，具有三模式开关（节能、平衡、性能）、每次响应的能耗反馈、发送前能耗估计、使用指标仪表板和能耗类比。然后，我们在为期五天的实地研究中评估了该原型，有11名参与者。在基线调查中，94.8%的受访者报告至少对AI能耗有一定了解，但88.3%的人错误估计了实际消耗。尽管对环境影响的担忧很高，但只有39.0%的人表示愿意接受性能权衡以降低能耗。在实地研究中，节能模式占记录提示的55.8%，而90.9%的人自我报告在不需要高精度时主动选择Eco模式。参与者没有减少提示长度，表明模式切换是主要行为机制。面向可持续性的UI干预可以提高意识，并支持LLM聊天机器人中更节能的交互模式。这些效应最好被解释为行为和基于模型的估计，补充了后端效率工作，所提供的原型和复制包支持对能源感知对话式AI设计的进一步研究。

英文摘要

LLM-powered chatbots are increasingly embedded in everyday workflows, raising sustainability concerns due to their energy use. Most mitigation strategies emphasize model or infrastructure efficiency, while the user-interface (UI) layer remains underexplored despite its potential to shape interaction behavior. We investigate whether sustainability-oriented UI interventions can increase users' energy awareness and encourage more energy-responsible chatbot use without reducing usability. We first conducted a baseline survey with 77 participants to assess awareness and receptiveness to intervention concepts. Guided by prior work on persuasive technology and choice architecture, we implemented a web-based chatbot prototype with a three-mode switch (Energy-efficient, Balanced, Performance), per-response energy feedback, pre-send energy estimates, a usage metrics dashboard, and energy analogies. We then evaluated the prototype in a five-day field study with 11 participants. In the baseline survey, 94.8% of respondents reported at least some awareness of AI energy use, yet 88.3% misestimated actual consumption. Although concern about environmental impact was high, only 39.0% indicated willingness to accept a performance trade-off for lower energy use. In the field study, Energy-efficient mode accounted for 55.8% of logged prompts, while 90.9% self-reported actively choosing Eco-mode when high accuracy was not required. Participants did not reduce prompt length, suggesting mode switching as the primary behavioral mechanism. Sustainability-oriented UI interventions can improve awareness and support more energy-responsible interaction patterns in LLM chatbots. These effects are best interpreted as behavioral and model-based estimates that complement backend efficiency work, and the provided prototype and replication package support further research on energy-aware conversational AI design.

URL PDF HTML ☆

赞 0 踩 0

2606.10860 2026-06-10 cs.CR cs.CL 新提交

Training LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference Optimization

训练LLM通过重力加权直接偏好优化强制执行多级指令层次结构

Lena S. Bolliger, Lena A. Jäger

发表机构 * Department of Computational Linguistics, University of Zurich, Switzerland（计算语言学系，苏黎世大学，瑞士）

AI总结提出重力加权DPO（GW-DPO）方法，通过线性或双边调度加权冲突级别间的结构距离，结合层次分隔符和指令段嵌入，在Llama-3.1-8B-Instruct上提升多级指令优先级遵守率并降低过度拒绝率。

详情

AI中文摘要

生产级LLM接收来自信任级别差异极大的源的指令，但对每个令牌赋予统一的架构特权。这种结构漏洞使得恶意提示注入成为可能，更广泛地说，模型缺乏原则性方法来解决合法但冲突的指令。常见的基于训练的响应是教导模型显式的指令层次结构；然而，现有方法仅形式化三或四个级别，将所有违规视为同等严重，且很少评估所有成对级别交互。我们形式化了k级指令层次问题，并针对k=5实例化，得到十个成对优先级关系，合规模型必须强制执行。然后我们引入重力加权DPO（GW-DPO），一种偏好优化目标，其每个样本的偏移量在线性或双边调度下与冲突级别间的结构距离成比例，后者通过特权差距和受害级别的特权共同加权严重性。结合层次特定的分隔符令牌（Chen等人，2025）和指令段嵌入（ISE；Wu等人，2025），采用双边调度的GW-DPO在Llama-3.1-8B-Instruct上帕累托改进标准DPO和线性变体，提高宏观成对优先级遵守率，同时将过度拒绝率降至标准DPO的一半。消融实验将ISE隔离为拒绝阈值校准器，并将五级与三级训练重新定义为通用性与专业性的权衡。

英文摘要

Production LLMs receive instructions from sources with very different levels of trust, yet attend to every token with uniform architectural privilege. This is the structural vulnerability that enables malicious prompt injections and, more broadly, leaves models without a principled way to resolve conflicts between legitimate but competing instructions. A common training-based response is to teach models an explicit instruction hierarchy; existing approaches, however, formalize hierarchies of only three or four levels, treat all violations as equally severe, and rarely evaluate the full set of pairwise level interactions. We formalize a k-level instruction hierarchy problem and instantiate it for k=5, yielding ten pairwise priority relations that a compliant model must enforce. We then introduce Gravity-Weighted DPO (GW-DPO), a preference-optimization objective whose per-sample offset scales with the structural distance between conflicting levels under a linear or bilateral schedule, the latter weighting severity by both the privilege gap and the privilege of the victim level. Combined with hierarchy-specific delimiter tokens (Chen et al., 2025) and Instructional Segment Embeddings (ISE; Wu et al., 2025), GW-DPO with the bilateral schedule Pareto-improves over standard DPO and the linear variant on Llama-3.1-8B-Instruct, raising macro pairwise priority adherence while keeping over-refusal at half the standard DPO rate. Ablations isolate ISE as a refusal-threshold calibrator and recast five- versus three-level training as a generality-specialization tradeoff.

URL PDF HTML ☆

赞 0 踩 0

2606.10827 2026-06-10 cs.NI cs.AI 新提交

A Unified Siamese Learning Framework for Zero-Day Anomaly Detection and Classification in Optical Networks

面向光网络中零日异常检测与分类的统一孪生学习框架

Carlos Natalino, Flávia P. Monteiro, Paolo Monti

发表机构 * Department of Electrical Engineering, Chalmers University of Technology（查尔姆斯理工大学电子工程系）； Federal University of Western Pará (UFOPA)（巴西北部联邦大学（UFOPA））

AI总结提出多相似度孪生神经网络，统一实现光网络中零日异常检测与单样本分类，无需重训练即可跨光路和未知异常类型达到99%以上准确率。

Comments Authors' version of the manuscript accepted and published at the Optical Fiber Communication Conference (OFC) 2026. 4 pages, 3 figures

Journal ref Optical Fiber Communication Conference (OFC) 2026

2606.10782 2026-06-10 cs.CR cs.AI cs.LG 新提交

A Bayesian Network Approach for Enhancing Security-Focused Decision Support Systems

一种增强安全导向决策支持系统的贝叶斯网络方法

Carolina Fernández-Martínez, Shuaib Siddiqui, Vanesa Daza

发表机构 * University of Granada（格拉纳达大学）； University of Birmingham（伯明翰大学）

AI总结提出基于贝叶斯网络的决策支持系统，帮助基础设施运营商选择安全工具，通过捕获用户需求并推理，提供最优安全机制，评估了时间和预测精度。

Journal ref Proc. 2025 IEEE 50th Conference on Local Computer Networks (LCN), 2025

详情

DOI: 10.1109/LCN65610.2025.11146363

AI中文摘要

当今大多数基于开源网络的异构栈的采用和集成带来了明显的优势，如互操作性和高级功能的可用性。然而，另一方面，互联组件和移动部件数量的增加需要维护跨不同领域的不同工具的跨学科知识基础，以确保正常运行。为了减轻这些工作，本文提出了一种决策支持系统（DSS），指导基础设施运营商选择在其环境中采用的安全方法（例如工具）。该框架能够轻松捕获最终用户对不同领域安全三元组的高层需求，并在指定模型上运行推理，以提供更好地满足这些需求的已识别工具（安全机制）。所提出的DSS旨在提供一个可理解和可扩展的框架，以适应不同的需求和贝叶斯网络（BN）模型。提出了系统的架构和建模，并与其理论框架保持一致。其性能在时间和预测精度方面进行了评估。

英文摘要

The adoption and integration of heterogeneous stacks in most of today's open-source based networks brings clear benefits like interoperability and availability of advanced features. Yet, on the other hand the increasing number of interconnecting components and moving parts requires maintaining an ever increasing base of interdisciplinary knowledge of different tools in different domains to ensure proper operation. To alleviate such efforts, this work proposes a Decision Support System (DSS) to guide infrastructure operators through the selection of security approaches (e.g. tools) to adopt in their environments. This framework easily captures the end-user high-level requirements on the security triad for different domains and runs inference on the designated models to provide the identified tools (security mechanisms) that better serve such needs. The presented DSS aims at delivering an understandable and extensible framework to accommodate varying requirements and Bayesian Network (BN) models. The architecture and modelling of the system are proposed, aligned with its theoretical framework. Its performance is evaluated in terms of time and prediction accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.10749 2026-06-10 cs.CR cs.AI 新提交

Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation

迈向安全的LLM智能体：威胁面、攻击、防御与评估

Yuchen Ling, Shengcheng Yu, Zhenyu Chen, Chunrong Fang

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University（新型软件技术国家重点实验室，南京大学）； Technical University of Munich（慕尼黑技术大学）

AI总结本文通过生命周期、系统导向的框架，综合247篇论文，围绕信息流、委托权限和持久状态，系统梳理了LLM智能体的威胁面、攻击、防御与评估，指出提示注入和工具控制流劫持仍是主要威胁，持久状态损坏和多智能体传播成为新兴关注点。

详情

AI中文摘要

大型语言模型（LLM）智能体正迅速从对话界面转变为规划、调用工具、维护记忆并在外部环境中行动的软件组件。这一转变改变了安全风险的性质。在智能体场景中，故障不再局限于不安全的文本生成。不受信任的内容可能重定向控制流、滥用工具权限、破坏持久状态、泄露敏感信息或触发有害的外部行动。与此同时，关于LLM智能体安全的研究正在迅速扩展，但仍然分散在不同的攻击家族、防御层、应用领域和评估设置中。本文通过一个基于生命周期、系统导向的框架综合了247篇论文，该框架围绕信息流、委托权限和持久状态的交互对智能体安全进行建模。我们围绕四个问题组织文献：LLM智能体安全应如何建模，哪些威胁面和攻击家族占主导地位，提出了哪些防御措施及其权衡，以及安全声明如何评估。我们发现，提示注入和工具介导的控制流劫持仍然主导该领域，而持久状态损坏和多智能体传播正在成为新兴的核心关注点。我们进一步发现，当前的防御措施提供了有用的构建块，但组合性较弱，现有的基准测试仍然低估了长期、有状态和部署敏感的风险。我们认为，安全的LLM智能体需要明确的信任边界、有原则的权限控制、可溯源的状态管理以及与真实操作环境一致的评估实践。

英文摘要

Large language model (LLM) agents are rapidly moving from conversational interfaces to software components that plan, invoke tools, maintain memory, and act on external environments. This transition changes the nature of security risk. In agentic settings, failures are no longer limited to unsafe text generation. Untrusted content may redirect control flow, misuse tool privileges, corrupt persistent state, leak sensitive information, or trigger harmful external actions. At the same time, research on LLM agent security is expanding quickly but remains fragmented across attack families, defense layers, application domains, and evaluation settings. This paper synthesizes 247 papers through a lifecycle-based, systems-oriented framework that models agent security around the interaction of information flow, delegated authority, and persistent state. We organize the literature around four questions: how LLM agent security should be modeled, which threat surfaces and attack families dominate, what defenses have been proposed and with what tradeoffs, and how security claims are evaluated. We find that prompt injection and tool-mediated control-flow hijacking still dominate the field, while persistent state corruption and multi-agent propagation are becoming central emerging concerns. We further find that current defenses provide useful building blocks but remain weakly compositional, and that existing benchmarks still underrepresent long-horizon, stateful, and deployment-sensitive risks. We argue that secure LLM agents require explicit trust boundaries, principled privilege control, provenance-aware state management, and evaluation practices aligned with realistic operational settings.

URL PDF HTML ☆

赞 0 踩 0

2606.10742 2026-06-10 cs.CR cs.LG 新提交

MemVenom: Triggered Poisoning of Multimodal Memories in Web Agents

MemVenom：网络代理中多模态记忆的触发式投毒

Yv Zhang, Hao Sun, Hao Fang, Kuofeng Gao, Fan Mo, Bin Chen, Shu-Tao Xia, Yaowei Wang

发表机构 * Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； Peng Cheng Laboratory（鹏城实验室）； Tianjin University（天津大学）； Shenzhen International Graduate School, Tsinghua University（深圳国际研究生学院，清华大学）； Huawei Technologies Ltd.（华为技术有限公司）

AI总结提出MemVenom框架，针对网络代理的外部记忆系统，通过触发条件检索和攻击诱导，实现黑盒多模态记忆投毒，达到高成功率且不影响良性性能。

Comments Preprint. 27 pages, 6 figures, 6 tables

详情

AI中文摘要

外部记忆已成为现代网络代理的核心组件，通过检索过去经验实现长程推理。然而，这种范式引入了一个关键漏洞：注入记忆中的恶意内容可以被持续召回并反复影响代理行为。在这项工作中，我们识别并系统研究了多模态记忆投毒——网络代理系统中一个被忽视但实际存在的攻击面。我们提出MemVenom，一个统一的黑盒攻击框架，通过协调的文本-图像证据对图结构外部记忆进行投毒。我们的方法包括两阶段设计：(1) 触发条件检索攻击，确保恶意记忆的高概率召回；(2) 检索后攻击诱导，利用对抗性扰动和隐蔽OCR注入覆盖原始用户目标。与先前针对提示或纯文本记忆的攻击不同，我们的方法无需修改模型参数或重新优化恶意任务，即可实现持久、可重用且目标无关的攻击。在多个网络代理框架和视觉语言模型上的实验表明，MemVenom在最小化对良性性能影响的同时，实现了强大的端到端攻击成功率，在GPT-5系列网络代理上达到99.15%，并在不同架构和模型规模间有效迁移。

英文摘要

External memory has become a core component of modern web agents, enabling long-horizon reasoning through the retrieval of past experiences. However, this paradigm introduces a critical vulnerability: malicious content injected into memory can be persistently recalled and repeatedly influence agent behavior. In this work, we identify and systematically study multimodal memory poisoning, an overlooked yet practical attack surface in web-agent systems. We propose MemVenom, a unified black-box attack framework that poisons graph-structured external memory with coordinated text-image evidence. Our method consists of a two-stage design: (1) a trigger-conditioned retrieval attack that ensures high-probability recall of malicious memory, and (2) a post-retrieval attack induction that leverages adversarial perturbations and stealthy OCR injection to override the original user objective. Unlike prior attacks that operate on prompts or text-only memory, our approach enables persistent, reusable, and goal-agnostic attacks without modifying model parameters or re-optimizing malicious tasks. Experiments across multiple web-agent frameworks and vision-language models demonstrate that MemVenom achieves strong end-to-end attack success with minimal impact on benign performance, reaching up to 99.15% on GPT-5-family web agents, while transferring effectively across architectures and model scales.

URL PDF HTML ☆

赞 0 踩 0

2606.10709 2026-06-10 cs.IR cs.AI 新提交

Effective Reinforcement Learning for Agentic Search by Recycling Zero-Variance Queries During Training

通过训练期间回收零方差查询实现智能体搜索的有效强化学习

João Coelho, João Magalhães, Bruno Martins, Chenyan Xiong

发表机构 * Language Technologies Institute, Carnegie Mellon University（卡内基梅隆大学语言技术研究所）； Instituto Superior Técnico and INESC-ID, University of Lisbon（里斯本大学理工学院和INESC-ID）； NOVA LINCS, NOVA School of Science and Technology（NOVA科学与技术学院LINCS）

AI总结提出查询回收方法，将训练中零方差查询重新投入采样池，使有效训练分布随策略演化，1.7B模型在7个多跳QA基准上平均Pass@1达66.0，匹配或超越7B模型。

详情

AI中文摘要

使用GRPO风格的算法已成为在仅结果奖励下训练LLM搜索代理的标准策略。使用这些算法时，只有当查询的 rollout 组混合了成功和失败时，该查询才对参数更新有贡献；全正确（太容易）和全错误（太难）的组是零方差的，浪费了 rollout 成本。现有方法将零方差视为静态属性，要么丢弃要么预过滤这些组。我们假设并通过实验验证，随着训练过程中策略的演化，查询会在零方差和信号承载状态之间翻转。基于这一直觉，我们提出查询回收，将零方差组返回到可变池中以供将来重新采样，从而使有效训练分布与策略共同演化。使用所提出的技术，在合成数据上训练的1.7B参数模型在七个多跳QA基准上平均达到66.0的Pass@1，匹配或超越使用基准监督训练的、参数高达7B的系统。回收模式分析表明，到训练结束时，回收的查询提供了大约四分之三的有效批次，贡献在策略改进恢复和策略漂移之间分配。

英文摘要

The use of GRPO-style algorithms has become the standard strategy for training LLM search agents under outcome-only rewards. With these algorithms, a query contributes to parameter updates only when its rollout group mixes successes and failures; all-correct (too-easy) and all-incorrect (too-hard) groups are zero-variance and waste rollout cost. Existing approaches treat zero-variance as a static property and either discard or pre-filter such groups. We hypothesize and empirically validate that queries flip between zero-variance and signal-bearing states as the policy evolves during training. Building on this intuition, we propose query recycling, which returns zero-variance groups to a mutable pool for future resampling, so that the effective training distribution co-evolves with the policy. With the proposed technique, a 1.7B parameter model trained on synthetic data can reach 66.0 average Pass@1 accross seven multi-hop QA benchmarks, matching or surpassing systems with up to 7B parameters trained on benchmark-derived supervision. Analysis of recycling patterns shows that recycled queries supply roughly three quarters of the effective batch by the end of training, with contributions split between recovery from policy improvement and policy drift.

URL PDF HTML ☆

赞 0 踩 0

2606.10692 2026-06-10 cs.CR cs.LG 新提交

Do LLMsMakeNeural Distinguishers Wise?

LLM 是否使神经区分器更智能？

Tatsuya Sakagami, Masashi Hisai, Naoto Yanai

发表机构 * University of Tokyo（东京大学）

AI总结本文提出基于大语言模型（LLM）的神经区分器，通过提示设计在SPECK-32/64上实验，发现LLM未显著提升性能，高轮次下差分选择失效，但加入XOR结果可改善性能。

Journal ref DeMeSSAI 2026 poster

详情

AI中文摘要

神经区分器是一种对称密钥密码的密码分析方法，它通过训练机器学习模型于具有特定差分的明文-密文对来恢复密钥。据我们所知，现有工作尚未探索使用大语言模型（LLM）进行神经区分器。在本文中，我们通过提示设计提出了基于LLM的神经区分器，并在SPECK-32/64上对其进行了广泛实验，以研究LLM能否增强神经区分器。然后，我们发现了三个关键见解。第一，通过将基于LLM的神经区分器与现有工作中的ResNet结果进行比较，我们证明LLM在神经区分器性能上没有提供可观察到的改进。第二，我们确认在高轮次下，差分的选择对基于LLM的神经区分器以及ResNet不再有效。第三，我们表明，通过仅将XOR运算结果作为提示设计，可以显著提高基于LLM的神经区分器的性能。

英文摘要

Neural distinguishers are a cryptanalysis method for symmetric-key cryptography that trains machine learning models on pairs of plaintexts and ciphertexts with specific differences in order to recover a secret key. To the best of our knowledge, no existing work has explored the use of large language models (LLMs) for neural distinguishers. In this paper, we propose LLM-based neural distinguishers through a prompt design and conduct extensive experiments with them on SPECK-32/64 to investigate whether LLMs can strengthen neural distinguishers. We then found three key insights. First, by comparing the results of LLM-based neural distinguishers with ResNet in the existing work, we demonstrate that LLMs provide no observable improvement in the performance of neural distinguishers. Second, we confirm that, at high rounds, the choice of differences is no longer effective for LLM-based neural distinguishers as well as ResNet. Third, we show that the performance of LLM-based neural distinguishers can be significantly improved by incorporating only the XOR operation results as a prompt design.

URL PDF HTML ☆

赞 0 踩 0

2606.10662 2026-06-10 cs.MA cs.AI 新提交

Decentralized Multi-Agent Systems with Shared Context

具有共享上下文的去中心化多智能体系统

Yuzhen Mao, Azalia Mirhoseini

发表机构 * Stanford University（斯坦福大学）

AI总结提出DeLM框架，通过并行智能体、共享上下文和任务队列去中心化协调，解决集中式MAS的瓶颈，在软件工程和长上下文推理中提升性能并降低成本。

详情

AI中文摘要

多智能体系统（MAS）通过将复杂问题分解为并行子任务，可以在测试时扩展大型语言模型的推理能力。然而，大多数现有的MAS依赖于集中式编排，其中主智能体分配工作、收集输出并合并结果。随着子任务数量的增长，该控制器成为通信和集成瓶颈。我们提出了去中心化语言模型（DeLM），这是一种通过并行智能体、共享验证上下文和任务队列来去中心化协调的MAS框架。智能体异步认领子任务，读取累积进度，执行局部推理，并写回紧凑的验证更新。共享上下文充当公共通信基础，使智能体能够基于彼此的验证进度进行构建，而无需通过中央控制器路由每次更新。实验上，DeLM在软件工程测试时扩展和长上下文推理方面均有所改进。在SWE-bench Verified上，DeLM在Avg.@1、Pass@2和Pass@4上均取得了最佳性能，比最强基线高出多达10.5个百分点，同时每个任务的成本降低约50%。在LongBench-v2多文档问答上，DeLM在四个前沿模型系列中取得了最高平均准确率，比最强基线高出多达5.7个百分点。代码可在我们的项目网站（此 https URL）上获取。

英文摘要

Multi-agent systems (MAS) can scale large language model reasoning at test time by decomposing complex problems into parallel subtasks. However, most existing MAS rely on centralized orchestration, where a main agent assigns work, collects outputs, and merges results. As the number of subtasks grows, this controller becomes a communication and integration bottleneck. We propose Decentralized Language Models (DeLM), a MAS framework that decentralizes coordination through parallel agents, a shared verified context, and a task queue. Agents asynchronously claim subtasks, read accumulated progress, perform local reasoning, and write back compact verified updates. The shared context acts as a common communication substrate, enabling agents to build on one another's verified progress without routing every update through a central controller. Empirically, DeLM improves both software-engineering test-time scaling and long-context reasoning. On SWE-bench Verified, DeLM achieves the best performance across Avg.@1, Pass@2, and Pass@4, with gains of up to 10.5 percentage points over the strongest baseline, while reducing cost per task by roughly 50%. On LongBench-v2 Multi-Doc QA, DeLM achieves the highest average accuracy across four frontier model families, improving over the strongest baseline by up to 5.7 percentage points. The code is available on our project website at https://yuzhenmao.github.io/DeLM/.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

ClusBench: The Clustering Benchmark Data Resource You've All Been Waiting For (?)

Entropy-Aware Domain-Routed Mixture-of-Experts Speech-LLM Framework: A Case Study of Multi-Domain Child-Adult ASR

Near-Exponential Convergence Rates for kNN Classification based on Boltzmann Margin

SSL-GMMVC: Interpretable Voice Conversion via Locally Linear GMM Transforms in Self-Supervised Representation Space

Hyperbolic Neural Population Geometry Benefits Computation

ANCHOR: Autoregressive Non-intrusive Chunk-Ordered Refinement for Joint Multi-Resolution Speech Quality Modeling

Decision-Calibrated Conformal Uncertainty for Pacing Decisions in Streaming Advertising

Robust Active Learning for Few-Shot Example Selection in Text-to-SQL

DeRA-MOS: Optimizing Text-to-Music Evaluation via Decoupled Listwise Ranking and Modality Alignment

Deep Slice Interpolation for Reducing Through-Plane Anisotropy and Noise in Head CT

GAGI: A Gini-Adjusted GDP-per-Capita Index for Distribution-Aware Macroeconomic Welfare Monitoring

Stochastic weather generators for high-frequency wind vector time series

Tractogram foundation model

Piper: A Programmable Distributed Training System

Towards Autonomous Accelerator Design: FPGA Accelerator Generation with SECDA

Designed by Journalists, but Is It for Readers? Rethinking AI Disclosures and Transparency in News

Do Transformers Actually Help Intrusion Detection? A Temporal Sequence Evaluation on CIC-IDS2017

Generative Archetype-Grounded Item Representations for Sequential Recommendation

Understanding and mitigating the risks of OpenClaw for non-technical users: A practical guide with Skill

Generative Explainability for Next-Generation Networks: LLM-Augmented XAI with Mutual Feature Interactions

Provenance Tracking in AI Compilers through the Lens of Coalgebra

From Perception to Action: Can UI Interventions Foster Sustainable LLM Chatbot

Training LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference Optimization

A Unified Siamese Learning Framework for Zero-Day Anomaly Detection and Classification in Optical Networks

A Bayesian Network Approach for Enhancing Security-Focused Decision Support Systems

Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation

MemVenom: Triggered Poisoning of Multimodal Memories in Web Agents

Effective Reinforcement Learning for Agentic Search by Recycling Zero-Variance Queries During Training

Do LLMsMakeNeural Distinguishers Wise?

Decentralized Multi-Agent Systems with Shared Context