arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.12713 2026-06-12 cs.AI 新提交

Definitional alignment before capability alignment: a Design-Science framework for adjudicating claims about AGI

能力对齐之前的定义对齐：一个用于裁定关于AGI主张的设计科学框架

J. E. Aguilera Briones

发表机构 * Universidad Internacional de Investigación México（墨西哥国际研究大学）

AI总结针对AGI定义不统一导致争议的问题，提出DAF-AGI框架，包含五个序数标准和一个结构化治理审计，用于评估候选定义并裁定AGI主张。

Comments 31 pages, 1 table, 2 appendices

详情

AI中文摘要

关于人工通用智能已经到来或仍需数十年的主张常常基于重叠的证据进行辩护。“AGI”缺乏一个单一共享且稳定的指称，不同的操作化方法可能对同一系统给出不同的判定。本文将这种欠指定性视为一个设计和治理问题。遵循设计科学研究方法论，本文开发了DAF-AGI，一个二阶概念性人工制品，包含两个耦合组件：用于评估候选定义的裁定适应性的五个序数标准，以及对作者身份、利益、认证、外部验证和修订权威的结构化治理审计。该人工制品在五个显著的测量族和一个通缩边界立场上进行了演示，这些均来自一个已记录的语料库，然后对一个风格化的强到来主张进行了压力测试：即当前生成系统构成AGI，因为它们在许多认知任务上优于受过良好教育的成年人。根据引用的2024-2025年来源的证据，该主张仅在基于性能的操作化下可认证；能力本体论、心理测量学和技能习得方法未认证它，经济族仍不确定，通缩立场拒绝二元裁定。贡献在于新颖的整合和操作化，而非经验验证：独立应用、评估者间测试和作者外部案例仍然是必要的。本文进一步提出定义主权作为算法主权的使能组件：即在公共问责下对进口技术类别进行质疑、认证和修订的制度能力。

英文摘要

Claims that artificial general intelligence has already arrived and claims that it remains decades away are often defended from overlapping evidence. "AGI" lacks a single shared and stable referent and competing operationalizations can return different verdicts on the same system. This article treats that under-specification as a design and governance problem. Following Design Science Research Methodology, it develops DAF-AGI, a second-order conceptual artifact with two coupled components: five ordinal criteria for assessing the adjudicative fitness of candidate definitions and a structured governance audit of authorship, interest, certification, external verification and revision authority. The artifact is demonstrated on five prominent measurement families and one deflationary boundary position in a documented corpus and then stress-tested against a stylized strong arrival claim: that current generative systems constitute AGI because they outperform a well-educated adult on many cognitive tasks. On evidence from the cited 2024-2025 sources, the claim was certifiable only under a performance-based operationalization; capability-ontology, psychometric and skill-acquisition approaches did not certify it, the economic family remains indeterminate and the deflationary position refuses binary adjudication. The contribution is a novel integration and operationalization, not an empirical validation: independent application, inter-rater testing and author-external cases remain necessary. The paper further proposes definitional sovereignty as an enabling component of algorithmic sovereignty: the institutional capacity to contest, certify and revise imported technological categories under public accountability.

URL PDF HTML ☆

赞 0 踩 0

2606.12708 2026-06-12 cs.CL cs.AI 新提交

AfriSUD: A Dependency Treebank Collection for Evaluating Models on African Languages

AfriSUD：用于评估非洲语言模型的依存树库集合

Happy Buzaaba, Cheikh Mouhamadou Bamba Dione, David Ifeoluwa Adelani, Sylvain Kahane, Kim Gerdes, Bruno Guillaume, Kevin Guan, Aremu Anuoluwapo, Naome A. Etori, Shamsuddeen Hassan Muhammad, Utitofon Inyang, Peter Nabende, David Sabiiti Bamutura, Andiswa Bukula, Chinedu Uchechukwu, Rooweither Mabuya, Idris Akinade, Christiane Fellbaum

发表机构 * Princeton University（普林斯顿大学）； Laboratory for Artificial Intelligence, Princeton University（普林斯顿大学人工智能实验室）； Gaston Berger University（加斯顿·伯杰大学）； Mila, McGill University（麦吉尔大学米拉研究所）； Canada CIFAR AI Chair（加拿大CIFAR人工智能教席）； Paris Nanterre University（巴黎南泰尔大学）； Paris-Saclay University（巴黎-萨克雷大学）； CNRS（法国国家科学研究中心）； Inria（法国国家信息与自动化研究所）； LORIA（洛林计算机科学实验室）； Université de Lorraine（洛林大学）； University of Trento（特伦托大学）； University of Minnesota–Twin Cities（明尼苏达大学双城分校）； Imperial College London（伦敦帝国学院）； Binghamton University（宾汉姆顿大学）； Makerere University（马凯雷雷大学）； Penn State University（宾夕法尼亚州立大学）； Mbarara University of Science and Technology（姆巴拉拉科技大学）； Chalmers University of Technology（查尔姆斯理工大学）； University of Ibadan（伊巴丹大学）； Nnamdi Azikiwe University（纳姆迪·阿齐基韦大学）； South African Centre for Digital Language Resources（南非数字语言资源中心）

AI总结为弥补非洲语言在NLP资源上的不足，构建了首个大规模九种非洲语言句法标注树库AfriSUD，评估多种模型发现显著句法差距。

详情

AI中文摘要

尽管非洲语言具有语言多样性和全球重要性，但在支持NLP的研究和资源中仍代表性不足。我们通过引入AfriSUD来弥合这一差距，这是首个大规模句法标注树库集合，涵盖九种多样的非洲语言，跨越撒哈拉以南非洲的主要语系和地区。采用表层句法通用依存（SUD）框架，我们社区主导的努力提供了高质量、经母语者验证的数据，捕捉了如黏着和声调等类型学关键特征。我们在AfriSUD上评估了多种模型，包括非Transformer基线、多语言预训练编码器和LLM，用于词性标注和依存句法分析。我们的结果揭示了显著的句法差距，模型在九种语言上仍表现出明显局限性，表明现有架构可能无法完全捕捉非洲语言句法的结构多样性。

英文摘要

Despite their linguistic diversity and global significance, African languages remain underrepresented in research and resources to support NLP. We aim to bridge this gap by introducing AfriSUD, the first large-scale collection of syntactically annotated treebanks for nine diverse African languages spanning major language families and regions across Sub-Saharan Africa. Using the Surface-Syntactic Universal Dependencies (SUD) framework, our community-led effort provides high-quality, native-speaker verified data that capture typological key features such as agglutination and tone. We evaluate a range of models on AfriSUD for part-of-speech tagging and dependency parsing including non-transformer baselines, multilingual pretrained encoders, and LLMs. Our results reveal a significant syntax gap, where models still show clear limitations across the nine languages, suggesting that existing architectures may not fully capture the structural diversity of African-language syntax.

URL PDF HTML ☆

赞 0 踩 0

2606.12706 2026-06-12 cs.CV 新提交

VLADriveBench: Evaluating CoT-Action Relationship in VLA for Autonomous Driving

VLADriveBench：评估自动驾驶VLA中的CoT-动作关系

Thach Nguyen, Danhua Guo, Tom Lampo, Fei Wu, Burhan Yaman

发表机构 * Uber AV Labs（优步自动驾驶实验室）

AI总结提出VLADriveBench框架，结合观察指标和CoT干预协议评估VLA模型中思维链与驾驶动作的相关性和因果性，发现不同模型表现差异显著。

2606.12699 2026-06-12 cs.LG cs.AI 新提交

LLM-Powered Personalized Glycemic Assessment in Type 2 Diabetes with Wearable Sensor Data

基于可穿戴传感器数据的2型糖尿病个性化血糖评估：LLM驱动方法

Yifan Gao, Yanmin Gong, Yun Shi, Yuanxiong Guo

发表机构 * Department of Information Systems and Cybersecurity, The University of Texas at San Antonio（德克萨斯大学圣安东尼奥分校信息系统与网络安全系）； School of Engineering Medicine, Texas A&M University（德克萨斯农工大学工程医学院）； Department of Family and Community Medicine, The University of Texas at San Antonio（德克萨斯大学圣安东尼奥分校家庭与社区医学系）

AI总结提出GlyLLM框架，利用大语言模型整合可穿戴传感器数据和结构化元数据，实现个性化血糖动态建模，在血糖预测和糖尿病分类任务上分别比传统ML方法提升13.66%和13.08%。

Comments The 14th IEEE International Conference on Healthcare Informatics, 2026

详情

AI中文摘要

2型糖尿病（T2D）对全球健康构成日益严重的威胁，需要有效的血糖评估来支持个性化和改进的糖尿病护理。可穿戴传感器如连续血糖监测仪（CGM）和健身追踪器为血糖评估提供了许多有价值的见解。然而，有效分析这些数据需要与重要的个体层面背景信息整合。现有方法通常基于传统机器学习（ML），主要依赖历史血糖测量值，忽略了个性化信息，这限制了它们在多样化糖尿病群体中的性能。大语言模型（LLMs）的最新进展展示了它们整合多种数据模态同时建模序列依赖性的能力，激发了探索其在个性化血糖评估中潜力的兴趣。在本文中，我们提出了GlyLLM，一个基于LLM的框架，通过整合可穿戴传感器数据和结构化元数据来建模基于CGM的血糖动态。GlyLLM可以利用预训练LLM的广泛先验知识，并在决策时实现传感器-文本语义抽象。在AI-READI数据集上的两个相关任务实验表明，我们的模型在血糖预测的均方根误差（RMSE）上平均优于传统ML方法13.66%，在糖尿病分类的受试者工作特征曲线下面积（AUROC）上平均优于13.08%。此外，我们的消融研究表明，糖尿病调查和生物特征测试比其他健康信息对血糖评估更为关键。我们的工作为利用LLM推进T2D护理中的个性化血糖评估迈出了有希望的一步。

英文摘要

Type 2 Diabetes (T2D) poses an increasing global health threat, demanding effective glycemic assessment to support personalized and improved diabetes care. Wearable sensors such as continuous glucose monitors (CGM) and fitness trackers offer many valuable insights for glycemic assessment. However, effectively analyzing these data requires integration with essential individual-level context. Existing methods are often based on traditional machine learning (ML) and rely primarily on historical blood glucose measurements and overlook personalized information, which limits their performance across diverse diabetes populations. Recent advances in large language models (LLMs) have demonstrated their ability to integrate diverse data modalities while modeling sequential dependencies, motivating the exploration of their potential for personalized glycemic assessment. In this paper, we propose GlyLLM, an LLM-powered framework for modeling CGM-based glycemic dynamics through the integration of wearable sensor data and structured metadata. GlyLLM can leverage the extensive prior knowledge of pre-trained LLMs and achieve sensor-text semantic abstraction at decision time. Experiments on two related tasks on the AI-READI dataset demonstrate that our model outperforms traditional ML methods by an average of 13.66\% in Root Mean Squared Error (RMSE) for glucose forecasting and 13.08\% in Area Under the Receiver Operating Characteristic (AUROC) for diabetes categorization. Additionally, our ablation study shows that diabetes surveys and biometric tests are more critical than other health information for glycemic assessment. Our work presents a promising step toward harnessing the power of LLMs to advance personalized glycemic assessment in T2D care.

URL PDF HTML ☆

赞 0 踩 0

2606.12690 2026-06-12 cs.RO cs.AI 新提交

EWAM: An Enhanced World Action Model for Closed-Loop Online Adaptation in Embodied Intelligence

EWAM：一种用于具身智能闭环在线自适应的增强世界动作模型

Xin Zhou, Cong Miao

发表机构 * Astronex Robotics ； Nanjing University of Information Science and Technology（南京信息工程大学）

AI总结提出EWAM架构，基于冻结的Cosmos3骨干网络，通过四个轻量级神经层实现零样本在线自适应，无需微调或额外演示数据，显著减少新任务布局的部署数据需求。

详情

AI中文摘要

在本文中，我们提出了增强世界动作模型（EWAM），这是一种基于预训练且完全冻结的Cosmos3骨干网络构建的闭环在线自适应架构。EWAM完全在零样本任务协议下进行评估，其核心目标是减少适应新任务布局所需的额外部署数据量。值得注意的是，所有评估中均未引入额外的任务特定演示集，也未对骨干网络进行微调。其性能提升完全源于由四个插入的轻量级神经层组成的推理时协同推理机制：位于扩散变换器（DiT）中间层的神经经验记忆层提供任务相关的执行上下文；状态预测头之后的神经异常检测层实时监测预测状态与实际状态之间的差异；神经策略路由层根据异常严重程度动态选择直接执行、保守重规划或回滚恢复；神经动作校正层利用执行诊断优化生成的动作块。与简单的特征融合不同，记忆、异常检测和校正模块以可微分的方式深度集成到Cosmos3的前向路径中，仅最终路由决策是离散监督的。

英文摘要

In this paper, we propose the Enhanced World Action Model (EWAM), a closed-loop online adaptation architecture built upon a pretrained and fully frozen Cosmos3 backbone network. Evaluated entirely under a zero-shot task protocol, EWAM is centrally focused on reducing the amount of additional deployment data required to adapt to new task layouts. Notably, no extra task-specific demonstration sets were introduced in any of the evaluations, and no fine-tuning was performed on the backbone network. Its performance gains stem entirely from an inference-time co-reasoning mechanism composed of four inserted lightweight neural layers: the Neural Experience Memory Layer located in the intermediate layers of the Diffusion Transformer (DiT) provides task-relevant execution context; the Neural Anomaly Detection Layer after the state prediction head monitors the divergence between predicted and actual states in real time; the Neural Policy Routing Layer dynamically selects direct execution, conservative replanning, or rollback recovery based on the anomaly severity; and the Neural Action Correction Layer refines the generated action chunks using execution diagnostics. Unlike naive feature fusion, the memory, anomaly detection, and correction modules are deeply integrated into the Cosmos3 forward path in a differentiable manner, with only the final routing decision being a discrete supervised one.

URL PDF HTML ☆

赞 0 踩 0

2606.12689 2026-06-12 cs.CL 新提交

Observable Patterns Are Not Explanations: A Causal-Geometric Analysis of Latent Reasoning Models

可观察模式并非解释：潜在推理模型的因果几何分析

Darpan Aswal, Thomas Palmeira Ferraz, Yongxin Zhou, Maxime Peyrard

发表机构 * Université Grenoble Alpes, CNRS, Grenoble INP, LIG（格勒诺布尔阿尔卑斯大学，法国国家科学研究中心，格勒诺布尔国立理工学院，信息学实验室）； Université Paris-Saclay（巴黎-萨克雷大学）； NAVER LABS Europe（NAVER欧洲实验室）

AI总结本文通过对照实验和因果干预发现，潜在推理模型中的可观察模式（如BFS前沿）在控制组中也出现且不总是因果影响行为，提出潜在思维的使用是分级的，其因果效应集中在低秩方向，几何结构随行为影响增强而更有序。

详情

AI中文摘要

潜在推理模型（LRMs）用连续思维替代显式思维链。最近的研究将可观察的潜在状态模式（如BFS式前沿和可解码的算术计算）视为内部推理机制的证据。通过评估两个LRM（Coconut和CODI）与缺乏所提议的循环或课程的控制组，我们发现这些模式也出现在控制组中，并且并不总是因果性地影响行为。因果干预揭示，潜在思维的利用不是二元的，而是分级的，随着思维对模型行为的因果效应而缩放。几何分析表明，这种效应集中在低秩方向，其逐步几何结构随着行为影响的增加而变得更加结构化。因此，潜在思维应被视为隐藏计算，而非隐藏解释：仅凭可解码性、注意力或静态结构无法确立机制。因此，LRM可解释性需要匹配的控制组和因果测试。

英文摘要

Latent reasoning models (LRMs) replace explicit chain-of-thought with continuous thoughts. Recent work treats observable latent-state patterns, such as BFS-like frontiers and decodable arithmetic computation, as evidence for internal reasoning mechanisms. Evaluating two LRMs (Coconut and CODI) against controls lacking the proposed recurrence or curriculum, we find these patterns also appear in the controls and do not always causally affect behavior. Causal interventions reveal that latent-thought utilization is not binary but graded, scaling with a thought's causal effect on model behavior. Geometric analyses reveal this effect concentrates in low-rank directions whose step-to-step geometry grows more structured as their behavioral influence increases. Latent thoughts should therefore be treated as hidden computation, not hidden explanation: decodability, attention, or static structure alone cannot establish mechanism. LRM interpretability thus requires matched controls and causal tests.

URL PDF HTML ☆

赞 0 踩 0

2606.12687 2026-06-12 cs.LG 新提交

Forecasting Is Not Attribution: Localizing Decoder Bypass in Graph-Based Neural Marketing Mix Models

预测不等于归因：在基于图的神经营销组合模型中定位解码器旁路

Yunbo Wang, Bolbi Liu

发表机构 * University of California, Irvine（加州大学尔湾分校）； AdsGency AI

AI总结针对基于图的神经营销组合模型中预测精度高但归因失败的问题，提出DICE-MMM框架，通过限制解码器通信路径来诊断和定位归因旁路，实验表明低预测误差不能保证归因正确性。

详情

AI中文摘要

营销组合模型用于预测业务结果并将这些结果归因于营销渠道，但这些目标并不等价。我们研究了基于图的神经MMM中的一种失败模式，称为归因旁路：高容量解码器可以通过目标自回归、密集通信、共同运动、上下文或潜在记忆获得低预测误差，但未能将反事实敏感性通过用作归因对象的图进行路由。我们引入DICE-MMM作为一个有界诊断和训练框架。我们不声称观测性神经MMM能够识别因果效应。相反，DICE将基于图的MMM中经常混淆的三个问题分开：图恢复、预测准确性，以及训练后的解码器的扰动诱导影响是否与图对齐。阶段1训练一个带有受限图介导解码器的图编码器。阶段2冻结选定的编码器，并训练一个图安全的潜在解码器，其跨节点通信必须通过提供的图。解码器的使用通过CIG、AR-CIG和图交换测试进行评估。在受控的R/d/T交换和外部多图原始日志压力测试中，DICE比CausalMMM提高了稳定图恢复。实验表明，预测准确性不是归因证书：在稀疏目标基准中，无图解码器和全图解码器实现了约0.004的MSE@7，而AR-CIG nAUPRC仍接近或低于零，而oracle图在可比的MSE下达到0.807 +/- 0.129。冻结图交换定位了瓶颈：相同的DICE-hard训练解码器在学习图输入下从nAUPRC -0.044 +/- 0.006移动到oracle图下的0.894 +/- 0.027。贡献在于一个压力测试和故障定位框架，表明低MSE可能隐藏归因旁路，且未解决的瓶颈是图支撑选择，而不是预测或解码器容量。

英文摘要

Marketing mix models are used to forecast business outcomes and to attribute those outcomes to marketing channels, but these goals are not equivalent. We study a failure mode in graph-based neural MMM called attribution bypass: a high-capacity decoder can obtain low forecasting error through target autoregression, dense communication, co-movement, context, or latent memory while failing to route counterfactual sensitivity through the graph used as the attribution object. We introduce DICE-MMM as a bounded diagnostic and training framework. We do not claim that observational neural MMM identifies causal effects. Instead, DICE separates three questions often conflated in graph-based MMM: graph recovery, forecasting accuracy, and whether the trained decoder's perturbation-induced influence is graph aligned. Stage 1 trains a graph encoder with a restricted graph-mediated decoder. Stage 2 freezes the selected encoder and trains a graph-safe latent decoder whose cross-node communication must pass through the supplied graph. Decoder use is evaluated with CIG, AR-CIG, and graph-swap tests. Across controlled R/d/T swaps and an external multi-graph rawlog stress test, DICE improves stable graph recovery over CausalMMM. The experiments show that forecasting accuracy is not an attribution certificate: in a sparse-target benchmark, no-graph and full-graph decoders achieve MSE@7 around 0.004 while AR-CIG nAUPRC remains near or below zero, whereas an oracle graph reaches 0.807 +/- 0.129 at comparable MSE. Frozen graph-swap localizes the bottleneck: the same DICE-hard-trained decoder moves from nAUPRC -0.044 +/- 0.006 under learned graph inputs to 0.894 +/- 0.027 with the oracle graph. The contribution is a stress test and failure-localization framework showing that low MSE can hide attribution bypass and that the unresolved bottleneck is graph-support selection, not forecasting or decoder capacity.

URL PDF HTML ☆

赞 0 踩 0

2606.12683 2026-06-12 cs.AI cs.CY cs.LG 新提交

From AGI to ASI

从AGI到ASI

Tim Genewein, Matija Franklin, Alexander Lerchner, Laurent Orseau, Samuel Albanie, Adam Bales, Cole Wyeth, Stephanie Chan, Iason Gabriel, Joel Z. Leibo, Allan Dafoe, Marcus Hutter, Thore Graepel, Shane Legg

发表机构 * Google DeepMind（谷歌深度思维）； University of Waterloo（滑铁卢大学）； Australian National University（澳大利亚国立大学）； University College London（伦敦大学学院）

AI总结探讨从人类级通用人工智能到超级智能的转变路径，包括扩展、范式转变、递归改进和多智能体涌现，并分析摩擦与瓶颈。

详情

AI中文摘要

在过去十年中，构建人类级通用人工智能已从遥不可及的猜测转变为许多大型AI组织未来十年的具体目标。实现这一目标将对人类社会产生深远影响，并引发未来十年的诸多复杂问题。本报告研究在机器智能连续体中，AI如何在后AGI世界中继续发展。该连续体的终点——通用AI——在理论上已被充分理解，这为本报告的主要焦点提供了形式基础：从人类级AGI向人工通用超级智能的转变，直观上可理解为比大型人类组织更智能、认知能力更强的系统。在描述ASI后，报告讨论了从AGI到ASI的四条潜在路径：扩展AGI、AI范式转变、递归改进以及从大规模多智能体集体中涌现ASI。随后，报告讨论了这些路径上可能的摩擦和瓶颈。确定这些摩擦的影响是微不足道还是重大，提出了若干具体的开放研究问题。由于预测ASI进展存在巨大不确定性，不能排除AI进展在未来几年继续加速的可能性。这可能意味着由人类级AGI引入社会所导致的单一变革性步骤的形象可能不准确。更恰当的前景可能是由AI在科学和技术的多个领域引发的进步和突破所导致的一系列变革性社会变化。为这一前景做准备需要全球范围内的大规模跨学科努力。

英文摘要

Over the last decade, building human-level artificial general intelligence has moved from far-fetched speculation to being a concrete next-decade target for many of the largest AI organisations. Achieving this goal would have profound and far-reaching impacts on human society, which raises many complex questions for the decade ahead. This report investigates how AI itself might continue to develop in a post-AGI world along the continuum of machine intelligence. The endpoint of this continuum, Universal AI, is theoretically well understood, which provides some formal grounding for the main focus of this report: the transition from human-level AGI to artificial general superintelligence, which, intuitively, can be understood as a system that is more intelligent and cognitively capable than large organisations of humans. After characterizing ASI, the report discusses four potential pathways from AGI to ASI: scaling AGI, AI paradigm shifts, recursive improvement, and ASI emerging from large-scale multi-agent collectives. The report then discusses possible frictions and bottlenecks along these pathways. Determining whether the impact of these frictions will be negligible or substantial raises a number of concrete open research questions. Due to large uncertainties for predicting ASI progress, it cannot be ruled out that AI progress might continue to accelerate over the next years. This could imply that the image of a single transformative step change, caused by the introduction of human-level AGI into our society, could be inaccurate. More apt might be the prospect of a series of transformative societal changes caused by AI-enabled progress and breakthroughs across many areas of science and technology. Preparing for this prospect requires a massively interdisciplinary endeavour of global scope and interest.

URL PDF HTML ☆

赞 0 踩 0

2606.12680 2026-06-12 cs.LG stat.ML 新提交

How Useful is Causal Invariance for Domain Adaptation in Finite-Sample Settings?

因果不变性在有限样本设置中对领域适应有多大用处？

Julia Kostin, Kasra Jalaldoust, Elias Bareinboim, Samory Kpotufe, Fanny Yang

发表机构 * Department of Computer Science, ETH Zurich（苏黎世联邦理工学院计算机科学系）； Causal Artificial Intelligence Lab, Columbia University（哥伦比亚大学因果人工智能实验室）； Department of Statistics, Columbia University（哥伦比亚大学统计系）

AI总结研究线性回归中因果不变性如何提升监督领域适应，通过候选预测器的目标风险边界和有限样本估计误差推导匹配上下界，证明当边界足够大时自适应聚合可避免负迁移。

详情

AI中文摘要

机器学习模型在部署到与训练源分布不同的目标分布时，性能往往会下降。最近基于因果的领域泛化工作表明，领域间的共享因果结构可以诱导不变预测器，例如在结构化领域偏移下具有稳定风险的某些特征子集上的模型。然而，这种总体水平的因果不变性在有限样本设置中能带来多大收益仍未充分探索。特别是，在实践中我们通常只能获得少量带标签的目标样本，这种设置称为监督领域适应（sDA）。本文探讨何时（完全或部分）因果知识能够可证明地改进监督领域适应。作为第一步，我们研究线性回归，其中完全或部分因果知识指定了一组不变或可能不变的特征子集，每个子集产生一个源训练候选预测器。我们推导了匹配的上界和下界，表明有限样本收益由候选预测器之间的目标风险边界以及有限源估计误差共同决定。当这些边界相对于$n_Q$足够大时，自适应聚合过程可以匹配最佳候选预测器，同时避免相对于仅使用目标样本学习的负迁移。另一方面，当边界过小时，没有算法能够可靠地利用候选集合获得更快的有限样本速率。我们进一步将这些边界与线性SCM中的结构偏移幅度联系起来，并在真实世界的因果基准上验证了理论。

英文摘要

Machine learning models often degrade when they are deployed on a target distribution that differs from the source distributions they were trained on. Recent work in causality-based domain generalization has shown how shared causal structure between domains can induce invariant predictors, e.g., models on a subset of features which have stable risk across structured domain shifts. However, the extent to which such population-level causal invariances can lead to gains in finite-sample settings remains underexplored. In particular, in practice we often have access to a few labeled target samples, a setting called supervised domain adaptation (sDA). In this paper, we explore when (full or partial) causal knowledge can provably improve supervised domain adaptation. As a first step, we study linear regression, where full or partial causal knowledge specifies a collection of invariant or possibly invariant feature subsets, each yielding a source-trained candidate predictor. We derive matching upper and lower bounds showing that finite-sample gains are governed by the target-risk margins separating the candidates, together with the finite-source estimation error. When these margins are sufficiently large relative to $n_Q$, an adaptive aggregation procedure can match the best candidate predictor while avoiding negative transfer relative to target-only learning. On the other hand, when the margins are too small, no algorithm can reliably exploit the candidate collection to obtain faster finite-sample rates. We further connect these margins to structural shift magnitude in linear SCMs and validate the theory on real-world causal benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.12679 2026-06-12 cs.LG cs.CR eess.IV 新提交

Fed-FBD: Federated Functional Block Diversification for Isolation, Privacy, and Surgical Unlearning

Fed-FBD：用于隔离、隐私和精准遗忘的联邦功能块多样化

Weijie Chen, Alan B. McMillan

发表机构 * University of Wisconsin–Madison（威斯康星大学麦迪逊分校）

AI总结提出Fed-FBD模块化联邦架构，将ResNet分解为六个功能块并维护颜色变体仓库，实现块级隔离、隐私设计和亚秒级精准遗忘，在多个数据集上以微小精度代价换取安全保障。

Comments 12 pages, 3 figures, 8 tables. Code: https://github.com/wchen-ai/functional-block-diversification

详情

AI中文摘要

联邦学习（FL）能够在无需共享原始患者数据的情况下进行协作模型训练，但标准方法（如FedAvg）将每个客户端视为黑盒，无法隔离对抗性贡献者、审计每个客户端的影响或尊重已退出参与者的被遗忘权。我们提出Fed-FBD（联邦功能块多样化），一种模块化联邦架构，将ResNet骨干网络分解为六个功能块（主干、四个残差组和分类头），并维护一个包含N种颜色变体的仓库，每种变体由独立跟踪和贡献者标记的块组装而成。Fed-FBD提供了FedAvg所不具备的三种能力：(i) 架构保证的块级隔离，使对抗性或错误标注的客户端无法污染干净颜色；(ii) 隐私设计，在应用任何隐私机制之前，成员推断优势已与随机猜测无异；(iii) 在亚秒级成本下无需重新训练即可精准遗忘已退出参与者的贡献。在六个MedMNIST-2D数据集、224x224的PathMNIST和CIFAR-10上的实验表明，Fed-FBD在规模足够的数据集上以0.3%-3.1%的IID精度差距换取这些保证，在四个数据集中的三个上，Dirichlet alpha=1.0时与FedAvg的差距在0.8%-4.0%以内，并将我们研究的所有六种对抗性攻击限制在中毒客户端自己的块内，干净颜色上的AUC漂移最多为+/-0.01。

英文摘要

Federated learning (FL) enables collaborative model training without sharing raw patient data, but standard approaches such as FedAvg treat each client as a black box and provide no mechanism for isolating an adversarial contributor, auditing per-client influence, or honoring a departed participant's right to be forgotten. We present Fed-FBD (Federated Functional Block Diversification), a modular federated architecture that decomposes a ResNet backbone into six functional blocks (the stem, four residual groups, and the classification head) and maintains a warehouse of N color variants, each assembled from independently tracked and contributor-stamped blocks. Fed-FBD provides three capabilities absent in FedAvg: (i) architecturally guaranteed block-level isolation, so that an adversarial or mislabelled client cannot contaminate the clean colous; (ii) privacy-by-design, where membership inference advantage is already indistinguishable from chance before any privacy mechanism is applied; and (iii) surgical machine unlearning of a departed participant's contribution at sub-second cost and without retraining. Experiments on six MedMNIST-2D datasets, PathMNIST at 224x224, and CIFAR-10 show that Fed-FBD trades a modest 0.3%-3.1% IID accuracy gap on the adequately sized datasets for these guarantees, remains within 0.8%-4.0% of FedAvg at Dirichlet alpha=1.0 on three of four datasets, and confines all six adversarial attacks we study to the poisoned client's own blocks with at most +/-0.01 AUC drift on the clean colors.

URL PDF HTML ☆

赞 0 踩 0

2606.12673 2026-06-12 cs.LG cs.AI 新提交

A Zero-shot Generalized Graph Anomaly Detection Framework via Node Reconstruction

基于节点重构的零样本广义图异常检测框架

Phan Nguyen, Dat Cao, Hien Chu, Khue Hoang

发表机构 * School of Computing, KAIST（韩国科学技术院计算机学院）

AI总结提出AlignGAD框架，通过全局统一模块对齐异构特征、聚类模块捕获组级异常模式及节点差异评分模块聚合多视图异常证据，实现零样本跨域图异常检测。

详情

AI中文摘要

跨域图异常检测旨在识别未见过的目标图中的异常节点，在异构图数据的实际应用中展现出巨大潜力。然而，现有方法通常依赖于数据集特定的特征语义和结构模式，限制了其跨域泛化能力。为解决这一挑战，我们提出AlignGAD，一个零样本广义图异常检测框架。我们的框架基于三个关键组件：全局统一模块，用于对齐异构节点特征并在谱域中归一化图信号；聚类模块，用于构建聚类感知的图视图以捕获组级异常模式；以及节点差异评分模块，用于测量重构差异并聚合来自不同图视图的异常证据。在多个真实数据集上的实验证明了AlignGAD在零样本图异常检测设置下的有效性。

英文摘要

Cross-domain graph anomaly detection (GAD) aims to identify abnormal nodes in unseen target graphs, showing strong potential in real-world applications with heterogeneous graph data. However, existing methods often depend on dataset-specific feature semantics and structural patterns, which limits their ability to generalize across different domains. To address this challenge, we propose AlignGAD, a zero-shot generalized graph anomaly detection framework. Our framework is built upon three key components: a Global Unification Module that aligns heterogeneous node features and normalizes graph signals in the spectral domain; a Clustering Module that constructs cluster-aware graph views to capture group-level abnormal patterns; and a Node Discrepancy Scoring Module that measures reconstruction discrepancy and aggregates anomaly evidence from different graph views. Experiments on multiple real-world datasets demonstrate the effectiveness of AlignGAD under the zero-shot GAD setting.

URL PDF HTML ☆

赞 0 踩 0

2606.12662 2026-06-12 cs.SD cs.AI cs.LG 新提交

BASENet: Band-Adapted Speech Enhancement Network with Cross-Band Attention

BASENet: 基于频带自适应的跨频带注意力语音增强网络

Damien Martins Gomes, François Capman

发表机构 * Thales SIX GTS, FRANCE（泰雷兹SIX GTS公司，法国）

AI总结提出BASENet，通过Bark尺度划分频带并分配自适应容量编码器，结合跨频带注意力模块，以最少参数实现高PESQ和STOI，适用于资源受限设备。

详情

AI中文摘要

语音增强模型通常对所有频率采用统一容量，忽略了人类听觉的非均匀频谱分辨率。我们提出BASENet，一种频率自适应架构，将频谱划分为Bark尺度频带，并为每个频带分配基于临界频带密度的缩放容量编码器，自动为感知密集的低频分配更深的分支，为高频分配更轻的分支。跨频带注意力模块通过紧凑的频率池化表示以线性复杂度捕获跨频带的谐波依赖性。基于具有密集连接的倒残差块和卷积循环网络，BASENet在VoiceBank+DEMAND上以仅0.83M参数和7.3 G MACs达到3.55 PESQ和STOI~96%，是所有PESQ > 3.50方法中参数最少的。因果变体（3.44 PESQ）超过了几种非因果基线，证实了其在资源受限设备上实时流传输的适用性。

英文摘要

Speech enhancement models typically apply uniform capacity across all frequencies, disregarding the non-uniform spectral resolution of human hearing. We propose BASENet, a frequency-adapted architecture that partitions the spectrum into Bark-scale bands and assigns each a scaled-capacity encoder derived from critical-band density, automatically granting deeper branches to perceptually dense low frequencies and lighter ones to high frequencies. A cross-band attention module captures harmonic dependencies across bands through compact frequency-pooled representations at linear complexity. Built on inverted residual blocks with dense connectivity and a convolutional recurrent network, BASENet achieves 3.55 PESQ and STOI~96% on VoiceBank+DEMAND with only 0.83M parameters and 7.3 G~MACs, the fewest parameters among all methods with PESQ > 3.50. A causal variant (3.44 PESQ) surpasses several non-causal baselines, confirming suitability for real-time streaming on resource-constrained devices.

URL PDF HTML ☆

赞 0 踩 0

2606.12658 2026-06-12 cs.LG q-bio.QM stat.ML 新提交

TEDD：不稳定时间特征的鲁棒检测

Ricardo Ribeiro Pereira, Bruno Casal Laraña, Nádia Soares, Miguel Araújo

发表机构 * Feedzai

AI总结提出TEDD方法，利用回归模型检测导致时间分布变化的特征，无需参数调优，可扩展，能检测数值和类别特征的单变量及多变量漂移。

Comments 8 pages, 9 figures

详情

DOI: 10.1109/ICDMW51313.2020.00063

AI中文摘要

在处理真实世界的时间序列数据时，经常会遇到特征分布随时间变化的情况。在这种不稳定的数据上直接使用机器学习模型可能导致性能迅速下降，尤其是当新分布与训练时所见差异较大时。为了解决这个问题，自动识别随时间变化的特征至关重要。检测到这些特征后，数据科学家和其他从业者能够通过应用数据变换等方式缓解问题，部署更鲁棒的模型，使其在更长时间内保持高性能。本文描述了特征不应遭受的时间变化类型，并提出了TEDD技术，用于a) 识别数据集何时可能导致不稳定的机器学习模型，以及b) 自动检测哪些特征导致了这种不鲁棒性。为此，我们利用回归模型来突出哪些特征有助于良好预测实例的时间戳。我们将我们的方法与其他方法在真实和合成数据上进行比较，测试它们在所有简单变化模式上的检测能力。我们表明，我们的方法：检测所有类型的基本变化，包括数值和类别特征；能够检测多变量漂移；返回一个可比较的值来衡量每个特征的变化量；无需参数调优；并且在数据集的特征数量和实例数量上都具有可扩展性。

英文摘要

When working with real-world temporal data, it is common to encounter features whose distribution is changing over time. The naive employment of Machine Learning models on this unstable data might lead to rapidly degrading performance, especially if the new distribution is much different from what was previously seen during training. In order to cope with this problem, it is critical to automatically identify features that are changing over time. With these features detected, data scientists and other practitioners will be able to mitigate the issue (for instance, by applying data transformations), deploying more robust models that retain high performance for longer periods of time. In this paper, we describe which temporal changes a feature should not suffer from, and propose TEDD, a technique to a) identify when a dataset might lead to an unstable Machine Learning model and b) automatically detect which features cause such lack of robustness. In order to achieve it, we leverage a regression model to highlight which features contribute to a good prediction of an instance's timestamp. We compare our approach to other methods in real and synthetic data, testing their detection capability on all simple change patterns. We show that our method: detects all types of basic changes, both for numerical and categorical features; can detect multivariate drifts; returns a comparable value measuring the amount of change of each feature; requires no parameter tuning; and is scalable both on number of features and instances of the dataset.

URL PDF HTML ☆

赞 0 踩 0

2606.12640 2026-06-12 cs.LG cs.RO cs.SY eess.SY 新提交

Individual Control Barrier Functions-Guided Diffusion Model for Safe Offline Multi-Agent Reinforcement Learning

个体控制障碍函数引导的扩散模型用于安全离线多智能体强化学习

Qingyun Guo, Junyi Shi, Jianuo Huang, Tianyu Shi

发表机构 * Department of Electrical Engineering and Automation, Aalto University（阿尔托大学电气工程与自动化系）； School of Computing and Data Science, Xiamen University Malaysia（厦门大学马来西亚分校计算与数据科学学院）； Department of Computer Science, University of Toronto（多伦多大学计算机科学系）

AI总结提出一种将神经个体控制障碍函数嵌入扩散模型的离线多智能体强化学习算法，通过逆动力学恢复控制策略，在保证奖励的同时显著提升轨迹生成的安全性。

Comments Accepted to the 23rd IFAC World Congress, 2026

2606.12639 2026-06-12 cs.LG q-bio.QM 新提交

The Metric Picks the Winner: Evaluation Choice Flips Model Rankings for Drug-Response Prediction in Unseen Chemistry

度量选择胜者：评估选择翻转未见化学空间中药物反应预测的模型排名

Dhruv Agarwal, Riya Bisht

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本研究通过VCPI竞赛数据，发现药物反应预测模型排名随评估指标反转：简单基线在代理指标下胜出，但真实指标下深度模型显著优于线性指纹基线，首次在真实药物化学数据上验证了度量校准效应。

详情

AI中文摘要

预测细胞转录组对其从未见过的药物的反应是计算细胞生物学中的一个核心难题：最近的基准测试表明，一旦测试化合物按化学结构留出，复杂模型往往无法击败简单基线。我们研究了一个细胞系和检测方法，即通过DRUG-seq分析的THP-1细胞，由VCPI预测竞赛的活性化合物加权MSE（wMSE）评分。我们提出了一种分阶段方法：该领域一直无法击败的简单基线（未处理对照和平均训练化合物响应）；非参数检索（对留出化合物的最近训练化合物进行Tanimoto加权平均）；以及一个融合阶段，将冻结的化学嵌入与检索支持特征相结合，以预测相对于均值的残差，并包含不确定性头和基因程序。在发布的VCPI THP-1 drug-seq数据（14,026个训练化合物）上，采用Bemis-Murcko骨架划分，模型排名根据度量标准反转。在逆方差每基因代理度量下，基于Morgan指纹的正则化线性回归似乎胜过了深度模型、检索和ChemBERTa——这是教科书式的“简单基线获胜”结果。但在竞赛的真实活性集度量（每（基因，化合物）的Mejia权重，经官方评分器验证；均值基线0.535 vs 组织者的0.507参考）下，情况反转：深度模型获胜，我们的融合解码器显著优于线性指纹基线（-0.012 wMSE，配对bootstrap p < 10^-4），而代理度量的胜者成为最差的化学感知预测器。选择度量即选择胜者——据我们所知，这是首次在真实留出药物化学数据上证明度量校准效应，该效应此前主要在遗传扰动中建立。我们发布了一个可复现的流水线，连接到官方评分器，可在真实的1064 x 12,995网格上生成有效提交。

英文摘要

Predicting how a cell's transcriptome responds to a drug it has never seen is a core, hard problem in computational cell biology: recent benchmarks show complex models often fail to beat trivial baselines once test compounds are held out by chemistry. We study one cell line and assay, THP-1 cells profiled by DRUG-seq, scored by the active-compound weighted MSE(wMSE) of the VCPI prediction contest. We propose a staged approach: dumb baselines (untreated control and mean training-compound response) that the field keeps failing to beat; non-parametric retrieval (a Tanimoto-weighted average of a held-out compound's nearest training compounds); and a fusion stage combining a frozen chemistry embedding with retrieval-support features to predict the residual over the mean, with an uncertainty head and gene programs. On the released VCPI THP-1 drug-seq data (14,026 training compounds), under a Bemis-Murcko scaffold split, the model ranking inverts depending on the metric. Under an inverse-variance per-gene proxy, a regularized linear regression on Morgan fingerprints appears to win over the deep models, retrieval, and ChemBERTa -- the textbook "simple baselines win" result. But under the contest's true active-set metric (per-(gene, compound) Mejia weights, validated against the official scorer; mean baseline 0.535 vs the organizers' 0.507 reference), that reverses: the deep models win, our fusion decoder significantly beats the linear fingerprint baseline (-0.012 wMSE, paired bootstrap p < 10^-4), and the proxy's winner becomes the worst chemistry-aware predictor. Picking the metric picks the winner -- to our knowledge the first demonstration on real held-out drug chemistry of the metric-calibration effect established largely on genetic perturbation. We release a reproducible pipeline wired to the official scorer that emits a valid submission over the real 1064 x 12,995 grid.

URL PDF HTML ☆

赞 0 踩 0

2606.12635 2026-06-12 cs.CV 新提交

PersonaDrive: 面向闭环驾驶模拟的人类风格检索增强VLA智能体

Mahmoud Srewa, Praneetsai Iddamsetty, Mohammad Abdullah Al Faruque, Salma Elmalaki

发表机构 * University of California, Irvine（加利福尼亚大学尔湾分校）

AI总结提出PersonaDrive流水线，通过检索风格指令下的人类驾驶演示来调节视觉-语言-动作（VLA）驾驶智能体，实现闭环模拟中多样化的非自车智能体行为，无需针对每种风格重新训练。

详情

AI中文摘要

闭环驾驶模拟器通常在其环境中填充行为大致相同的非自车交通智能体，这些智能体要么由基于规则的交通管理器生成，要么由训练为单一行为模式的学习模型生成。最近的工作通过观测数据上的事后标签或LLM推断的奖励权重引入风格变化，但这些信号充当了风格应奖励什么的代理，而不是明确要求以该风格驾驶的人类演示。我们提出了PersonaDrive，一个流水线，它根据从风格指令的人类驾驶数据集中检索到的演示来调节视觉-语言-动作（VLA）驾驶智能体，在该数据集中，参与者在驾驶员在环平台上以激进、中性和保守指令驾驶CARLA排行榜路线。该流水线包括三个阶段：(i) 使用组合的图像-文本相似度分数对每种风格的人类驾驶数据进行离线三元组挖掘；(ii) 训练一个轻量级检索头，将冻结的视觉特征与每个风格数据库上的小型控制编码器融合；(iii) 微调单个VLA主干，以在航点预测期间将检索到的上下文点视为上下文行为演示。在推理时，通过切换检索头查询的每个风格数据库，相同的主干可以适应任何风格，因此选择风格无需针对每种风格重新训练，同时为闭环模拟启用人类风格、风格多样的非自车智能体。在Bench2Drive上，PersonaDrive（无风格）的驾驶得分比SimLingo高4.6%，比HiP-AD高2.5%，在风格条件下，每种风格都获得最高驾驶得分，波动范围约2%（其最弱风格超过最强基线DMW 5.4%），而从保守指令到激进指令，平均速度和加速度分别提高18%和25%。

英文摘要

Closed-loop driving simulators typically populate their environments with non-ego traffic agents that behave largely the same way, produced either by rule-based traffic managers or by learned models trained toward a single behavioral mode. Recent work introduces style variation through post-hoc labels on observational data or LLM-inferred reward weights, but these signals act as proxies for what a style should reward rather than demonstrations of humans explicitly asked to drive in that style. We introduce PersonaDrive, a pipeline that conditions a vision-language-action (VLA) driving agent on retrieved demonstrations from a style-instructed human driving dataset, in which participants drive CARLA leaderboard routes under aggressive, neutral, and conservative instructions on a driver-in-the-loop rig. The pipeline has three stages: (i) offline triplet mining over per-style human driving data using a combined image-text similarity score; (ii) training a lightweight retrieval head that fuses frozen visual features with a small control encoder over per-style databases; and (iii) fine-tuning a single VLA backbone to treat retrieved context points as in-context behavioral demonstrations during waypoint prediction. At inference, the same backbone is conditioned on any style by swapping which per-style database the retrieval head queries, so selecting a style requires no per-style retraining while enabling human-style, style-diverse non-ego agents for closed-loop simulation. On Bench2Drive, PersonaDrive (no style) improves the driving score by 4.6% over SimLingo and 2.5% over HiP-AD, and under style conditioning attains the highest driving score in every style within a roughly 2% band (its weakest style surpassing the strongest baseline, DMW, by 5.4%), while average speed and acceleration rise by 18% and 25% from the conservative to the aggressive instruction.

URL PDF HTML ☆

赞 0 踩 0

2606.12615 2026-06-12 cs.LG 新提交

Towards Provably Fair Machine Learning: Bayesian Approaches For Consistent and Transparent Predictions

迈向可证明公平的机器学习：用于一致和透明预测的贝叶斯方法

Owen O'Neill, Fintan Costello

发表机构 * University College Dublin（都柏林大学学院）

AI总结提出公平贝叶斯分类器，通过强制确定性和统计一致性，在多个数据集上实现零一致性错误，同时保持准确性和多校准，解决少数群体因正则化导致的预测不一致问题。

详情

AI中文摘要

部署在高风险领域的机器学习分类器产生的预测质量在不同子组之间存在系统性差异。对于由多个特征交叉定义的细粒度子组，预测通常与观测数据不一致：模型输出与该子组可用的证据相矛盾。正则化通过将小子组合并到较大组中来改善整体性能，从而加剧了这一问题，对人口统计少数群体产生不成比例的影响。我们定义了一致性预测的两个要求：确定性（相同的个体获得相同的预测）和统计一致性（在显著性水平alpha下，我们不能拒绝子组预测来自为该子组推断的贝叶斯最优目标分布的假设）。从这些要求出发，我们推导出公平贝叶斯分类器，该分类器同时强制每个组和子组满足这两个要求，并在无法进行一致确定性预测时弃权。在三个基准数据集（Adult、COMPAS和Bank Marketing）上，标准分类器对相当一部分子组产生统计上不一致的预测。我们的分类器通过构造实现零一致性错误，同时在每个测试数据集上超过基线准确性和多校准。统计一致性为预测质量提供了原则性基础，对算法公平性有直接影响。少数群体人口不成比例地集中在小子组中，而正是在这些子组中频率论推断最不可靠；因此，解决这一推断问题是迈向公平ML的必要步骤。通过在数据支持的最细粒度上强制贝叶斯一致性，我们的分类器证明了在实践中可以实现具有原则性弃权的详尽子组公平性。

英文摘要

ML classifiers deployed in high-stakes domains produce predictions whose quality varies systematically across subgroups. For granular subgroups defined by intersections of multiple features, predictions are often inconsistent with the observed data: the model's outputs contradict the evidence available for that subgroup. This problem is exacerbated by regularisation, which improves aggregate performance by collapsing small subgroups into larger groups, disproportionately affecting demographic minorities. We define two requirements for consistent prediction: determinism (identical individuals receive identical predictions) and statistical consistency (we cannot reject, at significance level alpha, the hypothesis that the predictions for a subgroup were drawn from the Bayesian optimal target distribution inferred for that subgroup). From these requirements we derive the Fair Bayesian classifier, which enforces both across every group and subgroup simultaneously and abstains whenever no consistent deterministic prediction is possible. On three benchmark datasets (Adult, COMPAS, and Bank Marketing), standard classifiers produce statistically inconsistent predictions for a substantial proportion of subgroups. Our classifier achieves zero consistency error by construction while exceeding baseline accuracy and multicalibration on every dataset tested. Statistical consistency provides a principled foundation for prediction quality with direct implications for algorithmic fairness. Minority demographics are disproportionately concentrated in small subgroups, precisely where frequentist inference is least reliable; addressing this inference problem is therefore a necessary step toward fair ML. By enforcing Bayesian consistency at the finest resolution the data supports, the our classifier demonstrates that exhaustive subgroup fairness with principled abstention is achievable in practice.

URL PDF HTML ☆

赞 0 踩 0

2606.12614 2026-06-12 cs.RO 新提交

DARRMS -- An Efficient Algorithm for Dynamic Attention Radius in Resource-Constrained Multi-Agent Systems

DARRMS——资源受限多智能体系统中动态注意力半径的高效算法

Benjamin Alcorn, Eman Hammad

发表机构 * Texas A&M University（德克萨斯A&M大学）

AI总结提出DARRMS算法，通过优化注意力半径和决策，在资源受限下降低计算需求，提升协调性和可扩展性。

详情

AI中文摘要

多智能体系统是机器人、网络安全和自动驾驶规划等领域不可或缺的工具。这类系统通常面临计算资源约束，需要高效的轻量级算法。传统决策框架常假设理想条件（如完全可观测性和无限计算能力），这与现实挑战不符。本文提出一种新算法，在不显著牺牲其他性能指标的前提下，降低对计算资源的需求。智能体将可观测性限制在某个注意力半径内，从而有意识地忽略对行动规划可能不必要的环境部分。通过同时优化注意力半径和决策，我们的方法在不确定环境中增强了协调性和可扩展性。通过理论分析和实证验证，我们证明了自适应观测在资源受限系统中提升系统性能并维持稳健决策策略的有效性。

英文摘要

Multi-agent systems are integral tools for various domains such as robotics, cybersecurity, and autonomous vehicle planning. These types of systems often have constraints on the computational resources, leading to a need for efficient lightweight algorithms. Traditional decision making frameworks often assume ideal conditions, such as full observability and unlimited computational capacity, which do not align with real-world challenges. In this paper, we introduce a new algorithm that allows for reduced demand on computational resources without a large cost of other performance metrics. Agents will limit their observability to some attention radius, which intentionally allows them to ignore parts of the environment that might be unnecessary for action planning. By optimizing both the attention radius and decision-making, our approach enhances coordination and scalability in uncertain environments. Through both theoretical analysis and empirical validation, we demonstrate the effectiveness of adaptive observation in improving system performance and maintaining robust decision-making strategies in resource-constrained systems.

URL PDF HTML ☆

赞 0 踩 0

2606.12610 2026-06-12 cs.LG 新提交

The Mathematics of AI Winters: The mathematical Taxonomy of Paradigm Fragility in AI Winter

AI寒冬的数学：AI中范式脆弱性的数学分类

Miquel Noguer i Alonso, David Pacheco Aznar

发表机构 * AIFI ； Staq.io

AI总结本文提出AI寒冬的数学解释，通过感知机不可能性、神经网络训练复杂度、高维非参数估计率、梯度消失和统计学习理论等数学瓶颈，分析早期AI范式失败的原因，并关联后续突破。

Comments 33 pages, 1 figure

详情

AI中文摘要

人工智能研究中两个主要的资金减少和信心下降时期，通常被称为第一次和第二次AI寒冬，通常被解释为工程失败、商业失望和预期膨胀。本文提出一个补充论点：这些时期的主导范式也遇到了真正的形式障碍，包括表示、优化、计算复杂性、统计可学习性和高维近似的限制。贡献是综合性的而非档案性的。我们并不声称特定定理机械地导致了寒冬；相反，我们表明早期AI的几个核心失望与数学上精确的瓶颈相一致。我们通过Minsky和Papert的感知机不可能结果、Blum和Rivest建立的精确神经网络训练的计算复杂性困难、Stone的高维非参数估计的极小化极大率、Hochreiter以及Bengio及其合作者的梯度消失分析，以及Vapnik和Chervonenkis、Valiant、Blumer及其合作者传统的经典统计学习理论来分析这些瓶颈。然后我们将这些障碍与后来缓解（而非消除）它们的突破联系起来。

英文摘要

Two major periods of reduced funding and confidence in artificial intelligence research, commonly called the first and second AI winters, are usually explained through engineering failure, commercial disappointment, and inflated expectations. This article develops a complementary thesis: that the dominant paradigms of those periods also met genuine formal barriers, including limitations of representation, optimisation, computational complexity, statistical learnability, and high-dimensional approximation. The contribution is synthetic rather than archival. We do not claim that particular theorems mechanically caused the winters; rather, we show that several central disappointments of early AI were aligned with mathematically precise bottlenecks. We analyse these bottlenecks through the perceptron impossibility results of Minsky and Papert, the complexity-theoretic hardness of exact neural-network training established by Blum and Rivest, minimax rates for nonparametric estimation in high dimension due to Stone, vanishing-gradient analyses by Hochreiter and by Bengio and collaborators, and classical statistical learning theory in the tradition of Vapnik and Chervonenkis, Valiant, and Blumer and collaborators. We then relate these barriers to the later breakthroughs that mitigated, rather than eliminated, them.

URL PDF HTML ☆

赞 0 踩 0

2606.12609 2026-06-12 cs.LG q-bio.QM 新提交

Viral Proteins Reveal Geometry of Protein Language Models

病毒蛋白质揭示蛋白质语言模型的几何结构

Arthur Bigot, Harmon Bhasin, Core Francisco Park, Eugene Shakhnovich, Dianzhuo Wang

发表机构 * University of Washington（华盛顿大学）； DeepMind（深度思维）

AI总结研究蛋白质语言模型在不平衡数据下对病毒蛋白的表示，发现嵌入空间中存在主导的“天然性”轴，该轴按模型困惑度排序序列，且缩放效果因病毒家族而异，但嵌入仍保留病毒特异性信号。

Comments Accepted at ICML 2026 GenBio Workshop and FM4LS Workshop. Code available at https://github.com/MisteFr/viral-proteins-plms

2606.12608 2026-06-12 cs.CL cs.LG 新提交

Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants

购物推理基准：面向多轮对话购物助手的专家编写基准

Shuxian Fan, Seonwoo Min, Youna Hu, Botao Xia, Jayakrishnan Unnikrishnan, Rowan Musselmann, Yifan Gao, Qingyu Yin, Priyanka Nigam, Bing Yin

发表机构 * Amazon（亚马逊）

AI总结提出一个由零售专家编写的525个任务的多轮对话购物推理基准，包含10863个加权评分标准，评估9个模型显示通过率仅57-77%，多轮任务性能下降4-18分。

详情

AI中文摘要

对话式购物助手现已服务数亿客户，但现有基准均未联合评估真实购物对话所需的开放式多轮推理、领域专业知识和标准级质量。购物推理在语言模型应用中独具特色。与事实性问答或可验证代码生成不同，它需要在多轮对话中平衡主观偏好、预算约束和跨产品权衡，这些能力在以往的电商和通用基准中缺失。我们引入了购物推理基准（Shopping Reasoning Bench），这是一个由零售领域专家编写的基准，包含525个任务（232个单轮，293个多轮）和10863个重要性加权的二元评分标准。这些标准组织在包含五个推理类别和十五个子类别的分类体系下，涵盖偏好细化、权衡分析和兼容性评估等多样化需求。对三个模型系列（GPT、Claude、Gemini）中九个模型的评估显示，整体通过率仅为57-77%。在多轮任务中，所有模型在可选的超越标准上的得分比必需标准低13-29分，并且随着对话进行，性能下降4-18分。这些差距表明，当前模型能处理基本购物辅助，但达不到专家级建议，使购物推理基准成为未来购物助手开发的挑战性测试平台。

英文摘要

Conversational shopping assistants now serve hundreds of millions of customers, yet no existing benchmark jointly evaluates the open-ended multi-turn reasoning, domain expertise, and criterion-level quality that real shopping conversations demand. Shopping reasoning is unique among language model applications. Unlike factual question answering or verifiable code generation, it requires balancing subjective preferences, budget constraints, and cross-product trade-offs across multi-turn dialogue, capabilities absent from previous e-commerce and general-purpose benchmarks. We introduce the Shopping Reasoning Bench, an expert-authored benchmark of 525 missions (232 single-turn, 293 multi-turn) with 10863 importance-weighted binary rubrics authored by retail domain experts. These criteria are organized under a taxonomy of five reasoning categories and fifteen subcategories covering diverse demands such as preference refinement, trade-off analysis, and compatibility assessment. An evaluation of nine models across three families (GPT, Claude, Gemini) shows that pass rates reach only 57--77% overall. On multi-turn missions, all models score 13--29 points lower on optional above-and-beyond criteria than on required ones, and performance degrades 4--18 points as conversations progress. These gaps show that current models handle basic shopping assistance but fall short of expert-level advice, making Shopping Reasoning Bench a challenging testbed for future shopping assistant development.

URL PDF HTML ☆

赞 0 踩 0