arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1971
专题追踪
2606.19151 2026-06-18 cs.CY cs.CV 新提交

The Market in the Model: Latent Diffusion as Neural Economy

模型中的市场:潜在扩散作为神经经济

Eryk Salvaggio

发表机构 * Cambridge Digital Humanities(剑桥数字人文研究中心) University of Cambridge(剑桥大学) Machine Visual Culture Research Group(机器视觉文化研究组) Max Planck Institute(马克斯·普朗克研究所)

AI总结 本文从计算机视觉工程问题出发,分析潜在扩散模型的机制,论证其作为神经经济运作,将社会交流抽象为可通约向量,并警示仅关注版权与商品防御的批评可能强化模型产生的拜物教。

详情
AI中文摘要

在视觉文化和人文学科中,对生成图像模型的有价值批评强调了数据集在塑造其生成图像中的作用。然而,对嵌入模型机制的意识形态立场的细致研究一直被忽视,使得它们被想象为“黑箱”。为了扩展而非取代数据集批评,本文从潜在扩散模型被引入以解决计算机视觉工程师问题的角度,以及每个组件被赋予自动化决策的任务,审视了其机制。我通过其各部分的历史以及系统刻入每个生成图像中的视觉理论来解释这个集成。借鉴Impett和Offert的神经交换价值概念,我提出这一分析以论证该模型作为神经经济运作:一个封闭的符号系统,将社会交流抽象为可通约向量,同时将社会领域转化为待售包裹。逐组件追踪训练和生成流程揭示了每个操作取代了什么,以及它如何进一步巩固平台经济和注意力经济对社会交流的逻辑。本文警告,任何只关注版权和商品防御的批评都可能重申模型所产生的拜物教,并主张以社会交换为中心。

英文摘要

Valuable critique of generative image models within visual culture and the humanities has emphasized the role of datasets in shaping the images they produce. Yet, close studies of the ideological positions embedded into the mechanism of the models have been neglected, leaving them imagined as "black boxes." In a bid to expand, rather than replace, dataset critique, this paper examines the mechanisms of the latent diffusion model in terms of the problems they were brought in to solve on behalf of computer vision engineers, and the decisions each component was tasked with automating. I interpret that ensemble through the histories of its parts and the theory of vision the system inscribes into every generated image. Drawing on Impett and Offert's notion of neural exchange value, I offer this analysis to argue that the model operates as a neural economy: a contained symbolic system that abstracts social communication into commensurable vectors as it transfers the social sphere into parcels for sale. Tracing the training and generation pipelines component by component reveals what each operation displaces, and how it further entrenches the logics of platform and attention economies over social communication. The paper warns that any critique fixated exclusively on copyright and commodity defenses risks reaffirming the very fetishism the model produces, and argues instead for centering social exchange.

2606.19135 2026-06-18 cs.MA cs.AI cs.NI 新提交

A Technical Taxonomy of LLM Agent Communication Protocols

LLM智能体通信协议的技术分类法

Linus Sander, Habtom Kahsay Gidey, Alexander Lenz, Alois Knoll

发表机构 * Technische Universität München(慕尼黑技术大学)

AI总结 针对大语言模型智能体通信协议碎片化问题,提出包含五个维度的技术分类法,分析九种开源协议,揭示架构模式并预测协议演进趋势。

详情
AI中文摘要

随着大语言模型(LLM)的进步以及多智能体系统旨在克服单智能体的局限性,健壮的通信协议正成为分布式智能体网络的关键基础设施。然而,碎片化的协议格局带来了显著的互操作性挑战。本研究开发了一种技术分类法,用于分类和分析LLM智能体通信协议。遵循既定的迭代方法,我们定义了分类法的目的、元特征和终止条件,然后在九个积极维护且具有可证明采用度的开源协议上执行了五次迭代(三次从经验到概念,两次从概念到经验)。该分类法包含五个维度:交易对手、有效载荷、交互状态、发现机制和模式灵活性。分类揭示了重复出现的架构模式:所有采样的智能体间协议都将混合有效载荷与会话状态持久性相结合;大多数协议支持多个预定义模式,其中两个协议在运行时协商模式,表明向模式灵活性的趋势;去中心化发现仍然罕见。分析表明,短期内存在向统一智能体间和智能体-上下文(工具和数据)通信的协议收敛压力。然而,长期来看,没有单一协议能同时最大化通用性、效率和可移植性。该领域更可能演变为联邦式分层协议栈。该框架指导协议选择,并突出开放的研究空白,如隐私和策略执行。

英文摘要

As large language models (LLMs) advance and multi-agent systems aim to overcome the limits of standalone agents, robust communication protocols are becoming essential infrastructure for distributed agent networks. Nonetheless, the fragmented protocol landscape presents a significant interoperability challenge. This study develops a technical taxonomy to classify and analyze LLM agent communication protocols. Following an established iterative method, we defined the taxonomy's purpose, meta-characteristic, and ending conditions, then performed five iterations, three empirical-to-conceptual and two conceptual-to-empirical, on nine actively maintained open-source protocols with demonstrable adoption. The taxonomy comprises five dimensions: counterparty, payload, interaction state, discovery mechanism, and schema flexibility. Classification reveals recurring architectural patterns: all sampled agent-to-agent protocols combine hybrid payloads with session-state persistence; most protocols support multiple predefined schemas, and two negotiate schemas at runtime, indicating a trend toward schema flexibility; decentralized discovery remains rare. Analysis suggests short-term convergence pressure toward protocols unifying agent-to-agent and agent-to-context (tool and data) communication. Long-term, however, no single protocol is likely to maximize versatility, efficiency, and portability simultaneously. The field will more likely evolve toward a federated, layered protocol stack. The framework guides protocol selection and highlights open research gaps such as privacy and policy enforcement.}

2606.19129 2026-06-18 cs.CR cs.LG 新提交

Giskard : Byzantine Robust and Confidential Aggregation for Large-Scale Decentralized Learning

Giskard: 大规模去中心化学习中的拜占庭鲁棒与机密聚合

Ousmane Touat, César Sabater, Mohamed Maouche, Sonia Ben Mokhtar

发表机构 * INSA Lyon, LIRIS, CNRS(里尔斯大学 Lyon,LIRIS,CNRS) INRIA, INSA Lyon(法国国家科学研究中心 INRIA,里尔斯大学 Lyon)

AI总结 针对去中心化学习中同时保证机密性和抵御拜占庭行为的挑战,提出Giskard协议,通过树状委员会结构和BGW风格MPC实现近似中位数聚合,在百万级参与者下降低通信复杂度并保持模型效用。

Comments 17 pages, with appendix

详情
AI中文摘要

在去中心化学习中同时处理机密性和拜占庭行为是一个具有挑战性的问题。实际上,在去中心化学习中,客户端在本地保留数据的同时训练机器学习模型,并与一组邻居共享其模型参数或梯度。虽然强制机密性需要隐藏交换的模型参数/梯度(例如,通过使用密码学技术),但处理拜占庭贡献通常需要检查后者。因此,大多数研究工作分别处理这些目标。最近的一系列工作提出使用安全多方计算(MPC)来实现对模型投毒攻击的鲁棒聚合器,从而同时保证机密性和拜占庭鲁棒性。然而,这些解决方案扩展性差:它们要么要求参与者之间进行全对全通信,要么将整个计算委托给一个小子集,其计算和通信负载随网络规模成比例增长。在本文中,我们提出了Giskard,一种用于机密且拜占庭鲁棒的去中心化聚合协议。Giskard将$n$个参与方组织成一个大小为$O(\log n)$的委员会树,并通过在值域上进行委员会适应的分布式二分搜索来评估坐标-wise近似中位数,在每个委员会内使用BGW风格的MPC。我们通过理论证明其安全性和机密性,并通过涉及多达一百万个参与者的广泛实验来评估Giskard。与其最接近的竞争对手相比,Giskard渐近地降低了每方通信复杂度,同时在多达$n/4$个拜占庭参与方下表现出相当的模型效用。

英文摘要

Dealing simultaneously with confidentiality and Byzantine behaviors in decentralized learning is a challenging problem. Indeed, in decentralized learning, clients train a machine learning model while keeping their data locally and share their model parameters or gradients with a set of neighbors. While enforcing confidentiality calls for hiding the exchanged model parameters/gradients (e.g., by using cryptographic techniques), dealing with Byzantine contributions often requires inspecting the latter. Hence, most research works address these objectives separately. A recent line of work proposes to employ secure multi-party computation (MPC) to implement robust aggregators against model poisoning, thereby enforcing both confidentiality and Byzantine resilience. However, these solutions scale badly: they either require all-to-all communication between participants or delegate the entire computation to a small subset, whose computational and communication load grows proportionally with the size of the network. In this paper, we present Giskard, a protocol for confidential and Byzantine-robust decentralized aggregation. Giskard organizes $n$ parties into a tree of committees of size $O(\log n)$ and evaluates a coordinate-wise approximate median via a committee-adapted distributed binary search over the value domain, using BGW-style MPC within each committee. We assess Giskard both theoretically by proving its security and confidentiality properties and experimentally through extensive experiments involving up to one million participants. Compared to its closest competitors, Giskard reduces per-party communication complexity asymptotically while exhibiting comparable model utility under up to $n/4$ Byzantine parties.

2606.19121 2026-06-18 cs.SE cs.CL cs.HC 新提交

Written by AI, Managed by AI: Semantic Space Control and Index Sickness Elimination Across 391 Consecutive Sessions

由AI编写,由AI管理:跨越391个连续会话的语义空间控制与索引病消除

Hui Zhang, Shuren Song

发表机构 * Shenzhen Yunxi Technology Co., Ltd.(深圳云曦科技有限公司) Information Technology Center, Tsinghua University(清华大学信息科学技术中心)

AI总结 本文通过真实软件项目中的行动研究,发现长期LLM协作中增加形式约束反而导致“索引病”,提出“基线-日志物理分离”机制,有效消除该问题。

Comments 22 pages, 2 tables, 1 figure. Action research. Bilingual submission (Chinese companion version included as supplementary). Submitted to ICSE 2027 IOR track

详情
AI中文摘要

解决长期LLM协作中概念漂移的主流工程直觉是,用更多的形式约束换取更可靠的输出——设计符号标识符系统,在系统提示中积累防御规则,扩展上下文窗口。我们的工程记录表明,在长期设置中,这种方向可能产生与设计意图相反的效果。通过在跨越约一个月和391个协作会话的真实软件项目(Bang-v3)中使用行动研究方法,我们记录并分析了这些策略的失败过程。当符号系统超过复杂度阈值时,LLM并不会变得更准确——相反,它们放弃了对业务语义的真正理解,退回到符号层内的自我指涉推理,并生成看似内部一致但实际上与现实脱节的输出。我们将这种失败模式命名为“索引病”,其典型表现为“幻影立法”。我们将底层原理命名为“庞原理(语义活力定律)”:带有明确目的的自然语言传达的信息质量远高于符号表达。由此,我们设计并验证了其物理工程机制:“基线-日志物理分离”。在同一项目中,该机制将AI指令量减少了约75%,并且在随后的约150个会话中,未观察到索引病复发。附有双语对照版本(中文)作为补充材料。

英文摘要

The prevailing engineering intuition for addressing conceptual drift in long-horizon LLM collaboration is to trade more formal constraints for more reliable outputs -- designing symbolic identifier systems, accumulating defensive rules in System Prompts, expanding context windows. Our engineering record shows that in long-horizon settings, this direction may produce effects contrary to design intent. Using action research methods in a real software project (Bang-v3) spanning approximately one month and 391 collaborative sessions, we document and analyze the failure process of these strategies. When the symbolic system exceeds a complexity threshold, LLMs do not become more accurate -- instead, they abandon genuine understanding of business semantics, retreat to self-referential reasoning within the symbolic layer, and generate outputs that appear internally consistent but are physically disconnected from reality. We name this failure pattern "Index Sickness," and its canonical manifestation "Phantom Legislation." We name the underlying principle the "Pang Principle (Semantic Vitality Law)": natural language carrying explicit purpose conveys far greater information quality than symbolic expression. From this, we design and validate its physical engineering mechanism: "Baseline-Log Physical Separation." In the same project, this mechanism reduced AI Instructions volume by ~75%, and across the subsequent ~150 sessions, no recurrence of Index Sickness was observed. A bilingual companion version (Chinese) is included as supplementary material.

2606.19069 2026-06-18 eess.SY cs.LG cs.SY 新提交

Model-Free Reinforcement Learning Control for Resilient Cyber-Physical Systems

面向弹性信息物理系统的无模型强化学习控制

Hugo O. Garcés, Alejandro J. Rojas, Bernardo A. Hernández, Andrés Escalona, Jonathan M. Palma, Md. Rezwan Parvez, Bhushan Gopaluni, Sirish L. Shah

发表机构 * Departmento de Ingenier\'ia El\'ectrica, Universidad de Concepci\'on, Concepci\'on, Chile (e-mail: ) Department of Electrical \& Computer Engineering, University of Alberta, Edmonton, T6G 1H9, Alberta, AB, Canada (e-mail: ) Department of Chemical Biological Engineering, University of British Columbia, Vancouver, BC V6T 1Z3, Canada ( ) Department of Chemical \& Materials Engineering, University of Alberta, Edmonton, T6G 1H9, Alberta, AB, Canada (e-mail: )

AI总结 本文比较了无模型控制器在非线性系统遭受网络攻击(虚假数据注入和拒绝服务攻击)下的性能,分析了四种强化学习奖励类型,发现Lyapunov奖励在低跟踪误差下弹性最佳,指数奖励在中等训练条件下提供良好折衷,渐进和线性奖励收敛快但鲁棒性差。

Comments Accepted to the 23rd IFAC World Congress 2026

详情
AI中文摘要

本文比较了无模型控制器在遭受网络攻击(包括虚假数据注入和拒绝服务攻击)的非线性系统上的性能。分析了四种强化学习奖励类型的准确性、成本和弹性。结果表明,Lyapunov奖励在低跟踪误差下提供最佳弹性。指数模式在中等训练条件下也提供了良好的折衷,具有可接受的弹性。渐进和线性奖励收敛更快,但鲁棒性较差。强化学习模型预测控制器(RL-MPC)表现出强稳态弹性,但需要更长的训练时间;强化学习比例-积分-微分控制器(RL-PID)更快,训练时间显著减少。近端策略优化(PPO)优于深度确定性策略梯度(DDPG),关键绩效指标(KPI)方差显著降低。本研究旨在强调精心设计的强化学习奖励如何提高性能和对网络威胁的弹性。

英文摘要

This paper compares the performance of model-free controllers on a nonlinear system under cyberattacks, including false data injection and denial-of-service attacks. Four RL reward types are analyzed for accuracy, cost, and resilience. Results show that the Lyapunov reward offers the best resilience with low tracking error. Exponential mode also provides good trade-offs with acceptable resilience under moderate training conditions. Progressive and linear rewards converge faster but are less robust. RL-MPCs show strong steady-state resilience but require longer training times; RL-PID controllers are faster with significantly less training time. Proximal Policy Optimization outperforms Deep Deterministic Policy Gradient with a significant reduction in KPI variance. This study serves to highlight how well-designed RL rewards can improve performance and resilience against cyber threats.

2606.19042 2026-06-18 cs.SE cs.AI 新提交

Where Did the Variability Go? From Vibe Coding to Product Lines by Regeneration

可变性去哪了?从氛围编码到通过再生的产品线

Xhevahire Tërnava

发表机构 * LTCI, Télécom Paris, Institut Polytechnique de Paris, Palaiseau, France(LTCI,巴黎电信学院,巴黎理工学院,Palaiseau,法国)

AI总结 研究AI驱动编程(氛围编码)中可变性缺失问题,提出通过再生实现可变性(VbR)方法,让LLM作为推导引擎生成无死代码的变体二进制。

Comments VARIABILITY 2026

详情
AI中文摘要

在氛围编码这一新兴的AI驱动范式中,LLM根据自然语言提示生成整个程序,但传统软件工程精心构建到代码中的可变性会发生什么?为了回答这个问题,我们对10个氛围编码的C/C++项目进行了探索性分析,结果表明在编译和运行时,工件内可变性几乎为零。所有可变性决策都在一个新的绑定时间——生成时间(即LLM生成源代码的时刻)得到解决。我们不将其视为需要修复的缺陷,而是提出了通过再生实现可变性(VbR),据我们所知,这是第一种产品线方法,其中LLM充当推导引擎,根据声明性规范为每个变体生成无死代码的专用二进制,同时变体调度器透明地将用户请求路由到匹配的二进制。我们形式化了VbR,将其与经典SPL推导进行对比,并在wc产品家族上演示了其完整流程。对于SPL工程,AI生成软件中的可变性应属于规范,而非代码。

英文摘要

In vibe coding, an emerging AI-driven paradigm, an LLM generates an entire program from a natural language prompt, but what happens to the variability that traditional software engineering carefully builds into code? To answer this question, we conducted an exploratory analysis on 10 vibe coded C/C++ projects, which suggests that there is near-zero in-artifact variability, i.e., at compile and runtime. All variability decisions are resolved at a single new binding time, generation time, the moment the LLM produces the source code. Rather than treating this as a defect to fix, we propose Variability by Regeneration (VbR), to our knowledge the first product-line approach in which the LLM acts as the derivation engine, generating a purpose-built, free of dead code binary for each variant from a declarative specification, while a variant dispatcher transparently routes user requests to the matching binary. We formalise VbR, contrast it with classical SPL derivation, and demonstrate its full pipeline on a wc product family. For SPL engineering, variability in AI-generated software belongs in the specification, not in the code.

2606.19039 2026-06-18 cs.NE cs.LG cs.SD 新提交

Adaptive Speech-to-Spike Encoding for Spiking Neural Networks

自适应语音到脉冲编码用于脉冲神经网络

Taharim Rahman Anon, Jakaria Islam Emon

发表机构 * PI LLC(1 PI LLC)

AI总结 提出一种可学习的残差语音到脉冲编码器,与R-LIF骨干网络联合训练,在GSC-v2上达94.97%准确率,参数高效且学习任务对齐的脉冲表示。

Comments Accepted at Interspeech 2026. This version is a preprint

详情
AI中文摘要

连续声学信号与离散事件驱动处理之间的不匹配仍然是神经形态语音处理的基本瓶颈。当前系统通常依赖固定的脉冲编码器,迫使下游脉冲神经网络(SNN)补偿非自适应的输入表示。为了解决这个问题,我们提出了一种可学习的残差语音到脉冲编码器,与循环漏积分点火(R-LIF)骨干网络进行端到端联合训练。我们在Google Speech Commands v2(GSC-v2)基准上验证了该方法,达到了高达94.97%的准确率。值得注意的是,学习到的编码器仍然高度参数高效,其紧凑的35k参数变体达到了89.8%,匹配或超过了需要多一个数量级参数的先前基线。我们以编码器为中心的分析,包括线性探测和梯度残差检查,表明编码器并不追求忠实的信号重建,而是学习任务对齐的脉冲表示,增强了类别可分性。最后,我们通过比较直接反馈对齐(DFA)和替代梯度BPTT在相同架构和训练条件下的表现,对生物启发、硬件友好的信用分配进行了基准测试。我们发现DFA达到了91.5%的准确率,量化了生物启发学习规则在现代神经形态音频中的性能权衡。

英文摘要

The mismatch between continuous acoustic signals and discrete event-driven processing remains a fundamental bottleneck for neuromorphic speech processing. Current systems typically rely on fixed spike encoders, forcing downstream Spiking Neural Networks (SNNs) to compensate for non-adaptive input representations. To address this, we present a learnable residual speech-to-spike encoder jointly trained end-to-end with a Recurrent Leaky Integrate-and-Fire (R-LIF) backbone. We validate this approach on the Google Speech Commands v2 (GSC-v2) benchmark, achieving up to 94.97% accuracy. Notably, the learned encoder remains highly parameter-efficient with a compact 35k-parameter variant that reaches 89.8%, matching or exceeding prior baselines that require an order of magnitude more parameters. Our encoder-focused analysis, including linear probing and gradient-residual inspection, indicates that the encoder does not target faithful signal reconstruction but instead learns task-aligned spike representations that enhance class separability. Finally, we benchmark bio-inspired, hardware-friendly credit assignment by comparing Direct Feedback Alignment (DFA) with surrogate-gradient BPTT under identical architectures and training conditions. We find that DFA reaches 91.5% accuracy, quantifying the performance trade-off of bio-inspired learning rules for modern neuromorphic audio.

2606.19023 2026-06-18 cs.CR cs.LG 新提交

Lifecycle-Aware Dynamic Analysis for Secure ML Model Execution

生命周期感知的动态分析用于安全ML模型执行

Gabriele Digregorio, Marco Di Gennaro, Francesco Pastore, Stefano Zanero, Stefano Longari, Michele Carminati

发表机构 * Politecnico di Milano(米兰理工大学)

AI总结 提出Moat,一种动态生命周期感知方法,通过监控模型执行各阶段与宿主系统的结构化交互来检测恶意行为,在多个框架上实现零误报率。

详情
AI中文摘要

对预训练机器学习(ML)模型的日益依赖引入了新的攻击面。最近的漏洞表明,恶意行为可以嵌入模型工件中,常常绕过现有防御。当前的模型扫描解决方案主要依赖于静态的、特定格式的规则或已知的攻击签名,这限制了它们跨框架泛化和检测新型利用路径的能力。相比之下,我们提出了一种解决方案,专注于攻击对执行模型的宿主系统产生的影响,并基于关于ML模型执行的基本直觉。特别地,我们观察到ML模型在定义良好的生命周期阶段内运行,并且在每个阶段内,与宿主系统的交互是高度结构化和可预测的。我们将这些直觉转化为Moat,一种用于安全ML模型执行的动态生命周期感知方法,并在我们的参考实现Re-Moat中实例化此设计。我们使用来自Hugging Face Hub的77,974个真实世界模型工件、来自CVE的31个概念验证(PoC)以及来自最先进数据集的334个模型,在多个ML框架上评估Re-Moat,并将其与最先进的模型扫描解决方案进行比较。我们的结果表明,我们的方法检测到所有评估的攻击类别,同时保持接近零的误报率,验证了我们的直觉并激励了用于安全ML模型执行的动态分析。

英文摘要

The growing reliance on pre-trained Machine Learning (ML) models has introduced new attack surfaces. Recent vulnerabilities demonstrate that malicious behavior can be embedded within model artifacts, often bypassing existing defenses. Current model-scanning solutions primarily rely on static, format-specific rules or known attack signatures, which limit their ability to generalize across frameworks and to detect novel exploitation paths. In contrast, we propose a solution that focuses on the effects an attack has on the host system executing the model and builds on foundational intuitions about ML model execution. In particular, we observe that ML models operate within well-defined lifecycle phases and that, within each phase, interactions with the host system are highly structured and predictable. We translate these intuitions into Moat, a dynamic lifecycle-aware approach for securing ML model execution, and instantiate this design in Re-Moat, our reference implementation. We evaluate Re-Moat across multiple ML frameworks using 77,974 real-world model artifacts from the Hugging Face Hub, 31 Proofs-of-Concept (PoCs) from CVEs, and 334 models from a state-of-the-art dataset, and compare it against state-of-the-art model-scanning solutions. Our results show that our approach detects all evaluated attack classes while maintaining a close-to-zero false-positive rate, validating our intuitions and motivating dynamic analysis for securing ML model execution.

2606.19004 2026-06-18 cs.DC cs.AI cs.LG 新提交

Spotlight: Synergizing Seed Exploration and Spot GPUs for DiT RL Post-Training

Spotlight: 协同种子探索与抢占式GPU用于DiT强化学习后训练

Ruiqi Lai, Dakai An, Wei Gao, Ju Huang, Siran Yang, Jiamang Wang, Lin Qu, Dmitrii Ustiugov, Wei Wang

发表机构 * NTU Singapore(南洋理工大学) Hong Kong University of Science and Technology(香港科技大学) Alibaba Group(阿里巴巴集团)

AI总结 针对DiT强化学习后训练成本高的问题,提出Spotlight系统,通过利用探索对旧权重的容忍性和SP组快速重配置,在抢占式GPU上实现高效训练,加速4倍并降低成本1.4-6.4倍。

详情
AI中文摘要

扩散Transformer(DiT)的强化学习(RL)后训练成本极高,需要数千块高端GPU。现有工作探索了两个降低成本的方向:种子探索通过选择高对比度样本来改善训练收敛,但增加了关键路径的计算量;抢占式GPU提供69-77%的成本降低,但在训练期间处于空闲状态,因为DiT rollout几乎同时完成,这阻止了类似LLM的rollout与训练流水线化。抢占式GPU的抢占进一步破坏了序列并行(SP)组,导致GPU拓扑碎片化。我们提出了Spotlight,这是第一个利用抢占式GPU进行DiT RL后训练的系统。Spotlight基于我们设计的两个关键洞察:(1)我们证明探索可以容忍过时的模型权重,因为使用前一次迭代模型权重的探索保留了随机种子的相对排序,允许探索在训练期间在空闲的抢占式GPU上运行。(2)SP重配置可以重用节点内状态,将组恢复时间从分钟级缩短到亚秒级启动。基于这些洞察,Spotlight引入了三种技术:基于bandit的探索规划器,在训练时间预算内最大化奖励方差;弹性序列并行,通过持久调度器和节点内权重复制动态重配置SP组;以及抢占感知的拉取式请求调度器,平衡负载并在抢占时提交进行中的状态。我们在开源RL平台ROLL上实现了Spotlight,并在Qwen-Image后训练上进行了评估。Spotlight达到相同目标验证分数的速度比基线快4倍,总成本降低1.4-6.4倍,同时在分辨率512×512和1280×1280的DeepSeek-OCR和Geneval数据集上实现了更优的图像质量。

英文摘要

Reinforcement learning (RL) post-training of Diffusion Transformers (DiTs) is prohibitively expensive, requiring thousands of high-end GPUs. Existing works explore two directions to reduce cost: seed exploration improves training convergence by selecting high-contrast samples, yet adds compute to the critical path; spot GPUs offer 69--77\% lower cost, yet sit idle during training because DiT rollouts finish nearly simultaneously, which prevents LLM-style pipelining of rollout with training. Spot preemptions further break Sequence Parallelism (SP) groups, fragmenting GPU topology. We present Spotlight, the first system that harvests spot GPUs for DiT RL post-training. Spotlight rests on two key insights we devise: (1)~we show that exploration can tolerate stale model weights because exploration that uses the model weights from the previous iteration preserves the relative ranking of random seeds, allowing exploration to run on idle spot GPUs during training. (2)~SP reconfiguration can reuse on-node state, reducing group recovery from minutes to sub-second launches. Built on these insights, Spotlight introduces three techniques: a bandit-based exploration planner that maximizes reward variance within the training time budget, elastic sequence parallelism that reconfigures SP groups on the fly via persistent schedulers and intra-node weight copying, and a preemption-aware pull-based request scheduler that balances load and commits in-flight state upon preemption. We implement Spotlight on the open-source RL platform ROLL and evaluate it on Qwen-Image post-training. Spotlight reaches the same target validation score $4\times$ faster than baselines, reducing total cost by $1.4$-$6.4\times$ while achieving superior image quality on DeepSeek-OCR and Geneval datasets with resolution $512\times512$ and $1280\times1280$.

2606.18976 2026-06-18 cs.SE cs.AI 新提交

CAPRA: Scaling Feedback on Software Architecture Deliverables with a Multi-Agent LLM System

CAPRA: 使用多智能体LLM系统对软件架构交付物进行反馈扩展

Marco Becattini, Niccolò Caselli, Matteo Minin, Roberto Verdecchia, Enrico Vicario

发表机构 * Department of Information Engineering, University of Florence, Florence, Italy(信息工程系,佛罗伦萨大学,意大利佛罗伦萨)

AI总结 提出CAPRA多智能体LLM系统,通过多模态文档提取、确定性证据锚定和一致性管理,自动生成软件架构交付物的个性化LaTeX反馈,在10份学生报告中满足88.8%的评估标准。

Comments Accepted for publication at the 38th International Conference on Software Engineering Education and Training

详情
AI中文摘要

软件工程教育中的自动评估在代码评分和论文评分方面取得了显著进展。然而,审查软件架构交付物需要分析结构完整性和需求可追溯性,尚未完全自动化。将大型语言模型(LLM)应用于此任务需要稳健的架构,以确保技术反馈对学生准确可靠。本文提出CAPRA(可配置架构能力报告评估),一个多智能体LLM系统,分析软件架构交付物以生成个性化的、符合模板的LaTeX反馈。作为核心设计选择,CAPRA协调多个专门智能体,并采用基于Python的微服务进行多模态文档提取,利用PyMuPDF和视觉增强LLM(特别是gpt-4o)解析文本和UML图。为确保教育可靠性并减少幻觉,CAPRA引入了使用归一化Levenshtein距离进行模糊匹配的确定性证据锚定步骤,以及一个交叉验证、去重和合并发现的一致性管理器智能体。系统性能通过一个结构化的八标准二元评估分类法进行评估,涵盖:(i) 提取完整性,(ii) 特征验证,(iii) 问题依据和严重性检测,(iv) 建议特异性和可追溯性,以及(v) 模板和语气合规性。对10份学生报告的初步实证评估显示,在严格的两评分者聚合规则下,CAPRA满足了88.8%的评估标准,与人类评估者达到了中等评分者间一致性(kappa = 0.582),每份报告处理时间略超过4分钟。虽然这些结果支持LLM支持的架构反馈的可行性,但主观评估维度仍需人工监督。

英文摘要

Automated assessment in software engineering education has advanced significantly for code grading and essay scoring. However, reviewing software architecture deliverables, which requires analyzing structural completeness and requirements traceability, has not yet been fully automated. Applying Large Language Models (LLMs) to this task requires robust architectures to ensure technical feedback is accurate and reliable for students. This paper presents CAPRA (Configurable Architecture Proficiency Report Assessment), a multi-agent LLM system that analyzes software architecture deliverables to generate personalized, template-compliant LaTeX feedback. As a core design choice, CAPRA coordinates multiple specialized agents and employs a Python-based microservice for multi-modal document extraction, utilizing PyMuPDF and vision-enabled LLMs (specifically gpt-4o) to parse text and UML diagrams. To ensure educational reliability and mitigate hallucinations, CAPRA introduces a deterministic Evidence Anchoring step using fuzzy matching via normalized Levenshtein distance, along with a ConsistencyManager agent that cross-verifies, deduplicates, and merges findings. System performance is assessed using a structured eight-criterion binary evaluation taxonomy covering: (i) extraction completeness, (ii) feature validation, (iii) issue grounding and severity detection, (iv) recommendation specificity and traceability, and (v) template and tone compliance. A preliminary empirical evaluation on 10 student reports shows that CAPRA satisfied 88.8% of the evaluated criteria under a strict two-rater aggregation rule, achieved moderate inter-rater agreement with human evaluators (kappa = 0.582), and processed each report in slightly over 4 minutes. While these results support the viability of LLM-supported architectural feedback, human oversight remains essential for subjective assessment dimensions.

2606.18897 2026-06-18 cs.IR cs.AI 新提交

SAERec: Constructing Fine-grained Interpretable Intents Priors via Sparse Autoencoders for Recommendation

SAERec:通过稀疏自编码器为推荐构建细粒度可解释意图先验

Jiangnan Xia, Xuansheng Wu, Yu Yang, Xin Wang, Ninghao Liu

发表机构 * University of Georgia(佐治亚大学) Shanghai AI Laboratory(上海人工智能实验室) The Education University of Hong Kong(香港教育大学) Jilin University(吉林大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出SAERec模型,利用稀疏自编码器从大型语言模型文本嵌入中解耦出细粒度可解释意图,作为先验指导推荐,并通过多分支注意力机制融合个人与公共意图,提升推荐性能与可解释性。

详情
AI中文摘要

基于意图的推荐系统通过建模用户行为背后的动机来提高准确性和可解释性,已获得广泛关注。现有模型大多通过聚类或原型学习直接从用户序列中推导意图,但它们对序列质量敏感,需要预设意图数量,且缺乏明确的语义基础。这些问题导致意图集不完整且粗糙,限制了推荐效果。本文提出用于基于意图的推荐的稀疏自编码器(SAERec),一种新颖的推荐模型,它从文本语料库中自动构建细粒度且可解释的意图空间来指导推荐。SAERec不将文本视为辅助信号,而是将其作为高信息密度的意图构建证据。具体而言,我们首先利用稀疏自编码器(SAE)从大型语言模型(LLM)的潜在空间中提取一组全面的细粒度可解释意图,通过解耦和解释文本嵌入,将意图相关语义与文本噪声分离。然后,对于每个用户,我们从该集合中检索相关意图作为先验来指导推荐,包括匹配用户当前兴趣的个人意图和捕捉用户间共享的一般项目模式(如质量、价格)的公共意图。最后,为了将检索到的意图集成到序列建模中,我们提出了一种多分支注意力机制,用于捕获时间依赖性并注入个人和公共意图信号,随后通过自适应融合层构建最终的用户表示以进行推荐。在公共数据集上的大量实验证明了SAERec的优越性,它持续优于最先进的基线,同时提供人类可理解的解释。

英文摘要

Intent-based recommender systems have gained significant attention for improving accuracy and interpretability by modeling the underlying motivations behind user behaviors. Most existing models derive intents directly from user sequences via clustering or prototype learning. However, they are sensitive to sequence quality, require presetting the number of intents, and lack explicit semantic grounding. These issues lead to an incomplete and coarse intent set and limit the effectiveness of recommendation. In this paper, we propose the Sparse Autoencoder for intent-based recommendation (SAERec), a novel recommender that automatically constructs a fine-grained and interpretable intent space from a textual corpus to guide recommendation. Rather than treating texts as side signals, SAERec leverages them as high information density evidence for intent construction. Specifically, we first extract a comprehensive set of fine-grained interpretable intents from the latent space of large language models (LLMs) by using a sparse autoencoder (SAE) to disentangle and interpret text embeddings, which isolates intent-related semantics from textual noise. Then, for each user, we retrieve relevant intents from this set as priors to guide recommendation. It contains personal intents matching a user's current interests and public intents capturing general item patterns shared across users (e.g., quality, price). Finally, to integrate retrieved intents into sequence modeling, we propose a multi-branch attention mechanism that captures temporal dependencies and injects both personal and public intent signals, followed by an adaptive fusion layer to construct the final user representation for recommendation. Extensive experiments on public datasets demonstrate the superiority of SAERec, consistently outperforming state-of-the-art baselines while providing human-understandable explanations.

2606.18837 2026-06-18 cs.MA cs.AI cs.LG 新提交

Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

Skill-MAS: 演化元技能以自动生成多智能体系统

Hehai Lin, Qi Yang, Chengwei Qin

发表机构 * Ant Group(蚂蚁集团) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出Skill-MAS,通过将高层编排能力解耦为可演化的元技能,在无需参数更新的情况下实现经验保留,利用多轨迹采样和选择性反思优化元技能,在多个基准和LLM上取得显著性能提升且成本可控。

详情
AI中文摘要

基于大型语言模型(LLM)的自动多智能体系统(MAS)生成已成为处理复杂任务的关键前沿。然而,现有方法在模型能力和经验保留之间面临两难困境。推理时MAS利用冻结的尖端LLM,但重复相同搜索而不从过去经验中学习。相反,训练时MAS通过梯度更新内化经验,但受限于较小模型的低能力上限,且难以扩展到大型尖端LLM。为弥合这一差距,我们提出Skill-MAS,一种新颖的第三条路径,通过将高层编排能力概念化为可演化的元技能,将经验保留与参数更新解耦。Skill-MAS通过一个封闭优化循环来精炼这种架构知识:(1)多轨迹采样在当前元技能下为每个任务采样行为分布;(2)选择性反思自适应选择优先任务,并应用分层对比分析将系统经验蒸馏为可泛化的策略级原则。在四个复杂基准和四个不同LLM上的大量实验表明,Skill-MAS不仅实现了显著的性能提升,而且保持了良好的成本-性能权衡。进一步分析揭示,演化后的元技能高度鲁棒,并在未见任务和不同LLM之间表现出强迁移性。

英文摘要

Large Language Model (LLM)-based automatic Multi-Agent Systems (MAS) generation has become a crucial frontier for tackling complex tasks. However, existing methods face a dilemma between model capability and experience retention. Inference-time MAS leverages frozen frontier LLMs but repeats identical searches without learning from past experience. Conversely, Training-time MAS internalizes experience via gradient updates but is constrained by the low capability ceiling of smaller models, and is hard to scale to large frontier LLMs. To bridge this gap, we propose Skill-MAS, a novel third path that decouples experience retention from parametric updates by conceptualizing the high-level orchestration capability as an evolvable Meta-Skill. Skill-MAS refines this architectural knowledge through a closed optimization loop: (1) Multi-Trajectory Rollout samples a behavioral distribution for each task under the current Meta-Skill; and (2) Selective Reflection adaptively selects priority tasks and applies hierarchical contrastive analysis to distill systemic experience into generalizable, strategy-level principles. Extensive experiments across four complex benchmarks and four distinct LLMs demonstrate that Skill-MAS not only achieves remarkable performance gains but also maintains a favorable cost-performance trade-off. Further analysis reveals that the evolved Meta-Skills are highly robust and exhibit strong transferability across unseen tasks and different LLMs.

2606.18836 2026-06-18 cs.HC cs.AI 新提交

Improving Human-Robot Teamwork in Urban Search and Rescue Through Episodic Memory of Prior Collaboration

通过先前协作的片段记忆改善城市搜索与救援中的人机团队合作

Taewoon Kim, Emma van Zoelen, Mark Neerincx

发表机构 * HumemAI, The Netherlands(荷兰HumemAI) Vrije Universiteit Amsterdam, The Netherlands(荷兰阿姆斯特丹自由大学) TNO, The Netherlands(荷兰TNO)

AI总结 提出利用知识图谱片段记忆存储历史协作模式,通过图表示学习选择代表性记忆初始化机器人,在MATRX USAR环境中将救援成功率从25.7%提升至41.3%,任务时间减少283秒。

详情
AI中文摘要

有效的人机团队合作要求机器人从交互开始就适应伙伴、情境和任务动态。在MATRX城市搜索与救援(USAR)环境中,人们可以通过聊天和反思界面将他们在团队合作中发现的协作模式(CPs)外部化。我们研究机器人是否可以利用这种先前的团队经验,在未来的交互中成为更好的队友。为此,我们将历史CPs表示为知识图谱片段记忆,并使用具有节点分类目标的图表示学习来识别一个代表性且有效的记忆以供重用。然后,在新的协作片段开始之前,我们用该记忆初始化机器人。在20名参与者和160轮次观察中,用单个自动选择的先前CP初始化机器人将救援成功率从25.7%提高到41.3%,并将平均任务时间减少283秒。最强的提升出现在交互开始时,表明可重用的片段记忆可以帮助机器人以更有效的任务知识进入协作,并支持更顺畅的早期团队合作。

英文摘要

Effective human-robot teamwork requires robots to adapt to partners, situations, and task dynamics from the start of an interaction. In the MATRX Urban Search and Rescue (USAR) environment, people can externalize collaboration patterns (CPs) they discover during teamwork through a chat and reflection interface. We study whether a robot can use such prior team experience to become a better teammate in future interactions. To this end, we represent historical CPs as knowledge-graph episodic memories and use graph representation learning with a node-classification objective to identify a representative and effective memory for reuse. We then initialize the robot with this memory before a new collaboration episode begins. Across 20 participants and 160 round-level observations, initializing the robot with a single automatically selected prior CP increases rescue success from 25.7% to 41.3% and reduces average task time by 283 seconds. The strongest gains appear at the beginning of interaction, suggesting that reusable episodic memory can help robots enter collaboration with more effective task knowledge and support smoother early teamwork.

2606.18816 2026-06-18 cs.HC cs.AI cs.ET 新提交

SwitchBraidNet: Quantisation-Aware Lightweight Architecture for Hybrid Brain-Computer Interface

SwitchBraidNet: 面向混合脑机接口的量化感知轻量级架构

Gourav Siddhad, Yogesh Kumar Meena

发表机构 * Human-AI Interaction (HAIx) Lab, Indian Institute of Technology Gandhinagar(人类-人工智能交互实验室,印度理工学院甘地纳格尔)

AI总结 提出SwitchBraidNet紧凑型EEG分类架构,采用双路径时间辫、自适应挤压激励空间开关和对数方差读出层,通过量化感知训练在OpenBMI数据集上实现高精度低功耗混合脑机接口解码,INT8模型仅3.03 KB。

Comments 6 pages, 5 figures, Preprint accepted at IEEE SMC 2026

详情
AI中文摘要

混合脑机接口(BCI)结合运动想象(MI)和稳态视觉诱发电位(SSVEP),提供高维神经解码,但通常超出嵌入式硬件的计算限制。为解决此问题,我们提出SwitchBraidNet,一种专为低功耗部署设计的紧凑型EEG分类架构。该模型采用双路径时间辫提取多尺度振荡特征,自适应挤压激励空间开关进行电极门控,以及对数方差读出层直接编码频带功率。此外,通过在OpenBMI数据集上进行系统量化感知训练,我们将SwitchBraidNet与四种基线方法在FP32、FP16和INT8精度下进行比较。实验结果表明其优越的效率和性能,在FP16下MI准确率达到69.49%,FP32下SSVEP准确率达到93.48%,FP16下混合信息传输率为64.82 bits/min。INT8模型仅占用3.03 KB,SwitchBraidNet在不同数值精度下保持高准确率,证明了其适用于低功耗嵌入式BCI部署。

英文摘要

Hybrid brain-computer interfaces (BCIs) that integrate motor imagery (MI) and steady-state visual evoked potentials (SSVEP) provide high-dimensional neural decoding but typically exceed the computational limits of embedded hardware. To address this, we propose SwitchBraidNet, a compact EEG classification architecture designed for low-power deployment. The model employs a dual-path temporal braid to extract multiscale oscillatory features, an adaptive squeeze-and-excitation spatial switch for electrode gating, and a log-variance readout layer for direct band-power encoding. Furthermore, through systematic quantisation-aware training on the OpenBMI dataset, we compared SwitchBraidNet against four established baselines across FP32, FP16, and INT8 precisions. Experimental results demonstrate superior efficiency and performance, achieving MI accuracy of 69.49% (FP16), SSVEP accuracy of 93.48% (FP32), and a hybrid information transfer rate of 64.82 bits/min (FP16). With an INT8 footprint of only 3.03 KB, SwitchBraidNet maintains high accuracy across varying numerical precisions, demonstrating its suitability for low-power embedded BCI deployment.

2606.18811 2026-06-18 cs.IR cs.AI 新提交

Rescaling MLM-Head for Neural Sparse Retrieval

重新缩放MLM头部用于神经稀疏检索

Youngjoon Jang, Seongtae Hong, Jonah Turner, Heuiseok Lim

发表机构 * Korea University(韩国大学)

AI总结 针对SPLADE中MLM头部尺度不匹配导致训练不稳定和性能下降的问题,提出初始化时对MLM头部投影进行常数因子重缩放,零成本提升训练稳定性,使大范数骨干网络成为有竞争力的稀疏检索器。

详情
AI中文摘要

学习型稀疏检索(LSR)模型(如SPLADE)传统上使用BERT风格的掩码语言模型作为骨干编码器。一个自然的期望是,用更强的预训练编码器替换BERT应能提高检索效果。然而,我们发现,在标准的SPLADE训练方案下,具有大MLM头部L2范数的骨干网络可能会遭受性能下降,甚至在标准SPLADE训练方案下出现训练崩溃。我们将此失败归因于MLM头部中的尺度不匹配:SPLADE直接使用MLM头部输出来构建稀疏词汇表示,查询-文档相关性通过这些表示上的未归一化点积计算。因此,膨胀的MLM头部尺度会放大稀疏激活,扭曲匹配分数,并在常见训练设置下破坏对比训练的稳定性。为了解决这个问题,我们引入了一个简单的初始化时修正,在SPLADE训练之前通过一个常数因子重新缩放MLM头部投影。这种零成本调整提高了训练稳定性,而无需修改模型架构或训练目标。在领域内和跨领域检索基准测试中,这种简单的修正显著改善了诸如ModernBERT和Ettin等大范数骨干网络,将不稳定的训练运行转变为有竞争力的稀疏检索器。在多个设置中,修正后的模型进一步匹配或超越了经典的BERT-SPLADE基线。这些发现表明,将预训练编码器适应于LSR的瓶颈不仅仅是编码器容量,而是用于构建稀疏词汇表示的MLM头部尺度的校准。

英文摘要

Learned sparse retrieval (LSR) models such as SPLADE have traditionally used BERT-style masked language models as backbone encoders. A natural expectation is that replacing BERT with stronger pretrained encoders should improve retrieval effectiveness. However, we find that under standard SPLADE training recipes, backbones with large MLM-head L2 norms can suffer performance degradation and even training collapse under standard SPLADE training recipes. We identify this failure as a scale mismatch in the MLM head: SPLADE directly uses MLM-head outputs to construct sparse lexical representations, and query-document relevance is computed by an unnormalized dot product over these representations. As a result, an inflated MLM-head scale can amplify sparse activations, distort matching scores, and destabilize contrastive training under common training settings. To address this issue, we introduce a simple initialization-time correction that rescales the MLM-head projection by a constant factor before SPLADE training. This zero-cost adjustment improves training stability without modifying the model architecture or training objective. Across both in-domain and out-of-domain retrieval benchmarks, this simple correction substantially improves large-norm backbones such as ModernBERT and Ettin, turning unstable training runs into competitive sparse retrievers. In several settings, the corrected models further match or surpass the classic BERT-SPLADE baseline. These findings suggest that the bottleneck in adapting pretrained encoders to LSR is not encoder capacity alone, but the calibration of the MLM-head scale used to construct sparse lexical representations.

2606.18807 2026-06-18 cs.DS cs.LG 新提交

Learning Augmented Exact Exponential Algorithms

学习增强的精确指数时间算法

Tatiana Belova, Yuriy Dementiev, Danil Sagunov

发表机构 * ITMO University(ITMO大学)

AI总结 提出一种通用方法,利用略优于随机猜测的噪声预测器,可证明地减少NP难子集选择问题的搜索空间,运行时间加速随预测质量平滑扩展,且仅需预测的成对独立性或无需知道预测器精度。

详情
AI中文摘要

学习增强算法领域已经证明,机器学习预测可以在广泛的问题中绕过最坏情况下的下界。然而,到目前为止,关注点几乎完全集中在多项式时间算法上,其中预测改进了竞争比、近似保证或运行时间。在本文中,我们提出了一个问题:预测能否推动NP难问题的精确指数时间算法的前沿?我们通过提出一种通用方法对此问题给出肯定回答,该方法增强了一整类用于各种子集选择问题的最先进精确算法。我们表明,一个仅略优于随机猜测的噪声预测器足以可证明地减少搜索空间,并且由此产生的运行时间加速随预测质量平滑扩展。重要的是,我们的算法仅需要预测的成对独立性,或者,不需要知道预测器的精度——这两种设置都比通常假设的更弱且更现实。

英文摘要

The field of learning-augmented algorithms has demonstrated that machine-learned predictions can bypass worst-case lower bounds across a wide range of problems. So far, however, the focus has been almost exclusively on polynomial-time algorithms, where predictions improve competitive ratios, approximation guarantees, or running times. In this paper, we raise the question of whether predictions can push the frontier of exact exponential-time algorithms for NP-hard problems. We answer this question affirmatively by proposing a general approach that augments an entire family of state-of-the-art exact algorithms for a variety of subset selection problems. We show that a noisy predictor that is only marginally better than random guessing suffices to provably reduce the search space, and that the resulting runtime speedup scales smoothly with the prediction quality. Importantly, our algorithms require only pairwise independence of predictions or, alternatively, do not require the knowledge of the predictor's accuracy - both strictly weaker and more realistic settings than typically assumed.

2606.18801 2026-06-18 cs.IR cs.AI 新提交

SHIFT: Semantic Harmonization via Index-side Feature Transformation for Multilingual Information Retrieval

SHIFT: 通过索引侧特征变换实现多语言信息检索的语义对齐

Youngjoon Jang, Seongtae Hong, Hyeonseok Moon, Heuiseok Lim

发表机构 * Department of Computer Science and Engineering, Korea University(韩国大学计算机科学与工程系)

AI总结 提出SHIFT方法,在索引阶段通过平行翻译对估计相对语言向量并修正文档嵌入,以缓解多语言密集检索中的语言偏差,无需训练即可提升检索性能。

详情
AI中文摘要

随着大规模多语言语料库的迅速扩展,多语言信息检索(MLIR)已成为全球信息访问的关键技术。MLIR使用户能够使用单语言查询从多语言文本集合中检索语义相关的文档。然而,最近的多语言密集检索模型通常表现出对与查询相同语言的文档的强烈偏好。这导致了严重的语言偏差,即排名靠前的结果被特定语言的文档主导,即使其他语言的文档包含更多语义相关信息。为了解决这个问题,我们提出了SHIFT,一种在索引阶段适用的无需训练的方法。具体来说,SHIFT利用平行翻译对来估计每个目标语言相对于源语言的相对语言向量。随后,SHIFT通过在索引期间从文档嵌入中减去该相对语言向量来纠正语言特定的偏移。我们在四个MLIR基准测试和多种密集检索模型上的全面评估证实,SHIFT可以有效缓解语言偏差并提升MLIR性能。

英文摘要

With the rapid expansion of massive multilingual corpora, Multilingual Information Retrieval (MLIR) has emerged as a critical technology for global information access. MLIR enables users to retrieve semantically relevant documents from multilingual text collections using a single-language query. However, recent multilingual dense retrieval models often exhibit a strong preference for documents in the same language as the query. This leads to severe language bias, where top-ranked results are dominated by documents of specific languages, even when documents in other languages contain more semantically relevant information. To address this issue, we propose SHIFT, a training-free method applicable in the indexing stage. Specifically, SHIFT utilizes parallel translation pairs to estimate a relative language vector for each target language with respect to a source language. Subsequently, SHIFT corrects the language-specific offset by subtracting this relative language vector from document embeddings during indexing. Our comprehensive evaluation across four MLIR benchmarks and diverse dense retrieval models confirms that SHIFT can effectively mitigate language bias and enhance MLIR performance.

2606.18733 2026-06-18 cs.SE cs.AI 新提交

SWE-Future: Forecast-Conditioned Data Synthesis for Future-Oriented Software Engineering Agents

SWE-Future: 面向未来软件工程智能体的预测条件数据合成

Qiao Zhao, JianYing Qu, Jun Zhang, Yehua Yang, Hanwen Du, Zhongkai Sun

发表机构 * Baidu Inc(百度公司)

AI总结 提出SWE-Future方法,利用仓库历史证据预测未来任务类型(如功能实现、缺陷修复),并基于预测条件合成200个编码智能体任务,减少对历史PR回放的依赖,在80个仓库中达到58.1%的未来工作相关性。

详情
AI中文摘要

真实的编码智能体基准测试通常回放公开的GitHub问题和拉取请求,这使得它们容易与模型预训练、微调、合成数据生成或基准驱动的模型选择产生重叠。完全合成的任务避免了直接的历史回放,但可能偏离真实的仓库需求。我们提出了SWE-Future,一种面向未来编码任务的预测条件数据合成方法。给定时间$T_0$的预测快照,该方法仅使用$T_0$之前的仓库证据来预测未来的功能实现/增强、缺陷修复和重构任务族。我们首先回顾性地验证了这一预测步骤:在预测固定后,后续的拉取请求仅用于衡量预测的任务族是否与未来的仓库工作匹配。在一项80个仓库的研究中,预测器在主要语义匹配指标下达到了58.1%的未来工作相关性。然后,我们使用经过验证的预测族作为条件信号,从任务生成快照中跨61个仓库合成了一个包含200个任务的编码智能体数据集,而不是回放用于验证的后续拉取请求。SWE-Future表明,仓库演化预测可以指导现实的、面向未来的编码任务合成,同时减少对历史拉取请求回放的直接依赖。

英文摘要

Realistic coding-agent benchmarks often replay public GitHub issues and pull requests, making them vulnerable to overlap with model pretraining, fine-tuning, synthetic-data generation, or benchmark-driven model selection. Fully synthetic tasks avoid direct historical replay, but can drift away from real repository needs. We propose SWE-Future, a forecast-conditioned data synthesis method for future-oriented coding tasks. Given a forecast snapshot at time $T_0$, the method uses only pre-$T_0$ repository evidence to forecast future feature implementation/enhancement, bugfix, and refactor task families. We first validate this forecasting step retrospectively: after forecasts are fixed, later pull requests are used only to measure whether the predicted task families match future repository work. In an 80-repository study, the forecaster achieves 58.1\% future-work relevance under the main semantic matching metric. We then use validated forecast families as conditioning signals to synthesize a 200-task coding-agent dataset across 61 repositories from a task-generation snapshot, rather than replaying the later pull requests used for validation. SWE-Future shows that repository-evolution forecasts can guide realistic, future-oriented coding-task synthesis while reducing direct dependence on historical pull-request replay.

2606.18668 2026-06-18 cs.MA cs.CL 新提交

EARS: Explanatory Abstention for Reliable Sub-Agent Modeling in Large-scale Multi-Agent Systems

EARS:大规模多智能体系统中可靠子智能体建模的解释性弃权

Shuang Xie, Yunan Lu, Han Li, Lingyun Wang

发表机构 * Shopify Columbia University(哥伦比亚大学)

AI总结 针对大规模多智能体系统中子智能体过度回答导致幻觉的问题,提出EARS框架,通过将弃权重构为智能体间通信协议,利用校准的LLM裁判模型生成结构化弃权标签和理由,微调子智能体以检测故障并返回理由,在电商助手系统中将响应通过率从68.5%提升至78.9%。

详情
AI中文摘要

在大规模企业环境中,集中式多智能体系统(MAS)日益被采用,其中协调器将用户请求委托给轻量级、领域专业化的子智能体。虽然这种架构提高了模块化、可扩展性和成本效率,但其可靠性不仅取决于准确的路由,还取决于子智能体根据能力约束校准其响应的能力。特别是,基于较小微调模型的子智能体通常难以进行这种校准,导致它们过度回答模糊、未明确说明、路由错误或不支持的请求,并产生幻觉输出,而不是可操作的反馈。为了应对这一挑战,我们提出了EARS(用于可靠子智能体建模的解释性弃权),这是一个面向生产的框架,将子智能体弃权重新定义为智能体间通信协议:子智能体不仅弃权,而且向协调器暴露可操作的故障状态。EARS使用一组校准的LLM裁判模型来策划人机交互数据,在子智能体故障模式的分类法下生成结构化的弃权标签和理由。这些数据用于微调子智能体,使其能够检测故障条件并返回理由,以便协调器进行澄清、重新路由或回退。我们在一个支持企业商业智能工作流程的大规模生产电商助手中评估了EARS。EARS将整体响应通过率从68.5%提高到78.9%,证明了子智能体侧的解释性弃权提高了MAS的可靠性。

英文摘要

In large-scale enterprise settings, centralized multi-agent systems (MAS) are increasingly adopted, in which a coordinator delegates user requests to lightweight, domain-specialized sub-agents. While this architecture improves modularity, scalability, and cost efficiency, its reliability depends not only on accurate routing but also on sub-agents' ability to calibrate their responses to capability constraints. In particular, sub-agents built on smaller fine-tuned models often struggle with such calibration, leading them to over-answer ambiguous, underspecified, misrouted, or unsupported requests and produce hallucinated outputs instead of actionable feedback. To address this challenge, we present EARS (Explanatory Abstention for Reliable Sub-Agent Modeling), a production-oriented framework that reframes sub-agent abstention as an inter-agent communication protocol: a sub-agent does not merely abstain, but exposes an actionable failure state to the coordinator. EARS curates human-agent interaction data using an ensemble of calibrated LLM-as-a-Judge models, producing structured abstention labels and rationales under a taxonomy of sub-agent failure modes. These data are used to fine-tune sub-agents to detect failure conditions and return rationales for coordinator-level clarification, rerouting, or fallback. We evaluate EARS in a large-scale production e-commerce assistant supporting enterprise business intelligence workflows. EARS improves the overall response pass rate from 68.5% to 78.9%, demonstrating that sub-agent-side explanatory abstention improves MAS reliability.

2606.18619 2026-06-18 cs.CR cs.AI cs.SE 新提交

Code-Augur: Agentic Vulnerability Detection via Specification Inference

Code-Augur:通过规约推断的智能体漏洞检测

Zhengxiong Luo, Mehtab Zafar, Dylan Wolff, Abhik Roychoudhury

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出安全规约优先范式,通过显式化智能体假设并运行时反证,结合引导式模糊测试提升漏洞检测能力,在真实项目中比现有智能体检测更多漏洞。

详情
AI中文摘要

智能体漏洞检测的出现已成为软件安全的分水岭。完全由自主LLM智能体进行的审计正在发现数字社会基础软件中的关键漏洞。许多漏洞多年来一直隐藏,直到现在才被AI智能体发现。然而,这些发现背后的推理仍然令人担忧地不透明且未经验证。当智能体认为某个函数安全时,它对函数输入做了哪些假设?推理失败和错误假设可能导致遗漏漏洞,并降低对智能体分析的信任。我们提出了一种安全规约优先范式,该范式(1)将智能体的隐性假设明确暴露为安全规约,并(2)通过运行时反证持续细化这些规约。我们在Code-Augur中实现了我们的方法,这是一种用于智能体漏洞检测的新型框架。给定一个代码库,Code-Augur分析系统的每个组件以查找漏洞代码。当它认为某个组件安全时,它会将该判断背后的局部不变量作为源代码中的断言提交。同时,Code-Augur利用引导式模糊测试器尝试反证这些假设。当模糊测试器触发断言时,要么揭示一个真实漏洞,要么揭示一个需要细化的有缺陷规约。在这两种情况下,这一过程都夯实了智能体的理解,使其对代码意图的看法与代码实际行为保持一致。在真实世界的主题上,Code-Augur有效利用安全规约检测到比其他最先进智能体更多的漏洞。此外,Code-Augur在关键开源项目中发现了22个新漏洞。与精心策划的专用模型(如Claude Mythos)相比,Code-Augur提供了基于广泛可用的LLM(如Sonnet和DeepSeek)构建的有效智能体漏洞检测。

英文摘要

The advent of agentic vulnerability detection is already becoming a watershed moment for software security. Audits conducted entirely by autonomous LLM agents are uncovering critical vulnerabilities in fundamental software underpinning digital society. Many of these vulnerabilities remained masked for years, surfacing only now with AI agents. Yet the reasoning behind these discoveries remains alarmingly opaque and unvalidated. What assumptions did the agent make about a function's inputs when it deemed that function to be secure? Failures in reasoning and incorrect assumptions can lead to missed vulnerabilities and reduce trust in agentic analysis. We propose a security-specification-first paradigm that (1) exposes the agent's tacit assumptions explicitly as security specifications and (2) continuously refines those specifications via runtime falsification. We realize our approach in Code-Augur, a novel harness for agentic vulnerability detection. Given a codebase, Code-Augur analyzes each component of the system for vulnerable code. When it deems a component to be secure, it commits the local invariants behind that judgment as in-source assertions. In parallel, Code-Augur leverages a guided fuzzer to attempt to falsify those assumptions. When the fuzzer triggers an assertion, this either reveals a genuine vulnerability or a flawed specification to refine. In both cases, this process grounds the agent's understanding, aligning its view of code intent with how the code actually behaves. On real-world subjects, Code-Augur effectively leverages security specifications to detect more vulnerabilities than other state-of-the-art agents. Additionally, Code-Augur found 22 new vulnerabilities in key open-source projects. Compared to curated specialized models like Claude Mythos, Code-Augur offers effective agentic vulnerability detection built on widely available LLMs like Sonnet and DeepSeek.

2606.18617 2026-06-18 cs.CY cs.AI 新提交

AI-Driven Assessment of Human Tutors: Linking Training Performance to Real-Life Practice

AI驱动的人类导师评估:将培训表现与实际教学实践联系起来

Danielle R. Thomas, Marie Cynthia Abijuru Kamikazi, Clara Brandt, Conrad Borchers, Kenneth R. Koedinger

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Vanderbilt University(范德比大学)

AI总结 提出一种AI系统,利用生成式AI分析真实辅导转录,评估导师技能迁移,发现培训表现显著预测实际教学得分(效应量0.25 SD),并贡献开放数据集和评分标准。

Comments Full research paper accepted at EC-TEL 2026

详情
AI中文摘要

存在大量的导师培训平台。然而,很少有平台基于实际表现提供AI驱动的人类导师培训和评估。我们提出一个AI驱动系统,评估培训中的开放式回答和真实的实际辅导。与仅通过在线培训或模拟评估学习的平台不同,我们的系统利用生成式AI(Gemini-2.5-pro)分析真实辅导的转录,衡量导师技能向实际应用的迁移。远程辅导学生数学的人类导师(N=86)完成了六个基于场景的课程,平均显著学习增益为7.4%。使用跨405个会话-课程对的混合效应模型,我们发现培训表现显著预测实际辅导转录得分,效应量为0.25 SD。模型比较(AIC/BIC)表明,培训期间开放式回答和多项选择表现的平均值最能预测实际辅导表现,尽管开放式回答相对更具预测性。探索性分析显示,培训后,导师遇到应用技能的教学机会的可能性显著增加(从61.1%到68.9%),并且在这些机会中表现出更高的执行质量(从65.5%到68.1%)。中断时间序列分析表明,这些导师改进是随时间逐渐趋势的一部分,而非培训的即时干预效果。我们展示了一种将导师培训与实际评估联系起来的AI驱动方法。为此,我们贡献了开放数据集、AI提示和评分标准,以支持透明度和可重复性。

英文摘要

There exist numerous tutor training platforms. However, few provide AI-driven training and evaluation for human tutors based on real-life performance. We present an AI-driven system that assesses both open responses during training and authentic real-life tutoring. Unlike platforms that only assess learning through online training or simulations, our system utilizes Generative AI (Gemini-2.5-pro) to analyze transcriptions of authentic tutoring, measuring the transfer of tutor skills to real-life application. Human tutors instructing students remotely in math (N=86) completed six scenario-based lessons, averaging a significant 7.4% learning gain. Using mixed-effects models across 405 session-to-lesson pairs, we found that training performance significantly predicted real-life transcript scores with an effect size of 0.25 SD. Model comparison (AIC/BIC) indicated averaging open response and multiple choice performance during training predicted real-life tutor performance best, although open responses were comparatively more predictive. Exploratory analysis showed that after training, tutors were significantly more likely to encounter pedagogical opportunities to apply their skills (61.1% to 68.9%) and demonstrated higher execution quality within those opportunities (65.5% to 68.1%). Interrupted time series analysis suggested that these tutor improvements were part of a gradual trend over time rather than an immediate intervention effect of training. We illustrate an AI-driven method to link tutor training with real-life assessment. In doing so, we contribute open datasets, AI prompts, and scoring rubrics to support transparency and reproducibility.

2606.18599 2026-06-18 cs.CR cs.AI 新提交

MIDS: Detecting Stealthy Masquerade and Tampering Attacks on CAN Bus via Bidirectional Mamba

MIDS:通过双向Mamba检测CAN总线上的隐蔽伪装和篡改攻击

Qiqi Liu, Runhan Song, Lei Cui, Heng Zhang, Yuyan Sun, Limin Sun

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(信息工程研究所,中国科学院) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院) Zhongguancun Laboratory(中关村实验室)

AI总结 针对CAN总线缺乏加密认证易受攻击的问题,提出MIDS双流框架,利用双向状态空间模型并行处理标识符和载荷,在特斯拉Model 3数据集上F1达96.94%,优于基线8个百分点以上。

详情
AI中文摘要

控制器局域网(CAN)协议是现代车辆中电子控制单元(ECU)的主要通信标准,但其缺乏加密和认证,使其面临一系列安全威胁。现有的入侵检测系统主要针对制造型攻击(通过帧注入实现的DoS、模糊测试、ID欺骗),此类攻击中每ID到达间隔统计等检测信号易于获取。我们转而解决更困难的伪装场景,其中内部攻击者在其原始传输时隙原位替换合法帧,保持流量周期性,使基于流量统计的防御失效。我们提出Mamba入侵检测系统(MIDS),一种创新的双流框架,并行处理CAN标识符和载荷,并通过双向选择性状态空间建模重建其联合时间语义。为评估MIDS,我们从物理特斯拉Model 3在三种驾驶模式下收集了超过1亿个CAN帧,并合成了54种伪装攻击变体,涵盖仅ID、仅数据和组合修改。MIDS在该数据集上达到96.94%的F1分数,超过最强可复现基线8个百分点以上,同时保持1.147毫秒的单窗口推理延迟——为实时车载部署留有充足余量。为验证泛化能力,我们进一步在四个公开基准(ROAD、CrySyS、OTIDS、CT&T)上评估MIDS,涵盖伪装和注入场景;在统一的5折协议下,MIDS的F1分数从93.70%到99.61%,超过八个复现基线中最强者最多13.94个百分点。

英文摘要

The Controller Area Network (CAN) protocol is the primary communication standard for Electronic Control Units (ECUs) in modern vehicles, but its lack of encryption and authentication exposes it to a range of security threats. Existing intrusion detection systems are largely tuned to fabrication-style attacks (DoS, fuzzing, ID spoofing realised by frame injection), in which detection signals such as per-ID inter-arrival statistics are readily available. We instead address the harder \emph{masquerade} setting~\cite{b37}, in which an internal adversary substitutes a legitimate frame in-situ at its original transmission slot, preserving traffic periodicity and rendering traffic-statistic defences ineffective. We propose the Mamba Intrusion Detection System (MIDS), an innovative dual-stream framework that processes CAN identifiers and payloads in parallel and reconstructs their joint temporal semantics through bidirectional selective state-space modelling. To evaluate MIDS, we collected over 100 million CAN frames from a physical Tesla Model 3 across three driving regimes and synthesised 54 masquerade attack variants spanning ID-only, data-only, and combined modifications. MIDS attains an F1 of 96.94\% on this dataset, exceeding the strongest reproducible baseline by more than 8 percentage points, while sustaining a 1.147~ms single-window inference latency -- ample headroom for real-time onboard deployment. To verify generalisation, we further evaluate MIDS on four public benchmarks (ROAD, CrySyS, OTIDS, CT\&T) covering both masquerade and injection scenarios; MIDS attains F1 from 93.70\% to 99.61\%, outperforming the strongest of eight reproduced baselines by up to 13.94 percentage points under a unified 5-fold protocol.

2606.18596 2026-06-18 cs.HC cs.AI 新提交

Better Adherence, Richer Context: A Field Evaluation of LLM-Powered Conversational Voice Diaries for Sleep

更好的依从性,更丰富的上下文:基于LLM的对话式语音睡眠日记的现场评估

Amama Mahmood, Bokyung Kim, Honghao Zhao, Molly E. Atwood, Luis F. Buenaver, Michael T. Smith, Chien-Ming Huang

发表机构 * The Johns Hopkins University(约翰霍普金斯大学) Department of Psychiatry and Behavioral Sciences, The Johns Hopkins University School of Medicine(精神病学与行为科学系,约翰霍普金斯大学医学院)

AI总结 通过现场实验评估基于LLM的对话式语音睡眠日记,发现相比文本日记,语音日记提高了依从性并收集了更详细的上下文信息,但结构化字段完整性较低。

详情
AI中文摘要

睡眠日记是行为睡眠医学和失眠认知行为疗法的核心,但每日完成难以维持,静态形式通常为解释夜间睡眠变化提供的上下文有限。我们设计了一个基于LLM的对话式语音日记,通过主动智能音箱提示、结构化对话输入和自适应后续对话,提供临床基础的早晚睡眠日记问题。我们在为期四周的受试者间现场研究中评估了该系统,涉及30名大学生,使用匹配的日记项目、报告窗口和提醒间隔,与基于文本的移动日记进行比较。与文本日记相比,对话式语音日记显示出更高的依从性,并引发了关于日常习惯、压力源、环境条件和其他睡眠相关因素的更详细上下文自我报告。参与者还描述语音日记更容易融入日常,尽管感知完成时间更长。然而,基于语音的对话输入导致某些结构化日记字段的完整性较低,揭示了表达丰富性与结构化精度之间的权衡。这些发现展示了使用基于LLM的对话式语音助手进行纵向健康自我报告的前景和挑战。

英文摘要

Sleep diaries are central to behavioral sleep medicine and cognitive behavioral therapy for insomnia, yet daily completion is difficult to sustain, and static forms often provide limited context for interpreting night-to-night sleep variation. We designed an LLM-powered conversational voice diary that delivers clinically grounded morning and evening sleep diary questions through proactive smart-speaker prompts, structured conversational intake, and adaptive follow-up dialogue. We evaluated the system in a four-week between-subjects field study with 30 university students, comparing it with a text-based mobile diary using matched diary items, reporting windows, and reminder intervals. Compared with the text-based diary, the conversational voice diary showed higher adherence and elicited more detailed contextual self-report about routines, stressors, environmental conditions, and other sleep-related factors. Participants also described the voice diary as easier to integrate into daily routines, despite longer perceived completion time. However, voice-based conversational intake produced lower completeness for some structured diary fields, revealing a trade-off between expressive richness and structured precision. These findings show both the promise and the challenge of using LLM-powered conversational voice assistants for longitudinal health self-report.

2606.18588 2026-06-18 cs.DC cs.CV 新提交

Splaxel: Efficient Distributed Training of 3D Gaussian Splatting for Large-scale Scene Reconstruction via Pixel-level Communication

Splaxel:通过像素级通信实现大规模场景重建的高效分布式3D高斯泼溅训练

Wenqi Jia, Zhewen Hu, Ying Huang, Yu Gong, Stavros Kalafatis, Yuke Wang, Wei Niu, Chengming Zhang, Ang Li, Sheng Di, Yuede Ji, Bo Fang, Miao Yin

发表机构 * Independent Researcher(独立研究者) Rice University(里士满大学) University of Georgia(佐治亚大学) University of Houston(休斯顿大学) University of Washington(华盛顿大学) Argonne National Labs(阿贡国家实验室)

AI总结 提出Splaxel框架,通过像素级局部渲染与全局组合替代高斯同步,在保持数学一致性的同时稳定通信开销,结合可见性预测和冲突消除策略,实现大规模3DGS分布式训练加速7.6倍。

Comments 17 pages, 25 figures

详情
AI中文摘要

3D高斯泼溅(3DGS)能够实现高保真、实时的3D场景重建,但将训练扩展到大规模场景需要跨多个GPU优化数亿个高斯体。现有的分布式方法要么将场景划分为孤立区域,导致全局不一致,要么依赖全局高斯级交换,导致GPU间通信量大幅增长并迅速主导迭代时间。我们提出Splaxel,一种基于像素级局部渲染和全局组合的通信高效分布式3DGS训练框架。每个GPU渲染其局部子集并仅交换部分像素值,而非同步高斯体,从而在保持数学一致性的同时,使通信成本随场景规模增长保持稳定。Splaxel通过几何和透射率可见性预测进一步减少像素级冗余,并通过无冲突的相机视图整合提高GPU利用率。在包含多达1.2亿个高斯体的大规模数据集上评估,Splaxel相比最先进的分布式3DGS框架实现了高达7.6倍的加速,同时保持高重建质量。

英文摘要

3D Gaussian Splatting (3DGS) enables high-fidelity and real-time 3D scene reconstruction, but scaling training to large-scale scenes requires optimizing hundreds of millions of Gaussians across multiple GPUs. Existing distributed approaches either partition scenes into isolated regions, causing global inconsistency, or rely on global Gaussian-level exchanges, which lead to substantial growth in inter-GPU communication and quickly dominate iteration time. We propose Splaxel, a communication-efficient distributed 3DGS training framework based on pixel-level local rendering and global composition. Instead of synchronizing Gaussians, each GPU renders its local subset and exchanges only partial pixel values, maintaining mathematical consistency while keeping communication cost stable as the scene size increases. Splaxel further reduces pixel-level redundancy through geometric and transmittance visibility prediction and improves GPU utilization via conflict-free camera-view consolidation. Evaluated on large-scale datasets with up to 120M Gaussians, Splaxel achieves up to 7.6$\times$ speedup over the state-of-the-art distributed 3DGS framework while preserving high reconstruction quality.

2606.18548 2026-06-18 cs.CY cs.AI 新提交

Engagement Intensity as a Learner-Modeling Signal for Adaptive AI Ethics Instruction

参与强度作为自适应AI伦理教学的学习者建模信号

Yongkyung Oh, Lynn Talton, Alex Bui

发表机构 * University of California, Los Angeles (UCLA)(加州大学洛杉矶分校)

AI总结 本研究比较了三种学习者特征(使用频率、自评熟悉度、先前AI教育)与AI感知结果的关系,发现使用频率与所有五项结果显著相关,为自适应AI伦理教学提供了简单的入学者建模信号。

详情
AI中文摘要

在研究生研究训练中,自适应AI伦理教学受益于反映先前LLM经验差异的入学者测量指标。先前的课程或研讨会参与是一个明显的候选指标,但尚不清楚它是否与关键AI感知项目的教学前评分相关。我们比较了三种候选入学者特征:自我报告的使用频率、自评LLM熟悉度和先前AI教育,针对93名参加必修研究伦理课程的生命科学研究生和博士后学员的五项基线感知结果。使用频率与所有五项结果显示出Holm校正的关联,自评熟悉度与三项结果相关,而先前AI教育与任何结果均无关联。在量表低端呈现阈值模式,在训练兴趣和准确性信任方面最为明显,而非在所有五项结果上呈现均匀梯度。在简短的入学者调查中,报告的LLM使用比先前的课程或研讨会更一致地与这些感知相关,自评熟悉度作为次要指标。这些结果表明,简单的教学前行为信号可以为自适应AI伦理教育的轻量级入学者画像提供信息。

英文摘要

Adaptive AI ethics instruction in graduate research training benefits from intake measures that reflect differences in prior LLM experience. Prior coursework or workshop attendance is an obvious candidate, but it is not clear whether it is associated with pre-instruction ratings on key AI perception items. We compare three candidate intake features, self-reported usage frequency, self-rated LLM familiarity, and prior AI education, across five baseline perception outcomes in 93 bioscience graduate and postdoctoral trainees enrolled in a required research ethics course. Usage frequency shows Holm-corrected associations with all five outcomes, self-rated familiarity with three, and prior AI education with none. A threshold-like pattern at the lower end of the scale is most visible for training interest and accuracy trust rather than appearing as a uniform gradient across all five outcomes. In a short intake survey, reported LLM use is more consistently associated with these perceptions than prior coursework or workshops, with self-rated familiarity serving as a secondary indicator. These results suggest that simple pre-instruction behavioral signals can inform lightweight intake profiling for adaptive AI ethics education.

2606.18532 2026-06-18 cs.CR cs.AI cs.RO cs.SE 新提交

AI Sandboxes: A Threat Model, Taxonomy, and Measurement Framework

AI沙箱:威胁模型、分类法与测量框架

Inderjeet Singh, Haitham Mahmoud, Andrés Murillo

发表机构 * Fujitsu Research of Europe(富士通欧洲研究)

AI总结 提出AI沙箱的威胁模型、分类法和测量框架,形式化沙箱边界与最弱链规则,定义网络物理威胁模型,并通过三个案例验证。

Comments 50 pages, 8 figures, 10 tables

详情
AI中文摘要

AI系统越来越多地在结合隔离、仿真、仪器化、监督和证据捕获的有界环境中进行评估。对于物理AI、AIoT和网络物理系统,这种转变不仅仅是术语问题:被测系统可能通过物理过程、网络设备和人类操作员进行感知、决策、执行、通信和故障。本文开发了一种面向保证的AI沙箱描述,将其作为数字AI、具身自主和网络物理部署中测试、评估、验证和确认的受控环境。我们形式化了沙箱边界和用于将每个维度的证据组合成有界部署声明的“最弱链”规则;分离了主要的沙箱原型;定义了一个包括对保证装置本身攻击的网络物理威胁模型;并引入了一个跨越保真度、可控性、可观测性、包含性、可重复性和治理工件的测量框架,在三个实际沙箱的工作案例研究中实例化。由此产生的威胁模型、分类法和测量框架阐明了沙箱可以有效测试什么、它可以包含哪些风险,以及它可以为安全、安保和监管保证支持哪些形式的证据。

英文摘要

AI systems are increasingly evaluated in bounded environments that combine isolation, simulation, instrumentation, supervision, and evidence capture. For physical AI, AIoT, and cyber-physical systems, this shift is not a matter of terminology: the system under test may sense, decide, actuate, communicate, and fail through physical processes, networked devices, and human operators. This article develops an assurance-oriented account of AI sandboxes as controlled environments for testing, evaluation, verification, and validation across digital AI, embodied autonomy, and cyber-physical deployments. We formalize the sandbox boundary and a weakest-link rule for composing per-dimension evidence into a bounded deployment claim; separate major sandbox archetypes; define a cyber-physical threat model that includes attacks on the assurance apparatus itself; and introduce a measurement framework spanning fidelity, controllability, observability, containment, reproducibility, and governance artifacts, instantiated on three worked case studies of real sandboxes. The resulting threat model, taxonomy, and measurement framework clarify what a sandbox can validly test, which risks it can contain, and what forms of evidence it can support for safety, security, and regulatory assurance.

2606.18530 2026-06-18 cs.CR cs.CL cs.LG 新提交

Evaluating Prompting-Based Defenses Against Domain-Camouflaged Injection Attacks

评估基于提示的防御策略对抗领域伪装注入攻击

Aaditya Pai

发表机构 * Data Science Institute(数据科学研究所)

AI总结 针对领域伪装注入攻击,评估五种基于提示的防御方法(如释义、重点标记等)在三个模型家族和三个部署领域中的有效性,发现释义法最有效,可将伪装攻击成功率降低55-84%。

Comments 9 pages, 4 figures, 4 tables; under review at the AdvML-Frontiers x CoTMA workshop, COLM 2026

详情
AI中文摘要

领域伪装注入攻击使用领域特定词汇将恶意指令嵌入检索内容中,从而逃避依赖句法注入标记的标准检测器。当检测失败时,从业者需要知道哪些防御架构能降低攻击成功率。我们评估了五种基于提示的防御方法(重点标记、释义、提示夹层以及两种组合)对抗领域伪装注入攻击,涉及三个模型家族(Claude Haiku、Llama 3.1 8B、Gemini 2.0 Flash)和三个部署领域(金融、法律、通用),共进行3,510次试验。在代理处理之前对检索内容进行释义是最一致有效的防御方法,根据模型不同,可将伪装攻击成功率降低55-84%,并且在所有测试模型上均实现了比我们的Llama Guard 4配置更低的攻击成功率。防御效果强烈依赖于模型:重点标记在Claude Haiku上将攻击成功率减半,但在Llama 3.1 8B上没有任何益处。金融领域部署面临最高的残余风险,基线攻击成功率为26-33%,在较弱模型上没有任何基于提示的防御能完全消除威胁。这些结果首次系统评估了专门针对伪装类注入攻击的基于提示的防御方法,并为从业者建立了基于基准的建议。所有任务均使用合成构建的专业文档;这些基准排名是否能推广到真实企业文档仍是一个开放问题。

英文摘要

Domain-camouflaged injection attacks embed malicious instructions in retrieved content using domain-appropriate vocabulary, evading standard detectors that rely on syntactic injection markers. When detection fails, practitioners need to know which defense architectures reduce attack success. We evaluate five prompting-based defenses (spotlighting, paraphrasing, prompt sandwiching, and two combinations) against domain-camouflaged injection across three model families (Claude Haiku, Llama 3.1 8B, Gemini 2.0 Flash) and three deployment domains (financial, legal, general) using 3,510 trials. Paraphrasing retrieved content before agent processing is the most consistently effective defense in this benchmark, reducing camouflage attack success rate by 55-84\% depending on model, and achieves lower attack success rates than our Llama Guard 4 configuration on every model tested. Defense effectiveness is strongly model-dependent: spotlighting halves attack success on Claude Haiku but provides no benefit on Llama 3.1 8B. Financial domain deployments face the highest residual risk at 26-33\% baseline attack success rate, with no prompting-based defense fully eliminating the threat on weaker models. These results provide the first systematic evaluation of prompting-based defenses specifically against camouflage-class injection attacks and establish benchmark-based recommendations for practitioners. All tasks use synthetically constructed professional documents; whether these benchmark rankings generalize to real enterprise documents remains an open question.

2606.18425 2026-06-18 cs.SE cs.AI cs.DC 新提交

From Specification to Execution: AI Assisted Scientific Workflow Management

从规范到执行:AI辅助的科学工作流管理

Komal Thareja, Hamza Safri, Rajiv Mayani, Anirban Mandal, Ewa Deelman

发表机构 * RENCI, University of North Carolina at Chapel Hill, NC, USA(RENCI,北卡罗来纳大学教堂山分校) Information Sciences Institute, University of Southern California, Marina del Rey, CA, USA(信息科学研究所,南加州大学马里纳德尔雷耶斯分校)

AI总结 提出一种AI辅助方法,通过规范驱动的工作流生成、自动化调试和分布式执行,结合Pegasus与MCP层,实现从自然语言到大规模科学工作流的端到端管理。

详情
AI中文摘要

科学工作流管理系统(WMS)支持复杂管道的可扩展和可重复执行,但工作流的设计、实现和调试仍然主要依赖人工,需要大量专业知识。最近使用大型语言模型(LLM)的方法在从自然语言生成工作流方面显示出潜力,但通常依赖于直接的代码合成,这限制了透明度、可重复性以及与工作流系统的集成。我们提出了一种AI辅助的科学工作流管理方法,结合了规范驱动的工作流生成、自动化调试和分布式执行。该方法引入了一个结构化的规范阶段,将工作流意图、设计和实现分离,允许在代码生成之前进行验证。我们还开发了一个基于LLM的调试代理,用于诊断和解决跨多个系统层的故障。为了支持分布式执行和用户交互,我们将广泛使用的WMS Pegasus与模型上下文协议(MCP)层集成,为工作流提交、监控和控制提供统一接口。我们使用一个用于医学影像的联邦学习工作流来评估该方法,该工作流具有并行、迭代和依赖密集的结构。该系统生成并执行了包含数千个作业的大规模工作流,减少了调试工作量,并允许非专家用户使用专家级设计模式构建工作流。这些结果表明,端到端的AI辅助工作流生成和执行是可行的,并指向了用于管理科学工作流生命周期的AI驱动平台。

英文摘要

Scientific workflow management systems (WMS) support scalable and reproducible execution of complex pipelines, but workflow design, implementation, and debugging remain largely manual and require significant expertise. Recent approaches using large language models (LLMs) show promise for workflow generation from natural language, but often rely on direct code synthesis, which limits transparency, reproducibility, and integration with workflow systems. We present an AI-assisted approach to scientific workflow management that combines specification-driven workflow generation, automated debugging, and distributed execution. The method introduces a structured specification phase that separates workflow intent, design, and implementation, allowing validation prior to code generation. We also develop an LLM-based debugging agent that diagnoses and resolves failures across multiple system layers. To support distributed execution and user interaction, we integrate Pegasus, a widely used WMS, with a Model Context Protocol (MCP) layer, providing a unified interface for workflow submission, monitoring, and control. We evaluate the approach using a federated learning workflow for medical imaging, chosen for its parallel, iterative, and dependency-intensive structure. The system generated and executed large-scale workflows with thousands of jobs, reduced debugging effort, and allowed non-expert users to construct workflows with expert-level design patterns. These results indicate that end-to-end AI-assisted workflow generation and execution is feasible, and point toward AI-driven platforms for managing the scientific workflow lifecycle.

2606.18393 2026-06-18 eess.SY cs.AI cs.SY 新提交

Learning-Based Decision Making for Combustion Phasing Control in Multi-Fuel CI Engines with Latent Fuel Reactivity Estimation

基于学习的多燃料压燃发动机燃烧相位控制决策与潜在燃料反应性估计

Rajasree Sarkar, Aditya Satish Patil, Arunava Banerjee, Ihsan Berk Altiner, Zongxuan Sun, Kenneth Kim, Chol-Bum Mike Keown

发表机构 * Department of Mechanical Engineering, University of Minnesota Twin Cities(明尼苏达大学双城分校机械工程系) DEVCOM Army Research Laboratory, Aberdeen Proving Ground(美国陆军战争研究所阿伯丁试飞场)

AI总结 针对多燃料压燃发动机中燃料反应性(十六烷值)未知且时变的问题,提出一种基于GRU引导的强化学习框架,通过从燃烧历史中学习紧凑的燃料反应性表示,实现稳定的CA50控制,平均跟踪误差低于0.25°CA。

详情
AI中文摘要

多燃料压燃发动机具有燃料灵活性,但引入了不确定且时变的燃料反应性(以十六烷值CN表示),这使循环到循环的燃烧相位控制复杂化。本文将潜在CN变化下的CA50调节问题建模为部分可观测的序贯决策问题,并系统评估了具有递增时间和表示能力的控制器,包括LinUCB、历史增强上下文赌博机、仅观测DDPG、递归DDPG以及提出的GRU引导RL框架。基于实验多燃料发动机数据训练的高斯过程代理提供了受控且可重复的评估环境。结果表明,短视和固定历史赌博机方法在CN变化下性能下降,仅观测RL受潜在状态混叠影响,而通用递归在CN快速演变时不足。所提出的框架从燃烧历史中学习紧凑的GRU基燃料反应性表示,并将执行器和评论家基于此估计信号而非真实CN进行条件化。通过在部署时相同的非完美燃料反应性信息上训练策略,控制器避免了传统在线估计-控制流程中的训练-部署不一致性。在未见过的CN轨迹上,该策略实现了稳定的CA50调节,在训练设定点平均绝对跟踪误差低于0.25°CA,同时产生平滑、物理一致的SOI和电热塞功率驱动。这些结果表明,在潜在连续演变的燃料动态下进行燃烧控制需要超越独立估计或通用递归的方法。通过将燃料反应性推断与控制策略学习对齐,所提出的框架能够使用部署时可用的相同估计状态实现反应性感知决策。

英文摘要

Multi-fuel compression-ignition engines offer fuel flexibility but introduce uncertain, time-varying fuel reactivity, represented by cetane number (CN), which complicates cycle-to-cycle combustion-phasing control. This work formulates CA50 regulation under latent CN variation as a partially observable sequential decision problem and systematically evaluates controllers with increasing temporal and representational capacity, including LinUCB, history-augmented contextual bandits, observation-only DDPG, recurrent DDPG, and a proposed GRU-guided RL framework. A Gaussian-process surrogate trained on experimental multi-fuel engine data provides a controlled and reproducible evaluation environment. Results show that myopic and fixed-history bandit methods degrade under CN variation, observation-only RL suffers from latent-state aliasing, and generic recurrence is insufficient when CN evolves rapidly. The proposed framework learns a compact GRU-based representation of fuel reactivity from combustion history and conditions both actor and critic on this estimated signal rather than oracle CN. By training the policy on the same imperfect fuel-reactivity information available at deployment, the controller avoids train-deploy inconsistency in conventional online estimate-then-control pipelines. Across unseen CN trajectories, the policy achieves stable CA50 regulation with mean absolute tracking error below 0.25° CA at the training setpoint, while producing smooth, physically consistent SOI and glow-plug-power actuation. These results show that combustion control under latent, continuously evolving fuel dynamics requires more than standalone estimation or generic recurrence. By aligning fuel-reactivity inference with control policy learning, the proposed framework enables reactivity-aware decision-making using the same estimated state available during deployment.

2606.18379 2026-06-18 cs.IR cs.AI 新提交

RankGraph-2: Lifecycle Co-Design for Billion-Node Graph Learning in Recommendation

RankGraph-2:十亿节点图学习在推荐中的生命周期协同设计

Renzhi Wu, Zikun Cui, Junjie Yang, Tai Guo, Hong Li, Xian Chen, Li Yu, Ke Pan, Sri Reddy, Mahesh Srinivasan, Nipun Mathur, Haomin Yu, Hong Yan

发表机构 * Meta Platforms(Meta平台)

AI总结 针对十亿规模图检索中图构建、表示学习与实时服务三阶段孤立的问题,提出RankGraph-2框架,通过协同设计各阶段(如联合训练聚类索引、预计算邻域等),在降低83%服务计算成本的同时,召回率比GAT+Deep Graph Infomax高3.8倍,并带来CTR和CVR提升。

详情
AI中文摘要

十亿节点规模的基于图的检索需要联合解决三个紧密耦合的问题——图构建、表示学习和实时服务——然而现有工作各自孤立地处理这些问题。我们提出了RankGraph-2,一个部署在Meta的框架,它协同设计了基于相似性检索(U2U2I和U2I2I)的所有三个生命周期阶段,每个阶段的需求塑造其他阶段。服务需要一个联合学习的聚类索引以避免昂贵的在线KNN——这迫使索引联合训练进入训练目标。训练受益于观察到基于相似性的检索容忍预计算邻域,从而消除了在线图基础设施——这要求构建产生自包含的数据。构建还必须支持小时级别的刷新以覆盖物品。基于这些级联需求,RankGraph-2通过带流行度偏差校正的子采样将数百亿亿条边减少到数千亿条,通过个性化PageRank预计算多跳邻域,并联合学习一个残差量化聚类索引,将服务计算成本降低了83%。这种生命周期协同设计使得一个简单架构能够在二分图上实现比GAT+Deep Graph Infomax模型高3.8倍的召回率,在物品检索上比PyTorch-BigGraph高2.1倍。RankGraph-2带来了高达+0.96%的CTR和+2.75%的CVR提升,并已在主要业务面上支持了20多次检索发布。

英文摘要

Graph-based retrieval at billion-node scale requires jointly solving three tightly coupled problems -- graph construction, representation learning, and real-time serving -- yet existing work addresses each in isolation. We present RankGraph-2, a framework deployed at Meta that co-designs all three lifecycle stages for similarity-based retrieval (U2U2I and U2I2I), where each stage's requirements shape the others. Serving requires a co-learned cluster index to avoid expensive online KNN -- this pushes index co-training into the training objective. Training benefits from the observation that similarity-based retrieval tolerates pre-computed neighborhoods, eliminating online graph infrastructure -- this requires construction to produce self-contained data. Construction must also support hour-level refresh for item coverage. Acting on these cascading requirements, RankGraph-2 reduces hundreds of trillions of edges to hundreds of billions via subsampling with popularity bias correction, pre-computes multi-hop neighborhoods via personalized PageRank, and co-learns a residual-quantization cluster index that reduces serving computational cost by 83%. This lifecycle co-design enables a simple architecture to achieve 3.8 x higher recall than a GAT + Deep Graph Infomax model on a bipartite graph and 2.1 x higher than PyTorch-BigGraph on item retrieval. RankGraph-2 delivers up to +0.96% CTR and +2.75% CVR, and has powered 20+ retrieval launches across major surfaces.