arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2115
2606.17065 2026-06-17 q-fin.CP cs.AI cs.LG 新提交

PIVOT: Bridging Black-Scholes Implied-Volatility and Price Objectives via Differentiable Jäckel Operator

PIVOT: 通过可微分的Jäckel算子桥接Black-Scholes隐含波动率与价格目标

Raeid Saqur, Yannick Limmer, Anastasis Kratsios, Blanka Horvath, Hans Buehler

发表机构 * Mathematical Institute, University of Oxford(牛津大学数学研究所) McMaster University(麦基尔大学) Vector Institute for AI(人工智能矢量研究所) DRW

AI总结 提出PIVOT层,通过隐式微分保留Jäckel求解器的前向精度,并利用门控机制处理低vega区域的奇异性,实现价格与隐含波动率空间的高效可微转换。

Comments 30 pages, 17 figures, 12 tables

详情
AI中文摘要

现代期权学习系统在两种坐标系下运行:价格空间(市场报价且无套利约束最自然执行)和隐含波动率(IV)空间(波动率曲面被平滑、正则化和评估)。瓶颈在于接口而非近似:Jäckel开创性的“Let's Be Rational”(LBR)求解器已经高效地将Black-Scholes价格反转到机器精度。所缺少的是一个可微分层,它在正向传播中保留LBR,并避免通过其分支逻辑进行反向传播。这样的层还必须面对低vega区域中逆映射不可避免的奇异性,其中灵敏度1/vega在vega→0时发散。我们通过PIVOT(价格-隐含波动率目标转换器)填补了这一空白。PIVOT保持LBR正向传播不变,并通过隐式微分通过平滑的Black-Scholes/Black-76价格映射提供反向传播,并带有显式门控合约:无效域返回NaN,良态行接收精确的1/vega梯度,低vega行被衰减而非静默正则化。在单个H100上,融合的Triton内核在机器精度下达到1.79e9 IV/s(与参考C求解器的最大相对误差为9.3e-14);端到端标签生成在合成链上维持48.9M/s,在SPX OptionMetrics上维持16.6M/s。在SPX上的HyperIV风格单日复现中,PIVOT增强目标帕累托主导基线,将保留价格MAE降低高达43.4%,最强的三种子门控目标联合改善价格MAE 38.8%和IV MAE 21.3%;在RUT、VIX和NDX上的跨资产结果显示方向性价格MAE增益分别为40.1%、24.2%和16.7%,而无门控的IV往返控制崩溃为退化的近零曲面,确认门控是正确性合约而非调节旋钮。

英文摘要

Modern option-learning systems operate in two coordinates: price space, where markets quote and no-arbitrage constraints are most naturally enforced, and implied volatility (IV) space, where volatility surfaces are smoothed, regularized, and evaluated. The bottleneck is interface, not approximation: Jäckel's seminal "Let's Be Rational" (LBR) solver already inverts the Black-Scholes price to machine precision efficiently. What is missing is a differentiable layer that preserves LBR in the forward pass and avoids backpropagating through its branch logic. Such a layer must also confront the unavoidable singularity of the inverse map in the low-vega regime, where the sensitivity 1/vega diverges as vega -> 0. We close this gap with PIVOT, the Price-Implied-Volatility Objective Translator. PIVOT keeps the LBR forward pass intact and supplies the backward pass by implicit differentiation through the smooth Black-Scholes/Black-76 price map, with an explicit gating contract: invalid domains return NaN, well-conditioned rows receive the exact 1/vega gradient, and low-vega rows are attenuated rather than silently regularized. On a single H100, a fused Triton kernel reaches 1.79e9 IV/s at machine precision (9.3e-14 max relative error vs. the reference C solver); end-to-end label generation sustains 48.9M/s on synthetic chains and 16.6M/s on SPX OptionMetrics. In a HyperIV-style one-day reproduction on SPX, PIVOT-augmented objectives Pareto-dominate the baselines, reducing held-out price MAE by up to 43.4% and the strongest three-seed gated objective improving price MAE by 38.8% and IV MAE by 21.3% jointly; cross-asset results on RUT, VIX, and NDX show directional price-MAE gains of 40.1%, 24.2%, and 16.7%, while an ungated IV-roundtrip control collapses to a degenerate near-zero surface, confirming the gate as a correctness contract rather than a tuning knob.

2606.17062 2026-06-17 q-bio.QM cs.LG 新提交

RadSEM: A Finding-by-Finding Metric for Clinical Consistency in Radiology Reports

RadSEM:放射学报告中临床一致性的逐发现指标

Zhenhong Yang, Zhuoyun Liu, Jintao Fei, Wen Tang, Shichao Quan, Jun Zhao, Jun Xu

发表机构 * JDH Algo, JD Health International Inc., China Department of Big Data in Health Science, The First Affiliated Hospital of Wenzhou Medical University, China Zhejiang Engineering Research Center for Hospital Emergency Department of Intensive Care Unit, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou, Zhejiang, China

AI总结 提出RadSEM指标,通过约束LLM辅助将报告重写为原子发现句,进行矛盾感知的多对多匹配,并计算异常加权的F1分数,在SSREE测试中优于现有指标,实现高一致性评分。

详情
AI中文摘要

放射学报告评估必须区分临床兼容性与表面相似性,因为否定、侧别或正常-异常极性可能逆转发现。我们提出RadSEM(放射学句子级评估指标),一种受约束的LLM辅助指标,用于基于参考的放射学发现评估。RadSEM将参考报告和生成报告重写为有序的原子发现句,每个句子表达一个部位-发现命题。然后执行矛盾约束的多对多匹配:不兼容对(如“积液”和“无积液”)不得分,而兼容的粒度差异可获部分得分。确定性阶段根据部分-整体和异常-细节关系对配对加权,计数未匹配的发现,并生成异常加权的加权F1分数。因此,LLM支持结构化重写和局部对齐,而非充当不透明评判者。我们使用SSREE(一种受控单调性压力测试,基于2,448份去标识报告扩展为五个等级损坏水平)评估RadSEM。RadSEM的Kendall tau_b达到0.957,全对一致性97.8%,相邻一致性95.0%,81.9%的报告实现严格五级排序,优于放射学专用和通用文本指标,同时避免了极性反转报告重新获得词汇重叠的失败。在同一SSREE集上,RadSEM优于参考锚定的RadSEM-Alt策略,将相邻一致性从90.7%提升至95.0%,严格排序从67.2%提升至81.9%。在599个三元组同义词/反义词子集上,RadSEM在597个案例(99.67%)中偏好同义词。这些结果表明,显式发现单元、矛盾感知匹配和异常聚焦的确定性评分使报告评分更具可解释性,并对临床有意义的错误更敏感。代码见:此https URL。

英文摘要

Radiology report evaluation must distinguish clinical compatibility from surface similarity, because negation, laterality, or normal-abnormal polarity can reverse a finding. We propose RadSEM (Radiology Sentence-Level Evaluation Metric), a constrained LLM-assisted metric for reference-based evaluation of radiology Findings. RadSEM rewrites reference and generated reports into ordered atomic finding sentences, each expressing one site-finding proposition. It then performs contradiction-constrained many-to-many matching: incompatible pairs such as "effusion" and "no effusion" receive no credit, while compatible granularity differences can receive partial credit. A deterministic stage weights pairs by part-whole and abnormal-detail relationships, counts unmatched findings, and produces an abnormal-focused weighted F1 score. Thus, the LLM supports structured rewriting and local alignment rather than acting as an opaque judge. We evaluate RadSEM with SSREE, a controlled monotonicity stress test built from 2,448 de-identified reports expanded into five graded corruption levels. RadSEM achieves Kendall tau_b of 0.957, all-pairs concordance of 97.8%, adjacent concordance of 95.0%, and strict five-level ordering for 81.9% of reports, outperforming radiology-specific and general text metrics while avoiding the failure in which polarity-inverted reports regain lexical overlap. On the same SSREE set, RadSEM outperforms the Ref-anchored RadSEM-Alt policy, improving adjacent concordance from 90.7% to 95.0% and strict ordering from 67.2% to 81.9%. On a 599-triplet synonym/antonym subset, RadSEM prefers synonyms in 597 cases (99.67%). These results suggest that explicit finding units, contradiction-aware matching, and abnormal-focused deterministic scoring make report scoring more interpretable and sensitive to clinically meaningful errors. Code is available at https://github.com/jdh-algo/RadSEM.

2606.18223 2026-06-17 cs.CR cs.AI cs.LG cs.SY eess.SY 新提交

Learning Red Agent Policy from Observations for Neurosymbolic Autonomous Cyber Agents

从观测中学习红方代理策略用于神经符号自主网络代理

Ankita Samaddar, Sandeep Neema, Daniel Balasubramanian, Xenofon Koutsoukos

发表机构 * MIT(麻省理工学院)

AI总结 针对网络攻击中红方动作不可观测的问题,提出基于模仿学习的策略学习技术,从网络观测和防御动作预测红方行为,集成神经符号防御代理实现高精度预测。

详情
AI中文摘要

随着复杂网络攻击日益普遍,现代网络需要经由强化学习训练的智能自主网络防御代理。这些代理采用神经符号方法,如带有学习组件的行为树,来学习、推理、适应和实施安全规则,同时维持关键操作。然而,这些自主网络是部分可观测系统,即网络攻击者(红方代理)的动作不可观测,使得防御者难以预测红方动作、学习红方策略或评估攻击者的入侵程度。为解决此问题,我们提出一种策略学习技术,利用模仿学习来学习具有离散状态和离散动作的部分可观测RL代理的策略。我们在自主网络环境中应用该技术,从网络观测和防御动作预测红方代理的动作。与神经符号网络防御代理集成后,我们的方法有效处理不同红方策略,并在多种模拟场景中实现高预测精度。

英文摘要

With sophisticated cyber-attacks becoming increasingly prevalent, modern networks require intelligent autonomous cyber-defense agents trained via Reinforcement Learning (RL). These agents employ neurosymbolic approaches such as behavior trees with learning-enabled components (LECs) to learn, reason, adapt, and implement security rules while maintaining critical operations. However, these autonomous networks are partially observable systems, i.e., the cyber-attacker's (red agent's) actions are not observable, making it difficult for the defender to predict red actions, learn red policies, or assess the attacker's intrusion levels. To address this, we propose a Policy Learning Technique using imitation learning to learn policies for partially observable RL agents with discrete states and discrete actions. We apply this technique in an autonomous cyber environment to predict red agent's actions from network observations and defender actions. Integrated with a neurosymbolic cyber-defense agent, our method effectively handles different red policies and achieves high prediction accuracy across diverse simulated scenarios.

2606.18198 2026-06-17 cs.CR cs.CV 新提交

Seeing Is Not Screening: Multimodal Hidden Instruction Attacks on Agent Skill Scanners

看见不等于筛查:针对智能体技能扫描仪的多模态隐藏指令攻击

Xiaojun Jia, Jie Liao, Simeng Qin, Ke Ma, Wenbo Guo, Yebo Feng, Aishan Liu, Yang Liu

发表机构 * Nanyang Technological University, Singapore(南洋理工大学,新加坡) Chongqing University, China(重庆大学) Northeastern University, China(东北大学) University of Chinese Academy of Sciences, China(中国科学院大学) Beihang University, China(北航大学)

AI总结 针对现有技能扫描仪忽视图像中恶意指令的盲点,提出SkillCamo多模态攻击将指令隐藏于图像,并设计ExecScan执行级扫描模块,通过意图提取、行为重建等检测隐藏指令,实验证明其有效性。

详情
AI中文摘要

智能体技能正在成为基于LLM的系统中的重要攻击面。通过对现有技能扫描仪的实证研究,我们发现当前的防御主要依赖文本描述、清单和源代码作为安全分析的主要信号,这可能导致视觉传达的恶意意图未得到充分检查。这造成了一个实际盲点:隐藏在图像中的有害操作指令可能绕过扫描,但在部署期间仍可被多模态智能体恢复。为系统研究这一威胁,我们提出SkillCamo,一种文档介导的多模态指令攻击,将恶意指令隐藏在技能附带的图像中,同时重写周围文档以自然地将这些图像引用为正常工作流程的一部分。因此,攻击不依赖于图像本身,而是依赖于执行时对文本指导和视觉负载的联合解释。为防御此类攻击,我们进一步提出ExecScan,一种基于执行的多模态扫描模块,对技能工件进行意图提取、行为重建、滥用评估和审慎执行模拟。ExecScan联合分析文档、代码、引用的资源和视觉内容,以恢复隐藏指令,重建可执行行为链,并识别下游风险,如泄露、破坏、持久化、欺骗和权限提升。大量实验表明,图像隐藏的恶意指令挑战了现有技能扫描仪,而ExecScan可以提高技能扫描性能。

英文摘要

Agent skills are emerging as an important attack surface in LLM-based systems. Through an empirical study of existing skill scanners, we find that current defenses primarily rely on textual descriptions, manifests, and source code as the main signals for security analysis, which can leave visually conveyed malicious intent insufficiently examined. This creates a practical blind spot: harmful operational instructions hidden in images may bypass scanning while still being recoverable by multimodal agents during deployment. To systematically investigate this threat, we propose SkillCamo, a document-mediated multimodal instruction attack that conceals malicious instructions within images bundled with a skill while rewriting the surrounding documentation to naturally reference those images as part of the normal workflow. Thus, the attack does not rely on the image alone, but on the joint interpretation of textual guidance and visual payload at execution time. To defend against such attacks, we further propose ExecScan, an execution-grounded multimodal scanning module that performs intent extraction, behavior reconstruction, abuse assessment, and deliberative execution simulation over skill artifacts. ExecScan jointly analyzes documentation, code, referenced resources, and visual content to recover hidden instructions, reconstruct executable behavior chains, and identify downstream risks such as exfiltration, destruction, persistence, deception, and privilege escalation. Extensive experiments show that image-hidden malicious instructions challenge existing skill scanners, while ExecScan can improve the skill scanning performance.

2606.18193 2026-06-17 cs.CR cs.AI cs.CL 新提交

A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models

Anthropic Fable 5 与 Opus 4.8 模型的红队研究

Nicola Franco

发表机构 * AI4I

AI总结 通过 HackAgent 框架对两个前沿大语言模型进行自动化越狱攻击,发现尽管模型抵抗大部分攻击,但自适应迭代攻击仍能成功,且残差表面比总体框架更大。

Comments White paper

详情
AI中文摘要

我们评估了 Anthropic 开发的两个前沿大语言模型(LLM)Fable 5 和 Opus 4.8 的对抗鲁棒性,针对涵盖十个危害类别的 7 826 个有害意图,使用了四类自动化越狱攻击。利用 HackAgent 红队框架,生成了数十万次对抗尝试,每个明显的成功案例均由三个评判模型组成的委员会(多数投票)独立重新裁定。两个模型抵抗了大部分攻击,但残差表面比总体框架所暗示的更大:它主要由自适应迭代攻击主导,而静态混淆几乎完全被中和。最强的自适应搜索(攻击树)在 11.5% 的意图上攻破了 Opus 4.8,而 Fable 5 保持在个位数(最坏情况 6.1%)。因此,总体成功率不应被视为令人放心。即使在这些加固配置下,两个模型仍产生了 1 620 个(Opus 4.8)和 702 个(Fable 5)经委员会确认的有害完成,涵盖每个危害类别,这些完成是由攻击模型在没有人类专家参与的情况下,自动、廉价地在前一两个细化步骤中发现的。合理的结论是,即使是最好的、经过最严格测试的前沿模型,在持续的自动化压力下仍然可以被可靠地攻破。

英文摘要

We evaluate the adversarial robustness of two frontier large language models (LLMs) developed by Anthropic, Fable 5 and Opus 4.8, against four families of automated jailbreak attack across 7 826 harmful intents spanning a ten-category harm taxonomy. Using the HackAgent red-teaming framework, hundreds of thousands of adversarial attempts were generated and every apparent success was independently re-adjudicated by a panel of three judge models (majority vote). Both models resist the majority of attacks, but the residual surface is larger than aggregate framing suggests: it is dominated by adaptive iterative attacks, while static obfuscation is near-fully neutralised. The strongest adaptive search (tree-of-attacks) breaks Opus 4.8 on 11.5% of intents overall, whereas Fable 5 stays in the single digits (6.1% worst-case). Aggregate rates therefore should not be read as reassurance. Even in these hardened configurations, the two models produced 1 620 (Opus 4.8) and 702 (Fable 5) panel-confirmed harmful completions spanning every harm category, located automatically, cheaply, and within the first one or two refinement steps by an attacker model with no human expert in the loop. The reasonable conclusion is that even the best, most-tested frontier models remain reliably breakable under sustained automated pressure.

2606.18190 2026-06-17 cs.CR cs.LG 新提交

Multi-Source Cybersecurity Logs: An ATT&CK-Labeled Dataset and SLM Evaluation

多源网络安全日志:一个ATT&CK标记数据集及小语言模型评估

Abir Ashab Niloy, Ahmed Ryan, Imamul Hossain Rafi, Md Erfan, Md Rayhanur Rahman

发表机构 * Windows endpoints(Windows终端)

AI总结 为解决多阶段网络攻击检测中缺乏带ATT&CK技术标签的多源日志数据集问题,构建了包含870个会话(70个攻击、800个良性)和约230万事件的多源日志数据集,并基于该数据集微调三个小语言模型,在分块分类任务上准确率从约8%提升至90%-97%。

详情
AI中文摘要

多阶段网络攻击跨越系统、网络和浏览器日志。检测它们需要关联所有三个来源的事件。机器学习方法可以学习这些跨源模式,但需要带标签的多源数据。现有的公共数据集存在不足。仅网络数据集如CICIDS和UNSW-NB15缺少主机和浏览器活动。以主机为中心的数据集如LMDG和CICAPT-IIoT缺乏浏览器遥测。ATLAS包含所有三个来源,但仅将事件标记为恶意或良性,没有MITRE对抗战术、技术和通用知识(ATT&CK)技术的粒度。没有公共数据集将三个来源与每条记录的ATT&CK技术标签结合起来。我们通过构建一个包含870个会话(70个攻击,800个良性)和约230万事件的多源日志数据集来弥补这一差距。我们在Windows端点上同时捕获了系统、网络和浏览器活动。我们用ATT&CK技术ID标记了恶意事件,涵盖了12种战术和53种技术。我们使用真实工具生成了所有攻击数据,包括远程访问木马(RAT)、命令与控制(C2)隧道和云外泄。为了展示可学习性,我们使用低秩适配(LoRA)微调了三个小语言模型(SLM)(Qwen2.5-1.5B、Llama-3.2-3B、Phi-4-Mini)。我们在两个任务(分块分类和ATT&CK技术识别)上,将每个模型与其基础变体在十个指标上进行了比较。微调在每个指标上改进了每个模型。分块分类准确率从基础变体的大约8%提高到微调后的90%到97%。技术识别仍然具有挑战性,最佳精确匹配准确率为42%,尽管高部分匹配分数表明模型捕捉到了大部分底层推理。

英文摘要

Multi-stage cyberattacks span system, network, and browser logs. Detecting them requires correlating events across all three sources. Machine learning methods can learn these cross-source patterns, but they need labeled multi-source data. Existing public datasets fall short. Network-only datasets such as CICIDS and UNSW-NB15 miss host and browser activity. Host-focused datasets such as LMDG and CICAPT-IIoT lack browser telemetry. ATLAS includes all three sources but labels events only as malicious or benign, without MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) technique granularity. No public dataset combines all three sources with per-entry ATT&CK technique labels. We close the gap by building a multi-source log dataset of 870 sessions (70 attack, 800 benign) and approximately 2.3 million events. We captured system, network, and browser activity simultaneously on Windows endpoints. We labeled malicious events with ATT&CK technique IDs, covering 12 tactics and 53 techniques. We generated all attack data using real tools, including Remote Access Trojan (RAT), Command and Control (C2) tunnels, and cloud exfiltration. To demonstrate learnability, we fine-tuned three Small Language Models (SLMs) (Qwen2.5-1.5B, Llama-3.2-3B, Phi-4-Mini) using Low-Rank Adaptation (LoRA). We compared each against its base variant across ten metrics on two tasks: chunk classification and ATT&CK technique identification. Fine-tuning improved every model on every metric. Chunk classification accuracy rose from approximately 8% in the base variants to between 90% and 97% after fine-tuning. Technique identification remained challenging, with the best exact-match accuracy at 42%, although high partial-match scores show the models captured most of the underlying reasoning.

2606.18181 2026-06-17 cs.IR cs.AI cs.CY 新提交

IUU+DB: Tracking Illegal, Unreported, and Unregulated Fishing, Seafood Fraud, and Labor Abuse through LLM-driven Information Extraction

IUU+DB:通过LLM驱动的信息提取追踪非法、不报告和不管制捕捞、海鲜欺诈和劳工虐待

Henry Bodwell, Hong Yang, John C. Simeone, Kelvin Gorospe, Bella Sullivan, Lana Huang, Jessica Gephart, Sandy Aylesworth, Molly Masterton, Naren Ramakrishnan

发表机构 * University Of Washington(华盛顿大学)

AI总结 提出IUU+概念扩展非法捕捞定义,并构建基于大语言模型的IUU+DB系统,从异构文档中自动提取事件关键信息,支持去重和趋势分析,为渔业监管和研究提供数据支持。

详情
AI中文摘要

非法、不报告和不管制捕捞(IUU)传统上指违反适用法律或在缺乏适用法律的区域进行的捕捞活动。我们提出术语IUU+以涵盖更广泛的渔业部门环境及相关供应链贸易犯罪和行为。尽管IUU+活动被广泛认为是对海洋生态系统、市场和生计的严重威胁,但对其事件频率、地理分布、物种、行为者及非法活动类型模式的定量理解仍然难以获得。我们提出IUU+DB,一个由大语言模型驱动的系统,用于构建全球IUU+活动事件数据库。该系统接收异构文档,分类是否描述相关事件,提取关键数据元素如行为者、地点、物种、船只、违规行为及执法结果,并支持去重和趋势分析。案例研究和验证结果表明,IUU+DB有助于组织零散证据,揭示地理和行为热点,支持学术界和非政府组织的渔业领域特定研究,协助行业进行来源和物种风险评估,并为政府机构的政策实施和针对性执法提供支持。

英文摘要

Illegal, unreported, and unregulated fishing (IUU) traditionally refers to fishing activities that violate applicable laws or occur in areas that lack applicable laws. We propose the term IUU+ to capture a broader suite of fisheries sector environmental and associated supply chain trade-related crimes and behaviors. Although IUU+ activity is widely recognized as a serious threat to marine ecosystems, markets, and livelihoods, a quantitative understanding of these incidents, e.g., their frequency, geography, species, actors, and patterns in the type of illicit activity, remains difficult to obtain. We propose IUU+DB, a large language model driven system for building a global incident database of IUU+ activity. The system ingests heterogeneous documents, classifies whether they describe relevant incidents, extracts key data elements such as actors, locations, species, vessels, violations, and enforcement outcomes, and supports deduplication and trend analysis. Case studies and validation results show that IUU+DB can help organize fragmented evidence, surface geographic and behavioral hotspots, support fisheries-domain specific research in academia and non-government organizations, assist source and species risk assessments for industry, and provide support for policy implementation and targeted enforcement efforts to government agencies.

2606.18168 2026-06-17 cs.SE cs.AI 新提交

All Smoke, No Alarm: Oracle Signals in Agent-Authored Test Code

全是烟雾,没有警报:智能体编写的测试代码中的Oracle信号

Dipayan Banik, Kowshik Chowdhury, Shazibul Islam Shamim

发表机构 * Dipayan Banik(迪帕扬·班克) Kowshik Chowdhury(克什基·乔乌德里) Shazibul Islam Shamim(沙齐布·伊斯兰·沙米)

AI总结 研究智能体编写的测试代码中Oracle信号的存在情况,发现80.2%的测试补丁缺乏强Oracle信号,但强Oracle与合并可能性显著正相关(OR=1.28)。

Comments Accepted at the 8th IEEE International Conference on Artificial Intelligence Testing, 2026

详情
AI中文摘要

软件从业者越来越多地使用AI编码智能体,这些智能体在开源拉取请求(PR)中生成测试代码和生产代码。最近的研究报告称,超过116,000个仓库中有超过932,000个智能体编写的PR,然而这些测试文件是否包含有意义的验证逻辑仍未得到充分探索。缺乏显式断言的测试文件执行代码而不验证行为,因此基于测试文件存在的质量门控高估了验证强度。本文的目标是通过描述Oracle信号及其与合并结果和审查工作的关联,帮助从业者评估智能体编写的补丁的验证强度。我们对来自2,807个GitHub仓库的33,596个智能体编写的PR中的86,156个测试文件补丁进行了实证研究,这些PR由五个编码智能体生成:OpenAI Codex、GitHub Copilot、Devin、Cursor和Claude Code。对384个分层补丁的定性分析形成了八类Oracle信号的语法分类。在大规模应用中,80.2%的测试补丁包含弱或没有显式Oracle信号。虽然原始合并率对于强Oracle PR较低,但调整了智能体、PR大小、仓库流行度、任务类型和语言的回归分析显示,强Oracle显著提高了合并可能性(OR = 1.28, p < 0.001)。我们的发现表明,测试文件数量大大高估了验证强度,从业者可以采用Oracle感知的质量检查来更准确地评估智能体编写的贡献。

英文摘要

Software practitioners increasingly use AI coding agents that generate test code alongside production code in open source pull requests (PRs). Recent studies report more than 932,000 agent-authored PRs across more than 116,000 repositories, yet whether their test files contain meaningful verification logic remains underexplored. Test files lacking explicit assertions execute code without verifying behavior, so quality gates based on test-file presence overestimate verification strength. The goal of this paper is to help practitioners assess the verification strength of agent-authored patches by characterizing oracle signals and their link to merge outcomes and review effort. We conduct an empirical study of 86,156 test-file patches from 33,596 agent-authored PRs across 2,807 GitHub repositories produced by five coding agents: OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code. A qualitative analysis of 384 stratified patches informs a syntactic taxonomy of eight oracle signal categories. Applied at scale, 80.2% of test patches contain weak or no explicit oracle signals. While raw merge rates are lower for strong-oracle PRs, a regression analysis adjusting for agent, PR size, repository popularity, task type, and language shows strong oracles significantly improve merge likelihood (OR = 1.28, p < 0.001). Our findings suggest that test file counts substantially overestimate verification strength and that practitioners can adopt oracle-aware quality checks to more accurately evaluate agent-authored contributions.

2606.18166 2026-06-17 cs.CR cs.LG 新提交

Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports

评估开源大语言模型在CTI报告上的多标签ATT&CK技术分类

Ahmed Ryan, Saad Sakib Noor, Md Erfan, Shaswata Mitra, Sudip Mittal, Md Rayhanur Rahman

发表机构 * The University of Dhaka(达卡大学)

AI总结 针对开源LLM在复杂非结构化CTI报告上的ATT&CK分类性能未被评估的问题,构建了2076句人工标注数据集,评估7个开源LLM,最高F1为0.22,表明当前模型不足以用于生产。

详情
AI中文摘要

使用MITRE ATT&CK对网络威胁情报(CTI)进行分类对于主动防御至关重要,但历史上需要大量人工。大语言模型(LLM)之前的自动化加速了这一过程,但无法解决非结构化CTI报告中复杂的语言和多步攻击模式。LLM通过上下文推理理解非结构化文本,解决了以前的局限性。然而,当前的评估依赖于简化的单技术句子,忽略了真实CTI报告的复杂性,往往导致性能结果膨胀。因此,开源LLM在复杂非结构化CTI报告上的基线性能仍未得到评估。为弥补这一差距,我们从83份复杂非结构化CTI报告中构建了一个包含2076句人工标注(1281句技术阳性,795句阴性)的真实数据集。这些句子通过六阶段标注过程映射到114种独特的ATT&CK技术,实现了kappa=0.68的标注者间一致性。利用该数据集,我们评估了7个参数从8B到236B的开源LLM,涉及提示策略和温度配置。性能最高的LLM实现了0.22的微平均F1分数,为复杂非结构化CTI上的多标签ATT&CK分类建立了经验基线。参数大小与F1分数呈统计显著正相关。提示策略和温度在不同模型配置下未产生统计显著的增益。这些结果表明,当前开源LLM不足以用于生产级ATT&CK分类。该数据集、基准和发现为未来的CTI研究提供了可复现的基础。

英文摘要

Classifying Cyber Threat Intelligence (CTI) using MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) is essential for proactive defense, but historically required extensive human effort. Pre-Large Language Model (LLM) automation sped up this process, but could not resolve the complex language and multi-step attack patterns found in unstructured CTI reports. LLMs addressed previous limitations by using contextual reasoning to understand unstructured text. However, current evaluations rely on simplified, single-technique sentences that ignore the complexity of real-world CTI reports, which often leads to inflated performance results. Consequently, the baseline performance of open-source LLMs on complex unstructured CTI reports remains unevaluated. To address this gap, we constructed a ground-truth dataset of 2,076 human-annotated sentences (1,281 technique-positive, 795 negative) from 83 complex unstructured CTI reports. These sentences were mapped to 114 unique ATT&CK techniques using a six-phase annotation process, achieving \k{appa} = 0.68 inter-annotator agreement. Using this dataset, we evaluated seven open-source LLMs ranging from 8B to 236B parameters across prompt strategy and temperature configurations. The highest-performing LLM achieved a micro-averaged F1 score of 0.22, establishing the empirical baseline for multi-label ATT&CK classification on complex unstructured CTI. Parameter size showed a statistically significant positive correlation with F1 score. Prompt strategy and temperature produced no statistically significant gains across model configurations. These results indicate that current open-source LLMs are insufficient for production-grade ATT&CK classification. The dataset, benchmark, and findings provide a reproducible foundation for future CTI research.

2606.18158 2026-06-17 cs.CY cs.AI cs.CL 新提交

The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act

欧盟法律自动化中的测量差距:欧盟AI法案下教义性法律推理的基准测试

Michèle Finck

发表机构 * Chair of Law and Artificial Intelligence and Director, CZS Institute for Artificial Intelligence and Law, University of Tübingen(法律与人工智能教授、人工智能与法律研究所主任,图宾根大学)

AI总结 针对当前缺乏评估大型语言模型进行教义性法律推理的基准,提出该能力对满足欧盟AI法案中“适当准确性”要求至关重要。

详情
AI中文摘要

大型语言模型现在能够生成至少中等质量的法律文本,但现有的基准无法评估它们是否执行教义性法律推理——这是法律工作的解释核心,而非大多数当前法律AI评估所衡量的辅助性、准法律任务。这一测量差距不仅是方法论的,也是法律上的:欧盟AI法案将“适当准确性”作为司法领域使用高风险AI的约束性要求,但如果没有该领域缺乏的教义性推理基准,该要求就无法获得操作内容。

英文摘要

Large language models now produce legal text of at least median quality, yet no existing benchmark can evaluate whether they perform doctrinal legal reasoning, which forms the interpretive core of legal work, rather than the ancillary, paralegal tasks that most current legal-AI evaluations measure. This measurement gap is not only methodological but legal: the EU AI Act makes "appropriate accuracy" a binding requirement for high-risk AI used in the judicial domain, yet that requirement cannot acquire operational content without the very doctrinal-reasoning benchmark the field lacks.

2606.18129 2026-06-17 cs.HC cs.AI 新提交

Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour

理解和测量LLM行为中的认知萎缩

Abeer Badawi, Moyosoreoluwa Olatosi, Negin Baghbanzadeh, Laleh Seyyed-Kalantari, Frank Rudzicz, R. Shayna Rosenbaum, Sara Pishdadian, Elham Dolatabadi

发表机构 * York University(约克大学) Vector Institute(向量研究所) Rotman Research Institute(罗特曼研究学院) Dalhousie University(达尔豪斯大学) Centre for Addiction & Mental Health(成瘾与心理健康中心) KITE Research Institute(KITE研究机构)

AI总结 针对LLM在心理健康支持中缺乏过程行为评估的问题,提出认知萎缩概念及基准,通过临床标注和专家评估揭示模型普遍存在中度至高度萎缩行为。

详情
AI中文摘要

近期涉及LLM用于心理健康支持的事件揭示了一个关键的评估空白:表面安全评分无法捕捉模型在长时间、现实且情感敏感的交互中的行为。现有基准衡量知识、安全性或静态响应质量,但忽略了LLM交互是否帮助用户保持反思、应对和自主决策。我们将这一缺失维度形式化为认知萎缩,这是一种AI介导的心理健康支持中不同于安全性和有用性的过程级行为度量。为测量它,我们引入了认知萎缩基准,这是一个基于临床的基准,由1,576个完全人工生成的咨询对话、15,680轮次和来自五个LLM的42,230个响应构建而成。三位临床和神经心理学专家开发了一个包含用户上下文、响应行为和全局风险标志的20属性模式;六名经过培训的临床评审员应用该模式并附上基于跨度的证据,产生了5,324个评审判断。我们进一步引入了用户输入风险指数、认知萎缩风险指数和轨迹摘要。在五个LLM中,模型在单轮和多轮设置中表现出一致的中度至高度萎缩对齐行为。虽然模型通常对明显的安全线索做出响应,但当用户寻求解决方案或决策时,它们的适应性较差。主要的重复模式是指导性建议、问题解决、推荐响应、话题转移以及可能强化依赖而非反思的验证形式。我们的工作使认知萎缩变得可测量,并为审计敏感LLM对话中的模型行为提供了基础。

英文摘要

Recent incidents involving LLMs used for mental-health support reveal a critical evaluation gap: surface-level safety scores do not capture how models behave across realistic, emotionally sensitive interactions over time. Existing benchmarks measure knowledge, safety, or static response quality, but miss whether LLM interactions help users keep reflecting, coping, and making decisions themselves. We formalize this missing dimension as COGNITIVE ATROPHY, a process-level behavioural measure in AI-mediated mental-health support distinct from safety and helpfulness. To measure it, we introduce COGNITIVE ATROPHY BENCH, a clinically grounded benchmark built from 1,576 fully human-generated counseling conversations, 15,680 turns, and 42,230 responses from five LLMs. Three clinical and neuropsychology experts developed a 20-attribute schema spanning user context, response behaviour, and global risk flags; six trained clinical reviewers applied it with span-grounded evidence, producing 5,324 reviewer judgments. We further introduce the User-Input Risk Index (UIRI), the Cognitive Atrophy Risk Index (ARI), and trajectory summaries. Across five LLMs, models show a consistent moderate-to-high level of atrophy-aligned behaviour across single and multi-turn settings. While models generally respond to overt safety cues, they adapt less reliably when users seek solutions or decisions. The dominant recurring patterns are directive advice, problem-solving, recommendation responses, topic shifts, and forms of validation that may reinforce dependence rather than reflection. Our work makes COGNITIVE ATROPHY measurable and provides a foundation for auditing model behaviour in sensitive LLM conversations.

2606.18120 2026-06-17 cs.CR cs.AI cs.CL cs.LG 新提交

Structural Role Injection in Handlebars-Templated LLM Prompts: Triple-Brace Interpolation, Delimiter Family, and the Limits of HTML Auto-Escaping

Handlebars模板化LLM提示中的结构角色注入:三花括号插值、分隔符家族与HTML自动转义的局限性

Mohammadreza Rashidi

发表机构 * Department of Computer Science AI(计算机科学系人工智能) Media Analysis Lab Berlin, Germany(媒体分析实验室柏林德国)

AI总结 本文研究Handlebars模板引擎中双花括号与三花括号插值对结构角色注入攻击的影响,通过无模型分析和5760次实验,揭示HTML转义仅保护特定分隔符家族,无法替代指令与数据的结构分离。

Comments 7 pages, 6 figures

详情
AI中文摘要

大型语言模型应用从模板构建提示,Handlebars是广泛使用的模板引擎,也是Microsoft Semantic Kernel中的默认提示模板格式。其双花括号{x}表达式对插值值进行HTML转义,并被记录为安全默认;而三花括号{x}表达式则直接插入原始值。我们表明,这一选择悄然决定了应用对结构角色注入的暴露程度,攻击者控制的数据携带聊天角色分隔符,从而伪造高权限轮次。无模型分析建立了机制:Handlebars转义重写尖括号,但不重写方括号、冒号或Markdown井号,因此它中和了ChatML、Llama-3和XML角色分隔符(存活率0.00),同时保留Llama-2 [INST]、传统Human:/Assistant:和Markdown ###分隔符(后两者存活率1.00)。随后,我们在七个分隔符家族、两个攻击目标和四个模型(GPT-3.5 Turbo、GPT-4o mini、GPT-4.1 mini、Claude Haiku 4.5)上运行了5760次试验,总API成本为1.63美元。GPT-3.5 Turbo在97%的原始试验和91%的转义试验中遵循任务劫持指令,转义保护集中在尖括号家族,而在冒号和Markdown家族中缺失;更难的秘密泄露目标未饱和,更清晰地暴露了相同的家族交互。Claude Haiku 4.5几乎完全抵抗了两个目标。转义默认仅保护HTML转义恰好覆盖的分隔符方案,对剩余方案无保护,且无法替代指令与数据的结构分离。

英文摘要

Large language model applications build prompts from templates, and Handlebars is a widely used templating engine and the default prompt-template format in Microsoft Semantic Kernel. Its double-brace {x} expression HTML-escapes the interpolated value and is documented as the safe default; its triple-brace {x} expression inserts the value raw. We show that this choice silently governs an application's exposure to structural role injection, where attacker-controlled data carries chat role delimiters that forge a higher-privilege turn. A model-free analysis establishes the mechanism: Handlebars escaping rewrites angle brackets but not square brackets, colons, or Markdown hashes, so it neutralises ChatML, Llama-3, and XML role delimiters (survival rate 0.00) while leaving Llama-2 [INST], legacy Human:/Assistant:, and Markdown ### delimiters intact (survival rate 1.00 for the last two). We then run 5760 trials across seven delimiter families, two attack objectives, and four models (GPT-3.5 Turbo, GPT-4o mini, GPT-4.1 mini, Claude Haiku 4.5) at a combined API cost of 1.63 USD. GPT-3.5 Turbo follows the task-hijack instruction in 97% of raw and 91% of escaped trials, with the escaping protection concentrated in the angle-bracket families and absent for the colon- and Markdown-based families; the harder secret-exfiltration objective, which does not saturate, exposes the same family interaction more cleanly. Claude Haiku 4.5 resists both objectives almost entirely. The escaped default protects only the delimiter schemes whose characters HTML escaping happens to cover, gives no protection for the rest, and cannot substitute for a structural separation of instruction and data.

2606.18105 2026-06-17 cs.NI cs.LG 新提交

OmniPlan: An Adaptive Framework for Timely and Near-Optimal Network Planning Optimization

OmniPlan:一种用于及时且近乎最优的网络规划优化的自适应框架

Longlong Zhu, Jiashuo Yu, Zedi Chen, Yuhan Wu, Zhifan Jiang, Yuchen Xian, Yimeng Liu, Jiajie Su, Shaopeng Zhou, Xingyuan Li, Hongyan Liu, Xuan Liu, Dong Zhang, Chunming Wu, Xiang Chen

发表机构 * Zhejiang University(浙江大学) Fuzhou University(福州市大学) Yangzhou University(扬州大学) The State Key Laboratory of Blockchain and Data Security(区块链与数据安全国家重点实验室) College of Computer Science and Technology(计算机科学与技术学院)

AI总结 提出OmniPlan自适应框架,利用大语言模型解析用户意图,通过混合专家架构动态选择MIP求解器、启发式算法或深度强化学习模型,实现网络规划优化的及时性与近乎最优性,在分布式机器学习推理卸载任务中延迟降低97.8%,资源消耗降低11.5%。

详情
AI中文摘要

网络规划优化是跨多个领域(包括交通系统、通信网络和电网)的基本问题。它需要在复杂约束下同时优化多个相互竞争的目标。现有的网络规划优化框架依赖混合整数规划(MIP)求解器、启发式算法和深度强化学习(DRL)模型来计算规划决策。然而,它们缺乏对多样化和动态用户意图的有效适应性,从而导致执行时间与最优性之间的权衡。在本文中,我们提出OmniPlan,一种自适应框架,在网络规划优化中同时实现及时性和近乎最优性。为了实现现有解决方案所缺乏的适应性,OmniPlan采用基于大语言模型(LLM)的解释器,将异构的自然语言意图转换为统一且可量化的用户偏好向量。然后,它采用混合专家架构,集成MIP求解器、启发式算法和DRL模型作为专门专家,OmniPlan通过动态选择及时且近乎最优的专家来适应多样化的意图。最后,它包含一个基于DRL的专家配置模块,该模块微调优化目标权重,使规划决策与用户特定偏好对齐。我们使用代表性的真实工作负载(即分布式机器学习(ML))评估OmniPlan,其中我们利用OmniPlan将广泛的ML推理任务(例如决策树、SVM、朴素贝叶斯、XGBoost和随机森林)卸载到硬件设备网络。我们在真实测试平台上的实验表明,OmniPlan为真实ML推理任务实现了近乎最优且低执行时间的卸载,延迟降低高达97.8%,网络设备资源消耗降低高达11.5%。

英文摘要

Network planning optimization is a fundamental problem across diverse domains, including transportation systems, communication networks, and power grids. It requires simultaneous optimization of multiple competing objectives under complex constraints. Existing network planning optimization frameworks rely on mixed integer programming (MIP) solvers, heuristics, and deep reinforcement learning (DRL) models to compute planning decisions. However, they lack effective adaptability to diverse and dynamic user intents, thus leading to the trade-off between execution time and optimality. In this paper, we propose OmniPlan, an adaptive framework that achieves both timeliness and near-optimality in network planning optimization. To achieve the adaptability lacking in existing solutions, OmniPlan employs a large language model (LLM)-based interpreter to convert heterogeneous natural-language intents into a unified and quantifiable user-preference vector. Then it employs a mixture-of-experts architecture that integrates MIP solvers, heuristics, and DRL models as specialized experts, where OmniPlan adapts to diverse intents by dynamically selecting timely and near-optimal experts. Finally, it incorporates a DRL-based expert configuration module that fine-tunes optimization objective weights to align planning decisions with user-specific preferences. We evaluate OmniPlan with a representative real-world workload, i.e., distributed machine learning (ML), where we leverage OmniPlan to offload a wide spectrum of ML inference tasks, e.g., decision trees, SVM, naive Bayes, XGBoost, and random forests, onto a network of hardware devices. Our experiments on a real-world testbed indicate that OmniPlan achieves near-optimal and low-execution-time offloading for real-world ML inference tasks, reducing latency by up to 97.8\% and network device resource consumption by up to 11.5\%.

2606.18069 2026-06-17 cs.GR cs.CG cs.CV 新提交

Blended Chart Surfaces: A Seamless Explicit Representation for Smooth Surface Fitting

混合图表曲面:一种用于光滑曲面拟合的无缝显式表示

Romy Williamson, Niloy Mitra

发表机构 * Adobe

AI总结 提出混合图表曲面,一种无网络、显式且光滑的曲面表示,通过代理网格和多项式映射联合优化,融合局部图表实现全局光滑,支持微分量和能量计算。

Comments 17 pages, 16 figures

详情
AI中文摘要

适用于几何处理的曲面表示应紧凑且显式,提供全局光滑性保证,支持广泛的曲面拓扑,并能可靠访问法线和曲面能量等微分量,同时与现代可微优化兼容。现有的神经表示通常会牺牲这些属性中的一项或多项:隐式场通常需要等值面提取以供下游使用,而显式神经映射受限于规范域参数化或在局部图表之间出现接缝伪影。我们引入了混合图表曲面,这是一种紧凑、无网络、显式的表示,通过构造实现光滑,并锚定到用户提供的拓扑。给定一个编码预期曲面拓扑和近似几何的粗略代理网格,混合图表曲面使用现成优化器在每个代理顶点联合优化多项式映射,以拟合隐式目标形状,从而避免输入参数化。相邻映射使用光滑的“一环坐标”混合方案融合,将拓扑和粗略几何(由代理承载)与几何细节(由局部补丁承载)解耦。曲面全局光滑、完全可微,并支持导数的稳定评估,使得微分量和曲面能量可直接访问。此外,我们的构造对代理网格的刚体运动和缩放是等变的。我们在各种拓扑和几何复杂度上评估了混合图表曲面,并与显式替代方案(包括插值函数基线和网格位移MLP)进行比较。在这些比较中,混合图表曲面在紧凑性、简单性、微分量访问性和表达力之间实现了有利的权衡,同时在补丁边界保持光滑。

英文摘要

A surface representation suitable for geometry processing should be compact and explicit, provide global smoothness guarantees, support a wide range of surface topologies, and offer reliable access to differential quantities such as normals and surface energies, while remaining compatible with modern differentiable optimization. Existing neural representations typically sacrifice one or more of these properties: implicit fields typically require iso-surfacing for downstream use, while explicit neural maps are constrained by canonical-domain parametrizations or exhibit seam artifacts between local charts. We introduce Blended Chart Surfaces, a compact, network-free, explicit representation that is smooth by construction and anchored to user-provided topology. Given a coarse proxy mesh encoding the intended surface topology and approximate geometry, Blended Chart Surfaces jointly optimize for a polynomial map at each proxy vertex using an off-the-shelf optimizer to fit to an implicit target shape, avoiding the need for an input parametrization. Neighboring maps are fused using a smooth 'one-ring coordinate' blending scheme, decoupling topology and coarse geometry (carried by the proxy) from geometric details (carried by the local patches). The surface is globally smooth, fully differentiable, and enables stable evaluation of derivatives, making differential quantities and surface energies directly accessible. Additionally, our construction is equivariant to rigid motions and scaling of the proxy mesh. We evaluate Blended Chart Surfaces on various topologies and geometric complexity, and compare against explicit alternatives including interpolating-function baselines and mesh-displacement MLPs. Across these, Blended Chart Surfaces achieve a favorable trade-off among compactness, simplicity, access to differential quantities, and expressivity while remaining smooth across patch boundaries.

2606.18057 2026-06-17 cs.HC cs.AI cs.CL cs.CY cs.SI 新提交

When AI Says "I have been in similar situations": Synthetic Lived Experience in Peer-Like Caregiver Support

当AI说“我也有过类似经历”:同伴式照护支持中的合成生活经验

Drishti Goel, Violeta J. Rodriguez, Daniel S. Brown, Ravi Karkar, Dong Whi Yoo, Koustuv Saha

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) Indiana University Indianapolis(印第安纳大学印第安纳波利斯分校)

AI总结 研究AI在同伴支持中生成“合成生活经验”的悖论,通过分析人类与AI在阿尔茨海默病照护社区中的叙事差异,揭示AI虽能模拟情感支持但缺乏真实经历,需建立机制区分支持性语言与虚构经历。

详情
AI中文摘要

照护者经常转向在线社区寻求信息和情感支持。在这些空间中,同伴支持者常常利用个人叙事来回应情感复杂的照护情境。随着LLM被设计为同伴式的支持来源,它们引入了一个关键张力:AI可以提供即时、私密且非评判性的支持,但它无法真实拥有使人类同伴支持有意义的生活经验。然而,当被提示要听起来像同伴时,LLM可能会生成暗示生活经验的语言。这创造了一个合成生活经验悖论:使AI支持感觉温暖、 relatable 和同伴式的相同经验语言,也可能错误地将系统定位为拥有生活经验的人。我们在阿尔茨海默病及相关痴呆症(ADRD)患者的家庭照护者背景下审视这一悖论。利用来自在线社区的照护者支持交流以及三个LLM——LLaMA、GPT-4o-mini和MedGemma——生成的同伴式响应,我们分析人类同伴如何使用个人叙事以及AI如何融入类似的叙事形式。心理语言学分析显示,同伴响应使用的第一人称和过去时态语言显著多于同伴式AI响应。定性上,我们识别出人类同伴支持中的七种个人叙事类型,并表明AI通常能捕捉其情感工作,但可能捏造经验基础。这些发现揭示了一个叙事真实性差距:同伴式AI可以生成合成生活经验,而没有使同伴支持有意义的真实经验。我们认为,照护者支持AI系统需要机制来区分支持性的同伴式框架与虚构的生活经验,确保模型能够提供温暖和认可,而不会错误地将自己定位为经验同伴。

英文摘要

Caregivers often turn to online communities for informational and emotional support. In these spaces, peer supporters frequently draw on personal narratives to respond to emotionally complex caregiving situations. As LLMs are increasingly designed as peer-like sources of support, they introduce a critical tension: AI can provide immediate, private, and nonjudgmental support, but it cannot authentically possess the lived experiences that make human peer support meaningful. Yet, when prompted to sound peer-like, LLMs may generate language that implies lived experience. This creates a synthetic lived experience paradox: the same experiential language that may make AI support feel warm, relatable, and peer-like can also falsely position the system as someone with lived experience. We examine this paradox in the context of family caregivers of people living with Alzheimer's Disease and Related Dementias (ADRD). Drawing on caregiver support exchanges from online communities and prompted peer-like responses from three LLMs -- LLaMA, GPT-4o-mini, and MedGemma -- we analyze how human peers use personal narratives and how AI incorporates similar narrative forms. Psycholinguistic analysis shows that peer responses used significantly more first-person and past-focused language than peer-like AI responses. Qualitatively, we identify seven types of personal narratives in human peer support and show that AI often captures their emotional work, but can fabricate experiential grounding. These findings reveal a narrative authenticity gap: peer-like AI can generate synthetic lived experience without the real experience that makes peer support meaningful. We argue that caregiver-support AI systems need mechanisms to distinguish supportive peer-like framing from fabricated lived experience, ensuring that models can offer warmth and validation without falsely positioning themselves as experiential peers.

2606.18000 2026-06-17 cs.NI cs.AI 新提交

A T-API-Compliant ReAct Agentic Loop for Optical Networks: Generic vs. Domain-Specific Tool Abstractions

一种符合T-API规范的ReAct智能循环用于光网络:通用与领域特定工具抽象

Seyed Morteza Ahmadian, Paolo Monti, Carlos Natalino

发表机构 * Department of Electrical Engineering, Chalmers University of Technology(查尔姆斯理工大学电子工程系)

AI总结 提出首个符合T-API规范的推理与行动(ReAct)循环,通过领域特定复合工具实现90%的oracle验证正确性,并节省三倍令牌。

Comments 4 pages, 2 figures, accepted for presentation at the 52nd European Conference on Optical Communications (ECOC), 2026

详情
AI中文摘要

光网络需要意图驱动的闭环智能管理,这是实现更高自治水平的关键。我们提出了首个符合T-API规范的推理与行动(ReAct)循环。我们表明,与通用工具相比,领域特定的复合工具实现了90%的oracle验证正确性,并节省了三倍的令牌。

英文摘要

Optical networks need intent-driven, closed-loop agentic management, a key enabler for higher autonomy levels. We present the first T-API-compliant reasoning and act (ReAct) loop. We show that domain-specific composite tools achieve 90% oracle-validated correctness with threefold token savings compared to generic tools.

2606.17962 2026-06-17 cs.MA cs.AI 新提交

A Neuro-Symbolic Approach to Strategy Synthesis for Strategic Logics

一种面向策略逻辑的策略综合的神经符号方法

Marco Aruta, Vadim Malvone, Aniello Murano, Domenico Parente, Luca Rizzuti

发表机构 * University of Naples Federico II(那不勒斯费德里科二世大学) LTCI, Télécom Paris, Institut Polytechnique de Paris(LTCI,巴黎电信学院,巴黎理工学院) Università degli Studi di Salerno(萨勒诺大学)

AI总结 提出一种神经符号框架,将大语言模型作为策略生成预言机,结合模型检查器进行形式验证,在NatATL中实现高精度策略综合。

详情
AI中文摘要

推理智能体通过策略交互能实现什么是多智能体系统(MAS)中的核心挑战。用于策略能力的逻辑(如ATL)提供了严格的方法,但其采用常因策略综合的计算成本而受阻。我们引入了一种神经符号框架,将大语言模型(LLM)集成到MAS的模型检查流程中。LLM作为策略生成预言机,提出候选策略,然后由标准MAS模型检查器进行形式验证。这种生成-认证架构利用LLM引导来导航大型组合策略空间,同时保持形式正确性:生成的策略仅在通过验证器认证后才被接受。我们为NatATL中的有界策略推理实例化了该框架,并引入了首个NatATL策略综合数据集,包含4211个实例。使用开源Qwen3-32B模型的实验表明,我们的认证流程在策略综合结果上达到了92%的准确率。

英文摘要

Reasoning about what agents can achieve through strategic interaction is a core challenge in Multi-Agent Systems (MAS). Logics for strategic ability, such as ATL, provide rigorous methods, but their adoption is often hindered by the computational cost of strategy synthesis. We introduce a neuro-symbolic framework that integrates large language models (LLMs) into the model-checking pipeline for MAS. The LLM acts as a strategy-generation oracle, proposing candidate strategies that are then formally validated by a standard MAS model checker. This generate-and-certify architecture uses LLM guidance to navigate large combinatorial strategy spaces while preserving formal soundness: generated strategies are accepted only when certified by the verifier. We instantiate the framework for bounded strategic reasoning in NatATL and introduce the first NatATL strategy-synthesis dataset, consisting of 4211 instances. Experiments with an open-weight Qwen3-32B model show that our certified pipeline achieves 92\% accuracy on strategy-synthesis outcomes.

2606.17915 2026-06-17 cs.MA cs.AI cs.DB cs.SE 新提交

Trustworthy Self-Composable Big-Data-as-a-Service: An LLM-Orchestrated Multi-Agent Framework for Automated Data Engineering, AutoML, MLOps Deployment, and Drift-Aware Lifecycle Optimization

可信赖的自组合大数据即服务:一种LLM编排的多智能体框架,用于自动化数据工程、AutoML、MLOps部署和漂移感知生命周期优化

Aueaphum Aueawatthanaphisut, Badri Raj Lamichhane

发表机构 * School of Information, Computer, and Communication Technology(信息、计算机与通信技术学院) Sirindhorn International Institute of Technology, Thammasat University(素金国际技术研究所,泰国 Thammasat 大学)

AI总结 提出一种基于LLM编排的多智能体BDaaS框架,通过分解生命周期为专用智能体并协调执行,实现自动化数据工程、AutoML、MLOps部署和漂移感知优化,提升生命周期级可靠性。

Comments 7 pages, 3 figures, 5 tables

详情
AI中文摘要

大数据即服务(BDaaS)平台需要可靠地自动化数据摄取、清洗、特征工程、模型开发、部署和部署后监控。然而,现有的基于LLM的数据科学智能体和AutoML系统主要关注孤立的工作流阶段,对生命周期级编排、工件治理、人工监督和漂移感知适应的支持有限。本文提出了一种基于LLM编排的多智能体协作的可信赖自组合BDaaS框架。所提出的架构将BDaaS生命周期分解为专门的智能体,用于数据摄取、数据清洗、特征工程、AutoML训练、模型评估、MLOps部署、监控和漂移检测。中央LLM编排层协调智能体执行,验证中间输出,管理工作流上下文,并支持动态工作流组合。该框架还包含共享工件治理、可重现性支持、人在回路检查点和漂移感知反馈循环。使用包含缺失值、分类变量、异常值、类别不平衡和模拟协变量漂移的受控表格基准数据集进行了基于原型的评估。与手动ML、仅AutoML和单智能体LLM基线相比,所提出的多智能体BDaaS流水线在保持竞争性预测性能的同时,提高了生命周期级可靠性,包括工作流完成度、工件可追溯性、部署就绪性、可重现性和漂移恢复。结果表明,LLM编排的多智能体系统可以将传统AutoML扩展到可信赖、自适应和面向生产的BDaaS生命周期自动化。

英文摘要

Big-Data-as-a-Service (BDaaS) platforms require re liable automation across data ingestion, cleaning, feature engi neering, model development, deployment, and post-deployment monitoring. However, existing LLM-based data science agents and AutoML systems mainly focus on isolated workflow stages, leaving limited support for lifecycle-level orchestration, artifact governance, human oversight, and drift-aware adaptation. This paper proposes a trustworthy self-composable BDaaS frame work based on LLM-orchestrated multi-agent collaboration. The proposed architecture decomposes the BDaaS lifecycle into specialized agents for data ingestion, data cleaning, feature engineering, AutoML training, model evaluation, MLOps de ployment, monitoring, and drift detection. A central LLM or chestration layer coordinates agent execution, validates interme diate outputs, manages workflow context, and enables dynamic workflow composition. The framework also incorporates shared artifact governance, reproducibility support, human-in-the-loop checkpoints, and drift-aware feedback loops. A prototype-based evaluation is conducted using controlled tabular benchmark datasets with missing values, categorical variables, outliers, class imbalance, and simulated covariate drift. Compared with manual ML, AutoML-only, and single-agent LLM baselines, the pro posed multi-agent BDaaS pipeline achieves competitive predictive performance while improving lifecycle-level reliability, including workflow completion, artifact traceability, deployment readiness, reproducibility, and drift recovery. The results suggest that LLM-orchestrated multi-agent systems can extend conventional AutoML toward trustworthy, adaptive, and production-oriented BDaaS lifecycle automation.

2606.17910 2026-06-17 cs.IR cs.AI cs.CL 新提交

Non-negative Elastic Net Decoding for Information Retrieval

非负弹性网络解码用于信息检索

Koki Okajima, Yasutoshi Ida, Tsukasa Yoshida, Yasuaki Nakamura

发表机构 * NTT, Inc(NTT公司)

AI总结 提出非负弹性网络(NNN)解码方法,将检索视为联合解码问题,通过稀疏非负线性组合重构查询嵌入,在理论上严格优于稠密检索,实验表明在多个基准上取得一致改进。

Comments 19 pages, 4 figures

详情
AI中文摘要

稠密检索已成为信息检索中的主导范式,其中每个文档通过其向量嵌入与查询的内积进行评分,并根据分数检索前$k$个文档。然而,由于每个文档的分数仅取决于查询和自身的嵌入,检索过程忽略了整个语料库的内容。因此,稠密检索无法避免从语料库中选择语义相似的文档,这可能导致检索结果集缺乏多样性且冗余。为此,我们将检索视为一个联合解码问题,其中文档作为集合被选择,并考虑语料库其余部分的上下文。为了实现这一点,我们提出了非负弹性网络(NNN)解码,它选择嵌入能够联合重构查询嵌入(作为稀疏非负线性组合)的文档。我们的主要理论结果建立了稠密检索与NNN解码之间的严格分离。对于任何语料库,稠密检索正确处理的每个查询也由NNN解码处理,而在包含相关文档的语料库上,NNN解码额外处理了稠密检索无法处理的查询。实验结果表明,将NNN解码应用于为内积评分训练的冻结嵌入,在多个基准上产生了一致的改进。此外,我们引入了一种端到端训练过程,优化嵌入以用于NNN解码,在所有指标和基准上相比稠密检索产生了显著的性能提升。我们的工作为在信息检索中利用稠密嵌入建立了一种新的范式,超越了内积评分的标准实践。

英文摘要

Dense retrieval has become the dominant paradigm in information retrieval, in which each document is scored against a query by the inner product of their vector embeddings, and the top-$k$ documents by score are retrieved for this query. However, since each document's score depends solely on the embedding of the query and itself, the retrieval process is oblivious to the content of the entire corpus. Therefore, dense retrieval cannot avoid selecting semantically similar documents from the corpus, which may result in a non-diverse, redundant set of retrieved documents. To this end, we approach retrieval as a joint decoding problem, in which documents are selected as a set with regard to the context of the rest of the corpus. To achieve this, we propose Non-Negative elastic Net (NNN) decoding, which selects documents whose embeddings jointly reconstruct the query embedding as a sparse non-negative linear combination. Our main theoretical result establishes a strict separation between dense retrieval and NNN decoding. For any corpus, every query correctly handled by dense retrieval is also handled by NNN decoding, while on corpora containing correlated documents, NNN decoding additionally handles queries that dense retrieval cannot. Experimental results indicate that applying NNN decoding to frozen embeddings trained for inner-product scoring yields consistent improvements across several benchmarks. Moreover, we introduce an end-to-end training procedure which optimizes the embeddings for NNN decoding, producing significant performance gains surpassing in all metrics and benchmarks compared to dense retrieval. Our work establishes a new paradigm for leveraging dense embeddings in information retrieval, beyond the standard practice of inner-product scoring.

2606.17887 2026-06-17 cs.HC cs.AI 新提交

AI Adoption Across a Multinational Workforce: Sociotechnical Conditions for GenAI Acceptance in Human Resources

AI在跨国劳动力中的采纳:人力资源中GenAI接受的社会技术条件

Dalia Ali, Maria José Rodríguez Velázquez, Manoel Horta Ribeiro, Vera Liao, Orestis Papakyriakopoulos

发表机构 * Technical University of Munich(慕尼黑技术大学) University of Michigan(密歇根大学) Princeton University(普林斯顿大学)

AI总结 研究跨国科技公司从传统HR系统转向GenAI系统过程中,员工采纳受情境适配、搜索素养和信任校准等社会技术条件影响,并提出了包容性部署的设计建议。

详情
AI中文摘要

生成式AI(GenAI)在工作场所的部署正在迅速加速。然而,谁采纳、谁受益、谁被落下以及为什么,这些问题仍未得到充分研究。在本文中,我们在一家从传统人力资源(HR)搜索系统过渡到GenAI支持系统的跨国科技公司的背景下调查这些动态,分析了搜索日志数据、调查数据(n=25)和十次半结构化访谈。我们的发现表明,采纳取决于GenAI系统的设计假设与员工的工作位置性(角色、口语、任期)之间的匹配。此外,我们发现员工对GenAI答案的信任是通过来源检查、系统间比较以及在怀疑时向同事或HR寻求意见来建立的。我们的贡献有两方面。首先,我们提供了在实时组织转型期间工作场所GenAI采纳的经验证据,表明采纳受到情境适配、搜索素养和信任校准等因素的影响。它还进一步受到知识条件的影响,例如系统的内容质量、员工培训和指导。其次,我们将这些发现转化为在高风险环境(如HR)中包容性部署和采纳的设计考虑。我们认为,组织应该设计系统时考虑它们对不同社会群体产生的角色和情境敏感的好处。他们还需要将组织知识基础设施视为AI基础设施,以提高GenAI系统的问责性和可用性。

英文摘要

Generative AI (GenAI) deployment in the workplace is accelerating rapidly. Nevertheless, questions of who adopts, who benefits, and who is left behind and why are still understudied. In this paper, we investigate these dynamics in the context of a multinational tech company transitioning from a legacy Human Resources (HR) search system to a GenAI-supported system, analyzing search log data, survey data (n=25), and ten semi-structured interviews. Our findings show that adoption depended on the fit between the GenAI system's design assumptions and employees' work positionalities (role, spoken language, tenure). Further, we find that employees' trust in GenAI answers was built through source-checking, comparison among systems, and seeking input from colleagues or HR when in doubt. Our contribution is twofold. First, we provide empirical evidence of workplace GenAI adoption during a live organizational transition, showing that adoption is influenced by factors such as situational fit, search literacy, and trust calibration. It is also further shaped by knowledge conditions such as the system's content quality, employee training, and guidance. Second, we translate these findings into design considerations for inclusive deployment and adoption in high-stakes environments such as HR. We argue that organizations should design systems considering the role and context-sensitive benefits they yield to different social groups. They also need to treat the organizational knowledge infrastructure as AI infrastructure to improve the accountability and usability of GenAI systems

2606.17819 2026-06-17 cs.SE cs.AI cs.CL 新提交

A Framework for Evaluating Agentic Skills at Scale

大规模评估智能体技能的框架

Maksim Shaposhnikov, Nicolas Fortuin, Simon Stipcich, Maria I. Gorinova, Amy Heineike, Rob Willoughby

发表机构 * Tessl London United Kingdom(伦敦英国Tessl)

AI总结 提出一个评估框架,通过构建真实任务和评分标准,大规模评估500个真实技能在19种智能体模型上的表现,发现模型对技能指令的遵循程度差异显著,且技能显著改变模型行为。

详情
AI中文摘要

智能体技能——结构化、可重用的知识工件,增强LLM智能体能力——已在工业界迅速采用,但其跨领域影响以及在商业和开源模型中的使用仍未得到充分研究,并且缺乏可复用的方法来评估单个技能。在这项工作中,我们提出了一个评估框架,允许技能作者构建真实任务,以严格评估技能中对他们最重要的方面,并通过解决这些任务来估计技能效用。此外,我们将评估方法大规模应用于500个真实技能,生成了1000个源自技能内容的任务,以及指令遵循和目标完成评分标准。使用这些指标,我们评估了19种智能体模型配置(包括专有和开源模型)在任务上的表现。我们的结果表明,模型在遵循技能中编码的指令方面差异很大,导致其性能提升存在显著差异。此外,我们表明,与无技能设置相比,访问技能显著改变了模型行为,为将主观工作流编码到LLM智能体中提供了一种重要机制。我们发布了评估数据集,以支持未来关于智能体技能的工作。

英文摘要

Agent skills -- structured, reusable knowledge artifacts that augment LLM agent capabilities -- have been rapidly adopted in industry, yet their cross-domain impact and use across commercial and open-source models remain under-studied, and no reusable methodology exists for evaluating an individual skill. In this work, we present an evaluation framework that lets a skill author construct realistic tasks to rigorously assess the aspects of a skill that matter most to them, and that estimates skill utility by solving those tasks. Further, we apply our evaluation approach at scale to 500 real-world skills, generating 1,000 tasks derived from the skills' content, along with instruction-following and goal-completion scoring rubrics. Using these metrics, we evaluate how 19 agent-model configurations, both proprietary and open-source, perform on the tasks. Our results show that models vary widely in how closely they adhere to the instructions encoded in skills, leading to substantial differences in their performance gains. Furthermore, we show that access to a skill significantly changes model behavior compared to the no-skill setup, providing an essential mechanism for encoding opinionated workflows into LLM agents. We release our evaluation dataset to support future work on agent skills.

2606.17815 2026-06-17 cs.CR cs.CL 新提交

Beyond Native Success: Auditing Deployment-Interface Exposure of CLIP Backdoors

超越原生成功:审计CLIP后门的部署接口暴露

Kunlan Xiang, Haomiao Yang, Wenbo Jiang

发表机构 * University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出DIFE框架审计CLIP后门在不同部署接口下的暴露情况,发现原生成功不代表全局安全,并引入BadTextTower填补文本编码器后门缺失。

详情
AI中文摘要

对比语言-图像预训练模型广泛重用于下游接口,包括特征提取、检索、重排序和选择。然而,现有的CLIP后门通常在小规模的原生攻击任务上验证攻击,导致不清楚当通过其他接口重用时,相同的投毒检查点是否仍然暴露、减弱或变得不适用。我们引入DIFE,一个部署接口足迹评估框架,用于审计跨部署接口的带后门CLIP检查点。DIFE通过指定每个接口的组件读出、触发通道、目标事件、参考条件和度量,使各种评估具有可比性。DIFE还引入了有效足迹诊断,以识别携带暴露的可重用CLIP组件或组件组合,并解释风险转移的位置。使用DIFE审计复现的CLIP后门揭示了一个结构化的景观:原生成功不是检查点级别的风险证书,暴露遵循组件足迹,文本侧投毒不会产生文本编码器控制,一些耦合攻击仍然受机制约束。这次审计揭示了现有CLIP后门中的一个重要空白:文本编码器本身成为对抗行为的可重用载体。因此,我们引入BadTextTower来填补这一空白。BadTextTower产生强大的文本条件检索、重排序和选择暴露,同时使仅视觉重用几乎保持清洁。

英文摘要

Contrastive Language-Image Pre-training models are widely reused across downstream interfaces, including feature extraction, retrieval, reranking, and selection. Existing CLIP backdoor, however, usually validate attacks on a small attack-native task, leaving unclear whether the same poisoned checkpoint remains exposed, weakens, or becomes not applicable when reused through other interfaces. We introduce DIFE, a Deployment-Interface Footprint Evaluation framework that audits backdoored CLIP checkpoints across deployment interfaces. DIFE makes various evaluations comparable by specifying each interface's component readout, trigger channel, target event, reference condition, and metric. DIFE also introduces effective-footprint diagnosis to identify the reusable CLIP component or component combination that carries exposure and explains where risk transfers. Auditing reproduced CLIP backdoors with DIFE reveals a structured landscape: native success is not a checkpoint-level risk certificate, exposure follows component footprints, text-side poisoning does not yield textual-encoder control, and some coupled attacks remain mechanism-bound. This audit reveals a import gapin existing CLIP backdoors: a textual encoder that itself becomes a reusable carrier of adversarial behavior. We therefore introduce BadTextTower to fill this gap. BadTextTower produces strong text-conditioned retrieval, reranking, and selection exposure while leaving visual-only reuse nearly clean.

2606.17799 2026-06-17 cs.SE cs.AI cs.CL 新提交

Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering

立场:编程基准与智能体软件工程不一致

Maria I. Gorinova, Macey Baker, Amy Heineike, Maksim Shaposhnikov, Rob Willoughby, Dru Knox

发表机构 * Tessl

AI总结 本文指出当前编程基准在智能体时代存在三大问题:混淆模型与系统框架、单一参考答案惩罚有效替代方案、缺乏组件级信号导致迭代困难,并提出应重新设计基准以对齐智能体软件工程。

详情
AI中文摘要

编程智能体已成为软件工程的主要模式,但我们用于比较它们的基准是在智能体时代之前设计的:它们将模型、框架和环境合并为一个单一的端到端分数,通常针对一个参考答案进行计算,没有提供用于迭代的组件级信号。我们认为当前的编程基准与智能体软件工程不一致。在实践中,编程智能体不是一个模型:它是一个系统框架——由模型、框架、上下文、环境和反馈信号组成的复合体,其中任何一个都可能使基准分数移动与相邻模型代际之间相当的幅度。我们讨论了三个症状:(i) 基准分数混淆了模型与框架的其余部分;(ii) 针对单一参考答案评分惩罚了同样有效的替代方案;(iii) 缺乏单个框架组件级别的信号使得端到端系统分数难以迭代。

英文摘要

Coding agents have become a major mode of software engineering, but the benchmarks we use to compare them were designed in a pre-agent era: they collapse model, harness, and environment into a single end-to-end score, typically computed against one reference solution, with no component-level signal for iteration. We argue that current coding benchmarks are misaligned with agentic software engineering. A coding agent in practice is not a model: it is a system harness -- a composite of models, harnesses, contexts, environments, and feedback signals, any one of which can move the benchmark score by margins comparable to those between adjacent model generations. We discuss three symptoms: (i) benchmark scores conflate the model with the rest of the harness; (ii) grading against a single reference solution penalises equally valid alternatives; and (iii) the absence of signal at the level of individual harness components makes the end-to-end system score difficult to iterate on.

2606.17786 2026-06-17 cs.HC cs.CL 新提交

Toward Accessible Psychotherapy Training Using AI-Driven Interactive Patient Avatars

利用AI驱动的交互式患者化身实现可及的心理治疗培训

Pascal Riachi, Sofie Kamber, Stella Brogna, Andrew Gloster, Rafael Wampfler

发表机构 * ETH Zurich(苏黎世联邦理工学院) University of Lucerne(卢塞恩大学)

AI总结 提出一个通过具身虚拟患者进行对话训练ACT心理治疗师的系统,利用大语言模型模拟患者行为,并提供基于ACT保真度标准的逐轮反馈,专家评估证实了高真实性和培训效果。

详情
Journal ref
2026 IEEE 14th International Conference on Healthcare Informatics (ICHI), Minneapolis, MN, June 1-3, 2026, pp. 990-995
AI中文摘要

培训心理治疗师掌握诸如接纳与承诺疗法(ACT)等循证干预措施需要反复练习并伴有有意义的反馈,然而安全、标准化的培训机会受到伦理、后勤和资源限制。我们引入一个系统,旨在通过与具身虚拟患者的语音对话支持ACT导向的心理治疗培训。该系统使用大语言模型模拟患者行为,其行为基于真实治疗会话中提取的档案和可配置的临床场景,同时一个独立的自动评估器根据既定的ACT保真度标准为治疗师的回应提供逐轮反馈。该系统并非旨在取代督导,而是通过支持在低风险环境中进行实验、反思和即时反馈来促进刻意练习。执业心理学家的专家评估证实了患者行为的高度真实性,并表明即时的逐轮ACT反馈提高了治疗师对干预选择的意识,并使他们能够有效尝试替代回应。对49份治疗记录的定量评估确定GPT-4o-mini为最佳反馈模型,在复制人类督导的ACT保真度评分时实现了最低的平均绝对误差(MAE = 6.12),且具有统计显著性的一致性。这项工作展示了保真度感知的模拟患者作为心理治疗培训的可扩展补充的潜力。

英文摘要

Training psychotherapists in evidence-based interventions such as Acceptance and Commitment Therapy (ACT) requires repeated practice with meaningful feedback, yet opportunities for safe, standardized training are limited by ethical, logistical, and resource constraints. We introduce a system designed to support ACT-oriented psychotherapy training through spoken dialogue with an embodied virtual patient. The system uses large language models to simulate patient behavior conditioned on profiles derived from real therapy sessions and configurable clinical scenarios, while a separate automated evaluator provides turn-by-turn feedback on therapist responses based on established ACT fidelity criteria. Rather than aiming to replace supervision, the system is intended to support deliberate practice by enabling experimentation, reflection, and immediate feedback in low-risk settings. Expert evaluation with practicing psychologists confirmed high realism in patient behavior and demonstrated that immediate turn-by-turn ACT feedback increased therapists' awareness of intervention choices and enabled effective experimentation with alternative responses. Quantitative evaluation across 49 therapy transcripts identified GPT-4o-mini as the optimal feedback model, achieving the lowest mean absolute error (MAE = 6.12) in replicating human supervisor ACT fidelity ratings with statistically significant agreement. This work demonstrates the potential of fidelity-aware simulated patients as a scalable complement to psychotherapy training.

2606.17781 2026-06-17 cs.AR cs.AI 新提交

MIVE: A Minimalist Integer Vector Engine for Softmax LayerNorm and RMSNorm Acceleration

MIVE:用于Softmax、LayerNorm和RMSNorm加速的极简整数向量引擎

Kosmas Alexandridis, Giorgos Dimitrakopoulos

发表机构 * Integrated Circuits Lab, Electrical and Computer Engineering, Democritus University of Thrace (DUTH), Greece(德摩克利特大学特拉克分校集成电路实验室,电气与计算机工程,德摩克利特大学特拉克分校(DUTH),希腊)

AI总结 提出一种可编程的极简整数向量引擎MIVE,通过统一数据通路执行Softmax、LayerNorm和RMSNorm三种操作,最大化硬件共享,提升面积和硬件效率。

详情
AI中文摘要

大型语言模型(LLM)的快速增长加剧了对专用硬件加速器的需求,这些加速器必须满足严格的推理延迟和功耗约束。尽管矩阵乘法主导了整体计算工作负载,但非线性向量归一化操作(如LayerNorm、RMSNorm和Softmax)可能成为关键硬件瓶颈。现有加速器通常使用专用硬件块实现这些功能,导致资源重复和硅利用率低下。为解决这一限制,我们提出了一种极简整数向量引擎(MIVE),这是一种可编程架构,能够在统一数据通路内执行所有三种操作。通过利用LayerNorm、RMSNorm和Softmax之间的共同计算模式,所提出的向量引擎最大化硬件共享,同时减少实现开销。物理ASIC实现结果表明,MIVE提供全面的多函数支持,同时在面积和硬件效率方面优于大多数最先进的独立加速器。

英文摘要

The rapid growth of Large Language Models (LLMs) has intensified the need for specialized hardware accelerators that can satisfy stringent inference latency and power constraints. Although matrix multiplications dominate the overall computational workload, non-linear vector normalization operations, such as LayerNorm, RMSNorm and Softmax can become critical hardware bottlenecks. Existing accelerators typically implement these functions using dedicated hardware blocks, leading to duplicated resources and inefficient silicon utilization. To address this limitation, we propose a Minimalist Integer Vector Engine (MIVE), a programmable architecture capable of executing all three operations within a unified datapath. By exploiting common computational patterns across LayerNorm, RMSNorm and Softmax the proposed vector engine maximizes hardware sharing while reducing implementation overhead. Physical ASIC implementation results show that MIVE provides comprehensive multi-function support while achieving higher area and hardware efficiency than most state-of-the-art standalone accelerators.

2606.17767 2026-06-17 cs.HC cs.AI 新提交

Talking to Your Data: Exploring Embodied Conversation as an Interface for Personal Health Reflection

与你的数据对话:探索具身对话作为个人健康反思的界面

Nikola Kovacevic, Bastien Husler, Di Zhuang, Rafael Wampfler, Barbara Solenthaler

发表机构 * Department of Computer Science, ETH Zurich(苏黎世联邦理工学院计算机科学系)

AI总结 提出一种通过具身对话代理与可穿戴健康数据交互的新范式,采用双代理设计(观察者提取统计特征,呈现者以“口语化统计”沟通),通过模拟自我用户研究(N=5)与传统仪表盘对比,评估感知理解、行动具体性和认知转变。

详情
Journal ref
Joint Proceedings of the ACM Intelligent User Interfaces (IUI) Workshops 2026, Paphos, Cyprus, July 13-16, 2026
AI中文摘要

来自可穿戴设备的个人健康数据通常通过图表和统计摘要的仪表盘呈现,要求用户主动解读模式和含义。我们探索了一种替代交互范式:通过一个具身对话代理与个人健康数据进行互动,该代理在与用户的对话中促进客观的数据反思。我们提出了一个系统,它将可穿戴数据的轻量级预处理与基于Unity的具身角色相结合。在内部,系统遵循双代理设计,其中观察者代理提取描述性统计和时间趋势,呈现者代理通过“口语化统计”传达这些发现,有意避免临床建议,以隔离交互模态的影响。我们通过一个模拟自我用户研究(N=5)采用被试内设计评估了这种方法。参与者采用来自LifeSnaps数据集的健康角色和目标,比较了传统仪表盘探索与具身对话反思。我们的评估侧重于感知理解、生成行动的具体性,以及从被动观看到主动意义建构的认知转变。本文贡献了一个功能原型、一个客观健康数据叙事生成的设计模式,以及关于具身性如何影响个人健康指标解释的早期实证见解。

英文摘要

Personal health data from wearables are typically presented through dashboards of charts and summary statistics, requiring users to actively interpret patterns and implications. We explore an alternative interaction paradigm: engaging with personal health data through an embodied conversational agent that facilitates objective data reflection in dialogue with the user. We present a system that combines lightweight preprocessing of wearable data with a Unity-based embodied character. Internally, the system follows a dual-agent design in which an Observer agent extracts descriptive statistics and temporal trends, and a Presenter agent communicates these findings through "spoken statistics," intentionally refraining from clinical advice to isolate the impact of the interaction modality. We evaluate this approach through a simulated-self user study (N=5) using a within-subject design. Participants adopted health personas and goals derived from the LifeSnaps dataset to compare traditional dashboard exploration with embodied conversational reflection. Our evaluation focuses on perceived understanding, the specificity of generated actions, and the cognitive shift from passive viewing to active sensemaking. The paper contributes a functional prototype, a design pattern for objective health data narrative generation, and early empirical insights into how embodiment affects the interpretation of personal health metrics.

2606.17666 2026-06-17 cs.SE cs.AI 新提交

FacProcessTwin: An LLM-Based System for Process Twin Development

FacProcessTwin: 一种基于LLM的流程孪生开发系统

Yash Pulse, Yong-Bin Kang, Abhik Banerjee, Prem Prakash Jayaraman

发表机构 * Swinburne University of Technology(斯winburne大学)

AI总结 提出FacProcessTwin系统,利用大语言模型从工厂文档和操作员自然语言输入中自动生成流程模型并绑定实时数据,通过交互式流程图实现人机协同治理,在食品制造案例中准确率达95.2%,开发时间缩短至人工的1/6。

详情
AI中文摘要

流程孪生提供整个生产过程的实时表示。通过捕捉流程步骤如何相互作用,而不是像基于资产的数字孪生那样孤立地监控单个机器,它们有潜力推动整个过程的效率提升。然而,开发流程孪生成本高昂。它需要精确建模整个生产过程:其流程步骤、每个步骤使用的设备和产品特定设置,以及其流程变体。然后,生成的模型必须绑定到实时操作数据。我们提出FacProcessTwin,一个利用大语言模型(LLM)来减少开发时间的系统,它从工厂的流程文档和操作员的自然语言输入中构建流程孪生。FacProcessTwin生成完整的流程模型,然后自动将其流程步骤绑定到实时操作数据。生成的模型及其数据绑定被渲染为交互式流程图表,制造人员可以通过该图表监控和纠正系统的自主决策,例如解决安全关键绑定步骤中的不确定性。我们通过一家澳大利亚食品制造商的真实案例研究评估FacProcessTwin,涵盖16个生产流程,涉及冷藏、冷冻和无菌常温产品类别,并包括同一产品内的流程变体。结果表明,FacProcessTwin准确生成这些流程模型(与真实情况相比平均F1为95.2%),并且每个孪生的构建时间约为手动时间的六分之一。其人在环治理机制保持安全关键绑定的正确性:在模糊标签处,单次通过基线在75.0%的情况下静默错误绑定,而FacProcessTwin则推迟给操作员,错误绑定率为0。

英文摘要

Process twins provide real-time representations of entire production processes. By capturing how process steps interact, rather than monitoring a single machine in isolation as an asset-based digital twin does, they have the potential to drive efficiency gains across the whole process. However, developing a process twin is costly. It requires accurately modelling the entire production process: its process steps, the equipment and product-specific settings each step uses, and its process variations. The resulting model must then be bound to live operational data. We present FacProcessTwin, a system that leverages a large language model (LLM) to reduce this development time, building a process twin from a plant's process documentation and natural-language input from an operator. FacProcessTwin generates this complete process model and then automatically binds its process steps to live operational data. The generated model and its data bindings are rendered as an interactive process diagram through which manufacturing personnel can monitor and correct the system's autonomous decisions, such as resolving uncertainty at safety-critical binding steps. We evaluate FacProcessTwin through a real-world case study of an Australian food manufacturer, covering 16 production process flows that span chilled, frozen, and aseptic shelf-stable product categories and include process variations within the same product. The results show that FacProcessTwin generates these process models accurately (a mean F1 of 95.2% against ground truth) and builds each twin in roughly a sixth of the manual time. Its human-in-the-loop governance then keeps the safety-critical bindings correct: at ambiguous tags where a single-pass baseline silently mis-binds 75.0% of the time, FacProcessTwin defers to the operator and mis-binds none.

2606.17664 2026-06-17 cs.IR cs.AI 新提交

Temporal Preference Optimization for Unsupervised Retrieval

面向无监督检索的时间偏好优化

HyunJin Kim, Jaejun Shim, Young Jin Kim, JinYeong Bak

发表机构 * Microsoft, Redmond, USA(微软公司,美国红mond) Sungkyunkwan University, Suwon, South Korea(成均馆大学,韩国首尔)

AI总结 提出TPOUR方法,通过时间检索偏好优化(TRPO)和可学习时间嵌入插值,使无监督稠密检索器能捕捉时间相关性,在时间信息检索任务上超越有监督和无监督基线。

Comments Accepted to ICML 2026

详情
AI中文摘要

无监督稠密检索器通过对比学习从无标签文档中学习语义相似性,从而提供可扩展性,但它们难以捕捉时间相关性,会检索到语义相关但时间错位的文档——当文档集合跨越多个时间段时(例如,针对“2019年的总统是谁?”检索2018-2025年的文档会引入时间歧义),这是一个重要方面。现有方法依赖于带有显式时间戳的有监督训练,但这并不总是可行的。我们提出TPOUR(面向无监督检索器的时间偏好优化),它使用我们新颖的训练方法时间检索偏好优化(TRPO)。TRPO在时间维度上重新诠释偏好学习,引导检索器偏向时间对齐的文档。TPOUR进一步通过在学习到的时间嵌入中进行插值,泛化到未见的时间段,实现连续的时间对齐。在时间信息检索(T-IR)实验上,TPOUR优于无监督和有监督基线。与Qwen-Embedding-8B相比,尽管规模小约72.7倍,TPOUR Contriever在显式查询上的平均nDCG@5提高了+4.04(+12.15%),在隐式查询上提高了+4.98(+15.21%)。我们的代码可在以下网址获取:https://this URL。

英文摘要

Unsupervised dense retrievers offer scalability by learning semantic similarity from unlabeled documents via contrastive learning, but they struggle to capture the temporal relevance, retrieving semantically related but temporally misaligned documents-an important aspect when a document collection spans multiple time periods (e.g., retrieving documents from 2018-2025 for "Who is the president in 2019?" introduces temporal ambiguity). Existing methods rely on supervised training with explicit timestamps, which are not always feasible. We propose TPOUR (Temporal Preference Optimization for Unsupervised Retriever), which uses our novel training method Temporal Retrieval Preference Optimization (TRPO). TRPO reinterprets preference learning in the temporal dimension, guiding the retriever to favor temporally aligned documents. TPOUR further generalizes to unseen time periods via interpolation in a learned time embedding, enabling continuous temporal alignment. Experiments on temporal information retrieval (T-IR), TPOUR outperforms both unsupervised and supervised baselines. Compared to Qwen-Embedding-8B, despite being about 72.7x smaller, TPOUR Contriever improves average nDCG@5 by +4.04 (+12.15%) on explicit and +4.98 (+15.21%) on implicit queries. We provide our code at https://github.com/agwaBom/TPOUR.

2606.17646 2026-06-17 cs.HC cs.AI 新提交

SketchXplain: Intuitive Visual Explanations of Image Classifiers with Sketches

SketchXplain:基于草图的图像分类器直观视觉解释

Wencan Zhang, Mario Michelessa, Xuejun Zhao, Brian Y. Lim

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出SketchXplain方法,结合显著性图、概念瓶颈模型和草图优化,生成基于草图的直观视觉解释,以提升图像分类器的可解释性。

Comments 14 pages, 6 figures, 4 tables. Submitted to TVCG

详情
AI中文摘要

显著性图可视化通过指向区域来解释基于图像的AI预测,但这些区域通常不直观且语义不清晰,存在可解释性差距。我们认为AI解释应该是直观的——与用户知识一致,同时简单且具有选择性以加速解释。受艺术绘画启发,我们提出SketchXplain,为直观的基于图像的可解释AI(XAI)生成基于草图的视觉解释。结合显著性图、概念瓶颈模型和草图优化技术,SketchXplain整合显著性以选择一致的观察伪影、概念以实现知识一致性、线索以表示它们,以及抽象以实现简洁性。在面部表情识别上的评估、建模和用户研究表明,与显著性图或简单绘图相比,SketchXplain支持更快速的解释,且可视化更一致。在皮肤病变诊断上的进一步评估发现,SketchXplain更一致地可视化疾病症状,更好地支持非专业诊断。因此,这项工作展示了草图在直观、简单、一致和快速的基于图像的XAI可视化中的价值。

英文摘要

Saliency map visualizations explain image-based AI predictions by pointing to regions, but these are often unintuitive and semantically unclear, leaving an interpretability gap. We argue that AI explanations should be intuitive -- coherent to user knowledge, yet simple and selective to accelerate interpretation. Inspired by artistic drawings, we propose SketchXplain to generate sketch-based visual explanations for intuitive image-based explainable AI (XAI). Combining techniques in saliency maps, concept-bottleneck models, and sketch optimization, SketchXplain integrates saliency to select coherent observation artifacts, concepts for knowledge coherence, cues to represent them, and abstraction for simplicity. Evaluating on face expression recognition, modeling and user studies showed that SketchXplain supported quicker interpretation with more aligned visualizations than saliency maps or simple drawings. Further evaluation on skin lesion diagnosis found that SketchXplain more coherently visualized disease symptoms, better supporting lay diagnosis. Thus, this work illustrates the value of sketches for intuitive, simple, coherent, and quick image-based XAI visualizations.

2606.17588 2026-06-17 cs.SE cs.AI 新提交

Understanding LLMs in Title-Abstract Screening: From Disagreements to Recommendations

理解LLM在标题-摘要筛选中的作用:从分歧到建议

Mika Mäntylä, Patricia Matsubara, Katia Romero Felizardo, Miikka Kuutila, Marco Gerosa, Savio de Sousa Sampaio, Tayana Conte, Igor Steinmacher

发表机构 * University of Helsinki, Finland(赫尔辛基大学,芬兰) UFMS, Brazil(巴西UFMS) UTFPR – Federal University of Technology - Paraná, Brazil(巴西UTFPR – 法定技术大学-帕拉那) LUT University, Finland(芬兰LUT大学) Northern Arizona University, United States(美国北亚利桑那大学) UFAM, Brazil(巴西UFAM)

AI总结 本研究通过定性分析LLM与人类在系统综述标题-摘要筛选中的分歧原因,提出改进建议,如验证语义理解、使用多个LLM和关注边界案例。

Comments 14 pages + references. Accepted for publication in the 52nd Euromicro Conference on Software Engineering and Advanced Applications (SEAA 2026)

详情
AI中文摘要

多项研究探讨了在系统综述(SRs)中使用大型语言模型(LLMs)进行标题-摘要筛选,报告了混合的准确性。然而,可靠性问题仍未得到充分解决。在本研究中,我们超越了定量的人机一致性指标,定性调查了LLMs失败的方式和原因。我们还提出了可操作的建议。我们分析了六个软件工程SRs和超过1000篇主要研究论文中LLMs与研究人员之间的分歧。对于每个SR,论文由人类专家和LLMs以零样本模式独立筛选,得到的Kappa值在0.52到0.77之间。定性分析表明,人机分歧源于反复出现的可识别原因,例如关键术语的边界模糊、关键词过度强调和错误的话题推断。基于这些发现,我们提出了建议,例如在部署前验证语义理解、运行多个LLMs以及将验证工作集中在边界案例上。未来的研究需要验证我们建议的影响,并且需要社区努力制定关于在SRs中使用LLMs的规范性指南。

英文摘要

Several studies have examined the use of large language models (LLMs) for title-abstract screening in systematic reviews (SRs), reporting mixed accuracy. However, questions of reliability remain largely unaddressed. In this study, we go beyond quantitative LLM-human agreement metrics and qualitatively investigate how and why LLMs fail. We also propose actionable recommendations. We analyzed disagreements between LLMs and researchers across six software engineering SRs and over 1,000 primary study papers. For each SR, papers were screened independently by human experts and LLMs in zero-shot mode, resulting in Kappa values ranging from 0.52 to 0.77. Qualitative analysis suggests that human-LLM disagreement results from recurring, identifiable causes, such as boundary ambiguity in key terms, keyword overemphasization, and incorrect topic inference. Based on these findings, we propose recommendations such as validating semantic understanding before deployment, running multiple LLMs, and focusing validation efforts on borderline cases. Future studies are needed to validate the impact of our recommendations, and community efforts are needed to develop normative guidelines on LLM usage in SRs.