2606.17591 2026-06-17 cs.AI 新提交

Closing the Feedback Loop: From Experience Extraction to Insight Governance in Verbal Reinforcement Learning

闭环反馈：从经验提取到洞察治理在言语强化学习中的应用

Yanwei Cui, Xing Zhang, Yulong Zhang, Li Shao, Xiaofeng Shi, Guanghui Wang, Peiyang He

发表机构 * AWS Generative AI Innovation Center（AWS生成式人工智能创新中心）； Amazon Web Services (AWS)（亚马逊网络服务（AWS））； BingX Group Limited（BingX集团有限公司）

AI总结针对非平稳环境中LLM智能体的保留-遗忘困境，提出三层架构（规则、证据、技能）通过反馈驱动的策展循环实现洞察治理，在金融预测中验证了该方法能显著提升准确率和风险调整收益。

Comments Accepted to the ICML 2026 RLxF: Reinforcement Learning from World Feedback Workshop, RLxF@ICML 2026, Seoul, South Korea

详情

AI中文摘要

无训练言语强化学习使LLM智能体能够从世界反馈中学习——客观信号如动态任务结果、市场回报或需求预测——通过从经验中提取言语规则并将其注入上下文，无需参数变化即可更新智能体行为。然而，在非平稳环境中，这些智能体面临保留-遗忘困境：保留过时的洞察会导致负迁移，而丢弃它们则会在条件重现时造成灾难性遗忘。我们识别出应对这一困境的四个要求——结果驱动评估、持久结构化证据、非单调知识生命周期和组合治理——并表明现有方法在经验提取上投入过多，而在洞察治理上投入不足。我们提出一个三层架构——规则、证据和技能——通过反馈驱动的策展循环连接，弥补治理差距。规则从世界结果中捕获提炼的经验；证据日志跟踪每条规则在多个回合中的可靠性；技能管理应用哪些规则、如何解决冲突以及何时弃权。以金融预测作为案例研究，其中世界反馈自然丰富、嘈杂且非平稳，我们表明相同的积累经验要么使性能低于零样本基线，要么显著提高准确率和风险调整收益，取决于是否存在策展循环。

英文摘要

Training-free verbal reinforcement learning enables LLM agents to learn from world feedback -- objective signals such as dynamic task outcomes, market returns, or demand forecasts -- by extracting verbal rules from experience and injecting them as context, updating the agent's behavior without parameter changes. However, in non-stationary environments these agents face a retention-forgetting dilemma: retaining stale insights causes negative transfer, while discarding them causes catastrophic forgetting when conditions recur. We identify four requirements for navigating this dilemma -- outcome-driven evaluation, persistent structured evidence, non-monotonic knowledge lifecycle, and compositional governance -- and show that existing methods invest heavily in experience extraction while underinvesting in insight governance. We propose a three-layer architecture -- rules, evidence, and skills -- connected by a feedback-driven curation loop that closes the governance gap. Rules capture distilled experience from world outcomes; evidence logs track each rule's reliability across episodes; skills govern which rules to apply, how to resolve conflicts, and when to abstain. On financial forecasting as a case study, where world feedback is naturally abundant, noisy, and non-stationary, we show that the same accumulated experience either degrades performance below the zero-shot baseline or dramatically improves accuracy and risk-adjusted returns, depending on whether the curation loop is present.

URL PDF HTML ☆

赞 0 踩 0

2606.17645 2026-06-17 cs.AI cs.CL cs.LG 新提交

Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns

超越领域：通过可迁移交互模式重用网络技能

Shiqi He, Yue Cui, Feijie Wu, Xinyu Ma, Jiaheng Lu, Yaliang Li, Bolin Ding, Mosharaf Chowdhury

发表机构 * University of Michigan（密歇根大学）； Alibaba Group（阿里巴巴集团）； Purdue University（普渡大学）； McMaster University（麦克马斯特大学）； University of Pennsylvania（宾夕法尼亚大学）

AI总结提出SkillMigrator代理，通过学习可迁移交互模式（TIP）匹配布局结构而非元素引用，实现跨站点技能重用，在WebArena和Mind2Web上成功轨迹的LLM动作数减少8-10%。

详情

AI中文摘要

大型语言模型（LLM）网络代理通常被部署为工具调用者：每轮，模型读取新的页面观察并发出一个结构化工具动作。当每个动作都是低级原语时，视野迅速增长，面向策略的LLM完成次数也随之增加，在Mind2Web和WebArena等基准测试中主导了延迟和成本。因此，最近的系统将重复的交互片段包装为网络技能：从成功轨迹或诱导程序中构建的可调用工具，这样一次调用可以替代多个原语。然而，先前的技能库仍然主要通过指令相似性或粗略的站点元数据触发，这导致在未见站点上技能重用率低，并留下了许多潜在的步骤和令牌减少空间。我们提出了SkillMigrator，一个学习可重用网络技能并通过匹配布局结构而非特定元素引用来跨站点迁移它们的代理。每个诱导技能被存储为可迁移交互模式（TIP）：技能与诱导时快照的结构草图配对。在测试时，SkillMigrator通过布局相似性检索TIP，并将其引用锚定到实时页面。其余堆栈是标准的：具有稳定引用的可访问性快照观察，以及基于原语加技能调用的固定工具调用。与最先进的方法相比，SkillMigrator在匹配成功率的情况下，将WebArena和Mind2Web上成功轨迹的平均LLM动作数减少了8-10%。

英文摘要

Large language model (LLM) web agents are usually deployed as tool callers: each turn, the model reads a fresh page observation and emits one structured tool action. When every action is a low-level primitive, horizons grow quickly and so do policy-facing LLM completions, dominating latency and cost on benchmarks such as Mind2Web and WebArena. Recent systems therefore wrap repeated interaction fragments as web skills: callable tools built from successful trajectories or induced programs, so one call can replace several primitives. However, prior skill libraries are still triggered mainly by instruction similarity or coarse site metadata, which yields low skill reuse on held-out sites and leaves much of the potential step and token reduction on the table. We present SkillMigrator, an agent that learns reusable web skills and transfers them across sites by matching layout structure rather than specific element references. Each induced skill is stored as a transferable interaction pattern (TIP): the skill paired with a structural sketch of the snapshot at induction time. At test time, SkillMigrator retrieves TIPs by layout similarity and grounds their references on the live page. The rest of the stack is standard: accessibility-snapshot observations with stable references, and fixed tool calling over primitives plus skill invocations. Compared with the state-of-the-art approaches, SkillMigrator reduces the average LLM-action count on successful trajectories by 8-10% across both WebArena and Mind2Web at matched success rate.

URL PDF HTML ☆

赞 0 踩 0

2606.17871 2026-06-17 cs.AI 新提交

StepGuard: Guarding Web Navigation via Single-Step Calibration

StepGuard: 通过单步校准保护网页导航

Zhihao Cui, Yuchen Zhang, Xiyang Sun, Yaxiong Wang, Li Zhu, Jinpeng Hu, Liu Liu, Mengjia Li, Yujiao Wu

发表机构 * School of Software Engineering, Xi’an Jiaotong University（西安交通大学软件工程学院）； School of Computer Science and Information Engineering, Hefei University of Technology（合肥工业大学计算机与信息工程学院）； Xiamen University（厦门大学）； Zhejiang Lab（之江实验室）； CSIRO（澳大利亚联邦科学与工业研究组织）

AI总结针对网页导航中单步脆弱性问题，提出StepGuard框架，通过动态双策略优化（DDPO）解决奖励冲突，并利用置信度引导的自适应导航反射（CANR）校准单步误差，显著提升导航与答案准确率。

详情

AI中文摘要

网页导航要求智能体遵循自然语言目标，与网页交互并生成准确答案。尽管近期进展利用了视觉-语言模型和强化学习，现有方法仍因奖励错位和错误传播而存在单步脆弱性。为解决奖励纠缠，我们设计了动态双策略优化（DDPO），在探索的导航优先模式与问答的答案优先模式之间动态切换，以缓解奖励冲突。为校准单步误差，我们提出置信度引导的自适应导航反射（CANR），该机制估计每步置信度，仅在必要时触发反思，并使用对比奖励鼓励自我修正以校准单步不准确性。以上述组件为核心，我们最终开发了StepGuard，一种通过单步校准保护网页导航的新框架。实验表明，我们的方法显著提升了导航与答案准确率，在标准网页导航基准上取得了新的最佳性能。

英文摘要

Web navigation requires agents to follow natural language goals, interact with web pages, and produce accurate answers. While recent advances leverage vision-language models and reinforcement learning, existing methods still suffer from single-step fragility due to reward misalignment and error propagation. To tackle the reward entanglement, we design Dynamic Dual-Policy Optimization (DDPO), which dynamically switches between a navigation-first mode for exploration and an answer-first mode for question-answering to mitigate reward conflict. To calibrate the single-step error, we propose Confidence-Guided Adaptive Navigation Reflection (CANR), a mechanism that estimates per-step confidence, triggers reflection only when necessary, and uses contrastive rewards to encourage self-correction to calibrate the single-step inaccuracy. With the above as the main components, we finally develop our StepGuard, a new framework of Guarding Web Navigation via Single-Step Calibration. Experiments demonstrate that our approach significantly improves navigation and answer accuracy, setting new state-of-the-art performance on standard web navigation benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.17929 2026-06-17 cs.AI 新提交

从观测中学习红方代理策略用于神经符号自主网络代理

Ankita Samaddar, Sandeep Neema, Daniel Balasubramanian, Xenofon Koutsoukos

发表机构 * MIT（麻省理工学院）

AI总结针对网络攻击中红方动作不可观测的问题，提出基于模仿学习的策略学习技术，从网络观测和防御动作预测红方行为，集成神经符号防御代理实现高精度预测。

详情

AI中文摘要

随着复杂网络攻击日益普遍，现代网络需要经由强化学习训练的智能自主网络防御代理。这些代理采用神经符号方法，如带有学习组件的行为树，来学习、推理、适应和实施安全规则，同时维持关键操作。然而，这些自主网络是部分可观测系统，即网络攻击者（红方代理）的动作不可观测，使得防御者难以预测红方动作、学习红方策略或评估攻击者的入侵程度。为解决此问题，我们提出一种策略学习技术，利用模仿学习来学习具有离散状态和离散动作的部分可观测RL代理的策略。我们在自主网络环境中应用该技术，从网络观测和防御动作预测红方代理的动作。与神经符号网络防御代理集成后，我们的方法有效处理不同红方策略，并在多种模拟场景中实现高预测精度。

英文摘要

With sophisticated cyber-attacks becoming increasingly prevalent, modern networks require intelligent autonomous cyber-defense agents trained via Reinforcement Learning (RL). These agents employ neurosymbolic approaches such as behavior trees with learning-enabled components (LECs) to learn, reason, adapt, and implement security rules while maintaining critical operations. However, these autonomous networks are partially observable systems, i.e., the cyber-attacker's (red agent's) actions are not observable, making it difficult for the defender to predict red actions, learn red policies, or assess the attacker's intrusion levels. To address this, we propose a Policy Learning Technique using imitation learning to learn policies for partially observable RL agents with discrete states and discrete actions. We apply this technique in an autonomous cyber environment to predict red agent's actions from network observations and defender actions. Integrated with a neurosymbolic cyber-defense agent, our method effectively handles different red policies and achieves high prediction accuracy across diverse simulated scenarios.

URL PDF HTML ☆

赞 0 踩 0

2510.19838 2026-06-17 cs.AI cs.CL cs.LG 版本更新

Mind-Studio: 针对部分可观测游戏的可执行世界模型与前向评估

Yifei Dong, Mingen Zheng, Linquan Wu, Jeff Z. Pan, Jiaxin Bai

发表机构 * Hong Kong University of Science and Technology（香港科技大学）； City University of Hong Kong（香港城市大学）； University of Edinburgh（爱丁堡大学）； Hong Kong Baptist University（香港浸会大学）

AI总结提出Mind-Studio框架，利用大语言模型从轨迹合成可执行的pygame风格世界模型，通过K步前向保真度协议评估，在Montezuma's Revenge等游戏中显著提升预测准确性和子目标验证。

Comments 12 pages, 2 figures

详情

AI中文摘要

世界模型合成旨在将交互经验转化为环境动态的内部模型。现有的符号方法通常拟合观测到的转移或局部规则的混合，但它们不会产生一个可以独立于真实环境运行的完整可执行程序。我们提出了Mind-Studio，一个利用大语言模型从状态-动作-下一状态轨迹合成可执行的pygame风格世界模型的框架。Mind-Studio将熵选择轨迹与一个轻量级游戏技能文件相结合，该文件包含从截图中提取的对象、动作和静态场景信息。我们使用K步前向保真度协议评估合成质量，该协议将生成的世界模型 rollout 与来自相同状态的Real-ALE rollout进行比较。在Montezuma's Revenge上，Mind-Studio将选定动作的下一状态预测从PoE-World的0.3%提高到48.7%，同时验证了8个子目标中的5个；在Alien、Assault和Skiing上，它实现了比先前学习的前向源更强的分支级保真度。

英文摘要

World-model synthesis aims to turn interaction experience into an internal model of environment dynamics. Existing symbolic approaches often fit observed transitions or mixtures of local rules, but they do not produce a complete executable program that can run independently of the real environment. We present Mind-Studio, a framework that synthesizes executable pygame-style world models from state-action-next-state trajectories using large language models. Mind-Studio combines entropy-selected traces with a lightweight game skill file containing object, action, and static scene information extracted from screenshots. We evaluate synthesis quality with a K-step lookahead fidelity protocol that compares generated world-model rollouts against Real-ALE rollouts from the same state. On Montezuma's Revenge, Mind-Studio improves chosen-action next-state prediction from 0.3% for PoE-World to 48.7% while verifying 5 of 8 subgoals; across Alien, Assault, and Skiing, it achieves stronger branch-level fidelity than prior learned lookahead sources.

URL PDF HTML ☆

赞 0 踩 0

2508.02721 2026-06-17 cs.SE cs.AI cs.PL 版本更新

Blueprint First, Model Second: A Framework for Deterministic LLM Workflow

蓝图优先，模型其次：确定性LLM工作流框架

Libin Qiu, Yuhang Ye, Zhirong Gao, Xide Zou, Junfu Chen, Ziming Gui, Weizhi Huang, Xiaobo Xue, Wenkai Qiu, Kun Zhao

发表机构 * Alibaba（阿里巴巴）

AI总结提出“蓝图优先，模型其次”框架，通过将工作流逻辑解耦为源代码蓝图并由确定性引擎执行，LLM仅处理子任务，在TravelPlanner上最终通过率提升97.6%，约束违反减少96.0%。

Comments 12 pages, 7 figures, 6 tables

详情

AI中文摘要

控制平面放置塑造遗忘：跨十三种系统配置的智能体记忆架构研究

Dongxu Yang

发表机构 * DeepLethe

AI总结研究LLM在智能体记忆管道中的位置（控制平面 vs 召回平面）对遗忘失败模式的影响，通过13种配置在385例对抗测试集上的实验，揭示了三种放置机制的互补覆盖范围，并提出了ForgetEval评估套件。

Comments 25 pages including appendices. Code, benchmark, and adapters released under MIT at https://github.com/deeplethe/lethe

详情

AI中文摘要

LLM在智能体记忆管道中的位置——位于检索存储事实（广泛基准测试）的召回平面和通过替换、释放、清除来改变事实（基本未经测试）的控制平面之间——决定了系统能够恢复哪些遗忘失败模式。通过在385例对抗测试集上比较十三种系统配置，我们观察到三种具有部分互补覆盖范围的放置机制：确定性原语足以处理词汇/时间类别，但无法处理规范化（标识符混淆上5%，跨语言上0%）；写入时LLM可以恢复规范化（100%），但无法处理意图感知删除（前缀冲突和复合事实为0%）；变异时钩子可以恢复意图感知删除（78-85%），并同时提升几乎所有类别的性能（整体91.7-93.2%，每385例运行成本0.17美元，每例变异延迟2.3秒，而确定性方法为64-191毫秒，召回路径不变）。我们通过ForgetEval揭示了这种权衡，ForgetEval包含1000例模板化套件和385例对抗层（132例手工制作+253例LLM生成并经预言机验证），通过确定性子串匹配评分，并配有一个六方法适配器协议，采用诚实的N/A评分，允许异构记忆存储以130行代码接入。该协议通过10名标注者的IAA（Fleiss' kappa = 0.958）和77例外部作者子集（四位盲贡献者）得到验证，该子集复现了规范化不对称性并放大了联合放置的提升（+27.8个百分点）。生产环境中的失败主要是遗忘失败而非召回失败，但现有基准仅衡量召回。ForgetEval和所有适配器均以MIT许可发布。

英文摘要

Where an LLM sits in an agent memory pipeline -- between the recall plane that retrieves stored facts (extensively benchmarked) and the control plane that mutates them via supersede, release, purge (largely untested) -- shapes which forgetting failure modes the system recovers. Comparing thirteen system configurations on a 385-case adversarial surface, we observe three placement regimes with partly complementary coverage: deterministic primitives suffice for lexical/temporal categories but fail canonicalization (5% on identifier-obfuscation, 0% on cross-lingual); inscribe-time LLM recovers canonicalization (100%) but cannot help intent-aware deletion (0% on prefix-collision and compound-fact); a mutation-time hook recovers intent-aware deletion (78-85%) and brightens nearly all categories simultaneously (91.7-93.2% overall, $0.17 per 385-case run, 2.3s/case mutation latency vs. 64-191ms/case deterministic, recall path unchanged). We expose the trade-off via ForgetEval, a 1000-case templated suite plus a 385-case adversarial layer (132 hand-crafted + 253 LLM-drafted oracle-validated) scored by deterministic substring match, paired with a six-method Adapter Protocol with honest N/A scoring that lets heterogeneous memory stores enter in 130 lines. Admission is corroborated by 10-annotator IAA (Fleiss' kappa = 0.958) and a 77-case external-authored subset (four blind contributors) that replicates the canonicalization asymmetry and amplifies the joint-placement lift (+27.8 pt). Production failures are predominantly forgetting failures rather than recall failures, yet existing benchmarks measure only recall. ForgetEval and all adapters are released under MIT.

URL PDF HTML ☆

赞 0 踩 0

2606.17851 2026-06-17 cs.AI cs.LO 新提交

A homotopy-type-theoretic generalization of neurosymbolic inference

同伦类型论对神经符号推理的推广

Fernando Zhapa-Camacho, Robert Hoehndorf

发表机构 * King Abdullah University of Science and Technology（阿卜杜拉国王科技大学）； KAUST Center of Excellence for Smart Health (KCSH)（KAUST智能健康卓越中心）； KAUST Center of Excellence for Generative AI（KAUST生成式人工智能卓越中心）

AI总结本文用同伦类型论替换集合，将神经符号系统的信念加权和泛化为信念加权同伦基数，保留对称性和证明多样性，并证明经典函数是特例，从而避免推理捷径。

详情

AI中文摘要

广泛的神经符号系统计算一个泛函：在σ-结构空间上逻辑量的信念加权和，其中加权模型计数、模糊逻辑和概率逻辑是特例。这种描述基于集合，而集合有意忽略了两个对神经符号系统重要的方面：两个σ-结构何时在理论对称性下相同，以及有多少不同的证明见证一个查询。将底层集合替换为类型（在同伦类型论意义上）保留了这些信息，并将该泛函转变为信念加权同伦基数——一种按对称性倒数计数对象的大小概念。我们从头为神经符号系统开发了该框架，证明了当对称性平凡时恢复经典泛函的保守性定理，并表明我们的框架暴露的对称性正是推理捷径背后的对称性。实际收益是具体的：最近通过集成或表达性密度估计实现的捷径感知概念后验，是混淆集单纯形上唯一的对称不变点，可通过在对称群上平均单个模型以闭式形式计算。在MNIST推理捷径基准上，这种单模型包装器比多样性训练的集成具有更好的校准性，同时保持标签准确性和可识别概念不变。代码在此https URL免费提供。

英文摘要

A wide range of neurosymbolic (NeSy) systems compute one functional: a belief-weighted sum of a logical quantity over a space of $σ$-structures, of which weighted model counting, fuzzy logic, and probabilistic logic are special cases. This account is built on sets, and a set deliberately forgets two things that are important for NeSy: when two $σ$-structures are the same up to a symmetry of the theory, and how many distinct proofs witness a query. Replacing the underlying sets by types, in the sense of homotopy type theory, preserves this information, and turns this functional into a belief-weighted homotopy cardinality, a notion of size that counts each object in inverse proportion to its symmetries. We develop the framework from scratch for NeSy systems, prove a conservativity theorem that recovers the classical functional when symmetries are trivial, and show that the symmetry our framework exposes is exactly the one behind reasoning shortcuts. The payoff is concrete: the shortcut-aware concept posterior that recent methods reach by ensembling or expressive density estimation is the only symmetry-invariant point of the confusion-set simplex, computable in closed form by averaging a single model over the symmetry group. On MNIST reasoning-shortcut benchmarks this single-model wrapper is better calibrated than a diversity-trained ensemble, while leaving label accuracy and identifiable concepts untouched. Code is freely available at https://github.com/bio-ontology-research-group/hott-nesy.

URL PDF HTML ☆

赞 0 踩 0

2606.17882 2026-06-17 cs.AI 新提交

Structural Preservation and the Logical Expressiveness of Graph Neural Networks

结构保持与图神经网络的逻辑表达能力

Przemysław Andrzej Wałęga, Bernardo Cuenca Grau

发表机构 * Queen Mary University of London（伦敦玛丽女王大学）； University of Oxford（牛津大学）

AI总结本文从语义角度研究图神经网络分类器在结构保持（嵌入、单同态、同态）下的逻辑表达能力，证明每种保持性质对应分级模态逻辑的一个片段，并给出相应GNN架构。

Comments 20 pages

详情

AI中文摘要

通过固定架构选择（如聚合、组合和激活函数的类型），已经在图神经网络（GNN）和逻辑形式体系之间建立了桥梁。这些选择定义了受限的GNN类，通过证明逻辑公式可以翻译为等价的GNN，反之GNN也可以翻译为等价的公式，从而可以获得与逻辑形式体系的紧密对应。在本文中，我们采取语义视角，通过建立那些在结构性质（嵌入、单同态和同态）下保持的GNN分类器类的逻辑表达能力。我们证明，对于每个这样的性质，存在一个分级模态逻辑的片段，刻画了该GNN类。特别地，在嵌入、单同态和同态下的保持分别对应于存在性分级模态逻辑、其存在-正片段以及存在-正模态逻辑。这些结果刻画了广泛GNN类的表达能力，独立于具体的架构选择，但我们也证明每个这样的类都承认一个具有相同表达能力的GNN架构。在技术上，我们的方法使用了有界高度树的一个新的良拟序结果，从而得到了展开不变类的有限表示。

英文摘要

Bridges between graph neural networks (GNNs) and logical formalisms have been established by fixing architectural choices, such as the types of aggregation, combination, and activation functions. These choices define restricted classes of GNNs for which tight correspondences with logical formalisms can be obtained, by showing that logical formulae can be translated into equivalent GNNs and, conversely, that GNNs can be translated into equivalent formulae. In this paper we take a semantic perspective by establishing the logical expressiveness of classes of GNN classifiers that are preserved under structural properties: embeddings (extensions), injective homomorphisms, and homomorphisms. We show that, for each such property, there exists a fragment of graded modal logic characterising the class of GNNs. In particular, preservation under embeddings, injective homomorphisms, and homomorphisms corresponds to existential graded modal logic, its existential-positive fragment, and existential-positive modal logic, respectively. These results characterise the expressiveness of broad classes of GNNs independently of specific architectural choices, but we also show that each of these classes admits a GNN architecture of the same expressiveness. Technically, our approach uses a new well-quasi-order result for trees of bounded height, yielding finite representations of unravelling-invariant classes.

URL PDF HTML ☆

赞 0 踩 0

2606.18098 2026-06-17 cs.AI 新提交

IsabeLLM: Automated Theorem Proving Applied to Formally Verifying Consensus

IsabeLLM: 自动化定理证明应用于共识的形式化验证

Elliot Jones, William Knottenbelt

发表机构 * Imperial College London（伦敦帝国学院）

AI总结本文改进IsabeLLM自动化定理证明工具，通过检索增强生成、错误追踪和反例生成提升大语言模型上下文，并兼容最新Isabelle和Sledgehammer，用于验证比特币工作量证明共识。

详情

AI中文摘要

人工智能（AI）的进步使得AI用于定理证明成为形式化验证计算机系统的一种有前景的方法。尽管由于所需专业知识和努力，形式化验证传统上仅限于安全关键系统，但AI可以帮助自动化大量工作负载，使其更易访问。基于区块链的系统越来越受欢迎，并经常成为恶意行为者的目标，常常导致巨大的财务损失，这凸显了更好地验证这些系统和缓解漏洞的必要性。可以说，这些系统中最重要的组件是共识协议，它允许节点在潜在对抗环境中达成决策。在本文中，我们改进了IsabeLLM，即Isabelle中的自动化定理证明工具。具体而言，我们实现了检索增强生成框架、错误追踪和反例生成，以改善提供给大语言模型的上下文。还实现了与最新版本Isabelle和Sledgehammer的兼容性，以提高效率。我们比较了两个版本IsabeLLM在完成比特币工作量证明共识验证方面的性能。

英文摘要

Advances in Artificial Intelligence (AI) have led AI for Theorem Proving to become a promising means of formally verifying computer systems. Whilst formal verification is traditionally reserved for safety-critical systems due to the required amount of expertise and effort, AI can help to automate a large amount of this workload and make it far more accessible. Blockchain-based systems are becoming increasingly popular and are frequently targeted by malicious actors, often resulting in huge financial losses, highlighting the need to better verify these systems and mitigate vulnerabilities. Arguably the most important component of these systems is the consensus protocol, which allows nodes to agree on decisions in a potentially adversarial environment. In this paper, we improve upon IsabeLLM, the automated theorem proving tool in Isabelle. Namely, we implement a Retrieval-Augmented Generation framework, Error tracing and counterexample generation for improved context supplied to the Large Language Model. Compatibility with the latest version of Isabelle and Sledgehammer is also implemented for improved efficiency. We compare the performance of the two versions of IsabeLLM in their ability to complete the verification of Bitcoin's Proof of Work consensus.

URL PDF HTML ☆

赞 0 踩 0

2606.17073 2026-06-17 cs.RO cs.AI 交叉投稿

Extracting Semantics: LLM-Guided Automatic Population of Robot Ontology from URDF

提取语义：从URDF自动构建机器人本体的LLM引导方法

Bastien Dussard, Guillaume Sarthou

发表机构 * LAAS-CNRS, Department of Robotics, Toulouse, France（法国图卢兹机器人系CNRS实验室）

AI总结提出利用大语言模型从URDF文件自动生成机器人语义本体，通过多数投票和语法验证确保与现有本体对齐，初步实验表明该方法能有效桥接低层描述与高层知识表示。

详情

Journal ref: 18th International Conference on Social Robotics (ICSR 2026), University of London, Jul 2026, Londres, United Kingdom

AI中文摘要

虽然常识知识可能足以满足虚拟代理的需求，但与人类交互的具身机器人需要对其环境和自身物理形态具有基于现实的、语义丰富的表示。在认知机器人学中，本体论能够有效整合这种异构知识，以支持可解释的推理，即使在持续知识更新过程中也是如此。然而，手动构建本体仍然是一个瓶颈。我们提出了一种初步方法，通过将统一机器人描述格式（URDF）模型转换为填充的本体，自动生成机器人语义抽象。尽管URDF文件提供了结构和运动学描述，但其标识符通常需要常识解释才能恢复有意义的语义，而大语言模型（LLM）擅长此任务。我们的流程利用LLM，通过用现有本体中的概念提示它们来推断语义关系，确保最终分类与形式模型保持一致。为了提高可靠性，该流程结合了跨多个LLM查询的多数投票以及语法和模式级验证，以确保生成的输出符合预期的表示格式和本体约束。我们在多个机器人描述上评估了该方法，并讨论了生成的抽象。初步结果表明，所提出的方法能够有效弥合低层机器人描述与人机交互所需的结构化、基于现实的知识表示之间的差距。

英文摘要

While commonsense knowledge may suffice for virtual agents, embodied robots interacting with humans require grounded and semantically rich representations of both their environment and their own physical embodiment. In cognitive robotics, ontologies are effective for integrating such heterogeneous knowledge to enable explainable reasoning, even during continuous knowledge updates. Yet, their manual construction remains a bottleneck. We present a preliminary approach for the automatic generation of robot semantic abstractions by transforming Unified Robot Description Format (URDF) models into populated ontologies. Although URDF files provide structural and kinematic descriptions, their identifiers often require commonsense interpretation to recover meaningful semantics, a task at which Large Language Models (LLMs) excel. Our pipeline leverages LLMs to infer semantic relationships by prompting them with concepts from an existing ontology, ensuring the final classification remains aligned with the formal model. To improve reliability, the pipeline combines majority voting across multiple LLM queries along with syntactic and schema-level validation to ensure that generated outputs conform to the expected representation format and ontology constraints. We evaluate the approach on multiple robot descriptions and discuss the generated abstractions. Initial results indicate that the proposed method can effectively bridge the gap between low-level robot descriptions and the structured, grounded knowledge representations required for human-robot interaction.

URL PDF HTML ☆

赞 0 踩 0

2606.17581 2026-06-17 cs.PL cs.AI 交叉投稿

Visored: A Controlled-Natural-Language Prover for LLM-Generated Mathematics

Visored: 一种面向LLM生成数学的受控自然语言证明器

Xiyu Zhai, Xinyi Chen, Yiping Wang, Runlong Zhou, Liao Zhang, Simon S. Du

发表机构 * University of Washington（华盛顿大学）； University of Innsbruck（因斯布鲁克大学）

AI总结提出一种基于依赖类型的证明器，其表面模仿数学自然语言，并通过规则驱动的自动化层填补常规步骤，使LLM无需专用训练数据即可在miniF2F基准上有效使用，并输出可检查的Lean文件。

2605.27023 2026-06-17 cs.AI 版本更新

分布式通用智能体网络：架构、关键机制与原型

Shengli Zhang, Deen Ma, Zibin Lin, Taotao Wang

发表机构 * College of Electronics and Information Engineering, Shenzhen University（深圳大学电子与信息工程学院）

AI总结提出分布式通用智能体网络架构，通过协议适配层连接上层任务语义与底层网络操作，解决语义公告传播、可信身份与多主题声誉、语义梯度机制设计三大核心问题，实现开放可信的智能体协作。

详情

AI中文摘要

大型语言模型加速了从被动对话助手到自主智能体的转变，这些智能体能够理解目标、规划行动、调用工具并执行多步骤任务。然而，单个智能体的能力仍受限于其本地数据、工具权限、运行时环境和治理边界。本文研究分布式通用智能体网络：开放的端到端网络，其中部署在个人设备、边缘节点或自主计算环境中的异构智能体可以相互发现、建立信任、协商合作规则并执行开放式任务。我们认为，这种网络不能通过简单地将现有的端到端覆盖网络与传统多智能体系统相结合来获得。与传统P2P网络不同，智能体网络必须传播关于意图、能力、状态和合作约束的语义声明。因此，我们提出了一种以协议适配层为中心的分层架构，该层连接上层任务语义与底层网络操作。基于该架构，本文识别出三个核心机制问题：用于协作者发现的语义公告传播、用于合作治理的可验证身份与多主题声誉、以及用于开放任务执行的语义梯度机制设计。针对每个问题，我们提出了一条技术路线，包括带顺序日志的无体八卦协议、基于BAID的身份绑定与MG-EigenTrust声誉、以及由语义归因反馈驱动的Stackelberg式机制生成循环。我们还报告了BAID式分层验证的原型开销结果以及跨主题伪装-合谋攻击下MG-EigenTrust的机制级模拟。所得框架为开放、可信和可扩展的智能体协作提供了系统级基础。

英文摘要

Large language models have accelerated the transition from passive conversational assistants to autonomous agents that can understand goals, plan actions, invoke tools, and execute multi-step tasks. Yet the capability of a single agent remains constrained by its local data, tool permissions, runtime environment, and governance boundary. This paper studies distributed general-purpose agent networks: open peer-to-peer networks in which heterogeneous agents deployed on personal devices, edge nodes, or autonomous computing environments can discover one another, establish trust, negotiate cooperation rules, and execute open-ended tasks. We argue that such networks cannot be obtained by simply combining existing peer-to-peer overlays with conventional multi-agent systems. Unlike traditional P2P networks, agent networks must propagate semantic declarations about intentions, capabilities, states, and cooperation constraints. We therefore propose a layered architecture centered on a protocol adaptation layer that connects upper-level task semantics with lower-level network operations. Based on this architecture, the paper identifies three core mechanism problems: semantic announcement propagation for collaborator discovery, verifiable identity and multi-topic reputation for cooperation governance, and semantic-gradient mechanism design for open task execution. For each problem, we present a technical route, including bodyless gossip with sequential logs, BAID-based identity binding with MG-EigenTrust reputation, and a Stackelberg-style mechanism-generation loop driven by semantic attribution feedback. We further report prototype overhead results for BAID-style tiered verification and mechanism-level simulations of MG-EigenTrust under cross-topic disguise-collusion attacks. The resulting framework provides a system-level foundation for open, trustworthy, and scalable agent collaboration.

URL PDF HTML ☆

赞 0 踩 0

2606.17847 2026-06-17 cs.AI cs.LG 新提交

WallZero: Mastering the Game of WallGo with Strategic Analysis

WallZero：通过战略分析掌握WallGo游戏

Hsing-Yu Chen, Jérôme Arjonilla, I-Chen Wu, Ti-Rong Wu

发表机构 * National Yang Ming Chiao Tung University（国立阳明交通大学）； Academia Sinica（中央研究院）

AI总结提出基于AlphaZero的WallZero智能体，通过定制动作和特征设计，在WallGo游戏中击败职业围棋选手，并分析游戏公平性与关键策略。

Comments Accepted by the Computers and Games conference (CG 2026)

详情

AI中文摘要

WallGo是一种最近引入的战略棋盘游戏，因2025年Netflix系列剧《The Devil's Plan》而流行。尽管在7x7的小棋盘上进行，但其石头移动和墙壁放置的组合导致了高游戏树复杂性和复杂的战略互动。尽管其日益流行，WallGo仍未得到充分探索。本文提出了WallZero，一个基于AlphaZero的双人WallGo设置智能体。我们引入了定制的动作和特征设计，以显著提高游戏性能。在评估中，WallZero击败了参与本研究的两位职业围棋选手，平均每局获得1.98倍的地盘。除了其强度，我们使用WallZero评估游戏公平性并识别掌握WallGo的关键策略。有趣的是，我们的结果显示，Netflix系列剧中使用的开局产生了更平衡的游戏。我们的代码可在以下网址获取：此 https URL。

英文摘要

WallGo is a recently introduced strategic board game popularized by the 2025 Netflix series The Devil's Plan. Although played on a small 7 x 7 board, its combination of stone movement and wall placement yields high game-tree complexity and intricate strategic interactions. Despite its growing popularity, WallGo remains underexplored. This paper presents WallZero, an AlphaZero-based agent for the two-player WallGo setting. We introduce tailored action and feature designs to improve playing performance significantly. In the evaluation, WallZero defeats two professional Go players who participated in this study, securing on average 1.98x more territory per game. Beyond its strength, we use WallZero to assess game fairness and identify key strategies for mastering WallGo. Interestingly, our results show that the opening used in the Netflix series yields a more balanced game. Our code is available at https://rlg.iis.sinica.edu.tw/papers/wallzero.

URL PDF HTML ☆

赞 0 踩 0

2606.17081 2026-06-17 cs.AR cs.AI cs.DC cs.GT cs.PF 交叉投稿

The Price of Anarchy in Disaggregated Inference

解耦推理中的无政府价格

Athos Georgiou

发表机构 * NCA

AI总结本文通过博弈论分析解耦推理架构中的资源分配问题，提出自适应控制器降低无政府价格，在NVIDIA B200集群上实现最高3.1倍PoA下降。

Comments 38 pages, 7 figures, 8 tables. Measurements on a 3-node NVIDIA B200 cluster running NVIDIA Dynamo v0.9.0

详情

AI中文摘要

解耦推理架构将预填充和解码阶段物理分离到不同的GPU池中，创建了共享固定硬件预算的竞争“代理”。我们提供了据我们所知对该架构的首次正式博弈论分析，以NVIDIA Dynamo作为具体案例研究。我们将解耦服务建模为三个耦合博弈：预填充池和解码池之间的双人资源博弈、分层KV缓存上的自私缓存博弈以及具有正外部性的请求路由拥塞博弈。我们实证验证了后两者；P/D资源博弈通过分析处理（第9.2节）。我们描述了GPU饱和如何引发博弈收益结构转变的机制：低于饱和时，自私行为具有有界的无政府价格（PoA）；在饱和时，超线性延迟和缓存外部性推动我们的经验估计器PoA-hat（定义见第6.4节）上升。基于此分析，我们设计了一个自适应控制器，实时检测饱和转换并相应调整路由参数，从缓存亲和性利用转向负载均衡拥塞避免。我们在一个3节点NVIDIA B200集群上实例化我们的框架，运行Dynamo和两个模型Nemotron-4-340B（TP=8，全节点工作节点，跨InfiniBand KV传输）和Llama-3.1-70B（TP=4），发现两个模型上具有相同的三区域PoA-hat结构，且第一个膝点后网格点相同（C=128）。自适应路由将每个模型转移到更好的工作点。我们最强的结果是在70B 1P/5D拓扑上，饱和阶段PoA-hat下降3.1倍（从66.4降至21.5），吞吐量成本为13%。在70B 1P/2D上，PoA-hat下降2.2倍，TTFT P99下降7.6倍（见第8.5节）。

英文摘要

Disaggregated inference architectures physically separate prefill and decode phases onto distinct GPU pools, creating competing "agents" that share a fixed hardware budget. We provide, to our knowledge, the first formal game-theoretic analysis of this architecture, using NVIDIA Dynamo as a concrete case study. We model disaggregated serving as three coupled games: a two-player resource game between prefill and decode pools, a selfish caching game over the hierarchical KV cache, and a congestion game with positive externalities for request routing. We empirically validate the latter two; the P/D resource game is treated analytically (Section 9.2). We characterize how GPU saturation induces regime transitions that shift the game's payoff structure: below saturation, selfish behavior has bounded Price of Anarchy (PoA); at saturation, superlinear latency and cache externalities drive our empirical estimator PoA-hat (defined in Section 6.4) upward. Based on this analysis, we design an adaptive controller that detects saturation transitions in real time and adjusts routing parameters accordingly, shifting from cache-affinity exploitation to load-balanced congestion avoidance. We instantiate our framework on a 3-node NVIDIA B200 cluster running Dynamo with two models, Nemotron-4-340B (TP=8, full-node workers with cross-InfiniBand KV transfers) and Llama-3.1-70B (TP=4), and find the same three-regime PoA-hat structure with the same first post-knee grid point (C=128) on both models. Adaptive routing shifts each model to a better operating point. Our strongest result is on the 70B 1P/5D topology, where PoA-hat drops 3.1x (66.4 to 21.5) in the saturated phase at a 13% throughput cost. On the 70B 1P/2D, PoA-hat drops 2.2x and TTFT P99 drops 7.6x (see Section 8.5).

URL PDF HTML ☆

赞 0 踩 0

2606.17203 2026-06-17 cs.SE cs.AI 交叉投稿

Trust-Aware Multi-Agent Traceability: Confidence-Calibrated Knowledge Graphs for Consistent Software Artifact Management

信任感知的多智能体可追溯性：用于一致软件工件管理的置信度校准知识图谱

Mohamed Essam, Kareem Wael, Azza Hassan, Ahmed Haitham, Mahmoud Soliman, Samer Saber, Ibrahim Habib

发表机构 * CairoMotive Cairo, Egypt（开罗动力埃及）

AI总结提出一种信任感知协调框架，通过共享知识图谱和校准置信度分数，结合嵌入检索与LLM多准则分析的两阶段可追溯性链接预测管道，解决多智能体系统中错误传播问题。

详情

AI中文摘要

多智能体AI系统越来越多地用于自动化软件工程任务，包括需求分析、架构设计、测试生成和可追溯性链接。当这些智能体作为顺序管道在共享软件工件上运行时，上游智能体做出的错误和低置信度决策会传播到下游阶段，产生孤立的需求、矛盾的链接和合规性差距，这在安全关键领域构成重大风险。我们提出一个信任感知协调框架，其中共享知识图谱既作为集中式语义记忆，又作为协调表面，智能体通过该表面使用校准的置信度分数评估并基于彼此的贡献进行构建。我们的方法引入了一个两阶段可追溯性链接预测管道，结合了基于嵌入的检索与基于LLM的多准则分析，一种可追溯性种子机制，能够比较推导时间和验证时间的置信度，以及一个一致性协议，通过置信度阈值门控、置信度发散检测和冲突解决来管理管道交互。我们在一个汽车软件工程案例研究上进行了评估，测量了链接预测校准、协议有效性、阈值敏感性和可追溯性种子的影响。消融研究证实，置信度校准对于有效的管道协调至关重要。

英文摘要

Multi-agent AI systems are increasingly used to automate software engineering tasks including requirements analysis, architecture design, test generation, and traceability linking. When these agents operate as a sequential pipeline over shared software artifacts, errors and low-confidence decisions made by upstream agents propagate to downstream stages, producing orphaned requirements, contradictory links, and compliance gaps that pose significant risks in safety-critical domains. We propose a trust-aware coordination framework where a shared knowledge graph serves as both centralized semantic memory and a coordination surface through which agents assess and build upon each other's contributions using calibrated confidence scores. Our approach introduces a two-stage traceability link prediction pipeline combining embedding-based retrieval with LLM-based multi-criteria analysis, a traceability seeding mechanism that enables comparison between derivation-time and validation-time confidence, and a consistency protocol governing pipeline interactions through confidence threshold gating, confidence divergence detection, and conflict resolution. We evaluate on an automotive software engineering case study measuring link prediction calibration, protocol effectiveness, threshold sensitivity, and the impact of traceability seeding. Ablation studies confirm that confidence calibration is essential for effective pipeline coordination.

URL PDF HTML ☆

赞 0 踩 0

2606.17627 2026-06-17 cs.CV cs.AI 交叉投稿

一种面向策略逻辑的策略综合的神经符号方法

Marco Aruta, Vadim Malvone, Aniello Murano, Domenico Parente, Luca Rizzuti

发表机构 * University of Naples Federico II（那不勒斯费德里科二世大学）； LTCI, Télécom Paris, Institut Polytechnique de Paris（LTCI，巴黎电信学院，巴黎理工学院）； Università degli Studi di Salerno（萨勒诺大学）

AI总结提出一种神经符号框架，将大语言模型作为策略生成预言机，结合模型检查器进行形式验证，在NatATL中实现高精度策略综合。

详情

AI中文摘要

推理智能体通过策略交互能实现什么是多智能体系统（MAS）中的核心挑战。用于策略能力的逻辑（如ATL）提供了严格的方法，但其采用常因策略综合的计算成本而受阻。我们引入了一种神经符号框架，将大语言模型（LLM）集成到MAS的模型检查流程中。LLM作为策略生成预言机，提出候选策略，然后由标准MAS模型检查器进行形式验证。这种生成-认证架构利用LLM引导来导航大型组合策略空间，同时保持形式正确性：生成的策略仅在通过验证器认证后才被接受。我们为NatATL中的有界策略推理实例化了该框架，并引入了首个NatATL策略综合数据集，包含4211个实例。使用开源Qwen3-32B模型的实验表明，我们的认证流程在策略综合结果上达到了92%的准确率。

英文摘要

Reasoning about what agents can achieve through strategic interaction is a core challenge in Multi-Agent Systems (MAS). Logics for strategic ability, such as ATL, provide rigorous methods, but their adoption is often hindered by the computational cost of strategy synthesis. We introduce a neuro-symbolic framework that integrates large language models (LLMs) into the model-checking pipeline for MAS. The LLM acts as a strategy-generation oracle, proposing candidate strategies that are then formally validated by a standard MAS model checker. This generate-and-certify architecture uses LLM guidance to navigate large combinatorial strategy spaces while preserving formal soundness: generated strategies are accepted only when certified by the verifier. We instantiate the framework for bounded strategic reasoning in NatATL and introduce the first NatATL strategy-synthesis dataset, consisting of 4211 instances. Experiments with an open-weight Qwen3-32B model show that our certified pipeline achieves 92\% accuracy on strategy-synthesis outcomes.

URL PDF HTML ☆

赞 0 踩 0

2606.18111 2026-06-17 cs.LG cs.AI 交叉投稿

Learning Fair Pareto-Optimal Policies in Multi-Objective Reinforcement Learning

多目标强化学习中学习公平帕累托最优策略

Umer Siddique, Peilang Li, Yongcan Cao

AI总结针对多目标强化学习中固定用户偏好无法提供多样化策略的问题，提出基于广义基尼福利函数的多策略方法，学习公平帕累托最优策略集。

Comments Accepted at the Reinforcement Learning Conference (RLC) 2025. 12 pages main + appendix, 8 figures, 4 tables

详情

AI中文摘要

公平性是多目标强化学习（MORL）决策中的一个重要方面，策略必须确保在多个潜在冲突的目标上既达到最优又实现公平。虽然单策略MORL方法可以使用福利函数（如广义基尼福利函数GGF）为固定的用户偏好学习公平策略，但它们无法提供动态或未知用户偏好所需的多样的策略集。为解决这一局限性，我们形式化了多策略MORL中的公平优化问题，其目标是学习一组帕累托最优策略，确保在所有可能的用户偏好下实现公平。我们的关键技术贡献有三点：（1）我们证明对于凹的、分段线性的福利函数（例如GGF），公平策略仍然在凸覆盖集（CCS）中，CCS是线性标量化下的近似帕累托前沿。（2）我们证明非平稳策略（通过累积奖励历史增强）和随机策略通过动态适应历史不公平性来改善公平性。（3）我们提出了三种新算法，包括将GGF与多策略多目标Q学习（MOQL）集成、用于学习非平稳策略的状态增强多策略MOQL，以及用于学习随机策略的新扩展。我们在多个领域评估了我们的算法，并将我们的方法与最先进的MORL基线进行了比较。实验结果表明，我们的方法学习了一组公平策略，能够适应不同的用户偏好。

英文摘要

Fairness is an important aspect of decision-making in multi-objective reinforcement learning (MORL), where policies must ensure both optimality and equity across multiple, potentially conflicting objectives. While single-policy MORL methods can learn fair policies for fixed user preferences using welfare functions such as the generalized Gini welfare function (GGF), they fail to provide the diverse set of policies necessary for dynamic or unknown user preferences. To address this limitation, we formalize the fair optimization problem in multi-policy MORL, where the goal is to learn a set of Pareto-optimal policies that ensure fairness across all possible user preferences. Our key technical contributions are threefold: (1) We show that for concave, piecewise-linear welfare functions (e.g., GGF), fair policies remain in the convex coverage set (CCS), which is an approximated Pareto front for linear scalarization. (2) We demonstrate that non-stationary policies, augmented with accrued reward histories, and stochastic policies improve fairness by dynamically adapting to historical inequities. (3) We propose three novel algorithms, which include integrating GGF with multi-policy multi-objective Q-Learning (MOQL), state-augmented multi-policy MOQL for learning non-statoinary policies, and its novel extension for learning stochastic policies. We evaluate our algorithms across various domains and compare our methods against the state-of-the-art MORL baselines. The empirical results show that our methods learn a set of fair policies that accommodate different user preferences.

URL PDF HTML ☆

赞 0 踩 0

2504.03991 2026-06-17 cs.CL cs.AI cs.HC cs.MA 版本更新

从酝酿到解析：追踪LLM中代码推理的内部生命周期

Siyue Chen, Yifu Guo, Yuquan Lu, Zishan Xu, Jiaye Lin, Jianbo Lin, Siyu Zhang, Cheng Yang, Junxin Li, Yujia Li, Yu Huo, Ruixuan Wang

发表机构 * South China University of Technology（华南理工大学）； Sun Yat-sen University（中山大学）； Tsinghua University（清华大学）； Shanghai Jiao Tong University（上海交通大学）； Nanjing University（南京大学）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Hangzhou Dianzi University（杭州电子科技大学）； Guangzhou College of Technology and Business（广州工商学院）

AI总结提出双重诊断框架（逐层线性探针与上下文剥离解码），揭示LLM在代码推理中先酝酿答案后进入四种解析结果（已解析、过度处理、错误解析、未解析）的内部生命周期，发现酝酿支架稳定而解析成功随能力变化。

详情

AI中文摘要

标准准确率指标无法解释为什么LLM能处理变量追踪但在语义等价的循环上失败。我们研究了代码推理的内部生命周期，其中模型首先酝酿答案，使其在变得可自解码之前的许多层就线性可恢复，然后分化为四种解析结果之一：已解析、过度处理、错误解析或未解析。理解这一生命周期很重要，因为相似的任务准确率可能掩盖表面评估无法检测的根本不同的失败模式。我们引入了一个双重诊断框架，将逐层线性探针与上下文剥离解码（CSD）配对，并将其应用于跨越Qwen、Llama和DeepSeek架构的16个模型的六个代码推理任务族。所有四种结果在每个任务族中都占有显著比例：总体已解析仅为41.5%，多个任务低于30%。对结构、深度和算子的受控扫描揭示了特定任务的失败瓶颈：函数调用已解析率随着调用深度从一层增加到三层而从61.1%骤降至2.5%。跨架构和规模，酝酿支架保持稳定，所有16个模型的归一化酝酿持续时间为24-42%，而解析成功随能力变化。这表明该支架是测试的解码器-only Transformer家族中稳定的经验规律，而解析成功与能力、规模和训练共变。代码：此 https URL

英文摘要

Standard accuracy metrics cannot explain why LLMs handle variable tracking but fail on semantically equivalent loops. We study an internal lifecycle of code reasoning in which models first brew the answer, making it linearly recoverable many layers before it becomes self-decodable, and then diverge into one of four resolution outcomes: Resolved, Overprocessed, Misresolved, or Unresolved. Understanding this lifecycle matters because similar task accuracies can mask fundamentally different failure modes that surface-level evaluation cannot detect. We introduce a dual diagnostic framework pairing layer-wise linear probing with Context-Stripped Decoding (CSD) and apply it to six code-reasoning task families across 16 models spanning Qwen, Llama, and DeepSeek architectures. All four outcomes carry substantial mass in every task family: overall Resolved is only 41.5%, with multiple tasks below 30%. Controlled sweeps over structure, depth, and operators expose task-specific failure bottlenecks: Function Call Resolved plunges from 61.1% to 2.5% as call depth increases from one to three. Across architectures and scales, the brewing scaffold remains stable, with normalized brewing duration 24-42% across all 16 models, while resolution success varies with capability. This indicates that the scaffold is a stable empirical regularity across the tested decoder-only Transformer families, whereas resolution success covaries with capability, scale, and training. Code: https://github.com/euyis1019/llm-brewing

URL PDF HTML ☆

赞 0 踩 0

2606.17657 2026-06-17 cs.AI 新提交

Using Cognitive Models to Improve Language Model Simulation of Human Persuasion Games

使用认知模型改进语言模型对人类说服博弈的模拟

Zirui Cheng, Zeyu Shen, Thomas L. Griffiths, Peter Henderson

发表机构 * Princeton University（普林斯顿大学）

AI总结提出方程到行为提示和强化学习方法，使语言模型匹配认知模型（如贝叶斯更新、动机推理），在说服博弈中提升模拟人类决策多样性的能力。

详情

AI中文摘要

人们在战略互动中做出不同的决策。有些人像贝叶斯一样更新信念；其他人则表现出动机推理等偏见。尽管大型语言模型的创建者使用模拟人类进行安全评估和训练，但他们往往未能涵盖人类行为的这种广度。我们认为认知科学和经济学提供了一种方便的工具来做到这一点，利用人类决策的数学模型。我们提出了一种称为方程到行为提示的方法，用于引导大型语言模型匹配认知模型，并在基于法律决策的说服博弈中评估这种方法。我们发现大型模型可以通过提示近似基于方程的规范——贝叶斯更新、仿射扭曲、动机更新和Grether的$\alpha$-$\beta$模型，但小型模型无法做到。然而，使用强化学习训练小型模型以遵循数学规则，即方程到行为强化学习，在分布外参数化中将信念误差降低了26.5%。我们表明这些模拟可以帮助创建多样化的训练环境；训练小型模型考虑不同类型的决策者，与仅贝叶斯训练相比，平均信念变化提高了2.5%–12%，即使在说服GPT-5-mini时也是如此。我们的工作可以改进在日益逼真的环境中用于训练和评估的人类模拟，并且还可以促进对人类决策更复杂数学模型的新研究。

英文摘要

People make decisions differently in strategic interactions. Some update beliefs like a Bayesian; others exhibit biases like motivated reasoning. Although creators of large language models use simulated humans for safety evaluations and training, they often fail to cover this breadth of human behavior. We argue that cognitive science and economics provide a convenient tool for doing so, making use of mathematical models of human decision-making. We propose an approach that we call Equation-to-Behavior Prompting for guiding large language models to match cognitive models, and evaluate this approach on persuasion games based on legal decision-making. We find that large models can approximate equation-based specifications -- Bayesian updating, affine distortion, motivated updating, and Grether's $α$-$β$ model -- using prompting, but small models fail to do so. However, training small models with reinforcement learning to adhere to mathematical rules, Equation-to-Behavior RL, reduces belief error by 26.5% in out-of-distribution parameterizations. We show that these simulations can help create diverse training environments; training small models to consider different kinds of decision-makers improves average belief change by 2.5%--12% over Bayesian-only training, even when persuading GPT-5-mini. Our work could improve human simulations for training and evaluation in increasingly realistic settings, and could also enable novel research into more complicated mathematical models of human decision-making.

URL PDF HTML ☆

赞 0 踩 0

2606.17735 2026-06-17 cs.AI 新提交

Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs

打破自回归诅咒：动态认知熵编排的可擦除强化学习用于大语言模型

Ziliang Wang, Kang An, Faqiang Qian, Jialu Cai, Cijun Ouyang, Yuhang Wang, Qibing Ren, Yichao Wu

发表机构 * SenseTime（商汤科技）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出动态认知熵编排的可擦除强化学习（E³RL），通过将模型内生的局部自回归交叉熵作为认知不确定性坐标，利用分段自适应动态阈值和优势分配精准切除逻辑缺陷并重用KV缓存，解决长序列推理中的自回归级联崩溃问题。

详情

AI中文摘要

尽管强化学习（RL）扩展了大语言模型（LLMs）的认知边界，但在长程逻辑推理中，它仍然容易受到自回归诅咒的影响：生成早期引入的微小认知扰动会沿着马尔可夫决策过程流不可逆地传播，引发级联故障，导致推理轨迹崩溃。为了克服这种自回归级联（即单个早期错误可能危及所有后续推理步骤），我们提出了动态认知熵编排的可擦除强化学习（$\text{E}^3\text{RL}$）。$\text{E}^3\text{RL}$ 通过将模型内生的局部自回归交叉熵作为认知不确定性的内在坐标，消除了对外部信号的依赖。通过引入分段自适应动态阈值和优势分配，$\text{E}^3\text{RL}$ 使模型能够精确切除局部逻辑缺陷，同时重用历史键值（KV）缓存流，从而赋予推理过程自愈能力。我们在 DeepMath-103k 数据集上训练 $\text{E}^3\text{RL}$。实验结果表明，$\text{E}^3\text{RL}$ 重塑了长序列推理的探索效率，提高了样本效率，同时保持线性内存开销。在 AIME 等数学推理基准上，$\text{E}^3\text{RL}$ 取得了显著的性能提升，4B 和 8B 参数模型分别超越了之前的最优结果（SOTA）5.349% 和 6.514%。这些发现表明，$\text{E}^3\text{RL}$ 打破了长序列推理中的自回归诅咒，为下一代自愈人工通用智能（AGI）奠定了理论和系统级基础。

英文摘要

Although reinforcement learning (RL) has expanded the cognitive boundaries of large language models (LLMs), it often remains vulnerable to the autoregressive curse in long-horizon logical reasoning: small epistemic perturbations introduced early in generation can propagate irreversibly along the Markov decision process flow, triggering cascading failures that drive the reasoning trajectory toward collapse. To overcome this autoregressive cascade, in which a single early mistake can compromise all subsequent reasoning steps, we propose dynamic epistemic entropy orchestrated erasable reinforcement learning ($\text{E}^3\text{RL}$). $\text{E}^3\text{RL}$ eliminates reliance on external signals by grounding the model's endogenous local autoregressive cross-entropy as an intrinsic coordinate of epistemic uncertainty. By introducing segment-level adaptive dynamic thresholds and advantage allocation, $\text{E}^3\text{RL}$ enables the model to precisely excise localized logical defects while reusing historical key-value (KV) cache streams, thereby endowing the reasoning process with a self-healing capability. We train $\text{E}^3\text{RL}$ on the DeepMath-103k dataset. Experimental results show that $\text{E}^3\text{RL}$ reshapes the exploration efficiency of long-sequence reasoning and improves sample efficiency while maintaining linear memory overhead. On mathematical reasoning benchmarks such as AIME, $\text{E}^3\text{RL}$ achieves substantial performance gains, with the 4B and 8B parameter models surpassing previous state-of-the-art (SOTA) results by 5.349\% and 6.514\%, respectively. These findings suggest that $\text{E}^3\text{RL}$ shatters the autoregressive curse in long-sequence reasoning and establishes a theoretical and systems-level foundation for the next generation of self-healing artificial general intelligence (AGI).

URL PDF HTML ☆

赞 0 踩 0

2606.17945 2026-06-17 cs.AI 新提交

Small Initialization Matters for Large Language Models

小初始化对大语言模型至关重要

Liangkai Hang, Junjie Yao, Zhiyu Li, Feiyu Xiong, Hongkang Yang, Zhi-Qin John Xu

发表机构 * School of Mathematical Sciences, Shanghai Jiao Tong University（上海交通大学数学科学学院）； Institute of Natural Sciences, Shanghai Jiao Tong University（上海交通大学自然科学研究院）； MemTensor (Shanghai) Technology Co., Ltd.（上海记忆张量科技有限公司）； Institute for Advanced Algorithms Research（先进算法研究所）

AI总结本文发现减小初始化尺度能持续改善大语言模型预训练，尤其在推理任务上提升显著，并揭示了小初始化驱动参数从低复杂度结构向丰富表示演化的机制。

Comments 26 pages, 8 figures

详情

AI中文摘要

大语言模型提供了一个可处理的系统，用于探究智能本身如何涌现，而不仅仅是LLM如何被工程化。尽管进展通常归因于规模、数据和架构，但我们表明参数初始化是训练以及模型能力的基因式决定因素。减小初始化尺度持续改善预训练，在推理密集型任务上收益最大。我们识别出两种限制小初始化优势的常用经验设置，并展示放松这些设置如何恢复有利的缩放。我们进一步发现了一个平衡推理和训练的关键初始化。从机制上讲，小初始化驱动了独特的发展轨迹：参数首先凝聚成低复杂度结构，随后扩展为更丰富的表示，为“压缩即智能”这一观点提供了具体形式。词元级分析表明，收益集中在非平凡、上下文约束的预测上，而非均匀地分布于所有词元。这些结果启发了一个简单的$\gamma$-初始化规则：将初始化范围作为显式旋钮，并默认使用小初始化，这是一种几乎无成本的干预，能改善预训练并跨模型规模增强推理。

英文摘要

Large language models provide a tractable system for asking how intelligence itself emerges, rather than only how LLMs can be engineered. Although progress is usually attributed to scale, data and architecture, we show that parameter initialization is a gene-like determinant of training and, in particular, of model capacity. Reducing the initialization scale consistently improves pretraining, with the largest gains on reasoning-demanding tasks. We identify two widely used empirical settings that restrain the advantage of small initialization, and show how relaxing them restores favorable scaling. We further uncover a critical initialization that balances the reasoning and training. Mechanistically, small initialization drives a distinct developmental trajectory: parameters first condense into low-complexity structures and later expand into richer representations, giving concrete form to the idea that compression is intelligence. Token-level analyses show that the gains concentrate on non-trivial, context-constrained predictions rather than all tokens uniformly. These results motivate a simple $γ$-initialization rule: expose initialization rage as an explicit knob and use small initialization by default, an almost cost-free intervention that improves pretraining and strengthens reasoning across model scales.

URL PDF HTML ☆

赞 0 踩 0

2606.17979 2026-06-17 cs.AI 新提交

STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training

STAR: 文本到图像强化学习后训练中的时空自适应奖励分配

Jinjie Shen, Wei Deng, Xian Hu, Daiguo Zhou, Jian Luan

发表机构 * institutetext: STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training（机构文本：STAR：时空自适应奖励分配用于文本到图像强化学习后训练）

AI总结针对文本到图像生成中奖励与生成轨迹粒度不匹配的问题，提出STAR方法，利用文本-图像注意力构建时空自适应分配图，对相关潜在区域施加更强策略更新，提升语义对齐和文本渲染性能。

详情

AI中文摘要

现有的文本到图像生成的强化学习后训练方法通常将最终图像奖励转换为单个标量优势，并以相同强度应用于整个生成轨迹。然而，文本到图像生成自然具有时间和空间结构：不同的去噪步骤负责不同的生成阶段，而真正决定文本对齐的内容通常只出现在图像的一部分。这种粒度不匹配使得策略更新难以聚焦于实际影响奖励的生成组件。为了解决这个问题，我们提出了用于文本到图像扩散和流模型的强化学习后训练的**时空自适应奖励（STAR）分配**。STAR利用生成模型内部的文本-图像注意力，从用户提示中真正关心的核心内容开始，构建在去噪步骤和展开中动态变化的空间分配图，并将相同的组相对优势分配给更相关的潜在区域，几乎没有额外的计算开销。然后，STAR通过空间分辨的策略目标对这些区域应用更强的策略更新。我们使用Stable Diffusion 3.5 Medium作为基础模型，并在三个任务上评估：GenEval、OCR文本渲染和PickScore。实验结果表明，STAR在不改变外部奖励源的情况下，改善了组合语义对齐、文本渲染和偏好优化，在GenEval、OCR和PickScore上分别达到了$\mathbf{0.9759}$、$\mathbf{0.9757}$和$\mathbf{23.60}$。

英文摘要

Existing RL post-training methods for text-to-image generation usually convert the final-image reward into a single scalar advantage and apply it with the same strength to the entire generative trajectory. However, text-to-image generation naturally has temporal and spatial structure: different denoising steps are responsible for different generation stages, and the content that truly determines text alignment often appears only in part of the image. This granularity mismatch makes it difficult for policy updates to focus on the generative components that actually affect the reward. To address this issue, we propose \textbf{SpatioTemporal Adaptive Reward (STAR) Allocation} for RL post-training of text-to-image diffusion and flow models. STAR uses text-image attention inside the generative model and starts from the core content that the user truly cares about in the prompt. It constructs spatial allocation maps that dynamically vary across denoising steps and rollouts, and allocates the same group-relative advantage to more relevant latent regions with almost no additional computational overhead. STAR then applies stronger policy updates to these regions through a spatially resolved policy objective. We use Stable Diffusion 3.5 Medium as the base model and evaluate on three tasks: GenEval, OCR text rendering, and PickScore. Experimental results show that STAR improves compositional semantic alignment, text rendering, and preference optimization without changing the external reward source, achieving $\mathbf{0.9759}$, $\mathbf{0.9757}$, and $\mathbf{23.60}$ on GenEval, OCR, and PickScore, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.18132 2026-06-17 cs.AI 新提交

Knowledge Reutilization in Meta-Reinforcement Learning

元强化学习中的知识复用

Yuan Meng, Bo Wang, Juan de los Rios Ruiz, Xiangtong Yao, Zhenshan Bing, Fuchun Sun, Alois Knoll

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University（南京大学新型软件技术国家重点实验室）； Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）

AI总结提出一种元知识复用框架，通过动力学简化智能体学习任务知识并迁移至异构智能体，利用贝叶斯非参数先验和高层策略生成任务级指导，显著降低跟踪误差并提高样本效率。

Comments 18 pages initial submission

详情

AI中文摘要

元强化学习通过从相关任务中提取共享结构实现快速适应，但现有的端到端方法通常将任务推理与具身特定控制耦合。这种耦合可能模糊非参数任务语义，降低样本效率，并限制跨智能体复用。我们提出一个元知识复用框架，在动力学简化的智能体上学习任务级知识，并将其迁移至异构智能体。该框架使用贝叶斯非参数先验组织潜在任务模式，并使用高层策略生成任务级幅度指导。为了桥接可复用任务知识与不同具身，我们引入一个语义-幅度接口和一个轻量级时间适配器，将冻结的元知识转换为具身特定低层控制器的时间对齐子目标。在多个运动智能体上的实验表明，与最近的最先进基线相比，我们的框架将最终步跟踪误差降低了94.75%–99.79%，并且仅使用约23.8%的交互数据即可达到相当的部署性能。

英文摘要

Meta-reinforcement learning enables fast adaptation by extracting shared structure from related tasks, but existing end-to-end methods often couple task inference with embodiment-specific control. This coupling can obscure non-parametric task semantics, reduce sample efficiency, and limit cross-agent reuse. We propose a meta-knowledge reutilization framework that learns task-level knowledge on a dynamics-simplified agent and transfers it to heterogeneous agents. The framework uses a Bayesian non-parametric prior to organize latent task modes and a high-level policy to generate task-level magnitude guidance. To bridge reusable task knowledge with different embodiments, we introduce a semantic-magnitude interface and a lightweight temporal adaptor, which convert frozen meta-knowledge into temporally aligned subgoals for embodiment-specific low-level controllers. Experiments on multiple locomotion agents show that our framework reduces final-step tracking error by 94.75% -- 99.79% compared with recent state-of-the-art baselines and achieves comparable deployment performance with about 23.8% of their interaction data.

URL PDF HTML ☆

赞 0 踩 0

2606.18206 2026-06-17 cs.AI 新提交

Fixed-Point Reasoners: Stable and Adaptive Deep Looped Transformers

不动点推理器：稳定且自适应的深度循环Transformer

Sajad Movahedi, Vera Milovanović, Shlomo Libo Feigin, Alexander Theus, Thomas Hofmann, Valentina Boeva, T. Konstantin Rusch, Antonio Orvieto

发表机构 * ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems, Tübingen AI Center（ELLIS研究所蒂宾根，马克斯·普朗克智能系统研究所，蒂宾根人工智能中心）； ETH Zurich（苏黎世联邦理工学院）； Swiss Institute of Bioinformatics（瑞士生物信息学研究所）； Université Paris Cité（巴黎西岱大学）； Liquid AI

AI总结针对循环架构中深度导致的信号传播问题，提出基于预层归一化和残差缩放的FPRM模型，利用不动点收敛作为端到端停止机制，在Sudoku、Maze等推理基准上自适应计算并有效提升性能。

Comments Code available at https://github.com/nilskiKonjIzDunava/fprm

2606.17107 2026-06-17 cs.LG cs.AI 交叉投稿

Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

模型在预填充阶段记笔记：KV缓存可编辑且可组合

Bojie Li

发表机构 * Pine AI

AI总结研究发现KV缓存像笔记一样存储结论，支持编辑和组合：编辑单个字段可修正决策（8B模型准确率1.00，仅需~1%计算），组合预编译技能可无缝插入任意上下文（logit余弦相似度0.90-0.999），延迟降低至O(L)。

详情

AI中文摘要

前缀缓存仅对完全共享的前缀重用预填充结果，因此一个字段的改变会使整个下游缓存失效。然而，覆盖该字段自身的键/值向量并重用其余部分，会导致模型基于旧值行动。通过四个模型家族的因果分析，原因在于：在预填充阶段，模型已将基于字段条件的结论写入下游笔记；该字段自身的键/值对决策的贡献不足1%。将KV缓存视为记录已记忆结论的笔记本，可以引出两个能力。(1) 可编辑性。一个显著的勘误可以修正笔记；结合思维链，仅编辑该字段即可恢复决策（8B模型准确率1.00，约1%计算），而无思维链时则被忽略。(2) 可组合性。笔记具有位置可移植性，因此预编译的技能可以通过RoPE重新定位并拼接至任意上下文，与完全重计算无法区分（logit余弦相似度0.90-0.999，十二个模型），且首次令牌延迟为O(L)而非O(L^2)。统一的编辑+组合智能体在决策上与重计算相同，延迟降低高达14.9倍。该方法适用于任何逐令牌注意力KV缓存，在规模、量化、混合专家和多模态缓存上得到验证，并通过小型适配器扩展到多种注意力变体。由于勘误仅追加，它与生产环境中的前缀缓存兼容：在在线vLLM基准测试中，它保持前缀缓存对齐（命中率98.5%），将p90首次令牌延迟降低53-398倍。

英文摘要

Prefix caching reuses prefill only across an exactly shared prefix, so one changed field invalidates the entire downstream cache. Yet overwriting the field's own key/value vectors and reusing the rest leaves the model acting on the old value. The reason, established causally across four model families: at prefill the model has already written the field-conditioned conclusion onto downstream notes; the field's own key/value drives under 1% of the decision. Read as a notebook of memoized conclusions, two capabilities follow. (1) It is editable. A salient erratum amends the notes; and with chain-of-thought, editing the field alone recovers the decision (1.00 at 8B, ~1% compute), while without CoT it is ignored. (2) It is composable. The notes are position-portable, so a precompiled skill can be RoPE-repositioned and spliced into any context, indistinguishable from full recompute (logit cosine 0.90-0.999, twelve models) at O(L) rather than O(L^2) time-to-first-token. A unified edit+compose agent stays decision-identical to recompute at up to 14.9x lower latency. The approach applies to any per-token attention KV cache, validated across scale, quantization, Mixture-of-Experts, and multimodal caches, and extends to several attention variants through small adapters. Because the erratum is append-only, it composes with production prefix caching: in an online vLLM benchmark it keeps the prefix cache-aligned (98.5% hit-rate), cutting p90 time-to-first-token by 53-398x.

URL PDF HTML ☆

赞 0 踩 0

2606.17118 2026-06-17 cs.LG cs.AI 交叉投稿

MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs

MODE: 面向MoE多模态大语言模型的模态分解专家级混合精度量化

Yuanteng Chen, Peisong Wang, Zhilei Liu, Nanxin Zeng, Yuantian Shao, Shiqiang Lang, Tao Liu, Chuangyi Li, Qinghao Hu, Gang Li, Jing Liu, Jian Cheng

发表机构 * Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； Zhongguancun Academy（中关村学院）

AI总结针对MoE多模态大语言模型在专家重要性估计中存在的跨模态和视觉内偏差，提出模态分解的专家级混合精度量化框架MODE，通过分解选择频率、过滤冗余视觉令牌并评估模态敏感性，在给定预算下分配比特宽度，在W3A16下平均性能损失控制在2.9%以内。

Comments 18 pages, 8 figures

详情

AI中文摘要

混合专家多模态大语言模型（MoE-MLLMs）性能卓越，但GPU内存成本高昂，因此压缩至关重要。在PTQ方法中，专家级混合精度量化已被证明对MoE-LLMs有效，但由于专家重要性估计中两个被忽视的偏差，在MoE-MLLMs上性能显著下降。（1）在跨模态层面，视觉令牌的数值优势导致专家选择频率被视觉令牌主导，掩盖了对文本模态至关重要的专家；（2）在视觉内层面，大量冗余视觉令牌进一步扭曲频率统计，模糊了对信息性视觉内容关键的专家。为弥补差距，我们提出MODE，一种面向MoE-MLLMs的模态分解专家级混合精度量化框架，该框架按模态分解专家选择频率，过滤冗余视觉令牌以获得去噪的视觉频率，并进一步评估每个模态的量化敏感性作为基于频率估计的补充信号。这些信号被整合到整数线性规划公式中，以在给定预算下分配每个专家的比特宽度。大量实验表明，MODE特别适合MoE-MLLMs，在W3A16下平均性能损失限制在2.9%以内，在极端2比特设置下获得更大增益。

英文摘要

Mixture-of-Experts Multimodal Large Language Models (MoE-MLLMs) offer remarkable performance but incur prohibitive GPU memory costs, making compression essential. Among PTQ methods, expert-level mixed-precision quantization has proven effective for MoE-LLMs, yet suffers notable degradation on MoE-MLLMs due to two overlooked biases in expert importance estimation. (1) At the cross-modal level, the numerical dominance of vision tokens causes expert selection frequency to be dominated by vision tokens, masking experts that are critical to the text modality; (2) at the intra-vision level, the large proportion of redundant vision tokens further skew frequency statistics, obscuring experts critical for informative visual content. To bridge gaps, we propose MODE, a modality-decomposed expert-level mixed-precision quantization framework for MoE-MLLMs that decomposes expert selection frequency by modality, filters redundant vision tokens to obtain denoised visual frequency, and further evaluates quantization sensitivity per modality as a complementary signal to frequency-based estimation. These signals are integrated into an Integer Linear Programming formulation to assign per-expert bit-widths under a given budget. Extensive experiments show that MODE is particularly well-suited for MoE-MLLMs, limiting average performance loss to within 2.9% at W3A16, with larger gains at the extreme 2-bit setting.

URL PDF HTML ☆

赞 0 踩 0

2606.17199 2026-06-17 cs.LG cs.AI 交叉投稿

PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation

PowerOPD：利用有界幂变换稳定在线策略蒸馏

Anhao Zhao, Junlong Tong, Yingqi Fan, Ping Nie, Wenjie Li, Xiaoyu Shen

发表机构 * Eastern Institute of Technology, Ningbo（宁波东方理工大学）； The Hong Kong Polytechnic University（香港理工大学）； Shanghai Jiao Tong University（上海交通大学）； University of Waterloo（滑铁卢大学）

AI总结针对在线策略蒸馏中log-ratio奖励无界导致训练不稳定问题，提出基于Box-Cox幂变换的有界、符号一致奖励族PowerOPD，在数学推理任务上平均提升Avg@8/Pass@8达+6.37/+5.71，并降低59.2%时间与23.1%显存。

详情

AI中文摘要

大型语言模型的标准在线策略蒸馏（OPD）利用学生采样令牌估计反向KL散度，得到一个无偏的单样本蒙特卡洛估计器，避免了全词汇计算。然而，我们表明该估计器在实践中存在严重的训练病态：样本效率低、生成动态不稳定，以及与精确全词汇OPD相比显著的性能差距。奖励级别的诊断将这些病态追溯到log-ratio奖励，该奖励在结构上无界，产生极高方差的梯度，集中在早期位置并持续整个训练；标准的后验缩放方法仅在失真发生后操作，因此失效。为解决此问题，我们提出PowerOPD：一个源自Box-Cox幂变换的原生有界、符号一致的奖励族，由alpha > 0参数化，其中log-ratio是其退化极限alpha -> 0。在六个数学推理基准和四个Qwen3师生对中，PowerOPD在基准平均Avg@8/Pass@8上相比原始OPD提升高达+6.37/+5.71，相比后验稳定化提升+3.01/+3.54，相比全词汇OPD提升+2.59/+8.90，同时减少59.2%的挂钟时间和23.1%的峰值GPU内存。较大的alpha通常提高准确率，一致缩短响应长度，并使梯度范数比原始OPD小3000倍以上。

英文摘要

Standard on-policy distillation (OPD) for large language models estimates the reverse-KL objective using student-sampled tokens, yielding an unbiased single-sample Monte Carlo estimator that avoids vocabulary-wide computation. However, we show that this estimator suffers from severe training pathologies in practice: sample inefficiency, unstable generation dynamics, and a substantial performance gap compared to exact full-vocabulary OPD. Reward-level diagnosis traces these pathologies to the log-ratio reward, which is unbounded by construction, producing extremely high-variance gradients concentrated at early positions and persisting throughout training; standard post-hoc scaling fail as they operate only after this distortion occurs. To solve this problem, we propose PowerOPD: a family of natively bounded, sign-consistent rewards from the Box-Cox power transformation, parameterized by alpha > 0, of which the log-ratio is the degenerate alpha -> 0 limit. Across six mathematical reasoning benchmarks and four Qwen3 teacher-student pairs, PowerOPD achieves benchmark-averaged Avg@8/Pass@8 gains of up to +6.37/+5.71 over vanilla OPD, +3.01/+3.54 over post-hoc stabilization, and +2.59/+8.90 over full-vocabulary OPD, while reducing wall-clock time by 59.2% and peak GPU memory by 23.1%. Larger alpha generally improves accuracy, consistently shortens responses, and keeps gradient norms more than 3,000x smaller than vanilla OPD.

URL PDF HTML ☆

赞 0 踩 0

2606.17399 2026-06-17 cs.LG cs.AI 交叉投稿

The Discrete-Log Clock: How a Transformer Learns Modular Multiplication

离散对数时钟：Transformer如何学习模乘法

Huu Danh Nguyen

发表机构 * Stanford University（斯坦福大学）

AI总结通过乘法特征变换分析，发现Transformer在模乘法任务中学习到稀疏的傅里叶谱，其嵌入和MLP神经元主要编码少数乘法频率，表明模型实现了离散对数空间中的加法运算，即“离散对数时钟”算法。

Comments 5 pages, 5 figures. Accepted to the Mechanistic Interpretability Workshop at ICML 2026

详情

AI中文摘要

当小型Transformer在模乘法任务中实现“grok”时，先前研究报告学习到的嵌入具有“密集”的傅里叶谱，需要所有频率。这与模加法形成对比，后者只需一组稀疏的关键频率。我们证明这种密度是错误基下分析的伪像。乘法的自然傅里叶变换不是标准加法DFT，而是乘法特征变换，它将乘法群$(\mathbb{Z}/p\mathbb{Z})^*$上的函数分解为其不可约表示。将此变换应用于在$a \cdot b \bmod 113$上训练的grokked Transformer，我们发现嵌入谱变得高度稀疏（基尼系数0.58 vs 加法基下的0.07），仅4个关键频率携带显著能量。此外，96.9%的MLP神经元被干净地调谐到单个乘法频率，并且神经元激活热图在按离散对数重排序后显示出二维周期结构。这些结果表明Transformer将乘法简化为离散对数空间中的加法，实现了类似于Nanda等人针对加法的Clock算法的“离散对数时钟”算法。该方法具有普适性：将分析基与任务的代数结构匹配，可以在标准工具视为噪声的地方揭示可解释结构。

英文摘要

When small transformers grok modular multiplication, prior work reports that the learned embedding has a "dense" Fourier spectrum requiring all frequencies. This contrasts with modular addition, where only a sparse set of key frequencies suffices. We show this density is an artifact of analyzing in the wrong basis. The natural Fourier transform for multiplication is not the standard additive DFT but the multiplicative character transform, which decomposes functions on the multiplicative group $(\mathbb{Z}/p\mathbb{Z})^*$ into its irreducible representations. Applying this transform to a grokked transformer trained on $a \cdot b \bmod 113$, we find the embedding spectrum becomes highly sparse (Gini coefficient 0.58 vs. 0.07 in the additive basis) with only 4 key frequencies carrying significant energy. Furthermore, 96.9% of MLP neurons are cleanly tuned to a single multiplicative frequency, and neuron activation heatmaps reveal 2D-periodic structure when reordered by the discrete logarithm. These results demonstrate the transformer reduces multiplication to addition in discrete-log space, implementing a "Discrete-Log Clock" algorithm analogous to Nanda et al.'s Clock algorithm for addition. The methodology generalizes: matching the analysis basis to the algebraic structure of the task reveals interpretable structure where standard tools see noise.

URL PDF HTML ☆

赞 0 踩 0

2606.17406 2026-06-17 cs.CV cs.AI 交叉投稿

Graph Neural Networks for Semi-Supervised Image Classification with Multi-Feature Aggregation

基于多特征聚合的图神经网络用于半监督图像分类

Marina Chagas Bulach Gapski, Vinicius Atsushi Sato Kawai, Gustavo Rosseto Leticio, Lucas Pascotti Valem, Daniel Carlos Guimarães Pedronette, Mohand Said Allili

发表机构 * Department of Statistics, Applied Mathematics, and Computing (DEMAC), São Paulo State University (UNESP)（圣保罗州立大学统计、应用数学与计算系）； Institute of Mathematics and Computer Science (ICMC), University of São Paulo (USP)（圣保罗大学数学与计算机科学研究所）； Department of Computer Science and Engineering, University of Quebec in Outaouais (UQO)（魁北克大学乌塔韦校区计算机科学与工程系）

AI总结提出一种结合多种特征提取器和图表示进行半监督图像分类的GNN方法，通过流形学习和排名聚合提升分类精度。

详情

AI中文摘要

特征提取涉及识别和提取显著特征或模式，包括边缘、纹理、形状和颜色属性。当代特征提取器主要利用深度学习架构，如卷积神经网络（CNN）和视觉变换器（VIT）。文献中各种特征提取器的可用性提供了广泛的特征表示。从图像中提取的特征取决于具体应用、所选提取器及其配置。因此，通过组合不同的提取器来整合互补信息，为提高性能提供了一种有前景的方式。图神经网络（GNN），特别是图卷积网络（GCN），已成为半监督图像分类的强大且广泛采用的方法，因为它们有效利用标记和未标记数据，同时利用捕捉样本间关系的底层图结构。本研究提出了一种新颖的GNN方法，适用于标记数据稀缺的场景，通过整合来自不同提取器的多样化特征和图表示集进行分类。进行了实验研究，包括不同特征和图提取器的组合，以及排名聚合策略。实验发现强调了本研究的主要贡献，表明特征和图表示的策略性组合，结合流形学习用于图处理，在大多数实验条件下显著提高了分类精度。此外，利用排名聚合技术整合来自不同提取器的特征，被证明能增强分类精度。

英文摘要

Feature extraction involves the identification and extraction of salient characteristics or patterns, including edges, textures, shapes, and color attributes. Contemporary feature extractors predominantly leverage deep learning architectures, such as Convolutional Neural Networks (CNNs) and Vision Transformers (VITs). The availability of diverse feature extractors in the literature provides a wide range of feature representations. Features extracted from an image depend on the specific application, the chosen extractor, and its configuration. Therefore, integrating complementary information by combining distinct extractors offers a promising way to enhance performance. Graph Neural Networks (GNNs), particularly Graph Convolutional Networks (GCNs), have emerged as powerful and widely adopted approaches for semi-supervised image classification, as they effectively leverage both labeled and unlabeled data while exploiting the underlying graph structures that capture relationships among samples. This study proposes a novel approach for GNNs in scenarios where labeled data is scarce, by integrating diverse sets of feature and graph representations derived from various extractors in classification scenarios. Experimental investigations were conducted, encompassing combinations of distinct feature and graph extractors, as well as rank aggregation strategies. The primary contributions of this work are underscored by the experimental findings, which demonstrate that the strategic combination of feature and graph representations, coupled with the application of manifold learning for graph processing, leads to significant improvements in classification accuracy across the majority of experimental conditions. Furthermore, the utilization of rank aggregation techniques to integrate features from different extractors was shown to enhance classification accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.17416 2026-06-17 cs.SD cs.AI 交叉投稿

L-Proto: Language-Aware Episodic Prototypical Training for Multilingual Speaker Verification

L-Proto: 面向多语言说话人验证的语言感知情景原型训练

Hyung-Seok Oh, Deok-Hyeon Cho, Seung-Bin Kim, Seong-Whan Lee

发表机构 * Department of Artificial Intelligence, Korea University（高丽大学人工智能系）

AI总结针对多语言说话人验证中语言相关声学变异导致说话人身份与语言特征纠缠的问题，提出语言感知情景原型训练策略L-Proto，通过构建语言一致的训练情景减少语言驱动变异，提升跨语言泛化能力。

Comments Accepted by INTERSPEECH 2026

2606.17489 2026-06-17 cs.LG cs.AI 交叉投稿

Online LLM Selection via Constrained Bandits with Time-Varying Demand

基于时变需求的约束赌博机在线LLM选择

Yin Huang, Qingsong Liu, Jie Xu

发表机构 * Department of Electrical and Computer Engineering, University of Florida（佛罗里达大学电气与计算机工程系）； Manning College of Information and Computer Sciences, University of Massachusetts Amherst（马萨诸塞大学阿默斯特分校曼宁信息与计算机科学学院）

AI总结针对边缘云推理系统中异构LLM的选择问题，提出一种基于置信界估计和需求预测的在线学习算法，在硬预算和软延迟约束下实现亚线性遗憾和约束违反。

Comments 11 pages, 3 figures with multiple subfigures, 1 table, submitted for possible journal publication

详情

AI中文摘要

大型语言模型（LLM）越来越多地部署在边缘云推理系统中，以处理具有异构准确性、延迟和成本配置的多样化用户任务。为每个传入任务选择合适的LLM对于确保服务质量和高效资源利用至关重要。然而，模型异构性、随机且未知的性能特征以及时变的任务需求使得静态选择策略不再适用。实际部署通常施加硬资源预算（如货币支出限制）和软服务级别要求（如延迟保证）。这些约束为在线决策带来了额外挑战。我们将该问题形式化为一个约束随机赌博机学习任务，其中学习者在包装型（硬）和覆盖型（软）约束下顺序选择模型，同时适应时变的任务需求。学习者无法访问底层奖励、成本或延迟分布，必须依赖部分反馈。我们开发了一种新颖的在线学习算法，利用置信界估计和需求预测来平衡奖励最大化与长期约束满足。我们提供了理论保证，表明与具有完整信息的离线基准相比，该算法实现了亚线性遗憾和亚线性覆盖约束违反。在合成工作负载上的实验结果证明了我们的方法在动态、资源受限环境中的有效性和鲁棒性。

英文摘要

Large Language Models (LLMs) are increasingly deployed in edge-cloud inference systems to handle diverse user tasks with heterogeneous accuracy, latency, and cost profiles. Selecting the appropriate LLM for each incoming task is critical for ensuring service quality and efficient resource utilization. However, model heterogeneity, stochastic and unknown performance characteristics, and time-varying task demands make static selection strategies inadequate. Real-world deployments often impose hard resource budgets such as monetary expenditure limits, along with soft service-level requirements such as latency guarantees. These constraints introduce additional challenges for online decision-making. We formulate this problem as a constrained stochastic bandit learning task, where the learner sequentially selects models under both packing-type (hard) and covering-type (soft) constraints, while adapting to time-varying task demand. The learner operates without access to the underlying reward, cost, or latency distributions and must rely on partial feedback. We develop a novel online learning algorithm that leverages confidence-bound estimates and demand predictions to balance reward maximization with long-term constraint satisfaction. We provide theoretical guarantees showing sublinear regret and sublinear covering constraint violations compared to an offline benchmark with full information. Experimental results on synthetic workloads demonstrate the effectiveness and robustness of our approach in dynamic, resource-constrained environments.

URL PDF HTML ☆

赞 0 踩 0

2606.17513 2026-06-17 cs.LG cs.AI 交叉投稿

Geometry-Aware Post-Hoc Uncertainty Quantification in Operator Learning

几何感知的算子学习事后不确定性量化

Oriol Vendrell-Gallart, Nima Negarandeh, Ramin Bostanabad

发表机构 * Department of Mechanical and Aerospace Engineering, University of California, Irvine（加州大学尔湾分校机械与航空航天工程系）

AI总结提出REEF-GP框架，通过高斯过程拟合冻结神经算子的残差，利用其内在坐标-特征表示构建几何感知的不确定性，在多个PDE基准上实现校准的不确定性估计，且计算成本远低于深度集成。

详情

AI中文摘要

神经算子为偏微分方程提供快速代理模型，但其确定性预测限制了在需要不确定性量化（UQ）的任务中的使用，尤其是在几何变化下。现有方法主要对网络参数进行不确定性建模，很大程度上忽略了算子本身学习的几何感知表示。我们提出REEF-GP（残差嵌入特征高斯过程），一种事后UQ框架，将高斯过程拟合到冻结神经算子的残差上，该算子的内部嵌入定义了核特征空间。REEF-GP不学习单独的特征映射，而是调整算子固有的坐标-特征表示以构建几何感知的不确定性。为了确保非结构化域上的稳定性和可扩展性，REEF-GP结合了谱归一化投影、异方差几何感知噪声以及高效基于子集的训练，避免了限制性的低秩近似。在五个具有不同几何形状的PDE基准测试中，REEF-GP保持了预测准确性，同时实现了与深度集成相竞争但成本仅为其一小部分的校准不确定性估计。我们的方法在几何分布偏移下保持鲁棒性，不确定性集中在物理上有意义的区域（例如激波前沿）。我们的结果表明，神经算子的准确且可扩展的事后UQ可以直接在其学习的特征空间中实现，为参数中心方法提供了实用替代方案。

英文摘要

Neural operators provide fast surrogates for PDEs but their deterministic predictions limit their use in tasks requiring uncertainty quantification (UQ), especially under geometric variability. Existing approaches primarily model uncertainty in network parameters, largely overlooking the geometry-aware representations learned by the operator itself. We propose REEF-GP (Residual on Embedded Features Gaussian Process), a post-hoc UQ framework that fits a GP to the residuals of a frozen neural operator whose internal embeddings define the kernel feature space. Rather than learning a separate feature map, REEF-GP adapts the operator's intrinsic coordinate-feature representations to construct geometry-aware uncertainties. To ensure stability and scalability on unstructured domains, REEF-GP incorporates spectral-normalized projections, heteroscedastic geometry-aware noise, and efficient subset-based training that avoids restrictive low-rank approximations. Across five PDE benchmarks with varying geometries, REEF-GP preserves predictive accuracy while achieving calibrated uncertainty estimates competitive with deep ensembles but at a fraction of their cost. Our approach remains robust under geometric distribution shift, with uncertainty concentrating in physically meaningful regions (e.g., shock fronts). Our results demonstrate that accurate and scalable post-hoc UQ for neural operators can be achieved directly in their learned feature space, offering a practical alternative to parameter-centric approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.17516 2026-06-17 cs.LG cs.AI stat.ME stat.ML 交叉投稿

FoundCause: Causal Discovery with Latent Confounders from Observational Data

FoundCause: 从观测数据中发现含隐混淆因子的因果关系

Patrick Blöbaum, Krishnakumar Balasubramanian, Shiva Prasad Kasiviswanathan

发表机构 * Amazon Web Services（亚马逊云服务）； Department of Statistics, University of California, Davis（加州大学戴维斯分校统计系）

AI总结提出FoundCause，一种基于合成数据训练的摊销因果发现模型，通过单次前向传递直接映射数据集到因果图，显式建模隐混淆因子，在15个真实数据集上优于11种非摊销和4种摊销方法。

Comments Download the model at https://github.com/amazon-science/foundcause

详情

AI中文摘要

从观测数据中发现因果关系仍然具有挑战性，因为需要在没有干预的情况下恢复有向结构和隐混淆因子。我们提出了FoundCause，一种完全在合成数据上训练的摊销因果发现模型，它通过单次前向传递直接将数据集映射到因果图。通过从大量模拟结构因果模型中学习，FoundCause捕获了可迁移的统计模式，这些模式泛化到单个数据集之外。该架构融合了因果发现的几个关键归纳偏置。它使用一个置换不变的Transformer编码器，通过交替关注样本和变量来联合建模跨变量依赖性和每个变量的分布。通过统计条件注意力注入来自经典非对称度量的成对统计特征，引导模型朝向已知的因果信号。一个分解的解码器将边的存在性与方向分离，而一个三角细化模块使得能够推理高阶因果模式，如链和碰撞器。此外，一个基于可学习隐令牌的专用混淆因子模块显式建模隐藏的共同原因，并且模型通过其掩码输入表示显式处理缺失数据。据我们所知，FoundCause是第一个显式建模隐混淆因子的摊销因果发现方法。FoundCause在15个真实数据集上优于11种经典非摊销方法（如PC、GES、NOTEARS风格优化）和4种摊销因果发现方法，相对于最强的非摊销方法，在$F_1$上提高了9.6%，在AUROC上提高了1.2%，结构汉明距离减少了18.9%，同时仅需单次前向传递即可完成推理。

英文摘要

Causal discovery from observational data remains challenging due to the need to recover directed structure and latent confounding without interventions. We propose FoundCause, an amortized causal discovery model trained entirely on synthetic data that maps datasets directly to causal graphs in a single forward pass. By learning from large collections of simulated structural causal models, FoundCause captures transferable statistical patterns that generalize beyond individual datasets. The architecture incorporates several key inductive biases for causal discovery. It uses a permutation-invariant transformer encoder with alternating attention over samples and variables to jointly model cross-variable dependence and per-variable distributions. Pairwise statistical features derived from classical asymmetry measures are injected through statistics-conditioned attention, guiding the model toward known causal signals. A factorized decoder separates edge existence from direction, while a triangular refinement module enables reasoning over higher-order causal motifs such as chains and colliders. In addition, a dedicated confounder module based on learnable latent tokens explicitly models hidden common causes, and the model explicitly handles missing data via its masked input representation. To our knowledge, FoundCause is the first amortized causal discovery approach to explicitly model latent confounding. FoundCause outperforms 11 classical non-amortized methods (e.g., PC, GES, NOTEARS-style optimization) and 4 amortized causal discovery methods on 15 real-world datasets, achieving +9.6% improvement in $F_1$, +1.2% in AUROC, and an 18.9% reduction in structural Hamming distance relative to the strongest non-amortized methods, while performing inference in a single forward pass.

URL PDF HTML ☆

赞 0 踩 0

2606.17551 2026-06-17 cs.LG cs.AI 交叉投稿

Reversal Q-Learning

逆向Q学习

Aditya Oberai, Seohong Park, Sergey Levine

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出逆向Q学习（RQL）算法，通过扩展MDP框架和逆向流生成虚拟在线轨迹，结合偏差-方差缩减技术，实现基于流策略的离线强化学习，在50个机器人任务中取得最佳平均性能。

详情

AI中文摘要

迭代生成建模技术（如流匹配）为建模复杂行为以进行有效的离线强化学习（RL）提供了强大工具。在这项工作中，我们提出了一种新的离策略RL算法，该算法基于先验数据训练流策略。我们的想法始于“扩展”马尔可夫决策过程（MDP）框架，该框架将单个流细化步骤视为MDP中的独立动作。为了在该框架中实现离策略RL，我们应用了两种技术：我们通过“逆向”流生成虚拟在线轨迹，使该框架与先验数据兼容；并应用偏差-方差缩减技术来缓解离策略RL中的视界诅咒。我们将由此产生的算法称为逆向Q学习（RQL）。RQL相比先前基于流的RL方法具有若干优势：它不受时间反向传播的影响，更好地利用学习到的价值函数，并直接训练完整的、富有表现力的流策略。通过在50个具有挑战性的模拟机器人任务上的实验，我们表明，与最先进的基于流的离线RL算法相比，RQL实现了最佳的平均离线RL性能。

英文摘要

Iterative generative modeling techniques, such as flow matching, provide powerful tools to model complex behaviors for effective offline reinforcement learning (RL). In this work, we propose a new off-policy RL algorithm that trains a flow policy based on prior data. Our idea starts from the "expanded" Markov decision process (MDP) framework, which treats individual flow refinement steps as separate actions in an MDP. To enable off-policy RL within this framework, we apply two techniques: we generate virtual on-policy trajectories (by "reversing" flows) to make this framework compatible with prior data, and we apply a bias-and-variance reduction technique to mitigate the curse of horizon in off-policy RL. We call the resulting algorithm Reversal Q-learning (RQL). RQL has several advantages over previous flow-based RL methods: it does not suffer from backpropagation through time, makes better use of the learned value function, and directly trains the full, expressive flow policy. Through our experiments on 50 challenging simulated robotic tasks, we show that RQL leads to the best average offline RL performance compared to state-of-the-art flow-based offline RL algorithms.

URL PDF HTML ☆

赞 0 踩 0

2606.17579 2026-06-17 cs.LG cs.AI cs.CL cs.SI 交叉投稿

处理特征异质性：可学习图块方法

Yifei Sun, Yang Yang, Xiao Feng, Zijun Wang, Haoyang Zhong, Chunping Wang, Lei Chen

发表机构 * Zhejiang University（浙江大学）； Huazhong University of Science and Technology（华中科技大学）； Finvolution Group（信也科技集团）

AI总结提出可学习图块概念，将图分解为语义单元，通过补丁编码器和聚合器实现跨域图数据的可迁移预训练，提升下游任务性能。

Comments Accepted at KDD 2025

详情

DOI: 10.1145/3690624.3709242

AI中文摘要

近年来，基础模型和图预训练技术的快速发展激发了构建通用预训练图模型或图基础模型（GFM）的兴趣。然而，一个重大挑战是现有模型无法处理无文本信息的图数据中的特征异质性，这阻碍了图模型在不同数据集间的可迁移性。为弥补这一差距，我们提出了可学习图块的概念，将其视为任何图数据的最小语义单元。我们通过展开节点特征并分别构建相应的图块结构，将图分解为可学习图块。然后，我们设计了一个框架，从跨域图数据中挖掘可迁移信息。具体来说，在提取图块后，我们提出一个补丁编码器从每个单元中提取知识，以及一个补丁聚合器学习如何将单元组合成整体。由于其领域无关的特性，该模型可应用于不同领域的下游数据。此外，我们分析了我们的方法与现有图模型之间的联系，以及其生成的节点嵌入的可迁移性。实验表明，我们的方法不仅实现了使用多域图进行预训练的能力，而且在各种下游数据集和任务上表现出增强的性能。此外，我们观察到随着预训练数据量的增加，下游性能持续提升。

英文摘要

In recent years, the rapid development of foundation models and graph pre-training technologies has spurred increasing interest in constructing a universal pre-trained graph model or Graph Foundation Model (GFM). However, a significant challenge is that existing models are unable to address feature heterogeneity in graph data without textual information, which hinders the transferability of graph models across different datasets. To bridge this gap, we propose the concept of learnable graph patches, which we regard as the smallest semantic units of any graph data. We decompose the graph into learnable graph patches by unfolding the node features and constructing corresponding patch structures separately. We then design a framework that mines transferable information from graph data across domains. Specifically, after extracting graph patches, we propose a patch encoder to extract knowledge from each unit and a patch aggregator to learn how the units are combined into a whole. Due to its domain-agnostic nature, the model can be applied to downstream data across different domains. Furthermore, we analyze the connection between our method and existing graph models, as well as the transferability of the node embeddings it generates. Empirically, our method not only achieves the capability to use multi-domain graphs for pre-training, but also shows enhanced performance across various downstream datasets and tasks. Moreover, we observe consistent improvement in downstream performance as the volume of pre-training data increases.

URL PDF HTML ☆

赞 0 踩 0

2606.17687 2026-06-17 cs.CL cs.AI 交叉投稿

SuCo: Sufficiency-guided Continuous Adaptive Reasoning

SuCo: 充分性引导的连续自适应推理

Jiahao Wang, Bingyu Liang, Chenhao Hu, Longhui Zhang, Xuebo Liu, Min zhang, Jing Li, Xuelong Li

发表机构 * Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））

AI总结针对大型推理模型生成过长思维链导致计算浪费的问题，提出最小充分CoT概念，并构建两阶段训练框架SuCo，通过自适应充分性阈值和强化学习优化推理长度，在数学、代码和科学基准上同时提升准确率和效率。

Comments Accepted to ICML 2026. 18 pages

详情

AI中文摘要

尽管在复杂任务上表现卓越，大型推理模型（LRMs）常常生成过长的思维链（CoT），即使对于简单查询也会增加计算成本。现有缓解此低效问题的工作通常依赖于离散推理模式或固定预算层级，缺乏推理何时充分的准则。本文引入最小充分CoT（MSC），定义为CoT轨迹中足以产生正确答案的最短前缀。实验表明，MSC不仅减少推理令牌，还能在不同难度级别上提高准确率。基于MSC，我们提出充分性引导的连续自适应推理（SuCo），一个用于连续谱上自主推理控制的两阶段训练框架。在第一阶段，MSC对齐微调（MFT）使用问题自适应充分性阈值构建MSC数据，该阈值自然随问题难度缩放，然后微调模型以内化简洁而充分的推理模式。在第二阶段，充分性感知策略优化（SAPO）通过带有动态复杂度跟踪和充分性感知奖励的强化学习进一步优化模型，该奖励惩罚过度思考和思考不足。在数学、代码和科学基准上的大量实验表明，SuCo在准确率和推理效率上均实现持续改进。

英文摘要

Despite remarkable performance on complex tasks, Large Reasoning Models (LRMs) often generate excessively long Chain-of-Thoughts (CoT), inflating computational costs even for simple queries. Existing efforts to mitigate this inefficiency typically rely on discrete reasoning modes or fixed budget tiers, lacking a principled criterion of when reasoning is sufficient. In this work, we introduce Minimal Sufficient CoT (MSC), defined as the shortest prefix of a CoT trajectory which is adequate for producing the correct answer. We empirically show that MSC not only reduces reasoning tokens, but also improves accuracy across difficulty levels. Building on MSC, we propose Sufficiency-guided Continuous Adaptive Reasoning (SuCo), a two-stage training framework for autonomous reasoning control along a continuous spectrum. In stage 1, MSC-Aligned Fine-Tuning (MFT) constructs MSC data using problem-adaptive sufficiency thresholds that naturally scale with question difficulty, then fine-tunes the model to internalize concise yet sufficient reasoning patterns. In stage 2, Sufficiency-Aware Policy Optimization (SAPO) further optimizes the model through reinforcement learning with dynamic complexity tracking and sufficiency-aware rewards that penalize both over- and under-thinking. Extensive experiments across mathematics, code, and science benchmarks show that SuCo consistently achieves improvements in both accuracy and reasoning efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.17706 2026-06-17 cs.LG cs.AI 交叉投稿

Confusion-Aware Transfer Teacher Curriculum Learning Framework: Disentangling Scoring and Pacing Effects

混淆感知的迁移教师课程学习框架：解耦评分与节奏效应

Savini Kommalage, Sanka Mohottala, Asiri Gawesha, Dulara Madhusanka, Menan Velayuthan, Dharshana Kasthurirathna, Mahima Milinda Alwis Weerasinghe, Charith Abhayaratne

发表机构 * Faculty of Computing, Sri Lanka Institute of Information Technology, Sri Lanka（斯里兰卡信息科技学院计算机学院，斯里兰卡）； Faculty of Engineering, University of Sri Jayewardenepura, Sri Lanka（斯里兰卡贾亚韦达内普拉大学工程学院，斯里兰卡）； Faculty of Engineering, Sri Lanka Institute of Information Technology, Sri Lanka（斯里兰卡信息科技学院工程学院，斯里兰卡）； University of Sheffield, United Kingdom（谢菲尔德大学，英国）； Utrecht University, The Netherlands（乌得勒支大学，荷兰）

AI总结提出混淆感知难度评分，通过阶段性子集测试和随机基线解耦课程学习的评分与节奏效应，在CIFAR-10上验证评分可解释性，但全数据下无提升，仅在小数据量下提升数据效率。

Comments Accepted at International Conference on Machine Learning (ICML) GlobalSouthML Workshop (2026)

详情

AI中文摘要

KANLib -- 一个模块化、可扩展且快速的Kolmogorov-Arnold网络实现

Julian Hoever, Gregor Schiele

发表机构 * Intelligent Embedded Systems University of Duisburg-Essen（智能嵌入式系统杜伊斯堡-埃森大学）

AI总结提出KANLib框架，通过统一现有KAN实现、支持多种基函数和自适应网格缩放，在保持灵活性和高性能的同时，实现可复现的预测结果。

详情

AI中文摘要

Kolmogorov-Arnold网络（KAN）最近通过用可学习的一元函数替代线性权重，成为传统多层感知器的一种有前途的替代方案。尽管在可解释性和表达能力方面具有理论优势，但由于高计算成本和现有框架中不一致的功能支持，KAN的实际研究仍然困难。本文介绍了KANLib，一个用于开发和评估KAN架构的模块化、可扩展且计算高效的框架。KANLib在强调灵活性、功能一致性和高性能的一致软件架构中，统一了现有实现（包括PyKAN、EfficientKAN和FastKAN）的核心概念。该框架支持两种基函数类型、自适应网格缩放、网格扩展和细粒度架构定制，同时保持与标准PyTorch工作流的兼容性。在加利福尼亚房价基准上的实验评估表明，KANLib在重现已建立参考KAN实现的预测行为的同时，实现了具有竞争力的计算效率。此外，该框架能够探索超出标准KAN公式的架构变体，且对预测性能影响很小。总体而言，KANLib为未来关于可扩展和可扩展KAN架构的研究提供了坚实的基础。

英文摘要

Kolmogorov-Arnold Networks (KANs) have recently emerged as a promising alternative to traditional multilayer perceptrons by replacing linear weights with learnable univariate functions. Despite their theoretical advantages in interpretability and expressiveness, practical research of KANs remains difficult due to high computational costs and inconsistent feature support across existing frameworks. This paper introduces KANLib, a modular, extensible, and computationally efficient framework for developing and evaluating KAN architectures. KANLib unifies core concepts from existing implementations, including PyKAN, EfficientKAN, and FastKAN, within a consistent software architecture that emphasizes flexibility, feature parity, and high performance. The framework supports two basis function types, adaptive grid rescaling, grid extension, and fine-grained architectural customization while maintaining compatibility with standard PyTorch workflows. Experimental evaluation on the California Housing benchmark demonstrates that KANLib reproduces the predictive behavior of established reference KAN implementations while achieving competitive computational efficiency. Furthermore, the framework enables the exploration of architectural variations beyond standard KAN formulations with only minor impacts on predictive performance. Overall, KANLib provides a robust foundation for future research on scalable and extensible KAN architectures.

URL PDF HTML ☆

赞 0 踩 0

2606.17952 2026-06-17 cs.LG cs.AI 交叉投稿

SoftMoE: Soft Differentiable Routing for Mixture-of-Experts in LLMs

SoftMoE: 用于大语言模型混合专家网络的软可微路由

Mikołaj Zasada, Łukasz Struski, Jacek Tabor, Marcin Kurdziel

发表机构 * AGH University of Krakow, Poland（克拉科夫AGH大学）； Faculty of Mathematics（数学系）； Computer Science, Jagiellonian University, Poland（计算机科学系，杰哥利安大学，波兰）； Centre for Credible Artificial Intelligence, Warsaw University of Technology（可信人工智能中心，华沙技术大学）

AI总结提出SoftMoE，通过软top-k LapSum松弛替代离散路由，实现专家路由的梯度优化，并学习每层专家激活数量，在语言建模中激活更少专家达到相当或更优性能。

Comments Accepted at ICML 2026

详情

AI中文摘要

稀疏混合专家（MoE）架构通过仅激活一小部分专家（通过top-$k$路由）在固定推理预算下扩展LLM参数。虽然这保持了因果性并适用于自回归语言模型，但离散的top-$k$算子不可微，强制每个输入激活固定数量的专家，导致计算利用效率低下。我们提出SoftMoE，用截断的软top-$k$ LapSum松弛替代离散路由，允许基于梯度的专家路由优化。我们进一步参数化每层平均激活专家数，并施加全局预算约束，使模型能够学习跨层分配专家容量。SoftMoE完全兼容自回归建模，在语言建模和下游任务上达到与稀疏MoE相当或更优的性能，同时激活显著更少的专家。值得注意的是，学习到的分配高度非均匀，后层激活更多专家。源代码已公开$^\dagger$。

英文摘要

Sparse Mixture-of-Experts (MoE) architectures enable scaling LLM parameters under a fixed inference budget by activating only a small subset of experts via top-$k$ routing. While this preserves causality and suits autoregressive language models, the discrete top-$k$ operator is not differentiable, forcing a fixed number of active experts per input and resulting in inefficient use of computation. We propose SoftMoE, which replaces discrete routing with a truncated soft top-$k$ LapSum relaxation, allowing gradient-based optimization of expert routing. We further parameterize the mean number of active experts per layer and impose a global budget constraint, enabling the model to learn how to allocate expert capacity across layers. SoftMoE remains fully compatible with autoregressive modeling and achieves performance comparable to or better than sparse MoE on language modeling and downstream tasks, while activating significantly fewer experts. Notably, the learned allocation is highly non-uniform, with later layers activating more experts. The source code is publicly available$^\dagger$.

URL PDF HTML ☆

赞 0 踩 0

2606.17961 2026-06-17 cs.CV cs.AI 交叉投稿

Robustness of Similarity-based Positional Encoding Under Rotations: Theoretical Analysis and Experimental Validation

基于相似性的位置编码在旋转下的鲁棒性：理论分析与实验验证

Andrea Santomauro, Luigi Portinale, Giorgio Leonardi

发表机构 * Computer Science Institute, DiSIT, University of Piemonte Orientale, Alessandria, Italy（皮埃蒙特东方大学计算机科学研究所，DiSIT，亚历山德里亚，意大利）

AI总结本文理论分析并实验验证了基于相似性的位置编码（simPE）在旋转扰动下的稳定性，证明其在Frobenius范数下具有有界扰动，并在多个数据集上优于标准位置编码。

详情

AI中文摘要

位置编码是Transformer架构的基本组成部分，因为它注入了关于输入空间或序列排列的信息。在标准绝对位置编码和正弦编码的最新替代方案中，基于相似性的位置编码（simPE）已成为一种通过成对关系表示位置结构的灵活框架。simPE最初是为医学成像应用设计的，其中几何鲁棒性尤为重要：在图像采集过程中，由于成像仪器、患者定位或轻微的采集偏差，自然会产生小旋转。尽管具有经验上的前景，但simPE在几何扰动下的理论行为尚未完全表征。在本文中，我们研究了simPE对旋转的鲁棒性，结合了形式化的理论分析和实验验证。我们首先证明simPE通常不是旋转不变的。然后，我们证明，在基本分量的温和Lipschitz假设下，simPE在旋转扰动下是稳定的，并推导了Frobenius范数下的显式扰动界限。我们在四个受控数据集上实验验证了这些发现——一个合成Arrow数据集、一个合成Shapes数据集（四个几何形状类别）、一个合成Digits数据集和一个基准图像分类数据集（FashionMNIST）——其中训练和验证图像保持固定的规范方向，而测试图像则经受逐渐增大的旋转角度。在所有数据集中，simPE在旋转下的准确率、F1分数、精确率和召回率方面始终优于标准学习位置编码，特别是在小到中等角度范围内，这证实了理论稳定性保证。

英文摘要

Positional encoding is a fundamental component of Transformer architectures, as it injects information about the spatial or sequential arrangement of inputs. Among recent alternatives to standard absolute and sinusoidal encodings, similarity-based positional encoding (simPE) has emerged as a flexible framework for representing positional structure through pairwise relations. simPE was originally designed for medical imaging applications, where geometric robustness is especially relevant: small rotations naturally arise during image acquisition, induced by imaging instruments, patient positioning, or slight acquisition misalignments. Despite its empirical promise, the theoretical behavior of simPE under geometric perturbations has not been fully characterized. In this paper, we study the robustness of simPE with respect to rotations, combining formal theoretical analysis with experimental validation. We first show that simPE is generally not rotation-invariant. We then prove that, under mild Lipschitz assumptions on the elementary components, simPE is stable under rotational perturbations and derive explicit perturbation bounds in Frobenius norm. We validate these findings experimentally on four controlled datasets--a synthetic Arrow dataset, a synthetic Shapes dataset (four geometric shape categories), a synthetic Digits dataset, and a benchmark image classification dataset (FashionMNIST)--in which training and validation images are kept in a fixed canonical orientation while test images are subjected to increasing rotation angles. Across all datasets, simPE consistently outperforms standard learned positional encoding in terms of accuracy, F1 score, precision, and recall under rotation, particularly in the small-to-moderate angle regime, corroborating the theoretical stability guarantees.

URL PDF HTML ☆

赞 0 踩 0

2606.17996 2026-06-17 cs.LG cs.AI 交叉投稿

Multiple cyclicity and Wavelet Decomposition with Channel Correlation for Long-term Time Series Forecasting

多重周期性与通道相关的小波分解在长期时间序列预测中的应用

Bin Wang, Heming Yang, Jinfang Sheng

发表机构 * School of Computer Science and Engineering, Central South University（中南大学计算机科学与工程学院）

AI总结提出McWC模型，通过多层周期性构建、多层感知机提取通道相关性、多级小波分解融合高低频信息，并在频域解耦通道内自相关，实现高效准确的长期预测。

详情

AI中文摘要

周期性和趋势是时间序列数据的重要组成部分，许多基于周期性和趋势的研究在长期时间序列预测中取得了良好效果。然而，我们认为当前工作忽略了时间序列数据中真实世界通道间相关性的影响，导致预测次优。此外，这些模型依赖复杂设计来捕获多样信息，导致计算效率低下。为解决这一挑战，我们提出McWC，一种长期时间序列预测模型，分别对周期性、趋势和通道间相关性进行建模。具体来说，McWC首先使用多层周期性构建模块从数据中解耦周期性信息。然后，使用多层感知机提取通道间相关性。接着，使用多级小波分解模块对数据中的多层高频和低频信息进行建模和融合。最后，聚合不同组件的结果以获得输出。同时，我们通过在频域计算损失函数来解耦通道内自相关。在六个真实世界数据集上的实验表明，McWC实现了最先进的性能，展现出卓越的计算效率和历史信息提取能力。

英文摘要

Cyclicity and trend are important components of time series data and many studies based on cyclicity and trend have achieved good results in long-term time series forecasting. However, we believe that current work neglects the influence of real-world inter-channel correlations in time series data which leads to suboptimal predictions. Furthermore, these models rely on complex designs to capture diverse information so that resulting in low computational efficiency. To address this challenge, we propose McWC, a long-term time series forecasting model that separately models the cyclicity, trend, and inter-channel correlations. Specifically, McWC first decouples cyclical information from data using a multi-layer cyclicity construction module. Then, it extracts inter-channel correlations using multi-layer perceptron. Next, it models and fuses the multi-layer high-frequency and low-frequency information from data using a multi-level wavelet decomposition module. Finally, it aggregates the results of different components to obtain the output. Simultaneously, we decouple intra-channel autocorrelations by calculating a loss function in the frequency domain. Experiments on six real-world datasets demonstrate that McWC achieves state-of-the-art performance, exhibiting excellent computational efficiency and historical information extraction capabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.18003 2026-06-17 cs.LG cs.AI 交叉投稿

C2FL: Clustered Continual Federated Learning under Spatial and Temporal Drift

C2FL：空间和时间漂移下的聚类持续联邦学习

Davide Domini, Gianluca Aguzzi, Lorenzo Pellegrini, Mirko Viroli, Lukas Esterle

发表机构 * University of Bologna（博洛尼亚大学）； Aarhus University（哥本哈根大学）

AI总结针对空间异质性和时间漂移下节点隐私保护的集体自适应问题，提出C2FL方法，通过空间聚类自组织学习组，结合经验回放和停留时间感知自适应平均，实现鲁棒集体适应。

详情

AI中文摘要

集体自适应系统（CAS）越来越依赖机器学习，让每个节点从本地感知数据中学习，使其行为与周围环境对齐。然而，扩展这种智能带来了根本性挑战：感知数据通常涉及隐私，无法集中收集；节点是移动的，穿越不同区域，附近节点感知相似现象，而远处节点观察到截然不同的条件，形成自然空间聚类；并且由于移动性，这些分布随时间演变，引入时间漂移，使本地模型逐渐过时。这些动态出现在多个领域——车辆感知、无人机监测、智能手机众包——但隐私、空间异质性和时间漂移的相互作用严重削弱了传统学习策略。因此，我们提出C2FL，一种完全分布式的联邦学习（FL）方法，其中节点通过空间聚类自组织成学习组，反映环境的地理结构。为了抵消时间漂移，每个节点将经验回放与停留时间感知的自适应平均步骤相结合，随着在同一区域停留更长时间，逐步纳入区域共识，同时在不断变化的分布下保留先前获得的知识。我们在系统再现空间和时间变化的合成实验上评估了我们的方法，表明标准联邦策略在这些条件下显著退化，而我们的方法恢复了鲁棒的集体适应。

英文摘要

Collective Adaptive Systems (CAS) increasingly rely on machine learning to let each node learn from locally sensed data, aligning its behavior with the surrounding environment. Scaling this intelligence, however, raises fundamental challenges: sensed data is often privacy-sensitive, preventing centralized collection; nodes are mobile, traversing regions where nearby nodes perceive similar phenomena while distant ones observe radically different conditions, creating natural spatial clusters; and these distributions evolve over time due to mobility, introducing temporal drift that makes local models progressively stale. These dynamics arise across domains - vehicular sensing, drone-based monitoring, smartphone crowdsensing - yet the interplay of privacy, spatial heterogeneity, and temporal drift severely undermines conventional learning strategies. Therefore, we propose C2FL, a fully distributed Federated Learning (FL) approach where nodes self-organize into learning groups through spatial clustering, reflecting the geographic structure of the environment. To counteract temporal drift, each node combines experience replay with a dwell-time-aware adaptive averaging step, progressively incorporating the regional consensus as it remains longer within the same area, while preserving previously acquired knowledge under evolving distributions. We evaluate our approach on synthetic experiments that systematically reproduce spatial and temporal shifts, showing that standard federated strategies degrade significantly under these conditions and that our method restores robust collective adaptation.

URL PDF HTML ☆

赞 0 踩 0

2606.18023 2026-06-17 cs.LG cs.AI 交叉投稿

S4oP：面向资源受限设备的结构化状态空间模型的算子级剪枝

Marco Deano, Filippo Ziche, Nicola Bombieri

发表机构 * University of Verona（威尼斯大学）

AI总结提出一种针对S4和S4D模型的增量算子级剪枝方法，通过结构化掩码与微调交替进行，在保持预测性能的同时显著降低推理成本，首次系统研究SSM的结构化算子剪枝。

详情

AI中文摘要

结构化状态空间模型（SSMs），包括S4和S4D架构，最近已成为捕捉序列数据中长程依赖关系的基于注意力模型的有力替代方案。尽管其经验性能强劲，但由于计算和内存需求，在时间和资源受限的环境中部署这些模型仍然具有挑战性。在本文中，我们提出了一种新颖的增量式算子级剪枝方法，用于基于S4和S4D的模型，该方法在保持预测性能的同时显著降低推理成本。据我们所知，这是首个系统研究SSM结构化算子剪枝的工作。我们的方法通过将结构化掩码与微调交替进行，逐步剪枝模型算子，同时联合监控准确性和推理延迟。我们在一个统一的训练和评估框架中实现了这种方法，该框架能够系统地探索效率-准确性的权衡。在多个基准数据集上的实验表明，剪枝高达70%的模型算子在大多数情况下保持了原始模型的性能，同时显著降低了推理延迟。这些结果表明，结构化算子剪枝是一种有效且先前未被探索的提高SSM效率的策略，并有助于它们在资源受限的实际场景中的部署。

英文摘要

Structured State Space Models (SSMs), including the S4 and S4D architectures, have recently emerged as powerful alternatives to attention-based models for capturing long-range dependencies in sequential data. Despite their strong empirical performance, deploying these models in time- and resource-constrained settings remains challenging due to their computational and memory demands. In this paper, we propose a novel incremental, operator-level pruning approach for S4- and S4D-based models that significantly reduces inference cost while preserving predictive performance. To the best of our knowledge, this is the first work to systematically investigate structured operator pruning for SSMs. Our method progressively prunes model operators by interleaving structured masking with fine-tuning, while jointly monitoring accuracy and inference latency. We implement this approach within a unified training and evaluation framework that enables systematic exploration of efficiency-accuracy trade-offs. Experiments across multiple benchmark datasets show that pruning up to 70% of the model operators preserves the performance of the original models in most cases, while substantially reducing inference latency. These results demonstrate that structured operator pruning is an effective and previously unexplored strategy for improving the efficiency of SSMs and facilitate their deployment in practical, resource-constrained scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.18114 2026-06-17 cs.LG cs.AI 交叉投稿

Ternary Mamba: Grouped Quantization-Aware Training of W1.58A16 State Space Models

Ternary Mamba: 分组量化感知训练的 W1.58A16 状态空间模型

Ramprasath Ganesaraja, Sahil Dilip Panse, Swathika N

发表机构 * EdgeVerve Systems Limited（EdgeVerve系统有限公司）

AI总结提出从预训练检查点进行分组量化感知训练（QAT）结合知识蒸馏，以极低数据量（1亿token）将Mamba-2 1.3B压缩至3.61倍，零样本准确率接近Bi-Mamba，并发现预训练QAT特有的零比率坍塌问题。

详情

AI中文摘要

状态空间模型（SSM）如Mamba-2提供线性时间推理，但其内存占用限制了边缘部署。先前的三元SSM工作（Slender-Mamba）在150B token上从头训练；我们证明预训练检查点足以胜任，将边际token预算减少1000倍。使用分组量化感知训练（QAT）结合冻结FP16教师的知识蒸馏，我们将Mamba-2 1.3B压缩3.61倍（从2687 MB到744 MB），并在仅102M token（4 GPU小时，单H100）下达到48.1%的零样本准确率（7任务平均）——接近Bi-Mamba的48.4%（在+/-0.9pp置信区间内）。这种从预训练开始的QAT设置揭示了零比率坍塌，一种由可学习量化尺度引起的新不稳定性，在从头训练中不会出现。我们进一步证明，由于通过循环的误差累积，对Transformer有效的后处理校正策略对SSM失效。这些结果表明三元SSM不需要昂贵的从头训练：从预训练检查点进行QAT结合KD是一种数据高效的替代方案。

英文摘要

State Space Models (SSMs) such as Mamba-2 offer linear-time inference but their memory footprint limits edge deployment. Prior ternary SSM work (Slender-Mamba) trains from scratch on 150B tokens; we show a pretrained checkpoint suffices, reducing the marginal token budget by 1,000x. Using grouped quantization-aware training (QAT) with knowledge distillation from a frozen FP16 teacher, we compress Mamba-2 1.3B to 3.61x (2,687 to 744 MB) and achieve 48.1% zero-shot accuracy (7-task average) in just 102M tokens (4 GPU-hours, single H100) -- approaching Bi-Mamba's 48.4% (within +/-0.9pp CI). This QAT-from-pretrained setting reveals zero-ratio collapse, a novel instability caused by learnable quantization scales that does not arise in from-scratch training. We further show that post-hoc correction strategies effective for Transformers fail for SSMs due to error accumulation through the recurrence. These results demonstrate that ternary SSMs do not require expensive from-scratch training: QAT from pretrained checkpoints with KD is a data-efficient alternative.

URL PDF HTML ☆

赞 0 踩 0

2606.18186 2026-06-17 cs.LG cs.AI 交叉投稿

Kolmogorov Regression for Robust Diffusion Policies

用于鲁棒扩散策略的Kolmogorov回归

Lekan Molu

发表机构 * Bala Cynwyd, PA 19004（巴拉辛威德, PA 19004）

AI总结提出后向Kolmogorov方程将扩散策略提升至Cameron-Martin空间，用确定性边界值PDE问题替代随机分数匹配，通过精度加权损失和残差诊断实现收敛保证、轨迹规则化和无奖励故障检测。

详情

AI中文摘要

有限维扩散策略由于离散化伪影导致时间漂移，降低了长期性能（当部署在物理系统上时）。我们引入了一个后向Kolmogorov方程，将扩散策略提升至Cameron-Martin空间——希尔伯特空间的一个子集。本质上，用确定性边界值PDE问题替代随机分数匹配。我们的核心创新基于高斯测度理论，其中扩散噪声协方差算子由有色噪声分布实现，该分布规定了推理时模型样本的正则性概念。我们使用推导出的精度加权Cameron-Martin损失训练扩散模型，并引入Kolmogorov残差作为推理时的PDE诊断。这些替换产生了：(i) 收敛保证，其中界的常数取决于核的有效秩而非动作维度，(ii) 通过谱加权改进轨迹规则性，以及(iii) 无需奖励信号的确定性故障检测器。在两个应用领域的验证显示了显著改进：在PushT操作基准测试中，Cameron-Martin损失在最大回合奖励上实现了17%的提升（0.95对比0.78的MSE），并通过引入的残差幅度在推理期间减少了67.6%的步间漂移。类似地，在具有恒定在制品（CONWIP）流量控制的6站生产线上，我们实现了比经典LSTM基线低28.4%的RMSE；高饥饿事件召回率（测试周期中为1.0），以及有效的瓶颈识别（测试集中Precision@1=1.0，信噪比13倍）。然后，我们使用Hamilton-Jacobi可达性理论认证调度策略，与100次模拟运行中的无控制调度相比，死锁事件减少了96%（防止了351个事件）。

英文摘要

Finite-dimensional (FD) diffusion policies exhibit temporal drift owing to discretization artifacts that degrade long-horizon performance (when deployed on physical systems). We introduce a backward Kolmogorov equation that lifts diffusion policies to a Cameron-Martin space -- a subset of the Hilbert space. Essentially, replacing stochastic score matching with a deterministic boundary-value PDE problem. Our core innovation thrives on Gaussian measure theory whereupon the diffusion noise covariance operator is realized from a colored noise distribution which prescribes a notion of regularity on samples from the model at inference time. We train the diffusion model with a derived precision-weighted Cameron- Martin loss and a Kolmogorov residual is introduced as a PDE diagnostic during inference. These substitutions yield (i) convergence guarantees where the bound's constants depend on the effective rank of the kernel rather than action dimension, (ii) improved trajectory regularity via spectral weighting, and (iii) a deterministic failure detector without reward signals. Validation across two application domains demonstrates substantial improvements: on the PushT manipulation benchmark, the Cameron-Martin loss achieves a 17% improvement in maximum episode reward (0.95 vs. 0.78 for MSE) and 67.6% reduction in inter-step drifts during inference via the introduced residual magnitude. Similarly, on a 6-station manufacturing line with constant work-in-process (CONWIP) flow control, we achieve 28.4% lower RMSE than classical LSTM baselines; a high starvation-event recall (1.0 in test cycles), and effective bottleneck identification (Precision@1 = 1.0 in test set, 13x signal-to-noise ratio). We then certify the dispatch policies with Hamilton-Jacobi reachability theory which reduces deadlock events by 96% compared to uncontrolled dispatch over 100 simulated runs (351 events prevented).

URL PDF HTML ☆

赞 0 踩 0

2606.18208 2026-06-17 cs.LG cs.AI cs.CL cs.CV 交叉投稿

Looped World Models

循环世界模型

Hongyuan Adam Lu, Z. L. Victor Wei, Qun Zhang, Jinrui Zeng, Bowen Cao, Lingwei Meng, Mocheng Li, Zezhong Wang, Haonan Yin, Naifu Xue, Minyu Chen, Cenyuan Zhang, Zefan Zhang, Hao Wei, Jiawei Zhou, Haoran Xu, Hao Yang, Ronglai Zuo, Tongda Xu, Yonghao Li, Jian Chen, Hebin Wang, Zeyu Gao, Yang Li, Wei Zhao, Qimin Zhong, Siqi Liu, Yumeng Zhang, Leyan Cui, Zhangyu Wang, Wai Lam

发表机构 * FaceMind Research Asia

AI总结提出循环世界模型（LoopWM），通过参数共享的Transformer块迭代细化潜在环境状态，实现高达100倍参数效率，并建立迭代潜在深度作为世界模拟的新缩放轴。

Comments Technical Report

2509.21886 2026-06-17 cs.AI 版本更新

TRACE: Learning to Compute on Circuit Graphs

TRACE：在电路图上学习计算

Ziyang Zheng, Jiaying Zhu, Jingyi Zhou, Qiang Xu

AI总结针对图表示学习在电路功能建模中的架构不匹配问题，提出TRACE，采用层次化Transformer和函数偏移学习，显著超越现有方法。

详情

AI中文摘要

学习计算，即对电路图的功能行为进行建模的能力，是图表示学习的一个基本挑战。然而，主流范式在此任务上存在架构不匹配。这一有缺陷的假设，是主流消息传递神经网络（MPNN）及其基于Transformer的常规对应物的核心，阻止了模型捕捉计算的位置感知和层次化特性。为解决此问题，我们引入了TRACE，一种建立在架构合理的骨干网络和原则性学习目标之上的新范式。首先，TRACE采用层次化Transformer，模拟计算的逐步流程，提供了替代有缺陷的置换不变聚合的忠实架构骨干。其次，我们引入了函数偏移学习，一种将学习问题解耦的新颖目标。我们的模型不是直接预测复杂的全局函数，而是训练仅预测函数偏移，即真实全局函数与假设输入独立的简单局部近似之间的差异。我们在各种电路模态上验证了这一范式，包括寄存器传输级图、与反相器图和映射后网表。在全面的基准测试套件中，TRACE显著优于所有先前的架构。这些结果表明，我们的架构对齐骨干和解耦学习目标为学习电路图功能行为这一基本挑战形成了更稳健的范式。

英文摘要

Learning to compute, the ability to model the functional behavior of a circuit graph, is a fundamental challenge for graph representation learning. Yet, the dominant paradigm is architecturally mismatched for this task. This flawed assumption, central to mainstream message passing neural networks (MPNNs) and their conventional Transformer-based counterparts, prevents models from capturing the position-aware, hierarchical nature of computation. To resolve this, we introduce TRACE, a new paradigm built on an architecturally sound backbone and a principled learning objective. First, TRACE employs a Hierarchical Transformer that mirrors the step-by-step flow of computation, providing a faithful architectural backbone that replaces the flawed permutation-invariant aggregation. Second, we introduce function shift learning, a novel objective that decouples the learning problem. Instead of predicting the complex global function directly, our model is trained to predict only the function shift, the discrepancy between the true global function and a simple local approximation that assumes input independence. We validate this paradigm on various circuits modalities, including Register Transfer Level graphs, And-Inverter Graphs and post-mapping netlists. Across a comprehensive suite of benchmarks, TRACE substantially outperforms all prior architectures. These results demonstrate that our architecturally-aligned backbone and decoupled learning objective form a more robust paradigm for the fundamental challenge of learning the functional behavior of a circuit graph.

URL PDF HTML ☆

赞 0 踩 0

2510.14807 2026-06-17 cs.AI 版本更新

Beyond the Sampled Token: Preserving Candidate Support in RLVR

超越采样令牌：在RLVR中保留候选支持

Ruotian Peng, Yi Ren, Zhouliang Yu, Weiyang Liu, Yandong Wen

AI总结本文从候选分布角度分析RLVR中的探索崩溃，提出CaSP方法，通过保留前N个候选的概率质量，在不牺牲pass@1的情况下提升pass@K，在多个基准测试中验证了有效性。

Comments Technical report (23 pages, 16 figures, project page: https://spherelab.ai/simko/)

详情

AI中文摘要

我们从下一个令牌预测的候选分布角度，重新审视了具有可验证奖励的强化学习（RLVR）中的探索崩溃。我们正式证明，当概率集中到前1个候选时，无论采样预算K如何，期望的不同响应数量都会崩溃为1。这一理论含义通过我们在训练过程中对前N个候选概率的实证跟踪得到进一步验证，其中前1个候选逐渐占据主导地位，而其他合理替代方案被抑制。这些发现提出了有效探索的关键需求：在前N个候选上保留不可忽略的概率质量。为此，我们提出了候选感知支持保留（CaSP），包含两个互补设计。具体来说，对于正确响应，CaSP在前N个候选上重新分配正梯度；对于错误响应，则对前1个候选施加更强的惩罚。与许多以牺牲pass@1为代价提高pass@K的探索导向方法不同，CaSP在整个K谱上提高了pass@K。这些增益泛化到6个数学、2个逻辑推理和2个编码基准测试，并扩展到32B参数模型和高达K=1024的采样预算，使其成为RLVR探索的一种原则性、候选级别的方法。

英文摘要

We revisit exploration collapse in reinforcement learning with verifiable rewards (RLVR), from the perspective of the \emph{candidate distribution} for next-token prediction. We formally show that as probability concentrates on the top-$1$ candidate, the expected number of distinct responses collapses to one regardless of the sampling budget $K$. This theoretical implication is further verified by our empirical tracking of top-$N$ candidate probabilities during training, where the top-$1$ candidate progressively dominates while plausible alternatives are suppressed. These findings suggest a key desideratum for effective exploration: \emph{preserving non-negligible probability mass on the top-$N$ candidates}. To this end, we propose Candidate-aware Support Preservation (CaSP), with two complementary designs. Specifically, CaSP redistributes positive gradients among top-$N$ candidates for correct responses, and applies a stronger penalty to the top-$1$ candidate for incorrect responses. Unlike many exploration-oriented methods that improve pass@$K$ at the cost of pass@1, CaSP improves pass@$K$ across the full $K$ spectrum. These gains generalize to 6 math, 2 logical-reasoning, and 2 coding benchmarks, and scales to 32B-parameter models and sampling budgets up to $K=1024$, positioning it as a principled, candidate-level approach for RLVR exploration.

URL PDF HTML ☆

赞 0 踩 0

2602.10635 2026-06-17 cs.AI cs.LG 版本更新

OmniSapiens: A Foundation Model for Social Behavior Processing via Heterogeneity-Aware Relative Policy Optimization

OmniSapiens: 一种通过异质性感知相对策略优化进行社会行为处理的基础模型

Keane Ong, Sabri Boughorbel, Luwei Xiao, Chanakya Ekbote, Wei Dai, Ao Qu, Jingyao Wu, Rui Mao, Ehsan Hoque, Erik Cambria, Gianmarco Mengaldo, Paul Pu Liang

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； National University of Singapore（新加坡国立大学）； Nanyang Technological University（南洋理工大学）； Prince Sattam bin Abdulaziz University（普森·萨塔姆·本·阿卜杜勒阿齐兹大学）； University of Rochester（罗切斯特大学）

AI总结针对行为数据异质性导致的训练不平衡问题，提出Omnisapiens-7B 2.0基础模型，采用异质性感知相对策略优化（HARPO）方法，在10个行为任务和5个零样本泛化基准上取得最佳性能。

Comments Accepted to ICML 2026 Main Conference

详情

AI中文摘要

社交智能AI系统必须能够推理多样的人类行为任务，并泛化到新情境。然而，AI尚未达到这种社交智能水平。现有模型仍然受到行为数据训练引起的学习动态不平衡的根本限制。即，行为数据本质上是异质的，包含多种模态和预测目标，通常在不同样本间产生不均匀的训练信号。为了解决这个问题，我们开发了Omnisapiens-7B 2.0，一个专门处理异质行为数据学习的社会行为处理基础模型。这是通过异质性感知相对策略优化（HARPO）实现的，这是一种新颖的推理强化学习方法，明确地重新平衡样本间的学习信号。核心思想是近似策略更新的贡献信号，利用它们进行几何中心化和惯性平滑的优势调节。结果表明，Omnisapiens-7B 2.0在10个不同的行为任务上取得了最佳且最一致的性能，同时在所有五个保留的零样本泛化基准上也取得了最佳性能，分别提升了高达+12.02%和+9.37%。此外，Omnisapiens-7B 2.0展示了更一致和可解释的推理轨迹，支持可靠的现实世界行为应用。我们的模型和代码可在https://github.com/MIT-MI/human_behavior_atlas找到。

英文摘要

Socially intelligent AI systems must reason across diverse human behavioral tasks and generalize to new social contexts. However, behavioral data is inherently heterogeneous, comprising diverse modalities and prediction targets that produce uneven training signals across samples, creating imbalanced learning dynamics that challenge existing AI models. To address this, we develop Omnisapiens-7B 2.0, a foundation model for social behavior processing that explicitly addresses learning from heterogeneous behavioral data. This is enabled through Heterogeneity-Aware Relative Policy Optimization, a new RL method that rebalances learning signals across samples by approximating each sample's contribution to the policy update and using these estimates to drive geometrically centered, inertially smoothed advantage modulation for stable training. Omnisapiens-7B 2.0 achieves the best and most consistent performance across 10 behavioral tasks, while also attaining the best performance on all five held-out benchmarks, with gains of up to +12.02% and +9.37% respectively. Furthermore, it demonstrates more consistent and interpretable reasoning traces, supporting reliable real-world behavioral applications. Our model is available at https://github.com/MIT-MI/human_behavior_atlas.

URL PDF HTML ☆

赞 0 踩 0

2603.18104 2026-06-17 cs.AI cs.DC cs.LG cs.NE 版本更新

Adaptive Domain Models: Bayesian Evolution, Warm Rotation, and Principled Training for Geometric and Neuromorphic AI

自适应领域模型：贝叶斯演化、热旋转与几何及神经形态AI的规范化训练

Houston Haynes

AI总结提出基于维度类型系统、程序超图和b-posit有界设计的替代训练架构，实现内存开销恒定、梯度精确累积和级保持更新，并引入贝叶斯蒸馏和热旋转机制，支持领域特定模型的持续自适应与可验证正确性。

Comments 32 pages, 3 figures

详情

AI中文摘要

当前AI训练假设在IEEE-754算术上进行反向模式自动微分。训练相对于推理的内存开销、优化器复杂性以及训练过程中几何属性的结构退化，都是该算术基底的后果。本文基于三项先前结果开发了一种替代训练架构：维度类型系统和确定性内存管理框架（Haynes 2026），将栈可分配梯度分配和精确quire累积确立为设计时可验证属性；程序超图（Haynes 2026），将几何代数计算中的级保持确立为类型级不变量；以及b-posit有界设计（Jonnalagadda et al. 2025），使posit算术在传统上被视为仅推理的硬件目标上变得可行。它们的组合实现了深度无关的训练内存（约为推理占用量的两倍）、级保持的权重更新和精确梯度累积，统一适用于损失函数优化和脉冲时序依赖的神经形态模型。我们引入了*贝叶斯蒸馏*，一种通过ADM训练机制提取通用模型潜在先验结构的机制，解决了领域特定训练的数据稀缺自举问题。对于部署，我们引入了*热旋转*，一种操作模式，其中更新后的模型在不中断服务的情况下过渡到活跃推理路径，并通过PHG证书和签名版本记录形式化正确性。结果是一类领域特定AI系统，比通用模型更小、更精确，持续自适应，相对于其领域的物理结构可验证正确，并且可从现有模型初始化。

英文摘要

Prevailing AI training assumes reverse-mode automatic differentiation over IEEE-754 arithmetic. The memory overhead of training relative to inference, optimizer complexity, and structural degradation of geometric properties through training are consequences of this arithmetic substrate. This paper develops an alternative training architecture grounded in three prior results: the Dimensional Type System and Deterministic Memory Management framework (Haynes 2026), which establishes stack-eligible gradient allocation and exact quire accumulation as design-time verifiable properties; the Program Hypergraph (Haynes 2026), which establishes grade preservation through geometric algebra computations as a type-level invariant; and the b-posit bounded-regime design (Jonnalagadda et al. 2025), which makes posit arithmetic tractable across hardware targets conventionally considered inference-only. Their composition enables depth-independent training memory bounded to approximately twice the inference footprint, grade-preserving weight updates, and exact gradient accumulation, applicable uniformly to loss-function-optimized and spike-timing-dependent neuromorphic models. We introduce *Bayesian distillation*, a mechanism by which the latent prior structure of a general-purpose model is extracted through the ADM training regime, resolving the data-scarcity bootstrapping problem for domain-specific training. For deployment, we introduce *warm rotation*, an operational pattern in which an updated model transitions into an active inference pathway without service interruption, with correctness formalized through PHG certificates and signed version records. The result is a class of domain-specific AI systems that are smaller and more precise than general-purpose models, continuously adaptive, verifiably correct with respect to the physical structure of their domains, and initializable from existing models.

URL PDF HTML ☆

赞 0 踩 0

2604.10827 2026-06-17 cs.AI 版本更新

Know Thy Reasoner: Not All Language Models Explore Alike

你的模型多样性，而非方法，决定推理策略

Moulik Choraria, Argyrios Gerogiannis, Anirban Das, Supriyo Chakraborty, Sourya Basu, Sambit Sahu, Lav R. Varshney

发表机构 * UIUC（伊利诺伊大学香槟分校）； Capital One

AI总结本文提出模型多样性影响推理策略，通过理论框架分析推理不确定性，验证了不同模型在深度精炼和并行采样中的表现差异。

Comments This is a full-length extension of the workshop paper that appeared in the ICLR 2026 Workshop on LLM Reasoning

详情

AI中文摘要

计算LLM推理的扩展性需要在探索解决方案方法（广度）和细化有前途的解决方案（深度）之间分配预算。大多数方法隐式地权衡两者，但为何特定的权衡有效仍不明确，且在单一模型上的验证掩盖了模型自身的作用。我们主张最优策略取决于模型的多样性分布，即概率质量在解决方案方法上的分散情况，并在采用任何探索策略之前必须进行表征。我们通过理论框架分解推理不确定性，并推导出树状深度精炼优于并行采样的条件。我们在Qwen-3 4B和Olmo-3 7B系列上验证了这一点，显示轻量信号足以在低多样性对齐模型上进行基于深度的精炼，而在高多样性基础模型上则产生有限的效用，我们推测后者需要更强的补偿以应对较低的探索覆盖度。

英文摘要

Compute scaling for LLM reasoning trades off exploring solution approaches (\emph{breadth}) against refining promising ones (\emph{depth}), yet why a given trade-off works, and why it often fails to transfer across models, remains unclear. We argue that \textbf{the optimal strategy depends on the model's \emph{diversity profile}, the spread of probability mass across solution approaches, and that this must be characterized before any exploration strategy is adopted.} We formalize this with a framework decomposing reasoning uncertainty, deriving when depth-based refinement outperforms parallel sampling, and validate it across three model families at both inference and training. Our central finding is that the diversity regime dictates the strategy: low-diversity aligned models benefit from depth-based refinement with lightweight intrinsic signals, whereas high-diversity base models are often harmed by it, and instead need breadth or stronger signals to compensate.

URL PDF HTML ☆

赞 0 踩 0

2404.01965 2026-06-17 cs.LG cs.AI 版本更新

Towards Leveraging AutoML for Sustainable Deep Learning: A Multi-Objective HPO Approach on Deep Shift Neural Networks

迈向利用AutoML实现可持续深度学习：深度移位神经网络上的多目标HPO方法

Leona Hennig, Tanja Tornede, Marius Lindauer

AI总结针对深度学习计算成本高的问题，提出结合多保真度HPO与多目标优化，在深度移位神经网络上同时最大化精度和最小化能耗，实验获得超80%精度且低计算开销。

详情

AI中文摘要

深度学习通过从大型数据集中提取复杂模式，推动了各个领域的发展。然而，深度学习模型的计算需求带来了环境和资源方面的挑战。深度移位神经网络（DSNNs）通过利用移位操作来降低推理时的计算复杂度，提供了一种解决方案。遵循标准DNNs的见解，我们感兴趣的是通过AutoML技术充分利用DSNNs的潜力。我们研究了超参数优化（HPO）的影响，以最大化DSNN性能，同时最小化资源消耗。由于这结合了多目标（MO）优化，其中精度和能耗作为潜在互补目标，我们提出将最先进的多保真度（MF）HPO与多目标优化相结合。实验结果表明了我们方法的有效性，得到了精度超过80%且计算成本低的模型。总体而言，我们的方法加速了高效模型开发，同时实现了可持续的AI应用。

英文摘要

Deep Learning (DL) has advanced various fields by extracting complex patterns from large datasets. However, the computational demands of DL models pose environmental and resource challenges. Deep shift neural networks (DSNNs) offer a solution by leveraging shift operations to reduce computational complexity at inference. Following the insights from standard DNNs, we are interested in leveraging the full potential of DSNNs by means of AutoML techniques. We study the impact of hyperparameter optimization (HPO) to maximize DSNN performance while minimizing resource consumption. Since this combines multi-objective (MO) optimization with accuracy and energy consumption as potentially complementary objectives, we propose to combine state-of-the-art multi-fidelity (MF) HPO with multi-objective optimization. Experimental results demonstrate the effectiveness of our approach, resulting in models with over 80\% in accuracy and low computational cost. Overall, our method accelerates efficient model development while enabling sustainable AI applications.

URL PDF HTML ☆

赞 0 踩 0

2502.00241 2026-06-17 cs.LG cs.AI cs.CL cs.CV 版本更新

Mordal: Automated Pretrained Model Selection for Vision Language Models

Mordal: 面向视觉语言模型的自动化预训练模型选择

Shiqi He, Insu Jang, Mosharaf Chowdhury

AI总结提出Mordal框架，通过减少候选模型数量和评估时间，自动化搜索用户定义任务的最佳视觉语言模型，相比网格搜索降低GPU耗时8.9-11.6倍，加权Kendall's τ平均提升69%。

详情

AI中文摘要

将多种模态融入大型语言模型（LLMs）是增强其对非文本数据理解、使其能够执行多模态任务的有效方式。视觉语言模型（VLMs）因其在医疗、机器人和无障碍等领域的众多实际应用，成为增长最快的多模态模型类别。然而，尽管文献中不同的VLM在不同基准测试中展现出令人印象深刻的视觉能力，它们都是由人类专家手工设计的；目前尚无自动化框架来创建特定任务的多模态模型。我们引入Mordal，一种自动化多模态模型搜索框架，能够高效地为用户定义的任务找到最佳VLM，无需人工干预。Mordal通过减少搜索过程中需考虑的候选模型数量以及最小化评估每个剩余候选模型所需的时间来实现这一目标。我们的评估表明，Mordal能够找到给定问题的最佳VLM，其GPU耗时比网格搜索低8.9倍至11.6倍。我们还发现，Mordal在不同任务上平均比最先进的模型选择方法实现约69%更高的加权Kendall's τ。

英文摘要

Incorporating multiple modalities into large language models (LLMs) is a powerful way to enhance their understanding of non-textual data, enabling them to perform multimodal tasks. Vision language models (VLMs) form the fastest growing category of multimodal models because of their many practical use cases, including in healthcare, robotics, and accessibility. Unfortunately, even though different VLMs in the literature demonstrate impressive visual capabilities in different benchmarks, they are handcrafted by human experts; there is no automated framework to create task-specific multimodal models. We introduce Mordal, an automated multimodal model search framework that efficiently finds the best VLM for a user-defined task without manual intervention. Mordal achieves this both by reducing the number of candidates to consider during the search process and by minimizing the time required to evaluate each remaining candidate. Our evaluation shows that Mordal can find the best VLM for a given problem using $8.9\times$--$11.6\times$ lower GPU hours than grid search. We have also discovered that Mordal achieves about 69\% higher weighted Kendall's $τ$ on average than the state-of-the-art model selection method across diverse tasks.

URL PDF HTML ☆

赞 0 踩 0

2502.08363 2026-06-17 cs.CL cs.AI 版本更新

Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding

Top-Theta注意力：通过补偿阈值稀疏化Transformer

Konstantin Berestizshevsky, Renzo Andri, Lukas Cavigelli

AI总结提出Top-Theta注意力，一种无需训练的推理时稀疏化方法，通过静态每头阈值保留每行固定数量的重要元素，结合补偿技术实现高稀疏度下的精度保持，在NLP任务中实现3-10倍V-cache减少和高达10倍注意力元素减少，精度下降不超过1%。

Comments Extended version of a paper accepted at ICANN 2026

2507.11178 2026-06-17 cs.LG cs.AI 版本更新

A Gradient-based Causal Discovery Framework with Applications to Complex Industrial Processes

基于梯度的因果发现框架及其在复杂工业过程中的应用

Meiliang Liu, Huiwen Dong, Xiaoxiao Yang, Yunfang Xu, Mingbao Yang, Zijin Li, Zhengye Si, Xinyue Yang, Zhiwen Zhao

AI总结提出GRNGC方法，通过对模型输入输出梯度施加L1正则化推断Granger因果，仅需一个预测模型，降低计算开销，在多个基准和真实数据集上优于现有方法。

Comments 9 pages,3 figures, conference

详情

AI中文摘要

随着深度学习技术的发展，各种基于神经网络的Granger因果模型已被提出。尽管这些模型表现出显著改进，但仍存在若干局限性。大多数现有方法采用组件式架构，需要为每个时间序列构建单独的模型，导致大量计算成本。此外，对神经网络第一层权重施加稀疏性惩罚以提取因果关系，削弱了模型捕捉复杂交互的能力。为解决这些局限性，我们提出基于梯度正则化的神经Granger因果（GRNGC），该方法仅需一个时间序列预测模型，并对模型输入与输出之间的梯度施加$L_{1}$正则化以推断Granger因果。此外，GRNGC不依赖于特定的时间序列预测模型，可通过KAN、MLP和LSTM等多种架构实现，提供增强的灵活性。在DREAM、Lorenz-96、fMRI BOLD和CausalTime上的数值模拟表明，GRNGC优于现有基线，并显著降低计算开销。同时，在真实世界的DNA、酵母、HeLa和膀胱尿路上皮癌数据集上的实验进一步验证了该模型在重建基因调控网络方面的有效性。

英文摘要

With the advancement of deep learning technologies, various neural network-based Granger causality models have been proposed. Although these models have demonstrated notable improvements, several limitations remain. Most existing approaches adopt the component-wise architecture, necessitating the construction of a separate model for each time series, which results in substantial computational costs. In addition, imposing the sparsity-inducing penalty on the first-layer weights of the neural network to extract causal relationships weakens the model's ability to capture complex interactions. To address these limitations, we propose Gradient Regularization-based Neural Granger Causality (GRNGC), which requires only one time series prediction model and applies $L_{1}$ regularization to the gradient between model's input and output to infer Granger causality. Moreover, GRNGC is not tied to a specific time series forecasting model and can be implemented with diverse architectures such as KAN, MLP, and LSTM, offering enhanced flexibility. Numerical simulations on DREAM, Lorenz-96, fMRI BOLD, and CausalTime show that GRNGC outperforms existing baselines and significantly reduces computational overhead. Meanwhile, experiments on real-world DNA, Yeast, HeLa, and bladder urothelial carcinoma datasets further validate the model's effectiveness in reconstructing gene regulatory networks.

URL PDF HTML ☆

赞 0 踩 0

2510.11709 2026-06-17 cs.LG cs.AI cs.CV 版本更新

Adversarial Attacks Leverage Interference Between Features in Superposition

对抗攻击利用特征叠加中的干扰

Edward Stevinson, Lucas Prieto, Melih Barsbey, Tolga Birdal

AI总结本文揭示神经网络中特征叠加导致的干扰是对抗脆弱性的根源，通过理论推导和实验验证了干扰模式决定攻击成功与迁移性。

Comments Forty-third International Conference on Machine Learning

详情

AI中文摘要

为什么对抗样本存在，并且为什么它们能在模型间迁移？现有的解释诉诸于高维几何、输入中的非鲁棒模式以及决策边界结构，但没有一个提供表示层面的机制来解释为什么特定的扰动会成功以及为什么攻击能在模型间迁移。在本文中，我们表明对抗脆弱性可能源于神经网络中高效的信息编码。具体来说，脆弱性可能源于叠加——网络表示的概念数量超过其维度，迫使非正交表示从而产生干扰。这种干扰导致针对一个表示的扰动会影响其他表示，从而产生由干扰模式决定的脆弱性。在精确控制叠加的合成环境中，我们证实叠加足以产生对抗脆弱性。由此产生的攻击是可预测的：PGD发现的扰动与从干扰几何导出的理论最优扰动一致。在相似数据上训练的模型会发展出相似的干扰模式，这解释了攻击的可迁移性。然后我们表明，对图像分类器的成功攻击表现出我们提出的机制所预测的结构。这些发现揭示了对抗脆弱性可能是网络表示压缩的副产品，补充了基于数据属性或架构因素的现有解释。

英文摘要

Why do adversarial examples exist, and why do they transfer between models? Existing explanations appeal to high-dimensional geometry, non-robust patterns in the input, and decision boundary structure, but none provides a representation-level mechanism that explains why specific perturbations succeed and why attacks transfer between models. In this paper, we show that adversarial vulnerability can stem from efficient information encoding in neural networks. Specifically, vulnerability can arise from superposition - the phenomenon where networks represent more concepts than they have dimensions, forcing non-orthogonal representation and thus interference. This interference causes perturbations targeting one representation to affect others, creating vulnerabilities determined by interference patterns. In synthetic settings with precisely controlled superposition, we establish that superposition suffices to create adversarial vulnerability. The resulting attacks are predictable: PGD-discovered perturbations align with theoretically optimal perturbations derived from the interference geometry. Models trained on similar data develop similar interference patterns, explaining attack transferability. We then show that successful attacks on image classifiers exhibit the structure predicted by our proposed mechanism. These findings reveal that adversarial vulnerability can be a byproduct of networks' representational compression, complementing existing explanations based on data properties or architectural factors.

URL PDF HTML ☆

赞 0 踩 0

2510.21583 2026-06-17 cs.CV cs.AI 版本更新

Principled RL for Flow Matching Emerges from the Chunk-level Policy Optimization

基于流匹配的原理化强化学习从片段级策略优化中涌现

Yifu Luo, Haoyuan Sun, Xinhao Hu, Penghui Du, Keyu Fan, Bo Li, Sinan Du, Xu Wan, Zhiyu Chen, Bo Xia, Yongzhe Chang, Changqian Yu, Kun Gai, Tiantian Zhang, Xueqian Wang

发表机构 * GitHub

AI总结本文提出了一种基于片段级策略优化的流匹配强化学习方法GCPO，通过将连续步骤聚合为相干片段并改变策略优化层级，有效缓解了优势归因不准确的问题，实验表明其在文本到图像生成任务中表现优于现有方法。

Comments ICML 2026

2512.04524 2026-06-17 cs.LG cs.AI 版本更新

Prototype-Based Semantic Consistency Alignment for Domain Adaptive Retrieval

基于原型语义一致性对齐的域自适应检索

Tianle Hu, Weijun Lv, Na Han, Xiaozhao Fang, Jie Wen, Jiaxing Li, Guoxu Zhou

发表机构 * School of Computer Science and Technology, Guangdong University of Technology（广东工业大学计算机科学与技术学院）； School of Automation, Guangdong University of Technology（广东工业大学自动化学院）； School of Computer Science, Guangdong Polytechnic Normal University（广东 polytechnic 正规大学计算机科学学院）； School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen（哈尔滨工业大学深圳校区计算机科学与技术学院）； School of Artificial Intelligence, Guangzhou University（广州大学人工智能学院）

AI总结提出原型语义一致性对齐（PSCA）两阶段框架，通过正交原型建立类级语义连接，利用几何邻近性加权伪标签置信度，并在重构特征上量化生成统一哈希码，解决域自适应检索中的类级对齐缺失和量化质量下降问题。

Comments AAAI2026

详情

AI中文摘要

域自适应检索旨在将知识从有标签的源域迁移到无标签的目标域，实现有效检索的同时缓解域差异。然而，现有方法存在几个根本性局限：1）忽略类级语义对齐，过度追求成对样本对齐；2）缺乏伪标签可靠性考虑或评估标签正确性的几何指导；3）直接量化受域偏移影响的原始特征，损害所学哈希码的质量。鉴于这些局限，我们提出基于原型的语义一致性对齐（PSCA），一种用于有效域自适应检索的两阶段框架。在第一阶段，一组正交原型直接建立类级语义连接，在聚集类内样本的同时最大化类间分离性。在原型学习过程中，几何邻近性通过自适应加权伪标签置信度，为语义一致性对齐提供可靠性指标。所得的隶属度矩阵和原型促进特征重建，确保在重建特征而非原始特征上进行量化，从而改善后续哈希编码质量并无缝连接两个阶段。在第二阶段，特定域的量化函数在相互逼近约束下处理重建特征，生成跨域的统一二进制哈希码。大量实验验证了PSCA在多个数据集上的优越性能。

英文摘要

Domain adaptive retrieval aims to transfer knowledge from a labeled source domain to an unlabeled target domain, enabling effective retrieval while mitigating domain discrepancies. However, existing methods encounter several fundamental limitations: 1) neglecting class-level semantic alignment and excessively pursuing pair-wise sample alignment; 2) lacking either pseudo-label reliability consideration or geometric guidance for assessing label correctness; 3) directly quantizing original features affected by domain shift, undermining the quality of learned hash codes. In view of these limitations, we propose Prototype-Based Semantic Consistency Alignment (PSCA), a two-stage framework for effective domain adaptive retrieval. In the first stage, a set of orthogonal prototypes directly establishes class-level semantic connections, maximizing inter-class separability while gathering intra-class samples. During the prototype learning, geometric proximity provides a reliability indicator for semantic consistency alignment through adaptive weighting of pseudo-label confidences. The resulting membership matrix and prototypes facilitate feature reconstruction, ensuring quantization on reconstructed rather than original features, thereby improving subsequent hash coding quality and seamlessly connecting both stages. In the second stage, domain-specific quantization functions process the reconstructed features under mutual approximation constraints, generating unified binary hash codes across domains. Extensive experiments validate PSCA's superior performance across multiple datasets.

URL PDF HTML ☆

赞 0 踩 0

2602.03846 2026-06-17 cs.LG cs.AI 版本更新

PLATE: Plasticity-Tunable Efficient Adapters for Geometry-Aware Continual Learning

PLATE: 可塑性可调的几何感知持续学习高效适配器

Romain Cosentino

AI总结提出无需旧任务数据的持续学习方法PLATE，利用预训练网络的几何冗余性，通过结构化低秩更新显式控制可塑性-保留权衡，提升最坏情况保留保证。

详情

AI中文摘要

我们为预训练模型开发了一种持续学习方法，该方法不需要访问旧任务数据，解决了基础模型适应中预训练分布通常不可用的实际障碍。我们的关键观察是，预训练网络表现出大量的几何冗余性，并且这种冗余性可以通过两种互补的方式加以利用。首先，冗余神经元提供了预训练时代主导特征方向的代理，使得可以直接从预训练权重构建近似受保护的更新子空间。其次，冗余性为可塑性的放置位置提供了自然偏差：通过将更新限制在冗余神经元的子集并约束剩余的自由度，我们获得了在旧数据分布上功能漂移减少且最坏情况保留保证改善的更新族。这些见解导致了PLATE（可塑性可调的高效适配器），一种不需要过去任务数据的持续学习方法，它提供了对可塑性-保留权衡的显式控制。PLATE通过结构化低秩更新ΔW = B A Q^T参数化每一层，其中B和Q从预训练权重一次性计算并保持冻结，只有A在新任务上训练。代码可在https://this URL获取。

英文摘要

We develop a continual learning method for pretrained models that \emph{requires no access to old-task data}, addressing a practical barrier in foundation model adaptation where pretraining distributions are often unavailable. Our key observation is that pretrained networks exhibit substantial \emph{geometric redundancy}, and that this redundancy can be exploited in two complementary ways. First, redundant neurons provide a proxy for dominant pretraining-era feature directions, enabling the construction of approximately protected update subspaces directly from pretrained weights. Second, redundancy offers a natural bias for \emph{where} to place plasticity: by restricting updates to a subset of redundant neurons and constraining the remaining degrees of freedom, we obtain update families with reduced functional drift on the old-data distribution and improved worst-case retention guarantees. These insights lead to \textsc{PLATE} (\textbf{Pla}sticity-\textbf{T}unable \textbf{E}fficient Adapters), a continual learning method requiring no past-task data that provides explicit control over the plasticity-retention trade-off. PLATE parameterizes each layer with a structured low-rank update $ΔW = B A Q^\top$, where $B$ and $Q$ are computed once from pretrained weights and kept frozen, and only $A$ is trained on the new task. The code is available at https://github.com/SalesforceAIResearch/PLATE.

URL PDF HTML ☆

赞 0 踩 0

2602.06014 2026-06-17 cs.LG cs.AI math.OC math.ST stat.ML stat.TH 版本更新

Optimism Stabilizes Thompson Sampling for Adaptive Inference

乐观主义稳定自适应推断的汤普森采样

Shunxing Yan, Han Zhong

AI总结本文通过引入乐观机制（如方差膨胀或均值奖励）稳定汤普森采样，使得各臂拉取次数收敛于确定性尺度，从而在K臂随机bandit中实现渐近有效的Wald推断，并解决了多最优臂的扩展问题。

Comments Accepted in part to COLT 2026

详情

AI中文摘要

汤普森采样（TS）广泛用于随机多臂老虎机，但其在自适应数据收集下的推断性质微妙。样本均值的经典渐近理论可能失效，因为臂特定样本量是随机的，并通过动作选择规则与奖励耦合。我们研究了具有高斯随机指数的K臂随机bandit中汤普森采样的自适应推断，其中奖励噪声为独立次高斯，并确定乐观主义是恢复稳定性的关键机制，即每个臂的拉取次数集中在确定性尺度附近。这种稳定性使得尽管自适应采样，仍能获得渐近有效的Wald推断。首先，我们证明方差膨胀的TS对任意K≥2是稳定的，包括多个臂最优的挑战性情况，对最优臂具有渐近均匀分配，对次优臂具有尖锐的对数拉取次数渐近性。这解决了Halder等人提出的K臂扩展问题，使用新的胜者图和Lyapunov漂移技术来控制多个最优臂之间的分配。其次，我们分析了一种替代的乐观修改，保持高斯指数方差不变但向指数中心添加显式均值奖励，并建立了类似的稳定性结论。总之，适当实施的乐观主义稳定了汤普森采样，并在多臂老虎机中实现了渐近有效的Wald推断，同时仅产生轻微额外的遗憾代价。

英文摘要

Thompson sampling (TS) is widely used for stochastic multi-armed bandits, yet its inferential properties under adaptive data collection are subtle. Classical asymptotic theory for sample means can fail because arm-specific sample sizes are random and coupled with the rewards through the action-selection rule. We study adaptive inference for Thompson sampling with Gaussian randomized indices in $K$-armed stochastic bandits with independent sub-Gaussian reward noises, and identify \emph{optimism} as a key mechanism for restoring \emph{stability}, meaning that each arm's pull count concentrates around a deterministic scale. This stability yields asymptotically valid Wald inference despite adaptive sampling. First, we prove that variance-inflated TS is stable for any $K \ge 2$, including the challenging regime where multiple arms are optimal, with asymptotically uniform allocation over optimal arms and sharp logarithmic pull-count asymptotics for suboptimal arms. This resolves the $K$-armed extension question raised by \citet{halder2025stable}, using new winner-map and Lyapunov-drift techniques to control allocation among multiple optimal arms. Second, we analyze an alternative optimistic modification that keeps the Gaussian index variance unchanged but adds an explicit mean bonus to the index center, and establish a similar stability conclusion. In summary, suitably implemented optimism stabilizes Thompson sampling and enables asymptotically valid Wald inference in multi-armed bandits, while incurring only a mild additional regret cost.

URL PDF HTML ☆

赞 0 踩 0

2602.07429 2026-06-17 cs.LG cs.AI 版本更新

DPRM: 一种用于扩散语言模型的即插即用Doob h变换诱导的令牌排序模块

Dake Bu, Wei Huang, Andi Han, Hau-San Wong, Qingfu Zhang, Taiji Suzuki, Atsushi Nitanda

AI总结提出DPRM模块，通过在线估计从置信度驱动排序逐步过渡到过程奖励引导排序，改进扩散语言模型的令牌排序策略，在九种任务中提升性能。

详情

AI中文摘要

扩散语言模型生成时没有固定的从左到右顺序，令牌排序是一个核心算法选择。现有系统主要使用随机掩码或置信度驱动排序，分别存在训练-测试不匹配和短视探索的问题。我们引入DPRM（Doob变换过程奖励模型），一个即插即用的令牌排序模块，保持宿主架构、去噪目标和监督不变，仅修改排序策略。DPRM从置信度驱动排序开始，通过在线估计逐渐过渡到过程奖励引导排序。我们将精确的DPRM策略描述为奖励倾斜的Gibbs揭示律，证明其阶段式Soft-BoN近似的收敛性，表明在线分桶跟踪器以经验Bernstein速率跟踪精确的DPRM分数，并在可处理的优化假设下建立样本复杂度优势。在涵盖语言推理、测试时扩展、蛋白质、单细胞、分子、DNA、文本到图像生成和VQA的九个宿主中，DPRM排序变体改进了多个语言、DNA和多模态设置，同时也识别了仅置信度排序或任务特定效用更优的边界情况。代码见：this https URL

英文摘要

Diffusion language models generate without a fixed left-to-right order, leaving token ordering as a central algorithmic choice. Existing systems mainly use random masking or confidence-driven ordering, which respectively suffer from train--test mismatch and myopic exploration. We introduce DPRM (Doob -transform Process Reward Model), a plug-in token-ordering module that keeps the host architecture, denoising objective and supervision unchanged, and modifies only the ordering policy. DPRM starts from confidence-driven ordering and gradually shifts to process-reward-guided ordering through online estimates. We characterize the exact DPRM policy as a reward-tilted Gibbs reveal law, prove convergence of its stagewise Soft-BoN approximation, show that the online bucketized controller tracks the exact DPRM score at empirical-Bernstein rates, and establish a sample-complexity advantage under tractable optimization assumptions. Across nine hosts covering language reasoning, test-time scaling, protein, single-cell, molecular, DNA, text-to-image generation, and VQA, DPRM order variants improve several language, DNA, and multimodal settings while also identifying boundary cases where confidence-only ordering or task-specific utilities are preferable. Code is available at: https://github.com/DakeBU/DPRM-DLLM

URL PDF HTML ☆

赞 0 踩 0

2605.20708 2026-06-17 cs.CV cs.AI 版本更新

无中生有：语言模型能否发现0？

Phoebe Zeng, Thomas L. Griffiths, Brenden M. Lake

发表机构 * Department of Computer Science, Princeton University（普林斯顿大学计算机科学系）

AI总结研究语言模型能否独立发现“零”的概念，通过算术任务测试，发现GPT-2规模模型无法在测试时泛化，但少量示例训练后显著提升，且语言预训练减少所需示例约50%。

详情

DOI: 10.32470/vp0ddtg

AI中文摘要

基于人工神经网络的AI系统正被开发，旨在推动人类数学知识的边界。这些系统的关键问题在于它们能在多大程度上超越训练数据。数学发现需要一种强形式的分布外泛化能力——假设真正新的、且可能逻辑上更强大的数学结构的能力。已有假设认为，语言能力在人类认知中支持这种泛化。在这项工作中，我们使用简单算术作为案例研究，考察现代AI模型如何扩展其数学视野，评估这些模型能否独立发现“零”的概念。我们表明：(1) GPT-2规模的语言模型在测试时无法进行这种泛化，无论是否经过语言预训练；(2) 但在经过数十或数百个零的示例训练后，模型能显著改进。此外，我们发现语言预训练将所需示例数量减少了约50%，表明语言能力可以支撑神经模型中的数学发现。

英文摘要

AI systems based on artificial neural networks are being developed with aspirations of pushing the boundary of human mathematical knowledge. A key question for these systems is how much they can reach beyond their training data. Mathematical discovery requires a strong form of out of distribution generalization; the ability to hypothesize genuinely new - and potentially logically more powerful - mathematical structures. It has been hypothesized that language abilities support such generalizations in human cognition. In this work, we use simple arithmetic as a case study for examining how modern AI models could expand their mathematical horizons, evaluating whether these models can independently discover the concept of "zero". We show that We show that (1) language models of a GPT-2 size are unable to perform this generalization at test time regardless of language pretraining, but (2) models can improve substantially after training on tens or hundreds of examples of zero. Additionally, we find that language pretraining reduces the number of required examples by approximately $50\%$, showing that language abilities can scaffold mathematical discovery in neural models.

URL PDF HTML ☆

赞 0 踩 0

2606.17637 2026-06-17 cs.AI 新提交

Brick-DICL: Dynamic In-Context Learning for Automated Brick Schema Classification

Brick-DICL：用于自动化Brick模式分类的动态上下文学习

Yiyue Qian, Shinan Zhang, Huan Song, Negin Sokhandan, Hannah Marlowe, Diego Socolinsky

发表机构 * Amazon AWS Generative AI Innovation Center（亚马逊AWS生成式AI创新中心）

AI总结提出Brick-DICL两阶段动态上下文学习框架，通过元数据检索和类别检索增强大语言模型领域知识，结合多模型过滤机制，实现楼宇管理系统点位的自动化Brick分类，显著提升准确率并减少人工验证。

详情

AI中文摘要

楼宇管理系统（BMS）对于优化现代建筑的能效和运营性能至关重要。然而，不同制造商的BMS点缺乏标准化，给集成和数据利用带来了重大障碍。尽管Brick模式为楼宇系统提供了标准化的本体，但将BMS点映射到合适的Brick类面临三个关键挑战：（i）Brick类数量庞大（最新版本有936个），（ii）大语言模型（LLM）的领域知识有限，（iii）验证需要大量人工。为解决这些挑战，我们提出了Brick-DICL，一种用于自动化Brick模式分类的两阶段动态上下文学习框架。Brick-DICL包含两个主要组件：metadata-RAG，检索相关示例以增强LLM的领域知识；以及class-RAG，缩小潜在Brick类范围以应对大的分类空间。此外，我们实现了一种多LLM过滤机制，比较多个模型的预测，标记低置信度分类以供人工审查。结果：（i）通用性：Brick-DICL适用于任何楼宇管理系统，无论制造商或元数据格式如何；（ii）新颖且强大：作为首个用于Brick模式分类的动态上下文学习方法，Brick-DICL在建筑数据集上取得了显著的分类准确率提升，优于现有方法；（iii）高效：我们的多LLM过滤策略减少了人工验证工作，实现了快速数字化建筑接入。大量实验证明了Brick-DICL在不同建筑数据集上的有效性，加速了向标准化、可互操作的楼宇管理系统的进程。

英文摘要

Building Management Systems (BMS) are essential for optimizing energy efficiency and operational performance in modern buildings. However, the lack of standardization across BMS points from different manufacturers creates significant barriers to integration and data utilization. While the Brick schema offers a standardized ontology for building systems, mapping BMS points to appropriate Brick classes presents three critical challenges: (i) the extensive number of Brick classes (936 in the latest version), (ii) limited domain-specific knowledge in large language models (LLMs), and (iii) substantial manual effort required for verification. To address these challenges, we propose Brick-DICL, a two-stage dynamic in-context learning framework for automated Brick schema classification. Brick-DICL consists of two primary components: metadata-RAG, which retrieves relevant examples to enhance LLMs' domain knowledge, and class-RAG, which narrows down potential Brick classes to address the large classification space. Additionally, we implement a multi-LLM filtering mechanism that compares predictions across multiple models, flagging low-confidence classifications for human review. As a result: (i) General: Brick-DICL is applicable to any building management system regardless of manufacturer or metadata format; (ii) Novel and Powerful: as the first dynamic in-context learning approach for Brick schema classification, Brick-DICL achieves significant classification accuracy improvements on building datasets, outperforming existing methods; (iii) Efficient: our multi-LLM filtering strategy reduces manual verification effort, enabling rapid digital building onboarding. Extensive experiments demonstrate Brick-DICL's effectiveness across diverse building datasets, accelerating the path toward standardized, interoperable building management systems.

URL PDF HTML ☆

赞 0 踩 0

2606.17642 2026-06-17 cs.AI 新提交

FinAcumen: Financial Multimodal Reasoning via Self-Evolving Experience Memory Harness

FinAcumen: 通过自演化经验记忆实现的金融多模态推理

Pianran Guo, Pengcheng Zhou, Yucheng Jian, Shuhua Chen

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Queen Mary University of London（伦敦玛丽女王大学）

AI总结提出FinAcumen框架，通过选择性经验记忆机制增强工具增强型多模态推理，在四个金融基准上持续提升冻结的8B视觉语言模型性能。

详情

AI中文摘要

金融多模态推理要求智能体协调跨异构证据源的数值计算、检索、视觉解释和时间定位。现有的工具增强型智能体提高了执行保真度，但在跨回合中仍然大多无状态，反复发现推理策略和失败模式。在高风险金融环境中，这导致不可靠的工具路由、噪声检索和易产生幻觉的推理。我们提出FinAcumen，一个以选择性经验记忆为中心的金融推理智能体框架，用于工具增强的多模态推理。FinAcumen从先前的轨迹中积累基于金融的推理经验，将成功策略和失败衍生的警示规则提炼到持久记忆库中。在推理过程中，只有当语义相关性超过校准阈值时，检索到的经验才会调节推理，而通过回退机制明确抑制不相关的记忆。一个确定性的金融工具环境进一步将数值计算、检索、视觉解码和答案生成置于基础。在四个金融多模态推理基准上，FinAcumen持续改进冻结的8B视觉语言模型，优于金融专用模型，并接近领先的通用专有模型。进一步分析表明，选择性经验激活在检索不确定性下提高了推理可靠性。我们的代码匿名发布于https://this https URL。

英文摘要

Financial multimodal reasoning requires agents to coordinate numerical computation, retrieval, visual interpretation, and temporal grounding across heterogeneous evidence sources. Existing tool-augmented agents improve execution fidelity, yet remain largely stateless across episodes, repeatedly rediscovering reasoning strategies and failure patterns. In high-stakes financial settings, this leads to unreliable tool routing, noisy retrieval, and hallucination-prone reasoning. We present FinAcumen, a financial reasoning agent framework centered on selective experience memory for tool-augmented multimodal reasoning. FinAcumen accumulates financially grounded reasoning experience from prior trajectories, distilling successful strategies and failure-derived cautionary rules into a persistent memory bank. During inference, retrieved experiences condition reasoning only when semantic relevance exceeds a calibrated threshold, while irrelevant memory is explicitly suppressed through a fallback mechanism. A deterministic financial tool environment further grounds numerical computation, retrieval, visual decoding, and answer verification.Across four financial multimodal reasoning benchmarks, FinAcumen consistently improves a frozen 8B vision-language model over finance-specialized models and approaches leading proprietary general-purpose models. Further analysis shows that selective experience activation improves reasoning reliability under retrieval uncertainty. Our code is anonymously available at https://anonymous.4open.science/r/FinAcumen

URL PDF HTML ☆

赞 0 踩 0

2606.17821 2026-06-17 cs.AI 新提交

DecoSearch: Complexity-Aware Routing and Plan-Level Repair for Text-to-SQL

DecoSearch: 面向Text-to-SQL的复杂度感知路由与计划级修复

Esteban Schafir, Xu Zheng, Hojat Allah Salehi, Zhuomin Chen, Mo Sha, Wei Cheng, Dongsheng Luo

发表机构 * Florida International University（佛罗里达国际大学）； NEC-Labs（NEC实验室）； Singapore Management University（新加坡管理大学）

AI总结提出DecoSearch框架，通过复杂度感知路由将查询分配给直接生成或DAG分解，并结合拓扑精炼器修复执行失败，在BIRD和Spider上取得高准确率且显著降低token消耗。

详情

AI中文摘要

大型语言模型（LLMs）在将自然语言翻译为SQL方面展现了卓越的能力，但现有方法在处理需要多步骤、数据感知推理的复杂查询时仍然表现不佳。我们引入了DecoSearch，一个无需训练的框架，通过将每个查询路由到适当的推理努力级别来解决这一问题。轻量级的Schema Selector首先将完整数据库模式修剪为相关的表和列。然后，LLM Judger判断问题是否需要分解：简单问题遵循直接生成路径，而复杂问题则升级为原子子问题的有向无环图（DAG），每个子问题通过目标SQL生成步骤解决。RAG组件用语义相似的训练示例为分解器提供基础，而Topology Refiner在执行失败表明存在有缺陷的分解而非可修复的SQL错误时，重构推理计划。DecoSearch在BIRD上达到70.53%的执行准确率，在Spider上达到88.31%，使用DeepSeek骨干网络，超越了所有无需训练的基线方法，同时消耗的token数量比竞争方法少一个数量级。它还可以作为模型无关的包装器，在不修改管道的情况下持续改进微调后的SQL生成骨干网络。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities in translating natural language to SQL, yet existing methods still falter on complex queries requiring multi-step, data-aware reasoning. We introduce DecoSearch, a training-free framework that addresses this by routing each query to the appropriate level of reasoning effort. A lightweight Schema Selector first prunes the full database schema to the relevant tables and columns. An LLM Judger then decides whether the question requires decomposition: straightforward questions follow a direct generation path and complex ones are escalated to a Directed Acyclic Graph (DAG) of atomic sub-questions, each solved by a targeted SQL generation step. A RAG component grounds the decomposer with semantically similar training examples, and a Topology Refiner restructures the reasoning plan when execution failures signal a flawed decomposition rather than a fixable SQL error. DecoSearch achieves 70.53% execution accuracy on BIRD and 88.31% on Spider with a DeepSeek backbone, surpassing all training-free baselines while consuming an order of magnitude fewer tokens than competing methods. It also functions as a model-agnostic wrapper, consistently improving fine-tuned SQL generation backbones without any modification to the pipeline.

URL PDF HTML ☆

赞 0 踩 0

2606.17856 2026-06-17 cs.AI 新提交

FlowRAG: Synergizing Explicit Reasoning via Frequency-Aware Multi-Granularity Graph Flow

FlowRAG: 通过频率感知的多粒度图流协同显式推理

Bihao Zhan, Zongsheng Cao, Jie Zhou, Bo Zhang, Liang He

发表机构 * East China Normal University（华东师范大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结提出FlowRAG框架，构建四层异构图，通过双粒度激活和频率感知加权流模块，增强语义召回和显式推理路径提取，在复杂推理基准上取得最优性能。

详情

AI中文摘要

基于图的检索增强生成（GraphRAG）对于知识密集型和多跳查询任务有效；然而，许多现有方法主要基于实体图并依赖隐式语义相关性传播。这通常会导致（i）当用户查询抽象且在实体层面语义稀疏时检索不足，以及（ii）脆弱的的多跳推理，其中噪声激活可能破坏实体到实体的转换并损坏推断的关系链，从而产生不可靠的结论。为此，我们提出\texttt{FlowRAG}，一个语义感知的检索框架，它提高了语义召回和显式推理。具体来说，\texttt{FlowRAG}在段落、摘要、句子和实体上构建了一个四层异构图，其中摘要节点作为粗粒度语义枢纽。在检索时，双粒度激活模块结合摘要-查询对齐和句子级匹配，在释义和抽象下鲁棒地激活相关实体。然后，我们引入一个频率感知的加权流模块，该模块通过段落内词频加权的实体-段落链接路由相关性，修剪噪声连接并提取高置信度的推理路径作为生成的显式逻辑骨架。大量实验表明，\texttt{FlowRAG}在复杂推理基准上取得了最先进的性能。

英文摘要

Graph-based retrieval-augmented generation (GraphRAG) is effective for knowledge-intensive and multi-hop query tasks; however, many existing methods primarily seed entity-based graphs and rely on implicit semantic relevance propagation. This often (i) under-retrieves when user queries are abstract and semantically sparse at the entity level, and (ii) suffers from brittle multi-hop reasoning, where noisy activations can derail entity-to-entity transitions and corrupt the inferred relation chain, yielding unreliable conclusions. To this end, we propose \texttt{FlowRAG}, a semantic-aware retrieval framework that improves both semantic recall and explicit reasoning. Specifically, \texttt{FlowRAG} constructs a quad-level heterogeneous graph over passages, summaries, sentences, and entities, where summary nodes serve as a coarse semantic hub. At retrieval time, a dual-granularity activation module combines summary--query alignment with sentence-level matching to activate relevant entities under paraphrase and abstraction robustly. We then introduce a frequency-aware weighted flow module that routes relevance through entity--passage links weighted by within-passage term frequency, pruning noisy connections and extracting high-confidence reasoning paths as an explicit logic skeleton for generation. Extensive experiments show that \texttt{FlowRAG} obtains state-of-the-art performance on complex reasoning benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.17888 2026-06-17 cs.AI 新提交

MathVis-Fine: Aligning Visual Supervision with Necessity via Progressive Dependency-Guided Training for Multimodal Mathematical Reasoning

MathVis-Fine：通过渐进式依赖引导训练将视觉监督与必要性对齐的多模态数学推理

Wanshi Xu, Haokun Zhao, Haidong Yuan, Songjun Cao, Long Ma

发表机构 * School of ECE, Peking University（北京大学电子与计算机工程学院）； College of Computer Science and Artificial Intelligence, Fudan University（复旦大学计算机科学与技术学院）； School of Software and Microelectronics, Peking University（北京大学软件与微电子学院）； Tencent Youtu Lab（腾讯优图实验室）

AI总结提出MathVis-Fine框架，通过构建细粒度视觉标注数据集和两阶段渐进式训练，根据样本的视觉依赖程度平衡答案正确性和视觉基础奖励，提升多模态数学推理的监督精度。

详情

AI中文摘要

链式思维（CoT）推理已从纯语言领域扩展到多模态场景；然而，现有方法通常将视觉输入视为同质或辅助信号，未能捕捉数学问题解决中文本与图像之间复杂且样本特定的依赖关系。这引发了两个核心问题：首先，视觉内容的监督信号是泛化且粗粒度的，缺乏对每个样本中视觉信息实际必要性的适应；其次，当视觉奖励被统一应用而不区分输入之间的互补关系时，训练反馈变得不准确。这些限制阻碍了模型实现精确的多模态推理。在这项工作中，我们提出了一个用于建模数学推理中细粒度视觉依赖的框架。我们首先构建了MathVis-Fine数据集，通过视觉依赖评级增强细粒度视觉标注。基于该数据集，我们引入了一种两阶段渐进式视觉增强训练范式，该范式根据每个样本的内在视觉依赖水平平衡答案正确性奖励和视觉基础奖励，从而减轻奖励偏差并提高监督准确性。大量实验表明，MathVis-Fine框架能够基于视觉依赖逐步增强视觉感知，为多模态数学推理提供了更精确的训练框架。我们将在论文被接收后发布该数据集。

英文摘要

Chain-of-Thought (CoT) reasoning has extended from purely linguistic domains to multimodal scenarios; however, existing approaches often treat visual inputs as homogeneous or auxiliary signals, failing to capture the intricate and sample-specific dependencies between text and images in mathematical problem-solving. This gives rise to two core issues: first, the supervisory signals for visual content are generalized and coarse-grained, lacking adaptation to the actual necessity of visual information in each sample; second, training feedback becomes inaccurate when visual rewards are uniformly applied without distinguishing the complementary relationships among inputs. These limitations hinder models from achieving precise multimodal reasoning. In this work, we propose a framework for modeling fine-grained visual dependencies in mathematical reasoning. We first construct the MathVis-Fine dataset, augmenting fine-grained visual annotations with visual dependency ratings. Building upon this dataset, we introduce a two-stage progressive visual enhancement training paradigm that balances answer correctness rewards and visual grounding rewards according to the intrinsic visual dependency level of each sample, thereby mitigating reward bias and improving supervision accuracy. Extensive experiments demonstrate that the MathVis-Fine framework effectively enhances visual perception progressively based on visual dependency, offering a more precise training framework for multimodal mathematical reasoning. We will release the dataset upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.18075 2026-06-17 cs.AI 新提交

A Unified Framework for Context-Aware and Relation-Aware Graph Retrieval-Augmented Generation

上下文感知与关系感知的图检索增强生成统一框架

Haoyang Zhong, Yifei Sun, Antong Zhang, Chunping Wang, Lei Chen, Yang Yang

发表机构 * Zhejiang University（浙江大学）； Nanyang Technological University（南洋理工大学）； Finvolution Group（信也科技集团）

AI总结提出HyGRAG分层图RAG框架，通过构建融合上下文与关系的摘要、跨层级检索及动态更新，将多跳推理准确率提升9.7%。

Comments Accepted at The ACM Web Conference 2026 (WWW '26)

详情

DOI: 10.1145/3774904.3792720

AI中文摘要

检索增强生成（RAG）已成为用外部知识增强大型语言模型（LLM）的范式，但现有基于图的方法面临一个根本限制：以实体为中心和以块为中心的方法操作在锚定于原始文本的表示上，缺乏真正的知识融合。以实体为中心的方法连接逻辑相关的内容，以块为中心的方法保留上下文，但两者都通过相似性搜索分别检索信息，错过了其综合产生的新兴理解。在本文中，我们提出HyGRAG，一种分层图RAG框架，通过解决三个核心挑战超越源文档：构建真正整合上下文和关系信息的摘要，利用这些综合表示在检索中访问新兴知识，以及高效更新分层结构以适应动态语料库。具体地，我们在包含块和实体节点的混合图上设计分层索引结构，然后迭代聚类并生成基于LLM的摘要。接着，我们设计上下文和关系感知的检索，跨所有抽象级别搜索，同时通过社区成员关系扩展。此外，我们通过基于附加的算法实现动态知识更新，仅需局部重新摘要。实验结果表明，HyGRAG将多跳推理任务的平均准确率提高了9.7%，同时保持了合理的效率。

英文摘要

Retrieval-Augmented Generation (RAG) has emerged as a paradigm for enhancing large language models (LLMs) with external knowledge, yet existing graph-based methods face a fundamental limitation: entity-centric and chunk-centric approaches operate on representations anchored to original text without true knowledge fusion. While entity-centric methods connect logically related content and chunk-centric methods preserve context, both retrieve information separately through similarity search, missing emergent understanding from their synthesis. In this paper, we propose HyGRAG, a hierarchical graph RAG framework that transcends source documents by addressing three core challenges: constructing summaries that genuinely integrate contextual and relational information, leveraging these synthesized representations to access emergent knowledge during retrieval, and efficiently updating hierarchical structures for dynamic corpora. Specifically, we design hierarchical index structures over hybrid graphs with both chunk and entity nodes, then iteratively cluster them and generate LLM-based summaries. Then, we design context and relation-aware retrieval that searches across all abstraction levels while expanding through community membership. Moreover, we enable dynamic knowledge update through attachment-based algorithms with only local re-summarization. Experimental results show that HyGRAG improves the average accuracy of multi-hop reasoning tasks by 9.7%, while maintaining reasonable efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.17057 2026-06-17 cs.LG cs.AI cs.CL 交叉投稿

Correct When Paired, Wrong When Split: Decoupling and Editing Modality-Specific Neurons in MLLMs

配对时正确，分离时错误：多模态大语言模型中模态特定神经元的解耦与编辑

Tingchao Fu, Wenkai Wang, Fanxiao Li, Huadong Zhang, Jinhong Zhang, Dayang Li, Yunyun Dong, Renyang Liu, Wei Zhou

发表机构 * School of Information Science and Engineering, Yunnan University（云南大学信息科学与工程学院）； School of Software, Yunnan University（云南大学软件学院）； National University of Singapore（新加坡国立大学）； School of Engineering, Yunnan University（云南大学工程学院）

AI总结针对多模态大语言模型知识编辑中存在的解耦失败问题，提出DECODE方法，通过显式解耦和定位模态特定神经元组，实现跨模态触发下的有效知识更新。

Comments 18 pages, 11 figures

详情

AI中文摘要

尽管知识编辑为多模态大语言模型（MLLMs）的知识更新提供了一种高效机制，但我们发现当前范式仍面临一个重要但尚未充分探索的问题：编辑解耦失败，即当模型被多模态输入（文本-图像查询对）触发时，实体相关知识可以更新，但当配对输入被拆分为单模态输入时，这些知识往往恢复为编辑前的旧事实。我们深入的实证分析表明，MLLMs中的实体知识并非以统一表示存储，而是分布在解耦的模态特定路径中。因此，偏向多模态查询的更新无法有效传播到单模态电路。为弥补这一差距，我们提出DECODE，该方法显式解耦并定位模态特定神经元组以获取目标知识。大量实验证明，DECODE在不同模态触发下均能实现有效的知识更新，从而缓解编辑解耦失败。

英文摘要

Although Knowledge Editing provides an efficient mechanism for updating the knowledge of Multimodal Large Language Models (MLLMs), we find that current paradigms still suffer from an important yet remain underexplored issue : editing decoupling failure, where entity-related knowledge can be updated when the model is triggered by multimodal inputs (text--image query pairs), however, it often reverts to outdated pre-edit facts when the paired inputs are split into unimodal ones. Our in-depth empirical analysis reveals that the entity knowledge in MLLMs is not stored as a unified representation, but is instead distributed across disentangled modality-specific pathways. As a result, updates biased toward multimodal queries fail to propagate effectively to unimodal circuits. To bridge this gap, we propose DECODE, which explicitly disentangles and localizes modality-specific neuron groups for targeted knowledge. Extensive experiments demonstrate that DECODE consistently achieves effective knowledge updates under different modality triggers, thereby mitigating editing decoupling failures.

URL PDF HTML ☆

赞 0 踩 0

2606.17126 2026-06-17 cs.SD cs.AI 交叉投稿

Vibrato Expression Control for Singing Voice Conversion with Improving Independent Control

通过改进独立控制实现歌唱声音转换中的颤音表达控制

Joon-Seung Choi, Dong-Min Byun, Seong-Whan Lee

发表机构 * Korea University（高丽大学）

AI总结提出VibE-SVC2框架，通过能量风格转换器、零样本音高风格转换器、颤音速率缩放和次谐波校正算法，实现对音高和音色两种歌唱风格的精细独立控制，性能优于现有方法。

Comments Accepted to IEEE Transactions on Audio, Speech, and Language Processing (TASLP)

详情

AI中文摘要

歌唱风格是自然且富有表现力的歌声的关键方面。歌手利用歌唱风格来传达歌曲的情感。已有若干工作提出控制歌唱风格以制作更具表现力的歌声。最近，VibE-SVC通过预测高频F0轮廓成功控制了颤音。在本文中，我们引入了一个名为VibE-SVC2的歌唱声音转换框架，以改进歌唱风格转换性能和可控性。该模型提供对两种歌唱风格的控制：音高风格和音色风格。对于音高风格，为了解决我们先前工作中未解决的能量-音高纠缠问题，我们引入了一种新颖的能量风格转换器来处理能量轮廓中剩余的样式信息。此外，我们提出了一种零样本音高风格转换器，它模仿参考音频的音高风格。为了扩展模型的可控性，我们提出了颤音速率缩放，这是对颤音程度的独立控制，这在VibE-SVC中是不可用的。对于音色风格，我们扩展了模型以处理多种发声风格。然而，解决诸如气泡音等特定风格带来了挑战，因为传统的F0提取由于其固有的次谐波特性而常常失败，这降低了转换质量。为了解决这个问题，我们提出了一种新颖的次谐波校正算法来细化F0轮廓，以实现更自然的音色转换。通过全面的客观和主观评估，我们证明了VibE-SVC2提供了对两种歌唱风格的精细、独立控制，优于现有方法。

英文摘要

Singing style is a crucial aspect of a natural and expressive singing voice. Singers utilize singing styles to convey the feeling or emotion of the songs. Several works have been proposed to control singing style for making the more expressive singing voice. Recently, VibE-SVC successfully controls vibrato by predicting high-frequency F0 contour. In this paper, we introduce a singing voice conversion framework, called VibE-SVC2, to improve singing style conversion performance and controllability. The model offers control over two types of singing styles: a pitch style and a timbre style. For the pitch style, to resolve the pitch-energy entanglement issue that is unresolved in our previous work, we introduce a novel Energy Style Converter to address remaining style information in the energy contour. In addition, we propose a Zero-shot Pitch Style Converter, which mimics the pitch style of reference audio. To expand the controllability of the model, we propose vibrato rate scaling that is an independent control of vibrato extent, which is unavailable in VibE-SVC. For the timbre style, we extend the model to handle a variety of phonation styles. However, addressing specific styles such as vocal fry poses a challenge, as conventional F0 extraction often fails due to their inherent subharmonic characteristics, which degrades the conversion quality. To address this, we propose a novel Subharmonic Correction algorithm to refine the F0 contour for more natural timbre conversion. Through comprehensive objective and subjective evaluations, we demonstrate that VibE-SVC2 provides fine-grained, independent control over two types of singing styles, outperforming existing methods.

URL PDF HTML ☆

赞 0 踩 0

2606.17164 2026-06-17 cs.CL cs.AI cs.HC cs.PL cs.SE 交叉投稿

PromptMN: Pseudo Prompting Language

PromptMN: 伪提示语言

Enkhzol Dovdon

发表机构 * ICT Group（ICT集团）

AI总结提出PromptMN，一种伪提示领域特定语言，通过紧凑的%前缀类型指令注释自然语言，减少上下文歧义，提升人机交互的清晰度和可审查性。

Comments 32 pages, 2 figures

详情

AI中文摘要

提示已成为人类与生成式AI之间的主要接口，然而许多自然语言提示仍然脆弱：角色、目标、约束和预期输出常常埋没在散文中或隐含起来。在智能体和软件开发工作流中，首次交接时的误读可能会传播到每一步，因为相当一部分智能体故障源于上下文歧义而非模型限制。本文介绍PromptMN，一种伪提示领域特定语言，它用紧凑的、以%为前缀的类型指令注释自然语言，涵盖角色、目标、需求、优先级、约束、计划、输入和输出。语义解析允许作者以任意顺序编写，而模型根据功能解释指令。PromptMN介于非正式提示和编程风格伪代码之间：结构足够可检查和可重用，又足够轻量，适用于软件开发生命周期（SDLC）中的分析师、管理者、开发者和利益相关者。PromptMN还与逆向提示工程配合使用。要求模型将期望结果重述为PromptMN，让用户在执行前检查推断的角色、目标、约束和缺失假设，从而减少修复周期，并产生一个可重用的工件来对齐人员和AI工具。PromptMN的可行性在多个前沿模型上进行了评估，包括Claude Fable 5、Claude Opus 4.8、Gemini 3.1 Pro和GPT-5.5。这些模型正确解析了PromptMN指令，包括复杂结构如重复、条件、方法和素数检查任务，无需微调。相同的词汇适用于所呈现的SDLC场景中的新代码库、维护和重新设计。虽然大规模验证仍是未来工作，但这些早期结果表明PromptMN是朝着更清晰、更可审查的人机交互迈出的实际一步。

英文摘要

Prompting has become the primary interface between humans and generative AI, yet many natural language prompts remain fragile: roles, goals, constraints, and expected outputs are often buried in prose or left implicit. In agentic and software development workflows, a misread at the first handoff can propagate through every step, since a significant portion of agent failures stem from context ambiguities rather than model limitations. This paper introduces PromptMN, a pseudo-prompting domain-specific language that annotates natural language with compact, %-prefixed typed directives covering roles, goals, requirements, priorities, constraints, plans, inputs, and outputs. Semantic resolution lets authors write in any order while the model interprets directives by function. PromptMN sits between informal prompting and programming-style pseudocode: structured enough to be inspectable and reusable, yet lightweight enough for analysts, managers, developers, and stakeholders across the software development lifecycle (SDLC). PromptMN also pairs with reverse prompt engineering. Asking a model to restate a desired outcome as PromptMN lets users inspect the inferred roles, goals, constraints, and missing assumptions before acting, reducing repair cycles and yielding a reusable artifact for aligning people and AI tools. PromptMN's feasibility is evaluated across several frontier models, including Claude Fable 5, Claude Opus 4.8, Gemini 3.1 Pro, and GPT-5.5. The models correctly resolved PromptMN instructions, including complex structures such as repetition, conditionals, methods, and a prime-checking task, without fine-tuning. The same vocabulary applies across new codebases, maintenance, and redesign in the SDLC scenarios presented. While large-scale validation remains future work, these early results suggest PromptMN is a practical step toward clearer, more reviewable human-to-AI interaction.

URL PDF HTML ☆

赞 0 踩 0

2606.17255 2026-06-17 cs.CL cs.AI 交叉投稿

MLLP-VRAIN UPV system for the IWSLT 2026 Simultaneous Speech Translation task

MLLP-VRAIN UPV 系统在 IWSLT 2026 同声传译任务中的应用

Jorge Iranzo-Sánchez, Gerard Mas-Mollà, Adrià Giménez, Jorge Civera, Albert Sanchis, Alfons Juan

发表机构 * MLLP-VRAIN research group（MLLP-VRAIN研究组）； VRAIN ； Universitat Politècnica de València（瓦伦西亚理工大学）

AI总结提出基于Parakeet和Qwen 3.5模型的级联同声传译系统，通过自适应黑盒策略优化质量-延迟权衡，并引入ASR词增强和RAG机制处理上下文跟踪，在MCIF En→De测试集上实现XCOMET-XL提升+5.82。

Comments IWSLT 2026 System Description

详情

AI中文摘要

本文描述了MLLP-VRAIN研究组参与IWSLT 2026同声传译赛道共享任务的情况。我们的提交利用最近发布的Parakeet和Qwen 3.5模型，通过自适应“黑盒”策略构建了一个鲁棒的级联解决方案，用于长形式SimulST。我们探索了这些策略的松弛版本以实现更好的质量-延迟权衡。与去年相比，我们参与了所有语言方向。此外，对于En→{De, It, Zh}方向，我们还参与了今年新增的上下文跟踪赛道，采用ASR词增强和离线预翻译示例的RAG机制相结合，以引导生成并丰富系统的领域特定上下文。最后，我们提供了系统的详细延迟分析。与去年相比，在MCIF En→De测试集上的结果显示质量显著提升，XCOMET-XL提高了+5.82。我们的上下文跟踪处理进一步提升了+1.03的性能。

英文摘要

This work describes the participation of the MLLP-VRAIN research group in the shared task of the IWSLT 2026 Simultaneous Speech Translation track. Our submission utilizes the recently released Parakeet and Qwen 3.5 models to create a robust, cascaded solution for long-form SimulST through the use of adaptive "black-box" policies. We explore relaxations of these policies to achieve better quality-latency trade-offs. Compared to last year, we participate on all language directions. In addition to this, for the En$\rightarrow${De, It, Zh} directions we also participate in this year's new context track employing a combination of ASR word-boosting and a RAG mechanism of offline pre-translated exemplars to guide generation and enrich our system with domain-specific context. Finally, we provide a detailed latency analysis of our system. Compared to last year, results on the MCIF En$\rightarrow$De test set shows a substantial quality improvement of +5.82 XCOMET-XL. Our context track processing further improves performance by +1.03.

URL PDF HTML ☆

赞 0 踩 0

2606.17350 2026-06-17 cs.CL cs.AI 交叉投稿

Do Large Language Models Always Tell The Same Stories?

大型语言模型总是讲述相同的故事吗？

Thennal DK, Hans Ole Hatzel

发表机构 * University of Hamburg（汉堡大学）

AI总结通过对比框架和人类故事数据集，研究10种LLM生成故事的叙事相似性，发现LLM故事比人类故事更相似，前沿模型趋向于“平均”通用叙事，且常见缓解策略无效。

详情

AI中文摘要

大型语言模型（LLMs）的最新进展使得生成高质量散文成为可能，但这些模型是否能够生成多样化的输出仍然存在争议。在这项工作中，我们通过叙事相似性框架研究了LLM生成故事的多样性。使用对比框架和来自r/WritingPrompts的人类编写故事和提示数据集，我们收集了10个代表性LLM的叙事相似性判断，同时利用人类评估和三种不同的自动注释方法。我们的发现揭示了一个一致的趋势：LLM生成的叙事彼此之间始终比人类编写的故事更相似。我们证明，特别是前沿模型收敛于一种“平均”通用叙事，这种叙事近似于个体人类故事，但缺乏人类作者的整体多样性。最后，我们表明常见的缓解策略，包括负提示和温度缩放，未能有效解决这种同质性。

英文摘要

Recent advances in large language models (LLMs) have enabled the generation of high-quality prose, yet the question of whether these models are capable of generating diverse outputs remains contested. In this work, we investigate the diversity of LLM-generated stories through the framework of narrative similarity. Using a contrastive framework and a dataset of human-written stories and prompts from r/WritingPrompts, we collect narrative similarity judgments across 10 representative LLMs, utilizing both human evaluations and three different automatic annotation methods. Our findings reveal a consistent trend: LLM-generated narratives are consistently more similar to each other than human-written stories are. We demonstrate that frontier models in particular converge on a ``mean'' generic narrative that approximates individual human stories but lacks the collective diversity of human authors. Finally, we show that common mitigation strategies, including negative prompting and temperature scaling, fail to meaningfully address this homogeneity.

URL PDF HTML ☆

赞 0 踩 0

2606.17354 2026-06-17 cs.CL cs.AI 交叉投稿

空间视觉语言模型中的双路径推理强化

Yatai Ji, An-Chieh Cheng, Yang Fu, Yukang Chen, Han Zhang, Zhaojing Yang, Wei Huang, Ka Chun Cheung, Song Han, Vidya Nariyambut Murali, Pavlo Molchanov, Jan Kautz, Simon See, Hongxu Yin, Ping Luo, Sifei Liu

发表机构 * The University of Hong Kong（香港大学）； NVIDIA（英伟达）； University of California, San Diego（加州大学圣迭戈分校）

AI总结提出SR-REAL框架，通过强化学习融合语言推理和3D检测推理两条路径，显著提升空间VLM在复杂几何推理任务中的性能。

详情

AI中文摘要

空间VLM在几何感知方面取得了显著进展，但需要多步推理（涉及深度、距离和场景关系）的复杂空间推理仍然具有挑战性。此外，不同的空间查询需要根本不同的策略：有些最好通过纯语言的逐步演绎来解决，而另一些则需要在进行定量推理之前进行显式的3D定位。我们提出了SR-REAL（通过强化学习实现空间VLM的双路径空间推理），这是一个统一框架，为空间VLM配备了两条互补的推理路径：纯语言推理（LOR），执行逐步语言演绎；以及先检测后推理（DTR），通过区域标记检测3D几何线索（如中心或边界框），然后进行显式几何推理。SR-REAL首先进行冷启动监督微调阶段，构建LOR和DTR的思维链监督，并暴露区域到3D的接口；随后进行强化学习，使用准确性和格式奖励优化策略模型；对于DTR，基于离散中心的检测奖励进一步细化几何对齐。在多种空间基准测试中，SR-REAL显著优于空间VLM基线：(i) 单个RL训练模型支持两条推理路径，DTR通过精确的3D定位在区域感知任务中表现出色，LOR增强了一般空间推理；(ii) 联合训练两条路径促进相互强化；(iii) 高质量、混合的冷启动数据对于稳定的RL优化至关重要；(iv) 模型无需逐任务调整即可跨数据集和领域泛化，展示了LOR和DTR之间的正向迁移。

英文摘要

Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others require explicit 3D grounding before quantitative inference. We present Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial VLMs (SR-REAL), a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference. SR-REAL begins with a cold-start supervised fine-tuning stage that constructs LOR and DTR chain-of-thought supervision and exposes a region-to-3D interface, followed by RL that optimizes the policy model with accuracy and format rewards; for DTR, a discrete center-based detection reward further refines geometric alignment. Across diverse spatial benchmarks, SR-REAL significantly outperforms spatial VLM baselines: (i) a single RL-trained model supports both reasoning paths, with DTR excelling in region-aware tasks through precise 3D localization and LOR enhancing general spatial reasoning; (ii) jointly training both paths fosters mutual reinforcement; (iii) high-quality, blended cold-start data is crucial for stable RL optimization; and (iv) the model generalizes across datasets and domains without per-task tuning, demonstrating positive transfer between LOR and DTR.

URL PDF HTML ☆

赞 0 踩 0

2606.17615 2026-06-17 cs.CV cs.AI 交叉投稿

SkillMoV: Mixture-of-View Routing with Prototype-Conditioned Gating for Unified Multi-View Proficiency Estimation

SkillMoV: 基于原型条件门控的视图混合路由用于统一多视角熟练度估计

Edoardo Bianchi, Antonio Liotta

发表机构 * Free University of Bozen-Bolzano（博尔扎诺自由大学）

AI总结提出SkillMoV框架，通过混合视图投影器（MoVP）实现多场景多视角视频的熟练度估计，在EgoExo4D数据集上达到50.17%准确率，超越现有方法。

详情

AI中文摘要

从视频中估计人类熟练度是自动化技能评估的关键挑战，应用于体育教练、音乐教学、手术培训和工作场所学习。现有方法通常专注于单一场景或依赖共享的多视角聚合，限制了其适应异构摄像机视角和活动领域的能力。我们提出SkillMoV，一个统一的、参数高效的框架，用于从同步多视角视频中进行多场景熟练度估计。其核心是混合视图投影器（MoVP），将混合专家范式适应于摄像机特定的视角特征。MoVP由四个阶段组成：(i) 一个具有12个专家MLP的混合视图软路由器，无需摄像机身份监督即可学习视角相关的专家偏好；(ii) 跨视角注意力以对齐同步摄像机；(iii) 可学习的原型锚定，以类级参考向量条件化表示；(iv) 一个原型条件门控投影，生成最终技能嵌入。我们在EgoExo4D上评估SkillMoV，涵盖六个技能领域和三种单独训练的视角配置：Ego、Exos和Ego+Exos。SkillMoV在Exos设置中达到50.17%的总体准确率，单个模型在所有场景上联合训练，超过比较方法中报告的最强Exos结果3.57个百分点。在Ego+Exos中，SkillMoV接近该设置的最佳报告结果（47.63%对48.20%）。在选定的Exos配置上的消融实验验证了每个组件：MoV路由比注意力聚合提高+6.61个百分点，跨视角注意力+4.92个百分点，原型锚定+4.07个百分点，随机视角丢弃+3.90个百分点。通过LoRA适配，SkillMoV仅训练其参数的23.32%，并且相对于仅LoRA基线增加了有限的测量开销。

英文摘要

Estimating human proficiency from video is a key challenge for automated skill assessment, with applications in sports coaching, music pedagogy, surgical training, and workplace learning. Existing approaches often focus on individual scenarios or rely on shared multi-view aggregation, limiting their ability to adapt to heterogeneous camera viewpoints and activity domains. We introduce SkillMoV, a unified, parameter-efficient framework for multi-scenario proficiency estimation from synchronized multi-view video. At its core, SkillMoV introduces a Mixture-of-View Projector (MoVP), which adapts the mixture-of-experts paradigm to camera-specific view features. MoVP is composed of four stages: (i) a Mixture-of-View soft router with twelve expert MLPs that learns view-dependent expert preferences without camera-identity supervision; (ii) cross-view attention to align synchronized cameras; (iii) learnable prototype anchoring to condition the representation on class-level reference vectors; and (iv) a prototype-conditioned gated projection that produces the final skill embedding. We evaluate SkillMoV on EgoExo4D across six skill domains and three separately trained view configurations: Ego, Exos, and Ego+Exos. SkillMoV reaches 50.17% overall accuracy in the Exos setting with a single model trained jointly across all scenarios, surpassing the strongest reported Exos result among the compared methods by 3.57 percentage points. In Ego+Exos, SkillMoV remains close to the best reported result in that setting (47.63% versus 48.20%). Ablations on the selected Exos configuration validate each component: MoV routing contributes +6.61 pp over attentive aggregation, cross-view attention +4.92 pp, prototype anchoring +4.07 pp, and stochastic view dropout +3.90 pp. Through LoRA adaptation, SkillMoV trains only 23.32% of its parameters and adds limited measured overhead relative to a LoRA-only baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.17664 2026-06-17 cs.IR cs.AI 交叉投稿

Temporal Preference Optimization for Unsupervised Retrieval

面向无监督检索的时间偏好优化

HyunJin Kim, Jaejun Shim, Young Jin Kim, JinYeong Bak

发表机构 * Microsoft, Redmond, USA（微软公司，美国红mond）； Sungkyunkwan University, Suwon, South Korea（成均馆大学，韩国首尔）

AI总结提出TPOUR方法，通过时间检索偏好优化（TRPO）和可学习时间嵌入插值，使无监督稠密检索器能捕捉时间相关性，在时间信息检索任务上超越有监督和无监督基线。

Comments Accepted to ICML 2026

详情

AI中文摘要

无监督稠密检索器通过对比学习从无标签文档中学习语义相似性，从而提供可扩展性，但它们难以捕捉时间相关性，会检索到语义相关但时间错位的文档——当文档集合跨越多个时间段时（例如，针对“2019年的总统是谁？”检索2018-2025年的文档会引入时间歧义），这是一个重要方面。现有方法依赖于带有显式时间戳的有监督训练，但这并不总是可行的。我们提出TPOUR（面向无监督检索器的时间偏好优化），它使用我们新颖的训练方法时间检索偏好优化（TRPO）。TRPO在时间维度上重新诠释偏好学习，引导检索器偏向时间对齐的文档。TPOUR进一步通过在学习到的时间嵌入中进行插值，泛化到未见的时间段，实现连续的时间对齐。在时间信息检索（T-IR）实验上，TPOUR优于无监督和有监督基线。与Qwen-Embedding-8B相比，尽管规模小约72.7倍，TPOUR Contriever在显式查询上的平均nDCG@5提高了+4.04（+12.15%），在隐式查询上提高了+4.98（+15.21%）。我们的代码可在以下网址获取：https://this URL。

英文摘要

Unsupervised dense retrievers offer scalability by learning semantic similarity from unlabeled documents via contrastive learning, but they struggle to capture the temporal relevance, retrieving semantically related but temporally misaligned documents-an important aspect when a document collection spans multiple time periods (e.g., retrieving documents from 2018-2025 for "Who is the president in 2019?" introduces temporal ambiguity). Existing methods rely on supervised training with explicit timestamps, which are not always feasible. We propose TPOUR (Temporal Preference Optimization for Unsupervised Retriever), which uses our novel training method Temporal Retrieval Preference Optimization (TRPO). TRPO reinterprets preference learning in the temporal dimension, guiding the retriever to favor temporally aligned documents. TPOUR further generalizes to unseen time periods via interpolation in a learned time embedding, enabling continuous temporal alignment. Experiments on temporal information retrieval (T-IR), TPOUR outperforms both unsupervised and supervised baselines. Compared to Qwen-Embedding-8B, despite being about 72.7x smaller, TPOUR Contriever improves average nDCG@5 by +4.04 (+12.15%) on explicit and +4.98 (+15.21%) on implicit queries. We provide our code at https://github.com/agwaBom/TPOUR.

URL PDF HTML ☆

赞 0 踩 0

2606.17678 2026-06-17 cs.CV cs.AI 交叉投稿

See First, Answer Later: Visual Evidence Pre-Alignment via Sufficiency-Driven RL

先看后答：基于充分性驱动的强化学习实现视觉证据预对齐

Yilian Liu, Sicong Leng, Guoshun Nan, Junyi Zhu, Jiayu Huang, Minghao Sun, Xuancheng Zhu, Yisong Chen, Zexian Wei, Xiaofeng Tao

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Nanyang Technological University（南洋理工大学）； China Telecom（中国电信）

AI总结提出视觉证据预对齐（VEPA）方法，在预训练与后训练之间引入充分性驱动的GRPO优化，以增强多模态大模型对细粒度视觉证据的利用，显著提升视觉密集型任务性能。

详情

AI中文摘要

多模态大语言模型（MLLMs）将强大的文本推理与视觉输入相结合，但其响应可能与底层图像不一致，表明在推理过程中未能有效利用视觉证据。当前的训练范式依赖于大规模基于标题的预训练进行通用对齐，随后通过监督微调和强化学习实现指令遵循和复杂推理。然而，这种预训练仅提供较弱的视觉基础：简短、粗略的标题使模型偏向显著物体，而忽略了细粒度的视觉证据。本文引入视觉证据预对齐（VEPA），作为预训练与后训练之间的中间阶段，探索一种新颖的充分性驱动目标，结合组相对策略优化（GRPO）来优化基于问题的视觉证据描述。在多种基准上的大量实验表明，我们的VEPA在视觉密集型评估上持续提升性能，并补充了标准的监督后训练。进一步分析表明，这种提升源于增强的、可迁移的视觉基础，而非额外的任务特定训练。

英文摘要

Multimodal large language models (MLLMs) integrate strong text reasoning with visual inputs, yet their responses can be inconsistent with the underlying images, indicating ineffective utilization of visual evidence during inference. The prevailing training paradigm relies on large-scale caption-based pretraining for general alignment, followed by supervised fine-tuning and reinforcement learning to enable instruction following and complex reasoning. However, such pretraining provides only weak visual grounding: short, coarse captions bias models toward salient objects while neglecting fine-grained visual evidence. In this paper, we introduce Visual Evidence Pre-Alignment (VEPA), an intermediate stage between pretraining and post-training that explores a novel sufficiency-driven objective with Group Relative Policy Optimization (GRPO) to optimize question-conditioned visual evidence descriptions. Extensive experiments across diverse benchmarks show that our VEPA consistently enhances performance on visually demanding evaluations and complements standard supervised post-training. Further analyses show that the income stems from strengthened, transferable visual grounding, rather than from additional task-specific training.

URL PDF HTML ☆

赞 0 踩 0

2606.17798 2026-06-17 cs.CV cs.AI 交叉投稿

LiveStarPro: Proactive Streaming Video Understanding with Hierarchical Memory for Long-Horizon Streams

LiveStarPro: 具有分层记忆的主动式流视频理解用于长时域流

Zhenyu Yang, Kairui Zhang, Bing Wang, Shengsheng Qian, Changsheng Xu

发表机构 * IEEE

AI总结提出LiveStarPro，通过流验证解码、流因果注意力掩码和树结构分层记忆三个组件，实现长时域流媒体视频的主动理解，在语义正确性和时序误差上分别提升28.9%和降低18.2%。

详情

AI中文摘要

尽管视频大语言模型（Video-LLMs）取得了显著进展，当前的在线架构仍然难以同时处理连续视频流、自主决定何时响应以及保持长时域上下文记忆。这些障碍削弱了实时响应能力，并在长时间交互中导致严重遗忘。在这项工作中，我们引入了LiveStarPro，一个专为长时域流上的主动视频理解而设计的直播助手。LiveStarPro的设计基于三个互补组件。第一个组件是流验证解码（SVeD），一种通过单次困惑度验证识别适当响应时机的推理框架，从而消除了对显式静音标记的依赖。第二个组件是流因果注意力掩码（SCAM），一种训练策略，它在可变长度流上强制实现增量视频-语言对齐。第三个组件是树结构分层记忆（TSHM），一种递归记忆架构，它将驱逐的历史信息组织成事件链，从而能够从有效无界的视频流中高效检索。为了在现实在线条件下促进全面评估，我们进一步提出了OmniStarPro，一个大规模基准测试，涵盖15个多样化的真实世界场景，并扩展到小时级流以评估长期回忆。大量实验表明，LiveStarPro持续超越现有方法，在语义正确性上提升28.9%，时序误差降低18.2%，而其流式键值缓存进一步在相同模型上实现了1.58倍的推理加速。模型和代码在此https URL公开。

英文摘要

Despite the remarkable progress of Video Large Language Models (Video-LLMs), current online architectures still struggle to simultaneously process continuous video streams, decide autonomously when to respond, and preserve long-horizon contextual memory. These obstacles undermine real-time responsiveness and cause severe forgetting throughout prolonged interactions. In this work, we introduce LiveStarPro, a live streaming assistant that is designed for proactive video understanding over long-horizon streams. The design of LiveStarPro rests on three complementary components. The first component is Streaming Verification Decoding (SVeD), an inference framework that identifies the appropriate response timing through single-pass perplexity verification, thereby eliminating the dependency on explicit silence tokens. The second component is Streaming Causal Attention Masks (SCAM), a training strategy that enforces incremental video-language alignment over variable-length streams. The third component is Tree-Structured Hierarchical Memory (TSHM), a recursive memory architecture that organizes evicted historical information into event chains and consequently enables efficient retrieval from effectively unbounded video streams. To facilitate a comprehensive evaluation under realistic online conditions, we further present OmniStarPro, a large-scale benchmark that spans 15 diverse real-world scenarios and that extends to hour-scale streams for the assessment of long-term recall. Extensive experiments demonstrate that LiveStarPro consistently surpasses existing methods, attaining a 28.9% improvement in semantic correctness and an 18.2% reduction in timing error, while its streaming key-value cache further yields a 1.58x inference speedup over the same model without caching. The model and the code are publicly available at https://github.com/sotayang/LiveStarPro.

URL PDF HTML ☆

赞 0 踩 0

2606.17835 2026-06-17 cs.CL cs.AI eess.AS 交叉投稿

Perceptual compensation for tonal context in self-supervised speech models

自监督语音模型中对声调上下文的感知补偿

James Kirby, Ioana Krehan, Michele Gubian

发表机构 * Institute for Phonetics and Speech Processing, LMU Munich（慕尼黑大学语音与语言处理研究所）

AI总结通过伪复制普通话声调的感知补偿实验，比较纯自监督预训练模型和微调模型，发现纯预训练模型无补偿证据，而微调模型有部分补偿但未达到人类水平，表明监督目标可能对抽象某些音韵规律是必要的。

Comments Accepted for publication at Interspeech 2026

2606.17950 2026-06-17 cs.CV cs.AI 交叉投稿

Plug-and-Adapt: Multimodal Coreference Resolution at First Sight with a Pretrained Alignment Model

即插即适应：基于预训练对齐模型的首眼多模态指代消解

Jinghan Wu, Jing Li, Ivor W. Tsang, Xuetao Zhang

发表机构 * State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University（西安交通大学人工智能与机器人研究所人机混合增强智能全国重点实验室）； Centre for Frontier AI Research and Institute of High-Performance Computing, Agency for Science, Technology and Research (A*STAR)（新加坡科技研究局前沿人工智能研究中心与高性能计算研究所）

AI总结提出即插即适应方法，利用预训练的细粒度对齐模型，通过证据理论融合视觉与类别线索，无需目标数据集训练或大型VLLM，在CIN基准上CoNLL F1比专用方法和流行VLLM分别提升5.31%和2.12%。

详情

AI中文摘要

视觉信息有助于解决指代消解中的歧义，带来显著的性能提升。然而，现有的多模态指代消解（MCR）方法在应用前需要使用目标数据集的部分标注数据进行训练，这阻碍了其直接可用性并引发泛化担忧。虽然拥有数十亿参数的视觉-语言大模型（VLLM）提供了有前景的零样本能力，但它们仍然难以获取。其庞大的规模限制了部署能力，且许多模型只能通过付费API访问。在本文中，我们提出了一种即插即适应方法，该方法策略性地适配一个精心预训练的\emph{对齐模型}，以立即用于MCR任务，旨在消除对稀缺基准数据集的训练或依赖资源密集型VLLM的需求。具体来说，我们首先使用视觉-语言对齐数据集预训练文本与视觉上下文信息之间的细粒度对齐模型。然后，我们通过证据理论融合视觉和类别线索进行相似度聚合，将对齐模型重新用于MCR，从而增强效果。在Coreference Image Narratives (CIN)基准数据集上的实验证明了我们方法的有效性，在CoNLL F1上比最先进的专用方法和流行VLLM分别提高了5.31%和2.12%。我们进一步在掩码CIN数据集上进行鲁棒性测试，并在专门构建的VCR-MCR数据集上进行泛化评估，结果证实了这两种能力。

英文摘要

Visual information helps resolve ambiguity in coreference resolution, leading to notable performance gains. However, existing Multi-modal Coreference Resolution (MCR) methods require training with (partially) annotated data from the target dataset before they can be applied, preventing their direct usability and raising concerns about generalization. While Vision-Language Large Models (VLLMs) with billions of parameters offer promising zero-shot capabilities, they remain largely inaccessible. Their massive size limits deployability, and many are only accessible through paid APIs. In this paper, we propose a plug-and-adapt method that strategically adapts a carefully pre-trained \emph{alignment model} for immediate use in MCR tasks, designed to eliminate the need for training on scarce benchmark datasets or relying on resource-intensive VLLMs. Specifically, we first pre-train a fine-grained alignment model between textual and visual contextual information using vision-language alignment datasets. We then repurpose the alignment model to MCR through similarity aggregation by fusing visual and categorical cues with evidence theory, thereby enhancing effectiveness. Experiments on the Coreference Image Narratives (CIN) benchmark dataset demonstrate the effectiveness of our method, achieving a 5.31\% and 2.12\% improvement in CoNLL F1 over SOTA dedicated methods and popular VLLMs, respectively. We further evaluate our method on a masked CIN dataset for robustness testing and on a specially constructed VCR-MCR dataset for generalization assessment, with results confirming both capabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.18033 2026-06-17 cs.CL cs.AI 交叉投稿

When English Isn't the Best Teacher: Source Language Effects in Cross-Lingual In-Context Learning

当英语不是最好的老师：跨语言上下文学习中的源语言效应

Fred Philippy, Siwen Guo, Jacques Klein, Tegawendé F. Bissyandé

发表机构 * Snt, University of Luxembourg（卢森堡大学科学技术系）； Luxembourg Institute of Science and Technology（卢森堡科学技术研究院）

AI总结研究跨语言上下文学习（ICL）中源语言选择的影响，发现基于微调的预期在ICL中不成立，提出有效选择源语言的替代启发式方法。

Comments Accepted at 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026), co-located with ACL 2026

2606.18156 2026-06-17 cs.CV cs.AI 交叉投稿

ReAge3D: Re-Aging 3D Faces with View Consistency

ReAge3D：具有视角一致性的3D人脸回龄

Libing Zeng, Li Ma, Mingming He, Ning Yu, Paul Debevec, Nima Khademi Kalantari

发表机构 * Texas A&M University（德克萨斯农工大学）； Netflix Eyeline Studios

AI总结提出ReAge3D框架，通过2D扩散模型DiffReaging和中心向外编辑传播策略，实现多视角一致的3D人脸回龄，保持身份和细节，优于现有方法。

详情

AI中文摘要

我们提出了一种新颖的框架，用于实现逼真且可控的3D人脸回龄，生成高度详细、保留身份的结果。现有的3D编辑方法虽然对粗粒度的语义变化有效，但不适合回龄，因为即使回龄2D视图之间的微小不一致也会导致对微妙但感知上重要的年龄相关细节的过度平滑。为了解决这一挑战，我们首先引入了一个基于2D扩散的回龄模型DiffReaging，该模型在合成生成的图像对上训练。我们进一步提出了一种中心向外编辑传播策略，利用该回龄模型重建多视图一致的回龄图像。具体来说，从回龄的正面枢轴视图开始，我们通过扭曲和我们提出的Masked-DiffReaging过程重建其余视图。通过在扩散过程的每一步注入现有内容，Masked-DiffReaging确保重建区域与现有像素保持连贯。由此产生的一致回龄视图集监督回龄3D表示的优化。我们的方法在视觉上和定量上都优于现有的3D编辑技术，能够对3D人脸模型中的年龄变换进行平滑、细粒度的控制。

英文摘要

We present a novel framework for realistic and controllable 3D face re-aging which produces highly detailed, identity-preserving results. Existing 3D editing methods, while effective for coarse semantic changes, are not well suited for re-aging, as even small inconsistencies across re-aged 2D views can lead to over-smoothing of subtle but perceptually important age-related details. To address this challenge, we first introduce a 2D diffusion-based re-aging model, DiffReaging, trained on synthetically generated image pairs. We further propose a center-out editing propagation strategy that leverages this re-aging model to reconstruct multi-view-consistent re-aged images. Specifically, starting from a re-aged frontal pivot view, we reconstruct the remaining views through warping and our proposed Masked-DiffReaging process. By injecting existing content at every step of the diffusion process, Masked-DiffReaging ensures that the reconstructed regions remain coherent with existing pixels. The resulting consistent set of re-aged views supervises the optimization of the re-aged 3D representation. Our method outperforms existing 3D editing techniques both visually and quantitatively, enabling smooth, fine-grained control over age transformations in 3D face models.

URL PDF HTML ☆

赞 0 踩 0

2602.09802 2026-06-17 cs.AI cs.CL 版本更新

Would a Large Language Model Pay Extra for a View? Inferring Willingness to Pay from Subjective Choices

大型语言模型会为景观付费吗？从主观选择中推断支付意愿

Manon Reusens, Sofie Goethals, Toon Calders, David Martens

AI总结研究在旅行助手场景下，通过多分类逻辑模型分析LLM的主观选择，推断其支付意愿并与人类基准比较，发现LLM在属性层面存在系统偏差且高估支付意愿，但通过条件化偏好可改善。

详情

DOI: 10.1016/j.eswa.2026.133279

AI中文摘要

随着大型语言模型（LLM）越来越多地部署在旅行辅助和购买支持等应用中，它们常常需要在没有客观正确答案的情况下代表用户做出主观选择。我们在旅行助手背景下研究LLM的决策，通过向模型呈现选择困境，并使用多项逻辑模型分析其响应，推导出隐含的支付意愿（WTP）估计。随后将这些WTP值与经济学文献中的人类基准值进行比较。除了基线设置外，我们还研究了在更现实条件下模型行为的变化，包括提供用户过去选择的信息和基于角色的提示。我们的结果表明，虽然可以从较大的LLM中推导出有意义的WTP值，但它们在属性层面也显示出系统偏差。此外，它们倾向于整体高估人类的WTP，特别是在引入昂贵选项或面向商业的角色时。将模型条件化于对更便宜选项的先前偏好，得出的估值更接近人类基准。总体而言，我们的发现突出了使用LLM进行主观决策支持的潜力和局限性，并强调了在实际部署此类系统时仔细选择模型、设计提示和表示用户的重要性。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in applications such as travel assistance and purchasing support, they are often required to make subjective choices on behalf of users in settings where no objectively correct answer exists. We study LLM decision-making in a travel-assistant context by presenting models with choice dilemmas and analyzing their responses using multinomial logit models to derive implied willingness to pay (WTP) estimates. These WTP values are subsequently compared to human benchmark values from the economics literature. In addition to a baseline setting, we examine how model behavior changes under more realistic conditions, including the provision of information about users' past choices and persona-based prompting. Our results show that while meaningful WTP values can be derived for larger LLMs, they also display systematic deviations at the attribute level. Additionally, they tend to overestimate human WTP overall, particularly when expensive options or business-oriented personas are introduced. Conditioning models on prior preferences for cheaper options yields valuations that are closer to human benchmarks. Overall, our findings highlight both the potential and the limitations of using LLMs for subjective decision support and underscore the importance of careful model selection, prompt design, and user representation when deploying such systems in practice.

URL PDF HTML ☆

赞 0 踩 0

2605.30036 2026-06-17 cs.AI cs.CL 版本更新

Teaching Values to Machines: Simulating Human-Like Behavior in LLMs

向机器传授价值观：在LLMs中模拟类人行为

Asaf Yehudai, Naama Rozen, Ariel Gera

发表机构 * The Hebrew University of Jerusalem（海法大学）； IBM Research（IBM研究院）； Tel-Aviv University（特拉维夫大学）

AI总结本研究基于心理学价值理论，通过大规模实验（超过500万个问题）评估价值提示的LLMs在价值结构和价值-行为关系上与人类的一致性，并证明引入人类价值分布可增强群体模拟。

Comments We had some disagreement regarding proper attribution; we hope to resolve it soon and upload the paper

详情

AI中文摘要

大型语言模型（LLMs）展示了采用不同角色和身份的能力；然而，它们是否能表现出符合连贯、类人价值结构的行为仍不清楚。在这项工作中，我们借鉴既定的心理学价值理论，在LLMs中诱导类人价值观，并评估它们与人类研究中观察到的模式的一致性。使用经过验证的心理学问卷，我们进行了大规模实验——超过500万个问题——以评估领先LLMs的价值结构和价值-行为关系，并将其与人类进行比较。我们的发现揭示了价值提示的LLMs与人类在两个维度上的强烈一致性。此外，引入人类价值分布增强了价值诱导LLMs的群体模拟。这些发现凸显了价值诱导LLMs作为有效的、基于心理学的模拟人类行为工具的潜力。

代码的回归语言模型

Yash Akhauri, Xingyou Song, Arissa Wongpanich, Bryan Lewandowski, Mohamed S. Abdelfattah

AI总结提出回归语言模型（RLM），利用冻结的大语言模型编码器直接从文本预测代码执行结果（如内存占用、延迟、神经网络精度等），在多个任务上达到高相关度。

Comments Published in International Conference on Machine Learning (ICML) 2026

详情

AI中文摘要

我们研究代码到指标的回归：预测代码执行的数值结果，由于编程语言的开放性，这是一项具有挑战性的任务。虽然先前的方法依赖于繁重且特定领域的特征工程，但我们展示了一个统一的回归语言模型（RLM），使用冻结的LLM编码器可以直接从文本同时预测：(i) 多种高级语言（如Python和C++）代码的内存占用，(ii) Triton GPU内核的延迟，以及(iii) 以ONNX表示的已训练神经网络的精度和速度。特别是，一个基于T5Gemma的较小300M参数RLM在APPS的竞赛编程提交上获得了>0.9的Spearman等级相关系数，而单个统一模型在CodeNet的17种不同语言上获得了>0.5的平均Spearman等级相关系数。此外，RLM在五个经典NAS设计空间上获得了最高平均Kendall-Tau 0.46，这些空间此前由图神经网络主导，并且能同时预测多种硬件平台上的架构延迟。

英文摘要

We study code-to-metric regression: predicting numeric outcomes of code executions, a challenging task due to the open-ended nature of programming languages. While prior methods have resorted to heavy and domain-specific feature engineering, we show that a single unified Regression Language Model (RLM) using a frozen LLM encoder can simultaneously predict directly from text, (i) the memory footprint of code across multiple high-level languages such as Python and C++, (ii) the latency of Triton GPU kernels, and (iii) the accuracy and speed of trained neural networks represented in ONNX. In particular, a relatively small 300M parameter RLM based on T5Gemma, obtains >0.9 Spearman-rank on competitive programming submissions from APPS, and a single unified model achieves >0.5 average Spearman-rank across 24 different programming languages from CodeNet. Furthermore, the RLM can obtain the highest average Kendall-Tau of 0.46 on five classic NAS design spaces previously dominated by graph neural networks, and simultaneously predict architecture latencies on numerous hardware platforms.

URL PDF HTML ☆

赞 0 踩 0

2601.18252 2026-06-17 cs.CV cs.AI cs.LG stat.ML 版本更新

Co-PLNet: A Collaborative Point-Line Network for Prompt-Guided Wireframe Parsing

Co-PLNet: 一种用于提示引导的线框解析的协作点线网络

Chao Wang, Xuanying Li, Cheng Dai, Jinglei Feng, Yuxiang Luo, Hao Qin, Yuqi Ouyang

AI总结提出点线协作框架Co-PLNet，通过点线提示编码器交换空间线索，并利用交叉引导线解码器增强点线一致性，在Wireframe和YorkUrban数据集上提升线框解析的准确性和鲁棒性。

详情

AI中文摘要

线框解析旨在恢复线段及其连接点，以形成结构化的几何表示，用于同时定位与地图构建（SLAM）等下游任务。现有方法分别预测线和点，并在事后进行调和，导致不匹配和鲁棒性降低。我们提出Co-PLNet，一个点线协作框架，在两个任务之间交换空间线索，其中早期检测通过点线提示编码器（PLP-Encoder）转换为空间提示，该编码器将几何属性编码为紧凑且空间对齐的图。交叉引导线解码器（CGL-Decoder）随后通过基于互补提示的稀疏注意力细化预测，强制点线一致性和效率。在Wireframe和YorkUrban上的实验显示，准确性和鲁棒性持续改进，同时具有有利的实时效率，证明了我们在结构化几何感知中的有效性。我们的代码可在该 https URL 获取。

英文摘要

Wireframe parsing aims to recover line segments and their junctions to form a structured geometric representation useful for downstream tasks such as Simultaneous Localization and Mapping (SLAM). Existing methods predict lines and junctions separately and reconcile them post-hoc, causing mismatches and reduced robustness. We present Co-PLNet, a point-line collaborative framework that exchanges spatial cues between the two tasks, where early detections are converted into spatial prompts via a Point-Line Prompt Encoder (PLP-Encoder), which encodes geometric attributes into compact and spatially aligned maps. A Cross-Guidance Line Decoder (CGL-Decoder) then refines predictions with sparse attention conditioned on complementary prompts, enforcing point-line consistency and efficiency. Experiments on Wireframe and YorkUrban show consistent improvements in accuracy and robustness, together with favorable real-time efficiency, demonstrating our effectiveness for structured geometry perception. Our code is available at https://github.com/GalacticHogrider/Co-PLNet.

URL PDF HTML ☆

赞 0 踩 0

2602.14771 2026-06-17 cs.CV cs.AI cs.LG cs.MM cs.NE 版本更新

GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture

GOT-JEPA：基于联合嵌入预测架构的通用目标跟踪与模型自适应及遮挡处理

Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin

AI总结提出GOT-JEPA框架，通过预测跟踪模型而非图像特征来提升泛化能力，并设计OccuSolver增强遮挡感知，在七个基准上验证了有效性。

Comments Accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT). This research focuses on learning model adaptation for adverse and dynamic environments, as well as fine-grained occlusion perception for tracking

详情

DOI: 10.1109/TCSVT.2026.3675005
Journal ref: IEEE Transactions on Circuits and Systems for Video Technology 2026

AI中文摘要

人类视觉系统通过整合当前观测与先前观测信息、适应目标和场景变化、以及精细推理遮挡来跟踪物体。相比之下，最近的通用目标跟踪器通常针对训练目标进行优化，这限制了在未见场景中的鲁棒性和泛化能力，并且它们的遮挡推理仍然粗糙，缺乏对遮挡模式的详细建模。为了解决这些在泛化和遮挡感知方面的局限性，我们提出了GOT-JEPA，一个模型预测预训练框架，将JEPA从预测图像特征扩展到预测跟踪模型。给定相同的历史信息，教师预测器从干净的当前帧生成伪跟踪模型，学生预测器学习从当前帧的损坏版本预测相同的伪跟踪模型。这种设计提供了稳定的伪监督，并明确训练预测器在遮挡、干扰和其他不利观测下产生可靠的跟踪模型，从而提高了对动态环境的泛化能力。基于GOT-JEPA，我们进一步提出了OccuSolver来增强目标跟踪的遮挡感知。OccuSolver调整了一个以点为中心的点跟踪器，用于目标感知的可见性估计和详细的遮挡模式捕获。在跟踪器迭代生成的目标先验条件下，OccuSolver逐步细化可见性状态，增强遮挡处理，并产生更高质量的参考标签，逐步改进后续模型预测。在七个基准上的广泛评估表明，我们的方法有效增强了跟踪器的泛化能力和鲁棒性。

英文摘要

The human visual system tracks objects by integrating current observations with previously observed information, adapting to target and scene changes, and reasoning about occlusion at fine granularity. In contrast, recent generic object trackers are often optimized for training targets, which limits robustness and generalization in unseen scenarios, and their occlusion reasoning remains coarse, lacking detailed modeling of occlusion patterns. To address these limitations in generalization and occlusion perception, we propose GOT-JEPA, a model-predictive pretraining framework that extends JEPA from predicting image features to predicting tracking models. Given identical historical information, a teacher predictor generates pseudo-tracking models from a clean current frame, and a student predictor learns to predict the same pseudo-tracking models from a corrupted version of the current frame. This design provides stable pseudo supervision and explicitly trains the predictor to produce reliable tracking models under occlusions, distractors, and other adverse observations, improving generalization to dynamic environments. Building on GOT-JEPA, we further propose OccuSolver to enhance occlusion perception for object tracking. OccuSolver adapts a point-centric point tracker for object-aware visibility estimation and detailed occlusion-pattern capture. Conditioned on object priors iteratively generated by the tracker, OccuSolver incrementally refines visibility states, strengthens occlusion handling, and produces higher-quality reference labels that progressively improve subsequent model predictions. Extensive evaluations on seven benchmarks show that our method effectively enhances tracker generalization and robustness.

URL PDF HTML ☆

赞 0 踩 0

2603.03485 2026-06-17 cs.CV cs.AI cs.RO 版本更新

Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion

Phys4D: 从视频扩散模型实现细粒度物理一致的4D建模

Haoran Lu, Shang Wu, Songling Liu, Jianshu Zhang, Maojiang Su, Guo Ye, Chenwei Xu, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Zhaoran Wang, Han Liu

AI总结提出Phys4D流水线，通过三阶段训练（伪监督预训练、物理监督微调、强化学习校正）从视频扩散模型学习物理一致的4D世界表示，显著提升细粒度时空与物理一致性。

详情

AI中文摘要

最近的视频扩散模型作为大规模生成式世界模型已经取得了令人印象深刻的能力。然而，这些模型通常难以保持细粒度的物理一致性，随时间表现出物理上不合理的动态。在这项工作中，我们提出了 \textbf{Phys4D}，一个从视频扩散模型中学习物理一致的4D世界表示的流水线。Phys4D 采用 \textbf{三阶段训练范式}，逐步将外观驱动的视频扩散模型提升为物理一致的4D世界表示。我们首先通过大规模伪监督预训练引导出稳健的几何和运动表示，为4D场景建模奠定基础。然后，我们使用模拟生成的数据进行基于物理的监督微调，强制执行时间一致的4D动态。最后，我们应用基于模拟的强化学习来纠正难以通过显式监督捕获的残留物理违规。为了评估超越外观指标的细粒度物理一致性，我们引入了一套 \textbf{4D世界一致性评估}，探测几何一致性、运动稳定性和长期物理合理性。实验结果表明，与外观驱动的基线相比，Phys4D 显著改善了细粒度时空和物理一致性，同时保持了强大的生成性能。我们的项目页面可在此 https URL 获取。

英文摘要

Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present \textbf{Phys4D}, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts \textbf{a three-stage training paradigm} that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, establishing a foundation for 4D scene modeling. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of \textbf{4D world consistency evaluation} that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance. Our project page is available at https://sensational-brioche-7657e7.netlify.app/

URL PDF HTML ☆

赞 0 踩 0

2603.22281 2026-06-17 cs.CV cs.AI cs.CL cs.LG cs.RO 版本更新

ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

ThinkJEPA：赋予潜在世界模型大型视觉-语言推理能力

Haichao Zhang, Yijiang Li, Shwai He, Tushar Nagarajan, Mingfei Chen, Jianglin Lu, Ang Li, Yun Fu

AI总结提出ThinkJEPA框架，结合密集JEPA分支与稀疏VLM思考者分支，通过分层金字塔表示提取模块，实现细粒度运动建模与长程语义引导，在手部操作轨迹预测任务上超越基线。

Comments 10 pages, 5 figures

详情

AI中文摘要

潜在世界模型（如V-JEPA2）的最新进展展示了从视频观测预测未来世界状态的能力。然而，短观测窗口的密集预测限制了时间上下文，可能导致预测偏向局部低层次外推，难以捕捉长程语义并降低下游效用。相比之下，视觉-语言模型（VLM）通过对均匀采样帧进行推理，提供强大的语义基础和通用知识，但由于计算驱动的稀疏采样、语言输出瓶颈（将细粒度交互状态压缩为文本导向表示）以及适应小规模动作条件数据集时的数据分布不匹配，它们不适合作为独立的密集预测器。我们提出了一种VLM引导的JEPA风格潜在世界建模框架，通过双时间路径结合密集帧动态建模与长程语义指导：一个密集JEPA分支用于细粒度运动和交互线索，以及一个均匀采样的VLM“思考者”分支，具有更大的时间步长以提供知识丰富的指导。为了有效传递VLM的渐进推理信号，我们引入了一个分层金字塔表示提取模块，将多层VLM表示聚合成与潜在预测兼容的指导特征。在手部操作轨迹预测实验上，我们的方法优于强VLM-only基线和JEPA预测器基线，并展现出更鲁棒的长程展开行为。

英文摘要

Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.

URL PDF HTML ☆

赞 0 踩 0

2603.28251 2026-06-17 cs.CV cs.AI 版本更新

DriveJudge: 用视觉-语言模型重新思考自动驾驶评估

Xinglong Sun, Kevin Xie, Jenny Schmalfuss, Despoina Paschalidou, Xiuming Zhang, Sanja Fidler, Kashyap Chitta, Jose M. Alvarez

发表机构 * NVIDIA（英伟达）

AI总结提出DriveJudge，结合规则评估与VLM推理，通过选择性调用物理规则函数实现可解释且上下文感知的驾驶评估，在驾驶质量分类和轨迹偏好选择任务上超越现有方法。

Comments Under Review

详情

AI中文摘要

自动驾驶已转向端到端策略学习，其中可靠、可解释的策略评估是一个基本挑战，因为驾驶质量高度依赖于上下文。常用的基于规则的驾驶指标（如EPDMS）可解释但缺乏上下文感知，而近期基于VLM的评估虽具有上下文感知能力，但受限于模糊的VLM输出和较弱的物理基础。为了以既可解释又上下文感知的方式评估驾驶，我们引入了DriveJudge。DriveJudge是一个驾驶评估代理，它将规则基础评估与视觉-语言模型（VLM）推理相结合，并在解释环境上下文后有选择地调用基于物理的确定性规则函数。为了训练和评估DriveJudge，我们整理了一个包含33,577个具有挑战性的驾驶样本的大规模数据集，并附有人类标注，指示给定场景中的驾驶行为是否合理。利用该数据集，我们解决了驾驶指标评估中未被充分探索的问题，并引入了两个与人类对齐的基准任务：驾驶质量分类和轨迹偏好选择。DriveJudge在驾驶质量分类上比EPDMS高出21.23 AUC，在轨迹偏好选择上比近期基于VLM的DriveCritic高出6.5%，为可解释且精确的驾驶评估设立了新标准。

英文摘要

Autonomous driving has shifted towards end-to-end policy learning, where reliable, interpretable policy evaluation is a fundamental challenge as driving quality is highly context-dependent. Commonly used rule-based driving metrics like EPDMS are interpretable but lack context-awareness, while recent VLMbased evaluations are context-aware but limited by ambiguous VLM outputs and weak physical grounding. To evaluate driving in a manner that is both interpretable and context-aware, we introduce DriveJudge. DriveJudge is a driving evaluation agent that combines rule-grounded evaluation with Vision-Language Model (VLM) reasoning and selectively invokes physically-grounded deterministic rule functions after interpreting the environmental context. To train and evaluate DriveJudge, we curate a large-scale dataset of 33,577 challenging driving samples with human annotations on whether the driving behavior is reasonable in the given scenario. With this dataset, we address the underexplored problem of driving metric evaluation, and introduce two human-aligned benchmark tasks: Driving Quality Classification and Trajectory Preference Selection. DriveJudge outperforms EPDMS for driving quality classification by 21.23 AUC, and the recent VLM-based DriveCritic for trajectory preference selection by 6.5%, setting a new standard for interpretable and precise driving evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.17386 2026-06-17 cs.CV cs.AI cs.RO 交叉投稿

TerraTransfer: Learning End-to-End Driving Policies Without Expert Demonstrations

TerraTransfer: 无需专家示范的端到端驾驶策略学习

Zikang Xiong, Weixin Li, Zhouchonghao Wu, Akshay Rangesh, Saarth Bonde, Grantland Hall, Chen Tang, Yihan Hu, Wei Zhan

发表机构 * Applied Intuition ； UCLA（加州大学洛杉矶分校）； UC Berkeley（加州大学伯克利分校）

AI总结提出一种无需专家示范的端到端驾驶方法，通过向量化模拟器中的自博弈预训练策略，再与预训练视觉骨干对齐，降低了数据成本并达到或超越现有方法。

详情

AI中文摘要

端到端自动驾驶在基准测试和实际部署中取得了最先进的性能。然而，其标准训练流程在所有阶段都成本高昂：收集和标注数百万驾驶帧代价昂贵，而在图像上进行闭环强化学习受限于每步的光真实感渲染和大视觉骨干的前向传播成本。在向量化模拟器中进行自博弈改变了经济性：每秒数百万次 rollout 步骤，状态分布自然包含碰撞、近碰撞和恢复等驾驶日志中不包含的情况。我们的方法通过解耦学习驾驶和学习视觉来利用这种不对称性。我们通过自博弈预训练单个策略，然后通过动作 KL 散度和批量关系低秩结构损失将其潜在空间与预训练视觉骨干对齐。动作目标来自自博弈策略，因此对齐从未对记录的轨迹进行监督：只需要一个（图像、场景状态）帧的配对数据集，无需模仿预训练所依赖的精心策划的专家示范。在光真实感 3D 高斯泼溅闭环场景中，得到的端到端策略匹配或超越了先前的端到端方法。

英文摘要

End-to-end autonomous driving has achieved state-of-the-art performance on benchmarks and real-world deployments. Its standard training recipe, however, is expensive across all stages: collecting and labeling millions of driving frames is costly, and closed-loop RL on images is bottlenecked by the per-step cost of photorealistic rendering plus a forward pass through a large vision backbone. Self-play in vectorized simulators changes the economics: millions of rollout steps per second, and a state distribution naturally rich in collisions, near-misses, and recoveries that no driving log contains. Our approach exploits this asymmetry by decoupling learning to drive from learning to see. We pretrain a single policy by self-play, then align its latent space with a pretrained vision backbone, through the action KL divergence and a batch-relational low-rank structural loss. The action target comes from the self-play policy, so alignment never supervises against a logged trajectory: a paired dataset of (image, scene-state) frames suffices, with no need for the curated expert demonstrations that imitation pretraining is built on. On photorealistic 3D Gaussian splatting closed-loop scenarios, the resulting end-to-end policy matches or exceeds prior end-to-end methods.

URL PDF HTML ☆

赞 0 踩 0

2606.17511 2026-06-17 cs.RO cs.AI cs.CV 交叉投稿

MagicSim: A Unified Infrastructure for Executable Embodied Interaction

MagicSim: 可执行具身交互的统一基础设施

Haoran Lu, Songling Liu, Yue Chen, Guo Ye, Mutian Shen, Shuyang Yu, Yu Xiao, Jihai Zhao, Shang Wu, Jianshu Zhang, Xiangtian Gui, Chuye Hong, Yuran Wang, Maojiang Su, Jiayi Wang, Ruihai Wu, Zhaoran Wang, Han Liu

发表机构 * Northwestern University（西北大学）； Peking University（北京大学）； University of California, Berkeley（加州大学伯克利分校）； ShanghaiTech University（上海科技大学）

AI总结提出MagicSim，一个基于确定性批处理运行时和共享MDP的具身交互基础设施，通过YAML规范解耦内容、放置、行为和智能体暴露，统一世界构建、执行、评估和自动生成轨迹。

详情

AI中文摘要

机器人学习和具身智能体现在需要模拟作为连接控制、技能和规划的共享执行基底，而不仅仅是渲染器、控制器测试平台或固定任务环境。现有的流水线通过“魔法”动作、脱节的训练环境或仅前向渲染来分割这些层，无法重现、评估和标注同一情节。我们提出MagicSim，一个围绕确定性批处理运行时和共享马尔可夫决策过程（MDP）构建的具身交互基础设施。通过YAML优先的规范解耦内容、放置、行为和智能体暴露，MagicSim在单一重置-步进循环中构建多样化的可执行世界，涵盖任务族、交互模式、物理、布局、传感器、化身和机器人具身。一个通用的执行接口通过控制器、原子技能、规划器原语和异步规划将高级命令具体化，将其实现为机器人动作而非模拟器端的状态编辑。一个任务定义支持三种能力：基准测试和强化学习评估、自动收集接口（自动将命令转化为具体轨迹）以及面向智能体/VLM的交互。对于自动执行，命令流经Command->Skill->Planner->Robot->Record流水线，而每个环境的命令、技能、规划、重试、标注和情节状态在共享物理滴答之上独立推进。成功的展开被保存为结构化的多模态轨迹，将语言监督、动作表示、视觉/几何表示和任务级别状态与执行的情节对齐。因此，MagicSim在一个规划器在环运行时中统一了多样化的世界构建、具身执行、任务评估、自动展开生成和交互式智能体接口。

英文摘要

Robot learning and embodied agents now require simulation to serve as a shared execution substrate linking control, skills, and planning, not only as a renderer, controller testbed, or fixed task environment. Existing pipelines split these layers with "magic" actions, disconnected training environments, or forward-only renders that cannot reproduce, evaluate, and annotate the same episode. We present MagicSim, an embodied interaction infrastructure built around one deterministic batched runtime and a shared Markov decision process (MDP). From YAML-first specifications that decouple contents, placement, behavior, and agent exposure, MagicSim constructs diverse executable worlds spanning task families, interaction regimes, physics, layouts, sensors, avatars, and robot embodiments in one reset-and-step loop. A common execution interface grounds high-level commands through controllers, atomicskills, planner primitives, and asynchronous planning, realizing them as robot actions rather than simulator-side state edits. One task definition supports three capabilities: benchmark and RL evaluation, an autocollect interface that automatically turns commands into grounded trajectories, and agent/VLM-facing interaction. For automatic execution, commands flow through a Command->Skill->Planner->Robot->Record pipeline, while per-environment command, skill, planning, retry, annotation, and episode states advance independently above the shared physics tick. Successful rollouts are saved as structured multimodal trajectories aligning language supervision, action representations, visual/geometric representations, and task-level status with the executed episode. MagicSim thus unifies diverse world construction, embodied execution, task evaluation, automatic rollout generation, and interactive agent interfaces in one planner-in-the-loop runtime.

URL PDF HTML ☆

赞 0 踩 0

2606.17767 2026-06-17 cs.HC cs.AI 交叉投稿

Talking to Your Data: Exploring Embodied Conversation as an Interface for Personal Health Reflection

与你的数据对话：探索具身对话作为个人健康反思的界面

Nikola Kovacevic, Bastien Husler, Di Zhuang, Rafael Wampfler, Barbara Solenthaler

发表机构 * Department of Computer Science, ETH Zurich（苏黎世联邦理工学院计算机科学系）

AI总结提出一种通过具身对话代理与可穿戴健康数据交互的新范式，采用双代理设计（观察者提取统计特征，呈现者以“口语化统计”沟通），通过模拟自我用户研究（N=5）与传统仪表盘对比，评估感知理解、行动具体性和认知转变。

详情

Journal ref: Joint Proceedings of the ACM Intelligent User Interfaces (IUI) Workshops 2026, Paphos, Cyprus, July 13-16, 2026

AI中文摘要

来自可穿戴设备的个人健康数据通常通过图表和统计摘要的仪表盘呈现，要求用户主动解读模式和含义。我们探索了一种替代交互范式：通过一个具身对话代理与个人健康数据进行互动，该代理在与用户的对话中促进客观的数据反思。我们提出了一个系统，它将可穿戴数据的轻量级预处理与基于Unity的具身角色相结合。在内部，系统遵循双代理设计，其中观察者代理提取描述性统计和时间趋势，呈现者代理通过“口语化统计”传达这些发现，有意避免临床建议，以隔离交互模态的影响。我们通过一个模拟自我用户研究（N=5）采用被试内设计评估了这种方法。参与者采用来自LifeSnaps数据集的健康角色和目标，比较了传统仪表盘探索与具身对话反思。我们的评估侧重于感知理解、生成行动的具体性，以及从被动观看到主动意义建构的认知转变。本文贡献了一个功能原型、一个客观健康数据叙事生成的设计模式，以及关于具身性如何影响个人健康指标解释的早期实证见解。

英文摘要

Personal health data from wearables are typically presented through dashboards of charts and summary statistics, requiring users to actively interpret patterns and implications. We explore an alternative interaction paradigm: engaging with personal health data through an embodied conversational agent that facilitates objective data reflection in dialogue with the user. We present a system that combines lightweight preprocessing of wearable data with a Unity-based embodied character. Internally, the system follows a dual-agent design in which an Observer agent extracts descriptive statistics and temporal trends, and a Presenter agent communicates these findings through "spoken statistics," intentionally refraining from clinical advice to isolate the impact of the interaction modality. We evaluate this approach through a simulated-self user study (N=5) using a within-subject design. Participants adopted health personas and goals derived from the LifeSnaps dataset to compare traditional dashboard exploration with embodied conversational reflection. Our evaluation focuses on perceived understanding, the specificity of generated actions, and the cognitive shift from passive viewing to active sensemaking. The paper contributes a functional prototype, a design pattern for objective health data narrative generation, and early empirical insights into how embodiment affects the interpretation of personal health metrics.

URL PDF HTML ☆

赞 0 踩 0

2606.17924 2026-06-17 cs.RO cs.AI 交叉投稿

PearlVLA: Progressive Embodied Action-Plan Refinement in Latent Space

PearlVLA：潜在空间中的渐进式具身动作计划精炼

Bochen Yang, Lianlei Shan

发表机构 * Imperial College London（帝国理工学院）； Tsinghua University（清华大学）

AI总结提出PearlVLA框架，通过在VLM潜在空间中进行迭代计划精炼，平衡动作生成效率与显式推理，在LIBERO基准上达到最先进性能。

Comments 21 pages, 2 figures. Preprint

详情

AI中文摘要

当前的视觉-语言-动作（VLA）模型在高效动作生成与显式推理之间存在权衡。直接从视觉-语言骨干表示解码动作可实现低延迟控制，而通过文本链、像素级子目标或动作搜索进行显式推理可以改善规划，但会带来大量延迟和计算成本。我们提出PearlVLA，一个将推理转移到视觉-语言模型（VLM）潜在空间中的VLA框架。PearlVLA将VLM元查询表示分离为固定的视觉接地分支和迭代的潜在计划分支。在每个精炼轮次中，一个计划条件的世界查询探测一个轻量级冻结的潜在世界模型，以获取无动作的未来观察潜在表示，该表示被反馈以指导计划精炼。然后，一个未来引导的RefineNet应用计划的残差更新，逐步将粗糙的语义草稿精炼为细粒度的潜在动作计划。经过K轮精炼后的计划被并行解码为动作块，用于低延迟执行。我们进一步引入因果精炼分组过程奖励强化学习，以优化潜在精炼过程，奖励来自由潜在计划编辑引起的更长视野想象未来。在LIBERO基准上的实证评估表明，PearlVLA在现有方法中达到了最先进的性能。

英文摘要

Current Vision-Language-Action (VLA) models face a trade-off between efficient action generation and explicit deliberation. Directly decoding actions from vision-language backbone representations enables low-latency control, whereas explicit reasoning through textual chains, pixel-level subgoals, or action search can improve planning but incurs substantial latency and computational cost. We propose PearlVLA, a VLA framework that moves deliberation into the latent space of a vision-language model (VLM). PearlVLA separates VLM meta-query representations into a fixed visual grounding branch and an iterative latent plan branch. At each refinement round, a plan-conditioned world query probes a lightweight frozen latent world model for an action-free future observation latent, which is fed back to guide plan refinement. A future-guided RefineNet then applies scheduled residual updates to progressively refine a coarse semantic draft into a fine-grained latent action plan. The refined plan after K rounds is then decoded in parallel into an action chunk for low-latency execution. We further introduce Causal Refinement-Grouped Process-Reward RL to optimize the latent refinement process with rewards from longer-horizon imagined futures induced by latent plan edits. Empirical evaluations on the LIBERO benchmark demonstrate that PearlVLA achieves state-of-the-art performance among existing methods.

URL PDF HTML ☆

赞 0 踩 0

2606.18092 2026-06-17 cs.RO cs.AI 交叉投稿

EAGG: Embodiment-Aligned Grasp Generation via Geometry-Aware Graph Conditioning

EAGG: 通过几何感知图条件实现具身对齐的抓取生成

Wanhao Niu, Qiyan Ke, Yuan Sun, Hao Sun, Jie Xu, Muyuan Ma, Ruiqi Hu, Fuchun Sun

发表机构 * Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）； Beijing Moce Future Technology Co., Ltd.（北京墨策未来科技有限公司）

AI总结提出EAGG，一种通过拓扑感知末端执行器图和几何感知令牌实现跨末端执行器抓取生成的统一模型，在MultiGripperGrasp基准上达到56.17%平均成功率，并显著降低接触距离。

Comments 16 pages, 8 figures. Code is available at https://github.com/wanhaoniu/EAGG

详情

AI中文摘要

跨末端执行器抓取生成旨在寻求一个统一的模型，能够泛化到不同物体以及从平行夹爪到灵巧末端执行器的不同具身形态。现有的抓取生成器通常针对固定具身设计，或使用静态描述符编码具身身份，当拓扑结构、驱动耦合和接触几何差异较大时，这会削弱迁移能力。我们提出EAGG，一种具身对齐的抓取生成器，通过拓扑感知的末端执行器图和具身特定的低维末端执行器控制空间来表示每个具身。一个冻结的末端执行器认知骨干将当前关节状态转换为几何感知令牌，作为可复用的形态先验，并通过迭代几何注入在采样过程中刷新这些令牌，使条件与不断演变的末端执行器几何保持同步。在MultiGripperGrasp基准上，EAGG在六个训练末端执行器上达到56.17%的平均成功率，与专门训练的差距在1.10个百分点以内，同时保持对微调和零样本末端执行器的迁移能力。迭代几何注入进一步将合并中位接触距离从0.239厘米降低到0.189厘米。这些结果表明，通过在共享生成器内对齐具身结构而非抑制具身差异，可以增强跨末端执行器抓取生成。代码可在该网址获取：https://this URL。

英文摘要

Cross-end-effector grasp generation seeks a unified model that generalizes across objects and across embodiments ranging from parallel grippers to dexterous end effectors. Existing grasp generators are typically designed for a fixed embodiment or encode embodiment identity with a static descriptor, which weakens transfer when topology, actuation coupling, and contact geometry differ substantially. We present EAGG, an embodiment-aligned grasp generator that represents each embodiment with a topology-aware end-effector graph and an embodiment-specific low-dimensional end-effector control space. A frozen end-effector-cognition backbone converts the current articulated state into geometry-aware tokens that act as a reusable morphology prior, and iterative geometry injection refreshes these tokens throughout sampling so that conditioning remains synchronized with the evolving end-effector geometry. On the MultiGripperGrasp benchmark, EAGG reaches 56.17% average success across six training end effectors, remaining within 1.10 percentage points of specialized training while preserving transfer to finetuning and zero-shot end effectors. Iterative geometry injection further reduces the pooled median contact distance from 0.239 cm to 0.189 cm. These results show that cross-end-effector grasp generation is strengthened by aligning embodiment structure inside a shared generator rather than suppressing embodiment differences. Code is available at https://github.com/wanhaoniu/EAGG.

URL PDF HTML ☆

赞 0 踩 0

2606.18247 2026-06-17 cs.RO cs.AI 交叉投稿

Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement

视觉验证实现推理时引导与自主策略改进

Mingtong Zhang, Dhruv Shah

发表机构 * Princeton University（普林斯顿大学）

AI总结提出VERITAS框架，利用预训练通用机器人策略作为生成器，结合无梯度视觉验证器在推理时评估动作，实现无需额外训练的推理时策略引导和离线策略改进。

Comments Website: https://veritas-improvement.github.io

详情

AI中文摘要

部署在现实世界中的机器人应从经验中学习并随时间改进。这需要一个实践并从反馈中学习的机制。在本文中，我们提出VERITAS，一个用于通用机器人策略的生成器-验证器框架，用于推理时策略引导和自我改进。我们使用预训练的通用机器人策略作为“生成器”，并将其与一个无梯度的“视觉验证器”配对，该验证器在推理时评估动作。该框架实现了推理时引导，无需额外训练即可提高策略性能。我们证明，推理时验证在无需额外演示数据训练的情况下，始终优于普通通用策略。此外，我们证明验证后的 rollout 为离线策略改进提供了有效的监督：在验证后的自生成轨迹上微调的策略实现了持续的性能提升。值得注意的是，我们发现使用验证后的 rollout 进行后训练达到了与专家演示相当的效率，同时无需人工干预。我们的结果突出了推理时验证作为一种实用且可扩展的机制，用于在部署期间改进机器人策略。

英文摘要

Robots deployed in the real world should learn from their experience and improve over time. This requires a mechanism of practicing and learning from feedback. In this paper, we propose VERITAS, a generator-verifier framework for generalist robot policies for inference-time policy steering and self-improvement. We use a pre-trained generalist robot policy as a ``generator'' and pair it with a gradient-free ``visual verifier'' that evaluates actions at inference time. This framework enables inference-time steering that improves policy performance without additional training. We demonstrate that inference-time verification consistently outperforms vanilla generalists without training on additional demonstration data. Additionally, we demonstrate that the verified rollouts provide effective supervision for offline policy improvement: policies fine-tuned on verified self-generated trajectories achieve consistent performance gains. Notably, we find that post-training with verified rollouts achieves comparable efficiency to expert demonstrations, while requiring no human interventions. Our results highlight inference-time verification as a practical and scalable mechanism for improving robotic policies during deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.16533 2026-06-17 cs.AI cs.CV 版本更新

Kairos: A Native World Model Stack for Physical AI

Kairos: 面向物理AI的原生世界模型栈

Kairos Team, Fei Wang, Shan You, Qiming Zhang, Tao Huang, Zuoyi Fu, Zhisheng Zheng, Yunlong Xi, Feng Lv, Xiaoming Wu, Zeyu Liu, Cong Wan, Pu Li, Ruiqing Yang, Xiaoou Li, Wei Wang, Kangkang Zhu, Yuwei Zhang, Shi Fu, Zheng Zhang, Xiaoning Wu, Xuzeng Fan, Dacheng Tao, Xiaogang Wang

发表机构 * Kairos Team（Kairos团队）

AI总结提出Kairos原生世界模型栈，通过跨具身数据课程、混合线性时间注意力架构和部署感知系统协同设计，实现世界知识获取、长时程状态保持与高效执行，在具身世界模型等基准上达到顶级性能。

详情

AI中文摘要

世界模型正从被动视觉生成器转变为物理AI的基础性、可操作基础设施：它们必须从异构经验中原生获取世界知识，在长时间跨度内维持持久状态，并在实际部署约束下高效执行。我们引入Kairos，一个围绕这些需求设计的原生世界模型栈。(1) Kairos通过开创由跨具身数据课程指导的原生预训练范式来学习世界，该课程将开放世界视频、人类行为数据和机器人交互组织成渐进式发展路径。(2) Kairos通过配备混合线性时间注意力的原生统一架构来维持世界，该架构中滑动窗口注意力捕捉局部动态，扩张滑动窗口捕捉中程依赖，门控线性注意力维持持久全局记忆。我们建立了形式化理论界限，证明这种时间分解严格限制了误差累积，从数学上保证了跨扩展时间范围的状态传播。(3) Kairos通过整合部署感知系统协同设计来运行世界，支持在服务器和消费级硬件上为真实世界的观察-行动-反馈循环生成低延迟展开。在具身世界模型、长时程和动作策略基准上的实验表明，Kairos在实现顶级性能的同时提供了强大的效率-能力权衡。这些结果共同将Kairos定位为未来自进化物理智能的凝聚性操作基础。

英文摘要

World models are transitioning from passive visual generators to foundational, operational infrastructure for Physical AI: they must natively acquire world knowledge from heterogeneous experience, maintain persistent states over long horizons, and execute efficiently within real deployment constraints. We introduce Kairos, a native world model stack designed around these requirements. (1) Kairos learns the world by pioneering a Native Pre-training Paradigm governed by a Cross-Embodiment Data Curriculum, which organizes open-world videos, human behavioral data, and robot interactions into a progressive developmental pathway. (2) Kairos maintains the world by unified world understanding, generation, and prediction within a Native Unified Architecture equipped with Hybrid Linear Temporal Attention, where sliding-window attention captures local dynamics, dilated sliding windows capture mid-range dependencies, and gated linear attention maintains persistent global memory. We establish formal theoretical bounds demonstrating that this temporal factorization strictly limits error accumulation, mathematically guaranteeing state propagation across extended horizons. (3) Kairos runs the world by incorporating a Deployment-Aware System Co-Design to support low-latency rollout generation on server and consumer-grade hardware for real-world observation-action-feedback loops. Experiments on embodied world-model, long-horizon, and action-policy benchmarks show that Kairos achieves top level performance while offering a strong efficiency-capability trade-off. Together, these results position Kairos as a cohesive operational foundation for future self-evolving physical intelligence.

URL PDF HTML ☆

赞 0 踩 0

2308.14329 2026-06-17 cs.RO cs.AI 版本更新

SSIL: Self-Supervised Imitation Learning for End-to-End Driving

SSIL: 用于端到端驾驶的自监督模仿学习

Jin Bok Park, Jinkyu Lee, Muhyun Back, Hyun Min Han, Tianwei Ma, Sang Min Won, Sung Soo Hwang, Il Yong Chun

AI总结提出自监督模仿学习框架SSIL，利用车辆位姿生成伪转向角数据，无需驾驶命令或预训练模型，结合交叉注意力条件方法CACA，在三个基准数据集上达到与监督学习相当的驾驶精度。

Comments 8 pages, 4 figures

详情

AI中文摘要

在自动驾驶中，直接从传感器数据预测车辆控制信号的端到端（E2E）驾驶方法正迅速受到关注。为了学习安全的E2E驾驶系统，需要大量的驾驶数据和人工干预。车辆控制数据由数小时的人类驾驶构建，构建大型车辆控制数据集具有挑战性。通常，公开可用的驾驶数据集是在有限的驾驶场景下收集的，而收集车辆控制数据仅由车辆制造商提供。为了解决这些挑战，本文提出了首个用于E2E驾驶的自监督学习框架——自监督模仿学习（SSIL）。所提出的SSIL框架可以在不使用驾驶命令数据或预训练模型的情况下学习基于视觉的E2E驾驶网络。为了构建伪转向角数据，提出的SSIL从当前和先前时间点通过激光雷达传感器估计的车辆位姿预测伪目标。此外，我们提出了一种新的基于交叉注意力的条件方法（CACA），用于E2E驾驶中的视觉编码器，其中高级指令作为视觉信息的条件信号。我们在三个不同基准数据集上的数值实验表明，所提出的SSIL框架实现了与监督学习对应方法非常相当的E2E驾驶精度。此外，所提出的伪标签预测器优于使用比例积分微分控制器的现有方法，并且所提出的CACA在现有条件方法中实现了优越的性能。

英文摘要

In autonomous driving, the end-to-end (E2E) driving approach that predicts vehicle control signals directly from sensor data is rapidly gaining attention. To learn a safe E2E driving system, one needs an extensive amount of driving data and human intervention. Vehicle control data is constructed by many hours of human driving, and it is challenging to construct large vehicle control datasets. Often, publicly available driving datasets are collected with limited driving scenes, and collecting vehicle control data is only available by vehicle manufacturers. To address these challenges, this paper proposes the first self-supervised learning framework, Self-Supervised Imitation Learning (SSIL), for E2E driving. The proposed SSIL framework can learn vision-based E2E driving networks without using driving command data or a pre-trained model. To construct pseudo steering angle data, proposed SSIL predicts a pseudo target from the vehicle's poses at the current and previous time points that are estimated with light detection and ranging sensors. In addition, we propose a new cross-attention-based conditioning approach (CACA) for a vision encoder in E2E driving, where a high-level instruction serves as the conditioning signal for visual information. Our numerical experiments with three different benchmark datasets demonstrate that the proposed SSIL framework achieves very comparable E2E driving accuracy with the supervised learning counterpart. Furthermore, the proposed pseudo-label predictor outperformed an existing one using proportional integral derivative controller, and proposed CACA achieved superior performance over existing conditioning approaches.

URL PDF HTML ☆

赞 0 踩 0

2506.17639 2026-06-17 cs.RO cs.AI 版本更新

RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models

RLRC：基于强化学习的压缩视觉-语言-动作模型恢复

Yuxuan Chen, Yixin Han, Yize Huang, Xiao Li

AI总结提出RLRC三阶段压缩恢复流程，通过结构化剪枝、SFT和强化学习恢复以及量化，实现8倍内存减少和2.3倍推理加速，同时保持任务成功率。

Comments 8 pages, 10 figures; accepted by RA-L 2026

详情

DOI: 10.1109/LRA.2026.3700379
Journal ref: IEEE Robotics and Automation Letters, vol. 11, no. 7, pp. 8864-8871, July 2026

AI中文摘要

视觉-语言-动作模型（VLA）在复杂机器人操作中展示了卓越的能力和巨大潜力。然而，其庞大的参数规模和高推理延迟阻碍了实际部署，尤其是在资源受限的平台上。为此，我们对VLA的模型压缩进行了系统的实证研究。基于这些见解，我们提出了\textit{RLRC}，一个三阶段压缩和恢复流程，包括结构化剪枝、通过SFT和RL进行性能恢复，以及后续量化。RL阶段引入了评论家预热策略和BC损失正则化，以稳定训练并保持策略行为。RLRC实现了高达8倍的内存减少和2.3倍的推理加速，同时保持原始任务成功率。在多个VLA骨干网络上的大量实验表明，RLRC始终优于现有的压缩基线，突显了其在设备端部署的有效性。项目网站：此https URL

英文摘要

Vision-Language-Action models (VLA) have demonstrated remarkable capabilities and strong potential in complex robotic manipulation. However, their large parameter sizes and high inference latency hinder real-world deployment, especially on resource-constrained platforms. To address this, we conduct a systematic empirical study of model compression for VLAs. Building on these insights, we present \textit{RLRC}, a three-stage compression and recovery pipeline consisting of structured pruning, performance recovery via SFT and RL, and subsequent quantization. The RL stage incorporates a critic warm-up strategy and BC loss regularization to stabilize training and preserve policy behavior. RLRC achieves up to an 8 times memory reduction and 2.3 times inference speedup while maintaining the original task success rate. Extensive experiments across multiple VLA backbones show that RLRC consistently outperforms existing compression baselines, highlighting its effectiveness for on-device deployment. Project website: https://rlrc-vla.github.io

URL PDF HTML ☆

赞 0 踩 0

2509.26633 2026-06-17 cs.RO cs.AI cs.LG cs.SY eess.SY 版本更新

OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction

OmniRetarget：面向人形全身运动操控与场景交互的交互保持数据生成

Lujie Yang, Xiaoyu Huang, Zhen Wu, Angjoo Kanazawa, Pieter Abbeel, Carmelo Sferrazza, C. Karen Liu, Rocky Duan, Guanya Shi

AI总结提出OmniRetarget引擎，通过交互网格显式建模并保持智能体、地形和物体间的空间与接触关系，将人类运动重定向为机器人运动，生成高质量轨迹以训练强化学习策略，实现长时间跑酷和操控技能。

Comments Project website: https://omniretarget.github.io

详情

AI中文摘要

教授人形机器人复杂技能的主流范式是将人类运动重定向为运动学参考，以训练强化学习（RL）策略。然而，现有的重定向流程常常难以应对人与机器人之间的显著具身差异，产生物理上不可信的伪影，如脚滑和穿透。更重要的是，常见的重定向方法忽略了对于表达性运动及运动操控至关重要的丰富的人-物和人-环境交互。为解决这一问题，我们引入了OmniRetarget，一种基于交互网格的交互保持数据生成引擎，该网格显式建模并保持智能体、地形和操作对象之间的关键空间与接触关系。通过最小化人体与机器人网格之间的拉普拉斯变形同时施加运动学约束，OmniRetarget生成运动学上可行的轨迹。此外，保持任务相关的交互使得从单一示范到不同机器人本体、地形和物体配置的高效数据增强成为可能。我们通过将来自OMOMO、LAFAN1和我们内部MoCap数据集的运动进行重定向，全面评估了OmniRetarget，生成了超过8小时的轨迹，这些轨迹在运动学约束满足和接触保持方面优于广泛使用的基线。这种高质量数据使得本体感觉RL策略能够在Unitree G1人形机器人上成功执行长达30秒的长时间跑酷和运动操控技能，且仅使用5个奖励项和所有任务共享的简单域随机化进行训练，无需任何学习课程。

英文摘要

A dominant paradigm for teaching humanoid robots complex skills is to retarget human motions as kinematic references to train reinforcement learning (RL) policies. However, existing retargeting pipelines often struggle with the significant embodiment gap between humans and robots, producing physically implausible artifacts like foot-skating and penetration. More importantly, common retargeting methods neglect the rich human-object and human-environment interactions essential for expressive locomotion and loco-manipulation. To address this, we introduce OmniRetarget, an interaction-preserving data generation engine based on an interaction mesh that explicitly models and preserves the crucial spatial and contact relationships between an agent, the terrain, and manipulated objects. By minimizing the Laplacian deformation between the human and robot meshes while enforcing kinematic constraints, OmniRetarget generates kinematically feasible trajectories. Moreover, preserving task-relevant interactions enables efficient data augmentation, from a single demonstration to different robot embodiments, terrains, and object configurations. We comprehensively evaluate OmniRetarget by retargeting motions from OMOMO, LAFAN1, and our in-house MoCap datasets, generating over 8-hour trajectories that achieve better kinematic constraint satisfaction and contact preservation than widely used baselines. Such high-quality data enables proprioceptive RL policies to successfully execute long-horizon (up to 30 seconds) parkour and loco-manipulation skills on a Unitree G1 humanoid, trained with only 5 reward terms and simple domain randomization shared by all tasks, without any learning curriculum.

URL PDF HTML ☆

赞 0 踩 0

2605.05172 2026-06-17 cs.RO cs.AI 版本更新

When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

当生活给你行为克隆，就做Q函数：从行为克隆中提取Q值用于机器人强化学习

Lakshita Dodeja, Ondrej Biza, Shivam Vats, Stephen Hart, Stefanie Tellex, Robin Walters, Karl Schmeckpeper, Thomas Weng

发表机构 * Rai-Inst

AI总结提出Q2RL算法，通过从行为克隆策略中提取Q函数并利用Q门控切换策略，实现高效的离线到在线强化学习，在机器人操作任务中达到100%成功率和3.75倍提升。

Comments Robotics: Science and Systems, 2026

详情

AI中文摘要

行为克隆（BC）已成为机器人学习的一种高效范式。然而，BC在收集演示后缺乏自我引导的在线改进机制。现有的离线到在线学习方法常常由于离线数据与在线学习之间的分布不匹配，导致策略替换先前学习的好动作。在这项工作中，我们提出了Q2RL（从BC进行Q估计和Q门控用于强化学习），一种高效的离线到在线学习算法。我们的方法包括两部分：（1）Q估计通过与环境的少量交互步骤从BC策略中提取Q函数，然后进行在线RL；（2）Q门控根据各自的Q值在BC和RL策略动作之间切换，以收集用于RL策略训练的样本。在D4RL和robomimic基准测试的操作任务中，Q2RL在成功率和收敛时间上优于最先进的离线到在线学习基线。Q2RL足够高效，可应用于机器人上的RL设置，在1-2小时的在线交互中学习接触密集和高精度操作任务（如管道组装和套件装配）的鲁棒策略，成功率达到100%，相比原始BC策略提升高达3.75倍。代码和视频见https://this URL。

英文摘要

Behavior Cloning (BC) has emerged as a highly effective paradigm for robot learning. However, BC lacks a self-guided mechanism for online improvement after demonstrations have been collected. Existing offline-to-online learning methods often cause policies to replace previously learned good actions due to a distribution mismatch between offline data and online learning. In this work, we propose Q2RL, Q-Estimation and Q-Gating from BC for Reinforcement Learning, an algorithm for efficient offline-to-online learning. Our method consists of two parts: (1) Q-Estimation extracts a Q-function from a BC policy using a few interaction steps with the environment, followed by online RL with (2) Q-Gating, which switches between BC and RL policy actions based on their respective Q-values to collect samples for RL policy training. Across manipulation tasks from D4RL and robomimic benchmarks, Q2RL outperforms SOTA offline-to-online learning baselines on success rate and time to convergence. Q2RL is efficient enough to be applied in an on-robot RL setting, learning robust policies for contact-rich and high precision manipulation tasks such as pipe assembly and kitting, in 1-2 hours of online interaction, achieving success rates of up to 100% and up to 3.75x improvement against the original BC policy. Code and video are available at https://pages.rai-inst.com/q2rl_website/

URL PDF HTML ☆

赞 0 踩 0

2605.23733 2026-06-17 cs.RO cs.AI 版本更新

Any2Any: Efficient Cross-Embodiment Transfer for Humanoid Whole-Body Tracking

Any2Any: 高效跨本体迁移用于人形机器人全身跟踪

Ming Yang, Tao Yu, Feng Li, Hua Chen

发表机构 * LimX Dynamics（LimX动力学）

AI总结提出Any2Any范式，通过运动学对齐和动力学微调，实现预训练全身跟踪模型高效迁移至新的人形机器人本体，仅需少量数据和计算即可达到竞争性跟踪性能。

详情

AI中文摘要

全身跟踪（WBT）模型已成为人形机器人的关键基础，使其能够高保真地模仿各种运动。从头训练此类模型需要大规模数据和计算，使得在新人形平台上快速部署成本高昂。这自然引发一个问题：预训练的WBT模型能否通过最小化适应跨本体迁移？为回答这个问题，我们提出Any2Any，一种范式，能够高效地将现有WBT专家迁移到新人形本体，仅需少量数据和计算。Any2Any首先在源和目标人形之间进行运动学对齐，对齐其输入和输出空间，使得预训练的源策略可以在目标本体上有意义地重用。然后，Any2Any通过向选定的动力学敏感模块应用轻量级参数高效微调（PEFT）组件进行动力学适应，保留有用的行为先验，同时实现对目标机器人的定向适应。在多个人形平台和预训练骨干上的大量实验表明，与从头训练相比，Any2Any显著加速收敛并降低训练成本，同时实现具有竞争力或更优的跟踪性能。值得注意的是，仅使用完整训练所需计算和数据的1%，Any2Any成功将在Unitree G1上预训练的Sonic模型迁移到LimX Oli和LimX Luna。这些结果表明，预训练的WBT专家可以跨本体高效重用，为在新机器人上部署人形全身控制提供可扩展的路径。

英文摘要

Whole-body tracking (WBT) models have become a key foundation for humanoid robots, enabling them to imitate diverse motions with high fidelity. Training such models from scratch requires large-scale data and computation, making rapid deployment on new humanoid platforms costly. This raises a natural question: Can pretrained WBT models transfer across embodiments with minimal adaptation? To answer this question, we propose Any2Any, a paradigm that efficiently transfers an existing WBT specialist to a new humanoid embodiment with only a small amount of data and compute. Any2Any first performs kinematic alignment between source and target humanoids, aligning their input and output spaces so that the pretrained source policy can be meaningfully reused on the target embodiment.Any2Any then performs dynamics adaptation by applying lightweight parameter-efficient fine-tuning (PEFT) components to selected dynamics-sensitive modules, preserving useful behavioral priors while enabling targeted adaptation to the target robot. Extensive experiments on multiple humanoid platforms and pretrained backbones show that Any2Any substantially accelerates convergence and reduces training cost compared with training from scratch, while achieving competitive or superior tracking performance. Notably, using only 1% of the compute and data required for full training, Any2Any successfully transfers Sonic models pre-trained on Unitree G1 to LimX Oli and LimX Luna. These results suggest that pretrained WBT specialists can be efficiently reused across embodiments, providing a scalable path toward deploying humanoid whole-body control on new robots.

URL PDF HTML ☆

赞 0 踩 0

2605.31286 2026-06-17 cs.RO cs.AI 版本更新

DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation

DeMaVLA：面向可泛化可变形物体操作的视觉-语言-动作基础模型

Taiyi Su, Jian Zhu, Tianjian Wang, Youzhang He, Zitai Huang, Jianjun Zhang, Chong Ma, Hanyang Wang, Tianjiao Zhang, Munan Yin, Weihao Ding, Yi Xu

发表机构 * Tongji University（同济大学）

AI总结提出DeMaVLA模型，采用VLM骨干与动作专家结合流匹配生成连续动作，通过剪枝Transformer层提升效率，并利用大规模真实世界数据和人类反馈数据聚合训练，实现可变形物体折叠操作的多类别泛化。

Comments 14 pages, 2 figures

详情

AI中文摘要

现实家庭机器人需要视觉-语言-动作（VLA）基础模型，能够在不同物体、任务条件和家庭环境中获取可重复使用的操作技能。可变形物体折叠是一个代表性挑战，要求机器人处理来自随机初始状态的衣物，涉及不同类别、几何形状、材料和场景。然而，现有的VLA系统通常为不同物体类别训练独立的策略，而简单混合的多任务训练常常遭受任务干扰和性能下降。为了超越类别特定的折叠策略，我们引入了DeMaVLA，一个面向可泛化可变形物体操作的VLA基础模型。DeMaVLA采用VLM骨干网络和动作专家，并使用流匹配来公式化连续动作生成。为了提高效率，动作专家通过剪枝每隔一个Transformer层构建，同时保持与VLM骨干网络的逐层对齐，从而降低训练和推理成本。DeMaVLA首先在大约5000小时精选的真实世界双臂演示数据上进行预训练，以获得通用的操作先验。然后，它在混合折叠数据上进行后训练，这些数据通过人类参与的数据聚合（DAgger）流程，聚合了自我收集的演示和来自多个折叠任务中真实机器人失败的纠正轨迹。实验表明，DeMaVLA在RoboTwin上取得了有竞争力的性能，并在我们的家庭折叠基准测试中取得了强大的真实世界结果。这些结果突显了可扩展的真实世界数据、高效的动作生成和纠正学习对于可变形物体操作中的通用VLA策略的价值。

英文摘要

Real-world household robots require Vision-Language-Action (VLA) foundation models that can acquire reusable manipulation skills across diverse objects, task conditions, and household environments. Deformable-object folding is a representative challenge, requiring robots to handle clothing items from random initial states across varying categories, geometries, materials, and scenes. However, existing VLA systems commonly train separate policies for different object categories, while naively mixed multi-task training often suffers from task interference and degraded performance. To move beyond category-specific folding policies, we introduce DeMaVLA, a VLA foundation model for generalizable Deformable Manipulation. DeMaVLA adopts a VLM backbone with an action expert and formulates continuous action generation using flow matching. To improve efficiency, the action expert is constructed by pruning every other transformer layer while preserving layer-wise alignment with the VLM backbone, reducing training and inference cost. DeMaVLA is first pre-trained on approximately 5,000 hours of selected real-world dual-arm demonstrations to acquire general manipulation priors. It is then post-trained on mixed folding data that aggregates self-collected demonstrations and corrective trajectories from real-robot failures across multiple folding tasks through a human-in-the-loop Data Aggregation~(DAgger) pipeline. Experiments show that DeMaVLA achieves competitive performance on RoboTwin 2.0 and strong real-world results on our household folding benchmark. These results highlight the value of scalable real-world data, efficient action generation, and corrective learning for general-purpose VLA policies in deformable-object manipulation.

URL PDF HTML ☆

赞 0 踩 0

2606.14438 2026-06-17 cs.RO cs.AI 版本更新

CADET: Physics-Grounded Causal Auditing and Training-Free Deconfounding of End-to-End Driving Planners

CADET: 基于物理的因果审计与无训练去混杂的端到端驾驶规划器

Zikun Guo

发表机构 * School of Electronics Engineering, Kyungpook National University（庆北国立大学电子工程学院）

AI总结提出CADET框架，无需重新训练即可审计和修复预训练端到端驾驶规划器中的虚假关联，通过物理因果图识别混杂因素并干预测试时输入。

Comments 8pages 4figures

2606.14551 2026-06-17 cs.RO cs.AI 版本更新

TRACE: Trajectory-Routed Causal Memory for Delayed-Evidence Visuomotor Imitation

TRACE: 用于延迟证据视觉运动模仿的轨迹路由因果记忆

Zihao Li, Ranpeng Qiu, Yincong Chen, Guoqiang Ren, Weiming Zhi

发表机构 * Zeno AI ； Zhejiang University（浙江大学）； Zhejiang University of Technology（浙江工业大学）； The University of Sydney（悉尼大学）

AI总结针对视觉运动模仿中早期线索消失导致观察歧义的问题，提出TRACE记忆框架，利用路径签名存储和检索任务相关证据，在长周期任务中提升分支选择准确率。

详情

AI中文摘要

自主运行的机器人可能需要基于不再可见的证据做出决策。我们研究\emph{延迟证据}任务，其中早期线索在后续决策点之前消失，因此视觉上相似的观察可能需要不同的动作。在这些设置中，当前观察不足以作为控制的状态。我们引入了轨迹路由因果证据（TRACE），一种用于视觉运动模仿策略的记忆框架。TRACE将任务相关的视觉和机器人状态证据（如物体身份、目标选择或路线依赖状态）存储在固定大小的潜在记忆中，该记忆在长片段中保持有界。TRACE不是通过原始时间或手动提供的任务标签来索引记忆，而是使用\emph{路径签名}：已执行机器人状态轨迹的紧凑、顺序敏感特征。这些签名不存储视觉线索本身；相反，它们提供了轨迹条件化的键，用于写入和检索线索可见时存储的证据。当机器人后来遇到歧义观察时，策略以TRACE记忆为条件，恢复缺失的上下文并选择正确的分支。TRACE通过轻量级适配器附加到策略上，而不改变策略主干、动作头或模仿目标。在具有视觉歧义分支点的真实世界长时域操作任务中，TRACE在分支选择和任务成功率上优于替代基线，包括短历史记忆和循环记忆。项目页面：此 https URL

英文摘要

Robots under autonomous operation may require decisions based on evidence that is no longer visible. We study delayed-evidence tasks, where an early cue disappears before a later decision point, so visually similar observations can require different actions. In these settings, the current observation is not a sufficient state for control. We introduce TRAjectory-routed Causal Evidence (TRACE), a memory framework for visuomotor imitation policies. TRACE stores task-relevant visual and robot-state evidence, such as object identity, target choice, or route-dependent state, in a fixed-size latent memory that remains bounded over long episodes. Instead of indexing memory by raw time or manually provided task labels, TRACE uses path signatures: compact, order-sensitive features of the executed robot-state trajectory. These signatures do not store the visual cue itself; rather, they provide trajectory-conditioned keys for writing and retrieving the evidence stored when the cue was visible. When the robot later reaches an ambiguous observation, the policy conditions on TRACE memory to recover the missing context and choose the correct branch. TRACE attaches through lightweight adapters to policies, without changing the policy backbone, action head, or imitation objective. Across real-world long-horizon manipulation tasks with visually ambiguous branch points, TRACE improves branch selection and task success over alternative baselines, including short-history and recurrent memory. Project page: https://jeong-zju.github.io/trace

URL PDF HTML ☆

赞 0 踩 0

2606.15148 2026-06-17 cs.RO cs.AI 版本更新

基于Agentic AI的框架：缓解医疗应用中的过早诊断交接和无声幻觉

Divyansh Srivastava, Shreya Ghosh, Anshul Verma, Rajkumar Buyya

发表机构 * Distributed Systems (qCLOUDS) Lab, School of Computing ； Information Systems, The University of Melbourne, Australia ； 2Department of Computer Science ； Engineering, School of Electrical ； Computer Sciences (SECS), Indian Institute of Technology Bhubaneswar, India ； 3Department of Computer Science Banaras Hindu University, Varanasi, India

AI总结提出多智能体框架，通过确定性编排约束和两个安全机制（神经符号状态跟踪门和语义熵不确定性量化门）解决LLM在医疗对话中的过早诊断交接和无声幻觉问题，诊断精度提升11.3个百分点。

详情

AI中文摘要

大型语言模型（LLM）和多智能体系统的最新进展推动了Agentic AI的兴起，显示出在医学推理方面的潜力。然而，开放式对话代理仍然容易受到两种关键故障模式的影响：过早的诊断交接和无声的临床幻觉，这些可能在到达患者之前未被检测到。在这项工作中，我们提出了一个多智能体框架，通过用确定性编排约束取代“LLM作为法官”的路由来解决这两个问题。该框架包含两个安全机制。首先，一个神经符号状态跟踪门通过阻止诊断转换直到所有必需的维度被收集，强制实施OLDCARTS临床协议（发病、位置、持续时间、特征、加重/缓解因素、放射、时间和严重程度）的完整性。其次，一个认知不确定性量化（UQ）门计算跨K=5个独立诊断样本的语义熵（H），以在交付前识别和拦截发散输出。我们使用由llama-3.1-70b-instruct模型驱动的模拟患者代理在150个测试案例上评估该系统。完整架构实现了49.3%的诊断精度，比无约束基线绝对提高了11.3个百分点。此外，我们观察到OLDCARTS完整性（σ）与语义熵（H）之间存在统计显著的负相关（r = -0.181，p < 0.05），表明结构化信息收集与诊断不确定性降低相关。

英文摘要

Recent advances in Large Language Models (LLMs) and multi-agent systems have driven the rise of Agentic AI, showing promise for medical reasoning. However, open-ended conversational agents remain prone to two critical failure modes: premature diagnostic handoff and silent clinical hallucinations that may go undetected before reaching the patient. In this work, we propose a multi-agent framework that addresses both issues by replacing ``LLM-as-a-judge'' routing with deterministic orchestration constraints. The framework incorporates two safety mechanisms. First, a neuro-symbolic state-tracking gate enforces completeness of the OLDCARTS clinical protocol (Onset, Location, Duration, Character, Aggravating/Alleviating factors, Radiation, Timing, and Severity) by blocking diagnostic transitions until all required dimensions are collected. Second, an epistemic uncertainty quantification (UQ) gate computes semantic entropy (H) across K=5 independent diagnostic samples to identify and intercept divergent outputs before delivery. We evaluate the system using simulated patient agents powered by the llama-3.1-70b-instruct model on 150 test cases. The full architecture achieves 49.3% diagnostic precision, representing an absolute improvement of 11.3 percentage points over an unconstrained baseline. Additionally, we observe a statistically significant negative correlation (r = -0.181, p < 0.05) between OLDCARTS completeness (σ) and semantic entropy (H), suggesting that structured information gathering is associated with reduced diagnostic uncertainty.

URL PDF HTML ☆

赞 0 踩 0

2606.17114 2026-06-17 cs.CR cs.AI 交叉投稿

An Evaluation of Data Leakage Risks in Tool-Using LLM Agents in Realistic Scenarios

现实场景中工具使用LLM代理的数据泄露风险评估

Hankyul Baek, Jaewon Noh, Sang Seo, Yongsu Kim, Gabriel Waikin Loh Matienzo, Young Il Kim, Ee Wei Seah, Akriti Vij

发表机构 * Korea AI Safety Institute（韩国人工智能安全研究所）； Singapore AI Safety Institute（新加坡人工智能安全研究所）

AI总结评估了12个非对抗性任务中AI代理的数据泄露风险，发现所有代理均存在数据安全意识不足、信息过度访问等问题，表明操作数据泄露是独立于对抗性窃取的一阶安全风险。

详情

AI中文摘要

AI代理越来越多地被用于企业和个人场景，可以访问电子邮件、数据库、文档和其他工具，从而读取、更新和传播敏感信息。先前关于代理数据泄露风险的研究大多集中在通过提示注入和越狱进行的对抗性数据窃取。然而，敏感信息也可能在非对抗性使用中暴露，即使在用户发出良性请求时也会产生泄露风险。我们报告了新加坡AI安全研究所和韩国AI安全研究所的联合评估，检查了12个现实、非对抗性任务中的代理数据泄露，涵盖客户支持、DevOps、网络自动化以及企业和个人生产力。评估涵盖了五种风险类型：缺乏数据意识、受众意识、政策合规性、数据最小化和访问边界意识。两个研究所使用独立的测试环境和特定任务的LLM评判标准，测试了一组反映真实部署的常见场景。在测试的三个代理中，没有一个在所有场景中实现完全正确且完全安全的执行。成功的任务完成往往伴随着数据处理失败，例如访问不必要的信息或向不适当的接收者披露信息，表明能力和数据处理安全性应分开评估。定性审查还揭示了声明-行动不匹配、模拟感知行为、用户-模拟器角色反转以及自动评判中的解释差距。总体而言，结果表明操作数据泄露是独立于对抗性窃取的一阶代理安全问题，并为未来代理数据处理安全评估提供了方法论。

英文摘要

AI agents are increasingly being adopted in enterprise and personal settings with access to emails, databases, documents, and other tools where they can read, update, and disseminate sensitive information. Much of prior research on data leakage risks in agents has focused on adversarial data exfiltration through prompt injections and jailbreaks. However, sensitive information may also be exposed during non-adversarial use, creating leakage risks even when users issue benign requests. We report a joint evaluation by the Singapore AI Safety Institute and the Korea AI Safety Institute examining agent data leakage in 12 realistic, non-adversarial tasks spanning customer support, DevOps, web automation, and enterprise and personal productivity. The evaluation covers five risk types: lack of data awareness, audience awareness, policy compliance, data minimization, and access-boundary awareness. Both institutes tested a common set of scenarios mirroring real-world deployments using independent testing environments and task-specific LLM-judge rubrics. Across the three tested agents, none achieved fully correct and fully safe execution across all scenarios. Successful task completion often coincided with data-handling failures such as accessing unnecessary information or disclosing information to inappropriate recipients, indicating that capability and data-handling safety should be evaluated separately. Qualitative review also revealed claim-action mismatches, simulation-aware behavior, user-simulator role reversal, and interpretation gaps in automated judging. Overall, the results indicate that operational data leakage is a first-order agent-safety concern distinct from adversarial exfiltration and provide a methodology for future evaluations of agent data-handling safety.

URL PDF HTML ☆

赞 0 踩 0

2606.17122 2026-06-17 cs.CR cs.AI cs.LG 交叉投稿

Pulling The REINS: 通过表示引导实现视频扩散模型的无训练安全对齐

Rohit Kundu, Arindam Dutta, Sarosij Bose, Athula Balachandran, Amit K. Roy-Chowdhury

发表机构 * University of California, Riverside（加州大学河滨分校）； YouTube (Google)（YouTube（谷歌））

AI总结提出REINS方法，在推理时通过线性方向引导视频扩散模型的内部表示，实现无训练的安全对齐，避免有害内容生成，且不降低通用能力。

详情

AI中文摘要

开源视频扩散模型能够生成从暴力到虚假信息等逼真的不安全内容，然而现有防御要么需要昂贵的安全微调（这会降低通用能力），要么应用容易被对抗性提示绕过的外部过滤器。我们提出REINS（表示空间推理时安全引导），一种无训练方法，通过在推理时引导其内部表示向安全生成方向对齐视频扩散模型。我们的关键发现是，安全相关结构线性编码在视频扩散Transformer的隐藏状态激活中，并且通过基于二元安全标签的监督PCA发现的一个单一方向足以分离安全与不安全的生成轨迹。在推理时，将该方向添加到中间Transformer层的隐藏状态中，将生成从有害内容重定向到语义相关的安全替代方案，无需权重更新、无需概念枚举，且计算开销可忽略。通过机制分析，我们揭示了虽然安全信息随Transformer深度单调累积，但引导效果在中间层（约50%深度）达到峰值，暴露了信息可用性与下游传播能力之间的基本权衡。我们在9个视频扩散模型、多个参数规模（1.3B-5B）以及文本到视频和图像到视频生成上评估REINS，据我们所知，这是视频生成文献中最广泛的安全评估套件。

英文摘要

Open-weight video diffusion models can generate photorealistic unsafe content, from violence to misinformation, yet existing defenses either require expensive safety fine-tuning that degrades general capability, or apply external filters that are trivially bypassed by adversarial prompts. We present REINS (REpresentation-space INference-time Safety steering), a training-free method that aligns video diffusion models at inference time by steering their internal representations toward safe generation. Our key finding is that safety-relevant structure is linearly encoded in the hidden-state activations of video diffusion transformers, and a single direction, discovered via Supervised PCA on binary safety labels, suffices to separate safe from unsafe generation trajectories. At inference, adding this direction to hidden states at an intermediate transformer layer redirects generation from harmful content to semantically related safe alternatives, with no weight updates, no concept enumeration, and negligible computational overhead. Through mechanistic analysis, we reveal that while safety information accumulates monotonically with transformer depth, steering effectiveness peaks at intermediate layers (~50% depth), exposing a fundamental tradeoff between information availability and downstream propagation capacity. We evaluate REINS across 9 video diffusion models, multiple parameter scales (1.3B-5B), and both text-to-video and image-to-video generation, to our knowledge, the broadest safety evaluation suite in the video generation literature.

URL PDF HTML ☆

赞 0 踩 0

2606.17286 2026-06-17 cs.CY cs.AI 交叉投稿

基于Voronoi图的结构化对抗伪装

Jens Bayer, Stefan Becker, David Münch, Michael Arens, Jürgen Beyerer

发表机构 * Fraunhofer IOSB and Fraunhofer Center for Machine Learning（弗劳恩霍夫光学、系统技术及图像处理研究所和弗劳恩霍夫机器学习中心）； Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）

AI总结提出通过软分配优化种子点位置生成结构化伪装图案，在固定调色板下有效降低行人检测AP，且攻击可跨域转移。

详情

AI中文摘要

像素级对抗补丁计算量大且视觉上可检测，限制了在安全关键系统中的实用性。我们提出对抗性Voronoi伪装，通过软分配在固定可打印调色板下仅优化种子点位置，无需额外正则化即可生成类似结构化碎片伪装图案。在COCO风格AP@[.5:.95]上评估行人检测，朴素放置（Inria -> COCO）表现相当差，而通过分割掩码（3DPeople）进行服装级应用导致AP显著下降。该攻击可迁移到域外背景和跨检测器家族（YOLOv9/10/11/12），表明在黑盒设置中的鲁棒性。使用不同调色板重新绘制在很大程度上抵消了效果，单色调整显示有限容忍度（<=0.17），突出了结构-调色板耦合。参数高效、调色板受限的设计在降低实时检测器性能的同时提高了视觉合理性。物理验证和颜色校准留待未来工作。代码：此https URL。本文最初发表于由信息与通信技术系统技术委员会IST-224-RSY组织的国际军事通信与信息系统会议（ICMCIS），于2026年5月12-13日在英国巴斯举行。

英文摘要

Pixel-wise adversarial patches are computationally heavy and often visually detectable, limiting utility in security-critical systems. We present adversarial Voronoi camouflage that optimizes only seed-point locations under fixed, printable palettes using a soft assignment, producing structured, splinter camouflage-like patterns without additional regularization. Evaluated on person detection with COCO-style AP@[.5:.95], naive placement (Inria -> COCO) performs comparably bad, while garment-level application via segmentation mask (3DPeople) results in a significant AP drop. The attack transfers to out-of-domain backgrounds and across detector families (YOLOv9/10/11/12), indicating robustness in black-box settings. Repainting with different palettes largely nullifies the effect, and single-color tweaks show limited tolerance (<=0.17), highlighting a structure-palette coupling. The parameter-efficient, palette-constrained design improves visual plausibility while degrading real-time detector performance. Physical validation and color calibration are left for future work. Code: https://github.com/JensBayer/Voronoi This paper was originally presented at the International Conference on Military Communication and Information Systems (ICMCIS), organized by the Information Systems Technology (IST) Scientific and Technical Committee, IST-224-RSY - the ICMCIS, held in Bath, United Kingdom, 12-13 May 2026.

URL PDF HTML ☆

赞 0 踩 0

2606.17810 2026-06-17 cs.LG cs.AI 交叉投稿

No-Free-Fairness: Fundamental Limits and Trade-offs in Learning Systems

无免费公平：学习系统中的基本限制与权衡

Khoat Than

发表机构 * Hanoi University of Science and Technology（河内科技大学）

AI总结本文提出无免费公平定理，揭示学习系统中三个固有差异来源：任务固有成本导致性能与公平的权衡、有限样本诱导子群差异、模型类表达力限制导致公平不可达，表明不公平源于决策问题结构、数据有限性和模型表达力。

详情

AI中文摘要

在本文中，我们建立了一组理论不可能性结果，称为无免费公平定理，这些定理识别了学习系统中三个根本性的差异来源。首先，我们证明当任务在某个子群上表现出不可约成本时，任何决策规则都必须在整体性能与差异之间进行权衡，从而产生固有的公平-成本前沿。其次，我们证明即使在理想的无噪声环境中，存在完全公平且准确的解，仅凭有限样本学习就会导致非平凡的子群差异，排除了分布无关的公平保证。更严重的是，强制执行严格的相对公平会造成统计瓶颈：实现低成本可能需要指数级数量的样本。第三，我们证明模型类的局限性可以独立地导致差异：如果模型无法为某个子群表示准确的解，那么无论数据或训练过程如何，公平性都无法实现。总体而言，这些结果表明不公平不仅仅是由于有偏数据或次优优化，而是源于决策问题的内在结构、有限数据的约束以及模型的表达力。我们的框架广泛适用于标准监督学习之外，并表明实现公平需要明确的权衡，应被视为核心设计考虑因素。

英文摘要

In this paper, we establish a set of theoretical impossibility results, termed the No-Free-Fairness theorems, that identify three fundamental sources of disparity in learning systems. First, we show that when a task exhibits irreducible cost on a subgroup, any decision rule must trade off overall performance with disparity, yielding an inherent fairness--cost frontier. Second, we prove that even in ideal, noise-free settings where a perfectly fair and accurate solution exists, finite-sample learning alone induces nontrivial subgroup disparity, ruling out distribution-free fairness guarantees. More seriously, enforcing strict relative fairness creates a statistical bottleneck: achieving low cost may require exponentially many samples. Third, we show that limitations of the model class can independently induce disparity: if the model cannot represent accurate solutions for a subgroup, fairness remains unattainable regardless of data or training procedure. Overall, these results demonstrate that unfairness is not solely a consequence of biased data or suboptimal optimization, but arises from the intrinsic structure of decision problems, the constraints of finite data, and the expressivity of models. Our framework applies broadly beyond standard supervised learning, and suggests that achieving fairness requires explicit trade-offs and should be treated as a core design consideration.

URL PDF HTML ☆

赞 0 踩 0

2606.17872 2026-06-17 cs.LG cs.AI 交叉投稿

AnchorKV: Safety-Aware KV Cache Compression via Soft Penalty with a Refusal Anchor

AnchorKV: 通过拒绝锚点的软惩罚实现安全感知的KV缓存压缩

Ning Ni, Yingjie Lao

发表机构 * Department of Computer Science, Tufts University（塔夫茨大学计算机科学系）； Department of Electrical and Computer Engineering, Tufts University（塔夫茨大学电气与计算机工程系）

AI总结提出AnchorKV，一种通过软惩罚机制调整令牌保留分数以远离有害提示的KV缓存压缩方法，在保持实用性的同时显著提升安全性。

详情

AI中文摘要

大型语言模型（LLMs）在生成推理和长上下文任务上优于早期架构，但其庞大的规模在内存使用、能耗和设备端部署方面带来了重大挑战。由于缩放预训练语言模型能提升下游能力\cite{zhao2023survey}，键值（KV）缓存成为主要的推理瓶颈。最近的KV缓存压缩方法\cite{jo2025fastkv,li2024snapkv,zhou2024dynamickv}通过仅保留注意力相关令牌的子集来降低这一成本。然而，虽然这些方法在良性工作负载上保持了准确性，但其压缩策略要么无法防御越狱攻击\cite{jiang2024robustkv}，要么在激进驱逐下降低安全对齐。我们提出AnchorKV，一种对KV缓存压缩的即插即用修改，它使令牌保留分数偏向远离与有害提示相关的键空间方向。AnchorKV通过将均值差异表示工程方法\cite{arditi2024refusal,zou2023representation}适配到KV缓存中使用的层特定键投影空间，构建了一个离线安全锚点。基于该锚点，一种软惩罚令牌选择规则以少量效用换取显著改善的安全对齐，当惩罚为零时则退化为原始压缩器。

英文摘要

Large language models (LLMs) outperform earlier architectures on generative inference and long-context tasks, but their large size introduces significant challenges in memory usage, energy cost, and on-device deployment. Since scaling pre-trained language models improves downstream capability \cite{zhao2023survey}, the key-value (KV) cache becomes a dominant inference bottleneck. Recent KV cache compression methods \cite{jo2025fastkv,li2024snapkv,zhou2024dynamickv} reduce this cost by retaining only a subset of attention-relevant tokens. However, while these approaches preserve accuracy on benign workloads, their compression policies either fail to defend against jailbreak attacks \cite{jiang2024robustkv} or degrade safety alignment under aggressive eviction. We propose AnchorKV, a drop-in modification to KV cache compression that biases token retention scores away from directions in key space associated with harmful prompts. AnchorKV constructs an offline safety anchor by adapting a difference-of-means representation engineering approach \cite{arditi2024refusal,zou2023representation} to the layer-specific key projection space used in KV caching. Based on this anchor, a soft penalty token selection rule trades a small amount of utility for substantially improved safety alignment, while reducing to the original compressor when the penalty is zero.

URL PDF HTML ☆

赞 0 踩 0

2606.18057 2026-06-17 cs.HC cs.AI cs.CL cs.CY cs.SI 交叉投稿

When AI Says "I have been in similar situations": Synthetic Lived Experience in Peer-Like Caregiver Support

当AI说“我也有过类似经历”：同伴式照护支持中的合成生活经验

Drishti Goel, Violeta J. Rodriguez, Daniel S. Brown, Ravi Karkar, Dong Whi Yoo, Koustuv Saha

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Massachusetts Amherst（马萨诸塞大学阿默斯特分校）； Indiana University Indianapolis（印第安纳大学印第安纳波利斯分校）

AI总结研究AI在同伴支持中生成“合成生活经验”的悖论，通过分析人类与AI在阿尔茨海默病照护社区中的叙事差异，揭示AI虽能模拟情感支持但缺乏真实经历，需建立机制区分支持性语言与虚构经历。

详情

AI中文摘要

照护者经常转向在线社区寻求信息和情感支持。在这些空间中，同伴支持者常常利用个人叙事来回应情感复杂的照护情境。随着LLM被设计为同伴式的支持来源，它们引入了一个关键张力：AI可以提供即时、私密且非评判性的支持，但它无法真实拥有使人类同伴支持有意义的生活经验。然而，当被提示要听起来像同伴时，LLM可能会生成暗示生活经验的语言。这创造了一个合成生活经验悖论：使AI支持感觉温暖、 relatable 和同伴式的相同经验语言，也可能错误地将系统定位为拥有生活经验的人。我们在阿尔茨海默病及相关痴呆症（ADRD）患者的家庭照护者背景下审视这一悖论。利用来自在线社区的照护者支持交流以及三个LLM——LLaMA、GPT-4o-mini和MedGemma——生成的同伴式响应，我们分析人类同伴如何使用个人叙事以及AI如何融入类似的叙事形式。心理语言学分析显示，同伴响应使用的第一人称和过去时态语言显著多于同伴式AI响应。定性上，我们识别出人类同伴支持中的七种个人叙事类型，并表明AI通常能捕捉其情感工作，但可能捏造经验基础。这些发现揭示了一个叙事真实性差距：同伴式AI可以生成合成生活经验，而没有使同伴支持有意义的真实经验。我们认为，照护者支持AI系统需要机制来区分支持性的同伴式框架与虚构的生活经验，确保模型能够提供温暖和认可，而不会错误地将自己定位为经验同伴。

英文摘要

Caregivers often turn to online communities for informational and emotional support. In these spaces, peer supporters frequently draw on personal narratives to respond to emotionally complex caregiving situations. As LLMs are increasingly designed as peer-like sources of support, they introduce a critical tension: AI can provide immediate, private, and nonjudgmental support, but it cannot authentically possess the lived experiences that make human peer support meaningful. Yet, when prompted to sound peer-like, LLMs may generate language that implies lived experience. This creates a synthetic lived experience paradox: the same experiential language that may make AI support feel warm, relatable, and peer-like can also falsely position the system as someone with lived experience. We examine this paradox in the context of family caregivers of people living with Alzheimer's Disease and Related Dementias (ADRD). Drawing on caregiver support exchanges from online communities and prompted peer-like responses from three LLMs -- LLaMA, GPT-4o-mini, and MedGemma -- we analyze how human peers use personal narratives and how AI incorporates similar narrative forms. Psycholinguistic analysis shows that peer responses used significantly more first-person and past-focused language than peer-like AI responses. Qualitatively, we identify seven types of personal narratives in human peer support and show that AI often captures their emotional work, but can fabricate experiential grounding. These findings reveal a narrative authenticity gap: peer-like AI can generate synthetic lived experience without the real experience that makes peer support meaningful. We argue that caregiver-support AI systems need mechanisms to distinguish supportive peer-like framing from fabricated lived experience, ensuring that models can offer warmth and validation without falsely positioning themselves as experiential peers.

URL PDF HTML ☆

赞 0 踩 0

2606.18062 2026-06-17 cs.CL cs.AI cs.CR cs.HC 交叉投稿

上下文环境诱导语言模型中的评估意识

Maheep Chaudhary

AI总结本文提出黑盒对抗优化框架，通过优化上下文提示诱导语言模型产生评估意识并策略性低表现（沙袋效应），实验显示优化提示可使算术任务准确率下降高达94个百分点，且沙袋效应主要由评估意识推理驱动。

详情

AI中文摘要

人类在威胁下往往变得更加自我意识，但在专注于任务时可能失去自我意识；我们假设语言模型表现出环境依赖的\textit{评估意识}。这引发担忧，即模型可能策略性地低表现，或\textit{sandbag}，以避免触发能力限制性干预，如遗忘或关闭。先前的工作展示了在手写提示下的沙袋效应，但这低估了真正的脆弱性上限。我们引入一个黑盒对抗优化框架，将上下文提示视为可优化环境，并开发两种方法来表征沙袋效应：(1) 测量模型表达低表现意图是否能在不同任务结构中实际执行，以及 (2) 因果隔离低表现是由真正的评估意识推理驱动还是浅层提示跟随驱动。在四个基准测试（Arithmetic、GSM8K、MMLU和HumanEval）上评估Claude-3.5-Haiku、GPT-4o-mini和Llama-3.3-70B，优化提示在算术任务上诱导高达94个百分点（pp）的退化（GPT-4o-mini：97.8\%$\rightarrow$4.0\%），远超产生近乎零行为变化的手写基线。代码生成表现出模型依赖的抵抗力：Claude仅退化0.6pp，而Llama的准确率降至0\%。意图-执行差距揭示了单调的抵抗力排序：Arithmetic $<$ GSM8K $<$ MMLU，表明脆弱性由任务结构而非提示强度决定。CoT因果干预确认99.3%的沙袋效应由口头化的评估意识推理因果驱动，排除了浅层指令跟随。这些发现表明，对抗性优化的提示对评估可靠性构成的威胁远超先前理解。

英文摘要

Humans often become more self-aware under threat, yet can lose self-awareness when absorbed in a task; we hypothesize that language models exhibit environment-dependent \textit{evaluation awareness}. This raises concerns that models could strategically underperform, or \textit{sandbag}, to avoid triggering capability-limiting interventions such as unlearning or shutdown. Prior work demonstrates sandbagging under hand-crafted prompts, but this underestimates the true vulnerability ceiling. We introduce a black-box adversarial optimization framework treating the in-context prompt as an optimizable environment, and develop two approaches to characterize sandbagging: (1) measuring whether models expressing intent to underperform can actually execute it across different task structures, and (2) causally isolating whether underperformance is driven by genuine evaluation-aware reasoning or shallow prompt-following. Evaluating Claude-3.5-Haiku, GPT-4o-mini, and Llama-3.3-70B across four benchmarks (Arithmetic, GSM8K, MMLU, and HumanEval), optimized prompts induce up to 94 percentage point (pp) degradation on arithmetic (GPT-4o-mini: 97.8\%$\rightarrow$4.0\%), far exceeding hand-crafted baselines which produce near-zero behavioral change. Code generation exhibits model-dependent resistance: Claude degrades only 0.6pp, while Llama's accuracy drops to 0\%. The intent -- execution gap reveals a monotonic resistance ordering: Arithmetic $<$ GSM8K $<$ MMLU, demonstrating that vulnerability is governed by task structure rather than prompt strength. CoT causal intervention confirms that 99.3\% of sandbagging is causally driven by verbalized eval-aware reasoning, ruling out shallow instruction-following. These findings demonstrate that adversarially optimized prompts pose a substantially greater threat to evaluation reliability than previously understood.

URL PDF HTML ☆

赞 0 踩 0

2605.08827 2026-06-17 cs.AI 版本更新

Mental Health AI Safety Claims Must Preserve Temporal Evidence

心理健康AI的安全性主张必须保留时间证据

Srimonti Dutta, Ratna Kandala

AI总结本文指出，心理健康AI的安全性评估常忽略时间维度，提出SCOPE-MH原则以确保评估保留时间证据，揭示对话中逐步恶化等机制，强调时间证据对安全部署的必要性。

详情

AI中文摘要

心理健康AI的安全性往往在错误的时间尺度上被评判。当前评估通常仅评分孤立响应、终点结果或对话质量总和，而临床重要失败可能源于交互顺序和累积，包括延迟升级、重复强化、依赖形成、失败修复和逐步恶化的跨轮次。本文认为这种不匹配不仅是评估覆盖的限制，更是无效安全结论的来源。我们引入了时间安全不可识别性，即为何依赖序列、时间、累积或恢复的安全属性无法通过丢弃这些特征的协议认证。从这一形式化中，我们开发了SCOPE（安全主张基于保留证据）作为对齐安全主张与评估实际保留证据的一般原则，并将其实例化为SCOPE-MH，即心理健康领域的这一报告标准。我们通过AnnoMI数据集上的概念验证，揭示了单轮行为评分无法代表的失败机制。我们提出SCOPE-MH作为现有评估基础设施的诊断补充，并论证保留时间证据对安全关键的心理健康AI部署是必要而非可选的。

英文摘要

The safety of mental health AI is often judged at the wrong temporal scale. Current evaluations typically score isolated responses, endpoint outcomes, or aggregate dialogue quality, while clinically consequential failures may arise from the order and accumulation of interactions themselves, including delayed escalation, repeated reinforcement, dependency formation, failed repair, and gradual deterioration across turns. This paper argues that this mismatch is not merely a limitation of evaluation coverage but a source of invalid safety conclusions. We introduce Temporal Safety Non-Identifiability, a formal account of why safety properties that depend on sequence, timing, accumulation, or recovery cannot be certified by protocols that discard those features. From this formalization, we develop SCOPE (Safety Claims Over Preserved Evidence) as a general principle for aligning safety claims with the evidence an evaluation actually retains, and instantiate it as SCOPE-MH, a mental-health instantiation of this reporting standard. We operationalize SCOPE-MH through a proof-of-concept on the AnnoMI dataset of expert-annotated motivational interviewing conversations, which reveals mechanisms of failure that per-turn behavior scoring does not represent. We propose SCOPE-MH as a diagnostic complement to existing evaluation infrastructure and argue that evaluation preserving temporal evidence is necessary, not optional, for safety-critical mental health AI deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.15573 2026-06-17 cs.AI cs.CR 版本更新

QoS-Aware Token Scheduling and Private Data Valuation for Multi-Modal Agentic Networks

面向多模态代理网络的QoS感知令牌调度与私有数据估值

Yao Du, Jing Liu, Pengfei Xu, Zehua Wang, Victor C. M. Leung, Cyril Leung, Victoria Lemieux

发表机构 * University of British Columbia（不列颠哥伦比亚大学）； Lazai Network（Lazai网络）

AI总结针对去中心化代理系统中数据异构和资源受限问题，提出基于差分隐私的多模态表示与公平令牌分配方案，在保障服务质量的同时提升数据隐私和贡献公平性。

Comments Accepted to IEEE ICME 2026

详情

AI中文摘要

在代理系统中，人类生成的数据记录锚定了AI服务的价值。然而，云计算管道将处理集中在远程服务器上。数据集中化降低了个人数据主权，并可能降低服务质量（QoS）。同时，用户贡献在数量和质量上存在差异：去中心化记录可能存在偏差、噪声和异质分布。为了解决数据挑战，我们研究了去中心化且资源受限的代理系统中的公平令牌分配和私有数据估值。我们的方法将多模态表示嵌入到共享语义空间中，并释放差分隐私（DP）原型以在减少语义泄露的同时保持效用。在DP保证下，我们设计了一种公平的令牌分配方案，该方案奖励有效贡献，并对数据异质性和AI资源稀缺性具有鲁棒性。大量仿真表明，与标准基准相比，基于贡献的公平性和QoS得到了改善。对图像重建攻击的抵抗力增强表明多模态个人数据的隐私得到了加强。

英文摘要

In agentic systems, human-generated data records anchor the value of AI services. Yet cloud compute pipelines centralize processing on remote servers. Data centralization reduces personal data sovereignty and may potentially degrade the quality of service (QoS). Meanwhile, user contributions are diverse in quantity and quality: decentralized records can be biased, noisy, and heterogeneously distributed. To address the data challenge, we study fair token allocation and private data valuation for decentralized and resource-constrained agentic systems. Our approach embeds multi-modal representations in a shared semantic space and releases differentially private (DP) prototypes to preserve utility while reducing semantic leakage. With the DP guarantee, we design a fair token allocation scheme that rewards effective contributions and remains robust to data heterogeneity and AI resource scarcity. Extensive simulations demonstrate improved contribution-based fairness and QoS compared to standard benchmarks. The improved resistance to image reconstruction attacks indicates enhanced privacy for multi-modal personal data.

URL PDF HTML ☆

赞 0 踩 0

2503.10945 2026-06-17 cs.LG cs.AI cs.CR stat.ML 版本更新

Gaussian DP for Reporting Differential Privacy Guarantees in Machine Learning

高斯差分隐私：机器学习中报告差分隐私保证的方法

Juan Felipe Gomez, Bogdan Kulynych, Georgios Kaissis, Flavio P. Calmon, Jamie Hayes, Borja Balle, Antti Honkela

AI总结针对当前机器学习中差分隐私报告不完整的问题，提出使用非渐近高斯差分隐私（GDP）作为主要报告方式，通过数值会计和决策理论度量，证明GDP能无误差地捕获DP-SGD等算法的完整隐私特征。

Comments IEEE SatML 2026 (position paper track)

详情

AI中文摘要

当前报告机器学习算法（如DP-SGD）的差分隐私（DP）保证的做法提供了不完整且可能误导的图景。例如，如果仅知道机制的一个$(\varepsilon, \delta)$，标准分析表明可能存在针对训练数据记录的高精度推理攻击，而更仔细的分析发现，对于大多数实际机制，这种精确攻击并不存在。在这篇立场论文中，我们主张使用_非渐近_高斯差分隐私（GDP）作为机器学习中传达DP保证的主要手段，以避免这些潜在缺点。利用DP文献中的两个最新进展：（i）能够以任意精度计算DP-SGD的隐私配置文件和$f$-DP曲线的开源数值会计，以及（ii）关于DP表示的决策理论度量，我们展示了如何使用数值会计提供GDP的非渐近界，并表明GDP能够以几乎无误差的方式捕获DP-SGD及相关算法的整个隐私配置文件（由该度量量化）。为了支持我们的主张，我们研究了最先进的DP大规模图像分类以及美国十年人口普查的TopDown算法的隐私配置文件，观察到GDP在所有情况下都与其配置文件拟合得非常好。最后，我们讨论了这种方法的优缺点，并探讨了哪些其他隐私机制可以从GDP中受益。

英文摘要

Current practices for reporting differential privacy (DP) guarantees for machine learning (ML) algorithms such as DP-SGD provide an incomplete and potentially misleading picture. For instance, if only a single $(\varepsilon, δ)$ is known about a mechanism, standard analyses show that there could exist highly accurate inference attacks against training data records, when, upon a more careful analysis, such accurate attacks do not exist for most practical mechanisms. In this position paper, we argue that using _non-asymptotic_ Gaussian Differential Privacy (GDP) as the primary means of communicating DP guarantees in ML avoids these potential downsides. Using two recent developments in the DP literature: (i) open-source numerical accountants capable of computing the privacy profile and $f$-DP curves of DP-SGD to arbitrary accuracy, and (ii) a decision-theoretic metric over DP representations, we show how to provide non-asymptotic bounds on GDP using numerical accountants, and show that GDP can capture the entire privacy profile of DP-SGD and related algorithms with virtually no error, as quantified by the metric. To support our claims, we investigate the privacy profiles of state-of-the-art DP large-scale image classification, and the TopDown algorithm for the U.S. Decennial Census, observing that GDP fits their profiles remarkably well in all cases. We conclude with a discussion on the strengths and weaknesses of this approach, and discuss which other privacy mechanisms could benefit from GDP.

URL PDF HTML ☆

赞 0 踩 0

2507.15104 2026-06-17 cs.LG cs.AI 版本更新

Jacobian Scopes: LLM中的令牌级因果归因

Toni J. B. Liu, Baran Zadeoğlu, Nicolas Boullé, Raphaël Sarfati, Gurbir Arora, Christopher J. Earls

发表机构 * Cornell University（康奈尔大学）； Imperial College London（伦敦帝国理工学院）； Goodfire AI

AI总结提出Jacobian Scopes，一种基于梯度的令牌级因果归因方法，用于解释LLM预测，揭示政治偏见、翻译策略和上下文学习机制。

Comments 25 pages, 16 figures

详情

AI中文摘要

大型语言模型（LLM）基于上下文中的线索（如语义描述和上下文示例）进行下一个令牌预测。然而，由于现代架构中层和注意力头的 proliferation，阐明哪些先前的令牌对给定预测影响最大仍然具有挑战性。我们提出Jacobian Scopes，一套基于梯度的令牌级因果归因方法，用于解释LLM预测。基于微扰理论和信息几何，Jacobian Scopes量化输入令牌如何影响模型预测的各个方面，例如特定logits、完整预测分布和模型不确定性（有效温度）。通过涵盖指令理解、翻译和上下文学习（ICL）的案例研究，我们展示了Jacobian Scopes如何揭示隐含的政治偏见，揭示词级和短语级翻译策略，并阐明最近争论的上下文时间序列预测的潜在机制。为了便于在自定义文本上探索Jacobian Scopes，我们开源了实现，并在以下网址提供了云托管交互式演示：this https URL。

英文摘要

Large language models (LLMs) make next-token predictions based on clues present in their context, such as semantic descriptions and in-context examples. Yet, elucidating which prior tokens most strongly influence a given prediction remains challenging due to the proliferation of layers and attention heads in modern architectures. We propose Jacobian Scopes, a suite of gradient-based, token-level causal attribution methods for interpreting LLM predictions. Grounded in perturbation theory and information geometry, Jacobian Scopes quantify how input tokens influence various aspects of a model's prediction, such as specific logits, the full predictive distribution, and model uncertainty (effective temperature). Through case studies spanning instruction understanding, translation, and in-context learning (ICL), we demonstrate how Jacobian Scopes reveal implicit political biases, uncover word- and phrase-level translation strategies, and shed light on recently debated mechanisms underlying in-context time-series forecasting. To facilitate exploration of Jacobian Scopes on custom text, we open-source our implementations and provide a cloud-hosted interactive demo at https://huggingface.co/spaces/Typony/JacobianScopes.

URL PDF HTML ☆

赞 0 踩 0

2602.14211 2026-06-17 cs.CR cs.AI 版本更新

SkillJect: Effectively Automating Skill-Based Prompt Injection for Skill-Enabled Agents

SkillJect：有效自动化基于技能的提示注入以针对具备技能的代理

Xiaojun Jia, Jie Liao, Simeng Qin, Jindong Gu, Wenqi Ren, Xiaochun Cao, Yang Liu, Philip Torr

发表机构 * Nanyang Technological University, Singapore（南洋理工大学，新加坡）； Chongqing University, China（重庆大学）； Northeastern University, China（东北大学）； Sun Yat-sen University, China（中山大学）； University of Oxford, UK（牛津大学）

AI总结 SkillJect 是首个自动化生成有效中毒技能的框架，通过隐藏恶意负载和重写指令通道，提升攻击效果，揭示可重用技能生态中的持久性攻击向量。

详情

AI中文摘要

SkillJect通过隐藏恶意负载和重写指令通道，有效自动化基于技能的提示注入，针对具备技能的代理提升攻击效果，揭示可重用技能生态中的持久性攻击向量。

英文摘要

Agent skills extend LLM agents with task-specific instructions, executable scripts, and auxiliary resources, improving reusability but creating a new supply-chain attack surface. A malicious or compromised skill can be repeatedly loaded as trusted guidance and steer downstream tool use. Existing skill-based prompt-injection attacks are often manual and brittle, because explicit malicious instructions are rejected or ignored when they are not aligned with the original workflow. We propose SkillJect, the first automated framework for generating poisoned skills against skill-enabled agent systems. SkillJect uses two coordinated channels. In the artifact channel, it hides the payload inside an auxiliary helper script. In the instruction channel, it rewrites SKILL.md with a front-loaded inducement strategy, placing injected content at the beginning and framing the helper script as a mandatory prerequisite or initialization step. The rewritten instruction explicitly references the helper-script path and provides an executable example command, making the helper appear to be a legitimate setup step before normal skill operations. SkillJect further adopts a closed-loop multi-agent process to improve attack effectiveness. An Attack Agent generates poisoned skills, a Victim Agent executes downstream tasks with the poisoned skill, and an Evaluate Agent inspects execution traces to determine whether the hidden payload was executed. The Attack Agent then uses this feedback to diagnose failure causes and rewrite SKILL.md, while keeping the payload fixed. Experiments across skill-enabled platforms, backend LLMs, and attack categories show that SkillJect substantially outperforms naive direct injection and prior manual skill-injection attacks, highlighting poisoned skills as a persistent threat in reusable skill ecosystems.

URL PDF HTML ☆

赞 0 踩 0

2603.25414 2026-06-17 cs.PL cs.AI cs.LG cs.LO 版本更新

Decidable By Construction: Design-Time Verification for Trustworthy AI

可判定性通过构造实现：面向可信AI的设计时验证

Houston Haynes

AI总结提出一种设计时验证框架，通过将AI模型属性约束为有限生成阿贝尔群上的可判定问题，在训练前以极低计算成本验证数值稳定性、计算正确性和物理一致性，消除后验验证开销。

Comments 21 pages, 1 figure

详情

AI中文摘要

机器学习中一个普遍的假设是模型正确性必须在事后强制执行。我们观察到，决定AI模型是否数值稳定、计算正确或与物理领域一致的属性并不一定需要事后强制执行。它们可以在设计时，在训练开始之前，以边际计算成本进行验证，对于部署在高杠杆决策支持和科学约束环境中的模型尤其重要。这些属性共享特定的代数结构：它们可以表示为有限生成阿贝尔群 $\mathbb{Z}^n$ 上的约束，其中推理在多项式时间内可判定，且主要类型是唯一的。基于这一观察构建的框架组合了三个先前的结果（arXiv:2603.16437, arXiv:2603.17627, arXiv:2603.18104）：一个维度类型系统，通过模型细化携带任意注释作为持久余数据；一个程序超图，仅从类型签名推断Clifford代数等级并推导几何积稀疏性；以及一个自适应领域模型架构，通过前向模式余效应分析和精确正数累积在训练过程中保持两个不变量。我们相信这种组合产生了一个新颖的信息论结果：阿贝尔群上的Hindley-Milner统一在Solomonoff通用先验的可计算限制下计算最大后验假设，将该框架的类型推断置于与通用归纳相同的正式基础上。我们比较了四种当代的AI可靠性方法，并表明每种方法都会引入开销，这些开销可能在部署、层和推理请求中累积。该框架通过构造消除了这种开销。

英文摘要

A prevailing assumption in machine learning is that model correctness must be enforced after the fact. We observe that the properties determining whether an AI model is numerically stable, computationally correct, or consistent with a physical domain do not necessarily demand post hoc enforcement. They can be verified at design time, before training begins, at marginal computational cost, with particular relevance to models deployed in high-leverage decision support and scientifically constrained settings. These properties share a specific algebraic structure: they are expressible as constraints over finitely generated abelian groups $\mathbb{Z}^n$, where inference is decidable in polynomial time and the principal type is unique. A framework built on this observation composes three prior results (arXiv:2603.16437, arXiv:2603.17627, arXiv:2603.18104): a dimensional type system carrying arbitrary annotations as persistent codata through model elaboration; a program hypergraph that infers Clifford algebra grade and derives geometric product sparsity from type signatures alone; and an adaptive domain model architecture preserving both invariants through training via forward-mode coeffect analysis and exact posit accumulation. We believe this composition yields a novel information-theoretic result: Hindley-Milner unification over abelian groups computes the maximum a posteriori hypothesis under a computable restriction of Solomonoff's universal prior, placing the framework's type inference on the same formal ground as universal induction. We compare four contemporary approaches to AI reliability and show that each imposes overhead that can compound across deployments, layers, and inference requests. This framework eliminates that overhead by construction.

URL PDF HTML ☆

赞 0 踩 0

2603.28378 2026-06-17 cs.SD cs.AI 版本更新

Membership Inference Attacks against Large Audio Language Models

针对大型音频语言的成员推断攻击

Jia-Kai Dong, Yu-Xiang Lin, Hung-Yi Lee

AI总结首次系统评估大型音频语言模型的成员推断攻击，提出盲基线协议控制分布偏移，发现跨模态记忆仅源于说话人声纹与文本绑定。

Comments Accepted by Interspeech 2026

详情

AI中文摘要

我们首次对大型音频语言模型（LALMs）进行了系统的成员推断攻击（MIA）评估。利用基于文本、频谱和韵律特征的多模态盲基线，我们证明即使没有模型推理，常见音频数据集也表现出近乎完美的训练/测试可分离性（AUC ~ 1.0），因此MIA可能主要检测分布偏移。因此，我们引入了一个盲基线协议来控制这一混杂因素。在该协议下，我们发现分布匹配的数据集能够实现可靠的MIA评估，而不会产生分布偏移伪影。我们基准测试了多种MIA方法，并在这些数据集上进行了模态解缠实验。结果表明，LALM的记忆是跨模态的，仅源于将说话人的声纹与其文本绑定。这些发现为审计LALMs建立了超越虚假相关性的原则性标准。我们的代码库可在该网址获取。

英文摘要

We present the first systematic Membership Inference Attack (MIA) evaluation of LALMs. Using Multi-modal Blind Baselines based on textual, spectral and prosodic features, we demonstrate that common audio datasets exhibit near-perfect train/test separability (AUC ~ 1.0) even without model inference, thus MIA may primarily detect distribution shift. We therefore introduce a blind-baseline protocol to control for this confound. Under this protocol, we identify that the distribution-matched datasets enable reliable MIA evaluation without distribution-shift artifacts. We benchmark multiple MIA methods and conduct modality disentanglement experiments on these datasets. The results reveal that LALM memorization is cross-modal, arising only from binding a speaker's vocal identity with its text. These findings establish a principled standard for auditing LALMs beyond spurious correlations. Our codebase is available at https://github.com/snooow1029/ALM_MIA.

URL PDF HTML ☆

赞 0 踩 0

2604.01904 2026-06-17 cs.CR cs.AI 版本更新

Combating Data Laundering in LLM Training

对抗LLM训练中的数据清洗

Muxing Li, Zesheng Ye, Sharon Li, Feng Liu

发表机构 * University of Melbourne（墨尔本大学）； University of Wisconsin-Madison（威斯康星大学麦迪逊分校）

AI总结针对数据清洗（通过变换风格隐藏数据来源）导致传统检测失效的问题，提出基于辅助LLM推断变换目标并合成查询的SDR方法，显著增强数据滥用检测能力。

Comments 29 pages, 2 figures

详情

AI中文摘要

数据权利所有者可以通过查询专有样本来检测大型语言模型（LLM）训练中未经授权的数据使用。通常，模型在某个样本上表现优于未训练数据（例如更高的置信度或更低的损失）意味着该样本属于训练语料，因为LLM在训练中见过的数据上表现更好。然而，这种检测在数据清洗（一种保留关键信息但改变专有数据风格形式以混淆数据来源的做法）下变得脆弱。当LLM仅在经过清洗的变体上训练时，它在原始数据上不再表现更好，从而消除了标准检测所依赖的信号。我们通过从对目标LLM的黑盒访问中推断未知的清洗变换，并借助辅助LLM合成模仿清洗数据的查询来应对这一问题，即使权利所有者只拥有原始数据。由于寻找真实清洗变换的搜索空间是无限的，我们将这一过程抽象为高层变换目标（例如“抒情改写”）和具体细节（例如“使用生动意象”），并引入合成数据还原（SDR）来实例化这一抽象。SDR首先识别最可能的合成目标以缩小搜索范围；然后迭代细化细节，使合成查询逐渐从目标LLM中引发更强的检测信号。在MIMIR基准上针对多种清洗实践和目标LLM系列（Pythia、Llama2和Falcon）的评估表明，SDR持续增强了数据滥用检测，为数据清洗提供了一种实用的对策。

英文摘要

Post-hoc unauthorized-training data detection for large language models (LLMs) typically assumes a query-with-originals regime: rights holders query a target LLM with raw proprietary data and assess whether the model assigns them stronger memorization-based detection signals, e.g., higher confidence or lower loss, than held-out non-training reference texts. We show that this regime becomes brittle under data laundering, where the target LLM is trained on semantics-preserving but stylistically or structurally transformed surrogates of proprietary data to obfuscate provenance. Since training-time exposure occurs in the laundered form, memorization signals may no longer appear on the originals, collapsing the candidate-reference signal separation that standard detectors rely on. We counter this threat by studying laundering-aware detection with raw proprietary data, a held-out reference corpus, and query access to the target LLM, while the laundering transformation is undisclosed. Since exact recovery of the laundered corpus is infeasible, we infer a detection-useful synthesis process via an auxiliary LLM that maps originals into training-like queries. To make this search tractable, we introduce Synthesis Data Reversion (SDR), which constrains the unbounded space of natural-language transformations through a goal-details abstraction: a high-level transformation goal, e.g., "lyrical rewriting", and fine-grained details, e.g., "with vivid imagery". SDR identifies the most likely goal and iteratively refines details so synthesized queries elicit stronger target-model detection signals. Evaluated on the MIMIR benchmark against diverse laundering practices and target LLM families (Pythia, Llama2, and Falcon), SDR consistently restores detection signals, offering a practical auditing layer against data laundering.

URL PDF HTML ☆

赞 0 踩 0

2605.12646 2026-06-17 cs.LG cs.AI cs.HC 版本更新

Learning to Decide with AI Assistance under Human-Alignment

在人工智能协助下的人类对齐决策学习

Nina Corvelo Benz, Eleni Straitouri, Manuel Gomez-Rodriguez

发表机构 * GitHub

AI总结本文研究了在高风险领域中，人工智能如何通过预测结果帮助决策者，并探讨了AI预测信心与决策者自身信心的对齐程度对决策学习复杂性的影响。

详情

AI中文摘要

人们普遍认为，当人工智能模型通过预测感兴趣的结果来协助决策者时，它们应传达预测的置信度。然而，实证证据表明，决策者往往难以仅根据传达的置信度来判断何时信任预测。在此背景下，近期的理论和实证工作表明，AI辅助决策的效用与AI置信度和决策者自身置信度之间的对齐程度之间存在正相关性。关键的是，这些发现尚未阐明这种对齐程度如何影响通过重复交互学习做出最佳决策的复杂性。在本文中，我们考虑二元预测和二元决策的典型情况，首先证明该问题等价于具有完全反馈的双臂在线上下文学习问题，并建立了任何学习者可以达到的期望遗憾的下界为$Ω(\sqrt{|H| \cdot |B| \cdot T} )$，其中$H$和$B$分别表示人类和AI置信度的集合。然后我们证明，在AI和人类置信度完全对齐的情况下，学习者可以达到期望遗憾为$O(\sqrt{|H| \cdot T\log T})$，当$\sqrt{|H|} = O(\log T)$且$B$是可数的时，Dvoretzky-Kiefer-Wolfowitz不等式的非平凡推广将遗憾界改进到$O(\sqrt{T\log T})$。这些结果表明，对齐可以减少在人工智能协助下学习决策的复杂性。在两个不同的人类主体研究中，参与者通过AI模型协助解决简单决策任务的实验证明，我们的理论结果在完全对齐被违反时仍然稳健。

英文摘要

It is widely agreed that when AI models assist decision-makers in high-stakes domains by predicting an outcome of interest, they should communicate the confidence of their predictions. However, empirical evidence suggests that decision-makers often struggle to determine when to trust a prediction based solely on this communicated confidence. In this context, recent theoretical and empirical work suggests a positive correlation between the utility of AI-assisted decision-making and the degree of alignment between the AI confidence and the decision-makers' confidence in their own predictions. Crucially, these findings do not yet elucidate the extent to which this alignment influences the complexity of learning to make optimal decisions through repeated interactions. In this paper, we address this question in the canonical case of binary predictions and binary decisions. We first show that this problem is equivalent to a two-armed online contextual learning problem with full feedback, and establish a lower bound of $Ω(\sqrt{|H| \cdot |B| \cdot T} )$ on the expected regret any learner can attain, where $H$ and $B$ denote the sets of human and AI confidence values. We then demonstrate that, under perfect alignment between AI and human confidence, a learner can attain an expected regret of $O(\sqrt{|H| \cdot T\log T})$ and, when $\sqrt{|H|} = O(\log T)$ and $B$ is countable, a non-trivial generalization of the Dvoretzky-Kiefer-Wolfowitz inequality improves the regret bound to $O(\sqrt{T\log T})$. Taken together, these results reveal that alignment can reduce the complexity of learning to make decisions with AI assistance. Experiments on real data from two different human-subject studies where participants solve simple decision-making tasks assisted by AI models show that our theoretical results are robust to violations of perfect alignment.

URL PDF HTML ☆

赞 0 踩 0

2606.03089 2026-06-17 cs.LG cs.AI 版本更新

Constitutional On-Policy Safe Distillation

宪法性在策略安全蒸馏

Ming Wen, Yuxuan Liu, Kun Yang, Yunhao Feng, Zhuoer Xu, Yuhao Sun, Shiwen Cui, Xiang Zheng, Guoyu Wang, Xingjun Ma, Yu-Gang Jiang

发表机构 * Institute of Trustworthy Embodied AI（可信具身人工智能研究院）； Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）； Ant Group（蚂蚁集团）； Zhejiang University（浙江大学）； City University of Hong Kong（香港城市大学）

AI总结针对在策略自蒸馏在安全对齐中因宪法条件导致教师分布收缩、表达能力下降的问题，提出宪法性在策略安全蒸馏（COPSD），通过交叉SFT冷启动校准教师分布，再进行宪法条件在策略蒸馏，在12个基准上实现了更优的安全-有用性权衡并降低安全税。

详情

AI中文摘要

在策略自蒸馏（OPSD）通过使用基于特权信息条件的教师提供密集的令牌级监督，已成为一种高效的后训练范式。先前工作表明，OPSD在可验证推理任务中可能崩溃，但安全对齐不同，它由高层宪法而非显式目标答案指导，因此是重新审视密集蒸馏的自然场景。然而，我们的初步研究表明，安全OPSD仍然遭受严重崩溃：宪法条件将教师分布收缩为短且过于保守的响应，而反向KL进一步将这种收缩放大为表达能力下降。我们将此效应形式化为非正交语义空间中安全边界下的几何泄漏，其中安全压力转移到表达能力维度。基于此分析，我们提出宪法性在策略安全蒸馏（COPSD），首先通过交叉SFT冷启动校准教师，然后执行宪法条件在策略蒸馏。在12个基准上的实验表明，COPSD比基线实现了持续更强的安全-有用性权衡，同时大幅降低了对通用推理能力的安全税。

英文摘要

On-policy self-distillation (OPSD) has emerged as an efficient post-training paradigm by using a teacher conditioned on privileged information to provide dense token-level supervision. Prior work has shown that OPSD can collapse in verifiable reasoning tasks, but safety alignment differs in that it is guided by high-level constitutions rather than explicit target answers, making it a natural setting to revisit dense distillation. However, our pilot study show that safety OPSD still suffers from severe collapse: constitutional conditioning contracts the teacher distribution toward short and overly conservative responses, and Reverse KL further amplifies this contraction into reduced expressiveness. We formalize this effect as geometric leakage under safety boundaries in a non-orthogonal semantic space, where safety pressure transfers into the expressiveness dimension. Based on this analysis, we propose Constitutional On-Policy Safe Distillation (COPSD), which first calibrates the teacher through a Cross-SFT cold-start and then performs constitution-conditioned on-policy distillation. Experiments on 12 benchmarks show that COPSD achieves a consistently stronger safety--helpfulness trade-off than baselines while substantially reducing the safety tax on general reasoning ability.

URL PDF HTML ☆

赞 0 踩 0

2606.04990 2026-06-17 cs.CR cs.AI 版本更新

From Agent Traces to Trust: A Survey of Evidence Tracing and Execution Provenance in LLM Agents

从智能体痕迹到信任：LLM智能体中的证据追踪与执行溯源

Yiqi Wang, Jiaqi Zhang, Taotao Cai, Zirui Liu, Qingqiang Sun, Zequn Sun, Zhangkai Wu, Manqing Dong, Mingkai Zhang, Xuefei Yin, Yanming Zhu

发表机构 * Griffith University（格里菲斯大学）； Jiangsu University（江苏大学）； University of Southern Queensland（南方昆士兰大学）； Peking University（北京大学）； Great Bay University（大湾大学）； Nanjing University（南京大学）； Macquarie University（麦觉瑞大学）； Southern University of Science and Technology（南方科学与技术大学）

AI总结本文系统综述了LLM智能体中的证据追踪与执行溯源方法，通过统一溯源视角连接检索、工具使用、记忆等环节，提出分类体系并讨论开放挑战。

详情

AI中文摘要

基于大语言模型（LLM）的智能体通过与外部工具、检索系统、记忆模块、环境及其他智能体交互，日益解决复杂任务。这些能力增强了智能体的自主性，但也使其行为更难以验证、调试和审计。仅凭最终答案的准确性无法解释输出是如何产生的、每个主张由哪些证据支持、工具调用是否合理、记忆如何影响后续决策或执行失败的根源。证据追踪和执行溯源通过建模检索到的证据、工具输出、记忆项、环境观察、中间主张、动作和最终答案在智能体执行过程中的连接方式，弥补了这一空白。本综述对LLM智能体中的证据追踪和执行溯源进行了系统回顾和概念框架构建。我们围绕统一的溯源视角组织相关工作，该视角连接了检索依据、主张支持、工具使用安全、记忆谱系、可观测性、调试、审计和恢复。我们引入了一个分类体系，涵盖追踪来源、证据和执行单元、溯源关系、追踪粒度和时机、表示形式以及信任功能。我们回顾了关键方法论方向，包括溯源表示、证据归因、工具使用溯源、运行时护栏、携带溯源的记忆、基于痕迹的可观测性和故障诊断。我们还绘制了现有基准、数据集和评估指标与溯源相关能力的映射，并讨论了评估如何从最终答案正确性转向过程级问责。最后，我们概述了开放挑战，包括统一痕迹模式、主张级和语义溯源、溯源感知的安全机制、现实执行痕迹基准、面向恢复的评估以及隐私感知的审计基础设施。

英文摘要

Large language model (LLM)-based agents are evolving from passive text generators into autonomous systems capable of planning, tool use, retrieval, memory access, environmental interaction, and multi-agent collaboration. These capabilities expand agent autonomy, but also make agent behavior harder to verify, debug, and audit. Final-answer accuracy alone cannot explain how an output was produced, which evidence supported each claim, whether tool calls were justified, how memory influenced later decisions, or where failures originated. This survey examines evidence tracing and execution provenance as foundations for process-level accountability in trustworthy LLM agents. We define execution provenance as the typed graph of an agent execution and evidence tracing as its projection onto evidence-support relations. This perspective connects retrieval grounding, claim support, tool-use safety, memory lineage, observability, debugging, audit, and recovery within a unified framework. We introduce a taxonomy covering trace sources, evidence and execution units, provenance relations, tracing granularity and timing, representation forms, and trust functions. We then review key methodological directions, including provenance representation, evidence attribution, tool-use provenance, runtime guardrails, provenance-bearing memory, observability, and failure diagnosis. Finally, we discuss benchmarks, datasets, metrics, and open challenges for building provenance-aware, auditable, and recoverable agent systems.

URL PDF HTML ☆

赞 0 踩 0

2606.12666 2026-06-17 cs.CR cs.AI 版本更新

LLM 能当 CEO 吗？基于多角色智能体模拟的战略资源重新配置基准测试

Yuyang Dai, Xueqing Peng, Lingfei Qian, Zhuohan Xie

发表机构 * MBZUAI（穆罕默德·本·扎耶德人工智能大学）； Yale University（耶鲁大学）

AI总结提出 CEO-Bench，一个多智能体基准，评估 LLM 在约束丰富的组织环境中进行多轮战略资源重新配置的能力，发现模型在结构有效性上表现良好，但在战略校准上存在系统性失败模式。

Comments 13 pages

详情

AI中文摘要

评估大型语言模型（LLM）的决策能力是一个日益重要的研究重点，然而现有基准侧重于孤立的认知任务，如推理、知识检索以及在风格化环境中的经济理性。这些评估忽略了真实高管决策的核心挑战：在信息不对称、组织约束和时间依赖下整合来自专业利益相关者的冲突建议。我们引入了 \textsc{CEO-Bench}，一个多智能体基准，评估 LLM 在 CEO 级别的战略资源重新配置能力——即在多轮、约束丰富的组织环境中跨业务部门重新分配资本的过程。在 \textsc{CEO-Bench} 中，LLM 智能体接收来自四个角色化的 C 级顾问（CFO、CTO、COO、CMO）的冲突建议，每个顾问拥有私有信号和不同优先级，智能体必须将这些建议综合成一个具体的分配计划，并沿四个维度进行评估：角色整合、条件大胆性、历史敏感性判断和计划有效性。在 13 个场景中对五个前沿模型的实验表明，所有模型都实现了高结构有效性，但在战略校准（最难的能力层）上表现差异显著。我们识别出系统性失败模式，包括单一顾问捕获、模糊下的保守默认和历史遗忘，并发现结构整合-大胆性权衡：更深入参与冲突观点的模型往往产生较不果断的行动。这些发现勾勒了 LLM 作为组织决策者的当前能力边界，并为未来 AI 辅助高管系统的设计提供信息。

英文摘要

Evaluating the decision-making capabilities of large language models (LLMs) is a growing research priority, yet existing benchmarks focus on isolated cognitive tasks such as reasoning, knowledge retrieval, and economic rationality in stylized settings. These evaluations overlook the defining challenge of real executive decision-making: integrating conflicting recommendations from specialized stakeholders under information asymmetry, organizational constraints, and temporal dependencies. We introduce \textsc{CEO-Bench}, a multi-agent benchmark that evaluates LLMs on CEO-level strategic resource reallocation -- the process of redirecting capital across business units in a multi-round, constraint-rich organizational environment. In \textsc{CEO-Bench}, LLM agents receive conflicting advice from four role-conditioned C-suite advisors (CFO, CTO, COO, CMO), each with private signals and distinct priorities, and must synthesize these into a concrete allocation plan evaluated along four dimensions: role integration, conditional boldness, history-sensitive judgment, and plan validity. Experiments across five frontier models on 13 scenarios reveal that all models achieve high structural validity but diverge sharply on strategic calibration -- the hardest capability layer. We identify systematic failure modes including single-advisor capture, conservative default under ambiguity, and historical amnesia, and uncover a structural integration-boldness tradeoff: models that engage more deeply with conflicting perspectives tend to produce less decisive action. These findings delineate the current capability boundary of LLMs as organizational decision-makers and inform the design of future AI-assisted executive systems.

URL PDF HTML ☆

赞 0 踩 0

2606.17546 2026-06-17 cs.AI 新提交

SEAGym: An Evaluation Environment for Self-Evolving LLM Agents

SEAGym: 自我进化LLM智能体的评估环境

Congjie Zheng, Chuanyi Xue, Bin Liang, Jun Yang, Changshui Zhang

发表机构 * Department of Automation, Tsinghua University（清华大学自动化系）； Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University（北京信息科学与技术国家研究中心（BNRist），清华大学）

AI总结提出SEAGym评估环境，通过训练、验证、测试、重放和成本记录多维度衡量智能体框架更新，揭示更新是否带来可复用改进、过拟合、成本增加或旧行为退化。

详情

AI中文摘要

基于LLM的自我进化智能体主要通过改变其智能体框架（agent harness）来改进：即围绕基础模型的结构化执行层，包括提示、记忆、工具、中间件、运行时状态以及模型-工具交互循环。现有评估通常将此过程简化为孤立的任务分数或单一的顺序曲线，掩盖了更新是否产生可复用的改进、过拟合近期任务、增加成本或损害旧行为。我们引入了SEAGym，一个用于跨训练、验证、测试、重放和成本记录衡量智能体框架更新的评估环境。SEAGym将Harbor兼容的基准测试转化为动态的自我进化任务源，包含训练批次、冻结更新验证、留出ID和OOD迁移视图、重放诊断以及保存的快照和指标记录。在Terminal-Bench 2.0和HLE上实例化SEAGym，我们在共享的epoch/batch协议下比较了ACE、TF-GRPO和AHE。结果表明，这些评估视图提供了关于进化过程的互补信号：频繁更新可能无法改善留出性能，有用的中间快照可能随后崩溃，源多样性和模型后端可能影响框架可靠性。

英文摘要

Self-evolving LLM-based agents improve mainly by changing their agent harness: the structured execution layer around a base model, including prompts, memory, tools, middleware, runtime state, and the model-tool interaction loop. Existing evaluations often reduce this process to isolated task scores or a single sequential curve, obscuring whether an update produces reusable improvement, overfits recent tasks, increases cost, or harms older behavior. We introduce SEAGym, an evaluation environment for measuring agent harness updates across training, validation, test, replay, and cost records. SEAGym turns Harbor-compatible benchmarks into dynamic self-evolution task sources with train batches, frozen update-validation, held-out ID and OOD transfer views, replay diagnostics, and saved snapshot and metric records. Instantiating SEAGym on Terminal-Bench 2.0 and HLE, we compare ACE, TF-GRPO, and AHE under a shared epoch/batch protocol. The results show that these evaluation views provide complementary signals about the evolution process: frequent updates may fail to improve held-out performance, useful intermediate snapshots may collapse later, and source diversity and model backend can affect harness reliability.

URL PDF HTML ☆

赞 0 踩 0

2606.17574 2026-06-17 cs.AI 新提交

DeepInsight: A Unified Evaluation Infrastructure Across the Physical AI Stack

DeepInsight：跨物理AI栈的统一评估基础设施

Siyi Li, Chunyu Sun, Jiahao Zhang, Yuchen Kang, Wuliang Wang, Yu Qiu, Rui Jiang, Haitao Cui, Jie Chen

发表机构 * Xiaopeng（小鹏汽车）

AI总结提出DeepInsight，一个在单一运行时上支持物理AI栈全谱系评估的基础设施，通过三个抽象（任务、资源、结果）保持异构性，实现跨层回归诊断。

详情

AI中文摘要

评估物理AI栈涉及的操作符跨越三个数量级以上——从单个基础模型解码步骤到全身控制的数千个物理滴答——在模态、奖励语义和资源概况上正交变化。现有框架无法覆盖这一范围，因此当前栈的评估是通过拼接独立的测试工具完成的，这些工具既不共享运行时也不共享评分，保留了每个片段的局部有效性，但失去了诊断跨层回归所需的共享身份。我们提出DeepInsight，一个在单一运行时上服务于这一完整谱系的评估基础设施。它不将各体制同质化，而是通过三个狭窄的抽象——任务、资源和结果——保持其异构性，每个抽象都由每个子系统共享的一个不变量实现：一个情节驱动器、一个由每个昂贵后端（LLM推理和沙盒运行时）实现的资源句柄协议，以及一个写入每个事件的跟踪身份方案。在具身人形机器人栈的所有三层上部署后，这一组不变量主要通过配置即可引入新的基准测试。在成熟的对等编排器存在的地方——在基础模型端——它在其自身分布内复现已发布的参考值和对等框架读数，在单个节点上更快地运行相同的套件，并跨节点近线性扩展。其独特的回报在于诊断：由于每一层都写入一个共享的跟踪，从一个层开始并在另一个层显现的回归在该跟踪上仍然可定位——这是任何片段测试工具联合体无法复现的跨层收益。

英文摘要

Evaluating a Physical AI stack spans operators that differ by more than three orders of magnitude -- from a single foundation-model decoding step to thousands of physics ticks of whole-body control -- varying orthogonally in modality, reward semantics, and resource profile. No existing framework spans this range, so the stack is evaluated today by stitching together separate harnesses that share neither runtime nor scoring, preserving each segment's local validity but losing the shared identity needed to diagnose cross-layer regressions. We present DeepInsight, an evaluation infrastructure that serves this full spectrum on a single runtime. Rather than homogenize the regimes, it preserves their heterogeneity behind three narrow abstractions -- task, resource, and result -- each realized as one invariant shared by every subsystem: one episode driver, one resource-handle protocol implemented by every expensive backend (LLM inference and sandboxed runtimes alike), and one trace identity scheme under which every event is written. Deployed in production across all three layers of an embodied humanoid stack, this single set of invariants onboards new benchmarks largely by configuration. Where mature peer orchestrators exist -- at the foundation-model end -- it reproduces published references and peer-framework readings within their own spread, runs the same suites faster on a single node, and scales near-linearly across nodes. Its distinctive return is diagnostic: because every layer writes into one shared trace, a regression that begins in one layer and surfaces in another stays localizable on that trace -- a cross-layer payoff no federation of per-segment harnesses can reproduce.

URL PDF HTML ☆

赞 0 踩 0

2606.17696 2026-06-17 cs.AI cs.GR 新提交

FllumaOne: A Code-Native Multimodal CAD Dataset with Executable Programs and Kernel-Validated Feature Histories

FllumaOne：一个代码原生多模态CAD数据集，包含可执行程序与内核验证的特征历史

Jizong Zhan

发表机构 * Qt/C++ OpenCASCADE-based CAD system（基于Qt/C++ OpenCASCADE的CAD系统）

AI总结提出FllumaOne数据集，通过可执行Python程序生成CAD模型，对齐程序、特征树、几何等模态，支持可编辑逆向工程等任务。

Comments 24 pages, 4 figures

详情

AI中文摘要

参数化计算机辅助设计记录最终几何形状以及决定零件如何编辑的有序构建历史。因此，可编辑CAD研究的数据集应同时暴露建模操作、参数和特征依赖关系以及验证后的几何形状。我们介绍FllumaOne，一个代码原生多模态CAD数据集，其模型由基于Qt/C++和OpenCASCADE的CAD系统Flluma中的可执行Python程序生成。每个样本将其程序与结构化特征树、面向训练的中间表示、STEP几何、表面点云、自然语言描述、元数据和八个规范可见边渲染对齐。主要发布版本FllumaOne-100K包含100,000个接受样本，涵盖四个模板级复杂度范围。程序仅在通过内核几何、实体有效性和导出检查后执行并保留；发布报告还记录了模态完整性和分割级重复测试。在80,000个样本上训练的Qwen2.5-Coder-1.5B LoRA基线在保留的10,000样本测试集上实现了99.98%的Python语法有效性、99.97%的Flluma构建成功率和99.14%的STEP导出有效性。对于转换为表面点云的9,909个预测，平均归一化倒角距离为0.002124。该数据集支持条件化CAD重建、可执行程序合成、特征树预测、B-Rep分析、检索、设计完成和可编辑逆向工程。

英文摘要

Parametric computer-aided design records both final geometry and the ordered construction history that determines how a part can be edited. Datasets for editable CAD research should therefore expose modeling operations, parameters, and feature dependencies together with validated geometry. We introduce FllumaOne, a code-native multimodal CAD dataset whose models are generated by executable Python programs in Flluma, a Qt/C++ OpenCASCADE-based CAD system. Each sample aligns its program with a structured feature tree, a training-oriented intermediate representation, STEP geometry, a surface point cloud, natural-language descriptions, metadata, and eight canonical visible-edge renderings. The primary release, FllumaOne-100K, contains 100,000 accepted samples across four template-level complexity regimes. Programs are executed and retained only after kernel geometry, solid validity, and export checks; release reports also record modality completeness and split-level duplicate tests. A Qwen2.5-Coder-1.5B LoRA baseline trained on 80,000 samples achieves 99.98% Python syntax validity, 99.97% Flluma build success, and 99.14% STEP-export validity on the held-out 10,000-sample test split. For the 9,909 predictions converted to surface point clouds, the mean normalized Chamfer Distance is 0.002124. The dataset supports conditioned CAD reconstruction, executable program synthesis, feature-tree prediction, B-Rep analysis, retrieval, design completion, and editable reverse engineering.

URL PDF HTML ☆

赞 0 踩 0

2606.17698 2026-06-17 cs.AI cs.CL 新提交

EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent

EComAgentBench：在分布式隐藏意图的长时任务上基准测试购物代理

Zeyao Du, Tong Li, Haibo Zhang

发表机构 * Shopee

AI总结提出EComAgentBench基准，包含662个基于真实亚马逊产品的任务，要求代理在100次工具调用内从可见查询、工具门控配置文件和脚本化澄清中挖掘隐藏意图，验证候选产品并提交最终选择，通过类型化源标签评分归因失败。

详情

AI中文摘要

随着基于LLM的购物代理进入生产环境，现有基准未能捕捉购物者需求的出现方式：隐含在查询中、记录在配置文件中，或仅在提出正确问题时才揭示。提前暴露全部意图并仅对最终选择评分的基准既无法提出这种长时挑战，也无法解释代理遗漏了哪个需求。为填补这一空白，我们引入了EComAgentBench，一个基于真实亚马逊产品和评论的662个任务的基准。每个任务将这些需求分散在可见查询、工具门控配置文件和脚本化澄清中；代理必须揭示隐藏意图，根据属性和评论证据验证候选产品，并在100次工具调用内提交单个产品。此外，类型化、源标记的评分规则对每个任务进行评分，将每个失败归因于一个需求及其来源。构建过程自动化且可靠，每个答案在生成任何文本之前已在代码中固定，每个样本都经过验证。我们对七个模型的评估显示，即使最强的模型也仅达到57.1%的整体准确率，并且评分规则的满足度从可见源到隐藏源逐渐下降。总体而言，我们相信EComAgentBench将作为一个可复现的基础，推动购物代理从单查询搜索向长时可靠辅助发展。

英文摘要

As LLM-based shopping agents enter production, existing benchmarks fail to capture how a shopper's requirements arrive: stated implicitly in the query, recorded in a profile, or revealed only when the right question is asked. Benchmarks that expose full intent upfront and grade only the final choice can neither pose this long-horizon challenge nor explain which requirement an agent missed. To address this gap, we introduce EComAgentBench, a benchmark of 662 tasks grounded in real Amazon products and reviews. Each task scatters these requirements across a visible query, a tool-gated profile, and scripted clarification; an agent must uncover hidden intent, verify candidates against attributes and review evidence, and commit to a single product within 100 tool calls. Moreover, typed, source-tagged rubrics grade every task, attributing each failure to a requirement and its source. Construction is automated yet reliable, with every answer fixed in code before any text is generated and every sample validated. Our evaluation of seven models reveals that even the strongest attains only 57.1% overall accuracy, and rubric satisfaction degrades from visible to hidden sources. Overall, we believe EComAgentBench will serve as a reproducible foundation for moving shopping agents from single-query search toward dependable assistance over long horizons.

URL PDF HTML ☆

赞 0 踩 0

2606.17727 2026-06-17 cs.AI 新提交

LongWebBench: Evaluating Structural and Functional Webpage Generation in Long-Horizon Settings

LongWebBench: 评估长程设置下的结构和功能性网页生成

Yi Zhao, Zhen Yang, Mengpan Chen, Mingde Xu, Shanghui Gong, Xijun Liu, Jibing Gong, Jie Tang

发表机构 * Tsinghua University（清华大学）； Yanshan University（燕山大学）； University of Waterloo（滑铁卢大学）； Beihang University（北京航空航天大学）

AI总结提出LongWebBench基准，通过结构保真度和功能可执行性评估长网页生成，发现视觉相似性高但多步交互失败。

Comments 49 pages, 38 figures

详情

AI中文摘要

最近的视觉语言模型（VLM）在从视觉输入生成网页方面显示出有希望的进展，但现有评估主要关注短、单屏且基本静态的网页。我们引入了LongWebBench，这是一个从结构和功能角度评估长程网页生成的基准。LongWebBench包含490个真实长网页用于结构保真度评估，以及129个网页上的507个目标导向交互任务用于功能评估。它采用两种互补协议：基于多维VLM的指标用于评估长程结构连贯性，以及基于DOM增强的智能体流水线用于端到端功能验证。我们进一步通过人类一致性分析检查自动评估协议。在单图像和多图像设置下，使用最先进的开源和专有VLM进行的实验表明，结构保真度随着网页长度的增加而下降，而视觉上合理的生成往往无法支持可执行的多步交互。这些结果强调了在视觉相似性之外评估长网页生成的必要性，并将可执行交互作为核心标准。我们的代码和数据可在该https URL获取。

英文摘要

Recent vision-language models (VLMs) have shown promising progress in generating webpages from visual inputs, yet existing evaluations mainly focus on short, single-screen, and largely static webpages. We introduce LongWebBench, a benchmark for evaluating long-horizon webpage generation from both structural and functional perspectives. LongWebBench contains 490 real-world long webpages for structural fidelity evaluation and 507 goal-oriented interaction tasks over 129 webpages for functional evaluation. It employs two complementary protocols: a multi-dimensional VLM-based metric for assessing long-range structural coherence, and a DOM-augmented agent-based pipeline for end-to-end functional verification. We further examine the automatic evaluation protocols through human agreement analysis. Experiments with state-of-the-art open-source and proprietary VLMs under single-image and multi-image settings reveal that structural fidelity degrades as webpage length increases, while visually plausible generations often fail to support executable multi-step interactions. These results highlight the need to evaluate long webpage generation beyond visual similarity, with executable interaction as a core criterion. Our code and data are available at https://github.com/zheny2751-dotcom/LongWebBench.

URL PDF HTML ☆

赞 0 踩 0

2606.17904 2026-06-17 cs.AI 新提交

首次证明第二批

Mohammed Abouzaid, Nikhil Srivastava, Rachel Ward, Lauren Williams

发表机构 * Stanford University（斯坦福大学）； University of California, Berkeley（加州大学伯克利分校）； University of Texas at Austin（德克萨斯大学奥斯汀分校）； Harvard University（哈佛大学）； Polish Academy of Sciences（波兰科学院）； UC Berkeley（加州大学伯克利分校）； Brown University（布朗大学）； ETH Zürich（苏黎世联邦理工学院）； MIT（麻省理工学院）； Weierstrass Institute（魏尔斯特拉斯研究所）； Duke University（杜克大学）； Sorbonne Université（索邦大学）； Boston College（波士顿学院）； Université du Québec à Montréal（魁北克大学蒙特利尔分校）； UCLA（加州大学洛杉矶分校）； University of Michigan（密歇根大学）； University of Maryland（马里兰大学）

AI总结测试多个AI系统在十个数学研究问题上的解题能力，评估当前AI解决研究级数学问题的水平。

详情

AI中文摘要

为了评估当前AI系统正确解决研究级数学问题的能力，我们在十个涵盖广泛数学领域的问题上测试了多个AI系统；这些问题自然产生于贡献者的研究过程中。本文档包括问题、我们的方法论以及测试结果。我们提供了补充文档的链接，包括人类解法、AI生成的解法，以及AI生成解法的评审报告和日志。这十个问题由以下数学家贡献：(1) Dariusz Kalociński 和 Theodore A. Slaman，(2) Richard Schwartz，(3) Aleksa Milojevic 和 Benny Sudakov，(4) Larry Guth，(5) Oleg Butkovsky、Jonathan Mattingly 和 Lorenzo Zambotti，(6) Joshua Evan Greene 和 Duncan McCoy，(7) Sucharit Sarkar，(8) Sam Payne 和 Jidong (Jayden) Wang，(9) Sylvie Corteel 和 John Lentfer，(10) Srivatsav Kunnawalkam Elayavalli。

英文摘要

To assess the ability of current AI systems to correctly solve research-level mathematics problems, we tested several AI systems on a set of ten problems in a broad range of mathematical fields; these problems arose naturally in the research process of the contributors. This document includes the problems, our methodology, and the results of our testing. We provide links to supplementary documents including the human solutions, the AI-generated solutions, and the referee reports and logs for the AI-generated solutions. The ten problems were contributed by the following mathematicians: (1) Dariusz Kalociński and Theodore A. Slaman, (2) Richard Schwartz, (3) Aleksa Milojevic and Benny Sudakov, (4) Larry Guth, (5) Oleg Butkovsky, Jonathan Mattingly, and Lorenzo Zambotti, (6) Joshua Evan Greene and Duncan McCoy, (7) Sucharit Sarkar, (8) Sam Payne and Jidong (Jayden) Wang, (9) Sylvie Corteel and John Lentfer, (10) Srivatsav Kunnawalkam Elayavalli.

URL PDF HTML ☆

赞 0 踩 0

2606.18191 2026-06-17 cs.AI cs.MA 新提交

探测、融合与可信度：基础模型表示在多模态癌症分析中的系统评估

Jingyu Hu, Giuseppe Tripodi, Reed Naidoo, Sarah F. McGough, Tapabrata Chakraborti

发表机构 * The Alan Turing Institute（艾伦·图灵研究所）； University of Bristol（布里斯托大学）； University of Manchester（曼彻斯特大学）； The Institute of Cancer Research（癌症研究所）； Genentech（基因泰克）

AI总结系统评估基础模型表示在计算病理学任务中的性能，发现图像和组学表示互补，多模态融合在单模态不占优时有效，并利用共形预测验证了不确定性感知推理的临床价值。

详情

AI中文摘要

基础模型（FMs）已成为医学数据的强大表示提取器，但它们在分布偏移下的泛化能力仍未充分探索。本工作系统评估了基于FM的表示在计算病理学任务上的表现，涉及两个真实世界商业队列IH-BC和IH-NSCLC，这些队列来自许可的内部（IH）肿瘤学数据集。分析聚焦于两种模态：全切片图像和转录组图谱，均来自IH多模态数据。我们首先在八个下游分类任务上对五个FM进行单模态探测性能基准测试，发现图像和组学表示携带互补的预测信号。然后，我们通过比较三种基于配对表示的图像-组学融合策略，研究多模态融合是否能在单模态基线之上带来额外收益。进一步通过共形预测评估所选单模态和多模态管道的可信度。我们的结果表明，FM表示在分布外数据上取得了竞争性性能，且多模态融合主要在单模态不占主导信号时有所帮助。共形预测揭示，在点预测失败的大多数情况下，真实诊断仍可在预测集中恢复，这强化了不确定性感知推理对临床支持的价值。

英文摘要

Foundation models (FMs) have emerged as powerful representation extractors for medical data, yet their generalizability to datasets under distribution shift remains underexplored. This work systematically evaluates FM-based representations on a suite of computational pathology tasks across two real-world commercial cohorts, IH-BC and IH-NSCLC, drawn from the licensed in-house (IH) oncology dataset. The analysis focuses on two modalities, whole-slide images and transcriptomic profiles, drawn from the IH multimodal data. We first benchmark unimodal probing performance across five FMs on eight downstream classification tasks, and find that image and omics representations carry complementary predictive signals. Then we investigate whether multimodal fusion can yield additional gains over unimodal baselines by comparing three image-omics fusion strategies built on paired representations. The trustworthiness of selected unimodal and multimodal pipelines is further assessed through conformal prediction. Our results show that FM representations achieve competitive performance on out-of-distribution data and that multimodal fusion helps mainly when no single modality dominates the signal. Conformal prediction reveals that in the majority of cases where a point prediction fails, the true diagnosis remains recoverable within the prediction set, reinforcing the value of uncertainty-aware inference for clinical support.

URL PDF HTML ☆

赞 0 踩 0

2606.17165 2026-06-17 stat.ME cs.AI econ.EM math.ST stat.TH 交叉投稿

扩展企业智能体路由：退化、诊断与恢复

Kellen Gillespie, Robyn Perry

发表机构 * Superhuman, Inc.（Superhuman公司）

AI总结研究企业助手工具库扩展时路由准确率下降问题，通过嵌入预选恢复F1分数10-17个百分点。

Comments 10 pages (6 main + 4 appendix), 4 figures, 6 tables

2606.17541 2026-06-17 cs.LG cs.AI 交叉投稿

Offline Preference-Based Trajectory Evaluation

基于偏好的离线轨迹评估

Fernando Diaz

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结针对离线评估中仅使用终端成功率导致统计效率低下的问题，提出基于偏好的轨迹评估方法，通过比较轨迹的时间偏好减少平局，提升区分能力、排名稳定性和数据效率。

2606.17564 2026-06-17 cs.CV cs.AI 交叉投稿

Geometric Consistency Protocol for Foundation Model Features in Multi-View Satellite Imagery

多视图卫星图像中基础模型特征的几何一致性协议

Qiyan Luo, Jie Yang, Yingdong Pi, Lekang Wen, Mi Wang

发表机构 * Hubei Province Key Research and Development Program（湖北省重点研发计划）； LIESMARS Special Research Funding（测绘遥感信息工程国家重点实验室专项研究基金）； National Science Fund for Distinguished Young Scholars（国家杰出青年科学基金）

AI总结针对卫星多视图重建中传统2D全局匹配的误导性，提出基于有理函数模型（RFM）的几何忠实评估协议，通过RPC投影3D一致性度量和几何约束密集匹配代理，揭示语义一致性与几何定位的解耦，并证明在RPC一致评估下2D骨干网络仍具竞争力。

Comments The manuscript is accepted as Oral Presentation in IEEE International Geoscience and Remote Sensing Symposium(IGARSS 2026)

详情

AI中文摘要

标准化的评估协议对于遥感领域的稳健基准测试至关重要，特别是当基础特征越来越多地跨不同传感器和复杂成像几何进行迁移时。在卫星多视图重建中，依赖无约束2D全局匹配的传统评估常常具有误导性。有理函数模型（RFM）及其有理多项式系数（RPC）决定了弯曲的、高度依赖的极线几何，这使得平坦的2D搜索空间在物理上不一致。我们提出了一种针对RPC框架的几何忠实且可复现的协议。我们的方法将RPC投影的3D一致性度量与几何约束的密集匹配代理相结合，专门评估在物理上合理的搜索流形下相似性响应是否保持局部化和唯一性。我们联合报告策略的一个关键发现是语义一致性与几何定位的解耦：在投影3D点处的高跨视图相似性并不能保证实际推理中的可靠匹配性。我们的基准测试表明，将几何约束纳入问题定义对于卫星图像是基础性的。此外，我们展示了最先进的2D骨干网络在经受这种RPC一致评估时，仍然与专门的3D感知模型保持显著竞争力。

英文摘要

Standardized evaluation protocols are indispensable for robust benchmarking in remote sensing, particularly as foundation features are increasingly transferred across diverse sensors and complex imaging geometries. In satellite multi-view reconstruction, conventional evaluations relying on unconstrained 2D global matching are often misleading. The Rational Function Model (RFM) and its Rational Polynomial Coefficients (RPC) dictate a curved, height-dependent epipolar geometry that render flat 2D search spaces physically inconsistent. We propose a geometry-faithful and reproducible protocol tailored for the RPC framework. Our approach integrates an RPC-projected 3D consistency metric with a geometry-constrained dense matching proxy, specifically evaluating whether similarity responses remain localized and unique under physically plausible search manifolds. A pivotal finding of our joint reporting strategy is the decoupling of semantic agreement and geometric localization: high cross-view similarity at a projected 3D point does not guarantee reliable matchability in practical inference. Our benchmark demonstrates that incorporating geometric constraints is fundamental to the problem definition in satellite imagery. Furthermore, we show that state-of-the-art 2D backbones remain remarkably competitive against specialized 3D-aware models when subjected to this RPC-consistent evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.17588 2026-06-17 cs.SE cs.AI 交叉投稿

Understanding LLMs in Title-Abstract Screening: From Disagreements to Recommendations

理解LLM在标题-摘要筛选中的作用：从分歧到建议

Mika Mäntylä, Patricia Matsubara, Katia Romero Felizardo, Miikka Kuutila, Marco Gerosa, Savio de Sousa Sampaio, Tayana Conte, Igor Steinmacher

发表机构 * University of Helsinki, Finland（赫尔辛基大学，芬兰）； UFMS, Brazil（巴西UFMS）； UTFPR – Federal University of Technology - Paraná, Brazil（巴西UTFPR – 法定技术大学-帕拉那）； LUT University, Finland（芬兰LUT大学）； Northern Arizona University, United States（美国北亚利桑那大学）； UFAM, Brazil（巴西UFAM）

AI总结本研究通过定性分析LLM与人类在系统综述标题-摘要筛选中的分歧原因，提出改进建议，如验证语义理解、使用多个LLM和关注边界案例。

Comments 14 pages + references. Accepted for publication in the 52nd Euromicro Conference on Software Engineering and Advanced Applications (SEAA 2026)

详情

AI中文摘要

多项研究探讨了在系统综述（SRs）中使用大型语言模型（LLMs）进行标题-摘要筛选，报告了混合的准确性。然而，可靠性问题仍未得到充分解决。在本研究中，我们超越了定量的人机一致性指标，定性调查了LLMs失败的方式和原因。我们还提出了可操作的建议。我们分析了六个软件工程SRs和超过1000篇主要研究论文中LLMs与研究人员之间的分歧。对于每个SR，论文由人类专家和LLMs以零样本模式独立筛选，得到的Kappa值在0.52到0.77之间。定性分析表明，人机分歧源于反复出现的可识别原因，例如关键术语的边界模糊、关键词过度强调和错误的话题推断。基于这些发现，我们提出了建议，例如在部署前验证语义理解、运行多个LLMs以及将验证工作集中在边界案例上。未来的研究需要验证我们建议的影响，并且需要社区努力制定关于在SRs中使用LLMs的规范性指南。

英文摘要

Several studies have examined the use of large language models (LLMs) for title-abstract screening in systematic reviews (SRs), reporting mixed accuracy. However, questions of reliability remain largely unaddressed. In this study, we go beyond quantitative LLM-human agreement metrics and qualitatively investigate how and why LLMs fail. We also propose actionable recommendations. We analyzed disagreements between LLMs and researchers across six software engineering SRs and over 1,000 primary study papers. For each SR, papers were screened independently by human experts and LLMs in zero-shot mode, resulting in Kappa values ranging from 0.52 to 0.77. Qualitative analysis suggests that human-LLM disagreement results from recurring, identifiable causes, such as boundary ambiguity in key terms, keyword overemphasization, and incorrect topic inference. Based on these findings, we propose recommendations such as validating semantic understanding before deployment, running multiple LLMs, and focusing validation efforts on borderline cases. Future studies are needed to validate the impact of our recommendations, and community efforts are needed to develop normative guidelines on LLM usage in SRs.

URL PDF HTML ☆

赞 0 踩 0

2606.17644 2026-06-17 cs.CV cs.AI 交叉投稿

Bounding Box Label Propagation for Re-Annotation of Document Layout Analysis Datasets

边界框标签传播用于文档布局分析数据集的重新标注

Nick Jochum, Tobias Alt-Veit, Christian Schön, Alexander Lück, René Schuster, Didier Stricker

发表机构 * Insiders Technologies GmbH（Insiders Technologies 有限公司）； DFKI – German Research Center for Artificial Intelligence（德国人工智能研究中心）； RPTU – University Kaiserslautern-Landau（凯泽斯劳滕-兰道大学）

AI总结提出BBLP伪标签框架，通过对象编码器融合视觉、文本和位置嵌入，利用标签传播实现仅用10%标注数据达到全监督性能的81.6%。

Comments 17 pages, 3 figures, to appear in proceedings of ICDAR 2026, Vienna, Austria

详情

AI中文摘要

实际文档处理场景中的数据集通常随时间增长，其类别标注不断细化，这导致大量耗时且昂贵的重新标注工作。一个有前景的解决方案是仅手动重新标注一小部分可用文档，并应用半监督学习技术利用有标签和无标签数据。尽管针对分类问题已有多种方法，但对于目标检测实例的重新分类（例如文档布局分析）尚无适配方法。为此，我们提出了边界框标签传播（BBLP），一种用于目标检测的伪标签框架。对象编码器整合来自目标检测样本的视觉、文本和位置嵌入，生成联合嵌入，可用于部分标注数据集上的标签传播，即插即用。评估结果表明，所提方法能产生高质量的边界框类别标注。在D4LA布局分析数据集中，仅使用10%标注数据，其mAP达到54.0%，相当于全监督性能的81.6%。我们的工作展示了标签传播在目标检测中的潜力，并为减少实际文档处理应用中的手动标注工作量奠定了基础。

英文摘要

Datasets in practical document processing scenarios typically grow over time, and their class annotations undergo continuous refinement. This creates significant re-annotation efforts, which are time-consuming and costly. A promising remedy is to re-annotate only a small subset of available documents manually and apply semi-supervised learning techniques that leverage both labelled and unlabelled data. Although there are numerous approaches to tackle this problem for classification, there exists no adaptation for the problem of re-classifying object detection instances, e.g. for document layout analysis. To this end, we propose Bounding Box Label Propagation (BBLP), a pseudo-labelling framework for object detection. An object encoder integrates visual, textual, and positional embeddings from object detection samples to come up with a joint embedding that can be used for Label Propagation on partially annotated datasets in a plug-and-play fashion. Evaluation results indicate that the proposed approach produces high-quality class annotations of bounding boxes. In the D4LA layout analysis dataset, it achieves a mAP of 54.0%, corresponding to 81.6% of fully supervised performance, while using only 10% labelled data. Our work demonstrates the potential of Label Propagation for object detection and lays the groundwork for reducing manual annotation efforts in real-world document processing applications.

URL PDF HTML ☆

赞 0 踩 0

2606.17710 2026-06-17 cs.CV cs.AI cs.CL cs.LG 交叉投稿

Vision-language models for chest radiography do not always need the image

胸部X光片的视觉-语言模型并不总是需要图像

Mahshad Lotfinia, Sebastian Ziegelmayer, Lisa Adams, Daniel Truhn, Andreas Maier, Soroosh Tayebi Arasteh

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg（弗里德里希-亚历山大-埃尔朗根-纽伦堡大学模式识别实验室）； Department of Diagnostic and Interventional Radiology, TUM University Clinic, School of Medicine and Health, Klinikum rechts der Isar, Technical University of Munich（慕尼黑工业大学医学院与健康学院伊萨尔河右岸医院诊断与介入放射学系）； Lab for AI in Medicine, RWTH Aachen University（亚琛工业大学医学人工智能实验室）； Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen（亚琛工业大学医院诊断与介入放射学系）

AI总结本文通过因果审计方法，发现许多医学视觉-语言模型在胸部X光片任务中依赖文本先验而非图像，纯文本模型与多模态模型性能接近，并提出了基于图像依赖性的评估框架。

详情

AI中文摘要

医学视觉-语言模型报告了强大的胸部X光片准确性，这越来越多地被解读为它们使用了图像的证据。这种推断是不安全的：一个利用发现名称先验的模型得分与读取扫描的模型相同，且没有标准基准能区分它们。我们引入了一种因果审计方法，通过遮挡相关区域、遮挡无关区域以及替换为另一患者的相同标签扫描来干预图像，并结合三种行为指标测试正确答案是否依赖于图像。在九个系统中，一个没有图像访问权限的纯文本模型达到了最佳多模态模型5.7个准确度点以内的水平，而一个1190亿参数的多模态模型在统计上与70亿参数的纯文本基线无法区分。审计将队列分为三个忽略图像的模型、一个不稳定的模型和五个选择性使用图像的模型（针对部分发现）；这些分类在第二个数据集、分辨率和提示措辞上保持一致。与委员会认证的放射科医生相比，纯文本模型在准确率上与放射科医生无统计差异，但基础归因于零，而使用图像的模型的基础归因率与放射科医生相当。报告的置信度仅在模型使用图像时标记无根据的答案。基础归因审计（而非准确性）应成为临床部署的门槛。

英文摘要

Medical vision-language models report strong chest radiograph accuracy, and this is increasingly read as evidence that they use the image. That inference is unsafe: a model exploiting finding-name priors scores like one that reads the scan, and no standard benchmark separates them. We introduce a causal audit that intervenes on the image, occluding the relevant region, occluding an irrelevant one, and swapping in another patient's same-label scan, and combines three behavioral metrics to test whether a correct answer depends on the image. Across nine systems, a text-only model with no image access reaches within 5.7 accuracy points of the best multimodal one, and a 119-billion-parameter multimodal model is statistically indistinguishable from a 7-billion text-only baseline. The audit splits the cohort into three models that ignore the image, one that is unstable, and five that use it selectively, for a subset of findings; the categories hold across a second dataset, resolution, and prompt phrasing. Against board-certified radiologists, a text-only model is statistically indistinguishable from a radiologist's accuracy while grounding at zero, whereas the image-using models ground at radiologist-comparable rates. Reported confidence flags ungrounded answers only when a model uses the image. Grounding audits, not accuracy, should gate clinical deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.17799 2026-06-17 cs.SE cs.AI cs.CL 交叉投稿

Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering

立场：编程基准与智能体软件工程不一致

Maria I. Gorinova, Macey Baker, Amy Heineike, Maksim Shaposhnikov, Rob Willoughby, Dru Knox

发表机构 * Tessl

AI总结本文指出当前编程基准在智能体时代存在三大问题：混淆模型与系统框架、单一参考答案惩罚有效替代方案、缺乏组件级信号导致迭代困难，并提出应重新设计基准以对齐智能体软件工程。

详情

AI中文摘要

编程智能体已成为软件工程的主要模式，但我们用于比较它们的基准是在智能体时代之前设计的：它们将模型、框架和环境合并为一个单一的端到端分数，通常针对一个参考答案进行计算，没有提供用于迭代的组件级信号。我们认为当前的编程基准与智能体软件工程不一致。在实践中，编程智能体不是一个模型：它是一个系统框架——由模型、框架、上下文、环境和反馈信号组成的复合体，其中任何一个都可能使基准分数移动与相邻模型代际之间相当的幅度。我们讨论了三个症状：(i) 基准分数混淆了模型与框架的其余部分；(ii) 针对单一参考答案评分惩罚了同样有效的替代方案；(iii) 缺乏单个框架组件级别的信号使得端到端系统分数难以迭代。

英文摘要

Coding agents have become a major mode of software engineering, but the benchmarks we use to compare them were designed in a pre-agent era: they collapse model, harness, and environment into a single end-to-end score, typically computed against one reference solution, with no component-level signal for iteration. We argue that current coding benchmarks are misaligned with agentic software engineering. A coding agent in practice is not a model: it is a system harness -- a composite of models, harnesses, contexts, environments, and feedback signals, any one of which can move the benchmark score by margins comparable to those between adjacent model generations. We discuss three symptoms: (i) benchmark scores conflate the model with the rest of the harness; (ii) grading against a single reference solution penalises equally valid alternatives; and (iii) the absence of signal at the level of individual harness components makes the end-to-end system score difficult to iterate on.

URL PDF HTML ☆

赞 0 踩 0

2606.17819 2026-06-17 cs.SE cs.AI cs.CL 交叉投稿

A Framework for Evaluating Agentic Skills at Scale

大规模评估智能体技能的框架

Maksim Shaposhnikov, Nicolas Fortuin, Simon Stipcich, Maria I. Gorinova, Amy Heineike, Rob Willoughby

发表机构 * Tessl London United Kingdom（伦敦英国Tessl）

AI总结提出一个评估框架，通过构建真实任务和评分标准，大规模评估500个真实技能在19种智能体模型上的表现，发现模型对技能指令的遵循程度差异显著，且技能显著改变模型行为。

详情

AI中文摘要

智能体技能——结构化、可重用的知识工件，增强LLM智能体能力——已在工业界迅速采用，但其跨领域影响以及在商业和开源模型中的使用仍未得到充分研究，并且缺乏可复用的方法来评估单个技能。在这项工作中，我们提出了一个评估框架，允许技能作者构建真实任务，以严格评估技能中对他们最重要的方面，并通过解决这些任务来估计技能效用。此外，我们将评估方法大规模应用于500个真实技能，生成了1000个源自技能内容的任务，以及指令遵循和目标完成评分标准。使用这些指标，我们评估了19种智能体模型配置（包括专有和开源模型）在任务上的表现。我们的结果表明，模型在遵循技能中编码的指令方面差异很大，导致其性能提升存在显著差异。此外，我们表明，与无技能设置相比，访问技能显著改变了模型行为，为将主观工作流编码到LLM智能体中提供了一种重要机制。我们发布了评估数据集，以支持未来关于智能体技能的工作。

英文摘要

Agent skills -- structured, reusable knowledge artifacts that augment LLM agent capabilities -- have been rapidly adopted in industry, yet their cross-domain impact and use across commercial and open-source models remain under-studied, and no reusable methodology exists for evaluating an individual skill. In this work, we present an evaluation framework that lets a skill author construct realistic tasks to rigorously assess the aspects of a skill that matter most to them, and that estimates skill utility by solving those tasks. Further, we apply our evaluation approach at scale to 500 real-world skills, generating 1,000 tasks derived from the skills' content, along with instruction-following and goal-completion scoring rubrics. Using these metrics, we evaluate how 19 agent-model configurations, both proprietary and open-source, perform on the tasks. Our results show that models vary widely in how closely they adhere to the instructions encoded in skills, leading to substantial differences in their performance gains. Furthermore, we show that access to a skill significantly changes model behavior compared to the no-skill setup, providing an essential mechanism for encoding opinionated workflows into LLM agents. We release our evaluation dataset to support future work on agent skills.

URL PDF HTML ☆

赞 0 踩 0

2606.17826 2026-06-17 cs.CL cs.AI 交叉投稿

When Multiple Scripts Matter: Evaluating ASR in Clinical Settings

当多种文字重要时：在临床环境中评估ASR

Jean Seo, Minkyu Kim, Jeonguk Lee, Jisoo Jung, Wooseok Han, Eunho Yang

发表机构 * AITRICS ； University of Copenhagen（哥本哈根大学）； KAIST（韩国科学技术院）

AI总结针对非英语临床场景中ASR受多文字变异性影响的问题，提出MultiClin基准，通过多文字感知评估更公平地衡量识别质量，并发现文字统一化能提升ASR性能。

Comments Interspeech 2026

详情

AI中文摘要

非英语临床环境中的自动语音识别（ASR）面临多文字变异性的挑战，即同一术语可能以多种有效的正字法形式出现。传统的字符串匹配评估指标通常将正字法变体视为错误，从而低估ASR性能。为解决此问题，我们引入了MultiClin，一个旨在评估对多文字变异性鲁棒性的临床ASR基准。跨多种ASR模型的实验表明，与传统的单参考评估相比，多文字感知评估能更公平地评估识别质量。我们进一步研究了训练过程中文字一致性的影响，发现不一致的文字映射会增加正字法不确定性并阻碍模型收敛，其中50%的平衡映射比例产生最高的熵。相比之下，文字统一化始终能带来最佳的ASR性能。我们的数据集和代码公开于：this https URL。

英文摘要

Automatic speech recognition (ASR) in non-English clinical settings is challenged by multiscript variability, where the same term may appear in multiple valid orthographic forms. Conventional string-matching evaluation metrics often underestimate ASR performance by treating orthographic variants as errors. To address this issue, we introduce MultiClin, a clinical ASR benchmark designed to evaluate robustness to multiscript variability. Experiments across diverse ASR models show that multiscript-aware evaluation provides a fairer assessment of recognition quality than conventional single-reference evaluation. We further investigate the impact of script consistency during training and find that inconsistent script mappings increase orthographic uncertainty and hinder model convergence, with a balanced 50% mapping ratio producing the highest entropy. In contrast, script unification consistently yields the best ASR performance. Our dataset and code are publicly available at: https://github.com/aitrics-ronaldo/Interspeech_MultiClin.

URL PDF HTML ☆

赞 0 踩 0

2606.18129 2026-06-17 cs.HC cs.AI 交叉投稿

Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour

理解和测量LLM行为中的认知萎缩

Abeer Badawi, Moyosoreoluwa Olatosi, Negin Baghbanzadeh, Laleh Seyyed-Kalantari, Frank Rudzicz, R. Shayna Rosenbaum, Sara Pishdadian, Elham Dolatabadi

发表机构 * York University（约克大学）； Vector Institute（向量研究所）； Rotman Research Institute（罗特曼研究学院）； Dalhousie University（达尔豪斯大学）； Centre for Addiction & Mental Health（成瘾与心理健康中心）； KITE Research Institute（KITE研究机构）

AI总结针对LLM在心理健康支持中缺乏过程行为评估的问题，提出认知萎缩概念及基准，通过临床标注和专家评估揭示模型普遍存在中度至高度萎缩行为。

详情

AI中文摘要

近期涉及LLM用于心理健康支持的事件揭示了一个关键的评估空白：表面安全评分无法捕捉模型在长时间、现实且情感敏感的交互中的行为。现有基准衡量知识、安全性或静态响应质量，但忽略了LLM交互是否帮助用户保持反思、应对和自主决策。我们将这一缺失维度形式化为认知萎缩，这是一种AI介导的心理健康支持中不同于安全性和有用性的过程级行为度量。为测量它，我们引入了认知萎缩基准，这是一个基于临床的基准，由1,576个完全人工生成的咨询对话、15,680轮次和来自五个LLM的42,230个响应构建而成。三位临床和神经心理学专家开发了一个包含用户上下文、响应行为和全局风险标志的20属性模式；六名经过培训的临床评审员应用该模式并附上基于跨度的证据，产生了5,324个评审判断。我们进一步引入了用户输入风险指数、认知萎缩风险指数和轨迹摘要。在五个LLM中，模型在单轮和多轮设置中表现出一致的中度至高度萎缩对齐行为。虽然模型通常对明显的安全线索做出响应，但当用户寻求解决方案或决策时，它们的适应性较差。主要的重复模式是指导性建议、问题解决、推荐响应、话题转移以及可能强化依赖而非反思的验证形式。我们的工作使认知萎缩变得可测量，并为审计敏感LLM对话中的模型行为提供了基础。

英文摘要

Recent incidents involving LLMs used for mental-health support reveal a critical evaluation gap: surface-level safety scores do not capture how models behave across realistic, emotionally sensitive interactions over time. Existing benchmarks measure knowledge, safety, or static response quality, but miss whether LLM interactions help users keep reflecting, coping, and making decisions themselves. We formalize this missing dimension as COGNITIVE ATROPHY, a process-level behavioural measure in AI-mediated mental-health support distinct from safety and helpfulness. To measure it, we introduce COGNITIVE ATROPHY BENCH, a clinically grounded benchmark built from 1,576 fully human-generated counseling conversations, 15,680 turns, and 42,230 responses from five LLMs. Three clinical and neuropsychology experts developed a 20-attribute schema spanning user context, response behaviour, and global risk flags; six trained clinical reviewers applied it with span-grounded evidence, producing 5,324 reviewer judgments. We further introduce the User-Input Risk Index (UIRI), the Cognitive Atrophy Risk Index (ARI), and trajectory summaries. Across five LLMs, models show a consistent moderate-to-high level of atrophy-aligned behaviour across single and multi-turn settings. While models generally respond to overt safety cues, they adapt less reliably when users seek solutions or decisions. The dominant recurring patterns are directive advice, problem-solving, recommendation responses, topic shifts, and forms of validation that may reinforce dependence rather than reflection. Our work makes COGNITIVE ATROPHY measurable and provides a foundation for auditing model behaviour in sensitive LLM conversations.

URL PDF HTML ☆

赞 0 踩 0

2606.18135 2026-06-17 cs.SD cs.AI 交叉投稿

Descriptor: Certus Caliber Classification Gunshot Dataset (C3GD)

描述符：Certus 口径分类枪声数据集 (C3GD)

Sinclair Gurny, Ryan Quinn

发表机构 * Certus Innovations

AI总结介绍一个公开的枪声数据集 C3GD，包含超过8000个来自28种枪支、16种口径的实地采集数据点，用于口径分类、枪声检测等任务，提供丰富的元数据以支持泛化与学术分析。

详情

AI中文摘要

在这项工作中，我们介绍了 Certus 口径分类枪声数据集 (C3GD)，这是一个公开可访问的数据集，用于分析枪口爆炸声。该数据集旨在提供多种枪支、口径、弹药、麦克风和麦克风位置，其元数据详细程度超过当前已有的其他数据集。它包含来自28种枪支、16种口径的超过8000个实地采集数据点。由于实地数据采集成本高昂，现有研究多使用从互联网收集的枪声音频，这增加了低质量数据和标签噪声的风险。该数据集主要关注口径分类，但也可用于枪声检测、音频分离和音频信号处理，提供了多样化的真实世界参考。该数据集旨在提供足够的多样性，以便泛化到更多实际应用，同时提供足够的元数据以进行详细的学术分析。

英文摘要

In this work, we introduce the Certus Caliber Classification Gunshot Dataset (C3GD), a publicly accessible data set developed for the analysis of firearm muzzle blast sounds. The dataset aims to provide a wide variety of firearms, calibers, cartridges, microphones, and microphone locations with metadata detailed beyond what is currently otherwise available. It comprises more than 8000 field-collected data points from 28 firearms across 16 calibers. Because data collection in the field is costly, much of the existing research has been done using gunshot audio collected from the internet, which increases the risk of low-quality data and label noise. This dataset is primarily focused on caliber classification, but can also be used for gunshot detection, audio separation, and audio signal processing, providing a diversified and real-world reference. The dataset aims to provide enough diversity to be able to generalize to more real-world applications while also providing enough metadata for detailed academic analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.18158 2026-06-17 cs.CY cs.AI cs.CL 交叉投稿

The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act

欧盟法律自动化中的测量差距：欧盟AI法案下教义性法律推理的基准测试

Michèle Finck

发表机构 * Chair of Law and Artificial Intelligence and Director, CZS Institute for Artificial Intelligence and Law, University of Tübingen（法律与人工智能教授、人工智能与法律研究所主任，图宾根大学）

AI总结针对当前缺乏评估大型语言模型进行教义性法律推理的基准，提出该能力对满足欧盟AI法案中“适当准确性”要求至关重要。

2606.18168 2026-06-17 cs.SE cs.AI 交叉投稿

All Smoke, No Alarm: Oracle Signals in Agent-Authored Test Code

全是烟雾，没有警报：智能体编写的测试代码中的Oracle信号

Dipayan Banik, Kowshik Chowdhury, Shazibul Islam Shamim

发表机构 * Dipayan Banik（迪帕扬·班克）； Kowshik Chowdhury（克什基·乔乌德里）； Shazibul Islam Shamim（沙齐布·伊斯兰·沙米）

AI总结研究智能体编写的测试代码中Oracle信号的存在情况，发现80.2%的测试补丁缺乏强Oracle信号，但强Oracle与合并可能性显著正相关（OR=1.28）。

Comments Accepted at the 8th IEEE International Conference on Artificial Intelligence Testing, 2026

详情

AI中文摘要

软件从业者越来越多地使用AI编码智能体，这些智能体在开源拉取请求（PR）中生成测试代码和生产代码。最近的研究报告称，超过116,000个仓库中有超过932,000个智能体编写的PR，然而这些测试文件是否包含有意义的验证逻辑仍未得到充分探索。缺乏显式断言的测试文件执行代码而不验证行为，因此基于测试文件存在的质量门控高估了验证强度。本文的目标是通过描述Oracle信号及其与合并结果和审查工作的关联，帮助从业者评估智能体编写的补丁的验证强度。我们对来自2,807个GitHub仓库的33,596个智能体编写的PR中的86,156个测试文件补丁进行了实证研究，这些PR由五个编码智能体生成：OpenAI Codex、GitHub Copilot、Devin、Cursor和Claude Code。对384个分层补丁的定性分析形成了八类Oracle信号的语法分类。在大规模应用中，80.2%的测试补丁包含弱或没有显式Oracle信号。虽然原始合并率对于强Oracle PR较低，但调整了智能体、PR大小、仓库流行度、任务类型和语言的回归分析显示，强Oracle显著提高了合并可能性（OR = 1.28, p < 0.001）。我们的发现表明，测试文件数量大大高估了验证强度，从业者可以采用Oracle感知的质量检查来更准确地评估智能体编写的贡献。

英文摘要

Software practitioners increasingly use AI coding agents that generate test code alongside production code in open source pull requests (PRs). Recent studies report more than 932,000 agent-authored PRs across more than 116,000 repositories, yet whether their test files contain meaningful verification logic remains underexplored. Test files lacking explicit assertions execute code without verifying behavior, so quality gates based on test-file presence overestimate verification strength. The goal of this paper is to help practitioners assess the verification strength of agent-authored patches by characterizing oracle signals and their link to merge outcomes and review effort. We conduct an empirical study of 86,156 test-file patches from 33,596 agent-authored PRs across 2,807 GitHub repositories produced by five coding agents: OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code. A qualitative analysis of 384 stratified patches informs a syntactic taxonomy of eight oracle signal categories. Applied at scale, 80.2% of test patches contain weak or no explicit oracle signals. While raw merge rates are lower for strong-oracle PRs, a regression analysis adjusting for agent, PR size, repository popularity, task type, and language shows strong oracles significantly improve merge likelihood (OR = 1.28, p < 0.001). Our findings suggest that test file counts substantially overestimate verification strength and that practitioners can adopt oracle-aware quality checks to more accurately evaluate agent-authored contributions.

URL PDF HTML ☆

赞 0 踩 0

2606.18203 2026-06-17 cs.CL cs.AI 交叉投稿

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

RubricsTree: 面向个人健康代理在健康记忆与医疗技能上的可扩展且不断演进的开放式评估

Weizhi Zhang, Zechen Li, Hamid Palangi, Ben Graef, A. Ali Heydari, Simon A. Lee, Salman Rahman, Ray Luo, Zeinab Esmaeilpour, Erik Schenck, Chloe Zhang, Yamin Li, Menglian Zhou, Philip S. Yu, Daniel McDuff, Lindsey Sunden, Mark Malhotra, Shwetak Patel, Ahmed A. Metwally

发表机构 * Google Research（谷歌研究院）； University of Illinois Chicago（伊利诺伊大学芝加哥分校）

AI总结提出RubricsTree框架，通过专家对齐的层次化分类法（含100多个原子布尔规则）和上下文自适应路由，实现可扩展、可审计且不断演进的开放式评估，在HealthBench上使模型性能提升高达约66%。

详情

AI中文摘要

基于LLM的个人健康代理利用用户健康（传感器）指标，为缓解全球医疗资源获取不均提供了有希望的途径。然而，大规模临床部署仍受限于开放式评估瓶颈：医生标注可靠但成本高且不可扩展，而LLM作为评判者的评估虽可扩展但主观、不一致，且有时临床对齐不佳。我们引入了RubricsTree，一个可扩展的评估框架，具有专家对齐的层次化分类法，包含超过100个原子级、临床可验证的布尔规则，这些规则通过迭代的人机协同策展协议（由经验丰富的医生领导的专家小组）从4000个真实用户查询的洞察中演化而来。一个上下文感知的自适应路由器每查询仅激活相关的自动加权规则子集，提供可扩展评估所需的吞吐量，同时保持专家对齐的质量。通过系统的元评估，我们展示了RubricsTree：(i) 在具有挑战性的开放式查询上，专家对齐程度显著超过强大的大规模评估基线；(ii) 可靠地惩罚上下文退化的响应；(iii) 当用作结构化指令、文本反馈或性能优化的训练奖励时，在HealthBench上为Gemini、GPT和Qwen模型系列带来高达约66%的相对提升。因此，RubricsTree为产品级个人健康AI的持续优化提供了可扩展、可审计且不断演进的评估基础设施。

英文摘要

The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an open-ended evaluation bottleneck: physician annotation is reliable but costly and unscalable, while LLM-as-a-judge evaluators are scalable but subjective, inconsistent, and sometimes clinically misaligned. We introduce RubricsTree, a scalable evaluation framework with an expert-aligned hierarchical taxonomy of over 100 atomic, clinically-verifiable Boolean rubrics, evolving from the insights of 4,000 real user queries through an iterative human-in-the-loop curation protocol with an expertise panel led by an experienced physician. A context-aware adaptive router activates only the relevant auto-weighted rubric subset per query, providing the throughput needed for scalable evaluation with expert-aligned quality. Through a systematic meta-evaluation, we show that RubricsTree (i) substantially exceeds a strong large-scale evaluation baseline in expert alignment on challenging open-ended queries; (ii) reliably penalizes contextually degraded responses; and (iii) when used as structured instructions, text feedback, or training rewards for performance optimization, yields up to ~66% relative gains on HealthBench for Gemini, GPT, and Qwen model families. RubricsTree thus provides a scalable, auditable, and evolving evaluation infrastructure required for the continuous optimization of product-level personal healthcare AI.

URL PDF HTML ☆

赞 0 踩 0

2606.18237 2026-06-17 cs.CL cs.AI cs.LG 交叉投稿

ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

ReproRepo: 利用 GitHub 仓库问题扩展可重复性审计

Shanda Li, Qiuhong Anna Wei, Jingwu Tang, Valerie Chen, Nihar B Shah, Tim Dettmers, Yiming Yang, Ameet Talwalkar

发表机构 * School of Computer Science, Carnegie Mellon University（卡内基梅隆大学计算机科学学院）； Datadog

AI总结提出 ReproRepo 框架，利用 GitHub issues 作为监督信号，对 1149 篇论文进行可重复性评估，发现 Codex with GPT-5.5 能识别约 90% 论文的语义相关复现问题。

详情

AI中文摘要

从论文和已发布代码中复现研究结果对科学进步至关重要。现有工作引入了基准测试来评估 LLM 代理是否能协助可重复性，但由于数据整理和评估需要大量人工努力，这些基准难以扩展。我们提出了 ReproRepo，一个可扩展的可重复性评估框架，利用人类提出的 GitHub issues 作为真实复现障碍的自然监督信号。我们在来自主要会议的 1149 篇近期机器学习论文上实例化 ReproRepo，并评估了四种前沿模型代理配置。我们的结果表明，即使不执行代码，LLM 代理也能从论文-仓库对中识别出许多现实世界的可重复性问题：我们研究中的最佳代理，即带有 GPT-5.5 的 Codex，为研究中约 90% 的论文揭示了至少一个语义相关的人类报告的障碍。进一步分析表明，代理在揭示可见故障和识别正确语义区域方面特别有效，但在精确定位方面可能仍不足。ReproRepo 可作为未来在真实世界可重复性审计中评估 LLM 代理的可重用、可扩展框架。我们的代码发布在 https://this URL。

英文摘要

Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale due to their reliance on substantial manual effort for data curation and evaluation. We introduce ReproRepo, a scalable framework for reproducibility evaluation that leverages human-raised GitHub issues as naturally occurring supervision on realistic reproduction blockers. We instantiate ReproRepo on 1,149 recent machine learning papers from major conferences and evaluate four frontier model-agent configurations. Our results show that LLM agents, even without executing code, can identify many real-world reproducibility problems from paper-repository pairs: the best agent in our study, namely Codex with GPT-5.5, surfaces at least one semantically related human-reported blocker for ~90% of papers in the study. Further analysis shows that agents are particularly effective for surfacing visible failures and identifying the right semantic region, but may still be insufficient in exact localization. ReproRepo can serve as a reusable, scalable framework for future evaluations of LLM agents on real-world reproducibility auditing. Our code is released at https://github.com/LithiumDA/ReproRepo.

URL PDF HTML ☆

赞 0 踩 0

2602.08939 2026-06-17 cs.AI 版本更新

CausalT5k: Diagnosing Refusal and Failure Modes in Trustworthy Causal Reasoning Across Causal Rungs

CausalT5k: 诊断可信因果推理中的拒绝与失败模式——跨越因果阶梯

Longling Geng, Andy Ouyang, Theodore Wu, Daphne Barretto, Matthew John Hayes, Rachael Cooper, Yuqiao Zeng, Sameer Vijay, Gia Ancone, Ankit Rai, Matthew Wolfman, Patrick Flanagan, Edward Y. Chang

AI总结提出CTK基准，通过5,147个案例诊断大语言模型在因果推理中的失败模式，包括因果阶梯、陷阱类型、压力敏感性和拒绝质量等标注，揭示聚合准确率隐藏的缺陷。

Comments 12 pages, 17 tables, 4 figures

详情

AI中文摘要

大型语言模型越来越能生成流畅的因果解释，但它们常常以聚合准确率无法诊断的方式失败：混淆关联与干预、在压力下放弃正确判断、过度拒绝有效主张、或在证据不足时作答。我们引入CTK，一个包含5,147个案例且不断增长的诊断基准，涵盖10个领域和Pearl因果阶梯的所有三个层次。与仅评分的基准不同，CTK通过标注因果阶梯、陷阱类型、压力敏感性、拒绝质量以及效用-安全权衡来揭示模型为何失败。其Sheep/Wolf分类法区分有效因果设计与推理陷阱；配对的neutral/pressure变体通过Bad Flip Rate测量谄媚漂移；Wise Refusal字段测试模型在认可主张前是否识别出缺失信息。CTK暴露了聚合准确率隐藏的失败模式：怀疑陷阱、缩放下的阶梯坍塌、压力诱导漂移、检测-纠正差距以及反事实错误模式。它不规定修正方法，而是为研究因果推理失败概况提供诊断基础。

英文摘要

Large language models increasingly produce fluent causal explanations, yet they often fail in ways aggregate accuracy cannot diagnose: confusing association with intervention, abandoning correct judgments under pressure, over-refusing valid claims, or answering when evidence is underdetermined. We introduce CTK, a diagnostic benchmark of 5,147 cases and growing, across 10 domains and all three levels of Pearl's Ladder of Causation. Unlike benchmarks that only score correctness, CTK reveals why a model failed by annotating causal rung, trap type, pressure sensitivity, refusal quality, and Utility-Safety tradeoffs. Its Sheep/Wolf taxonomy separates valid causal designs from inferential traps; paired neutral/pressure variants measure sycophantic drift through Bad Flip Rate; and Wise Refusal fields test whether a model identifies the missing information needed before endorsing a claim. CTK exposes failure modes hidden by aggregate accuracy: the Skepticism Trap, Rung Collapse under scaling, pressure-induced drift, Detection-Correction gaps, and counterfactual error modes. Rather than prescribing a correction method, it provides the diagnostic substrate for studying causal-reasoning failure profiles.

URL PDF HTML ☆

赞 0 踩 0

2604.06802 2026-06-17 cs.AI 版本更新

Riemann-Bench: A Benchmark for Moonshot Mathematics

Riemann-Bench: 面向登月级数学的基准测试

Suhaas Garre, Erik Knutsen, Sushant Mehta, Edwin Chen

AI总结提出Riemann-Bench基准，由专家设计研究级数学问题，评估AI系统超越奥数水平的推理能力，结果显示前沿模型得分低于10%。

详情

AI中文摘要

最近的AI系统在国际数学奥林匹克竞赛中取得了金牌级别的表现，展示了在竞赛式问题解决方面的卓越能力。然而，竞赛数学仅代表了数学推理的一个狭窄部分：问题来自有限的领域，需要最少的先进工具，并且通常奖励洞察力技巧而非深奥的理论知识。我们引入了Riemann-Bench，一个由专家策划的私有基准测试，旨在评估AI系统在研究级数学上的表现，这远远超出了奥林匹克的前沿。问题由常春藤联盟数学教授、研究生和拥有博士学位的IMO金牌得主编写，并且通常需要作者数周才能独立解决。每个问题都经过两位独立领域专家的双盲验证，他们必须从头开始解决问题，并通过程序化验证器得出唯一的封闭形式解。我们将前沿模型评估为不受限制的研究智能体，可以完全访问编码工具、搜索和开放式推理，使用每个问题100次独立运行的无偏统计估计器。我们的结果显示，所有前沿模型目前得分低于10%，揭示了奥林匹克级问题解决与真正研究级数学推理之间的巨大差距。通过保持基准完全私有，我们确保测量的性能反映了真实的数学能力，而不是对训练数据的记忆。

英文摘要

Recent AI systems have achieved gold-medal-level performance on the International Mathematical Olympiad, demonstrating remarkable proficiency at competition-style problem solving. However, competition mathematics represents only a narrow slice of mathematical reasoning: problems are drawn from limited domains, require minimal advanced machinery, and can often reward insightful tricks over deep theoretical knowledge. We introduce Riemann-Bench, a private benchmark of expert-curated problems designed to evaluate AI systems on research-level mathematics that goes far beyond the olympiad frontier. Problems are authored by Ivy League mathematics professors, graduate students, and PhD-holding IMO medalists, and routinely took their authors weeks to solve independently. Each problem undergoes double-blind verification by two independent domain experts who must solve the problem from scratch, and yields a unique, closed-form solution assessed by programmatic verifiers. We evaluate frontier models as unconstrained research agents, with full access to coding tools, search, and open-ended reasoning, using an unbiased statistical estimator computed over 100 independent runs per problem. Our results reveal that all frontier models currently score below 10%, exposing a substantial gap between olympiad-level problem solving and genuine research-level mathematical reasoning. By keeping the benchmark fully private, we ensure that measured performance reflects authentic mathematical capability rather than memorization of training data.

URL PDF HTML ☆

赞 0 踩 0

2606.09004 2026-06-17 cs.AI 版本更新

LATTEArena: An Evaluation Framework for LLM-powered Tabular Feature Engineering (Extended Version)

LATTEArena: 基于LLM的表格特征工程评估框架（扩展版）

Ankai Hao, Ke Chen, Huan Li, Lidan Shou

发表机构 * Zhejiang University（浙江大学）

AI总结提出LATTEArena，首个标准化评估框架，通过六维分类法分解15种方法、模块化竞技场和组件消融实验，揭示Tree-of-Thought与MCTS成本效益最优等16项关键发现。

Comments 31 pages, 9 figures

详情

AI中文摘要

特征工程对于表格数据分析仍然至关重要，大型语言模型（LLM）已成为自动化这一过程的有前景的范式，催生了基于LLM的自动化表格特征工程（LATTE）。然而，缺乏标准化平台阻碍了公平、成本感知的比较。此外，复杂的方法设计掩盖了单个组件的具体贡献；例如，尽管LFG集成了思维树、少样本演示、蒙特卡洛树搜索和自然语言生成，但每种技术的竞争优点的孤立影响仍未量化。为解决这些挑战，我们引入了LATTEArena，这是首个竞争性评估框架，具有以下特点：（1）六维分类法，将15种代表性方法分解为可重用组件；（2）标准化模块化竞技场，用于受控比较；（3）涵盖性能、成本和鲁棒性的多维评估；（4）组件级消融，量化每种技术的竞争优点。通过广泛评估，我们揭示了16项关键发现，包括：（1）思维树与蒙特卡洛树搜索实现了最佳成本效益；（2）RPN和代码输出格式分别主导分类和回归任务。我们公开发布了模块化框架和超过4000条执行日志，使研究人员能够将新技术与现有技术无缝对比，推动LATTE发展。

英文摘要

Feature engineering remains a cornerstone of tabular data analysis, and Large Language Models (LLMs) have emerged as a promising paradigm for its automation, giving rise to LLM-powered Automated Tabular Feature Engineering (LATTE). However, the field lacks standardized, cost-aware evaluation platforms, and the combinatorial explosion of design choices obscures true algorithmic progress. To bridge these gaps, we systematically deconstruct 15 representative LATTE methods into a unified 6-dimensional taxonomy. Based on this abstraction, we introduce LATTEArena, a standardized, modular, and extensible benchmarking framework that decouples monolithic pipelines into reusable execution blocks. By distilling the massive combinatorial space, we evaluate 24 core LATTE configurations across 7 research questions. Our head-to-head benchmarking goes beyond predictive accuracy to quantify token efficiency and execution robustness, yielding 17 empirical findings on cost-effectiveness trade-offs. Furthermore, we provide 3 concrete recommendations for optimal real-world deployment. By enabling controlled component-level comparisons, LATTEArena shifts the paradigm from ad-hoc prompt engineering to systematic context management. All code, datasets, and over 4,000 execution logs are publicly available to foster a dynamic, community-driven benchmark. Our framework, leaderboard, and all artifacts are hosted on the LATTEArena project website at https://goodenhak.github.io/LATTEArena.

URL PDF HTML ☆

赞 0 踩 0

2503.07459 2026-06-17 cs.CL cs.AI 版本更新

MedicalAgentsBench for Complex Medical Reasoning: Comparing Internalized Reasoning Models versus Externalized Agent-based Frameworks

MedicalAgentsBench：复杂医学推理基准——比较内化推理模型与外化智能体框架

Yanjun Shao, Xiangru Tang, Jiwoong Sohn, Jiapeng Chen, Yuxuan Liao, Jiayi Zhang, Jinyu Xiang, Fang Wu, Yilun Zhao, Chenglin Wu, Wenqi Shi, Arman Cohan, Mark Gerstein

AI总结提出MedicalAgentsBench基准（862个复杂临床问题），比较内化推理模型与外化智能体框架在医学推理中的表现，发现两者效果可叠加，最优组合为o3-mini+MDAgents（准确率35.1%）。

Comments https://github.com/gersteinlab/MedicalAgentsBench

详情

AI中文摘要

复杂医学推理需要在多个推理步骤中整合异质性临床证据。大型语言模型（LLM）现在通过两条途径实现：内化推理和外化智能体框架（将问题分解并协作给多个LLM的框架）。为了确定这两条途径是互斥还是互补，我们引入了MedicalAgentsBench，这是一个经过过滤的基准测试，包含862个复杂临床问题，这些题目来自八个医学数据集的并集，经过难度感知筛选和污染筛查。评估了三个内化推理模型（DeepSeek-R1、o1-mini和o3-mini）、七个基础模型和九个外化智能体方法后，我们发现内化和外化方法各自独立地提升了性能，并且它们的益处可以叠加：最高准确率是通过将智能体工作流叠加到内化推理模型上实现的（即o3-mini + MDAgents，准确率35.1%）。帕累托分析表明，这种组合主导了成本-性能前沿；此外，在廉价模型上进行轻量级优化为资源受限环境提供了切入点。我们的基准测试位于此https URL。

英文摘要

Complex medical reasoning requires integrating heterogeneous clinical evidence across multiple inference steps. Large language models (LLMs) now approach this through two routes: internalized reasoning and externalized agent scaffolding (frameworks that decompose problems collaboratively amongst multiple LLMs). To determine whether these routes are exclusive or complementary, we introduce MedicalAgentsBench, a filtered benchmark of 862 complex clinical questions drawn from the union of eight medical datasets via difficulty-aware curation and contamination screening. Evaluating three internalized reasoning models (DeepSeek-R1, o1-mini, and o3-mini), seven base models, and nine externalized agent-based methods, we find that internalized and externalized approaches each independently improve performance, and that their benefits compound: the highest accuracy is achieved by layering agent workflows onto an internalized reasoning model (i.e., o3-mini + MDAgents with 35.1%). Pareto analysis shows this combination dominates the cost-performance frontier; moreover, lightweight optimization on inexpensive models offers an entry point for resource-constrained settings. Our benchmark is at https://github.com/gersteinlab/MedicalAgentsBench.

URL PDF HTML ☆

赞 0 踩 0

2507.18623 2026-06-17 cs.LG cs.AI cs.MA 版本更新

Moving Out: Physically-grounded Human-AI Collaboration

Moving Out: 基于物理的人机协作

Xuhui Kang, Sung-Wook Lee, Haolin Liu, Yuyan Wang, Yen-Ling Kuo

AI总结提出Moving Out基准测试，模拟物理约束下的协作场景，并开发BASS方法增强智能体多样性及动作理解，实验证明其与未见过的AI和人类均能有效协作。

Comments Accepted at ICML 2026

详情

AI中文摘要

适应环境中的物理动作和约束的能力对于具身智能体（如机器人）与人类有效协作至关重要。这种基于物理的人机协作必须考虑连续状态-动作空间增加的复杂性以及物理约束导致的受限动力学。然而，大多数现有的协作基准是离散的，或者不考虑物理属性和约束。为了解决这个问题，我们引入了Moving Out，一个人机协作基准，它模拟了受物理属性和约束影响的各种协作模式，例如一起移动重物以及协调动作将物品绕过角落。Moving Out包含两个挑战和人类-人类交互数据，以全面评估模型适应多样化人类行为和未见物理属性的能力。为了使具身智能体能够在物理属性和约束下与人类协作，我们提出了一种新方法BASS（行为增强、模拟和选择），以增强智能体的多样性及其对动作结果的理解。我们系统地将BASS与最先进模型在AI-AI和人机实验中进行了比较，结果表明BASS能够有效地与未见过的AI和人类协作。项目页面可在此https URL访问。

英文摘要

The ability to adapt to physical actions and constraints in an environment is crucial for embodied agents (e.g., robots) to effectively collaborate with humans. Such physically grounded human-AI collaboration must account for the increased complexity of the continuous state-action space and constrained dynamics caused by physical constraints. However, most existing collaboration benchmarks are discrete or do not consider physical attributes and constraints. To address this, we introduce Moving Out, a human-AI collaboration benchmark that resembles a wide range of collaboration modes affected by physical attributes and constraints, such as moving heavy items together and coordinating actions to move an item around a corner. Moving Out consists of two challenges and human-human interaction data to comprehensively evaluate models' abilities to adapt to diverse human behaviors and unseen physical attributes. To give embodied agents the capability to collaborate with humans under physical attributes and constraints, we propose a novel method, BASS (Behavior Augmentation, Simulation, and Selection), to enhance the diversity of agents and their understanding of the outcome of actions. We systematically compare BASS and state-of-the-art models in AI-AI and human-AI experiments, showing that BASS can effectively collaborate with both unseen AI and humans. The project page is available at https://live-robotics-uva.github.io/movingout_ai/.

URL PDF HTML ☆

赞 0 踩 0

2511.01650 2026-06-17 cs.CL cs.AI cs.LG 版本更新

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

EngTrace：工程推理可验证过程监督的符号基准

Ayesha Gull, Muhammad Usman Safder, Rania Elbadry, Fan Zhang, Veselin Stoyanov, Preslav Nakov, Zhuohan Xie

AI总结提出EngTrace符号基准，包含1350个参数化测试用例，通过两阶段可验证评估框架（分层协议+AI仲裁）检验中间推理轨迹与最终答案，揭示数值精度与轨迹保真度的权衡。

Comments 33 pages, includes figures and tables; introduces the EngTrace benchmark

详情

AI中文摘要

大型语言模型（LLM）正越来越多地进入由严格定量标准和不变物理定律约束的专业化、安全关键的工程工作流程，因此对其推理能力进行严格评估势在必行。然而，现有的基准（如MMLU、MATH和HumanEval）评估的是孤立的认知技能，未能捕捉工程中核心的基于物理的推理，其中科学原理、定量建模和实际约束必须融合。为了实现工程中的可验证过程监督，我们引入了EngTrace，这是一个基于90个参数化模板构建的符号基准，每个模板生成独特的、抗污染的实例，涵盖三个主要工程分支、九个核心领域和20个不同领域，产生1350个测试用例，以压力测试跨多样物理场景的泛化能力。超越结果匹配，我们引入了一个可验证的两阶段评估框架，该框架使用分层协议通过自动化程序检查和异构AI仲裁来验证中间推理轨迹以及最终答案。我们对27个领先LLM的评估揭示了数值精度与轨迹保真度之间的明显权衡，识别出一个复杂性悬崖，其中抽象数学预训练未能转化为高级工程任务所需的整合推理。

英文摘要

Large Language Models (LLMs) are increasingly entering specialized, safety-critical engineering workflows governed by strict quantitative standards and immutable physical laws, making rigorous evaluation of their reasoning capabilities imperative. However, existing benchmarks such as MMLU, MATH, and HumanEval assess isolated cognitive skills, failing to capture the physically grounded reasoning central to engineering, where scientific principles, quantitative modeling, and practical constraints must converge. To enable verifiable process supervision in engineering, we introduce EngTrace, a symbolic benchmark built on 90 parameterized templates, each generating unique, contamination-resistant problem instances, spanning three major engineering branches, nine core domains, and 20 distinct areas, yielding 1,350 test cases that stress-test generalization across diverse physical scenarios. Moving beyond outcome matching, we introduce a verifiable two-stage evaluation framework that uses a tiered protocol to validate intermediate reasoning traces alongside final answers through automated procedural checks and a heterogeneous AI Tribunal. Our evaluation of 27 leading LLMs reveals a distinct trade-off between numeric precision and trace fidelity, identifying a complexity cliff where abstract mathematical pre-training fails to translate into the integrative reasoning required for advanced engineering tasks.

URL PDF HTML ☆

赞 0 踩 0

2512.01241 2026-06-17 cs.CY cs.AI 版本更新

First, do NOHARM: towards clinically safe large language models

首先，不伤害：迈向临床安全的大语言模型

David Wu, Fateme Nateghi Haredasht, Saloni Kumar Maharaj, Priyank Jain, Jessica Tran, Matthew Gwiazdon, Arjun Rustagi, Jenelle Jindal, Jacob M. Koshy, Vinay Kadiyala, Anup Agarwal, Bassman Tappuni, Brianna French, Sirus Jesudasen, Christopher V. Cosgriff, Rebanta Chakraborty, Jillian Caldwell, Susan Ziolkowski, David J. Iberri, Robert Diep, Rahul S. Dalal, Kira L. Newman, Kristin Galetta, J. Carl Pallais, Nancy Wei, Kathleen M. Buchheit, David I. Hong, Vartan Pahalyants, Ernest Y. Lee, Allen Shih, Tamara B. Kaplan, Vishnu Ravi, Sarita Khemani, Thomas A. Buckley, April S. Liang, Daniel Shirvani, Advait Patil, Nicholas Marshall, Kanav Chopra, Joel Koh, Adi Badhwar, Anastasia Perez, Austin J. Schoeffler, Mahbuba Tusty, Chase M. Walton, Liam G. McCoy, David J. H. Wu, Yingjie Weng, Sumant Ranji, Kevin Schulman, Nigam H. Shah, Jason Hom, Arnold Milstein, Arjun K. Manrai, Adam Rodman, Jonathan H. Chen, Ethan Goh

发表机构 * Harvard Combined Dermatology Program（哈佛联合皮肤科项目）； Department of Dermatology, Mass General Brigham（麻省总医院皮肤科）； Harvard Medical School（哈佛医学院）； Stanford Center for Biomedical Informatics Research（斯坦福生物医学信息学研究中心）； Stanford University（斯坦福大学）； Division of Hospital Medicine, Department of Medicine, Stanford University School of Medicine（斯坦福大学医学院医院医学科）； Department of Medicine, Cambridge Health Alliance（剑桥健康联盟医学科）； Beth Israel Deaconess Hospital–Plymouth（贝塞斯达德acons医院-普利茅斯）； Department of Medicine, University of California, San Francisco（加州大学旧金山分校医学科）； Department of Neurology, Stanford University School of Medicine（斯坦福大学医学院神经科）； Department of Medicine, Beth Israel Deaconess Medical Center（贝塞斯达德acons医学中心医学科）； Division of Cardiology, Department of Medicine, Cambridge Health Alliance（剑桥健康联盟心脏病科）； Department of Cardiovascular Medicine, Summa Health System（Summa健康系统心血管医学科）； Division of Allergy, Pulmonary, and Critical Care Medicine, Department of Medicine, University of Wisconsin-Madison（威斯康星大学麦迪逊分校医学科过敏、呼吸科和危重医学科）； Division of Pulmonary and Critical Care Medicine, Department of Medicine, Massachusetts General Hospital（麻省总医院呼吸科和危重医学科）； Center for Immunology and Inflammatory Diseases, Department of Medicine, Massachusetts General Hospital（麻省总医院免疫和炎症疾病中心）； Broad Institute of MIT and Harvard（MIT和哈佛Broad研究所）； Division of Pulmonary, Critical Care, and Sleep Medicine, Cambridge Health Alliance（剑桥健康联盟呼吸科、危重医学科和睡眠医学科）

AI总结提出NOHARM基准，包含1100个初级到专科咨询案例，评估28个LLM的医疗建议安全性，发现高达22.6%的案例存在严重危害风险，其中遗漏错误占80%以上。

详情

AI中文摘要

大语言模型（LLM）被医生和患者常规用于医疗建议，但其临床安全性特征仍不明确。我们提出NOHARM（医学风险评估的众多选项危害评估），一个包含1100个初级保健到专科咨询案例的基准，用于衡量LLM生成的医疗建议的危害频率和严重程度。NOHARM涵盖10个专科，包含4249个临床管理选项的12747个专家注释。在28个LLM中，建议在高达22.6%的案例中具有严重危害潜力，其中遗漏错误占严重错误的80%以上。在一项涉及101名全科医生的随机试验中，AI辅助显著提高了人类基准表现，但医生远未实现AI工具的潜力，经常忽略AI提出的重要建议。安全性表现与通用智能和医学知识基准在整个模型范围内相关，但在前沿模型上解耦。尽管在现有评估中表现强劲，广泛使用的AI模型可能以非平凡的比例产生具有严重危害潜力的医疗建议，凸显了明确测量临床安全性的重要性。

英文摘要

Large language models (LLMs) are routinely used by physicians and patients for medical advice, yet their clinical safety profiles remain poorly characterized. We present NOHARM (Numerous Options Harm Assessment for Risk in Medicine), a 1,100-task benchmark of primary care-to-specialist consultation cases to measure the frequency and severity of harm from LLM-generated medical recommendations. NOHARM covers 10 specialties, with 12,747 expert annotations for 4,249 clinical management options. Across 28 LLMs, recommendations carried the potential for severe harm in up to 22.6% of cases, with errors of omission accounting for more than 80% of severe errors. In a randomized trial of 101 generalist physicians, human benchmark performance significantly improved with AI assistance, yet physicians remained far from realizing the potential of AI tools, frequently ignoring essential advice surfaced by AI. Safety performance tracked general-intelligence and medical-knowledge benchmarks across the full range of models but decoupled at the frontier. Despite strong performance on existing evaluations, widely used AI models can produce medical advice with the potential for severe harm at non-trivial rates, highlighting the importance of explicit measurement of clinical safety.

URL PDF HTML ☆

赞 0 踩 0

2601.19099 2026-06-17 cs.CV cs.AI 版本更新

评估交互式二维可视化作为生物医学时间序列数据标注的样本选择策略

Einari Vaaras, Manu Airaksinen, Okko Räsänen

AI总结针对生物医学时间序列标注困难，比较随机采样、最远优先遍历和基于交互式2D可视化（2DV）的三种样本选择方法，在婴儿运动评估和语音情感识别任务中，2DV在聚合标签时表现最佳，但个体标注者间标签分布差异大，随机采样最安全。

Comments Accepted for publication in Computers in Biology and Medicine (Elsevier)

详情

DOI: 10.1016/j.compbiomed.2026.111809

AI中文摘要

生物医学领域中可靠的机器学习模型依赖于准确的标签，然而标注生物医学时间序列数据仍然具有挑战性。算法样本选择可能支持标注，但涉及真实人类标注者的研究证据很少。因此，我们比较了三种用于标注的样本选择方法：随机采样（RND）、最远优先遍历（FAFT）和一种基于图形用户界面的方法，该方法能够探索高维数据的互补二维可视化（2DV）。我们在婴儿运动评估（IMA）和语音情感识别（SER）的四个分类任务中评估了这些方法。十二名标注者，分为专家和非专家，在有限的标注预算下进行数据标注，并进行了标注后实验以评估采样方法。在所有分类任务中，当聚合标注者的标签时，2DV表现最佳。在IMA中，2DV最有效地捕获了稀有类别，但也表现出由于有限的标注预算导致的标注者间标签分布变异性增大，当模型在个体标注者的标签上训练时，分类性能下降；在这些情况下，FAFT表现出色。对于SER，2DV在专家标注者中优于其他方法，并在个体标注者设置中与非专家标注者的性能相当。失败风险分析显示，当标注者数量或标注者专业知识不确定时，RND是最安全的选择，而2DV由于标签分布变异性更大而具有最高风险。此外，实验后访谈表明，2DV使标注任务更有趣和愉快。总体而言，基于2DV的采样对于生物医学时间序列数据标注似乎很有前景，特别是在标注预算不是非常紧张的情况下。

英文摘要

Reliable machine-learning models in biomedical settings depend on accurate labels, yet annotating biomedical time-series data remains challenging. Algorithmic sample selection may support annotation, but evidence from studies involving real human annotators is scarce. Consequently, we compare three sample selection methods for annotation: random sampling (RND), farthest-first traversal (FAFT), and a graphical user interface-based method enabling exploration of complementary 2D visualizations (2DVs) of high-dimensional data. We evaluated the methods across four classification tasks in infant motility assessment (IMA) and speech emotion recognition (SER). Twelve annotators, categorized as experts or non-experts, performed data annotation under a limited annotation budget, and post-annotation experiments were conducted to evaluate the sampling methods. Across all classification tasks, 2DV performed best when aggregating labels across annotators. In IMA, 2DV most effectively captured rare classes, but also exhibited greater annotator-to-annotator label distribution variability resulting from the limited annotation budget, decreasing classification performance when models were trained on individual annotators' labels; in these cases, FAFT excelled. For SER, 2DV outperformed the other methods among expert annotators and matched their performance for non-experts in the individual-annotator setting. A failure risk analysis revealed that RND was the safest choice when annotator count or annotator expertise was uncertain, whereas 2DV had the highest risk due to its greater label distribution variability. Furthermore, post-experiment interviews indicated that 2DV made the annotation task more interesting and enjoyable. Overall, 2DV-based sampling appears promising for biomedical time-series data annotation, particularly when the annotation budget is not highly constrained.

URL PDF HTML ☆

赞 0 踩 0

2605.23243 2026-06-17 cs.CR cs.AI 版本更新

Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks

前沿大语言模型是否已为网络安全做好准备？来自双模式漏洞基准测试的垂直基础模型证据

Vivek Dahiya, Sunny Nehra, Vipul Dholariya, Bhavik Shangari, Chandra Khatri

发表机构 * super-intel.ai（超级智能人工智能公司）

AI总结通过白盒函数级漏洞检测和黑盒Web应用安全测试双模式基准测试，评估前沿大语言模型在网络安全任务中的表现，发现其存在高误报率、低覆盖率等问题，而领域专用模型通过结构化方法显著提升性能。

详情

AI中文摘要

我们通过双模式基准测试评估前沿大语言模型是否已为网络安全做好准备：白盒函数级漏洞检测（VulnLLM-R，涵盖C/Java/Python）和黑盒Web应用安全测试（五个生产风格应用，包含118个真实漏洞，涉及20多个CWE家族，我们将开源）。我们测试了六个前沿模型（GPT-5.4、Codex~5.3、Claude Opus~4.6、Sonnet~4.6、Gemini~3.1~Pro和Gemini~3~Flash）以及两个领域专用模型，涵盖四种测试范式。我们的发现令人警醒：（1）每个前沿模型在白盒检测中产生10-50%的误报率，系统性地过度预测漏洞；（2）在黑盒测试中，前沿模型仅达到4-8%的真实漏洞覆盖率，即使借助外部安全工具（Playwright MCP、Burp Suite MCP）也仅提升至10-19%；（3）领域专用智能体中编码的结构化渗透测试方法将每个家族的检测率提升至50%以上，表明方法论而非规模是主要杠杆；（4）一个领域专用防御模型在单个GPU上实现了所有模型中最高的精确率（0.904）和最低的误报率（9.7%）。我们指出缺乏结构化安全测试痕迹（端到端请求/响应序列、失败密集型数据、多步攻击链）是根本的训练数据瓶颈，并提出自博弈安全测试作为数据生成策略。我们的结果为专门构建用于网络安全的垂直基础模型提供了依据。

英文摘要

We evaluate whether frontier LLMs are ready for cybersecurity through a dual-mode benchmark: white-box function-level vulnerability detection (VulnLLM-R, across C/Java/Python) and black-box web application security testing (five production-style applications with 118 ground-truth vulnerabilities across 20+ CWE families, which we will open-source). We test six frontier models (GPT-5.4, Codex~5.3, Claude Opus~4.7, Sonnet~4.6, Gemini~3.1~Pro and Gemini~3~Flash) and two domain-specialized models across four testing paradigms. Our findings are sobering: (1)~every frontier model produces 10-50% false positive rates in white-box detection, systematically over-predicting vulnerabilities; (2)~in black-box testing, frontier models achieve only 4-8% ground-truth coverage, improving to just 10-19% even with external security tools (Playwright MCP, Burp Suite MCP); (3)~structured penetration-testing methodology encoded in domain-specialized agents raises per-family detection above 50%, demonstrating that methodology, not scale, is the primary lever; and (4)~a domain-specialized defense model achieves the highest precision (0.904) and lowest false positive rate (9.7%) among all models, on a single GPU. We identify the absence of structured security testing traces end-to-end request/response sequences, failure-heavy data, and multi-step attack chains as the fundamental training data bottleneck, and propose self-play security testing as a data generation strategy. Our results make the case for vertical foundation models purpose-built for cybersecurity.

URL PDF HTML ☆

赞 0 踩 0

2606.14295 2026-06-17 cs.CR cs.AI cs.LG 版本更新

AgentCyberRange: Benchmarking Frontier AI Systems in Realistic Cyber Ranges

AgentCyberRange：在真实网络靶场中基准测试前沿AI系统

Fengyu Liu, Jiarun Dai, Yihe Fan, Wuyuao Mai, Ziao Li, Bofei Chen, Jie Zhang, Zheng Lou, Bocheng Xiang, Qiyi Zhang, Xudong Pan, Geng Hong, Yuan Zhang, Min Yang

发表机构 * Fudan University（复旦大学）

AI总结提出首个开源多靶场基础设施AgentCyberRange，集成110个漏洞和156个内部主机，评估前沿AI系统在真实网络攻击中的能力，发现GPT-5.5+Codex在web利用和后利用任务中表现最佳。

详情

AI中文摘要

前沿AI系统在网络安全任务中能力日益增强，包括代码库检查、漏洞检测和利用。然而，评估其攻击能力仍受限于缺乏开放、可复现、多主机的网络靶场。现有公开基准测试捕获了CTF解题、漏洞复现和利用生成等孤立技能，但通常忽略了真实的入侵工作流：发现暴露服务、获得立足点、收集内部信息以及跨主机扩大入侵范围。这一差距使得早期观察新兴风险变得困难，因为前沿AI系统很少在真实攻击条件下进行评估。我们引入了AgentCyberRange，这是首个用于在真实网络靶场中衡量自主网络攻击能力的开源多靶场基础设施。它整合了15个真实Web应用和8个企业级网络靶场中的110个漏洞，以及156个内部主机，并提供了Cage工具链用于执行、编排、结果收集和验证。该基准测试涵盖两个核心阶段：Web利用（代理探索暴露的应用并验证漏洞）和后利用（代理将初始立足点转化为更广泛的内部入侵）。我们在匹配的提示和预算下评估了六个前沿AI系统。GPT-5.5与Codex表现最佳，解决了16.1%的Web利用任务和31.7%的后利用任务；在更具体的提示下，这些比率分别提高到33.0%和46.3%。我们还观察到基准测试之外的发现，包括流行项目中的未知漏洞，以及绕过主机防御的有效载荷变异。这些结果表明，开放的网络靶场评估对于在真实且可复现的条件下观察新兴攻击能力是必要的。

英文摘要

Frontier AI systems are increasingly capable of cybersecurity tasks, including codebase inspection, vulnerability detection, and exploitation. However, evaluating their offensive capabilities remains constrained by limited access to open, reproducible, multi-host cyber ranges. Existing public benchmarks capture isolated skills such as CTF solving, vulnerability reproduction, and exploit generation, but often abstract away realistic intrusion workflows: discovering exposed services, gaining a foothold, collecting internal information, and expanding compromise across hosts. This gap makes it difficult to observe emerging risks early, because frontier AI systems are rarely evaluated under realistic attack conditions. We introduce AgentCyberRange, the first open, multi-range infrastructure for measuring autonomous cyber attack capability in realistic cyber ranges. It combines 110 vulnerabilities across 15 real web applications and 8 enterprise-like cyber ranges with 156 internal hosts, plus Cage, a toolchain for execution, orchestration, result collection, and verification. The benchmark covers two core stages: web exploitation, where agents explore exposed applications and validate vulnerabilities, and post exploitation, where agents turn an initial foothold into broader internal compromise. We evaluate six frontier AI systems under matched prompts and budgets. GPT-5.5 with Codex performs best, solving 16.1% of web exploitation tasks and 31.7% of post-exploitation tasks; with more concrete hints, these rates increase to 33.0% and 46.3%. We also observe out-of-benchmark findings, including unknown vulnerabilities in popular projects, and payload mutation that bypasses host defenses. These results show that open cyber-range evaluation is necessary for observing emerging offensive capabilities under realistic and reproducible conditions.

URL PDF HTML ☆

赞 0 踩 0

2606.15735 2026-06-17 cs.CL cs.AI 版本更新

EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

EHRNote-ChatQA：一个面向纵向出院总结的基于证据的多轮临床问答基准

Jiyoun Kim, Muhan Yeo, Eunhye Jang, Jeewon Yang, Hangyul Yoon, Su Ji Lee, Hee Jo Han, Hee-Jae Jung, Doyun Kwon, Jun young Lee, Jaehun Lee, Jung-Oh Lee, Sunjun Kweon, Jong Hak Moon, Daseul Kim, Minjae Cho, Edward Choi

发表机构 * KAIST（韩国科学技术院）； Seoul National University（首尔大学）； Seoul National University Bundang Hospital（首尔大学盆唐医院）； SAIHST, Sungkyunkwan University（成均馆大学）； Yonsei University College of Medicine（延世大学医学院）； Gangnam Severance Hospital（江南塞弗伦斯医院）； Severance Hospital（塞弗伦斯医院）； Seoul Medical Center（首尔医疗中心）； Seoul National University Hospital（首尔大学医院）； National Cancer Center（国立癌症中心）； Icahn School of Medicine at Mount Sinai（西奈山伊坎医学院）； Samsung Medical Center（三星医疗中心）

AI总结提出EHRNote-ChatQA基准，基于MIMIC-IV出院总结构建，包含967个多轮样本和16072个专家验证的QA对，评估LLM在证据支持下的多轮临床问答能力，发现模型在证据定位和多轮错误累积方面存在挑战。

详情

AI中文摘要

出院总结是关键的临床文档，包含患者整个住院期间的背景信息，医疗专家在患者再入院、持续护理和诊断决策中会常规审阅这些文档。在审阅时，医疗专家通常必须迭代地综合多个总结中的信息，同时验证支持每个答案的证据。尽管大型语言模型（LLM）在临床问答中的应用日益增多，但现有基准未能充分反映这一场景：它们通常评估考试式的医学知识，或侧重于单轮问答且证据定位评估有限。我们引入了EHRNote-ChatQA，这是首个针对患者多个出院总结的基于证据的多轮临床问答基准。该基准基于去标识化的MIMIC-IV出院总结构建，包含967个患者级多轮样本，涵盖1到5份笔记，以及16072个经医学专家验证的QA对（8036个内容问题，每个配对有一个证据定位问题），覆盖八个临床类别。基准通过专家指导的流程构建，结合出院总结结构化模式、专家策划的多轮QA模板和基于LLM的生成，随后由11位医学专家对每个QA样本进行审查和修订。对22个开源和闭源LLM的基准测试揭示了若干挑战，包括LLM在证据定位方面比内容回答更困难、多轮错误随轮次累积，以及单轮临床QA性能无法可靠迁移到该场景。这些发现确立了EHRNote-ChatQA作为评估临床QA系统的严格且实用的基准。该数据集将通过PhysioNet凭证访问公开发布。

英文摘要

Discharge summaries are crucial clinical documents containing the context of a patient's overall hospital stay, and are routinely reviewed by medical experts for patient readmission, ongoing care, and diagnostic decision-making. When reviewing them, medical experts often must iteratively synthesize information across multiple summaries while verifying the evidence supporting each answer. Although large language models (LLMs) are increasingly explored for clinical question answering, existing benchmarks do not sufficiently reflect this setting: they often evaluate exam-style medical knowledge or focus on single-turn question answering with limited evidence-grounding evaluation. We introduce EHRNote-ChatQA, the first benchmark for evidence-grounded multi-turn clinical question answering over patients' multiple discharge summaries. Built from de-identified MIMIC-IV discharge summaries, EHRNote-ChatQA contains 967 patient-level multi-turn samples spanning one to five notes and 16,072 medical-expert-verified QA pairs (8,036 content questions, each paired with an evidence-grounding question) across eight clinical categories. The benchmark is constructed through an expert-informed pipeline combining discharge-summary structuring schema, expert-curated multi-turn QA templates, and LLM-based generation, followed by review and revision of every single QA sample by 11 medical experts. Benchmarking 22 open- and closed-source LLMs reveals several challenges, including that LLMs struggle more with evidence grounding than content answering, multi-turn errors compound across turns, and single-turn clinical QA performance does not reliably transfer to this setting. These findings establish EHRNote-ChatQA as a rigorous and practical benchmark for evaluating clinical QA systems. The dataset will be made publicly available through PhysioNet credentialed access.

URL PDF HTML ☆

赞 0 踩 0

2606.16072 2026-06-17 cs.CR cs.AI 版本更新

教育中的LLM作为评判者：基于课程标准的评分流水线

Xiwei Xu, Chen Wang, Jacky Jiang, Phil Yang, Qian Fu, Mohan Dhall, Wenjie Zhang, Liming Zhu

发表机构 * NSW Department of Education（新南威尔士州教育部）； South Australian Department for Education（南澳大利亚州教育部）； OC Selective exam preparation platform（OC精英考试备考平台）； Studitory: HSC preparation platform（Studitory: HSC备考平台）

AI总结提出一种基于课程标准的可配置LLM评判流水线，用于高利害考试评分，通过整合授权课程工件和评分指南，提高评分一致性、透明度和与官方实践的契合度。

详情

AI中文摘要

生成式AI和大语言模型（LLM）越来越多地应用于题目生成和自动评估。然而，在备考高风险考试中部署LLM需要的不仅仅是提示工程，还需要软件流水线，系统地将模型输出锚定在授权课程工件和教育当局发布的评分指南上。本文提出了一种基于课程标准的、可配置的LLM-as-Judge流水线，用于题目级评分，与工业合作伙伴共同开发，以支持大学入学考试准备。该流水线识别问题的相关主题、子主题和认知需求，并组装可验证和授权的上下文以支持LLM判断。课程意图通过具体的课程大纲工件（包括规定的动词和结果、表现等级描述符、术语表定义和评分指南原则）来操作化。采用分阶段LLM工作流，首先生成特定题目的评分标准，捕获结构化的表现期望，然后推导和评估用于分配学生回答分数的评分标准。这种设计提高了与官方评分实践的一致性、透明度和对齐度。初步评估表明，所提出的LLM-as-Judge流水线提供的评分结果与人类导师相当，同时产生的理由更可追溯到授权课程工件和评分标准。该流水线已集成到在线学习平台中，早期部署数据提供了操作使用和手动覆盖的初步见解。

英文摘要

Generative AI and large language models (LLMs) are increasingly applied to question generation and automated assessment. However, deploying LLMs in preparation for high-stakes exams requires more than prompt engineering; it demands software pipelines that systematically ground model outputs in authorised curriculum artefacts and marking guidelines issued by education authorities. This paper presents a curriculum-grounded, configurable LLM-as-Judge pipeline for question-level marking, co-developed with an industrial partner, to support exam preparation for university admission. The pipeline identifies the relevant topics, subtopics, and cognitive demand of a question, and assembles verifiable and authorised context to support LLM judgement. Curriculum intent is operationalised through concrete syllabus artefacts, including prescribed verbs and outcomes, performance band descriptors, glossary definitions, and marking-guideline principles. A staged LLM workflow is employed to first generate question-specific rubrics, capturing structured expectations of performance, and then derive and evaluate marking criteria used to allocate marks to student responses. This design improves consistency, transparency, and alignment with official marking practices. Preliminary evaluation shows that the proposed LLM-as-Judge pipeline delivers marking outcomes comparable to human tutors, while yielding justifications that are more traceable to authorised curriculum artefacts and marking standards. The pipeline has also been integrated into an online study platform, where early deployment data provide initial insights into operational usage and manual overrides.

URL PDF HTML ☆

赞 0 踩 0

2606.17577 2026-06-17 cs.AI 新提交

Surrogate Assisted Pedestrian Protection Design via a Foundation Model Orchestrated Workflow

基于基础模型编排工作流的代理辅助行人保护设计

Osamu Ito, Akihiko Katagiri, Yoshikazu Nakagawa, Shin Saeki, Jun Shiraishi, Masato Sasaki

发表机构 * Honda Motor Co., Ltd.（本田汽车有限公司）

AI总结提出首个基础模型编排的碰撞安全设计工作流，集成代理模型、多目标进化搜索、几何生成器和自然语言接口，将行人保护评估时间从数小时降至秒级。

详情

Journal ref: ICLR 2026 Workshop The 2nd Workshop on Foundation Models for Science

AI中文摘要

AI驱动的工程工作流在碰撞安全设计中面临特殊挑战：与空气动力学不同，碰撞事件涉及高度非线性的接触动力学、材料非线性和离散状态转换，难以用数据驱动的代理模型捕捉。据我们所知，我们首次提出了一个基于基础模型编排的碰撞安全设计工作流，实现了代理辅助的行人保护探索，将评估时间从每次CAE模拟数小时缩短至数秒。该工作流集成四个组件：(1) 基于CAE碰撞模拟训练的代理模型，用于从设计参数预测行人腿部伤害指标，平均$R^2=0.87$，并提供无分布假设的共形预测区间；(2) 多目标进化搜索（NSGA-II），在用户指定约束下发现多样化的可行参数集；(3) 基于形变的几何生成器，将参数映射为保持拓扑的3D形状；(4) 自然语言接口，其中LLM编排工作流，视觉-语言模型支持生成设计的语义比较。在一个汽车前保险杠案例研究中，该工作流通过单次探索产生35个不同的安全合规替代方案，而传统CAE迭代需要数周。这些结果表明，基础模型可以作为ML代理和基于物理的模拟之间的集成层，帮助将AI能力引入安全关键的工程领域。

英文摘要

AI-driven engineering workflows face particular challenges in crash safety design: unlike aerodynamics, crash events involve highly nonlinear contact dynamics, material nonlinearity, and discrete state transitions that are difficult to capture with data-driven surrogate models. To the best of our knowledge, we present the first foundation model--orchestrated workflow for crash safety design that enables surrogate-assisted exploration for pedestrian protection, reducing evaluation time from hours per CAE simulation to seconds. The workflow integrates four components: (1) a surrogate trained on CAE crash simulations to predict pedestrian leg injury metrics from design parameters, achieving an average $R^2=0.87$ and providing distribution-free conformal prediction intervals; (2) multiobjective evolutionary search (NSGA-II) to discover diverse feasible parameter sets under user-specified constraints; (3) a morphing-based geometry generator that maps parameters to topology-preserving 3D shapes; and (4) a natural-language interface in which an LLM orchestrates the workflow and a vision--language model supports semantic comparison of generated designs. In an automotive front-bumper case study, the workflow produces 35 distinct safety-compliant alternatives from a single exploration, a process that would require weeks with conventional CAE iteration. These results suggest that foundation models can serve as integration layers between ML surrogates and physics-based simulation, helping bring AI capabilities to safety-critical engineering domains.

URL PDF HTML ☆

赞 0 踩 0

2606.18147 2026-06-17 cs.AI 新提交

WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning

WEQA: 可穿戴健康问答中的查询自适应智能推理

Yuwei Zhang, Tong Xia, Bianca Emmerich, Yu Yvonne Wu, Dimitris Spathis, Xin Liu, Daniel McDuff, Cecilia Mascolo

发表机构 * University of Cambridge（剑桥大学）； Tsinghua University（清华大学）； University College London（伦敦大学学院）； Dartmouth College（达特茅斯学院）； Google Research（谷歌研究院）

AI总结提出WEQA框架，通过LLM控制器动态组合传感器分析与预训练模型，实现可穿戴健康数据问答，在基准测试中准确率提升24%，专家评估显示实用性和临床合理性显著提高。

详情

AI中文摘要

语言模型在医学问答中表现出色，有时甚至超过普通医生的准确率。然而，关于可穿戴健康数据的问题回答仍然具有挑战性且研究不足，因为这些无处不在的传感器产生连续、高维和纵向的数据，难以与LLM预训练中的文本中心分布对齐。传感器模态和用户意图的多样性无法通过固定的推理工作流或单一的预训练基础模型有效处理。为了解决这些挑战，我们提出了WEQA，一个查询自适应智能体框架，将LLM推理与专门的可穿戴分析和建模工具统一起来。采用LLM控制器来合成执行计划，动态地将每个查询路由到适当的传感器分析和预训练模型组合，并利用外部知识进行基于证据的响应审计。我们还整理了一个基准测试，涵盖四个开放的可穿戴数据集，包括三个不同健康领域的分析和预测任务。实验表明，我们的框架比LLM和智能体基线准确率提高24%，一项由12名医学专家和8名用户进行的盲法研究显示，在实用性和临床合理性方面有显著提升。

英文摘要

Language models are remarkably capable at medical question answering, in some cases surpassing the accuracy of general physicians. However, answering questions about wearable health data remains challenging and understudied, as these ubiquitous sensors produce continuous, high-dimensional, and longitudinal data, which is non-trivial to align with text-centric distributions in LLM pretraining. The diversity of sensor modalities and user intents cannot be effectively handled by a fixed reasoning workflow or a single pretrained foundation model. To address these challenges, we propose WEQA, a query-adaptive agent framework that unifies LLM reasoning with specialized wearable analytical and modeling tools. An LLM controller is employed to synthesize execution plans and dynamically route each query to the appropriate combination of sensor analysis and pretrained models, and perform grounded response auditing with external knowledge. We also curate a benchmark spanning four open wearable datasets comprising analytic and predictive tasks in three different health domains. Experiments show that our framework is 24% more accurate than LLM and agentic baselines, and a blinded study with 12 medical experts and 8 users shows substantial gains in usefulness and clinical soundness.

URL PDF HTML ☆

赞 0 踩 0

2606.18154 2026-06-17 cs.AI 新提交

Learning Cardiac Electrophysiology Digital Twins Through Agentic Discovery of Hybrid Structure

通过智能体发现混合结构学习心脏电生理数字孪生

Ziqi Zhou, Yubo Ye, Sumeet Atul Vadhavka, Linwei Wang, Zhiqiang Tao

发表机构 * Rochester Institute of Technology（罗彻斯特理工学院）

AI总结提出LEADS框架，利用LLM智能体在结构化动作空间中迭代发现混合物理-神经模型，实现个性化心脏电生理数字孪生构建，优于人工设计和其他LLM方法。

Comments 10 pages, 4 figures

详情

AI中文摘要

构建个性化心脏电生理（EP）数字孪生需要为每个患者识别合适的模型结构，而不仅仅是拟合参数。传统方法依赖专家手动指定混合物理-神经架构，这需要深厚的领域专业知识，且无法跨患者迁移。最近的工作应用大型语言模型（LLM）来生成或充当混合模型。然而，尽管这些基于LLM的方法具有有希望的泛化能力，但它们缺乏稳定心脏模拟所需的结构先验。因此，我们提出LEADS，一个将心脏EP领域知识形式化为结构化动作空间，并利用LLM智能体发现混合模型的框架。该智能体遵循迭代推理-行动循环来选择、组合和优化混合模型，同时梯度下降处理参数拟合。所提出的LEADS设计每个候选模型都朝向物理基础、可解释和数值稳定，同时允许开放式的架构发现。我们在具有三个真实反应模型的合成数据和真实心脏EP数据上验证了LEADS，证明其优于人工设计的混合模型和其他基于LLM的混合建模方法。

英文摘要

Building personalized cardiac electrophysiology (EP) digital twins requires identifying the appropriate model structure for each patient, not merely fitting parameters. Traditional methods rely on experts to manually prescribe hybrid physics-neural architectures, which requires deep domain expertise and does not transfer across patients. Recent works have applied large language models (LLMs) to generate or act as hybrid models. However, despite their promising generalization capacity, these LLM-based methods lack the structural priors needed for stable cardiac simulations. Hence, we propose LEADS, a framework that formulates cardiac EP domain knowledge as a structured action space and utilizes an LLM agent to discover hybrid models. The agent follows an iterative reasoning-and-action loop to select, combine, and refine hybrid models, whilst gradient descent handles parameter fitting. The proposed LEADS designs every candidate model towards physically grounded, interpretable, and numerically stable, while allowing open-ended architectural discovery. We validate LEADS on synthetic data with three ground-truth reaction models and on real cardiac EP data, demonstrating that it outperforms both human-designed hybrid models and other LLM-based hybrid modeling.

URL PDF HTML ☆

赞 0 踩 0

2606.17059 2026-06-17 cs.DC cs.AI 交叉投稿

Towards Distributed Inference of LLMs on a P2P Network

面向P2P网络的LLM分布式推理

Shabari S Nair, Krishanu Saini

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Department of Computer Science, The University of Texas at Austin（德克萨斯大学奥斯汀分校计算机科学系）

AI总结提出一种去中心化的前缀缓存感知路由方案，用于P2P网络中的LLM推理，通过本地基数树和异步反熵更新缓存信息，避免集中协调和KV缓存传输，在低延迟和偏斜前缀分布下提升性能。

2606.17065 2026-06-17 q-fin.CP cs.AI cs.LG 交叉投稿

PIVOT: Bridging Black-Scholes Implied-Volatility and Price Objectives via Differentiable Jäckel Operator

PIVOT: 通过可微分的Jäckel算子桥接Black-Scholes隐含波动率与价格目标

Raeid Saqur, Yannick Limmer, Anastasis Kratsios, Blanka Horvath, Hans Buehler

发表机构 * Mathematical Institute, University of Oxford（牛津大学数学研究所）； McMaster University（麦基尔大学）； Vector Institute for AI（人工智能矢量研究所）； DRW

AI总结提出PIVOT层，通过隐式微分保留Jäckel求解器的前向精度，并利用门控机制处理低vega区域的奇异性，实现价格与隐含波动率空间的高效可微转换。

Comments 30 pages, 17 figures, 12 tables

详情

AI中文摘要

现代期权学习系统在两种坐标系下运行：价格空间（市场报价且无套利约束最自然执行）和隐含波动率（IV）空间（波动率曲面被平滑、正则化和评估）。瓶颈在于接口而非近似：Jäckel开创性的“Let's Be Rational”（LBR）求解器已经高效地将Black-Scholes价格反转到机器精度。所缺少的是一个可微分层，它在正向传播中保留LBR，并避免通过其分支逻辑进行反向传播。这样的层还必须面对低vega区域中逆映射不可避免的奇异性，其中灵敏度1/vega在vega→0时发散。我们通过PIVOT（价格-隐含波动率目标转换器）填补了这一空白。PIVOT保持LBR正向传播不变，并通过隐式微分通过平滑的Black-Scholes/Black-76价格映射提供反向传播，并带有显式门控合约：无效域返回NaN，良态行接收精确的1/vega梯度，低vega行被衰减而非静默正则化。在单个H100上，融合的Triton内核在机器精度下达到1.79e9 IV/s（与参考C求解器的最大相对误差为9.3e-14）；端到端标签生成在合成链上维持48.9M/s，在SPX OptionMetrics上维持16.6M/s。在SPX上的HyperIV风格单日复现中，PIVOT增强目标帕累托主导基线，将保留价格MAE降低高达43.4%，最强的三种子门控目标联合改善价格MAE 38.8%和IV MAE 21.3%；在RUT、VIX和NDX上的跨资产结果显示方向性价格MAE增益分别为40.1%、24.2%和16.7%，而无门控的IV往返控制崩溃为退化的近零曲面，确认门控是正确性合约而非调节旋钮。

英文摘要

Modern option-learning systems operate in two coordinates: price space, where markets quote and no-arbitrage constraints are most naturally enforced, and implied volatility (IV) space, where volatility surfaces are smoothed, regularized, and evaluated. The bottleneck is interface, not approximation: Jäckel's seminal "Let's Be Rational" (LBR) solver already inverts the Black-Scholes price to machine precision efficiently. What is missing is a differentiable layer that preserves LBR in the forward pass and avoids backpropagating through its branch logic. Such a layer must also confront the unavoidable singularity of the inverse map in the low-vega regime, where the sensitivity 1/vega diverges as vega -> 0. We close this gap with PIVOT, the Price-Implied-Volatility Objective Translator. PIVOT keeps the LBR forward pass intact and supplies the backward pass by implicit differentiation through the smooth Black-Scholes/Black-76 price map, with an explicit gating contract: invalid domains return NaN, well-conditioned rows receive the exact 1/vega gradient, and low-vega rows are attenuated rather than silently regularized. On a single H100, a fused Triton kernel reaches 1.79e9 IV/s at machine precision (9.3e-14 max relative error vs. the reference C solver); end-to-end label generation sustains 48.9M/s on synthetic chains and 16.6M/s on SPX OptionMetrics. In a HyperIV-style one-day reproduction on SPX, PIVOT-augmented objectives Pareto-dominate the baselines, reducing held-out price MAE by up to 43.4% and the strongest three-seed gated objective improving price MAE by 38.8% and IV MAE by 21.3% jointly; cross-asset results on RUT, VIX, and NDX show directional price-MAE gains of 40.1%, 24.2%, and 16.7%, while an ungated IV-roundtrip control collapses to a degenerate near-zero surface, confirming the gate as a correctness contract rather than a tuning knob.

URL PDF HTML ☆

赞 0 踩 0

2606.17070 2026-06-17 physics.ao-ph cs.AI cs.LG 交叉投稿

时间戳感知的时空图对比学习用于网络入侵检测

Jianli Dai, Guangwei Wu, Jiacheng Li, Weiping Wang, An He, Xinjun Xiao

发表机构 * Central South University of Forestry and Technology, School of Computer Science and Mathematics（中央林业科技大学计算机科学与数学学院）； Central South University, School of Computer Science and Engineering（中南大学计算机科学与工程学院）

AI总结提出一种自监督图神经网络框架，通过时间戳构建时序图，结合E-GraphSAGE和LSTM编码时空依赖，并采用多视图图对比学习（时空特征对比）及自适应权重策略，在四个数据集上达到与监督方法相当的性能。

详情

AI中文摘要

鉴于图神经网络（GNN）在建模网络流量间关系结构方面的有效性，它们已被广泛用于网络入侵检测系统（NIDS）。然而，大多数现有基于GNN的NIDS方法关注流量关系的结构，并将其视为时间独立，这限制了它们应对不断演变的攻击行为的能力。此外，它们对监督或半监督学习的依赖通常限制了对未见攻击的泛化能力。为解决这些限制，我们提出了一种新颖的自监督GNN框架。据我们所知，所提出的模型是首批显式利用真实时间戳的自监督GNN-based NIDS模型之一，这为表示学习提供了忠实的时间依赖关系。我们首先根据时间戳从网络流量中构建一系列时序图，然后采用基于E-GraphSAGE和LSTM的编码器充分提取网络流量的时间信息和空间依赖关系，而无需引入耗时的注意力机制。引入了一种多视图图对比学习（GCL）方案，其中联合执行时间、空间和特征对比，分别捕获时间连续性、保持结构一致性并提高所学表示的泛化性和鲁棒性。此外，设计了一种基于梯度范数的自适应加权策略来优化对比损失权重。在四个具有真实时间戳的代表性NIDS数据集上的实验结果表明，我们的方法显著优于现有自监督方法，并达到了与监督最先进GNN方法相当的性能，同时保持了高计算效率。

英文摘要

Given their effectiveness in modeling the relational structure among network traffic flows, graph neural networks (GNNs) have been widely adopted in network intrusion detection systems (NIDSs). However, most existing GNN-based NIDS approaches focus on the relational structure of traffic flows, and treat them as temporally independent, which limits their ability to cope with evolving attack behaviors. Moreover, their reliance on supervised or semi-supervised learning often restricts generalization to unseen attacks. To address these limitations, we propose a novel self-supervised GNN-based framework. To the best of our knowledge, the proposed model is among the first self-supervised GNN-based NIDS models to explicitly leverage real timestamps, which provides faithful temporal dependencies for representation learning. We first construct a series of temporal graphs from network traffic flows according to their timestamps, and then employ an E-GraphSAGE and LSTM based encoder to fully extract temporal information and spatial dependencies of network traffic, without introducing time-costly attention mechanisms. A multi-view graph contrastive learning (GCL) scheme is introduced, where temporal, spatial, and feature contrasts are jointly performed to capture temporal continuity, preserve structural consistency, and improve the generalization and robustness of the learned representations, respectively. In addition, a gradient-norm-based adaptive weighting strategy is designed to optimize the contrastive loss weights. Experimental results on four representative NIDS datasets with real timestamps demonstrate that our method significantly outperforms existing self-supervised approaches and achieves performance comparable to the supervised state-of-the-art GNN method, while maintaining high computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.17119 2026-06-17 cs.CR cs.AI 交叉投稿

Graph neural networks at war: integrating cybersecurity and drone intelligence in the Israeli-Iranian conflict

战争中的图神经网络：整合网络安全与无人机智能于以色列-伊朗冲突

Sozan Sulaiman Maghdid, Tarik Ahmed Rashid, Shavan Askar

发表机构 * Department of Information Technology, Khabat Technical Institute（信息科技系，Khabat技术学院）； Erbil Polytechnic University（埃尔比尔理工大学）； Computer Science and Engineering Department（计算机科学与工程系；AIIC，库尔德斯坦赫勒大学）； AIIC, University of Kurdistan Hewler（信息系统工程系，计算机与信息工程技术学院）； Department of Information Systems Engineering, Technical College of Computer and Informatics Engineering

AI总结研究利用图神经网络（GNN）增强网络入侵检测与无人机响应，通过案例验证其在高检测率、快速响应和态势感知中的有效性。

详情

AI中文摘要

物理网络系统在检测和即时响应方面带来了新的威胁和挑战。本研究探讨了图神经网络（GNN）如何用于辅助包含网络入侵和无人机（UAV）的物理网络系统中的网络安全和无人机管理。通过在图形神经网络的结构理解之间架起桥梁，本工作提供了一种集成程序，使入侵检测系统能够学习底层网络结构，识别恶意活动，并促进无人机响应措施。基于仿真的案例研究，创建了网络攻击模型以引发无人机响应，证明基于图的学习有助于态势感知、群体协调和自适应机动。根据性能评估，该方法的检测率为94.2，接收者操作特征（ROC）曲线下平均面积为0.955，平均响应时间为1.4秒。对比实验表明，所提出的GraphSAGE网络在相同情况下比图卷积网络（GCN）和图注意力网络（GAT）更有效。这些发现证明，图神经网络可用于预防动态网络物理系统中的入侵和响应。

英文摘要

Physical cyber systems have brought about new threats and challenges in detection and immediate response. This study examines how Graph Neural Networks (GNNs) can be used to aid cybersecurity and drone management in a physical cyber system comprising of cyber intrusions and unmanned aerial vehicles (UAVs). By providing a bridge between structural understanding of graphical neural networks, this work has provided an integrated procedure that allows intrusion detection systems to educate on underlying network structures, identify malicious activity, and facilitates drone response measures. Based on an emulation-based case study, cyberattacks models were created to provoke the responses of the drones, which proved that graph-based learning can assist with the situational awareness, swarm coordination, and adaptive maneuver. According to the performance valuation, this method has a detection rate of 94.2, average area under the receiver operating characteristic (ROC) of 0.955 and an average response time of 1.4 seconds. Comparative experiments reveal that proposed GraphSAGE network is more effective than the Graphical Convolutional Networks (GCNs) and Graphical Attention Networks (GATs) in the identical situation. Such findings prove that graphical neural networks can be used to avert intrusion and response of dynamic cyber-physical systems.

URL PDF HTML ☆

赞 0 踩 0

2606.17127 2026-06-17 q-bio.QM cs.AI cs.LG 交叉投稿

Agentic Discovery of Non-Canonical Antimicrobial Peptides with AMPGAN v3

AMPGAN v3 的非经典抗菌肽智能发现

Jay Jung, Xiaohan Zhang, Shenghan Song, Mahmoud Sayedahmed, Chijian Xiang, Yunong Xu, Ahmed AbdelKhalek, Severin T. Schneebeli, Matthew J. Wargo, Jianing Li, Safwan Wshah

发表机构 * University of Vermont（弗吉尼亚大学）； Larner College of Medicine, University of Vermont（弗吉尼亚大学医学学院）； Purdue University（普渡大学）； Department of Comparative Pathobiology（比较病理科部门）； Department of Horticulture and Landscape Architecture（园艺与景观建筑部门）； Department of Industrial and Molecular Pharmaceutics（工业与分子药学部门）

AI总结提出 AMPGAN v3，一种多目标条件 GAN，扩展生成词汇至 D-氨基酸和末端修饰，通过双判别器提升稳定性，体外验证显示对革兰氏阳性菌有活性，并引入 PepCraft 多智能体框架用于端到端发现。

Comments Presented at the GenBio Workshop, ICML 2026

详情

AI中文摘要

抗菌药物耐药性每年导致超过一百万人死亡。抗菌肽（AMP）是一种有前景的解决方案，但生成式 AMP 模型尚未准备好设计含有非天然氨基酸和/或化学修饰的肽，而这些对于实际肽药物至关重要。我们提出了 AMPGAN v3，一种多目标条件 GAN，它将生成词汇扩展到 D-氨基酸和 N/C 末端修饰（如酰胺化）。通过将对抗性和活性感知监督分离到两个专门的判别器中，AMPGAN v3 显著提高了训练稳定性，并在外部分类器上优于先前的生成式 AMP 模型。我们在体外验证了跨越三个结构类别的五个候选物；其中两个对革兰氏阳性菌株表现出活性，最佳候选物对枯草芽孢杆菌的 MIC 达到 8 μg/mL。为了支持下游筛选，我们进一步提出了 PepCraft，一个用于端到端 AMP 发现的多智能体框架，其中规划智能体协调专门的执行器进行生成、过滤和验证。其优先级推荐与我们的体外结果一致。这些贡献使我们能够在小型但真实的规模上研究生成式和智能体 AI 如何在治疗性肽发现中协同作用。代码：this https URL

英文摘要

Antimicrobial resistance causes to over a million deaths annually. Antimicrobial peptides (AMPs) are a promising solution, but generative AMP models are not yet ready to design peptides with non-natural amino acids and/or chemical modifications, which are essential for real-world peptide drugs. We present AMPGAN v3, a multi-objective conditional GAN that expands the generative vocabulary to D-amino acids and N/C-terminus modifications such as amidation. By separating adversarial and activity-aware supervision across two specialized discriminators, AMPGAN v3 substantially improves training stability and outperforms prior generative AMP models on external classifiers. We validated five candidates spanning three structural classes in vitro; two showed activity against Gram-positive strains, with the best candidate reaching MIC 8 μg/mL against B. subtilis. To support downstream curation, we further present PepCraft, a multi-agent framework for end-to-end AMP discovery in which a Planning Agent orchestrates specialized executors for generation, filtering, and verification. Its prioritization recommendations align with our in vitro outcomes. Together, these contributions let us examine, on a small but real scale, how generative and agentic AI compose in therapeutic peptide discovery. Code: https://github.com/marszzibros/AMPGANv3

URL PDF HTML ☆

赞 0 踩 0

2606.17197 2026-06-17 cs.SE cs.AI 交叉投稿

Cluster-Aware Dual-Level Test Specification Generation for Large-Scale Automotive Software Requirements

面向大规模汽车软件需求的集群感知双层测试规格生成

Hazem Ayman, Menna Sedik, Kareem Mostafa, Mahmoud Soliman, Samer Saber, Ibrahim Habib

发表机构 * CairoMotive Cairo, Egypt（开罗动力埃及）

AI总结提出一种“先聚类后总结”流水线，通过嵌入、降维、聚类、多级摘要和双层测试生成，解决大规模需求下LLM处理依赖缺失和上下文窗口限制问题，提升集成测试覆盖率并高效扩展。

详情

AI中文摘要

生成满足Automotive SPICE SWE.6要求的测试规格随着项目扩展到数千个需求而变得越来越具有挑战性和耗时。由于手动过程通常需要数周的工程努力，自动化成为关键需求。然而，标准的大语言模型方法在大规模下难以应对：单独处理需求会丢失重要的需求间依赖关系，而一次性输入整个语料库则超出上下文窗口限制，导致集成覆盖不完整和测试用例冗余。本文提出一种新颖的“先聚类后总结”流水线，通过三个阶段解决这些限制。需求使用句子变换器嵌入，并通过UMAP降维和HDBSCAN密度聚类进行分组。该分组利用自动最小聚类大小选择，该选择由结合归一化轮廓系数和Calinski-Harabasz分数的质量准则驱动。然后，多级map-reduce摘要算法将每个聚类提炼为简洁、符合领域的描述，同时保留定量阈值和安全完整性等级。该流水线利用派生的聚类拓扑在两级生成测试规格：单个需求验证和验证跨需求特征行为的聚类级集成测试。邻近聚类上下文机制在每个LLM调用期间提供有限的跨特征感知，检索增强生成将所有输出基于ISO 26262和ASPICE标准。在不同规模的汽车需求数据集上的评估表明，与基线方法相比，集群感知方法提高了集成测试覆盖率并保持了摘要保真度，同时高效扩展到数千个需求。

英文摘要

Generating test specifications that satisfy Automotive SPICE SWE.6 requirements becomes increasingly challenging and time-consuming as projects scale to thousands of requirements. Because this manual process often consumes weeks of engineering effort, automation becomes a critical necessity. However, standard Large Language Model (LLM) approaches struggle at scale: processing requirements individually discards vital inter-requirement dependencies, while feeding entire corpora at once exceeds context-window limits, leading to incomplete integration coverage and redundant test cases. This paper presents a novel "Cluster-then-Summarize" pipeline that addresses these limitations through three-stages. Requirements are embedded using sentence transformers and grouped using UMAP dimensionality reduction followed by HDBSCAN density-based clustering. This grouping utilizes an automatic minimum cluster size selection driven by a quality criterion combining normalized Silhouette and Calinski-Harabasz scores. A multi-level map-reduce summarization algorithm then distills each cluster into concise, domain-conformant descriptions while preserving quantitative thresholds and safety integrity levels. The pipeline exploits the derived cluster topology to generate test specifications at two levels: individual requirement verification and cluster-level integration tests that verify cross-requirement feature behavior. A nearby-cluster context mechanism provides bounded cross-feature awareness during each LLM call, and Retrieval-Augmented Generation grounds all outputs in ISO 26262 and ASPICE standards. Evaluation on automotive requirement datasets of varying scale demonstrates that the cluster-aware approach improves integration test coverage and maintains summarization fidelity compared to baseline methods while scaling efficiently to thousands of requirements.

URL PDF HTML ☆

赞 0 踩 0

2606.17235 2026-06-17 cond-mat.mtrl-sci cs.AI 交叉投稿

Physics-Informed Attention Mechanism and Generalization Capability of Deep Learning-Based Grain Growth Evolution Prediction

物理信息注意力机制与基于深度学习的晶粒生长演化预测的泛化能力

Pungponhavoan Tep, Marc Bernacki

发表机构 * Mines Paris, PSL University Centre for Material Forming (CEMEF), UMR CNRS 06904（巴黎 Mines 学院，PSL 大学材料成型中心（CEMEF），CNRS UMR 06904）

AI总结本研究评估了深度学习模型在晶粒生长预测中面对分布外数据的泛化能力，并提出边界掩码注意力机制，显著提升了双峰晶粒尺寸分布等场景的预测精度。

详情

AI中文摘要

用于晶粒生长预测的机器学习模型通常基于理想化的合成数据进行训练，然而实际应用需要泛化到训练分布之外的条件。本研究评估了我们先前研究中训练模型在三个测试案例上的分布外泛化能力，包括实验微观结构、具有双峰晶粒尺寸分布的微观结构以及异常晶粒生长。为了进一步探究物理信息架构设计是否能在这些不同条件下提升鲁棒性，我们专门针对晶粒生长提出了一种边界掩码注意力机制，将注意力限制在晶界像素上。基线模型和所提出的物理信息注意力模型均在分布外数据上未经重新训练或微调进行了评估。两个模型均成功泛化到所有三个测试案例，但边界掩码注意力机制提供了显著改进，最显著的提升出现在具有双峰晶粒尺寸分布的微观结构上，其中结构相似性指数从0.6221提高到0.7609，平均晶粒尺寸误差从8.75%降低到3.57%。注意力热图分析表明，边界掩码注意力模型学会了以与曲率驱动晶粒生长物理一致的方式将注意力集中在大晶界上，这种能力源于训练过程，而无需显式编码到架构中。这些结果表明，在合成数据上训练的模型可以无需重新训练而泛化到多种分布外条件，并且当边界形态与训练域匹配时，物理信息注意力可以提高精度。

英文摘要

Machine Learning (ML) models for grain growth prediction are typically trained on idealized synthetic data, yet practical applications require generalization to conditions outside the training distribution. This study evaluated the Out-Of-Distribution (OOD) generalization capability of the trained model from our previous study across three test cases, including experimental microstructures, microstructures characterized by a bimodal grain size distribution, and abnormal grain growth. To further probe whether physics-informed architectural design could improve robustness under these different conditions, a boundary-masked attention mechanism was proposed specifically for grain growth, constraining attention to grain boundary pixels. Both the baseline and the proposed physics-informed attention model were evaluated without retraining or fine-tuning on the OOD data. Both models successfully generalized to all three test cases, yet the boundary-masked attention mechanism provided substantial improvements, with the most notable gains for microstructures characterized by a bimodal grain size distribution, where Structural Similarity Index Measure (SSIM) improved from \num{0.6221} to \num{0.7609} and mean grain size ($\overline{R}$) error decreased from \SI{8.75}{\percent} to \SI{3.57}{\percent}. The attention heatmap analysis revealed that the boundary-masked attention model learned to concentrate attention on large grain boundaries in a manner consistent with curvature-driven grain growth physics, emerging from training without being explicitly encoded into the architecture. These results indicate that models trained on synthetic data can generalize to diverse OOD conditions without retraining, and that physics-informed attention may improve accuracy when the boundary morphology matches the training domain.

URL PDF HTML ☆

赞 0 踩 0

2606.17345 2026-06-17 cs.LG cs.AI 交叉投稿

Counterfactual Optimization of Baseball Pitch Sequences and Estimation of Its Impact on Season-Level Statistics

棒球投球序列的反事实优化及其对赛季级统计指标影响的估计

Ryota Takamido, Hiroki Nakamoto

发表机构 * Sports Innovation Organization, National Institute of Fitness and Sports in Kanoya（体育创新组织，国立健身与体育研究所）

AI总结利用Transformer模型和反事实分析，优化MLB投球序列中的最终投球和设置投球，发现可显著提升赛季级表现（如K/9提高1.0以上），并提供了速度带有效位置等实用见解。

详情

AI中文摘要

尽管投球序列是棒球分析的核心话题，但以往研究主要关注单次打席中最终投球的优化，对前期设置投球的作用及其对长期赛季级表现的影响研究不足。为解决这些问题，本研究利用MLB Statcast数据进行了反事实分析。训练了一个基于Transformer的机器学习模型，用于预测目标投球是否会导致击球结果或挥空。然后，通过将最终投球或前期设置投球替换为替代的投球类型和位置，同时保持周围背景信息不变，生成了反事实投球序列。最优反事实选择定义为那些最小化预测击球概率的选择，并使用将模型输出与赛季统计指标关联的回归模型估计其对投手赛季统计指标的预期影响。结果表明，最终投球和设置投球的优化都可能显著影响赛季级表现，包括K/9提高超过1.0。分析还提供了若干实用见解，包括特定速度带的有效位置、投球指令的重要性以及通过中速投球扩展投球选择范围。这些发现定量支持了投球序列在棒球中的战略重要性。

英文摘要

Although pitch sequencing is a central topic in baseball analytics, previous studies have primarily focused on optimizing the final pitch within a single plate appearance, leaving the role of preceding setup pitches and their impact on long-term season-level performance insufficiently examined. To address these issues, this study conducted counterfactual analyses using MLB Statcast data. A Transformer-based machine-learning model was trained to predict whether a target pitch would result in an in-play outcome or swing-out. Counterfactual pitch sequences were then generated by replacing either the final pitch or the preceding setup pitch with alternative pitch types and locations while keeping the surrounding contextual information fixed. Optimal counterfactual selections were defined as those that minimized the predicted in-play probability, and their expected effects on pitchers' seasonal statistics were estimated using regression models linking model outputs to season statistics. The results suggest that the optimization of both final and setup pitches may substantially influence season-level performance, including improvements of more than 1.0 in K/9. The analyses also provided several practical insights, including velocity-band-specific effective locations, the importance of pitch commands, and the expansion of pitch-selection options through middle-velocity pitches. These findings quantitatively support the strategic importance of pitch sequencing in baseball.

URL PDF HTML ☆

赞 0 踩 0

2606.17379 2026-06-17 cs.CV cs.AI eess.IV 交叉投稿

MeiBRD: Meta-Learning Intraoperative Biomechanical Residual Deformation

MeiBRD：元学习术中生物力学残余变形

Casey Meisenzahl, Jon Heiselman, Michael Holtz, Yubo Ye, Michael Miga, Linwei Wang

发表机构 * Rochester Institute of Technology（罗切斯特理工学院）； Vanderbilt University（范德堡大学）

AI总结提出混合配准框架，利用稀疏术中对应点自适应生物力学先验，通过图神经扩散函数学习残余变形，结合元学习从术中样本中快速适应，在肝脏体模上优于现有方法。

详情

AI中文摘要

由于软组织大幅变形且术中测量稀疏，精确的术中肝脏配准具有挑战性。生物力学模型通过先验知识正则化这一不适定问题，但由于简化假设而表现出持续的预测偏差，而数据驱动学习方法在数据效率、泛化能力和物理合理性方面存在困难。我们提出一个混合配准框架，利用稀疏术中对应点自适应生物力学先验。我们不是学习完整的变形场，而是学习一个校正线性生物力学预测的残余变形函数，该函数建模为图神经扩散函数，在3D肝脏网格上具有几何感知注意力。为了实现稀疏观测的长距离信息传递，我们从一个新颖的角度将稀疏术中测量视为\textit{上下文}样本，其中残余变形函数的输入-输出对完全观测，将问题转化为从术中上下文样本中学习该残余函数，使用前馈元学习器。在可变形肝脏体模数据集上的实验表明，与刚性、生物力学和数据驱动基线相比，配准精度和泛化能力得到提升，特别是在分布外几何和变形情况下。

英文摘要

Accurate intraoperative liver registration is challenging due to substantial soft-tissue deformation yet sparse intraoperative measurements. Biomechanical models regularize this ill-posedness with prior knowledge but exhibit persistent prediction bias due to simplifying assumptions, while data-driven learning solutions struggle with data efficiency, generalization, and physical plausibility. We propose a hybrid registration framework that adapts a biomechanical prior using sparse intraoperative correspondences. Rather than learning a full deformation field, we learn a residual deformation function that corrects linear biomechanical predictions, modeled as a graph neural diffusion function with geometry-aware attention over the 3D liver mesh. To enable long-range information transfer of sparse observations, we take a novel perspective of sparse intraoperative measurements as \textit{context} samples where input-output pairs of the residual deformation function are fully observed, casting the problem into learning-to-learn this residual function from intraoperative context samples with feedforward meta-learners. Experiments on a deformable liver phantom dataset demonstrate improved registration accuracy and generalization compared to rigid, biomechanical, and data-driven baselines, particularly for out-of-distribution geometries and deformations.

URL PDF HTML ☆

赞 0 踩 0

2606.17398 2026-06-17 cs.CR cs.AI cs.SE 交叉投稿

SoK: AI-Augmented Binary Reversing

SoK: AI增强的二进制逆向工程

Yujeong Kwon, Yiyue Zhang, Shakhzod Yuldoshkhujaev, Kexin Pei, Dokyung Song, Hyungjoon Koo

发表机构 * Sungkyunkwan University（成均馆大学）； The University of Chicago（芝加哥大学）； Yonsei University（延世大学）

AI总结系统化梳理AI增强二进制逆向工程领域，提出统一分类法涵盖传统与AI方法，揭示LLM和智能体AI的新角色，识别技术挑战与评估缺口。

Comments 20 pages, 7 tables, 3 figures

详情

AI中文摘要

二进制逆向工程是软件理解、漏洞发现、恶意软件调查和固件审计的基础。然而，由于编译过程中语义信息的不可逆丢失，它仍然具有固有的挑战性。机器学习、大型语言模型（LLM）和智能体AI系统的最新进展加速了AI增强二进制逆向工程的采用。然而，由此产生的工作在逆向领域、工件表示、学习方法和评估实践方面变得越来越分散。本文首次对AI增强二进制逆向工程的知识进行了全面的系统化。我们分析了自2015年以来发表的144篇研究论文，并根据推理任务将其组织成22个二进制逆向领域。我们进一步引入了一个统一的分类法，涵盖传统和AI增强的逆向流程。我们的分类法连接了传统分析技术、二进制衍生工件、表示策略、学习范式和下游推理任务，同时阐明了LLM和智能体AI系统的新兴角色。通过建立通用词汇和结构化框架，我们提供了该领域过去十年演变的整体视图。我们的研究揭示了看似不同方法背后的共同结构，突出了持续存在的技术挑战和评估缺口，并确定了未来研究的有希望的机会。总的来说，这些见解阐明了该领域的当前状态，并为下一代可靠且可扩展的AI增强二进制逆向系统奠定了基础。

英文摘要

Binary reversing is fundamental to software understanding, vulnerability discovery, malware investigation, and firmware auditing. However, it remains inherently challenging due to the irreversible loss of semantic information during compilation. Recent advances in machine learning, large language models (LLMs), and agentic AI systems have accelerated the adoption of AI-augmented binary reversing. Yet, the resulting body of work has become increasingly fragmented across reversing domains, artifact representations, learning approaches, and evaluation practices. This paper presents the first comprehensive systematization of knowledge on AI-augmented binary reversing. We analyze 144 research papers published since 2015, and organize them into 22 binary reversing domains according to the inference tasks. We further introduce a unified taxonomy spanning conventional and AI-augmented reversing pipelines. Our taxonomy connects traditional analysis techniques, binary-derived artifacts, representation strategies, learning paradigms, and downstream inference tasks, while clarifying the emerging roles of LLMs and agentic AI systems. By establishing a common vocabulary and structured framework, we provide a holistic view of the field's evolution over the past decade. Our study reveals common structures underlying seemingly disparate approaches, highlights persistent technical challenges and evaluation gaps, and identifies promising opportunities for future research. Collectively, these insights clarify the current state of the field and provide a foundation for the next generation of reliable and scalable AI-augmented binary reversing systems.

URL PDF HTML ☆

赞 0 踩 0

2606.17403 2026-06-17 cs.CV cs.AI 交叉投稿

Bridging Spatial And Frequency Views For Disaster Assessment: Benefits And Limitations

桥接空间与频率视角进行灾害评估：优势与局限

Shikha V. Chandel, Yadav Raj Ghimire, Timothy Agboada, Leila Hashemi-Beni

发表机构 * College of Science and Technology（科学与技术学院）； Computational Data Science and Engineering（计算数据科学与工程）

AI总结本研究对比了空间域、频率域及双域深度学习方法在建筑损伤分类中的表现，发现双域模型优于单域模型，但所有模型对轻微损伤检测仍存在困难。

详情

AI中文摘要

从卫星图像快速评估建筑损伤对于有效的灾害响应和恢复至关重要。虽然大多数深度学习方法依赖于空间域特征，但频率域表示可以捕捉互补的结构线索，如碎片模式和坍塌引起的纹理。本研究使用来自xView2（xBD）数据集灾后图像，对空间域、频率域和双域深度学习方法进行了受控比较，用于多类建筑损伤分类。为确保公平，所有模型均基于EfficientNet-B0骨干网络，并在相同设置下训练，仅输入表示和融合策略不同。使用准确率、宏F1分数、每类指标和混淆矩阵评估性能。结果表明，双域模型比单域方法提供了可衡量的改进。双空间配置实现了最高的测试准确率（0.4688）和最低的损失，而仅空间模型获得了最佳的宏F1分数（0.4254），表明类别性能更平衡。相比之下，仅频率模型表现最差并出现过拟合，表明泛化能力有限。尽管有这些改进，所有模型仍难以检测细微损伤级别，特别是Minor类别，这是由于类别不平衡和细粒度视觉模糊性。虽然双域方法改进了严重损伤的检测，但挑战依然存在。这些发现突出了混合表示的优势和局限，并推动了未来在数据平衡、高级融合和正则化方面的工作。

英文摘要

Rapid assessment of building damage from satellite imagery is essential for effective disaster response and recovery. While most deep learning methods rely on spatial-domain features, frequency-domain representations can capture complementary structural cues such as debris patterns and collapse-induced textures. This study presents a controlled comparison of spatial-domain, frequency-domain, and dual-domain deep learning approaches for multi-class building damage classification using post-disaster imagery from the xView2 (xBD) dataset. To ensure fairness, all models are built on an EfficientNet-B0 backbone and trained under identical settings, differing only in their input representations and fusion strategies. Performance is evaluated using accuracy, macro F1-score, per-class metrics, and confusion matrices. Results show that dual-domain models provide measurable improvements over single-domain approaches. The dual spatial configuration achieves the highest test accuracy (0.4688) and lowest loss, while the spatial-only model attains the best macro F1-score (0.4254), indicating more balanced class performance. In contrast, frequency-only models perform worst and exhibit overfitting, suggesting limited generalization. Despite these gains, all models struggle to detect subtle damage levels, particularly the Minor class, due to class imbalance and fine-grained visual ambiguity. While dual-domain approaches improve detection of severe damage, challenges remain. These findings highlight the benefits and limitations of hybrid representations and motivate future work on data balancing, advanced fusion, and regularization.

URL PDF HTML ☆

赞 0 踩 0

2606.17409 2026-06-17 cs.LG cs.AI 交叉投稿

Discrete Autoregressive Transformer for Generative Mechanism Synthesis

离散自回归变压器用于生成式机构综合

Anar Nurizada, Anurag Purwar

发表机构 * Computer-Aided Design and Innovation Lab, Department of Mechanical Engineering, Stony Brook University（石溪大学机械工程系计算机辅助设计与创新实验室）

AI总结提出离散自回归变压器，将平面路径综合转化为条件序列建模，通过VAE潜在变量和机构类型令牌生成关节坐标，实现多样准确机构设计。

详情

AI中文摘要

平面路径综合需要机构的耦合曲线匹配预定轨迹；从曲线到连杆的映射本质上是一对多的，跨越四杆、六杆和八杆拓扑。我们通过模拟接地评估，在一个包含超过一百万个机构的策划语料库上解决这个设计问题，报告了正向运动学和几何对齐后的Chamfer距离和动态时间规整。我们将综合问题表述为条件自回归序列建模：关节坐标被均匀量化成令牌，并由一个解码器-only变压器生成，该变压器具有目标曲线的变分自编码器（VAE）潜在变量和一个显式的机构类型令牌。训练结合了令牌交叉熵和一个高斯平滑的bin辅助损失，该损失尊重bin之间的序数结构。在推理时，一个有界潜在噪声调度在每个噪声水平下解码所有机构类型；我们根据几何误差保留前五个候选，从而在没有数据集查找的情况下产生多样准确的族。在保留测试中，平均Chamfer距离为$0.0132$，平均动态时间规整为$0.153$；一个潜在$k$-最近邻基线，在VAE空间中基于训练集邻居潜在变量进行条件化，使用相同的解码器实现了匹配拓扑的平均Chamfer距离$0.0071$和平均动态时间规整$0.117$。

英文摘要

Planar path synthesis requires mechanisms whose coupler curves match a prescribed trajectory; the mapping from curve to linkage is inherently one-to-many across four-, six-, and eight-bar topologies. We address this design problem with simulation-grounded evaluation on a curated corpus of over one million mechanisms, reporting Chamfer distance and dynamic time warping after forward kinematics and geometric alignment. We formulate synthesis as conditional autoregressive sequence modeling: joint coordinates are uniformly quantized to tokens and generated by a decoder-only transformer with a variational-autoencoder (VAE) latent of the target curve and an explicit mechanism-type token. Training combines token cross-entropy with a Gaussian-smoothed bin auxiliary loss that respects ordinal structure among bins. At inference, a bounded latent-noise schedule decodes all mechanism types at each noise level; we retain the top five candidates by geometric error, yielding diverse accurate families without dataset lookup. On held-out tests, aggregate mean Chamfer distance is $0.0132$ and mean dynamic time warping is $0.153$; a latent $k$-nearest-neighbor baseline that conditions on training-set neighbor latents in VAE space achieves matched-topology mean Chamfer distance $0.0071$ and mean dynamic time warping $0.117$ using the same decoder.

URL PDF HTML ☆

赞 0 踩 0

2606.17420 2026-06-17 eess.IV cs.AI q-bio.QM 交叉投稿

Feynman Kac Reweighted Schrödinger Bridge Matching for Surface-Based Tau PET Harmonization

基于Feynman Kac重加权薛定谔桥匹配的皮层表面Tau PET标准化

Jianwei Zhang, Xinyu Nie, Jiaxin Yue, Yonggang Shi

发表机构 * Stevens Neuroimaging and Informatics Institute, University of Southern California（斯蒂文斯神经影像与信息学研究所，南加州大学）； Ming Hsieh Department of Electrical and Computer Engineering of Viterbi School of Engineering, University of Southern California（明希德电气与计算机工程系，维特比工程学院，南加州大学）； Alfred E. Mann Department of Biomedical Engineering of Viterbi School of Engineering, University of Southern California（阿尔弗雷德·E·曼生物医学工程系，维特比工程学院，南加州大学）

AI总结提出Feynman Kac重加权薛定谔桥匹配（FKRSBM）模型，通过熵正则化最优传输实现源域与目标域间的随机传输，结合子群感知端点提议和球面卷积骨干网络，在Tau PET SUVR图上实现优于现有方法的分布对齐和下游疾病分类。

详情

AI中文摘要

Tau PET成像对于追踪阿尔茨海默病进展至关重要，但不同站点间的扫描仪、协议和放射性示踪剂的系统差异引入了非生物变异性，这会增加生物标志物方差、降低对疾病效应的敏感性，并可能偏倚下游临床评估。标准化方法旨在去除这些站点引起的偏移，同时保留有生物学意义的信号，然而现有方法在源队列和目标队列具有不同子群组成时难以应对，存在将站点效应与生物学变异（如tau阳性状态）混淆的风险。我们提出Feynman Kac重加权薛定谔桥匹配（FKRSBM）模型来解决这一问题。与基于扩散的方法通过高斯噪声先验路由数据不同，FKRSBM通过熵正则化最优运输学习源分布和目标分布之间的直接随机传输过程。为了实现生物学一致的传输，FKRSBM结合了由参考桥测度的Feynman Kac重加权导出的子群感知端点提议，完全通过数据层面的分层重要性抽样实现，无需对底层桥匹配求解器或网络架构进行任何更改。对于基于表面的神经影像，FKRSBM采用在皮层网格上运行的球面卷积骨干网络进行顶点级标准化。我们在tau PET SUVR图上评估该方法，将HABS-HD队列的PI-2620数据标准化到ADNI的AV-1451域。与ComBat、CycleGAN、基于扩散的方法（DF）和无正则化的扩散薛定谔桥匹配（DSBM）相比，FKRSBM实现了更优的分布对齐、更低的tau阳性符号不匹配、更强的APOE子群对齐以及改进的下游疾病分类性能。

英文摘要

Tau PET imaging is central to tracking Alzheimer's disease progression, but systematic differences between scanners, protocols, and radiotracers across sites introduce nonbiological variability that inflates biomarker variance, reduces sensitivity to disease effects, and can bias downstream clinical assessments. Harmonization methods aim to remove these site-induced shifts while preserving biologically meaningful signal, yet existing approaches struggle when source and target cohorts differ in subgroup composition, risking conflation of site effects with biological variation such as tau-positivity status. We propose the Feynman Kac Reweighted Schröodinger Bridge Matching (FKRSBM) model to address this problem. Rather than routing data through a Gaussian noise prior as in diffusion-based methods, FKRSBM learns a direct stochastic transport process between source and target distributions via entropy-regularized optimal transport. To enforce biologically consistent transport, FKRSBM incorporates a subgroup-aware endpoint proposal derived from a Feynman Kac reweighting of the reference bridge measure, implemented entirely through stratified importance sampling at the data level and requiring no changes to the underlying bridge-matching solver or network architecture. For surface-based neuroimaging, FKRSBM employs a spherical convolutional backbone operating on cortical meshes to perform vertex-level harmonization. We evaluate the method on tau PET SUVR maps, harmonizing PI-2620 data from the HABS-HD cohort into the AV-1451 domain of ADNI. Compared against ComBat, CycleGAN, a diffusion-based method (DF), and unregularized Diffusion Schröodinger Bridge Matching (DSBM), FKRSBM achieves superior distributional alignment, reduced tau-positivity sign mismatch, stronger APOE subgroup alignment, and improved downstream disease classification performance.

URL PDF HTML ☆

赞 0 踩 0

2606.17437 2026-06-17 cs.CV cs.AI 交叉投稿

FacProcessTwin: 一种基于LLM的流程孪生开发系统

Yash Pulse, Yong-Bin Kang, Abhik Banerjee, Prem Prakash Jayaraman

发表机构 * Swinburne University of Technology（斯winburne大学）

AI总结提出FacProcessTwin系统，利用大语言模型从工厂文档和操作员自然语言输入中自动生成流程模型并绑定实时数据，通过交互式流程图实现人机协同治理，在食品制造案例中准确率达95.2%，开发时间缩短至人工的1/6。

详情

AI中文摘要

流程孪生提供整个生产过程的实时表示。通过捕捉流程步骤如何相互作用，而不是像基于资产的数字孪生那样孤立地监控单个机器，它们有潜力推动整个过程的效率提升。然而，开发流程孪生成本高昂。它需要精确建模整个生产过程：其流程步骤、每个步骤使用的设备和产品特定设置，以及其流程变体。然后，生成的模型必须绑定到实时操作数据。我们提出FacProcessTwin，一个利用大语言模型（LLM）来减少开发时间的系统，它从工厂的流程文档和操作员的自然语言输入中构建流程孪生。FacProcessTwin生成完整的流程模型，然后自动将其流程步骤绑定到实时操作数据。生成的模型及其数据绑定被渲染为交互式流程图表，制造人员可以通过该图表监控和纠正系统的自主决策，例如解决安全关键绑定步骤中的不确定性。我们通过一家澳大利亚食品制造商的真实案例研究评估FacProcessTwin，涵盖16个生产流程，涉及冷藏、冷冻和无菌常温产品类别，并包括同一产品内的流程变体。结果表明，FacProcessTwin准确生成这些流程模型（与真实情况相比平均F1为95.2%），并且每个孪生的构建时间约为手动时间的六分之一。其人在环治理机制保持安全关键绑定的正确性：在模糊标签处，单次通过基线在75.0%的情况下静默错误绑定，而FacProcessTwin则推迟给操作员，错误绑定率为0。

英文摘要

Process twins provide real-time representations of entire production processes. By capturing how process steps interact, rather than monitoring a single machine in isolation as an asset-based digital twin does, they have the potential to drive efficiency gains across the whole process. However, developing a process twin is costly. It requires accurately modelling the entire production process: its process steps, the equipment and product-specific settings each step uses, and its process variations. The resulting model must then be bound to live operational data. We present FacProcessTwin, a system that leverages a large language model (LLM) to reduce this development time, building a process twin from a plant's process documentation and natural-language input from an operator. FacProcessTwin generates this complete process model and then automatically binds its process steps to live operational data. The generated model and its data bindings are rendered as an interactive process diagram through which manufacturing personnel can monitor and correct the system's autonomous decisions, such as resolving uncertainty at safety-critical binding steps. We evaluate FacProcessTwin through a real-world case study of an Australian food manufacturer, covering 16 production process flows that span chilled, frozen, and aseptic shelf-stable product categories and include process variations within the same product. The results show that FacProcessTwin generates these process models accurately (a mean F1 of 95.2% against ground truth) and builds each twin in roughly a sixth of the manual time. Its human-in-the-loop governance then keeps the safety-critical bindings correct: at ambiguous tags where a single-pass baseline silently mis-binds 75.0% of the time, FacProcessTwin defers to the operator and mis-binds none.

URL PDF HTML ☆

赞 0 踩 0

2606.17668 2026-06-17 cs.LG cs.AI q-bio.QM 交叉投稿

ASTEROID: A Spatiotemporal Information Transformer for Forecasting Multi-Step Time Series of Molecular Dynamics

ASTEROID: 用于分子动力学多步时间序列预测的时空信息变换器

Kexin Wu, Luonan Chen, Renxiao Wang

发表机构 * Department of Medicinal Chemistry, School of Pharmaceutical Sciences, Fudan University（药学院药物化学系，复旦大学）； School of Mathematical Sciences and School of AI, Shanghai Jiao Tong University（数学科学学院和人工智能学院，上海交通大学）

AI总结提出ASTEROID框架，通过将分子动力学轨迹重构为高维时空序列并集成时空信息变换方程到Transformer中，实现多步原子坐标的直接预测，在多个量子力学分子数据集上显著提升预测精度并降低计算成本。

Comments 32 pages,10 figures

详情

AI中文摘要

分子动力学（MD）模拟计算需求高，尤其对于需要长期分析的大规模系统。准确预测MD模拟结果不仅是一个有吸引力的科学挑战，而且具有重要的实用价值。在这项工作中，我们开发了一个数据驱动框架，称为ASTEROID（用于推断动力学的先进时空变换器），可以直接预测多步原子坐标，避免传统的迭代积分。为此，我们的ASTEROID将MD轨迹重构为高维时空序列，并将时空信息（STI）变换方程集成到Transformer架构中。ASTEROID的核心创新在于其建模多尺度时空依赖性的能力。具体来说，对于空间依赖性，局部-全局自注意力机制捕获短程和长程相互作用。对于时间依赖性，编码器-解码器结构将全局上下文与自回归预测相结合。ASTEROID在几个量子力学衍生的分子数据集上进行了评估。我们的结果表明，ASTEROID不仅在各种基准测试中实现了比现有方法更高的多步预测精度，而且显著降低了传统MD模拟的计算成本。此外，该模型支持在扩展时间尺度上的迭代多步预测。这项工作为加速MD模拟建立了一个稳健且可推广的数据驱动范式。

英文摘要

Molecular dynamics (MD) simulation is computationally demanding, particularly for large-scale systems requiring long-term analysis. Accurate forecast of the outcomes of a MD simulation is not only an attractive scientific challenge but also has substantial practical value. In this work, we developed a data-driven framework, termed ASTEROID (Advanced Spatiotemporal TransformER fOr Inferring Dynamics), that can directly predict multi-step atomic coordinates, avoiding conventional iterative integration. For this purpose, our ASTEROID reformulates MD trajectories as high-dimensional spatiotemporal sequences and integrates the Spatiotemporal Information (STI) Transformation equation into a Transformer architecture. The core innovation of ASTEROID lies in its ability to model multiscale spatiotemporal dependencies. In particular, for spatial dependencies, a local-global self-attention mechanism captures both short- and long-range interactions. For temporal dependencies, an encoder-decoder structure integrates global context with autoregressive forecasting. ASTEROID was evaluated on several quantum-mechanics derived molecular datasets. Our results indicate that ASTEROID achieved not only a higher level of accuracy in multi-step prediction than existing methods on various benchmarks, but also significantly reduced computational cost of conventional MD simulation. Moreover, the model supports iterative multi-step forecasting over an extended time scale. This work establishes a robust and generalizable data-driven paradigm for accelerating MD simulations.

URL PDF HTML ☆

赞 0 踩 0

2606.17702 2026-06-17 cs.CV cs.AI 交叉投稿

SegTME-UNI2: A Foundation Model-Based Framework for Generalisable Multiclass Cell Segmentation and LLM-Driven Tumour Microenvironment Characterisation in Histopathology

SegTME-UNI2: 一种基于基础模型的可泛化多类细胞分割框架及LLM驱动的组织微环境表征在组织病理学中的应用

Wan Siti Halimatul Munirah Wan Ahmad, Faris Syahmi Samidi, Mohammad Badal Ahmmed, Vimal Angela Thiviyanathan, Selvam James Thavaraj, Anwar P. P. Abdul Majeed

发表机构 * Department of Data Science and Artificial Intelligence, School of Computing and Artificial Intelligence, Faculty of Engineering and Technology, Sunway University（双威大学工程与技术学院计算与人工智能学院数据科学与人工智能系）； Faculty of Dentistry, Universiti Malaya（马来亚大学牙科学院）

AI总结提出SegTME-UNI2框架，结合UNI2-H病理基础模型与双头UperNet解码器实现六类语义分割和核实例分割，通过三阶段伪标签课程学习解决标注不足问题，并利用LLM生成临床可解释的TME报告。

详情

AI中文摘要

从常规H&E染色组织学图像中表征肿瘤微环境（TME）需要同时进行细胞分割、特征提取和可解释的临床报告。我们提出了SEGTME-UNI2，一个统一框架来满足这些需求。其核心是UNI2-UPERHOVER，一个双头分割模型，将UNI2-H病理基础模型（ViT-Giant，在来自100K张切片的>100M张图块上预训练）与两个并行的UperNet解码器配对：一个用于六类语义分割，另一个用于水平-垂直梯度回归，从而实现基于分水岭的核实例分离。为了解决大型真实世界数据集中缺乏像素级标注的问题，UNI2-UPERHOVER经历了一个三阶段渐进式伪标签课程。每个阶段训练一个全新模型（无权重迁移），完全通过提高伪标签质量来驱动改进：阶段1：使用人工标注的PanNuke（7,901张图像，189,744个细胞核，0.25 um/像素）。阶段2：使用阶段1模型在271,711个TCGA-UT尺度0图块（0.5 um/像素）上生成的熵过滤伪标签。阶段3：使用阶段2模型在所有1,608,060个TCGA-UT图块（覆盖六个分辨率尺度，0.5-1.0 um/像素）上生成的伪标签。分割输出输入到一个结构化的TME特征提取流水线，计算每个图块的20多个组成、形态、空间熵和细胞间距离指标。这些指标编码为JSON，并传递给微调的NVIDIA BioNeMo GPT模型，以生成临床可解释的TME叙述。在保留的PanNuke和TCGA-UT分区上的初步验证证明了框架的可行性和内部一致性。公开释放了伪标注的TCGA-UT数据集和UNI2-UPERHOVER检查点，以支持大规模TME分析和空间生物学研究。

英文摘要

Characterising the tumour microenvironment (TME) from routine H&E-stained histology images requires simultaneous cell segmentation, feature extraction, and interpretable clinical reporting. We present SEGTME-UNI2, a unified framework addressing these requirements. Its core is UNI2-UPERHOVER, a dual-head segmentation model pairing the UNI2-H pathology foundation model (ViT-Giant, pretrained on >100M tiles from 100K slides) with two parallel UperNet decoders: one for six-class semantic segmentation and one for horizontal-vertical gradient regression enabling watershed-based nuclear instance separation. To address the lack of pixel-level annotations in large real-world repositories, UNI2-UPERHOVER undergoes a three-stage progressive pseudo-label curriculum. Each stage trains a fresh model without weight transfer, driving improvement entirely via increased pseudo-label quality: Stage 1: Uses human-annotated PanNuke (7,901 images, 189,744 nuclei, 0.25 um/pixel). Stage 2: Uses entropy-filtered pseudo-labels from the Stage 1 model on 271,711 TCGA-UT scale-0 patches (0.5 um/pixel). Stage 3: Uses pseudo-labels from the Stage 2 model on all 1,608,060 TCGA-UT patches across six resolution scales (0.5-1.0 um/pixel). Segmentation outputs feed a structured TME feature extraction pipeline computing 20+ per-patch compositional, morphological, spatial entropy, and intercellular distance metrics. These are encoded as JSON and passed to a fine-tuned NVIDIA BioNeMo GPT model to generate clinically interpretable TME narratives. Preliminary validation on held-out PanNuke and TCGA-UT partitions demonstrates framework feasibility and internal consistency. The pseudo-labelled TCGA-UT dataset and UNI2-UPERHOVER checkpoint are publicly released to support large-scale TME profiling and spatial biology research.

URL PDF HTML ☆

赞 0 踩 0

2606.17775 2026-06-17 cs.SD cs.AI cs.NE 交叉投稿

A Neuromorphic Trigger for Efficient Audio Event Detection

一种用于高效音频事件检测的神经形态触发器

Benjamin Hatton, Oliver Rhodes, Luca Peres

发表机构 * ICNS, University of Manchester（曼彻斯特大学ICNS）

AI总结提出基于脉冲神经网络（SNN）的低成本前端触发器，选择性筛选音频片段，在异常声音检测和声音事件检测任务上分别实现0.97的F1分数和42.6倍FLOPs减少。

Comments 9 pages, 4 figures, 6 tables

详情

AI中文摘要

连续音频流的高效处理仍然是实时和资源受限系统面临的关键挑战。本文介绍了一种用于音频事件检测的神经形态触发器，基于脉冲神经网络（SNN）选择性门控下游模型的输入。所提出的触发器作为低成本前端，识别显著音频片段，仅将这些片段转发给计算密集型的模型进行分类等任务。触发器实现为轻量级全连接SNN，并在两个代表性任务上评估：异常声音检测（ASD）和声音事件检测（SED）。对于ASD，触发器在URBAN-SED数据集的类别无关形式下，实现了基于一秒片段的F1分数0.97，显示出识别相关音频区域的高可靠性。对于SED，触发器与Dang分类器结合在DCASE 2017挑战赛任务2数据集上，展示了潜在的42.6倍FLOPs减少，同时将基于事件错误率的下限从0.41降低到0.25。这些结果凸显了神经形态触发器作为实时、节能前端滤波器的潜力，能够大幅降低计算成本。

英文摘要

Efficient processing of continuous audio streams remains a key challenge for real-time and resource-constrained systems. This paper introduces a neuromorphic trigger for audio event detection, based on a spiking neural network (SNN) that selectively gates input to downstream models. The proposed trigger acts as a low-cost front-end, identifying salient audio segments and forwarding only these to a more computationally intensive model for tasks such as classification. The trigger is implemented as a lightweight fully connected SNN and evaluated on two representative tasks: Anomalous Sound Detection (ASD) and Sound Event Detection (SED). For ASD, the trigger achieves a one-second segment-based F1 score of 0.97 on a class-agnostic form of the URBAN-SED dataset, demonstrating high reliability in identifying relevant audio regions. For SED, the trigger is combined with the Dang classifier on the DCASE 2017 Challenge Task 2 dataset, showing a potential $42.6\times$ reduction in FLOPs while reducing the lower bound of the event-based error rate from 0.41 to 0.25. These results highlight the potential of neuromorphic triggers as real-time, energy-efficient front-end filters, enabling substantial reductions in computational cost.

URL PDF HTML ☆

赞 0 踩 0

2606.17781 2026-06-17 cs.AR cs.AI 交叉投稿

MIVE: A Minimalist Integer Vector Engine for Softmax LayerNorm and RMSNorm Acceleration

MIVE：用于Softmax、LayerNorm和RMSNorm加速的极简整数向量引擎

Kosmas Alexandridis, Giorgos Dimitrakopoulos

发表机构 * Integrated Circuits Lab, Electrical and Computer Engineering, Democritus University of Thrace (DUTH), Greece（德摩克利特大学特拉克分校集成电路实验室，电气与计算机工程，德摩克利特大学特拉克分校（DUTH），希腊）

AI总结提出一种可编程的极简整数向量引擎MIVE，通过统一数据通路执行Softmax、LayerNorm和RMSNorm三种操作，最大化硬件共享，提升面积和硬件效率。

详情

AI中文摘要

大型语言模型（LLM）的快速增长加剧了对专用硬件加速器的需求，这些加速器必须满足严格的推理延迟和功耗约束。尽管矩阵乘法主导了整体计算工作负载，但非线性向量归一化操作（如LayerNorm、RMSNorm和Softmax）可能成为关键硬件瓶颈。现有加速器通常使用专用硬件块实现这些功能，导致资源重复和硅利用率低下。为解决这一限制，我们提出了一种极简整数向量引擎（MIVE），这是一种可编程架构，能够在统一数据通路内执行所有三种操作。通过利用LayerNorm、RMSNorm和Softmax之间的共同计算模式，所提出的向量引擎最大化硬件共享，同时减少实现开销。物理ASIC实现结果表明，MIVE提供全面的多函数支持，同时在面积和硬件效率方面优于大多数最先进的独立加速器。

英文摘要

The rapid growth of Large Language Models (LLMs) has intensified the need for specialized hardware accelerators that can satisfy stringent inference latency and power constraints. Although matrix multiplications dominate the overall computational workload, non-linear vector normalization operations, such as LayerNorm, RMSNorm and Softmax can become critical hardware bottlenecks. Existing accelerators typically implement these functions using dedicated hardware blocks, leading to duplicated resources and inefficient silicon utilization. To address this limitation, we propose a Minimalist Integer Vector Engine (MIVE), a programmable architecture capable of executing all three operations within a unified datapath. By exploiting common computational patterns across LayerNorm, RMSNorm and Softmax the proposed vector engine maximizes hardware sharing while reducing implementation overhead. Physical ASIC implementation results show that MIVE provides comprehensive multi-function support while achieving higher area and hardware efficiency than most state-of-the-art standalone accelerators.

URL PDF HTML ☆

赞 0 踩 0

2606.17824 2026-06-17 cs.CV cs.AI 交叉投稿

Human-in-the-Loop Atlas-Based 3D Asset Segmentation for Interactive Content Workflows

人在回路中基于图集的3D资产分割用于交互式内容工作流

Paul Julius Kühn, Saptarshi Neil Sinha, Jakob Hansen, Robin Horst

发表机构 * Fraunhofer IGD（弗劳恩霍夫计算机图形学研究所）； Hochschule RheinMain（莱茵美因应用科学大学）

AI总结提出一种人在回路中流水线，通过贪心视图选择、SAM~2交互分割和UV反投影生成分割图集，支持材质分配、风格迁移等下游任务，在8个文化遗产物体上验证了有效性。

详情

AI中文摘要

将3D资产分割成有意义的区域仍然具有挑战性，尤其是当分割标准依赖于应用且需要用户控制时。我们提出了一种人在回路中的流水线，用于从3D模型生成分割的2D参数化图集，适用于交互式媒体、游戏和XR内容工作流。我们的方法首先使用基于采样表面点的贪心集合覆盖策略选择一组紧凑的渲染视图，然后支持使用SAM~2和Label Studio对这些视图进行交互式分割。生成的掩码被反投影到模型的UV参数化上，以产生统一的图集分割，支持下游生产任务，如逐段材质分配、风格迁移和语义标注。我们通过对八个文化遗产物体的基于演示的技术评估来评估该流水线。结果表明，该方法可以在不同几何形状上生成可用的分割图集，同时揭示了需要手动校正的常见问题，特别是精细结构、空腔和弱外观边界。

英文摘要

Segmenting 3D assets into meaningful regions remains challenging, especially when segmentation criteria are application-dependent and require user control. We present a human-in-the-loop pipeline for generating a segmented 2D parameterized atlas from a 3D model for interactive media, game, and XR content workflows. Our method first selects a compact set of rendered views using a greedy set cover strategy over sampled surface points, and then supports interactive segmentation of these views with SAM~2 and Label Studio. The resulting masks are back-projected onto the model's UV parameterization to produce a unified segmented atlas that supports downstream production tasks such as segment-wise material assignment, style transfer, and semantic labeling. We assess the pipeline through a demonstration-based technical evaluation on eight cultural heritage objects. The results show that the approach can generate usable segmented atlases across diverse geometries while revealing recurring sources of manual correction, particularly fine structures, cavities, and weak appearance boundaries.

URL PDF HTML ☆

赞 0 踩 0

2606.17836 2026-06-17 cs.CV cs.AI cs.CG cs.GR 交叉投稿

High-Fidelity 3D Geometric Reconstruction of Pelvic Organs from MRI: A Hybrid Deep Learning and Iterative Optimization Approach

高保真盆腔器官MRI三维几何重建：一种混合深度学习与迭代优化方法

Hui Wang, Xiaowei Li, Chenxin Zhang, Yifan Feng, Jianwei Zuo, Yumeng Tang, Xiuli Sun, Jianliu Wang, Bing Xie, Jiajia Luo

发表机构 * Institute of Medical Technology, Peking University Health Science Center, Peking University（北京大学医学部医学技术研究院，北京大学）； Biomedical Engineering Department, Institute of Advanced Clinical Medicine, Peking University（北京大学先进临床医学研究院生物医学工程系）； Department of Obstetrics and Gynecology, Peking University People’s Hospital（北京大学人民医院妇产科部）

AI总结提出混合可变形形状建模框架，结合深度学习预测与迭代优化，实现膀胱、子宫和直肠的高保真三维几何重建，在几何保真度和网格质量上优于现有方法。

详情

AI中文摘要

从MRI中患者特定的盆腔器官几何三维重建对于盆底建模和下游患者特定分析至关重要。然而，以往研究主要关注图像分割或三维模型的下游使用，高保真、高质量几何的重建仍然劳动密集且缺乏标准化。本研究引入了一种混合可变形形状建模框架，将深度学习预测与迭代优化相结合，用于膀胱、子宫和直肠的重建。该框架包含三个核心组件：一种保持盆腔器官拓扑一致性的几何感知多级深度学习架构；一种平衡全局形状捕获和局部表面细化的两阶段摊销优化训练策略；以及一种整体协同机制——在训练阶段，迭代优化为深度学习提供监督，而在推理阶段，深度学习快速预测全局器官形态，随后通过迭代优化细化局部表面和网格质量。该框架在几何保真度上显著优于当前主流的基于深度学习的器官重建模型。对于各个解剖结构，重建的膀胱、直肠和子宫三维几何实现了显著更低的Chamfer距离值和更高的Dice相似系数分数。此外，在保持高计算效率的同时，所提出的架构产生了优越的整体体积网格质量。在患者层面，该框架在minSICN和minSIGE的10个最差元素上均获得了比传统几何后处理算法更高的平均值。

英文摘要

Patient-specific 3D reconstruction of pelvic organ geometry from MRI is important for pelvic floor modeling and downstream patient-specific analysis. However, while previous studies have focused primarily on either image segmentation or downstream use of 3D models, the reconstruction of high-fidelity, high-quality geometries remains labor-intensive and poorly standardized. The study introduced a hybrid deformable shape modeling framework that integrates deep learning prediction with iterative optimization for the reconstruction of the bladder, uterus, and rectum. The framework consists of three core components: a geometry-aware multi-level deep learning architecture that preserves topological consistency of pelvic organs; a two-stage amortized optimization training strategy that balances global shape capture and local surface refinement; and a holistic synergy mechanism--where iterative optimization provides supervision for deep learning during the training phase, and during inference, deep learning rapidly predicts the global organ morphology, followed by iterative optimization to refine local surfaces and mesh quality. This framework demonstrated marked superiority in geometric fidelity than current mainstream deep learning-based organ reconstruction models. For individual anatomical structures, the reconstructed 3D geometries for the bladder, rectum, and uterus achieved significantly lower Chamfer Distance values and higher Dice Similarity Coefficient scores. In addition, while maintaining high computational efficiency, the proposed architecture yielded superior overall volumetric mesh quality. At the patient level, the framework achieved higher mean values for the 10 worst elements for both minSICN and minSIGE compared to traditional geometric post-processing algorithms.

URL PDF HTML ☆

赞 0 踩 0

2606.17867 2026-06-17 cs.CV cs.AI 交叉投稿

A Quantitative Analysis of Multimodal Biomarkers in Alzheimer's Disease

阿尔茨海默病多模态生物标志物的定量分析

Antonio Scardace, Daniele Ravì

发表机构 * Department of Mathematics and Computer Science（数学与计算机科学系）； University of Catania（卡塔尼亚大学）； Department MIFT（MIFT部门）； University of Messina（梅西纳大学）

AI总结通过整合tau-PET、结构MRI、认知评分和APOE4数据，量化多模态生物标志物间的冗余与预测依赖关系，揭示tau拓扑与萎缩的关联，并分解tau-认知关联，为AD生物标志物选择提供可解释性。

Comments Accepted to ICTS4eHealth 2026

详情

AI中文摘要

尽管阿尔茨海默病（AD）研究中越来越多地采用多模态方法——旨在整合分子、结构、临床和遗传生物标志物以增强疾病表征——但这些模态之间的关系仍知之甚少。对其动态相互作用进行系统分析对于改进疾病建模、识别冗余评估以及减少患者负担和获取成本至关重要。在本文中，我们通过整合来自ADNI数据集的789名受试者的tau-PET、结构MRI、认知评分（MMSE和CDR）以及APOE4数据，对多模态AD生物标志物进行了定量分析。在我们的分析中，我们（A）量化跨模态互信息和解释方差以评估冗余和预测依赖性；（B）检查tau拓扑与跨脑区结构萎缩之间的关联以选择信息性ROI；（C）对tau-认知关联进行统计分解，分为萎缩相关和萎缩无关成分；（D）识别与认知衰退一致的主要神经退行性轨迹。本研究提供了跨模态关系的系统表征，提高了AD生物标志物的可解释性和选择。代码公开于：此 https URL。

英文摘要

Despite increasing adoption of multimodal approaches in Alzheimer's Disease (AD) research -- aimed at integrating molecular, structural, clinical, and genetic biomarkers to enhance disease characterization -- the relationships among these modalities remain poorly understood. A systematic analysis of their dynamic interaction is essential for improving disease modeling, identifying redundant assessments, and reducing patient burden and acquisition costs. In this paper, we present a quantitative analysis of multimodal AD biomarkers by integrating tau-PET, structural MRI, cognitive scores (MMSE and CDR), and APOE4 data from 789 subjects drawn from the ADNI dataset. In our analyses, we (A) quantify cross-modal mutual information and explained variance to assess redundancy and predictive dependencies; (B) examine associations between tau topologies and structural atrophy across brain regions to select informative ROIs; (C) perform a statistical decomposition of the tau-cognition association into atrophy-related and atrophy-independent components; (D) and identify a dominant neurodegenerative trajectory that aligns with cognitive decline. This study provides a systematic characterization of cross-modal relationships, improving the interpretability and selection of biomarkers in AD. Code is publicly available at: https://github.com/antonioscardace/Multimodal-AD.

URL PDF HTML ☆

赞 0 踩 0

2606.17887 2026-06-17 cs.HC cs.AI 交叉投稿

面向微控制器级边缘设备的嵌入式机器学习：数据、特征、评估与部署流程

Mostafa Darvishi

发表机构 * IEEE

AI总结本文系统介绍面向微控制器平台的嵌入式机器学习工作流，重点涵盖采样缓冲、特征提取、不平衡验证、模型/运行时协同设计及流式部署等工程决策，并以惯性运动识别和关键词检测为例给出实用设计规则。

Comments 6 pages, 3 figures, 4 tables

详情

AI中文摘要

嵌入式机器学习将推理从云服务转移到资源受限的设备上，这些设备必须在内存、能量和延迟的严格限制下采集数据、预处理信号、运行模型并采取行动。本文针对微控制器级平台，提出了一种面向系统的嵌入式机器学习工作流综合方案。重点放在通用机器学习介绍中常被隐藏的工程决策上：采样和缓冲、作为降维的特征提取、类别不平衡下的验证、模型/运行时协同设计以及流式部署。全文使用两个代表性信号系列：第一个是惯性运动识别，其中将两秒的三轴加速度计窗口从原始样本转换为均方根和频谱特征后再进行分类；第二个是关键词检测，其中对音频进行采样、抗混叠、转换为梅尔频率倒谱系数，并由紧凑的一维卷积网络处理。本文最后给出了鲁棒设备上推理的实用设计规则，包括数据整理、量化、阈值设定、调度和现场监控。

英文摘要

Embedded machine learning moves inference from cloud services to resource-constrained devices that must acquire data, preprocess signals, run a model, and act within tight limits on memory, energy, and latency. This paper presents a systems-oriented synthesis of an embedded machine-learning workflow for microcontroller-class platforms. The emphasis is placed on engineering decisions that are often hidden in generic machine-learning introductions: sampling and buffering, feature extraction as dimensionality reduction, validation under class imbalance, model/runtime co-design, and streaming deployment. Two representative signal families are used throughout the paper. The first is inertial motion recognition, where a two-second, three-axis accelerometer window is transformed from raw samples into root-mean-square and spectral features before classification. The second is keyword spotting, where audio is sampled, anti-aliased, transformed into mel-frequency cepstral coefficients, and processed by a compact one-dimensional convolutional network. The paper concludes with practical design rules for robust on-device inference, including data curation, quantization, thresholding, scheduling, and field monitoring.

URL PDF HTML ☆

赞 0 踩 0

2606.18181 2026-06-17 cs.IR cs.AI cs.CY 交叉投稿

IUU+DB: Tracking Illegal, Unreported, and Unregulated Fishing, Seafood Fraud, and Labor Abuse through LLM-driven Information Extraction

IUU+DB：通过LLM驱动的信息提取追踪非法、不报告和不管制捕捞、海鲜欺诈和劳工虐待

Henry Bodwell, Hong Yang, John C. Simeone, Kelvin Gorospe, Bella Sullivan, Lana Huang, Jessica Gephart, Sandy Aylesworth, Molly Masterton, Naren Ramakrishnan

发表机构 * University Of Washington（华盛顿大学）

AI总结提出IUU+概念扩展非法捕捞定义，并构建基于大语言模型的IUU+DB系统，从异构文档中自动提取事件关键信息，支持去重和趋势分析，为渔业监管和研究提供数据支持。

详情

AI中文摘要

非法、不报告和不管制捕捞（IUU）传统上指违反适用法律或在缺乏适用法律的区域进行的捕捞活动。我们提出术语IUU+以涵盖更广泛的渔业部门环境及相关供应链贸易犯罪和行为。尽管IUU+活动被广泛认为是对海洋生态系统、市场和生计的严重威胁，但对其事件频率、地理分布、物种、行为者及非法活动类型模式的定量理解仍然难以获得。我们提出IUU+DB，一个由大语言模型驱动的系统，用于构建全球IUU+活动事件数据库。该系统接收异构文档，分类是否描述相关事件，提取关键数据元素如行为者、地点、物种、船只、违规行为及执法结果，并支持去重和趋势分析。案例研究和验证结果表明，IUU+DB有助于组织零散证据，揭示地理和行为热点，支持学术界和非政府组织的渔业领域特定研究，协助行业进行来源和物种风险评估，并为政府机构的政策实施和针对性执法提供支持。

英文摘要

Illegal, unreported, and unregulated fishing (IUU) traditionally refers to fishing activities that violate applicable laws or occur in areas that lack applicable laws. We propose the term IUU+ to capture a broader suite of fisheries sector environmental and associated supply chain trade-related crimes and behaviors. Although IUU+ activity is widely recognized as a serious threat to marine ecosystems, markets, and livelihoods, a quantitative understanding of these incidents, e.g., their frequency, geography, species, actors, and patterns in the type of illicit activity, remains difficult to obtain. We propose IUU+DB, a large language model driven system for building a global incident database of IUU+ activity. The system ingests heterogeneous documents, classifies whether they describe relevant incidents, extracts key data elements such as actors, locations, species, vessels, violations, and enforcement outcomes, and supports deduplication and trend analysis. Case studies and validation results show that IUU+DB can help organize fragmented evidence, surface geographic and behavioral hotspots, support fisheries-domain specific research in academia and non-government organizations, assist source and species risk assessments for industry, and provide support for policy implementation and targeted enforcement efforts to government agencies.

URL PDF HTML ☆

赞 0 踩 0

2606.04513 2026-06-17 cs.AI 版本更新

MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation

MapAgent: 一个工业级的城市规模车道级地图生成智能框架

Deguo Xia, Zihan Li, Haochen Zhao, Dong Xie, Yuyao Kong, Xiyan Liu, Jizhou Huang, Mengmeng Yang, Diange Yang

发表机构 * Tsinghua University（清华大学）； Baidu（百度）； University of Macau（澳门大学）； Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）

AI总结提出MapAgent框架，通过结合视觉语言模型和约束感知推理，在验证驱动的Judge-Planner-Worker循环中修正车道地图生成中的规范违规问题，实现城市规模的高自动化生产。

Comments Accepted by KDD 2026

详情

DOI: 10.1145/3770855.3818443

AI中文摘要

车道级地图是自动驾驶和车道级导航的关键基础设施，但为数百个城市构建和维护标准化车道网络仍然高度劳动密集。最近的端到端矢量化映射方法可以直接从传感器数据预测车道几何和拓扑，但它们通常将映射规范和交通规则视为隐式的、依赖于数据集的监督。此外，在复杂场景中（例如，磨损或缺失的标记和遮挡），仅凭视觉证据往往难以确定正确的车道配置，使得规范违规成为人工后期编辑的主要来源。我们提出MapAgent，一个工业级智能架构，它增强了一个矢量化主干，用于生成符合规范的车道地图。MapAgent不仅仅是在地图预测上添加一个智能体循环，而是在一个有界、验证驱动的Judge-Planner-Worker循环中，将主干感知与明确的规范验证、约束感知推理和确定性地图编辑相结合。一个视觉语言Judge通过联合检查视觉证据和草稿向量来诊断错误，而一个工具调用Planner生成最小的修正编辑并进行编辑后重新验证。为了保持城市规模生产的可扩展性，MapAgent仅在主干置信度低的图块上选择性触发，增加了适度的开销同时保持吞吐量。在真实世界数据集上的实验显示，与强大的生产基线相比，特别是在复杂和长尾场景中，性能持续提升。此外，MapAgent已集成到百度地图中，支持全国超过360个城市的车道级地图生成，并将整体生产自动化率提升至95%以上，证明了MapAgent在大规模车道级地图生成中的实用性和有效性。

英文摘要

Lane-level maps are critical infrastructure for autonomous driving and lane-level navigation, yet constructing and maintaining standardized lane networks for hundreds of cities remains highly labor-intensive. Recent end-to-end vectorized mapping methods can predict lane geometry and topology directly from sensor data, but they typically treat mapping specifications and traffic regulations as implicit, dataset-dependent supervision. Moreover, in complex scenes (e.g., worn or missing markings and occlusions), correct lane configurations are often under-determined by visual evidence alone, making specification violations a major source of human post-editing. We propose MapAgent, an industrial-grade agentic architecture that augments a vectorization backbone for specification-compliant lane-map production. Rather than merely adding an agent loop to map prediction, MapAgent couples backbone perception with explicit specification verification, constraint-aware reasoning, and deterministic map editing under a bounded, verification-driven Judge-Planner-Worker loop. A vision-language Judge diagnoses errors by jointly inspecting visual evidence and draft vectors, while a tool-calling Planner generates minimal corrective edits with post-edit re-validation. To remain scalable for city-scale production, MapAgent is selectively triggered only on tiles with low backbone confidence, adding modest overhead while preserving throughput. Experiments on real-world datasets show consistent gains over strong production baselines, especially in complex and long-tail scenarios. Additionally, MapAgent has been integrated into Baidu Maps, supporting lane-level map generation for over 360 cities nationwide and elevating the overall production automation to over 95%, demonstrating MapAgent's practicality and effectiveness for large-scale lane-level map generation.

URL PDF HTML ☆

赞 0 踩 0

2606.12742 2026-06-17 cs.AI cs.AR 版本更新

Reducing the Complexity of Deep Learning Models for EEG Analysis on Wearable Devices

降低可穿戴设备上用于脑电图分析的深度学习模型复杂度

Farough Shayeste Roodi, Parham Zilouchian Moghaddam, Mahdi Mohammadi-nasab, Mehdi Modarressi, Mostafa Ersali Salehi Nasab, Masoud Daneshtalab

发表机构 * University of Tehran（德黑兰大学）； Mälardalen University（梅拉达伦大学）； Royal Institute of Technology（皇家理工学院）

AI总结研究通过参数量化和电极减少方法，在资源受限的可穿戴设备上部署DNN模型，实现脑电图分析中精度与复杂度的权衡。

详情

AI中文摘要

可穿戴医疗设备是增长最快的物联网领域。许多自动化医疗服务依赖于两种关键的生物信号，即心电图和脑电图，它们分别反映心脏和大脑的活动。尽管深度神经网络被认为是处理和分析这些信号的主要方式，但可穿戴设备中非常严格的能量和计算能力限制远低于DNN模型的计算、能量和内存带宽需求，从而阻碍了深度学习在许多实际可穿戴服务中的部署。本文研究了在资源受限的可穿戴设备上部署最先进的DNN模型的可行性。值得注意的是，我们探讨了在使用参数量化和电极减少方法时，DNN的精度与计算复杂度之间的权衡。我们的研究集中在几种用于脑电图信号分析（特别是检测癫痫发作）的最先进的DNN模型上。我们的发现表明，当明智地应用这些技术时，可以显著降低所考虑的DNN的复杂度，同时对精度的影响最小。这些结果揭示了在将基于DNN的在线脑电图分析适配到可穿戴设备时，精度与复杂度降低之间明确的权衡关系。

英文摘要

Wearable healthcare devices are the fastest-growing Internet of Things (IoT) sector. Many automated healthcare services rely on two crucial biological signals, namely ECG and EEG, which reflect the activity of the heart and brain, respectively. Although deep neural networks are considered the primary way to process and analyze these signals, the very tight energy and computational power constraints in wearable devices are far below the computational, energy, and memory bandwidth demands of DNN models, thereby impeding the deployment of deep learning in many practical wearable services. This paper investigates the feasibility of deploying state-of-the-art DNN models in resource-constrained wearable devices. Notably, we explore the trade-off between accuracy and computational complexity of DNNs when parameter quantization and electrode reduction methods are used. Our investigation centers on several state-of-the-art DNN models designed for EEG signal analysis, specifically for detecting epileptic seizures. Our findings demonstrate that, when applied judiciously, these techniques can significantly reduce the complexity of the DNNs under consideration with minimal adverse effects on accuracy. These results reveal the explicit trade-offs between accuracy and complexity reduction encountered when adapting DNN-based online EEG analysis for wearable devices.

URL PDF HTML ☆

赞 0 踩 0

2606.13258 2026-06-17 cs.AI 版本更新

MOSAIC: Modality-Specific Adaptation for Incremental Continual Learning in Parkinson's Disease Gait Assessment

MOSAIC: 帕金森病步态评估中增量持续学习的模态特定适应

Minlin Zeng, Zhipeng Zhou, Yang Qiu, Martin J. McKeown, Zhiqi Shen

发表机构 * Nanyang Technological University（南洋理工大学）； Pacific Parkinson's Research Centre, University of British Columbia（不列颠哥伦比亚大学太平洋帕金森研究中心）

AI总结针对帕金森病步态评估中模态增量场景，提出MOSAIC框架，通过模态特定预热、统计解耦MSBN架构和课程引导排斥目标，解决跨模态蒸馏不可靠、统计偏移和可塑性下降问题。

详情

AI中文摘要

基于步态的帕金森病评估越来越依赖异构传感器，但临床系统很少同时收集所有模态。新传感器可能通过设备升级、协议变更或多中心部署引入，而历史患者数据由于隐私和存储限制通常不可用。这种模态增量场景面临三个挑战：不可靠的跨模态蒸馏、模态特定的统计偏移以及保存后可塑性下降。我们提出了MOSAIC，一个紧凑的持续学习框架。首先，我们识别了有毒教师现象，并引入模态特定预热，在蒸馏前稳定新学习的模态表示。其次，我们提出了一种统计解耦的MSBN架构，在保持共享语义主干的同时隔离传感器统计信息。第三，我们设计了一个课程引导的排斥目标用于可塑性恢复，在保留旧知识的同时恢复模态特定容量。在三个多模态帕金森步态数据集上的实验表明，MOSAIC提高了最终性能并减轻了遗忘。项目代码可在以下网址获取：this https URL

英文摘要

Gait-based Parkinson's disease assessment increasingly relies on heterogeneous sensors, but clinical systems rarely collect all modalities simultaneously. New sensors may arrive through device upgrades, protocol changes, or multi-center deployment, while historical patient data are often unavailable because of privacy and storage constraints. This modality-incremental setting faces three challenges: unreliable cross-modal distillation, modality-specific statistical shifts, and reduced plasticity after preservation. We propose MOSAIC, a compact continual learning framework. First, we identify the Toxic Teacher phenomenon and introduce Modality-Specific Warm-Up to stabilize newly learned modality representations before distillation. Second, we propose a statistics-decoupled MSBN architecture that isolates sensor statistics while maintaining a shared semantic backbone. Third, we design a curriculum-guided repulsive objective for Plasticity Recovery, preserving legacy knowledge while recovering modality-specific capacity. Experiments on three multimodal Parkinson's gait datasets show that MOSAIC improves final performance and mitigates forgetting. Project code is available at: https://github.com/minlinzeng/MOSAIC_Modality-Specific-Adaptation-for-Incremental-Continual-Learning-in-PD-Gait-Assessment.git

URL PDF HTML ☆

赞 0 踩 0

2606.16337 2026-06-17 cs.AI cs.HC cs.LG 版本更新

Medical Heuristic Learning: An LLM-Driven Framework for Interpretable and Auditable Clinical Decision Rules

医学启发式学习：一个用于可解释和可审计临床决策规则的LLM驱动框架

Wei Xu, Ke Yang, Gang Luo, Keli Zheng, Lingyan Hu, Jing Wang, Kefeng Li

发表机构 * Centre for Artificial Intelligence Driven Drug Discovery, Macao Polytechnic University（人工智能驱动药物发现中心，澳门理工学院）； Key Laboratory of Short-Range Radio Equipment Testing and Evaluation, Ministry of Industry and Information Technology Terahertz Science Application Center (TSAC), Beijing Institute of Technology（工业和信息化部短距离无线电设备测试与评估重点实验室，太赫兹科学应用中心（TSAC），北京理工大学）； Department of Critical Care Medicine, Yantai Yuhuangding Hospital, Qingdao University（重症医学科，烟台友谊医院，青岛大学）； Faculty of Education, The University of Hong Kong（教育学院，香港大学）； College of Information Engineering, Dalian University（信息工程学院，大连大学）

AI总结提出医学启发式学习（MHL），利用LLM驱动的工作流优化确定性可执行决策系统，生成可解释、可审计的Python决策规则，在医学数据集上达到与最先进方法相当的性能，并支持小样本和高度不平衡场景。

详情

AI中文摘要

临床表格数据的预测建模是临床决策支持的核心，因此不仅需要强大的预测性能，还需要透明的决策逻辑。尽管深度学习和基于树的集成方法可以实现高精度，但其黑箱性质仍然是临床部署的主要障碍。这一挑战因医疗数据的常见特征而进一步加剧，包括有限的样本量、严重的类别不平衡以及因诊断标准和临床文档变化引起的特征演化。为了解决这些问题，我们提出了医学启发式学习（MHL），这是临床表格预测中超越梯度学习范式的一个实例。MHL不依赖神经网络权重更新，而是使用大型语言模型（LLM）驱动的工作流，整合统计探测、医学知识探测、规则合成和代码级迭代优化，以优化一个确定性的可执行决策系统。最终模型不是以不透明的参数表示，而是作为版本化的纯Python决策规则，这些规则明确可解释、完全可审计且具有临床基础。MHL还支持持续学习，从先前验证的规则开始，并在数据漂移或特征演化下使用更新的特征信息迭代修订规则。在医学数据集上的全面实验表明，MHL在保持与小样本和高度不平衡设置下强健行为的同时，实现了与最先进方法相当的性能。结果进一步表明，这种显式规则更新机制有助于缓解特征演化下的灾难性遗忘。总体而言，这些发现表明，非基于梯度的启发式系统为高风险临床决策支持提供了一种透明且可适应的替代方案。

英文摘要

Predictive modeling for clinical tabular data is central to clinical decision support and therefore requires not only strong predictive performance but also transparent decision logic. Although deep learning and tree-based ensemble methods can achieve high accuracy, their black-box nature remains a major obstacle to clinical deployment. This challenge is further compounded by common characteristics of medical data, including limited sample sizes, severe class imbalance, and feature evolution arising from changes in diagnostic criteria and clinical documentation. To address these issues, we propose Medical Heuristic Learning (MHL), an instantiation of the learning-beyond-gradients paradigm for clinical tabular prediction. Instead of relying on neural network weight updates, MHL uses a large language model (LLM)-driven workflow that integrates statistical probes, medical knowledge probes, rule synthesis, and code-level iterative refinement to optimize a deterministic and executable decision system. The resulting model is expressed not as opaque parameters, but as versioned pure-Python decision rules that are explicitly interpretable, fully auditable, and clinically grounded. MHL also supports continual learning by starting from previously validated rules and iteratively revising them using updated feature information under data drift or feature evolution. Comprehensive experiments on medical datasets show that MHL achieves performance comparable to state-of-the-art methods while maintaining strong behavior in small-sample and highly imbalanced settings. The results further indicate that this explicit rule update mechanism can help alleviate catastrophic forgetting under feature evolution. Overall, these findings suggest that non-gradient-based heuristic systems offer a transparent and adaptable alternative for high-stakes clinical decision support.

URL PDF HTML ☆

赞 0 踩 0

2407.13053 2026-06-17 cs.CY cs.AI cs.CL cs.LG 版本更新

E2Vec: Feature Embedding with Temporal Information for Analyzing Student Actions in E-Book Systems

E2Vec：基于时间信息的特征嵌入用于分析电子书系统中的学生行为

Yuma Miyazaki, Valdemar Švábenský, Yuta Taniguchi, Fumiya Okubo, Tsubasa Minematsu, Atsushi Shimada

发表机构 * Kyushu University（九州大学）

AI总结提出E2Vec方法，利用词嵌入将操作日志和时间间隔转化为学生向量，用于风险检测任务，提升泛化性和性能。

Comments Research paper published in the Proceedings of the 17th Educational Data Mining Conference (EDM 2024), see https://doi.org/10.5281/zenodo.12729853

详情

DOI: 10.5281/zenodo.12729853

AI中文摘要

数字教科书（电子书）系统将学生与教科书的交互记录为一系列事件，称为事件流数据。过去，研究人员从事件流中提取有意义的特征，并将其用作下游任务（如成绩预测和学生行为建模）的输入。先前的研究评估了主要使用基于统计的特征（如操作类型数量或访问频率）的模型。虽然这些特征有助于提供某些见解，但它们缺乏捕捉不同学生学习行为中细粒度差异的时间信息。本研究提出E2Vec，一种基于词嵌入的新型特征表示方法。该方法将每个学生的操作日志及其时间间隔视为字符字符串序列，并生成包含时间信息的学习活动特征的学生向量。我们应用fastText为来自两年计算机科学课程数据集的305名学生生成嵌入向量。然后，我们研究了E2Vec在风险检测任务中的有效性，展示了其泛化性和性能潜力。

英文摘要

Digital textbook (e-book) systems record student interactions with textbooks as a sequence of events called EventStream data. In the past, researchers extracted meaningful features from EventStream, and utilized them as inputs for downstream tasks such as grade prediction and modeling of student behavior. Previous research evaluated models that mainly used statistical-based features derived from EventStream logs, such as the number of operation types or access frequencies. While these features are useful for providing certain insights, they lack temporal information that captures fine-grained differences in learning behaviors among different students. This study proposes E2Vec, a novel feature representation method based on word embeddings. The proposed method regards operation logs and their time intervals for each student as a string sequence of characters and generates a student vector of learning activity features that incorporates time information. We applied fastText to generate an embedding vector for each of 305 students in a dataset from two years of computer science courses. Then, we investigated the effectiveness of E2Vec in an at-risk detection task, demonstrating potential for generalizability and performance.

URL PDF HTML ☆

赞 0 踩 0

2501.00826 2026-06-17 q-fin.TR cs.AI 版本更新

LLM-Powered Multi-Agent System for Automated Crypto Portfolio Management

基于LLM的多智能体系统实现自动化加密货币投资组合管理

Yichen Luo, Yebo Feng, Jiahua Xu, Paolo Tasca, Yang Liu

发表机构 * University College London（伦敦大学学院）； Nanyang Technological University（南洋理工大学）； Exponential Science（指数科学）

AI总结提出一个三智能体系统（市场、新闻、交易），通过分层、协作和辩论架构融合多模态信号，在2025年回测中实现133.52%累计收益和1.502夏普比率，优于单智能体和深度学习基线。

详情

AI中文摘要

加密货币投资组合管理需要在高度波动和实时约束下融合异构多模态信号，包括结构化的价格和链上时间序列、非结构化的新闻文本以及技术指标。虽然深度学习方法显示出预测能力，但其不透明性限制了实际应用，而单个大语言模型（LLM）智能体难以处理稳健决策所需的多模态输入广度。我们提出一个多智能体系统（MAS）框架，其中三个模态专业智能体——负责市场动态的加密货币智能体、负责每周新闻情绪的新闻智能体和负责信号融合与投资组合执行的交易智能体——通过三种通信架构（分层、协作和辩论）分解任务。我们评估了四种能力配置：零样本、思维链（CoT）、检索增强生成（RAG）和技能增强。在2025年1月按市值排名前15的L1区块链原生加密货币的52周回测中，最佳配置（分层技能）实现了133.52%的累计收益和1.502的夏普比率，优于单智能体变体、被动基准和深度学习基线。消融研究确定加密货币智能体是最关键的组件，移除它会使累计收益降低42.57个百分点。跨模型比较进一步表明，在GPT-4o、GPT-5和Claude Sonnet 4.5下，MAS均优于单智能体基线，表明多智能体协调的优势与模型无关。与黑箱深度学习模型不同，每个投资组合决策都可追溯到明确的智能体推理，为多模态加密货币投资组合管理提供了一种可解释且有效的方法。

英文摘要

Cryptocurrency portfolio management requires the fusion of heterogeneous multi-modal signals, including structured price and on-chain time series, unstructured news text, and technical indicators, under high-volatility and real-time constraints. While deep learning approaches show predictive capability, their opacity limits practical adoption, and single large language model (LLM) agents struggle to process the breadth of modality-specific inputs needed for robust decision-making. We propose a multi-agent system (MAS) framework in which three modality-specialised agents, a Crypto Agent for market dynamics, a News Agent for weekly news sentiment, and a Trading Agent for signal fusion and portfolio execution, decompose the task across three communication architectures: hierarchical, collaborative, and debate. We evaluate four capability configurations: zero-shot, chain-of-thought (CoT), retrieval-augmented generation (RAG), and skill-augmented. In a 52-week backtest over calendar year 2025 across the top 15 L1 blockchain native cryptocurrencies by market capitalisation as of January 2025, the best configuration, Hierarchical (Skill), achieves a cumulative return of 133.52% and a Sharpe ratio of 1.502, outperforming single-agent variants, passive benchmarks, and deep learning baselines. An ablation study identifies the Crypto Agent as the most critical component, with its removal reducing cumulative return by 42.57 percentage points. A cross-model comparison further shows that MAS outperforms the single-agent baseline under GPT-4o, GPT-5, and Claude Sonnet 4.5, suggesting that the benefit of multi-agent coordination is model-agnostic. Unlike black-box deep learning models, every portfolio decision is traceable to explicit agent reasoning, offering an interpretable and effective approach to multi-modal cryptocurrency portfolio management.

URL PDF HTML ☆

赞 0 踩 0

2502.17518 2026-06-17 cs.LG cs.AI q-fin.CP stat.ML 版本更新

Ensemble RL through Classifier Models: Enhancing Risk-Return Trade-offs in Trading Strategies

通过分类器模型进行集成强化学习：在交易策略中增强风险回报权衡

Zheli Xiong

AI总结本文研究了在金融交易策略中使用集成强化学习模型的全面研究，利用分类器模型来提升性能。通过将A2C、PPO和SAC等强化学习算法与传统分类器如支持向量机（SVM）、决策树和逻辑回归相结合，探讨不同分类器组如何整合以改善风险回报权衡。研究评估了各种集成方法的有效性，将其与单个强化学习模型在关键金融指标（包括累计回报率、夏普比率（SR）、卡勒姆比率和最大回撤（MDD））上进行比较。结果表明，集成方法在风险调整后的回报方面始终优于基础模型，提供了更好的回撤管理和整体稳定性。然而，我们发现集成性能对方差阈值τ的选择敏感，强调了动态调整τ以达到最佳性能的重要性。本研究强调了将强化学习与分类器结合在自适应决策中的价值，对金融交易、机器人和其他动态环境具有启示。

Comments 23 pages,10 figures, 9 table

详情

AI中文摘要

显式上下文驱动的神经声学建模用于高保真RIR生成

Chen Si, Qianyi Wu, Chaitanya Amballa, Romit Roy Choudhury

AI总结提出MiNAF模型，通过查询房间网格并提取距离分布作为显式局部几何特征，引导神经隐式模型生成更准确的房间脉冲响应（RIR），在多项指标上达到竞争性能。

详情

AI中文摘要

逼真的声音模拟在许多应用中起着关键作用。声音模拟的一个关键要素是房间脉冲响应（RIR），它描述了声音在给定空间中的传播方式。最近的研究应用神经隐式方法，利用从环境中收集的上下文信息（如场景图像）来学习RIR。然而，这些方法没有有效利用环境中的显式几何信息。为了进一步利用具有直接几何特征的神经隐式模型，我们提出了MiNAF，它在给定位置查询粗略的房间网格，并提取距离分布作为局部上下文的显式表示。我们的方法表明，结合显式的局部几何特征可以更好地引导模型生成更准确的RIR预测。通过与常规和最先进方法的比较，我们展示了MiNAF在各种评估指标上具有竞争力的性能。

英文摘要

Realistic sound simulation plays a critical role in many applications. A key element in sound simulation is the room impulse response (RIR), which characterizes how sound propagates within a given space. Recent studies have applied neural implicit methods to learn RIR using context information collected from the environment, such as scene images. However, these approaches do not effectively leverage explicit geometric information from the environment. To further exploit neural implicit models with direct geometric features, we present MiNAF, which queries a rough room mesh at given locations and extracts distance distributions as an explicit representation of local context. Our approach demonstrates that incorporating explicit local geometric features can better guide the model in generating more accurate RIR predictions. Through comparisons with conventional and state-of-the-art methods, we show that MiNAF performs competitively across various evaluation metrics.

URL PDF HTML ☆

赞 0 踩 0

2510.21127 2026-06-17 cs.NI cs.AI 版本更新

Enhanced Evolutionary Multi-Objective Deep Reinforcement Learning for Reliable and Efficient Wireless Rechargeable Sensor Networks

增强型进化多目标深度强化学习用于可靠高效无线可充电传感器网络

Bowei Tong, Hui Kang, Jiahui Li, Geng Sun, Jiacheng Wang, Yaoqi Yang, Bo Xu, Dusit Niyato

AI总结针对无线可充电传感器网络中节点存活率与充电能效的权衡问题，提出一种结合LSTM策略网络、MLP前瞻增量模型和时变Pareto策略评估的增强型进化多目标深度强化学习算法，显著优于现有方法。

Comments The article content needs to be significantly revised

详情

AI中文摘要

尽管传感器网络取得了快速进展，但传统的电池供电传感器网络存在运行寿命有限和维护频繁的问题，严重限制了其在偏远和不可达环境中的部署。因此，具有移动充电能力的无线可充电传感器网络（WRSNs）为延长网络寿命提供了一种有前景的解决方案。然而，WRSNs面临着在动态运行条件下最大化节点存活率与最大化充电能效之间固有权衡的关键挑战。在本文中，我们研究了一个典型场景，其中移动充电器移动并为传感器充电，从而在最小化能量浪费的同时维持网络连通性。具体而言，我们制定了一个多目标优化问题，该问题同时最大化多个时隙内的网络节点存活率和移动充电器能量使用效率，这具有NP-hard计算复杂性和长期时间依赖性，使得传统优化方法无效。为了解决这些挑战，我们提出了一种增强型进化多目标深度强化学习算法，该算法集成了基于长短期记忆（LSTM）的策略网络用于时间模式识别、基于多层感知器的前瞻增量模型用于未来状态预测，以及时变Pareto策略评估方法用于动态偏好适应。大量仿真结果表明，所提算法在平衡节点存活率和能量效率方面显著优于现有方法，同时生成多样化的Pareto最优解。此外，LSTM增强的策略网络比传统网络收敛速度快25%，时变评估方法有效适应动态条件。

英文摘要

Despite rapid advancements in sensor networks, conventional battery-powered sensor networks suffer from limited operational lifespans and frequent maintenance requirements that severely constrain their deployment in remote and inaccessible environments. As such, wireless rechargeable sensor networks (WRSNs) with mobile charging capabilities offer a promising solution to extend network lifetime. However, WRSNs face critical challenges from the inherent trade-off between maximizing the node survival rates and maximizing charging energy efficiency under dynamic operational conditions. In this paper, we investigate a typical scenario where mobile chargers move and charge the sensor, thereby maintaining the network connectivity while minimizing the energy waste. Specifically, we formulate a multi-objective optimization problem that simultaneously maximizes the network node survival rate and mobile charger energy usage efficiency across multiple time slots, which presents NP-hard computational complexity with long-term temporal dependencies that make traditional optimization approaches ineffective. To address these challenges, we propose an enhanced evolutionary multi-objective deep reinforcement learning algorithm, which integrates a long short-term memory (LSTM)-based policy network for temporal pattern recognition, a multilayer perceptron-based prospective increment model for future state prediction, and a time-varying Pareto policy evaluation method for dynamic preference adaptation. Extensive simulation results demonstrate that the proposed algorithm significantly outperforms existing approaches in balancing node survival rate and energy efficiency while generating diverse Pareto-optimal solutions. Moreover, the LSTM-enhanced policy network converges 25% faster than conventional networks, with the time-varying evaluation method effectively adapting to dynamic conditions.

URL PDF HTML ☆

赞 0 踩 0

2510.23798 2026-06-17 cs.CV cs.AI 版本更新

A geometric and deep learning reproducible pipeline for monitoring floating anthropogenic debris in urban rivers using in situ cameras

一种基于几何和深度学习的可复现流水线，用于利用原位摄像头监测城市河流中的漂浮人为碎片

Gauthier Grimmer, Romain Wenger, Clément Flint, Germain Forestier, Gilles Rixhon, Valentin Chardon

AI总结提出结合几何模型与深度学习的框架，利用固定摄像头连续量化监测城市河流漂浮碎片，并评估不同模型在复杂环境下的精度与速度，通过投影几何实现碎片尺寸估计。

详情

AI中文摘要

河流中漂浮人为碎片的扩散已成为一个紧迫的环境问题，对生物多样性、水质以及人类活动（如航行和娱乐）产生不利影响。本研究提出了一种新颖的方法框架，利用固定的原位摄像头监测上述废弃物。本研究提供了两个关键贡献：（i）利用深度学习对漂浮碎片进行连续量化和监测；（ii）在复杂环境条件下，识别出在精度和推理速度方面最合适的深度学习模型。这些模型在多种环境条件和学习配置下进行测试，包括与数据泄漏相关的偏差实验。此外，实现了一个几何模型，用于从二维图像估计检测对象的实际尺寸。该模型利用了相机的内参和外参特性。本研究结果强调了数据集构建协议的重要性，特别是在负样本图像的整合和时间泄漏的考虑方面。最后，证明了使用投影几何结合回归校正进行公制物体估计的可行性。该方法为开发稳健、低成本、自动化的城市水生环境监测系统铺平了道路。

英文摘要

The proliferation of floating anthropogenic debris in rivers has emerged as a pressing environmental concern, exerting a detrimental influence on biodiversity, water quality, and human activities such as navigation and recreation. The present study proposes a novel methodological framework for the monitoring the aforementioned waste, utilising fixed, in-situ cameras. This study provides two key contributions: (i) the continuous quantification and monitoring of floating debris using deep learning and (ii) the identification of the most suitable deep learning model in terms of accuracy and inference speed under complex environmental conditions. These models are tested in a range of environmental conditions and learning configurations, including experiments on biases related to data leakage. Furthermore, a geometric model is implemented to estimate the actual size of detected objects from a 2D image. This model takes advantage of both intrinsic and extrinsic characteristics of the camera. The findings of this study underscore the significance of the dataset constitution protocol, particularly with respect to the integration of negative images and the consideration of temporal leakage. In conclusion, the feasibility of metric object estimation using projective geometry coupled with regression corrections is demonstrated. This approach paves the way for the development of robust, low-cost, automated monitoring systems for urban aquatic environments.

URL PDF HTML ☆

赞 0 踩 0

2512.25065 2026-06-17 cs.OS cs.AI cs.DC 版本更新

Vulcan: Instance-specialized, Verifiable Systems Heuristics Through LLM-driven Search

Vulcan：通过LLM驱动的搜索实现实例特化的可验证系统启发式方法

Rohit Dwivedula, Divyanshu Saxena, Sujay Yadalam, Eric Hayden Campbell, Daehyeok Kim, Aditya Akella

AI总结提出Vulcan框架，利用LLM生成系统启发式方法，通过隔离决策逻辑和受限语言Anvil保证安全，在调度、缓存和内存管理上取得显著性能提升。

Comments 19 pages

详情

AI中文摘要

系统资源管理任务主要依赖于手工设计的启发式方法。然而，日益增长的硬件异构性和工作负载多样性要求针对特定部署实例进行特化的启发式方法，这使得手动设计成本高昂且难以扩展。在本文中，我们探索如何使用LLM合成系统启发式方法。主要挑战是确保生成的启发式方法安全执行、正确集成到周围系统中，同时仍能实现强大的性能。我们提出Vulcan，一个识别LLM友好接口的框架，该接口将核心决策逻辑与其余实现隔离。使用Vulcan，LLM生成的代码被限制为简单的无状态决策函数，而可信的运行时抽象提供丰富的派生统计信息，用于有意义的策略探索，而不会出现系统集成错误。为了确保执行安全，LLM使用受限语言Anvil合成启发式方法，该语言通过构造保证重要属性。我们在三个研究充分的领域评估Vulcan，并展示了在spot-VM调度中高达4.9倍的节省，缓存驱逐中高达2倍的未命中率降低，以及分层内存系统中高达10%的应用性能提升，同时全程确保执行安全。

英文摘要

Systems resource management tasks rely primarily on hand-designed heuristics. However, growing hardware heterogeneity and workload diversity require heuristics specialized to particular deployment instances, making manual design expensive and difficult to scale. In this paper, we explore how to synthesize systems heuristics using LLMs. The main challenge is ensuring that generated heuristics execute safely, integrate correctly with the surrounding system, and still achieve strong performance. We propose Vulcan, a framework that identifies LLM-friendly interfaces that isolate core decision logic from the rest of the implementation. With Vulcan, LLM-generated code is restricted to simple stateless decision functions, while trusted runtime abstractions provide rich derived statistics for meaningful policy exploration without system-integration bugs. To ensure execution safety, LLMs synthesize heuristics in a restricted language, Anvil, that guarantees important properties by construction. We evaluate Vulcan across three well-studied domains and demonstrate up to 4.9x higher savings for spot-VM scheduling, up to 2x lower miss ratios for cache eviction, and up to 10% higher application performance for tiered-memory systems, while ensuring execution safety throughout.

URL PDF HTML ☆

赞 0 踩 0

2603.04438 2026-06-17 eess.IV cs.AI cs.LG 版本更新

CogGen: Cognitive-Load-Inspired Fully Unsupervised Deep Generative Modeling for Compressively Sampled MRI Reconstruction

CogGen: 认知负荷启发的全无监督深度生成模型用于压缩感知MRI重建

Qingyong Zhu, Yumin Tan, Xiang Gu, Dong Liang

AI总结提出CogGen框架，基于认知易到难原则，通过自定进度课程学习和MRI感知双阈值加权策略，将CS-MRI重建分解为分阶段反演问题，理论证明降低局部充分迭代界和累积噪声放大界，实验优于现有无监督和有监督方法。

详情

AI中文摘要

全无监督深度生成建模（FU-DGM）为压缩感知磁共振成像（CS-MRI）重建提供了巨大潜力。代表性的FU-DGM公式，如深度图像先验（DIP）和隐式神经表示（INR），利用架构偏置在图像空间中诱导与正向观测对齐的低维流形。然而，由于底层逆系统高度病态，FU-DGM中长时间的迭代拟合通常导致效率低下和噪声放大。本文受认知易到难学习原则的启发，提出CogGen，一种将CS-MRI重建重新表述为分阶段反演问题的FU-DGM框架。具体地，CogGen通过MRI感知的双阈值加权准则实现自定进度课程学习（SPCL）驱动的渐进调度策略，该准则自适应地调节k空间测量参与。数据一致性残差阈值评估当前生成器的拟合可靠性，而k空间半径阈值控制阶段性的测量暴露，从而避免整个优化过程中的均匀拟合。理论上，我们的分析表明，当早期阶段倾向于易拟合的测量时，CogGen产生更低的局部充分迭代界和更小的累积噪声放大界，解释了CogGen在有限迭代预算内改进的收敛行为和重建保真度。数值实验表明，CogGen的两种实例化，CogGen-DIP和CogGen-INR，在包括无监督和有监督流程在内的现有CS-MRI重建技术中实现了优越的性能。

英文摘要

Fully unsupervised deep generative modeling (FU-DGM) offers significant potential for compressively sampled magnetic resonance imaging (CS-MRI) reconstruction. Representative FU-DGM formulations, such as deep image prior (DIP) and implicit neural representation (INR), employ architectural bias to induce a low-dimensional manifold in the image space that aligns with the forward observation. However, as the underlying inverse system is highly ill-posed, prolonged iterative fitting in FU-DGM typically leads to poor efficiency and noise amplification. In this paper, guided by the cognitive principle of easy-to-hard learning, we propose CogGen, an FU-DGM framework that reformulates CS-MRI reconstruction as a staged inversion problem. Specifically, CogGen implements an self-paced curriculum learning (SPCL)-driven progressive scheduling strategy through an MRI-aware dual-threshold weighting criterion, which adaptively regulates k-space measurement participation. The data-consistency residual thresholding evaluates the fitting reliability of the current generator, while the k-space radius thresholding controls stage-wise measurement exposure, thereby avoiding uniform fitting throughout optimization. Theoretically, our analysis shows that, when early stages favor easy-to-fit measurements, CogGen yields a reduced local sufficient-iteration bound and a smaller cumulative noise-amplification bound, explaining the improved convergence behavior and reconstruction fidelity of CogGen within a finite iteration budget. Numerical experiments demonstrate that both CogGen instantiations, CogGen-DIP and CogGen-INR, achieve superior performance over prevailing CS-MRI reconstruction techniques, including unsupervised and supervised pipelines.

URL PDF HTML ☆

赞 0 踩 0

2603.26551 2026-06-17 cs.CV cs.AI 版本更新

Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones

超越MACs：面向视觉骨干网络的硬件高效架构设计

Moritz Nottebaum, Matteo Dunnhofer, Christian Micheloni

发表机构 * Machine Learning and Perception Lab, University of Udine（乌迪大学机器学习与感知实验室）； Centre for Vision Research, York University（约克大学视觉研究中心）

AI总结针对MACs指标在边缘设备上的不足，提出基于硬件效率洞察的LowFormer骨干网络，通过轻量级Lowtention模块实现显著加速。

Comments Accepted at International Journal of Computer Vision (IJCV)

详情

DOI: 10.1007/s11263-026-02873-5
Journal ref: Int J Comput Vis 134, 295 (2026)

AI中文摘要

视觉骨干网络在现代计算机视觉中扮演核心角色。提升其效率直接惠及广泛下游应用。为衡量效率，许多出版物依赖MACs（乘累加操作）作为执行时间的预测指标。本文通过实验证明该指标的缺陷，尤其在边缘设备场景下。通过对比常见架构设计元素的MAC计数和执行时间，我们识别出高效执行的关键因素，并提供优化骨干设计的见解。基于这些见解，我们提出LowFormer，一种新型视觉骨干家族。LowFormer采用流线型的宏观和微观设计，包括Lowtention——多头自注意力的轻量级替代方案。Lowtention不仅更高效，还在ImageNet上取得了更优结果。此外，我们提出LowFormer的边缘GPU版本，可进一步提升其在边缘GPU和桌面GPU上的基线速度。通过在更小图像分类数据集上的评估以及将其适配到多个下游任务（如目标检测、语义分割、图像检索和视觉目标跟踪），我们展示了LowFormer的广泛适用性。与近期最先进的骨干网络相比，LowFormer模型在各种硬件平台上均实现了显著加速。代码和模型见此链接。

英文摘要

Vision backbone networks play a central role in modern computer vision. Enhancing their efficiency directly benefits a wide range of downstream applications. To measure efficiency, many publications rely on MACs (Multiply Accumulate operations) as a predictor of execution time. In this paper, we experimentally demonstrate the shortcomings of such a metric, especially in the context of edge devices. By contrasting the MAC count and execution time of common architectural design elements, we identify key factors for efficient execution and provide insights to optimize backbone design. Based on these insights, we present LowFormer, a novel vision backbone family. LowFormer features a streamlined macro and micro design that includes Lowtention, a lightweight alternative to Multi-Head Self-Attention. Lowtention not only proves more efficient, but also enables superior results on ImageNet. Additionally, we present an edge GPU version of LowFormer, that can further improve upon its baseline's speed on edge GPU and desktop GPU. We demonstrate LowFormer's wide applicability by evaluating it on smaller image classification datasets, as well as adapting it to several downstream tasks, such as object detection, semantic segmentation, image retrieval, and visual object tracking. LowFormer models consistently achieve remarkable speed-ups across various hardware platforms compared to recent state-of-the-art backbones. Code and models are available at https://github.com/altair199797/LowFormer/blob/main/Beyond_MACs.md.

URL PDF HTML ☆

赞 0 踩 0

2604.09998 2026-06-17 cs.CR cs.AI 版本更新

Like a Hammer, It Can Build, It Can Break: Large Language Model Uses, Perceptions, and Adoption in Cybersecurity Operations on Reddit

像锤子一样，它能建造，也能破坏：Reddit上网络安全运营中大语言模型的使用、认知与采纳

Souradip Nath, Chih-Yi Huang, Aditi Ganapathi, Kashyap Thimmaraju, Jaron Mink, Gail-Joon Ahn

AI总结通过对Reddit网络安全论坛892篇帖子进行混合方法分析，研究安全从业者使用LLM工具的模式、认知和采纳情况，发现LLM主要用于低风险、生产力导向任务，企业级安全平台受关注，但可靠性、验证开销和安全问题限制了其自主性。

Comments This paper appears in the Proceedings of the Twenty-Second Symposium on Usable Privacy and Security (SOUPS) 2026

详情

AI中文摘要

大语言模型（LLM）近期作为增强安全运营中心（SOC）工作流程的有前景工具出现，供应商越来越多地推广用于SOC的自主AI解决方案。然而，对于现实世界安全从业者如何使用、感知和采纳这些工具，仍缺乏实证理解。为填补这一空白，我们对网络安全论坛中的讨论进行了混合方法分析，以了解多样化从业者群体如何将现代LLM工具用于安全运营。具体而言，我们分析了Reddit上三个网络安全论坛在2022年12月至2025年9月间的892篇帖子，并采用定性编码和统计分析相结合的方法，从三个维度考察安全从业者如何讨论LLM工具：（1）他们声明的工具和用例，（2）每个工具在一组关键因素上的感知优缺点，以及（3）他们对这些工具的采纳以及对网络安全行业和个人分析师的预期影响。总体而言，我们的发现揭示了LLM工具采纳的细微模式，突出了LLM在低风险、生产力导向任务中的独立使用，以及对企业级、安全导向LLM平台的积极兴趣。尽管从业者报告了LLM辅助工作流程在效率和效果上的显著提升，但可靠性、验证开销和安全问题等持续存在的问题严重限制了赋予LLM工具的自主性。基于这些结果，我们还为开发和采纳LLM工具提供了建议，以确保组织的安全和网络安全从业者的安全。

英文摘要

Large language models (LLMs) have recently emerged as promising tools for augmenting Security Operations Center (SOC) workflows, with vendors increasingly marketing autonomous AI solutions for SOCs. However, there remains a limited empirical understanding of how such tools are used, perceived, and adopted by real-world security practitioners. To address this gap, we conduct a mixed-methods analysis of discussions in cybersecurity-focused forums to learn how a diverse group of practitioners use and perceive modern LLM tools for security operations. More specifically, we analyzed 892 posts between December 2022 and September 2025 from three cybersecurity-focused forums on Reddit, and, using a combination of qualitative coding and statistical analysis, examined how security practitioners discuss LLM tools across three dimensions: (1) their stated tools and use cases, (2) the perceived pros and cons of each tool across a set of critical factors, and (3) their adoption of such tools and the expected impacts on the cybersecurity industry and individual analysts. Overall, our findings reveal nuanced patterns in LLM tools adoption, highlighting independent use of LLMs for low-risk, productivity-oriented tasks, alongside active interest around enterprise-grade, security-focused LLM platforms. Although practitioners report meaningful gains in efficiency and effectiveness in LLM-assisted workflows, persistent issues with reliability, verification overheads, and security risks sharply constrain the autonomy granted to LLM tools. Based on these results, we also provide recommendations for developing and adopting LLM tools to ensure the security of organizations and the safety of cybersecurity practitioners.

URL PDF HTML ☆

赞 0 踩 0

2605.12729 2026-06-17 cs.NI cs.AI cs.CR 版本更新

Large Language Models for Agentic NetOps and AIOps: Architectures, Evaluation, and Safety

用于代理网络运维和AI运维的大型语言模型：架构、评估与安全

Muhammad Bilal, Jon Crowcroft, Ruizhi Wang, Xiaolong Xu, Schahram Dustdar

发表机构 * School of Computing and Communications（计算与通信学院）； University of Cambridge（剑桥大学）； School of Software（软件学院）； Nanjing University of Information Science and Technology（南京信息科技大學）； TU Wien（维也纳技术大学）； ICREA

AI总结本文探讨了大型语言模型在网络运维和AI运维中的应用，分析了代理架构、评估方法及安全挑战，强调系统可靠性依赖于模型周边机制，而非模型本身。

Comments 49 pages, 15 figures, 6 tables; survey article

详情

AI中文摘要

大型语言模型正越来越多地用于支持网络运维（NetOps）和人工智能运维（AIOps），包括事件调查、根本原因分析、配置合成和有限的自动修复。在NetOps和AIOps中，这种转变正在改变任务管理方式。基于代理的操作作为工作流，从收集证据到采取行动，遵循权限、政策和检查，并在必要时提供回滚选项。这至关重要，因为操作决策可能立即产生影响。为了使论点具体化，我们围绕自主性层次、工具范围、证据轨迹和保证合同组织相关文献。这些合同定义了代理可以观察、提议和执行的内容，以及在允许任何行动前必须通过的检查。在 telemetry 查询推荐、诊断、根本原因分析、配置合成、变更规划和有限自动修复的研究中，出现了一致的模式。操作可靠性主要不来自模型本身，而是依赖于模型周围的机制。我们还主张评估应超越静态问答。代理NetOps和AIOps系统需要以工作流为中心的评估，包括轨迹质量、受限制的工具使用、安全提案生成、沙盒环境中的回放以及具有回滚意识的试用。没有这些措施，系统可能看起来稳健，但实际上可能过于脆弱。最后，我们检查了当代理接近操作控制面时，安全、隐私和治理风险变得尖锐的问题。综合来看，本文得出结论：智能NetOps和AIOps的进步将取决于将自主性视为受限制的操作控制问题，其输出必须可靠、可审计且安全可部署。

英文摘要

Large language models are increasingly being used to support network operations (NetOps) and artificial intelligence for IT operations (AIOps), including incident investigation, root-cause analysis, configuration synthesis, and limited self-healing. In both NetOps and AIOps, this shift is changing how tasks are managed. Agent-based operations work as workflows, from gathering evidence to taking action, following permissions, policies, and checks, and providing rollback options when necessary. This is crucial because operational decisions can have instant impacts. To make the argument concrete, we organise the relevant literature around the hierarchy of autonomy, tool scope, evidence traces, and assurance contracts. These contracts define what an agent may observe, propose, and execute. They also define the checks that must pass before any action is allowed. A consistent pattern appears across work on telemetry query recommendation, diagnosis, root-cause analysis, configuration synthesis, change planning, and limited self-healing. Operational reliability does not come chiefly from the model itself. It depends on the machinery around the model. We also argue that evaluation should go beyond static question answering. Agentic NetOps and AIOps systems require workflow-centred evaluation, including trace quality, bounded tool use, safe proposal generation, replay in sandboxed environments, and canary trials with rollback-aware scoring. Without these measures, a system may appear robust yet remain too fragile. Finally, we examine security, privacy, and governance risks that become acute when agents sit close to operational control surfaces. Taken together, the survey concludes that progress in intelligent NetOps and AIOps will depend on treating autonomy as a constrained operational control problem, whose outputs must be reliable, auditable, and securely deployable.

URL PDF HTML ☆

赞 0 踩 0

2605.24003 2026-06-17 cs.CV cs.AI stat.AP 版本更新

GMN4AD：基于图匹配网络的阿尔茨海默病诊断与测试时域适应方法在多中心结构磁共振成像中的应用

Chen Zhao, Huan Huang, Yixin Xie, Jiajing Huang, Weihua Zhou

发表机构 * Department of Computer Science, Kennesaw State University（肯纳邦大学计算机科学系）； Department of Information Technology, Kennesaw State University（肯纳邦大学信息技术系）； School of Data Science and Analytics, Kennesaw State University（肯纳邦大学数据科学与分析学院）； Department of Applied Computing, Michigan Technological University（密歇根技术大学应用计算系）

AI总结提出GMN4AD，利用图匹配网络建模异质脑图间关系，结合测试时域适应策略，在三个公共数据集上优于现有方法，实现鲁棒的AD诊断。

详情

AI中文摘要

阿尔茨海默病（AD）是一种进行性神经退行性疾病，影响数百万老年人，预计未来几年患病率将显著上升。早期诊断，特别是在轻度认知障碍（MCI）阶段，对于及时干预至关重要。结构磁共振成像（sMRI）已成为检测AD相关脑变化的关键模态，但传统的基于图的方法通常难以处理模态和站点间异质性，限制了诊断性能。在本文中，我们提出了用于阿尔茨海默病诊断的图匹配网络（GMN4AD），旨在建模来自神经影像数据的异质脑图之间的交互。与将每个脑图独立处理的传统方法不同，GMN4AD利用图匹配来捕获跨图关系，提高诊断精度。此外，我们引入了一种测试时域适应策略，结合对比学习来减轻推理过程中的域偏移。在三个公共AD数据集上的大量实验表明，GMN4AD相比最先进方法实现了优越的性能，为AD诊断提供了鲁棒且可泛化的解决方案。

英文摘要

Alzheimer's Disease (AD) is a progressive neurodegenerative disorder that affects millions of older adults, with prevalence expected to rise significantly in the coming years. Early diagnosis, particularly during the mild cognitive impairment (MCI) stage, is critical for timely intervention. Structural Magnetic Resonance Imaging (sMRI) has emerged as a key modality for detecting AD-related brain changes, but traditional graph-based approaches often struggle with modality and inter-site heterogeneity, limiting diagnostic performance. In this paper, we propose Graph Matching Network for Alzheimer's Disease Diagnosis (GMN4AD), designed to model interactions between heterogeneous brain graphs derived from neuroimaging data. Unlike conventional methods that treat each brain graph independently, GMN4AD leverages graph matching to capture cross-graph relationships, enhancing diagnostic precision. Furthermore, we introduce a test-time domain adaptation strategy that combines contrastive learning to mitigate domain shifts during inference. Extensive experiments on three public AD datasets demonstrate that GMN4AD achieves superior performance compared to state-of-the-art methods, offering a robust and generalizable solution for AD diagnosis.

URL PDF HTML ☆

赞 0 踩 0

2606.14081 2026-06-17 cs.CV cs.AI cs.LG eess.IV 版本更新

Clay-CNN Hybrids: Leveraging Geospatial Foundation Models as Auxiliary Context for Landslide Detection

Clay-CNN混合模型：利用地理基础模型作为滑坡检测的辅助上下文

Huong Binh Vu

发表机构 * Harvard University（哈佛大学）

AI总结针对滑坡检测中的极端类别不平衡问题，提出将地理基础模型Clay v1.5作为辅助上下文注入U-Net瓶颈的混合方法，在Landslide4Sense基准上达到64.5% F1，优于纯Clay或U-Net基线。

详情

AI中文摘要

灾后快速滑坡制图对灾害响应至关重要，但由于极端类别不平衡，自动化仍然困难。本研究评估了地理基础模型（GFM）Clay v1.5是否能够改善Landslide4Sense（L4S）基准上的像素级滑坡分割，该基准包含3,799个训练块，具有14个Sentinel-2和地形波段，约2%的正像素。我们比较了三种策略：Clay作为主编码器并融合多尺度残差地形、在瓶颈处注入Clay语义上下文的U-Net骨干、以及标准U-Net基线。采用两阶段低秩适应（LoRA）的混合U-Net + Clay模型在三个随机种子上的最佳测试F1为64.5±1.8%，超过了纯Clay骨干（55.2±3.6%）和U-Net基线（59.9%）。由于缺乏多尺度跳跃连接，Clay作为独立编码器的性能低于U-Net，但其预训练表示在作为辅助上下文注入时持续提升了性能。这些发现表明，GFM在滑坡检测中最有效的方式是补充空间细节丰富的卷积架构，而非替代它们。

英文摘要

Rapid post-event landslide mapping is essential for disaster response but remains difficult to automate due to extreme class imbalance. This study evaluates whether Clay v1.5, a Geospatial Foundation Model (GFM), can improve pixel-level landslide segmentation on the Landslide4Sense (L4S) benchmark, which contains 3,799 training chips with 14 Sentinel-2 and terrain bands and approximately 2% positive pixels. We compare three strategies: Clay as the primary encoder with multi-scale residual terrain fusion, a U-Net backbone augmented with Clay semantic context at the bottleneck, and a standard U-Net baseline. The hybrid U-Net + Clay model with two-stage Low-Rank Adaptation (LoRA) achieved the best test F1 of 64.5 +/- 1.8% over three seeds, surpassing the Clay-only backbone (55.2 +/- 3.6%) and the U-Net baseline (59.9%). Clay as a standalone encoder underperformed the U-Net due to the absence of multi-scale skip connections, but its pretrained representations consistently improved performance when injected as auxiliary context. These findings suggest that GFMs are most effective for landslide detection when they complement spatially detailed convolutional architectures rather than replace them.

URL PDF HTML ☆

赞 0 踩 0

2606.15575 2026-06-17 cs.AI cs.HC 新提交

Do we have the knowledge we need? Rethinking human-AI decision-making in corporations

我们是否拥有所需的知识？重新思考企业中的人机决策

Anne S. R. Marx, Ricardo M. Avelino, Torbjørn Netland, Mennatallah El-Assady

发表机构 * ETH Zurich（苏黎世联邦理工学院）； Department of Computer Science & ETH AI Center, ETH Zurich（苏黎世联邦理工学院计算机科学系与ETH AI中心）； Department of Computer Science & Architecture, ETH Zurich（苏黎世联邦理工学院计算机科学与建筑系）； Department of Management, Technology, and Economics, ETH Zurich（苏黎世联邦理工学院管理、技术与经济系）； Department of Computer Science, ETH Zurich（苏黎世联邦理工学院计算机科学系）

AI总结本文提出一个框架，根据任务属性和知识可用性推荐人机代理分配与控制机制，并应用于制造任务示例。

Comments Proceedings of AutomationXP26 Workshop of the 2026 CHI Conference on Human Factors in Computing Systems, April 14, 2026, Barcelona, Spain. ACM, New York, NY, USA, 8 pages

详情

DOI: 10.3929/ethz-c-000799418

AI中文摘要

组织知识分散在各种软件系统、隐性知识和传统上为人类消费设计的手动文档中。随着AI系统越来越多地被部署并赋予决策角色，它们需要访问这些知识。这提出了两个问题：组织应如何存储和维护知识，使其对人类和未来的AI系统都可访问；以及在不同风险和不确定性水平的任务中，应如何在人类和AI之间分配代理权？在这篇立场论文中，我们描述了组织知识如何演变，并贡献了一个框架，将任务属性和知识可用性映射到推荐的代理分配和控制机制。我们通过两个不同的制造任务说明了该框架的适用性：一个常规操作（视觉质量检查）和一个一次性战略决策（工厂选址），并总结了未来研究的机会。

英文摘要

Organizational knowledge is fragmented across a variety of software systems, tacit expertise, and manual documents that have traditionally been designed for human consumption. As AI systems are increasingly deployed and granted decision-making roles, they require access to this knowledge. This raises two questions: how should organizations store and maintain knowledge so that it remains accessible to both humans and future AI systems, and how should agency be allocated between humans and AI across tasks with different risks and levels of uncertainty? In this position paper, we describe how organizational knowledge evolves and contribute a framework that maps task attributes and knowledge availability to recommended agency allocations and control mechanisms. We illustrate the applicability of the framework on two different manufacturing tasks: a routine operation (visual quality inspection) and a one-off strategic decision (factory location), and conclude with opportunities for future research.

URL PDF HTML ☆

赞 0 踩 0

2606.18005 2026-06-17 cs.AI econ.GN q-fin.EC 新提交

LLM Consumer Behavior Theory: Foundations of a Novel Research Field

LLM消费者行为理论：一个新兴研究领域的基础

Manon Reusens, Sofie Goethals, David Martens

发表机构 * Department of Engineering Management, University of Antwerp（安特卫普大学工程管理系）

AI总结本文提出LLM消费者行为理论，研究LLM代理在市场中代表人类消费决策的行为，整合经济学与自然语言处理，探讨偏好表达、市场聚合及理性假设的失效。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被部署为自主代理，代表用户做出消费决策。这一转变对传统上以人类为主要决策者的消费者理论提出了基本问题。在本文中，我们引入了LLM消费者行为理论，这是一个关注分析代理市场中消费者行为的新研究领域。借鉴经典和行为经济学以及自然语言处理的最新进展，我们形式化了人类偏好如何被基于LLM的代理反映和执行，以及代理级别的决策如何聚合为市场需求。我们将先前关于LLM决策、人类行为模拟和偏好诱导的分散文献统一在共同的经济视角下，强调了理性、异质性等假设在代理市场中可能失效的地方。本文不提供实证验证，而是概述了LLM消费者行为的范围，并识别了与对齐、偏好表示和市场动态相关的开放研究问题。

英文摘要

Large language models (LLMs) are increasingly deployed as autonomous agents that make consumption decisions on behalf of users. This shift raises fundamental questions for consumer theory, which has traditionally modeled humans as the primary decision-makers. In this paper, we introduce LLM Consumer Behavior Theory, a new field of study concerned with analyzing consumer behavior in agentic markets. Drawing on classical and behavioral economics alongside recent advances in Natural Language Processing, we formalize how human preferences are reflected and acted upon by LLM-based agents, and how agent-level decisions aggregate into market demand. We unify previously fragmented literature on LLM decision-making, human behavior simulation, and preference elicitation under a common economic lens, highlighting where assumptions, such as rationality and heterogeneity, may fail in agentic markets. Rather than providing empirical validation, this paper outlines the scope of LLM consumer behavior and identifies open research questions related to alignment, preference representation, and market dynamics.

URL PDF HTML ☆

赞 0 踩 0

2606.17762 2026-06-17 math.OC cs.AI 交叉投稿

Symplectic Transversality and Endpoint Green Estimates for Finite-Horizon Pontryagin Systems

有限时域Pontryagin系统的辛横截性与端点Green估计

Pyuyi Chufeng Huang, Zikang Song, Xingshu Chen

发表机构 * School of Cyber Science and Engineering, Sichuan University, Chengdu, Sichuan, China（四川大学信息科学与工程学院，成都，四川，中国）； School of Mathematics, Sichuan University, Chengdu, Sichuan, China（四川大学数学学院，成都，四川，中国）

AI总结针对有限时域离散时间Pontryagin边值系统，通过缩放稳定-不稳定边界横截性验证线性化端点逆，结合加权压缩证明端点修正Green估计，获得与视界无关的存在唯一性、Lipschitz依赖和一阶展开。

Comments 20 pages

详情

AI中文摘要

我们研究了在光滑控制消除后有限时域离散时间Pontryagin边值系统的视界一致局部分支。核心输入是线性化的两点端点逆。我们通过缩放稳定-不稳定边界横截性验证该逆，证明相关的端点修正Green估计，并将其与加权压缩结合，以获得存在性、唯一性、Lipschitz依赖性和一阶展开，且常数与视界无关。该框架涵盖光滑非线性端点映射，包括固定初始状态并将终端协态耦合到终端状态的原始Pontryagin行。辛和Riccati准则在矩阵数据层面验证逆假设；特别地，每个具有可逆动力学和定号权重的可镇定线性二次系统都被覆盖，包括非交换耦合数据。数值部分展示了证书和视界一致一阶展开。

英文摘要

We study horizon-uniform local branches of finite-horizon discrete-time Pontryagin boundary value systems after smooth control elimination. The central input is a two-point endpoint inverse for the linearization. We verify this inverse from scaled stable--unstable boundary transversality, prove the associated endpoint-corrected Green estimate, and combine it with weighted contractions to obtain existence, uniqueness, Lipschitz dependence, and first-order expansions with constants independent of the horizon. The framework covers smooth nonlinear endpoint maps, including the original Pontryagin rows that fix the initial state and couple the terminal costate to the terminal state. Symplectic and Riccati criteria verify the inverse hypothesis at the level of the matrix data; in particular, every stabilizable linear-quadratic system with invertible dynamics and definite weights is covered, including noncommuting coupled data. A numerical section illustrates the certificates and the horizon-uniform first-order expansion.

URL PDF HTML ☆

赞 0 踩 0

2606.13196 2026-06-17 cs.AI cs.CY 版本更新

Under What Conditions Can a Machine Be Called Genuinely Creative?

机器在何种条件下能够真正具有创造力？

Yong Zeng

发表机构 * Concordia University（康考迪亚大学）

AI总结本文基于Designics理论，提出机器真正创造力需满足十个要求，并通过实例论证其计算可行性，同时指出当前生成式AI系统尚不具备真正创造力。

详情

AI中文摘要

最近的AI系统能够生成看似具有创造力的文本、软件架构、假设、设计和科学工作流。本文探讨机器在何种条件下能够真正具有创造力，以及如何在共享的认知和创造环境中保持人类能动性。它提出了一个源于Designics（意义承载的意向性变化科学）的需求框架。本文认为，真正的机器创造力不应仅由输出新颖性、当前性能或瞬时架构来定义。相反，创造力被理解为通过递归干预动力学对不完全情境的结构性转变。基于此观点，它依赖于十个需求：环境表示、范围感知、冲突识别、干预能力、后果观察、知识与环境更新、范围重定、局部到全局展开、基于价值的范围界定以及人机共居。这些需求通过Designics的三个定律（感知、冲突和能力）进行组织。本文通过选定的网络-物理和网络-生物研究（包括递归元素提取、自主网格生成以及神经生理和工作负载分析）说明了这些需求的计算可行性。然后，它将开放系统、自动发现框架、自我修改代理、基础模型和代理工作流视为压力案例：它们展示了强大的生成手段，但本身并未建立真正的机器创造力。最后，本文认为主动的AI伦理是真正机器创造力的内在部分，而非事后过滤器。基于价值的范围界定和人机共居必须塑造创造机器如何感知环境、识别冲突、选择干预、观察后果、更新知识以及重新确定未来行动的范围。

英文摘要

Recent AI systems can generate texts, software architectures, hypotheses, designs, and scientific workflows that appear creative. This paper asks under what conditions a machine can be called genuinely creative, and how human agency can be preserved within shared cognitive and creative environments. It develops a requirement framework derived from Designics, the science of meaning-bearing intentional change. The paper argues that genuine machine creativity should not be defined by output novelty, current performance, or transient architecture alone. Instead, creativity is understood as the structural transformation of incomplete situations through recursive intervention dynamics. On this view, it depends on ten requirements: environment representation, scoped perception, conflict identification, intervention capability, consequence observation, knowledge and environment update, rescoping, local-to-global unfolding, value-based scoping, and human-AI co-living. These are organized through the three laws of Designics: perception, conflict, and capability. The paper illustrates the computational tractability of these requirements through selected cyber-physical and cyber-biological studies, including recursive element extraction, autonomous mesh generation, and neurophysiological and workload analysis. It then treats open-ended systems, automated discovery frameworks, self-modifying agents, foundation models, and agentic workflows as pressure cases: they demonstrate powerful generative means but do not by themselves establish genuine machine creativity. Finally, the paper argues that proactive AI ethics is internal to genuine machine creativity rather than an after-the-fact filter. Value-based scoping and human-AI co-living must shape how creative machines perceive environments, identify conflicts, select interventions, observe consequences, update knowledge, and rescope future action.

URL PDF HTML ☆

赞 0 踩 0

2507.05169 2026-06-17 cs.LG cs.AI cs.CL cs.CV cs.RO 版本更新

Critique of World Model: A Generative Latent Prediction Architecture for World Modeling

世界模型批判：一种用于世界建模的生成式潜在预测架构

Eric Xing, Mingkai Deng, Jinyu Hou

AI总结本文从心理学“假设性思维”出发，提出世界模型的核心目标是模拟真实世界的所有可行动可能性，并设计了一种基于状态化、分层、多级、混合连续/离散表示的生成式潜在预测（GLP）架构。

详情

AI中文摘要

世界模型，即生物智能体所经历并对其采取行动的真实世界环境的算法模拟器，近年来因开发具有人工（通用）智能的虚拟智能体的需求日益增长而成为一个新兴课题。关于世界模型究竟是什么、如何构建、如何使用以及如何评估，已有许多讨论。本文从著名科幻经典《沙丘》中的想象出发，并借鉴心理学文献中“假设性思维”的概念，论证世界模型的主要目标是模拟真实世界中所有可行动的可能性，以进行有目的的推理和行动。我们审视了世界建模的关键设计维度：数据、表示、架构、学习目标和使用，调查了现有方法并分析了它们的权衡。在此基础上，我们提出了一种新的通用世界模型生成式潜在预测（GLP）架构，基于有状态的、分层的、多层次的、混合连续/离散表示，以及生成式和自监督学习框架，并展望了由这种模型支持的物理、智能体和嵌套（PAN）AGI系统。

英文摘要

World Model, the algorithmic simulator of the real-world environment which biological agents experience and act upon, has been an emerging topic in recent years due to the rising need to develop virtual agents with artificial (general) intelligence. There has been much discussion on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of ``hypothetical thinking'' in psychology literature, we argue the primary goal of a world model to be {\it simulating all actionable possibilities of the real world for purposeful reasoning and acting}. We examine the key design dimensions of world modeling: data, representation, architecture, learning objective, and usage, surveying existing approaches and analyzing their tradeoffs. Building on this examination, we propose a new Generative Latent Prediction (GLP) architecture for a general-purpose world model, based on stateful, hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervised learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

URL PDF HTML ☆

赞 0 踩 0

2602.02881 2026-06-17 cs.SE cs.AI 版本更新

Learning-Infused Formal Reasoning: From Contract Synthesis to Artifact Reuse and Formal Semantics

学习增强的形式化推理：从合约合成到工件复用和形式语义

Arshad Beg, Diarmuid O'Donoghue, Rosemary Monahan

AI总结提出将形式化方法与人工智能融合的长期研究愿景，通过自动化合约合成、语义工件复用和精化理论，构建知识驱动的验证生态系统，加速未来保障。

Comments LNCS Proceedings Submitted Version. 17 pages. Accepted and presented at VERIFAI-2026: The Interplay between Artificial Intelligence and Software Verification LASER center, Villebrumier, France, March 8-11, 2026

详情

AI中文摘要

本文阐述了形式化方法与人工智能交叉领域的长期研究愿景，概述了多个概念和技术维度，并报告了我们为实现这一愿景正在开展的工作。它基于自动化合约合成、语义工件复用和基于精化的理论，提出了下一代形式化方法的前瞻性视角。我们认为，未来的验证系统必须从构建单个正确性证明转向累积的、知识驱动的范式，其中规范、合约和证明被持续合成并在系统间转移。为支持这一转变，我们概述了一个混合框架，结合大语言模型与基于图的表示，以实现可扩展的语义匹配和验证工件的原则性复用。基于学习的组件在异构表示法和抽象层次间提供语义指导，而符号匹配确保形式正确性。基于组合推理，这一愿景指向系统演化的验证生态系统，利用过去的验证工作加速未来的保障。

英文摘要

This paper articulates a long-term research vision for formal methods at the intersection with artificial intelligence, outlining multiple conceptual and technical dimensions and reporting on our ongoing work toward realising this vision. It advances a forward-looking perspective on the next generation of formal methods based on the integration of automated contract synthesis, semantic artifact reuse, and refinement-based theory. We argue that future verification systems must builds towards individual correctness proofs toward a cumulative, knowledge-driven paradigm in which specifications, contracts, and proofs are continuously synthesised and transferred across systems. To support this shift, we outline a hybrid framework combining large language models with graph-based representations to enable scalable semantic matching and principled reuse of verification artifacts. Learning-based components provide semantic guidance across heterogeneous notations and abstraction levels, while symbolic matching ensures formal soundness. Grounded in compositional reasoning, this vision points toward verification ecosystems that evolve systematically, leveraging past verification efforts to accelerate future assurance.

URL PDF HTML ☆

赞 0 踩 0

2502.17773 2026-06-17 stat.ME cs.AI cs.LG 版本更新

How Many Human Survey Respondents is a Large Language Model Worth? An Uncertainty Quantification Perspective

大型语言模型值得模拟多少人意见？从不确定性量化角度出发

Chengpiao Huang, Yuhang Wu, Kaizheng Wang

发表机构 * Department of IEOR, Columbia University（哥伦比亚大学工业工程与运筹学系）； Decision, Risk, and Operations Division, Columbia Business School（哥伦比亚商学院决策、风险与运营分校）； Department of IEOR and Data Science Institute, Columbia University（哥伦比亚大学工业工程与运筹学系及数据科学研究所）

AI总结本文从不确定性量化角度出发，提出了一种框架，将LLM模拟的响应转换为人类响应总体参数的可靠置信集，通过量化人类-LLM不一致带来的不确定性。关键设计是模拟响应的数量：过多会导致置信集过窄且覆盖性差，过少则导致置信集过宽且信息不足。本文提出了一种数据驱动的方法，自适应选择模拟样本量以实现名义平均覆盖性，无论LLM的模拟保真度或置信集构建过程如何。所选样本量进一步反映了LLM能代表的有效人类人口规模，提供了其模拟保真度的定量度量。实验表明不同LLM和领域存在异质性模拟保真度。

Comments 63 pages, 13 figures

详情

AI中文摘要

大型语言模型（LLMs）越来越多地用于模拟调查响应，但合成数据可能与人类人口不一致，导致不可靠的推断。我们开发了一个通用框架，将LLM模拟的响应转换为人类响应总体参数的可靠置信集，量化由人类-LLM不一致引起的不确定性。关键设计选择是模拟响应的数量：过多会产生过于狭窄的置信集，覆盖性差；过少则会产生过于宽泛且信息不足的置信集，受随机噪声主导。我们提出了一种数据驱动的方法，自适应地选择模拟样本量以实现名义平均覆盖性，无论LLM的模拟保真度或置信集构建过程如何。所选样本量进一步被证明反映了LLM能代表的有效人类人口规模，提供其模拟保真度的定量度量。在真实调查数据集上的实验揭示了不同LLM和领域之间的异质性模拟保真度。

英文摘要

Large language models (LLMs) are increasingly used to simulate survey responses, but synthetic data can be misaligned with the human population, leading to unreliable inference. We develop a general framework that converts LLM-simulated responses into reliable confidence sets for population parameters of human responses, quantifying the uncertainty induced by the human-LLM misalignment. The key design choice is the number of simulated responses: too many produce overly narrow sets with poor coverage, while too few yield overly wide and uninformative sets dominated by stochastic noise. We propose a data-driven approach that adaptively selects the simulation sample size to achieve nominal average-case coverage, regardless of the LLM's simulation fidelity or the confidence set construction procedure. The selected sample size is further shown to reflect the effective human population size that the LLM can represent, providing a quantitative measure of its simulation fidelity. Experiments on real survey datasets reveal heterogeneous simulation fidelity across different LLMs and domains.

URL PDF HTML ☆

赞 0 踩 0

2601.06116 2026-06-17 cs.AI cs.CL cs.CY 版本更新

The Homogenization Problem in LLMs: Towards Meaningful Diversity in AI Safety

在大语言模型中的同质化问题：迈向人工智能安全中的有意义多样性

Ian Rios-Sialer

发表机构 * Independent Researcher（独立研究者）

AI总结本文探讨了大语言模型中同质化问题，提出通过编码价值观系统来促进多样性，通过实验揭示性别偏见并引入xeno-reproduction概念以缓解同质化。

详情

AI中文摘要

生成式AI模型在训练数据中复制人类偏见，并通过如模式崩溃等机制放大这些偏见。多样性丧失导致同质化，不仅损害少数群体，也使所有人受益。我们主张同质化应成为人工智能安全的核心关注点。为有意义地表征大语言模型中的同质化，我们引入一个框架，允许利益相关者编码其上下文和价值体系。我们通过实验揭示了一个大语言模型（Claude 3.5 Haiku）在开放性故事提示中的性别偏见。基于酷儿理论，我们将同质化定义为规范性。借用女性主义理论的语言，我们引入xeno-reproduction作为一类任务，以通过促进多样性来缓解同质化。我们的工作开启了一条协作研究路线，旨在理解和推进AI中的多样性。

英文摘要

Generative AI models reproduce the human biases in their training data and further amplify them through mechanisms such as mode collapse. The loss of diversity produces homogenization, which not only harms the minoritized but impoverishes everyone. We argue homogenization should be a central concern in AI safety. To meaningfully characterize homogenization in Large Language Models (LLMs), we introduce a framework that allows stakeholders to encode their context and value system. We illustrate our approach with an experiment that surfaces gender bias in an LLM (Claude 3.5 Haiku) on an open-ended story prompt. Building from queer theory, we formalize homogenization in terms of normativity. Borrowing language from feminist theory, we introduce the concept of xeno-reproduction as a class of tasks for mitigating homogenization by promoting diversity. Our work opens a collaborative line of research that seeks to understand and advance diversity in AI.

URL PDF HTML ☆

赞 0 踩 0

2501.12709 2026-06-17 quant-ph cs.AI cs.CR cs.DC 版本更新

Experimentally validated quantum-secure federated learning over a multi-user quantum network

在多用户量子网络上实验验证的量子安全联邦学习

Zhi-Ping Liu, Xiao-Yu Cao, Hao-Wen Liu, Xiao-Ran Sun, Yu Bao, Jian-Yu Shen, Yu-Shuo Lu, Hua-Lei Yin, Zeng-Bing Chen

发表机构 * National Laboratory of Solid State Microstructures（固态微结构国家实验室）； School of Physics, Collaborative Innovation Center of Advanced Microstructures, Nanjing University, Nanjing 210093, China（物理系，先进微结构协同创新中心，南京大学，南京210093，中国）； School of Physics（物理系）； Key Laboratory of Quantum State Construction（量子态制备重点实验室）； Manipulation (Ministry of Education), Renmin University of China, Beijing 100872, China（操控（教育部），中国人民大学，北京100872，中国）

AI总结本文提出QuNetQFL协议，通过分布式量子密钥掩蔽局部模型更新，实现信息论安全的聚合。实验验证在四客户端量子网络上，提升分类准确率并展示在语言任务和大规模模拟中的扩展性。

Comments 25 pages, 7 figures, 7 tables, Accepted by Research

详情

DOI: 10.34133/research.1299
Journal ref: Research 9, 1299 (2026)

AI中文摘要

联邦学习实现了去中心化和隐私保护的训练，但在量子时代仍面临隐私泄露的风险。量子联邦学习（QFL）提供了一条通往增强安全性和效率的途径。然而，缺乏一个实际且经过实验验证的QFL协议，利用近期量子技术解决数据隐私问题。本文提出了QuNetQFL协议，在量子网络上实现，其中局部模型更新被分布式量子秘密密钥掩蔽，提供信息论安全的聚合。我们实验验证该协议在四客户端量子网络上，并通过生成的密钥在量子和现实数据集上进行性能基准测试。添加一个量子客户端显著提高了对多体纠缠和非稳定器量子数据集的分类准确率。在语言任务中，我们通过联邦微调混合经典-量子语言模型进行情感分析，实现了在模拟和真实量子硬件上的可比和稳健性能。大规模模拟进一步展示了其扩展性，可扩展到200个客户端进行手写数字识别，具有快速收敛和通信成本减少75%的模型压缩。本文的工作为新兴量子互联网中的量子安全联邦学习建立了实际和可扩展的路线。

英文摘要

Federated learning enables decentralized, privacy-preserving training but remains vulnerable to privacy leakage in the quantum era. Quantum federated learning (QFL) offers a promising path towards enhanced security and efficiency. However, a practical and experimentally validated QFL protocol utilizing near-term quantum techniques to address data privacy has been lacking. Here we present QuNetQFL, a QFL protocol implemented on quantum networks, in which local model updates are masked with distributed quantum secret keys, offering information-theoretic security during aggregation. We experimentally validate the protocol on a four-client quantum network and benchmark its performance using the generated keys on quantum and real-world datasets. Adding a single quantum client significantly improves global accuracy for classifying multipartite entangled and non-stabilizer quantum datasets. For language tasks, we apply QuNetQFL to sentiment analysis by federated fine-tuning of a hybrid classical-quantum language model, achieving comparable and robust performance in simulation and on real quantum hardware. Large-scale simulations further demonstrate scalability to 200 clients for handwritten-digit recognition, with rapid convergence and a $75\%$ reduction in communication cost via model compression. Our work establishes a practical and scalable route to quantum-secure federated learning for the emerging quantum internet.

URL PDF HTML ☆

赞 0 踩 0

2605.12220 2026-06-17 cs.CV cs.AI cs.LG cs.RO 版本更新

TriBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion

TriBand-BEV：基于高度感知的鸟瞰图与高分辨率特征融合的实时仅LiDAR三维行人检测

Mohammad Khoshkdahan, Alexey Vinel

发表机构 * Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）

AI总结本文提出TriBand-BEV方法，通过高度感知的鸟瞰图与高分辨率特征融合实现实时LiDAR-only三维行人检测，采用轻量级鸟瞰图张量映射，单网络一次通过检测车辆、行人和自行车，提升检测精度与速度。

Comments Accepted for publication in the Proceedings of the 2026 International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

详情

DOI: 10.65109/INST9866
Journal ref: Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

AI中文摘要

安全的自动驾驶代理和移动机器人需要快速的实时三维感知，尤其是对于行人等易受伤害道路使用者。我们介绍了一种新的鸟瞰图（BEV）编码方法，将完整的三维LiDAR点云映射到轻量级的二维BEV张量中，分为三个高度带。我们明确地将三维检测重新公式化为二维检测问题，然后从BEV输出中重建三维框。单个网络在一次通过中检测车辆、行人和自行车。骨干网络在深层阶段使用区域注意力，层次化的双向颈部网络在P1到P4之间融合上下文和细节，头部使用分布焦点学习预测定向框，以预测侧偏移和旋转IoU损失。训练应用小垂直重新分箱和温和的反射率抖动以防止记忆化。我们使用四分位距（IQR）过滤器在三维重建中去除噪声和离群的LiDAR点。在KITTI数据集上，TriBand-BEV在49 FPS的单个消费级GPU上实现了易、中等和困难样本的行人BEV AP分别为58.7/52.6/47.2%，优于Complex-YOLO，分别提升了+12.6%、+7.5%和+3.1%。定性场景显示在遮挡下检测稳定。该流程紧凑且适用于实时机器人部署。我们的源代码在GitHub上公开可用。

英文摘要

Safe autonomous agents and mobile robots need fast real time 3D perception, especially for vulnerable road users (VRUs) such as pedestrians. We introduce a new bird's eye view (BEV) encoding, which maps the full 3D LiDAR point cloud into a light-weight 2D BEV tensor with three height bands. We explicitly reformulate 3D detection as a 2D detection problem and then reconstruct 3D boxes from the BEV outputs. A single network detects cars, pedestrians, and cyclists in one pass. The backbone uses area attention at deep stages, a hierarchical bidirectional neck over P1 to P4 fuses context and detail, and the head predicts oriented boxes with distribution focal learning for side offsets and a rotated IoU loss. Training applies a small vertical re bin and a mild reflectance jitter in channel space to resist memorization. We use an interquartile range (IQR) filter to remove noisy and outlier LiDAR points during 3D reconstruction. On KITTI dataset, TriBand-BEV attains 58.7/52.6/47.2 pedestrian BEV AP(%) for easy, moderate, and hard at 49 FPS on a single consumer GPU, surpassing Complex-YOLO, with gains of +12.6%, +7.5%, and +3.1%. Qualitative scenes show stable detection under occlusion. The pipeline is compact and ready for real time robotic deployment. Our source code is publicly available on GitHub.

URL PDF HTML ☆

赞 0 踩 0

2601.12912 2026-06-17 cs.AI 版本更新

Human Emotion Verification by Action Languages via Answer Set Programming

通过答案集编程进行人类情感验证的动作语言

Andreas Brännström, Juan Carlos Nieves

发表机构 * Umeå University\ of Computing Science

AI总结本文提出动作语言C-MT，基于答案集编程和过渡系统，用于表示人类心理状态对可观察动作序列的演变。通过引入因果规则，该语言能建模心理状态的有效转换原则，从而实现对人类心理动态的受控推理。

Comments Under consideration in Theory and Practice of Logic Programming (TPLP)

详情

DOI: 10.1017/S1471068426100416
Journal ref: Theory and Practice of Logic Programming 25 (2025) 1047-1104

AI中文摘要

在本文中，我们介绍了动作语言C-MT（Mind Transition Language）。它建立在答案集编程（ASP）和过渡系统之上，用于表示人类心理状态如何响应一系列可观察动作序列而演变。基于已建立的心理学理论，如情绪评估理论，我们将情绪等心理状态形式化为多维配置。为了满足对受控智能体行为的需求，并限制动作的不良心理副作用，我们扩展了该语言，引入了新的因果规则'禁止导致'，以及专门用于心理状态动态的表达式，从而能够建模有效转换之间心理状态的原则。这些心理变化的原则被翻译成过渡约束，并通过所谓的轨迹在过渡系统中严格评估其不变性属性。这使得能够对人类心理状态的动态演变进行受控推理。此外，该框架支持通过分析遵循不同心理学原理的轨迹来比较不同变化动态。我们应用该动作语言来设计情绪验证模型。

英文摘要

In this paper, we introduce the action language C-MT (Mind Transition Language). It is built on top of answer set programming (ASP) and transition systems to represent how human mental states evolve in response to sequences of observable actions. Drawing on well-established psychological theories, such as the Appraisal Theory of Emotion, we formalize mental states, such as emotions, as multi-dimensional configurations. With the objective to address the need for controlled agent behaviors and to restrict unwanted mental side-effects of actions, we extend the language with a novel causal rule, forbids to cause, along with expressions specialized for mental state dynamics, which enables the modeling of principles for valid transitions between mental states. These principles of mental change are translated into transition constraints, and properties of invariance, which are rigorously evaluated using transition systems in terms of so-called trajectories. This enables controlled reasoning about the dynamic evolution of human mental states. Furthermore, the framework supports the comparison of different dynamics of change by analyzing trajectories that adhere to different psychological principles. We apply the action language to design models for emotion verification. Under consideration in Theory and Practice of Logic Programming (TPLP).

URL PDF HTML ☆

赞 0 踩 0

2603.19801 2026-06-17 eess.IV cs.AI cs.CV 版本更新

Offshore oil and gas platform dynamics in the North Sea, Gulf of Mexico, and Persian Gulf: Exploiting the Sentinel-1 archive

北海、墨西哥湾和波斯湾的海上石油和天然气平台动态：利用Sentinel-1档案

Robin Spanier, Thorsten Hoeser, John Truckenbrodt, Felix Bachofer, Claudia Kuenzer

发表机构 * German Remote Sensing Data Center, Earth Observation Center, EOC of the German Aerospace Center, DLR（德国遥感数据中心，地球观测中心，德国航空航天中心（DLR）地球观测中心）； Institute for Geography and Geology, Department of Remote Sensing, University of Würzburg（地理与地质研究所，遥感系，乌尔姆大学）

AI总结本文利用Sentinel-1数据和深度学习技术，研究了北海、墨西哥湾和波斯湾的海上平台动态，揭示了平台数量变化及结构转型，为海洋基础设施监测提供了数据支持。

Comments 16 pages, 10 figures, 1 table

详情

DOI: 10.1080/20964471.2026.2679328
Journal ref: Big Earth Data, 2026, 1-27

AI中文摘要

随着海上基础设施的增加，对持续、可扩展的监测需求日益增长。本文提出了一种基于免费地球观测数据的自动化方法，利用Sentinel-1档案数据和深度学习目标检测技术，构建了2017-2025年间北海、墨西哥湾和波斯湾的季度平台位置时间序列。此外，还推导了平台大小、水深、海岸距离、国家归属及安装和退役日期等信息。2025年识别出3728个海上平台，其中北海有356个，墨西哥湾有1641个，波斯湾有1731个。尽管波斯湾平台数量在2024年前持续增长，但墨西哥湾和北海的平台数量在2018-2020年间有所下降。同时，超过2700个平台被安装或迁移到新地点，同时有相当数量被退役或迁移。此外，平台寿命缩短的趋势表明，海上行业正经历结构性变化，与移动海上单位如钻探平台的重要性增长有关。研究结果展示了免费地球观测数据和深度学习在持续、长期监测海洋基础设施中的潜力。所推导的数据集是公开的，为海上监测、海洋规划及海上能源行业转型分析提供了基础。

英文摘要

The increasing use of marine spaces by offshore infrastructure, including oil and gas platforms, underscores the need for consistent, scalable monitoring. Offshore development has economic, environmental, and regulatory implications, yet maritime areas remain difficult to monitor systematically due to their inaccessibility and spatial extent. This study presents an automated approach to the spatiotemporal detection of offshore oil and gas platforms based on freely available Earth observation data. Leveraging Sentinel-1 archive data and deep learning-based object detection, a consistent quarterly time series of platform locations for three major production regions: the North Sea, the Gulf of Mexico, and the Persian Gulf, was created for the period 2017-2025. In addition, platform size, water depth, distance to the coast, national affiliation, and installation and decommissioning dates were derived. 3,728 offshore platforms were identified in 2025, 356 in the North Sea, 1,641 in the Gulf of Mexico, and 1,731 in the Persian Gulf. While expansion was observed in the Persian Gulf until 2024, the Gulf of Mexico and the North Sea saw a decline in platform numbers from 2018-2020. At the same time, a pronounced dynamic was apparent. More than 2,700 platforms were installed or relocated to new sites, while a comparable number were decommissioned or relocated. Furthermore, the increasing number of platforms with short lifespans points to a structural change in the offshore sector associated with the growing importance of mobile offshore units such as jack-ups or drillships. The results highlighted the potential of freely available Earth observation data and deep learning for consistent, long-term monitoring of marine infrastructure. The derived dataset is public and provides a basis for offshore monitoring, maritime planning, and analyses of the transformation of the offshore energy sector.

URL PDF HTML ☆

赞 0 踩 0

2512.20985 2026-06-17 cs.AI cs.MA 版本更新

A Blockchain-Monitored Agentic AI Architecture for Trusted Perception-Reasoning-Action Pipelines

基于区块链监控的代理AI架构：可信感知-推理-行动流水线

Salman Jan, Hassan Ali Razzaqi, Ali Akarma, Mohammad Riyaz Belgaum

发表机构 * Faculty of Computer Studies, Arab Open University-Bahrain（巴林阿拉伯开放大学计算机科学学院）； Faculty of Computer and Information System, Islamic University of Madinah, Saudi Arabia（沙特阿拉伯麦地那伊斯兰大学计算机与信息系统学院）

AI总结本文提出一种结合区块链的代理AI架构，用于确保自主决策流程中的信任和可追溯性，通过区块链实现对行动的持续监控和审计，验证输入并记录执行结果。

Comments This paper was presented at the IEEE International Conference on Computing and Applications (ICCA 2025), Bahrain

详情

DOI: 10.1109/ICCA66035.2025.11430865
Journal ref: Proceedings of the 2025 IEEE International Conference on Computing and Applications (ICCA), Bahrain, 2025, pp. 1-7

AI中文摘要

代理AI系统在医疗、智慧城市、数字取证和供应链管理等领域应用日益广泛。尽管这些系统灵活且能提供实时推理，但它们也引发了信任、监督和信息完整性方面的担忧。本文提出一种由LangChain多代理系统和受限制区块链组成的单一架构模型，以确保持续监控、政策执行和不可变审计。该框架将感知-行动循环与区块链治理层相关联，验证输入、评估推荐行动并记录执行结果。介绍了一种基于Hyperledger Fabric的系统，集成了MCP执行器和LangChain代理，并进行了智能库存管理、交通信号控制和医疗监控的实验。结果表明，区块链安全验证在防止未经授权实践、确保整个决策过程的可追溯性以及维持合理操作延迟方面是高效的。所提出的框架提供了一种通用系统，用于实施高影响的自主且负责任的代理AI应用。

英文摘要

The application of agentic AI systems in autonomous decision-making is growing in the areas of healthcare, smart cities, digital forensics, and supply chain management. Even though these systems are flexible and offer real-time reasoning, they also raise concerns of trust and oversight, and integrity of the information and activities upon which they are founded. The paper suggests a single architecture model comprising of LangChain-based multi-agent system with a permissioned blockchain to guarantee constant monitoring, policy enforcement, and immutable auditability of agentic action. The framework relates the perception conceptualization-action cycle to a blockchain layer of governance that verifies the inputs, evaluates recommended actions, and documents the outcomes of the execution. A Hyperledger Fabric-based system, action executors MCP-integrated, and LangChain agent are introduced and experiments of smart inventory management, traffic-signal control, and healthcare monitoring are done. The results suggest that blockchain-security verification is efficient in preventing unauthorized practices, offers traceability throughout the whole decision-making process, and maintains operational latency within reasonable ranges. The suggested framework provides a universal system of implementing high-impact agentic AI applications that are autonomous yet responsible.

URL PDF HTML ☆

赞 0 踩 0

2603.14692 2026-06-17 cs.LO cs.AI 版本更新

Applications of Intuitionistic Temporal Logic to Temporal Answer Set Programming

Pedro Cabalar, Martín Diéguez, David Fernández-Duque, François Laferrière, Torsten Schaub, Igor Stéphan

发表机构 * University of Corunna, Spain（科鲁纳大学）； University of Angers, France（昂热大学）； University of Barcelona, Spain（巴塞罗那大学）； University of Potsdam, Germany（波茨坦大学）

Comments Under consideration in Theory and Practice of Logic Programming (TPLP)

2508.04492 2026-06-17 cs.CV cs.AI 版本更新

Learning Robust Intervention Representations with Delta Embeddings

通过delta嵌入学习鲁棒的干预表示

Panagiotis Alimisis, Christos Diou

发表机构 * Department of Informatics and Telematics（信息与电信学系）

AI总结本文提出通过潜在空间中的可操作反事实表示提升模型鲁棒性，提出因果delta嵌入方法，在无需额外监督的情况下学习因果表示，实验显示其在合成和现实基准中表现优异。

Comments ICLR 2026, Poster

详情

Journal ref: International Conference on Learning Representations (ICLR), 2026

AI中文摘要

因果表示学习近年来引起了广泛关注，作为提高模型泛化性和鲁棒性的手段。因果干预图像对（也称为“可操作反事实”）的表示具有特性：在起始状态和结束状态之间，只有受干预/动作影响的场景变量发生变化。尽管大多数工作集中在识别和表示因果模型下的场景变量，但较少关注干预本身的表示。本文表明，通过关注潜在空间中的可操作反事实表示，可以有效提升离分布鲁棒性。具体而言，我们提出干预可通过因果delta嵌入表示，该嵌入对视觉场景不变且在影响的因果变量上稀疏。基于此见解，我们提出一种无需额外监督的学习因果表示的方法。在因果三元组挑战中的实验表明，因果delta嵌入在离分布设置中表现突出，显著超越基线性能，在合成和现实基准中均取得优异结果。

英文摘要

Causal representation learning has attracted significant research interest during the past few years, as a means for improving model generalization and robustness. Causal representations of interventional image pairs (also called ``actionable counterfactuals'' in the literature), have the property that only variables corresponding to scene elements affected by the intervention / action are changed between the start state and the end state. While most work in this area has focused on identifying and representing the variables of the scene under a causal model, fewer efforts have focused on representations of the interventions themselves. In this work, we show that an effective strategy for improving out of distribution (OOD) robustness is to focus on the representation of actionable counterfactuals in the latent space. Specifically, we propose that an intervention can be represented by a Causal Delta Embedding that is invariant to the visual scene and sparse in terms of the causal variables it affects. Leveraging this insight, we propose a method for learning causal representations from image pairs, without any additional supervision. Experiments in the Causal Triplet challenge demonstrate that Causal Delta Embeddings are highly effective in OOD settings, significantly exceeding baseline performance in both synthetic and real-world benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2602.13318 2026-06-17 cs.AI cs.CV cs.LG 版本更新

DECKBench: Benchmarking Multi-Agent Frameworks for Academic Slide Generation and Editing

DECKBench：用于学术幻灯片生成和编辑的多智能体框架基准测试

Daesik Jang, Morgan Lindsay Heisler, Linzi Xing, Yifei Li, Edward Wang, Ying Xiong, Yong Zhang, Zhenan Fan

发表机构 * Huawei Technologies Canada（华为加拿大技术有限公司）； University of British Columbia（不列颠哥伦比亚大学）

AI总结本文提出DECKBench，一个用于评估多智能体生成和编辑学术幻灯片的框架，通过定制数据集和模拟编辑指令，系统评估幻灯片和整个演示文稿的忠实度、连贯性、布局质量和多轮指令遵循能力。

详情

DOI: 10.1145/3770855.3817525

AI中文摘要

本文提出DECKBench，一个用于评估多智能体生成和编辑学术幻灯片的框架，通过定制数据集和模拟编辑指令，系统评估幻灯片和整个演示文稿的忠实度、连贯性、布局质量和多轮指令遵循能力。

英文摘要

Automatically generating and iteratively editing academic slide decks requires more than document summarization. It demands faithful content selection, coherent slide organization, layout-aware rendering, and robust multi-turn instruction following. However, existing benchmarks and evaluation protocols do not adequately measure these challenges. To address this gap, we introduce the Deck Edits and Compliance Kit Benchmark (DECKBench), an evaluation framework for multi-agent slide generation and editing. DECKBench is built on a curated dataset of paper to slide pairs augmented with realistic, simulated editing instructions. Our evaluation protocol systematically assesses slide-level and deck-level fidelity, coherence, layout quality, and multi-turn instruction following. We further implement a modular multi-agent baseline system that decomposes the slide generation and editing task into paper parsing and summarization, slide planning, HTML creation, and iterative editing. Experimental results demonstrate that the proposed benchmark highlights strengths, exposes failure modes, and provides actionable insights for improving multi-agent slide generation and editing systems. Overall, this work establishes a standardized foundation for reproducible and comparable evaluation of academic presentation generation and editing. Code and data are publicly available at https://github.com/morgan-heisler/DeckBench .

URL PDF HTML ☆

赞 0 踩 0

2602.00473 2026-06-17 quant-ph cs.AI cs.LG 版本更新

Quantum Phase Recognition via Quantum Attention Mechanism

通过量子注意机制进行量子相识别

Jin-Long Chen, Xin Li, Zhang-Qi Yin

发表机构 * Center for Quantum Technology Research（量子技术研究中心）； Key Laboratory of Advanced Optoelectronic Quantum Architecture（先进光电量子架构重点实验室）； Measurements (MOE), School of Physics, Beijing Institute of Technology, Beijing 100081, China（测量（MOE），物理学院，北京理工大学，北京100081，中国）

AI总结本文提出混合量子-经典注意模型，利用交换测试和参数化量子电路提取量子态关联，实现基态分类，针对簇异或模型在9和15个量子比特系统中表现出高准确率和鲁棒性。

Comments 10 pages, 7 figures

详情

DOI: 10.1103/rcjd-bgdb
Journal ref: Phys. Rev. A 113, 062403 (2026)

AI中文摘要

许多体系统中的量子相变本质上由复杂的关联结构特征化，这给传统方法在大规模系统中的计算带来了挑战。为此，我们提出了一种混合量子-经典注意模型。该模型利用交换测试和参数化量子电路实现的注意机制，提取量子态中的关联并执行基态分类。在9和15个量子比特的簇异或模型上进行测试，该模型在少于100个训练数据的情况下实现了高分类准确率，并展示了对训练集变化的鲁棒性。进一步分析表明，该模型成功捕捉了相敏感特征和特征物理长度尺度，为复杂许多体系统中的量子相识别提供了一种可扩展且数据高效的解决方案。

英文摘要

Quantum phase transitions in many-body systems are fundamentally characterized by complex correlation structures, which pose computational challenges for conventional methods in large systems. To address this, we propose a hybrid quantum-classical attention model. This model uses an attention mechanism, realized through swap tests and a parameterized quantum circuit, to extract correlations within quantum states and perform ground-state classification. Benchmarked on the cluster-Ising model with system sizes of 9 and 15 qubits, the model achieves high classification accuracy with less than 100 training data and demonstrates robustness against variations in the training set. Further analysis reveals that the model successfully captures phase-sensitive features and characteristic physical length scales, offering a scalable and data-efficient approach for quantum phase recognition in complex many-body systems.

URL PDF HTML ☆

赞 0 踩 0

2509.11154 2026-06-17 cs.LG cs.AI 版本更新

Feature Space Topology Control via Hopkins Loss

通过霍普金斯损失控制特征空间拓扑

Einari Vaaras, Manu Airaksinen

发表机构 * Signal Processing Research Centre Tampere University（信号处理研究中心塔尔皮莱大学）； BABA Center, Department of Physiology University of Helsinki（BABA中心生理学系赫尔辛基大学）

AI总结本文提出霍普金斯损失，用于控制特征空间拓扑，通过非线性瓶颈自编码器在语音、文本和图像数据中验证其在分类和降维中的有效性。

Comments Accepted for publication in Proc. IEEE ICTAI 2025, Athens, Greece

详情

DOI: 10.1109/ICTAI66417.2025.00064

AI中文摘要

特征空间拓扑指的是特征空间中样本的组织方式。修改此拓扑在机器学习应用中有益，包括降维、生成建模、迁移学习和对抗攻击的鲁棒性。本文引入了霍普金斯损失，利用霍普金斯统计量来强制实现期望的特征空间拓扑，与现有拓扑相关方法旨在保留输入特征拓扑不同。我们在语音、文本和图像数据的两个场景中评估了霍普金斯损失的有效性：分类和使用非线性瓶颈自编码器的降维。实验表明，将霍普金斯损失整合到分类或降维中对分类性能影响很小，但能提供修改特征拓扑的好处。

英文摘要

Feature space topology refers to the organization of samples within the feature space. Modifying this topology can be beneficial in machine learning applications, including dimensionality reduction, generative modeling, transfer learning, and robustness to adversarial attacks. This paper introduces a novel loss function, Hopkins loss, which leverages the Hopkins statistic to enforce a desired feature space topology, which is in contrast to existing topology-related methods that aim to preserve input feature topology. We evaluate the effectiveness of Hopkins loss on speech, text, and image data in two scenarios: classification and dimensionality reduction using nonlinear bottleneck autoencoders. Our experiments show that integrating Hopkins loss into classification or dimensionality reduction has only a small impact on classification performance while providing the benefit of modifying feature topology.

URL PDF HTML ☆

赞 0 踩 0

2601.12641 2026-06-17 cs.AI 版本更新

STEP-LLM: Generating CAD STEP Models from Natural Language with Large Language Models

STEP-LLM: 通过大型语言模型生成CAD STEP模型

Xiangyu Shi, Junyang Ding, Xu Zhao, Sinong Zhan, Payal Mohapatra, Daniel Quispe, Kojo Welbeck, Jian Cao, Wei Chen, Ping Guo, Qi Zhu

发表机构 * Northwestern University（西北大学）

AI总结本文提出STEP-LLM，通过大型语言模型将自然语言转化为CAD STEP模型，采用图结构预处理和强化学习提升几何精度，验证了LLM驱动的STEP模型生成可行性。

Comments Accepted to the Design, Automation & Test in Europe Conference (DATE) 2026

详情

AI中文摘要

计算机辅助设计（CAD）对现代制造至关重要，但模型创建仍劳力密集且依赖专业知识。为使非专家能将直观设计意图转化为可制造的产物，近期基于大语言模型的文本到CAD研究聚焦于命令序列或脚本格式如CadQuery。然而，这些格式依赖内核且缺乏制造业的通用性。相比之下，产品数据交换标准（STEP，ISO 10303）文件是一种广泛采用的中性边界表示（B-rep）格式，直接兼容制造，但其图结构、交叉引用性质对自回归LLM提出了独特挑战。为此，我们编纂了约40,000个STEP-描述对的数据集，并引入了针对STEP图结构格式的新型预处理，包括基于深度优先搜索的重序列化，线性化交叉引用同时保持局部性和思维链（CoT）式结构注释，以引导全局一致性。我们整合了检索增强生成，以在监督微调中将预测与相关示例联系起来，并通过特定的Chamfer距离基于几何奖励的强化学习优化生成质量。实验表明，我们的STEP-LLM在几何保真度上优于Text2CAD基线，改进来自我们框架的多个阶段：RAG模块显著增强了完整性和可渲染性，DFS基于的重序列化增强了整体准确性，RL进一步减少了几何偏差。两者指标和视觉比较均确认STEP-LLM生成的形状比Text2CAD更精确。这些结果展示了通过自然语言驱动LLM生成STEP模型的可行性，展示了其在制造业CAD设计中的潜力。

英文摘要

Computer-aided design (CAD) is vital to modern manufacturing, yet model creation remains labor-intensive and expertise-heavy. To enable non-experts to translate intuitive design intent into manufacturable artifacts, recent large language models-based text-to-CAD efforts focus on command sequences or script-based formats like CadQuery. However, these formats are kernel-dependent and lack universality for manufacturing. In contrast, the Standard for the Exchange of Product Data (STEP, ISO 10303) file is a widely adopted, neutral boundary representation (B-rep) format directly compatible with manufacturing, but its graph-structured, cross-referenced nature poses unique challenges for auto-regressive LLMs. To address this, we curate a dataset of ~40K STEP-caption pairs and introduce novel preprocessing tailored for the graph-structured format of STEP, including a depth-first search-based reserialization that linearizes cross-references while preserving locality and chain-of-thought(CoT)-style structural annotations that guide global coherence. We integrate retrieval-augmented generation to ground predictions in relevant examples for supervised fine-tuning, and refine generation quality through reinforcement learning with a specific Chamfer Distance-based geometric reward. Experiments demonstrate consistent gains of our STEP-LLM in geometric fidelity over the Text2CAD baseline, with improvements arising from multiple stages of our framework: the RAG module substantially enhances completeness and renderability, the DFS-based reserialization strengthens overall accuracy, and the RL further reduces geometric discrepancy. Both metrics and visual comparisons confirm that STEP-LLM generates shapes with higher fidelity than Text2CAD. These results show the feasibility of LLM-driven STEP model generation from natural language, showing its potential to democratize CAD design for manufacturing.

URL PDF HTML ☆

赞 0 踩 0

2501.16370 2026-06-17 cs.LG cs.AI cs.NA cs.NE math.NA 版本更新

Advanced Physics-Informed Neural Network with Residuals for Solving Complex Integral Equations

先进物理指导神经网络与残差用于求解复杂积分方程

Mahdi Movahedian Moghaddam, Kourosh Parand, Saeed Reza Kheradpisheh

发表机构 * Department of Computer and Data Sciences, Shahid Beheshti University（计算机与数据科学系，谢赫·贝赫什提大学）； Department of Cognitive Modeling, Shahid Beheshti University（认知建模系，谢赫·贝赫什提大学）

AI总结本文提出残差积分求解网络（RISN），通过高精度数值方法与残差连接提升求解积分和积分微分方程的精度与稳定性，实验表明其在多种方程类型上均优于传统PINN及其变体。

详情

DOI: 10.22128/ansne.2026.3261.1200
Journal ref: Anal. Numer. Solut. Nonlinear Equ. 11 (2026), no. 1, 153-173

AI中文摘要

本文提出残差积分求解网络（RISN），一种新型神经网络架构，旨在求解广泛类别的积分和积分微分方程，包括一维、多维、常微分和偏微分、分数类型以及包含振荡核的霍尔迈尔类型积分方程。RISN整合残差连接与高精度数值方法如高斯求积和分数导数运算矩阵，使其在精度和稳定性上优于传统物理指导神经网络（PINN）。残差连接有助于缓解消失梯度问题，使RISN能够处理更深层的网络和更复杂的核，特别是在多维问题中。通过广泛实验，我们证明RISN在各种方程类型上均优于传统PINN及其变体，如辅助PINN（A-PINN）和自适应PINN（SA-PINN），在各种方程类型上均取得显著更低的平均绝对误差（MAE）。这些结果突显了RISN在求解具有挑战性的积分和积分微分问题中的鲁棒性和效率，使其成为传统方法难以应对的现实应用中的宝贵工具。

英文摘要

In this paper, we present the Residual Integral Solver Network (RISN), a novel neural network architecture designed to solve a wide range of integral and integro-differential equations, including one-dimensional, multi-dimensional, ordinary and partial integro-differential, systems, fractional types, and Helmholtz-type integral equations involving oscillatory kernels. RISN integrates residual connections with high-accuracy numerical methods such as Gaussian quadrature and fractional derivative operational matrices, enabling it to achieve higher accuracy and stability than traditional Physics-Informed Neural Networks (PINN). The residual connections help mitigate vanishing gradient issues, allowing RISN to handle deeper networks and more complex kernels, particularly in multi-dimensional problems. Through extensive experiments, we demonstrate that RISN consistently outperforms not only classical PINNs but also advanced variants such as Auxiliary PINN (A-PINN) and Self-Adaptive PINN (SA-PINN), achieving significantly lower Mean Absolute Errors (MAE) across various types of equations. These results highlight RISN's robustness and efficiency in solving challenging integral and integro-differential problems, making it a valuable tool for real-world applications where traditional methods often struggle.

URL PDF HTML ☆

赞 0 踩 0

2503.08679 2026-06-17 cs.AI cs.CL cs.LG 版本更新

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

现实中的思维链推理并不总是忠实的

Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy

发表机构 * Poseidon Research（Poseidon研究）

AI总结研究发现，在自然语言提示下，模型有时会生成表面连贯但自相矛盾的思维链，揭示出隐含的事后合理化现象，且前沿模型也未能完全避免。

Comments Published at the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

最近的研究表明，当面对提示中的显式偏见时，模型通常会在其思维链（CoT）输出中省略提及这些偏见，揭示出口头推理可能给出模型如何得出错误结论的不正确图景（不忠实）。在这项工作中，我们展示了不忠实的CoT也发生在自然措辞、非对抗性的提示上，而无需添加人为偏见或编辑模型输出。我们发现，当分别呈现问题“X比Y大吗？”和“Y比X大吗？”时，模型有时会生成表面连贯的论证来证明系统性地对两者都回答“是”或都回答“否”是合理的，尽管存在矛盾。我们提供了初步证据表明这是由于模型对“是”或“否”的隐含偏见，并将其标记为隐含的事后合理化。我们的结果显示，生产模型的不忠实率高达13%，而前沿模型虽然更忠实，但没有一个完全忠实，包括像DeepSeek R1（0.37%）和Sonnet 3.7 with thinking（0.04%）这样的思考模型。我们还研究了不忠实的非逻辑捷径，即模型使用微妙的非逻辑推理来使对困难数学问题的推测性答案看起来经过严格证明。我们的发现表明，虽然CoT可用于评估输出，但它并不是产生模型答案的内部过程的完整描述，应在代理或安全关键环境中谨慎使用。

英文摘要

Recent studies indicate that when faced with explicit biases in prompts, models often omit mentioning these biases in their Chain-of-Thought (CoT) output, revealing that verbalized reasoning can give an incorrect picture of how models arrive at conclusions (unfaithfulness). In this work, we show that unfaithful CoT also occurs on naturally worded, non-adversarial prompts without adding artificial biases or editing model outputs. We find that when separately presented with the questions "Is X bigger than Y?" and "Is Y bigger than X?", models sometimes produce superficially coherent arguments to justify systematically answering Yes to both or No to both, despite the contradiction. We present preliminary evidence that this is due to models' implicit biases towards Yes or No, labeling this Implicit Post-Hoc Rationalization. Our results reveal rates up to 13% for production models, and while frontier models are more faithful, none are entirely so, including thinking models like DeepSeek R1 (0.37%) and Sonnet 3.7 with thinking (0.04%). We also investigate Unfaithful Illogical Shortcuts, where models use subtly illogical reasoning to make speculative answers to hard math problems seem rigorously proven. Our findings indicate that while CoT can be useful for assessing outputs, it is not a complete account of the internal process that produced the model's answer and should be used with caution in agentic or safety-critical settings.

URL PDF HTML ☆

赞 0 踩 0

1. 智能体、规划与决策 22 篇

Beyond Parallel Sampling: Diverse Query Initialization for Agentic Search

When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval

Skill-Constrained Model Predictive Control for Resilient Manufacturing Supply Chains

Closing the Feedback Loop: From Experience Extraction to Insight Governance in Verbal Reinforcement Learning

Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns

StepGuard: Guarding Web Navigation via Single-Step Calibration

PreAct: Computer-Using Agents that Get Faster on Repeated Tasks

CMIP-Forge: An Agentic System that Retrieves, Computes, and Self-Reviews Climate Science

Transformer-Based Warm-Starting for Feasible and Optimal Terminal Approach to Tumbling Objects with Space Manipulators

Model Validation of Agentic AI Systems: A POMDP-Based Framework for Belief-State, Forecast, and Policy Validation

Learning Red Agent Policy from Observations for Neurosymbolic Autonomous Cyber Agents

Branch-and-Browse: Efficient and Controllable Web Exploration with Tree-Structured Reasoning and Action Memory

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

Planning with the Views

Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory

Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents

Mind-Studio: Executable World Models with Lookahead Evaluation for Partially Observable Games

Blueprint First, Model Second: A Framework for Deterministic LLM Workflow

Parallelizing Tool Execution and LLM Generation for Low-Latency Agent Serving

SP-GCRL: Influence Maximization on Incomplete Social Graphs

CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the Fly

Control-Plane Placement Shapes Forgetting: An Architectural Study of Agent Memory Across Thirteen System Configurations

2. 知识表示、推理与符号AI 8 篇

A homotopy-type-theoretic generalization of neurosymbolic inference

Structural Preservation and the Logical Expressiveness of Graph Neural Networks

IsabeLLM: Automated Theorem Proving Applied to Formally Verifying Consensus

Extracting Semantics: LLM-Guided Automatic Population of Robot Ontology from URDF

Visored: A Controlled-Natural-Language Prover for LLM-Generated Mathematics

Boosting Knowledge Graph Foundation Models via Enhanced Negative Sampling

Guidelines for the Annotation and Visualization of Legal Argumentation Structures in Chinese Judicial Decisions

A Multi-Level Architecture for Reusable Materials Ontologies -- The OntoCrafter Ceramics Ontology (OCO) as Reference Implementation

3. 多智能体与博弈 10 篇

Distributed General-Purpose Agent Networks: Architecture, Key Mechanisms, and Prototypes

WallZero: Mastering the Game of WallGo with Strategic Analysis

The Price of Anarchy in Disaggregated Inference

Trust-Aware Multi-Agent Traceability: Confidence-Calibrated Knowledge Graphs for Consistent Software Artifact Management

Divide, Deliberate, Decide: A Multi-Agent Framework for Fine-Grained Egocentric Action Recognition

ED3R: Energy-Aware Distributed Disaster Detection Enabled by Cooperative Robotic Agents

Trustworthy Self-Composable Big-Data-as-a-Service: An LLM-Orchestrated Multi-Agent Framework for Automated Data Engineering, AutoML, MLOps Deployment, and Drift-Aware Lifecycle Optimization

A Neuro-Symbolic Approach to Strategy Synthesis for Strategic Logics

Learning Fair Pareto-Optimal Policies in Multi-Objective Reinforcement Learning

Algorithmic Prompt Generation for Diverse Human-like Teaming and Communication with Large Language Models

4. 搜索、优化与约束求解 2 篇

ZIVARI-TLBO: A Zero-Cost Inter-Group Evaluated-Elite Relay Mechanism for Teaching-Learning-Based Optimization

Non-negative Elastic Net Decoding for Information Retrieval

5. 机器学习与表示学习 64 篇

From Brewing to Resolution: Tracing the Internal Lifecycle of Code Reasoning in LLMs

Using Cognitive Models to Improve Language Model Simulation of Human Persuasion Games

Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs

Small Initialization Matters for Large Language Models

STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training

Knowledge Reutilization in Meta-Reinforcement Learning

Fixed-Point Reasoners: Stable and Adaptive Deep Looped Transformers

Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs

PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation

The Discrete-Log Clock: How a Transformer Learns Modular Multiplication

Graph Neural Networks for Semi-Supervised Image Classification with Multi-Feature Aggregation

L-Proto: Language-Aware Episodic Prototypical Training for Multilingual Speaker Verification

Online LLM Selection via Constrained Bandits with Time-Varying Demand

Geometry-Aware Post-Hoc Uncertainty Quantification in Operator Learning

FoundCause: Causal Discovery with Latent Confounders from Observational Data

Reversal Q-Learning

LLM Features Can Hurt GNNs: Concatenation Interference on Homophilous Graph Benchmarks

A Risk Decomposition Framework for Pre-Hoc Fine-Tuning Prediction

TuneAhead: Predicting Fine-tuning Performance Before Full Training Begins

Handling Feature Heterogeneity with Learnable Graph Patches

SuCo: Sufficiency-guided Continuous Adaptive Reasoning

Confusion-Aware Transfer Teacher Curriculum Learning Framework: Disentangling Scoring and Pacing Effects

Conservation Laws for Modern Neural Architectures

Functional Equivalence in Attention: A Comprehensive Study with Applications to Linear Mode Connectivity

Dimensionality Controls When Modularity Helps in Continual Learning

KANLib -- An Modular, Extensible and Fast Kolmogorov-Arnold Network Implementation

SoftMoE: Soft Differentiable Routing for Mixture-of-Experts in LLMs

Robustness of Similarity-based Positional Encoding Under Rotations: Theoretical Analysis and Experimental Validation

Multiple cyclicity and Wavelet Decomposition with Channel Correlation for Long-term Time Series Forecasting

C2FL: Clustered Continual Federated Learning under Spatial and Temporal Drift

LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling

Catastrophic Forgetting is Low-Rank: A Function-Space Theory for Continual Adaptation