arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 21516
2606.04648 2026-06-04 cs.AI

BiNSGPS: Geometry Problem Solving via Bidirectional Neuro-Symbolic Interaction

BiNSGPS: 通过双向神经符号交互解决几何问题

Qi Wang, Peijie Wang, Fei Yin, Cheng-Lin Liu

发表机构 * MAIS, Institute of Automation of Chinese Academy of Sciences(自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 提出BiNSGPS框架,通过多模态大语言模型顾问与符号求解器之间的双向神经符号交互,动态纠正不一致的形式表示或提出辅助假设,以解决几何问题中的早期错误和符号冲突。

详情
AI中文摘要

几何问题求解在人工智能中提出了独特的挑战。现有方法通常分为两种范式:符号方法(适应性有限)和神经方法(容易产生幻觉)。最近的神经符号混合方法主要依赖单向流水线,其中神经输出被输入求解器而无反馈,使得系统对早期错误脆弱。为了打破这一单向瓶颈,我们提出了BiNSGPS,一个在多模态大语言模型顾问和符号求解器之间建立双向神经符号交互的框架。多模态大语言模型顾问主动整合来自符号求解器的反馈,以动态纠正不一致的形式表示或提出辅助假设,解决符号冲突并促进复杂推理。

英文摘要

Geometry problem solving poses distinct challenges in artificial intelligence. Existing approaches typically fall into two paradigms: symbolic methods, which exhibit limited adaptability, and neural methods, which are prone to hallucinations. Recent neuro-symbolic hybrids predominantly rely on a unidirectional pipeline where neural outputs are fed into solvers without feedback, making system brittle to early-stage errors. To break this unidirectional bottleneck, we propose BiNSGPS, a framework that establishes Bidirectional Neuro-Symbolic Interaction (BiNS) between a MLLM Adviser and a Symbolic Solver. MLLM Adviser actively incorporates feedback from the symbolic solver to dynamically rectify inconsistent formal representations or propose auxiliary hypotheses, resolving symbolic conflicts and facilitating complex deductions.

2606.04647 2026-06-04 cs.LG

ALINC: Active Learning for Inductive Node Classification via Graph Sampling

ALINC: 通过图采样的归纳式节点分类主动学习

Pascal Plettenberg, Denis Huseljic, André Alcalde, Bernhard Sick, Josephine M. Thomas

发表机构 * Intelligent Embedded Systems, University of Kassel(智能嵌入式系统,卡塞尔大学) CELUS GmbH GAIN Group, Institute of Data Science, University of Greifswald(GAIN集团,数据科学研究所,格里夫斯瓦尔德大学)

AI总结 提出ALINC框架,通过图级采样策略解决归纳式节点分类中的主动学习问题,并评估了多种策略与聚合方法的效果。

Comments Accepted at ECML PKDD 2026

详情
AI中文摘要

节点分类的主动学习通常专注于在一个或几个大图(例如社交网络分析)中选择最具信息量的节点进行标注。然而,在其他领域,如分子化学或电子设计自动化,数据集由数千个独立图组成。在许多这样的归纳式设置中,标注单个节点需要全图分析,这实际上会即时产生剩余的节点标签。因此,这些场景需要选择整个图而非单个节点的主动学习策略,而这一问题迄今尚未在文献中得到解决。因此,我们提出了ALINC,一个通过图采样进行归纳式节点分类的主动学习框架。它通过多种聚合机制将节点级效用度量提升为图级选择标准,从而弥合了现有的方法论差距。在包含十种策略、三种聚合方法和四个数据集的广泛基准测试中,我们确定了CoreSet、TypiClust和BADGE作为性能最佳的图采样策略。我们的详细分析进一步揭示,聚合方法的选择至关重要,因为它显著影响模型性能和标注成本。最后,我们在两个用例研究中展示了ALINC的有效性:分子中的代谢位点预测和印刷电路板原理图的设计自动化。

英文摘要

Active learning (AL) for node classification typically focuses on selecting the most informative nodes for annotation within one or a few large graphs (e.g., in social network analysis). However, in other domains, such as molecular chemistry or electronic design automation, datasets consist of thousands of independent graphs. In many of these inductive settings, annotating an individual node requires a full-graph analysis, which effectively yields the remaining node labels on-the-fly. Therefore, these scenarios require AL strategies that select entire graphs instead of single nodes, a problem which has not been tackled in the literature so far. Thus, we introduce ALINC, an AL framework for inductive node classification via graph sampling. It bridges the existing methodological gap by elevating node-level utility measures to graph-level selection criteria through various aggregation mechanisms. In an extensive benchmark including ten strategies, three aggregation methods, and four datasets, we identify CoreSet, TypiClust, and BADGE as the top-performing graph sampling strategies. Our detailed analysis further reveals that the choice of the aggregation method is pivotal, as it substantially affects model performance and annotation costs. Finally, we demonstrate the effectiveness of ALINC in two use case studies: site-of-metabolism prediction in molecules and design automation of printed circuit board schematics.

2606.04646 2026-06-04 cs.CL cs.AI cs.IR

QO-Bench: Diagnosing Query-Operator-Preserving Retrieval over Typed Event Tuples

QO-Bench: 诊断类型化事件元组上的查询操作符保持检索

Mengao Zhang, Xiang Yang, Chang Liu, Tianhui Tan, Ke-wei Huang

发表机构 * Asian Institute of Digital Finance, National University of Singapore(亚洲数字金融研究所,新加坡国立大学)

AI总结 提出QO-Bench基准,通过类型化事件元组上的确定性评估,诊断检索增强生成系统在查询操作符(如连接、交集)上的执行瓶颈。

Comments 14 pages

详情
AI中文摘要

许多关于商业、法律和科学语料库的现实世界问题是文本中潜在记录的数据库风格查询的自然语言版本。现有的检索增强生成(RAG)系统主要针对语义相关性进行优化,但检索到看似相关的段落并不能保证正确的查询执行。我们引入了QO-Bench,一个用于类型化事件元组上查询操作符问答的诊断基准。该基准涵盖22,984篇新闻文章和614个公司事件,涉及18个查询模板,在785个问题上进行评估。每个黄金答案由类型化事件元组确定性计算得出,并通过召回率评分,答案通过精确匹配而非LLM评判器与黄金元组匹配。这种设计支持操作符级别的诊断,如连接和交集。我们在匹配条件下评估了RAG、ReAct RAG、GraphRAG和信息提取到SQL的方法,并设置了一个长上下文oracle上限以隔离检索失败。一个双轴框架——索引时保持与查询时执行——预测了每种范式失败的位置,结果证实了这一点:系统检索到相关文本,但丢弃了操作符所需的类型化值,并且可部署的范式排名在不同操作符间反转,相似性检索在过滤/投影上领先,而提取到SQL在交集和计数上领先。即使提供了黄金证据,长上下文oracle也远未饱和,因此操作符执行——而不仅仅是检索——是一个核心瓶颈,更强的答案模型也无法消除。QO-Bench将目标从段落相关性重新定义为查询操作符保持检索。

英文摘要

Many real-world questions over business, legal, and scientific corpora are natural-language versions of database-style queries over records latent in text. Existing retrieval-augmented generation (RAG) systems are optimized primarily for semantic relevance, but retrieving plausible passages does not guarantee correct query execution. We introduce QO-Bench, a diagnostic benchmark for query-operator question answering over typed event tuples. The benchmark covers 22,984 news articles and 614 corporate events across 18 query templates, evaluated on 785 questions. Each gold answer is deterministically computed from typed event tuples and scored by recall, with answers matched to the gold tuples by exact match rather than an LLM judge. This design enables operator-level diagnosis such as joins and intersection. We evaluate RAG, ReAct RAG, GraphRAG, and information-extraction-to-SQL under matched conditions, with a long-context oracle ceiling to isolate retrieval failure. A two-axis framework -- index-time preservation versus query-time execution -- predicts where each paradigm fails, and the results bear it out: systems retrieve relevant text but discard the typed values operators need, and the deployable paradigm ranking inverts across operators, with similarity retrieval leading on filter/project and extraction-to-SQL on intersection and counting. Even given the gold evidence, a long-context oracle stays far from saturated, so operator execution -- not retrieval alone -- is a core bottleneck that a stronger answer model does not remove. QO-Bench reframes the goal from passage relevance to query-operator-preserving retrieval.

2606.04645 2026-06-04 cs.CL cs.DB

CYGNET: Cypher Gate for Neural Execution Triage and Cost Containment

CYGNET: 用于神经执行分类与成本控制的密码门

Nikodem Tomczak

发表机构 * Thulge Labs, Singapore(新加坡Thulge实验室)

AI总结 提出CYGNET门控机制,通过预执行验证和错误修正,在保证生成准确率的同时,高效拦截结构错误的Cypher查询并标记成本过高的执行计划。

详情
AI中文摘要

作为知识图谱代理的语言模型生成的Cypher查询可能因结构错误(在数据库中崩溃)或语义错误(执行但返回错误结果)而失败。我们在查询生成与生产级Neo4j数据库之间设置了一个预执行门。该门通过一个四后端链验证结构,最终在镜像图上执行,中位延迟为5.6毫秒。结构错误的查询被路由到一个修正器,该修正器通过语言模型迭代结构化错误反馈。在七个CypherBench模式(2348个问题,ACL 2025)上,该流水线在所有测试模型上保持了生成准确率,证实其作为安全防御层的有效性。修正器在五个模型上的成功率为81%至95%(平均89%)。在九个模式的模板生成语料库上,该门捕获了100%的解析错误、100%的约束违规以及100%的路径查询中带标签端点的模式引用错误,在1135个查询中零误报。属性兄弟交换(替换后的名称在目标标签上有效)得分为0%,标志着结构验证结束和语义验证开始的正式边界。基于规划器的成本门在执行前标记灾难性的计划结构。

英文摘要

Language models acting as agents over knowledge graphs generate Cypher queries that fail structurally (crashing at the database) or semantically (executing but returning wrong results). We place a pre-execution gate between query generation and a production Neo4j database. The gate validates structure through a four-backend chain culminating in execution against a mirror graph at 5.6 ms median latency. Structurally broken queries are routed to a corrector that iterates structured error feedback through a language model. On seven CypherBench schemas (2348 questions, ACL 2025) the pipeline maintains generation accuracy on every model tested, confirming it operates as a safe defensive layer. The corrector achieves 81% to 95% success across five models (mean 89%). On a template-generated corpus across nine schemas the gate catches 100% of parse errors, 100% of constraint violations, and 100% of schema-reference errors in path queries with labelled endpoints, at zero false positives across 1135 queries. Property sibling-swaps where the substituted name is valid on the target label score 0%, marking the formal boundary where structural validation ends and semantic validation must begin. A planner-based cost gate flags catastrophic plan structures before execution.

2606.04634 2026-06-04 cs.LG

Explainably Safe Reinforcement Learning

可解释的安全强化学习

Sabine Rieder, Stefan Pranger, Debraj Chakraborty, Jan Křetínský, Bettina Könighofer

发表机构 * Masaryk University(马萨里克大学) Graz University of Technology(格拉茨技术大学) Technical University of Munich(慕尼黑技术大学)

AI总结 提出一种基于分层决策树的可解释安全强化学习方法,通过世界模型分析状态风险并构建屏蔽策略,生成可理解的解释,同时保持安全保证。

详情
AI中文摘要

对决策系统的信任既需要安全保证,也需要解释和理解其行为的能力。这对于学习系统尤为重要,因为其决策过程往往高度不透明。屏蔽是一种基于模型的强化学习安全增强技术。然而,由于屏蔽是通过严格的形式化方法自动合成的,其决策同样难以被人类解释。最近,决策树被广泛用于表示控制器和策略。但由于屏蔽本质上具有非确定性,其决策树表示变得过大,无法在实践中提供可解释性。为应对这一挑战,我们提出了一种新颖的可解释安全强化学习方法,通过提供人类可理解的屏蔽决策解释来增强信任。我们的方法将屏蔽策略表示为分层决策树,提供自上而下的基于案例的解释。在设计时,我们使用世界模型分析在给定状态下执行动作的安全风险。基于此分析,我们构建屏蔽策略和一个高层决策树,将状态分类为风险类别(安全、关键、危险、不安全),解释为何某种情况可能涉及安全关键。在运行时,我们生成局部决策树,解释哪些动作被允许以及为何其他动作被认为不安全。我们的方法促进了屏蔽安全强化学习中安全方面的可解释性,不需要超出屏蔽已用信息的额外信息,开销极小,并能轻松集成到现有的屏蔽强化学习流程中。实验中,我们使用比原始屏蔽小几个数量级的决策树来计算解释。

英文摘要

Trust in a decision-making system requires both safety guarantees and the ability to interpret and understand its behavior. This is particularly important for learned systems, whose decision-making processes are often highly opaque. Shielding is a prominent model-based technique for enforcing safety in reinforcement learning. However, because shields are automatically synthesized using rigorous formal methods, their decisions are often similarly difficult for humans to interpret. Recently, decision trees became customary to represent controllers and policies. However, since shields are inherently non-deterministic, their decision tree representations become too large to be explainable in practice. To address this challenge, we propose a novel approach for explainable safe RL that enhances trust by providing human-interpretable explanations of the shield's decisions. Our method represents the shielding policy as a hierarchy of decision trees, offering top-down, case-based explanations. At design time, we use a world model to analyze the safety risks of executing actions in given states. Based on this analysis, we construct both the shield and a high-level decision tree that classifies states into risk categories (safe, critical, dangerous, unsafe), explaining why a situation may be safety-critical. At runtime, we generate localized decision trees that explain which actions are allowed and why others are deemed unsafe. Our method facilitates explainability of the safety aspect in safe-by-shielding reinforcement learning, requires no additional information beyond what is already used for shielding, incurs minimal overhead, and integrates readily into existing shielded RL pipelines. In our experiments, we compute explanations using decision trees that are several orders of magnitude smaller than the original shield.

2606.04632 2026-06-04 cs.LG cs.CL

VentAgent: When LLMs Learn to Breathe -- Multi-Objective Arbitration for ARDS Ventilation

VentAgent:当大语言模型学会呼吸——ARDS通气的多目标仲裁

Teqi Hao, Yuxuan Fu, Xiaoyu Tan, Shaojie Shi, Bohao Lv, Yinghui Xu, Xihe Qiu

发表机构 * School of Electronic and Electrical Engineering, Shanghai University of Engineering Science(上海工程技术大学电子与电气工程学院) Tencent Youtu Lab(腾讯优图实验室) Artificial Intelligence Innovation and Incubation Institute, Fudan University(复旦大学人工智能创新与孵化院)

AI总结 提出VentAgent分层框架,利用大语言模型作为透明仲裁者,通过感知-规划-编排三阶段将机械通气控制转化为动态多目标仲裁过程,在生理模拟器上优于强化学习和经典控制基线,并提供可解释的推理链。

详情
AI中文摘要

急性呼吸窘迫综合征(ARDS)的机械通气需要平衡竞争性的生理目标,包括氧合、肺保护和酸碱平衡。然而,当前的数据驱动方法,尤其是模仿回顾性电子健康记录(EHR)的方法,常常遭受模仿偏差。它们可能从不一致的临床演示中捕获表面相关性,例如将被动呼吸机设置与生存关联,因为这种设置在稳定患者中很常见,因此无法泛化到不稳定或分布外的表型。标准的强化学习(RL)方法也难以处理重症监护中的对抗性权衡,并常常产生不透明且临床可解释性有限的策略。为了解决这些局限性,我们引入了VentAgent,一个分层框架,其中大语言模型(LLM)作为机械通气的透明仲裁者。我们将通气控制重新表述为动态多目标仲裁过程,而非单目标优化。VentAgent将决策分解为三个可解释的阶段:感知、规划和编排。通过利用LLM的语义推理能力,它综合来自异构专家的策略,并通过显式协调机制解决冲突的临床优先级。在高保真生理模拟器上的评估表明,VentAgent优于最先进的RL和经典控制基线。此外,它将控制决策转化为人类可读的推理链,为重症监护自动化提供了更安全、更可解释和更自适应的范式。

英文摘要

Mechanical ventilation for Acute Respiratory Distress Syndrome (ARDS) requires balancing competing physiological goals, including oxygenation, lung protection, and acid-base homeostasis. However, current data-driven methods, especially those imitating retrospective Electronic Health Records (EHR), often suffer from imitation bias. They may capture superficial correlations from inconsistent clinical demonstrations, such as associating passive ventilator settings with survival because such settings are common in stable patients, and thus fail to generalize to volatile or out-of-distribution phenotypes. Standard Reinforcement Learning (RL) methods also struggle with the adversarial trade-offs of critical care and often produce opaque policies with limited clinical interpretability. To address these limitations, we introduce VentAgent, a hierarchical framework in which Large Language Models (LLMs) act as transparent arbitrators for mechanical ventilation. We reformulate ventilation control as a dynamic Multi-Objective Arbitration process rather than single-objective optimization. VentAgent decomposes decision-making into three interpretable stages: Perception, Planning, and Orchestration. By leveraging the semantic reasoning capabilities of LLMs, it synthesizes strategies from heterogeneous experts and resolves conflicting clinical priorities through an explicit coordination mechanism. Evaluations on a high-fidelity physiological simulator show that VentAgent outperforms state-of-the-art RL and classical control baselines. Moreover, it converts control decisions into human-readable reasoning chains, offering a safer, more interpretable, and adaptable paradigm for critical care automation.

2606.04628 2026-06-04 cs.CL cs.MA

RAMPART: Registry-based Agentic Memory with Priority-Aware Runtime Transformation

RAMPART: 基于注册表的代理记忆与优先级感知运行时转换

Nikodem Tomczak

发表机构 * Nikodem Tomczak

AI总结 提出RAMPART编译时记忆模型和纯内存块注册表,通过可编程运行时操作和五种原语实现上下文组装,实验表明块位置和分组显著影响任务成功率,并实现零提示令牌成本的共享注册表协调。

详情
AI中文摘要

RAMPART是一种用于基于LLM的代理的编译时记忆模型和纯内存块注册表。上下文组装是一种可编程的运行时操作,其中内容根据显式策略(排序、包含和驱逐)从结构化注册表中编译。五种可组合原语(提升、门控、写入、驱逐、回滚)在编译前对命名可寻址块进行操作,且零提示令牌成本。来源标签和不可驱逐的作者标志实现了具有块级所有权的许可记忆模型。使用Qwen3-8B Q4进行的受控探测表明,编译时放置以及块与任务查询之间的结构关系影响任务成功,当任务跟随注册表时,性能在约第七个块位置急剧下降,当任务先于注册表时则在第十二个位置。将关键块与内容相邻的邻居分组,并将该组作为一个单元提升,在单块放置失败的位置将任务成功率提高数十个百分点。在Qwen2.5-7B、Llama-3.1-8B、Mistral-7B-v0.3和Qwen3-14B上的跨模型复现表明,内容启动效应在不同家族中出现在相同的绝对位置,幅度随模型强度变化。块分组使Mistral在最难注册表大小下的平均通过率提高约五倍,并且在中间注册表区域,使用干预的较小模型可以超越不使用干预的较大模型。相关性门控将提示成本降低67.8%,同时恢复83%的提升条件成功率。模式驱逐产生0%的调用,而存在模式时为100%,这是基于策略的方法无法通过构造保证的属性。共享注册表协调将代理间通信减少为方法调用,且零协调令牌成本。

英文摘要

RAMPART is a compile-time memory model and pure in-RAM block registry for LLM-based agents. Context assembly is a programmable runtime operation where content is compiled from a structured registry under explicit policy for ordering, inclusion, and eviction. Five composable primitives (promote, gate, write, evict, rollback) act on named addressable blocks before compilation at zero prompt-token cost. Provenance tags and non-evictable authorship flags implement a permissioned memory model with block-level ownership. Controlled probes with Qwen3-8B Q4 show that compile-time placement and the structural relationship between blocks and the task query affect task success, with the cliff falling at roughly the seventh block position when the task follows the registry and the twelfth when it precedes. Grouping the critical block with content-adjacent neighbours and promoting the group as a unit lifts task success by tens of percentage points at positions where single-block placement fails. Cross-model replication on Qwen2.5-7B, Llama-3.1-8B, Mistral-7B-v0.3, and Qwen3-14B shows the content-priming effect appears at the same absolute positions across families, with magnitude varying with model strength. Block grouping raises Mistral's mean pass rate roughly fivefold at the hardest registry size, and a smaller model with the intervention can outperform a larger model without it in the mid-registry zone. Relevance gating reduces prompt cost by 67.8\% while recovering 83% of the promoted-condition success rate. Schema eviction produces 0% invocations against 100% with the schema present, a property policy-based approaches cannot guarantee by construction. Shared-registry coordination reduces inter-agent communication to a method call at zero coordination token cost.

2606.04623 2026-06-04 cs.LG

Learning symplectic model reduction based on a approximation theorem of symplectic embeddings

基于辛嵌入逼近定理的辛模型降阶学习

Liyi Feng, Yifa Tang, Yulin Xie, Ruili Zhang, Aiqing Zhu

发表机构 * School of Mathematics and Statistics, Beijing Jiaotong University(北京交通大学数学与统计学学院) State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences(中国科学院数学与系统科学研究院数学科学国家重点实验室) Department of Mathematics, National University of Singapore(新加坡国立大学数学系)

AI总结 针对高维哈密顿系统降阶中辛结构易破坏的问题,提出辛保持自编码器(SpAE),通过参数化解码器为辛嵌入、编码器为辛投影,在保证辛结构的同时提升重构与预测精度。

详情
AI中文摘要

高维哈密顿系统在许多科学和工程学科中扮演着核心角色,其动力学在辛流形上演化。尽管深度学习为从数据构建低维替代模型提供了强大工具,但在模型降阶过程中,内在的辛结构很容易被破坏。因此,标准自编码器可能产生不支持哈密顿流的潜在坐标,导致长时间预测不稳定。本文首先建立了辛嵌入的通用逼近定理。基于该理论,我们提出了辛保持自编码器(SpAE),其中解码器被参数化为辛嵌入,编码器被构造为相应的辛投影。该架构具有足够的表达能力来逼近非线性辛嵌入及其相关的辛投影,通过构造精确保持辛结构,并且可以通过标准的无约束优化进行训练,从而提高了重构和预测精度。在高维晶格和粒子系统上的大量实验证明了所提出方法的有效性。

英文摘要

High-dimensional Hamiltonian systems play a central role in many scientific and engineering disciplines, with dynamics evolving on symplectic manifolds. Although deep learning provides powerful tools for constructing low-dimensional surrogates from data, the intrinsic symplectic structure is easily destroyed during model reduction. As a result, a standard autoencoder may produce latent coordinates that do not support a Hamiltonian flow, leading to unstable long-time prediction. In this paper, we first establish a universal approximation theorem for symplectic embeddings. Based on this theory, we propose symplecticity-preserving autoencoders (SpAE), in which the decoder is parameterized as a symplectic embedding and the encoder is constructed as the corresponding symplectic projection. This architecture is expressive enough to approximate nonlinear symplectic embeddings and the associated symplectic projections, preserves the symplectic structure exactly by construction, and can be trained by standard unconstrained optimization, thereby improving both reconstruction and prediction accuracy. Extensive experiments on high-dimensional lattice and particle systems demonstrate the effectiveness of the proposed method.

2606.04620 2026-06-04 cs.LG cs.AI

QuBLAST: A Framework for Quantizing Large Language Models with Block-Level Compression Approach and Activation Scaling Strategy

QuBLAST: 一种采用块级压缩方法和激活缩放策略量化大语言模型的框架

Pasindu Wickramasinghe, Achyuta Muthuvelan, Rachmad Vidya Wicaksana Putra, Minghao Shao, Muhammad Shafique

发表机构 * eBRAIN Lab, Division of Engineering, New York University (NYU) Abu Dhabi(eBRAIN实验室,工程系,纽约大学(NYU)阿布扎赫德分校)

AI总结 针对大语言模型部署困难,提出QuBLAST框架,通过块级混合精度量化和激活缩放策略,在降低模型大小40%-45.2%的同时保持困惑度增加不超过5%。

Comments 10 pages, 9 figures, 5 tables

详情
AI中文摘要

大语言模型已成为解决NLP任务的最先进算法。然而,它们通常伴随着巨大的计算和内存成本,因此难以部署在嵌入式系统上。为此,最先进的方法通常在网络的所有注意力块上采用统一的训练后量化,从而忽略了在同一网络中应用不同量化级别的潜力。它们还采用复杂操作来减轻激活异常值的负面影响,从而产生高计算开销。此外,它们没有考虑使用具有非传统注意力架构(例如状态空间模型)的新兴大语言模型进行评估,这些模型在应用量化时提出了不同的挑战。为了解决这些局限性,我们提出了QuBLAST,一种新颖的训练后量化方法,该方法采用块级压缩方法和激活缩放策略用于大语言模型。块级压缩方法实现了网络各块之间的混合精度量化,而激活缩放策略有效减轻了激活异常值的负面影响。具体来说,QuBLAST首先通过交叉熵损失分析预训练模型中不同注意力块的敏感性。QuBLAST利用这种敏感性分析来确定模型中每个注意力块的权重量化级别。此外,QuBLAST为每个块采用激活缩放图来控制激活值的范围并减轻激活异常值的负面影响,从而实现更好的量化结果。实验结果表明,QuBLAST在不同模型架构(即Qwen3-8B、Llama3-8B、Mistral v0.1-8B和Falcon H1R-7B)上将模型大小减少了40%-45.2%,同时在WikiText-2和WikiText-103数据集上保持性能在5%的困惑度增加之内。

英文摘要

LLMs have become the state-of-the-art algorithms for solving NLP tasks. However, they typically come at huge computational and memory costs, thus making them difficult to deploy on embedded systems. Toward this, state-of-the-art methods typically employ uniform post-training quantization (PTQ) across attention blocks of the network, hence overlooking the potential of applying different quantization levels in the same network. They also employ complex operations to mitigate the negative impact of activation outliers, hence incurring high computational overheads. Moreover, they have not considered evaluation using emerging LLMs with non-conventional attention architectures (e.g., state-space models), which pose different challenges in applying quantization. To address these limitations, we propose QuBLAST, a novel PTQ methodology that employs block-level compression approach with activation scaling strategy for LLMs. Block-level compression approach enables mixed-precision quantization across blocks of the network, while activation scaling strategy efficiently mitigates the negative impact of activation outliers. Specifically, QuBLAST first analyzes the sensitivity of different attention blocks in the pre-trained model through the cross-entropy loss analysis. QuBLAST leverages this sensitivity analysis to determine the weight quantization level for each attention block in the model. Furthermore, QuBLAST employs the activation scaling map for each block to control the range of activation values and mitigate the negative impact of activation outliers, thereby enabling better quantization results. Experimental results show that, QuBLAST reduces model sizes by 40%-45.2% across different model architectures (i.e., Qwen3-8B, Llama3-8B, Mistral v0.1-8B, and Falcon H1R-7B), while maintaining the performance within 5% perplexity increase for the WikiText-2 and WikiText-103 datasets.

2606.04619 2026-06-04 cs.AI cs.LO

A Normative Intermediate Representation for ASP-Based Compliance Reasoning

基于ASP的合规推理的规范性中间表示

Yangfan Wu, Huanyu Yang, Jianmin Ji

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出MONIR,一种用于ASP合规推理的模态化输出规范性中间表示,通过分阶段操作语义和可执行编译,结合LLM辅助流程应用于中国ADAS法规,并评估提取质量与模块化增量求解效率。

详情
AI中文摘要

我们提出MONIR,一种用于基于ASP的合规推理的模态化输出规范性中间表示。其核心片段具有分阶段操作语义,而MONIR-ASP提供了可执行编译以及外部函数、时间规则和稳定模型推理的扩展。我们通过LLM辅助流程将框架实例化到中国ADAS法规和标准上。实验评估了提取质量以及模块化和增量ASP求解的效率。

英文摘要

We propose MONIR, a Modalized-Output Normative Intermediate Representation for ASP-based compliance reasoning. Its core fragment has a staged operational semantics, while MONIR-ASP provides an executable compilation and extensions for external functions, temporal rules, and stable-model reasoning. We instantiate the framework on Chinese ADAS regulations and standards with an LLM-assisted pipeline. Experiments evaluate extraction quality and the efficiency of modular and incremental ASP solving.

2606.04618 2026-06-04 cs.RO

BPDA-GMM: Bayesian Probabilistic Data Association via Gaussian Mixture Models for Semantic SLAM

BPDA-GMM:基于高斯混合模型的贝叶斯概率数据关联用于语义SLAM

Thanh Nguyen Canh, Haolan Zhang, Xiem HoangVan, Antonio Sgorbissa, Nak Young Chong

发表机构 * School of Information Science, Japan Advanced Institute of Science and Technology(信息科学学系,日本科学技术大学) University of Engineering and Technology, Vietnam National University(工程技术大学,越南国家大学)

AI总结 提出BPDA-GMM在线贝叶斯概率数据关联框架,通过狄利克雷过程先验和中文餐馆过程模型实现语义SLAM中增长式地标关联,并利用α散度退火处理模糊关联,提升轨迹精度和语义建图鲁棒性。

详情
AI中文摘要

概率数据关联(PDA)在感知混淆场景中改进了语义SLAM,但现有方法通常假设固定的地标集、随着地图增长重新计算关联权重,或依赖手动调整的零假设权重。为解决这些限制,我们提出了 extbf{BPDA-GMM},一个用于具有增长式对象级地图的语义SLAM的在线贝叶斯PDA框架。BPDA-GMM使用狄利克雷过程先验来诱导中文餐馆过程(CRP)关联模型,其中累积证据倾向于已有地标,而浓度参数将概率质量分配给新地标。对于每个语义检测,通过联合语义-几何门选择合理候选,计算CRP加权的关联概率,并以闭合形式将对象地标更新为语义高斯。所得地标集形成高斯混合模型,其主导分量作为最大混合语义因子传递给后端。当关联权重不确定时,一个由模糊触发的$α$-散度退火步骤提高了区分度。最后,解耦的后端将语义因子的位姿雅可比置零,使得噪声检测能够细化地标而不直接扰动轨迹。在仿真和真实室内数据集上的实验表明,与最先进的基线相比,轨迹精度、语义建图质量以及对感知混淆和分类器错误的鲁棒性均有所提升。代码和视频公开于https://github.com/thanhnguyencanh/BPDA-SLAM。

英文摘要

Probabilistic data association (PDA) improves semantic SLAM in perceptually aliased scenes, but existing methods often assume a fixed landmark set, recompute association weights as the map grows, or rely on hand-tuned null-hypothesis weights. To address these limitations, we propose \textbf{BPDA-GMM}, an online Bayesian PDA framework for semantic SLAM with a growing object-level map. BPDA-GMM uses a Dirichlet-process prior to induce a Chinese Restaurant Process (CRP) association model, where accumulated evidence favors existing landmarks, and the concentration parameter assigns probability mass to new landmarks. For each semantic detection, plausible candidates are selected by a joint semantic-geometric gate, CRP-weighted association probabilities are computed, and object landmarks are updated as semantic Gaussians in closed form. The resulting landmark set forms a Gaussian mixture model, and its dominant component is passed to the back-end as a max-mixture semantic factor. When association weights are inconclusive, an ambiguity-triggered $α$-divergence tempering step improves discrimination. Finally, a decoupled back-end zeroes the pose Jacobian of semantic factors, allowing noisy detections to refine landmarks without directly perturbing the trajectory. Experiments in simulation and on a real indoor dataset demonstrate improved trajectory accuracy, semantic mapping quality, and robustness to perceptual aliasing and classifier errors over state-of-the-art baselines. Code and video are publicly available at https://github.com/thanhnguyencanh/BPDA-SLAM.

2606.04613 2026-06-04 cs.CV cs.LG

Beyond Symmetric Alignment: Spectral Diagnostics of Modality Imbalance in Vision-Language Models in the Medical Domain

超越对称对齐:医学领域视觉-语言模型中模态不平衡的光谱诊断

Alessandro Gambetti, Qiwei Han, Cláudia Soares, Hong Shen

发表机构 * NOVA School of Science and Technology(诺瓦科学与技术学校) Nova School of Business and Economics(诺瓦商业与经济学校) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出非对称光谱对齐分数(SAS),通过特征值加权的特征模态相关性量化模态信息不平衡,并在医学图像-文本数据集上评估15个VLM,发现医学图像比临床报告保留更丰富的结构信息,且SAS与检索性能的相关性最强。

Comments 10 pages, 3 figures, 9 tables

详情
AI中文摘要

视觉-语言模型(VLM)在应用于医学图像-文本数据时表现不佳,但可用于诊断这种失败的工具仍然有限。现有的表示对齐度量是对称的,将两种模态合并为一个分数,隐藏了哪种模态驱动了跨模态退化。我们引入了光谱对齐分数(SAS),这是一种非对称度量,将两种模态投影到锚定模态的主特征基上,并计算特征值加权的每个特征模态的相关性,从而得到方向性分数,其差值量化了模态信息不平衡。我们将SAS嵌入到一个基准框架中,评估了15个VLM在自然和医学图像-文本数据集上的表现,同时使用了6种对齐度量和双向检索。我们的实验表明,医学图像比其配对的临床报告保留了更丰富的结构信息,这种方向性不对称是所有竞争度量无法察觉的,并且SAS在医学领域实现了与检索性能的最强零标签相关性,使其成为临床部署的实用诊断工具。代码可在以下网址获取:https://github.com/iamalegambetti/medical-vlms-assessment。

英文摘要

Vision-Language Models (VLMs) struggle when applied to medical image-text data, yet the tools available to diagnose this failure remain limited. Existing representation alignment metrics are symmetric, collapsing both modalities into a single score and hiding which modality drives cross-modal degradation. We introduce the Spectral Alignment Score (SAS), an asymmetric metric that projects both modalities onto the principal eigenbasis of an anchor modality and computes eigenvalue-weighted per-eigenmode correlations, resulting in directional scores whose difference quantifies modality information imbalance. We embed SAS within a benchmarking framework evaluating 15 VLMs across natural and medical image-text datasets alongside 6 alignment metrics and bidirectional retrieval. Our experiments show that medical images retain richer structural information than their paired clinical reports, a directional asymmetry invisible to all competing metrics, and that SAS achieves the strongest zero-label correlation with retrieval performance in the medical domain, positioning it as a practical diagnostic tool for clinical deployment. Code is available at this URL: https://github.com/iamalegambetti/medical-vlms-assessment.

2606.04612 2026-06-04 cs.CL

Hybrid Adversarial Defence for Natural Language Understanding Tasks

混合对抗防御用于自然语言理解任务

Manar Abouzaid, Yang Wang, Chenghua Lin, Stuart E. Middleton

发表机构 * School of Electronics and Computer Science, University of Southampton, UK(南安普顿大学电子与计算机科学学院) Department of Computer Science, University of Manchester, UK(曼彻斯特大学计算机科学系)

AI总结 提出一种结合熵、不确定性和几何特征的混合防御框架,在多个自然语言理解数据集上同时提升了干净任务性能和对抗鲁棒性。

详情
AI中文摘要

大型语言模型(LLMs)既容易产生幻觉,也容易受到对抗性操纵。尽管这些问题密切相关,但现有的防御方法通常分别处理它们。我们研究了一种混合防御框架,该框架结合了旨在减少幻觉的基于熵的模型,以及旨在降低脆弱性的基于不确定性的模型和基于几何的模型。在自然语言理解数据集(FEVER、HotpotQA、CSQA、SIQA)上的域内测试中,我们发现我们的混合模型提高了干净任务性能(准确率提升高达43.34%)和对抗鲁棒性(准确率提升高达64.92%,攻击成功率降低62.27%)。对于分布外数据集(AeroEngQA、CPIQA),我们的混合模型表现出类似的对抗鲁棒性(准确率提升高达57.14%)。对于提示注入(SafeGuard)和越狱检测(AdvBench、DAN)数据集,我们的混合模型也非常强大(与最先进的基线模型相比,攻击成功率降低高达51%)。总体而言,我们的结果表明,对于域内和分布外任务,结合熵、不确定性和几何特征比单独使用任何单一特征都能提供更有效的防御策略。

英文摘要

Large Language Models (LLMs) are vulnerable both to hallucination and adversarial manipulation. Although these problems are closely related, existing defences typically address them separately. We investigate a hybrid defence framework that combines entropy-based models, designed to reduce hallucinations, with uncertainty-based models and geometric-based models, designed to reduce vulnerability. Under in-domain tests on Natural Language Understanding datasets (FEVER, HotpotQA, CSQA, SIQA) we find our hybrid model improves both clean-task performance (up to 43.34\% increase in accuracy) and adversarial robustness (up to 64.92\% improvement in accuracy and 62.27\% reduction in attack success rate). For out-of-distribution datasets (AeroEngQA, CPIQA) we see similar adversarial robustness from our hybrid model (up to 57.14\% improvement in accuracy). For prompt injection (SafeGuard) and jailbreak detection (AdvBench, DAN) datasets our hybrid model is also very strong (up to 51\% reduction in attack success rate compared to state of the art baseline models). Overall, our results show that combining entropy, uncertainty and geometric features provides a more effective defence strategy than using any single feature alone for both in-domain and out-of-distribution tasks.

2606.04604 2026-06-04 cs.CV

COMBINER: Composed Image Retrieval Guided by Attribute-based Neighbor Relations

COMBINER: 基于属性邻居关系的组合图像检索

Zixu Li, Yupeng Hu, Zhiwei Chen, Haokun Wen, Xuemeng Song, Liqiang Nie

发表机构 * School of Software, Shandong University(山东大学软件学院) School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)计算机科学与技术学院) Department of Data Science, City University of Hong Kong(香港城市大学数据科学系) Department of Computer Science and Engineering, Southern University of Science and Technology(南方科技大学计算机科学与工程系)

AI总结 针对组合图像检索中视觉相似但属性不同的样本问题,提出基于属性原型的跨模态统一表示方法COMBINER,通过自适应语义解耦、统一原型组合和双重关系建模提升检索准确性。

Comments Accepted by IEEE TIP 2026

详情
AI中文摘要

组合图像检索(CIR)是一项具有挑战性的检索任务,旨在通过多模态输入定位特定图像。尽管CIR技术近期取得了进展,但先前的方法常常忽略视觉上相似但属性不同的情况,这可能削弱多模态特征融合和相似性建模。为缓解这一限制,我们基于属性原型设计了跨模态特征的统一表示。然而,由于三个核心问题,该任务远非直接:(1)属性级语义的纠缠,(2)模态间的不一致性,以及(3)监督信号缺失。为解决上述障碍,我们引入了基于属性邻居关系的组合图像检索网络(COMBINER)。具体而言,我们首先设计了一个自适应语义解耦模块,能够基于多模态原始特征解耦属性特征。其次,我们提出了一个统一原型组合模块,可以构建跨模态统一原型(CUP)并促进多模态特征组合。最后,我们引入了一个双重关系建模模块,能够基于属性相似性挖掘成对和邻居关系。与传统的邻居关系建模CIR方法相比,COMBINER是首个解决视觉相似但属性无关样本现象的研究。它通过采用基于属性原型的相似性度量,实现了对样本间语义关系的更准确理解。在三个基准数据集上进行的全面实验证实了我们提出的COMBINER的有效性。我们的方法实现将在https://github.com/Lee-zixu/COMBINER上提供。

英文摘要

Composed Image Retrieval (CIR) represents a challenging retrieval task that targets locating specific images through multimodal inputs. Despite recent progress in CIR techniques, prior approaches often overlook cases where images appear visually alike yet differ in attributes, potentially undermining both multimodal feature fusion and similarity modeling. To mitigate this limitation, we design a unified representation of cross-modal features based on attribute prototypes. Nevertheless, the task is far from straightforward, owing to three core issues: (1) entanglement in attribute-level semantics, (2) inconsistency across modalities, and (3) supervised signal missing. To tackle the above obstacles, we introduce a COMposed image retrieval network guided By attrIbute-based NEighbor Relations (COMBINER). Specifically, we first design an Adaptive Semantic Disentanglement module, which is capable of disentangling attribute features based on multimodal primitive features. Secondly, we propose a Unified Prototype-based Composition module, which can construct cross-modal unified prototypes (CUP) and facilitate multimodal feature composition. Finally, we introduce a Dual Relations Modeling module, which can mine pairwise and neighbor relations based on attribute similarity. Compared to traditional neighbor relations modeling CIR methods, COMBINER represents the first study addressing the phenomenon of visually similar but attribute-unrelated samples. It achieves a more accurate understanding of the semantic relations among samples by employing an attribute prototype-based similarity metric. Comprehensive experiments conducted on three benchmark datasets confirm the effectiveness of our proposed COMBINER. The implementation of our method will be accessed at https://github.com/Lee-zixu/COMBINER

2606.04599 2026-06-04 cs.AI cs.CE

Plan First, Judge Later, Run Better: A DMAIC-Inspired Agentic System for Industrial Anomaly Detection

先计划,后评判,更优运行:一种受DMAIC启发的工业异常检测智能体系统

Yongzi Yu, Ao Li, Le Wang, Ziyue Li, Fugee Tsung, Yuxuan Liang, Man Li

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) The Hong Kong University of Science and Technology(香港科技大学) Shanghai University of Finance and Economics(上海财经大学) Technische Universität München(慕尼黑技术大学) Southwestern University of Finance and Economics(西南财经大学)

AI总结 提出受DMAIC启发的多智能体系统DMAIC-IAD,通过先制定标准化操作程序(SOP)再生成策略,并引入预训练的无执行评判模型来排序候选策略,无需昂贵运行时试验,在四种模态上平均检测性能提升37.76%。

详情
AI中文摘要

大型语言模型(LLM)智能体在自动化复杂数据分析工作流方面展现出潜力,但在高风险工业场景中的可靠部署仍具挑战。工业异常检测(IAD)对制造质量、安全和效率至关重要,然而现有基于LLM的IAD智能体主要关注执行,而策略制定方面利用不足。因此,它们难以以统一且经济高效的方式处理异构模态。受DMAIC质量管理框架启发,我们提出DMAIC-IAD(受DMAIC启发的智能体工业异常检测),一种“先计划,后评判”的多智能体系统,将LLM智能体与结构化工业问题解决相结合。DMAIC-IAD在策略生成前将异构参考提炼为标准化操作程序(SOP),并引入预训练的无执行评判模型,无需昂贵的运行时试验即可对候选策略进行排序。跨四种模态的大量实验表明,DMAIC-IAD在适用智能体基线上平均检测性能提升37.76%。

英文摘要

Large language model (LLM) agents have shown promise in automating complex data-analysis workflows, but their reliable deployment remains challenging in high-stakes industrial scenarios. Industrial anomaly detection (IAD) is essential for manufacturing quality, safety, and efficiency, yet existing LLM-based IAD agents mainly focus on execution while under-exploiting strategy formulation. Consequently, they struggle to handle heterogeneous modalities in a unified and cost-effective manner. Inspired by the DMAIC quality-management framework, we propose DMAIC-IAD (DMAIC-inspired Agentic Industrial Anomaly Detection), a "Plan First, Judge Later" multi-agent system that aligns LLM agents with structured industrial problem-solving. DMAIC-IAD distills heterogeneous references into standardized operating procedures (SOPs) before strategy generation, and introduces a pre-trained execution-free judge model to rank candidate strategies without costly runtime trials. Extensive experiments across four modalities show that DMAIC-IAD improves average detection performance over applicable agentic baselines by 37.76%.

2606.04597 2026-06-04 cs.AI

Learning Admissible Heuristics via Cost Partitioning

通过成本划分学习可采纳启发式

Hugo Barral, Quentin Cappart, Marie-José Huguet, Sylvie Thiébaux

发表机构 * UCLouvain, Louvain-la-Neuve, Belgium(列日大学,列日-拉-纽夫,比利时) Australian National University, Canberra, Australia(澳大利亚国立大学,堪培拉,澳大利亚)

AI总结 提出一个框架,利用成本划分与乘子预测的拉格朗日对偶等价性,通过图编码和自注意力网络学习可采纳成本划分,从而生成首个保证可采纳性的机器学习启发式。

详情
AI中文摘要

可采纳启发式对于最优规划至关重要,但由于存在高估风险,学习它们仍然具有挑战性。成本划分在保持可采纳性的同时结合多个抽象启发式,但在线计算最优划分代价高昂。我们提出了一个框架,通过利用成本划分与乘子预测之间的拉格朗日对偶等价性,学习推断可采纳成本划分。规划状态和模式被编码为带标签的图,并使用Weisfeiler-Leman算法的动作中心变体提取结构特征向量。一个具有轴向自注意力和softmax输出层的深度架构将这些特征映射到成本权重,这些权重通过构造满足划分约束,从而确保可采纳性。实验表明,与次优划分基线相比,节点扩展减少,同时保持严格的可采纳性。据我们所知,这是第一个保证可采纳性的机器学习启发式。

英文摘要

Admissible heuristics are essential for optimal planning, yet learning them remains challenging due to the risk of overestimation. Cost partitioning combines multiple abstraction heuristics while preserving admissibility, but computing optimal partitions online is expensive. We propose a framework that learns to infer admissible cost partitions by leveraging the Lagrangian dual equivalence between cost partitioning and multiplier prediction. Planning states and patterns are encoded as labelled graphs, and an action-centric variant of the Weisfeiler-Leman algorithm extracts structural feature vectors. A deep architecture with axial self-attention and a softmax output layer maps these features to cost weights that satisfy the partition constraints by construction, ensuring admissibility. Experiments demonstrate reduced node expansions compared to suboptimal partitioning baselines while maintaining strict admissibility. To our knowledge, this is the first machine-learned heuristic guaranteed to be admissible.

2606.04596 2026-06-04 cs.CL

A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs

多视频摘要中位置偏差的系统评估:基于多模态大语言模型

Huangchen Xu, Yuan Wu, Yi Chang

发表机构 * School of Artificial Intelligence, Jilin University(吉林大学人工智能学院) Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Jilin University(知识驱动人机智能工程研究中心) International Center of Future Science, Jilin University(未来科学国际中心)

AI总结 本研究系统评估了多模态大语言模型在多视频摘要任务中的位置偏差,通过构建基准和三种互补指标揭示了领域与模型依赖的偏差特性,并分析了提示级缓解方法。

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越多地用于视频理解,但它们在多视频输入下的可靠性仍知之甚少。我们研究了多视频摘要中的位置偏差,即每个视频摘要的质量可能随视频输入槽位的变化而变化,即使底层内容不变。我们从ActivityNet和新闻视频构建了一个基准,涵盖烹饪、家庭、休闲和新闻场景,包含两个和四个视频输入。我们评估了九个开源和专有MLLMs,并使用三种互补指标测量位置效应:覆盖率、方向性位置偏差(DPB)和中间边缘差距(MEG)。我们的结果表明,位置效应是领域和模型依赖的:即使中间位置表现不佳,有符号的方向性偏差也可能很小;增加视觉或生成预算并不能均匀地消除不平衡。我们进一步分析了提示级缓解方法。总之,结果表明多视频摘要仍然对输入协议和位置敏感,这促使开发更鲁棒的、顺序不变的多模态系统。

英文摘要

Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under multi-video inputs remains poorly understood. We study positional bias in multi-video summarization, where the quality of a per-video summary can change with the video's input slot even when the underlying content is unchanged. We construct a benchmark from ActivityNet and News videos, covering Cooking, Domestic, Leisure, and News settings with two- and four-video inputs. We evaluate nine open-source and proprietary MLLMs and measure position effects with three complementary metrics: Coverage, Directional Positional Bias (DPB), and Middle-Edge Gap (MEG). Our results show that positional effects are domain- and model-dependent: signed directional bias can be small even when middle positions underperform, and increasing visual or generation budget does not uniformly remove the imbalance. We further analyze prompt-level mitigation methods. Together, the results show that multi-video summarization remains sensitive to input protocol and position, motivating more robust order-invariant multimodal systems.

2606.04593 2026-06-04 cs.CV

4D Reconstruction from Sparse Dynamic Cameras

来自稀疏动态相机的4D重建

Kazuki Ozeki, Shun Kenney, Yuto Shibata, Eisuke Takeuchi, Takuya Narihira, Kazumi Fukuda, Ryosuke Sawata, Yuki Mitsufuji, Yoshimitsu Aoki

发表机构 * Keio University(庆应大学) Sony AI(索尼人工智能) Sony Group Corporation(索尼集团)

AI总结 针对稀疏动态相机设置下的4D重建,提出一种通过集成跨相机特征匹配与帧内点跟踪来确保时空一致性的3D轨迹初始化方法,并引入噪声鲁棒的深度排序正则化损失和时空多样批次采样策略,在自建数据集LetCamsGo上验证了动态区域重建质量的提升。

Comments Accepted by 4DV Workshop at CVPR 2026

详情
AI中文摘要

尽管从单目动态相机进行动态3D(即4D)重建最近取得了进展,但其仍然受到深度模糊的根本限制。本文关注一种替代实用方案,即稀疏动态相机设置,其中少量独立移动的相机捕捉相同的对象。在保持低成本的同时,这种设置引入了多视图约束,并且对于现实世界的视频制作(如体育、音乐会和电视节目)仍然实用。尽管有潜力,但我们的实验表明,现有单目或密集固定相机方法的简单扩展是不够的,因为它们无法解决跨视图和时间的复杂时空不一致性。为填补这一空白,我们提出了一种简单而有效的3D轨迹初始化方法,通过集成跨相机特征匹配与帧内点跟踪来确保时空一致性。此外,我们引入了噪声鲁棒的深度排序正则化损失和时空多样批次采样策略,以增强优化稳定性和跨视图泛化。进一步地,为解决此任务缺乏标准化基准的问题,我们引入了LetCamsGo,这是一个新的真实世界视频数据集,包含4个不同环境中的5个序列,由三个独立移动的相机和一个固定相机记录。在LetCamsGo上的全面基准测试表明,与基线相比,我们提出的框架提高了动态区域的4D重建质量,为野外低成本4D重建范式铺平了道路。

英文摘要

Although dynamic 3D (i.e., 4D) reconstruction from a monocular dynamic camera has recently advanced, it remains fundamentally limited by depth ambiguity. In this paper, we focus on an alternative practical way, i.e., sparse dynamic camera setup, where a handful of independently moving cameras capture the same subjects. While keeping capture costs low, this setup introduces multi-view constraints and remains practical for real-world video production such as sports, concerts, and TV shows. Despite its potential, our experiments show that naive extensions of existing monocular or dense-fixed camera-based methods are insufficient since they fail to resolve the complex spatiotemporal inconsistencies across views and time. To fill this gap, we propose a simple yet effective 3D track initialization method designed to ensure spatiotemporal consistency by integrating inter-camera feature matching with intra-camera point tracking. Additionally, we incorporate a noise-robust depth-ordering regularization loss and a spatiotemporally diverse batch sampling strategy to enhance optimization stability and cross-view generalization. Furthermore, to address the lack of standardized benchmarks for this task, we introduce LetCamsGo, a new real-world video dataset with 5 sequences across 4 diverse environments, recorded by three independently moving cameras and one fixed camera. Comprehensive benchmarking on LetCamsGo demonstrated that our proposed framework improves 4D reconstruction quality in dynamic regions compared with baselines, paving the way for a low-cost 4D reconstruction paradigm in the wild.

2606.04591 2026-06-04 cs.CL cs.CV

Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues

多模态长对话中的细粒度片段检索

Hanbo Bi, Zhiqiang Yuan, Chongyang Li, Qiwei Yan, Zexi Jia, Jiapei Zhang, Xiaoyue Duan, Yingchao Feng, Jinchao Zhang, Jie Zhou

发表机构 * Pattern Recognition Center, WeChat AI, Tencent Inc(模式识别中心、微信AI、腾讯公司) Aerospace Information Research Institute, Chinese Academy of Sciences(航天信息研究所、中国科学院)

AI总结 提出细粒度片段检索任务,通过强化学习训练的生成式检索模型F2RVLM和两阶段系统FFRS,实现多模态长对话中多语句、多图像片段的精准定位。

详情
AI中文摘要

随着多模态交流平台的广泛采用,文本和图像交织的长对话变得越来越普遍。用户通常需要检索与特定主题相关的连贯对话片段,而不是孤立的语句。我们提出了细粒度片段检索(FFR),用于在多模态长对话中定位语义相关的多语句、多图像片段。我们探索了两种设置:(1)单对话内的FFR,从给定对话中检索片段;(2)对话语料库内的FFR,从大规模语料库中为开放域场景检索片段。对于(1),我们引入了F2RVLM,一种基于生成的检索模型,使用强化学习训练,通过多目标奖励和难度感知课程采样来增强片段连贯性。对于(2),我们开发了FFRS,一个两阶段系统,结合了离线片段级索引和在线检索。具体来说,每个对话被分解为最小语义片段,由片段嵌入模型(FEM)编码到向量数据库中;在推理时,FEM快速召回Top-K候选,F2RVLM进行细粒度推理以识别最相关的子内容。为支持FFR,我们构建了MLDR,迄今为止最长的多模态对话检索数据集,以及一个基于微信的真实世界测试集。在两个基准上的实验表明,F2RVLM和FFRS在单对话和语料库级别的FFR上始终取得优越性能。

英文摘要

With the widespread adoption of multi-modal communication platforms, long-form dialogues interleaving text and images have become increasingly common. Users often need to retrieve coherent dialogue fragments related to specific topics, rather than isolated utterances. We propose Fine-grained Fragment Retrieval (FFR), which locates semantically relevant multi-utterance, multi-image fragments in multi-modal long-form dialogues. We explore two settings: (1) FFR within Single-Dialogue, retrieving fragments from a given dialogue; and (2) FFR within Dialogue Corpus, retrieving from a large-scale corpus for open-domain scenarios. For (1), we introduce F2RVLM, a generation-based retrieval model trained with reinforcement learning, using multi-objective rewards and difficulty-aware curriculum sampling to enhance fragment coherence. For (2), we develop FFRS, a two-stage system combining offline fragment-level indexing with online retrieval. Specifically, each dialogue is decomposed into minimal semantic fragments encoded by a Fragment Embedding Model (FEM) into a vector database; at inference, FEM rapidly recalls Top-K candidates, and F2RVLM performs fine-grained reasoning to identify the most relevant sub-content. To support FFR, we construct MLDR, the longest multi-modal dialogue retrieval dataset to date, and a WeChat-based real-world test set. Experiments on both benchmarks demonstrate that F2RVLM and FFRS consistently achieve superior performance across single-dialogue and corpus-level FFR.

2606.04588 2026-06-04 cs.CL

VCIFBench: Evaluating Complex Instruction Following for Video Understanding

VCIFBench:评估视频理解中的复杂指令遵循能力

Huangchen Xu, Yuan Wu, Yi Chang

发表机构 * School of Artificial Intelligence, Jilin University(吉林大学人工智能学院) Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Jilin University(知识驱动人机智能工程研究中心,吉林大学) International Center of Future Science, Jilin University(未来科学国际中心,吉林大学)

AI总结 提出VCIFBench基准,通过混合验证流水线评估多模态大模型在视频理解中遵循内容、格式、风格和结构约束的复杂指令能力,实验表明联合约束满足仍具挑战,DPO训练可提升性能。

详情
AI中文摘要

多模态大语言模型在视频理解方面取得了快速进展,然而现有基准主要依赖简单提示,且提供的证据有限,无法判断模型是否能满足明确的输出约束。我们引入了VCIFBench,这是一个用于评估视频理解中复杂指令遵循能力的基准。VCIFBench从基准适配和直接视频接地提示中构建了富含约束的指令,涵盖内容、格式、风格和结构要求,并通过混合验证流水线评估模型输出。该基准包含306个可满足的测试指令、一个540对的DPO偏好数据集以及一个30项的冲突诊断子集。在10个MLLM上的实验表明,联合约束满足仍然具有挑战性。我们进一步表明,在VCIFBench数据上进行DPO训练可以提高指令遵循性能。

英文摘要

Multimodal large language models have made rapid progress in video understanding, yet existing benchmarks largely rely on simple prompts and provide limited evidence about whether models can satisfy explicit output constraints. We introduce VCIFBench, a benchmark for evaluating complex instruction following in video understanding. VCIFBench constructs constraint-rich instructions from both benchmark-adapted and directly video-grounded prompts, covering content, format, style, and structure requirements, and evaluates model outputs with a hybrid verification pipeline. The benchmark contains 306 satisfiable test instructions, a 540-pair DPO preference dataset, and a 30-item conflict diagnostic subset. Experiments on 10 MLLMs show that joint constraint satisfaction remains challenging. We further show that DPO training on VCIFBench data can improve instruction-following performance.

2606.04584 2026-06-04 cs.SD

SHB-AE: Spherical harmonic beamforming based Ambisonics encoding and upscaling method for smartphone microphone array

SHB-AE:基于球谐波束形成的智能手机麦克风阵列Ambisonics编码与升级方法

Yuhuan You, Yufan Qian, Tianshu Qu, Bin Wang, Xueyang Lv

发表机构 * Peking University(北京大学) Beijing Xiaomi Mobile Software Co., Ltd(北京小米移动软件有限公司) Xiaomi Communications Co., Ltd(小米通讯有限公司)

AI总结 针对智能手机麦克风阵列,提出一种基于球谐波束形成的Ambisonics编码与升级方法SHB-AE,通过设计各阶球谐函数的波束形成器,仅用四个非规则排列的麦克风即可实现四阶Ambisonics编码与升级。

Comments Accepted for presentation at AES Europe 2025 Convention (AES 158th Convention), Warsaw, Poland, May 22-24, 2025

详情
AI中文摘要

随着虚拟现实(VR)和增强现实(AR)的快速发展,空间音频录制与回放引起了越来越多的研究兴趣。高阶Ambisonics(HOA)因其对各种播放设备的适应性以及整合头部朝向的能力而脱颖而出。然而,当前的HOA录制通常依赖于笨重的球形麦克风阵列(SMA),而智能手机等便携设备受到阵列配置和麦克风数量的限制。我们提出SHB-AE,一种基于球谐波束形成的Ambisonics编码方法,适用于智能手机麦克风阵列(SPMA)。通过基于阵列流形为各阶球谐函数设计波束形成器,该方法实现了Ambisonics编码与升级。在真实SPMA及其模拟自由场对应物上,在噪声和混响条件下的验证表明,该方法仅用四个非规则排列的麦克风即可成功编码并升级至四阶Ambisonics。

英文摘要

With the rapid development of virtual reality (VR) and augmented reality (AR), spatial audio recording and reproduction have gained increasing research interest. Higher Order Ambisonics (HOA) stands out for its adaptability to various playback devices and its ability to integrate head orientation. However, current HOA recordings often rely on bulky spherical microphone arrays (SMA), and portable devices like smartphones are limited by array configuration and number of microphones. We propose SHB-AE, a spherical harmonic beamforming based method for Ambisonics encoding using a smartphone microphone array (SPMA). By designing beamformers for each order of spherical harmonic functions based on the array manifold, the method enables Ambisonics encoding and up-scaling. Validation on a real SPMA and its simulated free-field counterpart in noisy and reverberant conditions showed that the method successfully encodes and up-scales Ambisonics up to the fourth order with just four irregularly arranged microphones.

2606.04583 2026-06-04 cs.LG

HalfNet: Randomized Neural Networks with Learned Subspace Geometry

HalfNet: 具有学习子空间几何的随机神经网络

Ethem Alpaydin

发表机构 * Ethem Alpaydin

AI总结 提出HalfNet,通过从可学习的低秩协方差矩阵中随机采样权重,在减少参数的同时匹配全连接网络的性能,揭示权重空间几何对预测能力的关键作用。

Comments 6 pages (+2 pages of appendix), 6 figures

详情
AI中文摘要

许多研究者研究了将部分权重固定为从给定分布(例如 $N(0, I)$)随机抽取值的神经网络。我们提出的 HalfNet 从 $N(0, Σ)$ 中抽取随机权重,其中定义分布几何的 $Σ$ 具有我们从数据中学习的低秩分解。在 MNIST 和 CIFAR-10 上的实验表明,HalfNet 在使用显著更少参数的情况下,能够匹配全训练多层感知器的性能。谱分析表明,神经网络的大部分预测能力在于其权重空间的几何结构,而非单个参数的精确值,并且我们观察到准确率随秩平滑扩展。HalfNet 并非针对低秩结构的神经架构技巧;它实现了一种数据相关的随机嵌入,也可以通过监督度量学习或随机特征和核视角进行解释。

英文摘要

Many researchers investigated neural networks with some of their weights fixed to values randomly drawn from a given distribution, e.g., $N(0, I)$. Our proposed HalfNet draws random weights from $N(0, Σ)$, where $Σ$, which defines the geometry of the distribution, has a low-rank factorization that we learn from data. Experiments on MNIST and CIFAR-10 demonstrate that HalfNet can match the performance of fully trained multilayer perceptrons while using substantially fewer parameters. Spectral analysis indicates that much of the predictive power of neural networks lies in the geometry of their weight space rather than in the precise values of individual parameters, and we observe that accuracy scales smoothly with rank. HalfNet is not a neural architecture trick for low-rank structure; it implements a data-dependent random embedding that can also be interpreted through supervised metric learning, or random-feature and kernel perspectives.

2606.04579 2026-06-04 cs.AI

SCI-PRM: A Tool Aware Process Reward Model for Scientific Reasoning Verification

SCI-PRM:用于科学推理验证的工具感知过程奖励模型

Xiangyu Zhao, Hengyuan Zhao, Yiheng Wang, Wanghan Xu, Yuhao Zhou, Qinglong Cao, Zhiwang Zhou, Lei Bai, Wenlong Zhang, Xiao-Ming Wu

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Shanghai AI Lab(上海人工智能实验室) National University of Singapore(新加坡国立大学) Shanghai Jiao Tong University(上海交通大学) Sichuan University(四川大学) Tongji University(同济大学)

AI总结 针对科学推理中工具使用和事实一致性问题,提出Sci-PRM模型,通过构建包含工具链轨迹的数据集SCIPRM70K并训练过程奖励模型,在测试时扩展和强化学习中提供细粒度监督,提升基础模型性能。

Comments Accepted by KDD 2026 AI4Science Track

详情
AI中文摘要

虽然过程奖励模型(PRM)在数学推理中取得了显著成功,但它们在复杂科学领域(如生物学、化学和物理学)的应用仍基本未被探索。科学问题不仅要求逻辑严谨,还要求事实一致性和领域特定工具的精确使用,而当前模型在这些方面常常出现幻觉且缺乏验证。在本文中,我们首先构建了SCIPRM70K,这是一个大规模数据集,包含显式地将推理与科学工具执行交错的工具链轨迹。在此基础上,我们训练了一个名为Sci-PRM的高效奖励模型,以在单次推理的每一步提供关于工具选择、执行准确性和结果解释的细粒度监督。实验表明,Sci-PRM在两个关键方面显著增强了基础模型:(1)通过Best-of-N选择实现有效的测试时扩展;(2)当集成到强化学习中时,它作为密集奖励信号,缓解了优势消失的关键问题,使模型能够突破现有性能上限。

英文摘要

While Process Reward Models (PRMs) have achieved remarkable success in mathematical reasoning, their application in complex scientific domains-such as biology, chemistry, and physics remains largely unexplored. Scientific problems demand not only logical rigor but also factual consistency and the precise usage of domain-specific tools, areas where current models often suffer from hallucinations and lack of verification. In this paper, we first construct SCIPRM70K, a large-scale dataset featuring Chain-of-Tool trajectories that explicitly interleave reasoning with the execution of scientific tools. Building upon this, we train an efficient reward model called Sci-PRM to provide fine-grained supervision on tool selection, execution accuracy, and result interpretation at each step in one inference. Experiments demonstrate that Sci-PRM significantly enhances foundation models in two key aspects: (1) it enables effective test-time scaling via Best-of-N selection; and (2) when integrated into Reinforcement Learning, it serves as a dense reward signal that mitigates the critical issue of advantage disappearance, allowing the model to break through existing performance ceilings.

2606.04574 2026-06-04 cs.LG cs.NE q-fin.ST q-fin.TR stat.ML

Dynamic Multi-Pair Trading Strategy in Cryptocurrency Markets with Deep Reinforcement Learning

基于深度强化学习的加密货币市场动态多对交易策略

Damian Lebiedź, Robert Ślepaczuk

发表机构 * Politechnika Śląska(波兰斯拉维亚理工大学)

AI总结 本研究提出一种结合深度强化学习执行覆盖层的层次化“过滤-排序”配对选择方法和“固定风险、自适应均值”执行模型,在加密货币市场实现优于启发式基准的统计套利表现。

Comments 61 pages, 37 figures, 16 tables

详情
AI中文摘要

本研究旨在确定深度强化学习(DRL)作为专门执行覆盖层是否能够增强高波动性加密货币市场中的配对交易。尽管该策略的经典实现在传统股票市场中已被证明成功,但在高方差环境中往往表现出刚性并面临严重的发散风险。为应对这一需求,本研究引入了新颖概念。为构建稳健系统,我们开发了层次化的“过滤-排序”配对选择方法和专有的“固定风险、自适应均值”执行模型。该系统采用带有长短期记忆(LSTM)层的近端策略优化(PPO)智能体,在严格确定性风险管理边界内控制执行决策。在币安USD-M期货市场的1小时间隔数据上评估,优化后的强化学习策略在样本外表现显著优于启发式基线。平稳循环块自举稳健性检验证实,智能体的风险调整后超额收益在10%水平上统计显著。尽管略低于更严格的5%阈值,这一结果凸显了数字资产特有的极端异质方差。最终,本论文通过引入结合统计套利与DRL执行策略的混合架构,为量化金融文献做出贡献。此外,它通过确定性屏蔽提供了一种安全强化学习的新框架,证明将神经策略锚定于统计稳健边界能成功缓解严重的发散风险。

英文摘要

This study aims to determine whether the application of Deep Reinforcement Learning (DRL) as a specialized execution overlay can enhance pair trading in highly volatile cryptocurrency markets. Although classical implementations of the strategy have proven successful in traditional equities, they frequently exhibit rigidity and suffer from severe divergence risks when applied to high-variance environments. To address this need, this research introduces novel concepts. To construct a robust system, we developed a hierarchical "Filter-then-Rank" pair selection methodology and a proprietary "Fixed Risk, Adaptive Mean" execution model. The system employs a Proximal Policy Optimization (PPO) agent with a Long Short-Term Memory (LSTM) layer to govern execution decisions within strict deterministic risk management boundaries. Evaluated on 1-hour interval data from the Binance USD-M Futures market, the optimized RL policy achieved an out-of-sample performance that substantially outperformed the heuristic baseline. A stationary circular block bootstrap robustness check confirms that the agent's risk-adjusted outperformance is statistically significant at the 10 percent level. Although falling marginally short of the stricter 5 percent threshold, this result highlights the extreme idiosyncratic variance characteristic of digital assets. Ultimately, this thesis contributes to the quantitative finance literature by introducing a hybrid architecture that combines statistical arbitrage with DRL execution policies. Furthermore, it delivers a novel framework for safe reinforcement learning via deterministic shielding, proving that anchoring a neural policy to statistically robust boundaries successfully mitigates severe divergence risks.

2606.04570 2026-06-04 cs.SD

Flow-HOA: Generative Joint Optimization for Ambisonics Encoding via Flow Matching

Flow-HOA:基于流匹配的Ambisonics编码生成式联合优化

Yuhuan You, Yufan Qian, Tianshu Qu, Bin Wang, Xueyang Lv

发表机构 * State Key Laboratory of General Artificial Intelligence(通用人工智能国家重点实验室) School of Intelligence Science and Technology(智能科学与技术学院) Peking University(北京大学) Beijing Xiaomi Mobile Software Co., Ltd(北京小米移动软件有限公司) Xiaomi Communications Co., Ltd(小米通讯有限公司)

AI总结 提出Flow-HOA生成框架,通过条件流匹配联合优化时域、频谱和空间保真度,生成可部署的FIR编码滤波器组,在合成数据和真实录音上均优于强基线方法。

Comments Accepted for presentation at AES Europe 2026 Convention (AES 160th Convention), Copenhagen, Denmark, May 28-30, 2026

详情
AI中文摘要

从稀疏、不规则的麦克风阵列进行高阶Ambisonics(HOA)编码仍然是沉浸式通信和XR中消费级空间音频捕获的关键挑战。我们提出Flow-HOA,一个生成式框架,联合优化包含时域、频谱和空间保真度的多维目标,同时生成可部署的、时不变的有限脉冲响应(FIR)编码滤波器组。通过条件流匹配,模型学习将简单先验分布映射到FIR滤波器系数的目标分布。训练由复合损失引导,平衡时域波形保真度、多分辨率频谱一致性、子带能量保持和空间指向性约束。在合成模拟数据上的客观评估表明,在信号保真度和空间准确性指标上均优于强模型基线。在真实麦克风阵列录音上的主观听音测试进一步证实,Flow-HOA能产生更高的整体音质并减少伪影,展示了从合成训练数据到真实捕获条件的泛化能力。

英文摘要

Higher-Order Ambisonics (HOA) encoding from sparse, irregular microphone arrays remains a critical challenge for consumer spatial audio capture in immersive communication and XR. We propose Flow-HOA, a generative framework that jointly optimizes a multi-dimensional objective encompassing time-domain, spectral, and spatial fidelity while producing a deployable, time-invariant bank of Finite Impulse Response (FIR) encoding filters. Using conditional flow matching, the model learns to map a simple prior distribution to the target distribution of FIR filter coefficients. Training is guided by a composite loss that balances time-domain waveform fidelity, multi-resolution spectral consistency, sub-band energy preservation, and spatial directivity constraints. Objective evaluations on synthetically simulated data demonstrate improved performance over strong model-based baselines in both signal fidelity and spatial accuracy metrics. Subjective listening tests on real microphone array recordings further confirm that Flow-HOA yields higher overall sound quality with reduced artifacts, demonstrating generalization from synthetic training data to real-world capture conditions.

2606.04569 2026-06-04 cs.RO

MineXplore: An Open-Source Reinforcement Learning Exploration Benchmark for GNSS-Denied Underground Environment

MineXplore: 面向GNSS拒止地下环境的开源强化学习探索基准

Abhishek S, Badrikanath Praharaj, Sreeram MV

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出基于真实矿井数据的开源MuJoCo导航基准MineXplore,通过六阶段管道重建隧道网络,验证了在GNSS拒止、光照退化等极端条件下策略学习的稳定性与可复现性。

Comments 7 pages,11 figures, Submitted to the workshop Xplore:Cross-Disciplinary aspects of Exploration in Robotics, Reinforcement Learning and Search Held at International Conference on Robotics and Automation (ICRA)

详情
AI中文摘要

地下矿井为自主机器人导航带来了极端条件:GPS被拒止,光照退化,隧道拓扑具有丰富的环路且非凸。目前开源生态中尚不存在基于真实生产矿井几何结构且兼容GPU加速学习管道的仿真基准。我们提出了MineXplore,一个基于Leung等人2017年智利地下铜矿数据集的开源MuJoCo导航基准。该环境通过六阶段轮廓到MJCF管道重建了一个104,423平方米的隧道网络,包含八边形墙壁横截面、LiDAR源锯齿状墙壁几何、三个地形摩擦区域、全局5度倾斜和周期性点光源。几何保真度通过交并比(IoU)为0.9538(与源测量图对比)得到验证,表面纹理相似度在六个结构维度上达到79.4%。通过RLlib在五个独立随机种子上训练的单智能体PPO基线实现了88.89%的最佳滚动覆盖率(5个种子中有3个达到90%覆盖目标),证实MineXplore在真实地下感知和拓扑条件下支持稳定且可复现的策略学习。

英文摘要

Underground mines present extreme conditions for autonomous robot navigation: GPS is denied, lighting is degraded, and tunnel topology is loop-rich and non-convex. Simulation benchmarks grounded in real production-mine geometry and compatible with GPU-accelerated learning pipelines do not yet exist in the open-source ecosystem. We present MineXplore, an open-source MuJoCo-based navigation benchmark derived from the Leung et al. 2017 Chilean underground copper mine dataset. The environment reconstructs a 104,423 sq.m tunnel network through an six-stage contour-to-MJCF pipeline incorporating octagonal wall cross-sections, LiDAR-sourced jagged wall geometry, three terrain friction zones, a global 5 degree incline, and periodic spot lighting. Geometric fidelity is validated at an Intersection over Union (IoU) of 0.9538 against the source survey map, and surface texture similarity scores 79.4% across six structural dimensions. A single-agent PPO baseline trained via RLlib across five independent random seeds achieves a best rolling coverage of 88.89% (3 of 5 seeds reaching the 90% coverage target), confirming that MineXplore supports stable and reproducible policy learning under realistic underground sensing and topology.

2606.04564 2026-06-04 cs.LG

SurvPFN: Towards Foundation Models for Survival Predictions

SurvPFN:面向生存预测的基础模型

Samuel Böhm, Lennart Purucker, Frank Hutter, Pascal Schlosser

发表机构 * University of Freiburg(弗赖堡大学)

AI总结 提出SurvPFN,一种基于先验数据拟合网络(PFN)的生存预测模型,通过合成数据预训练和删失负对数似然损失,无需逐数据集拟合即可在真实任务中与经典和深度生存基线竞争。

Comments 10 pages, 1 figure. Accepted to "Foundation Models for Structured Data" Workshop at the International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

表格基础模型(TFM)在标准分类和回归任务中取得了快速进展,但时间至事件生存预测任务在很大程度上仍未涉及。与标准回归任务不同,生存预测模型必须处理删失数据。标准TFM无法原生处理删失数据,导致预测有偏且不准确,使其不适用于实际应用。为克服这一根本限制,我们提出了用于生存预测任务的先验数据拟合网络(PFN) exttt{SurvPFN}。我们在数百万个合成生存预测任务上预训练 exttt{SurvPFN},通过考虑删失数据的分布回归来学习生存。 exttt{SurvPFN}通过以下方式工作:(1)使用威布尔事件时间和非信息性删失机制生成数据;(2)整合删失事件指示符;(3)最小化删失负对数似然。在SurvSet(一个真实世界生存任务集合)上, exttt{SurvPFN}无需逐数据集拟合、生存特定架构或特征工程,即可与经典和深度生存基线高度竞争。我们表明,生存可以被视为具有删失损失的连续时间分布回归问题,从而释放PFN在时间至事件预测中的潜力。

英文摘要

Tabular foundation models (TFMs) have made rapid progress in standard classification and regression, but time-to-event survival prediction tasks have remained largely untouched. Unlike in standard regression tasks, survival prediction models must account for censored data. Standard TFMs cannot handle natively censored data, leading to biased and inaccurate predictions, making them unsuitable for real-world applications. To overcome this fundamental limitation, we propose \texttt{SurvPFN}, a prior-data fitted network (PFN), for survival prediction tasks. We pretrain \texttt{SurvPFN} on millions of synthetic survival prediction tasks to learn survival via distributional regression that accounts for censored data. \texttt{SurvPFN} works by (1) generating data with Weibull event times and a non-informative censoring mechanism; (2) integrating a censored event indicator; and (3) minimizing a censored negative log-likelihood. On SurvSet, a collection of real-world survival tasks, \texttt{SurvPFN} is highly competitive with classical and deep survival baselines without per-dataset fitting, a survival-specific architecture, or feature engineering. We show that survival can be treated as a continuous-time distributional regression problem with censored loss, unlocking the power of PFNs for time-to-event predictions.

2606.04562 2026-06-04 cs.AI cs.LG cs.SI

Neetyabhas: A Framework for Uncertainty-Aware Public Policy Optimization in Rational Agent-Based Models

Neetyabhas: 理性主体模型中不确定性感知的公共政策优化框架

Janani Venugopalan, Gaurav Deshkar, Rishabh Gaur, Harshal Hayatnagarkar, Jayanta Kshirsagar

发表机构 * ThoughtWorks

AI总结 提出一种集成流行病测量和政策执行不确定性的分层强化学习框架,通过模拟个体行为与政策干预的交互,有效管理疫情并降低影响。

详情
AI中文摘要

目的 世界卫生组织的COVID-19非药物干预措施(如封锁、疫苗接种)有效遏制了传播,但带来了沉重的经济负担。现有研究常常忽略个体行为,并错误地假设完美的感染追踪和无误的政策执行,未能考虑现实世界的不确定性和错误。方法 我们提出了一种整合流行病测量(感染/住院)和政策执行中不确定性的方法。我们构建了一个包含1000名个体的模拟模型,这些个体实时做出关于佩戴口罩、接种疫苗和购物的选择。同时,政策制定者基于健康和经济观察部署干预措施(封锁、强制令)。该框架由分层强化学习智能体驱动,利用深度Q网络以及不确定性感知的策略梯度变体(DDPG和TD3)。结果 模拟有效管理了疫情的进展。佩戴口罩和疫苗接种被证明非常有效,显著降低了疫情高峰的高度和持续时间。通过整合个体行为、政策不确定性和多方面的干预措施,我们的动态控制方法成功减轻了疫情的影响。结论 我们的模型通过将不确定性和人类行为嵌入公共卫生政策框架,克服了以往研究的局限性。模拟表明,考虑个体选择和不完美数据对于设计复杂疫情期间的有效干预措施至关重要,其中口罩和疫苗是关键工具。

英文摘要

Purpose The WHO's COVID-19 non-pharmaceutical interventions (e.g., lockdowns, vaccinations) effectively curb transmission but impose heavy economic strains. Existing research often neglects individual behaviors and falsely assumes perfect infection tracking and flawless policy execution, failing to account for real-world uncertainties and errors. Methods We propose an integrative approach incorporating uncertainties in both epidemic measurement (infections/hospitalizations) and policy implementation. We built a simulation model of 1,000 individuals making real-time choices regarding mask-wearing, vaccination, and shopping. Concurrently, policymakers deploy interventions (lockdowns, mandates) based on health and economic observations. This framework is driven by hierarchical reinforcement learning agents, utilizing deep Q-networks alongside uncertainty-aware policy gradient variants (DDPG and TD3). Results The simulations effectively managed the epidemic's progression. Masking and vaccinations proved highly effective, significantly reducing both the outbreak's peak height and duration. By integrating individual behaviors, policy uncertainties, and multifaceted interventions, our dynamic control approach successfully mitigated the epidemic's impact. Conclusions Our model overcomes previous research limitations by embedding uncertainty and human behavior into public health policy frameworks. The simulation demonstrates that accounting for individual choices and imperfect data is crucial for designing effective interventions during complex pandemics, with masks and vaccines serving as pivotal tools.

2606.04557 2026-06-04 cs.CL cs.IR cs.LG

Cartridges at Scale: Training Modular KV Caches over Large Document Collections

大规模弹匣:训练模块化KV缓存以处理大型文档集合

Momchil Hardalov, Gonzalo Iglesias, Adrià de Gispert

发表机构 * Amazon AGI(亚马逊人工智能研究院)

AI总结 提出Cartridges at Scale (CAS)框架,通过动态干扰混合和内存高效预算管理器实现大规模多弹匣训练,在减少预填充开销的同时保持准确性,性能优于单块弹匣10-31点,接近全上下文学习。

Comments 21 pages, 5 figures, 17 tables

详情
AI中文摘要

大型语言模型能够处理长上下文,但预填充数百万个标记是浪费的,因为许多内容在查询之间保持不变。弹匣通过将文档集合提炼为可重用的键值(KV)缓存来解决这一问题,从而消除预填充同时保持准确性。这种方法的一个关键限制是弹匣是单块且非组合的:将整个集合编码为单个KV块无法扩展,并且天真地混合单独训练的弹匣会使性能下降到接近随机水平。我们引入了Cartridges at Scale (CAS),这是一个可扩展的多弹匣学习训练框架,具有动态干扰混合和内存高效的预算管理器,可在GPU和持久存储之间轮换数百个每文档弹匣。我们的方法可扩展到超过一百万个标记的集合,在可比标记预算下,比单块弹匣提高10-31点。即使在高度压缩下,Oracle弹匣准确率也接近完全上下文学习的2-6点范围内。当与检索结合用于弹匣选择时,CAS匹配或超过传统RAG准确率,同时消耗的提示标记减少3-4倍。

英文摘要

Large Language Models can reason over long contexts, yet prefilling millions of tokens is wasteful as much of the content remains static across queries. Cartridges address this by distilling document collections into reusable key-value (KV) caches that eliminate prefilling while preserving accuracy. A critical limitation of this approach is that cartridges are monolithic and non-compositional: encoding an entire collection into a single KV block does not scale, and naively mixing cartridges trained in isolation collapses performance to near chance. We introduce Cartridges at Scale (CAS), a training framework for scalable multi-cartridge learning with dynamic distractor mixing and a memory-efficient budget manager that rotates hundreds of per-document cartridges between GPU and persistent storage. Our approach scales to collections exceeding a million tokens, improving over a monolithic cartridge by 10-31 points at comparable token budgets. Oracle cartridge accuracy falls within 2-6 points of full in-context learning even at high compression. When paired with retrieval for cartridge selection, CAS matches or exceeds conventional RAG accuracy while consuming 3-4x fewer prompt tokens.

2606.04555 2026-06-04 cs.CL cs.AI

Temporal Order Matters for Agentic Memory: Segment Trees for Long-Horizon Agents

时间顺序对智能体记忆至关重要:面向长程智能体的线段树

Yifan Simon Liu, Liam Gallagher, Faeze Moradi Kalarde, Jiazhou Liang, Armin Toroghi, Scott Sanner

发表机构 * University of Toronto(多伦多大学) Vector Institute for Artificial Intelligence(人工智能向量研究所)

AI总结 提出线段树记忆架构SegTreeMem,通过在线右边缘更新规则保持对话历史的时间顺序,结合层次化时间上下文进行检索,在长程记忆基准上优于现有方法。

详情
AI中文摘要

长程对话智能体需要通过与用户交互不断演化的事件、任务和目标进行互动。这些历史记录本质上是时间性的,然而许多现有的记忆系统主要按主题相似性组织信息,可能忽略事件发生的顺序。我们引入线段树记忆(Segment Tree Memory,简称SegTreeMem),这是一种将对话历史表示为按时间顺序排列的线段树的记忆架构。SegTreeMem通过在线最右边缘更新规则逐步插入新话语,在形成层次化记忆片段的同时保持时间顺序。在检索时,SegTreeMem通过树传播相关性分数,将局部语义匹配与层次化时间上下文相结合。在三个长程记忆基准和两个LLM骨干网络上,SegTreeMem在答案质量上优于平面检索、图结构记忆和树结构记忆基线。额外的时间顺序排列分析表明,性能提升依赖于在记忆构建过程中保持时间顺序,这支持了时间顺序是智能体记忆关键结构的观点。

英文摘要

Long-horizon conversational agents need to interact with users through evolving events, tasks, and goals. Such histories are naturally temporal, yet many existing memory systems organize information primarily by topical similarity and may ignore the order in which events occur. We introduce Segment Tree Memory, or SegTreeMem, a memory architecture that represents conversation history as a temporally ordered Segment Tree over utterances. SegTreeMem incrementally inserts new utterances through an online rightmost-frontier update rule, preserving chronological order while forming hierarchical memory segments. For retrieval, SegTreeMem propagates relevance scores through the tree to combine local semantic matching with hierarchical temporal context. Across three long-horizon memory benchmarks and two LLM backbones, SegTreeMem improves answer quality over flat retrieval, graph-structured memory, and tree-structured memory baselines. Additional temporal-order permutation analysis shows that the performance gain depends on preserving temporal order during memory construction, supporting the claim that temporal order is a key structure for agentic memory.