arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.26012 2026-05-26 cs.LG cs.AI

Learning in Low-Dimensional Subspaces: Orthogonal Bottlenecks for Reinforcement Learning

低维子空间中的学习：强化学习的正交瓶颈

Aleksandar Todorov, Matthia Sabatelli

AI总结提出一种在强化学习编码器特征中插入固定正交投影以约束低维子空间的简单先验，证明其在线性可实现性假设下保持表达能力，并在实验中显示价值表示可压缩至极低维度而不损失性能。

详情

AI中文摘要

深度强化学习代理通常依赖高维神经表示，尽管越来越多的证据表明任务相关的价值和策略结构本质上是低维的。在这项工作中，我们提出了一种简单而有效的表示级先验，它插入一个固定的正交投影以将编码器特征约束到低维子空间，无需辅助目标、预训练或对底层RL算法的更改。在线性可实现性假设下，我们证明当瓶颈维度超过特征空间中最优价值函数的内在秩时，瓶颈保持表达能力，并将诱导的梯度动力学保留到等价的低维参数化。实验上，我们发现，在单任务和多任务基准测试中，一旦瓶颈维度超过一个小的任务相关阈值，基线性能要么匹配要么提高；在许多情况下，价值表示可以压缩到极低维度而不损失，最小充分维度更多地取决于环境复杂性而非编码器宽度。此外，我们分析了表示几何，发现正交瓶颈稳定了特征范数，并与更高的有效秩相关。这些结果共同支持了强化学习中流形假设的表示空间解释，并将正交瓶颈定位为一种轻量级、架构无关的塑造RL表示的机制。

英文摘要

Deep reinforcement learning (RL) agents commonly rely on high-dimensional neural representations, despite growing evidence that task-relevant value and policy structure may be intrinsically low-dimensional. In this work, we present a simple yet effective representation-level prior that inserts a fixed orthonormal projection to constrain encoder features to a low-dimensional subspace, requiring no auxiliary objectives, pretraining, or changes to the underlying RL algorithm. Under a linear realizability assumption, we prove that when the bottleneck dimension exceeds the intrinsic rank of the optimal value function in feature space, the bottleneck preserves expressivity and leaves the induced gradient dynamics unchanged up to an equivalent low-dimensional parameterization. Empirically, we find that across both single and multi-task benchmarks, baseline performance is either matched or improved once the bottleneck dimension exceeds a small task-dependent threshold; in many cases, value representations can be compressed to extremely low dimensions without loss, and the minimal sufficient dimension depends far more on environment complexity than encoder width. In addition, we analyze representation geometry and find that orthogonal bottlenecks stabilize feature norms and are associated with higher effective rank. Together, these results support a representation-space interpretation of the manifold hypothesis in reinforcement learning and position orthogonal bottlenecks as a lightweight, architecture-agnostic mechanism for shaping RL representations.

URL PDF HTML ☆

赞 0 踩 0

2605.26007 2026-05-26 cs.CL

Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech

遗忘的词语：在低资源菲律宾语和英语对话中用于痴呆检测的NeoBERT基准测试

Rez Samantha Z. Floresca, Edric Castel C. Hao, Hannah Grachiella Buñales, Chelsea Dominique E. Temprosa, Georgianna Z. Reyes, Kervin Gabriel L. Chua

AI总结针对菲律宾语-英语代码混合的低资源场景，首次系统评估基于Transformer的痴呆检测模型，发现双语微调可消除跨语言性能下降，达到Macro-F1=0.969-0.973。

Comments Accepted to BioNLP Workshop @ ACL 2026

详情

AI中文摘要

从自发语音中检测痴呆症提供了一种可扩展的认知筛查方法，但NLP系统仍然以英语为中心。这一限制在菲律宾尤为严重，因为菲律宾语-英语代码混合普遍存在，且尚无先前工作涉及基于NLP的痴呆检测。我们首次对基于Transformer的菲律宾语音频痴呆检测进行了系统评估，并首次在临床NLP环境中评估了NeoBERT。为了将语言与领域效应分离，我们构建了一个包含4,000个DementiaBank衍生转录本的平行双语数据集，其中菲律宾语翻译由人工完成，以保留认知衰退的话语层面标记。我们在单语、零样本跨语言和双语微调设置下评估了五个模型家族：TF-IDF + LogReg、BERT、NeoBERT、XLM-R和RoBERTa-Tagalog。我们发现，领域内性能无法跨语言迁移，英语训练的BERT在菲律宾语上Macro-F1降至0.455，且仅靠架构现代化并不能提高鲁棒性。然而，双语微调消除了所有Transformer模型的跨语言性能下降，收敛到Macro-F1=0.969-0.973。这些结果表明，多语言临床NLP性能主要受训练期间的语言覆盖范围驱动，而非模型规模或架构。

英文摘要

Dementia detection from spontaneous speech offers a scalable approach to cognitive screening, yet NLP systems remain predominantly English-centric. This limitation is especially acute in the Philippines, where Filipino-English code-switching is pervasive and no prior work has addressed NLP-based dementia detection. We present the first systematic evaluation of transformer-based dementia detection in Filipino speech and the first assessment of NeoBERT in a clinical NLP setting. To separate language from domain effects, we construct a parallel bilingual dataset of 4,000 DementiaBank-derived transcripts, with Filipino translations produced manually to preserve discourse-level markers of cognitive decline. We evaluate five model families, TF-IDF + LogReg, BERT, NeoBERT, XLM-R, and RoBERTa-Tagalog, under monolingual, zero-shot cross-lingual, and bilingual fine-tuning settings. We find that in-domain performance does not transfer across languages, with English-trained BERT dropping to Macro-F1 = 0.455 on Filipino, and that architectural modernization alone does not improve robustness. Bilingual fine-tuning, however, eliminates cross-lingual degradation across all transformer models, converging to Macro-F1 = 0.969-0.973. These results suggest that multilingual clinical NLP performance is driven primarily by linguistic coverage during training rather than model scale or architecture.

URL PDF HTML ☆

赞 0 踩 0

2605.26004 2026-05-26 cs.CV cs.CL

MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models

MAGIC: 面向视觉语言模型的多模态对齐与接地感知指令核心集

Shristi Das Biswas, Kaushik Roy

AI总结提出MAGIC方法，利用预训练VLM中的多模态增益、桥接相关性和技能神经元签名三种内在信号，通过无训练、前向传播的核心集选择，构建紧凑且行为保真的子集用于多模态指令微调，在20%预算下达到甚至超越全微调性能。

详情

AI中文摘要

大型视觉语言模型（LVLMs）的指令微调越来越依赖于大规模多模态语料库，然而这些数据集包含大量冗余、低视觉依赖性以及多模态推理行为覆盖极不平衡的样本。因此，均匀子采样或基于分数的朴素选择往往产生次优的训练子集。我们提出MAGIC，一种无需训练、仅前向传播的核心集选择方法，旨在为多模态指令微调构建紧凑且行为保真的子集。MAGIC基于从预训练VLM中提取的三个内在信号：多模态增益，衡量从视觉输入获得的似然改进；桥接相关性，捕捉答案令牌在视觉令牌上的接地锐度；以及技能神经元签名，通过顶部激活的前馈神经元表征每个样本引发的功能计算。MAGIC通过三阶段流程组合这些信号：过滤低增益样本，通过归一化质量目标对候选样本排序，并在离散神经元签名上执行桶式预算分配以保留潜在的多模态技能覆盖。该公式避免了反向传播、辅助选择器训练以及连续激活空间中的昂贵聚类，同时保持高效且易于部署在现有VLM中。在LLaVA-665K和Vision-Flan数据集上，以及向大型目标模型LLaVA-1.5-7B和-13B的迁移设置中，MAGIC在匹配的20%预算下持续优于强基线：在LLaVA-665K上达到全微调相对性能的100.3%，在Vision-Flan-186K上达到101.6%，同时减少了73.7%的挂钟运行时间。

英文摘要

Instruction tuning of large vision-language models (LVLMs) increasingly depends on massive multimodal corpora, yet these datasets contain samples with substantial redundancy, low visual dependency, and highly imbalanced coverage of multimodal reasoning behaviors. As a result, uniform subsampling or naive score-based selection often yields suboptimal training subsets. We introduce MAGIC, a training-free, forward-only coreset selection method designed to construct compact yet behaviorally faithful subsets for multimodal instruction tuning. MAGIC is built on three intrinsic signals extracted from a pretrained VLM: Multimodal Gain, which measures the likelihood improvement obtained from visual input; Bridging Relevance, which captures the sharpness of answer-token grounding over visual tokens; and Skill-Neuron Signatures, which characterize the functional computation elicited by each sample via top-activated feed-forward neurons. MAGIC combines these signals in a three-stage pipeline: filtering low-gain examples, ranking candidates by a normalized quality objective, and performing bucket-wise budget allocation over discrete neuron signatures to preserve latent multimodal skill coverage. This formulation avoids backpropagation, auxiliary selector training, and expensive clustering in continuous activation spaces, while remaining efficient and easily deployable in existing VLMs. Across LLaVA-665K and Vision-Flan datasets, and transfer settings to large target models, LLaVA-1.5-7B and -13B, MAGIC consistently improves over strong baselines under matched 20% budgets: it achieves 100.3% relative performance to full finetuning on LLaVA-665K and 101.6% relative performance on Vision-Flan-186K, while yielding a 73.7% reduction in wall-clock run time.

URL PDF HTML ☆

赞 0 踩 0

2605.26003 2026-05-26 cs.CV

Towards 3D heart mesh generation using contactless radar imaging and physics-informed neural network

基于非接触式雷达成像和物理信息神经网络的3D心脏网格生成

Jinye Li, Chenxi Fu, Minghang Zheng, Yang Liu, Xiahai Zhuang, Qingchao Chen

AI总结提出SAR2Mesh框架，通过粗到细的网格变形过程，结合几何感知特征投影和物理信息雷达损失，从合成孔径雷达图像重建高保真3D心脏几何结构。

详情

AI中文摘要

心脏功能评估需要连续、无创的监测，而MRI在这方面的能力有限。毫米波雷达及其合成孔径雷达模式提供了一种保护隐私且便携的即时临床检测应用。然而，从SAR重建高保真3D心脏几何结构仍然是一个开放挑战。传统雷达方法生成稀疏点云，缺乏连续表面拓扑。同时，由于SAR图像中严重的散斑噪声和模糊边界，直接应用光学重建网络效果不佳。为了弥合这一差距，我们提出了SAR2Mesh，一种将任务重新表述为粗到细网格变形过程的新框架。通过用拓扑模板初始化，我们的方法通过渐进网格变形明确保留解剖连通性。我们引入了几何感知特征投影模块，通过3D到2D采样提取多视图特征，以及物理信息雷达损失，以强制预测几何与原始雷达回波之间的一致性。此外，我们提出了Cardiac Mesh-SAR，第一个大规模配对SAR-网格数据集。大量实验表明，SAR2Mesh显著优于现有的基于图像的基线，实现了准确且物理一致的心脏重建。

英文摘要

Cardiac function evaluation necessitates continuous, non-invasive monitoring, a capability limited in MRI. Millimeter-wave (mmWave) radar and its Synthetic Aperture Radar (SAR) mode offer a privacy-preserving and portable point-of-care clinical applications. However, reconstructing high-fidelity 3D cardiac geometry from SAR remains an open challenge. Traditional radar methods generate sparse point clouds that lack continuous surface topology. Meanwhile, direct application of optical reconstruction networks performs poorly due to the severe speckle noise and ambiguous boundaries inherent in SAR images. To bridge this gap, we propose SAR2Mesh, a novel framework that reformulates the task as a coarse-to-fine mesh deformation process. By initializing with a topological template, our approach explicitly preserves anatomical connectivity through progressive mesh deformation.We introduce a geometry-aware feature projection module to extract multi-view features via 3D-to-2D sampling, and a physics-informed radar loss to enforce consistency between the predicted geometry and raw radar echoes. Furthermore, we present Cardiac Mesh-SAR, the first large-scale paired SAR-mesh dataset. Extensive experiments demonstrate that SAR2Mesh significantly outperforms existing image-based baselines, achieving accurate and physically consistent cardiac reconstructions.

URL PDF HTML ☆

赞 0 踩 0

2605.26001 2026-05-26 cs.CL cs.AI cs.CY

AI-Assisted Systematization for Evaluating GenAI Systems

AI辅助的系统化方法用于评估生成式AI系统

Dhruv Agarwal, Emily Sheng, Chad Atalla, Jean Garcia-Gathright, Hussein Mozannar, Hannah Washington, Alexandra Chouldechova, Solon Barocas, Hanna Wallach

AI总结针对生成式AI评估中概念模糊的问题，提出AI辅助系统化方法，通过概念规范和验证工作表生成可衡量的概念规范，并评估其内容效度和信息可恢复性。

详情

AI中文摘要

评估生成式AI（GenAI）系统具有挑战性，因为许多评估目标都是宽泛且有争议的概念，例如“推理”、“公平性”或“创造力”。当这些概念未得到充分明确时，就不清楚应该测量什么或如何解释评估结果。这个问题反映了一个缺失的步骤：系统化，即从一个宽泛的背景概念转变为用可衡量术语对概念进行明确、结构化的描述。为了帮助解决系统化在认知上要求高且资源密集的问题，我们研究了AI辅助是否能够支持这一过程。为了实现AI辅助的系统化并评估其质量，我们引入了系统化概念的结构化表示——概念规范——以及一个验证工作表。然后，我们开发了两种AI辅助系统化工具：一种直接的零样本方法和一种多智能体方法，后者更贴近现有文献中手动系统化的方法。我们使用这些系统化工具为两个概念——仇恨言论和数字共情——生成概念规范，并评估所得概念规范的内容效度和信息可恢复性。

英文摘要

Evaluating generative AI (GenAI) systems is challenging because many targets of evaluation are broad, contested concepts, such as "reasoning," "fairness," or "creativity." When these concepts are left underspecified, it becomes unclear what should be measured or how evaluation results should be interpreted. This problem reflects a missing step: systematization, that is, moving from a broad background concept to an explicit, structured account of the concept in measurable terms. To help address the fact that systematization is cognitively demanding and resource-intensive, we investigate whether AI assistance can support this process. To enable AI-assisted systematization and assess its quality, we introduce a structured representation of a systematized concept, a concept spec, and a validation worksheet. We then develop two AI-assisted systematizers: a direct, zero-shot approach and a multi-agent approach that more closely mirrors manual systematization approaches from existing literature. We use these systematizers to produce concept specs for two concepts -- hate-based rhetoric and digital empathy -- and evaluate resulting concept specs on content validity and information recoverability.

URL PDF HTML ☆

赞 0 踩 0

2605.25998 2026-05-26 cs.LG

Causal methods for LLM development and evaluation

因果方法在LLM开发与评估中的应用

Dennis Frauen, Marie Brockschmidt, Konstantin Hess, Haorui Ma, Yuchen Ma, Abdurahman Maarouf, Maresa Schröder, Jonas Schweisthal, Yuxin Wang, Athiya Deviyani, Sonali Parbhoo, Rahul G. Krishnan, Stefan Feuerriegel

AI总结本文提出因果方法可解决LLM开发与评估中的关键因果问题，并系统梳理其在预训练、对齐、路由等环节的应用机会。

Comments Published in KDD 2026

详情

DOI: 10.1145/3770855.3818647

AI中文摘要

大型语言模型（LLM）的开发目前由数据混合、奖励模型、路由策略和评估流程的大规模经验迭代驱动。本文认为，LLM开发和评估中的许多核心问题本质上是因果性的：在预训练中添加数据域会产生什么影响？当LLM以不同风格生成文本时，注释者的偏好如何变化？在推理成本约束下，提示应路由到更大还是更小的模型？通常，因果方法非常适合这种干预改变结果的情景，但令人惊讶的是，它们在LLM开发中代表性不足。我们的贡献有三方面：（1）我们解释了因果方法如何帮助现代LLM开发和评估：LLM开发严重依赖日志数据，这些数据通常受混杂和分布偏移影响；评估使用学习到的但可能有偏见的评判者；部署环境是非平稳的。这些条件使得纯预测方法变得脆弱，并为因果推断中的原则性识别和估计方法创造了机会。（2）我们进一步映射了因果方法在整个LLM开发流程中的机会，包括预训练、对齐、路由、智能体工作流和评估。（3）我们讨论了利用因果方法进行LLM开发和评估的新研究机会。总体而言，我们认为因果方法在LLM开发和评估流程中可能未被充分利用，尽管这些方法可以确保可靠且科学合理的设计。

英文摘要

Large language model (LLM) development is currently driven by large-scale empirical iteration over data mixtures, reward models, routing strategies, and evaluation pipelines. Here, we argue that many central questions in LLM development and evaluation are inherently causal: What is the effect of adding a data domain during pretraining? How do annotator preferences change when LLMs generate text in a different style? Should a prompt be routed to a larger or smaller model given inference cost constraints? In general, causal methods are well-suited to such settings where interventions change outcomes but, surprisingly, are underrepresented in LLM development. Our contribution is threefold: (1) We explain how causal methods can help develop modern LLM development and evaluation: LLM development relies heavily on logged data, which are often subject to confounding and distribution shifts; evaluation uses learned but potentially biased judges; and deployment environments are non-stationary. These conditions make purely predictive approaches fragile and create opportunities for principled identification and estimation methods from causal inference. (2) We further map opportunities for causal methods in the entire LLM development pipeline, including pretraining, alignment, routing, agentic workflows, and evaluation. (3) We discuss new research opportunities around leveraging causal methods for LLM development and evaluation. Overall, we argue that causal methods are potentially underutilized for the LLM development and evaluation pipeline, despite the fact that such methods can ensure a reliable and scientifically grounded design.

URL PDF HTML ☆

赞 0 踩 0

2605.25997 2026-05-26 cs.LG stat.ML

Deployment-complete benchmarking

部署完备的基准测试

El Mustapha Mansouri, Keigo Arai

AI总结提出部署完备的基准测试框架，通过证据纤维和完成曲线量化基准证据是否足以确定部署行动，并证明仅靠分数不足以支持部署决策。

Comments 33 pages, 5 figures, 1 table; supplementary tables and code available

详情

AI中文摘要

基准测试日益指导部署、采购和科学筛选，但分数仅支持其记录的反应，不一定支持部署行动。我们引入了部署完备的基准测试，测试基准证据是否确定部署行动。当行动在每个证据纤维上恒定时，基准对于某个声明是完备的；混合纤维暴露了缺失的部署信息，完成曲线量化了解决歧义所需的证据。在受控响应空间中，基准通道的共形覆盖率为94.98%，但迁移到未测量的部署通道时表现不佳（10.07%），而响应排名区间实现了94.91%的覆盖率；即使零基准错误，在最大残差大小下也仅认证了45.4%的候选者。公开审计揭示了不完备性，包括97.9%的混合Tox21纤维和Matbench与JARVIS主要审计中零中位可认证分数。在保留的重放中，先认证后获取将Tox21中的错误决策从1.19%降至0.027%，JARVIS中从20.3%降至0.128%，同时改变了模型选择并识别了部署相关的探针。部署就绪的基准应报告证据、支持的行动、歧义和完成成本，而不仅仅是分数。

英文摘要

Benchmarks increasingly guide deployment, procurement and scientific screening, yet a score supports only the response it records, not necessarily the deployment action. We introduce deployment-complete benchmarking, which tests whether benchmark evidence determines a deployment action. A benchmark is complete for a claim exactly when the action is constant on each evidence fiber; mixed fibers expose missing deployment information, and completion curves quantify the evidence required to resolve ambiguity. In controlled response spaces, benchmark-channel conformal coverage of 94.98% transferred poorly to an unmeasured deployment channel (10.07%), whereas response-rank intervals achieved 94.91% coverage; even zero benchmark error certified only 45.4% of candidates at the largest residual size. Public audits revealed incompleteness, including 97.9% mixed Tox21 fibers and zero median certifiable fraction in main Matbench and JARVIS audits. In held-out replays, certify-then-acquire reduced false decisions from 1.19% to 0.027% in Tox21 and from 20.3% to 0.128% in JARVIS, while changing model choice and identifying deployment-relevant probes. Deployment-ready benchmarks should report evidence, supported actions, ambiguity and completion cost rather than scores alone.

URL PDF HTML ☆

赞 0 踩 0

2605.24728 2026-05-26 cs.AI

Hylos: Operability Contracts for Model-Native Spatial Intelligence

Hylos: 面向模型原生空间智能的可操作性契约

Christopher Da Silva

AI总结提出Hylos系统架构，通过契约约束和空间事务管理，确保生成或编辑的3D内容具备可操作性，支持CAD、机器人等下游应用。

Comments 27 pages, 7 figures. Systems/position preprint with focused artifact study

详情

AI中文摘要

基础模型日益能够描述、重建和生成3D对象、装配体、场景和环境，但视觉上合理的空间输出尚不具备可操作的3D特性。只有当系统能够识别其实体、框架、表面、约束、来源、允许的动作、预期效果和验证失败时，生成的对象或环境才对智能体有用。本文介绍了Hylos，一种用于契约约束的空间智能的系统架构。Hylos维护场景级别的可操作性状态，涵盖对象、装配体、资产、表面锚点、断言、动作候选、求解器任务、共享执行器调用、能力差距和效果差异。持久的空间变化通过空间事务（SpatialTransaction）进行路由：这是一种提交边界，用于解析引用、检查可允许性、保护不变量、投影效果，并返回提交、审查、回滚、延迟或能力差距结果。本文以系统/立场预印本的形式呈现，并附带一个聚焦的工件研究，而非广泛的基准测试。该研究考察了因果修复：一个可见的错位出现在依赖组件上，而支持的修复位于控制它的上游放置结构中。成功的交互通过场景依赖关系追踪症状，选择支持的上游交互，并应用经过验证的更改，而不是直接编辑可见几何体。更广泛的论点是，空间AI不仅应根据视觉质量进行评估，还应考虑生成或编辑的3D能否成为CAD、机器人、仿真、检测、制造和交互式世界创作的可靠基础。

英文摘要

Foundation models can increasingly describe, reconstruct, and generate 3D objects, assemblies, scenes, and environments, but visually plausible spatial output is not yet operable 3D. A generated object or environment becomes useful to an agent only when the system can identify its entities, frames, surfaces, constraints, provenance, admissible actions, expected effects, and validation failures. This paper introduces Hylos, a systems architecture for contract-bounded spatial intelligence. Hylos maintains scene-scale operability state over objects, assemblies, assets, surface anchors, assertions, action candidates, solver jobs, shared actuator invocations, capability gaps, and effect diffs. Durable spatial changes are routed through a SpatialTransaction: a commit boundary that resolves references, checks admissibility, protects invariants, projects effects, and returns commit, review, rollback, deferral, or capability-gap outcomes. The paper is framed as a systems/position preprint with a focused artifact study rather than a broad benchmark. The study examines causal repair: a visible misalignment appears on a dependent component, while the supported repair lies upstream in the placement structure that controls it. The successful interaction traces the symptom through scene dependencies, selects a supported upstream interaction, and applies a validated change instead of directly editing visible geometry. The broader claim is that spatial AI should be evaluated not only by visual quality, but by whether generated or edited 3D can become reliable substrate for CAD, robotics, simulation, inspection, manufacturing, and interactive world authoring.

URL PDF HTML ☆

赞 0 踩 0

2605.23904 2026-05-26 cs.AI cs.CL

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

SkillOpt: 自我进化智能体技能的执行策略

Yifan Yang, Ziyang Gong, Weiquan Huang, Qihao Yang, Ziwei Zhou, Zisu Huang, Yan Li, Xuemei Gao, Qi Dai, Bei Liu, Kai Qiu, Yuqing Yang, Dongdong Chen, Xue Yang, Chong Luo

AI总结提出SkillOpt，一种系统性的可控文本空间优化器，通过分离的优化器模型对技能文档进行有界编辑，并仅在严格改善验证分数时接受编辑，从而稳定训练技能，在六个基准测试中全面优于现有方法。

Comments 27 pages, 4 figures, 6 tables

详情

AI中文摘要

当前的智能体技能要么是手工制作的，要么是一次性生成的，要么通过松散控制的自我修订来进化，这些方法都不像深度学习优化器那样作用于技能，并且都无法在反馈下可靠地改进其起点。我们认为，技能应该作为冻结智能体的外部状态进行训练，并遵循使权重空间优化可复现的相同原则。据我们所知，SkillOpt是第一个系统性的可控文本空间优化器，用于智能体技能：一个独立的优化器模型将带分数的轨迹转换为对单个技能文档的有界添加/删除/替换编辑，并且仅当编辑严格改善保留验证分数时才接受编辑。文本学习率预算、拒绝编辑缓冲区和逐轮慢/元更新使得技能训练稳定，同时在部署时无需增加推理时的模型调用。在六个基准测试、七个目标模型和三个执行框架（直接对话、Codex、Claude Code）中，SkillOpt在所有52个评估的（模型、基准、框架）单元上取得最佳或并列最佳，并击败了每个单元上的所有竞争者，包括人类、一次性LLM、Trace2Skill、TextGrad、GEPA和EvoSkill技能。在GPT-5.5上，它在直接对话中将平均无技能准确率提高了23.5个百分点，在Codex智能体循环中提高了24.8个百分点，在Claude Code中提高了19.1个百分点。迁移实验进一步表明，优化后的技能工件在跨模型规模、在Codex和Claude Code执行环境之间迁移以及迁移到邻近的数学基准测试时，无需进一步优化即可保留其价值。代码：https://aka.ms/skillopt

英文摘要

Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization. Code: https://aka.ms/skillopt

URL PDF HTML ☆

赞 0 踩 0

2605.18646 2026-05-26 cs.CL

Language-Switching Triggers Take a Latent Detour Through Language Models

语言切换触发器通过语言模型的潜在迂回路径

Francis Kulumba, Wissam Antoun, Théo Lasnier, Benoît Sagot, Djamé Seddah

AI总结本文通过电路分析揭示了一个8B参数自回归语言模型中语言切换后门攻击的内部机制，该攻击由三个拉丁词触发，将英语输出重定向为法语，并发现触发信号通过正交于语言身份方向的潜在空间传播，最终由最后一层MLP转换为法语logits。

Comments 15 pages, 16 figures. Under review

详情

AI中文摘要

语言模型的后门攻击日益成为安全关注点，但触发器序列劫持模型计算的内部机制仍不明确。我们识别了一个8B参数自回归语言模型中语言切换后门背后的电路，其中三个拉丁词触发器（九个token）将英语输出重定向为法语。我们将该电路分解为三个阶段：（1）早期层的分布式注意力头将触发token组合到最后一个序列位置；（2）产生的信号通过中间层在正交于模型自然语言身份方向的子空间中传播；（3）最后一层的MLP将此潜在信号转换为法语logits。整个电路流经单个位置的串行瓶颈：在任何层破坏该位置完全消除触发器，但也损害模型能力。正交潜在编码表明，在中间表示中搜索类似语言信号的防御将完全错过此触发器。

英文摘要

Backdoor attacks on language models pose a growing security concern, yet the internal mechanisms by which a trigger sequence hijacks model computations remain poorly understood. We identify a circuit underlying a language-switching backdoor in an 8B-parameter autoregressive language model, where a three-word Latin trigger (nine tokens) redirects English output to French. We decompose the circuit into three phases: (1) distributed attention heads at early layers compose the trigger tokens into the last sequence position; (2) the resulting signal propagates through mid-layers in a subspace orthogonal to the model's natural language-identity direction; (3) the MLP at the final layer converts this latent signal into French logits. The entire circuit flows through a serial bottleneck at a single position: corrupting that position at any layer entirely mitigates the trigger but also hinders the model's capabilities. The orthogonal latent encoding suggests that defenses that search for language-like signals in intermediate representations would miss this trigger entirely.

URL PDF HTML ☆

赞 0 踩 0

2605.02836 2026-05-26 cs.LG math.AT

A Closed-Form Persistence-Landmark Pipeline for Certified Point-Cloud and Graph Classification

一种用于认证点云和图分类的闭式持久性-地标管道

Sushovan Majhi, Atish Mitra, Žiga Virk, Pramita Bagchi

AI总结提出PLACE管道，通过闭式公式从持久同调签名中分类点云和图，无需学习权重或校准，提供基于间隔的过量风险率、描述符选择规则和每个预测的认证。

Comments TMLR submission, https://openreview.net/forum?id=4kZxNlE5Ve. v2: variance-aware Pinelis-Bernstein certificate (radius iii) fires on 8/12 benchmarks (v1: not operational); MUTAG: empirical and population NC rules agree on 940/940 predictions. Matching-free nu-coherence replaces non-interference. Le Cam lower bound (Thm 3.2) recast PD-native, matching regime m<~R/D explicit

详情

AI中文摘要

我们引入PLACE（持久性-地标分析分类引擎），一种通过持久同调签名对点云和图进行分类的闭式管道。三个定量保证——基于间隔的过量风险率、闭式描述符选择规则和每个预测的认证——仅从训练标签中推导，无需学习权重或保留校准。嵌入将Mitra-Virk单点坐标函数求和到稀疏地标网格上；闭式权重规则$w_k^2 \propto (d_{k+1}^2 - d_k^2)/R_k^2$在$\nu$-相干性下最大化Mitra-Virk仿射证书中的失真斜率。(i) 由类均值分离$\Delta$和嵌入半径$R$驱动的$O(kR/(\Delta\sqrt{m_{\min}}))$间隔界，在样本匮乏区域$m \lesssim R/\Delta$中由Le Cam极小极大下界匹配。(ii) 在Ledoit-Wolf收缩协方差下的马氏距离是64描述符化学图池中最强的闭式排序器（11个基准上平均Spearman $\rho=+0.56$，11个中10个为正）；各向同性替代$\Delta/\sqrt{\ell}$在同质蛋白质/社交池上具有闭式选择一致性率。(iii) 训练时决定的证书，无每个预测开销，有三种具体半径（Pinelis、高斯插件和方差感知的Pinelis-Bernstein）。实验上，PLACE是Orbit5k上最强的基于图的方法，并在MUTAG和COX2上在统计噪声内匹配最强的基于拓扑的基线；剩余差距分为两个可诊断区域（NCI1/NCI109上的描述符盲点；其他地方的池覆盖限制）。Pinelis-Bernstein半径在12个基准中的8个上触发；在MUTAG上，经验和总体最近质心规则在940个保留测试预测中的每一个上一致，验证了证书的机制。

英文摘要

We introduce PLACE (Persistence-Landmark Analytic Classification Engine), a closed-form pipeline for classifying point clouds and graphs through their persistent-homology signatures. Three quantitative guarantees -- a margin-based excess-risk rate, a closed-form descriptor-selection rule, and a per-prediction certificate -- are derived from training labels alone, with no learned weights or held-out calibration. The embedding sums Mitra-Virk single-point coordinate functions over a sparse landmark grid; the closed-form weight rule $w_k^2 \propto (d_{k+1}^2 - d_k^2)/R_k^2$ maximizes the distortion slope in Mitra-Virk's affine certificate under $ν$-coherence. (i) An $O(kR/(Δ\sqrt{m_{\min}}))$ margin bound, driven by class-mean separation $Δ$ and embedding radius $R$, matched in the sample-starved regime $m \lesssim R/Δ$ by a Le Cam minimax lower bound. (ii) The Mahalanobis margin under Ledoit-Wolf-shrunk covariance is the strongest closed-form ranker on a 64-descriptor chemical-graph pool (mean Spearman $ρ= +0.56$ across 11 benchmarks, positive on 10 of 11); the isotropic surrogate $Δ/\sqrt{\ell}$ admits a closed-form selection-consistency rate on the homogeneous protein/social pools. (iii) A training-time-decided certificate, with no per-prediction overhead, in three concrete radii (Pinelis, Gaussian plug-in, and variance-aware Pinelis-Bernstein). Empirically, PLACE is the strongest diagram-based method on Orbit5k and matches the strongest topology-based baseline within statistical noise on MUTAG and COX2; remaining gaps fall into two diagnosable regimes (descriptor blindness on NCI1/NCI109; pool-coverage limits elsewhere). The Pinelis-Bernstein radius fires on 8 of the 12 benchmarks; on MUTAG the empirical and population nearest-centroid rules agree on every one of 940 held-out test predictions, validating the certificate's mechanism.

URL PDF HTML ☆

赞 0 踩 0

2605.25991 2026-05-26 cs.LG cs.NA math.NA

Fuzzy PyTorch: Rapid Numerical Variability Evaluation for Deep Learning Models

Fuzzy PyTorch: 深度学习模型的快速数值变异性评估

Inés Gonzalez-Pepe, Hiba Akhaddar, Tristan Glatard, Yohan Chatelain

AI总结提出Fuzzy PyTorch框架，通过集成随机算术和概率舍入实现深度学习模型数值变异性的快速评估，相比现有工具Verrou实现5至60倍加速，并支持从1到3.41亿参数的模型规模。

Comments 19 pages, 8 figures, Published in Transactions on Machine Learning Research (01/2026)

详情

AI中文摘要

我们介绍了Fuzzy PyTorch，一个用于快速评估深度学习（DL）模型中数值变异性的框架。随着DL越来越多地应用于各种任务，理解浮点运算带来的变异性对于确保稳健可靠的性能至关重要。评估此类变异性的工具必须具有可扩展性、高效性，并能与现有框架无缝集成，同时最小化代码修改。Fuzzy PyTorch通过将随机算术集成到PyTorch中实现了这一点，它采用了一种名为“概率舍入与指令集管理”的新型库，该库与数值分析编译器Verificarlo接口。该库提供了随机舍入模式以及一种新模式：上下舍入。对比评估显示，Fuzzy PyTorch保持了模型性能，并且与最先进的工具Verrou相比，运行时间减少了5倍到60倍。我们进一步通过运行从1到3.41亿参数的模型展示了其可扩展性，确认了其在小型和大型DL架构中的适用性。总体而言，Fuzzy PyTorch为评估深度学习中的数值变异性提供了一种高效、可扩展且实用的解决方案，使研究人员和从业者能够在不牺牲性能或计算效率的情况下量化和管理浮点不确定性。

英文摘要

We introduce Fuzzy PyTorch, a framework for rapid evaluation of numerical variability in deep learning (DL) models. As DL is increasingly applied to diverse tasks, understanding variability from floating-point arithmetic is essential to ensure robust and reliable performance. Tools assessing such variability must be scalable, efficient, and integrate seamlessly with existing frameworks while minimizing code modifications. Fuzzy PyTorch enables this by integrating stochastic arithmetic into PyTorch through Probabilistic Rounding with Instruction Set Management, a novel library interfacing with Verificarlo, a numerical analysis compiler. The library offers stochastic rounding mode and a novel mode; up-down rounding. Comparative evaluations show Fuzzy PyTorch maintains model performance and achieves runtime reductions of 5x to 60x versus Verrou, a state-of-the-art tool. We further demonstrate scalability by running models from 1 to 341 million parameters, confirming applicability across small and large DL architectures. Overall, Fuzzy PyTorch provides an efficient, scalable, and practical solution for assessing numerical variability in deep learning, enabling researchers and practitioners to quantify and manage floating-point uncertainty without compromising performance or computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.25988 2026-05-26 cs.CL

What Makes a Medical Checker Trainable? Diagnosing Signal Collapse and Reward Hacking in Checker-Guided RAG for Biomedical QA

什么使医学检查器可训练？诊断生物医学问答中检查器引导的RAG中的信号崩溃和奖励黑客

Yuelyu Ji, Min Gu Kwak, Hang Zhang, Xizhi Wu, Chenyu Li, Yanshan Wan

AI总结本文通过比较四种NLI检查器作为GRPO训练的医学RAG代理的过程奖励，诊断了信号崩溃和奖励黑客现象，发现检查器的输出分布而非保留准确率决定了其是否提供可训练梯度，并提出了验证器作为奖励系统的边界条件。

详情

AI中文摘要

医学RAG需要基于证据的声明，因此将声明级别的NLI检查器插入检索增强的强化学习是直观的。 extbf{我们发现，检查器在训练期间的输出分布，而不是其保留准确率，决定了它是否提供可训练梯度。}我们比较了四种NLI检查器后端作为GRPO训练的医学RAG代理（Qwen2.5-7B，在Qwen3-4B和Llama-3.1-8B上复制）的过程奖励，跨越四个保留的医学QA基准。出现了三个诊断发现。 extbf{(i)} 信号崩溃是log概率特定的：LLM log概率评分将超过97%的声明标记为中性——将RL梯度降至零——而校准的MedNLI分类器对相同对进行非退化评分。 extbf{(ii)} 在答案质量上，中等信号优于强信号：一个强大的专有检查器触发三步奖励黑客级联——超短答案、搜索回避、语言崩溃——因此中等信号的本地分类器训练出更高质量的模型（ extbf{+12% BERTScore相对于零样本，无GPT依赖}）。 extbf{(iii)} 信号强度是策略依赖的：相同的检查器在一个策略上表现为中等，但在另一个策略上表现为强，而不触发级联终点。我们将这些视为验证器作为奖励系统的边界条件。

英文摘要

Medical RAG needs evidence-grounded claims, so plugging a claim-level NLI checker into retrieval-augmented RL is intuitive. \textbf{We find that the checker's \emph{output distribution} during training, not its held-out accuracy, decides whether it provides trainable gradient.} We compare four NLI checker back-ends as process rewards inside a GRPO-trained medical RAG agent (Qwen2.5-7B, replicated on Qwen3-4B and Llama-3.1-8B) across four held-out medical QA benchmarks. Three diagnostic findings emerge. \textbf{(i)} Signal collapse is log-prob-specific: LLM log-probability scoring labels over 97\% of claims neutral -- collapsing the RL gradient to zero -- while a calibrated MedNLI classifier scores the same pairs non-degenerately. \textbf{(ii)} Moderate signal beats strong signal on answer quality: a strong proprietary checker triggers a three-step reward-hacking cascade -- ultra-short answers, search avoidance, language collapse -- so a moderate-signal local classifier trains a higher-quality model (\textbf{+12\% BERTScore over zero-shot, no GPT dependency}). \textbf{(iii)} Signal strength is policy-dependent: the same checker registers as moderate on one policy but strong on another without triggering the cascade end-state. We frame these as boundary conditions for verifier-as-reward systems.

URL PDF HTML ☆

赞 0 踩 0

2605.25984 2026-05-26 cs.CL cs.AI

SafeCtrl-RL: Inference-Time Adaptive Behaviour Control for LLM Dialogue via RL-Driven Prompt Optimisation

SafeCtrl-RL: 通过RL驱动的提示优化的LLM对话推理时自适应行为控制

Michael Orme, Yanchao Yu, Zhiyuan Tan

AI总结提出SafeCtrl-RL框架，利用强化学习在推理时动态选择提示调整策略，无需重新训练即可抑制不安全行为，提升LLM对话的安全性和响应质量。

详情

AI中文摘要

确保大型语言模型（LLM）的安全和上下文适当行为仍然是实际部署的关键挑战。我们提出了 extbf{SafeCtrl-RL}，一个推理时行为控制框架，无需模型重新训练或参数修改即可实现自适应安全调节。该方法将对话生成形式化为一个序列决策过程，其中强化学习代理根据上下文反馈动态选择提示调整策略。这使得不安全行为可以通过迭代细化被抑制，我们将其概念化为推理时行为遗忘。在多个LLM和不安全对话场景下的评估表明，SafeCtrl-RL一致地提高了安全性和响应质量，优于现有的基于提示的优化方法，并实现了良好的性能-效率权衡。**警告：本文可能包含有害语言的示例，建议读者谨慎阅读。**

英文摘要

Ensuring safe and contextually appropriate behaviour in Large Language Models (LLMs) remains a critical challenge for real-world deployment. We present \textbf{SafeCtrl-RL}, an inference-time behavioural control framework that enables adaptive safety regulation without model retraining or parameter modification. The method formulates dialogue generation as a sequential decision process, where a reinforcement learning agent dynamically selects prompt adjustment strategies based on contextual feedback. This allows unsafe behaviours to be suppressed through iterative refinement, which we conceptualise as inference-time behavioural unlearning. Evaluated across multiple LLMs and unsafe dialogue scenarios, SafeCtrl-RL consistently improves safety and response quality, outperforms existing prompt-based optimisation methods, and achieves favourable performance--efficiency trade-offs. **Warning: This paper may contain examples of harmful language, and reader discretion is recommended.

URL PDF HTML ☆

赞 0 踩 0

2605.25979 2026-05-26 cs.CV

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

LLaVA-OneVision-2：迈向下一代感知智能

Xiang An, Yin Xie, Feilong Tang, Yunyao Yan, Huajie Tan, Didi Zhu, Changrui Chen, Xiuwei Zhao, Bin Qin, Kaicheng Yang, Yifei Shen, Yuanhan Zhang, Kaichen Zhang, Wenkang Zhang, Zheng Cheng, Nansen Zhang, Chunsheng Wu, Chunjiang Ge, Zimin Ran, Dehua Song, Chunyuan Li, Shikun Feng, Ming Hu, Zhangquan Chen, Junbo Niu, Bo Li, Ziyong Feng, Ziwei Liu, Zongyuan Ge, Jiankang Deng

AI总结提出LLaVA-OV-2模型，通过编解码流令牌化、窗口注意力和3D RoPE实现统一视频理解与时空定位，在多项基准上超越Qwen3-VL-8B。

详情

AI中文摘要

我们介绍LLaVA-OneVision-2（LLaVA-OV-2），这是LLaVA-OneVision系列中迄今为止能力最强的视觉语言模型，在广泛的多模态基准测试中均取得了卓越性能。该模型基于原生OneVision编码器，并引入窗口注意力机制以实现高效的局部计算，同时保持原生分辨率。其关键进展是编解码流令牌化：它将压缩视频视为连续的比特成本流，其中比特成本动态决定自适应时间分组，运动残差线索选择显著空间证据到紧凑的视觉画布中。这种分配将有限的令牌预算集中在包含事件的内容上，相比固定图片组，实现了更稳定的长视频令牌压缩。共享的3D RoPE进一步将编解码画布、采样帧和图像置于统一的时空坐标系中。此外，我们围绕大规模开放监督构建了LLaVA-OV-2数据和训练栈：约800万重新标注的视频样本用于预训练，400万样本的空间语料库用于微调。我们还引入了JumpScore，这是一个针对高频、密集重复运动中的细粒度定位的时空定位基准，填补了现有视频评估的空白。LLaVA-OV-2的一项突出能力是其在视频理解、时空定位、空间定位和操作轨迹推理上的统一感知。在JumpScore上，LLaVA-OneVision-2-8B达到74.9 JumpScore mAP，比Qwen3-VL-8B（30.1）高出44.8分；在同一基准的匹配视觉令牌预算下，编解码流输入相比帧采样在时空定位上提升9.7分。在标准基准上，LLaVA-OneVision-2-8B在视频任务上平均比Qwen3-VL-8B高出4.3分，在空间任务上高出5.3分，在跟踪任务上平均J&F高出15.6分。

英文摘要

We introduce LLaVA-OneVision-2 (LLaVA-OV-2), the most capable vision-language model in the LLaVA-OneVision series to date, achieving superior performance across a broad range of multimodal benchmarks. The model builds on a native OneVision-Encoder and incorporates Windowed Attention for efficient local computation while maintaining native resolution. Its key advance is codec-stream tokenization: it treats compressed video as a continuous bit-cost stream, where bit-cost dynamics determine adaptive temporal groups, and motion-residual cues select salient spatial evidence into compact visual canvases. This allocation concentrates a limited token budget on event-bearing content, enabling more stable long-video token compression than fixed groups of pictures. A shared 3D RoPE further places codec canvases, sampled frames, and images in a unified spatiotemporal coordinate system. Furthermore, we build the LLaVA-OV-2 data and training stack around large-scale open supervision: approximately 8M re-captioned video samples for pretraining, a 4M-sample spatial corpus for fine-tuning. We also introduce JumpScore, a temporal-localization benchmark targeting fine-grained grounding in high-frequency, densely repeated motion, a regime underrepresented by existing video evaluations. A standout capability of LLaVA-OV-2 is its unified perception across video understanding, temporal grounding, spatial grounding, and manipulation-trace reasoning. On JumpScore, LLaVA-OneVision-2-8B reaches 74.9 JumpScore mAP, surpassing Qwen3-VL-8B (30.1) by +44.8 points; under matched visual-token budgets on the same benchmark, codec-stream inputs improve temporal grounding over frame sampling by +9.7 points. Across standard benchmarks, LLaVA-OneVision-2-8B further outperforms Qwen3-VL-8B by +4.3 average points on video tasks, +5.3 on spatial tasks, and +15.6 average J&F on tracking tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.25977 2026-05-26 cs.CL cs.AI cs.LG

Creative Quality Alignment: Expert Tacit Knowledge Transfer via Chain-of-Thought Fine-Tuning

创意质量对齐：通过思维链微调实现专家隐性知识迁移

Bo Zou, Chao Xu

AI总结本文通过低数据成本和小基模型的严格工程条件，实证验证了校准惊喜中的创意质量度量，并发现数据偏差，提出创意质量对齐方法及理论解释。

详情

PolyGnosis 2.0: 通过智能体工程增强LLM推理，用于Polymarket和OSINT洞察提取

Daren Wang, Hong Xu, Jiawen Xian

AI总结本文提出PolyGnosis 2.0多智能体架构，通过融合Polymarket异常信号与全球开源情报（GDELT）来提取预测情报，并量化“视角错配”作为高阿尔法交易信号，同时评估“工具工程”技术（如反射循环、工具调用、分治和思维链）在高噪声金融领域的有效性。

详情

AI中文摘要

本文介绍了PolyGnosis 2.0，这是一种开创性的多智能体架构，旨在通过综合Polymarket异常信号与全球开源情报（OSINT）流（特别是全球事件、语言和语调数据库（GDELT））来提取预测情报。我们定义并瞄准“视角错配”，即Polymarket情绪与全球媒体流之间的叙事分歧，作为高阿尔法交易信号。超越通用的智能体优越性，我们严格量化了“工具工程”技术（包括反射循环、工具调用、分治分区（D&C）和思维链（CoT））在高噪声金融领域的有效性。我们针对人类专家基准的实证评估表明，虽然结构分区对于多维对齐是必需的，但无约束的终端反射会主动导致逻辑漂移。此外，我们在所有智能体配置的叙事推理过程中发现了一种普遍的“共识偏差”，需要确定性验证。最终，我们分离出一个帕累托最优配置，该配置在最小化延迟和令牌开销的同时实现了专业级的分析精度，为预测市场中的自主智能提供了稳健的蓝图。

英文摘要

This paper introduces PolyGnosis 2.0, a pioneering multi-agent architecture designed to extract predictive intelligence by synthesizing Polymarket anomaly signals with global Open Source Intelligence (OSINT) streams, specifically Global Database of Events, Language, and Tone (GDELT). We define and target "Perspective Mismatches", the narrative divergence between Polymarket sentiment and global media flows, as high-alpha trading signals. Moving beyond generic agentic superiority, we rigorously quantify the efficacy of "Harness Engineering" techniques, including reflection loops, tool-calling, divide-and-conquer partitioning (D&C), and chain-of-thought (CoT), within high-noise financial domains. Our empirical evaluation against human-expert benchmarks reveals that while structural partitioning is mandatory for multi-dimensional alignment, unconstrained terminal reflection actively induces logical drift. Furthermore, we identify a pervasive "consensus bias" across all agent configurations during narrative reasoning, necessitating deterministic validation. Ultimately, we isolate a Pareto-optimal configuration that achieves professional-grade analytical precision while minimizing latency and token overhead, providing a robust blueprint for autonomous intelligence in prediction markets.

URL PDF HTML ☆

赞 0 踩 0

2605.25955 2026-05-26 cs.CL cs.AI cs.LG

QUIET: A Multi-Blank Cascaded Story Cloze Benchmark for LLM Creative Generation Capability

QUIET: 面向LLM创意生成能力的多空白级联故事完形填空基准

Bo Zou, Chao Xu

AI总结提出QUIET基准，通过多空白级联故事完形填空和基于信息论的自动评分协议，客观评估大语言模型的创意生成能力。

详情

AI中文摘要

大语言模型（LLM）在创意能力评估中面临双重挑战：现有基准（如Story Cloze Test、HellaSwag）通过多项选择识别范式衡量模型对叙事延续的判别能力，而非直接衡量创意生成能力；基于量规的评分和LLM-as-Judge方法依赖主观维度评估或自然语言模型输出，无法提供客观、自动化的评分机制。本文提出QUIET（Quality Understanding via Interlocked Evaluation Testing），一种基于多空白级联故事完形填空的LLM创意能力诊断基准。QUIET在结构完整的故事中设置N个空白（10-20个），每个空白附带显式内容约束，且空白之间存在级联依赖关系——较早空白填充的内容约束较晚空白的可行解空间。被评估模型（或人类参与者）以开放生成模式填充所有空白；结果由基于信息论的自动化评分协议评分，无需人工评分。该评分协议直接操作化“校准惊喜”理论框架（Zou & Xu, 2026a）。对于每个空白k，计算复合分数：score = satisfy * (1 + lambda * surprise)，其中lambda = 1.0。这里，“satisfy”衡量空白填充满足内容约束的程度（客观逻辑推理判断，非主观审美评分），“surprise”衡量在满足约束条件下的惊喜程度。不满足约束的创意答案得零分；满足约束但平庸的答案得分低；满足约束且令人惊喜的答案得分高。

英文摘要

Large language models (LLMs) face a dual challenge in creative capability evaluation: existing benchmarks (e.g., Story Cloze Test, HellaSwag) measure models' discriminative ability over narrative continuation using multiple-choice recognition paradigms, rather than directly measuring creative generation capability; rubric-based scoring and LLM-as-Judge methods rely on subjective dimension assessment or natural language model outputs, and cannot provide objective, automated scoring mechanisms. This paper proposes QUIET (Quality Understanding via Interlocked Evaluation Testing), a diagnostic benchmark for LLM creative capability based on multi-blank cascaded story cloze. QUIET sets N blanks (10-20) in a story with complete structure, with each blank accompanied by an explicit content constraint, and cascade dependency relationships between blanks -- the content filled into earlier blanks constrains the feasible solution space for later blanks. The evaluated model (or human participants) fills all blanks in open-ended generation mode; the results are scored by an information-theoretic automated scoring protocol without human grading. The scoring protocol directly operationalizes the "calibrated surprise" theoretical framework (Zou & Xu, 2026a). For each blank k, a composite score is computed: score = satisfy * (1 + lambda * surprise), where lambda = 1.0. Here, "satisfy" measures how well the blank filling satisfies the content constraint (objective logical reasoning judgment, not subjective aesthetic scoring), and "surprise" measures the degree of surprise given that the constraint is satisfied. Creative answers that do not satisfy the constraint score zero; answers that satisfy the constraint but are mediocre score low; answers that satisfy the constraint and are surprising score high.

URL PDF HTML ☆

赞 0 踩 0

2605.25954 2026-05-26 cs.LG cs.AI

Step-TP: A Grounded, Step-Level Dataset with Chain-of-Thought Reasoning for LLM-Guided Tensor Program Optimization

Step-TP: 一个基于步骤级、带有思维链推理的 LLM 引导张量程序优化数据集

Mengfan Liu, Da Zheng, Junwei Su, Chuan Wu

AI总结为解决 LLM 在张量程序优化中缺乏可验证步骤级监督的问题，提出 Step-TP 数据集，通过结构化思维链推理和原子步骤监督实现可靠的多步优化。

详情

AI中文摘要

尽管大语言模型（LLM）具有强大的推理能力，但由于需要精确、可组合的变换决策，优化张量程序的执行效率仍然具有挑战性。最近的 LLM 引导方法将张量程序优化视为一个迭代决策过程，但现有数据集仅提供使用令牌效率低下的表示方式的端到端优化程序对，缺乏可验证的步骤级监督和可解释性。因此，LLM 难以在大型组合优化空间中做出可靠的单步决策。我们引入了 Step-TP，一个用于张量程序优化的后训练数据集，它提供基于事实的、原子性的步骤级监督，并带有结构化的思维链（CoT）推理。Step-TP 在中间程序状态上形成一个封闭的推理循环，从而实现可靠的多步优化，而非结果模仿。其设计遵循四个原则：(i) 令牌高效、可验证的中间表示（IR），可确定性降低为 TVM TIR；(ii) 原子且可组合的优化策略，将复杂轨迹分解为可解释的单步决策；(iii) 结构化的 CoT 监督与显式的 IR 到 IR 状态转换相结合；(iv) 策略过滤以平衡覆盖范围同时防止捷径利用。该数据集和实现可在 GitHub 链接 https://github.com/LIUMENGFAN-gif/StepTP 获取。

英文摘要

Despite the strong reasoning capabilities of large language models (LLMs), optimizing the execution efficiency of tensor programs remains challenging due to the need for precise, composable transformation decisions. Recent LLM-guided approaches frame tensor program optimization as an iterative decision process, but existing datasets provide only end-to-end optimized program pairs using token-inefficient representations, lacking verifiable step-level supervision and interpretability. As a result, LLMs struggle to make reliable single-step decisions in large combinatorial optimization spaces. We introduce Step-TP, a post-training dataset for tensor program optimization that provides grounded, atomic, step-level supervision with structured chain-of-thought (CoT) reasoning. Step-TP forms a closed reasoning loop over intermediate program states, enabling reliable multi-step optimization rather than outcome imitation. Its design is guided by four principles: (i) a token-efficient, verifiable intermediate representation (IR) that deterministically lowers to TVM TIR; (ii) atomic and composable optimization strategies that decompose complex trajectories into interpretable single-step decisions; (iii) structured CoT supervision coupled with explicit IR-to-IR state transitions; and (iv) strategy filtering to balance coverage while preventing shortcut exploitation. The dataset and implementation are available at a GitHub link, https://github.com/LIUMENGFAN-gif/StepTP.

URL PDF HTML ☆

赞 0 踩 0

2605.25952 2026-05-26 cs.CV cs.AI

VEN-VL: A Visual Ensemble MoE Framework for Effective and Efficient Multi-Modal Understanding

VEN-VL: 一种用于高效多模态理解的视觉集成MoE框架

Yinghao Wu, Zhuoyan Luo, Yiyao Yu, Zhaojian Yu, Yujiu Yang, Xiao-Ping Zhang

AI总结提出VEN-VL框架，通过先丰富后压缩的策略，利用视觉集成MoE和自适应路由增强视觉令牌的信息容量与密度，在少量压缩令牌下实现复杂视觉任务的性能与效率平衡。

详情

AI中文摘要

尽管近期高效方法在加速多模态理解方面取得了显著进展，但它们仍然存在明显的性能下降。这些方法强调单一视觉线索的高压缩比，并依赖基于启发式剪枝策略的粗略注意力对齐，导致视觉令牌的信息容量和密度出现瓶颈。针对这一局限，我们提出了VEN-VL，一种遵循“先丰富后压缩”原则的视觉集成MoE框架，用于高效感知。具体来说，我们首先通过统一不同视角的视觉表示来丰富信息容量，然后通过专门视觉专家中的自适应路由器逐步压缩信息以增强信息密度。此外，我们通过显式视觉监督融入原始结构的重建能力，促进关键信息的保留。实验结果表明，我们在使用少量信息压缩令牌的复杂视觉任务中具有优越性，有效弥合了性能与效率之间的差距。

英文摘要

Despite the remarkable progress achieved by recent efficient methods in accelerating multimodal understanding, they still suffer from noticeable performance degradation. Their emphasis on the high compression ratio of a single visual clue and reliance on the heuristic pruning strategy with coarse attention alignment incurs a bottleneck on the information capacity and density of visual tokens. Addressing this limitation, we propose VEN-VL, a visual ensemble MoE framework for effective and efficient perception following the enrich then compact principle. Specifically, we first enrich the information capacity by unifying the visual representations of different perspectives, and then progressively compact it with adaptive routers in specialized visual experts to enhance the information density. Furthermore, we incorporate the reconstruction ability of vanilla structure via explicit visual supervision, facilitating crucial information preservation. Experimental results demonstrate our superiority in complex visual tasks with few information-condensed tokens, which effectively bridges the gap between performance and efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.25951 2026-05-26 cs.SD

Score-Agnostic Structure Analysis in Large-Scale Performance Datasets

大规模演奏数据集中的无乐谱结构分析

Patricia Hu, Silvan Peter, Gerhard Widmer

AI总结针对大规模自动转录钢琴演奏数据集中结构不一致的问题，提出基于序列比对和层次聚类的无乐谱分组方法，以音乐连贯性替代真实准确性作为评估标准。

Comments published at the Music Encoding Conference (MEC) 2026

详情

AI中文摘要

近年来，得益于自动音乐转录（AMT）的进步，多个大规模自动转录钢琴独奏音乐数据集已发布。虽然这些数据集无疑为演奏研究提供了丰富的材料，但它们在质量上差异很大。在古典音乐中，演奏不仅在速度等表现方面不同，而且在乐谱的结构解释（包括重复模式和版本特定变体）上也存在差异。为了有意义地将大规模转录数据集用于演奏研究，同一作品的转录必须根据其底层结构实现进行分组，以支持有效比较。我们通过应用序列到序列比对后进行层次聚类来解决这个问题：我们为给定作品的所有转录对创建成对比对，并使用比对成本和演奏序列长度的（不）相似性来解决结构不匹配问题，作为分组的特征。我们提出这种方法作为自动评估缺乏真实乐谱和/或音频的大规模转录数据集的第一步，将评估标准从基于真实性的准确性转向音乐连贯性和合理性。我们在最近发布的大规模转录钢琴演奏数据集中约1,500个转录（涵盖88部作品）上展示了我们的无乐谱方法。

英文摘要

In recent years, thanks to advances in automatic music transcription (AMT), several large-scale datasets of automatically transcribed piano solo music have been released. While these datasets undoubtedly offer extensive material for performance studies, they vary substantially in quality. In the case of classical music, performances often differ not only in expressive aspects such as tempo, but also in their structural interpretation of the score (including repeat patterns and edition-specific variants). To meaningfully use large-scale transcribed datasets for performance research, transcriptions of the same piece must be grouped according to their underlying structural realisation to support valid comparison. We address this by applying sequence-to-sequence alignment followed by hierarchical clustering: we create pairwise alignments for all pairs of transcriptions of a given piece, and use the alignment cost and (dis)similarity of performed sequence lengths to resolve structural mismatches as features for grouping. We propose this approach as a first step towards automatically evaluating large-scale transcribed datasets that lack ground-truth score and/or audio, shifting the evaluation criterion from truth-based accuracy to musical coherence and plausibility. We demonstrate our score-agnostic approach on around 1,500 transcriptions of 88 compositions from a recently published large-scale transcribed piano performance dataset.

URL PDF HTML ☆

赞 0 踩 0

2605.25949 2026-05-26 cs.LG cs.AI physics.comp-ph

Small Models, Strong Priors: Architectural Inductive Bias for Parameter-Efficient Neural PDE Solvers

小模型，强先验：参数高效神经PDE求解器的架构归纳偏置

Shyam Sankaran, Hanwen Wang, Paris Perdikaris

AI总结提出WaveLiT架构，通过小模型（1-10M参数）利用小波多尺度先验实现参数高效，在多个PDE基准上媲美大100-1000倍的基础模型，并揭示先验失败模式可提供有用信号。

详情

AI中文摘要

神经PDE求解器遵循视觉和语言的扩展轨迹，最近的基础模型达到数十亿参数。我们认为，在该领域中，规模不能很好地替代架构归纳偏置：结构化先验带来超高的参数效率，并且它们成功和失败的模式本身就能说明它们捕获了什么。我们通过WaveLiT实例化这一论点，该架构结合了用于无损多分辨率标记化的离散小波变换、增强的线性注意力块、共享权重的多尺度特征金字塔以及小波域辅助损失。定制的1-10M参数WaveLiT模型在八个TheWell基准测试中与规模大100-1000倍的基础模型竞争，在波动和声学主导的基准测试中增益最大，其中小波多尺度先验适合主导动力学结构，且小的每步误差在展开时不会几何级数地复合。在所有八个基准测试上联合训练后，一个10M参数的基础变体表现出结构化的、物理上可解释的迁移模式——在小波多尺度先验匹配动力学的地方最强，在混沌平流主导的流动中最弱。整个流水线在单个GPU上训练。结果表明，小模型PDE性能由架构归纳偏置而非规模决定，并且先验失败的结构是关于其内容的有用经验信号。

英文摘要

Neural PDE solvers have followed the scaling trajectory of vision and language, with recent foundation models reaching billions of parameters. We argue that scale is a poor substitute for architectural inductive bias in this domain: structured priors deliver outsized parameter efficiency, and the pattern of where they succeed and fail is itself informative about what they capture. We instantiate this argument in WaveLiT, an architecture combining a discrete wavelet transform for lossless multi-resolution tokenization, an augmented linear attention block, a shared-weight multiscale feature pyramid, and a wavelet-domain auxiliary loss. Bespoke 1-10M-parameter WaveLiT models compete with foundation models of 100-1000$\times$ their size across eight TheWell benchmarks, with the largest gains on wave and acoustic-dominated benchmarks where the wavelet-multiscale prior fits the dominant dynamical structure and small per-step errors do not compound geometrically under rollout. Trained jointly across all eight benchmarks, a 10M-parameter foundation variant exhibits a structured, physically interpretable transfer pattern -- strongest where the wavelet-multiscale prior matches the dynamics, weakest on chaotic advection-dominated flows. The entire pipeline trains on a single GPU. The results suggest that small-model PDE performance is shaped by architectural inductive bias rather than scale, and that the structure of a prior's failures is a useful empirical signal about its content.

URL PDF HTML ☆

赞 0 踩 0

2605.25947 2026-05-26 cs.CV

A Pedestrian-Vehicle Interaction Benchmark and Annotation Framework for Unstructured Scenes via Uncalibrated Cameras

非标定相机下的非结构化场景行人-车辆交互基准与标注框架

Haoyang Peng, Qian Hu, Songan Zhang, Ming Yang

AI总结针对非结构化场景中行人-车辆交互数据稀缺的问题，提出基于非标定监控视频的标注框架PINNS数据集，包含多国多场景的密集交互轨迹与场景信息，以促进复杂混合交通中的轨迹预测研究。

Comments 10 pages, 8 figures; project page available at https://github.com/Songan-Lab

详情

AI中文摘要

预测行人与车辆之间的交互对于非结构化和半结构化场景中的自动驾驶安全至关重要；然而，由于缺乏具有密集行人-车辆交互的公共数据集，这一任务受到严重阻碍。当前大多数研究依赖于结构化道路数据，导致非结构化环境中复杂的异质交互未能得到充分表示和研究。本文提出一种基于非标定监控摄像头视频数据的数据集标注框架，并推出PINNS（非结构化场景中非标定摄像头的行人-车辆交互数据集）。该数据集涵盖多个国家和地区，包含多样化的典型交通场景，并考虑了季节、光照条件和天气的变化。它聚焦于具有密集行人-车辆交互的复杂场景，并设计为易于扩展。数据集根据中国自动化学会发布的标准进行构建和标注，提供轨迹数据和相应的场景级信息。此外，本文分析了异质智能体轨迹预测的当前挑战和研究方向，展示了所提出数据集的必要性和实用性。我们希望我们的框架和数据集能够促进复杂混合交通场景中轨迹预测和自动驾驶的研究。PINNS数据集公开于https://github.com/Songan-Lab。

英文摘要

Predicting the interaction between pedestrian and vehicle is essential for autonomous driving safety in unstructured and semi-structured scenarios; however, this task is severely hindered by the scarcity of public datasets that feature dense pedestrian-vehicle interactions. Most current studies rely on structured road data, leaving the complex, heterogeneous interactions found in unstructured environments insufficiently represented and researched. In this paper, we propose a dataset annotation framework based on video data from uncalibrated surveillance cameras and present PINNS (Pedestrian-vehicle Interaction dataset from uNcalibrated cameras in uNstructured Scenes). The dataset covers multiple countries and regions, includes diverse typical traffic scenarios, and considers variations in seasons, lighting conditions, and weather. It focuses on complex scenes with dense pedestrian-vehicle interactions and is designed to be easily extensible. The dataset is constructed and annotated according to the standard issued by the Chinese Association of Automation, providing both trajectory data and corresponding scene-level information. Furthermore, this paper analyzes current challenges and research directions in heterogeneous agent trajectory prediction, shows the necessity and usefulness of the proposed dataset. We hope our framework and dataset will facilitate research on trajectory prediction and autonomous driving in complex mixed traffic scenarios. PINNS is publicly available at https://github.com/Songan-Lab.

URL PDF HTML ☆

赞 0 踩 0

2605.25944 2026-05-26 cs.CV cs.AI

EchoPilot: Training-Free Ultrasound Video Segmentation via Scale-Space Semantic Prompting and Reliability-Gated Memory

EchoPilot: 通过尺度空间语义提示和可靠性门控记忆实现无训练超声视频分割

Ruiqiang Xiao, Zhaohu Xing, Yijun Yang, Zhenyan Han, Weiming Wang, Kaishun Wu, Lei Zhu

AI总结提出EchoPilot，一种无需训练、仅需单点点击和类别名称的超声视频分割框架，通过尺度空间语义提示解决初始化歧义，并引入可靠性门控记忆减少传播漂移，在多个数据集上达到最优性能。

Comments Early accepted to MICCAI 2026. Project page: https://keeplearning-again.github.io/EchoPilot/

详情

AI中文摘要

超声视频分割在临床上具有重要价值，但由于散斑噪声、弱边界和快速解剖变形而困难。最近的可提示基础模型实现了点引导分割，但它们在超声中的直接部署仍然不可靠：单个点提供的空间上下文不足以解决尺度模糊性，贪婪的记忆更新会将早期错误放大为严重的时间漂移。我们提出了EchoPilot，一个在稀疏第一帧交互下进行超声视频分割的无训练框架，仅需单点点击和解剖类别名称。EchoPilot协调一个冻结的医学视觉语言模型（VLM）进行语义定位，一个视觉基础模型（VFM）进行密集几何特征提取，以及一个可提示视频分割器进行掩码预测和传播。为了解决初始化歧义，我们提出了尺度空间语义提示，首先通过无参数的S.E.E.D.（语义能量-熵密度）准则选择最佳上下文视图，然后从密集基础特征中合成几何精确的辅助点提示，无需额外用户交互。为了减少传播漂移，进一步引入了可靠性门控记忆更新，在不确定预测下选择性冻结分割器的记忆库，防止错误累积。我们还贡献了第一个动态胎儿胎盘超声视频分割数据集，包含671个标注帧。在三个超声视频数据集上，EchoPilot在稀疏交互设置下实现了最先进的性能，持续优于无训练基线和微调专家。

英文摘要

Ultrasound video segmentation is clinically valuable yet difficult due to speckle noise, weak boundaries, and rapid anatomical deformation. Recent promptable foundation models enable point-guided segmentation, but their direct deployment in ultrasound remains unreliable: a single point provides insufficient spatial context to resolve scale ambiguity, and greedy memory updates amplify early errors into severe temporal drift. We present EchoPilot, a training-free framework for ultrasound video segmentation under sparse first-frame interaction, requiring only a single point click and an anatomical category name. EchoPilot orchestrates a frozen medical vision-language model (VLM) for semantic localization, a vision foundation model (VFM) for dense geometric feature extraction, and a promptable video segmentor for mask prediction and propagation. To resolve initialization ambiguity, we propose Scale-Space Semantic Prompting, which first selects an optimal contextual view via a parameter-free S.E.E.D. (Semantic Energy-Entropy Density) criterion, and then synthesizes geometrically precise auxiliary point prompts from dense foundation features without additional user interaction. To reduce propagation drift, a Reliability-Gated Memory update is further introduced to selectively freeze the segmentor's memory bank under uncertain predictions, preventing error accumulation. We also contribute the first dynamic fetal placenta ultrasound video segmentation dataset with 671 annotated frames. Across three ultrasound video datasets, EchoPilot achieves state-of-the-art performance under the sparse-interactive setting, consistently outperforming training-free baselines and finetuned specialists.

URL PDF HTML ☆

赞 0 踩 0