arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2237
热门方向导航
2606.17905 2026-06-17 cs.CL 新提交

ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

ChLogic: 评估中文表达中逻辑推理的鲁棒性

Peixian Zhou, Yuxu Chen, Chaorui Zhang, Wei Han, Bo Bai, Xueyan Niu

发表机构 * College of Mathematics, Sichuan University(四川大学数学学院) Theory Lab, 2012 Labs, Huawei Technologies Co., Ltd(华为技术有限公司2012实验室理论实验室)

AI总结 提出英中对照基准ChLogic,通过形式逻辑模板构建数据集,测试模型在英文和多种中文表达下逻辑推理的鲁棒性,发现英中性能差距,回译效果因数据集和模型而异。

详情
AI中文摘要

大型语言模型在标准化逻辑推理基准上表现越来越好,但这种能力在英语之外是否保持鲁棒尚不清楚。我们提出ChLogic,一个英中对照基准,测试当相同的潜在逻辑结构用英语和多种中文表层实现表达时,模型是否保持逻辑推理性能。该基准基于形式逻辑模板构建,包含三个数据集:(i) 通用对照集,源自9个模板家族的60个通用命题;(ii) 困难对照集,源自40个困难问题;(iii) 仅中文集,涵盖15种语言特有现象类型。每个对照项将一个英文参考表达式与五个中文实现配对。在Qwen3、Ministral和GLM模型上的实验揭示了持续的英中性能差距。从标准中文回译成英语通常能提升通用对照集上的性能,但对困难对照集产生混合效果,Qwen3-32B和GLM-5.1在翻译后表现更差。这些结果表明,中文表层实现、翻译伪影和模型特定行为共同影响多语言逻辑推理。总体而言,ChLogic为多语言推理的鲁棒性提供了有用的压力测试。

英文摘要

Large language models perform increasingly well on standardized logical reasoning benchmarks, but whether this ability remains robust beyond English is unclear. We introduce ChLogic, an English--Chinese aligned benchmark that tests whether models preserve logical reasoning performance when the same latent logical structure is expressed in English and diverse Chinese surface realizations. Built from formal logical templates, the benchmark contains three data sets: (i) the General aligned set, derived from 60 General Propositions across nine template families; (ii) the Difficult aligned set, derived from 40 Difficult Problems; and (iii) the Chinese-only set, covering 15 language-specific phenomenon types. Each aligned item pairs one English reference expression with five Chinese realizations. Experiments on Qwen3, Ministral, and GLM models reveal a persistent English--Chinese performance gap. Back-translation from standard Chinese into English often improves performance on the General aligned set, but produces mixed effects on the Difficult aligned set, where Qwen3-32B and GLM-5.1 perform worse after translation. These results indicate that Chinese surface realization, translation artifacts, and model-specific behavior jointly affect multilingual logical reasoning. Overall, ChLogic provides a useful stress test for the robustness of multilingual reasoning.

2606.17904 2026-06-17 cs.AI 新提交

DiagFlowBench: Evaluating How Language Models Handle Off-Procedure Inputs in Grounded Diagnostic Dialogue

DiagFlowBench:评估语言模型在基于规程的诊断对话中如何处理偏离规程输入

Guillermo Gil de Avalle, Laura Maruster, Shaina Raza, Christos Emmanouilidis

发表机构 * University of Groningen(格罗宁根大学) Vector Institute for Artificial Intelligence(向量人工智能研究所)

AI总结 提出DiagFlowBench基准,包含50个工业诊断流程图转化的1676轮对话,评估10个模型在识别偏离规程输入时的表现,发现模型常选择真实但不恰当的步骤而非捏造事实。

详情
AI中文摘要

语言模型越来越多地作为维护操作中的咨询系统。为了防止幻觉,最近的系统将这些模型基于规程文档,以约束它们执行批准的步骤。然而,在实践中,操作员的查询经常偏离这一路径,要求模型在对话中途识别超出范围的输入,这是当前基准很少优先考虑的动态。我们引入了DiagFlowBench,这是一个数据集,包含来自一家消费制造商的50个工业诊断流程图,转化为1676轮多轮对话,对比合规与超出范围的语句。评估十个商业和开源模型显示,在弃权率上存在高度变异性,模型通常选择一个真实但上下文不恰当的步骤,而不是捏造事实。这种映射但错误建议的内在合理性和权威性暴露了基于规程系统的一个具有挑战性的脆弱性。

英文摘要

Language models increasingly serve as advisory systems in maintenance operations. To prevent hallucination, recent systems ground these models in procedural documentation to constrain them to approved steps. In practice, however, operator queries frequently stray from this path, requiring models to recognise out-of-scope inputs mid-conversation, a dynamic that current benchmarks rarely prioritise. We introduce DiagFlowBench, a dataset of 50 industrial diagnostic flowcharts from a consumer manufacturer converted into 1,676 multi-turn conversations that contrast compliant with out-of-scope utterances. Evaluating a panel of ten commercial and open-weight models reveals high variability in abstention rates, with models commonly selecting a real but contextually inadequate step rather than fabricating facts. The inherent plausibility and authority of this mapped but wrong advice exposes a challenging vulnerability for grounding systems.

2606.17897 2026-06-17 cs.AI cs.RO 新提交

Learn to Quantify Social Interaction with Constraints for Pedestrian Walking

学习量化行人行走中的社交互动约束

Xiaodan Shi

发表机构 * Department of Computer and Systems Sciences, Stockholm University(斯德哥尔摩大学计算机与系统科学系)

AI总结 提出Learn to Cluster方法,通过概率潜变量生成模型从轨迹观测中无监督学习社交互动模式,并有效集成到行人轨迹预测中,提升预测鲁棒性。

详情
AI中文摘要

人群中的长期行人路径预测对于自主移动平台(如自动驾驶汽车和社交机器人)避免碰撞并做出高质量规划至关重要。尽管当前研究考虑了社交互动进行预测,但它们并未揭示人与人之间发生的具体社交互动类型以及社交互动如何影响行人的决策过程,这进一步限制了其鲁棒性。行人行走中的社交互动直观上大量存在且难以标注和量化。在本文中,我们通过提出Learn to Cluster创造性地探索量化和解释行人如何与他人互动。我们的聚类社交互动是概率潜变量生成模型,直接从序列轨迹观测中学习,可扩展到任意数量的行人。Learn to Cluster无需标签,可以自然地集成到预测模型的训练过程中。潜变量随后将作为“标签”对社交互动进行分类。在多个轨迹预测基准上的大量实验表明,我们的方法能够学习社交互动的模式,并将这些模式有效集成到行人轨迹预测中。

英文摘要

Long-term human path forecasting in crowds is critical for autonomous moving platforms (like autonomous driving cars and social robots) to avoid collision and make high-quality planning. Although the current research take into account social interactions for prediction, they don't reveal the exact kinds of social interactions happened among people and how the social interactions affect the decision-making process of pedestrians, which further limits its robustness. Social interactions in pedestrian walking are intuitively massive and hard to label and quantify. In this paper, we explore creatively to quantify and interpret how pedestrians interact with others by proposing Learn to Cluster. Our clustering social interactions is probabilistic latent variable generative, learning directly from sequential trajectory observations, scalable to arbitrary number of pedestrians. Learn to cluster is label-free and can be naturally integrated into the training process of the prediction model. The latent variables will then serve as 'labels' to categorize social interactions. Extensive experiments over several trajectory prediction benchmarks demonstrate that our method is able to learn the patterns of social interactions and effectively integrate the patterns to pedestrian trajectory prediction.

2606.17890 2026-06-17 cs.CL 新提交

Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models

动态展开编辑:减少RL训练推理模型中的过度思考

Zihao Wei, Wenjie Shi, Liang Pang, Jingcheng Deng, Shicheng Xu, Shasha Guo, Zenghao Duan, Jiahao Liu, Jingang Wang, Huawei Shen, Xueqi Cheng

发表机构 * Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 针对GRPO强化学习训练中模型在得出正确答案后继续生成不必要推理的过度思考问题,提出动态展开编辑(DRE)方法,通过编辑成功轨迹中答案出现后的思考部分,削弱对不必要思考的偏好信号,实验证明其有效性。

Comments 21 pages, 10 figures, 2 tables

详情
AI中文摘要

长链思维推理可以提升LLM在复杂任务上的表现,但模型在正确答案出现后往往继续生成不必要的推理。我们将这种行为称为过度思考。我们从GRPO风格强化学习后训练的角度研究这一现象,将其视为训练时的信用分配问题,而不仅仅是解码时的停止问题。在GRPO训练初期采样的展开中,我们观察到对于相同提示,成功轨迹可能比不成功轨迹表现出稍高的过度思考程度。这种早期不平衡为不良反馈循环提供了起点:由于GRPO分配序列级信用,它无法区分到达解决方案的前缀与延长成功轨迹的不必要延续。两者都收到正向更新信号,使得初始不平衡在训练过程中演变为更严重的过度思考。为了解决这个问题,我们引入了动态展开编辑(DRE),这是一种针对在答案出现后继续思考的成功轨迹的训练时干预方法。DRE保留被接受的已验证前缀,编辑剩余的思考,并在同一RL组中偏好编辑后的轨迹,从而削弱对不必要思考的偏好信号,而不惩罚到达答案所需的推理。跨多种任务的实验证明了DRE的有效性。

英文摘要

Long-form chain-of-thought reasoning can improve LLM performance on complex tasks, but models often continue generating unnecessary reasoning after a correct answer has emerged. We refer to this behavior as overthinking. We study this phenomenon from the perspective of GRPO-style reinforcement learning (RL) post-training, framing it as a training-time credit-assignment problem rather than merely a decoding-time stopping problem. In rollouts sampled at the onset of GRPO training, we observe that successful trajectories can exhibit a slightly higher degree of overthinking than unsuccessful trajectories for the same prompts. This early imbalance provides a starting point for an undesirable feedback loop: because GRPO assigns sequence-level credit, it cannot distinguish the solution-reaching prefix from the unnecessary continuation that lengthens a successful trajectory. Both receive positive update signal, allowing the initial imbalance to grow into more severe overthinking during training. To address this issue, we introduce Dynamic Rollout Editing (DRE), a training-time intervention for successful trajectories that continue thinking after answer emergence. DRE preserves the accepted verified prefix, edits the remaining thinking, and prefers the edited trajectory within the same RL group, weakening the preference signal for unnecessary thinking without penalizing the reasoning needed to reach the answer. Experiments across diverse tasks show the effectiveness of DRE.

2606.17889 2026-06-17 cs.LG cs.AI cs.NE 新提交

Dimensionality Controls When Modularity Helps in Continual Learning

维度控制模块化在持续学习中的有效性

Kathrin Korte, Christian Medeiros Adriano, Joachim Winther Pedersen, Eleni Nisioti, Sebastian Risi

发表机构 * IT University of Copenhagen, Denmark(丹麦技术大学) Hasso Plattner Institute, University of Potsdam, Germany(波茨坦大学哈asso 印度学院)

AI总结 研究在持续学习中,模块化架构、任务相似性和表示维度如何共同影响组合学习,发现低维“丰富”机制下模块化结构显著提升性能,而高维“懒惰”机制下影响较小。

Comments Accepted to the 2nd Workshop on Compositional Learning (CompLearn) at ICML 2026, Seoul, South Korea. 8 pages, 5 figures

详情
AI中文摘要

组合学习系统必须平衡可塑性(获取新知识的能力)与稳定性(保留先前学习组件的能力),尤其是当任务共享结构并存在干扰风险时。我们研究了模块化架构、任务相似性和表示维度如何在顺序A-B-A范式中共同塑造组合持续学习,通过权重尺度操作诱导高维和低维机制,比较了任务分区循环网络与单网络基线。在高维“懒惰”机制中,两种架构实现了相似的性能和内部几何结构,表明当表示受到弱约束时,显式模块化结构影响甚微。在低维“丰富”机制中,模块化变得决定性:模块化网络发展出分级的任务特定子空间,这些子空间在相似任务上重叠,在中等不相似任务上部分对齐,在不相似任务上分离,从而产生比单网络更具组合性和可解释性的组织。这些发现表明,由初始化尺度诱导的表示机制(与表示维度共变)是决定组合性模块化结构在持续学习中何时功能有益的关键因素,并支持将安全性和鲁棒性视为表示子空间的自适应分配问题,而非固定分离或共享。

英文摘要

Compositional learning systems must balance plasticity, the ability to acquire new knowledge, with stability, the preservation of previously learned components, especially when tasks share structure and risk interference. We study how modular architecture, task similarity, and representational dimensionality jointly shape compositional continual learning in a sequential A-B-A paradigm, comparing a task-partitioned recurrent network to a single-network baseline while inducing high- and low-dimensional regimes via weight-scale manipulations. In a high-dimensional "lazy" regime, both architectures achieve similar performance and internal geometry, suggesting that explicit modular structure has little impact when representations are weakly constrained. In a lower-dimensional "rich" regime, modularity becomes decisive: the modular network develops graded task-specific subspaces that overlap for similar tasks, partially align for moderately dissimilar tasks, and separate for dissimilar tasks, yielding a more compositional and interpretable organization than the single network. These findings identify the representational regime induced by initialization scale, which co-varies with representational dimensionality, as a key factor governing when compositional, modular structure is functionally beneficial in continual learning, and support viewing safety and robustness as problems of adaptive allocation of representational subspaces rather than fixed separation versus sharing.

2606.17888 2026-06-17 cs.AI 新提交

MathVis-Fine: Aligning Visual Supervision with Necessity via Progressive Dependency-Guided Training for Multimodal Mathematical Reasoning

MathVis-Fine:通过渐进式依赖引导训练将视觉监督与必要性对齐的多模态数学推理

Wanshi Xu, Haokun Zhao, Haidong Yuan, Songjun Cao, Long Ma

发表机构 * School of ECE, Peking University(北京大学电子与计算机工程学院) College of Computer Science and Artificial Intelligence, Fudan University(复旦大学计算机科学与技术学院) School of Software and Microelectronics, Peking University(北京大学软件与微电子学院) Tencent Youtu Lab(腾讯优图实验室)

AI总结 提出MathVis-Fine框架,通过构建细粒度视觉标注数据集和两阶段渐进式训练,根据样本的视觉依赖程度平衡答案正确性和视觉基础奖励,提升多模态数学推理的监督精度。

详情
AI中文摘要

链式思维(CoT)推理已从纯语言领域扩展到多模态场景;然而,现有方法通常将视觉输入视为同质或辅助信号,未能捕捉数学问题解决中文本与图像之间复杂且样本特定的依赖关系。这引发了两个核心问题:首先,视觉内容的监督信号是泛化且粗粒度的,缺乏对每个样本中视觉信息实际必要性的适应;其次,当视觉奖励被统一应用而不区分输入之间的互补关系时,训练反馈变得不准确。这些限制阻碍了模型实现精确的多模态推理。在这项工作中,我们提出了一个用于建模数学推理中细粒度视觉依赖的框架。我们首先构建了MathVis-Fine数据集,通过视觉依赖评级增强细粒度视觉标注。基于该数据集,我们引入了一种两阶段渐进式视觉增强训练范式,该范式根据每个样本的内在视觉依赖水平平衡答案正确性奖励和视觉基础奖励,从而减轻奖励偏差并提高监督准确性。大量实验表明,MathVis-Fine框架能够基于视觉依赖逐步增强视觉感知,为多模态数学推理提供了更精确的训练框架。我们将在论文被接收后发布该数据集。

英文摘要

Chain-of-Thought (CoT) reasoning has extended from purely linguistic domains to multimodal scenarios; however, existing approaches often treat visual inputs as homogeneous or auxiliary signals, failing to capture the intricate and sample-specific dependencies between text and images in mathematical problem-solving. This gives rise to two core issues: first, the supervisory signals for visual content are generalized and coarse-grained, lacking adaptation to the actual necessity of visual information in each sample; second, training feedback becomes inaccurate when visual rewards are uniformly applied without distinguishing the complementary relationships among inputs. These limitations hinder models from achieving precise multimodal reasoning. In this work, we propose a framework for modeling fine-grained visual dependencies in mathematical reasoning. We first construct the MathVis-Fine dataset, augmenting fine-grained visual annotations with visual dependency ratings. Building upon this dataset, we introduce a two-stage progressive visual enhancement training paradigm that balances answer correctness rewards and visual grounding rewards according to the intrinsic visual dependency level of each sample, thereby mitigating reward bias and improving supervision accuracy. Extensive experiments demonstrate that the MathVis-Fine framework effectively enhances visual perception progressively based on visual dependency, offering a more precise training framework for multimodal mathematical reasoning. We will release the dataset upon acceptance.

2606.17882 2026-06-17 cs.AI 新提交

Structural Preservation and the Logical Expressiveness of Graph Neural Networks

结构保持与图神经网络的逻辑表达能力

Przemysław Andrzej Wałęga, Bernardo Cuenca Grau

发表机构 * Queen Mary University of London(伦敦玛丽女王大学) University of Oxford(牛津大学)

AI总结 本文从语义角度研究图神经网络分类器在结构保持(嵌入、单同态、同态)下的逻辑表达能力,证明每种保持性质对应分级模态逻辑的一个片段,并给出相应GNN架构。

Comments 20 pages

详情
AI中文摘要

通过固定架构选择(如聚合、组合和激活函数的类型),已经在图神经网络(GNN)和逻辑形式体系之间建立了桥梁。这些选择定义了受限的GNN类,通过证明逻辑公式可以翻译为等价的GNN,反之GNN也可以翻译为等价的公式,从而可以获得与逻辑形式体系的紧密对应。在本文中,我们采取语义视角,通过建立那些在结构性质(嵌入、单同态和同态)下保持的GNN分类器类的逻辑表达能力。我们证明,对于每个这样的性质,存在一个分级模态逻辑的片段,刻画了该GNN类。特别地,在嵌入、单同态和同态下的保持分别对应于存在性分级模态逻辑、其存在-正片段以及存在-正模态逻辑。这些结果刻画了广泛GNN类的表达能力,独立于具体的架构选择,但我们也证明每个这样的类都承认一个具有相同表达能力的GNN架构。在技术上,我们的方法使用了有界高度树的一个新的良拟序结果,从而得到了展开不变类的有限表示。

英文摘要

Bridges between graph neural networks (GNNs) and logical formalisms have been established by fixing architectural choices, such as the types of aggregation, combination, and activation functions. These choices define restricted classes of GNNs for which tight correspondences with logical formalisms can be obtained, by showing that logical formulae can be translated into equivalent GNNs and, conversely, that GNNs can be translated into equivalent formulae. In this paper we take a semantic perspective by establishing the logical expressiveness of classes of GNN classifiers that are preserved under structural properties: embeddings (extensions), injective homomorphisms, and homomorphisms. We show that, for each such property, there exists a fragment of graded modal logic characterising the class of GNNs. In particular, preservation under embeddings, injective homomorphisms, and homomorphisms corresponds to existential graded modal logic, its existential-positive fragment, and existential-positive modal logic, respectively. These results characterise the expressiveness of broad classes of GNNs independently of specific architectural choices, but we also show that each of these classes admits a GNN architecture of the same expressiveness. Technically, our approach uses a new well-quasi-order result for trees of bounded height, yielding finite representations of unravelling-invariant classes.

2606.17874 2026-06-17 cs.CV cs.LG 新提交

Revisiting Structural Dependency in Autoregressive Multi-Task Table Recognition via Order-Independent Cell-Level Representations

重新审视自回归多任务表格识别中的结构依赖性:基于顺序无关的单元格级表示

Takaya Kawakatsu

发表机构 * Preferred Networks, Inc.(Preferred Networks公司)

AI总结 针对自回归多任务表格识别中单元格表示顺序依赖导致全局一致性下降的问题,提出通过非因果注意力生成顺序无关的单元格特征,实现并行推理,在两大数据集上提升定位与识别性能,推理时间减少约3倍。

Comments ICDAR 2026

详情
AI中文摘要

多任务表格识别在统一框架中联合处理表格结构预测、单元格定位和单元格内容识别。现有方法通常依赖自回归解码器生成表格结构,并重用其隐藏状态进行单元格定位和内容识别。这种自回归生成过程可能使单元格表示产生顺序依赖,降低跨单元格的全局一致性。本文提出一个结构细化模块,通过非因果注意力产生顺序无关的单元格特征。该设计使得单元格内容能够并行推理,同时每个单元格以细化特征中编码的全局上下文为条件。在两个大型数据集上的实验表明,该方法在单元格定位和端到端识别上持续提升,同时将整体推理时间减少约三倍。

英文摘要

Multi-task table recognition jointly addresses table structure prediction, cell localization, and cell content recognition within a unified framework. Existing approaches often rely on autoregressive decoders to generate table structures and reuse their hidden states for cell localization and content recognition. This autoregressive generation process can make cell representations order-dependent, degrading global consistency across cells. This paper proposes a structural refinement module that produces order-independent cell features through non-causal attention. This design enables parallel inference of cell contents while conditioning each cell on global context encoded in the refined features. Experiments on two large datasets demonstrate consistent gains in cell localization and end-to-end recognition, while reducing overall inference time by around threefold.

2606.17872 2026-06-17 cs.LG cs.AI 新提交

AnchorKV: Safety-Aware KV Cache Compression via Soft Penalty with a Refusal Anchor

AnchorKV: 通过拒绝锚点的软惩罚实现安全感知的KV缓存压缩

Ning Ni, Yingjie Lao

发表机构 * Department of Computer Science, Tufts University(塔夫茨大学计算机科学系) Department of Electrical and Computer Engineering, Tufts University(塔夫茨大学电气与计算机工程系)

AI总结 提出AnchorKV,一种通过软惩罚机制调整令牌保留分数以远离有害提示的KV缓存压缩方法,在保持实用性的同时显著提升安全性。

详情
AI中文摘要

大型语言模型(LLMs)在生成推理和长上下文任务上优于早期架构,但其庞大的规模在内存使用、能耗和设备端部署方面带来了重大挑战。由于缩放预训练语言模型能提升下游能力\cite{zhao2023survey},键值(KV)缓存成为主要的推理瓶颈。最近的KV缓存压缩方法\cite{jo2025fastkv,li2024snapkv,zhou2024dynamickv}通过仅保留注意力相关令牌的子集来降低这一成本。然而,虽然这些方法在良性工作负载上保持了准确性,但其压缩策略要么无法防御越狱攻击\cite{jiang2024robustkv},要么在激进驱逐下降低安全对齐。我们提出AnchorKV,一种对KV缓存压缩的即插即用修改,它使令牌保留分数偏向远离与有害提示相关的键空间方向。AnchorKV通过将均值差异表示工程方法\cite{arditi2024refusal,zou2023representation}适配到KV缓存中使用的层特定键投影空间,构建了一个离线安全锚点。基于该锚点,一种软惩罚令牌选择规则以少量效用换取显著改善的安全对齐,当惩罚为零时则退化为原始压缩器。

英文摘要

Large language models (LLMs) outperform earlier architectures on generative inference and long-context tasks, but their large size introduces significant challenges in memory usage, energy cost, and on-device deployment. Since scaling pre-trained language models improves downstream capability \cite{zhao2023survey}, the key-value (KV) cache becomes a dominant inference bottleneck. Recent KV cache compression methods \cite{jo2025fastkv,li2024snapkv,zhou2024dynamickv} reduce this cost by retaining only a subset of attention-relevant tokens. However, while these approaches preserve accuracy on benign workloads, their compression policies either fail to defend against jailbreak attacks \cite{jiang2024robustkv} or degrade safety alignment under aggressive eviction. We propose AnchorKV, a drop-in modification to KV cache compression that biases token retention scores away from directions in key space associated with harmful prompts. AnchorKV constructs an offline safety anchor by adapting a difference-of-means representation engineering approach \cite{arditi2024refusal,zou2023representation} to the layer-specific key projection space used in KV caching. Based on this anchor, a soft penalty token selection rule trades a small amount of utility for substantially improved safety alignment, while reducing to the original compressor when the penalty is zero.

2606.17871 2026-06-17 cs.AI 新提交

StepGuard: Guarding Web Navigation via Single-Step Calibration

StepGuard: 通过单步校准保护网页导航

Zhihao Cui, Yuchen Zhang, Xiyang Sun, Yaxiong Wang, Li Zhu, Jinpeng Hu, Liu Liu, Mengjia Li, Yujiao Wu

发表机构 * School of Software Engineering, Xi’an Jiaotong University(西安交通大学软件工程学院) School of Computer Science and Information Engineering, Hefei University of Technology(合肥工业大学计算机与信息工程学院) Xiamen University(厦门大学) Zhejiang Lab(之江实验室) CSIRO(澳大利亚联邦科学与工业研究组织)

AI总结 针对网页导航中单步脆弱性问题,提出StepGuard框架,通过动态双策略优化(DDPO)解决奖励冲突,并利用置信度引导的自适应导航反射(CANR)校准单步误差,显著提升导航与答案准确率。

详情
AI中文摘要

网页导航要求智能体遵循自然语言目标,与网页交互并生成准确答案。尽管近期进展利用了视觉-语言模型和强化学习,现有方法仍因奖励错位和错误传播而存在单步脆弱性。为解决奖励纠缠,我们设计了动态双策略优化(DDPO),在探索的导航优先模式与问答的答案优先模式之间动态切换,以缓解奖励冲突。为校准单步误差,我们提出置信度引导的自适应导航反射(CANR),该机制估计每步置信度,仅在必要时触发反思,并使用对比奖励鼓励自我修正以校准单步不准确性。以上述组件为核心,我们最终开发了StepGuard,一种通过单步校准保护网页导航的新框架。实验表明,我们的方法显著提升了导航与答案准确率,在标准网页导航基准上取得了新的最佳性能。

英文摘要

Web navigation requires agents to follow natural language goals, interact with web pages, and produce accurate answers. While recent advances leverage vision-language models and reinforcement learning, existing methods still suffer from single-step fragility due to reward misalignment and error propagation. To tackle the reward entanglement, we design Dynamic Dual-Policy Optimization (DDPO), which dynamically switches between a navigation-first mode for exploration and an answer-first mode for question-answering to mitigate reward conflict. To calibrate the single-step error, we propose Confidence-Guided Adaptive Navigation Reflection (CANR), a mechanism that estimates per-step confidence, triggers reflection only when necessary, and uses contrastive rewards to encourage self-correction to calibrate the single-step inaccuracy. With the above as the main components, we finally develop our StepGuard, a new framework of Guarding Web Navigation via Single-Step Calibration. Experiments demonstrate that our approach significantly improves navigation and answer accuracy, setting new state-of-the-art performance on standard web navigation benchmarks.

2606.17867 2026-06-17 cs.CV cs.AI 新提交

A Quantitative Analysis of Multimodal Biomarkers in Alzheimer's Disease

阿尔茨海默病多模态生物标志物的定量分析

Antonio Scardace, Daniele Ravì

发表机构 * Department of Mathematics and Computer Science(数学与计算机科学系) University of Catania(卡塔尼亚大学) Department MIFT(MIFT部门) University of Messina(梅西纳大学)

AI总结 通过整合tau-PET、结构MRI、认知评分和APOE4数据,量化多模态生物标志物间的冗余与预测依赖关系,揭示tau拓扑与萎缩的关联,并分解tau-认知关联,为AD生物标志物选择提供可解释性。

Comments Accepted to ICTS4eHealth 2026

详情
AI中文摘要

尽管阿尔茨海默病(AD)研究中越来越多地采用多模态方法——旨在整合分子、结构、临床和遗传生物标志物以增强疾病表征——但这些模态之间的关系仍知之甚少。对其动态相互作用进行系统分析对于改进疾病建模、识别冗余评估以及减少患者负担和获取成本至关重要。在本文中,我们通过整合来自ADNI数据集的789名受试者的tau-PET、结构MRI、认知评分(MMSE和CDR)以及APOE4数据,对多模态AD生物标志物进行了定量分析。在我们的分析中,我们(A)量化跨模态互信息和解释方差以评估冗余和预测依赖性;(B)检查tau拓扑与跨脑区结构萎缩之间的关联以选择信息性ROI;(C)对tau-认知关联进行统计分解,分为萎缩相关和萎缩无关成分;(D)识别与认知衰退一致的主要神经退行性轨迹。本研究提供了跨模态关系的系统表征,提高了AD生物标志物的可解释性和选择。代码公开于:此 https URL。

英文摘要

Despite increasing adoption of multimodal approaches in Alzheimer's Disease (AD) research -- aimed at integrating molecular, structural, clinical, and genetic biomarkers to enhance disease characterization -- the relationships among these modalities remain poorly understood. A systematic analysis of their dynamic interaction is essential for improving disease modeling, identifying redundant assessments, and reducing patient burden and acquisition costs. In this paper, we present a quantitative analysis of multimodal AD biomarkers by integrating tau-PET, structural MRI, cognitive scores (MMSE and CDR), and APOE4 data from 789 subjects drawn from the ADNI dataset. In our analyses, we (A) quantify cross-modal mutual information and explained variance to assess redundancy and predictive dependencies; (B) examine associations between tau topologies and structural atrophy across brain regions to select informative ROIs; (C) perform a statistical decomposition of the tau-cognition association into atrophy-related and atrophy-independent components; (D) and identify a dominant neurodegenerative trajectory that aligns with cognitive decline. This study provides a systematic characterization of cross-modal relationships, improving the interpretability and selection of biomarkers in AD. Code is publicly available at: https://github.com/antonioscardace/Multimodal-AD.

2606.17861 2026-06-17 cs.CL 新提交

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

GameCraft-Bench:智能体能否在真实游戏引擎中端到端构建可玩游戏?

Tongxu Luo, Rongsheng Wang, Jiaxi Bi, Chenming Xu, Zhengyang Tang, Jianlong Chen, Juhao Liang, Ke Ji, Shuqi Guo, Yuhao Du, Fan Bu, Wenyu Du, Xiaotong Zhang, Kyle Li, Shaobo Wang, Linfeng Zhang, Yuxuan Liu, Xin Lai, Chenxin Li, Yiduo Guo, Zhexin Zhang, Xinyuan Wang, Tianyi Bai, Ziniu Li, Benyou Wang

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Shenzhen Loop Area Institute(深圳环区研究院) Hunyuan Team, Tencent(腾讯混元团队) USTB(北京科技大学) DualverseAI SJTU(上海交通大学) NUS(新加坡国立大学)

AI总结 提出GameCraft-Bench基准,评估编码智能体在Godot引擎中端到端生成可玩游戏的能力,最强智能体仅达41.46%成功率。

详情
AI中文摘要

游戏生成是编码智能体的新兴应用,要求模型将自然语言规范转化为可玩的交互系统。与传统编码任务不同,游戏生成发生在游戏引擎内,脚本、场景、资源、渲染和运行时交互必须共同产生连贯的游戏体验。我们将端到端游戏生成形式化为产生完整游戏制品的问题,该制品通过目标环境中可观察的玩家-游戏交互实现规范。我们认为评估这一设置需要三个必要条件:引擎接地、制品完整性和交互验证。我们提出一个交互接地评估框架,通过重放演示和基于规则的多模态评判来评估可执行游戏玩法。我们将该框架实例化为GameCraft-Bench,一个包含15个游戏家族共140个Godot任务的基准。对前沿编码智能体的评估表明,端到端游戏生成仍然极具挑战性:最强智能体仅达到41.46%,大多数智能体得分低于40%。进一步分析显示,虽然智能体经常实现可识别的机制,但它们在提供具有足够内容、功能性视觉反馈和连贯呈现的完整游戏方面存在困难。演示、代码和数据见此https URL。

英文摘要

Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-end game generation as the problem of producing a complete game artifact that realizes a specification through observable player-game interaction in a target environment. We argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, and Interactive Verification. We propose an interaction-grounded evaluation framework that assesses executable gameplay through replayed demonstrations and rubric-guided multimodal judging. We instantiate this framework as GameCraft-Bench, a benchmark comprising 140 Godot tasks across 15 game families. Evaluations of frontier coding agents show that end-to-end game generation remains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis reveals that while agents often implement recognizable mechanics, they struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation. See https://tongxuluo.github.io/gamecraft-bench-website for demos, code, and data.

2606.17858 2026-06-17 cs.LG 新提交

Meta-classification of one-class classification models using ranking correlation and nearest neighbor

使用排序相关性和最近邻的一类分类模型的元分类

Toshitaka Hayashi, Hamido Fujita, Dalibor Cimr, Richard Cimler, Jitka Kühnová

发表机构 * Faculty of Science, University of Hradec Kralove(赫拉德茨-克拉洛韦大学理学院) Malaysia-Japan International Institute of Technology (MJIIT), Universiti Teknologi Malaysia(马来西亚-日本国际技术学院,马来西亚理工大学) Regional Research Center, Iwate Prefectural University(岩手县立大学区域研究中心)

AI总结 提出用排序相关性和最近邻对一类分类模型进行元分类,实验表明能高精度区分数据集、算法和超参数,本质是数据集分类。

详情
AI中文摘要

机器学习技术已被应用于各种问题。然而,将机器学习应用于机器学习模型本身是一个未被探索的方向。为此,本文考虑了一类分类(OCC)模型的元分类,因为所有机器学习模型都可以近似为OCC模型。该提案将OCC模型表示为正态性排序,并使用最近邻和排序相关性度量对其进行分类。实验对OCC模型进行分类,其中类别对应于训练数据集、算法和超参数。当类别标签为数据集时,该提案实现了高精度。此外,当训练数据集包含相同类别时,它可以对算法进行分类。讨论强调,OCC模型的分类本质上是将多个样本视为单个输入的数据集分类。实验使用睡眠记录展示了数据集的分类。所提出的方法可以为分类OCC模型、数据集和排序提供统一解决方案。源代码已上传至公共仓库:https://this URL。

英文摘要

Machine Learning (ML) techniques have been applied to various problems. However, applying ML to ML models is an unexplored direction. For this purpose, this paper considers a meta-classification of one-class classification (OCC) models, because all ML models could be approximated as OCC models. The proposal represents OCC models as normality rankings and classifies them using nearest-neighbor and ranking-correlation metrics. The experiment classifies OCC models, where classes correspond to training datasets, algorithms, and hyperparameters. The proposal achieves high accuracy when class labels are datasets. Moreover, it can classify algorithms when the training datasets contain the same class. In addition, the discussion highlights that the classification of OCC models is essentially the classification of datasets that treats multiple samples as a single input. The experiment demonstrates the classification of datasets using sleeping records. The proposed method can provide a unified solution for classifying OCC models, datasets, and rankings. Source code is uploaded to the public repository https://github.com/ToshiHayashi/ClassOCC.

2606.17856 2026-06-17 cs.AI 新提交

FlowRAG: Synergizing Explicit Reasoning via Frequency-Aware Multi-Granularity Graph Flow

FlowRAG: 通过频率感知的多粒度图流协同显式推理

Bihao Zhan, Zongsheng Cao, Jie Zhou, Bo Zhang, Liang He

发表机构 * East China Normal University(华东师范大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出FlowRAG框架,构建四层异构图,通过双粒度激活和频率感知加权流模块,增强语义召回和显式推理路径提取,在复杂推理基准上取得最优性能。

详情
AI中文摘要

基于图的检索增强生成(GraphRAG)对于知识密集型和多跳查询任务有效;然而,许多现有方法主要基于实体图并依赖隐式语义相关性传播。这通常会导致(i)当用户查询抽象且在实体层面语义稀疏时检索不足,以及(ii)脆弱的的多跳推理,其中噪声激活可能破坏实体到实体的转换并损坏推断的关系链,从而产生不可靠的结论。为此,我们提出\texttt{FlowRAG},一个语义感知的检索框架,它提高了语义召回和显式推理。具体来说,\texttt{FlowRAG}在段落、摘要、句子和实体上构建了一个四层异构图,其中摘要节点作为粗粒度语义枢纽。在检索时,双粒度激活模块结合摘要-查询对齐和句子级匹配,在释义和抽象下鲁棒地激活相关实体。然后,我们引入一个频率感知的加权流模块,该模块通过段落内词频加权的实体-段落链接路由相关性,修剪噪声连接并提取高置信度的推理路径作为生成的显式逻辑骨架。大量实验表明,\texttt{FlowRAG}在复杂推理基准上取得了最先进的性能。

英文摘要

Graph-based retrieval-augmented generation (GraphRAG) is effective for knowledge-intensive and multi-hop query tasks; however, many existing methods primarily seed entity-based graphs and rely on implicit semantic relevance propagation. This often (i) under-retrieves when user queries are abstract and semantically sparse at the entity level, and (ii) suffers from brittle multi-hop reasoning, where noisy activations can derail entity-to-entity transitions and corrupt the inferred relation chain, yielding unreliable conclusions. To this end, we propose \texttt{FlowRAG}, a semantic-aware retrieval framework that improves both semantic recall and explicit reasoning. Specifically, \texttt{FlowRAG} constructs a quad-level heterogeneous graph over passages, summaries, sentences, and entities, where summary nodes serve as a coarse semantic hub. At retrieval time, a dual-granularity activation module combines summary--query alignment with sentence-level matching to activate relevant entities under paraphrase and abstraction robustly. We then introduce a frequency-aware weighted flow module that routes relevance through entity--passage links weighted by within-passage term frequency, pruning noisy connections and extracting high-confidence reasoning paths as an explicit logic skeleton for generation. Extensive experiments show that \texttt{FlowRAG} obtains state-of-the-art performance on complex reasoning benchmarks.

2606.17851 2026-06-17 cs.AI cs.LO 新提交

A homotopy-type-theoretic generalization of neurosymbolic inference

同伦类型论对神经符号推理的推广

Fernando Zhapa-Camacho, Robert Hoehndorf

发表机构 * King Abdullah University of Science and Technology(阿卜杜拉国王科技大学) KAUST Center of Excellence for Smart Health (KCSH)(KAUST智能健康卓越中心) KAUST Center of Excellence for Generative AI(KAUST生成式人工智能卓越中心)

AI总结 本文用同伦类型论替换集合,将神经符号系统的信念加权和泛化为信念加权同伦基数,保留对称性和证明多样性,并证明经典函数是特例,从而避免推理捷径。

详情
AI中文摘要

广泛的神经符号系统计算一个泛函:在σ-结构空间上逻辑量的信念加权和,其中加权模型计数、模糊逻辑和概率逻辑是特例。这种描述基于集合,而集合有意忽略了两个对神经符号系统重要的方面:两个σ-结构何时在理论对称性下相同,以及有多少不同的证明见证一个查询。将底层集合替换为类型(在同伦类型论意义上)保留了这些信息,并将该泛函转变为信念加权同伦基数——一种按对称性倒数计数对象的大小概念。我们从头为神经符号系统开发了该框架,证明了当对称性平凡时恢复经典泛函的保守性定理,并表明我们的框架暴露的对称性正是推理捷径背后的对称性。实际收益是具体的:最近通过集成或表达性密度估计实现的捷径感知概念后验,是混淆集单纯形上唯一的对称不变点,可通过在对称群上平均单个模型以闭式形式计算。在MNIST推理捷径基准上,这种单模型包装器比多样性训练的集成具有更好的校准性,同时保持标签准确性和可识别概念不变。代码在此https URL免费提供。

英文摘要

A wide range of neurosymbolic (NeSy) systems compute one functional: a belief-weighted sum of a logical quantity over a space of $σ$-structures, of which weighted model counting, fuzzy logic, and probabilistic logic are special cases. This account is built on sets, and a set deliberately forgets two things that are important for NeSy: when two $σ$-structures are the same up to a symmetry of the theory, and how many distinct proofs witness a query. Replacing the underlying sets by types, in the sense of homotopy type theory, preserves this information, and turns this functional into a belief-weighted homotopy cardinality, a notion of size that counts each object in inverse proportion to its symmetries. We develop the framework from scratch for NeSy systems, prove a conservativity theorem that recovers the classical functional when symmetries are trivial, and show that the symmetry our framework exposes is exactly the one behind reasoning shortcuts. The payoff is concrete: the shortcut-aware concept posterior that recent methods reach by ensembling or expressive density estimation is the only symmetry-invariant point of the confusion-set simplex, computable in closed form by averaging a single model over the symmetry group. On MNIST reasoning-shortcut benchmarks this single-model wrapper is better calibrated than a diversity-trained ensemble, while leaving label accuracy and identifiable concepts untouched. Code is freely available at https://github.com/bio-ontology-research-group/hott-nesy.

2606.17847 2026-06-17 cs.AI cs.LG 新提交

WallZero: Mastering the Game of WallGo with Strategic Analysis

WallZero:通过战略分析掌握WallGo游戏

Hsing-Yu Chen, Jérôme Arjonilla, I-Chen Wu, Ti-Rong Wu

发表机构 * National Yang Ming Chiao Tung University(国立阳明交通大学) Academia Sinica(中央研究院)

AI总结 提出基于AlphaZero的WallZero智能体,通过定制动作和特征设计,在WallGo游戏中击败职业围棋选手,并分析游戏公平性与关键策略。

Comments Accepted by the Computers and Games conference (CG 2026)

详情
AI中文摘要

WallGo是一种最近引入的战略棋盘游戏,因2025年Netflix系列剧《The Devil's Plan》而流行。尽管在7x7的小棋盘上进行,但其石头移动和墙壁放置的组合导致了高游戏树复杂性和复杂的战略互动。尽管其日益流行,WallGo仍未得到充分探索。本文提出了WallZero,一个基于AlphaZero的双人WallGo设置智能体。我们引入了定制的动作和特征设计,以显著提高游戏性能。在评估中,WallZero击败了参与本研究的两位职业围棋选手,平均每局获得1.98倍的地盘。除了其强度,我们使用WallZero评估游戏公平性并识别掌握WallGo的关键策略。有趣的是,我们的结果显示,Netflix系列剧中使用的开局产生了更平衡的游戏。我们的代码可在以下网址获取:此 https URL。

英文摘要

WallGo is a recently introduced strategic board game popularized by the 2025 Netflix series The Devil's Plan. Although played on a small 7 x 7 board, its combination of stone movement and wall placement yields high game-tree complexity and intricate strategic interactions. Despite its growing popularity, WallGo remains underexplored. This paper presents WallZero, an AlphaZero-based agent for the two-player WallGo setting. We introduce tailored action and feature designs to improve playing performance significantly. In the evaluation, WallZero defeats two professional Go players who participated in this study, securing on average 1.98x more territory per game. Beyond its strength, we use WallZero to assess game fairness and identify key strategies for mastering WallGo. Interestingly, our results show that the opening used in the Netflix series yields a more balanced game. Our code is available at https://rlg.iis.sinica.edu.tw/papers/wallzero.

2606.17839 2026-06-17 cs.RO cs.HC 新提交

From Ad Hoc Pilots to Repeatable Patterns: Structuring Drone Collaboration in Emergency Services with DroneLets

从临时飞行员到可重复模式:用DroneLets构建紧急服务中的无人机协作

Dzmitry Katsiuba, Samuel Brander, Mateusz Dolata, Gerhard Schwabe

发表机构 * University of Zurich(苏黎世大学) Zeppelin University(空天大学)

AI总结 本文通过实地试验和访谈,提炼出44种交互模式并引入DroneLets设计构件,以结构化的方式实现紧急服务中无人机协作的可重复和可扩展。

Comments Presented at International Conference on Information Systems (ICIS) 2025: https://aisel.aisnet.org/icis2025/is_transformwork/is_transformwork/19/

Journal ref International Conference on Information Systems 2025: ICIS2025-2217

详情
AI中文摘要

无人机有望支持紧急服务,但其融入工作流程仍具有临时性和协调密集型。本文探讨两个研究问题:紧急团队希望如何与无人机协作,以及如何将这些协作形式化为可重复的过程。基于四次实地试验和95次访谈,我们推导出44种交互模式,分为10个元模式,反映了侦察、通信和后勤支持等操作需求。为了构建这些实践,我们引入了DroneLets——一种新的设计构件类别,将协作工程扩展到具身代理。DroneLets捕获设置要求、无人机能力、环境约束以及人类和无人机代理之间的协调行动。它们提供了一个模块化框架,用于设计紧急服务中可重复、可扩展的协作过程,并通过向旁观者广播和火灾后监测等模式加以说明。这项工作扩展了协作工程的范围,并为将自主无人机集成到高风险现场操作中提供了结构化基础。

英文摘要

Drones hold promise for supporting emergency services, but their integration into workflows remains ad hoc and coordination-intensive. This paper addresses two research questions: how emergency teams want to collaborate with drones, and how to formalize these collaborations into repeatable processes. Based on four field trials and 95 interviews, we derive 44 interaction patterns grouped into 10 meta-patterns reflecting operational needs such as reconnaissance, communication, and logistical support. To structure these practices, we introduce DroneLets - a new class of design artifacts that extend Collaboration Engineering to embodied agents. DroneLets capture setup requirements, drone capabilities, environmental constraints, and coordinated actions across human and drone actors. They offer a modular framework for designing repeatable, scalable collaboration processes in emergency services, illustrated through patterns such as broadcasting to bystanders and post-fire monitoring. This work expands the scope of CE and provides a structured foundation for integrating autonomous drones into high-stakes field operations.

2606.17838 2026-06-17 cs.CL 新提交

Environment-Grounded Automated Prompt Optimization for LLM Game Agents

面向LLM游戏智能体的环境驱动自动提示优化

Rean Clive Fernandes, Lukas Fehring, Theresa Eimer, Marius Lindauer, Matthias Feurer

发表机构 * Lamarr institute for ML and AI(拉马尔机器学习与人工智能研究所) TU Dortmund University(多特蒙德工业大学) Leibniz University Hannover(莱布尼茨汉诺威大学) L3S Research Center(L3S研究中心)

AI总结 提出一种自动提示优化框架,将观察-动作管道分解为描述器和选择器,通过环境回报驱动的进化循环迭代优化提示,在BabyAI任务中显著提升成功率。

详情
AI中文摘要

交互环境中的LLM智能体对其提示高度敏感,但提示工程仍然是手动的、特定于任务的过程。我们为LLM智能体引入了一个自动提示优化框架,该框架将观察-动作管道分解为一个目标条件描述器智能体和一个动作选择智能体,并通过由环境回报引导的LLM驱动进化循环迭代地优化每个模块的提示。我们提出一个行为分析器,将情节结果归因于特定的提示组件,以及一个变异器,在通过环境回滚验证之前,对提示提出有针对性的修订。我们在BALROG基准测试中的所有五个BabyAI任务上进行了评估,在普通和引导提示初始化下,将我们的管道与BALROG的RobustCoTAgent进行了比较。优化在任务和条件下一致地提高了性能,无需更新模型权重。在PutNext(一个多步协调任务,RobustCoTAgent的成功率为0%)上,我们的框架使用相同的底层LLM和优化提示达到了高达72.5%的成功率。这些结果表明,多智能体框架结合自动提示优化,无需微调或大量人工监督即可增强LLM。

英文摘要

LLM agents in interactive environments are highly sensitive to their prompts, yet prompt engineering remains a manual, task-specific process. We introduce an automated prompt optimization framework for LLM agents that decomposes the observation-to-action pipeline into a goal-conditioned descriptor agent and an action selection agent, and iteratively refines each module's prompt through an LLM-driven evolutionary loop guided by environment returns. We propose a behavior analyzer to attribute episode outcomes to specific prompt components, and a mutator to propose targeted revisions to the prompt, before validating them through environment rollouts. We evaluate on all five BabyAI tasks in the BALROG benchmark, comparing our pipeline against BALROG's RobustCoTAgent under both plain and guided prompt initializations. Optimization improves performance consistently across tasks and conditions, without requiring updates to the model weights. On PutNext, a multi-step coordination task where the RobustCoTAgent achieves 0% success, our framework reaches up to 72.5% success rate using the same underlying LLM with optimized prompts. These results suggest that a multi-agent framework, combined with automatic prompt optimization, enhances LLMs without the need for fine-tuning or extensive human supervision.

2606.17836 2026-06-17 cs.CV cs.AI cs.CG cs.GR 新提交

High-Fidelity 3D Geometric Reconstruction of Pelvic Organs from MRI: A Hybrid Deep Learning and Iterative Optimization Approach

高保真盆腔器官MRI三维几何重建:一种混合深度学习与迭代优化方法

Hui Wang, Xiaowei Li, Chenxin Zhang, Yifan Feng, Jianwei Zuo, Yumeng Tang, Xiuli Sun, Jianliu Wang, Bing Xie, Jiajia Luo

发表机构 * Institute of Medical Technology, Peking University Health Science Center, Peking University(北京大学医学部医学技术研究院,北京大学) Biomedical Engineering Department, Institute of Advanced Clinical Medicine, Peking University(北京大学先进临床医学研究院生物医学工程系) Department of Obstetrics and Gynecology, Peking University People’s Hospital(北京大学人民医院妇产科部)

AI总结 提出混合可变形形状建模框架,结合深度学习预测与迭代优化,实现膀胱、子宫和直肠的高保真三维几何重建,在几何保真度和网格质量上优于现有方法。

详情
AI中文摘要

从MRI中患者特定的盆腔器官几何三维重建对于盆底建模和下游患者特定分析至关重要。然而,以往研究主要关注图像分割或三维模型的下游使用,高保真、高质量几何的重建仍然劳动密集且缺乏标准化。本研究引入了一种混合可变形形状建模框架,将深度学习预测与迭代优化相结合,用于膀胱、子宫和直肠的重建。该框架包含三个核心组件:一种保持盆腔器官拓扑一致性的几何感知多级深度学习架构;一种平衡全局形状捕获和局部表面细化的两阶段摊销优化训练策略;以及一种整体协同机制——在训练阶段,迭代优化为深度学习提供监督,而在推理阶段,深度学习快速预测全局器官形态,随后通过迭代优化细化局部表面和网格质量。该框架在几何保真度上显著优于当前主流的基于深度学习的器官重建模型。对于各个解剖结构,重建的膀胱、直肠和子宫三维几何实现了显著更低的Chamfer距离值和更高的Dice相似系数分数。此外,在保持高计算效率的同时,所提出的架构产生了优越的整体体积网格质量。在患者层面,该框架在minSICN和minSIGE的10个最差元素上均获得了比传统几何后处理算法更高的平均值。

英文摘要

Patient-specific 3D reconstruction of pelvic organ geometry from MRI is important for pelvic floor modeling and downstream patient-specific analysis. However, while previous studies have focused primarily on either image segmentation or downstream use of 3D models, the reconstruction of high-fidelity, high-quality geometries remains labor-intensive and poorly standardized. The study introduced a hybrid deformable shape modeling framework that integrates deep learning prediction with iterative optimization for the reconstruction of the bladder, uterus, and rectum. The framework consists of three core components: a geometry-aware multi-level deep learning architecture that preserves topological consistency of pelvic organs; a two-stage amortized optimization training strategy that balances global shape capture and local surface refinement; and a holistic synergy mechanism--where iterative optimization provides supervision for deep learning during the training phase, and during inference, deep learning rapidly predicts the global organ morphology, followed by iterative optimization to refine local surfaces and mesh quality. This framework demonstrated marked superiority in geometric fidelity than current mainstream deep learning-based organ reconstruction models. For individual anatomical structures, the reconstructed 3D geometries for the bladder, rectum, and uterus achieved significantly lower Chamfer Distance values and higher Dice Similarity Coefficient scores. In addition, while maintaining high computational efficiency, the proposed architecture yielded superior overall volumetric mesh quality. At the patient level, the framework achieved higher mean values for the 10 worst elements for both minSICN and minSIGE compared to traditional geometric post-processing algorithms.

2606.17835 2026-06-17 cs.CL cs.AI eess.AS 新提交

Perceptual compensation for tonal context in self-supervised speech models

自监督语音模型中对声调上下文的感知补偿

James Kirby, Ioana Krehan, Michele Gubian

发表机构 * Institute for Phonetics and Speech Processing, LMU Munich(慕尼黑大学语音与语言处理研究所)

AI总结 通过伪复制普通话声调的感知补偿实验,比较纯自监督预训练模型和微调模型,发现纯预训练模型无补偿证据,而微调模型有部分补偿但未达到人类水平,表明监督目标可能对抽象某些音韵规律是必要的。

Comments Accepted for publication at Interspeech 2026

详情
AI中文摘要

本研究考察了wav2vec2.0架构在多大程度上表现出对音韵上下文的补偿证据。我们对普通话声调进行了感知补偿实验的伪复制,并比较了纯自监督预训练模型和针对普通话ASR微调模型之间的嵌入相似度和探测分类器输出。在纯预训练模型的嵌入相似度中没有发现补偿证据。探测分类器除了预期的逐层分类改进外,还显示出一些补偿证据,但未能复制人类在孤立测试音节上的表现。我们的发现与先前仅通过预训练就能产生对音韵结构敏感性的报告形成对比,并表明监督目标可能是鼓励至少某些类型的音韵规律抽象所必需的。

英文摘要

This study examines the extent to which the wav2vec2.0 architecture exhibits evidence of compensation for phonological context. We conducted a pseudo-replication of a perceptional compensation experiment on Mandarin Chinese tones, and compared the embedding similarities and probing classifier outputs between a purely self-supervised pre-trained model and a model fine-tuned for Mandarin ASR. No evidence of compensation was found in the embedding similarities of the purely pre-trained model. Probing classifiers showed some evidence of compensation in addition to the expected layer-wise improvements in categorization, but failed to replicate human performance on isolated test syllables. Our findings contrast with previous reports of sensitivity to phonological structure emerging through pre-training alone, and suggest that supervised objectives may be necessary to encourage the abstraction of at least some types of phonological regularities.

2606.17833 2026-06-17 cs.RO 新提交

HumanoidArena: Benchmarking Egocentric Hierarchical Whole-body Learning

HumanoidArena: 以自我为中心的层级全身学习基准

Taowen Wang, Zikang Xie, Bin Yang, Yunheng Wang, Zizhao Yuan, Yuetong Fang, Yixiao Feng, Yichi Wang, Xingyu Chen, Haodong Chen, Qiwei Wu, Weisheng Xu, Lihan Chen, Lusong Li, Zecui Zeng, Renjing Xu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Beijing University of Technology(北京工业大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Shenzhen MSU-BIT University(深圳北理莫斯科大学) JD Explore Academy(京东探索研究院)

AI总结 提出HumanoidArena基准,通过层级控制(高层策略输出全身动作,低层通用运动跟踪器执行)解决人形机器人全身交互学习问题,设计7个腿部关键任务评估策略的泛化与迁移能力。

Comments 29 pages, 13 figures, 10 tables

详情
AI中文摘要

人形机器人有望在人类中心环境中实现全身交互,但由于任务级决策与全身动态执行紧密耦合,可扩展的策略学习仍然困难。一个实用的解决方案是层级控制,其中高层策略预测中间全身动作,低层通用运动跟踪器(GMT)将其执行为稳定的人形运动。然而,现有基准很少评估策略-跟踪器接口本身,因此尚不清楚中间全身动作是否可执行、在任务分布变化下是否鲁棒以及是否可跨不同GMT后端迁移。我们引入HumanoidArena,一个以自我为中心的层级全身学习的仿真优先基准。该基准将策略学习形式化为一个层级决策问题:高层策略将自我中心视觉、本体感觉和指令转换为紧凑的全身动作,随后由低层GMT执行。HumanoidArena不将腿部视为平面运输工具,而是强调下肢协调在任务完成中结构上必要的交互。因此,我们设计了7个腿部关键的人-物交互/人-场景交互(HOI/HSI)任务,其中成功需要足部放置、平衡维持、姿势调整和全身重新定向。为了进一步诊断层级系统,我们从两个互补角度评估策略:扰动条件泛化和GMT条件迁移。实验表明,层级控制使学习策略能够解决多样的腿部关键交互,但性能强烈依赖于跟踪器,且跨GMT迁移仍然脆弱。这些结果使HumanoidArena成为研究可迁移中间动作表示和可扩展的自我中心全身策略学习的基准。

英文摘要

Humanoid robots promise whole-body interaction in human-centered environments, but scalable policy learning remains difficult because task-level decision-making and whole-body dynamic execution are tightly coupled. A practical solution is hierarchical control, where a high-level policy predicts intermediate whole-body actions and low-level general motion trackers (GMTs) execute them as stable humanoid motion. However, existing benchmarks rarely evaluate the policy-tracker interface itself, leaving open whether intermediate whole-body actions are executable, robust under task distribution shifts, and transferable across different GMT backends. We introduce HumanoidArena, a simulation-first benchmark for egocentric hierarchical whole-body learning. The benchmark formulates policy learning as a hierarchical decision making problem: a high-level policy converts egocentric vision, proprioception, and instructions into a compact whole-body action, which is subsequently executed by a low-level GMT. Instead of treating the legs as planar transport tools, HumanoidArena emphasizes interactions where lower-body coordination is structurally necessary in task completion. We therefore design 7 leg-critical HOI/HSI tasks in which success requires foot placement, balance maintenance, posture adjustment, and whole-body reorientation. To further diagnose the hierarchical system, we evaluate policies from two complementary perspectives: perturbation-conditioned generalization and GMT-conditioned transfer. Experiments show that hierarchical control enables learned policies to solve diverse leg-critical interactions, but performance is strongly tracker-conditioned and cross-GMT transfer remains fragile. These results position HumanoidArena as a benchmark for studying transferable intermediate action representations and scalable egocentric whole-body policy learning.

2606.17831 2026-06-17 cs.RO cs.HC 新提交

Accountability in Autonomous Drone-Based Firefighting: Insights From a Field Trial

自主无人机消防中的问责制:来自实地试验的见解

Dzmitry Katsiuba, Anna Katharina Boos, Robin Hany, Mateusz Dolata, Gerhard Schwabe

发表机构 * University of Zurich(苏黎世大学) Zeppelin University(塞普豪森大学)

AI总结 通过实地试验,研究自主无人机在消防中对问责制的影响,发现角色不确定性和人机交互新问题,并提出建议以负责任地整合无人机。

Comments Accepted for Publication at International Conference on Information Systems (ICIS) 2025: https://aisel.aisnet.org/icis2025/ethical_is/ethical_is/10/

Journal ref International Conference on Information Systems, 2025, ICIS2025-2162

详情
AI中文摘要

有一个不断增长的研究领域探索自主无人机如何提高应急响应效率。将这些(人工)智能体整合到现有的应急团队和工作流程中,可能会显著影响既定的问责关系。本文研究了自主无人机如何在复杂的社会技术系统中影响问责归属。通过两次真实的消防实地试验,该研究揭示了当无人机在组织层面部署时,围绕问责制存在显著的不确定性。利用Bovens的问责框架,识别出两个挑战:(1)无人机在层级结构中的角色不确定性,导致问责归属混乱;(2)新形式的人机交互引入了额外的问责相关问题。基于这些见解,本文提出了可操作的建议,以支持在不损害问责制的前提下将自主无人机负责任地整合到消防行动中。这些发现为政策制定者提供了实用指导,并有助于进一步研究自主系统中的问责制。

英文摘要

There is a growing research field exploring how autonomous drones can enhance emergency response effectiveness. Integrating these (artificial) agents into existing emergency teams and workflows may significantly impact established accountability relationships. This paper examines how autonomous drones affect accountability attribution within complex socio-technical systems. Drawing on two real-life field trials in firefighting, the study reveals substantial uncertainty around accountability when drones are organizationally deployed. Using Bovens' accountability framework, two challenges are identified: (1) uncertainty about the role of drones within hierarchical structures, leading to confused accountability ascriptions; and (2) new forms of human-drone interactions introducing additional accountability-relevant issues. Based on these insights, the paper proposes actionable recommendations to support the responsible integration of autonomous drones into firefighting operations without undermining accountability. These findings offer practical guidance for policymakers and contribute to further research on accountability in autonomous systems.

2606.17830 2026-06-17 cs.LG cs.AI 新提交

Functional Equivalence in Attention: A Comprehensive Study with Applications to Linear Mode Connectivity

注意力中的功能等价性:一项综合研究及其在线性模式连通性中的应用

Viet-Hoang Tran, Vinh Khanh Bui, Van-Hoan Trinh, Tan Lai Ngoc, Tan M. Nguyen

发表机构 * National University of Singapore(新加坡国立大学) Center for AI Research, VinUniversity(Vin大学人工智能研究中心) Independent Researcher(独立研究者) Technical University of Munich(慕尼黑技术大学)

AI总结 本文形式化研究了Transformer中位置编码对功能等价性的影响,发现正弦编码保持原始注意力的对称性,而旋转编码显著减少对称群从而增强表达力,并通过对齐算法实证了位置编码对线性模式连通性的关键作用。

Comments Published at the International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

神经网络参数空间本质上是非单射的,因为不同的参数配置可以通过功能等价性实现相同的函数。虽然这种对称性在经典的全连接和卷积模型中已被充分理解,但在现代基于注意力的架构中变得更为复杂。现有的多头注意力分析主要关注原始公式,忽略了从根本上重塑架构对称性的位置编码。在这项工作中,我们提供了对带有位置编码的Transformer中功能等价性的形式化研究。聚焦于两种最广泛使用的变体——正弦和旋转位置编码(RoPE)——我们表明正弦编码保留了原始注意力的等价结构,而旋转编码显著减少了对称群,从而增强了表达力。这为RoPE在实践中日益突出的地位提供了原则性解释。我们进一步研究了位置编码如何影响线性模式连通性,并通过一种对齐算法,实证表明Transformer设置中连通性的存在和可变性关键取决于位置编码。

英文摘要

Neural network parameter spaces are inherently non-injective, as distinct parameter configurations can realize identical functions through functional equivalence. While this symmetry is well understood in classical fully connected and convolutional models, it becomes substantially more intricate in modern attention-based architectures. Existing analyses of multihead attention have largely focused on the vanilla formulation, overlooking positional encodings that fundamentally reshape architectural symmetries. In this work, we provide a formal study of functional equivalence in Transformers with positional encodings. Focusing on the two most widely used variants--sinusoidal and rotary positional encodings (RoPE)--we show that sinusoidal encodings preserve the equivalence structure of vanilla attention, whereas rotary encodings significantly reduce the symmetry group, thereby enhancing expressivity. This offers a principled explanation for the growing prominence of RoPE in practice. We further examine how positional encodings affect linear mode connectivity, and through an alignment algorithm, empirically demonstrate that the presence and variability of connectivity across Transformer settings crucially depend on the positional encoding.

2606.17826 2026-06-17 cs.CL cs.AI 新提交

When Multiple Scripts Matter: Evaluating ASR in Clinical Settings

当多种文字重要时:在临床环境中评估ASR

Jean Seo, Minkyu Kim, Jeonguk Lee, Jisoo Jung, Wooseok Han, Eunho Yang

发表机构 * AITRICS University of Copenhagen(哥本哈根大学) KAIST(韩国科学技术院)

AI总结 针对非英语临床场景中ASR受多文字变异性影响的问题,提出MultiClin基准,通过多文字感知评估更公平地衡量识别质量,并发现文字统一化能提升ASR性能。

Comments Interspeech 2026

详情
AI中文摘要

非英语临床环境中的自动语音识别(ASR)面临多文字变异性的挑战,即同一术语可能以多种有效的正字法形式出现。传统的字符串匹配评估指标通常将正字法变体视为错误,从而低估ASR性能。为解决此问题,我们引入了MultiClin,一个旨在评估对多文字变异性鲁棒性的临床ASR基准。跨多种ASR模型的实验表明,与传统的单参考评估相比,多文字感知评估能更公平地评估识别质量。我们进一步研究了训练过程中文字一致性的影响,发现不一致的文字映射会增加正字法不确定性并阻碍模型收敛,其中50%的平衡映射比例产生最高的熵。相比之下,文字统一化始终能带来最佳的ASR性能。我们的数据集和代码公开于:this https URL。

英文摘要

Automatic speech recognition (ASR) in non-English clinical settings is challenged by multiscript variability, where the same term may appear in multiple valid orthographic forms. Conventional string-matching evaluation metrics often underestimate ASR performance by treating orthographic variants as errors. To address this issue, we introduce MultiClin, a clinical ASR benchmark designed to evaluate robustness to multiscript variability. Experiments across diverse ASR models show that multiscript-aware evaluation provides a fairer assessment of recognition quality than conventional single-reference evaluation. We further investigate the impact of script consistency during training and find that inconsistent script mappings increase orthographic uncertainty and hinder model convergence, with a balanced 50% mapping ratio producing the highest entropy. In contrast, script unification consistently yields the best ASR performance. Our dataset and code are publicly available at: https://github.com/aitrics-ronaldo/Interspeech_MultiClin.

2606.17824 2026-06-17 cs.CV cs.AI 新提交

Human-in-the-Loop Atlas-Based 3D Asset Segmentation for Interactive Content Workflows

人在回路中基于图集的3D资产分割用于交互式内容工作流

Paul Julius Kühn, Saptarshi Neil Sinha, Jakob Hansen, Robin Horst

发表机构 * Fraunhofer IGD(弗劳恩霍夫计算机图形学研究所) Hochschule RheinMain(莱茵美因应用科学大学)

AI总结 提出一种人在回路中流水线,通过贪心视图选择、SAM~2交互分割和UV反投影生成分割图集,支持材质分配、风格迁移等下游任务,在8个文化遗产物体上验证了有效性。

详情
AI中文摘要

将3D资产分割成有意义的区域仍然具有挑战性,尤其是当分割标准依赖于应用且需要用户控制时。我们提出了一种人在回路中的流水线,用于从3D模型生成分割的2D参数化图集,适用于交互式媒体、游戏和XR内容工作流。我们的方法首先使用基于采样表面点的贪心集合覆盖策略选择一组紧凑的渲染视图,然后支持使用SAM~2和Label Studio对这些视图进行交互式分割。生成的掩码被反投影到模型的UV参数化上,以产生统一的图集分割,支持下游生产任务,如逐段材质分配、风格迁移和语义标注。我们通过对八个文化遗产物体的基于演示的技术评估来评估该流水线。结果表明,该方法可以在不同几何形状上生成可用的分割图集,同时揭示了需要手动校正的常见问题,特别是精细结构、空腔和弱外观边界。

英文摘要

Segmenting 3D assets into meaningful regions remains challenging, especially when segmentation criteria are application-dependent and require user control. We present a human-in-the-loop pipeline for generating a segmented 2D parameterized atlas from a 3D model for interactive media, game, and XR content workflows. Our method first selects a compact set of rendered views using a greedy set cover strategy over sampled surface points, and then supports interactive segmentation of these views with SAM~2 and Label Studio. The resulting masks are back-projected onto the model's UV parameterization to produce a unified segmented atlas that supports downstream production tasks such as segment-wise material assignment, style transfer, and semantic labeling. We assess the pipeline through a demonstration-based technical evaluation on eight cultural heritage objects. The results show that the approach can generate usable segmented atlases across diverse geometries while revealing recurring sources of manual correction, particularly fine structures, cavities, and weak appearance boundaries.

2606.17821 2026-06-17 cs.AI 新提交

DecoSearch: Complexity-Aware Routing and Plan-Level Repair for Text-to-SQL

DecoSearch: 面向Text-to-SQL的复杂度感知路由与计划级修复

Esteban Schafir, Xu Zheng, Hojat Allah Salehi, Zhuomin Chen, Mo Sha, Wei Cheng, Dongsheng Luo

发表机构 * Florida International University(佛罗里达国际大学) NEC-Labs(NEC实验室) Singapore Management University(新加坡管理大学)

AI总结 提出DecoSearch框架,通过复杂度感知路由将查询分配给直接生成或DAG分解,并结合拓扑精炼器修复执行失败,在BIRD和Spider上取得高准确率且显著降低token消耗。

详情
AI中文摘要

大型语言模型(LLMs)在将自然语言翻译为SQL方面展现了卓越的能力,但现有方法在处理需要多步骤、数据感知推理的复杂查询时仍然表现不佳。我们引入了DecoSearch,一个无需训练的框架,通过将每个查询路由到适当的推理努力级别来解决这一问题。轻量级的Schema Selector首先将完整数据库模式修剪为相关的表和列。然后,LLM Judger判断问题是否需要分解:简单问题遵循直接生成路径,而复杂问题则升级为原子子问题的有向无环图(DAG),每个子问题通过目标SQL生成步骤解决。RAG组件用语义相似的训练示例为分解器提供基础,而Topology Refiner在执行失败表明存在有缺陷的分解而非可修复的SQL错误时,重构推理计划。DecoSearch在BIRD上达到70.53%的执行准确率,在Spider上达到88.31%,使用DeepSeek骨干网络,超越了所有无需训练的基线方法,同时消耗的token数量比竞争方法少一个数量级。它还可以作为模型无关的包装器,在不修改管道的情况下持续改进微调后的SQL生成骨干网络。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities in translating natural language to SQL, yet existing methods still falter on complex queries requiring multi-step, data-aware reasoning. We introduce DecoSearch, a training-free framework that addresses this by routing each query to the appropriate level of reasoning effort. A lightweight Schema Selector first prunes the full database schema to the relevant tables and columns. An LLM Judger then decides whether the question requires decomposition: straightforward questions follow a direct generation path and complex ones are escalated to a Directed Acyclic Graph (DAG) of atomic sub-questions, each solved by a targeted SQL generation step. A RAG component grounds the decomposer with semantically similar training examples, and a Topology Refiner restructures the reasoning plan when execution failures signal a flawed decomposition rather than a fixable SQL error. DecoSearch achieves 70.53% execution accuracy on BIRD and 88.31% on Spider with a DeepSeek backbone, surpassing all training-free baselines while consuming an order of magnitude fewer tokens than competing methods. It also functions as a model-agnostic wrapper, consistently improving fine-tuned SQL generation backbones without any modification to the pipeline.

2606.17820 2026-06-17 cs.CL 新提交

Improving low-resource ASR using bilingual fine-tuning with language identification: a cross-linguistic evaluation

利用语言识别的双语微调改进低资源语音识别:跨语言评估

Reihaneh Amooie, Yun Hao, Wietse de Vries, Jelske Dijkstra, Matt Coler, Martijn Wieling

发表机构 * University of Groningen(格罗宁根大学) Fryske Akademy(弗里斯兰科学院) Vrije Universiteit Brussel(布鲁塞尔自由大学)

AI总结 研究双语微调对低资源语言语音识别的影响,在九种语言对中评估,通过语言识别令牌区分语言,发现高语言识别准确率时双语微调有效,低准确率时推理时加入令牌可提升性能。

详情
AI中文摘要

本研究探讨了双语微调如何影响低资源语言的自动语音识别(ASR)。我们在九种语言和地理多样化的语言对上评估了该方法,涵盖了多种语系和书写系统。为了区分两种语言,在训练期间,我们在每个输入文本前添加一个语言识别令牌。在推理时,模型仅从语音输入中联合预测语言和转录。由于语言被错误确定的文本显示出较低的ASR性能,我们还进行了一项后续实验,在训练和推理期间都提供语言识别令牌。我们的结果表明,当语言识别准确率高时,双语微调可能是有益的,而在语言识别性能低的情况下,在推理时包含语言识别令牌有助于提高ASR性能。

英文摘要

This study explores how bilingual fine-tuning affects automatic speech recognition (ASR) in low-resource languages. We evaluate this method across nine linguistically and geographically diverse language pairs, covering a range of language families and writing systems. To distinguish the two languages, during training, we pre-pend each input text with a language identification token. At inference, the model jointly predicts both the language and transcription from the speech input alone. As texts for which the language is incorrectly determined show low ASR performance, we also conduct a follow-up experiment in which the language identification token is provided both during training and inference. Our results show that bilingual fine-tuning can be beneficial when language identification accuracy is high, and that in cases where language identification performance is low, including the language identification token at inference helps to improve ASR performance.

2606.17816 2026-06-17 cs.LG cs.AI 新提交

Conservation Laws for Modern Neural Architectures

现代神经架构的守恒律

Viet-Hoang Tran, Vinh Khanh Bui, Tan Lai Ngoc, Nam Nguyen, Tuan Dam, Tan M. Nguyen

发表机构 * National University of Singapore(新加坡国立大学) Center for AI Research, VinUniversity(Vin大学人工智能研究中心) Independent Researcher(独立研究者) Hanoi University of Science and Technology(河内科学技术大学)

AI总结 本文提出统一框架,刻画GELU、SiLU、SwiGLU激活的前馈网络、多头注意力及混合专家模型中的梯度流守恒律,实验验证了理论预测的不变量。

Comments Published at the International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

理解梯度下降动力学是解释过参数化模型成功的关键,其中隐式偏差通过梯度流中的守恒律体现。尽管这类定律在线性和ReLU网络中已被充分理解,但在现代架构中仍鲜有探索。本文开发了一个统一框架,用于刻画当代模型中的守恒律,包括具有GELU、SiLU和SwiGLU激活的前馈网络、具有正弦和旋转位置编码的多头注意力,以及多种门控设计下的混合专家架构。我们的理论发现得到了实验支持,实验验证了预测的不变量。

英文摘要

Understanding gradient descent dynamics is key to explaining the success of over-parameterized models, where implicit bias manifests through conservation laws in gradient flow. While such laws are well understood for linear and ReLU networks, they remain largely unexplored for modern architectures. This work develops a unified framework to characterize conservation laws for contemporary models, including feedforward networks with GELU, SiLU, and SwiGLU activations, multihead attention with sinusoidal and rotary positional encodings, and Mixture-of-Experts architectures under diverse gating designs. Our theoretical findings are supported by experiments that validate the predicted invariants.

2606.17810 2026-06-17 cs.LG cs.AI 新提交

No-Free-Fairness: Fundamental Limits and Trade-offs in Learning Systems

无免费公平:学习系统中的基本限制与权衡

Khoat Than

发表机构 * Hanoi University of Science and Technology(河内科技大学)

AI总结 本文提出无免费公平定理,揭示学习系统中三个固有差异来源:任务固有成本导致性能与公平的权衡、有限样本诱导子群差异、模型类表达力限制导致公平不可达,表明不公平源于决策问题结构、数据有限性和模型表达力。

详情
AI中文摘要

在本文中,我们建立了一组理论不可能性结果,称为无免费公平定理,这些定理识别了学习系统中三个根本性的差异来源。首先,我们证明当任务在某个子群上表现出不可约成本时,任何决策规则都必须在整体性能与差异之间进行权衡,从而产生固有的公平-成本前沿。其次,我们证明即使在理想的无噪声环境中,存在完全公平且准确的解,仅凭有限样本学习就会导致非平凡的子群差异,排除了分布无关的公平保证。更严重的是,强制执行严格的相对公平会造成统计瓶颈:实现低成本可能需要指数级数量的样本。第三,我们证明模型类的局限性可以独立地导致差异:如果模型无法为某个子群表示准确的解,那么无论数据或训练过程如何,公平性都无法实现。总体而言,这些结果表明不公平不仅仅是由于有偏数据或次优优化,而是源于决策问题的内在结构、有限数据的约束以及模型的表达力。我们的框架广泛适用于标准监督学习之外,并表明实现公平需要明确的权衡,应被视为核心设计考虑因素。

英文摘要

In this paper, we establish a set of theoretical impossibility results, termed the No-Free-Fairness theorems, that identify three fundamental sources of disparity in learning systems. First, we show that when a task exhibits irreducible cost on a subgroup, any decision rule must trade off overall performance with disparity, yielding an inherent fairness--cost frontier. Second, we prove that even in ideal, noise-free settings where a perfectly fair and accurate solution exists, finite-sample learning alone induces nontrivial subgroup disparity, ruling out distribution-free fairness guarantees. More seriously, enforcing strict relative fairness creates a statistical bottleneck: achieving low cost may require exponentially many samples. Third, we show that limitations of the model class can independently induce disparity: if the model cannot represent accurate solutions for a subgroup, fairness remains unattainable regardless of data or training procedure. Overall, these results demonstrate that unfairness is not solely a consequence of biased data or suboptimal optimization, but arises from the intrinsic structure of decision problems, the constraints of finite data, and the expressivity of models. Our framework applies broadly beyond standard supervised learning, and suggests that achieving fairness requires explicit trade-offs and should be treated as a core design consideration.

2606.17809 2026-06-17 cs.CV 新提交

Million-scale multimodal pollen microscopy with expert-guided foundation models

百万级多模态花粉显微镜图像与专家引导的基础模型

András Biricz, Björn Gedda, Donát Magyar, Antonio Spanu, János Fillinger, Péter Pollner, István Csabai

发表机构 * Department of Physics of Complex Systems, ELTE Eötvös Loránd University(ELTE罗兰大学复杂物理系) The Palynological Laboratory at the Swedish Museum of Natural History(瑞典自然历史博物馆孢粉学实验室) National Centre for Public Health and Pharmacy(国家公共卫生与药品中心) INRAE, UR 546 BioSP, Site Agroparc(法国国家农业、食品与环境研究院,UR 546 BioSP,阿格罗帕克园区) National Korányi Institute for Pulmonology(国家科拉尼肺病研究所) Health Data Science and AI Knowledge Centre, Health Services Management Training Centre, Faculty of Health and Public Administration, Semmelweis University(塞梅维什大学健康与公共管理学院卫生服务管理培训中心健康数据科学与人工智能知识中心) Department of Biological Physics, ELTE Eötvös Loránd University(ELTE罗兰大学生物物理系)

AI总结 提出百万级多模态花粉显微镜数据集Pollen AI Atlas,结合专家引导的视觉-语言模型生成形态描述,实现跨区域、跨设置的高精度花粉识别与检索。

Comments 31 pages, 5 main figures, supplementary information included. Submitted to Scientific Reports

详情
AI中文摘要

从显微镜图像自动识别花粉仍然是空气生物学、古生态学和生物多样性监测中的一个瓶颈,因为可扩展系统必须泛化到样本制备、扫描仪设置和地理来源,同时保持孢粉学可解释性。为解决这一问题,我们提出了一个百万级多模态花粉显微镜资源——Pollen AI Atlas,该资源由来自四个地理来源、四种扫描仪设置和31个植物科46个分类标签的纯种全切片明场图像组装而成。通过每个源切片的一个手动选择示例,令牌级挖掘和过滤产生了1,511,390个释放的花粉颗粒检测结果,在专家筛选的测试区域中提案精度达到99.6%。每个检测结果与来自五个开放权重视觉-语言模型的机器生成颗粒级形态描述配对,这些描述由专家验证的孢粉学锚点引导,提供了关于萌发孔系统、壁纹饰、形状和大小的结构化描述。在评估的模型中,Gemma4提供了最可控的主描述集,结合了严格长度控制、无泄漏和最强的文本检索性能。使用冻结视觉特征的基线基准达到了88.16%的top-1准确率,而跨区域检索表明,当图像相似度下降时,描述派生的文本嵌入仍然保持鲁棒(mAP@20 0.811对比0.262)。发布的数据、注释、描述、划分、代码和权重为花粉识别、跨区域领域适应和特定领域多模态显微镜学习提供了基准。

英文摘要

Automated pollen identification from microscopy remains a bottleneck in aerobiology, palaeoecology and biodiversity monitoring, because scalable systems must generalise across specimen preparation, scanner settings and geographic origins while retaining palynological interpretability. To address this gap, we present a million-scale multimodal pollen microscopy resource, Pollen AI Atlas, assembled from pure-species whole-slide bright-field images spanning four geographic origins, four scanner settings and 46 taxon labels across 31 botanical families. Seeded by one manually selected exemplar per source slide, token-level mining and filtering produced 1,511,390 released grain detections with 99.6\% proposal precision in expert-curated test regions. Each detection was paired with machine-generated grain-level morphological captions from five open-weight vision-language models, guided by expert-verified palynological anchors, yielding structured descriptions of aperture systems, wall ornamentation, shape and size. Among the evaluated models, Gemma4 provided the most controlled primary caption set, combining tight length control, no leakage and the strongest text-retrieval performance. Baseline benchmarks with frozen visual features reached 88.16\% top-1 accuracy, while cross-regional retrieval showed that caption-derived text embeddings remained robust when image similarity degraded (mAP@20 0.811 versus 0.262). Released data, annotations, captions, splits, code, and weights provide a benchmark for pollen recognition, cross-regional domain adaptation and domain-specific multimodal microscopy learning.