arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 8081
专题追踪 全部专题
2606.02939 2026-06-03 cs.LG eess.SP

ERP-XTTN: Interpretable Prototype-Guided Cross-Attention for Cross-Subject ERP Classification

ERP-XTTN: 可解释的原型引导跨注意力用于跨被试ERP分类

Charlotte Genevier Wyman, Leanne Hirshfield

发表机构 * University of Colorado Boulder(科罗拉多大学波得尔分校)

AI总结 提出ERP-XTTN,一种基于原型引导跨注意力的架构,在无需校准的跨被试条件下实现可解释的ERP分类,并揭示分类错误的神经生理学原因。

详情
AI中文摘要

可解释的脑机接口分类器能够在无需校准的情况下跨被试泛化仍然是一个开放的挑战。我们测试了基于原型的跨注意力是否能在部署兼容条件下提供具有竞争力且可解释的事件相关电位(ERP)分类。我们提出ERP-XTTN,一种跨注意力架构,通过仅查询-键的跨注意力(无值投影)将输入EEG片段路由到固定的差异波原型,因此分类完全依赖于注意力路由,且注意力忠实性是结构性的而非事后解释的。原型从训练折差异波的极值自动推导。我们在三个公开数据集(BNCI Horizon 2020、HRI Cursor和ERP CORE)上评估,涵盖八个ERP成分(ERN、LRP、ErrP、N170、P300、N2pc、MMN、N400),使用留一被试(LOSO)评估,并在两种通道数(3通道和全导联)下采用因果滤波,与EEGNet和基于黎曼几何的xDAWN(xDAWN+RG)对比。最佳基线与ERP-XTTN的平均差距在3通道时为0.018 AUROC,在全导联时为0.034,这源于两个大致不同的来源:相对于EEGNet的时间灵活性成本和相对于xDAWN+RG的空间利用成本,后者在全导联时由信噪比驱动。除了准确性,透明的路由揭示了黑箱模型无法发现的跨被试信号结构:假阳性与真阳性的相似度高于真阴性,表明分类错误在神经生理学上是可以解释的。ERP-XTTN在因果、无校准条件下泛化到多种ERP,并在最小导联设置下具有较小的可解释性代价。据我们所知,这是ERP CORE上首个epoch级LOSO基准测试。

英文摘要

Interpretable brain-computer interface classifiers that generalize across subjects without calibration remain an open challenge. We test whether prototype-based cross-attention can provide competitive, interpretable event-related potential (ERP) classification under deployment-compatible conditions. We propose ERP-XTTN, a cross-attention architecture that routes input EEG patches to fixed difference-wave prototypes via query-key-only cross-attention with no value projection, so classification depends entirely on attention routing and attention faithfulness is structural rather than post-hoc. Prototypes are derived automatically from extrema in the training-fold difference wave. We evaluate across three public sources (BNCI Horizon 2020, HRI Cursor, and ERP CORE) spanning eight ERP components (ERN, LRP, ErrP, N170, P300, N2pc, MMN, N400), using leave-one-subject-out (LOSO) evaluation with causal filtering at two channel counts (3-channel and full montage), against EEGNet and xDAWN with Riemannian geometry (xDAWN+RG). The mean gap between the best baseline and ERP-XTTN was .018 AUROC at 3 channels and .034 at full montage, arising from two largely distinct sources: a temporal-flexibility cost relative to EEGNet and a spatial-exploitation cost relative to xDAWN+RG, the latter driven by signal-to-noise ratio at full montage. Beyond accuracy, the transparent routing reveals cross-subject signal structure that black-box models cannot: false positives resembled true positives more than true negatives did, indicating that classification errors are neurophysiologically explicable. ERP-XTTN generalizes across diverse ERPs under causal, calibration-free conditions with a small interpretability cost at minimal montages. To our knowledge, this is the first epoch-level LOSO benchmark on ERP CORE.

2606.02936 2026-06-03 cs.LG

Hierarchical RBF-KAN and RBF-SKAN Architectures for Multidimensional Function Approximation and Random Field Learning

分层RBF-KAN和RBF-SKAN架构用于多维函数逼近和随机场学习

Mingtao Xia, Qijing Shen

发表机构 * University of Houston(德克萨斯大学) University of Birmingham(伯明翰大学) University of Oxford(牛津大学)

AI总结 提出并分析使用径向基函数作为激活函数的分层Kolmogorov-Arnold神经网络架构,用于逼近确定性函数和随机场模型,并证明其通用逼近性质及缓解维度灾难的潜力。

详情
AI中文摘要

本文提出并分析了使用径向基函数作为激活函数的分层Kolmogorov-Arnold神经网络架构,用于逼近确定性函数和随机场模型。具体地,我们开发了用于多维确定性函数逼近的分层径向基函数Kolmogorov-Arnold网络(分层RBF-KAN)和用于随机场学习的分层径向基函数随机Kolmogorov-Arnold网络(分层RBF-SKAN)。从理论角度,我们为两种架构建立了通用逼近结果。特别地,我们推导了分层RBF-KAN的定量逼近估计,表明所提出的框架通过降低逼近问题的有效维度,有潜力部分缓解高维函数学习中的维度灾难。此外,我们证明了分层RBF-SKAN可以在Wasserstein-2度量下逼近随机场模型。实验上,我们表明所提出的基于径向基函数的神经网络结构能够有效学习多元函数和随机场模型。

英文摘要

In this manuscript, we propose and analyze hierarchical Kolmogorov--Arnold neural network architectures employing radial basis functions as activation functions for approximating deterministic functions and random field models. Specifically, we develop a hierarchical radial-basis-function Kolmogorov--Arnold network (hierarchical RBF-KAN) for multidimensional deterministic function approximation and a hierarchical radial-basis-function stochastic Kolmogorov--Arnold network (hierarchical RBF-SKAN) for random field learning. From a theoretical perspective, we establish universal approximation results for both architectures. In particular, we derive quantitative approximation estimates for the hierarchical RBF-KAN, showing that the proposed framework has the potential to partially alleviate the curse of dimensionality in learning high-dimensional functions by reducing the effective dimensionality of the approximation problem. Furthermore, we show that the hierarchical RBF-SKAN can approximate random field models under the Wasserstein-2 metric. Empirically, we show that our proposed radial-basis-function-based neural network structure could effectively learn multivariate functions and random field models.

2606.02935 2026-06-03 cs.CV cs.CE

CAD-to-CT Registration of Cylindrical Objects via Ellipse-Based Axis Estimation

基于椭圆轴估计的圆柱体CAD到CT配准

Aleksander Ogonowski, Mikołaj Mrozowski, Daniel Więcek, Arkadiusz Ćwiek, Konrad Klimaszewski, Rafał Możdżonek, Adam Padee, Lech Raczyński, Piotr Wasiuk, Wojciech Wiślicki, Michał Matusiak, Sławomir Wronka

发表机构 * Department of Complex Systems, National Centre for Nuclear Research(复杂系统系,国家核研究中心) ImagineRT sp. z o.o.(ImagineRT公司) National Centre for Nuclear Research(国家核研究中心)

AI总结 提出一种两阶段几何配准方法,通过检测CT切片中的椭圆截面估计旋转轴,再通过体素化CAD模型并最大化与CT扫描的体积重叠实现圆柱体(电离室)的精确配准,无需强度校准或特征匹配,倾斜和方向误差低于0.1°。

详情
AI中文摘要

CAD模型与CT扫描的精确配准对于在体积成像中建立真实几何基准至关重要。获取可靠的对象掩膜在机器学习环境中日益重要;随着最新架构能力增强,需要大规模数据集以充分利用其能力。当CT灰度值缺乏校准参考时,传统的基于强度的方法失效,而基于点的算法(如ICP、RANSAC)需要理想化CAD几何与噪声体积CT数据之间不可用的特征对应。我们提出了一种针对圆柱体(电离室)的两阶段几何配准方法,利用对象的独特几何特征。首先,通过检测CT切片中的椭圆截面、对边缘检测轮廓拟合椭圆,并在RANSAC异常值去除后对拟合椭圆中心进行PCA,来估计3D旋转轴。其次,将CAD模型体素化,沿检测轴定向,并通过平移调整最大化与CT扫描的体积重叠。该方法无需强度校准或特征匹配,即可实现倾斜和方向误差低于0.1°的鲁棒配准。配准后,对齐的CAD模型为机器学习目标定位和工业CT工作流中的自动分析等应用提供真实几何基准。

英文摘要

Accurate registration of CAD models to CT scans is essential for establishing ground truth geometry in volumetric imaging. Obtaining reliable object masks is of growing importance in machine learning settings; as recent architectures grow more capable, huge datasets are required to fully utilise their capabilities. Traditional intensity-based methods fail when CT grayscale values lack calibration references, while point-based algorithms (e.g., ICP, RANSAC) require feature correspondence unavailable between idealized CAD geometry and noisy volumetric CT data. We propose a two-stage geometric registration method for cylindrical objects (ionization chambers) that takes advantage of the distinctive geometric features of the objects. First, we estimate the 3D rotation axis by detecting elliptical cross-sections across CT slices, fitting ellipses to edge-detected contours, and performing PCA on the fitted ellipse centers after RANSAC outlier removal. Second, we voxelize the CAD model, orient it along the detected axis, and maximize volumetric overlap with the CT scan through translational adjustment. This approach achieves robust registration with tilt and orientation errors below $0.1^\circ$ without intensity calibration or feature matching. Once registered, the aligned CAD model provides ground truth geometry for applications including machine learning-based object localization and automated analysis in industrial CT workflows.

2606.02928 2026-06-03 cs.RO

Improved Postural Stability Using a Lightweight Semi-Active Soft Back Support Device Under Standing Perturbations

使用轻量级半主动软背部支撑装置在站立扰动下改善姿势稳定性

Rohan Khatavkar, Jiefeng Sun, Hyunglae Lee

发表机构 * School for Engineering of Matter, Transport and Energy(物质、运输与能源工程学院)

AI总结 研究提出一种结合气动人工肌肉与弹性带的轻量级半主动软背部支撑装置,通过快速提供辅助力显著降低全身角动量并增加稳定裕度,从而改善站立扰动后的平衡恢复。

Comments 6 pages, 8 figures, submitted to IROS 2026, the IEEE/RSJ International Conference on Intelligent Robots and Systems

详情
AI中文摘要

老年人在站立时受到扰动(如向前失去平衡)后特别容易跌倒。辅助躯干伸展的背部支撑装置可能通过防止过度躯干屈曲来帮助减轻跌倒风险。先前的研究已经研究了重型背部支撑装置;然而,这些系统由于其附加质量往往对稳定性产生不利影响,这会使身体自然重心发生不利的偏移。相比之下,轻量级被动装置显示出有限的益处,因为它们在向前平衡丧失相关的相对较小的躯干屈曲期间只能产生适度的辅助力。在本研究中,我们评估了一种轻量级半主动软背部支撑装置在站立扰动后对姿势稳定性的影响。我们的装置将一个主动元件(气动人工肌肉)与一个被动弹性带并联。主动元件在扰动后快速提供辅助力,克服了被动装置的局限性。对五名健康个体进行的实验表明,半主动装置显著降低了全身角动量并增加了稳定裕度,表明平衡恢复性能得到改善。这些结果突显了半主动软可穿戴机器人作为站立扰动期间跌倒预防的有效且轻量级策略的前景。

英文摘要

Older adults are particularly susceptible to falls following perturbations during standing, such as forward loss of balance. Back support devices that assist trunk extension may help mitigate fall risk by preventing excessive trunk flexion. Previous studies have investigated heavy back support devices; however, these systems often introduced adverse effects on stability due to their added mass, which shifted the body's natural center of mass unfavorably. In contrast, lightweight passive devices have shown limited benefits, as they can generate only modest assistive forces during the relatively small trunk flexion associated with forward balance loss. In this study, we evaluated the effects of a lightweight semi-active soft back support device on postural stability following standing perturbations. Our device combines an active element (a pneumatic artificial muscle) in parallel with a passive elastic band. The active element rapidly provides assistive force following a perturbation, overcoming the limitations of passive devices. Experiments conducted with five healthy individuals demonstrated that the semi-active device significantly reduced whole-body angular momentum and increased the margin of stability, indicating improved balance recovery performance. These results highlight the promise of semi-active soft wearable robots as an effective and lightweight strategy for fall prevention during standing perturbations.

2606.02927 2026-06-03 cs.CV

SaluNet: Enabling Total Plasticity in Normalization-Free Deep Networks

SaluNet: 在无归一化深度网络中实现完全可塑性

Mourad Zaied

发表机构 * Department of electrical engineering(电气工程系) National Engineering School of Gabes (ENIG)(盖斯国家工程学院) University of Gabes(盖斯大学)

AI总结 提出SALU激活函数替代归一化层,构建SaluNet网络,在无归一化条件下实现深度网络的稳定训练,并在多个数据集上取得优异性能。

Comments 34 pages

详情
AI中文摘要

归一化层如BatchNorm和LayerNorm长期以来被认为是深度网络稳定训练所必需的。本文证明它们可以被单一的可学习激活机制完全替代。我们发现标准归一化会引发可塑性抑制效应:当与归一化层配对时,可学习激活参数会迅速失去适应性。受此观察启发,我们引入SALU(饱和自适应线性单元),\[ \operatorname{SALU}(x;a,b) = \frac{a x}{\sqrt{1 + a b x^2}},\quad a>0,\; b>0 \] 一种有界的、可学习的激活函数,无需依赖批次统计或外部仿射参数即可提供内在的信号稳定。基于SALU,我们提出SaluNet,一种基于完全可塑性的范式:SALU替代归一化层,而SWALU和GALU替代标准激活函数。使用ResNet-18,SaluNet-C-18在CIFAR-10上达到97.35%,在CIFAR-100上达到83.25%,且无归一化;在批次大小为1时(归一化架构失败)仍保持93.44%和76.23%。对于Transformer,SaluNet-T在CIFAR-10上将LayerNorm-GELU从90.92%提升至91.01%,在CIFAR-100上从66.54%提升至68.10%。SaluNet-C-50在ImageNet-1K上达到78.67%的Top-1准确率(224×224),在288×288下为79.23%。这些结果表明归一化层抑制了完全可塑性——这是生物神经元固有的特性,使深度网络能够有效学习。

英文摘要

Normalization layers such as BatchNorm and LayerNorm have long been considered essential for stable training in deep networks. This work demonstrates that they can be fully replaced by a single learnable activation mechanism. We identify a plasticity suppression effect induced by standard normalization: learnable activation parameters rapidly lose adaptability when paired with normalization layers. Motivated by this observation, we introduce SALU (Saturated Adaptive Linear Unit), \[ \operatorname{SALU}(x;a,b) = \frac{a x}{\sqrt{1 + a b x^2}},\quad a>0,\; b>0 \] a bounded, learnable activation that provides intrinsic signal stabilization without relying on batch statistics or external affine parameters. Building on SALU, we propose SaluNet, a paradigm grounded in total plasticity: SALU replaces normalization layers, while SWALU and GALU replace standard activations. With ResNet-18, SaluNet-C-18 achieves 97.35\% on CIFAR-10 and 83.25\% on CIFAR-100 without normalization, maintaining 93.44\% and 76.23\% at batch size 1 where normalized architectures fail. For transformers, SaluNet-T improves over LayerNorm-GELU from 90.92\% to 91.01\% on CIFAR-10 and from 66.54\% to 68.10\% on CIFAR-100. SaluNet-C-50 reaches 78.67\% Top-1 on ImageNet-1K at $224\times224$, and $79.23\%$ at $288\times288$. These results suggest normalization layers suppress total plasticity, a property biological neurons inherently possess, enabling deep networks to learn effectively.

2606.02920 2026-06-03 cs.LG

Fast Unlearning at Scale via Margin Self-Correction

通过边际自我修正实现大规模快速遗忘学习

Federico Di Gennaro, Alexander Shevchenko, Fanny Yang

发表机构 * ETH Zürich(苏黎世联邦理工学院)

AI总结 提出MASC方法,通过在线停止规则在无需下游评估的情况下高效实现语言模型遗忘,显著降低计算成本。

详情
AI中文摘要

语言模型遗忘学习更新已训练模型,使其表现得好像从未见过选定的训练样本,同时保持效用并避免昂贵的重新训练。现有方法通常使用固定的训练预算微调预训练模型,然后通过在下游验证数据上评估几个保存的检查点来最终选择模型。两种不必要的计算限制了可扩展性:训练超出期望的遗忘-保留权衡,以及需要额外存储和重复评估的检查点选择。为了解决这些限制,我们引入了MArgin Self-Correction (MASC),一种高效的遗忘学习方法,具有在线停止规则,不需要下游评估。给定一个要遗忘的文本序列,MASC主动减小原始下一个词元与最可能替代词元之间的logit差距。一旦这个差距在所有遗忘序列的足够大比例的词元位置上平均较小,它就会输出最终模型。在TOFU、MUSE News和MUSE Books上,MASC以现有基线计算成本的一小部分实现了具有竞争力的遗忘-保留权衡。我们进一步观察到,随着模型规模(即参数数量)的增加,MASC和SimNPO的权衡都得到了改善——遗忘指标保持可比,而保留效用增加。

英文摘要

Language-model unlearning updates a trained model to behave as if it had not seen selected training examples, while preserving utility and avoiding costly retraining. Existing approaches typically fine-tune the pretrained model with a fixed training budget and select the final model afterwards by evaluating several saved checkpoints on downstream validation data. Two sources of unnecessary computation limit scalability: training beyond the desired forget-retain trade-off, and checkpoint selection that requires extra storage and repeated evaluations. To address these limitations, we introduce MArgin Self-Correction (MASC), an efficient unlearning method with an online stopping rule that does not require downstream evaluation. Given a text sequence to be forgotten, MASC actively reduces the logit gap between the original next token and the most likely alternatives. It outputs a final model once this gap is small on average over a sufficiently large proportion of token positions across all forget sequences. On TOFU, MUSE News, and MUSE Books, MASC achieves a competitive forget-retain trade-off at a fraction of the computational cost of existing baselines. We further observe that as we increase model size (a.k.a. number of parameters), the trade-offs improve for both MASC and SimNPO -- the forget metrics remain comparable while retain utility increases.

2606.02915 2026-06-03 cs.CV

Any2Poster: Any-Source Poster Generation Across Modalities and Domains

Any2Poster: 跨模态和领域的任意源海报生成

Amogh Vinaykumar, Aiden Li, Suozhi Huang, Shilong Liu

发表机构 * Flower Mound High School(弗洛拉穆恩高中) University College London(伦敦大学学院) Princeton University(普林斯顿大学)

AI总结 提出Any2Poster Bench基准和Any2Poster Agent智能体,实现从多种输入模态和领域生成海报,并通过基于测验和视觉评估的方法验证信息保真度和视觉传达效果。

Comments Project Page: https://github.com/Any2Poster/Any2Poster

详情
AI中文摘要

视觉海报是传达密集信息的紧凑媒介,然而自动海报生成的进展难以衡量,因为现有评估通常局限于仅论文输入、狭窄领域或表面视觉相似性。我们引入了Any2Poster Bench,一个用于任意源海报生成的基准,它评估系统在八种输入模态(PDF、URL、PPTX、DOCX、Markdown、LaTeX、笔记本和视频)和五个内容领域上的表现。Any2Poster Bench将每个源与基于测验的逐字事实保留和解释性理解探测,以及基于VLM的视觉质量、布局、可读性、内容完整性和逻辑流程判断相结合,从而实现对信息保真度和视觉传达的可重复评估。为了实例化和验证这一基准,我们进一步提出了Any2Poster Agent,一个端到端的参考智能体,它解析异构源、组织显著内容、规划海报布局、渲染海报,并使用视觉反馈迭代优化。在Any2Poster Bench上,Any2Poster Agent在输入模态上平均准确率达到87.25%,在内容领域上达到87.28%。在PaperQuiz风格评估中(其中先前的论文到海报智能体可直接比较),Any2Poster Agent将总体准确率从PosterAgent-4o的51.06-51.33%提高到72.58%,并将密度增强分数从116-121提高到145.16。总之,Any2Poster Bench和Any2Poster Agent为研究多模态、通用领域的海报生成提供了可复用的评估资源和有竞争力的基线。

英文摘要

Visual posters are a compact medium for communicating dense information, yet progress on automatic poster generation remains difficult to measure because existing evaluations are often restricted to paper-only inputs, narrow domains, or surface-level visual similarity. We introduce Any2Poster Bench, a benchmark for any-source poster generation that evaluates systems across eight input modalities--PDFs, URLs, PPTX, DOCX, Markdown, LaTeX, notebooks, and videos--and five content domains. Any2Poster Bench pairs each source with quiz-based probes of verbatim factual retention and interpretive understanding, together with VLM-based judgments of visual quality, layout, readability, content completeness, and logical flow, enabling reproducible assessment of both information fidelity and visual communication. To instantiate and validate this benchmark, we further present Any2Poster Agent, an end-to-end reference agent that parses heterogeneous sources, organizes salient content, plans poster layouts, renders posters, and iteratively refines them using visual feedback. On Any2Poster Bench, Any2Poster Agent achieves 87.25% average accuracy across input modalities and 87.28% across content domains. On PaperQuiz-style evaluation, where prior paper-to-poster agents are directly comparable, Any2Poster Agent improves over PosterAgent-4o from 51.06-51.33% to 72.58% overall accuracy and from 116-121 to 145.16 in density-augmented score. Together, Any2Poster Bench and Any2Poster Agent provide a reusable evaluation resource and a competitive baseline for studying multimodal, domain-general poster generation.

2606.02911 2026-06-03 cs.CL

The Ghost Annotator: a Framework to Explore Human Label Variation in Content Moderation through Conformal Prediction

幽灵标注者:通过共形预测探索内容审核中人类标签变异的框架

Mirko Lai, Alessandra Urbinati, Simona Frenda, Fabiana Vernero, Marco Antonio Stranisci

发表机构 * Laboratory for the Modeling of Biological and Socio-technical Systems, Northeastern University(生物与社会技术系统建模实验室,东北大学) Heriot-Watt University(赫瑞-沃顿大学) aequa-tech Università del Piemonte Orientale(皮埃蒙特东方大学) Università degli Studi di Torino(托斯卡纳大学)

AI总结 提出结合共形预测与协同过滤式标注者表征的框架,通过幽灵预测度量和幽灵标注者表征量化模型预测与所有人类标注的分歧,并发现模型在标注者分歧时不确定性增加,但大型模型对无人类对齐文本更自信,且存在结构性人口统计偏差。

详情
AI中文摘要

当前研究主要关注模型性能,而对不确定性估计的关注相对较少,特别是在LLM越来越多地用于生成标注数据的场景中。我们引入了一个框架,将共形预测与协同过滤式的标注者表征相结合,以建模LLM相对于人类标注者的行为,并分析一致与分歧的模式。利用非一致性分数,我们引入了幽灵预测度量和幽灵标注者表征,以量化模型预测与所有可用人类标注不一致的情况。我们计算余弦相似度度量,以探索模型行为在不同社会人口统计轴上的差异。我们在四个内容审核数据集上评估了四种不同规模和家族的LLM。我们的发现表明,虽然所有模型的不确定性随着标注者分歧的增加而增加,但较大的模型在对与任何人类标注不一致的文本进行分类时往往更自信。最后,幽灵标注者框架揭示了一致且稳健的人口统计错位模式,表明可能存在源于预训练语料库的结构性偏见。

英文摘要

Current research primarily focuses on model performance, while comparatively less attention has been devoted to uncertainty estimation, particularly in settings where LLMs are increasingly used to generate annotated data. We introduce a framework combining conformal prediction with Collaborative Filtering-style annotators' representation to model LLM behavior in relation to human annotators and to analyze patterns of agreement and disagreement. Using Non-Conformity Scores, we introduce the Ghost Prediction metric and the Ghost Annotator representation to quantify cases in which model predictions diverge from all available human annotations. We compute cosine similarity measures to explore differences in model behavior across sociodemographic axes. We evaluated four LLMs of different size and families across four content moderation datasets. Our finding shows that while we find that all models uncertainty increases with annotator disagreement, larger models tend to be more confident in the classification of texts that are not aligned with any human annotation. Finally, the Ghost Annotator framework reveals a consistent and robust pattern of demographic misalignment, suggesting a structural bias likely rooted in pretraining corpora.

2606.02908 2026-06-03 cs.CL cs.AI

WRIT: Write-Read Intensive Trajectory Synthesis for Multi-Turn User-Facing Agents

WRIT: 面向多轮用户代理的写密集型轨迹合成

Hengrui Gu, Xiaotian Han, Kaixiong Zhou

发表机构 * North Carolina State University(北卡罗来纳州立大学) Case Western Reserve University(凯斯西储大学)

AI总结 针对多轮用户代理在信息收集和决策中面临的证据负担挑战,提出WRIT方法,通过合成写密集型和读密集型轨迹,训练代理在信息负载下做出基于证据的决策,仅用2K轨迹即可提升性能并减少推理时token使用。

详情
AI中文摘要

多轮用户代理必须从不完整的请求中推断用户意图,通过对话和工具收集缺失信息,并执行有效操作。训练轨迹将此过程记录为用户消息、代理响应、工具调用等的交错序列。合成足够复杂的轨迹已成为训练代理的核心途径:现有流程通常通过将多个用户请求组合成更长的任务来增加难度,产生训练顺序执行的写密集型轨迹。我们认为,当代理必须在收集和比较大量读工具证据后才能确定其参数时,单个写决策本身可能很困难,这是仅靠写密集型数据无法解决的挑战。基于这一见解,我们提出WRIT(写-读密集型轨迹合成),这是一个沿两个复杂度轴合成多轮代理训练轨迹的流程:任务中写决策的数量和每个决策的证据负担。WRIT首先生成写密集型和读密集型任务。然后,它多样化用户行为指令以反映真实的对话变化,最后在可执行环境中模拟代理-用户交互以生成完整的训练轨迹。由此产生的数据不仅训练代理执行更长的任务,而且在高信息负载下做出稳健的、基于证据的决策。仅用2K合成轨迹,在WRIT上训练的4B模型在$\tau^2$-bench上优于GPT-5.1 no-think,并大幅减少推理时token使用,表明紧凑的SFT数据可以将部分昂贵的测试时推理转化为高效的代理行为。

英文摘要

Multi-turn user-facing agents must infer user intent from incomplete requests, collect missing information through dialogue and tools, and execute valid actions. A training trajectory records this process as an interleaved sequence of user messages, agent responses, tool calls, etc. Synthesizing sufficiently complex trajectory has become a central route to train agents: existing pipelines often increase difficulty by composing multiple user requests into longer tasks, producing write-intensive trajectories that train sequential execution. We argue that a single write decision can itself be difficult when the agent must gather and compare substantial read-tool evidence before its arguments become identifiable, a challenge that write-intensive data alone cannot address. Guided by this insight, we propose WRIT (\uline{W}rite-\uline{R}ead \uline{I}ntensive \uline{T}rajectory Synthesis), a pipeline for synthesizing multi-turn agent training trajectories along two complexity axes: the number of write decisions in a task and the evidence burden of each individual decision. WRIT first generates write-intensive and read-heavy tasks. It then diversifies user behavior instructions to reflect realistic conversational variation, and finally simulates agent-user interactions in an executable environment to produce complete training trajectories. The resulting data trains agents not only for longer task execution, but also for robust, evidence-grounded decision making under high information load. With only 2K synthesized trajectories, a 4B model trained on WRIT outperforms GPT-5.1 no-think on $τ^2$-bench and substantially reduces inference-time token usage, showing that compact SFT data can convert part of expensive test-time reasoning into efficient agent behavior.

2606.02892 2026-06-03 cs.LG

Multi-Modal Machine Learning for Breast Cancer Recurrence Prediction

多模态机器学习用于乳腺癌复发预测

Jiahao Shao, Xudong Wang, Anam Nawaz Khan, Christopher Brett, Xueping Li, Bing Yao

发表机构 * Department of Industrial & Systems Engineering, The University of Tennessee, Knoxville, TN 37996, USA(工业与系统工程系,田纳西大学,诺克斯维尔,TN 37996,美国) The University of Tennessee Medical Center, Knoxville, TN 37920, USA(田纳西大学医学中心,诺克斯维尔,TN 37920,美国)

AI总结 本研究通过整合结构化与非结构化临床数据(治疗记录、病理报告和临床笔记),结合基于规则的正则表达式提取和优先级冲突解决策略,显著提升了乳腺癌复发预测的准确性。

Comments 33 pages, 10 figures

详情
AI中文摘要

乳腺癌复发是幸存者长期死亡的主要原因,需要及时准确的风险评估来指导随访护理和治疗计划。传统预测模型通常局限于结构化或非结构化数据,难以捕捉完整的临床背景。本研究探讨了整合多模态临床数据(包括治疗记录、病理报告和临床笔记)对复发预测的影响。通过结合基于规则的正则表达式提取机制和严格的基于优先级的冲突解决策略,我们的方法有效地从自由文本病理叙述中恢复确定的肿瘤特征,以增强结构化记录。我们还与先前乳腺癌研究中常用的特征集进行性能基准测试,以评估多模态整合的附加价值。单源和多模态输入在一系列机器学习模型上进行评估。结果表明,与单模态方法相比,多模态整合一致地提高了预测准确性。

英文摘要

Breast cancer recurrence, a leading cause of long-term mortality among survivors, requires timely and accurate risk assessment to guide follow-up care and treatment planning. Traditional predictive models, often limited to either structured or unstructured data alone, struggle to capture the full clinical context. This study examines the impact of integrating multi-modal clinical data, including treatment records, pathology reports, and clinician notes, on recurrence prediction. By integrating a rule-based regular expression extraction mechanism with a rigorous precedence-based conflict reconciliation strategy, our approach effectively recovers definitive tumor characteristics from free-text pathology narratives to augment structured records. We also benchmark performance against commonly used feature sets from prior breast cancer studies to assess the added value of multi-modal integration. Single-source and multi-modal inputs are evaluated across a range of machine learning models. Results show that multi-modal integration consistently improves predictive accuracy compared to single-modal methods.

2606.02888 2026-06-03 cs.RO

Impact of a Soft Wearable Back-Support Device on Postural Stability during Trip-Like Perturbations

软性可穿戴背部支撑装置在类似绊倒扰动下对姿势稳定性的影响

Yuanhao Chen, Rohan Khatavkar, Soubhagya Nayak, Jiefeng Sun, Hyunglae Lee

发表机构 * School for Engineering of Matter, Transport and Energy, Arizona State University(物质、运输与能源工程学院,亚利桑那州立大学)

AI总结 通过扰动站立和行走实验,研究软性可穿戴背部支撑装置在类似绊倒扰动下对姿势稳定性的增强效果,发现装置使用提高了最小稳定裕度,表明其可改善反应性平衡控制,具有防跌倒潜力。

Comments 6 pages, 6 figures, to be published in the proceedings of the 2026 11th IEEE RAS/EMBS International Conference for Biomedical Robotics and Biomechatronics (BioRob)

详情
AI中文摘要

通过两种实验范式(扰动站立和扰动行走)研究了软性可穿戴背部支撑装置在类似绊倒扰动下对姿势稳定性的增强效果。健康受试者在三种不同的背部支撑条件下完成试验:无装置、低刚度装置、高刚度激活装置。使用最大不稳定点的最小稳定裕度(MOS)量化全身稳定性。结果表明,使用装置时MOS增加,表明姿势稳定性增强。在站立条件下,MOS随装置刚度显著增加;而在行走条件下,两种装置条件相比无装置均改善了MOS,但两者之间无显著差异。这些发现凸显了具有可调刚度的软性可穿戴背部支撑装置在改善对外部扰动的反应性平衡控制方面的潜力,对防跌倒具有重要意义。未来研究应探索个性化刚度优化,并评估在跌倒高风险人群中的有效性。

英文摘要

The effectiveness of a soft wearable back-support device in enhancing postural stability was investigated under trip-like perturbations using two experimental paradigms: perturbed standing and perturbed walking. Healthy subjects completed trials under three different back-support conditions: no device, device worn with low stiffness, and device activated with high stiffness. Whole-body stability was quantified using the minimum Margin of Stability (MOS) at the point of maximal instability. Results demonstrated increased MOS during device use, indicating enhanced postural stability. In standing, MOS increased significantly with device stiffness, whereas in walking, both device conditions improved MOS relative to no device but did not differ significantly from each other. These findings highlight the potential of soft wearable back-support devices with adjustable stiffness to improve reactive balance control against external perturbations, with important implications for fall prevention. Future research should explore personalized stiffness optimization and evaluate efficacy in populations at elevated risk of falls.

2606.02884 2026-06-03 cs.LG cs.AI

Are we really tilting? The mechanics of reward guidance in flow and diffusion models

我们真的在倾斜吗?流模型和扩散模型中奖励引导的机制

Sanjit Dandapanthula, Nicholas M. Boffi

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过高斯混合模型和二次奖励的闭式分析,揭示了奖励引导扩散中奖励黑客现象源于Doob h函数的有限粒子插件估计,并提出了无额外计算的闭式奖励阻尼调度来纠正模式内偏差。

详情
AI中文摘要

奖励引导算法在推理时将学习到的生成过程导向奖励倾斜的测度。虽然经验上强大,但这些方法容易产生奖励黑客行为:引导模型以牺牲对学习分布的保真度为代价过度优化奖励。先前的工作将其归因于神经奖励函数的复杂性或扩散训练中的隐式偏差,但其根本起源仍知之甚少。我们表明,奖励黑客行为源于大多数实际奖励引导扩散实现中的一个近似——Doob h函数的有限粒子插件估计——即使在最简单的高斯和高斯混合目标以及二次奖励的非平凡设置中也是如此。在闭式中,我们分离了插件估计器的两种不同失效模式:它导致每个模式内的奖励黑客行为,并且无法选择高奖励模式。我们提出了一种闭式奖励阻尼调度,无需额外计算即可纠正模式内偏差,并阐明了最佳-n采样在补偿模式选择失败中的作用。在高斯混合目标、2D棋盘和FLUX.1文本到图像生成上的实验证实了我们的理论见解适用于实际设置。

英文摘要

Reward guidance algorithms steer a learned generative process toward the reward-tilted measure at inference time. While empirically powerful, these methods are prone to reward hacking: the guided model over-optimizes the reward at the cost of fidelity to the learned distribution. Prior work has attributed this to the complexity of neural reward functions or implicit biases in diffusion training, but its fundamental origins remain poorly understood. We show that reward hacking arises from an approximation made in most practical implementations of reward-guided diffusion -- finite-particle plug-in estimation of the Doob h-function -- even in the simplest non-trivial settings of Gaussian and Gaussian mixture targets with quadratic rewards. In closed form, we isolate two distinct failure modes of the plug-in estimator: it leads to reward hacking within each mode and it cannot select high-reward modes. We propose a closed-form reward damping schedule that corrects the within-mode bias with no additional compute, and clarify the role of best-of-n sampling in compensating for the mode selection failure. Experiments on Gaussian mixture targets, a 2D checkerboard, and FLUX.1 text-to-image generation confirm that our theoretical insights carry over to practical settings.

2606.02879 2026-06-03 cs.RO

Direct Informed Sampling on Riemannian Manifolds via Loewner Order Lower Bounds

基于Loewner序下界的黎曼流形直接知情采样

Phone Thiha Kyaw, Jonathan Kelly

发表机构 * Space and Terrestrial Autonomous Robotic Systems (STARS) Laboratory, University of Toronto Institute for Aerospace Studies (UTIAS)(太空与地面自主机器人系统实验室,多伦多大学航空航天研究所)

AI总结 提出一种利用Loewner序计算度量张量最紧常数下界的矩阵值可容许启发式,将黎曼知情集映射为各向同性欧氏空间中的标准长球超椭球,实现直接无拒绝采样,加速多种最优运动规划器收敛。

Comments Submitted to IEEE Robotics and Automation Letters (RA-L)

详情
AI中文摘要

知情采样技术通过将搜索聚焦于状态空间的有希望区域来加速基于采样的运动规划器,然而大多数现有方法依赖于欧氏启发式,这些启发式在依赖于构型的黎曼度量下变得不可容许。虽然标量特征值下界通过均匀缩放欧氏距离恢复了可容许性,但它们丢弃了度量的方向结构,产生过于保守的知情集。我们提出一种矩阵值可容许启发式,利用对称正定矩阵上的Loewner序计算度量张量最紧的常数下界,同时保留其完整的方向结构。该下界的Cholesky分解定义了一个到各向同性欧氏空间的线性映射,在该空间中黎曼知情集简化为标准的长球超椭球,从而能够使用现有算法进行直接无拒绝采样。在6自由度UR5、7自由度Franka和14自由度PR2上三种不同黎曼度量下的操作任务实验表明,我们的启发式产生的知情集始终比欧氏和标量特征值下界更紧,加速了多种最先进渐近最优规划器的收敛。

英文摘要

Informed sampling techniques accelerate sampling-based motion planners by focusing the search on promising regions of the state space, yet most existing methods rely on Euclidean heuristics that become inadmissible under configuration-dependent Riemannian metrics. While scalar eigenvalue bounds restore admissibility by uniformly scaling the Euclidean distance, they discard the directional structure of the metric, producing overly conservative informed sets. We propose a matrix-valued admissible heuristic that exploits the Loewner order on symmetric positive definite matrices to compute the tightest constant lower bound on the metric tensor while preserving its full directional structure. The Cholesky factorization of this bound defines a linear map to an isotropic Euclidean space in which the Riemannian informed set reduces to a standard prolate hyperspheroid, enabling direct, rejection-free sampling using existing algorithms. Experiments on manipulation tasks with a 6-DoF UR5, 7-DoF Franka, and 14-DoF PR2 under three distinct Riemannian metrics show that our heuristic produces consistently tighter informed sets than both the Euclidean and scalar eigenvalue bounds, accelerating convergence across multiple state-of-the-art asymptotically optimal planners.

2606.02876 2026-06-03 cs.LG

RRISE: Robust Radius Inference via a Surrogate Estimator

RRISE: 通过代理估计器进行鲁棒半径推断

Jong-Ik Park, Shreyas Chaudhari, Carlee Joe-Wong, José M. F. Moura

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出RRISE框架,使用代理模型替代蒙特卡洛采样进行随机平滑认证,通过一次性共形校准保证保守半径,在保持认证精度的同时大幅降低计算成本。

详情
AI中文摘要

随机平滑(RS)使用平滑分类器提供与架构无关的$\ell_2$分类鲁棒性认证,但其对每个输入的蒙特卡洛(MC)采样的依赖限制了其在实时系统中的应用。我们认为这种代价是结构性的而非根本性的,因此可以通过在部署流中共享信息来显著降低。我们引入RRISE,一个RS框架,将认证压缩为通过学习的代理进行的单次前向传播。RRISE通过软标签交叉熵损失针对预计算的MC类计数目标训练代理,并通过一次性共形校准步骤将代理预测转换为可证明保守的认证半径。得到的证书是可部署验证的:每当校准半径为正值时,代理的预测可证明与平滑分类器的预测一致,且平滑分类器在输入周围该半径的球内是常数。在图像分类基准测试中,RRISE在固定预算MC认证精度上相差0.84个百分点以内,同时将每次查询最多10^4次噪声基础模型评估替换为单次代理前向传播,在约10^5次部署查询后即可收回MC训练成本。在CIFAR-100和Tiny ImageNet上,先前唯一的离线代理方法失效,而RRISE实现了1.23到1.91倍的认证精度提升,确立了高效随机平滑作为重复部署场景中认证鲁棒性的实用路径。

英文摘要

Randomized smoothing (RS) uses a smoothed classifier to provide architecture-agnostic certificates of $\ell_2$ classification robustness, but its dependence on per-input Monte Carlo (MC) sampling undermines its use in real-time systems. We argue that this cost is structural rather than fundamental, such that it can be significantly reduced by sharing information across the deployment stream. We introduce RRISE, an RS framework that compresses certification into a single forward pass through a learned surrogate. RRISE trains the surrogate against precomputed MC class-count targets via a soft-label cross-entropy loss and converts surrogate predictions into provably conservative certified radii through a one-time conformal calibration step. The resulting certificate is deployment-verifiable: whenever the calibrated radius is positive, the surrogate's prediction provably matches the smoothed classifier's and the smoothed classifier is constant on a ball of that radius around the input. Across image classification benchmarks, RRISE matches fixed-budget MC certified accuracy within $0.84$ percentage points while replacing up to $10^4$ noisy base-model evaluations per query with a single surrogate forward pass, recouping MC training cost after $\approx 10^5$ deployment queries. On CIFAR-100 and Tiny ImageNet, where the only prior offline-surrogate method collapses, RRISE achieves $1.23$ to $1.91\times$ higher certified accuracy, establishing efficient randomized smoothing as a practical path to certified robustness in repeated-deployment settings.

2606.02875 2026-06-03 cs.AI

Handoff Debt: The Rediscovery Cost When Coding Agents Take Over Interrupted Tasks

交接债务:当编码代理接管被中断任务时的重新发现成本

Dipesh KC, Anjila Budathoki

发表机构 * Independent Researcher(独立研究者) Georgia State University(佐治亚州立大学)

AI总结 本文通过引入“交接债务”概念,研究编码代理在任务中断后从部分状态恢复时的重新发现成本,并提出一种接管协议来量化不同交接视图对后继代理效率的影响。

详情
AI中文摘要

编码代理基准测试评估单个不间断代理能否解决仓库问题。实际软件工作更为复杂:任务会被中断、重新分配、审查,并从另一个代理或工程师留下的部分状态恢复。我们通过“交接债务”研究这一缺失维度:即前任工作不透明或不完整时施加的重新发现成本。我们的接管协议在确定性交接点中断编码代理,冻结仓库,并在四种交接视图下评估后继代理:仅仓库状态、原始轨迹、摘要笔记和结构化笔记。在75个源任务中,该协议为每个后继模型生成181个交接点任务和724次接管运行。在三个后继模型中,相对于仅仓库接管,带有上下文的交接将中位代理事件减少20-59%,累积提示令牌减少42-63%。解决率的影响较小且依赖于模型,但效率提升是一致的。这些发现表明,编码代理评估不仅应报告任务是否解决,还应报告另一个代理恢复该工作的成本。

英文摘要

Coding-agent benchmarks evaluate whether a single uninterrupted agent can resolve a repository issue. Real software work is messier: tasks are interrupted, reassigned, reviewed, and resumed from partial states left by another agent or engineer. We study this missing dimension through \emph{handoff debt}: the rediscovery cost imposed when a predecessor's work is opaque or incomplete. Our takeover protocol interrupts a coding agent at deterministic handoff points, freezes the repository, and evaluates successor agents under four handoff views: repository state only, raw trace, summary notes, and structured notes. Across 75 source tasks, the protocol generates 181 handoff-point tasks and 724 takeover runs per successor model. Across three successor models, context-bearing handoffs reduce median agent events by 20--59\% and cumulative prompt tokens by 42--63\% relative to repository-only takeover. Solved-rate effects are smaller and model-dependent, but efficiency gains are consistent. These findings suggest that coding-agent evaluation should report not only whether a task is solved, but also how costly that work is for another agent to resume.

2606.02871 2026-06-03 cs.CL cs.AI

Adaptive Latent Agentic Reasoning

自适应潜在智能推理

Dongwon Jung, Peng Shi, Yi Zhang, Junshan Zhang, Muhao Chen

发表机构 * University of California, Davis(加州大学戴维斯分校) University of Waterloo(滑铁卢大学) Greenshoe, Inc.(Greenshoe公司)

AI总结 提出双模式框架ALAR,在常规决策步骤使用紧凑潜在推理,仅在需要深入思考时切换至显式思维链,在保持或提升任务准确率的同时显著减少生成令牌数。

详情
AI中文摘要

大型推理模型通过生成扩展的思维链推理来提升性能,但当应用于LLM智能体时,这种行为变得低效。当前的LLM智能体通常在每一步决策中生成冗长的文本推理,并在各轮次中几乎均匀地分配推理努力,导致多轮智能体轨迹中的严重低效。我们提出自适应潜在智能推理(ALAR),一种双模式框架,在常规轮次中使用紧凑的潜在推理,并在需要更深思熟虑时选择性地升级为显式思维链。ALAR通过使用智能体的动作作为监督锚点来学习潜在推理,并进一步优化以在潜在推理足以完成任务时使用它,保留显式CoT用于更困难的决策。在智能体搜索和工具使用基准上的实验表明,ALAR在保持相当或更好任务准确率的同时,显著减少了生成的令牌数,在搜索中最多减少43.6%,在工具使用中最多减少84.6%。这些结果表明,ALAR通过减少不必要的文本推理,同时保留显式思考用于更困难的决策步骤,改善了LLM智能体的准确率-效率权衡。

英文摘要

Large reasoning models improve performance by generating extended chain-of-thought (CoT) reasoning, but this behavior becomes inefficient when applied to LLM agents. Current LLM agents often generate verbose textual reasoning at every decision step and allocate reasoning effort nearly uniformly across turns, leading to substantial inefficiency in multi-turn agentic trajectories. We propose Adaptive Latent Agentic Reasoning (ALAR), a dual-mode framework that uses compact latent reasoning for routine turns and selectively escalates to explicit chain-of-thought when deeper deliberation is needed. ALAR learns latent reasoning by using the agent's actions as supervision anchors and is further optimized to use latent reasoning when it is sufficient for task success and reserve explicit CoT for harder decisions. Experiments on agentic search and tool-use benchmarks show that ALAR maintains comparable or better task accuracy while substantially reducing generated tokens by up to 43.6% in search and 84.6% in tool use. These results demonstrate that ALAR improves the accuracy-efficiency trade-off of LLM agents by reducing unnecessary textual reasoning while preserving explicit deliberation for harder decision steps.

2606.02866 2026-06-03 cs.AI cs.CL cs.MA

When Helping Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning

当帮助有害时以及如何修复:多智能体辩论用于数据清洗

Chirag Parmar, Akshat Mehta, Henglin Wu, Jagadish Ramamurthy, Shweta Medhekar

发表机构 * Meta Platforms, Inc.(Meta公司)

AI总结 研究多智能体辩论在数据清洗中的效果,发现其会降低生成性能但提升错误检测,通过推导辩论收益条件并采用对抗性分离的辩论配置,首次在生成任务上显著超越单智能体。

Comments 27 pages, 4 figures, 12 tables. Includes appendix with full experimental results, prompt templates, and dataset statistics

详情
AI中文摘要

多智能体辩论何时有助于数据清洗,何时有害?在三个基准、四个模型家族和超过6000个任务-条件对中,我们发现辩论的效果会反转:通过批评引发的混淆(CIC),即生成器不加批判地接受幻觉性的批评反馈,辩论在所有四个模型上降低了生成性能(-1.6至-15.5个百分点),但提升了错误检测(F1提高27.4个百分点,d=1.0)。我们推导出一个辩论收益条件:当挽救错误输出的概率(由可修复性加权的批评者验证几率)超过破坏正确输出的概率时,辩论有帮助。一个析因实验证明对抗性分离至关重要:使用相同工具的自我验证失败,而一个独立的批评者,结合代码执行基础和证据门控生成,产生了第一个在生成任务上显著超过单智能体的辩论配置(+5.3个百分点,p<0.05)。该条件正确预测了所有九种任务类型,并在七个领域的19个已发表比较中实现了零假阳性泛化。

英文摘要

When does multi-agent debate help data cleaning, and when does it hurt? Across three benchmarks, four model families, and over 6,000 task-condition pairs, we find debate's effect reverses sign: it degrades generation across all four models (-1.6 to -15.5pp) through critique-induced confusion (CIC), hallucinated Critic feedback that the Generator accepts uncritically, yet improves error detection (+27.4pp F1, d=1.0). We derive a debate benefit condition: debate helps when the probability of rescuing a wrong output (Critic verification odds weighted by fixability) exceeds the probability of destroying a correct one. A factorial experiment proves adversarial separation is essential: self-verification with identical tools fails, while a separate Critic with code-execution grounding and evidence-gated generation produces the first debate configuration to significantly exceed single-agent on a generative task (+5.3pp, p<0.05). The condition correctly predicts all nine task types and generalizes with zero false positives across 19 published comparisons in seven domains.

2606.02863 2026-06-03 cs.AI

Don't Gamble, GAMBLe: An Analytical Framework for AI-Driven Research Systems

不要赌博,GAMBLe:AI驱动研究系统的分析框架

Marquita Ellis, Paul Castro

发表机构 * IBM Research(IBM研究院)

AI总结 提出GAMBLe框架,通过四个参数(生成器G、评估器A、发现机制M、预算B)和一个有效景观L_eff = A ∘ G分解AI驱动研究系统行为,实验表明组件选择可显著提升性能和搜索效率。

Comments Preprint. 21 pages (10 main, 11 appendix). 6 figures (2 in main, 4 in appendix)

详情
AI中文摘要

AI驱动研究系统(ADRS)——将LLM与自动评估相结合以发现算法、证明和设计的系统——正在跨领域优化和采用,但分析它们的工具尚未跟上。ADRS性能取决于组件交互,这些交互难以理解、探索成本高,并且(如我们所示)标准收敛保证无法很好地捕捉。这些保证依赖于结构假设,而这些假设在我们形式化的ADRS过程中不成立。我们引入GAMBLe,一个将ADRS行为分解为四个参数(生成器$G$、评估器$\mathcal{A}$、发现机制$\mathcal{M}$、预算$B$)和一个组合对象——有效景观$L_{ ext{eff}} = \mathcal{A} \circ G$的框架,该框架揭示了不同的生成器-评估器对在每个问题上诱导出结构不同的优化景观。我们在760多次重复运行(>46,000次迭代)上应用该框架,涵盖从单个LLM到动态自适应集成等生成器、从贪婪选择到协同进化元搜索等机制,以及三个NP难问题(其评估器范围从连续评分到悬崖函数)。实验表明,生成器或机制没有完全排序:前沿模型可能不如开源替代品,最简单的机制有时优于最先进的元搜索。结果显示,即使在有限预算下(每次运行60次迭代),正确的组件选择可以将性能提高13-67%,搜索效率提高6-39倍。

英文摘要

AI-Driven Research Systems (ADRS) -- systems coupling LLMs with automated evaluation to discover algorithms, proofs, and designs -- are being optimized and adopted across domains, but the tools to analyze them have not kept pace. ADRS performance depends on component interactions that are poorly understood, expensive to explore, and (as we show) not well captured by standard convergence guarantees. These guarantees rely on structural assumptions that do not hold under the ADRS process we formalize. We introduce GAMBLe, a framework that decomposes ADRS behavior into four parameters (generator $G$, assessor $\mathcal{A}$, discovery mechanism $\mathcal{M}$, budget $B$) and one compositional object, the effective landscape $L_{\text{eff}} = \mathcal{A} \circ G$, which reveals that distinct generator-assessor pairs induce structurally different per-problem optimization landscapes. We exercise the framework on 760+ replicated runs (>46,000 iterations) spanning generators from single LLMs to dynamically-adaptive ensembles, mechanisms from greedy selection to co-evolutionary meta-search, and three NP-hard problems whose assessors range from continuous scoring to cliff functions. The experiments reveal no total ordering of generators or mechanisms: frontier models can underperform open-source alternatives and the simplest mechanism sometimes outperforms state-of-the-art meta-search. Results show that even under limited budgets (60 iterations per run), the right component choices can improve performance by 13-67% and search efficiency by 6-39x.

2606.02862 2026-06-03 cs.AI cs.MA

Toward a Modular Architecture for Embedded AI Agent Systems at the Edge

面向边缘嵌入式AI智能体系统的模块化架构

Marcus Rüb, Michael Gerhards

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 提出一种模块化参考架构,通过分层设计解耦设备端和云端智能体,并引入治理层,解决嵌入式微控制器上部署自主AI的严格资源约束问题。

详情
AI中文摘要

大型语言模型的兴起使得具备复杂推理和工具使用能力的智能体AI成为可能;然而,由于嵌入式微控制器严格的内存和能量限制,在普适计算环境中部署这种自主性仍然具有挑战性。现有框架通常假设服务器级资源或持续连接,导致深度嵌入式系统存在空白。本文提出了一种嵌入式智能体系统的模块化参考架构,弥合了确定性实时控制与智能体智能之间的鸿沟。我们引入了一种分层设计,将设备端智能体(执行高度压缩的神经网络和基于规则的逻辑,用于低延迟、隐私关键任务)与云端增强智能体(利用小型语言模型进行更高级别的推理和规划)解耦。一个关键贡献是集成了跨领域的治理层,确保分布式自主设备集群的可观测性、策略执行和安全性。本文不呈现纯经验基准,而是分析资源受限环境中关于延迟、能量和可靠执行的架构设计原则与权衡。

英文摘要

The rise of Large Language Models (LLMs) has enabled agentic AI capable of complex reasoning and tool use; however, deploying such autonomy in pervasive computing environments remains challenging due to the strict memory and energy constraints of embedded microcontrollers. Existing frameworks typically assume server-class resources or continuous connectivity, leaving a gap for deeply embedded systems. This paper proposes a modular reference architecture for Embedded Agent Systems that bridges the divide between deterministic real-time control and agentic intelligence. We introduce a tiered design that decouples On-Device Agents - executing highly compressed neural networks and rule-based logic for low-latency, privacy-critical tasks - from Cloud-Augmented Agents that leverage Small Language Models (SLMs) for higher-level reasoning and planning. A key contribution is the integration of a cross-cutting Governance Layer, ensuring observability, policy enforcement, and safety across distributed fleets of autonomous devices. Rather than presenting purely empirical benchmarks, we analyze architectural design principles and trade-offs regarding latency, energy, and reliable execution in resource-constrained environments.

2606.02860 2026-06-03 cs.LG cs.AI

Forgetting is Not Erasure: Recovering Latent Knowledge via Transport Keys

遗忘并非擦除:通过传输键恢复潜在知识

Archie Chaudhury

发表机构 * Axionic Labs(Axionic实验室)

AI总结 通过缝合评估协议和紧凑的任务特定传输键,发现灾难性遗忘主要由内部阶段接口漂移而非任务相关计算的永久擦除引起,并能在顺序训练后恢复大部分早期任务性能。

Comments Technical report showcasing results from transport keys

详情
AI中文摘要

灾难性遗忘通常被视为表征问题:在顺序训练后,模型似乎失去了支持早期任务性能的特征。我们挑战了这一观点的更强形式。在受控的持续学习设置中,我们发现相当一部分明显的遗忘可归因于内部阶段之间的接口漂移,而非任务相关计算的永久擦除。我们通过一种缝合评估协议研究这一现象,该协议将更新后网络的早期计算与其前身的后期计算相结合,并可选地通过紧凑的任务特定传输键进行中介。我们在系统层面将传输键描述为紧凑的接口对齐算子,从少量配对的锚点激活中估计,并通过模型缝合进行评估。在split CIFAR-100上使用ResNet风格网络时,传输键在顺序训练任务B后恢复了大部分原始任务A的性能。在紧凑视觉变换器上,我们观察到类似的恢复模式。这些结果表明,持续学习可能需要更好的机制来索引和重新访问潜在计算,而不仅仅是防止权重变化的方法。

英文摘要

Catastrophic forgetting is often framed as a representational problem: after sequential training, a model appears to lose the features that supported performance on earlier tasks. We challenge the stronger form of this view. Across controlled continual-learning settings, we find that a significant portion of apparent forgetting can be attributed to interface drift between internal stages rather than permanent erasure of task-relevant computation. We study this phenomenon through a stitched evaluation protocol that combines early computation from a post-update network with late computation from its predecessor, optionally mediated by a compact, task-specific transport key. We describe transport keys at a systems level as compact interface-alignment operators estimated from a small set of paired anchor activations and evaluated through model stitching. On split CIFAR-100 with a ResNet-style network, transport keys recover most of the original Task A performance after sequential training on Task B. On a compact vision transformer, we observe a similar recovery pattern. These results suggest that continual learning may require better mechanisms for indexing and re-accessing latent computations, not only methods that prevent weight change.

2606.02859 2026-06-03 cs.CL cs.AI cs.MA

Economy of Minds: Emerging Multi-Agent Intelligence with Economic Interactions

思维经济:具有经济交互的涌现多智能体智能

Zhenting Qi, Huangyuan Su, Ao Qu, Chenyu Wang, Yu Yao, Han Zheng, Kushal Chattopadhyay, Guowei Xu, Zihan Wang, Weirui Ye, Vijay Janapa Reddi, Ju Li, Paul Pu Liang, Himabindu Lakkaraju, Sham Kakade, Yilun Du

发表机构 * Harvard University(哈佛大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 受哈耶克经济理论启发,通过拍卖和财富积累的简单经济信号实现去中心化信用分配,使弱智能体群体涌现出多步推理策略,在五个智能体任务中超越强单体基线。

详情
AI中文摘要

一群智能体如何在没有集中控制的情况下自我协调和自适应,形成更强的集体智能?受弗里德里希·哈耶克关于市场中去中心化协调的经济理论启发,我们通过一个智能体经济体来研究这个问题,其中智能体通过拍卖竞争行动权、交换支付,并从环境奖励中积累财富。这些简单的经济信号引出去中心化的信用分配,在没有全局编排或显式通信协议的情况下驱动规划。群体通过经济选择进化:有效的智能体积累财富并通过利用变异,而无效的智能体破产并通过探索被替换。我们表明,从弱智能体初始化,经济体产生涌现的多步推理策略,并在五个智能体任务中超越更强的单体基线,包括数学推理、金融研究、科学研究、加速器设计和分布式系统优化。我们进一步提供了关于经济动态如何塑造智能体行为的理论见解,将局部激励与长期全局表现联系起来。我们的结果指向了多智能体智能的一条新路径:与其设计协调,不如设计去中心化的激励结构,在这种结构下协调会自动涌现。

英文摘要

How can a population of agents self-orchestrate and self-adapt into stronger collective intelligence without centralized control? Inspired by Friedrich Hayek's economic theory of decentralized coordination in markets, we study this question through an agent economy in which agents compete via auctions for the right to act, exchange payments, and accumulate wealth from environmental rewards. These simple economic signals induce decentralized credit assignment, driving planning without global orchestration or explicit communication protocols. The population evolves through economic selection: effective agents accumulate wealth and are mutated via exploitation, while ineffective ones go bankrupt and are replaced via exploration. We show that, initialized with weak agents, the economy produces emergent multi-step reasoning strategies and outperforms stronger monolithic baselines across five agentic tasks, including mathematical reasoning, financial research, scientific research, accelerator design, and distributed-system optimization. We further provide theoretical insights into how economic dynamics shape agent behaviors, linking local incentives to long-term global performance. Our results suggest a new path to multi-agent intelligence: rather than engineering coordination, we can design decentralized incentive structures under which it automatically emerges.

2606.02857 2026-06-03 cs.LG cs.AI

GRZO: Group-Relative Zeroth-Order Optimization for Large Language Model Fine-Tuning

GRZO:用于大语言模型微调的组相对零阶优化

Liyan Tan, Yequan Zhao, Yifan Yang, Ruijie Zhang, Xinling Yu, Zheng Zhang

发表机构 * University of California, Santa Barbara(加州大学圣巴巴拉分校)

AI总结 提出GRZO优化器,通过组相对归一化聚合每个样本的损失,在不增加前向成本的情况下将有效梯度方向数从1提升至批量大小,降低方差并改善收敛,在多个模型和任务上优于MeZO。

Comments Preprint. Under review

详情
AI中文摘要

零阶优化是微调大语言模型时一种内存高效的反向传播替代方案,但其部署受限于梯度估计的高方差。我们提出GRZO,一种组相对零阶优化器,它为每个小批量样本抽取一个伪独立扰动,并通过组相对归一化聚合每个样本的损失,在不增加额外前向成本的情况下将有效梯度方向数从1提升至批量大小,同时保持推理级内存。我们证明GRZO在方向上是无偏的,方差随批量大小成比例缩小,从而得到比MeZO更紧的非凸收敛界。在RoBERTa-large、Llama3-8B和OPT-13B上,跨多个任务,GRZO在Llama3-8B上的平均准确率比MeZO提高$+3.0$,峰值GPU内存降低$23\%$;作为MeZO核心的即插即用替代,它平均将稀疏、低秩和量化ZO变体提升$+6.0$。

英文摘要

Zeroth-order (ZO) optimization is a memory-efficient alternative to backpropagation for fine-tuning large language models, but its deployment is limited by the high variance of gradient estimation. We propose GRZO, a Group-Relative Zeroth-Order optimizer that draws one pseudo-independent perturbation per mini-batch example and aggregates the per-example losses through group-relative normalization, raising the effective gradient-direction count from one to the batch size at no additional forward cost while preserving inference-level memory. We prove that GRZO is directionally unbiased with variance shrinking proportionally to the batch size, yielding a tighter nonconvex convergence bound than MeZO. Across RoBERTa-large, Llama3-8B, and OPT-13B over multiple tasks, GRZO improves average accuracy on Llama3-8B by $+3.0$ over MeZO at $23\%$ lower peak GPU memory; as a drop-in replacement for the MeZO core, it lifts sparse, low-rank, and quantized ZO variants by $+6.0$ on average.

2606.02841 2026-06-03 cs.LG math.AT

Learning Coherent Representations: A Topological Approach to Interpretability

学习一致表示:一种拓扑可解释性方法

Sigurd Gaukstad, Melvin Vaupel, Valdemar Kargård Olsen, Erik Hermansen, Benjamin Dunn

发表机构 * University of Oslo(奥斯陆大学)

AI总结 提出基于脑神经编码启发的“一致性”几何约束,通过Fréchet方差目标函数Coh训练模型,使特征在样本空间中形成连续区域,从而提升表示的可解释性。

Comments To appear in ICML 2026

详情
AI中文摘要

深度神经网络学习的表示中,单个特征往往缺乏可解释意义;一个神经元可能对分散、不相关的输入激活。我们引入一致性,这是一种受大脑神经编码启发的几何性质,其中像网格细胞和头部方向细胞这样的神经元对状态空间的连续区域做出响应。一个非负矩阵是一致的,如果每个行(样本)关注几何上聚类的列(特征),反之亦然,并且每个样本都由某个特征很好地描述,每个特征都被某个样本需要。我们证明一致矩阵在样本和特征的Vietoris-Rips过滤之间诱导有界交错,保证两个空间共享兼容的拓扑结构。这种几何约束促进了可解释性。例如,如果数据位于圆上,一致特征必须将该圆分割成连续的弧段。我们引入Coh,一种基于Fréchet方差的可微目标函数,在训练过程中强制执行一致性。与稀疏性(限制一个特征激活多少个样本)不同,一致性限制哪些样本,要求几何连通性而不仅仅是稀有性。这不仅产生可解释的特征,还产生可解释的特征空间。我们使用合成数据和旋转MNIST数据集在自编码器中验证Coh,并使用语言数据在BERT的词嵌入中验证Coh。

英文摘要

Deep neural networks learn representations where individual features often lack interpretable meaning; a single neuron may activate for scattered, unrelated inputs. We introduce coherence, a geometric property inspired by neural coding in the brain, where neurons like grid cells and head direction cells respond to contiguous regions of state space. A non-negative matrix is coherent if each row (sample) attends to geometrically clustered columns (features) and vice versa, and in addition every sample is well described by some feature and every feature is needed by some sample. We prove that coherent matrices induce a bounded interleaving between the Vietoris-Rips filtrations of samples and features, guaranteeing that both spaces share compatible topological structure. This geometric constraint facilitates interpretability. For example, if data lies on a circle, coherent features must tile that circle into contiguous arcs. We introduce Coh, a differentiable objective function based on Fréchet variance that enforces coherence during training. Unlike sparsity, which bounds how many samples a feature activates on, coherence bounds which samples, requiring geometric connectivity rather than only rarity. This yields not just interpretable features but an interpretable feature space. We validate Coh in an auto-encoder using synthetic and rotated MNIST datasets and in a token embedding of BERT using language data.

2606.02837 2026-06-03 cs.CL cs.AI

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

修复FOLIO和MALLS:经过验证的标注和基于LLM的框架以聚焦人工重新标注

Andrea Brunello, Cristian Curaba, Luca Geatti, Michele Mignani, Angelo Montanari, Nicola Saccomanno

发表机构 * University of Udine(乌迪大学)

AI总结 通过人工检查发现FOLIO和MALLS数据集中存在大量形式化错误,提出基于LLM的框架引导人工审核,显著减少所需审核量并提高数据集准确性。

详情
AI中文摘要

从自然语言到一阶逻辑(NL-to-FOL)的准确翻译是神经符号AI系统和自然语言推理(NLI)的基础,因此NL-to-FOL基准的质量至关重要——然而这些数据集从未经过严格审计。我们的第一个贡献是对 extsf{FOLIO}的验证集和 extsf{MALLS}测试实例子集进行系统性人工检查,发现分别约有39%和36%的条目包含错误的FOL形式化(即真实标签),此外还有一定比例的歧义NL句子(分别为16.4%和48%)以及 extsf{FOLIO}中错误的NLI标签(8.4%)。我们的第二个贡献是开发并发布了这些数据集的修正真实标签,并展示了标注错误如何扭曲参考基准任务上的模型评估:使用修正后的真实标签测试三个最先进的LLM(Gemma~4 31B-it、Qwen3-30B-A3B和GPT-4o-mini),准确率提升了9到22个百分点。受这些发现启发,我们提出了一个基于LLM的框架,以支持人工审查NL-to-FOL数据集。通过将审查者引导至最易出错的实例,我们实验证明,在审查少于24%的实例后即可达到90%的数据集准确率,而无引导的审查则需要审查超过70%的实例。我们发布了所有经过人工验证的标注以及框架代码。

英文摘要

Accurate translation from Natural Language to First-Order Logic (NL-to-FOL) underpins neurosymbolic AI systems and Natural Language Inference (NLI), making the quality of NL-to-FOL benchmarks essential -- yet these datasets have never been rigorously audited. Our first contribution is to present a systematic human inspection of the validation split of \textsf{FOLIO} and a subset of \textsf{MALLS} test instances, finding that approximately 39% and 36% of entries, respectively, contain incorrect FOL formalizations (i.e., ground truth labels), with additional rates of ambiguous NL sentences (16.4% and 48%) and incorrect NLI labels in \textsf{FOLIO} (8.4%). Our second contribution is to develop and release corrected ground truths for such datasets, showing that annotation errors distort model evaluation on a reference benchmark task: testing three state-of-the-art LLMs (Gemma~4 31B-it, Qwen3-30B-A3B, and GPT-4o-mini) with the corrected ground truths yields accuracy gains from +9 to +22 percentage points. Motivated by these findings, we propose an LLM-based framework to support humans in manual reviewing NL-to-FOL datasets. By directing reviewers toward the most error-prone instances, we empirically show that it is possible to achieve 90% dataset accuracy after reviewing fewer than 24% of instances, compared to over 70% required by unguided review. We release all human-verified annotations and the code for our framework.

2606.02835 2026-06-03 cs.AI

Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models

超越答案的思考:评估大型推理模型中的有害过度思考

Simone Caldarella, Davide Talon, Rahaf Aljundi, Elisa Ricci, Massimiliano Mancini

发表机构 * University of Trento(特伦托大学) Toyota Motor Europe(丰田欧洲公司) Fondazione Bruno Kessler(布鲁诺·凯塞林基金会)

AI总结 本文提出前缀级轨迹评估协议,通过定义推理充分性来区分冗余但无害的冗长过度思考和导致正确轨迹偏离的有害过度思考,发现当前模型不仅受限于推理能力,还受限于无法在适当时机停止。

详情
AI中文摘要

大型推理模型(LRMs)通过增加测试时计算生成显式的中间推理轨迹来提升性能,但更长的推理是否始终有益这一假设尚未得到充分检验。虽然近期证据表明额外推理可能导致模型过度思考,但我们提出疑问:“一旦模型得出正确答案,进一步的推理是优化解决方案,还是偏离它?”为了研究正确性之后的动态,我们引入了一种基于推理充分性的前缀级轨迹评估协议,定义了模型首次生成正确答案所需的最小推理预算。这使我们能够将冗长过度思考(额外推理冗余但无害)与有害过度思考(持续推理破坏已正确的轨迹)区分开来。从多模态基准开始,我们发现许多被认为是推理密集型的问题实际上只需要很少的推理。此外,在第一个正确前缀处停止比标准推理提高了高达21%的准确率,表明当前模型不仅受限于推理能力,还受限于无法在适当时机停止。此外,虽然常见的效率策略(如早停)能大幅减少冗长过度思考(高达50%),但它们未能缓解有害过度思考。失败分析表明,正确性偏差主要由逻辑漂移和视觉重新解释驱动。最后,我们展示了我们的发现可推广到纯语言推理基准,突显了有害过度思考作为一个更广泛的可靠性风险。代码可在该 https URL 获取。

英文摘要

Large Reasoning Models (LRMs) improve performance by generating explicit intermediate reasoning traces through increased test-time compute, yet the assumption that longer reasoning is consistently beneficial remains under-examined. While recent evidence shows that additional reasoning can lead models to overthink, we ask: "Once a model has reached the correct answer, does further reasoning refine the solution, or deviate from it?" To study the dynamics after correctness, we introduce a prefix-level trajectory evaluation protocol grounded in reasoning sufficiency, defining the minimum reasoning budget required for a model to first generate the correct answer. This allows us to disentangle verbose overthinking, where additional reasoning is redundant but harmless, from harmful overthinking, where continued reasoning destabilizes an already-correct trajectory. Starting from multimodal benchmarks, we find that many instances considered reasoning-intensive require surprisingly little reasoning. Moreover, stopping at the first correct prefix improves accuracy over standard reasoning up to 21%, revealing that current models are limited not only by their ability to reason, but also by their inability to stop at the right time. Furthermore, while common efficiency strategies like early stopping substantially reduce verbose overthinking (up to 50%), they fail to mitigate harmful overthinking. Failure analysis reveals that correctness deviations are mainly driven by logical drift and visual reinterpretation. Finally, we show that our findings generalize to language-only reasoning benchmarks, highlighting harmful overthinking as a broader reliability risk. Code available at https://simonecaldarella.github.io/thinking-past-the-answer.

2606.02832 2026-06-03 cs.AI

An Exploration of Collision-based Enemy Morphology Generation

基于碰撞的敌人形态生成探索

Johor Jara Gonzalez, Matthew Guzdial

发表机构 * Alberta Machine Intelligence Institute (Amii)(阿尔伯塔人工智能研究所) Department of Computing Science, University of Alberta(阿尔伯塔大学计算机科学系)

AI总结 本文探索了三种基于玩家碰撞信息生成敌人形态的新方法,并证明其性能优于从机器人形态学工作改编的进化基线。

详情
AI中文摘要

尽管先前对程序化内容生成(PCG)进行了大量研究,但相对较少的工作探索了为视频游戏生成敌人。特别是,尽管在机器人学中存在相关的形态生成工作,但几乎没有工作涉及生成敌人形态,即游戏中敌人的基本身体结构或碰撞信息。在本文中,我们探索了三种不同的基于玩家碰撞信息生成敌人形态的新方法。我们发现每种方法都有不同的优缺点,但所有方法的性能都等同于或优于从先前机器人形态学工作改编的进化基线。

英文摘要

Despite a great deal of prior research into Procedural Content Generation (PCG), relatively little prior work has explored generating enemies for video games. In particular, there is almost no work on generating enemy morphologies, the basic body plan or collision information for in-game enemies, despite the existence of related morphology generation work in robotics. In this paper, we explore three different novel approaches to generate enemy morphologies based on player collision information. We found that each approach provides different strengths and weaknesses, but all had equivalent or better performance than an evolutionary baseline adapted from prior robotics morphology work.

2606.02830 2026-06-03 cs.LG math.OC

Mitigating Spurious Correlations with Memorization-Guided Dataset De-Biasing

利用记忆引导的数据集去偏缓解虚假相关性

Arda Fazla, Abolfazl Hashemi

发表机构 * School of Electrical and Computer Engineering, Purdue University(电子与计算机工程学院,普渡大学)

AI总结 提出一种两阶段样本评分函数,通过解耦核心特征与虚假特征的学习动态来识别并优先选择信息性样本,从而在仅需10%原始数据的情况下,使用标准ERM模型超越现有去偏技术。

详情
AI中文摘要

真实世界的数据集通常包含与目标标签无因果关系的虚假相关性。当这些相关性主导大部分训练样本时,模型倾向于依赖它们,导致对不呈现相同虚假模式的少数样本分类错误。虽然一种潜在的方法是选择数据子集以更好地代表少数样本,但这可能需要访问通常未知的组标签。此外,正如我们所展示的,在不变子集或核心集选择文献中广泛使用的样本评分函数在很大程度上依赖于虚假特征,因此无法准确捕捉核心因果相关特征的重要性或难度。因此,我们提出通过开发一种两阶段样本评分函数来缓解虚假相关性,该函数解耦核心特征和虚假特征的学习动态,并分别评估它们的难度。基于我们提出的度量,我们引入了一种新算法来查找并优先处理带有和不带有虚假相关性的信息样本。大量实验表明,在我们选择的样本上训练的标准ERM模型在性能上优于最先进的去偏技术,同时仅需要原始训练数据的10%。

英文摘要

Real-world datasets often contain spurious correlations that are not causally related to the target label. When such correlations dominate the majority of training samples, models tend to rely on them, leading to misclassification of minority samples that do not exhibit the same spurious patterns. While a potential approach is to select subsets of data to better represent the minority samples, this may require access to group labels, which are typically unknown. Furthermore, as we demonstrate, widely used sample scoring functions in the invariant subset or coreset selection literature largely depend on spurious features and therefore fail to accurately capture the importance or difficulty of core, causally relevant features. Accordingly, we propose to mitigate spurious correlations by developing a two-stage sample scoring function that disentangles the learning dynamics of core and spurious features and evaluates their difficulty separately. Based on our proposed metric, we introduce a new algorithm to find and prioritize informative samples both with and without spurious correlations. Extensive experiments demonstrate that a standard ERM model trained on our selected samples achieves superior performance compared to state-of-the-art debiasing techniques, while requiring as little as 10\% of the original training data.

2606.02823 2026-06-03 cs.LG

Qift: Shift-Friendly No-Zero W2 Post-Training Quantization for Rotated W2A4/KV4 LLM Inference

Qift: 面向旋转W2A4/KV4 LLM推理的移位友好型无零W2后训练量化

Chi-Wei Huang, Chia-Chi Tsai

发表机构 * National Cheng Kung University(国立成功大学)

AI总结 针对旋转量化流水线中W2A4/KV4设置下标准W2电平集性能不佳的问题,提出一种基于零中心高斯源模型的无零固定W2电平集Qift,通过重新设计码本映射实现无训练、无学习码本、无组网格、无零点的量化,显著提升模型困惑度和下游精度。

Comments 23 pages, 8 figures

详情
AI中文摘要

两位权重量化对于内存高效的LLM推理具有吸引力,但标准W2电平集{-2,-1,0,+1}在激进的W2A4/KV4设置下经常崩溃。我们研究了哈达玛旋转量化流水线中两位权重的标量电平集几何结构。传统的非对称W2相比标准电平集有显著改进,表明W2A4失败不仅是位宽问题,也是重构电平问题。在LLaMA-2-7B和LLaMA-3.1-8B的每个224个线性模块中,预训练权重已经接近零中心,而哈达玛旋转主要使其标准化形状高斯化:超额峰度和Q-Q误差下降数个数量级。基于这种近似零中心高斯型源模型,我们提出了Qift,一种用于旋转W2A4/KV4推理的固定无零W2电平集。主电平集为{+/-0.5, +/-1.5},在半尺度重参数化下等价于{+/-1, +/-3};一种2的幂次变体使用{+/-1, +/-4}用于符号移位解码权重应用。Qift重新设计了固定的两位码本到电平的映射,并且无需训练、无需学习码本、无需组网格、无需零点,保留了标准的每通道尺度。一种尺度不变比率分析确定了有效内/外质心比率范围为0.25到0.33,解释了为什么镜像无零(MNZ)、Lloyd、NF2和PoT-MNZ表现良好而{+/-1, +/-2}则不然。在两个模型上,无零电平集在纯W2A4困惑度、L层混合W2/W4困惑度、下游准确率和GPTQ残差行为上持续优于标准W2电平集。在L=16混合精度下,它们显著缩小了与W3A4的差距,同时保持一半的Transformer层为两位精度,为更复杂的学习型W2码本提供了一种简单、源感知且易于部署的替代方案。

英文摘要

Two-bit weight quantization is attractive for memory-efficient LLM inference, but the standard W2 level set {-2,-1,0,+1} often collapses under aggressive W2A4/KV4 settings. We study the scalar level-set geometry of two-bit weights in a Hadamard-rotated quantization pipeline. Conventional asymmetric W2 substantially improves over the standard level set, indicating that W2A4 failure is not only a bit-width problem but also a reconstruction-level problem. Across all 224 linear modules in each of LLaMA-2-7B and LLaMA-3.1-8B, pretrained weights are already nearly zero-centered, while Hadamard rotation primarily Gaussianizes their standardized shape: excess kurtosis and Q-Q error drop by orders of magnitude. Based on this approximate zero-centered Gaussian-like source model, we propose Qift, a fixed no-zero W2 level set for rotated W2A4/KV4 inference. The main level set is {+/-0.5, +/-1.5}, equivalently {+/-1, +/-3} under a half-scale reparameterization; a power-of-two variant uses {+/-1, +/-4} for sign-and-shift decoded weight application. Qift redesigns the fixed two-bit code-to-level mapping and is training-free, learned-codebook-free, group-grid-free, and zero-point-free, retaining the standard per-channel scale. A scale-invariant ratio analysis identifies an effective inner/outer centroid ratio range of 0.25 to 0.33, explaining why mirror no-zero (MNZ), Lloyd, NF2, and PoT-MNZ perform well while {+/-1, +/-2} does not. On both models, the no-zero level sets consistently improve pure W2A4 perplexity, L-layer mixed W2/W4 perplexity, downstream accuracy, and GPTQ residual behavior over the standard W2 level set. At L=16 mixed precision, they substantially narrow the gap to W3A4 while keeping half of the transformer layers at two-bit precision, giving a simple, source-aware, and deployment-friendly alternative to more complex learned W2 codebooks.

2606.02812 2026-06-03 cs.AI cs.CL

Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection

Traj-Evolve:用于肺癌早期检测中患者轨迹建模的自进化多智能体系统

Sihang Zeng, Matthew Thompson, Ruth Etzioni, Meliha Yetisgen

发表机构 * University of Washington(华盛顿大学) Fred Hutch Cancer Center(Fred Hutch癌症中心) Google(谷歌)

AI总结 提出Traj-Evolve,一种结合经验池和多智能体强化学习的自进化多智能体系统,通过检索相似患者和参数优化,在肺癌早期检测中优于9个强基线模型。

详情
AI中文摘要

从纵向电子健康记录(EHR)中建模患者轨迹需要对稀疏、嘈杂且长上下文的多模态序列进行推理。现有的基于LLM的多智能体系统解决了上下文长度问题,但孤立地处理患者,未能模拟临床医生如何利用从类似先前病例中积累的经验。我们提出了Traj-Evolve,一个具有两种互补进化机制的自进化多智能体系统。首先,经验池(ExPool)作为非参数记忆,索引拒绝采样的推理轨迹,以检索相似患者作为少样本上下文。其次,通过奖励排序微调的多智能体强化学习(MARL)参数化地优化智能体间和智能体-记忆协作。留一法交叉检索策略统一了这两种机制,在检索增强下对齐训练和推理时的行为。在利用长达五年的多模态EHR的肺癌预测任务中,Traj-Evolve在整体人群和具有挑战性的从不吸烟人群中均优于9个强基线模型。对进化动态的分析揭示了三个关键发现:(1)扩展ExPool将最优检索从多样本转向特定样本;(2)在MARL下,管理智能体的预测损失快速收敛,而工作智能体的时间推理继续受益于更多经过验证的患者;(3)这两种机制在预测风险上互补,ExPool提高特异性,而MARL提高敏感性。

英文摘要

Modeling patient trajectories from longitudinal electronic health records (EHRs) requires reasoning over sparse, noisy, and long-context multimodal sequences. Existing LLM-based multi-agent systems address context length but process patients in isolation, failing to mirror how clinicians leverage accumulated experience from similar prior cases. We present Traj-Evolve, a self-evolving multi-agent system with two complementary evolving mechanisms. First, an Experience Pool (ExPool) acts as a non-parametric memory, indexing rejection-sampled reasoning traces to retrieve similar patients as few-shot contexts. Second, multi-agent reinforcement learning (MARL) via reward-ranked fine-tuning parametrically optimizes inter-agent and agent-memory collaboration. A leave-one-out cross-retrieval strategy unifies the two, aligning training- and inference-time behavior under retrieval augmentation. On a lung cancer prediction task utilizing up to five years of multimodal EHRs, Traj-Evolve outperforms 9 strong baselines on the overall population and a challenging never-smoker population. Analysis of the evolving dynamics highlights three key findings: (1) expanding the ExPool shifts optimal retrieval from diverse to specific samples; (2) under MARL, the manager agent's prediction loss converges quickly while the worker agents' temporal reasoning continues to benefit from more verified patients; and (3) the two mechanisms are complementary on the predicted risk, where ExPool improves specificity while MARL improves sensitivity.

2606.02809 2026-06-03 cs.CV

Automated Report-Derived Oncology VQA Benchmark for Evaluating Vision-Language Models on 3D Medical Imaging

自动化报告驱动的肿瘤学VQA基准:用于评估3D医学影像上的视觉-语言模型

Bo Liu, Hanxue Gu, Xiangru Li, Zheren Zhu, Jacob Ellison, Kang Wang, Janine M. Lupo, Yang Yang, Hui Lin

发表机构 * UCSF–UC Berkeley Joint Graduate Program in Bioengineering(UCSF-伯克利生物工程联合研究生项目) Department of Radiology, UCSF(UCSF放射科) Department of Radiation Oncology, UCSF(UCSF放射肿瘤科)

AI总结 提出一个自动化管道,从私有放射学报告和3D肿瘤影像生成多选VQA数据集,构建无污染基准,评估六种视觉-语言模型,发现视觉依赖因数据集而异。

详情
AI中文摘要

评估医学影像上的视觉-语言模型(VLM)需要临床基础、可扩展且控制评估混淆的基准。现有的公共基准在规模上有限、需要手动标注,或可能泄露到VLM预训练语料中。我们提出一个自动化智能体驱动的管道,直接从配对的私有放射学报告和3D肿瘤影像生成多选VQA数据集,产生两种互补的问题类型:从临床医生定义的报告模式确定性导出的RADS风格问题,以及由LLM根据放射科医生发现生成并对照源报告验证的放射学报告衍生问题。应用于四个内部癌症队列,该管道产生一个实例污染控制的基准,无需每个问题的人工标注。对六个VLM的零样本评估显示没有主导模型,且所有单元均有显著提升空间。一项盲消融实验显示,视觉依赖高度特定于数据集:肝脏报告衍生问题确实需要图像,而肺CT基本上可以在没有图像的情况下解决——领先的闭源模型在盲测时在肺CT上的准确率超过其有视觉的准确率,这表明即使是私有临床数据也不能保证对视觉能力的污染控制读取。该管道作为开放智能体技能发布,用于内部重新部署。

英文摘要

Evaluating vision-language models (VLMs) on medical images requires benchmarks that are clinically grounded, scalable, and controlled for evaluation confounds. Existing public benchmarks are limited in scale, manually annotated, or potentially leaked into VLM pretraining corpora. We present an automated agent-driven pipeline that generates multiple-choice VQA datasets directly from paired private radiology reports and 3D oncology imaging, producing two complementary question types: RADS-style questions deterministically derived from clinician-defined reporting schemas, and radiology report-derived questions generated by an LLM from radiologist findings and verified against the source report. Applied to four in-house cancer cohorts, the pipeline yields an instance-contamination-controlled benchmark without per-question human annotation. Zero-shot evaluation of six VLMs reveals no dominant model and substantial headroom across all cells. A blind ablation reveals that visual reliance is highly dataset-specific: liver Report-derived questions genuinely require the image, while Lung CT is essentially solvable without it - the leading closed model exceeds its sighted accuracy on Lung CT when blinded - indicating that even private clinical data does not guarantee a contamination-controlled read of visual capability. The pipeline is released as an open agent skill for in-house redeployment.