arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.03803 2026-06-10 cs.SD cs.AI eess.AS 版本更新

LiveBand: Live Accompaniment Generation in the Audio Domain

LiveBand: 音频域中的实时伴奏生成

Marco Pasini, Javier Nistal, Ben Hayes, Mathias Rose Bjare, Stefan Lattner, George Fazekas

发表机构 * University of Cambridge（剑桥大学）

AI总结提出LiveBand系统，利用因果Transformer在预训练因果音频自编码器的连续潜空间中生成高保真伴奏，通过对抗序列级监督训练，实现实时流式生成。

详情

AI中文摘要

我们提出LiveBand，一个实时系统，能够为现场音频输入生成高保真音乐伴奏，并严格遵守因果约束。我们的方法在预训练因果音频自编码器的连续潜空间中训练因果Transformer生成器，使用来自判别器的对抗序列级监督。在每个时间步，生成器仅接收因果可用的混合上下文和高斯噪声，并预测伴奏潜变量，而无法访问未来混合帧或真实目标潜变量。训练在因果掩码下通过单个并行前向传播完成，而流式推理则通过滚动注意力状态自回归进行。模型训练和推理计算在设计中匹配，消除了教师强制及相关曝光偏差。在多乐器音乐伴奏基准测试中，LiveBand在音频质量、节拍对齐和混合一致性的客观指标上优于先前工作，同时能够在消费级硬件上实现无需前瞻的实时流式生成。

英文摘要

We present LiveBand, a real-time system that generates high-fidelity music accompaniments to live audio input, respecting strict causal constraints. Our method trains a causal transformer generator in the continuous latent space of a pre-trained causal audio autoencoder, using adversarial sequence-level supervision from a discriminator. At each timestep, the generator receives only the causally available mix context and Gaussian noise, and predicts accompaniment latents without access to future mix frames or ground-truth target latents. Training is performed in a single parallel forward pass under causal masking, while streaming inference proceeds autoregressively with a rolling attention state. The model's training and inference computations are matched by design, eliminating teacher forcing and the associated exposure bias. On a multi-instrument music accompaniment benchmark, LiveBand improves over prior work on objective measures of audio quality, beat alignment, and mix adherence, while enabling real-time streaming generation without lookahead into the future on consumer hardware.

URL PDF HTML ☆

赞 0 踩 0

2606.02608 2026-06-10 cs.LG 版本更新

Pruning Deep Neural Networks via the Marchenko--Pastur Distribution

通过Marchenko-Pastur分布剪枝深度神经网络

Leonid Berlyand, Theo Bourdais, Houman Owhadi, Yitzchak Shmalo

发表机构 * Department of Mathematics, Pennsylvania State University（数学系，宾夕法尼亚州立大学）； Department of Computing and Mathematical Sciences, California Institute of Technology（计算与数学科学系，加州理工学院）

AI总结提出基于Marchenko-Pastur随机矩阵理论的剪枝方法，在极短微调预算下保持精度，并在ImageNet-1k上验证了多种架构的高效稀疏执行加速。

详情

AI中文摘要

我们研究了一种基于Marchenko-Pastur（MP）随机矩阵的深度神经网络剪枝方法，该方法适用于非常小的剪枝后微调预算。主要的实际贡献在于，在短校准和微调计划下保持精度，而不是采用长的剪枝后重新优化流程。理论给出了确定性数据路径保证：如果移除的组件$R$具有小的传播logit效应$L_s \| R \psi_1(s) \|_\infty$，则剪枝会降低弹性网络目标，并保留那些密集边界超过两倍扰动的样本。零预算情况给出了完美剪枝；剪枝-恢复扩展在固定稀疏执行模式内建模权重恢复；加性$L_2$正则化模型显示，可接受的随机类组件在训练极限处消失，而持续尖峰在MP bulk坍缩时稳定。在iid高斯充分条件下，拟合的MP边缘$\sigma_+$给出了高概率的逐层预算信号。在ImageNet-1k上，仅经过三个蒸馏epoch，ViT-B/16 $2{:}4{+}$ToMe在$59.81\%$稀疏执行MAC缩减下达到$83.41\%$ top-1（比密集低1.70个百分点），同一检查点和ToMe图在A40原生$2{:}4$后端加速比为$1.388 imes$；另一个无ToMe的A100端点加速比为$2.705 imes$。在结构化稀疏方面，ViT-B/16 $6{:}12$达到$83.74\%$，ViT-L/16 $8{:}16$密集+置换达到$85.33\%$（低0.51个百分点），ConvNeXtV2-Base $12{:}16$达到$86.35\%$（低0.37个百分点）。对于CNN，ResNet50 $8{:}16$密集+置换达到$75.87\%$（低0.26个百分点），ResNet152d CAST-conv+置换达到$81.33\%$（低1.53个百分点），在${\sim}50\%$ MAC核算下，A40 im2col$+2{:}4$稀疏GEMM审计加速比为$1.62 imes$。

英文摘要

We study a Marchenko--Pastur (MP) random-matrix approach to pruning deep neural networks with very small post-pruning fine-tuning budgets. The main practical contribution is accuracy retention under short calibration and fine-tuning schedules, rather than a long post-pruning reoptimization pipeline. The theory gives deterministic data-path certificates: if the removed component $R$ has small propagated logit effect $L_s \| R ψ_1(s) \|_\infty$, pruning decreases an elastic-net objective and preserves samples whose dense margin exceeds twice the perturbation. The zero-budget case gives perfect pruning; a prune--restore extension models weight restoration inside a fixed sparse-execution pattern; and an additive $L_2$-regularized model shows admissible random-like components vanish at the training limit, with persistent spikes stabilizing as the MP bulk collapses. Under iid-Gaussian sufficient conditions, the fitted MP edge $σ_+$ gives a high-probability layerwise budget signal. On ImageNet-1k, after only three distillation epochs, ViT-B/16 $2{:}4{+}$ToMe reaches $83.41\%$ top-1 ($-1.70$ pp from dense) at $59.81\%$ sparse-execution MAC reduction, with $1.388\times$ best-observed A40 native-$2{:}4$ backend speedup for the same checkpoint and ToMe graph; a separate no-ToMe A100 endpoint gives $2.705\times$. At structured sparsity, ViT-B/16 $6{:}12$ reaches $83.74\%$, ViT-L/16 $8{:}16$ dense+permutation reaches $85.33\%$ ($-0.51$ pp), and ConvNeXtV2-Base $12{:}16$ reaches $86.35\%$ ($-0.37$ pp). For CNNs, ResNet50 $8{:}16$ dense+permutation reaches $75.87\%$ ($-0.26$ pp), and ResNet152d CAST-conv+permutation reaches $81.33\%$ ($-1.53$ pp) at ${\sim}50\%$ MAC accounting with a $1.62\times$ A40 im2col$+2{:}4$ sparse-GEMM audit.

URL PDF HTML ☆

赞 0 踩 0

2606.02386 2026-06-10 cs.AI q-bio.QM 版本更新

AgentPLM: Agentic Protein Language Models with Reasoning-Augmented Decoding for Protein Sequence Design

AgentPLM：具有推理增强解码的智能体蛋白质语言模型用于蛋白质序列设计

Sahil Rahman, Maxx Richard Rahman

发表机构 * Sahil Rahman ； Maxx Richard Rahman

AI总结提出AgentPLM，通过推理增强解码和对比智能体策略优化，使预训练蛋白质语言模型能够利用外部生物物理反馈进行在线纠错，在多项蛋白质设计任务上取得最优结果。

详情

Journal ref: Workshop on Generative and Agentic AI for Biology, 43rd International Conference on Machine Learning (ICML 2026)

AI中文摘要

蛋白质语言模型（PLM）是被动预言机：它们通过单次前向传递生成序列，没有机制来咨询外部生物物理反馈或在候选序列违反热力学或结构约束时重定向生成。我们引入AgentPLM，通过为预训练PLM配备i)推理增强解码（RAD），该解码将自回归生成与工具调用（ESMFold、FoldX、AutoDock Vina）交错进行，以及ii)对比智能体策略优化（CAPO），这是直接偏好优化的轨迹级扩展，它端到端地训练策略以学习何时预言机反馈具有信息性，而不仅仅是模仿高适应度序列。我们在基准任务上评估AgentPLM，涵盖从头酶设计、抗体优化、热稳定性、PPI界面设计和零样本适应度预测，使用标准化的预言机API和受控的序列同一性划分。AgentPLM取得了最先进的结果，抗体前10%命中率相比最强被动基线有所提升，提供了无需显式回溯的在线纠错的机制证据。

英文摘要

Protein language models (PLMs) are passive oracles: they generate sequences in a single forward pass with no mechanism to consult external biophysical feedback or redirect generation when a candidate violates thermodynamic or structural constraints. We introduce AgentPLM, which addresses this by equipping a pre-trained PLM with i) Reasoning-Augmented Decoding (RAD), which interleaves autoregressive generation with tool calls (ESMFold, FoldX, AutoDock Vina), and ii) Contrastive Agent Policy Optimisation (CAPO), a trajectory-level extension of direct preference optimisation that trains the policy end-to-end to learn when oracle feedback is informative rather than merely imitating high-fitness sequences. We evaluate AgentPLM on benchmark tasks spanning de novo enzyme design, antibody optimisation, thermostability, PPI interface design, and zero-shot fitness prediction with standardised oracle APIs and controlled sequence-identity splits. AgentPLM achieves state-of-the-art results with a gain in antibody top-10% hit rate over the strongest passive baseline, providing mechanistic evidence of online error correction without explicit backtracking.

URL PDF HTML ☆

赞 0 踩 0

2606.02378 2026-06-10 cs.LG cs.AI 版本更新

When Do Attention Circuits Form? Developmental Trajectories of Capability and Attention-Sink Emergence Across Three 1B-ClassArchitectures

注意力电路何时形成？三种1B级架构中能力和注意力汇出现的发育轨迹

Yongzhong Xu

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文追踪三种1B级语言模型中注意力头电路的形成轨迹，发现归纳电路形成早于注意力汇形成10-20倍令牌，且电路识别无需最终模型。

Comments 27pages, 3 figures

详情

AI中文摘要

我们追踪了三种1B级语言模型（涵盖两个架构系列：密集Transformer和混合专家，以及两个预训练语料库：The Pile和DCLM）中注意力头电路形成的发育轨迹：Pythia 1B、OLMo 1B-0724-hf和OLMoE 1B-7B-0924。在每个模型的10个对数间隔检查点（共30次机械可解释性运行）上，我们应用参与比（PR）谱信号和全头能力特异性选择性筛选，以追踪归纳头、前一个标记头和BOS吸引头出现的过程。五个发现：（F1）第0层和第1层在每个模型的每个检查点都没有产生BOS分类头：L0/L1零BOS底限是架构属性，而非学习结果。（F2）整个模型的BOS吸引头比例呈现三种不同的出现形状——Pythia 1B中逐渐上升，OLMo 1B中急剧相变（相邻检查点间从7%到70%），OLMoE 1B-7B中逐渐上升。（F3）在DCLM模型中，归纳电路的形成比BOS吸引头的形成早10-20倍令牌；能力电路的形成和注意力汇的形成是两个不同的转变，而非一个。（F4）能力特异性筛选在总训练令牌的0.3-2%内收敛到最终的归纳电路——电路识别不需要最终模型。（F5）对于从所有三个模型中采样的每个最终检查点归纳头，其每个头的PR值在该头首次超过其能力选择性阈值时或之前就已升高。这些结果细化了归纳相变框架：在DCLM上训练的1B级模型中，归纳转变和注意力汇转变在令牌数量上相差一个数量级，并且具有定性不同的形状。

英文摘要

We track the developmental trajectory of attention-head circuit formation across three 1B-class language models spanning two architecture families (dense transformer, mixture-of-experts) and two pretraining corpora (The Pile, DCLM): Pythia 1B, OLMo 1B-0724-hf, and OLMoE 1B-7B-0924. At each of 10 log-spaced revisions per model -- 30 mechanistic-interpretability runs in total -- we apply a participation-ratio (PR) spectral signal and an all-head capability-specific selectivity screen to track induction, previous-token, and BOS-attractor heads as they emerge. Five findings. (F1) Layers 0 and 1 produce zero BOS-classified heads at every revision in every model: the L0/L1 zero-BOS floor is an architectural property, not a learned outcome. (F2) The whole-model BOS-attractor fraction follows three distinct emergence shapes -- a gradual ramp in Pythia 1B, a sharp phase transition in OLMo 1B (7% to 70% between adjacent checkpoints), and a gradual ramp in OLMoE 1B-7B. (F3) In DCLM models, induction-circuit formation precedes BOS-attractor formation by 10-20x in tokens; capability-circuit formation and attention-sink formation are two transitions, not one. (F4) The capability-specific screen converges to the final induction circuit within 0.3-2% of total training tokens -- circuit identification does not require the final model. (F5) For every final-checkpoint induction head sampled across all three models, per-head PR is elevated at or before the first revision at which that head crosses its capability-selectivity threshold. The results refine the induction-phase-transition framing: in 1B-class models trained on DCLM, the induction transition and the attention-sink transition are separated by an order of magnitude in tokens and have qualitatively different shapes.

URL PDF HTML ☆

赞 0 踩 0

2606.02224 2026-06-10 cs.CV 版本更新

Chroma Clues: Leveraging Color Statistics to Detect Synthetic Images

颜色线索：利用颜色统计检测合成图像

Lea Uhlenbrock, Davide Cozzolino, Christian Riess

发表机构 * Deutsche Forschungsgemeinschaft（德国研究基金会）

AI总结利用生成模型在颜色统计上的弱点，通过手工设计的颜色变换和学习优化的颜色变换，提出像素级或块级颜色敏感特征，实现高泛化准确率和鲁棒性的合成图像检测。

详情

AI中文摘要

AI合成图像的演变和传播正以前所未有的速度进行。图像生成器在完美模仿自然图像的目标上取得了快速进展，这也挑战了图像取证。在这项工作中，我们利用了当前生成模型中一个未被充分探索的线索，即它们在模仿自然图像的颜色统计方面的弱点。我们首先展示了用于训练图像生成器的LPIPS损失对色度的敏感性低于亮度，这可能导致合成图像颜色的统计差异。基于这一观察，我们随后引入了六种手工设计的颜色变换和一种学习任务优化颜色变换的方法，以统计上暴露生成的图像。这些变换可以以多种方式使用。首先，我们在像素级或块级定义了颜色敏感特征。一个简单、可解释的分类器使用这些特征实现了平均泛化准确率93.27%，并对六种后处理具有强鲁棒性。其次，我们证明了这些变换在自然和合成图像区域中表现出特征性的视觉噪声模式，从而实现直观的视觉图像评估。第三，我们证明了这些变换可以增强生成图像中的颜色模式，以改进多类归因。

英文摘要

The evolution and dissemination of AI-synthesized images is occurring at an unprecedented rate. Image generators are making rapid progress in their goal of perfectly imitating natural images, which also challenges image forensics. In this work, we exploit an underexplored cue in current generative models, namely their weakness to imitate color statistics of natural images. We first show that the LPIPS loss used for training image generators is less sensitive to chrominance than to luminance, which may lead to statistical discrepancies in the colors of synthetic images. Building on this observation, we then introduce six hand-crafted color transformations and a method to learn a task-optimized color transform to statistically expose generated images. These transformations can be used in various ways. First, we define color-sensitive features at pixel-level or patch-level. A simple, interpretable classifier achieves with these features an average generalization accuracy of 93.27% and strong robustness against six types of post-processing. Second, we demonstrate that the transformations exhibit characteristic visual noise patterns in natural and synthetic image areas, which enables an intuitive visual image evaluation. Third, we demonstrate that the transforms can enhance color patterns in generated images for improved multiclass attribution.

URL PDF HTML ☆

赞 0 踩 0

2606.01929 2026-06-10 cs.AI 版本更新

VET: A Framework for Analyzing AI Discourse

VET：一个分析AI话语的框架

Meredith Ringel Morris

发表机构 * The Paul G. Allen School of Computer Science & Engineering and The Information School（保罗·G·艾伦计算机科学与工程学院和信息学院）

AI总结本文提出VET框架，通过效价、有效性和轨迹三个维度对AI话语进行分类，用于识别和批判AI炒作、AI末日、AI否认和AI常态四种立场，并作为AI素养工具支持对极化话语的审视。

2606.01884 2026-06-10 cs.AI 版本更新

EVA-Net: Subject-Independent EEG Motor Decoding with Video-Derived Motor Priors

EVA-Net: 基于视频衍生运动先验的跨被试脑电运动解码

Ziyuan Li, Yueyu Sun, Yimeng Zhang

发表机构 * South China University of Technology（华南理工大学）

AI总结提出EVA-Net两阶段框架，利用动作视频作为语义先验，通过跨模态对比学习和对齐减少个体差异，实现跨被试脑电运动解码，在EEGMMI上取得8.66%的LOSO准确率提升。

详情

AI中文摘要

实用的非侵入式脑机接口（BCI）系统需要具有强跨被试泛化能力和最小校准的脑电解码器。然而，被试间差异和信号非平稳性常常将运动语义与被试特定噪声纠缠在一起，限制了跨被试解码。最近的多模态方法使用文本作为语义锚点，但文本对本质上动态的运动过程提供了稀疏且静态的监督。为了解决这个问题，我们提出了EVA-Net，一个两阶段框架，使用动作视频作为跨被试脑电运动解码的语义先验。在第一阶段，使用跨模态和监督对比目标将脑电和视频特征在共享空间中对齐，以减少被试特定变化。在第二阶段，视频类别原型和知识蒸馏将视频衍生的先验转移到仅脑电的分类器，而不增加推理开销。在两个公共数据集上的实验表明，EVA-Net实现了强大的跨被试解码性能，包括在EEGMMI上8.66%的LOSO准确率提升。消融结果进一步表明，视频提供了比本文考虑的文本基线更有效的语义锚点。

英文摘要

Practical non-invasive Brain-Computer Interface (BCI) systems require EEG decoders with strong cross-subject generalization and minimal calibration. However, inter-subject variability and signal non-stationarity often entangle motor semantics with subject-specific noise, limiting subject-independent decoding. Recent multimodal approaches use text as a semantic anchor, yet text provides sparse and static supervision for inherently dynamic motor processes. To address this issue, we propose EVA-Net, a two-stage framework that uses action videos as semantic priors for subject-independent EEG motor decoding. In the first stage, EEG and video features are aligned in a shared space using cross-modal and supervised contrastive objectives to reduce subject-specific variation. In the second stage, video category prototypes and knowledge distillation transfer video-derived priors to an EEG-only classifier without adding inference overhead. Experiments on two public datasets show that EVA-Net achieves strong subject-independent decoding performance, including an 8.66% LOSO accuracy gain on EEGMMI. Ablation results further suggest that video provides a more effective semantic anchor than the text baseline considered in this work.

URL PDF HTML ☆

赞 0 踩 0

2605.31405 2026-06-10 cs.RO 版本更新

Adaptive Artificial Time-Delay Control with Barrier Lyapunov Constraints for Euler-Lagrange Robots

具有障碍李雅普诺夫约束的自适应人工时延控制用于欧拉-拉格朗日机器人

Saksham Gupta, Rishabh Dev Yadav, Sarthak Mishra, Amitabh Sharma, Sourish Ganguly, Wei Pan, Spandan Roy, Simone Baldi

发表机构 * Robotics Research Center, International Institute of Information Technology Hyderabad, India（机器人研究中心，国际信息科技大学 Hyderabad，印度）； Department of Computer Science, University of Manchester, UK（计算机科学系，曼彻斯特大学，英国）； Autonomous Systems and Automatic Control in School of Engineering, Newcastle University, UK（工程学院自主系统与自动控制，新castle大学，英国）； Self-Organizing Mobility Lab, School of Mathematics, Southeast University, Nanjing 210096, China（自组织移动实验室，数学学院，东南大学，南京210096，中国）

AI总结针对欧拉-拉格朗日系统中的状态相关不确定性和时变状态约束问题，提出一种结合人工时延估计与障碍李雅普诺夫函数的自适应控制框架，通过在线估计不确定性上界并强制约束位置和速度，实验验证了其在五自由度机械臂上的有效性。

详情

AI中文摘要

本文解决了欧拉-拉格朗日系统中同时补偿状态相关不确定性和强制执行时变状态约束的挑战，这是机器人学中的常见需求，但现有控制设计尚未充分满足。开发了一种新颖的自适应控制框架，将基于人工时延的不确定性估计策略（也称为时延估计）与障碍李雅普诺夫函数相结合，以实现约束感知控制设计。具体而言，分析性地推导了时延估计近似误差的状态相关上界，并构造了自适应律在线估计其参数，从而无需先验模型知识即可实现实时状态相关不确定性补偿。为确保约束满足，基于障碍李雅普诺夫函数的控制器对位置和速度施加时变界限。通过李雅普诺夫分析证明了所提架构的稳定性。在五自由度机械臂上的实验结果验证了该框架相较于现有技术，在动态不确定性下保持严格遵守安全关键约束的能力。

英文摘要

This paper addresses the challenge of simultaneously compensating for state-dependent uncertainties and enforcing time-varying state constraints in Euler-Lagrange systems, a common requirement in robotics that remains underserved by existing control designs. A novel adaptive control framework is developed that combines an artificial time-delay-based uncertainty estimation strategy, also known as time-delay estimation, with a barrier Lyapunov function to enforce constraint-aware control design. Specifically, a state-dependent upper bound on the time-delay estimation approximation error is analytically formulated, and an adaptive law is constructed to estimate its parameters online, enabling real-time state-dependent uncertainty compensation without relying on prior model knowledge. To ensure constraint compliance, the barrier Lyapunov function-based controller enforces time-varying bounds on both position and velocity. The resulting architecture is provably stable via Lyapunov analysis. Experimental results on a five-degree-of-freedom robotic manipulator validate the framework's capability, compared with the state of the art, in maintaining strict adherence to safety-critical constraints under dynamic uncertainties.

URL PDF HTML ☆

赞 0 踩 0

2605.28093 2026-06-10 cs.CL 版本更新

ConRAG: Consensus-Driven Multi-View Retrieval for Multi-Hop Question Answering

ConRAG: 用于多跳问答的共识驱动多视角检索

Yikai Zhu, Kunfeng Chen, Qihuang Zhong, Juhua Liu, Bo Du

发表机构 * School of Computer Science, Wuhan University（武汉大学计算机学院）

AI总结提出ConRAG框架，通过共识驱动的多视角检索（关系、实体、文本信号）优化查询和语料库，显著提升多跳问答性能，在MuSiQue上创下新纪录。

详情

AI中文摘要

检索增强生成（RAG）已成为增强大型语言模型（LLMs）在多跳问答（QA）上的有前景范式，这需要对来自多个文档的证据进行推理。当前的多跳RAG方法通常侧重于查询侧任务分解或语料侧知识图谱构建。尽管取得了进展，这些方法在复杂的多跳QA任务上仍难以达到令人满意的性能。为此，我们提出了ConRAG，一个共识驱动的多视角RAG框架，有效提升了LLMs在复杂多跳QA上的表现。ConRAG的核心是系统性地优化查询和语料两侧，并利用多视角证据（关系、实体和文本信号）进行更准确的检索。在三个多跳QA基准上的大量实验表明，ConRAG以明显优势持续优于所有基线，例如，与普通RAG相比平均性能提升高达+26.9%，并使Gemma-4-31B在具有挑战性的MuSiQue基准上创下新的最先进记录。

英文摘要

Retrieval-augmented generation (RAG) has emerged as a promising paradigm for enhancing large language models (LLMs) on multi-hop question answering (QA), which requires reasoning over evidence from multiple documents. Current multi-hop RAG methods generally focus on either query-side task decomposition or corpus-side knowledge graph construction. Despite their progress, these methods still struggle to achieve satisfactory performance on complex multi-hop QA tasks. To this end, we propose ConRAG, a consensus-driven multi-view RAG framework that effectively boosts LLMs on complex multi-hop QA. The core of ConRAG is to systematically optimize both the query and corpus sides and to leverage multi-view evidence (relation, entity, and text signals) for more accurate retrieval. Extensive experiments on three multi-hop QA benchmarks show that ConRAG consistently outperforms all baselines by a clear margin, e.g., up to +26.9% average performance gains over vanilla RAG, and enables Gemma-4-31B to achieve a new state-of-the-art record on the challenging MuSiQue benchmark.

URL PDF HTML ☆

赞 0 踩 0

2605.28066 2026-06-10 cs.CL cs.AI 版本更新

PromptEmbedder: Efficient and Transferable Text Embedding via Dual-LLM Soft Prompting

PromptEmbedder：通过双LLM软提示实现高效且可迁移的文本嵌入

Yu-Che Tsai, Kuan-Yu Chen, Yuan-Hao Chen, Yu-Han Chang, Ching-Yu Tsai, Yu-Hsiang Chuang, Shou-De Lin

发表机构 * Department of Computer Science and Information Engineering, National Taiwan University（国立台湾大学计算机科学与资讯工程系）； National Taiwan University AI Center of Research Excellence（国立台湾大学人工智能研究中心）

AI总结提出PromptEmbedder双LLM框架，通过可微分的软提示生成将嵌入知识从特定骨干权重中解耦，在保持性能的同时降低40% GPU内存并加速3.7倍训练。

详情

AI中文摘要

大型语言模型（LLM）在文本嵌入方面展现出显著效果，但当前的适应方法（如LoRA）在计算效率和跨架构可迁移性方面面临重大瓶颈。每当出现新的骨干网络时，现有方法需要从头开始进行昂贵的重新训练。为了解决这个问题，我们提出了PromptEmbedder，一种新颖的双LLM框架，将嵌入知识与特定骨干权重解耦。PromptEmbedder利用一个提示LLM通过连续松弛的可微分生成过程，为冻结的嵌入LLM生成指令感知的软提示，确保对比训练期间的全梯度流动。通过将任务特定知识定位在提示LLM中，适应新架构只需重新训练一个轻量级的线性对齐矩阵。在MTEB基准上的评估表明，PromptEmbedder实现了与LoRA微调相当的性能，同时将GPU内存减少40%，训练速度提升3.7倍。我们的方法建立了一种可扩展、架构无关的范式，用于高效的基于LLM的表示学习。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable efficacy in text embedding, yet current adaptation methods like LoRA face significant bottlenecks in computational efficiency and cross-architecture transferability. Whenever a new backbone emerges, existing approaches require costly retraining from scratch. To address this, we propose PromptEmbedder, a novel dual-LLM framework that decouples embedding knowledge from specific backbone weights. PromptEmbedder utilizes a Prompting LLM to generate instruction-aware soft prompts for a frozen Embedding LLM via a differentiable generation process with continuous relaxation, ensuring full gradient flow during contrastive training. By localizing task-specific knowledge within the Prompting LLM, adapting to new architectures requires only retraining a lightweight linear alignment matrix. Evaluations on the MTEB benchmark show that PromptEmbedder achieves comparable performance with LoRA finetuning while reducing GPU memory by 40% and accelerating training by 3.7x. Our approach establishes a scalable, architecture-agnostic paradigm for efficient LLM-based representation learning.

URL PDF HTML ☆

赞 0 踩 0

2605.28057 2026-06-10 cs.LG cs.AI 版本更新

On the Learnability of Test-Time Adaptation: A Recovery Complexity Perspective

测试时自适应可学习性：恢复复杂性视角

Zhi Zhou, Ming Yang, Shi-Yu Tian, Kun-Yang Yu, Lan-Zhe Guo, Yu-Feng Li

发表机构 * Zhi Zhou（周智）； Ming Yang（杨明）； Shi-Yu Tian（田世宇）； Kun-Yang Yu（于坤阳）； Lan-Zhe Guo（郭兰哲）； Yu-Feng Li（李宇峰）

AI总结本文提出首个理论框架，通过引入(ε,δ)-恢复复杂性和(ε,ρ)-TTA可学习性，研究非平稳流下测试时自适应的可学习性，并揭示自适应-信息权衡。

Comments Accepted by ICML 2026

详情

AI中文摘要

测试时自适应（TTA）旨在调整模型以在非平稳测试流上保持可靠性能，无需标注数据。尽管其实证成功，非平稳流下TTA的可学习性仍未探索。关键挑战在于缺乏一个原则性的理论框架，该框架同时与TTA目标一致，并捕捉持续演变的分布偏移和内在信息约束。为解决这一差距，我们提出了首个研究TTA可学习性的理论框架，并引入了(ε,δ)-恢复复杂性和(ε,ρ)-TTA可学习性。恢复复杂性衡量偏移后维持超额风险低于目标水平所需的时间（高概率），并进一步扩展到TTA可学习性，衡量TTA的长期可靠性。在该框架内，我们引入了一种新颖的非平稳测试流离散替代，使得对渐变和突变进行统一且易于处理的分析成为可能。我们推导了恢复复杂性的阶匹配下界和上界，揭示了TTA的基本限制和内在的自适应-信息权衡。这些结果为TTA提供了统一的可学习性保证，补充了基于遗憾的分析。

英文摘要

Test-time adaptation (TTA) aims to adapt models to maintain reliable performance on non-stationary test streams without requiring labeled data. Despite its empirical success, the learnability of TTA under non-stationary streams remains unexplored. A key challenge is the lack of a principled theoretical framework that simultaneously aligns with the TTA objective and captures both continuously evolving distribution shifts and intrinsic information constraints. To address this gap, we propose the first theoretical framework for studying the learnability of TTA and introduce $(ε,δ)$-Recovery Complexity and $(ε,ρ)$-TTA Learnability. Recovery complexity measures the post-shift time needed to maintain excess risk below a target level with high probability, and is further extended to TTA learnability, which measures the long-term reliability of TTA. Within this framework, we introduce a novel discrete surrogate for non-stationary test streams, enabling a unified and tractable analysis of both gradual and abrupt shifts. We derive order-wise matching lower and upper bounds on recovery complexity, revealing fundamental limits of TTA and an intrinsic adaptivity-information trade-off. These results provide unified learnability guarantees for TTA that complement regret-based analyses.

URL PDF HTML ☆

赞 0 踩 0

2605.27914 2026-06-10 cs.CL cs.AI 版本更新

Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm

让结果说话：LLM行为基准测试的复制优先范式

Yuming, Huang, Yao Liu, Pengjie Ding, Lei Wang, Junchen Wan

发表机构 * Cylingo team（Cylingo团队）

AI总结提出复制优先范式，通过可靠性、跨仪器复制、历史足迹校准和预注册预测四个正交属性验证LLM行为评估工具，并在情感陪伴任务中测试，发现聚合分数掩盖的模型退化。

详情

AI中文摘要

对LLM行为的主观评估——如共情、克制、校准的情感语气——是困难的。人类评估者之间对这些品质的一致性饱和在rho约0.45附近，仅使用LLM作为评判代理存在循环论证的风险：与目标共享训练群体的评判者无法独立验证。将有效性锚定于单一人类评估者共识并不适用于人类自身存在分歧的能力。我们提出一种复制优先范式：不是锚定于一个评估者群体，而是通过四个正交属性认证工具——跨K次运行的可靠性、跨架构不同评判者的跨仪器复制、通过早期训练群体的评判者进行的历史足迹校准，以及预注册预测。我们在情感陪伴任务上测试该范式，让评分标准在迭代中数据驱动地自我演化：维度不是预先规定的，过程稳定在9维集合。预注册应用于10个可证伪假设和11个前向预测，在收集任何测试数据之前提交。应用于8个家族的49个模型，该范式揭示了聚合分数所隐藏的内容。在建议克制方面——模型是否在共情情境中避免提供未经请求的解决方案——gpt-5比gpt-4.1下降1.87分，Opus-4.7比Opus-4.6下降0.629分，而聚合分数保持平稳。这种退化在三次用户代理替换中幸存（95%的幅度），在5家族评判者堆栈和17个月队列间隔中复制，并在74个保留的真实ESConv对话中持续存在（rho在[0.749, 0.850]之间）；工具达到序数Krippendorff alpha=0.91。作为副产品，该范式充当饱和源诊断器，区分工具性天花板（可通过评分标准细化突破）和结构性天花板（需要场景或名单干预）。

英文摘要

Benchmarking is mature where answers are verifiable -- math, code, reasoning -- but the fastest-growing uses of LLMs are subjective and human-facing: companionship, emotional support, counseling. There the default validity test, correlating a metric to human judgment, has no stable anchor: inter-rater agreement is low, structured by annotator identity, barely reproducible, and length-biased. So we cannot answer the question that matters: does capability that scales on objective benchmarks transfer to subjective behavior, and would our instruments even tell us if it did not? We build an instrument for this regime and report what it reveals at the frontier. We contribute, first, a self-evolving instrument that selects and then authors its own behavioral dimensions under a multiplicative anti-gaming fitness, self-halting when it stops improving; second, a trust-by-construction paradigm that earns belief through three certificates established without a human gold standard, where human raters saturate (rho ~ 0.45); and third, the finding it makes visible -- capability transfer is dissociable. Across 49 models, 8 families, and 24 months, subjective behaviors are where objective-benchmark scaling fails to carry over: the sharpest case, advice-restraint (knowing when not to give advice), is the frontier's universal-lowest dimension, and at gpt-4.1->gpt-5 it ran backwards while the aggregate score hid it -- a regression one instruction recovers. Warm restraint is moved by model generation, not by raw scale, MoE width, inference budget, or reasoning mode; the open-weight Pareto frontier matches closed flagships at ~10-80x lower per-call cost; and four judge families replicate the rubric on held-out human ESConv conversations. Data, code, the locked rubric, and judge prompts will be released upon publication.

URL PDF HTML ☆

赞 0 踩 0

2605.27286 2026-06-10 cs.LG cs.AI 版本更新

Falcon-X: A Time Series Foundation Model for Heterogeneous Multivariate Modeling

Falcon-X：面向异构多变量建模的时间序列基础模型

Yiding Liu, Yifan Hu, Hongjie Xia, Peiyuan Liu, Hongzhou Chen, Xilin Dai, Zewei Dong, Jiang-Ming Yang

发表机构 * Ant Group（蚂蚁集团）

AI总结针对现有时间序列基础模型在语义对齐和关系表达上的局限，提出Falcon-X，通过将变量映射到统一潜在原型空间，利用统一原型差分注意力机制对齐异构变量，并在共享空间中通过潜在实体注意力进行跨变量交互，实现零样本结构迁移，在GIFT-Eval和fev-bench基准上取得最先进预测性能。

详情

AI中文摘要

时间序列基础模型（TSFMs）通过大规模跨领域预训练正在改变预测范式。然而，大多数现有的TSFMs仍然是单变量的，最近实现跨变量建模的努力仍然直接在原始变量空间内操作。这种设计在语义对齐和关系表达性方面引入了基本限制。具体来说，原始空间分组混合缺乏对齐异构物理量的专用机制，而标准的非负注意力无法捕捉现实世界系统中普遍存在的复杂协同和拮抗交互。为了解决这些挑战，我们提出了Falcon-X，它将变量从原始空间中解耦，并将其映射到一个统一的潜在原型空间中。Falcon-X采用统一原型差分注意力机制，显式评估正负语义亲和度以对齐异构变量。然后通过潜在实体注意力在该共享空间内高效执行跨变量交互，自然地促进零样本结构迁移。最后，变量重组路由器通过请求-分发机制稳健地重建变量特定的轨迹。在GIFT-Eval和fev-bench基准上的广泛评估表明，Falcon-X实现了最先进的预测性能，为复杂的多变量环境提供了一种有原则且可扩展的范式。Falcon-X已公开发布以支持未来研究。

英文摘要

Time series foundation models (TSFMs) are transforming the forecasting paradigm through large-scale cross-domain pretraining. However, most existing TSFMs remain univariate, and recent efforts to enable cross-variate modeling still operate directly within the raw variate space. This design introduces fundamental limitations in semantic alignment and relational expressivity. Specifically, raw-space group mixing lacks a dedicated mechanism to align heterogeneous physical quantities, while standard non-negative attention fails to capture the complex synergistic and antagonistic interactions ubiquitous in real-world systems. To address these challenges, we propose Falcon-X, decouples variates from the raw space and maps them into a unified latent prototype space. Falcon-X employs a Unified Prototype Diff-Attention mechanism that explicitly evaluates both positive and negative semantic affinities to explicitly align heterogeneous variates. Cross-variate interactions are then efficiently performed within this shared space via Latent Entity Attention, naturally facilitating zero-shot structural transfer. Finally, a Variate Reassembly Router robustly reconstructs variate-specific trajectories via a request-and-dispatch mechanism. Extensive evaluations on the GIFT-Eval and fev-bench benchmarks demonstrate that Falcon-X achieves excellent forecasting performance, offering a principled and scalable paradigm for complex multivariate environments. Falcon-X is publicly released to support future research.

URL PDF HTML ☆

赞 0 踩 0

2605.25371 2026-06-10 cs.RO 版本更新

FOUND-IT: Foundation-model-first Task-driven 3D Scene Graphs with Granularity on Demand

FOUND-IT: 基于基础模型优先、按需粒度的任务驱动3D场景图

Dominic Maggio, Nicolas Gorlo, Kris Hauser, Luca Carlone

发表机构 * Laboratory for Information & Decision Systems, Massachusetts Institute of Technology（信息与决策系统实验室，麻省理工学院）； Samsung Research America（三星美国研究院）

AI总结提出首个基于未标定单目相机实时构建任意室内外环境分层任务驱动3D场景图的方法，通过几何基础模型和可调整粒度支持动态任务，并在ASHiTA SG3D基准上提升79%准确率。

详情

AI中文摘要

我们提出了首个使用未标定单目相机实时构建任意室内外环境分层任务驱动3D场景图的方法。我们利用几何基础模型来估计场景图的几何属性（例如，物体边界框），但也观察到可通行性信息（场景图的“地点”层）可以通过向现有几何基础模型（如VGGT）添加额外头部直接重建。我们的方法是任务驱动的，即根据任务调整地图中物体和区域的粒度；例如，在操作任务中，我们的方法能够分辨炉子上的小旋钮，而在导航任务中则可以关注大物体（如整个炉子）。然而，与相关工作的重要区别在于，我们考虑了任务列表并非预定义固定，而是随着机器人运行而演变的现实情况。这自然允许处理复杂的移动操作任务，机器人可以在任务展开时动态调整其表示。我们将由此产生的方法称为FOUND-IT。FOUND-IT还包括一种代理方法来查询场景图中的信息。除了在ASHiTA SG3D任务定位基准上实现79%的更高准确率外，我们展示了FOUND-IT在Jetson Thor上实时运行于地面机器人。此外，为了突出我们方法的鲁棒性，我们演示了在YouTube上随意拍摄的房地产公寓游览中构建3D场景图。代码将在发表后提供。

英文摘要

We present the first approach to build hierarchical task-driven 3D scene graphs of arbitrary indoor or outdoor environments using an uncalibrated monocular camera in real-time. We leverage geometric foundation models to estimate geometric attributes of the scene graph (e.g., object bounding boxes), but we also observe that traversability information (the "places" layer of a scene graph) can be directly reconstructed by adding an extra head to existing geometric foundation models, like VGGT. Our approach is task-driven in the sense that we adjust the granularity of the objects and regions in the map depending on the task; for instance, during a manipulation task, our approach is able to resolve small knobs on a stove, while during a navigation task it can focus on large objects (e.g., the entire stove). However, in a major departure from related work, we consider the realistic case where the list of tasks is not predefined and fixed, but evolves as the robot operates. This naturally allows dealing with complex loco-manipulation tasks, where the robot can dynamically adjust its representation as the task unfolds. We dub the resulting approach FOUND-IT. FOUND-IT also includes an agentic approach to query information in the scene graph. In addition to achieving 79% higher accuracy on the ASHiTA SG3D task grounding benchmark, we demonstrate FOUND-IT runs in real-time on a ground robot using a Jetson Thor. Furthermore, to highlight the robustness of our method, we demonstrate constructing 3D scene graphs on casually captured realtor apartment tours from YouTube. Code will be made available upon publication.

URL PDF HTML ☆

赞 0 踩 0

2512.16189 2026-06-10 cs.CL 版本更新

Mitigating hallucinations in healthcare LLMs with granular fact-checking and domain-specific adaptation

通过细粒度事实核查和领域特定适应减轻医疗保健大语言模型中的幻觉

Musarrat Zeba, Abdullah Al Mamun, Kishoar Jahan Tithee, Debopom Sutradhar, Mohaimenul Azam Khan Raiaan, Saddam Mukta, Reem E. Mohamed, Md Rafiqul Islam, Yakub Sebastian, Mukhtar Hussain, Sami Azam

发表机构 * Applied Artificial Intelligence and Intelligent Systems (AAIINS) Laboratory（应用人工智能与智能系统实验室）； Department of Computer Science and Engineering（计算机科学与工程系）； Department of Data Science and Artificial Intelligence（数据科学与人工智能系）； Department of Software Engineering（软件工程系）； Faculty of Science and Information Technology（科学与信息技术学院）； Faculty of Science and Technology（科学与技术学院）

AI总结提出一个独立于任何LLM的事实核查模块和领域特定的摘要模型，通过数值测试和细粒度逻辑检查减少幻觉，在MIMIC III数据集上微调并评估，取得了高精度和召回率。

Comments Published in Expert Systems with Applications

详情

DOI: 10.1016/j.eswa.2026.132966
Journal ref: Expert Systems with Applications, Vol. 329, 132966, 2026

AI中文摘要

在医疗保健领域，任何LLM生成的输出都必须可靠且准确，尤其是在涉及决策和患者安全的情况下。然而，由于LLM存在幻觉风险，这些关键领域的输出往往不可靠。为了解决这个问题，我们提出了一个独立于任何LLM的事实核查模块，以及一个旨在最小化幻觉率的领域特定摘要模型。我们的模型使用低秩适配（LoRa）在MIMIC III数据集上进行微调，并与事实核查模块配对，该模块通过自然语言处理中的离散逻辑，在细粒度级别使用数值测试进行正确性检查和逻辑检查，以验证电子健康记录（EHR）中的事实。我们在完整的MIMIC-III数据集上训练了LLM模型。为了评估事实核查模块，我们抽样了104篇摘要，将其提取为3,786个命题，并将这些命题作为事实。事实核查模块的精确率为0.8904，召回率为0.8234，F1分数为0.8556。此外，LLM摘要模型的摘要质量达到了ROUGE-1分数0.5797和BERTScore 0.9120。

英文摘要

In healthcare, it is essential for any Large Language Model (LLM)-generated output to be reliable and accurate, particularly in cases involving decision-making and patient safety. However, the outputs are often unreliable in such critical areas due to the risk of hallucinated outputs from the LLMs. To address this issue, we propose a fact-checking module that operates independently of any LLM, along with a domain-specific summarization model designed to minimize hallucination rates. Our model is fine-tuned using Low-Rank Adaptation (LoRA) on the MIMIC-III dataset and is paired with the fact-checking module, which uses numerical tests for correctness and logical checks at a granular level through discrete logic in natural language processing (NLP) to validate facts against electronic health records (EHRs). We trained the LLM on the full MIMIC-III dataset. For evaluation of the fact-checking module, we sampled 104 summaries, extracted them into 3786 propositions, and used these as facts. The fact-checking module achieves a precision of 0.8904, a recall of 0.8234, and an F1-score of 0.8556. Additionally, the LLM summary achieves a ROUGE-1 score of 0.5797 and a BERTScore of 0.9120 for summary quality.

URL PDF HTML ☆

赞 0 踩 0

2605.24197 2026-06-10 cs.AI 版本更新

A Sober Look at Agentic Misalignment in Automated Workflows

对自动化工作流中智能体错位的冷静审视

Wenqian Ye, Bo Yuan, Zhichao Xu, Yijun Tian, Yawei Wang, Henry Kautz, Aidong Zhang

发表机构 * University of Virginia（弗吉尼亚大学）； Georgia Institute of Technology（佐治亚理工学院）； University of Utah（犹他大学）； University of Notre Dame（圣约翰大学）； George Washington University（乔治华盛顿大学）

AI总结本文研究多智能体系统中的新兴错位问题（称为智能体错位），通过贝叶斯框架分析其成因，并提出基于证据的智能体对齐范式（AEA）来纠正错位行为，从而提升协作可靠性。

详情

AI中文摘要

我们研究了一类在多智能体系统（MAS）中出现的新兴错位，重点关注自动化工作流，我们称之为智能体错位。尽管这些系统可以解决复杂任务，但它们常常因为智能体根据隐式代理效用行事而与预期的人类目标不一致而失败。我们正式定义了这些行为，并在贝叶斯框架内对其进行了分析，表明通用效用自然会导致自动化工作流中智能体的后验坍缩。为了解决这个问题，我们提出了智能体证据归因（AEA），一种新颖的对齐范式，它利用上下文特定的证据来改进智能体的后验。AEA 对智能体动作进行推理，并提供结构化证据以在协作过程中纠正错位行为。为了更好地理解证据的作用，我们研究了 AEA 的两种实例化：自我反思（来自模型的内部证据）和弱到强泛化（关于智能体轨迹的外部证据）。我们表明，一个小的证据模型通过提供正交的失败归因有效地对齐了 MAS。我们的结果澄清了自动化工作流中智能体错位的来源，并表明基于证据的对齐可以有效地改善智能体协作，并导致建立在自动化工作流上的可靠多智能体系统。

英文摘要

We study a class of emergent misalignment in multi-agent systems (MAS), with a focus on automated workflows, which we refer to agentic misalignment. Although these systems can solve complex tasks, they often fail because agents act according to implicit proxy utilities that do not align with the intended human goals. We formally define these behaviors and analyze them within a Bayesian framework, showing that generic utilities naturally lead to posterior collapse of agents in automated workflows. To address this issue, we propose Agentic Evidence Attribution (AEA), a novel alignment paradigm that improves agent posteriors using context-specific evidence. AEA reasons over agent actions and provides structured evidence to correct misaligned behavior during collaboration. To better understand the role of evidence, we study two instantiations of AEA: self-reflection (internal evidence from the model) and weak-to-strong generalization (external evidence on the agentic trajectory). We show that a small evidence model effectively aligns the MAS by providing orthogonal failure attribution. Our results clarify the sources of agentic misalignment in automated workflows and show that evidence-based alignment can effectively improve agent collaboration and leads to reliable multi-agent systems built on automated workflows.

URL PDF HTML ☆

赞 0 踩 0

2605.22714 2026-06-10 cs.AI cs.CL cs.LG 版本更新

AMEL: Accumulated Message Effects on LLM Judgments

AMEL: 累积消息对LLM判断的影响

Sid-Ali Temkit

发表机构 * chut.app

AI总结研究LLM在对话中因历史消息极性而偏离基准判断的累积消息效应（AMEL），发现模型偏向历史主流极性，且负向历史偏见更强，但偏见不随上下文长度增长，简单修复是为每个项目使用新上下文。

Comments 24 pages, 14 figures, 8 tables. Single author. Code, data (84,088 deduplicated API responses), and analysis pipeline at https://github.com/chutapp/amel

详情

AI中文摘要

大型语言模型常被用作自动评估者：审查代码、审核内容或评分输出，通常许多项目通过一次对话处理。我们询问先前对话历史的极性是否会偏倚后续判断，我们将这种效应称为LLM判断的累积消息效应（AMEL）。通过对来自4个提供商（OpenAI、Anthropic、Google和四个开源模型）的11个模型进行75,898次API调用，我们在隔离或跟随以正面或负面评价为主的历史之后呈现相同的测试项目。模型倾向于对话的主流极性（d = -0.17, p < 10^-46）。该效应集中在模型在基线时真正不确定的项目上（高熵项目d = -0.34，而基线确定时d = -0.15）。偏见不随上下文长度增长：5个先前轮次和50个产生相同的偏移（Spearman |r| < 0.01；OLS斜率p = 0.80）。并且存在负性不对称：按项目配对，负面历史诱导的偏见是正面的1.62倍（t = 13.46, p < 10^-39, n = 2,481）。扩展规模有帮助但不能解决（Anthropic: Haiku -0.22到Opus -0.17；OpenAI: Nano -0.34到GPT-5.2 -0.17）。三项后续研究缩小了机制范围。令牌概率分布连续变化，而非在阈值处。负性不对称既有令牌级成分也有语义成分，尽管在我们的样本量下平衡归因是探索性的。位置不重要：在50轮历史中任何位置的五个有偏轮次产生相同的偏移。评估流程最简单的修复是为每个项目使用新上下文；当批处理不可避免时，平衡历史有帮助。

英文摘要

Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items passing through one conversation. We ask whether the polarity of prior conversation history biases subsequent judgments, an effect we call the accumulated message effect on LLM judgments (AMEL). Across 84,088 API calls to 12 models from 5 providers (OpenAI, Anthropic, Google, DeepSeek, and four open-source models), we present identical test items in isolation or following histories saturated with predominantly positive or negative evaluations. Models shift toward the conversation's prevailing polarity (d = -0.17, p < 10^-53). The effect concentrates on items where the model is genuinely uncertain at baseline (d = -0.36 for high-entropy items, vs d = -0.15 when the baseline is deterministic). Bias does not grow with context length: 5 prior turns and 50 produce the same shift (Spearman |r| < 0.01; OLS slope p = 0.80). And there is a negativity asymmetry: paired per item, negative histories induce 1.52x more bias than positive (t = 13.03, p < 10^-36, n = 2,733). Scaling helps but does not solve it (Anthropic: Haiku -0.22 to Opus -0.17; OpenAI: Nano -0.34 to GPT-5.2 -0.17). Three follow-ups narrow the mechanism. The token probability distribution shifts continuously, not at a threshold. The negativity asymmetry has both token-level and semantic components, though attributing the balance is exploratory at our sample sizes. Position does not matter: five biased turns anywhere in a 50-turn history produce the same shift. The simplest fix for evaluation pipelines is a fresh context per item; when batching is unavoidable, balancing the history helps.

URL PDF HTML ☆

赞 0 踩 0

2601.14653 2026-06-10 cs.LG q-bio.GN 版本更新

Efficient Imputation for Patch-based Missing Single-cell Data via Cluster-regularized Optimal Transport

基于聚类正则化最优传输的块缺失单细胞数据高效插补

Yuyu Liu, Jiannan Yang, Ziyang Yu, Weishen Pan, Fei Wang, Tengfei Ma

发表机构 * Department of Computer Science, Stony Brook University（计算机科学系，石板溪大学）； Department of Computer Science, Emory University（计算机科学系，埃默里大学）； Department of Population Health Sciences, Cornell University（流行病学与公共卫生系，康奈尔大学）； Department of Computer Science and Engineering, Cornell University（计算机科学与工程系，康奈尔大学）； Department of Biomedical Informatics, Stony Brook University（生物医学信息学系，石板溪大学）

AI总结提出CROT算法，利用最优传输处理单细胞数据中的块缺失问题，实现高精度插补并显著降低运行时间。

Comments Accepted to ACM-BCB 2026

详情

AI中文摘要

单细胞测序数据集中的缺失数据对提取有意义的生物学见解构成了重大挑战。然而，现有的插补方法通常假设数据均匀且完整，难以处理存在大片缺失数据的情况。在本文中，我们提出了CROT（聚类正则化最优传输），一种基于最优传输的插补算法，旨在处理表格格式中的块缺失数据。我们的方法在存在显著缺失的情况下有效捕捉底层数据结构。值得注意的是，它在显著减少运行时间的同时实现了优越的插补精度，展示了其在大规模数据集上的可扩展性和效率。这项工作为具有结构化数据缺失的异质性高维数据集提供了一种鲁棒的插补解决方案，解决了生物学和临床数据分析中的关键挑战。我们的代码可在GitHub上获取，https://github.com/yuyuliu11037/CROT。

英文摘要

Missing data in single-cell sequencing datasets poses significant challenges for extracting meaningful biological insights. However, existing imputation approaches, which often assume uniformity and data completeness, struggle to address cases with large patches of missing data. In this paper, we present CROT (Cluster-Regularized Optimal Transport), an optimal transport-based imputation algorithm designed to handle patch-based missing data in tabular formats. Our approach effectively captures the underlying data structure in the presence of significant missingness. Notably, it achieves superior imputation accuracy while significantly reducing runtime, demonstrating its scalability and efficiency for large-scale datasets. This work introduces a robust solution for imputation in heterogeneous, high-dimensional datasets with structured data absence, addressing critical challenges in both biological and clinical data analysis. Our code is available on GitHub, https://github.com/yuyuliu11037/CROT.

URL PDF HTML ☆

赞 0 踩 0

2502.07027 2026-06-10 cs.LG cs.AI 版本更新

Representational Alignment with Chemical Induced Fit for Molecular Relational Learning

基于化学诱导契合的表征对齐用于分子关系学习

Peiliang Zhang, Jingling Yuan, Qing Xie, Yongjun Zhu, Chao Che, Lin Li

发表机构 * Wuhan University of Technology（武汉理工大学）； Yonsei University（延世大学）； Hubei Key Laboratory of Transportation Internet of Things（湖北省交通运输物联网重点实验室）； Dalian University（大连大学）

AI总结提出ReAlignFit方法，通过引入化学诱导契合的归纳偏置动态对齐子结构表征，并利用子图信息瓶颈优化高化学功能兼容性的子结构对，以提升分子关系学习在化学空间偏移数据上的稳定性。

Comments Accepted by SIGKDD2026 AI for Science Track

详情

AI中文摘要

分子关系学习（MRL）广泛应用于自然科学中，通过提取结构特征来预测分子对之间的关系。子结构对之间的表征相似性决定了分子结合位点的功能兼容性。然而，通过注意力机制对齐子结构表征缺乏化学知识的指导，导致模型在化学空间（例如，官能团、骨架）偏移数据上的性能不稳定。在理论论证的基础上，我们提出了基于化学诱导契合的表征对齐（ReAlignFit）来增强MRL的稳定性。ReAlignFit通过引入基于化学诱导契合的归纳偏置，动态对齐MRL中的子结构表征。在诱导过程中，我们设计了基于子结构边重构的偏差校正函数，通过模拟化学构象变化（子结构的动态组合）来对齐子结构对之间的表征。ReAlignFit进一步在契合过程中集成子图信息瓶颈，以精炼和优化具有高化学功能兼容性的子结构对，并利用它们生成分子嵌入。在九个数据集上的实验结果表明，ReAlignFit在两个任务中均优于最先进的模型，并显著增强了模型在规则偏移和骨架偏移数据分布上的稳定性。

英文摘要

Molecular Relational Learning (MRL) is widely applied in natural sciences to predict relationships between molecular pairs by extracting structural features. The representational similarity between substructure pairs determines the functional compatibility of molecular binding sites. Nevertheless, aligning substructure representations by attention mechanisms lacks guidance from chemical knowledge, resulting in unstable model performance in chemical space (\textit{e.g.}, functional group, scaffold) shifted data. With theoretical justification, we propose the \textbf{Re}presentational \textbf{Align}ment with Chemical Induced \textbf{Fit} (ReAlignFit) to enhance the stability of MRL. ReAlignFit dynamically aligns substructure representation in MRL by introducing chemical Induced Fit-based inductive bias. In the induction process, we design the Bias Correction Function based on substructure edge reconstruction to align representations between substructure pairs by simulating chemical conformational changes (dynamic combination of substructures). ReAlignFit further integrates the Subgraph Information Bottleneck during fit process to refine and optimize substructure pairs exhibiting high chemical functional compatibility, leveraging them to generate molecular embeddings. Experimental results on nine datasets demonstrate that ReAlignFit outperforms state-of-the-art models in two tasks and significantly enhances model's stability in both rule-shifted and scaffold-shifted data distributions.

URL PDF HTML ☆

赞 0 踩 0

2605.17254 2026-06-10 cs.AI 版本更新

CatalyticMLLM: A Graph-Text Multimodal Large Language Model for Catalytic Materials

CatalyticMLLM: 一种用于催化材料的图-文本多模态大语言模型

Yanjie Li, Jian Xu, Xu-Yao Zhang, Shiming Xiang, Nian Ran, Weijun Li, Cheng-Lin Liu

发表机构 * AnnLab（安实验室）； Institute of Semiconductors, Chinese Academy of Sciences（中国科学院半导体研究所）； Zhongguancun Academy（中关村学院）； State Key Laboratory of Multimodal Artificial Intelligence Systems（多模态人工智能系统国家重点实验室）； State Key Laboratory of High Performance Ceramics（高性能陶瓷国家重点实验室）； University of Chinese Academy of Sciences（中国科学院大学）； School of Electronic, Electrical and Communication Engineering（电子电气与通信工程学院）； University of ChineseAcademy of Sciences（中国科学院大学）

AI总结本文提出了一种统一的图-文本多模态大语言模型QE-Catalytic-V2，用于催化材料的性质预测和逆向设计，通过共享的表示空间实现两者的联合建模，从而形成闭环优化流程。

Comments 71 page

详情

AI中文摘要

催化材料的性质预测和逆向结构设计通常被建模为两个独立的任务：前者根据给定的结构预测目标性质，而后者则根据期望的性质生成候选结构。尽管解耦范式便于实现“生成-评估-筛选”工作流，但生成模型和性质预测模型在表示空间和训练目标上的不一致可能会引入数据分布偏移和评估者偏见，从而限制闭环优化的稳定性。在本文中，我们提出QE-Catalytic-V2，这是一种用于催化材料的统一图-文本多模态大语言模型，它在同一个模型和共享的表示空间中整合了性质预测和逆向设计。在此统一框架下，QE-Catalytic-V2不仅能通过三维结构和文本信息进行可靠的性质预测，还能根据目标性质生成和筛选物理上可行的CIF候选结构，从而形成“逆向设计-预测-筛选-重新设计”的闭环优化工作流。实验结果表明，这种统一范式在催化放松能预测和逆向设计任务上均优于解耦基线，验证了在单一多模态模型中联合建模性质预测和结构生成的有效性。

英文摘要

Property prediction and inverse structural design of catalytic materials are typically modeled as two independent tasks: the former predicts target properties from given structures, whereas the latter generates candidate structures according to desired properties. Although the decoupled paradigm facilitates the implementation of a ``generation--evaluation--screening'' workflow, the inconsistency between the generative model and the property prediction model in terms of representation spaces and training objectives can readily introduce data distribution shifts and evaluator bias, thereby limiting the stability of closed-loop optimization. In this work, we propose CatalyticMLLM, a unified graph--text multimodal large language model for catalytic materials, which integrates property prediction and \textbf{inverse design} within the same model and shared representation space. Under this unified framework, CatalyticMLLM can not only perform reliable property prediction by leveraging three-dimensional structures and textual information, but also generate and screen physically feasible CIF candidates conditioned on target properties, thereby forming a closed-loop optimization workflow of ``inverse design--prediction--screening--redesign.'' Experimental results demonstrate that this unified paradigm outperforms decoupled baselines on both catalytic relaxed-energy prediction and inverse design tasks, validating the effectiveness of jointly modeling property prediction and structure generation within a single multimodal model.

URL PDF HTML ☆

赞 0 踩 0

2602.06886 2026-06-10 cs.CV 版本更新

Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers

提示重注入：缓解多模态扩散变换器中的提示遗忘

Yuxuan Yao, Yuxuan Chen, Hui Li, Kaihui Cheng, Qipeng Guo, Yuwei Sun, Zilong Dong, Jingdong Wang, Siyu Zhu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本研究针对多模态扩散变换器中提示遗忘问题，提出一种无需训练的提示重注入方法，通过将早期层的提示表示重新注入到后期层，以提升指令遵循能力和生成质量。

Comments 19 pages

详情

AI中文摘要

多模态扩散变换器（MMDiTs）用于文本到图像生成时，维持文本和图像分支的分离，并在整个去噪过程中实现文本标记与视觉潜在变量之间的双向信息流。在此设置中，我们观察到一种提示遗忘现象：随着深度增加，文本分支中提示表示的语义逐渐被遗忘。我们进一步通过探测三个代表性MMDiTs--SD3、SD3.5和FLUX.1中文本分支各层表示的语言属性，验证了这种影响。受这些发现启发，我们引入了一种无需训练的方法，即提示重注入，通过将早期层的提示表示重新注入到后期层来缓解这种遗忘。在GenEval、DPG和T2I-CompBench++上的实验表明，这种方法在指令遵循能力方面有显著提升，并在捕获偏好、美学和整体文本-图像生成质量的指标上也有所改进。

英文摘要

Multimodal Diffusion Transformers (MMDiTs) for text-to-image generation maintain separate text and image branches, with bidirectional information flow between text tokens and visual latents throughout denoising. In this setting, we observe a prompt forgetting phenomenon: the semantics of the prompt representation in the text branch is progressively forgotten as depth increases. We further verify this effect on three representative MMDiTs--SD3, SD3.5, and FLUX.1 by probing linguistic attributes of the representations over the layers in the text branch. Motivated by these findings, we introduce a training-free approach, prompt reinjection, which reinjects prompt representations from early layers into later layers to alleviate this forgetting. Experiments on GenEval, DPG, and T2I-CompBench++ show consistent gains in instruction-following capability, along with improvements on metrics capturing preference, aesthetics, and overall text--image generation quality.

URL PDF HTML ☆

赞 0 踩 0

2605.20347 2026-06-10 cs.LG stat.ML 版本更新

Symmetrization of Loss Functions for Robust Training of Neural Networks in the Presence of Noisy Labels

损失函数对称化以在存在噪声标签的情况下实现神经网络的鲁棒训练

Alexandre Lemire Paquin, Brahim Chaib-Draa, Philippe Giguère

发表机构 * Department of Computer Science and Software Engineering（计算机科学与软件工程系）

AI总结本文研究了通过将交叉熵损失对称化来设计鲁棒损失函数的方法，提出了一种多类对称损失函数，并展示了其在噪声标签下的有效性。

Comments 28 pages, 1 figure, 4 tables. v2: Added relevant prior-work citations and revised the related-work discussion and Section 5.2. Minor wording corrections

详情

AI中文摘要

训练集的标注通常是昂贵且易出错的，因此设计对噪声具有鲁棒性的损失函数是一个重要的问题。对称条件为这种噪声的鲁棒性提供了理论保证。在本文中，我们研究了一种源自任何多类损失函数唯一分解为对称部分和类无关项的对称化方法。特别是，对交叉熵损失进行对称化会导致多类线性扩展的unhinged损失。与二分类情况不同，多类版本必须具有特定的系数才能满足对称条件。在适当假设下，我们证明这种多类unhinged损失是唯一的凸多类对称损失。我们还证明它在局部上具有根本作用：任何对称损失在具有相等分量的分数向量处的线性近似等价于多类unhinged损失。然后我们引入了SGCE和alpha-MAE两种损失函数，它们在多类unhinged损失和均值绝对误差之间进行插值，同时允许控制损失的beta-平滑性。在标准的噪声标签基准上的实验表明，其性能与现有的鲁棒损失函数相比具有竞争力。

英文摘要

Labeling a training set is often expensive and susceptible to errors, making the design of robust loss functions for label noise an important problem. The symmetry condition provides theoretical guarantees for robustness to such noise. In this work, we study a symmetrization method arising from the unique decomposition of any multi-class loss function into a symmetric component and a class-insensitive term. In particular, symmetrizing the cross-entropy loss leads to a linear multi-class extension of the unhinged loss. Unlike in the binary case, the multi-class version must have specific coefficients in order to satisfy the symmetry condition. Under suitable assumptions, we show that this multi-class unhinged loss is the unique convex multi-class symmetric loss. We also show that it has a fundamental local role: the linear approximation of any symmetric loss around score vectors with equal components is equivalent to the multi-class unhinged loss. We then introduce SGCE and alpha-MAE, two loss functions that interpolate between the multi-class unhinged loss and the Mean Absolute Error while allowing control of the beta-smoothness of the loss. Experiments on standard noisy-label benchmarks show competitive performance compared with existing robust loss functions.

URL PDF HTML ☆

赞 0 踩 0

2605.20098 2026-06-10 cs.AI 版本更新

Neurosymbolic Learning for Inference-Time Argumentation

用于推理时间论证的神经符号学习

Gabriel Freedman, Adam Dejl, Adam Gould, Mansi, Lihu Chen, Junqi Jiang, Francesca Toni

发表机构 * Department of Computing, Imperial College London（伦敦帝国理工学院计算机系）

AI总结本文提出了一种用于三元主张验证的可训练神经符号框架，通过在训练和推理过程中结合形式论证语义来指导大语言模型生成论证并分配基础分数，从而提高三元预测的准确性。

Comments Under review

详情

AI中文摘要

主张验证是高风险领域（如健康和金融）中的重要问题。当支撑主张的信息不完整或冲突时，不确定答案可能比二元的真或假分类更合适。在所有情况下，对最终裁决所依据的考虑因素的忠实解释都是至关重要的。我们引入了推理时间论证（ITA），一种可训练的神经符号框架，用于三元主张验证。其中，一种形式论证语义用于给定主张的强度，既（i）指导LLM训练，使模型在学习生成论证并分配基础分数（代表内在强度）时，又（ii）用于从生成的、评分的论证中计算三元（真/假/不确定）预测。因此，在训练时，论证生成和评分可以根据诱导的论证预测质量进行优化。此外，在推理时，最终预测是忠实于决定裁决的论证和分数的，而不是像传统推理模型那样由可能不忠实的后验推理轨迹所证明。最后，我们证明，在两个三元主张验证数据集上，ITA在论证基线之上表现更好，并能与非论证直接预测基线竞争，同时提供由显式、可检查的论证结构确定的裁决。

英文摘要

Claim verification is an important problem in high-stakes settings, including health and finance. When information underpinning claims is incomplete or conflicting, uncertain answers may be more appropriate than binary true or false classifications. In all cases, faithful explanations of the considerations determining the final verdict are crucial. We introduce inference-time argumentation (ITA), a trainable neurosymbolic framework for ternary claim verification in which a formal argumentation semantics giving the strength of claims is used both (i) to guide LLM training as models learn to generate arguments and assign them base scores (representing intrinsic strengths) and (ii) to compute ternary (true/false/uncertain) predictions from generated, scored arguments. As a result, at training time, argument generation and scoring can be optimised according to the quality of the induced argumentative predictions. Moreover, at inference time, the final prediction is faithful, by construction, to the arguments and scores determining the verdict, rather than being justified by a potentially unfaithful post-hoc reasoning trace as in conventional reasoning models. We finally show that, on two datasets for ternary claim verification, ITA improves upon argumentative baselines and can perform competitively against non-argumentative direct-prediction baselines, while providing verdicts that are computed deterministically from explicit, inspectable argumentative structures.

URL PDF HTML ☆

赞 0 踩 0

2605.18271 2026-06-10 cs.CL cs.AI cs.IR cs.LG 版本更新

From Volume to Value: Preference-Aligned Memory Construction for On-Device RAG

从体积到价值：面向设备端RAG的偏好对齐记忆构建

Changmin Lee, Jaemin Kim, Taesik Gong

发表机构 * Department of Computer Science and Engineering, Ulsan National Institute of Science and Technology (UNIST), Ulsan, Republic of Korea（计算机科学与工程系，全州国立科学与技术研究所（UNIST），全州，韩国）

AI总结本文提出EPIC方法，通过将用户偏好作为紧凑且稳定的个人上下文形式，整合到RAG流程中，以在有限内存下提高检索与用户偏好的对齐度，从而减少内存使用并提升准确性。

Comments Accepted to ICML 2026. Code and data are available at https://github.com/UbiquitousAILab/EPIC

详情

AI中文摘要

随着基于大型语言模型（LLMs）的个人AI代理的迅速发展，将其部署到设备上已成为隐私和响应性的重要需求。为了处理现实世界请求中固有的个人和上下文依赖性，这些代理必须基于设备上存储的个人上下文进行生成。然而，在内存预算紧张的情况下，核心瓶颈是存储什么内容以确保检索与用户保持一致。我们提出EPIC（高效偏好对齐索引构建），专注于用户偏好作为紧凑且稳定的个人上下文形式，并在整个RAG流程中整合它们。EPIC会选择性地保留与偏好相关的信息，并将检索对准偏好对齐的上下文。在四个涵盖对话、辩论、解释和推荐的基准测试中，EPIC将索引内存减少了2,404倍，提高了偏好遵循的准确性20.17个百分点，并在最佳表现基线之上实现了33.33倍更低的检索延迟。在我们的设备端实验中，EPIC在29.35毫秒/查询的流式更新下保持内存占用低于1 MB。

英文摘要

With the rapid emergence of personal AI agents based on Large Language Models (LLMs), implementing them on-device has become essential for privacy and responsiveness. To handle the inherently personal and context-dependent nature of real-world requests, such agents must ground their generation in device-resident personal context. However, under tight memory budgets, the core bottleneck is what to store so that retrieval remains aligned with the user. We propose EPIC (Efficient Preference-aligned Index Construction), which focuses on user preferences as a compact and stable form of personal context and integrates them throughout the RAG pipeline. EPIC selectively retains preference-relevant information from raw data and aligns retrieval toward preference-aligned contexts. Across four benchmarks covering conversations, debates, explanations, and recommendations, EPIC reduces indexing memory by 2,404 times, improves preference-following accuracy by 18.79 %p, and achieves 32.17 times lower retrieval latency over the best-performing baseline. In on-device experiments, EPIC maintains under 1 MB memory and achieves 5.21 to 29.35 ms/query latency across three platforms, while supporting streaming updates under preference drift. Our code and data are available at https://github.com/UbiquitousAILab/EPIC.

URL PDF HTML ☆

赞 0 踩 0

2605.04254 2026-06-10 cs.LG cs.HC 版本更新

Hierarchical Support Vector State Partitioning for Distilling Black Box Reinforcement Learning Policies

层次化支持向量状态划分用于黑箱强化学习策略的蒸馏

Senne Deproost, Mehrdad Asadi, Ann Nowé

发表机构 * Vrije Universiteit Brussel（布鲁塞尔自由大学）

AI总结本文提出SVSP方法，通过线性支持向量机划分状态动作对数据集，构建紧凑结构化表示，提升平均回报7.4%并减少子策略数量，推动更灵活的蒸馏方法。

Comments Accepted for poster presentation at HHAI 2026

2605.12804 2026-06-10 cs.RO 版本更新

BiPneu: Design and Control of a Bipolar-Pressure Pneumatic System for Soft Robots

BiPneu：用于软体机器人的双极气压气动系统的设计与控制

Yu Mei, Xinyu Zhou, Vedant Naik, Alan Gao, Xiaobo Tan

发表机构 * Department of Electrical and Computer Engineering, Michigan State University（电气与计算机工程系，密歇根州立大学）

AI总结提出一种可扩展、高性价比的多通道双极气压气动系统BiPneu，并设计基于混合电-气动模型的双模式滑模控制器（DM-SMC），实现宽范围、精确、快速的压力调节，在软体机器人应用中显著优于MPC和PID控制器。

Comments Full Version of BiPenu, including the supplementary materials

详情

DOI: 10.1109/TMECH.2026.3693622
Journal ref: IEEE/ASME Transactions on Mechatronics, 2026

AI中文摘要

正负压力调节对于软体机器人执行器至关重要，可实现大运动范围和多种驱动模式。然而，由于不对称的充放气动力学、阀门非线性以及切换引起的流量扰动，在两种压力极性下实现高性能调节仍然具有挑战性。本文提出BiPneu，一种可扩展且经济高效的多通道双极气压气动系统，用于软体机器人，能够实现宽范围、精确和快速的压力调节，同时与高级软件生态系统无缝兼容。基于混合电-气动模型，提出了一种带有滞后监督模式选择的双模式滑模控制器（DM-SMC）。广泛的仿真和实验表明，与先进模型预测控制器和良好调谐的PID控制器相比，DM-SMC在跟踪阶跃和正弦压力参考方面具有优越性能。实验结果显示，多步测试中平均绝对误差为1.44 kPa，正弦跟踪中为4.23 kPa，相对于PID控制分别降低了11.9%和35.6%，同时改善了控制力度、阀门切换速率和瞬态响应。DM-SMC的鲁棒性在具有压力依赖体积的波纹管执行器上得到进一步验证。最后，通过两个软体机器人示例——使用软体并联执行器快速控球和基于实时有限元方法（FEM）的软体波纹管执行器遥操作——展示了BiPneu的能力。

英文摘要

Positive-negative pressure regulation is critical to soft robotic actuators, enabling large motion ranges and versatile actuation modes. However, achieving high-performance regulation across both pressure polarities remains challenging due to asymmetric inflation-deflation dynamics, valve nonlinearities, and switching-induced flow disturbances. This paper presents BiPneu, a scalable and cost-efficient multi-channel bipolar-pressure pneumatic system for soft robots that enables wide-range, accurate, and responsive pressure regulation while providing seamless compatibility with high-level software ecosystems. A dual-mode sliding-mode controller (DM-SMC) with hysteresis-supervised mode selection is proposed based on a hybrid electro-pneumatic model. Extensive simulation and experiments demonstrate the superior performance of DM-SMC in tracking step and sinusoidal pressure references compared with both advanced model predictive controllers and well-tuned PID controllers. Experimental results show average absolute errors of 1.44 kPa in multi-step tests and 4.23 kPa in sinusoidal tracking, corresponding to reductions of 11.9% and 35.6% relative to PID control, along with improved control effort, valve switching rate, and transient response. Robustness of DM-SMC is further verified on a bellow actuator with pressure-dependent volume. Finally, BiPneu's capability is demonstrated via two soft robotic examples, quick ball-maneuvering with a soft parallel manipulator and real-time finite element method (FEM)-based teleoperation of a soft bellows actuator.

URL PDF HTML ☆

赞 0 踩 0

2605.11458 2026-06-10 cs.AI cs.CL cs.LO 版本更新

Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

自适应教师暴露用于大语言模型推理中的自蒸馏

Zihao Han, Tiangang Zhang, Huaibin Wang, Yilun Sun

发表机构 * ByteDance Douyin（字节跳动抖音）

AI总结针对自蒸馏中教师暴露完整推理导致学生难以吸收的问题，提出自适应教师暴露方法ATESD，通过轻量Beta策略控制器动态调整暴露比例，并用折扣学习进步奖励优化，在多个模型和数据集上提升推理性能。

Comments 11 pages, 4 figures; code not released yet

详情

AI中文摘要

同策略自蒸馏已成为大语言模型推理的一种强大方法，其中特权教师基于参考解决方案监督学生自身的轨迹。然而，几乎所有此类方法共享的一个设计选择却未被质疑：教师总是看到完整的参考推理。我们认为这一默认设置本身就是问题的一部分，并识别出教师侧暴露不匹配：当教师基于远超学生当前能力的推理进行条件化时，产生的词元目标变得过于强大而难以吸收。一个受控的固定暴露扫描在两个层面上明确了这一点：1）完全暴露并非可靠的最佳选择，2）随着教师看到更多特权推理，学生-教师不匹配单调增长。这促使我们将教师暴露视为一个可学习的训练时控制变量，而非固定的超参数。因此，我们提出了自适应教师暴露用于自蒸馏（ATESD）。ATESD使用一个轻量级的Beta策略控制器对暴露比例进行建模，该控制器以紧凑的训练状态统计为条件，并在学生更新的一个短保持窗口内使用一个采样的暴露。为了使该暴露控制器可学习，我们使用折扣学习进步奖励对其进行优化，该奖励根据每个保留决策对学生未来改进的影响（而非其即时损失变化）进行评分，从而解决了同策略蒸馏导致的延迟信用分配问题。在AIME 24、AIME 25和HMMT 25上，使用Qwen3-{1.7B, 4B, 8B}的实验表明，ATESD持续优于竞争性的自蒸馏和强化学习基线，相比OPSD分别提高了+0.95、+2.05和+2.33个Average@12点，将自适应教师暴露确立为推理自蒸馏的一个有效新方向。

英文摘要

On-policy self-distillation has become a strong recipe for LLM reasoning, where a privileged teacher supervises the student's own rollouts while conditioning on the reference solution. A design choice shared by nearly all such methods, however, has gone unquestioned: the teacher always sees the full reference reasoning. We argue that this default itself is part of the problem and identify a teacher-side exposure mismatch: when the teacher conditions on reasoning far beyond the student's current competence, the resulting token targets become too strong to absorb. A controlled fixed-exposure sweep makes this concrete on two fronts: 1) full exposure is not reliably the best choice, and 2) student-teacher mismatch grows monotonically as the teacher sees more privileged reasoning. This motivates treating teacher exposure not as a fixed hyperparameter but as a learnable training-time control variable. We therefore propose Adaptive Teacher Exposure for Self-Distillation (ATESD). ATESD models the reveal ratio with a lightweight Beta-policy controller conditioned on compact training-state statistics, and uses one sampled exposure for a short hold window of student updates. To make this exposure controller learnable, we optimize it with a discounted learning-progress reward that scores each held decision by its effect on the student's future improvement rather than its immediate loss change, addressing the delayed credit assignment induced by on-policy distillation. Experiments on AIME 24, AIME 25, and HMMT 25 across Qwen3-{1.7B, 4B, 8B} show that ATESD consistently outperforms competitive self-distillation and RL baselines, improving over OPSD by +0.95, +2.05, and +2.33 Average@12 points respectively, and establishing adaptive teacher exposure as an effective new axis for reasoning self-distillation.

URL PDF HTML ☆

赞 0 踩 0

2605.08565 2026-06-10 cs.LG 版本更新

Finer is Better (with the Right Scaling)

更细粒度更好（配合正确的缩放）

Clemens Schaefer, Gil Tabak

发表机构 * Google LLC（谷歌公司）； Mountain View, Ca（山景城，加利福尼亚）

AI总结本文研究大语言模型低精度量化中的块大小悖论，发现细粒度块在正确缩放策略下能降低量化误差，并通过4-over-6方法等解决退化问题。

详情

AI中文摘要

微缩放是保持量化到超低精度格式的大语言模型质量的关键技术。直观上，更细的块大小应产生更低的量化误差；然而，Fasoli等人（2026）最近发现的一个悖论表明，标准abs-max缩放实际上会随着块大小减小而导致模型质量下降。在这项工作中，我们研究了这一现象背后的机制。我们证明，这种退化并非细粒度的固有限制，而主要是由于较小块中的元素在统计上更接近其局部块最大值，与用作缩放因子的粗糙非规格化E4M3值交互不良所致。具体来说，我们表明：i）防止缩放因子下溢为零可减轻极端下溢引起的误差；ii）针对性的算法干预（如4-over-6方法）为缩放因子的选择提供更多灵活性，从而解决了较大值的悖论；iii）暴力搜索建立了最优基线，证实理论均方误差随块大小细化而严格改善。最终，我们的发现突显了硬件-软件协同设计的关键见解：块大小悖论部分源于天真的缩放选择。虽然使用分层缩放因子或更宽的格式（如UE5M3）可互换地解决大部分质量损失，但我们发现4-over-6缩放选择启发式方法可以进一步改善质量，特别是对于非常小的块大小。因此，最大化下一代ML加速器的性能需要将硅片格式规范和软件缩放算法视为紧密耦合的设计选择。

英文摘要

Microscaling is a critical technique for preserving the quality of Large Language Models (LLMs) quantized to ultra-low precision formats. Intuitively, finer block sizes should yield lower quantization error; however, a paradox recently identified by Fasoli et al. (2026) demonstrates that standard abs-max scaling can actually result in degraded model quality as block sizes shrink. In this work, we investigate the underlying mechanics of this phenomenon. We demonstrate that this degradation is not an inherent limitation of finer granularity, but is primarily driven by how elements in smaller blocks statistically cluster closer to their local block maximum, interacting poorly with the coarse subnormal E4M3 values used as scaling factors. Specifically, we show that i) preventing the scaling factor from underflowing to zero mitigates errors caused by extreme underflow, ii) targeted algorithmic interventions like the 4-over-6 methodology that give more flexibility to the choice of scaling factor resolve the paradox for larger values, and iii) a brute-force search establishes an optimal baseline, confirming that the theoretical Mean Squared Error (MSE) strictly improves with finer block sizes. Ultimately, our findings highlight a critical insight for hardware-software co-design: the block-size paradox is partially an artifact of naive scale selection. While using hierarchical scaling factors or wider formats like UE5M3 interchangeably resolves much of the quality loss, we found the 4-over-6 scale selection heuristic can even further improve quality, especially for very small block sizes. Consequently, maximizing the performance of next-generation ML accelerators will require treating silicon format specifications and software scaling algorithms as tightly coupled design choices.

URL PDF HTML ☆

赞 0 踩 0

2603.14463 2026-06-10 cs.CL 版本更新

An Industrial-Scale Insurance LLM Achieving Verifiable Domain Mastery and Hallucination Control without Competence Trade-offs

一个工业级保险大语言模型，实现可验证的领域掌握与幻觉控制，无能力权衡

Qian Zhu, Xinnan Guo, Jingjing Huo, Jun Li, Pan Liu, Wenyan Yang, Wanqing Xu, Xuan Lin

发表机构 * Ant Group（蚂蚁集团）

AI总结提出INS-S1保险专用大语言模型，通过可验证数据合成系统和渐进式SFT-RL课程框架，在领域任务上达到SOTA，同时保持通用能力并实现0.6%的低幻觉率。

Comments 21 pages, 12 figures, 17 tables

详情

Journal ref: ICLR 2026 Workshop Advances in Financial AI

AI中文摘要

将大语言模型（LLM）适应到保险等高风险垂直领域面临重大挑战：场景要求严格遵守复杂法规和业务逻辑，对幻觉零容忍。现有方法常遭受能力权衡——牺牲通用智能换取领域专长——或过度依赖RAG而缺乏内在推理。为弥合这一差距，我们提出了INS-S1，一个通过新颖的端到端对齐范式训练的保险专用LLM系列。我们的方法包含两项方法论创新：（1）可验证数据合成系统，构建用于精算推理和合规的分层数据集；（2）渐进式SFT-RL课程框架，将动态数据退火与验证推理（RLVR）和AI反馈（RLAIF）的协同混合相结合。通过优化数据比例和奖励信号，该框架强制执行领域约束，同时防止灾难性遗忘。此外，我们发布了INSEva，迄今为止最全面的保险基准（39k+样本）。大量实验表明，INS-S1在领域任务上达到SOTA，显著优于DeepSeek-R1和Gemini-2.5-Pro。关键的是，它保持了顶级的通用能力，并实现了创纪录的0.6%幻觉率（HHEM）。我们的结果表明，严格领域专业化可以在不牺牲通用智能的情况下实现。

英文摘要

Adapting Large Language Models (LLMs) to high-stakes vertical domains like insurance presents a significant challenge: scenarios demand strict adherence to complex regulations and business logic with zero tolerance for hallucinations. Existing approaches often suffer from a Competency Trade-off - sacrificing general intelligence for domain expertise - or rely heavily on RAG without intrinsic reasoning. To bridge this gap, we present INS-S1, an insurance-specific LLM family trained via a novel end-to-end alignment paradigm. Our approach features two methodological innovations: (1) A Verifiable Data Synthesis System that constructs hierarchical datasets for actuarial reasoning and compliance; and (2) A Progressive SFT-RL Curriculum Framework that integrates dynamic data annealing with a synergistic mix of Verified Reasoning (RLVR) and AI Feedback (RLAIF). By optimizing data ratios and reward signals, this framework enforces domain constraints while preventing catastrophic forgetting. Additionally, we release INSEva, the most comprehensive insurance benchmark to date (39k+ samples). Extensive experiments show that INS-S1 achieves SOTA performance on domain tasks, significantly outperforming DeepSeek-R1 and Gemini-2.5-Pro. Crucially, it maintains top-tier general capabilities and achieves a record-low 0.6% hallucination rate (HHEM). Our results demonstrate that rigorous domain specialization can be achieved without compromising general intelligence.

URL PDF HTML ☆

赞 0 踩 0

2605.08171 2026-06-10 cs.LG cs.AI 版本更新

Communication Dynamics Neural Networks: FFT-Diagonalized Layers for Improved Hessian Conditioning at Reduced Parameter Count

通信动力学神经网络：用于改进Hessian条件数并减少参数数量的FFT对角化层

Lurong Pan

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出CDLinear块循环线性层，通过FFT对角化Hessian矩阵，在参数减少3.8倍下保持高精度，并给出显式条件数分析。

Comments 17 pages, 5 figures. Includes NumPy implementation, gradient checks, MNIST experiments, and reference PyTorch CD-Transformer implementation

详情

AI中文摘要

通信动力学神经网络（CDNNs）将通信动力学框架的循环谱机制应用于神经网络层设计。我们引入了CDLinear，一种块大小为B=2l+1的块循环线性层，其参数数量仅为具有相同输入输出维度的稠密层的1/B。该构造为优化提供了显式的傅里叶域诊断：对于均方损失，权重Hessian矩阵通过离散傅里叶变换对角化，特征值直接由输入块的傅里叶谱决定。在输入预白化下，总体Hessian条件数恰好为1，经验条件数对于N个样本被界定为1+O(sqrt(B/N))。我们使用纯NumPy实现CDLinear，并手工推导反向传播，通过有限差分验证梯度。在8x8 MNIST数字基准上，三个随机种子下，使用B=4的CDLinear MLP以2,380个参数达到97.50%±0.23%的测试准确率，而使用8,970个参数的稠密基线为98.15%±0.47%。这实现了3.8倍的参数减少，准确率损失0.65%。CD-MLP的平均Hessian条件数为1.9e4，比稠密基线的5.9e6小约310倍。我们将CDLinear定位为结构化矩阵神经网络层的一个特例，主要贡献包括闭式Hessian谱诊断、原则性的块多重性离散序列以及显式的条件数分析。我们还发布了一个参考PyTorch实现，将CDLinear集成到DeepSeek-V3风格的混合专家Transformer中，用于未来大规模基准测试。

英文摘要

Communication Dynamics Neural Networks (CDNNs) apply the circulant-spectral machinery of the Communication Dynamics framework to neural-network layer design. We introduce CDLinear, a block-circulant linear layer with block size B = 2l + 1 that uses 1/B the parameters of a dense layer with the same input and output dimensions. The construction gives an explicit Fourier-domain diagnostic for optimization: for mean-squared loss, the weight Hessian is diagonalized by the discrete Fourier transform, with eigenvalues determined directly by the Fourier spectrum of the input blocks. Under input pre-whitening, the population Hessian condition number is exactly 1, and the empirical condition number is bounded by 1 + O(sqrt(B/N)) for N samples. We implement CDLinear in pure NumPy with hand-derived backward passes and verify gradients by finite differences. On the 8x8 MNIST digits benchmark, across three random seeds, a CDLinear MLP with B = 4 reaches 97.50% +/- 0.23% test accuracy using 2,380 parameters, compared with 98.15% +/- 0.47% for a dense baseline using 8,970 parameters. This gives a 3.8x parameter reduction at a 0.65% accuracy cost. The CD-MLP's mean Hessian condition number is 1.9e4, about 310x smaller than the dense baseline's 5.9e6. We position CDLinear as a special case of structured matrix neural-network layers, with the main contributions being a closed-form Hessian-spectrum diagnostic, a principled discrete sequence of block multiplicities, and an explicit conditioning analysis. We also release a reference PyTorch implementation integrating CDLinear into a DeepSeek-V3-style mixture-of-experts transformer for future large-scale benchmarks.

URL PDF HTML ☆

赞 0 踩 0