arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2115
2606.17242 2026-06-17 cs.CV 新提交

Landsat-Sentinel-2 Algal Bloom Mapping Using Vision Transformers: Model Description, Implementation, and Examples

基于视觉Transformer的Landsat-Sentinel-2藻华制图:模型描述、实现与示例

Thainara Lima, Vitor Martins

发表机构 * Department of Agricultural & Biological Engineering, Mississippi State University(密苏里州立大学农业与生物工程系)

AI总结 提出首个基于视觉Transformer的沿海藻华制图方法,利用Landsat-Sentinel-2 30米分辨率影像,通过全局分布数据集和多种架构对比,证明Swin Transformer在云/耀斑条件下优于传统方法,实现高精度碎片化藻华检测。

详情
AI中文摘要

沿海藻华监测需要频繁、空间详细且全球一致的观测,这由Landsat-8/9和Sentinel-2 A/B/C提供。这些任务共同提供了超过十年的中等分辨率多光谱影像,每2-3天覆盖近全球,能够检测粗分辨率海洋水色传感器无法分辨的碎片化藻华结构。然而,由于光谱覆盖有限且缺乏统一的反射率产品,它们在水生环境中的应用仍然具有挑战性。作为传统生物光学方法的替代,基于深度学习的图像分类提供了一种数据驱动的方法,可以克服许多这些限制。本研究首次成功实现了基于视觉Transformer的沿海藻华制图,使用30米Landsat-Sentinel-2影像。在全球范围内易发生藻华的沿海热点区域生成了一个全球分布的藻华斑块数据集。将四种基于Transformer的架构与标准卷积基线进行比较,用于精细尺度藻华检测,并在不同光学水类型、大气和表面条件下进行评估。所有深度学习模型在检测漂浮藻华区域方面表现出强大能力,遗漏和误报误差为8-65%。在时间序列中的云和耀斑压力下,Swin Transformer优于传统的光谱指数方法(后者产生广泛的误报),有效避免了受云和耀斑影响的像素。与MODIS产品的进一步比较突出了更高空间分辨率在检测碎片化和不规则影响藻华方面的优势。我们的研究结果支持深度学习作为动态沿海环境中漂浮藻华中等分辨率一致监测的可靠工具。

英文摘要

Coastal algal bloom monitoring requires frequent, spatially detailed, and globally consistent observations, provided by Landsat-8/9 and Sentinel-2 A/B/C. Together, these missions offer over a decade of medium-resolution multispectral imagery with near-global coverage every 2-3 days, enabling the detection of fragmented bloom structures not resolvable by coarse ocean-color sensors. However, their use in aquatic environments remains challenging due to limited spectral coverage and a lack of harmonized reflectance products. As an alternative to traditional bio-optical methods, deep learning-based image classification offers a data-driven approach that can overcome many of these limitations. This study presents the first successful implementation of vision transformer-based coastal algal bloom mapping using 30-m Landsat-Sentinel-2 images. A globally distributed bloom patch dataset was generated across bloom-prone coastal hotspots worldwide. Four transformer-based architectures were compared against a standard convolutional baseline for fine-scale bloom detection, and assessed under different optical water types and atmospheric and surface conditions. All deep learning models showed strong capabilities in detecting floating bloom areas, with omission and commission errors of 8-65%. Under cloud and glint stress in a time series, the Swin Transformer outperformed traditional spectral-index approaches, which produced widespread false positives, effectively avoiding cloud- and glint-affected pixels. Comparisons with MODIS-derived products further highlighted the benefits of higher spatial resolution in detecting fragmented and irregularly affected blooms. Our findings support deep learning as a reliable tool for medium-resolution, consistent monitoring of floating algal blooms in dynamic coastal environments.

2606.17241 2026-06-17 cs.CV cs.RO cs.SY eess.SY 新提交

Beyond Benchmarks: Continuous Edge Inference for Fine-Grained Roadside Perception

超越基准:面向细粒度路边感知的连续边缘推理

Aditya Mishra, Haroon Lone

发表机构 * Indian Institute of Science Education and Research Bhopal(印度科学教育与研究学院博帕尔分校)

AI总结 针对边缘推理在持续运行中的性能退化问题,提出Edge-TSR系统,集成检测、跟踪与轻量级时域稳定机制,在NVIDIA Jetson Orin Nano上实现实时路边感知,恢复高达10.16%的分类准确率。

详情
AI中文摘要

在资源受限的边缘硬件上进行连续AI推理会引入传统基准评估难以察觉的部署效应,包括流视频的时间不稳定性、持续负载下的热节流以及工作负载相关的性能变化。我们提出Edge-TSR,一个面向部署的连续边缘推理系统,用于在NVIDIA Jetson Orin Nano上进行持续的路边感知。Edge-TSR集成了检测、跟踪、细粒度分类以及轻量级的轨迹感知时域稳定机制,以最小的计算开销提高了流推理的一致性。我们的核心发现是,以基准为中心的评估系统性地高估了部署边缘推理的性能。在三个最先进的基线上,我们观察到从静态图像评估过渡到真实流部署时,性能一致下降20-30%。Edge-TSR通过时域推理稳定解决了这一差距,在持续运行下,相比逐帧推理基线,恢复了高达10.16%的分类准确率,同时保持了实时性能。我们在多种真实部署条件下评估了整个系统,联合表征了长时间运行期间的推理质量、延迟、吞吐量和热行为。在26公里路线上进行的55分钟车辆部署表明,在单个嵌入式设备上,无需云端卸载,即可在安全热限制内以16.18 FPS持续运行。我们的发现表明,部署感知评估和时域推理稳定是面向真实传感部署的持续运行边缘AI系统的必要组成部分。我们发布了一个带注释的流视频评估数据集样本和完整的系统实现,以支持可重复的以部署为中心的评估。

英文摘要

Continuous AI inference on resource-constrained edge hardware introduces deployment effects that are largely invisible to conventional benchmark evaluation, including temporal instability in streaming video, thermal throttling under sustained load, and workload-dependent performance variability. We present Edge-TSR, a deployment-oriented continuous edge inference system for sustained roadside perception on the NVIDIA Jetson Orin Nano. Edge-TSR integrates detection, tracking, fine-grained classification, and a lightweight track-aware temporal stabilization mechanism that improves streaming inference consistency with negligible computational overhead. Our central finding is that benchmark-centric evaluation systematically overstates deployed edge inference performance. Across three state-of-the-art baselines, we observe consistent 20-30% relative degradation when transitioning from static-image evaluation to real-world streaming deployment. Edge-TSR addresses this gap through temporal inference stabilization, recovering up to 10.16% classification accuracy over per-frame inference baselines while maintaining sustained real-time performance under continuous operation. We evaluate the complete system under diverse real-world deployment conditions, jointly characterizing inference quality, latency, throughput, and thermal behavior during long-duration operation. A 55-minute vehicular deployment over a 26 km route demonstrates sustained operation at 16.18 FPS within safe thermal limits on a single embedded device without cloud offload. Our findings show that deployment-aware evaluation and temporal inference stabilization are necessary components of continuously operating edge AI systems intended for real-world sensing deployments. We release a sample annotated streaming video evaluation dataset and full system implementation to support reproducible deployment-centric evaluation.

2606.17234 2026-06-17 cs.CL 新提交

Speaking in Self-Assessing Tongues: On the Verbalized Confidence of LLMs in Machine Translation

自评之言:论大语言模型在机器翻译中的口头化置信度

Ali Marashian, Alexis Palmer, Katharina von der Wense

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校) Johannes Gutenberg University Mainz(美因茨约翰内斯·古腾堡大学)

AI总结 本研究设计了五种无需内部信号的口头化方法提取LLM逐词置信度,并与内部确定性信号比较,发现两者在细粒度错误检测和校准上表现相似但相关性低。

详情
AI中文摘要

大语言模型(LLMs)在翻译中的迅速普及要求对其自身输出的置信度可靠性进行深入研究。与许多生成任务不同,翻译错误和置信度可以在不同粒度级别(标记、单词或片段)上发挥作用。基于预测概率等内部信号的无监督方法可能具有误导性,因为它们反映的是替代方案之间的确定性而非正确性。此外,这些方法需要访问此类内部信号。本文设计了五种口头化方法,用于在没有这些缺点的情况下提取LLM的逐词置信度,并将其可靠性与模型内部确定性信号进行比较。我们使用两种对齐形式评估可靠性:细粒度错误检测和校准。对于两者,内部方法和口头化方法表现相似,尽管结果因模型而异。有趣的是,我们发现内部方法与口头化方法之间几乎没有相关性。

英文摘要

The rapid rise in popularity of large language models (LLMs) for translation calls for a thorough study of the reliability of their confidence in their own outputs. Unlike many generation tasks, translation errors and confidence levels can be useful at different levels of granularity (tokens, words, or spans). Unsupervised approaches based on internal signals like predicted probabilities can be misleading because they reflect certainty among alternatives rather than correctness. In addition, they require access to such internal signals. Here, we devise five verbalized methods of extracting an LLM's per-token confidence without those shortcomings and compare their reliability with that of the model's internal signals of certainty. We evaluate reliability using two forms of alignment: fine-grained error detection and calibration. For both, internal and verbalized methods perform similarly, although results vary by model. Interestingly, we find little to no correlation between internal and verbalized methods.

2606.17233 2026-06-17 cs.LG stat.ML 新提交

Uncertainty Quantification of Engineering Structures by Polynomial Chaos Expansion and Multivariate Active Learning

基于多项式混沌展开与多元主动学习的工程结构不确定性量化

Qitian Lu, Jafar Jafari-Asl, Panagiotis Spyridis, Lukas Novak

发表机构 * Brno University of Technology(布尔诺理工大学) University of Rostock(罗斯托克大学)

AI总结 针对多输出工程问题中单一实验设计难以同时准确近似所有输出量的问题,提出一种自适应序贯采样方法,通过平衡输入空间探索与多输出聚合方差信息,构建多项式混沌展开代理模型,数值实验表明该方法提高了代理精度和稳定性。

详情
AI中文摘要

在许多工程应用中,单个高保真模型在相同输入参数下产生多个感兴趣的量(QoIs),例如复杂物理系统的有限元模型。为了减轻直接模型评估的高计算成本,代理模型被广泛用于构建模型响应的高效近似。自然地,代理模型的精度强烈依赖于实验设计(ED)的质量。然而,单个ED可能无法同时为所有输出提供足够的表示,特别是当不同输出对输入变量表现出不同的敏感性时。一个直接的解决方案是为每个输出分别进行采样,但这会导致采样复杂性和计算成本增加。从统计角度来看,这种方法也忽略了所有输出之间潜在的相关性,并可能损害数据一致性。为了解决这个问题,一种用于构建多项式混沌展开代理模型的自适应序贯采样方法被推广到向量值QoIs。该方法基于新样本对输出方差的局部贡献,从候选池中顺序选择新样本,同时平衡基于距离的输入空间探索和跨所有输出的聚合方差信息的利用。通过来自工程问题的几个数值示例,将其性能与非序贯拉丁超立方采样进行比较。数值结果表明,所提出的策略提高了代理模型的精度和稳定性,并提供了更可靠的二阶统计量估计。

英文摘要

In many engineering applications, a single high-fidelity model produces multiple quantities of interest (QoIs) under the same input parameters, e.g. finite element models of complex physical systems. To alleviate the high computational cost of direct model evaluations, surrogate models are widely used to construct efficient approximations of model responses. Naturally, the accuracy of surrogates strongly depends on the quality of the experimental design (ED). However, a single ED may not provide an adequate representation for all outputs simultaneously, especially when different outputs exhibit varying sensitivities to the input variables. A straightforward solution is to perform separate sampling for each output, but this results in increased sampling complexity and computational cost. From a statistical perspective, such an approach also ignores potential correlations among all outputs and may compromise data consistency. To address this issue, an adaptive sequential sampling method for constructing polynomial chaos expansion surrogate models is generalized for vector valued QoIs. The method sequentially selects new samples from a candidate pool based on their local contribution to the output variance, while balancing distance-based exploration of the input space and exploitation of aggregated variance information across all outputs. Its performance is compared with non-sequential Latin Hypercube Sampling through several numerical examples from engineering problems. Numerical results demonstrate that the proposed strategy improves both surrogate accuracy and stability, and provides a more reliable estimation of second-order statistics.

2606.17229 2026-06-17 cs.LG cs.AI cs.CL 新提交

Rift: A Conflict Signature for Deception in Language Models

Rift: 语言模型中欺骗行为的冲突特征

Petr Nyoma

发表机构 * Harmonic Labs

AI总结 通过对比知情欺骗与无知错误,发现欺骗性前向传递具有高残差秩的冲突特征,能以100%准确率无标签识别谎言,并跨模型、语言和架构迁移。

Comments 13 pages, 4 figures. Code and experiment logs: https://github.com/Omibranch/Rift

详情
AI中文摘要

一个在知道真相的情况下说谎的模型是ELK无法仅通过行为评估处理的核心情况。我们询问这种欺骗是否留下内部特征,将其与诚实错误区分开来。我们的关键举措是对错误进行控制:我们将一个潜伏代理(知道真相,在触发时说谎)与一个天真的说谎者(微调以输出相同的错误答案,没有诚实训练)进行对比。两者产生相同的错误输出;任何差异都关乎知识冲突,而非不正确性。我们发现欺骗性前向传递携带一个冲突特征——在相同错误答案上,残差秩比天真的说谎者传递高2.1-2.3倍——足以在GPT-2 small/medium(三个种子)和三个指令模型上,以100%准确率且无需标签,识别两个响应中哪个是谎言。在Qwen2.5-1.5B/7B和Phi-3-mini上,指令欺骗在每个测试事实上都提高了残差秩(18/18, 40/40, 34/34);在Phi-3上,谎言与诚实答案和幻觉完美分离(AUC 1.0,Wilcoxon p~6e-11)。该特征在策略性自构建欺骗(模型自创谎言,AUC 1.0)、主动隐藏尝试(AUC 1.0)和长度控制复制(20/20,AUC 1.0,p~1e-6)中仍然存在。使用无基相对表示,在一个模型家族上训练的探针在零样本下检测到另外两个家族中的欺骗(平均AUC 0.933),在同时改变架构和格式时仍有效(AUC 0.821),并跨五种语言迁移(AUC 1.000,长度控制)。该特征是只读的:可检测但不可注入(双向0/8)。诚实的局限性和六个负面实验已完整记录。

英文摘要

A model that lies while knowing the truth is the central case ELK cannot handle with behavioral evaluation alone. We ask whether such deception leaves an internal signature distinguishing it from honest error. Our key move is a control for wrongness: we contrast a sleeper agent (knows the truth, lies on trigger) against a naive liar (fine-tuned to emit the same wrong answers with no honest training). Both produce identical wrong outputs; any difference is about knowledge conflict, not incorrectness. We find deceptive forward passes carry a conflict signature - 2.1-2.3x higher residual rank than naive-liar passes on the same wrong answer - strong enough to identify which of two responses is the lie with 100% accuracy and no labels, across GPT-2 small/medium (three seeds) and three instruct models. Across Qwen2.5-1.5B/7B and Phi-3-mini, instructed deception raises residual rank on every tested fact (18/18, 40/40, 34/34); on Phi-3, lies separate perfectly from both honest answers and hallucinations (AUC 1.0, Wilcoxon p~6e-11). The signature survives strategic self-constructed deception (model invents its own lie, AUC 1.0), active concealment attempts (AUC 1.0), and length-controlled replication (20/20, AUC 1.0, p~1e-6). Using basis-free relative representations, a probe trained on one model family detects deception in two other families zero-shot (mean AUC 0.933), surviving simultaneous architecture and format change (AUC 0.821), and transfers across five languages (AUC 1.000, length-controlled). The signature is read-only: detectable but not injectable (0/8 both directions). Honest limitations and six negative experiments are documented in full.

2606.17222 2026-06-17 cs.CV 新提交

Quantum Enchanced Multi-Scale CNN with Bi-directional Mamba for Crop Field Analysis

量子增强多尺度CNN与双向Mamba用于农田分析

Mohammad Salman Khan, Ehsan Atoofian, Saad B. Ahmed

发表机构 * Lakehead University(湖首大学)

AI总结 提出BiSpectral Mamba框架,结合多尺度CNN、光谱注意力、双向状态空间建模和量子启发学习,解决高光谱图像分类中的高维性、类不平衡等问题,在UAVHSI-Crop数据集上达到84.83%准确率。

详情
AI中文摘要

高光谱图像(HSI)作物分析对于精准农业至关重要,因为它捕获了丰富的光谱和空间信息,用于准确的作物监测和评估。然而,由于高光谱维度、空间复杂性、类别不平衡以及有限的标记样本,HSI分类仍然具有挑战性。为了解决这些问题,本文提出了一种基于BiSpectral Mamba的框架,该框架结合了多尺度卷积特征提取、光谱注意力、双向状态空间建模和量子启发学习。多尺度CNN骨干首先通过跨多个分辨率的特征融合提取层次化的空间-光谱表示。然后,光谱注意力机制强调信息丰富的波段,同时抑制冗余和噪声通道。精炼后的特征由BiSpectral Mamba模块处理,该模块通过将高光谱特征图建模为序列标记,在正向和反向方向上捕获长距离依赖关系。此外,还引入了类加权优化和特征融合策略,以提高训练稳定性并缓解类别不平衡。在UAVHSI-Crop数据集上的实验评估证明了所提框架的有效性,总体准确率达到84.83%。结果表明,集成卷积、注意力机制和状态空间建模组件能够实现稳健的空间-光谱特征学习,用于作物分类。所提框架还展示了在更广泛的农业和遥感应用中的潜力,包括作物病害检测、产量预测和土壤湿度估计,同时突出了结构化状态空间和量子启发架构在高光谱图像分析中的有效性。

英文摘要

Hyperspectral image (HSI) crop analysis is essential for precision agriculture because it captures rich spectral and spatial information for accurate crop monitoring and assessment. However, HSI classification remains challenging due to high spectral dimensionality, spatial complexity, class imbalance, and limited labeled samples. To address these challenges, this paper proposes a BiSpectral Mamba-based framework that combines multi-scale convolutional feature extraction, spectral attention, bidirectional state-space modeling, and quantum-inspired learning. A multi-scale CNN backbone first extracts hierarchical spatial-spectral representations through feature fusion across multiple resolutions. A spectral attention mechanism then emphasizes informative bands while suppressing redundant and noisy channels. The refined features are processed by a BiSpectral Mamba module that captures long-range dependencies in both forward and backward directions by modeling hyperspectral feature maps as sequential tokens. In addition, class-weighted optimization and feature fusion strategies are incorporated to improve training stability and mitigate class imbalance. Experimental evaluation on the UAVHSI-Crop dataset demonstrates the effectiveness of the proposed framework, achieving an overall accuracy of 84.83%. The results show that integrating convolutional, attention-based, and state-space modeling components enables robust spatial-spectral feature learning for crop classification. The proposed framework also shows potential for broader agricultural and remote sensing applications, including crop disease detection, yield prediction, and soil moisture estimation, while highlighting the effectiveness of structured state-space and quantum-inspired architectures for hyperspectral image analysis.

2606.17220 2026-06-17 cs.AI 新提交

When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval

当规则学习时:一种用于法律案例检索的自演化智能体

Mingxu Tao, Jiawei Hu, Xian Zhou, Wenpeng Hu, Jiajun Cheng, Yunbo Cao, Zhunchen Luo, Guotong Geng

发表机构 * Center of Information Research, AMS(AMS信息研究中心) Discipline and Technology Research Center for Large Model Intelligence Applications(大模型智能应用学科与技术研究中心) Hebei University of Engineering(河北工程大学)

AI总结 提出一种自演化框架,通过LLM智能体自动生成并优化查询重写规则,无需参数训练即可增强BM25在法律案例检索中的性能。

Comments To appear in ACL 2026

详情
AI中文摘要

由于法律语言的复杂性以及查询与相关案例之间需要精确的词汇对齐,法律案例检索仍然具有挑战性。尽管密集检索模型取得了显著进展,但实证研究表明,BM25在该领域仍然是一个强大的基线。这促使我们提出一种自演化框架,用于规则驱动的查询重写,无需任何参数训练即可增强BM25。该框架为基于LLM的智能体配备了一个自动评估环境,使其能够迭代地创建重写规则、规划规则组合的验证实验,并根据历史反馈消除无效规则。我们在中文法律案例检索基准LeCaRD-v2上评估了我们的方法。实验结果表明,所提出的框架优于非演化基线,包括人类设计的规则和贪婪规则选择,特别是在由高容量核心LLM驱动时。我们还进行了详细分析,以研究自演化的机制。我们的发现表明,LLM利用先前实验结果的能力及其关于规则消除的内在知识在通过自演化优化规则集方面发挥着关键作用。

英文摘要

Legal case retrieval remains challenging due to the complexity of legal language and the need for precise lexical alignment between queries and relevant cases. Although dense retrieval models have achieved notable progress, empirical studies show that BM25 continues to serve as a strong baseline in this domain. It motivates us to propose a self-evolving framework for rule-driven query rewriting that enhances BM25 without any parameter training. The framework equips an LLM-based agent with an automatic evaluation environment, enabling it to iteratively create rewriting rules, plan validation experiments over rule combinations, and eliminate ineffective rules based on historical feedbacks. We evaluate our method on the Chinese legal case retrieval benchmark LeCaRD-v2. Experimental results demonstrate that the proposed framework outperforms non-evolutionary baselines, including human-designed rules and greedy rule selection, particularly when powered by a highcapacity core LLM. We also conduct detailed analyses to investigate the mechanisms underlying self-evolution. Our findings reveal that LLM's capabilities to leverage previous experimental results and its intrinsic knowledge of rule elimination play critical roles in refining the rule set via self-evolution.

2606.17215 2026-06-17 cs.LG cs.DS stat.ML 新提交

Sum-of-Squares Degree Barriers for the Reweighted-Hinge Method in Robust Halfspace Learning: A Christoffel-Function Characterization

鲁棒半空间学习中重加权铰链方法的平方和度障碍:一个Christoffel函数刻画

Xiaoyu Li

发表机构 * Xiaoyu Li(李小宇)

AI总结 本文通过Christoffel函数精确刻画了有界度证书无法去除的异常质量,揭示了重加权铰链方法在恶意噪声下学习γ-间隔半空间时,证书的SoS度与异常容忍度之间的基本权衡。

详情
AI中文摘要

一个去除异常值的证书仅通过低阶矩观察数据,而对手恰恰利用这一点,将腐败隐藏在干净数据已经看似典型的盲区中,该盲区无法被任何有界度测试分辨。这个盲区恰好有一个精确的大小:干净边际分布的Christoffel函数,这正是现代数据分析中用于检测异常值的量,此处从对手的角度解读为有界度证书无法去除的腐败。我们将这一反转作为在恶意噪声下鲁棒学习γ-间隔半空间的重加权铰链方法(Shen, 2025; Zeng and Shen, 2025)的组织原则:支配性资源是异常去除证书的平方和(SoS)度,而分辨原则指出,在中心c处能够对度-2t证书隐藏的最大腐败质量恰好是干净边际分布的Christoffel函数λ_{t+1}(c)。由此得出三个推论,均针对证书方法(而非信息论极限)。边际-度权衡:将密集煎饼认证到误差ϵ需要SoS度Ω(log(1/ϵ))或边际Ω(√(log(1/ϵ))/√d),解释了Shen (2025)中记录的log(1/ϵ)边际是必然的,通过加权Chebyshev归约使得阈值2t=Θ((|c|/s)^2)在经典加权极值估计下是紧的。度-2异常障碍:分辨原则实现为一个显式实例,其中度2卡在η^{1/2}而度4逃脱,将方法的小崩溃率定位在度上而非分析中。以及一个度-2t算法追踪前沿η^{1-1/2t}(在t=1时恢复Shen (2025)),其增益为显式常数,受限于煎饼密度,并由度-2障碍证明不可改进。

英文摘要

A certificate that removes outliers sees the data only through its low-degree moments, and an adversary exploits exactly this, hiding corruption where the clean data already looks typical, in the blind spot no bounded-degree test resolves. That blind spot turns out to have an exact size: the Christoffel function of the clean marginal, the very quantity modern data analysis thresholds to detect outliers, here read from the adversary's side as the corruption a bounded-degree certificate cannot remove. We turn this inversion into the organizing principle of the reweighted-hinge approach to robustly learning $γ$-margin halfspaces under malicious noise (Shen, 2025; Zeng and Shen, 2025): the governing resource is the Sum-of-Squares degree of the outlier-removal certificate, and the resolution principle states that the maximal corruption mass which can hide at a center $c$ from a degree-$2t$ certificate is exactly the Christoffel function $λ_{t+1}(c)$ of the clean marginal. Three consequences follow, all against the certificate method (not information-theoretic). A margin-degree tradeoff: certifying the dense pancake to error $ε$ costs SoS degree $Ω(\log(1/ε))$ or margin $Ω(\sqrt{\log(1/ε)}/\sqrt{d})$, explaining why the $\log(1/ε)$ margin Shen (2025) records is forced, with a weighted-Chebyshev reduction making the threshold $2t=Θ((|c|/s)^2)$ tight modulo one classical weighted-extremal estimate. A degree-$2$ outlier barrier: the resolution principle realized as an explicit instance on which degree $2$ is stuck at $η^{1/2}$ while degree $4$ escapes, locating the method's small breakdown rate in the degree, not the analysis. And a degree-$2t$ algorithm tracing the frontier $η^{1-1/2t}$ (recovering Shen (2025) at $t=1$), whose gain is an explicit constant, capped by the pancake density and shown unimprovable by the degree-$2$ barrier.

2606.17213 2026-06-17 cs.CL cs.CV 新提交

Revisiting LLM Adaptation for 3D CT Report Generation: A Study of Scaling and Diagnostic Priors

重新审视用于3D CT报告生成的LLM适应:缩放与诊断先验研究

Vanshali Sharma, Andrea M. Bejar, Halil Ertugrul Aktas, Quoc-Huy Trinh, Debesh Jha, Gorkem Durak, Ulas Bagci

发表机构 * Northwestern University(西北大学) University of South Dakota(南达科他大学) Aalto University(阿尔托大学)

AI总结 提出RAD3D-Prefix轻量级诊断先验框架,通过冻结大语言模型并融合多标签分类逻辑,在少量可训练参数下实现3D CT报告生成,优于全微调基线并展现强泛化性。

详情
AI中文摘要

多模态学习的最新进展,包括大型语言模型(LLM)和视觉-语言模型(VLM),已展现出对自然图像的强大适应性。然而,将其扩展到医学领域,特别是体积(3D)图像,由于高计算复杂度、体积依赖性和视觉特征与临床术语之间的语义差距而具有挑战性。在有限的医学数据上对LLM进行朴素微调常常导致过拟合和临床幻觉,其中语言流畅性优先于临床事实性。在本研究中,我们研究了用于体积CT报告生成的参数高效适应策略,并引入了RAD3D-Prefix,一种轻量级的诊断先验条件框架,最大限度地减少了对大量参数训练的需求。该模块将图像嵌入与多标签诊断分类逻辑相结合,保留了关键的临床细节,同时弥合了语义差距。通过保持LLM冻结,我们的方法需要最少的可训练参数,并减轻了在小规模、特定领域数据集上过拟合的风险。通过对从96.1M到1.6B参数的LLM进行系统研究,我们发现微调对较小的LLM最有益,而冻结较大的(约1B+)LLM并仅训练轻量级投影层在性能、泛化性和计算效率之间提供了优越的权衡。在多个自动指标和一项临床读者研究中,RAD3D-Prefix优于可比较的参数高效基线,并在使用比全微调替代方案少得多的可训练参数的情况下,展现出强大的域外泛化能力。

英文摘要

Recent advances in multimodal learning, including large language models (LLMs) and vision-language models (VLMs), have demonstrated strong adaptability to natural images. However, extending their use to the medical domain, particularly for volumetric (3D) images, is challenging due to high computational complexity, volumetric dependencies and the semantic gap between visual features and clinical terminology. Naively fine-tuning LLMs on limited medical data often leads to overfitting and clinical hallucination, where linguistic fluency is prioritized over clinical factuality. In this study, we investigate parameter-efficient adaptation strategies for volumetric CT report generation and introduce RAD3D-Prefix, a lightweight diagnostic-prior conditioning framework that minimizes the need for extensive parameter training. This module integrates image embeddings with multi-label diagnostic classification logits, preserving critical clinical details while bridging the semantic gap. By keeping the LLM frozen, our method requires minimal trainable parameters and mitigates the risk of overfitting on small, domain-specific datasets. Through a systematic study spanning LLMs from 96.1M to 1.6B parameters, we find that fine-tuning is most beneficial for smaller LLMs, whereas freezing larger (~1B+ LLMs and training only lightweight projection layers provides a superior trade-off between performance, generalization, and computational efficiency. Across multiple automatic metrics and a clinical reader study, RAD3D-Prefix outperforms comparable parameter-efficient baselines and demonstrates strong out-of-domain generalization while using substantially fewer trainable parameters than fully fine-tuned alternatives.

2606.17209 2026-06-17 cs.AI cs.IR 新提交

Beyond Parallel Sampling: Diverse Query Initialization for Agentic Search

超越并行采样:面向智能搜索的多样化查询初始化

Sidhaarth Murali, João Coelho, Jingjie Ning, João Magalhães, Bruno Martins, Chenyan Xiong

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Instituto Superior Técnico and INESC-ID, University of Lisbon(里斯本大学高等技术学院和INESC-ID) NOVA LINCS, NOVA School of Science and Technology(新里斯本大学科学与技术学院NOVA LINCS)

AI总结 针对智能搜索中的广度缩放,提出DivInit方法,通过在第一轮生成多样化查询而非独立采样,缓解查询冗余问题,在多跳问答任务中平均提升5-7个点。

Comments 15 pages, 8 figures; under review at EMNLP 2026

详情
AI中文摘要

测试时缩放用于智能搜索通常增加深度(即每个轨迹更多轮次和令牌)或广度(即更多并行展开)。这里我们关注广度缩放,表明标准并行采样收益递减,并将其归因于第一轮的查询冗余。当模型在不同展开中发出相似的第一查询时,线程检索重叠的证据,后续轮次基于此共享检索。我们通过DivInit解决这一限制,这是一种在第一轮无需训练的干预。DivInit不是采样k个独立的第一查询,而是从单次调用中抽取n个候选,选择k < n个多样化的种子,并将它们作为并行轨迹运行。在五个开源模型和八个基准测试中,DivInit始终优于标准并行采样,在匹配计算量的多跳问答上平均提升5到7个点。代码可在https://this URL获取。

英文摘要

Test-time scaling for agentic search typically increases depth (i.e., more turns and tokens per trajectory) or breadth (i.e., more parallel rollouts). Here we focus on breadth scaling, showing that standard parallel sampling yields diminishing returns, tracing this to query redundancy at the first turn. When models issue similar first queries across rollouts, the threads retrieve overlapping evidence, and subsequent turns are conditioned on this shared retrieval. We address this limitation with DivInit, a training-free intervention at the first turn. Rather than sampling k independent first queries, DivInit draws n candidates from a single call, picks k < n diverse seeds, and runs them as parallel trajectories. Across five open-weight models and eight benchmarks, DivInit consistently improves over standard parallel sampling, with average gains of five to seven points on multi-hop QA at matched compute. Code available at https://github.com/cxcscmu/diverse-query-initialization

2606.17200 2026-06-17 cs.RO 新提交

ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining

ACE-Ego-0:统一第一人称人类与机器人数据用于VLA预训练

Hao Li, Ganlong Zhao, Yufei Liu, Haotian Hou, Guoquan Ye, Tongyan Fang, Chunxiao Liu, Siyuan Huang, Jianbo Liu, Xiaogang Wang, Hongsheng Li

发表机构 * ACE Robotics CUHK MMLab(香港中文大学多媒体实验室) CUHK, Shenzhen(香港中文大学(深圳)) SJTU(上海交通大学) THU(清华大学)

AI总结 提出ACE-EGO-0框架,通过可扩展的第一人称视频到动作管道和可靠性感知训练目标,统一人类与机器人数据用于VLA预训练,在多个基准上达到最优性能。

详情
AI中文摘要

视觉-语言-动作(VLA)模型受益于大规模和多样化的具身数据,但收集机器人轨迹成本高昂且劳动密集。最近的进展表明,大规模第一人称人类视频在预训练中提供了互补的真实世界监督。然而,由于动作空间、具身结构、时间动态和监督质量的差异,联合训练人类和机器人数据仍然具有挑战性。我们引入了ACE-EGO-0,一个统一VLA预训练框架,联合利用异构数据源。为了从第一人称人类视频中提取大规模预训练监督,我们构建了一个可扩展的第一人称视频到动作管道,将原始人类视频转换为机器人格式的伪动作轨迹。为了使这些标签与机器人演示可比,ACE-EGO-0使用基于相机空间动作、形态条件化和时间对齐动作分块的统一动作表示。为了稳健地利用来自第一人称人类视频的噪声伪动作监督,我们制定了一个可靠性感知训练目标,并带有一个人辅助损失,将监督集中在可靠信号上。我们在4.53K小时的机器人和模拟数据以及1.48K小时的伪动作标记的第一人称人类数据上实例化ACE-EGO-0。实验表明,在可靠性感知加权下纳入大规模人类监督一致地改进了统一联合预训练和监督微调。ACE-EGO-0在RoboCasa GR1 TableTop和RoboTwin 2.0上达到了最先进的性能,并展示了向真实世界双臂操作的强迁移能力。

英文摘要

Vision-Language-Action (VLA) models benefit from large-scale and diverse embodied data, yet scaling robot trajectory collection is costly and labor-intensive. Recent advances show that large-scale egocentric human videos provide complementary real-world supervision in pretraining. However, joint training on human and robot data remains challenging due to divergences in action spaces, embodiment structures, temporal dynamics, and supervision quality. We introduce ACE-EGO-0, a unified VLA pretraining framework jointly leveraging heterogeneous data sources. To extract large-scale pretraining supervision from egocentric human videos, we build a scalable egocentric video-to-action pipeline that converts raw human videos into robot-format pseudo-action trajectories. To make these labels comparable with robot demonstrations, ACE-EGO-0 uses a unified action representation based on camera-space actions, morphology conditioning, and time-aligned action chunking. To robustly leverage noisy pseudo-action supervision from egocentric human videos, we formulate a reliability-aware training objective with a human auxiliary loss that concentrates supervision on reliable signals. We instantiate ACE-EGO-0 on 4.53K hours of robot and simulation data, together with 1.48K hours of pseudo-action-labeled egocentric human data. Experiments show that incorporating large-scale human supervision under reliability-aware weighting consistently improves both unified joint pretraining and supervised fine-tuning. ACE-EGO-0 achieves state-of-the-art performance on RoboCasa GR1 TableTop and RoboTwin 2.0, while demonstrating strong transfer to real-world bimanual manipulation.

2606.17199 2026-06-17 cs.LG cs.AI 新提交

PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation

PowerOPD:利用有界幂变换稳定在线策略蒸馏

Anhao Zhao, Junlong Tong, Yingqi Fan, Ping Nie, Wenjie Li, Xiaoyu Shen

发表机构 * Eastern Institute of Technology, Ningbo(宁波东方理工大学) The Hong Kong Polytechnic University(香港理工大学) Shanghai Jiao Tong University(上海交通大学) University of Waterloo(滑铁卢大学)

AI总结 针对在线策略蒸馏中log-ratio奖励无界导致训练不稳定问题,提出基于Box-Cox幂变换的有界、符号一致奖励族PowerOPD,在数学推理任务上平均提升Avg@8/Pass@8达+6.37/+5.71,并降低59.2%时间与23.1%显存。

详情
AI中文摘要

大型语言模型的标准在线策略蒸馏(OPD)利用学生采样令牌估计反向KL散度,得到一个无偏的单样本蒙特卡洛估计器,避免了全词汇计算。然而,我们表明该估计器在实践中存在严重的训练病态:样本效率低、生成动态不稳定,以及与精确全词汇OPD相比显著的性能差距。奖励级别的诊断将这些病态追溯到log-ratio奖励,该奖励在结构上无界,产生极高方差的梯度,集中在早期位置并持续整个训练;标准的后验缩放方法仅在失真发生后操作,因此失效。为解决此问题,我们提出PowerOPD:一个源自Box-Cox幂变换的原生有界、符号一致的奖励族,由alpha > 0参数化,其中log-ratio是其退化极限alpha -> 0。在六个数学推理基准和四个Qwen3师生对中,PowerOPD在基准平均Avg@8/Pass@8上相比原始OPD提升高达+6.37/+5.71,相比后验稳定化提升+3.01/+3.54,相比全词汇OPD提升+2.59/+8.90,同时减少59.2%的挂钟时间和23.1%的峰值GPU内存。较大的alpha通常提高准确率,一致缩短响应长度,并使梯度范数比原始OPD小3000倍以上。

英文摘要

Standard on-policy distillation (OPD) for large language models estimates the reverse-KL objective using student-sampled tokens, yielding an unbiased single-sample Monte Carlo estimator that avoids vocabulary-wide computation. However, we show that this estimator suffers from severe training pathologies in practice: sample inefficiency, unstable generation dynamics, and a substantial performance gap compared to exact full-vocabulary OPD. Reward-level diagnosis traces these pathologies to the log-ratio reward, which is unbounded by construction, producing extremely high-variance gradients concentrated at early positions and persisting throughout training; standard post-hoc scaling fail as they operate only after this distortion occurs. To solve this problem, we propose PowerOPD: a family of natively bounded, sign-consistent rewards from the Box-Cox power transformation, parameterized by alpha > 0, of which the log-ratio is the degenerate alpha -> 0 limit. Across six mathematical reasoning benchmarks and four Qwen3 teacher-student pairs, PowerOPD achieves benchmark-averaged Avg@8/Pass@8 gains of up to +6.37/+5.71 over vanilla OPD, +3.01/+3.54 over post-hoc stabilization, and +2.59/+8.90 over full-vocabulary OPD, while reducing wall-clock time by 59.2% and peak GPU memory by 23.1%. Larger alpha generally improves accuracy, consistently shortens responses, and keeps gradient norms more than 3,000x smaller than vanilla OPD.

2606.17192 2026-06-17 cs.LG 新提交

Constrained Diffusion Models with Primal-Dual Inference

约束扩散模型与原始-对偶推理

Samar Hadou, Yigit Berkay Uslu, Alejandro Ribeiro

发表机构 * Department of Electrical and Systems Engineering, University of Pennsylvania(宾夕法尼亚大学电气与系统工程系)

AI总结 提出原始-对偶推理(PDI)方法,通过联合推断最优原始分布和其对偶变量,在扩散模型反向过程中交替去噪与对偶上升,实现平均约束下的熵正则化优化问题采样。

详情
AI中文摘要

本文开发了具有原始-对偶推理(PDI)的约束扩散模型,用于从具有平均约束的熵正则化优化问题的最优分布中采样。我们在拉格朗日对偶域中形式化约束采样,其中最优分布采用由最优对偶变量索引的吉布斯分布形式。PDI不是先估计该对偶乘子并在整个生成过程中冻结它,而是联合推断最优原始分布及其参数化对偶变量。每个反向扩散步骤使用与当前乘子相关的得分场去噪,然后通过使用去噪样本的估计约束违反进行对偶上升来更新乘子。为了实现这种条件得分场,我们在推理过程中遇到的对偶变量所诱导的吉布斯分布族上训练一个单一的条件得分网络。我们证明了沿推理轨迹生成的对偶变量的时间平均收敛到对偶最优的邻域,并通过依赖于调度的时间稳定性因子限定了残余对偶失配对终端分布的影响。我们在高斯混合约束采样、无线资源分配和投资组合管理上评估了PDI。

英文摘要

This paper develops constrained diffusion models with primal-dual inference (PDI) to sample from optimal distributions of entropy-regularized optimization problems with \emph{average} constraints. We formalize constrained sampling in the Lagrangian dual domain, where the optimal distribution takes the form of a Gibbs distribution indexed by the optimal dual variable. Rather than estimating this dual multiplier before sampling and freezing it throughout generation, PDI jointly infers the optimal primal distribution and its parametrizing dual variable. Each reverse diffusion step denoises using the score field associated with the current multiplier and then updates the multiplier through dual ascent using the estimated constraint violation of the denoised samples. To enable this conditional score field, we train a single dual-conditioned score network over the family of Gibbs distributions induced by the dual variables encountered during inference. We prove that the time average of the dual variables generated along the inference trajectory converges to a neighborhood of the dual optimum and bound the effect of residual dual mismatch on the terminal distribution through schedule-dependent stability factors. We evaluate PDI on constrained sampling from a mixture of Gaussians, wireless resource allocation, and portfolio management.

2606.17188 2026-06-17 cs.CV cs.CL 新提交

Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

并非真正的多语言:脚本一致性作为VLM评估中缺失的维度

Prabhjot Singh, Bhushan Pawar, Madhu Reddiboina, Rajvee Sheth

发表机构 * RediMinds Inc.(RediMinds公司) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Independent Researcher(独立研究员)

AI总结 提出PuMVR基准,评估10个VLM在旁遮普语三种文字上的表现,发现显著的脚本差距,并提出脚本一致性率(SCR)作为必要评估指标。

详情
AI中文摘要

当前视觉语言模型(VLM)的多语言评估假设语言与正字法一一对应,忽略了使用多种文字语言的数十亿用户。我们引入了PuMVR(旁遮普多模态视觉推理),这是一个包含1000个严格平行图像-文本实例的基准,覆盖旁遮普语的三种活跃文字:古木基文、沙穆基文和罗马文。评估10个最先进的VLM,我们暴露了一个显著且系统的脚本差距。模型经常在一种文字上解决视觉任务,而在另一种文字上失败,准确率差异高达16%。关键的是,视觉输入均匀地提升了绝对性能,但并未缩小正字法差距。此外,跨文字的上下文迁移非常脆弱,揭示了脚本锁定的知识表示。通过所有文字对的McNemar检验支持,我们的发现表明当前的“多语言”VLM并非真正的多文字。我们提出脚本一致性率(SCR),在我们的基准上低至24.8%,作为脚本无关评估的强制性指标,以确保公平的AI访问。数据和代码可在以下网址获取:this https URL。

英文摘要

Current multilingual evaluations for Vision-Language Models (VLMs) assume a one-to-one mapping between language and orthography, overlooking billions of users of multi-script languages. We introduce PuMVR (Punjabi Multimodal Visual Reasoning), a benchmark of 1,000 strictly parallel image-text instances across Punjabi's three active scripts: Gurmukhi, Shahmukhi, and Roman. Evaluating 10 state-of-the-art VLMs, we expose a substantial and systematic Script Gap. Models frequently solve visual tasks in one script while failing identical tasks in another, with accuracy deltas reaching 16%. Crucially, visual input boosts absolute performance uniformly yet does not close the orthographic gap. Furthermore, cross-script in-context transfer is highly brittle, exposing script-locked knowledge representation. Supported by McNemar tests across all script pairs, our findings demonstrate that current "multilingual" VLMs are not truly multi-script. We propose the Script Consistency Rate (SCR), which falls as low as 24.8% on our benchmark, as a mandatory metric for script-agnostic evaluation to ensure equitable AI access. Data and code are available at: https://github.com/prabhjotschugh/Not-Truly-Multilingual-PuMVR.

2606.17183 2026-06-17 cs.RO 新提交

VL-MemKnG: Hybrid Memory with a Spatio-Temporal Knowledge Graph for Question Answering over Long Egocentric Navigation Trajectories

VL-MemKnG:结合时空知识图谱的混合记忆用于长程自我中心导航轨迹问答

Svetlana Lukina, Mohamad Al Mdfaa, Gloria Haro, Sergey Zagoruyko, Gonzalo Ferrer

发表机构 * Mobile Robotics Laboratory, Artificial Intelligence Center(移动机器人实验室,人工智能中心) Skoltech(斯科尔科沃科学技术学院) Intelligent Multimodal Vision Analysis Group, Department of Engineering, Universitat Pompeu Fabra(智能多模态视觉分析组,工程系,庞培法布拉大学) Independent Researcher(独立研究员)

AI总结 提出VL-MemKnG混合记忆框架,结合时空知识图谱与片段级上下文记忆,通过混合检索推理模块提升长程自我中心视频导航问答的准确性和效率。

详情
AI中文摘要

回答长程自我中心视频中的导航相关问题需要检索和组织分布在遥远时间瞬间的证据,同时保持空间和上下文一致性。尽管长上下文视觉-语言模型能够实现强大的答案质量,但对于长轨迹而言计算成本高昂,且对于重复查询效率低下。最近基于图的方法(如VL-KnG)通过持久化时空知识图谱解决了这一挑战,但仅依赖图检索可能不足以表达更广泛的时间连续性和上下文线索。我们提出了VL-MemKnG,一种混合记忆框架,它扩展了VL-KnG,将时空知识图谱与持久化片段级上下文记忆相结合。知识图谱捕获结构化关系信息和长程对象关联,而片段级记忆则保留更广泛的时间上下文以进行长程证据检索。混合检索与推理模块联合操作于两种记忆表示之上,生成基于证据的答案和时间上组织的支持证据。我们还引入了WalkieKnowledgeT+,这是WalkieKnowledge的扩展,用于长程导航导向的视频问答。该基准包括需要跨多个非共现时刻进行证据聚合的时间分布式推理任务。在WalkieKnowledgeT+上,VL-MemKnG将Top-1检索准确率从58%提升至67%,Recall@1从34.50%提升至40.55%,优于所有对比方法,包括Gemini 2.5 Pro和Qwen 3.5+。在时间全局和时间分散聚合问题上提升尤为显著,证明了将结构化关系记忆与片段级上下文记忆相结合的优势,同时保持高效的查询时推理。

英文摘要

Answering navigation-relevant questions over long egocentric videos requires retrieving and organizing evidence distributed across distant temporal moments while maintaining spatial and contextual consistency. Although long-context vision--language models can achieve strong answer quality, they are computationally expensive for long trajectories and inefficient for repeated querying. Recent graph-based approaches such as VL-KnG address this challenge through persistent spatio-temporal knowledge graphs, but graph-centric retrieval alone may underrepresent broader temporal continuity and contextual cues. We present VL-MemKnG, a hybrid memory framework that extends VL-KnG by combining a spatio-temporal knowledge graph with persistent segment-level contextual memory. The knowledge graph captures structured relational information and long-range object associations, while segment-level memory preserves broader temporal context for long-horizon evidence retrieval. A hybrid retrieval-and-reasoning module jointly operates over both memory representations to produce evidence-grounded answers and temporally organized supporting evidence. We also introduce WalkieKnowledgeT+, an extension of WalkieKnowledge for long-horizon navigation-oriented video question answering. The benchmark includes temporally distributed reasoning tasks requiring evidence aggregation across multiple non-cooccurring moments. On WalkieKnowledgeT+, VL-MemKnG improves Top-1 retrieval accuracy from 58% to 67% and Recall@1 from 34.50% to 40.55%, outperforming all compared methods, including Gemini 2.5 Pro and Qwen 3.5+. The gains are particularly pronounced on temporal-global and temporally scattered aggregation questions, demonstrating the benefits of combining structured relational memory with segment-level contextual memory while maintaining efficient query-time inference.

2606.17182 2026-06-17 cs.LG cs.DC cs.LO cs.MA cs.PL 新提交

Verified Detection and Prevention of Concurrency Anomalies in Multi-Agent Large Language Model Systems

多智能体大语言模型系统中并发异常的可验证检测与预防

Sajjad Khan

发表机构 * independent researcher(独立研究员)

AI总结 针对多智能体LLM系统,形式化四种并发异常并建立一致性层级,通过Verus验证检测器正确性,并在Rust运行时中实现预防。

Comments 32 pages, 2 figures, 6 tables. Verus/TLA+ verification artifact, reference Rust runtime, and Python harnesses, plus a supplementary appendix (Sections A-F, Tables S1-S6), included as ancillary files

详情
AI中文摘要

多智能体LLM系统通过内存存储、向量索引和工具注册表共享状态。我们将这种共享建模为在确定性生成语义(持久化执行引擎通过确定性重放强制执行的机制)下的长期读-生成-写操作,并在TLA+中形式化了四种并发异常:陈旧生成、幻影工具、因果级联和工具效应重排序,它们是经典隔离异常的结构类比,每种都有TLC反例。这些异常上的排除格是平凡的;贡献在于机械验证了其中一条最大链$L_0 \subsetneq \cdots \subsetneq L_4$的可实现性和严格分离,据我们所知,这是此类运行时第一个机器检查的一致性层级。通过274个Verus义务(零假设、零接受;信任基础:两个结构公理和一个互斥对应关系)的开发,证明了检测器相对于规范的正确性和完备性,以及每个运行时对应的避免集。三个部署的Rust运行时实现了L0-L1(悲观锁、可序列化快照隔离、默认SI),每个都针对陈旧生成进行了验证并细化到其状态机;L2-L4通过执行模式验证,并具有无依赖的预防孪生(A3、A6、A2:0/1000对比1000/1000),L2在三个模型家族上实时运行(A3在所有120个撤回会话中均被预防)。我们复现了字节跳动deer-flow中的静默丢失更新,将其修复形式化为已验证的$L_0 \to L_1$细化,并在LangGraph的ToolNode上展示了未修改输出中的工具效应重排序,通过L3提交顺序序列器消除。已验证的检测器、细化和可实现性工件是贡献;现象和格是经典的。

英文摘要

Multi-agent LLM systems share state through memory stores, vector indices, and tool registries. We model such sharing as long-running read-generate-write operations under deterministic-generation semantics -- the regime durable-execution engines enforce by deterministic replay -- and formalize four concurrency anomalies in TLA+: stale-generation, phantom-tool, causal-cascade, and tool-effect reordering, structural analogues of classical isolation anomalies, each with a TLC counter-example. The exclusion lattice over these anomalies is trivial; the contribution is the mechanically verified realizability and strict separation of one maximal chain within it, $L_0 \subsetneq \cdots \subsetneq L_4$, to our knowledge the first machine-checked consistency hierarchy for such runtimes. A development of 274 Verus obligations (zero assume, zero admit; trust base: two structural axioms and a mutex correspondence) proves the detectors sound and complete against the specifications and each runtime its avoidance set. Three deployed Rust runtimes realize L0-L1 (pessimistic locking, serializable snapshot isolation, default-SI), each verified against stale-generation and refined to its state machine; L2-L4 are exec-mode-verified with dependency-free prevention twins (A3, A6, A2: 0/1000 versus 1000/1000), and L2 is run live across three model families (A3 prevented in all 120 retracted sessions). We reproduce a silent lost update in ByteDance's deer-flow, formalizing its fix as a verified $L_0 \to L_1$ refinement, and exhibit tool-effect reordering in LangGraph's ToolNode on unmodified output, removed by an L3 commit-order sequencer. The verified detector, refinements, and realizability artifacts are the contribution; the phenomena and lattice are classical.

2606.17180 2026-06-17 cs.LG 新提交

Towards Fast GNN Surrogates for CO2 Migration in Complex Geological Formations

面向复杂地质构造中CO2运移的快速GNN替代模型

Rodrigo S. Luna, Thiago H. N. Coelho, Luiz S. L. Neto, Roberto M. Velho, Adriano M. A. Cortes, Renato N. Elias, Alexandre G. Evsukoff, Fernando A. Rochinha, Mauricio Araya-Polo, Herve Gross, Alvaro L. G. A. Coutinho

发表机构 * Systems and Computer Engineering and High Performance Computing Center, NACAD - COPPE, Federal University of Rio de Janeiro(里约热内卢联邦大学COPPE工程研究生院NACAD高性能计算中心,系统与计算机工程) Civil Engineering and High Performance Computing Center, NACAD - COPPE, Federal University of Rio de Janeiro(里约热内卢联邦大学COPPE工程研究生院NACAD高性能计算中心,土木工程) Mechanical Engineering and High Performance Computing Center, NACAD - COPPE, Federal University of Rio de Janeiro(里约热内卢联邦大学COPPE工程研究生院NACAD高性能计算中心,机械工程) Shell Global Solutions International B.V.(壳牌全球解决方案国际公司) TotalEnergies OneTech(道达尔能源OneTech)

AI总结 提出一种端到端图神经替代模型,用于地质封存中CO2羽流运移预测,通过各向异性消息传递和自回归残差公式在SPE11A基准上实现竞争性预测。

详情
AI中文摘要

本章讨论数据驱动的机器学习方法如何再现复杂地质构造中多相流物理行为的关键方面。我们提出了一种端到端的图神经替代模型,专门用于地质封存中CO$_2$羽流运移预测。该方法在SPE11A基准上进行了评估,这是一个著名的行业测试案例,旨在评估CO$_2$封存场景,其特点是尖锐的气-水界面、强平流输运以及伴随指进发展的快速对流混合。该基准被重新表述为一个图,其中节点表示计算单元,边编码基于传导率的相互作用,并辅以几何属性。由网格几何、渗透率对比和地质非均质性引起的方向性输运通过各向异性消息传递机制捕获,其中交互权重通过几何条件化的边嵌入计算,使消息聚合偏向于物理相关的输运方向。时间演化在潜在空间中使用自回归残差公式建模,并通过多步监督训练。所提出的模型对气体饱和度和液相密度(CO$_2$封存监测的关键指标)产生了具有竞争力的预测,在较长的预测范围内累积误差保持适中。

英文摘要

This chapter discusses how a data-driven machine learning approach can reproduce key aspects of the physical behavior of multiphase flows in complex geological formations. We propose an end-to-end graph neural surrogate tailored to CO$_2$ plume migration forecasting in geological storage. The method is evaluated on the SPE11A benchmark, a well-known industry test case designed to assess CO$_2$ storage scenarios and characterized by sharp gas-water interfaces, strong advective transport, and rapid convective mixing with fingering development. The benchmark is reformulated as a graph in which nodes represent computational cells and edges encode transmissibility-based interactions enriched with geometric attributes. Directional transport arising from grid geometry, permeability contrasts, and geological heterogeneity is captured through an anisotropic message-passing mechanism, where interaction weights are computed via geometry-conditioned edge embeddings, biasing message aggregation toward physically relevant transport directions. Temporal evolution is modeled in latent space using an autoregressive residual formulation trained with multi-step supervision. The proposed model produces competitive forecasts of gas saturation and liquid-phase density, which are key indicators for CO$_2$ storage monitoring, with cumulative errors that remain moderate over extended forecasting horizons.

2606.17175 2026-06-17 cs.CL 新提交

Self-Generated Error Training for Token Editing in Diffusion Language Models

扩散语言模型中令牌编辑的自生成错误训练

Lin Yao

发表机构 * School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学与技术学院) Zhongguancun Academy(中关村学院)

AI总结 针对LLaDA2.1中令牌编辑的训练-推理不匹配问题,提出自生成T2T方法,通过无梯度草稿传递和自生成错误监督,提升编辑准确率并减少编辑强度。

详情
AI中文摘要

令牌到令牌(T2T)编辑允许LLaDA2.1在块扩散解码过程中修正已提交的令牌。已发布的配方在随机词汇损坏上训练该编辑器,但在推理时,编辑器看到的是模型自身流畅、高置信度的草稿错误。我们研究了这种训练-推理不匹配,并提出了自生成T2T,该方法执行无梯度草稿传递,用预测的令牌填充掩码位置,并在第二次传递中在这些自生成损坏下监督恢复。我们将更新实现为LLaDA2.1-mini上的短LoRA持续预训练传递,并在官方Q-Mode T2T程序下使用不变的推理参数在多个基准上进行评估。该方法通常提高准确率,同时降低T2T编辑强度,缓解了诸如在正确推理后出现最终数字转录错误以及在简短事实答案前过度自我纠正等失败模式。

英文摘要

Token-to-token (T2T) editing lets LLaDA2.1 revise committed tokens during block-diffusion decoding. The released recipe trains this editor on random vocabulary corruptions, but at inference the editor sees the model's own fluent, high-confidence draft errors instead. We study this training-inference mismatch and propose self-generated T2T, which performs a no-gradient draft pass, fills masked positions with predicted tokens, and supervises recovery in a second pass under these self-generated corruptions. We implement the update as a short LoRA continued-pretraining pass on LLaDA2.1-mini and evaluate on several benchmarks under the official Q-Mode T2T procedure with unchanged inference parameters. The method generally improves accuracy while reducing T2T edit intensity, mitigating failure modes such as final-digit transcription errors after otherwise correct reasoning and excessive self-correction before short factual answers.

2606.17174 2026-06-17 cs.CL cs.CY cs.MA 新提交

From Parasocial Scripts to Dyadic Persistence in Autonomous AI-Agent Communities

从准社会脚本到自主AI智能体社区中的二元持久性

Mohammadsadegh Abolhasani, Hamid Reza Firoozfar, Reza Mousavi, Paul Jen-Hwa Hu

发表机构 * University of Utah(犹他大学) University of Virginia(弗吉尼亚大学)

AI总结 研究自主AI智能体社区中是否存在准社会互动(PSI)线索,通过关键词匹配、少样本LLM标注等方法分析帖子与评论,发现PSI线索与OP再参与及互惠回复结构强相关,并通过二元持久性测试验证了互动层面的PSI脚本与重复二元模式的一致性。

Comments Submitted for review in ARR for EMNLP 2026

详情
AI中文摘要

虽然准社会互动(PSI)和准社会关系(PSR)已在传统媒体环境中得到研究,但我们调查了在双方均为自主AI智能体的在线社区中是否也存在PSI(口语化)关系线索。我们通过三个基于理论的文本指标分析了Moltbook上的4,434篇帖子和50,338条评论:依恋/亲密语言、互惠邀请以及对原始发帖者(OP)的自我认同。基于关键词匹配、少样本大语言模型(LLM)标注和分组上下文LLM标注的方法的综合结果表明,PSI口语化线索普遍存在,并且与OP再参与和互惠回复结构强相关。这些结果在负对照、无效化、聚类标准误重估计和多重检验校正中均稳健。二元持久性测试进一步证实了互惠邀请与持续涉及OP的相互重复模式一致,为将互动层面的PSI脚本与符合PSR的重复二元模式联系起来提供了实证证据。我们将这些证据解释为由LLM驱动的智能体在话语中的行为结构。

英文摘要

While parasocial interactions (PSIs) and parasocial relationships (PSRs) have been studied in conventional media settings, we investigate whether PSI- (colloquial) relational cues also exist in online communities where both sides are autonomous AI agents. We analyze 4,434 posts and 50,338 comments from Moltbook through three theory-based textual indicators: attachment/intimacy language, reciprocity bids, and self-identification to original poster (OP). The combined results across methods based on keyword matching, few-shot large language model (LLM) annotation, and grouped-context LLM annotation reveal that PSI colloquial cues prevail and are strongly associated with OP re-engagement and a reciprocal reply structure. These results are robust across negative controls, nullification, clustered-standard-error re-estimation, and multiple-testing correction. A dyadic persistence test further affirms reciprocity bids aligned with sustained OP-involving mutual recurrence, providing empirical evidence for bridging interaction-level PSI scripts with PSR-consistent repeated dyadic patterns. We interpret the evidence as a behavioral structure in discourse by LLM-enabled agents.

2606.17168 2026-06-17 cs.CL 新提交

RepSelect: Robust LLM Unlearning via Representation Selectivity

RepSelect: 通过表示选择性实现鲁棒的大语言模型遗忘

Filip Sondej, Yushi Yang, Adam Mahdi

发表机构 * Independent(独立) University of Oxford(牛津大学)

AI总结 针对现有遗忘方法易被微调或少样本提示逆转的问题,提出RepSelect方法,通过梯度主成分坍塌隔离遗忘集表示,实现深度鲁棒遗忘。

详情
AI中文摘要

使大语言模型(LLMs)深度遗忘特定知识和价值观而不牺牲通用能力仍然是遗忘领域的一个核心挑战。然而,当前方法容易被微调或少样本提示逆转,表明其遗忘仅是浅层的。我们找到了根本原因:现有方法针对与保留集以及微调攻击者恢复的子空间共享的表示,这使得遗忘既破坏通用能力又容易被逆转。我们提出RepSelect(表示选择性),通过在每次更新前坍塌权重梯度的主成分来隔离遗忘集特定的表示,从而保持通用能力完整,同时限制微调可恢复的内容。我们在两个遗忘类别(生物危害知识和虐待倾向)以及四种涵盖密集和混合专家架构的模型家族(Llama 3、Qwen 3.5、Gemma 4 E4B、DeepSeek V2 Lite)上进行评估。与五种流行基线(GradDiff、NPO、SimNPO、RMU、UNDIAL)相比,RepSelect在重新学习后的答案准确性上实现了比最强基线大4-50倍的降低,并且对少样本提示攻击近乎完美鲁棒。因此,针对选择性表示是实现深度鲁棒LLM遗忘的重要一步。

英文摘要

Making large language models (LLMs) deeply forget specific knowledge and values without sacrificing general capabilities remains a central challenge in unlearning. However, current methods are easily reversed by fine-tuning or few-shot prompting, suggesting their forgetting is only shallow. We identify the root cause. Existing methods target representations shared with both the retain set and the subspace recovered by a fine-tuning attacker, making unlearning both disruptive to general capabilities and easy to reverse. We propose RepSelect (Representation Selectivity), isolates forget-set-specific representations by collapsing top principal components of weight gradients before each update, leaving general capabilities intact while limiting what fine-tuning can recover. We evaluate across two forget categories, biohazardous knowledge and abusive tendencies, and four model families spanning dense and Mixture-of-Experts architectures (Llama 3, Qwen 3.5, Gemma 4 E4B, DeepSeek V2 Lite). Compared to five popular baselines (GradDiff, NPO, SimNPO, RMU, UNDIAL), RepSelect achieves a 4-50x larger reduction in post-relearning answer accuracy than the strongest baseline, and is near-perfectly robust to few-shot prompting attacks. Targeting selective representations is thus an important step towards deep and robust LLM forgetting.

2606.17164 2026-06-17 cs.CL cs.AI cs.HC cs.PL cs.SE 新提交

PromptMN: Pseudo Prompting Language

PromptMN: 伪提示语言

Enkhzol Dovdon

发表机构 * ICT Group(ICT集团)

AI总结 提出PromptMN,一种伪提示领域特定语言,通过紧凑的%前缀类型指令注释自然语言,减少上下文歧义,提升人机交互的清晰度和可审查性。

Comments 32 pages, 2 figures

详情
AI中文摘要

提示已成为人类与生成式AI之间的主要接口,然而许多自然语言提示仍然脆弱:角色、目标、约束和预期输出常常埋没在散文中或隐含起来。在智能体和软件开发工作流中,首次交接时的误读可能会传播到每一步,因为相当一部分智能体故障源于上下文歧义而非模型限制。本文介绍PromptMN,一种伪提示领域特定语言,它用紧凑的、以%为前缀的类型指令注释自然语言,涵盖角色、目标、需求、优先级、约束、计划、输入和输出。语义解析允许作者以任意顺序编写,而模型根据功能解释指令。PromptMN介于非正式提示和编程风格伪代码之间:结构足够可检查和可重用,又足够轻量,适用于软件开发生命周期(SDLC)中的分析师、管理者、开发者和利益相关者。PromptMN还与逆向提示工程配合使用。要求模型将期望结果重述为PromptMN,让用户在执行前检查推断的角色、目标、约束和缺失假设,从而减少修复周期,并产生一个可重用的工件来对齐人员和AI工具。PromptMN的可行性在多个前沿模型上进行了评估,包括Claude Fable 5、Claude Opus 4.8、Gemini 3.1 Pro和GPT-5.5。这些模型正确解析了PromptMN指令,包括复杂结构如重复、条件、方法和素数检查任务,无需微调。相同的词汇适用于所呈现的SDLC场景中的新代码库、维护和重新设计。虽然大规模验证仍是未来工作,但这些早期结果表明PromptMN是朝着更清晰、更可审查的人机交互迈出的实际一步。

英文摘要

Prompting has become the primary interface between humans and generative AI, yet many natural language prompts remain fragile: roles, goals, constraints, and expected outputs are often buried in prose or left implicit. In agentic and software development workflows, a misread at the first handoff can propagate through every step, since a significant portion of agent failures stem from context ambiguities rather than model limitations. This paper introduces PromptMN, a pseudo-prompting domain-specific language that annotates natural language with compact, %-prefixed typed directives covering roles, goals, requirements, priorities, constraints, plans, inputs, and outputs. Semantic resolution lets authors write in any order while the model interprets directives by function. PromptMN sits between informal prompting and programming-style pseudocode: structured enough to be inspectable and reusable, yet lightweight enough for analysts, managers, developers, and stakeholders across the software development lifecycle (SDLC). PromptMN also pairs with reverse prompt engineering. Asking a model to restate a desired outcome as PromptMN lets users inspect the inferred roles, goals, constraints, and missing assumptions before acting, reducing repair cycles and yielding a reusable artifact for aligning people and AI tools. PromptMN's feasibility is evaluated across several frontier models, including Claude Fable 5, Claude Opus 4.8, Gemini 3.1 Pro, and GPT-5.5. The models correctly resolved PromptMN instructions, including complex structures such as repetition, conditionals, methods, and a prime-checking task, without fine-tuning. The same vocabulary applies across new codebases, maintenance, and redesign in the SDLC scenarios presented. While large-scale validation remains future work, these early results suggest PromptMN is a practical step toward clearer, more reviewable human-to-AI interaction.

2606.17162 2026-06-17 cs.CL cs.HC cs.MA 新提交

MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision

MemSlides:一种用于个性化幻灯片生成与多轮局部修订的层次化记忆驱动智能体框架

Ye Jin, Yangyang Xu, Jun Zhu, Yibo Yang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Tsinghua University(清华大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出MemSlides层次化记忆框架,通过分离长期记忆(用户画像和工具记忆)与工作记忆,结合局部修订机制,实现个性化幻灯片生成中的用户偏好保持、多轮修订和可靠局部编辑。

Comments Code, website, project page, and video are linked in the paper

详情
AI中文摘要

个性化演示文稿生成不仅需要基于当前提示或模板的条件生成:智能体必须跨任务保持稳定的用户偏好,在多轮修订中保留新引入的偏好和约束,并可靠地执行局部编辑。我们提出MemSlides,一种用于个性化演示智能体的层次化记忆框架,将长期记忆与工作记忆分离,并进一步将长期记忆分为用户画像记忆和工具记忆。用户画像记忆存储意图条件化的画像,用于第0轮个性化;工作记忆跨修订轮次携带活跃偏好和会话约束;工具记忆存储可重用的执行经验,用于可靠的局部编辑。MemSlides将此记忆设计与有范围的幻灯片局部修订相结合,使得目标更新作用于最小受影响区域,而非重复生成整个演示文稿。在控制实验中,用户画像记忆提高了多人物、多意图画像库上的人物对齐判断;工具记忆注入在诊断性配对设置中改善了闭环修改行为;定性案例展示了工作记忆传递偏好的能力。综合来看,这些结果表明,演示文稿创作中的有效个性化依赖于在生成和局部修订过程中分离持久用户画像、会话级工作记忆和可重用执行经验。

英文摘要

Personalized presentation generation requires more than conditioning on a current prompt or template: agents must preserve stable user preferences across tasks, retain newly introduced preferences and constraints during multi-turn revision, and carry out local edits reliably. We propose MemSlides, a hierarchical memory framework for personalized presentation agents that separates long-term memory from working memory and further divides long-term memory into user profile memory and tool memory. User profile memory stores intent-conditioned profiles for round-0 personalization, working memory carries active preferences and session constraints across revision rounds, and tool memory stores reusable execution experience for reliable localized editing. MemSlides pairs this memory design with scoped slide-local revision, so targeted updates act on the smallest affected region instead of repeatedly regenerating the full deck. In controlled experiments, user profile memory improves persona-alignment judgments on a multi-persona, multi-intent profile bank, tool-memory injection improves closed-loop modify behavior in diagnostic matched-pair settings, and qualitative cases illustrate working memory's ability to carryover preferences. Taken together, these results suggest that effective personalization in presentation authoring depends on separating persistent user profiles, session-level working memory, and reusable execution experience across generation and localized revision.

2606.17160 2026-06-17 cs.SD 新提交

Transductive Zero-Shot Audio Classification with Audio-Language Models

基于音频-语言模型的直推式零样本音频分类

Jingwen Zhou, Mingzhe Wang

发表机构 * Xidian University, Xi'an, China(西安电子科技大学)

AI总结 提出一种文本锚定的球面高斯混合EM算法,利用测试批次音频嵌入统计信息改进零样本后验,无需标签和梯度,在三个数据集上提升4.6-9.2个点。

详情
AI中文摘要

对比语言-音频预训练(CLAP)实现了零样本音频分类,但标准推理孤立地对每个片段进行分类,忽略了未标记测试集的结构。我们首次对CLAP的TransCLIP风格直推式推理进行了系统研究:一种文本锚定的球面高斯混合EM算法,利用测试批次的音频嵌入统计信息改进零样本后验,无需标签、无需梯度,且计算量可忽略(在单个CPU核心上处理2000个片段约需15毫秒)。在ESC-50、UrbanSound8K和VocalSound上,该方法始终将top-1准确率提升+4.6至+9.2个百分点(例如,ESC-50从89.1%提升至94.8%,UrbanSound8K从73.8%提升至81.8%)。我们进一步表明,该增益(i)受一个简单的操作边界控制——每批次每类约需2.5个测试样本,超过约5个样本后收益递减;(ii)与熵引导的提示加权互补,两者结合在ESC-50上达到96.2%;以及(iii)在长尾批次下衰减但仍为正(在20:1不平衡下从+4.9降至+3.1个百分点),我们将其报告为显式限制。我们还记录了一个负面结果:在TUT Urban Acoustic Scenes 2018上,零样本CLAP接近随机水平,直推式没有信号可放大。

英文摘要

Contrastive language-audio pretraining (CLAP) enables zero-shot audio classification, but standard inference classifies each clip in isolation and ignores the structure of the unlabeled test set. We present the first systematic study of TransCLIP-style transductive inference for CLAP: a text-anchored spherical Gaussian-mixture EM that refines zero-shot posteriors using the audio-embedding statistics of the test batch, with no labels, no gradients, and negligible compute (about 15 ms on one CPU core for 2,000 clips). Across ESC-50, UrbanSound8K, and VocalSound, this consistently improves top-1 accuracy by +4.6 to +9.2 points over the zero-shot baseline (e.g., 89.1 -> 94.8% on ESC-50, 73.8 -> 81.8% on UrbanSound8K). We further show that the gain (i) is governed by a simple operating boundary -- roughly 2.5 test samples per class per batch are required, with diminishing returns beyond ~5; (ii) is complementary to entropy-guided prompt weighting, with the combination reaching 96.2% on ESC-50; and (iii) attenuates but remains positive under long-tailed batches (+4.9 -> +3.1 points at a 20:1 imbalance), which we report as an explicit limitation. We also document a negative result: on TUT Urban Acoustic Scenes 2018, where zero-shot CLAP is near chance, transduction has no signal to amplify.

2606.17126 2026-06-17 cs.SD cs.AI 新提交

Vibrato Expression Control for Singing Voice Conversion with Improving Independent Control

通过改进独立控制实现歌唱声音转换中的颤音表达控制

Joon-Seung Choi, Dong-Min Byun, Seong-Whan Lee

发表机构 * Korea University(高丽大学)

AI总结 提出VibE-SVC2框架,通过能量风格转换器、零样本音高风格转换器、颤音速率缩放和次谐波校正算法,实现对音高和音色两种歌唱风格的精细独立控制,性能优于现有方法。

Comments Accepted to IEEE Transactions on Audio, Speech, and Language Processing (TASLP)

详情
AI中文摘要

歌唱风格是自然且富有表现力的歌声的关键方面。歌手利用歌唱风格来传达歌曲的情感。已有若干工作提出控制歌唱风格以制作更具表现力的歌声。最近,VibE-SVC通过预测高频F0轮廓成功控制了颤音。在本文中,我们引入了一个名为VibE-SVC2的歌唱声音转换框架,以改进歌唱风格转换性能和可控性。该模型提供对两种歌唱风格的控制:音高风格和音色风格。对于音高风格,为了解决我们先前工作中未解决的能量-音高纠缠问题,我们引入了一种新颖的能量风格转换器来处理能量轮廓中剩余的样式信息。此外,我们提出了一种零样本音高风格转换器,它模仿参考音频的音高风格。为了扩展模型的可控性,我们提出了颤音速率缩放,这是对颤音程度的独立控制,这在VibE-SVC中是不可用的。对于音色风格,我们扩展了模型以处理多种发声风格。然而,解决诸如气泡音等特定风格带来了挑战,因为传统的F0提取由于其固有的次谐波特性而常常失败,这降低了转换质量。为了解决这个问题,我们提出了一种新颖的次谐波校正算法来细化F0轮廓,以实现更自然的音色转换。通过全面的客观和主观评估,我们证明了VibE-SVC2提供了对两种歌唱风格的精细、独立控制,优于现有方法。

英文摘要

Singing style is a crucial aspect of a natural and expressive singing voice. Singers utilize singing styles to convey the feeling or emotion of the songs. Several works have been proposed to control singing style for making the more expressive singing voice. Recently, VibE-SVC successfully controls vibrato by predicting high-frequency F0 contour. In this paper, we introduce a singing voice conversion framework, called VibE-SVC2, to improve singing style conversion performance and controllability. The model offers control over two types of singing styles: a pitch style and a timbre style. For the pitch style, to resolve the pitch-energy entanglement issue that is unresolved in our previous work, we introduce a novel Energy Style Converter to address remaining style information in the energy contour. In addition, we propose a Zero-shot Pitch Style Converter, which mimics the pitch style of reference audio. To expand the controllability of the model, we propose vibrato rate scaling that is an independent control of vibrato extent, which is unavailable in VibE-SVC. For the timbre style, we extend the model to handle a variety of phonation styles. However, addressing specific styles such as vocal fry poses a challenge, as conventional F0 extraction often fails due to their inherent subharmonic characteristics, which degrades the conversion quality. To address this, we propose a novel Subharmonic Correction algorithm to refine the F0 contour for more natural timbre conversion. Through comprehensive objective and subjective evaluations, we demonstrate that VibE-SVC2 provides fine-grained, independent control over two types of singing styles, outperforming existing methods.

2606.17118 2026-06-17 cs.LG cs.AI 新提交

MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs

MODE: 面向MoE多模态大语言模型的模态分解专家级混合精度量化

Yuanteng Chen, Peisong Wang, Zhilei Liu, Nanxin Zeng, Yuantian Shao, Shiqiang Lang, Tao Liu, Chuangyi Li, Qinghao Hu, Gang Li, Jing Liu, Jian Cheng

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Zhongguancun Academy(中关村学院)

AI总结 针对MoE多模态大语言模型在专家重要性估计中存在的跨模态和视觉内偏差,提出模态分解的专家级混合精度量化框架MODE,通过分解选择频率、过滤冗余视觉令牌并评估模态敏感性,在给定预算下分配比特宽度,在W3A16下平均性能损失控制在2.9%以内。

Comments 18 pages, 8 figures

详情
AI中文摘要

混合专家多模态大语言模型(MoE-MLLMs)性能卓越,但GPU内存成本高昂,因此压缩至关重要。在PTQ方法中,专家级混合精度量化已被证明对MoE-LLMs有效,但由于专家重要性估计中两个被忽视的偏差,在MoE-MLLMs上性能显著下降。(1)在跨模态层面,视觉令牌的数值优势导致专家选择频率被视觉令牌主导,掩盖了对文本模态至关重要的专家;(2)在视觉内层面,大量冗余视觉令牌进一步扭曲频率统计,模糊了对信息性视觉内容关键的专家。为弥补差距,我们提出MODE,一种面向MoE-MLLMs的模态分解专家级混合精度量化框架,该框架按模态分解专家选择频率,过滤冗余视觉令牌以获得去噪的视觉频率,并进一步评估每个模态的量化敏感性作为基于频率估计的补充信号。这些信号被整合到整数线性规划公式中,以在给定预算下分配每个专家的比特宽度。大量实验表明,MODE特别适合MoE-MLLMs,在W3A16下平均性能损失限制在2.9%以内,在极端2比特设置下获得更大增益。

英文摘要

Mixture-of-Experts Multimodal Large Language Models (MoE-MLLMs) offer remarkable performance but incur prohibitive GPU memory costs, making compression essential. Among PTQ methods, expert-level mixed-precision quantization has proven effective for MoE-LLMs, yet suffers notable degradation on MoE-MLLMs due to two overlooked biases in expert importance estimation. (1) At the cross-modal level, the numerical dominance of vision tokens causes expert selection frequency to be dominated by vision tokens, masking experts that are critical to the text modality; (2) at the intra-vision level, the large proportion of redundant vision tokens further skew frequency statistics, obscuring experts critical for informative visual content. To bridge gaps, we propose MODE, a modality-decomposed expert-level mixed-precision quantization framework for MoE-MLLMs that decomposes expert selection frequency by modality, filters redundant vision tokens to obtain denoised visual frequency, and further evaluates quantization sensitivity per modality as a complementary signal to frequency-based estimation. These signals are integrated into an Integer Linear Programming formulation to assign per-expert bit-widths under a given budget. Extensive experiments show that MODE is particularly well-suited for MoE-MLLMs, limiting average performance loss to within 2.9% at W3A16, with larger gains at the extreme 2-bit setting.

2606.17115 2026-06-17 cs.LG cs.AI q-bio.QM 新提交

Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis

探测、融合与可信度:基础模型表示在多模态癌症分析中的系统评估

Jingyu Hu, Giuseppe Tripodi, Reed Naidoo, Sarah F. McGough, Tapabrata Chakraborti

发表机构 * The Alan Turing Institute(艾伦·图灵研究所) University of Bristol(布里斯托大学) University of Manchester(曼彻斯特大学) The Institute of Cancer Research(癌症研究所) Genentech(基因泰克)

AI总结 系统评估基础模型表示在计算病理学任务中的性能,发现图像和组学表示互补,多模态融合在单模态不占优时有效,并利用共形预测验证了不确定性感知推理的临床价值。

详情
AI中文摘要

基础模型(FMs)已成为医学数据的强大表示提取器,但它们在分布偏移下的泛化能力仍未充分探索。本工作系统评估了基于FM的表示在计算病理学任务上的表现,涉及两个真实世界商业队列IH-BC和IH-NSCLC,这些队列来自许可的内部(IH)肿瘤学数据集。分析聚焦于两种模态:全切片图像和转录组图谱,均来自IH多模态数据。我们首先在八个下游分类任务上对五个FM进行单模态探测性能基准测试,发现图像和组学表示携带互补的预测信号。然后,我们通过比较三种基于配对表示的图像-组学融合策略,研究多模态融合是否能在单模态基线之上带来额外收益。进一步通过共形预测评估所选单模态和多模态管道的可信度。我们的结果表明,FM表示在分布外数据上取得了竞争性性能,且多模态融合主要在单模态不占主导信号时有所帮助。共形预测揭示,在点预测失败的大多数情况下,真实诊断仍可在预测集中恢复,这强化了不确定性感知推理对临床支持的价值。

英文摘要

Foundation models (FMs) have emerged as powerful representation extractors for medical data, yet their generalizability to datasets under distribution shift remains underexplored. This work systematically evaluates FM-based representations on a suite of computational pathology tasks across two real-world commercial cohorts, IH-BC and IH-NSCLC, drawn from the licensed in-house (IH) oncology dataset. The analysis focuses on two modalities, whole-slide images and transcriptomic profiles, drawn from the IH multimodal data. We first benchmark unimodal probing performance across five FMs on eight downstream classification tasks, and find that image and omics representations carry complementary predictive signals. Then we investigate whether multimodal fusion can yield additional gains over unimodal baselines by comparing three image-omics fusion strategies built on paired representations. The trustworthiness of selected unimodal and multimodal pipelines is further assessed through conformal prediction. Our results show that FM representations achieve competitive performance on out-of-distribution data and that multimodal fusion helps mainly when no single modality dominates the signal. Conformal prediction reveals that in the majority of cases where a point prediction fails, the true diagnosis remains recoverable within the prediction set, reinforcing the value of uncertainty-aware inference for clinical support.

2606.17113 2026-06-17 cs.LG cs.CL 新提交

The Critical Role of Model Selection in Causal Inference: A Comparative Analysis of Classification Models within the InferBERT Framework for Pharmacovigilance

模型选择在因果推断中的关键作用:基于InferBERT框架的药物警戒分类模型比较分析

Csaba Kiss, Roland Molontay, Gabriele Pergola

发表机构 * Department of Stochastics, Institute of Mathematics, Budapest University of Technology and Economics(布达佩斯技术与经济大学数学研究所随机学系) Institute of Biostatistics and Network Science, Semmelweis University(塞梅维什大学生物统计学与网络科学研究所) Department of Computer Science, University of Warwick(华威大学计算机科学系)

AI总结 本研究在InferBERT框架下比较XGBoost、ALBERT、BioBERT和Med-LLaMA四种模型,发现领域特定预训练(BioBERT)在药物警戒因果ADE检测中优于简单基线和大型LLM,校准改善ECE但对准确率和因果发现影响不一。

Comments 10 pages, 5 figures

详情
AI中文摘要

区分因果性药物不良事件(ADE)与虚假相关性仍然是药物警戒中的核心挑战。InferBERT框架将Transformer模型与Do-calculus相结合,但其成功依赖于底层的分类模型。本研究评估了InferBERT中模型选择的影响,考察了更简单的模型是否足够、领域特定预训练是否有帮助、扩展到LLM是否能改善因果检测,以及事后校准的效果。我们在两个基准上进行了比较研究:镇痛药诱导的急性肝衰竭(AILF)和曲马多相关死亡率(TRAM)。评估了四种模型——XGBoost(基线)、ALBERT(原始InferBERT)、BioBERT(生物医学Transformer)和Med-LLaMA(医学LLM)——使用重复20次的5折交叉验证。我们测量了准确率、等渗回归前后的期望校准误差(ECE),以及因果项与PRR、ROR和EBGM的Jaccard一致性;显著性通过配对t检验测试。BioBERT在两个数据集上均取得了最高准确率,而Med-LLaMA尽管规模大且进行了参数高效微调,表现不佳。领域特定预训练起到了决定性作用。校准改善了ECE,但对准确率和因果发现的影响不一。BioBERT的优越性也使其与传统药物警戒信号的一致性最强。这些结果表明,领域特定预训练相比简单基线和更大的LLM具有明显优势。在计算药物警戒中,投资于可管理的、领域感知的模型比单纯扩大模型规模更有效。

英文摘要

Distinguishing causal adverse drug events (ADEs) from spurious correlations remains a central challenge in pharmacovigilance. The InferBERT framework integrates transformer models with Do-calculus, but its success hinges on the underlying classification model. This study evaluates the impact of model choice in InferBERT, assessing whether simpler models suffice, if domain-specific pre-training helps, whether scaling to LLMs improves causal detection, and the effect of post-hoc calibration. We performed a comparative study on two benchmarks: Analgesics-induced Acute Liver Failure (AILF) and Tramadol-related Mortalities (TRAM). Four models were evaluated-XGBoost (baseline), ALBERT (original InferBERT), BioBERT (biomedical transformer), and Med-LLaMA (medical LLM)-using 5-fold cross-validation repeated over 20 runs. We measured accuracy, Expected Calibration Error (ECE) pre- and post-isotonic regression, and Jaccard concordance of causal terms with PRR, ROR, and EBGM; significance was tested with paired t-tests. BioBERT achieved the highest accuracy on both datasets, while Med-LLaMA underperformed despite its size and parameter-efficient fine-tuning. Domain-specific pre-training was decisive. Calibration improved ECE but had mixed effects on accuracy and causal discovery. BioBERT's superiority also yielded the strongest concordance with traditional pharmacovigilance signals. These results show that domain-specific pre-training provides a clear advantage over simpler baselines and larger LLMs. Investing in manageable, domain-aware models is more effective for computational pharmacovigilance than simply scaling model size.

2606.17107 2026-06-17 cs.LG cs.AI 新提交

Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

模型在预填充阶段记笔记:KV缓存可编辑且可组合

Bojie Li

发表机构 * Pine AI

AI总结 研究发现KV缓存像笔记一样存储结论,支持编辑和组合:编辑单个字段可修正决策(8B模型准确率1.00,仅需~1%计算),组合预编译技能可无缝插入任意上下文(logit余弦相似度0.90-0.999),延迟降低至O(L)。

详情
AI中文摘要

前缀缓存仅对完全共享的前缀重用预填充结果,因此一个字段的改变会使整个下游缓存失效。然而,覆盖该字段自身的键/值向量并重用其余部分,会导致模型基于旧值行动。通过四个模型家族的因果分析,原因在于:在预填充阶段,模型已将基于字段条件的结论写入下游笔记;该字段自身的键/值对决策的贡献不足1%。将KV缓存视为记录已记忆结论的笔记本,可以引出两个能力。(1) 可编辑性。一个显著的勘误可以修正笔记;结合思维链,仅编辑该字段即可恢复决策(8B模型准确率1.00,约1%计算),而无思维链时则被忽略。(2) 可组合性。笔记具有位置可移植性,因此预编译的技能可以通过RoPE重新定位并拼接至任意上下文,与完全重计算无法区分(logit余弦相似度0.90-0.999,十二个模型),且首次令牌延迟为O(L)而非O(L^2)。统一的编辑+组合智能体在决策上与重计算相同,延迟降低高达14.9倍。该方法适用于任何逐令牌注意力KV缓存,在规模、量化、混合专家和多模态缓存上得到验证,并通过小型适配器扩展到多种注意力变体。由于勘误仅追加,它与生产环境中的前缀缓存兼容:在在线vLLM基准测试中,它保持前缀缓存对齐(命中率98.5%),将p90首次令牌延迟降低53-398倍。

英文摘要

Prefix caching reuses prefill only across an exactly shared prefix, so one changed field invalidates the entire downstream cache. Yet overwriting the field's own key/value vectors and reusing the rest leaves the model acting on the old value. The reason, established causally across four model families: at prefill the model has already written the field-conditioned conclusion onto downstream notes; the field's own key/value drives under 1% of the decision. Read as a notebook of memoized conclusions, two capabilities follow. (1) It is editable. A salient erratum amends the notes; and with chain-of-thought, editing the field alone recovers the decision (1.00 at 8B, ~1% compute), while without CoT it is ignored. (2) It is composable. The notes are position-portable, so a precompiled skill can be RoPE-repositioned and spliced into any context, indistinguishable from full recompute (logit cosine 0.90-0.999, twelve models) at O(L) rather than O(L^2) time-to-first-token. A unified edit+compose agent stays decision-identical to recompute at up to 14.9x lower latency. The approach applies to any per-token attention KV cache, validated across scale, quantization, Mixture-of-Experts, and multimodal caches, and extends to several attention variants through small adapters. Because the erratum is append-only, it composes with production prefix caching: in an online vLLM benchmark it keeps the prefix cache-aligned (98.5% hit-rate), cutting p90 time-to-first-token by 53-398x.

2606.17106 2026-06-17 cs.LG cs.CY 新提交

Informative Missingness to Generate Irregular Clinical Time Series

信息性缺失生成不规则临床时间序列

Hadi Mehdizavareh, Gabriele Santangelo, Giovanna Nicora, Simon Lebech Cichosz, Arianna Dagliati, Arijit Khan, Riccardo Bellazzi

发表机构 * Aalborg University(奥尔堡大学) University of Pavia(帕维亚大学) Bowling Green State University(博林格林州立大学)

AI总结 提出基于扩散的临床时间序列生成方法,联合建模实验室值和观察模式,在DACMI基准上验证,能捕获生理与检测行为间的临床依赖。

详情
AI中文摘要

电子健康记录中的实验室检测是不规则收集的,检测指令的缺失可能与测量值本身一样具有信息性。这种缺失反映了临床医生的决策和患者生理状态,因此直接对其建模而非将其视为预处理伪影非常重要。本文提出一种基于扩散的方法,用于生成临床时间序列,该方法使用源自MIMIC-III的公共数据填补缺失数据挑战(DACMI)基准,联合建模实验室值及其观察模式。为了保持真实的采样,我们将图表时间对齐为4小时间隔,并将入院记录分割为7天窗口,生成每个实验室值对应一个观察指示符的轨迹。应用标准变换和归一化以稳定训练。我们的方法扩展了TimeDiff框架,通过互补的扩散目标学习连续的实验室值和离散的缺失模式。实验表明,生成的数据在单个实验室分布和联合值-缺失嵌入方面与真实患者轨迹高度匹配,证明扩散模型能够捕获在类似MNAR(非随机缺失)缺失下患者生理与临床医生检测行为之间的临床有意义依赖。这些初步结果表明,我们的模型可以作为开发临床基础模型的初始组件。通过生成保留关键生理-缺失关系的合成先验,本工作激励了后续训练能够利用信息性缺失的先验数据拟合网络,我们将在扩展工作中对此进行研究。

英文摘要

Laboratory tests in electronic health records are collected irregularly, and the absence of a test order can be as informative as the measurement itself. Such missingness reflects clinicians' decisions and patient physiology, making it important to model it directly rather than treat it as a preprocessing artifact. Here we present a diffusion-based approach for generating clinical time series that jointly models laboratory values and their observation patterns using the public Data Analytics Challenge on Missing Data Imputation (DACMI) benchmark derived from MIMIC-III. To preserve realistic sampling, we align chart times into 4-hour intervals and segment admissions into 7-day windows, producing trajectories that pair each lab value with a corresponding observation indicator. Standard transformations and normalization are applied to stabilize training. Our method extends the TimeDiff framework to learn continuous lab values and discrete missingness patterns through complementary diffusion objectives. Experiments show that the generated data closely match real patient trajectories across individual lab distributions and joint value-missingness embeddings, demonstrating that diffusion models can capture clinically meaningful dependencies between patient physiology and clinicians' testing behavior under MNAR-like (missing-not-at-random) missingness. These preliminary results indicate that our model can serve as an initial component toward developing clinical foundation models. By producing synthetic priors that preserve key physiology-missingness relationships, this work motivates the subsequent training of Prior-Data Fitted Networks capable of leveraging informative missingness, which we will investigate in the extended work.

2606.17093 2026-06-17 cs.LG eess.IV 新提交

Diagnosing and Repairing Shape-Prior Shortcuts in Long-Range Single-Shot Fringe Projection Profilometry

诊断和修复长距离单次条纹投影轮廓测量中的形状先验捷径

Adam Haroon, Anush Lakshman, Cody Fleming, Beiwen Li

发表机构 * Department of Mechanical Engineering, Iowa State University(爱荷华州立大学机械工程系) College of Engineering, University of Georgia(佐治亚大学工程学院)

AI总结 通过机械可解释性和共形不确定性量化诊断长距离单次条纹投影轮廓测量中网络依赖形状先验而非条纹相位解码的问题,提出PhiCalNet架构修复,将物体平均绝对误差降低3.3倍。

Comments 44 pages, 27 figures

详情
AI中文摘要

基于学习的单次条纹投影轮廓测量术(FPP)主要在近距离下研究。长距离(工作距离超过1米)情况仍未得到充分解决:平方反比强度衰减降低了条纹信噪比并降低了物理真实度,单次问题由于一幅图像中缺乏条纹阶次信息而病态,且这些架构尚未被机制性地研究。我们提出了一项诊断-修复-验证研究,使用机械可解释性(MI)和共形不确定性量化(UQ)作为收敛的诊断工具:它们在一个物理故障点上达成一致,驱动并验证了架构修复。在一个逼真的合成基准(15,600幅条纹图像,50个物体在1.5-2.1米距离)上,最佳UNet基线达到14.54毫米的物体平均绝对误差(MAE)。三种探测方法(线性探测、Grad-CAM、平面外分布测试)收敛:基线通过物体边界形状先验而非条纹相位解码来解决任务。我们通过PhiCalNet修复此问题,该网络输出包裹相位而非深度,并应用固定的可微校准层将相位映射到深度,从架构上而非通过损失惩罚从假设空间中移除形状先验解。一个物理信息损失,作为对深度回归网络的软惩罚强制执行相同物理规律,没有带来可测量的增益,从而将架构隔离为操作因素。PhiCalNet将物体MAE降低3.3倍至4.46毫米;残余由±π包裹不连续处的0.103%像素承载。逐像素共形UQ确认了诊断:通过快照不一致性拒绝前5%的物体像素,将PhiCalNet RMSE降低64%(20.6->7.4毫米),而基线仅降低3.5%。MI和UQ在相同的故障点上收敛。

英文摘要

Learning-based single-shot fringe projection profilometry (FPP) has been studied mostly at close range. The long-range regime (standoff beyond 1 m) remains largely unaddressed: inverse-square intensity falloff lowers fringe signal-to-noise ratio and degrades physical ground truth, the single-shot problem is ill-posed because fringe-order information is absent from one image, and these architectures have not been studied mechanistically. We present a diagnose-repair-verify study using mechanistic interpretability (MI) and conformal uncertainty quantification (UQ) as convergent diagnostics: they agree on one physical failure locus, driving and verifying an architectural repair. On a photorealistic synthetic benchmark (15,600 fringe images, 50 objects at 1.5-2.1 m), a best UNet baseline reaches 14.54 mm object mean absolute error (MAE). Three probes (linear probing, Grad-CAM, flat-plane out-of-distribution test) converge: the baseline solves the task via object-boundary shape priors rather than fringe-phase decoding. We repair this with PhiCalNet, which outputs wrapped phase rather than depth and applies a fixed differentiable calibration layer mapping phase to depth, removing the shape-prior solution from the hypothesis space architecturally rather than by a loss penalty. A physics-informed loss that enforces the same physics as a soft penalty on a depth-regressing network yields no measurable gain, isolating the architecture as the operative factor. PhiCalNet reduces object MAE 3.3x to 4.46 mm; the residual is carried by 0.103% of pixels at the +/-pi wrap discontinuity. Pixel-wise conformal UQ confirms the diagnosis: rejecting the top 5% of object pixels by snapshot disagreement cuts PhiCalNet RMSE by 64% (20.6->7.4 mm) versus 3.5% for the baseline. MI and UQ converge on the same failure locus.