arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.11166 2026-06-10 stat.OT cs.AI 新提交

Flaws in the LLM Automation Narrative

LLM自动化叙事中的缺陷

George Perrett, Javae Elliott, Jennifer Hill, Marc Scott

发表机构 * New York University(纽约大学)

AI总结 通过编写代码完成数据分析任务的新基准测试,发现前沿LLM在平均性能、方差和错误幅度上均不如人类专家,挑战了LLM达到人类专家水平的说法。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被描述为在知识经济任务上达到人类专家水平。这些说法主要基于LLM在标准化数据集上衡量平均性能的基准测试任务中的表现。许多基准测试任务的主要局限性在于,它们通常基于直接包含在LLM训练数据中的内容来衡量性能,并且经常不评估LLM性能的可靠性或LLM错误的幅度。然而,在高风险情境中,这些品质至关重要。通过一项需要编写计算机代码完成数据分析任务的新型LLM基准测试,我们将前沿LLM的性能与人类专家的提交进行了比较,并明确测量了响应的方差和错误的幅度。我们的研究表明,人类专家在一系列指标上平均表现更好,并且表现出更小的性能变异性。我们的结果提供了证据,表明LLM并非始终如一地达到人类专家的水平,并证明了在LLM基准评估中测量方差和评估错误幅度的重要性。

英文摘要

Large Language Models (LLMs) are increasingly described as performing at the level of human experts on knowledge economy tasks. These claims are primarily based on how LLMs perform on benchmarking tasks that measure average performance across standardized datasets. Primary limitations of many benchmarking tasks are that they often measure performance based on content directly included in LLM training data, and they frequently do not assess the reliability of LLM performance or the magnitude of LLM errors. However, in high stakes contexts, these qualities are critically important. Through a novel LLM benchmarking task that requires writing computer code to complete a data analysis task, we compare the performance of a frontier LLM against submissions from human experts and explicitly measure the variance of responses and the magnitude of errors. Our study reveals that the human experts perform better on average on a range of metrics and demonstrate less variability in performance. Our results provide evidence that LLMs do not consistently perform at the level of human experts and demonstrate the importance of measuring variance and assessing error magnitude in LLM benchmark evaluations.

2606.11156 2026-06-10 stat.ML cs.LG 新提交

Itô maps for any-step SDEs

任意步SDE的Itô映射

Zhengkai Pan, Peter Potaptchik, Wenxi Yao, Michael S. Albergo, Jakiw Pidstrigach

发表机构 * Harvard University(哈佛大学) University of Oxford(牛津大学) Kempner Institute(凯门研究所)

AI总结 提出Itô映射,一种任意步随机流映射,通过单次前向传播预测未来状态,实现随机动力学的精确蒸馏,并支持推理时控制和后验采样。

详情
AI中文摘要

最近的单步生成模型通过学习底层动力学的确定性流映射来加速采样。这些方法依赖于从常微分方程学习,但如何为随机动力学定义精确的蒸馏过程仍是开放问题。我们引入Itô映射,一种任意步随机流映射,它接收中间状态和布朗路径,并在单次前向传播中预测未来状态。Itô映射公式通过提供廉价、可微的后验样本访问,为推理时控制提供了新的估计器。实验上,Itô映射从固定的中间状态生成多样、条件有效的端点样本,并在合成和图像生成基准上支持强引导性能。这些结果确立了任意步SDE积分作为后验采样和随机控制的有用原语。

英文摘要

Recent one-step generative models accelerate sampling by learning deterministic flow maps of the underlying dynamics. These methods rely on learning from ordinary differential equations, leaving open how to define an exact distillation procedure for stochastic dynamics. We introduce the Itô map, an any-step stochastic flow map that takes an intermediate state and Brownian path and predicts future states in a single pass. The Itô map formulation yields novel estimators for inference-time control by providing cheap, differentiable access to posterior samples. Empirically, Itô maps produce diverse, conditionally valid endpoint samples from fixed intermediate states and support strong steering performance on synthetic and image-generation benchmarks. These results establish any-step SDE integration as a useful primitive for posterior sampling and stochastic control.

2606.11125 2026-06-10 eess.SP cs.LG 新提交

DMT: Demographic Conditioning, Morphology-Enhanced Transformer for Cuffless Blood Pressure Estimation from PPG Signals

DMT: 基于人口统计条件与形态增强Transformer的无袖带血压估计方法

Yidan Shen, Neville Mathew, Maham Rahimi, Deependra Dhakal, George Zouridakis, Xin Fu, Renjie Hu

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出一种基于Transformer的PPG信号无袖带血压估计网络,通过FiLM风格特征调制融入人口统计信息,并添加辅助形态头引导模型关注与动脉僵硬度相关的波形形态,在PulseDB数据集上实现收缩压MAE 4.56 mmHg、舒张压MAE 2.62 mmHg。

详情
AI中文摘要

血压(BP)是心血管风险评估和治疗决策的关键指标,而光电容积描记术(PPG)能够实现低成本、可穿戴友好的无袖带血压估计。然而,即使近期取得了进展,许多基于PPG的模型仅通过血压回归进行训练,可能依赖于以振幅为主的捷径。此外,系统性调节血管顺应性的人口统计协变量通常仅通过后期融合纳入,限制了特定于主体的表示学习。我们提出了一种基于Transformer的网络,用于从PPG信号进行无袖带血压估计,利用自注意力机制捕获多个心动周期之间的长程依赖关系。为了考虑特定主体的血管差异,模型通过Transformer块的注意力和前馈子层中应用的FiLM风格特征调制,以人口统计信息为条件。此外,我们添加了一个辅助形态头,引导模型关注与动脉硬度和波反射相关的血压相关波形形态。在大型PulseDB数据集上基于校准的评估协议下,所提方法在收缩压上实现了4.56 mmHg的平均绝对误差(MAE),在舒张压上实现了2.62 mmHg,与先前的人口统计增强PPG基线相比,误差分别减少了47%和50。由此产生的轻量级单传感器模型支持在启用校准的部署场景中进行可扩展且临床可靠的无袖带血压估计。

英文摘要

Blood pressure (BP) is a key marker for cardiovascular risk assessment and therapeutic decision-making, and Photoplethysmography (PPG) enables low-cost, wearable-friendly cuffless BP estimation. However, even with recent progress, many PPG-based models are trained with BP regression alone and may rely on amplitude-dominated shortcuts. In addition, demographic covariates that systematically modulate vascular compliance are often incorporated only via late fusion, limiting subject-specific representation learning. We propose a Transformer-based network for cuffless BP estimation from PPG signal, leveraging self-attention to capture long-range dependencies across multiple cardiac cycles. To account for subject-specific vascular differences, the model is conditioned on demographics via FiLM-style feature modulation applied through the attention and feed-forward sublayers of Transformer blocks. In addition, we add an auxiliary morphology head to guide the model to attend to BP-relevant waveform morphology associated with arterial stiffness and wave reflection. Under calibration-based evaluation protocols on the large-scale PulseDB dataset, the proposed method achieves MAE of 4.56 mmHg for systolic BP and 2.62 mmHg for diastolic BP, reducing errors by 47% and 50% compared with prior demographic-enhanced PPG baselines. The resulting lightweight, single-sensor model supports scalable and clinically grounded cuffless BP estimation in calibration-enabled deployment settings.

2606.11044 2026-06-10 stat.ML cs.LG 新提交

Generalized Conformal Predictive Systems Under Distributional Shifts

广义共形预测系统在分布偏移下的应用

Jef Jonkers, Johanna Ziegel

发表机构 * IDLab Seminar for Statistics(统计研究所研讨会) Department of Electronics(电子系) ETH Zurich(苏黎世联邦理工学院) Information Systems Zurich, Switzerland(苏黎世信息系统,瑞士) Ghent University(根特大学)

AI总结 针对分布偏移,通过观测特定置换权重编码偏移,扩展广义共形预测系统,提出偏移感知预测系统,并引入权重不确定性框构建鲁棒共形预测系统包络,提供有限样本或渐近置信保证。

Comments 27 pages, 10 figures

详情
AI中文摘要

共形预测系统(CPS)在可交换性假设下输出校准的CDF带。我们通过观测特定的置换权重编码分布偏移,将广义CPS扩展到非可交换设置。这产生了偏移感知预测系统,当测试点(条件于无序样本)是从观测原子中加权抽取时,该系统保持有效。由于此类权重通常需要估计,我们引入了权重不确定性框,并构建了具有有限样本或渐近置信保证的鲁棒CPS包络。我们推导了符合性度量CPS、共形分箱和共形等渗分布回归的高效计算方法。在协变量偏移和反馈驱动的生物分子设计实验下,校准的预测带在更强偏移下变宽,随样本量增加而收紧。

英文摘要

Conformal predictive systems (CPS) output calibrated bands of CDFs under exchangeability. We extend generalized CPS to non-exchangeable settings by encoding distributional shifts through observation-specific permutation weights. This yields shift-aware predictive systems that remain valid whenever the test point is, conditionally on the unordered sample, a weighted draw from the observed atoms. Since such weights are typically estimated, we introduce weight-uncertainty boxes and construct robust CPS envelopes with finite-sample or asymptotic confidence guarantees. We derive efficient computation for conformity-measure CPS, conformal binning, and conformal isotonic distributional regression. Experiments under covariate shift and feedback-driven biomolecular design show calibrated predictive bands that widen under stronger shifts and tighten as sample size increases.

2606.10972 2026-06-10 eess.AS cs.AI 新提交

Optimizing 2D Input Representations and Sub-phase Fusion Strategies for Differential Diagnosis of Asthma and COPD Using CNN- and GRU-Based Networks

基于CNN和GRU网络的哮喘与COPD鉴别诊断中二维输入表示和子阶段融合策略的优化

Ipek Sen, Ozgur Ozdemir, Elena Battini Sonmez

发表机构 * Dept. Electrical and Electronics Engineering Istanbul Bilgi University, Turkey(电气与电子工程系伊斯坦布尔比尔吉大学,土耳其) Dept. Computer Engineering Istanbul Bilgi University, Turkey(计算机工程系伊斯坦布尔比尔吉大学,土耳其)

AI总结 本研究优化了二维输入表示(MFCC、对数梅尔谱图、VAR模型)和子阶段特征融合策略(直接拼接、GRU、GRU+注意力),使用CNN和GRU网络鉴别哮喘与COPD,最佳F1分数达0.877。

详情
AI中文摘要

本研究旨在探索VAR模型与梅尔频率倒谱系数(MFCC)矩阵和对数梅尔谱图在深度学习中的性能比较。在肺音分类中,基于谱图的表示因呼吸周期时长不同而存在时间维度不一致的问题。除了传统的裁剪/零填充,还提出了自适应长度窗口来固定时间维度。通过测试一系列参数优化其频谱和时间维度。采用不同的卷积神经网络(CNN)架构从子阶段获得的二维表示中提取特征。然后使用各种策略融合提取的子阶段特征,包括直接拼接、门控循环单元(GRU)网络和带注意力的GRU。通过基于呼吸周期的评估和基于受试者的评估(包含多个呼吸周期)来评估模型性能。还研究了多种数据增强技术以应对数据规模限制。最佳基于周期的F1分数(0.877)通过使用13个系数和每子阶段表示64点时间分辨率的MFCC矩阵,随后进行直接特征拼接获得;最佳基于受试者的F1分数(0.855)通过使用13个系数和每完整周期表示256点时间分辨率的MFCC矩阵获得,两者均采用自适应长度窗口。增强总体上降低了模型性能,但mixup增强是测试方法中最好的。MFCC在区分哮喘和COPD方面优于对数梅尔谱图和VAR模型。复杂的融合策略并未改善诊断。增强没有贡献,表明真实数据在肺音研究中的重要性。

英文摘要

This study aims to explore the performance of the VAR model in comparison with mel-frequency cepstral coefficient (MFCC) matrices and log-mel spectrograms using deep learning. In pulmonary sound classification, spectrogram-based representations suffer from inconsistent temporal dimensions due to varying respiratory cycle durations. Along with traditional trimming/zero-padding, adaptive-length windowing was presented to fix their temporal dimensions. Their spectral and temporal dimensions were optimized by testing a range of parameters. Different convolutional neural network (CNN) architectures were employed to extract features from the two-dimensional representations obtained over the sub-phases. The extracted sub-phase features were then fused using various strategies including direct concatenation, gated recurrent unit (GRU) network and GRU with attention mechanism. Model performances were assessed through respiratory cycle-based evaluation and subject-based evaluation comprising multiple respiratory cycles. Several data augmentation techniques were also studied to cope with limitations in data size. The best cycle-based F1-score (0.877) was obtained using the MFCC matrices with thirteen coefficients and 64-point time resolution per sub-phase representation followed by direct feature concatenation, and the best subject-based F1-score (0.855) was obtained using the MFCC matrices with thirteen coefficients and 256-point time resolution per full-cycle representation, both obtained by adaptive-length windowing. Augmentation degraded the performance of models overall, yet mixup augmentation was the best among the methods tested. MFCC outperformed log-mel spectrogram and VAR model in differentiation of asthma and COPD. Sophisticated fusion strategies did not improve the diagnosis. Augmentation did not contribute, demonstrating the significance of authentic data in pulmonary sound studies.

2606.10906 2026-06-10 stat.ML cs.AI cs.LG 新提交

Human-AI Teaming Through the Lens of Calibration

通过校准视角看人机协作

Eric Nalisnick, Chi Zhang, Sophia Qian, Yixin Wang

发表机构 * Department of Computer Science, Johns Hopkins University(计算机科学系,约翰霍普金斯大学) Department of Statistics, University of Michigan(统计学系,密歇根大学)

AI总结 研究通过统计校准视角分析人机协作模型,发现组合方法不保留人类校准度,而委托方法将校准负担转移给拒绝器元模型,且当人类依赖系统不可观测信息时无法实现。

Comments 19 pages, 5 figures (including appendix)

详情
AI中文摘要

我们通过统计校准的视角研究人机协作模型。假设团队由AI模型和人类组成——两者相对于特征空间的某种划分都是校准的——并揭示校准假设如何传播到协作框架中。特别地,我们考虑两种框架:(i) 结合人类和模型预测,或 (ii) 将预测责任委托给人类或模型。通过理论和实证结果,我们表明现有的组合方法不保留人类的校准程度。委托方法(通过委托行为本身)保留了后续预测器的校准,但将负担转移到了决定谁进行预测的拒绝器元模型上。拒绝器必须足够精细地校准,以定位每个成员的优势所在,这一需求随着人类专业知识的增长而增加,并且当人类依赖系统无法观测的信息时变得无法实现。

英文摘要

We study models for human-AI teaming through the lens of statistical calibration. We assume the team consists of an AI model and human -- both of which are calibrated with respect to some partitioning of the feature space -- and expose how the calibration assumptions propagate into the teaming framework. In particular, we consider frameworks that either (i) combine human and model predictions or (ii) delegate prediction responsibility to either a human or model. We show via theoretical and empirical results that existing methods for combination do not preserve the human's degree of calibration. Methods for delegation (by the very act of delegation) preserve calibration of the downstream predictors but shift the burden onto the rejector meta-model that decides who predicts. The rejector must be calibrated finely enough to locate where each member is superior, a demand that grows with the human's expertise and becomes unattainable when the human relies on information the system cannot observe.

2606.10889 2026-06-10 q-bio.NC cs.LG 新提交

Sleep EEG Signal Criticality as a Non-Invasive Predictor of Cognitive Decline in Dementia

睡眠脑电信号临界性作为痴呆认知衰退的非侵入性预测指标

Stanisław Narębski, Tomasz Komendziński, Tomasz M. Rutkowski

发表机构 * Institute of Cybernetics and Human Informatics, Polish Academy of Sciences(波兰科学院信息学与人类科学研究所)

AI总结 研究通过多重分形去趋势波动分析量化睡眠脑电信号临界性,发现认知健康者更接近最优临界状态,痴呆组DFA指数向1.0偏移,表明睡眠中无标度神经动力学重组先于临床症状,可作为早期筛查工具。

Comments 4 pages, 2 figures, accepted for publication in the Proc. 48th Annu. Int. Conf. IEEE EMBS (EMBC 2026), Toronto, Canada, July 20-24, 2026

详情
AI中文摘要

神经退行性疾病的早期检测仍然是一个关键的临床挑战。本研究探讨了通过多重分形去趋势波动分析(MFDFA)量化的睡眠脑电信号临界性是否可作为未来认知衰退的非侵入性生物标志物。我们分析了国家睡眠研究资源(NSRR)骨质疏松性骨折研究(SOF)队列的纵向数据,比较了保持认知正常与后来进展为痴呆相关损伤(3MS < 78)的女性之间的基线睡眠脑电动力学。我们的结果揭示了Hurst指数$H(q)$分布在组间的显著差异,特别是在非快速眼动阶段N2和N3期间。认知健康的个体在所有电极位置上表现出显著更接近最优临界状态的信号动力学($p \leqslant 0.001$),支持了大脑临界性假说。监督UMAP投影证实了整夜睡眠期间组间的清晰空间分离。痴呆组表现出DFA指数向$1.0$的偏移,表明睡眠中无标度神经动力学的重组先于临床症状。这些发现强调了将MFDFA衍生测量整合到自动化、基于睡眠的筛查工具中的潜力,从而能够在痴呆的前驱窗口期进行更早的预防性干预。

英文摘要

Early detection of neurodegeneration remains a critical clinical challenge. This study investigates whether sleep EEG signal criticality, quantified via Multifractal Detrended Fluctuation Analysis (MFDFA), serves as a non-invasive biomarker for future cognitive decline. We analyzed longitudinal data from the National Sleep Research Resource (NSRR) Study of Osteoporotic Fractures (SOF) cohort, comparing baseline sleep EEG dynamics between women who remained cognitively normal and those who later progressed to dementia-related impairment ($3MS < 78$).Our results reveal significant group-level differences in Hurst exponent $H(q)$ distributions, particularly during non-REM stages N2 and N3. Cognitively healthy individuals exhibited signal dynamics significantly closer to an optimally critical state across all electrode locations ($p \leqslant 0.001$), supporting the Brain Criticality Hypothesis. Supervised UMAP projections confirmed clear spatial separation between groups throughout the overnight sleep architecture.The dementia group demonstrated a shift in DFA exponents toward $1.0$, suggesting that a reconfiguration of scale-free neural dynamics during sleep precedes clinical symptoms. These findings highlight the potential for MFDFA-derived measures to be integrated into automated, sleep-based screening tools, enabling earlier preventative interventions during the prodromal window of dementia.

2606.10781 2026-06-10 eess.AS cs.CL 新提交

Recovering the Zipfian Distribution in Unsupervised Term Discovery

在无监督术语发现中恢复齐夫分布

Danel Slabbert, Simon Malan, Herman Kamper

发表机构 * Het Jan Marais Fonds(赫特·詹·马里茨基金会)

AI总结 针对无监督术语发现中中心聚类导致分布不均匀的问题,提出图聚类方法,在三种语言上显著优于K-means等,恢复更接近齐夫分布的词汇分布。

详情
AI中文摘要

无监督术语发现涉及将未标记语音分割成词或音节单元,并将这些单元聚类成候选类型的词典。真实词典遵循齐夫分布,然而主流的基于中心的聚类方法——K-means——由于对球形聚类的归纳偏差,产生更均匀的分布。在本文中,我们重新审视基于图的聚类作为一种自下而上的替代方案,其中片段嵌入通过成对相似性连接,并使用Leiden算法进行划分。我们表明,在三种语言的词级和音节级词典发现中,图聚类在性能上显著优于基于中心的方法(K-means、GMM、BIRCH),产生更接近齐夫分布的分布。另一种自下而上的方法,即使用平均链接的凝聚聚类,也表现良好,尽管其计算效率较低,且对结果分布的控制能力较弱。我们的工作质疑了基于中心的聚类在术语发现中的主导地位,并推广图聚类作为一种有吸引力的替代方案。

英文摘要

Unsupervised term discovery involves segmenting unlabelled speech into word- or syllable-like units and clustering these into a lexicon of candidate types. True lexicons follow a Zipfian distribution, yet the dominant centre-based clustering approach -- K-means -- produces a more uniform distribution due to an inductive bias toward spherical clusters. In this paper we revisit graph-based clustering as a bottom-up alternative, where segment embeddings are connected by pairwise similarity and partitioned using the Leiden algorithm. We show that graph clustering substantially outperforms centre-based approaches (K-means, GMM, BIRCH) in both word- and syllable-level lexicon discovery across three languages, producing more Zipf-like distributions. Another bottom-up approach, agglomerative clustering with average linkage, also performs well, although it is computationally less efficient and allows for less control over the resulting distribution. Our work calls into question the dominance of centre-based clustering for term discovery, and promotes graph clustering as an attractive alternative.

2606.10770 2026-06-10 stat.ME cs.LG 新提交

Correcting Variable Importance Scored by Random Forests

修正随机森林产生的变量重要性评分

Guancheng Zhou, Haiping Xu, Jason Liu, Donghui Yan

发表机构 * Computer and Information Science(计算机与信息科学) Mathematics and Data Science(数学与数据科学) University of Massachusetts, Dartmouth, MA(马萨诸塞大学达特茅斯分校) The Rivers School, Weston, MA(韦斯特on学校的河流学校)

AI总结 针对随机森林变量重要性受变量间相关性影响的问题,提出基于条件相关性的分组方法进行修正,实验证明两种计算高效方案均能有效校正变量重要性。

Comments 22 pages, 10 figures

详情
AI中文摘要

随机森林产生的变量重要性在统计分析中广泛应用,在辅助模型解释、模型选择和诊断、成本受限学习等任务中发挥重要作用。然而,RF中变量重要性的计算未考虑变量间的相关性,与许多其他变量相关的变量往往会获得较低的重要性指数,或被其他强相关变量完全掩盖(即重要性指数接近零)。为了在计算变量重要性时避免不相关变量的影响,我们提出根据变量的条件相关性(以响应变量为条件)对变量进行分组。我们探索了两种计算高效的方案:一种将变量单独分组,然后将感兴趣的变量与所有相关变量分离;另一种使用聚类根据变量间的成对条件相关性进行分组。实验表明,两种方法都能对变量重要性进行合理的修正。

英文摘要

Variable importance produced by Random Forests (RF) is used widely in statistical data analysis, and has played an important role in a variety of tasks such as assisting model interpretation, model selection and diagnosis, and cost-bounded learning etc. However, the calculation of variable importance in RF does not take into account of the correlations among variables, and variables that are correlated to many other variables tend to receive a lower importance index or being completely masked (i.e., with an importance index near zero) by other strongly correlated variables. To prevent influence from unwanted correlated variables in calculating variable importance, we propose to group variables by their conditional correlations (conditional on the response variable). We explore two computationally efficient options, with one grouping variables individually, and then separates the variable of interest from all correlated variables, while the other uses clustering to group variables according to their pair-wise conditional correlations. Our experiments show that both lead to sensible corrections to the importance of variables.

2606.10738 2026-06-10 eess.AS cs.AI 新提交

Spatial-Omni: Spatial Audio Understanding Integration in Multimodal LLMs via FOA Encoding

Spatial-Omni:通过FOA编码在多模态大语言模型中实现空间音频理解

Zhiyuan Zhu, Yixuan Chen, Yiwen Shao, Wenxiang Guo, Changhao Pan, Yu Zhang, Yuxiang Wang, Wei Liu, Houhua Zhang, Chengkuan Zeng, Wenbo Cheng, Yunxi Liu, Rui Yang, Steve Yves, Liefeng Bo, Zhou Zhao

发表机构 * Zhejiang University(浙江大学) Tencent Hunyuan(腾讯文心)

AI总结 提出Spatial-Omni,通过SO-Encoder将一阶Ambisonics空间音频注入现有全模态大语言模型,以轻量方式实现空间音频理解,并在构建的SO-Bench基准上超越现有模型。

详情
AI中文摘要

最近的多模态大语言模型主要将音频处理为单声道信号,从而丢弃了空间音频中包含的空间线索,这些线索用于声音定位、空间关系推理和空间场景理解。我们提出Spatial-Omni,一种轻量级方法,通过实现SO-Encoder将一阶Ambisonics(FOA)空间音频作为独立模态注入现有的全模态大语言模型,而无需修改其原始音频编码器。SO-Encoder以有限的额外上下文成本提供空间标记,并通过高效的分阶段训练提升空间音频理解。为支持训练和评估,我们从开源数据、真实录音和仿真中构建了SO-Dataset、SO-QA和SO-Bench,包含40万条FOA空间音频片段和210万个空间问答对。SO-Bench涵盖16个空间音频理解子任务,包括基本检测和位置估计、空间关系理解以及复杂空间推理。实验表明,Spatial-Omni在空间音频理解任务上优于现有的开源大型音频语言模型(LALM)和全模态大语言模型,同时保持合理的通用音频理解水平。代码和数据见:https://this https URL。

英文摘要

Recent multimodal large language models mainly process audio as monaural signals, thereby discarding the spatial cues contained in spatial audio for sound localization, spatial relation reasoning, and spatial scene understanding. We propose Spatial-Omni, a lightweight method that implements SO-Encoder to inject First-Order Ambisonics (FOA) spatial audio into existing Omni LLMs as an independent modality, without modifying their original audio encoders. SO-Encoder provides spatial tokens with limited additional context cost and improves spatial audio understanding through efficient staged training. To support training and evaluation, we construct SO-Dataset, SO-QA, and SO-Bench from open-source data, real recordings, and simulations, containing 400K FOA spatial audio clips and 2.1M spatial question answering pairs. SO-Bench covers 16 spatial audio understanding subtasks, including basic detection and location estimation, spatial relation understanding, and complex spatial reasoning. Experiments show that Spatial-Omni outperforms existing open-source Large Audio-Language Models (LALMs) and Omni LLM models on spatial audio understanding tasks while retaining a reasonable level of general audio understanding. Code and data are available at https://github.com/dieKarotte/Spatial-Omni.

2606.10713 2026-06-10 eess.IV cs.AI cs.CV cs.LG 新提交

++nnU-Net: Scaling nnU-Net with Prefix-Based Data Augmentation

++nnU-Net: 基于前缀数据增强的nnU-Net扩展

Ana Sofia Santos, André Ferreira, Gijs Luijten, Naida Solak, Lisle Faray de Paiva, Behrus Hinrichs-Puladi, Jens Kleesiek, Jan Egger, Victor Alves

发表机构 * Center Algoritmi / LASI, University of Minho, Braga, Portugal(阿尔戈里米中心/拉斯伊大学,明霍大学,布拉加,葡萄牙) Institute for Artificial Intelligence in Medicine, University Medicine Essen, Essen, Germany(医学人工智能研究所,埃森医学院,埃森,德国) Institute of Medical Informatics / Dept. of Oral and Maxillofacial Surgery, University Hospital RWTH Aachen, Germany(医学信息学研究所/口腔和颅面外科部,亚琛大学医院,德国) Faculty of Computer Science, University of Duisburg-Essen, Essen, Germany(计算机科学学院,杜伊斯堡-埃森大学,埃森,德国)

AI总结 提出++nnU-Net,通过图像配准进行数据增强,在预处理和训练前生成变形图像,在5个2D数据集上提升Dice系数最高约22%。

Comments 7 pages, 1 figure, 2 tables

详情
AI中文摘要

nnU-Net在医学分割任务中持续展现出成功,这严重依赖于标注生物医学数据的可用性和多样性。然而,由于隐私法规和标注成本等因素,收集医学影像队列仍然具有挑战性。因此,数据增强在增加数据可用性的同时保持解剖学可行性方面起着关键作用。为此,我们提出了++nnU-Net,一种基于图像配准的新型数据增强模块,在预处理和训练之前运行。我们的框架在五个不同的2D数据集上进行了评估。在该工作流中,图像数据经过两阶段配准过程,生成新的变形图像。然后将变换应用于相应的分割。此外,该管道计算可用磁盘空间,生成补充的二进制合成掩码并生成检查点。我们证明++nnU-Net优于nnU-Net基线,在Dice相似系数得分上有所提升。在最显著的情况下,我们观察到性能提升约22%。这些发现强调了基于配准的数据增强的有效性,特别是对于2D医学影像数据集,并表明++nnU-Net为在数据有限的情况下提高分割性能提供了一种实用且可扩展的方法。++nnU-Net的源代码可在以下网址获取:this https URL

英文摘要

The nnU-Net has demonstrated continuous success in medical segmentation tasks, which heavily rely on the availability and diversity of annotated biomedical data. However, assembling medical imaging cohorts remains challenging due to numerous factors such as privacy regulations and annotation costs. As a result, data augmentation plays a crucial role in increasing data availability while maintaining anatomical feasibility. Hence, we propose the ++nnU-Net, a novel data augmentation module based on image registration that operates prior to preprocessing and training take place. Our framework was evaluated across five different 2D datasets. In this workflow, image data go through a two-stage registration process, generating new warped images. The transformations are then applied to the respective segmentation. In addition, the pipeline computes available disk space, generates supplementary binary synthetic masks and generates checkpoints. We demonstrate that the ++nnU-Net outperforms the nnU-Net baseline, yielding improvements in Dice Similarity Coefficient scores. In the most prominent cases, we observe performance gains of approximately 22\%. These findings highlight the effectiveness of registration-based data augmentation, particularly for 2D medical imaging datasets and suggest that the ++nnU-Net provides a practical and scalable approach for enhancing segmentation performance in data-limited settings. The source code for the ++nnU-Net is available at: https://github.com/sofia-adelie/plusplusnnunet.git

2606.10673 2026-06-10 stat.OT cs.LG 新提交

ClusBench: The Clustering Benchmark Data Resource You've All Been Waiting For (?)

ClusBench:你一直期待的聚类基准测试数据资源(?)

David P. Hofmeyr

发表机构 * School of Mathematical Sciences, Lancaster University(兰卡斯特大学数学科学学院)

AI总结 本文通过拟合灵活的非参数分布,从200多个公开数据集生成近3000个合成数据集,用于大规模聚类方法评估,保留真实数据细微差别。

详情
AI中文摘要

尽管存在一些非常常见的测试平台用于评估聚类方法的性能,但大规模基准测试通常局限于相对简单的模拟设置。在这里,我们描述了近3000个合成数据集的生成和整理,这些数据集源自200多个公开可用的数据集;其中大多数来自实际应用。通过为每个基础数据集拟合灵活的非参数分布,我们能够保留真实数据中许多难以在标准模拟中重现的细微差别,同时生成的数据集的大小有时远大于它们所源自的数据集。合成数据集以及附带的R包可从该https URL下载。

英文摘要

Although some very common test beds exist for assessing the performance of clustering methods, large scale benchmarking is typically limited to relatively simplistic simulation set-ups. Here we describe the production and curation of close to 3000 synthetic data sets, derived from more than 200 publicly available data sets; the majority of which arose from real-world applications. By fitting a flexible non-parametric distribution to each base data set we are able to retain much of the nuance in real-world data which is difficult to reproduce in standard simulations, while also producing data sets whose sizes are sometimes substantially greater than the data sets from which they are derived. The synthetic data sets, plus an accompanying R package, are available for download from https://github.com/DavidHofmeyr/ClusBench.

2606.10631 2026-06-10 econ.GN cs.CR q-fin.EC 新提交

From Transactions to Records: Reconceptualizing Blockchain Systems through a Lifecycle Lens

从交易到记录:通过生命周期视角重新概念化区块链系统

Tom Barbereau, Ruggero Montalto, Christian Beyer

AI总结 本文引入ISO 15489-1:2016记录管理原则,提出区块链数据的七阶段生命周期模型,应用于比特币、同质化代币和非同质化代币,论证区块链系统不仅是交易基础设施,更是具有独特特征的记录管理系统。

详情
AI中文摘要

当前的区块链研究和分析倾向于优先考虑可观察的链上交易,掩盖了加密货币创建、公开、保留和处置的过程。为此,本文从ISO 15489-1:2016的记录管理原则出发,考虑分布式账本技术。首先指定相似之处——即交易作为“记录”,加密资产单元作为“信息资产”,区块链作为“聚合”——我们引入了区块链数据的七阶段生命周期。我们将该框架应用于比特币、同质化代币和非同质化代币。在此基础上,我们认为区块链系统不仅仅是交易基础设施,而是具有独特特征的记录管理系统。我们讨论了链上/链下边界和隐私增强技术如何使生命周期可见性复杂化,这对加密犯罪研究和调查尤为重要。作为一个元级框架,生命周期视角能够定位现有研究,按阶段分解法律、监管、技术和运营挑战,并为区块链治理、分析和监管提供生命周期感知的方法。

英文摘要

Current blockchain research and analytics tend to prioritize observable on-chain transactions, obscuring the processes through which cryptocurrencies are created, publicised, retained, and disposed of. In response, this paper considers distributed ledger technologies from records management principles in ISO 15489-1:2016. Setting off by specifying the parallels -- that is transactions as "records", crypto-asset units as "information assets", and blockchains as "aggregations" -- we introduce a seven-stage lifecycle for blockchain data. We apply the framework to Bitcoin, a fungible token, and a non-fungible token. On this basis, we argue that blockchain systems are not merely transactional infrastructures but record management systems with distinctive characteristics. We discuss how the on-chain/off-chain boundary and privacy-enhancing technologies can complicate lifecycle visibility, with particular relevance for crypto-crime research and investigation. As a meta-level framework, the lifecycle perspective enables positioning existing research, decomposing legal, regulatory, technological, and operational challenges by stage, and informing lifecycle-aware approaches to blockchain governance, analytics, and regulation.

2606.10454 2026-06-10 eess.AS cs.SD 新提交

Entropy-Aware Domain-Routed Mixture-of-Experts Speech-LLM Framework: A Case Study of Multi-Domain Child-Adult ASR

熵感知域路由混合专家语音-大语言模型框架:多领域儿童-成人ASR案例研究

Mohan Shi, Kaiyuan Zhang, Zilai Wang, Natarajan Balaji Shankar, Eray Eren, Abeer Alwan

发表机构 * University of California, Los Angeles, USA(加州大学洛杉矶分校)

AI总结 提出一种混合专家语音-大语言模型,通过分类器域路由、混合投影器和混合LoRA模块以及熵感知路由机制,实现跨不同环境和年龄组的统一儿童-成人ASR,在公共儿童语料库上取得一致改进。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

虽然语音大语言模型在成人自动语音识别上取得了强劲性能,但其对儿童语音的有效性仍未被充分探索,且单一模型往往难以同时处理多样化的成人和儿童年龄组。本文提出一种混合专家语音-大语言模型,用于跨不同环境和年龄组的统一成人及儿童语音ASR。该框架采用基于分类器的域路由,结合粗到细策略,并集成混合投影器和混合LoRA模块以建模域特定变化。为解决域边界附近的路由不确定性,引入熵感知路由机制以动态整合共享专家。在公共儿童语料库上的实验表明,该方法在保持成人ASR性能的同时,相比基线取得了一致改进。据我们所知,这是首个利用语音-大语言模型实现涵盖儿童和成人的统一多领域ASR的工作。

英文摘要

While Speech Large Language Models (Speech-LLMs) have achieved strong performance on adult Automatic Speech Recognition (ASR), their effectiveness on child speech remains under-explored, and single models often struggle to handle diverse adult and child age groups simultaneously. This paper proposes a Mixture-of-Experts (MoE) Speech-LLM for unified ASR across adult and child speech spanning diverse environments and age groups. The framework employs a Classifier-based Domain Router (C-DR) with a coarse-to-fine strategy and integrates both a Mixture-of-Projectors (MoP) and a Mixture-of-LoRAs (MoL) to model domain-specific variations. To address routing uncertainty near domain boundaries, an Entropy-Aware Routing (EAR) mechanism is introduced to dynamically incorporate a shared expert. Experiments on public child corpora demonstrate consistent improvements over baselines while preserving adult ASR performance. To our knowledge, this is the first work leveraging Speech-LLMs for unified, multi-domain ASR encompassing both children and adults.

2606.10361 2026-06-10 stat.ML cs.LG 新提交

Near-Exponential Convergence Rates for kNN Classification based on Boltzmann Margin

基于玻尔兹曼间隔的kNN分类近指数收敛速率

Luyuan Yang, Shayan Shafaei, Chao Lan

发表机构 * School of Computer Science, University of Oklahoma(计算机科学系,俄克拉荷马大学)

AI总结 提出玻尔兹曼间隔条件,介于Tsybakov与Massart间隔之间,首次证明kNN分类器可实现近指数收敛速率。

Comments Conference on Uncertainty in Artificial Intelligence (UAI)

详情
AI中文摘要

分类器的收敛速率分析通常在Tsybakov间隔或Massart间隔下进行。前者是相对较弱的条件,通常产生多项式速率,而后者更强,但能保证指数速率。本文引入一种新条件,称为玻尔兹曼间隔,它填补了这两种机制之间的空白。该条件弱于Massart间隔,通常强于Tsybakov间隔,并在适当条件下能蕴含它们的许多性质。我们将玻尔兹曼间隔应用于kNN分类器的分析,并建立了kNN分类的第一个近指数收敛速率。我们还给出了主要结果的扩展,并提供了支持主要理论结论的数值证据。

英文摘要

Convergence-rate analysis for classifiers is often conducted under either Tsybakov margin or Massart margin. The former is a relatively weak condition that typically yields polynomial rates, while the latter is substantially stronger but can guarantee exponential rates. In this paper, we introduce a new condition, called Boltzmann margin, that bridges the gap between these two regimes. It is weaker than Massart margin, generally stronger than Tsybakov margin, and can imply many of their properties under suitable conditions. We apply Boltzmann margin to the analysis of kNN classifiers and establish the first near-exponential convergence rates for kNN classification. We also present extensions of the main results and provide numerical evidence supporting the main theoretical implications.

2606.10317 2026-06-10 eess.AS cs.SD 新提交

SSL-GMMVC: Interpretable Voice Conversion via Locally Linear GMM Transforms in Self-Supervised Representation Space

SSL-GMMVC:自监督表示空间中通过局部线性GMM变换的可解释语音转换

Tomoya Tanabu, Hiroshi Nishijima, Daisuke Saito, Nobuaki Minematsu

发表机构 * The University of Tokyo, Japan(东京大学)

AI总结 提出SSL-GMMVC方法,在自监督语音空间中用高斯混合模型建模源-目标特征,通过后验加权仿射变换实现可解释的语音转换,在保持可理解性和自然度的同时提升说话人相似度。

Comments Accepted to Interspeech2026

详情
AI中文摘要

我们介绍了SSL-GMMVC,一种在自监督语音空间中可解释的语音转换方法。该方法使用高斯混合模型对配对的源-目标特征进行建模,并将转换表示为仿射变换的后验加权和。这产生了适应异质特征空间结构且保持解析可处理性的局部线性变换。通过客观和主观评估,我们表明SSL-GMMVC在保持相当可理解性和自然度的同时提高了说话人相似度,并且随着混合成分数量的增加,即使是受限协方差变体也超过了深度学习基线。进一步的分析将成分选择与语音结构联系起来,并揭示了学习变换中可解释的缩放和旋转。这些发现凸显了SSL-GMMVC作为一种有效且可分析的语音转换框架。

英文摘要

We introduce SSL-GMMVC, an interpretable voice conversion method in self-supervised speech space. The method models paired source-target features with a Gaussian mixture model and performs conversion as a posterior-weighted sum of affine transforms. This yields locally linear transformations that adapt to heterogeneous feature-space structure while remaining analytically tractable. Through objective and subjective evaluations, we show that SSL-GMMVC improves speaker similarity with comparable intelligibility and naturalness, and that even a constrained covariance variant surpasses a deep learning baseline as the number of mixture components increases. Further analyses link component selection to phonetic structure and reveal interpretable scaling and rotation in the learned transforms. These findings highlight SSL-GMMVC as an effective, analyzable framework for voice conversion.

2606.10280 2026-06-10 eess.IV cs.CV 新提交

Overlapped Wavelet Diffusion for Low-Light Image Enhancement

重叠小波扩散用于低光照图像增强

Fen Peng, Taizo Suzuki, Seisuke Kyochi

AI总结 提出重叠小波扩散框架OWDiff,通过重叠小波变换消除块伪影,并引入低频引导的高频增强模块恢复细节,在LOLv1和LOLv2-real数据集上优于现有方法。

Comments Advance published in IEICE Transactions on Information and Systems. DOI: 10.1587/transinf.2026PCP0006. Code: https://github.com/FinnPeg/Overlapped-Wavelet-Diffusion

详情
Journal ref
IEICE Transactions on Information and Systems, Advance online publication, 2026
AI中文摘要

在这项研究中,我们提出了一种用于低光照图像增强(LLIE)的重叠小波扩散框架,该框架包含两个互补组件,以实现无块伪影和细节保持的增强。尽管与传统方法相比,最近基于扩散的LLIE方法表现出显著性能,但DiffLL仍然遭受由Haar小波变换(WT)引起的块伪影以及由于其高频恢复模块(HFRM)的限制导致的边缘模糊或纹理过度平滑。为了克服这些问题,我们引入了重叠小波变换(OWT),它融合了相邻区域的相关性,从而在结构上防止块伪影。此外,我们集成了一个低频引导的高频增强模块(HFEBlock)来加强细节恢复,产生更清晰的边缘和更可靠的纹理。在LOLv1和LOLv2-real数据集上的大量实验表明,我们的框架(称为OWDiff)在定性和定量上均持续优于现有的LLIE方法,在保持计算效率的同时实现了卓越的视觉质量。OWDiff有效解决了Haar WT和HFRM的结构限制,与DiffLL相比,在LOLv1和LOLv2-real数据集上平均PSNR增益为0.58 dB,SSIM相对提高1.64%,LPIPS相对降低5.9%。

英文摘要

In this study, we propose an overlapped wavelet diffusion framework for Low-Light Image Enhancement (LLIE), which incorporates two complementary components to achieve blocking artifact-free and detail-preserving enhancement. Although recent diffusion-based LLIE methods have demonstrated remarkable performance compared with traditional approaches, DiffLL still suffers from blocking artifacts caused by the Haar Wavelet Transform (WT) and blurred edges or over-smoothed textures due to the limitations of its High-Frequency Restoration Module (HFRM). To overcome these issues, we introduce an Overlapped WT (OWT) that incorporates correlations across neighboring regions, thereby structurally preventing blocking artifacts. Furthermore, we integrate a low-frequency-guided High-Frequency Enhance Block (HFEBlock) to strengthen detail recovery, yielding sharper edges and more reliable textures. Extensive experiments on the LOLv1 and LOLv2-real datasets demonstrate that our framework, termed OWDiff, consistently outperforms existing LLIE methods both qualitatively and quantitatively, achieving superior visual quality while maintaining computational efficiency. OWDiff effectively addresses the structural limitations of the Haar WT and the HFRM, achieving an average PSNR gain of 0.58 dB, along with a 1.64% relative improvement in SSIM and a 5.9% relative reduction in LPIPS, compared to DiffLL across both the LOLv1 and LOLv2-real datasets.

2606.10238 2026-06-10 q-bio.NC cs.AI 新提交

Hyperbolic Neural Population Geometry Benefits Computation

双曲神经群体几何结构有益于计算

Dennis Wu, Yi-Chun Hung, Braden Yuille, James E. Fitzgerald, Han Liu

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出海马体群体活动诱导双曲几何的理论框架,证明现代Hopfield网络更新规则计算最小均方误差估计,并引入双曲空间中的新联想记忆模型,其容量显著优于现有模型。

Comments Accepted at ICML 2026, 37 pages, 5 figures

详情
AI中文摘要

神经群体几何结构影响下游计算。最近神经生物学的实验发现表明,海马体中的群体活动具有双曲结构。本文为这一现象提供了理论框架。首先,我们提出了一种海马体调谐曲线的合理构造,该构造在统计上诱导双曲几何。接着,我们通过证明现代Hopfield网络更新规则计算最小均方误差(MMSE)估计,建立了神经解码与联想记忆之间的联系。最后,我们引入了一个在双曲空间中定义的新型联想记忆模型,其容量显著大于领先模型。我们的结果表明,动物将空间信息编码为潜在的双曲认知地图,从而提高了记忆容量和解码精度。

英文摘要

Neural population geometry shapes downstream computation. Recent empirical findings in neurobiology suggest that a hyperbolic structure underlies population activity in the hippocampus. Here we provide a theoretical framework for this phenomenon. First, we propose a plausible construction of hippocampal tuning curves that statistically induces hyperbolic geometry. Next, we establish a connection between neural decoding and associative memory by demonstrating that the Modern Hopfield Network update rule computes the minimum mean-squared-error (MMSE) estimator. Finally, we introduce a novel associative memory model defined in hyperbolic space that yields significantly larger capacity than leading models. Our results suggest that animals encode spatial information as a latent hyperbolic cognitive map, improving both memory capacity and decoding accuracy.

2606.10233 2026-06-10 eess.AS cs.LG cs.SD 新提交

ANCHOR: Autoregressive Non-intrusive Chunk-Ordered Refinement for Joint Multi-Resolution Speech Quality Modeling

ANCHOR: 自回归非侵入式分块有序细化用于联合多分辨率语音质量建模

Zhuoyan Tao, Jiatong Shi, Hye-jin Shim, Shinji Watanabe

发表机构 * University of Southern California, USA(美国南加州大学) Carnegie Mellon University, USA(美国卡内基梅隆大学)

AI总结 提出ANCHOR模型,将增量语音质量评估重构为多分辨率自回归任务,通过双分辨率令牌和分辨率感知层次实现分块到整句的粗到细细化,在部分输入下显著降低误差,并揭示感知质量的时域积累机制。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

虽然语音质量通常是在完整话语上评估的,但流式和生成系统需要从部分音频中进行增量估计。现有的预测器假设完整的上下文,在受前缀约束的输入上性能下降。扩展ARECHO,我们提出ANCHOR,将增量评估重新表述为多分辨率自回归任务。它使用双分辨率令牌和分辨率感知层次结构在单个解码器中建模分块级和话语级质量,实现从粗到细的细化。实验表明,在部分输入下具有显著的鲁棒性,包括在2秒前缀上PLCMOS误差减少48%。收敛性分析揭示了4-6秒的有效感知上下文范围。压力测试进一步隔离了局部损坏下的结构化外推偏差。结果表明,层次监督改进了增量预测,并阐明了感知质量如何随时间累积。

英文摘要

While speech quality is typically assessed on complete utterances, streaming and generative systems require incremental estimation from partial audio. Existing predictors assume full context, degrading on prefix-constrained inputs. Extending ARECHO, we propose ANCHOR, reformulating incremental assessment as a multi-resolution autoregressive task. It models chunk- and utterance-level quality within a single decoder using dual-resolution tokens and a resolution-aware hierarchy for coarse-to-fine refinement. Experiments show substantial robustness under partial input, including a 48% PLCMOS error reduction on 2-second prefixes. Convergence analysis reveals a 4-6 s effective perceptual context horizon. A stress test further isolates structured extrapolation biases under localized corruption. Results demonstrate that hierarchical supervision improves incremental prediction and elucidates how perceptual quality accumulates over time.

2606.10187 2026-06-10 stat.ML cs.LG 新提交

Decision-Calibrated Conformal Uncertainty for Pacing Decisions in Streaming Advertising

面向流式广告中节奏控制的决策校准共形不确定性

Prashant Shekhar, Caroline Howard

发表机构 * Department of Mathematics, Embry-Riddle Aeronautical University(数学系,埃姆伯里-瑞德航空大学)

AI总结 提出一种决策校准共形框架,通过衡量预测误差对实际部署策略的最大影响来校准不确定性,理论证明该分数是保护所有可部署节奏控制策略的最小有效不确定性度量,并在公开数据集上显著降低不确定性半径。

详情
AI中文摘要

我们开发了一个决策校准的共形框架,用于流式广告中的节奏控制决策。节奏控制依赖于不确定的未来库存、需求压力、增量响应和会员体验负载。该框架不是校准通用的预测残差,而是通过预测误差对实际可能部署的策略的最大影响来衡量预测误差。主要定理表明,所提出的分数是统一保护所有可部署节奏控制策略的最小有效不确定性度量。几何上,它是有符号策略敏感性集的支持函数。分裂共形校准为该分数提供了有限样本覆盖。一个高维分离定理表明,传统的残差校准可能因支付干扰库存维度而任意保守,而一个鲁棒的节奏控制结果结合了库存、响应和体验不确定性。在基于Criteo Uplift和KuaiRand数据集构建的公开数据校准节奏控制回放中,传统共形节奏控制仍然未解决,在Criteo上残差半径高达7236.7,在KuaiRand上为4629.4。采用所提出的决策校准方法,不确定性半径分别降至18.4和278.6,并为价值、交付、预算和会员负载设置了单独的边际。在Criteo上,所提出的方法证明了比点预测基线更不激进的节奏控制策略,并将保留的任何违规率从16.7%降至3.3%,且预算和会员负载违规为零。在KuaiRand上,选择仍未解决。简而言之,本文确立了预测、响应估计和会员体验模型应根据它们是否缩小节奏控制决策使用的不确定性来判断,因为这会导致自信且不过度保守的决策。

英文摘要

We develop a decision-calibrated conformal framework for pacing decisions in streaming advertising. Pacing depends on uncertain future inventory, demand pressure, incremental response, and member-experience load. Instead of calibrating a generic forecast residual, the framework measures forecast error by its largest impact on the policies that could actually be deployed. The main theorem shows that the proposed score is the smallest valid uncertainty measure that uniformly protects all deployable pacing policies. Geometrically, it is the support function of the signed policy sensitivity set. Split conformal calibration gives finite-sample coverage for this score. A high-dimensional separation theorem shows that traditional residual calibration can be arbitrarily more conservative by paying for nuisance inventory dimensions, and a robust pacing result combines inventory, response, and experience uncertainty. On public-data-calibrated pacing replays built from Criteo Uplift and KuaiRand datasets, traditional conformal pacing remains unresolved with high residual radii of 7236.7 on Criteo and 4629.4 on KuaiRand. With the proposed decision calibration approach, the uncertainty radii are reduced to 18.4 and 278.6 respectively, with separate margins for value, delivery, budget, and member load. On Criteo, the proposed method certifies a less aggressive pacing policy than the point-forecast baseline, and reduces held-out any-violation rate from 16.7% to 3.3%, with zero budget and member-load violations. On KuaiRand, the choice remains unresolved. In a nutshell, the paper establishes that forecasts, response estimates, and member-experience models should be judged by whether they shrink the uncertainty that the pacing decision uses, as this leads to confident decisions that are not overly conservative.

2606.10125 2026-06-10 stat.ML cs.DB cs.LG 新提交

Robust Active Learning for Few-Shot Example Selection in Text-to-SQL

鲁棒主动学习用于文本到SQL中的少样本示例选择

Arash Pourhabib

发表机构 * NVIDIA

AI总结 针对文本到SQL中少样本示例选择,提出一种鲁棒主动学习方法,通过分层贪婪算法最大化异方差互信息目标,在嵌入流形上实现常数因子近似保证,显著减少标注成本。

Comments 31 pages, 4 figures, 5 tables

详情
AI中文摘要

少样本示例检索是将大型语言模型(LLM)应用于特定领域文本到SQL系统的主要范式。然而,标注示例库的质量直接决定系统准确性,且专家标注成本高昂。我们将这些示例的主动选择形式化为一个在语义查询嵌入的内在低维流形上的约束实验设计问题。与标准主动学习框架不同,我们的设置引入了三个关键挑战:依赖于查询的可变标注可靠性(异方差性)、跨语义主题的空间多样性严格要求(划分拟阵约束),以及嵌入空间真实协方差结构未知的固有现实(模型误设)。为了解决这些问题,我们提出了一种分层贪婪算法,该算法最大化异方差互信息目标。我们证明该目标在内在流形上保持子模性和近似单调性,从而得到理论上的常数因子近似保证。我们建立了一个谱界,表明当假设的替代核与真实数据生成过程存在偏差时,该近似保证会优雅地退化,而非灾难性地崩溃。实验结果表明,所提出的策略显著减少了标注工作量,同时保持了较高的文本到SQL检索准确性。

英文摘要

Few-shot example retrieval is the dominant paradigm for grounding large language models (LLMs) in domain-specific text-to-SQL systems. However, the quality of the annotated example bank directly governs system accuracy, and expert annotation is prohibitively expensive. We formalize the active selection of these examples as a constrained experimental design problem over the intrinsic, low-dimensional manifold of semantic query embeddings. Unlike standard active learning frameworks, our setting introduces three critical challenges: varying, query-dependent annotation reliability (heteroscedasticity), strict requirements for spatial diversity across semantic topics (partition matroid constraints), and the inherent reality that the true covariance structure of the embedding space is unknown (misspecification). To address these, we propose a stratified greedy algorithm that maximizes a heteroscedastic mutual information objective. We prove that this objective remains submodular and approximately monotonic on the intrinsic manifold, yielding a theoretical constant-factor approximation guarantee. We establish a spectral bound demonstrating that this approximation guarantee degrades gracefully, rather than catastrophically, when the assumed surrogate kernel diverges from the true underlying data-generating process. Empirical results demonstrate that the proposed strategy significantly reduces labeling effort while maintaining high text-to-SQL retrieval accuracy.

2606.10010 2026-06-10 eess.AS cs.AI cs.MM cs.SD 新提交

DeRA-MOS: Optimizing Text-to-Music Evaluation via Decoupled Listwise Ranking and Modality Alignment

DeRA-MOS:通过解耦列表排序和模态对齐优化文本到音乐评估

Chien-Chun Wang, Hung-Shin Lee, Hsin-Min Wang, Berlin Chen

发表机构 * E.SUN Financial Holding Co., Ltd.(E.SUN财务控股公司) United Link Co., Ltd.(联合链接有限公司) Institute of Information Science, Academia Sinica(学术院信息科学研究所) Department of Computer Science and Information Engineering, National Taiwan Normal University(台湾师范大学计算机科学与信息工程系)

AI总结 提出DeRA-MOS解耦优化框架,通过批感知列表排序损失和分数锚定模态对齐损失,分别优化音乐印象和文本对齐的排名指标,在MusicEval上显著提升评估性能。

Comments Accepted to IEEE Signal Processing Letters (SPL)

详情
AI中文摘要

评估文本到音乐(TTM)系统仍然昂贵,因为音乐印象(MI)和文本对齐(TA)分数依赖于人类平均意见分数(MOS)。大多数自动MOS估计器采用逐点回归或分布分类训练。这些目标不直接优化基于排名的指标,并且为跨模态一致性提供较弱的几何约束。为了解决这些问题,我们提出了DeRA-MOS,一种用于TTM评估的解耦优化框架。对于MI,我们引入了一种批感知列表排序损失,该损失对每个小批量内的相对顺序进行建模,并更好地与基于Spearman秩相关系数(SRCC)的评估对齐。对于TA,我们引入了一种分数锚定的模态对齐损失,将人类分数映射到目标音频-文本相似度,并在融合前正则化潜在空间。通过有效缓解逐点训练不匹配和模态漂移,MusicEval上的实验表明,我们的解耦框架在MI和TA排名指标上均取得了显著改进,为大规模TTM评估建立了稳健的范式。

英文摘要

Evaluating text-to-music (TTM) systems remains expensive because music impression (MI) and text alignment (TA) scores rely on human mean opinion scores (MOS). Most automatic MOS estimators are trained with point-wise regression or distributional classification. These objectives do not directly optimize rank-based metrics and provide weak geometric constraints for cross-modal coherence. To address these gaps, we propose DeRA-MOS, a decoupled optimization framework for TTM evaluation. For MI, we introduce a batch-aware listwise ranking loss that models relative order within each mini-batch and better aligns with evaluation based on Spearman's rank correlation coefficient (SRCC). For TA, we introduce a score-anchored modality alignment loss that maps human scores to target audio-text similarity and regularizes the latent space before fusion. By effectively mitigating the point-wise training mismatch and modality drift, experiments on MusicEval demonstrate that our decoupled framework yields substantial improvements in both MI and TA ranking metrics, establishing a robust paradigm for large-scale TTM evaluation.

2606.09953 2026-06-10 eess.IV cs.AI cs.LG 新提交

Deep Slice Interpolation for Reducing Through-Plane Anisotropy and Noise in Head CT

深度切片插值用于减少头部CT的穿平面各向异性和噪声

Luis Cortés Ferre, Miguel A. Gutiérrez-Naranjo, Marcin Balcerzyk

发表机构 * Department of Computer Science and Artificial Intelligence, University of Seville(塞维利亚大学计算机科学与人工智能系) Bioaraba Health Research Institute(Bioaraba健康研究 institute) IKERBASQUE, Basque Foundation of Science(巴斯克科学基金会)

AI总结 提出一种深度学习系统,通过相邻轴向切片对合成中间CT切片,将有效穿平面间距减半,同时实现隐式降噪,在结构指标上优于经典插值和视频帧插值方法。

详情
AI中文摘要

头部计算机断层扫描(CT)通常使用亚毫米级的面内分辨率,但穿平面间距为2-5毫米,造成显著的各向异性,这会降低多平面重建、血肿体积估计等体积测量以及假设近似各向同性体素的后续算法的性能。我们提出一个深度学习系统,从相邻轴向切片对合成中间CT切片,将有效穿平面间距减半。该系统改善三维可视化,同时产生固有降噪的输出,在一次推理中实现两个互补优势。为构建可靠系统,我们系统评估像素级损失(均方误差MSE和平均绝对误差L1)、结构相似性损失(结构相似性指数SSIM及其多尺度变体MS-SSIM)以及混合组合。在保留测试集上,所有收敛模型在所有结构指标上均优于经典插值基线和预训练视频帧插值方法(RIFE、FILM),其中MS-SSIM+L1提供最强平衡性能。我们还记录了SSIM族损失中的训练不稳定性并识别部分补救措施:标准数值修复消除了主要失败模式,但在较小批量大小下留下残余发散。所有结果均报告患者级自助法置信区间和配对统计检验。作为示例,我们将系统应用于来自Virgen del Rocío大学医院的非分布头部CT序列:模型合成中间切片,并在真实切片上表现出我们理论分析预测的隐式降噪特征,支持在单个外部病例中插值质量和隐式降噪不局限于训练分布。

英文摘要

Head computed tomography (CT) typically uses sub-millimeter in-plane resolution but 2-5 mm through-plane spacing, creating substantial anisotropy that degrades multiplanar reconstructions, volumetric measurements such as hematoma volume estimation, and downstream algorithms that assume near-isotropic voxels. We present a deep learning system that synthesizes intermediate CT slices from pairs of neighboring axial slices, halving the effective through-plane spacing. The system improves three-dimensional visualization while simultaneously producing inherently denoised outputs, yielding two complementary benefits from a single inference pass. To build a reliable system, we systematically evaluate pixel-wise losses, namely mean squared error (MSE) and mean absolute error (L1); structural-similarity losses, namely the structural similarity index (SSIM) and its multi-scale variant (MS-SSIM); and hybrid combinations. On a held-out test set, all converged models outperform classical interpolation baselines and pretrained video frame interpolation methods (RIFE, FILM) on all structural measures, with MS-SSIM+L1 offering the strongest balanced profile. We also document training instability in SSIM-family losses and identify partial remedies: the standard numerical fixes eliminate the dominant failure mode but leave residual divergence at smaller batch sizes. All results are reported with patient-level bootstrap confidence intervals and paired statistical tests. As an illustration, we apply the system to an out-of-distribution head CT series from Hospital Universitario Virgen del Rocío: the model synthesizes intermediate slices and exhibits on the real slices the implicit-denoising signature predicted by our theoretical analysis, supporting in a single external case that interpolation quality and implicit denoising are not confined to the training distribution.

2606.09944 2026-06-10 econ.GN cs.AI q-fin.EC 新提交

GAGI: A Gini-Adjusted GDP-per-Capita Index for Distribution-Aware Macroeconomic Welfare Monitoring

GAGI:一种用于分布感知宏观经济福利监测的基尼调整人均GDP指数

Sivasathivel Kandasamy

发表机构 * Independent Researcher(独立研究者)

AI总结 提出GAGI指数,通过基尼系数和价格水平调整人均GDP,以监测福利分配效应,应用于G7国家发现福利增长与GDP增长持续偏离。

详情
AI中文摘要

人均GDP是政府机构追踪经济繁荣和经济事件后果的默认视角,但它忽视了生活繁荣的两个首要决定因素:收入/财富分配和通胀影响。不平等调整的收入衡量指标本身并不新鲜,但宏观经济监测工具包中具体缺失的不是福利概念,而是一个可操作的监测触发指标:一个足够简洁、可每年从公开数据计算、无需建模假设即可审计、且标准化以便于理解年度间和国家间变化(监管机构需要据此采取行动)的统计量。我们构建了这样一个工具,即基尼调整人均GDP指数(GAGI):一种可复现、可公开计算的公式,通过不平等调整因子(1-G)和价格水平重新调整各国人均GDP,并以2010年为基准标准化。GAGI是一个通用福利指数,并非特定于AI自动化,适用于任何需要追踪福利调整后繁荣的场景。将GAGI应用于2010-2026年的G7经济体,我们发现福利调整后的繁荣与总体GDP增长持续且日益偏离,这种偏离在2022年后急剧扩大,时间上与COVID后遗症和生成式AI部署加速相吻合,尽管仅凭此证据尚不能证明因果关系。我们认为GAGI是基于GDP监测的必要补充:任何仅追踪总产出的宏观经济监测工具都会系统性地忽略自动化可能造成的分配损害,即使报告的增长依然强劲。

英文摘要

GDP per capita is the default lens through which governibng bodies track the economic prosperity and consequences of economic events , yet it is blind to two first-order determinants of lived prosperity: income/wealth distribution and inflation impact. Inequality-adjusted income measures are themselves not new but What is missing from the macroeconomic monitoring toolkit specifically is not a welfare concept but an operational monitoring trigger: a statistic minimal enough to compute annually from public data, transparent enough to audit without modelling assumptions, and normalised so that year-on-year, cross-country change ? the quantity a regulator needs to act on? is legible. We assemble such an instrument, the Gini- Adjusted GDP per Capita Index (GAGI): a reproducible, publicly computable formulation that rescales each country's GDP per capita by its inequality-adjustment factor (1-G) and its price level, normalised to a 2010 baseline. GAGI is a general-purpose welfare index, not inherently specific to AI automation, applicable wherever welfare-adjusted prosperity needs tracking. Applying GAGI to the G7 economies over 2010-2026, we show that welfare-adjusted prosperity has diverged persistently and increasingly from headline GDP growth, that the divergence widens sharply after 2022, temporally coincident with, though not, on this evidence alone, demonstrated to be caused by the after effects of COVID and the acceleration of generative-AI deployment. We argue that GAGI is a necessary complement to GDP-based monitoring: any macroeconomic monitoring instrument that tracks only aggregate output will systematically miss the distributional harm that automation can cause even while reported growth remains strong.

2606.09941 2026-06-10 stat.AP cs.LG stat.OT 新提交

Stochastic weather generators for high-frequency wind vector time series

高频风矢量时间序列的随机天气生成器

Mingshi Cui, Kevin Eng, Justin T. Greene, Zern Ke, Abolfazl Sodagartojgi, Zhiqiu Xia, Gemma E. Moran, Michael L. Stein

发表机构 * Department of Statistics, Rutgers University(统计学系,罗格斯大学)

AI总结 针对分钟级风矢量时间序列,开发基于时间矢量量化变分自编码器的机器学习模型,生成逼真序列,捕捉昼夜变化但极端风速分布匹配不足。

详情
AI中文摘要

地表风速在分钟尺度上变化显著,因此有必要研究其在此精细时间尺度上的变化。为最小化季节性影响,本文限定于六月,基于俄克拉荷马州拉蒙特站点超过30年的分钟级高质量测量数据,开发了一系列用于生成真实地表风矢量时间序列的机器学习模型。此类生成器可作为多种学科模型的输入,特别是风能领域,同时也适用于野火蔓延和航空等。数据显示风速和风向均存在复杂的昼夜结构,标准时间序列模型难以捕捉,因此我们考虑多种机器学习方法,基于时间矢量量化变分自编码器构建随机风生成器。我们考虑一次生成一天的数据,以及基于前一天风况生成一天的风矢量。我们还研究了在生成器中纳入离散天气状态变量的方法。我们使用多种正式和非正式方法评估生成器。其中最佳生成器能够捕捉观测数据中的许多(但非全部)复杂特征。特别地,我们的最佳方法准确模拟了风波动性的昼夜变化,但在匹配观测到的极端风速分布方面存在困难。

英文摘要

Surface winds can vary substantially from one minute to the next, so there is scope for studying its variation on this fine time scale. Restricting to the month of June to minimize seasonality, this work develops a range of machine learning models for generating realistic time series of surface wind vectors at a site in Lamont, Oklahoma based on more than 30 years of high quality measurements at the minute time scale. Such a generator could be used as an input into models from a range of disciplines, notably for wind energy, but also wildfire spread and aviation, among others. The data show complex diurnal structures in both wind speed and direction that would be challenging to capture with standard time series models, so we consider a number of machine learning approaches to producing a stochastic wind generator based on time vector-quantized variational autoencoders. We consider generating a day's worth of data at a time and generating a day of wind vectors conditional on the previous day's winds. We also study methods for incorporating a discrete weather state variable in the generator. We evaluate the generators using a wide range of formal and informal methods. The best of these generators can capture many but not all of the complex features present in the observational data. In particular, the best of our approaches accurately mimic diurnal changes in wind volatility but struggle to match the observed distribution of extreme wind speeds.

2606.09893 2026-06-10 eess.IV cs.AI cs.LG 新提交

Tractogram foundation model

TractFM:纤维束图基础模型

Guikun Chen, Yuqian Chen, Yijie Li, Yogesh Rathi, Nikos Makris, Fan Zhang, Wenguan Wang, Lauren J. O'Donnell

发表机构 * The State Key Lab of Brain-Machine Intelligence, Zhejiang University, Hangzhou(脑机智能国家重点实验室,浙江大学,杭州) Department of Radiology, Brigham and Women’s Hospital, Mass General Brigham, Boston(放射科,布里洛妇女医院,马萨诸塞总医院,波士顿) Harvard Medical School, Boston(哈佛医学院,波士顿) Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin(医学工程与转化医学研究院,天津大学,天津) School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu(信息与通信工程学院,电子科技大学,成都) Psychiatry Neuroimaging Laboratory, Brigham and Women’s Hospital, Mass General Brigham, Boston(精神病神经影像实验室,布里洛妇女医院,马萨诸塞总医院,波士顿) Department of Psychiatry, Center for Morphometric Analysis, Massachusetts General Hospital, Boston(精神病科,形态分析中心,马萨诸塞总医院,波士顿)

AI总结 提出TractFM基础模型,直接从全脑纤维束集学习可复用表示,结合局部纤维编码器和置换等变纤维束编码器,通过密集解剖束分割预训练,实现纤维束级和受试者级任务的迁移。

详情
AI中文摘要

扩散MRI(dMRI)纤维束成像是在活体人脑中绘制白质通路的唯一非侵入性方法。它将每个大脑表示为一个纤维束图:一个大型、无序的三维流线集合,包含局部流线几何和全脑解剖组织的信息。这种结构使纤维束图成为表示学习的自然但具有挑战性的目标。现有方法将流线分类和受试者级预测视为独立问题:流线分类器关注几何模式,而受试者级预测通常依赖于手工特征。因此,当前方法无法学习连接流线解剖与全脑受试者间变异的可复用表示。本文介绍TractFM,一个纤维束图基础模型,直接从全脑纤维束集学习可复用表示。TractFM结合了局部流线编码器和置换等变纤维束编码器,使得一个受试者的所有流线能够在单次前向传递中共同上下文化。在密集解剖束分割(即给单个流线分配解剖标签)上的预训练产生了两种互补表示:用于束分割的上下文化流线级嵌入和用于下游受试者表型预测的紧凑受试者级描述符。在三种纤维束成像算法和五个dMRI数据集上,TractFM迁移到流线级和受试者级任务。其冻结表示实现了准确的束分割,并在独立数据集上预测年龄和性别。这些结果表明,全脑几何上下文(一次性学习)可以泛化到纤维束成像流程、数据集和预测任务中。

英文摘要

Diffusion MRI (dMRI) tractography is the only noninvasive approach for mapping white-matter pathways in the living human brain. It represents each brain as a tractogram: a large, unordered set of three-dimensional streamlines that includes information about both local streamline geometry and whole-brain anatomical organization. This structure makes tractograms a natural but challenging target for representation learning. Existing methods treat streamline classification and subject-level prediction as separate problems: streamline classifiers focus on geometric patterns, whereas subject-level prediction often depends on hand-crafted features. As a result, current methods do not learn reusable representations that connect streamline anatomy with whole-brain inter-subject variation. Here we introduce TractFM, a tractogram foundation model that learns reusable representations directly from whole-brain streamline sets. TractFM combines a local streamline encoder with a permutation-equivariant tractogram encoder, allowing all streamlines from a subject to be contextualized jointly in a single forward pass. Pretraining on dense anatomical tract parcellation, i.e., assigning anatomical labels to individual streamlines, yields two complementary representations: contextualized streamline-level embeddings for tract parcellation and compact subject-level descriptors for downstream prediction of subject phenotypes. Across three tractography algorithms and five dMRI datasets, TractFM transfers to both streamline-level and subject-level tasks. Its frozen representations achieve accurate tract parcellation and predict age and sex across independent datasets. These results show that whole-brain geometric context, learned once, can generalize across tractography pipelines, datasets, and prediction tasks.

2606.11189 2026-06-10 cs.LG cs.AI cs.CL 新提交

A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design

通过目标分布设计审视监督微调的统一视角

Tong Xie, Yuanhao Ban, Yunqi Hong, Sohyun An, Yihang Chen, Cho-Jui Hsieh

发表机构 * University of California, Los Angeles (UCLA)(加州大学洛杉矶分校) Arena

AI总结 本文重新解读监督微调为目标分布设计,提出Q-target框架,将监督分解为对观测token的依赖强度与替代token的概率分配,并基于此提出Target-SFT方法,在多个推理任务中优于现有方法。

详情
AI中文摘要

监督微调(SFT)通常最大化示范轨迹中每个token的似然。然而,观测到的token可能非唯一、有噪声或与模型先验不一致。严格拟合这种one-hot目标可能不是最优的,尤其是当预训练模型编码了丰富的知识先验时。在这项工作中,我们将SFT重新解释为目标分布设计:不仅研究损失目标,还分析损失驱动模型匹配的token级目标。我们引入Q-target框架,将SFT监督分解为两个明确的选择:(1) 对观测token的依赖强度,以及(2) 如何将剩余概率质量分配给替代token。这一视角将许多现有的SFT变体统一为目标分布Q的隐式选择。基于这一观点,我们提出Target-SFT,直接从期望的目标分布构建训练目标。该方法在十个推理数据集-模型设置中一致优于现有方法,展示了这种基于目标的方法的有效性。总体而言,我们的公式揭示了SFT训练更基本的设计原则,并为SFT目标开辟了更广阔的搜索空间。

英文摘要

Supervised fine-tuning (SFT) typically maximizes the likelihood of every token in a demonstrated trajectory. However, an observed token can be non-unique, noisy, or misaligned with the model prior. Strictly fitting toward this one-hot target may be suboptimal, especially when the pretrained model encodes a rich knowledge prior. In this work, we reinterpret SFT as target distribution design: instead of studying only the loss objective, we analyze the token-level target that the loss drives the model to match. We introduce the Q-target framework, which decomposes SFT supervision into two explicit choices: (1) how strongly to rely on the observed token, and (2) how to allocate the remaining probability mass over alternatives. This perspective unifies many existing SFT variants as implicit choices of the target distribution Q. Building on this view, we propose Target-SFT which constructs the training objective directly from the desired target distribution. This method consistently outperforms across the ten reasoning dataset-model settings evaluated, showing the effectiveness of this target-based approach. Overall, our formulation reveals a more fundamental design principle for SFT training and opens a broader search space for SFT objectives.

2606.11188 2026-06-10 cs.CV 新提交

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

ARM: 一种具有统一离散表示的自回归大型多模态模型

Junke Wang, Xiao Wang, Jiacheng Pan, Xuefeng Hu, Feng Li, Jingxiang Sun, Chaorui Deng, Zilong Chen, Yunpeng Chen, Kaibin Tian, Matthew Gwilliam, Hao Chen, Danhui Guan, Kun Xu, Weilin Huang, Zuxuan Wu, Haoqi Fan, Yu-Gang Jiang, Zhenheng Yang

发表机构 * Shanghai Key Lab of Intelligent Information Processing, Fudan University(复旦大学上海智能信息处理重点实验室) School of Computer Science, Fudan University(复旦大学计算机科学技术学院) Shanghai Collaborative Innovation Center of Intelligent Visual Computing(上海智能视觉计算协同创新中心) Youtu Lab, Tencent(腾讯优图实验室) Meta AI Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出ARM模型,通过离散语义视觉分词器将图像映射为紧凑token序列,结合自回归建模和强化学习,统一实现图像理解、生成和编辑,并提升任务性能与跨任务协同。

Comments technical report

详情
AI中文摘要

本文介绍了ARM,一种基于离散表示的自回归模型,它在下一个词预测框架内统一了图像理解、生成和编辑。ARM建立在三个努力之上:首先,我们训练了一个离散语义视觉分词器,将图像映射为紧凑的token序列。我们的分词器通过多个目标进行监督,这些目标共同促进语义可辨别性、语言对齐和忠实重建,从而在共享潜在空间中支持多样化的任务。在此基础上,我们在大规模文本和图像token序列上训练了一个7B自回归模型,无缝地发展了视觉-语言感知和生成能力。最后,为了进一步改善文本到图像生成和指令引导编辑的偏好对齐行为,ARM应用强化学习(RL)来优化任务级目标,如视觉质量、指令遵循和编辑一致性。令人惊讶的是,结果表明RL不仅显著提高了目标任务的性能(例如,将WISE总体从0.50提升到0.56,GEdit-Bench-EN G_O从5.75提升到6.68),而且还诱导了文本到图像生成和编辑之间的跨任务协同。总的来说,这些发现凸显了自回归建模,当与强大的表示和偏好优化相结合时,作为多模态智能的可扩展基础。代码:此https URL。

英文摘要

This paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on three efforts: first, we train a discrete semantic visual tokenizer that maps images into compact token sequences. Our tokenizer is supervised with multiple objectives that jointly promote semantic discriminability, language alignment and faithful reconstruction, thereby supporting diverse tasks in a shared latent space. With this, we train a 7B autoregressive model over large-scale text and image token sequences, seamlessly developing vision-language perception and generation capabilities. Finally, to further improve preference-aligned behavior for text-to-image generation and instruction-guided editing, ARM applies reinforcement learning (RL) to optimize task-level objectives such as visual quality, instruction adherence, and edit consistency. Surprisingly, the results show that RL not only substantially improves performance on the target tasks (e.g., raising WISE overall from 0.50 to 0.56, GEdit-Bench-EN G_O from 5.75 to 6.68), but also induces cross-task synergy between text-to-image generation and editing. Collectively, these findings highlight autoregressive modeling, when paired with strong representations and preference optimization, as a scalable foundation for multimodal intelligence. Code: https://github.com/wdrink/ARM.

2606.11187 2026-06-10 cs.CV 新提交

Next Forcing: Causal World Modeling with Multi-Chunk Prediction

Next Forcing: 基于多块预测的因果世界建模

Gangwei Xu, Qihang Zhang, Jiaming Zhou, Xing Zhu, Yujun Shen, Xin Yang, Yinghao Xu

发表机构 * Robbyant HUST(华中科技大学) HKUST(香港科技大学) HKUST (GZ)(香港科技大学(广州))

AI总结 提出Next Forcing框架,通过多块预测训练目标加速视频生成模型收敛、提升精度并实现推理加速,在多个基准上取得最优结果。

Comments Project page: https://gangweix.github.io/next-forcing/

详情
AI中文摘要

自回归视频生成已成为世界动作模型(WAMs)的强大范式。然而,现有方法存在训练收敛慢和收敛精度有限的问题,尤其是在高帧率下,因为训练监督仅限于当前块,缺乏关于未来动态的明确信号;此外,由于迭代视频去噪,推理速度也较慢。在本文中,我们提出Next Forcing,一种用于因果世界建模的多块预测(MCP)框架,可实现更快的训练、更高的精度和加速推理。受大语言模型中多token预测的启发,Next Forcing引入了MCP训练目标,通过轻量级辅助MCP模块增强主模型,以同时去噪多个未来时间范围(next$^1$、next$^2$、next$^3$块)的视频块。这些MCP模块在预测深度上形成因果链,其中从主模型多个层融合的中间特征被用于预测未来动态,使得近期预测能够为远期预测提供信息,并向主模型提供密集的多尺度时间监督。在训练中,MCP模块显著加速收敛并提高收敛精度,尤其是在高帧率下:在50 fps下,Next Forcing在5k训练步数上比LingBot-VA相对提升93.1%,收敛速度提升2.3倍,并在RoboTwin基准(Clean/Random上94.1/93.5%)上建立了新的最先进结果。在推理时,MCP模块可以保留以并行预测当前块和下一个视频块,实现2倍推理加速。Next Forcing还在PhyWorld(评估视频生成中物理规律遵循的基准)上展示了显著改进,并在通用视频预训练上实现了超过50%的FVD降低。

英文摘要

Autoregressive video generation has emerged as a powerful paradigm for World Action Models (WAMs). However, existing approaches suffer from slow training convergence and limited converged accuracy, particularly at high frame rates, as the training supervision is confined to the current chunk without explicit signals about future dynamics; they also suffer from slow inference due to iterative video denoising. In this paper, we present Next Forcing, a multi-chunk prediction (MCP) framework for causal world modeling that enables faster training, higher accuracy, and accelerated inference. Inspired by multi-token prediction in large language models, Next Forcing introduces an MCP training objective that augments the main model with lightweight auxiliary MCP modules to simultaneously denoise video chunks at multiple future temporal horizons (next$^1$, next$^2$, next$^3$ chunks). These MCP modules form a causal chain across prediction depths, where intermediate features fused from multiple layers of the main model are leveraged to predict future dynamics, allowing near-future predictions to inform farther-future ones and providing dense multi-scale temporal supervision back to the main model. During training, the MCP modules significantly accelerate convergence and improve converged accuracy, especially at high frame rates: at 50 fps, Next Forcing achieves a 93.1% relative improvement over LingBot-VA at 5k training steps and 2.3x faster convergence, and establishes new state-of-the-art results on the RoboTwin benchmark (94.1/93.5% on Clean/Random). At inference, the MCP modules can be retained to predict the next video chunk in parallel with the current one, achieving 2x inference acceleration. Next Forcing also demonstrates significant improvements on PhyWorld, a benchmark evaluating adherence to physical laws in video generation, and over 50% FVD reduction on general video pretraining.

2606.11186 2026-06-10 cs.CV 新提交

AnyMod-LLVE: Low-Light Video Enhancement with Modality-Agnostic Inference

AnyMod-LLVE: 模态无关推理的低光照视频增强

Hangfeng Liang, Yutao Hu, Yanhan Hu, Xiaohan Wu, Wenqi Shao, Ying Fu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出AMNet统一多模态框架,通过空间-频谱双门控转换器学习辅助模态与RGB输入的对应关系,支持推理时任意模态组合,解决低光照视频增强中辅助模态缺失问题。

Comments Accepted at ICML 2026; Project page and code: https://lhfgghc.github.io/LLVE-AMNet

详情
AI中文摘要

低光照视频增强(LLVE)由于低照度条件下严重的信息退化仍然是一项具有挑战性的任务。最近的多模态方法通过引入辅助模态(如事件流和红外图像)显著提升了增强性能。然而,这些方法通常假设推理时这些模态可用,这在现实场景中往往不可行。为了解决这个问题,在本工作中,我们提出了AMNet,一个统一的LLVE多模态框架,以支持灵活的模态无关推理,其中辅助模态可能不可用。为了解决模态缺失问题,我们引入了一个空间-频谱双门控转换器,学习辅助模态与RGB输入之间的对应关系,生成隐式辅助表示以支持鲁棒增强。此外,为了充分促进跨模态对应学习,我们基于仅RGB数据集和合成辅助模态进行了大规模多模态预训练。大量实验表明,AMNet能够处理任意推理时的模态组合,并在模态缺失条件下展现出优越的LLVE性能。代码和模型可在项目页面上获取。

英文摘要

Low-light video enhancement (LLVE) remains a challenging task due to severe information degradation under low-illumination conditions. Recent multimodal approaches have significantly improved enhancement performance by incorporating auxiliary modalities, such as event streams and infrared images. However, these methods typically assume the availability of these modalities at inference, which is often not feasible in real-world scenarios. To solve this problem, in this work, we propose AMNet, a unified multimodal framework for LLVE, to support flexible modality-agnostic inference, where auxiliary modalities may be unavailable. To address the issue of modality absence, we introduce a Spatial-Spectral Dual-Gated Translator that learns the correspondence between auxiliary modalities and RGB inputs, producing implicit auxiliary representations to support the robust enhancement. Additionally, to fully facilitate the learning of cross-modal correspondence, we conduct large-scale multimodal pretraining based on the RGB-only dataset with synthetic auxiliary modalities. Extensive experiments demonstrate that AMNet could handle arbitrary inference-time modality combinations and exhibits superior performance for LLVE under modality absence conditions. Code and models are available on the project page.