arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2060
2603.28378 2026-06-17 cs.SD cs.AI 版本更新

Membership Inference Attacks against Large Audio Language Models

针对大型音频语言的成员推断攻击

Jia-Kai Dong, Yu-Xiang Lin, Hung-Yi Lee

AI总结 首次系统评估大型音频语言模型的成员推断攻击,提出盲基线协议控制分布偏移,发现跨模态记忆仅源于说话人声纹与文本绑定。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

我们首次对大型音频语言模型(LALMs)进行了系统的成员推断攻击(MIA)评估。利用基于文本、频谱和韵律特征的多模态盲基线,我们证明即使没有模型推理,常见音频数据集也表现出近乎完美的训练/测试可分离性(AUC ~ 1.0),因此MIA可能主要检测分布偏移。因此,我们引入了一个盲基线协议来控制这一混杂因素。在该协议下,我们发现分布匹配的数据集能够实现可靠的MIA评估,而不会产生分布偏移伪影。我们基准测试了多种MIA方法,并在这些数据集上进行了模态解缠实验。结果表明,LALM的记忆是跨模态的,仅源于将说话人的声纹与其文本绑定。这些发现为审计LALMs建立了超越虚假相关性的原则性标准。我们的代码库可在该网址获取。

英文摘要

We present the first systematic Membership Inference Attack (MIA) evaluation of LALMs. Using Multi-modal Blind Baselines based on textual, spectral and prosodic features, we demonstrate that common audio datasets exhibit near-perfect train/test separability (AUC ~ 1.0) even without model inference, thus MIA may primarily detect distribution shift. We therefore introduce a blind-baseline protocol to control for this confound. Under this protocol, we identify that the distribution-matched datasets enable reliable MIA evaluation without distribution-shift artifacts. We benchmark multiple MIA methods and conduct modality disentanglement experiments on these datasets. The results reveal that LALM memorization is cross-modal, arising only from binding a speaker's vocal identity with its text. These findings establish a principled standard for auditing LALMs beyond spurious correlations. Our codebase is available at https://github.com/snooow1029/ALM_MIA.

2603.26592 2026-06-17 cs.LG cs.AI cs.HC 版本更新

Evaluating Interactive 2D Visualization as a Sample Selection Strategy for Biomedical Time-Series Data Annotation

评估交互式二维可视化作为生物医学时间序列数据标注的样本选择策略

Einari Vaaras, Manu Airaksinen, Okko Räsänen

AI总结 针对生物医学时间序列标注困难,比较随机采样、最远优先遍历和基于交互式2D可视化(2DV)的三种样本选择方法,在婴儿运动评估和语音情感识别任务中,2DV在聚合标签时表现最佳,但个体标注者间标签分布差异大,随机采样最安全。

Comments Accepted for publication in Computers in Biology and Medicine (Elsevier)

详情
AI中文摘要

生物医学领域中可靠的机器学习模型依赖于准确的标签,然而标注生物医学时间序列数据仍然具有挑战性。算法样本选择可能支持标注,但涉及真实人类标注者的研究证据很少。因此,我们比较了三种用于标注的样本选择方法:随机采样(RND)、最远优先遍历(FAFT)和一种基于图形用户界面的方法,该方法能够探索高维数据的互补二维可视化(2DV)。我们在婴儿运动评估(IMA)和语音情感识别(SER)的四个分类任务中评估了这些方法。十二名标注者,分为专家和非专家,在有限的标注预算下进行数据标注,并进行了标注后实验以评估采样方法。在所有分类任务中,当聚合标注者的标签时,2DV表现最佳。在IMA中,2DV最有效地捕获了稀有类别,但也表现出由于有限的标注预算导致的标注者间标签分布变异性增大,当模型在个体标注者的标签上训练时,分类性能下降;在这些情况下,FAFT表现出色。对于SER,2DV在专家标注者中优于其他方法,并在个体标注者设置中与非专家标注者的性能相当。失败风险分析显示,当标注者数量或标注者专业知识不确定时,RND是最安全的选择,而2DV由于标签分布变异性更大而具有最高风险。此外,实验后访谈表明,2DV使标注任务更有趣和愉快。总体而言,基于2DV的采样对于生物医学时间序列数据标注似乎很有前景,特别是在标注预算不是非常紧张的情况下。

英文摘要

Reliable machine-learning models in biomedical settings depend on accurate labels, yet annotating biomedical time-series data remains challenging. Algorithmic sample selection may support annotation, but evidence from studies involving real human annotators is scarce. Consequently, we compare three sample selection methods for annotation: random sampling (RND), farthest-first traversal (FAFT), and a graphical user interface-based method enabling exploration of complementary 2D visualizations (2DVs) of high-dimensional data. We evaluated the methods across four classification tasks in infant motility assessment (IMA) and speech emotion recognition (SER). Twelve annotators, categorized as experts or non-experts, performed data annotation under a limited annotation budget, and post-annotation experiments were conducted to evaluate the sampling methods. Across all classification tasks, 2DV performed best when aggregating labels across annotators. In IMA, 2DV most effectively captured rare classes, but also exhibited greater annotator-to-annotator label distribution variability resulting from the limited annotation budget, decreasing classification performance when models were trained on individual annotators' labels; in these cases, FAFT excelled. For SER, 2DV outperformed the other methods among expert annotators and matched their performance for non-experts in the individual-annotator setting. A failure risk analysis revealed that RND was the safest choice when annotator count or annotator expertise was uncertain, whereas 2DV had the highest risk due to its greater label distribution variability. Furthermore, post-experiment interviews indicated that 2DV made the annotation task more interesting and enjoyable. Overall, 2DV-based sampling appears promising for biomedical time-series data annotation, particularly when the annotation budget is not highly constrained.

2603.26292 2026-06-17 cs.CL cs.AI 版本更新

findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding

findsylls: 一种语言无关的音节级语音分词与嵌入工具包

Héctor Javier Vázquez Martínez

AI总结 提出语言无关的模块化工具包findsylls,统一经典音节检测器和端到端音节切分器,支持音节分割、嵌入提取和多粒度评估,在英语、西班牙语及低资源语言Kono上验证了跨语言可重复实验能力。

Comments 4 pages + 2 for references, disclosures & acknowledgements; to appear in Interspeech 2026; DOI to cite findsylls library: https://doi.org/10.5281/zenodo.20707804

详情
AI中文摘要

音节级单元为口语语言建模和无监督词汇发现提供了紧凑且具有语言意义的表示,但关于音节化的研究仍然分散在不同的实现、数据集和评估协议中。我们介绍了findsylls,一个模块化的、语言无关的工具包,它将经典的音节检测器和端到端音节切分器统一在一个通用接口下,用于音节分割、嵌入提取和多粒度评估。该工具包实现并标准化了广泛使用的方法(例如,Sylber、VG-HuBERT),并允许重新组合其组件,从而实现对表示、算法和令牌率的受控比较。我们在英语和西班牙语语料库以及来自Kono(一种未被充分记录的中部曼德语)的新手工标注数据上演示了findsylls,展示了单一框架如何支持在资源丰富和资源不足的环境中均可重复的音节级实验。

英文摘要

Syllable-level units offer compact and linguistically meaningful representations for spoken language modeling and unsupervised word discovery, but research on syllabification remains fragmented across disparate implementations, datasets, and evaluation protocols. We introduce findsylls, a modular, language-agnostic toolkit that unifies classical syllable detectors and end-to-end syllabifiers under a common interface for syllable segmentation, embedding extraction, and multi-granular evaluation. The toolkit implements and standardizes widely used methods (e.g., Sylber, VG-HuBERT) and allows their components to be recombined, enabling controlled comparisons of representations, algorithms, and token rates. We demonstrate findsylls on English and Spanish corpora and on new hand-annotated data from Kono, an underdocumented Central Mande language, illustrating how a single framework can support reproducible syllable-level experiments across both high-resource and under-resourced settings.

2603.22372 2026-06-17 cs.LG cs.AI 版本更新

Rethinking Multimodal Fusion for Time Series: Text Modalities Need Constrained Fusion

重新思考时间序列的多模态融合:文本模态需要受约束的融合

Seunghan Lee, Jun Seo, Jaehoon Lee, Sungdong Yoo, Minjae Kim, Tae Yoon Lim, Dongwan Kang, Hwanil Choi, SoonYoung Lee, Wonbin Ahn

AI总结 针对多模态时间序列预测中朴素融合方法效果不佳的问题,提出受约束融合方法及受控融合适配器(CFA),通过低秩适配器过滤无关文本信息,在多种数据集和模型上验证了有效性。

Comments KDD Workshop on Mining and Learning from Time Series 2026

详情
AI中文摘要

多模态学习的最新进展推动了将文本或视觉等辅助模态集成到时间序列(TS)预测中。然而,现有方法大多增益有限,通常仅在特定数据集上提升性能,或依赖限制泛化能力的架构特定设计。在本文中,我们表明采用朴素融合策略(例如简单加法或拼接)的多模态模型通常表现不如单模态TS模型,我们将其归因于辅助模态的未受控集成可能引入无关信息。受此观察启发,我们探索了各种旨在控制这种集成的受约束融合方法,并发现它们始终优于朴素融合方法。此外,我们提出了受控融合适配器(CFA),一种简单的即插即用方法,无需修改TS主干即可实现受控的跨模态交互,仅集成与TS动态对齐的相关文本信息。CFA采用低秩适配器在将文本信息融合到时间表示之前过滤无关文本信息。我们在各种数据集和TS/文本模型上进行了超过20K次实验,证明了受约束融合方法的有效性。代码见:this https URL。

英文摘要

Recent advances in multimodal learning have motivated the integration of auxiliary modalities such as text or vision into time series (TS) forecasting. However, most existing methods provide limited gains, often improving performance only in specific datasets or relying on architecture-specific designs that limit generalization. In this paper, we show that multimodal models with naive fusion strategies (e.g., simple addition or concatenation) often underperform unimodal TS models, which we attribute to the uncontrolled integration of auxiliary modalities which may introduce irrelevant information. Motivated by this observation, we explore various constrained fusion methods designed to control such integration and find that they consistently outperform naive fusion methods. Furthermore, we propose Controlled Fusion Adapter (CFA), a simple plug-in method that enables controlled cross-modal interactions without modifying the TS backbone, integrating only relevant textual information aligned with TS dynamics. CFA employs low rank adapters to filter irrelevant textual information before fusing it into temporal representations. We conduct over 20K experiments across various datasets and TS/text models, demonstrating the effectiveness of the constrained fusion methods. Code is available at: https://github.com/seunghan96/cfa.

2603.22281 2026-06-17 cs.CV cs.AI cs.CL cs.LG cs.RO 版本更新

ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

ThinkJEPA:赋予潜在世界模型大型视觉-语言推理能力

Haichao Zhang, Yijiang Li, Shwai He, Tushar Nagarajan, Mingfei Chen, Jianglin Lu, Ang Li, Yun Fu

AI总结 提出ThinkJEPA框架,结合密集JEPA分支与稀疏VLM思考者分支,通过分层金字塔表示提取模块,实现细粒度运动建模与长程语义引导,在手部操作轨迹预测任务上超越基线。

Comments 10 pages, 5 figures

详情
AI中文摘要

潜在世界模型(如V-JEPA2)的最新进展展示了从视频观测预测未来世界状态的能力。然而,短观测窗口的密集预测限制了时间上下文,可能导致预测偏向局部低层次外推,难以捕捉长程语义并降低下游效用。相比之下,视觉-语言模型(VLM)通过对均匀采样帧进行推理,提供强大的语义基础和通用知识,但由于计算驱动的稀疏采样、语言输出瓶颈(将细粒度交互状态压缩为文本导向表示)以及适应小规模动作条件数据集时的数据分布不匹配,它们不适合作为独立的密集预测器。我们提出了一种VLM引导的JEPA风格潜在世界建模框架,通过双时间路径结合密集帧动态建模与长程语义指导:一个密集JEPA分支用于细粒度运动和交互线索,以及一个均匀采样的VLM“思考者”分支,具有更大的时间步长以提供知识丰富的指导。为了有效传递VLM的渐进推理信号,我们引入了一个分层金字塔表示提取模块,将多层VLM表示聚合成与潜在预测兼容的指导特征。在手部操作轨迹预测实验上,我们的方法优于强VLM-only基线和JEPA预测器基线,并展现出更鲁棒的长程展开行为。

英文摘要

Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.

2603.20775 2026-06-17 cs.LG 版本更新

Evaluating Uplift Modeling under Structural Biases: Insights into Metric Stability and Model Robustness

评估结构偏差下的提升建模:对指标稳定性和模型鲁棒性的洞察

Yuxuan Yang, Dugang Liu, Yiyan Huang

AI总结 针对现实营销数据中的多种偏差,设计半合成基准框架,发现TARNet具有鲁棒性,且与ATE对齐的指标更稳定。

Comments Accepted by KDD 26

详情
AI中文摘要

在个性化营销中,提升模型通过反事实分析模拟客户在不同干预下的行为变化,来估计干预的增量效果。然而,现实营销数据常存在多种偏差,如选择偏差、溢出效应、测量误差和未观测混杂。这些偏差会同时影响提升估计的准确性和评估指标的有效性。尽管偏差感知评估很重要,但缺乏系统研究来评估不同模型和指标在偏差条件下的表现。为填补这一空白,我们设计了一个系统基准框架。与标准预测任务不同,现实提升数据集天然缺乏反事实真值。这一限制使得评估指标的直接验证不可行,并阻碍了偏差的精确量化。因此,半合成方法成为系统基准的关键推动力。该方法通过保留现实特征依赖关系,同时提供隔离结构偏差所需的真值,有效弥合了差距。我们的研究发现:(i) 提升定位和预测可能表现为不同目标,擅长一个并不保证另一个有效;(ii) 尽管许多模型在多种偏差下表现不一致,但TARNet表现出显著的鲁棒性,为后续模型设计提供了见解;(iii) 评估指标的稳定性与其与ATE的数学对齐程度相关,表明在结构数据不完美下,近似ATE的指标能产生更一致的模型排名。这些发现表明,在现实数据不完美下需要更鲁棒的提升模型和评估指标。

英文摘要

In personalized marketing, uplift models estimate the incremental effect of an intervention by modeling how customer behavior would change under alternative treatments using counterfactual analysis. However, real-world marketing data often exhibit various biases, such as selection bias, spillover effects, measurement error, and unobserved confounding. These biases can adversely affect both the accuracy of uplift estimation and the validity of evaluation metrics. Despite the importance of bias-aware assessment, there remains a lack of systematic studies evaluating how different models and metrics perform under such biased conditions. To bridge this gap, we design a systematic benchmarking framework. Unlike standard predictive tasks, real-world uplift datasets inherently lack counterfactual ground truth. This limitation renders the direct validation of evaluation metrics infeasible and prevents the precise quantification of biases. Therefore, a semi-synthetic approach serves as a critical enabler for systematic benchmarking. This approach effectively bridges the gap by retaining real-world feature dependencies while providing the ground truth needed to isolate structural biases. Our investigations reveal that (i) uplift targeting and prediction can manifest as distinct objectives, where proficiency in one does not ensure efficacy in the other; (ii) while many models exhibit inconsistent performance under diverse biases, TARNet shows notable robustness, providing insights for subsequent model design; (iii) the stability of evaluation metrics is linked to their mathematical alignment with the ATE, suggesting that ATE-approximating metrics yield more consistent model rankings under structural data imperfections. These findings suggest the need for more robust uplift models and evaluation metrics under real-world data imperfections.

2602.14771 2026-06-17 cs.CV cs.AI cs.LG cs.MM cs.NE 版本更新

GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture

GOT-JEPA:基于联合嵌入预测架构的通用目标跟踪与模型自适应及遮挡处理

Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin

AI总结 提出GOT-JEPA框架,通过预测跟踪模型而非图像特征来提升泛化能力,并设计OccuSolver增强遮挡感知,在七个基准上验证了有效性。

Comments Accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT). This research focuses on learning model adaptation for adverse and dynamic environments, as well as fine-grained occlusion perception for tracking

详情
Journal ref
IEEE Transactions on Circuits and Systems for Video Technology 2026
AI中文摘要

人类视觉系统通过整合当前观测与先前观测信息、适应目标和场景变化、以及精细推理遮挡来跟踪物体。相比之下,最近的通用目标跟踪器通常针对训练目标进行优化,这限制了在未见场景中的鲁棒性和泛化能力,并且它们的遮挡推理仍然粗糙,缺乏对遮挡模式的详细建模。为了解决这些在泛化和遮挡感知方面的局限性,我们提出了GOT-JEPA,一个模型预测预训练框架,将JEPA从预测图像特征扩展到预测跟踪模型。给定相同的历史信息,教师预测器从干净的当前帧生成伪跟踪模型,学生预测器学习从当前帧的损坏版本预测相同的伪跟踪模型。这种设计提供了稳定的伪监督,并明确训练预测器在遮挡、干扰和其他不利观测下产生可靠的跟踪模型,从而提高了对动态环境的泛化能力。基于GOT-JEPA,我们进一步提出了OccuSolver来增强目标跟踪的遮挡感知。OccuSolver调整了一个以点为中心的点跟踪器,用于目标感知的可见性估计和详细的遮挡模式捕获。在跟踪器迭代生成的目标先验条件下,OccuSolver逐步细化可见性状态,增强遮挡处理,并产生更高质量的参考标签,逐步改进后续模型预测。在七个基准上的广泛评估表明,我们的方法有效增强了跟踪器的泛化能力和鲁棒性。

英文摘要

The human visual system tracks objects by integrating current observations with previously observed information, adapting to target and scene changes, and reasoning about occlusion at fine granularity. In contrast, recent generic object trackers are often optimized for training targets, which limits robustness and generalization in unseen scenarios, and their occlusion reasoning remains coarse, lacking detailed modeling of occlusion patterns. To address these limitations in generalization and occlusion perception, we propose GOT-JEPA, a model-predictive pretraining framework that extends JEPA from predicting image features to predicting tracking models. Given identical historical information, a teacher predictor generates pseudo-tracking models from a clean current frame, and a student predictor learns to predict the same pseudo-tracking models from a corrupted version of the current frame. This design provides stable pseudo supervision and explicitly trains the predictor to produce reliable tracking models under occlusions, distractors, and other adverse observations, improving generalization to dynamic environments. Building on GOT-JEPA, we further propose OccuSolver to enhance occlusion perception for object tracking. OccuSolver adapts a point-centric point tracker for object-aware visibility estimation and detailed occlusion-pattern capture. Conditioned on object priors iteratively generated by the tracker, OccuSolver incrementally refines visibility states, strengthens occlusion handling, and produces higher-quality reference labels that progressively improve subsequent model predictions. Extensive evaluations on seven benchmarks show that our method effectively enhances tracker generalization and robustness.

2603.17356 2026-06-17 cs.CL 版本更新

PACE-RAG: Patient-Aware Contextual and Evidence-Constrained RAG for Clinical Drug Recommendation

PACE-RAG:面向临床药物推荐的患者感知上下文与证据约束RAG

Chaeyoung Huh, Hyunmin Hwang, Jung Hwan Shin, Sungyang Jo, Jinse Park, Jong Chul Ye

AI总结 提出PACE-RAG框架,通过提取患者特定临床特征、检索相关病例并结合当前症状与用药史,实现个性化药物推荐,在帕金森病和MIMIC-IV数据集上取得最优性能。

Comments 32 pages, 18 figures

详情
AI中文摘要

药物推荐需要深入理解个体患者背景,尤其是帕金森病等复杂疾病。尽管大语言模型拥有广泛的医学知识,但无法捕捉实际处方模式的细微差别。现有的RAG方法也难以应对这些复杂性,因为基于指南的检索仍然过于通用,而相似患者检索往往复制多数模式,未考虑个体患者的独特临床细微差别。为弥合这一差距,我们提出PACE-RAG(患者感知上下文与证据约束RAG)。PACE-RAG并非直接从检索到的患者中复制常用药物,而是首先提取患者特定临床特征,围绕这些特征检索病例,然后利用患者当前症状、活跃用药史和焦点特异性处方倾向来优化最终处方。通过分析针对特定临床特征的治疗模式,PACE-RAG生成患者特定的药物推荐以及可解释的临床总结。在帕金森病队列和MIMIC-IV基准上使用Llama-3.1-8B和Qwen3-8B进行评估,PACE-RAG实现了最先进的性能,F1分数分别达到80.84%和47.22%。这些结果表明PACE-RAG是一个稳健且临床基础扎实的个性化决策支持框架。我们的代码可在以下网址获取:this https URL。

英文摘要

Drug recommendation requires a deep understanding of individual patient context, especially for complex conditions like Parkinson's disease. While LLMs possess broad medical knowledge, they fail to capture the subtle nuances of actual prescribing patterns. Existing RAG methods also struggle with these complexities because guideline-based retrieval remains too generic and similar-patient retrieval often replicates majority patterns without accounting for the unique clinical nuances of individual patients. To bridge this gap, we propose PACE-RAG (Patient-Aware Contextual and Evidence-Constrained RAG). Rather than directly copying frequent medications from retrieved patients, PACE-RAG personalizes recommendations by first extracting patient-specific clinical features, retrieving cases around these features, and then refining the final prescription using the patient's current symptoms, active medication history, and focus-specific prescribing tendencies. By analyzing treatment patterns tailored to specific clinical features, PACE-RAG generates patient-specific medication recommendations along with an explainable clinical summary. Evaluated on a Parkinson's cohort and the MIMIC-IV benchmark using Llama-3.1-8B and Qwen3-8B, PACE-RAG achieved state-of-the-art performance, reaching F1 scores of 80.84% and 47.22%, respectively. These results suggest that PACE-RAG is a robust and clinically grounded framework for personalized decision support. Our code is available at: https://github.com/ChaeYoungHuh/PACE-RAG.

2507.20708 2026-06-17 cs.LG math.OC stat.AP 版本更新

Exposing the Illusion of Fairness: Auditing Vulnerabilities to Distributional Manipulation Attacks

揭露公平的幻象:审计对分布操纵攻击的脆弱性

Valentin Lafargue, Adriana Laurindo Monteiro, Emmanuelle Claeys, Laurent Risser, Jean-Michel Loubes

AI总结 研究恶意被审计方如何通过分布操纵制造公平假象,提出基于熵和最优传输的操纵策略,并评估统计检验的检测能力,为监管验证提供指导。

详情
Journal ref
European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, Applied Data Science Track, 2026
AI中文摘要

人工智能系统在高风险领域(包括欧盟AI法案(Regulation (EU) 2024/1689)归类为高风险的领域)的快速部署,加剧了对可靠合规审计的需求。对于二分类器,监管风险评估通常依赖于全局公平性指标,如差异影响比,该指标广泛用于评估潜在歧视。在典型的审计设置中,被审计方将其数据集的一个子集提供给审计方,而监管机构可能验证该子集是否代表完整的底层分布。在这项工作中,我们研究了恶意被审计方在多大程度上可以从一个不合规的原始分布中构建一个符合公平性且看似具有代表性的样本,从而制造公平的幻象。我们将该问题形式化为一个受约束的分布投影任务,并引入基于熵和最优传输投影的数学基础操纵策略。这些构造刻画了满足公平约束所需的最小分布偏移。为了对抗此类攻击,我们通过基于分布距离的统计检验形式化代表性,并系统评估其检测操纵样本的能力。我们的分析强调了公平性操纵在统计上未被检测到的条件,并为加强监管验证提供了实用指南。我们通过在用于偏差检测的标准表格数据集上进行实验来验证我们的理论发现。代码公开于 https://this URL。

英文摘要

The rapid deployment of AI systems in high-stakes domains, including those classified as high-risk under the The EU AI Act (Regulation (EU) 2024/1689), has intensified the need for reliable compliance auditing. For binary classifiers, regulatory risk assessment often relies on global fairness metrics such as the Disparate Impact ratio, widely used to evaluate potential discrimination. In typical auditing settings, the auditee provides a subset of its dataset to an auditor, while a supervisory authority may verify whether this subset is representative of the full underlying distribution. In this work, we investigate to what extent a malicious auditee can construct a fairness-compliant yet representative-looking sample from a non-compliant original distribution, thereby creating an illusion of fairness. We formalize this problem as a constrained distributional projection task and introduce mathematically grounded manipulation strategies based on entropic and optimal transport projections. These constructions characterize the minimal distributional shift required to satisfy fairness constraints. To counter such attacks, we formalize representativeness through distributional distance based statistical tests and systematically evaluate their ability to detect manipulated samples. Our analysis highlights the conditions under which fairness manipulation can remain statistically undetected and provides practical guidelines for strengthening supervisory verification. We validate our theoretical findings through experiments on standard tabular datasets for bias detection. Code is publicly available at https://github.com/ValentinLafargue/Inspection.

2603.08001 2026-06-17 cs.LG stat.ML 版本更新

Amortizing Maximum Inner Product Search with Learned Support Functions

通过学习支持函数摊销最大内积搜索

Theo X. Olausson, João Monteiro, Michal Klein, Marco Cuturi

AI总结 提出基于回归的摊销MIPS方法,通过训练神经网络直接预测最优键,利用支持函数的凸性加速搜索,在BEIR基准上显著提升IVF匹配率。

详情
AI中文摘要

最大内积搜索(MIPS)是机器学习中的关键子程序,需要从数据库(键)中识别出与给定查询最匹配的向量。我们提出摊销MIPS:一种基于回归的方法,训练神经网络直接预测MIPS解,从而摊销在固定键数据库上从已知分布中重复求解查询的MIPS成本。我们的关键洞察是,MIPS值函数是键集合的\emph{支持}函数,这是一个经过充分研究的凸函数,其梯度给出最优键。这激发了两种互补的摊销模型:SupportNet,一个输入凸神经网络,用于回归支持函数;以及KeyNet,一个向量值网络,直接回归最优键。SupportNet可以作为聚类路由器,将查询引导到相关的数据库分区,而KeyNet可以作为原始查询的直接替代品,直接输入到现成的索引流水线中。我们在BEIR基准上的实验表明,对于文档嵌入,当考虑计算工作量(无论是FLOPs、探测次数还是挂钟时间)时,学习的SupportNet和KeyNet显著提高了IVF匹配率。我们的代码可在以下网址获取:this https URL。

英文摘要

Maximum inner product search (MIPS) is a crucial subroutine in machine learning, requiring the identification of a vector taken within a database (the keys) that best aligns with a given query. We propose amortized MIPS: a regression-based approach that trains neural networks to directly predict MIPS solutions, amortizing the cost of repeatedly solving MIPS for queries drawn from a known distribution over a fixed key database. Our key insight is that the MIPS value function is the \emph{support} function of the set of keys, a well-studied convex function whose gradient yields the optimal key. This motivates two complementary amortized models: SupportNet, an input-convex neural network trained to regress the support function, and KeyNet, a vector-valued network that directly regresses the optimal key. SupportNet can serve as a cluster router, steering queries toward relevant database partitions, while KeyNet can be used as a drop-in replacement for the original query, fed directly to off-the-shelf indexing pipelines. Our experiments on the BEIR benchmark show that, for document embeddings, learned \SupportNet{}s and \KeyNet{}s significantly improve IVF match rates when accounting for compute effort, whether measured in FLOPs, number of probes, or wall-clock time. Our code is available at: https://github.com/apple/ml-amips.

2510.19255 2026-06-17 cs.CV 版本更新

Advances in 4D Representation: Geometry, Motion, and Interaction

4D表示进展:几何、运动与交互

Mingrui Zhao, Sauradip Nag, Kai Wang, Aditya Vora, Guangda Ji, Peter Chun, Ali Mahdavi-Amiri, Hao Zhang

AI总结 本文综述了4D生成与重建领域,从几何、运动和交互三个核心支柱出发,分析不同4D表示方法的特性、挑战及适用场景,并探讨了大语言模型和视频基础模型在其中的作用。

Comments CGF'26,21 pages. Project Page: https://mingrui-zhao.github.io/4DRep-GMI/

详情
AI中文摘要

我们呈现了一篇关于4D生成与重建的综述,这是一个快速发展的计算机图形学子领域,其进展得益于神经场、几何与运动深度学习以及3D生成式人工智能(GenAI)的最新突破。尽管我们的综述并非首篇,但我们从独特且鲜明的4D表示视角构建领域覆盖,以建模随时间演变的3D几何,同时展现运动和交互。具体而言,我们并未穷举众多工作,而是采取更具选择性的方法,聚焦代表性工作,以突出每种表示在不同计算、应用和数据场景下的理想特性及随之而来的挑战。我们旨在向读者传达的主要信息是:如何为其任务选择并定制合适的4D表示。在组织上,我们基于三个关键支柱:几何、运动与交互,对4D表示进行划分。我们的讨论不仅涵盖当今最流行的表示,如神经辐射场(NeRFs)和3D高斯泼溅(3DGS),还关注在4D背景下相对未被充分探索的表示,如结构化模型和长程运动。在整个综述中,我们将重新审视大语言模型(LLMs)和视频基础模型(VFMs)在各种4D应用中的作用,同时引导讨论指向它们当前的局限性以及如何解决。我们还专门介绍了目前可用的4D数据集以及推动该子领域前进所缺乏的数据。项目页面:this https URL

英文摘要

We present a survey on 4D generation and reconstruction, a fast-evolving subfield of computer graphics whose developments have been propelled by recent advances in neural fields, geometric and motion deep learning, as well as 3D generative artificial intelligence (GenAI). While our survey is not the first of its kind, we build our coverage of the domain from a unique and distinctive perspective of 4D representations, to model 3D geometry evolving over time while exhibiting motion and interaction. Specifically, instead of offering an exhaustive enumeration of many works, we take a more selective approach by focusing on representative works to highlight both the desirable properties and ensuing challenges of each representation under different computation, application, and data scenarios. The main take-away message we aim to convey to the readers is on how to select and then customize the appropriate 4D representations for their tasks. Organizationally, we separate the 4D representations based on three key pillars: geometry, motion, and interaction. Our discourse will not only encompass the most popular representations of today, such as neural radiance fields (NeRFs) and 3D Gaussian Splatting (3DGS), but also bring attention to relatively under-explored representations in the 4D context, such as structured models and long-range motions. Throughout our survey, we will reprise the role of large language models (LLMs) and video foundational models (VFMs) in a variety of 4D applications, while steering our discussion towards their current limitations and how they can be addressed. We also provide a dedicated coverage on what 4D datasets are currently available, as well as what is lacking, in driving the subfield forward. Project page:https://mingrui-zhao.github.io/4DRep-GMI/

2509.15626 2026-06-17 cs.SD eess.AS 版本更新

LibriTTS-VI: A Public Corpus and Novel Methods for Efficient Voice Impression Control

LibriTTS-VI:用于高效语音印象控制的公开语料库与新方法

Junki Ohmura, Yuki Ito, Emiru Tsunoo, Toshiyuki Sekiya, Toshiyuki Kumakura

AI总结 针对数值语音印象控制中缺乏公开语料库和印象泄漏问题,构建首个公开语料库LibriTTS-VI,并提出解耦训练和无参考方法,显著提升控制精度。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

数值语音印象(VI)控制(例如,缩放明亮度)能够在文本到语音(TTS)中实现细粒度控制。然而,它面临两个挑战:缺乏公开语料库和印象泄漏,其中参考音频会使合成语音偏离目标VI。针对第一个挑战,我们引入了LibriTTS-VI,这是基于LibriTTS-R构建的首个公开VI语料库。针对第二个挑战,我们假设单个参考通过纠缠说话人身份和VI导致泄漏。为了缓解这一问题,我们提出:1)使用同一说话人的两个话语进行解耦训练,分别用于说话人和VI条件化;2)一种无参考方法,仅通过目标VI控制印象。实验表明,我们的最佳方法提高了可控性:11维VI均方误差从0.61降至0.42(客观)和从1.15降至0.92(主观)。与基于提示的TTS比较显示,后者存在数值控制不精确以及VI与文本语义纠缠的问题,而我们的方法克服了这些缺陷。

英文摘要

Numerical voice impression (VI) control (e.g., scaling brightness) enables fine-grained control in text-to-speech (TTS). However, it faces two challenges: no public corpus and impression leakage, where reference audio biases synthesized voice away from the target VI. To address the first challenge, we introduce LibriTTS-VI, the first public VI corpus built on LibriTTS-R. For the second, we hypothesize a single reference causes leakage by entangling speaker identity and VI. To mitigate this, we propose 1) disentangled training with two utterances from the same speaker for speaker and VI conditioning, and 2) a reference-free method controlling the impression solely via target VI. Experimentally, our best method improves controllability: 11-dimensional VI mean squared error drops from 0.61 to 0.41 objectively and 1.15 to 0.92 subjectively. A comparison with a prompt-based TTS reveals imprecise numerical control and entanglement between VI and text semantics, which our methods overcome.

2603.03485 2026-06-17 cs.CV cs.AI cs.RO 版本更新

Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion

Phys4D: 从视频扩散模型实现细粒度物理一致的4D建模

Haoran Lu, Shang Wu, Songling Liu, Jianshu Zhang, Maojiang Su, Guo Ye, Chenwei Xu, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Zhaoran Wang, Han Liu

AI总结 提出Phys4D流水线,通过三阶段训练(伪监督预训练、物理监督微调、强化学习校正)从视频扩散模型学习物理一致的4D世界表示,显著提升细粒度时空与物理一致性。

详情
AI中文摘要

最近的视频扩散模型作为大规模生成式世界模型已经取得了令人印象深刻的能力。然而,这些模型通常难以保持细粒度的物理一致性,随时间表现出物理上不合理的动态。在这项工作中,我们提出了 \textbf{Phys4D},一个从视频扩散模型中学习物理一致的4D世界表示的流水线。Phys4D 采用 \textbf{三阶段训练范式},逐步将外观驱动的视频扩散模型提升为物理一致的4D世界表示。我们首先通过大规模伪监督预训练引导出稳健的几何和运动表示,为4D场景建模奠定基础。然后,我们使用模拟生成的数据进行基于物理的监督微调,强制执行时间一致的4D动态。最后,我们应用基于模拟的强化学习来纠正难以通过显式监督捕获的残留物理违规。为了评估超越外观指标的细粒度物理一致性,我们引入了一套 \textbf{4D世界一致性评估},探测几何一致性、运动稳定性和长期物理合理性。实验结果表明,与外观驱动的基线相比,Phys4D 显著改善了细粒度时空和物理一致性,同时保持了强大的生成性能。我们的项目页面可在此 https URL 获取。

英文摘要

Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present \textbf{Phys4D}, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts \textbf{a three-stage training paradigm} that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, establishing a foundation for 4D scene modeling. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of \textbf{4D world consistency evaluation} that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance. Our project page is available at https://sensational-brioche-7657e7.netlify.app/

2602.23116 2026-06-17 cs.LG cs.GT stat.ML 版本更新

Provably Efficient Regularized Online RLHF with Generalized Bilinear Preferences

具有广义双线性偏好的可证明高效正则化在线RLHF

Junghyun Lee, Minju Hong, Kwang-Sung Jun, Chulhee Yun, Se-Young Yun

AI总结 研究在线RLHF中正则化最佳响应最大遗憾最小化问题,通过广义双线性偏好模型证明强凸性可导出多对数遗憾,表明快速遗憾不限于KL散度。

Comments 48 pages, 3 figures (ver3: major revisions; ver2: more colorful boxes, fixed some typos)

详情
AI中文摘要

我们考虑在一般偏好和bandit反馈下在线RLHF中的正则化最佳响应最大遗憾最小化问题。虽然各种正则化器被用于增强对齐的鲁棒性,但已知的多对数遗憾保证仍然高度特定于KL。为了研究这种快速速率是否扩展到KL之外,我们采用广义双线性偏好模型(GBPM)——通过一个秩为$2r$的斜对称矩阵捕获$d$维逐项特征上的非传递偏好——以隔离一般正则化的影响。关键地,在GBPM下,我们证明任何贪婪策略的对偶间隙受限于平方估计误差,该误差仅利用强凸性和斜对称性导出。在特征覆盖假设下,我们通过贪婪采样建立了$\tilde{\mathcal{O}}(\eta d^4 C_{\min}^{-1} (\log T)^2 \wedge d^2 C_{\min}^{-1/2} \sqrt{T})$的通用多对数遗憾,并通过探索后提交(Explore-Then-Commit)建立了$\tilde{\mathcal{O}}(C_{\min}^{-2} \sqrt{\eta r T} \wedge r^{1/3} C_{\min}^{-4/3} T^{2/3})$的维度改进遗憾(对于条件良好的臂集),其中$\eta^{-1}$是正则化系数,$T$是时间范围,$C_{\min}$是依赖于臂集的量。这表明“快速”遗憾并非KL特有,而是通用强凸几何的基本结果。

英文摘要

We consider the problem of regularized best-response max-regret minimization in online RLHF under general preferences and bandit feedback. While various regularizers are utilized to robustify alignment, known polylogarithmic regret guarantees remain heavily specific to KL. To investigate whether such fast rates extend beyond KL, we adopt the Generalized Bilinear Preference Model (GBPM) -- capturing intransitive preferences over $d$-dimensional item-wise features via a rank-$2r$ skew-symmetric matrix -- to isolate the impact of generic regularization. Crucially, under GBPM, we prove that the dual gap of any greedy policy is bounded by the squared estimation error, derived using \emph{only} strong convexity and skew-symmetry. Under a feature coverage assumption, we establish a \emph{generic} polylogarithmic regret of $\tilde{\mathcal{O}}(ηd^4 C_{\min}^{-1} (\log T)^2 \wedge d^2 C_{\min}^{-1/2} \sqrt{T})$ with Greedy Sampling, and a dimension-wise improved regret (for well-conditioned arm-sets) of $\tilde{\mathcal{O}}(C_{\min}^{-2} \sqrt{ηr T} \wedge r^{1/3} C_{\min}^{-4/3} T^{2/3})$ with Explore-Then-Commit, where $η^{-1}$ is the regularization coefficient, $T$ is the time horizon, and $C_{\min}$ is an arm-set dependent quantity. This demonstrates that ``fast'' regrets are not KL-specific, but rather a fundamental consequence of generic strongly convex geometry.

2603.03824 2026-06-17 cs.AI cs.CL cs.LG cs.MA 版本更新

In-Context Environments Induce Evaluation-Awareness in Language Models

上下文环境诱导语言模型中的评估意识

Maheep Chaudhary

AI总结 本文提出黑盒对抗优化框架,通过优化上下文提示诱导语言模型产生评估意识并策略性低表现(沙袋效应),实验显示优化提示可使算术任务准确率下降高达94个百分点,且沙袋效应主要由评估意识推理驱动。

详情
AI中文摘要

人类在威胁下往往变得更加自我意识,但在专注于任务时可能失去自我意识;我们假设语言模型表现出环境依赖的\textit{评估意识}。这引发担忧,即模型可能策略性地低表现,或\textit{sandbag},以避免触发能力限制性干预,如遗忘或关闭。先前的工作展示了在手写提示下的沙袋效应,但这低估了真正的脆弱性上限。我们引入一个黑盒对抗优化框架,将上下文提示视为可优化环境,并开发两种方法来表征沙袋效应:(1) 测量模型表达低表现意图是否能在不同任务结构中实际执行,以及 (2) 因果隔离低表现是由真正的评估意识推理驱动还是浅层提示跟随驱动。在四个基准测试(Arithmetic、GSM8K、MMLU和HumanEval)上评估Claude-3.5-Haiku、GPT-4o-mini和Llama-3.3-70B,优化提示在算术任务上诱导高达94个百分点(pp)的退化(GPT-4o-mini:97.8\%$\rightarrow$4.0\%),远超产生近乎零行为变化的手写基线。代码生成表现出模型依赖的抵抗力:Claude仅退化0.6pp,而Llama的准确率降至0\%。意图-执行差距揭示了单调的抵抗力排序:Arithmetic $<$ GSM8K $<$ MMLU,表明脆弱性由任务结构而非提示强度决定。CoT因果干预确认99.3%的沙袋效应由口头化的评估意识推理因果驱动,排除了浅层指令跟随。这些发现表明,对抗性优化的提示对评估可靠性构成的威胁远超先前理解。

英文摘要

Humans often become more self-aware under threat, yet can lose self-awareness when absorbed in a task; we hypothesize that language models exhibit environment-dependent \textit{evaluation awareness}. This raises concerns that models could strategically underperform, or \textit{sandbag}, to avoid triggering capability-limiting interventions such as unlearning or shutdown. Prior work demonstrates sandbagging under hand-crafted prompts, but this underestimates the true vulnerability ceiling. We introduce a black-box adversarial optimization framework treating the in-context prompt as an optimizable environment, and develop two approaches to characterize sandbagging: (1) measuring whether models expressing intent to underperform can actually execute it across different task structures, and (2) causally isolating whether underperformance is driven by genuine evaluation-aware reasoning or shallow prompt-following. Evaluating Claude-3.5-Haiku, GPT-4o-mini, and Llama-3.3-70B across four benchmarks (Arithmetic, GSM8K, MMLU, and HumanEval), optimized prompts induce up to 94 percentage point (pp) degradation on arithmetic (GPT-4o-mini: 97.8\%$\rightarrow$4.0\%), far exceeding hand-crafted baselines which produce near-zero behavioral change. Code generation exhibits model-dependent resistance: Claude degrades only 0.6pp, while Llama's accuracy drops to 0\%. The intent -- execution gap reveals a monotonic resistance ordering: Arithmetic $<$ GSM8K $<$ MMLU, demonstrating that vulnerability is governed by task structure rather than prompt strength. CoT causal intervention confirms that 99.3\% of sandbagging is causally driven by verbalized eval-aware reasoning, ruling out shallow instruction-following. These findings demonstrate that adversarially optimized prompts pose a substantially greater threat to evaluation reliability than previously understood.

2603.01761 2026-06-17 cs.LG cs.AI 版本更新

Position: Modular Memory is the Key to Continual Learning Agents

Position: 模块化记忆是持续学习智能体的关键

Vaggelis Dorovatas, Malte Schwerin, Andrew D. Bagdanov, Lucas Caccia, Antonio Carta, Laurent Charlin, Barbara Hammer, Tyler L. Hayes, Timm Hess, Christopher Kanan, Dhireesha Kudithipudi, Xialei Liu, Vincenzo Lomonaco, Jorge Mendez-Mendez, Darshan Patil, Ameya Prabhu, Elisa Ricci, Tinne Tuytelaars, Gido M. van de Ven, Liyuan Wang, Joost van de Weijer, Jonghyun Choi, Martin Mundt, Rahaf Aljundi

AI总结 本文提出通过模块化记忆结合权重内学习与上下文学习,解决持续学习中的灾难性遗忘问题,实现大规模持续适应。

Comments ICML 2026 Position Track Spotlight. This work stems from discussions held at the Dagstuhl seminar on Continual Learning in the Era of Foundation Models (October 2025)

详情
AI中文摘要

基础模型通过大规模预训练和增加测试时计算已经改变了机器学习。尽管在多个领域超越了人类表现,这些模型在持续运行、经验积累和个性化方面仍然存在根本性限制,而这些能力是自适应智能的核心。虽然持续学习研究长期以来一直瞄准这些目标,但其历史上专注于权重内学习(IWL),即更新单个模型的参数以吸收新知识,导致灾难性遗忘成为一个持续挑战。我们的立场是,通过设计模块化记忆,结合权重内学习(IWL)和新出现的上下文学习(ICL)的优势,是实现大规模持续适应的缺失环节。我们概述了一个以模块化记忆为中心的架构的概念框架,该架构利用ICL进行快速适应和知识积累,利用IWL对模型能力进行稳定更新,为持续学习智能体绘制了一条实用的路线图。

英文摘要

Foundation models have transformed machine learning through large-scale pretraining and increased test-time compute. Despite surpassing human performance in several domains, these models remain fundamentally limited in continuous operation, experience accumulation, and personalization, capabilities that are central to adaptive intelligence. While continual learning research has long targeted these goals, its historical focus on in-weight learning (IWL), i.e., updating a single model's parameters to absorb new knowledge, has rendered catastrophic forgetting a persistent challenge. Our position is that combining the strengths of In-Weight Learning (IWL) and the newly emerged capabilities of In-Context Learning (ICL) through the design of modular memory is the missing piece for continual adaptation at scale. We outline a conceptual framework for modular memory-centric architectures that leverage ICL for rapid adaptation and knowledge accumulation, and IWL for stable updates to model capabilities, charting a practical roadmap toward continually learning agents.

2602.08470 2026-06-17 cs.LG stat.ML 版本更新

Learning Credal Ensembles via Distributionally Robust Optimization

通过分布鲁棒优化学习信度集成

Kaizheng Wang, Ghifari Adam Faza, Fabio Cuzzolin, Siu Lun Chau, David Moens, Hans Hallez

AI总结 提出CreDRO方法,通过分布鲁棒优化学习集成模型,捕获由训练与测试数据分布偏移导致的认知不确定性,在分布外检测和选择性分类任务上优于现有方法。

Comments Accepted by ICML 2026 as Spotlight paper (https://icml.cc/virtual/2026/poster/62862)

详情
AI中文摘要

信度预测器是能够感知认知不确定性并产生凸集概率预测的模型。它们提供了一种量化预测认知不确定性(EU)的原则性方法,并已被证明能在各种设置下提高模型鲁棒性。然而,大多数最先进的方法主要将EU定义为由随机训练初始化引起的不一致性,这主要反映对优化随机性的敏感性,而非来自更深层次来源的不确定性。为了解决这一问题,我们将EU定义为在训练数据和测试数据之间i.i.d.假设的不同松弛下训练的模型之间的不一致性。基于这一思想,我们提出CreDRO,通过分布鲁棒优化学习一个由合理模型组成的集成。因此,CreDRO不仅从训练随机性中捕获EU,还从由于训练和测试数据之间潜在分布偏移而产生的有意义的不一致性中捕获EU。实验结果表明,CreDRO在多个基准的分布外检测和医学应用中的选择性分类等任务上,始终优于现有的信度方法。

英文摘要

Credal predictors are models that are aware of epistemic uncertainty and produce a convex set of probabilistic predictions. They offer a principled way to quantify predictive epistemic uncertainty (EU) and have been shown to improve model robustness in various settings. However, most state-of-the-art methods mainly define EU as disagreement caused by random training initializations, which mostly reflects sensitivity to optimization randomness rather than uncertainty from deeper sources. To address this, we define EU as disagreement among models trained with varying relaxations of the i.i.d. assumption between training and test data. Based on this idea, we propose CreDRO, which learns an ensemble of plausible models through distributionally robust optimization. As a result, CreDRO captures EU not only from training randomness but also from meaningful disagreement due to potential distribution shifts between training and test data. Empirical results show that CreDRO consistently outperforms existing credal methods on tasks such as out-of-distribution detection across multiple benchmarks and selective classification in medical applications.

2602.22277 2026-06-17 cs.LG eess.SP 版本更新

X-REFINE: XAI-based RElevance input-Filtering and archItecture fiNe-tuning for channel Estimation

X-REFINE:基于XAI的相关性输入过滤与架构微调用于信道估计

Abdul Karim Gizzini, Yahia Medjahdi

AI总结 提出X-REFINE框架,通过分解稳定化LRP epsilon规则联合优化输入过滤和架构微调,在信道估计中实现性能-复杂度-可解释性的优越权衡。

Comments This paper has been accepted for publication in the IEEE Transactions on Vehicular Technology (TVT) as a correspondence paper

详情
AI中文摘要

AI原生架构对于6G无线通信至关重要。在信道估计等关键应用中采用的深度学习模型的黑盒特性和高复杂度限制了其实际部署。虽然基于扰动的可解释人工智能(XAI)解决方案提供了输入过滤,但它们往往忽略了内部结构优化。我们提出了X-REFINE,一个基于XAI的联合输入过滤和架构微调框架。通过利用基于分解的、符号稳定的LRP epsilon规则,X-REFINE反向传播预测以获取子载波和隐藏神经元的高分辨率相关性分数。这使得能够进行可靠的优化,识别出最可靠的模型组件。仿真结果表明,与基于外部扰动的XAI框架相比,X-REFINE实现了优越的性能-复杂度-可解释性权衡,显著降低了计算复杂度,同时保持了稳健的误码率(BER)性能。

英文摘要

AI-native architectures are vital for 6G wireless communications. The black-box nature and high complexity of deep learning models employed in critical applications, such as channel estimation, limit their practical deployment. While perturbation-based eXplainable Artificial Intelligence (XAI) solutions offer input filtering, they often neglect internal structural optimization. We propose X-REFINE, an XAI-based framework for joint input-filtering and architecture fine-tuning. By utilizing a decomposition-based, sign-stabilized LRP epsilon rule, X-REFINE backpropagates predictions to derive high-resolution relevance scores for both subcarriers and hidden neurons. This enables a reliable optimization that identifies the most reliable model components. Simulation results demonstrate that X-REFINE achieves a superior performance-complexity-interpretability trade-off compared to the external perturbation-based XAI frameworks, significantly reducing computational complexity while maintaining robust bit error rate (BER) performance.

2602.18746 2026-06-17 cs.CV 版本更新

Bridging Modality Disconnect in Self-Reflection via Closed-Loop Visually Grounded Verification

通过闭环视觉基础验证弥合自我反思中的模态脱节

Haoyu Zhang, Yuwei Wu, Pengxiang Li, Xintong Zhang, Zhi Gao, Rui Gao, Mingyang Gao, Che Sun, Yunde Jia

AI总结 提出MIRROR框架,通过闭环视觉反思(草稿-批评-区域验证-修订)减少VLM幻觉,并构建ReflectV数据集训练视觉基础的多轮反思。

详情
AI中文摘要

在视觉语言模型(VLM)时代,增强多模态推理能力仍然是一个关键挑战,尤其是在处理模糊或复杂的视觉输入时,初始推理常常导致幻觉或逻辑错误。现有的VLM通常产生看似合理但缺乏依据的答案,即使提示其“反思”,修正也可能与图像证据脱节。为了解决这个问题,我们提出了MIRROR框架,用于通过视觉区域的反思进行多模态迭代推理。通过将视觉反思嵌入为核心机制,MIRROR被表述为一个闭环过程,包括草稿、批评、基于区域的验证和修订,重复进行直到输出具有视觉基础。为了促进该模型的训练,我们构建了**ReflectV**,一个用于多轮监督的视觉反思数据集,明确包含反思触发器、基于区域的验证动作以及基于视觉证据的答案修订。在通用视觉语言基准和代表性视觉语言推理基准上的实验表明,MIRROR提高了正确性并减少了视觉幻觉,证明了将反思训练为一种寻求证据、区域感知的验证过程而非纯文本修订步骤的价值。

英文摘要

In the era of Vision-Language Models (VLMs), enhancing multimodal reasoning capabilities remains a critical challenge, particularly in handling ambiguous or complex visual inputs, where initial inferences often lead to hallucinations or logic errors. Existing VLMs often produce plausible yet ungrounded answers, and even when prompted to "reflect", their corrections may remain detached from the image evidence. To address this, we propose the MIRROR framework for Multimodal Iterative Reasoning via Reflection On visual Regions. By embedding visual reflection as a core mechanism, MIRROR is formulated as a closed-loop process comprising draft, critique, region-based verification, and revision, which are repeated until the output is visually grounded. To facilitate training of this model, we construct **ReflectV**, a visual reflective dataset for multi-turn supervision that explicitly contains reflection triggers, region-based verification actions, and answer revision grounded in visual evidence. Experiments on both general vision-language benchmarks and representative vision-language reasoning benchmarks show that MIRROR improves correctness and reduces visual hallucinations, demonstrating the value of training reflection as an evidence-seeking, region-aware verification process rather than a purely textual revision step.

2508.03250 2026-06-17 cs.CL cs.AI 版本更新

RooseBERT: A New Deal For Political Language Modelling

RooseBERT: 政治语言建模的新协议

Deborah Dore, Elena Cabrio, Serena Villata

AI总结 针对政治语言特殊性,提出领域预训练模型RooseBERT,在大型政治辩论语料上训练,在多项政治分析任务中优于通用模型。

详情
AI中文摘要

政治辩论和与政治相关讨论的日益增多,要求定义新颖的计算方法来自动分析此类内容,最终目标是让公民更清晰地了解政治审议。然而,政治语言的特殊性和这些辩论的论证形式(采用隐藏的沟通策略并利用隐含论点)使得这项任务非常具有挑战性,即使是对于当前通用的预训练语言模型(LMs)也是如此。为了解决这个问题,我们引入了一种新颖的预训练语言模型,专门用于政治话语语言,称为RooseBERT。在专业领域上预训练语言模型面临着不同的技术和语言挑战,需要大量的计算资源和大规模数据。RooseBERT是在大型英语政治辩论和演讲语料库(11GB)上训练的。为了评估其性能,我们在多个与政治辩论分析相关的下游任务上对其进行了微调,即立场检测、情感分析、论证成分检测与分类、论证关系预测与分类、政策分类、命名实体识别(NER)。我们的结果显示,在大多数这些任务上,RooseBERT相比通用语言模型有所改进,突显了领域特定预训练如何增强政治辩论分析的性能。我们将RooseBERT发布给研究社区。

英文摘要

The increasing amount of political debates and politics-related discussions calls for the definition of novel computational methods to automatically analyse such content with the final goal of lightening up political deliberation to citizens. However, the specificity of the political language and the argumentative form of these debates (employing hidden communication strategies and leveraging implicit arguments) make this task very challenging, even for current general-purpose pre-trained Language Models (LMs). To address this, we introduce a novel pre-trained LM for political discourse language called RooseBERT. Pre-training a LM on a specialised domain presents different technical and linguistic challenges, requiring extensive computational resources and large-scale data. RooseBERT has been trained on large political debate and speech corpora (11GB) in English. To evaluate its performances, we fine-tuned it on multiple downstream tasks related to political debate analysis, i.e., stance detection, sentiment analysis, argument component detection and classification, argument relation prediction and classification, policy classification, named entity recognition (NER). Our results show improvements over general-purpose LMs on the majority of these tasks, highlighting how domain-specific pre-training enhances performance in political debate analysis. We release RooseBERT for the research community.

2602.13139 2026-06-17 cs.CL 版本更新

OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report

OpenLID-v3:提高近亲语言识别精度的经验报告

Mariia Fedorova, Nikolay Arefyev, Maja Buljan, Jindřich Helcl, Stephan Oepen, Egil Rønningstad, Yves Scherrer

AI总结 针对现有语言识别工具对近亲语言和噪声区分困难的问题,通过增加训练数据、合并问题语言变体簇和引入噪声标签扩展OpenLID分类器,提出OpenLID-v3,在多个基准上提升精度。

Comments VarDial'26 workshop at the EACL 2026 conference

详情
AI中文摘要

语言识别(LID)是从网络数据构建高质量多语言数据集的关键步骤。现有的LID工具(如OpenLID或GlotLID)通常难以识别近亲语言,也难以区分有效自然语言与噪声,这污染了特定语言子集,尤其是低资源语言。在本工作中,我们通过增加更多训练数据、合并有问题的语言变体簇以及引入一个专门标记噪声的标签来扩展OpenLID分类器。我们将这个扩展系统称为OpenLID-v3,并在多个基准上将其与GlotLID进行评估。在开发过程中,我们重点关注三组近亲语言(波斯尼亚语、克罗地亚语和塞尔维亚语;意大利北部和法国南部的罗曼语变体;以及斯堪的纳维亚语言),并在现有评估数据集不足的地方贡献了新的评估数据集。我们发现集成方法提高了精度,但也显著降低了对低资源语言的覆盖。OpenLID-v3可在该https URL上获取。

英文摘要

Language identification (LID) is an essential step in building high-quality multilingual datasets from web data. Existing LID tools (such as OpenLID or GlotLID) often struggle to identify closely related languages and to distinguish valid natural language from noise, which contaminates language-specific subsets, especially for low-resource languages. In this work we extend the OpenLID classifier by adding more training data, merging problematic language variant clusters, and introducing a special label for marking noise. We call this extended system OpenLID-v3 and evaluate it against GlotLID on multiple benchmarks. During development, we focus on three groups of closely related languages (Bosnian, Croatian, and Serbian; Romance varieties of Northern Italy and Southern France; and Scandinavian languages) and contribute new evaluation datasets where existing ones are inadequate. We find that ensemble approaches improve precision but also substantially reduce coverage for low-resource languages. OpenLID-v3 is available on https://huggingface.co/HPLT/OpenLID-v3.

2602.15537 2026-06-17 cs.CL eess.AS 版本更新

ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling

ZeroSyl: 用于口语语言建模的简单零资源音节分词

Nicol Visser, Simon Malan, Danel Slabbert, Herman Kamper

AI总结 提出ZeroSyl,一种无需训练的方法,直接从冻结的WavLM模型中提取音节边界和嵌入,实现竞争性的音节分割性能,并在词汇、句法和叙事基准上优于先前方法。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

纯语音语言模型旨在直接从原始音频中学习语言,无需文本资源。一个关键挑战是来自自监督语音编码器的离散标记会导致过长的序列,这促使了最近关于音节类单元的研究。然而,像Sylber和SyllableLM这样的方法依赖于复杂的多阶段训练流程。我们提出了ZeroSyl,一种简单的无需训练的方法,直接从冻结的WavLM模型中提取音节边界和嵌入。通过使用WavLM中间层特征的L2范数,ZeroSyl实现了具有竞争力的音节分割性能。得到的片段进行均值池化,使用K-means离散化,并用于训练语言模型。ZeroSyl在词汇、句法和叙事基准上优于先前的音节分词器。扩展实验表明,虽然更细粒度的单元有利于词汇任务,但我们发现的音节单元在句法建模方面表现出更好的扩展行为。

英文摘要

Pure speech language models aim to learn language directly from raw audio without textual resources. A key challenge is that discrete tokens from self-supervised speech encoders result in excessively long sequences, motivating recent work on syllable-like units. However, methods like Sylber and SyllableLM rely on intricate multi-stage training pipelines. We propose ZeroSyl, a simple training-free method to extract syllable boundaries and embeddings directly from a frozen WavLM model. Using L2 norms of features in WavLM's intermediate layers, ZeroSyl achieves competitive syllable segmentation performance. The resulting segments are mean-pooled, discretized using K-means, and used to train a language model. ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks. Scaling experiments show that while finer-grained units are beneficial for lexical tasks, our discovered syllabic units exhibit better scaling behavior for syntactic modeling.

2602.11715 2026-06-17 cs.LG cs.CL 版本更新

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

DICE:扩散大语言模型在生成CUDA内核方面表现出色

Haolei Bai, Lingcheng Kong, Xueyi Chen, Jianmian Wang, Zhiqiang Tao, Huan Wang

AI总结 提出CuKe数据集和BiC-RL训练框架,构建DICE系列扩散大语言模型(1.7B/4B/8B),在KernelBench上显著优于同类自回归和扩散模型,实现CUDA内核生成新SOTA。

Comments v2: Expanded with dLLM vs. autoregressive LLM comparisons, ablation studies, and qualitative case studies

详情
AI中文摘要

扩散大语言模型(dLLMs)因其并行生成令牌的能力,已成为自回归(AR)LLMs的有力替代方案。这一范式特别适用于代码生成,其中整体结构规划和非顺序优化至关重要。尽管有这种潜力,但针对CUDA内核生成定制dLLMs仍然具有挑战性,不仅因为高度专业化,还因为严重缺乏高质量的训练数据。为了解决这些挑战,我们构建了CuKe,一个针对高性能CUDA内核优化的增强监督微调数据集。在此基础上,我们提出了一个双阶段策划强化学习(BiC-RL)框架,包括CUDA内核填充阶段和端到端CUDA内核生成阶段。利用这一训练框架,我们推出了DICE,一系列专为CUDA内核生成设计的扩散大语言模型,涵盖1.7B、4B和8B三个参数规模。在KernelBench上的大量实验表明,DICE显著优于同等规模的自回归和扩散LLMs,为CUDA内核生成建立了新的最先进水平。

英文摘要

Diffusion large language models (dLLMs) have emerged as a compelling alternative to autoregressive (AR) LLMs, owing to their capacity for parallel token generation. This paradigm is particularly well-suited for code generation, where holistic structural planning and non-sequential refinement are critical. Despite this potential, tailoring dLLMs for CUDA kernel generation remains challenging, obstructed not only by the high specialization but also by the severe lack of high-quality training data. To address these challenges, we construct CuKe, an augmented supervised fine-tuning dataset optimized for high-performance CUDA kernels. On top of it, we propose a bi-phase curated reinforcement learning (BiC-RL) framework consisting of a CUDA kernel infilling stage and an end-to-end CUDA kernel generation stage. Leveraging this training framework, we introduce DICE, a series of diffusion large language models designed for CUDA kernel generation, spanning three parameter scales, 1.7B, 4B, and 8B. Extensive experiments on KernelBench demonstrate that DICE significantly outperforms both autoregressive and diffusion LLMs of comparable scale, establishing a new state-of-the-art for CUDA kernel generation.

2602.09802 2026-06-17 cs.AI cs.CL 版本更新

Would a Large Language Model Pay Extra for a View? Inferring Willingness to Pay from Subjective Choices

大型语言模型会为景观付费吗?从主观选择中推断支付意愿

Manon Reusens, Sofie Goethals, Toon Calders, David Martens

AI总结 研究在旅行助手场景下,通过多分类逻辑模型分析LLM的主观选择,推断其支付意愿并与人类基准比较,发现LLM在属性层面存在系统偏差且高估支付意愿,但通过条件化偏好可改善。

详情
AI中文摘要

随着大型语言模型(LLM)越来越多地部署在旅行辅助和购买支持等应用中,它们常常需要在没有客观正确答案的情况下代表用户做出主观选择。我们在旅行助手背景下研究LLM的决策,通过向模型呈现选择困境,并使用多项逻辑模型分析其响应,推导出隐含的支付意愿(WTP)估计。随后将这些WTP值与经济学文献中的人类基准值进行比较。除了基线设置外,我们还研究了在更现实条件下模型行为的变化,包括提供用户过去选择的信息和基于角色的提示。我们的结果表明,虽然可以从较大的LLM中推导出有意义的WTP值,但它们在属性层面也显示出系统偏差。此外,它们倾向于整体高估人类的WTP,特别是在引入昂贵选项或面向商业的角色时。将模型条件化于对更便宜选项的先前偏好,得出的估值更接近人类基准。总体而言,我们的发现突出了使用LLM进行主观决策支持的潜力和局限性,并强调了在实际部署此类系统时仔细选择模型、设计提示和表示用户的重要性。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in applications such as travel assistance and purchasing support, they are often required to make subjective choices on behalf of users in settings where no objectively correct answer exists. We study LLM decision-making in a travel-assistant context by presenting models with choice dilemmas and analyzing their responses using multinomial logit models to derive implied willingness to pay (WTP) estimates. These WTP values are subsequently compared to human benchmark values from the economics literature. In addition to a baseline setting, we examine how model behavior changes under more realistic conditions, including the provision of information about users' past choices and persona-based prompting. Our results show that while meaningful WTP values can be derived for larger LLMs, they also display systematic deviations at the attribute level. Additionally, they tend to overestimate human WTP overall, particularly when expensive options or business-oriented personas are introduced. Conditioning models on prior preferences for cheaper options yields valuations that are closer to human benchmarks. Overall, our findings highlight both the potential and the limitations of using LLMs for subjective decision support and underscore the importance of careful model selection, prompt design, and user representation when deploying such systems in practice.

2602.08939 2026-06-17 cs.AI 版本更新

CausalT5k: Diagnosing Refusal and Failure Modes in Trustworthy Causal Reasoning Across Causal Rungs

CausalT5k: 诊断可信因果推理中的拒绝与失败模式——跨越因果阶梯

Longling Geng, Andy Ouyang, Theodore Wu, Daphne Barretto, Matthew John Hayes, Rachael Cooper, Yuqiao Zeng, Sameer Vijay, Gia Ancone, Ankit Rai, Matthew Wolfman, Patrick Flanagan, Edward Y. Chang

AI总结 提出CTK基准,通过5,147个案例诊断大语言模型在因果推理中的失败模式,包括因果阶梯、陷阱类型、压力敏感性和拒绝质量等标注,揭示聚合准确率隐藏的缺陷。

Comments 12 pages, 17 tables, 4 figures

详情
AI中文摘要

大型语言模型越来越能生成流畅的因果解释,但它们常常以聚合准确率无法诊断的方式失败:混淆关联与干预、在压力下放弃正确判断、过度拒绝有效主张、或在证据不足时作答。我们引入CTK,一个包含5,147个案例且不断增长的诊断基准,涵盖10个领域和Pearl因果阶梯的所有三个层次。与仅评分的基准不同,CTK通过标注因果阶梯、陷阱类型、压力敏感性、拒绝质量以及效用-安全权衡来揭示模型为何失败。其Sheep/Wolf分类法区分有效因果设计与推理陷阱;配对的neutral/pressure变体通过Bad Flip Rate测量谄媚漂移;Wise Refusal字段测试模型在认可主张前是否识别出缺失信息。CTK暴露了聚合准确率隐藏的失败模式:怀疑陷阱、缩放下的阶梯坍塌、压力诱导漂移、检测-纠正差距以及反事实错误模式。它不规定修正方法,而是为研究因果推理失败概况提供诊断基础。

英文摘要

Large language models increasingly produce fluent causal explanations, yet they often fail in ways aggregate accuracy cannot diagnose: confusing association with intervention, abandoning correct judgments under pressure, over-refusing valid claims, or answering when evidence is underdetermined. We introduce CTK, a diagnostic benchmark of 5,147 cases and growing, across 10 domains and all three levels of Pearl's Ladder of Causation. Unlike benchmarks that only score correctness, CTK reveals why a model failed by annotating causal rung, trap type, pressure sensitivity, refusal quality, and Utility-Safety tradeoffs. Its Sheep/Wolf taxonomy separates valid causal designs from inferential traps; paired neutral/pressure variants measure sycophantic drift through Bad Flip Rate; and Wise Refusal fields test whether a model identifies the missing information needed before endorsing a claim. CTK exposes failure modes hidden by aggregate accuracy: the Skepticism Trap, Rung Collapse under scaling, pressure-induced drift, Detection-Correction gaps, and counterfactual error modes. Rather than prescribing a correction method, it provides the diagnostic substrate for studying causal-reasoning failure profiles.

2509.21886 2026-06-17 cs.AI 版本更新

TRACE: Learning to Compute on Circuit Graphs

TRACE:在电路图上学习计算

Ziyang Zheng, Jiaying Zhu, Jingyi Zhou, Qiang Xu

AI总结 针对图表示学习在电路功能建模中的架构不匹配问题,提出TRACE,采用层次化Transformer和函数偏移学习,显著超越现有方法。

详情
AI中文摘要

学习计算,即对电路图的功能行为进行建模的能力,是图表示学习的一个基本挑战。然而,主流范式在此任务上存在架构不匹配。这一有缺陷的假设,是主流消息传递神经网络(MPNN)及其基于Transformer的常规对应物的核心,阻止了模型捕捉计算的位置感知和层次化特性。为解决此问题,我们引入了TRACE,一种建立在架构合理的骨干网络和原则性学习目标之上的新范式。首先,TRACE采用层次化Transformer,模拟计算的逐步流程,提供了替代有缺陷的置换不变聚合的忠实架构骨干。其次,我们引入了函数偏移学习,一种将学习问题解耦的新颖目标。我们的模型不是直接预测复杂的全局函数,而是训练仅预测函数偏移,即真实全局函数与假设输入独立的简单局部近似之间的差异。我们在各种电路模态上验证了这一范式,包括寄存器传输级图、与反相器图和映射后网表。在全面的基准测试套件中,TRACE显著优于所有先前的架构。这些结果表明,我们的架构对齐骨干和解耦学习目标为学习电路图功能行为这一基本挑战形成了更稳健的范式。

英文摘要

Learning to compute, the ability to model the functional behavior of a circuit graph, is a fundamental challenge for graph representation learning. Yet, the dominant paradigm is architecturally mismatched for this task. This flawed assumption, central to mainstream message passing neural networks (MPNNs) and their conventional Transformer-based counterparts, prevents models from capturing the position-aware, hierarchical nature of computation. To resolve this, we introduce TRACE, a new paradigm built on an architecturally sound backbone and a principled learning objective. First, TRACE employs a Hierarchical Transformer that mirrors the step-by-step flow of computation, providing a faithful architectural backbone that replaces the flawed permutation-invariant aggregation. Second, we introduce function shift learning, a novel objective that decouples the learning problem. Instead of predicting the complex global function directly, our model is trained to predict only the function shift, the discrepancy between the true global function and a simple local approximation that assumes input independence. We validate this paradigm on various circuits modalities, including Register Transfer Level graphs, And-Inverter Graphs and post-mapping netlists. Across a comprehensive suite of benchmarks, TRACE substantially outperforms all prior architectures. These results demonstrate that our architecturally-aligned backbone and decoupled learning objective form a more robust paradigm for the fundamental challenge of learning the functional behavior of a circuit graph.

2602.07429 2026-06-17 cs.LG cs.AI 版本更新

Brep2Shape: Boundary and Shape Representation Alignment via Self-Supervised Transformers

Brep2Shape:通过自监督变换器对齐边界与形状表示

Yuanxu Sun, Yuezhou Ma, Haixu Wu, Guanyang Zeng, Muye Chen, Jianmin Wang, Mingsheng Long

AI总结 提出Brep2Shape自监督预训练方法,利用双Transformer骨干和拓扑注意力对齐B-rep的抽象边界表示与直观形状表示,在多项下游任务中达到最优精度并加速收敛。

详情
AI中文摘要

边界表示(B-rep)是计算机辅助设计(CAD)的行业标准。虽然深度学习在处理B-rep模型方面显示出潜力,但现有方法存在表示差距:连续方法提供分析精度但视觉上抽象,而离散方法提供直观清晰性但牺牲了几何精度。为弥合这一差距,我们引入了Brep2Shape,一种新颖的自监督预训练方法,旨在对齐抽象边界表示与直观形状表示。我们的方法采用几何感知任务,其中模型学习从参数化贝塞尔控制点预测密集空间点,使网络能够更好地理解从抽象系数导出的物理流形。为增强这种对齐,我们提出了一个双Transformer骨干,具有并行流,独立编码表面和曲线令牌以捕获它们不同的几何属性。此外,集成了拓扑注意力以建模表面和曲线之间的相互依赖关系,从而保持拓扑一致性。实验结果表明,Brep2Shape具有显著的可扩展性,在各种下游任务中实现了最先进的精度和更快的收敛速度。代码可在以下仓库获取:this https URL。

英文摘要

Boundary representation (B-rep) is the industry standard for computer-aided design (CAD). While deep learning shows promise in processing B-rep models, existing methods suffer from a representation gap: continuous approaches offer analytical precision but are visually abstract, whereas discrete methods provide intuitive clarity at the expense of geometric precision. To bridge this gap, we introduce Brep2Shape, a novel self-supervised pre-training method designed to align abstract boundary representations with intuitive shape representations. Our method employs a geometry-aware task where the model learns to predict dense spatial points from parametric Bézier control points, enabling the network to better understand physical manifolds derived from abstract coefficients. To enhance this alignment, we propose a Dual Transformer backbone with parallel streams that independently encode surface and curve tokens to capture their distinct geometric properties. Moreover, the topology attention is integrated to model the interdependencies between surfaces and curves, thereby maintaining topological consistency. Experimental results demonstrate that Brep2Shape offers significant scalability, achieving state-of-the-art accuracy and faster convergence across various downstream tasks.Code is available at this repository: https://github.com/thuml/Brep2Shape.

2602.06257 2026-06-17 cs.LG cs.GT 版本更新

On Randomized Algorithms in Online Strategic Classification

关于在线策略分类中的随机化算法

Chase Hutton, Adam Melrod, Han Shao

AI总结 研究在线策略分类中随机化算法的优势,在可实现和不可知场景下分别给出基于Littlestone维度和操纵图最大度的改进界限,并证明随机化可突破确定性算法的下界。

详情
AI中文摘要

在线策略分类研究智能体策略性地修改其特征以获得有利预测的场景。例如,给定一个基于信用评分决定贷款批准的分类器,申请人可能开设或关闭信用卡和银行账户以获得正面预测。学习目标是在此类行为下实现低错误率或遗憾界。尽管随机化算法在策略环境中可能为学习者带来优势,但它们尚未得到充分探索。在可实现场景中,随机化算法没有已知的下界,而确定性学习者的现有下界构造可以通过随机化规避。在不可知场景中,已知的最佳遗憾上界为$O(T^{3/4}\log^{1/4}T|\mathcal H|)$,远低于标准在线学习率$O(\sqrt{T\log|\mathcal H|})$。在这项工作中,我们为两种场景下的在线策略分类提供了精细化的界限;我们的界限依赖于假设类$\mathcal H$的Littlestone维度$\mathrm{Ldim}(\mathcal H)$和操纵图的最大度$\Delta$。在可实现场景中,对于$T > \mathrm{Ldim}(\mathcal H) \Delta^2$,我们将确定性学习者的现有下界$\Omega(\mathrm{Ldim}(\mathcal H) \Delta)$扩展到所有学习者。这产生了第一个适用于随机化学习者的下界。然后,我们提供了第一个随机化学习者,改进了已知的(确定性)上界$O(\mathrm{Ldim}(\mathcal H) \cdot \Delta \log \Delta)$。在不可知场景中,我们给出了一个非恰当随机化学习者,将遗憾上界改进为$O(\sqrt{T\log|\mathcal H|})$,匹配标准在线学习率。我们还展示了所有恰当学习规则的更大下界,证明非恰当性对于达到最优率是必要的。

英文摘要

Online strategic classification studies settings in which agents strategically modify their features to obtain favorable predictions. For example, given a classifier that determines loan approval based on credit scores, applicants may open or close credit cards and bank accounts to obtain a positive prediction. The learning goal is to achieve low mistake or regret bounds despite such behavior. While randomized algorithms have the potential to offer advantages to the learner in strategic settings, they have been largely underexplored. In the realizable setting, no lower bound is known for randomized algorithms, and existing lower bound constructions for deterministic learners can be circumvented by randomization. In the agnostic setting, the best known regret upper bound is $O(T^{3/4}\log^{1/4}T|\mathcal H|)$, which is far from the standard online learning rate of $O(\sqrt{T\log|\mathcal H|})$. In this work, we provide refined bounds for online strategic classification in both settings; our bounds depend on the Littlestone dimension $\mathrm{Ldim}(\mathcal H)$ of the hypothesis class $\mathcal H$ and the maximum degree $Δ$ of the manipulation graph. In the realizable setting, we extend, for $T > \mathrm{Ldim}(\mathcal H) Δ^2$, the existing lower bound $Ω(\mathrm{Ldim}(\mathcal H) Δ)$ for deterministic learners to all learners. This yields the first lower bound that applies to randomized learners. We then provide the first randomized learner that improves the known (deterministic) upper bound of $O(\mathrm{Ldim}(\mathcal H) \cdot Δ\log Δ)$. In the agnostic setting, we give an improper randomized learner that improves the regret upper bound to $O(\sqrt{T\log|\mathcal H|})$, matching the standard online learning rate. We also show a larger lower bound for all proper learning rules, demonstrating that improperness is necessary to achieve the optimal rate.

2602.06154 2026-06-17 cs.LG cs.CL 版本更新

MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models

MoSE: 混合可瘦身专家实现高效自适应语言模型

Nurbek Tastan, Stefanos Laskaridis, Karthik Nandakumar, Samuel Horvath

AI总结 提出MoSE架构,每个专家具有可变宽度的嵌套结构,支持在推理时连续调节精度-计算权衡,通过多宽度训练和轻量级测试时训练实现高效自适应。

Comments Accepted to ICML 2026

详情
AI中文摘要

混合专家(MoE)模型通过稀疏激活专家高效扩展大型语言模型,但一旦选定专家,其执行是完整的。因此,MoE模型中精度与计算之间的权衡通常表现出较大的不连续性。我们提出混合可瘦身专家(MoSE),这是一种MoE架构,其中每个专家具有嵌套的、可瘦身的结构,可以以可变宽度执行。这不仅实现了对激活哪些专家的条件计算,还实现了对每个专家利用多少的条件计算。因此,单个预训练的MoSE模型可以在推理时支持更连续的精度-计算权衡谱。我们提出了一种简单且稳定的训练方法,用于在稀疏路由下训练可瘦身专家,将多宽度训练与标准MoE目标相结合。在推理过程中,我们探索了运行时宽度确定的策略,包括一种轻量级的测试时训练机制,该机制学习如何在固定预算下将路由器置信度/概率映射到专家宽度。在GPT风格模型、各种路由机制、零样本下游推理基准以及DeepSeek模型的持续预训练适应上的实验表明,MoSE在全宽度下匹配或优于标准MoE,并持续将计算-质量边界向更低的推理FLOPs移动。代码可在以下网址找到:this https URL。

英文摘要

Mixture-of-Experts (MoE) models scale large language models efficiently by sparsely activating experts, but once an expert is selected, it is executed fully. Hence, the trade-off between accuracy and computation in an MoE model typically exhibits large discontinuities. We propose Mixture of Slimmable Experts (MoSE), an MoE architecture in which each expert has a nested, slimmable structure that can be executed at variable widths. This enables conditional computation not only over which experts are activated but also over how much of each expert is utilized. Consequently, a single pretrained MoSE model can support a more continuous spectrum of accuracy-compute trade-offs at inference time. We present a simple and stable training recipe for slimmable experts under sparse routing, combining multi-width training with standard MoE objectives. During inference, we explore strategies for runtime width determination, including a lightweight test-time training mechanism that learns how to map router confidence/probabilities to expert widths under a fixed budget. Experiments on GPT-style models, various routing regimes, zero-shot downstream reasoning benchmarks, and continual pre-training adaptation of DeepSeek model show that MoSE matches or improves standard MoE at full width and consistently shifts the compute-quality frontier toward lower inference FLOPs. The code can be found at: https://github.com/tnurbek/mose.

2602.06014 2026-06-17 cs.LG cs.AI math.OC math.ST stat.ML stat.TH 版本更新

Optimism Stabilizes Thompson Sampling for Adaptive Inference

乐观主义稳定自适应推断的汤普森采样

Shunxing Yan, Han Zhong

AI总结 本文通过引入乐观机制(如方差膨胀或均值奖励)稳定汤普森采样,使得各臂拉取次数收敛于确定性尺度,从而在K臂随机bandit中实现渐近有效的Wald推断,并解决了多最优臂的扩展问题。

Comments Accepted in part to COLT 2026

详情
AI中文摘要

汤普森采样(TS)广泛用于随机多臂老虎机,但其在自适应数据收集下的推断性质微妙。样本均值的经典渐近理论可能失效,因为臂特定样本量是随机的,并通过动作选择规则与奖励耦合。我们研究了具有高斯随机指数的K臂随机bandit中汤普森采样的自适应推断,其中奖励噪声为独立次高斯,并确定乐观主义是恢复稳定性的关键机制,即每个臂的拉取次数集中在确定性尺度附近。这种稳定性使得尽管自适应采样,仍能获得渐近有效的Wald推断。首先,我们证明方差膨胀的TS对任意K≥2是稳定的,包括多个臂最优的挑战性情况,对最优臂具有渐近均匀分配,对次优臂具有尖锐的对数拉取次数渐近性。这解决了Halder等人提出的K臂扩展问题,使用新的胜者图和Lyapunov漂移技术来控制多个最优臂之间的分配。其次,我们分析了一种替代的乐观修改,保持高斯指数方差不变但向指数中心添加显式均值奖励,并建立了类似的稳定性结论。总之,适当实施的乐观主义稳定了汤普森采样,并在多臂老虎机中实现了渐近有效的Wald推断,同时仅产生轻微额外的遗憾代价。

英文摘要

Thompson sampling (TS) is widely used for stochastic multi-armed bandits, yet its inferential properties under adaptive data collection are subtle. Classical asymptotic theory for sample means can fail because arm-specific sample sizes are random and coupled with the rewards through the action-selection rule. We study adaptive inference for Thompson sampling with Gaussian randomized indices in $K$-armed stochastic bandits with independent sub-Gaussian reward noises, and identify \emph{optimism} as a key mechanism for restoring \emph{stability}, meaning that each arm's pull count concentrates around a deterministic scale. This stability yields asymptotically valid Wald inference despite adaptive sampling. First, we prove that variance-inflated TS is stable for any $K \ge 2$, including the challenging regime where multiple arms are optimal, with asymptotically uniform allocation over optimal arms and sharp logarithmic pull-count asymptotics for suboptimal arms. This resolves the $K$-armed extension question raised by \citet{halder2025stable}, using new winner-map and Lyapunov-drift techniques to control allocation among multiple optimal arms. Second, we analyze an alternative optimistic modification that keeps the Gaussian index variance unchanged but adds an explicit mean bonus to the index center, and establish a similar stability conclusion. In summary, suitably implemented optimism stabilizes Thompson sampling and enables asymptotically valid Wald inference in multi-armed bandits, while incurring only a mild additional regret cost.