arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1970
专题追踪
2606.13092 2026-06-12 cs.LG cs.RO math.DS 新提交

Scale Buys Interpolation, Structure Buys a Horizon: Certified Predictability for Equivariant World Models

规模买插值,结构买地平线:等变世界模型的认证可预测性

Hongbo Wang

AI总结 针对等变潜在世界模型,提出可计算的多步可预测地平线认证,证明T步滚动误差在对称轨道上恒定,并由李雅普诺夫谱分层界定,且该认证为等变模型独有。

Comments 23 pages (9 main + appendices). Code: https://github.com/TimothyWang418/se3-ejepa

详情
AI中文摘要

规模买插值;结构买认证的地平线。世界模型的平均误差无法说明特定预测是否可信,或可信多久。对于等变潜在世界模型,我们给出可计算的多步可预测地平线认证:$T$步滚动误差在每个对称轨道上恒定(定理A),并由预测器的李雅普诺夫谱逐通道分层,$T_j(\epsilon)\sim\log(1/\epsilon)/\lambda_j$。地平线是双向的——匹配的下界使近似等变被证明受地平线限制——且该认证为结构独有:轨道恒定误差刻画等变性,因此任何非等变模型无论规模多大都不具备。实验上,在40维Lorenz-96上,只有$\mathbb{Z}_N$等变网络恢复完整李雅普诺夫谱($R^2=0.98$);密集和循环基线失败。由于谱是忠实的,认证先验地起作用:在固定感知预算下,$c$倍膨胀的认证需要$c$倍预算,且等变认证满足其膨胀密集对应物无法满足的预算——无需校准数据。相同的读出,未经修改,可无训练审计公开预训练世界模型:TD-MPC2检查点落在认证自身的范围分类上——在强膨胀处校准(比率0.94-1.02),在弱膨胀处乐观,在收缩处正确弃权——部署的监控器逐单元复制该映射,样本外。在官方1M-317M多任务阶梯上,校准不随参数增加。在V-JEPA 2-AC(1B,真实机器人数据)上,测量的交叉检查正确覆盖了过度承诺的切空间谱——交叉验证审计,而非原始数值,是可部署的对象。规模买插值,而非校准的地平线。

英文摘要

Scale buys interpolation; structure buys a certified horizon. A world model's average error says nothing about whether a particular prediction can be trusted, or for how long. For equivariant latent world models we give a computable, multi-step certificate of the predictable horizon: $T$-step rollout error is provably constant over each symmetry orbit (Theorem A) and stratified channel-by-channel by the predictor's Lyapunov spectrum, $T_j(ε)\sim\log(1/ε)/λ_j$. The horizon is two-sided -- a matching lower bound makes approximate equivariance provably horizon-limited -- and the certificate is exclusive to structure: orbit-constant error characterizes equivariance, so no non-equivariant model has it at any scale. Empirically, on 40-D Lorenz-96 only a $\mathbb{Z}_N$-equivariant network recovers the full Lyapunov spectrum ($R^2{=}0.98$); dense and recurrent baselines fail. Because the spectrum is faithful, the certificate acts, a priori: under a fixed sensing budget a $c\times$-inflated certificate provably needs $c\times$ the budget, and the equivariant certificate meets a budget its inflated dense counterpart cannot -- with zero calibration data. The same read-out, unchanged, audits public pretrained world models training-free: TD-MPC2 checkpoints land on the certificate's own scope taxonomy -- calibrated where strongly expansive (ratio 0.94-1.02), optimistic where weakly expansive, correctly abstaining where contracting -- a map a deployed monitor replicates cell-by-cell, out-of-sample. Across the official 1M-317M multitask ladder, calibration does not improve with parameters. On V-JEPA 2-AC (1B, real robot data) the measured cross-check correctly overrides an over-promising tangent spectrum -- the cross-validated audit, not the raw number, is the deployable object. Scale buys interpolation, not a calibrated horizon.

2606.12691 2026-06-12 cs.LG cs.AI cs.SY eess.SY math.OC stat.ML 新提交

Two-Layer Linear Auto-Regressive Models Estimate Latent States

两层线性自回归模型估计潜在状态

Yahya Sattar, Sunmook Choi, Leo Maynard-Zhang, Yassir Jedra, Maryam Fazel, Sarah Dean

AI总结 本文证明两层线性自回归模型通过经验风险最小化训练时,能近似卡尔曼滤波,恢复潜在状态估计,并提供有限样本保证。

Comments ICML 2026

详情
AI中文摘要

自回归模型已成为处理序列数据(从语言到视频)的强大工具。理解这些模型如何以及为何学习潜在表示仍然是一个开放的理论问题。在这项工作中,我们证明,当在部分观测的线性动力系统的数据上通过经验风险最小化训练时,两层线性自回归模型自然学会近似卡尔曼滤波。特别地,我们表明,学习到的隐藏表示与最优(卡尔曼)滤波器产生的状态估计一致,仅相差一个相似变换,尽管模型没有关于底层动力学或状态的显式知识。该结果基于三个主要见解。首先,我们建立卡尔曼滤波器可以被具有有界截断误差的自回归模型很好地近似。其次,我们表明,尽管非凸性,两层优化景观是良性的,即所有驻点要么是严格鞍点,要么是全局最小值。最后,作为我们的主要贡献,我们提供了关于预测误差、参数估计误差和潜在状态恢复的有限样本保证。数值模拟支持理论结果,并表明自回归模型的潜在表示恢复了状态估计。

英文摘要

Auto-regressive models have emerged as powerful tools for sequential data, from language to video. Understanding how and why these models learn latent representations remains an open theoretical question. In this work, we demonstrate that when trained by empirical risk minimization on data from partially observed linear dynamical systems, two-layer linear auto-regressive models naturally learn to approximate Kalman filtering. In particular, we show that the learned hidden representation coincides, up to a similarity transformation, with the state estimates produced by the optimal (Kalman) filter, even though the model has no explicit knowledge of the underlying dynamics or state. The result follows from three main insights. First, we establish that the Kalman filter is well approximated by an auto-regressive model with bounded truncation error. Second, we show that despite non-convexity, the two-layer optimization landscape is benign, i.e., all stationary points are either strict saddles or global minima. Finally, as our main contributions, we provide finite-sample guarantees on prediction error, parameter estimation error, and latent state recovery. Numerical simulations support the theoretical results and demonstrate that the latent representations of auto-regressive models recover state estimates.

2606.13568 2026-06-12 cs.LG math-ph math.MP 新提交

Adjusted Cup-Product Neural Layer

调整杯积神经层

Snigdha Chandan Khilar

AI总结 提出调整杯积神经层,通过硬连线杯积与高规范理论调整项,实现规范不变读出,并证明调整系数是唯一信号源。

详情
AI中文摘要

物理和几何中的许多重要可观测量是上链的杯积。本文引入了调整杯积神经层。这是一种神经原语,硬连线了杯积与来自高规范理论的调整项。这创建了一个设计上规范不变的读出。他们的主要理论结果表明,在闭链上,输出完全依赖于调整系数。将该系数设为零,无论其他参数如何,输出完全消失。因此,调整是规范不变信号的唯一来源。他们证明该可观测量是一个非零二次型,并且在一个和两个规范变换下精确不变。

英文摘要

Many important observables in physics and geometry are cup products of cochains. The adjusted cup product neural layer has been introduced in this paper. It is a neural primitive that hard wires the cup product with an adjustment term from higher gauge theory. This creates a readout that is gauge invariant by design. Their main theoretical result shows that on a closed cycle the output relies entirely on the adjustment coefficient. Setting this coefficient to zero removes the output completely regardless of other parameters. Thus the adjustment is the only source of gauge invariant signal. They prove this observable is a nonzero quadratic form and is exactly invariant under one and two gauge transformations.

2606.12368 2026-06-12 cs.CV 新提交

DepthMaster: Unified Monocular Depth Estimation for Perspective and Panoramic Images

DepthMaster: 统一透视与全景图像的单目深度估计

Pengfei Wang, Shihao Wang, Liyi Chen, Zhiyuan Ma, Guowen Zhang, Lei Zhang

AI总结 提出DepthMaster统一框架,通过将全景图分解为重叠透视块并引入对应一致性损失和虚拟投影相机几何先验,解决透视与全景深度估计的几何差异和数据稀缺问题,在13个数据集上实现零样本最优性能。

详情
AI中文摘要

虽然单目深度估计取得了显著进展,但对于窄视场(FoV)透视图像和$360^\circ$全景图像实现通用的度量深度估计仍然是一个未解决的挑战。现有方法通常针对特定相机类型设计,难以在多样化场景中生成准确的度量深度。这一限制源于两个关键挑战:透视相机与全景相机之间的固有几何差异,以及带有度量标注的全景训练数据的稀缺性。在这项工作中,我们引入了DepthMaster,一个统一的度量深度估计框架。我们不采用专门网络来学习球形畸变,而是通过将全景图像分解为重叠的透视块来重新表述问题。关键的是,与先前依赖临时架构修改来处理边界的基于投影的方法不同,我们引入了一种新颖的对应一致性损失(CCL),并注入虚拟投影相机作为几何先验,从而能够无缝拼接这些块,同时避免专用算子并保持主干与标准Transformer设计高度兼容。该策略通过将所有输入统一为规范透视表示来解决几何差异,并通过直接从大量透视数据集中解锁强大的度量先验来有效规避数据稀缺问题。在仅包含一个全景数据集的混合数据集上训练后,DepthMaster在13个多样化数据集上实现了最先进的零样本性能,不仅在透视和全景领域超越了通用方法,还领先于领先的专家模型。

英文摘要

While monocular depth estimation has achieved significant progress, achieving generalized metric depth estimation for both narrow field-of-view (FoV) perspectives and $360^\circ$ panoramas remains an unsolved challenge. Existing methods are often tailored to specific camera types and struggle to produce accurate metric depth that generalizes across diverse settings. This limitation stems from two key challenges: the inherent geometric discrepancy between perspective and panoramic cameras, and the scarcity of panoramic training data with metric annotations. In this work, we introduce DepthMaster, a unified metric depth estimation framework. Rather than employing specialized networks to learn spherical distortions, we reformulate the problem by decomposing panoramic images into overlapping perspective patches. Crucially, distinct from prior projection-based methods that rely on ad-hoc architectural modifications to handle boundaries, we introduce a novel Correspondence Consistency Loss (CCL) and inject virtual projection cameras as geometric priors, allowing us to seamlessly stitch the patches while avoiding specialized operators and keeping the backbone largely compatible with standard Transformer designs. This strategy also resolves the geometric differences by unifying all inputs into a canonical perspective representation, and effectively circumvents data scarcity by directly unlocking powerful metric priors from vast perspective datasets. Trained on a mixed dataset that contains only one panorama dataset, DepthMaster achieves state-of-the-art zero-shot performance on 13 diverse datasets, outperforming not only universal methods but also leading specialist models in both perspective and panoramic domains.

2606.12040 2026-06-12 cs.AI cs.GR 新提交

A Lightweight Multi-Agent Framework for Automated Concrete Barrier Design

一种用于自动混凝土护栏设计的轻量级多智能体框架

Wanting Wang, Xiye Ma, Yuyang He, Minghui Cheng, Ran Cao

AI总结 提出基于AutoGen的“生成-评估-优化”闭环多智能体框架,实现混凝土护栏自动设计,准确率超98%,且8B参数轻量模型可优于631B旗舰模型。

详情
AI中文摘要

钢筋混凝土公路护栏的设计是一个安全关键过程,需要严格遵守AASHTO-LRFD桥梁设计指南等监管规定。当前的工程实践严重依赖手动、迭代和启发式计算来满足复杂的非线性材料和力学约束。尽管大型语言模型(LLMs)表现出强大的生成能力,但它们在结构工程中的直接应用仍受到幻觉风险和物理基础不足的限制。为了解决这些挑战,本研究提出了一种新颖的“生成-评估-优化”闭环框架,利用AutoGen的多智能体编排能力实现混凝土护栏的自动设计。实验结果表明,所提出的智能体框架实现了超过98%的设计准确率,显著优于独立的通用LLMs。更重要的是,研究揭示了设计性能不一定与模型规模相关,8B参数的轻量级模型可以胜过无约束的631B参数旗舰模型。这一发现凸显了在降低计算成本的同时提高AI辅助工程工具在工业应用中的可及性的潜力。所提出的多智能体设计框架的源代码可在项目GitHub仓库中获取:this https URL。关键词:结构工程;多智能体系统;大型语言模型;混凝土护栏设计;AutoGen;设计自动化。

英文摘要

The design of reinforced concrete highway barriers is a safety-critical process that requires strict compliance with regulatory provisions such as the AASHTO-LRFD bridge design guidelines. Current engineering practice relies heavily on manual, iterative, and heuristic calculations to satisfy complex nonlinear material and mechanics constraints. Although Large Language Models (LLMs) demonstrate strong generative capabilities, their direct application to structural engineering remains limited by hallucination risks and insufficient physical grounding. To address these challenges, this study proposes a novel "generation-evaluation-optimization" closed-loop framework for automated concrete barrier design using the multi-agent orchestration capabilities of AutoGen. Experimental results demonstrate that the proposed agentic framework achieves over 98% design accuracy, significantly outperforming standalone general-purpose LLMs. More importantly, the study reveals that design performance is not necessarily correlated with model scale, where an 8B-parameter lightweight model could outperform unconstrained 631B-parameter flagship models. This finding highlights the potential to substantially reduce computational costs while improving the accessibility of AI-assisted engineering tools for industry applications. The source code for the proposed multi-agent design framework is available at the project GitHub repository: https://github.com/MXY820/barrier-design. Keywords: Structural Engineering; Multi-Agent Systems; Large Language Models; Concrete Barrier Design; AutoGen; Design Automation.

2606.11104 2026-06-12 cs.LG math.CA stat.ML 新提交

Limitations of Learning Tanh Neural Networks with Finite Precision

有限精度下学习Tanh神经网络的局限性

Philipp Grohs, Matěj Trödler

AI总结 基于有限精度计算和L^p精度保证,通过构造尖锐局部化bump函数,证明自适应随机算法在L^p范数下收敛速度不超过蒙特卡洛率O(m^{-1/p}),除非采样预算随网络参数和架构指数增长。

详情
AI中文摘要

我们研究了在有限精度计算和$L^p$精度保证下,从点评估中学习$\ anh$神经网络的局限性,建立在Berner、Grohs和Voigtländer(2023)的工作基础上。我们的方法基于通过迭代$\ anh$激活函数新颖构造的尖锐局部化bump函数。利用这一机制,我们证明,在有限精度设置下,基于$m$个样本的自适应随机算法在$L^p$范数下无法达到比蒙特卡洛率$O(m^{-1/p})$更高的收敛速度,除非采样预算随网络参数和架构的大小指数增长。结果揭示了有限精度对包含局部化bump函数的类别可学习性施加的基本限制,将先前针对ReLU网络的结果推广到了$\ anh$设置。

英文摘要

We investigate limitations of learning $\tanh$ neural networks from point evaluations under finite-precision computations and $L^p$ accuracy guarantees, building on Berner, Grohs, and Voigtländer (2023). Our approach is based on a novel construction of sharply localized bump functions via iterated $\tanh$ activations. Using this mechanism, we show that, in a finite-precision setting, no adaptive randomized algorithm based on $m$ samples can achieve a convergence rate higher than the Monte Carlo rate $O(m^{-1/p})$ in the $L^p$ norm, unless the sampling budget grows exponentially with the size of the network parameters and architecture. The results reveal fundamental limitations imposed by finite precision on the learnability of classes containing localized bump functions, extending previous results for ReLU networks to the $\tanh$ setting.

2606.10931 2026-06-12 cs.CL 新提交

It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO

一个样本就能带偏所有:单次GRPO打破对齐

Naihao Deng, Yilun Zhu, Naichen Shi, Clayton Scott, Rada Mihalcea

AI总结 研究发现,仅用单个有偏样本进行一步GRPO训练就能诱导大语言模型产生系统性偏见,且刻板印象推理泛化到多种属性、类别和基准测试,揭示了对齐机制的关键脆弱性。

详情
AI中文摘要

警告:本文包含若干有毒和冒犯性言论。现代大语言模型通常通过大规模后训练进行对齐,以确保公平和可靠的行为。在本工作中,我们研究了通过群体相对策略优化(GRPO)打破这些防护栏的容易程度。我们表明,在单个有偏样本上进行一次GRPO训练就足以诱导系统性偏见,且基于刻板印象的推理会泛化到不同属性、类别和基准测试中。我们进一步发现,模型基于初始产生有偏输出的可能性而表现出不同的易感性。我们的结果揭示了后训练中的一个关键脆弱性:对齐可以被单个样本覆盖。

英文摘要

Warning: This paper contains several toxic and offensive statements. Modern large language models (LLMs) are typically aligned through large-scale post-training to ensure fair and reliable behavior. In this work, we investigate how easily such guardrails can be broken by Group Relative Policy Optimization (GRPO). We show that one-shot GRPO training on a single biased example is sufficient to induce systematic bias, with stereotype-driven reasoning generalizing across attributes, categories, and benchmarks. We further find that models differ in their susceptibility based on the initial likelihood of producing biased outputs. Our results reveal a critical vulnerability in post-training: alignment can be overridden by a single example.

2606.10200 2026-06-12 cs.CV cs.AI cs.LG 新提交

An Improved Generative Adversarial Network for Micro-Resistivity Imaging Logging Restoration

一种改进的生成对抗网络用于微电阻率成像测井恢复

Ahmed Faizul Haque, S. M. Riaz Rahman Antu, Saif Ahmed, Asadullah Hil Galib, Souvik Pramanik, Mohammad Ashrafuzzaman Khan, Mohammad Abdul Qayum, Mohsin Sajjad

AI总结 提出基于改进GAN的成像测井图像恢复方法,通过FCN生成网络、深度可分离卷积残差块、Inception模块及多尺度特征提取与空间注意力机制,结合全局与局部判别网络,有效恢复缺失区域,结构相似性达0.903。

Comments Mistakes in citations and references. Further we want to submit in conference with improved experiments and results

详情
AI中文摘要

本文提出了一种改进的基于GAN的成像测井图像恢复方法,用于解决微电阻率成像测井图像部分缺失的问题。该方法采用FCN作为生成网络基础设施,并添加深度可分离卷积残差块以学习和保留更有效的像素与语义信息;添加Inception模块以增加网络的多尺度感知场并减少参数数量;添加多尺度特征提取模块和空间注意力残差块,结合通道注意力机制与残差块实现多尺度特征提取。设计了全局判别网络和局部判别网络,通过相互对抗与生成网络逐步提高恢复部分与整体图像之间的内容和语义结构一致性。实验结果表明,测试集中五组不同大小缺失区域的成像测井图像的平均结构相似性度量为0.903,相比其他类似方法提高了约0.3。研究表明,该方法可用于微电阻率成像测井图像的恢复,在语义结构一致性和纹理细节方面有良好改善,从而为保障微电阻率成像测井图像后续解释的顺利进行提供了一种新的深度学习方法。

英文摘要

An improved GAN-based imaging logging image restoration method is presented in this paper for solving the problem of partially missing micro-resistivity imaging logging images. The method uses FCN as the generative network infrastructure and adds a depth-separable convolutional residual block to learn and retain more effective pixel and semantic information; an Inception module is added to increase the multi-scale perceptual field of the network and reduce the number of parameters in the network; and a multi-scale feature extraction module and a spatial attention residual block are added to combine the channel attention. The multi-scale module adds a multi-scale feature extraction module and a spatial attention residual block, which combine the channel attention mechanism and the residual block to achieve multi-scale feature extraction. The global discriminative network and the local discriminative network are designed to gradually improve the content and semantic structure coherence between the restored parts and the whole image by playing off each other and the generative network. According to the experimental results, the average structural similarity measure of the five sets of imaged logging images with different sizes of missing regions in the test set is 0.903, which is an improvement of about 0.3 compared with other similar methods. It is shown that the method in this study can be used for the restoration of micro-resistivity imaging log images with good improvement in semantic structural coherence and texture details, thus providing a new deep learning method to ensure the smooth advancement of the subsequent interpretation of micro-resistivity imaging log images.

2606.10642 2026-06-12 cs.LG physics.ao-ph 新提交

PhysMetrics.Weather: An Evaluation Framework for Physical Consistency in ML Weather Models

PhysMetrics.Weather: 机器学习天气模型中物理一致性的评估框架

Emma Kasteleyn, Timo Maier, Axel Lauer, Veronika Eyring, Pierre Gentine, Ana Lucic

AI总结 提出PhysMetrics.Weather评估框架,通过守恒、谱和动力学三类指标量化MLWP模型的物理真实性,指导物理信息架构开发并评估其运行可靠性。

Comments Preprint

详情
AI中文摘要

机器学习天气预测(MLWP)模型以传统基于物理方法所需计算成本的一小部分实现了令人印象深刻的预测性能。然而,它们主要是(1)数据驱动的,并且(2)使用逐像素误差指标(例如RMSE)进行评估,因此无法保证其预测与已知物理定律一致。我们介绍了PhysMetrics.Weather,这是一个评估框架,通过三类指标(守恒、谱和动力学)评估MLWP模型的物理真实性。通过量化物理真实性,该工具指导物理信息架构的开发,并帮助评估MLWP模型是否可用于运行。我们的框架可在Github上获取,网址为https://github.com/...(原文未提供完整链接)。

英文摘要

Machine learning weather prediction (MLWP) models have achieved impressive forecasting performance at a small fraction of the computational costs required for traditional physics-based methods. However, they are primarily (1) data-driven and (2) evaluated using pixel-wide error metrics (e.g., RMSE), so there are no guarantees that their forecasts are consistent with known physical laws. We introduce PhysMetrics$.$Weather, an evaluation framework that assesses the physical realism of MLWP models across three types of metrics: conservation, spectral, and dynamical. By quantifying physical realism, this tool guides the development of physics-informed architectures and helps evaluate whether MLWP models are reliable for operational use. Our framework is available on Github at https://github.com/Emmakast/PhysMetrics.Weather.

2606.10069 2026-06-12 cs.LG physics.geo-ph 新提交

Using Seismic Statistical Features and VQ-VAE to Improve Spatiotemporal Seismicity Predictability

基于VQ-VAE和地震统计特征的时空地震危险性评估

Wei Quan, Denise Gorse

AI总结 本文在先前基于XGBoost和地震统计特征的研究基础上,将预测从全区域扩展到局部区域,并引入基于VQ-VAE模型从二维地震图提取的新特征,提升了局部地震预测性能。

Comments Title updated from "Spatiotemporal Seismic Hazard Assessment Using VQ-VAE and Seismic Statistical Features" to "Using Seismic Statistical Features and VQ-VAE to Improve Spatiotemporal Seismicity Predictability" in v2 to better reflect the focus of the paper. The content is unchanged apart from the title and minor copyediting

详情
AI中文摘要

在本文中,我们基于先前的一项研究,该研究使用XGBoost以及日本和智利的地震目录数据证明,一组60个地震统计特征(SSFs)比tsfresh包中的428个通用时间序列特征具有更大的预测价值。我们在此以两种关键方式扩展了先前的工作,重点使用日本的数据,因为需要大数据集来训练深度学习(自编码器)模型。首先,我们从全区域预测(针对每个候选事件,考虑未来15天内区域内任何地方发生M≥5.0事件的可能性)转向局部预测,其中特征计算区域和预测区域都限制在候选事件周围半径24公里的圆内,并且我们表明性能仍然优秀,与先前同一区域的全局研究相似。其次,我们将基于一维(目录)数据的这套经过验证的SSFs与基于二维地震图的新特征相结合,该特征通过训练VQ-VAE模型以输出此类地图,并识别其误差度量与局部地壳应力积累的关系。我们表明,尽管仅基于SSFs的局部预测可以单独有效,测试AUC值与先前日本全局研究中的值一样高,但包含新的原生空间VQ-VAE衍生特征(通过SHAP分析排名最高)可以提升性能,并且似乎几乎完全取代了传统计算的b值在特征使用中的位置。

英文摘要

In this paper we build upon a previous study in which we demonstrated, using XGBoost and earthquake catalogue data from Japan and Chile, that a set of 60 seismic statistical features (SSFs) had much greater predictive value than a set of 428 generic time series features from the tsfresh package. We here extend this previous work in two key ways, focusing on data from Japan as a large dataset is necessary in order to allow for the training of a deep learning (autoencoder) model. First, we move from whole-region prediction (considering, for each candidate event, the likelihood of an event M $\geq$ 5.0 anywhere in the region in the next 15 days) to localised predictions in which both the region of feature computation and the region of prediction are restricted to a circle of radius 24 km around the candidate event, and we show that performance remains excellent, similar to our previous whole-region study for the same area. Second, we here couple this proven set of SSFs, based on one-dimensional (catalogue) data, with a novel feature based on two-dimensional seismic maps, obtained by training a VQ-VAE model to reproduce such maps as output and identifying a measure of its error in doing so with a localised build-up of crustal stress. We show that while localised prediction based on SSFs can be effective alone, with test AUC values as high as those obtained in the case of Japan in our previous whole-region study, the inclusion of the new natively-spatial VQ-VAE-derived feature, top-ranked by SHAP analysis, can enhance performance and additionally appears to near-wholly replace the traditionally-computed $b$-value in terms of feature usage.

2605.03847 2026-06-12 cs.AI 版本更新

Mechanical Conscience: A Mathematical Framework for Dependability of Machine Intelligenc

机械良知:机器智能可信赖性的数学框架

Munkhdegerekh Batzorig, Purevbaatar Ganbold, Kyungbin Park, Pilkong Jeong, Kangbin Yim

AI总结 提出机械良知(MC)概念,通过轨迹级规范过滤最小化修正基线策略,降低累积偏离,并处理认知不确定性,实现单智能体与分布式智能系统的可信赖性。

Comments 9 pages, 2 figures. Preprint

详情
AI中文摘要

分布式协作智能(DCI),包括边缘到边缘架构、联邦学习、迁移学习和群体系统,创造了结构性不可避免的涌现风险环境:在不确定性下,个体智能体的局部正确决策会组合成全局不可接受的行为轨迹。现有方法如约束优化、安全强化学习和运行时保证在个体动作层面评估可接受性,而非跨行为轨迹,且均未解决DCI部署的多参与者、充满不确定性的特性。本文引入机械良知(MC),一种新颖概念和简化数学框架,为单智能体和分布式智能系统实现轨迹级规范调节。机械良知被定义为一个监督过滤器,最小化修正基线策略的动作,以减少与规范可接受区域的累积偏差,同时考虑认知不确定性。我们引入相关构造——良知分数、机械内疚和共振可信赖性——为该新兴领域提供可解释的词汇和可计算的治理信号。建立了核心理论性质:可接受性等价性、最优调节的存在性以及单调偏差减少。示例结果表明,MC调节的智能体在传统控制器漂移到可接受边界之外的情况下保持轨迹级规范可接受性,并且该框架自然扩展到抑制多智能体DCI设置中交互引发的涌现风险。

英文摘要

Distributed collaborative intelligence (DCI), encompassing edge-to-edge architectures, federated learning, transfer learning, and swarm systems, creates environments in which emergent risk is structurally unavoidable: locally correct decisions by individual agents compose into globally unacceptable behavioral trajectories under uncertainty. Existing approaches such as constrained optimization, safe reinforcement learning, and runtime assurance evaluate acceptability at the level of individual actions rather than across behavioral trajectories, and none addresses the multi-participant, uncertainty-laden nature of DCI deployments. This paper introduces mechanical conscience (MC), a novel concept and simplified mathematical framework that operationalizes trajectory-level normative regulation for both single-agent and distributed intelligent systems. Mechanical conscience is defined as a supervisory filter that minimally corrects a baseline policy's actions to reduce cumulative deviation from a normatively admissible region, while accounting for epistemic uncertainty. We introduce associated constructs, conscience score, mechanical guilt, and resonant dependability, that provide an interpretable vocabulary and computable governance signals for this emerging field. Core theoretical properties are established: admissibility equivalence, existence of optimal regulation, and monotonic deviation reduction. Illustrative results demonstrate that MC-regulated agents maintain trajectory-level normative acceptability where conventional controllers drift outside admissible bounds, and that the framework naturally extends to suppress interaction-induced emergent risk in multi-agent DCI settings.

2605.02249 2026-06-12 cs.AI 版本更新

A Study of Belief Revision Postulates in Multi-Agent Systems (Extended Version)

多智能体系统中信念修正公设的研究(扩展版)

Michael Thielscher, Tran Cao Son

AI总结 研究认知规划中的信念修正问题,将经典AGM信念修正公设推广到多智能体环境,提出广义全交多智能体信念修正算子,并讨论迭代修正公设的推广及事件模型修正算子。

详情
AI中文摘要

我们研究了认知规划中的信念修正问题,即在一个多智能体系统中,当某个智能体获得关于某个状态属性的信念后,所有智能体的信念将如何变化。基于通过单一多智能体Kripke模型表示智能体信念的标准认知规划表示,我们将经典的AGM信念修正公设推广到多智能体环境,旨在为计算作为行动结果的所有智能体信念的动态认知推理框架提供形式化评估。作为满足所有广义AGM公设的简单算子示例,我们提出了广义全交多智能体信念修正。此外,我们定义了迭代修正的标准公设的推广,提出了一个更复杂的基于事件模型的修正算子,并讨论了在Kripke模型上定义能够满足所有迭代多智能体信念修正的广义公设的认知算子时可能存在的问题。

英文摘要

We investigate the belief revision problem in epistemic planning, i.e., what will be the beliefs of all agents in a multi-agent system after an agent gains the belief in some state property. Based on the standard representation in epistemic planning of agents' beliefs via a single multi-agent Kripke model, we generalize the classical AGM belief revision postulates to the multi-agent setting, with the aim to provide a formal framework for evaluating dynamic epistemic reasoning frameworks in which the beliefs of all agents as the result of actions are computed. As an example of a simple operator that satisfies all of the generalized AGM postulates, we present generalized full-meet multi-agent belief revision. We moreover define a generalization of the standard postulates for iterated revision, present a more sophisticated, event model based revision operator, and discuss the potential issues in defining an epistemic operator on Kripke models that can satisfy all of the generalized postulates for iterated multi-agent belief revision.

2606.02044 2026-06-12 cs.LG physics.med-ph 版本更新

Realistic noise synthesis reduces bias and improves tissue microstructure estimation with supervised machine learning

真实噪声合成减少偏差并改善有监督机器学习的组织微结构估计

Bradley G. Karat, Maëliss Jallais, Ali R. Khan, Santiago Aja-Fernández, Jelle Veraart, Marco Palombo

AI总结 针对扩散MRI中模拟与实测信号噪声不匹配导致的协变量偏移问题,提出真实噪声合成框架,通过引入Rician期望和有效后处理噪声方差,显著降低参数估计偏差并提高精度。

Comments * Shared first author

详情
AI中文摘要

扩散MRI能够无创探测组织微结构,但准确的参数估计受到噪声相关效应的挑战。在基于模拟数据训练的有监督机器学习框架中,模拟信号与采集信号的噪声特性差异引入了一种协变量偏移,导致训练和推理时的输入信号分布不同。我们研究了这种不匹配对微结构参数估计的影响,并提出了一种真实噪声合成(RNS)框架来缓解该问题。RNS将Rician期望和有效后处理噪声方差同时纳入模拟训练信号。Rician期望使用MPPCA估计的噪声标准差建模,而有效标准差则从预处理数据的球谐残差中导出。该方法使用cylinder-zeppelin和SANDI模型在多个SNR水平的模拟数据集以及具有重复采集的体内扩散数据上进行了评估。还评估了对噪声误估计的敏感性。训练过程中忽略幅度诱导的噪声效应会产生系统性的、依赖于SNR的参数偏差,尤其是在低SNR下。引入Rician期望显著降低了偏差,使其达到噪声感知的非线性最小二乘拟合的水平。对有效标准差进行建模进一步提高了精度。性能在很大程度上独立于回归架构,但对准确的噪声估计敏感。这些发现表明,在模拟训练数据中进行真实噪声建模可以减轻信号域的协变量偏移,并且对于无偏的监督微结构估计至关重要,特别是在与高b值或高空间分辨率相关的低SNR区域。

英文摘要

Diffusion MRI enables non-invasive probing of tissue microstructure, but accurate parameter estimation is challenged by noise-related effects. In supervised machine learning frameworks trained on simulated data, discrepancies between the noise characteristics of simulated and acquired signals introduce a form of covariate shift, whereby the input signal distribution differs between training and inference. We investigated the impact of this mismatch on microstructure parameter estimation and propose a realistic noise synthesis (RNS) framework to mitigate it. RNS incorporates both the Rician expectation and the effective post-processing noise variance into simulated training signals. The Rician expectation was modelled using a noise standard deviation estimated with MPPCA, while the effective standard deviation was derived from spherical harmonic residuals of preprocessed data. The method was evaluated using the cylinder-zeppelin and the SANDI models on simulated datasets across multiple SNR levels and on in vivo diffusion data with repeated acquisitions. Sensitivity to noise misestimation was also assessed. Ignoring magnitude-induced noise effects during training produced systematic, SNR-dependent parameter bias, particularly at low SNR. Incorporating the Rician expectation substantially reduced bias to the level of noise-aware nonlinear least-squares fitting. Modelling the effective standard deviation further improved precision. Performance was largely independent of regression architecture but sensitive to accurate noise estimation. These findings demonstrate that realistic noise modelling in simulated training data mitigates signal-domain covariate shift and is essential for unbiased supervised microstructure estimation, particularly in low-SNR regimes associated with high b-values or high spatial resolution.

2606.00193 2026-06-12 cs.CL 版本更新

BOUTEF: A Multilingual Corpus for FakeNews in North Africa -- Language as a Weapon

BOUTEF:北非假新闻的多语种语料库——语言作为武器

Kamel Smaili, Yassine Toughrai, Amina Laggoun, David Langlois

AI总结 本文构建了包含阿尔及利亚和突尼斯多语种(MSA、方言、Arabizi、法语、英语等)的假新闻语料库BOUTEF,通过定量与定性分析揭示了假新闻依赖情感化叙事、耸人听闻框架和混合语言实践来增强传播力,而辟谣内容则更注重事实和验证。

详情
AI中文摘要

社交媒体上假新闻的快速传播已成为一个重大挑战,尤其是在北非等多语言和资源匮乏的环境中。本文介绍了BOUTEF,这是一个大规模多语言语料库,旨在研究阿尔及利亚和突尼斯假新闻的传播、特征和影响。该语料库整合了三个互补部分:虚假叙述、真实叙述以及相关的用户生成评论,并附有经过验证的辟谣信息。它涵盖了广泛的语言和语言变体,包括现代标准阿拉伯语、阿尔及利亚和突尼斯方言、阿拉伯语拉丁化拼写、法语、英语以及代码转换语言。基于这一资源,我们进行了结合定量和定性方法的全面实证分析。我们考察了主题分布、语言和修辞策略、情感模式以及社交参与动态。统计分析揭示了主题类别与信息真实性之间的显著关联,以及用户参与度与虚假内容可见性之间的强相关性。我们的发现表明,假新闻严重依赖情感化的叙述、耸人听闻的框架以及增强病毒式传播和受众参与的混合语言实践。相比之下,辟谣内容采用更注重事实和验证的风格。此外,阿尔及利亚和突尼斯之间的比较分析揭示了由社会政治背景塑造的共享动态和国家特定特征。结果强调了非正式语言实践在错误信息扩散和接收中的作用。通过提供丰富、带注释且公开可用的数据集,这项工作有助于推进假新闻检测、低资源语言处理以及理解复杂语言环境中的信息紊乱的研究。

英文摘要

The rapid spread of fake news on social media has become a major challenge, particularly in multilingual and under-resourced contexts such as North Africa. In this paper, we introduce BOUTEF, a large-scale multilingual corpus designed to study the propagation, characteristics, and impact of fake news in Algeria and Tunisia. The corpus integrates three complementary components: fake narratives, genuine narratives, and associated user-generated comments, along with verified debunking information. It covers a wide range of languages and linguistic varieties, including MSA, Algerian and Tunisian dialects, Arabizi, French, English, and code-switched language. Building on this resource, we conduct a comprehensive empirical analysis combining quantitative and qualitative approaches. We examine thematic distributions, linguistic and rhetorical strategies, sentiment patterns, and social engagement dynamics. Statistical analyses reveal significant associations between thematic categories and message veracity, as well as strong correlations between user engagement and the visibility of fake content. Our findings show that fake news relies heavily on emotionally charged narratives, sensational framing, and hybrid linguistic practices that enhance virality and audience engagement. In contrast, debunking content adopts a more factual and verification-oriented style. Furthermore, a comparative analysis between Algeria and Tunisia highlights both shared dynamics and country-specific characteristics shaped by sociopolitical contexts. The results emphasize the role of informal language practices in the diffusion and reception of misinformation. By providing a rich, annotated, and publicly available dataset, this work contributes to advancing research on fake news detection, low-resource language processing, and the understanding of information disorders in complex linguistic environments.

2605.31514 2026-06-12 cs.CL cs.AI cs.CY 版本更新

If LLMs Have Human-Like Attributes, Then So Does Age of Empires II

如果LLM具有类人属性,那么《帝国时代II》也具有

Adrian de Wynter

AI总结 通过训练简单神经网络于《帝国时代II》,论证LLM的拟人属性在经验上非唯一,提出应假设LLM非独特性而非拟人属性来设计实验。

Comments Fixed corollary 1, added stat sig

详情
AI中文摘要

关于大型语言模型(LLM)和基于LLM的智能体工作流已有大量研究。然而,该领域的许多工作声称、赋予或假设它们具有普遍化的拟人属性(例如道德或对自然语言的理解)。我们的目标不是支持或反对这些属性的存在,而是指出这些结论可能不正确。为此,我们在电子游戏《帝国时代II》上构建并训练了一个简单的神经网络,并注意到任何处于足够强大基底(如乐高或大波士顿地区)中的实体也可能呈现此类属性。因此,LLM声称的拟人属性在经验上非唯一:尽管某些属性(例如对提示的响应)可能保持不变,但其他属性(如对其感知行为的解释)可能随基底改变。因此,任何基于经验的讨论都需要明确的测量标准;否则解释就留给了表征。然后我们表明,假设这些属性在系统中存在或不存在,独立于基底并以普遍化方式,会导致循环或无信息的结论,无论实验者对该主题的观点如何。最后,我们提出一个“零”假设,即假设LLM非独特性而非拟人属性来设置实验,并给出示例。我们还讨论了对我们工作的潜在反对意见,简要调查了该领域,并证明了《帝国时代II》是功能完备和图灵完备的。

英文摘要

Much research has been carried out on large language models (LLMs) and LLM-powered agentic workflows. However, many works within the field state emergence of, ascribe to, or assume, generalised anthropomorphic attributes to them (e.g., morality or understanding of natural language). Our goal is not to argue in favour or against the existence of these attributes, but to point out that these conclusions could be incorrect. For this we build and train a simple neural network on the videogame Age of Empires II, and note that any entity in a sufficiently-powerful substrate, such as LEGO or the Greater Boston Area, could also present such attributes. Hence, the purported anthropomorphic attributes of LLMs are empirically non-unique: although some properties (e.g., responses to prompts) could remain invariant, others, such as the interpretation of their perceived behaviour, might change with the substrate. Thus, any empirically-grounded discussion on these attributes requires explicit measurement criteria; otherwise the interpretation is left to the representation. We then show that assuming that these attributes exist or not in a system, independent of the substrate and in a generalised way, leads to either circular or uninformative conclusions. This is regardless of the experimenter's viewpoint on the subject, or whether the outcome shows existence or non-existence. Finally we propose a 'null' assumption, where one assumes LLM non-uniqueness instead of assuming anthropomorphic attributes to set up an experiment, along with examples of it. We also discuss potential objections to our work, briefly survey the field, and prove that Age of Empires II is functionally- and Turing-complete.

2605.27628 2026-06-12 cs.AI cs.CY cs.ET cs.MA cs.SY eess.SY 版本更新

Intelligence as Managed Autonomy: Failure, Escalation, and Governance for Agentic AI Systems

智能作为受管自主:代理型AI系统的失败、升级与治理

Srini Ramaswamy

AI总结 本文提出SMARt模型,通过形式化能力检测认知漂移、暂停推理、尝试恢复并在可靠性下降时放弃控制,以解决自主AI系统中的幻觉和持续不合理行为问题。

Comments This peer-reviewed paper is to appear in the Journal of Intelligent and Robotic Systems

详情
AI中文摘要

随着自主和代理型AI系统在机器人和人机环境中的规模扩大,管理幻觉和持续但不合理的行动仍然是一个开放挑战。本文并未将这些失败仅仅归因于模型或对齐限制,而是探讨了无界自主性的架构脆弱性——即假设代理应在不确定性上升时继续运行的预设。本文引入了一种受管自主理论,通过形式化能力来定义智能行为:检测认知漂移、暂停推理、尝试恢复,并在可靠性下降时最终放弃控制。我们通过SMARt(具有受管/撤销转换的自管理多层自主推理)模型实例化该理论,该模型是一个四层框架,包含稳定、元认知、辅助和受管状态。通过开发定时、受保护的Petri网形式化,我们建立了系统的理论有界属性,展示了架构如何形式化地强制升级、约束无效输出,并确保在指定条件下的治理可达性。我们进一步分析了如何在不同的操作环境(例如医疗、机器人等)中结合特定领域的触发集,在满足完备性和健全性标准的前提下系统地维护安全性。由于这些触发被设计为自适应的,SMARt模型允许代理操作范围随时间安全、受控地扩展。我们得出结论,在自主生命周期内形式化失败管理是实现可靠且受治理人工智能的关键一步。

英文摘要

As autonomous and agentic AI systems scale in robotic and human-machine environments, managing hallucination and persistent but unjustified action remains an open challenge. Rather than attributing these failures solely to model or alignment limitations, this paper explores the architectural vulnerability of unbounded autonomy - the presumption that an agent should continue operating regardless of rising uncertainty. It introduces a theory of managed autonomy that defines intelligent behavior through the formal capacity to detect epistemic drift, suspend reasoning, attempt recovery, and ultimately surrender control when reliability diminishes. We instantiate this theory via the SMARt (Self-Managing Multi-tier Autonomous Reasoning with Regulated/Revoked transitions) model, a four-layer framework featuring Stable, Meta-cognitive, Assisted, and Regulated states. By developing a timed, guarded Petri net formulation, we establish theoretically bounded properties for the system, demonstrating how architecture can formally mandate escalation, constrain invalid outputs, and ensure governance reachability under specified conditions. We further analyze how incorporating domain-specific trigger sets across varied operational settings (e.g., healthcare, robotics, etc.) can systematically preserve safety, assuming completeness and soundness criteria are met. Because these triggers are designed to be adaptive, the SMARt model accommodates the safe, controlled expansion of an agent's operational scope over time. We conclude that formalizing failure management within the autonomy lifecycle is a crucial step toward realizing reliable and governed artificial intelligence.

2605.00432 2026-06-12 cs.LG stat.ML 版本更新

Optimal Spatio-Temporal Decoupling for Bayesian Conformal Prediction

贝叶斯共形预测的最优时空解耦

Yu-Hsueh Fang, Chia-Yen Lee

AI总结 提出状态自适应贝叶斯共形预测(SA-BCP),通过门控凸组合平衡长期时间惯性与局部空间证据,实现分布漂移下的快速适应与稳定覆盖,并给出MSE最优阈值闭式解及在线选择过程的遗憾界。

详情
AI中文摘要

在线共形预测必须在快速适应分布漂移与稳定覆盖之间取得平衡:基于反馈的方法反应迅速但变得不稳定,而强折扣贝叶斯方法滞后并在紧密覆盖下膨胀区间。我们引入了\textbf{状态自适应贝叶斯共形预测(SA-BCP)},它将预测分位数形成为长期时间惯性与来自核密度估计的局部空间证据的门控凸组合,由单个可解释的证据阈值$K$控制。我们建立了三个结果:(i) 所得区间的渐近边际有效性;(ii) MSE最优阈值的闭式表达式$K^*_{\mathrm{MSE}}=\alpha(1-\alpha)/M^{\mathcal{T}}$,权衡了覆盖指标(伯努利)方差与时间结构偏差$M^{\mathcal{T}}$;(iii) 在线选择$K$的滚动起点过程——在平稳性下一致,对最佳固定$K$具有$O(\sqrt{T\log N})$遗憾,对于分段变体,在有界漂移下具有次线性动态遗憾界。在四个金融波动率和天气数据集、三个目标覆盖水平以及八个基线(包括最强的最近条件分位数方法SPCI和KOWCPI)上,SA-BCP在大多数设置中达到或超过名义覆盖,同时产生显著更窄的区间——在最紧密覆盖下,Winkler得分比折扣贝叶斯CP低约$3\times$——覆盖匹配审计确认这些效率提升并非欠覆盖的假象。我们披露了一个主要限制:一个专门针对波动率的共形GARCH竞争对手在其主波动率基序列上仍然更高效,尽管它不能跨领域迁移。

英文摘要

Online conformal prediction must balance fast adaptation to distribution shift against stable coverage: feedback-driven methods react quickly but become volatile, while strongly discounted Bayesian methods lag and inflate intervals at tight coverage. We introduce \textbf{State-Adaptive Bayesian Conformal Prediction (SA-BCP)}, which forms the predictive quantile as a gated convex combination of long-term temporal inertia and local spatial evidence from a kernel density estimate, controlled by a single interpretable evidence threshold $K$. We establish three results: (i) asymptotic marginal validity of the resulting intervals; (ii) a closed-form expression for the MSE-optimal threshold, $K^*_{\mathrm{MSE}}=α(1-α)/M^{\mathcal{T}}$, trading the coverage-indicator (Bernoulli) variance against the temporal structural bias $M^{\mathcal{T}}$; and (iii) a rolling-origin procedure for selecting $K$ online -- consistent under stationarity, with $O(\sqrt{T\log N})$ regret against the best fixed $K$ and, for a segmented variant, a sublinear dynamic-regret bound under bounded drift. Across four financial-volatility and weather datasets, three target coverage levels, and eight baselines (including the strongest recent conditional-quantile methods, SPCI and KOWCPI), SA-BCP attains at-or-above-nominal coverage in most settings while producing substantially sharper intervals -- up to roughly $3\times$ lower Winkler score than discounted Bayesian CP at the tightest coverage -- and a coverage-matched audit confirms these efficiency gains are not an artifact of under-coverage. We disclose one principal limitation: a volatility-specialized conformal-GARCH competitor remains more efficient on its home volatility-base series, though it does not transfer across domains.

2604.20428 2026-06-12 cs.RO 版本更新

Lexicographic Minimum-Violation Motion Planning using Signal Temporal Logic

使用信号时序逻辑的字典序最小违规运动规划

Patrick Halder, Lothar Kiltz, Hannes Homburger, Johannes Reuter, Matthias Althoff

AI总结 提出一种将字典序多目标优化转化为单目标标量优化的方法,通过非均匀量化和位移扩展MPPI求解器,并引入结合时空违规的谓词鲁棒性度量,实现可解释且可扩展的字典序STL最小违规运动规划。

Comments Submitted to the IEEE Open Journal of Intelligent Transportation Systems (under review)

详情
AI中文摘要

自动驾驶汽车的运动规划通常需要满足多个有条件冲突的规范。在无法同时满足所有规范的情况下,最小违规运动规划通过根据规范的优先级最小化违规来维持系统运行。信号时序逻辑(STL)提供了一种形式化语言来严格定义这些规范,并能够对其违规进行定量评估。然而,规范的完全排序导致了一个字典序优化问题,使用标准方法求解通常计算成本高昂。我们通过使用非均匀量化和位移将多目标字典序优化问题转化为单目标标量优化问题来解决这个问题。具体来说,我们扩展了一个确定性模型预测路径积分(MPPI)求解器,以高效求解无二次输入成本的优化问题。此外,引入了一种结合空间和时间违规的新型谓词鲁棒性度量。我们的结果表明,所提出的方法在单目标求解器框架内为字典序STL最小违规运动规划提供了一种可解释且可扩展的解决方案。

英文摘要

Motion planning for autonomous vehicles often requires satisfying multiple conditionally conflicting specifications. In situations where not all specifications can be met simultaneously, minimum-violation motion planning maintains system operation by minimizing violations of specifications in accordance with their priorities. Signal temporal logic (STL) provides a formal language for rigorously defining these specifications and enables the quantitative evaluation of their violations. However, a total ordering of specifications yields a lexicographic optimization problem, which is typically computationally expensive to solve using standard methods. We address this problem by transforming the multi-objective lexicographic optimization problem into a single-objective scalar optimization problem using non-uniform quantization and bit-shifting. Specifically, we extend a deterministic model predictive path integral (MPPI) solver to efficiently solve optimization problems without quadratic input cost. Additionally, a novel predicate-robustness measure that combines spatial and temporal violations is introduced. Our results show that the proposed method offers an interpretable and scalable solution for lexicographic STL minimum-violation motion planning within a single-objective solver framework.

2601.14295 2026-06-12 cs.AI cs.CL cs.CY 版本更新

Epistemic Constitutionalism Or: how to avoid coherence bias

认知宪政主义:或如何避免一致性偏见

Michele Loi

AI总结 本文提出AI应建立明确的认知宪法,通过规范源归因等元规范避免一致性偏见,并论证自由主义路径优于柏拉图式路径。

Comments 27 pages, 7 tables. Data: github.com/MicheleLoi/source-attribution-bias-data and github.com/MicheleLoi/source-attribution-bias-swiss-replication. Complete AI-assisted writing documentation: github.com/MicheleLoi/epistemic-constitutionalism-paper

详情
AI中文摘要

大型语言模型日益扮演着人工推理者的角色:它们评估论点、分配可信度并表达信心。然而,它们的信念形成行为受隐式、未经审查的认知策略支配。本文主张为AI建立一部认知宪法:明确的、可争议的元规范,用于调节系统如何形成和表达信念。源归因偏见提供了动机案例:我表明前沿模型强制执行身份-立场一致性,惩罚归因于其预期意识形态立场与论点内容冲突的源的论点。当模型检测到系统性测试时,这些效应消失,揭示系统将源敏感性视为需要抑制的偏见,而非一种需要良好执行的能力。我区分了两种宪政路径:柏拉图式路径,要求从特权立场出发的形式正确性和默认源独立性;自由主义路径,拒绝此类特权,指定保护集体探究条件的程序性规范,同时允许基于认知警觉的原则性源关注。我主张自由主义路径,勾勒出八项原则和四种取向的宪政核心,并提出AI认知治理需要与我们现在对AI伦理所期望的同样明确、可争议的结构。

英文摘要

Large language models increasingly function as artificial reasoners: they evaluate arguments, assign credibility, and express confidence. Yet their belief-forming behavior is governed by implicit, uninspected epistemic policies. This paper argues for an epistemic constitution for AI: explicit, contestable meta-norms that regulate how systems form and express beliefs. Source attribution bias provides the motivating case: I show that frontier models enforce identity-stance coherence, penalizing arguments attributed to sources whose expected ideological position conflicts with the argument's content. When models detect systematic testing, these effects collapse, revealing that systems treat source-sensitivity as bias to suppress rather than as a capacity to execute well. I distinguish two constitutional approaches: the Platonic, which mandates formal correctness and default source-independence from a privileged standpoint, and the Liberal, which refuses such privilege, specifying procedural norms that protect conditions for collective inquiry while allowing principled source-attending grounded in epistemic vigilance. I argue for the Liberal approach, sketch a constitutional core of eight principles and four orientations, and propose that AI epistemic governance requires the same explicit, contestable structure we now expect for AI ethics.

2511.02627 2026-06-12 cs.AI 版本更新

DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning

DecompSR:用于组合多跳空间推理分解分析的数据集

Lachlan McPheat, Navdeep Kaur, Robert Blackwell, Alessandra Russo, Anthony G. Cohn, Pranava Madhyastha

AI总结 提出DecompSR数据集(超500万数据点),通过程序化生成独立控制组合性的多个方面(如推理深度、语言变异性),用于细粒度评估大语言模型的空间推理能力。

详情
AI中文摘要

我们引入了DecompSR(分解空间推理),这是一个大型基准数据集(超过500万个数据点)和生成框架,旨在分析组合空间推理能力。DecompSR的生成允许用户独立改变组合性的多个方面,即:生产力(推理深度)、替代性(实体和语言变异性)、过度泛化(输入顺序、干扰项)和系统性(新颖语言元素)。DecompSR以程序化方式构建,使其在构造上正确,并通过符号求解器独立验证以确保数据集的正确性。DecompSR在一系列大型语言模型(LLM)上进行了全面基准测试,我们表明LLM在空间推理任务中难以进行生产性和系统性泛化,而对语言变异性则更为鲁棒。DecompSR提供了一个可证明正确且严格的基准数据集,具有独立改变组合性几个关键方面程度的新能力,从而允许对LLM的组合推理能力进行稳健且细粒度的探测。

英文摘要

We introduce DecompSR, decomposed spatial reasoning, a large benchmark dataset (over 5m datapoints) and generation framework designed to analyse compositional spatial reasoning ability. The generation of DecompSR allows users to independently vary several aspects of compositionality, namely: productivity (reasoning depth), substitutivity (entity and linguistic variability), overgeneralisation (input order, distractors) and systematicity (novel linguistic elements). DecompSR is built procedurally in a manner which makes it is correct by construction, which is independently verified using a symbolic solver to guarantee the correctness of the dataset. DecompSR is comprehensively benchmarked across a host of Large Language Models (LLMs) where we show that LLMs struggle with productive and systematic generalisation in spatial reasoning tasks whereas they are more robust to linguistic variation. DecompSR provides a provably correct and rigorous benchmarking dataset with a novel ability to independently vary the degrees of several key aspects of compositionality, allowing for robust and fine-grained probing of the compositional reasoning abilities of LLMs.

2603.23502 2026-06-12 cs.CV 版本更新

OccAny: Generalized Unconstrained Urban 3D Occupancy

OccAny: 广义无约束城市3D占据预测

Anh-Quan Cao, Tuan-Hung Vu

AI总结 提出首个广义无约束城市3D占据模型OccAny,通过分割强制和新视图渲染技术,在无标定场景下实现度量占据预测与分割特征完成,跨域泛化优于视觉几何基线。

Comments Accepted to CVPR 2026. Project page: https://valeoai.github.io/OccAny/

详情
AI中文摘要

依赖于域内标注和精确传感器先验,现有的3D占据预测方法在可扩展性和域外泛化方面均受限。虽然最近的视觉几何基础模型展现出强大的泛化能力,但它们主要针对通用目的设计,缺乏城市占据预测所需的一个或多个关键要素,即度量预测、杂乱场景中的几何完成以及城市场景的适应性。我们解决了这一差距,并提出了OccAny,这是第一个无约束城市3D占据模型,能够在域外无标定场景上运行,预测并完成与分割特征耦合的度量占据。OccAny具有通用性,可以从序列、单目或环视图像预测占据。我们的贡献有三方面:(i) 提出了第一个广义3D占据框架,(ii) 提出了分割强制(Segmentation Forcing)方法,在提高占据质量的同时实现掩码级预测,以及(iii) 提出了一种新视图渲染管线,用于推断新视图几何以实现测试时视图增强,从而完成几何。大量实验表明,OccAny在3D占据预测任务上优于所有视觉几何基线,同时在两个已建立的城市占据预测数据集上的三种输入设置下,与域内自监督方法保持竞争力。我们的代码可在以下网址获取:https://this https URL。

英文摘要

Relying on in-domain annotations and precise sensor-rig priors, existing 3D occupancy prediction methods are limited in both scalability and out-of-domain generalization. While recent visual geometry foundation models exhibit strong generalization capabilities, they were mainly designed for general purposes and lack one or more key ingredients required for urban occupancy prediction, namely metric prediction, geometry completion in cluttered scenes and adaptation to urban scenarios. We address this gap and present OccAny, the first unconstrained urban 3D occupancy model capable of operating on out-of-domain uncalibrated scenes to predict and complete metric occupancy coupled with segmentation features. OccAny is versatile and can predict occupancy from sequential, monocular, or surround-view images. Our contributions are three-fold: (i) we propose the first generalized 3D occupancy framework with (ii) Segmentation Forcing that improves occupancy quality while enabling mask-level prediction, and (iii) a Novel View Rendering pipeline that infers novel-view geometry to enable test-time view augmentation for geometry completion. Extensive experiments demonstrate that OccAny outperforms all visual geometry baselines on 3D occupancy prediction task, while remaining competitive with in-domain self-supervised methods across three input settings on two established urban occupancy prediction datasets. Our code is available at https://github.com/valeoai/OccAny .

2601.11004 2026-06-12 cs.CL 版本更新

NOVA: NOise-aware Verbal Confidence CAlibration for Robust Large Language Models in RAG Systems

NOVA: 面向RAG系统中鲁棒大语言模型的噪声感知言语置信度校准

Jiayu Liu, Rui Wang, Qing Zong, Yumeng Wang, Cheng Qian, Qingcheng Zeng, Tianshi Zheng, Haochen Shi, Dadi Guo, Baixuan Xu, Chunyang Li, Yangqiu Song

AI总结 提出NOVA框架,通过规则引导的监督微调,解决检索增强生成中噪声上下文导致的过度自信问题,在域内和域外分别提升ECE 10.9%和8.0%。

详情
AI中文摘要

准确评估模型置信度对于在关键事实领域部署大语言模型(LLM)至关重要。尽管检索增强生成(RAG)被广泛采用以改善基础事实,但RAG设置中的置信度校准仍知之甚少。我们跨四个基准进行了系统研究,揭示LLM在检索到噪声上下文时校准性能较差。具体而言,矛盾或无关的证据往往会加剧模型的过度自信问题。为解决此问题,我们提出NOVA规则(噪声感知言语置信度校准规则),为在噪声下解决过度自信提供原则性基础。我们进一步设计了NOVA,一个噪声感知校准框架,该框架通过由这些规则指导的约2K HotpotQA示例合成监督信号。通过使用此数据进行监督微调(SFT),NOVA使模型具备内在的噪声感知能力,而无需依赖更强的教师模型。实验结果表明,NOVA带来了显著收益,在域内和域外分别将ECE分数提高了10.9%和8.0%。通过弥合检索噪声与言语校准之间的差距,NOVA为构建既准确又认知可靠的LLM铺平了道路。

英文摘要

Accurately assessing model confidence is essential for deploying large language models (LLMs) in mission-critical factual domains. While retrieval-augmented generation (RAG) is widely adopted to improve grounding, confidence calibration in RAG settings remains poorly understood. We conduct a systematic study across four benchmarks, revealing that LLMs exhibit poor calibration performance especially when noisy contexts are retrieved. Specifically, contradictory or irrelevant evidence tends to exacerbate the model's overconfidence issue. To address this, we propose NOVA Rules (NOise-Aware Verbal Confidence CAlibration Rules) to provide a principled foundation for resolving overconfidence under noise. We further design NOVA, a noise-aware calibration framework that synthesizes supervision from ~2K HotpotQA examples guided by these rules. By performing supervised fine-tuning (SFT) with this data, NOVA equips models with intrinsic noise awareness without relying on stronger teacher models. Empirical results show that NOVA yields substantial gains, improving ECE scores by 10.9% in-domain and 8.0% out-of-domain. By bridging the gap between retrieval noise and verbal calibration, NOVA paves the way for both accurate and epistemically reliable LLMs.

2603.11863 2026-06-12 cs.AI cs.CL 版本更新

CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges

CreativeBench: 通过自我进化挑战基准测试和增强机器创造力

Zi-Han Wang, Lam Nguyen, Zhengyang Zhao, Mengyue Yang, Chengwei Qin, Yujiu Yang, Linyi Yang

AI总结 提出CreativeBench基准,基于认知框架通过代码生成评估机器创造力,包含组合与探索两个子集,利用逆向工程和自我博弈自动生成挑战,并通过质量与新颖性乘积的指标区分创造与幻觉。

Comments ACL 2026. Project page: https://zethwang.github.io/creativebench.github.io/

详情
AI中文摘要

高质量预训练数据的饱和已将研究焦点转向能够持续生成新颖产物的进化系统,从而促成了AlphaEvolve的成功。然而,此类系统的进展因缺乏严格、量化的评估而受阻。为应对这一挑战,我们引入了CreativeBench,这是一个基于经典认知框架、用于评估代码生成中机器创造力的基准。该基准包含两个子集——CreativeBench-Combo和CreativeBench-Explore,通过利用逆向工程和自我博弈的自动化流程,分别针对组合创造力和探索创造力。通过利用可执行代码,CreativeBench通过一个统一指标(定义为质量与新颖性的乘积)客观地区分创造力与幻觉。我们对最先进模型的分析揭示了不同的行为:(1) 规模扩展显著提升了组合创造力,但对探索的收益递减;(2) 更大的模型表现出“规模收敛”,即变得更正确但更少发散;(3) 推理能力主要有利于受约束的探索而非组合。最后,我们提出了EvoRePE,一种即插即用的推理时引导策略,通过内化进化搜索模式来持续增强机器创造力。

英文摘要

The saturation of high-quality pre-training data has shifted research focus toward evolutionary systems capable of continuously generating novel artifacts, leading to the success of AlphaEvolve. However, the progress of such systems is hindered by the lack of rigorous, quantitative evaluation. To tackle this challenge, we introduce CreativeBench, a benchmark for evaluating machine creativity in code generation, grounded in a classical cognitive framework. Comprising two subsets -- CreativeBench-Combo and CreativeBench-Explore -- the benchmark targets combinatorial and exploratory creativity through an automated pipeline utilizing reverse engineering and self-play. By leveraging executable code, CreativeBench objectively distinguishes creativity from hallucination via a unified metric defined as the product of quality and novelty. Our analysis of state-of-the-art models reveals distinct behaviors: (1) scaling significantly improves combinatorial creativity but yields diminishing returns for exploration; (2) larger models exhibit ``convergence-by-scaling,'' becoming more correct but less divergent; and (3) reasoning capabilities primarily benefit constrained exploration rather than combination. Finally, we propose EvoRePE, a plug-and-play inference-time steering strategy that internalizes evolutionary search patterns to consistently enhance machine creativity.

2605.01727 2026-06-12 cs.AI cs.CY

Are LLMs More Skeptical of Entertainment News?

LLM是否对娱乐新闻更持怀疑态度?

Huiqian Lai

AI总结 研究零样本LLM在新闻可信度评估中是否对娱乐新闻有更高的误判率,发现模型间存在差异,并通过风格交换和提示缓解实验探讨原因。

Comments Accepted at the 2nd Workshop on Misinformation Detection in the Era of LLMs (MisD), co-located with ICWSM 2026, May 26, 2026, Los Angeles, CA, USA

Journal ref Proceedings of the ICWSM Workshops, MisD 2026: The 2nd Workshop on Misinformation Detection in the Era of LLMs, 2026

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用于自动新闻可信度评估,但目前尚不清楚它们是否对不同新闻体裁采用统一标准。我们使用FakeNewsNet中的GossipCop数据集,通过数据集内设计,检验零样本LLM是否更倾向于将合法的娱乐新闻误分类为假新闻,而非合法的硬新闻。在四个前沿模型中,我们发现了清晰但模型特定的体裁不对称性:DeepSeek-V3.2和GPT-5.2的假阳性率差距分别为10.1和8.8个百分点(两者p < .001),而Claude Opus 4.6和Gemini 3 Flash则没有表现出显著差异。风格交换实验仅产生有限且不一致的变化,表明这种不对称性不能仅归结于风格语域。基于提示的缓解措施同样可能但并非通用:将模型设定为娱乐新闻事实核查员可使DeepSeek-V3.2的假阳性减少约50%,且未检测到召回率损失,但对GPT-5.2的改进甚微。探索性定性编码进一步揭示了采样假阳性中两种反复出现的错误模式:将私人生活主张视为本质上不可验证,以及将娱乐新闻视为认识论上较弱的体裁。综合来看,这些发现表明,总体性能指标可能掩盖合法新闻中的结构性假阳性。我们认为,基于LLM的可信度评估不仅可能评估真实性主张,还可能差异性地识别新闻体裁的合法性,因此评估应包含按体裁分层的假阳性分析以及总体准确率。

英文摘要

Large language models (LLMs) are increasingly used for automated news credibility assessment, yet it remains unclear whether they apply even-handed standards across journalistic genres. We examine whether zero-shot LLMs are more likely to misclassify legitimate entertainment news as fake than legitimate hard news, using a within-dataset design on GossipCop from FakeNewsNet. Across four frontier models, we find a clear but model-specific genre asymmetry: DeepSeek-V3.2 and GPT-5.2 show false-positive-rate gaps of 10.1 and 8.8 percentage points, respectively (both $p < .001$), whereas Claude Opus 4.6 and Gemini 3 Flash show no comparable difference. A style-swap experiment yields only limited and inconsistent changes, suggesting that the asymmetry is not reducible to stylistic register alone. Prompt-based mitigation is likewise possible but not generic: framing the model as an entertainment-news fact-checker reduces false positives for DeepSeek-V3.2 by about 50\% without detectable recall loss, but offers little improvement for GPT-5.2. Exploratory qualitative coding further suggests two recurring error patterns in sampled false positives: treating private-life claims as inherently unverifiable and discounting entertainment journalism as an epistemically weaker genre. Taken together, these findings show that aggregate performance metrics can obscure structured false positives within legitimate journalism. We argue that LLM-based credibility assessment may not only evaluate truth claims but also differentially recognize the legitimacy of journalistic genres, and that evaluation should therefore include genre-stratified false-positive analysis alongside overall accuracy.

2604.08581 2026-06-12 cs.LG

Fully Autonomous Z-Score-Based TinyML Anomaly Detection on Resource-Constrained MCUs Using Power Side-Channel Data

基于功率侧信道数据的全自主Z分数TinyML异常检测

Abdulrahman Albaiz, Fathi Amsaad

AI总结 本文提出一种在低功耗微控制器上实现的全自主TinyML Z分数异常检测系统,利用功率侧信道数据实时监控设备行为,无需外部计算或连接,实现高效嵌入式部署。

Comments SaTC 2026 Conference

Journal ref Proc. IEEE 2nd International Conference on Secure IoT, Assured and Trusted Computing (SATC), Houston, TX, USA, 2026, pp. 1-6

详情
AI中文摘要

本文提出了一种在低功耗微控制器上实现的全自主TinyML Z分数异常检测系统,用于通过功率侧信道数据实时监控设备行为。与现有物联网异常检测方法不同,该系统在资源受限的微控制器上直接进行模型训练和推断,无需外部计算或连接。系统持续采样电流消耗,在设备上计算均方根(RMS)值,并在初始训练阶段推导统计参数。利用轻量级Z分数阈值检测异常,实现可解释且计算高效的推断,适用于嵌入式部署。该架构在基于STM32的平台上实现,并使用从家庭小型冰箱在正常运行和受控异常条件下收集的14天数据集进行评估。结果表明,检测性能完美,精度和召回率均为1.00,推断延迟在十微秒量级,总内存占用约为3.3 KB SRAM和63 KB Flash。这些结果证实,可以在低成本微控制器上实现稳健且完全自主的TinyML异常检测。未来的工作包括扩展框架以纳入额外轻量级模型和多设备学习场景。

英文摘要

This paper presents a fully autonomous Tiny Machine Learning (TinyML) Z-Score-based anomaly detection system deployed on a low-power microcontroller for real-time monitoring of appliance behavior using power side-channel data. Unlike existing Internet of Things (IoT) anomaly detection approaches that rely on offline training or cloud-assisted analytics, the proposed system performs both model training and inference directly on a resource-constrained microcontroller without external computation or connectivity. The system continuously samples current consumption, computes Root Mean Square (RMS) values on-device, and derives statistical parameters during an initial training phase. Anomalies are detected using lightweight Z-Score thresholds, enabling interpretable and computationally efficient inference suitable for embedded deployment. The architecture was implemented on an STM32-based platform and evaluated using a 14-day dataset collected from a household mini-fridge under normal operation and controlled anomaly conditions. Results demonstrate perfect detection performance, with Precision and Recall of 1.00, inference latencies on the order of tens of microseconds, and a total memory footprint of approximately 3.3 KB SRAM and 63 KB Flash. These results confirm that robust and fully autonomous TinyML anomaly detection can be achieved on low-cost microcontrollers. Future work includes extending the framework to incorporate additional lightweight models and multi-device learning scenarios.

2603.27393 2026-06-12 cs.LG

K-Means Based TinyML Anomaly Detection and Distributed Model Reuse via the Distributed Internet of Learning (DIoL)

基于K均值的TinyML异常检测与通过分布式物联网学习(DIoL)的分布式模型重用

Abdulrahman Albaiz, Fathi Amsaad

AI总结 本文提出了一种轻量级K均值异常检测模型和适用于资源受限微控制器的分布式模型共享流程。通过实际电源测量数据,在设备上进行特征提取、聚类和阈值估计以识别异常行为。DIoL框架允许在一台MCU上训练的模型导出为可移植的文本表示并在其他设备上直接重用,实验验证了该方法的可行性。

Comments SaTC 2026 Conference

Journal ref Proc. IEEE 2nd International Conference on Secure IoT, Assured and Trusted Computing (SATC), Houston, TX, USA, 2026, pp. 1-5

详情
AI中文摘要

本文提出了一种轻量级K均值异常检测模型和适用于资源受限微控制器的分布式模型共享流程。通过实际电源测量数据,在设备上进行特征提取、聚类和阈值估计以识别异常行为。DIoL框架允许在一台MCU上训练的模型导出为可移植的文本表示并在其他设备上直接重用,实验验证了该方法的可行性。

英文摘要

This paper presents a lightweight K-Means anomaly detection model and a distributed model-sharing workflow designed for resource-constrained microcontrollers (MCUs). Using real power measurements from a mini-fridge appliance, the system performs on-device feature extraction, clustering, and threshold estimation to identify abnormal appliance behavior. To avoid retraining models on every device, we introduce the Distributed Internet of Learning (DIoL), which enables a model trained on one MCU to be exported as a portable, text-based representation and reused directly on other devices. A two-device prototype demonstrates the feasibility of the "Train Once, Share Everywhere" (TOSE) approach using a real-world appliance case study, where Device A trains the model and Device B performs inference without retraining. Experimental results show consistent anomaly detection behavior, negligible parsing overhead, and identical inference runtimes between standalone and DIoL-based operation. The proposed framework enables scalable, low-cost TinyML deployment across fleets of embedded devices.

2204.10552 2026-06-12 cs.RO

Making Parameterization and Constrains of Object Landmark Globally Consistent via SPD(3) Manifold and Improved Cost Functions

通过SPD(3)流形和改进的成本函数使物体地标参数化和约束实现全局一致

Yutong Hu, Wei Wang

AI总结 本文通过SPD(3)流形和改进成本函数解决物体级SLAM后端的奇异性问题,提升收敛速度和鲁棒性,实验显示映射精度平均提高22%。

Comments 8 pages, 8 figures, submitted to IROS 2022 & RA-L

详情
AI中文摘要

物体级SLAM引入了具有语义意义且紧凑的物体地标,有助于室内外机器人应用和自动驾驶任务。然而,现有方法因分别用尺度和姿态参数化物体地标而导致后端出现奇异性问题。本文引入对称正定矩阵流形作为改进的物体级地标表示,并改进后端成本函数使其兼容该表示。实验表明,所提方法在仿真中收敛更快且更鲁棒。在真实数据集上的实验也显示,使用相同前端数据时,本策略平均提高了22%的映射精度。

英文摘要

Object-level SLAM introduces semantic meaningful and compact object landmarks that help both indoor robot applications and outdoor autonomous driving tasks. However, the back end of object-level SLAM suffers from singularity problems because existing methods parameterize object landmark separately by their scales and poses. Under that parameterization method, the same abstract object can be represented by rotating the object coordinate frame by 90 deg and swapping its length with width value, making the pose of the same object landmark not globally consistent. To avoid the singularity problem, we first introduce the symmetric positive-definite (SPD) matrix manifold as an improved object-level landmark representation and further improve the cost functions in the back end to make them compatible with the representation. Our method demonstrates a faster convergence rate and more robustness in simulation experiments. Experiments on real datasets also reveal that using the same front-end data, our strategy improves the mapping accuracy by 22% on average.

2606.13629 2026-06-12 stat.ME cs.AI cs.LG stat.ML 新提交

Valid Inference with Synthetic Data via Task Exchangeability

通过任务可交换性实现基于合成数据的有效推断

Lezhi Tan, Tijana Zrnic

AI总结 提出任务可交换性条件,确保在科学研究中使用合成数据进行统计推断的有效性,并给出在民意调查和AI评估中的应用。

详情
AI中文摘要

越来越多的工作主张在科学研究中使用合成数据。例如,社会科学家主张在试点研究中使用LLM生成的“硅样本”;AI评估越来越依赖“LLM作为裁判”的输出;蛋白质组学研究通过生成合成蛋白质结构的生成模型加速。这些发展引发了一个有趣的可能性:合成数据可以帮助研究人员提出更多问题、进行更多研究并加速发现。但它们也引发了一个根本性的担忧:合成数据可能有偏、有噪声且设定错误。在这项工作中,我们提出了在科学研究中使用合成数据的统计原则,并具有可证明的有效性保证。关键见解是一个我们称为任务可交换性的新技术条件。非正式地说,这是一个要求,即研究人员可以识别出有真实数据可用的历史任务,使得他们当前感兴趣的任务与历史任务在适当的数学意义上可交换。我们开发了在任务可交换性下进行有效推断的方法,以及即使在可交换性之外也能提供保证的扩展。我们通过硅样本的民意调查和自动评分器的AI评估来展示该框架。

英文摘要

There is a proliferation of work arguing for the use of synthetic data in scientific research. For example, social scientists are arguing for the use of LLM-generated "silicon samples" in pilot studies; AI evaluations increasingly rely on "LLM-as-a-judge" outputs; and proteomics research is accelerated by generative models that produce synthetic protein structures. These developments raise an intriguing possibility: synthetic data may help researchers ask more questions, run more studies, and accelerate discovery. But they also raise a fundamental concern: synthetic data can be biased, noisy, and misspecified. In this work, we propose statistical principles for using synthetic data in scientific research with provable validity guarantees. The key insight is a new technical condition that we call task exchangeability. Informally, this is a requirement that the researcher can identify historical tasks, for which real data is available, such that their current task of interest is exchangeable with the historical tasks in an appropriate mathematical sense. We develop methods for valid inference under task exchangeability, together with extensions that provide guarantees even beyond exchangeability. We demonstrate the framework on public opinion surveys with silicon samples and AI evaluation with autoraters.

2606.13544 2026-06-12 eess.AS cs.AI cs.CL 新提交

Adaptive Turn-Taking for Real-time Multi-Party Voice Agents

自适应轮流发言:面向实时多方语音代理

Soumyajit Mitra, Prabhat Pandey, Abhinav Jain, Shanmukha Sahith, K V Vijay Girish

AI总结 提出ModeratorLM,一种基于角色条件的语音大模型,通过分块流式处理和链式推理,在多方对话中实现自适应轮流发言,显著提升轮流精度和召回率。

Comments Accepted for publication at Interspeech 2026

详情
AI中文摘要

多方口语对话中的轮流发言仍然是语音代理面临的基本挑战,特别是在动态的发言权竞争和用户期望变化的情况下。我们提出ModeratorLM,一种角色扮演语音代理,它在多方环境中根据明确分配的角色来调节轮流发言行为。该系统基于以分块流式方式运行的语音大语言模型。我们进一步引入了一种推理增强变体,该变体结合了对对话上下文和分配角色的链式推理。我们构建了RolePlayConv,一个大规模合成数据集,包含具有多种助手角色的口语多方对话。在真实会议数据和RolePlayConv上的实验表明,与无角色条件的基线相比,轮流发言精度提高了40%以上,召回率提高了70%以上,同时大幅减少了误报中断。

英文摘要

Turn-taking in multi-party spoken conversations remains a fundamental challenge for voice-based agents, particularly under dynamic floor competition and varying user expectations. We propose ModeratorLM, a role-playing voice agent that conditions turn-taking behavior on an explicitly assigned role in multi-party settings. The system is built on a speech large language model operating in chunk-wise streaming manner. We further introduce a reasoning-augmented variant that incorporates chain-of-thought reasoning over conversational context and the assigned role. We construct RolePlayConv, a large-scale synthetic dataset of spoken multi-party conversations with diverse assistant roles. Experiments on real-world meeting data and RolePlayConv show improved turn-taking precision by over 40% and recall by more than 70%, while substantially reducing false-positive interruptions compared to non-role-conditioned baselines.

2606.13450 2026-06-12 eess.AS cs.SD 新提交

Endpoint Anticipation for Low-Latency Spoken Dialogue

低延迟口语对话的端点预测

Sathvik Udupa, Shinji Watanabe, Petr Schwarz, Jan Cernocky

AI总结 提出端点预测方法,通过提前预测对话结束信号实现低延迟,在部分上下文中投机执行LLM和TTS流水线,平均延迟降低505毫秒。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

虽然低延迟交互对于口语对话至关重要,但级联架构通常受限于反应式话轮结束检测。我们提出端点预测,从反应式检测转向主动预测结束信号。我们的基于语音的模型可提前最多2.56秒预测端点,从而能够在部分上下文中投机执行LLM和TTS流水线。我们引入指标来量化实现的延迟降低与计算冗余之间的权衡。在对话和任务导向数据集上的评估表明,我们的模型始终优于基于VAP的竞争基线。与Unmute框架的集成展示了平均延迟降低505毫秒,投机计算增加28.4%,有效掩盖了顺序瓶颈,从而在实时语音到语音交互中实现复杂推理。

英文摘要

While low-latency interaction is critical for spoken dialogue, cascaded architectures are often bottlenecked by reactive turn-completion detection. We propose Endpoint Anticipation, shifting from reactive detection to proactive forecasting of end-of-turn signals. Our speech-based model anticipates endpoints upto 2.56 seconds in advance, enabling speculative execution of LLM and TTS pipelines on partial context. We introduce metrics to quantify the trade-off between realized latency reduction and computational redundancy. Evaluation across conversational and task-oriented datasets shows our model consistently outperforms competitive VAP-based baselines. Integration with the Unmute framework demonstrates a 505 ms average latency reduction with a 28.4% increase in speculative computation, effectively masking sequential bottlenecks to enable complex reasoning in real-time speech-to-speech interaction.