arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2094
专题追踪 全部专题
2606.19522 2026-06-19 cs.AI 新提交

REVEAL++: Differentiable Phenotypic Grouping for Vision-Language Retinal Modeling of Alzheimer's Disease Risk

REVEAL++:用于阿尔茨海默病风险视觉-语言视网膜建模的可微分表型分组

Ethan Elio Meidinger, Seowung Leem, Zeyun Zhao, Ruogu Fang

发表机构 * University of Virginia(弗吉尼亚大学) J. Crayton Pruitt Family Department of Biomedical Engineering, Herbert Wertheim College of Engineering, University of Florida(佛罗里达大学赫伯特·韦特海姆工程学院J. Crayton Pruitt家庭生物医学工程系)

AI总结 提出可微分连续表型相似性权重函数,替代离散分组,在对比学习中端到端学习跨模态对齐与表型结构,提升AD风险预测。

Comments Accepted for publication at MICCAI 2026

详情
AI中文摘要

视网膜为神经退行性疾病提供了非侵入性窗口,能够捕捉与未来认知衰退风险相关的细微结构模式。诸如REVEAL等视觉-语言对齐框架已表明,将视网膜眼底图像与结构化临床风险叙述配对可改善阿尔茨海默病(AD)的早期预测。这些方法的一个关键设计选择是使用表型分组,即在对比学习中将具有相似风险特征的个体视为多正对。然而,现有方法将表型相似性操作化为离散构造,依赖硬分组分配,施加刚性监督并将分组形成与表示学习分离。我们提出对比学习中表型结构的连续形式。我们不将样本分配到固定聚类,而是将受试者间相似性建模为可微分权重函数,该函数源自视网膜图像和风险特征中模态内嵌入相似性。这些权重通过连续聚合算子定义软多正关系,实现反映疾病风险谱的梯度监督。我们进一步引入软目标对比目标,以端到端方式联合学习跨模态对齐和表型结构。在UK Biobank视网膜成像数据上进行AD发病预测评估,所提框架持续优于基于离散分组的对比学习和标准视觉-语言基线。通过将表型相似性视为可学习的连续信号而非固定分组规则,我们的方法为从多模态视网膜和临床数据中进行人群规模的神经退行性风险建模提供了有原则且稳健的基础。

英文摘要

The retina offers a noninvasive window into neurodegenerative disease, capturing subtle structural patterns associated with a risk of future cognitive decline. Vision-language alignment frameworks such as REVEAL have shown that pairing retinal fundus images with structured clinical risk narratives improves early prediction of Alzheimer's disease (AD). A key design choice in these approaches is the use of phenotypic grouping, where individuals with similar risk profiles are treated as multi-positive pairs during contrastive learning. However, existing methods operationalize phenotypic similarity as a discrete construct, relying on hard group assignments that impose rigid supervision and decouple group formation from representation learning. We propose a continuous formulation of phenotypic structure within contrastive learning. Rather than assigning samples to fixed clusters, we model inter-subject similarity as a differentiable weighting function derived from intra-modality embedding similarities in both retinal images and risk profiles. These weights define soft multi-positive relationships through a continuous aggregation operator, enabling graded supervision that reflects the spectrum nature of disease risk. We further introduce a soft-target contrastive objective that jointly learns cross-modal alignment and phenotypic structure in an end-to-end manner. Evaluated on UK Biobank retinal imaging data for incident AD prediction, the proposed framework consistently outperforms discrete group-based contrastive learning and standard vision-language baselines. By treating phenotypic similarity as a learnable, continuous signal rather than a fixed grouping rule, our approach provides a principled and robust foundation for population-scale neurodegenerative risk modeling from multi-modal retinal and clinical data.

2606.19512 2026-06-19 cs.RO cs.SY eess.SY 新提交

Proprioceptive Invariant State Estimation for Humanoid Robots on Non-Inertial Ground

非惯性地面上仿人机器人的本体感觉不变状态估计

Falak Mandali, Zijian He, Yan Gu

发表机构 * Purdue University(普渡大学)

AI总结 提出一种仅使用本体感觉的InEKF方法,利用足部IMU和运动学约束,实现非惯性地面上仿人机器人的实时状态估计,收敛速度提升96%,位置误差降低80%。

详情
AI中文摘要

本文提出了一种不变扩展卡尔曼滤波(InEKF)方法,用于在非惯性地面上运行的仿人机器人仅使用机载本体感觉进行实时状态估计。所提出的方法估计机器人相对于移动地面框架的基座位置和速度,无需直接测量地面运动或外部安装的传感器。通过足部安装的IMU利用支撑脚的运动学约束,该滤波器在保持完全本体感觉的同时,考虑了过程模型和测量模型中的地面引起的非线性。估计器被设计为具有右不变测量模型,从而在较大的初始不确定性下实现有利的误差动态。可观测性分析建立了机器人相对于非惯性地面框架的相对基座位置和速度可观测的条件。在摇摆和俯仰地面上站立和蹲下的Digit仿人机器人实验表明,与现有的InEKF相比,收敛速度提高了96%,位置估计误差减少了80%。在单轴旋转地面上的行走实验实现了平均估计误差小于9厘米,初始误差高达1米。

英文摘要

This paper presents an invariant extended Kalman filtering (InEKF) approach for real-time state estimation of humanoid robots operating on non-inertial ground using only onboard proprioceptive sensing. The proposed approach estimates the robot's base position and velocity relative to the moving ground frame without requiring direct measurements of ground motion or externally mounted sensors. By exploiting kinematic constraints at the stance foot through foot-mounted IMUs, the filter accounts for ground-induced nonlinearities in the process and measurement models while remaining fully proprioceptive. The estimator is formulated to admit a right-invariant measurement model, enabling favorable error dynamics under large initial uncertainties. Observability analysis establishes conditions under which the robot's relative base position and velocity are observable with respect to the non-inertial ground frame. Experiments with the Digit humanoid robot standing and squatting atop a swaying and pitching ground showcase a 96% speedup in convergence rate and an 80% reduction in position estimate errors over existing InEKFs. Walking experiments on a uni-axially rotating ground achieve an average estimation error of less than 9 cm for an initial error of up to 1 m.

2606.19509 2026-06-19 cs.AI 新提交

LLM Doesn't Know What It Doesn't Know: Detecting Epistemic Blind Spots via Cross-Model Attribution Divergence on Clinical Tabular Data

LLM 不知道它不知道什么:通过跨模型归因分歧检测临床表格数据上的认知盲点

Akshat Dasula, Prasanna Desikan, Jaideep Srivastava

发表机构 * Centific AI Research(Centific AI研究)

AI总结 研究大语言模型在结构化临床数据上的认知不确定性,通过跨模型归因分歧分析,发现其口头置信度空洞、存在逆难度效应,并提出基于归因分歧的校准方法,无需训练即可提升准确率并降低校准误差。

Comments Accepted at EIML@ICML 2026

详情
AI中文摘要

大语言模型(LLM)越来越多地应用于结构化临床数据,但它们在处理此类任务时能否认识到自身知识的局限性仍未得到探索。我们通过跨模型归因分歧的视角研究这一问题,旨在减少结构化任务的认知不确定性,通过归因分歧分析比较 Qwen 2.5 7B 和 XGBoost 在预测任务上的表现。我们报告了四个发现。首先,LLM 口头表达的置信度在认知上是空洞的,无论准确率是 49% 还是 75.3%,它输出接近常数(0.856-0.937),追踪的是提示格式而非预测质量。其次,LLM 表现出逆难度效应:当 XGBoost 以 99% 正确时,LLM 准确率降至 64.8%,但在 XGBoost 中等不确定时,LLM 与其匹配(73.8% 对 73.1%)。第三,少样本示例和 SHAP 导出的特征证据是正交的、超加性的干预措施:它们将归因分歧分数(ADS)从 1.54 降至 0.38,并在无需训练的情况下将准确率从 49% 提升至 75.3%。第四,一种利用归因分歧信号确定 LLM 可靠性的跨模型校准器,将期望校准误差从 0.254 降至 0.080,用患者特定的可靠性估计替代了无信息量的口头置信度,无需访问模型内部或重复推理。我们将这些发现视为 LLM 在结构化数据上的冷启动问题,并勾勒出通向真正认知自我意识的路径。

英文摘要

Large language models (LLMs) are increasingly applied to structured clinical data, yet whether they can recognize the limits of their own knowledge on such tasks remains unexplored. We study this question through the lens of cross-model attribution divergence with the goal of reducing epistemic uncertainty for structured tasks, comparing Qwen 2.5 7B and XGBoost on a prediction task via attribution divergence analysis. We report four findings. First, LLM verbalized confidence is epistemically vacuous, it outputs a near-constant (0.856-0.937) regardless of whether accuracy is 49% or 75.3%, tracking prompt format rather than prediction quality. Second, the LLM exhibits an inverse difficulty effect: accuracy drops to 64.8% when XGBoost is 99% correct, but matches XGBoost (73.8% vs. 73.1%) when it is moderately uncertain. Third, few-shot examples and SHAP-derived feature evidence are orthogonal, super-additive interventions: they reduce the Attribution Disagreement Score (ADS) from 1.54 to 0.38 and improve accuracy from 49% to 75.3% without training. Fourth, a cross-model calibrator that determined LLM reliability using attribution divergence signals reduces expected calibration error from 0.254 to 0.080, replacing uninformative verbalized confidence with patient-specific reliability estimates, without accessing model internals or requiring repeated inference. We frame these findings as a cold start problem for LLMs on structured data and outline a path toward genuine epistemic self-awareness.

2606.19504 2026-06-19 cs.RO cs.SY eess.SY 新提交

Simulating Robotic Locomotion in Sand: Resistive Force Theory in an Open-Source Physics Engine

模拟沙地中的机器人运动:开源物理引擎中的阻力理论

Ryan Walker Brown, Laura K. Treers, Kathryn A. Daltorio

发表机构 * Case Western Reserve University(凯斯西储大学) University of Vermont(佛蒙特大学)

AI总结 将三维颗粒阻力理论(3D RFT)集成到MuJoCo物理引擎中,实现沙地行走模拟,验证了足端形状、速度和负载对运动的影响,并在六足机器人实验中预测行走距离和沉陷误差在20%以内。

Comments 12 pages, 7 figures

详情
AI中文摘要

阻力理论(RFT)的最新进展使得无需模拟单个颗粒相互作用即可近似沙地运动中的地面反作用力,从而降低了计算成本。然而,这些工具在常用于机器人仿真的3D物理引擎中尚不可用。我们探讨了将阻力近似与标准动力学计算相结合,是否能为自由行走的机器人提供稳定的支撑。为此,我们在物理仿真引擎MuJoCo中实现了三维颗粒阻力理论(3D RFT)。我们在多个场景中验证了仿真,证明了由于末端执行器形状、速度和负载引起的关键趋势得以保留。我们的实现预测了12自由度六足机器人在沙地中的行走距离和足部下沉,误差在实验值的20%以内。尽管RFT存在固有近似,但本文描述的开源工具有望帮助开发新的和改进的机器人设计,以穿越颗粒介质基底。

英文摘要

Recent advancements in Resistive Force Theory (RFT) enable approximation of ground reaction forces for locomotion in sand without the computational expense of modeling interactions with individual grains. However, these tools have been absent in 3D physics engines commonly used for robot simulation. We explore if resistive force approximations are sufficient, when integrated with standard dynamics calculations, to provide a stable substrate for a freely walking robot. To determine this, we implement 3D Granular Resistive Force Theory (3D RFT) in a physics simulation engine, MuJoCo. We verify simulations in multiple scenarios to demonstrate that key trends due to end effector shape, speed, and loading are preserved. Our implementation predicts walking distance and foot sinkage of a 12-Degree of Freedom hexapod robot within 20\% of experiments in sand. While RFT has inherent approximations, the open source tool described here has potential to help develop new and improved robot designs to traverse granular media substrates.

2606.19501 2026-06-19 cs.AI cs.CL cs.LG q-fin.RM 新提交

DeXposure-Claw: An Agentic System for DeFi Risk Supervision

DeXposure-Claw: 一个用于DeFi风险监管的智能体系统

Aijie Shu, Bowei Chen, Wenbin Wu, Cathy Yi-Hsuan Chen, Fengxiang He

发表机构 * University of Edinburgh(爱丁堡大学) University of Glasgow(格拉斯哥大学) University of Cambridge(剑桥大学)

AI总结 针对DeFi监管中LLM智能体易误报的问题,提出DeXposure-Claw系统,通过图时间序列基础模型预测风险网络,结合确定性监控和置信度门控生成可审计监管票据,并构建六轴评估基准DeXposure-Bench,实验验证有效性。

详情
AI中文摘要

去中心化金融使监管者面临快速变化的网络化信用风险。通用LLM智能体不适合此场景:它们过度解读弱证据并推荐高风险干预,而现有评估无法提供符合监管者需求的误报衡量方式。我们提出DeXposure-Claw,一个基于预测的智能体监管系统,通过结构化证据引导LLM决策:(1) DeXposure-FM,一个图时间序列基础模型,预测未来风险网络;(2) 确定性监控和压力场景将预测转化为类型化警报、归因信号和场景证据;(3) 数据健康和置信度门控在DeXposure-Claw发出带有理由的可审计监管票据前限制升级。我们进一步开发了DeXposure-Bench,一个六轴评估框架,其决策轴根据符合监管者的绝对损失真实情况和显式误干预率对票据评分。在五年每周真实数据上的实验充分支持了我们的系统。代码见 https://this URL。

英文摘要

Decentralized finance exposes supervisors to fast-moving, networked credit risks. General-purpose LLM agents fit this setting poorly: they over-read weak evidence and recommend high-stakes interventions, while existing evaluations offer no regulator-aligned way to measure the resulting false alarms. We introduce DeXposure-Claw, a forecast-grounded agentic supervision system that routes LLM decisions through structured evidence: (1) DeXposure-FM, a graph time-series foundation model, forecasts future exposure networks; (2) deterministic monitors and stress scenarios then turn those forecasts into typed alerts, attribution signals, and scenario evidence; and (3) data-health and confidence gates constrain escalation before DeXposure-Claw emits auditable supervisory tickets with rationales. We further develop DeXposure-Bench, a six-axis evaluation harness, whose decision axis scores tickets against a regulator-aligned absolute-loss ground truth and an explicit false-intervention rate. Experiments on five years of weekly real data fully support our system. Code is at https://github.com/EVIEHub/DeXposure-Claw.

2606.19496 2026-06-19 cs.LG 新提交

Calibrating Generative Models to Feature Distributions with MMD Finetuning

使用MMD微调将生成模型校准到特征分布

Nathaniel L. Diamant, Brian L. Trippe

发表机构 * Stanford University(斯坦福大学)

AI总结 提出kCGM方法,通过最小化生成与目标特征分布的最大均值差异(MMD)并加入KL正则化,在不牺牲有效性的前提下校准生成模型的特征分布,适用于多种生成模型。

详情
AI中文摘要

生成模型可以产生个体上合理的样本,但在关键特征分布上与目标集存在显著偏差。例如,在广泛的药物类化学空间上预训练的模型可能生成分子,其分子特征与感兴趣的治疗类别(如已知抗生素)不同。纠正这种分布校准错误具有挑战性:在目标集上直接微调可能导致过拟合,并且无法控制匹配哪些特征。为了填补这一空白,我们引入了核校准生成模型(kCGM)。kCGM使用无偏得分函数估计器最小化生成特征分布与目标特征分布之间的最大均值差异(MMD),并通过KL正则化保持与预训练模型的接近。在一个包含174种抗生素的目标集上,直接微调牺牲了化学有效性以匹配特征分布,而kCGM在提高有效性的同时改善了目标特征匹配。我们还在蛋白质和DNA生成任务中展示了kCGM,表明它可以使用仅特征级别的监督来适应自回归、连续空间扩散和离散扩散模型。代码可在https://this URL获取。

英文摘要

Generative models can produce individually plausible samples while deviating substantially from a target set in the distribution of key features. For example, a model pretrained on broad drug-like chemical space may generate molecules whose molecular features differ from those of a therapeutic class of interest, such as known antibiotics. Correcting such distributional miscalibration is challenging: direct finetuning on the target set can overfit and does not control which features are matched. To fill this gap, we introduce kernel Calibrating Generative Models (kCGM). kCGM minimizes a maximum mean discrepancy (MMD) between generated and target feature distributions using an unbiased score-function estimator, with KL regularization to remain close to the pretrained model. On a target set of 174 antibiotics, direct finetuning sacrifices chemical validity for feature-distribution matching, whereas kCGM improves target feature matching while increasing validity. We further demonstrate kCGM in protein and DNA generation tasks, showing it can adapt autoregressive, continuous-space diffusion, and discrete diffusion models using only feature-level supervision. Code is available at https://github.com/smithhenryd/cgm.

2606.19495 2026-06-19 cs.CV 新提交

LooseControlVideo: Directorial Video Control using Spatial Blocking

LooseControlVideo: 使用空间分块进行导演式视频控制

Shariq Farooq Bhat, Niloy J. Mitra, Kalyan Sunkavalli

发表机构 * Adobe Research(Adobe研究院)

AI总结 提出LooseControlVideo框架,通过稀疏定向3D框作为“分块”代理,实现文本到视频生成中多对象场景的直观布局与轨迹控制,显著优于现有2D框和流方法。

Comments Project page at https://shariqfarooq123.github.io/LooseControlVideo/

详情
AI中文摘要

在文本到视频生成中,精确的3D空间编排仍然是一个重大挑战,特别是对于语义布局和时间动态经常纠缠的多对象场景。虽然现有的深度条件模型实现了良好的结构保真度,但它们需要密集的、帧精确的指导,这对于涉及可变形对象的动态事件来说,制作起来非常费力。我们提出了LooseControlVideo,一个通过使用稀疏的、定向的3D框作为“分块”代理来实现直观和表达性控制的框架。这允许用户创作高级布局和轨迹,同时利用视频生成模型生成逼真的遮挡、动态和交互。我们通过在带有DNOCS(一种用于3D大小、方向和深度排序遮挡的新型编码)注释的视频数据集上微调Wan 2.2骨干网络来实现这一点。此外,我们的方法允许局部细化,例如调整跳跃轨迹或添加交互,而对全局场景上下文的干扰最小。在nuScenes、HO-3D和BEHAVE基准上的广泛评估表明,LooseControlVideo显著优于现有的2D框和基于流的基线。我们的结果表明,与当前最先进的布局条件模型相比,轨迹误差提高了1.2倍到3倍;刚体运动一致性提高了2倍;遮挡精度提高了1.5倍到2倍,表明定向3D基元为复杂的多智能体视频创作提供了良好的几何先验。

英文摘要

Precise 3D spatial orchestration in text-to-video generation remains a significant challenge, particularly for multi-object scenes where semantic layout and temporal dynamics are often entangled. While existing depth-conditioned models achieve good structural fidelity, they necessitate dense, frame-accurate guidance that is labor-intensive to author for dynamic events involving deformable objects. We present LooseControlVideo, a framework that enables intuitive and expressive control by using sparse, oriented 3D boxes as a "blocking" proxy. This allows users to author high-level layout and trajectory while leveraging a video generative model to generate realistic occlusions, dynamics and interactions. We achieve this by fine-tuning a Wan 2.2 backbone on a video dataset annotated with DNOCS, a novel encoding for 3D size, orientation and depth-ordered occlusions. Furthermore, our method allows for localized refinement, such as adjusting a jump trajectory or adding an interaction, with minimal disruption to the global scene context. Extensive evaluations on the nuScenes, HO-3D, and BEHAVE benchmarks demonstrate that LooseControlVideo significantly outperforms existing 2D-box and flow-based baselines. Our findings indicate a 1.2x to 3x improvement in Trajectory Error; 2x improvement in Rigid Motion Consistency; and a 1.5x to 2x increase in Occlusion Accuracy over current state-of-the-art layout-conditioned models, demonstrating that oriented 3D primitives provide good geometric prior for complex, multi-agent video authoring.

2606.19494 2026-06-19 cs.AI 新提交

Hidden Anchors in Multi-Agent LLM Deliberation

多智能体LLM协商中的隐藏锚点

Apurba Pokharel, Ram Dantu

发表机构 * University of North Texas(北德克萨斯大学)

AI总结 将多智能体LLM协商建模为闭环动力系统,每个智能体有隐藏内部信念(锚点),解释协商如何超越初始信念凸包,并通过恢复锚点预测模型行为。

Comments 13 pages, 6 figures, 7 tables

详情
AI中文摘要

多智能体LLM协商,即智能体在多轮中交换和修改答案,越来越多地被用于提高推理和准确性,但其工作原理很少被建模。这种协商反映了人类如何做出决策。作为社会性动物,我们既受到群体的影响(如DeGroot和Friedkin-Johnsen等经典意见动力学模型所捕捉的羊群效应),也受到自身内部信念的影响(这些模型未考虑)。我们将多智能体协商建模为一个闭环动力系统,其中每个智能体携带一个隐藏的内部信念(其锚点),该锚点持续拉动其意见,无论邻居如何。我们证明,仅从协商中就可以恢复该锚点,并且它解释了经典共识规则所禁止的行为:智能体对正确答案的信心可以超过任何智能体初始时的水平,从而逃离由初始信念形成的空间(凸包)。检查恢复的锚点是否也能预测未参与运行的协商(泛化),为模型是否真正由这样的锚点驱动提供了一个简单测试。在三个开放权重模型系列中,这是一个谱系,而非全有或全无。所有锚点的影响强度大致相同,但它们在锚点位置上有差异,只有当锚点远离初始意见时,协商才会逃离凸包并需要完整的闭环模型。

英文摘要

Multi-agent LLM deliberation, where agents exchange and revise answers over several rounds, is increasingly used to improve reasoning and accuracy, yet how and why it works is rarely modelled. Such deliberation mirrors how humans reach decisions. As social animals we are pulled both by the group, the herd effect that classical opinion-dynamics models such as DeGroot and Friedkin--Johnsen capture, and by our own internal belief, which they do not. We model multi-agent deliberation as a closed-loop dynamical system in which each agent carries a hidden internal belief, its anchor, that continually pulls its opinion regardless of its neighbours. We show this anchor can be recovered from the deliberation alone, and that it explains a behaviour classical consensus rules forbid: an agent's confidence in the correct answer can climb past where any agent started, escaping the space (convexhull) formed by the initial beliefs. Checking whether the recovered anchor also predicts held-out runs (generalizes) gives a simple test for when a model is truly driven bysuch an anchor. Across three open-weight model families this is a spectrum, not all-or-nothing. All anchors' influence are about equally strongly, but they differ in where the anchor sits, and only when it sits far from the initial opinions does deliberation escape the hull and need the full closed-loop model.

2606.19491 2026-06-19 cs.LG stat.ML 新提交

Algebraic Dead Directions in LayerNorm Transformers: A Forward-Pass-Only Diagnostic at LLM Scale

LayerNorm Transformer 中的代数死方向:一种仅需前向传播的大语言模型规模诊断方法

Tejas Pradeep Shirodkar, P. J. Narayanan

发表机构 * IIIT, Hyderabad(海得拉巴国际信息技术学院)

AI总结 本文发现 LayerNorm 的逆尺度方向是后最终归一化中心激活协方差矩阵的精确代数核,可仅从参数中读取死方向,无需前向或后向传播,并在 14 个预训练模型上验证了其有效性。

Comments 34 pages, 7 figures, 6 tables. Empirical companion to arXiv:2606.05957

详情
AI中文摘要

预训练 Transformer 位于损失函数的奇异极小值附近,此时 Fisher 信息度量沿死方向退化:参数空间中方向性 Fisher 为零的方向。通常定位这样的方向需要一次前向传播和激活矩阵的特征分解,或基于采样的复杂度估计;没有一种方法能仅从网络参数计算方向。我们针对 LayerNorm Transformer 给出了一个这样的方向。LayerNorm 仿射的逆尺度方向 $\gamma^{-1}/\|\gamma^{-1}\|$ 是后最终归一化中心激活协方差矩阵的精确代数核,适用于任何输入分布,并在参数空间中诱导出相应的死方向。它仅从 LN 尺度参数读取,无需前向或后向传播,无需特征分解:这是针对 LayerNorm 的最廉价死方向读取方法。我们在 14 个预训练 Transformer(9 个 LayerNorm,5 个 RMSNorm;160M-35B;语言和视觉目标)上进行了测试。在随机初始化时,预测方向与测量的底部奇异方向(一次前向传播,直接 SVD)在 9/9 的 LayerNorm 模型上匹配到小数点后四位,并在 5/5 的 RMSNorm 模型上正确缺失,后者缺乏产生该方向的均值减法投影器。在训练后的检查点上,沿该方向的协方差特征值加深约 ${\sim}10^3$ 倍,并打开更多死方向;随机初始化到训练后的差距是一次前向传播、每检查点沿预测坐标的奇异结构读出。由此得出两个闭式结论:残差流的最小奇异值在 13/14 个 Transformer 上逐块保持不变(在其自身输入分布上测量),唯一的例外(Gemma$4$-$31$B)是一个真正的死方向,同一读出可精确定位;核方向的存在从参数本身即可对 Transformer 的归一化进行分类。

英文摘要

Pretrained transformers sit near singular minima of the loss, where the Fisher information metric degenerates along dead directions: directions in parameter space along which the directional Fisher vanishes. Locating such a direction normally needs a forward pass and an eigendecomposition of activations, or a sampling-based complexity estimate; none returns a direction computable from the network's parameters alone. We give one, for LayerNorm transformers. The inverse-scale direction $γ^{-1}/\|γ^{-1}\|$ of the LayerNorm affine is an exact algebraic kernel of the post-final-norm centred activation covariance, for any input distribution, and induces a corresponding dead direction in parameter space. It is read from the LN scale parameter alone, with no forward or backward pass and no eigensolve: the cheapest dead-direction read, specific to LayerNorm. We test it on $14$ pretrained transformers ($9$ LayerNorm, $5$ RMSNorm; $160$M-$35$B; language and vision objectives). At random initialisation the predicted direction matches the measured bottom singular direction (one forward pass, direct SVD) to four decimal places on $9/9$ LayerNorm models, and is correctly absent on $5/5$ RMSNorm models, which lack the mean-subtraction projector that creates it. On the trained checkpoint the covariance eigenvalue along this direction deepens by ${\sim}10^3\times$ and further dead directions open; the random-init-to-trained gap is a one-forward-pass, per-checkpoint readout of singular structure along the predicted coordinate. Two consequences follow in closed form: the residual stream's smallest singular value is preserved block-to-block on $13/14$ transformers measured on their own input distribution, the one exception (Gemma$4$-$31$B) a genuine dead direction the same read pinpoints; and the kernel direction's presence classifies a transformer's normalisation from the parameters alone.

2606.19489 2026-06-19 cs.LG cs.AI 新提交

Concept Flow Models: Anchoring Concept-Based Reasoning with Hierarchical Bottlenecks

概念流模型:通过层次瓶颈锚定基于概念的推理

Ya Wang, Adrian Paschke

发表机构 * Fraunhofer Institute for Open Communication Systems(弗劳恩霍夫开放通信系统研究所) Freie Universität Berlin(柏林自由大学)

AI总结 提出概念流模型(CFM),用层次化概念决策树替代扁平瓶颈,通过逐步缩小预测范围减少信息泄露,在保持预测性能的同时提升可解释性。

Journal ref Transaction on Machine Learning Research, 2/2026

详情
AI中文摘要

概念瓶颈模型(CBM)通过将学习到的特征投影到人类可理解的概念空间来增强可解释性。最近的方法利用视觉-语言模型生成概念嵌入,减少了对人工概念标注的需求。然而,这些模型存在一个关键限制:当概念数量接近嵌入维度时,信息泄露增加,使得模型能够利用虚假或语义上不相关的相关性,从而削弱可解释性。在这项工作中,我们提出了概念流模型(CFM),它将扁平瓶颈替换为层次化的、概念驱动的决策树。层次结构中的每个内部节点专注于局部判别性概念子集,逐步缩小预测范围。我们的框架从视觉嵌入构建决策层次,在每个层次级别分布语义概念,并通过概率树遍历训练可微的概念权重。在多个基准上的大量实验表明,CFM在预测性能上与扁平CBM相当,同时通过减少有效概念使用显著缓解了信息泄露。此外,CFM产生逐步决策流,使得具有层次类结构的透明且可审计的模型推理成为可能。

英文摘要

Concept Bottleneck Models (CBMs) enhance interpretability by projecting learned features into a human-understandable concept space. Recent approaches leverage vision-language models to generate concept embeddings, reducing the need for manual concept annotations. However, these models suffer from a critical limitation: as the number of concepts approaches the embedding dimension, information leakage increases, enabling the model to exploit spurious or semantically irrelevant correlations and undermining interpretability. In this work, we propose Concept Flow Models (CFMs), which replace the flat bottleneck with a hierarchical, concept-driven decision tree. Each internal node in the hierarchy focuses on a localized subset of discriminative concepts, progressively narrowing the prediction scope. Our framework constructs decision hierarchies from visual embeddings, distributes semantic concepts at each hierarchy level, and trains differentiable concept weights through probabilistic tree traversal. Extensive experiments on diverse benchmarks demonstrate that CFMs match the predictive performance of flat CBMs, while substantially mitigating information leakage by reducing effective concept usage. Furthermore, CFMs yield stepwise decision flows that enable transparent and auditable model reasoning with hierarchical class structures.

2606.19483 2026-06-19 cs.CV 新提交

LEAP: Layer-skipping Efficiency via Adaptive Progression for Vision Transformer Distillation

LEAP: 通过自适应进度实现视觉Transformer蒸馏的层跳过效率

Jiaqi Zhang, Ashton Lee, Anthony Wong, John Zou, Sami BuGhanem, Randall Balestriero

发表机构 * Brown University(布朗大学) Rice University(莱斯大学)

AI总结 提出LEAP训练课程,通过自适应选择教师中间特征图作为渐进式目标,加速学生ViT的知识蒸馏,在ImageNet-100上提升12.24%准确率,并节省25.1%训练FLOPs。

详情
AI中文摘要

基于视觉Transformer(ViT)骨干的视觉基础模型(VFMs),如DINOv2,已成为目标识别和语义分割等下游任务的关键。骨干网络的巨大计算需求通常需要将其蒸馏到更小的架构中以便在边缘部署。基于特征的知识蒸馏(KD)常受师生差距影响;学生由于容量有限难以模仿教师复杂的特征图。为缓解这一瓶颈,我们提出LEAP:通过自适应进度实现层跳过效率,一种用于ViT特征知识蒸馏的训练课程。通过利用教师的中间特征图作为一系列逐渐困难的渐进目标,我们的课程允许学生在处理更高层抽象之前构建基础表示。我们的结果表明,这种范式通过在不同学生模型大小和数据集规模上自适应选择难度,显著加速了收敛。采用我们的课程,LEAP蒸馏的ViT-S在ImageNet-100上达到90.1%的准确率,相比基线提升12.24%。在ImageNet-1K上,LEAP在Oxford和Paris数据集上的实例检索任务分别提升3.84%和7.75%。此外,该课程通过在训练初始阶段对教师推理实施早停,在ImageNet-100上节省了25.1%的训练FLOPs和21%的训练时间。代码可在以下网址获取:https://this URL

英文摘要

Vision Foundation Models (VFMs) with Vision Transformer (ViT) backbones, such as DINOv2, have become essential for downstream tasks like object recognition and semantic segmentation. The immense computational requirements of backbones often necessitate distillation into smaller architectures for edge deployment. Feature-based knowledge distillation (KD) often suffers from the teacher-student gap; the student struggles to imitate teacher's complex feature map due to its limited capacity. To mitigate this bottleneck, we propose LEAP: Layer-skipping Efficiency via Adaptive Progression, a training curriculum for ViT feature-based knowledge distillation. By utilizing the teacher's intermediate feature maps as a sequence of progressively more difficult targets, our curriculum allows the student to build a foundational representation before tackling higher-level abstractions. Our results demonstrate that this paradigm significantly accelerates convergence through adaptive difficulty selection across various student model sizes and dataset scales. With our curriculum, the LEAP-distilled ViT-S achieves 90.1% accuracy on ImageNet-100, a +12.24% improvement compared with baseline. On ImageNet-1K, LEAP achieves +3.84% and +7.75% improvement for the instance retrieval task on the Oxford and Paris datasets, respectively. Furthermore, the curriculum enables 25.1% savings in training FLOPs and 21% savings in training time on ImageNet-100 by implementing early-stopping for teacher inference during the initial stages of training. Code is available at https://github.com/KevinZ0217/LEAP

2606.19481 2026-06-19 cs.LG 新提交

Insulin4RL: Real-Time Insulin Management in the Intensive Care Unit for Offline Reinforcement Learning

Insulin4RL:面向离线强化学习的重症监护室实时胰岛素管理

Thomas Frost, Steve Harris

发表机构 * Institute of Health Informatics(健康信息学研究所) University College London(伦敦大学学院)

AI总结 针对电子健康记录离散化导致模型泛化性差的问题,提出基于真实临床轨迹的离线强化学习数据集Insulin4RL,包含375,000+决策和12,209名患者,用于评估模型在真实采样假设下的性能。

Comments Under submission

详情
AI中文摘要

离线强化学习(ORL)有潜力利用历史电子健康记录(EHR)数据提高临床决策质量。当前该领域的训练和评估实践严重依赖于按固定规则时间间隔离散化的EHR数据集。离散化创建了复杂临床场景的虚构表示,并损害了回顾性模型评估的泛化性。在本文中,我们介绍Insulin4RL,一个医疗ORL数据集,其特点是来自真实临床轨迹的自然不规则输入和动作。该数据集源自MIMIC-IV,包含超过375,000个标记决策,涉及12,209名需要在重症监护室进行胰岛素输注滴定的患者。因此,该数据集可用于研究ORL模型在现实临床采样假设下的性能。我们提供了数据集结构和特征的描述、使用无模型离线强化学习的基线性能指标,以及使用拟合Q评估的标准化评估协议。最后,我们提出了未来研究可以利用该资源解决的领域。

英文摘要

Offline reinforcement learning (ORL) offers the potential to improve the quality of clinical decision-making using historical electronic health record (EHR) data. Current training and evaluative practices in this field rely heavily on EHR datasets that have been temporally discretised into fixed, regular time intervals. Discretisation creates fictional representations of complex clinical scenarios and compromises the generalisability of retrospective model evaluations. In this paper, we introduce Insulin4RL, a healthcare ORL dataset featuring naturally irregular inputs and actions from real clinical trajectories. Derived from MIMIC-IV, Insulin4RL comprises over 375,000 labelled decisions across 12,209 patients requiring insulin infusion titration in the Intensive Care Unit. The dataset can thus be used for research into ORL model performance under realistic clinical sampling assumptions. We provide a description of the dataset's structure and characteristics, baseline performance metrics using model-free offline reinforcement learning, and a standardised evaluation protocol using fitted Q-evaluation. We conclude with suggested areas for future research that could be addressed using this resource.

2606.19476 2026-06-19 cs.LG cs.AI 新提交

Can In-Context Learning Support Intrinsic Curiosity?

上下文学习能否支持内在好奇心?

Eric Elmoznino, Sangnie Bhardwaj, Johannes von Oswald, Rajai Nasser, Blaise Agüera y Arcas, João Sacramento, Rif A. Saurous, Guillaume Lajoie

发表机构 * Google – Paradigms of Intelligence Team(Google – 智能范式团队) Google DeepMind

AI总结 研究利用序列模型的上下文学习能力作为即时无更新世界模型,以消除传统内在好奇心方法中梯度下降的计算瓶颈,理论证明在非时间设置下可渐近收敛到真实学习进度。

详情
AI中文摘要

有效的机器学习不仅取决于我们如何对数据建模,还取决于我们选择收集哪些数据。虽然大型序列模型已经彻底改变了数据建模,但自动数据选择或“内在好奇心”的问题仍然是一个重大挑战。经典方法通过基于智能体的“学习进度”奖励来激励探索,该奖励衡量新获得的观测在多大程度上改进了世界模型的预测能力。然而,传统上评估这些奖励需要在每个轨迹内进行昂贵的梯度下降内循环更新,这使得它们在规模上计算上不可行。在这项工作中,我们研究序列模型涌现的上下文学习(ICL)能力是否可以通过作为即时的、无需更新的世界模型来消除这一瓶颈。具体来说,我们评估是否可以训练一个探索策略来最大化学习进度,仅使用上下文学习者的预测误差和反事实上下文操作。我们首先证明,在一般马尔可夫决策过程中,这实际上不可能以无偏的方式实现:由此产生的内在奖励要么包含干扰项,使其对真实学习进度的估计产生偏差,要么无法使用上下文学习者的预测误差来实现。相反,我们对于非时间设置的一个广泛子类(包括主动学习和贝叶斯实验设计)证明了积极结果:在这里,ICL派生的奖励成功界定了真实学习进度并渐近收敛到它。我们通过连续和符号环境中的受控实验证实了我们的理论,表明我们的ICL驱动框架成功训练了以最优方式进行探索的好奇数据收集策略。

英文摘要

Effective machine learning depends not only on how we model data, but also on what data we choose to collect. While large sequence models have revolutionized data modeling, the problem of automated data selection, or "intrinsic curiosity", remains a significant challenge. Classic approaches incentivize exploration by rewarding an agent based on its "learning progress", which measures how much a newly acquired observation improves a world model's predictive ability. However, evaluating these rewards traditionally requires expensive inner loops of gradient descent updates within each trajectory, rendering them computationally impractical at scale. In this work, we investigate whether the emergent in-context learning (ICL) capabilities of sequence models can eliminate this bottleneck by serving as immediate, update-free world models. Specifically, we evaluate whether an exploration policy can be trained to maximize learning progress, using solely the prediction errors and counterfactual context manipulations of an in-context learner. We first prove that in general Markov decision processes, this is in fact impossible in an unbiased way: the resulting intrinsic rewards either suffer from nuisance terms that bias their estimation of true learning progress, or they cannot be implemented using an in-context learner's prediction errors. Conversely, we prove a positive result for a broad subclass of non-temporal settings, encompassing active learning and Bayesian Experimental Design: here, ICL-derived rewards successfully bound and asymptotically converge to the true learning progress. We corroborate our theory with controlled experiments across continuous and symbolic environments, demonstrating that our ICL-driven framework successfully trains curious data-collection policies that explore optimally.

2606.19475 2026-06-19 cs.AI cs.CL 新提交

Diffusion Language Models: An Experimental Analysis

扩散语言模型:一项实验分析

Thomas Bertolani, Davide Bucciarelli, Leonardo Zini, Marcella Cornia, Lorenzo Baraldi

发表机构 * University of Modena and Reggio Emilia(摩德纳和雷焦艾米利亚大学) University of Pisa(比萨大学)

AI总结 本文系统比较了八种扩散语言模型在推理、编码、翻译等任务上的表现,分析了去噪步数、上下文长度等推理因素对性能与效率的影响,揭示了扩散语言模型在不同任务和预算下的权衡。

详情
AI中文摘要

大型语言模型(LLMs)通过自回归生成彻底改变了语言建模,使其在广泛的任务中表现出色。最近,扩散语言模型(DLMs)作为一种替代范式出现,它通过迭代去噪而非下一个词预测来生成文本,从而允许对整个序列进行并行精炼。尽管已经提出了许多基于扩散的架构,但评估协议、数据集、推理预算和生成超参数的差异使得比较它们的能力和理解它们提供的权衡变得困难。在这项工作中,我们对现代DLMs进行了系统的实验分析。具体来说,我们评估了八种最先进的DLMs在八个基准上的表现,这些基准涵盖推理、编码、翻译、知识和结构化问题解决,同时明确考虑了生成质量和计算效率。除了下游评估,我们还分析了关键推理时间因素的影响,包括去噪步数、上下文长度、块大小和并行解掩策略,并通过在相同条件下训练的较小模型的受控比较来补充大规模实验。我们的分析突出了基于扩散的语言建模在不同任务、架构和推理预算下的优势和局限性。我们表明,DLMs的行为受到生成时间设计选择的强烈影响,导致性能和计算效率之间的不同权衡。总体而言,我们的研究为当代DLMs的能力和部署特性提供了实用见解。

英文摘要

Large Language Models (LLMs) have revolutionized language modeling through autoregressive generation, enabling strong performance across a wide range of tasks. Recently, Diffusion Language Models (DLMs) have emerged as an alternative paradigm that generates text through iterative denoising rather than next-token prediction, allowing parallel refinement of entire sequences. While numerous diffusion-based architectures have been proposed, differences in evaluation protocols, datasets, inference budgets, and generation hyperparameters make it difficult to compare their capabilities and understand the trade-offs they offer. In this work, we present a systematic experimental analysis of modern DLMs. Specifically, we evaluate eight state-of-the-art DLMs across eight benchmarks spanning reasoning, coding, translation, knowledge, and structured problem solving, while explicitly considering both generation quality and computational efficiency. Beyond downstream evaluation, we analyze the impact of key inference-time factors, including denoising steps, context length, block size, and parallel unmasking strategies, and complement large-scale experiments with controlled comparisons of smaller models trained under identical conditions. Our analysis highlights the strengths and limitations of diffusion-based language modeling across different tasks, architectures, and inference budgets. We show that the behavior of DLMs is strongly influenced by generation-time design choices, leading to distinct trade-offs between performance and computational efficiency. Overall, our study provides practical insights into the capabilities and deployment characteristics of contemporary DLMs.

2606.19469 2026-06-19 cs.AI cs.SE 新提交

Measuring Curriculum Alignment across Topical Coverage, Competency, and Cognitive Depth: A Longitudinal Framework Applied to CS2013 and CS2023

衡量课程在主题覆盖、能力与认知深度上的一致性:应用于CS2013和CS2023的纵向框架

Sherzod Turaev, Mary John, Saja Aldabet, Mamoun Awad, Nazar Zaki, Khaled Shuaib

发表机构 * United Arab Emirates University(阿联酋大学) Abu Dhabi Polytechnic(阿布扎比理工学院)

AI总结 提出一种人机协同流程,通过语义检索与人工确认,纵向衡量计算机科学课程对CS2013和CS2023指南的覆盖情况,发现课程覆盖稳定但新指南对认知深度要求更高。

Comments 24 pages, 5 figures, 8 tables

详情
AI中文摘要

本科计算机科学教育受约每十年修订一次的国际课程指南指导,但各项目缺乏可靠且可重复的方法来衡量其对当前指南的覆盖程度,以及当指南重组时覆盖情况如何变化。我们通过一个人机协同流程解决此问题,该流程衡量项目对外部知识体系的覆盖情况,并纵向应用于一个经认证的计算机科学学士学位项目,对照计算机科学课程2013(CS2013)和2023(CS2023)。该流程将项目和每个指南表示为结构化语料库,通过语义检索生成候选课程-知识单元匹配,并在明确的覆盖定义下通过人工判断确认。在七个基准检索器中,倒数秩融合集成最强,而知名长上下文模型表现不如小型句子模型,因此必须衡量检索器的选择。两个映射由独立第二评分者验证(CS2023的Cohen's kappa为0.64,CS2013为0.69)。该项目覆盖CS2023的49.7%和CS2013的50.9%的知识单元,十年间几乎恒定。将相同的检索-确认设计扩展到能力表述和认知深度,显示项目在每个指南下对约88%的覆盖单元表述了能力,但在CS2023下对76%的现有单元以推荐深度交付,而在CS2013下为95%,这一差距反映了新指南提高了期望,而非项目本身。纵向比较将持久的结构性差距(并行与分布式计算、编程语言基础、系统基础)——这些差距在两种指南和ABET下均未覆盖——与反映标准演变的差异区分开来。该工具可重用,并可向作者索取。

英文摘要

Undergraduate computer science is governed by international curricular guidelines revised about once a decade, yet programs lack a reliable, reproducible way to measure how completely they cover the current guidelines and how that coverage shifts when the guidelines are restructured. We address this with a human-in-the-loop pipeline that measures a program's coverage of an external body of knowledge, applied longitudinally to one accredited BSc in Computer Science against Computer Science Curricula 2013 (CS2013) and 2023 (CS2023). The pipeline represents the program and each guideline as structured corpora, generates candidate course-to-knowledge-unit matches by semantic retrieval, and confirms them through human judgment under an explicit coverage definition. Of seven benchmarked retrievers, a reciprocal-rank-fusion ensemble was strongest, and a reputed long-context model underperformed a small sentence model, so retriever choice must be measured. Both maps were validated by an independent second rater (Cohen's kappa 0.64 for CS2023, 0.69 for CS2013). The program covers 49.7% of CS2023 and 50.9% of CS2013 knowledge units, near-constant across a decade. Extending the same retrieve-then-confirm design to competency articulation and cognitive depth shows that the program articulates the competency for ~88% of covered units under each guideline, yet delivers it at the recommended depth for 76% of present units under CS2023 against 95% under CS2013, a gap reflecting the newer guideline's raised expectations, not the program. The longitudinal comparison separates persistent structural gaps (parallel and distributed computing, foundations of programming languages, systems fundamentals), uncovered against both guidelines and ABET, from differences that reflect the standard's evolution. The instrument is reusable and available from the authors on request.

2606.19468 2026-06-19 cs.CL 新提交

Characterizing Narrative Content in Web-scale LLM Pretraining Data

网络规模LLM预训练数据中的叙事内容特征化

Teagan Johnson, Elliott Ash, Andrew Piper, Maria Antoniak

发表机构 * University of Colorado Boulder(科罗拉多大学波尔德分校) ETH Zürich(苏黎世联邦理工学院) McGill University(麦吉尔大学)

AI总结 首次细粒度研究LLM预训练语料库Dolma的叙事特征,提出涵盖三个核心叙事元素(能动性、场景、事件)的框架,构建NarraBERT模型并发布NarraDolma数据集,揭示叙事结构在异构数据中可测量且分布不均。

Comments 8 pages of main content, 28 total pages. 30 figures

详情
AI中文摘要

尽管叙事是人类交流的基本模式,但网络规模LLM预训练语料库的叙事组成仍然很大程度上未被探索。我们首次对Dolma(一个3万亿词元的开放预训练语料库)中的叙事特征进行了细粒度研究。借鉴叙事理论,我们设计了一个框架,涵盖三个核心叙事元素(能动性、场景和事件),并将其操作化为11个可解释维度。在采样并标注了400个多样化的段落之后,我们微调并验证了NarraBERT,一个基于RoBERTa的细粒度叙事预测模型。我们将NarraBERT应用于300万个段落,生成了新数据集NarraDolma。我们发现:(i) 叙事结构在极度异构的数据中是可大规模测量的;(ii) 我们揭示了网络文本背后连续的多维叙事结构;(iii) 叙事质量在预训练来源和主题之间分布不均,而当前的策展实践既未测量也未考虑这一点。我们的框架、数据集和分析为理解LLM预训练数据中叙事质量的分布以及研究数据组成如何影响叙事推理任务提供了基础。我们公开发布了NarraDolma和NarraBERT。

英文摘要

The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though narrative is a fundamental mode of human communication. We present the first fine-grained study of narrative features in Dolma, a 3-trillion-token open pretraining corpus. Drawing on narrative theory, we design a framework spanning three core narrative elements (agency, setting, and events) operationalized as 11 interpretable dimensions. After sampling and annotating a diverse set of 400 passages, we finetune and validate NarraBERT, a RoBERTa-based model for fine-grained narrative prediction. We apply NarraBERT to 3M passages, resulting in a new dataset, NarraDolma. We find (i) narrative structure is measurable at scale across extremely heterogeneous data, (ii) we uncover a continuous, multidimensional narrative structure underlying web text, and (iii) narrative qualities are unequally distributed across pretraining sources and topics in ways that current curation practices neither measure nor account for. Our framework, dataset, and analyses provide a foundation for understanding how narrative qualities are distributed in LLM pretraining data and for studying how data composition affects narrative reasoning tasks. We publicly release NarraDolma and NarraBERT.

2606.19464 2026-06-19 cs.AI cs.MA 新提交

Deontic Policies for Runtime Governance of Agentic AI Systems

面向自主AI系统运行时治理的道义策略

Anupam Joshi, Tim Finin, Karuna Pande Joshi, Lalana Kagal

发表机构 * CSEE Department UMBC Baltimore, MD, USA Center for AI UMBC Baltimore, MD, USA Information Systems Department UMBC Baltimore, MD, USA CSAIL MIT Cambridge, MA, USA

AI总结 针对大语言模型驱动的自主AI系统在安全、隐私和合规方面的治理挑战,提出AgenticRei框架,利用基于Rei的道义策略语言(OWL表示)在运行时通过逻辑引擎强制执行义务、豁免、冲突解决等治理约束,并兼容A2AS等标准。

Comments 10 pages, 1 figure. To be published in the 2026 IEEE Symposium on Agentic Services which is part of the IEEE Conference on Web Services

详情
AI中文摘要

由大语言模型驱动的自主AI系统引入了一类新的安全、隐私和合规挑战:能够调用工具、操作数据、安装软件并与跨组织边界对等代理协调的代理,不仅必须通过身份验证和访问控制来约束,还必须通过企业治理的完整结构来约束。这包括指定代理被允许和禁止做什么,它们在特定操作后必须做什么(例如,通知CISO),在什么条件下可以免除一项持续义务,以及当策略冲突时哪些规则优先。这个治理问题超出了当前策略引擎的能力范围。诸如XACML、Rego和Cedar等系统仅处理此治理结构的允许/禁止子集。它们不提供义务生命周期管理、元策略冲突解决、在特定情况下免除义务的豁免,以及通常在医疗、网络安全或数据隐私等应用中发现的领域类层次结构的本体推理。我们提出了AgenticRei,它实现了关键的治理需求,如义务、豁免、策略冲突解决和策略推理,以及基本的允许/禁止约束。我们使用基于Rei框架的道义策略语言,表示为OWL(Web本体语言),并由完全在LLM外部的高性能逻辑引擎在运行时评估。同一管道同时管理代理的工具调用和代理间消息。我们通过示例表明,道义策略捕获了当前生产引擎大多无法表达的安全和隐私治理约束。我们的方法自然地与A2AS等行业标准框架兼容。

英文摘要

Autonomous agentic AI systems driven by Large Language Models (LLMs) introduce a new class of security, privacy, and compliance challenges: an agent that can invoke tools, manipulate data, install software, and coordinate with peer agents across organizational boundaries must be constrained not just by authentication and access control, but by the full structure of enterprise governance. This includes specifying what agents are permitted and prohibited from doing, what they areobliged to do after certain actions (e.g., notify the CISO), under what conditions a standing obligation may be waived, and which rules take precedence when policies conflict. This governance problem exceeds what current policy engines provide. Systems such as XACML, Rego, and Cedar address only the permit/prohibit subset of this governance structure. They do not provide obligation lifecycle management, meta-policy conflict resolution, dispensations that waive obligations in specific circumstances, and ontological reasoning over domain class hierarchies commonly found in applications such as healthcare, cybersecurity, or data privacy. We propose AgenticRei, which realizes key governance requirements such as obligations, dispensations, policy conflict resolutions, and reasoning over policies, as well as the basic permit/prohibit constraints. We use a deontic policy language built on the Rei framework, expressed as OWL (Web Ontology Language) and evaluated at runtime by a high-performance logic engine entirely outside the LLM. The same pipeline governs both tool invocations by the agent and agent-to-agent messages. We show through examples that deontic policies capture governance constraints around security and privacy that mostly cannot be expressed in current production engines. Our approach composes naturally with industry-standard frameworks like A2AS.

2606.19460 2026-06-19 cs.CV cs.AI cs.LG 新提交

Scaling Generative Foundation Models for Chest Radiography with Rectified Flow Transformers

使用整流流变换器扩展胸部X光片的生成式基础模型

Fabio De Sousa Ribeiro, Emma A. M. Stanley, Charles Jones, Tian Xia, Dominic C. Marshall, Laurent Renard Triché, Christopher V. Cosgriff, Panagiotis Dimitrakopoulos, Sotirios A. Tsaftaris, Ben Glocker

发表机构 * Imperial College London(帝国理工学院) Causality in Healthcare AI Hub(医疗AI因果关系中心) University of Edinburgh(爱丁堡大学) Cleveland Clinic London(克利夫兰诊所伦敦) Department of Perioperative Medicine, CHU Clermont-Ferrand(克莱蒙费朗大学医院围手术期医学科) Department of Medicine, Massachusetts General Hospital(麻省总医院医学部) Broad Institute of MIT and Harvard(麻省理工学院与哈佛大学博德研究所)

AI总结 提出首个十亿参数级胸部X光片生成基础模型,通过整流流变换器实现高保真可控合成,显著提升合成图像与真实图像的不可区分性。

Comments Project page: https://RadiT-project.github.io

详情
AI中文摘要

我们引入了首个从零开始在十亿参数规模上训练的胸部X光片合成生成基础模型。现有的放射学AI模型通常在不同患者亚群、机构和采集设置下泛化能力差,导致实际临床效用有限。可控、高保真的胸部X光片合成是多样化临床数据集和评估诊断模型鲁棒性的有前景途径。因此,我们提出了迄今为止最大的胸部X光片专用生成基础模型,拥有超过13亿参数,在包含120万张X光片和临床专家指导元数据的精选异质数据集上训练了1.6万亿个token。我们的模型支持跨多个人口统计亚组、采集视图和十多种病理的可控X光片生成和编辑。此外,我们显著推进了X光片合成保真度的最新技术,生成的图像对临床专家而言与真实X光片无法区分。

英文摘要

We introduce the first generative foundation model for chest radiograph synthesis trained from scratch at the billion-parameter scale. Existing radiographic AI models often suffer from poor generalisation across patient subpopulations, institutions, and acquisition settings, resulting in limited real-world clinical utility. Controlled, high-fidelity synthesis of chest radiographs is a promising path toward diversifying clinical datasets and evaluating the robustness of diagnostic models. Therefore, we present the largest specialist generative foundation model for chest radiographs to date, with over 1.3B parameters, trained for 1.6T tokens on a curated, heterogeneous dataset comprising 1.2M radiographs and clinical expert-guided metadata. Our model supports controllable radiograph generation and editing across multiple demographic subgroups, acquisition views, and a dozen pathologies. Moreover, we significantly advance the state of the art in radiograph synthesis fidelity, producing images that are indistinguishable from real radiographs to clinical experts.

2606.19451 2026-06-19 cs.LG cs.CV cs.RO 新提交

3D-DLP: Self-Supervised 3D Object-Centric Scene Representation Learning

3D-DLP:自监督3D物体中心场景表示学习

Ellina Zhang, Madhaven Iyengar, Amir Zadeh, Chuan Li, Deepak Pathak, David Held, Tal Daniel

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出3D-DLP模型,通过自监督学习将场景级RGB-D或体素观测分解为3D潜在粒子,每个粒子编码解耦属性,实现可解释的逐粒子分割图,并支持场景操控和下游机器人操作。

Comments ICML 2026. Project webpage: https://eubooks3003.github.io/3d-dlp

详情
AI中文摘要

我们引入了3D-DLP,一种自监督的物体中心表示学习模型,它将场景级RGB-D或体素观测分解为一组3D潜在粒子。基于深度潜在粒子(DLP)框架,每个粒子编码解耦的属性,包括3D关键点位置、边界框尺寸和外观特征,并代表场景中的一个独特实体。该模型通过端到端的自监督重建目标学习可解释的逐粒子分割图。我们在模拟和真实数据集上证明,学习到的潜在空间是可解释和可控的:通过操纵粒子位置并解码,我们可以生成新颖的场景配置。此外,我们展示了将这些紧凑的3D潜在粒子用于下游机器人操作,相比缺乏显式3D信息或依赖无物体中心结构的密集3D输入的基线方法,性能有所提升。代码和视频可在以下网址获取:此 https URL。

英文摘要

We introduce 3D-DLP, a self-supervised object-centric representation learning model that decomposes scene-level RGB-D or voxel observations into a set of 3D latent particles. Building on the Deep Latent Particles (DLP) framework, each particle encodes disentangled attributes, including 3D keypoint position, bounding box dimensions, and appearance features, and represents a distinct entity in the scene. The model learns interpretable per-particle segmentation maps through an end-to-end self-supervised reconstruction objective. We demonstrate on both simulated and real-world datasets that the learned latent space is interpretable and controllable: by manipulating particle positions and decoding, we can generate novel scene configurations. Furthermore, we show that leveraging these compact 3D latent particles for downstream robotic manipulation improves performance over baselines that either lack explicit 3D information or rely on memory-intensive dense 3D inputs without object-centric structure. Code and videos are available at https://eubooks3003.github.io/3d-dlp.

2606.19419 2026-06-19 cs.RO cs.AI 新提交

Playful Agentic Robot Learning

趣味性具身机器人学习

Junyi Zhang, Jiaxin Ge, Hanjun Yoo, Letian Fu, Zihan Yang, Yaowei Liu, Raj Saravanan, Shaofeng Yin, Justin Yu, Dantong Niu, Zirui Wang, Roei Herzig, Ken Goldberg, Yutong Bai, David M. Chan, Ion Stoica, Angjoo Kanazawa, Jiahui Lei, Haiwen Feng, Trevor Darrell

发表机构 * University of California, Berkeley(加州大学伯克利分校) Impossible Research

AI总结 提出RATs框架,让机器人通过自主探索学习可复用技能,在LIBERO-PRO和MolmoSpaces上分别提升20.6和17.0个百分点。

Comments Project page: https://playful-rats.github.io/

详情
AI中文摘要

当前的具身机器人系统可以编写可执行的代码即策略程序、观察反馈并在多次尝试中修正行为,但它们仍然主要是任务驱动的:可复用技能仅在明确指令后获得。我们研究趣味性具身机器人学习,其中具身编码代理在下游任务到来之前,将自主导向的趣味性作为持续技能学习阶段。我们引入RATs,即专为趣味性技能获取设计的机器人代理团队。在趣味性阶段,RATs提出新颖且可学习的探索性任务,规划并执行机器人代码策略,验证中间进展,诊断失败,通过密集的步骤级反馈进行重试,并将成功执行提炼到持久代码技能库中。在测试时,代理从该冻结库中重用相关技能以帮助解决新任务。在LIBERO-PRO和MolmoSpaces上的实验表明,与无趣味性和随机趣味性基线相比,趣味性学习技能在保留的下游任务上分别提升了20.6和17.0个百分点(相对于CaP-Agent0)。此外,学习到的技能可以通过简单地检索到上下文中插入到其他推理时代码即策略代理中,无需微调基础模型,即可在RoboSuite和真实世界迁移中分别提升8.9和8.8个百分点。

英文摘要

Current agentic robot systems can write executable Code-as-Policy programs, observe feedback, and revise behavior across multiple attempts, but they remain largely task-driven: reusable skills are acquired only after explicit instructions. We study Playful Agentic Robot Learning, where an embodied coding agent uses self-directed play as a continual skill-learning stage before downstream tasks arrive. We introduce RATs, Robotics Agent Teams designed for play-time skill acquisition. During play, RATs proposes novel yet learnable exploratory tasks, plans and executes robot-code policies, verifies intermediate progress, diagnoses failures, retries with dense, step-level feedback, and distills successful executions into a persistent code skill library. At test time, the agent reuses relevant skills from this frozen library to help solve new tasks. Experiments in LIBERO-PRO and MolmoSpaces show that play-learned skills improve held-out downstream tasks over no-play and random-play baselines, with 20.6 and 17.0 percentage-point gains over CaP-Agent0 on LIBERO-PRO and MolmoSpaces, respectively. Moreover, the learned skills can be plugged into other inference-time Code-as-Policy agents by simply retrieving them into the context, improving RoboSuite and real-world transfer by 8.9 and 8.8 points, respectively, without finetuning the underlying model.

2606.19416 2026-06-19 cs.LG 新提交

MortarBench: Evaluating Mortgage Loan Origination Agents

MortarBench: 评估抵押贷款发起代理

Matthew Toles, Yunan Lu, Manav Munjal, Bojun Liu, Yuanhao Deng, Stephanie Selig, Derek Rindner, Cheng Li, Zhou Yu

发表机构 * Columbia University(哥伦比亚大学) Tidalwave

AI总结 提出MortarBench基准,通过金融数据合成与变异管道生成覆盖边缘案例的示例,评估大语言模型在贷款发起任务中的表现,发现模型准确率低且存在偏见,并引入CRIT校准框架提升准确率至80.5%。

详情
AI中文摘要

贷款发起是贷方创建新贷款的过程,从申请和承保到批准和融资。该过程在评估申请人的资格和风险水平方面起着关键作用。最近,尽管缺乏任何公开基准,公司已开始使用抵押贷款代理来增强人类贷款官员。为填补这一空白,我们提出了MortarBench,一个贷款发起代理基准。MortarBench使用金融数据合成和变异管道生成具有广泛边缘案例覆盖的示例,这些示例匹配真实世界的分布和问题。我们发现最先进的大语言模型(LLM)表现不佳,闭源模型最多达到77.1%的精确匹配准确率。我们还发现LLM对与非英语名字相关的外国性存在系统性偏见。注意到这些弱点,我们引入了CRIT,一个置信度校准框架。我们的方法将准确率提高到80.5%,同时改善了风险管理导向并减少了偏见。

英文摘要

Loan origination is the process by which a lender creates a new loan, from application and underwriting through approval and funding. This process serves a critical role in evaluating the eligibility and level of risk posed by an applicant. Recently, firms have begun using mortgage loan agents to augment human loan officers, despite a lack of any public benchmark. To fill this gap, we present MortarBench, a loan origination agent benchmark. MortarBench uses a financial data synthesis and mutation pipeline to generate examples with broad edge case coverage that match real-world distributions and questions. We find that state-of-the-art large language models (LLMs) perform poorly, with closed-source models achieving at most 77.1\% exact match accuracy. We also discover systematic biases in LLM perception of foreignness related to non-English names. Noting these weaknesses, we introduce CRIT, a confidence calibration framework. Our method increases accuracy to 80.5\% while improving risk management steering and reducing bias.

2606.19413 2026-06-19 cs.LG 新提交

Does Text Actually Help? Uncovering and Resolving Text Collapse in Multimodal Time Series Forecasting

文本真的有用吗?揭示并解决多模态时间序列预测中的文本坍缩问题

Huu Hiep Nguyen, Minh Hoang Nguyen, Dung Nguyen, Hung Le

发表机构 * Applied Artificial Intelligence Initiative(应用人工智能计划)

AI总结 针对多模态时间序列预测中文本分支被忽视导致“文本坍缩”的问题,提出REST-TS方法,通过让文本分支专门预测数值主干无法解释的残差,强制其提取真实内容,实现最先进性能。

详情
AI中文摘要

多模态时间序列预测将数值序列与领域相关的文本报告配对,有望将世界知识注入预测流程。然而,我们揭示了现有框架中的一个关键失败模式,称为文本坍缩:文本分支收敛到与内容无关的变换,无论输入描述如何,都贡献可忽略的判别信号。我们认为文本坍缩是时间序列预测中基本不对称性的结果:数值输入与输出强自相关,使得数值主干天生占主导地位,而文本分支尽管携带互补且通常关键的信息,却未被充分利用,导致其系统性欠利用。为解决此问题,我们提出REST-TS(时间序列中文本的残差独占监督),将不对称性转化为设计原则:数值主干产生其独立的数值预测,而文本分支被独占监督以预测残差的结构化组成部分,即数值无法解释的预测差距。由于没有数值路径可以减少这些损失,文本分支必须从输入描述中提取真实内容。在多样化的现实领域和主干架构上的评估表明,REST-TS实现了最先进的性能,并一致地显示出比现有框架更高的文本分支利用率,提供了强有力的经验证据,表明对文本分支进行残差监督迫使其从输入中提取真实内容。

英文摘要

Multimodal time series forecasting, which pairs numerical sequences with domain-relevant textual reports, promises to inject world knowledge into forecasting pipelines. However, we uncover a critical failure mode in existing frameworks that we term text collapse: the text branch converges to a content-independent transformation, contributing negligible discriminative signal regardless of the input description. We argue that text collapse is a consequence of a fundamental asymmetry in time series forecasting: the numerical input is strongly autocorrelated with the output, making the numerical backbone inherently dominant, while the text branch, despite carrying complementary and often critical information, is insufficiently utilized, leading to its systematic underexploitation. To address this, we propose \textbf{REST-TS} (\textbf{R}esidual-\textbf{E}xclusive \textbf{S}upervision for \textbf{T}ext in \textbf{T}ime \textbf{S}eries), which turns the asymmetry into a design principle: the numerical backbone produces its own independent numerical forecast, and the text branch is exclusively supervised to predict the structured components of the residual, the prediction gap that numbers cannot explain. Because no numerical pathway can reduce these losses, the text branch must extract genuine content from the input description. Evaluated across diverse real-world domains and backbone architectures, REST-TS achieves state-of-the-art performance and consistently demonstrates greater text-branch utilization than existing frameworks, providing strong empirical evidence that supervising the text branch on the residual compels it to extract genuine content from the input.

2606.19412 2026-06-19 cs.LG 新提交

Spectral Retrieval-Augmented Time-Series Forecasting

频谱检索增强的时间序列预测

Huu Hiep Nguyen, Minh Hoang Nguyen, Dung Nguyen, Hung Le

发表机构 * Applied Artificial Intelligence Initiative(应用人工智能倡议) Deakin University(迪肯大学)

AI总结 提出SpecReTF方法,通过将时间序列转换为窗口化频率表示并采用结合幅度和相位的相似性度量,以及指数移动平均加权方案,解决了现有检索方法在频谱盲区和时间近因上的局限性,提升了非平稳时间序列预测的准确性。

详情
AI中文摘要

时间序列预测利用历史模式来预测未来值,但传统方法在处理复杂、非平稳模式时面临挑战,这些模式在训练期间难以记忆。检索增强方法通过检索相似历史模式来增强预测,已成为有前景的解决方案。然而,现有检索方法存在两个基本局限性:频谱盲区,即忽略了捕捉潜在周期结构的关键频域特征;以及时间近因,即对所有历史数据一视同仁,而不强调最近、更相关的模式。在本文中,我们提出SpecReTF,一种新颖的检索方法,通过将时间序列转换为窗口化频率表示,并使用结合幅度和相位信息的组合度量来衡量相似性,从而解决这些问题。为了平衡近因和历史上下文,我们应用指数移动平均加权方案,强调最近的窗口。在基准数据集上的大量实验表明,SpecReTF优于时域检索方法,在多样化的非平稳时间序列上实现了卓越的预测准确性。

英文摘要

Time series forecasting leverages historical patterns to predict future values, but traditional methods face challenges when dealing with complex, non-stationary patterns that are difficult to memorize during training. Retrieval-augmented approaches have emerged as promising solutions by retrieving similar historical patterns to enhance predictions. However, existing retrieval methods suffer from two fundamental limitations: spectral blindness, which overlooks critical frequency-domain characteristics that capture underlying periodic structures, and temporal recency, which treats all historical data equally without emphasizing recent, more relevant patterns. In this paper, we propose SpecReTF, a novel retrieval method that addresses these issues by converting time series into windowed frequency representations, measuring similarity with a combined metric that captures both amplitude and phase information. To balance recency and historical context, we apply an exponential moving average weighting scheme that emphasizes recent windows. Extensive experiments on benchmark datasets demonstrate that SpecReTF outperforms time-domain retrieval methods, achieving superior forecasting accuracy across diverse, non-stationary time series.

2606.19411 2026-06-19 cs.LG 新提交

Spectral DPPs via NEPv: A Scalable Continuous Relaxation of Determinantal MAP for Diversity-Aware Data Selection

通过NEPv的谱DPP:用于多样性感知数据选择的确定性点过程MAP的可扩展连续松弛

Richard Yi Da Xu

发表机构 * Hong Kong Baptist University(香港浸会大学) TadReamk Limited(TadReamk有限公司)

AI总结 提出将NP难的DPP-MAP选择问题转化为Stiefel流形上的连续优化,通过非线性特征值问题(NEPv)的自洽场迭代实现近线性时间求解,适用于大规模数据选择。

详情
AI中文摘要

从海量候选池中选择一个小的、多样化的、高质量的子集是现代机器学习中的一个常见原语——用于训练和微调大型模型的数据整理和核心集选择、主动学习批次获取、上下文学习的提示和示例选择、检索多样化以及实验设计。确定性点过程(DPP)为此任务提供了原则性的、良好校准的多样性概念,但其MAP目标——选择大小为$k$的子集$S$最大化$\log\det(L_S)$——是NP难的,并且标准的贪心和采样算法在候选集大小$n$上具有超线性复杂度。这种成本在多样性最重要的数据为中心的场景中尤其高昂,其中$n$范围从数百万到数十亿的候选示例、特征或嵌入。我们将DPP-MAP重新表述为Stiefel流形上的连续优化问题,并证明其最优性条件构成一个先前未研究形式的具有特征向量依赖性的非线性特征值问题(NEPv)。该NEPv允许自洽场(SCF)迭代,具有基于谱间隙的局部收缩保证,从而提供了一个原则性的迭代求解器,其中多样性目标驱动一个特征向量依赖的算子。由此产生的算法OurMethod仅需要与核的矩阵-向量乘积,运行时间为$O\!\big((ndk+nk^2)\,t\big)$,其中迭代次数$t$很小,在$n$上接近线性,并直接与机器学习中常见的低秩和特征映射核集成。本文重点介绍松弛、求解器和扩展分析;完整的真实数据基准测试留给计划中的实证研究。

英文摘要

Selecting a small, diverse, high-quality subset from a massive pool of candidates is a recurring primitive in modern machine learning -- data curation and coreset selection for training and fine-tuning large models, active-learning batch acquisition, prompt and exemplar selection for in-context learning, retrieval diversification, and experimental design. Determinantal Point Processes (\DPP s) give a principled, well-calibrated notion of diversity for this task, but their \emph{MAP} objective -- pick a size-$k$ subset $S$ maximizing $\logdet(L_S)$ -- is NP-hard, and the standard greedy and sampling algorithms scale superlinearly in the ground-set size $n$. This cost is prohibitive precisely in the data-centric regime where diversity matters most, where $n$ ranges over millions to billions of candidate examples, features, or embeddings. We recast \DPP-MAP as a continuous optimization problem over the Stiefel manifold, and show that its first-order optimality conditions form a \emph{Nonlinear Eigenvalue Problem with eigenvector dependency} (\NEPv) of a previously unstudied form. This \NEPv\ admits a self-consistent field (\SCF) iteration with a spectral-gap-based local contraction guarantee, giving a principled iterative solver where the diversity objective drives an eigenvector-dependent operator. The resulting algorithm, \OurMethod, requires only matrix-vector products with the kernel and runs in time $O\!\big((ndk+nk^2)\,t\big)$ for a small number of iterations $t$, scaling near-linearly in $n$ and integrating directly with low-rank and feature-map kernels common in ML. This paper focuses on the relaxation, solver, and scaling analysis; full real-data benchmarking is left to a planned empirical study.

2606.19408 2026-06-19 cs.LG cs.RO 新提交

FlexLAM: Resolving the Bottleneck Trade-off in Latent Action Learning

FlexLAM: 解决潜在动作学习中的瓶颈权衡

Takanori Yoshimoto, Yang Hu, Naruya Kondo, Tatsuya Matsushima

发表机构 * University of Tsukuba(筑波大学) The University of Tokyo(东京大学)

AI总结 针对潜在动作模型中固定容量瓶颈导致的权衡问题,提出FlexLAM,通过嵌套dropout实现变长潜在动作,在不增加架构或损失的情况下,在稀缺标签和低回报任务中优于固定容量模型,并支持推理时调整令牌预算。

详情
AI中文摘要

潜在动作为无动作视频与下游决策提供了紧凑接口,但现有潜在动作模型(LAM)强制每个转换通过固定容量瓶颈。我们识别出一个瓶颈权衡:过于紧凑的编码可能丢弃动作对齐所需的转换线索,而过于松散的编码则保留了额外的转换变化,当对齐标签稀缺或分布狭窄时必须解决这些变化。FlexLAM用通过嵌套dropout训练的变长潜在动作取代固定容量,产生前缀有效编码,首先捕获紧凑的转换结构,仅在需要时添加细节,无需新架构或损失。在标准稀缺标签监督下和低回报单任务对齐压力测试中,单个FlexLAM在每个评估的令牌预算下匹配或超越单独训练的固定容量LAM,表明FlexLAM不仅在推理时可调整,而且在相同令牌预算下学习了更好的潜在动作接口。同一模型支持推理时令牌预算调整而无需重新训练,并且FlexLAM改善了Ego4D转换重建。这些结果表明,变长潜在动作是对潜在动作模型、潜在动作世界模型和视频预训练动作接口中固定容量瓶颈的无架构、即插即用升级。

英文摘要

Latent actions provide a compact interface between action-free video and downstream decision-making, yet existing Latent Action Models (LAMs) force every transition through a fixed-capacity bottleneck. We identify a bottleneck trade-off: overly tight codes can discard transition cues needed for action alignment, while overly loose codes preserve additional transition variation that must be resolved when alignment labels are scarce or narrowly distributed. FlexLAM replaces this fixed capacity with variable-length latent actions trained by nested dropout, yielding prefix-valid codes that capture compact transition structure first and add detail only when needed, without new architectures or losses. A single FlexLAM matches or surpasses separately trained fixed-capacity LAMs at every evaluated token budget under standard scarce-label supervision and under a low-return single-task alignment stress test, indicating that FlexLAM is not merely adjustable at inference time but learns a better latent-action interface at the same token budgets. The same model supports inference-time token-budget adjustment without retraining, and FlexLAM improves Ego4D transition reconstruction. These results suggest that variable-length latent actions are an architecture-free, drop-in upgrade to the fixed-capacity bottleneck in latent action models, latent-action world models, and video-pretrained action interfaces.

2606.19404 2026-06-19 cs.LG cs.CL 新提交

Thermodynamic Signatures of Reasoning: Free-Energy and Spectral-Form-Factor Diagnostics for Hallucination Detection in Large Language Models

推理的热力学特征:用于大型语言模型幻觉检测的自由能和谱形因子诊断

Salim Khazem

发表机构 * Talan Research & Innovation Center(Talan研究与创新中心)

AI总结 提出自由能签名(Fes)作为谱描述符,将注意力拉普拉斯视为哈密顿量并提取热力学势和随机矩阵理论谱形因子,用于检测LLM幻觉,无需训练即可实现高AUROC。

详情
AI中文摘要

大型语言模型(LLM)中的幻觉检测对部署至关重要,近期研究表明注意力导出的图拉普拉斯谱携带关于推理质量的强信号。然而,先前的谱诊断仅通过少数特征值或手工选取的标量来总结拉普拉斯谱,忽略了其大部分结构。我们提出自由能签名(Fes),一种谱描述符,将每层的注意力拉普拉斯视为哈密顿量,并提取其热力学势(配分函数、自由能、谱熵、热容)以及随机矩阵理论(RMT)谱形因子。我们证明了三个结果:(i)Fes在注意力扰动下的Lipschitz稳定性;(ii)一个表达性结果,表明Fes丰富了有限谱摘要,并在明确的规则性和网格分辨率假设下逼近矩导出的谱泛函;(iii)基于Fes构建的无训练检测器AUROC的有限样本PAC界。实验上,在六个开源LLM和六个基准测试中,基于Fes描述符的轻量级探测在注意力谱基线中实现了最强的平均AUROC,相比LapEig平均提高+6.5 AUROC点,相比GoR-4平均提高+2.4点,且无需更新底层LLM。在完全无监督设置下,RMT偏差得分达到平均AUROC 0.71,提供了一个无标签但较弱的检测器。互补的RMT分析表明,正确生成表现出更接近Wigner-Dyson的谱统计,而幻觉表现出更接近Poisson的统计。匿名代码和配置在补充材料中提供。

英文摘要

Hallucination detection in large language models (LLMs) is deployment-critical, and recent work shows that the spectrum of attention-derived graph Laplacians carries strong signal about reasoning quality. Prior spectral diagnostics, however, summarize the Laplacian spectrum by a handful of eigenvalues or hand-picked scalars, leaving most of its structure unused. We propose Free-Energy Signatures (Fes), a spectral descriptor that treats each layer's attention Laplacian as a Hamiltonian and extracts its thermodynamic potentials partition function, free energy, spectral entropy, heat capacity together with the random-matrix-theory (RMT) spectral form factor. We prove three results: (i)~Lipschitz stability of Fes under attention perturbation; (ii)~an expressiveness result showing that Fes enriches finite spectral summaries and approximates moment-derived spectral functionals under explicit regularity and grid-resolution assumptions; and (iii)~a finite-sample PAC bound on the AUROC of a training-free detector built from Fes. Empirically, across six open-weight LLMs and six benchmarks, a lightweight probe on Fes descriptors achieves the strongest aggregate AUROC among attention-spectral baselines, improving over LapEig by $+6.5$ AUROC points and over GoR-4 by $+2.4$ points on average, while requiring no update to the underlying LLM. In the fully unsupervised setting, an RMT-deviation score achieves mean AUROC $0.71$, providing a label-free but weaker detector. A complementary RMT analysis shows that correct generations exhibit more Wigner-Dyson like spectral statistics, whereas hallucinations exhibit more Poisson-like statistics. The anonymized code and config are provided in the supplementary material.

2606.19399 2026-06-19 cs.LG cs.AI cs.LO cs.PL 新提交

VERITAS: Verifier-Guided Proof Search for Zero-Shot Formal Theorem Proving

VERITAS:验证器引导的零样本形式定理证明搜索

Manish Acharya, Zhenyu Liao, Yueke Zhang, Kevin Leach, Yu Huang, Yifan Zhang

发表机构 * Department of Computer Science, Vanderbilt University(范德堡大学计算机科学系) Amazon(亚马逊)

AI总结 提出VERITAS框架,通过两阶段协议(Best-of-N采样+批评引导MCTS)利用验证器反馈进行零样本定理证明,在miniF2F上达40.6%准确率,并发布组合学基准VERITAS-CombiBench。

详情
AI中文摘要

基于LLM的形式化证明器通常将丰富的验证器信号(语法错误、类型不匹配、部分目标进展)压缩为二进制的通过/失败位。我们提出VERITAS,一个零样本框架,通过两阶段协议将每个验证器信号路由回证明搜索:首先进行Best-of-N采样,然后进行批评引导的MCTS遍历,该遍历将第一阶段失败作为显式负例吸收。该协议保留其第一阶段扫描解决的每个定理,因此第二阶段额外的解决可归因于反馈驱动的探索。VERITAS在miniF2F上达到40.6%(相比之下,独立运行的Best-of-5为36.9%,Portfolio为26.2%),在VERITAS-CombiBench上达到7.3%,这是一个我们发布的55个定理的组合学基准,在该基准上Best-of-5(1.8%)低于Portfolio(3.6%),暴露了当必须从验证器反馈中迭代恢复正确的引理名称时,无指导的采样会带来损害。工件可在GitHub上获取。

英文摘要

LLM-based formal provers often collapse rich verifier signals (syntax errors, type mismatches, partial goal progress) into a binary pass/fail bit. We present VERITAS, a zero-shot framework that routes every verifier signal back into proof search through a two-phase protocol: Best-of-N sampling first, then a critic-guided MCTS pass that ingests Phase 1 failures as explicit negative examples. The protocol preserves every theorem solved by its own Phase 1 sweep, so Phase 2's additional solves are attributable to feedback-driven exploration. VERITAS reaches 40.6% on miniF2F (vs. an independently run Best-of-5 at 36.9%, Portfolio 26.2%) and 7.3% on VERITAS-CombiBench, a 55-theorem combinatorics benchmark we release on which Best-of-5 (1.8%) falls below Portfolio (3.6%), exposing that unguided sampling hurts when correct lemma names must be recovered iteratively from verifier feedback. Artifacts are available on GitHub.

2606.19398 2026-06-19 cs.SD eess.AS eess.SP 新提交

S-JEPA : Soft Clustering Anchors for Self-Supervised Speech Representation Learning

S-JEPA:用于自监督语音表示学习的软聚类锚点

Georgios Ioannides, Adrian Kieback, Judah Goldfeder, Linsey Pang, Aman Chadha, Aaron Elkins, Yann LeCun, Ravid Shwartz-Ziv

发表机构 * Carnegie Mellon University(卡内基梅隆大学) New York University(纽约大学) James Silberrad Brown Center for AI(詹姆斯·西尔伯拉德·布朗人工智能中心) Columbia University(哥伦比亚大学) Northeastern University(东北大学) Stanford University(斯坦福大学) Amazon GenAI(亚马逊生成式人工智能)

AI总结 提出S-JEPA,通过KL散度匹配高斯混合模型的软后验概率训练编码器-预测器对,无需离线重聚类或教师蒸馏,在SUPERB协议下以低于90M参数取得最低WER,并建立新的帕累托前沿。

详情
AI中文摘要

自监督语音编码器主要通过预测掩蔽位置处的离散硬聚类ID进行训练,这种方法会坍缩类别边界处的声学模糊性,并需要在迭代之间中断训练以对整个语料库进行重聚类。我们提出S-JEPA,一种JEPA风格的编码器-预测器对,通过KL散度训练以匹配掩蔽位置处高斯混合模型的软后验概率。训练作为连续优化轨迹分两个阶段进行:首先在MFCC特征上使用固定GMM,然后在编码器特征上使用在线GMM,输入层从无标签信号中自适应选择,从而消除了离线重聚类步骤以及手动选择聚类所在Transformer层的问题。在SUPERB协议下,S-JEPA在评估的低于90M参数的自监督方法中实现了最低的词错误率(WER),并在大约一半参数量的情况下在情感识别任务上与HuBERT-Base相当,无需离线重聚类或教师蒸馏即建立了新的帕累托前沿。对预测器在保留语音上的每帧熵的分析揭示了双峰分布,其中相当一部分帧的熵接近完美两聚类平局的熵,这直接经验性地证明了软目标目标保留了硬目标会坍缩的声学模糊性。代码可在以下网址获取:https://this https URL。

英文摘要

Self-supervised speech encoders are predominantly trained by predicting discrete hard cluster IDs at masked positions, a recipe that collapses acoustic ambiguity at category boundaries and requires interrupting training to re-cluster the entire corpus between iterations. We introduce S-JEPA, a JEPA-style encoder-predictor pair trained to match the soft posteriors of a Gaussian Mixture Model at masked positions via KL divergence. Training runs as one continuous optimization trajectory in two phases: a fixed GMM over MFCC features, then an online GMM over encoder features, with the input layer selected adaptively from a label-free signal, removing both the offline re-cluster step and the hand-tuned choice of which transformer layer to cluster on. Under the SUPERB protocol, S-JEPA achieves the lowest WER among evaluated SSL methods below 90M parameters and matches HuBERT-Base on emotion recognition at roughly half its parameter count, establishing a new Pareto frontier without offline re-clustering or teacher distillation. An analysis of the predictor's per-frame entropy on held-out speech reveals a bimodal distribution with a substantial minority of frames near the entropy of a perfect two-cluster tie, providing direct empirical evidence that the soft-target objective preserves the acoustic ambiguity that hard targets would collapse. Code is available at https://github.com/gioannides/s-jepa.

2606.19383 2026-06-19 cs.RO cs.CV 新提交

3D Scene Graphs: Open Challenges and Future Directions

3D场景图:开放挑战与未来方向

Dennis Rotondi, Francesco Argenziano, Sebastian Koch, Nathan Hughes, Martin Buechner, Johanna Wald, Lukas Rosenberger Schmid, Daniele Nardi, Abhinav Valada, Liam Paull, Federico Tombari, Luca Carlone, Kai O. Arras

发表机构 * University of Stuttgart(斯图加特大学) IMPRS-IS(马克斯·普朗克研究所-智能系统) Sapienza University of Rome(罗马萨皮恩扎大学) Google(谷歌) MIT(麻省理工学院) University of Freiburg(弗赖堡大学) UTN University of Montreal(蒙特利尔大学UTN分校) Mila TU Munich(慕尼黑技术大学Mila)

AI总结 本文统一综述3D场景图(3DSG)的构建、应用与评估,分析现有建模选择与开放挑战,旨在推动鲁棒部署。

Comments Invited article for the Annual Review of Control, Robotics, and Autonomous Systems Volume 10

详情
AI中文摘要

3D场景图(3DSG)通过将几何基础与环境的语义和关系抽象相结合,已成为空间AI的强大表示。其表现力使其与机器人和计算机视觉中的广泛问题相关,包括操作、导航、任务规划、场景理解等。然而,该领域仍然分散:不同的社区采用不同的公式、构建流程和评估协议,使得比较方法、识别共同假设以及评估鲁棒实际部署的剩余挑战变得困难。本综述提供了对3DSG的统一和批判性回顾,特别强调开放挑战和未来方向。我们首先在共同定义下形式化3DSG,并分析表征现有公式的主要建模选择,包括节点和边属性、层次结构、动态场景表示和可供性感知扩展。然后,我们回顾如何从原始感官观察构建3DSG,讨论最常见的术语、约定和技术。最后,我们检查下游应用和评估策略,从内在图质量到任务级性能。为支持社区,我们还提供了一个专用网站,组织和扩展所调查的内容,可访问此 https URL。

英文摘要

3D Scene Graphs (3DSGs) have emerged as a powerful representation for spatial AI by combining geometric grounding with semantic and relational abstractions of the environment. Their expressiveness has made them relevant to a broad range of problems in robotics and computer vision, including manipulation, navigation, task planning, scene understanding, and many others. However, the field remains fragmented: different communities adopt distinct formulations, construction pipelines, and evaluation protocols, making it difficult to compare methods, identify common assumptions, and assess remaining challenges for robust real-world deployment. This survey provides a unified and critical review of 3DSGs, with particular emphasis on open challenges and future directions. We first formalize 3DSGs under a common definition and analyze the principal modeling choices that characterize existing formulations, including node and edge attributes, hierarchical structure, dynamic scene representations, and affordance-aware extensions. We then review how 3DSGs are built from raw sensory observations, discussing the most common terminologies, conventions, and techniques. Finally, we examine downstream applications and evaluation strategies, from intrinsic graph quality to task-level performance. To support the community, we also provide a dedicated website that organizes and extends the surveyed content, accessible at https://3dscenegraphs.com/.

2606.19381 2026-06-19 cs.SD cs.AI 新提交

Improving Code-Switching ASR with Code-Mixing Guided Synthetic Speech

利用语码混合引导的合成语音改进语码转换语音识别

Yue Heng Yeo, Haoyang Li, Yizhou Peng, Shreyas Gopal, Hexin Liu, Leibny Paola Garcia-Perera, Hardik B. Sailor, Jeremy H. M. Wong, Eng Siong Chng

发表机构 * College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) Google DeepMind(谷歌深度思维)

AI总结 针对语码转换语音识别中高质量文本-语音对稀缺的问题,提出语码混合引导的偏好学习框架,通过语码混合指数优化合成语音的转换保真度,在SEAME语料库上微调Whisper Large,将混合错误率从12.1%/17.8%降至8.9%/14.2%。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

语码转换语音识别由于缺乏高质量的语码转换文本-语音对用于训练而仍然具有挑战性。尽管已经探索了通过文本到语音进行合成数据增强,但现有的语码转换文本到语音方法主要优化重建保真度,并未明确强制语言边界一致性,从而限制了它们在语码转换语音识别增强中的有效性。本文提出了一种语码混合引导的偏好学习框架,该框架利用语码混合指数引导合成语音生成,以提高语码转换保真度。在SEAME汉英口语语料库上的实验表明,所提方法增强了合成数据在语音识别微调中的效用。具体来说,当微调Whisper Large时,所提方法在DevMAN和DevSGE测试集上分别将混合错误率从12.1%/17.8%降低到8.9%/14.2%。

英文摘要

Code-switch (CS) Automatic Speech Recognition (ASR) remains challenging due to limited availability of high quality CS text-speech pairs for training. Although synthetic data augmentation via Text-to-speech (TTS) has been explored, existing CS TTS approaches primarily optimise reconstruction fidelity and do not explicitly enforce language-boundary consistency, thereby limiting their effectiveness for CS ASR augmentation. This paper proposes a code-mixing guided preference-learning framework that steers synthetic speech generation toward improved code-switching fidelity using the Code Mixing Index (CMI). Experiments on the SEAME Mandarin-English conversational corpus demonstrate that the proposed method enhances the utility of synthetic data for ASR fine-tuning. Specifically, when fine-tuning Whisper Large, the proposed approach reduces Mixed Error Rate (MER) from 12.1%/17.8% to 8.9%/14.2% on the DevMAN and DevSGE sets, respectively.