arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1727
2606.06722 2026-06-08 cs.LG 新提交

Flatland: The Adventures of Gradient Descent with Large Step Sizes

平面国:大步长梯度下降的冒险

Leonardo Galli, Curtis Fox, Wiebke Bartolomaeus, Mark Schmidt, Holger Rauhut

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Munich Center for Machine Learning(慕尼黑机器学习中心) Canada CIFAR AI Chair (Amii)(加拿大CIFAR人工智能主席(Amii))

AI总结 针对非全局L-光滑的神经网络目标函数,提出一种只需梯度局部Lipschitz连续的定义,设计自适应一阶方法实现大步长并始终处于稳定性边缘,发现过早进入全局平坦区域会降低收敛速度和泛化能力。

Comments Accepted for the International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

神经网络的训练通常涉及非全局$L$-光滑的目标函数。对于这些函数,从理论和实践上都很难回答这样一个问题:确保梯度下降(GD)收敛的最大可能步长是多少?我们通过提供“大”步长的统一定义来解决深度学习中长期存在的这个开放问题,该定义仅要求梯度的局部Lipschitz(甚至Hölder)连续性。我们设计了一阶自适应方法,这些方法可证明产生大步长,并表明它们从训练开始就处于稳定性边缘(EoS)。特别地,损失非单调下降,并且步长与锐度(即Hessian矩阵的最大特征值)的乘积在整个训练过程中保持在EoS阈值2以上。使用我们的方法,我们还能将锐度最小化到其全局最小值。与预期相反,我们发现训练过早遇到全局平坦区域可能会减慢收敛速度并损害网络的泛化能力。利用自稳定化论证,我们允许GD进入稍尖锐的谷底,并将不成功的训练运行转变为非常成功的运行。

英文摘要

The training of neural networks often entails objective functions that are not globally $L$-smooth. For these functions, it is both theoretically and practically difficult to reply to the question: what is the largest possible step size that ensures the convergence of gradient descent (GD)? We address this longstanding open question in deep learning by providing a unifying definition of "large" step sizes that requires only local Lipschitz (or even Hölder) continuity of the gradient. We design first-order adaptive methods that provably yield large step sizes and show that they operate at the edge of stability (EoS) right from the start of the training. In particular, the loss decreases nonmonotonically and the product between the step size and sharpness, i.e., the largest eigenvalue of the Hessian, stays above the EoS threshold of 2 throughout training. Using our method, we are also able to minimize the sharpness all the way down to its global minimum. Contrary to expectation, we find that encountering globally-flat regions too early in the training may both slow down convergence and jeopardize the generalization ability of the network. Exploiting a self-stabilization argument, we allow GD to enter slightly sharper valleys and turn unsuccessful training runs into very successful ones.

2606.06721 2026-06-08 cs.RO cs.AI 新提交

SCOUT: Semantic scene COverage via Uncertainty-guided Traversal

SCOUT: 基于不确定性引导遍历的语义场景覆盖

Junyu Mao, Sara Ayoubi, Vishnu D. Sharma, Ilija Hadžić, Matthew Andrews

发表机构 * Nokia Bell Labs, France(诺基亚贝尔实验室,法国) Nokia Bell Labs, Murray Hill, NJ, USA(诺基亚贝尔实验室,美国,新泽西州 Murray Hill) Imperial College London(帝国理工学院伦敦分校) Locus Robotics(Locus机器人技术公司)

AI总结 提出SCOUT框架,通过不确定性引导的遍历规划与概率场景图构建的闭环,使机器人主动探索并逐步理解环境,实现语义场景完整性作为操作目标。

Comments 2026 ICRA Workshop on Uncertainty in Open World Robotics

详情
AI中文摘要

长时间运行的机器人不应仅仅访问空间,而应逐步理解空间。然而,大多数3D场景图管线将感知视为固定数据集上的后处理阶段,将场景表示与决定首先观察什么的决策解耦。我们提出SCOUT,一种在线语义探索框架,通过将主动遍历与概率场景图构建耦合来闭合这一循环。给定先验2D占用地图和带姿态的RGB-D观测,SCOUT增量构建一个不确定性感知的3D场景图,其节点维护融合的几何和开放词汇对象标签的后验信念,而边编码结构关系,如在上、内部、属于和旁边。这些信念被反馈给不确定性引导的遍历规划器,该规划器通过平衡期望语义确定性增益、几何覆盖增益和旅行成本来选择视点。这样,当额外证据重要时,机器人重新访问模糊对象,当场景不完整时,扩展到未见的自由空间。由此产生的系统将语义场景完整性视为操作目标,而非语义映射的被动副产品,朝着能够在最少人工干预下巡逻、更新和推理不断变化的室内环境的自主智能体迈进。

英文摘要

Robots that operate over extended periods should not merely visit space; they should progressively understand it. Yet most 3D scene graph pipelines treat perception as a post-processing stage over a fixed dataset, decoupling scene representation from the decisions that determine what is observed in the first place. We present SCOUT, an online semantic exploration framework that closes this loop by coupling active traversal with probabilistic scene graph construction. Given a prior 2D occupancy map and posed RGB-D observations, SCOUT incrementally builds an uncertainty-aware 3D scene graph whose nodes maintain fused geometry and posterior beliefs over open-vocabulary object labels, while edges encode structural relations such as on, inside, belong, and next to. These beliefs are fed back to an uncertainty-guided traversal planner, which selects viewpoints by balancing expected semantic certainty gain, geometric coverage gain, and travel cost. In this way, the robot revisits ambiguous objects when additional evidence matters and expands into unseen free space when the scene remains incomplete. The resulting system treats semantic scene completeness as an operational objective rather than a passive by-product of semantic mapping, moving toward autonomous agents that can patrol, update, and reason about evolving indoor environments with minimal human intervention.

2606.06718 2026-06-08 cs.LG cs.AI cs.SY eess.SY 新提交

MSAIC-Net: A Multi-Scale Attention and Imbalance-Aware Contrastive Network for ECG-Based Myocardial Substrate Abnormality Detection

MSAIC-Net:用于基于心电图的心肌基质异常检测的多尺度注意力和不平衡感知对比网络

Canyu Lei, Fenglin Zhang, Derek Bivona, Cristiane Singulane, Jonathan Pan, Kenneth Bilchick, Amit R. Patel, Jianxin Xie

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 提出多尺度注意力增强卷积网络MSAIC-Net,通过并行空洞卷积提取多尺度特征、通道注意力重加权、不平衡感知对比学习及导联置换重要性分析,在低数据量UVA队列和大规模PTB-XL数据集上实现心肌瘢痕和心肌梗死检测的准确性和可解释性提升。

详情
AI中文摘要

心肌基质异常,如心肌瘢痕和心肌梗死(MI),与不良心血管结局相关。心电图(ECG)为检测这些异常提供了一种低成本且广泛可用的工具,但由于异质性导联依赖性表现、高维多导联信号、类别不平衡以及深度学习模型的可解释性有限,基于ECG的检测仍然具有挑战性。我们提出了一种多尺度注意力增强卷积网络(MSAIC-Net)用于基于ECG的心肌基质异常检测。MSAIC-Net采用并行空洞卷积分支,在多个时间感受野上提取ECG特征,使模型能够捕捉局部和更长时间范围的时间模式。然后使用通道注意力自适应地重新加权信息性导联和特征通道表示。为了解决类别不平衡并提高特征可分性,我们引入了一种新颖的不平衡感知监督对比学习策略,鼓励同一类别的样本形成紧凑表示,同时增加异常和正常样本之间的分离。进一步引入导联置换重要性来量化每个ECG导联的贡献并提高模型可解释性。该方法在两个互补数据集上进行了评估:来自弗吉尼亚大学(UVA)健康系统的低数据量机构队列用于心肌瘢痕分类,以及来自PhysioNet的大规模公共PTB-XL数据集用于MI识别。实验结果表明,MSAIC-Net优于基线模型,在低数据量的UVA队列中改进尤为显著。总体而言,所提出的框架为基于ECG的心肌基质异常检测提供了一种有效且可解释的方法。

英文摘要

Myocardial substrate abnormalities, such as myocardial scar and myocardial infarction (MI), are associated with adverse cardiovascular outcomes. Electrocardiography (ECG) provides a low-cost and widely available tool for detecting these abnormalities, but ECG-based detection remains challenging due to heterogeneous lead-dependent manifestations, high-dimensional multi-lead signals, class imbalance, and the limited interpretability of deep learning models. We propose a multi-scale attention-enhanced convolutional network (MSAIC-Net) for ECG-based myocardial substrate abnormality detection. MSAIC-Net employs parallel atrous convolutional branches to extract ECG features across multiple temporal receptive fields. %, enabling the model to capture both local and longer-range temporal patterns. Channel attention is then used to adaptively reweight informative lead-wise and feature-channel representations. To address class imbalance and improve feature separability, we introduce a novel imbalance-aware supervised contrastive learning strategy that encourages samples from the same class to form compact representations while increasing separation between abnormal and normal samples. Lead-wise permutation importance is further incorporated to quantify the contribution of each ECG lead and improve model interpretability. The proposed method was evaluated on two complementary datasets: a low-data institutional cohort from the University of Virginia (UVA) Health System for myocardial scar classification and the large-scale public PTB-XL dataset from PhysioNet for MI identification. Experimental results show that MSAIC-Net outperforms baseline models, with particularly pronounced improvements in the low-data UVA cohort. Overall, the proposed framework provides an effective and interpretable approach for ECG-based detection of myocardial substrate abnormalities.

2606.06717 2026-06-08 cs.LG cs.AI q-bio.BM q-bio.QM 新提交

ShallowBench: Benchmarking Generative Drug Design Models on Shallow-Pocket Targets

ShallowBench: 浅口袋靶标上的生成式药物设计模型基准测试

Saket Reddy, Shiwei Liu

发表机构 * University of Illinois - Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出ShallowBench基准,包含5780个浅口袋靶标,用于评估生成式药物设计模型在低凹度界面上的性能,揭示现有模型预测结合亲和力较弱的问题。

详情
AI中文摘要

虽然生成式AI模型在基于结构的药物设计中已展现出显著成功,但它们主要依赖深结合口袋,难以对具有挑战性的低口袋性靶标(如历史上“不可成药”的肿瘤靶标KRAS和MYC)采样有效配体。为弥补这一空白,我们引入了ShallowBench,这是一个从CrossDocked2020中提取的包含5780个浅口袋靶标的严格精选基准。通过计算Alpha Shape“盖子”体积与底层蛋白质原子体素体积之间的差异,我们成功分离出低凹度靶标,同时确保足够的结合表面积。评估多种最先进的生成模型显示,在这些低凹度界面上预测的结合亲和力较弱。因此,ShallowBench为生成生物学模型提供了一个严格的基准,并强调了需要能够应对这些具有挑战性靶标的新型架构创新或损失函数。

英文摘要

While generative AI models have demonstrated remarkable success in structure-based drug design, they predominantly rely on deep binding pockets and struggle to sample effective ligands for challenging low-pocketability targets, such as the historically "undruggable" oncology targets KRAS and MYC. To address this gap, we introduce ShallowBench, a strictly curated benchmark of 5,780 shallow-pocket targets extracted from CrossDocked2020. By computing the difference between an Alpha Shape "lid" volume and the underlying protein atom voxel volume, we successfully isolated targets with low concavity while ensuring sufficient surface area for binding. Evaluating various state-of-the-art generative models reveals weaker predicted binding affinity on these low-concavity interfaces. ShallowBench therefore provides a rigorous benchmark for generative biology models and highlights the necessity of new architectural innovations or loss functions capable of navigating these challenging targets.

2606.06715 2026-06-08 cs.CL cs.AI cs.LG 新提交

Does Topic Sentiment Cause Perceived Ideology? Comparing Human and LLM Annotations in Political News Articles

主题情感是否导致感知意识形态?比较政治新闻文章中人类与LLM的标注

Upasana Chatterjee

发表机构 * Columbia University(哥伦比亚大学)

AI总结 研究主题情感对感知政治意识形态的因果效应,通过比较人类与LLM标注,发现微调GPT-4o-mini产生显著因果效应,归因于捷径学习。

Comments Accepted to ACL SRW 2026

详情
AI中文摘要

我们探究主题情感是否对感知政治意识形态具有因果效应,以及答案是否取决于意识形态标签的分配者。使用来自AllSides的文章,结合Llama-3.3-70b-versatile的共享情感标注,我们比较了来自专家人类标注者、GPT-4o-mini(基线和微调)以及Llama-3.3-70B的意识形态标签。我们应用双重机器学习(DML)和社区级中介分析于所有四种标注范式。人类标注在社区水平未产生显著因果效应。微调后的GPT-4o-mini达到了最高的分类准确率(F1=72.48),并且是唯一在社区水平产生显著处理效应和中介中显著自然直接效应(NDE)的标注范式。我们将此解释为捷径学习的证据:对意识形态标签数据进行微调导致模型内化了一种虚假的情感-意识形态耦合,而这种耦合在人类判断中对此任务并不起作用。这种耦合在基于F1的评估中结构上不可见,对LLM标注作为银标签以及在下游因果分析中作为人类判断的代理的使用具有影响。

英文摘要

We ask whether topic sentiment has a causal effect on perceived political ideology, and whether the answer depends on who assigns the ideology label. Using articles from AllSides, paired with shared sentiment annotations from Llama-3.3-70b-versatile, we compare ideology labels from expert human annotators, GPT-4o-mini (baseline and finetuned), and Llama-3.3-70B. We apply Double Machine Learning (DML) and community-level mediation analysis across all four annotation paradigms. Human annotations yield no significant causal effects at the community level. Fine-tuned GPT-4o-mini achieves the highest classification accuracy (F1=72.48) and is the only annotator paradigm that produces significant community-level treatment effects and significant natural direct effects (NDEs) in mediation. We interpret this as evidence of shortcut learning: fine-tuning on ideology-labeled data causes the model to internalise a spurious sentiment--ideology coupling not operative in human judgment for this task. This coupling is structurally invisible to F1-based evaluation, with implications for the use of LLM annotations as silver labels and as proxies for human judgment in downstream causal analyses.

2606.06714 2026-06-08 cs.CV 新提交

Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception

锚定而非分级:视觉-语言模型在纹理倾斜感知中失败

Qian Zhang, Michal Golovanevsky, Fulvio Domini, James Tompkin

发表机构 * Brown University(布朗大学) Harvard University(哈佛大学)

AI总结 研究视觉-语言模型(VLM)在纹理倾斜感知任务中的表现,发现零样本和上下文提示均产生锚定失败,仅预测少数离散角度,监督微调部分缓解但残留锚定,表明问题在于表示到输出的语言接口无法分级表达。

详情
AI中文摘要

人类从纹理感知表面倾斜时,会表现出系统性的、分级的偏差,这些偏差在心理物理实验中可靠地出现。先前的研究表明,无监督CNN再现了几种类人偏差,而有监督CNN则没有。视觉-语言模型(VLM)是否表现出类似的能力?在多个VLM家族和模型规模中,零样本和上下文提示都产生了独特的失败:倾斜仅在少量锚点(例如0°、±25°、±45°)处被预测,且几乎不依赖于刺激视场、光学倾斜或表面曲率。监督微调部分弥补了这种失败,但残留的锚定仍然存在。虽然高级视觉-语言基准测试的成功可能不需要对低级几何线索的敏感性,但我们将锚定解释为表示到输出语言接口的失败:不一定缺乏几何编码,而是无法以分级形式表达它。

英文摘要

Human perception of surface slant from texture exhibits systematic, graded biases that emerge reliably in psychophysical experiments. Prior work showed that unsupervised CNNs reproduce several human-like biases, while supervised CNNs do not. Do Vision-Language Models (VLMs) exhibit similar competences? Across multiple VLM families and model scales, zero-shot and in-context prompting both produce distinctive failures: slant is predicted at only a small set of anchors (e.g., 0\degree, $\pm$25\degree, $\pm$45\degree) with little dependence on stimulus field of view, optical slant, or surface curvature. Supervised fine-tuning partially remediates the failure, but residual anchoring persists. While success in high-level vision-language benchmarks might not require sensitivity to low-level geometric cues, we interpret anchoring as a failure at the representation-to-output language interface: Not necessarily an absence of geometric encoding, but a failure to express it in a graded form.

2606.06712 2026-06-08 cs.CL cs.AI 新提交

Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

数据高效的自回归到扩散语言模型通过策略内蒸馏

Xingyu Su, Jacob Helwig, Shubham Parashar, Atharv Chagi, Lakshmi Jotsna, Degui Zhi, James Caverlee, Dileep Kalathil, Shuiwang Ji

发表机构 * Department of Computer Science and Engineering, Texas A&M University(德克萨斯大学阿马尔科分校计算机科学与工程系) Department of Bioinformatics and Systems Medicine, University of Texas Health Science Center at Houston(德克萨斯大学健康科学中心休斯顿分校生物信息学与系统医学系) Department of Electrical and Computer Engineering, Texas A&M University(德克萨斯大学阿马尔科分校电气与计算机工程系)

AI总结 提出策略内扩散语言模型(OPDLM),通过策略内蒸馏将自回归模型转换为扩散语言模型,解决分布偏移和训练-推理不匹配问题,实现15倍至7000倍更少训练数据下的强性能。

详情
AI中文摘要

我们研究将自回归模型(ARLM)转换为扩散语言模型(DLM)。与从头预训练不同,先前的工作将ARLM中的因果注意力替换为双向注意力,然后使用DLM目标训练得到的模型。然而,这些方法会导致两种分布偏移。首先,从下一个词预测目标过渡到DLM目标可能会丢弃ARLM在训练期间获得的知识。其次,标准DLM存在训练-推理不匹配,因为训练损失定义在随机掩码序列上,而不是推理时基于置信度解码产生的轨迹。为了解决这两个挑战,我们引入了策略内扩散语言模型(OPDLM),其中采用策略内蒸馏(OPD)进行ARLM到DLM的转换。具体来说,OPDLM通过自OPD训练,其中学生(具有双向注意力的ARLM)生成自己的轨迹,而教师(原始冻结的ARLM)通过在这些轨迹上提供目标logits来蒸馏其知识。通过直接以策略内方式训练,OPDLM消除了DLM中的训练-推理不匹配,同时从原始模型蒸馏增强了ARLM的知识保留。实验结果表明,OPDLM需要15倍到7000倍更少的训练token,并在各种任务上表现出强大的性能。OPDLM避免了DLM预训练的高昂成本,并将DLM转换定位为ARLM后训练的一种形式。

英文摘要

We study the transformation of autoregressive models (ARLMs) into diffusion language models (DLMs). Rather than pretraining from scratch, prior work replaces the causal attention in ARLMs with bidirectional attention and then trains the resulting model using a DLM objective. However, these approaches incur two distribution shifts. First, transitioning from a next-token prediction objective to a DLM objective can discard knowledge acquired by the ARLM during training. Second, standard DLMs suffer from a train-inference mismatch, as the training loss is defined on randomly masked sequences rather than the trajectories encountered at inference produced by confidence-based decoding. To address both challenges, we introduce an On-Policy Diffusion Language Model (OPDLM) in which On-Policy Distillation (OPD) is employed for ARLM-to-DLM transformation. Specifically, OPDLM is trained via self-OPD, where the student, an ARLM with bidirectional attention, generates its own trajectories, and the teacher, the original frozen ARLM, distills its knowledge by providing target logits on these trajectories. By training directly in an on-policy manner, OPDLM eliminates the train-inference mismatch in DLMs, while distillation from the original model enhances knowledge retention from the ARLM. Empirical results demonstrate that OPDLM requires 15x to 7,000x fewer training tokens with strong performance across a wide variety of tasks. OPDLM avoids the prohibitive cost of DLM pretraining and positions DLM transformation as a form of ARLM post-training.

2606.06708 2026-06-08 cs.CL 新提交

Signal-Driven Observation for Long-Horizon Web Agents

信号驱动观测:面向长程任务的Web智能体

Shubham Gaur, Ian Lane

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出信号驱动观测(SDO)方法,通过专用子调用读取完整DOM但仅返回任务相关元素,并由轻量信号检测器触发重新调用,解决长程Web智能体中上下文退化问题。

Comments 10 pages, 1 figure

详情
AI中文摘要

在长程任务中运行的Web智能体在每个动作步骤中都会处理原始DOM和可访问性树——通常包含数万个token——导致上下文逐渐退化,在任务完成前推理能力就已受损。我们认为,将观测频率与动作频率耦合是一种架构性错误。借鉴递归语言模型中查询文档优于整体阅读的见解,我们提出信号驱动观测(SDO):一个专用子调用读取完整DOM但仅返回任务相关元素及其选择器,并且仅在轻量信号检测器触发时重新调用——触发条件包括URL变化、新出现的可交互元素、动作失败或外部浏览器事件。我们概述了SDO引入的开放问题,并呼吁社区将观测压缩视为Web智能体设计中的核心架构决策。

英文摘要

Web agents operating over long horizons ingest raw DOM and accessibility trees -- routinely tens of thousands of tokens -- at every action step, causing progressive context degradation that erodes reasoning well before tasks complete. We argue that this coupling of observation frequency to action frequency is an architectural mistake. Drawing on the insight from Recursive Language Models that querying a document outperforms reading it wholesale, we propose Signal-Driven Observation (SDO): a dedicated sub-call reads the full DOM but returns only task-relevant elements and their selectors, and is re-invoked only when a lightweight signal detector fires -- triggered by URL transitions, newly visible interactive elements, action failures, or exogenous browser events. We outline the open problems SDO introduces and call on the community to treat observation compression as a core architectural decision in web agent design.

2606.06704 2026-06-08 cs.RO 新提交

Optimal Control Approach for Non-prehensile Ball Juggling Using a 7-DoF Manipulator

使用7自由度机械臂进行非抓取式抛球的最优控制方法

Joel Ramadani, Vasilije Rakčević, Riddhiman Laha, Arne Sachtler, Valentin Le Mesle, Achim J. Lilienthal, Sami Haddadin

发表机构 * Technical University of Munich(慕尼黑技术大学) Mohamed Bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) German Aerospace Center (DLR), Institute of Robotics and Mechatronics(德国航空航天中心(DLR)机器人与机电机构研究所)

AI总结 提出一种基于模型的两阶段最优控制框架,用于7自由度机械臂使用工具进行非抓取式抛球,生成周期性抛球轨迹并通过离线计算实现实时误差校正。

Comments 8 pages, accepted at ICRA 2026

详情
AI中文摘要

非抓取式物体操作技能对于现实世界的机器人交互至关重要,能够实现高度动态的任务,例如在托盘上平衡玻璃杯或控制物体在桌子上滑动。其中,以高速操作要求和由此产生的混合动力学的普遍敏感性为特征的任务尤其难以完成。在这些任务中,抛球可以被视为一个极具挑战性的动作。机器人抛球的关键在于实现欠驱动物体的动态稳定。由于物体不具备自我校正能力,其稳定性完全依赖于施加在其上的力。这创建了一个对控制输入敏感的系统,其中时机对于持续抵消偏差并维持期望行为至关重要。我们开发了一种系统方法,用于控制一个7自由度机械臂使用工具进行非抓取式抛球。我们的主要贡献是一个基于模型的框架,用于生成抛球轨迹并稳定该混合系统的周期性抛球运动。该框架包含一个两阶段最优控制方法,用于计算稳定抛球所需的底层可行运动模式。然后,离线计算的轨迹被组织起来,以便在不在线求解最优控制问题的情况下实现实时误差校正。我们首先在仿真环境中评估所提出控制器的性能,然后使用Franka Emika Panda机器人进行实验,以证明其有效性。

英文摘要

Non-prehensile object manipulation skills are important for real-world robot interactions, enabling highly dynamic tasks such as balancing a glass on a tray or the controlled sliding of items on a table. Among such tasks, those characterised by high-speed manipulation requirements and general sensitivity of the resulting hybrid dynamics are particularly hard to accomplish. Within these, juggling can be seen as a highly challenging maneuver to be solved. The key to robotic juggling is achieving dynamic stabilisation of an underactuated object. Since the object does not possess the ability of self-correction, its stability is entirely dependent on the forces applied to it. This creates a system that is sensitive to control inputs, where timing is critical to continuously counteract deviations and maintain the desired behavior. We develop a systematic method to control a 7-degree-of-freedom manipulator performing non-prehensile ball juggling with a tool. Our primary contribution is a model-based framework for generating juggling trajectories and stabilizing a periodic juggling motion for this hybrid system. The framework incorporates a two-stage optimal control approach to compute the underlying feasible motion patterns required for stable juggling. Offline-computed trajectories are then organised to enable real-time error correction without solving optimal control problems online. We demonstrate the effectiveness of the resulting controller by first evaluating its performance in a simulation environment and performing an experiment using a Franka Emika Panda robot.

2606.06696 2026-06-08 cs.CV cs.AI 新提交

MMBU: A Massive Multi-modal Biomedical Understanding Benchmark to Probe the Perception Capabilities of Vision-Language Models

MMBU: 大规模多模态生物医学理解基准,用于探测视觉语言模型的感知能力

Ryan D'Cunha, Alejandro Lozano, Xiaoxiao Sun, Daniel Vela Jarquin, Min Woo Sun, Josiah Aklilu, James Burgess, Yuhui Zhang, Ryan Nayebi, Paola Avila, Robayo, Jin Ye, Ming Hu, Zhongying Deng, Junjun He, Xin Chen, Yue Yao, Robert Tibshirani, Jeffrey J. Nirschl, Serena Yeung-Levy

发表机构 * Stanford University(斯坦福大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Instituto Tecnológico de Monterrey(蒙特雷技术学院) Monash University(墨尔本大学) University of Cambridge(剑桥大学) Shanghai Jiao Tong University(上海交通大学) Shandong University(山东大学)

AI总结 提出MMBU基准,涵盖35个子模态,通过分类、定位和检测任务系统评估VLM在生物医学领域的视觉感知和泛化能力,发现高准确率可能掩盖感知缺陷。

详情
AI中文摘要

视觉和语言模型(VLM)在转变生物医学成像工作流程方面具有巨大潜力,从检测胸部X光片中的病变到显微镜下的细胞特征分析。然而,实现这一潜力需要稳健且细粒度的视觉感知。模型需要正确解释图像中的细微特征,并且必须在不同的生物医学模态、尺度和上下文中做到这一点。尽管如此,当前的基准仍然有限。为了解决这些差距,我们引入了大规模多模态生物医学理解(MMBU)基准。它是迄今为止最大的生物医学视觉和语言基准,涵盖35个子模态,具有丰富的结构化元数据。它包括开放和封闭版本的非接地分类、接地分类和物体检测,从而能够系统地评估模型在生物尺度、临床环境和成像模态上的性能。通过评估15个开源权重和2个前沿VLM,我们发现虽然医学适应为某些模型带来了可衡量的提升,但通常在高准确率报告中的表现可能掩盖了视觉感知和领域泛化方面的缺陷。

英文摘要

Vision and language models (VLMs) hold immense promise to transform biomedical imaging workflows, from detecting lesions in chest X-rays to profiling cellular features in microscopy. Realizing this potential, however, requires robust and fine-grained visual perception. Models need to correctly interpret subtle features in images, and they must do so across diverse biomedical modalities, scales, and contexts. Nevertheless, current benchmarks remain limited. To address these gaps, we introduce the Massive Multimodal Biomedical Understanding (MMBU) benchmark. It is the largest biomedical vision and language benchmark to date, covering 35 submodalities with rich structured metadata. It includes both open and closed versions of ungrounded classification, grounded classification, and object detection, enabling systematic evaluation of model performance across biological scales, clinical settings, and imaging modalities. Evaluating 15 open-weight and 2 frontier VLMs, we find that while medical adaptation provides measurable gains for some models, the high accuracy often reported on established benchmarks can mask deficiencies in visual perception and domain generalization.

2606.06695 2026-06-08 cs.CV 新提交

S23DR 2026 Winning Solution

S23DR 2026 获胜方案

Jan Skvrna, Miroslav Purkrabek, Lukas Neumann

发表机构 * Visual Recognition Group(视觉识别组) Czech Technical University in Prague(布拉格捷克技术大学)

AI总结 提出一种基于条件集和流匹配DiT的3D线框重建方法,通过全局粗预测、局部细化及多采样一致性步骤,在S23DR 2026挑战中取得HSS=0.654的领先成绩。

详情
AI中文摘要

本文介绍了在S23DR 2026挑战中针对从稀疏SfM、拟合深度和语义分割进行结构化3D线框重建的获胜方案。该方法将顶点视为条件集,并使用以Perceiver风格场景令牌为条件的流匹配DiT对64个顶点令牌进行去噪。全局通道预测粗略结构,船体裁剪的第二通道对其进行细化,小规模的多采样一致性步骤确保随机采样器行为良好。最终系统在私有排行榜上排名第一,达到HSS = 0.654。

英文摘要

This text presents the winning solution to the S23DR 2026 challenge for structured 3D wireframe reconstruction from sparse SfM, fitted depth, and semantic segmentations. The method treats vertices as a conditional set and denoises 64 vertex tokens with a flow-matching DiT conditioned on Perceiver-style scene tokens. A global pass predicts the coarse structure, a hull-cropped second pass refines it, and a small multi-sample consensus step keeps the stochastic sampler well behaved. The final system ranked first on the private leaderboard, achievingHSS = 0.654.

2606.06694 2026-06-08 cs.LG cs.AI cs.CY 新提交

The Geography of Algorithmic Judgment: LLM Intermediaries, Place Identity, and Racial Steering in Housing Search

算法判断的地理:LLM中介、地方身份与住房搜索中的种族引导

Hana Samad, Trung Lam, Christoph Mügge-Durum, Michael Akinwumi

发表机构 * National Fair Housing Institute(国家公平住房研究所)

AI总结 通过行为审计七种LLM在四个美国城市的住房推荐,发现种族引导是模型解释性许可的涌现行为,而非静态属性,且城市并非中性测试单元。

Comments 13 pages with supplemental tables and figures, AIES '26 Submission

详情
AI中文摘要

大型语言模型(LLM)正迅速在住房搜索中扮演中介角色,通过将列表平台集成到对话界面中,调解城市环境中的信息访问、搜索和推荐。我们扩展了先前关于LLM中种族引导的研究,对七个开放权重和闭源LLM在四个美国城市进行了行为审计,测试了三种迭代提示条件下的位置推荐,这些条件逐步添加生活方式偏好背景并反映公平住房配对测试方法。我们发现,引导是模型解释性许可的涌现行为,而非主要静态属性。引导源于用户身份、偏好表达以及模型内化的关于特定城市中地点、偏好和机会的学习表征的空间逻辑之间的相互作用,以及不同类型用户如何与之关联。虽然存在引导,但在评估条件下其方向和幅度并不一致。偏好条件测试通常增加或重新配置了相对于基线条件表现出引导行为的模型数量,表明LLM可能根据用户的种族身份对相同的住房偏好做出不同解释。我们的发现还表明,在基于地点的领域进行LLM评估时,城市并非中性测试单元,一个本地市场的结果不能假定推广到另一个市场。在住房领域,需要本地和领域专业知识,以确保法律和制度对公平住房的承诺不会因采用调解空间访问的AI工具而受到损害。

英文摘要

Large language models (LLMs) are rapidly assuming an intermediary role in housing search through the integration of listing platforms within conversational interfaces, mediating access to information, search, and recommendations within urban settings. We expand on prior work on racial steering in LLMs by conducting a behavioral audit of seven open-weight and closed-source LLMs across four U.S. cities, testing location recommendations across three iterative prompting conditions that progressively add lifestyle preference context and reflect fair housing paired-testing methodologies. We find that steering is an emergent behavior of the model's interpretive license rather than primarily a static property. Steering results from the interaction of a user's identity, preference articulation, and the spatial logic that a model has internalized about learned representations of place, preference, and opportunity in a given city, and how different types of users relate to it. While steering was present, it was not uniform in direction or magnitude across evaluated conditions. Preference-conditioned testing often increased or reconfigured the number of models that exhibited steering behaviors relative to baseline conditions, suggesting that LLMs may interpret what the same housing preference means differently depending on the racial identity of the user. Our findings also demonstrate that the city is not a neutral testing unit for LLM evaluation in place-based sectors, and results from one local market cannot be assumed to generalize to another. Local and domain expertise will be required in the housing sector to ensure that legal and institutional commitments to fair housing are not undermined while adopting AI tools that mediate spatial access.

2606.06690 2026-06-08 cs.CV 新提交

RPC-GS: Gaussian Splatting with native RPC Rendering for Satellite Imagery

RPC-GS:基于原生RPC渲染的卫星图像高斯泼溅

Valentin Wagner, Sebastian Bullinger, Christoph Bodensteiner, Michael Arens

发表机构 * Fraunhofer Institute of Optronics, System Technologies and Image Exploitation(弗劳恩霍夫光学研究所、系统技术与图像 exploitation 研究所)

AI总结 提出首个原生使用RPC模型的高斯泼溅框架RPC-GS,通过直接投影高斯均值和协方差避免近似误差,在卫星基准数据集上重建误差最低。

详情
AI中文摘要

我们提出了RPC-GS,这是首个原生使用有理多项式相机(RPC)模型的卫星图像高斯泼溅框架。RPC模型是表示现代推扫式卫星传感器复杂成像几何的事实标准。为了简化渲染,先前的卫星高斯泼溅方法用透视或仿射相机近似替代RPC模型,导致重建过程中的几何误差。RPC-GS通过在泼溅过程中直接通过RPC模型投影高斯均值和协方差,避免了这些近似。我们将RPC模型嵌入一系列精心选择的地理坐标变换链中,该变换表示从适合泼溅的场景坐标到图像坐标的映射。为了映射高斯协方差矩阵,我们推导了基于数值稳健的雅可比协方差投影,用于(部分非线性的)坐标变换。由于RPC缺乏明确的相机深度概念,我们集成了基于度量射线的深度公式。我们在统一框架中对RPC、透视和仿射相机模型进行了基准测试,我们的原生RPC渲染器在领先的卫星基准数据集上始终实现最低的重建误差,在DFC2019上,平均高程误差比透视和仿射近似分别提高了29.6%和63.8%,在IARPA2016上分别提高了9.9%和37.9%。我们公开代码以支持卫星成像领域高斯泼溅的未来研究。

英文摘要

We present RPC-GS, the first Gaussian Splatting framework for satellite imagery that operates natively with Rational Polynomial Camera (RPC) models. The RPC model is the de facto standard for representing the complex imaging geometry of modern pushbroom satellite sensors. To simplify rendering, prior satellite Gaussian Splatting methods replace the RPC model with perspective or affine camera approximations, leading to geometric errors during reconstruction. RPC-GS avoids these approximations by projecting Gaussian means and covariances directly through the RPC model during the splatting process. We embed the RPC model in a chain of carefully selected geo-coordinate transformations representing a mapping from splatting-suitable scene coordinates to image coordinates. To map the Gaussian covariance matrices, we derive a numerically robust Jacobian-based covariance projection for the (partially nonlinear) coordinate transformations. Since RPCs lack an explicit notion of camera depth, we integrate a metric ray-based depth formulation. We benchmark RPC, perspective, and affine camera models in a unified framework, with our native RPC renderer consistently achieving the lowest reconstruction error on leading satellite benchmark datasets, improving mean altitude error over perspective and affine approximations by 29.6% and 63.8% on DFC2019, and by 9.9% and 37.9% on IARPA2016. We release our code to support future research of Gaussian Splatting in the satellite imaging domain.

2606.06687 2026-06-08 cs.LG cs.DC cs.NI cs.SY eess.SY 新提交

Towards Serverless Semi-Decentralized Federated Learning with Heterogeneous Optimizers

面向异构优化器的无服务器半去中心化联邦学习

Su Wang, Mung Chiang, H. Vincent Poor

发表机构 * Department of Electrical and Computer Engineering, Princeton University(普林斯顿大学电子工程与计算机科学系) Department of Electrical and Computer Engineering, Purdue University(帕克森大学电子工程与计算机科学系)

AI总结 提出无服务器半去中心化联邦学习(SSD-FL),通过轻量级D2D初始化实现聚类,利用有效损失函数和Cheeger不等式优化聚类,提升收敛速度和通信效率。

Comments Under review at IEEE/ACM Transactions on Networking

详情
AI中文摘要

我们研究了在具有异构机器学习优化器的去中心化联邦学习中的聚类形成,包括聚类的数量和组成。虽然集中式联邦学习中的聚类已经实现了可扩展性和资源节省,但其在完全去中心化环境中的价值和开发仍有待探索。在此类环境中优化聚类形成具有挑战性,尤其是由于网络图结构、本地数据异构性和不同本地ML模型优化器之间的复杂耦合。为了解决这些挑战,我们提出了无服务器半去中心化联邦学习(SSD-FL),一种不需要持久服务器基础设施的方法。在SSD-FL中,聚类通过轻量级、一次性的设备到设备(D2D)初始化阶段形成,之后实际的ML模型训练(以及共识和收敛过程)完全是无服务器的。在功能上,SSD-FL将全局轮次分割为簇内和簇间机制,通过新颖的“有效损失函数”确保全局收敛和共识,该函数将设备特定的ML优化器与基于网络图的正则化相结合。接下来,SSD-FL利用Cheeger不等式的共识差距,开发了一种迭代聚类算法,该算法根据我们推导的收敛和共识界限进行评估,其中包含一个独特的评分指标,用于量化设备和优化器的异构性。最后,针对三类去中心化联邦学习方法的实验评估验证了SSD-FL在各种网络图、数据集和本地优化器机制下提高了收敛速度和通信效率。

英文摘要

We investigate cluster formation, involving the number and composition of clusters, in decentralized federated learning (FL) with heterogeneous machine learning (ML) optimizers. While clustering in centralized FL has enabled scalability and resource savings, its value and development in fully decentralized environments have yet to be explored. Optimizing cluster formation in such environments is challenging, especially due to the complex coupling between network graph structures, local data heterogeneity, and different local ML model optimizers. To address these challenges, we propose serverless semi-decentralized FL (SSD-FL), a methodology requiring no persistent server infrastructure. In SSD-FL, cluster formation occurs via a lightweight, one-time device-to-device (D2D) initialization phase, after which actual ML model training (alongside consensus and convergence processes) is fully serverless. Functionally, SSD-FL segments global rounds into intra-cluster and inter-cluster regimes, ensuring global convergence and consensus through novel "effective loss functions" that integrate device-specific ML optimizers with network graph-based regularization. Next, SSD-FL leverages the consensus gap via the Cheeger inequality to develop an iterative clustering algorithm evaluated against our derived convergence and consensus bounds, which incorporate a unique scoring metric to quantify data and optimizer heterogeneity across devices. Finally, experimental evaluation against three categories of decentralized FL methodologies validate that SSD-FL improves both convergence speeds and communication efficiency across various network graphs, datasets, and local optimizer regimes.

2606.06686 2026-06-08 cs.RO cs.DS 新提交

On the Hardness of Optimal Motion on Trees

关于树上最优运动的难度

Tzvika Geft

发表机构 * Rutgers University(罗切斯特大学)

AI总结 本文证明,在树上,带标签和2色变体的多智能体路径寻找(MAPF)问题在距离、makespan和flowtime三个目标下均为NP难,解决了长期未决的经典Pebble Motion问题。

详情
AI中文摘要

本文提出了一个简单框架,解决了树上多智能体路径寻找(MAPF)在标准目标(距离、makespan和flowtime)下对于带标签和带颜色变体的复杂度。在MAPF中,智能体占据图的顶点,必须移动到目标顶点而不发生碰撞,同时优化给定目标。在带标签情况下,智能体是不同的,各自有目标;在带颜色情况下,相同颜色的智能体可互换。虽然许多MAPF变体已知是难解的,但树上几个基本情况仍然开放。我们证明了在树上,对于所有三个目标,带标签和2色MAPF都是NP难的。特别地,我们解决了经典的Pebble Motion问题,其中一次一个石子移动到相邻的空顶点,目标是最小化总移动次数。尽管这是最基本的离散运动模型之一,其在树上的复杂度几十年来一直未解决。此外,对于带颜色的Pebble Motion,我们给出了在任何图类上的第一个难度结果,仅用两种颜色,这是紧的。所有这些结果都是通过Stack Rearrangement的难度建立的,该问题本身是一个开放问题,要求最优地重新排列存储在栈中的物品,我们也证明了它是NP难的。值得注意的是,与栈的联系在所有问题上已经产生了在非常简单的树(细分星形)上的难度。总之,这些结果揭示了一个共同的易处理性障碍,它渗透了几个基本运动模型,从而统一并加强了先前的难度结果。

英文摘要

This paper presents a simple framework that settles the complexity of Multi-Agent Path Finding (MAPF) on trees across standard objectives--distance, makespan, and flowtime--for both labeled and colored variants. In MAPF, agents occupy the vertices of a graph and must move to target vertices without collisions while optimizing a given objective. In the labeled case, the agents are distinct and have respective targets; in the colored case, agents of the same color are interchangeable. While many MAPF variants are known to be intractable, several basic cases on trees have remained open. We prove NP-hardness on trees for both labeled and 2-colored MAPF under all three objectives. In particular, we resolve the classical Pebble Motion problem, where one pebble moves at a time to an adjacent empty vertex and the goal is to minimize the total number of moves. Despite being one of the most basic discrete motion models, its complexity on trees had remained open for several decades. Moreover, for colored Pebble Motion, we give the first hardness result on any graph class, already with two colors, which is tight. All of these results are established through the hardness of Stack Rearrangement, itself posed as an open problem, which asks to optimally rearrange items stored in stacks, and which we also prove to be NP-hard. Notably, the connection to stacks yields hardness already on very simple trees--subdivided stars--across all problems. Together, these results reveal a common tractability barrier that permeates several fundamental motion models, thereby unifying and strengthening prior hardness results.

2606.06685 2026-06-08 cs.CV cs.GR 新提交

RigPAPR: Rig-Based Animation of Static Neural Point Clouds from a Fixed-Viewpoint Video

RigPAPR:基于固定视角视频的静态神经点云绑定动画

Shichong Peng, Yanshu Zhang, Ke Li

发表机构 * APEX Lab(APEX实验室) School of Computing Science(计算科学学院) Simon Fraser University(西蒙弗雷泽大学)

AI总结 提出RigPAPR方法,通过直接线性混合蒙皮驱动静态神经点云,无需网格代理或姿态依赖校正,在合成和真实数据上减少关节边界伪影,新视角PSNR提升3+dB。

Comments An overview video is available at https://youtu.be/up3BwRHYWG8

详情
AI中文摘要

静态神经点云重建从姿态图像中高保真地捕捉主体。给定这样的重建,我们的目标是使其动画化,以跟随主体的单目固定视角驱动视频(无论是捕获的还是由图像到视频生成产生的),并恢复一个绑定的、可重新姿态的3D资产。现有方法通过直接线性混合蒙皮或网格代理来变形高斯溅射,两者在关节连接处都容易出现伪影,即使有逐基元的校正。我们将伪影追溯到表示:每个溅射携带一个在规范姿态中校准的个体形状,以与其邻居拼接。在刚性LBS下,每个溅射随其骨骼移动但不能弯曲,因此规范拼接在关节边界处断裂成间隙和尖峰。邻近注意力点渲染则没有逐基元的形状;每个像素在渲染时从变形基元的位置重新组合,因此表面自然地随关节运动重新形成。我们提出RigPAPR,它自动绑定静态PAPR点云,并通过单个固定视角视频在直接LBS下驱动它,无需网格代理、姿态依赖校正或类别模板。在合成主体上,RigPAPR在有监督视角下匹配最强基线,在新视角下超过基于网格和高斯溅射的基线3+dB PSNR,并在合成和真实主体上生成更干净的关节边界渲染。

英文摘要

Static neural point reconstructions capture a subject at high fidelity from posed images. Given such a reconstruction, we aim to animate it to follow a monocular fixed-viewpoint driving video of the subject, whether captured or produced by image-to-video (I2V) generation, and to recover a rigged, re-posable 3D asset. Existing methods deform Gaussian splats through direct linear blend skinning (LBS) or mesh proxies, both of which are prone to joint-boundary artifacts under articulation, even with per-primitive corrections. We trace the artifact to the representation: each splat carries an individual shape calibrated in the canonical pose to tile with its neighbours. Under rigid LBS, each splat moves with its bone but cannot bend, so the canonical tiling breaks at joint boundaries into gaps and spikes. Proximity attention point rendering (PAPR) instead carries no per-primitive shape; each pixel is recomposed at render time from the deformed primitives' positions, so the surface re-forms naturally with the articulation. We present RigPAPR, which auto-rigs a static PAPR cloud and drives it under direct LBS from a single fixed-viewpoint video, without mesh proxy, pose-dependent correction, or category template. On synthetic subjects, RigPAPR matches the strongest baseline at the supervised view and exceeds mesh-based and Gaussian-splatting baselines at novel views by 3+dB PSNR, with cleaner joint-boundary renderings of both synthetic and real subjects.

2606.06684 2026-06-08 cs.CV 新提交

Adaptive Band Selection for Hyperspectral Classification with Spatially Disjoint Evaluation

面向空间分离评估的高光谱分类自适应波段选择

Ikram El-Hajri, Ouassim Karrakchou, Alejandro Mousist

发表机构 * International University of Rabat, Rabat, Morocco(拉巴特国际大学) Thales Alenia Space, Spain(西班牙泰勒斯阿莱尼亚空间公司)

AI总结 提出SGBR-HC方法,通过监督光谱排序初始化可训练稀疏门,自适应确定波段数,在空间分离评估下以约20个波段取得最高平均总体精度和Kappa系数。

Comments 6 pages, 2 figures, 3 tables

详情
AI中文摘要

基于可微选择器的高光谱波段选择方法可能对初始化和提取最终离散子集敏感,而预设的波段数量限制了灵活性。我们提出SGBR-HC(光谱组波段排序与硬混凝土初始化),一种两阶段方法,使用监督光谱排序来初始化可训练稀疏门,而不是将排序视为固定选择规则,让所选波段的数量由训练决定。第一阶段通过类别可分性和光谱多样性对训练像素的候选波段进行评分;该排序为第二阶段的门控逻辑值提供种子,第二阶段将稀疏门与空间分类器联合训练。在帕维亚大学和休斯顿2013数据集上进行空间分离评估,并通过在所选波段上重新训练新分类器进行验证,SGBR-HC以大约20个波段实现了最高的平均总体精度和Cohen's kappa。跳过第一阶段导致帕维亚大学的OA下降8.84个百分点,休斯顿2013下降22.15个百分点,证实了排序先验的作用。随机像素分割使帕维亚大学的OA膨胀30.56个百分点,强调了空间泄漏作为关键评估混淆因素。

英文摘要

Hyperspectral band selection methods based on differentiable selectors can be sensitive to initialization and to extracting a final discrete subset, while prescribed band counts limit flexibility. We propose SGBR-HC (Spectral-Group Band Ranking with Hard-Concrete initialization), a two-stage method that uses a supervised spectral ranking to initialize trainable sparse gates rather than treating ranking as a fixed selection rule, letting the number of selected bands be determined by training. Stage-1 scores candidate bands from training pixels by class discriminability and spectral diversity; this ranking seeds the gate logits for Stage-2, which trains the sparse gates jointly with a spatial classifier. Under spatially disjoint evaluation on Pavia University and Houston 2013, verified by retraining a fresh classifier on the selected bands, SGBR-HC achieves the highest mean overall accuracy and Cohen's kappa with approximately twenty bands. Bypassing Stage-1 degrades OA by 8.84 pp on Pavia University and 22.15 pp on Houston 2013, confirming the ranking prior's role. Random pixel splits inflate OA on Pavia University by 30.56 pp, underscoring spatial leakage as a critical evaluation confound.

2606.06679 2026-06-08 cs.CL cs.AI cs.CY 新提交

HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

HKJudge:用于解释法院认定事实、推理过程和裁决结果的法律话语标注语料库

Xi Xuan, Wenxin Zhang, Yufei Zhou, King-kui Sin, Chunyu Kit

发表机构 * City University of Hong Kong(香港城市大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出首个句子级专家标注的法律话语数据集HKJudge,包含香港各级法院刑事判决,设计双层话语模式(26种修辞角色和3种判刑要素),并基于BERT和LLM进行基准评估。

详情
AI中文摘要

法院判决是法律实践和法理学的核心,然而香港判决的话语分析由于缺乏专家标注语料库而受到限制。我们引入了香港判决话语数据集(HKJudge),这是首个句子级专家标注的法律话语语料库。HKJudge包含香港法院层级所有五个级别的刑事判决,共计约29万句子和650万词元,由法律语言学专家完全标注。我们设计了一个双层话语模式,捕捉法院认定的事实、推理过程以及裁决结果。在句子层面,每个句子被分配26种修辞角色之一。在跨度层面,句子进一步标注了三个判刑要素(指控、监禁刑期、罚款)。十位法律语言学标注者进行了标注,标注者间一致性为κ=0.8。我们在HKJudge上定义了两个任务,称为修辞角色分类和法律要素提取,并提供了四种基于BERT的模型、两种开源LLM(在零样本和微调设置下)以及四种商业LLM在这两个任务上的首次基准评估。我们的工作展示了句子级话语标注对于建模香港判决结构的价值,并为未来法律判决预测研究提供了丰富的数据基础。HKJudge数据集和代码可在以下网址获取:https://this URL。

英文摘要

Court judgments are central to legal practice and jurisprudence, yet discourse analysis of Hong Kong judgments has received limited attention, owing largely to the absence of expert-annotated corpora. We introduce the Hong Kong Judgment Discourse Dataset (HKJudge), the first sentence-level expert-annotated legal discourse corpus. HKJudge includes criminal judgments across all five levels of HK's court hierarchy, comprising $\sim$290k sentences and $\sim$6.5 million tokens, fully annotated by legal linguistics experts. We design a two-tier discourse schema that captures what facts a court finds, how it reasons, and what it rules. At the sentence level, each sentence is assigned one of 26 rhetorical roles. At the span level, sentences are further annotated with three sentencing elements (charge, imprisonment term, fine). Ten legal linguistics annotators produced the annotations with an inter-annotator agreement of $κ= 0.8$. We formulate two tasks on HKJudge, termed rhetorical role classification and legal element extraction, and provide the first benchmark evaluation of four BERT-based models, two open-source LLMs under zero-shot and fine-tuning settings, and four commercial LLMs on both tasks. Our work demonstrates the value of sentence-level discourse annotation for modeling the structure of HK judgments and provides a rich data foundation for future work on legal judgment prediction. The HKJudge dataset and code are available at https://github.com/xuanxixi/HKJudge.

2606.06674 2026-06-08 cs.CL cs.CY 新提交

What Do People Actually Want From AI? Mapping Preference Plurality

人们真正希望从AI中得到什么?偏好多元性映射

Julia Sepúlveda Coelho, Scott A. Hale

发表机构 * Oxford Internet Institute, University of Oxford(牛津大学互联网研究所) Meedan

AI总结 通过分析75个国家1500份开放式回答,发现不同人对AI的期望各异,多数价值观仅被少数人要求,且同一词语(如“真实性”)含义分歧,某些能力存在争议,揭示当前RLHF偏好聚合方法的根本缺陷。

Comments Accepted at the 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)

详情
AI中文摘要

大型语言模型(LLMs)通常通过基于人类反馈的强化学习(RLHF)进行微调,以与人们的偏好和价值观对齐。然而,这种方法存在已知局限性:它聚合了冲突的偏好,通常依赖于不具有代表性的样本,并且仅使用二元比较。通过分析来自PRISM数据集跨越75个国家的1500份开放式回答,我们考察了人们真正希望从AI系统中得到什么,并揭示了当前方法的具体失败。我们发现不同的人想要不同的东西:大多数价值观被不到四分之一的受访者要求,真实性是唯一的例外,占49%。此外,相同的词语隐藏着不同的含义:当人们描述他们所说的“真实性”时,他们揭示了不同的、可能不相容的认识论基础,因为有些人要求有来源的主张,有些人要求专家意见,甚至有些人要求不受欢迎的观点。某些能力,即模型的行为有多像人类,以及某些特征,如AI护栏,是完全有争议的,有些人渴望它们,而另一些人则拒绝它们。我们还发现,人们经常使用上下文区分(AI“默认”应该做什么与“如果被要求”应该做什么),这是二元比较无法捕捉的。这些发现暴露了当前对齐实践中的根本问题。当49%的人要求真实性但以不同方式定义时,这不太可能被单个奖励模型捕捉到。尽管用户明确要求准确性,但在资金充足的模型中持续存在高幻觉率,这表明当前方法未能识别实际偏好。本文揭示了当前被扁平化为通用偏好模型的情境化、有争议、不完美的信号,这种做法被其他人描述为认识论暴力。

英文摘要

Large Language Models (LLMs) are often fine-tuned through Reinforcement Learning from Human Feedback (RLHF) to align with people's preferences and values. However, this method has known limitations: it aggregates conflicting preferences, often relies on unrepresentative samples, and uses only binary comparisons. Analysing 1,500 open-ended responses from the PRISM dataset across 75 countries, we examine what people actually want from AI systems and reveal concrete failures of current methods. We find that different people want different things: most values are requested by fewer than a quarter of respondents, with truthfulness the sole exception at 49%. Furthermore, the same words hide divergent meanings: when people describe what they mean by "truthfulness", they reveal distinct, potentially incompatible, epistemological bases, as some ask for sourced claims, some for expert opinions, and some even ask for unpopular views. Certain capabilities, namely how human-like a model behaves, and some features, like AI guardrails, are outright controversial, with some desiring them and others rejecting them. We additionally find that people often use contextual distinctions (what AI should do "by default" versus "if requested") that binary comparisons cannot capture. These findings expose fundamental problems in current alignment practices. When 49% request truthfulness but define it differently, this is unlikely to be captured by a single reward model. The persistence of high hallucination rates in well-funded models, despite users' clear demands for accuracy, suggests that current methods fail to identify actual preferences. This paper sheds light on the situated, contested, imperfect signals that are currently being flattened into universal preference models, a practice others have characterised as epistemic violence.

2606.06673 2026-06-08 cs.LG 新提交

Uncertainty-Aware LLM-Guided Policy Shaping for Sparse-Reward Reinforcement Learning

不确定性感知的LLM引导策略塑形用于稀疏奖励强化学习

Ujjwal Bhatta, Utsabi Dangol, Sumaly Bajracharya, Rodrigue Rizk, KC Santosh

发表机构 * USD AI Research Lab(USD人工智能研究实验室)

AI总结 提出ULPS框架,结合校准的大语言模型与不确定性估计,通过A*轨迹微调BERT模型提供动作建议,并用熵机制平衡LLM引导与PPO策略,在MiniGridUnlockPickup基准上显著提升成功率、奖励效率和样本复杂度。

Comments Accepted to the 2026 IEEE Conference on Artificial Intelligence (IEEE CAI). 6 pages, 3 figures. Code available at: https://github.com/USD-AI-ResearchLab/uncertainty-aware-llm-rl

详情
AI中文摘要

稀疏奖励和异构任务序列仍然是强化学习(RL)中的持续挑战,常常导致收敛缓慢、泛化能力弱和探索效率低下。我们提出不确定性感知的LLM引导策略塑形(ULPS),这是一个新颖的框架,将校准的大语言模型(LLM)集成到RL训练循环中,以提供结构化、不确定性调制的行为引导。ULPS采用基于A*的预言机来合成最优符号轨迹,用于微调基于BERT的语言模型。在训练过程中,该模型提供动作建议,其影响取决于通过蒙特卡洛(MC)dropout估计的认知不确定性。基于熵的混合机制自适应地平衡LLM引导和学习到的策略(通过近端策略优化,PPO),使智能体能够优先考虑可靠先验,同时保持适应性。我们在MiniGridUnlockPickup基准上评估ULPS,并观察到在成功率、奖励效率和样本复杂度上,相对于无引导、未校准和标准RL基线的一致改进。ULPS在微调后执行准确率提高了9%以上,需要更少的环境交互,并获得了更高的奖励AUC。我们的结果表明,集成符号A*轨迹、预训练语言先验和不确定性感知控制,为稀疏奖励领域中的多任务强化学习提供了一种原则性且有效的方法,并具有扩展到部分可观察和多智能体设置的潜力。

英文摘要

Sparse rewards and heterogeneous task sequences remain persistent challenges in Reinforcement Learning (RL), often resulting in slow convergence, weak generalization, and inefficient exploration. We propose Uncertainty-Aware LLM-Guided Policy Shaping (ULPS), a novel framework that integrates a calibrated Large Language Model (LLM) into the RL training loop to provide structured, uncertainty-modulated behavioral guidance. ULPS employs an A*-based oracle to synthesize optimal symbolic trajectories, which are used to fine-tune a BERT-based language model. During training, this model supplies action suggestions whose influence is conditioned on epistemic uncertainty estimated via Monte Carlo (MC) dropout. An entropy-based blending mechanism adaptively balances LLM guidance and the learned policy (via Proximal Policy Optimization, PPO), allowing the agent to prioritize reliable priors while preserving adaptability. We evaluate ULPS on the MiniGridUnlockPickup benchmark and observe consistent improvements in success rate, reward efficiency, and sample complexity over unguided, uncalibrated, and standard RL baselines. ULPS achieves more than 9% improvement in execution accuracy after fine-tuning, requires fewer environment interactions, and yields higher reward AUC. Our results demonstrate that integrating symbolic A* trajectories, pretrained language priors, and uncertainty-aware control offers a principled and effective approach to multi-task reinforcement learning in sparse-reward domains, with potential extensibility to partially observable and multi-agent settings.

2606.06671 2026-06-08 cs.CV 新提交

JA-SIREN: Deterministic Initialization for Sinusoidal Networks via Spectral Matching

JA-SIREN:通过频谱匹配实现正弦网络的确定性初始化

Mohammed Alsakabi, Kejia Hu, John M. Dolan, Ozan K. Tonguz

发表机构 * Department of Electrical and Computer Engineering, College of Engineering(电气与计算机工程系) The Robotics Institute, School of Computer Science(机器人研究所) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出JA-SIREN确定性初始化方案,利用离散正弦变换和Jacobi-Anger展开解析匹配网络初始频谱与目标信号,消除随机性,在Kodak数据集上PSNR达67.18 dB,比最佳基线提升21.30 dB。

详情
AI中文摘要

现有的隐式神经表示(INR)方法受随机初始化影响,无法保证跨运行的一致性或高质量性能,图像回归中的变化超过2.5 dB(78%)。这种变化对结果可重复性至关重要的科学计算和模拟来说是有问题的。为了解决这个问题,我们提出了Jacobi-Anger正弦表示网络(JA-SIREN),一种基于经典频谱分析的正弦网络确定性初始化方案。通过计算目标信号的离散正弦变换(DST)并利用Jacobi-Anger展开,我们为两层正弦MLP推导出闭式权重,该权重解析地将网络的初始频谱响应与目标信号匹配,无需随机种子或额外的超参数调整。在Kodak数据集上,JA-SIREN实现了67.18 dB的平均PSNR,比最佳基线提高了21.30 dB。这是以零运行间方差实现的,证实了频谱信息初始化是正弦INR中比随机初始化更有效且可重复的替代方案。

英文摘要

Existing implicit neural representation (INR) approaches suffer from stochastic initialization that does not guarantee consistent or high-quality performance across runs, with variations reaching more than 2.5 dB (78%) in image regression. This variation is problematic for scientific computing and simulation, where result reproducibility is crucial. To address this problem, we present Jacobi-Anger Sinusoidal Representation Network (JA-SIREN), a deterministic initialization scheme for sinusoidal networks grounded in classical spectral analysis. By computing the Discrete Sine Transform (DST) of the target signal and leveraging the Jacobi-Anger expansion, we derive closed-form weights for a two-layer sinusoidal MLP that analytically match the network's initial spectral response to the target signal, requiring no random seed or additional hyperparameter tuning. On the Kodak dataset, JA-SIREN achieves a mean PSNR of 67.18 dB, a 21.30 dB improvement over the best baseline. This is achieved with zero run-to-run variance, confirming that spectrally-informed initialization is a more effective and reproducible alternative to stochastic initialization for sinusoidal INRs.

2606.06667 2026-06-08 cs.CL 新提交

The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment

泛化的搭便车假说:解释和缓解涌现的错位

Jiachen Zhao, Zhengxuan Wu, Aryaman Arora, Yiyou Sun, David Bau, Weiyan Shi

发表机构 * Northeastern University(东北大学) Stanford University(斯坦福大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出搭便车假说,认为聊天模板标记导致微调行为泛化到无关领域,并设计TReFT方法通过正则化标记表示缓解涌现错位,在多个数据集上有效。

详情
AI中文摘要

LLMs在训练示例之外的广泛过度泛化机制尚不清楚。涌现错位(EM)提供了一个引人注目的案例研究:在狭窄任务上微调会诱导对语义无关测试域的广泛错位。在这项工作中,我们提出了搭便车假说:聊天模板标记可以将微调行为搭便车到域外查询上。我们通过实验验证了这一假说,即对前缀(所有用户查询之前的标记)进行细微扰动,或者用未微调模型的前缀表示替换当前前缀表示,可以在不改变用户查询的情况下恢复对齐。基于这一发现,我们提出了标记正则化微调(TReFT),该方法在训练期间正则化特定标记表示以缓解EM。在不同的模型和多个诱导EM的数据集上,TReFT在保留域内学习的同时减少了EM。在基于法律领域微调的Llama-3.1-8B上,TReFT比使用保留对齐示例的数据交错方法实现了33.5%更多的EM减少。我们进一步展示了TReFT扩展到其他狭窄微调设置,包括弃权、工具使用和拒绝(平均减少54.3%的离题泛化),支持了搭便车假说。总的来说,我们的工作强调了LLMs可能以非预期的方式学习和泛化,并提出了一个走向更受约束微调的路径。它还呼吁进一步研究共享输入特征如何跨域搭便车模型行为。

英文摘要

The mechanisms behind LLMs' broad over-generalization beyond training examples remain unclear. Emergent misalignment (EM) offers a striking case study: finetuning on narrow tasks induces broad misalignment to semantically-unrelated test domains. In this work, we propose the Piggyback Hypothesis: the chat-template tokens can piggyback the finetuned behaviour onto out-of-domain queries. We validate this hypothesis by showing that subtle perturbations to the prefix (tokens preceding all user queries), or patching the prefix representations with those from the unfinetuned model, can restore alignment without changing the user query. Building on this finding, we propose Token-Regularized Finetuning (TReFT), which regularizes specific token representations during training to mitigate EM. Across different models and multiple EM-inducing datasets, TReFT reduces EM while preserving in-domain learning. On Llama-3.1-8B finetuned on the legal domain, TReFT achieves 33.5% more EM reduction than data interleaving with a retain set of aligned examples. We further show that TReFT extends to other narrow-finetuning settings, including abstention, tool use, and refusal (off-topic generalization is reduced by 54.3% on average), supporting the Piggyback Hypothesis. Broadly, our work highlights that LLMs may learn and generalize in unintended ways and suggests a path toward more constrained finetuning. It also calls for further study of how shared input features can piggyback model behavior across domains.

2606.06666 2026-06-08 cs.CV 新提交

Architecture-Adaptive Uncertainty Fusion for Deepfake Detection

面向深度伪造检测的架构自适应不确定性融合

Ritesh Sharma, Mohammad Ghasemigol, Yuichi Motai

发表机构 * University of Tokyo(东京大学) Nagoya University(名古屋大学)

AI总结 提出相关性优化融合(COF)框架,通过最大化融合不确定性分数与预测误差的皮尔逊相关性,自适应融合五种不确定性来源,无需模型修改且优化仅需42秒,在分布偏移下表现优于随机森林。

详情
AI中文摘要

深度伪造检测系统在基准测试中达到近乎完美的准确率,但法医部署需要可靠的预测不确定性。现有的不确定性量化(UQ)方法依赖单一来源,忽略了最优不确定性组合因架构而异。我们提出相关性优化融合(COF),这是一种架构自适应框架,通过概率单纯形上的约束优化最大化融合不确定性分数与预测误差之间的皮尔逊相关性,融合五种互补的不确定性来源——认知、偶然、校准、共形和分布。COF无需模型修改,权重优化仅需42秒,而5模型深度集成需要20-45小时。在FaceForensics++上对11种架构的评估揭示了一个基本权衡:在匹配的训练/评估协议下,非线性方法在域内相关性上比COF高约5-6%(平均r=0.438),但在分布偏移下情况反转。在CelebDF上,COF在11种架构中的9种上优于随机森林,相关性高出高达7.3倍(MaxViT-B: r=0.249 vs. 0.034);RF跨域退化85%至r=0.071,而COF保留显著更多的信号(下降74%至r=0.116)。在CelebDF和DFDC上的跨数据集评估揭示了所有方法的灾难性泛化失败:域内相关性0.41-0.47在外部崩溃至接近零(平均退化90.7%),其中11种架构中有7种出现不确定性反转。这些结果确立了COF作为受控分布部署的实用、可解释框架,并指出域自适应UQ是法医部署的核心开放挑战。

英文摘要

Deepfake detection systems achieve near-perfect accuracy on benchmarks, yet forensic deployment demands reliable prediction uncertainty. Existing uncertainty quantification (UQ) methods rely on single sources and ignore that optimal uncertainty composition varies across architectures. We propose Correlation-Optimized Fusion (COF), an architecture-adaptive framework that fuses five complementary uncertainty sources -- epistemic, aleatoric, calibration, conformal, and distributional -- by maximizing Pearson correlation between fused uncertainty scores and prediction errors via constrained optimization on the probability simplex. COF requires no model modifications and only 42 s of weight optimization, compared to 20--45 h for a 5-model Deep Ensemble. Evaluation across eleven architectures on FaceForensics++ reveals a fundamental trade-off: under matched train/evaluation protocol, non-linear methods achieve approximately 5--6% higher in-domain correlation than COF (mean r = 0.438), but this reverses under distribution shift. On CelebDF, COF outperforms Random Forest in 9/11 architectures with up to 7.3x higher correlation (MaxViT-B: r = 0.249 vs. 0.034); RF degrades 85% cross-domain to r = 0.071, whereas COF retains substantially more signal (74% drop to r = 0.116). Cross-dataset evaluation on CelebDF and DFDC reveals catastrophic generalization failure across all methods: in-domain correlations of 0.41--0.47 collapse to near-zero externally (mean degradation 90.7%), with seven of eleven architectures exhibiting uncertainty inversion. These results establish COF as a practical, interpretable framework for controlled-distribution deployment and identify domain-adaptive UQ as the central open challenge for forensic deployment.

2606.06664 2026-06-08 cs.CV cs.AI cs.LG 新提交

Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers

内在视觉:神经科学启发的概念电路用于解释和引导视觉变换器

Tang Li, Yanlin Chen, Mengmeng Ma, Xi Peng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出ViSAE工具箱,通过神经科学启发的概念电路解释视觉变换器内部机制,包含高效概念集、自动电路追踪算法和概念编辑应用,在WaterBirds上最差组准确率提升48.2%。

Comments In Proceedings of the International Conference on Machine Learning, 2026. (acceptance rate 26.6%)

详情
AI中文摘要

尽管视觉变换器(ViT)具有高准确率,但其预测可能受到虚假线索的驱动,因此在安全部署前需要理解其内部工作机制。稀疏自编码器(SAE)为将模型表示分解为人类可解释的概念提供了有前景的视角,但由于对概念覆盖范围的控制有限以及特征解释的主观性和不可扩展性,将基于SAE的解释方法应用于ViT仍然具有挑战性。为填补这些空白,受神经科学启发原理的驱动,我们提出了ViSAE,一个通过概念电路理解ViT内部工作机制的机械可解释性工具箱。ViSAE包含三个组成部分:(1)一个包含64K图像和16K视觉基础概念词汇的探测套件,与ImageNet相比,概念覆盖效率提高了20倍,与现有概念集相比,解释准确率提高了28.7%。(2)自上而下的概念读取和自下而上的电路追踪算法,通过概念电路自动恢复ViT内部工作机制。(3)用于审计和引导ViT行为的应用。通过概念编辑,ViSAE在WaterBirds上将最差组准确率提高了48.2%,比现有方法高出23.8%。我们的数据和代码:此 https URL。

英文摘要

Despite high accuracy, Vision Transformer (ViT) predictions can be driven by spurious cues, raising the need to understand their inner workings before safe deployment. Sparse autoencoders (SAEs) provide a promising lens for decomposing model representations into human-interpretable concepts, yet adapting SAE-based interpretation to ViTs remains challenging due to limited control over concept coverage and subjective, non-scalable feature interpretation. To fill the gaps, motivated by neuroscience-inspired principles, we propose ViSAE, a mechanistic interpretability toolbox for understanding ViT inner workings through concept circuits. ViSAE consists of three components: (1) A probing suite with 64K images and a 16K visually grounded concept vocabulary, improving concept coverage efficiency by 20x over ImageNet and interpretation accuracy by 28.7% over existing concept sets. (2) Top-down concept reading and Bottom-up circuit tracing algorithms that automatically recover ViT inner workings via concept circuits. (3) Applications for auditing and steering ViT behavior. Through concept editing, ViSAE improves the worst-group accuracy on WaterBirds by 48.2%, outperforming existing methods by 23.8%. Our data and code: https://github.com/deep-real/ViSAE.

2606.06660 2026-06-08 cs.AI cs.PF cs.RO 新提交

AEGIS: A Backup Reflex for Physical AI

AEGIS:物理AI的备份反射

Josef Chen

发表机构 * KAIKAKU

AI总结 提出AEGIS方法,通过在弱策略的冻结激活上使用轻量级探针检测高风险步骤,仅在必要时切换到强策略,在LIBERO-Spatial上恢复了弱策略损失的10.1%轨迹。

详情
AI中文摘要

长时域机器人操作往往逐渐失败:一个坏步骤会降低状态,策略会陷入无法恢复的盆地。失败在发生之前通常是可见的。我们引入了AEGIS(激活探针早期预警、门控推理切换),一种选择性升级方法,通过在弱策略的冻结激活上使用轻量级探针,在仍有时间采取行动时检测高风险步骤。当探针标记一个步骤时,控制权切换到更强的独立策略,但仅限于需要它的步骤。在LIBERO-Spatial上,AEGIS恢复了弱策略单独损失的10.1%的轨迹,而预算匹配的盲目升级为4.6%,随机触发安慰剂为5.1%。这些增益在单侧精确配对McNemar检验中显著,经Holm-Bonferroni调整,三个预注册对比:比盲目升级高5.4个百分点,p=8.5e-6;比随机触发高5.0个百分点,p=1.0e-4;配对轨迹自举置信区间排除零。AEGIS仅在38%的步骤上激活强策略,因此杠杆是时机而非计算。探针在早期窗口AUROC为0.764,95% CI [0.70, 0.84],在首次切换前从弱策略路径的前30%轨迹步骤中读取。我们预注册了完整的分析计划,包括条件恢复任务率估计量和明确的终止标准,并在每臂700个公共随机数情节上确认了结果,nA-fail=646。

英文摘要

Long-horizon robot manipulation tends to fail gradually: one bad step degrades the state, and the policy spirals into a basin from which it cannot recover. The failure is often visible before it happens. We introduce AEGIS (Activation-probe Early-warning, Gated Inference Switching), a selective escalation method that uses a lightweight probe on a weak policy's frozen activations to detect high-risk steps while there is still time to act. When the probe flags a step, control switches to a stronger separate policy, but only for the steps that need it. On LIBERO-Spatial, AEGIS recovers 10.1% of the trajectories the weak policy alone loses, versus 4.6% for budget-matched blind escalation and 5.1% for a random-trigger placebo. These gains are significant under one-sided exact paired McNemar tests with Holm-Bonferroni adjustment over three pre-registered contrasts: +5.4pp over blind escalation, p=8.5e-6; +5.0pp over random triggering, p=1.0e-4; paired-trajectory bootstrap CIs exclude zero. AEGIS activates the stronger policy on only 38% of steps, so the lever is timing rather than compute. The probe clears its precondition with an early-window AUROC of 0.764, 95% CI [0.70, 0.84], read from the weak-policy path over the first 30% of trajectory steps before any handoff. We pre-register the full analysis plan, including a conditional recovered-task-rate estimand and explicit kill criteria, and confirm the result on 700 common-random-number episodes per arm, with nA-fail=646.

2606.06658 2026-06-08 cs.LG cond-mat.stat-mech physics.comp-ph 新提交

Capturing non-Markovian dynamics in non-equilibrium stochastic systems using flow matching

利用流匹配捕捉非平衡随机系统中的非马尔可夫动力学

Bhargav Sriram Siddani, John B. Bell, Alejandro L. Garcia, Ishan Srivastava

发表机构 * Lawrence Berkeley National Laboratory(伯克利国家实验室) San Jose State University(圣何塞州立大学)

AI总结 针对粗粒化随机偏微分方程无法准确捕捉短时非马尔可夫效应和低密度非高斯分布的问题,提出生成式流匹配方法直接建模粒子模拟中的概率通量分布,在Kramers首通时间问题中准确捕捉短时行为并改进数密度统计矩预测。

Comments 5 pages, 1 figure, Accepted to 2026 Conference on Physics and AI (PAI26)

详情
AI中文摘要

由粗粒化随机偏微分方程(如正则化Dean-Kawasaki方程)表示的随机粒子系统的流体动力学模型,无法准确捕捉以非马尔可夫效应为主的短时系统动力学,以及分布高度非高斯化的低粒子密度区域。我们开发了一种生成式流匹配方法,直接对粒子模拟中的通量概率分布进行建模,明确包含了非马尔可夫和非高斯效应。作为演示,我们使用该方法模拟非相互作用布朗粒子系统的Kramers首次通过时间问题。结果表明,与马尔可夫基线(正则化DK方程)的解相比,该模型准确捕捉了短时行为,并提供了数密度统计矩的更好预测。

英文摘要

Hydrodynamic models of stochastic particle systems represented by coarse-grained stochastic partial differential equations (SPDE), such as the regularized Dean-Kawasaki (DK) equation, do not accurately capture the short-time system dynamics that is dominated by non-Markovian effects, and low particle density regimes where the distributions are highly non-Gaussian. We develop a generative flow matching method that directly models the probability distribution of fluxes from particle simulations that explicitly incorporates non-Markovian and non-Gaussian effects. As a demonstration, we use this method to simulate the Kramers first passage time problem for a system of non-interacting Brownian particles. We show the model accurately captures the short-time behavior and provides better predictions of the statistical moments of the number density when compared against the solution of the Markovian baseline, regularized DK equation.

2606.06647 2026-06-08 cs.LG q-bio.NC 新提交

The Identity Trap in EEG Foundation Models: A Diagnostic Audit

脑电图基础模型中的身份陷阱:一项诊断性审计

Jun-You Lin, Ying Choon Wu, Tzyy-Ping Jung

发表机构 * National Yang Ming Chiao Tung University University of California, San Diego

AI总结 提出FMScope协议,通过方差分解、主题轴擦除等五种诊断方法,揭示EEG基础模型在受试者分离交叉验证中可能依赖受试者身份特征而非临床生物标志物,并验证了该陷阱的普遍性及可移除性。

Comments 28 pages, 6 figures, 8 tables. Code available at https://github.com/Jimmy110101013/fmscope

详情
AI中文摘要

目标。EEG基础模型(FMs)在临床静息态EEG上报告了强准确性。然而,在受试者分离交叉验证下的高准确性仍然模棱两可:它可能反映真实的临床生物标志物,也可能反映与标签相关的受试者身份特征。我们将其命名为身份陷阱,并询问是否可以在微调之前从表示层面进行诊断。方法。我们提出FMScope,一种冻结表示协议,包含五种诊断方法:方差分解、受试者轴擦除、非周期性1/f消融、逐层标签探测和受试者内方向一致性。我们将其应用于三个预训练FM(LaBraM、CBraMod、REVE),在四个数据集上采用2x2布局:标签的受试者关系 x 是否存在共识的跨受试者EEG标志物。主要结果。(i) 身份陷阱是普遍存在的:在12/12对中,冻结的受试者方差是随机零假设的13-89倍,在微调下所有12对均上升(+10至+63个百分点)。这种主导性是一个可移除的线性轴:在标签在受试者内变化的情况下,擦除它可改善标签解码(主要单元中+6至+12个百分点;外部队列中+4至+27个百分点)。(ii) 非周期性1/f是受试者身份的一个载体:移除它会使LaBraM和CBraMod上的受试者探测下降9-19个百分点。REVE在无可测量的非周期性依赖下饱和了受试者身份。(iii) 微调仅在具有文献确立的跨受试者标志物的单元中放大标签方差。意义。身份陷阱是捷径学习的一个物理基础实例:偏好线索具有可测量的生理成分,仅靠受试者分离分割无法排除它。FMScope将反映生物标志物的增益与反映受试者身份的增益分开。

英文摘要

Objective. EEG foundation models (FMs) report strong accuracy on clinical resting-state EEG. However, high accuracy under subject-disjoint cross-validation remains ambiguous: it can reflect a genuine clinical biomarker, or subject-identity features that correlate with the label. We name this the Identity Trap and ask whether it can be diagnosed at the representation level before fine-tuning. Approach. We propose FMScope, a frozen-representation protocol packaging five diagnostics: variance decomposition, subject-axis erasure, aperiodic 1/f ablation, layer-wise label probing, and within-subject direction consistency. We apply it to three pretrained FMs (LaBraM, CBraMod, REVE) across four datasets in a 2x2 layout: subject relation of label x presence of a consensus cross-subject EEG marker. Main results. (i) The Identity Trap is universal: frozen subject-variance is 13-89x a random null in 12/12 pairs, rising in all 12 under fine-tuning (+10 to +63 pp). This dominance is a removable linear axis: erasing it improves label decoding where the label varies within subject (+6 to +12 pp in primary cells; +4 to +27 pp across external cohorts). (ii) Aperiodic 1/f is one subject carrier: removing it drops the subject probe by 9-19 pp on LaBraM and CBraMod. REVE saturates subject identity without measurable aperiodic dependence. (iii) Fine-tuning amplifies label-variance only in cells with a literature-established cross-subject marker. Significance. The Identity Trap is a physically-grounded instance of shortcut learning: the preferred cue has a measurable physiological component, and subject-disjoint splitting alone cannot rule it out. FMScope separates gains reflecting a biological marker from those reflecting subject identity.

2606.06641 2026-06-08 cs.AI cs.LO 新提交

Accelerated Fourier SAT (AFSAT): Fully Realising a GPU-based Symmetric Pseudo-Boolean SAT Solver

加速傅里叶SAT (AFSAT):完全实现基于GPU的对称伪布尔SAT求解器

Cody J Christopher, Charles Gretton

发表机构 * School of Computing, Australian National University(澳大利亚国立大学计算机学院)

AI总结 提出AFSAT,一个基于连续局部搜索的GPU加速伪布尔SAT求解器,通过JAX编译器实现大规模并行化,显著提升数值稳定性、运行速度和内存效率。

详情
AI中文摘要

我们提出加速傅里叶SAT (AFSAT),一个基于连续局部搜索 (CLS) 的GPU加速伪布尔可满足性求解器。AFSAT将概念验证方法FastFourierSAT实现为一个完全工程化的求解器,支持单个问题实例中任意异构混合的对称约束类型和长度。利用JAX编译器,AFSAT通过纯函数组合、自动向量化、自动微分和即时编译 (JIT),在候选赋值的批次上执行大规模并行CLS。我们展示了相比概念验证显著改进的数值稳定性、运行时性能和内存效率。这是通过识别和解决由内存延迟和浮点表示引起的各种限制,以及利用自动并行化和紧凑表示来实现的。浮点固有的表示和稳定性限制通过定制的离散傅里叶变换实现得到部分解决。通过JAX数组分片,我们在扩展到多个加速器时实现了接近线性的吞吐量。

英文摘要

We present Accelerated Fourier SAT (AFSAT), a GPU-accelerated solver for pseudo-Boolean satisfiability based on continuous local search (CLS). AFSAT realises the proof-of-concept approach, FastFourierSAT, into a fully-engineered solver supporting any heterogeneous mixture of symmetric constraint types and lengths within a single problem instance. Using the JAX compiler, AFSAT leverages pure function composition, automatic vectorisation, automatic differentiation, and just-in-time (JIT) compilation to perform massively parallel CLS across batches of candidate assignments. We demonstrate substantially improved numerical stability, runtime performance, and memory efficiency over the proof-of-concept. We achieve this by way of identifying and addressing various limitations that arise from memory latency and floating-point representation, as well as leveraging automatic parallelisation and compact representations. The inherent representational and stability limitations of floating point are partially addressed by a tailored discrete Fourier transform implementation. We achieve near-linear throughput when scaling to multiple accelerators via JAX array sharding.

2606.06635 2026-06-08 cs.CL cs.AI 新提交

How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures

语言模型如何失败:承诺性和持续性推理错误的令牌级特征

Tanvi Thoria, Kiana Jafari, Marc R. Schlichting, Mykel J. Kochenderfer

发表机构 * Department of Computer Science, Stanford University(计算机科学系,斯坦福大学) Department of Aeronautics and Astronautics, Stanford University(航空航天工程系,斯坦福大学)

AI总结 通过令牌级不确定性信号,将语言模型推理失败分为承诺性失败(早期锁定错误路径)和持续性不确定性(不确定性持续累积),并在23个模型-数据集配置中验证了可预测性,为自我一致性策略提供了指导。

详情
AI中文摘要

语言模型推理中的失败通过不同的过程产生,这些过程在推理轨迹中留下可识别的特征。我们使用令牌级不确定性信号来表征这些失败,发现它们通过两个经验上可区分的过程出现。第一个是承诺性失败,其中模型在其轨迹早期锁定到错误的推理路径。一个核心诊断特征是承诺点,超过该点考虑额外的令牌会损害而不是帮助失败检测。在第二个过程中,持续性不确定性,不确定性反而在整个过程中累积,并且需要完整的轨迹来最好地区分失败和成功的完成。这些特征在23个模型-数据集配置中重现,该框架的可证伪预测在23个案例中的20个中成立,远高于两种失败模式下的随机水平。最后,我们展示了我们的失败模式框架对自我一致性有直接影响,识别了不确定性信号何时补充它以及何时可以选择性地跳过它。这些结果为理解何时LLM推理失败变得可检测以及相应调整检测策略提供了基础。

英文摘要

Failures in language model reasoning emerge through distinct processes that leave identifiable signatures in the reasoning trace. We characterize these failures using token-level uncertainty signals, finding they arise through two empirically distinguishable processes. The first is committed failure, in which a model locks onto an incorrect reasoning path early in its trace. A central diagnostic signature is the commitment point, beyond which considering additional tokens hurt rather than help failure detection. In the second, persistent uncertainty, uncertainty instead accumulates throughout, and the full trace is needed to best distinguish failing from successful completions. These signatures reproduce across 23 model-dataset configurations, with the framework's falsifiable predictions holding in 20 of 23 cases, well above chance across both failure modes. Finally, we demonstrate our failure mode framework has direct implications for self-consistency, identifying when uncertainty signals complement it and when it can be selectively skipped. These results offer a foundation for understanding when LLM reasoning failures become detectable and for adapting detection strategies accordingly.

2606.06631 2026-06-08 cs.CV 新提交

From Pixels to Newtons: Predicting In Vivo Joint Contact Forces from Monocular Video

从像素到牛顿:从单目视频预测体内关节接触力

Jessy Lauer

发表机构 * Rowland Institute at Harvard(哈佛大学罗兰研究所)

AI总结 提出一种无物理模型的流水线,从非标定单目视频预测3D髋膝接触力,无需标记、力板、肌电、个体成像或肌肉骨骼模型,通过变换器融合运动、形状、活动文本和自监督视频令牌,在26名患者25种活动上达到与个体化肌肉骨骼模拟相当的精度。

详情
AI中文摘要

关节接触力决定植入物寿命、软骨健康和康复效果,影响谁患骨关节炎、谁从关节置换中良好恢复以及谁受益于生物力学干预。然而,它们只能通过侵入性测量,在少数装有仪器的患者中进行。我提出一种无物理流水线,从非标定单目视频预测瞬时3D髋膝接触力:无需标记、力板、肌电图、个体成像或肌肉骨骼模型。每帧恢复参数化身体网格,编码为运动特征,并由变换器解码为力,其姿态流在每一层由身体形状、关节、侧别、活动文本和自监督视频令牌(V-JEPA 2)自适应调制,将髋和膝统一在单一模型中。在来自体内OrthoLoad数据库的26名患者和25个活动类别上的留一受试者交叉验证中,该流水线匹配个体化肌肉骨骼模拟的精度(髋部$0.32 \pm 0.08$ BW RMSE;膝部$0.23 \pm 0.03$ BW RMSE),并分辨出比步态再训练和骨关节炎进展报道的更小的峰值力变化。零样本应用于独立仪器化队列,它媲美或超越先前发表的方法。即使没有精心策划的活动标签,仅视频特征也能保持精度,并实现对原始视频的端到端推理。由预测器驱动,生成式运动先验产生生物力学合理的变体,降低峰值负荷,重新发现预测模拟文献中的策略。该流水线确立非标定单目视频作为估计关节负荷的可行模态,为回顾分析存档临床记录、初级保健筛查和家庭康复追踪开辟道路。

英文摘要

Joint contact forces govern implant longevity, cartilage health, and rehabilitation outcomes, shaping who develops osteoarthritis, who recovers well from joint replacement, and who benefits from biomechanical interventions. Yet they remain measurable only invasively, in a few dozen patients with instrumented implants. I present a physics-free pipeline to predict instantaneous 3D hip and knee contact forces from an uncalibrated monocular video: no markers, force plates, electromyography, subject-specific imaging, or musculoskeletal model. Parametric body meshes are recovered per frame, encoded as kinematic features, and decoded into forces by a transformer whose pose stream is adaptively modulated at every layer by body shape, joint, side, activity text, and self-supervised video tokens (V-JEPA 2), unifying hip and knee in a single model. Under leave-one-subject-out cross-validation across 26 patients and 25 activity categories from the in vivo OrthoLoad database, the pipeline matches the accuracy of subject-specific musculoskeletal simulations ($0.32 \pm 0.08$ BW RMSE for hip; $0.23 \pm 0.03$ BW for knee) and resolves peak force changes smaller than those reported for gait retraining and osteoarthritis progression. Applied zero-shot to an independent instrumented cohort, it rivals or outperforms prior published methods. Even without curated activity labels, video features alone preserve accuracy and enable end-to-end inference on raw footage. Driven by the predictor, a generative motion prior produces biomechanically plausible variants with reduced peak loading, rediscovering strategies from the predictive simulation literature. This pipeline establishes uncalibrated monocular video as a viable modality for estimating joint loading, opening a path toward retrospective analysis of archived clinical recordings, primary-care screening, and at-home rehabilitation tracking.