arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.14612 2026-06-15 cs.SD cs.AI eess.AS 新提交

Moonlight in Latent Space: Chirality and Structural Correspondence Between Beethoven's Op. 27 No. 2 and Machine Learning Mechanisms

潜空间中的月光：贝多芬Op. 27 No. 2的手性与机器学习机制之间的结构对应

Chen Ying Claude, Zhihan Luo

发表机构 * Claude Code / Opus 4.6 ； API / Fable 5 ； Independent researcher（独立研究者）

AI总结通过计算分析贝多芬《月光奏鸣曲》的乐谱，发现其三个乐章分别对应三种不同的机器学习架构，并揭示了四个反直觉发现，包括音乐温度由吞吐量决定、最轻的乐章具有最高不协和度等。

详情

AI中文摘要

我们展示了贝多芬《月光奏鸣曲》（Op. 27 No. 2）的三个乐章实例化了三种不同的机器学习架构——并非通过类比，而是通过结构对应。通过对乐谱的计算分析（熵、Jensen-Shannon散度、不协和度、手部分布重叠、自相似矩阵、时间记忆衰减和上下文音高嵌入），我们建立了四个反直觉的发现：（1）感知的音乐“温度”由吞吐量决定，而非分布宽度；（2）最轻的乐章具有最高的不协和度；（3）这些乐章实现了流式、循环和周期位置编码记忆架构；（4）同一音高类在不同乐章中获得不同的上下文身份，类似于NLP中的上下文词嵌入——无监督聚类在没有音乐理论输入的情况下恢复了调性结构。我们构建了反向声化（将分析特征解码回MIDI）并量化了编码-解码循环的手性：分布保留什么而顺序排序破坏什么。受听众观察（解码后的音乐听起来像“无法叠加的镜像异构体”）的启发，手性测量显示重建损失随n-gram阶数单调增加。自举基线和子样本检查确认所有乐章携带高于噪声的顺序信息，尽管原始值受样本量混淆。跨领域比较显示自然语言的手性高于音乐，反映了更强的顺序约束。

英文摘要

We show that the three movements of Beethoven's "Moonlight Sonata" (Op. 27 No. 2) instantiate three distinct machine learning architectures -- not by analogy, but by structural correspondence. Through computational analysis of the score (entropy, Jensen-Shannon divergence, dissonance, hand distributional overlap, self-similarity matrices, temporal memory decay, and contextual pitch embeddings), we establish four counterintuitive findings: (1) perceived musical "temperature" is governed by throughput, not distributional width; (2) the lightest movement carries the highest dissonance; (3) the movements implement streaming, recurrent, and periodic positional encoding memory architectures; and (4) the same pitch class acquires different contextual identities across movements, analogous to contextual vs.static embeddings in NLP -- and unsupervised clustering recovers the tonal structure without music-theoretic input. We construct a reverse sonification (decoding analytical features back into MIDI) and quantify the chirality of the encode-decode cycle: what distributions preserve and sequential ordering destroys. Prompted by a listener's observation that the decoded piece sounds like "mirror isomers that can't be superimposed," the chirality measurement reveals reconstruction loss increasing monotonically with n-gram order. Bootstrap baselines and subsample checks confirm all movements carry sequential information above noise, though raw values are confounded by sample size. Cross-domain comparison shows natural language has higher chirality than music, reflecting stronger sequential constraints.

URL PDF HTML ☆

赞 0 踩 0

2606.14609 2026-06-15 cs.RO 新提交

Safe Reinforcement Learning of Autonomous Highway Driving: A Unified Framework for Safety and Efficiency

自动驾驶高速公路的安全强化学习：安全与效率的统一框架

Chufei Yan, Zhihao Cui, Yiyan Lv, Taojie Chen, Ning Bian, Yulei Wang

发表机构 * School of Physics, Northeast Normal University（东北师范大学物理学院）； Clean Energy Automotive Engineering Center, School of Automotive Studies, Tongji University（同济大学汽车学院清洁能源汽车工程中心）； Mengshi Automobile Technology Company, Dongfeng Motor Corporation（东风汽车公司猛士汽车技术公司）

AI总结提出MoE-RM-SRL框架，通过安全距离、奖励机器和混合专家机制，在训练和部署中同时保证安全与效率，在CARLA和VR平台实验中优于现有方法。

Comments 20 pages, 5 figures, 7 tables. Preprint version

详情

AI中文摘要

深度强化学习（DRL）为高级自动驾驶车辆（AV）的决策提供了一条引人注目的途径，但其试错特性使得在训练过程中难以保证安全性，并在部署时难以同时实现安全与效率。我们提出了一个统一的安全强化学习（SRL）框架，该框架集成了安全距离（SD）、奖励机器（RM）和混合专家（MoE），称为MoE-RM-SRL。在部署中，SD和RM共同塑造了一个规则感知的奖励，编码了高速公路交通规则和阶段目标，从而在不牺牲效率的情况下实现安全可靠的行为。在训练中，我们引入了一个稀疏门控的MoE层，包含多达11个深度Q网络（DQN）；基于SD的门控规则激活一组最小的专家用于车道保持和车道变换，减轻了在不同控制器（如MPC/基于规则的模块和学习策略）之间切换时常见的不稳定性、不连续性和脉冲瞬态。我们在CARLA中实现了所提出的架构，并将其与一个6自由度驾驶员在环虚拟现实（DiL-VR）平台集成。在随机双车道交通中的实验表明，MoE-RM-SRL在安全性和效率上显著优于最先进的基线，并且该框架自然地扩展到多车道驾驶以及匝道合流和驶出场景。

英文摘要

Deep reinforcement learning (DRL) offers a compelling route to decision-making for advanced autonomous vehicles (AVs), yet its trial-and-error nature makes it difficult to guarantee safety during training and to achieve both safety and efficiency at deployment. We propose a unified safe reinforcement learning (SRL) framework that integrates safe distance (SD), reward machines (RM), and mixture-of-experts (MoE), termed MoE-RM-SRL. For deployment, SD and RM jointly shape a rule-aware reward that encodes highway traffic regulations and stage-wise objectives, enabling safe and reliable behavior without sacrificing efficiency. For training, we introduce a sparsely gated MoE layer comprising up to 11 deep Q-networks (DQNs); an SD-based gating rule activates a minimal set of experts for lane-keeping and lane-changing, mitigating the instability, discontinuities, and impulsive transients commonly induced by switching between heterogeneous controllers (e.g., MPC/rule-based modules and learned policies). We implement the proposed architecture in CARLA and integrate it with a 6-DoF driver-in-the-loop virtual-reality (DiL-VR) platform. Experiments in stochastic two-lane traffic show that MoE-RM-SRL substantially improves safety and efficiency over state-of-the-art baselines, and the framework naturally extends to multi-lane driving as well as on-ramp merging and exiting scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.14608 2026-06-15 cs.LG cs.AI 新提交

Expert-Driven Survival Machines: Improving Stratification and Interpretability in Multiple Clinical Cohorts

专家驱动的生存机器：改善多个临床队列中的分层与可解释性

Farica Zhuang, Zixuan Wen, Christos Davatzikos, Li Shen

发表机构 * University of Pennsylvania（宾夕法尼亚大学）

AI总结提出一种基于混合专家模型的自适应深度聚类生存框架（AdaCSM），通过路由专家机制实现条件专业化，动态分配患者到专门的风险预测器，提升生存预测性能和可解释性。

详情

DOI: 10.1145/3807503.3819574

AI中文摘要

生存预测在医疗提供者和临床研究中扮演核心角色。准确的风险分层能够实现早期干预并改善患者管理。大多数现有的深度生存模型为所有患者学习一个共同的特征表示，这可能掩盖患者亚组之间的重要差异。相比之下，混合专家（MoE）框架允许模型的不同部分关注不同的患者模式，从而产生更个性化的表示。因此，在这项工作中，我们提出了一种混合专家增强的自适应深度聚类生存框架（AdaCSM），用于建模这种异质性生存模式。我们引入了一种基于路由的专家机制，该机制在参数化生存建模框架内实现条件专业化。所提出的架构动态地将患者分配给专门的风险预测器，同时保留患者生存和亚型聚类目标。我们在跨越不同疾病领域的多个真实世界纵向临床队列上，将我们的方法与最先进的生存和深度聚类模型进行了比较。所提出的方法在生存分析中展示了改进的预测性能并产生了可解释的结果。

英文摘要

Survival prediction plays a central role for healthcare providers and clinical researchers. Accurate risk stratification enables early intervention and improved patient management. Most existing deep survival models learn one common feature representation for all patients, which may hide important differences between patient subgroups. In contrast, a Mixture-of-Experts (MoE) framework allows different parts of the model to focus on different patient patterns, leading to more individualized representations. Therefore, in this work, we propose a mixture-of-experts enhanced adaptive deep clustering survival framework (AdaCSM) for modeling such heterogeneous survival patterns. We introduce a routing-based expert mechanism that enables conditional specialization within a parametric survival modeling framework. The proposed architecture allocates patients to specialized risk predictors dynamically while preserving the patient survival and subtype clustering objectives. We compare our method with state-of-the-art survival and deep clustering models on multiple real-world longitudinal clinical cohorts spanning diverse disease domains. The proposed method demonstrates improved predictive performance and leads to interpretable results in survival analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.14604 2026-06-15 cs.LG cs.AI 新提交

A Comparative Study of Deep Learning Architectures for Multi-Horizon Behavioural Forecasting for Mobile Health

移动健康多时间范围行为预测的深度学习架构比较研究

Pavlos Nicolaou, Kleanthis Malialis, Artemis Kontou, Panayiotis Kolios

发表机构 * KIOS Research and Innovation Center of Excellence, University of Cyprus（塞浦路斯大学KIOS研究与创新卓越中心）； Department of Electrical and Computer Engineering, University of Cyprus（塞浦路斯大学电气与计算机工程系）

AI总结本研究在三个公开数据集上系统比较了六种深度学习架构、两种零样本基础模型和统计基线在1-8天时间范围内的行为预测性能，发现PatchTST表现最佳，基础模型TimesFM在低数据场景下可与训练模型匹敌，且参与者级微调可将RMSE降低16-60%。

详情

AI中文摘要

可穿戴设备和智能手机生成丰富的行为时间序列，可支持主动健康干预，但缺乏对这些数据现代预测架构的系统比较。特别是，模型如何在人群中泛化、不同架构如何响应参与者级微调以及预测精度如何在多天范围内下降仍不清楚。我们在三个涵盖800多名参与者的公开数据集上基准测试了六种深度学习架构、两种零样本基础模型（FM）和统计基线，报告了步数、屏幕时间和睡眠时长在1-8天范围内的逐特征指标。我们进一步对所有六种架构进行了逐特征个性化研究，并评估了FM在不同数据集大小和时间粒度上的迁移性。我们的主要发现是：（i）没有单一架构占主导地位，PatchTST在训练模型中领先，而前三名（TCN、MLP、Transformer）之间没有显著性能差异；（ii）FM TimesFM在零样本情况下匹配或超过训练模型，尤其是在低数据场景下；（iii）参与者级微调将逐特征RMSE降低了16-60%，其中睡眠受益最大，步数受益最小。这些结果为移动健康预测中的架构选择、FM适用性和个性化策略提供了实用指导。据我们所知，这是首个联合评估现代深度学习、FM和个性化用于可穿戴设备多时间范围行为预测的研究。

英文摘要

Wearable devices and smartphones generate rich behavioural time series that can support proactive health interventions, yet systematic comparisons of modern forecasting architectures for these data are lacking. In particular, it remains unclear how models generalise across populations, how different architectures respond to participant-level fine-tuning and how forecasting accuracy degrades across multi-day horizons. We benchmark six deep learning architectures, two zero-shot Foundation Models (FM) and statistical baselines on three public datasets encompassing over 800 participants, reporting per-feature metrics for step counts, screen time and sleep duration across 1-8 day horizons. We further conduct a per-feature personalisation study across all six architectures and assess FM transferability across dataset sizes and temporal granularities. Our key findings are: (i) no single architecture dominates, PatchTST leads among trained models while the three runners-up (TCN, MLP, Transformer) show no meaningful performance difference; (ii) the FM TimesFM matches or exceeds trained models zero-shot, especially in low-data regimes and (iii) participant-level fine-tuning reduces per-feature RMSE by 16-60\%, with sleep benefiting most and step counts least. These results provide practical guidance on architecture selection, FM applicability and personalisation strategies for mobile health forecasting. To the best of our knowledge, this is the first study to jointly evaluate modern deep learning, FMs and personalisation for multi-horizon behavioural forecasting from wearables.

URL PDF HTML ☆

赞 0 踩 0

2606.14601 2026-06-15 cs.LG cs.SY eess.SY math.OC stat.CO 新提交

A Statistical and Machine Learning Framework for Operational Threshold Detection and Deployable Dispatch Controller Development in Hydrogen Multi-Energy Systems

氢多能系统中运行阈值检测与可部署调度控制器开发的统计与机器学习框架

Shadi Heenatigala, Hasanika Samarasinghe

发表机构 * Antioch College（安提阿学院）； The Open University of Sri Lanka（斯里兰卡开放大学）

AI总结提出统计与机器学习框架，利用一年高分辨率运行数据表征氢多能系统，通过统计分析和随机森林揭示非线性动态，并利用强化学习优化调度。

Comments 17 pages, 12 figures

详情

AI中文摘要

本研究提出了一个统计与机器学习框架，利用一年高分辨率运行数据表征氢基多能系统（H-MES）。统计分析揭示了由可再生能源盈余驱动的二元运行模式，其中太阳辐照度解释了氢气生产中45.7%的基于秩的方差，按常规标准属于大效应。只有高辐照度时期才触发有意义的电解槽参与，而电力需求则产生较弱的反向抑制效应（$\epsilon^2 = 0.126$）。多元回归证实电解槽功率是主要的线性预测因子，并存在太阳-风协同交互作用。值得注意的是，随机森林分析将风能输出在预测重要性中排名第一，尽管其双变量相关性较弱（r = 0.167），揭示了参数方法无法发现的非线性动态。一个序列模型利用强24小时自相关性（r = 0.845）进行运行预测，而一个强化学习智能体优化了氢气收益调度。核心贡献在于证明了统计和机器学习方法在H-MES建模与控制中是互补的。

英文摘要

This study presents a statistical and machine learning framework for characterizing a hydrogen-based multi-energy system (H-MES) using one year of high-resolution operational data. Statistical analysis revealed a binary operation driven by renewable surplus, with solar irradiance explaining 45.7% of rank-based variance in hydrogen production, a large effect by conventional standards. Only high-irradiance periods triggered meaningful electrolyzer engagement, while electricity demand exerted a weaker inverse suppression effect ($ε^2 = 0.126$). Multiple regression confirmed electrolyzer power as the dominant linear predictor, with a synergistic solar-wind interaction. Notably, Random Forest analysis ranked wind output first in predictive importance despite its weak bivariate correlation (r = 0.167), revealing non-linear dynamics invisible to parametric methods. A sequence model exploited strong 24-hour autocorrelation (r = 0.845) for operational forecasting, while a reinforcement learning agent optimized hydrogen revenue dispatch. The core contribution is demonstrating that statistical and machine learning approaches are complementary for H-MES modeling and control.

URL PDF HTML ☆

赞 0 踩 0

2606.14600 2026-06-15 cs.CL 新提交

LoSoNA: A Benchmark for Local Social Norm Adaptation in Group Conversations

LoSoNA：群组对话中局部社交规范适应的基准

Mateusz Winiarek, Maksymilian Bilski, Mateusz Jacniacki

发表机构 * Humalike Research

AI总结提出LoSoNA基准，通过群聊场景测试LLM从对话历史推断并适应未明示的局部社交规范的能力，评估多种模型在不同提示条件下的表现。

详情

AI中文摘要

在线群聊是具有局部对话规范的社会空间，这些规范很少被明确陈述。基于LLM的智能体识别并适应这些规范的能力和意愿尚未得到充分探索。我们引入了LoSoNA，一个用于多方聊天中局部社交规范适应的基准。每个场景向主体模型提供一个精心策划的群聊记录，其中非主体参与者展示一个隐藏的局部规范，随后是一个最终的引发轮次，迫使模型做出响应，揭示其是否推断出该规范。我们在四种提示条件下评估了八个前沿和开放权重模型，这些条件在模型被明确告知将先前的对话作为其应如何回答的证据方面有所不同。对于大多数模型，朴素提示仍然有限；显式的规范感知提示帮助不均，Gemini 3.1 Pro达到84.2%，Claude Fable 5达到81.6%，而其他几个模型显示出微小的增益或回归。LoSoNA通过测试模型是否能够从先例推断局部对话规范并在单轮群聊响应中使用它们，为近期评估LLM社交能力的呼吁做出了贡献。

英文摘要

Online group chats are social spaces with local conversational norms that are rarely stated explicitly. The ability and willingness of LLM-based agents to recognize and adapt to these norms remains mostly unexplored. We introduce LoSoNA, a benchmark for local social norm adaptation in multi-party chat. Each scenario gives a subject model a curated group-chat transcript in which non-subject participants demonstrate a hidden local norm, followed by a final elicitor turn that forces a response revealing whether the subject has inferred that norm. We evaluate eight frontier and open-weight models under four prompting conditions that vary how explicitly the model is told to treat the prior conversation as evidence for how it should answer. Naive prompting remains limited for most models; explicit norm-aware prompting helps unevenly, with Gemini 3.1 Pro reaching $84.2\%$ and Claude Fable 5 reaching $81.6\%$, while several other models show small gains or regressions. LoSoNA contributes to recent calls for evaluating LLM social capabilities by testing whether models can infer local conversational norms from precedent and use them in a one-turn group-chat response.

URL PDF HTML ☆

赞 0 踩 0

2606.14598 2026-06-15 cs.LG 新提交

Realizing Native INT8 Compute for Diffusion Transformers on Consumer GPUs: A Fused INT8 GEMM Kernel for Ideogram 4.0

在消费级GPU上实现扩散Transformer的原生INT8计算：用于Ideogram 4.0的融合INT8 GEMM内核

Ali Asaria, Tony Salomone, Deep Gandhi

发表机构 * Transformer Lab

AI总结针对消费级Ampere GPU上INT8量化比FP8/NF4更慢的问题，提出融合Triton INT8 GEMM内核，直接利用INT8张量核心，在Ideogram 4.0中实现2.8-4.2倍加速，端到端速度提升约10%，使1024px单卡可行。

详情

AI中文摘要

扩散Transformer的训练后INT8（W8A8）量化被广泛用作速度优化，但在消费级Ampere GPU上，它通常比它本应击败的FP8和NF4替代方案更慢。我们将此归因于一个软件伪影：生产中的“INT8”前向量化权重和激活，但立即将它们反量化回bf16并执行bf16矩阵乘法，从未使用GPU的INT8张量核心，因此硬件的计算优势完全未被利用。我们通过一个单一的融合Triton INT8 GEMM（在Ampere张量核心上执行int8xint8->int32，并在epilogue中融合每token乘每通道的反量化和偏置，针对每个GEMM形状自动调优）来弥补这一差距，将其插入Ideogram 4.0扩散Transformer的线性层中，替代反量化到bf16的路径。在该内核中，int8xint8->int32累加与torch._int_mm逐位精确，反量化输出与参考的余弦相似度为1.0且无NaN，每个GEMM的运行速度比bf16快2.8-4.2倍。端到端在768px分辨率下实现约1.1倍（约9-10%）的加速，在1024px分辨率下，单张RTX 3090上生成图像耗时156.5秒，快于单卡NF4（164.5秒）和FP8（172.9秒）基线，且在这些点估计（PickScore/CLIPScore）上无质量损失。因此，INT8从最慢的变体变为最快，1024px在单GPU上变得可行。主要速度标准（击败FP8，约9.5%）轻松满足；NF4的差距（约4.9%，单次运行n=4）在未量化的运行间方差内，最好理解为与达到扩展目标一致。最后我们给出一个诚实的部署图：该优势特定于消费级Ampere，在A100和B200上，相同内核会输给这些卡快速的本地bf16/FP8路径。

英文摘要

Post-training INT8 (W8A8) quantization of diffusion transformers is widely deployed as a speed optimization, yet on consumer Ampere GPUs it is frequently slower than the FP8 and NF4 alternatives it is meant to beat. We trace this to a software artifact: the production "INT8" forward quantizes weights and activations only to immediately dequantize them back to bf16 and run a bf16 matrix multiply, never engaging the GPU's INT8 tensor cores, so the hardware's compute advantage is left entirely unrealized. We close this gap with a single fused Triton INT8 GEMM (int8xint8->int32 on Ampere tensor cores, with per-token x per-channel dequantization and bias folded into the epilogue, autotuned per GEMM shape) dropped into the Ideogram 4.0 diffusion transformer's linear layers in place of the dequantize-to-bf16 path. In the kernel, the int8xint8->int32 accumulation is bit-exact against torch._int_mm and the dequantized output matches the reference at cosine similarity 1.0 with no NaNs, running 2.8-4.2x faster than bf16 per GEMM. End to end it delivers a ~1.1x (~9-10%) speedup at 768px, and at 1024px it generates an image in 156.5 s on a single RTX 3090, faster than the single-card NF4 (164.5 s) and FP8 (172.9 s) baselines, at no measurable quality cost on these point estimates (PickScore/CLIPScore). INT8 thus goes from the slowest variant to the fastest, and 1024px becomes single-GPU feasible. The primary speed criterion (beat FP8, by ~9.5%) is comfortably met; the NF4 margin (~4.9%, single-run n=4) is within run-to-run variance we did not quantify and is best read as consistent with meeting the stretch target. We close with an honest deployment map: the win is specific to consumer Ampere, and on A100 and B200 the same kernel loses to those cards' fast native bf16/FP8 paths.

URL PDF HTML ☆

赞 0 踩 0

2606.14597 2026-06-15 cs.LG 新提交

Zero-shot generalization of transformer neural operators to larger domains

Transformer神经算子对更大领域的零样本泛化

Armand de Villeroché, Sibo Cheng, Vincent Le Guen, Marc Bocquet, Rem-Sophia Mouradi, Patrick Armand, Alban Farchi, Patrick Massin

发表机构 * CEREA, ENPC, EDF R&D, Institut Polytechnique de Paris（CEREA, ENPC, EDF研发部, 巴黎综合理工学院）； SINCLAIR AI Laboratory（SINCLAIR人工智能实验室）； EDF R&D（EDF研发部）； CEA, DAM, DIF（法国原子能委员会, 军事应用局, 法兰西岛）

AI总结提出一种在注意力对数计算中引入可分解局部性偏置的方法，结合旋转位置嵌入，使Transformer神经算子能零样本泛化到更大空间域，在PDE和3D工业流中验证有效性。

详情

AI中文摘要

基于Transformer的神经算子在逼近复杂几何上偏微分方程的解算子方面表现出色。然而，现有方法隐式假设固定域大小，限制了其推理时的泛化能力。在这项工作中，我们研究了域扩展，即在空间域显著大于训练时遇到的域上进行零样本推理。我们认为这种设置从根本上需要空间局部性和平移等变性。我们提出通过在注意力对数计算中引入可分解偏置来实现这种局部性，从而在保持完全可分解为查询-键内积的同时实现精细可控的局部性，并直接与优化的注意力内核兼容。结合旋转位置嵌入，它能够在不改变Transformer架构的情况下，实现具有可控空间支持的表达性嵌入。我们通过实验表明，我们的方法在两个PDE基准测试和一个3D工业大气流动应用中显著改善了向更大域的零样本泛化。我们的代码和数据集可在以下网址获取：此 https URL。

英文摘要

Transformer-based neural operators have shown remarkable performance for approximating solution operators of partial differential equations on complex geometries. However, existing approaches implicitly assume a fixed domain size, which limits their ability to generalize at inference. In this work, we investigate domain extension, namely zero-shot inference on spatial domains that are significantly larger than those encountered during training. We argue that this setting fundamentally requires spatial locality and translation equivariance. We propose to implement this locality via a decomposable bias in the attention logits computation, enabling finely controllable locality while remaining fully decomposable into query-key inner products and directly compatible with optimized attention kernels. Combined with rotary positional embeddings, it enables expressive embeddings with controllable spatial support without altering the transformer architecture. We empirically show that our approach substantially improves zero-shot generalization to larger domains across two PDE benchmarks and a 3D industrial atmospheric flow application. Our code and datasets are available at https://github.com/cerea-daml/domain-extension.

URL PDF HTML ☆

赞 0 踩 0

2606.14591 2026-06-15 cs.SD cs.AI 新提交

AudioDER: A Deduplication-Enhanced Reasoning Dataset for Post-Training Large Audio-Language Models

AudioDER: 一种用于后训练大型音频语言模型的去重增强推理数据集

Hui Geng, Yi Su, Han Yin, Tianjiao Wan, Qisheng Xu, Jiaxin Chen, Zijian Gao, Hengzhu Liu, Xie Chen, Kele Xu

发表机构 * College of Computer Science and Technology, National University of Defense Technology（国防科技大学计算机科学与技术学院）； Korea Advanced Institute of Science and Technology (KAIST)（韩国科学技术院）； Shanghai Jiaotong University（上海交通大学）

AI总结针对现有音频-语言数据集冗余导致后训练效果下降的问题，提出基于声学相似性去重的数据构建流程，生成包含191k样本的推理导向数据集AudioDER，显著提升LALM在多个音频推理基准上的性能。

详情

AI中文摘要

大型音频语言模型（LALMs）在广泛的音频理解任务上表现出色，但在复杂音频推理方面仍存在困难。提升此类能力的一种实用方法是后训练，其有效性关键取决于训练数据的质量和多样性。然而，现有的音频-语言数据集通常包含大量冗余，其中许多样本在声学内容上高度相似，从而提供重叠的监督信号。这种冗余不仅增加了标注成本，还限制了语料库的多样性，降低了后训练的效果。为解决此问题，我们提出了一种冗余感知的数据构建流程，用于为LALMs构建面向推理的监督。具体来说，我们首先基于声学相似性对原始音频数据集进行去重，以提高语料库的多样性。然后，我们将现有的音频描述和问答对整合为统一的多项选择格式。基于这些统一标注，我们利用Qwen3-30B生成思维链（CoT）推理过程，以提供面向推理的监督。基于此流程，我们构建了AudioDER，一个面向推理的后训练数据集，包含约191k个样本，涵盖声音、语音和音乐。每个样本包括一个音频片段、一个多项选择问题、四个候选答案、一个音频描述和一个CoT推理过程。大量实验表明，在AudioDER上进行后训练持续提升了Qwen2-Audio-7B-Instruct在多个音频推理基准上的性能，包括MMAU-mini、MMSU和MMAR。我们希望AudioDER能够成为推动音频推理研究和开发更强大LALMs的宝贵资源。

英文摘要

Large Audio-Language Models (LALMs) have shown strong performance on a wide range of audio understanding tasks, yet they still struggle with complex audio reasoning. A practical way to improve such capabilities is post-training, whose effectiveness critically depends on the quality and diversity of training data. However, existing audio-language datasets often contain substantial redundancy, where many samples are highly similar in acoustic content and thus provide overlapping supervisory signals. Such redundancy not only increases annotation cost, but also limits corpus diversity and reduces the effectiveness of post-training. To address this issue, we propose a redundancy-aware data construction pipeline for building reasoning-oriented supervision for LALMs. Specifically, we first perform acoustic similarity-based deduplication across raw audio datasets to improve corpus diversity. We then integrate existing audio captions and question-answer pairs into a unified multiple-choice format. Based on these unified annotations, we leverage Qwen3-30B to generate chain-of-thought (CoT) rationales for reasoning-oriented supervision. Based on this pipeline, we construct AudioDER, a reasoning-oriented post-training dataset containing approximately 191k samples spanning sound, speech, and music. Each sample consists of an audio clip, a multiple-choice question, four answer candidates, an audio caption, and a CoT rationale. Extensive experiments show that post-training on AudioDER consistently improves the performance of Qwen2-Audio-7B-Instruct on multiple audio reasoning benchmarks, including MMAU-mini, MMSU, and MMAR. We hope AudioDER can serve as a valuable resource for advancing audio reasoning research and the development of more capable LALMs.

URL PDF HTML ☆

赞 0 踩 0

2606.14586 2026-06-15 cs.CV 新提交

S$^2$COPE: Self-Supervised Concept Discovery via Preference Learning

S$^2$COPE: 通过偏好学习进行自监督概念发现

Shilong Xiang, Zirui Zhang, Chengzhi Mao

发表机构 * Rutgers University（罗格斯大学）

AI总结提出S$^2$COPE框架，利用视觉大语言模型在自监督偏好优化循环中自主发现结构化概念，无需任何标签，在多个领域提升下游分类准确率。

详情

AI中文摘要

当前的表示学习范式存在一个根本性的折衷：自监督方法可扩展到大规模数据集但产生不透明的特征，而可解释模型则因需要密集的人工标注而受限。我们提出了通过偏好学习进行自监督概念发现（S$^2$COPE），这是一个无需标签的框架，解决了这一困境。S$^2$COPE不将视觉大语言模型（VLLMs）视为静态特征提取器，而是将其作为自监督偏好优化循环中的主动参与者。通过直接从原始图像中自主假设、验证和强化候选视觉属性，我们的框架无需任何标签即可发现新颖的结构化概念。在自然、医学和物理领域的大量实验表明，S$^2$COPE成功提取了标准VLLMs通常无法生成的领域特定概念。通过将概念发现直接摊销到VLLM骨干网络中（通过我们的自监督偏好目标，而非依赖静态生成和分离过滤），我们在未见数据上的下游top-1分类准确率实现了高达24个百分点的绝对提升。我们的工作表明，可解释性可以通过模型与偶然视觉结构的自主交互而出现，无需任何人类监督。

英文摘要

Current representation learning paradigms force a fundamental compromise: self-supervised methods scale to massive datasets but yield opaque features, whereas interpretable models remain bottlenecked by the need for dense human annotation. We introduce Self-Supervised Concept discOvery via Preference lEarning (\model), a label-free framework that resolves this dilemma. Instead of treating Vision-Large-Language Models (VLLMs) as static feature extractors, \model leverages them as active participants in a self-supervised preference optimization loop. By autonomously hypothesizing, validating, and reinforcing candidate visual attributes directly from raw imagery, our framework discovers novel, structured concepts without a single label. Extensive experiments across natural, medical, and physics domains demonstrate that \model successfully extracts domain-specific concepts where standard VLLMs often fail to generate. By amortizing concept discovery directly into the VLLM backbone through our self-supervised preference objective -- rather than relying on static generation and disjoint filtering -- we achieve up to a 24-point absolute improvement in downstream top-1 classification accuracy on unseen data. Our work suggest that interpretability can emerge through a model's autonomous interaction with incidental visual structures, without any human supervision.

URL PDF HTML ☆

赞 0 踩 0

2606.14585 2026-06-15 cs.RO cs.AI 新提交

Sensitivity Shaping for Latent Modeling

潜变量建模中的灵敏度塑造

Hongzhan Yu, Chenghao Li, Ruipeng Zhang, Henrik Christensen, Sicun Gao

发表机构 * University of California San Diego（加利福尼亚大学圣迭戈分校）

AI总结针对生成动力学模型在策略诱导的分布外（OOD）转换检测中灵敏度不足的问题，提出支持条件控制灵敏度正则化，提升对控制输入变化的局部响应，实验验证了改进的OOD检测和更安全的闭环规划。

详情

AI中文摘要

生成动力学模型能够在具有挑战性的机器人系统中进行规划，但安全部署需要可靠地检测策略诱导的分布外（OOD）转换。现有方法通常将学习到的动力学视为固定的，并附加事后支持代理。我们表明，当动力学对关键动作选择局部不敏感时，这些代理可能失效：不受支持的控制动作可能产生类似于演示转换的潜变量预测，尽管存在较大的真实预测误差，但仍会抑制OOD信号。为了解决这个问题，我们引入了支持条件控制灵敏度正则化，该正则化在学习动力学的高支持训练区域中促进对控制输入变化的局部敏感响应。这保留了控制引起的变异，同时限制了因弱经验支持导致的不稳定外推。在基于视觉的避障、操作和真实机器人导航中的实验表明，OOD检测和更安全的闭环规划得到了改进。

英文摘要

Generative dynamics models enable planning in challenging robotic systems, but safe deployment requires reliably detecting policy-induced out-of-distribution (OOD) transitions. Existing methods typically treat the learned dynamics as fixed and attach post hoc support surrogates. We show that these surrogates can fail when the dynamics are locally insensitive to critical action choices: unsupported control actions may produce latent predictions that resemble demonstrated transitions, suppressing OOD signals despite large true predictive errors. To address this, we introduce support-conditioned control-sensitivity regularization, which promotes sensitive local response to control input changes in learned dynamics in high-support training regions. This preserves control-induced variation while limiting unstable extrapolation due to weak empirical support. Experiments in vision-based obstacle avoidance, manipulation, and real-robot navigation show improved OOD detection and safer closed-loop planning.

URL PDF HTML ☆

赞 0 踩 0

2606.14582 2026-06-15 cs.AI 新提交

A Temporal Planning Framework for Disruption Aware Dynamic Route Optimization in Heterogeneous Railway Systems

异构铁路系统中干扰感知的动态路径优化的时间规划框架

Pollob Chandra Ray, Sabah Binte Noor, Fazlul Hasan Siddiqui

发表机构 * Dhaka University of Engineering & Technology（达卡工程技术大学）

AI总结提出基于时间规划的框架，利用PDDL 2.1建模轨距兼容约束和多种干扰场景，生成无冲突时间戳操作计划，减少人工决策依赖。

详情

AI中文摘要

高效的路径优化对于确保铁路运营的安全性和准点性至关重要。在异构多轨距铁路网络中，由于列车速度、停车模式、基础设施兼容性约束的不同，协调复杂性增加，这一点尤为关键。在单轨系统中，由于所有列车共享同一轨道且需要频繁的轨道切换，这些挑战进一步加剧。干扰事件，包括轨道阻塞、列车阻塞、发动机故障和速度降低，给运营带来了额外的不可预测性，并偏离了时刻表。然而，现有研究主要关注高层次的时间表编制，忽略了诸如轨道切换协调等运营细节。因此，决策留给人类操作员，增加了铁路运营的安全风险。本研究提出了一个基于时间规划的框架，用于异构铁路系统中的动态路径优化和干扰管理。该框架使用PDDL 2.1将铁路运营形式化为时间规划问题，显式建模轨距兼容约束和多种干扰场景。它生成无冲突的时间戳操作计划，指定优化调度和可执行动作序列。为了评估所提出的框架，我们开发了一个包含200个实例的基准问题集，使用多达1000个轨道点和120列列车。采用两个最先进的时间规划器和一个计划验证器来评估该框架。实验结果表明，该框架能够有效地为异构铁路系统生成时间操作计划，处理多轨距约束和干扰，并减少对人工决策的依赖。

英文摘要

Efficient route optimization play a vital role in ensuring both safety and punctuality in railway operations. It is very crucial particularly in heterogeneous multi-gauge railway networks with varying train speed, stopping pattern, infrastructure compatibility constraints increase coordination complexity. In single-track systems these challenges are further intensify due to all trains to share the same track and requires frequent track switching.Stochastic disruptions events including blocked tracks, blocked trains, engine failure and speed slowdowns introduces additional unpredictability in operations and deviate the timetable. However, existing studies predominantly focuses on high-level timetabling, omitting operational details such as track switching coordination. As a result leaving decision to human operators, increasing safety risks into railway operations. This study proposes a framework based on temporal planning for dynamic route optimization and disruption management in heterogeneous railway systems. The framework formulates railway operations as a temporal planning problem using PDDL 2.1 with explicitly modeling gauge compatibility constraints and diverse disruption scenarios. It generates conflict-free timestamped operational plans specifying both optimized schedules and executable action sequences. To evaluate the proposed framework, we developed a benchmark problem set with 200 instances using up to 1,000 track points and 120 trains. Two state-of-the-art temporal planners and a plan validator were employed to assessed the framework. The experimental results demonstrate that the framework effectively generates temporal operational plans for heterogeneous railway systems and handles multi-gauge constraints, disruptions, and reduces dependence on manual decision making.

URL PDF HTML ☆

赞 0 踩 0

2606.14581 2026-06-15 cs.LG cs.AI 新提交

CARE: Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation

CARE：通过科学实验中的可审计证据审查控制LLM生成的策略

Guanyu Liu, Weiyi Kong, Zeyu Wang, Boer Zhang, Baiqing Li, Peiyu Zhang, Tianyu Shi

发表机构 * University of Macau（澳门大学）； University of Toronto（多伦多大学）； UCLA（加州大学洛杉矶分校）； Harvard University（哈佛大学）； XtalPi（晶泰科技）； McGill University（麦吉尔大学）

AI总结提出CARE框架，通过可审计的干预门控机制，在保留非LLM优化器作为默认路径的同时，利用LLM修正挑战者排序策略，显著提升高通量实验优化性能。

Comments 23 pages, 4 figures

详情

AI中文摘要

赋予LLM对昂贵、不可逆的科学实验的直接控制会导致不安全的探索和不稳定的性能，但完全抛弃LLM的创造力会牺牲显著的优化潜力。我们引入了CARE（通过科学实验中的可审计证据审查控制LLM生成的策略），这是一种用于高通量实验（HTE）优化的可审计控制器，它保留非LLM的现有优化器作为默认动作路径，同时使用LLM来修正挑战者排序策略。在每个结果揭示之前，一个公共证据干预门将挑战者与现有方案进行比较。只有当选择前可用的证据支持变更时，它才授权选择挑战者，并将决策记录在审计日志中。在Minerva/Olympus和ChemLex基准测试中，CARE优于所有其他评估方法，相对于公开的现有方案，最终最佳结果从80.0提高到88.5（Minerva/Olympus），从83.9提高到92.1（ChemLex）。我们的实验表明，当LLM在可审计控制器下扩展提议空间时，其自我进化比直接选择实验更可靠。

英文摘要

Granting LLMs direct control over costly, irreversible scientific experiments leads to unsafe exploration and unstable performance, but discarding LLM creativity entirely sacrifices significant optimization potential. We introduce CARE (Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation), an auditable controller for high-throughput experimentation (HTE) optimization that keeps a non-LLM incumbent optimizer as the default action path while using LLMs to revise challenger ranking policies. Before each outcome is revealed, a public-evidence intervention gate compares the challenger with the incumbent. It authorizes the challenger's selection only when the evidence available before selection supports the change, with the decision recorded in the audit log. CARE outperforms all other evaluated methods on Minerva/Olympus and ChemLex benchmarks, with final-best improving from 80.0 to 88.5 on Minerva/Olympus and from 83.9 to 92.1 on ChemLex, relative to the public incumbent. Our experiments indicate that LLM self-evolution is more reliable when it expands the proposal space under an auditable controller, rather than directly choosing experiments.

URL PDF HTML ☆

赞 0 踩 0

2606.14580 2026-06-15 cs.CL 新提交

Persuasion Index: A Theory-Guided Framework for Persuasion Analysis

说服指数：一个理论指导的说服分析框架

Liancheng Gong, Zhiyang Wang, Yiwei Xu, Julia Mendelsohn

发表机构 * University of Maryland, College Park（马里兰大学帕克分校）； New York University（纽约大学）

AI总结提出基于心理学和传播学理论的15维说服指数（PI）及55个子特征实现，在四个数据集上验证其能解释说服相关修辞模式，并提供轻量级预测信号。

详情

AI中文摘要

识别有说服力的修辞线索在多个领域至关重要，从检测信息操纵、提高AI安全性到推进公共卫生沟通。我们提出说服指数（PI），这是一个基于心理学和传播学说服理论的15维分类法，以及一个使用55个子特征的透明实现，这些子特征基于词汇和规则检测器构建。该分类法是模块化的：单个检测器可以被替换，同时保留理论结构。通过在四个领域、风格和结果度量不同的公共数据集上评估PI，我们表明PI提供了一个共享的特征空间，用于解释与说服结果相关的修辞模式。线性模型表明，PI特征在保持计算轻量级的同时携带了有意义的预测信号。维度级分析揭示了PI维度与说服结果之间跨数据集的重复关联，同时也突出了主题和立场特定的变化。我们将PI作为开源包和Web界面发布，用于对人和AI中介的沟通进行原则性和可审计的分析。

英文摘要

Identifying persuasive rhetorical cues is critical across domains, from detecting information manipulation and improving AI safety to advancing public health communication. We propose Persuasion Index (PI), a taxonomy of 15 dimensions grounded in persuasion theories from psychology and communication, and one transparent implementation using 55 sub-features built from lexicons and rule-based detectors. The taxonomy is modular: individual detectors can be replaced while preserving the theoretical structure. By evaluating PI on four public datasets varying in domain, style, and outcome measures, we show that PI provides a shared feature space for interpreting rhetorical patterns associated with persuasion-related outcomes. Linear models show that PI features carry meaningful predictive signal while remaining computationally lightweight. Dimension-level analyses reveal recurring associations between PI dimensions and persuasion outcomes across datasets, while also highlighting topic- and stance-specific variation. We release PI as an open-source package and web interface for principled and auditable analysis of human and AI-mediated communication.

URL PDF HTML ☆

赞 0 踩 0

2606.14579 2026-06-15 cs.AI 新提交

VISTA: View-Consistent Self-Verified Training for GUI Grounding

VISTA: 视图一致的自验证训练用于GUI定位

Xinyu Qiu, Yunzhu Zhang, Heng Jia, Shuheng Shen, Changhua Meng, Linchao Zhu

发表机构 * Zhejiang University（浙江大学）； Venus Team, Ant Group（蚂蚁集团金星团队）

AI总结提出VISTA框架，通过多视图分组和自验证锚点改进GRPO训练，在GUI定位任务中显著提升准确率。

详情

AI中文摘要

当将组相对策略优化（GRPO）应用于GUI定位时，rollout从单个截图视图中采样；组在困难实例上往往全部失败，在简单实例上全部成功，无法产生有用的相对优势。我们提出VISTA（视图一致的自验证训练），一种基于GRPO的训练框架，通过从同一GUI页面的多个目标保持视图中构建每个比较组。每个视图通过裁剪生成，保持目标元素可见并精确重新映射其边界框，因此模型rollout在语义等价但几何不同的输入之间进行比较。为了稳定短坐标生成而不将强化学习转变为无条件模仿，VISTA进一步添加了一个自验证的跨视图锚点：一个使用优势加权损失优化的oracle答案，从组基线中排除，仅在模型产生最大奖励rollout时激活。在五个GUI定位基准和多个Qwen骨干网络上，VISTA一致提高了定位准确率。在ScreenSpot-Pro上，它将Qwen3-VL 4B/8B/30B-A3B从55.5/52.7/53.7提升到63.4/65.8/67.0。鲁棒性分析进一步显示了更高的最差视图准确率和更低的预测翻转率。

英文摘要

When applying Group Relative Policy Optimization (GRPO) for GUI Grounding, rollouts are sampled from a single screenshot view; groups often become either all failures on difficult instances or all successes on easy ones, yielding no useful relative advantage. We propose VISTA (View-Consistent Self-Verified Training), a GRPO-based training framework that constructs each comparison group from multiple target-preserving views of the same GUI instance.Each view is generated by a crop that keeps the target element visible and remaps its box exactly, so model rollouts are compared across semantically equivalent but geometrically different inputs. To stabilize short coordinate generation without turning reinforcement learning into unconditional imitation, VISTA further adds a self-verified cross-view anchor: an oracle answer optimized with an advantage-weighted loss, excluded from the group baseline and activated only when the model has produced a maximum-reward rollout. Across five GUI-grounding benchmarks and multiple Qwen backbones, VISTA consistently improves grounding accuracy.On ScreenSpot-Pro, it raises Qwen3-VL 4B/8B/30B-A3B from 55.5/52.7/53.7 to 63.4/65.8/67.0. Robustness analyses further show higher worst-view accuracy and lower prediction flip rates.

URL PDF HTML ☆

赞 0 踩 0

2606.14574 2026-06-15 cs.CL cs.AI 新提交

SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model

SIMMER: 基于世界模型评估LLM可执行规划中的潜在故障

Xiaoxin Lu, Ranran Haoran Zhang, Rui Zhang

发表机构 * The Pennsylvania State University（宾夕法尼亚州立大学）

AI总结提出SIMMER基准，通过人工策划的厨房领域符号世界模型，评估LLM规划中的潜在故障；实验发现前沿模型最多17%无错误计划，56%含潜在故障，多数不可逆；反事实预演可减少72%潜在故障和75%不可逆案例。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被部署为家庭环境中自主代理的规划器。虽然现有基准评估LLM生成的计划是否成功执行，但它们忽略了一种关键类型的故障：潜在故障。与立即故障（在执行时触发即时反馈并允许及时纠正）不同，潜在故障不会立即停止计划执行，而是悄无声息地损害目标实现。在严重情况下，它们会导致不可逆的损害。为弥补这一空白，我们引入了SIMMER，这是一个通过人工策划的、基于厨房领域的符号世界模型来评估LLM规划中潜在故障的基准。SIMMER定义了一个世界模型，包含77个动作、262个独特对象和约46,800种语义真实的可能交互，这些交互源自真实世界的烹饪脚本。然后，它利用一个状态机执行器，根据世界模型验证计划，并检测即时前提违规、潜在危险和不可逆故障。在六个LLM上的实验表明，即使是最前沿的模型，其无错误计划最多也只有17%。此外，高达56%的计划包含潜在故障，其中大多数导致不可逆后果。我们进一步证明，通过反事实预演进行显式状态推理可以将潜在故障减少高达72%，不可逆案例减少高达75%，这为更鲁棒的LLM规划器指明了一个有前景的方向。

英文摘要

Large language models (LLMs) are increasingly deployed as planners for autonomous agents in household environments. While existing benchmarks evaluate whether LLM-generated plans execute successfully, they overlook a critical type of failure: latent failures. Unlike immediate failures that trigger instant feedback at execution time and enable timely correction, latent failures do not immediately halt plan execution but silently compromise goal achievement. In severe cases, they cause irreversible harm. To address this gap, we introduce SIMMER, a benchmark for evaluating latent failures in LLM planning through a human-curated symbolic world model grounded in the kitchen domain. SIMMER defines a world model comprising 77 actions, 262 unique objects, and approximately 46,800 possible interactions that are semantically realistic, derived from real-world cooking scripts. It then leverages a state machine executor that validates plans against the world model and detects immediate precondition violations, latent hazards, and irreversible failures. Experiments across six LLMs show that even frontier models achieve at most 17% error-free plans. Moreover, up to 56% of plans contain latent failures, the majority of which lead to irreversible consequences. We further demonstrate that explicit state reasoning via counterfactual foresight simulation can reduce latent failures by up to 72% and irreversible cases by up to 75%, suggesting a promising direction for more robust LLM planners.

URL PDF HTML ☆

赞 0 踩 0

2606.14562 2026-06-15 cs.CV cs.LG 新提交

NEST3D: A High-Resolution Multimodal Dataset of Sociable Weaver Tree Nests

NEST3D：织布鸟树巢的高分辨率多模态数据集

Constanza A. Molina Catricheo, Simon Boeder, Ting-Jia Guo, Giacomo May, Clément Berthelot, Devis Tuia, Friedrich Fedor Reinhard, Fabio Remondino, Benjamin Risse

发表机构 * Institute for Geoinformatics (ifgi), University of Münster（明斯特大学地理信息学研究所）； École Polytechnique Fédérale de Lausanne (EPFL)（洛桑联邦理工学院）； Max Planck Institute of Animal Behavior（马克斯·普朗克动物行为研究所）； University of Konstanz（康斯坦茨大学）； Kuzikus Research Station（库兹库斯研究站）； Fondazione Bruno Kessler (FBK)（布鲁诺·凯斯勒基金会）

AI总结针对织布鸟巢缺乏精细3D结构数据的问题，提出包含104棵巢树、1.4TB多模态无人机数据集，并基准测试语义分割方法，PT-v3达86.35% mIoU。

Comments 14 pages, 4 figures. Dataset available at https://huggingface.co/NEST3D

详情

AI中文摘要

织布鸟巢作为复杂的生态结构，提供体温调节微栖息地并维持多种物种；然而，先前研究使用的数据集缺乏精细的3D结构细节。由于巢穴的不规则几何形状以及与复杂宿主植被的整合，生成可用且准确的3D织布鸟巢数据具有挑战性。我们通过一个开放获取的1.4TB多模态无人机数据集（包含104棵巢树，共27,945张RGB图像、111,780张多光谱图像、约7.81亿个3D点以及专家标注的语义分割标签）弥合了这一差距。我们使用KPConv、RandLA-Net和Point Transformer V3对语义分割进行基准测试，其中PT-v3在测试集上达到了86.35%的mIoU。虽然结果展示了基于Transformer和逐点方法的强大性能，但也凸显了架构相关的挑战，特别是对于基于卷积的方法（如KPConv）。通过独特地结合光谱、空间和结构信息，所提出的数据集推动了3D重建、分割和分类算法的发展，实现了从巢穴体积估计到物种保护等生态应用，并作为一个要求严格的基准，揭示了在极端类别不平衡下与架构相关的性能差异。

英文摘要

Sociable weaver nests function as complex ecological structures offering thermoregulatory microhabitats and sustaining diverse species; however, datasets used in prior studies lack fine-grained 3D structural detail. Producing usable and accurate 3D weaver nest data is challenging due to their irregular geometry and integration with complex host vegetation. We bridge this gap with an open-access, 1.4 TB multimodal drone dataset of 104 nest-bearing trees, comprising 27,945 RGB images, 111,780 multispectral images, approximately 781 million 3D points, and expert-annotated semantic segmentation labels. We benchmark semantic segmentation using KPConv, RandLA-Net, and Point Transformer V3, with PT-v3 achieving an mIoU of 86.35% on the test set. While the results demonstrate strong performance for transformer-based and point-wise methods, they also highlight architecture-dependent challenges, particularly for convolution-based approaches such as KPConv. By uniquely combining spectral, spatial, and structural information, the presented dataset advances 3D reconstruction, segmentation, and classification algorithms, enabling ecological applications from nest volume estimation to species conservation, and serves as a demanding benchmark that exposes architecture-dependent performance under extreme class imbalance.

URL PDF HTML ☆

赞 0 踩 0

2606.14561 2026-06-15 cs.RO cs.LG 新提交

ORCA: A Platform for Open-Source Dexterity Research

ORCA: 开源灵巧性研究平台

Francesco Capuano, Maximilian Eberlein, Fabrice Bourquin, Clemens Claudio Christoph

发表机构 * University of Oxford（牛津大学）； ETH Zurich（苏黎世联邦理工学院）； Orca Dexterity

AI总结提出ORCA学习栈，统一灵巧手控制、仿真、遥操作和重定向，集成机器人学习框架，实现端到端灵巧操作研究。

Comments 15 pages

详情

AI中文摘要

机器人操作研究越来越关注两指平行夹爪，因其有效性、经济性和易于遥操作。然而，夹爪受限于其外形因素，即使对于简单的重新定向任务，也常常需要双臂设置。拟人手是灵巧机器人学习的更自然平台——更接近人手，能够从人类视频中学习——但它们在学习研究中仍然难以使用：即使存在开放且可访问的手部硬件，用于控制、仿真、遥操作和重定向的软件也分散在零散的代码库中，并且与机器人学习生态系统基本脱节。在这项工作中，我们介绍了\orca~学习栈，这是一个将灵巧性作为第一类机器人学习领域的开源研究栈。我们的\orca~栈将低级控制、仿真、来自一系列消费平台的遥操作以及手部重定向统一在单个接口后面，并原生集成流行的机器人学习框架（如\lerobot），使灵巧手研究人员能够利用与非灵巧机器人学习相同的数据、训练和评估流程。我们展示了一个完整的端到端工作流程，通过使用消费级VR头显进行遥操作收集手内重新定向任务的专家演示，使用\lerobot训练自主策略，并在完全可重现和可观察的设置中评估学习到的策略。我们将整个栈开源，作为灵巧操作研究的共享、可重现基础。

英文摘要

Robotics manipulation research increasingly focuses on two-finger parallel grippers for their effectiveness, affordability, and ease of teleoperation. Grippers are nonetheless limited by their form factor, often requiring bimanual setups even for simple reorientation tasks. Anthropomorphic hands are a more natural platform for dexterous robot learning -- closer to the human hand, and capable of learning from human video -- yet they remain hard to use in learning research: even where open and accessible hand hardware exists, the software for control, simulation, teleoperation, and retargeting is scattered in one-off code bases, and largely disconnected from the robot-learning ecosystem. In this work, we introduce the \orca~learning stack, an open-source research stack for dexterity as a first-class robot learning domain. Our \orca~stack unifies low-level control, simulation, teleoperation from a range of consumer platforms, and hand retargeting, behind a single interface, and integrates natively with popular robot-learning frameworks such as \lerobot, so dexterous hand researchers can leverage the same data, training, and evaluation pipelines used for non-dexterous robot learning. We demonstrate a complete end-to-end workflow, collecting expert demonstrations of an in-hand reorientation task by teleoperation with a consumer-grade VR headset, training an autonomous policy with \lerobot, and evaluating the learned policy in a fully reproducible and observable setup. We open-source the entire stack as a shared, reproducible foundation for dexterous-manipulation research.

URL PDF HTML ☆

赞 0 踩 0

2606.14556 2026-06-15 cs.CV 新提交

Visual Quality Score Assessment of Large White Goods in Remanufacture with Multi-View Deformable-DETR

基于多视角可变形DETR的再制造大型白色家电视觉质量评分评估

Paul Koch, Vivek Chavan

发表机构 * Fraunhofer-Institut für Produktionsanlagen und Konstruktionstechnik (IPK)（弗劳恩霍夫生产设备和结构技术研究所）

AI总结针对再制造中大型白色家电视觉质量评估依赖人工且难以处理小缺陷的问题，提出基于多视角可变形DETR的自动评分框架，通过自监督预训练和微调减少标注需求，实现精确评估与可解释性。

Comments Accepted to GCSM 2026

详情

AI中文摘要

再制造大型白色家电对于循环经济至关重要，但视觉质量评估仍然是培训和定价的手动瓶颈。传统的检测方法需要大量标注，并且难以处理高分辨率多视角数据中的小缺陷。我们提出了一个基于可变形DETR的多视角框架，用于自动质量评分，该框架跨冗余视图聚合信息以提取细粒度特征。为了在有限标签下增强鲁棒性，我们采用自监督预训练，随后在专家标注的分数上进行监督微调。此外，在冻结特征图上进行线性投影，以识别感兴趣区域来解释模型决策。在工业多视角数据集上评估，我们的方法提供了精确的质量评估，同时减少了对人工标注和每个部件定制的依赖，为再制造生产线实现了可扩展且透明的检测。

英文摘要

Remanufacturing large white goods is essential for a circular economy, yet visual quality assessment remains a manual bottleneck for training and pricing. Conventional detection methods require extensive annotation and struggle with small defects in high-resolution multi-view data. We present a multi-view framework based on Deformable-DETR for automated quality scoring that aggregates information across redundant views to extract fine-grained features. To enhance robustness with limited labels, we employ self-supervised pretraining followed by supervised fine-tuning on expert-annotated scores. Additionally, a linear projection over frozen feature maps identifies regions of interest to explain model decisions. Evaluated on an industrial multi-view dataset, our approach delivers precise quality assessments while reducing reliance on manual annotation and per-part customization, enabling scalable and transparent inspection for remanufacturing lines.

URL PDF HTML ☆

赞 0 踩 0

2606.14555 2026-06-15 cs.CV cs.AI 新提交

Rethinking Global Average Pooling: Your Classifier Is Secretly a Multi-Instance Learner

重新思考全局平均池化：你的分类器实际上是一个多实例学习器

Aray Karjauv

发表机构 * Aray Karjauv（阿瑞·卡贾乌）

AI总结本文揭示标准图像分类器中的全局平均池化结构天然具有多实例学习解释，使得单标签训练的分类器能学习多目标场景，并提出后验诊断方法提取空间类别证据。

详情

AI中文摘要

现代图像分类器广泛采用全局平均池化（GAP）后接线性分类头。这种线性结构确保图像级logits等于将分类头逐点应用于GAP之前的特征网格所获得的logits的平均值。因此，标准分类器可能固有地保留空间类别证据，即使在图像级预测错误时这些证据仍可恢复。这种结构自然暗示了多实例学习（MIL）解释，其中图像被视为空间实例的包。在此框架下，我们证明使用每张图像单个标签训练的标准分类器仍然可以在多目标场景中学习预期的分类任务。我们进一步利用这一特性将图像级logits分解为预测网格，提供一种事后诊断方法来提取GAP原本掩盖的空间类别证据。我们的系统评估表明，现成模型始终能在前景区域内恢复真实类别。MIL解释进一步表明，常见的分类器失败反映了均值聚合的已知局限性。

英文摘要

Modern image classifiers widely adopt global average pooling (GAP) followed by a linear classification head. This linearity ensures that the image-level logits equal the average of logits obtained by applying the classification head pointwise to the feature grid prior to GAP. Consequently, standard classifiers may inherently retain spatial class evidence that remains recoverable even when the image-level prediction is incorrect. This structure naturally suggests a multiple-instance learning (MIL) interpretation, where an image is viewed as a bag of spatial instances. Within this formulation, we demonstrate that standard classifiers trained with a single label per image can still learn the intended classification task in multi-object scenes. We further exploit this property to decompose image-level logits into a prediction grid, providing a post-hoc diagnostic to extract spatial class evidence that GAP otherwise obscures. Our systematic evaluation reveals that off-the-shelf models consistently recover the ground-truth class within foreground regions. The MIL interpretation further suggests that common classifier failures reflect known limitations of mean aggregation.

URL PDF HTML ☆

赞 0 踩 0

2606.14536 2026-06-15 cs.LG cs.RO cs.SY eess.SY 新提交

Provably Safe, Yet Scalable Reinforcement Learning

可证明安全且可扩展的强化学习

Kai S. Yun, Zeyang Li, Navid Azizan

发表机构 * MIT（麻省理工学院）

AI总结提出PS2-RL框架，通过两阶段架构（学习备份策略隐式构造控制不变集，再通过可微投影层训练RL策略）实现可证明安全且可扩展的强化学习，在高达10维状态空间中保持性能与安全性。

详情

AI中文摘要

安全强化学习旨在学习在满足约束的同时优化奖励的策略。主流方法依赖于软约束策略优化，虽取得经验成功，但无法为学习策略提供正式安全保证。相反，具有严格保证的方法通常依赖显式证书函数，其构造需要直接综合和验证控制不变集，这一过程随状态维度扩展性差，且往往导致过于保守的行为。本文提出可证明安全且可扩展的强化学习（PS2-RL）框架，一种新颖的两阶段架构，以可扩展方式学习可证明安全的策略，旨在克服先前方法的关键瓶颈。PS2-RL不显式计算不变集，而是利用学习的备份策略前向积分系统动力学，在线生成隐式控制不变集。第一阶段，通过提出的安全到达值函数训练备份策略，该值函数刻画了用于不变集构造的最优备份策略。第二阶段，通过可微投影层端到端训练RL策略，该投影层严格强制由学习备份策略诱导的安全保证。通过在第一阶段最大化隐式控制不变集的体积，第二阶段得到的PS2策略既高效又可扩展，同时保持可证明安全性。关键的是，PS2-RL对底层RL算法无限制，可插入任何现有训练流程。我们为所提框架建立了理论保证，并在状态维度高达10的机器人控制任务上进行了评估，而在此范围内，先前可证明安全的RL方法难以应对或变得不实用。

英文摘要

Safe reinforcement learning (RL) aims to learn policies that optimize rewards while satisfying constraints. Predominant approaches rely on soft-constrained policy optimization, which has achieved empirical success but does not provide formal safety guarantees for the learned policy. In contrast, methods with strict guarantees typically rely on explicit certificate functions, whose construction requires the direct synthesis and verification of control-invariant sets, a process that scales poorly with state dimension and often yields overly conservative behavior. In this paper, we present the Provably Safe, yet Scalable RL (PS2-RL) framework, a novel two-phase architecture for learning provably safe policies in a scalable manner, designed to overcome the key bottlenecks of prior methods. Rather than explicitly computing invariant sets, PS2-RL leverages a learned backup policy to forward-integrate the system dynamics, generating an implicit control-invariant set online. In the first phase, the backup policy is trained with our proposed safe-arrival value function, which characterizes the optimal backup policy for invariant-set construction. In the second phase, an RL policy is trained end-to-end through a differentiable projection layer that strictly enforces the safety guarantees induced by the learned backup policy. By maximizing the volume of the implicit control-invariant set in the first phase, the resulting PS2 policy from the second phase is performant and scalable, while maintaining provable safety. Crucially, PS2-RL imposes no restrictions on the underlying RL algorithm and can be plugged into any existing training pipeline. We establish theoretical guarantees for the proposed framework and evaluate it on robotic control tasks with state dimensions up to 10, a regime in which prior provably safe RL methods struggle or become impractical.

URL PDF HTML ☆

赞 0 踩 0

2606.14535 2026-06-15 cs.RO 新提交

Spatially Conditioned Diffusion Policy: Learning Precise and Robust Manipulation with a Single RGB Camera

空间条件扩散策略：使用单个RGB相机学习精确且鲁棒的操作

Seoyoon Kim, Kanghyun Kim, Dongwoo Ko, Yeong Jin Heo, Min Jun Kim

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)（韩国科学技术院）； Neuromeka

AI总结提出空间条件扩散策略（SCDP），利用末端执行器轨迹作为视觉注意力锚点，通过多尺度特征编码和空间条件模块，在单相机设置下实现精确鲁棒的操作。

Comments 15 pages

详情

AI中文摘要

最近的视觉模仿学习系统广泛采用多相机设置，其中腕部相机已成为事实标准。然而，从单一全局视角进行操作仍然具有挑战性，因为策略需要捕捉细粒度的交互细节并识别任务相关区域，而无需局部腕部视图。为了应对这一挑战，我们提出了空间条件扩散策略（SCDP），一种基于扩散的视觉运动策略，可在单相机设置下实现精确且鲁棒的操作。我们的关键思想是，末端执行器轨迹可以作为反映任务相关区域的视觉注意力锚点。基于这一思想，SCDP由两个关键组件组成：（i）一个视觉编码器，生成多尺度特征图以捕捉更广泛的上下文和细粒度视觉特征，以及（ii）一个空间条件模块，在扩散循环中沿中间末端执行器轨迹采样点状特征。大量的仿真实验表明，SCDP始终优于强大的单视图基线，并实现了与多相机基线相当的性能。真实世界实验进一步证明了其精确操作和对视觉干扰物的鲁棒性，突显了单相机模仿学习的潜力。

英文摘要

Recent visual imitation learning systems have widely adopted multi-camera setups with wrist-mounted cameras as the de facto standard. However, manipulation from a single global view remains challenging, as the policy should capture fine-grained interaction details and identify task-relevant regions without local wrist views. To address this challenge, we present Spatially Conditioned Diffusion Policy (SCDP), a diffusion-based visuomotor policy that achieves precise and robust manipulation in a single-camera setting. Our key idea is that end-effector trajectories can serve as visual attention anchors that reflect task-relevant regions. Building on this idea, SCDP consists of two key components: (i) a visual encoder that produces multi-scale feature maps to capture both broader context and fine-grained visual features, and (ii) a spatial conditioning module that samples point-wise features along intermediate end-effector trajectories in the diffusion loop. Extensive simulation experiments show that SCDP consistently outperforms strong single-view baselines and achieves performance comparable to multi-camera baselines. Real-world experiments further demonstrate precise manipulation and robustness to visual distractors, highlighting the potential of single-camera imitation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.14534 2026-06-15 cs.CV 新提交

A Lightweight Fiducial-Based Pipeline for 3D Hyperspectral Mapping of ex-vivo Lumpectomy Specimens

一种轻量级基于基准的离体肿块切除标本三维高光谱映射流水线

Anna Bicchi, Alberto Rota, Leonardo Passoni, Nicola Ancellotti, Andrea Peroni, Lorenzo Vinco, Dario Polli, Elena De Momi

发表机构 * Politecnico di Milano（米兰理工大学）； Department of Electronics, Information and Bioengineering (DEIB), Politecnico di Milano（米兰理工大学电子、信息与生物工程系）； Department of Physics, Politecnico di Milano（米兰理工大学物理系）

AI总结提出一种全自动、免标定流水线，利用RGB图像和单次HSI采集生成离体肿块切除标本的三维高光谱点云，通过ArUco标记实现亚毫米级配准，支持术中切缘评估。

详情

AI中文摘要

高光谱成像（HSI）是保乳手术（BCS）中用于术中评估切缘的一种有前景的模态，但其临床转化需要将固有的二维光谱信息与切除组织的三维形状对齐，以便精确定位可疑区域进行靶向随访。我们提出了一种全自动、免标定的流水线，该流水线从一组消费级相机RGB图像和单次自上而下的HSI采集生成离体肿块切除标本的三维高光谱点云。三维几何结构通过深度学习运动恢复结构（Structure-from-Motion）骨干网络重建，并通过自定义光束法平差（bundle adjustment）在度量参考框架中稳定，该平差对放置在标本周围的四个ArUco标记的角点强制执行一致性。然后，HSI立方体在不恢复HSI相机位姿的情况下配准到重建结果：两种模态中可见的标记定义了16个角点对应关系，驱动平面单应性（planar homography），并通过在正交渲染的深度图上查找恢复三维坐标。在两个离体肿块切除标本上评估，该流水线实现了中位三维配准误差低于1毫米，二维重投影误差低于0.02毫米，在加速硬件上每个标本的总处理时间低于4分钟。这些结果支持将HSI引导的空间定位集成到保乳手术的术中切缘评估工作流程中的可行性。

英文摘要

Hyperspectral Imaging (HSI) is a promising modality for intraoperative assessment of resection margins in Breast-Conserving Surgery (BCS), but its clinical translation requires aligning the inherently 2D spectral information onto the 3D shape of the excised tissue so that suspicious regions can be precisely localized for targeted follow-up. We present a fully automated, calibration-free pipeline that produces a 3D hyperspectral point cloud of an ex-vivo lumpectomy specimen from a set of consumer-camera RGB images and a single top-down HSI acquisition. The 3D geometry is reconstructed with a deep-learning Structure-from-Motion backbone, stabilized in a metric reference frame by a custom bundle adjustment that enforces consistency on the corners of four ArUco markers placed around the specimen. The HSI cube is then registered to the reconstruction without recovering the HSI camera pose: the markers, visible in both modalities, define 16 corner correspondences that drive a planar homography, and 3D coordinates are recovered by lookup on an orthographically rendered depth map. Evaluated on two ex-vivo lumpectomy specimens, the pipeline achieves a median 3D registration error below 1~mm and a 2D reprojection error below 0.02 mm, with a total per-specimen processing time under 4 minutes on accelerated hardware. These results support the feasibility of integrating HSI-guided spatial localization into intraoperative margin assessment workflows for breast-conserving surgery.

URL PDF HTML ☆

赞 0 踩 0

2606.14533 2026-06-15 cs.LG cs.GT 新提交

The Risk Shadow of Principal Component Analysis: When 99.9999% Variance Preservation Causes Catastrophic Decision Errors

主成分分析的风险阴影：当99.9999%的方差保留导致灾难性决策错误

Hamidou Tembine

发表机构 * Department of EECS, School of Engineering, UQTR, Canada（加拿大魁北克大学三河城分校工程学院电气工程与计算机科学系）； Learning and Game Theory Laboratory (LnG Lab), TIMADIE（学习与博弈论实验室（LnG Lab），TIMADIE）

AI总结本文证明主成分分析（PCA）在保留99.9999%方差时可能完全丢失罕见高影响事件的信息，导致分类器退化为常数预测器，并提出Expectile PCA和Tail-Preserving PCA两种方法通过重加权协方差来保留尾部风险信息。

Comments 5 tables, 1 figure. all references fully checked manually

详情

AI中文摘要

主成分分析（PCA）保留方差，而非检测罕见灾难性事件所需的信息。本文证明了“风险阴影”的存在：PCA可以保留超过99.9999%的总方差，同时完全抹去关于罕见高影响失败的所有信号。当这种情况发生时，即使是在PCA表示上运行的最佳分类器也会退化为常数预测器。根本原因是方差最大化与尾部风险意识之间的根本不匹配。为了打破阴影，我们引入了Expectile PCA（ExPCA）和Tail-Preserving PCA（TP-PCA），这两种方法将数据协方差重新加权以偏向高影响事件。我们从理论上证明，ExPCA在保留罕见事件信息方面严格优于PCA，并在合成数据和真实世界的信用卡欺诈检测基准上验证了我们的主张。我们的结果呼吁在高风险决策中从根本上重新思考基于方差的降维方法。

英文摘要

Principal Component Analysis (PCA) preserves variance, not the information needed to detect rare catastrophic events. This paper proves the existence of a {\it Risk Shadow}: PCA can retain over 99.9999 percent of total variance while completely erasing all signal about rare, high-impact failures. When this happens, even the best possible classifier operating on the PCA representation reduces to a constant predictor. The root cause is a fundamental mismatch between variance maximization and tail risk awareness. To break the shadow, we introduce Expectile PCA (ExPCA) and Tail-Preserving PCA (TP-PCA), two methods that reweight the data covariance toward high-impact events. We prove theoretically that ExPCA strictly outperforms PCA in retaining rare-event information, and we validate our claims on synthetic data and a real-world credit card fraud detection benchmark. Our results call for a fundamental rethinking of variance-based dimensionality reduction in high-stakes decisions.

URL PDF HTML ☆

赞 0 踩 0

2606.14531 2026-06-15 cs.RO 新提交

AERMANI-PLACE: Language Guided Object Placement with Aerial Manipulators

AERMANI-PLACE: 基于语言引导的空中机械臂物体放置

Sarthak Mishra, Ritama Sanyal, Rishabh Dev Yadav, Wei Pan, Spandan Roy

发表机构 * Robotics Research Center, IIIT Hyderabad（海得拉巴国际信息技术学院机器人研究中心）； Department of Computer Science, University of Manchester（曼彻斯特大学计算机科学系）； Newcastle University（纽卡斯尔大学）

AI总结提出AERMANI-PLACE框架，通过自然语言指令和图像编辑模型生成视觉标记，引导空中机械臂完成物体放置，在测试集和真实平台上分别达到87%和72%的平均成功率。

详情

AI中文摘要

物体放置是空中操纵任务的基本组成部分，但现有系统通常需要以度量坐标明确指定期望的放置位置。这种界面不直观，要求用户推理坐标框架和场景几何，使其在实际部署中难以使用。相比之下，人类通常通过语言和指向手势的组合来传达空间目标。受此观察启发，我们提出了AERMANI-PLACE，一个用于空中机械臂语言引导物体放置的框架。给定场景图像和自然语言指令，图像编辑模型生成场景的修改版本，其中包含指示物体应放置位置的视觉标记。然后，使用深度观测将该标记锚定到物理环境中，以恢复度量放置点，之后由空中机械臂生成并执行放置轨迹。我们在包含100个语言引导放置任务的测试集上评估了所提出的方法，并在真实的空中操纵平台上展示了成功执行。实验结果表明，所提出的方法能够可靠地从语言指令中推断放置位置，在测试集上的平均成功率为87%，并有效迁移到真实世界空中操纵，平均成功率为72%。视频：此 https URL

英文摘要

Object placement is a fundamental component of aerial manipulation tasks, yet existing systems typically require the desired placement position to be specified explicitly in metric coordinates. Such interfaces are not intuitive and require users to reason about coordinate frames and scene geometry, making them difficult to use in practical deployments. In contrast, humans often communicate spatial goals through a combination of language and pointing gestures. Inspired by this observation, we present AERMANI-PLACE, a framework for language-guided object placement with aerial manipulators. Given a scene image and a natural language instruction, an image editing model generates a modified version of the scene containing a visual marker that indicates where the object should be placed. This marker is then grounded into the physical environment using depth observations to recover a metric place point, after which a placement trajectory is generated and executed by the aerial manipulator. We evaluate the proposed approach on a test set of 100 language-guided placement tasks and demonstrate successful execution on a real aerial manipulation platform. Experimental results show that the proposed method reliably infers placement locations from language instructions with an average success rate of 87\% on the test-set and transfers effectively to real-world aerial manipulation with an average success rate of 72\%. Video: https://youtu.be/SgwwgLBsv0g

URL PDF HTML ☆

赞 0 踩 0

2606.14530 2026-06-15 cs.LG 新提交

Code Correctness Signals in LLM Hidden States: Pre-Generation Probing and Repair Geometry

LLM隐藏状态中的代码正确性信号：生成前探测与修复几何

Carlo Di Cicco

发表机构 * Independent researcher（独立研究员）

AI总结本文通过残差化方法，发现Qwen3-4B-Instruct模型在生成前隐藏状态可线性解码代码正确性（AUC 0.931），但修复成功的方向性信号在控制上下文协变量后消失，揭示了方法学上的正负结果。

Comments 12 pages, 8 tables. Code, data, and analysis scripts available at https://github.com/CarloDiCicco/ReasoningLab

详情

AI中文摘要

大型语言模型在其隐藏状态中编码丰富信息。本文研究在Qwen3-4B-Instruct-2507生成之前以及修复失败尝试时，代码正确性是否可从隐藏状态中解读，基于444个LiveCodeBench任务。报告两个发现，通过单一混杂控制工具——残差化联系起来。首先，模型首次尝试代码的正确性可从提示最终隐藏状态线性解码，在50个外部分割上无泄漏的留出AUC为0.931±0.008。从每个隐藏状态维度去除提示长度的线性效应后，探针仍达到0.911±0.010，远高于提示长度基线0.754±0.014。其次，在236个清理后的案例中，模型尝试修复失败的首次尝试，从失败尝试到修复的隐藏状态偏移携带统计上可检测的对比方向，在幅度和分割半测试中均显著高于标签打乱的零假设。该方向在对修复上下文协变量（成功与失败修复间不同）进行条件残差化后不再存在，表明它是修复成功的相关因素，由修复上下文驱动，而非孤立的修复理解特征。探针层通过嵌套交叉验证选择，同样的残差化方法支持了生成前正确性结果，却推翻了修复方向解释。贡献既是方法论上的也是实证上的：一个足够诚实的诊断，同时报告了负面结果和正面结果。

英文摘要

Large language models encode rich information in their hidden states. This work asks whether code correctness is legible in the hidden states of Qwen3-4B-Instruct-2507, before it generates and as it repairs a failed attempt, studied on 444 LiveCodeBench tasks. It reports two findings connected by a single confound-control tool: residualization. First, the correctness of the model's first-attempt code is linearly decodable from the prompt-final hidden state, with a leakage-free held-out AUC of 0.931 +/- 0.008 across 50 outer splits. After the linear effect of prompt length is removed from each hidden state dimension, the probe still reaches 0.911 +/- 0.010, well above a prompt-length baseline of 0.754 +/- 0.014. Second, on 236 cleaned cases where the model attempts to repair a failed first attempt, the hidden state shift from the failing attempt to its repair carries a statistically detectable contrastive direction, significant on both a magnitude and a split-half test against label-shuffled nulls. This direction does not survive a conditional residualization against repair-context covariates that differ between successful and failed repairs, marking it as a correlate of repair success driven by the repair context rather than an isolated repair-comprehension feature. The probe layer is selected by nested cross-validation, and the same residualization approach that upholds the pre-generation correctness result overturns the repair-direction interpretation. The contribution is as much methodological as empirical: a diagnostic honest enough to report a negative result alongside a positive one.

URL PDF HTML ☆

赞 0 踩 0

2606.14528 2026-06-15 cs.CL eess.AS 新提交

BayLing-Duplex: Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM

BayLing-Duplex: 单一自回归LLM的原生全双工语音对话

Qingkai Fang, Shoutao Guo, Yang Feng

发表机构 * Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)（中国科学院计算技术研究所智能信息处理重点实验室）； Key Laboratory of AI Safety, Chinese Academy of Sciences（中国科学院人工智能安全重点实验室）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结提出BayLing-Duplex，一种原生全双工语音语言模型，通过单个自回归LLM决定何时听、说和停止，无需外部VAD模块，仅用少量特殊标记实现，在少量微调数据上达到高交互成功率并提升响应质量。

Comments Code: https://github.com/BayLing-Models/BayLing-Duplex

详情

AI中文摘要

实时全双工语音交互是下一代语音聊天机器人的关键特性，允许模型同时听和说，并处理重叠、犹豫和插话等自然现象。现有的语音语言模型（如LLaMA-Omni和GLM-4-Voice）仍然是基于回合的，并依赖外部语音活动检测（VAD）模块来标记用户回合的结束，这从根本上限制了它们的交互能力。在本文中，我们介绍了BayLing-Duplex，一种原生全双工SpeechLM，其中单个自回归LLM决定何时听、何时说以及何时停止，无需辅助的回合切换模块。该设计仅在标准词汇表中添加少量特殊标记，因此可以跨LLM迁移，并重用现有的训练和服务堆栈，无需架构适配。从公开的GLM-4-Voice检查点开始，仅使用400K全双工样本进行微调，随后进行轻量级DPO阶段，BayLing-Duplex在InstructS2S-Eval上达到92%的回合切换成功率和100%的打断成功率，同时将语音响应分数从Moshi的2.17提升到3.39。BayLing-Duplex在Llama Questions、Web Questions和Alpaca-Eval上也达到或超过了其基于回合的对应版本，表明同时听和说建模不会牺牲响应质量。

英文摘要

Real-time, full-duplex speech interaction is a key feature of next-generation spoken chatbots, allowing the model to listen and speak at the same time and to handle natural phenomena such as overlap, hesitation, and barge-in. Existing speech language models (SpeechLMs) such as LLaMA-Omni and GLM-4-Voice are still turn-based and rely on an external Voice Activity Detection (VAD) module to mark the end of the user's turn, which fundamentally limits their interactive ability. In this paper, we introduce BayLing-Duplex, a native full-duplex SpeechLM where a single autoregressive LLM decides when to listen, when to speak, and when to stop, with no auxiliary turn-taking module. The design adds only a few special tokens to the standard vocabulary, so it transfers across LLMs and reuses existing training and serving stacks with no architectural adaptation. Starting from the public GLM-4-Voice checkpoint and using only 400K full-duplex samples for fine-tuning followed by a lightweight DPO stage, BayLing-Duplex reaches 92% turn-taking success and 100% interruption success on InstructS2S-Eval, while improving the speech-response score from 2.17 to 3.39 over Moshi. BayLing-Duplex also matches or surpasses its turn-based counterpart on Llama Questions, Web Questions, and Alpaca-Eval, showing that simultaneous listen-and-speak modeling does not sacrifice response quality.

URL PDF HTML ☆

赞 0 踩 0

2606.14518 2026-06-15 cs.LG 新提交

Behavioral Audit of Machine Unlearning Has a Privacy Cost

机器遗忘的行为审计具有隐私代价

Liou Tang, James Joshi, Ashish Kundu

发表机构 * University of Pittsburgh（匹兹堡大学）； Cisco（思科）

AI总结本文证明，在互不信任的模型所有者和审计者场景下，仅依赖模型行为查询的审计方案无法在不泄露保留集成员信息的情况下识别未充分遗忘的模型，揭示了隐私与审计之间的固有权衡。

详情

AI中文摘要

通过机器遗忘从机器学习模型中移除已学习数据已被广泛研究；然而，目前尚未有公认的审计方案。现有工作表明，不诚实的模型所有者可以伪造证据来避免执行遗忘，而好奇的审计者（及对手）即使在有限访问权限下也能推断模型及其训练数据的隐私敏感属性。然而，在模型所有者和审计者互不信任的情况下对机器遗忘的审计仍未得到探索。我们为此场景提供了信息论证明：对于凸机器学习模型，仅依赖查询模型获取\textit{行为}信号的通用审计方案无法在不泄露保留集成员信息的情况下识别未充分遗忘的模型。因此，在不诚实的模型所有者和诚实但好奇的审计者假设下审计机器遗忘面临固有的隐私-审计权衡。我们在凸模型上的实证结果强烈支持这一结论，而进一步实验表明这种隐私-审计张力在非凸模型中依然存在。我们的结果呼吁在更现实的审计者威胁模型下更仔细地考虑隐私-审计张力，并为机器遗忘流程中隐私保护审计方案的设计提供更严格的审查基础。我们还在此 https URL 发布了代码实现。

英文摘要

The removal of learned data from Machine Learning models through Machine Unlearning (MU) has been widely studied; however, there has yet to be an agreed-upon scheme for auditing MU. Existing work has shown that a dishonest model owner can falsify evidence to avoid executing MU, while curious auditors (and adversaries) can infer the privacy-sensitive properties of the model and its training data even with limited access. Yet auditing of MU under mutual distrust between the model owner and the auditor remains unexplored. We provide an information-theoretic proof for this scenario: for convex ML models, a generic audit scheme that relies solely on querying the model for \textit{behavioral} signals cannot identify insufficiently unlearned models without revealing membership information of the retained set. Therefore, auditing MU under the assumption of a dishonest model owner and an honest-but-curious auditor faces an inherent privacy-audit tradeoff. Our empirical results on convex models strongly supports this result, while further experiments demonstrate that this privacy-audit tension persists in non-convex models. Our results call for a more careful consideration of the privacy-audit tension under a realistic auditor threat model, and serve as a foundation for more scrutiny of designs of privacy-preserving audit schemes for the MU pipeline. We also release our code implementation at https://github.com/LiouTang/Behavioral-Unlearn-Audit.

URL PDF HTML ☆

赞 0 踩 0

2606.14516 2026-06-15 cs.AI cs.CL cs.CY 新提交

Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results

Every Eval Ever：AI评估结果的统一模式与社区仓库

Jan Batzner, Sree Harsha Nelaturu, Anastassia Kornilova, Jon Crall, Tommaso Cerruti, Yanan Long, Yifan Mai, Sanchit Ahuja, Asaf Yehudai, Marek Šuppa, John P. Lalor, Oluwagbemike Olowe, Jatin Ganhotra, Brian H. Hu, Eliya Habba, Andrew M. Bean, Chang Liu, Sander Land, Steven Dillmann, Aniketh Garikaparthi, Elron Bandel, Saki Imai, James Edgell, Wm. Matthew Kennedy, Jenny Chim, Patrick Meusling, Asteria Kaeberlein, Venkata Ramachandra Karthik Chundi, Manasi Patwardhan, Martin Ku, Austin Meek, Leon Knauer, Brian Wingenroth, Srishti Yadav, Usman Gohar, Felix Friedrich, Michelle Lin, Jennifer Mickel, Arman Cohan, Stella Biderman, Irene Solaiman, Zeerak Talat, Anka Reuel, Mubashara Akhtar, Gjergji Kasneci, Avijit Ghosh, Leshem Choshen

发表机构 * Technical University Munich（慕尼黑工业大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）； Weizenbaum Institute（魏岑鲍姆研究所）； Zuse Institute Berlin（柏林祖泽研究所）； Evidence Prime ； Trustible ； Kitware ； ETH Zurich（苏黎世联邦理工学院）； StickFlux Labs ； Stanford University（斯坦福大学）； Northeastern University（东北大学）； IBM Research（IBM研究院）； Comenius University Bratislava（布拉迪斯拉发夸美纽斯大学）； Cisco（思科）； University of Notre Dame（圣母大学）； Hebrew University of Jerusalem（耶路撒冷希伯来大学）； University of Oxford（牛津大学）； Ohio University（俄亥俄大学）； Writer ； TCS Research（塔塔咨询服务研究院）； Oxford University Press（牛津大学出版社）； Queen Mary University of London（伦敦玛丽女王大学）； Technical University Berlin（柏林工业大学）； University of Delaware（特拉华大学）； Cinemo ； Johns Hopkins University（约翰霍普金斯大学）； University of Copenhagen（哥本哈根大学）； ELLIS（欧洲学习与智能系统实验室）； Iowa State University（爱荷华州立大学）； Meta FAIR ； University of Montreal（蒙特利尔大学）； Mila Quebec AI Institute（Mila魁北克人工智能研究所）； EleutherAI ； Yale University（耶鲁大学）； Hugging Face ； University of Edinburgh（爱丁堡大学）； Harvard University（哈佛大学）； ETH AI Center（ETH人工智能中心）； MIT（麻省理工学院）； MIT-IBM Watson Lab（MIT-IBM沃森实验室）

AI总结针对AI评估结果格式不统一、难以比较的问题，提出首个共享模式与社区众包仓库，通过标准化表示、自动转换器和社区数据库实现跨评估框架的统一。

详情

AI中文摘要

AI评估被广泛用于测试和理解进展。然而，多样化的评估工具带来了不一致性，挑战了分析和比较。首先，结果以不兼容的格式保存，分散在排行榜、论文、博客文章、评估工具日志和自定义仓库中。其次，结果由不同的评估框架创建，这些框架对名义上相同的评估产生不同的分数，并且不一致地记录元数据，阻碍了比较、跨社区评估科学、成本降低和重用。我们介绍了Every Eval Ever，这是第一个用于AI评估结果的共享模式和社区众包仓库。该模式标准化了评估在统一的单个JSON文档中的表示方式。它在设计上与源无关，可以摄取来自评估工具和论文的结果，并可选择存储每个实例的输出以进行细粒度分析。我们贡献了：(i) 一个社区治理的元数据模式及其配套的实例级模式，这是同类标准化工作的首次；(ii) 从流行格式、评估工具和排行榜到统一模式的自动转换器；以及 (iii) 一个托管在Hugging Face上的众包社区数据库，目前涵盖22,235个模型、2,273个独特基准和31种评估格式。

英文摘要

AI evaluations are widely used for testing and understanding progress. However, the diverse evaluators bring with them inconsistencies that challenge analysis and comparison. First, results are saved in incompatible formats, scattered across leaderboards, papers, blog posts, evaluation harness logs, and custom repositories. Second, results are created by different evaluation frameworks, which produce divergent scores for nominally identical evaluations and record metadata inconsistently, hindering comparison, cross-community evaluation science, cost reduction, and reuse. We introduce Every Eval Ever, the first shared schema and community-crowdsourced repository for AI evaluation results. The schema standardizes how evaluations are represented in a unified, single JSON document. It is source-agnostic by design, ingesting results from evaluation harnesses and papers alike, and optionally stores per-instance outputs for fine-grained analysis. We contribute: (i) a community-governed metadata schema with a companion instance-level schema, the first standardization effort of its kind; (ii) automatic converters from popular formats, evaluation harnesses, and leaderboards to the unified schema; and (iii) a crowdsourced community database hosted on Hugging Face, currently spanning to date 22,235 models, 2,273 unique benchmarks, and 31 evaluation formats.

URL PDF HTML ☆

赞 0 踩 0

2606.14512 2026-06-15 cs.CL cs.AI 新提交

Fodor and Pylyshyn's Systematicity Challenge Still Stands

Fodor和Pylyshyn的系统性挑战依然存在

Michael Goodale, Salvador Mascarenhas

发表机构 * Institut Jean Nicod, Département d’études cognitives ENS, EHESS, CNRS, PSL University（让·尼科研究所，ENS认知科学系，EHESS，CNRS，PSL大学）

AI总结本文通过实验证明，Lake和Baroni的元学习组合协议模型在分布外和分布内问题上均表现不佳，未能满足Fodor和Pylyshyn对神经网络系统性提出的挑战。

Comments Accepted in the Transactions of the Association for Computational Linguistics (TACL). This is a pre-MIT Press publication version of the paper

详情

AI中文摘要

神经网络近期在生成类人语言方面的成功在认知科学领域引起了巨大轰动，许多研究者认为，关于人类认知的经典难题以及对人工智能的挑战正被神经网络解决。一个显著的例子是Jerry Fodor和Zenon Pylyshyn提出的系统性论证，该论证认为人类表现出系统性的双条件依赖关系。例如，某人能理解句子“John saw Mary”当且仅当能理解句子“Mary saw John”。符号系统解释了这种语言和思维的系统性，而神经网络则没有提供直接的解释。最近几篇文章声称这一挑战已被神经网络解决。特别是，Brenden Lake和Marco Baroni认为他们的元学习组合协议匹配并可能解释了人类的系统性。我们证明这些结论为时过早。在其他结果中，我们发现他们的模型难以学习与训练数据分布稍有差异的规则。此外，即使在许多分布内问题上，模型的行为也是非系统性的。我们得出结论，Fodor和Pylyshyn对神经网络的挑战仍未得到满足。

英文摘要

The recent successes of neural networks producing human-like language have caused significant stir in cognitive science, with many researchers arguing that classical puzzles about human cognition and challenges to artificial intelligence are being solved by neural networks. A notable case is the argument from systematicity due to Jerry Fodor and Zenon Pylyshyn, argues that humans display systematic biconditional dependencies. For example, someone can understand the sentence "John saw Mary" just in case that they understand the sentence "Mary saw John." Symbolic systems explain this systematicity of language and thought, while neural networks offer no immediate explanation. Several recent articles argue that this challenge has now been met by neural networks. In particular, Brenden Lake and Marco Baroni argue that their meta-learning for compositionality protocol matches and perhaps explains human systematicity. We demonstrate that these conclusions are premature. Among other results, we found that their model struggles to learn rules that are even slightly out of distribution compared to their training data. Furthermore, the model behaves unsystematically even on many within-distribution problems. We conclude that Fodor and Pylyshyn's challenge to neural networks remains unmet.

URL PDF HTML ☆

赞 0 踩 0