arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.05165 2026-06-04 cs.LG cs.CL

STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

STRIDE: 通过子集扰动的稀疏恢复进行训练数据归因

Rishit Dagli, Abir Harrasse, Luke Zhang, Florent Draye, Amirali Abdullah, Bernhard Schölkopf, Zhijing Jin

发表机构 * Jinesis AI Lab, University of Toronto & Vector Institute（Jinesis AI实验室，多伦多大学及向量研究所）； Max Planck Institute for Intelligent Systems, Tübingen, Germany（智能系统马克斯·普朗克研究所，图宾根，德国）； Thoughtworks（Thoughtworks公司）； Martian ； ELLIS Institute, Tübingen, Germany（图宾根ELLIS研究所，德国）； EuroSafeAI

AI总结提出STRIDE框架，将训练数据归因建模为压缩感知中的稀疏恢复问题，通过激活空间中的轻量级“引导算子”模拟数据子集的影响，实现高效且准确的LLM预训练归因。

Comments project page: https://stride-tda.github.io/

详情

AI中文摘要

训练数据归因（TDA）旨在将模型的预测追溯到其训练数据。TDA的黄金标准依赖于因果干预，观察模型在数据添加或移除时的变化，但对于大型语言模型（LLMs）而言，重复训练在计算上具有挑战性。因此，大多数方法在参数空间中使用梯度来近似这种效应。然而，跟踪数十亿参数的梯度不仅成本高昂，而且依赖于局部近似。在这项工作中，我们提出了一种转变：我们不估计参数变化，而是在激活空间中建模训练数据的功能效应。我们引入了STRIDE（基于引导的训练数据影响分解），这是一个将TDA表述为压缩感知精神下的稀疏恢复问题的框架。STRIDE学习轻量级的“引导算子”，这些算子模拟在数据子集上训练引起的行为变化。通过测量这些算子如何扰动测试预测，我们通过稀疏线性分解恢复单个训练示例的影响。STRIDE在LLM预训练归因中达到了最先进的性能，同时比先前的方法快一个数量级（13倍）。我们通过下游应用（包括数据选择、数据污染和定性分析）进一步验证了其实用性。

英文摘要

Training Data Attribution (TDA) seeks to trace a model's predictions back to its training data. The gold standard for TDA relies on causal interventions, observing how a model changes when data is added or removed, but repeated retraining is computationally challenging for Large Language Models (LLMs). Consequently, most approaches approximate this effect in the parameter space using gradients. However, tracking gradients across billions of parameters is not only prohibitively expensive but relies on local approximations. In this work, we propose a shift: rather than estimating parameter changes, we model the functional effect of training data in the activation space. We introduce STRIDE (Steering-based Training Data Influence Decomposition), a framework that formulates TDA as a sparse recovery problem in the spirit of compressive sensing. STRIDE learns lightweight "steering operators" that mimic the behavioral shift caused by training on data subsets. By measuring how these operators perturb test predictions, we recover individual training example influences via sparse linear decomposition. STRIDE achieves state-of-the-art for LLM pre-training attribution while being an order of magnitude ($13\times$) faster than previous art. We further validate its practical utility through downstream applications including data selection, data contamination, and qualitative analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.05162 2026-06-04 cs.CV

Controllable Dynamic 3D Shape Generation via 3D Trajectories and Text

基于3D轨迹和文本的可控动态3D形状生成

Jaeyeong Kim, Ines Kim, Jahyeok Koo, Seungryong Kim

发表机构 * KAIST AI Project（韩国科学技术院人工智能项目）

AI总结提出T2Mo前馈框架，通过3D轨迹和文本条件生成可控动态3D形状，采用形状接地轨迹嵌入处理任意配置轨迹，实现空间精确跟随与全局语义一致。

Comments Project page: https://cvlab-kaist.github.io/T2Mo/

详情

AI中文摘要

我们提出T2Mo，一个前馈框架，用于基于3D轨迹和文本的可控动态3D形状生成。由于语言固有的模糊性，仅使用文本生成精确意图的运动仍然具有挑战性。为了解决这个问题，我们采用3D轨迹作为可控空间引导，指定选定点应移动的精确路径。通过结合两者，T2Mo生成的对象运动在空间上遵循给定轨迹，同时全局反映文本语义。为了鲁棒地处理任意配置的轨迹输入（从密集到稀疏且不均匀分布），我们进一步提出了一种形状接地轨迹嵌入，将输入轨迹集映射到覆盖整个对象的形状感知令牌集。我们与基于文本的基线以及级联视频基线（结合轨迹引导视频生成和视频到动态网格生成）进行了广泛比较。定量和定性评估以及用户研究表明，我们的方法生成的运动更忠实地遵循给定提示，具有更高的表现力，同时保持运动质量。

英文摘要

We introduce T2Mo, a feed-forward framework for controllable dynamic 3D shape generation conditioned on 3D trajectories and text. Due to the inherent ambiguity of language, generating precisely intended motions using text alone remains challenging. To address this, we adopt 3D trajectories as controllable spatial guidance, specifying the exact paths along which selected points should move. By combining both, T2Mo generates object motions that spatially adhere to the given trajectories while globally reflecting the text semantics. To robustly handle trajectory inputs with arbitrary configurations, ranging from dense to sparse and unevenly distributed, we further propose a shape-grounded trajectory embedding that maps an input trajectory set into a shape-aware token set covering the entire object. We conduct extensive comparisons against text-based baselines and cascaded video-based baselines that combine trajectory-guided video generation with video-to-dynamic mesh generation. Quantitative and qualitative evaluations, along with user studies, demonstrate that our approach produces motions that more faithfully follow the given prompts with higher expressiveness while preserving motion quality.

URL PDF HTML ☆

赞 0 踩 0

2606.05161 2026-06-04 cs.SD cs.CL

Beyond Text Following: Repairable Arbitration Reversals in Audio-Language Models

超越文本跟随：音频-语言模型中的可修复仲裁反转

Yichen Gao, Yiqun Zhang, Zijing Wang, Yujia Li, Heng Guo, Xi Wu, Xiaocui Yang, Shi Feng, Yifei Zhang, Daling Wang

发表机构 * Northeastern University, China（东北大学）； Shanghai Artificial Intelligence Laboratory, China（上海人工智能实验室）

AI总结本文通过同音频反事实实验发现，音频-语言模型在冲突任务中常因文本主导而忽略音频证据，并提出无训练解码规则GACL，通过插值联合分数与同音频分数来修复仲裁反转，显著提升忠实度。

详情

AI中文摘要

音频-语言模型（ALMs）常常遵循与音频冲突的文本，即使音频证据清晰。这引发了一个基本问题：音频支持的答案是不可用的，还是被表示出来但被冲突文本覆盖了？我们使用一个同音频反事实来研究这个问题，该反事实保持音频固定，仅移除冲突文本，并测量模型偏好由此产生的变化。在五个ALM和四个冲突任务中，64.1%的冲突样本显示出符号翻转：同音频分支偏好音频支持的答案，而联合分支偏好文本支持的答案。这种模式表明，相关的音频证据被编码但在仲裁中失败。激活修补进一步将反转定位到答案位置计算，并且修补效果与输出候选分数差异紧密相关（Spearman rho=0.93）。利用这一诊断，我们提出了门控音频反事实逻辑校正（GACL），一种无训练解码规则，在联合分数和同音频分数之间进行插值。在严格的5个百分点的忠实度下降预算下，GACL在最佳对比基线上将nAUC提高了17.8个点，并且无需重新调整即可迁移到视觉-文本仲裁（最高+40.5个百分点）。

英文摘要

Audio-language models (ALMs) often follow text that conflicts with audio, even when the audio evidence is clear. This raises a basic question: is the audio-supported answer unavailable, or is it represented but overridden by the conflicting text? We examine this question using a same-audio counterfactual that keeps the audio fixed, removes only the conflicting text, and measures the resulting shift in model preference. Across five ALMs and four conflict tasks, 64.1% of conflict samples show a sign flip: the same-audio branch prefers the audio-supported answer, whereas the joint branch prefers the text-supported answer. This pattern suggests that the relevant audio evidence is encoded but loses in arbitration. Activation patching further localizes the reversal to answer-position computation, and patching effects closely track output candidate-score differences (Spearman rho=0.93). Using this diagnostic, we propose Gated Audio Counterfactual Logit Correction (GACL), a training-free decoding rule that interpolates between joint and same-audio scores. Under a strict 5 pp faithfulness-drop budget, GACL improves nAUC by 17.8 points over the best contrastive baseline and transfers without retuning to vision-text arbitration (up to +40.5 pp).

URL PDF HTML ☆

赞 0 踩 0

2606.05160 2026-06-04 cs.RO

GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors

GRAIL: 从3D资产和视频先验生成人形机器人全身操作

Tianyi Xie, Haotian Zhang, Jinhyung Park, Zi Wang, Bowen Wen, Jiefeng Li, Xueting Li, Qingwei Ben, Haoyang Weng, Yufei Ye, David Minor, Tingwu Wang, Chenfanfu Jiang, Sanja Fidler, Jan Kautz, Linxi Fan, Yuke Zhu, Zhengyi Luo, Umar Iqbal, Ye Yuan

发表机构 * NVIDIA ； UCLA（加州大学洛杉矶分校）

AI总结提出GRAIL全虚拟生成管线，利用3D资产和视频基础模型先验合成人机交互演示，无需物理搭建或遥操作，实现人形机器人全身操作策略的模拟到现实迁移。

Comments Project page: https://research.nvidia.com/labs/dair/grail/

详情

AI中文摘要

扩展人形机器人全身操作需要跨多样物体、全身运动和场景几何的机器人兼容演示，但遥操作和动作捕捉难以规模化，因为每次采集都依赖于物理设置、仪器化演员和机器人操作。我们提出GRAIL，一个在部署前完全保持虚拟的数字生成管线：它组合3D资产、模拟器就绪场景和视频基础模型（VFM）的先验，以合成交互，无需重建物理环境或遥操作机器人。GRAIL并非重建无约束的野外视频，而是从完全指定的3D配置开始，其中物体几何、相机参数、度量尺度、环境深度和机器人比例的角色在视频生成前已知，并在重建过程中重复使用。这种特权设置更好地约束了4D恢复，允许基于模型的物体跟踪、人体运动估计和交互感知优化，以重建度量的4D人-物交互（HOI）轨迹，减少了深度模糊和形态不匹配。我们将恢复的运动重定向到人形机器人，并训练互补的任务通用跟踪器：用于操作的对象感知潜在适配器和用于地形穿越的场景感知跟踪器。GRAIL生成了超过20,000个序列，涵盖拾取、物体操作、坐姿和地形穿越。仅使用GRAIL生成的数据，我们通过模拟到现实管线训练自我中心视觉策略，并将其部署在Unitree G1人形机器人上，在多样物体拾取上实现了84%的真实世界成功率，在爬楼梯上实现了90%的成功率。

英文摘要

Scaling humanoid loco-manipulation requires robot-compatible demonstrations across diverse objects, whole-body motions, and scene geometries, but teleoperation and motion capture are difficult to scale because each collection depends on physical setups, instrumented actors, and robot operation. We present GRAIL, a digital generation pipeline that remains fully virtual until deployment: it composes 3D assets, simulator-ready scenes, and priors from video foundation models (VFMs) to synthesize interactions without rebuilding physical environments or teleoperating the robot. Rather than reconstructing unconstrained in-the-wild videos, GRAIL starts from fully specified 3D configurations in which object geometry, camera parameters, metric scale, environment depth, and a robot-proportioned character are known before video generation and reused during reconstruction. This privileged setup better conditions 4D recovery, allowing model-based object tracking, human motion estimation, and interaction-aware optimization to reconstruct metric 4D human-object interaction (HOI) trajectories with reduced depth ambiguity and morphology mismatch. We retarget the recovered motions to a humanoid robot and train complementary task-general trackers: an object-aware latent adaptor for manipulation and a scene-aware tracker for terrain traversal. GRAIL produces over 20,000 sequences spanning pick-up, object manipulation, sitting, and terrain traversal. Using only GRAIL-generated data, we train egocentric visual policies through a sim-to-real pipeline and deploy them on a Unitree G1 humanoid, achieving 84\% real-world success on diverse object pick-up and 90\% success on stair-climbing.

URL PDF HTML ☆

赞 0 踩 0

2606.05159 2026-06-04 cs.RO

X4Val: Learning Neural Surrogates for Variance-Reduced Policy Evaluation

X4Val: 学习方差缩减策略评估的神经代理模型

Rachel Luo, Michael Watson, Apoorva Sharma, Heng Yang, Han Qi, Edward Schmerling, Sushant Veer, Boris Ivanovic, Marco Pavone

发表机构 * NVIDIA Research（NVIDIA研究院）； Harvard University（哈佛大学）； Stanford University（斯坦福大学）

AI总结提出X4Val框架，通过嵌入多域数据并学习可迁移预测器，结合控制变量估计器实现无配对样本下的方差缩减，在自动驾驶和机器人操作任务中方差降低达38.4%。

详情

AI中文摘要

对基于学习的机器人系统进行严格评估是部署的必要前提。然而，真实世界的测试数据收集成本高昂；此外，在典型的迭代开发环境中，从最新策略收集的数据规模必然有限。这促使我们利用异构数据源（包括仿真、历史策略日志以及从相关平台或环境收集的数据）的评估方法。虽然此类辅助数据丰富且廉价，但它们通常不能直接代表真实世界的结果——例如，仿真中的性能可能与真实世界中的性能存在显著差异——这使得它们在高置信度性能估计中的原则性使用具有挑战性。在本文中，我们介绍了X4Val，一个在存在非配对、多域数据的情况下进行方差缩减的真实世界指标估计的通用框架。X4Val将来自真实域和辅助域的样本嵌入到一个共享表示空间中，并学习一个可迁移的真实世界指标预测器；然后将这个学习到的预测器纳入控制变量估计器，即使在无配对样本的情况下也能实现方差缩减。我们提供了理论分析，并在自动驾驶和真实世界机器人操作任务上进行了实证评估，在这些领域中，X4Val实现了高达38.4%的方差缩减，并表现出相对于强基线的持续改进。这些结果表明，非配对的异构数据可以被利用来显著提高严格机器人系统验证的样本效率。

英文摘要

Rigorous evaluation of learning-based robotic systems is an essential prerequisite for deployment. However, real-world test data is expensive to gather; moreover, in a typical iterative development context, data gathered from the latest policy is necessarily limited in scale. This motivates evaluation methodologies that make use of heterogeneous data sources, including simulation, historical policy logs, and data collected from related platforms or environments. While such auxiliary data are abundant and inexpensive, they are generally not directly representative of real-world outcomes -- for example, performance in simulation may differ substantially from performance in the real world -- making their principled use for high-confidence performance estimation challenging. In this paper, we introduce X4Val, a general framework for variance-reduced real-world metric estimation in the presence of non-paired, multi-domain data. X4Val embeds samples from real and auxiliary domains into a shared representation space and learns a transferable predictor of real-world metrics; this learned predictor is then incorporated into a control-variates estimator, enabling variance reduction even when paired samples are unavailable. We provide theoretical analysis and empirical evaluations on autonomous driving and real-world robot manipulation tasks, domains across which X4Val achieves up to 38.4% variance reduction and demonstrates consistent improvements over strong baselines. These results show that non-paired, heterogeneous data can be leveraged to substantially improve the sample efficiency of rigorous robotic system validation.

URL PDF HTML ☆

赞 0 踩 0

2606.05158 2026-06-04 cs.CL cs.AI cs.MA

Streaming Communication in Multi-Agent Reasoning

多智能体推理中的流式通信

Zhen Yang, Xiaogang Xu, Wen Wang, Cong Chen, Xander Xu, Ying-Cong Chen

发表机构 * HKUST(GZ)（香港科技大学（广州））； Alibaba Group（阿里巴巴集团）； ZJU（浙江大学）； HKUST（香港科技大学）

AI总结提出流式多智能体推理系统StreamMA，通过将推理步骤实时流式传输给下游智能体来降低延迟，并意外地提升了效果，同时首次给出流式、串行和单协议三种模式的闭式联合分析。

Comments project page: https://zhenyangcs.github.io/StreamMA-website/

详情

AI中文摘要

多智能体推理系统采用“生成-然后传输”范式，导致端到端延迟与流水线深度成线性关系。我们提出StreamMA，一种多智能体推理系统，它将每个推理步骤在生成后立即流式传输给下游智能体，流水线化相邻智能体，从而降低延迟。令人惊讶的是，这种流水线化也提高了效果：因为多步推理质量不均匀，早期步骤比后期步骤更可靠，使用这些可靠的早期步骤而不是完整链条可以防止容易出错的后期步骤误导下游智能体。我们通过首个流式、串行和单协议三种模式的闭式联合分析，形式化了这两种优势，推导出效果排序、加速上限和成本比。在涵盖数学、科学和代码的八个推理基准测试中，使用两个前沿LLM（Claude Opus 4.6和GPT-5.4）以及三种拓扑结构（链、树、图），StreamMA均优于两个基线（平均+7.3个百分点，在HMMT 2026上最高+22.4个百分点；Claude Opus 4.6-high）。除了这些贡献，我们还发现了一个“步骤级缩放定律”：增加每个智能体的步骤持续提高效果和效率，这是一个与智能体数量缩放正交且可组合的新缩放维度。

英文摘要

Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency. Surprisingly, this pipelining also improves effectiveness: because multi-step reasoning quality is non-uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error-prone late steps from misleading downstream agents. We formalize both advantages with the first closed-form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio. Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT-5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026; Claude Opus 4.6-high). Beyond these contributions, we discover a "step-level scaling law": increasing per-agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent-count scaling.

URL PDF HTML ☆

赞 0 踩 0

2606.05149 2026-06-04 cs.CV cs.LG eess.IV

An Open-Source Two-Stage Computer Vision Pipeline for Fine-Grained Vehicle Classification using Vision Transformers

基于视觉Transformer的开源两阶段细粒度车辆分类流水线

Gandhimathi Padmanaban, Fred Feng

发表机构 * Department of Electrical and Computer Engineering, University of California, Los Angeles, CA, USA（1 电气工程与计算机科学系，美国加州大学洛杉矶分校）

AI总结提出一个结合RT-DETR检测器和微调ViT-Base/16的两阶段流水线，用于六类车身分类，并引入置信度弃权机制，在分布内和分布外数据集上分别达到0.94和0.89的准确率。

Comments 24 pages, 10 figures, venue TBD

详情

AI中文摘要

车辆车身类型是超车碰撞中骑行者伤害严重程度的重要决定因素，然而，在公开文献中，尚不存在从自然道路视频中将车辆分类为与伤害风险相关类别的自动化工具。标准目标检测基准仅提供粗粒度车辆标签（轿车、卡车、公交车、摩托车），而现有的细粒度识别系统在受控图像上训练，且缺乏跨记录站点的部署鲁棒性评估。本文提出一个开源的两阶段计算机视觉流水线，结合预训练的RT-DETR检测器进行粗粒度车辆定位，以及微调的视觉Transformer（ViT-Base/16）进行六类车身分类：乘用车、SUV、皮卡、小型货车、大型货车和商用卡车。当softmax输出低于0.60时，基于置信度的弃权机制保留第二阶段预测，产生未知标签而非静默误分类。在来自密歇根州安阿伯市自行车道走廊的3,805个标注超车事件（分布内）上评估，该流水线达到0.94的准确率，每类F1分数从0.91（小型货车）到0.97（SUV）。在来自开放骑行数据集的311个事件（分布外）上独立评估，无需重新训练，准确率为0.89。四个代表性类别中的三个在域偏移下保持F1不低于0.90。观察到的最大退化出现在小型货车（F1=0.72），原因是弃权率从2.4%上升到25.0%，而非主动误分类，这与传播真实模型不确定性的机制一致。完整的流水线，包括推理脚本、训练代码、评估工具和模型权重，作为开源软件发布，以支持跨路边视频档案和骑行安全研究的可重复性和复用。

英文摘要

Vehicle body type is a significant determinant of cyclist injury severity in overtaking crashes, yet automated tools for classifying vehicles into injury-risk-relevant categories from naturalistic roadway video do not exist in the open literature. Standard object detection benchmarks provide only coarse vehicle labels (car, truck, bus, motorcycle), while existing fine-grained recognition systems are trained on controlled imagery and lack evaluation for deployment robustness across recording sites. This paper presents an open-source two-stage computer vision pipeline combining a pre-trained RT-DETR detector for coarse vehicle localization with a fine-tuned Vision Transformer (ViT-Base/16) for six-category body-type classification: passenger car, SUV, pickup truck, minivan, large van, and commercial truck. A confidence-based abstention mechanism withholds Stage 2 predictions when softmax output falls below 0.60, producing unknown labels rather than silent misclassifications. Evaluated on 3,805 annotated overtaking events from a bicycle-lane corridor in Ann Arbor, Michigan (in-distribution), the pipeline achieved 0.94 accuracy with per-class F1 scores from 0.91 (minivan) to 0.97 (SUV). On an independent out-of-distribution evaluation of 311 events from an open cycling dataset without retraining, accuracy was 0.89. Three of four well-represented categories maintained F1 at or above 0.90 under domain shift. The largest degradation was observed for minivan (F1 = 0.72), driven by abstention rate rising from 2.4% to 25.0% rather than active misclassification, consistent with the mechanism propagating genuine model uncertainty. The full pipeline, including inference scripts, training code, evaluation utilities, and model weights, is released as open-source software to support reproducibility and reuse across roadside video archives and cycling safety research.

URL PDF HTML ☆

赞 0 踩 0

2606.05145 2026-06-04 cs.LG cs.AI cs.CL

Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)

失败推理轨迹告诉你什么是可修复的（但仅凭阅读它们不行）

Nizar Islah, Istabrak Abbes, Irina Rish, Sarath Chandar, Eilif B. Muller

发表机构 * Mila - Quebec AI Institute（魁北克人工智能研究所）； Université de Montréal（蒙特利尔大学）； Polytechnique Montréal（蒙特利尔理工学院）； CHU Sainte-Justine（圣约斯特医院）

AI总结本文提出通过失败推理轨迹的分布特征而非文本内容来识别可修复的失败，并设计无训练的路由规则提升测试时干预效果。

详情

AI中文摘要

当后训练语言模型在推理问题上失败时，常见的测试时扩展响应是花费更多计算进行额外尝试，而失败轨迹不再发挥作用。我们认为这丢弃了一个关键信号；一些失败源于不幸运的采样，此时更多滚动有助于解决，而其他失败是结构性的，无论预算如何都无法通过重采样解决。我们提出失败轨迹编码了可恢复性结构：即哪些测试时干预可以挽救特定失败的推理时特征。三个问题级别的轨迹特征，源自可用干预的结构，从失败滚动的分布特征（而非其文本）中恢复这种结构。它们将失败聚类为稳定区域，刻画不同后训练方法的失败地形（准确率84.3±4.3%，比多数类基线高20%），并支持一个无训练的路由规则，在部署相关的Steerable-Hard子集（重试不足且可达有界干预的失败）上将挽救率提升12.2%。这些特征和路由规则在两个跨家族探针上可迁移。因此，相同的三个特征将失败轨迹从丢弃数据转化为诊断对象，支持测试时路由和后训练分析，无需训练时或权重空间访问。

英文摘要

When post-trained language models fail on reasoning problems, the common test-time-scaling response is to spend more compute on additional attempts, and the failed traces play no further role. We argue this discards a crucial signal; some failures come from unlucky sampling, where more rollouts help, while others are structural and resist resampling regardless of budget. We propose that failed traces encode recoverability structure: the inference-time signature of which test-time interventions can rescue a given failure. Three problem-level trajectory features, derived from the structure of available interventions, recover this structure from the distributional signature of failed rollouts, not their text. They cluster failures into stable regimes, characterize the failure topography of different post-training methods ($84.3{\pm}4.3\%$ accuracy, $+20\%$ over a majority-class baseline), and support a training-free routing rule that lifts rescue by $+12.2\%$ on the deployment-relevant Steerable-Hard subset (failures where retry is insufficient and a bounded intervention is reachable). The features and the routing rule transfer across two cross-family probes. The same three features thus convert failed traces from discarded data into a diagnostic object, supporting test-time routing and post-training analysis without training-time or weight-space access.

URL PDF HTML ☆

赞 0 踩 0

2606.05143 2026-06-04 cs.RO

HORIZON: Recoverability-Governed Curriculum for Physical-Domain Scaling

HORIZON: 基于可恢复性的物理域缩放课程

Chenhao Bai, Liqin Lu, Kaijun Wang, Hui Chen, Jin-Chuan Shi, Yuyang Liu, Hao Chen, Chunhua Shen

发表机构 * Zhejiang University, State Key Lab of CAD & CG（浙江大学，计算机辅助设计与图形学国家重点实验室）； Zhejiang University of Technology（浙江工业大学）

AI总结针对机器人策略在物理域缩放中的可学习性问题，提出基于可恢复性的前沿课程HORIZON，通过回滚和边界细化逐步扩展物理域，实验揭示了物理域扩展的三个规律。

Comments 16 pages, 9 figures

详情

AI中文摘要

扩展鲁棒的机器人策略需要的不仅仅是更广泛的随机化，因为物理域经验必须在整个训练过程中保持有序和可学习。我们研究了策略何时能从更难的物理中受益，并确定可恢复性是在策略物理域缩放中的核心约束。在在策略训练中，新的动态仅当它们足够接近当前策略以生成纠正性的在策略数据时才有用，而不是将轨迹崩溃为不可恢复的失败。使用四足运动作为具身泛化的物理要求高的基准，我们引入了HORIZON，一种检查点前沿课程，仅在当前策略的可恢复边界内扩展物理域。HORIZON使用回滚和边界细化来管理每个扩展步骤，将固定随机化转变为物理域增长的持续过程。实验揭示了物理域扩展的三个规律。首先，直接域扩展在物理轴上是非均匀的，并且通常在没有阶段排序的情况下不可学习。其次，域组合是非单调的，在紧凑核心之外添加更多域可能会稀释可恢复的联合样本并降低整体鲁棒性。第三，孤立专家的离线蒸馏不能替代在策略课程生成的联合交互。这些结果共同将物理域泛化框架为具身控制的持续增长问题，以可恢复性作为在策略扩展的组织原则。

英文摘要

Scaling robust robot policies requires more than broader randomization, because physical-domain experience must remain organized and learnable throughout training. We study when a policy can benefit from harder physics and identify recoverability as a central constraint in on-policy physical-domain scaling. In on-policy training, new dynamics are useful only insofar as they remain close enough to the current policy to generate corrective on-policy data, rather than collapsing rollouts into unrecoverable failures. Using quadruped locomotion as a physically demanding benchmark for embodied generalization, we introduce HORIZON, a checkpointed frontier curriculum that expands physical domains only within the current policy's recoverable boundary. HORIZON uses rollback and boundary refinement to govern each expansion step, turning fixed randomization into a continual process of physical-domain growth. Experiments reveal three regularities of physical-domain expansion. First, direct domain widening is uneven across physical axes and often unlearnable without staged ordering. Second, domain composition is non-monotonic, and adding more domains beyond a compact core can dilute recoverable joint samples and reduce overall robustness. Third, offline distillation of isolated experts cannot substitute for the joint interaction generated by on-policy curriculum. Together, these results frame physical-domain generalization as a continual growth problem for embodied control, with recoverability as the organizing principle for on-policy expansion.

URL PDF HTML ☆

赞 0 踩 0

2606.05142 2026-06-04 cs.CV cs.AI

GeM-NR: Geometry-Aware Multi-View Editing for Nonrigid Scene Changes

GeM-NR：面向非刚性场景变化的几何感知多视角编辑

Josef Bengtson, Yaroslava Lochman, Fredrik Kahl

发表机构 * Chalmers University of Technology（查尔姆斯理工大学）

AI总结提出GeM-NR，一种无需训练的快速灵活方法，通过深度图对齐、视角投影和条件细化实现多视角一致的通用非刚性图像编辑，支持几何和外观的显著变化。

Comments Project page: https://gem-nr.github.io/

详情

AI中文摘要

近年来，基于生成模型的多视角图像编辑的发展使我们离通用3D内容生成和定制更近一步。现有大多数工作通过利用未编辑场景的几何结构，专注于刚性或仅外观的编辑。这自然将这些方法限制在保留底层场景结构的编辑上。其他方法则针对特定图像编辑任务（如物体移除和添加）进行训练。尽管取得了进展，但通用的非刚性编辑（即大幅改变场景几何的编辑）对现有方法仍然具有挑战性。我们提出GeM-NR，一种快速灵活且无需训练的方法，用于通用的多视角一致图像编辑，包括大幅改变场景几何和外观的编辑。给定一个使用选定骨干编辑器（如FLUX、Qwen、BrushNet）编辑的锚点图像和一个未编辑的查询图像，GeM-NR以与锚点编辑一致的方式编辑查询图像。该方法包含多个阶段：(i) 深度图估计，我们提出一种策略以最大化编辑和未编辑场景的3D点云之间的对齐；(ii) 投影到查询视角；(iii) 基于未编辑查询的条件细化所得图像。基于条件化的公式从两个视角很好地扩展到物体的多个视角。我们展示了该方法处理几何和外观显著变化的编辑的能力，这是现有方法难以做到的。我们进行了广泛评估，表明我们的方法在各种编辑任务中提高了一致性，包括生成编辑场景的3D表示。定量和定性结果均表明，我们的方法在编辑质量以及多视角几何和光度一致性方面达到了最先进的性能。

英文摘要

Recent developments in multi-view image editing with generative models have brought us a step closer toward general 3D content generation and customization. Most existing works focus on rigid or appearance-only edits by utilizing the geometry of the unedited scene. This naturally limits these methods to edits that preserve the underlying scene structure. Other approaches are trained for specific image editing tasks, such as object removal and addition. Despite this progress, general nonrigid edits, i.e., edits that substantially change the scene geometry, remain challenging for existing methods. We propose GeM-NR, a fast and flexible training-free approach for general multi-view consistent image editing, including edits that drastically change the geometry and appearance of the scene. Given an anchor image edited with a chosen backbone editor (such as FLUX, Qwen, BrushNet) and a query unedited image, GeM-NR edits the query image consistently with the anchor edit. The method incorporates multiple stages: (i) depth map estimation, where we propose a strategy to maximize the alignment between the 3D point clouds of the edited and unedited scenes, (ii) projection onto a query viewpoint, and (iii) refinement of the obtained image conditioned on the unedited query. The conditioning-based formulation scales well from two to many views of an object. We demonstrate the ability of our method to handle edits with significant changes in geometry and appearance, something that existing methods struggle with. We perform an extensive evaluation showing that our method improves consistency for a wide variety of edit tasks, including generating 3D representations of the edited scene. Both quantitative and qualitative results indicate the state-of-the-art performance of our method in terms of edit quality as well as geometric and photometric consistency across multiple views.

URL PDF HTML ☆

赞 0 踩 0

2606.05139 2026-06-04 cs.LG

BBOmix: A Tabular Benchmark for Hyperparameter Optimization of Unsupervised Biological Representation Learning

BBOmix: 用于无监督生物表示学习超参数优化的表格基准

Luca Thale-Bombien, Jan Ewald, Ralf König, Aaron Klein

发表机构 * Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI)（可扩展数据与人工智能研究中心）； Leipzig University（莱比锡大学）； ELLIS Institute（ELLIS研究所）

AI总结针对高通量测序产生的组学数据，提出首个开源表格基准BBOmix，包含105,000次评估，涵盖四种自编码器架构和七种多组学模态，用于无监督表示学习的超参数优化。

详情

AI中文摘要

高通量测序的快速发展产生了大规模、高维的组学数据集。深度无监督学习架构，特别是自编码器（AEs），在该领域越来越多地被用于降维和表示学习。然而，AEs对架构选择和超参数高度敏感，且无监督优化通常依赖于重建损失，这可能是下游任务效用的不良代理。穷举超参数优化（HPO）计算成本高昂，导致研究人员经常依赖次优的默认配置。为了普及大规模无监督HPO研究，我们引入了$ extbf{BBOmix}$，这是第一个用于真实生物数据上无监督表示学习的开源表格基准。我们的基准包括来自TCGA和SCHC数据集的四种AE架构和七种多组学模态的105,000次评估。我们量化了重建损失与下游任务性能之间的相关性，并对最先进的单保真度、多保真度和迁移学习HPO方法进行了广泛评估，为未来无监督生物表示学习研究建立了严格的基线。

英文摘要

The rapid advancement of high-throughput sequencing has led to large, high-dimensional omics datasets. Deep unsupervised learning architectures, particularly Autoencoders (AEs), are increasingly used for dimensionality reduction and representation learning in this domain. However, AEs are highly sensitive to architectural choices and hyperparameters, and unsupervised optimization typically relies on reconstruction loss, which may be a poor proxy for downstream utility. Exhaustive hyperparameter optimization (HPO) is computationally expensive, leading researchers to frequently rely on suboptimal default configurations. To democratize access to large-scale unsupervised HPO research, we introduce $\textbf{BBOmix}$, the first open-source tabular benchmark for unsupervised representation learning on real-world biological data. Our benchmark includes 105,000 evaluations across four AE architectures and seven multi-omics modalities from the TCGA and SCHC datasets. We quantify the correlation between reconstruction loss and downstream task performance and provide an extensive evaluation of state-of-the-art single-fidelity, multi-fidelity, and transfer learning HPO methods, establishing a rigorous baseline for future research in unsupervised biological representation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.05138 2026-06-04 cs.LG q-fin.ST

Generating Financial Time Series by Matching Random Convolutional Features

通过匹配随机卷积特征生成金融时间序列

Konrad J. Mueller, Nikita Zozoulenko, Ben Wood, Thomas Cass, Lukas Gonon

发表机构 * Imperial College London（帝国理工学院伦敦分校）； JPMorgan Chase & Co.（摩根大通公司）； University of St. Gallen（圣加尔登大学）

AI总结提出SOCK（软竞争核）可微随机卷积特征图，通过匹配真实与生成时间序列的随机卷积特征来训练生成器，在小样本金融数据集上优于签名和扩散基线方法。

详情

AI中文摘要

生成逼真的金融时间序列具有挑战性，因为训练数据通常仅限于单个历史路径。在如此稀缺的数据下，过拟合难以避免，尤其是在对抗训练中，训练好的判别器可能记忆训练样本。为了缓解这一问题，近期的方法训练生成器以最小化真实与生成时间序列的未训练特征表示之间的差异。在这些工作中，特征图基于路径签名，而路径签名在可处理的截断深度下可能无法捕捉相关的时间序列属性。在本工作中，我们通过匹配真实与生成时间序列的随机卷积特征来训练生成器。现有的随机卷积特征图，如Rocket和Hydra，已被证明能为真实世界的时间序列提供信息丰富的表示，但由于不可微，无法监督生成模型。我们引入了SOCK（软竞争核），一种完全可微的随机卷积特征图，适用于训练生成时间序列模型。我们表明，通过匹配随机SOCK特征训练的生成器在多种小样本金融数据集上始终优于签名和扩散基线。我们进一步在双样本假设检验和时间序列分类任务中展示了SOCK的表达能力，在这些任务中SOCK匹配或超越了现有的无监督特征图。

英文摘要

Generating realistic financial time series is challenging as training data is often limited to a single historical path. With such scarce data, overfitting is hard to avoid, especially under adversarial training where a trained discriminator can memorize the training samples. To mitigate this, recent approaches train generators to minimize the discrepancy between untrained feature representations of real and generated time series. In these works, the feature maps are based on path signatures, which can fail to capture relevant time series properties at tractable truncation depths. In this work, we instead train generators by matching random convolutional features of real and generated time series. Existing random convolutional feature maps, such as Rocket and Hydra, have been shown to provide informative representations of real-world time series, but cannot supervise generative models because they are non-differentiable. We introduce SOCK (SOft Competing Kernels), a fully differentiable random convolutional feature map, suited to train generative time series models. We show that generators trained by matching random SOCK features consistently outperform signature and diffusion baselines across a wide range of small-sample financial datasets. We further demonstrate SOCK's expressiveness on two-sample hypothesis testing and time series classification tasks, where SOCK matches or outperforms existing unsupervised feature maps.

URL PDF HTML ☆

赞 0 踩 0

2606.05134 2026-06-04 cs.CL cs.LG

Activation-Based Active Learning for In-Context Learning: Challenges and Insights

基于激活的主动学习用于上下文学习：挑战与见解

Yaseen M. Osman, Geoff V. Merrett, Stuart E. Middleton

发表机构 * School of Electronics and Computer Science (ECS), University of Southampton（电子与计算机科学学院（ECS），南安普顿大学）

AI总结本文研究了基于MLP激活的深度主动学习方法在上下文学习中的应用，发现激活信号与示例质量或任务性能相关性弱，表明此类方法不适用于上下文学习。

Comments 9 pages, 3 figures

详情

AI中文摘要

深度主动学习此前已被探索用于大语言模型的上下文样本选择，但未利用对Transformer激活理解的最新进展。在本文中，我们测试了模型激活能否提供细粒度信号以优化上下文示例选择的假设。我们提出了迄今为止最全面的基于MLP激活的深度主动学习方法应用于上下文学习的分析，包括不同注意力掩码策略如何影响跨多样分类和生成数据集的主动学习，使用了Llama-3.2-3B和Qwen2.5-3B基础模型。然而，我们得到了负面结果：通过大规模激活或前四阶矩视角观察的MLP输出，与示例质量或任务性能不相关。具体来说，对于所有测试的任务和模型，绝对Spearman相关系数至多为0.33，表明此类基于激活的采样不应用于上下文学习。我们假设这可能是由于叠加现象，即模型表示的特征数量超过其维度，表明稀疏自编码器等方法可能是未来有前景的方向。

英文摘要

Deep active learning has previously been explored for LLM in-context sample selection, but not with methods that utilise recent advances in understanding of transformer activations. In this paper, we test the hypothesis that model activations could provide a fine-grained signal to optimise the selection of in-context examples. We present the most comprehensive analysis to date of MLP activation-based deep active learning methods applied to in-context learning, including how different attention masking strategies impact active learning across diverse classification and generative datasets, using both Llama-3.2-3B and Qwen2.5-3B base models. However, we find a negative result: MLP outputs, viewed through the lenses of massive activations or the first four moments, do not correlate with example quality or task performance. Specifically, the absolute Spearman correlation coefficient is at most 0.33 for all tasks and models we tested, showing that such activation-based sampling should not be used for in-context learning. We hypothesise that this may be due to superposition, whereby models represent more features than they have dimensionality, suggesting that methods like Sparse Autoencoders (SAEs) may be a promising future direction.

URL PDF HTML ☆

赞 0 踩 0

2606.05131 2026-06-04 cs.LG cs.NA math.DS math.NA math.OC math.SP

Deep Embedded Multiplicative DMD for Algebra-Preserving Koopman Learning

深度嵌入乘法DMD用于保代数Koopman学习

Kelan Gray, Finlay Brown, Nicolas Boullé, Matthew J. Colbrook

发表机构 * Department of Mathematics, Imperial College London（帝国理工学院数学系）； Department of Applied Mathematics and Theoretical Physics, University of Cambridge（剑桥大学应用数学与理论物理系）

AI总结提出DeepMDMD方法，通过结合深度学习和乘法DMD，在潜空间中施加Koopman乘积规则作为代数约束，学习紧凑且动态一致的字典，实现稳定预测和谱污染减少。

Comments 26 pages, 11 figures

详情

AI中文摘要

Koopman理论将非线性动力学转化为线性谱问题。然而，在计算中，一切都取决于一个困难的有限维选择：可观测量必须具有表现力，在动力学下几乎不变，并且理想情况下与复合运算兼容。深度Koopman方法学习灵活的坐标，而保结构方法在固定字典上强制执行算子恒等式。我们通过引入深度嵌入乘法动态模式分解（DeepMDMD）来结合这些思想，该方法学习潜空间及其划分，同时将Koopman乘积规则作为精确代数约束强制执行。训练在精确的乘法算子更新和可微的潜聚类步骤之间交替进行，后者促进Koopman封闭性。结果是在学习的潜细胞上得到一个有限转移映射。其非零谱位于单位圆上，其字典由动力学而非环境几何塑造，预测在潜坐标中进行，然后解码到物理空间。在哈密顿、混沌和流体示例中，DeepMDMD学习的字典比几何MDMD划分产生的字典更紧凑且动态一致。它减少了谱污染，揭示了更丰富的连续谱结构，并在严重噪声下提供稳定预测。在高维流中，包括158,624维圆柱尾流和噪声$Re=20,000$顶盖驱动空腔，它保持了相干结构和长时间谱统计，而状态空间MDMD则失败。这些结果提出了Koopman学习的实用规则：学习坐标，约束代数。

英文摘要

Koopman theory turns nonlinear dynamics into a linear spectral problem. In computation, however, everything depends on a hard finite-dimensional choice: the observables must be expressive, nearly invariant under the dynamics, and, ideally, compatible with composition. Deep Koopman methods learn flexible coordinates, whereas structure-preserving methods enforce operator identities on fixed dictionaries. We combine these ideas by introducing Deep Embedded Multiplicative Dynamic Mode Decomposition (DeepMDMD), a method that learns a latent space and a partition of it, while enforcing the Koopman product rule as an exact algebraic constraint. Training alternates between an exact multiplicative operator update and a differentiable latent-clustering step that promotes Koopman closure. The result is a finite transition map on learned latent cells. Its nonzero spectrum lies on the unit circle, its dictionary is shaped by the dynamics rather than by ambient geometry, and forecasts are made in latent coordinates before being decoded to physical space. Across Hamiltonian, chaotic, and fluid examples, DeepMDMD learns dictionaries that are far more compact and dynamically coherent than those produced by geometric MDMD partitions. It reduces spectral pollution, reveals richer continuous-spectrum structure, and gives stable forecasts under severe noise. In high-dimensional flows, including a 158,624-dimensional cylinder wake and a noisy $Re=20,000$ lid-driven cavity, it preserves coherent structures and long-time spectral statistics where state-space MDMD fails. These results suggest a practical rule for Koopman learning: learn the coordinates, constrain the algebra.

URL PDF HTML ☆

赞 0 踩 0

2606.05130 2026-06-04 cs.LG cs.AI

Towards Efficient and Evidence-grounded Mobility Prediction with LLM-Driven Agent

面向高效且基于证据的移动预测：基于LLM驱动的智能体

Linyao Chen, Qinlao Zhao, Zechen Li, Mingming Li, Likun Ni, Jinyu Chen, Yuhao Yao, Xuan Song, Noboru Koshizuka, Hiroki Kobayashi

发表机构 * The University of Tokyo（东京大学）； Huazhong University of Science and Technology（华中科技大学）； University of New South Wales, Sydney（新南威尔士大学（悉尼））； LocationMind Inc.（LocationMind公司）； Southern University of Science and Technology（南方科技大学）； Jilin University（吉林大学）

AI总结提出一种无需训练的LLM驱动智能体框架AgentMob，通过自适应证据收集机制解决移动预测中的模糊情况，在多个数据集上达到最优性能。

详情

AI中文摘要

个体层面的移动预测是城市模拟、交通规划和政策分析的核心。监督序列模型实现了高精度，但需要任务特定训练且决策透明度有限。最近的基于LLM的方法提高了可解释性，但大多依赖静态提示和单次推理，限制了在移动信号弱或冲突时寻求额外证据的能力。我们提出\method{}，一种无需训练的LLM驱动智能体框架，将下一位置预测建模为自适应证据控制的决策制定。\method{}通过基于历史规律性的快速路径处理常规情况，而模糊情况则触发对近期轨迹、历史行为、停留-移动可能性和地理证据的迭代工具使用。在三个移动数据集上，AgentMob在无需训练的基于LLM的方法中实现了最强的整体性能，GPT-5.4在BW上达到71.42%的Acc@1，在YJMob100K上达到33.14%，在上海ISP上达到33.50%。在BW的非快速路径案例中，LLM控制器相比相同工具的统计基线将Acc@1从30.65%提高到48.62%，表明其主要优势在于通过自适应证据收集解决模糊预测。我们的代码可在https://github.com/Unknown-zoo/AgentMob获取。

英文摘要

Individual-level mobility prediction is central to urban simulation, transportation planning, and policy analysis. Supervised sequence models achieve strong accuracy but require task-specific training and offer limited decision-level transparency. Recent LLM-based methods improve interpretability, yet mostly rely on static prompts and single-pass inference, limiting their ability to seek additional evidence when mobility signals are weak or conflicting. We propose \method{}, a training-free LLM-driven agent framework that formulates next-location prediction as adaptive evidence-controlled decision making. \method{} resolves routine cases through a fast path based on historical regularity, while ambiguous cases trigger iterative tool use over recent trajectories, historical behavior, stay-move likelihood, and geographical evidence. Across three mobility datasets, AgentMob achieves the strongest overall performance among training-free LLM-based methods, with GPT-5.4 reaching 71.42\% Acc@1 on BW, 33.14\% on YJMob100K, and 33.50\% on Shanghai ISP. On BW non-fast-path cases, the LLM controller improves Acc@1 from 30.65\% to 48.62\% over a same-tool statistical baseline, showing that its main benefit lies in resolving ambiguous predictions through adaptive evidence gathering. Our code is available at https://github.com/Unknown-zoo/AgentMob.

URL PDF HTML ☆

赞 0 踩 0

2606.05122 2026-06-04 cs.CL

Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

自我评估已然存在：用最少数据激发基础LLM中的潜在评判校准

XiuYu Zhang, Yi Shan, Junfeng Fang, Zhenkai Liang

发表机构 * National University of Singapore（新加坡国立大学）； Beijing University of Technology（北京理工大学）

AI总结本文提出自我评估激发（SEE）方法，通过少量数据（160个示例）结合校准耦合强化学习和掩码蒸馏，激发基础LLM中已有的预测外部评判者评分能力，在保持答案质量的同时显著提升校准性能。

详情

AI中文摘要

大型语言模型越来越多地被其他模型评估，这引发了一个自然问题：模型能否预测评判者将如何对其自身输出进行评分？我们发现，这种能力在很大程度上已经存在于任何针对性训练之前：通过少量示例提示，基础模型已经能够预测外部评判者对开放式回答的多属性质量评分，在三个基准测试中显著高于随机水平。我们引入了自我评估激发（SEE）方法，该方法通过一个短周期来表面化这种潜在能力，该周期包括一个校准耦合的强化学习阶段（改进答案并预测评判者），随后是一个掩码蒸馏阶段（增强预测而不改变答案）。通过160个独特示例（比强化学习基线少约31倍），SEE在三个基准测试中改善了保留校准，同时保持了答案质量。激发的自我评估严格定位于模型自身的词元分布内，并且对于从未训练过的评判者保持稳定，这表明了一种可转移的质量概念，而非单一评判者的偏好。这些结果将评判者对齐的自我评估重新定义为激发问题而非获取问题。

英文摘要

Large language models are increasingly evaluated by other models, raising a natural question: can a model predict how a judge will score its own output? We find that the ability is largely present before any targeted training: prompted few-shot, a base model already predicts an external judge's multi-attribute quality scores on open-ended responses well above chance across three benchmarks. We introduce Self-Evaluation Elicitation (SEE), a method that surfaces this latent ability through a short cycle comprising a calibration-coupled reinforcement learning phase that improves the answer and predicts the judge, followed by a masked distillation phase that sharpens the prediction while leaving the answer untouched. From 160 unique examples, roughly 31x fewer than a reinforcement learning baseline, SEE improves held-out calibration across three benchmarks while preserving answer quality. The elicited self-evaluation is sharply localized within the model's own token distribution and stable across judges it was never trained against, indicating a transferable notion of quality rather than a single judge's preference. These results reframe judge-aligned self-evaluation as a problem of elicitation rather than acquisition.

URL PDF HTML ☆

赞 0 踩 0

2606.05121 2026-06-04 cs.SD cs.AI cs.CL cs.MM eess.AS

Audio Interaction Model

音频交互模型

Zhifei Xie, Zihang Liu, Ze An, Xiaobin Hu, Yue Liao, Ziyang Ma, Dongchao Yang, Mingbao Lin, Deheng Ye, Shuicheng Yan, Chunyan Miao

发表机构 * NTU（国立新加坡大学）； NUS（新加坡国立大学）； CUHK（香港大学）

AI总结提出一种统一的在线大型音频语言模型Audio-Interaction，通过始终在线的感知-决策-响应循环实现实时音频交互，并构建了StreamAudio-2M数据集和Proactive-Sound-Bench基准，在保持主流音频任务性能的同时解锁了实时ASR、流式音频指令跟随和主动帮助等能力。

Comments Next generation of LALMs, work in progress

详情

AI中文摘要

音频本质上是一种交互式模态，然而当今的大型音频语言模型（LALM）是离线的，而流式音频模型每个只处理单一任务，如流式ASR或语音聊天。现在是时候将它们统一为一个在线LALM：一个通过始终在线的感知-决策-响应循环，实时收听声音、环境和指令并即时反应的模型。我们将这种机制形式化为音频交互模型，并通过Audio-Interaction实现，这是一个统一的流式模型，在保留离线任务执行的同时，增加了在线通用音频指令跟随能力，从对话到全语音聊天，根据流语义决定何时响应。为此，我们提出了SoundFlow框架，该框架通过流原生数据构建、理解感知训练和异步低延迟推理，端到端地实例化感知-决策-响应循环，实现稳定的实时交互。我们进一步构建了StreamAudio-2M，一个包含260万项流式语料库，涵盖7种基本能力和28个子任务，以及用于评估主动音频干预的Proactive-Sound-Bench。在8个基准测试中，Audio-Interaction在主流音频任务上保持有竞争力的性能，同时解锁了离线LALM无法实现的能力，包括实时ASR、流式音频指令跟随和主动帮助。

英文摘要

Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and streaming audio models each handle only a single task such as streaming ASR or voice chatting. It is time to unify them into one online LALM: a model that, through an always-on perceive-decide-respond loop, listens to sound, environment, and instructions in real time and reacts on the fly. We formalize this regime as the Audio Interaction Model, and realize it with Audio-Interaction, a unified streaming model that retains offline task execution while adding online general audio instruction following, from dialogue to full voice chatting, deciding when to respond from the semantics of the stream. To enable this, we propose SoundFlow, a framework that instantiates the perceive-decide-respond loop end to end, from data to training to deployment, through streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference for stable real-time interaction. We further construct StreamAudio-2M, a 2.6M-item streaming corpus spanning 7 fundamental abilities and 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Across 8 benchmarks, Audio-Interaction preserves competitive performance on mainstream audio tasks while unlocking capabilities inaccessible to offline LALMs, including real-time ASR, streaming audio instruction following, and proactive help.

URL PDF HTML ☆

赞 0 踩 0

2606.05116 2026-06-04 cs.LG

Graph Set Transformer

图集变换器

Jose E. Escrig Molina, Baoquan Chen, Daniel Probst

发表机构 * Bioinformatics Group Wageningen University（瓦赫宁根大学生物信息学组）； Department of Physics Technical University of Munich（慕尼黑技术大学物理系）

AI总结提出图集变换器（GST），通过层间交织节点级特征传播与跨图上下文建模，解决图集合学习任务中局部结构与集合上下文融合问题，在合成和真实基准上优于基线。

Comments 10 pages, 1 figure, conference

详情

AI中文摘要

我们介绍了图集变换器（GST），一种用于在图集合上学习的神经网络架构，设计用于每个元素的预测依赖于集合范围的上下文以及局部结构的任务。现有架构，包括DeepSets和SetTransformer，需要来自单独GNN的预编码图嵌入，在特征提取和集合级上下文化之间造成瓶颈。相比之下，GST在每一层交织节点级特征传播和跨图上下文建模，通过门控机制融合两个信息层次。我们在一个旨在隔离集合条件结构推理的受控合成套件以及三个真实数据基准（包括逐原子反应中心识别、反应产率预测和图像分类）上评估了GST。在匹配参数预算下，GST在这些设置中表现优于基线。架构消融强烈表明，局部和集合上下文的交织对这一优势有显著贡献。

英文摘要

We introduce the Graph Set Transformer (GST), a neural network architecture for learning on sets of graphs, designed for tasks in which per-element predictions depend on set-wide context as well as local structure. Existing architectures, including DeepSets and SetTransformer, require pre-encoded graph embeddings from a separate GNN, creating a bottleneck between feature extraction and set-level contextualisation. In contrast, GST interleaves node-level feature propagation and cross-graph contextual modelling at every layer, fusing the two levels of information through a gating mechanism. We evaluate GST on a controlled synthetic suite designed to isolate set-conditional structural reasoning and on three real-data benchmarks spanning per-atom reaction-centre identification, reaction yield prediction, and image classification. Under matched parameter budgets, GST performs better than the baselines across these settings. An architectural ablation strongly suggests that the interleaving of local and set context contributes substantially to this advantage.

URL PDF HTML ☆

赞 0 踩 0

2606.05115 2026-06-04 cs.CV cs.AI cs.CL

Continual Visual and Verbal Learning Through a Child's Egocentric Input

通过儿童自我中心输入进行持续的视觉与语言学习

Xiaoyang Jiang, Yanlai Yang, Kenneth A. Norman, Brenden Lake, Mengye Ren

发表机构 * Agentic Learning AI Lab, New York University（代理学习人工智能实验室，纽约大学）； Department of Psychology, Princeton University（心理学系，普林斯顿大学）

AI总结提出BabyCL持续多模态学习框架，在单一时间顺序处理SAYCam数据集，通过流式视觉表示学习和图像-文本对比目标，在SAYCam Labeled-S 4AFC基准上优于流式学习基线，缩小了与离线训练上限的差距。

Comments 15 pages, 4 figures

详情

AI中文摘要

儿童从连续的、时间结构化的自我中心经验流中学习单词的含义。最近的研究表明，神经网络也可以从儿童的自我中心视频记录中学习单词-指代物映射，但它们会循环处理打乱的数据数百个周期，这与儿童实际接触环境的方式形成对比。我们引入了BabyCL，一个持续多模态学习框架，它以单一时间顺序处理SAYCam数据集，结合了流式视觉表示学习和图像-文本对比目标。BabyCL将流的多阶段时间分割与双回放缓冲区相结合，该缓冲区独立管理视觉和多模态历史，并在共享骨干网络上联合训练三个对比损失。在匹配的优化预算下，BabyCL在SAYCam Labeled-S 4AFC基准上优于流式学习基线，显著缩小了与离线训练上限的差距。消融实验表明，这些增益对在线时间分割窗口的长度和回放缓冲区的驱逐规则具有鲁棒性。总之，这些结果表明，在更接近儿童实际体验的训练条件下，有意义的单词-指代物映射可以出现。

英文摘要

Children learn the meanings of words from a continuous, temporally structured stream of egocentric experience. Recent work shows that neural networks can also learn word-referent mappings from a child's egocentric video recordings, but they cycle through the shuffled data for hundreds of epochs, contrasting with how children actually encounter their environment. We introduce BabyCL, a continual multimodal learning framework that processes the SAYCam dataset in a single chronological pass, combining streaming visual representation learning with an image-text contrastive objective. BabyCL combines a multi-stage temporal segmentation of the stream with a dual replay buffer that independently manages visual and multimodal histories, and it is jointly trained with three contrastive losses on a shared backbone. Under a matched optimization budget, BabyCL outperforms streaming learning baselines on the SAYCam Labeled-S 4AFC benchmark, substantially narrowing the gap to an upper bound of offline training. Ablations show that the gains are robust to the length of the online temporal segmentation window and the eviction rule of the replay buffer. Together, these results show that meaningful word-referent mappings can emerge under training conditions much closer to a child's actual experience.

URL PDF HTML ☆

赞 0 踩 0

2606.05112 2026-06-04 cs.CL

Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases

评估大型语言模型在标准化病人案例中的动态临床决策能力

Cheng Liang, Pengcheng Qiu, Ya Zhang, Yanfeng Wang, Chaoyi Wu, Weidi Xie

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结本文提出MedSP1000基准，通过标准化病人案例模拟动态临床交互，评估LLM在信息收集、治疗计划和长期管理中的表现，发现当前模型在过程级评估中远未达到临床安全标准。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地被提议作为临床代理，然而静态的单轮基准无法捕捉模型在诊疗过程中如何动态地提供护理：收集信息、规划治疗以及跨连续患者状态调整长期管理。医学教育长期以来通过标准化病人（SPs）解决了类似的挑战：经过培训的演员一致地扮演临床案例，实现逼真的实践和客观的脚本化评估。在此，我们介绍MedSP1000，一个源自SP的交互式基准，用于临床代理评估，包括1,638个SP案例和24,602个轨迹级同行评审评分标准。MedSP1000将同行评审的SP教学案例转化为可执行场景，包含定义的SP案例脚本、临床环境上下文和人工验证的结构化评分标准。在每次模拟评估运行中，临床代理与患者代理和环境控制器闭环交互，其行为根据原始材料中指定的专家标准在整个诊疗过程中进行评分。将MedSP1000应用于一系列通用和医学专用LLMs，我们发现静态基准上的表现并不能可靠地转化为此类教育场景。表现最好的模型GPT-5.5仅完成了60.4%的专家定义评分项目，而最强的医学专用模型达到了40.0%；增加测试时计算量没有产生可测量的增益。这些结果表明，当前的LLMs，包括为医学调整的代理系统，尚未足够可靠以安全地整合到实际临床实践中。更广泛地说，MedSP1000展示了过程级、SP式评估如何揭示单轮基准无法捕捉的临床相关失败模式。

英文摘要

Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers care across an encounter: gathering information, planning treatment, and adapting longitudinal management across successive patient states. Medical education has long addressed an analogous challenge through standardized patients (SPs): trained actors who consistently portray clinical cases, enabling realistic practice and objective, scripted assessment. Here we introduce MedSP1000, an SP-derived interactive benchmark for clinical-agent evaluation, including 1,638 SP cases with 24,602 trajectory-level peer-reviewed rubrics. MedSP1000 converts peer-reviewed SP teaching cases into executable scenarios with defined SP case scripts, clinical environment contexts, and human-validated structured rubric. In each simulation evaluation run, a clinical agent interacts in closed loop with a patient agent and an environment controller, and its behaviour is scored throughout the encounter against expert criteria specified in the original materials. Applying MedSP1000 to a range of general-purpose and medically specialized LLMs, we find that performance on static benchmarks does not reliably translate to such educational scenarios. The best-performing model, GPT-5.5, completes only 60.4% of expert-defined rubric items, whereas the strongest medically specialized model reaches 40.0%; increasing test-time compute produces no measurable gain. These results suggest that current LLMs, including agentic systems tuned for medicine, are not yet reliable enough to be safely integrated into actual clinical practice. More broadly, MedSP1000 shows how process-level, SP-style evaluation can reveal clinically relevant failure modes that single-turn benchmarks miss.

URL PDF HTML ☆

赞 0 踩 0

2606.05109 2026-06-04 cs.LG

RePercENT: Scaling Disentangled Representation Learning Beyond Two Modalities

RePercENT：将解耦表示学习扩展到两种模态之外

Vasiliki Rizou, Pascal Frossard, Dorina Thanou

发表机构 * EPFL（瑞士联邦理工学院）

AI总结提出RePercENT框架，通过多模态即插即用架构和联合优化目标，实现超过两种模态的可扩展成对解耦，无需联合预训练并降低计算复杂度。

详情

AI中文摘要

为了充分利用多模态数据的潜力，我们需要超越当前最先进的对齐和融合方法，在不牺牲模态特定信息的情况下利用所有跨模态交互。学习解耦表示是识别隐藏在观测数据中的潜在共享和独特因素的一种原则性方法。然而，尽管多模态解耦是一个引人注目的范式，现有方法由于固有的可扩展性瓶颈，主要局限于两种模态。为了解决这个问题，我们提出了RePercENT，这是一个自监督框架，旨在超越这些限制，并解锁超过两种模态的可扩展成对解耦。通过多模态“即插即用”架构，我们的方法直接操作于预提取的嵌入，消除了对广泛联合预训练的需求，同时不对底层模态或基础模型骨干做出任何假设。此外，我们引入了一个联合优化目标，用于同时推导共享和独特组件，并提供了形式化的理论保证来表征我们解决方案的最优性。在多种模态和任务中，RePercENT成功恢复了解耦组件，同时保持了竞争性能并显著降低了计算复杂度。

英文摘要

To leverage the full potential of multimodal data, we need representations that go beyond the state-of-the-art alignment and fusion approaches and exploit all cross-modal interactions without sacrificing modality-specific information. Learning disentangled representations is a principled way to identify these underlying shared and unique factors that are hidden in observational data. However, while multimodal disentanglement is a compelling paradigm, existing methods are largely confined to the two-modality regime due to its inherent scalability bottleneck. To address this, we propose RePercENT, a self-supervised framework designed to surpass these limitations and unlocks scalable pairwise disentanglement beyond two modalities. Through a multimodal `plug-and-play' architecture, our approach operates directly on pre-extracted embeddings, eliminating the need for extensive joint pre-training while making no assumptions regarding the underlying modalities or foundation model backbones. Moreover, we introduce a joint optimization objective for simultaneously deriving the shared and unique components, and provide formal theoretical guarantees that characterize the optimality of our solution. Across diverse modalities and tasks, RePercENT successfully recovers disentangled components while maintaining competitive performance and significantly reducing computational complexity.

URL PDF HTML ☆

赞 0 踩 0

2606.05107 2026-06-04 cs.CV cs.AI

Who Needs Labels? Adapting Vision Foundation Models With the Metadata You Already Have

谁需要标签？利用已有的元数据适应视觉基础模型

Elouan Gardès, Seung Eun Yi, Kartik Ahuja, Théo Moutakanni, Huy V. Vo, Piotr Bojanowski, Wolfgang M. Pernice, Loïc Landrieu, Camille Couprie

发表机构 * Meta FAIR, Paris（Meta FAIR，巴黎）； LIGM, CNRS, Gustave Eiffel, ENPC, IP Paris（LIGM，CNRS，居斯塔夫·艾菲尔，ENPC，IP巴黎）； Columbia University, New York（哥伦比亚大学，纽约）

AI总结提出一种无标签方法FINO，利用元数据通过自监督学习将通用视觉基础模型适应到专业科学领域，无需任务标签且仅用轻量探针进行监督，在多个领域超越标准无监督和全监督适应方法。

详情

AI中文摘要

我们提出一种无标签方法，将强大但通用的视觉基础模型适应到专业科学领域。标准的监督微调通常不适合这些场景：标签稀缺，且任务特定训练可能破坏模型的通用性和鲁棒性。我们转而利用元数据以自监督方式将表示适应到新领域。我们的方法FINO结合了标准的自监督目标与灵活的元数据指导，能够处理高度细粒度的离散元数据和连续元数据。它鼓励表示保留信息因子，同时抑制虚假因子。在亚细胞荧光显微镜、地球观测、野生动物监测和医学成像中，FINO始终优于标准的无监督域适应和全监督适应。它甚至超过了高度专业化的领域特定最先进方法，同时在骨干网络适应中不使用任何任务标签，仅使用轻量探针进行监督。

英文摘要

We propose a label-free approach to adapt powerful but generic vision foundation models to specialized scientific domains. Standard supervised fine-tuning is often ill-suited to these settings: labels are scarce, and task-specific training can collapse the model's generality and hurt robustness. We instead leverage metadata to adapt representations to new domains in a self-supervised manner. Our method, FINO, combines a standard self-supervised objective with flexible metadata guidance that handles both highly granular discrete metadata and continuous metadata. It encourages the representation to preserve informative factors while suppressing spurious ones. Across subcellular fluorescence microscopy, Earth observation, wildlife monitoring, and medical imaging, FINO consistently outperforms standard unsupervised domain adaptation and fully supervised adaptation. It also exceeds highly-specialized domain-specific state of the art, while using no task labels for backbone adaptation and only lightweight probes for supervision.

URL PDF HTML ☆

赞 0 踩 0

2606.05106 2026-06-04 cs.CL cs.AI cs.CY

Arithmetic Pedagogy for Language Models

语言模型的算术教学法

Andhika Bernard Lumbantobing, Hokky Situngkir

发表机构 * Bandung Fe Institute & Adjunct Science Fellow in InaAI（巴旦格Fe研究所及InaAI兼职科学研究员）； AI Research Center IT Del & Bandung Fe Institute（IT Del人工智能研究中心及巴旦格Fe研究所）

AI总结借鉴人类数学教学法，通过将GASING方法操作化为链式思维监督训练小规模GPT-2模型，使其在算术推理上达到高准确率并展现出联想式心算能力。

Comments 18 pages, 6 figures

详情

AI中文摘要

我们研究人类数学教学法能否指导语言模型训练以实现算术推理。基于GASING方法——一种通过从左到右过程解决基本算术的印尼教学法，该过程与令牌生成的因果顺序一致——我们将每个操作操作化为一个计算过程，其执行轨迹序列化为自然语言的链式思维监督。使用仅下一个令牌预测目标（无强化学习或基于奖励的优化），从零开始训练一个带有音节-粘着TOBA分词器的小型GPT-2解码器（86M参数）。监控训练揭示了三个不同的学习阶段，机制分析——对链式思维信息图的注意力掩码干预、残差流探测和对数透镜检查——表明模型首先内化程序化路径，随后发展出联想式“心算”能力，无需显式逐步计算即可检索中间结果。训练后的模型在保留问题上达到超过80%的准确率，并与显著更大的语言模型相比具有竞争力，表明有针对性的、基于教学法的训练可以在小规模下产生强大且经济的算术能力。

英文摘要

We investigate whether methods of human mathematics pedagogy can guide the training of language models toward arithmetic reasoning. Building on the GASING method -- an Indonesian pedagogy that solves basic arithmetic through a left-to-right procedure aligned with the causal order of token generation -- we operationalize each operation as a computational procedure whose execution trace is serialized into natural-language Chain-of-Thought (CoT) supervision. A small GPT-2 decoder (86M parameters) with a syllabic-agglutinative TOBA tokenizer for Indonesian is trained from scratch on this data using only a next-token prediction objective, without reinforcement learning or reward-based optimization. Monitoring training reveals three distinct learning phases, and mechanistic analyses -- attention-masking interventions on the CoT information graph, residual-stream probing, and logit-lens inspection -- show that the model first internalizes a procedural pathway and subsequently develops an associative, ``mental-arithmetic'' capacity that retrieves intermediate results without explicit step-by-step computation. The trained model reaches over 80% accuracy on held-out problems and attains competitive performance against substantially larger language models, indicating that targeted, pedagogically grounded training can yield strong and economical arithmetic capability at small scale.

URL PDF HTML ☆

赞 0 踩 0

2606.05103 2026-06-04 cs.LG astro-ph.IM cs.CV stat.ML

Identifying Gems from Roman RAPIDly

从Roman RAPIDly中识别宝石

Karan Gandhi, Ashish A. Mahabal, Jacob E. Jencson, Russ R. Laher, Ben Rusholme, Lin Yan, Ryan M. Lau, Schuyler D. Van Dyk, Mansi M. Kasliwal

发表机构 * Department of Computer Science and Engineering, Indian Institute of Technology, Gandhinagar, India（印度理工学院计算机科学与工程系）； Division of Physics, Mathematics, and Astronomy, California Institute of Technology, Pasadena, CA 91125, USA（加州理工学院物理、数学与天文学系）； Center for Data Driven Discovery, California Institute of Technology, Pasadena, CA 91125, USA（数据驱动发现中心）； IPAC, California Institute of Technology, 1200 E. California Blvd, Pasadena, CA 91125, USA（IPAC, 加州理工学院）； Caltech Optical Observatories, California Institute of Technology, Pasadena, CA 91125, USA（加州理工学院光学观测站）

AI总结针对Roman太空望远镜无真实数据的问题，提出机器学习模型RuBR和通用方法，用于在RAPID流水线中区分真实瞬变/变源与虚假检测，实验表明该方法在Roman时代具有鲁棒性。

Comments 15 pages, 10 figures, Submitted to the Publications of the Astronomical Society of the Pacific

详情

AI中文摘要

南希·格雷斯·罗马太空望远镜（Roman）计划最早于2026年9月发射，将以前所未有的空间分辨率和节奏进行宽场红外成像巡天，从而发现数百万天文瞬变源。因此，有必要建立自动化的警报流水线，以便望远镜在发射后不久就能开始发现可靠的瞬变源和变源。然而，目前不存在真实的Roman数据，这使得开发此类流水线变得困难。在这项工作中，我们提出了一个机器学习模型$RuBR$和一种通用方法，用于在RAPID流水线中区分真实的瞬变和变源检测与虚假检测。具体而言，我们使用该方法提出了三个模型：$RuBR_{comb}$在本地注入和OpenUniverse2024瞬变源的组合数据上训练和测试，$RuBR_{loc}$在本地注入瞬变源上训练并在OpenUniverse2024瞬变源上测试，以及$RuBR_{DA}$将本地注入瞬变源与部分OpenUniverse2024瞬变源以域适应模式结合进行训练。这为在Roman任务早期阶段缺乏真实标签的情况下，将$RuBR_{comb}$模型适应真实观测的策略铺平了道路。尽管图像差分流水线仍在改进中，但我们的实验结果证明了所提出方法的有效性及其在Roman时代进行稳健真实-虚假分类的前景。

英文摘要

The Nancy Grace Roman Space Telescope (Roman), set for launch as early as September 2026, will conduct wide-field infrared imaging surveys with unprecedented spatial resolution and cadence, enabling the discovery of millions of astronomical transients. Hence, it is necessary to have automated pipelines for generating alerts in place so that the telescope can begin discovering reliable transients and variable objects soon after it is launched. However, no real Roman data currently exist, making the development of such pipelines difficult. In this work, we present a machine learning model $RuBR$ and a general methodology for distinguishing genuine transient and variable detections from spurious (bogus) detections within the RAPID pipeline. In particular, we present three models using this methodology: $RuBR_{comb}$ trained and tested on combined locally injected and OpenUniverse2024 transients, $RuBR_{loc}$ trained on locally injected transients and tested on OpenUniverse2024 transients, and $RuBR_{DA}$ that combines locally injected transients with a fraction of OpenUniverse2024 transients in domain-adaptation mode for training. This paves the way for strategies to adapt the $RuBR_{comb}$ model to real observations in the absence of any ground-truth labels during the early phases of the Roman mission. While the image differencing pipeline continues to be improved, our experimental results demonstrate the effectiveness of the proposed approach and its promise for robust real-bogus classification in the Roman era.

URL PDF HTML ☆

赞 0 踩 0

2606.05101 2026-06-04 cs.SD cs.LG

FoeGlass: Simple In-Context Learning Is Enough for Red Teaming Audio Deepfake Detectors

FoeGlass: 简单的上下文学习足以对音频深度伪造检测器进行红队测试

Sepehr Dehdashtian, Jacob H Seidman, Vishnu N Boddeti, Gaurav Bharaj

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出FoeGlass，一种基于大语言模型上下文学习的黑盒自动红队方法，通过生成音频样本发现深度伪造检测器的盲点，将假阴性率降低高达94%。

Comments Accepted at ICML 2026

详情

AI中文摘要

音频深度伪造检测（ADD）模型对于对抗文本转语音（TTS）模型的恶意使用至关重要。评估和增强ADD模型需要开发覆盖生成音频空间并突出高错误区域的数据集。现有数据集开发策略面临两个挑战：（i）手动收集，以及（ii）低效发现ADD模型中的盲点。为应对这些挑战，我们提出FoeGlass，这是首个针对ADD的黑盒自动红队方法，能有效发现最先进深度伪造基准未充分探索的生成音频空间中的ADD失败模式。FoeGlass利用大语言模型的上下文学习能力探索TTS模型的输入空间，仅通过黑盒访问所有组件即可生成欺骗目标ADD的音频样本。通过使用基于多样性度量精心设计的上下文，FoeGlass缓解了自动红队系统中常见的模式崩溃问题。在多个开源ADD和TTS模型上的实证评估表明，与无条件采样基线和最近的欺骗数据集相比，FoeGlass生成的数据将假阴性率大幅提升高达94%，且无需人工监督。此外，我们证明FoeGlass生成的攻击在不同目标ADD之间具有可迁移性，展示了其在ADD系统自动红队中的广泛适用性和易用性。最后，在FoeGlass生成的样本上微调ADD模型显著增强了检测器的鲁棒性（提升高达41%）。

英文摘要

Audio deepfake detection (ADD) models are critical for countering the malicious use of text-to-speech (TTS) models. Evaluating and strengthening ADD models requires developing datasets that span the space of generated audio and highlight high-error regions. Existing dataset development strategies face two challenges: (i) manual collection, and (ii) inefficient discovery of blind spots in the ADD models. To address these challenges, we propose FoeGlass, the first black-box automated red-teaming method for ADDs, which effectively discovers ADD failure modes in the space of generated audio underexplored by state-of-the-art deepfake benchmarks. FoeGlass uses the in-context learning capabilities of an LLM to explore the input space of a TTS model, generating audio samples that fool the target ADD using only black-box access to all components. By using a carefully designed context based on diversity measurements, FoeGlass mitigates the common problem of mode collapse in automated red-teaming systems. Empirical evaluations on several open-source ADD and TTS models demonstrate that data generated from FoeGlass substantially improves the false negative rates over unconditional sampling baselines and recent spoofing datasets by up to 94%, while requiring no manual supervision. Furthermore, we show that the attacks generated by FoeGlass are transferable across different target ADDs, demonstrating its broad applicability and ease of use for the automated red teaming of ADD systems. Finally, fine-tuning ADD models on FoeGlass-generated samples notably enhances the robustness of the detectors (up 41%).

URL PDF HTML ☆

赞 0 踩 0

2606.05087 2026-06-04 cs.CL

Light or Full Verb? A Minimal-Pair Dataset for Probing Phraseological Competence in Language Models

轻动词还是实义动词？用于探究语言模型短语能力的极小对比数据集

Francesca Franzon, Nicolas Rosàs Gómez, Leo Wanner

发表机构 * Universitat Pompeu Fabra (UPF)（庞培法布拉大学）

AI总结通过构建极小对比数据集，探究语言模型在轻动词与实义动词用法上的区分能力，发现模型能在最小上下文中区分这两种用法，并表现出跨宾语类型的可分离模式。

2606.05085 2026-06-04 cs.CL cs.AI

Automatic Generation of Titles for Research Papers Using Language Models

使用语言模型自动生成研究论文标题

Tohida Rehman, Debarshi Kumar Sanyal, Samiran Chattopadhyay

发表机构 * Jadavpur University（贾达沃尔大学）； Indian Association for the Cultivation of Science（印度科学培养协会）

AI总结提出利用预训练语言模型和大语言模型从摘要生成论文标题的方法，通过微调PEGASUS-large在多个数据集上取得最优性能。

Comments 24 pages, 24 tables, 01 figure

详情

AI中文摘要

研究论文的标题以清晰简洁的方式传达其主要思想，有时也包括结论。选择合适的标题通常具有挑战性，自动标题生成可以帮助作者完成此任务。在这项工作中，我们提出了一种使用开放权重预训练模型和大语言模型从摘要生成论文标题的技术。我们使用了CSPubSum和LREC-COLING-2024数据集，并引入了一个新数据集SpringerSSAT，该数据集来自社会科学领域的四个Springer期刊。此外，我们使用GPT-3.5-turbo在零样本设置下生成标题。模型性能通过ROUGE、METEOR、MoverScore、BERTScore和SciBERTScore指标进行评估。我们的实验表明，微调的PEGASUS-large在大多数指标上优于其他模型，包括微调的LLaMA-3-8B和零样本GPT-3.5-turbo。我们进一步证明ChatGPT可以生成有创意的论文标题。总体而言，AI生成的标题通常是恰当且可靠的。

英文摘要

The title of a research paper conveys its primary idea and, occasionally, its conclusions in a clear and concise manner. Choosing an appropriate title is often challenging, and automated title generation can assist authors in this task. In this work, we propose a technique to generate paper titles from abstracts using open-weight pre-trained and large language models. We use the CSPubSum and LREC-COLING-2024 datasets and introduce a new dataset, SpringerSSAT, curated from four Springer journals in the social sciences. Additionally, we use GPT-3.5-turbo in a zero-shot setting to generate titles. Model performance is evaluated with ROUGE, METEOR, MoverScore, BERTScore, and SciBERTScore metrics. Our experiments show that fine-tuned PEGASUS-large outperforms other models, including fine-tuned LLaMA-3-8B and zero-shot GPT-3.5-turbo, across most metrics. We further demonstrate that ChatGPT can generate creative paper titles. Overall, AI-generated titles are generally appropriate and reliable.

URL PDF HTML ☆

赞 0 踩 0

2606.05080 2026-06-04 cs.AI cs.LG

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

AutoLab：前沿模型能否解决长周期自动研究与工程任务？

Zhangchen Xu, Junda Chen, Yue Huang, Dongfu Jiang, Jiefeng Chen, Hang Hua, Zijian Wu, Zheyuan Liu, Zexue He, Lichi Li, Shizhe Diao, Jiaxin Pei, Jinsung Yoon, Hao Zhang, Mengdi Wang, Radha Poovendran, Misha Sra, Alex Pentland, Zichen Chen

发表机构 * MIT（麻省理工学院）； Stanford University（斯坦福大学）； University of California, Berkeley（加州大学伯克利分校）； University of California, Los Angeles（加州大学洛杉矶分校）； University of California, San Diego（加州大学圣地亚哥分校）； University of Washington（华盛顿大学）； University of Toronto（多伦多大学）； University of Michigan（密歇根大学）； National University of Singapore（新加坡国立大学）； University of Tokyo（东京大学）

AI总结本文提出AutoLab基准，通过36个专家策划的长周期闭环优化任务评估前沿模型，发现持续迭代和利用经验反馈比初始尝试质量更重要。

Comments Code: https://github.com/autolabhq/autolab ; Website: https://autolab.moe/

详情

AI中文摘要

科学和工程进步本质上是一个长周期迭代过程：提出更改、运行实验、测量结果并不断改进工件。然而，现有的前沿模型基准主要评估单轮响应或短周期智能体轨迹，未能捕捉在长时间跨度内持续迭代改进的挑战。为了解决这一差距，我们引入了AutoLab，一个用于超长周期闭环优化的新基准。AutoLab包含36个现实且由专家策划的任务，涵盖四个不同领域：系统优化、谜题与挑战、模型开发和CUDA内核优化。每个任务从一个正确但故意次优的基线开始，并挑战智能体在严格的挂钟预算内改进它。评估17个最先进模型的结果表明，成功的主要预测因素不是智能体初始尝试的质量，而是其持续进行基准测试、编辑和整合经验反馈的毅力。虽然claude-opus-4.6表现出强大的长周期优化能力，但大多数前沿模型，包括几个专有模型，要么过早终止，要么在预算内进展甚微。这些结果强调了时间意识和持续迭代在自主智能体中的重要性。我们开源了完整的基准、评估框架和任务工件，以加速研究真正有能力的长周期智能体。

英文摘要

Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduce AutoLab, a new benchmark for ultra long-horizon closed-loop optimization. AutoLab consists of 36 realistic, expert-curated tasks spanning four diverse domains: system optimization, puzzle & challenge, model development, and CUDA kernel optimization. Each task begins with a correct but deliberately suboptimal baseline and challenges agents to improve it within a strict wall-clock budget. Evaluating 17 state-of-the-art models reveals the dominant predictor of success is not the quality of an agent's initial attempt, but its persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. While claude-opus-4.6 exhibits strong long-horizon optimization capabilities, most frontier models, including several proprietary ones, either terminate prematurely or exhaust their budgets with minimal progress. These results underscore the importance of time awareness and persistent iteration in autonomous agents. We open-source the full benchmark, evaluation harness, and task artifacts, to accelerate research toward truly capable long-horizon agents.

URL PDF HTML ☆

赞 0 踩 0

2606.05079 2026-06-04 cs.CL cs.LG

Fast & Faithful Function Vectors

快速且保真的函数向量

Minh An Pham, Anton Segeler, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin, Patrick Kahardipraja, Reduan Achtibat

发表机构 * GitHub ； arXiv

AI总结本研究通过优化注意力头选择和分布式引导方法，利用基于梯度的逐层相关性传播（LRP）提高了函数向量（FV）的效率和准确性，从而实现了对大型语言模型（LLM）的快速且保真的引导。

2606.05073 2026-06-04 cs.LG

Learning What Not to Impute: An Uncertainty-Aware Diffusion Framework for Meaningful Missingness

学习什么不该插补：一种面向有意义缺失的不确定性感知扩散框架

Lixing Zhang, Yidong Ouyang, Weifu Li, Shixiang Zhu, Guang Cheng, Liyan Xie

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出Diff-Joint扩散框架，通过联合建模表格数据和潜在缺失掩码，交替进行条件采样和不确定性感知聚合，以区分有意义缺失和需插补的缺失，实现选择性插补。

详情

AI中文摘要

缺失值插补是机器学习中的一项基本任务，现有大多数方法假设所有缺失条目对应于未观测到的常规值。然而，在许多现实世界数据集中，缺失可能源于两个不同的来源：一些条目是有意义缺失（本质上不存在且语义有效），而另一些则因观测过程而缺失，应被插补。我们将这一区别形式化为选择性插补问题，目标是共同推断哪些缺失条目应被保留，哪些应被恢复。为应对这一挑战，我们提出了Diff-Joint，一种基于扩散的框架，联合建模表格数据与潜在缺失掩码。该方法在条件采样和不确定性感知聚合之间交替，以迭代优化插补值和缺失标签。在合成和真实数据集上的实验结果表明，Diff-Joint能有效识别有意义缺失条目，同时实现具有竞争力的插补精度和改善的下游任务性能。

英文摘要

Missing value imputation is a fundamental task in machine learning, with most existing methods assuming that all missing entries correspond to unobserved regular values. In many real-world datasets, however, missingness may arise from two distinct sources: some entries are meaningfully missing (intrinsically absent and semantically valid), while others are missing due to the observation process and should be imputed. We formalize this distinction as a selective imputation problem, where the goal is to jointly infer which missing entries should be preserved and which should be recovered. To address this challenge, we propose Diff-Joint, a diffusion-based framework that jointly models tabular data together with a latent missingness mask. The method alternates between conditional sampling and uncertainty-aware aggregation to iteratively refine both imputed values and missingness labels. Empirical results on synthetic and real-world datasets demonstrate that Diff-Joint effectively identifies meaningfully missing entries while achieving competitive imputation accuracy and improved downstream task performance.

URL PDF HTML ☆

赞 0 踩 0