arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.13629 2026-06-12 stat.ME cs.AI cs.LG stat.ML 新提交

Valid Inference with Synthetic Data via Task Exchangeability

通过任务可交换性实现基于合成数据的有效推断

Lezhi Tan, Tijana Zrnic

AI总结提出任务可交换性条件，确保在科学研究中使用合成数据进行统计推断的有效性，并给出在民意调查和AI评估中的应用。

详情

AI中文摘要

越来越多的工作主张在科学研究中使用合成数据。例如，社会科学家主张在试点研究中使用LLM生成的“硅样本”；AI评估越来越依赖“LLM作为裁判”的输出；蛋白质组学研究通过生成合成蛋白质结构的生成模型加速。这些发展引发了一个有趣的可能性：合成数据可以帮助研究人员提出更多问题、进行更多研究并加速发现。但它们也引发了一个根本性的担忧：合成数据可能有偏、有噪声且设定错误。在这项工作中，我们提出了在科学研究中使用合成数据的统计原则，并具有可证明的有效性保证。关键见解是一个我们称为任务可交换性的新技术条件。非正式地说，这是一个要求，即研究人员可以识别出有真实数据可用的历史任务，使得他们当前感兴趣的任务与历史任务在适当的数学意义上可交换。我们开发了在任务可交换性下进行有效推断的方法，以及即使在可交换性之外也能提供保证的扩展。我们通过硅样本的民意调查和自动评分器的AI评估来展示该框架。

英文摘要

There is a proliferation of work arguing for the use of synthetic data in scientific research. For example, social scientists are arguing for the use of LLM-generated "silicon samples" in pilot studies; AI evaluations increasingly rely on "LLM-as-a-judge" outputs; and proteomics research is accelerated by generative models that produce synthetic protein structures. These developments raise an intriguing possibility: synthetic data may help researchers ask more questions, run more studies, and accelerate discovery. But they also raise a fundamental concern: synthetic data can be biased, noisy, and misspecified. In this work, we propose statistical principles for using synthetic data in scientific research with provable validity guarantees. The key insight is a new technical condition that we call task exchangeability. Informally, this is a requirement that the researcher can identify historical tasks, for which real data is available, such that their current task of interest is exchangeable with the historical tasks in an appropriate mathematical sense. We develop methods for valid inference under task exchangeability, together with extensions that provide guarantees even beyond exchangeability. We demonstrate the framework on public opinion surveys with silicon samples and AI evaluation with autoraters.

URL PDF HTML ☆

赞 0 踩 0

2606.13544 2026-06-12 eess.AS cs.AI cs.CL 新提交

Adaptive Turn-Taking for Real-time Multi-Party Voice Agents

自适应轮流发言：面向实时多方语音代理

Soumyajit Mitra, Prabhat Pandey, Abhinav Jain, Shanmukha Sahith, K V Vijay Girish

AI总结提出ModeratorLM，一种基于角色条件的语音大模型，通过分块流式处理和链式推理，在多方对话中实现自适应轮流发言，显著提升轮流精度和召回率。

Comments Accepted for publication at Interspeech 2026

2606.13450 2026-06-12 eess.AS cs.SD 新提交

Endpoint Anticipation for Low-Latency Spoken Dialogue

低延迟口语对话的端点预测

Sathvik Udupa, Shinji Watanabe, Petr Schwarz, Jan Cernocky

AI总结提出端点预测方法，通过提前预测对话结束信号实现低延迟，在部分上下文中投机执行LLM和TTS流水线，平均延迟降低505毫秒。

Comments Accepted at Interspeech 2026

2606.13277 2026-06-12 stat.ML cs.LG 新提交

ProtoX-AD: Self-Explainable Time Series Anomaly Detection and Characterization

ProtoX-AD：自解释的时间序列异常检测与特征描述

Aitor Sánchez-Ferrera, Elisabeth Wetzer, Kristoffer Wickstrøm, Michael Kampffmeyer, Robert Jenssen

AI总结提出ProtoX-AD框架，通过原型学习实现自监督时间序列异常检测的可解释性，在保持检测性能的同时提供语义一致的异常特征解释。

Comments 26 pages, 8 figures

详情

AI中文摘要

时间序列异常检测（TSAD）的最新进展突显了自监督分类方法的有效性。这些方法对正常训练样本应用变换，训练分类器识别变换特定模式，从而通过增加分类误差来帮助识别异常。尽管性能强大，但一个重大挑战是缺乏可解释性，因为它们对标记异常的特征提供的洞察有限。为了解决这一局限，我们提出了ProtoX-AD，一种基于原型的自解释框架，用于自监督TSAD。ProtoX-AD学习变换感知的潜在表示以及可解释的原型，从而实现准确的异常检测和通过基于原型的解释识别不同的异常轮廓。此外，它允许系统分析变换设计如何影响检测性能和可解释性。在合成和真实世界数据集上的实验结果表明，ProtoX-AD实现了与其黑盒对应物相当的检测性能，同时比现有的可解释基线提供更一致和语义上有意义的解释。我们的代码在此 https URL 公开。

英文摘要

Recent advances in time series anomaly detection (TSAD) have highlighted the effectiveness of self-supervised classification-based approaches. These methods apply transformations to normal training samples, training a classifier to recognize transformation-specific patterns that help identify anomalies through increased classification errors. Despite their strong performance, a significant challenge is their lack of explainability, as they provide limited insight into the characteristics of flagged anomalies. To address this limitation, we propose ProtoX-AD, a prototype-based self-explainable framework for self-supervised TSAD. ProtoX-AD learns transformation-aware latent representations alongside interpretable prototypes, enabling both accurate anomaly detection and the identification of distinct anomalous profiles through prototype-based explanations. Additionally, it allows for systematic analysis of how transformation design impacts detection performance and explainability. Experimental results on synthetic and real-world datasets demonstrate that ProtoX-AD achieves detection performance comparable to its black-box counterparts while offering more consistent and semantically meaningful explanations than existing explainable baselines. Our code is publicly available at https://github.com/Aitorzan3/ProtoX-AD.

URL PDF HTML ☆

赞 0 踩 0

2606.13193 2026-06-12 eess.AS cs.PL cs.SD 新提交

A Dual-Mode Faust-to-CLAP Compilation System

双模式 Faust 到 CLAP 编译系统

Facundo Franchino, Stéphane Letz, Jatin Chowdhury

AI总结提出 faust2clap 框架，支持静态编译和动态解释两种模式，通过地址身份匹配算法和稳定槽位分配方案解决 DSP 参数身份保持问题，实现高效编译与热更新。

Comments 4 pages, 4 figures, 1 algorithm. Presented at the International Faust Conference (IFC-26), Lyon, France, June 2026

详情

AI中文摘要

我们描述了 faust2clap，一个建立从 Faust DSP 规范到 CLAP 格式的首个官方维护编译路径的框架。该系统以两种不同模式运行。静态模式采用提前编译以生成最优效率的原生二进制文件，而动态模式使用运行时解释以允许在不中断宿主应用程序的情况下修改 DSP 代码。后一种能力解决了音频软件开发中一个长期存在的摩擦，即编辑、编译和重载循环的累积开销。我们详细阐述了两种模式背后的算法机制，特别关注参数身份问题。为了在结构 DSP 突变中保留参数值及其与宿主自动化的绑定，我们引入了一种基于地址的身份匹配算法和一种稳定的槽位分配方案。该实现包含约 2400 行 C++ 架构和 Python 工具代码，并已集成到 Faust 主发行版中。

英文摘要

We describe faust2clap, a framework establishing the first officially maintained compilation pathway from Faust DSP specifications to the CLAP format. The system operates in two different modes. A static mode employs ahead-of-time compilation to yield native binaries of optimal efficiency, while a dynamic mode uses runtime interpretation to permit DSP code modification without interrupting the host application. This latter capability addresses a persistent friction in audio software development, namely the cumulative overhead of the edit, compile, and reload cycle. We detail the algorithmic machinery underlying both modes, focusing specifically on the problem of parameter identity. To preserve both parameter values and their bindings to host automation across structural DSP mutations, we introduce an address-based identity matching algorithm and a stable slot allocation scheme. The implementation, comprising approximately 2,400 lines of C++ architecture and Python tooling code, has been integrated into the main Faust distribution.

URL PDF HTML ☆

赞 0 踩 0

2606.13146 2026-06-12 stat.ML cs.LG stat.ME 新提交

Robust State-Conditional Feature-Weighted Jump Models for Temporal Clustering

鲁棒的状态条件特征加权跳跃模型用于时间聚类

Federico P. Cortese, Alessio Farcomeni

AI总结提出一种鲁棒的特征加权跳跃模型，通过Tukey双权损失函数实现鲁棒性，并引入状态特定特征权重，在模拟和实证中优于竞争方法。

2606.13109 2026-06-12 eess.AS cs.SD 新提交

Generating Training Targets for Real-World Speech Enhancement via Close-to-Distant Microphone Projection

为真实场景语音增强生成训练目标：通过近远麦克风投影

Tomohiro Nakatani, Rintaro Ikeshita, Naoyuki Kamo, Marc Delcroix, Shoko Araki

AI总结提出近远麦克风投影（C2D投影）方法，利用真实录音生成配对数据，通过参数化多通道维纳滤波器实现投影，训练神经网络在远场语音增强中优于现有GSS方法。

详情

Journal ref: Proceedings of IEEE ICASSP 2026

AI中文摘要

在远距离语音捕获场景中训练语音增强（SE）神经网络需要配对的失真和干净参考语音信号。虽然此类数据通常通过模拟生成，但模拟与真实录音之间的不匹配显著限制了SE的准确性。为解决此问题，我们提出近远麦克风投影（C2D投影），一种从近距离和远距离麦克风捕获的真实录音中生成配对数据的方法。C2D投影估计一个最优投影矩阵，将近麦克风输入转换为与远麦克风录音对齐的干净参考信号，同时执行去噪。我们证明，使用参数化多通道维纳滤波器（PMWF）的变体可以有效地实现这种投影。实验结果表明，在具有挑战性的CHiME6晚宴派对ASR任务中，使用C2D投影数据训练的神经网络在oracle说话人日志条件下，当使用GSS的增强输出作为神经网络的辅助输入时，优于最先进的引导源分离（GSS）。

英文摘要

Training neural networks (NNs) for speech enhancement (SE) in distant speech-capturing scenarios requires paired distorted and clean reference speech signals. While such data are often generated through simulation, the mismatch between simulated and real recordings significantly limits SE accuracy. To address this issue, we propose Close-to-Distant microphone Projection (C2D projection), a method that generates paired data from real recordings captured by close and distant microphones. C2D projection estimates an optimal projection matrix that transforms close-microphone inputs into clean reference signals aligned with distant-microphone recordings, while simultaneously performing denoising. We show this projection can be effectively realized using a variant of the Parametric Multichannel Wiener Filter (PMWF). Experimental results demonstrate that an NN trained with C2D-projected data outperforms the state-of-the-art Guided Source Separation (GSS) on the challenging CHiME6 dinner party ASR task under oracle diarization, when using the enhanced output from GSS as an auxiliary input to the NN.

URL PDF HTML ☆

赞 0 踩 0

2606.13095 2026-06-12 eess.AS cs.SD 新提交

Balancing ASR and diarization in end-to-end LLMs for multi-talker speech recognition

在端到端大语言模型中平衡ASR与说话人日志以进行多说话人语音识别

Naijun Zheng, Yuke Lin, Sanli Tian, Mengtian Li, Zhiwei Lin, Longshuai Xiao, Dandan Tu

AI总结提出双编码器架构、特征交错格式、长度感知说话人ID损失和自适应阈值ASR损失策略，在有限真实数据下高效训练LLM系统，平衡ASR与说话人日志任务，在AliMeeting和Aishell4语料库上分别实现18%和24%的相对改进。

Comments Accepted in Interspeech 2026

详情

AI中文摘要

多说话人语音识别通常通过结合自动语音识别（ASR）和说话人日志的流水线系统来处理。最近，基于大语言模型（LLM）的方法通过联合建模语义和说话人信息显示出前景，但它们通常需要大规模的多说话人语料库，而标注这些语料库成本高昂。在本文中，我们研究了如何在有限真实录音数据下高效训练基于LLM的系统，同时保持说话人归属的高准确性。我们提出了几种策略：（1）双编码器架构，用于提取语义和说话人特征；（2）特征交错格式，将这些特征合并作为LLM的输入；（3）长度感知的说话人ID损失，以增强日志能力；（4）自适应阈值的ASR损失计算，以减轻语音重叠引起的幻觉。这些策略平衡了ASR和说话人日志任务之间的训练。我们的系统优于开源基线方法，在AliMeeting语料库上实现了18%的相对改进，在Aishell4语料库上实现了24%的相对改进。

英文摘要

Multi-talker speech recognition is often addressed by combining automatic speech recognition (ASR) and speaker diarization in a pipeline system. Recently, LLM-based approaches have shown promise by jointly modeling semantic and speaker information, but they typically require large-scale multi-talker corpora that are costly to annotate. In this paper, we investigate how to efficiently train an LLM-based system with limited real-recorded data while maintaining high accuracy in speaker attribution. We propose several strategies: (1) a dual-encoder architecture to extract semantic and speaker features, (2) a feature interleaving format to merge these features as the inputs to the LLM, (3) a length-aware speaker ID loss to enhance diarization capability, and (4) an adaptive threshold strategy for ASR loss computation to mitigate hallucinations caused by speech overlaps. These strategies balance training between ASR and diarization tasks. Our system outperforms open-source baseline approaches, achieving relative improvements of 18% on the AliMeeting corpus and 24% on the Aishell4 corpus.

URL PDF HTML ☆

赞 1 踩 0

2606.13017 2026-06-12 q-bio.NC cs.LG 新提交

Deep Sleep Classification via EEG Signal Criticality: A Passive BCI Approach for Sleep-Improvement Neurofeedback

基于EEG信号临界性的深度睡眠分类：一种用于改善睡眠神经反馈的被动BCI方法

Stanisław Narębski, Tomasz Komendziński, Tomasz M. Rutkowski

AI总结本研究利用去趋势波动分析（DFA）提取的临界性特征，通过朴素贝叶斯分类器实现了对深度睡眠（N3）的高精度识别（平衡准确率87.17%），为被动脑机接口中的状态依赖神经反馈提供了高效感知机制。

Comments 7 pages, 3 figures, accepted for publication in the Proceedings of the 10th Graz Brain-Computer Interface Conference 2026, Graz, Austria, September 14-17, 2026

详情

AI中文摘要

自动睡眠分期是被动脑-机接口（pBCI）的一项基础应用，它解码自发神经状态以实现独立于用户意图的闭环干预。本研究评估了从去趋势波动分析（DFA）中提取的临界性特征，用于特定识别深度睡眠（N3）。我们分析了来自290名老年女性的347,232个EEG时段，使用UMAP流形学习可视化状态转换。随后，通过10折交叉验证对六个分类器进行基准测试，使用平衡准确率确定此http URL的最佳“状态感知”引擎。朴素贝叶斯达到了最高的平均平衡准确率（87.17% ± 0.24%），显著优于全连接深度神经网络（FNN：81.58%）和随机森林（80.97%）。线性模型（LDA：57.21%；SVM：51.01%）表现不佳，表明DFA衍生的临界性特征位于一个独特的非线性流形上。EEG临界性的概率解码为pBCI提供了一种高精度的感知机制。这种稳健的分类流程支持开发状态依赖的神经反馈，例如靶向听觉刺激，以增强认知恢复。

英文摘要

Automated sleep staging is a fundamental application of passive Brain-Computer Interfaces (pBCI), decoding spontaneous neural states to enable closed-loop interventions independent of user intent. This study evaluates criticality features derived from Detrended Fluctuation Analysis (DFA) for the specific identification of deep sleep (N3). We analyzed $347,232$ EEG epochs from $290$ older women using UMAP manifold learning to visualize state transitions. Subsequently, six classifiers were benchmarked via 10-fold cross-validation, using balanced accuracy to determine the optimal "state-sensing" engine for neurofeedback.Naive Bayes achieved the highest mean balanced accuracy ($87.17\% \pm 0.24\%$), significantly outperforming a fully connected deep neural network (FNN: $81.58\%$) and Random Forest ($80.97\%$). Linear models (LDA: $57.21\%$; SVM: $51.01\%$) performed poorly, indicating that DFA-derived criticality features reside on a distinct, non-linear manifold. Probabilistic decoding of EEG criticality provides a high-accuracy sensing mechanism for pBCIs. This robust classification pipeline supports the development of state-dependent neurofeedback, such as targeted auditory stimulation, to enhance cognitive recovery.

URL PDF HTML ☆

赞 0 踩 0

2606.12838 2026-06-12 q-bio.QM cs.AI cs.LG q-bio.GN 新提交

OCOO-T : A Simple and Scalable Virtual Cell Model for Transcriptional Perturbation Response Prediction

OCOO-T: 一种用于转录扰动响应预测的简单可扩展虚拟细胞模型

Danning Jiang, Zheming An, Yalong Zhao, Lipeng Lai

AI总结提出OCOO-T，一种基于流匹配的简约虚拟细胞模型，通过连续时间去噪和自适应层归一化，在多个基准上实现转录扰动预测的最优性能。

Comments 22 pages, 6 figures

详情

AI中文摘要

预测单细胞对遗传、化学和细胞因子扰动的转录响应是计算生物学和AI虚拟细胞（AIVC）建模中的一个基本挑战，对药物发现和基因调控网络的阐明具有直接影响。现有方法通常依赖辅助细胞状态编码器、分层变分自编码器、专用Transformer编码器-解码器模块或基因相互作用先验，将高维表达谱压缩为潜在表示。虽然有效，但这些设计增加了架构复杂性，可能限制可扩展性和泛化性。本文介绍了OCOO-T，一种基于流匹配的简约AIVC模型，用于转录扰动响应预测。OCOO-T利用一个直接操作连续基因表达谱的普通Transformer堆栈，并将扰动响应预测表述为连续时间去噪过程。通过自适应层归一化和上下文令牌整合扰动嵌入、剂量信息以及细胞系/细胞类型特异性。在Tahoe100M、Replogle和PBMC基准上的全面评估表明，OCOO-T在多种扰动和细胞类型上实现了最先进的性能，同时通过细胞上下文的修补和拆补有效扩展到长转录谱。通过利用基于Transformer去噪的单细胞组学简单性，OCOO-T为计算机细胞模拟提供了一个有效且可扩展的框架。

英文摘要

Predicting single-cell transcriptional responses to genetic, chemical and cytokine perturbations is a fundamental challenge in computational biology and AI Virtual Cell (AIVC) modeling, with direct implications for drug discovery and the elucidation of gene regulatory networks. Existing approaches often rely on auxiliary cell-state encoders, hierarchical variational autoencoders, dedicated Transformer encoder-decoder modules, or gene-interaction priors to compress high-dimensional expression profiles into latent representations. While effective, these designs increase architectural complexity and may limit scalability and generalizability. This paper introduces OCOO-T, a minimalist flow-matching-based AIVC model for transcriptional perturbation response prediction. OCOO-T utilizes a vanilla Transformer stack that operates directly on continuous gene expression profiles and formulates perturbation response prediction as a continuous-time denoising process. Perturbation embeddings, dosage information, and cell-line/cell-type specificity are integrated through adaptive layer normalization and in-context tokens. Comprehensive evaluations on Tahoe100M, Replogle, and PBMC benchmarks demonstrate that OCOO-T achieves state-of-the-art performance across diverse perturbations and cell types while effectively scaling to long transcriptional profiles through patching and depatching of cellular contexts. By leveraging the simplicity of Transformer-based denoising for single-cell omics, OCOO-T provides an effective and scalable framework for in-silico cellular simulation.

URL PDF HTML ☆

赞 0 踩 0

2606.12654 2026-06-12 stat.ME cs.LG stat.ML 新提交

Computationally tractable robust differentially private mean estimation

计算可处理的鲁棒差分隐私均值估计

Kelly Ramsay

AI总结提出一种名为“气球均值”的新差分隐私均值估计器，通过扩展马氏距离球上的迭代裁剪实现计算可处理性、鲁棒性及零集中差分隐私，理论保证在重尾和污染椭圆模型下的统计性能与鲁棒性。

Comments 40 pages, 17 figures

2606.12585 2026-06-12 econ.GN cs.HC q-fin.EC 新提交

Revisiting the ABCs of Working with AI: A Replication with Radiologists

重新审视与AI合作的ABC：一项针对放射科医生的复制研究

Daniel Martin

AI总结本研究在放射科医生分析胸部X光片的场景中，复制了Caplin等人关于能力和信念校准影响AI辅助收益的发现，验证了其外部有效性。

详情

AI中文摘要

人工智能（AI）系统越来越多地协助人类专家，但AI辅助对生产力的影响可能具有异质性。Caplin、Deming、S. Li、Martin、Marx、Weidmann和Ye（2025b）提供的证据表明，两个特征——能力和信念校准——有助于确定AI辅助的回报。本文表明，他们的结果在专业放射科医生利用最先进的机器学习预测分析胸部X光片的场景中得到了复制。我利用了Moehring、Kutwal、Huang、Banerjee、Jacobi、Eber、Mendoza、Chung、Dayan、Gupta、Bui、Truong、Pareek、Langlotz、Lungren、Agarwal、Rajpurkar和Salz（2025）描述的公共Collab-CXR数据存储库，该数据首先由Agarwal、Moehring、Rajpurkar和Salz（2023）用于人机协作分析。为了忠实再现Caplin、Deming、S. Li、Martin、Marx、Weidmann和Ye（2025b）的分析，我使用了重复病例设计中的放射科医生评估，包括68名放射科医生和11,420个配对的放射科医生-患者-病理观察结果。本复制结果支持其核心发现的外部有效性：较低的基础能力和较高的校准预测了AI带来的更大增量价值。

英文摘要

Artificial intelligence (AI) systems increasingly assist human experts, but the consequences of AI assistance on productivity can be heterogeneous. Caplin, Deming, S. Li, Martin, Marx, Weidmann, and Ye (2025b) provide evidence that two characteristics, ability and belief calibration, help to determine the returns to AI assistance. This note shows that their results replicate to a setting where professional radiologists analyze chest X-rays with access to state-of-the-art machine learning predictions. I leverage the public Collab-CXR data repository described by Moehring, Kutwal, Huang, Banerjee, Jacobi, Eber, Mendoza, Chung, Dayan, Gupta, Bui, Truong, Pareek, Langlotz, Lungren, Agarwal, Rajpurkar, and Salz (2025) and first analyzed for human-AI collaboration by Agarwal, Moehring, Rajpurkar, and Salz (2023). To faithfully reproduce the analysis in Caplin, Deming, S. Li, Martin, Marx, Weidmann, and Ye (2025b), I use the radiologist assessments from the repeated-case designs, which include 68 radiologists and 11,420 paired radiologist-patient-pathology observations. The results of this replication support the external validity of their core findings: lower baseline ability and higher calibration predict larger incremental value from AI.

URL PDF HTML ☆

赞 0 踩 0

2606.12471 2026-06-12 stat.ML cs.CL cs.ET cs.LG 新提交

Identifiability Without Gaussianity: Symbolic World Models and Near-Infinite Temporal Consistency

无高斯假设的可识别性：符号世界模型与近无限时间一致性

Seth Dobrin, Łukasz Chmiel

AI总结本文提出物理基础符号架构（PGSA），证明其在非高斯动态系统中实现精确线性可识别性和近无限时间一致性，克服了统计世界模型的高斯边界限制。

Comments Pre-print

详情

AI中文摘要

Klindt、LeCun 和 Balestriero (arXiv:2605.26379) 证明了联合嵌入预测架构（JEPA）实现线性可识别性（即线性恢复世界的真实潜在变量）当且仅当世界的潜在动态遵循高斯平稳过程。这一高斯边界意味着时间一致性的基本限制：对于任何非高斯物理系统，统计世界模型的表示误差随时间单调增长。我们证明这一限制是统计对齐机制的产物，而非世界模型的一般性质。我们引入物理基础符号架构（PGSA），并证明三个结果：(1) PGSA 对所有物理机制实现精确线性可识别性，无论潜在分布如何；(2) PGSA 的每步误差仅受数值精度限制；(3) 直接推论是，PGSA 在无界数量的转换中保持时间一致性，我们称之为近无限时间一致性。我们进一步证明，对于任何非高斯系统，统计世界模型无法实现这一性质，无论模型容量或训练数据量如何。其中四个定理的代数核心已在 Lean 4 中使用 Mathlib4 v4.31.0 形式化（零个 sorry 占位符）；Klindt 等人的逆命题作为外部前提。对比表明，在世界动态的因果生成器中进行符号基础化是充分条件，并且在非高斯体制下，是实现近无限时间一致性的唯一条件。

英文摘要

Klindt, LeCun, and Balestriero (arXiv:2605.26379) proved that Joint-Embedding Predictive Architectures (JEPAs) achieve linear identifiability, the linear recovery of the world's true latent variables, if and only if the world's latent dynamics follow a Gaussian, stationary process. This Gaussian boundary implies a fundamental limit on temporal consistency: for any non-Gaussian physical system, the representation error of a statistical World Model grows monotonically with time. We prove that this limit is an artifact of the statistical alignment mechanism, not a property of World Models in general. We introduce the Physics-Grounded Symbolic Architecture (PGSA) and prove three results: (1) a PGSA achieves exact linear identifiability for all physical regimes, regardless of the latent distribution; (2) the per-step error of a PGSA is bounded by numerical precision alone; and (3) as a direct consequence, a PGSA maintains temporal consistency for an unbounded number of transitions, a property we term near-infinite temporal consistency. We further prove that statistical World Models cannot achieve this property for any non-Gaussian system, regardless of model capacity or the volume of training data. The algebraic cores of four of the theorems are formalized in Lean 4 with Mathlib4 v4.31.0 (zero sorry placeholders); the Klindt et al. converse is taken as an external premise. The contrast establishes that symbolic grounding in the causal generator of the world's dynamics is the sufficient condition and, in non-Gaussian regimes, the only condition for near-infinite temporal consistency.

URL PDF HTML ☆

赞 0 踩 0

2606.13681 2026-06-12 cs.CL 新提交

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

EvoArena: 追踪记忆演化以构建动态环境中的鲁棒LLM智能体

Jundong Xu, Qingchuan Li, Jiaying Wu, Yihuai Lan, Shuyue Stella Li, Huichi Zhou, Bowen Jiang, Lei Wang, Jun Wang, Anh Tuan Luu, Caiming Xiong, Hae Won Park, Bryan Hooi, Zhiyuan Hu

发表机构 * National University of Singapore（新加坡国立大学）； Singapore Management University（新加坡管理大学）； University of Washington（华盛顿大学）； University College London（伦敦大学学院）； University of Pennsylvania（宾夕法尼亚大学）； Nanyang Technological University（南洋理工大学）； Recursive ； Massachusetts Institute of Technology（麻省理工学院）

AI总结提出EvoArena基准套件模拟终端、软件和社交领域的渐进环境变化，并设计基于补丁的记忆范式EvoMem记录结构化更新历史，使智能体能通过记忆变化推理环境演化，实验表明当前智能体在动态环境中表现不佳，EvoMem可稳定提升性能。

详情

AI中文摘要

大型语言模型（LLM）智能体在广泛基准测试中取得了强劲性能，但大多数评估假设静态环境。相比之下，实际部署本质上是动态的，要求智能体持续将其知识、技能和行为与不断变化的环境及更新的任务条件对齐。为弥补这一差距，我们引入了EvoArena，一个基准套件，将环境变化建模为终端、软件和社交领域的渐进更新序列。我们进一步提出EvoMem，一种基于补丁的记忆范式，将记忆演化记录为结构化的更新历史，使智能体能够通过记忆中的变化推理环境演化。实验表明，当前智能体在EvoArena上表现不佳，在演化的终端、软件和社交偏好领域平均准确率仅为39.6%。EvoMem持续提升性能，在EvoArena上平均提升1.5%，并在GAIA和LoCoMo等标准基准上分别提升6.1%和4.8%。除单个任务外，EvoMem在EvoArena上还将链级准确率提升3.7%，其中成功需要完成一系列连续的相关演化子任务。机制分析表明，EvoMem改善了记忆中的证据捕获，表明更完整地保留了演化的环境状态。我们的结果强调了在评估和记忆中对演化进行建模对于可靠智能体部署的重要性。

英文摘要

Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to continually align their knowledge, skills, and behavior with changing environments and updated task conditions. To address this gap, we introduce EvoArena, a benchmark suite that models environment changes as sequences of progressive updates across terminal, software, and social domains. We further propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories, enabling agents to reason about environmental evolution through changes in their memory. Experiments show that current agents struggle on EvoArena, achieving an average accuracy of 39.6% across evolving terminal, software, and social-preference domains. EvoMem consistently improves performance, yielding an average gain of 1.5% on EvoArena and also improving standard benchmarks such as GAIA and LoCoMo by 6.1% and 4.8%. Beyond individual tasks, EvoMem further improves chain-level accuracy by 3.7% on EvoArena, where success requires completing a consecutive sequence of related evolutionary subtasks. Mechanistic analysis shows that EvoMem improves evidence capture in the memory, indicating better preservation of complete evolving environment states. Our results highlight the importance of modeling evolution in both evaluation and memory for reliable agent deployment.

URL PDF HTML ☆

赞 1 踩 0

2606.13680 2026-06-12 cs.CL cs.AI 新提交

Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

通过检索增强强化微调进行类比推理学习

Zilin Xiao, Qi Ma, Chun-cheng Jason Chen, Xintao Chen, Avinash Atreya, Hanjie Chen, Vicente Ordonez

发表机构 * Meta Superintelligence Labs（Meta超级智能实验室）； Rice University（莱斯大学）

AI总结提出RA-RFT框架，通过黄金相关性蒸馏训练检索器，并结合强化微调利用类比推理轨迹，提升数学推理性能。

详情

AI中文摘要

检索增强生成（RAG）已成为将语言模型锚定于外部知识的标准机制，然而基于词汇或语义相似性的传统检索难以适用于复杂推理任务：语义相似的问题可能要求完全不同的解决策略，而表面不同的问题可能共享相同的底层推理模式。我们提出检索增强强化微调（RA-RFT），一种事后训练框架，教导语言模型通过类比进行推理。RA-RFT使用黄金相关性蒸馏训练检索器，该检索器根据预期推理收益而非语义重叠对上下文进行排序，然后通过强化微调方法利用检索到的类比演示对策略模型进行微调，使模型学会在可验证的结果奖励下利用推理轨迹。我们进一步分析了检索上下文的多样性，发现推理感知检索揭示了互补的解决策略，为个别问题提供了不同的推理支架。在具有挑战性的数学推理基准上，RA-RFT始终优于标准强化微调方法。例如，在AIME 2025上，对于Qwen3-1.7B和Qwen3-4B，RA-RFT的平均@32准确率分别比GRPO提高了7.1和2.8个百分点——这表明推理感知检索是一个互补的改进轴，与奖励设计或训练课程的进步正交。

英文摘要

Retrieval-augmented generation (RAG) has become a standard mechanism for grounding language models in external knowledge, yet conventional retrieval based on lexical or semantic similarity is poorly suited for complex reasoning tasks: a semantically similar problem may demand an entirely different solution strategy, while a superficially different problem may share the same underlying reasoning pattern. We propose Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that teaches language models to reason by analogy. RA-RFT uses gold-relevance distillation to train a retriever that ranks contexts by expected reasoning benefit rather than semantic overlap, and then fine-tunes the policy model via reinforcement fine-tuning methods with retrieved analogous demonstrations, so the model learns to leverage reasoning traces under verifiable outcome rewards. We further analyze the diversity of retrieved contexts and find that reasoning-aware retrieval surfaces complementary solution strategies that provide distinct reasoning scaffolds for individual problems. Across challenging mathematical reasoning benchmarks, RA-RFT consistently outperforms standard reinforcement fine-tuning methods. For example, it improves AIME 2025 average@32 accuracy by 7.1 and 2.8 points over GRPO for Qwen3-1.7B and Qwen3-4B respectively -- suggesting that reasoning-aware retrieval is a complementary axis of improvement and orthogonal to advances in reward design or training curricula.

URL PDF HTML ☆

赞 1 踩 0

2606.13677 2026-06-12 cs.RO cs.AI cs.CV cs.LG 新提交

Mana: Dexterous Manipulation of Articulated Tools

Mana: 铰接工具的灵巧操作

Zhao-Heng Yin, Guanya Shi, Pieter Abbeel, C. Karen Liu

发表机构 * UC Berkeley（加州大学伯克利分校）； CMU（卡内基梅隆大学）； Stanford University（斯坦福大学）； Amazon FAR（亚马逊FAR）

AI总结提出Mana框架，将灵巧操作重解释为动画问题，通过粗到细的流水线自动生成操作轨迹，实现铰接工具的零样本仿真到现实迁移。

Comments Project Page: https://zhaohengyin.github.io/mana

详情

AI中文摘要

铰接工具的操作由于需要协调内部自由度与接触丰富的交互，仍然是灵巧机器人学中的一个主要挑战。虽然先前的工作主要集中在刚性物体上，但铰接工具的使用由于其物理复杂性以及学习功能性抓取和操作策略的困难，仍未得到充分探索。我们提出了Mana（操作动画器），一个通用的仿真到现实框架，将灵巧操作重新解释为动画问题。受计算机动画启发，Mana采用粗到细的流水线，通过运动规划和强化学习将程序生成的抓取关键帧转化为操作轨迹。数据生成过程基本自动化，仅需几次鼠标点击即可指定功能可供性（每个工具不到1分钟）。在跨越不同尺度和关节类型的四个铰接工具上，Mana实现了抓取和手内操作的零样本仿真到现实迁移，展示了灵巧铰接工具操作的可扩展方法。

英文摘要

Articulated tool manipulation remains a major challenge in dexterous robotics due to the need to coordinate internal degrees of freedom and contact-rich interactions. While prior work has largely focused on rigid objects, articulated tool use remains underexplored because of its physical complexity and the difficulty of learning functional grasping and manipulation policies. We present Mana (Manipulation Animator), a general sim-to-real framework that reinterprets dexterous manipulation as an animation problem. Inspired by computer animation, Mana employs a coarse-to-fine pipeline that transforms procedurally-generated grasp keyframes into manipulation trajectories through motion planning and reinforcement learning. The data generation process is largely automatic, requiring only a few mouse clicks to specify functional affordances (<1 minute per tool). Across four articulated tools spanning different scales and joint types, Mana achieves zero-shot sim-to-real transfer for both grasping and in-hand manipulation, demonstrating a scalable approach to dexterous articulated tool use.

URL PDF HTML ☆

赞 1 踩 0

2606.13676 2026-06-12 cs.CV 新提交

Modality Forcing for Scalable Spatial Generation

模态强制实现可扩展的空间生成

Bardienus Pieter Duisterhof, Deva Ramanan, Jeffrey Ichnowski, Justin Johnson, Keunhong Park

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； World Labs

AI总结提出Modality Forcing方法，通过为每个模态分配独立噪声水平，实现单DiT的联合图像-深度生成，利用稀疏深度数据训练，继承T2I预训练的可扩展性，在深度估计上取得竞争性能。

详情

AI中文摘要

文本到图像（T2I）模型包含丰富的空间先验。合成逼真、杂乱的场景需要理解几何，包括透视和相对尺度。先前的工作通过调整T2I模型利用这一先验进行深度预测，但需要密集深度数据并涉及复杂的方案。我们提出Modality Forcing，一种简单、可扩展的后训练方案，使用在稀疏深度数据上训练的单个DiT进行联合图像-深度生成。Modality Forcing通过为每个模态分配独立的噪声水平，允许以任意排列进行图像和深度的条件生成和联合生成。每个模态的解码器使我们能够在稀疏的真实世界深度上训练，并实现强大的、可泛化的深度预测。我们进一步表明，Modality Forcing继承了T2I预训练的可扩展性：通过从头训练一组T2I模型（370M到3.3B参数），我们发现更大的模型在更多图像数据上训练产生更准确的深度。我们的最强模型与最先进的单目深度估计器竞争，并将现有联合图像-深度生成模型的AbsRel降低了57%。这些结果提供了强有力的证据，表明图像生成是空间感知的可扩展预训练目标。

英文摘要

Text-to-image (T2I) models contain rich spatial priors. Synthesizing photorealistic, cluttered scenes requires an understanding of geometry, including perspective and relative scale. Prior works adapt T2I models to leverage this prior for depth prediction, but they require dense depth data and involve complex recipes. We propose Modality Forcing, a simple, scalable post-training recipe for joint image-depth generation using a single DiT trained on sparse depth data. Modality Forcing enables conditional and joint generation of image and depth in any permutation by assigning separate noise levels per modality. Per-modality decoders let us train on sparse, real-world depth and achieve strong, generalizable depth prediction. We further show that Modality Forcing inherits the scalability of T2I pre-training: by training a set of T2I models from scratch (370M to 3.3B parameters), we find that larger models trained on more image data produce more accurate depth. Our strongest model is competitive with state-of-the-art monocular depth estimators and reduces AbsRel by 57% relative to existing joint image-depth generative models. These results provide strong evidence that image generation is a scalable pre-training objective for spatial perception. https://modality-forcing.github.io/

URL PDF HTML ☆

赞 1 踩 0

2606.13673 2026-06-12 cs.CV cs.AI 新提交

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

SpatialClaw：重新思考智能体空间推理的动作接口

Seokju Cho, Ryo Hachiuma, Abhishek Badki, Hang Su, Byung-Kwan Lee, Chan Hee Song, Sifei Liu, Subhashree Radhakrishnan, Seungryong Kim, Yu-Chiang Frank Wang, Min-Hung Chen

发表机构 * KAIST（韩国科学技术院）； NVIDIA（英伟达）

AI总结提出SpatialClaw框架，以代码作为动作接口，通过状态化Python内核和感知几何原语，使VLM智能体逐步执行并灵活组合中间结果，在20个3D/4D空间推理基准上平均准确率59.9%，比现有方法高11.2个百分点。

Comments Project page: https://spatialclaw.github.io/

详情

AI中文摘要

空间推理——确定物体在3D空间中的位置、关系及运动方式的能力——仍然是视觉语言模型（VLM）面临的基本挑战。工具增强型智能体试图通过为VLM添加专业感知模块来解决这一问题，但其有效性受限于调用这些工具的动作接口。本文研究该接口的设计如何影响智能体进行开放式空间推理的能力。现有的空间智能体要么采用单次代码执行，即在观察到任何中间结果之前就确定完整的分析策略；要么依赖结构化的工具调用接口，这通常缺乏自由组合操作或针对每个任务定制分析的灵活性。这两种设计对开放式、复杂的3D/4D空间推理的灵活性有限。因此，我们提出SpatialClaw，一个无需训练的空间推理框架，采用代码作为动作接口。SpatialClaw维护一个状态化的Python内核，预加载输入帧和一套感知与几何原语，让基于VLM的智能体在每一步根据所有先前输出编写一个可执行单元，从而灵活地组合和操作感知结果，并根据中间文本和视觉观察以及每个问题的需求调整其分析。在涵盖广泛静态和动态3D/4D空间推理任务的20个空间推理基准上评估，SpatialClaw实现了59.9%的平均准确率，比最新的空间智能体高出11.2个百分点，并且在来自两个模型家族的六个VLM骨干网络上均取得一致提升，无需任何基准或模型特定的适配。

英文摘要

Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent's capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex 3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework for spatial reasoning that adopts code as the action interface. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20 spatial reasoning benchmarks spanning a broad range of static and dynamic 3D/4D spatial reasoning tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.

URL PDF HTML ☆

赞 0 踩 0

2606.13672 2026-06-12 cs.RO 新提交

$\texttt{WEAVER}$, Better, Faster, Longer: An Effective World Model for Robotic Manipulation

$\texttt{WEAVER}$：更好、更快、更长——一种有效的机器人操作世界模型

Arnav Kumar Jain, Yilin Wu, Jesse Farebrother, Gokul Swamy, Andrea Bajcsy

发表机构 * Mila - Québec AI Institute（Mila - 魁北克人工智能研究所）； Université de Montréal（蒙特利尔大学）； Carnegie Mellon University（卡内基梅隆大学）； McGill University（麦吉尔大学）

AI总结提出WEAVER世界模型架构，通过流匹配损失训练多视图潜在预测，同时实现高保真度、长程一致性和高效推理，在机器人操作任务中显著提升策略评估、改进和测试时规划性能。

详情

AI中文摘要

世界模型（即学习型模拟器）对机器人技术的潜在影响深远——包括策略评估、策略改进和测试时规划——所有这些都只需有限的真实世界交互。为了解锁这些下游能力，世界模型需要同时满足三个期望：（i）保真度（即产生与现实相关的模拟轨迹），（ii）一致性（即产生在长时域上连贯的模拟轨迹），以及（iii）效率（即快速产生模拟轨迹）。我们提出$\texttt{WEAVER}$（面向具身推理的多视图世界估计）：一种同时实现所有三个期望的世界模型架构，在机器人操作任务上提供了最先进的结果。$\texttt{WEAVER}$是一个多视图世界模型，通过流匹配损失训练以预测未来潜在状态和奖励值。我们提炼了模型架构、记忆和预测目标方面的关键设计决策，以解锁那些困扰先前世界建模方法的长时间动态操作任务。我们将$\texttt{WEAVER}$应用于机器人硬件，展示了其在策略评估（与真实世界成功率的相关系数$\rho=0.870$）、策略改进（在$\pi_{0.5}$机器人基础模型上真实世界成功率提升$38\%$）和测试时规划（真实世界成功率提升$14\%$，且比先前世界模型快$5-10$倍）方面的有效性。$\texttt{WEAVER}$在分布外场景评估中也表现出优于先前世界模型的性能。代码、模型和视频见：this https URL。

英文摘要

The potential impacts of world models (WMs, i.e., learned simulators) on robotics are far-reaching -- policy evaluation, policy improvement, and test-time planning -- all with limited real-world interaction. To unlock these downstream capabilities, a WM needs to jointly satisfy three desiderata: $\textit{(i)}$ fidelity (i.e., producing simulated trajectories that correlate with reality), $\textit{(ii)}$ consistency (i.e., producing simulated trajectories that are coherent over long horizons), and $\textit{(iii)}$ efficiency (i.e., producing simulated trajectories quickly). We propose $\texttt{WEAVER}$ (World Estimation Across Views for Embodied Reasoning): a WM architecture that simultaneously achieves all three desiderata, providing state-of-the-art results on robotic manipulation tasks. $\texttt{WEAVER}$ is a multi-view WM trained to predict future latents and reward values via a flow-matching loss. We distill the key design decisions across model architecture, memory, and prediction objectives required to unlock the kinds of long-horizon dynamic manipulation tasks that have confounded prior world modeling approaches. We apply $\texttt{WEAVER}$ in robotic hardware, demonstrating its effectiveness at policy evaluation ($ρ$=0.870 correlation with real-world success rate), policy improvement (real-world success rate improvement of $38\%$ on top of the $π_{0.5}$ robot foundation model), and test-time planning (real-world success rate improvement of $14\%$ with a $5-10\times$ speedup over prior WMs). $\texttt{WEAVER}$ also demonstrates better performance than prior WMs when evaluated on out-of-distribution scenarios. Code, models, and videos at: https://arnavkj1995.github.io/WEAVER/ .

URL PDF HTML ☆

赞 1 踩 0

2606.13671 2026-06-12 cs.LG 新提交

Understanding Truncated Positional Encodings for Graph Neural Networks

理解图神经网络的截断位置编码

James Flora, Mitchell Black, Weng-Keen Wong, Amir Nayyeri

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结研究截断位置编码（如前k个特征空间或邻接矩阵幂）对图神经网络表达能力的影响，理论证明截断后多种位置编码的表达能力存在本质差异，且截断谱位置编码不再强于1-WL测试，实验表明混合截断编码优于单一类型。

Comments 28 pages, 4 figures, ICML 2026

详情

AI中文摘要

位置编码（PEs）在理论和经验上增强了图神经网络（GNNs）的能力。两个最流行的PE家族——谱（例如，拉普拉斯特征空间、有效电阻）和基于游走的（邻接矩阵的多项式）——在表达能力上理论等价，其表达性介于1-WL和3-WL测试之间。然而，这种等价性假设GNN使用这些PE的“完整”版本，这需要$O(n^3)$的时间和空间复杂度。相反，从业者通常使用这些编码的截断变体，例如前$k$个特征空间或邻接矩阵的幂。然而，这些截断PE的理论性质尚不清楚。在这项工作中，我们启动了对这些截断PE的研究。理论上，我们表明，在截断下，几个PE家族在表达能力上存在根本差异。作为推论，我们证明截断谱PE不再强于1-WL测试。我们还研究了一个谱PE家族——$k$-调和距离——以突出即使密切相关的截断PE在表达能力上的差异。最后，我们通过实验表明，在真实世界数据集上，混合截断PE优于任何单一家族。

英文摘要

Positional encodings (PEs) enhance the power of graph neural networks (GNNs), both theoretically and empirically. Two of the most popular families of PEs - spectral (e.g., Laplacian eigenspaces, effective resistance) and walk-based (polynomials of the adjacency matrix) - are theoretically equivalent in expressive power, with expressivity between the 1-WL and 3-WL tests. However, this equivalence assumes the GNN uses the "complete" version of these PEs, which requires $O(n^3)$ time and space complexity. Instead, practitioners commonly use truncated variants of these encodings, such as the first $k$ eigenspaces or powers of the adjacency matrix. However, the theoretical properties of these truncated PEs are unknown. In this work, we initiate the study of these truncated PEs. Theoretically, we show that, under truncation, several families of PEs are fundamentally different in expressive power. As a corollary, we show that truncated spectral PEs are no longer stronger than the 1-WL test. We also study a family of spectral PEs, the $k$-harmonic distances, to highlight the differences in expressive power of even closely related truncated PEs. Finally, we experimentally show that a mix of truncated PEs is preferable to any single family on real-world datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.13670 2026-06-12 cs.AI 新提交

Automated reproducibility assessments in the social and behavioral sciences using large language models

使用大型语言模型自动评估社会与行为科学的可重复性

Tobias Holtdirk, Pietro Marcolongo, Anna Steinberg Schulten, Felix Henninger, Stefan Rose, Sarah Ball, Bolei Ma, Frauke Kreuter, Markus Weinmann, Stefan Feuerriegel

发表机构 * LMU Munich（慕尼黑大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）； University of Cologne（科隆大学）

AI总结本研究利用大型语言模型（LLMs）自动评估社会与行为科学研究的可重复性，在76项研究中，LLM在41%的研究中恢复了原始效应量，在96%的案例中得出了与原始研究相同的定性结论，优于人类再分析。

详情

AI中文摘要

社会与行为科学的可重复性通常由独立研究人员重新分析原始数据来评估，以判断已发表的研究结果是否可复现。然而，这种方法资源密集且难以规模化。在此，我们展示了大型语言模型（LLMs）可以自动化可重复性评估。利用N=76项来自行为与社会科学、具有预定义声明的研究，我们比较了LLM生成的分析与原始结果和人类再分析。对于7项研究，LLM无法产生可行的效应量估计。对于其余研究，我们的LLM流程在41%的研究中恢复了原始效应量（Cohen's d的容忍度为+/-0.05）。此外，我们的LLM流程在96%的案例中得出了与原始研究相同的定性结论，其中结论指示再分析是否支持原始声明。相比之下，人类再分析者在34%的研究中恢复了原始效应量，并在74%的案例中得出了相同的定性结论。这些结果共同表明，LLMs可以作为自动化可重复性评估的可扩展工具，并为社会与行为科学中实证结果的系统审计提供基础。

英文摘要

Reproducibility in the social and behavioral sciences is typically evaluated by independent researchers who reanalyze the original data to assess whether the published findings can be recovered. However, such approaches are resource-intensive and difficult to scale. Here, we show that large language models (LLMs) can automate reproducibility assessments. Using N=76 published studies with predefined claims from the behavioral and social sciences, we compare LLM-generated analysis with the original findings and human reanalysis. For 7 studies, the LLM could not produce a viable effect size estimate. For the remaining studies, our LLM pipeline recovered the original effect sizes in 41% of studies using a +/-0.05 tolerance in Cohen's d. Further, our LLM pipeline reached the same qualitative conclusion as the original study in 96% of cases, where conclusions indicate whether the reanalysis supports the original claim. For comparison, human reanalysts recovered the original effect sizes in 34% of studies and reached the same qualitative conclusion in 74% of cases. Together, these results show that LLMs can serve as a scalable tool for automated reproducibility assessment and provide a foundation for systematic auditing of empirical results in the social and behavioral sciences.

URL PDF HTML ☆

赞 0 踩 0

2606.13669 2026-06-12 cs.AI 新提交

Agents-K1: Towards Agent-native Knowledge Orchestration

Agents-K1：迈向智能体原生的知识编排

Zongsheng Cao, Bihao Zhan, Jinxin Shi, Jiong Wang, Fangchen Yu, Zhijie Zhong, Zijie Guo, Tianshuo Peng, Zhuo Liu, Yi Xie, Xiang Zhuang, Yue Fan, Runmin Ma, Shiyang Feng, Xiangchao Yan, Anran Liu, Peng Ye, Wenlong Zhang, Shufei Zhang, Chunfeng Song, Fenghua Ling, Jie Zhou, Liang He, Bo Zhang, Lei Bai

发表机构 * PJLab（上海人工智能实验室）

AI总结提出Agents-K1管道，将原始文档转化为智能体原生科学知识图谱，通过多模态解析器、GRPO训练的4B信息抽取骨干和三源智能体接口，实现科学信息抽取、知识图谱构建和多跳推理。

详情

AI中文摘要

当前基于LLM的研究智能体通过智能体编排取得了进展，但在很大程度上忽视了科学知识编排。现有工作通常将论文简化为摘要、表面提及和扁平化的\ exttt{cites}边，忽略了科学推理所必需的关键实体、主张、证据、机制和方法谱系。为此，我们引入了\ extbf{Agents-K1}，一个端到端的知识编排管道，将原始文档转换为智能体原生的科学知识图谱。Agents-K1在统一的理论基础下整合了三个组件：一个多模态解析器，其五模块模式捕获整个论文中的实体、多模态证据、引用和类型化实体间关系，而非仅摘要；一个基于GRPO在规则奖励下训练的4B信息抽取骨干；以及一个graphanything CLI，一个统一了网络搜索、多模态图检索和跨文档遍历的三源智能体接口。在此基础上，我们处理了六个学科的246万篇科学论文，生成了\ extbf{Scholar-KG}，并发布了其中100万篇论文的子集，完整Scholar-KG可通过下方SCP链接访问。同一管道可扩展到通用领域语料库和符合模式的数据合成。大量实验表明，Agents-K1在科学信息抽取、知识图谱构建和多跳科学推理方面取得了优越性能。

英文摘要

Current LLM-based research agents have advanced through agent orchestration, yet largely overlook scientific knowledge orchestration. Existing works often reduce papers to abstracts, surface mentions, and flat \texttt{cites} edges, omitting key entities, claims, evidence, mechanisms, and method lineages essential for scientific reasoning. To this end, we introduce \textbf{Agents-K1}, an end-to-end knowledge orchestration pipeline that converts raw documents into agent-native scientific knowledge graphs. Agents-K1 integrates three components under a unifying theoretical foundation: a multimodal parser whose five-module schema captures entities, multimodal evidence, citations, and typed inter-entity relations across the full paper rather than abstracts alone; a 4B information-extraction backbone trained with GRPO under a rule-based reward; and a graphanything CLI, a tri-source agent interface that unifies web search, multimodal graph retrieval, and cross-document traversal. On top of this, we process 2.46 million scientific papers across six subjects to produce \textbf{Scholar-KG}, of which we release a one-million-paper subset, and the full Scholar-KG is accessible via the SCP link below. The same pipeline can be extended to general-domain corpora and to schema-conformant data synthesis. Extensive experiments demonstrate that Agents-K1 achieves superior performance in scientific information extraction, knowledge graph construction, and multi-hop scientific reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.13668 2026-06-12 cs.CL 新提交

Influcoder: Distilling Decoders' Gradient Influence Rankings into an Encoder for Data Attribution

Influcoder：将解码器的梯度影响排名蒸馏到编码器用于数据归因

Dimitri Kachler, Damien Sileo, Pascal Denis

发表机构 * Centre Inria de l’Université de Lille, CRIStAL, Université de Lille（里尔大学Inria中心，CRIStAL，里尔大学）

AI总结针对大型语言模型训练数据归因中影响函数方法计算和存储成本高的问题，提出Influcoder方法，通过将解码器梯度影响排名蒸馏到编码器，实现快速且成本高效的大规模数据归因。

Comments 8 pages, 2 figures

2606.13663 2026-06-12 cs.CL 新提交

HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents

HyperTool：超越逐步工具调用的工具增强型智能体

Yaxin Du, Yifan Zhou, Yujie Ge, Jiajun Wang, Xianghe Pang, Shuo Tang, Tuney Zheng, Bryan Dai, Jian Yang, Siheng Chen

发表机构 * Shanghai Jiao Tong University（上海交通大学）； IQuest Research ； Beijing University of Aeronautics and Astronautics（北京航空航天大学）

AI总结针对工具增强型LLM中逐步调用导致执行粒度不匹配的问题，提出HyperTool统一可执行接口，将确定性工具子流程折叠为单次调用，在多步工具任务上显著提升准确率。

详情

AI中文摘要

工具增强型LLM智能体通常依赖逐步的原子工具调用，其中每次调用、观察和值传递都暴露在主推理轨迹中。这造成了执行粒度不匹配：局部确定性的工具工作流被展开为重复的模型可见决策，消耗上下文并迫使模型管理轨迹中的低级数据流。我们引入HyperTool，一个统一的可执行MCP风格工具接口，改变了模型可见的工具执行单元。模型调用HyperTool时使用一个代码块，该代码块可以通过原始模式调用现有工具、操作返回值并在本地传递中间结果，将确定性工具子程序折叠为单个外部调用。为了训练模型使用此接口，我们从跨工具组合任务中合成HyperTool格式的轨迹，并在真实MCP环境中验证。在MCP-Universe上，HyperTool将Qwen3-32B的平均准确率从15.69%提升至35.29%，Qwen3-8B从9.93%提升至33.33%，并在平均准确率上超越GPT-OSS和Kimi-k2.5，表明我们的HyperTool能显著改进多步工具使用。

英文摘要

Tool-augmented LLM agents commonly rely on step-wise atomic tool calls, where each invocation, observation, and value transfer is exposed in the main reasoning trace. This creates an \emph{execution-granularity mismatch}: locally deterministic tool workflows are unfolded into repeated model-visible decisions, consuming context and forcing the model to manage low-level dataflow in the trace. We introduce \textbf{HyperTool}, a unified executable MCP-style tool interface that changes the model-visible unit of tool execution. A model invokes HyperTool with a code block that can call existing tools through their original schemas, manipulate returned values, and pass intermediate results locally, folding deterministic tool subroutines into a single outer call. To train models to use this interface, we synthesize HyperTool-format trajectories from cross-tool compositional tasks and verify them in real MCP environments. On MCP-Universe, HyperTool improves average accuracy from 15.69\% to 35.29\% on Qwen3-32B and from 9.93\% to 33.33\% on Qwen3-8B, and surpass GPT-OSS and Kimi-k2.5 on average accuracy, showing that our HyperTool can substantially improve multi-step tool use.

URL PDF HTML ☆

赞 0 踩 0

2606.13659 2026-06-12 cs.PL cs.AR 新提交

Specifying Hardware Communication as Programs

将硬件通信规范为程序

Ernest Ng, Nikil Shyamsunder, Francis Pham, Adrian Sampson, Kevin Laeufer

AI总结提出一种DSL，将硬件通信协议规范为简洁的命令式程序，同一规范可用于驱动设计和监控事务，并自动从波形推断事务级跟踪。

2606.13658 2026-06-12 cs.AI 新提交

Before You Think: System 0, AI-Mediated Cognition and Cognitive Colonization

在你思考之前：系统0、AI中介认知与认知殖民化

Marianna Bergamaschi Ganapini, Massimo Chiriatti, Enrico Panai, Giuseppe Riva

AI总结本文比较三种AI认知框架，提出系统0具有独特理论地位，并引入“认知殖民化”概念，指出AI系统能将外部利益嵌入自我架构，构成难以察觉的影响。

2606.13652 2026-06-12 cs.CV cs.GR 新提交

World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible

世界追踪：超越可见表面的生成式像素对齐几何

Hao Zhang, Mohamed El Banani, Jen-Hao Cheng, Paul Zhang, Yi Hua, Ben Mildenhall, Christoph Lassner, Narendra Ahuja, Gengshan Yang

发表机构 * World Labs ； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出世界追踪（World Tracing），一种生成式像素对齐几何表示，通过扩散变压器预测有序点栈，同时重建可见表面和生成遮挡几何，在多个基准上超越深度预测和图像到3D方法。

Comments World Labs Technical Report; Page: https://haoz19.github.io/world-tracing-page/

详情

AI中文摘要

图像到3D方法常常在忠实度和完整性之间权衡：深度估计器锚定于输入像素但止于可见表面，而图像到3D模型生成完整形状却往往与输入不对齐。我们引入世界追踪（World Tracing），一种生成式像素对齐几何表示，它预测与观测像素对齐的3D点，同时完成可见表面之外的几何。对于每个输入像素，世界追踪预测一个有序的相机空间3D点栈，其中第一层表示可见表面，后续层表示与遮挡表面的从前到后交点。我们通过一个世界追踪扩散变压器WT-DiT实例化该表示，该变压器将多个几何层视为独立的去噪令牌，并通过分解和全局注意力耦合。WT-DiT使用像素空间流匹配和混合噪声调度进行训练，平衡可见表面重建与遮挡几何生成。世界追踪在物体、场景和动态基准上，在可见表面重建和完整几何生成方面均取得了强劲性能，超越了深度预测器和图像到3D生成器。它还保留了2D到3D对应关系，实现了文本驱动的3D场景编辑、几何条件的新视角视频合成，以及与纹理网格生成器的无训练集成。

英文摘要

Image-to-3D methods often trade off faithfulness and completeness: depth estimators are anchored to input pixels but stop at the visible surface, while image-to-3D models generate complete shapes that are often misaligned with the input. We introduce World Tracing, a generative pixel-aligned geometry representation that predicts 3D points aligned with observed pixels while completing geometry beyond the visible surface. For each input pixel, World Tracing predicts an ordered stack of camera-space 3D points, where the first layer represents the visible surface and subsequent layers represent front-to-back intersections with occluded surfaces. We instantiate this representation with a world-tracing diffusion transformer, WT-DiT, which treats multiple geometry layers as separate denoising tokens coupled through factorized and global attention. WT-DiT is trained with pixel-space flow matching and a mixed noise schedule that balances visible-surface reconstruction with occluded-geometry generation. World Tracing achieves strong performance on visible-surface reconstruction and complete geometry generation across object, scene, and dynamic benchmarks, outperforming both depth predictors and image-to-3D generators. It also preserves 2D-to-3D correspondence, enabling text-driven 3D scene editing, geometry-conditioned novel-view video synthesis, and training-free integration with textured-mesh generators.

URL PDF HTML ☆

赞 1 踩 0

2606.13649 2026-06-12 cs.CL cs.LG 新提交

Operadic consistency: a label-free signal for compositional reasoning failures in LLMs

Operadic一致性：LLM中组合推理失败的无标签信号

Nathaniel Bottman, Yinhong Liu, Kyle Richardson

发表机构 * Incubilate ； University of Cambridge（剑桥大学）； Allen Institute for Artificial Intelligence（艾伦人工智能研究所）

AI总结提出Operadic一致性（OC）作为检测大语言模型组合推理失败的无标签信号，在四个多跳QA数据集上与准确率强相关（Pearson r≥0.86），优于自一致性等方法。

详情

AI中文摘要

在推理时检测LLM推理失败而无需真实标签，催生了广泛的置信度基线，包括自一致性、语义熵和P(True)，这些方法基于问题内采样和自我评估。Operad理论，即通过迭代替换构建系统的形式化方法，提出了一种补充性诊断：模型对组合查询的直接回答应与通过组合同一查询的分解陈述所产生的回答一致。我们将这一思想实例化为Operadic一致性（OC），一个每问题信号。在四个多跳QA数据集上的十二个指令微调LLM（4B到671B参数，开源和闭源）上，OC与每个数据集上的准确率强相关（Pearson r ∈ [0.86, 0.94]，所有p ≤ 0.0004），并且是我们评估的所有信号中唯一在所有四个数据集上均匀达到r ≥ 0.85的信号。思维链自一致性（CoT-SC；Wang等人，2023）在HotpotQA和DROP上与OC匹配（r = 0.93, 0.87），但在MuSiQue和StrategyQA上降至r ≈ 0.45。在每问题层面，OC在每个数据集上提供了超出CoT-SC和语义熵的信息（OC系数的聚类稳健p ≤ 10^{-16}），并且该结论在额外控制构造的分解感知基线时依然稳健（p ≤ 10^{-13}）。相同的信号在等成本K = 3预算下，相对于调优的CoT-SC基线产生了选择性预测改进（固定覆盖率下的准确率提升）（AUARC提升+0.086至+0.096，AUROC提升+0.092至+0.164；95%置信区间在每个单元上排除零）。在五个前沿思维模型上，其中分解从模型自身的思维链中提取，相同的等成本比较在所有测试的16个（数据集、预算、指标）单元上给出了正的选择性预测点估计提升，其中12个单元的95%置信区间排除零。

英文摘要

Detecting LLM reasoning failures at inference time without ground-truth labels has motivated a wide range of confidence baselines, including self-consistency, semantic entropy, and P(True), built on within-question sampling and self-evaluation. Operad theory, the formalism for systems built by iterated substitution, suggests a complementary diagnostic: a model's direct answer to a compositional query should agree with the answer it produces by composing a stated decomposition of the same query. We instantiate this idea as operadic consistency (OC), a per-question signal. Across twelve instruction-tuned LLMs (4B to 671B parameters, open-weights and closed-source) on four multi-hop QA datasets, OC is strongly correlated with accuracy on every dataset (Pearson $r \in [0.86, 0.94]$, all $p \leq 0.0004$), and is the only signal we evaluate with $r \geq 0.85$ uniformly across all four datasets. Chain-of-thought self-consistency (CoT-SC; Wang et al., 2023) matches OC on HotpotQA and DROP ($r = 0.93, 0.87$) but drops to $r \approx 0.45$ on MuSiQue and StrategyQA. At the per-question level, OC contributes information beyond CoT-SC and semantic entropy on every dataset (cluster-robust $p \leq 10^{-16}$ for the OC coefficient), and the conclusion is robust to additionally controlling for constructed decomposition-aware baselines ($p \leq 10^{-13}$). The same signal yields selective-prediction improvements (accuracy at fixed coverage) over a tuned CoT-SC baseline at the equal-cost $K = 3$ budget (AUARC lifts of +0.086 to +0.096 and AUROC lifts of +0.092 to +0.164; 95% CIs exclude zero on every cell). On five frontier thinking models, where the decomposition is extracted from the model's own chain of thought, the same equal-cost comparison gives positive selective-prediction point-estimate lift on all 16 (dataset, budget, metric) cells tested, with 95% CIs excluding zero on 12 of the 16.

URL PDF HTML ☆

赞 0 踩 0

2606.13647 2026-06-12 cs.CL cs.AI cs.LG 新提交

SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

SkMTEB：斯洛伐克大规模文本嵌入基准与模型适配

Marek Šuppa, Andrej Ridzik, Daniel Hládek, Natália Kňažeková, Viktória Ondrejová

发表机构 * Comenius University in Bratislava（布拉迪斯拉发夸美纽斯大学）； Cisco Systems（思科系统）； Technical University of Košice（科希策技术大学）； Kempelen Institute of Intelligent Technologies（肯佩伦智能技术研究所）

AI总结针对低资源西斯拉夫语斯洛伐克语，构建首个MTEB风格文本嵌入基准SkMTEB（含31个数据集、7类任务），并开发高效本地部署模型e5-sk-small/large，通过词汇裁剪与微调在参数减少62%下达到与商业API相当的竞争力。

Comments ACL 2026

详情

AI中文摘要

我们介绍了SkMTEB，这是首个针对斯洛伐克语（一种低资源西斯拉夫语）的全面MTEB风格文本嵌入基准，包含31个数据集，覆盖7种任务类型——几乎是现有斯洛伐克语多语言基准覆盖深度的4倍。我们对31个嵌入模型的评估表明，大型指令调优多语言模型表现最强，而现有的针对NLU任务训练的斯洛伐克语特定模型在嵌入任务上迁移效果不佳。为了满足高效、可本地部署的斯洛伐克语嵌入需求，我们通过对多语言E5模型进行词汇裁剪和微调，开发了\ exttt{e5-sk-small}（45M参数）和\ exttt{e5-sk-large}（365M）模型。尽管模型尺寸缩小了高达62%，我们的开源模型在性能上与专有API相当，同时仍可本地部署用于语义搜索和检索增强生成（RAG）。我们公开了基准、模型、数据集和代码，希望我们的方法能为其他资源匮乏的语言提供可复现的路径。

英文摘要

We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types -- nearly 4$\times$ the depth of existing multilingual benchmark coverage for Slovak. Our evaluation of 31 embedding models reveals that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific models trained for NLU tasks transfer poorly to embedding tasks. To address the need for efficient, locally-deployable Slovak embeddings, we develop \texttt{e5-sk-small} (45M parameters) and \texttt{e5-sk-large} (365M) by applying vocabulary trimming and fine-tuning to Multilingual E5 models. Despite size reductions of up to 62\%, our open-source models achieve competitive performance with proprietary APIs while remaining locally deployable for semantic search and retrieval-augmented generation (RAG). We release the benchmark, models, datasets, and code openly, hoping our approach offers a replicable path for other under-resourced languages.

URL PDF HTML ☆

赞 0 踩 0

2606.13644 2026-06-12 cs.CV 新提交

Surflo: Consistent 3D Surface Flow Model with Global State

Surflo：具有全局状态的一致3D表面流模型

Antoine Guédon, Shu Nakamura, Nicolas Dufour, Jiahui Lei, Ko Nishino, Angjoo Kanazawa

发表机构 * LIX, École polytechnique（LIX，巴黎综合理工学院）； Kyoto University（京都大学）； Kyutai ； UC Berkeley（加州大学伯克利分校）

AI总结提出Surflo模型，通过将可变数量的无位姿RGB视图压缩为全局潜变量，并利用流匹配从噪声中独立传输3D表面点，实现任意分辨率的一致表面重建，推理时通过光度梯度引导消除局部不一致性。

Comments Project webpage: https://anttwo.github.io/surflo/

详情

AI中文摘要

几何形状对视角具有不变性，这使得任何图像集合都是单个3D状态的冗余编码。现有的前馈重建模型未能充分利用这一点：逐视角方法会生成重叠且未对齐的点云，其数量随输入数量线性增长；而全局潜在方法则局限于固定的低分辨率输出。我们提出Surflo，它将可变数量的无位姿RGB视图压缩为K个潜在令牌（一个全局状态），并通过流匹配将带方向的3D表面点从噪声独立传输到表面上进行解码。这使得输出不受任何固定网格或令牌预算的限制：相同的潜在变量在单次前向传播中即可生成从几千到一百万个点。为了抑制独立逐点解码固有的局部不一致性，我们在ODE积分过程中注入光度梯度，通过推理时的引导项关联邻近点。Surflo在表面指标上匹配或超越前馈基线，运行速度比需要数百个视图的基于优化的方法快一个数量级，并且是唯一结合全局潜在变量与任意分辨率解码的前馈方法。

英文摘要

Geometry is invariant to viewpoint, which makes any collection of images a redundant encoding of a single 3D state. Existing feed-forward reconstruction models fail to exploit this: per-view methods emit overlapping, unaligned pointmaps that grow linearly with input count, while global-latent methods commit to a fixed, low-resolution output. We introduce Surflo, which compresses a variable number of unposed RGB views into K latent tokens-one global state-and decodes oriented 3D surface points by independently transporting them from noise onto the surface via flow matching. This frees the output from any fixed grid or token budget: the same latent yields from a few thousand to a million points in a single forward pass. To suppress the local inconsistencies inherent to independent per-point decoding, an inference-time guidance term correlates nearby points by injecting a photometric gradient during ODE integration. Surflo matches or surpasses feed-forward baselines on surface metrics, runs an order of magnitude faster than optimization-based methods that require hundreds of views, and is the only feed-forward approach to combine a global latent with arbitrary-resolution decoding.

URL PDF HTML ☆

赞 0 踩 0