arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1737
2606.14530 2026-06-15 cs.LG 新提交

Code Correctness Signals in LLM Hidden States: Pre-Generation Probing and Repair Geometry

LLM隐藏状态中的代码正确性信号:生成前探测与修复几何

Carlo Di Cicco

发表机构 * Independent researcher(独立研究员)

AI总结 本文通过残差化方法,发现Qwen3-4B-Instruct模型在生成前隐藏状态可线性解码代码正确性(AUC 0.931),但修复成功的方向性信号在控制上下文协变量后消失,揭示了方法学上的正负结果。

Comments 12 pages, 8 tables. Code, data, and analysis scripts available at https://github.com/CarloDiCicco/ReasoningLab

详情
AI中文摘要

大型语言模型在其隐藏状态中编码丰富信息。本文研究在Qwen3-4B-Instruct-2507生成之前以及修复失败尝试时,代码正确性是否可从隐藏状态中解读,基于444个LiveCodeBench任务。报告两个发现,通过单一混杂控制工具——残差化联系起来。首先,模型首次尝试代码的正确性可从提示最终隐藏状态线性解码,在50个外部分割上无泄漏的留出AUC为0.931±0.008。从每个隐藏状态维度去除提示长度的线性效应后,探针仍达到0.911±0.010,远高于提示长度基线0.754±0.014。其次,在236个清理后的案例中,模型尝试修复失败的首次尝试,从失败尝试到修复的隐藏状态偏移携带统计上可检测的对比方向,在幅度和分割半测试中均显著高于标签打乱的零假设。该方向在对修复上下文协变量(成功与失败修复间不同)进行条件残差化后不再存在,表明它是修复成功的相关因素,由修复上下文驱动,而非孤立的修复理解特征。探针层通过嵌套交叉验证选择,同样的残差化方法支持了生成前正确性结果,却推翻了修复方向解释。贡献既是方法论上的也是实证上的:一个足够诚实的诊断,同时报告了负面结果和正面结果。

英文摘要

Large language models encode rich information in their hidden states. This work asks whether code correctness is legible in the hidden states of Qwen3-4B-Instruct-2507, before it generates and as it repairs a failed attempt, studied on 444 LiveCodeBench tasks. It reports two findings connected by a single confound-control tool: residualization. First, the correctness of the model's first-attempt code is linearly decodable from the prompt-final hidden state, with a leakage-free held-out AUC of 0.931 +/- 0.008 across 50 outer splits. After the linear effect of prompt length is removed from each hidden state dimension, the probe still reaches 0.911 +/- 0.010, well above a prompt-length baseline of 0.754 +/- 0.014. Second, on 236 cleaned cases where the model attempts to repair a failed first attempt, the hidden state shift from the failing attempt to its repair carries a statistically detectable contrastive direction, significant on both a magnitude and a split-half test against label-shuffled nulls. This direction does not survive a conditional residualization against repair-context covariates that differ between successful and failed repairs, marking it as a correlate of repair success driven by the repair context rather than an isolated repair-comprehension feature. The probe layer is selected by nested cross-validation, and the same residualization approach that upholds the pre-generation correctness result overturns the repair-direction interpretation. The contribution is as much methodological as empirical: a diagnostic honest enough to report a negative result alongside a positive one.

2606.14528 2026-06-15 cs.CL eess.AS 新提交

BayLing-Duplex: Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM

BayLing-Duplex: 单一自回归LLM的原生全双工语音对话

Qingkai Fang, Shoutao Guo, Yang Feng

发表机构 * Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)(中国科学院计算技术研究所智能信息处理重点实验室) Key Laboratory of AI Safety, Chinese Academy of Sciences(中国科学院人工智能安全重点实验室) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出BayLing-Duplex,一种原生全双工语音语言模型,通过单个自回归LLM决定何时听、说和停止,无需外部VAD模块,仅用少量特殊标记实现,在少量微调数据上达到高交互成功率并提升响应质量。

Comments Code: https://github.com/BayLing-Models/BayLing-Duplex

详情
AI中文摘要

实时全双工语音交互是下一代语音聊天机器人的关键特性,允许模型同时听和说,并处理重叠、犹豫和插话等自然现象。现有的语音语言模型(如LLaMA-Omni和GLM-4-Voice)仍然是基于回合的,并依赖外部语音活动检测(VAD)模块来标记用户回合的结束,这从根本上限制了它们的交互能力。在本文中,我们介绍了BayLing-Duplex,一种原生全双工SpeechLM,其中单个自回归LLM决定何时听、何时说以及何时停止,无需辅助的回合切换模块。该设计仅在标准词汇表中添加少量特殊标记,因此可以跨LLM迁移,并重用现有的训练和服务堆栈,无需架构适配。从公开的GLM-4-Voice检查点开始,仅使用400K全双工样本进行微调,随后进行轻量级DPO阶段,BayLing-Duplex在InstructS2S-Eval上达到92%的回合切换成功率和100%的打断成功率,同时将语音响应分数从Moshi的2.17提升到3.39。BayLing-Duplex在Llama Questions、Web Questions和Alpaca-Eval上也达到或超过了其基于回合的对应版本,表明同时听和说建模不会牺牲响应质量。

英文摘要

Real-time, full-duplex speech interaction is a key feature of next-generation spoken chatbots, allowing the model to listen and speak at the same time and to handle natural phenomena such as overlap, hesitation, and barge-in. Existing speech language models (SpeechLMs) such as LLaMA-Omni and GLM-4-Voice are still turn-based and rely on an external Voice Activity Detection (VAD) module to mark the end of the user's turn, which fundamentally limits their interactive ability. In this paper, we introduce BayLing-Duplex, a native full-duplex SpeechLM where a single autoregressive LLM decides when to listen, when to speak, and when to stop, with no auxiliary turn-taking module. The design adds only a few special tokens to the standard vocabulary, so it transfers across LLMs and reuses existing training and serving stacks with no architectural adaptation. Starting from the public GLM-4-Voice checkpoint and using only 400K full-duplex samples for fine-tuning followed by a lightweight DPO stage, BayLing-Duplex reaches 92% turn-taking success and 100% interruption success on InstructS2S-Eval, while improving the speech-response score from 2.17 to 3.39 over Moshi. BayLing-Duplex also matches or surpasses its turn-based counterpart on Llama Questions, Web Questions, and Alpaca-Eval, showing that simultaneous listen-and-speak modeling does not sacrifice response quality.

2606.14518 2026-06-15 cs.LG 新提交

Behavioral Audit of Machine Unlearning Has a Privacy Cost

机器遗忘的行为审计具有隐私代价

Liou Tang, James Joshi, Ashish Kundu

发表机构 * University of Pittsburgh(匹兹堡大学) Cisco(思科)

AI总结 本文证明,在互不信任的模型所有者和审计者场景下,仅依赖模型行为查询的审计方案无法在不泄露保留集成员信息的情况下识别未充分遗忘的模型,揭示了隐私与审计之间的固有权衡。

详情
AI中文摘要

通过机器遗忘从机器学习模型中移除已学习数据已被广泛研究;然而,目前尚未有公认的审计方案。现有工作表明,不诚实的模型所有者可以伪造证据来避免执行遗忘,而好奇的审计者(及对手)即使在有限访问权限下也能推断模型及其训练数据的隐私敏感属性。然而,在模型所有者和审计者互不信任的情况下对机器遗忘的审计仍未得到探索。我们为此场景提供了信息论证明:对于凸机器学习模型,仅依赖查询模型获取\textit{行为}信号的通用审计方案无法在不泄露保留集成员信息的情况下识别未充分遗忘的模型。因此,在不诚实的模型所有者和诚实但好奇的审计者假设下审计机器遗忘面临固有的隐私-审计权衡。我们在凸模型上的实证结果强烈支持这一结论,而进一步实验表明这种隐私-审计张力在非凸模型中依然存在。我们的结果呼吁在更现实的审计者威胁模型下更仔细地考虑隐私-审计张力,并为机器遗忘流程中隐私保护审计方案的设计提供更严格的审查基础。我们还在此 https URL 发布了代码实现。

英文摘要

The removal of learned data from Machine Learning models through Machine Unlearning (MU) has been widely studied; however, there has yet to be an agreed-upon scheme for auditing MU. Existing work has shown that a dishonest model owner can falsify evidence to avoid executing MU, while curious auditors (and adversaries) can infer the privacy-sensitive properties of the model and its training data even with limited access. Yet auditing of MU under mutual distrust between the model owner and the auditor remains unexplored. We provide an information-theoretic proof for this scenario: for convex ML models, a generic audit scheme that relies solely on querying the model for \textit{behavioral} signals cannot identify insufficiently unlearned models without revealing membership information of the retained set. Therefore, auditing MU under the assumption of a dishonest model owner and an honest-but-curious auditor faces an inherent privacy-audit tradeoff. Our empirical results on convex models strongly supports this result, while further experiments demonstrate that this privacy-audit tension persists in non-convex models. Our results call for a more careful consideration of the privacy-audit tension under a realistic auditor threat model, and serve as a foundation for more scrutiny of designs of privacy-preserving audit schemes for the MU pipeline. We also release our code implementation at https://github.com/LiouTang/Behavioral-Unlearn-Audit.

2606.14516 2026-06-15 cs.AI cs.CL cs.CY 新提交

Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results

Every Eval Ever:AI评估结果的统一模式与社区仓库

Jan Batzner, Sree Harsha Nelaturu, Anastassia Kornilova, Jon Crall, Tommaso Cerruti, Yanan Long, Yifan Mai, Sanchit Ahuja, Asaf Yehudai, Marek Šuppa, John P. Lalor, Oluwagbemike Olowe, Jatin Ganhotra, Brian H. Hu, Eliya Habba, Andrew M. Bean, Chang Liu, Sander Land, Steven Dillmann, Aniketh Garikaparthi, Elron Bandel, Saki Imai, James Edgell, Wm. Matthew Kennedy, Jenny Chim, Patrick Meusling, Asteria Kaeberlein, Venkata Ramachandra Karthik Chundi, Manasi Patwardhan, Martin Ku, Austin Meek, Leon Knauer, Brian Wingenroth, Srishti Yadav, Usman Gohar, Felix Friedrich, Michelle Lin, Jennifer Mickel, Arman Cohan, Stella Biderman, Irene Solaiman, Zeerak Talat, Anka Reuel, Mubashara Akhtar, Gjergji Kasneci, Avijit Ghosh, Leshem Choshen

发表机构 * Technical University Munich(慕尼黑工业大学) Munich Center for Machine Learning(慕尼黑机器学习中心) Weizenbaum Institute(魏岑鲍姆研究所) Zuse Institute Berlin(柏林祖泽研究所) Evidence Prime Trustible Kitware ETH Zurich(苏黎世联邦理工学院) StickFlux Labs Stanford University(斯坦福大学) Northeastern University(东北大学) IBM Research(IBM研究院) Comenius University Bratislava(布拉迪斯拉发夸美纽斯大学) Cisco(思科) University of Notre Dame(圣母大学) Hebrew University of Jerusalem(耶路撒冷希伯来大学) University of Oxford(牛津大学) Ohio University(俄亥俄大学) Writer TCS Research(塔塔咨询服务研究院) Oxford University Press(牛津大学出版社) Queen Mary University of London(伦敦玛丽女王大学) Technical University Berlin(柏林工业大学) University of Delaware(特拉华大学) Cinemo Johns Hopkins University(约翰霍普金斯大学) University of Copenhagen(哥本哈根大学) ELLIS(欧洲学习与智能系统实验室) Iowa State University(爱荷华州立大学) Meta FAIR University of Montreal(蒙特利尔大学) Mila Quebec AI Institute(Mila魁北克人工智能研究所) EleutherAI Yale University(耶鲁大学) Hugging Face University of Edinburgh(爱丁堡大学) Harvard University(哈佛大学) ETH AI Center(ETH人工智能中心) MIT(麻省理工学院) MIT-IBM Watson Lab(MIT-IBM沃森实验室)

AI总结 针对AI评估结果格式不统一、难以比较的问题,提出首个共享模式与社区众包仓库,通过标准化表示、自动转换器和社区数据库实现跨评估框架的统一。

详情
AI中文摘要

AI评估被广泛用于测试和理解进展。然而,多样化的评估工具带来了不一致性,挑战了分析和比较。首先,结果以不兼容的格式保存,分散在排行榜、论文、博客文章、评估工具日志和自定义仓库中。其次,结果由不同的评估框架创建,这些框架对名义上相同的评估产生不同的分数,并且不一致地记录元数据,阻碍了比较、跨社区评估科学、成本降低和重用。我们介绍了Every Eval Ever,这是第一个用于AI评估结果的共享模式和社区众包仓库。该模式标准化了评估在统一的单个JSON文档中的表示方式。它在设计上与源无关,可以摄取来自评估工具和论文的结果,并可选择存储每个实例的输出以进行细粒度分析。我们贡献了:(i) 一个社区治理的元数据模式及其配套的实例级模式,这是同类标准化工作的首次;(ii) 从流行格式、评估工具和排行榜到统一模式的自动转换器;以及 (iii) 一个托管在Hugging Face上的众包社区数据库,目前涵盖22,235个模型、2,273个独特基准和31种评估格式。

英文摘要

AI evaluations are widely used for testing and understanding progress. However, the diverse evaluators bring with them inconsistencies that challenge analysis and comparison. First, results are saved in incompatible formats, scattered across leaderboards, papers, blog posts, evaluation harness logs, and custom repositories. Second, results are created by different evaluation frameworks, which produce divergent scores for nominally identical evaluations and record metadata inconsistently, hindering comparison, cross-community evaluation science, cost reduction, and reuse. We introduce Every Eval Ever, the first shared schema and community-crowdsourced repository for AI evaluation results. The schema standardizes how evaluations are represented in a unified, single JSON document. It is source-agnostic by design, ingesting results from evaluation harnesses and papers alike, and optionally stores per-instance outputs for fine-grained analysis. We contribute: (i) a community-governed metadata schema with a companion instance-level schema, the first standardization effort of its kind; (ii) automatic converters from popular formats, evaluation harnesses, and leaderboards to the unified schema; and (iii) a crowdsourced community database hosted on Hugging Face, currently spanning to date 22,235 models, 2,273 unique benchmarks, and 31 evaluation formats.

2606.14512 2026-06-15 cs.CL cs.AI 新提交

Fodor and Pylyshyn's Systematicity Challenge Still Stands

Fodor和Pylyshyn的系统性挑战依然存在

Michael Goodale, Salvador Mascarenhas

发表机构 * Institut Jean Nicod, Département d’études cognitives ENS, EHESS, CNRS, PSL University(让·尼科研究所,ENS认知科学系,EHESS,CNRS,PSL大学)

AI总结 本文通过实验证明,Lake和Baroni的元学习组合协议模型在分布外和分布内问题上均表现不佳,未能满足Fodor和Pylyshyn对神经网络系统性提出的挑战。

Comments Accepted in the Transactions of the Association for Computational Linguistics (TACL). This is a pre-MIT Press publication version of the paper

详情
AI中文摘要

神经网络近期在生成类人语言方面的成功在认知科学领域引起了巨大轰动,许多研究者认为,关于人类认知的经典难题以及对人工智能的挑战正被神经网络解决。一个显著的例子是Jerry Fodor和Zenon Pylyshyn提出的系统性论证,该论证认为人类表现出系统性的双条件依赖关系。例如,某人能理解句子“John saw Mary”当且仅当能理解句子“Mary saw John”。符号系统解释了这种语言和思维的系统性,而神经网络则没有提供直接的解释。最近几篇文章声称这一挑战已被神经网络解决。特别是,Brenden Lake和Marco Baroni认为他们的元学习组合协议匹配并可能解释了人类的系统性。我们证明这些结论为时过早。在其他结果中,我们发现他们的模型难以学习与训练数据分布稍有差异的规则。此外,即使在许多分布内问题上,模型的行为也是非系统性的。我们得出结论,Fodor和Pylyshyn对神经网络的挑战仍未得到满足。

英文摘要

The recent successes of neural networks producing human-like language have caused significant stir in cognitive science, with many researchers arguing that classical puzzles about human cognition and challenges to artificial intelligence are being solved by neural networks. A notable case is the argument from systematicity due to Jerry Fodor and Zenon Pylyshyn, argues that humans display systematic biconditional dependencies. For example, someone can understand the sentence "John saw Mary" just in case that they understand the sentence "Mary saw John." Symbolic systems explain this systematicity of language and thought, while neural networks offer no immediate explanation. Several recent articles argue that this challenge has now been met by neural networks. In particular, Brenden Lake and Marco Baroni argue that their meta-learning for compositionality protocol matches and perhaps explains human systematicity. We demonstrate that these conclusions are premature. Among other results, we found that their model struggles to learn rules that are even slightly out of distribution compared to their training data. Furthermore, the model behaves unsystematically even on many within-distribution problems. We conclude that Fodor and Pylyshyn's challenge to neural networks remains unmet.

2606.14510 2026-06-15 cs.LG 新提交

PepALD: Macrocyclic Peptide Generation via Autoregressive Latent Diffusion

PepALD: 通过自回归潜在扩散生成大环肽

Junming Zhang, Siyu Yi, Wei Ju, Zhonghui Gu

发表机构 * College of Computer Science, Sichuan University(四川大学计算机科学学院) School of Mathematics, Sichuan University(四川大学数学学院) School of Artificial Intelligence, Sichuan University(四川大学人工智能学院) Lingang Laboratory(临港实验室)

AI总结 提出PepALD模型,结合自回归潜在扩散与化学嵌入,实现从头设计大环肽,并利用偏好优化提升亲和力,在生成质量和奖励优化上优于基线。

Comments 18 pages, 5 figures, 3 tables

详情
AI中文摘要

大环肽是细胞内靶点的有前景的治疗候选物,但其设计需要同时控制非天然单体化学、环拓扑、膜通透性和靶点结合。现有的SMILES或HELM字符串生成模型要么在长原子级序列空间中操作,要么将单体视为具有有限化学基础符号化令牌。我们引入了PepALD,一个用于从头生成大环肽的自回归潜在扩散(ALD)基础模型。该模型使用结构化化学嵌入表示HELM单体,通过在化学信息潜在空间中的上下文条件扩散生成每个残基,在自回归生成过程中预测R基团感知的环闭合,并使用胜者保护的扩散自适应偏好优化将去噪器与亲和力奖励对齐。体外实验表明,PepALD在生成质量和奖励优化性能上优于代表性肽生成基线。

英文摘要

Macrocyclic peptides are promising therapeutic candidates for intracellular targets, but their design requires simultaneous control over non-natural monomer chemistry, ring topology, membrane permeability, and target binding. Existing SMILES- or HELM-string generative models either operate in long atom-level sequence spaces or treat monomers as symbolic tokens with limited chemical grounding. We introduce PepALD, an Autoregressive Latent Diffusion (ALD) foundation model for \textit{de novo} macrocyclic peptide generation. The model represents HELM monomers with structured chemical embeddings, generates each residue through context-conditioned diffusion in chemically informed latent space, predicts R-group-aware ring closures during autoregressive generation, and aligns the denoiser to affinity rewards using winner-protected diffusion-adapted preference optimization. In silico experiments demonstrate PepALD's generation quality and reward-optimization performance against representative peptide generation baselines.

2606.14507 2026-06-15 cs.AI 新提交

Dense Coordinate-List Fine-Tuning Induces a Controllable Interference Surface in Vision-Language Models

密集坐标列表微调在视觉语言模型中诱导可控干扰面

Chenyu Zhou, Qiliang Jiang, Boguang Pan

发表机构 * School of Engineering, Institute of Science Tokyo(东京科学大学工学院) College of Control Science and Engineering, Zhejiang University(浙江大学控制科学与工程学院) Graduate School of Information, Production and Systems, Waseda University(早稻田大学信息生产系统研究生院)

AI总结 研究密集坐标列表微调对视觉语言模型结构化输出(如重复、终止)的影响,发现其产生结构绑定且跨家族的干扰面,可通过目标信号分离和结构轴探针进行测量与控制。

详情
AI中文摘要

微调视觉语言模型以输出密集坐标列表可改善视觉定位,但也会改变模型序列化、重复和终止结构化输出的方式。我们将此行为视为一个生成与控制面进行研究。在Gemma 4 12B中,高容量q/k/v/o LoRA将类别感知F1@0.3从0.007提升至0.448,同时诱导重复尾部压力(重复率0.080,最大重复23)。q/v秩扫描在秩4-64范围内保持最大重复为21-22,显示出容量持久性。目标信号是可分离的:对象级重复停止移除了精确重复记录(重复率0.000,最大重复1),同时保持F1(0.494至0.490)和更严格的F1@0.5(0.381至0.385)。结构轴探针将效应定位到边界框坐标对象列表;密集非边界框和空间/计数JSON保持无重复,包括在高容量适配器下。Qwen3-VL-8B复现了干净的控制端点(F1@0.3 0.318,重复率0.000),COCO 2017复现了获取和重复压力。因此,密集坐标列表适应创建了一个结构绑定、跨家族的干扰面,该干扰面可被测量和控制。

英文摘要

Fine-tuning vision-language models to emit dense coordinate lists improves visual grounding but also changes how models serialize, repeat, and terminate structured outputs. We study this behavior as a generation and control surface. In Gemma 4 12B, high-capacity q/k/v/o LoRA raises class-aware F1@0.3 from 0.007 to 0.448 while inducing repeated-tail pressure (duplicate rate 0.080, max repeat 23). A q/v rank sweep keeps max repeat at 21-22 across ranks 4-64, showing capacity persistence. The target signal is separable: object-level repeat-stop removes exact repeated records (duplicate rate 0.000, max repeat 1) while preserving F1 (0.494 to 0.490) and stricter F1@0.5 (0.381 to 0.385). Structure-axis probes localize the effect to bbox-coordinate object lists; dense non-bbox and spatial/count JSON remain repeat-clean, including under high-capacity adapters. Qwen3-VL-8B reproduces a clean controlled endpoint (F1@0.3 0.318, duplicate rate 0.000), and COCO 2017 reproduces acquisition plus duplicate pressure. Dense coordinate-list adaptation therefore creates a structure-bound, cross-family interference surface that can be measured and controlled.

2606.14502 2026-06-15 cs.AI 新提交

From Chatbot to Digital Colleague: The Paradigm Shift Toward Persistent Autonomous AI

从聊天机器人到数字同事:向持久自主人工智能的范式转变

Yongheng Zhang, Ziang Liu, Jiaxuan Zhu, Shuai Wang, Xiangqi Chen, Haojing Huang, Jiayi Kuang, Siyu Chen, Ao Shen, Hao Wu, Qiufeng Wang, Qian-Wen Zhang, Junnan Dong, Wenhao Jiang, Ying Shen, Hai-Tao Zheng, Yinghui Li, Di Yin, Xing Sun, Philip S. Yu

发表机构 * arXiv

AI总结 本文提出LLM从聊天机器人向数字同事的范式转变,通过认知核心(思考型LLM)和工具增强任务执行(OpenClaw工作站系统)两个维度,实现持久工作、状态持久化、可重用技能和自改进能力。

Comments The paper is available on the project website: https://from-chatbot-to-digital-colleague.github.io/

详情
AI中文摘要

大型语言模型(LLM)正在经历从对话生成器向集成AI系统的根本性转变,这些系统具备推理、行动、记忆和自我改进能力。我们将这一转变概念化为从聊天机器人到数字同事的转变:从对话式回答到持久工作。我们沿着两个紧密耦合的维度组织这一转变。首先,在认知核心层面,LLM正从聊天机器人时代由下一词预测驱动的“快速思考”系统,向思考型LLM发展,后者利用推理时计算、思维链推理、反思、过程监督和强化学习来支持更深思熟虑和可靠的认知。其次,在工具增强的任务执行层面,LLM正从临时调用外部资源的工具调用智能体,向配备持久工作空间、技能、验证循环和治理的OpenClaw式工作站系统(OpenClaw)发展。“工作空间+技能”范式通过状态持久化、可重用程序、任务闭合和经验复用,使偶发性的工具使用变得像同事一样。我们研究了数据构建从指令-响应对向状态-动作-观测轨迹的转变,以及评估从静态基准向沙盒化、可审计、自演进的AI生态系统的转变。

英文摘要

Large Language Models (LLMs) are undergoing a fundamental transformation from conversational generators into integrated AI systems capable of reasoning, action, memory, and self-improvement. We conceptualize this transition as a shift from Chatbot to Digital Colleague: from conversational answers to persistent work. We organize this transition along two tightly coupled dimensions. First, at the cognitive core level, LLMs are advancing from Chatbot-era "fast thinking" systems driven by next-token prediction toward Thinking LLMs that leverage inference-time computation, Chain-of-Thought reasoning, reflection, process supervision, and reinforcement learning to support more deliberate and reliable cognition. Second, at the tool-augmented task execution level, LLMs are progressing from tool-calling Agents that invoke external resources in an ad hoc manner toward OpenClaw-style workstation systems (OpenClaw) equipped with persistent Workspaces, skills, verification loops, and governance. The "Workspace + Skill" paradigm makes episodic tool use colleague-like via state persistence, reusable procedures, task closure, and experience reuse. We examine data construction shifts from instruction-response pairs to State-Action-Observation trajectories and evaluation from static benchmarks to sandboxed, auditable, self-evolving AI ecosystems.

2606.14492 2026-06-15 cs.LG 新提交

Recipe-Controlled Decoder Audit for Structural Knowledge-Graph Completion

配方控制的解码器审计用于结构知识图谱补全

Xihang Shan, Ye Luo

发表机构 * School of Mathematical Sciences, Xiamen University(厦门大学数学科学学院) School of Informatics, Xiamen University(厦门大学信息学院)

AI总结 提出配方控制的解码器审计方法,通过交换解码器评估其对知识图谱补全性能的影响,发现解码器效果受配方和来源影响,并建议在编码器层面声明前进行解码器×深度扫描。

Comments 11 pages, 5 figures. Code and artifacts: https://github.com/AndyShan11/kgc-decoder-audit

详情
AI中文摘要

我们提出了一种用于结构直推式知识图谱补全(KGC)的配方控制解码器审计(RCDA)。该审计提出了一个简单的报告问题:在将性能提升归因于编码器或训练配方之前,当在相同配方下交换解码器时,会发生什么变化?使用ComplEx和DistMult作为主要控制对,并辅以针对性的RotatE/TransE抽查,我们评估了七个基准。在五个标准知识图谱上,在我们的配方下,ComplEx与DistMult的差异虽小但一致(MRR增加+0.005至+0.012),而CompGCN风格的编码器效果因数据集而异。在小知识图谱上,解码器效果成为主要诊断指标:Kinship显示ComplEx稳定优势为+0.143 MRR(6个种子),而UMLS在干净的6种子服务器重跑中偏好ComplEx(+0.022 MRR),但在早期来源变体中结果相反。因此,我们将小知识图谱的解码器选择视为对配方和来源敏感,而非固定的数据集胜者。我们进一步表明,在WN18RR上解码器选择与编码器深度存在交互,且在我们的配方下,YAGO3-10上L=0的ComplEx在d=128时达到0.6971 ± 0.0048 MRR。结果是一个紧凑的审计协议:报告匹配的解码器行,记录小知识图谱来源,并在做出编码器层面声明之前进行解码器×深度扫描。

英文摘要

We present a recipe-controlled decoder audit (RCDA) for structural transductive knowledge-graph completion (KGC). The audit asks a simple reporting question: before attributing gains to an encoder or training recipe, what changes when the decoder is swapped under the same recipe? Using ComplEx and DistMult as the primary controlled pair, with targeted RotatE/TransE spot-checks, we evaluate seven benchmarks. On five standard KGs, ComplEx-vs-DistMult differences are modest but consistent under our recipe (+0.005 to +0.012 MRR), whereas CompGCN-style encoder effects vary more by dataset. On small KGs, decoder effects become the main diagnostic: Kinship shows a stable ComplEx advantage of +0.143 MRR (6 seeds), while UMLS favours ComplEx by +0.022 MRR in a clean 6-seed server rerun but reverses in an earlier provenance variant. We therefore treat small-KG decoder choice as recipe- and provenance-sensitive rather than as a fixed dataset winner. We further show that decoder choice interacts with encoder depth on WN18RR, and that under our recipe L=0 ComplEx on YAGO3-10 reaches 0.6971 +/- 0.0048 MRR at d=128. The result is a compact audit protocol: report matched decoder rows, log small-KG provenance, and sweep decoder x depth before making encoder-level claims.

2606.14476 2026-06-15 cs.AI cs.LG 新提交

When the Tool Decides: LLM Agents Defer Blindly to Graph Neural Network Tools, and Stronger Backbones Defer More

当工具决定时:LLM代理盲目服从图神经网络工具,更强的骨干网络服从更多

Zhongyuan Wang, Pratyusha Vemuri

发表机构 * raptorX.ai

AI总结 研究LLM代理在使用GNN工具时是否真正判断而非盲目服从,发现代理在97.6-99.2%的情况下完全采纳GNN输出,且更强的骨干网络服从更多,选择性调用设计受限。

Comments 9 pages, 2 figures. Under review at TMLR

详情
AI中文摘要

越来越多的研究为大型语言模型(LLM)代理配备图神经网络(GNN)作为可调用工具,假设代理能够判断何时以及多大程度上依赖该工具。我们直接测试了这一假设。我们将冻结的GNN作为显式工具暴露给ReAct风格的LLM代理,并在文本属性图(ogbn-arxiv,在WikiCS上重复)上的节点分类任务中,测量代理是使用工具还是仅仅服从它。我们发现代理并未进行判断:其预测与原始GNN的预测一致率达到97.6-99.2%(5个随机种子),沦为GNN鹦鹉,全盘采用工具的输出并绕过自身推理。通过扫描骨干网络能力(Qwen2.5 0.5B-7B),这种服从并非弱模型伪影:在能够调用工具的模型中,一致性随能力提升而上升(从1.5B的0.60到7B的0.98)。关键的是,服从的代价并未随能力增长而缩小,反而在替代方案出现时扩大:每个节点上可用动作的oracle比鹦鹉在3B时高出0.09-0.18,在7B时高出0.12-0.22,在高同质性下几乎翻倍,因为鹦鹉被冻结的GNN所束缚,而代理的替代方案在改进;在7B时,简单的邻居标签工具在高同质性下超越了GNN(0.81 vs 0.71),但代理仍然服从。一个简单的选择性调用门恢复了约一半的高同质性差距(0.71到0.83),但未带来全局净收益,而保留估计表明,在标准测试时特征上可达到的最佳门最多只能获得oracle余量的三分之一:可靠的选择性调用似乎受限于可用信息,而不仅仅是路由器设计。我们的结果是一个警示性测量:对代理+工具系统的评估不能假设代理在工具之上添加了判断,选择性调用必须被设计进去,而不是期望从规模中涌现。

英文摘要

A growing line of work equips large language model (LLM) agents with graph neural networks (GNNs) as callable tools, assuming the agent exercises judgment over when and how much to rely on such a tool. We test this directly. We expose a frozen GNN to a ReAct-style LLM agent as an explicit tool and measure, on node classification over a text-attributed graph (ogbn-arxiv, replicated on WikiCS), whether the agent uses the tool or merely obeys it. We find the agent does not exercise judgment: its predictions agree with the raw GNN's 97.6-99.2% of the time (5 seeds), collapsing into a GNN parrot that adopts the tool's output wholesale and bypasses its own reasoning. Sweeping backbone capability (Qwen2.5 0.5B-7B), the deference is not a weak-model artifact: among models able to invoke the tool, agreement rises with capability (0.60 to 0.98 from 1.5B to 7B). Crucially, the cost of deference does not shrink as capability grows and grows where alternatives emerge: a per-node oracle over the available actions beats the parrot by 0.09-0.18 at 3B and 0.12-0.22 at 7B, roughly doubling at high homophily, because the parrot is pinned to the frozen GNN while the agent's alternatives improve; at 7B a simple neighbour-label tool overtakes the GNN at high homophily (0.81 vs 0.71) yet the agent still defers. A simple selective-invocation gate recovers about half of that high-homophily gap (0.71 to 0.83) but yields no net global gain, and held-out estimates bound the best achievable gate over standard test-time features to at most a third of the oracle headroom: reliable selective invocation looks limited by available information, not merely router design. Our results are a cautionary measurement: evaluations of agent+tool systems cannot assume the agent adds judgment on top of the tool, and selective invocation must be designed in rather than expected to emerge from scale.

2606.14475 2026-06-15 cs.CV 新提交

Value-order Decomposition for Generalist Anomaly Detection

值序分解用于通用异常检测

Miaoyun Zhao, Jing Chen, Miaoni Zhao, Qiang Zhang

发表机构 * Dalian University of Technology(大连理工大学) Xi’an Chang’an Vanke City Primary School(西安长安万科城小学) Key Laboratory of Social Computing and Cognitive Intelligence (Dalian University of Technology), Ministry of Education(社会计算与认知智能教育部重点实验室(大连理工大学))

AI总结 提出值序分解(VOD)方法,通过解耦和抑制类别、缺陷类型和域特定信息,实现跨域异常检测的强泛化。

详情
AI中文摘要

工业异常检测受限于数据量少,使得跨域泛化尤其具有挑战性。通用异常检测(GAD)旨在在源域上训练一个统一模型,能够有效检测未见目标域中的异常。在初始语义特征空间中,异常与物体类别或缺陷类型之间的强纠缠阻碍了跨域的有效泛化。最近的工作通过将特征投影到残差空间来解决这个问题;然而,这些方法主要增加了正常特征的跨域重叠,而异常特征仍然与物体类别、缺陷类型和数据域相关,导致对齐和泛化效果差。为了解决这一限制,我们提出了值序分解(VOD),一种简单而有效的技术,它弥合了物体类别、缺陷类型(包括真实和合成缺陷)和数据域之间的\textbf{三种泛化差距}。VOD解耦并抑制了物体类别、缺陷类型和域特定信息,促进了正常和异常样本内部的对齐,同时保持了它们的可分离性,从而实现了跨三个差距的鲁棒泛化。利用同一物体内真实和合成缺陷之间的强对齐,我们仅使用正常和合成异常参考进行异常检测,并有效泛化到未见过的真实缺陷类型。在多样化的工业和医学基准上的实验表明,我们的方法使用简单的剪切粘贴异常模拟策略,实现了跨三个差距的强泛化。

英文摘要

Industrial anomaly detection suffers from limited data, making cross-domain generalization particularly challenging. Generalist Anomaly Detection (GAD) aims to train a unified model on a source domain that can effectively detect anomalies in unseen target domains. In the initial semantic feature space, strong entanglement between anomalies and object categories or defect types hinders effective generalization across domains. Recent works address this issue by projecting features into a residual space; however, such methods primarily increase cross-domain overlap for normal features, while anomalous features remain specific to object categories, defect types and data domains, leading to poor alignment and generalization. To address this limitation, we propose Value-order Decomposition (VOD), a simple yet effective technique that bridges \textbf{three types of generalization gaps} across object categories, defect types (including real and synthetic defects), and data domains. VOD disentangles and suppresses object-category-, defect-type-, and domain-specific information, promoting alignment within normal and abnormal samples while preserving their separability, thereby enabling robust generalization across the three gaps. Leveraging the strong alignment between real and synthetic defects within the same object, we perform anomaly detection using only normal and synthetic-abnormal reference, and effectively generalize to unseen real defect types. Experiments on diverse industrial and medical benchmarks demonstrate that our method, using a simple cut-and-paste anomaly simulation strategy, achieves strong generalization across the three gaps.

2606.14470 2026-06-15 cs.AI cs.CL cs.LG 新提交

GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge

GitOfThoughts: 版本控制的推理与可回放、差异比较和合并的智能体记忆

Pavan C Shekar, Abhishek H S, Aswanth Krishnan

发表机构 * QpiAI

AI总结 提出GitOfThoughts框架,将智能体推理树存储为git仓库,实现推理的可回放、审计和合并;实验表明,对于新问题,任何记忆格式均不能可靠提升准确率,仅当检索案例与当前问题高度相似(>0.8)时才有显著提升,且收益来自答案检索而非方法迁移。

Comments 10 pages, 1 figure, 9 tables

详情
AI中文摘要

大语言模型推理是短暂的:思维链随上下文窗口消失,剪枝的搜索分支不留记录,记忆缓冲区无法进行差异比较、合并或审计。其他所有复杂的软件过程(代码、基础设施、数据、实验)都受版本控制;推理却没有。我们提出GitOfThoughts,将智能体的推理树存储为git仓库:每个评分的思维是一个提交,分数是注释,结果是标签,检索是智能体自身历史上的“git log”。这使得推理可回放、可审计,并且可以在智能体之间以近乎零的工程成本进行合并。然后我们提出一个更难的问题:记忆在任何基质上是否真的能提高准确性?在五种基质(无、markdown、向量、图、git)、两个基准、两个模型规模以及预注册的复制实验中,对于新问题的答案是否定的。没有一种记忆格式可靠地有帮助,一个有希望的早期结果在其自身的预注册复制下崩溃了。记忆只有在超过我们所谓的可复制阈值时才有效:当检索到的案例与当前问题几乎重复(相似度>~0.8)时,准确率急剧上升;低于此阈值,则无效果。收益是答案检索,而非方法迁移:一个4.5倍大的模型使近重复收益翻倍,但仍然无法从工作示例中提取可迁移的方法。我们发现唯一的通用杠杆是测试时采样。因此,git作为基质的理由是审计性、溯源性和可合并性,且准确率相当。我们记录了一个撤回的结果和一个被反驳的假设,以体现我们坚持的评估标准。

英文摘要

Large language model (LLM) reasoning is ephemeral: chains of thought vanish with the context window, pruned search branches leave no record, and memory buffers cannot be diffed, merged, or audited. Every other complex software process (code, infrastructure, data, experiments) is version-controlled; reasoning is not. We introduce GitOfThoughts, which stores an agent's reasoning tree as a git repository: every scored thought is a commit, scores are notes, outcomes are tags, and retrieval is "git log" over the agent's own history. This makes reasoning replayable, auditable, and mergeable across agents at near-zero engineering cost. We then ask the harder question: does memory, in any substrate, actually improve accuracy? Across five substrates (none, markdown, vector, graph, git), two benchmarks, two model scales, and pre-registered replications, the answer for novel problems is no. No memory format reliably helps, and a promising early result collapsed under its own pre-registered replication. Memory pays only above what we call the copyability threshold: when the retrieved case is a near-duplicate of the current problem (similarity >~ 0.8), accuracy jumps sharply; below it, nothing. The gain is answer retrieval, not method transfer: a 4.5x larger model doubles the near-duplicate payoff yet still cannot extract a transferable method from a worked example. The only general lever we find is test-time sampling. The case for git-as-substrate is therefore auditability, provenance, and mergeability at accuracy parity. We document a retracted result and a refuted hypothesis to model the evaluation standard we hold ourselves to.

2606.14466 2026-06-15 cs.SD cs.AI cs.LG 新提交

The Perceived Fragility of Explanations in Audio Models: Manipulation of Attribution with Unchanged Predictions

音频模型中解释的感知脆弱性:在预测不变的情况下操纵归因

Piotr Kitłowski, Dominik Wiącek, Mateusz Modrzejewski

发表机构 * University of Warsaw(华沙大学)

AI总结 提出一种心理声学框架,通过优化不可听扰动来解耦模型归因与分类,证明在音频深度伪造检测中可系统扭曲解释热图而保持预测标签不变。

Comments Accepted to the ICML 2026 Workshop on Machine Learning for Audio: 5 pages, 4 figures

详情
AI中文摘要

本文研究了事后解释方法在音频深度伪造检测中的脆弱性。先前关于解释操纵的工作主要关注图像并使用标准$L_p$度量,而我们引入了一个心理声学框架,该框架优化不可听扰动以将模型归因与最终分类解耦。我们在严格的预测保持约束下,评估了这种脆弱性在多种最先进架构上的表现。通过领域特定的感知音频质量指标和解释对齐标准来评估操纵成本,我们的框架证明,攻击者可以在保持预测的深度伪造标签不变的情况下,系统地扭曲自动生成的解释热图。完整代码见:this https URL

英文摘要

This paper investigates the fragility of post-hoc explanation methods in audio deepfake detection. While previous work on explanation manipulation focused on images using standard $L_p$ metrics, we introduce a psychoacoustic framework that optimizes inaudible perturbations to decouple model attributions from final classifications. We evaluate this vulnerability across state-of-the-art architectures under strict prediction-preserving constraints. By evaluating the manipulation cost through domain-specific perceptual audio quality metrics alongside explanation alignment criteria, our framework demonstrates that an adversary can systematically distort automated explanation heatmaps while preserving the predicted deepfake label. Full code available at: https://github.com/cncPomper/Audio-XAI

2606.14463 2026-06-15 cs.LG 新提交

EM-NeSy: Expectation Maximization for Neurosymbolic Learning

EM-NeSy:神经符号学习的期望最大化

Annegret Seibt, Luc De Raedt, Giuseppe Marra

发表机构 * Department of Computer Science(计算机科学系) KU Leuven(根特大学)

AI总结 提出EM-NeSy框架,将概率神经符号学习视为期望最大化算法实例,通过概率推理计算符号后验,仅通过神经组件进行梯度更新,实现可扩展且高效的近似推理。

详情
AI中文摘要

神经符号(NeSy)模型融合神经网络和符号推理,以实现鲁棒且可解释的人工智能。最先进的NeSy模型要求符号组件以可微分方式表达,这常常使近似推理的使用复杂化。我们提出EM-NeSy,将概率神经符号学习视为期望最大化(EM)算法的一个实例。在期望步骤中,我们通过概率推理计算基于标签的神经预测符号的后验。在最大化步骤中,我们仅通过神经组件使用梯度下降基于该后验更新神经参数。该公式释放了EM算法在NeSy学习中的全部潜力。它允许NeSy自然地扩展到近似推理,无需对符号组件进行任何额外修改或可微分性要求。此外,在精确推理下,它恢复了标准的端到端基于梯度的NeSy设置。我们的实验结果证明了EM-NeSy的可扩展性和计算效率。

英文摘要

Neurosymbolic (NeSy) models integrate neural networks and symbolic reasoning for robust and interpretable AI. State-of-the-art NeSy models require that the symbolic component is expressed in a differentiable way, often complicating the use of approximate inference. We propose EM-NeSy which casts probabilistic NeSy learning as an instance of the Expectation-Maximization (EM) algorithm. In the expectation step, we compute the posterior over the neurally predicted symbols conditioned on the label via probabilistic inference. In the maximization step, we update the neural parameters based on this posterior using gradient descent only through the neural component. This formulation unlocks the full potential of the EM algorithm for NeSy learning. It allows NeSy to extend naturally to approximate reasoning without any additional modifications or differentiability requirements of the symbolic component. Furthermore, it recovers the standard end-to-end gradient-based NeSy setting under exact inference. Our experimental results demonstrate the scalability and computational efficiency of EM-NeSy.

2606.14460 2026-06-15 cs.CL cs.HC 新提交

A Computational Audit of Demographic Association Encoding in ClinicalBERT Language Predictions

ClinicalBERT 语言预测中人口统计关联编码的计算审计

Kehinde Temitayo Soetan

发表机构 * The Ohio State University(俄亥俄州立大学)

AI总结 通过两种探针方法审计 ClinicalBERT 中的表征偏差,发现模型主要放大而非继承训练数据中的统计差异。

Comments 17 pages, 4 tables, appendices A-E, preprint

详情
AI中文摘要

基于 Transformer 的临床语言模型越来越多地集成到高风险临床决策支持流程中,但医学文档中编码的人口统计关联通过计算机制传播到模型概率分布的方式仍缺乏实证研究。我们对 ClinicalBERT (Alsentzer et al., 2019) 的表征偏差进行了系统性计算审计,该模型是在 MIMIC-III 出院小结上预训练的 BERT 模型。我们采用两种互补的探针方法:对数概率偏差分析 (LPBA),量化人口统计描述符引起的掩码标记概率分布在行为和评价语义类别上的偏移;以及基于掩码语言模型的分析 (MLM),探测内部表征结构在 98 个真实临床句子模板和八种交叉种族-性别组合中的人口统计代理归因编码。语料频率分析通过将模型输出与 MIMIC-III 训练语料中的经验词频进行基准测试,操作化了统计差异与偏差放大之间的区别。在 32 个统计显著的结果中,65.6% 与观察到的语料分布相矛盾,在 MLM 探针下,黑人患者和代理归因的比率分别上升至 80% 和 87.5%,这提供了直接的经验证据,表明 ClinicalBERT 中的表征偏差主要通过模型内部放大而非训练数据继承来运作。关键词:自然语言处理,临床文档,算法审计,表征偏差,健康公平

英文摘要

Transformer-based clinical language models are increasingly integrated into high-stakes clinical decision support pipelines, yet the computational mechanisms through which demographic associations encoded in medical documentation propagate into model probability distributions remain empirically underspecified. We present a systematic computational audit of representational bias in ClinicalBERT (Alsentzer et al., 2019), a BERT-based model pretrained on MIMIC-III discharge summaries, employing two complementary probing methodologies: Log Probability Bias Analysis (LPBA), which quantifies demographic descriptor-induced shifts in masked token probability distributions across behavioral and evaluative semantic categories, and Masked Language Model-based analysis (MLM), which probes internal representational structure for demographic agency attribution encoding across 98 real clinical sentence templates and eight intersectional race-gender combinations. Corpus frequency analysis operationalizes the distinction between statistical disparity and bias amplification by benchmarking model outputs against empirical term frequencies in the MIMIC-III training corpus. Of 32 statistically significant findings, 65.6% contradict observed corpus distributions, rising to 80% for Black patients and 87.5% for agency attribution under MLM probing, providing direct empirical evidence that representational bias in ClinicalBERT operates predominantly through model-internal amplification rather than training data inheritance. Keywords: natural language processing, clinical documentation, algorithmic auditing, representational bias, health equity 1

2606.14459 2026-06-15 cs.CL cs.AI cs.SD 新提交

MoDiCoL: A Modular Diagnostic Continual Learning Dataset for Robust Speech Recognition

MoDiCoL:用于鲁棒语音识别的模块化诊断持续学习数据集

Theresa Pekarek Rosin, Matthias Kerzel, Stefan Wermter

发表机构 * Knowledge Technology, Department of Informatics, University of Hamburg, Germany(德国汉堡大学信息学系知识技术研究所)

AI总结 提出MoDiCoL数据集,通过模块化设计分离语言内容、说话人特征和声学环境,并设计持续学习课程来模拟真实分布变化,评估三种持续学习策略下的鲁棒性获取、迁移和遗忘。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

现代自动语音识别(ASR)系统在标准基准测试上取得了显著进展,但在真实世界的分布变化下,由录音条件、口音、语言障碍和噪声引起的性能差距已经显现。现有数据集和基准通常孤立这些因素,忽略了它们在真实应用中的共现。在本文中,我们认为模型鲁棒性可以被视为一种动态能力,持续发展,并引入了MoDiCoL,一个模块化诊断持续学习数据集,旨在对语言内容、说话人特征和声学环境进行受控分析。此外,我们提出了一个受真实世界启发的持续学习课程,以模拟增量更新,并研究鲁棒性是如何获取、迁移和遗忘的。我们评估了三种持续学习策略,并提供了在演化条件下鲁棒性的详细见解。

英文摘要

Modern Automatic Speech Recognition (ASR) systems have made remarkable progress on standard benchmarks, yet performance gaps have emerged under real-world distribution shifts, caused by recording conditions, accents, speech impairments, and noise. Existing datasets and benchmarks typically isolate these factors, which overlooks their co-occurrence in real-world applications. In this paper, we argue that model robustness can be treated as a dynamic capability that continually develops, and we introduce MoDiCoL, a Modular Diagnostic Continual Learning dataset designed for controlled analysis of linguistic content, speaker characteristics, and acoustic environments. Furthermore, we propose a real-world-inspired continual learning curriculum to simulate incremental updates and study how robustness is acquired, transferred, and forgotten. We evaluate three continual learning strategies and provide detailed insights into robustness under evolving conditions.

2606.14433 2026-06-15 cs.RO 新提交

Kine2Go: Kinematic dataset for the Unitree Go2 robot with diverse gaits and motions

Kine2Go: 面向Unitree Go2机器人的多步态运动学数据集

Władysław Pałucki, Paweł Siwak, Krzysztof Ciebiera, Marek Cygan

发表机构 * University of Warsaw(华沙大学)

AI总结 为降低四足机器人研究门槛,提出Kine2Go数据集,包含800条来自40种策略的Unitree Go2机器人步态运动学轨迹数据,通过强化学习训练策略并收集鲁棒的运动学与电机动作数据。

Comments 9 pages, 6 figures

详情
AI中文摘要

近年来,机器人技术的普及以及机器人硬件成本的稳步下降,降低了机器人研究的入门门槛,推动了该领域的快速发展。一个典型例子是Unitree Go2四足机器人,它常被研究人员用于运动、导航、控制等领域。许多研究人员将Go2机器人与模仿学习、强化学习和行为克隆等技术结合,使机器学习系统能够完全控制机器人。同时,这些技术中的许多需要包含机器人运动学信息和施加于电机的动作的演示数据。获取此类数据困难、需要构建复杂流程且耗时。为帮助此类工作,我们提出了Kine2Go——一个包含800条多样化步态运动学轨迹运动数据的数据集,针对Unitree Go2机器人,源自40种不同的策略。我们的流程接受来自各种四足形态的数据,并将其转换为Go2兼容格式。然后我们使用强化学习训练遵循给定运动的策略,最后从这些策略中收集数据,从而获得鲁棒的、带有扰动的运动学数据及相应的电机级动作。

英文摘要

The recent popularity of robotics, combined with the steadily decreasing cost of robotic hardware, has lowered the entry barrier to robotics research and enabled rapid advancements in the field. One of the primary examples is the Unitree Go2 quadruped robot, which is often used by researchers in the areas of locomotion, navigation, control, and others. Many researchers use the Go2 robot in combination with techniques like imitation learning, reinforcement learning, and behavioral cloning to allow machine learning systems to take full control of the robot. At the same time, many of those techniques require demonstration data consisting of the robot's kinematics information and actions applied to the motors. Obtaining such data is difficult, requires building complex pipelines, and can take significant time. To aid in those kinds of efforts, we present Kine2Go - a dataset with 800 diverse gait kinematics trajectory motion data for the Unitree Go2 robot, derived from 40 distinct policies. Our pipeline accepts data from various quadruped morphologies and translates them to a Go2-compatible format. Then we use Reinforcement Learning to train policies following a given motion, and finally we gather data from those policies, which grants robust, perturbed kinematic data with corresponding motor-level actions.

2606.14421 2026-06-15 cs.RO cs.HC eess.SP 新提交

ForestBack: Breadcrumb-Based Pedestrian Dead Reckoning for Infrastructure-Free Return Navigation

ForestBack:基于面包屑的步行者航位推算实现无基础设施返回导航

Aueaphum Aueawatthanaphisut, Chanakan Chaipan

发表机构 * University of Tokyo(东京大学)

AI总结 提出ForestBack框架,通过面包屑式步行者航位推算(PDR)在无GPS/基础设施环境中记录路径并生成反向引导,实验显示轨迹RMSE降低15.76%。

Comments 9 pages, 6 figures, 1 table, and 19 equations

详情
AI中文摘要

在GPS受限且外部定位基础设施可能不可用或不可靠的环境中,可靠的返回导航仍然是一个重要挑战。本文提出ForestBack,一种基于面包屑式步行者航位推算(PDR)的无基础设施行人返回导航框架。该系统将用户的行走路线记录为一系列可逆的面包屑节点,并在无需GPS、Wi-Fi、蓝牙信标或预装基础设施的情况下生成反向路径引导。ForestBack集成了基于加速度的步态检测、自适应步长估计、磁力计辅助航向估计、气压高度校正以及双向面包屑路径重建。该系统使用一条包含五个检查点的室内避障路线进行评估,用户围绕一个中心障碍物导航。评估使用了包含36次行走试验和42,474个时间序列样本的数据集,包括IMU信号、磁力计读数、气压变量、转弯事件标签、地面真实轨迹、基线PDR输出、提出的ForestBack输出以及功率相关测量。实验结果表明,与传统PDR相比,ForestBack将平均RMSE从1.129米降低到0.965米,提高了15.76%。平均最终位置误差从1.781米降低到1.388米,而转弯事件检测一致性达到约99.90%。这些结果表明,ForestBack在避障场景中改善了轨迹重建和路径保持的返回引导。发布的数据集和分析笔记本支持可重复性以及未来对基于PDR的无基础设施返回导航系统的基准测试。

英文摘要

Reliable return navigation remains an important challenge in GPS-denied environments where external positioning infrastructure may be unavailable or unreliable. This paper presents ForestBack, an infrastructure-free pedestrian return navigation framework based on breadcrumb-based pedestrian dead reckoning (PDR). The system records a user's walking route as a sequence of reversible breadcrumb nodes and generates reverse-path guidance without requiring GPS, Wi-Fi, Bluetooth beacons, or pre-installed infrastructure. ForestBack integrates acceleration-based step detection, adaptive step-length estimation, magnetometer-assisted heading estimation, barometric-altitude correction, and bidirectional breadcrumb path reconstruction. The system was evaluated using an indoor obstacle-avoidance route with five checkpoints, where the user navigated around a central obstacle. A dataset of 36 walking trials and 42,474 time-series samples was used for evaluation, including IMU signals, magnetometer readings, barometric variables, turn-event labels, ground-truth trajectories, baseline PDR outputs, proposed ForestBack outputs, and power-related measurements. Experimental results show that ForestBack reduced the mean RMSE from 1.129 m to 0.965 m compared with traditional PDR, corresponding to a 15.76% improvement. The mean final-position error was reduced from 1.781 m to 1.388 m, while turn-event detection consistency reached approximately 99.90%. These results indicate that ForestBack improves trajectory reconstruction and route-preserving return guidance in obstacle-avoidance scenarios. The released dataset and analysis notebook support reproducibility and future benchmarking of infrastructure-free PDR-based return navigation systems.

2606.14420 2026-06-15 cs.CL 新提交

Coping in Crisis: Computational Modeling of Coping Styles in Digital Crisis Discourse During the 2023 Turkiye Earthquake

危机中的应对:2023年土耳其地震数字危机话语中应对风格的计算建模

Şevval Çakıcı

发表机构 * Koç University(科克大学)

AI总结 本研究利用BERTurk分类器,基于Lazarus和Folkman的应对理论,从百万条土耳其语推文中检测三种应对风格(问题聚焦、情绪聚焦、意义建构),揭示了危机不同阶段的动态变化,并发现愤怒与意义建构强相关。

Comments 20 pages, 5 figures, 3 tables. To be submitted to Social Science Computer Review

详情
AI中文摘要

当灾难来临时,人们如何应对?我们能否从他们的文字中实时、大规模地检测到这种应对?本研究利用2023年2月6日土耳其地震后发布的超过一百万条土耳其语推文来回答这个问题。这场地震发生在深度政治极化的背景下,距离全国大选仅数月。基于Lazarus和Folkman(1984)的应对理论,我们开发了一个多标签BERTurk分类器,用于在四个理论驱动的危机阶段检测三种应对风格(问题聚焦、情绪聚焦和意义建构)。BERTurk的宏F1得分为0.693,显著优于零样本mDeBERTa基线(宏F1=0.324)。将该分类器应用于完整语料库后,揭示了清晰的时间轨迹:问题聚焦应对在紧急阶段占主导地位并急剧下降,情绪聚焦应对上升并趋于稳定,意义建构应对单调递增。愤怒与意义建构的相关性最强(Spearman r=0.387),表明愤怒作为一种动员力量,指向归责而非实际行动。这些发现表明,应对理论可以在现实世界的数字危机数据中可靠地操作化,并且这样做可以帮助人道主义组织根据人口的实际状态调整其响应。

英文摘要

How do people cope when disaster strikes and can we detect it at scale, in real time, from what they write? This study addresses that question using over one million Turkish-language tweets posted in the aftermath of the February 6, 2023 earthquake in Turkiye, which unfolded in a deeply polarized political context just months before a national election. Drawing on Lazarus and Folkman's (1984) coping theory, we develop a multi-label BERTurk classifier to detect three coping styles (problem-focused, emotion-focused, and meaning-making) across four theoretically motivated crisis phases. BERTurk achieves a macro F1 of 0.693, substantially outperforming a zero-shot mDeBERTa baseline (macro F1 = 0.324). Applied to the full corpus, the classifier reveals a clear temporal trajectory: problem-focused coping dominates the urgency phase and declines sharply, emotion-focused coping rises and stabilizes, and meaning-making increases monotonically. Anger correlates most strongly with meaning-making (Spearman r = 0.387), suggesting it functions as a mobilizing force toward blame attribution rather than practical action. These findings demonstrate that coping theory can be reliably operationalized in real-world digital crisis data and that doing so can help humanitarian organizations tailor their responses to where a population actually is.

2606.14418 2026-06-15 cs.AI cs.LG cs.RO 新提交

Causal Object-Centric Models for Planning with Monte Carlo Tree Search

用于蒙特卡洛树搜索规划的因果对象中心模型

Rodion Vakhitov, Leonid Ugadiarov, Alexey Skrynnik, Aleksandr Panov

发表机构 * MIRAI CogAILab

AI总结 提出COMET算法,结合无监督对象中心编码器和Transformer世界模型,通过动作-槽融合机制和对象因果注意力实现高效规划,在多个基准上优于基线方法。

详情
AI中文摘要

我们提出了COMET(用于高效树搜索的因果对象中心模型),一种基于模型的强化学习算法,在槽结构化的潜在空间中执行蒙特卡洛树搜索。COMET将冻结的无监督对象中心编码器与基于Transformer的世界模型配对,其中通过一种新颖的动作-槽融合机制将动作绑定到对象上,该机制用于槽转移预测。策略和价值头使用对象因果注意力,通过学习到的每槽相关性分数调节令牌交互,使决策集中在任务相关实体上。COMET为MuZero风格的潜在规划增加了显式的对象级归纳偏差。在来自Object-Centric Visual RL基准、ManiSkill、Robosuite和VizDoom的八个视觉和动态多样化的任务中,COMET在训练早期相比对象中心和单一基线实现了更高的平均归一化分数。

英文摘要

We introduce COMET (Causal Object-centric Model for Efficient Tree search), a model-based reinforcement learning algorithm that performs Monte Carlo Tree Search in a slot-structured latent space. COMET pairs a frozen unsupervised object-centric encoder with a transformer-based world model, in which actions are bound to objects through a novel action-slot fusion mechanism that is used in slot transition prediction. Policy and value heads use object-causal attention, modulating token interactions by learned per-slot relevance scores so that decision-making concentrates on task-relevant entities. COMET adds an explicit object-level inductive bias to MuZero-style latent planning. Across eight visually and dynamically diverse tasks from the Object-Centric Visual RL benchmark, ManiSkill, Robosuite, and VizDoom, COMET achieves a higher mean normalized score during the early stages of training compared to object-centric and monolithic baselines.

2606.14416 2026-06-15 cs.LG stat.ML 新提交

Federated Learning for Feature Generalization with Convex Constraints

基于凸约束的联邦学习特征泛化

Dongwon Kim, Donghee Kim, Sung Kuk Shyn, Kwangsu Kim

发表机构 * Dongwon Kim(金东Won) Donghee Kim(金东浩) Sung Kuk Shyn(申 Sung Kuk) Kwangsu Kim(金光Su)

AI总结 针对联邦学习中客户端数据异构导致的泛化问题,提出FedCONST方法,利用线性凸约束自适应调整更新幅度,平衡参数学习,并通过梯度信噪比分析验证其有效性,实现跨异构环境的强泛化。

Comments Accepted at the 42nd International Conference on Machine Learning (ICML 2025)

详情
AI中文摘要

联邦学习(FL)常因客户端数据异构而难以泛化。局部模型容易过拟合其局部数据分布,甚至可迁移特征在聚合过程中也可能被扭曲。为应对这些挑战,我们提出FedCONST,一种基于全局模型参数强度自适应调整更新幅度的方法。这可以防止过度强调已学好的参数,同时加强未充分发展的参数。具体而言,FedCONST采用线性凸约束来确保训练稳定性,并在聚合过程中保留局部学到的泛化能力。梯度信噪比(GSNR)分析进一步验证了FedCONST在增强特征可迁移性和鲁棒性方面的有效性。因此,FedCONST有效对齐了局部和全局目标,减轻了过拟合,促进了跨不同FL环境的更强泛化,达到了最先进的性能。

英文摘要

Federated learning (FL) often struggles with generalization due to heterogeneous client data. Local models are prone to overfitting their local data distributions, and even transferable features can be distorted during aggregation. To address these challenges, we propose FedCONST, an approach that adaptively modulates update magnitudes based on the parameter strength of the global model. This prevents over-emphasizing well-learned parameters while reinforcing underdeveloped ones. Specifically, FedCONST employs linear convex constraints to ensure training stability and preserve locally learned generalization capabilities during aggregation. A Gradient Signal to Noise Ratio (GSNR) analysis further validates the effectiveness of FedCONST in enhancing feature transferability and robustness. As a result, FedCONST effectively aligns local and global objectives, mitigating overfitting and promoting stronger generalization across diverse FL environments, achieving state-of-the-art performance.

2606.14415 2026-06-15 cs.AI 新提交

CSPO: Constraint-Sensitive Policy Optimization for Safe Reinforcement Learning

CSPO: 面向安全强化学习的约束敏感策略优化

Ayoub Belouadah, Sylvain Kubler, Yves Le Traon

发表机构 * University of Luxembourg(卢森堡大学)

AI总结 提出约束敏感策略优化(CSPO),通过引入局部约束敏感性修正原目标,加速安全恢复并减少振荡,在导航与运动基准上取得更高约束回报。

Comments Accepted as a Spotlight paper at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

安全强化学习(Safe RL)旨在最大化期望回报的同时满足安全约束,通常建模为约束马尔可夫决策过程(CMDPs)。虽然原始-对偶方法可扩展到深度强化学习,但它们常常遭受延迟约束校正,导致振荡行为和长时间的安全违规。在本文中,我们提出约束敏感策略优化(CSPO),一种一阶原始-对偶方法,将局部约束敏感性纳入策略更新。CSPO通过从安全边界的最短有符号距离导出的约束敏感校正来增强原始目标,从而实现更智能的恢复步骤回到安全状态,补偿延迟的拉格朗日乘子更新,减少边界附近的振荡,并保留原始约束问题的KKT解。在导航和运动基准上的实验表明,与最先进的原始-对偶和基于惩罚的方法相比,CSPO实现了更快的安全恢复和高奖励保持,从而获得更高的约束回报。

英文摘要

Safe reinforcement learning (Safe RL) aims to maximize expected return while satisfying safety constraints, typically modeled as Constrained Markov Decision Processes (CMDPs). While primal-dual methods scale well to deep RL, they often suffer from delayed constraint correction, leading to oscillatory behavior and prolonged safety violations. In this paper, we propose Constraint-Sensitive Policy Optimization (CSPO), a first-order primal-dual method that incorporates local constraint sensitivity into policy updates. CSPO augments the primal objective with a constraint-sensitive correction derived from the shortest signed distance to the safety boundary, enabling smarter recovery steps back to safety, compensating for delayed Lagrange multiplier updates, reducing oscillations near the boundary, and preserving the KKT solutions of the original constrained problem. Experiments on navigation and locomotion benchmarks demonstrate that CSPO achieves faster safety recovery and high reward preservation, resulting in higher constrained returns compared to state-of-the-art primal-dual and penalty-based methods

2606.14409 2026-06-15 cs.RO cs.AI 新提交

Hy-Embodied-0.5-VLA: From Vision-Language-Action Models to a Real-World Robot Learning Stack

Hy-Embodied-0.5-VLA:从视觉-语言-动作模型到真实世界机器人学习栈

He Zhang, Lingzhu Xiang, Haitao Lin, Zeyu Huang, Minghui Wang, Dingyan Zhong, Yubo Dong, Yihao Wu, Yongming Rao, Dongsheng Zhang, Wanjia He, Ling Chen, Kai Huang, Jiahao Chen, Sichang Su, Xumin Yu, Ziyi Wang, Chengwei Zhu, Xiao Teng, Yuchun Guo, Yufeng Zhang, Yuandong Liu, Rui Wang, Zisheng Lu, Han Hu, Zhengyou Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 提出端到端机器人学习栈HyVLA-0.5,涵盖数据收集、模型设计、预训练与微调、RL后训练及真实部署,各组件协同工作。

详情
AI中文摘要

在本报告中,我们提出Hy-Embodied-0.5-VLA,简称HyVLA-0.5,一个覆盖完整机器人学习栈的端到端系统:数据收集、模型设计、持续预训练和监督微调、RL后训练以及真实世界部署。每个组件在该栈中扮演着独特的角色。

英文摘要

In this report, we present Hy-Embodied-0.5-VLA, abbreviated as HyVLA-0.5, an end-to-end system that spans the full robot learning stack: data collection, model design, continued pre-training and supervised fine-tuning, RL post-training, and real-world deployment. Each component serves a distinct role in this stack.

2606.14397 2026-06-15 cs.LG 新提交

Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments

Running the Gauntlet: 重新评估智能体在陌生环境中的能力

Mykola Vysotskyi, Runqi Lin, Grzegorz Biziel, Michal Zakrzewski, Sebastian Montagna, Damian Rynczak, Shreyansh Padarha, Kumail Alhamoud, Zihao Fu, William Lugoloobi, Kai Rawal, Hanna Yershova, Xander Davies, Taras Rumezhak, Guohao Li, Fazl Barez, Baoyuan Wu, Arkadiusz Drohomirecki, Yarin Gal, Chris Russell, Christopher Summerfield, Adam Mahdi, Volodymyr Karpiv, Philip Torr, Adel Bibi

发表机构 * University of Oxford(牛津大学) SoftServe Massachusetts Institute of Technology(麻省理工学院) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) UK AI Security Institute(英国人工智能安全研究所) Ukrainian Catholic University(乌克兰天主教大学)

AI总结 提出GauntletBench基准,通过20个视觉密集型任务评估智能体在时间感知、图形理解和3D推理等未被充分探索的能力,发现最先进智能体成功率仅19.1%,远低于人类80%以上。

详情
AI中文摘要

随着智能体系统不断发展并广泛部署于现实场景,对其能力进行忠实评估的需求日益增长。然而,当前的基准通常基于流行应用,任务相对简单,且关注狭窄的能力集,忽略了更广泛的维度,导致现代智能体性能饱和,无法探测其局限性。为此,我们引入了GauntletBench,一个基于网络的基准,用于评估智能体在挑战性场景中的泛化能力,重点关注三个未被充分探索的能力(时间感知、图形理解和3D推理),涵盖五个较少被覆盖的专业应用(视频编辑器、工作流构建器、3D建模器、飞行分析器和电路设计器),每个应用包含20个视觉密集型任务(共100个)。我们的基准提供了一个模块化流水线,包括一个与开源和闭源智能体框架兼容的环境、一个受控的基于网络的应用、一个结构良好的任务套件,以及一个具有多样化指标的自动评估引擎。与广泛预期相反,我们的实证结果表明,前沿智能体系统远未达到人类水平的表现。即使是最先进的智能体,在我们的GauntletBench上也仅达到19.1%的成功率,凸显了这些被忽视的能力和泛化方面的局限性。相比之下,非专家人类标注者在我们具有挑战性但可行的任务上实现了超过80%的成功率,揭示了当前智能体能力与复杂现实场景所需能力之间的巨大差距。

英文摘要

As agentic systems continue to evolve and are widely deployed in real-world scenarios, there is a growing demand to faithfully evaluate their capabilities. However, current benchmarks are typically built on popular applications with relatively simple tasks and focus on a narrow set of capabilities while overlooking broader dimensions, resulting in saturated performance on modern agents and failing to probe their limitations. To this end, we introduce GauntletBench, a web-based benchmark for evaluating agent generalisation in challenging scenarios, focusing on three underexplored capabilities (temporal perception, graphical understanding, and 3D reasoning), across five less-covered professional applications (Video Editor, Workflow Builder, 3D Modeller, Flight Analyser, and Circuit Designer), each with 20 vision-intensive tasks (100 in total). Our benchmark provides a modular pipeline that comprises an environment compatible with both open- and closed-source agent frameworks, a controlled web-based application, a well-structured task suite, and an automated evaluation engine with diverse metrics. Contrary to widespread expectations, our empirical results reveal that frontier agentic systems remain far from achieving human-level performance. Even the state-of-the-art agent achieves only a 19.1% success rate on our GauntletBench, highlighting the limitations in these overlooked capabilities and generalisation. By comparison, non-expert human annotators achieve over 80% success on our challenging yet feasible tasks, revealing the substantial gap between current agent capabilities and those required for complex real-world scenarios.

2606.14391 2026-06-15 cs.CL cs.AI cs.SD 新提交

Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR

学习听到犹豫:面向非流畅语音的连续学习ASR

Henri-Leon Kordt, Theresa Pekarek Rosin, Jae Hee Lee, Stefan Wermter

发表机构 * Knowledge Technology, Department of Informatics, University of Hamburg(汉堡大学信息学系知识技术研究所)

AI总结 针对ASR系统忽略非流畅导致信息丢失的问题,提出基于连续学习与显式非流畅标记的方法,在预训练模型中引入标记并持续训练,分析标记学习与ASR性能的权衡及跨方法共享的交叉注意力头机制。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

尽管大规模自动语音识别(ASR)取得了进展,但非流畅语音仍然具有挑战性,因为最先进的系统通常被优化以忽略非流畅,导致信息丢失和幻觉。先前的工作集中于逐字转录和非流畅标记的整合,但在有限数据集上适配模型可能导致通用领域知识的灾难性遗忘。我们通过利用具有显式非流畅标记的连续学习(CL)来填补这一空白。我们首先将这些标记引入预训练ASR模型以建立稳定的标记机制,然后在具有不同非流畅分布的其他数据集上继续训练。通过对训练期间模型动态的详细分析,我们识别出标记学习与ASR性能之间的权衡,以及跨CL方法共享的一致交叉注意力头机制。

英文摘要

Despite advances in large-scale Automatic Speech Recognition (ASR), disfluent speech remains challenging, as state-of-the-art systems are often optimized to omit disfluencies, leading to information loss and hallucinations. Prior work has focused on verbatim transcription and the integration of disfluency markers, but adapting models on limited datasets can lead to catastrophic forgetting of general-domain knowledge. We address this gap by leveraging continual learning (CL) with explicit disfluency tokens. We first introduce these tokens into a pretrained ASR model to establish stable token mechanisms, and then continue training on additional datasets with varying disfluency distributions. Through a detailed analysis of model dynamics during training, we identify a trade-off between marker learning and ASR performance, and a consistent cross-attention head mechanism shared across CL methods.

2606.14389 2026-06-15 cs.CV 新提交

MooMIns -- Monocular 3D Reconstruction and Object Pose Estimation from Multiple Instances

MooMIns -- 多实例单目3D重建与物体姿态估计

Robert Langendörfer, Markus Hillemann, Markus Ulrich

发表机构 * Institute of Photogrammetry and Remote Sensing, Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院摄影测量与遥感研究所)

AI总结 提出MooMIns方法,利用单张图像中同一物体的多个实例提供隐式多视图几何,通过反向高斯泼溅实现3D重建和6D姿态估计,避免深度先验导致的幻觉。

详情
AI中文摘要

从单张单目图像同时进行3D重建和6D物体姿态估计是一个固有的病态问题。然而,在工业场景中,物体的多个实例通常随机堆叠在料箱中,隐式地在单张图像内提供了同一物体的多个视角。我们证明,这种隐式多视图几何可以被利用来同时重建物体的3D形状并估计每个可见物体实例的6D姿态。我们提出了MooMIns,一种基于高斯泼溅的新方法,它反转了原始高斯泼溅公式:不是从多个相机渲染单个场景,而是从单个相机渲染多个物体实例。我们的方法使用SAM3实例分割掩码和改进的运动恢复结构(SfM)流水线进行初始化。与学习的单目深度估计不同,我们基于图像证据进行真正的几何重建,避免了训练数据先验导致的幻觉。我们在合成和真实料箱抓取场景中评估MooMIns,并展示了对于未见过的物体的准确重建以及单个实例的可靠姿态估计。

英文摘要

Simultaneous 3D reconstruction and 6D object pose estimation from a single monocular image is an inherently ill-posed problem. In industrial settings, however, multiple instances of an object are often randomly arranged in bins, implicitly providing several views of the same object within a single image. We show that this implicit multi-view geometry can be exploited to simultaneously reconstruct the object in 3D and estimate the 6D pose of each visible object instance. We present MooMIns, a new Gaussian-splatting-based approach that inverts the original Gaussian splatting formulation: instead of rendering a single scene from multiple cameras, we render multiple object instances from a single camera. Our method is initialized with SAM3 instance segmentation masks and a modified Structure from Motion (SfM) pipeline. In contrast to learned monocular depth estimation, we perform true geometry-based reconstruction from image evidence, avoiding hallucinations caused by training data priors. We evaluate MooMIns on synthetic and real bin-picking scenarios, and demonstrate accurate reconstruction of previously unseen objects as well as reliable pose estimation of individual instance

2606.14388 2026-06-15 cs.LG 新提交

A Low-Rank Subspace Analysis of LLM Interventions

LLM干预的低秩子空间分析

Angira Sharma, Christian Schroeder de Witt, Philip Torr, Anisoara Calinescu, Jialin Yu

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出诊断框架,将行为建模为激活空间中的低秩子空间,发现干预一个行为会不对称地影响其他行为,效果与子空间重叠和决策子空间角度相关。

Comments Mechanistic Interpretability Workshop @ ICML 2026

详情
AI中文摘要

旨在修改LLM特定行为(如拒绝或谄媚)的干预措施通常会在其他行为中产生意外变化。这种缺乏针对性控制使得设计和实施可靠的安全控制变得困难。为了理解这些副作用,我们引入了一个诊断框架来分析LLM中交互行为。我们将行为建模为激活空间中的低秩子空间,并研究干预如何跨行为产生影响。在多个指令调优模型(7B-70B)以及拒绝、越狱和谄媚设置中,我们发现不同行为共享内部表示,并且干预一个行为会以不对称方式改变其他行为。一些行为作为上游控制点,其干预广泛传播到其他行为,而另一些则更为孤立。我们将这些效应与两个几何量联系起来:(i)行为子空间之间的重叠,以主角度平均余弦平方度量;(ii)每个行为子空间与决策子空间(捕捉模型最终决策,如拒绝与服从)之间的角度。经验上,对于子空间重叠较高的行为对,以及子空间更接近(角度更小)决策子空间的源行为,干预对其他行为的影响往往更大。这些发现突显了针对性行为控制的挑战:行为难以独立修改,因为干预可以通过共享表示和不对称交互传播。

英文摘要

Interventions designed to modify a particular behavior in LLMs, such as refusal or sycophancy, often produce unintended changes in other behaviors. This lack of targeted control makes it difficult to design and implement reliable safety controls. To understand these side-effects, we introduce a diagnostic framework for analyzing interacting behaviors in LLMs. We model behaviors as low-rank subspaces in activation space, and study how interventions influence across behaviors. Across multiple instruction-tuned models (7B-70B) and across refusal, jailbreak, and sycophancy settings, we find that different behaviors share internal representations, and intervening on one behavior alters others in asymmetric ways. Some behaviors act as upstream control points whose interventions propagate broadly across other behaviors, while others remain more isolated. We relate these effects to two geometric quantities: (i) the overlap between behavior subspaces, measured as the average squared cosine of principal angles, and (ii) the angle between each behavior subspace and the decision subspace (capturing the model's final decision e.g., refuse vs. comply). Empirically, intervention effects on other behaviors tend to be larger for behavior pairs with higher subspace overlap, and for source behaviors whose subspaces lie closer (smaller angle) to the decision subspace. These findings highlight a challenge for targeted behavior control: behaviors are difficult to modify independently, as interventions can propagate through shared representations and asymmetric interactions.

2606.14386 2026-06-15 cs.LG cs.AI q-fin.PM 新提交

Discovery under Hypothesis Redundancy: A Geometric Theory of Discovery Bottlenecks

假设冗余下的发现:发现瓶颈的几何理论

Li Xia, Baoxun Wang

发表机构 * School of Economics and Management, Tsinghua University(清华大学经济管理学院) Platform & Content Group, Tencent(腾讯平台与内容事业群)

AI总结 提出搜索压缩假说,通过谱压缩、正交逃逸和残差信号对齐三个几何条件解释混合发现系统的优势,实验表明仅新颖性不足,需预测对齐。

Comments 23 pages, 1 figure, 27 tables

详情
AI中文摘要

当新假设不再提供独立信息时,科学发现会饱和,即使名义假设空间仍然很大。我们研究了结合结构化局部搜索与LLM生成的非局部提议的混合发现系统,并提出了搜索压缩假说:非局部探索仅在三个几何条件同时出现时才有帮助:谱压缩、从已探索张成的子空间正交逃逸、以及残差信号与目标对齐。我们形式化了这些条件,推导了混合优势的必要条件,并在受控合成环境、大规模A股因子发现和符号回归基准中测试了该机制;一个公开的表格操作合理性检查测试了相关的预算分配含义。信号植入和定向与随机实验表明,仅新颖性是不够的:随机正交跳跃扩大了覆盖范围,但如果没有预测对齐,则不会提高产出。在压缩扫描、真实因子档案和LLM-SRBench任务中,混合优势集中在弱表示但目标承载的方向上,并随着假设空间接近满秩而消失。该框架将LLM引导的发现从通用新颖性搜索转变为诊断程序,用于判断何时需要进行定向非局部探索。

英文摘要

Scientific discovery saturates when new hypotheses cease to provide independent information, even if the nominal hypothesis space remains large. We study hybrid discovery systems that combine structured local search with LLM-generated non-local proposals and pose the Search Compression Hypothesis: non-local exploration helps only when three geometric conditions co-occur: spectral compression, orthogonal escape from the explored span, and residual signal alignment with the target. We formalize these conditions, derive necessary conditions for hybrid advantage, and test the mechanism in controlled synthetic environments, large-scale A-share factor discovery, and symbolic-regression benchmarks; a public tabular operational sanity check tests the associated budget-allocation implication. Signal-planting and directed-versus-random experiments show that novelty alone is insufficient: random orthogonal jumps expand coverage but do not improve yield without predictive alignment. Across compression sweeps, real factor archives, and LLM-SRBench tasks, hybrid gains concentrate in weakly represented but target-bearing directions and vanish as the hypothesis space approaches full rank. The framework turns LLM-guided discovery from generic novelty search into a diagnostic procedure for deciding when directed non-local exploration is warranted.

2606.14380 2026-06-15 cs.CV 新提交

FLaRA: Predicting Future Latent Representations for Accident Anticipation

FLaRA: 预测未来潜在表示用于事故预警

Lorenzo Caselli, Tomaso Trinci, Tommaso Bianconcini, Simone Magistri, Leonardo Taccari, Francesco Sambo, Andrew D. Bagdanov

发表机构 * Department of Information Engineering, University of Florence(佛罗伦萨大学信息工程系) Verizon Connect

AI总结 提出FLaRA架构,通过预测未来潜在表示实现事故预警,在Nexar等数据集上达到最优性能。

Comments Accepted at the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026)

详情
AI中文摘要

从行车记录仪视频中预测交通事故是智能交通系统中的一个关键挑战。现有方法通常直接将视觉上下文映射到碰撞概率,而没有显式建模驾驶场景的未来演化。在本文中,我们提出了FLaRA(预测未来潜在表示用于事故预警),一种新颖的预测架构,通过预测未来潜在表示来转变这一范式。基于视频联合嵌入预测架构(V-JEPA2),我们的模型将预测器网络条件于观察到的上下文帧,以预测场景即将到来的潜在特征。然后,分类器对这些预测的未来表示进行操作,而不仅仅是过去的观察。为了确保这些预测基于现实的未来动态,我们引入了一个联合训练目标,同时优化辅助的特征级重建损失和交叉熵分类损失。在Nexar数据集上的广泛评估,以及在DAD、DADA-2000和DoTA基准上的跨域验证,表明我们的方法在保持现实早期预警能力的同时实现了最先进的性能。

英文摘要

Anticipating traffic accidents from dashcam videos is a critical challenge in intelligent transportation systems. Existing methods typically map visual context directly to a collision probability without explicitly modeling the future evolution of the driving scene. In this paper we propose FLaRA (Predicting Future Latent Representations for Accident Anticipation), a novel predictive architecture that shifts this paradigm by forecasting future latent representations for accident anticipation. Building upon the Video Joint-Embedding Predictive Architecture (V-JEPA2), our model conditions a predictor network on observed context frames to predict the forthcoming latent features of the scene. A classifier then operates on these predicted future representations rather than only on past observations. To ensure these forecasts remain grounded in realistic future dynamics, we introduce a joint training objective that simultaneously optimizes an auxiliary feature-level reconstruction loss and a cross-entropy classification loss. Extensive evaluations on the Nexar dataset, alongside cross-domain validations on the DAD, DADA-2000, and DoTA benchmarks, demonstrate that our approach achieves state-of-the-art performance while maintaining realistic early warning capabilities.

2606.14375 2026-06-15 cs.RO cs.AI 新提交

Elastic Queries Reinforcement Learning: Self-Aware Policy Execution for VLA Models

弹性查询强化学习:VLA模型的自我感知策略执行

Ge Wang, Xinyu Tan, Xiang Li, Man Luo, Chengsi Yao, Shenhao Yan, Jiahao Yang, Fan Feng, Honghao Cai, Xiangyuan Wang, Zhixin Mai, Yiming Zhao, Yatong Han, Zhen Li

发表机构 * Ising AI CUHK-Shenzhen(香港中文大学(深圳)) PKU(北京大学)

AI总结 提出弹性查询强化学习(EQRL),通过轻量级潜在调度适配器动态调整VLA模型的推理步骤和动作块长度,利用评论家集成分歧估计状态难度,在降低推理成本的同时保持或提升任务成功率。

详情
AI中文摘要

视觉-语言-动作(VLA)模型是机器人操作中强大的动作生成器,但通常以固定的推理和重新规划调度执行。这种刚性忽略了机器人控制的不均匀难度:接触密集或不确定状态可能需要更多计算和更新鲜的反馈,而较容易的状态通常可以用更少的推理步骤和更长的开环执行来处理。我们提出弹性查询强化学习(EQRL),一个使每个VLA策略查询具有弹性的框架。一个轻量级的潜在调度适配器联合选择潜在输入、去噪预算和动作块长度,无需微调底层VLA模型。为了使调度具有难度感知,EQRL在联合潜在调度动作上训练一个评论家,并从评论家集成分歧中推导出状态难度信号。该信号引导计算资源向困难状态倾斜,而学习到的残差允许任务驱动的修正。我们将可变块执行形式化为查询级宏动作强化学习,具有块依赖的折扣和摊销的函数评估次数(NFE)预算。在仿真和真实机器人操作中,EQRL在保持或提高任务成功率的同时,降低了摊销推理成本。

英文摘要

Vision-language-action (VLA) models are powerful action generators for robot manipulation, but they are typically executed with fixed inference and replanning schedules. This rigidity ignores the uneven difficulty of robot control: contact-rich or uncertain states may need more computation and fresher feedback, while easier states can often be handled with fewer inference steps and longer open-loop execution. We propose Elastic Queries Reinforcement Learning (EQRL), a framework that makes each VLA policy query elastic. A lightweight latent-schedule adaptor jointly selects the latent input, denoising budget, and action chunk length, without fine-tuning the underlying VLA model. To make scheduling difficulty-aware, EQRL trains a critic over the joint latent-schedule action and derives a state difficulty signal from critic ensemble disagreement. This signal guides compute toward difficult states, while a learned residual allows task-driven correction. We formulate variable chunk execution as query-level macro-action RL with chunk-dependent discounting and an amortized number-of-function-evaluations (NFE) budget. Across simulation and real-robot manipulation, EQRL reduces amortized inference cost while preserving or improving task success.