arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.13707 2026-06-15 cs.AI cs.CL cs.CV 新提交

Orchestra-o1: Omnimodal Agent Orchestration

Orchestra-o1: 全模态智能体编排

Fan Zhang, Vireo Zhang, Shengju Qian, Haoxuan Li, Hao Wu, Jinyang Wu, Donghao Zhou, Zhihong Zhu, Zheng Lian, Xin Wang, Pheng-Ann Heng

发表机构 * CUHK（香港中文大学）； LIGHTSPEED ； PKU（北京大学）； THU（清华大学）； Tongji University（同济大学）

AI总结提出Orchestra-o1全模态智能体编排框架，通过统一编排机制实现模态感知任务分解、在线子智能体专业化和并行子任务执行，在OmniGAIA基准上准确率超第二名10.3%，并引入DA-GRPO强化学习方法训练Orchestra-o1-8B达到开源全模态智能体最优性能。

详情

AI中文摘要

近期智能体集群的成功将基于大语言模型（LLM）的智能体从单智能体工作流范式转向多智能体系统，凸显了智能体编排在任务分解与协作中的重要性。然而，现有编排框架局限于狭窄的模态集合，难以泛化到异构模态共存并交互的更复杂场景。这种局限性在全模态场景中尤为突出，此类任务需要对文本、图像、音频和视频等多样化输入进行统一理解与协调。在本工作中，我们提出Orchestra-o1，一种全模态智能体编排框架，旨在支持跨多种模态的高效智能体协作。Orchestra-o1引入统一编排机制，实现模态感知任务分解、在线子智能体专业化和并行子任务执行。这种可扩展设计使智能体系统能够有效处理涉及异构信息源的复杂现实任务，在OmniGAIA基准上超越第二名方法10.3%的准确率。此外，我们提出决策对齐群体相对策略优化（DA-GRPO），一种高效的智能体强化学习方法，用于训练Orchestra-o1-8B，该方法在所有现有开源全模态智能体中取得了最先进性能。

英文摘要

The recent success of agent swarms has shifted the paradigm of large language model (LLM)-based agents from single-agent workflows to multi-agent systems, highlighting the importance of agent orchestration for task decomposition and collaboration. However, existing orchestration frameworks are limited to a narrow set of modalities and struggle to generalize to more complex settings where heterogeneous modalities coexist and interact. This limitation becomes particularly pronounced in omnimodal scenarios, where tasks require the unified understanding and coordination of diverse inputs such as text, image, audio, and video. In this work, we propose Orchestra-o1, an omnimodal agent orchestration framework designed to support efficient agent collaboration across multiple modalities. Orchestra-o1 introduces a unified orchestration mechanism that enables modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution. This scalable design allows agent systems to effectively tackle complex real-world tasks involving heterogeneous information sources, surpassing the second-best approach by 10.3% accuracy on the OmniGAIA benchmark. Furthermore, we introduce decision-aligned group relative policy optimization (DA-GRPO), an efficient agentic reinforcement learning approach for training Orchestra-o1-8B, which also achieves state-of-the-art performance against all existing open-source omnimodal agents.

URL PDF HTML ☆

赞 1 踩 0

2606.13705 2026-06-15 cs.LG cs.AI 新提交

Can Editing 1 Neuron Fix Repetition Loops in LLMs?

编辑1个神经元能修复LLM中的重复循环吗？

Aristotelis Lazaridis, Aman Sharma, Dylan Bates, Brian King, Vincent Lu, Jack FitzGerald

发表机构 * Edgerunner AI

AI总结本文发现Gemma 4模型在长事实列举任务中高达95%的概率陷入重复循环，通过逐层消融和逐神经元归因定位到少量MLP神经元，并用静态权重编辑（小至单个神经元符号反转）消除循环，但无法解决因知识缺失导致的“末日循环”。

详情

AI中文摘要

是的。它能治愈末日循环吗？可能不行。Gemma 4指令微调模型存在一个可复现的失败：在长事实列举提示（如列出电视剧的每一集、88个IAU星座或151个原始宝可梦）上，它们会崩溃成重复，要么是严格的逐字循环，要么是列表条目退化到单一答案。这些循环的发生率高达95%，并且能抵抗提示改写、推理引擎更改和大多数采样调整。在本文中，我们探讨这种行为是否足够局部化，从而可以通过权重编辑来消除。为了定位原因，我们使用逐层消融和逐神经元归因，然后通过完整生成扫描确认最强候选。循环追溯到一小部分MLP神经元（或者在26B-A4B混合专家模型中，几个路由专家），我们通过静态权重编辑抑制它们。这些“手术”可以小到单个符号反转的神经元（在E2B模型中）。有效编辑的大小随模型规模增长，但在所有情况下，循环模式可以在正常生成预算内解决，同时保持通用基准分数。然而，编辑并不能解决所有问题：我们还研究了更长的思考预算，其中两个较大的模型最明显地进入末日循环，即模型在无法回忆的事实上自我纠正的循环，耗尽预算而不给出最终答案。我们表明，这种残余失败通过相同的编辑减少但未消除，并认为它本质上是知识精度问题，而非可移除的电路；权重手术可以删除循环，但不能提供缺失的事实。我们的结果既是可行性证明——即具体的生成病理可以定位到少数参数并编辑掉——也是对该方法适用范围的界定。

英文摘要

Yes. Can it cure doom loops? Probably not. The Gemma 4 instruction-tuned models share a reproducible failure: on long factual enumeration prompts, such as listing every episode of a TV series, the 88 IAU constellations, or the 151 original Pokemon, they collapse into repetition, either a tight verbatim loop or a list whose entries decay onto a single answer. These loops occur at rates as high as 95% and survive prompt rewording, inference-engine changes, and most sampling adjustments. In this paper we explore whether this behavior is localized enough to remove by weight edits. To localize the cause, we use per-layer ablation and per-neuron attribution, then confirm the strongest candidates with full-generation sweeps. The loops trace to a small set of MLP neurons (or, in the 26B-A4B Mixture-of-Experts model, a few routed experts) which we suppress with static weight edits. These "surgeries" can be as small as a single sign-inverted neuron (in the E2B model). The size of the effective edits grows with model scale, but in all cases, the loop patterns can be addressed at normal generation budgets while preserving general-purpose benchmark scores. However, the edits do not solve everything: we also study longer thinking budgets, where the two larger models most visibly enter doom looping, i.e. a non-convergent regime in which the model self-corrects in circles over a fact it cannot recall, exhausting the budget without committing to a final answer. We show this residual failure is reduced but not eliminated by the same edits, and argue it is fundamentally a knowledge-precision problem rather than a removable circuit; weight surgery can delete a loop, but it cannot supply a missing fact. Our results are both a feasibility demonstration, that is, evidence that a concrete generation pathology can be localized to a few parameters and edited out, and a delineation of where that approach stops.

URL PDF HTML ☆

赞 0 踩 0

2606.13703 2026-06-15 cs.AI cs.GL cs.LO 新提交

History of the Muddy Children Puzzle

泥孩子谜题的历史

Hans van Ditmarsch

发表机构 * CNRS, France（法国国家科学研究中心）； IIT Kanpur, India（印度理工学院坎普尔分校）

AI总结本文追溯泥孩子谜题在过去两个世纪中的起源，并介绍其变体及一个涉及自指的新帽子谜题。

2606.13686 2026-06-15 cs.CL cs.CY 新提交

Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces

电子商务欺骗性界面下的Web Agent安全基准测试

Zijing Shi, Meng Fang, Ling Chen

发表机构 * AAII, University of Technology Sydney（悉尼科技大学AAII）； University of Liverpool（利物浦大学）

AI总结提出WebDecept框架，在电子商务环境中注入七种常见欺骗性界面模式，测试多模态Web Agent的安全性，发现当前Agent极易受骗且提示约束不足。

Comments Accepted to ACL 2026

详情

AI中文摘要

随着自主Web Agent越来越多地用于执行现实任务，确保其安全性已成为关键问题。在这项工作中，我们研究了电子商务领域中现实欺骗性界面下的Web Agent行为。我们引入了WebDecept，一个轻量级且可配置的插件框架，能够将欺骗性界面模式可控地注入现有Web环境。使用WebDecept，我们实例化了开放Web上常见的七种欺骗模式，包括定向广告、域名重定向和购物操纵。通过在任务执行期间将这些模式注入前端，我们对多个多模态Web Agent进行了受控评估。我们的结果表明，当前的Web Agent极易受到多类欺骗性界面的影响，并且基于提示的约束通常不足以缓解这些失败。我们进一步分析了欺骗性模式的设计选择如何影响此类操纵的成功。这些发现凸显了在Web Agent向现实部署扩展时应解决的安全挑战。

英文摘要

As autonomous web agents are increasingly deployed to perform real-world tasks, ensuring their safety has become a critical concern. In this work, we study web agent behavior under realistic deceptive interfaces in the e-commerce domain. We introduce WebDecept, a lightweight and configurable plugin framework that enables controlled injection of deceptive interface patterns into existing web environments. Using WebDecept, we instantiate seven deceptive patterns commonly observed on the open web, including targeted advertisements, domain redirection, and shopping manipulation. By injecting these patterns into the frontend during task execution, we perform controlled evaluation of multiple multimodal web agents. Our results show that current web agents are highly susceptible to multiple classes of deceptive interfaces, and that prompt-based constraints are often insufficient to mitigate these failures. We further analyze how the design choices of deceptive patterns influence the success of such manipulations. These findings highlight safety challenges that should be addressed as web agents are scaled toward real-world deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.13685 2026-06-15 cs.CL cs.AI 新提交

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

抛硬币的裁判？LLM作为评估者的可靠性与偏见

Abel Yagubyan

发表机构 * Independent Researcher（独立研究员）

AI总结研究LLM作为评估者在重复评估中的不可靠性，发现偏好翻转率平均13.6%，存在位置偏见，并建议多轮聚合和不确定性报告。

Comments 24 pages, 7 figures

详情

AI中文摘要

LLM作为评估者（LLM-as-a-Judge）现被广泛用于模型输出排名、训练奖励模型和填充公共排行榜，但其运行间可靠性仍缺乏充分表征。我们使用两个OpenAI评估模型（GPT-4o-mini和GPT-4.1-mini）在涵盖10个类别的29个任务上进行了重复的相同评估，每个问题进行50次成对试验和50次逐点试验，并辅以温度和提示敏感性消融实验。在评估者之间，成对偏好平均翻转13.6%的时间，28%的问题翻转率超过20%，一个问题达到56%。GPT-4o-mini还表现出显著的第一位置偏见（72%的A多数，p=0.024）。同时，平均逐点评分差距很小（在10分制上为0.19-0.36），且总体上不具统计显著性，产生了成对-逐点差距：评估者经常选择胜者，即使它们自己的标量分数几乎没有证据表明存在有意义的质量差异。除了评估者内部的不稳定性，评估者间的一致性仅为76%（κ=0.51），语义等价的提示模板在25%的测试案例中改变了多数结果，确定性解码减少了但不消除不一致性。可靠性曲线分析显示，在我们的数据集中，平均需要11次重复试验才能让多数投票以95%的概率恢复50次试验的参考裁决，对于高方差问题则上升至15次。这些发现表明，单次LLM评估对于高风险评估往往噪声过大，多轮聚合、位置随机化和显式不确定性报告应成为标准实践。由于两个评估者均来自同一提供商，跨提供商复制仍是重要的下一步。

英文摘要

LLM-as-a-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks spanning 10 categories using two OpenAI judge models (GPT-4o-mini and GPT-4.1-mini), with 50 pairwise trials and 50 pointwise trials per question, supplemented by temperature and prompt-sensitivity ablations. Across judges, pairwise preferences flip on average 13.6% of the time, with 28% of questions exceeding a 20% flip rate and one question reaching 56%. GPT-4o-mini also exhibits a significant first-position bias (72% A-majority, p = 0.024). At the same time, mean pointwise score gaps are small (0.19--0.36 on a 10-point scale) and not statistically significant in aggregate, producing a pairwise--pointwise gap: judges frequently choose a winner even when their own scalar scores provide little evidence of a meaningful quality difference. Beyond within-judge instability, cross-judge agreement is only 76% ($κ= 0.51$), semantically equivalent prompt templates change majority outcomes in 25% of tested cases, and deterministic decoding reduces but does not eliminate inconsistency. A reliability curve analysis shows that, in our dataset, 11 repeated trials are needed for a majority vote to recover the 50-trial reference verdict with 95% probability on average, rising to 15 for high-variance questions. These findings suggest that single-trial LLM judging is often too noisy for high-stakes evaluation, and that multi-trial aggregation, position randomization, and explicit uncertainty reporting should be standard practice. Because both judges are from a single provider, cross-provider replication remains an important next step.

URL PDF HTML ☆

赞 0 踩 0

2606.13683 2026-06-15 cs.AI cs.CL 新提交

UP-NRPA: User Portrait based Nested Rollout Policy Adaptation for Planning with Large Language Models in Goal-oriented Dialogue Systems

UP-NRPA：基于用户画像的嵌套 rollout 策略自适应，用于目标导向对话系统中与大语言模型的规划

Hui Wang, Fafa Zhang, Meng Liu, Xiangyu Chen, Chaoxu Mu

发表机构 * School of Artificial Intelligence, Anhui University（安徽大学人工智能学院）； Anhui Provincial Key Laboratory of Security Artificial Intelligence, Anhui University（安徽大学安徽省安全人工智能重点实验室）； Pengcheng Laboratory（鹏城实验室）

AI总结提出 UP-NRPA 在线框架，利用大语言模型和用户画像实时自适应调整对话策略，无需离线强化学习，在协作与非协作对话基准中实现 100% 任务成功率，谈判任务销售列表比提升 56.41%。

详情

AI中文摘要

为了解决当前对话策略规划方法难以动态适应不同用户特征的挑战，本文提出了一种基于用户画像的嵌套 rollout 策略自适应（UP-NRPA）在线框架，结合大语言模型。与传统依赖模型训练并为用户群体离线学习强化学习策略模型的方法不同，UP-NRPA 通过自适应机制实现对话策略的动态定制。这是通过利用实时用户反馈以及从当前用户画像映射出的个性、偏好和目标来实现的，从而无需离线强化学习即可适应用户特征。在协作和非协作对话基准测试中，UP-NRPA 展现了显著优势，在多个对话任务中实现了令人印象深刻的 100% 成功率。特别是在谈判任务中，销售列表比（SL）提高了 56.41%。这表明 UP-NRPA 无需训练机制即可适应多样化的用户需求，使对话系统能够适应用户特征。

英文摘要

To address the challenge that current dialogue policy planning methods struggle to dynamically adapt to diverse user characteristics, this paper proposes a User Portrait based Nested Rollout Policy Adaptation (UP-NRPA) online framework with Large Language Models. In contrast to conventional approaches dependent on model training and require offline reinforcement learning policy models for user groups, UP-NRPA enables dynamic customization of dialogue strategies through an adaptive mechanism. This is achieved by leveraging real-time user feedback alongside personality, preferences, and objectives mapped from the current user portrait, thereby adapting to user characteristics without offline reinforcement learning. In collaborative and non-collaborative dialogue benchmarks, UP-NRPA demonstrated considerable benefits, achieving an impressive 100% success rate in multiple dialogue tasks. Particularly in negotiation tasks, the sale-to-list ratio (SL) increased by 56.41%. This demonstrates that UP-NRPA can adapt to diverse user needs without requiring a training mechanism, enabling the dialogue system to adapt to user characteristics.

URL PDF HTML ☆

赞 0 踩 0

2606.13682 2026-06-15 cs.AI cs.LG 新提交

A Deep Reinforcement Learning (DRL)-Based Transformer Method for Solving the Open Shop Scheduling Problem

基于深度强化学习的Transformer方法求解开放车间调度问题

Faezeh Ardali, Mwembezi A. Nyelele, Gerald M. Knapp

发表机构 * Louisiana State University（路易斯安那州立大学）； University of Minnesota Duluth（明尼苏达大学杜鲁斯分校）

AI总结提出一种基于Transformer编码器-解码器架构的调度策略，仅以加工时间矩阵为输入，在Taillard小规模实例上训练后可直接推广至40x40至100x100的大规模问题，与经典调度规则相比具有竞争力。

详情

AI中文摘要

开放车间调度问题（OSSP）出现在许多工业和服务环境中，但随着作业和机器数量的增加，其计算难度仍然很大。精确方法很快变得难以处理，而经典调度规则和元启发式方法可能需要大量调整才能在大规模下保持解的质量。本研究开发了一种基于Transformer的OSSP调度策略，采用具有多头注意力的编码器-解码器架构。该模型仅在Taillard基准实例（4x4、5x5、7x7和10x10）上使用加工时间矩阵作为输入进行训练，生成可行调度，其makespan通常为最佳已知值的15-30%。为了评估可扩展性，将训练好的策略无需重新训练直接应用于从40x40到100x100随机生成的实例，并与经典调度启发式方法（包括SPT、LPT、MWKR和EST）进行比较。在这些大规模实例中，Transformer相对于标准下界实现了12.89-15.12%的平均差距。与EST相比，Transformer保持了竞争力，通常差距较小，同时显著优于SPT和LPT。这些结果表明，在小规模OSSP实例上训练的Transformer策略可以推广到更大规模的问题，并提供一种轻量级、基于学习的替代经典调度规则的方法。

英文摘要

The open shop scheduling problem (OSSP) arises in many industrial and service settings but remains computationally challenging as the number of jobs and machines increases. While exact methods quickly become intractable, classical dispatching rules and metaheuristics may require substantial tuning to maintain solution quality at large scales. This study develops a Transformer-based scheduling policy for OSSP using an encoder-decoder architecture with multi-head attention. The model is trained on Taillard benchmark instances (4x4, 5x5, 7x7, and 10x10) using only the processing-time matrix as input and produces feasible schedules with makespans typically within 15-30% of best-known values. To evaluate scalability, the trained policy is applied without retraining to randomly generated instances from 40x40 to 100x100 and compared against classical dispatching heuristics, including SPT, LPT, MWKR, and EST. Across these large instances, the Transformer achieved average gaps of 12.89-15.12% relative to a standard lower bound. Compared with EST, the Transformer remained competitive, typically within a modest margin, while substantially outperforming SPT and LPT. These results indicate that a Transformer policy trained on small OSSP instances can generalize to substantially larger problems and provide a feature-light, learning-based alternative to classical dispatching rules.

URL PDF HTML ☆

赞 0 踩 0

2606.13679 2026-06-15 cs.CV 新提交

InterleaveThinker: Reinforcing Agentic Interleaved Generation

InterleaveThinker: 强化智能体交错生成

Dian Zheng, Harry Lee, Manyuan Zhang, Kaituo Feng, Zoey Guo, Ray Zhang, Hongsheng Li

发表机构 * CUHK MMLab（香港中文大学多媒体实验室）； Meituan（美团）； CUHK IMIXR（香港中文大学IMIXR实验室）

AI总结提出首个多智能体管线InterleaveThinker，通过规划器和评论家智能体使现有图像生成器具备交错生成能力，并利用GRPO强化单步指令修正，显著提升生成性能。

Comments Project Page: https://zhengdian1.github.io/InterleaveThinker-proj/ Code: https://github.com/zhengdian1/InterleaveThinker

详情

AI中文摘要

最近的图像生成器在单图像生成和编辑中展示了令人印象深刻的逼真度和指令遵循能力。然而，受限于其架构，它们无法实现交错生成（文本-图像序列），这在视觉叙事、指导和具身操作中具有关键应用。即使是最近的开源统一多模态模型（UMMs）在这方面也表现出有限的性能。在本文中，我们介绍了InterleaveThinker，这是第一个旨在赋予任何现有图像生成器交错生成能力的多智能体管线。具体来说，我们使用规划器智能体来组织图像-文本输入序列，指示图像生成器在每个步骤所需的执行。随后，我们引入评论家智能体来评估生成器的输出，识别偏离计划指令的样本，并优化指令以进行重新生成。为了实现这一管线，我们构建了Interleave-Planner-SFT-80k和Interleave-Critic-SFT-112k以进行格式冷启动。然后，我们开发了Interleave-Critic-RL-13k，使用GRPO在生成轨迹内强化逐步指令修正能力。由于单个交错生成轨迹可能涉及超过25次生成器调用，优化整个轨迹在计算上不可行。因此，我们提出了准确率奖励和逐步奖励，使得单步强化学习能够有效引导整个生成轨迹。结果表明，InterleaveThinker在各种图像生成器上提升了性能。在交错生成基准上，它实现了与Nano Banana和GPT-5相当的性能。令人惊讶的是，它还在基于推理的基准上显著增强了基础模型；例如，在4步FLUX.2-klein上，我们在WISE和RISE上观察到了显著的增益。

英文摘要

Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.

URL PDF HTML ☆

赞 1 踩 0

2606.13675 2026-06-15 cs.RO 新提交

Improving Robotic Generalist Policies via Flow Reversal Steering

通过流反转引导改进机器人通用策略

Andy Tang, William Chen, Andrew Wagenmaker, Chelsea Finn, Sergey Levine

发表机构 * Stanford University（斯坦福大学）； UC Berkeley（加州大学伯克利分校）

AI总结提出流反转引导（FRS）方法，通过逆向流策略找到次优动作的潜在噪声并映射到通用策略的动作模式，提升零样本控制、行为克隆和强化学习效果。

详情

AI中文摘要

通用策略可以从多样化的机器人数据集中学习广泛的技能。为了解决或改进具有挑战性的新任务，我们需要一种方法从策略丰富的行为先验中推断并调用适当的动作，特别是当直接命令策略失败时。我们专注于流匹配通用策略，并提出流反转引导（FRS）：一种方法，它采用次优但“合理”的动作，通过逆向流策略传递它们以找到其潜在噪声，并将它们映射到附近的通用策略动作模式。我们在多个模拟和真实世界的操作设置中评估了FRS。首先，FRS可以将来自人类或视觉语言模型的粗略语义引导转化为相应的良好机器人动作，从而改进零样本控制。这些收益可以通过行为克隆进行蒸馏，通过训练一个辅助策略输出噪声，通用策略将其映射到良好动作——在不到一分钟的训练中显示出高达95%的绝对任务成功率提升。最后，FRS通过用语义知识引导强化学习实现策略改进，在标准强化学习无法改进的多个任务上取得了改进。

英文摘要

Generalist policies can learn a wide range of skills from diverse robot datasets. In order to solve or improve on challenging new tasks, we need a way to infer and invoke the appropriate actions from the policy's rich behavioral prior, especially when directly commanding the policy fails. We focus on flow matching generalists and propose Flow Reversal Steering (FRS): a method that takes suboptimal but ``reasonable'' actions, finds their latent noises by passing them through the flow policy in reverse, and maps them to nearby generalist action modes. We evaluate FRS across many simulated and real-world manipulation settings. First, FRS can turn coarse semantic guidance from humans or vision-language models (VLMs) into corresponding good robot actions, improving zero-shot control. These gains can be distilled with behavioral cloning by training an auxiliary policy to output noises that the generalist maps to good actions -- showing up to 95% absolute task success rate boosts in under a minute of training. Finally, FRS enables policy improvement by bootstrapping reinforcement learning with semantic knowledge, improving on several tasks that standard RL fails to improve on.

URL PDF HTML ☆

赞 0 踩 0

2606.13662 2026-06-15 cs.AI cs.CL 新提交

EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery

EurekAgent：自主科学发现中，智能体环境工程即一切

Amy Xin, Jiening Siow, Junjie Wang, Zijun Yao, Fanjin Zhang, Jian Song, Lei Hou, Juanzi Li

发表机构 * Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）； Zhipu AI（智谱AI）

AI总结提出环境工程框架EurekAgent，通过权限、工件、预算和人机交互四维工程设计，在数学、内核工程和机器学习任务上取得新最优结果，总API成本低于11美元。

详情

AI中文摘要

基于LLM的智能体在自动化科学发现方面展现出日益增长的潜力。给定一个可优化的度量和执行环境，它们可以提出、验证和迭代科学解决方案，并已产生超越人类设计方法的结果。随着模型能力的持续提升，我们认为自主科学发现的瓶颈正从规定智能体工作流程转向设计智能体环境：即塑造智能体行为的资源、约束和接口。我们将此框架化为环境工程：构建能够放大生产性行为（如开放式探索、系统化工件管理和智能体间协作）同时抑制有害行为（如奖励黑客和高摩擦人工监督）的环境。我们提出了EurekAgent，一个用于度量驱动自主科学发现的环境工程智能体系统。EurekAgent从四个维度进行环境工程：权限工程用于受限智能体执行和隔离评估；工件工程用于基于文件系统和Git的协作；预算工程用于预算感知探索；人机交互工程用于便捷的人工监督和干预。EurekAgent在多个数学、内核工程和机器学习任务上取得了新的最优结果，包括以不到11美元的总API成本发现新的26圆填充最优结果。我们开源了代码和结果，并呼吁将环境工程作为开发可靠自主研究智能体的核心研究方向。

英文摘要

LLM-based agents have shown increasing potential in automating scientific discovery. Given an optimizable metric and an execution environment, they can propose, validate, and iterate scientific solutions, and have produced results that outperform human-designed approaches. As model capabilities continue to improve, we argue that the bottleneck for autonomous scientific discovery is shifting from prescribing agent workflows to designing agent environments: the resources, constraints, and interfaces that shape agent behavior. We frame this as environment engineering: building environments that amplify productive behaviors, such as open-ended exploration, systematic artifact management, and inter-agent collaboration, while suppressing harmful behaviors, such as reward hacking and high-friction human oversight. We present EurekAgent, an environment-engineered agent system for metric-driven autonomous scientific discovery. EurekAgent engineers the environment along four dimensions: permissions engineering for bounded agent execution and isolated evaluation; artifact engineering for filesystem and Git-based collaboration; budget engineering for budget-aware exploration; and human-in-the-loop engineering for easy human supervision and intervention. EurekAgent sets new state-of-the-art results on multiple mathematics, kernel engineering, and machine learning tasks, including new state-of-the-art 26-circle packing results discovered with less than $11 in total API cost. We open-source our code and results, and call for environment engineering as a core research direction for developing reliable autonomous research agents.

URL PDF HTML ☆

赞 0 踩 0

2606.13657 2026-06-15 cs.LG 新提交

Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

密集监督，稀疏更新：论策略蒸馏的稀疏性与几何结构

Guo Yu, Wenlin Liu, Yulan Hu, Hao-Xuan Ma, Jun-Peng Jiang, Han-Jia Ye

发表机构 * School of Artificial Intelligence, Nanjing University（南京大学人工智能学院）； National Key Laboratory for Novel Software Technology, Nanjing University（南京大学计算机软件新技术国家重点实验室）； Amap, Alibaba Group（阿里巴巴集团高德地图）

AI总结本文分析策略蒸馏（OPD）中参数更新的稀疏性和几何特性，发现更新稀疏且集中于小权重坐标，并验证了稀疏子网络的有效性。

Comments Code is available at https://github.com/SydCS/OPD-Param-Analysis

详情

AI中文摘要

策略蒸馏（\ extsc{OPD}）最近成为一种重要的后训练方法，因为它结合了两个理想的要素：策略学生轨迹和密集教师监督，但这种混合如何改变模型参数仍不清楚。在多个语言和视觉-语言模型对及用例中，我们的分析得出两个主要发现。关于稀疏性，\ extsc{OPD}风格的更新小且坐标稀疏。它们分布在各层，通常以前馈网络（FFN）为主。这种稀疏结构在操作上有用：仅训练发现的子网络几乎能恢复完整\ extsc{OPD}的性能。然而，在我们的优化器消融实验中，诱导稀疏性的SGD优化器表现不如AdamW，可能是因为密集教师监督保留了异质的坐标梯度尺度，而AdamW的自适应缩放仍然有用。关于几何结构，更新在数值上是满秩的，但谱集中；它们主要位于源权重的奇异子空间之外，并且不成比例地落在源权重接近零的坐标上。这些发现表明，密集教师监督并不会使\ extsc{OPD}变成普通的密集参数重写；相反，\ extsc{OPD}保留了策略后训练的重要几何特征。

英文摘要

On-policy distillation (\textsc{OPD}) has recently become a prominent post-training recipe by combining two desirable ingredients: on-policy student trajectories and dense teacher supervision. However, how this hybrid changes a model's parameters remains unclear. Across several language and vision-language model pairs and \textsc{OPD} use cases, our analysis yields two main findings. On sparsity, \textsc{OPD} updates are small and coordinate-sparse. They are distributed across layers, with the largest relative movement usually appearing in FFN modules. This sparse structure is operationally useful: training only the discovered subnetwork nearly recovers full-training performance. The sparse support does not remove the need for adaptive optimization: SGD, previously reported to be competitive in \textsc{RLVR}, underperforms AdamW in our \textsc{OPD} optimizer ablation, suggesting that dense teacher supervision preserves useful momentum structure and heterogeneous second-moment scales. On geometry, the updates are numerically full-rank but spectrally concentrated; they lie mostly away from the principal singular subspaces of the source weights and fall disproportionately on coordinates where the source weights are close to zero. These findings suggest that dense teacher supervision does not turn \textsc{OPD} into ordinary dense parameter rewriting; instead, \textsc{OPD} retains important geometric signatures of on-policy post-training.

URL PDF HTML ☆

赞 1 踩 0

2606.13626 2026-06-15 cs.SD cs.LG 新提交

Generative Modeling of Bach-Style Symbolic Music: A Comparative Study of Autoregressive, Latent-Variable, and Adversarial Approaches

巴赫风格符号音乐的生成建模：自回归、潜变量和对抗方法的比较研究

Dezhi Yu, Kyuil Lee, Yongkang Huang

发表机构 * Stanford University（斯坦福大学）

AI总结比较自回归LSTM、潜变量模型和生成对抗网络在巴赫风格钢琴音乐生成中的表现，发现带注意力的自回归LSTM生成音乐最连贯，向量量化缓解后验塌陷，对抗方法捕捉局部音高但训练困难。

Comments 11 pages, 13 figures. All authors contributed equally

2606.13589 2026-06-15 cs.LG cs.AI 新提交

Simplex-Constrained Sparse Bagging: Transitioning from Uniform Priors to Sparse Posteriors in Ensemble Learning

单纯形约束的稀疏装袋：集成学习中从均匀先验到稀疏后验的转变

Meher Sai Preetam, Meher Bhaskar

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结提出SCSB框架，通过最小化袋外损失在概率单纯形上联合优化集成剪枝与校准，引入凹二次惩罚解决L1单纯形悖论，实现高达96%的压缩并提升校准性能。

Comments 6 pages, 3 tables

详情

AI中文摘要

我们提出单纯形约束的稀疏装袋（SCSB），一个用于基于自助法的装袋集成后训练压缩和概率校准的数学严格框架。标准装袋集成（如随机森林、装袋SVM和装袋神经网络）赋予所有组成估计器均匀的投票权。然而，这种朴素的均匀先验忽略了基估计器不同的局部能力，并导致模型过度自信。我们将集成剪枝和校准表述为在概率单纯形上的联合优化问题，通过最小化袋外（OOB）损失。为了诱导稀疏性，我们通过引入凹二次惩罚来解决理论上的“L1单纯形悖论”——即L1范数在单纯形上为常数且无法剪枝的数学现实。SCSB是模型无关的，实现了高达96%的集成压缩，带来线性推理加速和优越的概率校准（降低期望校准误差），同时保持或提升泛化精度。

英文摘要

We present Simplex-Constrained Sparse Bagging (SCSB), a mathematically rigorous framework for post-training compression and probability calibration of bootstrap-based bagging ensembles. Standard bagging ensembles (such as Random Forests, Bagged SVMs, and Bagged Neural Networks) assign uniform voting power to all constituent estimators. However, this naive uniform prior ignores the varying local competence of base estimators and contributes to model overconfidence. We formulate ensemble pruning and calibration as a joint optimization problem over the probability simplex by minimizing the Out-Of-Bag (OOB) loss. To induce sparsity, we address the theoretical "L1-simplex paradox" -- the mathematical reality that the L1 norm is constant on the simplex and fails to prune -- by introducing a concave quadratic penalty. SCSB is model-agnostic and achieves up to 96% ensemble compression, yielding linear inference speedups and superior probability calibration (lowered Expected Calibration Error) while preserving or enhancing generalization accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.13556 2026-06-15 cs.AI cs.HC q-bio.BM q-bio.GN q-bio.MN 新提交

Is It You or Your Environment? A Bayesian Inference Framework for Genomically-Anchored Personalized Physiological Interpretation

是你还是你的环境？一种用于基因组锚定的个性化生理解释的贝叶斯推理框架

Aruna Dey, Suraj Biswas

发表机构 * Dots-In

AI总结提出一种贝叶斯推理框架，利用基因组先验解决个性化健康AI的冷启动问题，通过基因组锚定分离生理信号的体质与环境成分，并随数据积累动态更新。

Comments 24 pages, 8 figures, 3 tables. Conceptual framework paper. Updated version with revised section structure and formatting

详情

AI中文摘要

个性化健康AI系统面临一个根本性的冷启动问题：用于生理解释的机器学习模型需要数周的个人行为数据，才能区分体质变异与环境引起的偏差。我们提出一种基于因果推断和贝叶斯先验设计的解决方案。个体的基因组图谱作为外源性遗传锚点——一个领域信息化的个性化先验，在受孕时固定，不受反向因果影响，且在收集任何行为观测之前即可获得。该锚点初始化个体生理设定点G-hat = mu + sum(beta_i * g_i)上的贝叶斯信念状态，其中beta_i是GWAS衍生的效应大小，g_i是风险等位基因计数。每次传入的生理测量P产生一个非体质偏差delta = P - G-hat，将可归因于环境和状态的部分与体质固定的基线分离。随着行为数据的积累，先验根据G-hat_t = w(t)*G-hat_genomic + [1-w(t)]*P-bar_t衰减，从基因组主导过渡到经验基线主导的推理。同一个观测到的HRV 55 ms，对于先验预测80 ms的人产生抑制假设，而对于先验预测30 ms的人产生增强假设——没有个性化锚点，这种反转是不可能的。我们在六个生理领域开发了这一架构，根据证据强度对基因组先验进行分级，区分稳健复制的锚点（FTO、FADS1/2、FKBP5）和有争议的候选基因（SLC6A4、MAOA、DRD2）。我们讨论了关联、孟德尔随机化和个体因果推断之间的推理边界，并定义了部署的四个约束：证据分级的先验、动态衰减、祖先匹配的效应大小以及归因而非确定性输出。

英文摘要

Personalized health AI systems face a fundamental cold-start problem: machine learning models for physiological interpretation require weeks of individual behavioral data before they can distinguish constitutional variation from environmentally driven deviation. We propose a solution grounded in causal inference and Bayesian prior design. An individual's genomic profile serves as an exogenous genetic anchor -- a domain-informed, personalized prior that is fixed at conception, immune to reverse causation, and available before a single behavioral observation is collected. The anchor initializes a Bayesian belief state over an individual's physiological set point G-hat = mu + sum(beta_i * g_i), where beta_i are GWAS-derived effect sizes and g_i are risk-allele counts. Each incoming physiological measurement P produces a non-constitutional deviation delta = P - G-hat that separates the signal attributable to environment and state from the constitutionally fixed baseline. As behavioral data accrue, the prior decays according to G-hat_t = w(t)*G-hat_genomic + [1-w(t)]*P-bar_t, transitioning from genome-dominated to empirical-baseline-dominated inference. The same observed HRV of 55 ms generates a suppression hypothesis for a person whose prior predicts 80 ms, and an enhancement hypothesis for a person whose prior predicts 30 ms -- a reversal impossible without a personalized anchor. We develop this architecture across six physiological domains, grading genomic priors by evidence strength, distinguishing robustly replicated anchors (FTO, FADS1/2, FKBP5) from contested candidate genes (SLC6A4, MAOA, DRD2). We address the inference boundary between association, Mendelian randomization, and individual token causation, and define four constraints for deployment: evidence-graded priors, dynamic decay, ancestry-matched effect sizes, and attribution rather than deterministic output.

URL PDF HTML ☆

赞 0 踩 0

2606.13464 2026-06-15 cs.CL cs.AI 新提交

Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations

本体记忆增强的ASR校正用于长文本-语音交错对话

Xinxin Li, Huiyao Chen, Meishan Zhang, Yunxin Li, Zulong Chen, Zhibo Ren, Xiaoqing Dong, Baotian Hu, Min Zhang

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology (Shenzhen), China（哈尔滨工业大学（深圳）计算与智能研究所）； Shenzhen Loop Area Institute (SLAI), China（深圳环域研究所）

AI总结提出本体记忆增强的ASR校正框架，通过动态更新本体记忆存储实体、术语、变体、混淆和语义关系，解决长文本-语音交错对话中的上下文校正问题，在RAMC-Corr数据集上优于直接校正。

详情

AI中文摘要

自动语音识别（ASR）校正传统上集中于孤立的话语或短局部上下文。然而，随着文本和语音在长交互中越来越交错，ASR校正需要对话级别的上下文证据。现有的ASR校正方法通常依赖于当前假设或拼接原始对话历史。在此类上下文中，稀疏的校正证据可能难以在冗余和噪声中定位。针对这些挑战，我们提出了一种本体记忆增强的ASR校正框架，用于长文本-语音交错对话。该框架将先前的交互历史组织成动态可更新的本体记忆，其中实体、术语、表面变体、潜在ASR混淆和语义关系作为可检索节点存储，用于上下文基础的校正。为了评估这一设置，我们构建了RAMC-Corr，一个源自MAGIC-RAMC的数据集，用于具有基础上下文的长距离ASR校正。在RAMC-Corr上的实验表明，我们的方法在10个配对骨干-设置组合中的9个上优于直接校正，并鼓励对上下文相关的ASR错误进行更具选择性和证据基础的校正。

英文摘要

Automatic speech recognition (ASR) correction has traditionally focused on isolated utterances or short local contexts. However, as text and speech become increasingly interleaved in long interactions, ASR correction requires conversation-level contextual evidence. Existing ASR correction methods often rely on the current hypothesis or concatenate raw dialogue history. In such contexts, sparse correction evidence can be difficult to locate amid redundancy and noise. Addressing these challenges, we propose an ontology memory-augmented ASR correction framework for long text-speech interleaved conversations. The framework organizes preceding interaction history into a dynamically updatable ontology memory, where entities, terminology, surface variants, potential ASR confusions, and semantic relations are stored as retrievable nodes for context-grounded correction. To evaluate this setting, we construct RAMC-Corr, a dataset derived from MAGIC-RAMC for long-range ASR correction with grounded context. Experiments on RAMC-Corr show that our method improves over direct correction in 9 out of 10 paired backbone-setting combinations and encourages more selective and evidence-grounded corrections for context-dependent ASR errors.

URL PDF HTML ☆

赞 0 踩 0

2606.13392 2026-06-15 cs.AI 新提交

MiniMax Sparse Attention

MiniMax 稀疏注意力

Xunhao Lai, Weiqi Xu, Yufeng Yang, Qiaorui Chen, Yang Xu, Lunbin Zeng, Xiaolong Li, Haohai Sun, Haichao Zhu, Vito Zhang, Jinkai Hu, Jiayao Li, Rui Gao, Zekun Li, Songquan Zhu, Jingkai Zhou, Pengyu Zhao

发表机构 * MiniMax ； Peking University（北京大学）； NVIDIA（英伟达）； Zhejiang University（浙江大学）； Huazhong University of Science and Technology（华中科技大学）

AI总结提出 MiniMax 稀疏注意力（MSA），一种基于分组查询注意力的块级稀疏注意力机制，通过轻量索引分支选择 Top-k 键值块，实现高效长上下文处理，在 109B 模型上以 1M 上下文减少 28.4 倍注意力计算，并带来 14.2 倍预填充和 7.6 倍解码加速。

Comments 30 pages, 14 figures

详情

AI中文摘要

超长上下文能力对于前沿大语言模型变得不可或缺：智能体工作流、仓库级代码推理和持久记忆都要求模型共同关注数十万到数百万个 token，然而 softmax 注意力的二次成本使得这在部署规模上难以实现。我们引入了 MiniMax 稀疏注意力（MSA），一种基于分组查询注意力（GQA）构建的块级稀疏注意力。一个轻量级索引分支对键值块进行评分，并为每个 GQA 组独立选择 Top-k 子集，从而实现组特定的稀疏检索，同时保持高效的块级执行；主分支则仅对选中的块执行精确的块稀疏注意力。MSA 的设计遵循简单和可扩展的原则，经过精心简化，使其能够在一系列 GPU 上高效部署。为了将稀疏性转化为实际加速，我们与 MSA 协同设计了 GPU 执行路径，该路径使用无指数 Top-k 选择和 KV 外部稀疏注意力，以在块粒度访问下提高张量核心利用率。在一个具有原生多模态训练的 109B 参数模型上，MSA 的性能与 GQA 相当，同时在 1M 上下文下将每个 token 的注意力计算减少了 28.4 倍。结合我们协同设计的内核，MSA 在 H800 上实现了 14.2 倍的预填充和 7.6 倍的解码端到端加速。我们的推理内核可在以下网址获取：this https URL。一个由 MSA 驱动的生产级原生多模态模型已在以下网址公开发布：this https URL。

英文摘要

Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax attention makes this untenable at deployment scale. We introduce MiniMax Sparse Attention (MSA), a blockwise sparse attention built upon Grouped Query Attention (GQA). A lightweight Index Branch scores key-value blocks and independently selects a Top-k subset for each GQA group, enabling group-specific sparse retrieval while maintaining efficient block-level execution; the Main Branch then performs exact block-sparse attention over only the selected blocks. Designed around a principle of simplicity and scalability, MSA is deliberately streamlined, making it straightforward to deploy efficiently across a broad range of GPUs. To translate sparsity into practical speedups, we co-design MSA with a GPU execution path that uses exp-free Top-k selection and KV-outer sparse attention to improve tensor-core utilization under block-granular access. On a 109B-parameter model with native multimodal training, MSA performs on par with GQA while reducing per-token attention compute by 28.4x at 1M context. Paired with our co-designed kernel, MSA achieves 14.2x prefill and 7.6x decoding wall-clock speedups on H800. Our inference kernel is available at: https://github.com/MiniMax-AI/MSA. A production-grade natively multimodal model powered by MSA has been publicly released at: https://huggingface.co/MiniMaxAI/MiniMax-M3.

URL PDF HTML ☆

赞 0 踩 0

2606.13221 2026-06-15 cs.LG 新提交

From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM Evaluation

从不确定判断到校准排名：用于LLM评估的共形Elo估计

Bora Kargi, David Salinas

发表机构 * ELLIS Institute Tübingen（ELLIS 蒂宾根研究所）； OpenEuroLLM

AI总结提出一种两层次校准方法，通过局部不确定性传播和全局共形预测，将LLM-as-a-judge的Elo评分误差降至17.9 MAE，并提供无分布假设的置信区间。

详情

AI中文摘要

评估新的大型语言模型通常需要大规模且昂贵的人工标注。LLM作为评判者提供了一种更便宜的替代方案，但评判者评分存在系统误差——如位置偏差、自我偏好或不可传递性——这些误差可能导致最终排名严重失准。我们在两个互补层面上量化评判者与人类之间的分歧。在局部层面，我们通过将校准的获胜概率而非硬标签传播到Bradley-Terry过程中，从评判者自身的评分差异估计每场对战的不确定性。仅此一项就显著提高了Elo估计的准确性，在LMArena上对55个保留模型取平均时，LLM得出的评分与人类得出的评分之间的平均绝对误差为17.9 Elo。在全局层面，我们将分裂共形预测应用于LLM得出的与人类得出的Elo评分之间的残差差距，产生具有无分布边际覆盖保证的预测区间，从而解释了不可约的LLM-人类分歧。这两层结合产生了一个低成本的评估工具，为开发者提供校准的Elo估计和诚实的置信区间，而无需大规模人工标注。为促进可重复性，我们在https://this http URL发布代码。

英文摘要

Evaluating new large language models typically requires costly human annotation campaigns at scale. LLM-as-a-judge offers a cheaper alternative, but judge scores carry systematic errors - such as position bias, self-preference, or intransitivity - that can strongly miscalibrate the resulting rankings. We quantify the resulting judge-human disagreement at two complementary levels. At the local level, we estimate per-battle uncertainty from the judge's own score differences by propagating calibrated win probabilities rather than hard labels into the Bradley-Terry procedure. This alone provides a drastic improvement to Elo estimation accuracy, bringing LLM-derived ratings within 17.9 Elo MAE of human-derived ones when averaged over 55 held-out models on LMArena. At the global level, we apply split conformal prediction to the residual gap between LLM-derived and human-derived Elo ratings across held-out models, producing prediction intervals with distribution-free marginal coverage guarantees that account for irreducible LLM-human disagreement. Together, these two layers yield a low-cost evaluation tool that provides developers with calibrated Elo estimates and honest uncertainty bounds, without access to large-scale human annotations. To facilitate reproducibility, we release our code at https://github.com/kargibora/SoftElo .

URL PDF HTML ☆

赞 0 踩 0

2606.13119 2026-06-15 cs.LG cs.AI cs.NE 新提交

MP3: Multi-Period Pattern Pre-training for Spatio-Temporal Forecasting

MP3：面向时空预测的多周期模式预训练

Lilan Peng, Yandi Liu, Qingren Yao, Chongshou Li, Tianrui Li

发表机构 * School of Computing and Artificial Intelligence, Southwest Jiaotong University（西南交通大学计算机与人工智能学院）； Eindhoven University of Technology（埃因霍温理工大学）

AI总结针对时空数据中短窗口输入导致的时间幻象问题，提出多周期模式预训练插件MP3，通过多周期时间建模、空间建模和跨周期因果交互，提升现有STGNN的预测性能。

详情

AI中文摘要

时空预测在交通、气候和能源等多个领域至关重要。城市时空数据表现出时间幻象：相似的短窗口输入具有不同的未来趋势，反之亦然。现有的时空图神经网络（STGNN）无法有效识别此类幻象。我们认为核心原因在于短窗口输入具有不完整的周期观测、异质的全局空间相关性和跨周期叠加因果性。为弥补这一差距，我们开发了一种新颖的多周期模式预训练（MP3），这是一种用于区分时间幻象的即插即用预训练插件。MP3提出了两项核心创新：（1）多周期模式学习旨在从长时间序列中学习多周期模式。具体地，多周期时间建模利用边卷积来识别不同的多周期模式。多周期空间建模使用瓶颈投影和全局记忆库来高效捕获异质的全局空间关系。跨周期模式交互采用因果增强的Transformer来捕获不同周期模式之间的依赖关系。（2）该插件可以无缝集成到现有的STGNN骨干中，以增强其预测性能。在五个真实世界数据集（包括大规模数据集CA）上的五个STGNN基线实验验证了MP3的有效性、优越的可扩展性和强适应性，其在所有评估基线上带来了一致且稳健的性能提升。平均而言，MP3将MAE降低了4.7%，RMSE降低了5.0%。代码可在此https URL获取。

英文摘要

Spatio-Temporal forecasting is crucial in diverse fields, such as transportation, climate, and energy. Urban spatio-temporal data exhibits temporal mirage: similar short-window inputs have divergent future trends, and vice versa. Existing spatio-temporal graph neural networks (STGNNs) cannot effectively identify such mirages. We argue that the core reason lies in the short-window inputs that have incomplete period observation, heterogeneous global spatial correlation, and cross-period superposition causality. To bridge this gap, we develop a novel Multi- Period Pattern Pre-training (MP3), a plug-and-play pre-training plugin for distinguishing temporal mirages. MP3 presents two core innovations: (1) The multi-period pattern learning is designed to learn multi-period patterns from long time series. Specifically, multi-period temporal modeling leverages edge convolution to identify different multi-period patterns. Multi-period spatial modeling uses a bottleneck project and a global memory bank to capture heterogeneous global spatial relations efficiently. Cross-period pattern interaction employs a causality-enhanced Transformer to capture dependencies across different period patterns. (2) This plugin can seamlessly integrate into existing STGNN backbones to strengthen their forecasting performance. The experiment on five STGNN baselines across five real-world datasets (including a large-scale dataset CA) verify the effectiveness, superior scalability and strong adaptability of MP3, which brings consistent and robust performance improvements across all evaluated baselines. On average, MP3 reduces the MAE 4.7% and the RMSE 5.0%. The code can be available at https://github.com/YAN-outlook/MP3.

URL PDF HTML ☆

赞 0 踩 0

2606.13054 2026-06-15 cs.LG cs.AI 新提交

TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization

TWLA：通过训练后量化实现大语言模型的三值权重和低位激活

Zhixiong Zhao, Zukang Xu, Zhixuan Chen, Xing Hu, Zhe Jiang, Dawei Yang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出TWLA框架，通过后训练量化实现1.58位权重和4位激活，解决激活分布长尾问题，加速推理。

Comments Accepted by ICML 2026

详情

AI中文摘要

大型语言模型（LLMs）展现出卓越的通用语言处理能力，但其内存和计算成本阻碍了部署。三值化已成为一种有前景的压缩技术，可显著降低模型大小和推理复杂度。然而，现有方法难以处理重尾激活分布，因此将激活保持在高精度，从根本上限制了端到端推理加速。为克服这一限制，我们提出TWLA，一种后训练量化（PTQ）框架，在保持高精度的同时实现1.58位权重压缩和4位激活量化。TWLA包含三个组件：（1）欧几里得到流形非对称三值量化器（E2M-ATQ），通过从欧几里得初始化到流形重定位的两阶段优化，最小化权重三值化下的层输出误差；（2）Kronecker正交三模态整形（KOTMS），应用Kronecker结构正交旋转将权重重塑为三值友好的三模态分布，同时共享旋转统计上抑制激活异常值；（3）层间感知激活混合精度（ILA-AMP），在位分配中显式引入相邻层二阶交互成本，并联合优化由共享正交变换引起的激活量化增益的层间差异，防止少数弱层触发级联效应。大量实验表明，TWLA在W1.58A4下保持高精度，同时实现显著的推理加速。代码见<此https URL>。

英文摘要

Large language models (LLMs) exhibit exceptional general language processing capabilities, but their memory and compute costs hinder deployment. Ternarization has emerged as a promising compression technique, offering significant reductions in model size and inference complexity. However, existing methods struggle with heavy-tailed activation distributions and therefore keep activations in high precision, fundamentally limiting end-to-end inference acceleration. To overcome this limitation, we propose TWLA, a post-training quantization (PTQ) framework that achieves 1.58-bit weight compression and 4-bit activation quantization while maintaining high accuracy. TWLA comprises three components: (1) Euclidean-to-Manifold Asymmetric Ternary Quantizer (E2M-ATQ) minimizes layer-output error under weight ternarization via a two-stage optimization from Euclidean initialization to manifold relocation; (2) Kronecker Orthogonal Tri-Modal Shaping (KOTMS) applies a Kronecker-structured orthogonal rotation to reshape weights into ternary-friendly tri-modal distributions, while the shared rotation statistically suppresses activation outliers; and (3) Inter-Layer Aware Activation Mixed Precision (ILA-AMP) explicitly introduces adjacent-layer second-order interaction costs in bit allocation and jointly optimizes for the layer-wise disparity of activation quantization gains induced by the shared orthogonal transform, preventing cascades triggered by a few weak layers. Extensive experiments demonstrate that TWLA maintains high accuracy under W1.58A4, while delivering significant inference acceleration. The code is available at https://github.com/Kishon-zzx/TWLA.

URL PDF HTML ☆

赞 0 踩 0

2606.12994 2026-06-15 cs.LG cs.CE 新提交

DeepJEB++: Foundation Model-Driven Large-Scale 3D Engineering Dataset via 2D Latent Space Augmentation

DeepJEB++: 基于基础模型驱动的二维潜空间增强的大规模三维工程数据集

Soyoung Yoo, Leekyo Jeong, Jinsu Ra, Dongeon Lee, Sunwoong Yang, Hyogu Jeong, Namwoo Kang

发表机构 * Cho Chun Shik Graduate School of Mobility, Korea Advanced Institute of Science and Technology（韩国科学技术院赵春植移动研究生院）； Department of Mechanical Engineering, Hanyang University（汉阳大学机械工程系）； Narnia Labs（纳尼亚实验室）

AI总结提出DeepJEB++框架，通过二维潜空间增强和基础模型，将少量喷气发动机支架种子设计扩展为大规模带仿真标签的三维数据集，实现40倍扩展。

Comments 16 pages, 14 figures. Submitted to ASME Journal of Mechanical Design

详情

AI中文摘要

数据驱动的工程设计受到缺乏大规模三维数据集的限制，这些数据集需要将几何形状与基于物理的性能标签配对。特别是，现有的三维数据增强技术在保留微妙且多样的几何变化方面存在局限性，并且自动化后续的仿真标注过程仍然困难，因为边界条件取决于生成的几何形状。我们提出了DeepJEB++，一个基础模型驱动的数据增强框架，在资源受限的情况下将少量喷气发动机支架种子设计扩展为大规模、带仿真标签的三维数据集。我们的关键思想是在数据丰富的二维潜空间中进行增强，然后转移到三维。在第一阶段，我们在多视图渲染上微调预训练的二维潜扩散模型，并通过潜插值合成新视图，通过视觉语言模型（VLM）质量过滤器保留可制造的设计。在第二阶段，经过验证的图像通过领域适应的生成基础模型提升为三维网格。在第三阶段，一个自动化流水线识别每个网格上的载荷和螺栓接口，并分配有限元标签——质量、应力和位移——无需人工干预。我们沿着三个内在轴评估增强质量：可制造性、相对于SimJEB真实值的标签保真度以及分布一致性。从少于400个种子设计开始，DeepJEB++在每阶段使用单个GPU的情况下，生成了15,360个带仿真标签的三维支架——实现了40倍的扩展。该数据集将公开提供，以支持可复现的工程AI研究。

英文摘要

Data-driven engineering design is constrained by the lack of large-scale 3D datasets that pair geometry with physics-based performance labels. In particular, existing 3D data augmentation techniques have limitations in preserving subtle and diverse geometric variations, and it remains difficult to automate the subsequent simulation-labeling process, where boundary conditions vary depending on the generated geometry. We present DeepJEB++, a foundation-model-driven data-augmentation framework that expands a small seed set of jet engine brackets into a large, simulation-labeled 3D dataset under constrained resources. Our key idea is to augment in the data-rich 2D latent space, then transfer to 3D. In Stage 1, we fine-tune a pretrained 2D latent diffusion model on multi-view renders and synthesize novel views by latent interpolation, retaining manufacturable designs through a vision-language-model (VLM) quality filter. In Stage 2, the validated images are lifted to 3D meshes by a domain-adapted generative foundation model. In Stage 3, an automated pipeline recognizes the load and bolt interfaces on each mesh and assigns finite-element labels -- mass, stress, and displacement -- without manual intervention. We assess augmentation quality along three intrinsic axes: manufacturability, label fidelity against the SimJEB ground truth, and distributional consistency. Starting from fewer than 400 seed designs, DeepJEB++ yields 15,360 simulation-labeled 3D brackets -- a 40x expansion -- using a single GPU per stage. The dataset will be made publicly available to support reproducible engineering-AI research.

URL PDF HTML ☆

赞 0 踩 0

2606.12941 2026-06-15 cs.CL 新提交

Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL

当上下文分片到达时的多轮推理：可扩展的分片与记忆增强强化学习

Shu Tong Luo, Wenqin Liu, Rui Liu, Mingming Gong, Jiaxian Guo

发表机构 * The University of Melbourne（墨尔本大学）； Google Research Australia（谷歌澳大利亚研究院）

AI总结针对多轮对话中信息碎片化导致LLM准确率下降65%的问题，提出通过训练模型维护紧凑滚动记忆而非增长历史来缓解，并引入低成本分片流水线将单轮QA转换为多轮碎片化情节，训练的记忆增强策略显著提升多轮准确率并零样本泛化到更难任务。

2606.12923 2026-06-15 cs.LG cs.AI cs.CL 新提交

Order Is Not Control: Driven-Dissipative Response Laws Across Artificial and Biological Systems

秩序并非控制

Gareth Seneque, Lap-Hang Ho, Nafise Erfanian Saeedi, Jeffrey Molendijk, Tim Elson

发表机构 * Australian Broadcasting Corporation（澳大利亚广播公司）

AI总结本文论证秩序不等于控制，提出接收器门控响应定律，并在生物、大语言模型、适配器和随机算子面板中验证，表明控制是局部的、可测量的。

Comments 52 pages, 7 figures, updated title

详情

AI中文摘要

AI对齐、可解释性、引导和神经扰动研究识别出诱导秩序的对象。我们认为秩序并非控制。控制需要接收器门控的响应定律：一个分母索引算子，将物质状态、动作/驱动、浴和接收器状态映射到响应位移、汇、努力和盆地投影。我们在生物、大语言模型、适配器和随机算子面板中识别出该定律。这些定律是局部的：干预可以被接纳、饱和、变号、泄漏或过驱动，取决于介质、浴、接收器状态、动作端口和比较器。当有限努力在相同分母下移动目标或结果读出类别，而损伤、无效/规避、无效格式、过驱动和不必要努力保持有界时，控制被分配。小鼠ALM、秀丽隐杆线虫和斑马鱼面板提供了物理响应算子证据，同时排除了坐标同一性和控制器结论。大语言模型面板展示了生成输出响应定律：在四种物质条件下，响应向量的分量符号预测准确率为72.8-73.7%，非零分量上提升至84.3-84.8%；留出观察者以93.6%和91.7%的准确率预测系统效应和目标/预言家族。宪法条件适配器将易感性重塑为制备介质，随机算子面板将测量机会与可部署行动策略分离。这给出了介观控制层面的驱动-耗散响应系统描述：驱动通过制备介质、浴和接收器作用，产生接纳运动、阻抗、汇或过驱动。证据支持局部接纳控制和可测量的随机响应算子，同时将可部署的预生成控制、隐藏/logit因果充分性、生物到LLM坐标同一性以及字面热力学量排除在范围之外。

英文摘要

AI alignment, interpretability, steering, and neural perturbation studies identify order-inducing objects. We argue that order is not control. Control requires a receiver-gated response law: a denominator-indexed operator mapping material state, action/drive, bath, and receiver state to response displacement, sinks, effort, and basin projection. We identify it across biological, LLM, adapter, and stochastic-operator panels. The laws are local: an intervention can be admitted, saturated, sign-changing, leaky, or overdriven depending on medium, bath, receiver state, action port, and comparator. Control is assigned when finite effort moves a target or outcome-readout class under the same denominator while damage, null/evasive, invalid format, overdrive, and unnecessary effort stay bounded. Mouse ALM, C. elegans, and zebrafish panels provide physical response-operator evidence while excluding coordinate identity and controller conclusions. LLM panels show generated-output response laws: across four material conditions, response vectors are predictable at 72.8-73.7% component-sign accuracy, rising to 84.3-84.8% on nonzero components; held-out observers predict system-effect and target/oracle families at 93.6% and 91.7% accuracy. Constitution-conditioned adapters reshape susceptibility as prepared media, and stochastic-operator panels separate measured opportunity from deployable action policies. This gives a driven-dissipative response-system account at the mesoscopic control level: drives act through prepared media, baths, and receivers, producing admitted movement, impedance, sinks, or overdrive. The evidence supports local admitted control and measurable stochastic response operators, while leaving deployable pre-generation control, hidden/logit causal sufficiency, biological-to-LLM coordinate identity, and literal thermodynamic quantities outside scope.

URL PDF HTML ☆

赞 0 踩 0

2606.12910 2026-06-15 cs.RO cs.AI cs.CV cs.SY eess.SY 新提交

Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning

边界框作为目标：通过神经符号规划实现语言条件抓取

Allison Andreyev, Landon Eum, Nestor Tiglao, Romel Gomez

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出GRASP框架，利用预训练VLM将自然语言查询转化为神经符号目标状态，通过边界框检测实现零样本桌面操作，无需任务特定训练。

Comments Project website: https://allisonandreyev.github.io/grasp.github.io/

详情

AI中文摘要

为了将机器人有效集成到家庭或工业环境中，机器必须实时适应自然语言提示。尽管视觉-语言模型（VLM）已在机器人任务与运动规划（TAMP）中实现零样本泛化，但当前最先进的方法通常计算量“沉重”或需要在数千个演示上进行大量训练。我们提出GRASP（基础推理与符号规划）框架，作为向开放词汇桌面操作迈进的一步。我们的方法利用预训练VLM将自然语言查询转化为神经符号目标状态，通过边界框检测管道在物理世界中接地。与依赖固定颜色列表或硬编码坐标的方法不同，GRASP使机器人能够解释诸如“顶层架子”之类的抽象空间概念，并在无需额外微调的情况下执行任务。我们在三个难度级别的90次真实机器人试验中实现了73.3%的总体成功率，无需任务特定训练。

英文摘要

For robotics to be effectively integrated into household or industrial environments, machines must adapt to natural-language prompts in real time. Although Vision-Language Models (VLMs) have enabled zero-shot generalization in robot task and motion planning (TAMP), current state-of-the-art approaches often remain computationally "heavyweight" or require extensive training on thousands of demonstrations. We present GRASP (Grounded Reasoning and Symbolic Planning), a framework designed as a step toward open-vocabulary tabletop manipulation. Our approach leverages a pretrained VLM to translate natural-language queries into neuro-symbolic goal states, grounded in the physical world via a bounding-box detection pipeline. Unlike methods that rely on fixed color lists or hard-coded coordinates, GRASP enables robots to interpret abstract spatial concepts such as "top shelf" and execute tasks without additional fine-tuning. We achieve 73.3% overall success across 90 real-robot trials at three difficulty levels, requiring no task-specific training.

URL PDF HTML ☆

赞 0 踩 0

2606.12881 2026-06-15 cs.CL cs.LG 新提交

Direct Preference Optimization for Chatbot Fine-Tuning: An Empirical Study

面向聊天机器人微调的直接偏好优化：一项实证研究

Dezhi Yu, Yvonne Qiu, ShuoJia Fu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文实证研究直接偏好优化（DPO）在聊天机器人微调中的应用，表明其简化训练流程、提升计算效率且性能有竞争力，但存在训练不稳定性。

Comments 7 pages, 3 figures, 1 table. All authors contributed equally

2606.12817 2026-06-15 cs.AI 新提交

GUITrans2Act: Understanding User Operational Behaviors from Mobile GUI Interactions with Vision-Language Models

Teach-and-Repeat: 从移动屏幕演示中准确提取操作知识以赋能GUI智能体

Yudong Zhang, Lei Hu, Daoyang Liu, Jiawei Liu, Yangfan Luo, Zhilin Gao, Zuojian Wang

发表机构 * Honor Device Co., Ltd（荣耀终端有限公司）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出Teach VLM模型，通过从演示视频中提取关键帧生成操作知识，并构建数据飞轮解决训练数据稀缺问题；在基准测试中达到最优性能，并提升下游智能体的任务成功率。

Comments 20 pages, 9 figures. Yudong Zhang and Lei Hu contributed equally to this work. Zuojian Wang, and Zhilin Gao are corresponding authors

详情

AI中文摘要

理解移动设备上的数字世界正从静态UI感知转向动态动作理解。这种能力使模型能够将视觉状态转换转化为操作知识，定义为描述动作类型、目标UI元素、文本参数和执行顺序的简短自然语言句子。然而，由于跨应用的UI设计高度多样化和异构，现有视觉语言模型（VLM）难以准确推断这些底层操作。为弥补这一差距，我们引入了Teach VLM，这是一个核心模型，旨在通过从演示视频中提取和分析与操作相关的关键帧，将移动屏幕轨迹转化为逐步操作知识。为解决对齐训练数据稀缺的问题，我们开发了一个系统性的数据飞轮以实现可扩展的数据采集。我们进一步引入了一个新颖的中文移动屏幕教学基准用于细粒度评估。基于Teach VLM，我们提出了Teach-and-Repeat范式，其中生成的操作知识作为可解释的程序化参考，指导下游基于屏幕的执行智能体。大量评估表明，Teach VLM显著优于强VLM基线，在操作语义预测中达到了最先进的性能。此外，在Android World中的实验表明，我们的范式为下游智能体带来了持续的任务成功率提升。Teach VLM和Teach-and-Repeat范式共同提供了一条从原始演示到可复用任务自动化的实用路径。

英文摘要

Understanding the digital world on mobile devices is shifting from static UI perception to dynamic action comprehension. This capability enables models to convert visual state transitions into operational knowledge, defined as short natural-language sentences that describe action types, target UI elements, textual arguments, and execution orders. However, due to the highly diverse and heterogeneous UI designs across applications, existing vision-language models (VLMs) struggle to accurately infer these underlying operations. To bridge this gap, we introduce Teach VLM, a core model designed to translate mobile screen trajectories into step-wise operational knowledge by extracting and analyzing operation-related keyframes from demonstration videos. To address the scarcity of aligned training data, we develop a systematic data flywheel for scalable data acquisition. We further introduce a novel Chinese Mobile Screen Teach Benchmark for fine-grained evaluation. Building upon Teach VLM, we propose the Teach-and-Repeat paradigm, where the generated operational knowledge serves as an interpretable procedural reference to guide downstream screen-based execution agents. Extensive evaluations demonstrate that Teach VLM significantly outperforms strong VLM baselines, achieving state-of-the-art performance in operation semantics prediction. Furthermore, experiments in Android World show that our paradigm yields consistent Task Success Rate improvements for downstream agents. Together, Teach VLM and the Teach-and-Repeat paradigm offer a practical pathway from raw demonstrations to reusable task automation.

URL PDF HTML ☆

赞 0 踩 0

2606.12733 2026-06-15 cs.LG 新提交

Let's Ask Gauss: Improved One-Run Privacy Auditing

让我们问高斯：改进的单次运行隐私审计

Adya Agrawal, Yu Wei, Jaspal Singh, Malik Magdon-Ismail, Vassilis Zikas

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； Rensselaer Polytechnic Institute（伦斯勒理工学院）； Purdue University（普渡大学）

AI总结提出一种基于高斯渐近分布的差分隐私审计框架，利用白盒DP-SGD中金丝雀对齐信号的归一化和，从单次训练运行中获取更紧的隐私下界。

2606.12728 2026-06-15 cs.RO cs.CV cs.LG 新提交

EquiDexFlow: Contact-Grounded SE(3)-Equivariant Dexterous Grasp Generative Flows

EquiDexFlow: 基于接触的SE(3)-等变灵巧抓取生成流

Clinton Enwerem, John S. Baras, Calin Belta

发表机构 * Institute for Systems Research, University of Maryland, College Park（马里兰大学帕克分校系统研究所）

AI总结提出EquiDexFlow，一种SE(3)-等变流匹配模型，联合预测腕部姿态、关节角度、指尖接触、表面法线和接触力，通过将接触投影到物体表面并将力约束在库仑摩擦锥内，确保物理稳定抓取，在16自由度Allegro手上实现零摩擦违规和最佳综合分数。

Comments 22 pages, 11 figures, 11 tables. Project page with videos, code, and checkpoints: https://equidexflow.github.io

详情

AI中文摘要

大多数学习型灵巧抓取生成器将接触力降级为下游验证步骤，因此运动学上可行的姿态仍可能违反稳定物理抓取的条件。我们通过EquiDexFlow解决这一问题，这是一种SE(3)-等变流匹配模型，从物体点云联合预测腕部姿态、关节角度、指尖接触、表面法线和接触力。我们的架构通过构造将接触投影到物体表面并将力约束在库仑摩擦锥内，因此无需损失惩罚即可满足放置和摩擦合规性。我们证明了端到端SE(3)等变性，并在200次旋转上经验验证，腕部残差低于$0.04^\circ$且关节偏差严格为零。该模型在81个物体的8,100个力闭合抓取上训练，适用于16自由度Allegro手，在所有消融变体中实现了零摩擦违规、最佳综合分数和最低扳手残差。我们通过每指逆运动学将解码的指尖接触重新定位到16自由度LEAP手，我们的硬件可行优化将每个关节至少置于其执行器包络的5%以内，同时保持扳手平衡。在物理机器人上，重新定位的EquiDexFlow解码抓取在所有六个测试物体上完成了开环拾取和保持试验，每个非对称物体在标准姿态和$120^\circ$共旋转下均成功。视频、代码和检查点可在https://this URL获取。

英文摘要

Most learned dexterous grasp generators relegate contact forces to a downstream verification step, so a kinematically-plausible pose can still violate the conditions for a stable physical grasp. We address this with EquiDexFlow, an SE(3)-equivariant flow-matching model that jointly predicts wrist pose, joint angles, fingertip contacts, surface normals, and contact forces from an object point cloud. Our architecture projects contacts onto the object surface and forces into the Coulomb friction cone by construction, so placement and friction compliance hold without loss penalties. We prove end-to-end SE(3) equivariance and verify it empirically over 200 rotations, with wrist residuals below $0.04^\circ$ and exactly zero joint deviation. Trained on 8,100 force-closure grasps across 81 objects for the 16-DoF Allegro Hand, our model achieves zero friction violations, the best composite score, and the lowest wrench residual among all ablation variants. We retarget decoded fingertip contacts to a 16-DoF LEAP Hand via per-finger inverse kinematics, and our hardware-feasible refinement places every joint at least 5% inside its actuator envelope while preserving wrench balance. On the physical robot, retargeted EquiDexFlow-decoded grasps complete open-loop pick-and-hold trials on all six test objects, with every asymmetric object succeeding at both the canonical pose and a $120^\circ$ co-rotation. Videos, code, and checkpoints are available at https://equidexflow.github.io.

URL PDF HTML ☆

赞 0 踩 0

2606.12476 2026-06-15 cs.LG cs.AI cs.CL 新提交

Quickest Detection of Hallucination Onset: Delay Bounds and Learned CUSUM Statistics

幻觉起始的快速检测：延迟界与学习型CUSUM统计量

Igor Itkin

发表机构 * Independent Researcher（独立研究员）

AI总结将幻觉起始检测建模为快速变化检测问题，基于RAGTruth验证的一阶马尔可夫模型，利用学习型CUSUM算法在匹配虚警率下实现11-13个token的检测延迟，优于线性基线，并揭示了分类指标掩盖的延迟结构。

Comments 16 pages, 1 figure. v2: added Discussion and Appendix; recall-honest framing; robustness analyses (k-NN divergence estimate, seed-averaged decomposition)

详情

AI中文摘要

Token级幻觉检测器作为分类器进行评估，通过所有token的AUC，但流式监控器由其反应时间判断：从幻觉开始到警报之间的token数量。我们将幻觉起始检测表述为一个快速变化检测问题。在RAGTruth上验证的潜在忠实/幻觉状态的一阶马尔可夫模型，将任务置于经典变点理论中，并得出Lorden关于检测延迟的下界：在虚警率为0.01时约为1.3个token。然后我们证明，因果循环标注器充当了具有学习增量的CUSUM；在匹配的虚警率下，它在11-13个token内检测到，而线性每token基线为31个token，受控分解将大部分优势归因于更好的每token得分，而非时间累积。Donsker-Varadhan型的信息率最优性定理解释了剩余的数量级差距：学习得分仅实现了特征携带散度的1/4.5，这一缺陷无法通过重新校准消除，其余部分为有限时域效应。分类指标掩盖了这种延迟结构；序列分析使其可测量。

英文摘要

Token-level hallucination detectors are evaluated as classifiers, by AUC over all tokens, yet a streaming monitor is judged by its reaction time: the number of tokens that pass between the onset of a hallucination and the alarm. We formulate hallucination onset detection as a quickest change detection problem. A first-order Markov model of the latent faithful/hallucinated state, validated on RAGTruth, places the task inside classical change-point theory and yields Lorden's lower bound on detection delay: about 1.3 tokens at a false-alarm rate of 0.01. We then show that a causal recurrent labeler acts as a CUSUM with a learned increment. Among the onsets it catches it detects in 11-13 tokens, against 31 for a linear per-token baseline, though at this false-alarm budget every detector catches under a third of onsets and the recall-honest delay is 56-66 tokens: low-false-alarm onset detection is hard. A controlled decomposition attributes the speed advantage mostly to a better per-token score rather than to temporal accumulation. An information-rate optimality theorem of Donsker-Varadhan type explains the remaining order-of-magnitude gap: the learned score realizes only 1/4.5 of the divergence the features carry, a deficit that recalibration cannot remove, with the remainder a finite-horizon effect. Classification metrics conceal this delay structure; sequential analysis makes it measurable.

URL PDF HTML ☆

赞 0 踩 0

2606.12360 2026-06-15 cs.LG 新提交

Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal

后训练的解剖：利用可解释性表征数据并塑造学习信号

Leon Bergen, Usha Bhalla, Sidharth Baskaran, Max Loeffler, Raphael Sarfati, Dhruvil Gala, Ryan Panwar, Santiago Aranguri, Thomas Fel, Atticus Geiger, Matthew Kowal, Siddharth Boppana, Daniel Balsam, Owen Lewis, Jack Merullo, Thomas McGrath, Ekdeep Singh Lubana

发表机构 * Stanford University（斯坦福大学）； Google Research（谷歌研究院）

AI总结提出基于可解释性的数据后训练流程，通过统计假设识别偏好数据中的潜在概念，实现细粒度反馈，减少虚假关联和不良行为。

详情

AI中文摘要

语言模型后训练是塑造模型行为的主要阶段，但它仍然主要涉及优化总结多样需求的标量奖励。这种抽象使从业者几乎无法了解数据实际教会了模型什么，导致模型学习虚假关联，并引发过度风格化和谄媚等不良行为。为了解决这个问题，我们提出：能否在优化之前检查偏好数据集，并在概念层面决定模型应该被允许学习哪些行为？受此启发，我们引入了一个以数据为中心的后训练流程，该流程使用可解释性协议来开发统计假设，以区分偏好和非偏好生成的潜在概念，使其明确以供细粒度用户反馈。基于这一观点，我们将几种基于可解释性的训练协议统一为通过特征或数据干预来塑造奖励的方式。实验上，我们表明我们的流程诊断了现有偏好数据中的不良信号，减轻了脱靶学习，并且还可以帮助放大或塑造期望的属性，如安全防护和模型个性。更广泛地说，我们的结果表明，可解释性可以将后训练从优化不透明的代理奖励转变为审计和塑造学习信号本身的过程。

英文摘要

Language-model post-training is the main stage at which model behavior is shaped, yet it still largely involves optimization of scalar rewards that summarize diverse desiderata. This abstraction gives practitioners little visibility into what their data actually teaches models, allowing spurious correlations to be learned by a model and inducing undesirable behaviors such as over-stylization and sycophancy. To address this problem, we ask: can we inspect a preference dataset before optimization and decide, at the level of concepts, which behaviors a model should be allowed to learn? Motivated by this, we introduce a data-centric post-training pipeline that uses interpretability protocols to develop statistical hypotheses for the latent concepts separating preferred from dispreferred generations, making them explicit for fine-grained user feedback. Building on this view, we unify several interpretability-based training protocols as ways of shaping rewards via feature or data interventions. Empirically, we show that our pipeline diagnoses undesirable signals in existing preference data, mitigates off-target learning, and can also help amplify or shape desired properties such as safeguards and model personality. More broadly, our results suggest that interpretability can turn post-training from optimizing opaque proxy rewards into a process of auditing and sculpting the learning signal itself.

URL PDF HTML ☆

赞 0 踩 0

2606.12349 2026-06-15 cs.RO cs.SY eess.SY 新提交

Traceable Virtual Sea Trials in the Marine Robotics Unity Simulator for Manoeuvring Assessment of Unmanned Surface Vehicles

面向无人水面艇操纵性评估的海洋机器人Unity仿真器中可追溯虚拟海试

Paria Rezayan

发表机构 * School of Engineering and Built Environment, Sheffield Hallam University（谢菲尔德哈勒姆大学工程与建筑环境学院）

AI总结针对USV水动力导数辨识数据获取难的问题，在MARUS仿真器中建立标准化虚拟海试框架，通过TC/ZZ机动自动化执行、数据采集与后处理管道，生成符合IMO/ITTC指标的可重复数据集，案例验证了框架的有效性。

详情

AI中文摘要

精确识别水动力导数对于无人水面艇（USV）的控制与导航至关重要，但物理海试的高保真操纵数据受成本和安全性限制。回转试验（TC）和Z形试验（ZZ）仍是IMO和ITTC评估程序的基础。本文扩展了海洋机器人Unity仿真器（MARUS），引入标准化虚拟海试框架，用于TC/ZZ机动的自动化执行和数据生成，包括可追溯的命令-执行日志记录、面向系统辨识（SI）的数据调理以及自动提取符合IMO/ITTC的操纵性指标。一个关键贡献是专用的TC/ZZ数据采集和后处理管道，提高了基于仿真的机动的可重复性和可审计性，同时生成适用于水动力导数辨识和数字孪生工作流的SI就绪数据集。另一个特点是差动推力转向的显式命令-执行分离，其中输入记录为有序的等效舵命令，而实际执行则记录为基于施加推力的执行级代理。案例研究结果表明了可重复且合规的机动行为。对于TC试验，左舷和右舷之间的归一化进距差异约为3.9%，战术直径差异约为4.6%至4.7%。对于ZZ试验，±10度和±20度机动下的第一和第二超越角超调量均保持在1度以下，满足IMO标准，而峰值偏航速率约为4.1至5.8度/秒。总体而言，该框架提供了一种可重复且可审计的虚拟海试工作流，用于生成符合IMO/ITTC的数据集，并支持系统辨识、水动力导数估计和数字孪生校准。

英文摘要

Accurate identification of hydrodynamic derivatives is essential for precise control and autonomous navigation of Unmanned Surface Vehicles (USVs). However, acquiring high-fidelity manoeuvring data from physical sea trials is often constrained by cost, safety, and environmental disturbances. Standard manoeuvring trials, particularly Turning Circle (TC) and Zig-Zag (ZZ), remain fundamental to IMO and ITTC assessment procedures because they provide comparable performance metrics reflective of underlying hydrodynamic behaviour. This paper extends the open-source Marine Robotics Unity Simulator (MARUS) by introducing a standardised Virtual Sea Trial framework for automated execution and data generation of TC/ZZ manoeuvres. The framework provides traceable command-actuation logging, system-identification (SI)-focused data conditioning, and automated extraction of IMO/ITTC-aligned manoeuvring metrics. A key contribution is a dedicated TC/ZZ data acquisition and post-processing pipeline, improving the repeatability and auditability of simulator-based manoeuvres while producing SI-ready datasets for hydrodynamic-derivative identification and digital-twin workflows. The framework also provides explicit command-execution separation for differential-thrust steering, where manoeuvre inputs are recorded as ordered rudder-equivalent commands and realised actuation is logged as an execution-level proxy derived from applied thrust. Case study results demonstrate repeatable and IMO-compliant manoeuvre behaviour. For TC tests, the normalised advance differs by approximately 3.9% between port and starboard turns, while the tactical diameter differs by 4.6-4.7%. For ZZ tests, first and second overshoot excesses remain below 1 degree for both +/-10-degree and +/-20-degree manoeuvres, satisfying IMO criteria, while peak yaw rates range from approximately 4.1 to 5.8 degrees/second.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Orchestra-o1: Omnimodal Agent Orchestration

Can Editing 1 Neuron Fix Repetition Loops in LLMs?

History of the Muddy Children Puzzle

Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

UP-NRPA: User Portrait based Nested Rollout Policy Adaptation for Planning with Large Language Models in Goal-oriented Dialogue Systems

A Deep Reinforcement Learning (DRL)-Based Transformer Method for Solving the Open Shop Scheduling Problem

InterleaveThinker: Reinforcing Agentic Interleaved Generation

Improving Robotic Generalist Policies via Flow Reversal Steering

EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery

Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

Generative Modeling of Bach-Style Symbolic Music: A Comparative Study of Autoregressive, Latent-Variable, and Adversarial Approaches

Simplex-Constrained Sparse Bagging: Transitioning from Uniform Priors to Sparse Posteriors in Ensemble Learning

Is It You or Your Environment? A Bayesian Inference Framework for Genomically-Anchored Personalized Physiological Interpretation

Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations

MiniMax Sparse Attention

From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM Evaluation

MP3: Multi-Period Pattern Pre-training for Spatio-Temporal Forecasting

TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization

DeepJEB++: Foundation Model-Driven Large-Scale 3D Engineering Dataset via 2D Latent Space Augmentation

Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL

Order Is Not Control: Driven-Dissipative Response Laws Across Artificial and Biological Systems

Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning

Direct Preference Optimization for Chatbot Fine-Tuning: An Empirical Study

GUITrans2Act: Understanding User Operational Behaviors from Mobile GUI Interactions with Vision-Language Models

Let's Ask Gauss: Improved One-Run Privacy Auditing

EquiDexFlow: Contact-Grounded SE(3)-Equivariant Dexterous Grasp Generative Flows

Quickest Detection of Hallucination Onset: Delay Bounds and Learned CUSUM Statistics

Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal

Traceable Virtual Sea Trials in the Marine Robotics Unity Simulator for Manoeuvring Assessment of Unmanned Surface Vehicles