arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1733
2606.13722 2026-06-15 cs.AI cs.MA 新提交

YeasierAgent: Agentic Social Sandbox as a Canvas for Intent-Driven Creation of Platform-Agnostic Symbiotic Agent-Native Applications

YeasierAgent:作为意图驱动创建平台无关共生智能体原生应用的画布的智能体社交沙盒

Jory He

发表机构 * Yeasier AI

AI总结 提出YeasierAgent范式,通过平台无关的交互单元和空间多智能体协作,实现快速跨平台构建共生智能体原生应用,统一情感陪伴与工具执行。

详情
AI中文摘要

本文介绍了YeasierAgent,一种基于共生智能体、叙事世界和场景感知交互的应用构建范式。它通过将应用重新定义为用户、智能体和世界之间的协作空间,挑战了传统的设备耦合软件模型。我们提出了一种系统架构,实现了两个主要贡献:(1)通过利用平台无关的交互单元(智能体、场景、对话)而非固定的图形布局,实现跨平台的智能体原生应用的快速构建;(2)在单一体验沙盒中统一智能体的情感陪伴和实用工具执行属性。通过集成自动生成、用户创建的世界和空间多智能体协作,YeasierAgent形式化了共生智能体原生应用的类别,展示了从孤立的、特定工具聊天机器人向凝聚的、社会嵌入的计算环境的转变。

英文摘要

This paper introduces YeasierAgent, an application-building paradigm based on symbiotic agents, narrative worlds, and scene-aware interaction. It challenges the conventional device-coupled model of software by redefining applications as collaborative spaces among users, agents, and worlds. We present a system architecture that achieves two primary contributions: (1) enabling the rapid, cross-platform construction of agent-native applications by utilizing platform-agnostic interactive units (agents, scenes, dialogue) rather than fixed graphical layouts; and (2) unifying the emotional companionship and practical tool execution attributes of intelligent agents within a single experiential sandbox. By integrating automated generation, user-created worlds, and spatial multi-agent collaboration, YeasierAgent formalizes the category of Symbiotic Agent-Native Applications, demonstrating a shift from isolated, tool-specific chatbots toward cohesive, socially embedded computational environments.

2606.13720 2026-06-15 cs.AI 新提交

Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP

拒绝不止一个方向:Diff-in-Means 与 INLP 的初步比较

Elisabetta Rocchetti, Alfio Ferrara

发表机构 * Department of Computer Science, Università degli Studi di Milano(米兰大学计算机科学系)

AI总结 比较 DiM 和 INLP 两种方法在安全微调聊天模型中调控拒绝行为的效果,发现 INLP 反事实翻转可匹配 DiM 方向消融,而零空间投影较弱,且两种方法在激活空间中产生不同几何分布。

详情
AI中文摘要

Arditi 等人 (2024) 表明,安全微调聊天模型中的拒绝行为由残差流中的一个线性方向介导,该方向可通过有害和无害激活的均值差 (DiM) 恢复。我们将基于 DiM 的干预(激活添加和方向消融)与基于迭代零空间投影 (INLP) 的两种干预——零空间投影和反事实翻转——在五个开源聊天模型上进行比较,探究 INLP 是否能在引导拒绝方面匹配 DiM,以及其更丰富的参数化是否产生更可调的干预。INLP 反事实翻转在拒绝抑制上与 DiM 方向消融具有竞争力,而零空间投影始终较弱。将 INLP 限制为提取子空间的主导方向,可在接近基线的困惑度下保留大部分抑制效果,从而提供可调的能力。从几何角度看,两种 INLP 干预落在激活空间中性质不同的区域:零空间投影将变换后的激活压缩在有害和无害簇之间,而反事实翻转将其移入相反簇,这表明模型编码概念的缺失与其对立面不同——这是一个有趣的区分,值得未来进一步研究。

英文摘要

Arditi et al. (2024) has shown that refusal in safety fine-tuned chat models is mediated by a single linear direction in the residual stream, recoverable by a difference-in-means (DiM) of harmful and harmless activations. We compare DiM-based interventions (activation addition and directional ablation) with two interventions derived from Iterative Nullspace Projection (INLP) -- nullspace projection and counterfactual flipping -- on five open-weight chat models, asking whether INLP can match DiM at steering refusal and whether its richer parameterisation yields more tweakable interventions. INLP counterfactual flipping is competitive with DiM directional ablation on refusal suppression, while nullspace projection is consistently weaker. Restricting INLP to the leading directions of the extracted subspace preserves most of the suppression effect at near-baseline perplexity, giving a tunable capability. Geometrically, the two INLP interventions land in qualitatively different regions of activation space: nullspace projection collapses transformed activations \emph{between} the harmful and harmless clusters, while counterfactual flipping moves them into the opposite cluster, suggesting that the model encodes the absence of a concept differently from its opposite -- an intriguing distinction that warrants further investigation in future work.

2606.13715 2026-06-15 cs.AI cs.CL cs.MA 新提交

WorkBench Revisited: Workplace Agents Two Years On

WorkBench 再探:两年后的工作场所智能体

Olly Styles

发表机构 * GitHub

AI总结 本文重新评估2024至2026年间WorkBench基准上智能体的进展,发现前沿模型在能力和安全性上均有显著提升,但开放权重模型降低了高性能门槛。

Comments 8 pages, 3 figures. Follow-up to arXiv:2405.00823

详情
AI中文摘要

2024年3月,WorkBench上表现最好的智能体GPT-4完成了43%的任务,并在26%的任务中采取了意外的有害行为(例如给错误的人发送电子邮件)。我们在2026年6月重新审视该基准,发现迄今为止最好的智能体Claude Opus 4.8完成了89%的任务,并仅在2.5%的任务中采取了意外的有害行为。除了前沿智能体性能的显著进步外,有三点值得注意。首先,在WorkBench上,能力与安全性是相辅相成的,而非相互权衡,因此完成最多任务的模型造成的意外损害也最少。其次,虽然几类错误已被完全消除,但前沿模型仍然会犯一些基本错误,有时会导致不可逆转的损害,例如将电子邮件发送给错误的人。第三,开放权重模型的兴起大幅降低了此前仅专有模型才能达到的性能水平的成本,而前沿模型的成本则保持相对稳定。我们发布了该基准的更新版本,包括数据与代码质量改进、新的模型评分以及自2024年以来WorkBench上智能体进展的分析。

英文摘要

The best agent on WorkBench in March 2024, GPT-4, completed 43% of tasks and took an unintended harmful action, such as emailing the wrong person, on 26% of them. We re-visit the benchmark in June 2026 and find that the best agent to date, Claude Opus 4.8, completes 89% and takes an unintended harmful action on 2.5%. Aside from this considerable progress in frontier agent performance, three things stand out. First, capability and safety go together on WorkBench rather than trade off, so the models that finish the most tasks also do the least unintended damage. Second, while several classes of error have been totally eliminated, frontier models still make some basic mistakes that occasionally result in irreversible harm, such as sending an email to the wrong person. Third, the rise of open-weight models has drastically lowered costs for a performance level that was previously only accessible to proprietary models, while frontier costs have stayed relatively stable. We release an updated version of the benchmark with data and code quality improvements, new model scores, and analysis of agent progress on WorkBench since 2024.

2606.13714 2026-06-15 cs.CV 新提交

TSA: Temporal Slot Activation for Persistent Object-Centric Video Representation

TSA: 时间槽激活用于持久目标中心视频表示

Duc Nguyen, Sieu Tran, Hao Vo, Khoa Vo, Duy Minh Ho Nguyen, Nghi D. Q. Bui, Anh Nguyen, Long Mai, Ngan Le

发表机构 * University of Arkansas, USA(阿肯色大学) Max Planck Research School for Intelligent Systems(马克斯·普朗克智能系统研究所) Google Research, Google(谷歌研究院) University of Liverpool, UK(利物浦大学) Adobe Research(Adobe研究院)

AI总结 提出时间槽激活(TSA)机制,通过学习每槽每帧激活分数实现持久槽的生命周期建模,解决无条件传播导致的状态漂移和重建干扰问题,在多个基准上提升目标分解和时间身份保持。

详情
AI中文摘要

无监督视频目标中心学习旨在将动态场景分解为时间上持久的实体表示。现有的循环视频槽注意力方法在帧间传播一组固定的槽,但通常假设无条件槽传播:每个槽在每一帧都被更新和解码,无论其对应目标是否可见。我们表明,这种设计违反了持久槽的基本生命周期要求:当目标缺失或完全遮挡时,其槽应保留先前状态,并避免解释无关的可见内容。相反,无条件传播导致两种失败路径:更新引起的状态漂移(当前帧证据覆盖缺失目标的表示)和解码器引起的重建干扰(非活跃槽通过解码器注意力保持与重建的耦合)。我们提出时间槽激活(TSA),一种无需可见性监督即可学习每槽每帧激活分数 $\alpha_{k,t} \in (0, 1)$ 的机制。TSA 使用该激活作为共享潜在控制变量进行槽生命周期建模。当槽不活跃时,TSA 通过激活门控更新将其状态锚定到前一槽,并通过在 softmax 归一化前对注意力 logits 施加激活依赖的加性偏置来抑制其解码器参与。这共同减少了状态漂移和重建驱动的干扰。为了在部分遮挡和逐渐重现下改进决策,TSA 进一步将激活预测条件于时间上下文编码器生成的每槽时间记忆。我们在 MOVi-C/E、YT-VIS 和 OVIS 基准上使用标准指标和基于跟踪的指标(FG-ARI、mBO、IDF1、HOTA)评估 TSA。TSA 持续改进了目标分解和时间身份保持,在长且严重遮挡的视频上取得了大幅提升。

英文摘要

Unsupervised video object-centric learning aims to decompose dynamic scenes into temporally persistent entity representations. Existing recurrent video slot-attention methods propagate a fixed set of slots across frames, but typically assume unconditional slot propagation: every slot is updated and decoded at every frame, regardless of whether its corresponding object is visible. We show that this design violates a basic lifecycle requirement for persistent slots: when an object is absent or fully occluded, its slot should preserve its previous state and avoid explaining unrelated visible content. Instead, unconditional propagation creates two failure pathways: update-induced state drift, where current-frame evidence overwrites the absent object's representation, and decoder-induced reconstruction interference, where the inactive slot remains coupled to reconstruction through decoder attention. We propose Temporal Slot Activation (TSA), a mechanism that learns a per-slot, per-frame activation score $α_{k,t} \in (0, 1)$ without visibility supervision. TSA uses this activation as a shared latent control variable for slot lifecycle modeling. When a slot is inactive, TSA anchors its state to the previous slot via activation-gated updating and suppresses its decoder participation through an activation-dependent additive bias on attention logits before softmax normalization. This jointly reduces state drift and reconstruction-driven interference. To improve decisions under partial occlusion and gradual reappearance, TSA further conditions activation prediction on a per-slot temporal memory produced by a Temporal Context Encoder. We evaluate TSA on MOVi-C/E, YT-VIS, and OVIS benchmarks using both standard and tracking-based metrics (FG-ARI, mBO, IDF1, HOTA). TSA consistently improves object decomposition and temporal identity preservation, with large gains on long, heavily occluded videos.

2606.13712 2026-06-15 cs.SD cs.CL 新提交

Multimodal Speaker Identification in Classroom Environments

课堂环境中的多模态说话人识别

Michael L. Chrzan, Meghavarshini Krishnaswamy, Robert Gibboni, Katie Wetstone, Wei Ai, Jing Liu

发表机构 * University of Michigan(密歇根大学) University of Pennsylvania(宾夕法尼亚大学) University of Maryland(马里兰大学)

AI总结 针对课堂背景噪声和儿童语音变异性导致纯声学模型准确率低的问题,提出融合声学嵌入与LLM语义上下文的多模态框架,将学生识别准确率从39.0%提升至50.3%,长句准确率达76.9%,角色区分准确率99.3%。

Comments 9 pages, 5 tables, 3 figures

详情
AI中文摘要

K-12课堂动态的自动化分析面临背景噪声和儿童语音变异性带来的挑战,这些因素常常干扰纯声学模型。本研究评估了一种多模态说话人识别框架,该框架将声学嵌入与LLM衍生的语义上下文相结合。使用EDSI数据集的一个子集(8个数学课堂,N = 2,801个话语),我们发现声学基线模型(ECAPA-TDNN)仅达到39.0%的准确率。通过将基于转录的“上下文锚定”集成到梯度提升分类器中,我们的多模态方法将学生识别准确率提高到50.3%。对于超过5秒的话语,性能也有所提升,达到76.9%的准确率(基线为64.9%),Top-3准确率为90.9%。此外,该模型以99.3%的准确率区分教师与学生角色。该方法推进了能够考虑个体学生参与的自动化反馈系统的可行性,这是支持大规模公平教学的关键一步。

英文摘要

Automated analysis of K-12 classroom dynamics faces challenges due to background noise and variable child speech, often confounding acoustic-only models. This study evaluates a multimodal speaker identification framework anchoring acoustic embeddings with LLM-derived semantic context. Using a subset of the EDSI dataset (8 math classrooms, N = 2,801 utterances), we found an acoustic baseline (ECAPA-TDNN) achieved only 39.0% accuracy. By integrating transcript-based "contextual anchoring" into a gradient boosting classifier, our multimodal approach raised student identification to 50.3%. Performance also improved for utterances over 5 seconds, reaching 76.9% accuracy (vs. 64.9% baseline) with a 90.9% Top-3 accuracy. Additionally, the model distinguished teacher vs. student roles with 99.3% accuracy. This approach advances the feasibility of automated feedback systems capable of considering individual student participation, a crucial step for supporting equitable instruction at scale.

2606.13707 2026-06-15 cs.AI cs.CL cs.CV 新提交

Orchestra-o1: Omnimodal Agent Orchestration

Orchestra-o1: 全模态智能体编排

Fan Zhang, Vireo Zhang, Shengju Qian, Haoxuan Li, Hao Wu, Jinyang Wu, Donghao Zhou, Zhihong Zhu, Zheng Lian, Xin Wang, Pheng-Ann Heng

发表机构 * CUHK(香港中文大学) LIGHTSPEED PKU(北京大学) THU(清华大学) Tongji University(同济大学)

AI总结 提出Orchestra-o1全模态智能体编排框架,通过统一编排机制实现模态感知任务分解、在线子智能体专业化和并行子任务执行,在OmniGAIA基准上准确率超第二名10.3%,并引入DA-GRPO强化学习方法训练Orchestra-o1-8B达到开源全模态智能体最优性能。

详情
AI中文摘要

近期智能体集群的成功将基于大语言模型(LLM)的智能体从单智能体工作流范式转向多智能体系统,凸显了智能体编排在任务分解与协作中的重要性。然而,现有编排框架局限于狭窄的模态集合,难以泛化到异构模态共存并交互的更复杂场景。这种局限性在全模态场景中尤为突出,此类任务需要对文本、图像、音频和视频等多样化输入进行统一理解与协调。在本工作中,我们提出Orchestra-o1,一种全模态智能体编排框架,旨在支持跨多种模态的高效智能体协作。Orchestra-o1引入统一编排机制,实现模态感知任务分解、在线子智能体专业化和并行子任务执行。这种可扩展设计使智能体系统能够有效处理涉及异构信息源的复杂现实任务,在OmniGAIA基准上超越第二名方法10.3%的准确率。此外,我们提出决策对齐群体相对策略优化(DA-GRPO),一种高效的智能体强化学习方法,用于训练Orchestra-o1-8B,该方法在所有现有开源全模态智能体中取得了最先进性能。

英文摘要

The recent success of agent swarms has shifted the paradigm of large language model (LLM)-based agents from single-agent workflows to multi-agent systems, highlighting the importance of agent orchestration for task decomposition and collaboration. However, existing orchestration frameworks are limited to a narrow set of modalities and struggle to generalize to more complex settings where heterogeneous modalities coexist and interact. This limitation becomes particularly pronounced in omnimodal scenarios, where tasks require the unified understanding and coordination of diverse inputs such as text, image, audio, and video. In this work, we propose Orchestra-o1, an omnimodal agent orchestration framework designed to support efficient agent collaboration across multiple modalities. Orchestra-o1 introduces a unified orchestration mechanism that enables modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution. This scalable design allows agent systems to effectively tackle complex real-world tasks involving heterogeneous information sources, surpassing the second-best approach by 10.3% accuracy on the OmniGAIA benchmark. Furthermore, we introduce decision-aligned group relative policy optimization (DA-GRPO), an efficient agentic reinforcement learning approach for training Orchestra-o1-8B, which also achieves state-of-the-art performance against all existing open-source omnimodal agents.

2606.13705 2026-06-15 cs.LG cs.AI 新提交

Can Editing 1 Neuron Fix Repetition Loops in LLMs?

编辑1个神经元能修复LLM中的重复循环吗?

Aristotelis Lazaridis, Aman Sharma, Dylan Bates, Brian King, Vincent Lu, Jack FitzGerald

发表机构 * Edgerunner AI

AI总结 本文发现Gemma 4模型在长事实列举任务中高达95%的概率陷入重复循环,通过逐层消融和逐神经元归因定位到少量MLP神经元,并用静态权重编辑(小至单个神经元符号反转)消除循环,但无法解决因知识缺失导致的“末日循环”。

详情
AI中文摘要

是的。它能治愈末日循环吗?可能不行。Gemma 4指令微调模型存在一个可复现的失败:在长事实列举提示(如列出电视剧的每一集、88个IAU星座或151个原始宝可梦)上,它们会崩溃成重复,要么是严格的逐字循环,要么是列表条目退化到单一答案。这些循环的发生率高达95%,并且能抵抗提示改写、推理引擎更改和大多数采样调整。在本文中,我们探讨这种行为是否足够局部化,从而可以通过权重编辑来消除。为了定位原因,我们使用逐层消融和逐神经元归因,然后通过完整生成扫描确认最强候选。循环追溯到一小部分MLP神经元(或者在26B-A4B混合专家模型中,几个路由专家),我们通过静态权重编辑抑制它们。这些“手术”可以小到单个符号反转的神经元(在E2B模型中)。有效编辑的大小随模型规模增长,但在所有情况下,循环模式可以在正常生成预算内解决,同时保持通用基准分数。然而,编辑并不能解决所有问题:我们还研究了更长的思考预算,其中两个较大的模型最明显地进入末日循环,即模型在无法回忆的事实上自我纠正的循环,耗尽预算而不给出最终答案。我们表明,这种残余失败通过相同的编辑减少但未消除,并认为它本质上是知识精度问题,而非可移除的电路;权重手术可以删除循环,但不能提供缺失的事实。我们的结果既是可行性证明——即具体的生成病理可以定位到少数参数并编辑掉——也是对该方法适用范围的界定。

英文摘要

Yes. Can it cure doom loops? Probably not. The Gemma 4 instruction-tuned models share a reproducible failure: on long factual enumeration prompts, such as listing every episode of a TV series, the 88 IAU constellations, or the 151 original Pokemon, they collapse into repetition, either a tight verbatim loop or a list whose entries decay onto a single answer. These loops occur at rates as high as 95% and survive prompt rewording, inference-engine changes, and most sampling adjustments. In this paper we explore whether this behavior is localized enough to remove by weight edits. To localize the cause, we use per-layer ablation and per-neuron attribution, then confirm the strongest candidates with full-generation sweeps. The loops trace to a small set of MLP neurons (or, in the 26B-A4B Mixture-of-Experts model, a few routed experts) which we suppress with static weight edits. These "surgeries" can be as small as a single sign-inverted neuron (in the E2B model). The size of the effective edits grows with model scale, but in all cases, the loop patterns can be addressed at normal generation budgets while preserving general-purpose benchmark scores. However, the edits do not solve everything: we also study longer thinking budgets, where the two larger models most visibly enter doom looping, i.e. a non-convergent regime in which the model self-corrects in circles over a fact it cannot recall, exhausting the budget without committing to a final answer. We show this residual failure is reduced but not eliminated by the same edits, and argue it is fundamentally a knowledge-precision problem rather than a removable circuit; weight surgery can delete a loop, but it cannot supply a missing fact. Our results are both a feasibility demonstration, that is, evidence that a concrete generation pathology can be localized to a few parameters and edited out, and a delineation of where that approach stops.

2606.13703 2026-06-15 cs.AI cs.GL cs.LO 新提交

History of the Muddy Children Puzzle

泥孩子谜题的历史

Hans van Ditmarsch

发表机构 * CNRS, France(法国国家科学研究中心) IIT Kanpur, India(印度理工学院坎普尔分校)

AI总结 本文追溯泥孩子谜题在过去两个世纪中的起源,并介绍其变体及一个涉及自指的新帽子谜题。

详情
AI中文摘要

泥孩子谜题是一个关于知识和无知的谜题,对认知逻辑的发展具有启发意义。谁首先提出了它?这一点尚不清楚。我们通过过去两个世纪的逻辑和文学出版物追溯泥孩子谜题的起源。该谜题激发了众多变体,例如涉及数字或彩色帽子的谜题。我们还提出了一个涉及自指的新型帽子谜题。

英文摘要

The Muddy Children Puzzle is a puzzle about knowledge and ignorance that has been inspiring for the development of epistemic logic. Who came up with it first? This is unclear. We trace the origin of the Muddy Children Puzzle through logical and literary publications over the past two centuries. The puzzle inspired a numerous variations such as involving numbers or coloured hats. We also present a novel hats puzzle involving self-reference.

2606.13686 2026-06-15 cs.CL cs.CY 新提交

Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces

电子商务欺骗性界面下的Web Agent安全基准测试

Zijing Shi, Meng Fang, Ling Chen

发表机构 * AAII, University of Technology Sydney(悉尼科技大学AAII) University of Liverpool(利物浦大学)

AI总结 提出WebDecept框架,在电子商务环境中注入七种常见欺骗性界面模式,测试多模态Web Agent的安全性,发现当前Agent极易受骗且提示约束不足。

Comments Accepted to ACL 2026

详情
AI中文摘要

随着自主Web Agent越来越多地用于执行现实任务,确保其安全性已成为关键问题。在这项工作中,我们研究了电子商务领域中现实欺骗性界面下的Web Agent行为。我们引入了WebDecept,一个轻量级且可配置的插件框架,能够将欺骗性界面模式可控地注入现有Web环境。使用WebDecept,我们实例化了开放Web上常见的七种欺骗模式,包括定向广告、域名重定向和购物操纵。通过在任务执行期间将这些模式注入前端,我们对多个多模态Web Agent进行了受控评估。我们的结果表明,当前的Web Agent极易受到多类欺骗性界面的影响,并且基于提示的约束通常不足以缓解这些失败。我们进一步分析了欺骗性模式的设计选择如何影响此类操纵的成功。这些发现凸显了在Web Agent向现实部署扩展时应解决的安全挑战。

英文摘要

As autonomous web agents are increasingly deployed to perform real-world tasks, ensuring their safety has become a critical concern. In this work, we study web agent behavior under realistic deceptive interfaces in the e-commerce domain. We introduce WebDecept, a lightweight and configurable plugin framework that enables controlled injection of deceptive interface patterns into existing web environments. Using WebDecept, we instantiate seven deceptive patterns commonly observed on the open web, including targeted advertisements, domain redirection, and shopping manipulation. By injecting these patterns into the frontend during task execution, we perform controlled evaluation of multiple multimodal web agents. Our results show that current web agents are highly susceptible to multiple classes of deceptive interfaces, and that prompt-based constraints are often insufficient to mitigate these failures. We further analyze how the design choices of deceptive patterns influence the success of such manipulations. These findings highlight safety challenges that should be addressed as web agents are scaled toward real-world deployment.

2606.13685 2026-06-15 cs.CL cs.AI 新提交

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

抛硬币的裁判?LLM作为评估者的可靠性与偏见

Abel Yagubyan

发表机构 * Independent Researcher(独立研究员)

AI总结 研究LLM作为评估者在重复评估中的不可靠性,发现偏好翻转率平均13.6%,存在位置偏见,并建议多轮聚合和不确定性报告。

Comments 24 pages, 7 figures

详情
AI中文摘要

LLM作为评估者(LLM-as-a-Judge)现被广泛用于模型输出排名、训练奖励模型和填充公共排行榜,但其运行间可靠性仍缺乏充分表征。我们使用两个OpenAI评估模型(GPT-4o-mini和GPT-4.1-mini)在涵盖10个类别的29个任务上进行了重复的相同评估,每个问题进行50次成对试验和50次逐点试验,并辅以温度和提示敏感性消融实验。在评估者之间,成对偏好平均翻转13.6%的时间,28%的问题翻转率超过20%,一个问题达到56%。GPT-4o-mini还表现出显著的第一位置偏见(72%的A多数,p=0.024)。同时,平均逐点评分差距很小(在10分制上为0.19-0.36),且总体上不具统计显著性,产生了成对-逐点差距:评估者经常选择胜者,即使它们自己的标量分数几乎没有证据表明存在有意义的质量差异。除了评估者内部的不稳定性,评估者间的一致性仅为76%(κ=0.51),语义等价的提示模板在25%的测试案例中改变了多数结果,确定性解码减少了但不消除不一致性。可靠性曲线分析显示,在我们的数据集中,平均需要11次重复试验才能让多数投票以95%的概率恢复50次试验的参考裁决,对于高方差问题则上升至15次。这些发现表明,单次LLM评估对于高风险评估往往噪声过大,多轮聚合、位置随机化和显式不确定性报告应成为标准实践。由于两个评估者均来自同一提供商,跨提供商复制仍是重要的下一步。

英文摘要

LLM-as-a-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks spanning 10 categories using two OpenAI judge models (GPT-4o-mini and GPT-4.1-mini), with 50 pairwise trials and 50 pointwise trials per question, supplemented by temperature and prompt-sensitivity ablations. Across judges, pairwise preferences flip on average 13.6% of the time, with 28% of questions exceeding a 20% flip rate and one question reaching 56%. GPT-4o-mini also exhibits a significant first-position bias (72% A-majority, p = 0.024). At the same time, mean pointwise score gaps are small (0.19--0.36 on a 10-point scale) and not statistically significant in aggregate, producing a pairwise--pointwise gap: judges frequently choose a winner even when their own scalar scores provide little evidence of a meaningful quality difference. Beyond within-judge instability, cross-judge agreement is only 76% ($κ= 0.51$), semantically equivalent prompt templates change majority outcomes in 25% of tested cases, and deterministic decoding reduces but does not eliminate inconsistency. A reliability curve analysis shows that, in our dataset, 11 repeated trials are needed for a majority vote to recover the 50-trial reference verdict with 95% probability on average, rising to 15 for high-variance questions. These findings suggest that single-trial LLM judging is often too noisy for high-stakes evaluation, and that multi-trial aggregation, position randomization, and explicit uncertainty reporting should be standard practice. Because both judges are from a single provider, cross-provider replication remains an important next step.

2606.13683 2026-06-15 cs.AI cs.CL 新提交

UP-NRPA: User Portrait based Nested Rollout Policy Adaptation for Planning with Large Language Models in Goal-oriented Dialogue Systems

UP-NRPA:基于用户画像的嵌套 rollout 策略自适应,用于目标导向对话系统中与大语言模型的规划

Hui Wang, Fafa Zhang, Meng Liu, Xiangyu Chen, Chaoxu Mu

发表机构 * School of Artificial Intelligence, Anhui University(安徽大学人工智能学院) Anhui Provincial Key Laboratory of Security Artificial Intelligence, Anhui University(安徽大学安徽省安全人工智能重点实验室) Pengcheng Laboratory(鹏城实验室)

AI总结 提出 UP-NRPA 在线框架,利用大语言模型和用户画像实时自适应调整对话策略,无需离线强化学习,在协作与非协作对话基准中实现 100% 任务成功率,谈判任务销售列表比提升 56.41%。

详情
AI中文摘要

为了解决当前对话策略规划方法难以动态适应不同用户特征的挑战,本文提出了一种基于用户画像的嵌套 rollout 策略自适应(UP-NRPA)在线框架,结合大语言模型。与传统依赖模型训练并为用户群体离线学习强化学习策略模型的方法不同,UP-NRPA 通过自适应机制实现对话策略的动态定制。这是通过利用实时用户反馈以及从当前用户画像映射出的个性、偏好和目标来实现的,从而无需离线强化学习即可适应用户特征。在协作和非协作对话基准测试中,UP-NRPA 展现了显著优势,在多个对话任务中实现了令人印象深刻的 100% 成功率。特别是在谈判任务中,销售列表比(SL)提高了 56.41%。这表明 UP-NRPA 无需训练机制即可适应多样化的用户需求,使对话系统能够适应用户特征。

英文摘要

To address the challenge that current dialogue policy planning methods struggle to dynamically adapt to diverse user characteristics, this paper proposes a User Portrait based Nested Rollout Policy Adaptation (UP-NRPA) online framework with Large Language Models. In contrast to conventional approaches dependent on model training and require offline reinforcement learning policy models for user groups, UP-NRPA enables dynamic customization of dialogue strategies through an adaptive mechanism. This is achieved by leveraging real-time user feedback alongside personality, preferences, and objectives mapped from the current user portrait, thereby adapting to user characteristics without offline reinforcement learning. In collaborative and non-collaborative dialogue benchmarks, UP-NRPA demonstrated considerable benefits, achieving an impressive 100% success rate in multiple dialogue tasks. Particularly in negotiation tasks, the sale-to-list ratio (SL) increased by 56.41%. This demonstrates that UP-NRPA can adapt to diverse user needs without requiring a training mechanism, enabling the dialogue system to adapt to user characteristics.

2606.13682 2026-06-15 cs.AI cs.LG 新提交

A Deep Reinforcement Learning (DRL)-Based Transformer Method for Solving the Open Shop Scheduling Problem

基于深度强化学习的Transformer方法求解开放车间调度问题

Faezeh Ardali, Mwembezi A. Nyelele, Gerald M. Knapp

发表机构 * Louisiana State University(路易斯安那州立大学) University of Minnesota Duluth(明尼苏达大学杜鲁斯分校)

AI总结 提出一种基于Transformer编码器-解码器架构的调度策略,仅以加工时间矩阵为输入,在Taillard小规模实例上训练后可直接推广至40x40至100x100的大规模问题,与经典调度规则相比具有竞争力。

详情
AI中文摘要

开放车间调度问题(OSSP)出现在许多工业和服务环境中,但随着作业和机器数量的增加,其计算难度仍然很大。精确方法很快变得难以处理,而经典调度规则和元启发式方法可能需要大量调整才能在大规模下保持解的质量。本研究开发了一种基于Transformer的OSSP调度策略,采用具有多头注意力的编码器-解码器架构。该模型仅在Taillard基准实例(4x4、5x5、7x7和10x10)上使用加工时间矩阵作为输入进行训练,生成可行调度,其makespan通常为最佳已知值的15-30%。为了评估可扩展性,将训练好的策略无需重新训练直接应用于从40x40到100x100随机生成的实例,并与经典调度启发式方法(包括SPT、LPT、MWKR和EST)进行比较。在这些大规模实例中,Transformer相对于标准下界实现了12.89-15.12%的平均差距。与EST相比,Transformer保持了竞争力,通常差距较小,同时显著优于SPT和LPT。这些结果表明,在小规模OSSP实例上训练的Transformer策略可以推广到更大规模的问题,并提供一种轻量级、基于学习的替代经典调度规则的方法。

英文摘要

The open shop scheduling problem (OSSP) arises in many industrial and service settings but remains computationally challenging as the number of jobs and machines increases. While exact methods quickly become intractable, classical dispatching rules and metaheuristics may require substantial tuning to maintain solution quality at large scales. This study develops a Transformer-based scheduling policy for OSSP using an encoder-decoder architecture with multi-head attention. The model is trained on Taillard benchmark instances (4x4, 5x5, 7x7, and 10x10) using only the processing-time matrix as input and produces feasible schedules with makespans typically within 15-30% of best-known values. To evaluate scalability, the trained policy is applied without retraining to randomly generated instances from 40x40 to 100x100 and compared against classical dispatching heuristics, including SPT, LPT, MWKR, and EST. Across these large instances, the Transformer achieved average gaps of 12.89-15.12% relative to a standard lower bound. Compared with EST, the Transformer remained competitive, typically within a modest margin, while substantially outperforming SPT and LPT. These results indicate that a Transformer policy trained on small OSSP instances can generalize to substantially larger problems and provide a feature-light, learning-based alternative to classical dispatching rules.

2606.13679 2026-06-15 cs.CV 新提交

InterleaveThinker: Reinforcing Agentic Interleaved Generation

InterleaveThinker: 强化智能体交错生成

Dian Zheng, Harry Lee, Manyuan Zhang, Kaituo Feng, Zoey Guo, Ray Zhang, Hongsheng Li

发表机构 * CUHK MMLab(香港中文大学多媒体实验室) Meituan(美团) CUHK IMIXR(香港中文大学IMIXR实验室)

AI总结 提出首个多智能体管线InterleaveThinker,通过规划器和评论家智能体使现有图像生成器具备交错生成能力,并利用GRPO强化单步指令修正,显著提升生成性能。

Comments Project Page: https://zhengdian1.github.io/InterleaveThinker-proj/ Code: https://github.com/zhengdian1/InterleaveThinker

详情
AI中文摘要

最近的图像生成器在单图像生成和编辑中展示了令人印象深刻的逼真度和指令遵循能力。然而,受限于其架构,它们无法实现交错生成(文本-图像序列),这在视觉叙事、指导和具身操作中具有关键应用。即使是最近的开源统一多模态模型(UMMs)在这方面也表现出有限的性能。在本文中,我们介绍了InterleaveThinker,这是第一个旨在赋予任何现有图像生成器交错生成能力的多智能体管线。具体来说,我们使用规划器智能体来组织图像-文本输入序列,指示图像生成器在每个步骤所需的执行。随后,我们引入评论家智能体来评估生成器的输出,识别偏离计划指令的样本,并优化指令以进行重新生成。为了实现这一管线,我们构建了Interleave-Planner-SFT-80k和Interleave-Critic-SFT-112k以进行格式冷启动。然后,我们开发了Interleave-Critic-RL-13k,使用GRPO在生成轨迹内强化逐步指令修正能力。由于单个交错生成轨迹可能涉及超过25次生成器调用,优化整个轨迹在计算上不可行。因此,我们提出了准确率奖励和逐步奖励,使得单步强化学习能够有效引导整个生成轨迹。结果表明,InterleaveThinker在各种图像生成器上提升了性能。在交错生成基准上,它实现了与Nano Banana和GPT-5相当的性能。令人惊讶的是,它还在基于推理的基准上显著增强了基础模型;例如,在4步FLUX.2-klein上,我们在WISE和RISE上观察到了显著的增益。

英文摘要

Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.

2606.13675 2026-06-15 cs.RO 新提交

Improving Robotic Generalist Policies via Flow Reversal Steering

通过流反转引导改进机器人通用策略

Andy Tang, William Chen, Andrew Wagenmaker, Chelsea Finn, Sergey Levine

发表机构 * Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校)

AI总结 提出流反转引导(FRS)方法,通过逆向流策略找到次优动作的潜在噪声并映射到通用策略的动作模式,提升零样本控制、行为克隆和强化学习效果。

详情
AI中文摘要

通用策略可以从多样化的机器人数据集中学习广泛的技能。为了解决或改进具有挑战性的新任务,我们需要一种方法从策略丰富的行为先验中推断并调用适当的动作,特别是当直接命令策略失败时。我们专注于流匹配通用策略,并提出流反转引导(FRS):一种方法,它采用次优但“合理”的动作,通过逆向流策略传递它们以找到其潜在噪声,并将它们映射到附近的通用策略动作模式。我们在多个模拟和真实世界的操作设置中评估了FRS。首先,FRS可以将来自人类或视觉语言模型的粗略语义引导转化为相应的良好机器人动作,从而改进零样本控制。这些收益可以通过行为克隆进行蒸馏,通过训练一个辅助策略输出噪声,通用策略将其映射到良好动作——在不到一分钟的训练中显示出高达95%的绝对任务成功率提升。最后,FRS通过用语义知识引导强化学习实现策略改进,在标准强化学习无法改进的多个任务上取得了改进。

英文摘要

Generalist policies can learn a wide range of skills from diverse robot datasets. In order to solve or improve on challenging new tasks, we need a way to infer and invoke the appropriate actions from the policy's rich behavioral prior, especially when directly commanding the policy fails. We focus on flow matching generalists and propose Flow Reversal Steering (FRS): a method that takes suboptimal but ``reasonable'' actions, finds their latent noises by passing them through the flow policy in reverse, and maps them to nearby generalist action modes. We evaluate FRS across many simulated and real-world manipulation settings. First, FRS can turn coarse semantic guidance from humans or vision-language models (VLMs) into corresponding good robot actions, improving zero-shot control. These gains can be distilled with behavioral cloning by training an auxiliary policy to output noises that the generalist maps to good actions -- showing up to 95% absolute task success rate boosts in under a minute of training. Finally, FRS enables policy improvement by bootstrapping reinforcement learning with semantic knowledge, improving on several tasks that standard RL fails to improve on.

2606.13662 2026-06-15 cs.AI cs.CL 新提交

EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery

EurekAgent:自主科学发现中,智能体环境工程即一切

Amy Xin, Jiening Siow, Junjie Wang, Zijun Yao, Fanjin Zhang, Jian Song, Lei Hou, Juanzi Li

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) Zhipu AI(智谱AI)

AI总结 提出环境工程框架EurekAgent,通过权限、工件、预算和人机交互四维工程设计,在数学、内核工程和机器学习任务上取得新最优结果,总API成本低于11美元。

详情
AI中文摘要

基于LLM的智能体在自动化科学发现方面展现出日益增长的潜力。给定一个可优化的度量和执行环境,它们可以提出、验证和迭代科学解决方案,并已产生超越人类设计方法的结果。随着模型能力的持续提升,我们认为自主科学发现的瓶颈正从规定智能体工作流程转向设计智能体环境:即塑造智能体行为的资源、约束和接口。我们将此框架化为环境工程:构建能够放大生产性行为(如开放式探索、系统化工件管理和智能体间协作)同时抑制有害行为(如奖励黑客和高摩擦人工监督)的环境。我们提出了EurekAgent,一个用于度量驱动自主科学发现的环境工程智能体系统。EurekAgent从四个维度进行环境工程:权限工程用于受限智能体执行和隔离评估;工件工程用于基于文件系统和Git的协作;预算工程用于预算感知探索;人机交互工程用于便捷的人工监督和干预。EurekAgent在多个数学、内核工程和机器学习任务上取得了新的最优结果,包括以不到11美元的总API成本发现新的26圆填充最优结果。我们开源了代码和结果,并呼吁将环境工程作为开发可靠自主研究智能体的核心研究方向。

英文摘要

LLM-based agents have shown increasing potential in automating scientific discovery. Given an optimizable metric and an execution environment, they can propose, validate, and iterate scientific solutions, and have produced results that outperform human-designed approaches. As model capabilities continue to improve, we argue that the bottleneck for autonomous scientific discovery is shifting from prescribing agent workflows to designing agent environments: the resources, constraints, and interfaces that shape agent behavior. We frame this as environment engineering: building environments that amplify productive behaviors, such as open-ended exploration, systematic artifact management, and inter-agent collaboration, while suppressing harmful behaviors, such as reward hacking and high-friction human oversight. We present EurekAgent, an environment-engineered agent system for metric-driven autonomous scientific discovery. EurekAgent engineers the environment along four dimensions: permissions engineering for bounded agent execution and isolated evaluation; artifact engineering for filesystem and Git-based collaboration; budget engineering for budget-aware exploration; and human-in-the-loop engineering for easy human supervision and intervention. EurekAgent sets new state-of-the-art results on multiple mathematics, kernel engineering, and machine learning tasks, including new state-of-the-art 26-circle packing results discovered with less than $11 in total API cost. We open-source our code and results, and call for environment engineering as a core research direction for developing reliable autonomous research agents.

2606.13657 2026-06-15 cs.LG 新提交

Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

密集监督,稀疏更新:论策略蒸馏的稀疏性与几何结构

Guo Yu, Wenlin Liu, Yulan Hu, Hao-Xuan Ma, Jun-Peng Jiang, Han-Jia Ye

发表机构 * School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) National Key Laboratory for Novel Software Technology, Nanjing University(南京大学计算机软件新技术国家重点实验室) Amap, Alibaba Group(阿里巴巴集团高德地图)

AI总结 本文分析策略蒸馏(OPD)中参数更新的稀疏性和几何特性,发现更新稀疏且集中于小权重坐标,并验证了稀疏子网络的有效性。

Comments Code is available at https://github.com/SydCS/OPD-Param-Analysis

详情
AI中文摘要

策略蒸馏(\ extsc{OPD})最近成为一种重要的后训练方法,因为它结合了两个理想的要素:策略学生轨迹和密集教师监督,但这种混合如何改变模型参数仍不清楚。在多个语言和视觉-语言模型对及用例中,我们的分析得出两个主要发现。关于稀疏性,\ extsc{OPD}风格的更新小且坐标稀疏。它们分布在各层,通常以前馈网络(FFN)为主。这种稀疏结构在操作上有用:仅训练发现的子网络几乎能恢复完整\ extsc{OPD}的性能。然而,在我们的优化器消融实验中,诱导稀疏性的SGD优化器表现不如AdamW,可能是因为密集教师监督保留了异质的坐标梯度尺度,而AdamW的自适应缩放仍然有用。关于几何结构,更新在数值上是满秩的,但谱集中;它们主要位于源权重的奇异子空间之外,并且不成比例地落在源权重接近零的坐标上。这些发现表明,密集教师监督并不会使\ extsc{OPD}变成普通的密集参数重写;相反,\ extsc{OPD}保留了策略后训练的重要几何特征。

英文摘要

On-policy distillation (\textsc{OPD}) has recently become a prominent post-training recipe by combining two desirable ingredients: on-policy student trajectories and dense teacher supervision. However, how this hybrid changes a model's parameters remains unclear. Across several language and vision-language model pairs and \textsc{OPD} use cases, our analysis yields two main findings. On sparsity, \textsc{OPD} updates are small and coordinate-sparse. They are distributed across layers, with the largest relative movement usually appearing in FFN modules. This sparse structure is operationally useful: training only the discovered subnetwork nearly recovers full-training performance. The sparse support does not remove the need for adaptive optimization: SGD, previously reported to be competitive in \textsc{RLVR}, underperforms AdamW in our \textsc{OPD} optimizer ablation, suggesting that dense teacher supervision preserves useful momentum structure and heterogeneous second-moment scales. On geometry, the updates are numerically full-rank but spectrally concentrated; they lie mostly away from the principal singular subspaces of the source weights and fall disproportionately on coordinates where the source weights are close to zero. These findings suggest that dense teacher supervision does not turn \textsc{OPD} into ordinary dense parameter rewriting; instead, \textsc{OPD} retains important geometric signatures of on-policy post-training.

2606.13626 2026-06-15 cs.SD cs.LG 新提交

Generative Modeling of Bach-Style Symbolic Music: A Comparative Study of Autoregressive, Latent-Variable, and Adversarial Approaches

巴赫风格符号音乐的生成建模:自回归、潜变量和对抗方法的比较研究

Dezhi Yu, Kyuil Lee, Yongkang Huang

发表机构 * Stanford University(斯坦福大学)

AI总结 比较自回归LSTM、潜变量模型和生成对抗网络在巴赫风格钢琴音乐生成中的表现,发现带注意力的自回归LSTM生成音乐最连贯,向量量化缓解后验塌陷,对抗方法捕捉局部音高但训练困难。

Comments 11 pages, 13 figures. All authors contributed equally

详情
AI中文摘要

我们使用共享的MIDI语料库和三个模型家族研究巴赫风格符号钢琴音乐的生成建模:带注意力的自回归LSTM、包括循环VAE和向量量化VAE的潜变量模型,以及生成对抗网络。我们比较它们对复调音符序列建模、学习有用潜在表示以及生成风格连贯作品的能力。实验表明,带注意力的自回归LSTM生成最音乐连贯的样本,而向量量化有助于缓解后验塌陷,并产生比传统循环VAE更结构化的输出。对抗方法捕捉局部音高模式,但训练困难且对巴赫风格的泛化可靠性较低。这些结果突出了自回归、潜变量和对抗方法在符号音乐生成中的相对优势和失败模式。

英文摘要

We study generative modeling of Bach-style symbolic piano music using a shared MIDI corpus and three model families: autoregressive LSTMs with attention, latent-variable models including recurrent VAEs and vector-quantized VAEs, and generative adversarial networks. We compare their ability to model polyphonic note sequences, learn useful latent representations, and generate stylistically coherent compositions. Our experiments show that the autoregressive LSTM with attention produces the most musically coherent samples, while vector quantization helps mitigate posterior collapse and yields more structured outputs than conventional recurrent VAEs. The adversarial approach captures local pitch patterns but remains difficult to train and generalizes less reliably to Bach's style. These results highlight the relative strengths and failure modes of autoregressive, latent-variable, and adversarial approaches for symbolic music generation.

2606.13589 2026-06-15 cs.LG cs.AI 新提交

Simplex-Constrained Sparse Bagging: Transitioning from Uniform Priors to Sparse Posteriors in Ensemble Learning

单纯形约束的稀疏装袋:集成学习中从均匀先验到稀疏后验的转变

Meher Sai Preetam, Meher Bhaskar

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出SCSB框架,通过最小化袋外损失在概率单纯形上联合优化集成剪枝与校准,引入凹二次惩罚解决L1单纯形悖论,实现高达96%的压缩并提升校准性能。

Comments 6 pages, 3 tables

详情
AI中文摘要

我们提出单纯形约束的稀疏装袋(SCSB),一个用于基于自助法的装袋集成后训练压缩和概率校准的数学严格框架。标准装袋集成(如随机森林、装袋SVM和装袋神经网络)赋予所有组成估计器均匀的投票权。然而,这种朴素的均匀先验忽略了基估计器不同的局部能力,并导致模型过度自信。我们将集成剪枝和校准表述为在概率单纯形上的联合优化问题,通过最小化袋外(OOB)损失。为了诱导稀疏性,我们通过引入凹二次惩罚来解决理论上的“L1单纯形悖论”——即L1范数在单纯形上为常数且无法剪枝的数学现实。SCSB是模型无关的,实现了高达96%的集成压缩,带来线性推理加速和优越的概率校准(降低期望校准误差),同时保持或提升泛化精度。

英文摘要

We present Simplex-Constrained Sparse Bagging (SCSB), a mathematically rigorous framework for post-training compression and probability calibration of bootstrap-based bagging ensembles. Standard bagging ensembles (such as Random Forests, Bagged SVMs, and Bagged Neural Networks) assign uniform voting power to all constituent estimators. However, this naive uniform prior ignores the varying local competence of base estimators and contributes to model overconfidence. We formulate ensemble pruning and calibration as a joint optimization problem over the probability simplex by minimizing the Out-Of-Bag (OOB) loss. To induce sparsity, we address the theoretical "L1-simplex paradox" -- the mathematical reality that the L1 norm is constant on the simplex and fails to prune -- by introducing a concave quadratic penalty. SCSB is model-agnostic and achieves up to 96% ensemble compression, yielding linear inference speedups and superior probability calibration (lowered Expected Calibration Error) while preserving or enhancing generalization accuracy.

2606.13556 2026-06-15 cs.AI cs.HC q-bio.BM q-bio.GN q-bio.MN 新提交

Is It You or Your Environment? A Bayesian Inference Framework for Genomically-Anchored Personalized Physiological Interpretation

是你还是你的环境?一种用于基因组锚定的个性化生理解释的贝叶斯推理框架

Aruna Dey, Suraj Biswas

发表机构 * Dots-In

AI总结 提出一种贝叶斯推理框架,利用基因组先验解决个性化健康AI的冷启动问题,通过基因组锚定分离生理信号的体质与环境成分,并随数据积累动态更新。

Comments 24 pages, 8 figures, 3 tables. Conceptual framework paper. Updated version with revised section structure and formatting

详情
AI中文摘要

个性化健康AI系统面临一个根本性的冷启动问题:用于生理解释的机器学习模型需要数周的个人行为数据,才能区分体质变异与环境引起的偏差。我们提出一种基于因果推断和贝叶斯先验设计的解决方案。个体的基因组图谱作为外源性遗传锚点——一个领域信息化的个性化先验,在受孕时固定,不受反向因果影响,且在收集任何行为观测之前即可获得。该锚点初始化个体生理设定点G-hat = mu + sum(beta_i * g_i)上的贝叶斯信念状态,其中beta_i是GWAS衍生的效应大小,g_i是风险等位基因计数。每次传入的生理测量P产生一个非体质偏差delta = P - G-hat,将可归因于环境和状态的部分与体质固定的基线分离。随着行为数据的积累,先验根据G-hat_t = w(t)*G-hat_genomic + [1-w(t)]*P-bar_t衰减,从基因组主导过渡到经验基线主导的推理。同一个观测到的HRV 55 ms,对于先验预测80 ms的人产生抑制假设,而对于先验预测30 ms的人产生增强假设——没有个性化锚点,这种反转是不可能的。我们在六个生理领域开发了这一架构,根据证据强度对基因组先验进行分级,区分稳健复制的锚点(FTO、FADS1/2、FKBP5)和有争议的候选基因(SLC6A4、MAOA、DRD2)。我们讨论了关联、孟德尔随机化和个体因果推断之间的推理边界,并定义了部署的四个约束:证据分级的先验、动态衰减、祖先匹配的效应大小以及归因而非确定性输出。

英文摘要

Personalized health AI systems face a fundamental cold-start problem: machine learning models for physiological interpretation require weeks of individual behavioral data before they can distinguish constitutional variation from environmentally driven deviation. We propose a solution grounded in causal inference and Bayesian prior design. An individual's genomic profile serves as an exogenous genetic anchor -- a domain-informed, personalized prior that is fixed at conception, immune to reverse causation, and available before a single behavioral observation is collected. The anchor initializes a Bayesian belief state over an individual's physiological set point G-hat = mu + sum(beta_i * g_i), where beta_i are GWAS-derived effect sizes and g_i are risk-allele counts. Each incoming physiological measurement P produces a non-constitutional deviation delta = P - G-hat that separates the signal attributable to environment and state from the constitutionally fixed baseline. As behavioral data accrue, the prior decays according to G-hat_t = w(t)*G-hat_genomic + [1-w(t)]*P-bar_t, transitioning from genome-dominated to empirical-baseline-dominated inference. The same observed HRV of 55 ms generates a suppression hypothesis for a person whose prior predicts 80 ms, and an enhancement hypothesis for a person whose prior predicts 30 ms -- a reversal impossible without a personalized anchor. We develop this architecture across six physiological domains, grading genomic priors by evidence strength, distinguishing robustly replicated anchors (FTO, FADS1/2, FKBP5) from contested candidate genes (SLC6A4, MAOA, DRD2). We address the inference boundary between association, Mendelian randomization, and individual token causation, and define four constraints for deployment: evidence-graded priors, dynamic decay, ancestry-matched effect sizes, and attribution rather than deterministic output.

2606.13464 2026-06-15 cs.CL cs.AI 新提交

Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations

本体记忆增强的ASR校正用于长文本-语音交错对话

Xinxin Li, Huiyao Chen, Meishan Zhang, Yunxin Li, Zulong Chen, Zhibo Ren, Xiaoqing Dong, Baotian Hu, Min Zhang

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology (Shenzhen), China(哈尔滨工业大学(深圳)计算与智能研究所) Shenzhen Loop Area Institute (SLAI), China(深圳环域研究所)

AI总结 提出本体记忆增强的ASR校正框架,通过动态更新本体记忆存储实体、术语、变体、混淆和语义关系,解决长文本-语音交错对话中的上下文校正问题,在RAMC-Corr数据集上优于直接校正。

详情
AI中文摘要

自动语音识别(ASR)校正传统上集中于孤立的话语或短局部上下文。然而,随着文本和语音在长交互中越来越交错,ASR校正需要对话级别的上下文证据。现有的ASR校正方法通常依赖于当前假设或拼接原始对话历史。在此类上下文中,稀疏的校正证据可能难以在冗余和噪声中定位。针对这些挑战,我们提出了一种本体记忆增强的ASR校正框架,用于长文本-语音交错对话。该框架将先前的交互历史组织成动态可更新的本体记忆,其中实体、术语、表面变体、潜在ASR混淆和语义关系作为可检索节点存储,用于上下文基础的校正。为了评估这一设置,我们构建了RAMC-Corr,一个源自MAGIC-RAMC的数据集,用于具有基础上下文的长距离ASR校正。在RAMC-Corr上的实验表明,我们的方法在10个配对骨干-设置组合中的9个上优于直接校正,并鼓励对上下文相关的ASR错误进行更具选择性和证据基础的校正。

英文摘要

Automatic speech recognition (ASR) correction has traditionally focused on isolated utterances or short local contexts. However, as text and speech become increasingly interleaved in long interactions, ASR correction requires conversation-level contextual evidence. Existing ASR correction methods often rely on the current hypothesis or concatenate raw dialogue history. In such contexts, sparse correction evidence can be difficult to locate amid redundancy and noise. Addressing these challenges, we propose an ontology memory-augmented ASR correction framework for long text-speech interleaved conversations. The framework organizes preceding interaction history into a dynamically updatable ontology memory, where entities, terminology, surface variants, potential ASR confusions, and semantic relations are stored as retrievable nodes for context-grounded correction. To evaluate this setting, we construct RAMC-Corr, a dataset derived from MAGIC-RAMC for long-range ASR correction with grounded context. Experiments on RAMC-Corr show that our method improves over direct correction in 9 out of 10 paired backbone-setting combinations and encourages more selective and evidence-grounded corrections for context-dependent ASR errors.

2606.13392 2026-06-15 cs.AI 新提交

MiniMax Sparse Attention

MiniMax 稀疏注意力

Xunhao Lai, Weiqi Xu, Yufeng Yang, Qiaorui Chen, Yang Xu, Lunbin Zeng, Xiaolong Li, Haohai Sun, Haichao Zhu, Vito Zhang, Jinkai Hu, Jiayao Li, Rui Gao, Zekun Li, Songquan Zhu, Jingkai Zhou, Pengyu Zhao

发表机构 * MiniMax Peking University(北京大学) NVIDIA(英伟达) Zhejiang University(浙江大学) Huazhong University of Science and Technology(华中科技大学)

AI总结 提出 MiniMax 稀疏注意力(MSA),一种基于分组查询注意力的块级稀疏注意力机制,通过轻量索引分支选择 Top-k 键值块,实现高效长上下文处理,在 109B 模型上以 1M 上下文减少 28.4 倍注意力计算,并带来 14.2 倍预填充和 7.6 倍解码加速。

Comments 30 pages, 14 figures

详情
AI中文摘要

超长上下文能力对于前沿大语言模型变得不可或缺:智能体工作流、仓库级代码推理和持久记忆都要求模型共同关注数十万到数百万个 token,然而 softmax 注意力的二次成本使得这在部署规模上难以实现。我们引入了 MiniMax 稀疏注意力(MSA),一种基于分组查询注意力(GQA)构建的块级稀疏注意力。一个轻量级索引分支对键值块进行评分,并为每个 GQA 组独立选择 Top-k 子集,从而实现组特定的稀疏检索,同时保持高效的块级执行;主分支则仅对选中的块执行精确的块稀疏注意力。MSA 的设计遵循简单和可扩展的原则,经过精心简化,使其能够在一系列 GPU 上高效部署。为了将稀疏性转化为实际加速,我们与 MSA 协同设计了 GPU 执行路径,该路径使用无指数 Top-k 选择和 KV 外部稀疏注意力,以在块粒度访问下提高张量核心利用率。在一个具有原生多模态训练的 109B 参数模型上,MSA 的性能与 GQA 相当,同时在 1M 上下文下将每个 token 的注意力计算减少了 28.4 倍。结合我们协同设计的内核,MSA 在 H800 上实现了 14.2 倍的预填充和 7.6 倍的解码端到端加速。我们的推理内核可在以下网址获取:this https URL。一个由 MSA 驱动的生产级原生多模态模型已在以下网址公开发布:this https URL。

英文摘要

Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax attention makes this untenable at deployment scale. We introduce MiniMax Sparse Attention (MSA), a blockwise sparse attention built upon Grouped Query Attention (GQA). A lightweight Index Branch scores key-value blocks and independently selects a Top-k subset for each GQA group, enabling group-specific sparse retrieval while maintaining efficient block-level execution; the Main Branch then performs exact block-sparse attention over only the selected blocks. Designed around a principle of simplicity and scalability, MSA is deliberately streamlined, making it straightforward to deploy efficiently across a broad range of GPUs. To translate sparsity into practical speedups, we co-design MSA with a GPU execution path that uses exp-free Top-k selection and KV-outer sparse attention to improve tensor-core utilization under block-granular access. On a 109B-parameter model with native multimodal training, MSA performs on par with GQA while reducing per-token attention compute by 28.4x at 1M context. Paired with our co-designed kernel, MSA achieves 14.2x prefill and 7.6x decoding wall-clock speedups on H800. Our inference kernel is available at: https://github.com/MiniMax-AI/MSA. A production-grade natively multimodal model powered by MSA has been publicly released at: https://huggingface.co/MiniMaxAI/MiniMax-M3.

2606.13221 2026-06-15 cs.LG 新提交

From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM Evaluation

从不确定判断到校准排名:用于LLM评估的共形Elo估计

Bora Kargi, David Salinas

发表机构 * ELLIS Institute Tübingen(ELLIS 蒂宾根研究所) OpenEuroLLM

AI总结 提出一种两层次校准方法,通过局部不确定性传播和全局共形预测,将LLM-as-a-judge的Elo评分误差降至17.9 MAE,并提供无分布假设的置信区间。

详情
AI中文摘要

评估新的大型语言模型通常需要大规模且昂贵的人工标注。LLM作为评判者提供了一种更便宜的替代方案,但评判者评分存在系统误差——如位置偏差、自我偏好或不可传递性——这些误差可能导致最终排名严重失准。我们在两个互补层面上量化评判者与人类之间的分歧。在局部层面,我们通过将校准的获胜概率而非硬标签传播到Bradley-Terry过程中,从评判者自身的评分差异估计每场对战的不确定性。仅此一项就显著提高了Elo估计的准确性,在LMArena上对55个保留模型取平均时,LLM得出的评分与人类得出的评分之间的平均绝对误差为17.9 Elo。在全局层面,我们将分裂共形预测应用于LLM得出的与人类得出的Elo评分之间的残差差距,产生具有无分布边际覆盖保证的预测区间,从而解释了不可约的LLM-人类分歧。这两层结合产生了一个低成本的评估工具,为开发者提供校准的Elo估计和诚实的置信区间,而无需大规模人工标注。为促进可重复性,我们在https://this http URL发布代码。

英文摘要

Evaluating new large language models typically requires costly human annotation campaigns at scale. LLM-as-a-judge offers a cheaper alternative, but judge scores carry systematic errors - such as position bias, self-preference, or intransitivity - that can strongly miscalibrate the resulting rankings. We quantify the resulting judge-human disagreement at two complementary levels. At the local level, we estimate per-battle uncertainty from the judge's own score differences by propagating calibrated win probabilities rather than hard labels into the Bradley-Terry procedure. This alone provides a drastic improvement to Elo estimation accuracy, bringing LLM-derived ratings within 17.9 Elo MAE of human-derived ones when averaged over 55 held-out models on LMArena. At the global level, we apply split conformal prediction to the residual gap between LLM-derived and human-derived Elo ratings across held-out models, producing prediction intervals with distribution-free marginal coverage guarantees that account for irreducible LLM-human disagreement. Together, these two layers yield a low-cost evaluation tool that provides developers with calibrated Elo estimates and honest uncertainty bounds, without access to large-scale human annotations. To facilitate reproducibility, we release our code at https://github.com/kargibora/SoftElo .

2606.13119 2026-06-15 cs.LG cs.AI cs.NE 新提交

MP3: Multi-Period Pattern Pre-training for Spatio-Temporal Forecasting

MP3:面向时空预测的多周期模式预训练

Lilan Peng, Yandi Liu, Qingren Yao, Chongshou Li, Tianrui Li

发表机构 * School of Computing and Artificial Intelligence, Southwest Jiaotong University(西南交通大学计算机与人工智能学院) Eindhoven University of Technology(埃因霍温理工大学)

AI总结 针对时空数据中短窗口输入导致的时间幻象问题,提出多周期模式预训练插件MP3,通过多周期时间建模、空间建模和跨周期因果交互,提升现有STGNN的预测性能。

详情
AI中文摘要

时空预测在交通、气候和能源等多个领域至关重要。城市时空数据表现出时间幻象:相似的短窗口输入具有不同的未来趋势,反之亦然。现有的时空图神经网络(STGNN)无法有效识别此类幻象。我们认为核心原因在于短窗口输入具有不完整的周期观测、异质的全局空间相关性和跨周期叠加因果性。为弥补这一差距,我们开发了一种新颖的多周期模式预训练(MP3),这是一种用于区分时间幻象的即插即用预训练插件。MP3提出了两项核心创新:(1)多周期模式学习旨在从长时间序列中学习多周期模式。具体地,多周期时间建模利用边卷积来识别不同的多周期模式。多周期空间建模使用瓶颈投影和全局记忆库来高效捕获异质的全局空间关系。跨周期模式交互采用因果增强的Transformer来捕获不同周期模式之间的依赖关系。(2)该插件可以无缝集成到现有的STGNN骨干中,以增强其预测性能。在五个真实世界数据集(包括大规模数据集CA)上的五个STGNN基线实验验证了MP3的有效性、优越的可扩展性和强适应性,其在所有评估基线上带来了一致且稳健的性能提升。平均而言,MP3将MAE降低了4.7%,RMSE降低了5.0%。代码可在此https URL获取。

英文摘要

Spatio-Temporal forecasting is crucial in diverse fields, such as transportation, climate, and energy. Urban spatio-temporal data exhibits temporal mirage: similar short-window inputs have divergent future trends, and vice versa. Existing spatio-temporal graph neural networks (STGNNs) cannot effectively identify such mirages. We argue that the core reason lies in the short-window inputs that have incomplete period observation, heterogeneous global spatial correlation, and cross-period superposition causality. To bridge this gap, we develop a novel Multi- Period Pattern Pre-training (MP3), a plug-and-play pre-training plugin for distinguishing temporal mirages. MP3 presents two core innovations: (1) The multi-period pattern learning is designed to learn multi-period patterns from long time series. Specifically, multi-period temporal modeling leverages edge convolution to identify different multi-period patterns. Multi-period spatial modeling uses a bottleneck project and a global memory bank to capture heterogeneous global spatial relations efficiently. Cross-period pattern interaction employs a causality-enhanced Transformer to capture dependencies across different period patterns. (2) This plugin can seamlessly integrate into existing STGNN backbones to strengthen their forecasting performance. The experiment on five STGNN baselines across five real-world datasets (including a large-scale dataset CA) verify the effectiveness, superior scalability and strong adaptability of MP3, which brings consistent and robust performance improvements across all evaluated baselines. On average, MP3 reduces the MAE 4.7% and the RMSE 5.0%. The code can be available at https://github.com/YAN-outlook/MP3.

2606.13054 2026-06-15 cs.LG cs.AI 新提交

TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization

TWLA:通过训练后量化实现大语言模型的三值权重和低位激活

Zhixiong Zhao, Zukang Xu, Zhixuan Chen, Xing Hu, Zhe Jiang, Dawei Yang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出TWLA框架,通过后训练量化实现1.58位权重和4位激活,解决激活分布长尾问题,加速推理。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)展现出卓越的通用语言处理能力,但其内存和计算成本阻碍了部署。三值化已成为一种有前景的压缩技术,可显著降低模型大小和推理复杂度。然而,现有方法难以处理重尾激活分布,因此将激活保持在高精度,从根本上限制了端到端推理加速。为克服这一限制,我们提出TWLA,一种后训练量化(PTQ)框架,在保持高精度的同时实现1.58位权重压缩和4位激活量化。TWLA包含三个组件:(1)欧几里得到流形非对称三值量化器(E2M-ATQ),通过从欧几里得初始化到流形重定位的两阶段优化,最小化权重三值化下的层输出误差;(2)Kronecker正交三模态整形(KOTMS),应用Kronecker结构正交旋转将权重重塑为三值友好的三模态分布,同时共享旋转统计上抑制激活异常值;(3)层间感知激活混合精度(ILA-AMP),在位分配中显式引入相邻层二阶交互成本,并联合优化由共享正交变换引起的激活量化增益的层间差异,防止少数弱层触发级联效应。大量实验表明,TWLA在W1.58A4下保持高精度,同时实现显著的推理加速。代码见<此https URL>。

英文摘要

Large language models (LLMs) exhibit exceptional general language processing capabilities, but their memory and compute costs hinder deployment. Ternarization has emerged as a promising compression technique, offering significant reductions in model size and inference complexity. However, existing methods struggle with heavy-tailed activation distributions and therefore keep activations in high precision, fundamentally limiting end-to-end inference acceleration. To overcome this limitation, we propose TWLA, a post-training quantization (PTQ) framework that achieves 1.58-bit weight compression and 4-bit activation quantization while maintaining high accuracy. TWLA comprises three components: (1) Euclidean-to-Manifold Asymmetric Ternary Quantizer (E2M-ATQ) minimizes layer-output error under weight ternarization via a two-stage optimization from Euclidean initialization to manifold relocation; (2) Kronecker Orthogonal Tri-Modal Shaping (KOTMS) applies a Kronecker-structured orthogonal rotation to reshape weights into ternary-friendly tri-modal distributions, while the shared rotation statistically suppresses activation outliers; and (3) Inter-Layer Aware Activation Mixed Precision (ILA-AMP) explicitly introduces adjacent-layer second-order interaction costs in bit allocation and jointly optimizes for the layer-wise disparity of activation quantization gains induced by the shared orthogonal transform, preventing cascades triggered by a few weak layers. Extensive experiments demonstrate that TWLA maintains high accuracy under W1.58A4, while delivering significant inference acceleration. The code is available at https://github.com/Kishon-zzx/TWLA.

2606.12994 2026-06-15 cs.LG cs.CE 新提交

DeepJEB++: Foundation Model-Driven Large-Scale 3D Engineering Dataset via 2D Latent Space Augmentation

DeepJEB++: 基于基础模型驱动的二维潜空间增强的大规模三维工程数据集

Soyoung Yoo, Leekyo Jeong, Jinsu Ra, Dongeon Lee, Sunwoong Yang, Hyogu Jeong, Namwoo Kang

发表机构 * Cho Chun Shik Graduate School of Mobility, Korea Advanced Institute of Science and Technology(韩国科学技术院赵春植移动研究生院) Department of Mechanical Engineering, Hanyang University(汉阳大学机械工程系) Narnia Labs(纳尼亚实验室)

AI总结 提出DeepJEB++框架,通过二维潜空间增强和基础模型,将少量喷气发动机支架种子设计扩展为大规模带仿真标签的三维数据集,实现40倍扩展。

Comments 16 pages, 14 figures. Submitted to ASME Journal of Mechanical Design

详情
AI中文摘要

数据驱动的工程设计受到缺乏大规模三维数据集的限制,这些数据集需要将几何形状与基于物理的性能标签配对。特别是,现有的三维数据增强技术在保留微妙且多样的几何变化方面存在局限性,并且自动化后续的仿真标注过程仍然困难,因为边界条件取决于生成的几何形状。我们提出了DeepJEB++,一个基础模型驱动的数据增强框架,在资源受限的情况下将少量喷气发动机支架种子设计扩展为大规模、带仿真标签的三维数据集。我们的关键思想是在数据丰富的二维潜空间中进行增强,然后转移到三维。在第一阶段,我们在多视图渲染上微调预训练的二维潜扩散模型,并通过潜插值合成新视图,通过视觉语言模型(VLM)质量过滤器保留可制造的设计。在第二阶段,经过验证的图像通过领域适应的生成基础模型提升为三维网格。在第三阶段,一个自动化流水线识别每个网格上的载荷和螺栓接口,并分配有限元标签——质量、应力和位移——无需人工干预。我们沿着三个内在轴评估增强质量:可制造性、相对于SimJEB真实值的标签保真度以及分布一致性。从少于400个种子设计开始,DeepJEB++在每阶段使用单个GPU的情况下,生成了15,360个带仿真标签的三维支架——实现了40倍的扩展。该数据集将公开提供,以支持可复现的工程AI研究。

英文摘要

Data-driven engineering design is constrained by the lack of large-scale 3D datasets that pair geometry with physics-based performance labels. In particular, existing 3D data augmentation techniques have limitations in preserving subtle and diverse geometric variations, and it remains difficult to automate the subsequent simulation-labeling process, where boundary conditions vary depending on the generated geometry. We present DeepJEB++, a foundation-model-driven data-augmentation framework that expands a small seed set of jet engine brackets into a large, simulation-labeled 3D dataset under constrained resources. Our key idea is to augment in the data-rich 2D latent space, then transfer to 3D. In Stage 1, we fine-tune a pretrained 2D latent diffusion model on multi-view renders and synthesize novel views by latent interpolation, retaining manufacturable designs through a vision-language-model (VLM) quality filter. In Stage 2, the validated images are lifted to 3D meshes by a domain-adapted generative foundation model. In Stage 3, an automated pipeline recognizes the load and bolt interfaces on each mesh and assigns finite-element labels -- mass, stress, and displacement -- without manual intervention. We assess augmentation quality along three intrinsic axes: manufacturability, label fidelity against the SimJEB ground truth, and distributional consistency. Starting from fewer than 400 seed designs, DeepJEB++ yields 15,360 simulation-labeled 3D brackets -- a 40x expansion -- using a single GPU per stage. The dataset will be made publicly available to support reproducible engineering-AI research.

2606.12941 2026-06-15 cs.CL 新提交

Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL

当上下文分片到达时的多轮推理:可扩展的分片与记忆增强强化学习

Shu Tong Luo, Wenqin Liu, Rui Liu, Mingming Gong, Jiaxian Guo

发表机构 * The University of Melbourne(墨尔本大学) Google Research Australia(谷歌澳大利亚研究院)

AI总结 针对多轮对话中信息碎片化导致LLM准确率下降65%的问题,提出通过训练模型维护紧凑滚动记忆而非增长历史来缓解,并引入低成本分片流水线将单轮QA转换为多轮碎片化情节,训练的记忆增强策略显著提升多轮准确率并零样本泛化到更难任务。

详情
AI中文摘要

当用户在多个对话轮次中透露任务关键信息时,尽管上下文完全可用,LLM的准确率下降高达65%。我们表明,这种“迷失在对话中”的退化可以通过训练模型维护紧凑的滚动记忆而不是关注增长的历史来大幅缓解。为了使这种训练可扩展,我们引入了一个低成本的分片流水线,将单轮QA数据集转换为多轮碎片化信息情节,消除了数小时手动标注的需求。仅在分片的GSM8K上训练,我们的记忆增强策略显著提高了多轮准确率,并零样本泛化到更难的数学和域外长上下文QA。此外,即使在测试时给定完整历史,记忆训练模型也优于全历史基线,这表明学习压缩比单独的全上下文暴露能诱导更稳健的增量推理。

英文摘要

When a user reveals task-critical information across several conversation turns, LLM accuracy drops by up to 65% despite full context availability. We show that this Lost in Conversation degradation can be substantially mitigated by training models to maintain a compact rolling memory instead of attending to a growing history. To make such training scalable, we introduce a low-cost sharding pipeline that converts single-turn QA datasets into multi-turn fragmented-information episodes, eliminating the need for hours of manual annotation. Training only on sharded GSM8K, our memory-augmented policy significantly improves multi-turn accuracy and generalises zero-shot to harder math and out-of-domain long-context QA. Moreover, memory-trained models outperform full-history baselines even when given the full history at test time, suggesting that learning to compress induces more robust incremental reasoning than full-context exposure alone.

2606.12923 2026-06-15 cs.LG cs.AI cs.CL 新提交

Order Is Not Control: Driven-Dissipative Response Laws Across Artificial and Biological Systems

秩序并非控制

Gareth Seneque, Lap-Hang Ho, Nafise Erfanian Saeedi, Jeffrey Molendijk, Tim Elson

发表机构 * Australian Broadcasting Corporation(澳大利亚广播公司)

AI总结 本文论证秩序不等于控制,提出接收器门控响应定律,并在生物、大语言模型、适配器和随机算子面板中验证,表明控制是局部的、可测量的。

Comments 52 pages, 7 figures, updated title

详情
AI中文摘要

AI对齐、可解释性、引导和神经扰动研究识别出诱导秩序的对象。我们认为秩序并非控制。控制需要接收器门控的响应定律:一个分母索引算子,将物质状态、动作/驱动、浴和接收器状态映射到响应位移、汇、努力和盆地投影。我们在生物、大语言模型、适配器和随机算子面板中识别出该定律。这些定律是局部的:干预可以被接纳、饱和、变号、泄漏或过驱动,取决于介质、浴、接收器状态、动作端口和比较器。当有限努力在相同分母下移动目标或结果读出类别,而损伤、无效/规避、无效格式、过驱动和不必要努力保持有界时,控制被分配。小鼠ALM、秀丽隐杆线虫和斑马鱼面板提供了物理响应算子证据,同时排除了坐标同一性和控制器结论。大语言模型面板展示了生成输出响应定律:在四种物质条件下,响应向量的分量符号预测准确率为72.8-73.7%,非零分量上提升至84.3-84.8%;留出观察者以93.6%和91.7%的准确率预测系统效应和目标/预言家族。宪法条件适配器将易感性重塑为制备介质,随机算子面板将测量机会与可部署行动策略分离。这给出了介观控制层面的驱动-耗散响应系统描述:驱动通过制备介质、浴和接收器作用,产生接纳运动、阻抗、汇或过驱动。证据支持局部接纳控制和可测量的随机响应算子,同时将可部署的预生成控制、隐藏/logit因果充分性、生物到LLM坐标同一性以及字面热力学量排除在范围之外。

英文摘要

AI alignment, interpretability, steering, and neural perturbation studies identify order-inducing objects. We argue that order is not control. Control requires a receiver-gated response law: a denominator-indexed operator mapping material state, action/drive, bath, and receiver state to response displacement, sinks, effort, and basin projection. We identify it across biological, LLM, adapter, and stochastic-operator panels. The laws are local: an intervention can be admitted, saturated, sign-changing, leaky, or overdriven depending on medium, bath, receiver state, action port, and comparator. Control is assigned when finite effort moves a target or outcome-readout class under the same denominator while damage, null/evasive, invalid format, overdrive, and unnecessary effort stay bounded. Mouse ALM, C. elegans, and zebrafish panels provide physical response-operator evidence while excluding coordinate identity and controller conclusions. LLM panels show generated-output response laws: across four material conditions, response vectors are predictable at 72.8-73.7% component-sign accuracy, rising to 84.3-84.8% on nonzero components; held-out observers predict system-effect and target/oracle families at 93.6% and 91.7% accuracy. Constitution-conditioned adapters reshape susceptibility as prepared media, and stochastic-operator panels separate measured opportunity from deployable action policies. This gives a driven-dissipative response-system account at the mesoscopic control level: drives act through prepared media, baths, and receivers, producing admitted movement, impedance, sinks, or overdrive. The evidence supports local admitted control and measurable stochastic response operators, while leaving deployable pre-generation control, hidden/logit causal sufficiency, biological-to-LLM coordinate identity, and literal thermodynamic quantities outside scope.

2606.12910 2026-06-15 cs.RO cs.AI cs.CV cs.SY eess.SY 新提交

Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning

边界框作为目标:通过神经符号规划实现语言条件抓取

Allison Andreyev, Landon Eum, Nestor Tiglao, Romel Gomez

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出GRASP框架,利用预训练VLM将自然语言查询转化为神经符号目标状态,通过边界框检测实现零样本桌面操作,无需任务特定训练。

Comments Project website: https://allisonandreyev.github.io/grasp.github.io/

详情
AI中文摘要

为了将机器人有效集成到家庭或工业环境中,机器必须实时适应自然语言提示。尽管视觉-语言模型(VLM)已在机器人任务与运动规划(TAMP)中实现零样本泛化,但当前最先进的方法通常计算量“沉重”或需要在数千个演示上进行大量训练。我们提出GRASP(基础推理与符号规划)框架,作为向开放词汇桌面操作迈进的一步。我们的方法利用预训练VLM将自然语言查询转化为神经符号目标状态,通过边界框检测管道在物理世界中接地。与依赖固定颜色列表或硬编码坐标的方法不同,GRASP使机器人能够解释诸如“顶层架子”之类的抽象空间概念,并在无需额外微调的情况下执行任务。我们在三个难度级别的90次真实机器人试验中实现了73.3%的总体成功率,无需任务特定训练。

英文摘要

For robotics to be effectively integrated into household or industrial environments, machines must adapt to natural-language prompts in real time. Although Vision-Language Models (VLMs) have enabled zero-shot generalization in robot task and motion planning (TAMP), current state-of-the-art approaches often remain computationally "heavyweight" or require extensive training on thousands of demonstrations. We present GRASP (Grounded Reasoning and Symbolic Planning), a framework designed as a step toward open-vocabulary tabletop manipulation. Our approach leverages a pretrained VLM to translate natural-language queries into neuro-symbolic goal states, grounded in the physical world via a bounding-box detection pipeline. Unlike methods that rely on fixed color lists or hard-coded coordinates, GRASP enables robots to interpret abstract spatial concepts such as "top shelf" and execute tasks without additional fine-tuning. We achieve 73.3% overall success across 90 real-robot trials at three difficulty levels, requiring no task-specific training.

2606.12881 2026-06-15 cs.CL cs.LG 新提交

Direct Preference Optimization for Chatbot Fine-Tuning: An Empirical Study

面向聊天机器人微调的直接偏好优化:一项实证研究

Dezhi Yu, Yvonne Qiu, ShuoJia Fu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文实证研究直接偏好优化(DPO)在聊天机器人微调中的应用,表明其简化训练流程、提升计算效率且性能有竞争力,但存在训练不稳定性。

Comments 7 pages, 3 figures, 1 table. All authors contributed equally

详情
AI中文摘要

我们提出了一种使用直接偏好优化(DPO)微调大型语言模型的方法,这是一种强化学习技术。我们的实验结果表明,DPO简化了训练流程,提高了计算效率,并实现了有竞争力的性能。使用BLEU、ROUGE和余弦相似度指标的评估表明,模型有效学习并收敛,尽管需要进一步研究以解决观察到的训练不稳定性。

英文摘要

We present an approach to fine-tuning large language models using Direct Preference Optimization (DPO), a reinforcement learning technique. Our experimental results demonstrate that DPO simplifies the training pipeline, improves computational efficiency, and achieves competitive performance. The evaluation using BLEU, ROUGE, and cosine similarity metrics indicates effective learning and convergence, though further investigation is needed to address observed training instability.

2606.12817 2026-06-15 cs.AI 新提交

GUITrans2Act: Understanding User Operational Behaviors from Mobile GUI Interactions with Vision-Language Models

Teach-and-Repeat: 从移动屏幕演示中准确提取操作知识以赋能GUI智能体

Yudong Zhang, Lei Hu, Daoyang Liu, Jiawei Liu, Yangfan Luo, Zhilin Gao, Zuojian Wang

发表机构 * Honor Device Co., Ltd(荣耀终端有限公司) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出Teach VLM模型,通过从演示视频中提取关键帧生成操作知识,并构建数据飞轮解决训练数据稀缺问题;在基准测试中达到最优性能,并提升下游智能体的任务成功率。

Comments 20 pages, 9 figures. Yudong Zhang and Lei Hu contributed equally to this work. Zuojian Wang, and Zhilin Gao are corresponding authors

详情
AI中文摘要

理解移动设备上的数字世界正从静态UI感知转向动态动作理解。这种能力使模型能够将视觉状态转换转化为操作知识,定义为描述动作类型、目标UI元素、文本参数和执行顺序的简短自然语言句子。然而,由于跨应用的UI设计高度多样化和异构,现有视觉语言模型(VLM)难以准确推断这些底层操作。为弥补这一差距,我们引入了Teach VLM,这是一个核心模型,旨在通过从演示视频中提取和分析与操作相关的关键帧,将移动屏幕轨迹转化为逐步操作知识。为解决对齐训练数据稀缺的问题,我们开发了一个系统性的数据飞轮以实现可扩展的数据采集。我们进一步引入了一个新颖的中文移动屏幕教学基准用于细粒度评估。基于Teach VLM,我们提出了Teach-and-Repeat范式,其中生成的操作知识作为可解释的程序化参考,指导下游基于屏幕的执行智能体。大量评估表明,Teach VLM显著优于强VLM基线,在操作语义预测中达到了最先进的性能。此外,在Android World中的实验表明,我们的范式为下游智能体带来了持续的任务成功率提升。Teach VLM和Teach-and-Repeat范式共同提供了一条从原始演示到可复用任务自动化的实用路径。

英文摘要

Understanding the digital world on mobile devices is shifting from static UI perception to dynamic action comprehension. This capability enables models to convert visual state transitions into operational knowledge, defined as short natural-language sentences that describe action types, target UI elements, textual arguments, and execution orders. However, due to the highly diverse and heterogeneous UI designs across applications, existing vision-language models (VLMs) struggle to accurately infer these underlying operations. To bridge this gap, we introduce Teach VLM, a core model designed to translate mobile screen trajectories into step-wise operational knowledge by extracting and analyzing operation-related keyframes from demonstration videos. To address the scarcity of aligned training data, we develop a systematic data flywheel for scalable data acquisition. We further introduce a novel Chinese Mobile Screen Teach Benchmark for fine-grained evaluation. Building upon Teach VLM, we propose the Teach-and-Repeat paradigm, where the generated operational knowledge serves as an interpretable procedural reference to guide downstream screen-based execution agents. Extensive evaluations demonstrate that Teach VLM significantly outperforms strong VLM baselines, achieving state-of-the-art performance in operation semantics prediction. Furthermore, experiments in Android World show that our paradigm yields consistent Task Success Rate improvements for downstream agents. Together, Teach VLM and the Teach-and-Repeat paradigm offer a practical pathway from raw demonstrations to reusable task automation.