arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.14703 2026-06-15 cs.CV cs.CL cs.LG 新提交

Gaze Heads: How VLMs Look at What They Describe

注视头:视觉语言模型如何观察它们所描述的内容

Rohit Gandikota, David Bau

发表机构 * Northeastern University(东北大学)

AI总结 发现视觉语言模型的语言骨干中存在一组“注视头”,其注意力跟踪当前描述的图像区域,通过干预这些头可精确控制模型描述内容,准确率达83.1%。

详情
AI中文摘要

视觉语言模型在内部如何解决描述图像的任务远非显而易见。我们发现模型为此发展出一种特定机制:其语言模型骨干中的一小部分注意力头(我们称之为注视头),其注意力跟踪模型当前正在描述的图像区域。我们通过简单的相关性得分从几次前向传播中发现了它们,使用连环漫画作为受控测试平台,其中叙事顺序在空间上展开。这些注视头不仅跟踪正在描述的图像标记:将它们的注意力重定向到所选区域会强制视觉语言模型描述该区域。对前100个注视头(少于所有头的9%)进行单次注意力掩码干预,以83.1%的准确率将模型的答案引导到任何选定的漫画面板,而对随机头进行相同干预则无法重定向答案,并且对所有头进行干预会破坏生成。相同的杠杆还扩展到连续控制:在生成过程中切换注视目标会使模型在几个标记内结束当前面板描述并转向新面板。在漫画之外,相同的干预将答案重定向到自然COCO图像中的选定区域。该机制进一步在2B到32B参数的模型大小以及其他视觉语言模型架构中重复出现,尽管一些冻结编码器系列没有显示可比较的头集。更广泛地说,这表明通过机制分析识别的目标编辑可以作为实用的推理时杠杆来引导多模态模型行为,而无需任何重新训练。我们的代码、交互式演示和数据集可在以下网址获取:此 https URL

英文摘要

How a vision-language model internally solves the task of describing an image is far from obvious. We find that the model develops a specific mechanism for this: a small set of attention heads in its language-model backbone, which we call gaze heads, whose attention tracks the image region the model is currently describing. We find them with a simple correlation score from a few forward passes, using comic strips as a controlled testbed where narrative order is laid out spatially. These gaze heads do not just track the image tokens being described: redirecting their attention to a chosen region forces the VLM to describe that region instead. A single attention-mask intervention on the top-100 gaze heads, fewer than 9% of all heads, steers the model's answer to any chosen comic panel at 83.1% accuracy, while the same intervention on random heads fails to redirect the answer, and intervening on all heads destroys generation. The same lever also extends to continuous control: switching the gaze target mid-generation makes the model wrap up its current panel description and move to the new one within a few tokens. Beyond comics, the same intervention redirects answers to chosen regions in natural COCO images. The mechanism further recurs across model sizes from 2B to 32B parameters and across other VLM architectures, although some frozen-encoder families show no comparable head set. More broadly, this shows that targeted edits identified through mechanistic analysis can serve as practical inference-time levers for steering multimodal model behavior, without any retraining. Our code, interactive demo, and datasets are available at https://gaze.baulab.info/

2606.14702 2026-06-15 cs.CV 新提交

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

OmniVideo-100K:通过结构化脚本和证据链进行音视频推理的数据集

Xinyue Cai, Chaoyou Fu, Yi-Fan Zhang, Ran He, Caifeng Shan

发表机构 * Nanjing University(南京大学) CASIA(中国科学院自动化研究所)

AI总结 提出OmniVideo-100K数据集,通过实体锚定视频脚本和线索引导的QA生成机制,解决音视频问答中跨段实体不一致和长时推理不足的问题,微调模型在多个基准上取得显著提升。

Comments Project page: https://github.com/MiG-NJU/OmniVideo-100K

详情
AI中文摘要

当前的音视频问答(QA)自动化流水线通常采用“视频-字幕-QA”范式。然而,这些方法通常将视频分割成短片段,并为音频和视觉模态生成独立的描述。这种解耦处理切断了声音与其视觉来源之间的固有关联,而独立的片段处理常常导致同一实体在不同片段中的描述不一致。此外,将长文本理解和QA合成耦合到单一步骤中,往往将模型限制在局部事件上,生成的问答缺乏长期时间连接和深度跨模态推理。为了解决这些问题,我们提出了一种自动化数据引擎,包含两种机制:(1)**实体锚定视频脚本**将视频转换为结构化脚本,包括摘要、主要实体列表和逐片段的音视频描述。实体列表作为全局先验,确保跨片段引用一致性并重建音视频关联。(2)**线索引导的QA生成**提示模型首先从脚本中挖掘跨片段、多模态线索,然后基于这些高价值线索生成QA对。利用这一流水线,我们构建了指令微调数据集**OmniVideo-100K**和人工验证的测试集**OmniVideo-Test**。在OmniVideo-100K上微调VITA-1.5、Qwen2.5-Omni-7B和Qwen3-Omni-30B,在OmniVideo-Test上获得了高达20.59%的性能提升,并在Daily-Omni和JointAVBench等现有基准上表现出强大的泛化能力(提升高达12.64%)。

英文摘要

Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes inconsistent descriptions of the same entity across segments. Furthermore, coupling long-text comprehension and QA synthesis into a single step often restricts models to localized events, yielding questions lacking long-term temporal connections and deep cross-modal reasoning. To address these issues, we propose an automated data engine featuring two mechanisms: (1) \textbf{Entity-Anchored Video Scripting} transforms videos into structured scripts, comprising summaries, main entity lists, and segment-wise audio-visual descriptions. The entity list serves as a global prior to ensure cross-segment referential consistency and reconstruct audio-visual associations. (2) \textbf{Clue-Guided QA Generation} prompts models to first mine cross-segment, multimodal clues from the script, and subsequently generate QA pairs based on these high-value clues. Leveraging this pipeline, we construct the instruction-tuning dataset \textbf{OmniVideo-100K} and a human-verified test set, \textbf{OmniVideo-Test}. Fine-tuning VITA-1.5, Qwen2.5-Omni-7B and Qwen3-Omni-30B on OmniVideo-100K yields performance gains of up to 20.59% on OmniVideo-Test, demonstrating strong generalization (up to 12.64% improvements) across established benchmarks like Daily-Omni and JointAVBench.

2606.14701 2026-06-15 cs.CV 新提交

RATS! Patches Talk Through Registers: Emergent Parts in Register Attention Transformers

RATS!补丁通过寄存器对话:寄存器注意力Transformer中的涌现部件

Timing Yang, Predrag Neskovic, Jansen Seheult, Wenchao Han, Anand Bhattad, Alan Yuille, Feng Wang

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Office of Naval Research, Arlington, VA(海军研究办公室,阿灵顿,弗吉尼亚州) Department of Laboratory Medicine and Pathology, Mayo Clinic, MN, USA(梅奥诊所检验医学与病理学系,明尼苏达州,美国)

AI总结 提出RATS模型,通过将分类令牌分解为可学习的寄存器令牌,在L→N→N→L瓶颈中路由补丁信息,无需辅助损失或部件标注,每个寄存器自发专化为类似物体部件的原语义区域,在五个分割基准上平均mIoU提升12。

详情
AI中文摘要

当人类看到一只鸟时,他们识别出的远不止是“鸟”——他们看到头部、翅膀和爪子,这是一个可重复使用部件的结构化组合,这些部件可以在他们见过的每一只鸟中被识别出来。我们询问一个自监督视觉模型能否自行发现相同的组合结构。为此,我们提出了RATS(寄存器注意力Transformer),它将分类令牌分解为N个可学习的寄存器令牌,通过三步压缩-通信-广播注意力机制,在L→N→N→L瓶颈中路由补丁信息。这N个寄存器被分配到H个注意力头上,因此分配给不同头的寄存器之间不相互作用。在没有辅助损失或部件标注的情况下,每个寄存器自发地专化为一个原语义区域,其涌现结构类似于物体部件。RATS在五个分割基准上平均超过所有基线+12 mIoU,在ADE20K(+1.11 mIoU)和COCO(+0.2 AP^m)上持续提升。其寄存器字典进一步展示了跨相关类别的部件级一致性和语义接近性。我们的结果表明,RATS可能为结构化和可解释的视觉表示学习提供有用的架构先验。

英文摘要

When humans see a bird, they recognize far more than just "bird" -- they see a head, wings, and talons, a structured assembly of reusable parts that can be identified across every bird they have ever seen. We ask whether a self-supervised visual model can discover the same compositional structure on its own. To this end, we propose RATS (Register Attention Transformers), which decomposes the classification token into N learnable register tokens that route patch information through an L->N->N->L bottleneck via a three-step compress-communicate-broadcast attention. The N registers are partitioned across the H attention heads, so that registers assigned to different heads do not interact with each other. Without auxiliary losses or part annotations, each register spontaneously specializes into a proto-semantic region whose emerging structure resembles object parts. RATS surpasses all baselines by +12 mIoU on average across five segmentation benchmarks, with consistent gains on ADE20K (+1.11 mIoU) and COCO (+0.2 AP^m). Its register dictionary further exhibits part-level consistency and semantic proximity across related categories. Our results suggest that RATS may provide a useful architectural prior for structured and interpretable visual representation learning.

2606.14700 2026-06-15 cs.CV 新提交

RepFusion: Leveraging Multimodal Priors for Denoising in Representation Space

RepFusion:利用多模态先验在表示空间中进行去噪

Xichen Pan, Aashu Singh, Satya Narayan Shukla, Xiangjun Fan, Shlok Kumar Mishra, Saining Xie

发表机构 * Meta AI New York University(纽约大学)

AI总结 提出RepFusion方法,利用多模态大语言模型作为噪声表示编码器,为扩散变压器提供条件信号,在相似推理预算下优于新初始化解码器基线。

Comments Project Page: https://xichenpan.com/repfusion

详情
AI中文摘要

大型语言模型(LLMs)广泛用于文本到图像(T2I)系统,但它们通常仅限于文本编码,而去噪由新训练的生成骨干网络处理。表示自编码器(RAEs)的出现将生成目标转向语义结构化的视觉表示,创建了一个与预训练LLM先验更兼容的潜在空间。受多模态LLM(MLLMs)的启发,其中MLP投影仪足以将干净的视觉表示与预训练LLM对齐,我们将MLLM本身重新用作噪声表示编码器,将此机制从干净输入扩展到噪声输入。我们提出了RepFusion,它使用生成的MLLM输出作为扩散变压器的条件信号。在相似推理预算下的受控比较中,RepFusion优于将可比容量分配给新初始化解码器的基线。这些结果表明,MLLMs为去噪视觉表示提供了强大的先验,并且通过以演化的噪声表示为条件,测试时的计算可以有效地用于现代T2I系统中重复的MLLM条件化。

英文摘要

Large language models (LLMs) are widely used in text-to-image (T2I) systems, but they are typically limited to text encoding, while denoising is handled by newly trained generative backbones. The emergence of representation autoencoders (RAEs) shifts the generation target toward semantically structured visual representations, creating a latent space that is more compatible with pretrained LLM priors. Inspired by multimodal LLMs (MLLMs), where an MLP projector is sufficient to align clean visual representations with a pretrained LLM, we repurpose the MLLM itself as a noisy representation encoder, extending this mechanism from clean to noisy inputs. We present RepFusion, which uses the resulting MLLM outputs as the conditioning signal for a diffusion transformer. In controlled comparisons at similar inference budgets, RepFusion outperforms baselines that devote comparable capacity to newly initialized denoisers. These results demonstrate that MLLMs provide strong priors for denoising visual representations and that, by conditioning on evolving noisy representations, test-time compute can be productively spent on repeated MLLM conditioning in modern T2I systems.

2606.14699 2026-06-15 cs.CV cs.GR cs.RO 新提交

Instruct-Particulate: Scaling Feed-Forward 3D Object Articulation with Kinematic Control

Instruct-Particulate: 基于运动学控制的可扩展前馈式3D物体关节化

Ruining Li, Yuxin Yao, Matt Zhou, Chuanxia Zheng, Christian Rupprecht, Joan Lasenby, Shangzhe Wu, Andrea Vedaldi

发表机构 * University of Oxford(牛津大学) University of Cambridge(剑桥大学) Nanyang Technological University(南洋理工大学)

AI总结 提出Instruct-Particulate模型,通过运动学规范(部件描述、连接性、关节类型等)指导3D网格的关节分割和运动参数预测,利用异构数据集(15万+物体)训练,实现跨类别和AI生成网格的泛化。

Comments Project page: https://instruct-particulate.github.io/

详情
AI中文摘要

重建关节式3D物体对于动画、游戏和机器人模拟至关重要。最近的神经网络可以估计3D物体的关节结构,但其泛化能力仍然受到该任务标注数据稀缺的限制。为了解决这一差距,我们引入了Instruct-Particulate,一个模型,它接受一个3D网格以及一个目标运动学规范,包括部件描述、连接性、关节类型和可选的点提示,并预测相应的运动学部件分割和关节运动参数。运动学规范消除了任务的歧义,并允许模型针对不同粒度的标注,从而使得使用更丰富的异构训练数据成为可能。在测试时,运动学规范可以从大规模视觉-语言模型中自动获得,因此该模型可以应用于任何输入网格。为了大规模训练我们的模型,我们构建了一个包含超过15万个关节式3D物体的异构数据集,通过使用视觉-语言模型对部分其他3D模型(整体或已分解为部件)进行运动学标注,扩展了现有的公开数据集。实验表明,我们的模型在跨类别和AI生成网格上泛化更好,通过图像到3D模型实现了从真实世界图像重建关节式资产。

英文摘要

Reconstructing articulated 3D objects is important for animation, gaming, and robotic simulations. Recent neural networks can estimate the articulated structure of 3D objects, but their generalization remains limited by the scarcity of annotated data for this task. To address this gap, we introduce Instruct-Particulate, a model that takes a 3D mesh together with a target kinematic specification, including part descriptions, connectivity, joint types, and optional point prompts, and predicts the corresponding kinematic part segmentation and joint motion parameters. The kinematic specification disambiguates the task and allows the model to target annotations of different granularity, thereby making it possible to use more abundant heterogeneous training data. At test time, the kinematic specification can be obtained automatically from large-scale vision-language models, so the model can be applied to any input mesh. To train our model at scale, we construct a heterogeneous dataset of more than 150,000 articulated 3D objects, extending existing publicly available collections with data obtained by partially labelling other 3D models (monolithic or already decomposed into parts) with kinematic labels by means of vision-language models. Experiments show that our model generalizes better across categories and to AI-generated meshes, enabling articulated asset reconstruction from real-world images via image-to-3D models.

2606.14697 2026-06-15 cs.CV cs.AI cs.CL 新提交

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

ClinHallu: 用于诊断医学多模态大语言模型推理中阶段式幻觉的基准

Sicheng Yang, Hangjie Yuan, Wenjun Zhang, Jinwang Wang, Yichen Qian, Weihua Chen, Fan Wang, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) DAMO Academy, Alibaba Group(阿里巴巴达摩院) Hupan Lab(湖畔实验室) Zhejiang University(浙江大学)

AI总结 提出ClinHallu基准,包含7031个实例,每个实例带有结构化推理轨迹(视觉识别、知识回忆、推理整合),通过阶段替换干预和轨迹监督微调,实现细粒度幻觉诊断与缓解。

Comments Code and datasets: https://github.com/alibaba-damo-academy/ClinHallu

详情
AI中文摘要

构建可信的医学多模态大语言模型(MLLM)对于可靠的临床决策支持至关重要。现有的医学幻觉基准主要关注数据收集,但往往忽略了推理过程中幻觉的起源。我们发现幻觉来源因样本而异:错误可能源于视觉误识别、不正确的医学知识回忆或有缺陷的推理整合。为了实现源级别的幻觉诊断,我们引入了ClinHallu,一个用于医学MLLM推理中阶段式幻觉诊断的基准。ClinHallu包含7031个经过验证的实例,每个实例都附有分解为视觉识别、知识回忆和推理整合的结构化推理轨迹。我们还使用阶段替换干预来测量纠正特定阶段如何影响最终答案。除了评估,我们表明轨迹监督微调减少了阶段式幻觉。ClinHallu为诊断和缓解医学MLLM中的推理失败提供了一个细粒度的幻觉测试平台。该基准可从此https URL公开获取。

英文摘要

Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where hallucinations originate within the reasoning process. We find that hallucination sources vary across samples: errors may arise from visual misrecognition, incorrect medical knowledge recall, or flawed reasoning integration. To enable source-level hallucination diagnosis, we introduce ClinHallu, a benchmark for stage-wise hallucination diagnosis in medical MLLM reasoning. ClinHallu contains 7,031 validated instances, where each instance is augmented with a structured reasoning trace decomposed into Visual Recognition, Knowledge Recall, and Reasoning Integration. We also use stage-replacement interventions to measure how correcting specific stages affects the final answer. Beyond evaluation, we show that trace-supervised fine-tuning reduces stage-wise hallucinations. ClinHallu provides a fine-grained hallucination testbed for diagnosing and mitigating reasoning failures in medical MLLMs. The benchmark is publicly available at https://github.com/alibaba-damo-academy/ClinHallu.

2606.14695 2026-06-15 cs.LG cs.CL 新提交

Persona-Pruner: Sculpting Lightweight Models for Role-Playing

Persona-Pruner: 为角色扮演雕琢轻量级模型

Jinsu Kim, Jihoon Tack, Noah Lee, Jongheon Jeong

AI总结 提出Persona-Pruner框架,通过从单个描述中隔离特定角色的子网络来剪枝语言模型,在保持角色扮演性能的同时大幅降低计算成本,性能下降比最强基线减少93.8%。

Comments 25 pages; ICML 2026; Code is available at https://github.com/jsu-kim/Persona-Pruner

详情
AI中文摘要

语言模型(LMs)作为角色扮演聊天机器人展现出显著潜力,在给定角色或用户画像规范时,能够提供一致且风格化的交互。然而,将这些能力应用于现实世界应用(例如,众多NPC同时交互的生态系统)时,由于过高的计算成本,暴露了关键的效率问题。在本文中,我们质疑将完整的通用模型专用于单一角色的必要性,假设特定角色身份仅依赖于模型总容量的一小部分。我们观察到,朴素地剪枝LM通常会严重降低特定角色的角色扮演性能;它无法区分冗余知识和基本角色特征。我们提出Persona-Pruner,一个通过从单个描述中隔离特定角色的子网络来雕琢轻量级角色扮演模型的框架。我们的实验一致表明,Persona-Pruner在保留角色扮演性能方面比现有最先进的LLM剪枝技术有效得多,在RoleBench上使用LLM-as-a-judge评分,将性能下降从密集模型减少至多93.8%(相比最强基线),同时仍保持通用LLM能力。代码可在以下网址获取:此https URL。

英文摘要

Language Models (LMs) have shown remarkable potential as role-playing chatbots, delivering consistent, stylized interactions when given a specification of a character or user persona. However, applying these capabilities to real-world applications (e.g., ecosystems with numerous NPCs interacting simultaneously) exposes a critical inefficiency due to the excessive computational cost. In this paper, we question the necessity of dedicating a full, generalist model to a single persona, hypothesizing that a specific character identity relies on only a fraction of the model's total capacity. We observe that naively pruning LMs often severely degrades the role-playing performance for a specific persona; it does not distinguish between redundant knowledge and essential character traits. We propose Persona-Pruner, a framework that sculpts a lightweight role-playing model by isolating persona-specific sub-networks from a single description. Our experiments consistently show that Persona-Pruner preserves role-playing performance substantially more effectively than existing state-of-the-art LLM pruning techniques, reducing the performance drop from the dense model by up to 93.8% over the strongest baseline on RoleBench in LLM-as-a-judge score, while still maintaining general LLM capabilities. Code is available at https://github.com/jsu-kim/Persona-Pruner.

2606.14693 2026-06-15 cs.MA cs.AI 新提交

Learning Coordinated Preference for Multi-Objective Multi-Agent Reinforcement Learning

学习协调偏好用于多目标多智能体强化学习

Pengxin Wang, Lihao Guo, Yi Xie, Bo Liu, Siyang Cao, Jingdi Chen

发表机构 * Department of Electrical and Computer Engineering, University of Arizona(亚利桑那大学电气与计算机工程系)

AI总结 提出偏好协调多智能体策略优化(PCMA),通过学习协调的智能体特定偏好实现多目标多智能体强化学习中的互补权衡,理论证明偏好多样性可诱导团队改进,实验验证性能与协调性提升。

详情
AI中文摘要

合作性多目标多智能体强化学习(MOMARL)对团队在多个可能冲突的目标下的决策进行建模。在此设置中,冲突不仅出现在目标之间,也出现在具有不同观察、角色和贡献的智能体之间。我们提出了偏好协调多智能体策略优化(PCMA),它学习协调的智能体特定偏好,以实现智能体之间的互补权衡。理论上,我们将合作性MOMARL形式化为一个团队最优博弈,并证明在适当条件下,偏好多样性可以通过一阶改进分解诱导团队改进。在多个合作性MOMA环境和一个实际交通控制场景上的实验表明,PCMA提高了性能和权衡协调性。

英文摘要

Cooperative multi-objective multi-agent reinforcement learning (MOMARL) models team decision making under multiple, potentially conflicting objectives. In this setting, conflicts arise not only across objectives but also across agents with different observations, roles, and contributions. We propose Preference Coordinated Multi-agent Policy Optimization (PCMA), which learns coordinated agent-specific preferences to enable complementary trade-offs among agents. Theoretically, we formulate cooperative MOMARL as a team-optimal game and show that, under suitable conditions, preference diversity can induce team improvement through a first-order improvement decomposition. Experiments on multiple cooperative MOMA environments and a practical traffic-control scenario show that PCMA improves both performance and trade-off coordination.

2606.14691 2026-06-15 cs.CL 新提交

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

CORA: 通过一致性导向的推理对齐分析与弥合多模态RLVR中的思考-答案差距

Jiayue Cao, Zhicong Lu, Xuehan Sun, Wei Jia, Hongling Zheng, Changyuan Tian, Zichuan Lin, Wenqian Lv, Nayu Liu

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Wuhan University(武汉大学) Tsinghua University(清华大学) Tianjin University(天津大学)

AI总结 本文分析多模态RLVR中思考与答案的语义不一致问题,提出CORA方法,通过轻量级一致性奖励模型引入语义一致性,并采用混合奖励优势分裂稳定优化,提升推理忠实度。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成功激发大语言模型的推理能力,推动其向多模态场景扩展。现有方法主要关注提升推理轨迹的视觉覆盖和缓解视觉幻觉,但低估了推理过程与最终答案之间的语义不一致性。本文深入研究了大型视觉语言模型(LVLMs)中RLVR的思考-答案不一致性,通过对组相对策略优化(GRPO)训练过程中收集的轨迹以及RLVR后评估输出的分析,表明该问题在训练期间持续存在,并在推理时仍然存在。受此分析启发,我们提出一致性导向的推理对齐(CORA),通过轻量级即插即用的一致性奖励模型将思考-答案语义一致性引入RLVR,并进一步结合混合奖励优势分裂(HRAS)以稳定协调任务和一致性优化。在代表性多模态推理基准和主流LVLMs上的大量实验表明,CORA在提升任务性能的同时有效缓解了思考-答案不一致性,从而产生更忠实的推理轨迹。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has successfully elicited the reasoning capabilities of large language models, motivating its extension to multimodal scenarios. Existing methods primarily focus on improving the visual coverage of reasoning traces and mitigating visual hallucinations, but underestimate the semantic inconsistency between the reasoning process and the final answer. In this paper, we delve into thinking-answer inconsistency in RLVR for large vision-language models (LVLMs), showing thorough analyses of rollouts collected throughout Group Relative Policy Optimization (GRPO) training process and post-RLVR evaluation outputs that this issue persists during training and remains present during inference. Motivated by the analysis, we propose Consistency-Oriented Reasoning Alignment (CORA), which introduces thinking-answer semantic consistency into RLVR through a lightweight plug-and-play consistency reward model, and further incorporates Hybrid Reward Advantage Splitting (HRAS) to stably coordinate task and consistency optimization. Extensive experiments across representative multimodal reasoning benchmarks and mainstream LVLMs show that CORA improves task performance while effectively mitigating thinking-answer inconsistency, leading to more faithful reasoning traces.

2606.14690 2026-06-15 cs.LG cs.IT math.IT 新提交

A Complexity Measure for Active Learning in Multi-group Mean Estimation

多组均值估计中主动学习的复杂度度量

Abdellah Aznag, Rachel Cummings, Adam N. Elmachtoub

发表机构 * Department of Industrial Engineering and Operations Research & Data Science Institute, Columbia University(哥伦比亚大学工业工程与运筹学系及数据科学研究所)

AI总结 针对多组均值估计的max-risk目标,提出局部极小极大框架并证明一般下界,引入方差局部曲率(VLC)作为复杂度度量,在平滑类中与方差-费希尔信息关联,并揭示异质实例中的系统性差距。

详情
AI中文摘要

我们研究了多组均值估计$d$臂老虎机中主动学习的\emph{max-risk}目标:学习者在$d$组间自适应分配$T$个样本的预算,以最小化最坏情况不确定性指标$\max_{k\in[d]}\sigma_k^2/n_k$,其中$\sigma_k$是臂$d$分布的标准差,$n_k$是臂$d$被采样的次数。我们开发了一个局部极小极大框架,并证明了该目标的第一个通用下界,适用于任何有限方差假设类。该下界将难度分解为三个正交因素:\emph{预算}项、衡量不确定性在臂间分布不均匀程度的\emph{异方差性}指数,以及一个模型相关的复杂度度量——\emph{方差局部曲率}($\mathrm{VLC}$),它捕捉了局部方差变化在假设类内创造的信息量。对于平滑类,$\mathrm{VLC}$是方差-费希尔信息的重新参数化,常见族具有闭式值。与现有最强上界对比表明,在广泛范围内接近最优(对数因子内),并在高度异质实例中指出了系统性差距。我们的证明引入了两个关键要素:决策空间上的损失诱导$\ell_1$几何,以及一个基于表示的实例生成器,将困难实例构造简化为显式随机矩阵计算。

英文摘要

We study a \emph{max-risk} objective for active learning in a multi-group mean estimation $d$-armed bandits: a learner adaptively allocates a budget of $T$ samples across $d$ groups to minimize the worst-case uncertainty index $\max_{k\in[d]}σ_k^2/n_k$, where $σ_k$ is the standard deviation of the distribution of arm $d$, and $n_k$ is the number of times arm $d$ is sampled. We develop a local minimax framework and prove the first general lower bound for this objective, valid for any finite-variance hypothesis class. The bound separates difficulty into three orthogonal factors: a \emph{budget} term, a \emph{heteroscedasticity} index measuring how unevenly the uncertainty is spread across arms, and a model-dependent complexity measure, the \emph{Variance Local Curvature} ($\mathrm{VLC}$), which captures how much information a local change of variance creates inside the hypothesis class. For smooth classes, the $\mathrm{VLC}$ is a reparametrization of a variance--Fisher information, with closed-form values for common families. Benchmarking against the strongest available upper bound shows near-optimality up to logarithmic factors in broad regimes, and pinpoints a systematic gap in highly heterogeneous instances. Our proof introduces two key ingredients: a loss-induced $\ell_1$ geometry on the decision space, and a representation-based instance generator that reduces hard-instance construction to an explicit random matrix calculation.

2606.14688 2026-06-15 cs.LG cs.AI cs.CL cs.DS 新提交

Flood and Harvest: The Provable Necessity of Trivia for Generating Valuable Mathematics via the Lens of Language Generation in the Limit

洪流与收获:通过极限语言生成视角证明琐碎知识对于生成有价值数学的必要性

Xiaoyu Li, Andi Han, Dai Shi, Zheng Gao, Jiaojiao Jiang, Junbin Gao

发表机构 * University of New South Wales(新南威尔士大学) University of Sydney(悉尼大学) University of Cambridge(剑桥大学)

AI总结 本文通过极限语言生成模型证明,在形式化数学生成中,验证器无法替代品味:覆盖未记录的有价值数学必须产生无限但渐近可忽略的琐碎语句,这是理论上的必然。

详情
AI中文摘要

与证明助手耦合的AI系统现在能够大规模生成形式化数学,而验证器可验证的内容与数学家认为有价值的内容之间的差距已成为制约因素。我们将有价值数学的生成建模为极限下的嵌套语言生成:通过成员查询预言机(证明检查器)访问的可验证形式语言$F$包含一个未知的有价值语言$H \in \mathcal{H}$,该语言仅通过核心$C \subseteq H$的对抗性枚举揭示,其精确密度为$\alpha$(文献)。每个输出要么是有价值的($\in H$),要么是琐碎的($\in F \setminus H$),要么是幻觉($\notin F$)。我们解决了四个问题。第一,验证器不是品味:允许广度生成的集合恰好是无预言机模型中的那些,按纤维由Angluin条件刻画。第二,验证器确实提供了可靠覆盖,覆盖所有未见过的有价值陈述同时仅断言有效陈述:有验证器可能,无验证器不可能;它将不可避免的错误从虚假转移到琐碎。第三,核心地,关于紧族存在尖锐二分法:生成有限个琐碎语句的生成器达到最优覆盖$\alpha/2$,而任何无限琐碎语句的允许,即使以消失速率,也将最优值跃升至$1-\alpha/2$(两者均为紧界,对于以候选交集形式呈现的核心),且存在一个生成器同时达到两端。转变在于琐碎语句的数量而非速率;间隙$1-\alpha$是未记录的质量。第四,两种机制在数学的压缩模型中实例化。完美的验证器无法替代品味:正确但无价值的语句的无界流并非工程事故,而是可证明的必要性,因为覆盖未记录的有价值数学需要无限但渐近可忽略的已认证琐碎语句流。

英文摘要

AI systems coupled to proof assistants now generate formal mathematics at scale, and the gap between what a checker can verify and what a mathematician would value has become the binding constraint. We model the generation of valuable mathematics as nested language generation in the limit: a verifiable formal language $F$, accessed through a membership oracle (the proof checker), contains an unknown valuable language $H \in \mathcal{H}$ revealed only through an adversarial enumeration of a core $C \subseteq H$ of exact density $α$ (the literature). Every output is valuable ($\in H$), trivial ($\in F \setminus H$), or a hallucination ($\notin F$). We settle four questions. First, the verifier is not taste: the collections admitting generation with breadth are exactly those of the oracle-free model, characterized fiber-wise by Angluin's condition. Second, the verifier does buy sound coverage, covering all unseen valuable statements while asserting only valid ones: possible with it, impossible without it; it relocates unavoidable errors from false to trivial. Third, and centrally, a sharp dichotomy on the tight family: generators emitting finitely many trivia achieve optimal coverage $α/2$, while any infinite trivia allowance, even at vanishing rate, jumps the optimum to $1-α/2$ (both tight, for cores presented as the candidate intersection), and one generator attains both ends. The transition is in trivia count, not rate; the gap $1-α$ is the unrecorded mass. Fourth, both regimes instantiate in a compression model of mathematics. A perfect verifier cannot substitute for taste: the unbounded stream of correct-but-worthless statements is not an engineering accident but a provable necessity, since covering unrecorded valuable mathematics requires an infinite, but asymptotically negligible, stream of certified trivia.

2606.14686 2026-06-15 cs.CV cs.AI 新提交

CottonLeafVision: An Explainable and Robust Deep Learning Framework for Cotton Leaf Disease Classification

CottonLeafVision:一种可解释且鲁棒的棉花叶部病害分类深度学习框架

Rafi Ahamed, Md. Abir Rahman, Tasnia Tarannum Roza, Munaia Jannat Easha, Md. Asif Khan, Sudeepta Mandal

AI总结 提出CottonLeafVision框架,使用DenseNet201在棉花叶部病害数据集上达到98%分类准确率,并集成Grad-CAM、遮挡敏感性和对抗训练增强可解释性与鲁棒性。

Comments This paper contains 11 figures and 4 tables. It was Presented at 18th IEEE International Conference on Computational Intelligence and Communication Networks (CICN) 2026

详情
AI中文摘要

全球范围内,棉花是一种高度经济价值的作物,因为纺织工业严重依赖它。因此,精确识别和检测棉花叶部病害对经济稳定至关重要。“CottonLeafVision”的开发目标是准确分类和检测棉花叶部病害。为此,我们在公开的棉花叶部病害图像数据集上评估了多个预训练的深度卷积神经网络,包括DenseNet201、InceptionV3和VGG19。该图像数据集包含七个类别,六个病害类别和一个健康类别,是在反映现实挑战的各种田间条件下收集的。在这些预训练模型中,使用DenseNet201,我们实现了98%的最高分类准确率。为了增强模型的可靠性和可解释性,我们实施了不同的技术和方法,如梯度加权类激活映射(Grad-CAM)、遮挡敏感性分析和对抗训练,以提高模型的抗噪声能力。最后,我们开发了一个原型,以便在现实农业中利用模型的能力。本文展示了深度学习模型在现实棉花病害管理情况下分类病害的能力。

英文摘要

Globally, cotton is a highly economically beneficial crop, as the textile industry heavily depends on it. So, the precise identification and detection of cotton leaf disease is crucial for economic stability. The development goal of "CottonLeafVision" is to accurately classify and detect cotton leaf disease. With this goal, we have evaluated multiple pretrained Deep Convolutional Neural Networks, including DenseNet201, InceptionV3, and VGG19 on a publicly available cotton leaf disease image dataset. This image dataset includes seven classes, six disease classes, and one healthy class, collected under various field conditions reflecting real-world challenges. Among these pretrained models, with DenseNet201, we have achieved the highest classification accuracy of 98%. To enhance the model reliability and interpretability, we have implemented different techniques and methods such as Gradient-weighted Class Activation Mapping (Grad-CAM), occlusion sensitivity analysis and adversarial training to increase the noise resistance of the model. Finally, we have developed a prototype in order to utilize the model's capabilities on real life agriculture. This paper shows the deep learning model's capabilities to classify the disease in real-life cotton disease management situations.

2606.14684 2026-06-15 cs.CV cs.LG 新提交

HumP-KD: A Hybrid Uncertainty-Aware Multi-Stage Progressive Knowledge Distillation Framework for Efficient Fire Classification

HumP-KD: 一种混合不确定性感知的多阶段渐进式知识蒸馏框架用于高效火灾分类

Mohammed Arif Mainuddin, Najifa Tabassum, Omar Ibne Shahid, Riasat Khan

AI总结 提出HumP-KD框架,通过层次化渐进式知识蒸馏和多阶段蒸馏,将两个冻结的异构Transformer教师(Swin-Tiny和ViT-Base)及其集成知识蒸馏到轻量级MobileViT-S学生模型中,在火灾分类任务上显著提升性能,同时保持低参数量和实时推理速度。

详情
AI中文摘要

实时火灾分类系统需要模型同时具备准确性、计算效率以及可在资源受限硬件上部署的能力。本文提出\textbf{HumP-KD},一种混合不确定性感知的多阶段渐进式知识蒸馏框架,用于高效火灾分类。使用了两个数据集:FlameVision(8600张图像)和Dataset-II(31309张图像)。在标准预处理、在线增强、高斯噪声和运动模糊鲁棒性条件下,应用了多种CNN和Transformer基线模型。所提出的HumP-KD模型通过三个紧密集成的组件,将两个冻结的异构Transformer教师(Swin-Tiny和ViT-Base)及其Meta-MLP集成的知识蒸馏到轻量级MobileViT-S学生中。层次化渐进式知识蒸馏采用层次化特征构建器,生成融合的空间注意力掩码,以选择性地引导蒸馏到判别性区域。多阶段知识蒸馏在训练过程中逐步激活三个蒸馏阶段。在Dataset-II上,HumP-KD在10次独立试验中平均F1分数达到$0.9876 \pm 0.0063$,显著优于未使用蒸馏训练的MobileViT-S基线($0.9537 \pm 0.0351$),独立t检验($p = 0.0195$)和Wilcoxon符号秩检验($W = 1$,$p = 0.0039$)均证实了统计显著性。所提出的方法还展示了跨数据集的强泛化能力和在退化视觉条件下的鲁棒性。学生模型仅保留4.94M参数和19.01Mb模型大小,相比Swin-Tiny参数减少$5.7\times$,相比ViT-Base减少$17.5\times$,同时达到37.72 CPU FPS,适合实时部署。

英文摘要

Real-time fire classification systems require models that are simultaneously accurate, computationally efficient, and deployable on resource-constrained hardware. This work proposes \textbf{HumP-KD}, a Hybrid Uncertainty-aware Multi-stage Progressive Knowledge Distillation framework for efficient fire classification. Two datasets, FlameVision and Dataset-II, containing 8,600 and 31,309 images, are used. Various CNN and transformer baselines are applied under standard preprocessing, online augmentation, Gaussian noise and motion blur robustness conditions. The proposed HumP-KD model distills knowledge from two frozen heterogeneous transformer teachers, Swin-Tiny and ViT-Base, along with their Meta-MLP ensemble, into a lightweight MobileViT-S student via three tightly integrated components. Hierarchical Progressive Knowledge Distillation employs a Hierarchical Feature Builder. It generates a fused spatial attention mask to guide distillation toward discriminative regions selectively. Multi-Stage Knowledge Distillation progressively activates three distillation stages across training. On Dataset-II, HumP-KD achieves a mean F1 score of $0.9876 \pm 0.0063$ across 10 independent trials, significantly outperforming the MobileViT-S baseline trained without distillation ($0.9537 \pm 0.0351$), with statistical significance confirmed by both independent t-test ($p = 0.0195$) and Wilcoxon signed-rank test ($W = 1$, $p = 0.0039$). The proposed method also demonstrates strong generalization across datasets and robustness under degraded visual conditions. The student model retains only 4.94M parameters and 19.01Mb model size, representing a $5.7\times$ parameter reduction over Swin-Tiny and a $17.5\times$ reduction over ViT-Base, while achieving 37.72 CPU FPS, making it suitable for real-time deployment.

2606.14679 2026-06-15 cs.LG cs.SY eess.SY math.OC stat.ML 新提交

Optimal Hidden-Target Learning for Online Inventory Optimization on General Convex Sets

一般凸集上在线库存优化的最优隐藏目标学习

Anthony Pineci, Yunzong Xu

发表机构 * UIUC(伊利诺伊大学厄巴纳-香槟分校)

AI总结 针对一般凸容量集上的在线库存优化问题,提出隐藏目标投影方法,将遗憾从逆概率依赖改进为平方根逆概率依赖,并证明匹配下界,同时首次给出强凸损失的 polylog 遗憾和动态遗憾保证。

详情
AI中文摘要

在线库存优化(OIO)是具有物理记忆的在线凸优化:库存结转使得可行动作集依赖于过去。一个自然的原则——在随机库存学习以及最近在单一线性容量约束下的OIO中使用——是维护一个由在线学习器选择的隐藏目标,并将其投影到当前可行的订货上限集上。我们证明,对于任意有界凸容量集上的OIO,这一简单原则是最优的。以在线梯度下降为基础学习器,该方法将一般凸集上OIO的最佳已知遗憾保证从对共同需求概率的逆依赖改进为平方根逆依赖,并且我们证明了匹配的下界。同样的原则为强凸损失提供了首个多对数遗憾保证,并为一般凸容量集上的欧几里得路径变化提供了首个动态遗憾保证。分析引入了一个范数对齐原则:正确的状态变量是隐藏目标到可行集的距离,以与投影相同的范数度量。在范数对齐下,该距离路径地演化为一个标量队列,目标移动作为到达,共同需求作为服务。这种简化为一维队列控制解决了状态依赖性,并将保证扩展到一般凸容量集,超出了先前乘积方法的范围。在合成和真实库存数据上的实验证实了该理论。

英文摘要

Online inventory optimization (OIO) is online convex optimization with physical memory: inventory carryover makes the feasible action set depend on the past. A natural principle, used in stochastic inventory learning and recently in OIO under a single linear capacity constraint, is to maintain a hidden target chosen by an online learner and implement its projection onto the currently feasible order-up-to set. We prove that this simple principle is optimal for OIO on arbitrary bounded convex capacity sets. With online gradient descent as the base learner, the method improves the best known regret guarantee for OIO on general convex sets from inverse to inverse-square-root dependence on the common-demand probability, and we prove a matching lower bound. The same principle gives the first polylogarithmic regret guarantee for strongly convex losses and the first dynamic regret guarantee adapting to Euclidean path variation on general convex capacity sets. The analysis introduces a norm alignment principle: the right state variable is the distance from the hidden target to the feasible set, measured in the same norm as the projection. Under norm alignment, this distance evolves pathwise as a scalar queue, with target movement as arrival and common demand as service. This reduction to one-dimensional queue control resolves the state dependence and extends the guarantees to general convex capacity sets, beyond the reach of prior productwise approaches. Experiments on synthetic and real-world inventory data corroborate the theory.

2606.14677 2026-06-15 quant-ph cs.LO cs.PL cs.SE math.CT 新提交

Quasilinear Equivalence Checking for Detector Error Models

探测器错误模型的拟线性等价性检查

Mathys Rennela

AI总结 提出探测器错误模型(DEM)的等式理论,通过拟线性时间归约系统实现结构等价性判定,并应用于量子编译器验证与优化。

Comments 19 pages, 5 figures

详情
AI中文摘要

探测器错误模型(DEM)是量子电路中错误机制的结构化表示,因其能够在电路层面捕获容错性而在量子编译流程中广受欢迎。它将错误机制列为针对探测器和可观测量的指令,为每个物理错误通道指定错误触发的概率、触发的探测器以及翻转的可观测值。在本文中,我们为DEM开发了一个等式理论及其相关的范畴语义。我们提出了一个对于DEM项而言是完备、终止且合流的重写系统,将其表述为Giry单子上的对称幺半理论(PROP)。我们证明每个DEM项都有唯一的范式,可以在拟线性时间$O(k|E|\log|E|)$内高效计算,其中$|E|$是指令数量,$k$是目标集大小的上界。这为结构DEM等价性提供了完整的(通过Tanner图)不变量集合。我们提供了第一个用于DEM等价性的静态判定程序,具有严格的正确性保证。对于非自适应量子纠错(QEC)流程,它是完备的(精确判定完整的解码器等价性),并且可以扩展到部分自适应电路(晶格手术、分布式QEC等)的可靠且适用的判定程序,而不会遭受指数级开销。我们讨论了其在量子编译器验证和优化中的应用。

英文摘要

A Detector Error Model (DEM) is a structured representation of error mechanisms in quantum circuits, which has gained popularity in quantum compilation pipelines for its ability to capture fault-tolerance at a circuit level. It lists error mechanisms as instructions targeting detectors and observables, specifying for each physical fault channel the probability that the fault fires, the detectors it triggers, and the observables it flips. In this paper, we develop an equational theory for DEMs, with its associated categorical semantics. We present a sound, terminating, confluent rewriting system for DEM terms, formulating it as a symmetric monoidal theory (a PROP) over the Giry monad. We prove that every DEM term has a unique normal form, which can be computed efficiently in quasilinear time $O(k|E|\log|E|)$, where $|E|$ is the number of instructions and $k$ bounds the size of a target set. This provides a complete set of invariants (via Tanner graphs) for structural DEM equivalence. We provide the first static decision procedure for DEM equivalence, with rigorous correctness guarantees. It is complete (decides full decoder-equivalence exactly) for non-adaptive quantum error correction (QEC) pipelines, and scales to a sound and applicable decision procedure for partially-adaptive circuits (lattice surgery, distributed QEC, ...) without suffering exponential overhead. We discuss its application to the verification and optimisation of quantum compilers.

2606.14674 2026-06-15 cs.CL 新提交

AgentSpec: Understanding Embodied Agent Scaffolds Through Controlled Composition

AgentSpec: 通过受控组合理解具身智能体脚手架

Jixuan Chen, Jianzhi Shen, Haoqiang Kang, Zhi Hong, Qingyi Jiang, Soham Bose, Yiming Zhang, Leon Leng, Amit Vyas, Lingjun Mao, Siru Ouyang, Kun Zhou, Lianhui Qin

发表机构 * University of California, San Diego(加利福尼亚大学圣迭戈分校) Johns Hopkins University(约翰霍普金斯大学) University of Washington(华盛顿大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出AgentSpec模块化规范框架,将具身智能体表示为可复用策略组件的类型化组合,通过标准化接口实现受控组件替换与重组,揭示脚手架兼容性和交互效应对性能的主导作用。

详情
AI中文摘要

LLM智能体越来越多地构建为脚手架系统,而非单一模型调用,这些系统结合了推理、记忆、反思、动作执行和学习。虽然此类脚手架通常能提升性能,但它们往往嵌入在紧密耦合的流水线中,使得难以隔离组件贡献、比较替代设计或理解模块交互如何塑造智能体行为。我们引入AgentSpec,一个模块化规范框架,将具身智能体表示为具有标准化接口的可复用策略组件的类型化组合。AgentSpec标准化了感知、记忆、推理、反思、动作和可选学习之间的接口,使得组件能够在受控条件下被交换和重组。我们在DeliveryBench、ALFRED、MiniGrid和RoboTHOR上实例化该框架,并分析了跨模型骨干的推理、记忆、反思和强化学习模块。我们的结果表明,智能体性能由脚手架兼容性和交互效应主导,而非孤立模块强度。特别是,结构化多粒度记忆改善了长程状态跟踪,推理和记忆在不同环境中非均匀交互,反思在纠正和成本之间权衡,而RL训练的策略在与部署时脚手架结构共同优化时组合最佳。AgentSpec为研究、比较和设计可组合的LLM智能体提供了受控基础。我们的代码、基线和交互式游乐场在此https URL公开。

英文摘要

LLM agents are increasingly built not as single model calls, but as scaffolded systems that combine reasoning, memory, reflection, action execution, and learning. While such scaffolds often improve performance, they are often embedded in tightly coupled pipelines, making it difficult to isolate component contributions, compare alternative designs, or understand how module interactions shape agent behavior. We introduce AgentSpec, a modular specification framework that represents embodied agents as typed compositions of reusable policy components with standardized interfaces. AgentSpec standardizes the interfaces among perception, memory, reasoning, reflection, action, and optional learning, enabling components to be swapped and recombined under controlled conditions. We instantiate this framework across DeliveryBench, ALFRED, MiniGrid, and RoboTHOR, and analyze reasoning, memory, reflection, and reinforcement-learning modules across model backbones. Our results show that agent performance is governed by scaffold compatibility and interaction effects rather than isolated module strength. In particular, structured multi-granularity memory improves long-horizon state tracking, reasoning and memory interact non-uniformly across environments, reflection trades off correction and cost, and RL-trained policies compose best when optimized with deployment-time scaffold structure. AgentSpec provides a controlled foundation for studying, comparing, and designing composable LLM agents. Our code, baselines and interactive playground are publicly available at https://agentspec-embodied.github.io.

2606.14673 2026-06-15 cs.LG 新提交

Compressed Computation is (probably) not Computation in Superposition

压缩计算(可能)不是叠加计算

Jai Bhagat, Sara Molas-Medina, Giorgi Giglemiani, Stefan Heimersheim

发表机构 * Metamorphic Independent(独立研究者) UK AI Security Institute(英国人工智能安全研究所) Apollo Research

AI总结 通过分析压缩计算(CC)模型,发现其性能提升源于标签中的混合矩阵,而非真正的叠加计算,SNMF基线可复现其损失特征。

Comments Presented at the Mechanistic Interpretability Workshop at NeurIPS 2025

详情
AI中文摘要

我们研究压缩计算(CC)玩具模型(Braun等人,2025)是否是叠加计算的一个实例。CC模型似乎仅用50个神经元就能计算100个ReLU函数,其损失优于仅表示50个ReLU函数的预期。我们表明,该模型通过其带噪的残差流混合输入,对应于标签中一个非预期的混合矩阵。将训练目标分解为ReLU项和混合项,我们发现性能增益随混合矩阵的幅度缩放,并在移除该矩阵时消失。学习到的神经元方向集中在与混合矩阵前50个特征值相关的子空间中,表明混合项主导了解决方案。最后,仅从混合矩阵导出的半非负矩阵分解(SNMF)基线重现了定性损失曲线,并改进了先前的基线,尽管它未能匹配训练后的模型。这些结果表明CC不是叠加计算的一个合适玩具模型。

英文摘要

We study whether the Compressed Computation (CC) toy model (Braun et al., 2025) is an instance of computation in superposition. The CC model appears to compute 100 ReLU functions with just 50 neurons, achieving a better loss than expected from only representing 50 ReLU functions. We show that the model mixes inputs via its noisy residual stream, corresponding to an unintended mixing matrix in the labels. Splitting the training objective into the ReLU term and the mixing term, we find that performance gains scale with the magnitude of the mixing matrix and vanish when the matrix is removed. The learned neuron directions concentrate in the subspace associated with the top 50 eigenvalues of the mixing matrix, suggesting that the mixing term governs the solution. Finally, a semi-non-negative matrix factorization (SNMF) baseline derived solely from the mixing matrix reproduces the qualitative loss profile and improves on prior baselines, though it does not match the trained model. These results suggest CC is not a suitable toy model of computation in superposition.

2606.14672 2026-06-15 cs.AI cs.CL 新提交

Towards Direct Latent-Space Synthesis for Parallel Branches in LLM-Agent Workflows

面向LLM-Agent工作流中并行分支的直接潜在空间合成

Shikun Liu, Mufei Li, Dongqi Fu, Haoyu Wang, Yinglong Xia, Hong Li, Hong Yan, Pan Li

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Meta

AI总结 提出Parallel-Synthesis框架,通过直接利用并行工作代理的KV缓存进行合成,避免文本拼接冗余,在9个数据集上匹配或超越文本合成,并将首令牌延迟降低2.5-11倍。

详情
AI中文摘要

大型语言模型越来越多地作为代理系统的执行引擎,但它们仍然通过顺序文本接口消耗上下文。这与现代结构化代理工作流不匹配,其中独立分支探索子任务、检索证据或生成候选解决方案,然后进行最终合成步骤。现有系统通常通过拼接这些分支的文本输出来合并它们,这丢弃了并行结构并导致冗余的预填充计算。在这项工作中,我们引入了Parallel-Synthesis,一个即插即用的框架,使合成器能够直接消耗由并行工作代理产生的KV缓存。Parallel-Synthesis结合了一个缓存映射器,用于校准独立生成的分支缓存,以及一个微调的合成器适配器,用于从此非顺序缓存接口生成。我们使用数据训练Parallel-Synthesis,这些数据使合成器暴露于并行缓存上下文,教授跨缓存分支的聚合,并从基于标准文本拼接的合成中蒸馏推理行为。在跨越数学、科学问答、代码生成、GAIA和多代理数据库诊断的九个下游数据集上,Parallel-Synthesis在七个数据集上匹配或优于基于文本的合成,并在另外两个数据集上保持接近。它还将首令牌时间减少了2.5-11倍,表明直接基于缓存的合成是一种更有前途的接口,用于在并行代理分支上进行更原生和高效的合成。

英文摘要

Large language models increasingly serve as execution engines for agentic systems, yet they still consume context through a sequential text interface. This creates a mismatch with modern structured agent workflows, in which independent branches explore subtasks, retrieve evidence, or generate candidate solutions before a final synthesis step. Existing systems typically merge these branches by concatenating their textual outputs, which discards the parallel structure and incurs redundant prefill computation. In this work, we introduce Parallel-Synthesis, a plug-and-play framework that enables a synthesizer to directly consume the KV caches produced by parallel worker agents. Parallel-Synthesis combines a cache mapper that calibrates independently generated branch caches with a fine-tuned synthesizer adapter that enables generation from this non-sequential cache interface. We train Parallel-Synthesis using data that exposes the synthesizer to parallel cache contexts, teaches aggregation across cached branches, and distills reasoning behavior from standard text-concatenation-based synthesis. Across nine downstream datasets spanning math, science QA, code generation, GAIA, and multi-agent database diagnosis, Parallel-Synthesis matches or outperforms text-based synthesis on seven datasets and remains close on the other two. It also reduces time-to-first-token by 2.5x-11x, suggesting that direct cache-based synthesis is a promising interface for more native and efficient synthesis over parallel agent branches.

2606.14667 2026-06-15 cs.CV 新提交

Memento: Reconstruct to Remember for Consistent Long Video Generation

Memento: 通过重建来记忆以实现一致的长视频生成

Xuan Wei, Longbin Ji, Guan Wang, Xiangrui Liu, Zhenyu Zhang, Shuohuan Wang, Yu Sun, Qingqi Hong

发表机构 * Xiamen University(厦门大学) ERNIE Team, Baidu Inc.(百度公司ERNIE团队)

AI总结 提出Memento框架,通过主体重建引导和双查询记忆机制,解决长视频生成中主体一致性丢失问题,实现跨镜头连贯生成。

Comments Project page: https://ernie-research.github.io/Memento/

详情
AI中文摘要

长视频生成需要重复出现的主题在各种镜头、视角、运动和场景转换中保持一致。现有的时间分解方法通过逐镜头生成视频来提高可扩展性。然而,它们主要专注于优化合理的下一镜头延续,而没有验证历史记忆是否保留了身份关键的主体证据。因此,随着生成的进行,重复出现的主题可能会被稀释、覆盖或遗忘。在本文中,我们提出了Memento,一个主体重建引导的框架,将主体保留视为一个显式的身份基础问题,其前提是:一个忠实保存主体的记忆库应该能够仅从记忆中重建该主体。具体来说,Memento联合训练自回归的下一镜头生成和基于记忆的主体重建,利用历史记忆和全局故事描述恢复目标外观。为了从短程线索中分离出长程主体证据,Memento引入了一种双查询记忆机制,其中一个查询检索与身份相关的记忆,另一个选择短上下文关键帧以实现连贯的延续。此外,一个主体感知的电影数据管道通过一致、无代词的主体描述提供精确的重建监督。实验表明,Memento在长期主体一致性、跨镜头连贯性和视觉质量方面达到了最先进的性能。

英文摘要

Long-form video generation requires recurring subjects to remain consistent across various shots, viewpoints, motions, and scene transitions. Existing temporal decomposition methods improve scalability by generating videos shot by shot. However, they mainly focus on optimizing plausible next-shot continuations without verifying whether the historical memory preserves identity-critical subject evidence. Consequently, as generation proceeds, recurring subjects may be diluted, overwritten, or forgotten. In this paper, we propose Memento, a subject-reconstruction-guided framework that treats subject preservation as an explicit identity grounding problem, based on the premise that a memory bank faithfully preserving a subject should support reconstructing that subject from memory alone. Specifically, Memento jointly trains autoregressive next-shot generation with memory-based subject reconstruction, recovering target appearances using historical memory and global story captions. To disentangle long-range subject evidence from short-range cues, Memento introduces a dual-query memory mechanism, where one query retrieves identity-relevant memory and the other selects short-context keyframes for coherent continuation. Additionally, a subject-aware cinematic data pipeline provides precise reconstruction supervision via consistent, pronoun-free subject descriptions. Experiments demonstrate that Memento achieves state-of-the-art performance in long-term subject consistency, cross-shot coherence, and visual quality.

2606.14665 2026-06-15 cs.RO 新提交

EgoGuide: Egocentric Guidance for Efficient Robot-Free Demonstration Collection and Learning

EgoGuide: 以自我为中心引导的高效无机器人演示收集与学习

Yue Xu, Mingtao Nie, Tianle Li, Hong Li, Yibo Luo, Siyuan Huang, Yong-Lu Li

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) Beijing Institute for General Artificial Intelligence (BIGAI)(北京通用人工智能研究院)

AI总结 提出EgoGuide数据收集接口,通过同步腕部和头部/自我中心观察并在线视觉-几何质量引导,结合门控自我中心残差策略,减少所需数据量并提高数据效率。

详情
AI中文摘要

目前,从真实世界演示中进行的机器人学习受到数据扩展的限制。通用操作接口(UMI)提供了一种高效的无机器人数据收集接口,然而当前的UMI风格流程通常收集冗余的演示,并且缺乏全局场景上下文。为了提高数据效率,我们提出了EgoGuide,一种收集接口,它记录同步的腕部和头部/自我中心观察,并将其与在线视觉-几何数据质量引导相结合。我们还引入了一种门控自我中心残差策略,用于从视角变化的自我中心相机中进行鲁棒学习,允许头部/自我中心上下文纠正模糊的局部观察,同时保持稳定的腕部视角控制。真实世界实验表明,EgoGuide减少了所需的数据集数并提高了数据效率。残差策略进一步提高了视觉遮挡下的鲁棒性。项目页面:此 https URL

英文摘要

Robot learning from real-world demonstrations is currently constrained by data scaling. Universal Manipulation Interface (UMI) provides an efficient robot-free data collection interface, yet current UMI-style pipelines often collect redundant demonstrations and lack global scene context. To improve data efficiency, we present EgoGuide, a collection interface that records synchronized wrist and head/egocentric observations and couples them with online visual-geometric data quality guidance. We also introduce a Gated Egocentric Residual Policy for robust learning from a viewpoint-varying egocentric camera, allowing head/egocentric context to correct ambiguous local observations while preserving stable wrist-view control. Real-world experiments show that EgoGuide reduces the required number of data episodes and improves data efficiency. The residual policy further improves robustness under visual occlusion. Project Page: https://silicx.github.io/EgoGuide

2606.14664 2026-06-15 cs.HC 新提交

The Self-Aware Body: A User-Centered Framework for Designing Therapeutic Sonic Interactions

自我感知的身体:以用户为中心的设计治疗性声音交互框架

Prithvi Ravi Kantan, Sofia Dahl, Erika G. Spaich

AI总结 提出一个用于设计运动声音化治疗交互技术的框架,包括概念重构、设计平台和以用户为中心的方法,以促进临床采用。

详情
AI中文摘要

本章提出了一个设计治疗性声音交互技术的框架,重点关注运动声音化:将身体运动实时转换为声音,在运动康复过程中作为反馈。尽管有越来越多的证据表明其有效性,但实现运动声音化的技术尚未系统性地被临床实践采纳,这可能是由于缺乏标准化的开发方法以及临床利益相关者视角在交互设计中的整合不足。该框架通过三个相互关联的贡献来解决这些障碍。第一个是将设计任务概念重构为将声音变异性校准到听众的感知可供性和临床情境的需求。第二个是受专业音频混音工作流程启发的实用设计平台,该平台为交互设计过程施加了结构化和可学习的信号流架构,并支持快速迭代探索。第三个是改编自医疗干预科学的以用户为中心的开发方法,该方法将设计决策基于与将使用最终系统的临床医生和患者的互动。HearWalk生物反馈系统用于偏瘫步态康复说明了该框架,本章最后探讨了大语言模型和AI工具如何有意义地协助设计过程的每个阶段,以及人类临床和感知专业知识在哪些方面仍然不可替代。

英文摘要

This chapter presents a framework for designing therapeutic sonic interaction technologies, with a focus on movement sonification: the real-time conversion of bodily motion into sound that serves as feedback during motor rehabilitation. Despite growing evidence for their effectiveness, technologies implementing movement sonification are yet to be systematically adopted as part of clinical practice, potentially due to a lack of standardized development methodologies as well as inadequate integration of clinical stakeholder perspectives into interaction design. The framework addresses these barriers through three interconnected contributions. The first is a conceptual reframing of the design task as the calibration of sonic variability to the perceptual affordances of the listener and the demands of the clinical context. The second is a practical design platform inspired by professional audio mixing workflows, which imposes a structured and learnable signal-flow architecture on the interaction design process and enables rapid iterative exploration. The third is a user-centered development methodology adapted from healthcare intervention science, which grounds design decisions in engagement with the clinicians and patients who will use the resulting systems. The HearWalk biofeedback system for hemiparetic gait rehabilitation illustrates the framework, and the chapter concludes by examining where large language models and AI tools can meaningfully assist each stage of this design process, as well as where human clinical and perceptual expertise remains irreplaceable.

2606.14662 2026-06-15 cs.LG cs.SD 新提交

Beyond task performance: Decoding bioacoustic embeddings with speech features

超越任务性能:用语音特征解码生物声学嵌入

Ines Nolasco, Jules Cauzinille, Marius Miron, Gagan Narula, Milad Alizadeh, Emmanuel Fernandez, Matthieu Geist, Ellen Gilsenan-McMahon, Olivier Pietquin, Emmanuel Chemla, Sara Keen

AI总结 本研究通过线性与非线性回归探针,揭示生物声学预训练嵌入编码的语音特征,发现不同模型互补覆盖声学空间,并提出基于特征可恢复性的模型选择指南。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

预训练音频嵌入在生物声学中是标准做法,但关于这些模型编码了哪些声学特征以及哪些特征对特定任务有用,我们知之甚少。这阻碍了透明度,并限制了向稀有物种或数据稀缺领域的扩展。在这里,我们揭示了生物声学表示中编码了哪些类似语音的特征。使用跨越六个分类群的88个eGeMAPS特征,我们应用线性和非线性回归探针来量化每个模型捕获了哪些声学属性。结果证实了“没有免费午餐”的模式:没有单个模型能捕获完整的特征空间。拼接嵌入实现了最高性能,表明模型之间互补的声学空间覆盖。响度特征编码最好($R^2 = 0.76$),而F0最难恢复($R^2 = 0.33$)。通过将可恢复性与每个物种的特征显著性(NMI)交叉引用,我们为生物声学得出了数据驱动的模型选择指南。

英文摘要

Pretrained audio embeddings are standard in bioacoustics, yet little is known about which acoustic features these models encode, nor which are useful for a given task. This hinders transparency and limits extension to rare species or data-scarce domains. Here we reveal which speech-like features are encoded in bioacoustic representations. Using the 88~eGeMAPS features across six taxonomic groups, we apply linear and nonlinear regression probes to quantify which acoustic properties each model captures. Results confirm a ``no free lunch'' pattern: no single model captures the full feature space. A concatenated embedding achieves the highest performance, suggesting complementary acoustic space coverage across models. Loudness features are best encoded ($R^2 = 0.76$) while F0 is hardest to recover ($R^2 = 0.33$). By cross-referencing recoverability with per-species feature salience (NMI), we derive data-driven model selection guidance for bioacoustics.

2606.14659 2026-06-15 cs.DC 新提交

parRSB: Exascale Spectral Element Mesh Partitioning

parRSB: 极大规模谱元网格划分

Thilina Ratnayaka, Paul Fischer

AI总结 提出基于递归谱二分法的并行图划分器parRSB,用于谱元网格的高质量划分,通过Lanczos和共轭梯度逆迭代计算Fiedler向量,在Summit和Frontier上验证了可扩展性和划分质量。

详情
AI中文摘要

我们介绍了parRSB——一种用于谱元网格的并行、高度可扩展的图划分器,能够产生高质量的划分。parRSB基于递归谱二分法(RSB)算法,该算法在输入网格的对偶图上实现。RSB使用Fiedler向量,即对偶图拉普拉斯矩阵的最小非零特征值对应的特征向量,来做出划分决策,并试图最小化划分之间的通信量。我们实现了两种数值方法:Lanczos方法和使用共轭梯度法的逆迭代,以计算Fiedler向量。我们展示了在橡树岭国家实验室的Summit和Frontier超级计算机上使用parRSB的划分结果,以说明parRSB产生的划分质量以及我们实现的可扩展性。我们还展示了一些为加速划分过程所做的优化结果。

英文摘要

We introduce parRSB - a parallel, highly scalable graph partitioner for spectral element meshes that produce high quality partitions. parRSB is based on Recursive Spectral Bisection (RSB) algorithm implemented on the dual graph of the input mesh. RSB uses the Fiedler vector, which is the eigenvector associated with the smallest non-zero eigenvalue of the Laplacian matrix of the dual graph for making partitioning decisions and tries to minimize the communication volume between the partitions. We implemented two numerical methods: Lanczos, and Inverse iteration using Conjugate Gradient method to compute the Fiedler vector. We present partitioning results using parRSB on Summit and Frontier supercomputers at Oak Ridge National Laboratory to illustrate the quality of the partitions produced by parRSB and the scalability of our implementation. We also present results for some of the optimizations we did to speed up the partitioning process.

2606.14658 2026-06-15 cs.CV cs.AI 新提交

Giving AI a Headache: Acoustic Adversarial Attacks to Computer Vision Applications

给AI带来头痛:针对计算机视觉应用的声学对抗攻击

Nicole Villavicencio-Garduño, Maksim Ekin Eren, Milo Prisbrey, Ben Migliori, Michael Teti

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 研究利用低频声波(<20 kHz)引起相机物理振动,导致AI视觉模型(如YOLO11)误分类、漏检或产生幻觉,并分析了影响攻击效果的因素。

Comments 9 pages, 7 figures, SPIE Defense + Security

详情
Journal ref
Proc. SPIE 14046, Assurance and Security for AI-enabled Systems 2026, 1404609 (10 Jun 2026)
AI中文摘要

人工智能(AI)越来越多地被用于自动化各种现实世界的计算机视觉(CV)应用,如自动驾驶车辆控制、面部识别和安全摄像头。最近的研究表明,声学振动可以引起相机真实的物理运动,干扰其内部稳定机制。由于这种运动超出了稳定系统设计处理的条件,系统会在帧中引入伪影,导致基于AI的CV模型误分类、错过目标或产生幻觉对象。先前的工作使用超声波频率(>20 kHz)进行短距离攻击,由于高频的衰减,这些攻击仅限于短距离。在这项工作中,我们研究了使用可听范围内较低频率(<20 kHz)的声学攻击,并进一步扩展了我们的分析,包括各种图像和物体特征如何受到攻击的影响。具体来说,我们进行了物理实验,通过用各种频率共振商用相机,证明了我们的攻击对现成目标检测模型(YOLO11)的可行性。基于我们的结果,我们提供了关于使AI CV系统更容易受到这些攻击的几个因素的见解,这可能有助于未来缓解策略的开发。

英文摘要

Artificial Intelligence (AI) is increasingly used to automate a variety of real-world computer vision (CV) applications, such as autonomous vehicle control, facial recognition, and security cameras. Recent research has shown that acoustic vibration can induce real physical motion in cameras, interfering with their internal stabilization mechanisms. Because the motion falls outside the conditions the stabilization system was designed to handle, the system introduces artifacts into the frame, causing AI-based CV models to misclassify, miss targets, or hallucinate objects. Previous work used ultrasonic frequencies (>20 kHz) to perform short-range attacks, which limits them to short distances due to the attenuation exhibited by high frequencies. In this work, we investigate acoustic attacks using lower frequencies in the audible range (<20 kHz), and we further expand our analysis to include how various image and object features are affected by the attacks. Specifically, we performed physical experiments to demonstrate the viability of our attacks on an off-the-shelf object detection model (YOLO11) by resonating a commercially available camera with various frequencies. Based on our results, we provide insights into several factors that make an AI CV system more vulnerable to these attacks, which could help inform the development of future mitigation strategies.

2606.14657 2026-06-15 cs.CV 新提交

HPSv3++: Scaling Reward Models Across the Full Spectrum of Diffusion Model Capabilities

HPSv3++:跨扩散模型能力全谱系扩展奖励模型

Yijun Liu, Jie Huang, Zeyue Xue, Yuming Li, Ruizhe He, Haoran Li, Shijia Ge, Siming Fu

发表机构 * Tsinghua University(清华大学) JD Explore Academy(京东探索研究院) Peking University(北京大学) Zhejiang University(浙江大学)

AI总结 提出HPSv3++奖励模型框架,通过双维度偏好数据集HPDv3++和两阶段训练(正交梯度投影+无监督引导),提升对各类T2I模型及RL迭代的偏好预测能力,在多个基准上达到最优。

详情
AI中文摘要

奖励模型引导文本到图像(T2I)系统输出符合人类偏好的结果。然而,典型的奖励模型(如HPSv3)是在早期T2I模型的预标注数据上训练的,没有考虑因模型能力演进和强化学习(RL)迭代而产生的质量判别偏移,限制了其更广泛的适用性。在这项工作中,我们提出了HPSv3++,一个奖励模型框架,将HPSv3模型提升到适应不同T2I模型能力及其RL迭代变化的全能力-迭代谱系。具体来说,我们首先引入了HPDv3++,一个212K双维度偏好数据集,使用近期高能力(Qwen-Image)模型并辅以人工监督,对文本保真度和美学质量进行标注。然后我们提出了一个两阶段训练框架。第一阶段采用数据感知的正交梯度投影,从HPDv3++中融入多样化的美学感知,同时保留HPSv3中原始有效的人类偏好知识。第二阶段进一步利用来自不同能力水平和RL迭代的T2I模型的无标注数据,并引入一个联合能力-迭代条件的信号给奖励模型,以及一个标准差驱动的无监督引导机制,从而在能力-迭代谱系上强化奖励模型。HPSv3++实现了最先进的偏好预测,在HPDv3上比HPSv3高出9.8%,在GenAI-Bench上高出5.5%,同时在我们提出的HPDv3++上达到79.1%/88.1%。当用于T2I RL训练时,它持续提升了多种T2I模型的GenEval分数,展示了其广泛的能力。代码可在该网址获取。

英文摘要

Reward models guide text-to-image (T2I) systems toward outputs aligned with human preferences. However, typical reward models such as HPSv3 are trained on pre-annotated data from earlier T2I models, without accounting for quality discriminative shifts arising from evolving model capabilities and reinforcement learning (RL) iterations, limiting their broader applicability. In this work, we propose HPSv3++, a reward model framework that elevates the HPSv3 model for varying T2I model capabilities and their RL iteration changes across the full capability-iteration spectrum. Specifically, we first introduce HPDv3++, a 212K dual-dimension preference dataset annotated for text fidelity and aesthetic quality using a recent high-capability (Qwen-Image) model with human supervision. We then propose a two-stage training framework. Stage 1 employs data-aware orthogonal gradient projection to incorporate diverse aesthetic perception from HPDv3++ while preserving the original effective human preference knowledge in HPSv3. Stage 2 further leverages unlabeled data from T2I models spanning different capability levels and RL iterations, and introduces a joint capability-iterations conditioned signal for the reward model together with a standard deviation-driven unsupervised guidance mechanism, strengthening reward model across the capability-iteration spectrum. HPSv3++ achieves state-of-the-art preference prediction, outperforming HPSv3 9.8% on HPDv3, 5.5% on GenAI-Bench, while achieving 79.1%/88.1% on our proposed HPDv3++. When used for T2I RL training, it consistently improves GenEval scores across diverse T2I models, demonstrating its wide-range capabilities. The code is available at https://github.com/PlantPotatoOnMoon/HPSv3-PlusPlus.

2606.14655 2026-06-15 math.NA cs.NA 新提交

Discontinuous Galerkin approximations of the Jordan-Moore-Gibson-Thompson equation in the vanishing relaxation limit

Jordan-Moore-Gibson-Thompson方程在消失松弛极限中的间断Galerkin逼近

Vanja Nikolić

AI总结 针对JMGT方程,提出间断Galerkin空间离散化,推导与松弛参数无关的先验误差估计,证明半离散逼近在消失松弛极限下以线性速率收敛到阻尼Westervelt方程,并给出全离散Newmark型方法。

详情
AI中文摘要

Jordan-Moore-Gibson-Thompson (JMGT) 方程模拟热松弛介质中的非线性声波传播,在消失松弛极限下趋近于阻尼Westervelt方程。我们研究了JMGT方程在单纯形网格上的间断Galerkin空间离散化,并分析了它们关于松弛参数的一致行为。在实践相关的混合Neumann和吸收边界条件下,我们推导了与松弛参数无关的先验误差估计。这些估计使得严格的奇异极限分析成为可能,证明了半离散JMGT逼近以线性速率收敛到相应的Westervelt压力分布。这也揭示了精确解在消失松弛极限中的预期行为。对于全离散问题,我们提出了一种基于耦合二阶/一阶系统重新表述的Newmark型方法。数值实验支持理论发现,并证明了该方法在小参数区域中的鲁棒性。

英文摘要

The Jordan-Moore-Gibson-Thompson (JMGT) equation models nonlinear acoustic wave propagation in thermally relaxing media and in the vanishing relaxation limit approaches the damped Westervelt equation. We investigate discontinuous Galerkin spatial discretizations of the JMGT equation on simplicial meshes and analyze their behavior uniformly with respect to the relaxation parameter. Under practically relevant mixed Neumann and absorbing boundary conditions, we derive a priori error estimates independent of the relaxation parameter. These estimates enable a rigorous singular limit analysis, yielding convergence of the semi-discrete JMGT approximations to the corresponding Westervelt pressure profile at a linear rate. This also sheds light on the expected behavior of exact solutions in the vanishing relaxation limit. For the fully discrete problem, we propose a Newmark-type method based on a reformulation as a coupled second-/first-order system. Numerical experiments support the theoretical findings and demonstrate the robustness of the approach in the small-parameter regime.

2606.14654 2026-06-15 cs.AI cs.CL cs.LG 新提交

Abstracting Cross-Domain Action Sequences into Interpretable Workflows

将跨领域动作序列抽象为可解释的工作流

Gaurav Verma, Scott Counts

发表机构 * Microsoft Corporation(微软公司)

AI总结 提出WorkflowView框架,利用大语言模型将低层动作序列抽象为高层活动,在三个不同任务中验证了有效性和泛化能力,实现高语义相似度和预测性能。

Comments preprint; 9 pages, 5 figures

详情
AI中文摘要

序列或时间戳交互日志提供了数字应用使用的客观记录,但其粒度和噪声常常掩盖了关于人们工作的有意义见解。这些见解对于以真实用户交互为基础改进数字产品至关重要。先前的研究应用深度学习模型将用户动作聚类为高层活动,但这些方法对噪声高度敏感且难以跨应用泛化。为解决这一局限,我们引入了WorkflowView,一个使用大语言模型(LLMs)将低层动作序列抽象为高层活动的框架。我们在三个不同且具有挑战性的序列任务和多样化领域中建立了该方法的有效性和泛化性:(a)从浏览器日志中进行零样本任务描述重构(实现高语义相似度,$\mu_{sim} = 0.91$),(b)使用MOOC交互日志进行少样本学生退学预测(仅用五个少样本示例达到加权$F_1 = 0.90$),以及(c)对Microsoft Word中文档工作流中AI工具集成进行匿名化、隐私保护分析。我们的工作表明,基于LLM的抽象是将低层行为数据转化为高层、可解释且可操作见解的稳健高效途径。我们还讨论了在日志基础设施中部署基于LLM的推理时的实际考虑,包括计算效率和用户隐私。

英文摘要

Sequential or time-stamped interaction logs provide objective records of digital application usage, yet their granularity and noise often obscure meaningful insights into people's work. Such insights are essential for improving digital products in ways grounded in real-world user interactions. Prior research has applied deep learning models to cluster user actions into high-level activities, but these approaches are highly sensitive to noise and struggle to generalize across applications. To address this limitation, we introduce WorkflowView, a framework that uses large language models (LLMs) to abstract low-level action sequences into high-level activities. We establish the effectiveness and generality of our approach across three distinct, challenging sequential tasks and diverse domains: (a) zero-shot task description reconstruction from browser logs (achieving high semantic similarity, $μ_{sim} = 0.91$), (b) few-shot student dropout prediction using MOOC interaction logs (reaching weighted $F_1 = 0.90$ with only five few-shot examples), and (c) anonymized, privacy-preserving analysis of AI tool integration within document workflows in Microsoft Word. Our work demonstrates that LLM-based abstraction is a robust and efficient path forward for transforming low-level behavioral data into high-level, interpretable, and actionable insights. We also discuss practical considerations for deploying LLM-based inferences within logging infrastructures, including computational efficiency and user privacy.

2606.14652 2026-06-15 cs.LO cs.PL 新提交

Syntax and semantics of focalisation with relative monads and comonads

关于相对单子和余单子的焦点化的语法与语义

Éléonore Mangel, Paul-André Melliès, Guillaume Munch-Maccagnoni

AI总结 本文研究直觉主义与线性设置中资源与效应模态的焦点化语法,通过相对(余)单子实现线性call-by-push-value模型中模态的完备性,并从非结合范畴的伴随角度建立对应关系。

Comments Presented at the Sixth International Workshop on Structures and Deduction 2026 (SD 2026)

详情
AI中文摘要

焦点化和极化逻辑原则可用于设计良行为的相继式演算项语法,作为描述效应计算的元语言。在语义方面,这对应于以非结合范畴上的伴随形式陈述的模型计算公理化极化概念。本文研究一般直觉主义和线性设置中资源和效应模态的特殊微妙情况:指数余单子 $!$(细化 $\square$)和强单子 $\lozenge$。我们贡献的出发点是注意到,如果转向相对(余)单子,即对于 $!$ 相对于 $\downarrow$(正移位函子)的余单子和对于 $\lozenge$ 相对于 $\uparrow$(负移位函子)的单子,可以实现线性call-by-push-value模型中关于 $!$ 和 $\lozenge$ 的极化语法的完备性。这些相对(余)单子概念对call-by-push-value伴随的特殊化最近出现。然而我们提出的语法源于证明论考虑,当时未注意到与相对(余)单子的联系。因此我们的第一个评论是,在焦点化背景下,从证明论角度先前已经激发了相对于call-by-push-value伴随的(余)单子,这也为效应设置中的这些概念提供了元语言。我们从公理化、非结合的角度研究这些模态。我们回顾非结合范畴上的伴随概念,并建立该伴随概念与相对伴随概念之间的对应结果。然后将该对应扩展到建模 $!$ 和 $\lozenge$ 所需的线性-非线性及强版本的伴随。

英文摘要

The logical principles of focalisation and polarisation can be used to design well-behaved term syntaxes for sequent calculus, which play a role as meta-languages for describing effectful computation. On the semantics side, this corresponds to an axiomatic and polarised notion of model of computation stated in terms of adjunctions over non-associative categories. In this paper, we study the special and delicate cases of resource and effect modalities in a general intuitionistic and linear setting: an exponential comonad $!$ (refining $\square$) and a strong monad $\lozenge$. The starting point of our contribution is noticing that the completeness for a polarised syntax for $!$ and $\lozenge$ with respect to (co)monads in linear call-by-push-value models can be achieved if we move to relative (co)monads: more precisely, comonads relative to $\downarrow$ (the positive shift functor) for $!$ and monads relative to $\uparrow$ (the negative shift functor) for $\lozenge$. These specialisations of the concept of relative (co)monad to call-by-push-value adjunctions recently appeared. Yet the syntax we present arose from proof-theoretic consideration, without the link with relative (co)monads being noticed at the time. Our first remark is thus that (co)monads relative to a call-by-push-value adjunction have been motivated previously from a proof-theoretic perspective in the context of focalisation, which also provides a meta-language for these concepts in an effectful setting. We carry out the study of these modalities from the axiomatic, non-associative point of view. We recall the notion of adjunction over non-associative categories, and establish correspondence results between this notion of adjunction and that of relative adjunction. This correspondence is then extended to linear-non-linear and strong versions of adjunctions as needed to model $!$ and $\lozenge$.

2606.14650 2026-06-15 cs.LG 新提交

Graph Structured Combinatorial Semi-Bandit with Nonlinear Reward Associations through Separable Signals

具有非线性奖励关联的图结构组合半赌博机通过可分离信号

Christoph Bauschmann, Setareh Maghsudi

发表机构 * IEEE

AI总结 针对图结构组合半赌博机问题,提出基于图因果奖励建模、再生核方法和泰勒近似的自适应策略,实现时间次线性与数据量线性性能保证,并验证于合成与真实交通数据。

详情
AI中文摘要

在大量互连数据中识别最优结构需要大量的采样和计算工作。学习和利用潜在的信号依赖关系可以显著提高效率和预测能力,但非线性统计关系的普遍性增加了此类任务的复杂性。在本文中,我们开发了新颖的通用自适应策略,配备了基于图的因果奖励建模、解析再生核方法以及函数过程的泰勒逼近。我们建立了理论性能保证,在时间上呈次线性,在数据量上随时间呈线性。我们的分析涵盖了对噪声干扰、渐进模型收敛和解空间不匹配等多种不确定性的鲁棒性。该框架的通用性通过最小化条件集或对先验估计的依赖得到证实,而各种概述的修改则针对特定或扩展设置。为了证明实际有效性,我们使用基准合成和真实世界交通数据集进行了数值实验。

英文摘要

The identification of optimal structures within vast arrays of interconnected data necessitates significant sampling- and computational effort. Learning and leveraging underlying signal dependencies can improve efficiency and predictive capabilities considerably, but the ubiquity of nonlinear statistical relations amplifies the complexity of such undertakings. In this paper, we develop novel generic and adaptive strategies equipped with routines for graph-based causal reward modeling, analytic reproducing kernel methods, and Taylor approximation of functional processes. We establish theoretical performance guarantees sublinear in time and linear in data volume over time. Our analyses cover robustness to a multitude of uncertainties arising from noise interference, gradual model convergence, and solution space mismatch. The framework's general appeal is substantiated by a minimalistic set of conditions or reliance on prior estimates, while various outlined modifications address specific or extended settings. To demonstrate practical effectiveness, we conduct numerical experiments using both benchmarked synthetic and real-world transportation datasets.

2606.14648 2026-06-15 cs.LG math.OC 新提交

Which Directions Matter? Sparse Design for Affine Robust Optimization

哪些方向重要?仿射鲁棒优化的稀疏设计

Pedro Chumpitaz-Flores, My Duong, Juan S. Borrero, Kaixun Hua

发表机构 * University of South Florida(南佛罗里达大学)

AI总结 研究有限字典和预算约束下鲁棒优化中不确定性方向的选择问题,提出基于覆盖目标的数据驱动选择规则,证明其单调次模性,给出贪心算法的近似保证和匹配的难度下界。

Comments Accepted at UAI 2026

详情
AI中文摘要

鲁棒机器学习和优化依赖于不确定性模型的选择。我们研究了当由有限字典和预算约束定义时,模型必须覆盖哪些不确定性方向。选择一个子集形成一个具有闭式支持函数的原子不确定性集,从而为仿射目标产生可处理的鲁棒程序。我们提出了一种基于评估方向(包括梯度、对抗扰动或保留数据上观察到的偏移)上的覆盖目标的数据驱动选择规则。我们证明该目标是单调且次模的,支持具有$(1-1/e)$近似保证的贪心方法和匹配的难度障碍。我们还提供了一个证书,用于限制所选子集的损失,以及一个具有样本外控制的半径校准规则。

英文摘要

Robust machine learning and optimization rely on the uncertainty model choice. We investigate which uncertainty directions a model must cover when defined by a finite dictionary and a budget constraint. Selecting a subset forms an atomic uncertainty set with a closed form support function, yielding tractable robust programs for affine objectives. We propose a data driven selection rule based on a coverage objective over evaluation directions, including gradients, adversarial perturbations, or shifts observed on held out data. We prove this objective is monotone and submodular, supporting a greedy method with a $(1-1/e)$ approximation guarantee and a matching hardness barrier. We also provide a certificate bounding the loss from the selected subset and a radius calibration rule with out of sample control.