arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.10537 2026-06-10 cs.CL 新提交

Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

Prefilling-dLLM: 扩散语言模型中长上下文推理的预测性预填充

Jing Xiong, Qi Han, Shansan Gong, Yunta Hsieh, Chengyue Wu, Chaofan Tao, Chenyang Zhao, Ngai Wong

发表机构 * The University of Hong Kong（香港大学）； University of Michigan, Ann Arbor（密歇根大学安娜堡分校）； LMSYS Org（LMSYS组织）

AI总结针对扩散语言模型在长上下文中因重复编码前缀导致计算量二次增长的问题，提出Prefilling-dLLM框架，通过分块缓存KV表示并基于稀疏性选择相关块，实现高效解码，在LongBench等基准上达到最先进加速效果。

Comments Technical Report

详情

AI中文摘要

扩散大语言模型（dLLM）在每个去噪步骤中重新编码整个前缀，导致计算量随上下文长度二次增长，在长上下文场景中变得不可行。我们提出Prefilling-dLLM，一种无需训练的预填充-解码分离框架，将前缀划分为N个块，缓存其KV表示一次，并利用块内令牌稀疏性选择最相关的K个块进行解码，表明稀疏预填充可以优于密集注意力，同时将每步复杂度从完整序列长度的二次方降低到仅解码长度的二次方。在LongBench和InfiniteBench上，Prefilling-dLLM在dLLM加速方法中达到了最先进的质量，并且一个对非连续缓存的块KV进行并行解码的注意力核在8K--32K上下文下实现了9.1--28.0倍的加速。我们进一步表明，预置到每个块的开头序列令牌作为周期性注意力锚点，消除了中间丢失现象。代码见此 https URL。

英文摘要

Diffusion large language models (dLLMs) re-encode the entire prefix at every denoising step, causing recomputation that scales quadratically with context length and becomes prohibitive for long-context scenarios. We propose Prefilling-dLLM, a training-free prefill-decode disaggregation framework for dLLMs that partitions the prefix into N chunks, caches their KV representations once, and selects the top-K most relevant chunks with intra-chunk token sparsity for decoding, showing that sparse prefilling can outperform dense attention while reducing per-step complexity from quadratic in the full sequence length to quadratic only in the decode length. On LongBench and InfiniteBench, Prefilling-dLLM achieves state-of-the-art quality among dLLM acceleration methods, and an attention kernel that parallelizes decoding over the non-contiguously cached chunk KV yields 9.1--28.0x speedup at 8K--32K contexts. We further show that beginning-of-sequence tokens prepended to each chunk act as periodic attention anchors that eliminate the lost-in-the-middle phenomenon. Code is available at https://github.com/menik1126/Prefilling-dLLM.

URL PDF HTML ☆

赞 0 踩 0

2606.10533 2026-06-10 cs.CV 新提交

Audio-Visual Exchange-Aware Token Pruning for Efficient Audio-Visual Captioning

音频-视觉交换感知令牌剪枝用于高效音频-视觉字幕生成

Zihan Meng, Dexiang Hong, Weidong Chen, Ziyu Zhou, Bo Hu, Zhendong Mao

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出基于强化学习的AVEX-Prune方法，通过跨模态令牌交换策略选择高置信度令牌，在40%保留率下保持全令牌质量。

详情

AI中文摘要

音频-视觉字幕生成从视频和音频内容生成自然语言描述。多模态大语言模型推进了这一任务，但两种模态都为LLM输入贡献了大量令牌，其中预填充自注意力呈二次方扩展。现有的令牌剪枝方法通常通过注意力、显著性或交叉熵损失保留令牌，但硬阈值选择使得难以保留真正有价值的令牌，尤其是决策边界附近的高混淆令牌。为此，我们提出AVEX-Prune，一种基于强化学习的音频-视觉动态令牌剪枝方法。在我们的AVEX-Prune中，提出了一种音频-视觉令牌交换策略，通过用来自同一或另一模态的高置信度候选令牌替换低置信度保留令牌，并测量令牌交换带来的字幕生成差异，来选择真正有价值的令牌。AVEX-Prune在VILA 1.5-8B（54.5 vs. 54.6）和VideoLLaMA 2（57.0 vs. 56.8）上以40%保留率保持了全令牌质量。

英文摘要

Audio-visual captioning generates natural language descriptions from video and audio content. Multimodal LLMs have advanced this task, but both modalities contribute many tokens to the LLM input, where prefill self-attention scales quadratically. Existing token-pruning methods usually retain tokens by attention, saliency, or cross-entropy loss, yet the hard threshold selection makes it difficult to retain tokens that are truly valuable, especially for high-confusing tokens near the decision boundary. To this end, we propose a AVEX-Prune, an RL-based audio-visual dynamic token pruning method in this work. In our AVEX-Prune, an audio-visual token exchange strategy is proposed to select truly valuable tokens by replacing low-confidence retained tokens with high-confidence candidate tokens from the same or the other modality, and measuring the differences in caption generation from token swaps. AVEX-Prune preserves full-token quality at a 40% retention ratio on both VILA 1.5-8B (54.5 vs. 54.6) and VideoLLaMA 2 (57.0 vs. 56.8).

URL PDF HTML ☆

赞 0 踩 0

2606.10532 2026-06-10 cs.AI 新提交

ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning

ActiveMem: 用于长程LLM推理的分布式主动记忆

Yunhan Jiang, Wenbin Duan, Shasha Guo, Liang Pang, Xiaoqian Sun, Huawei Shen

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences（中国科学院计算技术研究所人工智能安全国家重点实验室）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结提出ActiveMem框架，将记忆从核心推理中解耦，通过分布式主动记忆系统积累语义要点，在长程推理任务中实现高精度和低开销。

详情

AI中文摘要

记忆对于使大型语言模型（LLM）代理能够处理长程推理任务至关重要。现有的记忆机制大多是集中式的，通常将检索到的信息和交互历史组织在单个模型上下文中。这种设计带来了一个基本的权衡：扩展推理轨迹可能导致上下文过载，而激进的修剪内容可能导致不可逆的信息丢失。为了寻求更好的权衡，我们从人类认知系统中汲取灵感，特别是前额叶皮层（执行控制）和海马体（记忆管理）之间的功能互补性，表明这种权衡并非固有，而可能源于集中式记忆组织。为此，我们提出了ActiveMem，一个异构框架，将代理记忆从核心推理过程中解耦。具体来说，高层规划器利用提炼的语义要点执行推理，而轻量级的分布式记忆系统并行运行，在整个任务中主动积累和整合这些要点。在BrowseComp-Plus和GAIA上的实验表明，ActiveMem以显著降低的开销实现了最先进的准确性，证明了分布式主动记忆在长程推理中的有效性。

英文摘要

Memory is essential for enabling large language model (LLM) agents to handle long-horizon reasoning tasks. Existing memory mechanisms are largely centralized, typically organizing retrieved information and interaction history within a single model context. This design imposes a fundamental trade-off: scaling reasoning trajectories risks context overload, whereas aggressive content pruning may result in irreversible information loss. Seeking a better trade-off, we draw inspiration from human cognitive systems, especially the functional complementarity between the prefrontal cortex (executive control) and the hippocampus (memory management), suggesting that such a trade-off need not be inherent, but may instead stem from centralized memory organization. To this end, we propose ActiveMem, a heterogeneous framework that decouples agent memory from the core reasoning process. Specifically, a high-level Planner utilizes distilled semantic gists to execute reasoning, while a lightweight, distributed memory system operates in parallel to actively accumulate and consolidate these gists throughout the task. Experiments on BrowseComp-Plus and GAIA show that ActiveMem achieves state-of-the-art accuracy with significantly reduced overhead, demonstrating the effectiveness of distributed active memory for long-horizon reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.10531 2026-06-10 cs.CL cs.AI 新提交

LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization

LC-QAT: 通过线性约束向量量化实现LLM的数据高效2比特QAT

Haoyu Wang, Xingyu Yu, Haiyan Zhao, Fengxiang Wang, Xu Han

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出LC-QAT，一种2比特权重量化的向量量化感知训练框架，通过可微的线性映射避免离散码本查找，实现高质量PTQ初始化和端到端优化，仅用0.1%-10%训练数据即超越现有方法。

Comments Accepted by ICML 2026

详情

AI中文摘要

量化感知训练（QAT）对于极低比特大语言模型（LLMs）至关重要。当前的QAT方法主要基于标量量化（SQ），虽然能高效优化，但在2比特精度下性能严重下降。另一方面，向量量化（VQ）提供了更高的表示能力，但其离散码本查找阻碍了端到端训练。我们提出LC-QAT，一种2比特权重量化的VQ-QAT框架，通过离散向量上的学习仿射映射表示量化权重，从而在训练前向传播中无需显式码本查找即可实现高质量PTQ初始化和完全可微的端到端优化。这种强大的训练后初始化使LC-QAT具有高度数据效率。在多种LLM上的实验表明，LC-QAT在使用仅0.1%-10%训练数据的情况下，始终优于最先进的QAT方法。我们的结果确立了LC-QAT作为极低比特模型部署的实用且可扩展的解决方案。

英文摘要

Quantization-aware training (QAT) is essential for extremely low-bit large language models (LLMs). Current QAT methods are mainly based on scalar quantization (SQ), which enables efficient optimization but suffers from severe performance degradation at 2-bit precision. On the other hand, vector quantization (VQ) provides substantially higher representational capacity, but its discrete codebook lookup prevents end-to-end training. We propose LC-QAT, a 2-bit weight-only VQ-QAT framework that represents quantized weights via a learned affine mapping over discrete vectors, which yields a high-quality PTQ initialization and enables fully differentiable end-to-end optimization without explicit codebook lookup in the training forward pass. This strong post-training initialization makes LC-QAT highly data-efficient. Experiments across diverse LLMs demonstrate that LC-QAT consistently outperforms state-of-the-art QAT methods while using only 0.1%--10% of the training data. Our results establish LC-QAT as a practical and scalable solution for extreme low-bit model deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.10530 2026-06-10 cs.LG cs.AI 新提交

Machine Learning Methods for Studying Latent Neural Activity Dynamics

研究潜在神经活动动力学的机器学习方法

Shufeng Kong, Fumei Deng, Xinyi Dong, Caihua Liu, Weiwei Chen, Yingheng Wang, Daniel Cao, Azahara Oliva, Antonio Fernandez-Ruiz, Carla Gomes

发表机构 * School of Software Engineering, Sun Yat-sen University（中山大学软件工程学院）； Department of Computer Science, Cornell University（康奈尔大学计算机科学系）； Department of Neurobiology and Behavior, Cornell University（康奈尔大学神经生物学与行为学系）； Department of Ecology and Evolutionary Biology, Cornell University（康奈尔大学生态学与进化生物学系）； School of Computer Science and Artificial Intelligence, Foshan University（佛山大学计算机科学与人工智能学院）

AI总结综述从状态空间模型到深度生成模型的潜在变量模型，涵盖单区域动力学、多区域通信和行为对齐建模，并讨论大规模神经基础模型及未来挑战。

Comments Accepted by IJCAI 2026 survey track

详情

AI中文摘要

脑记录的最新发展推动了对能够解码大量神经元潜在结构的机器学习工具的需求。本文提供了全面的综述，概述了潜在变量模型（LVM）从早期状态空间模型到最近深度生成模型的轨迹。我们将文献组织为三个密切相关的领域：（1）单区域潜在动力学，包括从线性动力系统到由循环神经网络（RNN）和神经常微分方程（ODE）表示的更复杂动力学模型；（2）多区域通信，采用概率和子空间方法研究信息如何在不同脑区之间传递，考虑突触传播延迟和网络连接；（3）行为对齐建模，旨在通过监督或对比学习将与任务表现相关的神经活动与其他内部状态分离。本综述还包括大规模神经基础模型，如Transformer和扩散模型，它们依赖大规模预训练以实现跨主体的最佳性能。最后，我们总结并讨论基准、评估标准和开放挑战，如识别因果联系或通信方向的能力，以促进弥合可解释脑动力学与可靠神经解码之间的未来研究。

英文摘要

Recent developments in brain recording are driving a demand for machine learning tools capable of decoding the latent structure of large populations of neurons. In this paper, we provide a comprehensive survey that outlines the trajectory of Latent Variable Models (LVMs) from early state-space models to more recent deep generative models. We organize the literature into three closely related domains: (1) Single-Region Latent Dynamics, which includes models such as linear dynamical systems to more complex dynamics represented by Recurrent Neural Networks (RNNs) and Neural Ordinary Differential Equations (ODEs); (2) Multi-Region Communication, which employs probabilistic as well as subspace methods to study how information is transferred across different brain areas considering synaptic propagation delays and network connectivity; and (3) Behavior-Aligned Modeling, which seeks to disentangle neural activity related to task performance from other internal states via supervised or contrastive learning. This survey also includes large-scale neural foundation models, such as Transformers and diffusion models, that rely on large-scale pre-training for optimal performance across subjects. Finally, we conclude and discuss benchmarks, evaluation criteria, and open challenges, such as the ability to identify causal links or directionality of communication, to facilitate future research for bridging interpretable brain dynamics with reliable neural decoding.

URL PDF HTML ☆

赞 0 踩 0

2606.10528 2026-06-10 cs.LG cs.CL 新提交

Representation-Aware Advantage Estimation: Your Reward Model Provides More Than A Scalar Output

表示感知优势估计：你的奖励模型提供的不仅仅是标量输出

Guozheng Li, Xiyan Fu, Yiwen Guo

发表机构 * Southeast University（东南大学）； Nanyang Technological University（南洋理工大学）； Independent Researcher（独立研究员）

AI总结提出表示感知优势估计方法，利用奖励模型隐藏状态作为辅助信号，通过图传播计算优势值，提升RLHF的样本效率和鲁棒性。

详情

AI中文摘要

当前基于人类反馈的强化学习（RLHF）方法主要依赖来自训练好的奖励模型（RM）的标量奖励。虽然有效，但标量奖励通常存在噪声，无法捕捉细粒度的偏好差异，而RM隐藏状态编码了更丰富的语义和偏好信息。我们引入了表示感知优势估计，利用RM隐藏状态并将其建模为辅助信号以实现更好的优势估计。具体来说，我们提出了基于图的优势估计（GraphAE），将每个采样组视为一个图，其中节点对应响应，边捕捉它们在RM隐藏空间中的相似性。然后通过图传播计算优势值，使每个样本能够从其邻居中融入上下文信息。GraphAE轻量级，可以无缝集成到现有的基于组的RL算法中。我们将GraphAE应用于GRPO、GSPO和RLOO，并在不同模型和基准上进行了大量实验。实证结果显示，在三个基准上均有一致改进，在Arena-Hard-v0.1上提升高达+6.3，在AlpacaEval 2.0上提升+8.27，在MT-Bench上提升+0.22。这些结果表明，利用RM表示可以实现更高效和鲁棒的RLHF。

英文摘要

Current reinforcement learning from human feedback (RLHF) methods primarily rely on scalar rewards from a trained reward model (RM). While effective, scalar rewards are often noisy and fail to capture fine-grained preference differences, whereas RM hidden states encode richer semantic and preference information. We introduce the representation-aware advantage estimation, which leverages RM hidden states and models them as auxiliary signals for better advantage estimation. Specifically, we propose the Graph-based Advantage Estimation (GraphAE), treat each sampled group as a graph, where nodes correspond to responses and edges capture their similarity in the RM hidden space. Then advantages are computed via graph propagation, enabling each sample to incorporate contextual information from its neighbors. GraphAE is lightweight and can be seamlessly integrated into existing group-based RL algorithms. We apply GraphAE to GRPO, GSPO and RLOO, and conduct extensive experiments on different models and benchmarks. Empirical results show consistent improvements across three benchmarks, with gains of up to + 6.3 on Arena-Hard-v0.1, + 8.27 on AlpacaEval 2.0, and + 0.22 on MT-Bench. These results demonstrate that leveraging RM representations leads to more sample efficient and robust RLHF.

URL PDF HTML ☆

赞 0 踩 0

2606.10522 2026-06-10 cs.CV 新提交

GUI-AC: Enhancing Continual Learning in GUI Agents

GUI-AC：增强GUI代理的持续学习能力

Can Lin, Tao Feng, Hangjie Yuan, Dan Zhang, Yifan Zhu, Zhonghong Ou

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Tsinghua University（清华大学）； Zhejiang University（浙江大学）； National University of Singapore（新加坡国立大学）

AI总结针对GUI代理在持续学习中的分布漂移和强化微调不稳定性问题，提出GUI-AC方法，通过自适应优势和动态裁剪机制提升性能，超越现有基线。

详情

AI中文摘要

图形用户界面（GUI）是人机交互的主要媒介，但构建能够像人类一样在多样化的真实界面环境中泛化、具有相同灵活性和鲁棒性的GUI代理仍未解决。值得注意的是，GUI数据本质上是非平稳的：持续出现未见过的界面实例（例如，新领域和分辨率）会导致持续的分布漂移，严重阻碍现有GUI代理的持续学习。强化微调（RFT）作为一种有前景的方法引起了广泛关注。然而，RFT在其定位能力上表现出明显的不稳定性，表现为奖励的急剧不连续和高方差振荡。推出结果的不平衡分布给优势估计引入了大量噪声，导致策略过度自信。固定的裁剪边界抑制了适应新分布所需的策略概率增加，导致探索能力崩溃。为了解决这些挑战，我们提出了GUI-AC，一种增强GUI代理持续学习能力的方法。GUI-AC引入了定位确定性以支持两个核心机制：（i）自适应优势，降低噪声优势估计的权重以防止策略过度自信；以及（ii）动态裁剪，放松裁剪边界以鼓励探索范围。大量实验表明，这些机制共同提高了性能，使我们的方法超越了最先进的基线。代码匿名提供于此https URL。

英文摘要

Graphical User Interfaces (GUIs) serve as the dominant medium for human-computer interaction, yet building GUI agents that generalize across the vast diversity of real-world interface environments, with the same flexibility and robustness that humans naturally exhibit, remains unsolved. Notably, GUI data are inherently non-stationary: the continual emergence of previously unseen interface instances (e.g., novel domains and resolutions) induces persistent distribution shifts, significantly impeding the continual learning of existing GUI agents. Reinforcement fine-tuning (RFT) has attracted considerable attention as a promising approach. Nevertheless, RFT exhibits pronounced instability in its grounding capability, manifested as sharp reward discontinuities and high-variance oscillations. The imbalanced distribution of rollout outcomes introduces substantial noise into advantage estimation, leading to policy overconfidence. The fixed clipping bound suppresses the increase in policy probabilities needed to adapt to new distributions, leading to a collapse in exploration capacity. To address these challenges, we propose GUI-AC, a method that enhances the continual learning capability of GUI agents. GUI-AC introduces grounding certainty to support two core mechanisms: (i) Adaptive Advantage, which down-weights noisy advantage estimates to prevent policy overconfidence; and (ii) Dynamic Clipping, which relaxes the clipping bound to encourage exploration range. Extensive experiments show that these mechanisms jointly improve performance, enabling our method to surpass state-of-the-art baselines. Code is available anonymously at https://anonymous.4open.science/r/GUI-AC.

URL PDF HTML ☆

赞 0 踩 0

2606.10520 2026-06-10 cs.CL 新提交

UniSVQ: 2-bit Unified Scalar-Vector Quantization

UniSVQ: 2比特统一标量-向量量化

Haoyu Wang, Haiyan Zhao, Xingyu Yu, Zhangyang Yao, Xu Han, Zhiyuan Liu, Maosong Sun

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出UniSVQ，通过将码字参数化为整数格点的仿射变换，统一标量和向量量化，实现2比特量化下性能优于标量量化、媲美向量量化，且推理吞吐更高。

Comments Accepted by ICML 2026

详情

AI中文摘要

2比特级别的训练后量化使得大型语言模型（LLMs）能够实现低成本部署和推理加速。标量量化（SQ）和向量量化（VQ）是两种主要的量化方法，然而前者遭受显著的性能下降，后者则带来计算和存储开销。我们提出UniSVQ，一个统一的2比特量化框架，通过将码字参数化为整数格点的仿射变换，桥接了标量和向量量化。这种结构保持了与优化整数内核的兼容性，同时保留了VQ的许多灵活性。我们进一步引入了一种数据驱动的块级微调策略，以直接最小化量化重建误差。在多个LLM家族和零样本基准上的大量实验表明，UniSVQ持续优于最先进的SQ方法，并实现了与高级VQ方法相当的性能，同时提供更高的推理吞吐量。

英文摘要

Post-training quantization at the 2-bit level enables low-cost deployment and inference acceleration for large language models (LLMs). Scalar quantization (SQ) and vector quantization (VQ) are two primary quantization methods, however, the former suffers from significant performance degradation, and the latter incurs computational and storage overhead. We propose UniSVQ, a unified 2-bit quantization framework that bridges scalar and vector quantization by parameterizing codewords as an affine transform of integer lattices. This structure preserves compatibility with optimized integer kernels while retaining much of VQ's flexibility. We further introduce a data-driven block-wise fine-tuning strategy to directly minimize quantization reconstruction error. Extensive experiments across multiple LLM families and zero-shot benchmarks demonstrate that UniSVQ consistently outperforms state-of-the-art SQ methods and achieves performance comparable to advanced VQ methods, while providing higher inference throughput.

URL PDF HTML ☆

赞 0 踩 0

2606.10517 2026-06-10 cs.CV 新提交

LAFP: Preserving Latent Action Structure in Latent Policy Learning via Flow Matching

LAFP：通过流匹配在潜在策略学习中保留潜在动作结构

Jiexi Lyu, Xizhou Bu, Qingqiu Huang, Chufeng Tang, Xiaoshuai Hao, Hongbo Wang, Wei Li

发表机构 * Fudan University（复旦大学）； Morphi

AI总结提出LAFP方法，利用流匹配学习潜在策略，并引入推理时插值机制缓解随机性导致的错位，在模仿学习任务中成功率提升10-15%，推理开销增加不到1倍。

详情

AI中文摘要

从大规模无标签视频中学习高质量潜在动作，并结合有限真实交互数据训练动作解码器，已成为可扩展潜在策略学习的一种有前景的范式。然而，现有方法通常依赖行为克隆，这倾向于将固有的多模态动作分布坍缩为单模态分布，从而破坏预训练的潜在动作结构。虽然流匹配提供了一种潜在的替代方案，但由于学习策略的随机性，直接应用它会导致动作解码器训练中潜在动作与物理动作之间的错位。为了解决这些问题，我们提出了潜在动作流策略（LAFP），它利用流匹配进行潜在策略学习，并引入推理时插值机制来缓解随机性引起的错位。实验结果表明，LAFP在下游模仿学习任务上持续优于先前方法，成功率提升高达10-15%，而推理开销增加不到1倍。

英文摘要

Learning high-quality latent actions from large-scale unlabeled videos, coupled with limited real-world interaction data for training an action decoder, has emerged as a promising paradigm for scalable latent policy learning. However, existing approaches typically rely on behavior cloning, which tends to collapse inherently multimodal action distributions into unimodal ones, thereby degrading the pretrained latent action structure. While flow matching provides a potential alternative, directly applying it leads to a misalignment between latent actions and physical actions during action decoder training, due to the stochastic nature of the learned policy. To address these, we propose Latent Action Flow Policy (LAFP), which leverages flow matching for latent policy learning and introduces an inference-time interpolation mechanism to mitigate stochasticity-induced misalignment. Experimental results demonstrate that LAFP consistently outperforms prior methods on downstream imitation learning tasks, achieving up to 10-15% improvement in success rate while incurring less than 1x additional inference overhead.

URL PDF HTML ☆

赞 0 踩 0

2606.10507 2026-06-10 cs.AI 新提交

HIPIF: Hierarchical Planning and Information Folding for Long-Horizon LLM Agent Learning

HIPIF: 面向长视界LLM智能体学习的层次化规划与信息折叠

Juncheng Diao, Zhicong Lu, Peiguang Li, Yongwei Zhou, Changyuan Tian, Qingbin Li, Rongxiang Weng, Jingang Wang, Xunliang Cai

发表机构 * Meituan（美团）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结提出层次化规划与信息折叠方法，通过子目标分解和历史折叠减少长上下文干扰，结合层次化反思和子目标过程奖励，提升LLM在多轮长视界任务中的表现。

详情

AI中文摘要

尽管大型语言模型（LLM）在广泛任务中展现出作为自主智能体的强大能力，但其性能在多轮长视界智能体任务中常常下降。现有方法通过细粒度信用分配以缓解长视界稀疏奖励，以及通过层次化强化学习分解任务并减少长期依赖，取得了进展。然而，这些方法仍未直接解决长上下文干扰问题，即持续增长的历史记录削弱了智能体跟踪全局任务状态的能力，并损害了后续推理和决策。受人类通过子目标分解和已完成进度总结处理复杂任务的方式启发，我们提出了面向长视界LLM智能体学习的层次化规划与信息折叠（HIPIF）。HIPIF端到端地训练智能体，使其围绕显式子目标组织长视界执行，同时折叠已完成的子目标历史以减少长上下文干扰。此外，为稳定基于子目标的规划与执行，HIPIF结合了层次化反思和面向子目标的过程奖励，以指导子目标的生成、转换和执行，而无需依赖昂贵的辅助模型或特定任务的专家轨迹。在三个公开可用的智能体基准上的广泛实验证明了我们方法的有效性。

英文摘要

While Large Language Models (LLMs) have demonstrated strong capabilities as autonomous agents across a wide range of tasks, their performance often degrades in multi-turn long-horizon agentic tasks. Existing methods have made progress through fine-grained credit assignment to alleviate long-horizon sparse rewards and hierarchical reinforcement learning to decompose tasks and reduce long-term dependency. However, these methods still do not directly address long-context interference, in which continuously growing histories weaken the agent's ability to track the global task state and impair subsequent reasoning and decision-making. Inspired by the way humans handle complex tasks through subgoal decomposition and completed progress summarization, we propose Hierarchical Planning and Information Folding (HIPIF) for long-horizon LLM agent learning. HIPIF trains the agent end-to-end to organize long-horizon execution around explicit subgoals while folding completed subgoal histories to reduce long-context interference. Furthermore, to stabilize subgoal-based planning and execution, HIPIF combines hierarchical reflection and subgoal-oriented process rewards to guide subgoal generation, transition, and execution, without relying on costly auxiliary models or task-specific expert trajectories. Extensive experiments on three publicly available agentic benchmarks demonstrate the validity of our method.

URL PDF HTML ☆

赞 0 踩 0

2606.10504 2026-06-10 cs.AI 新提交

Cross-Modal Knowledge Distillation without Paired Data: Theoretical Foundation and Algorithm

无配对数据的跨模态知识蒸馏：理论基础与算法

Trong Khiem Tran, Anh Duc Chu, Quang Hung Pham, Phi Le Nguyen, Trong Nghia Hoang

发表机构 * School of Information and Communications Technology, Hanoi University of Science and Technology, Hanoi, Vietnam（信息与通信技术学院，河内科学技术大学，越南河内）； School of Electrical Engineering and Computer Science, Washington State University, Pullman, US（电气工程与计算机科学学院，华盛顿州立大学，华盛顿州普尔曼）

AI总结提出无配对数据下的跨模态知识蒸馏框架，通过特征对齐和标签对齐两种分布对齐机制，实现跨模态知识迁移，理论保证且实验效果显著。

详情

AI中文摘要

跨模态知识蒸馏（CMKD）研究如何利用在一种数据类型（如图像）上训练的大型教师模型来指导基于另一种数据类型（如文本/音频）的较小学生模型。现有的CMKD方法通常需要具有对齐语义的配对多模态数据，但获取此类配对数据往往成本高昂且不切实际。为缓解这一限制，我们针对更困难的设置——无配对数据——开发了一种新的CMKD框架。特别地，我们建立了教师模型与学生模型之间的跨模态分布关系，揭示了控制有效蒸馏的两个基本量：特征对齐和标签对齐。这些量分别从表示和预测分布层面表征了模态间的语义差异。受此启发，我们提出了一个具有理论保证的原则性框架，通过对齐分布而非单个样本实现有效的跨模态知识蒸馏。在广泛的多模态基准上的大量实验表明，我们的框架在无配对和有配对数据设置中均非常有效，显著优于先前的工作。

英文摘要

Cross-modal knowledge distillation (CMKD) studies how a (large) teacher model trained on one type of data (e.g., images) can guide a (smaller) student model building on another type of data (e.g., text/audio). Existing CMKD methods often require paired multi-modal data with aligned semantics, but obtaining such paired data are often costly and impractical. To mitigate this limitation, we develop a new CMKD framework for the more challenging setting where paired data are unavailable. In particular, we establish a cross-modal distributional relationship between teacher and student models, which reveals two fundamental quantities governing effective distillation: feature alignment and label alignment. These quantities characterize semantic discrepancy between modalities at the levels of representation and prediction distributions, respectively. Motivated by this insight, we propose a principled framework, with theoretical guarantees, that enables effective cross-modal knowledge distillation by aligning distributions rather than individual samples. Extensive experiments across a wide range of multimodal benchmarks show that our framework is highly effective in both unpaired and paired data settings, improving significantly over prior work.

URL PDF HTML ☆

赞 0 踩 0

2606.10501 2026-06-10 cs.RO 新提交

Uncovering Vulnerability of Vision-Language-Action Models under Joint-Level Physical Faults

揭示视觉-语言-动作模型在关节级物理故障下的脆弱性

Minsoo Jo, Taeju Kwon, Junha Chun, Youngjoon Jeong, Taesup Kim

发表机构 * Graduate School of Data Science, Seoul National University（首尔大学数据科学研究生院）

AI总结本研究揭示VLA模型在机器人关节级物理故障（如执行器退化、摩擦增加）下性能显著下降，并提出轻量级残差校准框架J-PARC，通过推断关节故障状态并自适应修正动作，提升鲁棒性。

详情

AI中文摘要

在真实机器人系统中部署视觉-语言-动作（VLA）模型不仅需要对语义和感知变化具有鲁棒性，还需要对改变动作物理实现方式的实体侧故障具有鲁棒性。真实机器人可能经历由执行器退化、硬件故障、安全限制、碰撞损坏或磨损引起的摩擦导致的关节级变化。这些故障至关重要，因为它们改变了策略的动作到运动接口，破坏了指令动作、实现运动与后续观测之间的学习闭环关系。在这项工作中，我们研究了真实的关节级物理故障，并表明当预测动作通过受扰动的机器人身体执行时，VLA模型是脆弱的。我们的分析揭示了关节依赖效应，受影响关节的任务成功率呈现异质性退化。我们还表明，性能下降不能仅归因于物理不可行性，因为可行的故障（如增加的关节摩擦）仍能显著降低成功率并引发闭环执行不匹配。受这些发现的启发，我们提出了关节级物理故障感知残差校准器（J-PARC），这是一个构建在冻结VLA策略之上的轻量级残差校准框架。J-PARC从最近的关节动力学中推断出潜在的关节故障状态，并在此状态下调节共享的残差校准器，从而实现对故障关节的自适应动作修正。实验表明，J-PARC在关节级故障下提高了鲁棒性，同时保持了无故障环境下的性能。

英文摘要

Deploying Vision-Language-Action (VLA) models in real robotic systems requires robustness not only to semantic and perceptual variations, but also to embodiment-side faults that change how actions are physically realized. Real robots can experience joint-level changes caused by actuator degradation, hardware faults, safety limits, collision damage, or wear-induced friction. These faults are critical because they alter the action-to-motion interface of a policy, disrupting the learned closed-loop relationship between commanded actions, realized motion, and subsequent observations. In this work, we study realistic joint-level physical faults and show that VLA models are vulnerable when predicted actions are executed through a perturbed robot body. Our analysis reveals joint-dependent effects, with heterogeneous degradation in task success across affected joints. We also show that performance drops cannot be attributed solely to physical infeasibility, since feasible faults such as increased joint friction can still substantially reduce success rates and induce closed-loop execution mismatch. Motivated by these findings, we propose Joint-level Physical-fault Aware Residual Calibrator (J-PARC), a lightweight residual calibration framework built on top of a frozen VLA policy. J-PARC infers a latent joint-fault regime from recent joint dynamics and conditions a shared residual calibrator on this regime, enabling adaptive action correction across faulty joints. Experiments show that J-PARC improves robustness under joint-level faults while preserving fault-free environment performance.

URL PDF HTML ☆

赞 0 踩 0

2606.10500 2026-06-10 cs.AI 新提交

A Reliable Fault Diagnosis Method Based on Belief Rule Base Consider Robustness Analysis

一种考虑鲁棒性分析的基于置信规则库的可靠故障诊断方法

Mingyuan Liu, Dan Yin, Zongzong Wu

发表机构 * Central South University（中南大学）

AI总结针对故障诊断中传感器读数可靠性问题，提出一种基于置信规则库的可靠故障诊断方法，通过鲁棒性分析与优化策略提升模型准确性和鲁棒性，在柴油机和轴承故障诊断中验证有效性。

详情

AI中文摘要

在设备运行中，实施故障诊断对于确保生产设备的连续性和安全性、提高运行效率以及降低维护成本至关重要。由于传感器读数广泛用于故障诊断，其可靠性直接影响故障诊断的结果。针对故障诊断模型的鲁棒性评估和鲁棒性优化两个问题，提出了一种新的故障诊断方法。为此，提出了一种考虑鲁棒性分析的基于置信规则库（BRB）的可靠故障诊断方法。首先，系统地对BRB模型进行鲁棒性分析。其次，提出了三种鲁棒性约束策略来优化BRB故障诊断模型的鲁棒性。最后，以WD615柴油机和凯斯西储大学轴承的故障诊断为例，验证了所提模型的有效性，实验表明所提模型在准确性和鲁棒性上均有提升。

英文摘要

In equipment operation, the implementation of fault diagnosis is essential to ensure the continuity and safety of production equipment, improve operational efficiency and reduce maintenance costs. Since sensor readings are widely used for fault diagnosis, their reliability directly affects the results of fault diagnosis. A new fault diagnosis method is proposed to address the two problems of robustness assessment and robustness optimization of fault diagnosis models. For this purpose, a reliable fault diagnosis method based on a belief rule base (BRB) considering robustness analysis is proposed. Firstly, the robustness analysis of the BRB model is carried out systematically. Secondly, three robustness constraint strategies are proposed to optimize the robustness of the BRB fault diagnosis model. Finally, the effectiveness of the proposed model is verified by taking the fault diagnosis of WD615 diesel engine and Case Western Reserve University bearings as an example, and the experiments show that the proposed model improves both accuracy and robustness.

URL PDF HTML ☆

赞 0 踩 0

2606.10499 2026-06-10 cs.LG cs.AI 新提交

MoE Enhanced Federated Learning for Spatiotemporal Prediction

基于混合专家模型增强的联邦学习用于时空预测

Zhehao Dai, Xiao Han, Zhaolin Deng, Zijian Zhang, Xiangyu Zhao, Guojiang Shen, Xiangjie Kong

发表机构 * Zhejiang University of Technology, Zhejiang Key Laboratory of Visual Information Intelligent Processing（浙江工业大学，浙江省可视信息智能处理重点实验室）； Jilin University（吉林大学）； City University of Hong Kong（香港城市大学）

AI总结提出MoE-FedTP框架，通过轻量级混合专家网络和门控机制，在保护隐私的同时实现跨城市时空预测，有效缓解数据稀缺和异质性问题。

详情

AI中文摘要

交通预测是智能交通系统和城市计算的基础，然而由于传感器部署有限和城市发展不均衡，许多城市仍然面临交通数据稀缺的问题。跨城市知识转移因此受到越来越多的关注，使数据丰富的城市能够帮助数据稀缺的城市。然而，集中式方法引发了隐私问题，而现有的联邦方法难以应对城市间显著的时空异质性。为了解决这些挑战，我们提出了MoE-FedTP，一种基于轻量级混合专家（MoE）网络的个性化联邦跨城市时空预测框架。MoE-FedTP首先利用时空神经网络从源城市和目标城市提取特征，然后通过部分参数共享引入来自不同源城市的专家网络集合。门控机制动态融合专家以捕捉多样的交通动态，在保护隐私的同时实现城市异质性的细粒度建模。在四个真实世界交通数据集上的实验表明，MoE-FedTP始终优于最先进的跨城市和联邦学习基线，证明了其在提高数据稀缺城市预测准确性方面的有效性。

英文摘要

Traffic prediction is fundamental to intelligent transportation systems and urban computing, yet many cities continue to suffer from traffic data scarcity due to limited sensor deployment and uneven urban development. Cross-city knowledge transfer has thus attracted increasing attention, enabling data-rich cities to assist data-scarce ones. However, centralized approaches raise privacy concerns, while existing federated methods struggle with pronounced spatiotemporal heterogeneity across cities. To address these challenges, we propose MoE-FedTP, a personalized federated cross-city spatiotemporal prediction framework based on lightweight Mixture-of-Experts (MoE) networks. MoE-FedTP first employs spatiotemporal neural networks to extract features from both source and target cities, then introduces a set of expert networks derived from different source cities through partial parameter sharing. A gating mechanism dynamically fuses the experts to capture diverse traffic dynamics, achieving fine-grained modeling of urban heterogeneity while preserving privacy. Experiments on four real-world traffic datasets show that MoE-FedTP consistently outperforms state-of-the-art cross-city and federated learning baselines, demonstrating its effectiveness in enhancing prediction accuracy for data-scarce cities.

URL PDF HTML ☆

赞 0 踩 0

2606.10492 2026-06-10 cs.CV 新提交

PathRelax: Parallel-Path Relaxed Speculative Jacobi Decoding for Accelerating Auto-Regressive Text-to-Image Generation

PathRelax: 并行路径松弛推测雅可比解码加速自回归文本到图像生成

Haodong Lei, Hongsong Wang, Bingxuan Dai, Pan Zhou

发表机构 * College of Software Engineering, Southeast University（东南大学软件工程学院）； School of Computer Science and Engineering, Southeast University（东南大学计算机科学与工程学院）； School of Cyber Science and Engineering, Southeast University（东南大学网络空间安全学院）； School of Computing and Information Systems, Singapore Management University（新加坡管理大学计算机与信息系统学院）

AI总结针对自回归文本到图像模型因长序列导致推理慢的问题，提出并行路径交叉松弛推测雅可比解码框架，通过多序列草稿树结构扩展搜索空间并利用跨路径语义相似性提高接受率，实现3.95-4.18倍加速。

Comments 10 pages, 5 figures

详情

AI中文摘要

自回归文本到图像模型对高分辨率图像生成的需求日益增长，导致令牌序列延长，显著增加了计算成本和推理时间。然而，现有的加速自回归文本到图像模型的最先进方法依赖于链式结构的草稿令牌序列，导致草稿令牌搜索效率低下且接受长度有限。为了解决这一问题，我们提出了并行路径交叉松弛推测雅可比解码（PathSpec），一种通过多序列草稿树结构提升效率的新框架。我们的并行路径推测雅可比解码（PathExplore）扩展了令牌搜索空间，在不牺牲图像质量的情况下实现了更高的加速比。此外，我们引入了跨路径松弛验证（PathRelax），利用序列间的语义相似性进一步提高令牌接受率。在Parti-Prompts、MSCOCO2017和T2ICompBench数据集上的评估表明，我们的方法分别实现了4.14倍、3.95倍和4.18倍的加速比。值得注意的是，PathExplore在没有任何松弛采样的条件下，在加速比上优于GSD和LANTERN等松弛采样方法。此外，PathRelax的松弛机制可以与其他松弛技术无缝集成，实现进一步加速，为实时文本到图像生成提供了高效解决方案。我们的代码可在该https URL获取。

英文摘要

The growing need for high-resolution image generation in autoregressive text-to-image models has resulted in extended token sequences, significantly increasing computational costs and inference times. However, existing state-of-the-art methods for accelerating autoregressive text-to-image models rely on chain-structured draft token sequences, leading to inefficient draft token search and limited acceptance lengths. To address this, we propose parallel-path cross-relaxed speculative Jacobi decoding (\textbf{PathSpec}), a novel framework that enhances efficiency through a multi-sequence draft tree structure. Our parallel-path speculative Jacobi decoding (\textbf{PathExplore}) expands the token search space, achieving a higher speedup ratio without sacrificing image quality. Additionally, we introduce cross-path relaxed verification (\textbf{PathRelax}) that exploits semantic similarities across sequences to further boost token acceptance rates. Evaluated on the Parti-Prompts, MSCOCO2017, and T2ICompBench datasets, our method achieves a speedup ratio of 4.14 $\times$, 3.95$\times$, and 4.18$\times$, respectively. Remarkably, PathExplore, without any relaxed sampling, outperforms relaxed sampling methods in the speedup ratio, such as GSD and LANTERN. Moreover, PathRelax's relaxation mechanism can be seamlessly integrated with other relaxation techniques, enabling further acceleration and providing an efficient solution for real-time text-to-image generation. Our code is available at https://github.com/Haodong-Lei-Ray/PathSpec.

URL PDF HTML ☆

赞 0 踩 0

2606.10489 2026-06-10 cs.AI 新提交

A complementary study on PlanGPT: Evaluation with defined Performance Metrics and comparison with a planner

PlanGPT的补充研究：使用定义性能指标评估并与规划器比较

Youssef Abdelkader, Humbert Fiorino, Damien Pellier

发表机构 * Univ. Grenoble Alpes - LIG（格勒诺布尔阿尔卑斯大学 - 信息学实验室（LIG））

AI总结本文对大型语言模型PlanGPT进行补充实验，使用规划成本和生成时间两个指标评估其性能，并与传统规划器比较，发现PlanGPT并不优于贪心搜索策略。

Comments 7 pages

详情

AI中文摘要

自动规划是人工智能（AI）的一个子领域，其主要目标是生成一系列动作（称为规划），帮助我们从初始状态达到目标状态。规划问题由一组对象、初始状态和期望目标状态定义。目标是计算一个从初始状态到目标状态的规划。生成规划的程序称为规划器。在本文中，我们对去年发布的最新LLM——PlanGPT进行了补充研究。我们重新进行了一些实验，以验证使用LLM进行规划是否**恰当**且**有价值**。我们还检查了官方PlanGPT论文中关于规划覆盖的结果是否正确，并对PlanGPT的性能进行了更全面的研究：在我们的论文中，PlanGPT的性能使用两个指标进行评估：规划成本和规划生成时间。将PlanGPT的结果与同一规划和相同指标下传统规划器产生的结果进行比较。我们发现PlanGPT并不优于贪心搜索策略。

英文摘要

Automated Planning is a subfield of Artificial Intelligence (AI) where the main objective is generating a sequence of actions, known as a plan, that helps us reach a goal state from an initial state. A planning problem is defined by a set of objects, an initial state and a desired goal state. The objective is to compute a plan that'll lead us from the inital state to the goal state. Programs that generate plans are called planners. In this paper, we did a complementary study to the state-of-the-art LLM called PlanGPT which was released last year. We redid some experiments to verify whether planning with LLMs is \textbf{pertinent} and \textbf{worthwhile}. We also check whether the results obtained in the official PlanGPT paper for plan coverage were correct, and we also performed a more comprehensive study on PlanGPT's performance: in our paper PlanGPT's performance was evaluated using two metrics: Plan Cost and Plan Generation Time. The results of planGPT were compared to those produced by a traditional planner for the same plans and same metrics. We discovered that PlanGPT is no better than a Greedy search strategy.

URL PDF HTML ☆

赞 0 踩 0

2606.10488 2026-06-10 cs.CV 新提交

5% > 100%: Flatness Preference is All You Need for Multimodal Parameter-Efficient Fine-Tuning

5% > 100%: 平坦性偏好是您进行多模态参数高效微调所需的一切

Yifan Zhu, Can Lin, Hangjie Yuan, Zixiang Zhao, Pengfei Zhang, Tao Feng, Zhonghong Ou

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Zhejiang University（浙江大学）； ETH Zürich（苏黎世联邦理工学院）； Anhui University of Science and Technology（安徽理工大学）； Tsinghua University（清华大学）； State Key Laboratory of Networking and Switching Technology（网络与交换技术国家重点实验室）

AI总结揭示参数高效微调方法中普遍存在的平坦性偏好，即少量尖锐维度主导泛化，并提出FlatPO方法优化这些维度以提升泛化性能。

详情

AI中文摘要

参数高效微调（PEFT）方法为将大型模型适应特定领域的多模态下游任务提供了一种简化和高效的工具。尽管这些方法在实践中证明了其实际效果，但其主要方面仍未得到充分探索。因此，我们仍然对各种PEFT方法中的潜在泛化机制以及如何进一步增强它们感到好奇。在本文中，我们揭示了各种PEFT中普遍存在的平坦性偏好，其中一小部分尖锐维度主导了PEFT的泛化。这一发现暗示了一种有吸引力的可能性：我们可能只需关注这一小部分尖锐维度而非所有维度，就能获得更好的泛化。此外，我们提出了平坦性偏好优化（FlatPO）来平坦化这些关键的尖锐维度，使各种PEFT朝向更好的泛化。大量实验证明了我们的发现和所提方法的有效性。代码可在以下网址获取：https://this URL。

英文摘要

Parameter-Efficient Fine-Tuning (PEFT) methods provide a streamlined and efficient tool for adapting large models to domain-specific multimodal downstream tasks. Although these methods proved their tangible effects in practice, their principal aspects remain under-explored. Therefore we remain curious about the underlying generalization mechanisms in various PEFT methods and how they can be further enhanced. In this paper, we reveal the flatness preference widely present in various PEFTs, where a small fraction of sharp dimensions dominates the generalization of PEFT. This finding suggests an appealing possibility: we may be satisfied with a better generalization by merely attending to this small fraction of sharp dimensions instead of all of them. Furthermore, we propose Flatness Preference Optimization (FlatPO) to flatten these key sharpness dimensions, leading various PEFTs toward better generalization. Extensive experiments demonstrate the effectiveness of our findings and the proposed method. Code is available at https://github.com/Can-Lin/FlatPO.

URL PDF HTML ☆

赞 0 踩 0

2606.10487 2026-06-10 cs.LG cs.AI 新提交

Stop Early, Spend Less: Hidden-State Probes as a Practical Recipe for Streaming Moderation of LLM Outputs

早停早省：隐藏状态探针作为LLM输出流式审核的实用方案

Huizhen Shu, Xuying Li, Piao Xue

发表机构 * ModelOneAI ； yunshanai（云山AI）

AI总结提出基于隐藏状态的轻量级词元级探针，在解码循环中实时检测不安全输出，无需额外前向传播，实现亚毫秒级安全审核，可提前中断或修改生成。

Comments Technical Report. 14 pages, 3 figures, 4 tables

详情

AI中文摘要

在面向用户的系统中部署大型语言模型需要高效的输出安全过滤。现有方法通常依赖于生成后应用的单独审核模型，这会使推理成本翻倍，并且仅在生成完成后检测违规。我们观察到审核所需的信号已经存在于模型隐藏状态中。基于此，我们训练了轻量级的词元级探针，直接操作内部激活，生成每个词元的安全分数，这些分数可以聚合用于离线评估和在线干预。该探针重用生成器的激活，无需额外的前向传播，从而在解码循环内实现亚毫秒级的逐词元安全检查。应用于单个中间层的探针可以恢复强防护模型的大部分决策，作为一个低成本替代方案，优化延迟而非准确性。在流式设置中，它可以在不安全输出完全生成之前暂停或修改它们，用连续的词元级监控取代序列结束时的审核。与事后和流式防护模型相比，我们的方法实现了数量级的计算开销降低，且延迟成本最小。我们还提供了一个实用的部署方案，包括层选择、聚合策略、探测频率和触发阈值。最后，我们展示了探针的线性分量对应于残差空间中的一个方向，从而能够以可忽略的成本实现检测和激活引导。

英文摘要

Deploying large language models in user-facing systems requires efficient output safety filtering. Existing approaches typically rely on a separate moderation model applied after generation, which doubles inference cost and only detects violations after generation completes. We observe that the signal needed for moderation is already present in the model hidden states. Based on this, we train lightweight token-level probes that operate directly on internal activations, producing per-token safety scores that can be aggregated for both offline evaluation and online intervention. The probe reuses activations from the generator and requires no additional forward pass, enabling sub millisecond per-token safety checks inside the decoding loop. A probe applied to a single mid layer recovers most decisions of a strong guard model, acting as a low cost surrogate optimized for latency rather than accuracy. In streaming settings, it can halt or modify unsafe outputs before they are fully generated, replacing end of sequence moderation with continuous token level monitoring. Compared to post hoc and streaming guard models, our method achieves orders of magnitude lower compute overhead with minimal latency cost. We also provide a practical deployment recipe, including layer selection, aggregation strategy, probing frequency, and triggering thresholds. Finally, we show that the probe linear component corresponds to a direction in residual space, enabling both detection and activation steering at negligible cost.

URL PDF HTML ☆

赞 0 踩 0

2606.10481 2026-06-10 cs.LG cs.AI cs.CL cs.CR stat.ML 新提交

Advancing the State-of-the-Art in Empirical Privacy Auditing

推进经验隐私审计的最新水平

Nicole Mitchell, Galen Andrew, Arun Ganesh, Brendan McMahan, Peter Kairouz

发表机构 * Google Research（谷歌研究院）

AI总结提出通过高温采样生成合成金丝雀，用于经验隐私审计，并引入基于辅助模型的合成数据审计方法，系统研究模型容量与金丝雀熵对记忆化的交互影响。

详情

AI中文摘要

大型语言模型的参数高效微调可能表现出对个别训练示例的问题性记忆。经验隐私审计（EPA）通过测量成员推断（MI）或重构攻击上的实际数据泄露来量化这种风险。EPA的一个关键挑战是设计与隐私敏感训练数据混合的“金丝雀”示例。我们提出通过从LLM中进行高温采样（$T \geq 0.8$）生成合成金丝雀，使用针对隐私敏感训练数据定制的提示。这些金丝雀作为高影响异常值，确保高可识别性，从而实现强审计。此外，由于金丝雀本身是非私有的，它们是可检查的，并且可以重复插入，而不会危及真实数据的隐私。在隐私敏感数据上微调的模型的一个重要用途是生成合成数据。这也带来了隐私风险。我们引入了一种强大的合成数据审计方法，基于在合成数据上微调辅助模型。然后，对原始金丝雀的辅助模型进行审计，可以强有力地估计通过合成数据的隐私泄露。最后，利用我们强大的审计方法，我们系统研究了模型容量和金丝雀熵对记忆化的交互影响。

英文摘要

Parameter-efficient fine-tuning of large language models (LLMs) can exhibit problematic memorization of individual training examples. Empirical privacy auditing (EPA) quantifies this risk by measuring realistic data leakage on membership inference (MI) or reconstruction attacks. A key challenge in EPA is designing ``canary'' examples that are mixed with the privacy-sensitive training data. We propose generating synthetic canaries via high-temperature sampling ($T \geq 0.8$) from LLMs, using prompts tailored to the privacy-sensitive training data. These canaries act as high-influence outliers, ensuring high identifiability and hence strong audits. Further, since the canaries are themselves non-private, they are inspectable and can be inserted with repetition without jeopardizing the privacy of the real data. An important use of models fine-tuned on privacy-sensitive data is the generation of synthetic data. This also comes with privacy risk. We introduce a powerful synthetic data audit based on fine-tuning an auxiliary model on the synthetic data. Auditing the auxiliary model for the original canaries then provides a strong estimate of the privacy leakage through the synthetic data. Finally, leveraging our strong auditing methodologies, we perform a systematic investigation into the interacting effects of model capacity and canary entropy on memorization.

URL PDF HTML ☆

赞 0 踩 0

2606.10479 2026-06-10 cs.AI 新提交

ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

ComBench: 奥林匹克级组合数学中严格证明推理与构造实现的基准测试

Shunkai Zhang, Haoran Zhang, Yun Luo, Qianjia Cheng, Haodi Lei, Yizhuo Li, Runzhe Zhan, Zhilin Wang, Bangjie Xu, Yucheng Su, Xinmiao Han, Xiaoye Qu, Dongrui Liu, Zhouchen Lin, Yu Qiao, Ning Ding, Yafu Li, Yu Cheng

发表机构 * Shanghai AI Laboratory（上海人工智能实验室）； Peking University（北京大学）； Shanghai Jiao Tong University（上海交通大学）； Tsinghua University（清华大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出ComBench基准，包含100道奥林匹克级组合问题，分分析和构造两类，通过评分与验证评估大模型推理能力，发现最强模型准确率仅65.4%，且证明推理与构造实现能力存在差异。

Comments 39 pages, 6 figures, 26 tables. Project page: https://simplified-reasoning.github.io/ComBench/docs/

详情

AI中文摘要

组合数学是奥林匹克级数学问题解决的核心，需要深入的离散推理、创造性构造和严格的结构洞察。最近的证据表明，即使今天最强的前沿模型在奥林匹克组合问题上仍表现不均，揭示了创造性数学推理方面的差距。我们引入了ComBench，一个奥林匹克级组合数学基准，用于评估和诊断大语言模型的组合推理能力。ComBench包含100道人工标注的竞赛级问题，围绕两个互补的设置组织：以分析为中心的问题，主要需要严格的数学论证；以及以构造为中心的问题，除了正确性证明外还需要显式构造。评估协议结合了基于评分标准的证明评分和确定性构造验证，揭示了证明质量和构造有效性存在分歧的情况。对前沿开源和闭源模型的实验表明，ComBench远未饱和：最强模型总体平均准确率为65.4%，总体Best@4为75.3%。我们进一步发现，严格证明推理和构造实现是不同的能力：Kimi-K2.6在分析中心的证明评分上落后于GPT-5.5，但在构造中心的Best@4上超过它，而存在性和构造问题在代表性前沿模型中始终是最难的。

英文摘要

Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning. We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench contains 100 human-annotated competition-level problems organized around two complementary settings: analysis-centric problems, which primarily require rigorous mathematical arguments, and construction-centric problems, which require explicit constructions in addition to correctness justifications. The evaluation protocol combines rubric-guided proof grading with deterministic construction verification, exposing cases where proof quality and construction validity diverge. Experiments on frontier open- and closed-source models show that ComBench is far from saturated: the strongest model reaches 65.4% overall Avg. and 75.3% overall Best@4. We further find that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities: Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4, while Existence and Construction problems remain consistently hardest across representative frontier models.

URL PDF HTML ☆

赞 0 踩 0

2606.10478 2026-06-10 cs.CV 新提交

3D-CoS: A New 3D Reconstruction Paradigm Based on VLM Code Synthesis

3D-CoS：基于VLM代码合成的新型3D重建范式

Yuhao Wang, Puyi Wang, Linjie Li, Zhengyuan Yang, Kevin Qinghong Lin, Yu Cheng

发表机构 * Shanghai Jiao Tong University（上海交通大学）； The Chinese University of Hong Kong（香港中文大学）； Microsoft（微软）； University of Oxford（牛津大学）

AI总结提出3D代码合成（3D-CoS）范式，将3D资产表示为可执行的Blender代码，利用VLM进行程序化重建，实现高可控性和局部编辑能力。

Comments Preprint. 24 pages, 11 figures

详情

AI中文摘要

最近的3D重建和编辑系统大多基于隐式或显式表示，如NeRF、点云或网格。尽管这些表示能够实现高保真渲染，但它们本质上是低层次的，难以通过编程控制。相比之下，我们提出并系统评估了一种新的3D重建范式——3D代码合成（3D-CoS），其中3D资产被构建为可执行的Blender代码，这是一种可编程且可解释的媒介。为了评估当前VLM使用代码表示3D对象的能力，我们在统一协议下评估了代表性的开源和闭源VLM在基于代码的重建中的表现。我们进一步引入了一套结构化的代码合成工作流，包括基于蓝图的规划、Blender API文档的检索增强生成（RAG）、少样本几何演示以及用于逐部分代码生成的组件级Agent工作流。为了展示这种表示的独特优势，我们进一步评估了局部文本驱动的修改，并将我们的基于代码的编辑与基于点云的3D编辑基线进行了比较。我们的研究表明，代码作为3D表示提供了强大的可控性和局部性，在目标编辑评估中产生了更强的编辑保真度和更好的未编辑区域保持。我们的工作还分析了这种范式的潜力，描绘了当前VLM在程序化3D建模中的能力边界，并强调了代码合成作为可编辑3D重建的一个有前景的方向。

英文摘要

Most recent 3D reconstruction and editing systems operate on implicit and explicit representations such as NeRF, point clouds, or meshes. While these representations enable high-fidelity rendering, they are fundamentally low-level and hard to control programmatically. In contrast, we propose and systematically evaluate a new 3D reconstruction paradigm, 3D Code Synthesis (3D-CoS), where 3D assets are constructed as executable Blender code, a programmatic and interpretable medium. To assess how well current VLMs can use code to represent 3D objects, we evaluate representative open-source and closed-source VLMs in code-based reconstruction under a unified protocol. We further introduce a suite of structured code-synthesis workflows, including blueprint-based planning, Retrieval-Augmented Generation (RAG) over Blender API documentation, few-shot geometric demonstrations, and a component-level Agent workflow for part-wise code generation. To demonstrate the unique advantages of this representation, we further evaluate localized text-driven modifications and compare our code-based edits with a point-cloud-based 3D editing baseline. Our study shows that code as a 3D representation offers strong controllability and locality, yielding stronger edit fidelity and better preservation of unedited regions in our targeted editing evaluation. Our work also analyzes the potential of this paradigm, delineates the current capability frontier of VLMs for programmatic 3D modeling, and highlights code synthesis as a promising direction for editable 3D reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2606.10471 2026-06-10 cs.CL cs.AI 新提交

Detecting Speculative Language in Biomedical Texts using Recurrent Neural Tensor Networks

使用递归神经张量网络检测生物医学文本中的推测性语言

Dhruv Dixit

发表机构 * Stevens Institute of Technology（史蒂文斯理工学院）

AI总结利用分布式句子表示和深度学习技术，提出递归神经张量网络（RNTN）用于自动检测生物医学文献中的推测性语言，性能略优于线性双元SVM（F1=0.885 vs 0.881）。

Comments 12 Pages

详情

AI中文摘要

在本研究中，我们通过利用分布式句子表示和先进的深度学习技术，深入探讨了生物医学文章中推测性语言的自动检测。这种识别的意义延伸至信息检索、多文档摘要以及新知识的探索。我们的探索涵盖了两种获取分布式句子表示的不同方法：段落向量模型和递归神经张量网络。然后，将这些方法与三种基础基线算法进行严格比较：支持向量机、朴素贝叶斯和模式匹配。我们的发现表明，递归神经张量网络（RNTN）的性能（F1=0.885）略优于表现最佳的基线线性双元SVM（F1=0.881）。同时，段落向量模型即使在使用大规模未标记数据集进行广泛训练后，效果也较差（F1=0.368）。我们对影响这些性能差异的因素进行了全面讨论，并为未来的研究方向提供了有见地的建议。

英文摘要

In this investigation, we delve into the automated detection of speculative language within biomedical articles by utilizing distributed sentence representations and advanced deep learning techniques. The implications of such identification extend to information retrieval, multi-document summarization, and the exploration of new knowledge. Our exploration encompasses two distinct approaches for acquiring distributed sentence representations: the Paragraph Vector model and the Recursive Neural Tensor Network. These methodologies are then rigorously compared against three foundational baseline algorithms: Support Vector Machines, Naive Bayes, and pattern matching. Our findings reveal that the Recursive Neural Tensor Network (RNTN) demonstrates a slight performance edge (F1 = 0.885) over the top-performing baseline, the linear bigram SVM (F1 = 0.881). Meanwhile, the Paragraph Vector model proves less effective (F1 = 0.368), even after extensive training using an expansive, unlabeled dataset. We engage in a comprehensive discourse on the factors influencing these performance disparities and provide insightful recommendations for future research directions.

URL PDF HTML ☆

赞 0 踩 0

2606.10468 2026-06-10 cs.CV 新提交

Geometric Coastline Localization using Vision-Language Models

基于视觉语言模型的海岸线几何定位

Rafia Malik, Bernhard Pfahringer, Karin Bryan, Mark Dickson, Eibe Frank

发表机构 * The University of Waikato（怀卡托大学）； The University of Auckland（奥克兰大学）

AI总结提出将海岸线提取视为几何边界定位任务，基于GeoChat-7B/LLaVA-1.5架构构建CoastlineVLM-7B模型，直接预测折线而非分割掩码，在几何指标上优于传统分割方法。

详情

AI中文摘要

遥感图像中的海岸线检测通常被表述为逐像素分割问题，通过后处理从预测掩码中提取最终海岸线。这种表述将海岸线几何（海岸变化分析中使用的主要表示）降级为次要产物而非学习目标。在实践中，海岸线由地貌代理（如植被线、沙丘趾或悬崖边缘）定义，而非像素级分割方法中常用的瞬时水陆边界。在这项工作中，我们从表示角度重新审视海岸线提取，并将任务表述为几何边界定位。我们使用新西兰海岸变化数据集（NZCCD）和来自新西兰土地信息局（LINZ）的高分辨率航空影像，开发了CoastlineVLM-7B，这是一个基于GeoChat-7B/LLaVA-1.5架构的视觉语言模型（VLM），联合执行海岸线存在检测、代理类型分类和海岸线定位。该模型直接预测海岸线为折线，而非密集分割掩码。我们在严格的单像素边界监督下，将CoastlineVLM-7B与分割基线进行评估。结果表明，基于几何的指标比像素重叠指标（如交并比IoU）更适合评估海岸线定位质量。CoastlineVLM-7B改善了与参考海岸线的全局几何对齐，将豪斯多夫距离从37.74米降至31.84米，地球移动距离从21.12米降至17.32米。这些结果表明，输出表示是海岸线提取中的关键设计选择，而面向几何的学习结合视觉语言模型的语义推理能力，与运营海岸监测中海岸线的定义和评估方式高度一致。

英文摘要

Coastline detection in remote sensing imagery is commonly formulated as a pixel-wise segmentation problem, where the final coastline is extracted from a predicted mask through post-processing. This formulation relegates coastline geometry, the primary representation used in coastal change analysis, to a secondary artifact rather than the learning objective. In practice, coastlines are defined by geomorphic proxies such as vegetation lines, dune toes, or cliff edges, rather than an instantaneous land-water boundary often used in pixel-based segmentation approaches. In this work, we revisit coastline extraction from a representation perspective and formulate the task as geometric boundary localization. We use the New Zealand Coastal Change Dataset (NZCCD) and high-resolution aerial imagery from Land Information New Zealand (LINZ) to develop CoastlineVLM-7B, a vision-language model (VLM) built on the GeoChat-7B/LLaVA-1.5 architecture that jointly performs coastline presence detection, proxy-type classification, and coastline grounding. The model directly predicts a coastline as a polyline rather than a dense segmentation mask. We evaluate CoastlineVLM-7B against segmentation baselines under strict one-pixel boundary supervision. Results show that geometry-based metrics are more suitable for assessing coastline localization quality than pixel-overlap metrics such as Intersection over Union (IoU). CoastlineVLM-7B improves global geometric alignment with reference coastlines, reducing Hausdorff distance from 37.74 m to 31.84 m and Earth Mover's Distance from 21.12 m to 17.32 m. These results indicate that output representation is a critical design choice in coastline extraction, and that geometry-oriented learning, combined with the semantic reasoning capabilities of vision-language models, aligns well with how coastlines are defined and evaluated in operational coastal monitoring.

URL PDF HTML ☆

赞 0 踩 0

2606.10467 2026-06-10 cs.CL 新提交

Large Language Models as Modal Models in Linguistics

大语言模型作为语言学中的模态模型

Haruto Suzuki, Saku Sugawara

发表机构 * Keio University（庆应义塾大学）； National Institute of Informatics（国立信息学研究所）； University of Tokyo（东京大学）

AI总结本文应用科学哲学中的模态建模框架，论证大语言模型作为最小模型具有真正的认知价值，能提供“如何可能解释”，但当前尚不满足“如何实际解释”的条件，其解释力位于两者之间的连续统上。

详情

AI中文摘要

大语言模型（LLMs）的快速发展加剧了关于它们对语言学理论重要性的争论。这些争论通常分为三种立场：绝缘主义，认为LLMs与人类语言无关；消除主义，声称LLMs可以取代传统语言学理论；以及调和主义，将LLMs视为语言学研究的有用工具。为澄清这些立场，本文应用了科学哲学中的模态建模框架。我们认为，即使没有与人类认知的结构对应，LLMs作为最小模型也具有真正的认知价值。特别是，它们可以通过测试关于语言习得和语言能力的模态主张来提供“如何可能解释”（HPEs）。然后，我们基于科学解释的机制说明，考察了LLMs有资格成为人类语言的“如何实际解释”（HAEs）的条件。我们认为当前的LLMs尚未满足这些要求。在此分析基础上，我们提出将LLMs的解释力理解为位于HPEs和HAEs之间的连续统上。这一框架既避免了夸大也避免了低估它们的解释意义，并为评估LLMs在语言科学研究中的作用提供了更精确的基础。

英文摘要

The rapid advancement of large language models (LLMs) has intensified debates about their significance for linguistic theory. These debates are commonly divided into three positions: insulationism, which regards LLMs as irrelevant to human language; eliminativism, which claims that LLMs can replace traditional linguistic theories; and conciliationism, which views them as useful tools for linguistic research. To clarify these positions, this paper applies the framework of modal modeling from the philosophy of science. We argue that LLMs possess genuine epistemic value as minimal models, even without structural correspondence to human cognition. In particular, they can provide how-possibly explanations (HPEs) by testing modal claims about language acquisition and linguistic competence. We then examine the conditions under which LLMs could qualify as how-actually explanations (HAEs) of human language, drawing on the mechanistic account of scientific explanation. We argue that current LLMs do not yet satisfy these requirements. On the basis of this analysis, we propose understanding the explanatory power of LLMs as lying on a continuum between HPEs and HAEs. This framework avoids both overstating and understating their explanatory significance and offers a more precise basis for evaluating the role of LLMs in the scientific study of language.

URL PDF HTML ☆

赞 0 踩 0

2606.10461 2026-06-10 cs.LG cs.AI cs.CL 新提交

ERAlign: Energy-based Representation Alignment of GNNs and LLMs on Text-attributed Graphs

ERAlign: 文本属性图上GNN与LLM的基于能量的表示对齐

Xianlin Zeng, Fan Xia, Xiangyu Chen

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出ERAlign框架，利用能量模型对齐GNN和LLM的表示，通过能量差异优化实现分布一致性，在8个数据集上取得最优性能。

Comments Accepted to ICML 2026

详情

AI中文摘要

文本属性图（TAGs）将文本节点属性与图结构相结合，以描述丰富的关联语义。最近整合图神经网络（GNNs）和大语言模型（LLMs）的努力在TAGs学习上显示出前景，但实现良好对齐的表示仍然具有挑战性。先前的研究主要依赖于执行粗粒度匹配的启发式方法。它们缺乏足够的约束，忽略了分布对齐，导致表示漂移和泛化能力有限。基于能量模型（EBMs），我们提出了一种基于能量的表示对齐（ERAlign）框架，该框架将GNN编码的图结构和LLM导出的文本嵌入投影到共享潜在空间，以实现分布一致性。具体来说，层间对齐通过距离度量量化，并通过EBM目标进行优化。通过降低能量值，我们的框架为下游任务产生良好对齐的表示。在训练过程中，我们引入能量差异（ED）以避免与难以处理的归一化相关的高采样成本。ED还具有更高的训练效率和减少能量景观失真的理论保证。在八个TAG数据集上的实证评估表明，ERAlign在不同监督水平和跨任务迁移场景下均获得了最先进的性能。

英文摘要

Text-attributed Graphs (TAGs) incorporate textual node attributes with graph structures to describe rich relational semantics. Recent efforts to integrate Graph Neural Networks (GNNs) and Large Language Models (LLMs) have shown promise for learning on TAGs, yet achieving well-aligned representations remains challenging. Prior studies largely rely on heuristics that perform coarse-grained matching. They lack sufficient constraints and ignore distributional alignment, leading to representation drift and limited generalization. Building on Energy-based Models (EBMs), we propose an Energy-based Representation Alignment (ERAlign) framework that projects GNN-encoded graph structure and LLM-derived text embeddings in a shared latent space to achieve distribution consistency. Concretely, layer-wise alignment is quantified by a distance metric and optimized via an EBM objective. By decreasing energy values, our framework yields well-aligned representations for downstream tasks. During training, we introduce Energy Discrepancy (ED) to avoid high sampling costs associated with intractable normalization. ED also carries theoretical guarantees of higher training efficiency and reduced energy landscape distortion. Empirical evaluations on eight TAG datasets demonstrate that ERAlign obtains state-of-the-art performance across varying levels of supervision and cross-task transfer scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.10460 2026-06-10 cs.CL cs.AI 新提交

LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake

LakeQA：百万级数据湖上的探索性问答基准

Haonan Wang, Jiaxiang Liu, Yurong Liu, Austin Senna Wijaya, Tianle Zhou, Eden Wu, Yijia Chen, Wanting You, Reya Vir, Daniela Pinto, Grace Fan, Yusen Zhang, Juliana Freire, Eugene Wu

发表机构 * Columbia University（哥伦比亚大学）； New York University（纽约大学）； Barnard College（巴纳德学院）

AI总结提出LakeQA基准，要求LLM在9.5TB异构数据湖中搜索并多跳推理，GPT-5.2仅达18.37%精确匹配，挑战性强。

详情

AI中文摘要

近期的大语言模型（LLM）在基于阅读的问答（QA）方面取得了快速进展，其中证据被明确提供或可以轻松检索。相比之下，现实世界的问题通常不与准确的证据文档配对。有用的证据存在于海量数据湖中，使得搜索成为回答的前提。然而，目前缺乏要求在大型数据湖上进行搜索和推理的综合基准。为此，我们引入了LakeQA，一个针对数据湖上以搜索为中心的问答的综合基准，同时强调搜索和推理能力。LakeQA建立在来自维基百科和开源政府数据的大约9.5 TB文本资源的异构集合上，涵盖结构化和非结构化数据。为确保任务质量，每个样本至少由一名博士级专家标注。每个任务需要长期的多跳推理，包含隐式的中间步骤：智能体需要发现正确的文档，然后跨来源组合证据以产生答案。在七个前沿LLM上的实验结果表明，LakeQA具有挑战性。例如，GPT-5.2在LakeQA上仅达到18.37%的精确匹配分数。总体而言，LakeQA为开发能够在现代数据湖中查找和分析数据的LLM智能体提供了一个现实的测试平台。

英文摘要

Recent large language models (LLMs) have shown rapid progress in reading-based question answering (QA), where evidence is explicitly provided or can be trivially retrieved. In contrast, real-world questions are often not paired with accurate evidence documents. The useful evidence resides in massive data lakes, making search a prerequisite for answering. However, there is a lack of comprehensive benchmarks that require both searching and reasoning over large data lakes. To this end, we introduce LakeQA, a comprehensive benchmark for search-centric question answering over data lakes that jointly emphasizes searching and reasoning capabilities. LakeQA is built on a heterogeneous collection of approximately 9.5 TB of text resources from Wikipedia and open-source government data, spanning structured and unstructured data. To ensure task quality, each sample is annotated by at least one Ph.D.-level expert. Each task requires long-horizon multi-hop reasoning with implicit intermediate steps: agents need to discover the correct documents and then compose evidence across sources to produce the answer. Experimental results on seven frontier LLMs demonstrate that LakeQA is challenging. For instance, GPT-5.2 achieves only an exact-match score of 18.37% on LakeQA. Overall, LakeQA provides a realistic testbed for developing LLM agents that can both find and analyze data in modern data lakes.

URL PDF HTML ☆

赞 0 踩 0

2606.10457 2026-06-10 cs.AI 新提交

Trace2Policy: From Expert Behavior Traces to Self-Evolving Decision Agents

Trace2Policy：从专家行为轨迹到自我进化的决策代理

Junli Zha, Jinbo Wang, Chao Zhou, Xiang Song

发表机构 * SF Express（顺丰速运）

AI总结提出Trace2Policy框架，通过错误驱动的迭代技能精炼（EISR）从专家行为中提取可读规则，在合规敏感任务中规则质量是关键性能杠杆，经8轮迭代后编译为确定性Python代码达到79.6%准确率，并在实际部署中优于纯LLM基线。

详情

AI中文摘要

企业专家在审计、合规和合同审查中隐性应用的决策规则可以通过迭代错误分析系统地恢复和改进。我们提出\textbf{Trace2Policy}，其核心机制——\textbf{EISR}（\textbf{E}rror-driven \textbf{I}terative \textbf{S}kill \textbf{R}efinement）——将人类可读的规则文档作为优化目标：每轮在验证集上执行规则，按根本原因将错误聚类为MISSING、WRONG或CONFLICT类型，应用针对性补丁，并仅提交通过回归门的补丁。\textbf{对于这类合规敏感、基率偏斜的决策任务，我们确定规则质量——而非模型能力——是主导性能杠杆}：在五个LLM上，一次性蒸馏在部署池上停滞在约70%，而八轮EISR将相同规则提升至79.6%（编译为确定性Python，推理时零LLM调用）。\textbf{执行形式放大了收益：在生产中，相同的EISR精炼内容作为编译Python运行比作为LLM提示高出9.8个百分点，这是一个形式与工程捆绑包，经过22天部署共同成熟。}在一家大型物流承运商（3,349个审计案例）部署22天后，编译管道优于其替代的纯LLM基线（72.7%）；在这些校准的、基率偏斜的工作负载上，重新启用LLM回退会单调地降低准确率。一种LLM驱动的变体，\textbf{Auto-EISR}，以每周期5-10美元（对比约70专家小时）复现了这种精炼，并无需重新工程即可迁移到涵盖法律推理（LegalBench）和流程挖掘决策（BPIC 2012）的四个公开基准上。

英文摘要

Decision rules that enterprise experts apply tacitly -- in auditing, compliance, and contract review -- can be systematically recovered and improved through iterative error analysis. We present \textbf{Trace2Policy}, whose core mechanism -- \textbf{EISR} (\textbf{E}rror-driven \textbf{I}terative \textbf{S}kill \textbf{R}efinement) -- maintains a human-readable rule document as its optimization target: each round executes the rules on a validation set, clusters errors by root cause into MISSING, WRONG, or CONFLICT types, applies targeted patches, and commits only those that pass a regression gate. \textbf{For this class of compliance-sensitive, skewed-base-rate decision tasks, we identify rule quality -- not model capability -- as the dominant performance lever}: across five LLMs, one-shot distillation plateaus near $\sim$70\% on the deployed pool, while eight EISR rounds lift the same rules to 79.6\% when compiled into deterministic Python -- zero LLM calls at inference. \textbf{Execution form compounds the gain: in production, the same EISR-refined content runs 9.8~pp higher as compiled Python than as an LLM prompt, a form-and-engineering bundle the 22-day deployment matured together.} Deployed for 22 days at a major logistics carrier (3,349 audit cases), the compiled pipeline outperforms the pure-LLM baseline it replaced (72.7\%); on these calibrated, skewed-base-rate workloads, re-enabling LLM fallback monotonically degrades accuracy. An LLM-driven variant, \textbf{Auto-EISR}, reproduces this refinement at \$5--\$10 per cycle versus $\sim$70 expert-hours, and transfers to four public benchmarks spanning legal reasoning (LegalBench) and process-mining decisions (BPIC 2012) without re-engineering.

URL PDF HTML ☆

赞 0 踩 0

2606.10450 2026-06-10 cs.CV cs.LG 新提交

Few-step Generative Models as Lossy Compression

少步生成模型作为有损压缩

Fuma Kimishima, Jinjia Zhou

发表机构 * University of Tokyo（东京大学）

AI总结研究将少步生成模型（Rectified Flow、CTM、MeanFlow）用于反向信道编码框架进行有损压缩，通过参数化等效和局部高斯近似实现无需重训练的编解码，在低分辨率基准上减少编解码时间并提升低比特率下的真实性。

详情

AI中文摘要

DiffC 提供了一种重用预训练扩散模型进行有损压缩的原则性方法，但其编码和解码过程仍然缓慢，因为它们需要许多离散化的前向和反向步骤。我们研究少步生成模型——Rectified Flow、一致性轨迹模型（CTM）和 MeanFlow——是否可以在相同的反向信道编码（RCC）框架中作为编解码器使用。主要挑战在于 RCC 需要后验和共享分布参数，而这些模型并未显式参数化中间条件分布。对于 Rectified Flow 和 MeanFlow，我们利用速度参数化与扩散式去噪参数化之间的等价性来推导 RCC 所需的量。对于从 EDM 蒸馏得到的 CTM，我们采用 EDM 噪声参数化以及中间状态下发送方和共享分布的局部高斯近似。这产生了一个概念验证的概率公式，使得无需重新训练即可使用预训练的少步生成模型进行压缩。在低分辨率基准上，由此产生的编解码器减少了编码和解码时间，并在低比特率范围内提高了真实性。

英文摘要

DiffC provides a principled way to reuse pre-trained diffusion models for lossy compression, but its encoding and decoding procedures remain slow because they require many discretized forward and reverse steps. We study whether few-step generative models -- Rectified Flow, Consistency Trajectory Models (CTM), and MeanFlow -- can be cast as codecs within the same reverse channel coding (RCC) framework. The main challenge is that RCC requires posterior and shared distribution parameters, whereas these models do not explicitly parameterize intermediate conditional distributions. For Rectified Flow and MeanFlow, we use the equivalence between velocity parameterization and diffusion-style denoising parameterization to derive the quantities required by RCC. For CTM, which is distilled from EDM, we adopt the EDM noise parameterization together with local Gaussian approximations of the sender and shared distributions at intermediate states. This yields a proof-of-concept probabilistic formulation that enables compression with pre-trained few-step generative models without retraining. On low-resolution benchmarks, the resulting codecs reduce encoding and decoding time and improve realism in the low-bit-rate regime.

URL PDF HTML ☆

赞 0 踩 0

2606.10448 2026-06-10 cs.LG cs.AI 新提交

Mitigating Bias in Low-SNR Financial Reinforcement Learning via Quantum Representations

通过量子表示缓解低信噪比金融强化学习中的偏差

Zeyu Liu, Xuanzhi Feng, Sing Kwong Lai, Yuanchen Gao, Xiaoyi Pang, Hualei Zhang, Jingcai Guo, Jie Zhang, Song Guo

发表机构 * The Hong Kong University of Science and Technology（香港科技大学）

AI总结针对低信噪比金融市场中SAC算法的不稳定性，提出FPQC-SAC变体，在表征层使用参数化量子电路约束特征传播，减少极端波动影响，在真实组合管理任务中累计收益相对提升66.89%。

Comments Preprint. Code available at https://github.com/ZeyuLIU-UST/FPQC-SAC-main

详情

AI中文摘要

金融市场是典型的低信噪比（SNR）环境，这常常使Soft Actor-Critic（SAC）等离策略最大熵方法不稳定。具体来说，噪声状态表示可能产生不可靠的Q值估计，而自举会放大这些误差，形成我们称之为“金融熵陷阱”的失效模式。在本文中，我们提出FPQC-SAC，一种高效且即插即用的SAC变体，它在演员和评论家网络之前放置一个紧凑且有界的参数化量子电路（PQC），以在表征层约束特征传播，而不是过滤原始输入或在自举后正则化Q值。值得注意的是，FPQC-SAC减少了极端市场波动对贝尔曼目标估计的影响，而可训练的量子纠缠保留了灵活的跨资产交互。在真实投资组合管理任务上的实证评估表明，FPQC-SAC通过实现比标准无约束SAC累计收益相对提升66.89%，显著增强了样本外稳定性和累计收益，并且比最佳连续控制深度强化学习基线高出约27%。开源代码可在该https URL获取。

英文摘要

The financial market is a typical low signal-to-noise ratio (SNR) setting, which often destabilizes off-policy maximum-entropy methods like Soft Actor-Critic (SAC). Specifically, noisy state representations may produce unreliable Q-value estimates, and bootstrapping amplifies these errors, forming a failure mode we call the "Financial Entropy Trap". In this paper, we propose FPQC-SAC, an efficient and plug-and-play SAC variant that places a compact and bounded Parameterized Quantum Circuit (PQC) before the actor and critic networks to constrain feature propagation at the representation level, rather than filtering raw inputs or regularizing Q-values after bootstrapping. Notably, FPQC-SAC reduces the impact of extreme market fluctuations on Bellman target estimation, while trainable quantum entanglement preserves flexible cross-asset interactions. Empirical evaluations on real-world portfolio management tasks demonstrate that FPQC-SAC substantially enhances out-of-sample stability and cumulative returns by achieving a 66.89% relative gain in cumulative return over standard unconstrained SAC and outperforms the best continuous-control deep reinforcement learning baseline by approximately 27%. Open-source code is available at https://github.com/ZeyuLIU-UST/FPQC-SAC-main.

URL PDF HTML ☆

赞 0 踩 0

2606.10445 2026-06-10 cs.LG cs.CL 新提交

SpenseGPT: Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference

SpenseGPT: 面向LLM推理的实用一次性剪枝，支持稀疏和稠密GEMM

Jaeseong Lee, Seung-won Hwang, Samyam Rajbhandari

发表机构 * Snowflake AI Research（Snowflake AI研究）； Seoul National University（首尔大学）

AI总结提出Spense混合稀疏-稠密格式，将权重矩阵分为2:4稀疏和稠密区域，结合一次性剪枝方法SpenseGPT，在B200 GPU上实现高达1.2倍端到端解码加速，同时保持模型精度。

详情

AI中文摘要

半结构化2:4稀疏性被现代加速器广泛支持，可提供高达2倍的理论加速。然而，其严格的50%稀疏性约束在训练后剪枝下常导致不可忽略的精度下降。同时，现有的宽松稀疏格式要么需要专门的编译器支持，要么引入限制端到端加速的运行时开销。我们提出Spense，一种实用的混合稀疏-稠密格式，将每个权重矩阵分为2:4稀疏区域和稠密区域。该设计放宽了有效稀疏性约束，同时保持与现有高性能稀疏和稠密GEMM库的兼容性，避免了自定义编译器支持和输入激活扩展。基于此格式，我们引入SpenseGPT，一种一次性训练后剪枝方法，生成稀疏和稠密区域。值得注意的是，我们表明选择正确的稠密区域很重要，并设计了两种不同的策略来选择它们。在Qwen3-32B和Seed-OSS-36B上的实验表明，我们的方法在B200 GPU上使用FP8精度实现了高达1.2倍的端到端解码加速，同时保持精度。据我们所知，这是首个在B200等最新GPU上通过半结构化稀疏张量核心实现真实世界端到端LLM解码加速并保持模型质量的一次性剪枝演示。

英文摘要

Semi-structured 2:4 sparsity is widely supported by modern accelerators, providing up to a 2x theoretical speedup. However, its strict 50% sparsity constraint often causes non-negligible accuracy degradation under post-training pruning. Meanwhile, existing relaxed sparsity formats either require specialized compiler support or introduce runtime overheads that limit end-to-end speedup. We propose Spense, a practical hybrid sparse-dense format that splits each weight matrix into a 2:4 sparse region and a dense region. This design relaxes the effective sparsity constraint while remaining compatible with existing high-performance sparse and dense GEMM libraries, avoiding both custom compiler support and input activation expansion. Building on this format, we introduce SpenseGPT, a one-shot post-training pruning method that produces sparse and dense regions. Notably, we show that selecting the right dense regions is important, and we devise two different strategies to choose them. Experiments on Qwen3-32B and Seed-OSS-36B demonstrate that our method achieves up to 1.2x end-to-end decoding speedup on B200 GPUs with FP8 precision, while preserving accuracy. To the best of our knowledge, this is the first one-shot pruning demonstration of real-world end-to-end LLM decoding speedup from semi-structured sparse tensor cores on recent GPUs such as B200s, while maintaining model quality.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

Audio-Visual Exchange-Aware Token Pruning for Efficient Audio-Visual Captioning

ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning

LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization

Machine Learning Methods for Studying Latent Neural Activity Dynamics

Representation-Aware Advantage Estimation: Your Reward Model Provides More Than A Scalar Output

GUI-AC: Enhancing Continual Learning in GUI Agents

UniSVQ: 2-bit Unified Scalar-Vector Quantization

LAFP: Preserving Latent Action Structure in Latent Policy Learning via Flow Matching

HIPIF: Hierarchical Planning and Information Folding for Long-Horizon LLM Agent Learning

Cross-Modal Knowledge Distillation without Paired Data: Theoretical Foundation and Algorithm

Uncovering Vulnerability of Vision-Language-Action Models under Joint-Level Physical Faults

A Reliable Fault Diagnosis Method Based on Belief Rule Base Consider Robustness Analysis

MoE Enhanced Federated Learning for Spatiotemporal Prediction

PathRelax: Parallel-Path Relaxed Speculative Jacobi Decoding for Accelerating Auto-Regressive Text-to-Image Generation

A complementary study on PlanGPT: Evaluation with defined Performance Metrics and comparison with a planner

5% > 100%: Flatness Preference is All You Need for Multimodal Parameter-Efficient Fine-Tuning

Stop Early, Spend Less: Hidden-State Probes as a Practical Recipe for Streaming Moderation of LLM Outputs

Advancing the State-of-the-Art in Empirical Privacy Auditing

ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

3D-CoS: A New 3D Reconstruction Paradigm Based on VLM Code Synthesis

Detecting Speculative Language in Biomedical Texts using Recurrent Neural Tensor Networks

Geometric Coastline Localization using Vision-Language Models

Large Language Models as Modal Models in Linguistics

ERAlign: Energy-based Representation Alignment of GNNs and LLMs on Text-attributed Graphs

LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake

Trace2Policy: From Expert Behavior Traces to Self-Evolving Decision Agents

Few-step Generative Models as Lossy Compression

Mitigating Bias in Low-SNR Financial Reinforcement Learning via Quantum Representations

SpenseGPT: Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference