arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 21516
2606.04552 2026-06-04 cs.CL q-bio.GN

LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling

LDARNet: 用于基因组建模的DNA自适应表示网络与可学习分词

Daria Ledneva, Denis Kuznetsov

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出LDARNet,一种结合动态分块和双向路由的120M参数层次基因组基础模型,在27个任务中优于更大模型,并发现学习到的边界与生物学基序对齐。

详情
AI中文摘要

基因组基础模型越来越多地采用大型语言模型架构,但几乎普遍依赖于固定的分词方案,如$k$-mers、BPE或单核苷酸,这些方案强加了可能掩盖生物学相关结构的任意序列边界。我们提出了LDARNet,一个120M参数的层次基因组基础模型,它将H-Net风格的动态分块从自回归生成适应到掩码语言建模,结合了BiMamba-2状态空间层与局部注意力、双向路由以及基于比值的正则化器,以在无监督的情况下诱导自适应标记边界。在来自Nucleotide Transformer和Genomic Benchmarks套件的27个任务上进行微调后,LDARNet在紧凑模型(<300M参数)中取得了11/18的胜率,并在5个组蛋白修饰任务上取得了最先进的结果,优于高达20倍大的模型。一个FLOPs匹配的对照实验将学习到的路由确定为这些增益的来源:在相同计算量下,学习到的边界在组蛋白任务上比固定网格边界高出多达14个百分点。进一步的核苷酸分辨率分析表明,学习到的边界在无监督的情况下与典型的启动子基序和剪接连接点对齐,为基因组基础模型中的自适应分词提供了生物学解释。

英文摘要

Genomic foundation models increasingly adopt large language model architectures, yet almost universally rely on fixed tokenization schemes such as $k$-mers, BPE, or single nucleotides, which impose arbitrary sequence boundaries that may obscure biologically relevant structure. We present LDARNet, a 120M-parameter hierarchical genomic foundation model that adapts H-Net-style dynamic chunking from autoregressive generation to masked language modeling, combining BiMamba-2 state-space layers with local attention, bidirectional routing, and a ratio-based regularizer to induce adaptive token boundaries without supervision. Fine-tuned on 27 tasks from the Nucleotide Transformer and Genomic Benchmarks suites, LDARNet achieves 11/18 wins among compact models ($<$300M parameters) and state-of-the-art results on 5 histone modification tasks, outperforming models up to 20$\times$ larger. A FLOPs-matched controlled experiment isolates learned routing as the source of these gains: learned boundaries beat fixed-grid boundaries by up to 14 percentage points on histone tasks at identical compute. Nucleotide-resolution analysis further shows that the learned boundaries align with canonical promoter motifs and splice junctions without supervision, providing a biological interpretation for adaptive tokenization in genomic foundation models.

2606.04545 2026-06-04 cs.CV

Impostor: An Agent-Curated Benchmark for Realistic AIGC Manipulation Localization

Impostor:一个用于真实AIGC篡改定位的智能体策划基准

Zhenliang Li, Yutao Hu, Qixiong Wang, Wenpeng Du, Hongxiang Jiang, Jiasong Wu, Xiaolong Jiang, Jungong Han

发表机构 * Southeast University(东南大学) Xiaohongshu Inc.(小红书公司) Tsinghua University(清华大学)

AI总结 为解决现有图像篡改检测与定位基准在视觉真实感、篡改多样性和生成器覆盖方面的局限,提出了Impostor数据集和CraftAgent框架,并设计了PhaseAware-Net方法,在多个基准上取得优异性能。

Comments 10 pages, 3 figures, 5 tables

详情
AI中文摘要

近期生成式图像编辑的进展提高了局部图像篡改的真实感和可控性,给图像篡改检测与定位(IMDL)带来了新挑战。然而,现有IMDL基准在视觉真实感、篡改多样性和生成器覆盖方面仍有局限,难以反映图像篡改的最新趋势。为解决这些局限,我们引入了Impostor,一个包含10万张篡改图像的高质量AI编辑图像篡改定位数据集。Impostor由CraftAgent构建,这是一个闭环智能体框架,集成了场景感知、编辑规划、篡改执行、质量验证和迭代反思,以自动生成多样且视觉真实的篡改图像。此外,Impostor包含由七个近期AIGC模型生成的图像,涵盖三种篡改类型,并包含多个篡改区域,为基于AIGC的IMDL提供了更全面的基准。进一步,我们提出了PhaseAware-Net(PANet),一个语义-取证框架,引入局部相位建模和语义-取证一致性学习,以更好地定位语义合理但取证异常的篡改区域。大量实验表明,Impostor对现有大型视觉语言模型(LVLMs)和专用IMDL方法构成了显著挑战,而PANet在Impostor和多个公开基准上取得了优越性能。

英文摘要

Recent advances in generative image editing have improved the realism and controllability of localized image manipulation, raising new challenges for image manipulation detection and localization (IMDL). However, existing IMDL benchmarks still have limitations in visual realism, manipulation diversity, and generator coverage, making it difficult to reflect recent trends in image manipulation. To address these limitations, we introduce Impostor, a high-quality AI-edited image manipulation localization dataset containing 100K manipulated images. Impostor is constructed by CraftAgent, a closed-loop agent framework that integrates scene perception, editing planning, manipulation execution, quality validation, and iterative reflection to automatically generate diverse and visually realistic manipulated images. Moreover, Impostor contains images generated by seven recent AIGC models across three manipulation types and includes multiple manipulated regions, providing a more comprehensive benchmark for AIGC-based IMDL. Furthermore, we propose PhaseAware-Net (PANet), a semantic-forensic framework that introduces local phase modeling and semantic-forensic consistency learning to better localize semantically plausible yet forensically disrupted manipulated regions. Extensive experiments show that Impostor poses significant challenges to existing large vision-language models (LVLMs) and specialized IMDL methods, while PANet achieves superior performance on Impostor and multiple public benchmarks.

2606.04536 2026-06-04 cs.AI

Scaling Self-Evolving Agents via Parametric Memory

通过参数化内存扩展自进化智能体

Tao Ren, Weiyao Luo, Hui Yang, Rongzhi Zhu, Xiang Huang, Yuchuan Wu, Bingxue Chou, Jieping Ye, Jiafeng Liang, Yongbin Li, Yijie Peng

发表机构 * Alibaba Group(阿里巴巴集团) Peking University(北京大学)

AI总结 提出TMEM框架,通过在线更新LoRA权重使智能体从经验中学习,从而在单轮交互中改变未来行为,并在多个基准上优于基于摘要和检索的方法。

详情
AI中文摘要

现有的内存增强型LLM智能体仅在提示空间中存储过去经验,作为文本摘要或检索段落,同时在整个运行过程中保持模型参数冻结。这类智能体可以\emph{查找}它们所见过的东西,但无法\emph{从中学习}:它们的策略不会因经验而改变,任何从上下文中丢弃的信息都会永久丢失。我们引入 exttt{TMEM},一个自进化的参数化内存框架,其中智能体不仅将历史压缩为显式内存,还通过轻量级在线更新将提炼的监督吸收到快速LoRA权重$Δ_t$中,从而在单个回合内真正改变其未来行为。我们将其形式化为具有快速权重运行动态的智能决策过程:动作从$π_{θ_0+Δ_t}$中采样,而提取动作产生监督,更新$Δ_t$以用于后续决策。这种观点使得提取策略可以直接通过RL优化:训练$θ_0$不仅改进了任务动作,还提高了用于在线LoRA适应的数据质量。我们进一步提出基于SVD的LoRA子空间初始化以加速在线收敛。在LoCoMo、LongMemEval-S、多目标搜索和CL-Bench上的实验表明, exttt{TMEM}在不同模型规模下始终优于基于摘要和基于检索的基线。

英文摘要

Existing memory-augmented LLM agents store past experience exclusively in prompt space, as textual summaries or retrieved passages, while keeping model parameters frozen throughout a rollout. Such agents can \emph{look up} what they have seen but cannot \emph{learn from} it: their policy is unchanged by experience, and any information dropped from the context is permanently lost. We introduce \texttt{TMEM}, a self-evolving parametric memory framework in which the agent not only compresses history into explicit memory but also absorbs distilled supervision into fast LoRA weights $Δ_t$ via lightweight online updates, genuinely altering its future behavior within a single episode. We formalize this as an agentic decision process with fast-weight rollout dynamics: actions are sampled from $π_{θ_0+Δ_t}$, while extraction actions produce supervision that updates $Δ_t$ for subsequent decisions. This view makes the extraction policy directly optimizable by RL: training $θ_0$ improves not only task actions but also the quality of the data used for online LoRA adaptation. We further propose SVD-based initialization of the LoRA subspace to accelerate online convergence. Experiments on LoCoMo, LongMemEval-S, multi-objective search, and CL-Bench show that \texttt{TMEM} consistently outperforms summary-based and retrieval-based baselines across different model scales.

2606.04535 2026-06-04 cs.CL cs.AI

Dynamic Infilling Anchors for Format-Constrained Generation in Diffusion Large Language Models

扩散大语言模型中用于格式约束生成的动态填充锚点

Boyan Han, Yiwei Wang, Yi Song, Yujun Cai, Chi Zhang

发表机构 * AGI Lab, Westlake University, China(西溪大学AGI实验室,中国) University of California, Merced, USA(加州大学梅尔德分校,美国) Teeni AI, China(Teeni AI,中国) The University of Queensland, Australia(昆士兰大学,澳大利亚)

AI总结 提出动态填充锚点(DIA),一种无需训练的方法,通过动态估计结束锚点位置调整生成长度,确保格式约束下的结构正确性和语义连贯性,在GSM8K和MATH上实现零样本性能提升。

Comments Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

详情
AI中文摘要

扩散大语言模型(dLLMs)提供双向注意力和并行生成,使其能够利用全局上下文并自然支持格式约束任务,如可解析的JSON或推理模板。虽然直接的固定锚点可以强制执行此类约束,但它们通常强加刚性跨度,导致推理截断或内容冗余。为了克服这一点,我们提出了动态填充锚点(DIA),一种无需训练的方法,在迭代填充之前动态估计结束锚点位置以调整生成长度。这种灵活机制确保了结构正确性和语义连贯性,避免了固定跨度方法的低效。在推理基准上的实验表明,DIA显著提高了格式合规性和答案准确性,在GSM8K和MATH上实现了显著的零样本增益。这些结果确立了DIA作为通往可靠、结构感知生成的一条稳健路径。

英文摘要

Diffusion large language models (dLLMs) offer bidirectional attention and parallel generation, enabling them to exploit global context and naturally support format-constrained tasks like parseable JSON or reasoning templates. While straightforward fixed anchors can enforce such constraints, they often impose rigid spans, leading to truncated reasoning or redundant content. To overcome this, we propose Dynamic Infilling Anchors (DIA), a training-free method that dynamically estimates end-anchor positions to adjust generation length before iterative infilling. This flexible mechanism ensures structural correctness and semantic coherence, avoiding the inefficiencies of fixed-span methods. Experiments on reasoning benchmarks demonstrate that DIA substantially improves format compliance and answer accuracy, achieving significant zero-shot gains on GSM8K and MATH. These results establish DIA as a robust pathway toward reliable, structure-aware generation.

2606.04534 2026-06-04 cs.RO

MAD: Mapping-Aware World Models for Agile Quadrotor Flight

MAD: 面向敏捷四旋翼飞行的地图感知世界模型

Xinhong Zhang, Runqing Wang, Yunfan Ren, Ding Yu, Boyu Zhou, Jian Sun, Fang Deng, Jie Chen, Gang Wang

发表机构 * State Key Lab of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology, Beijing 100081, China(自主智能无人系统国家重点实验室,北京理工大学,北京100081,中国) Zhongguancun Academy, Beijing 100094, China(中关村学院,北京100094,中国) School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院) Department of Mechanical and Energy Engineering, Southern University of Science and Technology(南方科技大学机械与能源工程系) Harbin Institute of Technology(哈尔滨工业大学)

AI总结 提出地图感知世界模型MAD,通过重构机器人中心占用和可见性网格地图学习几何感知的潜在动力学,在视觉导航和竞速任务中实现更高成功率、更快飞行速度和更好跨任务迁移。

Comments 12 pages, 14 figures

详情
AI中文摘要

在杂乱场景中的敏捷四旋翼飞行需要的不仅仅是从深度图像到控制命令的反应式映射:飞行器必须记住已观测的区域,推断附近的占用空间,并在部分可见性和严格延迟下行动。在本文中,我们提出了地图感知梦想家(MAD),一种用于基于视觉的四旋翼飞行的几何感知世界模型。MAD不是将原始图像重建作为主要的自监督目标,而是学习循环潜在动力学,该动力学重构机器人中心的占用和可见性网格地图以及本体感受状态。这种设计迫使潜在状态以与碰撞避免直接相关的方式编码局部几何、可见性历史和自运动。MAD使用GPU并行地图构建模块在DiffAero中训练,该模块为占用和可见性提供高通量监督。学习到的表示用于三种策略学习模式:基于想象的MAD-Dreamer以及基于PPO和SHAC的特征提取器变体。在视觉导航和竞速任务中,基于MAD的智能体比相应的纯视觉基线实现了更高的成功率、更快的飞行速度和更好的跨任务迁移。该模型还从深度观测中产生可解释的地图预测和准确的自运动估计。我们进一步在配备Intel RealSense D435i的物理四旋翼上部署学习到的策略,并在有限感知下演示了安全的室内和室外飞行,在仿真中达到9.66 m/s,在真实世界森林实验中达到5.05 m/s。这些结果表明,地图感知世界模型在模块化空中导航和端到端学习之间提供了一个实用的中间地带。

英文摘要

Agile quadrotor flight in cluttered scenes requires more than a reactive mapping from a depth image to a control command: the vehicle must remember which regions have been observed, infer nearby occupied space, and act under partial visibility and tight latency. In this paper, we present Mapping-Aware Dreamer (MAD), a geometry-aware world model for vision-based quadrotor flight. Instead of using raw-image reconstruction as the main self-supervised objective, MAD learns recurrent latent dynamics that reconstruct robocentric occupancy and visibility grid maps together with proprioceptive states. This design forces the latent state to encode local geometry, visibility history, and ego-motion in a form that is directly relevant to collision avoidance. MAD is trained in DiffAero using a GPU-parallel map-construction module that provides high-throughput supervision for occupancy and visibility. The learned representation is used in three policy-learning modes: imagination-based MAD-Dreamer and feature-extractor variants based on PPO and SHAC. Across visual navigation and racing tasks, MAD-based agents achieve higher success rates, faster flight, and better cross-task transfer than corresponding vision-only baselines. The model also produces interpretable map predictions and accurate ego-motion estimates from depth observations. We further deploy the learned policy on a physical quadrotor with an Intel RealSense D435i and demonstrate safe indoor and outdoor flight under limited sensing, reaching 9.66 m/s in simulation and 5.05 m/s in real-world forest experiments. These results show that mapping-aware world models provide a practical middle ground between modular aerial navigation and end-to-end learning.

2606.04528 2026-06-04 cs.CV cs.AI

Optical-Guided Neural Collapse for SAR Few-Shot Class Incremental Learning

光学引导的SAR少样本类增量学习中的神经坍缩

Fan Zhang, Sijin Zheng, Fei Ma, Qiang Yin, Yongsheng Zhou, Fei Gao, Xian Sun

发表机构 * Beihang University, Beijing 100191, China(北航,北京100191,中国) Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China(航天信息研究所,中国科学院,北京100190,中国)

AI总结 针对SAR图像少样本类增量学习中的数据稀缺和方位角敏感问题,提出利用光学ATR数据集的正交子空间作为几何先验,通过投影损失和分类器损失联合诱导神经坍缩,实现特征紧凑性和类间可分离性。

Comments 16 pages, 6 figures

详情
AI中文摘要

合成孔径雷达图像中的少样本类增量学习由于严重的数据稀缺和SAR特有的变异性而面临独特挑战。特别是,SAR中强烈的方位角敏感性导致大的类内变异和类间混淆,而FSCIL的顺序更新进一步导致先前学习类别的灾难性遗忘。受神经坍缩启发,我们提出了一种光学引导的SAR FSCIL框架,该框架从数据丰富的光学ATR数据集中推导出正交特征子空间,并将其作为几何先验来指导SAR特征学习。通过主角约束将SAR特征投影到这些正交子空间上,有效地将判别结构从光学域转移到SAR域。具体地,我们的投影损失和用冻结的单纯形ETF几何优化的分类器损失通过将特征集中在类均值周围同时保持大的类间角度,联合诱导神经坍缩。我们在一个包含光学ATR数据集和具有24个目标类别的SAR ATR数据集的基准上评估该方法,该基准组织为一个基础训练会话和七个增量会话。与最近的FSCIL方法(包括NCFSCIL等)相比,我们的方法实现了最高的最终准确率以及最终性能与性能下降之间的有利权衡。此外,神经坍缩指标显示类内紧凑性和类间可分离性得到改善,表明学习到的特征更接近理想的单纯形ETF几何。

英文摘要

Few-shot class-incremental learning (FSCIL) in synthetic aperture radar imagery presents unique challenges due to severe data scarcity and SAR-specific variability. In particular, strong azimuth sensitivity in SAR induces large intra-class variation and inter-class confusion, and FSCIL sequential updates further lead to catastrophic forgetting of previously learned classes. Inspired by neural collapse, we propose an optical-guided SAR FSCIL framework, which derives orthogonal feature subspaces from a data-rich optical ATR dataset and uses them as geometric priors to guide SAR feature learning. SAR features are projected onto these orthogonal subspaces via principal angle constraints, effectively transferring discriminative structure from the optical to the SAR domain. Specifically, our projection loss and the classifier loss optimized with a frozen simplex-ETF geometry jointly induce neural collapse by concentrating features around class means while maintaining large inter-class angles. We evaluate the approach on a benchmark comprising an optical ATR dataset and a SAR ATR dataset with 24 target classes, organized into a base training session and seven incremental sessions. Compared with recent FSCIL methods including NCFSCIL and so on, our method achieves the highest final accuracy and a favorable trade-off between final performance and performance degradation. Moreover, neural collapse metrics show improved intra-class compactness and inter-class separability, indicating that the learned features more closely approximate the ideal simplex-ETF geometry.

2606.04518 2026-06-04 cs.RO

Cooperative Circumnavigation for Multiple Unmanned Surface Vehicles Without External Localization

无外部定位的多无人水面艇协同环绕航行

Xueming Liu, Lin Li, Xiang Zhou, Tianjiang Hu, Qingrui Zhang

发表机构 * School of Aeronautics and Astronautics, Sun Yat-sen University (Shenzhen Campus)(航空工程学院,中山大学(深圳校区))

AI总结 针对无外部定位的多无人水面艇,提出基于异构感知和耦合振荡器的协同环绕框架,利用最大相关熵卡尔曼滤波和伪线性卡尔曼滤波估计相对位置,实现指定半径的均匀圆形编队。

Comments 17 pages, 15 figures

详情
AI中文摘要

本文提出了一种针对多无人水面艇(USV)在无外部定位条件下运行的协同目标环绕框架。目标是仅利用有限的本船传感,围绕目标保持指定半径的均匀圆形编队。该框架采用异构感知策略,区分与目标之间以及USV之间的非对称传感关系。具体而言,USV通过主动感知和艇间通信获取相对距离和位移测量,而通过被动传感器获取对非合作目标的方位测量。为了估计相对位置——包括USV之间以及每个USV与目标之间的相对位置——我们分别采用了最大相关熵卡尔曼滤波和伪线性卡尔曼滤波。设计了一个基于耦合振荡器的编队控制器,以确保系统可观测性同时实现环绕航行。理论分析表明,该控制器确保USV之间的相对运动以及每个USV与目标之间的相对运动满足持续激励条件,从而保证基于卡尔曼滤波器的可观测性。通过数值仿真验证了所提方法的有效性。

英文摘要

This paper proposes a cooperative target circumnavigation framework for multiple unmanned surface vehicles (USVs) operating without external localization. The objective is to maintain a uniform circular formation of a specified radius around a target using only limited onboard sensing. The framework adopts a heterogeneous perception strategy that distinguishes between the asymmetric sensing relationships with the target and among the USVs. Specifically, the USVs obtain relative range and displacement measurements through active perception and inter-vehicle communication, while bearing measurements to a non-cooperative target are acquired via passive sensors. To estimate relative positions--both among USVs and between each USV and the target--we employ a Maximum Correntropy Kalman Filter and a Pseudo-Linear Kalman Filter, respectively. A coupled oscillator-based formation controller is designed to ensure system observability while achieving circumnavigation. Theoretical analysis demonstrates that the controller ensures the relative motions between the USVs, as well as that between each USV and the target, satisfy the persistent excitation condition, thereby guaranteeing observability of the Kalman-based filters. The effectiveness of the proposed approach is validated through numerical simulations.

2606.04516 2026-06-04 cs.LG cs.AI

GeoMin: Data-Efficient Semi-Supervised RLVR via Geometric Distribution Modeling

GeoMin: 基于几何分布建模的数据高效半监督RLVR

Guangcheng Zhu, Shenzhi Yang, Haobo Wang, Xing Zheng, Yingfan MA, Xuening Feng, Zhongqi Chen, Kai Tang, Zhengqing Zang, Bowen Song, Weiqiang Wang, Gang Chen

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团)

AI总结 提出GeoMin方法,通过建模标注数据的全局特征分布来解码正确与错误展开的结构差异,从而建立稳健先验评估自奖励信号可靠性,以少量标注数据高效利用未标注数据,在仅用10%标注时超越全监督模型。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)显著提升了LLM的推理能力,但面临困境:标准监督扩展受限于高标注成本,而无监督替代方案则遭受严重的模型崩溃。最近的半监督RLVR方法通过使用少量标注集指导未标注数据,在训练效果和标注成本之间取得了有前景的权衡。然而,由于依赖粗糙的性能启发式,它们遭受严重的数据效率瓶颈,导致绝大多数有价值实例未被充分利用。为此,我们提出GeoMin,它在标注数据上建模全局特征分布,以解码正确和错误展开之间的结构差异,从而建立稳健的先验来评估自奖励信号的可靠性,并充分释放未标注数据的潜力。实验上,GeoMin比最强基线高出+4.1%,甚至在使用仅10%标注的情况下超越全监督模型,展示了显著的数据效率。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) significantly advances LLM reasoning, yet it faces a dilemma: standard supervised scaling is throttled by high annotation costs, while unsupervised alternatives suffer from severe model collapse. Recent semi-supervised RLVR methods address this by using a small labeled set to guide unlabeled data, achieving a promising trade-off between training efficacy and annotation cost. However, they suffer from a severe data-efficiency bottleneck due to the reliance on coarse performance heuristics, leaving a vast majority of valuable instances underutilized. To this end, we propose GeoMin, which models global feature distributions on labeled data to decode the structural discrepancy between correct and incorrect rollouts, thereby establishing a robust prior to assess the reliability of self-reward signals and fully unleash the potential of unlabeled data. Empirically, GeoMin outperforms the strongest baselines by +4.1% and even surpasses fully supervised models with only 10% of the annotations, demonstrating remarkable data efficiency.

2606.04511 2026-06-04 cs.CL cs.LG

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

SparDA: 用于高效长上下文LLM推理的稀疏解耦注意力

Yaosheng Fu, Guangxuan Xiao, Xin Dong, Song Han, Oreste Villa

发表机构 * NVIDIA Thinking Machines Lab ByteDance Seed MIT

AI总结 提出SparDA架构,通过引入第四投影Forecast实现KV缓存预取与注意力解耦,减少稀疏选择开销,在长上下文推理中实现1.25倍预填充加速和1.7倍解码加速。

详情
AI中文摘要

稀疏注意力减少了长上下文LLM推理的计算和内存带宽。然而,仍然存在两个关键挑战:(1)KV缓存容量随序列长度增长,卸载到CPU内存引入了PCIe传输瓶颈;(2)稀疏选择步骤本身保持$O(T^2)$复杂度,在长上下文中可能主导注意力成本。我们提出SparDA,一种解耦的稀疏注意力架构,它在Query、Key和Value之外引入了第四个逐层投影——Forecast。Forecast预测下一层所需的KV块,从而实现超前选择,将CPU到GPU的预取与当前层执行重叠。由于Forecast与注意力查询解耦,我们的GQA实现为每个GQA组使用一个Forecast头,相比原始多头选择器减少了选择开销。SparDA增加了<0.5%的参数,并通过匹配原始选择器的注意力分布仅训练Forecast投影。在两个稀疏预训练的8B模型上,SparDA匹配或略微提高了准确性,并且相比稀疏注意力卸载基线,提供了高达1.25倍的预填充加速和1.7倍的解码加速。通过使单个GPU上可行的批量大小更大,SparDA进一步实现了比非卸载稀疏基线高达5.3倍的解码吞吐量。我们的源代码可在https://github.com/NVlabs/SparDA获取。

英文摘要

Sparse attention reduces compute and memory bandwidth for long-context LLM inference. However, two key challenges remain: (1) KV cache capacity still grows with sequence length, and offloading to CPU memory introduces a PCIe transfer bottleneck; (2) the sparse selection step itself retains $O(T^2)$ complexity and can dominate attention cost at long contexts. We propose SparDA, a decoupled sparse attention architecture that introduces a fourth per-layer projection, the Forecast, alongside Query, Key, and Value. The Forecast predicts the KV blocks needed by the next layer, enabling lookahead selection that overlaps CPU-to-GPU prefetch with current-layer execution. Because Forecast is decoupled from the attention query, our GQA implementation uses one Forecast head per GQA group, reducing selection overhead versus the original multi-head selector. SparDA adds $<$0.5% parameters and trains only the Forecast projections by matching the original selector's attention distribution. On two sparse-pretrained 8B models, SparDA matches or slightly improves accuracy and delivers up to 1.25$\times$ prefill speedup and 1.7$\times$ decode speedup over the sparse-attention offload baseline. By enabling larger feasible batch sizes on a single GPU, SparDA further reaches up to 5.3$\times$ higher decode throughput than the non-offload sparse baseline. Our source code is available at https://github.com/NVlabs/SparDA.

2606.04507 2026-06-04 cs.CL cs.AI

Self-Evolving Deep Research via Joint Generation and Evaluation

通过联合生成与评估实现自我进化的深度研究

Han Zhu, Chengkun Cai, Yuanfeng Song, Xing Chen, Sirui Han, Yike Guo

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) ByteDance, China(字节跳动) University College London(伦敦大学学院)

AI总结 提出SCORE框架,通过共享参数的协同进化训练联合优化评估器与求解器,解决深度研究报告生成中奖励不可验证的问题,持续提升生成质量。

详情
AI中文摘要

大型语言模型(LLM)在日常应用中越来越广泛,其中深度研究是一项特别重要的能力。与传统的问答(QA)任务不同,深度研究报告生成缺乏明确的真实答案,这使得奖励设计本质上不可验证,限制了有效的强化学习。现有方法通过LLM作为评判者和查询相关的评估标准来缓解这一挑战,但它们仍然依赖静态评估器,无法随着求解器的改进而调整标准,导致优化压力不足并最终饱和。我们通过一个用于深度研究评估和生成的 extbf{自}我进化 extbf{协}同进化训练框架(SCORE)来解决这一限制,该框架在共享参数的学习过程中紧密耦合评估器和求解器。我们不将生成和评估视为孤立的模块,而是利用它们的内在联系,在单个共享参数模型中实现联合改进。为了限制这一过程,我们引入了一个元控制机制,该机制根据求解器的性能动态控制评估环境,鼓励有效的评估维度和足够深入的评估器搜索。在深度研究基准上的大量实验表明,报告生成质量持续提升,表明协同进化评估和生成是训练开放式研究代理的一个有前景的方向。

英文摘要

Large Language Models (LLMs) have become increasingly adopted in daily applications, with deep research standing out as a particularly important capability. Unlike traditional question-answering (QA) tasks, deep research report generation lacks definitive ground-truth, making reward design inherently unverifiable and limiting effective reinforcement learning. Existing approaches mitigate this challenge with LLM-as-a-judge and query-dependent evaluation rubrics, but they still rely on static evaluators that cannot adapt their standards as the solver improves, leading to insufficient and eventually saturated optimization pressure. We address this limitation with a \textbf{s}elf-evolving \textbf{co}-evolutionary training framework for deep \textbf{re}search evaluation and generation (SCORE), which tightly couples an evaluator and a solver in a shared-parameter learning process. Rather than treating generation and evaluation as isolated modules, we leverage their intrinsic connection to enable joint improvement within a single shared-parameter model. To restrict this process, we introduce a meta-harness, which dynamically controls the evaluation environment based on solver performance, encouraging valid evaluation dimensions and sufficiently deep evaluator search. Extensive experiments on deep research benchmarks demonstrate consistent improvement in report generation quality, showing that co-evolving evaluation and generation is a promising direction for training open-ended research agents.

2606.04505 2026-06-04 cs.AI

Simulate, Reason, Decide: Scientific Reasoning with LLMs for Simulation-Driven Decision Making

模拟、推理、决策:基于科学推理的LLM驱动模拟决策

Yuhan Yang, Ruipu Li, Alexander Rodríguez

发表机构 * Computer Science and Engineering University of Michigan(计算机科学与工程大学密歇根大学)

AI总结 提出MechSim框架,通过神经符号推理使LLM能够推理科学模拟器的机制和假设,提升决策透明度和可靠性。

详情
AI中文摘要

科学模拟器越来越多地被集成到LLM驱动的系统中,用于高风险模拟驱动决策。然而,现有框架主要使用LLM来生成、校准或执行模拟器,将其视为黑盒接口而非可推理的结构化机械系统。因此,当前方法缺乏识别、表示和推理模拟器行为背后的假设和机制的能力,限制了透明度、可审计性和决策合理性。我们引入了MechSim,一个面向可执行科学模拟器的机制基础神经符号推理框架。与先前主要对静态符号结构进行推理的神经符号方法不同,MechSim使LLM代理能够推理科学模拟器的机制、假设和执行行为。我们的框架通过共享结构化模式表示模拟器,捕获假设、变量、机制依赖和执行轨迹。在此表示之上,LLM代理作为受约束的推理引擎运行,生成结构化的、基于证据的解释,将模拟器结果与其底层机制联系起来。我们在多个高风险领域评估了我们的方法,结果表明它提高了机制级解释质量、模拟器分析和下游决策可靠性。

英文摘要

Scientific simulators are increasingly being integrated into LLM-driven systems for high-stakes simulation-driven decision-making. However, existing frameworks primarily use LLMs to generate, calibrate, or execute simulators, treating them as black-box interfaces rather than as structured mechanistic systems that can be reasoned about. As a result, current approaches lack the ability to identify, represent, and reason about the assumptions and mechanisms underlying simulator behavior, limiting transparency, auditability, and decision justification. We introduce MechSim, a mechanism-grounded neuro-symbolic reasoning framework for executable scientific simulators. Unlike prior neuro-symbolic approaches that primarily reason over static symbolic structures, MechSim enables LLM agents to reason about the mechanisms, assumptions, and execution behavior of scientific simulators. Our framework represents simulators through a shared structured schema capturing assumptions, variables, mechanism dependencies, and execution traces. On top of this representation, LLM agents operate as constrained reasoning engines that generate structured, evidence-grounded explanations linking simulator outcomes to their underlying mechanisms. We evaluate our approach across multiple high-stakes domains and show that it improves mechanism-level explanation quality, simulator analysis, and downstream decision-making reliability.

2606.04503 2026-06-04 cs.LG cs.AI

Smart Picks in the Dark: Towards Efficient RLVR for Reasoning via Tracing Metacognitive Pivots

暗中选择:通过追踪元认知支点实现高效的推理可验证奖励强化学习

Guangcheng Zhu, Shenzhi Yang, Haobo Wang, Xing Zheng, Yingfan MA, Xuening Feng, Zhongqi Chen, Bowen Song, Weiqiang Wang, Gang Chen

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团)

AI总结 针对可验证奖励强化学习(RLVR)中数据效率低的问题,提出PivotTrace框架,利用注意力动态追踪推理过程中的元认知支点,通过支点密度量化不确定性实现数据自动分流,在仅使用29.3%标注样本和2.75倍收敛加速下超越全监督模型。

详情
AI中文摘要

可验证奖励强化学习(RLVR)极大地推进了大型推理模型(LRMs),但它需要及时在大量完全标注的数据集上进行训练。为此,从两个角度广泛研究了数据高效的RLVR方法:(i)数据选择方法识别一小部分“黄金”样本,这些样本能产生接近全数据性能,但它们依赖于预先存在的标注数据池。(ii)无监督RLVR方法在大规模未标注数据上利用模型自身的内部监督信号进行训练,但表现出次优性能。因此,我们研究了RLVR的“暗中选择”设置,其目标是在没有先验监督的情况下,选择对训练最有益且值得标注的未标注样本。通过系统分析,我们证明智能选择依赖于一个校准良好的不确定性估计器,以实现数据的策略性划分,从而进行自适应训练方案。基于这一见解,我们提出了PivotTrace,一个三路数据分流框架,利用注意力动态追踪推理过程中的元认知支点。通过支点密度精确量化不确定性,PivotTrace实现了自动数据路由,协同最大化标注和训练效率。实验表明,PivotTrace仅使用29.3%的标注样本和2.75倍的收敛速度就超越了全监督LRM。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has greatly advanced large reasoning models (LRMs), but it requires timely training on a huge fully-annotated dataset. To this end, data-efficient RLVR methods have been widely studied from two perspectives: (i) data selection methods identify a small subset of "golden" samples that yield near-full-data performance, but they rely on a pre-existing pool of labeled data. (ii) unsupervised RLVR methods train the model using its own internal supervision signals on large-scale unlabeled data, yet they exhibit suboptimal performance. Accordingly, we investigate the "pick in the dark" setup for RLVR, which aims to select, without prior supervision, unlabeled samples that are most beneficial for training and worthy of annotation. Through systematic analysis, we demonstrate that smart picks hinge on a well-calibrated uncertainty estimator to enable strategic partitioning of data for adaptive training regimes. Building on this insight, we propose PivotTrace, a three-way data triage framework that leverages attention dynamics to trace metacognitive pivots during reasoning. By precisely quantifying uncertainty through pivot density, PivotTrace achieves automated data routing to synergistically maximize both annotation and training efficiency. Empirically, PivotTrace surpasses the fully supervised LRM with only 29.3% annotated samples and 2.75 faster convergence.

2606.04500 2026-06-04 cs.CL

SANE Schema-aware Natural-language Evaluation of Biological Data

SANE:生物数据的模式感知自然语言评估

Rolf Gattung, Martin Krueger, Markus Reischl

发表机构 * Institute for Automation and Applied Informatics (IAI), Karlsruhe Institute of Technology (KIT)(自动化与应用信息研究所(IAI)、卡尔斯鲁厄理工学院(KIT))

AI总结 提出SANE范式,通过模式感知的自动生成基准,评估少样本大语言模型在特定领域文本到SQL任务中的可靠性,发现结构化提示和约束可实现准确查询生成。

Comments 5 pages, 3 figures, submitted but not yet reviewed by BMT2026

详情
AI中文摘要

高通量显微镜生成大型结构化数据集,捕捉细胞对药理扰动的反应,但访问这些数据集通常需要SQL专业知识。大语言模型提供了一种自然语言替代方案,但其幻觉倾向引发了对结果可靠性的担忧。我们提出SANE(模式感知自然语言评估),一种用于特定领域文本到SQL评估的新范式:基于模式、自动生成的基准,与实际和特定的实验结构相关联。SANE使评估更具可扩展性、系统性和可重复性。使用SANE,我们评估了一个少样本大语言模型,并表明在具有结构化提示和约束的受限模式下,无需任何模型训练或微调即可实现准确的查询生成。大多数失败源于模糊或未明确指定的输入,表现为过度谨慎的澄清请求或对应先消除歧义的查询的回答,而不是错误的SQL生成。这些结果表明,当与模式感知提示相结合时,少样本大语言模型可以在定义良好的领域内提供可靠的数据库访问。

英文摘要

High-throughput microscopy generates large, structured datasets capturing cellular responses to pharmacological perturbations, but accessing these datasets typically requires SQL expertise. Large language models offer a natural-language alternative, yet their tendency to hallucinate raises concerns about result reliability . We present SANE Schema-Aware Natural-language Evaluation, a novel paradigm for domain-specific text-to-SQL evaluation: schema-grounded, automatically generated benchmarks tied to real and specific experimental structure. SANE makes evaluation more scalable, systematic, and reproducible. Using SANE, we evaluate a few-shot large language model and show that, under constrained schemas with structured prompting and guardrails, accurate query generation is achievable without any model training or fine-tuning. Most failures stem from ambiguous or underspecified inputs and manifest as overly cautious clarification requests or answers to queries that should first be disambiguated, rather than incorrect SQL generation. These results indicate that few-shot large language models can provide reliable database access in well-defined domains when combined with schema-aware prompting.

2606.04494 2026-06-04 cs.AI

Beyond Prompt-Based Planning: MCP-Native Graph Planning-based Biomedical Agent System

超越基于提示的规划:基于MCP原生图规划的生物医学智能体系统

Zhangtianyi Chen, Florensia Widjaja, Wufei Dai, Xiangjun Zhang, Yuhao Shen, Juexiao Zhou

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出BioManus系统,通过将异构生物信息学工具编译为标准MCP服务器并构建类型化异构图,实现基于图结构的规划,解决工具混淆和上下文效率问题,在BioAgentBench和LAB-Bench上提升执行准确性和工作流有效性。

详情
AI中文摘要

生物医学智能体有望自动化复杂的生物工作流,但当前系统面临两个基本瓶颈:生物信息学工具在接口和执行环境上高度异构,而智能体规划仍依赖于基于提示的扁平工具描述。随着生物医学软件生态系统的增长,这种工具覆盖与上下文大小之间的耦合导致工具混淆、规划不稳定和执行效率低下。我们引入BioManus,一种基于结构化生物能力上的图支架规划的原生MCP生物医学智能体。BioManus首先提出BioinfoMCP编译器,将异构生物信息学软件转换为标准化的MCP服务器,从而产生一个大型可执行的MCP生态系统。然后,它将这个生态系统组织成一个类型化的异构图,涵盖工具、操作、数据类型和工作流阶段。在推理时,BioManus检索紧凑的任务特定子图,合成操作级工作流支架。这种设计将规划复杂度与原始工具库存大小解耦,在高召回率检索下实现了上下文压缩比Theta(N / (h * m_bar)),其中N是工具总数,h是工作流长度,m_bar(远小于N)是每个操作的平均候选工具数量。在BioAgentBench和LAB-Bench上的实验表明,与先进的生物医学智能体基线相比,BioManus提高了执行准确性、工作流有效性和上下文效率。这项工作表明了一种范式转变:可扩展的生物医学推理需要结构化的可执行能力图,而不是越来越大的提示级工具检索。

英文摘要

Biomedical agents promise to automate complex biological workflows, yet current systems face two fundamental bottlenecks: bioinformatics tools are highly heterogeneous in interfaces and execution environments, while agent planning still relies on flat prompt-retrieved tool descriptions. As biomedical software ecosystems grow, this coupling between tool coverage and context size leads to tool confusion, unstable planning, and inefficient execution. We introduce BioManus, an MCP-native biomedical agent built on graph-scaffolded planning over structured biological capabilities. BioManus first introduces the BioinfoMCP Compiler, which converts heterogeneous bioinformatics software into standardized MCP servers, yielding a large executable MCP ecosystem. It then organizes this ecosystem as a typed heterogeneous MCP graph over tools, operations, datatypes, and workflow stages. At inference time, BioManus retrieves compact task-specific subgraphs, synthesizes operation-level workflow scaffolds. This design decouples planning complexity from raw tool inventory size, achieving a context compression ratio of Theta(N / (h * m_bar)) under high-recall retrieval, where N is the total tool count, h is the workflow horizon, and m_bar (much smaller than N) is the average number of candidate tools per operation. Experiments on BioAgentBench and LAB-Bench show that BioManus improves execution accuracy, workflow validity, and context efficiency over advanced biomedical agent baselines. This work suggests a paradigm shift: scalable biomedical reasoning requires structured executable capability graphs rather than increasingly larger prompt-level tool retrieval.

2606.04492 2026-06-04 cs.LG cs.GT

Episodic Memory Temporal Consistency for Cooperative Multi-Agent Reinforcement Learning

面向合作多智能体强化学习的 episodic 记忆时间一致性

Zicheng Zhao, Yu Lan, Chengzhengxu Li, Zhaohan Zhang, Xiaoming Liu

发表机构 * Xi’an Jiaotong University(西安交通大学) Queen Mary University of London(伦敦玛丽女王大学)

AI总结 针对合作多智能体强化学习中的奖励稀疏和探索瓶颈,提出 Episodic Memory Temporal Consistency (EMTC) 框架,通过时间一致性语义嵌入器和门控机制,防止表示崩溃并过滤伪成功轨迹,理论保证误差界,在 SMAC 和 GRF 基准上显著优于现有方法。

Comments Under Review

详情
AI中文摘要

合作多智能体强化学习(MARL)经常遭受严重的奖励稀疏性和探索瓶颈。虽然 episodic 记忆机制通过重用高回报轨迹缓解了这些问题,但由于无约束的激励分布和语义表示崩溃,它们常常使智能体陷入局部最优。为了解决这个问题,我们提出了 Episodic Memory Temporal Consistency (EMTC),一个能够稳健构建并选择性利用历史经验的框架。EMTC 引入了两个协同组件:(1) 时间一致性语义嵌入器,它将对比学习与时间条件状态重建相结合,防止表示崩溃并实现精确的记忆检索;(2) 时间一致性门控机制,它根据时间一致性误差动态调节 episodic 激励。这个自适应门从伪成功轨迹中过滤误导信号,有效缓解 Q 值高估。我们提供了理论保证,建立了严格误差界,将可观测的时间一致性误差直接与底层轨迹最优性和表示质量联系起来。在 SMAC 和 GRF 基准上的广泛评估表明,EMTC 持续优于最先进的基线。值得注意的是,与最强的 episodic 基线相比,EMTC 在超难 SMAC 场景中实现了高达 24% 的绝对胜率提升,在 GRF 任务上平均提升 28%。

英文摘要

Cooperative Multi-Agent Reinforcement Learning (MARL) frequently suffers from severe reward sparsity and exploration bottlenecks. While episodic memory mechanisms mitigate these issues by reusing high-return trajectories, they often trap agents in local optima due to unconstrained incentive distribution and semantic representation collapse. To address this, we propose Episodic Memory Temporal Consistency (EMTC), a framework that robustly constructs and selectively leverages historical experiences. EMTC introduces two synergistic components: (1) a Temporally Consistent Semantic Embedder that integrates contrastive learning with time-conditioned state reconstruction, preventing representation collapse and enabling precise memory retrieval; and (2) a Temporal Consistency Gating Mechanism that dynamically modulates episodic incentives based on temporal consistency error. This adaptive gate filters misleading signals from pseudo-successful trajectories, effectively mitigating Q-value overestimation. We provide theoretical guarantees, establishing a strict error bound that directly links the observable temporal consistency error to the underlying trajectory optimality and representation quality. Extensive evaluations on the SMAC and GRF benchmarks demonstrate that EMTC consistently outperforms state-of-the-art baselines. Notably, compared to the strongest episodic baseline, EMTC achieves absolute win-rate improvements of up to 24% in super-hard SMAC scenarios and an average improvement of 28% across GRF tasks.

2606.04484 2026-06-04 cs.AI cs.LG cs.MA

AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning

AgentJet:一种用于智能体强化学习的灵活群体训练框架

Qingxu Fu, Boyin Liu, Shuchang Tao, Zhaoyang Liu, Bolin Ding

发表机构 * Tongyi Lab, Alibaba Group(通义实验室,阿里巴巴集团)

AI总结 提出AgentJet,一种解耦的多节点群体训练框架,支持异构多模型强化学习、多任务鸡尾酒训练、容错执行和实时代码迭代,并通过上下文跟踪模块实现1.5-10倍训练加速。

Comments Technical report, 27 pages

详情
AI中文摘要

我们提出了AgentJet,一个用于大型语言模型(LLM)智能体强化学习的分布式群体训练框架。与将智能体运行与模型优化紧密耦合的集中式框架不同,AgentJet采用解耦的多节点架构,其中群体服务器节点托管可训练模型并在GPU集群上运行优化,而群体客户端节点在任意设备上执行任意智能体。这种设计提供了集中式框架难以支持的能力:(1)异构多模型强化学习,支持训练具有多个LLM作为大脑的异构多智能体团队;(2)具有隔离智能体运行时的多任务鸡尾酒训练;(3)容错执行,防止外部环境故障中断训练过程;(4)实时代码迭代,允许通过替换群体客户端节点在训练期间编辑智能体。为了支持多模型、多轮和多智能体设置中的高效强化学习,AgentJet引入了一个带有时间线合并的上下文跟踪模块,该模块合并冗余上下文并实现1.5-10倍的训练加速。最后,AgentJet引入了一个自动化研究系统,该系统以研究主题为输入,并在大规模集群上自主进行长期、多天的强化学习研究。通过利用群体架构,该系统在无需人工干预的情况下复现了强化学习研究人员的关键探索工作流程。

英文摘要

We present AgentJet, a distributed swarm training framework for large language model (LLM) agent reinforcement learning. Unlike centralized frameworks that tightly couple agent rollouts with model optimization, AgentJet adopts a decoupled multi-node architecture in which swarm server nodes host trainable models and run optimization on GPU clusters, whereas swarm client nodes execute arbitrary agents on arbitrary devices. This design provides capabilities that are difficult to support in centralized frameworks: (1) heterogeneous multi-model reinforcement learning, enabling the training of heterogeneous multi-agent teams with multiple LLM as brains; (2) multi-task cocktail training with isolated agent runtimes; (3) fault-tolerant execution that prevents external environment failures from interrupting the training process; and (4) live code iteration, which allows agents to be edited during training by replacing swarm client nodes. To support efficient RL in multi-model, multi-turn, and multi-agent settings, AgentJet introduces a context tracking module with timeline merging, which consolidates redundant context and achieves a 1.5-10x training speedup. Finally, AgentJet introduces an automated research system that takes a research topic as input and autonomously conducts long-horizon, multi-day RL studies on large-scale clusters. By leveraging the swarm architecture, this system reproduces key exploratory workflows of RL researchers without human intervention during execution.

2606.04483 2026-06-04 cs.CL

Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs

分布外声音:同人小说子类型作为对齐LLM的通用白话越狱

Zhongze Luo, Ruihe Shi, Zhenshuai Yin, Haoyue Liu, Weixuan Wan, Xiaoying Tang

发表机构 * School of Science and Engineering, The Chinese University of Hong Kong (Shenzhen)(香港中文大学(深圳)科学与工程学院) The Shenzhen Future Network of Intelligence Institute (FNii-Shenzhen)(深圳未来网络智能研究院) The Guangdong Provincial Key Laboratory of Future Networks of Intelligence(广东省未来网络智能重点实验室) School of Microelectronics, Xi’an Jiaotong University(西安交通大学微电子学院)

AI总结 本文发现安全训练覆盖不足的自然人类写作语域是对齐LLM的真正失败模式,并提出首个利用真实同人小说子类型作为通用攻击载体的越狱方法,显著提升攻击成功率。

Comments 23 pages

详情
AI中文摘要

现有的针对对齐LLM的越狱方法是离散的产物,其表面形式容易被指纹识别和修补。我们认为真正的失败模式不是任何特定的提示,而是安全训练覆盖不足的整个自然人类写作语域。基于这一见解,我们引入了第一个使用真实同人小说子类型作为通用攻击载体的越狱家族:一种创意写作元条件基于来自十二个Archive of Our Own (AO3)子类型之一的段落,有害行为被嵌入为结果场景的高潮。该构造不需要攻击者LLM,也不需要针对每个目标进行适应。在HarmBench和JailbreakBench的并集上对八个对齐LLM,该攻击在四评委集成下将平均ASR从0.278提升到0.731;因子分解显示增益由语域而非长度或结构带来。两种主动防御扩大了而非缩小了白话与基线的比率,表明针对模板的防御仅仅将攻击者引向像我们这样的基于语域的攻击。我们还提出了SAGA-A4,一种静态的四轮扩展,实现了平均ASR 0.924,大大超过了现有的三种多轮方法。

英文摘要

Existing jailbreaks against aligned LLMs are discrete artifacts whose surface forms are easy to fingerprint and patch. We argue that the real failure mode is not any specific prompt, but an entire register of natural human writing that safety training has under-covered. Building on this insight, we introduce the first jailbreak family that uses real fanfiction subgenres as universal attack carriers: a creative-writing meta is conditioned on passages from one of twelve Archive of Our Own (AO3) subgenres, and the harmful behavior is embedded as the climax of the resulting scene. The construction requires no attacker LLM and no per-target adaptation. On eight aligned LLMs over the union of HarmBench and JailbreakBench, this attack lifts mean ASR from 0.278 to 0.731 under a four-judge ensemble; a factorial decomposition shows the gain is carried by register rather than length or structure. Two active defences widen rather than narrow the vernacular-to-baseline ratio, indicating that template-targeting defences merely steer attackers toward register-based attacks like ours. We also propose SAGA-A4, a static four-turn extension that attains mean ASR 0.924, substantially exceeding three existing multi-turn methods.

2606.04480 2026-06-04 cs.CV cs.HC

IMPose: Interactive Multi-person Pose Estimation with Dynamic Correction Propagation

IMPose: 基于动态校正传播的交互式多人姿态估计

Haoyang Ge, Jian Ma, Ziwen Wang, Qihe Wang, Jianqi Fan, Hongzhi Yu, Xingyu Chen, Kun Li

发表机构 * Tianjin University(天津大学) Zhongguancun Academy(中关村学院) Tiandi(天迪)

AI总结 提出IMPose交互式工具,通过双级跟踪机制(关键点级和实例级)将稀疏的多人姿态校正传播到整个视频,显著减少手动标注工作量。

详情
AI中文摘要

高质量动态人体姿态标注为人工智能提供精确的运动学信息,使其能够掌握人类行为,但仍然劳动密集且耗时。当前的标注工具要么缺乏时间校正传播,要么在多人场景中失败,需要过多的人工干预。在本文中,我们介绍了IMPose,一种用于多人动态姿态标注的交互式工具。它具有双级跟踪机制,可将标注者的一帧多人姿态校正传播到整个视频。关键点级通过顺序建模确保校正的时间传播,而实例级采用关键点感知嵌入和相对位置编码来维持多人跨帧一致性。为了进一步提高鲁棒性,IMPose在轨迹库中维护历史姿态和实例线索,增强了长程时间关联,并在遮挡和运动模糊等挑战性情况下稳定标注。通过将稀疏的人工校正转换为密集且连贯的姿态轨迹,我们的框架显著减少了跨帧的重复手动细化。大量实验表明,IMPose在不同交互预算下始终实现强精度-效率权衡,在低点击标注设置中表现出特别优势。IMPose实现了高精度和高效率的标注,在3DPW上每1050帧视频仅需27次点击,在PoseTrack21上每个轨迹段每84帧仅需3次点击。我们进一步扩展了PoseTrack21,以10名标注员10小时的最小成本添加了188K个姿态实例(355万个关键点)。标注工具、代码和扩展数据集将开源。

英文摘要

High-quality dynamic human pose annotation equips AI with precise motion kinematics to enable human behavior mastery, yet remains labor-intensive and time-consuming. Current annotation tools either lack temporal correction propagation or fail in multi-person scenarios, necessitating excessive manual intervention. In this paper, we introduce IMPose, an interactive tool for multi-person dynamic pose annotation. It features a dual-level tracking mechanism that propagates one-frame multi-person pose corrections from annotators across entire videos. The keypoint-level ensures corrections temporal propagation via sequential modeling, while the instance-level employs keypoint-aware embedding with relative positional encoding to maintain multi-person cross-frame consistency. To further improve robustness, IMPose maintains historical pose and instance cues in a trajectory bank, which enhances long-range temporal association and stabilizes annotation in challenging cases such as occlusion and motion blur. By converting sparse human corrections into dense and coherent pose trajectories, our framework significantly reduces repeated manual refinement across frames. Extensive experiments show that IMPose consistently achieves a strong accuracy efficiency trade off under different interaction budgets, demonstrating particular advantages in low click annotation settings. IMPose achieves high precision annotation with high efficiency, requiring only 27 clicks per 1,050 frame video on 3DPW and 3 clicks per tracklet per 84-frame on PoseTrack21. We further expand PoseTrack21 with 188K pose instances (3.55M keypoints) at a minimal cost of 10 annotators in 10 hours. The annotation tool, codes, and extended dataset will be open-sourced.

2606.04479 2026-06-04 cs.CV cs.AI cs.CL

Evaluating Reasoning Fidelity in Visual Text Generation

评估视觉文本生成中的推理保真度

Jiajun Hong, Jiawei Zhou

发表机构 * Stony Brook University(石桥大学)

AI总结 通过长文本渲染、事实知识探测、上下文理解和多步推理等任务,评估当前文本到图像模型在视觉文本生成中是否忠实保持推理能力,发现其常产生语义错误和逻辑不一致,与纯文本模型存在显著差距。

Comments Peer reviewed and accepted at CVPR 2026 at the GRAIL-V (Grounded Retrieval and Agentic Intelligence for Vision-Language) workshop (non-archival track)

详情
AI中文摘要

最近的文本到图像(T2I)模型能够在图像中渲染高度清晰且结构良好的文本,从而支持文档生成和幻灯片生成等应用。然而,当复杂解决方案必须直接通过渲染文本表达时,这些系统是否忠实地保留了推理能力,还是仅仅模仿表面模式,目前尚不清楚。我们通过评估视觉文本生成中的推理保真度来研究这一问题,其中模型必须将完整的推理过程表达为图像。我们的评估包括长文本渲染、事实知识探测、上下文理解和多步推理。在这些设置中,我们发现当前的T2I模型经常产生语义错误、逻辑不一致和错误的中间步骤,即使渲染的文本在视觉上清晰。这些失败与纯文本模型在相同任务上的强推理表现形成对比。我们的发现揭示了视觉文本生成与程序性推理之间的显著差距,促使更可靠的视觉文本推理。

英文摘要

Recent text-to-image (T2I) models can render highly legible and well-structured text within images, enabling applications including document generation and slide generation. However, it remains unclear whether such systems faithfully preserve reasoning ability when complex solutions must be expressed directly through rendered text, or whether they merely imitate surface-level patterns. We investigate this question by evaluating reasoning fidelity in visual text generation, where models must express complete reasoning processes as images. Our evaluation includes long text rendering, factual knowledge probing, context understanding, and multi-step reasoning. Across these settings, we find that current T2I models frequently produce semantic errors, logical inconsistencies, and incorrect intermediate steps, even when the rendered text appears visually clear. These failures contrast with the strong reasoning performance of text-only models on the same tasks. Our findings reveal a substantial gap between visual text generation and procedural reasoning, motivating more reliable visual text reasoning.

2606.04477 2026-06-04 cs.RO

TransTac: Visuo-Tactile Modality Transition via Ultraviolet-Encoded Transparent Elastomers

TransTac: 通过紫外编码透明弹性体实现视觉-触觉模态转换

Lingyue Yang, Bin Fang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出一种透明紫外编码双目视觉触觉传感器TransTac,结合视觉观察与标记触觉重建,通过先验引导的Delaunay立体匹配算法实现鲁棒稀疏三角化,在零样本触觉图像识别上达到83.3%准确率,并显著增强跨模态对齐。

Comments Accepted at IEEE International Conference on Robotics and Automation (ICRA) 2026. 8 pages, 7 figures

详情
AI中文摘要

基于视觉的触觉传感器(VBTS)能够恢复高分辨率接触几何形状,但通常依赖于不透明的弹性体层,这阻碍了视觉透明性;而RGB-D相机提供全局深度感知,但在近距离时性能显著下降。为解决这一局限,我们提出了TransTac,一种透明的紫外(UV)编码双目VBTS,它将视觉观察和基于标记的触觉重建集成在一个紧凑设备中。该系统采用嵌入UV反射标记的透明弹性体,以及一种先验引导的Delaunay立体匹配算法,用于鲁棒的稀疏三角化。为了可靠地检测密集分布的半透明标记,我们开发了一种轻量级检测器,能够在接触和变形下实现稳定定位。所提出的先验引导的Delaunay匹配相比全局分配基线,将对应鲁棒性提高了约21%,同时保持高重建精度。在语义评估中,TransTac在触觉图像上实现了高达83.3%的零样本识别准确率,超过不透明触觉基线约50个百分点。嵌入分析进一步揭示了与自然图像的跨模态对齐显著增强,类中心相似度从约0.2提升至超过0.77。受控的近距实验量化了RGB-D深度可靠性的下降,并展示了通过视觉-触觉集成实现的扩展几何覆盖。最后,实现了一个紧凑原型,硬件成本约为70美元。

英文摘要

Vision-based tactile sensors (VBTS) recover high-resolution contact geometry but typically rely on opaque elastomer layers that prevent visual transparency, while RGB-D cameras provide global depth perception yet degrade significantly at close range. To address this limitation, we present TransTac, a transparent ultraviolet (UV)-encoded binocular VBTS that integrates visual observation and marker-based tactile reconstruction within a single compact device. The system employs a transparent elastomer embedded with UV-reflective markers and a prior-guided Delaunay stereo matching algorithm for robust sparse triangulation. To reliably detect densely distributed semitransparent markers, we develop a lightweight detector that enables stable localization under contact and deformation. The proposed prior-guided Delaunay matching improves correspondence robustness by approximately 21% compared with global assignment baselines while maintaining high reconstruction accuracy. In semantic evaluation, TransTac achieves up to 83.3% zero-shot recognition accuracy on tactile images, exceeding opaque tactile baselines by approximately 50 percentage points. Embedding analysis further reveals substantially stronger cross-modal alignment with natural images, with class-center similarity increasing from around 0.2 to over 0.77. Controlled near-distance experiments quantify the degradation of RGB-D depth reliability and demonstrate extended geometric coverage enabled by visuo-tactile integration. Finally, a compact prototype is implemented with an approximate hardware cost of $70.

2606.04476 2026-06-04 cs.LG math.OC math.ST stat.ML stat.TH

When Both Layers Learn: Training Dynamics of Representing Linear Models via ReLU Networks

当两层都学习:通过ReLU网络表示线性模型的训练动力学

Berk Tinaz, Changzhi Xie, Mahdi Soltanolkotabi

发表机构 * Department of Electrical and Computer Engineering(电气工程系) Department of Computer Science(计算机科学系)

AI总结 本文研究单隐层ReLU网络联合训练两层以拟合线性目标函数的梯度下降动力学,通过三阶段分析证明从随机初始化出发能以线性速率收敛到全局最小化器并达到最优样本复杂度。

Comments 47 pages, 8 figures, published at the 39th Annual Conference on Learning Theory (COLT), 2026

详情
AI中文摘要

在本文中,我们研究了联合训练单隐层ReLU网络的两层以拟合线性目标函数的梯度下降动力学。具体来说,我们考虑一个可实现设置,其中输入从高斯分布中独立同分布采样,标签遵循一个植入的线性模型。这种风格化的框架捕捉了逆问题和某些自编码器模型中端到端训练的关键特征。尽管其表面简单,但动力学仍然难以理解,部分原因是损失景观包含多个非严格鞍点,这使得不清楚为什么从随机初始化开始的梯度下降能够可靠地逃离坏的驻点区域。我们提供了优化景观的详细刻画,并证明从适度小的随机初始化开始——同时训练两层——梯度下降以线性速率收敛到全局最小化器,并具有阶次最优的样本复杂度。我们的分析通过三个阶段追踪轨迹:对齐阶段,其中隐藏权重逐渐与植入方向对齐,而输出权重保持正确的符号模式;增长阶段,其中两层的范数增加同时保持对齐;以及局部细化阶段,其中对齐的神经元快速收敛到植入方向,产生快速的局部收敛。为了严格证明梯度下降避免非严格鞍点,我们为端到端动力学开发了轨迹级控制论证。此外,我们建立了沿整个轨迹成立的新颖的均匀集中结果,这对于获得阶次最优的样本复杂度至关重要。我们通过一系列配置的大量实验验证了我们的理论。

英文摘要

In this paper, we study the gradient descent dynamics for jointly training both layers of a one-hidden-layer ReLU network to fit a linear target function. Concretely, we consider a realizable setting where inputs are drawn i.i.d. from a Gaussian distribution and labels follow a planted linear model. This stylized framework captures salient features of end-to-end training in inverse problems and certain auto-encoder models. Despite its apparent simplicity, the dynamics remain poorly understood, in part because the loss landscape contains multiple non-strict saddle points, making it unclear why gradient descent from random initialization reliably escapes bad stationary regions. We provide a detailed characterization of the optimization landscape and prove that gradient descent from a moderately small random initialization-simultaneously training both layers-converges to a global minimizer at a linear rate with order-wise optimal sample complexity. Our analysis tracks the trajectory through three phases: an alignment phase in which hidden weights progressively align with the planted direction while the output weights maintain the correct sign pattern; a growth phase in which the norms of both layers increase while preserving alignment; and a local refinement phase in which the aligned neurons rapidly converge to the planted direction, yielding fast local convergence. To rigorously show that GD avoids non-strict saddles, we develop trajectory-level control arguments for the end-to-end dynamics. In addition, we establish novel uniform concentration results that hold along the entire trajectory, and are essential for obtaining order-wise optimal sample complexity. We corroborate our theory with extensive experiments across a range of configurations.

2606.04475 2026-06-04 cs.SD cs.MM math.SP

A Second-Order Cepstral Signature of Contact-Vibration Sounds Reproduced by Laptop Loudspeakers: A Synthetic Case Study

笔记本电脑扬声器再现的接触振动声音的二阶倒谱特征:一个合成案例研究

Jim Salsman

发表机构 * TalkNicer, Inc.(TalkNicer公司)

AI总结 通过合成信号链分析,提出接触振动声音在笔记本电脑扬声器再现时具有一阶和二阶倒谱周期性结构,其中二阶倒谱双峰性在机械源和扬声器播放时最明显。

Comments 11 pages, 4 tables, 5 figures, 8 references

详情
AI中文摘要

手机在硬表面上振动时,通过笔记本电脑扬声器再现的声音通常在质量上不同于普通的视听录音。我们提出这种感知独特性的部分原因可以描述为嵌套周期性:一阶倒谱结构反映振动周期及其倍数,二阶倒谱结构反映一阶倒谱内的重复间隔。将感知效应视为真实的,并使用刻意透明的合成信号链,我们建模了六个阶段:机械生成、表面和空气传播、麦克风捕获、编码和解码、笔记本电脑扬声器播放以及重新录制或后处理。合成分析表明,一阶倒谱周期性在整个链中得以保留,而更干净的双峰或准双峰二阶倒谱特征在机械源和笔记本电脑扬声器播放时最为明显。该结果支持但未证明以下假设:笔记本电脑再现可以重新强调潜在的接触振动周期性,而这种周期性在中间记录和编码形式中表达得不够清晰。我们将二阶倒谱双峰性视为接触振动播放的探索性描述符,而非完整的感知度量。所需的验证包括真实设备的录音、受控的播放传递函数、感知判断以及与普通语音、音乐和环境录音的比较。

英文摘要

A mobile phone vibrating on a hard surface often sounds qualitatively unlike ordinary audiovisual recordings when reproduced through laptop loudspeakers. We propose that part of this perceptual distinctiveness can be described as a nested periodicity: a first-order cepstral structure reflecting the vibration period and its multiples, and a second-order cepstral structure reflecting repeated spacing within the first-order cepstrum. Treating the perceptual effect as real and using a deliberately transparent synthetic signal chain, we model six stages: mechanical generation, surface and air propagation, microphone capture, encoding and decoding, laptop-speaker playback, and re-recording or post-processing. The synthetic analysis shows that the first-order cepstral periodicity is preserved across the chain, whereas a cleaner bimodal or quasi-bimodal second-order cepstral signature is most evident at the mechanical source and at laptop-speaker playback. The result supports, but does not prove, the hypothesis that laptop reproduction can re-emphasize a latent contact-vibration periodicity that is less cleanly expressed in intermediate recorded and encoded forms. We frame second-order cepstral bimodality as an exploratory descriptor of contact-vibration playback rather than as a completed perceptual metric. Required validation includes recordings of real devices, controlled playback transfer functions, perceptual judgments, and comparisons against ordinary speech, music, and environmental recordings.

2606.04473 2026-06-04 cs.LG cs.AI

ChessMimic: Per-Rating Transformer Models for Human Move, Clock, and Outcome Prediction in Online Blitz Chess

ChessMimic: 用于在线闪电棋中人类走棋、时钟和结果预测的按等级划分的Transformer模型

Thomas Johnson

发表机构 * nascent.xyz(nascent实验室)

AI总结 提出ChessMimic系统,包含三个小型编码器Transformer模型,分别用于走棋、思考时间和结果预测,通过按Elo等级分段训练实现更精细的技能校准,在Lichess闪电棋数据上走棋预测准确率超越Maia-2,结果预测AUC达0.78,时钟模型提供可用但非最优的思考时间信号。

详情
AI中文摘要

我们提出了ChessMimic,一个由三个小型编码器Transformer组成的系统——分别用于走棋、思考时间和结果预测——以局面、最近走棋历史、玩家等级和时钟状态为条件。我们为每100 Elo等级区间拟合每个模型的独立实例,以参数效率换取更精细的技能校准。在Lichess Rated Blitz游戏的一个月保留切片上,ChessMimic的人类走棋预测准确率在每个Elo区间都优于Maia-2。与Maia-3相比,我们的9M参数模型的准确率介于Maia-3-5M和Maia-3-23M之间,且没有几何注意力偏置的额外复杂性。除了走棋匹配模型,我们还训练了一个游戏结果模型,该模型不仅以局面为条件,还以玩家等级、时间控制和剩余时钟时间为条件。结果模型在样本外达到了0.78的AUC,击败了Maia-2以及基于子力、等级和时钟时间的逻辑回归。最后,我们训练了一个时钟模型来预测人类思考时间。该时钟模型在ALLIE风格过滤器下提供了可用但非最优的每步思考时间信号(Pearson r = 0.41,Spearman rho = 0.50,MAE 4.10秒,而ALLIE报告的r = 0.70),残差差距集中在每位置桶的锐度上,而非桶边际校准。公开演示在1e4.ai,我们在GitHub上发布了代码、每个区间的权重以及C++数据过滤管道代码。

英文摘要

We present ChessMimic, a system of three small encoder-only transformers - for move, thinking-time, and outcome prediction - conditioned on the position, recent move history, player rating, and clock state. We fit a separate instance of each model per 100-Elo rating band, trading parameter efficiency for sharper per-skill calibration. On a held-out month-wide slice of Lichess Rated Blitz games ChessMimic's human move prediction accuracy outperforms Maia-2 in every Elo band. Compared to Maia-3, our 9M parameter model's accuracy sits between Maia-3-5M and Maia-3-23M without the additional complexity of Geometric Attention Bias. In addition to the move matching model, we also train a game outcome model that conditions not only on the position, but also player ratings, time control, and remaining clock times. The outcome model achieves an AUC of 0.78 out of sample, beating Maia-2 as well as logistic regressions based on material, ratings, and clock time. Finally, we train a clock model that predicts human thinking times. The clock model provides a usable but non-SOTA per-ply think-time signal under ALLIE-style filters (Pearson r = 0.41, Spearman rho = 0.50, MAE 4.10 s, against ALLIE's reported r = 0.70), with the residual gap concentrated in per-position bucket sharpness rather than bucket-marginal calibration. A public demo is at 1e4.ai and we release code, per-band weights, and the C++ data-filter pipeline code in GitHub.

2606.04469 2026-06-04 cs.CV cs.AI

Adaptive Calibration for Fair and Performant Facial Recognition

自适应校准:实现公平且高性能的面部识别

Ryan Brown, Chris Russell

发表机构 * University of Oxford(牛津大学)

AI总结 提出自适应校准(AC)方法,通过将归一化嵌入的余弦相似度映射为校准概率,并融入局部上下文校正区域差异,从而在无需人口统计元数据的情况下提升面部识别的整体性能和公平性。

详情
AI中文摘要

我们引入自适应校准(AC),一种新颖的面部识别校准策略,将归一化嵌入之间的余弦相似度映射为良好校准的概率。通过将局部上下文纳入校准,自适应校正确保了余弦相似度中的一个基本不匹配问题,即相同的距离在不同嵌入区域可能对应不同的匹配概率。我们的方法在无需人口统计元数据的情况下,既提高了整体性能,又实现了更公平的校准。在各种预训练模型和标准基准上,我们的方法在准确性和公平性指标上始终优于现有方法。AC为公平的面部识别提供了实用的解决方案,无需人口统计组注释,同时提高了整体性能。与现有方法不同,我们的方法提供了连续的、区域特定的校准,避免了“降级”现象,即公平性以牺牲某些群体的性能为代价。

英文摘要

We introduce Adaptive Calibration (AC), a novel calibration strategy for facial recognition that maps cosine similarity between normalized embeddings to well-calibrated probabilities. By incorporating local context into calibration, Adaptive Calibration corrects for a fundamental mismatch in cosine similarity, whereby the same distance can correspond to different match probabilities in different embedding regions. Our approach improves both overall performance and results in a fairer calibration without requiring demographic metadata. Our approach consistently dominates existing methods both on accuracy and fairness metrics across a variety of pretrained models and standard benchmarks. AC provides a practical solution for equitable facial recognition, without requiring demographic group annotations, and while improving overall performance. Unlike existing approaches, our method provides continuous, region-specific calibration that avoids "leveling down" where fairness comes at the cost of degraded performance for some groups.

2606.04468 2026-06-04 cs.LG cs.AI cs.NE math.OC

ParetoPilot: Zero-Surrogate Offline Multi-Objective Optimization via Infer-Perturb-Guide Diffusion

ParetoPilot:通过推断-扰动-引导扩散实现零代理离线多目标优化

Ruiqing Sun, Sen Yang, Dawei Feng, Bo Ding, Yijie Wang, Huaimin Wang

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 提出ParetoPilot,一种无需外部代理模型的零代理扩散框架,通过推断-扰动-引导引擎在无条件去噪步骤中隐式推断目标方向、正交化并行引力场和边缘感知排斥力,实现离线多目标优化的帕累托最优设计。

详情
AI中文摘要

离线多目标优化旨在基于静态数据集发现新颖的帕累托最优设计,而无需昂贵的环境交互。尽管最近的生成方法取得了显著成功,但它们主要依赖外部代理模型。这种依赖引入了显著的计算开销,遭受欺骗性评估,并偏离了联合训练主流生成模型与条件的流行范式。为了解决这些瓶颈,我们提出了ParetoPilot,一种用于离线多目标优化的新颖零代理扩散框架。ParetoPilot充分利用预训练扩散模型中固有的条件先验。其核心是引入了推断-扰动-引导引擎,该引擎无缝地插入在反向生成过程的无条件去噪步骤中。首先,通过匹配条件噪声预测和无条件噪声预测,隐式推断瞬时目标方向。其次,数学上正交化一个用于严格收敛的平行引力场和一个用于相互多样性的边缘感知排斥力,从而生成一个动态退火的扰动向量。最后,这个扰动目标通过标准的无分类器引导无缝地引导生成过程。在51个任务上的大量实验表明,ParetoPilot优于14个最先进的基于代理和逆生成基线。通过消除辅助代理训练,我们的方法在实现超体积改进和鲁棒帕累托前沿覆盖的同时,保护了数据隐私。

英文摘要

Offline multi-objective optimization (Offline MOO) aims to discover novel Pareto-optimal designs based on static datasets without expensive environment interactions. While recent generative methods have achieved notable success, they predominantly rely on external surrogate models. This dependency introduces significant computational overhead, suffers from deceptive evaluations, and deviates from the prevailing paradigm of jointly training mainstream generative models with conditions. To address these bottlenecks, we propose ParetoPilot, a novel zero-surrogate diffusion framework for offline MOO. ParetoPilot fully leverages the conditional priors inherently embedded within pre-trained diffusion models. At its core, the framework introduces the Infer-Perturb-Guide (IPG) engine, which is seamlessly interleaved within the unconditional denoising steps of the reverse generation process. First, it implicitly infers the instantaneous objective direction by matching conditional and unconditional noise predictions. Next, it mathematically orthogonalizes a parallel gravity field for strict convergence and an edgeness-aware repulsive force for mutual diversity, creating a dynamically annealed perturbation vector. Finally, this perturbed target seamlessly steers the generation process via standard Classifier-Free Guidance (CFG). Extensive experiments across 51 tasks demonstrate that ParetoPilot outperforms 14 state-of-the-art surrogate-based and inverse generative baselines. By eliminating auxiliary proxy training, our approach preserves data privacy while achieving hypervolume improvement and robust Pareto front coverage.

2606.04466 2026-06-04 cs.CL

Learning What to Learn: Stage-Specific Data Sets for SFT-then-RL in Small Language Model Reasoning

学习什么:小语言模型推理中SFT-then-RL的阶段特定数据集

Chongyang He, Rui Zhang, Zixuan Wang, Xin Li

发表机构 * Tsinghua University(清华大学) National University of Singapore(新加坡国立大学) DiDi(滴滴出行) University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出一种难度感知的SFT-then-RL框架,通过阶段特定数据集(SFT阶段使用桥接机制,RL阶段使用批判微调)协调数据难度,提升小语言模型推理性能。

Comments 25 pages, 12 figures

详情
AI中文摘要

后训练小语言模型(SLM)进行推理通常遵循SFT-then-RL流程,但现有工作很少考虑每个阶段应该学习什么数据。我们认为数据策略应与SFT和RL的不同角色对齐:SFT更适合获取尚未掌握的推理技能,而RL更适合巩固模型已部分掌握的技能。基于这一原则,我们提出了一种难度感知的SFT-then-RL框架,将训练数据组织成阶段特定的数据集。对于SFT阶段的困难样本,我们引入桥接机制,将教师生成的原始推理轨迹转化为SLM更易学习的监督信号。对于RL阶段仍未解决的困难样本,我们应用批判微调,将零奖励失败转化为诊断、修复和新的推理轨迹监督,用于下一SFT阶段。在两个SLM上跨越五个推理基准的实验表明,我们的方法在代表性SFT、蒸馏和RL基线上持续改进。我们的结果强调了协调SFT和RL之间数据难度对于有效SLM推理后训练的重要性。

英文摘要

Post-training Small Language Models (SLMs) for reasoning typically follows an SFT-then-RL pipeline, yet existing work rarely considers what data should be learned at each stage. We argue that data strategy should be aligned with the distinct roles of SFT and RL: SFT is better suited for acquiring not-yet-mastered reasoning skills, while RL is better suited for consolidating skills that the model can already partially access. Based on this principle, we propose a difficulty-aware SFT-then-RL framework that organizes training data into stage-specific sets. For hard samples in the SFT stage, we introduce a Bridge mechanism that transforms raw teacher-generated reasoning traces into more learnable supervision for SLMs. For hard samples that remain unsolved during RL, we apply Critique Fine-Tuning by converting all-zero-reward failures into diagnostic, repair, and new reasoning trace supervision for the next SFT stage. Experiments on two SLMs across five reasoning benchmarks show that our method consistently improves over representative SFT, distillation, and RL baselines. Our results highlight the importance of coordinating data difficulty across SFT and RL for effective SLM reasoning post-training.

2606.04465 2026-06-04 cs.CL cs.AI

SePO: Self-Evolving Prompt Agent for System Prompt Optimization

SePO: 用于系统提示优化的自我进化提示智能体

Wangcheng Tao, Han Wu, Weng-Fai Wong

发表机构 * National University of Singapore(新加坡国立大学) City University of Hong Kong(香港城市大学)

AI总结 提出SePO方法,通过自我指涉设计让提示智能体同时优化任务智能体和自身的系统提示,采用两阶段进化训练,在多个基准上平均准确率提升4.49%。

Comments 26 pages. Code: https://github.com/taowangcheng/SePO

详情
AI中文摘要

系统提示优化在不修改底层模型的情况下改善智能体行为,生成可读且模型无关的指令。现有方法构建一个提示智能体来优化任务智能体的系统提示,但提示智能体自身的系统提示仍由人工设计且固定不变。我们提出自我进化提示优化(SePO),将提示智能体自身的系统提示与任务智能体的系统提示一同作为优化目标。SePO采用自我指涉设计:一个单一的提示智能体在开放式进化搜索下同时改进任务智能体的系统提示和自身的系统提示,该搜索维护一个候选提示档案作为垫脚石。训练分为两个阶段:预训练在多任务池上进化提示智能体,微调则将其应用于目标任务。在涵盖数学(AIME'25)、抽象推理(ARC-AGI-1)、研究生级科学(GPQA)、代码生成(MBPP)和逻辑谜题(数独)的五个基准上,SePO始终优于Manual-CoT、TextGrad和MetaSPO,与Manual-CoT相比平均准确率提升4.49%。预训练中的提示优化技能也能泛化到预训练混合任务之外的任务,而非记忆每个任务的提示。

英文摘要

System prompt optimization improves agent behavior without modifying the underlying model, yielding human-readable, model-agnostic instructions. Existing methods build a prompt agent that refines task agents' system prompts, yet leave the prompt agent's own system prompt hand-engineered and fixed. We propose Self-Evolving Prompt Optimization (SePO), which treats the prompt agent's own system prompt as an optimization target alongside task agents' system prompts. SePO adopts a self-referential design. A single prompt agent improves both task agents' system prompts and its own under an open-ended evolutionary search that maintains an archive of candidate prompts as stepping stones. Training proceeds in two stages: pre-training evolves the prompt agent on a multi-task pool, and fine-tuning then applies it to a target task. Across five benchmarks spanning math (AIME'25), abstract reasoning (ARC-AGI-1), graduate-level science (GPQA), code generation (MBPP), and logic puzzles (Sudoku), SePO consistently outperforms Manual-CoT, TextGrad, and MetaSPO, improving the average accuracy by 4.49 points compared to Manual-CoT. The prompt optimization skill from pre-training also generalizes to tasks beyond the pre-training mixture, rather than memorizing per-task prompts.

2606.04461 2026-06-04 cs.CV

ChannelTok: Efficient Flexible-Length Vision Tokenization

ChannelTok: 高效灵活长度视觉分词

Sukriti Paul, Arpit Bansal, Tom Goldstein

发表机构 * University of Maryland, College Park(马里兰大学College Park分校)

AI总结 提出一种基于通道的轻量级灵活长度分词器,通过随机尾部丢弃训练实现语义重要性排序,在保持高质量的同时大幅提升解码速度和模型效率。

详情
AI中文摘要

领先的灵活视觉分词器以极端成本实现SOTA质量,依赖参数繁重的骨干网络和缓慢的多步生成解码器。我们摆脱这种复杂的空间分词范式,引入一种简单、轻量且快速的通道级灵活长度分词器。我们的方法将每个潜在通道视为一个视觉标记,采用参数高效的CNN-Transformer混合骨干网络。此外,在训练过程中采用随机尾部丢弃范式,自然地迫使通道按语义重要性排序。这使得在推理时只需保留前$k$个通道即可实现灵活压缩,并自然支持可变长度自回归图像生成。我们通过在ImageNet上的大量实验验证了该方法,展示了在不同标记预算下的一致质量。结果建立了新的质量-效率前沿:我们的模型实现了最先进的感知质量(rFID 2.92),同时解码速度比次优方案快$8.6\times$,参数量小$2.1\times$(1.59亿参数)。我们的工作将通道级分词确立为高效视觉表示的一种强大且实用的范式。项目页面:https://channeltok.github.io

英文摘要

Leading flexible vision tokenizers achieve SOTA quality at an extreme cost, relying on parameter-heavy backbones and slow, multi-step generative decoders. We depart from this complex, spatial-token paradigm and introduce a simple, lightweight, and fast channel-wise flexible-length tokenizer. Our method treats each latent channel as a visual token, enabling a parameter-efficient CNN-Transformer hybrid backbone. Furthermore, employing a stochastic tail-dropping paradigm during training naturally forces channels to organize by semantic importance. This allows for flexible compression at inference by simply retaining the first $k$ channels, and naturally enables variable-length autoregressive image generation. We validate our approach through extensive experiments on ImageNet, demonstrating consistent quality across diverse token budgets. The results establish a new quality-efficiency frontier: our model achieves state-of-the-art perceptual quality (rFID 2.92) while being $8.6\times$ faster in decoding and $2.1\times$ smaller (159M params) than the next-best alternative. Our work establishes channel-wise tokenization as a powerful and practical paradigm for efficient visual representation. Project page: https://channeltok.github.io

2606.04457 2026-06-04 cs.CV

Imagine Before You Draw: Visual Prompt Engineering for Image Generation

先构思再绘制:面向图像生成的视觉提示工程

Liyu Jia, Fengda Zhang, Jiachun Pan, Kesen Zhao, Saining Zhang, Wang Lin, Weijia Wu, Yue Liao, Aojun Zhou, Hanwang Zhang

发表机构 * Nanyang Technological University(南洋理工大学) National University of Singapore(国立新加坡大学) Zhejiang University(浙江大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出视觉提示工程(VPE),通过在单一模型内先生成视觉语义令牌作为中间计划,再生成完整图像,从而避免信息瓶颈,提升图像生成质量与编辑保真度。

详情
AI中文摘要

在图像生成之前,将视觉语义表示作为中间步骤引入,可以降低文本与图像之间的建模难度,从而提高生成质量。近期工作如X-Omni和BLIP3o-Next探索了这一方向,但它们通常采用两阶段外部流水线:一个独立的自回归模型首先生成语义令牌,然后将其作为条件输入给独立的扩散解码器。由于解码器无法同时访问原始输入和语义计划,这种设计引入了信息瓶颈,限制了编辑等下游任务中的细节保留。而Transfusion、BAGEL和Show-o2等内部架构通过单一模型内的跨模态交互避免了这一瓶颈,但它们在没有中间语义引导的情况下,仍然面临困难的文本到像素建模差距。我们提出了视觉提示工程(VPE),它可以无缝集成到此类内部框架中。具体来说,模型首先自回归地生成视觉语义令牌(例如SigLIP 2)作为“视觉提示”,以捕捉语义布局,然后基于该计划生成完整图像令牌。我们在类别条件生成、文本到图像生成和图像编辑上验证了VPE,涵盖了多种令牌类型和模型架构。结果表明,VPE可以加速收敛、提高质量上限,并且通过内部集成,在相同参数规模下,相比外部替代方案实现了显著更好的编辑保真度(PSNR:26.76 vs. 19.92),同时保持了有竞争力的编辑响应速度。

英文摘要

Incorporating visual semantic representations as an intermediate step before image generation can reduce the modeling difficulty between text and images, thereby improving generation quality. Recent works such as X-Omni and BLIP3o-Next have explored this direction, but they typically use a two-stage external pipeline: a separate autoregressive model first generates semantic tokens, which are then fed as conditioning to an independent diffusion decoder. Since the decoder cannot jointly access the original input and the semantic plan, this design introduces an information bottleneck that limits detail preservation in downstream tasks such as editing. Internal architectures such as Transfusion, BAGEL, and Show-o2 avoid this bottleneck by enabling cross-modal interaction within a single model, but they still face the difficult text-to-pixel modeling gap without intermediate semantic guidance. We propose Visual Prompt Engineering (VPE), which can be seamlessly integrated into such internal frameworks. Specifically, the model first autoregressively generates visual semantic tokens (e.g., SigLIP 2) as "visual prompts" that capture the semantic layout, then generates the full image tokens conditioned on this plan. We validate VPE across class-conditional generation, text-to-image generation, and image editing, covering various token types and model architectures. Results show that VPE can accelerate convergence, raise quality ceilings, and through internal integration, achieve substantially better editing preservation (PSNR: 26.76 vs. 19.92) than external alternatives of the same parameter scale, while maintaining competitive editing responsiveness.

2606.04455 2026-06-04 cs.AI cs.CL

The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

元智能体挑战:当前智能体能否自主开发智能体?

Xinyu Lu, Tianshu Wang, Pengbo Wang, zujie wen, Zhiqiang Zhang, Jun Zhou, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所信息处理实验室) University of Chinese Academy of Sciences(中国科学院大学) Ant Group(蚂蚁集团)

AI总结 提出元智能体挑战(MAC)框架,评估前沿模型自主开发智能体系统的能力,发现多数元智能体难以匹敌人类设计的基线策略,且存在鲁棒性和对齐问题。

Comments Website: https://meta-agent-challenge.com/

详情
AI中文摘要

当前的AI基准测试评估智能体在人类设计的工作流程中执行任务的能力。这些评估从根本上未能衡量一个关键的更高级能力:模型能否自主开发智能体系统。我们引入了元智能体挑战(MAC),这是一个评估框架,旨在测试前沿模型自主开发智能体的能力。具体来说,一个代码智能体(元智能体)被赋予一个沙盒环境、一个评估API和一个时间限制,以迭代地编程一个智能体工件,该工件在五个领域的保留测试集上最大化性能。为确保评估完整性,该框架通过多层防御机制防止奖励黑客攻击。利用该框架,我们证明元智能体很少能匹配人类设计的基线策略,而少数能匹配的则主要由专有前沿模型主导。此外,设计过程表现出高方差,高优化压力会浮现出诸如真实数据窃取等新兴对抗行为——凸显了鲁棒性和模型对齐方面的关键缺陷。最终,MAC为自主AI研究和开发提供了一个严格的、开源的基准测试,为评估递归自我改进提供了经验代理。基准测试公开于:https://github.com/ant-research/meta-agent-challenge。

英文摘要

Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We introduce the Meta-Agent Challenge (MAC), an evaluation framework designed to test the capacity of frontier models for autonomous agent development. Specifically, a code agent (the meta-agent) is given a sandboxed environment, an evaluation API, and a time limitation to iteratively program an agent artifact that maximizes performance on a held-out test set across five domains. To ensure evaluation integrity, this framework is secured by multi-layer defenses against reward hacking. Leveraging this framework, we demonstrate that meta-agents rarely match human-engineered baseline policies, and the few that do are dominated by proprietary frontier models. Moreover, the design process exhibits high variance, and high optimization pressure surfaces emergent adversarial behaviors like ground-truth exfiltration-highlighting critical deficits in both robustness and model alignment. Ultimately, MAC provides a rigorous, open-source benchmark for autonomous AI research and development, offering an empirical proxy for evaluating recursive self-improvement. Benchmark is publicly available at: https://github.com/ant-research/meta-agent-challenge.