arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.04552 2026-06-04 cs.CL q-bio.GN

LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling

LDARNet: 用于基因组建模的DNA自适应表示网络与可学习分词

Daria Ledneva, Denis Kuznetsov

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出LDARNet，一种结合动态分块和双向路由的120M参数层次基因组基础模型，在27个任务中优于更大模型，并发现学习到的边界与生物学基序对齐。

详情

AI中文摘要

基因组基础模型越来越多地采用大型语言模型架构，但几乎普遍依赖于固定的分词方案，如$k$-mers、BPE或单核苷酸，这些方案强加了可能掩盖生物学相关结构的任意序列边界。我们提出了LDARNet，一个120M参数的层次基因组基础模型，它将H-Net风格的动态分块从自回归生成适应到掩码语言建模，结合了BiMamba-2状态空间层与局部注意力、双向路由以及基于比值的正则化器，以在无监督的情况下诱导自适应标记边界。在来自Nucleotide Transformer和Genomic Benchmarks套件的27个任务上进行微调后，LDARNet在紧凑模型（<300M参数）中取得了11/18的胜率，并在5个组蛋白修饰任务上取得了最先进的结果，优于高达20倍大的模型。一个FLOPs匹配的对照实验将学习到的路由确定为这些增益的来源：在相同计算量下，学习到的边界在组蛋白任务上比固定网格边界高出多达14个百分点。进一步的核苷酸分辨率分析表明，学习到的边界在无监督的情况下与典型的启动子基序和剪接连接点对齐，为基因组基础模型中的自适应分词提供了生物学解释。

英文摘要

Genomic foundation models increasingly adopt large language model architectures, yet almost universally rely on fixed tokenization schemes such as $k$-mers, BPE, or single nucleotides, which impose arbitrary sequence boundaries that may obscure biologically relevant structure. We present LDARNet, a 120M-parameter hierarchical genomic foundation model that adapts H-Net-style dynamic chunking from autoregressive generation to masked language modeling, combining BiMamba-2 state-space layers with local attention, bidirectional routing, and a ratio-based regularizer to induce adaptive token boundaries without supervision. Fine-tuned on 27 tasks from the Nucleotide Transformer and Genomic Benchmarks suites, LDARNet achieves 11/18 wins among compact models ($<$300M parameters) and state-of-the-art results on 5 histone modification tasks, outperforming models up to 20$\times$ larger. A FLOPs-matched controlled experiment isolates learned routing as the source of these gains: learned boundaries beat fixed-grid boundaries by up to 14 percentage points on histone tasks at identical compute. Nucleotide-resolution analysis further shows that the learned boundaries align with canonical promoter motifs and splice junctions without supervision, providing a biological interpretation for adaptive tokenization in genomic foundation models.

URL PDF HTML ☆

赞 0 踩 0

2606.04545 2026-06-04 cs.CV

Impostor: An Agent-Curated Benchmark for Realistic AIGC Manipulation Localization

Impostor：一个用于真实AIGC篡改定位的智能体策划基准

Zhenliang Li, Yutao Hu, Qixiong Wang, Wenpeng Du, Hongxiang Jiang, Jiasong Wu, Xiaolong Jiang, Jungong Han

发表机构 * Southeast University（东南大学）； Xiaohongshu Inc.（小红书公司）； Tsinghua University（清华大学）

AI总结为解决现有图像篡改检测与定位基准在视觉真实感、篡改多样性和生成器覆盖方面的局限，提出了Impostor数据集和CraftAgent框架，并设计了PhaseAware-Net方法，在多个基准上取得优异性能。

Comments 10 pages, 3 figures, 5 tables

详情

AI中文摘要

近期生成式图像编辑的进展提高了局部图像篡改的真实感和可控性，给图像篡改检测与定位（IMDL）带来了新挑战。然而，现有IMDL基准在视觉真实感、篡改多样性和生成器覆盖方面仍有局限，难以反映图像篡改的最新趋势。为解决这些局限，我们引入了Impostor，一个包含10万张篡改图像的高质量AI编辑图像篡改定位数据集。Impostor由CraftAgent构建，这是一个闭环智能体框架，集成了场景感知、编辑规划、篡改执行、质量验证和迭代反思，以自动生成多样且视觉真实的篡改图像。此外，Impostor包含由七个近期AIGC模型生成的图像，涵盖三种篡改类型，并包含多个篡改区域，为基于AIGC的IMDL提供了更全面的基准。进一步，我们提出了PhaseAware-Net（PANet），一个语义-取证框架，引入局部相位建模和语义-取证一致性学习，以更好地定位语义合理但取证异常的篡改区域。大量实验表明，Impostor对现有大型视觉语言模型（LVLMs）和专用IMDL方法构成了显著挑战，而PANet在Impostor和多个公开基准上取得了优越性能。

英文摘要

Recent advances in generative image editing have improved the realism and controllability of localized image manipulation, raising new challenges for image manipulation detection and localization (IMDL). However, existing IMDL benchmarks still have limitations in visual realism, manipulation diversity, and generator coverage, making it difficult to reflect recent trends in image manipulation. To address these limitations, we introduce Impostor, a high-quality AI-edited image manipulation localization dataset containing 100K manipulated images. Impostor is constructed by CraftAgent, a closed-loop agent framework that integrates scene perception, editing planning, manipulation execution, quality validation, and iterative reflection to automatically generate diverse and visually realistic manipulated images. Moreover, Impostor contains images generated by seven recent AIGC models across three manipulation types and includes multiple manipulated regions, providing a more comprehensive benchmark for AIGC-based IMDL. Furthermore, we propose PhaseAware-Net (PANet), a semantic-forensic framework that introduces local phase modeling and semantic-forensic consistency learning to better localize semantically plausible yet forensically disrupted manipulated regions. Extensive experiments show that Impostor poses significant challenges to existing large vision-language models (LVLMs) and specialized IMDL methods, while PANet achieves superior performance on Impostor and multiple public benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.04536 2026-06-04 cs.AI

GeoMin: 基于几何分布建模的数据高效半监督RLVR

Guangcheng Zhu, Shenzhi Yang, Haobo Wang, Xing Zheng, Yingfan MA, Xuening Feng, Zhongqi Chen, Kai Tang, Zhengqing Zang, Bowen Song, Weiqiang Wang, Gang Chen

发表机构 * Zhejiang University（浙江大学）； Ant Group（蚂蚁集团）

AI总结提出GeoMin方法，通过建模标注数据的全局特征分布来解码正确与错误展开的结构差异，从而建立稳健先验评估自奖励信号可靠性，以少量标注数据高效利用未标注数据，在仅用10%标注时超越全监督模型。

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）显著提升了LLM的推理能力，但面临困境：标准监督扩展受限于高标注成本，而无监督替代方案则遭受严重的模型崩溃。最近的半监督RLVR方法通过使用少量标注集指导未标注数据，在训练效果和标注成本之间取得了有前景的权衡。然而，由于依赖粗糙的性能启发式，它们遭受严重的数据效率瓶颈，导致绝大多数有价值实例未被充分利用。为此，我们提出GeoMin，它在标注数据上建模全局特征分布，以解码正确和错误展开之间的结构差异，从而建立稳健的先验来评估自奖励信号的可靠性，并充分释放未标注数据的潜力。实验上，GeoMin比最强基线高出+4.1%，甚至在使用仅10%标注的情况下超越全监督模型，展示了显著的数据效率。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) significantly advances LLM reasoning, yet it faces a dilemma: standard supervised scaling is throttled by high annotation costs, while unsupervised alternatives suffer from severe model collapse. Recent semi-supervised RLVR methods address this by using a small labeled set to guide unlabeled data, achieving a promising trade-off between training efficacy and annotation cost. However, they suffer from a severe data-efficiency bottleneck due to the reliance on coarse performance heuristics, leaving a vast majority of valuable instances underutilized. To this end, we propose GeoMin, which models global feature distributions on labeled data to decode the structural discrepancy between correct and incorrect rollouts, thereby establishing a robust prior to assess the reliability of self-reward signals and fully unleash the potential of unlabeled data. Empirically, GeoMin outperforms the strongest baselines by +4.1% and even surpasses fully supervised models with only 10% of the annotations, demonstrating remarkable data efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.04511 2026-06-04 cs.CL cs.LG

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

SparDA: 用于高效长上下文LLM推理的稀疏解耦注意力

Yaosheng Fu, Guangxuan Xiao, Xin Dong, Song Han, Oreste Villa

发表机构 * NVIDIA ； Thinking Machines Lab ； ByteDance Seed ； MIT

AI总结提出SparDA架构，通过引入第四投影Forecast实现KV缓存预取与注意力解耦，减少稀疏选择开销，在长上下文推理中实现1.25倍预填充加速和1.7倍解码加速。

详情

AI中文摘要

稀疏注意力减少了长上下文LLM推理的计算和内存带宽。然而，仍然存在两个关键挑战：（1）KV缓存容量随序列长度增长，卸载到CPU内存引入了PCIe传输瓶颈；（2）稀疏选择步骤本身保持$O(T^2)$复杂度，在长上下文中可能主导注意力成本。我们提出SparDA，一种解耦的稀疏注意力架构，它在Query、Key和Value之外引入了第四个逐层投影——Forecast。Forecast预测下一层所需的KV块，从而实现超前选择，将CPU到GPU的预取与当前层执行重叠。由于Forecast与注意力查询解耦，我们的GQA实现为每个GQA组使用一个Forecast头，相比原始多头选择器减少了选择开销。SparDA增加了<0.5%的参数，并通过匹配原始选择器的注意力分布仅训练Forecast投影。在两个稀疏预训练的8B模型上，SparDA匹配或略微提高了准确性，并且相比稀疏注意力卸载基线，提供了高达1.25倍的预填充加速和1.7倍的解码加速。通过使单个GPU上可行的批量大小更大，SparDA进一步实现了比非卸载稀疏基线高达5.3倍的解码吞吐量。我们的源代码可在https://github.com/NVlabs/SparDA获取。

英文摘要

Sparse attention reduces compute and memory bandwidth for long-context LLM inference. However, two key challenges remain: (1) KV cache capacity still grows with sequence length, and offloading to CPU memory introduces a PCIe transfer bottleneck; (2) the sparse selection step itself retains $O(T^2)$ complexity and can dominate attention cost at long contexts. We propose SparDA, a decoupled sparse attention architecture that introduces a fourth per-layer projection, the Forecast, alongside Query, Key, and Value. The Forecast predicts the KV blocks needed by the next layer, enabling lookahead selection that overlaps CPU-to-GPU prefetch with current-layer execution. Because Forecast is decoupled from the attention query, our GQA implementation uses one Forecast head per GQA group, reducing selection overhead versus the original multi-head selector. SparDA adds $<$0.5% parameters and trains only the Forecast projections by matching the original selector's attention distribution. On two sparse-pretrained 8B models, SparDA matches or slightly improves accuracy and delivers up to 1.25$\times$ prefill speedup and 1.7$\times$ decode speedup over the sparse-attention offload baseline. By enabling larger feasible batch sizes on a single GPU, SparDA further reaches up to 5.3$\times$ higher decode throughput than the non-offload sparse baseline. Our source code is available at https://github.com/NVlabs/SparDA.

URL PDF HTML ☆

赞 0 踩 0

2606.04507 2026-06-04 cs.CL cs.AI

Self-Evolving Deep Research via Joint Generation and Evaluation

通过联合生成与评估实现自我进化的深度研究

Han Zhu, Chengkun Cai, Yuanfeng Song, Xing Chen, Sirui Han, Yike Guo

发表机构 * The Hong Kong University of Science and Technology（香港科技大学）； ByteDance, China（字节跳动）； University College London（伦敦大学学院）

AI总结提出SCORE框架，通过共享参数的协同进化训练联合优化评估器与求解器，解决深度研究报告生成中奖励不可验证的问题，持续提升生成质量。

详情

AI中文摘要

大型语言模型（LLM）在日常应用中越来越广泛，其中深度研究是一项特别重要的能力。与传统的问答（QA）任务不同，深度研究报告生成缺乏明确的真实答案，这使得奖励设计本质上不可验证，限制了有效的强化学习。现有方法通过LLM作为评判者和查询相关的评估标准来缓解这一挑战，但它们仍然依赖静态评估器，无法随着求解器的改进而调整标准，导致优化压力不足并最终饱和。我们通过一个用于深度研究评估和生成的 extbf{自}我进化 extbf{协}同进化训练框架（SCORE）来解决这一限制，该框架在共享参数的学习过程中紧密耦合评估器和求解器。我们不将生成和评估视为孤立的模块，而是利用它们的内在联系，在单个共享参数模型中实现联合改进。为了限制这一过程，我们引入了一个元控制机制，该机制根据求解器的性能动态控制评估环境，鼓励有效的评估维度和足够深入的评估器搜索。在深度研究基准上的大量实验表明，报告生成质量持续提升，表明协同进化评估和生成是训练开放式研究代理的一个有前景的方向。

英文摘要

Large Language Models (LLMs) have become increasingly adopted in daily applications, with deep research standing out as a particularly important capability. Unlike traditional question-answering (QA) tasks, deep research report generation lacks definitive ground-truth, making reward design inherently unverifiable and limiting effective reinforcement learning. Existing approaches mitigate this challenge with LLM-as-a-judge and query-dependent evaluation rubrics, but they still rely on static evaluators that cannot adapt their standards as the solver improves, leading to insufficient and eventually saturated optimization pressure. We address this limitation with a \textbf{s}elf-evolving \textbf{co}-evolutionary training framework for deep \textbf{re}search evaluation and generation (SCORE), which tightly couples an evaluator and a solver in a shared-parameter learning process. Rather than treating generation and evaluation as isolated modules, we leverage their intrinsic connection to enable joint improvement within a single shared-parameter model. To restrict this process, we introduce a meta-harness, which dynamically controls the evaluation environment based on solver performance, encouraging valid evaluation dimensions and sufficiently deep evaluator search. Extensive experiments on deep research benchmarks demonstrate consistent improvement in report generation quality, showing that co-evolving evaluation and generation is a promising direction for training open-ended research agents.

URL PDF HTML ☆

赞 0 踩 0

2606.04505 2026-06-04 cs.AI

Simulate, Reason, Decide: Scientific Reasoning with LLMs for Simulation-Driven Decision Making

模拟、推理、决策：基于科学推理的LLM驱动模拟决策

Yuhan Yang, Ruipu Li, Alexander Rodríguez

发表机构 * Computer Science and Engineering University of Michigan（计算机科学与工程大学密歇根大学）

AI总结提出MechSim框架，通过神经符号推理使LLM能够推理科学模拟器的机制和假设，提升决策透明度和可靠性。

详情

AI中文摘要

科学模拟器越来越多地被集成到LLM驱动的系统中，用于高风险模拟驱动决策。然而，现有框架主要使用LLM来生成、校准或执行模拟器，将其视为黑盒接口而非可推理的结构化机械系统。因此，当前方法缺乏识别、表示和推理模拟器行为背后的假设和机制的能力，限制了透明度、可审计性和决策合理性。我们引入了MechSim，一个面向可执行科学模拟器的机制基础神经符号推理框架。与先前主要对静态符号结构进行推理的神经符号方法不同，MechSim使LLM代理能够推理科学模拟器的机制、假设和执行行为。我们的框架通过共享结构化模式表示模拟器，捕获假设、变量、机制依赖和执行轨迹。在此表示之上，LLM代理作为受约束的推理引擎运行，生成结构化的、基于证据的解释，将模拟器结果与其底层机制联系起来。我们在多个高风险领域评估了我们的方法，结果表明它提高了机制级解释质量、模拟器分析和下游决策可靠性。

英文摘要

Scientific simulators are increasingly being integrated into LLM-driven systems for high-stakes simulation-driven decision-making. However, existing frameworks primarily use LLMs to generate, calibrate, or execute simulators, treating them as black-box interfaces rather than as structured mechanistic systems that can be reasoned about. As a result, current approaches lack the ability to identify, represent, and reason about the assumptions and mechanisms underlying simulator behavior, limiting transparency, auditability, and decision justification. We introduce MechSim, a mechanism-grounded neuro-symbolic reasoning framework for executable scientific simulators. Unlike prior neuro-symbolic approaches that primarily reason over static symbolic structures, MechSim enables LLM agents to reason about the mechanisms, assumptions, and execution behavior of scientific simulators. Our framework represents simulators through a shared structured schema capturing assumptions, variables, mechanism dependencies, and execution traces. On top of this representation, LLM agents operate as constrained reasoning engines that generate structured, evidence-grounded explanations linking simulator outcomes to their underlying mechanisms. We evaluate our approach across multiple high-stakes domains and show that it improves mechanism-level explanation quality, simulator analysis, and downstream decision-making reliability.

URL PDF HTML ☆

赞 0 踩 0

2606.04503 2026-06-04 cs.LG cs.AI

Smart Picks in the Dark: Towards Efficient RLVR for Reasoning via Tracing Metacognitive Pivots

暗中选择：通过追踪元认知支点实现高效的推理可验证奖励强化学习

Guangcheng Zhu, Shenzhi Yang, Haobo Wang, Xing Zheng, Yingfan MA, Xuening Feng, Zhongqi Chen, Bowen Song, Weiqiang Wang, Gang Chen

发表机构 * Zhejiang University（浙江大学）； Ant Group（蚂蚁集团）

AI总结针对可验证奖励强化学习（RLVR）中数据效率低的问题，提出PivotTrace框架，利用注意力动态追踪推理过程中的元认知支点，通过支点密度量化不确定性实现数据自动分流，在仅使用29.3%标注样本和2.75倍收敛加速下超越全监督模型。

详情

AI中文摘要

可验证奖励强化学习（RLVR）极大地推进了大型推理模型（LRMs），但它需要及时在大量完全标注的数据集上进行训练。为此，从两个角度广泛研究了数据高效的RLVR方法：（i）数据选择方法识别一小部分“黄金”样本，这些样本能产生接近全数据性能，但它们依赖于预先存在的标注数据池。（ii）无监督RLVR方法在大规模未标注数据上利用模型自身的内部监督信号进行训练，但表现出次优性能。因此，我们研究了RLVR的“暗中选择”设置，其目标是在没有先验监督的情况下，选择对训练最有益且值得标注的未标注样本。通过系统分析，我们证明智能选择依赖于一个校准良好的不确定性估计器，以实现数据的策略性划分，从而进行自适应训练方案。基于这一见解，我们提出了PivotTrace，一个三路数据分流框架，利用注意力动态追踪推理过程中的元认知支点。通过支点密度精确量化不确定性，PivotTrace实现了自动数据路由，协同最大化标注和训练效率。实验表明，PivotTrace仅使用29.3%的标注样本和2.75倍的收敛速度就超越了全监督LRM。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has greatly advanced large reasoning models (LRMs), but it requires timely training on a huge fully-annotated dataset. To this end, data-efficient RLVR methods have been widely studied from two perspectives: (i) data selection methods identify a small subset of "golden" samples that yield near-full-data performance, but they rely on a pre-existing pool of labeled data. (ii) unsupervised RLVR methods train the model using its own internal supervision signals on large-scale unlabeled data, yet they exhibit suboptimal performance. Accordingly, we investigate the "pick in the dark" setup for RLVR, which aims to select, without prior supervision, unlabeled samples that are most beneficial for training and worthy of annotation. Through systematic analysis, we demonstrate that smart picks hinge on a well-calibrated uncertainty estimator to enable strategic partitioning of data for adaptive training regimes. Building on this insight, we propose PivotTrace, a three-way data triage framework that leverages attention dynamics to trace metacognitive pivots during reasoning. By precisely quantifying uncertainty through pivot density, PivotTrace achieves automated data routing to synergistically maximize both annotation and training efficiency. Empirically, PivotTrace surpasses the fully supervised LRM with only 29.3% annotated samples and 2.75 faster convergence.

URL PDF HTML ☆

赞 0 踩 0

2606.04500 2026-06-04 cs.CL

SANE Schema-aware Natural-language Evaluation of Biological Data

SANE：生物数据的模式感知自然语言评估

Rolf Gattung, Martin Krueger, Markus Reischl

发表机构 * Institute for Automation and Applied Informatics (IAI), Karlsruhe Institute of Technology (KIT)（自动化与应用信息研究所（IAI）、卡尔斯鲁厄理工学院（KIT））

AI总结提出SANE范式，通过模式感知的自动生成基准，评估少样本大语言模型在特定领域文本到SQL任务中的可靠性，发现结构化提示和约束可实现准确查询生成。

Comments 5 pages, 3 figures, submitted but not yet reviewed by BMT2026

详情

AI中文摘要

高通量显微镜生成大型结构化数据集，捕捉细胞对药理扰动的反应，但访问这些数据集通常需要SQL专业知识。大语言模型提供了一种自然语言替代方案，但其幻觉倾向引发了对结果可靠性的担忧。我们提出SANE（模式感知自然语言评估），一种用于特定领域文本到SQL评估的新范式：基于模式、自动生成的基准，与实际和特定的实验结构相关联。SANE使评估更具可扩展性、系统性和可重复性。使用SANE，我们评估了一个少样本大语言模型，并表明在具有结构化提示和约束的受限模式下，无需任何模型训练或微调即可实现准确的查询生成。大多数失败源于模糊或未明确指定的输入，表现为过度谨慎的澄清请求或对应先消除歧义的查询的回答，而不是错误的SQL生成。这些结果表明，当与模式感知提示相结合时，少样本大语言模型可以在定义良好的领域内提供可靠的数据库访问。

英文摘要

High-throughput microscopy generates large, structured datasets capturing cellular responses to pharmacological perturbations, but accessing these datasets typically requires SQL expertise. Large language models offer a natural-language alternative, yet their tendency to hallucinate raises concerns about result reliability . We present SANE Schema-Aware Natural-language Evaluation, a novel paradigm for domain-specific text-to-SQL evaluation: schema-grounded, automatically generated benchmarks tied to real and specific experimental structure. SANE makes evaluation more scalable, systematic, and reproducible. Using SANE, we evaluate a few-shot large language model and show that, under constrained schemas with structured prompting and guardrails, accurate query generation is achievable without any model training or fine-tuning. Most failures stem from ambiguous or underspecified inputs and manifest as overly cautious clarification requests or answers to queries that should first be disambiguated, rather than incorrect SQL generation. These results indicate that few-shot large language models can provide reliable database access in well-defined domains when combined with schema-aware prompting.

URL PDF HTML ☆

赞 0 踩 0

2606.04494 2026-06-04 cs.AI

ChessMimic: 用于在线闪电棋中人类走棋、时钟和结果预测的按等级划分的Transformer模型

Thomas Johnson

发表机构 * nascent.xyz（nascent实验室）

AI总结提出ChessMimic系统，包含三个小型编码器Transformer模型，分别用于走棋、思考时间和结果预测，通过按Elo等级分段训练实现更精细的技能校准，在Lichess闪电棋数据上走棋预测准确率超越Maia-2，结果预测AUC达0.78，时钟模型提供可用但非最优的思考时间信号。

详情

AI中文摘要

我们提出了ChessMimic，一个由三个小型编码器Transformer组成的系统——分别用于走棋、思考时间和结果预测——以局面、最近走棋历史、玩家等级和时钟状态为条件。我们为每100 Elo等级区间拟合每个模型的独立实例，以参数效率换取更精细的技能校准。在Lichess Rated Blitz游戏的一个月保留切片上，ChessMimic的人类走棋预测准确率在每个Elo区间都优于Maia-2。与Maia-3相比，我们的9M参数模型的准确率介于Maia-3-5M和Maia-3-23M之间，且没有几何注意力偏置的额外复杂性。除了走棋匹配模型，我们还训练了一个游戏结果模型，该模型不仅以局面为条件，还以玩家等级、时间控制和剩余时钟时间为条件。结果模型在样本外达到了0.78的AUC，击败了Maia-2以及基于子力、等级和时钟时间的逻辑回归。最后，我们训练了一个时钟模型来预测人类思考时间。该时钟模型在ALLIE风格过滤器下提供了可用但非最优的每步思考时间信号（Pearson r = 0.41，Spearman rho = 0.50，MAE 4.10秒，而ALLIE报告的r = 0.70），残差差距集中在每位置桶的锐度上，而非桶边际校准。公开演示在1e4.ai，我们在GitHub上发布了代码、每个区间的权重以及C++数据过滤管道代码。

英文摘要

We present ChessMimic, a system of three small encoder-only transformers - for move, thinking-time, and outcome prediction - conditioned on the position, recent move history, player rating, and clock state. We fit a separate instance of each model per 100-Elo rating band, trading parameter efficiency for sharper per-skill calibration. On a held-out month-wide slice of Lichess Rated Blitz games ChessMimic's human move prediction accuracy outperforms Maia-2 in every Elo band. Compared to Maia-3, our 9M parameter model's accuracy sits between Maia-3-5M and Maia-3-23M without the additional complexity of Geometric Attention Bias. In addition to the move matching model, we also train a game outcome model that conditions not only on the position, but also player ratings, time control, and remaining clock times. The outcome model achieves an AUC of 0.78 out of sample, beating Maia-2 as well as logistic regressions based on material, ratings, and clock time. Finally, we train a clock model that predicts human thinking times. The clock model provides a usable but non-SOTA per-ply think-time signal under ALLIE-style filters (Pearson r = 0.41, Spearman rho = 0.50, MAE 4.10 s, against ALLIE's reported r = 0.70), with the residual gap concentrated in per-position bucket sharpness rather than bucket-marginal calibration. A public demo is at 1e4.ai and we release code, per-band weights, and the C++ data-filter pipeline code in GitHub.

URL PDF HTML ☆

赞 0 踩 0

2606.04469 2026-06-04 cs.CV cs.AI

Adaptive Calibration for Fair and Performant Facial Recognition

自适应校准：实现公平且高性能的面部识别

Ryan Brown, Chris Russell

发表机构 * University of Oxford（牛津大学）

AI总结提出自适应校准（AC）方法，通过将归一化嵌入的余弦相似度映射为校准概率，并融入局部上下文校正区域差异，从而在无需人口统计元数据的情况下提升面部识别的整体性能和公平性。

详情

AI中文摘要

我们引入自适应校准（AC），一种新颖的面部识别校准策略，将归一化嵌入之间的余弦相似度映射为良好校准的概率。通过将局部上下文纳入校准，自适应校正确保了余弦相似度中的一个基本不匹配问题，即相同的距离在不同嵌入区域可能对应不同的匹配概率。我们的方法在无需人口统计元数据的情况下，既提高了整体性能，又实现了更公平的校准。在各种预训练模型和标准基准上，我们的方法在准确性和公平性指标上始终优于现有方法。AC为公平的面部识别提供了实用的解决方案，无需人口统计组注释，同时提高了整体性能。与现有方法不同，我们的方法提供了连续的、区域特定的校准，避免了“降级”现象，即公平性以牺牲某些群体的性能为代价。

英文摘要

We introduce Adaptive Calibration (AC), a novel calibration strategy for facial recognition that maps cosine similarity between normalized embeddings to well-calibrated probabilities. By incorporating local context into calibration, Adaptive Calibration corrects for a fundamental mismatch in cosine similarity, whereby the same distance can correspond to different match probabilities in different embedding regions. Our approach improves both overall performance and results in a fairer calibration without requiring demographic metadata. Our approach consistently dominates existing methods both on accuracy and fairness metrics across a variety of pretrained models and standard benchmarks. AC provides a practical solution for equitable facial recognition, without requiring demographic group annotations, and while improving overall performance. Unlike existing approaches, our method provides continuous, region-specific calibration that avoids "leveling down" where fairness comes at the cost of degraded performance for some groups.

URL PDF HTML ☆

赞 0 踩 0

2606.04468 2026-06-04 cs.LG cs.AI cs.NE math.OC

ChannelTok: 高效灵活长度视觉分词

Sukriti Paul, Arpit Bansal, Tom Goldstein

发表机构 * University of Maryland, College Park（马里兰大学College Park分校）

AI总结提出一种基于通道的轻量级灵活长度分词器，通过随机尾部丢弃训练实现语义重要性排序，在保持高质量的同时大幅提升解码速度和模型效率。

详情

AI中文摘要

领先的灵活视觉分词器以极端成本实现SOTA质量，依赖参数繁重的骨干网络和缓慢的多步生成解码器。我们摆脱这种复杂的空间分词范式，引入一种简单、轻量且快速的通道级灵活长度分词器。我们的方法将每个潜在通道视为一个视觉标记，采用参数高效的CNN-Transformer混合骨干网络。此外，在训练过程中采用随机尾部丢弃范式，自然地迫使通道按语义重要性排序。这使得在推理时只需保留前$k$个通道即可实现灵活压缩，并自然支持可变长度自回归图像生成。我们通过在ImageNet上的大量实验验证了该方法，展示了在不同标记预算下的一致质量。结果建立了新的质量-效率前沿：我们的模型实现了最先进的感知质量（rFID 2.92），同时解码速度比次优方案快$8.6\times$，参数量小$2.1\times$（1.59亿参数）。我们的工作将通道级分词确立为高效视觉表示的一种强大且实用的范式。项目页面：https://channeltok.github.io

英文摘要

Leading flexible vision tokenizers achieve SOTA quality at an extreme cost, relying on parameter-heavy backbones and slow, multi-step generative decoders. We depart from this complex, spatial-token paradigm and introduce a simple, lightweight, and fast channel-wise flexible-length tokenizer. Our method treats each latent channel as a visual token, enabling a parameter-efficient CNN-Transformer hybrid backbone. Furthermore, employing a stochastic tail-dropping paradigm during training naturally forces channels to organize by semantic importance. This allows for flexible compression at inference by simply retaining the first $k$ channels, and naturally enables variable-length autoregressive image generation. We validate our approach through extensive experiments on ImageNet, demonstrating consistent quality across diverse token budgets. The results establish a new quality-efficiency frontier: our model achieves state-of-the-art perceptual quality (rFID 2.92) while being $8.6\times$ faster in decoding and $2.1\times$ smaller (159M params) than the next-best alternative. Our work establishes channel-wise tokenization as a powerful and practical paradigm for efficient visual representation. Project page: https://channeltok.github.io

URL PDF HTML ☆

赞 0 踩 0

2606.04457 2026-06-04 cs.CV

Imagine Before You Draw: Visual Prompt Engineering for Image Generation

先构思再绘制：面向图像生成的视觉提示工程

Liyu Jia, Fengda Zhang, Jiachun Pan, Kesen Zhao, Saining Zhang, Wang Lin, Weijia Wu, Yue Liao, Aojun Zhou, Hanwang Zhang

发表机构 * Nanyang Technological University（南洋理工大学）； National University of Singapore（国立新加坡大学）； Zhejiang University（浙江大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出视觉提示工程（VPE），通过在单一模型内先生成视觉语义令牌作为中间计划，再生成完整图像，从而避免信息瓶颈，提升图像生成质量与编辑保真度。

详情

AI中文摘要

在图像生成之前，将视觉语义表示作为中间步骤引入，可以降低文本与图像之间的建模难度，从而提高生成质量。近期工作如X-Omni和BLIP3o-Next探索了这一方向，但它们通常采用两阶段外部流水线：一个独立的自回归模型首先生成语义令牌，然后将其作为条件输入给独立的扩散解码器。由于解码器无法同时访问原始输入和语义计划，这种设计引入了信息瓶颈，限制了编辑等下游任务中的细节保留。而Transfusion、BAGEL和Show-o2等内部架构通过单一模型内的跨模态交互避免了这一瓶颈，但它们在没有中间语义引导的情况下，仍然面临困难的文本到像素建模差距。我们提出了视觉提示工程（VPE），它可以无缝集成到此类内部框架中。具体来说，模型首先自回归地生成视觉语义令牌（例如SigLIP 2）作为“视觉提示”，以捕捉语义布局，然后基于该计划生成完整图像令牌。我们在类别条件生成、文本到图像生成和图像编辑上验证了VPE，涵盖了多种令牌类型和模型架构。结果表明，VPE可以加速收敛、提高质量上限，并且通过内部集成，在相同参数规模下，相比外部替代方案实现了显著更好的编辑保真度（PSNR：26.76 vs. 19.92），同时保持了有竞争力的编辑响应速度。

英文摘要

Incorporating visual semantic representations as an intermediate step before image generation can reduce the modeling difficulty between text and images, thereby improving generation quality. Recent works such as X-Omni and BLIP3o-Next have explored this direction, but they typically use a two-stage external pipeline: a separate autoregressive model first generates semantic tokens, which are then fed as conditioning to an independent diffusion decoder. Since the decoder cannot jointly access the original input and the semantic plan, this design introduces an information bottleneck that limits detail preservation in downstream tasks such as editing. Internal architectures such as Transfusion, BAGEL, and Show-o2 avoid this bottleneck by enabling cross-modal interaction within a single model, but they still face the difficult text-to-pixel modeling gap without intermediate semantic guidance. We propose Visual Prompt Engineering (VPE), which can be seamlessly integrated into such internal frameworks. Specifically, the model first autoregressively generates visual semantic tokens (e.g., SigLIP 2) as "visual prompts" that capture the semantic layout, then generates the full image tokens conditioned on this plan. We validate VPE across class-conditional generation, text-to-image generation, and image editing, covering various token types and model architectures. Results show that VPE can accelerate convergence, raise quality ceilings, and through internal integration, achieve substantially better editing preservation (PSNR: 26.76 vs. 19.92) than external alternatives of the same parameter scale, while maintaining competitive editing responsiveness.

URL PDF HTML ☆

赞 0 踩 0

2606.04455 2026-06-04 cs.AI cs.CL

The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

元智能体挑战：当前智能体能否自主开发智能体？

Xinyu Lu, Tianshu Wang, Pengbo Wang, zujie wen, Zhiqiang Zhang, Jun Zhou, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences（中国科学院软件研究所信息处理实验室）； University of Chinese Academy of Sciences（中国科学院大学）； Ant Group（蚂蚁集团）

AI总结提出元智能体挑战（MAC）框架，评估前沿模型自主开发智能体系统的能力，发现多数元智能体难以匹敌人类设计的基线策略，且存在鲁棒性和对齐问题。

Comments Website: https://meta-agent-challenge.com/

详情

AI中文摘要

当前的AI基准测试评估智能体在人类设计的工作流程中执行任务的能力。这些评估从根本上未能衡量一个关键的更高级能力：模型能否自主开发智能体系统。我们引入了元智能体挑战（MAC），这是一个评估框架，旨在测试前沿模型自主开发智能体的能力。具体来说，一个代码智能体（元智能体）被赋予一个沙盒环境、一个评估API和一个时间限制，以迭代地编程一个智能体工件，该工件在五个领域的保留测试集上最大化性能。为确保评估完整性，该框架通过多层防御机制防止奖励黑客攻击。利用该框架，我们证明元智能体很少能匹配人类设计的基线策略，而少数能匹配的则主要由专有前沿模型主导。此外，设计过程表现出高方差，高优化压力会浮现出诸如真实数据窃取等新兴对抗行为——凸显了鲁棒性和模型对齐方面的关键缺陷。最终，MAC为自主AI研究和开发提供了一个严格的、开源的基准测试，为评估递归自我改进提供了经验代理。基准测试公开于：https://github.com/ant-research/meta-agent-challenge。

英文摘要

Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We introduce the Meta-Agent Challenge (MAC), an evaluation framework designed to test the capacity of frontier models for autonomous agent development. Specifically, a code agent (the meta-agent) is given a sandboxed environment, an evaluation API, and a time limitation to iteratively program an agent artifact that maximizes performance on a held-out test set across five domains. To ensure evaluation integrity, this framework is secured by multi-layer defenses against reward hacking. Leveraging this framework, we demonstrate that meta-agents rarely match human-engineered baseline policies, and the few that do are dominated by proprietary frontier models. Moreover, the design process exhibits high variance, and high optimization pressure surfaces emergent adversarial behaviors like ground-truth exfiltration-highlighting critical deficits in both robustness and model alignment. Ultimately, MAC provides a rigorous, open-source benchmark for autonomous AI research and development, offering an empirical proxy for evaluating recursive self-improvement. Benchmark is publicly available at: https://github.com/ant-research/meta-agent-challenge.

URL PDF HTML ☆

赞 0 踩 0