arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1998
2606.12940 2026-06-12 cs.SD cs.LG 新提交

Self-Guidance: Enhancing Neural Codecs via Decoder Manifold Alignment

自引导:通过解码器流形对齐增强神经编解码器

Xiang Li, Yixuan Zhou, Jingran Xie, Zhiyong Wu, Hui Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出自引导方法,通过轻量特征映射损失对齐解码器内部流形,在不改变推理过程下提升VQ-VAE神经语音编解码器重建质量,实现低比特率SOTA性能并支持4倍码本缩减。

Comments 20 pages, 9 figures, accepted to ICML 2026, demo website available at https://sgvqvae.github.io/sgvqvae-demo

详情
AI中文摘要

基于向量量化VAE(VQ-VAE)的神经语音编解码器是语音大语言模型的核心音频分词器,但其重建保真度受限于量化误差。常见的修复方法是修改量化器或增加模型容量,但这会复杂化下游语言建模。我们的核心思想是,在处理量化标记及其原始连续嵌入时,使用轻量级特征映射损失对齐解码器的内部特征流形。这需要最小的训练开销,且无需改变推理过程。应用于XCodec2时,自引导改善了所有重建指标,实现了低比特率下的最先进性能。值得注意的是,它实现了4倍码本缩减而无保真度损失,下游TTS实验表明,通过简化标记建模空间,这显著改善了基于LLM的合成。多项统计观察和可视化证实了解码器中内部流形对齐的增强。大量实验证实了其在各种归纳偏置下的通用性。因此,自引导建立了一种高效、广泛适用的高保真神经音频编码方法。

英文摘要

Neural speech codecs based on Vector-Quantized VAEs (VQ-VAEs) are core audio tokenizers for speech LLMs, yet their reconstruction fidelity is bottlenecked by quantization error. Modifying the quantizer or increasing model capacity are common fixes, but they complicate downstream language modeling. Our core idea is to align the decoder's internal feature manifolds when processing both the quantized tokens and their original continuous embeddings, using a lightweight feature-mapping loss. This requires minimal training overhead and no inference-time changes. Applied to XCodec2, self-guidance improves all reconstruction metrics, achieving state-of-the-art low-bitrate performance. Notably, it enables a 4x codebook reduction without fidelity loss, which downstream TTS experiments show significantly improves LLM-based synthesis by simplifying the token modeling space. Multiple statistical observations and visualizations corroborate the enhanced internal manifold alignment in the decoder. Extensive experiments confirm its generality across various inductive biases. Self-guidance thus establishes an efficient, broadly applicable method for high-fidelity neural audio coding.

2606.12939 2026-06-12 cs.CV 新提交

MAMVI: 3D Test-Time Adaptation via Masked Multi-View Point Clouds

MAMVI:通过掩蔽多视角点云实现3D测试时自适应

Inseok Kong, Geunyoung Jung, Jiyoung Jung

发表机构 * Department of Geo Informatics, University of Seoul(首尔大学地理信息学系) Department of Artificial Intelligence, University of Seoul(首尔大学人工智能系)

AI总结 针对3D点云在分布偏移下性能下降的问题,提出MAMVI方法,用统一单步自适应替代顺序优化,结合混合掩蔽策略和多视角损失聚合,实现快速且高精度的测试时自适应。

Comments Accepted by ICPR 2026

详情
AI中文摘要

3D点云模型在传感器噪声、遮挡和环境变化引起的分布偏移下会出现显著的性能下降。测试时自适应(TTA)已成为在推理过程中缓解此问题的实用范式。最近,利用多视角增强在提升3D TTA性能方面显示出潜力。然而,现有的多视角方法通常受限于将每个视角独立处理的顺序优化。这种顺序优化由于重复的优化步骤导致显著的推理延迟,使得实时自适应不切实际。为了解决这个问题,我们提出了掩蔽多视角测试时自适应(MAMVI),它用统一的单步自适应替代顺序优化。具体来说,MAMVI利用一种混合掩蔽策略,结合固定比例以保持稳定性,以及Beta分布采样以增加多样性。通过聚合多个视角的损失,MAMVI基于多视角共识通过单次反向传播执行自适应。此外,使用基于置信度的自适应学习率来动态调整每个样本的自适应强度。在ModelNet-40C、ShapeNet-C和ScanObjectNN-C上的大量实验表明,MAMVI在ShapeNet-C和ScanObjectNN-C上达到了最先进的准确率。同时,它在ModelNet-40C上保持竞争力,同时推理速度提高了4.9-8.9倍,使其非常适合实时应用。我们的代码可在以下网址获取:this https URL

英文摘要

3D point cloud models suffer significant performance degradation under distribution shifts caused by sensor noise, occlusions, and environmental changes. Test-time adaptation (TTA) has emerged as a practical paradigm for mitigating this issue during inference. Recently, leveraging multi-view augmentation has shown promise in improving 3D TTA performance. However, existing multi-view approaches are often constrained by sequential optimization that treats each view independently. This sequential optimization leads to substantial inference latency due to repetitive optimization steps, making real-time adaptation impractical. To address this, we propose Masked Multi-View Test-Time Adaptation (MAMVI), which replaces sequential optimization with a unified single-step adaptation. Specifically, MAMVI utilizes a hybrid masking strategy that combines fixed ratios for stability with Beta-distributed sampling for diversity. By aggregating losses across multiple views, MAMVI performs adaptation through a single backward pass based on multi-view consensus. Additionally, a confidence-based adaptive learning rate is used to dynamically adjust the adaptation intensity for each sample. Extensive experiments on ModelNet-40C, ShapeNet-C, and ScanObjectNN-C demonstrate that MAMVI achieves state-of-the-art accuracy on ShapeNet-C and ScanObjectNN-C. Moreover, it remains competitive on ModelNet-40C while delivering 4.9-8.9 times faster inference, making it highly suitable for real-time applications. Our code is available at https://github.com/Inseok-kong/MAMVI

2606.12936 2026-06-12 cs.RO cs.AI 新提交

An Embodied Simulation Platform, Benchmark, and Data-Efficient Augmentation Framework for Wet-Lab Robotics

面向湿实验室机器人的具身仿真平台、基准测试及数据高效增强框架

Zhe Liu, Huanbo Jin, Zhaohui Du, Zhe Wang, He Xu, Peijia Li, Jiaming Gu, Quan Lu, Qi Wang, Bin Ji, Ting Xiao

发表机构 * Key Laboratory of Smart Manufacturing in Energy Chemical Process Ministry of Education(能源化工过程智能制造国家重点实验室) Department of Computer Science and Engineering(计算机科学与工程系) Department of Laboratory Medicine(实验室医学系) Shanghai Jiao Tong University School of Medicine(上海交通大学医学院)

AI总结 提出Pipette平台,包含可编辑资产、仿真数据增强管道和11任务基准测试,将30次演示的VLA成功率从44.1%提升至74.7%。

Comments 25 pages, 17figures

详情
AI中文摘要

湿实验室机器人可以提高生物医学实验的可重复性、通量和安全性,但扩展其学习需要可定制的模拟器以进行安全和可重复的任务生成、开放的可编辑实验室资产,以及将有限演示转化为可用训练数据的高效管道。我们提出了Pipette,一个用于湿实验室机器人学习的具身仿真平台、基准测试和数据高效增强框架。Pipette发布了超过43个开源且可重新编辑的湿实验室资产,以及一个可扩展的资产构建管道。Pipette的一个关键组件是其基于仿真的数据增强管道,在仿真中重放人类演示,应用光照、相机、速度和动作扰动,并通过自动任务成功检查过滤生成的片段,从有限的手动演示中快速扩展可用的训练数据。我们进一步引入了一个包含11个任务的湿实验室具身基准测试,涵盖样本处理、培养器具操作、设备操作和精确放置。每个任务仅需30次演示,ACT实现了65.5%的平均成功率,而仿真增强将SmolVLA从44.1%提升至74.7%,将π0从40.4%提升至46.5%,验证了Pipette在数据高效的VLA训练和评估中的有效性。Pipette还支持自然语言驱动的场景构建和任务注册,降低了非专家用户定义新湿实验室机器人任务的门槛。

英文摘要

Wet-lab robots can improve the reproducibility, throughput, and safety of biomedical experiments, but scaling their learning requires customizable simulators for safe and reproducible task generation, open editable laboratory assets, and efficient pipelines that turn limited demonstrations into usable training data. We present Pipette, an embodied simulation platform, benchmark, and data-efficient augmentation framework for wet-lab robot learning. Pipette releases over 43 open-source and re-editable wet-lab assets, together with an extensible asset-building pipeline. A key component of Pipette is its simulation-based data augmentation pipeline, replaying human demonstrations in simulation, applies lighting, camera, speed, and action perturbations, and filters generated episodes with automatic task success checks, rapidly expanding usable training data from limited manual demonstrations. We further introduce an 11-task wet-lab embodied benchmark covering sample handling, culture-ware manipulation, device operation, and precision placement. With only 30 demonstrations per task, ACT achieves 65.5% average success rate, while simulation augmentation improves SmolVLA from 44.1% to 74.7% and π0 from 40.4% to 46.5%, validating the effectiveness of Pipette for data-efficient VLA training and evaluation. Pipette also supports natural-language-driven scene construction and task registration, lowering the barrier for non-expert users to define new wet-lab robotic tasks.

2606.12935 2026-06-12 cs.AI 新提交

MARS: Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling

MARS: 用于并行LLM测试时扩展的边际对抗风险控制停止策略

Wenbo Chen, Puheng Li, Mengyang Liu, Weijie Su, Tianpei Xie

发表机构 * Amazon(亚马逊) Stanford University(斯坦福大学) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出MARS停止规则,通过监测中间检查点的聚合投票并利用对抗性边界估计未来投票变化,在保证准确率的同时节省25-47%的自一致性token。

详情
AI中文摘要

并行测试时扩展采样多个推理轨迹并对答案进行多数投票,提高了LLM的准确性,但需要轨迹运行至完成,导致大量计算开销。我们观察到,在中间检查点探测部分轨迹可以在不中断生成的情况下提取当前答案,揭示出不断演变的聚合投票。基于这一观察,我们引入了MARS,一种边际对抗性停止规则,它估计哪些活跃轨迹可能改变其答案,并在未来投票移动的保守边界下,一旦领先者保持安全就停止。该规则分离了两种不确定性来源。它学习轨迹级别的切换概率,这些概率决定了当前边际有多少可能被保留,同时通过从预热轨迹中校准的对抗性边界处理切换轨迹落在哪里的更难问题。在真实切换概率下,MARS以高概率保证提前停止的答案与完整预算投票一致。在实践中,一个五特征逻辑模型紧密匹配了神谕切换行为。在三个推理模型和三个竞赛数学基准上,MARS节省了25-47%的自一致性token,并在DeepConf Online(一个已经过滤和截断弱轨迹的强置信加权基线)之上额外节省14-29%,同时匹配相应完整预算基线的准确率。

英文摘要

Parallel test-time scaling samples many reasoning traces and majority-votes their answers, improving LLM accuracy but requiring traces to run to completion, incurring substantial computational overhead. We observe that probing partial traces at intermediate checkpoints can extract current answers without disrupting generation, revealing an evolving aggregate vote. Based on this observation, we introduce MARS, a margin-adversarial stopping rule that estimates which active traces are likely to change their answers and stops once the leader remains safe under a conservative bound on future vote movement. The rule separates two sources of uncertainty. It learns the trace-level switch probabilities that determine how much of the current margin is likely to be retained, while handling the harder question of where switching traces land through an adversarial bound calibrated from warmup traces. With true switch probabilities, MARS guarantees with high probability that the early-stopped answer matches the full-budget vote. In practice, a five-feature logistic model closely matches oracle switching behavior. Across three reasoning models and three competition-math benchmarks, MARS saves 25-47% of self-consistency tokens and 14-29% on top of DeepConf Online, a strong confidence-weighted baseline that already filters and truncates weak traces, while matching the accuracy of the corresponding full-budget baselines.

2606.12930 2026-06-12 cs.LG 新提交

Is Spurious Correlation Removal Always Learnable?

虚假相关性去除是否总是可学习的?

Yibo Zhou, Bo Li, Hai-Miao Hu, Hanzi Wang, Xiaokang Zhang, Ruifan Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 研究不变学习在统计可识别时的计算障碍,证明存在一维不变子空间的可采样多环境实例,多项式时间算法无法达到常数精度,并量化环境多样性对可识别性和风险的影响。

Comments poster paper in ICML-2026

详情
AI中文摘要

即使不变结构在统计上是可识别的,不变学习也可能失败。我们展示了一个条件计算障碍:在由平均情况稀疏恢复归约驱动的黑盒可采样监督稀疏恢复原语下,存在具有一维预测不变子空间($k=1$)的\emph{可采样}多环境实例,这些实例可以通过穷举搜索用多项式样本学习,而任何多项式时间常数精度恢复算法都会与该原语矛盾。我们进一步通过分离参数$\gamma$量化环境多样性,该参数控制可识别性和不变性目标的曲率。在充分多样性和局部高斯正则性下,极小极大风险为$\mathbb{E}[\dist(\hat{V},V_{\mathrm{inv}})^2]=\Theta(k(d-k)/(n|\mathcal{E}|))$,在标签诱导的偏移下,在$n^*\propto k(d-k)/(|\mathcal{E}|\gamma^2)$处发生相变,估计误差缩放比例与$1/\gamma^2$成正比。合成和真实数据集说明了预测的差距和转变,并激发了简单的多样性诊断。

英文摘要

Invariant learning can fail even when the invariant structure is statistically identifiable. We show a conditional computational barrier: under a black-box samplable supervised sparse recovery primitive motivated by average-case sparse-recovery reductions, there exist \emph{samplable} multi-environment instances with a one-dimensional predictive invariant subspace ($k=1$) that are learnable with polynomial samples by exhaustive search, while any polynomial-time constant-accuracy recovery algorithm would contradict the primitive. We further quantify environment diversity by a separation parameter $γ$, which controls identifiability and the curvature of invariance objectives. Under sufficient diversity and local Gaussian regularity, the minimax risk is $\mathbb{E}[\dist(\hat{V},V_{\mathrm{inv}})^2]=Θ(k(d-k)/(n|\mathcal{E}|))$, and under label-induced shifts a phase transition occurs at $n^*\propto k(d-k)/(|\mathcal{E}|γ^2)$ with refined estimation error scaling proportional to $1/γ^2$. Synthetic and real datasets illustrate the predicted gaps and transitions and motivate simple diversity diagnostics.

2606.12925 2026-06-12 cs.CV cs.LG 新提交

Multi-Label Test-Time Adaptation with Bayesian Conditional Priors

基于贝叶斯条件先验的多标签测试时自适应

Qiru Li, Ao Zhou, Zhiwei Jiang, Zifeng Cheng, Cong Wang, Yafeng Yin, Qing Gu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出贝叶斯条件先验估计(BCP),一种无梯度的测试时自适应方法,通过在线估计锚定条件先验注入标签依赖性,提升冻结视觉语言模型在多标签识别中的分布偏移鲁棒性。

Comments accepted by ICML2026

详情
AI中文摘要

多标签识别中,冻结的视觉语言模型(VLM)在分布偏移下表现脆弱:标准零样本推理独立评分每个标签,忽略共现结构,产生不连贯的标签集,其中主导概念抑制较弱但兼容的标签。我们引入贝叶斯条件先验(BCP)估计,一种无梯度的测试时自适应方法,在不调整主干网络的情况下注入标签依赖性。BCP将零样本logits视为在固定图像-文本似然下的边缘后验代理,并将偏移引起的误差主要归因于不匹配的标签先验。对于每个测试图像,它选择一个高置信度的锚定标签,并应用锚定条件的贝叶斯精炼。该更新在logit空间中是闭式的,并具有点互信息(PMI)解释,明确促进兼容标签并抑制不兼容标签。BCP通过从无标签测试流中在线估计锚定条件先验(使用轻量级二阶共现统计)来运行,无需目标标注,且仅增加单个前向传递之外的微不足道的开销。在标准多标签基准和多个CLIP主干网络上,BCP持续优于强TTA基线,例如将RN50的平均mAP从57.31提升至69.22,ViT-B/16从62.61提升至71.79。

英文摘要

Multi-label recognition with frozen Vision-Language Models (VLMs) is brittle under distribution shift: standard zero-shot inference scores labels independently, ignoring co-occurrence structure and producing incoherent label sets where dominant concepts suppress weaker but compatible labels. We introduce Bayesian Conditional Priors (BCP) Estimation, a gradient-free test-time adaptation method that injects label dependency without tuning the backbone. BCP views zero-shot logits as a proxy for marginal posteriors under a fixed image-text likelihood and attributes shift-induced errors mainly to a mismatched label prior. For each test image, it selects a high-confidence anchor label and applies an anchor-conditioned Bayesian refinement. This update is closed-form in logit space and admits a pointwise mutual information (PMI) interpretation, explicitly promoting compatible labels and suppressing incompatible ones. BCP operates without target annotations by estimating anchor-conditioned priors online from the unlabeled test stream via lightweight second-order co-occurrence statistics, adding negligible overhead beyond a single forward pass. Across standard multi-label benchmarks and multiple CLIP backbones, BCP consistently outperforms strong TTA baselines, e.g., improving RN50 average mAP from 57.31 to 69.22 and ViT-B/16 from 62.61 to 71.79.

2606.12924 2026-06-12 cs.AI 新提交

Iterating Toward Better Search: A Two-Agent Simulation Framework for Evaluating Agentic Search Architectures in E-Commerce

迭代优化搜索:面向电子商务中智能搜索架构评估的双智能体仿真框架

Jetlir Duraj, Jayanth Yetukuri, Shuang Zhou, Dhruv Varma, Rui Kong, Ishita Khan, Qunzhi Zhou

发表机构 * eBay Inc.(eBay公司)

AI总结 提出模块化双智能体仿真框架,通过固定买家智能体对比不同应答器设计,发现滚动窗口记忆在质量和速度上优于意图提取记忆,并基于失败分析将失败率降低62%。

详情
AI中文摘要

我们提出了一个模块化的双智能体仿真框架,用于评估对话式购物助手架构。一个独立的买家智能体,配置了角色、任务和耐心水平,与一个可互换的应答器配对,该应答器与真实的电子商务搜索API集成。在实验中保持买家不变,可以在相同场景下对照比较应答器设计。利用跨越14个角色桶的2011次对话,我们建立了四个实证发现。首先,滚动窗口记忆在所有质量指标上优于意图提取记忆,同时每个查询速度快35%。其次,通过对应答器版本的系统性失败分析,实现了有针对性的修复,将整个数据集上的失败和接近失败率降低了62%,展示了快速的证据驱动迭代。第三,将应答器的LLM骨干从Gemini~2.5切换到Llama~3.3~70B,尽管架构相同,但性能下降了0.16-0.45点。最后,我们记录了前沿LLM评判者之间系统性的哲学分歧:Gemini奖励过程正确性,而Claude要求具体结果,尽管使用了相同的评估提示。

英文摘要

We present a modular two-agent simulation framework for evaluating conversational shopping assistant architectures. An independent buyer agent, configured with personas, missions, and patience levels, is paired with an interchangeable responder that integrates with a real e-commerce search API. Holding the buyer constant across experiments enables controlled comparison of responder designs on identical scenarios. Using 2011 conversations across 14 persona buckets, we establish four empirical findings. First, rolling-window memory outperforms intent-extraction memory on all quality metrics while being 35% faster per query. Second, illustrating rapid evidence-driven iteration, a systematic failure analysis of a responder version enables targeted fixes that reduce failure and near-failure rates by 62% across the full dataset. Third, swapping the responder LLM backbone from Gemini~2.5 to Llama~3.3~70B costs 0.16--0.45 points despite identical architecture. Finally, we document systematic philosophical disagreement between frontier LLM judges: Gemini rewards process correctness while Claude demands concrete outcomes, despite using the same evaluation prompt.

2606.12922 2026-06-12 cs.CL cs.CY 新提交

Polar: A Benchmark for Evaluating Political Bias in LLMs

Polar: 评估大语言模型中政治偏见的基准

Sangho Kim, Heejin Kim, Yoonhee Park, Hyunggeun Jeon, Jaejin Lee

发表机构 * Graduate School of Data Science, Seoul National University(首尔大学数据科学研究生院) Dept. of Computer Science and Engineering, Seoul National University(首尔大学计算机科学与工程系)

AI总结 提出Polar基准,通过选项级似然度测量大语言模型的政治偏见,覆盖美国和韩国政治语境,发现偏见随语境、议题、模型组和语言变化。

Comments Submitted to ARR 2026 May cycle

详情
AI中文摘要

大语言模型(LLM)中的政治偏见日益显著,但在不同政治和语言背景下难以可重复地测量。我们引入了Polar,一个包含4,026个实例的多项选择基准,通过选项级似然度而非基于提示的生成来测量政治偏见。Polar覆盖了两个意识形态轴和来自Manifesto Project的八个议题类别,并在美国和韩国政治语境中并行评估模型。在38个LLM中,测量的偏见随政治语境、议题类别、模型组和呈现语言系统性地变化。所有模型在美国政治内容上倾向于左翼进步派,但在韩国内容上表现出更居中且混合的模式。翻译实验进一步表明,仅呈现语言就能改变测量的偏见。这些发现凸显了对LLM中政治偏见进行多语言和跨语境评估的必要性。

英文摘要

Political bias in large language models (LLMs) is increasingly significant, but difficult to measure reproducibly across political and linguistic contexts. We introduce Polar, a 4,026-instance multiple-choice benchmark that measures political bias through option-level likelihoods rather than prompt-based generation. Polar covers two ideological axes and eight issue categories derived from the Manifesto Project, and evaluates models in parallel across U.S. and South Korean political contexts. Across 38 LLMs, measured bias varies systematically with political context, issue category, model group, and presentation language. All models lean left-progressive on U.S. political content, but show more centered and mixed patterns on South Korean content. Translation experiments further show that presentation language alone can shift measured bias. These findings highlight the need for multilingual and cross-contextual evaluation of political bias in LLMs.

2606.12921 2026-06-12 cs.LG cs.AI 新提交

LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold

LoRA-Muon:低秩流形上的谱最速下降

Franz Louis Cesista, Katherine Crowson, Cédric Simal, Stella Biderman

发表机构 * Ateneo de Manila University(雅典耀马尼拉大学) EleutherAI NaXys, UNamur(纳慕尔大学NaXys研究所)

AI总结 提出LoRA-Muon优化器,将Muon的谱最速下降规则应用于低秩微调,解决LoRA对初始化敏感、最优学习率跨秩迁移差等问题,在TinyShakespeare上以秩32达到比稠密基线更低的验证损失。

Comments 20 pages, 4 figures

详情
AI中文摘要

低秩适应(LoRA)显著降低了微调深度学习模型的计算和内存成本,但通常比稠密训练更难调优:当使用因子级优化器(如AdamW)时,它对初始化选择敏感,其最优学习率在秩之间迁移性差,且常常无法超越稠密基线。我们通过将Muon优化器的谱最速下降规则应用于低秩设置,推导出LoRA-Muon。结合我们的分裂权重衰减规则,我们的主要主张是LoRA-Muon是全秩Muon和Shampoo族优化器的一个良好的低秩代理。其最优学习率在秩、宽度、深度和因子重缩放之间均可迁移。在我们计算匹配的TinyShakespeare研究中,秩2代理恢复了稠密最佳测试学习率,秩32的LoRA-Muon运行在种子平均扫描中达到了比稠密基线更低的平均验证损失。我们进一步表明,Spectron优化器依赖于任意的因子缩放,因此在从严重不平衡的因子开始微调时可能不太适用,并且LoRA-RITE的简化QR坐标核心实现了相同的谱更新。LoRA-Muon无需QR分解即可计算该更新,并避免存储二阶矩,使其更易于加速器使用且内存效率更高。

英文摘要

Low-Rank Adaptation (LoRA) significantly reduces compute and memory costs for finetuning Deep Learning models but is often harder to tune than dense training: when using factor-wise optimizers such as AdamW, it is sensitive to initialization choices, its optimal learning rates transfer poorly across ranks, and it often fails to beat dense baselines. We derive LoRA-Muon by applying the Muon optimizer's spectral steepest-descent rule to the low-rank setting. Along with our split weight-decay rule, our main claim is that LoRA-Muon is a good low-rank proxy for full-rank Muon and Shampoo-family optimizers. Its optimal learning rates transfer across rank, width, depth, and factor-rescaling. In our compute-matched TinyShakespeare study, a rank-$2$ proxy recovers the dense best tested learning rate, and a rank-$32$ LoRA-Muon run attains lower mean validation loss than the dense baseline in the seed-averaged sweep. We further show that the Spectron optimizer depends on arbitrary factor scaling, so it would likely be a poor fit when finetuning starts from badly imbalanced factors, and that LoRA-RITE's simplified QR-coordinate core implements the same spectral update. LoRA-Muon computes that update without QR-decomposition and avoids storing second moments, making it more accelerator-friendly and memory-efficient.

2606.12916 2026-06-12 cs.AI cs.CL cs.LG 新提交

MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback

MDForge:稀疏模拟器反馈下的智能分子动力学流水线设计

Zehong Wang, Yijun Ma, Connor R. Schmidt, Tianyi Ma, Weixiang Sun, Ziming Li, Xiaoguang Guo, Chuxu Zhang, Matthew J. Webber, Yanfang Ye

发表机构 * University of Notre Dame(圣母大学) University of Connecticut(康涅狄格大学)

AI总结 提出MDForge,利用LLM智能体通过多智能体辩论将稀疏奖励稠密化,自动设计分子动力学流水线,在SAMPL基准上达到专家水平,并发现新型高亲和力CB[7]结合剂。

详情
AI中文摘要

分子动力学(MD)是原子分子科学中经典的计算机模拟方法,从第一性原理物理模拟分子行为。为新系统设计MD流水线需要大量专业知识:即使在一个分子上运行也代价高昂,排除了试错法。我们使用LLM智能体自动化这一专家流水线设计过程。与现有编排预定义工具集的MD智能体不同,我们将流水线设计视为开放式代码生成,其中智能体的行为通过语言奖励在线重塑。具体而言,我们构建了MDForge,一个LLM智能体,其上下文更新规则通过物理专家间的多智能体辩论将稀疏奖励稠密化。在三个SAMPL主客体结合自由能基准上,MDForge自动设计的MD流水线与人类专家竞争。部署在未见过的候选客体库上,其CB[7]流水线发现了一种新型结合剂,湿实验竞争NMR证实其为高亲和力、皮摩尔级的CB[7]结合剂。我们的数据和代码可在https://this URL获取。

英文摘要

Molecular dynamics (MD) is the canonical in-silico method for atomistic molecular science, simulating molecular behavior from first-principle physics. Designing an MD pipeline for a new system requires substantial expert knowledge: running it on even one molecule is expensive, ruling out trial-and-error. We automate this expert pipeline-design process with an LLM agent. Unlike existing MD agents that orchestrate a predefined tool set, we treat pipeline design as open-ended code generation in which the agent's behavior is reshaped online by verbal reward. Specifically, we build MDForge, an LLM agent whose in-context update rule densifies the sparse reward via a multi-agent debate among physics experts. On three SAMPL host-guest binding free-energy benchmarks, MDForge automatically designs MD pipelines competitive with human experts. Deployed on a library of unseen candidate guests, its CB[7] pipeline discovers a novel binder that wet-lab competition NMR confirms is a high-affinity, picomolar CB[7] binder. Our data and code are available at https://github.com/Zehong-Wang/MDForge.

2606.12913 2026-06-12 cs.LG cs.CV 新提交

Selecting Samples on Graphs: A Unified Dataset Pruning Framework for Lossless Training Acceleration

图上的样本选择:用于无损训练加速的统一数据集剪枝框架

Dongyue Wu, Zilin Guo, Xiaoyu Li, Jiajia Liu, Jingdong Chen, Nong Sang, Changxin Gao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出基于图的统一数据集剪枝框架,将数据集建模为加权图,通过最大权重团问题选择样本,并设计贪心算法,在多种剪枝比例下优于现有方法,实现ImageNet-1k上40%以上训练加速且不损失精度。

Comments ICML 2026

详情
AI中文摘要

现代训练数据集的快速增长显著增加了计算成本,促使数据集剪枝(DP)方法仅保留信息量丰富的样本子集以减少训练成本。现有的剪枝标准通常依赖于评估样本独立性的内在信号或通过成对关系促进多样性的外在信号。虽然在其特定领域有效,但每种方法仅捕捉样本效用的一方面,且在不同剪枝比例或数据分布下缺乏鲁棒性。在这项工作中,我们提出了一个统一的基于图的DP框架。通过将数据集建模为加权图,其中节点权重编码内在价值,边权重编码外在价值,DP可以转化为最大权重团问题(MWCP)。尽管MWCP是NP难的,但其结构允许基于样本边际增益的原则性贪心解法。在几个温和条件下,我们进一步证明该统一目标具有形式化的近似保证,适用于广泛的度量族,并提供了实用设计指南。大量实验表明,我们的方法优于现有DP方法,同时显著降低训练成本,在ImageNet-1k上使用ResNet-50时,训练时间减少超过40%且不损失精度。

英文摘要

The rapid growth of modern training datasets has significantly increased computational cost, motivating dataset pruning~(DP) methods which retain only a subset of informative samples to reduce training cost. Existing pruning criteria typically rely on either intrinsic signals that assess samples independently or extrinsic signals that promote diversity via pairwise relations. While effective in their own specific regimes, each captures only one aspect of sample utility and lacks robustness across different pruning ratios or data distribution. In this work, we present a unified graph-based DP framework. By modeling the dataset as a weighted graph, where node weights encode intrinsic value and edge weights encode extrinsic value, DP can be cast as a Maximum Weight Clique Problem (MWCP). Although MWCP is NP-hard, its structure admits a principled greedy solution based on sample-wise marginal gains. Under a few mild conditions, we further prove that this unified objective enjoys a formal approximation guarantee, which applies to a broad family of importance metrics and provides practical design guidelines. Extensive experiments show that our method outperforms existing DP methods while substantially reducing training cost, reducing training time by over 40\% without sacrificing accuracy on ImageNet-1k with ResNet-50.

2606.12911 2026-06-12 cs.CL 新提交

PiDA: Phonetically-Informed Data Augmentation for Robust Vietnamese Speech Translation

PiDA: 基于语音信息的数据增强用于鲁棒的越南语语音翻译

Giang Son Nguyen, Tung X. Nguyen, Hieu Minh Truong, Nhu Vo, Wray Buntine, Dung D. Le

发表机构 * VinUniversity(Vin大学) University of Technology Sydney(悉尼技术大学) Monash University(莫纳什大学)

AI总结 针对级联语音翻译中ASR错误传播问题,提出基于语音信息的数据增强方法PiDA,通过语音词嵌入生成相似音替换,在FLEURS越南语-英语上提升错误ASR输出翻译质量(BLEU+2.04)。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

级联语音翻译(ST)系统在自动语音识别(ASR)输出错误转录时会出现错误传播。我们首次对越南语ST的ASR错误进行系统分类,根据语音原因对替换错误进行分类,并使用线性混合效应模型量化其对下游神经机器翻译(NMT)性能的影响。我们确认大多数ASR替换错误源于语音混淆而非随机噪声,并且这些语音错误显著降低了ST质量。受此发现启发,我们提出了基于语音信息的数据增强(PiDA),该方法通过使用语音词嵌入替换为语音相似的替代词来生成类似ASR的损坏。在FLEURS越南语-英语的PiDA增强版本上进行微调,提高了错误ASR输出的翻译质量(比标准微调最多提高+2.04 BLEU),同时也略微提升了干净文本的性能。

英文摘要

Cascaded speech translation (ST) systems suffer from error propagation when Automatic Speech Recognition (ASR) outputs incorrect transcripts. We present the first systematic categorization of ASR errors for Vietnamese ST, classifying substitution errors by phonetic cause and quantifying their impact on downstream Neural Machine Translation (NMT) performance using Linear Mixed-Effects Modelling. We confirm that most ASR substitution errors arise from phonetic confusions rather than random noise, and that these phonetic errors significantly degrade ST quality. Motivated by this finding, we propose Phonetically-Informed Data Augmentation (PiDA), which generates ASR-like corruptions by substituting words with phonetically similar alternatives using phonetic word embeddings. Fine-tuning on a PiDA-augmented version of FLEURS Vietnamese-English improves translation of erroneous ASR outputs (up to +2.04 BLEU over standard fine-tuning) while also slightly improving clean-text performance.

2606.12908 2026-06-12 cs.CL 新提交

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

SENTINEL: 用于训练工具使用语言模型智能体的失败驱动强化学习

Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Qun Liu, Chen Luo, Jiri Gesi, Hanqing Lu, Yisi Sang, Manling Li, Jing Huang, Dakuo Wang

发表机构 * Northeastern University(东北大学) Independent Researcher(独立研究员) Northwestern University(西北大学)

AI总结 提出SENTINEL框架,通过将智能体失败转化为针对性训练任务,在Tau2-Bench Retail上提升Qwen3-4B模型Pass@1从66.4到74.9,优于通用合成任务上的强化学习。

详情
AI中文摘要

语言模型智能体通过多轮工具使用在解决现实任务方面越来越有效。然而,训练可靠的工具使用智能体在实践中仍然具有挑战性。虽然强化学习提供了一种从智能体自身环境交互中改进智能体的在策略范式,但其有效性在很大程度上取决于训练任务分布。当任务在训练前固定时,任务分布可能越来越与策略不断发展的能力不匹配,导致许多轨迹被浪费在无信息的任务上。我们提出SENTINEL,一种失败驱动的强化学习框架,将求解器的轨迹失败转化为有针对性的训练任务。SENTINEL遵循控制器-提议者-求解器循环:控制器分析失败轨迹并总结重复出现的错误模式,提议者生成可执行的任务来强调这些弱点,求解器在针对性任务上接受训练。在Tau2-Bench Retail上使用Qwen3-4B-Thinking-2507,SENTINEL将Pass@1从66.4提高到74.9,并且在Pass@k指标上优于通用合成任务上的强化学习。这些结果表明,模型失败为改进工具使用语言模型智能体提供了有效且可扩展的针对性训练信号来源。

英文摘要

Language model agents are increasingly effective in solving realistic tasks through multi-turn tool use. However, training reliable tool-using agents remains challenging in practice. While reinforcement learning provides an on-policy paradigm for improving agents from their own environment interactions, its effectiveness depends heavily on the training task distribution. When tasks are fixed before training, the task distribution can become increasingly mismatched with the policy's evolving capabilities, causing many rollouts to be spent on uninformative tasks. We propose SENTINEL, a failure-driven reinforcement learning framework that turns the Solver's rollout failures into targeted training tasks. SENTINEL follows a Controller--Proposer--Solver loop: the Controller analyzes failed trajectories and summarizes recurring error patterns, the Proposer generates executable tasks that stress these weaknesses, and the Solver is trained on the targeted tasks. On Tau2-Bench Retail with Qwen3-4B-Thinking-2507, SENTINEL improves Pass\^{}1 from 66.4 to 74.9 and outperforms RL on general synthetic tasks across Pass\^{}k metrics. These results demonstrate that model failures provide an effective and scalable source of targeted training signal for improving tool-using language model agents.

2606.12903 2026-06-12 cs.CL 新提交

X-MADAM-RAG: Diagnosing and Handling Chinese-English Evidence Conflict in Retrieval-Augmented Generation

X-MADAM-RAG:诊断和处理检索增强生成中的中英文证据冲突

Yongqi Kang, Yu Fu, Yong Zhao

发表机构 * Sichuan University(四川大学)

AI总结 提出X-MADAM-RAG管道,通过分解证据处理步骤(候选提取、可见证据修复、确定性分组和冲突感知聚合)解决RAG中中英文证据冲突问题,在受控基准上取得高准确率,但发现文档级提取是主要瓶颈。

详情
AI中文摘要

检索增强生成(RAG)系统可能接收到不仅噪声大而且相互矛盾的证据。这个问题在多语言环境中尤为突出,因为检索到的中文和英文证据可能支持不相容的答案候选。我们通过X-RAMDocs-ZHEN(一个从RAMDocs衍生的受控中英文基准)研究此问题,用于诊断RAG中的证据冲突。该基准包含300个示例,涵盖六种平衡条件,包括单语言支持、双语一致、反向冲突方向以及带可选噪声的冲突。我们进一步研究了X-MADAM-RAG,一个可解释的管道,将证据处理分解为每个文档的候选提取、可见证据修复、确定性候选分组和冲突感知聚合。在原始受控基准上使用Qwen2.5-7B-Instruct,X-MADAM-RAG达到了0.9667的严格准确率和0.9767的冲突感知成功率,优于证据归一化的单次调用基线。然而,一个零调用的纯规则提取器在同一基准上达到了1.0000,揭示了强模板规律性。为了探究这一局限性,我们构建了一个确定性自然化压力测试,移除了显式答案模板但保留了候选字符串。在其100样本子集上,纯规则提取器降至0.0000,但X-MADAM-RAG也降至0.3000严格准确率,低于朴素基线和证据归一化基线。特权Oracle保持完美,表明文档级提取是主要瓶颈。这些发现将X-RAMDocs-ZHEN和X-MADAM-RAG定位为受控证据冲突的诊断工具,而非通用幻觉检测或对自然检索鲁棒性的证据。

英文摘要

Retrieval-augmented generation (RAG) systems may receive evidence that is not merely noisy but mutually contradictory. This issue becomes particularly salient in multilingual settings, where retrieved Chinese and English evidence may support incompatible answer candidates. We study this problem through X-RAMDocs-ZHEN, a controlled Chinese-English benchmark derived from RAMDocs for diagnosing evidence conflict in RAG. The benchmark contains 300 examples across six balanced conditions, including monolingual support, bilingual agreement, reversed conflict directions, and conflict with optional noise. We further examine X-MADAM-RAG, an interpretable pipeline that decomposes evidence handling into per-document candidate extraction, visible-evidence repair, deterministic candidate grouping, and conflict-aware aggregation. On the original controlled benchmark with Qwen2.5-7B-Instruct, X-MADAM-RAG achieves 0.9667 strict accuracy and 0.9767 conflict-aware success, outperforming an evidence-normalized single-call baseline. However, a zero-call rule-only extractor reaches 1.0000 on the same benchmark, revealing strong template regularity. To probe this limitation, we construct a deterministic naturalized stress test that removes explicit answer templates while preserving candidate strings. On its 100-sample subset, rule-only extraction falls to 0.0000, but X-MADAM-RAG also drops to 0.3000 strict accuracy, below both naive and evidence-normalized baselines. A privileged oracle remains perfect, indicating that document-level extraction is the main bottleneck. These findings position X-RAMDocs-ZHEN and X-MADAM-RAG as diagnostic tools for controlled evidence conflict rather than as evidence of general hallucination detection or robustness to natural retrieval.

2606.12902 2026-06-12 cs.CL 新提交

PRISM: Prosody-Integrated Multi-Agent Reasoning Framework for Empathetic Spoken Dialogue

PRISM:用于共情口语对话的韵律集成多智能体推理框架

Wen Zhang, Xiaocui Yang, Zhuoyue Gao, Shi Feng, Daling Wang, Yifei Zhang

发表机构 * School of Computer Science and Engineering, Northeastern University(东北大学计算机科学与工程学院)

AI总结 提出PRISM多智能体框架,通过解耦语音感知、响应生成和语音合成,并引入韵律到语言翻译机制,实现共情口语对话中的韵律适当性和知识集成。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

共情口语对话系统不仅需要语义上合适的回应,还需要情感上一致的韵律表达。然而,级联流水线通常在语音到文本转换过程中丢弃声学线索,而端到端语音模型缺乏对情感和知识集成的可解释控制。为了解决这些挑战,我们提出了PRISM,一个用于共情口语对话的多智能体框架,它将语音感知、响应生成和语音合成解耦为协调的组件。PRISM引入了一种韵律到语言的翻译机制来稳定大语言模型的推理,并支持按需调用外部知识工具以生成共情对话。实验结果表明,PRISM在客观和主观指标上均实现了共情性、韵律适当性和文本响应生成质量的一致改进。我们的代码可在以下网址获取:this https URL。

英文摘要

Empathetic spoken dialogue systems require not only semantically appropriate responses but also emotionally aligned prosodic expression. However, cascade pipelines often discard acoustic cues during speech-to-text conversion, while end-to-end speech models lack interpretable control over emotion and knowledge integration. To address these challenges, we propose PRISM, a multi-agent framework for empathetic spoken dialogue that decouples speech perception, response generation, and speech synthesis into coordinated components. PRISM introduces a prosody-to-language translation mechanism to stabilize large language model reasoning and enables on-demand invocation of external knowledge tools for empathetic dialogue generation. Experimental results demonstrate that PRISM achieves consistent improvements in empathy, prosodic appropriateness, and text response generation quality across objective and subjective metrics. Our code is available at: https://github.com/Bxzfrm/PRISM.

2606.12898 2026-06-12 cs.CV cs.CL 新提交

Magnifying What Matters: Attention-Guided Adaptive Rendering for Visual Text Comprehension

放大关键信息:面向视觉文本理解的注意力引导自适应渲染

Shenglai Zeng, Qirui Wang, Kai Guo, Xinnan Dai, Xianxuan Long, Hui Liu

发表机构 * Michigan State University(密歇根州立大学) Xi’an Jiaotong University(西安交通大学)

AI总结 针对视觉语言模型在视觉文本理解任务中存在的定位与利用脱节问题,提出无需训练、模型无关的注意力引导自适应渲染方法AGAR,通过放大关键文本跨度提升模型性能。

详情
AI中文摘要

视觉文本理解(VTC)将文本渲染为图像供视觉语言模型(VLM)阅读,绕过了LLM的上下文窗口限制,并支持从长页OCR到多页记忆问答等应用。然而,现有的VTC流水线将渲染和布局视为固定的、内容无关的预处理步骤,并且对VLM内部如何处理可视化文本的机制理解甚少。通过对VTC问答任务的聚焦实证研究,我们揭示了VLM存在一种“定位而不利用”的模式:证据定位注意力在中间到后期层中急剧出现,并且与答案正确性在很大程度上解耦,然而仅仅放大渲染页面上定位的跨度就能恢复大部分失败。基于这些观察,我们提出了AGAR(注意力引导自适应渲染),一种无需训练、模型无关的方法,该方法利用VLM自身的中间到后期层注意力来识别前K个重要的视觉补丁,将它们映射回单词跨度,并在重新推理答案之前重新渲染页面,放大这些跨度。在九个VTC基准测试(短文本、长上下文和多页记忆问答)和四个VLM骨干上的大量实验表明,AGAR(i)作为即插即用的增强,持续改进了现成的VLM,(ii)与VLM后训练相结合可带来进一步收益,并且(iii)在视觉和文本侧输入退化下保持鲁棒性。

英文摘要

Visual Text Comprehension (VTC) renders text into images for a vision-language model (VLM) to read, sidestepping LLM context-window limits and powering applications from long-page OCR to multi-page memory QA. Yet existing VTC pipelines treat rendering and layout as a fixed, content-agnostic preprocessing step and offer little mechanistic understanding of how VLMs internally process visualized text. Through a focused empirical study on VTC QA tasks, we reveal that VLMs exhibit a localization-without-utilization regime: evidence-localizing attention emerges sharply in the middle-to-late layers and is largely decoupled from answer correctness, yet simply enlarging the localized spans on the rendered page recovers a large fraction of the failures. Building on these observations, we propose AGAR (Attention-Guided Adaptive Rendering), a training-free, model-agnostic method that leverages a VLM's own middle-to-late layer attention to identify the top-K important visual patches, maps them back to word spans, and re-renders the page with those spans enlarged before re-inferring the answer. Extensive experiments across nine VTC benchmarks (short-form, long-context, and multi-page memory QA) and four VLM backbones show that AGAR (i)consistently improves off-the-shelf VLMs as a plug-and-play enhancement, (ii)composes with VLM post-training to yield further gains, and (iii)remains robust under both visual- and text-side input degradation.

2606.12897 2026-06-12 cs.CL 新提交

SafeLLM: Extraction as a Hallucination-Resistant Alternative to Rewriting in Safety-Critical Settings

SafeLLM: 在安全关键场景中,提取作为重写的抗幻觉替代方案

Julia Ive, Felix Jozsa, Evridiki Georgaki, Nabeel Sheikh, Emma Cattell, Nick Jackson, Paulina Bondaronek, Ciaran Scott Hill, Richard Dobson

发表机构 * Institute of Health Informatics, University College London(伦敦大学学院健康信息学研究所) National Hospital for Neurology and Neurosurgery(国家神经内科与神经外科医院) Somerset NHS Foundation Trust(萨默塞特NHS基金会信托) King's College Hospital(国王学院医院) King's College London(伦敦国王学院)

AI总结 提出将提取作为重写型RAG的抗幻觉替代方案,通过行号选择策略在安全关键文档中实现高召回(95%)和低幻觉,优于直接复制和安全导向方法。

详情
AI中文摘要

大型语言模型(LLM)越来越多地用于访问组织文档,包括标准操作程序(SOP)、人力资源政策和机构指南。然而,依赖自由形式重写的检索增强生成(RAG)系统可能引入幻觉,并在完整性和简洁性之间产生不稳定的权衡,尤其是在安全和合规关键场景中。目标:评估提取作为基于重写的RAG的抗幻觉替代方案,并比较在文档类型和模型规模之间平衡精确度、召回率和安全性的策略。方法:我们比较了多种提示策略,包括基于行号的源选择、提取带有明确安全注释的相关指南句子,以及使用源指南中的支持证据细化草稿答案的多阶段流水线。实验在长度和结构各异的文档上进行,包括当地NHS急症护理和肿瘤学指南以及英国范围内的NICE指南,使用前沿规模和本地可部署模型。使用自动指标和人类专家评估相关性和完整性来评估性能。结果:行号选择取得了最强结果,在大型和小型模型上均优于直接复制和安全导向策略,同时保持高术语召回率(高达95%)并与源文本紧密对齐。安全导向方法提高了精确度,但引入了系统性遗漏,而多阶段过滤进一步放大了这种权衡。性能随文档结构变化:基于行的提取在协议类内容中表现出色,而替代策略在更冗长的文档上表现更好(术语召回率高达97%)。

英文摘要

Large language models (LLMs) are increasingly used to access organisational documentation, including standard operating procedures (SOPs), HR policies and institutional guidelines. However, retrieval-augmented generation (RAG) systems that rely on free-form rewriting can introduce hallucinations and unstable trade-offs between completeness and conciseness, particularly in safety- and compliance-critical settings. Objectives: To evaluate extraction as a hallucination-resistant alternative to rewriting-based RAG and compare strategies that balance precision, recall and safety across document types and model scales. Methods: We compare multiple prompting strategies, including line-number-based source selection, extraction of relevant guideline sentences with explicit safety annotations, and a multi-stage pipeline that refines draft answers using supporting evidence from source guidelines. Experiments are conducted on documents of varying length and structure, including local NHS acute care and oncology guidelines and UK-wide NICE guidelines, using both frontier-scale and locally deployable models. Performance is assessed using automatic metrics and human expert evaluation of relevance and completeness. Results: Line-number selection achieves the strongest results, outperforming direct copying and safety-focused strategies across both large and small models while maintaining high term recall (up to 95%) and close alignment with source text. Safety-oriented approaches improve precision but introduce systematic omissions, while multi-stage filtering further amplifies this trade-off. Performance varies with document structure: line-based extraction excels in protocol-like content, whereas alternative strategies perform better on more verbose documents (up to 97% term recall).

2606.12895 2026-06-12 cs.LG 新提交

LongSpike: Fractional Order Spiking State Space Models for Efficient Long Sequence Learning

LongSpike:用于高效长序列学习的分数阶脉冲状态空间模型

Xinrui He, Qiyu Kang, Xuhao Li, Zheng-Jun Zha

发表机构 * Wuhan University(武汉大学) University of Science and Technology of China(中国科学技术大学) Anhui University(安徽大学)

AI总结 提出LongSpike框架,将分数阶状态空间模型(f-SSM)引入脉冲神经网络,通过长记忆核实现高效长序列学习,在多个基准上超越现有SNN。

详情
AI中文摘要

脉冲神经网络(SNN)因其生物合理性和处理序列数据时的能量效率而备受推崇。然而,主流的SNN架构通常依赖一阶常微分方程(ODE)来控制神经元状态转换。这种一阶假设引入了“无记忆”瓶颈,限制了模型捕捉长序列任务中固有的复杂长程依赖关系的能力。在这项工作中,我们提出了LongSpike,一种新颖的SNN框架,它将控制理论中的分数阶状态空间建模(f-SSM)集成到脉冲域中。通过将传统的整数阶SSM扩展到分数阶微积分领域,LongSpike实现了具有长记忆核的神经元动力学的层次化集成。为了缓解分数算子通常带来的计算开销和并行化挑战,我们利用了一种支持高效并行训练的状态空间公式。在具有挑战性的基准测试(包括Long Range Arena(LRA)、大规模WikiText-103和Speech Commands)上的实证评估表明,LongSpike在保持稀疏突触计算的同时,在准确性上优于最先进的SNN。代码可在以下网址获取:https://this URL。

英文摘要

Spiking Neural Networks (SNNs) are well-regarded for their biological plausibility and energy efficiency in processing sequential data. However, dominant SNN architectures typically rely on first-order Ordinary Differential Equations (ODEs) to govern neuronal state transitions. This first-order assumption imposes a "memoryless" bottleneck, limiting the model's capacity to capture the complex, long-range dependencies inherent in long-sequence tasks. In this work, we propose LongSpike, a novel SNN framework that integrates fractional-order State-Space Modeling, or f-SSM, from control theory into the spiking domain. By extending traditional integer-order SSMs to the fractional-calculus regime, LongSpike enables the hierarchical integration of neuronal dynamics with long-memory kernels. To mitigate the computational overhead and parallelization challenges typically associated with fractional operators, we leverage a state-space formulation that supports efficient, parallel training. Empirical evaluations on challenging benchmarks, including Long Range Arena (LRA), large-scale WikiText-103, and Speech Commands, demonstrate that LongSpike outperforms state-of-the-art SNNs in accuracy while preserving sparse synaptic computation. The code is available at https://github.com/xinruihe389-commits/LongSpike.

2606.12890 2026-06-12 cs.RO 新提交

Learning to Adapt: Representation-Based Reinforcement Learning for Multi-Task Skill Transfer

学会适应:基于表示的多任务技能迁移强化学习

Aryan Naveen, Haitong Ma, Haldun Balim, Na Li

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Harvard School of Engineering and Applied Sciences(哈佛大学工程与应用科学学院)

AI总结 提出RepMT-SAC框架,通过谱MDP分解捕获可迁移动力学,实现任务无关核心与最小任务特定调整的价值函数结构,在四旋翼轨迹跟踪任务上零样本性能提升30%。

Comments 8 pages, 4 figures, 1 table

详情
AI中文摘要

强化学习在学习复杂控制策略方面取得了显著成功,但由于样本效率低和跨任务泛化能力差,其适用性仍然有限。在这项工作中,我们提出了RepMT-SAC,一个多任务强化学习框架,能够实现高效的知识共享和稳健的新任务迁移。RepMT-SAC使用谱MDP分解来捕获可迁移的动力学,将价值函数结构化为一个任务无关的核心和最小的任务特定调整。这种设计允许在分布内任务上具有强大的零样本性能,并在分布外任务上实现快速的少样本适应。我们在四旋翼轨迹跟踪任务上评估了RepMT-SAC在分布内和分布外上下文中的表现,证明其性能优于基线方法高达30%。

英文摘要

Reinforcement learning has achieved remarkable success in learning complex control policies, yet its applicability remains limited due to sample inefficiency and poor generalization across tasks. In this work, we propose RepMT-SAC, a framework for multi-task RL that enables efficient knowledge sharing and robust transfer to new tasks. RepMT-SAC uses spectral MDP decomposition to capture transferable dynamics, structuring the value function into a task-agnostic core with a minimal task-specific adjustment. This design allows for strong zero-shot performance on in-distribution tasks and rapid few-shot adaptation to out-of-distribution tasks. We evaluate RepMT-SAC on quadcopter trajectory-following tasks across in-distribution and out-of-distribution contexts, demonstrating that it outperforms baselines by up to 30%.

2606.12886 2026-06-12 cs.CV cs.AI 新提交

Bridging Modal Isolation in Interleaved Thinking: Supervising Modality Transitions via Stepwise Reinforcement

交错思维中的模态隔离桥接:通过逐步强化监督模态转换

Tingyu Li, Le Zhou, Siyuan Li, Yujun Wu, Xinglong Xu, Jingxuan Wei, Conghui He, Cheng Tan

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Jiaotong University(上海交通大学) Zhejiang University(浙江大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出MoTiF框架,通过反射式SFT和Flow-GRPO优化模态转换保真度,解决交错思维中图像与文本脱节的模态隔离问题,提升跨模态一致性和任务准确性。

Comments 22 pages, 5 figures, 6 tables

详情
AI中文摘要

交错思维是一种统一的多模态模型交替进行文本推理和视觉生成的方法,在空间和物理任务上显示出潜力。然而,在复杂的长链场景中,我们识别出一个基本故障模式:生成的图像偏离文本上下文,而后续文本忽略视觉证据,导致两种模态交替但并未真正相互通知。我们将其称为模态隔离,并归因于模态边界处的信息损失累积。我们将每个推理循环分解为原子操作,并定义模态转换损失,量化每个边界处的跨模态幻觉(文本到图像)和视觉利用不足(图像到文本)。我们提出MoTiF(模态转换保真度),一个两阶段训练框架,直接优化这些转换:反射式SFT训练模型检测和恢复错误的视觉输出;Flow-GRPO通过强化学习提高图像生成保真度。MoTiF中的所有训练信号来自转换级保真度而非最终任务准确性。在四个视觉谜题基准测试中,这种转换级监督显著提高了跨模态一致性和最终任务准确性。结果表明,有效的交错推理需要在模态边界处进行明确的结构监督,而不仅仅是扩展或最终任务优化。

英文摘要

Interleaved thinking, where a unified multimodal model alternates between textual reasoning and visual generation, has shown promise on spatial and physical tasks. However, in complex long-chain scenarios, we identify a fundamental failure mode: generated images diverge from the textual context while subsequent text ignores the visual evidence, causing the two modalities to alternate without genuinely informing each other. We term this Modal Isolation and attribute it to compounding information loss at modality boundaries. We decompose each reasoning cycle into atomic operations and define modality transition loss, quantifying cross-modal hallucination (text-to-image) and visual utilization deficit (image-to-text) at each boundary. We propose MoTiF (Modality Tiransition Fidelity), a two-stage training framework that directly optimizes these transitions: Reflective SFT trains the model to detect and recover from erroneous visual outputs; Flow-GRPO improves image generation fidelity via reinforcement learning. All training signals in MoTiF derive from transition-level fidelity rather than end-task accuracy. Across four visual puzzle benchmarks, this transition-level supervision substantially improves both cross-modal coherence and final task accuracy. The results demonstrate that effective interleaved reasoning requires explicit structural supervision at modality boundaries, not merely scaling or end-task optimization.

2606.12883 2026-06-12 cs.AI 新提交

The Hidden Power of Scaling Factor in LoRA Optimization

缩放因子在LoRA优化中的隐藏力量

Zicheng Zhang, Haoran Li, Jiaxing Wang, Guoqiang Gong, Anqi Li, Yudong Hu, Ting Xiong, Yurong Gao, Junxing Hu, Zhida Jiang, Yifeng Zhang, Pengzhang Liu, Qixia Jiang

发表机构 * School of Mathematical Sciences, UCAS(中国科学院大学数学科学学院) School of Mathematical Sciences, NKU(南开大学数学科学学院) School of Advanced Interdisciplinary Sciences, UCAS(中国科学院大学前沿交叉科学学院)

AI总结 本文揭示LoRA中缩放因子α与学习率功能不同,α主导优化效果,通过信号-漂移框架发现α能放大任务信号而不增加漂移比,并提出LoRA-α框架以简化超参数搜索并提升性能。

详情
AI中文摘要

在低秩适应(LoRA)中,缩放因子α通常被视为学习率的简单补充,但其在优化中的作用仍未被充分理解。本文揭示缩放因子α和学习率功能不同,α成为有效优化的主导驱动因素,带来无法通过单独缩放学习率复现的收益。通过大量实证分析和理论信号-漂移框架的协同作用,我们发现了关于LoRA缩放机制的三点发现:首先,LoRA的频谱抑制平滑了优化景观,使得标准超参数过于保守,造成优化差距。其次,当利用这种平滑性加速收敛时,α通过放大任务信号而不增加漂移比,优于学习率。第三,最优缩放因子与秩呈次线性关系,由平方根定律很好地刻画,且系数出乎意料地大,揭示了现有秩相关启发式方法的缩放不足。基于这些见解,我们提出LoRA-α,一个极简框架,将α恢复到其原则性状态,使LoRA与标准小学习率兼容。跨多种任务的广泛评估表明,LoRA-α在简化超参数搜索的同时持续提升性能,释放了LoRA的学习潜力。

英文摘要

In Low-Rank Adaptation (LoRA), the scaling factor $α$ is often treated as a mere complement to the learning rate, yet its role in optimization remains poorly understood. In this paper, we reveal that the scaling factor $α$ and the learning rate function differently, with $α$ emerging as the dominant driver of effective optimization, delivering gains that cannot be replicated by learning rate scaling alone. Through the synergy of extensive empirical analysis and a theoretical Signal-Drift framework, we uncover three findings into LoRA's scaling mechanism: First, LoRA's spectral suppression smooths the optimization landscape, rendering standard hyperparameters overly conservative and creating an optimization gap. Second, when leveraging this smoothness to accelerate convergence, $α$ outperforms the learning rate by amplifying the task signal without increasing the drift ratio. Third, the optimal scaling factor follows a sublinear relationship with the rank, well characterized by a square-root law with an unexpectedly large coefficient, revealing the insufficient scaling of existing rank-tied heuristics. Based on these insights, we propose LoRA-$α$, a minimalist framework that restores $α$ to its principled regime, making LoRA compatible with standard small learning rates. Extensive evaluations across diverse tasks demonstrate that LoRA-$α$ consistently improves performance while streamlining hyperparameter search, unleashing the learning potential of LoRA.

2606.12871 2026-06-12 cs.AI 新提交

DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

DailyReport: 一个用于评估搜索代理在日常搜索任务上的开放式基准

Jingxuan Han, Wei Liu, Mingyang Zhu, Youpeng Wang, Ziwen Wang, Lin Qiu, Xuezhi Cao, Xunliang Cai, Zheren Fu, Licheng Zhang, Zhendong Mao

发表机构 * University of Science and Technology of China(中国科学技术大学) Meituan(美团)

AI总结 提出DailyReport基准,包含150个开放式日常搜索任务和3546个级联评分标准,通过分解子任务和维度评估,揭示当前搜索代理系统仍未能满足用户期望。

详情
AI中文摘要

搜索代理(SAs)通常利用大型语言模型(LLMs)通过自主探索网络资源并将信息综合成全面响应来支持复杂的信息寻求任务。对于SAs的评估,先前的基准主要关注在真实用户场景中不太可能出现的专门任务。此外,它们依赖于粗略的任务级评分标准,通常限制了评估的可解释性。为弥补这一差距,我们引入了DailyReport,一个用于评估SA在日常搜索任务上能力的开放式基准。它包含150个开放式任务,配有3546个相关评分标准,捕捉了真实用户广泛讨论和及时的信息需求。每个任务被分解为子任务,并通过跨解缠维度的级联评分标准进行评估。通过级联性能归因和以用户为中心的聚合,我们为每个维度推导出高度可解释的分数,以及一个用户偏好分数。我们在17个代理系统上的结果表明,当前系统仍未能达到用户的期望。为促进未来研究,我们的数据集和代码已在https://this URL公开。

英文摘要

Search Agents (SAs) typically leverage large language models (LLMs) to support complex information-seeking tasks by autonomously exploring web sources and synthesizing information into comprehensive responses. For SAs evaluation, prior benchmarks mainly focus on specialized tasks that are unlikely to arise in real-world user scenarios. Moreover, their reliance on coarse task-level rubrics often limits evaluation interpretability. To bridge this gap, we introduce DailyReport, an open-ended benchmark to evaluate SA capabilities on daily search tasks. It contains 150 open-ended tasks with 3,546 associated rubrics, capturing widely discussed and timely information demands of real-world users. Each task is decomposed into subtasks and evaluated with cascade rubrics across disentangled dimensions. Through cascade performance attribution and user-centric aggregation, we derive highly interpretable scores for each dimension, along with a user preference score. Our results on 17 agentic systems show that current systems still fall short of users' expectations. To facilitate future research, our dataset and code are made publicly available at https://github.com/AGI-Eval-Official/DailyReport.

2606.12869 2026-06-12 cs.CV 新提交

Learning Task-Aware Sampling with Shared Saliency through Density-Equalizing Mappings

通过密度均衡映射学习具有共享显著性的任务感知采样

Tsz Lok Ip, Han Zhang, Lok Ming Lui

发表机构 * Department of Mathematics, The Chinese University of Hong Kong(香港中文大学数学系) Department of Mathematics, City University of Hong Kong(香港城市大学数学系)

AI总结 提出DECNN框架,利用密度均衡映射根据数据空间重要性动态重分配卷积计算资源,实现任务自适应采样,提升模型效率与可解释性。

Comments 16 pages, 10 figures

详情
AI中文摘要

在基于图像和表面的学习任务中,卷积特征通常使用在整个域上均匀采样的感受野来提取。然而,信息丰富的结构在实践中很少均匀分布,通常集中在局部区域。这种现象在医学影像中尤为常见,其中病理变化在空间上受限。因此,均匀卷积将相同的计算量分配给信息丰富和信息不丰富的区域,导致特征提取效率低下和模型容量利用不充分。为了解决这个问题,我们提出了一个任务自适应采样框架,根据数据的空间重要性动态重分配计算注意力。具体来说,我们引入了密度均衡卷积神经网络(DECNN),它通过密度均衡映射,利用学习到的密度函数来引导卷积。密度函数编码了不同区域的相对重要性,并诱导一种变换,放大信息丰富的区域,同时压缩不太相关的区域。结果,卷积感受野在域上非均匀地重新分布,使得在任务相关区域能够进行更密集的采样。通过将这种重要性驱动的变换与卷积相结合,DECNN执行自适应特征提取,将计算资源集中在信息丰富的结构上。这导致更有效地利用模型容量,产生一个轻量级但表达力强的架构,同时生成可解释的显著性图。在图像分类和颅面表面分析上的实验表明,DECNN以更少的参数实现了竞争性或更优的性能,准确识别任务相关区域,并在复杂的几何变化下保持鲁棒性。

英文摘要

In image and surface-based learning tasks, convolutional features are typically extracted using receptive fields that are sampled uniformly across the entire domain. However, informative structures are rarely distributed uniformly in practice and are often concentrated in localized regions. Such phenomena are particularly common in medical imaging, where pathological changes are spatially confined. Consequently, uniform convolution allocates equal computational effort to both informative and uninformative regions, resulting in inefficient feature extraction and suboptimal utilization of model capacity. To address this issue, we propose a framework for task-adaptive sampling that dynamically redistributes computational attention according to the spatial importance of the data. Specifically, we introduce the Density-Equalizing Convolutional Neural Network (DECNN), which employs density-equalizing mappings to guide convolution through a learned density function. The density function encodes the relative importance of different regions and induces a transformation that enlarges informative areas while compressing less relevant ones. As a result, convolutional receptive fields are redistributed non-uniformly over the domain, enabling denser sampling in task-relevant regions. By coupling this importance-driven transformation with convolution, DECNN performs adaptive feature extraction that focuses computational resources on informative structures. This leads to more efficient use of model capacity, yielding a lightweight yet expressive architecture while simultaneously producing an interpretable saliency map. Experiments on image classification and craniofacial surface analysis demonstrate that DECNN achieves competitive or superior performance with fewer parameters, accurately identifies task-relevant regions, and remains robust under complex geometric variations.

2606.12859 2026-06-12 cs.RO 新提交

AIR-VLA+: Decoupling Movement and Manipulation via Cascaded Dual-Action Decoders with Asymmetric MoE for Aerial Robots

AIR-VLA+: 通过级联双动作解码器与非对称MoE解耦空中机器人的移动与操作

Jianli Sun, Bin Tian, Qiyao Zhang, Zijian Liu, Yutong Wang, Zhiyong Cui, Bai Li, Yisheng Lv, Yonglin Tian

发表机构 * The Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Automation, Beijing Institute of Technology(北京理工大学自动化学院) College of Automotive and Energy Engineering, Tongji University(同济大学汽车与能源工程学院) School of Transportation Science and Engineering, Beihang University(北京航空航天大学交通科学与工程学院) Information Science, East China Normal University(华东师范大学信息科学)

AI总结 针对空中机器人移动与操作在动作尺度、动力学和控制目标上的显著差异,提出级联双动作解码器与非对称MoE架构,实现解耦协调控制,在AIR-VLA基准上取得48.0平均分,任务完成度提升80.2%。

详情
AI中文摘要

空中操作系统长期以来在端到端控制中遭受表示耦合问题,因为平台级无人机(UAV)移动与末端执行器级机械臂操作在动作尺度、动力学和控制目标上存在显著差异。本文提出AIR-VLA+,一种专为空中操作设计的流匹配动作生成架构,具有级联双动作解码器和非对称特征级混合专家(MoE)。我们构建了级联的操作和移动解码器,使无人机在移动过程中单向观察机械臂的意图以实现工作流协调,同时隔离无人机移动信息反向传播对机械臂操作稳定性的影响。针对空中操作中无人机移动高度依赖高层语义并负责任务状态转换的特点,我们为无人机移动解码器设计了输入特征增强模块,该模块引入隐式视觉抓取投影器以感知夹爪与物体的交互状态,并注入压缩的全局语义特征。在无人机移动解码器内部,我们部署了隐式MoE架构,使不同的移动专家在训练过程中自发地对不同任务阶段表现出能力倾向。通过在特征流形上进行密集软混合计算,无人机移动获得了更强的任务阶段适应性。在标准化AIR-VLA基准上的实验表明,我们的方法以48.0的总体平均分全面超越所有基线。与单头$\pi_{0.5}$策略相比,整体任务完成分数提高了80.2%,有效缓解了复合机器人的异构协调控制冲突。

英文摘要

Aerial manipulation systems have long suffered from representation coupling in end-to-end control, as platform-level Unmanned Aerial Vehicle (UAV) movement and end-effector-level arm manipulation differ substantially in action scale, dynamics, and control objectives. In this paper, we propose AIR-VLA+, a flow matching action generation architecture specifically designed for aerial manipulation, featuring cascaded dual-action decoders and an asymmetric feature-level Mixture of Experts (MoE). We construct cascaded manipulation and movement decoders, allowing the UAV to unidirectionally observe the manipulator's intent during movement to achieve workflow coordination, while isolating the impact of UAV movement information backpropagation on arm manipulation stability. Addressing the characteristic that UAV movement is highly dependent on high-level semantics and responsible for task state transitions in aerial manipulation, we design an input feature enhancement module for the UAV movement decoder. This module introduces an implicit visual grasp projector to perceive the interaction state between the gripper and the object, and injects compressed global semantic features. Within the UAV movement decoder, we deploy an implicit MoE architecture, enabling different movement experts to spontaneously exhibit capacity inclinations for various task stages during training. Through dense soft blending computation on the feature manifold, the UAV movement is endowed with stronger task-stage adaptability. Experiments on the standardized AIR-VLA benchmark demonstrate that our method comprehensively surpasses all baselines with an overall average score of 48.0. The overall task completion score improves by 80.2\% compared to the single-head $π_{0.5}$ policy, effectively mitigating the heterogeneous coordinated control conflicts of composite robots.

2606.12854 2026-06-12 cs.CL q-bio.QM 新提交

Small LLMs for Biomedical Claim Verification: Cost-Effective Fine-Tuning, Structural Dataset Shortcuts, and Cross-Domain Generalization

小型LLM用于生物医学声明验证:成本效益微调、结构性数据集捷径与跨域泛化

Gaurav Kumar

发表机构 * Moveworks AI University of California San Diego(加州大学圣迭戈分校)

AI总结 通过QLoRA微调小型LLM(Phi-3-mini、Qwen2.5-3B、Mistral-7B),在生物医学声明验证中超越GPT-4o和GPT-5(F1提升12%),并发现SciFact数据集的结构性伪影,提出基于结构稳健数据的跨域迁移方法。

Comments 8 pages, 2 figures, 12 tables. To appear at BioNLP Workshop, ACL 2026

详情
AI中文摘要

大型语言模型如GPT-4o和GPT-5在生物医学声明验证上表现出强大的零样本性能,但成本和透明度限制了其可扩展使用。我们通过QLoRA在SciFact和HealthVer上微调了三个小型LLM:Phi-3-mini(3.8B)、Qwen2.5-3B和Mistral-7B,首次研究了QLoRA模型与GPT-4o及微调BioLinkBERT编码器的对比。Mistral-7B QLoRA在仅使用1,008个训练样本的情况下,以极低的成本超越了GPT-4o和GPT-5(F1提升高达12%)。我们进行了广泛的域内和跨域评估:在SciFact上训练的模型在HealthVer上测试,反之亦然,并匹配模型大小以隔离数据集结构与数据量的影响。我们识别了SciFact中一个先前未报告的结构性伪影,该伪影夸大了域内得分,并通过双向域外评估表明,在结构稳健的数据上训练能够实现鲁棒的跨域迁移。我们计划发布所有代码和适配器检查点。

英文摘要

Large Language Models such as GPT-4o and GPT-5 achieve strong zero-shot performance on biomedical claim verification, but cost and opacity limit scalable use. We fine-tune three small LLMs: Phi-3-mini (3.8B), Qwen2.5-3B, and Mistral-7B, via QLoRA on SciFact and HealthVer, providing the first study of QLoRA models against GPT-4o and fine-tuned BioLinkBERT encoders. Mistral-7B QLoRA surpasses both GPT-4o and GPT-5 (up to 12% F1 gain) at a fractional cost using just 1,008 training examples. We conduct extensive in-domain and cross-domain evaluation: models trained on SciFact tested on HealthVer and vice versa, at matched sizes to isolate dataset structure from data quantity. We identify a previously unreported structural artifact in SciFact that inflates in-domain scores, and show through bidirectional out-of-domain evaluation that training on structurally sound data enables robust cross-domain transfer. We plan to release all code and adapter checkpoints.

2606.12852 2026-06-12 cs.AI 新提交

WISE: A Long-Horizon Agent in Minecraft with Why-Which Reasoning

WISE:具有Why-Which推理的Minecraft长时域智能体

Renmin Cheng, Changhao Chen

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出WISE框架,通过因果事件图增强情景记忆并解耦what-where-when与which-why推理,结合机会主义任务调度和多尺度探索,显著提升长时域稀疏任务的成功率和效率。

详情
AI中文摘要

通过采用LLM增强的分层方法,在Minecraft等环境中开发通用具身智能体取得了快速进展。尽管前景广阔,但低级控制器由于重复执行失败常常成为性能瓶颈。我们认为,一个关键限制不仅是缺乏情景记忆,而且是将\textit{what-where-when}记忆与\textit{which-why}推理解耦。为了解决这个问题,我们提出\textbf{WISE}(Which-Why Informed Semantic Explorer),一个长时域智能体框架,其增强的低级控制器配备因果事件图,通过将观察与任务相关性关联的显式因果结构来增强情景记忆。与先前依赖特征相似性进行检索的工作(如MrSteve)不同,WISE能够在视角变化下实现稳健回忆,并通过因果推理支持机会主义任务重排序。基于这种记忆,我们提出一个机会主义任务调度器,当检测到因果相关机会时动态重新优先化子任务。我们进一步为WISE配备多尺度渐进探索策略,为下游推理提供空间上全面的观察。实验表明,WISE在长时域稀疏任务上大幅提高了任务成功率和效率,特别是在需要自适应决策的场景中。

英文摘要

Rapid advances have been made in developing general-purpose embodied agent in environments like Minecraft through the adoption of LLM-augmented hierarchical approaches. Despite their promise, low-level controllers often become performance bottlenecks due to repeated execution failures. We argue that a key limitation is not only the lack of episodic memory, but also the decoupling of \textit{what-where-when} memory from \textit{which-why} reasoning. To address this, we propose \textbf{WISE} (Which-Why Informed Semantic Explorer), a long-horizon agent framework with an enhanced low-level controller equipped with a Causal Event Graph that augments episodic memory with explicit causal structure linking observations to task relevance. Unlike prior work such as MrSteve, which relies on feature similarity for retrieval, WISE enables robust recall under viewpoint changes and supports opportunistic task reordering through causal reasoning. Building on this memory, we propose an Opportunistic Task Scheduler that dynamically re-prioritizes subtasks when causally relevant opportunities are detected. We further equip WISE with a multi-scale progressive exploration strategy to provide spatially comprehensive observations for downstream reasoning. Experiments show that WISE largely improves task success and efficiency on long-horizon sparse tasks, particularly in settings requiring adaptive decision-making.

2606.12848 2026-06-12 cs.AI econ.GN q-fin.EC 新提交

(Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable

(人类的)注意力(仍然)就是一切:人类监督使AI辅助的社会科学变得可靠

Chen Zhu, Xiaolu Wang, Weilong Zhang

发表机构 * China Agricultural University(中国农业大学) University of Cambridge(剑桥大学)

AI总结 提出人机协同决策架构HLER,通过预承诺、决策排序、问责和注意力分配,将AI辅助研究的失败率从72%降至16%。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用于曾经只有训练有素的研究人员才能完成的任务,包括假设生成、规范选择和结论起草。我们认为,AI辅助研究的可靠性不仅取决于模型能力,还取决于认知劳动在人与机器之间的分配方式。我们通过人机协同经济研究(HLER)来研究这个问题,这是一种基于预承诺、决策排序、问责和注意力分配的决策架构。在一个预先指定的2*4因子实验中,涉及四个数据集的280个完整研究运行,无约束的多智能体基线在72%的运行中产生了关键失败。使用相同的底层模型、相同的智能体分解以及共享推理智能体的相同提示,HLER通过施加三个架构承诺将失败率降低到16%:LLMs进行推理但不执行数据工作,数据和估计以确定性方式处理,以及三个人类决策门约束工作流程。Fisher精确检验在p<0.001水平上拒绝失败率相等的假设。可靠性增益在公开代表性最低的数据集(一份清代人口登记册)上最大,这与基于任务的产出质量服从弗雷歇分布的生产模型一致。一项80次运行的消融研究表明,确定性计算和人类决策门独立贡献,并存在互补性的探索性证据。我们将HLER解释为一种研究框架而非自主的AI科学家:它大幅减少失败,使残留的弱点更加可见,并防止不可靠的主张作为可发表的成果被提出。

英文摘要

Large language models (LLMs) are increasingly used for tasks once reserved for trained researchers, including hypothesis generation, specification choice, and drafting conclusions. We argue that the reliability of AI-assisted research depends not only on model capability, but also on how cognitive labour is structured between humans and machines. We study this problem through Human-in-the-Loop Economic Research (HLER), a decision architecture based on pre-commitment, decision sequencing, accountability, and attention allocation. In a pre-specified 2*4 factorial experiment with 280 complete research runs across four datasets, an unconstrained multi-agent baseline produced critical failures in 72% of runs. Using the same underlying model, the same agent decomposition, and identical prompts for the shared reasoning agents, HLER reduced the failure rate to 16% by imposing three architectural commitments: LLMs reason but do not execute data work, data and estimation are handled deterministically, and three human decision gates bind the workflow. Fisher's exact test rejects equality of failure rates at p<0.001. Reliability gains were largest on the least publicly represented dataset, a Qing-dynasty population register, consistent with a task-based production model with Frechet-distributed output quality. An 80-run ablation suggests that deterministic computation and human gates contribute independently, with exploratory evidence of complementarity. We interpret HLER as a research harness rather than an autonomous AI scientist: it sharply reduces failures, makes residual weaknesses more visible, and prevents unreliable claims from being advanced as publication-ready outputs.

2606.12847 2026-06-12 cs.CV 新提交

Language-Guided Abstraction for Visual Reasoning

语言引导的视觉推理抽象

Xu-Jing Ye, Yuan-Gen Wang, Ruping Wang

发表机构 * School of Artificial Intelligence, Guangzhou University(广州大学人工智能学院) Traditional Chinese Medicine Hospital of Zengcheng District(广州市增城区中医医院)

AI总结 提出L-VARC框架,通过语言引导的特权信息学习分支增强视觉推理,设计语义压缩模块和交叉注意力投影器,在ARC任务上以18M参数超越现有方法。

详情
AI中文摘要

抽象与推理语料库(ARC)被视为通往通用人工智能(AGI)的关键途径,因为它使模型能够从少量示例中学习抽象转换规则,然后泛化到新任务。然而,主流的ARC方法要么是纯语言,要么是纯视觉(即VARC)。前者严重依赖大语言模型,消耗数十亿参数;后者通常难以捕捉高层语义,导致在像素级模式上过拟合。为弥合这一差距,我们提出L-VARC,一种通过语言引导的特权信息学习(LUPI)分支增强视觉推理的新框架。具体来说,我们通过将统一的任务无关提示输入DeepSeek-V3来设计语义压缩模块。这样,原始的LARC(一个众包语言描述数据集)可以被大幅精炼和结构化,以适应标准文本编码器(如CLIP)的上下文长度约束。此外,我们设计了交叉注意力投影器来对齐视觉特征与语义嵌入,旨在指导ARC模型的训练。值得注意的是,LUPI分支在训练过程中使用,推理时被丢弃,从而产生一个仅1800万参数的轻量级模型。大量实验表明,我们的L-VARC有效利用语言先验提升视觉推理,并超越现有最优方法。消融研究进一步证实了这两个新设计对L-VARC框架的贡献。代码见https://this URL。

英文摘要

The Abstraction and Reasoning Corpus (ARC) is viewed as a critical avenue to Artificial General Intelligence (AGI), as it enables models to learn abstract transformation rules from few-shot examples and then generalize to new tasks. However, prevalent ARC methodology is either pure language or vision-only (i.e., VARC). The former depends heavily on LLMs, consuming billions of parameters. The latter often struggles to capture high-level semantics, leading to overfitting on pixel-level patterns. To bridge this gap, we propose L-VARC, a novel framework that enhances visual reasoning via a language-guided Learning Using Privileged Information (LUPI) branch. Specifically, we design a Semantic Compression Module by feeding a unified, task-agnostic prompt into DeepSeek-V3. In this way, the raw LARC (a crowd-sourced language description dataset) can be substantially refined and structured, fitting with the context length constraint of standard text encoders (e.g., CLIP). Moreover, we design a Cross-Attention Projector to align visual features with semantic embeddings, aiming to guide the training of the ARC model. Notably, the LUPI branch is taken in the training process and will be discarded during inference, thereby yielding a lightweight model with a mere 18 million parameters. Extensive experiments demonstrate that our L-VARC effectively leverages linguistic priors to boost visual reasoning and outperforms state-of-the-art. Ablation studies further confirm the contribution of the two new designs towards the L-VARC framework. The code is available at https://github.com/GZHU-DVL/L-VARC.

2606.12843 2026-06-12 cs.LG cs.CE 新提交

Interpretable Factor Decomposition for Decision Intelligence in Large-Scale Financial Markets: Evidence from China's A-Share Market

可解释因子分解用于大规模金融市场决策智能:来自中国A股市场的证据

Xiao Han, Yao Xiao, Zhen Zhang, Moxuan Zheng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出可解释机器学习流程,将截面股票收益预测分解为可审计因子贡献,使用XGBoost和TreeSHAP在中国A股市场验证,发现行为信号贡献58.2%预测归因。

详情
AI中文摘要

我们提出一个可解释的机器学习流程,将截面股票收益预测分解为可审计的因子贡献。我们应用带有TreeSHAP归因的XGBoost模型,对2009年至2019年的3632只中国A股进行压力测试。使用60个月滚动窗口,在55个月的样本外数据上,XGBoost获得平均AUC为0.547,且前五分之一与后五分之一的多空价差为+2.38%/月(Newey-West t = 5.94;年化夏普比率2.23)。在调整Carhart四因子模型后,该alpha持续存在(+2.31%/月;t = 7.48)。SHAP分解表明,在55个行业组中,行为信号(换手率和动量)平均占预测归因的58.2%,而估值比率仅占10.7%。消融分析用于交叉验证这一排名,并提供证据表明SHAP和消融以突出特征可替代性结构的方式产生分歧,而这种结构在单独使用任一方法时几乎不可见。

英文摘要

We present an interpretable machine learning pipeline to decompose Cross-Sectional Equity Return Predictability into auditable factor contribution. We apply an XGBoost model with TreeSHAP attribution and conduct stress testing on 3632 Chinese A-share stocks from 2009 until 2019. Using 60-month, rolling windows over 55 months of out-of-sample data, XGBoost obtains a mean AUC of 0.547 and +2.38%/month (Newey-West t = 5.94; Annualized Sharpe 2.23) long-short spread for the top vs bottom quintiles. This alpha is persistent after adjusting for the Carhart four-factor model (+2.31%/month; t = 7.48). SHAP Decomposition indicates that behavioral signals (turnover and momentum) account for 58.2% of predictive attribution compared to 10.7% for valuation ratios, on average, across 55 industry groups. Ablation analysis serves to cross-validate this ranking and provides evidence that SHAP and ablation diverge in a manner that highlights feature substitutability structure that is largely invisible to either method used in isolation.

2606.12841 2026-06-12 cs.LG cs.AI 新提交

TimeROME-DLM: Temporal Causal Tracing and Low-Rank Inference-Time Knowledge Editing for Masked Diffusion Language Models

TimeROME-DLM:掩码扩散语言模型的时间因果追踪与低秩推理时知识编辑

Zhengtao Yao, Liuyang Song, Hongbo Zhang, Chenhao Wei, Haoyan Xu, Guang Yang, Siheng Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学) National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出TimeROME-DLM,首个无需训练和梯度的推理时知识编辑框架,通过时间因果追踪定位关键坐标并应用低秩残差编辑,在保持模型性能的同时高效删除事实。

详情
AI中文摘要

掩码扩散语言模型(MDLM),如LLaDA,现已能与自回归(AR)大语言模型(LLM)竞争,但现有的所有知识编辑和遗忘方法(如ROME、MEMIT等)均针对AR Transformer,要么做出在迭代去噪下失败的假设,要么需要梯度更新,其反向传播激活会消耗数十GB的额外显存,并在标准学习率下导致MDLM崩溃。我们提出TimeROME-DLM,这是首个针对MDLM的无需训练、无需梯度、推理时的知识编辑框架。它结合了两个组件:时间间接效应(TIE)因果追踪协议,用于识别每个事实中在后续去噪步骤中最强驱动对象预测的坐标;以及一个闭式低秩残差编辑记忆,该记忆聚合所有遗忘事实的主语键和目标差值,并在每个扩散前向步骤中对该坐标应用单次岭正则化更新,同时通过稀疏化限制效用溢出。骨干权重保持冻结;仅需在小型验证集上调整三个超参数(alpha、lambda、q)。在TOFU forget01任务上,使用TOFU微调的LLaDA-8B-Base,TimeROME-DLM将遗忘集的对数概率降低了约83 nats。相同的配置可迁移至LLaDA-8B-Instruct、Dream-7B、MMaDA-8B、DiffuLLaMA-7B和LLaDA-MoE-1.4B。在50个顺序插入的事实中,它使保留集的对数概率几乎持平(在效用安全操作点处波动约1 nat),相比最强的收敛训练时基线,实现了四到十四倍的墙钟加速且零额外显存,并亚线性地扩展到400个事实。TimeROME-DLM以极小的计算代价弥合了AR LLM与MDLM之间的定位-编辑差距。

英文摘要

Masked diffusion language models (MDLMs) such as LLaDA now rival autoregressive (AR) LLMs, but every existing knowledge-editing and unlearning method (ROME, MEMIT, etc.) targets AR transformers and either makes assumptions that fail under iterative denoising, or requires gradient updates whose backward-pass activations cost tens of GB of extra VRAM and which collapse MDLMs at standard learning rates. We introduce TimeROME-DLM, the first training-free, gradient-free, inference-time knowledge-editing framework for MDLMs. It couples two components: a Temporal Indirect Effect (TIE) causal-tracing protocol that identifies, for each fact, the coordinate whose intervention most strongly drives the object prediction at later denoising steps; and a closed-form, low-rank residual edit memory that aggregates subject keys and target deltas across all forget facts and applies a single ridge-regularised update at that coordinate at every diffusion forward, with sparsification to limit utility spillover. Backbone weights stay frozen; only three hyperparameters (alpha, lambda, q) are tuned on a small validation split. On TOFU forget01 with TOFU-finetuned LLaDA-8B-Base, TimeROME-DLM cuts forget-set log-probability by roughly 83 nats. The same configuration transfers to LLaDA-8B-Instruct, Dream-7B, MMaDA-8B, DiffuLLaMA-7B, and LLaDA-MoE-1.4B. It keeps retain-set log-probability nearly flat (within ~1 nat at the utility-safe operating point) across 50 sequentially inserted facts, delivers a four- to fourteen-fold wall-clock speedup with zero additional VRAM over the strongest converged training-time baseline, and scales sub-linearly to 400 facts. TimeROME-DLM closes the locate-then-edit gap between AR LLMs and MDLMs at a fraction of the computational cost.