arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.27081 2026-05-27 cs.LG cs.AI cs.DC

ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference

ReMoE: 在内存受限的MoE大模型推理中通过路由器微调提升专家重用

Xiongwei Zhu, Xiaojian Liao, Tianyang Jiang, Yusen Zhang, Liang Wang, Limin Xiao

AI总结提出ReMoE路由器微调框架，通过偏向近期选中的专家实现时间稳定的路由，减少专家从外部存储的获取次数，在保持下游任务性能的同时提升专家重用26%，并在实际系统中实现8.4%的吞吐量提升和1.77-1.99倍的解码加速。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

细粒度混合专家（MoE）模型对每个token仅稀疏激活一部分专家，在保持高模型容量的同时减少激活计算。然而，在内存受限的推理场景中，只能缓存少量专家。未缓存的专家必须从慢速外部存储（如UFS）获取，导致频繁的驱逐和大量的I/O开销。我们提出ReMoE，一个路由器微调框架，旨在提升token级别的专家重用。ReMoE使路由器偏向近期选中的专家，产生时间稳定的路由，更好地匹配缓存局部性约束。通过增加短时专家重用，ReMoE减少了从存储中获取专家，且不增加推理计算开销。在DeepSeek和Qwen模型上的实验表明，ReMoE在保持下游任务性能的同时将专家重用提升了26%。实际系统评估进一步证实了这些优势：在vLLM GPU-CPU专家卸载下，输出吞吐量提升8.4%；在Jetson Orin NX上的llama.cpp中，TPOT降低43.6-49.8%，对应不同工作负载下1.77-1.99倍的解码加速。检查点和使用说明见https://github.com/BUAA-OSCAR/ReMoE。

英文摘要

Fine-grained Mixture-of-Experts (MoE) models sparsely activate only a subset of experts per token, reducing activated computation while maintaining high model capacity. However, in memory-constrained inference scenarios, only a small set of experts can be cached. Experts not in the cache must be fetched from slow external storage (e.g., UFS), leading to frequent evictions and substantial I/O overhead. We propose ReMoE, a router fine-tuning framework designed to boost token-wise expert reuse. ReMoE biases the router toward recently selected experts, producing temporally stable routing that better matches cache locality constraints. By increasing short-horizon expert reuse, ReMoE reduces expert fetches from storage without adding inference-time computation. Experiments on DeepSeek and Qwen models show that ReMoE improves expert reuse by 26% while maintaining downstream task performance. Real-system evaluations further confirm these benefits, improving output throughput by 8.4% under vLLM GPU-CPU expert offloading and reducing TPOT by 43.6-49.8% under llama.cpp on Jetson Orin NX, corresponding to a 1.77-1.99$\times$ decode speedup across diverse workloads. Checkpoints and usage instructions are available at https://github.com/BUAA-OSCAR/ReMoE.

URL PDF HTML ☆

赞 0 踩 0

2605.27080 2026-05-27 cs.CV

Semi-Supervised Gaze Estimation via Disentangled Subspace Contrastive Learning

基于解耦子空间对比学习的半监督视线估计

Qida Tan, Hongyu Yang, Wenchao Du

AI总结提出一种半监督学习框架DSCL，通过雅可比正则化解耦特征为俯仰角和偏航角子空间，并利用子空间内序数对比学习，仅用5%-20%标注数据即可达到竞争性能。

Comments ICML2026

详情

AI中文摘要

基于外观的视线估计由于标注样本有限和数据集多样性不足，常面临泛化能力差的问题。主流方法采用弱监督学习从无约束真实场景生成大规模伪标签数据，以缓解域偏移。本文设计了一种简单而有效的半监督学习架构，利用未标注数据增强域泛化，从而减少对劳动密集型人工标注的依赖。我们的关键洞察是施加雅可比正则化，将特征表示解耦为专门针对特定视线组件（如俯仰角和偏航角）的判别性子空间。我们进一步利用每个子空间内的内在序数排序进行对比学习，使模型能够从少量标注样本和大量未标注样本中学习鲁棒的视线表示。最终形成了我们的解耦子空间对比学习（DSCL）框架。在多个基准上的大量实验表明，所提出的DSCL是即插即用的，在域内和跨域评估设置下，仅使用20%、10%甚至5%的标注数据即可达到竞争性能。公开代码见https://github.com/da60266/DSCL。

英文摘要

Appearance-based gaze estimation always suffers from poor generalization due to limited annotated samples and insufficient dataset diversity. Leading approaches adopt weakly supervised learning to generate large-scale pseudo-labeled data from unconstrained real-world scenarios, aiming to mitigate the domain shifts. In this work, we devise a simple yet effective semi-supervised learning architecture that leverages unlabeled data to enhance domain generalization, thereby reducing reliance on labor-intensive manual annotations. Our key insight is to impose Jacobian regularization to disentangle feature representations into discriminative subspaces dedicated to specific gaze components, such as pitch and yaw angles. We further exploit the intrinsic ordinal ranking within each subspace for contrastive learning, enabling the model to learn robust gaze representations from a small set of labeled samples and an abundance of unlabeled ones. This ultimately yields our Disentangled Subspace Contrastive Learning (DSCL) framework. Extensive experiments on multiple benchmarks verify that the proposed DSCL is plug-and-play, achieving competitive performance using only 20\%, 10\%, and even 5\% of the annotated data under both in-domain and cross-domain evaluation settings. The public code is available at \href{https://github.com/da60266/DSCL}{https://github.com/da60266/DSCL}.

URL PDF HTML ☆

赞 0 踩 0

2605.27079 2026-05-27 cs.LG cs.AI cs.RO

Trust Region Q Adjoint Matching

信任区域Q伴随匹配

Yonghoon Dong, Kyungmin Lee, Changyeon Kim, Jaehyuk Kim, Jinwoo Shin

AI总结针对预训练流策略的离策略强化学习不稳定性，提出信任区域Q伴随匹配方法，通过投影对偶下降自适应控制路径空间KL散度，实现稳定微调，在50个OGBench任务中离线RL成功率达68%。

详情

AI中文摘要

由于多步采样过程带来的优化不稳定性，预训练流策略的离策略强化学习仍然具有挑战性。最近，带有伴随匹配的Q学习（QAM）通过将问题重新表述为一个具有学习评论家的无记忆随机最优控制（SOC）问题来解决这一问题。然而，QAM继承了评论家引导改进的根本脆弱性：当评论家病态时，小的评论家误差会被放大，通常导致模型崩溃。本文引入了信任区域Q伴随匹配（TRQAM），一种稳定的离策略微调算法，通过投影对偶下降自适应地控制与预训练流策略的路径空间KL散度。具体来说，我们优化SOC动力学中的信任区域参数$λ$，并从理论上证明路径空间KL可以用$λ$的闭式函数表示。因此，我们的方法可以精确控制与预训练流策略的精确偏差，实现稳定的离策略强化学习。通过在50个OGBench任务上的实验，TRQAM在离线强化学习和离线到在线强化学习中都持续优于先前的方法。特别是，TRQAM在离线强化学习中实现了68%的总体成功率，显著提高了最强基线的46%。

英文摘要

Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochastic optimal control (SOC) problem with a learned critic. However, QAM inherits a fundamental fragility of critic-guided improvement: small critic errors are amplified when critics are ill-conditioned, often leading to model collapse. This paper introduces Trust Region Q-Adjoint Matching (TRQAM), a stable off-policy fine-tuning algorithm that adaptively controls the path-space KL with pretrained flow policies through projected dual descent. Specifically, we optimize the trust-region parameter $λ$ in SOC dynamics, and theoretically show that the path-space KL can be represented by a closed-form function of $λ$. As a result, our method can precisely control the exact deviation from pretrained flow policies, achieving stable off-policy RL. Through experiments on 50 OGBench tasks, TRQAM consistently outperforms prior arts in both offline RL and offline-to-online RL. In particular, TRQAM achieves an overall success rate of 68% in offline RL, substantially improves the strongest baseline at 46%.

URL PDF HTML ☆

赞 0 踩 0

2605.27075 2026-05-27 cs.CV

SoftCap: Soft-Budget Control for Diffusion Transformer Acceleration

SoftCap: 扩散Transformer加速的软预算控制

Yuhang Zhang, Junxiang Qiu, Huixia Ben, Zhenhua Tang, Shuo Wang, Yanbin Hao

AI总结提出一种无需训练的软预算控制层SoftCap，通过轨迹漂移观测器和软预算PI控制器动态调整全步触发阈值，在保持计算预算软上限的同时提升图像质量。

详情

AI中文摘要

扩散Transformer（DiTs）实现了强大的视觉质量，但其迭代去噪过程需要大量昂贵的Transformer评估。无训练加速方法通过缓存、预测或验证中间特征来降低这一成本，然而何时执行全步的运行时决策通常由固定调度或手动调整的阈值驱动。我们提出 extbf{SoftCap}，一种用于基于缓存的DiT推理的无训练控制层。SoftCap将轨迹漂移观测器（通过轻量级隐藏状态统计估计局部缓存风险）与软预算PI控制器（根据相对于固定参考配置的实际计算调整全步触发阈值）相结合。预算是软上限：它塑造阈值，但不要求运行消耗预定数量的全步评估。在FLUX.1-dev上，在可比的中等计算操作点下，SoftCap优于SpeCa，在几乎相同的FLOPs下将ImageReward从0.967提升至0.981，并将LPIPS-Full从0.518降至0.498，而目标扫描诊断显示随着预算放宽，预期的软上限行为得以实现。

英文摘要

Diffusion Transformers (DiTs) achieve strong visual quality, but their iterative denoising process requires many costly Transformer evaluations. Training-free acceleration methods reduce this cost by caching, forecasting, or verifying intermediate features, yet the runtime decision of when to execute a Full step is often driven by fixed schedules or hand-tuned thresholds. We propose \textbf{SoftCap}, a training-free control layer for cache-based DiT inference. SoftCap couples a Trajectory Drift Observer, which estimates local cache risk from lightweight hidden-state statistics, with a Soft-Budget PI Controller, which adjusts the Full-triggering threshold from realized compute relative to a fixed reference profile. The budget is a soft ceiling: it shapes the threshold but does not require a run to spend a prescribed number of Full evaluations. On FLUX.1-dev, SoftCap improves over SpeCa at a comparable middle-compute operating point, raising ImageReward from 0.967 to 0.981 and reducing LPIPS-Full from 0.518 to 0.498 at nearly identical FLOPs, while target-sweep diagnostics show the intended soft-ceiling behavior as the budget is relaxed.

URL PDF HTML ☆

赞 0 踩 0

2605.27074 2026-05-27 cs.CV

IPIBench: Evaluating Interactive Proactive Intelligence of MLLMs under Continuous Streams

IPIBench: 在连续流下评估多模态大模型的交互式主动智能

Jinzhao Li, Yinuo Chen, Wenxuan Song, Yijia Lei, Yichi Zhang, Honglei Yan, Panwang Pan, Miao Liu

AI总结提出IPIBench基准，用于评估多模态大模型在流式视频场景中的交互式主动智能，并设计IPI-Agent框架以改善主动触发和交互协调。

详情

AI中文摘要

最近的多模态大模型在反应式问答上表现强劲，但现实世界的流式助手需要对连续视觉输入进行主动推理。现有基准主要研究孤立的单轮设置中的反应式或主动式交互，忽视了用户可能在交错反应式查询中添加、修改或取消主动请求的动态多轮场景。为填补这一空白，我们引入IPIBench，这是首个在流式视频设置下评估多模态大模型交互式主动智能的基准。IPIBench涵盖主动监控、主动任务管理以及交错的反应式-主动式请求。对代表性多模态大模型的评估揭示了两个主要限制：不稳定的主动触发以及反应式和主动行为之间的弱协调。我们进一步提出IPI-Agent，一个无训练的智能体框架，包含交互控制策略和时间门控机制，用于稳定主动触发和协调多轮交互。实验表明，IPI-Agent在所有基准设置上持续改进现有多模态大模型。

英文摘要

Recent multimodal large language models (MLLMs) achieve strong performance on reactive question answering, but real-world streaming assistants require proactive reasoning over continuous visual inputs. Existing benchmarks mainly study reactive or proactive interactions in isolated single-turn settings, overlooking dynamic multi-turn scenarios where users may add, modify, or cancel proactive requests alongside interleaved reactive queries. To address this gap, we introduce IPIBench, the first benchmark for evaluating Interactive Proactive Intelligence of MLLMs under streaming video settings. IPIBench covers proactive monitoring, proactive task management, and interleaved reactive-proactive requests. Evaluations on representative MLLMs reveal two major limitations: unstable proactive triggering and weak coordination between reactive and proactive behaviors. We further propose IPI-Agent, a training-free agentic framework with an interaction-control policy and a temporal-gating mechanism for stabilizing proactive triggering and coordinating multi-turn interactions. Experiments show that IPI-Agent consistently improves existing MLLMs across all benchmark settings.

URL PDF HTML ☆

赞 0 踩 0

2605.27073 2026-05-27 cs.LG

Learning to Orchestrate Agents under Uncertainty

学习在不确定性下编排智能体

Mary Chriselda Antony Oliver, Lan Jiang, Aaron Bundi Anampiu, Elaf Almahmoud, Francesco Quinzan, Umang Bhatt

AI总结提出BOT-Orch框架，将编排问题转化为带正则化的多臂赌博机问题，在不确定性下实现异构智能体的自适应编排，理论保证遗憾界为O(√T)并优于基线。

详情

AI中文摘要

异构智能体的自适应编排需要在不确定且不断演化的智能体行为下做出顺序委派决策，例如协调具有不同可靠性、成本和响应质量的专门AI模型。虽然先前关于智能体编排的工作侧重于性能或成本，但通常未在编排层面显式建模智能体可靠性和输出分布的不确定性。在这项工作中，我们研究了不确定性下异构智能体的自适应编排问题，其中元控制器必须决定何时委派给某个智能体，同时考虑可靠性、成本和不确定性。我们提出了BOT-Orch，一个轻量级框架，将编排重新表述为智能体上的赌博机问题，并通过智能体输出分布与任务特定参考分布之间的OT距离进行正则化。我们证明，在标准假设下，正则化编排享有O(√T)的遗憾界，并能在具有相同平均奖励但分布对齐不同的智能体之间可证明地诱导偏好排序。实验上，我们展示了BOT-Orch在具有异构、非独立同分布智能体行为的合成但对抗性任务分配设置中优于标准赌博机和启发式基线。

英文摘要

Adaptive orchestration of heterogeneous agents requires making sequential delegation decisions under uncertain and evolving agent behaviour, e.g., coordinating specialised AI models with varying reliability, cost, and response quality. While prior work on agent orchestration focuses on performance or cost, uncertainty in agent reliability and output distributions is typically not modelled explicitly at the orchestration level. In this work, we study the problem of adaptive orchestration of heterogeneous agents under uncertainty, where a meta-controller must decide when to delegate to an agent, accounting for reliability, cost, and uncertainty. We propose BOT-Orch, a lightweight framework that recasts orchestration as a bandit problem over agents, regularized by OT distances between agent output distributions and task-specific reference distributions. We show that the regularised orchestration enjoys $\mathcal{O}(\sqrt{T})$ regret under standard assumptions, and provably induces preference ordering among agents with identical mean rewards but differing distributional alignment. Empirically, we demonstrate that BOT-Orch outperforms standard bandit and heuristic baselines in synthetic but adversarial task allocation settings with heterogeneous, non-i.i.d. agent behaviour.

URL PDF HTML ☆

赞 0 踩 0

2605.27072 2026-05-27 cs.CL cs.AI

E3: Issue-Level Backtesting for Automated Research Critique

E3: 面向自动化研究评论的问题级回测

Yashwardhan Chaudhuri, Sanyam Jain, Paridhi Mundra

AI总结提出E3自动化评论助手，通过问题级回测协议评估其在识别研究论文技术问题上的表现，相比人类评审和LLM基线实现最高召回率。

详情

AI中文摘要

我们提出E3，一个自动化评论助手，通过识别研究论文中与决策相关的技术问题来增强评审者和工程团队。对于每个问题，E3报告其性质、位置、对贡献的影响以及解决该问题所需的分析或证据，涵盖无根据的主张、缺失的消融实验、弱基线、隐藏假设、有效性威胁和数据泄露风险。为了在没有污染混杂因素的情况下评估E3，我们采用问题级回测协议：语料库仅限于每个自动化来源训练截止日期之后发表的论文，并且对于每篇论文，一个仅观察匿名评审的元裁判将每个问题-来源对标记为“捕获”、“部分”或“遗漏”。应用于100篇ICLR 2026论文和4598个被评判的问题行，将E3与ICLR人类评审以及基于OpenAI的gpt-5.4和Anthropic的claude-opus-4-6构建的两个提示匹配的LLM基线进行比较，使用元裁判gpt-5.5，E3在每个聚合指标上达到最高召回率。包含部分的召回率达到90.2%，比GPT高15.5个百分点，比Claude高17.1个百分点，比人类评审高29.2个百分点，严格召回率保持顺序为65.8%。在人类评审提出的问题上，E3恢复了89.6%；在人类评审遗漏的问题上，它额外发现了1635行被纳入评判联合集，比次优来源多406行。语料库、基线提示、裁判提示模板和评估代码已发布。

英文摘要

We present E3, an automated review assistant that augments reviewers and engineering teams by identifying decision-relevant technical concerns in research papers. For each concern, E3 reports its nature, its location, its bearing on the contribution, and the analysis or evidence that would resolve it, covering unsupported claims, missing ablations, weak baselines, hidden assumptions, threats to validity, and leakage risks. To evaluate E3 without contamination confounds we adopt an issue-level backtesting protocol: the corpus is restricted to papers postdating the training cutoff of every automated source, and for each paper a meta-judge that observes only anonymised reviews labels every issue-source pair as Caught, Partial, or Missed. Applied to 100 ICLR 2026 papers and 4598 judged issue rows, comparing E3 against the ICLR human reviews and two prompt-matched LLM baselines built on gpt-5.4 from OpenAI and claude-opus-4-6 from Anthropic, with meta-judge gpt-5.5, E3 attains the highest recall on every aggregate metric. Partial-inclusive recall reaches 90.2 percent, which is 15.5 points over GPT, 17.1 points over Claude, and 29.2 points over the human reviews, and strict recall preserves the ordering at 65.8 percent. On concerns raised by the human reviewers, E3 recovers 89.6 percent; on concerns the human reviewers missed it surfaces 1635 additional rows admitted into the judged union, 406 above the next-best source. Corpus, baseline prompts, judge prompt template, and evaluation code are released.

URL PDF HTML ☆

赞 0 踩 0

2605.27071 2026-05-27 cs.AI

Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry

可溯源知识图谱推理助力钢铁行业工业VOCs的LLM辅助决策支持

Changqing Su, Yu Ding, Zuhong Lin, Hongyu Liu, Xi He, Zheng Zeng, Liqing Li

AI总结针对钢铁行业VOCs治理知识分散、通用大模型易产生幻觉的问题，提出基于知识图谱增强的多智能体问答系统Chat-ISV，通过拓扑优化、多智能体路由和源回溯检索实现高可靠性决策支持。

详情

AI中文摘要

钢铁行业挥发性有机化合物（VOCs）治理的关键知识分散在非结构化的科学文献中，使得整合工艺、污染物和控制技术证据变得困难，并增加了通用大语言模型（LLM）在回答低频工业问题时产生幻觉的风险。为此，我们开发了Chat-ISV，一个知识图谱（KG）增强的多智能体问答系统，该系统解析精选的钢铁行业VOCs文献语料库，构建包含27180个节点和81779条语义边的Neo4j知识图谱，并结合提示约束提取、以块为中心的拓扑优化、多智能体路由、源回溯检索、本地文献检索、开放域知识访问和交互式子图可视化。基准测试和400份专家盲评表明，拓扑优化将孤立节点从57%降至4.08%，Chat-ISV实现了高事实可靠性，精确率96.93%，召回率72.63%，F1分数0.830，平均得分1.69/2.00。通过将碎片化的环境工程文献转化为可溯源、可查询、面向决策支持的知识，Chat-ISV为专业工业领域中可靠的LLM部署和智能污染控制决策支持建立了一种可扩展的环境信息学范式。

英文摘要

Key knowledge for steel-industry volatile organic compounds (VOCs) governance is scattered across unstructured scientific literature, making it difficult to integrate process, pollutant, and control-technology evidence and increasing the risk of hallucination when general large language models (LLMs) answer low-frequency industrial questions. Here we developed Chat-ISV, a knowledge graph (KG) enhanced multi-agent Q&A system that parses a curated steel-industry VOCs literature corpus, constructs a Neo4j KG with 27180 nodes and 81779 semantic edges, and combines prompt-constrained extraction, chunk-centered topology optimization, multi-agent routing, source-backtracking retrieval, local literature retrieval, open-domain knowledge access, and interactive subgraph visualization. Benchmark tests and 400 expert blind evaluations showed that topology optimization reduced isolated nodes from 57% to 4.08% and that Chat-ISV achieved high factual reliability, with 96.93% precision, 72.63% recall, an F1-score of 0.830, and a mean score of 1.69/2.00. By converting fragmented environmental-engineering literature into traceable, queryable, and decision-support-oriented knowledge, Chat-ISV establishes a scalable environmental-informatics paradigm for reliable LLM deployment and intelligent pollution-control decision support in specialized industrial domains.

URL PDF HTML ☆

赞 0 踩 0

2605.27068 2026-05-27 cs.CL cs.AI cs.MA

QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

QUACK: 多模态社交推理智能体中的沟通知识质疑、理解与审计

Ye Yuan, Rui Song, Weien Li, Zeyu Li, Haochen Liu, Xiangyu Kong, Changjiang Han, Yonghan Yang, Zichen Zhao, Zixuan Dong, Fuyuan Lyu, Bowei He, Haolun Wu, Jikun Kang, Xue Liu

AI总结提出QUACK框架，通过游戏结果、行为轨迹和话语一致性三级评估，自动审计多模态社交推理智能体语言与感知行为的一致性，发现最强智能体仍有15.1%的空间幻觉和过半无据指控。

详情

AI中文摘要

社交推理游戏已成为探测大型语言模型智能体推理、欺骗、协调和信念建模的热门测试平台。然而，大多数环境仅通过胜率等游戏结果评分，且主要局限于纯文本交互，难以判断智能体的语言是否真正基于其感知和行动，也难以识别其行为背后的失败模式。为填补这一空白，我们引入了QUACK，一个用于审计多模态社交推理中智能体语言基础的开源环境和评估框架。QUACK在三个层面评估智能体：游戏结果、行为轨迹和话语级一致性。其核心的陈述验证流水线从引擎日志重建每个智能体的真实轨迹，并对照检查每个讨论声明，自动标记空间幻觉、无据指控、欺骗崩溃和语言-行动不一致。在同质和跨模型对抗设置下评估三个前沿视觉语言模型，我们发现即使是最强的智能体，其可验证的空间声明中有15.1%是幻觉，且超过一半的指控缺乏有据证据。我们在https://github.com/AAAAA-Academia-Attractions/QUACK发布完整的引擎、评估框架、工具包和日志。

英文摘要

Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text-only interaction, making it difficult to tell whether an agent's language is actually grounded in what it perceived and did, or to identify the failure modes underlying its behavior. To address this gap, we introduce QUACK, an open-source environment and evaluation framework for auditing the grounding of agent language in multimodal social reasoning. QUACK evaluates agents at three levels: game outcomes, behavioral trajectories, and utterance-level consistency. Its core Statement Verification Pipeline reconstructs each agent's ground-truth trajectory from engine logs and checks every discussion claim against it, automatically flagging spatial hallucination, unsupported accusation, deception collapse, and language-action inconsistency. Evaluating three frontier VLMs in both homogeneous and cross-model adversarial settings, we find that even the strongest agent hallucinates 15.1% of its verifiable spatial claims and makes over half of its accusations without grounded evidence. We release the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA-Academia-Attractions/QUACK.

URL PDF HTML ☆

赞 0 踩 0

2605.27067 2026-05-27 cs.CV

BEAT: Rhythm-Elastic Alignment for Agentic Music-guided Movie Trailer Generation

BEAT: 节奏弹性对齐用于智能音乐引导的电影预告片生成

Yutong Wang, Yunke Wang, Xinyuan Chen, Chang Xu

AI总结提出BEAT框架，通过音乐-视觉对齐编码器MuVA和能量自适应动态规划算法Bar-DP，实现弹性多对一节奏对齐，用于端到端电影预告片生成。

详情

AI中文摘要

自动电影预告片生成必须从整部电影中选择镜头并与背景音乐同步。现有方法要么将音乐对齐归为后处理，要么强制执行刚性的——对应镜头-音乐映射，忽略了专业剪辑节奏的弹性：快速剪辑伴随高能量段落，而持续镜头跨越较安静的小节。我们提出BEAT，一个解决这一差距的框架，包含两个核心组件：MuVA，一个紧凑的音乐-视觉对齐编码器，通过Sinkhorn正则化的两阶段学习训练；以及Bar-DP，一种能量自适应动态规划算法，根据音乐动态产生弹性的多对一对齐。这些组件被集成到一个五阶段智能管道中，该管道将核心对齐建立在学习的跨模态特征上，同时通过结构化文本信号协调更高层次的创意决策。为了支持全面评估，我们还引入了TrailerArena，一个包含四个互补维度20多个指标的基准。在TrailerArena上，BEAT在镜头选择、排序和感知质量方面实现了最先进的性能，同时端到端地生成完整制作的预告片。

英文摘要

Automatic movie trailer generation must select shots from a full-length film and synchronize them with background music. Existing methods either relegate music alignment to post-processing or enforce rigid one-to-one shot-music mappings, overlooking that professional editing rhythm is elastic: rapid cuts accompany high-energy passages while sustained shots span quieter bars. We introduce BEAT, a framework that addresses this gap with two core components: MuVA, a compact music-visual alignment encoder trained with Sinkhorn-regularized two-stage learning, and Bar-DP, an energy-adaptive dynamic programming algorithm that produces elastic many-to-one alignments following musical dynamics. These components are integrated into a five-phase agentic pipeline that grounds the core alignment in learned cross-modal features while coordinating higher-level creative decisions through structured text signals. To support comprehensive evaluation, we also introduce TrailerArena, a benchmark with 20+ metrics across four complementary dimensions. On TrailerArena, BEAT achieves state-of-the-art performance across shot selection, ordering, and perceptual quality, while producing fully composed trailers end-to-end.

URL PDF HTML ☆

赞 0 踩 0

2605.27066 2026-05-27 cs.CL cs.IR

Large Language Model-Powered Query-Driven Event Timeline Summarization in Industrial Search

工业搜索中基于大语言模型的查询驱动事件时间线摘要

Mingyue Wang, Xingyu Xie, Hang Yang, Li Gao, Lixin Su, Ge Chen, Dawei Yin, Daiting Shi

AI总结提出QDET系统，通过多任务微调和强化学习实现查询驱动的事件时间线摘要，在百度搜索中显著提升用户参与度。

Comments Accepted at KDD 2026

详情

DOI: 10.1145/3770855.3818439

AI中文摘要

理解事件如何随时间演变对于处理热门新闻查询的搜索引擎至关重要。我们提出了QDET（查询驱动事件时间线摘要），这是一个部署在百度搜索上的生产系统，用于构建聚焦的事件时间线以解释特定查询事件。与传统的以主题为中心、旨在全面覆盖的方法不同，QDET从每天检索的数百万文档形成的嘈杂候选集中识别并组织与查询密切相关的子事件。QDET包含两个关键创新：（1）多任务监督微调，包含三个辅助任务——时间顺序、因果判断和时间线完成——使紧凑模型在专业领域匹配更大通用模型的性能；（2）基于强化学习的事件简洁摘要，在保持语义质量的同时强制执行严格长度约束，实现了88.2%的长度合规性，并在约束满足上比671B规模模型高出7.7个百分点。我们微调的7B参数模型在时间线摘要上达到76.2%的F1分数，略超DeepSeek-R1-671B的零样本性能（76.1% F1），而仅使用其1%的参数——表明领域特定优化能够以大幅降低的计算成本实现质量相当的生产就绪模型。百度搜索上的在线A/B测试验证了实际效果，与单任务基线相比，点击率提升5.5%，停留时间延长4.6%，探索深度增加4.4%。我们进一步证明时间线理解可迁移到热度预测，确认了对下游任务的有效知识迁移。

英文摘要

Understanding how events evolve over time is essential for search engines handling queries about trending news. We present QDET (Query-Driven Event Timeline Summarization), a production system deployed on Baidu Search that constructs focused event timelines to explain specific query events. Unlike traditional topic-centric approaches that aim for comprehensive coverage, QDET identifies and organizes sub-events closely relevant to the query from noisy candidate sets formed by millions of documents retrieved daily. QDET incorporates two key innovations: (1) multi-task supervised fine-tuning with three auxiliary tasks-temporal ordering, causal judgment, and timeline completion-that enable compact models to match the performance of much larger general-purpose models in specialized domains; (2) reinforcement learning-based event concise summarization that enforces strict length constraints while maintaining semantic quality, achieving 88.2% length compliance and outperforming 671B-scale models by 7.7 points in constraint satisfaction. Our fine-tuned 7B parameter model achieves 76.2% F1 score on timeline summarization, slightly surpassing the zero-shot performance of DeepSeek-R1-671B (76.1% F1) while using only 1% of its parameters-demonstrating that domain-specific optimization enables production-ready models with comparable quality at drastically reduced computational costs. Online A/B tests on Baidu Search validate real-world effectiveness, showing 5.5% CTR improvement, 4.6% longer dwell time, and 4.4% deeper exploration compared to single-task baselines. We further demonstrate that timeline understanding transfers to heat prediction, confirming effective knowledge transfer to downstream tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.27063 2026-05-27 cs.LG

Learning Dynamic Graph Representations through Timespan View Contrasts

通过时间跨度视图对比学习动态图表示

Yiming Xu, Zhen Peng, Bin Shi, Xu Hua, Bo Dong

AI总结提出基于时间平移不变性的动态图表示框架CLDG和CLDG++，通过跨时间跨度对比学习和多尺度对比学习，有效提升节点分类和动态图异常检测性能。

Comments Accepted by Neural Networks

详情

AI中文摘要

图蕴含的丰富信息激发了对无监督图表示的进一步研究。现有研究主要依赖静态图中的节点特征和拓扑属性来创建自监督信号，忽略了真实世界图数据携带的时间成分，例如边的时间戳。为了克服这一局限，本文探索了如何在动态图上优雅地建模时间演化。具体地，我们引入了一种新的归纳偏置，即时间平移不变性，它说明了同一节点在不同时间跨度上倾向于保持相似标签。基于这一假设，我们开发了一个动态图表示框架CLDG，通过在不同时间跨度上进行对比学习，鼓励节点保持局部一致的时间平移不变性。除了仅考虑显式拓扑链接的标准CLDG，我们进一步提出的CLDG++额外采用图扩散来揭示节点之间的全局上下文相关性，并设计了一个由局部-局部、局部-全局和全局-全局对比组成的多尺度对比学习目标，以增强表示能力。有趣的是，通过测量不同时间跨度之间的一致性来形成异常指标，CLDG和CLDG++无缝集成到动态图异常检测任务中，这在金融、网络安全和医疗保健等许多高影响力领域具有广泛应用。实验表明，CLDG和CLDG++在节点分类和动态图异常检测等下游任务中均表现出理想的性能。此外，CLDG通过隐式利用时间线索而不是复杂的序列模型，显著降低了时间和空间复杂度。

英文摘要

The rich information underlying graphs has inspired further investigation of unsupervised graph representation. Existing studies mainly depend on node features and topological properties within static graphs to create self-supervised signals, neglecting the temporal components carried by real-world graph data, such as timestamps of edges. To overcome this limitation, this paper explores how to model temporal evolution on dynamic graphs elegantly. Specifically, we introduce a new inductive bias, namely temporal translation invariance, which illustrates the tendency of the identical node to keep similar labels across different timespans. Based on this assumption, we develop a dynamic graph representation framework CLDG that encourages the node to maintain locally consistent temporal translation invariance through contrastive learning on different timespans. Except for standard CLDG which only considers explicit topological links, our further proposed CLDG++ additionally employs graph diffusion to uncover global contextual correlations between nodes, and designs a multi-scale contrastive learning objective composed of local-local, local-global, and global-global contrasts to enhance representation capabilities. Interestingly, by measuring the consistency between different timespans to shape anomaly indicators, CLDG and CLDG++ are seamlessly integrated with the task of spotting anomalies on dynamic graphs, which has broad applications in many high-impact domains, such as finance, cybersecurity, and healthcare. Experiments demonstrate that CLDG and CLDG++ both exhibit desirable performance in downstream tasks including node classification and dynamic graph anomaly detection. Moreover, CLDG significantly reduces time and space complexity by implicitly exploiting temporal cues instead of complicated sequence models.

URL PDF HTML ☆

赞 0 踩 0

2605.27062 2026-05-27 cs.CL cs.LG

FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions

FalAR: 一个大规模说话人标注的欧洲葡萄牙语议会会议语音语料库

Francisco Teixeira, Carlos Carvalho, Mariana Julião, Catarina Botelho, Rubén Solera-Ureña, Sérgio Paulo, Thomas Rolland, Ben Peters, Isabel Trancoso, Alberto Abad

AI总结为弥补欧洲葡萄牙语语音资源不足，构建了FalAR语料库，包含5800小时议会会议语音及说话人标注，实验表明作为预训练数据可使ASR词错误率相对降低14%。

Comments Published in LREC2026

详情

AI中文摘要

自动语音识别（ASR）的最先进性能在很大程度上依赖于大规模标注语料库的可用性。这增加了数据收集工作的需求，特别是对于代表性不足的语言和方言变体。由于欧洲葡萄牙语（EP）的说话人数量较少（约1100万），在目前可用的大规模语音数据资源中，它被巴西葡萄牙语（BP）（约2亿说话人）所掩盖，导致EP用户的语音系统性能不佳。为了弥补这一差距，并遵循其他语言的类似数据收集工作，我们提出了FalAR，一个大规模、说话人标注的欧洲葡萄牙语议会会议语音语料库。FalAR涵盖约20年，包含5800小时的语音数据。此外，4850小时具有说话人身份标注，总共1180个说话人，附带元数据包括年龄、性别、政治派别和议会角色。该语料库使用最先进的EP CAMÕES ASR模型进行转录参考对齐。在本文中，我们描述了数据收集过程以及FalAR语料库的主要特征。此外，我们评估了数据量和对齐准确性对ASR性能的权衡，实验表明，将FalAR作为预训练数据可以使基线模型的词错误率相对降低高达14%。

英文摘要

State-of-the-art performance for Automatic Speech Recognition (ASR) largely depends on the availability of large-scale labeled corpora. This creates a demand for increased data collection efforts, particularly for under-represented languages and dialectal varieties. Due to having considerably fewer speakers (around 11 million), European Portuguese (EP) is overshadowed by Brazilian Portuguese (BP) (around 200 million speakers) in currently available large-scale speech data resources, resulting in under-performing speech-based systems for EP users. To address this gap, and following similar data collection efforts for other languages, we present FalAR, a large-scale, speaker-annotated speech corpus of European Portuguese parliamentary sessions. Spanning approximately 20 years, FalAR comprises 5,800 hours of speech data. In addition, 4,850 hours have speaker identity annotations, for a total of 1,180 speakers with associated metadata including age, gender, political affiliation, and parliamentary role. The corpus was built using a state-of-the-art EP CAMÕES ASR model for transcription-reference alignment. In this paper, we describe the data collection process, together with the main characteristics of the FalAR corpus. Furthermore, we evaluate the trade-off between data quantity and alignment accuracy on ASR performance, with our experiments demonstrating that incorporating FalAR as pre-training data yields up to 14% relative WER improvement over baseline models.

URL PDF HTML ☆

赞 0 踩 0

2605.27050 2026-05-27 cs.CL cs.LG

BhashaSetu: A Data-Centric Approach to Low-Resource Machine Translation

BhashaSetu：一种以数据为中心的低资源机器翻译方法

Param Thakkar, Anushka Yadav, Michael Tiemann, Abhi Mehta, Akshita Bhasin, Shrinivas Khedkar

AI总结提出BhashaSetu数据集，通过大规模、多领域、形态感知的英-马拉地语平行语料库，并验证语料库级去重对低资源神经机器翻译质量的关键影响。

详情

AI中文摘要

我们提出了BhashaSetu，一个语言丰富的英语-马拉地语平行数据集，解决了低资源神经机器翻译（NMT）中持续存在的数据限制问题。马拉地语有超过9500万使用者，但在不同领域的高质量平行语料库中仍然代表性不足。我们的数据集包含来自新闻、政治、医疗、文学和文化等异构来源的278万个句子对，并提供了词干化和词形还原表示以支持形态感知分析。我们使用BLEU、spBLEU、chrF++和TER指标对多个最先进的翻译模型进行了基准测试，并使用LoRA对NLLB-200-distilled-600M进行了参数高效微调。我们消融实验的一个关键发现是：语料库级去重是预处理中对下游质量贡献最大的单一因素（去除它会使性能降低1.17 BLEU和2.21 chrF++），这表明对于低资源、形态丰富的语言，有纪律的跨源语料库卫生是一种低成本、高影响力的干预措施。该数据集已公开发布，以促进可重复且语言信息丰富的低资源NMT研究。

英文摘要

We present BhashaSetu, a linguistically enriched English--Marathi parallel dataset addressing persistent data limitations in low-resource neural machine translation (NMT). Marathi, spoken by over 95 million people, remains underrepresented in high-quality parallel corpora across diverse domains. Our dataset comprises 2.78 million sentence pairs from heterogeneous sources including news, politics, healthcare, literature, and culture, with stemmed and lemmatized representations to support morphology-aware analysis. We benchmark multiple state-of-the-art translation models using BLEU, spBLEU, chrF++, and TER metrics, and conduct parameter-efficient fine-tuning of NLLB-200-distilled-600M using LoRA. A key finding from our ablation: corpus-level deduplication is the single largest preprocessing contributor to downstream quality (removing it reduces performance by 1.17 BLEU and 2.21 chrF++), demonstrating that disciplined cross-source corpus hygiene is a low-cost, high-impact intervention for low-resource, morphologically rich languages. The dataset is publicly released to promote reproducible and linguistically informed low-resource NMT research.

URL PDF HTML ☆

赞 0 踩 0

2605.27046 2026-05-27 cs.RO

Learning to Balance Motor Thermal Safety and Quadrupedal Locomotion Performance with Residual Policy

学习平衡电机热安全与四足运动性能的残差策略

Yuhang Wan, Weixian Lin, Letian Qian, Yiqi Zou, Weiwei Wu, Shengwei Wu, Chuanlin Zhao, Xin Luo

AI总结提出一种两阶段训练框架，结合整机热模型和残差策略，在保持运动性能的同时防止电机过热，实现长时间负重运动。

详情

AI中文摘要

电机热管理在电动驱动机器人（尤其是腿式机器人）中常被忽视，但电机过热是限制长时间运动的关键因素，特别是在负载条件下。本文将一个四足机器人的整机热模型集成到强化学习流水线中以更新电机温度，并提出一个用于电机热管理的两阶段训练框架。在该框架中，首先预训练一个名义策略作为能够穿越多种地形的运动基线。然后，在名义策略之上训练一个残差策略，根据机器人的热状态提供修正动作，确保在低温条件下保持高性能，并在高温条件下防止电机过热。仿真结果表明，所提出的策略在电机热安全与运动性能之间实现了有效平衡。在宇树A1四足机器人上的真实世界实验进一步验证了该方法：在3千克负载下，机器人能够在多种地形上稳定运动超过13分钟，而仅使用名义策略时，约5分钟就会导致电机过热。

英文摘要

Motor thermal management is often overlooked in the context of electrically-actuated robots, particularly legged robots, but motor overheating is a key factor that limits long-duration locomotion especially under payload conditions. This paper integrates a whole-body thermal model of a quadruped robot into the reinforcement learning pipeline to update motor temperatures, and proposes a two-stage training framework for motor thermal management. In this framework, a nominal policy is first pre-trained as a locomotion baseline capable of traversing diverse terrains. A residual policy is then trained on top of the nominal policy to provide corrective actions based on the robot's thermal state, ensuring high performance under low-temperature conditions and preventing motor overheating under high-temperature conditions. Simulation results demonstrate that the proposed policy achieves an effective balance between motor thermal safety and locomotion performance. Real-world experiments on a Unitree A1 quadruped robot further validate the approach: under a 3 kg payload, the robot achieves stable locomotion across multiple terrains for over 13 minutes, while the nominal policy alone leads to motor overheating in about 5 minutes.

URL PDF HTML ☆

赞 0 踩 0

2605.27045 2026-05-27 cs.CL

ExTax: Explainable Disinformation Detection via Persuasion, Emotion, and Narrative Role Taxonomies

ExTax：基于说服、情感和叙事角色分类学的可解释虚假信息检测

Shang Luo, Yingguang Yang, Zhenchen Sun, Yang Liu, Bin Chong, Jingru Chen, Yancheng Chen, Jiayu Liang, Kefu Xu, Hao Peng, Philip S. Yu

AI总结提出ExTax框架，统一说服修辞、情感操纵和叙事角色为17维分类空间，通过熵驱动动态标签平滑和多头注意力融合分类与上下文特征，实现可解释的虚假信息检测，在跨域基准上达到0.8456 Macro F1。

详情

AI中文摘要

LLMs的普及加速了高度流畅虚假信息的生成和传播，使得传统的句法语义验证越来越不足。这种欺骗很少仅依赖表面虚假；相反，它常常结合说服性修辞、情感操纵和叙事角色构建，通过多种认知途径影响读者的解读。然而，现有检测器通常强调孤立信号——如句法、外部知识、说服或情感线索——因此难以捕捉虚假信息背后的多方面操纵意图或提供人类可审计的解释。为填补这一空白，我们提出了 extbf{ExTax}，一个面向分类学的可解释虚假信息检测框架。ExTax将说服修辞、情感操纵和叙事角色统一到17维分类空间中，涵盖6种说服修辞策略、5种情感操纵方法和6种叙事角色类别。它从多个前沿LLMs中提取属性，通过熵驱动动态标签平滑协调它们的分歧，并通过异构多头注意力将所得分类表示与上下文编码融合，将每个预测基于可解释的操纵画像。在五个跨领域和跨体裁基准上，ExTax实现了0.8456的整体Macro $F_1$，优于最先进的深度学习和基于LLM的基线。在严重的体裁不平衡下，最强的深度基线从0.9454降至0.6194，而ExTax保持稳健。

英文摘要

The democratization of LLMs has accelerated the generation and circulation of highly fluent disinformation, making traditional syntax-semantic verification increasingly insufficient. Such deception rarely relies solely on surface-level falsity; instead, it often combines persuasive rhetoric, emotional manipulation, and narrative role construction to influence readers' interpretations through multiple cognitive pathways. However, existing detectors typically emphasize isolated signals -- such as syntax, external knowledge, persuasion, or affective cues -- and therefore struggle to capture the multi-faceted manipulative intents underlying disinformation or provide human-auditable explanations. To address this gap, we present \textbf{ExTax}, a taxonomy-aligned framework for explainable disinformation detection. ExTax unifies persuasive rhetoric, emotional manipulation, and narrative roles into a 17-dimensional taxonomic space, covering 6 persuasive-rhetoric strategies, 5 emotional-manipulation methods, and 6 narrative-role categories. It elicits attributes from multiple frontier LLMs, reconciles their disagreements through Entropy-driven Dynamic Label Smoothing, and fuses the resulting taxonomic representations with contextual encodings via Heterogeneous Multi-Head Attention, grounding each prediction in an interpretable manipulation profile. Across five cross-domain and cross-genre benchmarks, ExTax achieves an overall Macro $F_1$ of $0.8456$, outperforming state-of-the-art deep learning and LLM-based baselines. It also remains robust under severe genre imbalance, where the strongest deep baseline degrades from $0.9454$ to $0.6194$.

URL PDF HTML ☆

赞 0 踩 0

2605.27038 2026-05-27 cs.RO

TPS-Drive: Task-Guided Representation Purification for VLM-based Autonomous Driving

TPS-Drive: 基于VLM的自动驾驶任务引导表示净化

Jiaxiang Li, Yumao Liu, Ke Ma

AI总结提出TPS-Drive框架，通过任务引导的表示净化（Agent-Centric Tokenizer）解决VLM在自动驾驶中的空间幻觉和表示干扰问题，实现精确的3D空间预测与安全规划。

详情

AI中文摘要

视觉-语言模型（VLM）为自动驾驶规划提供了有前景的基础，但弥合语义推理与精确3D空间预测之间的差距仍然是一个关键挑战。现有的表示策略通常遵循两条路径：文本对齐方法将连续空间状态扁平化为符号，这损害了几何结构并导致“空间幻觉”；密集视觉方法保留了空间拓扑，但用冗余的背景纹理压垮了标准分词器，导致“表示干扰”。为了解决这些限制，我们引入了TPS-Drive，一个以任务引导表示净化为核心的新框架，使VLM能够在净化空间中思考。其核心是一个以智能体为中心的分词器，利用由冻结的3D检测头监督的任务引导向量量化机制，将有限的码本容量从普遍的静态背景显式重新分配给关键的动态智能体，并有效隔离空间冗余。利用这种净化的空间词汇，TPS-Drive采用解耦的推理流程，依次执行场景理解、未来预测和动作生成。该框架通过渐进的三阶段训练范式进行优化，最终通过奖励驱动的细化超越纯模仿学习。大量实验验证了我们的方法：TPS-Drive在开环nuScenes评估中实现了准确的智能体空间状态预测并降低了碰撞率，同时在严格的闭环NAVSIMv1和NAVSIMv2基准测试中建立了新的安全记录。

英文摘要

Vision-Language Models (VLMs) provide a promising foundation for autonomous driving planning, yet bridging semantic reasoning and precise 3D spatial forecasting remains a critical challenge. Existing representation strategies generally follow two paths: text-aligned methods flatten continuous spatial states into symbols, which compromises geometric structure and induces "spatial hallucinations"; dense visual methods preserve spatial topology but overwhelm standard tokenizers with redundant background textures, leading to "representation interference". To address these limitations, we introduce TPS-Drive, a novel framework centered on Task-Guided Representation Purification that empowers VLMs to Think in Purified Space. At its core, an Agent-Centric Tokenizer utilizes a task-guided vector quantization mechanism supervised by a frozen 3D detection head, which explicitly reallocates limited codebook capacity from pervasive static backgrounds to critical dynamic agents and effectively isolates spatial redundancy. Leveraging this purified spatial vocabulary, TPS-Drive employs a decoupled reasoning pipeline that sequentially performs scene understanding, future forecasting, and action generation. The framework is optimized via a progressive three-stage training paradigm, culminating in reward-driven refinement that surpasses pure imitation learning. Extensive experiments validate our approach: TPS-Drive achieves accurate agent spatial state forecasting and reduces collision rates in open-loop nuScenes evaluations, while establishing new safety records on the rigorous closed-loop NAVSIMv1 and NAVSIMv2 benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.27033 2026-05-27 cs.CL cs.AI cs.LG

Tracing Computation Density in LLMs

追踪LLMs中的计算密度

Corentin Kervadec, Iuliia Lysova, Iuri Macocco, Marco Baroni, Gemma Boleda

AI总结提出s-Trace方法估计最优子图，发现LLM计算分为早期稀疏核心和后期密集细化两个阶段，且计算量与模型不确定性相关。

详情

AI中文摘要

基于Transformer的大型语言模型（LLMs）由数十亿个参数组成，这些参数排列在深度和宽度都很大的计算图中，但尚不清楚它们是否对所有输入都充分利用了全部容量。我们引入了s-Trace方法，以有效估计最能近似完整模型输出的大小为s的子图。通过这种方法，我们发现各种LLM中的计算组织成两个不同的阶段。一个主要由早期层节点组成的小子图可以重建完整模型输出分布的头部。添加更多节点（主要位于后期层，且越来越多地由注意力头组成）会导致近似完整输出分布的逐步细化。此外，我们发现每个输入所需的计算量与模型不确定性相关，并且更稀疏的子图编码浅层统计信息，例如单字频率。总体而言，我们的结果表明，有效的LLM计算中存在一致的模块化组织，其中稀疏的早期层核心提供粗略预测，然后通过后期层中更密集的计算进一步细化。

英文摘要

Transformer-based large language models (LLMs) are comprised of billions of parameters arranged in deep and wide computational graphs, but it is not clear that they exploit their full capacity for all inputs. We introduce the s-Trace method to efficiently estimate the subgraph of size s that best approximates a full model output. With this method, we find the computation in a variety of LLMs to be organized in two distinct phases. A small subgraph mostly composed of early-layer nodes can reconstruct the head of the full model output distribution. Adding further nodes, mostly located in later layers and increasingly consisting of attention heads, leads to incremental refinements in approximating the full output distribution. We find moreover that the amount of necessary computation per input correlates with model uncertainty, and that sparser subgraphs encode shallow statistics, such as unigram frequency. Overall, our results suggest a consistent modular organization in effective LLM computation, with a sparse early-layer core providing a rough prediction that is further refined through denser computations in later layers.

URL PDF HTML ☆

赞 0 踩 0

2605.27032 2026-05-27 cs.CV

SCKAN: Structural Consensus-based KAN Prototype Learning for Semi-Supervised Pancreas Segmentation

SCKAN: 基于结构一致性的KAN原型学习用于半监督胰腺分割

Yuqi Liu, Yufei Chen, Wei Fu, Xiaodong Yue, Shuo Li

AI总结针对半监督胰腺分割中稀疏监督导致的监督偏差问题，提出基于结构一致性的KAN原型学习方法（SCKAN），通过跨样本结构一致性学习和KAN自适应融合实现更泛化且准确的分割。

Comments 10.5 pages, 5 figures, Medical Image Computing and Computer Assisted Intervention 2026

详情

AI中文摘要

精确的胰腺分割对于早期癌症诊断至关重要，而标注稀缺使得半监督学习（SSL）成为必要。然而，由于样本间显著的形态变异性，现有SSL方法在稀疏监督下存在严重的泛化限制，导致监督偏差问题。为解决这一问题，我们提出了基于结构一致性的KAN原型学习（SCKAN），该方法首次利用Kolmogorov-Arnold网络（KANs）构建跨样本结构一致性学习，以实现更泛化和准确的分割。具体而言，SCKAN包含两个关键设计：结构约束的原型一致性学习（SPCL），通过原型级对比优化强制跨样本一致性，促进无偏结构表示；以及基于一致性的Kolmogorov-Arnold融合（CKaF），通过KAN的自适应B样条非线性聚合稳定一致性并过滤样本特定噪声，减少形态特异性偏差。在两个公开胰腺数据集上的大量实验证明了SCKAN的有效性。代码位于https://github.com/rhodaliu17/SCKAN。

英文摘要

Accurate pancreas segmentation is critical for early cancer diagnosis, where annotation scarcity necessitates Semi-Supervised Learning (SSL). However, due to significant inter-sample morphological variability, existing SSL methods face severe generalizability limitations under sparse supervision, leading to the Supervision Bias problem. To address this, we propose Structural Consensus-based KAN Prototype Learning (SCKAN), which constructs the first cross-sample structural consensus learning with Kolmogorov-Arnold Networks (KANs), to achieve more generalizable and accurate segmentation. Specifically, SCKAN contains two key designs: Structure-constrained Prototype Consistency Learning (SPCL), which prompts unbiased structural representation by enforcing cross-sample consistency via prototype-level contrastive optimization, and Consensus-based Kolmogorov-Arnold Fusion (CKaF), which reduces morphology-specific bias by aggregating stable consensus and filtering sample-wise noise via KAN's adaptive B-spline nonlinearity. Extensive experiments on two public pancreas datasets demonstrate the effectiveness of SCKAN. Code is at https://github.com/rhodaliu17/SCKAN.

URL PDF HTML ☆

赞 0 踩 0

2605.27030 2026-05-27 cs.CL

Share More, Search Less: Collaborative Parallel Thinking for Efficient Test-Time Scaling

分享更多，搜索更少：面向高效测试时间扩展的协作并行思考

Xinglin Wang, Hao Lin, Shaoxiong Feng, Peiwen Yuan, Yiwei Li, Jiayi Shi, Yueqi Zhang, Chuyi Tan, Ji Zhang, Boyuan Pan, Yao Hu, Kan Li

AI总结提出一种无需训练的协作并行思考框架，通过在并行分支间共享搜索信息来减少冗余探索，从而在测试时间扩展中实现更优的准确率-延迟帕累托边界。

Comments Preprint

详情

AI中文摘要

测试时间扩展（TTS）通过分配额外的推理计算来探索解空间，从而增强大型语言模型的推理能力。然而，现有的并行TTS方法通常在搜索过程中保持分支隔离：中间发现保持分支私有，无法及时指导其他分支。这种信息隔离导致大量冗余探索，因为分支反复重新发现其他地方已有的信息，并且需要更多搜索步骤来收集做出正确回答所需的完整决策信息。为弥补这一差距，我们提出协作并行思考（CPT），一种无需训练的推理框架，能够在并行分支间实现搜索时信息共享。CPT从正在运行的分支中提取紧凑的中间信息，维护一个去重的查询级信息池，并通过输入上下文广播池条目，使得后续搜索步骤中的每个分支能够重用其他分支的发现，而不是重新发现相同信息。实验上，在HMMT和AIME基准测试上的结果表明，CPT在不同rollout预算和模型规模下，相比强基线建立了更强的准确率-延迟帕累托前沿，突显了搜索时协作作为高效并行TTS的一个有效方向。

英文摘要

Test-Time Scaling (TTS) enhances the reasoning capabilities of large language models by allocating additional inference compute to explore the solution space. However, existing parallel TTS methods typically keep branches isolated during search: intermediate discoveries remain branch-private and cannot guide other branches in time. This information isolation causes substantial redundant exploration, as branches repeatedly rediscover information already found elsewhere and require more search steps to collect complete decision information needed to reach correct answers. To bridge this gap, we propose \textbf{Collaborative Parallel Thinking (CPT)}, a training-free inference framework that enables search-time information sharing across parallel branches. CPT extracts compact intermediate information from ongoing branches, maintains a deduplicated query-level information pool, and broadcasts pool entries through the input context, allowing each branch in subsequent search steps to reuse discoveries made by other branches rather than rediscover the same information. Empirically, experiments on HMMT and AIME benchmarks show that CPT establishes a stronger accuracy--latency Pareto frontier than strong baselines across rollout budgets and model scales, highlighting search-time collaboration as an effective direction for efficient parallel TTS.

URL PDF HTML ☆

赞 0 踩 0

2605.27028 2026-05-27 cs.LG cs.AI

Less is More: Early Stopping Rollout for On-Policy Distillation

少即是多：用于在线策略蒸馏的早停展开

Zhou Ziheng, Jiaqi Li, Huacong Tang, Ying Nian Wu, Demetri Terzopoulos

AI总结针对在线策略蒸馏中存在的“离策略教师衰减”问题，提出早停展开（ESR）方法，通过限制响应生成的前几个token来提升性能、GPU效率和训练稳定性。

详情

AI中文摘要

在线策略蒸馏最近成为标准序列级模仿的有前途的替代方案，通过使用教师模型对学生自身的展开进行评分来训练学生。然而，我们观察到这种范式中的“离策略教师衰减”问题：对于后面的token，由于学生的早期轨迹作为上下文对于教师来说是离策略的，教师产生纠正性分数的能力会衰减，并可能退回到预训练阶段学习的token补全行为。我们通过实验验证了这个问题，并提出了早停展开（ESR）来解决它：一种简单而有效的蒸馏策略，仅限制展开生成到前几个响应token。我们表明，ESR在模型大小、家族、任务和训练制度上均超越了全展开在线策略蒸馏的性能，并且在跨模型家族场景下表现出更高的GPU效率和训练稳定性。我们进一步研究了这一惊人性能背后的机制，发现了ESR的“级联对齐”和“子模式承诺”效应，这可能解释其为何有效，甚至有时超过教师模型性能。此外，我们表明这种基于位置的token选择策略不能完全由KL散度和熵信号解释。

英文摘要

On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring its own rollouts with a teacher model. However, we observe ``Off-policy Teacher Decay'' problem in this paradigm: for the later tokens, with student's earlier trajectory as context that is off-policy to the teacher, the teacher's ability to produce a corrective score would decay, and may fall back to token-completion behavior learned in the pre-training stage. We empirically verify this problem, and we propose Early Stopping Rollout (ESR) to fix it: a simple yet effective distillation strategy that simply restricts the rollout generation to the first response tokens. We show that ESR both surpasses the full rollout OPD performance across model size, family, tasks and training regime, and exhibit much higher GPU efficiency and training stability, especially under cross model family scenarios. We further investigate the mechanism behind this surprising performance and discovered "Cascading Alignment" and "Sub-mode Commitment" effect of ESR that may explain why it works effectively and even sometimes exceeding the teacher model performance. Besides, we show that this position-based token selection strategy cannot be fully explainable by KL divergence and entropy signals.

URL PDF HTML ☆

赞 0 踩 0

2605.27027 2026-05-27 cs.LG

SQARL: A Size-Agnostic Reinforcement Learning approach for Circuit Allocation in Distributed Quantum Architectures

SQARL: 一种适用于分布式量子架构中电路分配的大小无关强化学习方法

Víctor Carballo, Júlia López-Closa, Mario Martin

AI总结针对分布式量子计算中的量子比特分配问题，提出一种基于Transformer的灵活强化学习架构，无需重新训练即可处理任意数量的量子比特和核心，在分配成本上比匈牙利量子比特分配算法降低33%。

详情

AI中文摘要

量子处理器的扩展目前受到退相干和串扰等技术挑战的限制。随着量子比特数量的增加，干扰会增大计算噪声。分布式量子计算通过互连更小、更易处理的量子处理器（核心）来解决这些限制，但引入了最小化缓慢且易出错的核间通信的挑战。在最小化通信成本的同时将量子电路分配到核心的任务被称为量子比特分配问题。本文致力于开发一种深度学习方法来解决该问题，强调对量子硬件拓扑的灵活性，并提升现有最优性能。启发式和非学习算法，如匈牙利量子比特分配（HQA），目前代表了最优水平。强化学习（RL）方法利用学习到的分配策略，但通常缺乏灵活性，当硬件配置改变时需要重新训练，并且其解的质量不如非学习方法。然而，学习机制可能超越人工设计的启发式方法。为克服这些限制，本文提出一种灵活的基于Transformer的架构，无需重新训练即可处理任意数量的量子比特和核心。结果表明，训练后的策略持续优于先前的RL最优水平，并缩小了RL与HQA在大多数常见电路上的差距。对于Cuccaro加法器，它相对于HQA实现了33%的分配成本降低，对于随机电路平均降低25%。这些发现表明，基于学习的方法可以有效地匹配手工启发式方法的性能，这是向实际应用迈出的关键一步。

英文摘要

The scaling of quantum processors is currently limited by technical challenges such as decoherence and cross-talk. As the number of qubits grows, interference increases the computational noise. Distributed quantum computing addresses these limitations by interconnecting smaller, easier-to-handle quantum processors (cores), but it introduces the challenge of minimizing slow, error-prone inter-core communication. The task of distributing quantum circuits across cores while minimizing communication costs is known as the Qubit Allocation problem. This work focuses on developing a deep learning approach to this problem, emphasizing flexibility to quantum hardware topology and improving state-of-the-art performance. Heuristic and non-learning algorithms, such as the Hungarian Qubit Allocation (HQA), currently represent the state of the art. Reinforcement Learning (RL) approaches leverage learned allocation policies but often lack flexibility, requiring retraining when hardware configurations change, and they fall short of the solution quality achieved by non-learning methods. However, learning mechanisms could outperform human-crafted heuristics. To overcome these limitations, this work proposes a flexible, transformer-based architecture that can handle arbitrary numbers of qubits and cores without retraining. Results show that the trained policy consistently outperforms the previous RL state of the art and narrows the gap between RL and HQA for the most common circuits. It achieves a 33% reduction in allocation cost relative to the HQA for the Cuccaro Adder and 25% on average for random circuits. These findings show that learning-based approaches can effectively match the performance of hand-crafted heuristics, a crucial step towards their application in real-world scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.27025 2026-05-27 cs.CL cs.MM

Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations

基于属性的LLM与仇恨言论标注的对齐诊断

Mohammad Amine Jradi, Faeze Ghorbanpour, Alexander Fraser

AI总结通过分析LLM在十个主观属性上的判断与人类标注的对齐情况，发现行为显式维度对齐良好而评价维度系统性反转，并提出基于置信度加权岭回归的属性组合方法，重构连续仇恨言论分数，R²达0.71。

详情

AI中文摘要

仇恨言论标注成本高昂、主观性强且容易产生标注者分歧，使得大规模数据集构建具有挑战性。我们系统分析了大型语言模型（LLM）在十个理论上基于主观属性（如去人性化、暴力和情感）上与人类判断的对齐程度，评估了Llama 3.1和Qwen 2.5的小型及大型变体。我们的分析揭示了所有模型的一致分裂：行为显式维度（侮辱、羞辱、攻击-防御）与人类标注高度相关，而评价维度（尊重、情感、仇恨言论）则系统性反转。人口统计角色条件化降低了模型置信度，但未改善对齐。基于这些发现，我们提出通过置信度加权岭回归组合属性级LLM预测，从测量仇恨言论语料库中重构连续仇恨言论分数，R²达到0.71，优于直接提示基线，表明结构化属性分解比端到端标签预测单独恢复出更丰富且更符合人类对齐的信号。

英文摘要

Hate speech annotation is costly, subjective, and prone to annotator disagreement, making large-scale dataset construction challenging. We systematically analyze how well large language models (LLMs) align with human judgments across ten theoretically grounded subjective attributes, such as dehumanization, violence, and sentiment, evaluating both small and large variants of Llama 3.1 and Qwen 2.5. Our analysis reveals a consistent split across all models: behaviorally explicit dimensions (insult, humiliate, attack-defend) correlate strongly with human annotations, while evaluative dimensions (respect, sentiment, hate speech) are systematically inverted. Demographic persona conditioning reduces model confidence without improving alignment. Building on these insights, we propose combining attribute-level LLM predictions via a confidence-weighted Ridge regression to reconstruct continuous hate speech scores from the Measuring Hate Speech corpus, achieving $R^2$ of up to 0.71 and outperforming direct prompting baselines, demonstrating that structured attribute decomposition recovers a richer and more human-aligned signal than end-to-end label prediction alone.

URL PDF HTML ☆

赞 0 踩 0

2605.27024 2026-05-27 cs.CV cs.MM

NeR-SC: Adapting Neural Video Representation to Screen Content

NeR-SC：适应屏幕内容的神经视频表示

Ruohan Shi, Jiaoyan Zhao, Haogang Feng

AI总结提出NeR-SC框架，通过可学习调色板、多门密集融合和嵌入级帧跳过策略，针对屏幕内容视频的离散颜色、强时间冗余等特性进行优化，在低码率下超越H.264/H.265。

Comments Submitted to PRMVAI 2026

详情

AI中文摘要

隐式神经表示已成为视频压缩的一种有前景的范式，最近的方法在自然视频上取得了有竞争力的性能。然而，屏幕内容视频——常见于远程桌面、在线教育和云游戏——表现出独特的统计特性：锐利边缘、有限调色板和强时间冗余。现有的为自然场景设计的神经表示方法缺乏利用这些特性的机制，留下了很大的改进空间。在本文中，我们提出了NeR-SC，一个为屏幕内容视频量身定制的神经表示框架。基于SNeRV骨干网络，NeR-SC引入了三个屏幕内容特定模块：(i) 可学习调色板，通过将低频子带限制到学习到的颜色集来建模屏幕内容的离散颜色结构；(ii) 多门密集融合模块，用密集的、注意力门控的跨阶段交互替代顺序特征融合；(iii) 嵌入级帧跳过策略，绕过静态帧的冗余解码器调用，且零训练开销。在DSCVC和VCD上的实验表明，NeR-SC实现了40.32 dB和41.73 dB的平均PSNR，优于代表性的神经视频表示方法，并且在低码率下超越了H.264和H.265。帧跳过策略实现了实时解码且质量无损失。

英文摘要

Implicit neural representations have emerged as a promising paradigm for video compression, with recent methods achieving competitive performance on natural video. However, screen content video -- common in remote desktop, online education, and cloud gaming -- exhibits distinct statistics: sharp edges, limited color palettes, and strong temporal redundancy. Existing neural representation methods, designed for natural scenes, lack mechanisms to exploit these properties, leaving substantial room for improvement. In this paper, we propose NeR-SC, a neural representation framework tailored for screen content video. Building on the SNeRV backbone, NeR-SC introduces three screen-content-specific modules: (i) a learnable color palette that models the discrete color structure of screen content by restricting the low-frequency sub-band to a learned color set; (ii) a multi-gate dense fusion module that replaces sequential feature fusion with dense, attention-gated cross-stage interaction; and (iii) an embedding-level frame skip strategy that bypasses redundant decoder invocations for static frames, with zero training overhead. Experiments on DSCVC and VCD show that NeR-SC achieves 40.32~dB and 41.73~dB average PSNR, outperforming representative neural video representation methods and, at low bitrates, surpassing H.264 and H.265. The skip strategy enables real-time decoding with no loss in quality.

URL PDF HTML ☆

赞 0 踩 0

2605.27022 2026-05-27 cs.AI

ORCA: An End-to-End Interactive Copilot for Optimized Root Cause Analysis

ORCA：一种用于优化根因分析的端到端交互式副驾驶

Phi Nguyen Xuan, Nicholas Tagliapietra, Lavdim Halilaj, Kristian Kersting, Juergen Luettin

AI总结提出ORCA，一种端到端因果分析副驾驶，通过编排智能体理解用户目标并引导其完成从全自动到高度用户引导的因果分析工作流，涵盖因果发现、效应估计、可解释性和根因分析，并生成结构化报告。

详情

AI中文摘要

因果分析是制造、社会科学和医学等多个领域的关键任务。然而，尽管近期取得了进展，因果方法的概念和方法复杂性使得领域专家难以使用。这一差距阻碍了专家利用这些进展，并阻碍了缺乏真实世界数据进行验证的研究人员。为了弥合这一鸿沟，我们引入了ORCA，一种用于端到端因果分析的副驾驶。ORCA编排智能体以理解用户的目标，并引导他们完成最合适的因果分析工作流，从全自动到高度用户引导的执行。它具有因果发现、因果效应估计、可解释性和根因分析（RCA）功能。ORCA评估和比较性能，生成关键指标和图表，并通过结构化报告生成洞察。我们强调了它在几个真实世界用例中的有效性。

英文摘要

Causal analysis is a crucial task in many domains, including manufacturing, social science, and medicine. However, despite recent progress, the conceptual and methodological complexity of causal methods makes them largely inaccessible to domain experts. This gap prevents experts from leveraging these advances and hinders researchers who lack access to real-world data for validation. To bridge this divide, we introduce ORCA, a copilot for end-to-end causal analysis. ORCA orchestrates agents to understand the user's goals and guide them through the most appropriate causal analysis workflow, from fully automatic to highly user-guided execution. It features causal discovery, causal effect estimation, explainability and Root-Cause-Analysis (RCA). ORCA evaluates and compares performance, generates key metrics and diagrams, and generates insights through structured reports. We highlight its effectiveness across several real-world use-cases.

URL PDF HTML ☆

赞 0 踩 0

2605.27020 2026-05-27 cs.CV cs.AI

Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models

黑盒成员推断攻击：针对图像生成模型的预训练数据

Tao Qi, Huili Wang, Yuanhong Huang, Wendan Wang, Lianchao Zhao, Jinrui Wang, Zichen Qin, Shangguang Wang, Yongfeng Huang

AI总结提出一种基于跨模态数据扰动的黑盒成员推断攻击框架SD-MIA，通过分析扩散模型对目标图像和扰动文本指令的去噪过程，有效检测预训练数据中的成员关系。

Comments 13 pages, 9 figures; CVPR 2026 camera-ready

详情

AI中文摘要

基于扩散的图像生成模型的快速发展引发了对涉及人类创建数据的潜在版权和隐私侵犯的严重担忧。成员推断攻击（MIA）已成为识别模型训练期间未经授权数据使用的有前景工具。现有方法通常评估模型对扰动嫌疑图像的去噪能力作为成员状态的指标。然而，此类特征的判别能力高度依赖于模型记忆程度，并且在应用于曝光较少的数据（例如预训练数据）时显著下降。尽管有几种方法尝试通过利用模型内部特征来增强检测，但这些特征在主流闭源图像生成平台中通常不可访问，限制了其实用性。在本文中，我们证明分析黑盒扩散模型如何对目标图像和相应的扰动文本指令进行去噪可以揭示更具区分性的成员线索。基于这一见解，我们提出了一种名为SD-MIA的黑盒成员推断攻击框架，该框架利用跨模态数据扰动机制来检测扩散模型中的预训练数据。我们在一个公共基准数据集和一个新构建的数据集上进行了广泛实验，每个数据集包含具有相同分布的预训练成员和非成员样本。实验结果表明，SD-MIA相比现有基线（包括那些具有不公平访问模型内部特征优势的基线）实现了更优的性能。

英文摘要

The rapid advancement of diffusion-based image generation models has raised serious concerns regarding potential copyright and privacy infringements involving human-created data. Membership inference attacks (MIAs) have emerged as a promising tool for identifying unauthorized data usage during model training. Existing methods typically assess the ability of model to denoise perturbed suspect images as an indicator of membership status. However, the discriminative power of such features is highly dependent on the degree of model memorization and deteriorates significantly when applied to less exposed data (e.g., pre-training data). Although several methods attempt to enhance detection by leveraging internal model features, these features are generally inaccessible in mainstream closed-source image generation platforms, limiting their practicality. In this paper, we demonstrate that analyzing how a black-box diffusion model denoises a target image and corresponding perturbed textual instructions can reveal more distinctive membership cues. Based on this insight, we propose a black-box membership inference attack framework (named SD-MIA) that leverages a cross-modal data perturbation mechanism to detect pre-training data in diffusion models. We conduct extensive experiments on both a public benchmark dataset and a newly constructed dataset, each comprising pre-training membership and non-membership samples with identical distributions. Experimental results demonstrate that SD-MIA achieves superior performance compared to existing baselines, including those with the unfair advantage of accessing internal model features.

URL PDF HTML ☆

赞 0 踩 0

2605.27016 2026-05-27 cs.CL cs.AI cs.LG stat.ML

Evaluating the Relevance of Uncertainty Estimators for LLM Hallucination

评估不确定性估计器与LLM幻觉的相关性

Yedidia Agnimo, Anna Korba, Annabelle Blangero, Nicolas Chesneau, Karteek Alahari

AI总结通过系统实证研究，评估信息论、基于采样和反思性等不确定性估计器与LLM幻觉之间的关联，发现关联性高度可变且通常较弱，挑战了将不确定性作为幻觉直接信号的做法。

Comments 35 pages, 7 figures, 9 tables

详情

AI中文摘要

大型语言模型（LLM）容易产生幻觉，即与输入或训练数据不符的陈述，阻碍了可靠部署。同时，许多不确定性估计（UE）方法被提出来量化模型置信度，并常被隐含地视为模型失败的代理。然而，不确定性与幻觉之间的关系尚未得到充分表征。我们对不确定性估计器与LLM幻觉之间的关联进行了系统的实证研究。我们不是假设这种关联，而是直接评估它在何时以及在多大程度上成立。我们考虑了多种不确定性估计器，包括信息论、基于采样和反思性估计器，并检查了它们在幻觉设置中的行为。我们的实验涵盖了内在幻觉（违反输入忠实性）和外在幻觉（相对于训练数据的无根据主张），使用了四个互补基准，包括RAGTruth和HalluLens。我们发现，这种关联性高度可变且通常较弱，取决于幻觉类型和所评估的LLM。这些结果挑战了将不确定性作为幻觉直接信号的做法，并阐明了何时它能提供可操作的信息。

英文摘要

Large language models (LLMs) are prone to hallucinations, i.e., statements unsupported by the input or training data, hindering reliable deployment. In parallel, numerous uncertainty estimation (UE) methods have been proposed to quantify model confidence and are often implicitly treated as proxies for model failure. However, the relationship between uncertainty and hallucinations remains insufficiently characterized. We present a systematic empirical study of the association between uncertainty estimators and hallucinations in LLMs. Rather than assuming this association, we evaluate directly when and to what extent it holds. We consider a diverse set of uncertainty estimators, including information-theoretic, sampling-based, and reflexive estimators, and examine their behavior across hallucination settings. Our experiments cover both intrinsic hallucinations (violations of input faithfulness) and extrinsic hallucinations (unsupported claims relative to training data), using four complementary benchmarks, including RAGTruth and HalluLens. We find that the association is highly variable and often weak, depending on the hallucination type and the LLM under evaluation. These results challenge the use of uncertainty as a direct signal of hallucination and clarify when it provides actionable information.

URL PDF HTML ☆

赞 0 踩 0

2605.27015 2026-05-27 cs.CL

PersLitEval: Fine-grained Benchmark and Evaluation of LLMs on Persian Literature Questions

PersLitEval：波斯文学问题上的细粒度基准与LLM评估

Ruhallah Niazi, Faeze Ghorbanpour, Alexander Fraser

AI总结提出PersLitEval基准，包含4514道波斯文学多选题，评估六种LLM在十种提示策略下的表现，发现模型在概念相似性任务上准确率高，但在拼写和构词等正式语言分析上困难，且提示策略显著影响性能。

详情

AI中文摘要

尽管多语言能力令人印象深刻，但大型语言模型（LLM）在非英语语言的文学知识方面仍然缺乏充分评估。我们引入了PersLitEval，这是一个包含4514道波斯文学多选题的基准，涵盖拼写、修辞手法、语法、词汇、构词和概念理解等八个细粒度类别，题目来源于Konkur大学入学考试材料。我们评估了六种LLM在十种提示策略下的表现，揭示了三个难度层级上显著的类别差异：模型在概念相似性任务上准确率较高，但在正式语言分析上表现不佳，其中拼写和构词对所有模型来说都是最难的。提示策略对性能有显著影响，其中带解释的少样本示例效果最佳，尤其是在正式语言类别上。错误分析识别出三种失败模式：语义理解差距、正式语言知识差距以及计数/枚举错误，表明不同类别需要不同的改进策略。

英文摘要

Despite impressive multilingual capabilities, large language models (LLMs) remain poorly evaluated on literary knowledge in non-English languages. We introduce PersLitEval, a benchmark of 4,514 Persian literature multiple-choice questions across eight fine-grained categories spanning spelling, literary devices, grammar, vocabulary, word formation, and conceptual understanding, sourced from materials for the Konkur university entrance examination. We evaluate six LLMs across ten prompting strategies, revealing striking category-level disparities across three tiers of task difficulty: models reach higher accuracy on conceptual similarity tasks but struggle with formal linguistic analysis, with spelling and word formation proving the hardest across all models. Prompting strategy has a significant impact on performance, with explained few-shot examples yielding the best results, particularly on formal linguistic categories. An error analysis identifies three failure modes: semantic comprehension gaps, formal linguistic knowledge gaps, and counting/enumeration errors, suggesting that different categories require different improvement strategies.

URL PDF HTML ☆

赞 0 踩 0

2605.27013 2026-05-27 cs.AI

Generating Robust Portfolios of Optimization Models using Large Language Models

使用大型语言模型生成鲁棒的优化模型组合

Eleni Straitouri, Cheol Woo Kim, Milind Tambe

AI总结提出一种利用LLM作为随机生成器和推理评估器的统一框架，生成鲁棒的优化模型组合，并保证在生成器或评估器之一与人类偏好对齐时组合中包含高质量候选模型。

Comments Accepted at the ICML 2026 LM4Plan Workshop

详情

AI中文摘要

数学优化是跨领域（如资源分配和规划）进行结构化决策的强大工具。然而，制定忠实于现实的优化模型仍然是一个重大瓶颈，因为它通常需要领域专业知识和优化知识，而这些往往是稀缺的。最近大型语言模型（LLM）的进展有望弥合这一差距，使得从自然语言描述中生成候选优化模型成为可能。然而，无法保证任何单个LLM生成的模型是可靠的，因此仅输出一个模型的现有方法存在风险。在这项工作中，我们提出了一种新颖的算法，生成一个优化模型组合，旨在对LLM的局限性具有鲁棒性。我们的方法利用了一个观察：单个LLM可以扮演两个不同的角色——作为随机生成器和作为推理评估器——并提出了一个统一的框架，以互补的方式利用这两种能力。我们提供了理论保证，表明只要生成器或评估器中至少有一个与人类偏好良好对齐，该组合就保证包含高质量的候选模型，从而实现一个原则性的人机交互过程，决策者可以在承诺使用一个模型之前审查多个候选模型。我们进一步通过实验验证了我们的方法，展示了在一系列优化建模任务中的强大性能。

英文摘要

Mathematical optimization is a powerful tool for structured decision-making across domains such as resource allocation and planning. Formulating optimization models faithful to reality, though, remains a significant bottleneck as it typically demands both domain expertise and optimization knowledge that are often scarce. Recent advances in large language models (LLMs) promise to bridge this gap, enabling the generation of candidate optimization models from natural language descriptions. However, there is no guarantee that any single LLM-generated model is reliable, and existing approaches that output only one model are therefore risky. In this work, we propose a novel algorithm that generates a portfolio of optimization models, designed to be robust to the limitations of LLMs. Our method exploits the observation that a single LLM can play two distinct roles $\unicode{x2014}$ as a stochastic generator and as a reasoning evaluator $\unicode{x2014}$ and proposes a unified framework that leverages both capabilities in a complementary manner. We provide theoretical guarantees showing that, as long as either the generator or the evaluator is well-aligned with human preferences, the portfolio is guaranteed to contain high-quality candidates, enabling a principled human-in-the-loop process in which a decision-maker can review multiple candidates before committing to one. We further validate our approach empirically, demonstrating strong performance across a range of optimization modeling tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.27009 2026-05-27 cs.LG

SCENT: Aligning Mass Spectra with Molecular Structure for Olfactory Perception

SCENT: 将质谱与分子结构对齐用于嗅觉感知

Ziqi Zhang, Eunyeong Jin, Miguel Vasco, Farzaneh Taleb, Nona Rajabi, Alexandra Gutmann, Jonathan Williams, Antônio H. Ribeiro, Danica Kragic

AI总结提出SCENT多模态对比学习框架，通过将电子电离质谱表示与预训练化学结构嵌入对齐，在无需分子结构的情况下实现与结构模型相当的嗅觉预测性能。

详情

AI中文摘要

从分子结构预测人类嗅觉感知已取得显著进展，但这些方法在推理时需要明确的化学结构，而这在实际传感场景中并不可用。我们通过探索直接电子电离质谱（EI-MS）作为嗅觉预测的替代输入模态来弥补这一差距，该传感技术可在数秒内获取化学信息丰富的碎片指纹。我们提出了谱图到化学嵌入对齐（SCENT），这是一个多模态对比学习框架，它将EI-MS表示与预训练的化学结构嵌入对齐，同时在推理时仅需要质谱。在多标签气味描述符预测任务中，SCENT显著优于仅使用MS的基线，并实现了与基于结构的模型相当的性能，尽管在测试时不需要明确的分子结构。学习到的表示还能更好地逼近连续的人类感知评分，并泛化到真实实验室测量的谱图，表明跨模态对齐是将分析谱图嵌入化学语义的有效策略。

英文摘要

Predicting human olfactory perception from molecular structure has seen remarkable progress, yet these approaches require explicit chemical structure at inference, which is not available in practical sensing settings. We address this gap by exploring direct electron ionization mass spectrometry (EI-MS), a sensing technique that acquires chemically informative fragmentation fingerprints in seconds, as an alternative input modality for olfactory prediction. We contribute Spectrum-to-Chemical Embedding alignmeNT (SCENT), a multi-modal contrastive learning framework that aligns EI-MS representations with pretrained chemical structure embeddings, while requiring only mass spectra at inference. On the multi-label odor descriptor prediction task, SCENT significantly outperforms MS-only baselines and achieves performance comparable to structure-based models, despite requiring no explicit molecular structure at test time. The learned representations also better approximate continuous human perceptual ratings and generalize to real-world lab-measured spectra, suggesting that cross-modal alignment is an effective strategy for grounding analytical spectra in chemical semantics.

URL PDF HTML ☆

赞 0 踩 0