arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.01436 2026-06-02 cs.CL

Learning from Saturated Data: Signals Beyond Correctness for LLM Training

从饱和数据中学习：LLM训练中超越正确性的信号

Hanno Hiss, Jasper Dekoninck, Martin Vechev

发表机构 * ETH Zurich（苏黎世联邦理工学院）

AI总结本文研究在基准测试和训练数据集饱和的情况下，如何利用超越二元正确性的细粒度质量信号（如成对LLM自判断和token级熵）来提升下游性能，实验表明在简单算术任务上质量信号显著优于SFT，但在复杂任务上效果有限且需谨慎校准。

Comments 25 pages, 5 figures

详情

AI中文摘要

大型语言模型（LLM）能力的不断增强导致许多用于改进它们的基准测试和训练数据集趋于饱和。受此启发，我们研究了那些经验上达到完美准确率的问题是否仍可用于提升下游性能。为此，我们将二元正确性替换为两种更细粒度的质量信号来源：（1）成对LLM自判断，即模型评估自身解决方案的相对质量；（2）token级熵，其中token级不确定性作为解决方案质量的代理。我们将这些信号整合到多种训练算法中，并在Qwen3-1.7B-Base上进行了评估。当仅在一个简单算术任务上训练时，基于质量的信号相比基础模型提升了高达18.6%的性能，显著优于SFT。然而，在GSM8K上，提升较为有限，且强烈依赖于质量信号。例如，自判断与更强的外部判断者一致性较差，甚至可能使性能低于基础模型。总体而言，我们的结果表明，基于质量的训练可以从饱和问题中为基础模型提取有用信号，但将此类信号应用于更复杂的任务需要仔细校准和进一步研究。

英文摘要

The growing capabilities of large language models (LLMs) have led to the saturation of many benchmarks and training datasets used to improve them. Motivated by this, we investigate whether questions solved with perfect empirical accuracy can nevertheless be used to improve downstream performance. To do so, we replace binary correctness with two sources of more fine-grained quality signals: (1) pairwise LLM self-judgments, in which the model evaluates the relative quality of its own solutions, and (2) token-level entropy, where token-level uncertainty is used as a proxy for solution quality. We incorporate these signals into several training algorithms and evaluate them on Qwen3-1.7B-Base. When training exclusively on a simple arithmetic task, quality-based signals improve performance by up to $18.6\%$ over the base model, substantially outperforming SFT. On GSM8K, however, gains are more modest and depend strongly on the quality signal. For instance, self-judgments show poor agreement with a stronger external judge and can even degrade performance below the base model. Overall, our results suggest that quality-based training can extract useful signal from saturated questions for base models, but that applying such signals to more complex tasks requires careful calibration and further study.

URL PDF HTML ☆

赞 0 踩 0

2606.01435 2026-06-02 cs.AI cs.CL cs.IR

Don't Ask the LLM to Track Freshness: A Deterministic Recipe for Memory Conflict Resolution

不要询问LLM追踪新鲜度：一种确定性的内存冲突解决策略

Vikas Reddy, Sumanth Challaram

发表机构 * IIT Kgp（印度理工学院科钦分校）

AI总结针对基于LLM的内存系统中事实冲突解决性能低下的问题，提出用候选提取加Python max(serial)的确定性聚合替代LLM判断，在单跳任务上提升10.8个百分点，并扩展到多跳任务。

详情

AI中文摘要

基于LLM的内存系统越来越多地维护随时间演变的事实，其中一个反复出现的失败是冲突解决：当一个事实有多个矛盾的值时，智能体应该返回哪个？MemoryAgentBench (MAB; Hu et al., 2026) 在其FactConsolidation任务中明确了这一点：事实被编号，反事实具有更高的序号，并且智能体被告知较新的事实具有较大的序号。然而，每个已发布的系统表现不佳：HippoRAG-v2在单跳（FC-SH）上达到54%，BM25 48%，Mem0 18%，而时间知识图谱Zep/Graphiti仅为7%。多跳几乎未解决（22个系统中最多7%）。我们认为瓶颈在于组装步骤：基线将冲突解决留给LLM介导的检索或生成，而不是版本感知的聚合。一个匹配设置的比较（相同的主干、检索、分块、TOP_K）表明，用候选提取加Python max(serial)替换LLM判断答案流水线，在FC-SH上（gpt-4o-mini）获得+10.8分的提升，从6K时的+8分扩大到262K时的+21分。这是一个全流水线效应（解析器、提示、格式和温度共同变化）；隔离解析器是未来的工作。该配方在FC-SH上达到78.0%（gpt-4o-mini）、94.8%（gpt-4o），在FC-MH上达到30.2%（gpt-4o-mini，使用gpt-4o时升至51.5%），通过每跳确定性的Self-Ask扩展。在匹配的262K下，它比HippoRAG-v2高出+28分，比已发布的最佳FC-MH结果高出+20分。这一含义对该子领域具有纠正作用：冲突解决的瓶颈是组装（检索后聚合），而不是存储。一个LongMemEval知识更新检查表明，该机制从max(serial)移植到max(timestamp)，但仅与LLM判断持平（57.8% vs 64.4%，n=45）：确定性聚合是当前值冲突的正确原语，并且必须与问题类型感知处理组合，以实现更广泛的内存问答。

英文摘要

LLM-based memory systems increasingly maintain facts that evolve over time, where a recurring failure is conflict resolution: when a fact has multiple contradictory values, which should the agent return? MemoryAgentBench (MAB; Hu et al., 2026) makes this explicit in its FactConsolidation task: facts are numbered, the counterfactual has the higher serial, and agents are told newer facts have larger serials. Yet every published system underperforms: HippoRAG-v2 reaches 54% on single-hop (FC-SH), BM25 48%, Mem0 18%, and the temporal KG Zep/Graphiti just 7%. Multi-hop is near-unsolved (at most 7% across 22 systems). We argue the bottleneck is the assembly step: baselines leave conflict resolution to LLM-mediated retrieval or generation rather than version-aware aggregation. A matched-setup comparison (same backbone, retrieval, chunking, TOP_K) shows that replacing the LLM-judgment answer pipeline with candidate-extraction plus Python max(serial) yields +10.8 points on FC-SH (gpt-4o-mini), widening from +8 at 6K to +21 at 262K. This is a whole-pipeline effect (resolver, prompt, format, and temperature vary jointly); isolating the resolver is future work. The recipe reaches 78.0% on FC-SH (gpt-4o-mini), 94.8% (gpt-4o), and 30.2% on FC-MH (gpt-4o-mini, rising to 51.5% with gpt-4o) via a per-hop deterministic extension of Self-Ask. At matched-262K, it beats HippoRAG-v2 by +28 points and the best published FC-MH result by +20. The implication is corrective for the subfield: the bottleneck on conflict resolution is assembly (post-retrieval aggregation), not storage. A LongMemEval knowledge-update check shows the mechanism ports from max(serial) to max(timestamp) but only ties LLM judgment (57.8% vs 64.4%, n=45): deterministic aggregation is the right primitive for current-value conflicts and must be composed with question-type-aware handling for broader memory QA.

URL PDF HTML ☆

赞 0 踩 0

2606.01434 2026-06-02 cs.CL

DrugClaw and DrugAudit: A Primary-Source-Grounded Agent and Authority-Aware Benchmark for Drug-Information Question Answering

DrugClaw与DrugAudit：基于原始来源的智能体与权威感知基准用于药物信息问答

Qing Wang, Bo Li, Jialu Liang, Daling Shi, Bob Zhang, Qianqian Song

发表机构 * Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida（佛罗里达大学健康结局与生物医学信息学系，医学院）； PAMI Research Group, Department of Computer and Information Science, Faculty of Science and Technology, University of Macau（澳门大学科学与技术学院计算机与信息科学系PAMI研究组）

AI总结提出多智能体检索增强系统DrugClaw，通过反射驱动状态机查询药物注册与药物警戒知识库，并构建含3772条权威感知基准DrugAudit，在多个基准上取得最优性能。

详情

AI中文摘要

药物信息问答是一个高风险场景，其中虚构的事实可能误导临床决策，且每个引用事实的来源与事实本身同样重要。我们提出了DrugClaw，一个多智能体检索增强系统，通过反射驱动的状态机工作流查询药物注册和药物警戒技能库，并返回基于原始监管或同行评审记录的答案。我们还贡献了DrugAudit，一个包含3772个条目的权威感知基准，配备评估面板，在双评判者LLM作为评判者的协议下（评判者间kappa=0.88，几乎完美），对上游黄金来源匹配、令牌级语义片段重叠和引用忠实性进行评分。在DrugAudit以及MedQA（751）和PubMedQA（512）的药物相关子集上，DrugClaw在标题表的每一列均排名第一：两个评判者下的综合证据指数、评判者中介的答案正确性、原始来源率（0.918，比次优高10.1个百分点）、忠实性（0.887，高5.9个百分点）、MedQA（0.920）和PubMedQA（0.693）。

英文摘要

Drug-information question answering is a high-stakes setting where hallucinated facts can mislead clinical decision-making and the provenance of each cited fact matters as much as the fact itself. We present DrugClaw, a multi-agent retrieval-augmented system that queries a registry of drug and pharmacovigilance skills via a reflection-driven state-machine workflow and returns answers grounded in primary regulatory or peer-reviewed records. We also contribute DrugAudit, a 3,772-item authority-aware benchmark with an evaluation panel that scores upstream-of-gold source match, token-level semantic snippet overlap, and citation faithfulness under a dual-judge LLM-as-judge protocol with inter-judge kappa = 0.88 (almost-perfect). Across DrugAudit plus drug-related subsets of MedQA (751) and PubMedQA (512), DrugClaw is top-1 on every column of the headline table: composite Evidence Index under both judges, judge-mediated answer correctness, primary-source rate (0.918, +10.1 pp over next-best), faithfulness (0.887, +5.9 pp), MedQA (0.920), and PubMedQA (0.693).

URL PDF HTML ☆

赞 0 踩 0

2606.01432 2026-06-02 cs.LG eess.IV eess.SP stat.ML

Leaf Spectral Reflectance Prediction Using Multi-Head Attention Neural Networks

使用多头注意力神经网络预测叶片光谱反射率

Parastoo Farajpoor, Alireza Pourreza, Mohammadreza Narimani, Ashraf El-Kereamy, Matthew W. Fidelibus

发表机构 * Digital Agriculture Laboratory, Department of Biological and Agricultural Engineering, University of California, Davis, CA, USA（加州大学戴维斯分校数字农业实验室，生物与农业工程系）； Department of Botany and Plant Sciences, University of California, Riverside, CA, USA（加州大学河滨分校植物学与植物科学系）； Department of Viticulture and Enology, University of California, Davis, CA, USA（加州大学戴维斯分校葡萄学与酿酒学系）

AI总结针对特定作物（如葡萄藤），提出基于多头注意力神经网络的叶片性状-光谱预测模型，在葡萄藤数据集上实现高精度（R²=0.84，NRMSE=1.52%），优于传统辐射传输模型PROSPECT-PRO。

Comments 8 pages, 5 figures. Author-accepted version of the SPIE conference paper

Journal ref Proc. SPIE 13475, 134750V (2025)

详情

DOI: 10.1117/12.3061298

AI中文摘要

从生理和生化性状准确建模叶片光谱反射率对于推进植物科学和精准农业中的遥感应用至关重要。广泛使用的辐射传输模型（如PROSPECT-PRO）依赖于从多种物种中开发的广义性状-反射率关系，这可能无法完全捕捉特定作物（如葡萄藤）的光谱行为。在本研究中，我们开发了一个基于多头注意力神经网络的性状到光谱预测模型，该模型在包含16个叶片性状（涵盖多个品种、生长阶段和年份）的葡萄藤特定数据集上训练。使用分层5折交叉验证评估模型，平均决定系数（R^2）为0.84，归一化均方根误差（NRMSE）为1.52%，显示出高精度和泛化能力。与正向模式下的PROSPECT-PRO相比，神经网络表现出更低的平均绝对误差（MAE），尤其是在近红外（NIR）和短波红外（SWIR）区域。这些结果强调了物种特异性建模方法的重要性，并表明将生化和结构性状整合到数据驱动架构中可以显著改善光谱预测。所提出的模型为生成准确的叶片级反射率数据提供了稳健框架，在冠层性状反演、葡萄园监测和遥感驱动的作物管理方面具有潜在应用。

英文摘要

Accurate modeling of leaf spectral reflectance from physiological and biochemical traits is essential for advancing remote sensing applications in plant science and precision agriculture. Widely used radiative transfer models, such as PROSPECT-PRO, rely on generalized trait-reflectance relationships developed from a wide range of species, which may not fully capture the spectral behavior of specific crops like grapevines. In this study, we developed a trait-to-spectra prediction model using a multi-head attention neural network trained on a grapevine-specific dataset that includes 16 leaf traits measured across multiple varieties, growth stages, and years. The model was evaluated using stratified 5-fold cross-validation and achieved an average coefficient of determination (R^2) of 0.84 and normalized root mean squared error (NRMSE) of 1.52 percent, demonstrating high accuracy and generalizability. When compared to PROSPECT-PRO in forward mode, the neural network exhibited lower mean absolute error (MAE), especially in the near-infrared (NIR) and shortwave-infrared (SWIR) regions. These results emphasize the importance of species-specific modeling approaches and show that integrating biochemical and structural traits into data-driven architectures can significantly improve spectral prediction. The proposed model provides a robust framework for generating accurate leaf-level reflectance data, with potential applications in canopy trait retrieval, vineyard monitoring, and remote sensing-driven crop management.

URL PDF HTML ☆

赞 0 踩 0

2606.01425 2026-06-02 cs.LG

Learning-based Directed Graph Abstraction of Combinatorial Spaces for Order-Preserving Search in Mixed-Combinatorial Nonlinear Optimization

基于学习的组合空间有向图抽象用于混合组合非线性优化中的保序搜索

Gishnu Madhu, Feng Liu, Souma Chowdhury

发表机构 * Department of Computer Science and Engineering（计算机科学与工程系）； Department of Mechanical and Aerospace Engineering（机械与航空航天工程系）

AI总结提出一种基于图神经网络的有向图抽象方法，将组合空间映射为有向图，以改进混合组合非线性规划问题的搜索效率。

Comments Accepted for presentation at 2026, ASME IDETC

详情

AI中文摘要

混合组合非线性规划（MCNLP）问题出现在许多工程设计和规划应用中，例如由于分类、组件和几何设计选择，以及联合任务和运动规划。组合空间的传统表示方法，如整数或二进制编码，常常引入虚假关系，增加维度，并需要额外的兼容性约束。相反，本文借鉴了机器人规划和车辆/网络路由领域的最新发展，旨在使用图神经网络（GNN）学习组合空间上的搜索启发式。更具体地说，本文通过使用边场图网络（EFGN）学习从无向全连接组合图到指示改进方向的有向图的映射，提出了首个结构化的组合空间抽象。为了展示这种抽象组合空间的新方法在解决MCNLP中的效用，我们采用了一个最近的优化框架，该框架纯粹搜索非组合（例如连续）变量，并通过使用抽象模型（类似于推荐系统）为每个候选设计检索最合适的组合。与原始框架中的推荐系统相比，所提出的方向感知抽象模型提供了可能更具可扩展性和可解释性的组合检索。为了评估，所提出的方法与著名的粒子群优化和遗传算法求解器集成，在三个具有不同组合和变量数量的基准非线性问题上进行测试。与使用索引化组合的基线求解器相比，基于GNN的推荐器在多次运行中始终获得更好的平均最优值和鲁棒性。

英文摘要

Mixed-combinatorial nonlinear programming (MCNLP) problems arise in many engineering design and planning applications, e.g., due to categorical, component, and geometric design choices, as well as joint task and motion planning. Traditional representations of combinatorial spaces, such as integer or binary encoding, often introduce spurious relations, increase dimensionality, and require additional compatibility constraints. Instead, this paper draws on recent developments in robot planning and vehicle/network routing domains that aim to learn search heuristics over combinatorial spaces using graph neural networks (GNNs). More specifically, this paper presents a first-of-its-kind structured abstraction of the combinatorial space by learning a mapping from an undirected fully connected graph of combinations to a directed graph indicating improvement directions using an Edge Field Graph Network (EFGN). To demonstrate the utility of this new way of abstracting the combinatorial space in solving MCNLPs, we adopt a recent optimization framework that purely searches over the non-combinatorial (e.g., continuous) variables and retrieves the best-suited combination for each candidate design by using the abstraction model, akin to a recommender system. The presented direction-aware abstraction model provides a potentially more scalable and interpretable retrieval of combinations compared to the original recommendation system in that framework. For evaluation, the proposed method is integrated with a well-known particle swarm optimization and genetic algorithm solvers on three benchmark nonlinear problems with varying numbers of combinations and variables. Compared to baseline solvers using indexified combinations, the GNN-based recommender consistently achieves better mean optimum values and robustness across multiple runs.

URL PDF HTML ☆

赞 0 踩 0

2606.01421 2026-06-02 cs.LG

Target localization, identification and sensing using latent symmetries

利用潜在对称性进行目标定位、识别与感知

David Dukov, Malte Röntgen, Bryn Davies

发表机构 * Mathematics Institute, University of Warwick（沃里克大学数学研究所）； Eastern Institute for Advanced Study（东部高级研究 institute）； Eastern Institute of Technology Ningbo, Zhejiang, China（宁波东部技术研究院，浙江，中国）

AI总结本文利用设计有潜在对称性的散射体阵列作为传感器，通过分析对称性破缺程度，结合贝叶斯推断或人工神经网络实现入侵散射体的半径识别与位置定位。

Comments Submitted to SIAM Journal on Imaging Sciences

2606.01419 2026-06-02 cs.CV

DENSER: Depth-Guided Ensemble with Staged EFA-GS Reconstruction for Soccer Novel View Synthesis

DENSER：面向足球新视角合成的深度引导集成与分阶段EFA-GS重建

Parthsarthi Rawat

发表机构 * GameChanger by Dick’s Sporting Goods（Dick’s Sporting Goods 游戏变革）

AI总结提出DENSER方法，通过深度引导集成和分阶段EFA-GS重建，结合相机高度损失加权、单目深度监督和三模型像素平均集成，提升足球场景新视角合成质量。

Comments CVPR 2026 SoccerNet Novel View Synthesis Challenge, Rank 1

2606.01417 2026-06-02 cs.AI

GovAI-Pipe: A Layered AI Governance Pipeline for Citizen-Facing AI in Turkey's e-Government Gateway

GovAI-Pipe：面向土耳其电子政务门户的公民交互AI分层治理管道

Ahmet Kaplan

发表机构 * Turkey's e-Government Gateway（土耳其电子政务门户）

AI总结针对土耳其电子政务平台缺乏结构化技术治理基础设施的问题，提出基于设计科学研究方法的四层AI治理管道GovAI-Pipe，将AI模型生命周期映射到治理检查点，并通过高风险用例验证其可审计的技术实现。

Comments 7 pages

详情

AI中文摘要

土耳其的电子政务门户（e-Devlet）为超过6800万注册用户提供9200多项政府服务，并越来越多地将人工智能集成到面向公民的应用中，如聊天机器人助手和资格评估。然而，目前没有结构化的技术治理基础设施将高级AI政策框架（如欧盟AI法案、OECD AI原则和土耳其自身的国家AI战略）与在集中式电子政务平台中部署AI的操作现实联系起来。我们提出GovAI-Pipe，这是一个使用设计科学研究方法设计的四层治理管道，将AI模型生命周期映射到治理检查点：（1）部署前验证，用于偏差测试、可解释性和隐私影响评估；（2）部署治理，用于风险等级分类和审批工作流；（3）运行时监控，用于漂移检测、公平性跟踪和人在回路升级；（4）事后治理，用于审计跟踪、回滚和公民补救。每一层都锚定到欧盟AI法案、GDPR数据保护框架和国家AI战略的具体条款。我们通过两个高风险e-Devlet用例演示该框架，展示GovAI-Pipe如何将治理原则作为可审计的技术管道组件进行操作化。

英文摘要

Turkey's e-Government Gateway (e-Devlet) serves over 68 million registered users with more than 9,200 government services, and is increasingly integrating artificial intelligence into citizen-facing applications such as chatbot assistants and eligibility assessments. However, no structured technical governance infrastructure currently connects high-level AI policy frameworks, such as the EU AI Act, OECD AI Principles, and Turkey's own National AI Strategy, to the operational reality of deploying AI within a centralized e-government platform. We propose GovAI-Pipe, a four-layer governance pipeline designed using Design Science Research methodology that maps the AI model lifecycle to governance checkpoints: (1) pre-deployment validation for bias testing, explainability, and privacy impact assessment; (2) deployment governance for risk-tier classification and approval workflows; (3) runtime monitoring for drift detection, fairness tracking, and human-in-the-loop escalation; and (4) post-incident governance for audit trails, rollback, and citizen redress. Each layer is anchored to specific provisions of the EU AI Act, the GDPR data protection framework, and the National AI Strategy. We demonstrate the framework through two high-risk e-Devlet use cases, showing how GovAI-Pipe operationalizes governance principles as auditable, technical pipeline components.

URL PDF HTML ☆

赞 0 踩 0

2606.01416 2026-06-02 cs.AI

Self-Healing Agentic Orchestrators for Reliable Tool-Augmented Large Language Model Systems

用于可靠的工具增强型大语言模型系统的自愈代理编排器

Rahul Suresh Babu, Adarsh Agrawal

发表机构 * Independent Researcher（独立研究者）； Senior Member, IEEE（IEEE高级成员）

AI总结提出一种自愈代理编排器，通过将可靠性视为有界运行时控制问题，映射故障信号、选择恢复动作并验证轨迹，在100任务故障注入基准上达到98.8%任务成功率，优于重试和完全重规划基线。

详情

AI中文摘要

工具增强型大语言模型（LLM）代理依赖于协调规划、检索、工具调用、验证、记忆和恢复的编排层。在这些系统中，故障不仅来自模型错误，还来自编排层问题，如工具超时、参数格式错误、过时上下文、矛盾证据、重试循环和未验证的中间输出。本文提出一种自愈代理编排器，将可靠性视为有界运行时控制问题。该编排器将可观察的故障信号映射到推断的故障类别，在显式预算下选择目标恢复动作，验证恢复轨迹，并记录可观察性痕迹。我们在一个100任务的受控故障注入基准上，将本方法与静态工作流、仅重试、ReAct风格和完全重规划基线进行比较。自愈方法实现了98.8%的任务成功率，而仅重试为94.5%，完全重规划为93.8%。匹配的恢复预算扫描显示，在每个测试预算下，自愈方法均优于仅重试和完全重规划，在单次恢复尝试下差距最大：分别为94.0%对比85.3%和88.2%。在受控的语义静默故障设置下，验证器引导的自愈将静默故障降至0.0%，而非验证基线更频繁地返回错误但看似合理的输出。紧凑的模型在环验证表明，当实时工具调用模型在本地故障注入工具上执行工具选择、参数生成和答案合成时，相同的恢复机制可以运行。这些结果提供了受控证据，表明故障感知、有预算和验证引导的编排提高了工具增强型LLM系统的可靠性和可诊断性。

英文摘要

Tool-augmented large language model (LLM) agents rely on orchestration layers that coordinate planning, retrieval, tool invocation, validation, memory, and recovery. In these systems, failures arise not only from model errors, but also from orchestration-level issues such as tool timeouts, malformed arguments, stale context, contradictory evidence, retry loops, and unverified intermediate outputs. This paper presents a self-healing agentic orchestrator that treats reliability as a bounded runtime control problem. The orchestrator maps observable failure signals to inferred failure classes, selects targeted recovery actions under explicit budgets, verifies recovered trajectories, and records observability traces. We evaluate the approach on a 100-task controlled fault-injection benchmark against static workflow, retry-only, ReAct-style, and full-replanning baselines. Self-healing achieves 98.8\% task success, compared with 94.5\% for retry-only and 93.8\% for full replanning. A matched recovery-budget sweep shows that self-healing outperforms retry-only and full replanning at every tested budget, with the largest gap under a single recovery attempt: 94.0\% versus 85.3\% and 88.2\%, respectively. Under a controlled semantic silent-failure setting, verifier-guided self-healing reduces silent failures to 0.0\%, while non-verifying baselines return wrong-but-plausible outputs more often. A compact model-in-the-loop validation shows that the same recovery mechanism can operate when a live tool-calling model performs tool selection, argument generation, and answer synthesis over local fault-injected tools. These results provide controlled evidence that failure-aware, budgeted, and verification-guided orchestration improves reliability and diagnosability in tool-augmented LLM systems.

URL PDF HTML ☆

赞 0 踩 0

2606.01414 2026-06-02 cs.CV

Agent Skills Should Go Beyond Text: The Case for Visual Skills

Agent技能应超越文本：视觉技能的必要性

Binxiao Xu, Ruichuan An, Bocheng Zou, Hang Hua

发表机构 * Peking University（北京大学）； University of Wisconsin（威斯康星大学）； MIT-IBM Watson AI Lab（麻省理工-IBM沃森人工智能实验室）

AI总结针对现有技能学习方法仅存储文本经验导致视觉任务瓶颈的问题，提出多模态技能范式，结合文本逻辑与视觉支持，通过自动系统将经验转化为可复用的视觉技能，在GUI等视觉任务中显著优于纯文本技能。

详情

AI中文摘要

可复用技能是扩展智能体能力的关键机制，使智能体能够积累经验并解决日益复杂的任务。然而，现有大多数技能学习方法仅将可复用经验存储为文本资产，如指令、推理轨迹或总结的轨迹。我们认为，这种纯文本范式为视觉中心任务造成了根本性瓶颈，因为可复用知识通常依赖于空间布局、视觉定位、细粒度外观和局部状态变化。为解决这一局限，我们提出\NAME，一种结合声明式文本逻辑与显式视觉支持的多模态技能范式。我们区分三种可复用形式：静态先验（用于稳定的空间惯例）、动态先验（用于现场视觉工作记忆）以及交错视觉技能（将有序文本步骤绑定到源帧、截图或页面区域，以证明其合理性）。视觉技能不仅描述要做什么，还编码了在哪里看、如何检查以及如何验证视觉结果。为了规模化构建视觉技能，我们引入\SYSTEM，一种自动系统，通过保留任务轨迹中的文本推理、空间引用、视觉边界和交互模式，将智能体经验转化为可复用的多模态技能。在GUI和其他视觉中心任务上的实验表明，视觉技能始终优于纯文本技能，尤其是在成功需要空间对应、视觉证据和状态感知交互时。这些结果支持我们的核心立场：可复用智能体技能应超越文本，成为未来多模态智能体的多模态资产。

英文摘要

Reusable skills are a key mechanism for extending agent capabilities, allowing agents to accumulate experience and solve increasingly complex tasks. Yet most existing skill-learning methods store reusable experience as text-only assets, such as instructions, reasoning traces, or summarized trajectories. We argue that this text-only paradigm creates a fundamental bottleneck for visual-centric tasks, where reusable knowledge often depends on spatial layout, visual grounding, fine-grained appearance, and localized state changes. To address this limitation, we propose \textbf{\NAME}, a multimodal skill paradigm that combines declarative textual logic with explicit visual support. We distinguish three reusable forms: static priors for stable spatial conventions, dynamic priors for in-situ visual working memory, and interleaved visual skills that bind ordered text steps to the source frames, screenshots, or page regions that justify them. Rather than only describing what to do, visual skills also encode where to look, how to inspect, and how to verify visual outcomes. To scale visual-skill construction, we introduce \textbf{\SYSTEM}, an automatic system that converts agent experience into reusable multimodal skills by preserving textual reasoning, spatial references, visual boundaries, and interaction patterns from task trajectories. Experiments on GUI and other visual-centric tasks show that visual skills consistently outperform text-only skills, particularly when success requires spatial correspondence, visual evidence, and state-aware interaction. These results support our central position: reusable agent skills should go beyond text and become multimodal assets for future multimodal agents.

URL PDF HTML ☆

赞 0 踩 0

2606.01412 2026-06-02 cs.LG cs.IT math.IT

GPTQ-intrinsic LoRA: A Near-optimal Algorithm for Low-precision Quantization with Low-rank Adaptation

GPTQ-intrinsic LoRA: 一种用于低秩自适应低精度量化的近最优算法

Shihao Zhang, Rayan Saab

发表机构 * Department of Mathematics, University of California San Diego（数学系，加州大学圣地亚哥分校）； Department of Mathematics and Halıcıoğlu Data Science Institute, University of California San Diego（数学系和Halıcıoğlu数据科学研究所，加州大学圣地亚哥分校）

AI总结本文提出GPTQ-intrinsic LoRA算法，通过将低秩校正直接融入GPTQ量化过程，并利用信息论下界证明其近最优性，在语言和视觉模型上优于现有方法。

详情

AI中文摘要

后训练量化广泛用于压缩大型神经网络，但激进的低比特量化会显著降低模型质量。一种常见的补救措施是用低秩校正增强量化权重，得到形如 $W\approx Q+LR$ 的近似。本文通过逐层重构目标 $\|XW-X(Q+LR)\|_F^2$ 研究这种低精度加低秩表示，其中 $X$ 是校准矩阵。我们首次在有限字母和有界低秩补偿约束下建立了该问题的信息论下界。然后我们提出GPTQ-intrinsic LoRA，一种无训练算法，通过适当增广校准Hessian矩阵，将低秩校正直接融入GPTQ风格的量化过程中。对于选择 $L=V_r$（$V_r$ 包含 $X$ 的顶部右奇异向量），我们证明了逐层重构误差界，其中通常的GPTQ对 $\|X\|_F^2$ 的依赖被秩-$r$ 残差 $\|X-X_r\|_F^2$ 取代，直至正则化项。在自然结构假设下，这些界在主导尺度上与信息论下界匹配（至多常数和温和因子）。我们还引入了Bid-Up，一种固定网格量化细化步骤，可与最优低秩补偿交替进行，保证逐层重构误差不增。在Qwen3语言模型和DeiT视觉变换器上的实验表明，GPTQ-intrinsic LoRA优于GPTQ以及GPTQ后接低秩补偿，并且通过细化循环获得额外增益。

英文摘要

Post-training quantization is widely used for compressing large neural networks, but aggressive low-bit quantization can significantly degrade model quality. A common remedy is to augment the quantized weights with a low-rank correction, leading to approximations of the form $W\approx Q+LR$. In this paper, we study this low-precision plus low-rank representation through the layer-wise reconstruction objective $\|XW-X(Q+LR)\|_F^2$, where $X$ is a calibration matrix. We establish, to our knowledge, the first information-theoretic lower bounds for this problem under finite-alphabet and bounded low-rank compensation constraints. We then propose GPTQ-intrinsic LoRA, a training-free algorithm that incorporates the low-rank correction directly into a GPTQ-style quantization pass by appropriately augmenting the calibration Hessian. For the choice $L=V_r$, where $V_r$ contains the top right singular vectors of $X$, we prove layer-wise reconstruction error bounds in which the usual GPTQ dependence on $\|X\|_F^2$ is replaced by the rank-$r$ residual $\|X-X_r\|_F^2$, up to regularization terms. Under natural structural assumptions, these bounds match the information-theoretic lower bounds in their dominant scaling, up to constants and mild factors. We also introduce Bid-Up, a fixed-grid quantization refinement step that can be alternated with optimal low-rank compensation with guaranteed non-increasing layer-wise reconstruction error. Experiments on Qwen3 language models and DeiT vision transformers show that GPTQ-intrinsic LoRA improves over GPTQ and GPTQ followed by low-rank compensation, with additional gains from refinement loops.

URL PDF HTML ☆

赞 0 踩 0

2606.01402 2026-06-02 cs.LG cs.AI

Neural Network Compression by Approximate Differential Equivalence

基于近似微分等价的神经网络压缩

Ravi Dhiman, Andrea Passarella, Mirco Tribastone, Lorenzo Valerio

发表机构 * IMT School for Advanced Studies Lucca（利古里亚高级研究学院）； IIT CNR（理工学院-国家科研委员会）

AI总结提出一种通过聚合功能相似神经元来压缩神经网络的方法，利用近似前向微分等价将网络编码为多项式ODE系统，实现模型大小与精度的平滑权衡。

Comments 19 pages, 4 figures

详情

AI中文摘要

神经网络压缩通常通过基于局部重要性分数（例如基于幅度的剪枝）剪枝参数来实现。我们提出一种互补方法，通过聚合具有相似功能行为的神经元来压缩模型，而不是独立移除权重。我们的方法将训练好的网络编码为多项式ODE系统，并应用一种称为近似前向微分等价的 lumping 方法来识别具有近似匹配诱导动力学的神经元。单个容差参数 $\varepsilon$ 控制压缩水平，并在模型大小和预测精度之间诱导平滑权衡。我们在来自已知真实行为的非线性动力系统的合成数据集和公共回归基准上评估该方法。在这两种设置下，所提出的方法在保持精度的同时实现了显著的参数减少，并在相似的压缩水平下始终优于基于幅度的剪枝和Wanda。这些结果表明，基于微分等价的聚合是传统以权重为中心的剪枝的一种有原则且有效的替代方案。

英文摘要

Neural network compression is commonly achieved by pruning parameters based on local importance scores, e.g., magnitude-based pruning. We propose a complementary approach that compresses models by aggregating neurons with similar functional behavior rather than removing weights independently. Our method encodes a trained network as a polynomial ODE system and applies a lumping method called Approximate Forward Differential Equivalence to identify neurons with approximately matching induced dynamics. A single tolerance parameter, $\varepsilon$, controls the compression level and induces a smooth trade-off between model size and predictive accuracy. We evaluate the method on synthetic datasets derived from nonlinear dynamical systems with known ground-truth behavior and on public regression benchmarks. Across both settings, the proposed approach achieves substantial parameter reduction while preserving accuracy, and consistently compares favorably with magnitude-based pruning and Wanda at similar compression levels. These results suggest that differential equivalence-based aggregation is a principled and effective alternative to conventional weight-centric pruning.

URL PDF HTML ☆

赞 0 踩 0

2606.01400 2026-06-02 cs.CL cs.AI

Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs

一致且独特：基于相似图最大独立集提示选择的LLM基准测试效率

Denica Kjorvezir, Marko Djukanović, Ana Gjorgjevikj, Gjorgjina Cenikj, Tome Eftimov

发表机构 * Computer Systems Department, Jožef Stefan Institute, Ljubljana, Slovenia（计算机系统部，乔塞夫·斯塔芬研究所，卢布尔雅那，斯洛文尼亚）； Jožef Stefan International Postgraduate School, Ljubljana, Slovenia（乔塞夫·斯塔芬国际研究生学院，卢布尔雅那，斯洛文尼亚）； Center for Astrophysics and Cosmology, University of Nova Gorica, Nova Gorica, Slovenia（天体物理与宇宙学中心，诺瓦戈里察大学，诺瓦戈里察，斯洛文尼亚）

AI总结提出基于相似图最大独立集的提示选择框架，通过选择多样且非冗余的子集，在保持LLM排名一致性的同时显著减少基准测试成本。

详情

AI中文摘要

在全面基准测试中评估大型语言模型（LLM）既昂贵又耗时。我们提出了一种基于图的提示选择框架，将每个基准建模为相似图——如果提示在嵌入空间中的距离超过可配置阈值，则节点相连——并应用最大独立集（MIS）算法选择最大多样、非冗余的子集。我们评估了四种MIS求解器（CPLEX、GREEDY、Online-MIS、ReduMIS），涵盖六种嵌入模型、三种距离度量、六个百分位数阈值和四个基准（GPQA、IFEval、MMLU-Pro、Omni-MATH），涉及66个LLM。我们的核心假设——不同随机种子下的重复选择会产生一致的LLM排名，且可能不同于完整基准基线——得到强烈证实：在99.2%的随机配置中Kendall's $W \geq 0.90$（平均$W = 0.997 \pm 0.008$），而在较高百分位数阈值下，所选子集平均减少25-48%的提示。与完整基准的排名差异（$\rho < 0.95$）仅发生在15.95%的配置中，主要集中在低阈值（$p_{10}$-$p_{20}$）和基准（GPQA、IFEval）上，识别出过于密集的图是主要失败模式。

英文摘要

Evaluating large language models (LLMs) across comprehensive benchmarks is expensive and time-consuming. We propose a graph-based prompt selection framework that models each benchmark as a similarity graph -- nodes are prompts connected if their embedding-space distance falls above a configurable threshold -- and applies Maximum Independent Set (MIS) algorithms to select a maximally diverse, non-redundant subset. We evaluate four MIS solvers (CPLEX, GREEDY, Online-MIS, ReduMIS) across six embedding models, three distance measures, six percentile thresholds, and four benchmarks (GPQA, IFEval, MMLU-Pro, Omni-MATH) covering 66 LLMs. Our central hypothesis -- that repeated selection under different random seeds yields consistent LLM rankings that may also differ from the full-benchmark baseline -- is strongly confirmed: Kendall's $W \geq 0.90$ in 99.2\% of stochastic configurations (mean $W = 0.997 \pm 0.008$), while at higher percentile thresholds selected subsets achieve 25--48\% prompt reduction on average. Ranking divergence from the full benchmark ($ρ< 0.95$) occurs in only 15.95\% of configurations, concentrated at low thresholds ($p_{10}$--$p_{20}$) and benchmarks (GPQA, IFEval), identifying overly dense graphs as the primary failure mode.

URL PDF HTML ☆

赞 0 踩 0

2606.01399 2026-06-02 cs.CV

PAI-Studio: Cinematic Video Background Replacement with Camera-Aware Motion

PAI-Studio: 具有相机感知运动的电影级视频背景替换

Heyuan Gao, Bangxun Tang, Yiren Song, Guian Fang, Zijian He, Jie Yang, Mike Zheng Shou

发表机构 * Utopai Studios（Utopai工作室）； Nanyang Technological University（南洋理工大学）； University of California, Irvine（加州大学尔湾分校）； Show Lab, National University of Singapore（新加坡国立大学Show实验室）

AI总结提出PAI-Studio，一种基于扩散变换器的视频合成任务，通过双向注意力机制统一处理前景动态与背景参考，实现运动一致的背景生成、高保真前景重光照和身份保持。

详情

AI中文摘要

我们提出PAI-Studio，一种新的参考条件视频合成任务，解决了电影级背景替换中长期存在的挑战：生成与前景运动对齐的动态背景，同时保持前景身份、匹配参考场景外观，并实现具有真实前景重光照的全局一致光照。现有的开源系统和商业API无法同时确保运动一致的背景生成、高保真前景重光照和前景身份保持，常常导致静态背景、不一致边界和明显的合成伪影。为弥补这一差距，我们基于扩散变换器视频骨干，将问题重新表述为上下文条件生成任务。通过双向注意力，我们的模型在统一架构中联合捕获前景动态和背景参考信息。我们进一步构建了一个来自高质量电影和在线视频的30K规模数据集来支持此任务。大量评估表明，我们的方法显著优于现有的开源和商业API解决方案。

英文摘要

We present PAI-Studio, a new reference-conditioned video synthesis task that addresses a long-standing challenge in cinematic background replacement: generating dynamic backgrounds aligned with foreground motion while preserving foreground identity, matching reference scene appearance, and achieving globally consistent illumination with realistic foreground relighting. Existing open-source systems and commercial APIs cannot simultaneously ensure motion-consistent background generation, high-fidelity foreground relighting and foreground identity preservation, often resulting in static backgrounds, inconsistent boundaries, and noticeable compositing artifacts. To bridge this gap, we build upon a Diffusion Transformer video backbone and reformulate the problem as an in-context conditional generation task. Through bidirectional attention, our model jointly captures foreground dynamics and background reference information within a unified architecture. We further construct a 30K-scale dataset sourced from high-quality films and online videos to support this task. Extensive evaluations demonstrate that our method significantly outperforms existing open-source and commercial API solutions.

URL PDF HTML ☆

赞 0 踩 0

2606.01398 2026-06-02 cs.RO

A Sonar-Visual Dataset for Cross-Modal Underwater Robot Perception

用于跨模态水下机器人感知的声纳-视觉数据集

Weitung Chen, Phil Tinn, Per Gunnar Auran, Martin Ludvigsen, Peter Halland Haro

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； SINTEF（斯蒂纳夫）； Norwegian University of Science and Technology（挪威科技大学）

AI总结提出SOVIS数据集，包含76,000多对声纳-视觉帧，通过端到端管道同步和清洗数据，并利用交互式标注工具加速标注，在跨模态鱼类检测任务中实现mAP@0.10提升7倍。

Comments 6 pages, 7 figures, 3 tables. Accepted to IEEE ICRA 2026 S2S Workshop (From Sea to Space: Advancing Perception in Harsh Domains)

详情

AI中文摘要

水下机器人通常同时使用相机和声纳进行感知，以利用视觉丰富的语义细节和声学稳健的距离测量。然而，由于缺乏声纳-视觉配对数据集，通过跨模态预测学习这些模态之间的映射仍然探索不足。我们提出了SOVIS，一个用于跨模态水下感知的声纳-视觉数据集。SOVIS包含在特隆赫姆峡湾六个地点17次潜水中收集的超过76,000对配对帧，并得到端到端管道的支持，该管道清洁和同步跨模态传感器数据。我们还引入了一个交互式标注工具，旨在加速配对数据的标注过程。最后，我们使用一小部分标注数据展示了一个概念验证的跨模态鱼类检测任务，与单目相机基线相比，mAP@0.10提高了7倍。SOVIS是推进跨模态水下感知研究的第一步，支持从单目图像进行密集声纳预测等研究方向。

英文摘要

Underwater robots typically use both cameras and sonar for perception to leverage the rich semantic details of vision and the robust range measurements of acoustics. However, learning to map between these modalities via cross-modal prediction remains underexplored due to limited sonar-visual paired datasets. We present SOVIS, a sonar-visual dataset for cross-modal underwater perception. SOVIS comprises over 76,000 paired frames collected across 17 dives at six sites in the Trondheimfjord, supported by an end-to-end pipeline that cleans and synchronizes the cross-modal sensor data. We also introduce an interactive annotation tool designed to accelerate the labeling process for this paired data. Finally, we demonstrate a proof-of-concept cross-modal fish detection task using a small subset of labeled data, achieving a 7x improvement in mAP@0.10 over a monocular camera baseline. SOVIS serves as the first step toward advancing cross-modal underwater perception research, enabling research directions such as dense sonar prediction from monocular images.

URL PDF HTML ☆

赞 0 踩 0

2606.01397 2026-06-02 cs.RO cs.LG cs.SY eess.SY

Autopilot-Preserving Residual Q-Learning with HJB-Inspired Finite-Action Risk Filtering for Fixed-Wing UAV Command Supervision

基于HJB启发有限动作风险滤波的保持自动驾驶仪的残差Q学习用于固定翼无人机指令监督

Mehmet Iscan, Batuhan Temiz

发表机构 * PythaLab, Yildiz Technical University, Istanbul, Turkey（伊兹密尔技术大学吡塔实验室，伊斯坦布尔，土耳其）； Turkish Aerospace (TUSAŞ), Ankara, Turkey（土耳其航空航天（TUSAŞ），安卡拉，土耳其）

AI总结提出一种保持自动驾驶仪的残差指令监督框架，通过HJB方程启发的半离散值迭代评价器和控制Lyapunov/屏障函数启发的有限动作屏蔽，选择有限有界动作集中的残差，显著降低路径跟踪误差。

Comments 47 pages, 12 figures, 20 tables. Simulation-based study with a code-traceable benchmark, source code and a demonstration video are linked in the paper

详情

AI中文摘要

固定翼无人机必须在风、阵风和湍流下保持空速、高度和航向参考，这些通道耦合使得纠正一个通道可能恶化另一个。经典自动驾驶仪能很好地稳定机身，但在强侧风遇到激进转弯时适应能力差，而直接作用于舵面的强化学习策略将探索风险集中在执行器接口。我们在未改变的自动驾驶仪之上放置一个学习型监督器，而不是在其内部：它从指令空速、高度和航向的有限有界动作集中选择一个残差；修改后的参考在到达自动驾驶仪之前被投影到允许的指令包络内，自动驾驶仪仍然是唯一面向执行器的控制器。新颖之处在于残差的选择方式。HJB残差使用半离散值迭代评价器（基于Hamilton-Jacobi-Bellman方程精神）对候选动作评分，通过无操作相对哈密顿优势排序，并通过控制Lyapunov函数和控制屏障函数启发的有限动作屏蔽进行过滤，该屏蔽始终保留无操作回退。在共享的12状态运行时（固定植物、自动驾驶仪和执行器模型）上，HJB残差将均方根路径跟踪误差降低到44.809米，而基线自动驾驶仪为338.617米，表格Q残差为88.809米，相比基线降低86.77%，相比Q学习降低49.54%。增益集中在基线表现最差的区域，但伴随空速误差的测量上升，因此没有方法在所有指标上占优。我们呈现这种保持自动驾驶仪的残差指令监督设计，并完整报告其权衡基准。

英文摘要

A fixed-wing UAV must hold airspeed, altitude, and heading references under wind, gusts, and turbulence, channels coupled so that correcting one can degrade another. Classical autopilots stabilize the airframe well but adapt poorly when a hard crosswind meets an aggressive turn, while reinforcement-learning (RL) policies acting directly on the surfaces concentrate exploration risk at the actuator interface. We place a learned supervisor above an unchanged autopilot rather than inside it: it selects a residual from a finite, bounded action set on the commanded airspeed, altitude, and heading; the modified reference is projected into an admissible command envelope before reaching the autopilot, which stays the only actuator-facing controller. What is new is how the residual is chosen. HJB residual scores candidates with a semi-discrete value-iteration critic in the spirit of the Hamilton-Jacobi-Bellman (HJB) equation, ranks them by a no-op-relative Hamiltonian advantage, and filters them through a control-Lyapunov- and control-barrier-inspired finite-action shield that always keeps a no-op fallback. On a shared 12-state runtime holding the plant, autopilot, and actuator model fixed, so the comparison is at the package level, HJB residual lowers mean RMS path-tracking error to 44.809 m, against 338.617 m for the baseline autopilot and 88.809 m for a tabular-Q residual, an 86.77% reduction over the baseline and 49.54% over Q-learning. The gain concentrates where the baseline fails worst and comes with a measured rise in airspeed error, so no method dominates every metric. We present this autopilot-preserving residual command-supervision design and benchmark with its trade-offs reported intact.

URL PDF HTML ☆

赞 0 踩 0

2606.01394 2026-06-02 cs.CL

UniD$^3$: A Knowledge Graph-Enhanced RAG Framework for Drug-Disease Discovery and Reasoning

UniD$^3$：一种用于药物-疾病发现与推理的知识图谱增强RAG框架

Qing Wang, Tianshi Liu, Minghao Zhou, Jialu Liang, Sen Guo, Guangyu Wang, Jing Su, Qianqian Song

发表机构 * Department of Health Outcomes and Biomedical Informatics, University of Florida（佛罗里达大学健康成果与生物医学信息学系）； Department of Hematology, H. Lee Moffitt Cancer Center and Research Institute（血液科，H. Lee Moffitt癌症中心与研究院）； Center for Bioinformatics and Computational Biology, Houston Methodist Research Institute（生物信息学与计算生物学中心，休斯顿方法主义研究学院）； Department of Cardiothoracic Surgery, Weill Cornell Medicine, Cornell University（心胸外科，Weill Cornell医学，康奈尔大学）； Department of Biostatistics and Health Data Science, Indiana University School of Medicine（生物统计学与健康数据科学系，印第安纳大学医学院）

AI总结提出UniD$^3$框架，结合大语言模型与知识图谱增强检索生成（KG-RAG），从生物医学文献中提取、组织和验证药物-疾病知识，生成结构化数据集并提升推理可靠性。

详情

AI中文摘要

药物-疾病关系的系统表征对于药物发现和重定位至关重要，但受限于生物医学文献的异质性和快速增长。现有数据集依赖劳动密集型整理且常不完整，而仅使用大语言模型的方法存在幻觉和证据基础薄弱的问题。我们提出UniD$^3$，一个统一框架，将大语言模型与知识图谱增强检索生成（KG-RAG）相结合，在药物-疾病匹配（DDM）、药物有效性评估（DEA）和药物-靶点分析（DTA）中提取、组织和验证药物-疾病知识。UniD$^3$使用Llama 3.3-70B处理157,849篇PubMed文章，并通过双阶段策略构建知识图谱，该策略结合了论文级提取和以药物和疾病实体为中心的KG级整合。这些图谱支持基于KG-RAG的结构化数据集生成，并通过外部基准测试、与精选资源的模糊匹配以及临床医生评审进行评估。UniD$^3$生成了六个知识图谱和大规模数据集，包括28,915个DDM、15,042个DEA和超过4,000个DTA问答对。外部验证显示性能强劲（DDM/DEA的F1为0.85-0.87；DTA为0.82），临床医生评审确认高可靠性（AUROC = 0.90）。KG-RAG增强模型优于独立大语言模型，UniD$^3$聊天机器人支持可解释、有引用的药物-疾病关系探索。UniD$^3$提供了一个可扩展、可扩展的框架，用于将非结构化生物医学文献转化为高质量、结构化的药物-疾病知识，支持AI驱动的发现、重定位和精准医学。

英文摘要

Systematic characterization of drug-disease relationships is essential for drug discovery and repurposing, yet is hindered by the heterogeneity and rapid growth of biomedical literature. Existing datasets rely on labor-intensive curation and are often incomplete, while LLM-only approaches suffer from hallucination and weak evidence grounding. We introduce UniD$^3$, a unified framework that integrates Large Language Models with Knowledge Graph-enhanced Retrieval-Augmented Generation (KG-RAG) to extract, organize, and validate drug-disease knowledge across Drug-Disease Matching (DDM), Drug Effectiveness Assessment (DEA), and Drug-Target Analysis (DTA). UniD$^3$ processes 157,849 PubMed articles with Llama 3.3-70B and constructs knowledge graphs via a dual-stage strategy combining paper-level extraction with KG-level consolidation centered on drug and disease entities. These graphs support KG-RAG-based generation of structured datasets, evaluated through external benchmarks, fuzzy matching with curated resources, and clinician review. UniD$^3$ produces six knowledge graphs and large-scale datasets, including 28,915 DDM, 15,042 DEA, and over 4,000 DTA QA pairs. External validation shows strong performance (F1: 0.85-0.87 for DDM/DEA; 0.82 for DTA), with clinician review confirming high reliability (AUROC = 0.90). KG-RAG-augmented models outperform standalone LLMs, and the UniD$^3$ chatbot enables interpretable, citation-supported exploration of drug-disease relationships. UniD$^3$ provides a scalable, extensible framework for transforming unstructured biomedical literature into high-quality, structured drug-disease knowledge, supporting AI-driven discovery, repurposing, and precision medicine.

URL PDF HTML ☆

赞 0 踩 0

2606.01393 2026-06-02 cs.CL cs.AI cs.CV

Dr. DocBench: A Comprehensive Benchmark for Expert-Level and Difficult Document Parsing

Dr. DocBench：专家级与困难文档解析的综合基准

Minglai Yang, Xinyan Velocity Yu, Pengyuan Li, Xinyu Guo, Zhenting Qi, Konwoo Kim, Longtian Ye, Xiaolong Luo, Jinhe Bi, Henry Zhang, Haris Riaz, Xuan Zhang, Yunze Xiao, Bangya Liu, Tom Tang, Yunfei Zhao, Qunshu Lin, Zihan Wang, Minghao Liu, Michael Lingzhi Li, Yilun Du, Jesse Thomason, Rogerio Feris, Alex Pentland, Zexue He

发表机构 * Stanford University（斯坦福大学）； MIT（麻省理工学院）； Carnegie Mellon University（卡内基梅隆大学）； University of Southern California（南加州大学）； Harvard University（哈佛大学）； IBM Research（IBM研究院）； University of Arizona（亚利桑那大学）； Duke University（杜克大学）； UC Berkeley（加州大学伯克利分校）； LMU Munich（慕尼黑路德维希-马克西米利安大学）

AI总结提出Dr. DocBench基准，通过基于解析器失败的采样从多语言书籍语料库中选取挑战性文档，包含52个BISAC主题领域和65k高质量标注，用于评估专家级文档解析能力。

Comments 27 pages, 13 figures, 14 tables

详情

AI中文摘要

文档解析和识别是视觉语言模型（VLM）和文档处理系统的基本能力。然而，现有的光学字符识别（OCR）和文档解析基准在覆盖范围和难度上日益受限：许多基准专注于常见文档类型或均匀采样的页面，现代解析器在这些页面上已表现良好，而对专家领域结构（如化学公式、乐谱、复杂表格和跨页布局）的标注有限。我们引入了Dr. DocBench，一个面向专家级文档解析的难度感知基准。Dr. DocBench基于大规模多语言书籍语料库构建，涵盖52个BISAC主题领域，并通过基于解析器失败的采样选择挑战性文档，针对多个最先进系统难以处理的案例。它包含来自平均约100页的长文档的4,514个标注页面，具有65k高质量的页面级和块级标注，涵盖布局、阅读顺序、层次关系和特定领域的视觉内容。对基于流水线的解析器和通用VLM的评估表明，在现有基准上的强性能并不能迁移到我们的专家级文档解析中。我们的分析揭示了跨主题、内容类型和结构属性的重大失败，突显了Dr. DocBench作为诊断和推进文档智能的综合测试平台的作用。

英文摘要

Document parsing and recognition are fundamental capabilities for vision-language models (VLMs) and document processing systems. However, existing Optical Character Recognition (OCR) and document parsing benchmarks are increasingly limited in coverage and difficulty: many focus on common document genres or uniformly sampled pages where modern parsers already perform strongly, while offering limited annotation for expert-domain structures such as chemical formula, music notation, complex tables, and cross-page layouts. We introduce Dr. DocBench, a difficulty-aware benchmark for expert-level document parsing. Built from a large-scale multilingual book corpus, Dr. DocBench spans 52 BISAC subject domains and selects challenging documents through parser-failure-based sampling, targeting cases where multiple state-of-the-art systems struggle. It contains 4,514 annotated pages from long documents averaging around 100 pages, with 65k high-quality page- and block-level annotations for layout, reading order, hierarchical relations, and domain-specific visual contents. Evaluations of pipeline-based parsers and general-purpose VLMs show that strong performance on existing benchmarks does not transfer to our expert-level document parsing. Our analysis reveals substantial failures across subjects, content types, and structural attributes, highlighting Dr. DocBench as a comprehensive testbed for diagnosing and advancing document intelligence.

URL PDF HTML ☆

赞 0 踩 0

2606.01386 2026-06-02 cs.AI cs.CL cs.DC cs.LG

GuidaPA: Privacy-Preserving Chatbot for Public Administration via Federated Learning

GuidaPA: 通过联邦学习为公共行政提供隐私保护的聊天机器人

Daniel M. Jimenez-Gutierrez, Albenzio Cirillo, Raffaele Nicolussi, Alessio Beltrame, Andrea Vitaletti

发表机构 * University of Bologna（博洛尼亚大学）

AI总结提出GuidaPA，一个基于联邦学习（FL）在意大利公共行政文档上训练的隐私保护聊天机器人，通过参数高效的联邦微调（QLoRA）和角色访问控制，在保持数据本地化的同时实现了接近集中式微调的答案质量。

Comments Accepted to the 2nd International Conference on Federated Learning and Intelligent Computing Systems (FLICS2026)

详情

AI中文摘要

我们提出了GuidaPA，一个为意大利公共行政（PA）设计的隐私保护聊天机器人，它通过联邦学习（FL）在两个国家PA平台SIGESON和SIDFORS的文档上进行训练。我们的语料库包括约8页的SIGESON手册和31页的SIDFORS手册/常见问题解答；虽然本研究使用公开文档作为安全代理，但预期的部署将扩展到受限制的内部来源（例如，工单、官员手册、数据库提取），这些数据由于监管和组织约束无法集中汇集。GuidaPA集成了基于角色的访问控制、安全的客户端预处理、对非独立同分布效应的显式监控以及大语言模型的参数高效联邦微调。使用QLoRA（4位）进行15轮联邦训练，每个客户端采用80/20的训练-测试划分，我们使用ROUGE、BLEU-4和METEOR评估答案质量。最佳联邦模型达到了ROUGE-1/2/L分别为61.10/55.77/59.44，BLEU-4为45.02，METEOR为63.94——接近私有集中式微调的性能，同时保持数据在本地。与通用基线相比，领域微调将ROUGE-1从41.45提高到62.18，BLEU-4从26.97提高到50.90。总体而言，结果表明FL可以在不进行集中数据共享的情况下，为公共服务提供高质量的对话式AI。

英文摘要

We present GuidaPA, a privacy-preserving chatbot for the Italian Public Administration (PA) trained via Federated Learning (FL) on documentation from two national PA platforms, SIGESON and SIDFORS. Our corpus includes approximately 8 pages of SIGESON manuals and 31 pages of SIDFORS manuals/FAQs; while this study uses public documentation as a safe proxy, the intended deployment extends to restricted internal sources (e.g., tickets, officer manuals, database extracts) that can not be centrally pooled due to regulatory and organizational constraints. GuidaPA integrates role-based access control, secure client-side preprocessing, explicit monitoring of non-IID effects, and parameter-efficient federated fine-tuning of large language models. Using QLoRA (4-bit) over 15 federated rounds with an 80/20 train-test split per client, we evaluate answer quality with ROUGE, BLEU-4, and METEOR. The best federated model achieves ROUGE-1/2/L of 61.10/55.77/59.44, BLEU-4 of 45.02, and METEOR of 63.94-close to private centralized fine-tuning while keeping data on-site. Compared to the general-purpose baseline, domain fine-tuning improves ROUGE-1 from 41.45 to 62.18 and BLEU-4 from 26.97 to 50.90. Overall, the results indicate that FL can deliver high-quality conversational AI for public services without centralized data sharing

URL PDF HTML ☆

赞 0 踩 0

2606.01382 2026-06-02 cs.LG cs.AI

Efficient Exploration for Iterative Nash Preference Optimization

迭代纳什偏好优化的高效探索

Tianlong Nan, Xiaopeng Li, Christian Kroer, Tianyi Lin

发表机构 * Columbia University（哥伦比亚大学）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结针对通用偏好模型下的迭代NLHF，提出显式探索算法，结合SFT正则化与对抗性策略探索，实现O(√T)遗憾界，避免对KL正则化参数的指数依赖。

Comments 49 pages

详情

AI中文摘要

偏好对齐是改进大语言模型的核心，但当人类偏好是循环、非传递或无法用标量奖励表示时，标准的基于奖励的公式可能具有限制性。从人类反馈中学习纳什均衡（NLHF）通过将对齐建模为偏好博弈并针对纳什均衡而非奖励最大化来解决这一限制。然而，可扩展NLHF的学习理论基础仍然有限。现有的遗憾保证依赖于基于oracle的方法，这些方法估计一个通用偏好模型并求解KL正则化的极小极大问题，而迭代NLHF方法直接优化策略级别的偏好损失，更易实现但缺乏遗憾保证。我们研究通用偏好模型下的在线迭代NLHF，并确定探索是关键障碍。首先，我们表明标准迭代NLHF可能遭受对KL正则化参数的指数依赖，揭示了通过策略更新进行的隐式探索不足以控制遗憾。其次，我们提出一种显式探索的迭代NLHF算法，结合了基于SFT的正则化与对抗性策略探索。所得方法保留了迭代NLHF的直接策略优化结构，避免了显式偏好模型估计，并实现了$O(\sqrt{T})$的遗憾界，而不依赖于KL正则化参数的指数项。我们表明，通过访问一个极小极大oracle，遗憾可以改进为$O(\log(T))$，阐明了学习通用偏好博弈中的计算-统计权衡。最后，我们将我们的方法实例化用于LLM微调，并在多个基准上对\texttt{Llama-3-8B-Instruct}进行评估，其中显式探索在现有NLHF基线上产生了一致的改进。

英文摘要

Preference alignment is central to improving large language models, but standard reward-based formulations can be restrictive when human preferences are cyclic, non-transitive, or otherwise not representable by a scalar reward. Nash Learning from Human Feedback (NLHF) addresses this limitation by modeling alignment as a preference game and targeting a Nash equilibrium rather than a reward maximizer. However, the learning-theoretic foundations of scalable NLHF remain limited. Existing regret guarantees rely on oracle-based methods that estimate a general preference model and solve KL-regularized minimax problems, while iterative NLHF methods directly optimize policy-level preference losses and are easier to implement but lack regret guarantees. We study online iterative NLHF under general preference models and identify exploration as the key obstacle. First, we show that standard iterative NLHF can suffer an exponential dependence on the KL-regularization parameter, revealing that implicit exploration through policy updates is insufficient for controlling regret. Second, we propose an explicitly exploratory iterative NLHF algorithm that combines SFT-based regularization with adversarial policy exploration. The resulting method retains the direct policy optimization structure of iterative NLHF, avoids explicit preference model estimation, and achieves an $O(\sqrt{T})$ regret bound without an exponential dependence on the KL-regularization parameter. We show that the regret can be improved to $O(\log(T))$ with access to a minimax oracle, clarifying the computational-statistical tradeoff in learning general preference games. Finally, we instantiate our method for LLM fine-tuning and evaluate it on \texttt{Llama-3-8B-Instruct} across multiple benchmarks, where explicit exploration yields consistent improvements over existing NLHF baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.01380 2026-06-02 cs.CV

Training-free image inversion for one-step diffusion models

无需训练的一步扩散模型图像反演

Tao Wu, Senmao Li, Yaxing Wang, Shiqi Yang, Kai Wang, Joost van de Weijer

发表机构 * CVC, University of Alabama in Birmingham（CVC，阿拉巴马大学伯明翰分校）； Machine Intelligence Institute, Masdar Institute of Science and Technology（机器智能研究所，马斯达尔科技 institute）； Jilin University（吉林大学）； City University of Hong Kong, Department of Geography（香港城市大学地理系）

AI总结提出一种无需训练的反演框架TFinv，通过迭代噪声对齐和后缀学习解决一步扩散模型中真实图像反演与编辑的关键挑战，实现高效编辑。

Comments Accepted to Pattern Recognition

详情

DOI: 10.1016/j.patcog.2026.114063

AI中文摘要

在这项工作中，我们为一步扩散模型引入了一种新颖的无需训练的反演（TFinv）框架，解决了真实图像反演和编辑中的关键挑战。我们首先确定了阻碍真实图像反演和编辑的两个关键因素：（1）初始潜在可编辑性，与初始噪声与理想高斯分布之间的距离有关；（2）描述差距，即文本描述与图像表示之间的对齐。这两个因素都影响一步扩散模型的反演效率和可编辑性。然后，我们提出了两种新颖的技术：迭代噪声对齐（iterNA），它最小化分布差距以与正态高斯分布对齐；以及后缀学习（suffL），它通过引入学习到的后缀提示令牌来增强文本到图像的描述对齐。这些技术能够将输入图像精确反演为其初始噪声表示，并促进图像编辑。此外，我们提出了一种基于掩码的编辑技术，用于局部编辑同时保持背景完整性。在PIE-Bench数据集上的全面实验验证了我们的方法TFinv不仅在一阶扩散编辑中实现了最先进的性能，而且在效率上显著优于现有的多步方法。代码可在https://github.com/tttao-uwu/TFinv.git获取。

英文摘要

In this work, we introduce a novel training-free inversion (TFinv) framework for one-step diffusion models,addressing key challenges in real image inversion and editing. We first identify two critical factors hamperingreal-image inversion and editing: (1) Initial Latent Editability, which is related to the distance between theinitial noise and the ideal Gaussian distribution, and (2) Caption Gap, which means the alignment betweentext captions and image representations. Both factors influence inversion efficiency and the editability ofone-step diffusion models. Then, we propose two novel techniques: iterative noise alignment (iterNA), whichminimizes the distribution gap to align with the normal Gaussian distribution, and suffix learning (suffL),which enhances text-to-image caption alignment by introducing learned suffix prompt tokens. These techniquesenable precise inversion of input images into their initial noise representations and facilitate image editing.Furthermore, we propose a mask-based editing technique for localized edits while preserving backgroundintegrity. Comprehensive experiments on the PIE-Bench dataset validate that our method TFinv not onlyachieves state-of-the-art performance in one-step diffusion editing, but also significantly outperforms existingmultistep approaches in efficiency. The code is available at https://github.com/tttao-uwu/TFinv.git.

URL PDF HTML ☆

赞 0 踩 0

2606.01374 2026-06-02 cs.LG

From Performance to Viability: A Bootstrap Framework for Latent-Space Representation Learning in Adaptive Biological Systems

从性能到生存力：自适应生物系统中潜在空间表示学习的自举框架

Jacques Raynal, Pierre Slangen, Elsa Raynal, Jacques Margerit

发表机构 * Laboratory of Bioengineering and Nanosciences (LBN)（生物工程与纳米科学实验室）； University of Montpellier（蒙彼利埃大学）； EuroMov Digital Health in Motion（EuroMov数字健康运动）； IMT Mines Alès ； Certified Sophrologist, Sensorimotor Practice（认证Sophrologist，运动感知实践）

AI总结针对自适应生物系统中性能相似但组织不同的问题，提出一个五级自举框架，通过逐步引入潜在组织、纵向生存力和内部预测近似，从观测不足中学习更具信息量的表示。

Comments 25 pages. Methodological framework for latent-space representation learning in adaptive biological systems

详情

AI中文摘要

可观测性能通常用于表征生物系统。然而，在自适应系统中，相似性能可能源于不同的组织，且在给定时间看似相似的配置可能遵循不同的纵向轨迹。这一局限性促使我们提出一种方法论框架，以超越基于性能的解释，而无需事先假设完整的机制模型。本文提出了一个用于自适应生物系统中潜在空间表示学习的自举框架。这里的自举是在方法论和认识论意义上使用的：当先前的表示不足以解释观察到的自适应动态时，引入新的分析层次。该框架围绕五个层次组织：可观测性能、动态组织、潜在组织、纵向生存力和内部预测近似。通过三个先前报道的步态-遮挡研究来说明该框架，这些研究仅作为方法论案例序列，而非新的实验证据。本文形式化了性能分析如何导致潜在组织，静态潜在组织如何导致纵向生存力，以及观察到的生存力如何导致内部预测近似。贡献不是新的学习算法、临床协议或数据集，而是一个用于潜在空间表示学习的自举框架，描述了如何从自适应生物数据的观测不足中涌现出更具信息量的表示。

英文摘要

Observable performance is commonly used to characterize biological systems. In adaptive systems, however, similar performances may arise from distinct organizations, and configurations that appear comparable at a given time may follow different longitudinal trajectories. This limitation motivates a methodological framework for moving beyond performance-based interpretation without assuming a complete mechanistic model in advance. This article proposes a bootstrap framework for latent-space representation learning in adaptive biological systems. Here, bootstrap is used in a methodological and epistemological sense: new analytical levels are introduced when the preceding representation becomes insufficient to account for observed adaptive dynamics. The framework is organized around five levels: observable performance, dynamic organization, latent organization, longitudinal viability, and internal predictive approximation. The framework is illustrated by three previously reported gait--occlusion studies, used here only as a methodological case sequence and not as new experimental evidence. The article formalizes how performance analysis led to latent organization, how static latent organization led to longitudinal viability, and how observed viability led to internal predictive approximation. The contribution is not a new learning algorithm, clinical protocol, or dataset, but a bootstrap framework for latent-space representation learning describing how increasingly informative representations can emerge from observational insufficiencies in adaptive biological data.

URL PDF HTML ☆

赞 0 踩 0

2606.01372 2026-06-02 cs.LG cs.AI cs.CV

BRo-JEPA: Learning Modular Arithmetic in Latent Space

BRo-JEPA：在潜空间中学习模算术

Divyansh Jha, Yuanfang Xie, Varan Mehra, Brennen Yu

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； NYU Langone Health（纽约大学Langone医疗中心）

AI总结本文提出BRo-JEPA模型，通过在潜空间中施加模10算术的循环结构，实现零样本泛化，解决了标准模型无法外推未见操作的问题。

Comments 10 pages, 14 figures

2606.01367 2026-06-02 cs.RO cs.CV

ActMVS: Active Scene Reconstruction with Monocular Multi-View Stereo

ActMVS：基于单目多视图立体的主动场景重建

Guo Pu, Yixuan Han, Zhouhui Lian

发表机构 * Wangxuan Institute of Computer Technology, Peking University（北京大学王轩计算机技术研究所）

AI总结提出ActMVS框架，通过视图因子图构建和全局深度优化，实现单目相机在线生成高质量、全局一致的密集深度图，支持机器人/UAV的主动场景重建与安全轨迹规划。

Comments ICRA 2026

详情

AI中文摘要

主动场景重建使机器人/UAV能够自主规划轨迹并重建环境，无需昂贵的手动数据采集。与被动方法不同，主动重建需要实时构建高置信度占据地图以实现无碰撞导航。现有方法依赖深度传感器更新占据地图，增加了平台成本和重量。为推进空间智能，我们旨在实现纯视觉单目解决方案。然而，当前单目场景重建方法离线运行，无法在机器人/UAV导航所需的帧率下提供全局一致的密集深度。为弥补这一差距，我们引入ActMVS，这是首个单目主动重建框架。我们的框架集成了用于信息多视图立体深度预测的视图因子图构建，以及全局深度优化，从而实现在线生成高质量、全局一致的密集深度图。这使得单目机器人/UAV能够在重建过程中维护可靠的占据地图，以实现安全的轨迹规划。在Replica数据集上的实验表明，其性能与RGB-D方法相当。我们的代码和数据可在https://github.com/TrickyGo/ActMVS获取。

英文摘要

Active scene reconstruction enables robots/UAVs to autonomously plan trajectories and reconstruct environments without costly manual data acquisition. Unlike passive methods, active reconstruction requires real-time construction of high-confidence occupancy maps for collision-free navigation. Existing approaches rely on depth sensors for occupancy map updates, increasing platform cost and weight. To advance spatial intelligence, we aim for a vision-only monocular solution. However, current monocular scene reconstruction methods operate offline and fail to deliver globally consistent dense depth at the frame rates required for robots/UAVs navigation. To bridge this gap, we introduce ActMVS, the first framework for monocular active reconstruction. Our framework integrates a view factor graph construction for informed Multi-View Stereo depth prediction, along with a global depth optimization, to enable the online generation of high-quality, globally consistent dense depth maps. This enables monocular robots/UAVs to maintain reliable occupancy maps for safe trajectory planning during reconstruction. Experiments on Replica datasets demonstrate performance competitive with RGB-D methods. Our code and data are available at https://github.com/TrickyGo/ActMVS.

URL PDF HTML ☆

赞 0 踩 0

2606.01363 2026-06-02 cs.LG cs.SY eess.SY

All Models are Wrong, Knowing Where is Useful: On Model Uncertainty in Reinforcement Learning

所有模型都是错的，知道哪里有用：强化学习中的模型不确定性

Bernd Frauenknecht, Devdutt Subhasish, Artur Eisele, Friedrich Solowjow, Sebastian Trimpe

发表机构 * German Federal Ministry of Research, Technology and Space (BMFTR)（德国联邦研究、技术和空间部）； Robotics Institute Germany (RIG)（德国机器人研究所）； Institute for Data Science in Mechanical Engineering, RWTH Aachen University（机械工程数据科学研究所，亚琛工业大学）； NHR Center NHR4CES at RWTH Aachen University（亚琛工业大学NHR4CES中心）

AI总结提出通过针对性处理概率模型的不确定性来减轻模型利用的框架，并展示在硬件直接学习和安全探索方面的成功。

2606.01361 2026-06-02 cs.CV

Diamonds in the Sky: Pareidolic Animals in Clouds

天空中的钻石：云中的空想性动物

Miriam Horovicz, Yacov Hel-Or, Yael Moses

发表机构 * Reichman University, Israel（里奇曼大学，以色列）

AI总结提出基于扩散模型的方法，预测人们可能在云中感知到的空想性动物，并通过生成相似形状的动物图像和变形视频辅助识别。

详情

AI中文摘要

人们常在云中看到动物形状，这种现象被称为空想性错视。我们提出一种基于AI的方法，旨在预测人们可能在云中感知到哪些动物，尽管最先进的识别方法通常无法检测到此类动物。此外，我们引入一种方法帮助个体感知特定的空想性动物，即使他们最初未能识别。我们的方法使用扩散模型将云片段转换为视觉上类似于原始云的动物形状。这种扩散技术的灵感来源于观察：扩散过程仅在目标动物与云形状相似时成功，且微妙的视觉线索通常足以帮助个体识别特定的空想性动物。从扩散模型成功生成的图像随后用于预测空想性动物。此外，使用从生成图像过渡回原始云片段的短变形视频进一步增强人类对空想性动物的感知。

英文摘要

People often see animal shapes in clouds, a phenomenon known as pareidolia. We propose an AI-based method that aims to predict which animals people are likely to perceive in clouds, even though state-of-the-art recognition methods typically fail to detect such animals. Additionally, we introduce a method to assist individuals in perceiving specific pareidolic animals, even if they did not recognize them initially. Our approach uses a diffusion model to transform cloud segments into an animal shape that visually resemble the original cloud. This diffusion technique is inspired by the observation that the diffusion process succeeds only when the target animal resembles the shape of the cloud, and that subtle visual hints often suffice to help individuals recognize specific pareidolic animals. A generated image, successfully derived from the diffusion model, is then used to predict the pareidolic animal. Additionally, a short morphing video transitioning from the generated image back to the original cloud segment is employed to further enhance the human's perception of the pareidolic animals.

URL PDF HTML ☆

赞 0 踩 0

2606.01352 2026-06-02 cs.AI

FlowTime: Towards Continuous Generative Watch Time Prediction via Flow-based Personalized Priors

FlowTime: 基于流的个性化先验实现连续生成式观看时间预测

Hongxu Ma, Han Zhou, Chenghou Jin, Jie Zhang, Xiaoyu Yang, Chunjie Chen, Jihong Guan, Shuigeng Zhou

发表机构 * Fudan University（复旦大学）； Shanghai University of Finance and Economics（上海财经大学）； Kuaishou Technology（快手科技）； Tongji University（同济大学）

AI总结针对现有观看时间预测方法在范式上的局限性，提出连续生成式回归范式及FlowTime方法，利用一步生成变分自编码器和基于流的个性化先验，有效建模多模态用户-物品交互模式，显著提升预测性能。

Comments Accepted by KDD'26

详情

DOI: 10.1145/3770855.3818143

AI中文摘要

观看时间已成为短视频推荐系统中优化深度用户参与度的关键指标。然而，当前的观看时间预测方法存在固有的范式特定局限性。直接回归因单峰高斯假设而面临均值崩溃，序数回归因刚性离散化而受到量化误差的困扰。同样，离散生成式回归则面临高推理延迟和启发式词汇表设计的问题。除了这些具体缺陷外，一个共同的不足是无法捕捉用户-物品交互模式的内在多模态性和异质性。为应对这些挑战，我们首先从因果角度重新审视观看时间预测问题，并将这些用户特定模式识别为调节观看时间结果的结构性混淆因素，其中相同的兴趣在不同用户习惯条件下表现为不同的观看时间结果。然后，我们正式提出一种新的（即第四种）范式——连续生成式回归，并引入FlowTime，一种利用一步生成变分自编码器的新方法。FlowTime有效规避了迭代去噪的延迟，同时保持了连续潜在空间的表达能力。此外，我们设计了一种基于流的个性化先验，利用归一化流将标准高斯先验扭曲为复杂的历史条件流形，从而实现对多模态交互模式的自适应建模。最后，我们构建了TimeRec，首个开源观看时间预测库，并引入一种新的个性化指标，以建立严格的基准测试标准。广泛的离线实验和在线A/B测试表明，FlowTime显著优于现有最先进方法。

英文摘要

Watch time has emerged as a pivotal metric for optimizing deep user engagement in short-video recommender systems. However, current methods of watch time prediction (WTP) suffer from inherent paradigm-specific limitations. Direct Regression faces mean-collapse due to unimodal Gaussian assumptions, while Ordinal Regression is hampered by quantization errors from rigid discretization. Similarly, Discrete Generative Regression struggles with high inference latency and heuristic vocabulary design. Beyond these specific flaws, a shared deficiency is the inability to capture the intrinsic multimodality and heterogeneity of User-Item Interaction Patterns. To address these challenges, we first revisit the WTP problem from a causal perspective and identify these user-specific patterns as structural confounders that modulate watch time outcomes, where identical interests manifest as distinct watch time outcomes conditioned on diverse user habits. Then, we formally propose a new (or the fourth) paradigm -- Continuous Generative Regression, and introduce FlowTime, a novel method utilizing a One-step Generative Variational Autoencoder. FlowTime effectively circumvents the latency of iterative denoising while maintaining the expressivity of continuous latent spaces. Furthermore, we design a Flow-based Personalized Prior that leverages NFs to warp a standard Gaussian prior into a complex, history-conditioned manifold, thereby enabling the adaptive modeling of multimodal interaction patterns. Finally, we build TimeRec, the first open-source WTP Library, alongside a novel personalization metric to establish a rigorous benchmarking standard. Extensive offline experiments and online A/B tests demonstrate FlowTime's significant superiority over SOTA methods.

URL PDF HTML ☆

赞 0 踩 0

2606.01351 2026-06-02 cs.AI

Recognize Your Orchestrator: An Entropy Dynamics Perspective for LLM Multi-Agent Systems

识别你的编排器：面向LLM多智能体系统的熵动力学视角

Junze Zhu, Weihao Chen, Xuanwang Zhang, Zhen Wu, Xinyu Dai

发表机构 * Junze Zhu, Weihao Chen, Xuanwang Zhang, Zhen Wu, Xinyu Dai（朱俊泽、陈伟浩、张轩望、伍震、戴新宇）

AI总结提出平均场熵动力学框架，通过逆工作流生成（IWG）合成高复杂度基准，揭示推理型模型作为编排器时因上下文压缩而失效的“推理陷阱”，为多智能体系统架构设计提供物理可解释参数。

详情

AI中文摘要

从单轮模型到多智能体系统（MAS）的转变有望增强问题解决能力，但集中式编排拓扑仍然是一个关键脆弱点。为分析此问题，我们提出平均场熵动力学框架，将编排过程建模为由任务解决和累积上下文加载两种竞争力量支配的系统。为便于验证，我们引入逆工作流生成（IWG），一种多智能体流水线，用于合成具有密集中间检查点的过程可验证、高复杂度基准。我们证明熵动力学模型拟合经验轨迹，提供量化系统稳定性和性能崩溃的物理可解释参数。关键的是，我们的分析揭示了“推理陷阱”：尽管推理密集型模型在孤立任务中表现出色，但由于上下文压缩，它们作为编排器时经常失败。阐明编排器背后的物理机制并量化系统不确定性，为MAS的架构设计提供了见解。

英文摘要

The transition from single-turn models to Multi-Agent Systems (MAS) promises enhanced problem-solving capabilities, yet the centralized orchestration topology remains a critical point of fragility. To analyze this, we propose a Mean-Field Entropy Dynamics framework, modeling the orchestration process as a system governed by the competing forces of task resolution and cumulative context loading. To facilitate validation, we introduce Inverse Workflow Generation (IWG), a multi-agent pipeline that synthesizes process-verifiable, high-complexity benchmarks with dense intermediate checkpoints. We demonstrate that our entropy dynamics model fits empirical trajectories, providing physically interpretable parameters that quantify system stability and performance collapse. Crucially, our analysis uncovers a ``Reasoning Trap": while reasoning-heavy models excel in isolated tasks, they frequently fail as orchestrators due to context squeezing. Elucidating the physical mechanisms underlying the Orchestrator and quantifying systemic uncertainty offers insights for the MASs' architectural design.

URL PDF HTML ☆

赞 0 踩 0

2606.01339 2026-06-02 cs.LG cs.AI cs.CL cs.CV cs.ET

FreqLite: A Lightweight Frequency-Decomposed Linear Model with Adaptive Reversible Normalization for Robust Long-Term Time-Series Forecasting

FreqLite：一种轻量级频率分解线性模型，具有自适应可逆归一化，用于稳健的长期时间序列预测

Mirza Samad Ahmed Baig, Syeda Anshrah Gillani

发表机构 * Hamdard University（哈姆达德大学）

AI总结提出FreqLite，一种超轻量级、通道独立的频率分解线性预测器，通过可学习的无损谱滤波器进行频带分解和线性预测，并引入自适应可逆实例归一化（A-RevIN）处理非平稳性，在长期预测基准上以更少参数和计算资源超越PatchTST等模型。

Comments 26 pages, 5 figures

详情

AI中文摘要

长期时间序列预测需要既准确又能在商用硬件上高效运行的模型。轻量级线性预测器在此领域表现出色，但仍存在两个问题：可逆实例归一化（RevIN）使用单一回溯统计量对整个预测区间进行去归一化，在非平稳性下不准确；时域趋势/季节分解依赖于固定的非自适应滤波器。我们提出FreqLite，一种超轻量级、通道独立的频率分解线性预测器：一个可学习的、无损的单位划分谱滤波器将输入分割成多个频带，由每个频带的线性头进行预测，与低通截断方法不同，高频带被保留并建模。FreqLite在标准长期预测基准上是最佳的轻量级模型，在长回溯（L=336）时，其平均误差低于PatchTST Transformer（0.3244 vs 0.3587 MSE），同时参数减少4倍，内存减少2.2倍，在单块4 GB笔记本GPU上每轮时间减少2.2倍；尽管幅度不大，但在所有匹配单元上的配对Wilcoxon检验中，其改进具有统计显著性（p < 1e-5）。我们进一步引入自适应可逆实例归一化（A-RevIN），一种自适应可逆归一化，严格推广了RevIN（在其门关闭时完全恢复），在非平稳性下起作用，并在平稳数据上无害地退化为RevIN。我们在一个真实的强非平稳数据集（ILI，MSE降低约5%）和一个受控合成漂移扫描中验证了这一点，其中A-RevIN的收益及其学习门都随注入的非平稳性单调增加。每个组件均可独立消融（Linear和RLinear是FreqLite的特例），所有结果均可在商用硬件上复现。

英文摘要

Long-term time-series forecasting needs models that are accurate yet efficient enough for commodity hardware. Lightweight linear forecasters are remarkably strong in this regime, yet they leave two openings: reversible instance normalization (RevIN) de-normalizes the entire horizon with a single lookback statistic, which is inaccurate under non-stationarity, and time-domain trend/seasonal decomposition relies on a fixed, non-adaptive filter. We present FreqLite, an ultra-lightweight, channel-independent frequency-decomposed linear forecaster: a learnable, lossless, partition-of-unity spectral filter splits the input into bands that are forecast by per-band linear heads and, unlike low-pass-truncation approaches, the high-frequency band is retained and modeled. FreqLite is the best lightweight model on the standard long-term forecasting benchmarks and, at long lookback (L=336), attains a lower average error than a PatchTST Transformer (0.3244 vs. 0.3587 MSE) while using 4x fewer parameters, 2.2x less memory, and 2.2x less time per epoch on a single 4 GB laptop GPU; although modest in magnitude, its improvements are statistically significant under paired Wilcoxon tests across all matched cells (p < 1e-5). We further introduce Adaptive Reversible Instance Normalization (A-RevIN), a regime-adaptive reversible normalization that strictly generalizes RevIN (recovered exactly when its gate is closed), engages under non-stationarity, and reduces to RevIN without harm on stationary data. We validate this on both a real strongly non-stationary dataset (ILI, up to ~5% MSE reduction) and a controlled synthetic drift sweep in which A-RevIN's benefit and its learned gate both rise monotonically with injected non-stationarity. Every component is independently ablatable (Linear and RLinear are special cases of FreqLite), and all results are reproducible on commodity hardware.

URL PDF HTML ☆

赞 0 踩 0

2606.01336 2026-06-02 cs.CL

LongAttnComp: Cross-Family Context Compression for Long-Context Reasoning

LongAttnComp：跨族上下文压缩用于长上下文推理

Mengmeng Ji, Ravi Shanker Raju, Jonathan Lingjie Li, Chen Wu

发表机构 * SambaNova Systems, Inc.（SambaNova系统公司）

AI总结提出LongAttnComp方法，通过微调轻量级交叉注意力评分层并引入令牌级分块、令牌预算top-p算法、位置重排序和格式无关查询解析器，结合两阶段微调策略，在长上下文推理任务中实现与全上下文相当或更优的准确率。

Comments Under review

详情

AI中文摘要

随着实际应用越来越需要处理10万+令牌的输入，上下文长度与推理效率之间的差距已成为关键瓶颈。上下文压缩提供了一种在保持任务准确性的同时降低预填充成本的方法。然而，现有的无训练注意力方法在代码推理等要求高的长上下文任务中留下了显著差距。我们提出了LongAttnComp，这是AttnComp的长上下文适配，它微调了一个轻量级的交叉注意力评分层，并引入了令牌级分块、令牌预算top-p算法、位置重排序和格式无关的查询解析器。我们进一步为压缩器设计了两阶段微调方案：阶段1从NIAH风格数据构建通用检索基础，阶段2通过多跳和推理数据扩展，以覆盖更广泛的长上下文任务。在InfiniteBench Code-Debug上，LongAttnComp匹配或超过全上下文准确率，显著优于无训练基线，并在来自三个族的四个目标模型上迁移。在LongBench v2上，两阶段方案在很大程度上缩小了阶段1在多文档推理上的差距，同时保持了Code-Debug的性能。

英文摘要

As real-world applications increasingly require processing inputs of 100k+ tokens, the gap between context length and inference efficiency has become a critical bottleneck. Context compression offers a way to reduce prefill costs while preserving task accuracy. However, existing training-free attention-based methods leave substantial gaps in demanding long-context tasks such as code reasoning. We present LongAttnComp, a long-context adaptation of AttnComp that fine-tunes a lightweight cross-attention scoring layer and introduces tokenlevel chunking, a token-budget top-p algorithm, positional reordering, and a formatagnostic query parser. We further design a two-stage fine-tuning recipe for the compressor: Stage 1 builds a general retrieval foundation from NIAH-style data, and Stage 2 extends it with multi-hop and reasoning data for broader long-context task coverage. On InfiniteBench Code-Debug, LongAttnComp matches or exceeds full-context accuracy, substantially outperforms training-free baselines, and transfers across four target models from three families. On LongBench v2, the two-stage recipe largely closes the Stage 1 gap on multi-document reasoning while preserving Code-Debug performance.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Learning from Saturated Data: Signals Beyond Correctness for LLM Training

Don't Ask the LLM to Track Freshness: A Deterministic Recipe for Memory Conflict Resolution

DrugClaw and DrugAudit: A Primary-Source-Grounded Agent and Authority-Aware Benchmark for Drug-Information Question Answering

Leaf Spectral Reflectance Prediction Using Multi-Head Attention Neural Networks

Learning-based Directed Graph Abstraction of Combinatorial Spaces for Order-Preserving Search in Mixed-Combinatorial Nonlinear Optimization

Target localization, identification and sensing using latent symmetries

DENSER: Depth-Guided Ensemble with Staged EFA-GS Reconstruction for Soccer Novel View Synthesis

GovAI-Pipe: A Layered AI Governance Pipeline for Citizen-Facing AI in Turkey's e-Government Gateway

Self-Healing Agentic Orchestrators for Reliable Tool-Augmented Large Language Model Systems

Agent Skills Should Go Beyond Text: The Case for Visual Skills

GPTQ-intrinsic LoRA: A Near-optimal Algorithm for Low-precision Quantization with Low-rank Adaptation

Neural Network Compression by Approximate Differential Equivalence

Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs

PAI-Studio: Cinematic Video Background Replacement with Camera-Aware Motion

A Sonar-Visual Dataset for Cross-Modal Underwater Robot Perception

Autopilot-Preserving Residual Q-Learning with HJB-Inspired Finite-Action Risk Filtering for Fixed-Wing UAV Command Supervision

UniD$^3$: A Knowledge Graph-Enhanced RAG Framework for Drug-Disease Discovery and Reasoning

Dr. DocBench: A Comprehensive Benchmark for Expert-Level and Difficult Document Parsing

GuidaPA: Privacy-Preserving Chatbot for Public Administration via Federated Learning

Efficient Exploration for Iterative Nash Preference Optimization

Training-free image inversion for one-step diffusion models

From Performance to Viability: A Bootstrap Framework for Latent-Space Representation Learning in Adaptive Biological Systems

BRo-JEPA: Learning Modular Arithmetic in Latent Space

ActMVS: Active Scene Reconstruction with Monocular Multi-View Stereo

All Models are Wrong, Knowing Where is Useful: On Model Uncertainty in Reinforcement Learning

Diamonds in the Sky: Pareidolic Animals in Clouds

FlowTime: Towards Continuous Generative Watch Time Prediction via Flow-based Personalized Priors

Recognize Your Orchestrator: An Entropy Dynamics Perspective for LLM Multi-Agent Systems

FreqLite: A Lightweight Frequency-Decomposed Linear Model with Adaptive Reversible Normalization for Robust Long-Term Time-Series Forecasting

LongAttnComp: Cross-Family Context Compression for Long-Context Reasoning