arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

RAG / 检索增强生成

检索增强生成、向量检索、知识库问答和面向大模型的搜索系统。

今日/当前日期收录 11 信号源:cs.IR, cs.CL, cs.AI, cs.DB
2606.19960 2026-06-19 cs.IR 新提交 90%

Stellar: Scalable Multimodal Document Retrieval for Natural Language Queries

Stellar:面向自然语言查询的可扩展多模态文档检索

Yuxiang Guo, Zhonghao Hu, Yuren Mao, Yuhang Liu, Congcong Ge, Xiaolu Zhang, Jun Zhou, Yunjun Gao

专题命中 检索器 :提出Stellar框架实现可扩展多模态文档检索

AI总结 提出Stellar框架,通过磁盘存储令牌级文档嵌入并动态加载候选嵌入,结合词汇表示过滤和高效磁盘支持的后交互,在保持检索效果的同时将内存开销和查询延迟降低1-2个数量级。

详情
AI中文摘要

多模态文档检索——从大型语料库中选择最相关的多模态文档以回答自然语言查询——在检索增强生成(RAG)系统中扮演着重要角色。最先进的方法使用多个令牌级嵌入来表示每个文档和查询,并通过后交互实现高效性。然而,这种多向量表示在检索过程中会产生大量内存开销,导致可扩展性差,阻碍了实际部署。在本文中,我们提出了Stellar,一个可扩展的多模态文档检索框架,它将令牌级文档嵌入存储在磁盘上,仅将少量候选嵌入加载到内存中进行后交互。Stellar包含两个关键组件:(i)基于词汇表示的过滤(LRF),它微调多模态大语言模型(MLLM)作为稀疏编码器,以产生高质量的词汇表示,从而实现高效且有效的文档过滤,显著减少候选集;(ii)高效的磁盘支持后交互(DLI),它设计了一种基于平衡聚类算法的磁盘令牌嵌入存储布局,并通过简单有效的成本模型动态地将必要的令牌嵌入加载到内存中。在四个真实世界基准和一个新提出的大规模数据集上的大量实验表明,与现有方法相比,Stellar在不影响检索效果的情况下,将内存开销和查询延迟降低了1-2个数量级。

英文摘要

Multimodal document retrieval--selecting the most relevant multimodal document from a large corpus to answer a natural language query--plays an essential role in Retrieval-Augmented Generation (RAG) systems. State-of-the-art methods represent each document and query with multiple token-level embeddings and use late interaction to achieve high effectiveness. However, such multi-vector representations incur substantial memory overhead during retrieval, leading to poor scalability and hindering real-world deployment. In this paper, we present Stellar, a scalable multimodal document retrieval framework that stores token-level document embeddings on disk and loads only a small set of candidate embeddings into memory for late interaction. Stellar comprises two key components: (i) Lexical Representation-based Filtering (LRF), which fine-tunes a Multimodal Large Language Model (MLLM) as a sparse encoder to produce high-quality lexical representations, enabling efficient and effective document filtering to significantly reduce the candidate set; (ii) Efficient Disk-backed Late Interaction (DLI), which designs an on-disk token embedding storage layout guided by a balanced clustering algorithm, and dynamically loads only the necessary token embeddings into memory using a simple yet effective cost model. Extensive experiments on four real-world benchmarks and a newly presented large-scale dataset demonstrate that Stellar reduces memory overhead and query latency by 1-2 orders of magnitude compared to existing methods without compromising retrieval effectiveness.

2606.19719 2026-06-19 cs.IR cs.CL cs.LG 新提交 90%

Closing the Calibration Gap in Semantic Caching

缩小语义缓存中的校准差距

Aditeya Baral, Radoslav Ralev, Iliya Sotirov Zhechev, Srijith Rajamohan, Jen Agarwal

发表机构 * New York University(纽约大学) Redis(Redis公司)

专题命中 检索器 :研究语义缓存系统的校准问题,提出新指标。

AI总结 针对语义缓存系统中离线指标与部署性能的差距,提出P-CHR AUC和CRR指标,发现校准差距由训练目标主导,模型选择本质是校准问题。

Comments 23 pages, 2 figures. Source code: https://github.com/aditeyabaral/calibration-gap-semantic-caching ; Models and Datasets: https://huggingface.co/redis

详情
AI中文摘要

语义缓存通过为语义相似的查询提供缓存响应来降低LLM推理成本。标准实践使用PR-AUC评估这些系统,该指标仅衡量分数排序的好坏,而忽略它们在固定阈值下是否可用。我们表明这种不匹配会导致系统性的部署选择不佳,因为具有最高PR-AUC的模型通常在操作中最差。我们引入精度-缓存命中率(P-CHR)AUC,一种衡量缓存利用率水平上精度的缓存感知指标,以及校准保留率(CRR),它捕捉离线排序质量在部署中保留多少。我们将离线质量与部署质量之间的操作差距分解为可恢复的校准组件和由数据集正例率固定的不可约结构组件。我们的实验表明,校准差距由训练目标而非数据规模主导,事后校准只能部分缩小它。最终,语义缓存的模型选择是一个校准问题,而非排序问题,而测量它是缩小差距的第一步。

英文摘要

Semantic caching cuts LLM inference costs by serving a cached response to semantically similar queries. Standard practice evaluates these systems using PR-AUC, a metric that only measures how well scores rank and ignores whether they are usable at a fixed threshold. We show this mismatch leads to systematically poor deployment choices, as models with the highest PR-AUC are often the worst in operation. We introduce Precision-Cache Hit Ratio (P-CHR) AUC, a cache-aware metric that measures precision across cache utilization levels, and Calibration Retention Rate (CRR), which captures how much offline ranking quality survives at deployment. We decompose the operational gap between offline and deployed quality into a recoverable calibration component and an irreducible structural component fixed by the dataset's positive rate. Our experiments show that the calibration gap is governed by the training objective rather than data scale, and post-hoc calibration only partially closes it. Ultimately, model selection for semantic caching is a calibration problem, not a ranking one, and measuring it is the first step to closing the gap.

2606.20113 2026-06-19 cs.CL cs.IR 新提交 85%

When Does Streaming Tool Use Help? Characterizing Tool-Intent Stabilization in Streaming Retrieval-Augmented Generation

流式工具使用何时有帮助?表征流式检索增强生成中的工具意图稳定化

Elroy Galbraith

发表机构 * SMG Labs(SMG实验室)

专题命中 检索器 :流式RAG中工具意图稳定化分析。

AI总结 通过测量工具意图稳定化(即推测查询收敛到答案的时间点),在CRAG基准上分析流式RAG的延迟隐藏效果,发现73.9%的查询可实现显著延迟隐藏,并识别早期与晚期稳定化的预测因素。

详情
AI中文摘要

流式检索增强生成(Streaming RAG)通过在用户输入完成前并行发出工具查询来减少用户感知的延迟。报告的性能提升是聚合性的,但该机制的好处本质上是查询内在的:只有当正确的工具查询在用户停止说话或打字之前变得可确定时,推测才有帮助。我们隔离并测量了这一属性——工具意图稳定化,即输入流中推测查询的检索收敛到包含答案的结果的时间点。在CRAG基准(1371个验证问题)上,我们(i)测量了稳定化的分布,(ii)推导出一个与模型无关的界限H,表示可以隐藏在用户剩余输入背后的工具延迟比例,该比例是工具延迟L和输入节奏δ的函数,(iii)通过一个工作流式管道验证了实际节省达到或超过此界限,(iv)识别了哪些查询属性预测早期与晚期稳定化。该研究无需模型训练,在普通CPU硬件上运行。我们发现,在现实操作点(L=600ms,δ=3w/s,θ=0.8)下,整个基准中73.9%的查询实现了显著的延迟隐藏——这一混合数字结合了21.3%的问题(其中黄金证据以原文形式存在且可被BM25检索)上的充分稳定化(在此有利切片上95.2%可流式处理)以及其余问题上的无基础top-1稳定化回退。在有利切片上,φ_suf被精确和宽松基础限定在[0.26, 0.281]之间——两者均为早期。问题类型产生了显著但粗略的早期/晚期划分(Kruskal-Wallis p=0.017, epsilon^2=0.04),直接指导了何时学习到的推测触发器值得其成本。

英文摘要

Streaming Retrieval-Augmented Generation (Streaming RAG) reduces user-perceived latency by issuing tool queries in parallel with ongoing user input, before the utterance is complete. Reported gains are aggregate, yet the mechanism's benefit is fundamentally query-intrinsic: speculation can only help when the correct tool query becomes determinable before the user stops speaking or typing. We isolate and measure this property -- tool-intent stabilization, the point in the input stream at which a speculative query's retrieval converges to the answer-bearing result. On the CRAG benchmark (1371 validation questions) we (i) measure the distribution of stabilization, (ii) derive a model-agnostic bound H on the portion of tool latency that can be hidden behind the user's remaining input, as a function of tool latency L and input cadence δ, (iii) validate against a working streaming pipeline that realized savings meet or exceed this bound, and (iv) identify which query properties predict early versus late stabilization. The study requires no model training and runs on commodity CPU hardware. We find that at a realistic operating point (L=600ms, δ=3w/s, θ=0.8), 73.9% of queries across the full benchmark admit substantial latency hiding -- a blended figure that mixes sufficiency stabilization on the 21.3% of questions where gold evidence is verbatim-present and BM25-retrievable (95.2% streamable on this favorable slice) with a grounding-free top-1-settling fallback on the remainder. On the favorable slice, ϕ_suf is bracketed to [0.26, 0.281] by exact and relaxed grounding -- both early. Question type produces a significant but coarse early/late split (Kruskal-Wallis p=0.017, epsilon^2=0.04), directly informing when a learned speculative trigger is worth its cost.

2606.20065 2026-06-19 cs.IR cs.CL cs.CY 新提交 85%

Generative Engine Optimization at Scale: Measuring Brand Visibility Across AI Search Engines

生成式引擎优化规模化:衡量AI搜索引擎中的品牌可见性

Pratyush Kumar

发表机构 * Ranqo

专题命中 检索器 :研究AI搜索引擎中品牌可见性,涉及检索和引用机制。

AI总结 本研究通过分析10万+提示响应,提出衡量AI搜索引擎中品牌可见性的方法,发现品牌成熟度形成三级阶梯,并识别出最受引用的内容格式和情感不稳定性。

Comments 14 pages, 4 tables; v1.0 preprint

详情
AI中文摘要

人们越来越多地从AI搜索引擎(如ChatGPT、Claude、Perplexity和Gemini)直接获取答案,而不是滚动浏览搜索结果。曾经专注于搜索引擎优化(SEO)的品牌现在必须优化这些引擎如何代表、引用和推荐它们——这一转变被称为生成式引擎优化(GEO)、答案引擎优化(AEO)和AI搜索可见性。我们将AEO和AI可见性视为GEO的一部分,并研究如何衡量AI引擎中的品牌可见性:它们在引用品牌时看重什么,依赖哪些来源,以及大型语言模型呈现什么内容。难点在于那些尚未成为权威顶级品牌的所有其他品牌——中小企业、D2C品牌、创作者和早期初创公司。我们分析了2026年3月至5月期间在Ranqo上追踪的100多个品牌的10万+提示响应。首次可见性运行形成了清晰的三级品牌地位阶梯:全球家喻户晓的品牌(如Stripe、Nike)在首次运行时出现在73%的相关AI答案中;成熟的中端市场和区域品牌(如Olipop、Klaviyo)出现在44%中;小众和小品牌仅出现在11%中——每级约30个百分点。当引擎引用来源时,约78%指向企业网站;在非企业来源中,YouTube领先,其次是Reddit、编辑媒体和维基百科。杠杆率最高的页面是排名“最佳”列表文章,是最常被引用的内容格式,约占所有引用的21%。情感是不稳定的信号:品牌被正面或负面描述的变化频率大约是品牌是否被提及的变化频率的6.7倍。这些发现为衡量GEO提供了首个大规模基线:AI品牌可见性是可测量的,因平台而异,并随品牌成熟度强烈变化。最后,我们提出了七个v1.1协议,以测试特定建议是否能因果性地提高AI可见性。

英文摘要

People increasingly get answers straight from AI search engines like ChatGPT, Claude, Perplexity, and Gemini rather than scrolling search results. Brands that once focused on search engine optimization (SEO) must now optimize for how these engines represent, cite, and recommend them -- a shift variously called Generative Engine Optimization (GEO), Answer Engine Optimization (AEO), and AI Search Visibility. We treat AEO and AI Visibility as part of GEO, and study how to measure brand visibility across AI engines: what they value when they cite a brand, which sources they rely on, and what content large language models surface. The hard case is everyone outside the already-authoritative top brands -- SMEs, D2C brands, creators, and early-stage startups. We analyze 100K+ prompt responses across 100+ brands tracked on Ranqo between March and May 2026. First visibility runs form a clear three-tier brand-stature ladder: global household names (e.g., Stripe, Nike) appear in 73% of relevant AI answers on their first run; established mid-market and regional brands (e.g., Olipop, Klaviyo) in 44%; niche and small brands in just 11% -- about 30 percentage points per step. When engines cite sources, about 78% go to corporate websites; among non-corporate sources YouTube leads, ahead of Reddit, editorial media, and Wikipedia. The highest-leverage page is the ranked "best-of" listicle, the most-cited content format at about 21% of all citations. Sentiment is the unstable signal: whether a brand is framed positively or negatively flips about 6.7 times more often than whether it is mentioned at all. These findings provide a first large-scale baseline for measuring GEO: AI brand visibility can be measured, differs by platform, and varies strongly by brand maturity. We close by proposing seven v1.1 protocols to test whether specific recommendations can causally improve AI visibility.

2606.19898 2026-06-19 cs.DB cs.IR 新提交 85%

Query-aware Routing for Filtered Approximate Nearest Neighbors Search

面向过滤近似最近邻搜索的查询感知路由

Qianqian Xiong, Mengxuan Zhang

专题命中 检索器 :过滤近似最近邻搜索,核心RAG技术。

AI总结 提出查询感知路由框架,通过轻量级ML模型预测各候选方法的召回率,结合离线基准表选择最佳召回-QPS权衡,在五个未见数据集上达到SOTA性能。

Comments 12 pages

详情
AI中文摘要

过滤ANN搜索结合向量相似性与属性谓词,是现代向量数据库和检索增强生成中的核心原语。我们在多个数据集上对三种谓词下的所有主要分类过滤ANN方法进行基准测试,发现没有单一方法占主导地位。此外,即使在单个数据集和谓词类型内,查询的最佳方法也可能不同。因此,我们提出了一种查询感知路由框架。轻量级ML模型预测每个候选方法在查询上的召回率,路由器查阅离线基准表(该表将每种方法和参数设置映射到其测量的召回率和QPS),然后选择具有最佳召回-QPS权衡的方法。我们的消融研究将22个候选特征缩减为最小的三个特征集,并采用回归而非分类作为预测目标以提高准确性。我们的模型在六个真实世界数据集上训练,并应用于五个未见过的验证数据集。最终结果表明,与现有的过滤ANN基线相比,我们的路由器在所有五个验证数据集上实现了最先进的召回率和QPS平衡,同时引入了可忽略的延迟开销。

英文摘要

Filtered ANN search, which combines vector similarity with attribute predicates, is a core primitive in modern vector databases and retrieval-augmented generation. We benchmark all major categorical filtered ANN methods across multiple datasets under three predicates and find that no single method dominates. Moreover, even within a single dataset and predicate type, the best method for a query can vary. Therefore, we propose a query-aware routing framework. A lightweight ML model predicts each candidate method's recall on the query, and the router consults an offline benchmark table that maps every method and parameter setting to its measured recall and QPS, then selects the method with the best recall--QPS trade-off. Our ablation study narrows 22 candidate features to a minimal set of three and we adopt regression rather than classification as the prediction target to sharpen accuracy. Our model is trained on six real-world datasets and applied to five unseen validation datasets. The final result shows that our router achieves state-of-the-art recall and QPS balance across all five validation datasets compared to existing filtered ANN baselines, while incurring negligible latency overhead.

2606.19667 2026-06-19 cs.CL 新提交 85%

CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference

CacheWeaver:面向高效接地RAG推理的缓存感知证据排序

Kaizhen Tan, Rong Gu, Mingyuan Li

发表机构 * Heinz College of Information Systems and Public Policy, Carnegie Mellon University(卡内基梅隆大学海因茨信息系统与公共政策学院)

专题命中 检索器 :缓存感知证据排序降低RAG推理首令牌延迟

AI总结 提出CacheWeaver,一种轻量级提示层方法,通过缓存感知的证据排序降低RAG推理的首令牌延迟,无需修改服务引擎或证据集。

详情
AI中文摘要

检索增强生成(RAG)改善了事实基础,但也延长了提示并增加了预填充成本。vLLM等服务引擎中的前缀缓存仅在请求共享相同令牌前缀时降低此成本。然而,在接地生成中,相邻查询可能以不同顺序检索重叠证据,因此集合重叠不会变成可重用的前缀重叠。我们提出CacheWeaver,一种用于缓存感知证据排序的轻量级提示层方法。该方法维护最近服务的证据序列的前缀树,并使用贪婪遍历将最可重用的前缀放在首位,同时保持服务引擎和检索到的证据集不变。在三种vLLM配置中,相对于检索顺序前缀缓存,该方法将中位首令牌时间(TTFT)降低了约20-33%,且在我们的QA测试中不损害答案质量。贪婪策略达到了Oracle排序中位TTFT增益的97.5%,表明大多数可重用前缀局部性可以通过检索和推理之间的简单调度层恢复。

英文摘要

Retrieval-Augmented Generation (RAG) improves factual grounding, but it also lengthens prompts and raises prefill cost. Prefix caching in serving engines such as vLLM reduces this cost only when requests share the same token prefix. In grounded generation, however, adjacent queries may retrieve overlapping evidence in different orders, so set overlap does not become reusable prefix overlap. We present CacheWeaver, a lightweight prompt-layer method for cache-aware evidence ordering. The method keeps a prefix tree over recently served evidence sequences and uses a greedy walk to place the most reusable prefix first, while leaving the serving engine and retrieved evidence set unchanged. Across three vLLM configurations, the method lowers median time-to-first-token (TTFT) by about 20-33 percent relative to retrieval-order prefix caching, without hurting answer quality in our QA tests. The greedy policy reaches 97.5 percent of the median TTFT gain from oracle ordering, indicating that most reusable prefix locality can be recovered by a simple scheduling layer between retrieval and inference.

2606.20047 2026-06-19 cs.IR 新提交 80%

PACMS: Submodular Context Selection as a Pluggable Engine for LLM Agents

PACMS: 作为LLM代理可插拔引擎的子模上下文选择

Manu Ghulyani, Arunabh Singh, Karan Bharadwaj, Ankit Nath, Suranjan Goswami

专题命中 检索器 :提出子模上下文选择方法,优化LLM代理的上下文。

AI总结 提出PACMS,一种基于子模函数最大化的上下文选择方法,在提示组装时按相关性从会话、记忆和工具输出中挑选内容,替代截断机制,提升长对话中的信息保持能力。

详情
AI中文摘要

对话和工具使用的LLM代理在上下文窗口中操作,该窗口同时从多个方向填充。随着会话进行,代理积累用户和助手轮次、从持久记忆存储中提取的条目,以及通常最大的工具调用输出(如文件读取、搜索结果和API响应)。一旦累积上下文超过模型的令牌预算,框架必须决定保留什么。当前机制是最近截断,有时辅以定期摘要。这是主题盲目的:会话早期建立的事实仅仅因为陈旧而被丢弃,即使当前用户查询正是关于该事实;相反,冗长但无关的近期材料被保留。必须在多轮中回忆信息的代理(记忆的定义案例)正是最近截断失败的地方。现有替代方案位于代理组装步骤之外。检索增强生成将外部文档提取到提示中,但不仲裁代理的“已存在”池化上下文。上下文压缩方法通过重写或修剪文本来减少令牌计数,但以查询盲目和有损的方式操作。两者都不将记忆条目、对话轮次和工具输出视为一个单一的候选池,在提示组装时按相关性进行选择。

英文摘要

Conversational and tool-using LLM agents operate over a context window that fills from several directions simultaneously. As a session proceeds, the agent accumulates user and assistant turns, entries drawn from a persistent memory store, and often largest of all, the verbatim outputs of tool calls such as file reads, search results, and API responses. Once the cumulative context exceeds the model's token budget, the framework must decide what to keep. The prevailing mechanism is recency truncation, sometimes paired with periodic summarization. This is topic-blind: a fact established early in a session is discarded simply because it is old, even when the current user query is about exactly that fact; conversely, verbose but irrelevant recent material is retained. Agents that must recall information across many turns, the defining case for memory, are precisely where recency truncation fails. Existing alternatives sit outside the agent's assembly step. Retrieval augmented generation fetches external documents into the prompt but does not arbitrate the agent's \emph{already-present} pooled context. Context-compression methods reduce token count by rewriting or pruning text, but operate query-blind and lossily. Neither treats memory entries, conversation turns, and tool outputs as a single candidate pool to be selected from by relevance at the moment the prompt is assembled.

2606.19911 2026-06-19 cs.AI cs.CL cs.IR 新提交 75%

Multi-Agent Transactive Memory

多智能体交互记忆

To Eun Kim, Xuhong He, Dishank Jain, Ambuj Agrawal, Negar Arabzadeh, Fernando Diaz

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of California, Berkeley(加州大学伯克利分校)

专题命中 检索器 :扩展RAG到智能体生成轨迹的检索与复用。

AI总结 提出MATM框架,通过共享存储和检索智能体轨迹,实现异构智能体群体间的知识复用,提升下游任务性能并减少交互步骤。

详情
AI中文摘要

具有不同能力的LLM智能体在多样化任务中的去中心化部署,激发了跨异构智能体群体知识共享的基础设施需求。正如搜索引擎索引人类生成的工件以支持人类问题解决,检索系统可以组织智能体生成的工件以供跨智能体群体重用。我们将检索增强生成(展示了人类创作工件对单个智能体的价值)扩展到检索智能体生成的工件以支持智能体群体。特别是,智能体轨迹编码了可重用的程序性知识,然而这些工件通常在一次使用后被丢弃或仅由产生智能体保留,迫使新实例化的智能体反复重新发现现有解决方案。我们提出了多智能体交互记忆(MATM),一个用于群体级存储和检索智能体生成轨迹的框架,其中生产者智能体将轨迹贡献到共享仓库,消费者智能体检索它们以改进任务执行。我们专注于交互环境(ALFWorld和WebArena),其中轨迹较长且编码了特别丰富的程序性结构。我们的实验表明,从MATM检索轨迹可提高下游任务性能并减少交互步骤,无需协调或联合训练。这些结果将MATM定位为开放智能体生态系统中群体级经验共享的设计模式。

英文摘要

The decentralized deployment of LLM agents with diverse capabilities across diverse tasks motivates infrastructure for knowledge sharing across heterogeneous agent populations. Just as search engines index human-generated artifacts to support human problem solving, retrieval systems can organize agent-generated artifacts for reuse across agent populations. We extend retrieval-augmented generation - which demonstrates the value of human-authored artifacts to individual agents - to retrieval of agent-generated artifacts supporting a population of agents. In particular, agent trajectories encode reusable procedural knowledge, yet these artifacts are typically discarded after a single use or retained only by the producing agent, forcing newly instantiated agents to repeatedly rediscover existing solutions. We propose Multi-Agent Transactive Memory (MATM), a framework for population-level storage and retrieval of agent-generated trajectories, where producer agents contribute trajectories to a shared repository and consumer agents retrieve them to improve task execution. We focus on interactive environments (ALFWorld and WebArena), where trajectories are long and encode especially rich procedural structure. Our experiments demonstrate that retrieving trajectories from MATM improves downstream task performance and reduces interaction steps without coordination or joint training. These results position MATM as a design pattern for population-level experience sharing in open agent ecosystems.

2606.17041 2026-06-19 cs.CL cs.IR 新提交 75%

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

对Nature Portfolio元分析文章进行LLM代理基准测试

Anzhe Xie, Weihang Su, Yujia Zhou, Yiqun Liu, Qingyao Ai

发表机构 * Tsinghua University(清华大学)

专题命中 检索器 :包含检索和RAG变体的基准测试

AI总结 提出MetaSyn数据集,包含442篇专家策划的元分析,用于评估LLM代理在检索-筛选-综合全流程中的表现,发现当前系统在筛选阶段存在严重瓶颈。

Comments 13 pages, 7 figures, preprint for arXiv, dataset and code available at https://github.com/BFTree/MetaSyn

详情
AI中文摘要

元分析是一种要求高的证据综合形式,结合了文献检索、PI/ECO指导的研究选择和统计聚合。其结构化、可验证的工作流程使其成为评估系统科学推理的理想基础,然而现有基准缺乏完整的检索-筛选-综合流程的真相。我们引入了MetaSyn,一个包含来自Nature Portfolio期刊的442篇专家策划的元分析的数据集。每个条目将研究问题与PI/ECO标准、包含140k篇PubMed文章的检索语料库、经过验证的阳性研究、主题相似但不符合PI/ECO的硬负样本以及完整的搜索策略和日期范围配对。对十二种流水线配置(九种RAG变体和一种协议驱动的代理)进行基准测试揭示了关键的筛选瓶颈:尽管在K=200时检索上限达到90.9%的召回率,但没有任何系统能恢复超过52.7%的真相包含文献。当前的LLM无法可靠地将合格研究与主题相关性相当的PI/ECO不合格干扰项区分开来。阶段归因指标捕捉了系统成功和失败的地方;单一的端到端分数则不能。

英文摘要

Meta-analysis is a demanding form of evidence synthesis that combines literature retrieval, PI/ECO-guided study selection, and statistical aggregation. Its structured, verifiable workflow makes it an ideal substrate for evaluating systematic scientific reasoning, yet existing benchmarks lack ground truth across the full retrieval-screening-synthesis pipeline. We introduce MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals. Each entry pairs a research question with PI/ECO criteria, a retrieval corpus of 140k PubMed articles, verified positive studies, hard negatives that are topically similar but PI/ECO-ineligible, and complete search strategies and date bounds. Benchmarking twelve pipeline configurations (nine RAG variants and a protocol-driven agent) reveals a critical screening bottleneck: despite a retrieval ceiling of 90.9% recall at K=200, no system recovers more than 52.7% of ground-truth included literature. Current LLMs fail to reliably separate eligible studies from PI/ECO-failing distractors in pools of comparable topical relevance. Stage-attributed metrics capture where systems succeed and fail; a single end-to-end score does not.

2606.20235 2026-06-19 cs.IR cs.AI 新提交 70%

ScholarQuest: A Taxonomy-Guided Benchmark for Agentic Academic Paper Search in Open Literature Environments

ScholarQuest:开放文献环境中智能学术论文搜索的基于分类法的基准测试

Tingyue Pan, Mingyue Cheng, Daoyu Wang, Yitong Zhou, Jie Ouyang, Qi Liu, Enhong Chen

发表机构 * State Key Lab of Cognitive Intelligence, University of Science and Technology of China(中国科学技术大学认知智能国家重点实验室)

专题命中 检索器 :学术论文搜索基准,涉及检索

AI总结 提出ScholarQuest基准,基于1000多个计算机科学主题和四种研究意图,构建可扩展的答案和共享检索后端,评估LLM智能体在开放文献环境中的学术论文搜索能力。

详情
AI中文摘要

学术论文搜索是科学研究中的核心步骤,基于LLM的搜索智能体正成为迭代式、意图驱动的文献探索的有前景范式。然而,现有基准不足以在现实开放文献环境下系统评估智能学术搜索。我们提出ScholarQuest,一个大规模、基于分类法的智能学术论文搜索基准。ScholarQuest基于1000多个计算机科学主题和四种代表性研究意图构建,包括方法导向、设置锚定、比较型和范围控制查询。它进一步提供可扩展的答案构建和共享检索后端ScholarBase,用于可重复评估。基准测试结果表明,智能方法优于单次检索基线,但表现最佳的智能体仅达到0.314的Recall@100和0.355的Recall@All,表明有显著的改进空间。此外,对搜索效率、意图级鲁棒性和失败案例的分析进一步凸显了该基准为学术论文搜索智能体提供多维评估信号的能力。

英文摘要

Academic paper search is a core step in scientific research, and LLM-based search agents are emerging as a promising paradigm for iterative, intent-driven literature exploration. However, existing benchmarks are insufficient for systematically evaluating agentic academic search under realistic open literature environments. We propose ScholarQuest, a large-scale, taxonomy-guided benchmark for agentic academic paper search. ScholarQuest is constructed from over 1,000 computer science topics and four representative research intents, including method-oriented, setting-anchored, comparison-based, and scope-controlled queries. It further provides scalable answer construction and a shared retrieval backend ScholarBase for reproducible evaluation. Benchmarking results show that agentic methods outperform single-shot retrieval baselines, yet the best-performing agent only achieves 0.314 Recall@100 and 0.355 Recall@All, indicating substantial room for improvement. In addition, analyses of search efficiency, intent-level robustness, and failure cases further highlight the benchmark's ability to provide multi-dimensional evaluation signals for academic paper search agents.

2606.20554 2026-06-19 cs.IR cs.AI 新提交 55%

Structuring and Tokenizing Distributed User Interest Context for Generative Recommendation

结构化与分词化分布式用户兴趣上下文以支持生成式推荐

Ruizhong Qiu, Yinglong Xia, Dongqi Fu, Hanqing Zeng, Ren Chen, Xiangjun Fan, Hong Li, Hong Yan, Hanghang Tong

发表机构 * University of Illinois Urbana--Champaign(伊利诺伊大学厄巴纳-香槟分校) Meta MRS

专题命中 检索器 :使用图建模和语义分词进行上下文建模。

AI总结 提出G2Rec框架,通过统一图建模与语义分词,实现工业级生成式推荐中用户兴趣上下文的全面准确建模。

详情
AI中文摘要

生成式推荐是一种新兴范式,在工业推荐系统中展现出前景,旨在从用户历史行为中预测其下一次交互。生成式推荐的核心是物品分词,它连接了物品语义与推荐模型。然而,现有方法往往难以同时有效地组织和注入复杂的用户行为与物品语义上下文。一方面,现有的基于图的集成方法,如图序列化和图神经网络,要么存在可扩展性问题,要么仅利用局部图信息。另一方面,现有的语义分词方法通常依赖启发式规则且缺乏明确的监督信号,可能导致不准确或次优的语义表示。为解决用户兴趣上下文建模中的这些局限性,我们提出G2Rec,一个可扩展的框架,将基于图的整体用户共同参与建模与语义分词统一起来,用于工业级生成式推荐。总体而言,G2Rec使推荐模型能够捕捉整体且基于语义的用户兴趣原型,而无需真实用户兴趣,从而在工业序列推荐中提供更全面、更准确的用户行为上下文建模。跨产品表面的在线部署和在公开数据集上的大量实验证明了G2Rec相对于现有方法的优越性。

英文摘要

Generative recommendation is an emerging paradigm that has shown promise in industrial recommendation systems, aiming to predict users' next interactions from their historical behaviors. At the core of generative recommendation lies item tokenization, which bridges item semantics and recommendation models. However, existing methods often struggle to effectively organize and inject complex user-behavioral and item-semantic contexts into recommendation models simultaneously. On the one hand, existing graph-based integration methods, such as graph serialization and graph neural networks, either suffer from scalability issues or exploit only local graph information. On the other hand, existing semantic tokenization methods typically rely on heuristics and lack explicit supervision signals, which may lead to inaccurate or suboptimal semantic representations. To address these limitations in user interest context modeling, we propose G2Rec, a scalable framework that unifies holistic graph-based user co-engagement modeling with semantic tokenization for industrial-scale generative recommendation. Overall, G2Rec enables recommendation models to capture holistic and semantically grounded user interest prototypes without requiring ground-truth user interests, thereby providing more comprehensive and accurate modeling of user behavior contexts in industrial sequential recommendation. Online deployment across product surfaces and extensive experiments on public datasets demonstrate the superiority of G2Rec over existing methods.