RAG / 检索增强生成 - arXivDaily 专题

2606.18508 2026-06-18 cs.CL cs.IR 新提交 95%

MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval

MCompassRAG：主题元数据作为段落级检索的语义指南针

Amirhossein Abaskohi, Raymond Li, Gaetano Cimino, Peter West, Giuseppe Carenini, Issam H. Laradji

发表机构 * University of British Columbia（不列颠哥伦比亚大学）； University of Salerno（萨莱诺大学）； ServiceNow Research（ServiceNow研究院）

专题命中检索器：提出主题元数据增强段落检索的RAG框架

AI总结提出MCompassRAG框架，通过主题元数据增强段落表示，利用LLM蒸馏训练轻量检索器，实现主题感知检索，在六个基准上平均信息效率提升8.24%，延迟降低5倍以上。

详情

AI中文摘要

检索增强生成（RAG）系统关键依赖于文档的分块和搜索方式。细粒度块可以提高检索精度，但会扩大搜索空间，增加延迟和成本；较大的块减少了候选数量，但使密集相似性变得不可靠，因为每个块的表示混合了多个主题并引入了更多语义噪声。这种权衡在深度研究任务中尤其受限，因为检索必须在大型异构语料库中既快速又精确。我们引入了MCompassRAG，一种元数据引导的检索框架，它使用主题级信号作为语义指南针来选择相关证据。MCompassRAG不仅依赖于查询与噪声块嵌入之间的余弦相似度，还在同一嵌入空间中用主题元数据丰富块表示，并通过LLM教师蒸馏训练轻量级检索器。在推理时，MCompassRAG无需额外的LLM调用即可执行主题感知检索，提高了效率和证据质量。在六个复杂检索基准上，MCompassRAG平均信息效率（IE）提高了8.24%，延迟比最强的高效RAG基线低5倍以上。代码可从此https URL获取。

英文摘要

Retrieval-augmented generation (RAG) systems depend critically on how documents are chunked and searched. Fine-grained chunks can improve retrieval precision but expand the search space, increasing latency and cost; larger chunks reduce the number of candidates but make dense similarity less reliable, as the representation for each chunk mixes multiple topics and introduces more semantic noise. This trade-off becomes especially limiting in deep research tasks, where retrieval must be both fast and precise across large, heterogeneous corpora. We introduce MCompassRAG, a metadata-guided retrieval framework that uses topic-level signals as a semantic compass for selecting relevant evidence. Instead of relying only on cosine similarity between queries and noisy chunk embeddings, MCompassRAG enriches chunk representations with topic metadata in the same embedding space and trains a lightweight retriever through LLM-teacher distillation. At inference time, MCompassRAG performs topic-aware retrieval without additional LLM calls, improving both efficiency and evidence quality. Across six complex retrieval benchmarks, MCompassRAG improves information efficiency (IE) by 8.24% on average with over 5 times lower latency than the strongest efficient RAG baselines. Code is available on https://github.com/AmirAbaskohi/MCompassRAG.

URL PDF HTML ☆

赞 0 踩 0

2606.18781 2026-06-18 cs.CL 新提交 90%

Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation

迷失在单一向量中：通过分块证据聚合改进长文档检索

Shanshan Lyu, Yiwei Wang, Yujun Cai, Jiafeng Guo, Shenghua Liu

发表机构 * Chongqing University（重庆大学）； State Key Laboratory of AI Safety（人工智能安全国家重点实验室）； Institute of Computing Technology, Chinese Academy of Sciences（中国科学院计算技术研究所）； University of California, Merced（加州大学默塞德分校）； University of Queensland（昆士兰大学）； University of Chinese Academy of Sciences（中国科学院大学）

专题命中检索器：改进长文档检索，提出分块证据聚合策略。

AI总结针对长文档检索中单向量编码削弱关键片段证据的问题，提出无训练的分块证据聚合策略DICE，通过独立编码分块并聚合为单一向量，在保持标准接口的同时显著提升检索性能。

Comments Code is available at https://github.com/PunchlineAAAA/DICE

详情

AI中文摘要

稠密检索将一个查询向量与一个文档向量进行排名。对于长文档，当在排名前的文档编码过程中，一个简短但决定性的跨度被削弱时，这种接口可能会失败。我们将这种失败模式研究为文档侧早期压缩，并引入证据稀释指数（EDI）来衡量文档级表示低于同一黄金文档中最强分块级证据的程度。在此观点的指导下，我们提出了DICE（通过分块证据进行文档推理），一种无需训练的文档侧策略，它将文档分割成块，使用冻结模型独立编码，然后将它们聚合回单个向量，同时保持标准的单查询-单文档接口。在LongEmbed上，DICE在四个骨干网络上提高了检索性能，在超过4k标记的切片上提升最大：对于Dream，Passkey >4k从30.0提升到90.0，Needle >4k从23.3提升到74.0。在12,779个过滤样本中，DICE在92.8%的情况下比单向量基线产生更低的EDI。这些结果确立了文档级编码作为长文档检索的一个实用且未被充分探索的杠杆。

英文摘要

Dense retrieval ranks one query vector against one document vector. On long documents, this interface can fail when a short but decisive span is weakened during document encoding before ranking. We study this failure mode as document-side early compression and introduce the Evidence Dilution Index (EDI) to measure how far a document-level representation falls below the strongest chunk-level evidence within the same gold document. Guided by this view, we propose DICE (Document Inference via Chunk Evidence), a training-free document-side strategy that splits documents into chunks, encodes them independently with a frozen model, and aggregates them back into a single vector while preserving the standard one-query-one-document interface. On LongEmbed, DICE improves retrieval across four backbones, with the largest gains on slices beyond 4k tokens: for Dream, Passkey >4k rises from 30.0 to 90.0 and Needle >4k from 23.3 to 74.0. Across 12,779 filtered samples, DICE yields lower EDI than the single-vector baseline in 92.8% of cases. These results establish document-level encoding as a practical and underexplored lever for long-document retrieval.

URL PDF HTML ☆

赞 0 踩 0

2606.19037 2026-06-18 cs.IR 新提交 85%

Querit-Reranker: Training Compact Multilingual Rerankers via Efficient Label-Free Distribution Adaptation

Querit-Reranker: 通过高效无标签分布适应训练紧凑型多语言重排序器

Yunfei Zhong, Jun Yang, Wei Huang, Yinqiong Cai, Haosheng Qian, Yixing Fan, Ruqing Zhang, Lixin Su, Daiting Shi, Jiafeng Guo

专题命中检索器：多语言重排序器，用于检索增强。

AI总结提出Querit-Reranker系列多语言交叉编码器重排序模型，采用数据驱动的无标签适应管道，通过合成查询挖掘和教师软标签进行分布适应，并利用球面线性插值合并检查点，在BEIR和MIRACL上显著提升nDCG@10，在MTEB多语言重排序上达到最优性能。

详情

AI中文摘要

可部署的多语言重排序器必须能够跨语言、领域和目标排序任务进行泛化，同时保持足够的效率以用于第二阶段重排序。然而，将它们适应新的目标分布通常需要大量特定任务的相关性标注，这获取成本高昂。我们提出了Querit-Reranker，这是一个多语言交叉编码器重排序器家族，通过数据中心的管道进行标签高效适应。我们将其实例化为Querit-Reranker-A0.4B（从内部MoE骨干网络初始化，具有0.4B激活参数）和Querit-Reranker-4B（从Qwen3-Embedding-4B初始化）。我们的管道首先从大规模面向排序的数据中学习通用相关性建模，然后通过合成查询挖掘和教师分数作为连续软标签来适应目标分布。为了巩固互补的任务适应优势，我们进一步通过球面线性插值合并检查点，获得一个无需运行时集成开销即可部署的单一模型。使用Qwen3-Embedding-0.6B作为共享的第一阶段检索器，Querit-Reranker-A0.4B在BEIR上将平均nDCG@10从54.11提升到59.28，在MIRACL上从59.87提升到67.70。在MTEB Multilingual v2 Reranking上，它也显著优于更大的基于嵌入的基线，而Querit-Reranker-4B在公开可用模型中进一步实现了最先进的性能。我们在Hugging Face上发布了这两个模型。

英文摘要

Deployable multilingual rerankers must generalize across languages, domains, and target ranking tasks while remaining efficient enough for second-stage reranking. However, adapting them to new target distributions typically requires extensive task-specific relevance annotations, which are costly to obtain. We present Querit-Reranker, a family of multilingual cross-encoder rerankers trained with a data-centric pipeline for label-efficient adaptation. We instantiate it as Querit-Reranker-A0.4B, initialized from an in-house MoE backbone with 0.4B activated parameters, and Querit-Reranker-4B, initialized from Qwen3-Embedding-4B. Our pipeline first learns general relevance modeling from large-scale ranking-oriented data, then adapts to target distributions through synthetic-query mining with teacher scores as continuous soft labels. To consolidate complementary task-adapted strengths, we further merge checkpoints via spherical linear interpolation, obtaining a single deployable model without runtime ensembling overhead. Using Qwen3-Embedding-0.6B as the shared first-stage retriever, Querit-Reranker-A0.4B improves average nDCG@10 from 54.11 to 59.28 on BEIR and from 59.87 to 67.70 on MIRACL. On MTEB Multilingual v2 Reranking, it also substantially outperforms larger embedding-based baselines, while Querit-Reranker-4B further achieves state-of-the-art performance among publicly available models. We release both models on Hugging Face.

URL PDF HTML ☆

赞 0 踩 0

2606.18947 2026-06-18 cs.AI cs.CL cs.IR cs.MA 新提交 85%

Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents

将搜索与推理解耦：面向LLM Agent的供应商无关的接地架构

Emmanuel Aboah Boateng, Kyle MacDonald, Amardeep Kumar, Siddharth Kodwani, Sudeep Das

发表机构 * DoorDash, Inc.（DoorDash公司）

专题命中检索器：实现搜索接地与推理解耦，优化检索增强生成

AI总结提出解耦搜索接地（DSG）架构，将搜索接地从推理模型中分离，通过MCP兼容网关实现供应商路由、缓存等控制，在降低成本和延迟的同时保持或提升准确性。

Comments 15 pages, Figure 8

详情

AI中文摘要

生产级LLM Agent越来越依赖实时搜索，但原生搜索接地将检索策略、供应商选择、证据注入、成本、延迟和生成行为捆绑在单一模型-供应商边界内。这种耦合使得接地难以检查、调优、重用或移植，并可能触发搜索诱导的冗长，破坏严格的输出合约。我们提出解耦搜索接地（DSG），一种供应商无关的边界，通过MCP兼容网关将接地移出推理模型，将供应商路由、源感知上下文渲染、配置的回退、检索深度控制以及精确和语义缓存作为一级控制暴露。在SimpleQA、FreshQA和HotpotQA上的五个前沿模型上，原生搜索在时效性敏感的FreshQA上领先，但DSG在控制重要时展现出更强的前沿：在SimpleQA上，它以91%更低的搜索成本接近原生准确率（86.1%对87.7%），保持简洁答案合约，并以68%更低的延迟达到99.4%的热缓存命中率。作为大规模Agent工作负载的共享生产接地层部署，DSG在电商查询理解（QIU）工作负载上匹配或略超原生搜索准确率，同时将搜索成本降低超过98%。实时接地最好被视为可优化的接口边界，而非固定的模型特性。

英文摘要

Production LLM agents increasingly depend on real-time search, yet native search grounding bundles retrieval policy, provider choice, evidence injection, cost, latency, and generation behavior behind a single model-provider boundary. This coupling makes grounding hard to inspect, tune, reuse, or port, and can trigger Search-Induced Verbosity that breaks strict output contracts. We present Decoupled Search Grounding (DSG), a vendor-agnostic boundary that moves grounding outside the reasoning model through an MCP-compatible gateway, exposing provider routing, source-aware context rendering, configured fallback, retrieval-depth control, and exact plus semantic caching as first-class controls. Across five frontier models on SimpleQA, FreshQA, and HotpotQA, native search leads on recency-sensitive FreshQA, but DSG exposes a stronger frontier when control matters: on SimpleQA it nearly matches native accuracy (86.1% vs. 87.7%) at 91% lower search cost, preserves concise answer contracts, and reaches a 99.4% warm-cache hit rate with 68% lower latency. Deployed as a shared production grounding layer for large-scale agentic workloads with interchangeable models, DSG matches or slightly exceeds native-search accuracy on an e-commerce query-understanding (QIU) workload while cutting search cost by over 98%. Real-time grounding is best treated as an optimizable interface boundary, not a fixed model feature.

URL PDF HTML ☆

赞 0 踩 0

2606.18811 2026-06-18 cs.IR cs.AI 新提交 85%

Rescaling MLM-Head for Neural Sparse Retrieval

重新缩放MLM头部用于神经稀疏检索

Youngjoon Jang, Seongtae Hong, Jonah Turner, Heuiseok Lim

发表机构 * Korea University（韩国大学）

专题命中检索器：改进SPLADE神经稀疏检索，属于检索器

AI总结针对SPLADE中MLM头部尺度不匹配导致训练不稳定和性能下降的问题，提出初始化时对MLM头部投影进行常数因子重缩放，零成本提升训练稳定性，使大范数骨干网络成为有竞争力的稀疏检索器。

详情

AI中文摘要

学习型稀疏检索（LSR）模型（如SPLADE）传统上使用BERT风格的掩码语言模型作为骨干编码器。一个自然的期望是，用更强的预训练编码器替换BERT应能提高检索效果。然而，我们发现，在标准的SPLADE训练方案下，具有大MLM头部L2范数的骨干网络可能会遭受性能下降，甚至在标准SPLADE训练方案下出现训练崩溃。我们将此失败归因于MLM头部中的尺度不匹配：SPLADE直接使用MLM头部输出来构建稀疏词汇表示，查询-文档相关性通过这些表示上的未归一化点积计算。因此，膨胀的MLM头部尺度会放大稀疏激活，扭曲匹配分数，并在常见训练设置下破坏对比训练的稳定性。为了解决这个问题，我们引入了一个简单的初始化时修正，在SPLADE训练之前通过一个常数因子重新缩放MLM头部投影。这种零成本调整提高了训练稳定性，而无需修改模型架构或训练目标。在领域内和跨领域检索基准测试中，这种简单的修正显著改善了诸如ModernBERT和Ettin等大范数骨干网络，将不稳定的训练运行转变为有竞争力的稀疏检索器。在多个设置中，修正后的模型进一步匹配或超越了经典的BERT-SPLADE基线。这些发现表明，将预训练编码器适应于LSR的瓶颈不仅仅是编码器容量，而是用于构建稀疏词汇表示的MLM头部尺度的校准。

英文摘要

Learned sparse retrieval (LSR) models such as SPLADE have traditionally used BERT-style masked language models as backbone encoders. A natural expectation is that replacing BERT with stronger pretrained encoders should improve retrieval effectiveness. However, we find that under standard SPLADE training recipes, backbones with large MLM-head L2 norms can suffer performance degradation and even training collapse under standard SPLADE training recipes. We identify this failure as a scale mismatch in the MLM head: SPLADE directly uses MLM-head outputs to construct sparse lexical representations, and query-document relevance is computed by an unnormalized dot product over these representations. As a result, an inflated MLM-head scale can amplify sparse activations, distort matching scores, and destabilize contrastive training under common training settings. To address this issue, we introduce a simple initialization-time correction that rescales the MLM-head projection by a constant factor before SPLADE training. This zero-cost adjustment improves training stability without modifying the model architecture or training objective. Across both in-domain and out-of-domain retrieval benchmarks, this simple correction substantially improves large-norm backbones such as ModernBERT and Ettin, turning unstable training runs into competitive sparse retrievers. In several settings, the corrected models further match or surpass the classic BERT-SPLADE baseline. These findings suggest that the bottleneck in adapting pretrained encoders to LSR is not encoder capacity alone, but the calibration of the MLM-head scale used to construct sparse lexical representations.

URL PDF HTML ☆

赞 0 踩 0

2606.18406 2026-06-18 cs.CL 新提交 85%

CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents

CoreMem: 对话代理中长期记忆的黎曼检索与Fisher引导蒸馏

Jiaqi Chen, Yongqin Zeng, Shaoshen Chen, Yijian Zhang, Hai-Tao Zheng, Chunxia Ma, XiuTeng Zhou

发表机构 * Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）； Peng Cheng Laboratory（鹏城实验室）； Shandong Analysis and Test Center, Qilu University of Technology（齐鲁工业大学山东省分析测试中心）； State Key Laboratory for Quality Ensurance and Sustainable Use of Dao-di Herbs（道地药材品质保障与可持续利用国家重点实验室）

专题命中检索器：提出黎曼检索方法用于对话长期记忆

AI总结提出CoreMem架构，用黎曼检索替代余弦相似度解决高维检索枢纽问题，通过Fisher引导离散令牌蒸馏实现原则性压缩，在8GB显存边缘设备上实现长期记忆对话代理。

Comments 15 pages, 5 figures

详情

AI中文摘要

个性化对话代理需要持续的长期记忆以在多次会话中维持连贯交互。然而，在消费级硬件（例如8 GB VRAM边缘设备）上部署这些能力会引入严重的内存和计算瓶颈。现有系统通常依赖各向同性余弦相似度进行检索，以及启发式规则进行上下文压缩。这些方法缺乏统一的理论基础，经常在高维检索中遭受枢纽问题，并在压缩过程中出现句法碎片化。为克服这些限制，我们提出CoreMem，一种资源高效的边缘-云记忆架构，从根本上由信息几何统一。首先，黎曼检索用局部自适应Fisher-Rao度量替代余弦匹配，通过马氏距离有效惩罚枢纽记忆，并采用O(Ndr) Woodbury加速实现实时搜索。其次，Fisher引导离散令牌蒸馏（FDTD）引入分层句子到令牌压缩机制。它从Fisher信息迹中推导敏感度分数，提供原则性的压缩-KL权衡，并辅以显式结构句法保护。在LOCOMO和LongMemEval-S基准上评估，CoreMem实现了显著的准确率提升，在开放域（+4.51个百分点）和时间（+4.17个百分点）推理上取得实质性增益。广泛性能分析证实，CoreMem在严格的8 GB VRAM预算内无缝运行，成功弥合了资源受限边缘设备与对理论基础的终身记忆代理需求之间的差距。

英文摘要

Personalized dialogue agents require continuous long-term memory to maintain coherent interactions across multiple sessions. However, deploying these capabilities on consumer-grade hardware (e.g., 8 GB VRAM edge devices) introduces severe memory and compute bottlenecks. Existing systems typically rely on isotropic cosine similarity for retrieval and heuristic rules for context compression. These approaches lack a unified theoretical foundation, frequently suffering from the hubness problem in high-dimensional retrieval and syntactic fragmentation during compression. To overcome these limitations, we propose CoreMem, a resource-efficient edge-cloud memory architecture fundamentally unified by information geometry. First, Riemannian retrieval replaces cosine matching with a locally adaptive Fisher-Rao metric, effectively penalizing hub memories via Mahalanobis distance with O(Ndr) Woodbury acceleration for real-time search. Second, Fisher-guided discrete token distillation (FDTD) introduces a hierarchical sentence-to-token compression mechanism. It derives sensitivity scores from Fisher information traces, providing a principled compression-KL tradeoff augmented with explicit structural syntax protection. Evaluated on the LOCOMO and LongMemEval-S benchmarks, CoreMem achieves strong accuracy improvements, yielding substantial gains in Open-domain (+4.51 pp) and Temporal (+4.17 pp) reasoning. Extensive profiling confirms that CoreMem operates seamlessly within a strict 8 GB VRAM budget, successfully bridging the gap between resource-constrained edge devices and the demand for theoretically grounded, lifelong memory agents.

URL PDF HTML ☆

赞 0 踩 0

2606.18310 2026-06-18 cs.CR cs.AI 新提交 85%

Conflict-Aware Retriever Editing for Knowledge Injection Attacks on LLM-Based RAG Systems

冲突感知检索器编辑：针对基于LLM的RAG系统的知识注入攻击

Xinru Liu, Xianglong Zhang, Di Cai, Zhumin Chen, Pengfei Hu, Xin Xin

发表机构 * Shandong University, China（山东大学，中国）； Tsinghua University, China（清华大学，中国）

专题命中检索器：提出冲突感知检索器编辑攻击，注入恶意知识到RAG。

AI总结提出冲突感知检索器编辑框架CAREATTACK，通过模型中心攻击将恶意知识注入RAG系统，利用图检测和参数编辑投影解决冲突，并轻量校准保持攻击效果。

详情

AI中文摘要

将恶意知识注入检索增强生成（RAG）系统可以操纵检索到的证据并误导下游生成，对AI应用构成严重安全威胁。现有的RAG注入攻击主要依赖于操纵外部知识库，例如制作恶意语料库。然而，这种以数据为中心的方法合成的文本可能被检测到，导致攻击失败。除了语料库操纵之外，开源检索器越来越多地将RAG系统暴露于以模型为中心的攻击。在本文中，我们提出了冲突感知检索器编辑，即CAREATTACK，一个以模型为中心的检索器攻击框架，用于在RAG中注入恶意知识。具体来说，CAREATTACK包括两个阶段：冲突感知检索器编辑和攻击保持锚点修复。冲突感知检索器编辑将高效的闭式参数编辑适应于密集检索模型，提升恶意知识在良性竞争段落之上的排名，并通过基于图的冲突检测和参数编辑投影解决潜在参数冲突。然后，攻击保持锚点修复对编辑后的检索器进行轻量校准，以进一步消除对非目标提示的影响，同时保持对目标提示的攻击有效性。我们在Qwen3-Embedding-0.6B和BGE-M3上实例化CAREATTACK，并在三个基准数据集上进行评估。实验结果表明，我们的方法显著地将恶意段落提升到RAG系统检索到的知识中，并且在访问检索模型参数的情况下，可以对批量目标提示和段落执行攻击。由于大多数RAG系统基于开源检索模型构建，这项工作揭示了RAG系统中一个实际攻击面。代码在此https URL公开。

英文摘要

Injecting malicious knowledge into retrieval-augmented generation (RAG) systems can manipulate retrieved evidence and mislead downstream generation, posing a serious security threat for AI applications. Existing RAG injection attacks mainly rely on manipulating external knowledge bases, such as crafting malicious corpus. However, the synthetic text crafted by such data-centric methods could be detectable, leading to the failure of attacks. Beyond corpus manipulation, open-source retrievers are increasingly exposing RAG systems to model-centric attacks. In this paper, we propose conflict-aware retriever editing, i.e., CAREATTACK, a model-centric retriever attack framework for malicious knowledge injection in RAG. Specifically, CAREATTACK consists two stages of conflict-aware retriever editing and attack-preserving anchor repair. Conflict-aware retriever editing adapts efficient closed-form parameter editing to the dense retrieval model, promoting malicious knowledge above benign competing passages and resolving potential parameter conflicts through graph-based conflict detection and parameter editing projection. Then, attack-preserving anchor repair performs lightweight calibration on the edited retriever to further eliminate the impact on non-target prompts while preserving the attack effectiveness for target prompts. We instantiate CAREATTACK on Qwen3-Embedding-0.6B and BGE-M3, and conduct evaluation on three benchmark datasets. Experimental results demonstrate our method substantially promote malicious passages into the retrieved knowledge of RAG systems and can perform attacks for batches of target prompts and passages, given the access of retrieval model parameters. Since most RAG systems are built upon open-source retrieval models, this work reveals a practical attack surface in RAG systems. Codes are public accessible at https://anonymous.4open.science/r/CareAttack-3F1C.

URL PDF HTML ☆

赞 0 踩 0

2606.15345 2026-06-18 cs.CL cs.IR 新提交 85%

Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

超越单语言深度研究：用跨语言 BrowseComp-Plus 评估智能体和检索器

Yuheng Lu, Qingcheng Zeng, Heli Qi, Puxuan Yu, Fuheng Zhao, Rui Yang, Hitomi Yanaka, Naoto Yokoya, Weihao Xuan

发表机构 * Waseda University（早稻田大学）； Northwestern University（西北大学）； RIKEN AIP（理化学研究所革新智能研究中心）； Snowflake Inc.（Snowflake公司）； University of Utah（犹他大学）； Duke-NUS Medical School（杜克-新加坡国立大学医学院）； The University of Tokyo（东京大学）

专题命中检索器：评估跨语言检索和智能体性能

AI总结提出跨语言基准 XBCP，评估深度研究智能体在证据语言与查询不同时的表现，发现检索和智能体端均存在显著性能下降。

Comments Preprint

详情

AI中文摘要

深度研究智能体越来越被评估其搜索证据、推理检索来源和生成有依据答案的能力。然而，现有的浏览基准大多假设用户查询和支持证据使用同一种语言，因此当相关证据出现在另一种语言时，智能体搜索系统能否运行尚不清楚。我们引入了 XBCP（跨语言 BrowseComp-Plus），这是一个受控基准，它保留了 BrowseComp-Plus 的英文问答空间，但改变了支持文档的语言。XBCP 实例化了两个互补的设置：在跨语言设置中，每个查询与单一指定语言的证据配对。在多语言设置中，完整的证据语料库在 12 种语言（涵盖高资源和低资源语言）中均匀随机分布。我们使用稀疏和密集的多语言检索器评估了四个深度研究智能体，测量了答案准确性、证据召回率、搜索行为、校准度、引用忠实度和 oracle 检索。结果显示，当证据被翻译时，性能显著下降。即使是强大的密集检索器也会丢失证据召回率，智能体变得不那么校准，且引用证据的可靠性降低。值得注意的是，即使直接提供所有黄金证据，准确性仍然较低。这些发现表明，跨语言深度研究暴露了检索失败和智能体端在整合语言不匹配证据方面的独立困难。

英文摘要

Deep research agents are increasingly evaluated on their ability to search for evidence, reason over retrieved sources, and produce grounded answers. Existing browsing benchmarks, however, largely assume that the user's query and the supporting evidence are written in the same language, leaving open whether agentic search systems can operate when relevant evidence appears in another language. We introduce XBCP (Cross-lingual BrowseComp-Plus), a controlled benchmark that preserves the English question-and-answer space of BrowseComp-Plus but varies the languages of the supporting documents. XBCP instantiates two complementary settings: in the cross-lingual setting, each query is paired with evidence in a single assigned language. In the multilingual setting, the full evidence corpus is distributed equally and randomly across 12 languages spanning high-resource and low-resource regimes. We evaluate four deep research agents using sparse and dense multilingual retrievers, measuring answer accuracy, evidence recall, search behavior, calibration, citation fidelity, and oracle retrieval. Results reveal substantial degradation when evidence is translated. Even strong, dense retrievers lose evidence recall, and agents become less calibrated and cite evidence less reliably. Notably, accuracy remains lower even when all gold evidence is supplied directly. These findings suggest that cross-lingual deep research exposes both retrieval failures and an independent, agent-side difficulty in integrating language-mismatched evidence.

URL PDF HTML ☆

赞 0 踩 0

2606.18801 2026-06-18 cs.IR cs.AI 新提交 80%

SHIFT: Semantic Harmonization via Index-side Feature Transformation for Multilingual Information Retrieval

SHIFT: 通过索引侧特征变换实现多语言信息检索的语义对齐

Youngjoon Jang, Seongtae Hong, Hyeonseok Moon, Heuiseok Lim

发表机构 * Department of Computer Science and Engineering, Korea University（韩国大学计算机科学与工程系）

专题命中检索器：多语言密集检索，缓解语言偏差

AI总结提出SHIFT方法，在索引阶段通过平行翻译对估计相对语言向量并修正文档嵌入，以缓解多语言密集检索中的语言偏差，无需训练即可提升检索性能。

详情

AI中文摘要

随着大规模多语言语料库的迅速扩展，多语言信息检索（MLIR）已成为全球信息访问的关键技术。MLIR使用户能够使用单语言查询从多语言文本集合中检索语义相关的文档。然而，最近的多语言密集检索模型通常表现出对与查询相同语言的文档的强烈偏好。这导致了严重的语言偏差，即排名靠前的结果被特定语言的文档主导，即使其他语言的文档包含更多语义相关信息。为了解决这个问题，我们提出了SHIFT，一种在索引阶段适用的无需训练的方法。具体来说，SHIFT利用平行翻译对来估计每个目标语言相对于源语言的相对语言向量。随后，SHIFT通过在索引期间从文档嵌入中减去该相对语言向量来纠正语言特定的偏移。我们在四个MLIR基准测试和多种密集检索模型上的全面评估证实，SHIFT可以有效缓解语言偏差并提升MLIR性能。

英文摘要

With the rapid expansion of massive multilingual corpora, Multilingual Information Retrieval (MLIR) has emerged as a critical technology for global information access. MLIR enables users to retrieve semantically relevant documents from multilingual text collections using a single-language query. However, recent multilingual dense retrieval models often exhibit a strong preference for documents in the same language as the query. This leads to severe language bias, where top-ranked results are dominated by documents of specific languages, even when documents in other languages contain more semantically relevant information. To address this issue, we propose SHIFT, a training-free method applicable in the indexing stage. Specifically, SHIFT utilizes parallel translation pairs to estimate a relative language vector for each target language with respect to a source language. Subsequently, SHIFT corrects the language-specific offset by subtracting this relative language vector from document embeddings during indexing. Our comprehensive evaluation across four MLIR benchmarks and diverse dense retrieval models confirms that SHIFT can effectively mitigate language bias and enhance MLIR performance.

URL PDF HTML ☆

赞 0 踩 0

2606.12837 2026-06-18 cs.CL 新提交 75%

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

LoHoSearch: 超越人类难度上限的长时域搜索代理基准测试

Jiarui Zhao, Rongzhi Zhang, Lingchuan Liu, Hao Yang, Xunliang Cai, Xi Su

发表机构 * Meituan（美团）

专题命中检索器：基于知识图谱构建复杂搜索问题

AI总结提出LoHoSearch基准，基于700万维基实体知识图谱自动构建544个复杂问题，评估显示最强模型仅34.74%准确率，远超人类难度上限。

详情

AI中文摘要

以BrowseComp为代表的搜索代理基准在过去一年中迅速饱和，最强模型已超过90%准确率。由于这些基准主要由人类编写，标注者缺乏对实体统计的全局视角，无法系统性地最大化搜索空间大小和结构复杂性，这造成了难以突破的难度上限。为解决这一问题，我们引入了LoHoSearch（长时域搜索代理），一个包含544个人工验证问题、覆盖11个领域的挑战性基准。LoHoSearch通过基于覆盖超过700万维基百科实体的知识图谱的自动化流水线构建，该流水线选择具有大搜索空间的关系，并将其组装成结构复杂且具有知识图谱验证的唯一答案的问题。我们的评估表明，即使是最强模型也仅达到34.74%的准确率，且现有的上下文管理策略（最佳提升+6.8%）带来的增益远小于先前基准。LoHoSearch为评估搜索代理中的长时域推理和上下文管理提供了更高要求的标准。

英文摘要

Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global perspective on entity statistics and cannot systematically maximize search space size and structural complexity. This creates a difficulty ceiling that is hard to break. To address this, we introduce LoHoSearch (Long-Horizon Search Agents), a challenging benchmark comprising 544 human-verified questions across 11 domains. LoHoSearch is constructed via an automated pipeline built upon a knowledge graph covering over 7 million Wikipedia entities, which selects relations with large search spaces and assembles them into structurally complex questions with KG-verified unique answers. Our evaluation demonstrates that even the strongest model achieves only 34.74% accuracy, and existing context management strategies (best +6.8%) yield far smaller gains than on prior benchmarks. LoHoSearch provides a more demanding standard for evaluating long-horizon reasoning and context management in search agents.

URL PDF HTML ☆

赞 0 踩 0

2606.18814 2026-06-18 cs.IR 新提交 70%

LensKit-Auto

LensKit-Auto的改进与增强

Max Breit, Anass Amezian El Idrissi, Rishikesh Giriraj Kulkarni, Luca Quade

专题命中检索器：自动推荐系统框架，与检索相关但非RAG核心

AI总结本文改进了LensKit-Auto框架，使其能自动寻找适合数据集的推荐算法和超参数组合，增强了易用性和可视化功能，并适配了最新版本的LensKit框架。

详情

AI中文摘要

推荐系统在视频流、社交媒体和数字市场等领域有广泛应用，但选择合适的算法和超参数是一个持续挑战。本文改进了LensKit-Auto框架，使其能够自动寻找适合数据集的推荐算法和超参数组合。LensKit-Auto的主要优势在于其易用性，用户可将其作为黑箱输入数据集，获取最适合该数据集的算法和超参数信息。本文还更新了LensKit-Auto以适配最新版本的LensKit框架，实现了Tree Parzen Estimator等新功能，更新了文档，并增加了优化过程的可视化能力。此外，本文还适配了现有的元学习框架，生成适合LensKit-Auto的元数据集，以未来可能整合元学习。这些改进使LensKit-Auto更加完善，甚至非专业用户也能找到适合其应用场景的算法。

英文摘要

Recommender systems have a wide area of application, e.g. in fields like video streaming, social media, or digital marketplaces. But, for a recommender-system, finding the right algorithm with the right hyperparameters is a reoccurring challenge. There is no one-fits-all solution, since the performance of one algorithm can vary immensely on different data sets. Due to the challenges of finding the right algorithm and the broad use of recommender-systems, it is of interest to create an Automated Recommender System (AutoRecSys) that takes on the task of finding the right algorithm-hyperparameter-combination for a given data set. In this work, we present the enhancement of LensKit-Auto, a framework introduced by Vente et al., that solves exactly this task of finding a fitting algorithm-hyperparameter-combination. LensKit-Auto's biggest strength lies in its ease of use, where it operates as a black-box, into which the user can feed their data set and receive the information of which algorithm and hyperparameters work best on this data set. In this work, we bring LensKit-Auto up to date, so that it works with the new version of its underlying framework, LensKit. We also implement further functionalities, such as the Tree Parzen Estimator as an additional optimization method, the ability to reuse the found algorithm, updated documentation, and the ability to visualize the optimization process. We also adapt an existing meta-learning framework to generate a suitable meta-dataset for LensKit-Auto, which could enable the integration of meta-learning into LensKit-Auto in the future. The presented changes bring LensKit-Auto up to date and enhance its usability, so that even non-experts in the field can find the right algorithm for their use case.

URL PDF HTML ☆

赞 0 踩 0

2606.18878 2026-06-18 cs.DS cs.DB cs.FL 新提交 60%

Tractable Gap-Constraint Languages for Complex Event Recognition

复杂事件识别的可处理间隙约束语言

Antoine Amarilli, Florin Manea, Tina Ringleb, Markus L. Schmid

专题命中检索器：研究子序列匹配与间隙约束，与复杂事件识别相关，但非核心RAG内容。

AI总结研究带间隙约束的子序列匹配问题，提出左凸语言类，可在O(|D|(|u|+|C|))时间内求解，并用于复杂事件识别中的高效枚举。

Comments 50 pages

详情

AI中文摘要

对于字符串 $u, D \in \Sigma^*$，$u$ 在 $D$ 中的子序列嵌入是一个函数 $e \colon \{1, 2, \ldots, |u|\} \to \{1, 2, \ldots, |D|\}$，满足对每个 $i \in \{1, 2, \ldots, |u|-1\}$ 有 $e(i) < e(i+1)$，且 $u$ 的第 $i$ 个符号等于 $D$ 的第 $e(i)$ 个符号。$u$ 的间隙约束是一个三元组 $(i, j, L)$，其中 $1 \leq i < j \leq |u|$，$L$ 是 $\Sigma$ 上的正则语言。如果 $D$ 中严格位于 $e(i)$ 和 $e(j)$ 之间的因子是 $L$ 中的单词，则嵌入 $e$ 满足间隙约束 $(i, j, L)$。我们研究带间隙约束的子序列匹配问题，该问题与复杂事件识别（CER）相关：给定 $u, D \in \Sigma^*$ 和一组间隙约束 $C$，找到 $u$ 在 $D$ 中满足 $C$ 中所有间隙约束的嵌入。通常，子序列匹配是NP完全的，唯一已知的可处理变体限制了间隙约束的区间结构。在这项工作中，我们表明，如果间隙约束语言满足我们称之为左凸性的性质：只要 $u v w \in L$ 且 $v \in L$，则也有 $uv \in L$，那么我们可以相当高效地（实际上，在SETH下是最优的）在时间 $O(|D| (|u| + |C|))$ 内解决具有任意区间结构的间隙约束子序列匹配。左凸语言足够表达CER中考虑的有趣现实场景，例如长度约束 $L = \{w \mid a \leq |w| \leq b\}$，其中 $a, b \in \mathbb{N}$。我们还展示了如何使用我们的算法高效枚举所有满足条件的嵌入，这对于CER中的可能应用尤为重要。最后，我们展示了非左凸语言如何导致难解性，即如果除了长度约束外，还允许 $\{aa, \epsilon\}$ 作为唯一的非左凸约束语言，那么问题再次变为NP完全的。

英文摘要

For strings $u, D \in Σ^*$, a subsequence embedding of $u$ in $D$ is a function $e \colon \{1, 2, \ldots, |u|\} \to \{1, 2, \ldots, |D|\}$ with $e(i) < e(i+1)$ for every $i \in \{1, 2, \ldots, |u|-1\}$ and the $i$-th symbol of $u$ equals the $e(i)$-th symbol of $D$. A gap-constraint for $u$ is a triple $(i, j, L)$ with $1 \leq i < j \leq |u|$ and $L$ is a regular language over $Σ$. An embedding $e$ satisfies a gap-constraint $(i, j, L)$ if the factor of $D$ strictly between positions $e(i)$ and $e(j)$ is a word from $L$. We investigate the subsequence matching problem with gap-constraints, which is relevant in the context of complex event recognition (CER): given $u, D \in Σ^*$ and a set $C$ of gap-constraints, find an embedding of $u$ in $D$ that satisfies all gap-constraints from $C$. In general, subsequence matching is NP-complete and the only known tractable variants restrict the interval structure of the gap-constraints. In this work, we show that we can solve subsequence matching with gap-constraints with an arbitrary interval structure rather efficiently (in fact, optimally under SETH) in time $O(|D| (|u| + |C|))$ if the gap-constraint languages satisfy a property which we dub left-convexity: whenever $u v w \in L$ and $v \in L$, then also $uv \in L$. Left-convex languages are sufficiently expressive to model interesting real-world scenarios considered in CER, e.g., length constraints $L = \{w \mid a \leq |w| \leq b\}$ for $a, b \in \mathbb{N}$. We also show how our algorithm can be used in order to efficiently enumerate all satisfying embeddings, which is particularly relevant for possible applications in CER. Finally, we show how non-left-convex languages can lead to intractability, i.e., if in addition to length constraints we allow $\{aa, ε\}$ as the only non-left-convex constraint language, then the problem is NP-complete again.

URL PDF HTML ☆

赞 0 踩 0

2606.18530 2026-06-18 cs.CR cs.CL cs.LG 新提交 60%

Evaluating Prompting-Based Defenses Against Domain-Camouflaged Injection Attacks

评估基于提示的防御策略对抗领域伪装注入攻击

Aaditya Pai

发表机构 * Data Science Institute（数据科学研究所）

专题命中检索器：防御检索内容中的注入

AI总结针对领域伪装注入攻击，评估五种基于提示的防御方法（如释义、重点标记等）在三个模型家族和三个部署领域中的有效性，发现释义法最有效，可将伪装攻击成功率降低55-84%。

Comments 9 pages, 4 figures, 4 tables; under review at the AdvML-Frontiers x CoTMA workshop, COLM 2026

详情

AI中文摘要

领域伪装注入攻击使用领域特定词汇将恶意指令嵌入检索内容中，从而逃避依赖句法注入标记的标准检测器。当检测失败时，从业者需要知道哪些防御架构能降低攻击成功率。我们评估了五种基于提示的防御方法（重点标记、释义、提示夹层以及两种组合）对抗领域伪装注入攻击，涉及三个模型家族（Claude Haiku、Llama 3.1 8B、Gemini 2.0 Flash）和三个部署领域（金融、法律、通用），共进行3,510次试验。在代理处理之前对检索内容进行释义是最一致有效的防御方法，根据模型不同，可将伪装攻击成功率降低55-84%，并且在所有测试模型上均实现了比我们的Llama Guard 4配置更低的攻击成功率。防御效果强烈依赖于模型：重点标记在Claude Haiku上将攻击成功率减半，但在Llama 3.1 8B上没有任何益处。金融领域部署面临最高的残余风险，基线攻击成功率为26-33%，在较弱模型上没有任何基于提示的防御能完全消除威胁。这些结果首次系统评估了专门针对伪装类注入攻击的基于提示的防御方法，并为从业者建立了基于基准的建议。所有任务均使用合成构建的专业文档；这些基准排名是否能推广到真实企业文档仍是一个开放问题。

英文摘要

Domain-camouflaged injection attacks embed malicious instructions in retrieved content using domain-appropriate vocabulary, evading standard detectors that rely on syntactic injection markers. When detection fails, practitioners need to know which defense architectures reduce attack success. We evaluate five prompting-based defenses (spotlighting, paraphrasing, prompt sandwiching, and two combinations) against domain-camouflaged injection across three model families (Claude Haiku, Llama 3.1 8B, Gemini 2.0 Flash) and three deployment domains (financial, legal, general) using 3,510 trials. Paraphrasing retrieved content before agent processing is the most consistently effective defense in this benchmark, reducing camouflage attack success rate by 55-84\% depending on model, and achieves lower attack success rates than our Llama Guard 4 configuration on every model tested. Defense effectiveness is strongly model-dependent: spotlighting halves attack success on Claude Haiku but provides no benefit on Llama 3.1 8B. Financial domain deployments face the highest residual risk at 26-33\% baseline attack success rate, with no prompting-based defense fully eliminating the threat on weaker models. These results provide the first systematic evaluation of prompting-based defenses specifically against camouflage-class injection attacks and establish benchmark-based recommendations for practitioners. All tasks use synthetically constructed professional documents; whether these benchmark rankings generalize to real enterprise documents remains an open question.

URL PDF HTML ☆

赞 0 踩 0