RAG / 检索增强生成 - arXivDaily 专题

2605.29517 2026-06-18 cs.IR 版本更新 95%

FLASH-MAXSIM: IO-Aware Fused Kernels for Late-Interaction Retrieval

FLASH-MAXSIM: 面向IO感知的融合内核用于晚期交互评分

Roi Pony, Daniel Ezer, Adi Raz Goldfarb, Idan Friedman, Oshri Naparstek, Udi Barzelay

专题命中检索器：提出Flash-MaxSim内核加速晚期交互检索，核心是检索器优化。

AI总结提出Flash-MaxSim，一种IO感知的融合GPU内核，通过流式分块和片上SRAM折叠行最大规约，避免物化完整相似度张量，显著降低内存占用并加速晚期交互检索（如ColBERT、ColPali）的MaxSim评分。

详情

AI中文摘要

晚期交互检索（ColBERT, ColPali）使用MaxSim算子对查询和文档进行评分：对于每个查询词元，取与文档词元的最大相似度，然后对所有查询词元求和。标准实现会在GPU内存中物化完整的查询词元×文档词元相似度张量；对于视觉ColPali处理10K文档，仅该张量在FP16下就占用21 GB，创建后仅为了减少为每个文档一个分数然后丢弃。这会耗尽40 GB GPU，并限制了推理和训练中可实现的批大小。我们提出Flash-MaxSim，一种IO感知的融合GPU内核，通过将查询和文档分块流式传输到片上SRAM，并在同一遍中折叠行最大规约，从而在不物化张量的情况下精确计算相同的分数。我们将IO感知原理扩展到训练反向传播：一种逆网格CSR构造，重用前向argmax实现无原子操作、目标拥有的梯度规约；以及INT8×INT8量化和可变长度（无填充）评分。在A100上，Flash-MaxSim比同等精度的朴素PyTorch快3.9倍（H100上快4.7倍），推理内存减少16倍，训练内存减少约28倍，解锁了PyTorch完全无法处理的语料库和批大小，并保持了精确的排序（与FP32参考的top-20一致性为100%）。

英文摘要

Late-interaction retrieval (ColBERT, ColPali) scores a query against a document via the MaxSim operator. The standard PyTorch implementation materialises the full query-token x document-token similarity tensor only to reduce it away. At ColPali scale this is the single largest tensor in the pipeline (e.g. 21 GB in FP16 for 10K documents) and limits both candidate set size at inference and batch size during contrastive training. We present Flash-MaxSim (FM), an IO-aware fused GPU kernel that computes the same MaxSim scores without ever materialising the tensor, and extends the same principle to the training backward. At ColPali scale on A100 this cuts inference memory up to 9x and training memory by two orders of magnitude, unlocking candidate sets and contrastive batch sizes a single GPU could not previously reach. The kernel is a drop-in replacement, exact up to floating-point evaluation order under its stated FP32-accumulation protocol: rankings match the FP32 reference within 5e-4 of nDCG@10 on BEIR and REAL-MM-RAG. A separate INT8 path trades exactness for halved index storage at high fidelity. Released open-source.

URL PDF HTML ☆

赞 0 踩 0

2606.01697 2026-06-18 cs.CL 版本更新 90%

RCEM: Robust Conversational Search EMbedder in Distributional Shift

RCEM：配备查询重写技能的嵌入器，用于分布偏移下的鲁棒对话搜索

Kilho Son, Paul Hsu, Cha Zhang, Dinei Florencio

发表机构 * Microsoft（微软）

专题命中检索器：对话搜索嵌入器，结合LLM查询重写与检索

AI总结提出RCEM模型，通过将LLM的查询重写能力蒸馏到嵌入模型中，实现无需显式重写的上下文感知检索，在分布偏移下提升鲁棒性。

详情

AI中文摘要

对话搜索在检索增强生成（RAG）系统中变得越来越重要，用户通过包含上下文相关查询的多轮对话与AI助手交互。我们提出RCEM，一种对话式稠密检索模型，它将LLM的查询改写能力蒸馏到嵌入模型中，从而在推理时无需显式查询改写即可实现上下文感知检索。与先前学习直接对话到文档匹配的对话式稠密检索方法不同，RCEM将对话查询嵌入与改写后的查询嵌入对齐，提高了在分布偏移下的鲁棒性。RCEM不需要用于训练的对话查询到文档的相关性映射，这些映射通常昂贵且难以获得高质量。在QReCC、TopiOCQA和TREC CAsT上的大量实验表明，RCEM始终优于强对话检索基线，在分布偏移下取得了特别大的增益，包括Recall@10提升高达20%。RCEM进一步扩展了基础嵌入模型，使其具备对话查询改写能力，同时保留了原有的检索功能，允许单个模型对独立查询和对话查询进行编码，并针对现有文档索引进行搜索，而无需重建检索数据库。

英文摘要

We propose RCEM, a Robust Conversational search EMbedder that is additionally equipped with LLM's query reformulation capability without losing base model's generalization. Unlike prior conversational dense retrieval approaches that learn direct conversation-to-passage matching, RCEM aligns conversations, prepended by special token, to LLM-rewritten queries, while preserving the original embedding space. The unchanged embedding space automatically maps the rewritten-query to the relevant passages. As a result, RCEM (1) reduces overfitting by simplifying the alignment task from long passages to shorter rewritten queries, (2) eliminates the need for conversation-to-passage relevance labels for training, and (3) maintains its original embedding space that allows conversational queries against indexes built by original embedder without rebuilding them. Extensive experiments show that RCEM consistently outperforms prior approaches, achieving up to 30% improvement under distributional shift.

URL PDF HTML ☆

赞 0 踩 0

2601.08554 2026-06-18 cs.SI cs.DB cs.GR 版本更新 60%

Maintaining Leiden Communities in Large Dynamic Graphs

维护大规模动态图中的 Leiden 社区

Chunxu Lin, Yumao Xie, Yixiang Fang, Yongmin Hu, Yingqian Hu, Cheng Chen

专题命中检索器：社区检测用于RAG的层次索引，但非核心

AI总结针对现有动态 Leiden 算法在频繁更新下效率低的问题，提出 HIT-Leiden 算法，通过维护连通分量和层次社区结构减少受影响顶点范围，实现高达五个数量级的加速。

详情

AI中文摘要

社区检测是大规模工业图分析中的基础能力，支撑着欺诈团伙发现、推荐系统和检索增强生成的层次索引等应用。在基于模块度的方法中，Leiden 算法因其能生成具有连通性保证的高质量社区而被广泛采用。然而，现实世界的图不断演化，需要及时更新社区以保持下游特征和检索索引的新鲜度。同时，现有的动态 Leiden 方法在顶点和边发生变化时重新计算社区，因此在频繁更新下几乎退化为接近完全重新计算。为了解决效率问题，我们研究了大规模动态图中 Leiden 社区的高效维护，并提出了一种新颖算法，称为层次增量树 Leiden（HIT-Leiden）。我们首先进行了有界性分析，表明先前的增量 Leiden 方法即使对于小更新也可能产生本质上无界的工作量。在此分析的指导下，我们提出了 HIT-Leiden，它通过维护连通分量和层次社区结构有效减少了受影响顶点的范围。在大型真实动态图上的大量实验表明，HIT-Leiden 实现了与最先进竞争对手相当的社区质量，同时相比现有解决方案实现了高达五个数量级的加速。生产部署结果表明，HIT-Leiden 在高频更新下满足严格的延迟要求。

英文摘要

Community detection is a foundational capability in large-scale industrial graph analytics, powering applications such as fraud-ring discovery, recommendation systems, and hierarchical indexing for retrieval-augmented generation. Among modularity-based methods, the Leiden algorithm has been widely adopted in production because it delivers high-quality communities with connectivity guarantees. However, real-world graphs evolve continuously, and timely community updates are needed to keep downstream features and retrieval indices fresh. Meanwhile, existing dynamic Leiden approaches recompute the communities whenever their vertices and edges change, thereby almost degrading to near-full recomputation under frequent updates. To alleviate the efficiency issue, we study the efficient maintenance of Leiden communities in large dynamic graphs and present a novel algorithm, called Hierarchical Incremental Tree Leiden (HIT-Leiden). We first provide a boundedness analysis showing that prior incremental Leiden methods can incur essentially unbounded work even for small updates. Guided by this analysis, we propose HIT-Leiden which effectively reduces the range of affected vertices by maintaining connected components and hierarchical community structures. Extensive experiments on large real-world dynamic graphs demonstrate that HIT-Leiden achieves community quality comparable to the state-of-the-art competitors while delivering speedups of up to five orders of magnitude over existing solutions. The production deployment results show that HIT-Leiden meets stringent latency requirements under high-rate updates at scale.

URL PDF HTML ☆

赞 0 踩 0