arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

RAG / 检索增强生成

检索增强生成、向量检索、知识库问答和面向大模型的搜索系统。

2026-06-19 至 2026-06-19 收录 4 信号源:cs.IR, cs.CL, cs.AI, cs.DB
2606.19692 2026-06-19 cs.CR cs.DB cs.IR 新提交 90%

When Global Gating Is Enough: Admission-Time Hubness Control in Anisotropic Vector Retrieval Systems

当全局门控足够:各向异性向量检索系统中的准入时间枢纽性控制

Prashant Kumar Pathak, Tarun Kumar Sharma

专题命中 向量检索 :针对RAG中向量枢纽性投毒风险提出准入时间控制方法

AI总结 针对检索增强生成中向量枢纽性引发的投毒风险,提出准入时间控制方法,通过哨兵查询评分隔离枢纽文档,全局门控在多个数据集上达到高召回率和低误报率。

详情
AI中文摘要

向量枢纽性(少数点成为许多查询的最近邻)在检索增强生成(RAG)中造成投毒风险:一个注入的文档可能影响不相关的请求。现有防御使用周期性反向k近邻扫描,存在暴露窗口和重复的全语料库工作。我们研究准入时间控制,根据哨兵查询对每个候选文档评分,并在插入前隔离类似枢纽的文档。在两个10万文档语料库、五个编码器以及不相交的攻击者和防御者查询集上,全局门控在决定性嵌入空间点达到召回率1.0(有效范围内>=0.92),在HotFlip攻击上达到0.91 +/- 0.07,对一般文档的误报率为1%。每主题门控没有提供可靠的好处,这与各向异性耦合局部和全局可见性一致。阈值是增量维护的,插入成本与语料库大小无关,删除成本摊销。在HNSW上,准入增加约3.1%的摄入延迟,评分在10^6向量上保持平坦,近似索引下1.2%的决策翻转,不涉及攻击。来源信息补充了门控对自然或紧密领域枢纽的处理。

英文摘要

Vector hubness, where a few points become nearest neighbors of many queries, creates a poisoning risk in retrieval-augmented generation (RAG): one injected document can influence unrelated requests. Existing defenses use periodic reverse-kNN scans, leaving an exposure window and repeated corpus-wide work. We study admission-time control, scoring each candidate against sentinel queries and quarantining hub-like documents before insertion. Across two 100,000-document corpora, five encoders, and disjoint attacker and defender query sets, a global gate achieves recall 1.0 at the decisive embedding-space point (>=0.92 across the effective range) and 0.91 +/- 0.07 on HotFlip attacks, with 1% false positives on general documents. A per-topic gate provides no reliable benefit, consistent with anisotropy coupling local and global visibility. Thresholds are maintained incrementally, with corpus-size-independent insertion cost and amortized deletion cost. On HNSW, admission adds about 3.1% to ingestion latency, scoring remains flat to 10^6 vectors, and 1.2% of decisions flip under approximate indexing, none involving attacks. Provenance complements the gate for natural or tight-domain hubs.

2606.19803 2026-06-19 cs.DB cs.AI cs.LG 新提交 85%

Policy-aware Vector Search: A Vision for Fine Grained Access Control in Vector Databases

策略感知向量搜索:向量数据库中细粒度访问控制的愿景

Lakshmi Sahithi Yalamarthi, Primal Pappachan

发表机构 * Portland State University(波特兰州立大学)

专题命中 向量检索 :讨论向量数据库中的细粒度访问控制,与RAG系统相关。

AI总结 本文提出策略感知向量搜索的愿景,形式化向量数据库中的细粒度访问控制(FGAC)策略模型与实施问题,比较不同实施策略并指出未来挑战。

Comments Accepted at SeQureDB 26, Sigmod 2026

详情
AI中文摘要

向量数据库越来越多地用于安全敏感的场景,如检索增强生成和组织AI管道;然而,其安全能力仍然有限。具体而言,现代向量数据库不完全支持细粒度访问控制(FGAC),而FGAC是确保数据访问符合用户特定策略所必需的。与关系数据库不同,向量数据库结合结构化和非结构化属性以提供语义近似查询结果,这使FGAC实现复杂化。这就在正确执行FGAC策略、实现高ANN搜索召回率和保持低查询延迟之间产生了内在张力。在本文中,我们通过形式化向量数据库中的FGAC策略模型以及实施问题,提出了策略感知向量搜索的愿景。我们比较了各种实施策略,展示了初步发现,并指出了未来策略感知向量搜索研究的关键开放挑战。

英文摘要

Vector databases are increasingly used in security sensitive contexts with Retrieval Augmented Generation and organizational AI pipelines; however, their security capabilities remain limited. Specifically, Fine-grained Access Control (FGAC) which is required to ensure that data access adheres to user-specific policies is not fully supported in modern vector databases. Unlike relational databases, vector databases combine structured and unstructured attributes to provide semantic, approximate query results, which complicates FGAC implementation. This creates an inherent tension between enforcing FGAC policies correctly, achieving high ANN search recall and maintaining low query latency. In this paper, we present a vision for Policy-aware Vector Search by formalizing the FGAC policy model in vector databases as well as the enforcement problem. We compare various enforcement strategies, present preliminary findings, and identify key open challenges for future research in policy-aware vector search.

2606.19458 2026-06-19 cs.IR 新提交 85%

MonaVec: A Training-Free Embedded Vector Search Kernel for Edge and Offline AI Systems

MonaVec: 一种面向边缘和离线AI系统的免训练嵌入式向量搜索内核

Oğuzhan Yenen

专题命中 向量检索 :免训练嵌入式向量搜索内核,适用于边缘AI。

AI总结 提出MonaVec,一种无需训练、数据无关的嵌入式向量搜索内核,通过随机哈达玛变换和预计算Lloyd-Max量化实现4位压缩,在边缘和离线场景下提供确定性结果,支持单文件部署。

Comments 27 pages, 11 figures. Code and artifacts: https://github.com/mona-hq/monavec (PyPI: monavec; crates.io: monavec-core). Zenodo: doi:10.5281/zenodo.20559587

详情
AI中文摘要

我们提出MonaVec,一种确定性的嵌入式向量搜索内核,适用于边缘和离线AI场景——即服务器基础设施、网络连接和训练数据均不可用的环境。现有的向量搜索系统假设存在持久化服务器、千兆字节RAM或对语料库进行训练;而MonaVec则针对SQLite的部署模式:一个文件、一次函数调用、随处运行。其量化核心默认免训练且数据无关:随机哈达玛变换(RHDH)将任意输入分布调整至N(0,1),因此预计算的Lloyd-Max表可将数据量化至4位(缩小8倍),无需学习码本或数据遍历。索引持久化为单个.mvec文件,其中嵌入的ChaCha20旋转种子使得结果在不同架构间可重现,并在同一构建内字节一致——这是并行构建图库无法提供的确定性保证。在语义嵌入(AG News,45K x 1024维BGE-M3,余弦相似度)上,MonaVec 4位BruteForce在27 MB内达到0.960 Recall@10,在召回率上领先float32 FAISS-IVF和8位usearch,同时以峰值吞吐量换取字节一致的确定性。单次全局标准化(fit())将相同的数据无关流程扩展到对幅度敏感的L2数据,可选的IvfFlat和HNSW后端将其扩展到百万向量语料库。MonaVec使用纯Rust实现,并带有Python绑定和运行时SIMD调度(AVX-512/AVX2/NEON/scalar)。它面向设备端RAG、离线代理和嵌入式检索——即SQLite在关系数据领域占据的细分市场:一个文件、一次调用、随处运行。

英文摘要

We present MonaVec, a deterministic, embedded vector-search kernel for edge and offline AI -- settings where server infrastructure, network connectivity, and training data are all unavailable. Existing vector-search systems assume a persistent server, gigabytes of RAM, or a training pass over the corpus; MonaVec instead targets the deployment profile of SQLite: one file, one function call, runs anywhere. Its quantization core is training-free by default and data-oblivious: a Randomized Hadamard Transform (RHDH) conditions any input distribution toward N(0,1), so precomputed Lloyd-Max tables quantize to 4 bits (8x smaller) with no learned codebook and no data pass. The index persists as a single .mvec file whose embedded ChaCha20 rotation seed makes results reproducible across architectures and byte-identical within a build -- a determinism guarantee that parallel-build graph libraries cannot offer. On semantic embeddings (AG News, 45K x 1024-dim BGE-M3, cosine), MonaVec 4-bit BruteForce reaches 0.960 Recall@10 in 27 MB -- leading float32 FAISS-IVF and 8-bit usearch on recall -- while trading peak throughput for byte-identical determinism. A single-pass global standardization (fit()) extends the same data-oblivious pipeline to magnitude-sensitive L2 data, and optional IvfFlat and HNSW backends carry it to million-vector corpora. MonaVec is implemented in pure Rust with Python bindings and runtime SIMD dispatch (AVX-512/AVX2/NEON/scalar). It targets on-device RAG, offline agents, and embedded retrieval -- the niche SQLite occupies for relational data: one file, one call, runs anywhere.

2606.09824 2026-06-19 cs.DB 版本更新 60%

TSseek: Regular Expression-Based Similarity Search for Distributed Time Series Datasets

TSseek: 基于正则表达式的分布式时间序列数据集相似性搜索

Xiaoshuai Li, Khalid Alnuaim, Mohamed Y. Eltabakh, Elke A. Rundensteiner

专题命中 向量检索 :时间序列相似性搜索,非传统RAG但涉及检索

AI总结 提出TSseek框架,通过正则表达式查询语言支持趋势、值范围和通配符模式搜索,并构建分布式空间索引TSseek-X实现高效精确匹配。

Comments Extended version with full ablation studies and additional experiments. v3 corrects bibliographic metadata for several references

详情
AI中文摘要

相似性搜索是时间序列分析中的基本操作。然而,大多数现有技术要求用户提供精确的值序列(通常是整个时间序列对象)作为查询输入。这种严格的要求限制了实际应用,用户更希望表达模式、趋势或值范围。灵活的基于模式的搜索已在文本检索和复杂事件处理中得到探索,但在大规模分布式时间序列中仍未得到充分研究。为弥补这一差距,我们提出TSseek,一个基于正则表达式的分布式时间序列数据集搜索框架。TSseek的查询语言使用户能够组合包含趋势、值范围和通配符片段的模式。我们表明,传统的近似技术(如PAA和SAX)及其索引结构不适合此类查询,因为它们无法对正则表达式查询构造进行操作。在TSseek中,我们通过将时间序列对象近似为保留趋势(斜率方向)和值范围的线段序列,并将查询构造转换为边界矩形,将时间序列对象和查询构造映射到同一空间。为支持高效处理,我们构建了TSseek-X,一个基于时间序列片段的分布式空间索引。TSseek支持两种基本查询类型:全匹配查询(针对整个序列)和子序列匹配查询(针对序列内的任意窗口)。在基准和真实数据集上,全扫描、基于模型和基于SAX的基线方法要么牺牲准确性,要么牺牲速度,而TSseek能高效地返回精确答案。此外,对于子序列工作负载,它比最先进的子序列匹配引擎实现了显著的加速。

英文摘要

Similarity search is a fundamental operation in time series analysis. Most existing techniques, however, require users to supply a precise sequence of values (typically an entire time series object) as the query input. This rigid requirement limits real-world applications, where users instead want to express patterns, trends, or value ranges. Flexible, pattern-based search has been explored in text retrieval and complex event processing, but remains underexplored for large-scale distributed time series. To close this gap, we propose TSseek, a regular-expression-powered search framework for distributed time series datasets. TSseek's query language enables users to compose patterns encompassing trends, value ranges, and wildcard segments. We show that conventional approximation techniques (e.g., PAA and SAX) and their index structures are ill-suited for such queries because they cannot operate on regular-expression query constructs. In TSseek, we map the time series objects and the query constructs into the same space by approximating time series objects as sequences of line segments that retain both trend (slope direction) and value range, and translating query constructs into bounding rectangles. To support efficient processing, we build TSseek-X, a distributed spatial index over the time series segments. TSseek supports two fundamental query types, namely whole-matching queries (over entire series) and subsequence-matching queries (over arbitrary windows within a series). Across benchmark and real-world datasets, full-scan, model-based, and SAX-based baselines all sacrifice either accuracy or speed, whereas TSseek returns exact answers efficiently. Also, for subsequence workloads, it achieves significant speedups over state-of-the-art subsequence matching engines.