arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

RAG / 检索增强生成

检索增强生成、向量检索、知识库问答和面向大模型的搜索系统。

2026-06-19 至 2026-06-19 收录 27 信号源:cs.IR, cs.CL, cs.AI, cs.DB

1. 知识库问答 12 篇

2606.19396 2026-06-19 q-bio.QM 新提交 90%

BioHarness: Substrate-Aware Evidence Assembly for Biomedical Question Answering across Literature, Knowledge Bases, and Biological Atlases

BioHarness:面向生物医学问答的底物感知证据组装——跨文献、知识库和生物图谱

Meng Xiao, Chuan Qin, Jinmiao Chen, Yihang Cheng, Yuanchun Zhou, Hengshu Zhu

专题命中 知识库问答 :生物医学问答中跨文献、知识库和生物图谱的证据组装

AI总结 提出BioHarness,通过级联控制机制在文献检索、知识库和生物图谱间选择性组装证据,提升生物医学问答准确率,在19,302个问答项上得分从65.9提升至71.0。

Comments 14 Pages, 11 Figures, Keywords: biomedical question answering; retrieval-augmented generation; large language models; evidence assembly; biomedical knowledge bases; biological atlases

详情
AI中文摘要

动机:生物医学问答通常需要超越主题检索文献的证据,包括基因别名解析、数据库标识符标准化以及来自图谱的生物测量值。然而,现有的检索增强生成(RAG)系统通常遵循固定工作流程,缺乏明确机制来决定何时检索文本足够、何时需要经过整理的生物医学知识、或何时应调用对结构化测量值的可执行证据组装。这激发了一种底物感知的大语言模型(LLM)框架,能够跨文献、知识库和生物图谱选择性地组装足够的证据。结果:我们引入BioHarness,一种用于分阶段生物医学证据组装的LLM框架,涵盖文献检索、经过整理的生物医学知识资源以及来自图谱的结构化测量值。BioHarness首先尝试根据重排序的文献证据回答问题,并通过基于接地级联控制,仅在当前证据不确定、接地不足或底物不匹配时升级到REPL风格的证据组装。在涵盖七种答案格式的19,302个生物医学问答项上,BioHarness将最强非预言基线的综合得分从65.9提升至71.0。消融实验、案例研究和骨干扩展分析表明,这些提升源于通过重排序、实体接地和结构化测量访问修复证据-底物不匹配,而非不加区分地调用更多推理步骤、检索更多文献或依赖特定答案模型规模。

英文摘要

Motivation: Biomedical question answering often requires evidence beyond topically retrieved literature, including gene alias resolution, database identifier normalization, and atlas-derived biological measurements. However, existing retrieval-augmented generation (RAG) systems typically follow a fixed workflow and lack an explicit mechanism for deciding when retrieved text is sufficient, when curated biomedical knowledge is required, or when executable evidence assembly over structured measurements should be invoked. This motivates a substrate-aware large language model (LLM) harness that selectively assembles sufficient evidence across literature, knowledge bases, and biological atlases. Results: We introduce BioHarness, an LLM harness for staged biomedical evidence assembly across literature retrieval, curated biomedical knowledge resources, and atlas-derived structured measurements. BioHarness first attempts to answer from reranked literature evidence and escalates through grounded cascade control to REPL-style evidence assembly only when the current evidence is uncertain, weakly grounded, or substrate-mismatched. Across 19,302 biomedical QA items spanning seven answer formats, BioHarness improves the pooled score from 65.9 to 71.0 over the strongest non-oracle baseline. Ablations, case studies, and backbone-scaling analyses show that these gains arise from repairing evidence-substrate mismatches through reranking, entity grounding, and structured measurement access, rather than from indiscriminately invoking more reasoning steps, retrieving additional literature, or relying on a particular answer-model scale.

2606.20359 2026-06-19 cs.LG 新提交 90%

Train, Retrieve, or Both? A Four-Arm Head-to-Head for Correct Statutory Citation on the Ontario Residential Tenancies Act

训练、检索,还是两者兼用?针对安大略省住宅租赁法的正确法定引用的四组头对头比较

Ali Asaria, Tony Salomone, Deep Gandhi

发表机构 * Transformer Lab

专题命中 知识库问答 :SFT+RAG混合模型用于法律条文引用

AI总结 研究自诉租户、房东和帮助台工作人员如何获得正确的法定引用,通过四组实验比较微调、检索及混合方法,发现SFT+RAG混合模型在精确匹配上得分最高且无幻觉引用。

详情
AI中文摘要

自诉租户、房东和帮助台工作人员需要被指向实际管辖问题的法律条款,并附有正确的法定引用。我们在2006年安大略省住宅租赁法(RTA)及其核心法规上研究此任务,从操作者的角度实证提问:微调是否足够,还是需要混合检索?我们在Qwen2.5-7B-Instruct上运行四组头对头比较(基础零样本、仅LoRA SFT、仅RAG、以及SFT+RAG混合),在一个小型、待人工验证的真实评估集上,以引用的精确匹配(节+小节)评分。基础模型无法引用RTA,仅SFT会错误回忆章节;检索至关重要,并通过构造将幻觉降至零;而SFT+RAG混合模型得分最高,精确匹配为0.481,且无幻觉引用。其优势在于SFT使得条款选择对高召回候选集(损害零样本RAG)更加鲁棒。值得注意的是,这种廉价的bge-small混合模型匹配或超越了基于更大、专门检索模型(更大的嵌入器和交叉编码器重排序器)的管道,更大/改进的训练集也无帮助:在此任务中,强法定引用性能不需要专门的检索模型或更多数据。该工件将幻觉归零并超过了基准提升线,但未达到期望的0.70精确匹配目标。所有结果均基于小型、待人工验证的真实评估集,并作为初步结果报告。

英文摘要

Self-represented tenants, landlords, and help-desk staff need to be pointed at the provision of law that actually governs a question, with a correct statutory citation. We study this task on the Ontario Residential Tenancies Act, 2006 (RTA) and its core regulation, asking the operator's question empirically: is fine-tuning enough, or is hybrid retrieval needed? We run a four-arm head-to-head on Qwen2.5-7B-Instruct (base zero-shot, LoRA SFT-only, RAG-only, and an SFT+RAG hybrid), scored on citation exact-match (section+subsection) over a small, human-verification-pending real eval set. The base model cannot cite the RTA and SFT-only mis-recalls sections; retrieval is essential and drives hallucination to zero by construction; and the SFT+RAG hybrid scores highest at 0.481 exact-match with zero hallucinated citations. Its edge comes from SFT making provision selection more robust to the higher-recall candidate sets that hurt zero-shot RAG. Notably, this cheap bge-small hybrid matches or beats a pipeline built on bigger, specialized retrieval models (a larger embedder and a cross-encoder reranker), and a larger/improved training set does not help either: strong statutory-citation performance here does not require specialized retrieval models or more data. The artifact zeroes hallucination and clears the lift-over-base bar but does not reach the aspirational 0.70 exact-match target. All results are on a small, human-verification-pending real eval set and are reported as preliminary.

2606.19602 2026-06-19 cs.AI 新提交 90%

Configurable Clinical Information Extraction with Agentic RAG: What Works, What Breaks, and Why

可配置的临床信息提取与智能体RAG:什么有效、什么失效及原因

Osman Alperen Çinar-Koraş, Marie Bauer, Sameh Khattab, Merlin Engelke, Moon Kim, Stephan Settelmeier, Shigeyasu Sugawara, Fabian Freisleben, Felix Nensa, Jens Kleesiek

发表机构 * Institute for Artificial Intelligence in Medicine (IKIM), University Medicine Essen(埃森大学医学院人工智能医学研究所) Faculty of Computer Science, University of Duisburg-Essen(杜伊斯堡-埃森大学计算机科学学院) Department of Physics, TU Dortmund University(多特蒙德工业大学物理系) Lamarr Institute for Machine Learning and Artificial Intelligence, TU Dortmund University(多特蒙德工业大学拉马尔机器学习和人工智能研究所) Advanced Clinical Research Center, Fukushima Medical University(福岛医科大学先进临床研究中心) Department of Cardiology and Vascular Medicine, University Hospital Essen(埃森大学医院心血管内科)

专题命中 知识库问答 :提出ACIE系统,基于智能体RAG进行临床信息提取

AI总结 针对临床文档元数据缺失问题,提出基于智能体RAG的ACIE系统,在埃森大学医学中心部署,通过完整患者上下文推理和源引用验证,在7326次临床判断中实现96.5%的提取接受率。

详情
AI中文摘要

患者上下文涵盖数百份异构文档和数千个结构化数据点,然而AI系统进行检索和分诊所需的文档级元数据缺失或不完整。标准检索增强生成在此类数据上失效,无法处理时间推理、跨文档依赖和缺失元数据。我们在埃森大学医学中心部署了ACIE(智能体临床信息提取):一个本地智能体RAG管道,能够推理完整的患者上下文,并将每个答案基于源段落以供临床医生验证。我们量化了元数据差距,追溯了由此形成的架构决策,并在一项独立的回顾性淋巴瘤注册研究中评估了提取效果,其中核医学医生根据引用的来源验证每个提取值。在7326次判断中,临床医生接受了96.5%的提取结果,按类型划分的接受率从80%到99%不等。

英文摘要

Patient contexts span hundreds of heterogeneous documents and thousands of structured data points, yet the document-level metadata that AI systems need for retrieval and triage is absent or incomplete. Standard retrieval-augmented generation fails on this data, mishandling temporal reasoning, cross-document dependencies, and missing metadata. We deploy ACIE (Agentic Clinical Information Extraction) at University Medicine Essen: an on-premise agentic RAG pipeline that reasons over complete patient contexts and grounds every answer in source passages for clinician verification. We quantify the metadata gap, trace the architectural decisions it shaped, and evaluate extraction alongside an independent retrospective lymphoma registry study, in which nuclear-medicine physicians verify every extracted value against its cited sources. Across 7,326 judgments, clinicians accepted 96.5\% of extractions, with per-type acceptance ranging from 80\% to 99\%.

2606.03367 2026-06-19 cs.IR 版本更新 85%

Automating Information Extraction and Retrieval for Industrial Spare Parts Pooling

自动化信息提取与检索用于工业备件池化

Dyuman Bulloni, Rocco Felici, Oliver Avram, Anna Valente

专题命中 知识库问答 :提出PhRAG混合检索增强生成框架用于备件检索。

AI总结 提出PhRAG混合检索增强生成框架,通过命名实体识别结构化异构备件描述并构建虚拟库存池,结合生成式语言模型处理数据稀缺和查询变异性,实现可解释的备件检索。

详情
AI中文摘要

制造业的维护组织试图通过重用现有资产来避免停机和不必要的采购,但主要障碍不是缺乏零件,而是缺乏跨站点和合作伙伴的可操作可见性。库存分布广泛,描述命名约定不一致,包含重复和部分指定的引用,因此正确的零件通常存在于某处,但实际无法发现。本文提出PhRAG,一种混合检索增强生成方法,将这种碎片化景观池化为一个虚拟库存池(VSPool),可以作为一个单一资源进行结构化和搜索。非结构化的异构备件描述通过命名实体识别(NER)结构化到一个共享的虚拟池数据集中,并进行索引以支持稳健的检索,即使用户以自然语言而非精确技术规格表达需求。所提出的模块化流水线利用生成语言模型的多任务特性,覆盖了使工业备件池化具有挑战性的两个维度:(i)来自不同数据源(例如新合作伙伴、目录、市场列表)的非结构化技术规格通过离线提取处理;(ii)运行时的请求变异性(引用、部分引用、规格、价格/条件约束)通过基于混合RAG的搜索引擎处理,该引擎能够检索相关组件并证明结果。该框架展示了在技术规格提取数据稀缺情况下,生成方法相比传统NER方法的潜力,并通过为检索到的组件生成理由,克服了标准信息检索系统的不透明性。项目的开源代码可在此https URL找到。

英文摘要

Maintenance organizations in manufacturing try to avoid downtime and unnecessary purchasing by reusing existing assets, but the main obstacle is not a lack of parts but a lack of actionable visibility across sites and partners. Inventories are distributed, described with inconsistent naming conventions, and contain duplicates and partially specified references, so the right part often exists somewhere but remains effectively undiscoverable. The paper proposes PhRAG, a hybrid Retrieval-Augmented Generation for pooling this fragmented landscape into a Virtual Stock Pool (VSPool) that can be structured and searched as a single resource. Heterogeneous spare part descriptions are structured via Named Entity Recognition (NER) into a shared virtual pool dataset and indexed to support robust retrieval even when users express needs in natural language rather than exact technical specifications. The proposed modular pipeline leverages the multitasking nature of generative language models to cover two dimensions that make industrial parts pooling challenging: ($\boldsymbol{i}$) unstructured technical specifications from diverse data sources (e.g. new partners, catalogs, marketplace listings) are handled through an offline extraction and ($\boldsymbol{ii}$) request variability at runtime (references, partial references, specifications, price/condition constraints) is handled through a hybrid RAG-based search engine capable of retrieving relevant components and justifying results. The framework demonstrates the potential of generative approaches compared with traditional NER approaches in the presence of data scarcity for technical specifications extraction and overcomes the opacity of standard information retrieval systems by generating justifications for retrieved components.

2606.20041 2026-06-19 econ.GN cs.AI cs.LG q-fin.EC q-fin.GN 新提交 80%

AI Economist Agent: An Agentic Framework for Model-Grounded Economic Analysis with RAG, Knowledge Graphs, and Large Language Models

AI经济学家代理:一种基于模型的经济分析代理框架,结合RAG、知识图谱和大语言模型

Masahiro Kato

发表机构 * Mizuho-DL Financial Technology, Co., Ltd.(Mizuho-DL金融科技有限公司)

专题命中 知识库问答 :基于RAG的经济分析,检索证据并生成报告

AI总结 提出一种基于RAG的AI经济学家代理框架,利用知识图谱和大语言模型进行经济情景分析,通过代理规划、检索证据、选择模型并生成报告,提高经济叙事的连贯性和可追溯性。

详情
AI中文摘要

我们提出了一种基于模型的RAG型AI经济学家,具有用于经济情景分析的代理框架,使用大语言模型(LLMs)和知识图谱。虽然LLMs可以生成流畅的经济叙事,但经济学家通常需要做出基于经济理论和现实数据的经济主张。基于这一动机,本研究提出了一种基于RAG的AI经济学家,它利用包含经济数据和理论的知识图谱以及基于LLM的代理来规划分析、检索相关证据、选择合适的模型并生成报告。在我们的框架中,我们不直接仅使用语言模型产生定量主张;相反,我们生成基于显式模型计算的叙事,并通过AI代理与检索到的证据相关联。我们将我们的框架称为AI经济学家代理。我们在两个应用中评估了AI经济学家代理:为美国通胀持续性和美联储政策生成经济学家报告,以及为美国商业房地产再融资压力生成银行压力测试叙事。结果说明了如何通过基于生成报告来提高其经济连贯性和可追溯性。

英文摘要

We propose a model-grounded RAG-based AI economist with an agentic framework for economic scenario analysis using large language models (LLMs) and knowledge graphs. While LLMs can generate fluent economic narratives, economists are often required to make economic claims grounded by economic theory and real-world data. Based on this motivation, this study proposes an RAG-based AI economist, which utilizes knowledge graphs including economic data and theory and LLM-based agents to plan the analysis, retrieve relevant evidence, select appropriate models, and generate reports. In our framework, we do not produce quantitative claims directly with the language model alone; instead, we generate narratives grounded in explicit model-based computations and linked to the retrieved evidence via AI agents. We refer to our framework as an AI economist agent. We evaluate the AI economist agent in two applications: economist report generation for U.S. inflation persistence and Federal Reserve policy, and bank stress-test narrative generation for U.S. commercial real estate refinancing stress. The results illustrate how grounding the generated reports improves their economic coherence and traceability.

2606.20369 2026-06-19 cs.CL 新提交 80%

CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges

CATCH-ME if you RAG:针对仇恨与虚假信息交流的上下文注释多轮对抗言论数据集

Helena Bonaldi, Genoveffa Martone, Marco Guerini

发表机构 * Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会) Università Cattolica del Sacro Cuore(圣心天主教大学)

专题命中 知识库问答 :数据集用于RAG系统训练对抗言论模型

AI总结 提出首个大规模、专家策划的多语言对话数据集,覆盖仇恨与虚假信息重叠问题,包含事实核查锚定和跨度标注,支持RAG系统训练更可信的对抗言论模型。

详情
AI中文摘要

在线仇恨言论和虚假信息经常重叠,但NLP研究主要将它们孤立处理。虽然LLMs代表了协助人类针对这两种威胁生成对抗言论的可扩展解决方案,但零样本模型经常生成重复和模糊的回应,凸显了需要高质量示例来指导模型生成。然而,现有的针对仇恨和虚假信息重叠的对抗言论数据集很少,且仅限于单轮英语对话,而现实中的交互跨越多个轮次和语言。为弥补这一差距,我们引入了第一个大规模、专家策划的多语言对话数据集,处理仇恨与虚假信息的交叉点。为确保事实基础,对话还锚定在已验证的外部知识(即事实核查文章和非政府组织报告)中,并包含文档级和块级跨度标注,使其可直接应用于RAG系统。该新资源涵盖五种语言,针对七个边缘化群体的仇恨,能够训练和评估更具说服力、基于事实的对抗言论模型。

英文摘要

Online hate speech and misinformation frequently overlap, yet NLP research has mainly treated them in isolation. While LLMs represent a scalable solution for assisting humans in the generation of counterspeech for both threats, zero-shot models frequently generate repetitive and vague responses, underscoring the need for high-quality examples to steer model generation. However, existing counterspeech datasets against the overlap of hate and misinformation are scarce and limited to single-turn English dialogues, while real-life interactions span across multiple turns and languages. To bridge this gap, we introduce the first large-scale, expert-curated, multilingual dataset of dialogues tackling the intersection of hate and misinformation. To ensure factual grounding, the dialogues are also anchored in verified external knowledge (i.e., fact-checking articles and NGO reports) and include document- and chunk-level span annotations, making it directly applicable for RAG systems. Covering five languages and targeting hate directed at seven marginalized groups, this novel resource enables the training and evaluation of more persuasive, factually grounded counterspeech models.

2606.19598 2026-06-19 cs.RO 新提交 80%

Fail-RAG : A Retrieval Augmented Generation Informed Framework for Robot Failure Identification

Fail-RAG:一种基于检索增强生成的机器人故障识别框架

Ameya Salvi, Jie Hu

发表机构 * Hitachi America, Ltd.(日立美国有限公司)

专题命中 知识库问答 :提出Fail-RAG框架,利用RAG检测机器人故障

AI总结 提出Fail-RAG框架,利用检索增强生成和视觉语言模型,通过嵌入故障图像和上下文信息并查询数据库,实现机器人操作故障的高效检测,在仓库自动化任务中平均检测准确率提升25个百分点。

详情
AI中文摘要

工业自动化正经历由技术突破和社会变革驱动的机器人演进:向通用机器人、具身和物理人工智能发展,以及劳动力短缺的加剧。智能自主机器人不仅需要按计划运动,还需对意外事件做出反应。本研究聚焦于仓库中物料搬运机器人的意外事件,将其定义为故障,并开发检测机器人操作故障的方法。由于环境和任务的动态性,故障形式可能变化,基于规则的检测方法可能失效。我们提出'Fail-RAG',一种基于检索增强生成(RAG)的故障检测框架,其中故障图像和上下文信息被嵌入,并通过计算相似度查询故障数据库。进一步使用视觉语言模型(VLM)按照指令模板分析故障并提供细节。通过使用固定机械臂和移动操作器在仓库自动化常见任务中进行仿真和物理实验,评估了Fail-RAG的性能。与使用现成VLM相比,Fail-RAG在五种机器人操作类型上的平均故障检测准确率提高了25个百分点,表明其在真实世界故障检测中的有效性。

英文摘要

Industry automation is witnessing an evolution in robotics driven by both technological breakthroughs and societal changes: progress towards generalist robots, embodied and physical artificial intelligence (AI), and increasing labor shortage in manufacturing.An intelligent autonomous robot needs to not only act according to planned motions but also react to any unexpected events. In this study, we focus on such unexpected events in warehouses where robots are used for material handling. Specifically, we refer to any unexpected events as failures and develop methods to detect robot operations related failures. Rule-based detection methods may break since the form of failures could change due to the dynamic nature of both environments and tasks. We propose 'Fail-RAG', a Retrieval Augmented Generation (RAG)-based failure detection framework where failure images and context information are embedded and queried against a failure database by calculating their similarities. Vision-Language Models (VLMs) are further used to analyze failures and provide details by following our instruction template. We evaluated the performance of Fail-RAG by conducting both simulation and physical experiments using fixed robot arms and a mobile manipulator for multiple tasks that are common in warehouse automation. Fail-RAG achieved 25 percentage point higher failure detection accuracy on average across five types of robot operations compared to using off-the-shelf VLMs, indicating its effectiveness for real-world failure detection.

2605.26891 2026-06-19 cs.CL 版本更新 80%

Telenor Nordics Customer Service self-help corpus

Telenor Nordics 客户服务自助语料库

Mike Riess

发表机构 * Research and Innovation, Telenor Group(Telenor集团研究与创新)

专题命中 知识库问答 :构建多语言客户服务语料库,支持RAG。

AI总结 本文构建了一个包含芬兰语、丹麦语、挪威语和瑞典语的多语言客户服务自助语料库,共1122篇文档,用于支持北欧NLP和信息检索研究。

Comments 8 pages, 2 figures, 5 tables. Submitted to Nordic Machine Intelligence. Dataset: https://zenodo.org/records/19493152

详情
AI中文摘要

本文介绍了一个多语言客户服务自助语料库,包含1122篇经过人工验证的芬兰语、丹麦语、挪威语和瑞典语文档,总词数超过一百万。这些文档来自四家北欧电信运营商的公共自助页面,随后通过结合LLM和人工标注的流程过滤了个人身份信息和相关性。北欧语言的领域特定数据集仍然稀缺,尤其是在客户服务领域——这一领域对于检索增强生成、跨语言迁移学习和新兴的基于代理的服务架构日益重要。对语料库的分析显示,不同运营商的文档长度和结构存在显著差异,反映了不同的编辑策略,以及涵盖网络硬件、移动服务、电视和流媒体、计费和账户管理的广泛主题覆盖。该数据集在CC-BY-NC-SA-4.0许可下公开提供,网址为https://zenodo.org/records/19493152,旨在支持北欧NLP和信息检索的可重复研究。

英文摘要

This paper presents a multilingual customer service self-help corpus comprising 1,122 manually validated documents in Finnish, Danish, Norwegian, and Swedish, totaling 274,599 words and 1,884,833 characters. The documents have been sourced from the public self-help pages of four Nordic telecommunications operators and subsequently filtered for person-identifiable information and relevance through a combined LLM and human annotation pipeline. Domain-specific datasets for Nordic languages remain scarce, particularly in customer service: a domain of growing importance for retrieval-augmented generation, cross-lingual transfer learning, and emerging agent-based service architectures. An analysis of the corpus reveals substantial variation in document length and structure across operators, reflecting distinct editorial strategies, as well as broad topical coverage spanning network hardware, mobile services, TV and streaming, billing, and account management. The dataset is publicly available under a CC-BY-NC-SA-4.0 license at https://zenodo.org/records/20732652, intended to support reproducible research in Nordic NLP and information retrieval.

2606.19847 2026-06-19 cs.CL 新提交 70%

AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic Facts

AtomMem: 通过原子事实构建简单有效的LLM智能体记忆系统

Yanyu Yao, Shangze Li, Zhi Zheng, Hui Zheng, Qi Liu, Tong Xu, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China(中国科学技术大学认知智能国家重点实验室) Anhui University(安徽大学)

专题命中 知识库问答 :涉及事实提取和层次化事件结构,用于记忆检索。

AI总结 针对现有记忆系统存储粗粒度、更新不稳定的问题,提出AtomMem,通过事实执行器提取高价值原子事实作为高效记忆表示,并组织为层次化事件结构和时间档案,实现价值密集存储和稳定演化,在LoCoMo基准上取得最优性能。

Comments 19 pages, 10 figures, 5 tables

详情
AI中文摘要

大型语言模型(LLM)展示了强大的推理和生成能力,但其固定的上下文窗口限制了跨多会话交互的长期信息积累和重用。现有的记忆增强系统通常以粗粒度且不稳定的方式构建记忆,依赖于低效的记忆表示或不稳定的无约束更新。为了解决这些挑战,我们提出了AtomMem,一种专为价值密集存储和稳定记忆演化设计的长期记忆系统。AtomMem引入了一个事实执行器,从长形式交互中选择性地提取高价值原子事实,作为高效的记忆表示。随后,AtomMem将这些事实组织成层次化的事件结构和时间档案,捕获连贯的情景上下文并随时间跟踪动态演变的用户属性。在检索过程中,系统激活一个关联记忆图来连接碎片化的记忆。在LoCoMo基准上的实验证实,AtomMem在各种推理任务中实现了最先进的性能,为部署智能个性化智能体提供了一种可扩展且经济可行的解决方案。

英文摘要

Large language models (LLMs) demonstrate strong reasoning and generation abilities, but their fixed context windows limit long-term information accumulation and reuse across multi-session interactions. Existing memory-augmented systems often construct memory in a coarse and unstable manner, relying on inefficient memory representations or unstable unconstrained updates. To address these challenges, we propose AtomMem, a long-term memory system designed for value-dense storage and stable memory evolution. AtomMem introduces a Fact Executor, which selectively extracts high value atomic facts from long form interactions to serve as highly efficient memory representations. Subsequently, AtomMem organizes these facts into hierarchical event structures and temporal profiles, capturing coherent episodic contexts and tracking dynamically evolving user attributes over time. During retrieval, the system activates an associative memory graph to connect fragmented memories. Experiments on the LoCoMo benchmark confirm that AtomMem achieves state-of-the-art performance across various reasoning tasks, offering a scalable and economically viable solution for deploying intelligent personalized agents.

2606.19700 2026-06-19 cs.CL 新提交 70%

TerraMARS: A Domain-Adapted Small-Language-Model Pipeline for Mars Terraforming Literature

TerraMARS: 用于火星地球化改造文献的领域自适应小语言模型管道

Jyotsna Singh, Ash Black, Jeff Larsen, Scott R. Saleska

发表机构 * University of Arizona(亚利桑那大学) College of Information Science, University of Arizona(亚利桑那大学信息科学学院) Biosphere 2, University of Arizona(亚利桑那大学生物圈2) Department of Ecology and Evolutionary Biology, University of Arizona(亚利桑那大学生态与进化生物学系) Department of Environmental Sciences, University of Arizona(亚利桑那大学环境科学系)

专题命中 知识库问答 :结合检索和分块框架进行信息提取。

AI总结 提出TerraMARS管道,结合领域自适应小语言模型,从火星科学文献中提取结构化信息,支持地球化改造研究。

Comments 16 pages, 1 figure, 4 tables

详情
AI中文摘要

研究人员有兴趣了解火星,以便最终使其适合人类居住。为此,需要通过科学文献全面了解行星的大气、水文、表面化学、辐射环境和空间特征。这些文献包含有价值的信息和有意义的定量约束,可用于其他模型和研究,如宜居性评估和未来的地球化改造研究。我们提出了TerraMARS,一个端到端的信息提取管道,它结合了领域自适应的小语言模型来回答火星地球化改造相关问题,并将非结构化的火星科学文本转换为机器可读的结构化输出(JSON格式)。收集了一个开放获取论文语料库,并使用多阶段检索和分块框架进行处理。使用量化低秩自适应(QLoRA)对火星特定问答和信息提取数据集进行微调,使Google Gemma 3 1B适应领域。生成的管道产生两种类型的输出,并为将科学文献中的知识整合到下游应用(如数字孪生和火星宜居性建模)提供了基础。该管道的输出看起来很有前景,但需要进一步改进以提高提取准确性和事实一致性。

英文摘要

Researchers are interested in learning about Mars so that it may eventually become habitable for humans. To achieve this, there is a need for comprehensive knowledge of the planet's atmosphere, hydrology, surface chemistry, radiation environment, and spatial features through the scientific literature. These contain valuable information and meaningful quantitative constraints that can be used in other models and studies, such as habitability assessment and future terraforming studies. We present TerraMARS, an end-to-end information extraction pipeline that combines a domain-adapted Small Language Model to answer Mars terraforming-related questions and convert unstructured Mars science text into machine-readable structured outputs in JavaScript Object Notation (JSON) format. A corpus of open-access papers is collected and processed using a multistage retrieval and chunking framework. Google Gemma 3 1B was adapted to the domain using Quantized Low-Rank Adaptation (QLoRA) fine-tuning on Mars-specific question-answering and information extraction datasets. The resulting pipeline generates both types of output and provides a foundation for integrating knowledge from scientific literature into downstream applications like digital twins and habitability modeling for Mars. The output from this pipeline looks promising, but further improvements are needed to increase extraction accuracy and factual consistency.

2605.27864 2026-06-19 cs.AI 版本更新 70%

FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research

FundaPod: 一个具有知识图谱记忆的多角色智能体平台,用于AI辅助的基础投资研究

Di Zhu, Lei Nico Zheng, Zihan Chen

发表机构 * Stevens Institute of Technology(史蒂文斯理工学院) UMass Boston(马萨诸塞大学波士顿分校)

专题命中 知识库问答 :知识图谱记忆用于投资研究

AI总结 提出FundaPod平台,通过多角色独立研究、知识图谱记忆和事后裁决机制,支持人类投资经理进行透明、可验证的基础投资决策。

Comments 32 pages; 12 figures

详情
AI中文摘要

大型语言模型(LLMs)在金融领域的应用日益增多,但现有工作大多强调交易信号或围绕预测的金融自然语言处理任务。相比之下,机构基础研究需要人类分析师或AI智能体收集证据、识别业务驱动因素、比较竞争观点并生成投资备忘录。其更广泛的目标不仅是预测结果,而是产生透明、可重用和可验证的投资计划,同时促进投资知识的累积发展。我们提出了FundaPod,一个用于AI辅助基础投资研究的多角色智能体平台。我们认为基础研究是一项以人为中心的决策支持任务,在本质上与交易信号生成不同,因此更适合采用保持独立性的架构。在FundaPod中,具有不同角色(如价值投资者或宏观策略师)的AI智能体在共享溯源契约下独立进行研究。他们的分歧随后通过知识图谱记忆系统事后呈现,供人类投资组合经理(PM)裁决。本文基于设计科学实践以及认知隔离和人机协调理论,提出了支持基础研究的人机混合系统的五项设计原则。它还描述了四种架构机制:将公开投资者资料转化为可部署智能体的角色提炼管道;允许规划器推导类型化任务图的声明式技能注册表;将备忘录声明与可验证来源联系起来的基于证据的模型;以及连接股票代码、备忘录、分析师和主题的知识图谱“第二大脑”。我们通过一个完整的案例研究和基于角色的备忘录比较来展示该架构。

英文摘要

Large language models (LLMs) are increasingly applied in finance, yet most existing work emphasizes trading signals or financial NLP tasks centered on prediction. Institutional fundamental research, by contrast, requires human analysts or AI agents to gather evidence, identify business drivers, compare competing viewpoints, and generate investment memos. Its broader goal is not merely to predict outcomes, but to produce investment plans that are transparent, reusable, and verifiable, while contributing to the cumulative development of investment knowledge. We present FundaPod, a multi-persona agent platform for AI-assisted fundamental investment research. We argue that fundamental research is a human-centric decision-support task that is qualitatively distinct from trading-signal generation, and is therefore better served by an independence-preserving architecture. In FundaPod, AI agents with different personas, such as value investors or macro strategists, conduct research independently under a shared provenance contract. Their disagreements are then surfaced post hoc for adjudication by the human portfolio manager (PM) through a knowledge-graph memory system. This paper contributes five design principles for human-AI hybrid systems supporting fundamental research, grounded in design-science practice and theories of cognitive isolation and human-machine coordination. It also describes four architectural mechanisms: a persona distillation pipeline that turns public investor materials into deployable agents; a declarative skill registry that lets the planner derive typed task graphs; a grounded evidence model that links memo claims to verifiable sources; and a knowledge-graph "second brain" that connects tickers, memos, analysts, and themes. We demonstrate the architecture through a complete case study and a persona-based memo comparison.

2507.00875 2026-06-19 cs.CL cs.HC cs.MA 版本更新 70%

TransLaw: A Large-Scale Dataset and Multi-Agent Benchmark Simulating Professional Translation of Hong Kong Case Law

TransLaw:模拟香港判例法专业翻译的大规模数据集与多智能体基准

Xi Xuan, Chunyu Kit

发表机构 * City University of Hong Kong, Hong Kong SAR, China(香港城市大学)

专题命中 知识库问答 :集成法律词汇库和检索增强生成

AI总结 针对香港判例法英译中资源匮乏、法律术语和格式要求严格的问题,构建了首个大规模句对齐平行语料库HKCFA Judgment 97-22,并提出多智能体框架TransLaw,通过分解翻译任务、集成法律词汇库和检索增强生成,显著提升翻译质量,但仍未达到人类专家的风格自然度。

Comments Accepted at ICML 2026 - AI for Law

详情
AI中文摘要

根据《基本法》第8-9条,香港法院判决书需从英文翻译成繁体中文,但由于平行资源短缺以及对法律术语、引用格式和司法风格的严格要求,这一任务仍受到限制。我们引入了HKCFA Judgment 97-22,这是首个用于香港判例法的大规模句对齐平行语料库,包含344份专业翻译的判决书(11,099个句对;210万词元),涵盖1997年至2022年。基于这一资源,我们提出了TransLaw,一个多智能体框架,将翻译分解为词级表达、句级翻译和多维审查,集成了专门的香港法律词汇数据库、检索增强生成和迭代反馈,并包括涵盖语义对齐、术语、引用和风格的四维专家审查。通过对13个开源和商业大语言模型进行基准测试,我们证明TransLaw在所有评估模型上均显著优于单智能体基线,并在3次迭代内收敛。由10名持证法律翻译人员使用我们提出的Legal ACS指标进行的人工评估证实了法律语义准确性的提升,同时表明TransLaw在风格自然度上仍落后于人类专家。数据集和基准代码可在以下网址获取:https://xxx。

英文摘要

Translating Hong Kong Court Judgments from English to Traditional Chinese is mandated by Articles 8-9 of the Basic Law, yet remains constrained by a shortage of parallel resources and rigorous demands on legal terminology, citation format, and judicial style. We introduce HKCFA Judgment 97-22, the first large-scale sentence-aligned parallel corpus for HK case law, comprising 344 professionally translated judgments (11,099 sentence pairs; 2.1M tokens) spanning 1997-2022. Building on this resource, we propose TransLaw, a multi-agent framework that decomposes translation into word-level expression, sentence-level translation, and multidimensional review, integrating a specialized Hong Kong legal glossary database, Retrieval-Augmented Generation, and iterative feedback, with four-dimensional expert review covering semantic alignment, terminology, citation, and style. Benchmarking 13 open-source and commercial LLMs, we demonstrate that TransLaw significantly outperforms single-agent baselines across all evaluated models, with convergence within 3 iterations. Human evaluation by 10 certified legal translators using our proposed Legal ACS metric confirms gains in legal-semantic accuracy, while showing that TransLaw still trails human experts in stylistic naturalness. The dataset and benchmark code are available at https://github.com/xuanxixi/TransLaw.

2. 检索器 11 篇

2606.19960 2026-06-19 cs.IR 新提交 90%

Stellar: Scalable Multimodal Document Retrieval for Natural Language Queries

Stellar:面向自然语言查询的可扩展多模态文档检索

Yuxiang Guo, Zhonghao Hu, Yuren Mao, Yuhang Liu, Congcong Ge, Xiaolu Zhang, Jun Zhou, Yunjun Gao

专题命中 检索器 :提出Stellar框架实现可扩展多模态文档检索

AI总结 提出Stellar框架,通过磁盘存储令牌级文档嵌入并动态加载候选嵌入,结合词汇表示过滤和高效磁盘支持的后交互,在保持检索效果的同时将内存开销和查询延迟降低1-2个数量级。

详情
AI中文摘要

多模态文档检索——从大型语料库中选择最相关的多模态文档以回答自然语言查询——在检索增强生成(RAG)系统中扮演着重要角色。最先进的方法使用多个令牌级嵌入来表示每个文档和查询,并通过后交互实现高效性。然而,这种多向量表示在检索过程中会产生大量内存开销,导致可扩展性差,阻碍了实际部署。在本文中,我们提出了Stellar,一个可扩展的多模态文档检索框架,它将令牌级文档嵌入存储在磁盘上,仅将少量候选嵌入加载到内存中进行后交互。Stellar包含两个关键组件:(i)基于词汇表示的过滤(LRF),它微调多模态大语言模型(MLLM)作为稀疏编码器,以产生高质量的词汇表示,从而实现高效且有效的文档过滤,显著减少候选集;(ii)高效的磁盘支持后交互(DLI),它设计了一种基于平衡聚类算法的磁盘令牌嵌入存储布局,并通过简单有效的成本模型动态地将必要的令牌嵌入加载到内存中。在四个真实世界基准和一个新提出的大规模数据集上的大量实验表明,与现有方法相比,Stellar在不影响检索效果的情况下,将内存开销和查询延迟降低了1-2个数量级。

英文摘要

Multimodal document retrieval--selecting the most relevant multimodal document from a large corpus to answer a natural language query--plays an essential role in Retrieval-Augmented Generation (RAG) systems. State-of-the-art methods represent each document and query with multiple token-level embeddings and use late interaction to achieve high effectiveness. However, such multi-vector representations incur substantial memory overhead during retrieval, leading to poor scalability and hindering real-world deployment. In this paper, we present Stellar, a scalable multimodal document retrieval framework that stores token-level document embeddings on disk and loads only a small set of candidate embeddings into memory for late interaction. Stellar comprises two key components: (i) Lexical Representation-based Filtering (LRF), which fine-tunes a Multimodal Large Language Model (MLLM) as a sparse encoder to produce high-quality lexical representations, enabling efficient and effective document filtering to significantly reduce the candidate set; (ii) Efficient Disk-backed Late Interaction (DLI), which designs an on-disk token embedding storage layout guided by a balanced clustering algorithm, and dynamically loads only the necessary token embeddings into memory using a simple yet effective cost model. Extensive experiments on four real-world benchmarks and a newly presented large-scale dataset demonstrate that Stellar reduces memory overhead and query latency by 1-2 orders of magnitude compared to existing methods without compromising retrieval effectiveness.

2606.19719 2026-06-19 cs.IR cs.CL cs.LG 新提交 90%

Closing the Calibration Gap in Semantic Caching

缩小语义缓存中的校准差距

Aditeya Baral, Radoslav Ralev, Iliya Sotirov Zhechev, Srijith Rajamohan, Jen Agarwal

发表机构 * New York University(纽约大学) Redis(Redis公司)

专题命中 检索器 :研究语义缓存系统的校准问题,提出新指标。

AI总结 针对语义缓存系统中离线指标与部署性能的差距,提出P-CHR AUC和CRR指标,发现校准差距由训练目标主导,模型选择本质是校准问题。

Comments 23 pages, 2 figures. Source code: https://github.com/aditeyabaral/calibration-gap-semantic-caching ; Models and Datasets: https://huggingface.co/redis

详情
AI中文摘要

语义缓存通过为语义相似的查询提供缓存响应来降低LLM推理成本。标准实践使用PR-AUC评估这些系统,该指标仅衡量分数排序的好坏,而忽略它们在固定阈值下是否可用。我们表明这种不匹配会导致系统性的部署选择不佳,因为具有最高PR-AUC的模型通常在操作中最差。我们引入精度-缓存命中率(P-CHR)AUC,一种衡量缓存利用率水平上精度的缓存感知指标,以及校准保留率(CRR),它捕捉离线排序质量在部署中保留多少。我们将离线质量与部署质量之间的操作差距分解为可恢复的校准组件和由数据集正例率固定的不可约结构组件。我们的实验表明,校准差距由训练目标而非数据规模主导,事后校准只能部分缩小它。最终,语义缓存的模型选择是一个校准问题,而非排序问题,而测量它是缩小差距的第一步。

英文摘要

Semantic caching cuts LLM inference costs by serving a cached response to semantically similar queries. Standard practice evaluates these systems using PR-AUC, a metric that only measures how well scores rank and ignores whether they are usable at a fixed threshold. We show this mismatch leads to systematically poor deployment choices, as models with the highest PR-AUC are often the worst in operation. We introduce Precision-Cache Hit Ratio (P-CHR) AUC, a cache-aware metric that measures precision across cache utilization levels, and Calibration Retention Rate (CRR), which captures how much offline ranking quality survives at deployment. We decompose the operational gap between offline and deployed quality into a recoverable calibration component and an irreducible structural component fixed by the dataset's positive rate. Our experiments show that the calibration gap is governed by the training objective rather than data scale, and post-hoc calibration only partially closes it. Ultimately, model selection for semantic caching is a calibration problem, not a ranking one, and measuring it is the first step to closing the gap.

2606.20113 2026-06-19 cs.CL cs.IR 新提交 85%

When Does Streaming Tool Use Help? Characterizing Tool-Intent Stabilization in Streaming Retrieval-Augmented Generation

流式工具使用何时有帮助?表征流式检索增强生成中的工具意图稳定化

Elroy Galbraith

发表机构 * SMG Labs(SMG实验室)

专题命中 检索器 :流式RAG中工具意图稳定化分析。

AI总结 通过测量工具意图稳定化(即推测查询收敛到答案的时间点),在CRAG基准上分析流式RAG的延迟隐藏效果,发现73.9%的查询可实现显著延迟隐藏,并识别早期与晚期稳定化的预测因素。

详情
AI中文摘要

流式检索增强生成(Streaming RAG)通过在用户输入完成前并行发出工具查询来减少用户感知的延迟。报告的性能提升是聚合性的,但该机制的好处本质上是查询内在的:只有当正确的工具查询在用户停止说话或打字之前变得可确定时,推测才有帮助。我们隔离并测量了这一属性——工具意图稳定化,即输入流中推测查询的检索收敛到包含答案的结果的时间点。在CRAG基准(1371个验证问题)上,我们(i)测量了稳定化的分布,(ii)推导出一个与模型无关的界限H,表示可以隐藏在用户剩余输入背后的工具延迟比例,该比例是工具延迟L和输入节奏δ的函数,(iii)通过一个工作流式管道验证了实际节省达到或超过此界限,(iv)识别了哪些查询属性预测早期与晚期稳定化。该研究无需模型训练,在普通CPU硬件上运行。我们发现,在现实操作点(L=600ms,δ=3w/s,θ=0.8)下,整个基准中73.9%的查询实现了显著的延迟隐藏——这一混合数字结合了21.3%的问题(其中黄金证据以原文形式存在且可被BM25检索)上的充分稳定化(在此有利切片上95.2%可流式处理)以及其余问题上的无基础top-1稳定化回退。在有利切片上,φ_suf被精确和宽松基础限定在[0.26, 0.281]之间——两者均为早期。问题类型产生了显著但粗略的早期/晚期划分(Kruskal-Wallis p=0.017, epsilon^2=0.04),直接指导了何时学习到的推测触发器值得其成本。

英文摘要

Streaming Retrieval-Augmented Generation (Streaming RAG) reduces user-perceived latency by issuing tool queries in parallel with ongoing user input, before the utterance is complete. Reported gains are aggregate, yet the mechanism's benefit is fundamentally query-intrinsic: speculation can only help when the correct tool query becomes determinable before the user stops speaking or typing. We isolate and measure this property -- tool-intent stabilization, the point in the input stream at which a speculative query's retrieval converges to the answer-bearing result. On the CRAG benchmark (1371 validation questions) we (i) measure the distribution of stabilization, (ii) derive a model-agnostic bound H on the portion of tool latency that can be hidden behind the user's remaining input, as a function of tool latency L and input cadence δ, (iii) validate against a working streaming pipeline that realized savings meet or exceed this bound, and (iv) identify which query properties predict early versus late stabilization. The study requires no model training and runs on commodity CPU hardware. We find that at a realistic operating point (L=600ms, δ=3w/s, θ=0.8), 73.9% of queries across the full benchmark admit substantial latency hiding -- a blended figure that mixes sufficiency stabilization on the 21.3% of questions where gold evidence is verbatim-present and BM25-retrievable (95.2% streamable on this favorable slice) with a grounding-free top-1-settling fallback on the remainder. On the favorable slice, ϕ_suf is bracketed to [0.26, 0.281] by exact and relaxed grounding -- both early. Question type produces a significant but coarse early/late split (Kruskal-Wallis p=0.017, epsilon^2=0.04), directly informing when a learned speculative trigger is worth its cost.

2606.20065 2026-06-19 cs.IR cs.CL cs.CY 新提交 85%

Generative Engine Optimization at Scale: Measuring Brand Visibility Across AI Search Engines

生成式引擎优化规模化:衡量AI搜索引擎中的品牌可见性

Pratyush Kumar

发表机构 * Ranqo

专题命中 检索器 :研究AI搜索引擎中品牌可见性,涉及检索和引用机制。

AI总结 本研究通过分析10万+提示响应,提出衡量AI搜索引擎中品牌可见性的方法,发现品牌成熟度形成三级阶梯,并识别出最受引用的内容格式和情感不稳定性。

Comments 14 pages, 4 tables; v1.0 preprint

详情
AI中文摘要

人们越来越多地从AI搜索引擎(如ChatGPT、Claude、Perplexity和Gemini)直接获取答案,而不是滚动浏览搜索结果。曾经专注于搜索引擎优化(SEO)的品牌现在必须优化这些引擎如何代表、引用和推荐它们——这一转变被称为生成式引擎优化(GEO)、答案引擎优化(AEO)和AI搜索可见性。我们将AEO和AI可见性视为GEO的一部分,并研究如何衡量AI引擎中的品牌可见性:它们在引用品牌时看重什么,依赖哪些来源,以及大型语言模型呈现什么内容。难点在于那些尚未成为权威顶级品牌的所有其他品牌——中小企业、D2C品牌、创作者和早期初创公司。我们分析了2026年3月至5月期间在Ranqo上追踪的100多个品牌的10万+提示响应。首次可见性运行形成了清晰的三级品牌地位阶梯:全球家喻户晓的品牌(如Stripe、Nike)在首次运行时出现在73%的相关AI答案中;成熟的中端市场和区域品牌(如Olipop、Klaviyo)出现在44%中;小众和小品牌仅出现在11%中——每级约30个百分点。当引擎引用来源时,约78%指向企业网站;在非企业来源中,YouTube领先,其次是Reddit、编辑媒体和维基百科。杠杆率最高的页面是排名“最佳”列表文章,是最常被引用的内容格式,约占所有引用的21%。情感是不稳定的信号:品牌被正面或负面描述的变化频率大约是品牌是否被提及的变化频率的6.7倍。这些发现为衡量GEO提供了首个大规模基线:AI品牌可见性是可测量的,因平台而异,并随品牌成熟度强烈变化。最后,我们提出了七个v1.1协议,以测试特定建议是否能因果性地提高AI可见性。

英文摘要

People increasingly get answers straight from AI search engines like ChatGPT, Claude, Perplexity, and Gemini rather than scrolling search results. Brands that once focused on search engine optimization (SEO) must now optimize for how these engines represent, cite, and recommend them -- a shift variously called Generative Engine Optimization (GEO), Answer Engine Optimization (AEO), and AI Search Visibility. We treat AEO and AI Visibility as part of GEO, and study how to measure brand visibility across AI engines: what they value when they cite a brand, which sources they rely on, and what content large language models surface. The hard case is everyone outside the already-authoritative top brands -- SMEs, D2C brands, creators, and early-stage startups. We analyze 100K+ prompt responses across 100+ brands tracked on Ranqo between March and May 2026. First visibility runs form a clear three-tier brand-stature ladder: global household names (e.g., Stripe, Nike) appear in 73% of relevant AI answers on their first run; established mid-market and regional brands (e.g., Olipop, Klaviyo) in 44%; niche and small brands in just 11% -- about 30 percentage points per step. When engines cite sources, about 78% go to corporate websites; among non-corporate sources YouTube leads, ahead of Reddit, editorial media, and Wikipedia. The highest-leverage page is the ranked "best-of" listicle, the most-cited content format at about 21% of all citations. Sentiment is the unstable signal: whether a brand is framed positively or negatively flips about 6.7 times more often than whether it is mentioned at all. These findings provide a first large-scale baseline for measuring GEO: AI brand visibility can be measured, differs by platform, and varies strongly by brand maturity. We close by proposing seven v1.1 protocols to test whether specific recommendations can causally improve AI visibility.

2606.19898 2026-06-19 cs.DB cs.IR 新提交 85%

Query-aware Routing for Filtered Approximate Nearest Neighbors Search

面向过滤近似最近邻搜索的查询感知路由

Qianqian Xiong, Mengxuan Zhang

专题命中 检索器 :过滤近似最近邻搜索,核心RAG技术。

AI总结 提出查询感知路由框架,通过轻量级ML模型预测各候选方法的召回率,结合离线基准表选择最佳召回-QPS权衡,在五个未见数据集上达到SOTA性能。

Comments 12 pages

详情
AI中文摘要

过滤ANN搜索结合向量相似性与属性谓词,是现代向量数据库和检索增强生成中的核心原语。我们在多个数据集上对三种谓词下的所有主要分类过滤ANN方法进行基准测试,发现没有单一方法占主导地位。此外,即使在单个数据集和谓词类型内,查询的最佳方法也可能不同。因此,我们提出了一种查询感知路由框架。轻量级ML模型预测每个候选方法在查询上的召回率,路由器查阅离线基准表(该表将每种方法和参数设置映射到其测量的召回率和QPS),然后选择具有最佳召回-QPS权衡的方法。我们的消融研究将22个候选特征缩减为最小的三个特征集,并采用回归而非分类作为预测目标以提高准确性。我们的模型在六个真实世界数据集上训练,并应用于五个未见过的验证数据集。最终结果表明,与现有的过滤ANN基线相比,我们的路由器在所有五个验证数据集上实现了最先进的召回率和QPS平衡,同时引入了可忽略的延迟开销。

英文摘要

Filtered ANN search, which combines vector similarity with attribute predicates, is a core primitive in modern vector databases and retrieval-augmented generation. We benchmark all major categorical filtered ANN methods across multiple datasets under three predicates and find that no single method dominates. Moreover, even within a single dataset and predicate type, the best method for a query can vary. Therefore, we propose a query-aware routing framework. A lightweight ML model predicts each candidate method's recall on the query, and the router consults an offline benchmark table that maps every method and parameter setting to its measured recall and QPS, then selects the method with the best recall--QPS trade-off. Our ablation study narrows 22 candidate features to a minimal set of three and we adopt regression rather than classification as the prediction target to sharpen accuracy. Our model is trained on six real-world datasets and applied to five unseen validation datasets. The final result shows that our router achieves state-of-the-art recall and QPS balance across all five validation datasets compared to existing filtered ANN baselines, while incurring negligible latency overhead.

2606.19667 2026-06-19 cs.CL 新提交 85%

CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference

CacheWeaver:面向高效接地RAG推理的缓存感知证据排序

Kaizhen Tan, Rong Gu, Mingyuan Li

发表机构 * Heinz College of Information Systems and Public Policy, Carnegie Mellon University(卡内基梅隆大学海因茨信息系统与公共政策学院)

专题命中 检索器 :缓存感知证据排序降低RAG推理首令牌延迟

AI总结 提出CacheWeaver,一种轻量级提示层方法,通过缓存感知的证据排序降低RAG推理的首令牌延迟,无需修改服务引擎或证据集。

详情
AI中文摘要

检索增强生成(RAG)改善了事实基础,但也延长了提示并增加了预填充成本。vLLM等服务引擎中的前缀缓存仅在请求共享相同令牌前缀时降低此成本。然而,在接地生成中,相邻查询可能以不同顺序检索重叠证据,因此集合重叠不会变成可重用的前缀重叠。我们提出CacheWeaver,一种用于缓存感知证据排序的轻量级提示层方法。该方法维护最近服务的证据序列的前缀树,并使用贪婪遍历将最可重用的前缀放在首位,同时保持服务引擎和检索到的证据集不变。在三种vLLM配置中,相对于检索顺序前缀缓存,该方法将中位首令牌时间(TTFT)降低了约20-33%,且在我们的QA测试中不损害答案质量。贪婪策略达到了Oracle排序中位TTFT增益的97.5%,表明大多数可重用前缀局部性可以通过检索和推理之间的简单调度层恢复。

英文摘要

Retrieval-Augmented Generation (RAG) improves factual grounding, but it also lengthens prompts and raises prefill cost. Prefix caching in serving engines such as vLLM reduces this cost only when requests share the same token prefix. In grounded generation, however, adjacent queries may retrieve overlapping evidence in different orders, so set overlap does not become reusable prefix overlap. We present CacheWeaver, a lightweight prompt-layer method for cache-aware evidence ordering. The method keeps a prefix tree over recently served evidence sequences and uses a greedy walk to place the most reusable prefix first, while leaving the serving engine and retrieved evidence set unchanged. Across three vLLM configurations, the method lowers median time-to-first-token (TTFT) by about 20-33 percent relative to retrieval-order prefix caching, without hurting answer quality in our QA tests. The greedy policy reaches 97.5 percent of the median TTFT gain from oracle ordering, indicating that most reusable prefix locality can be recovered by a simple scheduling layer between retrieval and inference.

2606.20047 2026-06-19 cs.IR 新提交 80%

PACMS: Submodular Context Selection as a Pluggable Engine for LLM Agents

PACMS: 作为LLM代理可插拔引擎的子模上下文选择

Manu Ghulyani, Arunabh Singh, Karan Bharadwaj, Ankit Nath, Suranjan Goswami

专题命中 检索器 :提出子模上下文选择方法,优化LLM代理的上下文。

AI总结 提出PACMS,一种基于子模函数最大化的上下文选择方法,在提示组装时按相关性从会话、记忆和工具输出中挑选内容,替代截断机制,提升长对话中的信息保持能力。

详情
AI中文摘要

对话和工具使用的LLM代理在上下文窗口中操作,该窗口同时从多个方向填充。随着会话进行,代理积累用户和助手轮次、从持久记忆存储中提取的条目,以及通常最大的工具调用输出(如文件读取、搜索结果和API响应)。一旦累积上下文超过模型的令牌预算,框架必须决定保留什么。当前机制是最近截断,有时辅以定期摘要。这是主题盲目的:会话早期建立的事实仅仅因为陈旧而被丢弃,即使当前用户查询正是关于该事实;相反,冗长但无关的近期材料被保留。必须在多轮中回忆信息的代理(记忆的定义案例)正是最近截断失败的地方。现有替代方案位于代理组装步骤之外。检索增强生成将外部文档提取到提示中,但不仲裁代理的“已存在”池化上下文。上下文压缩方法通过重写或修剪文本来减少令牌计数,但以查询盲目和有损的方式操作。两者都不将记忆条目、对话轮次和工具输出视为一个单一的候选池,在提示组装时按相关性进行选择。

英文摘要

Conversational and tool-using LLM agents operate over a context window that fills from several directions simultaneously. As a session proceeds, the agent accumulates user and assistant turns, entries drawn from a persistent memory store, and often largest of all, the verbatim outputs of tool calls such as file reads, search results, and API responses. Once the cumulative context exceeds the model's token budget, the framework must decide what to keep. The prevailing mechanism is recency truncation, sometimes paired with periodic summarization. This is topic-blind: a fact established early in a session is discarded simply because it is old, even when the current user query is about exactly that fact; conversely, verbose but irrelevant recent material is retained. Agents that must recall information across many turns, the defining case for memory, are precisely where recency truncation fails. Existing alternatives sit outside the agent's assembly step. Retrieval augmented generation fetches external documents into the prompt but does not arbitrate the agent's \emph{already-present} pooled context. Context-compression methods reduce token count by rewriting or pruning text, but operate query-blind and lossily. Neither treats memory entries, conversation turns, and tool outputs as a single candidate pool to be selected from by relevance at the moment the prompt is assembled.

2606.19911 2026-06-19 cs.AI cs.CL cs.IR 新提交 75%

Multi-Agent Transactive Memory

多智能体交互记忆

To Eun Kim, Xuhong He, Dishank Jain, Ambuj Agrawal, Negar Arabzadeh, Fernando Diaz

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of California, Berkeley(加州大学伯克利分校)

专题命中 检索器 :扩展RAG到智能体生成轨迹的检索与复用。

AI总结 提出MATM框架,通过共享存储和检索智能体轨迹,实现异构智能体群体间的知识复用,提升下游任务性能并减少交互步骤。

详情
AI中文摘要

具有不同能力的LLM智能体在多样化任务中的去中心化部署,激发了跨异构智能体群体知识共享的基础设施需求。正如搜索引擎索引人类生成的工件以支持人类问题解决,检索系统可以组织智能体生成的工件以供跨智能体群体重用。我们将检索增强生成(展示了人类创作工件对单个智能体的价值)扩展到检索智能体生成的工件以支持智能体群体。特别是,智能体轨迹编码了可重用的程序性知识,然而这些工件通常在一次使用后被丢弃或仅由产生智能体保留,迫使新实例化的智能体反复重新发现现有解决方案。我们提出了多智能体交互记忆(MATM),一个用于群体级存储和检索智能体生成轨迹的框架,其中生产者智能体将轨迹贡献到共享仓库,消费者智能体检索它们以改进任务执行。我们专注于交互环境(ALFWorld和WebArena),其中轨迹较长且编码了特别丰富的程序性结构。我们的实验表明,从MATM检索轨迹可提高下游任务性能并减少交互步骤,无需协调或联合训练。这些结果将MATM定位为开放智能体生态系统中群体级经验共享的设计模式。

英文摘要

The decentralized deployment of LLM agents with diverse capabilities across diverse tasks motivates infrastructure for knowledge sharing across heterogeneous agent populations. Just as search engines index human-generated artifacts to support human problem solving, retrieval systems can organize agent-generated artifacts for reuse across agent populations. We extend retrieval-augmented generation - which demonstrates the value of human-authored artifacts to individual agents - to retrieval of agent-generated artifacts supporting a population of agents. In particular, agent trajectories encode reusable procedural knowledge, yet these artifacts are typically discarded after a single use or retained only by the producing agent, forcing newly instantiated agents to repeatedly rediscover existing solutions. We propose Multi-Agent Transactive Memory (MATM), a framework for population-level storage and retrieval of agent-generated trajectories, where producer agents contribute trajectories to a shared repository and consumer agents retrieve them to improve task execution. We focus on interactive environments (ALFWorld and WebArena), where trajectories are long and encode especially rich procedural structure. Our experiments demonstrate that retrieving trajectories from MATM improves downstream task performance and reduces interaction steps without coordination or joint training. These results position MATM as a design pattern for population-level experience sharing in open agent ecosystems.

2606.17041 2026-06-19 cs.CL cs.IR 新提交 75%

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

对Nature Portfolio元分析文章进行LLM代理基准测试

Anzhe Xie, Weihang Su, Yujia Zhou, Yiqun Liu, Qingyao Ai

发表机构 * Tsinghua University(清华大学)

专题命中 检索器 :包含检索和RAG变体的基准测试

AI总结 提出MetaSyn数据集,包含442篇专家策划的元分析,用于评估LLM代理在检索-筛选-综合全流程中的表现,发现当前系统在筛选阶段存在严重瓶颈。

Comments 13 pages, 7 figures, preprint for arXiv, dataset and code available at https://github.com/BFTree/MetaSyn

详情
AI中文摘要

元分析是一种要求高的证据综合形式,结合了文献检索、PI/ECO指导的研究选择和统计聚合。其结构化、可验证的工作流程使其成为评估系统科学推理的理想基础,然而现有基准缺乏完整的检索-筛选-综合流程的真相。我们引入了MetaSyn,一个包含来自Nature Portfolio期刊的442篇专家策划的元分析的数据集。每个条目将研究问题与PI/ECO标准、包含140k篇PubMed文章的检索语料库、经过验证的阳性研究、主题相似但不符合PI/ECO的硬负样本以及完整的搜索策略和日期范围配对。对十二种流水线配置(九种RAG变体和一种协议驱动的代理)进行基准测试揭示了关键的筛选瓶颈:尽管在K=200时检索上限达到90.9%的召回率,但没有任何系统能恢复超过52.7%的真相包含文献。当前的LLM无法可靠地将合格研究与主题相关性相当的PI/ECO不合格干扰项区分开来。阶段归因指标捕捉了系统成功和失败的地方;单一的端到端分数则不能。

英文摘要

Meta-analysis is a demanding form of evidence synthesis that combines literature retrieval, PI/ECO-guided study selection, and statistical aggregation. Its structured, verifiable workflow makes it an ideal substrate for evaluating systematic scientific reasoning, yet existing benchmarks lack ground truth across the full retrieval-screening-synthesis pipeline. We introduce MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals. Each entry pairs a research question with PI/ECO criteria, a retrieval corpus of 140k PubMed articles, verified positive studies, hard negatives that are topically similar but PI/ECO-ineligible, and complete search strategies and date bounds. Benchmarking twelve pipeline configurations (nine RAG variants and a protocol-driven agent) reveals a critical screening bottleneck: despite a retrieval ceiling of 90.9% recall at K=200, no system recovers more than 52.7% of ground-truth included literature. Current LLMs fail to reliably separate eligible studies from PI/ECO-failing distractors in pools of comparable topical relevance. Stage-attributed metrics capture where systems succeed and fail; a single end-to-end score does not.

2606.20235 2026-06-19 cs.IR cs.AI 新提交 70%

ScholarQuest: A Taxonomy-Guided Benchmark for Agentic Academic Paper Search in Open Literature Environments

ScholarQuest:开放文献环境中智能学术论文搜索的基于分类法的基准测试

Tingyue Pan, Mingyue Cheng, Daoyu Wang, Yitong Zhou, Jie Ouyang, Qi Liu, Enhong Chen

发表机构 * State Key Lab of Cognitive Intelligence, University of Science and Technology of China(中国科学技术大学认知智能国家重点实验室)

专题命中 检索器 :学术论文搜索基准,涉及检索

AI总结 提出ScholarQuest基准,基于1000多个计算机科学主题和四种研究意图,构建可扩展的答案和共享检索后端,评估LLM智能体在开放文献环境中的学术论文搜索能力。

详情
AI中文摘要

学术论文搜索是科学研究中的核心步骤,基于LLM的搜索智能体正成为迭代式、意图驱动的文献探索的有前景范式。然而,现有基准不足以在现实开放文献环境下系统评估智能学术搜索。我们提出ScholarQuest,一个大规模、基于分类法的智能学术论文搜索基准。ScholarQuest基于1000多个计算机科学主题和四种代表性研究意图构建,包括方法导向、设置锚定、比较型和范围控制查询。它进一步提供可扩展的答案构建和共享检索后端ScholarBase,用于可重复评估。基准测试结果表明,智能方法优于单次检索基线,但表现最佳的智能体仅达到0.314的Recall@100和0.355的Recall@All,表明有显著的改进空间。此外,对搜索效率、意图级鲁棒性和失败案例的分析进一步凸显了该基准为学术论文搜索智能体提供多维评估信号的能力。

英文摘要

Academic paper search is a core step in scientific research, and LLM-based search agents are emerging as a promising paradigm for iterative, intent-driven literature exploration. However, existing benchmarks are insufficient for systematically evaluating agentic academic search under realistic open literature environments. We propose ScholarQuest, a large-scale, taxonomy-guided benchmark for agentic academic paper search. ScholarQuest is constructed from over 1,000 computer science topics and four representative research intents, including method-oriented, setting-anchored, comparison-based, and scope-controlled queries. It further provides scalable answer construction and a shared retrieval backend ScholarBase for reproducible evaluation. Benchmarking results show that agentic methods outperform single-shot retrieval baselines, yet the best-performing agent only achieves 0.314 Recall@100 and 0.355 Recall@All, indicating substantial room for improvement. In addition, analyses of search efficiency, intent-level robustness, and failure cases further highlight the benchmark's ability to provide multi-dimensional evaluation signals for academic paper search agents.

2606.20554 2026-06-19 cs.IR cs.AI 新提交 55%

Structuring and Tokenizing Distributed User Interest Context for Generative Recommendation

结构化与分词化分布式用户兴趣上下文以支持生成式推荐

Ruizhong Qiu, Yinglong Xia, Dongqi Fu, Hanqing Zeng, Ren Chen, Xiangjun Fan, Hong Li, Hong Yan, Hanghang Tong

发表机构 * University of Illinois Urbana--Champaign(伊利诺伊大学厄巴纳-香槟分校) Meta MRS

专题命中 检索器 :使用图建模和语义分词进行上下文建模。

AI总结 提出G2Rec框架,通过统一图建模与语义分词,实现工业级生成式推荐中用户兴趣上下文的全面准确建模。

详情
AI中文摘要

生成式推荐是一种新兴范式,在工业推荐系统中展现出前景,旨在从用户历史行为中预测其下一次交互。生成式推荐的核心是物品分词,它连接了物品语义与推荐模型。然而,现有方法往往难以同时有效地组织和注入复杂的用户行为与物品语义上下文。一方面,现有的基于图的集成方法,如图序列化和图神经网络,要么存在可扩展性问题,要么仅利用局部图信息。另一方面,现有的语义分词方法通常依赖启发式规则且缺乏明确的监督信号,可能导致不准确或次优的语义表示。为解决用户兴趣上下文建模中的这些局限性,我们提出G2Rec,一个可扩展的框架,将基于图的整体用户共同参与建模与语义分词统一起来,用于工业级生成式推荐。总体而言,G2Rec使推荐模型能够捕捉整体且基于语义的用户兴趣原型,而无需真实用户兴趣,从而在工业序列推荐中提供更全面、更准确的用户行为上下文建模。跨产品表面的在线部署和在公开数据集上的大量实验证明了G2Rec相对于现有方法的优越性。

英文摘要

Generative recommendation is an emerging paradigm that has shown promise in industrial recommendation systems, aiming to predict users' next interactions from their historical behaviors. At the core of generative recommendation lies item tokenization, which bridges item semantics and recommendation models. However, existing methods often struggle to effectively organize and inject complex user-behavioral and item-semantic contexts into recommendation models simultaneously. On the one hand, existing graph-based integration methods, such as graph serialization and graph neural networks, either suffer from scalability issues or exploit only local graph information. On the other hand, existing semantic tokenization methods typically rely on heuristics and lack explicit supervision signals, which may lead to inaccurate or suboptimal semantic representations. To address these limitations in user interest context modeling, we propose G2Rec, a scalable framework that unifies holistic graph-based user co-engagement modeling with semantic tokenization for industrial-scale generative recommendation. Overall, G2Rec enables recommendation models to capture holistic and semantically grounded user interest prototypes without requiring ground-truth user interests, thereby providing more comprehensive and accurate modeling of user behavior contexts in industrial sequential recommendation. Online deployment across product surfaces and extensive experiments on public datasets demonstrate the superiority of G2Rec over existing methods.

3. 向量检索 4 篇

2606.19692 2026-06-19 cs.CR cs.DB cs.IR 新提交 90%

When Global Gating Is Enough: Admission-Time Hubness Control in Anisotropic Vector Retrieval Systems

当全局门控足够:各向异性向量检索系统中的准入时间枢纽性控制

Prashant Kumar Pathak, Tarun Kumar Sharma

专题命中 向量检索 :针对RAG中向量枢纽性投毒风险提出准入时间控制方法

AI总结 针对检索增强生成中向量枢纽性引发的投毒风险,提出准入时间控制方法,通过哨兵查询评分隔离枢纽文档,全局门控在多个数据集上达到高召回率和低误报率。

详情
AI中文摘要

向量枢纽性(少数点成为许多查询的最近邻)在检索增强生成(RAG)中造成投毒风险:一个注入的文档可能影响不相关的请求。现有防御使用周期性反向k近邻扫描,存在暴露窗口和重复的全语料库工作。我们研究准入时间控制,根据哨兵查询对每个候选文档评分,并在插入前隔离类似枢纽的文档。在两个10万文档语料库、五个编码器以及不相交的攻击者和防御者查询集上,全局门控在决定性嵌入空间点达到召回率1.0(有效范围内>=0.92),在HotFlip攻击上达到0.91 +/- 0.07,对一般文档的误报率为1%。每主题门控没有提供可靠的好处,这与各向异性耦合局部和全局可见性一致。阈值是增量维护的,插入成本与语料库大小无关,删除成本摊销。在HNSW上,准入增加约3.1%的摄入延迟,评分在10^6向量上保持平坦,近似索引下1.2%的决策翻转,不涉及攻击。来源信息补充了门控对自然或紧密领域枢纽的处理。

英文摘要

Vector hubness, where a few points become nearest neighbors of many queries, creates a poisoning risk in retrieval-augmented generation (RAG): one injected document can influence unrelated requests. Existing defenses use periodic reverse-kNN scans, leaving an exposure window and repeated corpus-wide work. We study admission-time control, scoring each candidate against sentinel queries and quarantining hub-like documents before insertion. Across two 100,000-document corpora, five encoders, and disjoint attacker and defender query sets, a global gate achieves recall 1.0 at the decisive embedding-space point (>=0.92 across the effective range) and 0.91 +/- 0.07 on HotFlip attacks, with 1% false positives on general documents. A per-topic gate provides no reliable benefit, consistent with anisotropy coupling local and global visibility. Thresholds are maintained incrementally, with corpus-size-independent insertion cost and amortized deletion cost. On HNSW, admission adds about 3.1% to ingestion latency, scoring remains flat to 10^6 vectors, and 1.2% of decisions flip under approximate indexing, none involving attacks. Provenance complements the gate for natural or tight-domain hubs.

2606.19803 2026-06-19 cs.DB cs.AI cs.LG 新提交 85%

Policy-aware Vector Search: A Vision for Fine Grained Access Control in Vector Databases

策略感知向量搜索:向量数据库中细粒度访问控制的愿景

Lakshmi Sahithi Yalamarthi, Primal Pappachan

发表机构 * Portland State University(波特兰州立大学)

专题命中 向量检索 :讨论向量数据库中的细粒度访问控制,与RAG系统相关。

AI总结 本文提出策略感知向量搜索的愿景,形式化向量数据库中的细粒度访问控制(FGAC)策略模型与实施问题,比较不同实施策略并指出未来挑战。

Comments Accepted at SeQureDB 26, Sigmod 2026

详情
AI中文摘要

向量数据库越来越多地用于安全敏感的场景,如检索增强生成和组织AI管道;然而,其安全能力仍然有限。具体而言,现代向量数据库不完全支持细粒度访问控制(FGAC),而FGAC是确保数据访问符合用户特定策略所必需的。与关系数据库不同,向量数据库结合结构化和非结构化属性以提供语义近似查询结果,这使FGAC实现复杂化。这就在正确执行FGAC策略、实现高ANN搜索召回率和保持低查询延迟之间产生了内在张力。在本文中,我们通过形式化向量数据库中的FGAC策略模型以及实施问题,提出了策略感知向量搜索的愿景。我们比较了各种实施策略,展示了初步发现,并指出了未来策略感知向量搜索研究的关键开放挑战。

英文摘要

Vector databases are increasingly used in security sensitive contexts with Retrieval Augmented Generation and organizational AI pipelines; however, their security capabilities remain limited. Specifically, Fine-grained Access Control (FGAC) which is required to ensure that data access adheres to user-specific policies is not fully supported in modern vector databases. Unlike relational databases, vector databases combine structured and unstructured attributes to provide semantic, approximate query results, which complicates FGAC implementation. This creates an inherent tension between enforcing FGAC policies correctly, achieving high ANN search recall and maintaining low query latency. In this paper, we present a vision for Policy-aware Vector Search by formalizing the FGAC policy model in vector databases as well as the enforcement problem. We compare various enforcement strategies, present preliminary findings, and identify key open challenges for future research in policy-aware vector search.

2606.19458 2026-06-19 cs.IR 新提交 85%

MonaVec: A Training-Free Embedded Vector Search Kernel for Edge and Offline AI Systems

MonaVec: 一种面向边缘和离线AI系统的免训练嵌入式向量搜索内核

Oğuzhan Yenen

专题命中 向量检索 :免训练嵌入式向量搜索内核,适用于边缘AI。

AI总结 提出MonaVec,一种无需训练、数据无关的嵌入式向量搜索内核,通过随机哈达玛变换和预计算Lloyd-Max量化实现4位压缩,在边缘和离线场景下提供确定性结果,支持单文件部署。

Comments 27 pages, 11 figures. Code and artifacts: https://github.com/mona-hq/monavec (PyPI: monavec; crates.io: monavec-core). Zenodo: doi:10.5281/zenodo.20559587

详情
AI中文摘要

我们提出MonaVec,一种确定性的嵌入式向量搜索内核,适用于边缘和离线AI场景——即服务器基础设施、网络连接和训练数据均不可用的环境。现有的向量搜索系统假设存在持久化服务器、千兆字节RAM或对语料库进行训练;而MonaVec则针对SQLite的部署模式:一个文件、一次函数调用、随处运行。其量化核心默认免训练且数据无关:随机哈达玛变换(RHDH)将任意输入分布调整至N(0,1),因此预计算的Lloyd-Max表可将数据量化至4位(缩小8倍),无需学习码本或数据遍历。索引持久化为单个.mvec文件,其中嵌入的ChaCha20旋转种子使得结果在不同架构间可重现,并在同一构建内字节一致——这是并行构建图库无法提供的确定性保证。在语义嵌入(AG News,45K x 1024维BGE-M3,余弦相似度)上,MonaVec 4位BruteForce在27 MB内达到0.960 Recall@10,在召回率上领先float32 FAISS-IVF和8位usearch,同时以峰值吞吐量换取字节一致的确定性。单次全局标准化(fit())将相同的数据无关流程扩展到对幅度敏感的L2数据,可选的IvfFlat和HNSW后端将其扩展到百万向量语料库。MonaVec使用纯Rust实现,并带有Python绑定和运行时SIMD调度(AVX-512/AVX2/NEON/scalar)。它面向设备端RAG、离线代理和嵌入式检索——即SQLite在关系数据领域占据的细分市场:一个文件、一次调用、随处运行。

英文摘要

We present MonaVec, a deterministic, embedded vector-search kernel for edge and offline AI -- settings where server infrastructure, network connectivity, and training data are all unavailable. Existing vector-search systems assume a persistent server, gigabytes of RAM, or a training pass over the corpus; MonaVec instead targets the deployment profile of SQLite: one file, one function call, runs anywhere. Its quantization core is training-free by default and data-oblivious: a Randomized Hadamard Transform (RHDH) conditions any input distribution toward N(0,1), so precomputed Lloyd-Max tables quantize to 4 bits (8x smaller) with no learned codebook and no data pass. The index persists as a single .mvec file whose embedded ChaCha20 rotation seed makes results reproducible across architectures and byte-identical within a build -- a determinism guarantee that parallel-build graph libraries cannot offer. On semantic embeddings (AG News, 45K x 1024-dim BGE-M3, cosine), MonaVec 4-bit BruteForce reaches 0.960 Recall@10 in 27 MB -- leading float32 FAISS-IVF and 8-bit usearch on recall -- while trading peak throughput for byte-identical determinism. A single-pass global standardization (fit()) extends the same data-oblivious pipeline to magnitude-sensitive L2 data, and optional IvfFlat and HNSW backends carry it to million-vector corpora. MonaVec is implemented in pure Rust with Python bindings and runtime SIMD dispatch (AVX-512/AVX2/NEON/scalar). It targets on-device RAG, offline agents, and embedded retrieval -- the niche SQLite occupies for relational data: one file, one call, runs anywhere.

2606.09824 2026-06-19 cs.DB 版本更新 60%

TSseek: Regular Expression-Based Similarity Search for Distributed Time Series Datasets

TSseek: 基于正则表达式的分布式时间序列数据集相似性搜索

Xiaoshuai Li, Khalid Alnuaim, Mohamed Y. Eltabakh, Elke A. Rundensteiner

专题命中 向量检索 :时间序列相似性搜索,非传统RAG但涉及检索

AI总结 提出TSseek框架,通过正则表达式查询语言支持趋势、值范围和通配符模式搜索,并构建分布式空间索引TSseek-X实现高效精确匹配。

Comments Extended version with full ablation studies and additional experiments. v3 corrects bibliographic metadata for several references

详情
AI中文摘要

相似性搜索是时间序列分析中的基本操作。然而,大多数现有技术要求用户提供精确的值序列(通常是整个时间序列对象)作为查询输入。这种严格的要求限制了实际应用,用户更希望表达模式、趋势或值范围。灵活的基于模式的搜索已在文本检索和复杂事件处理中得到探索,但在大规模分布式时间序列中仍未得到充分研究。为弥补这一差距,我们提出TSseek,一个基于正则表达式的分布式时间序列数据集搜索框架。TSseek的查询语言使用户能够组合包含趋势、值范围和通配符片段的模式。我们表明,传统的近似技术(如PAA和SAX)及其索引结构不适合此类查询,因为它们无法对正则表达式查询构造进行操作。在TSseek中,我们通过将时间序列对象近似为保留趋势(斜率方向)和值范围的线段序列,并将查询构造转换为边界矩形,将时间序列对象和查询构造映射到同一空间。为支持高效处理,我们构建了TSseek-X,一个基于时间序列片段的分布式空间索引。TSseek支持两种基本查询类型:全匹配查询(针对整个序列)和子序列匹配查询(针对序列内的任意窗口)。在基准和真实数据集上,全扫描、基于模型和基于SAX的基线方法要么牺牲准确性,要么牺牲速度,而TSseek能高效地返回精确答案。此外,对于子序列工作负载,它比最先进的子序列匹配引擎实现了显著的加速。

英文摘要

Similarity search is a fundamental operation in time series analysis. Most existing techniques, however, require users to supply a precise sequence of values (typically an entire time series object) as the query input. This rigid requirement limits real-world applications, where users instead want to express patterns, trends, or value ranges. Flexible, pattern-based search has been explored in text retrieval and complex event processing, but remains underexplored for large-scale distributed time series. To close this gap, we propose TSseek, a regular-expression-powered search framework for distributed time series datasets. TSseek's query language enables users to compose patterns encompassing trends, value ranges, and wildcard segments. We show that conventional approximation techniques (e.g., PAA and SAX) and their index structures are ill-suited for such queries because they cannot operate on regular-expression query constructs. In TSseek, we map the time series objects and the query constructs into the same space by approximating time series objects as sequences of line segments that retain both trend (slope direction) and value range, and translating query constructs into bounding rectangles. To support efficient processing, we build TSseek-X, a distributed spatial index over the time series segments. TSseek supports two fundamental query types, namely whole-matching queries (over entire series) and subsequence-matching queries (over arbitrary windows within a series). Across benchmark and real-world datasets, full-scan, model-based, and SAX-based baselines all sacrifice either accuracy or speed, whereas TSseek returns exact answers efficiently. Also, for subsequence workloads, it achieves significant speedups over state-of-the-art subsequence matching engines.