RAG / 检索增强生成 - arXivDaily 专题

2506.20869 2026-06-18 cs.SE cs.AI cs.IR 95%

Engineering RAG Systems for Real-World Applications: Design, Development, and Evaluation

为现实应用工程化RAG系统：设计、开发与评估

Md Toufique Hasan, Muhammad Waseem, Kai-Kristian Kemell, Ayman Asad Khan, Mika Saari, Pekka Abrahamsson

发表机构 * Faculty of Information Technology and Communication Sciences, Tampere University（信息科技与通讯科学学院，塔尔皮耶大学）

专题命中知识库问答：五个领域特定RAG系统的工程化实践

AI总结本文介绍了五个领域特定的RAG应用，涵盖治理、网络安全、农业、工业研究和医疗诊断，通过多语言OCR、语义向量检索和领域适应LLM，评估六个维度并总结十二项关键经验教训。

Comments Published in the Proceedings of the 51st Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2025. Lecture Notes in Computer Science, volume 16082, pages 143-158. Springer, 2026

Journal ref LNCS 16082, 143-158, 2026

详情

DOI: 10.1007/978-3-032-04200-2_10

AI中文摘要

检索增强生成（RAG）系统正成为一种关键方法，用于将大型语言模型（LLMs）与外部知识联系起来，以解决事实准确性和上下文相关性方面的限制。然而，缺乏实证研究报告RAG基于真实应用场景的实现，通过一般用户参与评估，并伴有系统性的经验总结。本文提出了五个领域特定的RAG应用，分别应用于治理、网络安全、农业、工业研究和医疗诊断。每个系统都集成了多语言OCR、语义检索通过向量嵌入以及领域适应的LLM，并通过本地服务器或云API部署以满足不同的用户需求。一个基于网络的评估涉及总共100名参与者，评估了六个维度：（i）易用性，（ii）相关性，（iii）透明度，（iv）响应性，（v）准确性，（vi）推荐可能性。基于用户反馈和我们的开发经验，我们记录了十二项关键经验教训，突出了影响RAG系统在实践中可靠性和可用性的技术、操作和伦理挑战。

英文摘要

Retrieval-Augmented Generation (RAG) systems are emerging as a key approach for grounding Large Language Models (LLMs) in external knowledge, addressing limitations in factual accuracy and contextual relevance. However, there is a lack of empirical studies that report on the development of RAG-based implementations grounded in real-world use cases, evaluated through general user involvement, and accompanied by systematic documentation of lessons learned. This paper presents five domain-specific RAG applications developed for real-world scenarios across governance, cybersecurity, agriculture, industrial research, and medical diagnostics. Each system incorporates multilingual OCR, semantic retrieval via vector embeddings, and domain-adapted LLMs, deployed through local servers or cloud APIs to meet distinct user needs. A web-based evaluation involving a total of 100 participants assessed the systems across six dimensions: (i) Ease of Use, (ii) Relevance, (iii) Transparency, (iv) Responsiveness, (v) Accuracy, and (vi) Likelihood of Recommendation. Based on user feedback and our development experience, we documented twelve key lessons learned, highlighting technical, operational, and ethical challenges affecting the reliability and usability of RAG systems in practice.

URL PDF HTML ☆

赞 0 踩 0

2602.20135 2026-06-18 cs.CL cs.AI cs.IR 80%

KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

KNIGHT: 基于知识图谱的多选题生成与自适应难度校准

Mohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari, Mahdi Noori, Farhan Farsi, Behnam Bahrak

发表机构 * University of Tehran（塔里班大学）； Independent Researcher（独立研究员）； Amirkabir University of Technology（阿米尔卡比尔技术大学）； TEIAS Institute（TEIAS研究所）

专题命中知识库问答：基于知识图谱生成多选题用于RAG评估

AI总结 KNIGHT通过构建领域特定知识图谱，实现高效生成多选题数据集，支持自适应难度控制，提升生成效率与质量，验证了其在多个领域内的有效性。

Comments Accepted at the Third Conference on Parsimony and Learning (CPAL 2026). 36 pages, 12 figures. (Equal contribution: Yasaman Amou Jafari and Mahdi Noori.)

Journal ref Conference on Parsimony and Learning, Proceedings of Machine Learning Research, 328:989-1024, 2026

详情

AI中文摘要

随着大语言模型（LLMs）的兴起，它们在检索增强生成（RAG）等应用中变得至关重要。然而，评估这些系统仍受制于构建专用评估数据集的时间和成本。我们介绍了KNIGHT，一种基于LLM的知识图谱驱动框架，用于从外部来源生成多选题（MCQ）数据集。KNIGHT构建了一个主题特定的知识图谱，这是一个结构化且简洁的实体和关系摘要，可以重复使用以生成由教师控制的难度级别，包括多跳问题，而无需反复重新输入完整源文本。该知识图谱充当一个压缩、可重用的状态，使问题生成成为对图的廉价读取。我们将在维基百科/Wikidata上实例化KNIGHT，同时保持框架的领域和本体无关性。作为案例研究，KNIGHT在历史、生物学和数学领域生成了六个MCQ数据集。我们评估了五个标准：流畅性、无歧义性（单个正确答案）、主题相关性、选项唯一性和给定源提供的答案性（作为幻觉的代理）。结果表明，KNIGHT能够通过可重用的图表示实现令牌和成本高效的生成，实现了这些标准的高质量，且模型排名与MMLU风格基准一致，同时支持主题特定和难度控制的评估。

英文摘要

With the rise of large language models (LLMs), they have become instrumental in applications such as Retrieval-Augmented Generation (RAG). Yet evaluating these systems remains bottlenecked by the time and cost of building specialized assessment datasets. We introduce KNIGHT, an LLM-based, knowledge-graph-driven framework for generating multiple-choice question (MCQ) datasets from external sources. KNIGHT constructs a topic-specific knowledge graph, a structured and parsimonious summary of entities and relations, that can be reused to generate instructor-controlled difficulty levels, including multi-hop questions, without repeatedly re-feeding the full source text. This knowledge graph acts as a compressed, reusable state, making question generation a cheap read over the graph. We instantiate KNIGHT on Wikipedia/Wikidata while keeping the framework domain- and ontology-agnostic. As a case study, KNIGHT produces six MCQ datasets in History, Biology, and Mathematics. We evaluate quality on five criteria: fluency, unambiguity (single correct answer), topic relevance, option uniqueness, and answerability given the provided sources (as a proxy for hallucination). Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-specific and difficulty-controlled evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.18385 2026-06-18 cs.AI 新提交 70%

CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

CaVe-VLM-CoT：一种可解释的视觉-语言模型框架

Sneha Rao, Shaina Raza, Dhanesh Ramachandram

发表机构 * Vector Institute（向量研究所）

专题命中知识库问答：采用检索增强生成实现证据推理

AI总结提出CaVe-VLM-CoT框架，通过五阶段闭环流水线（提取器、检索器、求解器、引用注入器、验证器）实现证据推理，并引入CaVeScore复合指标评估检索质量、引用忠实度和跨模态基础，在ScienceQA和MMMU上取得性能提升。

详情

AI中文摘要

视觉-语言模型（VLM）仍然容易产生幻觉，输出流畅但视觉上不忠实的输出。现有的思维链和检索增强方法仅部分解决了这一问题，因为它们既没有强制执行步骤级引用基础，也没有将验证失败路由回检索以进行纠正。我们提出了CaVe-VLM-CoT，一个模块化的基于反射的智能体RAG框架，通过五阶段闭环流水线强制执行证据推理：提取器、检索器、求解器、引用注入器和验证器，其中检测到的无根据声明会触发结构化反馈给提取器以进行针对性重新检索。由于现有框架没有联合衡量检索质量、逐步引用忠实度和跨模态基础，我们提出了一套涵盖所有阶段的23个组件级指标，以CaVeScore为核心，这是一个加权准确性、引用精确率和召回率、归因和证据基础的复合指标。无需任何架构或提示修改，CaVe-VLM-CoT在ScienceQA上达到87.1%的准确率和56.6%的CaVeScore，在MMMU（30个学科）上达到55.2%的准确率和35.7%的CaVeScore。

英文摘要

Vision-Language Models (VLMs) remain prone to hallucinations, producing fluent but visually unfaithful outputs. Existing chain-of-thought and retrieval-augmented methods only partially address this, as they neither enforce step-level citation grounding nor route verification failures back to retrieval for correction. We present CaVe-VLM-CoT, a modular reflection-based agentic-RAG framework that enforces evidence-grounded reasoning through a five-stage closed-loop pipeline: Extractor, Retriever, Solver, Citation Injector, and Verifier, in which detected ungrounded claims trigger structured feedback to the Extractor for targeted re-retrieval. Since no existing framework jointly measures retrieval quality, step-wise citation faithfulness, and cross-modal grounding, we propose a suite of 23 component-wise metrics across all stages, anchored by CaVeScore, a composite metric weighting accuracy, citation precision and recall, attribution, and evidence grounding. Without any architectural or prompt modifications, CaVe-VLM-CoT achieves 87.1\% accuracy and 56.6\% CaVeScore on ScienceQA , and 55.2\% accuracy and 35.7\% CaVeScore on MMMU (30 subjects).

URL PDF HTML ☆

赞 0 踩 0

2606.18850 2026-06-18 cs.CL cs.IR 新提交 60%

ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement

ScholarSum：基于知识图谱推理与反思性精炼的师生式抽象摘要生成

Bohou Zhang, Xiaoyu Tao, Mingyue Cheng, Huijie Liu, Qi Liu

发表机构 * State Key Laboratory of Cognitive Intelligence（认知智能国家重点实验室）

专题命中知识库问答：使用知识图谱推理，非传统RAG。

AI总结提出ScholarSum框架，通过构建层次知识图谱引导学生生成初稿，并利用教师式审阅者迭代检查与修正，实现科学文献摘要的流畅性与事实一致性。

详情

AI中文摘要

抽象摘要生成在实现科学文献高效理解中起着关键作用，但它本质上要求同时具备语言流畅性和事实忠实性。现有方法往往难以协调这两个要求。抽取式方法依赖僵硬的句子拼接，破坏了宏观层面的逻辑连贯性；而基于大语言模型的生成式方法尽管掌握了语言流畅性，但事实一致性有限。在这项工作中，我们提出了ScholarSum，一个层次化反思性图框架，模拟师生写作过程以实现流畅且忠实的科学摘要生成。ScholarSum首先通过将文档分割成语义连贯的单元，组织成层次知识图谱，其多层社区结构捕获全局逻辑和宏观主题。在该全局结构引导下，学生生成初稿，随后通过细粒度证据检索进行精炼。为确保事实一致性，教师式审阅者迭代检查初稿，识别不支持的内容，并触发有针对性的重新检索和重写，直到摘要达到严格的质量标准。大量实验表明，ScholarSum在完整性和忠实性方面显著优于之前的基线方法。我们的代码可在该https URL获取。

英文摘要

Abstractive summarization plays a crucial role in enabling efficient understanding of scientific literature, yet it inherently demands both linguistic fluency and factual faithfulness. Existing approaches often fail to reconcile these two requirements. Extractive methods rely on rigid sentence splicing that disrupts macro-level logical coherence, while large language model (LLM)-based generative approaches, despite mastering linguistic fluency, exhibit limited factual consistency. In this work, we propose ScholarSum, a hierarchical reflective graph-based framework that emulates a student-teacher writing process for fluent and faithful scientific summarization. ScholarSum first organizes the document into a hierarchical knowledge graph by segmenting it into semantically coherent units, whose multi-layered community structure captures global logic and macro-level themes. Guided by this global structure, the student generates an initial draft, which is subsequently refined through fine-grained evidence retrieval. To ensure factual consistency, a teacher-like reviewer then iteratively examines the draft, identifies unsupported content, and prompts targeted re-retrieval and rewriting until the summary meets rigorous quality standards. Extensive experiments demonstrate that ScholarSum significantly outperforms previous baselines in terms of both completeness and faithfulness. Our code is available at https://github.com/Xiaoyu-Tao/ScholarSum.

URL PDF HTML ☆

赞 0 踩 0