arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

语言大模型 / LLM

大语言模型、预训练、指令微调、后训练和语言模型应用。

2026-06-19 至 2026-06-19 收录 14 信号源:cs.CL, cs.AI, cs.LG
2606.19700 2026-06-19 cs.CL 新提交 85%

TerraMARS: A Domain-Adapted Small-Language-Model Pipeline for Mars Terraforming Literature

TerraMARS: 用于火星地球化改造文献的领域自适应小语言模型管道

Jyotsna Singh, Ash Black, Jeff Larsen, Scott R. Saleska

发表机构 * University of Arizona(亚利桑那大学) College of Information Science, University of Arizona(亚利桑那大学信息科学学院) Biosphere 2, University of Arizona(亚利桑那大学生物圈2) Department of Ecology and Evolutionary Biology, University of Arizona(亚利桑那大学生态与进化生物学系) Department of Environmental Sciences, University of Arizona(亚利桑那大学环境科学系)

专题命中 领域大模型 :领域自适应小语言模型管道,用于火星科学文献提取。

AI总结 提出TerraMARS管道,结合领域自适应小语言模型,从火星科学文献中提取结构化信息,支持地球化改造研究。

Comments 16 pages, 1 figure, 4 tables

详情
AI中文摘要

研究人员有兴趣了解火星,以便最终使其适合人类居住。为此,需要通过科学文献全面了解行星的大气、水文、表面化学、辐射环境和空间特征。这些文献包含有价值的信息和有意义的定量约束,可用于其他模型和研究,如宜居性评估和未来的地球化改造研究。我们提出了TerraMARS,一个端到端的信息提取管道,它结合了领域自适应的小语言模型来回答火星地球化改造相关问题,并将非结构化的火星科学文本转换为机器可读的结构化输出(JSON格式)。收集了一个开放获取论文语料库,并使用多阶段检索和分块框架进行处理。使用量化低秩自适应(QLoRA)对火星特定问答和信息提取数据集进行微调,使Google Gemma 3 1B适应领域。生成的管道产生两种类型的输出,并为将科学文献中的知识整合到下游应用(如数字孪生和火星宜居性建模)提供了基础。该管道的输出看起来很有前景,但需要进一步改进以提高提取准确性和事实一致性。

英文摘要

Researchers are interested in learning about Mars so that it may eventually become habitable for humans. To achieve this, there is a need for comprehensive knowledge of the planet's atmosphere, hydrology, surface chemistry, radiation environment, and spatial features through the scientific literature. These contain valuable information and meaningful quantitative constraints that can be used in other models and studies, such as habitability assessment and future terraforming studies. We present TerraMARS, an end-to-end information extraction pipeline that combines a domain-adapted Small Language Model to answer Mars terraforming-related questions and convert unstructured Mars science text into machine-readable structured outputs in JavaScript Object Notation (JSON) format. A corpus of open-access papers is collected and processed using a multistage retrieval and chunking framework. Google Gemma 3 1B was adapted to the domain using Quantized Low-Rank Adaptation (QLoRA) fine-tuning on Mars-specific question-answering and information extraction datasets. The resulting pipeline generates both types of output and provides a foundation for integrating knowledge from scientific literature into downstream applications like digital twins and habitability modeling for Mars. The output from this pipeline looks promising, but further improvements are needed to increase extraction accuracy and factual consistency.

2606.20138 2026-06-19 cs.AI cs.CL cs.HC cs.LG 新提交 80%

Learning to Prompt: Improving Student Engagement with Adaptive LLM-based High-School Tutoring

学习提示:基于自适应LLM的高中辅导提升学生参与度

Po-Chin Chang, Nicholas Hogan, Aske Plaat, Michiel T. van der Meer

发表机构 * Leiden University(莱顿大学) FutureWhiz

专题命中 领域大模型 :自适应LLM高中辅导系统。

AI总结 提出一种基于14个教学特征的主题感知提示路由模型,通过模拟训练和在线A/B测试,在高中辅导中实现自适应策略切换,提高教学效率并减少交互轮次。

详情
AI中文摘要

LLMs可以个性化教育,尽管当前的静态提示辅导系统难以适应不同的学科。我们开发并测试了一个具有主题感知提示的系统,该系统基于从原始转录中提取的14个教学特征(例如,辅导支架、学生理解)。我们首先在模拟环境中训练一个提示路由模型,然后将其部署到实际高中学生的在线适应中。模拟基准测试显示,路由器的性能优于两个静态基线($0.694$ vs. $0.647$ 和 $0.64$, $p<0.001$)。A/B测试($N=656$ 次对话,来自359名学生)显示了从模拟到现实的迁移,其中模型从分析策略切换到支架学习策略。我们的自适应提示选择机制提高了教学效率,保持了教学质量,并减少了约3轮交互($p=0.007$)。虽然贪婪路由器的练习转化率与基线相当($19.1\%$ vs. $19.6\%$),但随机采样策略的随机路由器实现了更高的转化率($28.1\%$)。

英文摘要

LLMs can personalize education, although current static-prompt tutoring systems struggle to adapt to diverse academic disciplines. We develop and test a system with subject-aware prompting, based on 14 pedagogical features (e.g., tutor scaffolding, student understanding) extracted from raw transcripts. We first train a prompt routing model in a simulation environment, and then deploy it for online adaptation with actual high-school students. The simulation benchmark shows the router outperforming two static baselines ($0.694$ vs. $0.647$ and $0.64$, $p<0.001$). A/B testing ($N=656$ conversations from 359 students) shows sim-to-real transfer where the model switches from analytical to scaffolding learning strategies. Our adaptive prompt selection mechanism improves instructional efficiency, maintains pedagogical quality and reduces interactions by around 3 turns ($p=0.007$). While a greedy router achieves a comparable exercise conversion rate with the baseline ($19.1\%$ vs. $19.6\%$), a stochastic router that samples strategies leads to a higher conversion rate ($28.1\%$).

2606.19376 2026-06-19 cs.LG cs.AI cs.IR 新提交 80%

Cost-Optimal LLM Routing with Limited User Feedback under User Satisfaction Guarantees

在用户满意度保证下基于有限用户反馈的成本最优LLM路由

Herbert Woisetschläger, Arastun Mammadli, Ryan Zhang, Shiqiang Wang

发表机构 * Technical University of Munich(慕尼黑工业大学) University of Exeter(埃克塞特大学) Horace Greeley High School(霍勒斯格里利高中)

专题命中 领域大模型 :研究LLM路由以优化成本和服务质量。

AI总结 针对LLM推理成本与服务质量之间的矛盾,提出SLARouter在线路由算法,利用稀疏单侧用户反馈学习成本最优策略,理论保证成本最优和SLA合规,实验显示成本降低高达2.2倍。

Comments Preprint. Under review

详情
AI中文摘要

大型语言模型(LLM)应用的推理成本正在快速增长,这是由于需求激增和基础设施成本上升所驱动的。用户期望高质量的响应,在商业环境中,这被正式编码在服务级别协议(SLA)中,从而在成本和质量之间形成了根本性的矛盾。最近在成本感知的LLM请求路由方面的进展显示出解决这一矛盾的潜力,但现有方法依赖于完整的反馈信号、离线训练、大量的每工作负载调优,并且大多数缺乏SLA保证或推理时适应性。我们引入了SLARouter,一种在线路由算法,它从生产系统中可用的稀疏、单侧用户反馈中学习成本最优策略。SLARouter为成本最优性和严格的SLA合规性提供了理论保证。在广泛的LLM基准测试上的实验表明,SLARouter无需每基准调优即可满足SLA约束,将运营成本降低至现有基线的2.2倍。

英文摘要

Inference costs for large language model (LLM) applications are rapidly growing, driven by surging demand and rising infrastructure cost. Users expect high-quality responses, and in commercial settings this is formally codified in Service Level Agreements (SLAs), creating a fundamental tension between cost and quality. Recent progress on cost-aware LLM request routing has shown potential to resolve this tension, but existing approaches rely on complete feedback signals, offline training, extensive per-workload tuning, and most lack SLA guarantees or inference-time adaptivity. We introduce SLARouter, an online routing algorithm that learns a cost-optimal policy from the sparse, one-sided user feedback available in production systems. SLARouter provides theoretical guarantees for both cost optimality and strict SLA compliance. Experiments across a wide range of LLM benchmarks show that SLARouter satisfies SLA constraints without the need for per-benchmark tuning, reducing operating cost by up to 2.2x over existing baselines.

2605.17443 2026-06-19 cs.CL cs.SD eess.AS 版本更新 80%

Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

分析韩语语音问答中ASR-LLM级联中的误差传播

Donghyuk Jung, Youngwon Choi

发表机构 * Korea Culture Technology Institute, Republic of Korea(韩国文化科技研究所) Maum AI Inc., Republic of Korea(马姆人工智能公司)

专题命中 领域大模型 :研究ASR-LLM级联在韩语语音问答中的误差传播

AI总结 本文研究了韩语语音问答中ASR-LLM级联中误差传播的问题,通过分析下游语义失败,揭示了传统ASR指标无法完全捕捉的误差影响,发现不同性能的LLM在级联降级上的一致性,识别出单字符ASR错误作为语义失败通道,并通过辅助比较表明大音频语言模型在噪声韩语SQA中优于匹配语言模型的ASR-LLM流水线。

Comments Preprint. Submitted to APSIPA ASC 2026

详情
AI中文摘要

我们分析了自动语音识别(ASR)误差如何通过ASR-LLM级联在韩语语音问答(SQA)中传播,重点关注传统ASR指标无法完全捕捉的下游语义失败。我们的分析显示,由ASR误差引起的相对下游降级在不同绝对性能的LLM中保持一致,表明级联降级主要跟踪ASR阶段的信息损失。我们进一步识别出单字符韩语ASR错误作为一种独特的语义失败通道,其中正确答案在下游预测中完全消失,尽管仅存在微小的转录差异。最后,辅助比较显示,大型音频语言模型在噪声韩语SQA中优于具有匹配语言骨干的ASR-LLM流水线,表明直接音频输入有潜力缓解转录诱导的信息损失。

英文摘要

We analyze how automatic speech recognition (ASR) errors propagate through ASR-LLM cascades in Korean spoken question answering (SQA), focusing on downstream semantic failures that conventional ASR metrics cannot fully capture. Our analysis shows that the relative downstream degradation caused by ASR errors is consistent across LLMs with different absolute performance, suggesting that cascade degradation largely tracks ASR-stage information loss. We further identify single-character Korean ASR errors as a Korean-specific loss channel, where even a minimal transcription difference can change the intended question and degrade downstream QA performance. Finally, an auxiliary comparison shows that a large audio language model outperforms an ASR-LLM cascade with an approximately matched language backbone in noisy Korean SQA, indicating the potential of direct audio input to mitigate transcript-induced information loss.

2604.18105 2026-06-19 eess.AS cs.CL cs.SD 版本更新 80%

NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

NIM4-ASR:迈向高效、鲁棒且可定制的实时基于LLM的语音识别

Yuan Xie, Jiaqi Song, Guang Qiu, Xianliang Wang, Kai Qiao, Junfeng Yuan, Shengqing Liu, Yi Zhang, Bowen Chen, Ming Lei, Jie Gao, Jie Wu

发表机构 * Advanced Intelligent Systems Group, NIO(蔚来智能系统集团)

专题命中 领域大模型 :提出基于LLM的语音识别框架NIM4-ASR

AI总结 提出NIM4-ASR框架,通过重新设计多阶段训练范式(包括预训练架构优化、迭代异步SFT和ASR专用强化学习)以及生产优化(噪声鲁棒性、流式推理和RAG热词定制),在2.3B参数下实现SOTA性能。

详情
AI中文摘要

将大语言模型(LLM)集成到自动语音识别(ASR)中已成为近年来的主流范式。尽管现有的基于LLM的ASR模型在公共基准上表现出色,但其训练仍然主要依赖数据驱动,未能充分解决关键的实际挑战——特别是在资源受限部署中的有限向下可扩展性以及声学挑战条件下的幻觉问题。为了解决这些问题,我们提出了NIM4-ASR,一个面向生产的、基于LLM的ASR框架,针对效率和鲁棒性进行了优化。基于编码器和LLM之间功能角色的原则性划分,我们重新设计了多阶段训练范式,使每个模块与其预期的能力边界对齐。具体来说,我们重新制定了预训练架构和目标以缓解模态差距并提高参数效率;引入了迭代异步SFT阶段以保持声学保真度并约束表示漂移;设计了ASR专用的强化学习阶段以进一步提高识别质量和鲁棒性。我们还加入了一系列面向生产的优化,包括噪声和静音条件下的鲁棒性、实时流式推理以及通过检索增强生成(RAG)进行的热词定制。实验表明,NIM4-ASR仅用2.3B参数就在多个公共基准上达到了最先进的性能,同时在内部基准上显著优于更大规模的竞争对手——特别是在实体密集的真实场景中。NIM4-ASR进一步通过RAG支持百万级热词定制,检索延迟低于毫秒,从而能够高效适应新兴实体和个性化用户需求。

英文摘要

Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their training remains predominantly data-driven, leaving key practical challenges insufficiently addressed -- particularly limited downward scalability in resource-constrained deployments and hallucinations under acoustically challenging conditions. To address these issues, we present NIM4-ASR, a production-oriented LLM-based ASR framework optimized for both efficiency and robustness. Grounded in a principled delineation of functional roles between the encoder and the LLM, we redesign the multi-stage training paradigm to align each module with its intended capability boundary. Specifically, we reformulate the pre-training architecture and objective to mitigate the modality gap and improve parameter efficiency; introduce an iterative asynchronous SFT stage to preserve acoustic fidelity and constrain representation drift; and design an ASR-specialized reinforcement learning stage to further enhance recognition quality and robustness. We additionally incorporate a suite of production-oriented optimizations, including robustness under noisy and silent conditions, real-time streaming inference, and hotword customization via retrieval-augmented generation (RAG). Experiments show that NIM4-ASR achieves state-of-the-art performance on multiple public benchmarks with merely 2.3B parameters, while substantially outperforming larger-scale competitors on internal benchmarks -- particularly in entity-intensive real-world scenarios. NIM4-ASR further supports million-scale hotword customization via RAG with sub-millisecond retrieval latency, enabling efficient adaptation to emerging entities and personalized user requirements.

2507.00875 2026-06-19 cs.CL cs.HC cs.MA 版本更新 80%

TransLaw: A Large-Scale Dataset and Multi-Agent Benchmark Simulating Professional Translation of Hong Kong Case Law

TransLaw:模拟香港判例法专业翻译的大规模数据集与多智能体基准

Xi Xuan, Chunyu Kit

发表机构 * City University of Hong Kong, Hong Kong SAR, China(香港城市大学)

专题命中 领域大模型 :多智能体框架用于法律翻译

AI总结 针对香港判例法英译中资源匮乏、法律术语和格式要求严格的问题,构建了首个大规模句对齐平行语料库HKCFA Judgment 97-22,并提出多智能体框架TransLaw,通过分解翻译任务、集成法律词汇库和检索增强生成,显著提升翻译质量,但仍未达到人类专家的风格自然度。

Comments Accepted at ICML 2026 - AI for Law

详情
AI中文摘要

根据《基本法》第8-9条,香港法院判决书需从英文翻译成繁体中文,但由于平行资源短缺以及对法律术语、引用格式和司法风格的严格要求,这一任务仍受到限制。我们引入了HKCFA Judgment 97-22,这是首个用于香港判例法的大规模句对齐平行语料库,包含344份专业翻译的判决书(11,099个句对;210万词元),涵盖1997年至2022年。基于这一资源,我们提出了TransLaw,一个多智能体框架,将翻译分解为词级表达、句级翻译和多维审查,集成了专门的香港法律词汇数据库、检索增强生成和迭代反馈,并包括涵盖语义对齐、术语、引用和风格的四维专家审查。通过对13个开源和商业大语言模型进行基准测试,我们证明TransLaw在所有评估模型上均显著优于单智能体基线,并在3次迭代内收敛。由10名持证法律翻译人员使用我们提出的Legal ACS指标进行的人工评估证实了法律语义准确性的提升,同时表明TransLaw在风格自然度上仍落后于人类专家。数据集和基准代码可在以下网址获取:https://xxx。

英文摘要

Translating Hong Kong Court Judgments from English to Traditional Chinese is mandated by Articles 8-9 of the Basic Law, yet remains constrained by a shortage of parallel resources and rigorous demands on legal terminology, citation format, and judicial style. We introduce HKCFA Judgment 97-22, the first large-scale sentence-aligned parallel corpus for HK case law, comprising 344 professionally translated judgments (11,099 sentence pairs; 2.1M tokens) spanning 1997-2022. Building on this resource, we propose TransLaw, a multi-agent framework that decomposes translation into word-level expression, sentence-level translation, and multidimensional review, integrating a specialized Hong Kong legal glossary database, Retrieval-Augmented Generation, and iterative feedback, with four-dimensional expert review covering semantic alignment, terminology, citation, and style. Benchmarking 13 open-source and commercial LLMs, we demonstrate that TransLaw significantly outperforms single-agent baselines across all evaluated models, with convergence within 3 iterations. Human evaluation by 10 certified legal translators using our proposed Legal ACS metric confirms gains in legal-semantic accuracy, while showing that TransLaw still trails human experts in stylistic naturalness. The dataset and benchmark code are available at https://github.com/xuanxixi/TransLaw.

2509.03391 2026-06-19 cs.DL cs.CY 版本更新 80%

More Parameters Than Populations: A Systematic Literature Review of Large Language Models within Survey Research

参数多于总体:调查研究中的大语言模型系统文献综述

Trent D. Buskirk, Florian Keusch, Leah von der Heyde, Adam Eck

专题命中 领域大模型 :系统综述LLM在调查研究中的应用,涵盖三个阶段。

AI总结 通过系统文献综述,评估大语言模型在调查研究三个阶段(数据收集前、中、后)的应用,讨论其潜力与陷阱,并展望调查研究对LLM发展的贡献。

Comments This working paper is outdated as of June 2026 - please refer to the full version with substantive changes here: https://doi.org/10.31235/osf.io/eubj4_v1 This work was presented at NLPOR 2025 (non-archival): https://openreview.net/forum?id=0Hxhwa56Yg

详情
AI中文摘要

[工作论文]调查研究长期以来一直是人力驱动的领域,但也接纳了多种技术来收集、处理和分析各种行为、政治和社会结果。与此同时,大语言模型(LLM)带来了新的技术挑战和前提条件,以充分利用其潜力。在本文中,我们报告了一项基于多个大规模数据库关键词搜索和引文网络的系统文献综述的进展,评估LLM目前在调查研究过程中的应用情况。我们根据调查研究过程综合并组织我们的发现,包括LLM在三个广泛阶段的使用示例:数据收集前、数据收集和数据收集后。我们基于现有文献中的示例,讨论了LLM潜在用例的选定示例及其陷阱。考虑到调查研究在数据质量方面拥有丰富的经验和历史,我们讨论了一些机会,并描述了调查研究为LLM的持续发展和改进做出贡献的未来展望。

英文摘要

[Working Paper] Survey research has a long-standing history of being a human-powered field, but one that embraces various technologies for the collection, processing, and analysis of various behavioral, political, and social outcomes of interest, among others. At the same time, Large Language Models (LLMs) bring new technological challenges and prerequisites in order to fully harness their potential. In this paper, we report work-in-progress on a systematic literature review based on keyword searches from multiple large-scale databases as well as citation networks that assesses how LLMs are currently being applied within the survey research process. We synthesize and organize our findings according to the survey research process to include examples of LLM usage across three broad phases: pre-data collection, data collection, and post-data collection. We discuss selected examples of potential use cases for LLMs as well as its pitfalls based on examples from existing literature. Considering survey research has rich experience and history regarding data quality, we discuss some opportunities and describe future outlooks for survey research to contribute to the continued development and refinement of LLMs.

2606.19387 2026-06-19 cs.SE cs.AI 新提交 75%

Interpretable and Verifiable Hardware Generation with LLM-Driven Stepwise Refinement

可解释且可验证的硬件生成:基于LLM驱动的逐步细化

You Li, Samuel Mandell, David Z. Pan

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Fudan University(复旦大学) USA(美国)

专题命中 领域大模型 :LLM用于硬件设计,但非通用语言模型。

AI总结 提出结合LLM创造力与形式化方法可解释性的硬件生成框架,通过迭代应用变换规则将设计规范转换为正确性有保证的RTL程序。

详情
AI中文摘要

大型语言模型(LLM)在软件开发中取得了显著成功。然而,它们容易产生幻觉,即可能引入微妙的语义和逻辑错误。由于芯片设计和制造的高风险,硬件工程师仍不愿依赖LLM进行寄存器传输级(RTL)生成。本文提出一种硬件生成框架,结合了LLM的创造力和广泛知识与形式化方法的可解释性和数学严谨性。具体而言,我们设计了一组覆盖各种设计决策和硬件特征的变换规则。通过迭代应用这些规则,LLM代理可以将设计规范转换为正确性有保证的RTL程序。实验结果证明了该框架的有效性和效率。

英文摘要

Large language models (LLMs) have achieved remarkable success in software development. However, they are susceptible to hallucinations, meaning that they can introduce subtle semantic and logical errors. Due to the high stakes in chip design and manufacturing, hardware engineers are still reluctant to rely on LLMs for register-transfer level (RTL) generation. In this paper, we propose a hardware generation framework that combines the creativity and broad knowledge of LLMs with the explainability and mathematical rigor of formal methods. Specifically, we devise a set of transformation rules that cover various design decisions and hardware features. By iteratively applying these rules, an LLM agent can convert a design specification into an RTL program with guaranteed correctness. Experimental results demonstrate the effectiveness and efficiency of the framework.

2606.19364 2026-06-19 cs.LG 新提交 75%

Closing the Social-Semantic Gap: SPSD for Edge-Based Prompt Compression in Cloud LLM Inference

缩小社会-语义差距:SPSD用于云LLM推理中的边缘端提示压缩

Abhinit Sen, Ajeet Kumar, Manaranjan Pradhan

发表机构 * Indian School of Business(印度管理学院)

专题命中 领域大模型 :边缘端提示压缩用于云LLM推理。

AI总结 针对云LLM推理中提示词预填充阶段能耗高的问题,提出SPSD边缘端管道,利用4比特量化小语言模型压缩用户提示,在保持响应质量非劣效的前提下,平均节省99.9个输入token,每调用净节能70-270 uWh。

Comments 19 pages, 7 tables, 1 figure, includes appendix

详情
AI中文摘要

大语言模型(LLM)推理的预填充阶段正成为云规模能耗的日益增长的贡献者。许多面向消费者的支持和对话提示包含社会性支架:礼貌标记、道歉性开场白、重复以及建立融洽关系的语言,这些对人类交流很重要,但对机器推理而言边际信息量较低。我们将这种差异称为社会-语义差距。我们提出SPSD(情感保留语义蒸馏),一种边缘端管道,在传输到云端部署的LLM之前,使用4比特量化的小语言模型压缩用户提示。在248个提示的语料库上,使用Gemma-2-2B-Instruct(Q4_K_M)作为SLM、Llama-3.1-8B-Instruct作为云端评估模型进行评估,每次蒸馏调用平均输入token节省99.9个,所有146次蒸馏调用均产生正向节省。通过盲法LLM-as-judge评分对121对进行评估,响应质量在15分制中预先指定的1分非劣效范围内不劣于原始路径;评审员给出43%平局、28%蒸馏胜出和29%原始胜出。余弦相似度结果不一:均值0.682,中位数0.712,54.1%的配对高于0.70参考阈值。安全关键领域通过基于规则的网关保守地路由至直通模式。在所述假设下,每次调用净节能估计为70-270 uWh。SPSD表明,设备端提示蒸馏可以在保持响应质量在实际非劣效范围内的同时,降低云LLM的输入token成本。

英文摘要

The prefill stage of Large Language Model (LLM) inference is a growing contributor to cloud-scale energy cost. Many consumer-support and conversational prompts contain social scaffolding: politeness markers, apologetic preamble, repetition, and rapport-building language that is important for human communication but carries low marginal information for machine reasoning. We call this discrepancy the Social-Semantic Gap. We present SPSD (Sentiment Preserving Semantic Distillation), an edge-based pipeline that compresses user prompts using a 4-bit quantised Small Language Model before transmission to a cloud-deployed LLM. Evaluation on a 248-prompt corpus using Gemma-2-2B-Instruct (Q4_K_M) as the SLM and Llama-3.1-8B-Instruct as the cloud evaluation model yields a mean input token saving of 99.9 tokens per distilled call, with all 146 distilled calls yielding positive savings. Response quality, assessed by blind LLM-as-judge scoring across 121 pairs, is non-inferior to the raw path within a pre-specified 1-point margin on a 15-point rubric; the judge awarded 43 percent ties, 28 percent distilled wins, and 29 percent raw wins. Cosine similarity is mixed: mean 0.682, median 0.712, with 54.1 percent of pairs above the 0.70 reference threshold. Safety-critical domains are conservatively routed to passthrough via rule-based gates. Per-call net energy saving is estimated at 70-270 uWh under stated assumptions. SPSD shows that on-device prompt distillation can reduce cloud LLM input-token cost while preserving response quality within a practical non-inferiority margin.

2512.18859 2026-06-19 cs.CL 版本更新 75%

Toward Human-Centered AI-Assisted Terminology Work

迈向以人为中心的AI辅助术语工作

Antonio San Martin

发表机构 * Universite du Quebec à Trois-Rivieres(魁北克大学三河分校)

专题命中 领域大模型 :讨论生成式AI在术语工作中的应用,属于领域大模型

AI总结 本文提出以人为中心的人工智能框架,在利用生成式AI自动化术语工作的同时,通过增强术语学家能力、保持人类控制权来确保术语数据的准确性和可靠性。

Comments Accepted for publication in the journal Terminology

详情
AI中文摘要

生成式AI可能通过创造自动化新机会来改变术语工作。同时,它引发了对术语学家和术语资源未来的担忧,因为效率压力可能鼓励过度自动化,认为人类专业知识可被AI取代。然而,由于错误、幻觉和各种形式的偏见,大型语言模型在术语目的上仍然不可靠,使得术语学家在确保术语数据的准确性和可靠性方面不可或缺。本文认为,以人为中心的AI(强调AI的主要目标应是促进人类福祉的方法)提供了一个框架,可以在最大化生成式AI收益的同时减轻其风险。它主张高水平的自动化和有意义的人类控制是兼容且可取的,AI应增强术语学家的能力,同时保留他们的自主权和决策权。通过三个相互关联的维度——增强的术语学家、伦理AI和以人为中心的设计——审视了AI辅助术语工作的影响。特别是,本文探讨了AI整合如何重塑术语学家的角色,影响专业价值观和工作条件,要求管理AI产生的偏见,并呼吁围绕术语学家的需求设计AI工具。本文得出结论,以人为中心的方向是必要的,以确保AI加强而非削弱术语工作在支持专业交流以及跨语言和跨文化准确传播知识中的关键作用。

英文摘要

Generative AI is likely to transform terminology work by creating new opportunities for automation. At the same time, it raises concerns about the future of terminologists and terminological resources, as efficiency pressures may encourage excessive automation based on the perception that human expertise can be replaced by AI. However, large language models remain unreliable for terminological purposes due to errors, hallucinations, and various forms of bias, making terminologists indispensable for ensuring the accuracy and reliability of terminological data. This paper argues that human-centered AI, an approach that emphasizes that AI's primary goal should be to contribute to human well-being, provides a framework for maximizing the benefits of generative AI while mitigating its risks. It contends that high levels of automation and meaningful human control are compatible and desirable, and that AI should enhance terminologists' capabilities while preserving their agency and decision-making authority. The implications of AI-assisted terminology work are examined through three interrelated dimensions: the augmented terminologist, ethical AI, and human-centered design. In particular, the paper examines how AI integration reshapes the role of the terminologist, affects professional values and working conditions, requires the management of AI-generated bias, and calls for the design of AI tools around the terminologist's needs. The paper concludes that a human-centered orientation is necessary to ensure that AI strengthens, rather than undermines, the essential role of terminology work in supporting specialized communication and the accurate transmission of knowledge across languages and cultures.

2606.19640 2026-06-19 cs.CL cs.AI cs.HC 新提交 70%

Creating Multilingual Mental Health Dialogue Datasets: Limits of Persona-Based Localization via Nationality and Language

创建多语言心理健康对话数据集:基于国籍和语言的人物角色本地化方法的局限性

Yunkai Xu, Saeed Abdullah

发表机构 * Pennsylvania State University(宾夕法尼亚州立大学)

专题命中 领域大模型 :研究LLM生成多语言心理健康对话数据集及评估。

AI总结 研究通过修改人物角色中的国籍和语言参数生成中文、孟加拉语和印地语临床对话,发现仅添加这些参数会导致跨语言临床不一致,且LLM评估非英语文本的抑郁严重度时存在不准确性。

Comments 15 pages, 4 figures. Accepted to the 2026 Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026), co-located with ACL 2026

详情
AI中文摘要

人工智能和大语言模型(LLMs)已成为应对全球心理健康挑战的有前景的工具。尽管这些挑战具有全球性,但用于训练和评估此类系统的高质量数据集仍然严重短缺。为弥补这一差距,研究人员越来越多地生成合成临床人物角色来模拟用户数据并测试数字心理健康支持系统。然而,大多数经过验证的人物角色依赖于以英语为中心的语境。本文研究了是否可以使用类似的人物角色方法生成多语言心理健康数据集。我们修改了人物角色中的国籍和语言参数,以生成普通话、孟加拉语和印地语的临床对话。然后,我们考察了不同LLM在评估这些生成的多语言数据集的抑郁严重程度(与英语基线相比)时的表现。我们的研究结果表明,仅在人物角色中添加国籍和语言参数可能不够,因为它可能引入跨语言的临床不一致性。LLM评判模型在评估非英语文本中的抑郁严重程度时常常表现出不准确性,且不同模型的性能存在差异。这暴露了将以英语为中心的人物角色应用于多语言语境的系统性局限性。最终,我们的工作强调了迫切需要文化响应式数据生成,以确保全球心理健康系统的公平性。

英文摘要

AI and large language models (LLMs) have emerged as promising tools to address global mental health challenges. Despite the global nature of these challenges, there remains a critical shortage of high-quality datasets for training and evaluating such systems. To mitigate this gap, researchers increasingly generate synthetic clinical personas to simulate user data and test digital mental health support systems. However, most validated personas rely on English-centric contexts. This paper investigates whether similar persona-based methods can be used to generate multilingual mental health datasets. We modified nationality and language parameters in personas to generate clinical dialogues in Mandarin, Bengali, and Hindi. We then examined how different LLMs perform when evaluating the depression severity of these generated multilingual datasets against the baseline in English. Our findings indicate that just adding nationality and language parameters in personas might not be adequate, as it can introduce clinical inconsistency across languages. LLM judge models often exhibit inaccuracies in assessing depression severity in non-English texts, with performance varying across different models. This exposes the systemic limitations of applying English-centric personas to multilingual contexts. Ultimately, our work highlights the urgent need for culturally responsive data generation to ensure equitable mental health systems globally.

2604.23938 2026-06-19 cs.CL 版本更新 70%

TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment

TSAssistant: 一种人在回路中的自动化靶点安全性评估智能体框架

Xiaochen Zheng, Zhiwen Jiang, David Tokar, Yexiang Cheng, Alvaro Serra, Melanie Guerard, Klas Hatje, Tatyana Doktorova

发表机构 * Computational Sciences Center of Excellence(计算科学卓越中心)

专题命中 领域大模型 :利用LLM进行生物医学文献检索与综合

AI总结 提出TSAssistant多智能体框架,通过分层指令架构和交互式优化循环,将靶点安全性评估报告生成分解为专业子任务,实现高可重复性和证据溯源。

Comments Updated with quantitative and expert evaluations

详情
AI中文摘要

靶点安全性评估(TSA)需要系统整合遗传、转录组、靶点同源性、药理学和临床数据,以评估治疗靶点的潜在安全性风险。该过程劳动密集且依赖专家,在可扩展性和可重复性方面面临挑战。我们提出TSAssistant,一种人在回路中的多智能体框架,将TSA报告生成分解为专门子智能体的工作流:研究子智能体各自基于并引用单个TSA领域,合成子智能体整合跨领域发现。子智能体通过标准化工具接口从精选生物医学来源检索和综合证据,生成可单独引用、基于证据的章节,其行为由分层指令架构塑造,该架构将协调逻辑与领域专业知识和用户意图分离。为补充这些软约束,程序化执行钩子和持久记忆存储在整个工作流中强制执行硬约束,而交互式优化循环允许专家在完全保留跨迭代对话上下文的情况下审查和修订各个章节。我们不是进行单一的整体比较,而是将报告质量分解为可重复性、证据基础、任务级准确性和专家监督下的可控性,发现高可重复性和证据基础、与人类参考高度一致以及专家驱动的净正面改进。

英文摘要

Target Safety Assessment (TSA) requires systematic integration of genetic, transcriptomic, target homology, pharmacological, and clinical data to evaluate potential safety liabilities of therapeutic targets. This process is labor-intensive and expert-dependent, posing challenges in scalability and reproducibility. We present TSAssistant, a human-in-the-loop multi-agent framework that decomposes TSA report generation into a workflow of specialized subagents: Research Subagents that each ground and cite a single TSA domain, and Synthesis Subagents that integrate findings across domains. Subagents retrieve and synthesize evidence from curated biomedical sources through standardized tool interfaces and produce individually citable, evidence-grounded sections, with behavior shaped by a hierarchical instruction architecture that separates coordination logic from domain expertise and user intent. To complement these soft constraints, programmatic execution hooks and persistent memory stores enforce hard constraints across the workflow, while an interactive refinement loop allows experts to review and revise individual sections with full conversational context preserved across iterations. Rather than a single holistic comparison, we decompose report quality into reproducibility, evidential grounding, task-level accuracy, and controllability under expert oversight, finding high reproducibility and grounding, substantial agreement with the human reference, and net-positive expert-driven refinement.

2402.14035 2026-06-19 cs.LG cs.AI 版本更新 70%

Wisdom of Committee: Diverse Distillation from Large Foundation Models and Domain Experts

委员会智慧:来自大型基础模型和领域专家的多样化蒸馏

Zichang Liu, Qingyun Liu, Yuening Li, Liang Liu, Anshumali Shrivastava, Shuchao Bi, Lichan Hong, Ed H. Chi, Zhe Zhao

发表机构 * Rice University(Rice大学) Google DeepMind(谷歌DeepMind) Google Inc(谷歌公司) University of California, Davis(加州大学戴维斯分校)

专题命中 领域大模型 :蒸馏基础模型到紧凑领域模型,涉及推荐和视觉

AI总结 针对基础模型向紧凑领域模型蒸馏时能力、架构和模态差异大的问题,提出DiverseDistill框架,通过可学习的问答机制和对齐异构教师输出,在推荐和视觉任务上恢复73-114%的性能差距。

Comments Accepted at the 1st Workshop on Resource-Efficient Learning and Knowledge Discovery (RelKD), KDD 2026

Journal ref Proceedings of the RelKD Workshop at KDD 2026

详情
AI中文摘要

从基础模型向紧凑领域模型进行知识蒸馏因能力、架构和模态的巨大差异而具有挑战性。例如,在我们的实验中,从7600万参数的语言模型蒸馏到200万参数的推荐模型仅能弥补未蒸馏学生与教师之间不到40%的性能差距。我们表明,引入与基础模型共享学生架构特征的领域专家作为多样化教师委员会,能显著改善迁移效果。然而,标准的多教师方法未能利用这种多样性:简单组合异构教师可能使性能低于单教师蒸馏。为此,我们提出DiverseDistill,一种交互式蒸馏框架,采用可学习的问答机制生成教师条件查询,并将异构教师输出对齐到学生的表示空间。与需要基于梯度的协同优化或修改教师架构的方法不同,DiverseDistill在冻结教师的情况下仅通过其中间层的前向推理运行:无需参数更新、无需协同训练、无需架构修改。动态教师重要性机制通过过滤每个样本中低相关性的教师(例如,在推荐任务中减少约30%的前向传播且无质量损失)进一步降低训练成本,而整个蒸馏模块在训练后被丢弃,推理时零开销。在推荐(38倍压缩)和视觉(3.6倍压缩)任务上的评估表明,DiverseDistill恢复了73-114%的师生性能差距,持续优于所有单教师和多教师基线方法。

英文摘要

Knowledge distillation from foundation models to compact domain models is challenging due to substantial gaps in capacity, architecture, and modality. For example, in our experiments, distilling from a 76M-parameter language model to a 2M-parameter recommender closes less than 40% of the performance gap between the undistilled student and the teacher. We show that introducing domain-specific experts -- which share the student's architectural characteristics -- alongside the foundation model as a diverse teacher committee significantly improves transfer. However, standard multi-teacher methods fail to exploit this diversity: naively combining heterogeneous teachers can degrade performance below single-teacher distillation. To address this, we propose DiverseDistill, an interactive distillation framework that employs a learnable Question-Answer mechanism to generate teacher-conditioned queries and align heterogeneous teacher outputs into the student's representation space. Unlike methods requiring gradient-based co-optimization or architectural modification of teachers, DiverseDistill operates with frozen teachers using only forward-pass inference through their intermediate layers: no parameter updates, no co-training, and no architectural surgery. A dynamic teacher importance mechanism further reduces training cost by filtering low-relevance teachers per sample (e.g., ~30% fewer forward passes with no quality loss for recommendation tasks), while the entire Distillation Module is discarded after training, adding zero inference overhead. Evaluations on recommendation (38x compression) and vision (3.6x compression) tasks demonstrate that DiverseDistill recovers 73-114% of the teacher-student performance gap, consistently outperforming all single- and multi-teacher baselines.

2606.20554 2026-06-19 cs.IR cs.AI 新提交 60%

Structuring and Tokenizing Distributed User Interest Context for Generative Recommendation

结构化与分词化分布式用户兴趣上下文以支持生成式推荐

Ruizhong Qiu, Yinglong Xia, Dongqi Fu, Hanqing Zeng, Ren Chen, Xiangjun Fan, Hong Li, Hong Yan, Hanghang Tong

发表机构 * University of Illinois Urbana--Champaign(伊利诺伊大学厄巴纳-香槟分校) Meta MRS

专题命中 领域大模型 :生成式推荐,涉及语言模型与用户兴趣建模。

AI总结 提出G2Rec框架,通过统一图建模与语义分词,实现工业级生成式推荐中用户兴趣上下文的全面准确建模。

详情
AI中文摘要

生成式推荐是一种新兴范式,在工业推荐系统中展现出前景,旨在从用户历史行为中预测其下一次交互。生成式推荐的核心是物品分词,它连接了物品语义与推荐模型。然而,现有方法往往难以同时有效地组织和注入复杂的用户行为与物品语义上下文。一方面,现有的基于图的集成方法,如图序列化和图神经网络,要么存在可扩展性问题,要么仅利用局部图信息。另一方面,现有的语义分词方法通常依赖启发式规则且缺乏明确的监督信号,可能导致不准确或次优的语义表示。为解决用户兴趣上下文建模中的这些局限性,我们提出G2Rec,一个可扩展的框架,将基于图的整体用户共同参与建模与语义分词统一起来,用于工业级生成式推荐。总体而言,G2Rec使推荐模型能够捕捉整体且基于语义的用户兴趣原型,而无需真实用户兴趣,从而在工业序列推荐中提供更全面、更准确的用户行为上下文建模。跨产品表面的在线部署和在公开数据集上的大量实验证明了G2Rec相对于现有方法的优越性。

英文摘要

Generative recommendation is an emerging paradigm that has shown promise in industrial recommendation systems, aiming to predict users' next interactions from their historical behaviors. At the core of generative recommendation lies item tokenization, which bridges item semantics and recommendation models. However, existing methods often struggle to effectively organize and inject complex user-behavioral and item-semantic contexts into recommendation models simultaneously. On the one hand, existing graph-based integration methods, such as graph serialization and graph neural networks, either suffer from scalability issues or exploit only local graph information. On the other hand, existing semantic tokenization methods typically rely on heuristics and lack explicit supervision signals, which may lead to inaccurate or suboptimal semantic representations. To address these limitations in user interest context modeling, we propose G2Rec, a scalable framework that unifies holistic graph-based user co-engagement modeling with semantic tokenization for industrial-scale generative recommendation. Overall, G2Rec enables recommendation models to capture holistic and semantically grounded user interest prototypes without requiring ground-truth user interests, thereby providing more comprehensive and accurate modeling of user behavior contexts in industrial sequential recommendation. Online deployment across product surfaces and extensive experiments on public datasets demonstrate the superiority of G2Rec over existing methods.