语言大模型 / LLM - arXivDaily 专题

2606.19700 2026-06-19 cs.CL 新提交 85%

TerraMARS: A Domain-Adapted Small-Language-Model Pipeline for Mars Terraforming Literature

TerraMARS: 用于火星地球化改造文献的领域自适应小语言模型管道

Jyotsna Singh, Ash Black, Jeff Larsen, Scott R. Saleska

发表机构 * University of Arizona（亚利桑那大学）； College of Information Science, University of Arizona（亚利桑那大学信息科学学院）； Biosphere 2, University of Arizona（亚利桑那大学生物圈2）； Department of Ecology and Evolutionary Biology, University of Arizona（亚利桑那大学生态与进化生物学系）； Department of Environmental Sciences, University of Arizona（亚利桑那大学环境科学系）

专题命中领域大模型：领域自适应小语言模型管道，用于火星科学文献提取。

AI总结提出TerraMARS管道，结合领域自适应小语言模型，从火星科学文献中提取结构化信息，支持地球化改造研究。

Comments 16 pages, 1 figure, 4 tables

详情

AI中文摘要

研究人员有兴趣了解火星，以便最终使其适合人类居住。为此，需要通过科学文献全面了解行星的大气、水文、表面化学、辐射环境和空间特征。这些文献包含有价值的信息和有意义的定量约束，可用于其他模型和研究，如宜居性评估和未来的地球化改造研究。我们提出了TerraMARS，一个端到端的信息提取管道，它结合了领域自适应的小语言模型来回答火星地球化改造相关问题，并将非结构化的火星科学文本转换为机器可读的结构化输出（JSON格式）。收集了一个开放获取论文语料库，并使用多阶段检索和分块框架进行处理。使用量化低秩自适应（QLoRA）对火星特定问答和信息提取数据集进行微调，使Google Gemma 3 1B适应领域。生成的管道产生两种类型的输出，并为将科学文献中的知识整合到下游应用（如数字孪生和火星宜居性建模）提供了基础。该管道的输出看起来很有前景，但需要进一步改进以提高提取准确性和事实一致性。

英文摘要

Researchers are interested in learning about Mars so that it may eventually become habitable for humans. To achieve this, there is a need for comprehensive knowledge of the planet's atmosphere, hydrology, surface chemistry, radiation environment, and spatial features through the scientific literature. These contain valuable information and meaningful quantitative constraints that can be used in other models and studies, such as habitability assessment and future terraforming studies. We present TerraMARS, an end-to-end information extraction pipeline that combines a domain-adapted Small Language Model to answer Mars terraforming-related questions and convert unstructured Mars science text into machine-readable structured outputs in JavaScript Object Notation (JSON) format. A corpus of open-access papers is collected and processed using a multistage retrieval and chunking framework. Google Gemma 3 1B was adapted to the domain using Quantized Low-Rank Adaptation (QLoRA) fine-tuning on Mars-specific question-answering and information extraction datasets. The resulting pipeline generates both types of output and provides a foundation for integrating knowledge from scientific literature into downstream applications like digital twins and habitability modeling for Mars. The output from this pipeline looks promising, but further improvements are needed to increase extraction accuracy and factual consistency.

URL PDF HTML ☆

赞 0 踩 0

2606.20138 2026-06-19 cs.AI cs.CL cs.HC cs.LG 新提交 80%

Learning to Prompt: Improving Student Engagement with Adaptive LLM-based High-School Tutoring

学习提示：基于自适应LLM的高中辅导提升学生参与度

Po-Chin Chang, Nicholas Hogan, Aske Plaat, Michiel T. van der Meer

发表机构 * Leiden University（莱顿大学）； FutureWhiz

专题命中领域大模型：自适应LLM高中辅导系统。

AI总结提出一种基于14个教学特征的主题感知提示路由模型，通过模拟训练和在线A/B测试，在高中辅导中实现自适应策略切换，提高教学效率并减少交互轮次。

详情

AI中文摘要

LLMs可以个性化教育，尽管当前的静态提示辅导系统难以适应不同的学科。我们开发并测试了一个具有主题感知提示的系统，该系统基于从原始转录中提取的14个教学特征（例如，辅导支架、学生理解）。我们首先在模拟环境中训练一个提示路由模型，然后将其部署到实际高中学生的在线适应中。模拟基准测试显示，路由器的性能优于两个静态基线（$0.694$ vs. $0.647$ 和 $0.64$, $p<0.001$）。A/B测试（$N=656$ 次对话，来自359名学生）显示了从模拟到现实的迁移，其中模型从分析策略切换到支架学习策略。我们的自适应提示选择机制提高了教学效率，保持了教学质量，并减少了约3轮交互（$p=0.007$）。虽然贪婪路由器的练习转化率与基线相当（$19.1\%$ vs. $19.6\%$），但随机采样策略的随机路由器实现了更高的转化率（$28.1\%$）。

英文摘要

LLMs can personalize education, although current static-prompt tutoring systems struggle to adapt to diverse academic disciplines. We develop and test a system with subject-aware prompting, based on 14 pedagogical features (e.g., tutor scaffolding, student understanding) extracted from raw transcripts. We first train a prompt routing model in a simulation environment, and then deploy it for online adaptation with actual high-school students. The simulation benchmark shows the router outperforming two static baselines ($0.694$ vs. $0.647$ and $0.64$, $p<0.001$). A/B testing ($N=656$ conversations from 359 students) shows sim-to-real transfer where the model switches from analytical to scaffolding learning strategies. Our adaptive prompt selection mechanism improves instructional efficiency, maintains pedagogical quality and reduces interactions by around 3 turns ($p=0.007$). While a greedy router achieves a comparable exercise conversion rate with the baseline ($19.1\%$ vs. $19.6\%$), a stochastic router that samples strategies leads to a higher conversion rate ($28.1\%$).

URL PDF HTML ☆

赞 0 踩 0

2606.19376 2026-06-19 cs.LG cs.AI cs.IR 新提交 80%

Cost-Optimal LLM Routing with Limited User Feedback under User Satisfaction Guarantees

在用户满意度保证下基于有限用户反馈的成本最优LLM路由

Herbert Woisetschläger, Arastun Mammadli, Ryan Zhang, Shiqiang Wang

发表机构 * Technical University of Munich（慕尼黑工业大学）； University of Exeter（埃克塞特大学）； Horace Greeley High School（霍勒斯格里利高中）

专题命中领域大模型：研究LLM路由以优化成本和服务质量。

AI总结针对LLM推理成本与服务质量之间的矛盾，提出SLARouter在线路由算法，利用稀疏单侧用户反馈学习成本最优策略，理论保证成本最优和SLA合规，实验显示成本降低高达2.2倍。

Comments Preprint. Under review

详情

AI中文摘要

大型语言模型（LLM）应用的推理成本正在快速增长，这是由于需求激增和基础设施成本上升所驱动的。用户期望高质量的响应，在商业环境中，这被正式编码在服务级别协议（SLA）中，从而在成本和质量之间形成了根本性的矛盾。最近在成本感知的LLM请求路由方面的进展显示出解决这一矛盾的潜力，但现有方法依赖于完整的反馈信号、离线训练、大量的每工作负载调优，并且大多数缺乏SLA保证或推理时适应性。我们引入了SLARouter，一种在线路由算法，它从生产系统中可用的稀疏、单侧用户反馈中学习成本最优策略。SLARouter为成本最优性和严格的SLA合规性提供了理论保证。在广泛的LLM基准测试上的实验表明，SLARouter无需每基准调优即可满足SLA约束，将运营成本降低至现有基线的2.2倍。

英文摘要

Inference costs for large language model (LLM) applications are rapidly growing, driven by surging demand and rising infrastructure cost. Users expect high-quality responses, and in commercial settings this is formally codified in Service Level Agreements (SLAs), creating a fundamental tension between cost and quality. Recent progress on cost-aware LLM request routing has shown potential to resolve this tension, but existing approaches rely on complete feedback signals, offline training, extensive per-workload tuning, and most lack SLA guarantees or inference-time adaptivity. We introduce SLARouter, an online routing algorithm that learns a cost-optimal policy from the sparse, one-sided user feedback available in production systems. SLARouter provides theoretical guarantees for both cost optimality and strict SLA compliance. Experiments across a wide range of LLM benchmarks show that SLARouter satisfies SLA constraints without the need for per-benchmark tuning, reducing operating cost by up to 2.2x over existing baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.19387 2026-06-19 cs.SE cs.AI 新提交 75%

Interpretable and Verifiable Hardware Generation with LLM-Driven Stepwise Refinement

可解释且可验证的硬件生成：基于LLM驱动的逐步细化

You Li, Samuel Mandell, David Z. Pan

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Fudan University（复旦大学）； USA（美国）

专题命中领域大模型：LLM用于硬件设计，但非通用语言模型。

AI总结提出结合LLM创造力与形式化方法可解释性的硬件生成框架，通过迭代应用变换规则将设计规范转换为正确性有保证的RTL程序。

2606.19364 2026-06-19 cs.LG 新提交 75%

Closing the Social-Semantic Gap: SPSD for Edge-Based Prompt Compression in Cloud LLM Inference

缩小社会-语义差距：SPSD用于云LLM推理中的边缘端提示压缩

Abhinit Sen, Ajeet Kumar, Manaranjan Pradhan

发表机构 * Indian School of Business（印度管理学院）

专题命中领域大模型：边缘端提示压缩用于云LLM推理。

AI总结针对云LLM推理中提示词预填充阶段能耗高的问题，提出SPSD边缘端管道，利用4比特量化小语言模型压缩用户提示，在保持响应质量非劣效的前提下，平均节省99.9个输入token，每调用净节能70-270 uWh。

Comments 19 pages, 7 tables, 1 figure, includes appendix

详情

AI中文摘要

大语言模型（LLM）推理的预填充阶段正成为云规模能耗的日益增长的贡献者。许多面向消费者的支持和对话提示包含社会性支架：礼貌标记、道歉性开场白、重复以及建立融洽关系的语言，这些对人类交流很重要，但对机器推理而言边际信息量较低。我们将这种差异称为社会-语义差距。我们提出SPSD（情感保留语义蒸馏），一种边缘端管道，在传输到云端部署的LLM之前，使用4比特量化的小语言模型压缩用户提示。在248个提示的语料库上，使用Gemma-2-2B-Instruct（Q4_K_M）作为SLM、Llama-3.1-8B-Instruct作为云端评估模型进行评估，每次蒸馏调用平均输入token节省99.9个，所有146次蒸馏调用均产生正向节省。通过盲法LLM-as-judge评分对121对进行评估，响应质量在15分制中预先指定的1分非劣效范围内不劣于原始路径；评审员给出43%平局、28%蒸馏胜出和29%原始胜出。余弦相似度结果不一：均值0.682，中位数0.712，54.1%的配对高于0.70参考阈值。安全关键领域通过基于规则的网关保守地路由至直通模式。在所述假设下，每次调用净节能估计为70-270 uWh。SPSD表明，设备端提示蒸馏可以在保持响应质量在实际非劣效范围内的同时，降低云LLM的输入token成本。

英文摘要

The prefill stage of Large Language Model (LLM) inference is a growing contributor to cloud-scale energy cost. Many consumer-support and conversational prompts contain social scaffolding: politeness markers, apologetic preamble, repetition, and rapport-building language that is important for human communication but carries low marginal information for machine reasoning. We call this discrepancy the Social-Semantic Gap. We present SPSD (Sentiment Preserving Semantic Distillation), an edge-based pipeline that compresses user prompts using a 4-bit quantised Small Language Model before transmission to a cloud-deployed LLM. Evaluation on a 248-prompt corpus using Gemma-2-2B-Instruct (Q4_K_M) as the SLM and Llama-3.1-8B-Instruct as the cloud evaluation model yields a mean input token saving of 99.9 tokens per distilled call, with all 146 distilled calls yielding positive savings. Response quality, assessed by blind LLM-as-judge scoring across 121 pairs, is non-inferior to the raw path within a pre-specified 1-point margin on a 15-point rubric; the judge awarded 43 percent ties, 28 percent distilled wins, and 29 percent raw wins. Cosine similarity is mixed: mean 0.682, median 0.712, with 54.1 percent of pairs above the 0.70 reference threshold. Safety-critical domains are conservatively routed to passthrough via rule-based gates. Per-call net energy saving is estimated at 70-270 uWh under stated assumptions. SPSD shows that on-device prompt distillation can reduce cloud LLM input-token cost while preserving response quality within a practical non-inferiority margin.

URL PDF HTML ☆

赞 0 踩 0

2606.19640 2026-06-19 cs.CL cs.AI cs.HC 新提交 70%

Creating Multilingual Mental Health Dialogue Datasets: Limits of Persona-Based Localization via Nationality and Language

创建多语言心理健康对话数据集：基于国籍和语言的人物角色本地化方法的局限性

Yunkai Xu, Saeed Abdullah

发表机构 * Pennsylvania State University（宾夕法尼亚州立大学）

专题命中领域大模型：研究LLM生成多语言心理健康对话数据集及评估。

AI总结研究通过修改人物角色中的国籍和语言参数生成中文、孟加拉语和印地语临床对话，发现仅添加这些参数会导致跨语言临床不一致，且LLM评估非英语文本的抑郁严重度时存在不准确性。

Comments 15 pages, 4 figures. Accepted to the 2026 Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026), co-located with ACL 2026

详情

AI中文摘要

人工智能和大语言模型（LLMs）已成为应对全球心理健康挑战的有前景的工具。尽管这些挑战具有全球性，但用于训练和评估此类系统的高质量数据集仍然严重短缺。为弥补这一差距，研究人员越来越多地生成合成临床人物角色来模拟用户数据并测试数字心理健康支持系统。然而，大多数经过验证的人物角色依赖于以英语为中心的语境。本文研究了是否可以使用类似的人物角色方法生成多语言心理健康数据集。我们修改了人物角色中的国籍和语言参数，以生成普通话、孟加拉语和印地语的临床对话。然后，我们考察了不同LLM在评估这些生成的多语言数据集的抑郁严重程度（与英语基线相比）时的表现。我们的研究结果表明，仅在人物角色中添加国籍和语言参数可能不够，因为它可能引入跨语言的临床不一致性。LLM评判模型在评估非英语文本中的抑郁严重程度时常常表现出不准确性，且不同模型的性能存在差异。这暴露了将以英语为中心的人物角色应用于多语言语境的系统性局限性。最终，我们的工作强调了迫切需要文化响应式数据生成，以确保全球心理健康系统的公平性。

英文摘要

AI and large language models (LLMs) have emerged as promising tools to address global mental health challenges. Despite the global nature of these challenges, there remains a critical shortage of high-quality datasets for training and evaluating such systems. To mitigate this gap, researchers increasingly generate synthetic clinical personas to simulate user data and test digital mental health support systems. However, most validated personas rely on English-centric contexts. This paper investigates whether similar persona-based methods can be used to generate multilingual mental health datasets. We modified nationality and language parameters in personas to generate clinical dialogues in Mandarin, Bengali, and Hindi. We then examined how different LLMs perform when evaluating the depression severity of these generated multilingual datasets against the baseline in English. Our findings indicate that just adding nationality and language parameters in personas might not be adequate, as it can introduce clinical inconsistency across languages. LLM judge models often exhibit inaccuracies in assessing depression severity in non-English texts, with performance varying across different models. This exposes the systemic limitations of applying English-centric personas to multilingual contexts. Ultimately, our work highlights the urgent need for culturally responsive data generation to ensure equitable mental health systems globally.

URL PDF HTML ☆

赞 0 踩 0

2606.20554 2026-06-19 cs.IR cs.AI 新提交 60%

Structuring and Tokenizing Distributed User Interest Context for Generative Recommendation

结构化与分词化分布式用户兴趣上下文以支持生成式推荐

Ruizhong Qiu, Yinglong Xia, Dongqi Fu, Hanqing Zeng, Ren Chen, Xiangjun Fan, Hong Li, Hong Yan, Hanghang Tong

发表机构 * University of Illinois Urbana--Champaign（伊利诺伊大学厄巴纳-香槟分校）； Meta MRS

专题命中领域大模型：生成式推荐，涉及语言模型与用户兴趣建模。

AI总结提出G2Rec框架，通过统一图建模与语义分词，实现工业级生成式推荐中用户兴趣上下文的全面准确建模。

详情

AI中文摘要

生成式推荐是一种新兴范式，在工业推荐系统中展现出前景，旨在从用户历史行为中预测其下一次交互。生成式推荐的核心是物品分词，它连接了物品语义与推荐模型。然而，现有方法往往难以同时有效地组织和注入复杂的用户行为与物品语义上下文。一方面，现有的基于图的集成方法，如图序列化和图神经网络，要么存在可扩展性问题，要么仅利用局部图信息。另一方面，现有的语义分词方法通常依赖启发式规则且缺乏明确的监督信号，可能导致不准确或次优的语义表示。为解决用户兴趣上下文建模中的这些局限性，我们提出G2Rec，一个可扩展的框架，将基于图的整体用户共同参与建模与语义分词统一起来，用于工业级生成式推荐。总体而言，G2Rec使推荐模型能够捕捉整体且基于语义的用户兴趣原型，而无需真实用户兴趣，从而在工业序列推荐中提供更全面、更准确的用户行为上下文建模。跨产品表面的在线部署和在公开数据集上的大量实验证明了G2Rec相对于现有方法的优越性。

英文摘要

Generative recommendation is an emerging paradigm that has shown promise in industrial recommendation systems, aiming to predict users' next interactions from their historical behaviors. At the core of generative recommendation lies item tokenization, which bridges item semantics and recommendation models. However, existing methods often struggle to effectively organize and inject complex user-behavioral and item-semantic contexts into recommendation models simultaneously. On the one hand, existing graph-based integration methods, such as graph serialization and graph neural networks, either suffer from scalability issues or exploit only local graph information. On the other hand, existing semantic tokenization methods typically rely on heuristics and lack explicit supervision signals, which may lead to inaccurate or suboptimal semantic representations. To address these limitations in user interest context modeling, we propose G2Rec, a scalable framework that unifies holistic graph-based user co-engagement modeling with semantic tokenization for industrial-scale generative recommendation. Overall, G2Rec enables recommendation models to capture holistic and semantically grounded user interest prototypes without requiring ground-truth user interests, thereby providing more comprehensive and accurate modeling of user behavior contexts in industrial sequential recommendation. Online deployment across product surfaces and extensive experiments on public datasets demonstrate the superiority of G2Rec over existing methods.

URL PDF HTML ☆

赞 0 踩 0