语言大模型 / LLM

2606.19868 2026-06-19 cs.AI 新提交 80%

A Systematic Evaluation of Black-Box Uncertainty Estimation Methods for Large Language Models

大型语言模型黑盒不确定性估计方法的系统评估

Jiayi Wang, Xu-Yao Zhang

发表机构 * State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所多模态人工智能系统国家重点实验室）

专题命中其他LLM ：系统评估LLM黑盒不确定性估计方法。

AI总结系统评估了24种黑盒不确定性估计方法在4个模型和4个数据集上的表现，发现无单一方法普遍最优，但基于答案空间推理和比较的方法通常有效，混合方法在多数条件下表现良好。

详情

AI中文摘要

尽管大型语言模型（LLMs）在广泛的任务中展现出强大的能力，但其输出通常仍不可靠，可能包含幻觉，因此不确定性估计（UE）对于构建可信赖的LLMs至关重要。在实践中，许多主流LLMs仅通过受限API访问，此时logits和隐藏状态等内部信号不可用，使得黑盒UE尤为重要。然而，现有关于LLMs黑盒UE的研究在方法论上仍然零散，缺乏统一的实证比较。为填补这一空白，我们系统回顾了黑盒UE方法，并将其分为五类：基于口头化、基于采样、基于解释、多智能体和混合方法。我们进一步构建了统一的评估框架，并在4个模型和4个数据集设置下对24种代表性方法进行了基准测试。结果表明，没有单一方法在所有设置中一致占优。然而，在答案空间中进行推理和比较候选的方法通常有效，而结合多种不确定性信号的混合方法在大多数条件下表现良好。通过发布基准数据和统一评估框架，我们旨在促进可重复比较并支持未来研究，同时我们的实证发现为开发未来LLMs的黑盒UE方法提供了实践指导。

英文摘要

Although large language models (LLMs) have shown strong capabilities across a wide range of tasks, their outputs often remain unreliable and may contain hallucinations, making uncertainty estimation (UE) essential for building trustworthy LLMs. In practice, many mainstream LLMs are only accessible through restricted APIs, where internal signals such as logits and hidden states are unavailable, making black-box UE especially important. However, existing work on black-box UE for LLMs remains fragmented in methodology and lacks a unified empirical comparison. To address this gap, we present a systematic review of black-box UE methods and organize them into five categories: verbalization-based, sampling-based, explanation-based, multi-agent, and hybrid methods. We further build a unified evaluation framework and benchmark 24 representative methods across 4 models and 4 dataset settings. Our results show that no single method consistently dominates across all settings. Nevertheless, methods that reason over and compare candidates in the answer space are generally effective, and hybrid methods that combine multiple uncertainty signals perform well under most conditions. By releasing the benchmark data and a unified evaluation framework, we aim to facilitate reproducible comparisons and support future research, while our empirical findings provide practical guidance for developing future black-box UE methods for LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.19735 2026-06-19 cs.AI cs.CV 新提交 80%

GLARE: A Natural Language Interface for Querying Global Explanations

GLARE: 用于查询全局解释的自然语言接口

Bhavan Vasu, Rajesh Mangannavar

发表机构 * Oregon State University（俄勒冈州立大学）

专题命中其他LLM ：基于LLM的接口将自然语言转换为SQL查询。

AI总结提出基于LLM的交互接口GLARE，将自然语言问题转换为SQL查询以聚合局部解释数据，提升全局解释的可访问性和可用性。

Comments 16 pages, 2 figures

详情

AI中文摘要

虽然全局解释对于理解跨数据集、类别和决策上下文的视觉模型至关重要，但其复杂和单一的性质常常阻碍实际探索。由于用户通常寻求针对特定问题的目标答案，而不是静态产物，我们提出了一种基于LLM的交互接口，提供对黑盒图像分类器全局解释的自然语言访问。系统的核心LLM充当调解者，将自然语言问题转换为对局部解释数据的结构化SQL查询。这使得灵活聚合成为可能，而无需向用户暴露低级表示。对于每个查询，接口输出统计增强的自然语言响应，支持局部解释和意图对齐的可视化。我们在意图解释、查询映射准确性、对新查询和数据集的泛化能力以及对语言错误的鲁棒性方面评估了该系统。我们的结果表明，LLM中介的查询显著提高了以人为中心的XAI中全局解释的可访问性和可用性。

英文摘要

While global explanations are crucial for understanding vision models across datasets, classes, and decision contexts, their complex and monolithic nature often hinders practical exploration. Because users typically seek targeted answers to specific questions rather than static artifacts, we present an LLM-based interactive interface that provides natural language access to global explanations for black-box image classifiers. The system's core LLM acts as a mediator, translating natural language questions into structured SQL queries over local explanation data. This enables flexible aggregation without exposing users to low-level representations. For each query, the interface outputs statistics-augmented natural language responses, supporting local explanations, and intent-aligned visualizations. We evaluate the system on intent interpretation, query mapping accuracy, generalization to novel queries and datasets, and robustness to linguistic errors. Our results demonstrate that LLM-mediated querying substantially improves the accessibility and usability of global explanations for human-centered XAI.

URL PDF HTML ☆

赞 0 踩 0

2606.19727 2026-06-19 cs.CL cs.AI 新提交 80%

NRITYAM: Language Models Meet Art and Heritage of Dance

NRITYAM：语言模型遇见舞蹈的艺术与遗产

Punit Kumar Singh, Niladri Ghosh, Advait Joshiınst, Shailee Choudhary, Michael Färber, Haiqin Yang

发表机构 * Shenzhen Technology University（深圳技术大学）； New Delhi Institute of Management（新德里管理学院）； Technische Universität Dresden（德累斯顿工业大学）； Ramakrishna Mission Vivekananda Educational and Research Institute（罗摩克里希纳传道会维韦卡南达教育与研究学院）； Indian Institute of Technology（印度理工学院）； Swami Vivekananda Institute of Technology（斯瓦米·维韦卡南达技术学院）； GuangDong Engineering Technology Research Center of Edge Intelligence（广东省边缘智能工程技术研究中心）

专题命中其他LLM ：评估语言模型对全球舞蹈文化的理解能力。

AI总结提出NRITYAM基准，包含9,260个跨12语言的文化问答对，评估语言模型对全球舞蹈传统的文化理解能力，涵盖多种模型类型。

Comments 18 pages, 12 figures, in ECML_PKDD'26

详情

AI中文摘要

语言模型已成为塑造现代工作流程的重要工具。然而，其全球有效性取决于对当地社会文化背景的细致理解。为弥补这一差距，我们提出NRITYAM，一个用于评估语言模型在全球舞蹈传统背景下文化理解能力的综合基准。NRITYAM包含9,260个精心策划的问答对，涵盖12种语言，是专门用于评估舞蹈文化知识的最大数据集。该数据集通过与本地舞蹈艺术家和母语者的密切合作从头开发，他们创作并验证了特定地区的文化相关问题。我们评估了一系列模型，包括大型语言模型、小型语言模型、多模态大型语言模型和小型多模态语言模型。作为一个多语言和多文化基准，NRITYAM为评估AI系统理解和推理传统表演艺术的能力设定了新标准。详细数据集样本可在\url{this https URL}获取。

英文摘要

Language models have become essential tools in shaping modern workflows. However, their global effectiveness hinges on a nuanced understanding of local socio-cultural contexts. To address this gap, we present NRITYAM, a comprehensive benchmark for evaluating the cultural comprehension capabilities of language models in the context of global dance traditions. NRITYAM comprises 9,260 carefully curated question-answer pairs spanning 12 languages, making it the largest dataset dedicated to evaluating cultural knowledge in dance. The dataset has been developed from the ground up through close collaboration with native dance artists and native speakers of the languages, who authored and validated culturally relevant questions specific to their regions. We evaluate a broad set of models, including large language models, small language models, multimodal large language models, and small multimodal language models. As a multilingual and multicultural benchmark, NRITYAM sets a new standard for evaluating the ability of AI systems to understand and reason about traditional performing arts. Detailed dataset samples are available at~\url{https://github.com/niladrighosh03/NRITYAM}.

URL PDF HTML ☆

赞 0 踩 0

2606.19698 2026-06-19 cs.CL 新提交 80%

What sentiment analysis can't see: Measuring whether customers were helped, and what went wrong, across 70,000 support conversations

情感分析看不到的：衡量客户是否得到帮助以及出了什么问题——基于70,000次客服对话

Jason Potteiger

发表机构 * Dimension Labs（Dimension实验室）

专题命中其他LLM ：使用GPT-5.4估计客户满意度并标记问题。

AI总结本研究使用GPT-5.4从70,450次客服对话中估计客户满意度并标记具体问题，发现满意度估计比情感分析更准确，且能揭示情感分析无法捕捉的客户状态和问题原因。

Comments 25 pages, 6 figures

详情

AI中文摘要

大多数公司通过情感分析大规模阅读客户支持数据，这种方法衡量的是客户听起来如何，而不是他们对结果是否满意。我们在一个领先的在线筹款平台的70,450次客服对话中测试了一种更丰富的替代方案：除了语气，我们使用GPT-5.4估计每位客户的满意度，并标记他们是否报告了具体问题，然后将这三个读数与客户对对话的1到5星评分进行验证。满意度估计跟踪这些评分的效果远好于情感分析，相关性分别为0.47和0.36，并且标记不满客户时的误报率低得多。结构化阅读还能看到情感分析看不到的东西：语气和满意度在44%的对话中不一致，一个单一的“中性”标签掩盖了从安静满意的客户到安静放弃的客户的一切，而最大的群体是“容忍摩擦”——那些满意但仍然报告可修复问题的客户，这是一个情感分析仪表板无法揭示的长期问题。更广泛的发现是，基于LLM的标注可以捕捉到比客户语言语气多得多的信息，为基于客户状态（他们是否满意）和直接从交互与反馈的原始文本数据中提取的问题原因的新业务指标提供了强大潜力。

英文摘要

Most companies read their customer support data at scale using sentiment analysis, which measures how customers sound rather than whether they were satisfied with the result. We tested a richer alternative on 70,450 support conversations from a leading online fundraising platform: alongside tone, we used GPT-5.4 to estimate each customer's satisfaction and to flag whether they reported a concrete problem, then validated all three readings against the 1-to-5 ratings customers left on the conversations they rated. The satisfaction estimate tracked those ratings far better than sentiment did, correlating at 0.47 against 0.36 and flagging unhappy customers with far fewer false alarms. The structured read also sees what sentiment cannot: tone and satisfaction disagree in 44% of conversations, a single "Neutral" label hides everything from quietly satisfied customers to ones who quietly gave up, and the largest group of all is "tolerated friction," customers who are satisfied but still reporting a fixable problem, a standing issue that no sentiment-based dashboard can surface. The broader finding is that LLM-based annotation can capture far more than the tonality of a customer's language, offering strong potential for new business metrics grounded instead in the customer's state (whether they were satisfied) and the cause of their problem extracted directly from the raw textual data of interactions and feedback.

URL PDF HTML ☆

赞 0 踩 0

2606.19668 2026-06-19 cs.CL 新提交 80%

Code-Switching Reveals Language Anchoring in Multilingual LLMs

代码切换揭示多语言大模型中的语言锚定

Jeonghyun Park, Seunghyun Yoon, Yonghyun Jun, Hwanhee Lee

发表机构 * Chung-Ang University（中央大学）； Adobe Research（Adobe研究院）

专题命中其他LLM ：研究多语言大模型中的代码切换和语言锚定现象

AI总结通过语法强制代码切换诊断多语言大模型中的语言锚定现象，提出锚定偏差度量并设计CANVAS干预方法，有效缓解代码切换导致的问答性能下降。

Comments 36 pages, 13 figures, 27 tables

详情

AI中文摘要

多语言大模型（MLLMs）越来越需要处理代码切换（CS）输入，然而混合语言通常会导致性能相对于源语言或目标语言单语版本下降。为了理解这种退化，我们使用语法强制CS作为受控诊断设置，将CS表示相对于其源和目标对应物进行定位。我们引入锚定偏差（Anchor Bias），一种几何度量，用于量化语言锚定，即CS隐藏状态是否更接近其源语言或目标语言对应物。在不同的MLLMs中，锚定偏差揭示了一致的语法框架效应：源框架CS保持源锚定，而目标框架CS向目标方向移动，并显示出更大的问答（QA）退化。受这种表示模式的启发，我们提出了CANVAS（基于上下文锚定的神经向量对齐引导），一种推理时干预方法，从输入中提取源侧画布，并在预填充期间将目标语言隐藏状态软引导向源锚定。CANVAS在MLLMs和CS条件下一致地恢复了QA F1分数，表明内部锚定信号为缓解CS推理失败提供了可行的目标。

英文摘要

Multilingual Large Language Models (MLLMs) are increasingly expected to handle Code-Switched (CS) inputs, yet mixing languages frequently degrades performance relative to source- or target-language monolingual counterparts. To understand this degradation, we use grammar-forced CS as a controlled diagnostic setting for locating CS representations relative to their source and target counterparts. We introduce Anchor Bias, a geometric measure that quantifies language anchoring, whether a CS hidden state aligns closer to its source or target language counterpart. Across diverse MLLMs, Anchor Bias reveals a consistent grammar-frame effect: source-framed CS stays source-anchored, whereas target-framed CS shifts target-ward and shows larger Question Answering (QA) degradation. Motivated by this representational pattern, we propose CANVAS (Contextual Anchor-based Neural Vector Alignment Steering), an inference-time intervention that extracts a source-side canvas from the input and softly steers target-language hidden states toward the source anchor during prefill. CANVAS consistently recovers QA F1 across MLLMs and CS conditions, showing that internal anchoring signals provide an actionable target for mitigating CS inference failures.

URL PDF HTML ☆

赞 0 踩 0

2606.19353 2026-06-19 cs.CL cs.LG 新提交 80%

Quantifying Aleatoric Uncertainty of In-Context Learning for Robust Measure of LLM Prediction Confidence

量化上下文学习中的偶然不确定性以稳健衡量LLM预测置信度

Jinseok Chung, Minkyoung Song, Hyunji Jung, Namhoon Lee

发表机构 * POSTECH（浦项科技大学）

专题命中其他LLM ：量化上下文学习中的不确定性，提升置信度

AI总结针对上下文学习（ICL）中预测对提示设计敏感的问题，提出基于贝叶斯观点和机制可解释性的自函数向量，直接估计偶然不确定性，并设计严格评估协议，在合成和真实数据集上验证了方法的可靠性及在幻觉检测等应用中的实用性。

Comments Accepted to ACL 2026

详情

AI中文摘要

上下文学习（ICL）使LLM能够从少量示例中适应新任务，但其可靠性仍存疑虑：预测对提示设计和模型理解上下文的能力高度敏感，使得失败源于数据特性还是模型限制难以区分。不确定性分解——将偶然不确定性从认知不确定性中分离——在此场景中尤为关键，然而现有方法针对标准生成任务设计，未能捕捉ICL的独特动态。为解决此问题，我们引入基于贝叶斯观点和ICL机制可解释性的自函数向量概念。这些向量利用模型内部表示来建模上下文提示中学习的潜在概念，从而在贝叶斯框架内直接估计偶然不确定性，并规避了对脆弱的输入或解码操作的依赖。鉴于缺乏既定基准和合适的评估协议，我们还提出了首个严格的评估协议，其中数据以受控方式被操纵，以便精确量化偶然不确定性并将其与认知不确定性分离。借助这一新的评估框架（最初基于合成任务进行概念开发，随后扩展到真实世界数据集），我们展示了所提出的方法比现有替代方法更可靠地衡量LLM在ICL下做出的预测的不确定性。此外，我们展示了它可作为可信相关应用（如幻觉检测）的实用工具。我们的发现为将不确定性的量化观点与模型行为的机制理解联系起来开辟了新方向。

英文摘要

In-Context Learning (ICL) allows LLMs to adapt to new tasks from a few demonstrations, but its reliability remains a concern: predictions are highly sensitive to both prompt design and the model's ability to understand the context, obscuring whether failures arise from data properties or model limitations. Uncertainty decomposition-separating aleatoric from epistemic sources-is particularly crucial in this setting, yet existing methods, designed for standard generation tasks, fail to capture the unique dynamics of ICL. To address this, we introduce a concept of self-function vectors, built upon Bayesian views and the mechanistic interpretability of ICL. These vectors leverage internal model representations to model the latent concept learned during in-context prompting, thereby enabling a direct estimation of aleatoric uncertainty within a Bayesian framework and circumventing the reliance on brittle input or decoding manipulations. Given the lack of established benchmarks and suitable evaluation protocols, we also propose the first and rigorous evaluation protocol, in which data is manipulated in controlled ways so as to quantify aleatoric uncertainty precisely and separately from epistemic uncertainty. With this new evaluation framework, initially grounded in synthetic tasks for conceptual development and subsequently extended to real-world datasets, we show that our proposed methodology can measure uncertainty of LLM predictions made under ICL more reliably than existing alternative methods. Moreover, we show it can be used as a practical tool for trustworthy-related applications, such as hallucination detection. Our findings pave a new direction for connecting the quantitative view of uncertainty with the mechanistic understanding of model behavior.

URL PDF HTML ☆

赞 0 踩 0

2606.19349 2026-06-19 cs.CL cs.AI 新提交 80%

Where to Place the Query? Unveiling and Mitigating Positional Bias in In-Context Learning for Diffusion LLMs via Decoding Dynamics

查询应置于何处？通过解码动力学揭示并缓解扩散大语言模型中上下文学习的位置偏差

Zhengheng Li, Panrui Li, Xuyang Liu, Puzhi Xia

发表机构 * Southeast University（东南大学）

专题命中其他LLM ：研究扩散LLM中上下文学习的位置偏差

AI总结本文系统分析了扩散大语言模型中查询位置对生成质量的影响，发现其与示例语义质量同等重要，并提出基于平均置信度的无训练自适应路由策略Auto-ICL以优化查询放置。

Comments 9 figures, 4 tables

详情

AI中文摘要

尽管上下文学习（ICL）在自回归（AR）大语言模型（LLMs）中已被广泛研究，但其在扩散大语言模型（dLLMs）中的机制仍基本未被探索。与受单向因果掩码限制的AR模型不同，dLLMs本质上利用双向注意力，为查询放置提供了广泛的空间灵活性。不幸的是，当前实践通常继承AR风格的尾随查询模板，往往忽略了结构范式转变。本文通过全面分析揭示了查询位置实际上是dLLMs中的一阶变量。通过经验解耦，我们证明了位置方差对生成质量的影响与示例语义质量相当。在内部，这种位置敏感性源于注意力流中的空间“近因效应”以及解码轨迹中依赖于任务的偏移。为了在没有真实标签的情况下缓解这种不稳定性，我们揭示了传统的单步置信度（$C_{decoded}$）在dLLMs中失效。相反，我们提出了平均置信度（$\overline{C}$），一种跟踪迭代解码过程的新指标。通过建立基础的空间ICL基线，我们引入了Auto-ICL，一种无需训练的自适应路由策略，动态优化查询放置，在异构推理和感知任务中稳健地接近最优性能。

英文摘要

While In-Context Learning (ICL) is extensively studied in Autoregressive (AR) LLMs, its mechanism within Diffusion Large Language Models (dLLMs) remains largely unexplored. Unlike AR models restricted by unidirectional causal masking, dLLMs intrinsically utilize bidirectional attention, offering extensive spatial flexibility for query placement. Unfortunately, current practices conventionally inherit AR-style trailing-query templates, often overlooking the structural paradigm shift. This paper presents a comprehensive analysis unveiling that query position is actually a first-order variable in dLLMs. Through empirical decoupling, we demonstrate that positional variance impacts generation quality on par with example semantic quality. Internally, this positional sensitivity stems from a spatial ``Recency Effect'' in attention flow and task-dependent shifts in decoding trajectories. To mitigate this instability without ground-truth labels, we reveal that traditional single-step confidence ($C_{decoded}$) fails in dLLMs. Instead, we propose Average Confidence ($\overline{C}$), a novel metric tracking the iterative decoding process. By establishing the foundational spatial ICL baselines, we introduce Auto-ICL, a training-free adaptive routing strategy that dynamically optimizes query placement, robustly approaching oracle performance across heterogeneous reasoning and perception tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.19346 2026-06-19 cs.CL cs.AI 新提交 80%

Disentangling Linguistic Relatedness from Task Alignment in Cross-Lingual Transfer

跨语言迁移中语言相关性与任务对齐的解耦

Ahmed Haj Ahmed, Ruochen Zhang, Alvin Grissom

发表机构 * Haverford College（哈弗福德学院）； Brown University（布朗大学）

专题命中其他LLM ：跨语言迁移中任务对齐与语言相关性解耦

AI总结通过微调大语言模型并在闪语族与非闪语族语言上评估零样本阅读理解，发现跨语言迁移主要提升任务格式对齐而非语言特定知识。

2603.25702 2026-06-19 cs.CL 版本更新 80%

S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

S2D2：通过免训练自我推测实现扩散LLM的快速解码

Ligong Han, Hao Wang, Han Gao, Kai Xu, Akash Srivastava

发表机构 * Red Hat AI Innovation（红帽AI创新）； MIT-IBM Watson AI Lab（MIT-IBM沃森人工智能实验室）； Iowa State University（爱荷华州立大学）； Core AI, IBM（IBM核心AI）

专题命中其他LLM ：扩散LLM解码加速，属于语言模型方法

AI总结提出S2D2，一种免训练的自我推测解码框架，通过将块扩散模型在块大小为1时变为自回归模型，实现草稿与验证角色复用，在不增加训练或测试计算下提升解码速度与准确性。

Comments Code is available at https://github.com/phymhan/S2D2

详情

AI中文摘要

块扩散语言模型通过结合块级自回归解码与块内并行去噪，为超越自回归生成提供了一条有前景的路径。然而，在实际加速所需的少步数场景中，标准的置信度阈值解码往往脆弱：激进的阈值损害质量，而保守的阈值则需要不必要的去噪步骤。现有解决此问题的方法要么需要额外训练，要么增加测试时计算。我们提出S2D2，一种用于块扩散语言模型的免训练自我推测解码框架。我们的关键观察是，当块大小减小到1时，块扩散模型变为自回归模型，从而允许相同的预训练模型同时充当草稿模型和验证模型。S2D2在标准块扩散解码中插入一个推测验证步骤，并使用轻量级路由策略来决定何时验证值得其成本。这产生了一种混合解码轨迹，其中扩散并行提出令牌，而自回归模式充当局部序列级评判器。在三个主流块扩散家族中，S2D2在准确性-速度权衡上持续优于强置信度阈值基线。在SDAR上，我们观察到相比自回归解码高达4.7倍加速，相比调优的动态解码基线高达1.57倍加速，同时准确性提升高达4.5个点。在LLaDA2.1-Mini上，S2D2与内置自校正保持互补，包括在保守设置下比静态基线快4.4倍且准确性略高。

英文摘要

Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test-time compute. We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models. Our key observation is that a block-diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block-diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence-level critic. Across three mainstream block-diffusion families, S2D2 consistently improves the accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR, we observe up to $4.7\times$ speedup over autoregressive decoding, and up to $1.57\times$ over a tuned dynamic decoding baseline while improving accuracy by up to $4.5$ points. On LLaDA2.1-Mini, S2D2 remains complementary to built-in self-correction, including a conservative setting where it is $4.4\times$ faster than the static baseline with slightly higher accuracy.

URL PDF HTML ☆

赞 0 踩 0

2603.16606 2026-06-19 cs.CL 版本更新 80%

Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech

Omnilingual SONAR：跨语言与跨模态句子嵌入，连接大规模多语言文本与语音

Omnilingual SONAR Team, João Maria Janeiro, Pere-Lluís Huguet Cabot, Ioannis Tsiamas, Yen Meng, Vivek Iyer, Guillem Ramírez, Loic Barrault, Belen Alastruey, Xiang "Tony" Cao, Yu-An Chung, Marta R. Costa-Jussa, David Dale, Kevin Heffernan, Jaehyeong Jo, Artyom Kozhevnikov, Alexandre Mourachko, Christophe Ropers, Holger Schwenk, Paul-Ambroise Duquenne

发表机构 * FAIR at Meta（Meta的FAIR）

专题命中其他LLM ：跨语言跨模态句子嵌入模型

AI总结提出OmniSONAR模型，通过渐进式训练和教师-学生蒸馏，在数千种语言上实现文本、语音、代码和数学表达式的统一语义嵌入，在跨语言检索和翻译任务上显著降低错误率，并支持零样本语音翻译。

详情

AI中文摘要

跨语言句子编码器通常只覆盖几百种语言，并且常常为了更强的对齐而牺牲下游质量，限制了它们的采用。我们引入了OmniSONAR，一个新的全语言、跨语言和跨模态句子嵌入模型家族，它原生地将文本、语音、代码和数学表达式嵌入到单一语义空间中，同时在数千种语言（从高资源到极低资源变体）的规模上提供最先进的下游性能。为了在不发生表示崩溃的情况下达到这一规模，我们使用了渐进式训练。我们首先使用LLM初始化的编码器-解码器，结合token级解码、新颖的分裂softmax对比损失和合成硬负样本，为200种语言学习一个强大的基础空间。在此基础上，我们通过两阶段教师-学生编码器蒸馏框架扩展到数千种语言变体。最后，我们通过将177种口语无缝映射到该空间，展示了该空间的跨模态可扩展性。OmniSONAR将200种语言的FLORES数据集上的跨语言相似性搜索错误减半，并在1560种语言的BIBLE基准上将错误减少了15倍。它还实现了强大的翻译性能，在多语言基准上优于NLLB-3B，并在1560种语言到英语的BIBLE翻译上比先前模型（包括更大的LLM）高出15个chrF++点。OmniSONAR在MTEB和XLCoST上也表现强劲。对于语音，OmniSONAR实现了43%更低的相似性搜索错误，并达到了SeamlessM4T语音到文本质量的97%，尽管对于翻译是零样本（仅在ASR数据上训练）。最后，通过训练一个编码器-解码器LM Spectrum，仅使用英语文本处理OmniSONAR嵌入序列，我们为复杂的下游任务解锁了向数千种语言和语音的高性能迁移。

英文摘要

Cross-lingual sentence encoders typically cover only a few hundred languages and often trade downstream quality for stronger alignment, limiting their adoption. We introduce OmniSONAR, a new family of omnilingual, cross-lingual and cross-modal sentence embedding models that natively embed text, speech, code, and mathematical expressions in a single semantic space, while delivering state-of-the-art downstream performance at the scale of thousands of languages, from high-resource to extremely low-resource varieties. To reach this scale without representation collapse, we use progressive training. We first learn a strong foundational space for 200 languages with an LLM-initialized encoder-decoder, combining token-level decoding with a novel split-softmax contrastive loss and synthetic hard negatives. Building on this foundation, we expand to several thousands language varieties via a two-stage teacher-student encoder distillation framework. Finally, we demonstrate the cross-modal extensibility of this space by seamlessly mapping 177 spoken languages into it. OmniSONAR halves cross-lingual similarity search error on the 200-language FLORES dataset and reduces error by a factor of 15 on the 1,560-language BIBLE benchmark. It also enables strong translation, outperforming NLLB-3B on multilingual benchmarks and exceeding prior models (including much larger LLMs) by 15 chrF++ points on 1,560 languages into English BIBLE translation. OmniSONAR also performs strongly on MTEB and XLCoST. For speech, OmniSONAR achieves a 43% lower similarity-search error and reaches 97% of SeamlessM4T speech-to-text quality, despite being zero-shot for translation (trained only on ASR data). Finally, by training an encoder-decoder LM, Spectrum, exclusively on English text processing OmniSONAR embedding sequences, we unlock high-performance transfer to thousands of languages and speech for complex downstream tasks.

URL PDF HTML ☆

赞 0 踩 0

2512.03818 2026-06-19 cs.CL 版本更新 80%

Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology

改善人机编码对齐：心理学构念识别中提示工程的实证评估

Kylie L. Anglin, Stephanie Milan, Brittney Hernandez, Claudia Ventura

发表机构 * Department of Educational Psychology, Neag School of Education, University of Connecticut（教育心理学系，教育学院，康涅狄格大学）； Department of Psychological Sciences, College of Liberal Arts and Sciences, University of Connecticut（心理学系，文理学院，康涅狄格大学）

专题命中其他LLM ：优化LLM在心理学文本中识别构念的提示工程。

AI总结本研究提出一个实证框架，通过提示工程优化大语言模型在心理学文本中识别构念的性能。实验评估五种提示策略，发现构念定义和任务框架最关键，结合代码簿引导和自动提示工程的少样本方法最接近专家判断。

Comments 22 pages, 2 figures

详情

AI中文摘要

由于其架构和庞大的预训练数据，大语言模型（LLMs）表现出强大的文本分类性能。然而，LLM的输出——这里指分配给文本的类别——在很大程度上取决于提示的措辞。尽管关于提示工程的文献正在扩展，但很少有研究关注分类任务，更少有研究涉及心理学等领域，在这些领域中，构念具有精确的、理论驱动的定义，而这些定义可能未在预训练数据中得到充分体现。我们提出了一个实证框架，通过提示工程优化LLM在文本中识别构念的性能。我们实验评估了五种提示策略——代码簿引导的实证提示选择、自动提示工程、角色提示、思维链推理和解释性提示——采用零样本和少样本分类。我们发现，角色、思维链和解释并不能完全解决因措辞不当的提示而导致的性能损失。相反，提示中最有影响力的特征是构念定义、任务框架，以及在较小程度上提供的示例。在三个构念和两个模型中，与专家判断最一致的分类来自结合代码簿引导的实证提示选择和自动提示工程的少样本提示。基于我们的发现，我们建议研究人员生成并评估尽可能多的提示变体，无论是人工编写的、自动生成的，或者理想情况下两者兼有，并根据训练数据集中的实证性能选择提示和示例，在保留集中验证最终方法。该程序提供了一种实用、系统且理论驱动的方法，用于在需要与专家判断对齐的环境中优化LLM提示。

英文摘要

Due to their architecture and vast pre-training data, large language models (LLMs) demonstrate strong text classification performance. However, LLM output - here, the category assigned to a text - depends heavily on the wording of the prompt. While literature on prompt engineering is expanding, few studies focus on classification tasks, and even fewer address domains like psychology, where constructs have precise, theory-driven definitions that may not be well represented in pre-training data. We present an empirical framework for optimizing LLM performance for identifying constructs in texts via prompt engineering. We experimentally evaluate five prompting strategies -- codebook-guided empirical prompt selection, automatic prompt engineering, persona prompting, chain-of-thought reasoning, and explanatory prompting - with zero-shot and few-shot classification. We find that persona, chain-of-thought, and explanations do not fully address performance loss accompanying a badly worded prompt. Instead, the most influential features of a prompt are the construct definition, task framing, and, to a lesser extent, the examples provided. Across three constructs and two models, the classifications most aligned with expert judgments resulted from a few-shot prompt combining codebook-guided empirical prompt selection with automatic prompt engineering. Based on our findings, we recommend that researchers generate and evaluate as many prompt variants as feasible, whether human-crafted, automatically generated, or ideally both, and select prompts and examples based on empirical performance in a training dataset, validating the final approach in a holdout set. This procedure offers a practical, systematic, and theory-driven method for optimizing LLM prompts in settings where alignment with expert judgment is critical.

URL PDF HTML ☆

赞 0 踩 0

2606.20560 2026-06-19 cs.LG cs.AI 新提交 75%

How Transparent is DiffusionGemma?

DiffusionGemma 的透明度如何？

Joshua Engels, Callum McDougall, Bilal Chughtai, Janos Kramar, Senthoran Rajamanoharan, Cindy Wu, Arthur Conmy, Asic Q Chen, Jean Tarbouriech, Min Ma, Brendan O'Donoghue, João Gabriel Lopes de Oliveira, Rohin Shah, Neel Nanda

发表机构 * Google（谷歌）

专题命中其他LLM ：研究DiffusionGemma推理透明度

AI总结研究DiffusionGemma在连续潜空间中的推理透明度，通过变量透明度和算法透明度分解，发现可解释的令牌瓶颈将不透明串行深度降至Gemma 4的1.1倍，并揭示扩散特有现象。

Comments 20 main text pages and 6 pages of references and appendices

详情

AI中文摘要

LLM推理透明度是理解模型决策、减少误用和错位以及调试意外模型行为的关键能力。然而，DiffusionGemma在连续潜空间中执行了更大比例的计算；这是否使其推理透明度降低？我们通过将透明度分解为两个组成部分来研究这个问题：变量透明度，即我们是否理解模型计算状态的中间快照；以及算法透明度，即我们是否能够利用这些快照重建模型得出其输出的过程。直观上，DiffusionGemma的变量透明度较差：其不透明串行深度，即在可解释模型状态之间发生的串行计算量，最初似乎是相应自回归Gemma 4模型的28.6倍。然而，我们表明，我们可以通过一个可解释的令牌瓶颈映射去噪步骤之间流动的信息，且下游性能没有下降。将这些中间状态视为可解释的，将不透明串行深度降至仅为Gemma 4的1.1倍。对于扩散模型来说，算法透明度比自回归模型更难，因为画布中的所有令牌预测在每个去噪步骤中都可能发生变化，这使模型有能力在去噪过程中实现复杂的分布式算法。为了开始弥合这一差距，我们进行了一系列可解释性案例研究，发现了扩散特有现象（如非时序推理、令牌和序列涂抹以及中间上下文推理）的初步证据。最后，我们测试了可监控性，这是透明度的一个关键应用，衡量模型输出是否对下游任务有用。我们发现DiffusionGemma的可监控性与Gemma 4相似。

英文摘要

LLM reasoning transparency is a critical affordance for understanding model decisions, mitigating misuse and misalignment, and debugging surprising model behaviors. However, DiffusionGemma performs a larger fraction of its computation in a continuous latent space; does this make its reasoning less transparent? We study this question by decomposing transparency into two components: variable transparency, whether we understand intermediate snapshots of a model's computational state; and algorithmic transparency, whether we can use these snapshots to reconstruct the process by which the model arrived at its outputs. Naively, DiffusionGemma has poor variable transparency: its opaque serial depth, the amount of serial computation that occurs in between interpretable model states, seems at first 28.6X higher than the corresponding autoregressive Gemma 4 model. However, we show that we can map the information flowing between denoising steps through an interpretable token bottleneck with no decrease in downstream performance. Treating these intermediate states as interpretable reduces the opaque serial depth to just 1.1X that of Gemma 4. Algorithmic transparency is harder for diffusion models than for autoregressive models because all token predictions in the canvas can change at every denoising step, giving the model the power to implement complicated distributed algorithms during the denoising process. To begin bridging this gap, we conduct a suite of interpretability case studies, uncovering initial evidence of novel diffusion-specific phenomena such as non-chronological reasoning, token and sequence smearing, and intermediate-context reasoning. Finally, we test monitorability, a key application of transparency that measures whether model outputs are useful for downstream tasks. We find that DiffusionGemma is similarly monitorable to Gemma 4.

URL PDF HTML ☆

赞 0 踩 0

2606.20400 2026-06-19 cs.LG 新提交 75%

The Significance of Style Diversity in Annotation-Free Synthetic Data Generation

无标注合成数据生成中风格多样性的重要性

Zahra Abbasiantaeb, Zeno Belligoli, Omar Essam, Mohammad Aliannejadi

发表机构 * University of Amsterdam（阿姆斯特丹大学）

专题命中其他LLM ：利用LLM生成合成对话数据，提升意图分类性能

AI总结提出无需人工标注的对话生成框架，利用主题和风格属性增强多样性，并设计两种后处理风格化模型，实验表明风格多样性比主题多样性更关键，性能可达人工标注数据的93.3%。

详情

AI中文摘要

为意图分类生成高实用性的合成数据通常需要人工标注的种子数据，这在快节奏的工业环境中往往不可用。在本文中，我们提出了一个完全无需人工标注数据、仅依赖意图定义的合成对话生成框架。我们提出的对话生成框架利用两种不同类型的主题和风格属性来提高数据多样性。此外，我们提出了两种新颖的后处理风格化模型，称为Univ和Exam，以将合成的LLM生成的语句转换为更多样化、更接近人类的语言风格。为了提升数据质量，我们利用LLM作为评判的过滤过程。在工业数据集和公开数据集上的实验结果表明，所提出的方法达到了使用人工标注训练数据所获得性能的93.3%。至关重要的是，研究结果揭示，对于合成数据的实用性，风格多样性比主题多样性更为关键，因为它能防止模型学习虚假的风格相关性。此外，研究表明，在生成过程中融入风格属性比后处理风格适应更有效。

英文摘要

Generating high-utility synthetic data for intent classification typically requires human-annotated seed data, which is often unavailable in fast-paced industrial settings. In this paper, we propose a framework for synthetic dialogue generation that works entirely without human-annotated data, relying solely on intent definitions. Our proposed dialogue generation framework utilizes two different types of topic and style attributes to improve data diversity. Also, we propose two novel post-hoc stylization models called Univ and Exam to transform synthetic LLM-generated utterances into more varied, human-like linguistic styles. To enhance data quality, we utilize an LLM-as-a-judge filtering process. Experimental results on both industrial and public datasets demonstrate that the proposed approach achieves up to 93.3% of the performance obtained using human-annotated training data. Crucially, the findings reveal that style diversity is more critical than topic diversity for synthetic data utility, as it prevents models from learning spurious stylistic correlations. Furthermore, the study shows that incorporating style attributes during the generation process is more effective than post-hoc style adaptation.

URL PDF HTML ☆

赞 0 踩 0

2606.19831 2026-06-19 cs.CL cs.LG 新提交 75%

Leverage Is Not Reach: A Control-Window Law for Single-Neuron Steering in Language Models

杠杆不等于可达性：语言模型中单神经元操控的控制窗口定律

Hongliang Liu

发表机构 * Palo Alto Networks

专题命中其他LLM ：研究语言模型中单神经元干预的控制窗口理论。

AI总结提出预算归一化控制窗口框架，通过残差范数与写入范数之比定义的相干预算，预测单神经元干预何时产生连贯行为控制，并在15个神经元上验证了预测精度。

详情

AI中文摘要

对齐语言模型通过稀疏前馈神经元门控拒绝和语言路由等行为，但尚无理论预测单神经元干预何时连贯地控制行为而非导致输出崩溃。我们开发了一个预算归一化的控制窗口框架用于单神经元操控。沿一个写入方向的剂量简化为一个控制坐标：残差流与写入之间的对齐，该对齐沿着一条通用饱和曲线驱动，以残差范数除以写入范数设定的相干预算为单位。当行为触发点低于崩溃上限时，存在连贯控制。同一坐标控制良性模式切换和拒绝；上限由权重和一次通用前向传播得出，而触发点在 rollout 时测量。在15个保留神经元上，预测上限的平均绝对误差为0.14，在批量层中约为0.07，并且承诺的开启或关闭判定在11个神经元上成立，而多数基线为10/15。关闭情况揭示了三种失败模式而非违反：触发前崩溃、深度不足以传播、或归一化限制了单个神经元能推动的距离。该定律解释了为什么局部梯度归因反直觉地预测控制：真正的控制器偏离读出轴写入，并携带接近零的一阶梯度。由窗口精确化的仅前向对比筛选恢复了归因遗漏的控制器。在拒绝这一最难案例中，干预成功是类型化的而非标量：连贯旁路和严格可操作可达性分离，因此一个神经元可以在流畅、任务相关且无操作内容的文本中翻转拒绝，而真正的可操作可达性仅出现在六个审计的 Llama 枢轴中的三个，且仅在较晚的 rollout 时间范围内。因此，单神经元操控是对可控性的预算化、类型化审计，而非固定剂量的轶事。

英文摘要

Aligned language models gate behaviors such as refusal and language routing through sparse feed forward neurons, yet no theory predicts when a single neuron intervention controls a behavior coherently rather than collapsing the output. We develop a budget normalized control window framework for single neuron steering. A dose along one write direction reduces to one control coordinate: the alignment between the residual stream and the write, driven along a universal saturation curve in units of a coherence budget set by the residual norm divided by the write norm. Coherent control exists when a behavior trigger lies below the collapse ceiling. The same coordinate governs benign mode switches and refusal; the ceiling follows from weights and one generic forward pass, while triggers are measured at rollout. On fifteen held out neurons, the predicted ceiling has mean absolute error 0.14, about 0.07 in bulk layers, and the committed open or closed verdict holds on eleven against a ten of fifteen majority baseline. Closed cases expose three failure modes rather than violations: collapse before trigger, too little depth to propagate, or a normalization that caps how far one neuron can push. The law explains why local gradient attribution anti predicts control: true controllers write off the readout axis and carry a near zero first order gradient. A forward only contrastive screen made precise by the window recovers controllers that attribution misses. On refusal, the hardest case, intervention success is typed, not scalar: coherent bypass and strict actionable reach separate, so a neuron can flip refusal in fluent, on task text with no actionable content, and genuine actionable reach appears only for three of six audited Llama pivots and only at later rollout horizons. Single neuron steering is therefore a budgeted, typed audit of controllability rather than a fixed dose anecdote.

URL PDF HTML ☆

赞 0 踩 0

2606.19679 2026-06-19 cs.LG cs.AI 新提交 80%

LOKI: Memory-Free Null-Space Constrained Lifelong Knowledge Editing

LOKI: 无记忆零空间约束的终身知识编辑

Masih Eskandar, Miquel Sirera Perelló, Stratis Ioannidis, Jennifer Dy

发表机构 * Department of Electrical and Computer Engineering（电气与计算机工程系）

专题命中后训练：终身知识编辑方法，动态选择层并投影到零空间

AI总结提出LOKI方法，通过希尔伯特-施密特独立性准则动态选择层，并将梯度更新投影到模型权重的零空间，实现无需访问旧知识的终身知识编辑，平均准确率提升14%。

2606.19607 2026-06-19 cs.AI stat.AP 新提交 80%

Which Pairs to Compare for LLM Post-Training?

LLM后训练中应比较哪些对？

Jiangze Han, Vineet Goyal, Will Ma

发表机构 * Columbia University（哥伦比亚大学）

专题命中后训练：研究偏好后训练中比较对的选择，提升样本效率。

AI总结研究偏好后训练中如何选择最具信息量的比较对，提出基于采样设计的比较策展方法，通过DPO训练的理论分析给出优化准则，实验证明能提升样本效率。

详情

AI中文摘要

基于偏好的后训练已成为对齐语言模型的核心范式。常见的数据收集策略是为每个提示生成少量补全并标注生成的比较对。然而，人工偏好标签通常比生成额外补全昂贵得多，这提示了相同标注预算的不同使用方式：生成更大的补全集，但只标注最具信息量的比较对。本文研究在基于偏好的后训练中应比较哪些对。我们将比较策展形式化为一个采样设计问题，并通过基于偏好的后训练目标下的最终策略质量来评估设计。我们针对直接偏好优化（DPO）实例化该框架，分析标注对的选择如何通过DPO训练传播到下游策略性能。我们的主要结果为DPO训练策略的后训练最优性差距提供了匹配的上界和下界。这些界限表明，比较选择通过一个单一的设计相关信息矩阵影响下游性能，该矩阵将标签分配与参数估计误差和策略次优性联系起来。这为预算受限的比较策展提供了显式优化准则，并激发了从大型生成补全池中选择信息对的实际采样设计。在合成设置和语言模型后训练基准上的实验表明，所提出的设计在样本效率上持续优于常见的比较选择启发式方法。

英文摘要

Preference-based post-training has become a central paradigm for aligning language models. A common data-collection strategy is to generate a small set of completions for each prompt and label the resulting comparison pairs. However, human preference labels are often much more expensive than generating additional completions, suggesting a different use of the same labeling budget: generate a larger pool of completions, but label only the most informative comparison pairs. This paper studies which pairs should be compared in preference-based post-training. We formulate comparison curation as a sampling-design problem and evaluate designs by the quality of the final policy under the preference-based post-training objective. We instantiate this framework for Direct Preference Optimization (DPO), analyzing how the choice of labeled pairs propagates through DPO training to downstream policy performance. Our main results provide matching upper and lower bounds on the post-training optimality gap of the DPO-trained policy. The bounds show that comparison selection affects downstream performance through a single design-dependent information matrix, which links label allocation to parameter estimation error and policy suboptimality. This yields an explicit optimization criterion for budgeted comparison curation and motivates practical sampling designs for selecting informative pairs from large generated completion pools. Experiments on synthetic settings and language-model post-training benchmarks show that the proposed designs consistently improve sample efficiency over common comparison-selection heuristics.

URL PDF HTML ☆

赞 0 踩 0

2606.19549 2026-06-19 cs.LG 新提交 80%

Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates

预测参数高效微调更新的可合并性

Lin Tang, Wei Zhang, Jing Li, Hongyu Chen, Ming Zhao, Yuxuan Wang

发表机构 * Sichuan University（四川大学）； University of Electronic Science and Technology of China（电子科技大学）

专题命中后训练：预测LoRA适配器可合并性，涉及模型微调

AI总结提出MergeProbe，通过训练初期信号预测LoRA适配器的可合并性，在MERGE-PEFT基准上实现最佳平均和最差保留性能。

详情

AI中文摘要

低秩适配（LoRA）使得训练许多领域和任务特定的语言模型适配器变得廉价，但两个适配器是否可以合并通常只有在两者都经过充分训练和评估后才能发现。这种延迟反馈代价高昂：单独表现强大的适配器在合并更新后可能会产生破坏性干扰。我们询问是否可以预测这种结果。我们将适配器可合并性形式化为适配器在合并后保持其单任务效用的程度，并表明可以从训练初期百分之几的信号中预测——主要是低秩更新及其梯度在不同任务间的对齐程度以及它们对共享表示的干扰程度。我们将这些信号打包成MergeProbe，一个轻量级预测器，用于估计成对和集合级别的保留，并将估计转化为具体决策：直接合并、重新加权、剪枝或路由。在MERGE-PEFT（一个涵盖数学、代码、科学、指令遵循和安全的五领域基准）上，MergeProbe在强干扰感知合并基线中实现了最佳平均和最差保留，同时增加的部署开销远低于完整任务路由。这将LoRA合并从事后工程步骤转变为预期测量问题。

英文摘要

Low-rank adaptation (LoRA) makes it cheap to train many domain- and task-specific language model adapters, but whether two adapters can be merged is usually discovered only after both have been fully trained and evaluated. This late feedback is costly: adapters that are strong in isolation can interfere destructively once their updates are combined. We ask whether this outcome can be anticipated. We formalize adapter mergeability as the degree to which an adapter preserves its single-task utility after merging, and show that it can be forecast from signals measured in the first few percent of training -- chiefly how the low-rank updates and their gradients align across tasks and how much they disturb shared representations. We package these signals into MergeProbe, a lightweight predictor that estimates pairwise and set-level retention and turns the estimate into a concrete decision: merge directly, reweight, prune, or route. On MERGE-PEFT, a five-domain benchmark spanning math, code, science, instruction following, and safety, MergeProbe attains the best average and worst-case retention among strong interference-aware merge baselines while adding far less deployment overhead than full task routing. This turns LoRA merging from a post-hoc engineering step into an anticipatory measurement problem.

URL PDF HTML ☆

赞 0 踩 0

2606.19542 2026-06-19 cs.LG 新提交 80%

Tracking Representation Dynamics in Large Language Models with Persistent Homology

利用持续同调追踪大型语言模型中的表示动态

Naman Malhotra, Jay Ambadkar, Abhinav Gupta, Kushal Kasivel, Abbas Schwarz, Kamillo Ferry, Anthea Monod

发表机构 * Imperial College London（伦敦帝国学院）

专题命中后训练：分析对齐过程中LLM内部表示拓扑变化

AI总结通过持续同调分析激活空间拓扑，发现对齐过程中拓扑重组主要发生在训练早期，且不同对齐目标产生可区分的拓扑轨迹。

Comments 29 pages

详情

AI中文摘要

大型语言模型通常通过监督微调进行对齐，但关于其内部表示在此过程中如何演变的研究尚不充分。我们利用持续同调，通过追踪微调过程中激活空间的拓扑结构来研究对齐动态。在四个参数范围从1B到7B的Transformer语言模型以及对应于有用、无害和混合训练数据的三个对齐目标上，我们发现大多数拓扑重组发生在训练的最早阶段。密集检查点分析揭示了拓扑活动的瞬态峰值，随后迅速稳定。我们进一步表明，不同的对齐目标会引发可区分的拓扑轨迹，而指令微调和预训练模型则表现出定性不同的演化模式。我们的结果表明，持续同调为对齐提供了互补视角，揭示了仅从行为指标无法察觉的表示级变化。

英文摘要

Large language models are commonly aligned through supervised fine-tuning, yet little is known about how their internal representations evolve during this process. We study alignment dynamics using persistent homology by tracking the topology of activation spaces throughout fine-tuning. Across four transformer language models ranging from 1B to 7B parameters and three alignment objectives corresponding to helpful, harmless, and mixed training data, we find that the majority of topological reorganization occurs during the earliest stages of training. A dense checkpoint analysis reveals a transient peak in topological activity followed by rapid stabilization. We further show that different alignment objectives induce distinguishable topological trajectories, while instruction-tuned and pretrained models exhibit qualitatively different patterns of evolution. Our results suggest that persistent homology provides a complementary perspective on alignment, revealing representation-level changes that are not apparent from behavioral metrics alone.

URL PDF HTML ☆

赞 0 踩 0

2602.09689 2026-06-19 cs.LG 版本更新 80%

Model soups need only one ingredient

模型汤只需一种成分

Alireza Abdollahpoorrostam, Nikolaos Dimitriadis, Adam Hazimeh, Pascal Frossard

发表机构 * EPFL（瑞士联邦理工学院）； EPFL LTS4（瑞士联邦理工学院 LTS4）

专题命中后训练：MonoSoup方法通过SVD实现单检查点模型汤

AI总结提出MonoSoup方法，利用SVD分解单检查点的层更新，通过熵有效秩自动重加权成分，实现强分布内-分布外平衡，无需多检查点。

详情

AI中文摘要

在目标分布上微调大型预训练模型通常会提高分布内（ID）准确性，但代价是分布外（OOD）鲁棒性下降，因为表示会专门适应微调数据。权重空间集成方法，如模型汤（Model Soups），通过平均多个检查点来缓解这一影响，但它们在计算上代价高昂，需要训练和存储数十个微调模型。在本文中，我们介绍了MonoSoup，一种简单、无数据、无超参数的事后方法，仅使用单个检查点即可实现强大的ID-OOD平衡。我们的方法对每一层的更新应用奇异值分解（SVD），将其分解为捕捉任务特定适应的高能量方向和引入噪声但可能仍编码对鲁棒性有用的残余信号的低能量方向。然后，MonoSoup使用基于熵的有效秩自动重新加权这些分量，并考虑模型的谱和几何结构的逐层系数。在ImageNet上微调并在自然分布偏移下评估的CLIP模型，以及在数学推理和多选题基准上测试的Qwen语言模型上的实验表明，这种即插即用方法是多检查点方法的实用且有效的替代方案，保留了其大部分好处而无需计算开销。

英文摘要

Fine-tuning large pre-trained models on a target distribution often improves in-distribution (ID) accuracy, but at the cost of out-of-distribution (OOD) robustness as representations specialize to the fine-tuning data. Weight-space ensembling methods, such as Model Soups, mitigate this effect by averaging multiple checkpoints, but they are computationally prohibitive, requiring the training and storage of dozens of fine-tuned models. In this paper, we introduce MonoSoup, a simple, data-free, hyperparameter-free, post-hoc method that achieves a strong ID-OOD balance using only a single checkpoint. Our method applies Singular Value Decomposition (SVD) to each layer's update and decomposes it into high-energy directions that capture task-specific adaptation and low-energy directions that introduce noise but may still encode residual signals useful for robustness. MonoSoup then uses entropy-based effective rank to automatically re-weigh these components with layer-wise coefficients that account for the spectral and geometric structure of the model. Experiments on CLIP models fine-tuned on ImageNet and evaluated under natural distribution shifts, as well as on Qwen language models tested on mathematical reasoning and multiple-choice benchmarks, show that this plug-and-play approach is a practical and effective alternative to multi-checkpoint methods, retaining much of their benefits without their computational overhead.

URL PDF HTML ☆

赞 0 踩 0

2606.19946 2026-06-19 cs.CL cs.LG 新提交 75%

GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLMs

GEMS: 几何约束使LLM中多语义叠加成为可能

Yu Deng

专题命中后训练：提出无训练多语义方向激活干预方法GEMS

AI总结提出GEMS方法，通过范数保持加权叠加、目标注意力路径注入和实时正交化两个几何约束，解决无训练多方向激活干预中的分布偏差和方向干扰问题，在GSM8K上保持98%准确率。

Comments 30 pages, 5 figures, 20 tables. Code and logs are available at: https://github.com/LuLu663939/gems-multi-semantic-steering

详情

AI中文摘要

激活引导通过在推理时修改中间隐藏状态来控制模型行为，无需重新训练。现有方法仅处理单方向注入；当多个语义方向无约束叠加时，模型崩溃。我们证明这种崩溃分解为两个独立作用的来源：分布偏差（加法扰动在层间累积范数并将激活推出训练分布）和方向干扰（非正交语义向量叠加时相互抑制）。这两个来源定义了任何无训练多方向干预必须满足的设计约束。作为这些原则的一个实例，我们提出GEMS，一种无训练方法，将每个来源映射到相应的几何约束：针对分布偏差的范数保持加权叠加和目标注意力路径注入，以及针对方向干扰的实时正交化。在GSM8K上，注入三个并发非数学方向保持98%的准确率（基线92%），而无约束加法崩溃至4%；在Wikitext-2上，相同注入仅导致2.2%的PPL增加。组件消融隔离了每个约束的因果作用，层级探针确认正交化信号通过FFN路径存活并以语义特异性到达输出分布。定性引导效果跨架构从3B到31B迁移。

英文摘要

Activation steering controls model behavior by modifying intermediate hidden states at inference time without retraining. Existing methods handle only single-direction injection; when multiple semantic directions are superposed without constraints, the model collapses. We show that this collapse decomposes into two independently acting sources: distributional deviation, where additive perturbations accumulate in norm across layers and drive activations outside the training distribution, and directional interference, where non-orthogonal semantic vectors mutually dampen when superposed. These two sources define the design constraints that any training-free multi-directional intervention must address. As one instantiation of these principles, we propose GEMS, a training-free method that maps each source to a corresponding geometric constraint: norm-preserving weighted superposition and targeted attention-pathway injection for distributional deviation, and real-time orthogonalization for directional interference. On GSM8K, injecting three concurrent non-mathematical directions preserves accuracy at 98% (baseline 92%), while unconstrained addition collapses to 4%; on Wikitext-2, the same injection incurs only 2.2% PPL increase. Component ablation isolates the causal role of each constraint, and layer-level probes confirm that orthogonalized signals survive the FFN pathway and reach the output distribution with semantic specificity. Qualitative steering effects transfer across architectures from 3B to 31B.

URL PDF HTML ☆

赞 0 踩 0

2606.19528 2026-06-19 cs.LG cs.AI 新提交 80%

Techniques for Peak Memory Reduction for LoRA Fine-tuning of LLMs on Edge Devices

边缘设备上LLM LoRA微调峰值内存降低技术

Hassan Dbouk, Matthias Reisser, Prathamesh Mandke, Likhita Arun Navali, Christos Louizos

发表机构 * GitHub

专题命中预训练：降低LLM LoRA微调峰值内存的技术

AI总结针对边缘设备上LLM LoRA微调的内存瓶颈，提出四种互补技术（量化、检查点、softmax近似、logits掩码），在Llama-3.2 3B和Qwen-2.5 3B上实现高达26倍和28倍的峰值内存降低。

Comments Hassan Dbouk and Matthias Reisser contributed equally to this work

2602.04396 2026-06-19 cs.LG cs.AI 版本更新 80%

LoRDO: Distributed Low-Rank Optimization with Infrequent Communication

LoRDO: 分布式低秩优化与低频通信

Andrej Jovanović, Alex Iacob, Mher Safaryan, Ionut-Vlad Modoranu, Lorenzo Sani, William F. Shen, Xinchi Qiu, Dan Alistarh, Nicholas D. Lane

发表机构 * University of Cambridge（剑桥大学）； Institute of Science and Technology Austria（奥地利科学与技术研究院）； Lancaster University（兰卡斯特大学）； Flower Labs（Flower实验室）

专题命中预训练：LoRDO框架实现分布式低秩优化与低频通信

AI总结提出LoRDO框架，统一低秩优化与低频同步，通过全秩准双曲更新恢复子空间探索，在125M-720M模型规模下实现与低秩DDP近似的性能，通信量减少约10倍。

Comments Accepted at ICML 2026

详情

AI中文摘要

通过$\ exttt{DDP}$进行基础模型的分布式训练受限于互连带宽。虽然低频通信策略减少了同步频率，但优化器状态的内存和通信需求仍然构成瓶颈。低秩优化器可以缓解这些限制；然而，在局部更新机制下，工作节点无法访问计算低秩投影所需的全批次梯度，这降低了性能。我们提出$\ exttt{LoRDO}$，一个统一低秩优化与低频同步的原则性框架。我们首先证明，虽然基于伪梯度的全局投影在理论上更优，但它们将优化轨迹永久限制在低秩子空间中。为了恢复子空间探索，我们引入了一个全秩准双曲更新。$\ exttt{LoRDO}$在125M-720M模型规模的语言建模和下游任务中实现了与低秩$\ exttt{DDP}$近乎相同的性能，同时将通信量减少了约10倍。最后，我们表明在具有小秩/小批次大小的极低内存设置中，$\ exttt{LoRDO}$的性能提升更为显著。

英文摘要

Distributed training of foundation models via $\texttt{DDP}$ is limited by interconnect bandwidth. While infrequent communication strategies reduce synchronization frequency, they remain bottlenecked by the memory and communication requirements of optimizer states. Low-rank optimizers can alleviate these constraints; however, in the local-update regime, workers lack access to the full-batch gradients required to compute low-rank projections, which degrades performance. We propose $\texttt{LoRDO}$, a principled framework unifying low-rank optimization with infrequent synchronization. We first demonstrate that, while global projections based on pseudo-gradients are theoretically superior, they permanently restrict the optimization trajectory to a low-rank subspace. To restore subspace exploration, we introduce a full-rank quasi-hyperbolic update. $\texttt{LoRDO}$ achieves near-parity with low-rank $\texttt{DDP}$ in language modeling and downstream tasks at model scales of $125$M--$720$M, while reducing communication by $\approx 10 \times$. Finally, we show that $\texttt{LoRDO}$ improves performance even more in very low-memory settings with small rank/batch size.

URL PDF HTML ☆

赞 0 踩 0

2606.19625 2026-06-19 cs.CL cs.LG 新提交 75%

Where Does Social Reasoning Come From? Capability Provenance in Language Models

社会推理从何而来？语言模型中的能力来源

Glenn Matlin, Chandreyi Chakraborty, Saehee Eom, Mika Okamoto, Rayan Castilla, Louis Jaburi, Alvin Deng, Taywon Min, Lucia Quirke, Stella Biderman, Mark Riedl

发表机构 * Georgia Institute of Technology, College of Computing（佐治亚理工学院计算学院）； MATS Program（MATS项目）； EleutherAI ； KAIST AI（韩国科学技术院人工智能学院）； Georgia Tech AI Safety Initiative（佐治亚理工学院人工智能安全倡议）

专题命中预训练：通过训练数据归因分析社会推理与STEM推理来源。

AI总结通过训练数据归因方法，发现OLMo3-7B中社会推理和STEM推理依赖于不同的预训练语料区域，且推理层面的差异比知识层面更显著。

Comments Under review at COLM 2026 (Conference)

详情

AI中文摘要

我们使用训练数据归因作为可解释的工具进行能力发现，映射预训练语料库中哪些区域支持OLMo3-7B的社会推理与STEM推理。训练数据归因衡量每个训练文档对模型在基准测试上的预测的影响强度，但文档级别的分数过于嘈杂，无法识别哪些语料区域支持哪些能力，且先前的工作侧重于事实知识而非推理。我们在从去重后的Dolma3混合数据中抽取的工作集上计算基于梯度的归因（通过Bergmann的TrackStar），聚合跨WebOrganizer的24格式×24主题分类（576个箱子）的影响，并在2×2设计中对比基准对，该设计变化领域（社会 vs. STEM）和能力类型（推理 vs. 知识）：SocialIQA和MMLU社会科学对比ARC-Challenge和MMLU STEM。社会和STEM推理依赖于定性不同的语料区域，且推理层面的对比比知识层面更尖锐。有针对性的机器遗忘提供了部分因果验证：遗忘高归因主题箱（例如，SocialIQA的文学）比箱内随机基线更严重地降低对齐的基准，我们开源所有代码、采样清单、箱级影响矩阵和遗忘检查点。

英文摘要

We use training-data attribution as an interpretable tool for capability discovery, mapping which regions of the pretraining corpus support social-reasoning versus STEM-reasoning in OLMo3-7B. Training-data attribution measures how strongly each training document influences a model's predictions on a benchmark, but document-level scores are too noisy to identify which corpus regions support which capabilities, and prior work has emphasized factual knowledge rather than reasoning. We compute gradient-based attribution (TrackStar via Bergson) over a working set drawn from the de-duplicated Dolma3 mix, aggregate influence across WebOrganizer's 24-format x 24-topic taxonomy (576 bins), and contrast benchmark pairs in a 2x2 design that varies domain (social vs. STEM) and capability type (reasoning vs. knowledge): SocialIQA and MMLU Social Sciences against ARC-Challenge and MMLU STEM. Social and STEM reasoning draw on qualitatively distinct corpus regions, and the contrast is sharper at the reasoning level than at the knowledge level. Targeted machine unlearning provides partial causal validation: forgetting high-attribution topic bins (e.g., Literature for SocialIQA) degrades the aligned benchmark more than within-bin random baselines, and we open-source all code, sampling manifests, the bin-level influence matrix, and unlearning checkpoints.

URL PDF HTML ☆

赞 0 踩 0

2606.19376 2026-06-19 cs.LG cs.AI cs.IR 新提交 80%

Cost-Optimal LLM Routing with Limited User Feedback under User Satisfaction Guarantees

在用户满意度保证下基于有限用户反馈的成本最优LLM路由

Herbert Woisetschläger, Arastun Mammadli, Ryan Zhang, Shiqiang Wang

发表机构 * Technical University of Munich（慕尼黑工业大学）； University of Exeter（埃克塞特大学）； Horace Greeley High School（霍勒斯格里利高中）

专题命中领域大模型：研究LLM路由以优化成本和服务质量。

AI总结针对LLM推理成本与服务质量之间的矛盾，提出SLARouter在线路由算法，利用稀疏单侧用户反馈学习成本最优策略，理论保证成本最优和SLA合规，实验显示成本降低高达2.2倍。

Comments Preprint. Under review

详情

AI中文摘要

大型语言模型（LLM）应用的推理成本正在快速增长，这是由于需求激增和基础设施成本上升所驱动的。用户期望高质量的响应，在商业环境中，这被正式编码在服务级别协议（SLA）中，从而在成本和质量之间形成了根本性的矛盾。最近在成本感知的LLM请求路由方面的进展显示出解决这一矛盾的潜力，但现有方法依赖于完整的反馈信号、离线训练、大量的每工作负载调优，并且大多数缺乏SLA保证或推理时适应性。我们引入了SLARouter，一种在线路由算法，它从生产系统中可用的稀疏、单侧用户反馈中学习成本最优策略。SLARouter为成本最优性和严格的SLA合规性提供了理论保证。在广泛的LLM基准测试上的实验表明，SLARouter无需每基准调优即可满足SLA约束，将运营成本降低至现有基线的2.2倍。

英文摘要

Inference costs for large language model (LLM) applications are rapidly growing, driven by surging demand and rising infrastructure cost. Users expect high-quality responses, and in commercial settings this is formally codified in Service Level Agreements (SLAs), creating a fundamental tension between cost and quality. Recent progress on cost-aware LLM request routing has shown potential to resolve this tension, but existing approaches rely on complete feedback signals, offline training, extensive per-workload tuning, and most lack SLA guarantees or inference-time adaptivity. We introduce SLARouter, an online routing algorithm that learns a cost-optimal policy from the sparse, one-sided user feedback available in production systems. SLARouter provides theoretical guarantees for both cost optimality and strict SLA compliance. Experiments across a wide range of LLM benchmarks show that SLARouter satisfies SLA constraints without the need for per-benchmark tuning, reducing operating cost by up to 2.2x over existing baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.17443 2026-06-19 cs.CL cs.SD eess.AS 版本更新 80%

Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

分析韩语语音问答中ASR-LLM级联中的误差传播

Donghyuk Jung, Youngwon Choi

发表机构 * Korea Culture Technology Institute, Republic of Korea（韩国文化科技研究所）； Maum AI Inc., Republic of Korea（马姆人工智能公司）

专题命中领域大模型：研究ASR-LLM级联在韩语语音问答中的误差传播

AI总结本文研究了韩语语音问答中ASR-LLM级联中误差传播的问题，通过分析下游语义失败，揭示了传统ASR指标无法完全捕捉的误差影响，发现不同性能的LLM在级联降级上的一致性，识别出单字符ASR错误作为语义失败通道，并通过辅助比较表明大音频语言模型在噪声韩语SQA中优于匹配语言模型的ASR-LLM流水线。

Comments Preprint. Submitted to APSIPA ASC 2026

2604.18105 2026-06-19 eess.AS cs.CL cs.SD 版本更新 80%

NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

NIM4-ASR：迈向高效、鲁棒且可定制的实时基于LLM的语音识别

Yuan Xie, Jiaqi Song, Guang Qiu, Xianliang Wang, Kai Qiao, Junfeng Yuan, Shengqing Liu, Yi Zhang, Bowen Chen, Ming Lei, Jie Gao, Jie Wu

发表机构 * Advanced Intelligent Systems Group, NIO（蔚来智能系统集团）

专题命中领域大模型：提出基于LLM的语音识别框架NIM4-ASR

AI总结提出NIM4-ASR框架，通过重新设计多阶段训练范式（包括预训练架构优化、迭代异步SFT和ASR专用强化学习）以及生产优化（噪声鲁棒性、流式推理和RAG热词定制），在2.3B参数下实现SOTA性能。

详情

AI中文摘要

将大语言模型（LLM）集成到自动语音识别（ASR）中已成为近年来的主流范式。尽管现有的基于LLM的ASR模型在公共基准上表现出色，但其训练仍然主要依赖数据驱动，未能充分解决关键的实际挑战——特别是在资源受限部署中的有限向下可扩展性以及声学挑战条件下的幻觉问题。为了解决这些问题，我们提出了NIM4-ASR，一个面向生产的、基于LLM的ASR框架，针对效率和鲁棒性进行了优化。基于编码器和LLM之间功能角色的原则性划分，我们重新设计了多阶段训练范式，使每个模块与其预期的能力边界对齐。具体来说，我们重新制定了预训练架构和目标以缓解模态差距并提高参数效率；引入了迭代异步SFT阶段以保持声学保真度并约束表示漂移；设计了ASR专用的强化学习阶段以进一步提高识别质量和鲁棒性。我们还加入了一系列面向生产的优化，包括噪声和静音条件下的鲁棒性、实时流式推理以及通过检索增强生成（RAG）进行的热词定制。实验表明，NIM4-ASR仅用2.3B参数就在多个公共基准上达到了最先进的性能，同时在内部基准上显著优于更大规模的竞争对手——特别是在实体密集的真实场景中。NIM4-ASR进一步通过RAG支持百万级热词定制，检索延迟低于毫秒，从而能够高效适应新兴实体和个性化用户需求。

英文摘要

Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their training remains predominantly data-driven, leaving key practical challenges insufficiently addressed -- particularly limited downward scalability in resource-constrained deployments and hallucinations under acoustically challenging conditions. To address these issues, we present NIM4-ASR, a production-oriented LLM-based ASR framework optimized for both efficiency and robustness. Grounded in a principled delineation of functional roles between the encoder and the LLM, we redesign the multi-stage training paradigm to align each module with its intended capability boundary. Specifically, we reformulate the pre-training architecture and objective to mitigate the modality gap and improve parameter efficiency; introduce an iterative asynchronous SFT stage to preserve acoustic fidelity and constrain representation drift; and design an ASR-specialized reinforcement learning stage to further enhance recognition quality and robustness. We additionally incorporate a suite of production-oriented optimizations, including robustness under noisy and silent conditions, real-time streaming inference, and hotword customization via retrieval-augmented generation (RAG). Experiments show that NIM4-ASR achieves state-of-the-art performance on multiple public benchmarks with merely 2.3B parameters, while substantially outperforming larger-scale competitors on internal benchmarks -- particularly in entity-intensive real-world scenarios. NIM4-ASR further supports million-scale hotword customization via RAG with sub-millisecond retrieval latency, enabling efficient adaptation to emerging entities and personalized user requirements.

URL PDF HTML ☆

赞 0 踩 0

2507.00875 2026-06-19 cs.CL cs.HC cs.MA 版本更新 80%

TransLaw: A Large-Scale Dataset and Multi-Agent Benchmark Simulating Professional Translation of Hong Kong Case Law

TransLaw：模拟香港判例法专业翻译的大规模数据集与多智能体基准

Xi Xuan, Chunyu Kit

发表机构 * City University of Hong Kong, Hong Kong SAR, China（香港城市大学）

专题命中领域大模型：多智能体框架用于法律翻译

AI总结针对香港判例法英译中资源匮乏、法律术语和格式要求严格的问题，构建了首个大规模句对齐平行语料库HKCFA Judgment 97-22，并提出多智能体框架TransLaw，通过分解翻译任务、集成法律词汇库和检索增强生成，显著提升翻译质量，但仍未达到人类专家的风格自然度。

Comments Accepted at ICML 2026 - AI for Law

详情

AI中文摘要

根据《基本法》第8-9条，香港法院判决书需从英文翻译成繁体中文，但由于平行资源短缺以及对法律术语、引用格式和司法风格的严格要求，这一任务仍受到限制。我们引入了HKCFA Judgment 97-22，这是首个用于香港判例法的大规模句对齐平行语料库，包含344份专业翻译的判决书（11,099个句对；210万词元），涵盖1997年至2022年。基于这一资源，我们提出了TransLaw，一个多智能体框架，将翻译分解为词级表达、句级翻译和多维审查，集成了专门的香港法律词汇数据库、检索增强生成和迭代反馈，并包括涵盖语义对齐、术语、引用和风格的四维专家审查。通过对13个开源和商业大语言模型进行基准测试，我们证明TransLaw在所有评估模型上均显著优于单智能体基线，并在3次迭代内收敛。由10名持证法律翻译人员使用我们提出的Legal ACS指标进行的人工评估证实了法律语义准确性的提升，同时表明TransLaw在风格自然度上仍落后于人类专家。数据集和基准代码可在以下网址获取：https://xxx。

英文摘要

Translating Hong Kong Court Judgments from English to Traditional Chinese is mandated by Articles 8-9 of the Basic Law, yet remains constrained by a shortage of parallel resources and rigorous demands on legal terminology, citation format, and judicial style. We introduce HKCFA Judgment 97-22, the first large-scale sentence-aligned parallel corpus for HK case law, comprising 344 professionally translated judgments (11,099 sentence pairs; 2.1M tokens) spanning 1997-2022. Building on this resource, we propose TransLaw, a multi-agent framework that decomposes translation into word-level expression, sentence-level translation, and multidimensional review, integrating a specialized Hong Kong legal glossary database, Retrieval-Augmented Generation, and iterative feedback, with four-dimensional expert review covering semantic alignment, terminology, citation, and style. Benchmarking 13 open-source and commercial LLMs, we demonstrate that TransLaw significantly outperforms single-agent baselines across all evaluated models, with convergence within 3 iterations. Human evaluation by 10 certified legal translators using our proposed Legal ACS metric confirms gains in legal-semantic accuracy, while showing that TransLaw still trails human experts in stylistic naturalness. The dataset and benchmark code are available at https://github.com/xuanxixi/TransLaw.

URL PDF HTML ☆

赞 0 踩 0

2509.03391 2026-06-19 cs.DL cs.CY 版本更新 80%

More Parameters Than Populations: A Systematic Literature Review of Large Language Models within Survey Research

参数多于总体：调查研究中的大语言模型系统文献综述

Trent D. Buskirk, Florian Keusch, Leah von der Heyde, Adam Eck

专题命中领域大模型：系统综述LLM在调查研究中的应用，涵盖三个阶段。

AI总结通过系统文献综述，评估大语言模型在调查研究三个阶段（数据收集前、中、后）的应用，讨论其潜力与陷阱，并展望调查研究对LLM发展的贡献。

Comments This working paper is outdated as of June 2026 - please refer to the full version with substantive changes here: https://doi.org/10.31235/osf.io/eubj4_v1 This work was presented at NLPOR 2025 (non-archival): https://openreview.net/forum?id=0Hxhwa56Yg

详情

AI中文摘要

[工作论文]调查研究长期以来一直是人力驱动的领域，但也接纳了多种技术来收集、处理和分析各种行为、政治和社会结果。与此同时，大语言模型（LLM）带来了新的技术挑战和前提条件，以充分利用其潜力。在本文中，我们报告了一项基于多个大规模数据库关键词搜索和引文网络的系统文献综述的进展，评估LLM目前在调查研究过程中的应用情况。我们根据调查研究过程综合并组织我们的发现，包括LLM在三个广泛阶段的使用示例：数据收集前、数据收集和数据收集后。我们基于现有文献中的示例，讨论了LLM潜在用例的选定示例及其陷阱。考虑到调查研究在数据质量方面拥有丰富的经验和历史，我们讨论了一些机会，并描述了调查研究为LLM的持续发展和改进做出贡献的未来展望。

英文摘要

[Working Paper] Survey research has a long-standing history of being a human-powered field, but one that embraces various technologies for the collection, processing, and analysis of various behavioral, political, and social outcomes of interest, among others. At the same time, Large Language Models (LLMs) bring new technological challenges and prerequisites in order to fully harness their potential. In this paper, we report work-in-progress on a systematic literature review based on keyword searches from multiple large-scale databases as well as citation networks that assesses how LLMs are currently being applied within the survey research process. We synthesize and organize our findings according to the survey research process to include examples of LLM usage across three broad phases: pre-data collection, data collection, and post-data collection. We discuss selected examples of potential use cases for LLMs as well as its pitfalls based on examples from existing literature. Considering survey research has rich experience and history regarding data quality, we discuss some opportunities and describe future outlooks for survey research to contribute to the continued development and refinement of LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.16865 2026-06-19 cs.CL 版本更新 80%

MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

MixSD: 混合上下文自蒸馏用于知识注入

Jiarui Liu, Lechen Zhang, Yongjin Yang, Yinghui He, Yingheng Wang, Weihao Xuan, Zhijing Jin, Mona Diab

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Jinesis Lab, University of Toronto & Vector Institute（Jinesis实验室，多伦多大学及向量研究所）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Princeton University（普林斯顿大学）； Cornell University（康奈尔大学）； The University of Tokyo（东京大学）； RIKEN AIP（日本理化学研究所AIP）； Max Planck Institute for Intelligent Systems, Tübingen, Germany（德国图宾根最大计划智能系统研究所）； EuroSafeAI

专题命中指令微调：混合上下文自蒸馏用于知识注入

AI总结本文提出MixSD方法，通过混合模型自身条件下的token来实现与模型生成分布对齐的知识注入，从而在保持预训练能力的同时提升事实记忆和推理能力。

详情

AI中文摘要

监督微调（SFT）被广泛用于将新知识注入语言模型，但通常会损害预训练能力，如推理和通用领域性能。我们认为这种遗忘是由于微调目标与模型的自回归分布不一致，迫使优化器模仿低概率token序列。为了解决这个问题，我们提出了MixSD，一种无需外部教师的简单方法，用于对齐分布的知识注入。与固定目标训练不同，MixSD通过混合基础模型自身两个条件下的token动态构建监督。所生成的监督序列保留了事实学习信号，同时更接近基础模型的分布。我们在两个合成语料库上评估了MixSD，研究事实回忆和算术功能学习，并结合已建立的开放领域事实问答和知识编辑基准。在多种模型规模和设置下，MixSD在记忆-保留权衡上优于SFT和在线自蒸馏基线，能够保留基础模型的100% held-out能力，同时保持接近完美的训练准确率，而标准SFT只能保留1%。我们进一步表明，MixSD在基础模型下生成的监督目标具有显著更低的NLL，并减少了有害的Fisher敏感参数方向运动。这些结果表明，将监督与模型的本征生成分布对齐是简单且有效的知识注入原则，可以缓解灾难性遗忘。

英文摘要

Supervised fine-tuning (SFT) is widely used to inject new knowledge into language models, but it often degrades pretrained capabilities such as reasoning and general-domain performance. We argue this forgetting arises because fine-tuning targets from humans or external systems diverge from the model's autoregressive distribution, forcing the optimizer to imitate low-probability token sequences. To address this problem, we propose MixSD, a simple external-teacher-free method for distribution-aligned knowledge injection. Instead of training on fixed targets, MixSD constructs supervision dynamically by mixing tokens from two conditionals of the base model itself: an expert conditional that observes the injected fact in context, and a naive conditional that reflects the model's original prior. The resulting supervision sequences preserve the factual learning signal while remaining substantially closer to the base model's distribution. We evaluate MixSD on two synthetic corpora that we construct to study factual recall and arithmetic function acquisition in a controlled setting, together with established benchmarks for open-domain factual question answering and knowledge editing. Across multiple model scales and settings, MixSD consistently achieves a better memorization-retention trade-off compared to SFT and on-policy self distillation baselines, retaining up to 100% of the base model's held-out capability while maintaining near-perfect training accuracy, whereas standard SFT retains as little as 1%. We further show that MixSD produces substantially lower-NLL supervision targets under the base model and reduces harmful movement along Fisher-sensitive parameter directions. These results suggest that aligning supervision with the model's native generation distribution is a simple and effective principle for knowledge injection that mitigates catastrophic forgetting.

URL PDF HTML ☆

赞 0 踩 0

2606.19710 2026-06-19 cs.CL cs.AI 新提交 75%

FineREX: Fine-Tuned NER-RE for Human Smuggling Knowledge Graphs

FineREX: 面向人口走私知识图谱的微调NER-RE

Elijah Feldman, Dipak Meher, Carlotta Domeniconi

发表机构 * Thomas Jefferson High School for Science and Technology（托马斯·杰斐逊科技高中）

专题命中指令微调：微调LLM以提升特定领域信息提取性能。

AI总结提出FineREX，一个基于微调LLM的流水线，用于从法律文档中提取实体和关系构建知识图谱，在F1分数上分别提升15.50%和31.46%，并减少50%处理时间。

Comments Code available at https://github.com/ElijahFeldman7/FineREX

详情

AI中文摘要

法庭记录包含关于人口走私网络的有价值证据，但这些信息通常埋藏在非结构化的、充满术语的法律文件中。虽然大型语言模型（LLM）可以通过自动信息提取支持知识图谱构建，但现有方法依赖通用模型，未针对该领域所需的实体和关系定义进行定制。我们提出FineREX，一个精简的知识图谱构建流水线，基于微调的LLM进行命名实体识别和关系提取（NER-RE）。使用包含512个文本块的手动标注数据集，FineREX在实体和关系F1分数上分别比更大的通用基线模型绝对提高了15.50%和31.46%。这些提升转化为更高质量的知识图谱，将法律噪声减少近一半，并将长文档上的节点重复率从17.78%降至11.17%。通过消除文档重写和冗余提取阶段，FineREX还将端到端处理时间减少了50.0%。我们的结果表明，领域特定的微调可以显著优于更大的通用模型，同时提高非法网络分析知识图谱构建的质量和效率。

英文摘要

Court proceedings contain valuable evidence about human smuggling networks, but this information is often buried within unstructured, jargon-heavy legal documents. While large language models (LLMs) can support knowledge graph construction through automated information extraction, existing approaches rely on general-purpose models that are not tailored to the entity and relationship definitions required in this domain. We introduce FineREX, a streamlined knowledge graph construction pipeline built around a fine-tuned LLM for named entity recognition and relationship extraction (NER-RE). Using a manually annotated dataset of $512$ text chunks, FineREX achieves absolute improvements of 15.50% and 31.46% in entity and relationship F1-score, respectively, compared to a larger general-purpose baseline. These gains translate into higher-quality knowledge graphs, reducing legal noise by nearly half and lowering node duplication on long documents from 17.78% to 11.17%. By eliminating document rewriting and redundant extraction stages, FineREX also reduces end-to-end processing time by 50.0%. Our results demonstrate that domain-specific fine-tuning can substantially outperform larger general-purpose models while improving both the quality and efficiency of knowledge graph construction for illicit network analysis.

URL PDF HTML ☆

赞 0 踩 0

1. 其他LLM 14 篇

A Systematic Evaluation of Black-Box Uncertainty Estimation Methods for Large Language Models

GLARE: A Natural Language Interface for Querying Global Explanations

NRITYAM: Language Models Meet Art and Heritage of Dance

What sentiment analysis can't see: Measuring whether customers were helped, and what went wrong, across 70,000 support conversations

Code-Switching Reveals Language Anchoring in Multilingual LLMs

Quantifying Aleatoric Uncertainty of In-Context Learning for Robust Measure of LLM Prediction Confidence

Where to Place the Query? Unveiling and Mitigating Positional Bias in In-Context Learning for Diffusion LLMs via Decoding Dynamics

Disentangling Linguistic Relatedness from Task Alignment in Cross-Lingual Transfer

S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech

Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology

How Transparent is DiffusionGemma?

The Significance of Style Diversity in Annotation-Free Synthetic Data Generation

Leverage Is Not Reach: A Control-Window Law for Single-Neuron Steering in Language Models

2. 后训练 6 篇

LOKI: Memory-Free Null-Space Constrained Lifelong Knowledge Editing

Which Pairs to Compare for LLM Post-Training?

Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates

Tracking Representation Dynamics in Large Language Models with Persistent Homology

Model soups need only one ingredient

GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLMs

3. 预训练 3 篇

Techniques for Peak Memory Reduction for LoRA Fine-tuning of LLMs on Edge Devices

LoRDO: Distributed Low-Rank Optimization with Infrequent Communication

Where Does Social Reasoning Come From? Capability Provenance in Language Models

4. 领域大模型 5 篇

Cost-Optimal LLM Routing with Limited User Feedback under User Satisfaction Guarantees

Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

TransLaw: A Large-Scale Dataset and Multi-Agent Benchmark Simulating Professional Translation of Hong Kong Case Law

More Parameters Than Populations: A Systematic Literature Review of Large Language Models within Survey Research

5. 指令微调 2 篇

MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

FineREX: Fine-Tuned NER-RE for Human Smuggling Knowledge Graphs