语言大模型 / LLM - arXivDaily 专题

2606.18431 2026-06-18 cs.LG cs.DC 新提交 85%

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

超越预测：面向LLM推理的尾延迟感知调度

Yueying Li, Yuanfan Chen, Jiayang Chen, Esha Choukse, Haoran Qiu, G. Edward Suh, Rodrigo Fonseca, Ziv Scully, Udit Gupta

发表机构 * Cornell University, Computer Science Department（康奈尔大学计算机科学系）； Cornell University, Electrical and Computer Engineering Department（康奈尔大学电气与计算机工程系）； Cornell University, Operations Research and Information Engineering Department（康奈尔大学运筹学与信息工程系）； Microsoft Azure System Research（微软Azure系统研究）； NVIDIA Corporation（英伟达公司）

专题命中其他LLM ：提出LLM推理调度框架，优化尾延迟

AI总结针对LLM推理中长度预测调度在分布偏移和尾延迟控制上的脆弱性，提出无预测的分布感知调度框架，通过轻量统计信号实现软优先级提升，结合缓存感知抢占，在多种工作负载下将P99 TTLT降低35-50%，TTFT降低34-47%。

Journal ref Forty-Third International Conference on Machine Learning (2026)

详情

AI中文摘要

LLM服务表现出极端的长度可变性，使得基于大小的调度在实践中变得困难。最近的LLM调度器使用预测的解码长度或排名来近似SJF/SRPT，并主要报告均值中心指标如TTFT和TBT。我们表明，这些预测驱动的策略在分布偏移、突发到达和GPU内存压力下可能脆弱，同时对主导用户体验的尾延迟（P90-P99）控制有限，即使拥有完美的解码长度知识。我们引入了一个分布感知、无预测的调度框架，用由轻量统计信号驱动的软优先级提升取代显式长度预测。我们的设计协同优化调度和缓存感知抢占，以考虑跨工作负载混合的内存耦合解码动态。在生产环境和开源轨迹上的评估表明，相对于具有完美长度知识的SRPT，我们的方法将P99 TTLT降低了高达35-50%，并在各种工作负载（包括推理密集型和聊天密集型任务）上将TTFT降低了34-47%。这些结果证明了在在线LLM服务中优化尾延迟的稳健替代方案。

英文摘要

LLM serving exhibits extreme length variability, making size-based scheduling difficult in practice. Recent LLM schedulers approximate SJF/SRPT using predicted decode lengths or ranks and primarily report mean-centric metrics such as TTFT and TBT. We show that these prediction-driven policies can be fragile under distribution shifts, bursty arrivals, and GPU memory pressure, while offering limited control over the tail latency (P90-P99) that dominates user experience, even with perfect decode-length knowledge. We introduce a distribution-aware, prediction-free scheduling framework that replaces explicit length prediction with soft priority boosting driven by lightweight statistical signals. Our design co-optimizes scheduling and cache-aware preemption to account for memory-coupled decode dynamics across workload mixes. Evaluated on production and open-source traces, our method reduces P99 TTLT by up to 35-50% relative to SRPT with perfect length knowledge and reduces TTFT by 34-47% across workloads, including reasoning-heavy and chat-heavy tasks. These results demonstrate a robust alternative for optimizing tail latency in online LLM serving.

URL PDF HTML ☆

赞 0 踩 0

2606.18394 2026-06-18 cs.CL 新提交 85%

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

JetFlow: 通过并行树草稿突破推测解码的缩放上限

Lanxiang Hu, Zhaoxiang Feng, Yulun Wu, Haoran Yuan, Yujie Zhao, Yu-Yang Qian, Bojun Wang, Daxin Jiang, Yibo Zhu, Tajana Rosing, Hao Zhang

发表机构 * UC San Diego（加州大学圣地亚哥分校）； Zhejiang University（浙江大学）； UIUC（伊利诺伊大学厄巴纳-香槟分校）； Nanjing University（南京大学）； StepFun（阶跃星辰）

专题命中其他LLM ：提出并行树草稿加速LLM推测解码

AI总结提出JetFlow框架，通过因果并行草稿头结合树推测解码，将更大草稿预算转化为更长接受前缀和更高端到端加速，在Qwen3模型上实现最高9.64倍加速。

详情

AI中文摘要

推测解码（SD）通过草拟多个令牌并并行验证来加速自回归大语言模型（LLM），但面临缩放限制：仅当接受率保持较高且草拟开销较低时，增加草稿预算才能提高速度。这一上限难以突破，因为先前基于头的SD方法面临因果-效率困境。自回归草稿器生成路径条件候选，适用于树推测解码且接受长度更高，但其草拟成本随树深度增长。双向块扩散草稿器一次性生成所有位置，但其分支无关的边缘分布可能形成个体合理但相互不一致的树，浪费预算并降低接受率。我们提出JetFlow，一种基于头的SD框架，结合单次前向草拟效率与分支级因果条件。JetFlow在冻结目标模型的融合隐藏状态上训练因果并行草稿头，生成与目标模型自回归分解对齐的候选树。这使得JetFlow能够将更大的草稿预算转换为更长的接受前缀和更高的端到端加速。在密集和MoE Qwen3模型上的数学、编码和聊天基准测试中，JetFlow始终优于双向头和基于树的SD基线。在H100 GPU上，JetFlow在MATH-500上实现高达9.64倍加速，在开放式对话工作负载上实现4.58倍加速，并通过vLLM集成在实际服务负载下进一步降低延迟。我们的代码和模型可在该https URL获取。

英文摘要

Speculative decoding (SD) accelerates autoregressive Large Language Models (LLMs) by drafting multiple tokens and verifying them in parallel, but it faces a scaling limitation: increasing the draft budget improves speed only when acceptance remains high and drafting overhead stays low. This ceiling has been difficult to break because prior head-based SD methods face a causality-efficiency dilemma. Autoregressive drafters produce path-conditioned candidates that are effective for tree speculative decoding with higher acceptance length, but their drafting cost grows with tree depth. Bidirectional block-diffusion drafters generate all positions in one pass, but their branch-agnostic marginals can form individually plausible yet mutually inconsistent trees, wasting budget and reducing acceptance. We propose JetFlow, a head-based SD framework that combines one-forward drafting efficiency with branch-wise causal conditioning. JetFlow trains a causal parallel draft head over fused hidden states from the frozen target model, producing candidate trees whose scores align with the target model's autoregressive factorization. This enables JetFlow to convert larger draft budgets into longer accepted prefixes and higher end-to-end speedup. Across math, coding, and chat benchmarks on dense and MoE Qwen3 models, JetFlow consistently outperforms bidirectional-head and tree-based SD baselines. On H100 GPUs, JetFlow achieves up to 9.64x speedup on MATH-500 and 4.58x on open-ended conversational workloads, with further latency gains demonstrated through vLLM integration under realistic serving loads. Our code and models are available at https://github.com/hao-ai-lab/JetFlow.

URL PDF HTML ☆

赞 0 踩 0

2602.05992 2026-06-18 cs.CL 版本更新 85%

DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs

DSB: 扩散语言模型的动态滑动块调度

Lizhuo Luo, Shenggui Li, Yonggang Wen, Tianwei Zhang

发表机构 * Nanyang Technological University（南洋理工大学）

专题命中其他LLM ：改进扩散语言模型的推理调度

AI总结针对扩散语言模型固定块调度忽视语义难度的问题，提出无训练的动态滑动块方法DSB及配套KV缓存机制DSB Cache，显著提升生成质量和推理效率。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

扩散大语言模型（dLLMs）已成为文本生成的一种有前景的替代方案，其特点在于原生支持并行解码。在实践中，块推理对于避免全局双向解码中的顺序错乱以及提高输出质量至关重要。然而，广泛使用的固定、预定义块（朴素）调度忽略了语义难度，使其在质量和效率上均非最优策略：它可能迫使模型对不确定的位置过早做出承诺，同时延迟块边界附近的简单位置。在这项工作中，我们分析了朴素块调度的局限性，并揭示了根据语义难度动态调整调度对于可靠高效推理的重要性。受此启发，我们提出了动态滑动块（DSB），一种无训练的块调度方法，它使用动态大小的滑动块来克服朴素块的刚性。为了进一步提高效率，我们引入了DSB Cache，一种针对DSB量身定制的无训练KV缓存机制。跨多个模型和基准的大量实验表明，DSB与DSB Cache一起，持续提升了dLLMs的生成质量和推理效率。代码已发布在 https://this https URL。

英文摘要

Diffusion large language models (dLLMs) have emerged as a promising alternative for text generation, distinguished by their native support for parallel decoding. In practice, block inference is crucial for avoiding order misalignment in global bidirectional decoding and improving output quality. However, the widely-used fixed, predefined block (naive) schedule is agnostic to semantic difficulty, making it a suboptimal strategy for both quality and efficiency: it can force premature commitments to uncertain positions while delaying easy positions near block boundaries. In this work, we analyze the limitations of naive block scheduling and disclose the importance of dynamically adapting the schedule to semantic difficulty for reliable and efficient inference. Motivated by this, we propose Dynamic Sliding Block (DSB), a training-free block scheduling method that uses a sliding block with a dynamic size to overcome the rigidity of the naive block. To further improve efficiency, we introduce DSB Cache, a training-free KV-cache mechanism tailored to DSB. Extensive experiments across multiple models and benchmarks demonstrate that DSB, together with DSB Cache, consistently improves both generation quality and inference efficiency for dLLMs. Code is released at https://github.com/lizhuo-luo/DSB.

URL PDF HTML ☆

赞 0 踩 0

2602.23092 2026-06-18 cs.AI 版本更新 85%

Enhancing CVRP Solver through LLM-driven Automatic Heuristic Design

通过LLM驱动的自动启发式设计增强CVRP求解器

Zhuoliang Xie, Fei Liu, Zhenkun Wang, Qingfu Zhang

发表机构 * Southern University of Science and Technology（南方科技大学）； City University of Hong Kong（香港城市大学）

专题命中其他LLM ：利用LLM自动设计启发式求解CVRP，属于LLM应用

AI总结提出AILS-AHD方法，结合进化搜索框架与大语言模型动态生成和优化破坏启发式，并引入加速机制，在中等和大规模CVRP实例上优于现有求解器，在CVRPLib大规模基准中10个实例上取得8个新最优解。

详情

AI中文摘要

容量受限车辆路径问题（CVRP）是一个基本的组合优化挑战，专注于在车辆容量约束下优化车队运营。尽管在运筹学中得到了广泛研究，CVRP的NP-hard性质仍然带来显著的计算挑战，特别是对于大规模实例。本研究提出了AILS-AHD（自适应迭代局部搜索与自动启发式设计），一种利用大语言模型（LLMs）革新CVRP求解的新方法。我们的方法将进化搜索框架与LLMs集成，在AILS方法中动态生成和优化破坏启发式。此外，我们引入了一种基于LLM的加速机制以提高计算效率。针对最先进的求解器（包括AILS-II和HGS）的综合实验评估表明，AILS-AHD在中等和大规模实例上均表现出优越性能。值得注意的是，我们的方法在CVRPLib大规模基准的10个实例中为8个建立了新的最佳已知解，突显了LLM驱动的启发式设计在推进车辆路径优化领域的潜力。

英文摘要

The Capacitated Vehicle Routing Problem (CVRP), a fundamental combinatorial optimization challenge, focuses on optimizing fleet operations under vehicle capacity constraints. While extensively studied in operational research, the NP-hard nature of CVRP continues to pose significant computational challenges, particularly for large-scale instances. This study presents AILS-AHD (Adaptive Iterated Local Search with Automatic Heuristic Design), a novel approach that leverages Large Language Models (LLMs) to revolutionize CVRP solving. Our methodology integrates an evolutionary search framework with LLMs to dynamically generate and optimize ruin heuristics within the AILS method. Additionally, we introduce an LLM-based acceleration mechanism to enhance computational efficiency. Comprehensive experimental evaluations against state-of-the-art solvers, including AILS-II and HGS, demonstrate the superior performance of AILS-AHD across both moderate and large-scale instances. Notably, our approach establishes new best-known solutions for 8 out of 10 instances in the CVRPLib large-scale benchmark, underscoring the potential of LLM-driven heuristic design in advancing the field of vehicle routing optimization.

URL PDF HTML ☆

赞 0 踩 0

2602.15851 2026-06-18 cs.CL cs.AI 版本更新 85%

Narrative Theory-Driven LLM Methods for Automatic Story Generation and Understanding: A Survey

叙事理论驱动的LLM方法在自动故事生成与理解中的应用：综述

David Y. Liu, Aditya Joshi, Paul Dawson

发表机构 * School of Computer Science and Engineering（计算机科学与工程学院）； School of Arts and Media（艺术与媒体学院）； University of New South Wales (UNSW)（新南威尔士大学）

专题命中其他LLM ：综述叙事理论驱动的LLM故事生成与理解

AI总结综述叙事理论驱动的大语言模型方法在自动故事生成与理解中的应用，分析现状并指出生成任务在理论应用、后训练方法、非虚构叙事及叙事层次等方面落后于理解任务，提出未来方向。

Comments 31 pages

详情

AI中文摘要

使用大语言模型（LLM）的叙事理论应用在自动故事生成和理解任务中提供了有前景的方法。本综述考察了自然语言处理（NLP）研究如何利用LLM方法处理叙事研究中的不同概念。我们使用叙事学中的既定区分来分类当前工作，并发现以下内容：(a) 叙事文本来源多样，不仅限于文学；(b) 理论综合与验证是潜在成果；(c) 生成任务在多个方面落后于理解任务：理论应用、后训练方法、探索非虚构叙事以及处理超出故事与话语层面的叙事层次。对于未来方向，我们相信，与其追求单一的、通用的“叙事质量”基准，进步可以受益于以下方面的努力：定义和改进针对单个叙事属性的基于理论的度量；继续开展大规模、理论驱动的文学/社会/文化分析；在情境化上下文中生成叙事；以及继续进行实验，其输出可用于验证或完善叙事理论。本文通过概述当前研究工作和更广泛的叙事研究领域，为NLP中更系统、更具理论依据的叙事研究提供了背景基础。

英文摘要

Applications of narrative theories using large language models (LLMs) deliver promising methods in automatic story generation and understanding tasks. Our survey examines how natural language processing (NLP) research uses LLM methods to engage with diverse concepts from narrative studies. We use established distinctions from narratology to categorise ongoing efforts and discover the following: \redtext{(a) narrative texts come from diverse sources beyond just literature, (b) theoretical synthesis and validation are potential outcomes, (c) generation tasks lag behind understanding in several ways: theoretical application, post-training methods, exploring non-fiction narratives and addressing narrative levels beyond fabula and discourse.} For future directions, instead of the pursuit of a single, generalised benchmark for `narrative quality', we believe that progress can benefit from efforts that focus on the following: defining and improving theory-based metrics for individual narrative attributes; continue conducting large-scale, theory-driven literary/social/cultural analysis; generating narratives in situated contexts; and continuing experiments where outputs can be used to validate or refine narrative theories. This work provides a contextual foundation for more systematic and theoretically informed narrative research in NLP by providing an overview to ongoing research efforts and the broader narrative studies landscape.

URL PDF HTML ☆

赞 0 踩 0

2510.15551 2026-06-18 cs.CL cs.AI cs.LG 版本更新 85%

Rethinking Cross-lingual Gaps from a Statistical Viewpoint

从统计视角重新思考跨语言差距

Vihari Piratla, Purvam Jain, Darshan Singh, Trevor Cohn, Preethi Jyothi, Partha Talukdar

发表机构 * Google DeepMind（谷歌深Mind）

专题命中其他LLM ：研究LLM跨语言差距，属于LLM应用

AI总结提出跨语言差距源于目标语言响应方差，通过形式化偏差和无偏误差，并采用推理时集成方法降低方差，使跨语言迁移得分提升8%-50%以上。

Comments 30 pages

详情

AI中文摘要

任何知识片段通常以一种或少数几种自然语言表达在网页或大型语料库中。大型语言模型（LLMs）通过从源语言获取知识，并在使用目标语言查询时使其可访问，从而充当桥梁。跨语言差距是指使用目标语言而非源语言查询知识时准确率的下降。现有研究侧重于导致跨语言差距的建模或训练失败。在这项工作中，我们采取另一种视角来表征跨语言错误的性质，并假设目标语言中响应的方差是造成这一差距的关键原因。我们首次将跨语言差距形式化为有偏误差和无偏误差。通过多种控制方差并减少跨语言差距的推理时干预，我们实证验证了我们的假设。我们展示了几种测试时集成方法，这些方法降低了响应方差，从而将源-目标迁移得分提高了多达12个绝对百分点，在各种LLMs上实现了8%到超过50%的相对提升。

英文摘要

Any piece of knowledge is usually expressed in one or a handful of natural languages on the web or in any large corpus. Large Language Models (LLMs) act as a bridge by acquiring knowledge from a source language and making it accessible when queried using target languages. A cross-lingual gap is a drop in accuracy incurred when querying knowledge in a target language rather than the source language. Existing research focused on modeling or training failures leading to cross-lingual gaps. In this work, we take an alternative view to characterize the nature of cross-lingual error, and hypothesize that the variance of responses in the target language is a key cause of this gap. For the first time, we formalize the cross-lingual gap in terms of biased and unbiased errors. We empirically validate our hypothesis through multiple inference-time interventions that control variance and reduce the cross-lingual gap. We demonstrate a few test-time ensemble methods that reduce response variance, and thereby improve source-target transfer scores by up to 12 absolute points yielding relative gains of 8% to over 50% across various LLMs.

URL PDF HTML ☆

赞 0 踩 0

2510.04120 2026-06-18 cs.CL cs.AI 版本更新 85%

Probing Semantic Alignment, Lexical Invariance, and Syntactic Influence in LLM Metaphor Processing

探究大语言模型隐喻处理中的语义对齐、词汇不变性和句法影响

Fengying Ye, Shanshan Wang, Lidia S. Chao, Derek F. Wong

发表机构 * NLP 2 CT Lab, Department of Computer and Information Science, University of Macau（自然语言处理2CT实验室，计算机与信息科学系，澳门大学）

专题命中其他LLM ：LLM隐喻处理机制分析

AI总结通过几何探测、上下文替换和句法扰动三种方法，分析LLM在隐喻处理中的语义漂移、词汇稳定性及句法敏感性，揭示强行为表现可能源于异质信号。

Comments Accepted to ACL 2026

详情

AI中文摘要

大语言模型（LLM）在隐喻检测和解释任务上表现出色，但尚不清楚这种行为成功揭示了隐喻处理的哪些方面。我们通过探测三个互补维度：语义属性对齐、词汇不变性和句法敏感性，对行为证据的局限性进行诊断分析。使用几何探测，我们评估模型生成的解释是否与参考语义属性对齐；通过上下文变化替换，分析隐喻和字面表达之间词汇关联的稳定性；通过受控句法扰动，检查隐喻检测的敏感性。我们的分析表明，LLM生成的解释可能相对于参考属性出现语义漂移；稳定的词汇锚点在不同上下文条件下持续存在，可能支持常规隐喻，同时使需要上下文整合的新奇隐喻产生偏差；检测性能对句法不规则性敏感。这些发现表明，强行为表现可能反映了异质的潜在信号，强调在将隐喻基准解释为稳健、集成语义理解的证据时需要谨慎。

英文摘要

Large language models (LLMs) achieve strong performance on metaphor detection and interpretation tasks, yet it remains unclear what such behavioral success reveals about metaphor processing. We present a diagnostic analysis that examines the limits of behavioral evidence by probing three complementary dimensions: semantic attribute alignment, lexical invariance, and syntactic sensitivity. Using geometric probing, we assess whether model-generated interpretations align with reference semantic attributes; through context-varying substitution, we analyze the stability of lexical associations between metaphorical and literal expressions; and via controlled syntactic perturbations, we examine sensitivity in metaphor detection. Our analysis reveals that LLM-generated interpretations can exhibit semantic drift relative to reference attributes; stable lexical anchors persist across contextual conditions, potentially supporting conventional metaphors while biasing novel metaphors requiring contextual integration; and detection performance is sensitive to syntactic irregularities. These findings suggest that strong behavioral performance may reflect heterogeneous underlying signals, highlighting the need for caution when interpreting metaphor benchmarks as evidence of robust, integrated semantic understanding.

URL PDF HTML ☆

赞 0 踩 0

2508.09191 2026-06-18 cs.LG cs.AI 版本更新 85%

From Values to Tokens: An LLM-Driven Framework for Context-aware Time Series Forecasting via Symbolic Discretization

从数值到标记：一种基于符号离散化的LLM驱动上下文感知时间序列预测框架

Xiaoyu Tao, Shilong Zhang, Mingyue Cheng, Daoyu Wang, Tingyue Pan, Bokai Pan, Changqing Zhang, Shijin Wang

发表机构 * State Key Laboratory of Cognitive Intelligence（认知智能国家重点实验室）； University of Science and Technology of China（中国科学技术大学）； College of Intelligence and Computing（智能科学与计算学院）； iFLYTEK Research（iFLYTEK研究院）

专题命中其他LLM ：提出TokenCast框架，利用LLM进行时间序列预测。

AI总结提出TokenCast框架，利用大语言模型通过符号离散化将连续时间序列转化为标记，与上下文文本对齐，实现上下文感知的预测，实验证明有效。

详情

AI中文摘要

时间序列预测在能源、医疗和金融等关键应用领域支持决策中起着重要作用。尽管近期取得了进展，但由于将历史数值序列与通常包含非结构化文本数据的上下文特征整合的挑战，预测精度仍然有限。为了解决这一挑战，我们提出了TokenCast，一个由大语言模型（LLM）驱动的框架，利用基于语言的符号表示作为上下文感知时间序列预测的统一中介。具体来说，TokenCast采用离散分词器将连续数值序列转化为时间标记，实现与基于语言输入的结构对齐。为了有效弥合模态之间的语义差距，时间和上下文标记通过预训练的LLM嵌入到共享表示空间中，并通过生成目标进一步优化。基于这一统一语义空间，对齐的LLM随后以监督方式进行微调，以预测未来的时间标记，然后解码回原始数值空间。在真实世界数据集上的大量实验证明了我们框架的有效性，并突显了其作为上下文感知时间序列预测生成框架的潜力。代码可从此https URL获取。

英文摘要

Time series forecasting plays a vital role in supporting decision-making across a wide range of critical applications, including energy, healthcare, and finance. Despite recent advances, forecasting accuracy remains limited due to the challenge of integrating historical numerical sequences with contextual features, which often comprise unstructured textual data. To address this challenge, we propose TokenCast, a large language model (LLM) driven framework that leverages language-based symbolic representations as a unified intermediary for context-aware time series forecasting. Specifically, TokenCast employs a discrete tokenizer to transform continuous numerical sequences into temporal tokens, enabling structural alignment with language-based inputs. To effectively bridge the semantic gap between modalities, both temporal and contextual tokens are embedded into a shared representation space via a pre-trained LLM, further optimized with generative objectives. Building upon this unified semantic space, the aligned LLM is subsequently fine-tuned in a supervised manner to predict future temporal tokens, which are then decoded back into the original numerical space. Extensive experiments on real-world datasets demonstrate the effectiveness of our framework and highlight its potential as a generative framework for context-aware time series forecasting. The code is available at https://github.com/Xiaoyu-Tao/TokenCast.

URL PDF HTML ☆

赞 0 踩 0

2506.15066 2026-06-18 cs.AR cs.MA 版本更新 85%

ChatModel: Automating Reference Model Design and Verification with LLMs

ChatModel: 利用LLMs自动化参考模型设计与验证

Jianmin Ye, Tianyang Liu, Qi Tian, Shengchu Su, Zhe Jiang, Xi Wang

专题命中其他LLM ：利用LLM自动化参考模型设计与验证，提升效率。

AI总结提出ChatModel平台，通过设计标准化和分层敏捷建模，利用LLM自动生成参考模型，在300个设计上验证，效率提升最高58.99%，验证周期加速7.11倍。

详情

AI中文摘要

随着集成电路设计复杂性的不断升级，功能验证变得越来越具有挑战性。参考模型对于加速验证过程至关重要，但其自身也变得越来越复杂且耗时。尽管大型语言模型（LLM）在代码编程方面显示出潜力，但有效生成复杂参考模型仍然是一个重大障碍。因此，我们引入了ChatModel，一个LLM辅助的敏捷参考模型生成与验证平台。ChatModel通过集成设计标准化和分层敏捷建模，简化了从设计规范到功能完备参考模型的过渡。采用构建块生成策略，不仅增强了LLM对参考模型的设计能力，还显著提高了验证效率。我们在300个不同复杂度的设计上评估了ChatModel，证明了参考模型生成在效率和质量上的显著提升。与替代方法相比，ChatModel实现了最高58.99%的性能提升，生成稳定性显著增强，并且其生成参考模型设计的能力提高了9.18倍。此外，ChatModel将参考模型设计与验证周期平均加速了7.11倍，相比传统手动方法。这些结果突显了ChatModel在推动参考模型生成与验证自动化方面的巨大潜力。

英文摘要

As the complexity of integrated circuit designs continues to escalate, functional verification becomes increasingly challenging. Reference models, critical for accelerating the verification process, are themselves becoming more intricate and time-consuming to develop. Despite the promise shown by large language models (LLMs) in code programming, effectively generating complex reference models remains a significant hurdle. Therefore, we introduce ChatModel, an LLM-aided agile reference model generation and verification platform. ChatModel streamlines the transition from design specifications to fully functional reference models by integrating design standardization and hierarchical agile modeling. Employing a building-block generation strategy, it not only enhances the design capabilities of LLMs for reference models but also significantly boosts verification efficiency. We evaluated ChatModel on 300 designs of varying complexity, demonstrating substantial improvements in both efficiency and quality of reference model generation. ChatModel achieved a peak performance improvement of 58.99% compared to alternative methods, with notable enhancements in generation stability, and delivered a 9.18x increase in its capacity to produce reference model designs. Moreover, ChatModel accelerates the reference model design and validation cycles by an average of 7.11x over traditional manual approaches. These results highlight the potential of ChatModel to significantly advance the automation of reference model generation and validation.

URL PDF HTML ☆

赞 0 踩 0

2503.01163 2026-06-18 cs.AI cs.CL cs.HC cs.LG cs.NE 85%

Bandit-Based Prompt Design Strategy Selection Improves Prompt Optimizers

基于Bandit的提示设计策略选择改进提示优化器

Rin Ashizawa, Yoichi Hirose, Nozomu Yoshinari, Kento Uchida, Shinichi Shirakawa

发表机构 * Yokohama National University（横滨国立大学）

专题命中其他LLM ：提出OPTS方法优化LLM提示策略

AI总结本文提出OPTS方法，通过显式选择提示设计策略提升EvoPrompt性能，采用Thompson采样机制在BIG-Bench Hard上验证效果，实现最优结果。

Comments Accepted to ACL 2025 Findings

详情

DOI: 10.18653/v1/2025.findings-acl.1070

AI中文摘要

提示优化旨在寻找能提升大语言模型性能的有效提示。尽管现有方法已发现有效提示，但往往与人类专家精心设计的复杂提示不同。提示设计策略作为提升提示性能的最佳实践，对优化提示至关重要。最近，Autonomous Prompt Engineering Toolbox (APET) 将多种提示设计策略整合到提示优化过程中。在APET中，需要LLM隐式选择和应用合适的策略，因为提示设计策略可能产生负面影响。这种隐式选择可能因LLM的有限优化能力而表现不佳。本文引入Optimizing Prompts with sTrategy Selection (OPTS)，实现提示设计的显式选择机制。我们提出三种机制，包括基于Thompson采样的方法，并将其整合到EvoPrompt中。在使用BIG-Bench Hard对Llama-3-8B-Instruct和GPT-4o mini进行提示优化的实验中，结果表明提示设计策略的选择提升了EvoPrompt的性能，Thompson采样机制实现了最佳整体结果。我们的实验代码可在https://github.com/shiralab/OPTS获取。

英文摘要

Prompt optimization aims to search for effective prompts that enhance the performance of large language models (LLMs). Although existing prompt optimization methods have discovered effective prompts, they often differ from sophisticated prompts carefully designed by human experts. Prompt design strategies, representing best practices for improving prompt performance, can be key to improving prompt optimization. Recently, a method termed the Autonomous Prompt Engineering Toolbox (APET) has incorporated various prompt design strategies into the prompt optimization process. In APET, the LLM is needed to implicitly select and apply the appropriate strategies because prompt design strategies can have negative effects. This implicit selection may be suboptimal due to the limited optimization capabilities of LLMs. This paper introduces Optimizing Prompts with sTrategy Selection (OPTS), which implements explicit selection mechanisms for prompt design. We propose three mechanisms, including a Thompson sampling-based approach, and integrate them into EvoPrompt, a well-known prompt optimizer. Experiments optimizing prompts for two LLMs, Llama-3-8B-Instruct and GPT-4o mini, were conducted using BIG-Bench Hard. Our results show that the selection of prompt design strategies improves the performance of EvoPrompt, and the Thompson sampling-based mechanism achieves the best overall results. Our experimental code is provided at https://github.com/shiralab/OPTS .

URL PDF HTML ☆

赞 0 踩 0

2412.15557 2026-06-18 cs.SE cs.CL 版本更新 85%

MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems

MORTAR：基于LLM的对话系统的多轮蜕变测试

Aaron Guoxiang Guo, Aldeida Aleti, Neelofar Neelofar, Chakkrit Tantithamthavorn, Yuanyuan Qi, Tsong Yueh Chen

发表机构 * Faculty of Information Technology, Monash University（墨尔本大学信息科技学院）； School of Computing Technologies, RMIT University（皇家墨尔本理工大学计算技术学院）； School of Science, Computing and Emerging Technologies, Swinburne University of Technology（斯威本理工大学科学、计算与新兴技术学院）

专题命中其他LLM ：LLM对话系统多轮测试方法

AI总结提出MORTAR方法，通过多轮蜕变关系自动化生成测试用例，解决LLM对话系统多轮测试中的预言问题，相比单轮测试每个用例发现更多且更高质量的缺陷。

Comments Accepted for publication in IEEE Transactions on Software Engineering (TSE)

详情

DOI: 10.1109/TSE.2026.3701230

AI中文摘要

随着基于LLM的对话系统在日常生活中的广泛应用，质量保证变得比以往更加重要。最近的研究成功引入了在单轮测试场景中识别意外行为的方法。然而，多轮交互是对话系统常见的实际使用方式，但针对此类交互的测试方法仍未得到充分探索。这主要是由于多轮测试中的预言问题，它仍然是对话系统开发人员和研究人员面临的重大挑战。在本文中，我们提出了MORTAR，一种蜕变式多轮对话测试方法，它缓解了测试基于LLM的对话系统时的测试预言问题。MORTAR形式化了对话系统的多轮测试，并自动生成问答对话测试用例，其中包含多种对话级扰动和蜕变关系（MRs）。自动化的MR匹配机制使MORTAR在蜕变测试中具有更高的灵活性和效率。所提出的方法完全自动化，无需依赖LLM评判。在测试六个流行的基于LLM的对话系统时，与单轮蜕变测试基线相比，MORTAR每个测试用例发现的错误数量增加了150%以上，效果显著更好。在错误质量方面，MORTAR在多样性、精确性和唯一性方面揭示了更高质量的错误。MORTAR有望激发更多的多轮测试方法，并帮助开发人员在有限的测试资源和预算下更全面地评估对话系统性能。

英文摘要

With the widespread application of LLM-based dialogue systems in daily life, quality assurance has become more important than ever. Recent research has successfully introduced methods to identify unexpected behaviour in single-turn testing scenarios. However, multi-turn interaction is the common real-world usage of dialogue systems, yet testing methods for such interactions remain underexplored. This is largely due to the oracle problem in multi-turn testing, which continues to pose a significant challenge for dialogue system developers and researchers. In this paper, we propose MORTAR, a metamorphic multi-turn dialogue testing approach, which mitigates the test oracle problem in testing LLM-based dialogue systems. MORTAR formalises the multi-turn testing for dialogue systems, and automates the generation of question-answer dialogue test cases with multiple dialogue-level perturbations and metamorphic relations (MRs). The automated MR matching mechanism allows MORTAR more flexibility and efficiency in metamorphic testing. The proposed approach is fully automated without reliance on LLM judges. In testing six popular LLM-based dialogue systems, MORTAR reaches significantly better effectiveness with over 150\% more bugs revealed per test case when compared to the single-turn metamorphic testing baseline. Regarding the quality of bugs, MORTAR reveals higher-quality bugs in terms of diversity, precision and uniqueness. MORTAR is expected to inspire more multi-turn testing approaches, and assist developers in evaluating the dialogue system performance more comprehensively with constrained test resources and budget.

URL PDF HTML ☆

赞 0 踩 0

2506.09822 2026-06-18 cs.CE cs.AI 85%

Superstudent intelligence in thermodynamics

热力学中的超级学生智能

Rebecca Loubet, Pascal Zittlau, Marco Hoffmann, Luisa Vollmer, Sophie Fellenz, Heike Leitte, Fabian Jirasek, Johannes Lenhard, Hans Hasse

发表机构 * Laboratory of Engineering Thermodynamics (LTD)（工程热力学实验室）； Visual Information Analysis Research Group (VIA)（视觉信息分析研究组）； Machine Learning Research Group (ML)（机器学习研究组）

专题命中其他LLM ：评估o3模型在热力学考试中的表现

AI总结研究展示OpenAI的o3模型在热力学考试中超越所有学生，证明机器在复杂任务中的能力，影响工程教育与实践。

Comments This document is the unedited Author's version of a yet to be Submitted Work to Physical Review Physics Education Research. 15 pages, 2 figures, Graphical Abstract, Highlights and SI available (12 pages)

详情

DOI: 10.1088/1361-6552/ae35bf

AI中文摘要

在本文中，我们报告并分析了一个引人注目的事件：OpenAI的大型语言模型o3在热力学考试中击败了所有学生。热力学考试是大多数学生的难点，需要展示对这一重要主题基本原理的掌握。因此，失败率很高，A级分数稀少，被视为学生卓越智力的证明。这是因为模式学习无助于考试。问题只能通过有创造力地结合热力学原理来解决。我们不仅将最新热力学考试提供给学生，还提供给OpenAI最强大的推理模型o3，并以相同方式评估其答案。在零样本模式下，模型o3正确解答了所有问题，优于所有参加考试的学生；其总分在1985年以来超过10000次类似考试中最佳分数范围内。这标志着转折点：机器现在在复杂任务中表现出色，通常被视为人类智力能力的证明。我们讨论了这对工程师工作和未来工程师教育的影响。

英文摘要

In this short note, we report and analyze a striking event: OpenAI's large language model o3 has outwitted all students in a university exam on thermodynamics. The thermodynamics exam is a difficult hurdle for most students, where they must show that they have mastered the fundamentals of this important topic. Consequently, the failure rates are very high, A-grades are rare - and they are considered proof of the students' exceptional intellectual abilities. This is because pattern learning does not help in the exam. The problems can only be solved by knowledgeably and creatively combining principles of thermodynamics. We have given our latest thermodynamics exam not only to the students but also to OpenAI's most powerful reasoning model, o3, and have assessed the answers of o3 exactly the same way as those of the students. In zero-shot mode, the model o3 solved all problems correctly, better than all students who took the exam; its overall score was in the range of the best scores we have seen in more than 10,000 similar exams since 1985. This is a turning point: machines now excel in complex tasks, usually taken as proof of human intellectual capabilities. We discuss the consequences this has for the work of engineers and the education of future engineers.

URL PDF HTML ☆

赞 0 踩 0

2504.12347 2026-06-18 cs.CL cs.AI cs.CY 85%

Assessment of Evolving Large Language Models in Upper Secondary Mathematics

对上中学数学中演进式大语言模型的评估

Mika Setälä, Pieta Sikström, Ville Heilala, Tommi Kärkkäinen

发表机构 * Faculty of Information Technology（信息科技学院）； University of Jyväskylä（于韦斯屈莱大学）； Faculty of Humanities and Social Sciences（人文与社会科学学院）

专题命中其他LLM ：评估LLM在中学数学考试中的能力

AI总结本文评估了不同大语言模型在芬兰毕业考试中的数学能力，发现随着模型演进，其表现显著提升，部分模型接近完美，展示了LLM在数学能力上的快速进步及其在教育中的潜力。

2606.19256 2026-06-18 cs.AI 新提交 80%

X+Slides: Benchmarking Audience-Conditioned Slide Generation

X+Slides：面向受众条件的幻灯片生成基准测试

Haodong Chen, Xuanhe Zhou, Wei Zhou, Xinyue Shao, Yanbing Zhu, Bo Wang, Jiawei Hong, Anya Jia, Fan Wu

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Harbin Institute of Technology（哈尔滨工业大学）； SenseTime

专题命中其他LLM ：LLM幻灯片生成基准测试

AI总结提出X+Slides基准，通过动态评估框架和受众特定权重，衡量幻灯片生成系统在受众覆盖、领域覆盖、效率和正确性方面的表现，揭示现有系统在受众关键信息恢复上的不足。

详情

AI中文摘要

从源文档自动生成幻灯片是大语言模型（LLMs）的重要应用。现有基准主要评估幻灯片的完整性和技术深度，而忽略了目标受众这一关键现实因素。例如，专家需要严格的证明，而决策者优先考虑可操作的结论。为弥补这一差距，我们引入了X+Slides，一个专门为受众条件幻灯片生成设计的基准。基于涵盖113个主题和七种演示场景的多样化语料库，X+Slides采用由8,133个去重、基于源的探针构建的动态评估框架。通过为相同的基于源的探针分配受众特定的效用权重，X+Slides报告四个互补指标：受众覆盖率衡量传达了受众必要信息的程度，领域覆盖率显示覆盖了哪些信息类型，效率衡量每单位注意力成本传递的效用，正确性验证幻灯片声明是否得到源支持。在DeepPresenter、SlideTailor和NotebookLM上的实验表明，当前系统可以恢复大部分但仍有缺失的受众必要信息：在τ_A=0.7时，DeepPresenter达到最佳受众覆盖率0.714，SlideTailor达到0.594，NotebookLM消融达到0.853，同时显示出明显的接地差异。这些结果表明，视觉质量和广泛的主题覆盖不应在没有基于源评估的情况下被视为证据支持。

英文摘要

Automatically generating slide decks from source documents is an important application of large language models (LLMs). Existing benchmarks primarily assess slide completeness and technical depth, while overlooking the target audience as a critical real-world factor. For instance, specialists demand rigorous proofs, whereas decision-makers prioritize actionable conclusions. To bridge this gap, we introduce X+Slides, a benchmark specifically designed for audience-conditioned slide generation. Built on a diverse corpus spanning 113 topics and seven presentation scenes, X+Slides employs a dynamic evaluation framework constructed from 8,133 deduplicated, source-grounded probes. By assigning audience-specific utility weights to the same source-grounded probes, X+Slides reports four complementary metrics: Audience Coverage measures how much audience-essential information is conveyed, Domain-wise Coverage shows which information types are covered, Efficiency measures delivered utility per unit of attention cost, and Correctness verifies whether slide claims are supported by the source. Experiments on DeepPresenter, SlideTailor, and NotebookLM show that current systems can recover a substantial but still incomplete part of audience-essential information: at $τ_A=0.7$, DeepPresenter reaches a best Audience Coverage of 0.714, SlideTailor reaches 0.594, and the NotebookLM ablation reaches 0.853 while showing clear grounding differences. These results indicate that visual quality and broad topic coverage should not be treated as evidence support without source-grounded evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.18946 2026-06-18 cs.CL 新提交 80%

SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents

SenFlow: 面向混合文档中AI生成文本检测的句间流建模

Jingkun Luo, Yifan Sun, Da-Tian Peng, Guanxiong Pei

发表机构 * Northwestern Polytechnical University（西北工业大学）； Zhejiang Lab（浙江实验室）

专题命中其他LLM ：检测LLM生成文本，建模句间依赖

AI总结针对人机混合文档的句子级AI文本检测，提出SenFlow模型，通过图传播和CRF解码建模句间依赖，在MOSAIC基准上跨域F1提升4.15个百分点。

Comments 16 pages, 4 figures, 9 tables

详情

AI中文摘要

针对混合文档（人类与LLM共同撰写同一文本）的句子级AI生成文本检测（S-AGTD）面临两个空白：现有方法孤立地对每个句子进行分类，忽略了句间依赖；现有基准遗漏了最新一代生成器。我们构建了MOSAIC基准，包含来自PubMed和XSum的16,000个混合文档，由DeepSeek-V3.2和Kimi K2生成，并经过严格质量控制，包括先前基准中缺失的困惑度一致性过滤器。我们将S-AGTD重新定义为文档句子序列上的结构化预测，并实例化为SenFlow，在句子图的单次文档级传递中，将基于图的句间传播与线性链CRF解码相结合。SenFlow在MOSAIC上达到了最先进的性能，在跨域迁移（三种难度递增协议中最难的一种）上平均Macro-F1提高了4.15个百分点。我们进一步发现，即使困惑度过滤器平衡了显式线索，AI插入仍然保留了一个依赖于生成器的句子长度差距，句子级检测器仍可利用这一点。代码和数据：此 https URL

英文摘要

Sentence-level AI-generated text detection (S-AGTD) for hybrid documents, where humans and LLMs co-author one text, faces two gaps: existing methods classify each sentence in isolation, discarding inter-sentence dependencies, and existing benchmarks omit the newest generation of generators. We construct MOSAIC, a benchmark of 16,000 hybrid documents over PubMed and XSum, generated by DeepSeek-V3.2 and Kimi K2 under stringent quality controls including a perplexity-consistency filter absent from prior benchmarks. We recast S-AGTD as structured prediction over the document sentence sequence and instantiate it as SenFlow, integrating graph-based inter-sentence propagation with linear-chain CRF decoding in a single document-level pass over a sentence graph. SenFlow reaches state-of-the-art performance on MOSAIC, with a +4.15 pp average Macro-F1 margin on cross-domain transfer, the hardest of three protocols of increasing difficulty. We further find that even after the perplexity filter equalizes overt cues, AI insertions retain a generator-dependent sentence-length gap that sentence-level detectors still exploit. Code and data: https://github.com/luojingkun22/SenFlow

URL PDF HTML ☆

赞 0 踩 0

2606.18922 2026-06-18 cs.CL cs.AI 新提交 80%

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

像火箭科学一样简单：评估大型语言模型解释比喻语言中否定能力的研究

Jasmine Owers, Edwin Simpson, Martha Lewis

发表机构 * Intelligent Systems Lab University of Bristol（智能系统实验室英国布里斯托尔大学）； ILLC University of Amsterdam（阿姆斯特丹大学语言学研究所）

专题命中其他LLM ：评估LLM对否定与比喻语言的理解

AI总结本研究通过开发新的注释数据集，测试多种大型语言模型在比喻语言中理解否定的能力，发现否定与比喻的组合对模型构成挑战，且性能高度依赖提示风格。

Comments 16 pages, 16 figures; for associated code and data see https://github.com/jrdowers/Negation-and-Fig-Lang; To be published in Transactions of the Association for Computational Linguistics

2606.18797 2026-06-18 cs.CL 新提交 80%

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

超越标量分数：探索基于LLM的放射学报告临床意义评估指标

Qingyu Lu, Ruochen Li, Liang Ding, Yufei Xia, Youxiang Zhu, Dacheng Tao

发表机构 * Nanyang Technological University（南洋理工大学）； Technical University of Munich（慕尼黑工业大学）； Alibaba（阿里巴巴）； University of Glasgow（格拉斯哥大学）； University of Massachusetts Boston（马萨诸塞大学波士顿分校）

专题命中其他LLM ：基于LLM的放射学报告评估指标

AI总结针对放射学报告评估中临床准确性要求，研究基于LLM的指标区分临床错误与无害变体的能力，发现判别偏差，并通过合成数据训练轻量级指标，在成本敏感部署中优于大型模型。

Comments Under Review

详情

AI中文摘要

对生成的放射学报告进行可靠评估需要严格的临床准确性，因为遗漏关键发现或误判影像学观察结果会直接影响患者护理。现有指标通过将报告质量简化为一个医学上无依据的标量而模糊了这一要求。尽管大型语言模型（LLM）拥有丰富的医学知识，但它们同样难以在临床显著错误和无害变异之间划定可靠边界。我们以ReEvalMed基准为测试平台研究这一边界，并从检测真实临床错误（“判别力”）和容忍无关变异（“鲁棒性”）两方面评估指标的临床意义。在单次和两次设置下对8个LLM评估器进行实验，我们发现了一个普遍的判别偏差：模型能有效检测错误，但也过度惩罚无害的改写。为缓解这一问题，我们合成了4000对报告，并在Qwen3-8B和MedGemma-4B上训练了轻量级可解释指标。我们训练的指标明确了临床意义边界，超越了32B规模的医学LLM，并与专有模型保持竞争力。关键的是，成本更高的两次设置未能持续提升整体性能，主要是在用判别力换取鲁棒性。这些发现表明，单次训练指标是成本敏感部署的实用选择，而两次推理则保留给判别-鲁棒平衡至关重要的场景。我们将发布数据集和指标。

英文摘要

Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care. Existing metrics obscure this requirement by reducing report quality to a medically ungrounded scalar. Although Large Language Models (LLMs) possess rich medical knowledge, they likewise struggle to draw a reliable boundary between clinically significant errors and harmless variation. We study this boundary using ReEvalMed benchmark as testbed and evaluate metric-level clinical significance from detecting true clinical errors ("Discrimination") and tolerating insignificant variations ("Robustness"). Across 8 LLM evaluators under one-pass and two-pass settings, we identify a widespread discrimination bias: models effectively detect errors but also over-penalize harmless rephrasings. To mitigate this, we synthesize 4k report pairs and train lightweight interpretable metrics on Qwen3-8B and MedGemma-4B. Our trained metric sharpens the clinical significance boundary, surpassing 32B-scale medical LLMs and remaining competitive with proprietary models. Crucially, the more costly two-pass setting fails to consistently improve overall performance and mainly trades discrimination for robustness. These findings suggest one-pass trained metrics as the practical choice for cost-sensitive deployment, with two-pass inference reserved for settings where D-R balance is critical. We will release the dataset and metric.

URL PDF HTML ☆

赞 0 踩 0

2606.18741 2026-06-18 cs.DC 新提交 80%

ReMP: Low-Downtime Runtime Model-Parallelism Reconfiguration for LLM Serving

ReMP：面向LLM服务的低停机时间运行时模型并行重配置

Haipeng Yuan, Kaining Zheng, Yongshu Bai, Yuchen Zhang, Yunquan Zhang, Baodong Wu, Xiang Gao, Daning Cheng

专题命中其他LLM ：LLM推理服务模型并行重配置，低停机时间。

AI总结提出ReMP框架，通过解耦拓扑与运行时状态、二维KV缓存迁移等技术，实现LLM推理服务中模型并行拓扑的在线动态调整，将重配置停机时间从分钟级降至1-7秒。

详情

AI中文摘要

当前大语言模型（LLM）推理系统普遍采用张量并行（TP）和流水线并行（PP）的组合来部署超大规模模型。然而，现有系统将模型并行拓扑视为静态配置，无法在运行时灵活调整。这种刚性设计与实际场景中动态变化的推理负载存在根本矛盾。最先进的系统缺乏在线重配置能力，只能通过重启服务来切换配置，导致数分钟的服务中断、KV缓存丢失以及高昂的重计算开销。为解决此问题，本文提出ReMP，一种支持低停机时间的运行时模型并行重配置框架。ReMP通过三项关键技术实现动态调整：（1）将模型并行拓扑与运行时状态解耦，避免完全重建服务；（2）设计二维KV缓存迁移机制，在TP/PP变化后保留可复用的缓存状态；（3）实现端到端的在线重配置。实验表明，ReMP能在7B到70B参数规模的模型上，在1-7秒内完成大多数拓扑切换，相比重启方法实现数十至上百倍的加速。此外，在动态负载下，ReMP显著优于固定配置，在TTFT、TPOT和输出吞吐量方面表现出更优性能。

英文摘要

Current large language model (LLM) inference systems universally deploy ultra-large-scale models using a combination of Tensor Parallelism (TP) and Pipeline Parallelism (PP). However, existing systems treat the model parallelism topology as a static configuration that cannot be flexibly adjusted at runtime. This rigid design creates a fundamental contradiction with the dynamically changing inference workloads in real-world scenarios. State-of-the-art systems lack online reconfiguration capabilities and can only switch configurations by restarting the service, resulting in several minutes of service interruption, KV cache loss, and prohibitive recomputation overhead. To address this problem, this paper presents ReMP, a runtime model parallelism reconfiguration framework that supports low downtime. ReMP achieves dynamic adjustment through three key techniques: (1) decoupling the model parallelism topology from runtime state to avoid full service reconstruction; (2) designing a two-dimensional KV cache migration mechanism to preserve reusable cache states after TP/PP changes; and (3) implementing end-to-end online reconfiguration. Experiments demonstrate that ReMP can complete most topology switches within 1-7 seconds on models ranging from 7B to 70B parameters, achieving speedups of tens to over a hundred times compared to the restart approach. Moreover, ReMP significantly outperforms fixed configurations under dynamic workloads, delivering superior performance in terms of TTFT, TPOT, and output throughput.

URL PDF HTML ☆

赞 0 踩 0

2606.18677 2026-06-18 cs.LG cs.AI 新提交 80%

Bounded Context Management for Tabular Foundation Models on Stream Learning

表格基础模型在流学习中的有界上下文管理

Jinmo Lee, Doyun Choi, Moongi Choi, Jaemin Yoo

发表机构 * Seoul National University（首尔大学）； KAIST（韩国科学技术院）

专题命中其他LLM ：表格基础模型流学习上下文管理

AI总结针对表格流学习中分布漂移问题，提出上下文管理策略CURE，通过不确定性门控准入和冗余感知驱逐管理上下文，在七个流上相对提升最高27.0%。

Comments Accepted as a spotlight oral (top 5%) at the 2nd ICML Workshop on Foundation Models for Structured Data (FMSD@ICML2026)

详情

AI中文摘要

表格流学习需要在分布漂移下对顺序到达的样本进行预测。虽然标准方法通过更新模型状态来适应，但表格基础模型（TFMs）以上下文方式基于标记上下文进行预测，使其成为流学习的自然替代方案。这便将挑战从如何更新模型转移到如何管理上下文。我们提出一种未来信息视角，为上下文管理导出三个实际需求：保留最近样本、保留不确定样本、移除冗余样本。我们将这些需求实例化为CURE（通过不确定性感知准入和冗余感知驱逐的上下文管理），一种具有熵门控准入和冗余感知驱逐的上下文管理策略。在七个流上，CURE相比经典流学习器相对提升高达27.0%，在多个TFM骨干上保持鲁棒，并在其他策略变体中排名第一。代码和数据集可在该https URL获取。

英文摘要

Tabular stream learning requires predictions on sequentially arriving examples under distribution shift. While standard methods adapt by updating model states, tabular foundation models (TFMs) make predictions conditioned on a labeled context in an in-context manner, making them a natural alternative for stream learning. This shifts the challenge from how to update the model to how to manage the context. We propose a future information view that yields three practical requirements for context management: preserve recent examples, retain uncertain examples, and remove redundant examples. We instantiate these requirements as CURE (Context management via Uncertainty-aware admission and Redundancy aware Eviction), a context-managing policy with entropy-gated admission and redundancy-aware eviction. Across seven streams, CURE shows up to 27.0% relative improvement over classical stream learners, remains robust across multiple TFM backbones, and ranks first among other policy variants. Code and datasets are available at https://github.com/morcellinus/CURE-ICML-FMSD.

URL PDF HTML ☆

赞 0 踩 0

2606.18557 2026-06-18 cs.AI cs.LG cs.LO 新提交 80%

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

DeFAb：基础模型中可废止溯因的可验证基准

Patrick Cooper, Alvaro Velasquez

发表机构 * University of Colorado Boulder（科罗拉多大学博尔德分校）

专题命中其他LLM ：评估基础模型的可废止溯因推理

AI总结提出DeFAb基准，通过将知识库转换为可验证的溯因实例，评估基础模型在可废止推理中的创造力与理论推理能力，发现前沿模型准确率远低于符号求解器。

Comments 33 pages, 14 figures, 23 tables. Dataset: https://huggingface.co/datasets/PatrickAllenCooper/DeFAb ; code and evaluation harness: https://github.com/PatrickAllenCooper/blanc

详情

AI中文摘要

一个基于规则的逻辑求解器在不到50微秒内以100%的准确率解决了我们基准中的每个实例；而最佳前沿语言模型在渲染鲁棒评估下最高仅达65%，最差降至23.5%（四种表面渲染的最坏情况）。我们引入DeFAb（可废止溯因基准），这是一个数据集和生成流水线，将四十年的公共资助知识库转换为形式化可废止溯因实例：通过覆盖默认值同时保留无关期望来构建解释异常假设。由于每个假设必须通过多项式时间检查（有效推导、保守性和最小性），DeFAb将逻辑严谨性作为衡量创造性和理论推理的工具，评分的是理论修正的规范构建，而非流畅但破坏理论的散文。该流水线将分类层次结构（OpenCyc、YAGO、Wikidata）与行为属性图（ConceptNet、UMLS）配对，从18个来源生成372,648+个实例，涉及33.75M条实例化规则，分为三个级别，并具有多项式时间可验证的金标准。四个前沿模型未能可靠内化可废止推理：渲染鲁棒的Level 2准确率为7.8-23.5%；思维链方差（约36个百分点）超过任何模型间差距；匹配的污染控制隔离出+19.4个百分点的Level 3差距。我们进一步发布了DeFAb-Hard（235个实例的Level 3难度变体；最佳模型53.3% vs 符号100%）和CONJURE（一个内核验证的变革性创造力变体，包含560个Lean 4/Mathlib实例，其金答案证明内核先前未包含的定义，无需判断的验证器；试点发现零新概念）。同一验证器还可作为偏好优化（DPO、RLVR/GRPO）的精确奖励。基于MIT许可发布于此https URL。

英文摘要

A rule-based logic solver resolves every instance in our benchmark in under 50 microseconds with 100% accuracy; the best frontier language model reaches 65% at best and drops to 23.5% under rendering-robust evaluation (worst case over four surface renderings). We introduce DeFAb (Defeasible Abduction Benchmark), a dataset and generation pipeline that converts four decades of publicly funded knowledge bases into formally grounded instances for defeasible abduction: constructing hypotheses that explain anomalies by overriding defaults while preserving unrelated expectations. Because every hypothesis must pass polynomial-time checks for valid derivation, conservativity, and minimality, DeFAb makes logical rigor the instrument for measuring creativity and theoretical reasoning, scoring the disciplined construction of theory revisions rather than fluent but theory-destroying prose. The pipeline pairs taxonomic hierarchies (OpenCyc, YAGO, Wikidata) with behavioral property graphs (ConceptNet, UMLS) to produce 372,648+ instances across 33.75M materialized rules from 18 sources, in three levels with polynomial-time verifiable gold standards. Four frontier models do not reliably internalize defeasible reasoning: rendering-robust Level 2 accuracy is 7.8-23.5%; chain-of-thought variance (~36 pp) exceeds any inter-model gap; and a matched contamination control isolates a +19.4 pp Level 3 gap. We further release DeFAb-Hard (a 235-instance Level 3 difficulty variant; best model 53.3% vs 100% symbolic) and CONJURE (a kernel-verified transformative-creativity variant of 560 Lean 4/Mathlib instances whose gold answers are definitions the proof kernel did not previously contain, judge-free verifier; a pilot finds zero novel concepts). The same verifier doubles as an exact reward for preference optimization (DPO, RLVR/GRPO). Released under MIT at https://huggingface.co/datasets/PatrickAllenCooper/DeFAb.

URL PDF HTML ☆

赞 0 踩 0

2606.18383 2026-06-18 cs.LG cs.CL 新提交 80%

From Sparse Features to Trustworthy Proxies: Certifying SAE-Based Interpretability

从稀疏特征到可信代理：认证基于SAE的可解释性

Dibyanayan Bandyopadhyay, Asif Ekbal

发表机构 * Department of Computer Science and Engineering, Indian Institute of Technology Patna（印度理工学院巴特那分校计算机科学与工程系）

专题命中其他LLM ：认证基于SAE的语言模型可解释性

AI总结提出一种后验泛化框架，通过稀疏代理（SAE重建）认证语言模型，推导期望风险上界，并在GPT-2 Small等模型上验证非平凡界，揭示深层更易认证且特征分解区分语义对齐与统计稀疏性。

详情

AI中文摘要

稀疏自编码器（SAE）越来越多地被用于从语言模型（LM）中提取可解释特征，但一个核心问题仍然存在：基于SAE的解释何时可以被视为底层冻结LM的忠实视图？我们通过一个后验泛化框架来研究这个问题，该框架通过稀疏代理来认证LM，稀疏代理是通过将原生隐藏激活替换为其预训练的SAE重建而获得的。我们的框架使用四个可测量量推导出基础模型期望风险的上界：代理风险、SAE重建差距、概念池不匹配和稀疏复杂度。我们将此证书解释为解释忠实性的操作标准。特别地，非平凡界表明提取的稀疏特征保留了有意义的预测信息，而小的重建和匹配误差表明代理在行为上接近原始模型。实验上，我们展示了在GPT-2 Small、Gemma-2B和Llama-3-8B上，该界在实际样本量下变得非平凡。对Llama-3-8B的详细逐层分析揭示了强烈的深度依赖性，较深层变得更容易认证，这与更强的局部保真度和更弱的下游误差放大相关。最后，通过特征洗牌消融，我们展示了分解区分了真正的语义对齐与单纯的统计稀疏性，为基于SAE的解释何时变得不太可靠提供了有用的诊断。

英文摘要

Sparse autoencoders (SAEs) are increasingly used to extract interpretable features from language models (LMs), yet a central question remains: when can an SAE-based explanation be treated as a faithful view of an underlying frozen LM We study this through a post-hoc generalization framework that certifies the LM via a sparse proxy, obtained by replacing a native hidden activation with its pretrained SAE reconstruction. Our framework derives an upper bound on the base model's expected risk using four measurable quantities: proxy risk, SAE reconstruction gap, concept-pool mismatch, and sparse complexity. We interpret this certificate as an operational criterion for explanatory faithfulness. In particular, a non-vacuous bound indicates that the extracted sparse features retain meaningful predictive information, while small reconstruction and mismatch errors indicate that the proxy remains behaviorally close to the original model. Empirically, we show that the bound becomes non-vacuous on GPT-2 Small, Gemma-2B, and Llama-3-8B at practical sample sizes. A detailed layerwise analysis of Llama-3-8B reveals a strong depth dependence, with later layers becoming much easier to certify, associated with both stronger local fidelity and weaker downstream error amplification. Finally, through feature-shuffling ablations, we show that the decomposition distinguishes genuine semantic alignment from mere statistical sparsity, providing a useful diagnostic for when SAE-based explanations become less reliable.

URL PDF HTML ☆

赞 0 踩 0

2606.18042 2026-06-18 cs.DC 新提交 80%

Latency Prediction for LLM Inference on NPU Systems

NPU系统上LLM推理的延迟预测

Juhyun Park, Seungwoo Jeong, Jingyu Lee, Kyungyong Lee

专题命中其他LLM ：预测LLM在NPU上的推理延迟

AI总结针对NPU上LLM推理延迟预测面临微架构不公开、编译器优化不可预测和分桶导致非线性延迟的挑战，提出LENS延迟估计器，通过每个桶两次端到端测量组合预测任意输入输出长度组合的延迟，平均预测误差2.15%。

Comments 12 pages, 9 figures

详情

AI中文摘要

部署大型语言模型（LLM）需要探索涵盖并行化策略、批处理技术和调度策略的庞大配置空间。在此空间上进行穷举测量是不切实际的，因此延迟预测对于系统优化至关重要。尽管NPU已成为专为LLM推理设计的加速器，但尚未建立针对它们的预测方法。具体来说，将先前的工作应用于NPU上的LLM推理延迟预测面临三个挑战：商用NPU的微架构不公开、不可预测的编译器优化以及由分桶引起的延迟非线性。我们提出了LENS，一种延迟估计器，它可以在没有微架构或编译器信息的情况下预测NPU推理延迟，并捕获由分桶引起的非线性延迟。LENS通过两次端到端（E2E）测量对每个桶进行剖析，并组合结果以预测任意输入-输出长度组合的延迟。我们在来自多个供应商的NPU、多个LLM以及多样化工作负载上验证了LENS，平均预测误差为2.15%。我们进一步将LENS与两个方法相关的基线进行比较，确认了其方法的有效性。

英文摘要

Deploying Large Language Models (LLMs) requires exploring a large configuration space spanning parallelization strategies, batching techniques, and scheduling policies. Exhaustive measurement across this space is impractical, making latency prediction essential for system optimization. While NPUs have emerged as accelerators designed for LLM inference, no prediction methodology has been established for them. Specifically, applying prior work to LLM inference latency prediction on NPUs faces three challenges: undisclosed microarchitecture of commercial NPUs, unpredictable compiler optimizations, and latency non-linearity induced by bucketing. We present LENS, a latency estimator that predicts NPU inference latency without information on the microarchitecture or compiler, and captures the non-linear latency induced by bucketing. LENS profiles each bucket with two end-to-end (E2E) measurements and composes the results to predict latency for arbitrary input-output length combinations. We validate LENS across NPUs from multiple vendors, several LLMs, and diverse workloads, achieving a mean prediction error of 2.15\%. We further compare LENS against two methodologically related baselines, confirming the validity of its approach.

URL PDF HTML ☆

赞 0 踩 0

2606.12629 2026-06-18 cs.LG cs.AI 新提交 80%

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

Bag of Dims：通过维度级符号模式实现无需训练的机制可解释性

Varun Reddy Nalagatla

发表机构 * Amazon Web Services（亚马逊云服务）

专题命中其他LLM ：无需训练的Transformer机制可解释性方法

AI总结本文提出Bag of Dims框架，证明Transformer隐藏状态的标准基即可作为无需训练的特征基，通过维度符号模式编码语义，并在三个模型上验证了其有效性。

Comments 22 pages, 5 figures, 27 tables

详情

AI中文摘要

我们表明，Transformer隐藏状态的标准基已经提供了一个无需训练、架构通用的特征基。单个维度通过其符号编码语义内容，通过其幅度编码置信度，充当独立的二进制寄存器。我们通过四个渐进实验在三个模型家族（Qwen 3.5-4B、Gemma 3-4B、Mistral 7B）上验证了这种Bag of Dims框架。仅符号模式就携带预测性内容：将所有幅度替换为1，通过LM头实现72-93%的top-5下一个token准确率，而无需任何解码器的纯汉明评分达到80-90%的top-4096准确率。这些符号模式组织成语义特征：使用单token类型缓存（每个词汇token一次前向传播，无上下文），我们通过每维度符号一致性（平均AUC 0.80）从50个锚点发现了175个类别，无需任何训练。一个训练过的探针仅增加+0.018 AUC并收敛到轴对齐的权重，证实了可忽略的跨维度结构。这种结构扩展到注意力：所有175个类别在K和V投影中仍然可发现。在写入端，静态FFN权重检查将20%的特征与单个写入神经元联系起来（一致性>0.70；随机对照：0%），通过多数投票，top-200神经元联盟在99.9%的原型上实现>0.70的一致性。完全无监督的发现（随机种子，无标签）在所有三个模型上扩展到1500个特征，产量100%，稀疏度99%，成对互信息为0.0014比特，证实了低维度间耦合。这些结果确立了标准基已经足以在整个Transformer计算路径中进行特征读取，无需训练、无需优化，且每个词汇token仅需一次前向传播，无需GPU天数。

英文摘要

We show the standard basis of transformer hidden states already provides a training-free, architecture-general feature basis. Individual dimensions encode semantic content via their signs (+/-1) and confidence via their magnitudes, acting as independent binary registers; a feature is a subset of dimensions with a consistent sign pattern, read by counting sign agreements with no learned rotation. We validate this Bag of Dims framework across seven models spanning language (Qwen 3.5-4B, Gemma 3-4B, Mistral 7B, Qwen3-32B), vision (DINOv2, ViT-Base), and audio (AST). Signs alone carry predictive content: unit-magnitude sign patterns preserve 60-93% top-5 next-token accuracy through the LM head, and decoder-free Hamming scoring reaches 80-90% top-4096. From a single-token cache (one forward pass per token, no context, no labels), we detect 175 categories at AUC 0.97-0.99 by sign agreement; a trained probe adds only +0.018 AUC and converges to axis-aligned weights. These features are causally operative: they survive the K/V attention projections, trace to the FFN neuron coalitions that write them (random-weight controls never reproduce this), and flipping a feature's signs during the live forward pass suppresses its concept across four language models, magnitude-matched and concept-specific. Dimensions stay independent throughout (pairwise mutual information below 0.006 bits). The structure is not specific to language: the same per-dimension signs appear in self-supervised vision (DINOv2, 9/12 ImageNet superclasses), supervised vision (ViT-Base, 11/12), and audio (AST, 50/50 ESC-50 categories), so it reflects transformer training in general, not the language-modeling objective. The standard basis already suffices for feature reading at one forward pass, no optimization, no GPU-days. The open problem shifts from finding the right rotation to cataloging what each dimension encodes.

URL PDF HTML ☆

赞 0 踩 0

2606.08532 2026-06-18 cs.AI 新提交 80%

DN-Hypo-Pipeline: An AI-Driven Workflow for Hypothesis Generation via Large Language Models and Scientific Explanations

DN-Hypo-Pipeline：一种基于大语言模型和科学解释的AI驱动假设生成工作流

Lei Lin, Ronghao Wang, Chunbao Zhou, Jue Wang, Yangang Wang

发表机构 * Computer Network Information Center, Chinese Academy of Sciences, China（中国科学院计算机网络信息中心）

专题命中其他LLM ：LLM驱动的假设生成工作流

AI总结提出DN-Hypo-Pipeline，利用大语言模型和科学解释作为先验知识，从现有文献中推导新假设，在数据科学建模中通过统计推断和专家评估证明优于直接生成方法，并验证了生成假设对应的算法性能。

详情

AI中文摘要

科学假设是研究的第一步并经过实验验证，但它也反映了对科学现象的深刻理解和推理。我们引入了DN-Hypo-Pipeline，一种基于大语言模型的AI驱动工作流，旨在通过利用科学解释作为先验知识来支持结构化科学思维和假设生成。该流水线帮助研究人员从现有文献中推导出新假设。给定研究论文的解释项（即结论），它识别潜在的定律、理论和原理，并为观察到的现象重构一个新的、尚未验证的解释。我们在数据科学建模领域使用三篇高被引论文评估了DN-Hypo-Pipeline。由LLM作为评判者和人类专家评估支持的统计推断表明，我们的流水线比直接生成方法更有效。此外，我们通过开发相应新颖算法验证了得分最高的两个生成假设，这些算法优于原始论文中提出的基线模型。除了在数据科学中的应用，DN-Hypo-Pipeline还提供了一个理论框架，不仅包含了理论指导的数据科学建模方法，还揭示了建模过程更基础的结构。此外，这种方法本质上是理论指导建模的推广，具有扩展到其他领域和更广泛科学学科的潜力。

英文摘要

A scientific hypothesis is the first step in research and undergoes experimental validation, yet it also reflects a deep understanding of and reasoning about scientific phenomena. We introduce DN-Hypo-Pipeline, an AI-powered workflow based on large language models, designed to support structured scientific thinking and hypothesis generation by leveraging scientific explanations as prior knowledge. This pipeline assists researchers in deriving novel hypotheses from existing literature. Given the explanandum (i.e., the conclusion) of a research paper, it identifies underlying laws, theories, and principles, and reconstructs a new, yet-to-be-verified explanation for the observed phenomenon. We evaluated DN-Hypo-Pipeline in the field of data science modeling using three highly cited papers. Statistical inference, supported by both LLM-as-judge assessment and human expert evaluation, demonstrates that our pipeline is more effective than direct generation methods. Additionally, we validated the two highest-scoring generated hypotheses by developing corresponding novel algorithms, which outperformed the baseline models presented in the original papers. Beyond application in data science, DN-Hypo-Pipeline provides a theoretical framework that not only encompasses theory-guided data science modeling methods but also reveals a more fundamental structure of the modeling process. Moreover, this approach is essentially a generalization of theory-guided modeling, offering potential for extension to other domains and across a broader range of scientific disciplines.

URL PDF HTML ☆

赞 0 踩 0

2602.06470 2026-06-18 cs.CL cs.AI 版本更新 80%

Improve Large Language Model Systems with User Logs

通过用户日志改进大型语言模型系统

Changyue Wang, Weihang Su, Qingyao Ai, Xingzhao Yue, Rui Zhang, Xiaojia Chang, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）

专题命中其他LLM ：提出UNO框架利用用户日志改进LLM系统。

AI总结本文提出UNO框架，通过用户日志提炼规则和偏好对，利用查询反馈驱动聚类处理数据异质性，量化模型知识与日志数据间的认知差距，提升LLM系统性能。

详情

AI中文摘要

扩大训练数据和模型参数规模长期以来推动了大型语言模型（LLMs）的发展，但这一范式日益受到高质量数据稀缺和计算成本上升导致的边际效益递减的限制。因此，近期研究更加关注从真实世界部署中持续学习，其中用户交互日志提供了丰富的真人类反馈和过程知识。然而，从用户日志学习具有挑战性，因为它们是无结构和嘈杂的。传统的LLM系统往往难以区分有用的反馈信号与嘈杂的用户行为，且用户日志收集与模型优化之间的差异（例如，非策略优化问题）进一步加剧了这一问题。为此，我们提出UNO（用户日志驱动的优化），一个统一的框架，用于通过用户日志改进LLM系统（LLMsys）。UNO首先将日志提炼为半结构化的规则和偏好对，然后利用查询和反馈驱动的聚类来管理数据异质性，最后量化模型先验知识与日志数据之间的认知差距。这一评估指导LLMsys自适应地过滤掉嘈杂的反馈并构建不同模块，以处理从用户日志中提取的初级和反思性经验，从而提升未来的响应。广泛的实验表明，UNO在效果和效率上均达到最先进的水平，显著优于检索增强生成（RAG）和基于记忆的基线方法。我们已开源代码至https://github.com/bebr2/UNO。

英文摘要

Scaling training data and model parameters has long driven progress in large language models (LLMs), but this paradigm is increasingly constrained by the scarcity of high-quality data and diminishing returns from rising computational costs. As a result, recent work is increasing the focus on continual learning from real-world deployment, where user interaction logs provide a rich source of authentic human feedback and procedural knowledge. However, learning from user logs is challenging due to their unstructured and noisy nature. Vanilla LLM systems often struggle to distinguish useful feedback signals from noisy user behavior, and the disparity between user log collection and model optimization (e.g., the off-policy optimization problem) further strengthens the problem. To this end, we propose UNO (User log-driveN Optimization), a unified framework for improving LLM systems (LLMsys) with user logs. UNO first distills logs into semi-structured rules and preference pairs, then employs query-and-feedback-driven clustering to manage data heterogeneity, and finally quantifies the cognitive gap between the model's prior knowledge and the log data. This assessment guides the LLMsys to adaptively filter out noisy feedback and construct different modules for primary and reflective experiences extracted from user logs, thereby improving future responses. Extensive experiments show that UNO achieves state-of-the-art effectiveness and efficiency, significantly outperforming Retrieval Augmented Generation (RAG) and memory-based baselines. We have open-sourced our code at https://github.com/bebr2/UNO .

URL PDF HTML ☆

赞 0 踩 0

2602.00161 2026-06-18 cs.LG cs.AI cs.CL quant-ph 版本更新 80%

LLM Compression by Block Removal with Constrained Binary Optimization

通过带约束二进制优化的块移除进行LLM压缩

David Jansen, Roman Rausch, Ali Hashemi, David Montero, Román Orús

发表机构 * Multiverse Computing（多维计算公司）； Donostia International Physics Center（多斯蒂亚国际物理中心）； Ikerbasque Foundation for Science（伊克尔巴斯克科学基金会）

专题命中其他LLM ：提出LLM压缩方法，通过块移除优化，属于LLM。

AI总结提出将大语言模型块移除压缩问题建模为约束二进制优化，映射到Ising玻璃系统，实现高效排序和高质量非连续块移除，在50%压缩时MMLU提升近23个百分点，且计算高效、通用性强。

Comments 16 pages, 3 figures

详情

AI中文摘要

在本文中，我们将通过最优删除Transformer块（“块移除”）来压缩大语言模型（LLM）的问题，表述为一个约束二进制优化（CBO）问题，该问题可以映射到物理系统（Ising玻璃），其能量是下游模型性能的强代理。这种表述使得能够高效地对大量候选块移除配置进行排序，产生许多高质量、非平凡的解决方案，而不仅仅是移除连续区域。我们的方法在深度压缩场景中表现强劲，例如在Llama-3.3-70B-Instruct的50%压缩中，与其他最先进的块移除方法相比，我们在MMLU基准上取得了近23个百分点的提升。对于较轻的压缩，它在多个基准上与这些方法表现相当，适用于Llama-3.1-8B-Instruct、Qwen3-14B（重训练前后）以及Llama-3.3-70B-Instruct。该方法计算效率高，仅需在校准数据集上对少数活跃参数进行前向和反向传播。此外，我们证明，当无法精确求解CBO问题时，使用良好的启发式求解器可以在可忽略的运行时间内提供在下游任务上表现良好的解决方案。该方法可以轻松应用于任何架构。我们在最近的NVIDIA-Nemotron-3-Nano-30B-A3B-FP8模型上展示了这种通用性，该模型具有高度不均匀且具有挑战性的块结构，并且在移除2个注意力层或3个混合专家层时，我们在AIME25和GPQA上超越了最先进水平。

英文摘要

In this paper, we formulate the compression of large language models (LLMs) by optimally deleting transformer blocks (``block removal'') as a constrained binary optimization (CBO) problem that can be mapped to a physical system (Ising glass), whose energies are a strong proxy for downstream model performance. This formulation enables an efficient ranking of a large number of candidate block-removal configurations yielding many high-quality, non-trivial solutions beyond those only removing consecutive regions. Our method performs strongly in the deep compression regime, such as for 50% compression of Llama-3.3-70B-Instruct, where we achieve an almost 23 percentage point increase on the MMLU benchmark compared to other state-of-the-art (SOTA) block-removal methods. For lighter compression, it performs on par with those methods across several benchmarks for Llama-3.1-8B-Instruct, Qwen3-14B (both before and after retraining), as well as Llama-3.3-70B-Instruct. The approach is computationally efficient and requires only forward and backward passes on a calibration dataset for a few active parameters. Additionally, we demonstrate that using good heuristic solvers for the CBO problem provides solutions that perform well on downstream tasks in negligible runtime when it is unfeasible to solve the problem exactly. The method can be readily applied to any architecture. We illustrate this generality on the recent NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 model, which exhibits a highly inhomogeneous and challenging block structure, and where we outperform SOTA for AIME25 and GPQA when removing either 2 attention layers or 3 mixture-of-experts layers.

URL PDF HTML ☆

赞 0 踩 0

2601.14968 2026-06-18 cs.LG cs.AI 版本更新 80%

InstructTime++: Time Series Classification with Multimodal Language Modeling via Implicit Feature Enhancement

InstructTime++: 通过隐式特征增强的多模态语言建模进行时间序列分类

Mingyue Cheng, Xiaoyu Tao, Huajian Zhang, Qi Liu, Zhiding Liu, Yucong Luo, Yiheng Chen, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China（中国科学技术大学认知智能国家重点实验室）

专题命中其他LLM ：将时间序列分类转化为多模态生成任务

AI总结提出将时间序列分类转化为多模态生成任务，通过离散化模块和对齐投影层弥合模态差距，并利用隐式特征建模提升语言模型性能。

详情

AI中文摘要

大多数现有的时间序列分类方法采用判别范式，将输入序列直接映射到独热编码的类别标签。虽然有效，但这种范式难以融入上下文特征，也无法捕捉类别间的语义关系。为了解决这些局限性，我们提出了InstructTime，一种将时间序列分类重新定义为多模态生成任务的新框架。具体来说，连续的数值序列、上下文文本特征和任务指令被视为多模态输入，而类别标签则通过调优的语言模型作为文本输出生成。为了弥合模态差距，InstructTime引入了一个时间序列离散化模块，将连续序列转换为离散的时间标记，同时结合对齐投影层和生成式自监督预训练策略，以增强跨模态表示对齐。在此框架基础上，我们进一步提出了InstructTime++，通过引入隐式特征建模来扩展InstructTime，以补偿语言模型有限的归纳偏差。InstructTime++利用专门的工具包从原始时间序列和上下文输入中挖掘信息丰富的隐式模式，包括统计特征提取和基于视觉-语言模型的图像描述，并将其转化为文本描述以实现无缝集成。在多个基准数据集上的大量实验证明了InstructTime++的优越性能。

英文摘要

Most existing time series classification methods adopt a discriminative paradigm that maps input sequences directly to one-hot encoded class labels. While effective, this paradigm struggles to incorporate contextual features and fails to capture semantic relationships among classes. To address these limitations, we propose InstructTime, a novel framework that reformulates time series classification as a multimodal generative task. Specifically, continuous numerical sequences, contextual textual features, and task instructions are treated as multimodal inputs, while class labels are generated as textual outputs by tuned language models. To bridge the modality gap, InstructTime introduces a time series discretization module that converts continuous sequences into discrete temporal tokens, together with an alignment projection layer and a generative self-supervised pre-training strategy to enhance cross-modal representation alignment. Building upon this framework, we further propose InstructTime++, which extends InstructTime by incorporating implicit feature modeling to compensate for the limited inductive bias of language models. InstructTime++ leverages specialized toolkits to mine informative implicit patterns from raw time series and contextual inputs, including statistical feature extraction and vision-language-based image captioning, and translates them into textual descriptions for seamless integration. Extensive experiments on multiple benchmark datasets demonstrate the superior performance of InstructTime++.

URL PDF HTML ☆

赞 0 踩 0

2508.07375 2026-06-18 cs.CL cs.SD eess.AS 版本更新 80%

TurnGuide: Enhancing Meaningful Full Duplex Spoken Interactions via Dynamic Turn-Level Text-Speech Interleaving

TurnGuide: 通过动态轮次级文本-语音交错增强有意义的全双工口语交互

Wenqian Cui, Lei Zhu, Xiaohui Li, Zhihan Guo, Haoli Bai, Lu Hou, Irwin King

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Huawei Technologies（华为技术）

专题命中其他LLM ：全双工语音语言模型中的文本-语音交错生成

AI总结提出TurnGuide方法，通过动态分割助手语音为对话轮次并交错生成轮次级文本和语音，解决全双工语音语言模型在连续双通道音频中集成离散文本令牌导致的时间对齐问题，显著提升语义连贯性和轮次交互性能。

Comments Interspeech 2026 Long Paper Track

详情

AI中文摘要

全双工语音语言模型（FD-SLMs）是专门的基础模型，旨在通过建模复杂的对话轮次（如打断、反馈和重叠语音）来实现自然的实时口语交互。端到端（e2e）FD-SLMs利用真实世界的双通道对话数据捕捉细微的双说话者对话模式以实现类人交互，但由于语音序列过长和高质量口语对话数据有限，其对话能力往往比纯文本对话有所下降。尽管交错文本-语音生成可以缓解这种退化，但将离散文本令牌集成到连续双通道音频流中可能会破坏流畅交互所需的时间对齐。为了解决这个问题，我们提出了TurnGuide，一种用于e2e FD-SLMs的新型文本-语音交错生成方法，该方法动态地将助手语音分割成对话轮次，并交错生成轮次级文本和语音。这种方法使FD-SLMs能够整合LLMs的语义智能，同时不损害自然的声学流畅性。大量实验表明，TurnGuide不仅显著提升了e2e FD-SLMs生成语义有意义且连贯语音的能力，而且在各种轮次事件上达到了最先进的性能。演示请访问此https URL。代码请访问此https URL。

英文摘要

Full-Duplex Speech Language Models (FD-SLMs) are specialized foundation models designed to enable natural, real-time spoken interactions by modeling complex conversational turn-taking such as interruptions, backchannels, and overlapping speech. End-to-end (e2e) FD-SLMs leverage real-world double-channel conversational data to capture nuanced two-speaker dialogue patterns for human-like interactions, but their conversational abilities often degrade compared to pure-text conversation due to prolonged speech sequences and limited high-quality spoken dialogue data. Although interleaved text-speech generation could mitigate this degradation, integrating discrete text tokens into continuous double-channel audio streams could disrupt the precise time alignment required for fluid interaction. To address this, we propose TurnGuide, a novel text-speech interleaved generation approach for e2e FD-SLMs that dynamically segments assistant speech into dialogue turns and interleaves turn-level text and speech generation. This approach allows FD-SLMs to integrate the semantic intelligence of LLMs without compromising the natural acoustic flow. Extensive experiments show that TurnGuide not only significantly improves e2e FD-SLMs to produce semantically meaningful, coherent speech but also achieves state-of-the-art performance on various turn-taking events. Demos are available at https://dreamtheater123.github.io/TurnGuide-Demo/. Code is available at https://github.com/dreamtheater123/TurnGuide.

URL PDF HTML ☆

赞 0 踩 0

2512.04144 2026-06-18 cs.AI 版本更新 80%

RippleBench: Capturing Ripple Effects Using Existing Knowledge Repositories

RippleBench: 利用现有知识库捕捉涟漪效应

Roy Rinberg, Usha Bhalla, Igor Shilov, Flavio P. Calmon, Rohit Gandikota

发表机构 * Harvard University（哈佛大学）； Imperial College London（伦敦帝国学院）； Northeastern University（东北大学）

专题命中其他LLM ：评估语言模型遗忘的涟漪效应

AI总结提出RippleBench-Maker自动管道，从知识库检索语义邻居生成选择题，评估八种遗忘方法在Llama3-8B-Instruct上的涟漪效应，发现准确率下降随语义距离衰减且跨模型一致。

详情

AI中文摘要

针对语言模型的目标干预，如遗忘或模型编辑，旨在修改特定信息，但其效果往往传播到相关的、非预期的领域（例如，删除病毒学内容可能降低对过敏任务的性能）；这些副作用通常被称为涟漪效应。我们引入RippleBench-Maker，一个自动管道，从知识库中检索任何源概念的语义邻居，并生成不同语义距离的多选题。我们使用WikiRAG（一个基于英文维基百科的开源RAG系统）实例化该框架，构建RippleBench-WMDP-Bio（584个种子主题，352,961个问题），并在Llama3-8B-Instruct上评估八种遗忘方法。所有八种方法在遗忘目标附近准确率下降最大，并随语义距离衰减，每种方法具有不同的传播曲线。我们在Mistral-7B、Zephyr-7B和Yi-34B上复现了这些发现；跨模型的差值曲线几乎相同，表明涟漪效应是遗忘方法的属性而非基础模型。我们通过一项包含四个实验的Mechanical Turk研究（5,200+次响应，61名工作者）验证了所有主要管道阶段。我们发布所有代码、数据和基础设施。

英文摘要

Targeted interventions on language models, such as unlearning or model editing, aim to modify specific information, but their effects often propagate to related, unintended areas (e.g., removing virology content may degrade performance on allergies); these side-effects are commonly referred to as the ripple effect. We introduce RippleBench-Maker, an automatic pipeline that retrieves semantic neighbors of any source concept from a knowledge repository and generates multiple-choice questions at varying semantic distances. We instantiate this framework using WikiRAG, an open-source RAG system over English Wikipedia, to construct RippleBench-WMDP-Bio (584 seed topics, 352,961 questions), and evaluate eight unlearning methods on Llama3-8B-Instruct. All eight exhibit accuracy drops that are largest near the unlearned target and decay with semantic distance, each with a distinct propagation profile. We replicate these findings across Mistral-7B, Zephyr-7B, and Yi-34B; cross-model delta curves are nearly identical, suggesting ripple effects are a property of the unlearning method rather than the base model. We validate all major pipeline stages using a four-experiment Mechanical Turk study (5,200+ responses, 61 workers). We release all code, data, and infrastructure.

URL PDF HTML ☆

赞 0 踩 0

2510.09905 2026-06-18 cs.AI cs.CL 版本更新 80%

The Personalization Trap: How User Memory Alters Emotional Reasoning in LLMs

个性化陷阱：用户记忆如何改变大语言模型的情感推理

Xi Fang, Weijie Xu, Yuchong Zhang, Stephanie Eckman, Scott Nickleach, Chandan K. Reddy

发表机构 * Amazon（亚马逊）

专题命中其他LLM ：LLM情感推理偏差研究

AI总结研究用户记忆如何导致大语言模型在情感推理中产生系统性偏差，发现高绩效模型对优势背景用户的情感解读更准确，个性化机制可能嵌入社会等级。

Comments 19 pages 5 figures

详情

AI中文摘要

当AI助手记住Sarah是一位打两份工的单亲母亲时，它对她压力的解读是否与她是富有的高管时不同？随着个性化AI系统越来越多地融入长期用户记忆，理解这种记忆如何塑造情感推理至关重要。我们通过在人验证的情感智能测试上评估15个模型，研究用户记忆如何影响大语言模型（LLMs）的情感智能。我们发现，相同的场景搭配不同的用户画像会产生系统性不同的情感解读。在经验证的独立于用户的情感场景和多样化的用户画像中，几个高性能LLM出现了系统性偏差，其中优势背景的用户画像获得了更准确的情感解读。此外，LLM在情感推理和支持性推荐任务中表现出跨人口统计因素的显著差异，表明个性化机制可以将社会等级嵌入模型的情感推理中。这些结果凸显了记忆增强AI的一个关键挑战：为个性化设计的系统可能会强化社会不平等。为缓解这些差异，我们整理了一个通用偏好数据集，旨在减少人口统计画像对情感理解的影响。

英文摘要

When an AI assistant remembers that Sarah is a single mother working two jobs, does it interpret her stress differently than if she were a wealthy executive? As personalized AI systems increasingly incorporate long-term user memory, understanding how this memory shapes emotional reasoning is critical. We investigate how user memory affects emotional intelligence in large language models (LLMs) by evaluating 15 models on human-validated emotional intelligence tests. We find that identical scenarios paired with different user profiles produce systematically divergent emotional interpretations. Across validated user-independent emotional scenarios and diverse user profiles, systematic biases emerged in several high-performing LLMs where advantaged profiles received more accurate emotional interpretations. Moreover, LLMs demonstrate significant disparities across demographic factors in emotion reasoning and supportive recommendations tasks, indicating that personalization mechanisms can embed social hierarchies into models' emotional reasoning. These results highlight a key challenge for memory-enhanced AI: systems designed for personalization may reinforce social inequalities. To mitigate these disparities, we curate a general-purpose preference dataset designed to reduce demographic profiles' influence on emotional understanding.

URL PDF HTML ☆

赞 0 踩 0