2605.11744
2026-05-13
cs.CL
cs.LG
Training-Inference Consistent Segmented Execution for Long-Context LLMs
Xianpeng Shang, Jiang Li, Zehua Duo, Qianyi Cai, Xiangdong Su
发表机构
*
College of Computer Science, Inner Mongolia University, Hohhot 010021, China
;
National \& Local Joint Engineering Research Center of Intelligent Information Processing Technology for Mongolian, Hohhot 010021, China
;
Inner Mongolia Key Laboratory of Multilingual Artificial Intelligence Technology, Hohhot 010021, China
;
Thrust of Artificial Intelligence, The Hong Kong University of Science
AI总结
本文针对基于Transformer的大语言模型在长上下文生成中面临的计算和内存瓶颈问题,提出了一种训练与推理一致的分段执行框架。该方法在训练过程中模拟推理阶段的分段执行语义,通过限制梯度传播仅作用于前一段的KV状态,从而保证训练与推理的一致性。实验表明,该方法在长上下文任务中性能接近全上下文注意力机制,同时在延迟与内存消耗方面优于现有高效推理方法,显著提升了超长上下文场景下的可扩展性。