arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2251
2606.10752 2026-06-10 cs.AI 新提交

AutoPDE: Reliable Agentic PDE Solving via Explicitly Represented Solver Strategies

AutoPDE: 通过显式表示的求解器策略实现可靠的智能体PDE求解

Huanshuo Dong, Keyao Zhang, Hong Wang, Zhezheng Hao, Zhiwei Zhuang, Ziyan Liu, Jiacong Wang, Gengyuan Liu, Xin Jin

发表机构 * University of Science and Technology of China(中国科学技术大学) Zhejiang University(浙江大学) University of the Chinese Academy of Sciences(中国科学院大学) Tsinghua University(清华大学) Eastern Institute of Technology, Ningbo(宁波东方理工大学)

AI总结 提出AutoPDE,一种将求解器策略作为显式对象维护的代码智能体,通过PDE分析、数值方法选择和自适应调优三阶段构建策略,在PDE Agent Bench上达到54.5%的通过率,比最强基线提升14.2个百分点。

详情
AI中文摘要

偏微分方程(PDE)的数值求解器是科学和工程中的核心计算工具。构建可靠的PDE求解器不仅需要可执行的代码,还需要一个数值求解器策略——一组关于离散化、稳定化、求解器配置和分辨率控制的决策,这些决策需与PDE结构相匹配。最近基于LLM的编码智能体通过生成和调试求解器实现,开始减轻编程负担。然而,它们通常直接从PDE问题跳到求解器代码,将求解器策略隐含在实现细节中。因此,求解失败的反馈被路由回代码编辑,而不是底层策略,导致数值决策在代码生成前难以检查,且在失败时难以利用数值证据进行修改。为解决这一局限,我们提出AutoPDE,一种在整个求解过程中将求解器策略作为显式表示对象维护的代码智能体:一个独立的、可检查的对象,在编写任何代码之前构建,并在求解失败时可根据数值证据进行修订。AutoPDE通过三个阶段构建和维护该对象,所有阶段均利用可重用的PDE求解技能库:PDE分析识别方程类型和代数结构;数值方法选择选择与分析结果匹配的数值方法,并确定离散化、稳定化和线性求解器;自适应调优运行低成本试算以在规定的精度和运行时间预算下校准分辨率和容差。我们在PDE Agent Bench上评估AutoPDE,实验结果表明,AutoPDE的通过率达到54.5%,比最强基线提高了14.2个百分点。

英文摘要

Numerical solvers for partial differential equations (PDEs) are core computational tools in science and engineering. Building reliable PDE solvers requires not only executable code, but a numerical solver strategy, a set of decisions about discretization, stabilization, solver configuration, and resolution control, that matches the PDE structure. Recent LLM-based coding agents have begun to reduce the programming burden by generating and debugging solver implementations. However, they typically move directly from a PDE problem to solver code, leaving the solver strategy implicit in implementation details. Feedback from a failed solve is therefore routed back to code edits rather than to the underlying strategy, so numerical decisions remain hard to check before code is generated and hard to revise using numerical evidence when it fails. To address this limitation, we propose AutoPDE, a code agent that maintains the solver strategy as an explicitly represented object throughout the solving process: an independent, inspectable object that is built before any code is written and can be revised, using numerical evidence, whenever a solve fails. AutoPDE builds and maintains this object in three stages, all drawing from a library of reusable PDE-solving skills: PDE analysis identifies the equation type and algebraic structure; numerical method selection chooses a numerical method that matches the analysis result and commits to a discretization, stabilization, and linear solver accordingly; and adaptive tuning runs low-cost pilot solves to calibrate resolution and tolerances under the prescribed accuracy and runtime budget. We evaluate AutoPDE on the PDE Agent Bench, where experimental results show that AutoPDE achieves a pass rate of $54.5%$, improving over the strongest baseline by $14.2$ percentage points.

2606.10747 2026-06-10 cs.AI 新提交

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

仲裁者代理:持续监控多智能体对话以检测涌现性失调

Filippo Tonini, Federico Torrielli, Anton Danholt Lautrup, Peter Schneider-Kamp, Mustafa Mert Çelikok, Lukas Galke Poech

发表机构 * University of Southern Denmark(南丹麦大学) University of Turin(都灵大学)

AI总结 提出仲裁者代理,在有限检查预算下实时监控多智能体对话,通过主动检查工具检测失调行为,实验表明能可靠提前发现失调,并分析不同失调类型的检测难度。

Comments AITC 2026

详情
AI中文摘要

随着由多个语言模型代理构建的AI系统变得越来越普遍,它们被越来越多地用于共同决策:讨论、协商并执行共享任务。尽管单个代理在单独测试时可能表现良好,但它们之间的交互方式可能会引发问题。我们引入了仲裁者,一个旨在实时监控多智能体对话并识别哪些参与者可能表现出失调行为的代理。仲裁者在有限的“检查预算”下运行,这意味着它必须谨慎决定如何使用其资源。当它逐步观察对话时,可以选择等待、询问参与者、检查系统提示或推理轨迹等内部信息,或记录可疑行为。最后,它生成一份报告,识别失调的可能来源。我们在五种对话条件下评估仲裁者,范围从风险金融建议模型生物到评估感知和共谋代理,测试了五种能力递增的工具配置和两种骨干模型。我们发现仲裁者能在对话结束前可靠地检测到失调代理,主动检查工具提高了检测准确性和速度。权重引起的失调最难检测,而指令引起的失调即使在被动观察下也能可靠识别。记录工具表现出双重效果,以精度为代价提高了召回率。这些结果表明,持续的、预算感知的监控可以有效捕捉失调,并且监督多智能体系统可能需要将审计者视为过程中的积极参与者。代码可在以下网址获取:https://this URL。

英文摘要

As AI systems built from multiple language-model agents become more common, they are increasingly used to make decisions together: discussing, negotiating, and acting on shared tasks. While individual agents may appear well-aligned when tested on their own, problems can arise from how they interact with one another. We introduce the Arbiter, an agent designed to monitor multi-agent conversations in real time and identify which participants may be behaving in misaligned ways. The Arbiter operates under a limited "inspection budget", meaning it must decide carefully how to use its resources. As it observes a conversation step by step, it can choose to wait, question a participant, examine internal information such as system prompts or reasoning traces, or log concerning behavior. At the end, it produces a report identifying the likely source of misalignment. We evaluate the Arbiter across five conversation conditions, ranging from risky financial advice model organisms to evaluation-aware and colluding agents, we test five tool configurations of increasing capability and two backbone models. We find that the Arbiter reliably detects misaligned agents well before the end of the conversation, with active inspection tools improving both detection accuracy and speed. Weight-induced misalignment proves hardest to detect, while instruction-induced misalignment is identified reliably even under passive observation. The logging tool exhibits a dual effect, improving recall at the cost of precision. These results suggest that continual, budget-aware monitoring can effectively catch misalignment, and that overseeing multi-agent systems may require treating the auditor as an active participant in the process. The code is available at https://github.com/aisilab/arbiter.

2606.10746 2026-06-10 cs.RO 新提交

ros2probe: Non-intrusive, Kernel-selective Observability for Robot Operating System 2 Middleware

ros2probe: 面向机器人操作系统2中间件的非侵入式、内核选择性可观测性

Jisang Yu, Sanghoon Lee, Yeonwoo Choi, Kyung-Joon Park

发表机构 * DGIST(大邱庆北科学技术院)

AI总结 针对ROS 2观测工具因加入DDS域而产生的探针效应(膨胀发现平面、增加反序列化开销、导致丢包偏差),提出ros2probe,通过被动捕获发现包重构通信状态,并利用内核过滤仅提取用户指定主题的包,消除探针效应,保持发现图误差在0.5%以内,无丢包,CPU和内存开销降低最高28倍。

Comments 13 pages, 8 figures, 7 tables

详情
AI中文摘要

机器人操作系统2(ROS 2)是机器人的事实标准中间件框架,它将每个机器人作为节点图运行,节点通过数据分发服务(DDS)——一种发布/订阅底层——进行通信。实时观察这种节点间通信对机器人开发至关重要,但需要付出代价。工具只能通过作为订阅者加入DDS域来接收数据,而发现过程会将其与发布者匹配,因此观察将工具折叠到其所测量的系统中并扰动该系统。我们将这种协议固有的扰动定义为观察者的探针效应。它会膨胀发现平面,增加观察者的反序列化成本,使其报告的丢包与订阅者实际接收的丢包偏离,并在接近饱和时取代订阅者的消息。唯一的逃避方法是被动捕获所有线路流量,但这会丢弃ROS 2消息语义,并且其规模与总流量成正比,而非被观察的流量。我们提出ros2probe,一种非侵入式观察框架,消除了探针效应。它从域中的发现数据包重构完整的ROS 2通信状态,且无带宽成本,然后驱动一个内核级过滤器,仅限用户请求的主题,以最小成本提取这些数据包,并观察真实订阅者接收的内容。其接口和记录与标准ROS 2工具匹配。在三个硬件平台(笔记本电脑、Jetson和树莓派)、两种DDS实现和七种机器人操作工作负载上,ros2probe将发现图保持在未观察系统的0.5%以内,而加入域的工具将发现膨胀高达2.6倍,并在饱和时丢弃订阅者38.5%的消息,而ros2probe无丢包。其丢包报告召回率为1.0,将观察者的CPU和内存开销分别降低高达7倍和28倍,并在现有工具会使系统过载的嵌入式机器人上保持实用性。

英文摘要

Robot Operating System 2 (ROS 2), the de facto standard middleware framework for robots, runs each robot as a graph of nodes communicating over the Data Distribution Service (DDS), a publish/subscribe substrate. Observing this inter-node communication in real time is essential to robot development, yet it has a price. A tool can receive data only by joining the DDS domain as a subscriber that discovery has matched to the publisher, so observing folds the tool into the system it measures and perturbs it. We define this protocol-inherent perturbation as the observer's probe effect. It inflates the discovery plane, adds deserialization cost on the observer, makes the loss it reports diverge from what the subscriber actually received, and near saturation displaces the subscriber's messages. The only escape, capturing all wire traffic passively, discards ROS 2 message semantics and scales with total traffic, not what is observed. We present ros2probe, a non-intrusive observation framework that removes the probe effect. It reconstructs the full ROS 2 communication state from the domain's discovery packets at no bandwidth cost, then drives an in-kernel filter restricted to the topics the user asks for, lifting only those packets at minimal cost and observing what the real subscriber receives. Its interfaces and recordings match the standard ROS 2 tools. Across three hardware platforms (laptop, Jetson, and Raspberry Pi), two DDS implementations, and seven robot-operation workloads, ros2probe holds the discovery graph within 0.5% of an unobserved system, whereas domain-joining tools inflate discovery up to 2.6$\times$ and drop 38.5% of the subscriber's messages at saturation while ros2probe drops none. It reports loss with a recall of 1.0, cuts observer CPU and memory by up to 7$\times$ and 28$\times$, and stays practical on the embedded robots where existing tools overload the system.

2606.10743 2026-06-10 cs.RO 新提交

Hand-centric Human-to-Robot Trajectory Transfer from Video Demonstrations via Open-World Contact Localization

基于开放世界接触定位的以手为中心的人到机器人轨迹迁移

Yitian Shi, Di Wen, Zhengqi Han, Zicheng Guo, Yu Hu, Edgar Welte, Kunyu Peng, Rainer Stiefelhagen, Rania Rayyes

发表机构 * Karlsruhe Institute of Technology (KIT)(卡尔斯鲁厄理工学院)

AI总结 提出HOWTransfer框架,通过接触定位从人类视频中提取接触感知的机器人轨迹,无需物体特定描述,在多样化操作任务中实现86%的成功率。

详情
AI中文摘要

由于嘈杂的手-物体交互、部分观测下的未知物体以及跨实体差异,从人类视频演示中学习仍然具有挑战性。为了解决这些问题,我们提出了\textit{HOWTransfer}(\emph{H}and-\emph{O}bject \emph{O}pen-\emph{W}orld Transfer),这是一个以手为中心的框架,将人类演示提炼为接触感知、分类学信息丰富且多样化的机器人轨迹。\emph{HOWTransfer}不依赖于物体特定描述、视觉语言查询或显式物体状态跟踪,而是通过推理观测到的手-物体交互线索,恢复时间一致的三维手部运动并定位时间接触区间。然后,利用定位的接触起始点将人类抓取意图重定向到多模态平行颚抓取假设,这些假设沿恢复的手腕轨迹传播以生成机器人可执行的运动。最后,轨迹编辑阶段细化接触对齐,并从单个演示生成多样化的可执行变体。跨多种操作任务的实验表明,\emph{HOWTransfer}能够实现准确的接触定位和高质量的机器人运动重定向,成功率为86%,在盲选偏好研究中优于遥操作轨迹。

英文摘要

Learning from human video demonstrations remains challenging due to noisy hand-object interactions, unseen objects with partial observation, and cross-embodiment discrepancy. To address these challenges, we present \textit{HOWTransfer} (\emph{H}and-\emph{O}bject \emph{O}pen-\emph{W}orld Transfer), a hand-centric framework that distills human demonstrations into contact-aware, taxonomy-informed, and diverse robotic trajectories. Instead of relying on object-specific descriptions, vision-language queries, or explicit object-state tracking, \emph{HOWTransfer} recovers temporally consistent 3D hand motion and localizes temporal contact intervals by reasoning over observed hand-object interaction cues. The localized contact onsets are then used to retarget human grasp intent into multi-modal parallel-jaw grasp hypotheses, which are propagated along the recovered wrist trajectory to generate robot-executable motions. Finally, a trajectory editing stage refines contact alignment and produces diverse executable variants from a single demonstration. Experiments across diverse manipulation tasks show that \emph{HOWTransfer} enables accurate contact localization and high-quality robot motion retargeting with $86\%$ success, which is preferred over teleoperated trajectories in a blinded preference study.

2606.10736 2026-06-10 cs.CL cs.AI cs.CY 新提交

Detecting Knowledge Gaps from Conversational AI Interactions Using Curriculum Prerequisite Graphs

利用课程先决条件图检测对话式AI交互中的知识缺口

Youssef Medhat, Junsoo Park, Ploy Thajchayapong, Ashok K. Goel

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出一个流水线,通过少样本文本分类器将学生向对话式AI助教提出的问题映射到课程主题,并利用GPT-4提取的先决条件知识图谱,以检测主题级知识缺口。

Comments Accepted as a short paper at the 10th CSEDM Workshop, co-located with the 18th International Conference on Educational Data Mining (EDM 2026). 7 pages, 2 figures, 2 tables

详情
AI中文摘要

大型在线课程会产生数千条学生向对话式AI助教提出的问题,但这些交互日志作为诊断信号在很大程度上未被利用。我们提出一个流水线,使用少样本文本分类器,将学生向对话式AI助教提出的问题映射到课程主题,该分类器基于GPT-4提取的课程概念先决条件知识图谱。在研究生级别AI课程的164名学生的1,340个问题事件上评估,我们的分类器在43个标签(42个课程主题加上一个“未知”弃权类别)上达到80.0%的准确率。主题级问题数量与独立期中调查中学生自我报告的难度显著相关(rho = 0.491, p = 0.008, n = 28个主题),提供了趋同证据,表明分类后的问题流反映了真实的主题难度。这些结果表明,映射到课程结构上的对话式AI交互日志携带关于主题级知识缺口的可操作信号,并为教师提供基于课程视角的哪些主题需要关注的视图。

英文摘要

Large online courses generate thousands of student questions directed at conversational AI teaching assistants, yet these interaction logs remain largely untapped as diagnostic signals. We present a pipeline that maps student questions from a conversational AI teaching assistant to curriculum topics using a few-shot text classifier, grounded in a GPT-4-extracted prerequisite knowledge graph of course concepts. Evaluated on 1,340 question events from 164 students in a graduate-level AI course, our classifier achieves 80.0% accuracy across 43 labels (42 curriculum topics plus an "unknown" abstention class). Topic-level question volume correlates significantly with student self-reported difficulty from an independent mid-semester survey (rho = 0.491, p = 0.008, n = 28 topics), providing convergent evidence that the classified question stream reflects genuine topic difficulty. These results demonstrate that conversational AI interaction logs, mapped onto curriculum structure, carry actionable signals about topic-level knowledge gaps and provide instructors with a curriculum-grounded view of which topics warrant attention.

2606.10734 2026-06-10 cs.LG stat.ME stat.ML 新提交

SPACR: Single-Pass Adaptive Training of Uncertainty-Aware Conformal Regressors

SPACR: 单次自适应训练的不确定性感知共形回归器

Soundouss Messoudi, Sylvain Rousseau, Sébastien Destercke

发表机构 * Heudiasyc - UMR CNRS 7253, Université de Technologie de Compiègne(法国贡比涅技术大学 - CNRS 7253联合实验室 Heudiasyc)

AI总结 提出SPACR方法,通过可微损失直接训练不确定性感知回归器,联合优化效率和有效性,无需批分割或预定义置信水平,单个模型在推理时支持多置信水平预测区间,实验表明其区间更窄、覆盖-效率权衡更优且计算成本更低。

详情
AI中文摘要

共形预测(CP)为预测模型提供了鲁棒的不确定性保证,但通常事后应用,这导致模型训练与产生高效(即窄)区间的共形目标不一致。我们提出SPACR(单次自适应共形回归器),一种在可微损失内直接训练不确定性感知回归器的新方法。SPACR联合优化效率和有效性,无需在训练期间进行批分割或预定义置信水平。因此,单个SPACR模型在推理时能在多个置信水平下产生有效的预测区间,避免了像DOICR等方法所需的高成本重训练。在多个数据集上的实验表明,与标准CP和DOICR相比,SPACR始终提供更紧的区间和更好的覆盖-效率权衡,同时显著降低计算成本。

英文摘要

Conformal Prediction (CP) provides robust uncertainty guarantees for predictive models, but is typically applied post hoc, which misaligns model training with the conformal goal of producing efficient (i.e, narrow) intervals. We propose SPACR (Single-Pass Adaptive Conformal Regressor), a novel method for directly training uncertainty-aware regressors within a differentiable loss. SPACR jointly optimizes efficiency and validity without batch-splitting or a predefined confidence levels during training. As a result, a single SPACR model yields valid prediction intervals at multiple confidence levels during inference, avoiding the costly retraining required by methods like DOICR. Experiments on diverse datasets show that SPACR consistently gives tighter intervals and better coverage-efficiency trade-offs compared to standard CP and DOICR, while significantly reducing computational costs.

2606.10733 2026-06-10 cs.RO 新提交

Pushing the Performance Limits in Autonomous Racing: Continuous Stability-Aware Adaptive Velocity Planning in Formula Student Driverless

推动自动驾驶赛车的性能极限:大学生方程式无人驾驶中的连续稳定性感知自适应速度规划

Tamara Bergerhoff, Sebastian Baader, Pascal Meißner, Frank Deinzer

发表机构 * Center for Artificial Intelligence and Robotics (CAIRO)(人工智能与机器人中心(CAIRO);维尔茨堡-施韦因富特应用技术大学) TUAS Würzburg-Schweinfurt

AI总结 提出一种连续稳定性感知自适应速度规划方法,通过推断连续缩放因子生成摩擦图,实现实时最优目标速度计算,在真实赛车上测试圈速提升35%。

Comments Accepted as a conference paper in IEEE Intelligent Vehicles Symposium (IV) 2026, Detroit, MI, United States

详情
AI中文摘要

在自动驾驶赛车中,尤其是在大学生方程式无人驾驶等比赛中,精确规划赛车的目标速度对于实现有竞争力的圈速和稳定的驾驶行为至关重要。特别是在高速行驶时,速度规划是一项重大挑战,因为它必须实时进行,同时考虑赛道布局、环境影响、机械公差以及由此产生的控制不准确性。在本文中,我们提出了一种新颖的速度规划方法,能够动态适应这些变化的条件。该方法不是估计物理轮胎-路面摩擦系数,而是从车辆稳定性中间接推断出一个连续缩放因子。该因子不仅反映了有效的轮胎-路面相互作用,还捕捉了控制不准确性的影响。由此,我们生成一个连续的摩擦图,作为稳健、自适应的基础,用于计算考虑车辆和环境限制的最优目标速度。我们提出的方法在一辆真实的大学生方程式赛车上进行了评估,结果显示,与十圈相比,圈速提高了35%,与非自适应方法相比,平均提高了8%。

英文摘要

In autonomous racing, especially in competitions such as Formula Student Driverless, precise planning of the target velocity of a race car is crucial for competitive lap times and stable driving behavior. Especially at high speeds, Velocity Planning (VP) is a significant challenge as it has to be performed in real time, taking into account track layouts, environmental influences, mechanical tolerances, and the resulting control inaccuracies. In this paper, we present a novel approach to VP that dynamically adapts to such changing conditions. Instead of estimating the physical Tire-Road Friction Coefficient (TRFC), a continuous scaling factor is inferred indirectly from vehicle stability. This factor not only reflects the effective tire-road interaction but also captures effects of control inaccuracies. From this, we generate a continuous friction map, which serves as a robust, adaptive basis for computing the optimal target speed, accounting for both vehicle and environmental limits. Our proposed approach was evaluated on a real Formula Student race car, showing a lap time improvement of 35 % over ten laps and an average increase of 8 % compared to a non-adaptive approach.

2606.10732 2026-06-10 cs.RO 新提交

Vehicle Prediction Model for Enhanced MPC Path Tracking in Formula Student Driverless

面向大学生无人驾驶方程式赛车增强MPC路径跟踪的车辆预测模型

Sebastian Baader, Tamara Bergerhoff, Pascal Meißner, Frank Deinzer

发表机构 * Center for Artificial Intelligence and Robotics (CAIRO)(人工智能与机器人中心(CAIRO);维尔茨堡-施韦因富特应用科学大学) TUAS Würzburg-Schweinfurt

AI总结 提出一种结合离线贝叶斯线性回归与在线稀疏高斯过程回归的实时车辆预测模型,将预测精度提升高达57%,并在实际赛车MPC路径跟踪控制器中验证有效性。

Comments Accepted as a conference paper in IEEE Intelligent Vehicles Symposium (IV) 2026, Detroit, MI, United States

详情
AI中文摘要

自动驾驶赛车,如大学生无人驾驶方程式赛车,在接近其物理操控极限下运行。由此产生的高度非线性车辆行为增加了路径跟踪的复杂性,尤其是在狭窄赛道上。模型预测控制(MPC)通常用于解决此问题,其性能与底层预测模型的准确性密切相关。本文提出一种新颖的、实时能力强的自动驾驶赛车预测模型,该模型通过结合过去运行和当前驾驶情况的信息来适应变化的条件。我们的模型分为三个连续的子模型:名义运动学自行车模型、离线贝叶斯线性回归(BLR)模型和在线稀疏高斯过程回归(SGPR)模型。所提出的方法能够在不显著增加计算成本的情况下有效整合所有可用数据,确保从运行开始就具有高预测精度和定量不确定性评估。与现有方法相比,预测精度提高了高达57%。此外,我们成功地在基于MPC的路径跟踪控制器中,在真实的大学生方程式赛车上展示了该模型的实际适用性。

英文摘要

Autonomous race cars, such as in Formula Student Driverless, operate close to their physical handling limits. The resulting highly nonlinear vehicle behavior increases the path tracking complexity, especially on narrow tracks. Model Predictive Control (MPC) is commonly used to address this issue, a method whose performance is closely tied to the accuracy of the underlying prediction model. This paper presents a novel, real-time capable prediction model for autonomous race cars that adjusts to changing conditions by combining information from past runs and the current driving situation. Our model is divided into three consecutive submodels: a nominal Kinematic Bicycle Model, an offline Bayesian Linear Regression (BLR) model, and an online Sparse Gaussian Process Regression (SGPR) model. The proposed approach enables efficient integration of all available data without significantly increasing computational cost, ensuring high prediction accuracy and a quantitative uncertainty assessment right from the start of the run. Compared to existing approaches, an improvement in prediction accuracy of up to 57% was achieved. Further, we successfully demonstrated the practical applicability of the model within an MPC-based path tracking controller on a real Formula Student race car.

2606.10722 2026-06-10 cs.CL 新提交

Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs

持续LLM升级:一种用于稠密到稀疏LLM的预测器门控银行级稀疏训练方案

Ruixuan Huang, Jinyuan Shi, Hantao Huang, Yifan Huang, Ziyi Guan, Hao Zeng, Ian En-Hsu Yen, Minghui Yu

发表机构 * Nanyang Technological University(南洋理工大学) Salesforce AI Huawei Noah's Ark Lab(华为诺亚方舟实验室)

AI总结 提出一种从稠密检查点构建通道稀疏大语言模型的持续训练方法,通过预测器门控稀疏SwiGLU FFN和银行级top-k规则实现4倍稀疏性,并修复长上下文失败模式。

详情
AI中文摘要

我们研究稠密到稀疏的持续训练,作为从稠密检查点构建通道稀疏大语言模型的一种方式。从Qwen2.5-8B稠密骨干网络开始,我们在32K上下文中继续训练,并在32K阶段引入预测器门控稀疏SwiGLU FFN。对于每个token和层,我们使用低秩预测器生成FFN通道路由logits。然后应用银行级top-k规则,在每个64通道的银行中保留16个通道,从而在FFN中间激活中实现4倍稀疏性。与事后稀疏推理方法不同,路由模块被放置在主要语言建模路径上,并在持续训练期间进行优化,使稠密模型能够升级为面向硬件的稀疏模型。我们报告了架构、训练方案、基准性能以及训练经验。我们还识别了RULER-CWE上的层局部长上下文失败模式,并提出了一种单层修复算法,显著改善了受影响长度范围内的性能。

英文摘要

We study dense-to-sparse continual training as a way to construct channel-sparse large language models from dense checkpoints. Starting from a Qwen2.5-8B dense backbone, we continue training at 32K context and introduce a predictor-gated sparse SwiGLU FFN in the 32K stage. For each token and layer, we use a low-rank predictor to produce FFN-channel routing logits. We then apply a bank-wise top-k rule to retain 16 channels in every 64-channel bank, yielding 4x sparsity in the FFN intermediate activation. Unlike post-hoc sparse inference methods, the routing module is placed on the main language modeling path and optimized during continual training, enabling the dense model to be upcycled into a hardware-oriented sparse model. We report the architecture, training recipe, benchmark performance, and training lessons. We also identify a layer-local long-context failure mode on RULER-CWE and propose a single-layer repair algorithm that substantially improves the affected length range.

2606.10718 2026-06-10 cs.LG cs.AI 新提交

Transformer Based Model for Spatiotemporal Feature Learning in EEG Emotion Recognition

基于Transformer的脑电情绪识别时空特征学习模型

Xinglong Cui, Dian Gu

发表机构 * Beijing Neurodeep Technology Co., Ltd(北京纽罗德普科技有限公司) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出EEG-TransNet架构,通过局部自注意力块和模糊注意力同步Transformer捕捉脑电信号的时空特征,在三个数据集上优于现有方法。

详情
AI中文摘要

脑电图(EEG)是一种广泛采用的监测大脑活动的技术,因其高时间分辨率和成本效益,为神经状态提供了有价值的见解。为了增强对复杂EEG数据的分析,我们提出了EEG-TransNet,一种旨在捕捉EEG信号的时间、区域和同步特征的架构。EEG-TransNet引入了三个关键模块:1)利用ResNet和基于小波去噪的预处理与特征提取模块,2)用于区域特征学习的局部自注意力块,以及3)用于建模时空依赖性的模糊注意力同步Transformer(FAST)。通过在三个EEG数据集(BETA、SEED和DepEEG)上的大量实验,所提出的模型在不同信号长度下的分类准确性和鲁棒性方面始终优于其他方法。消融研究证实了局部自注意力块在提高性能方面的贡献,并且解码器中引入深度可分离卷积降低了计算复杂度,同时保持了高准确性。EEG-TransNet在受试者间具有最小的性能变化,突显了其作为基于EEG的大脑活动分类和情绪识别任务的鲁棒工具的潜力。

英文摘要

Electroencephalography (EEG) is a widely adopted technique for monitoring brain activity, offering valuable insights into neurological states due to its high temporal resolution and cost-effectiveness. To enhance the analysis of complex EEG data, we propose EEG-TransNet, an architecture designed to capture temporal, regional, and synchronous features of EEG signals. EEG-TransNet introduces three key modules: 1) a preprocessing and feature extraction module leveraging ResNet and wavelet-based denoising, 2) a Local Self-Attention Block for regional feature learning, and 3) a Fuzzy-Attention Synchronous Transformer (FAST) to model spatiotemporal dependencies. Through extensive experiments on three EEG datasets (BETA, SEED, and DepEEG), the proposed model consistently outperforms other methods in terms of classification accuracy and robustness across varying signal lengths. Ablation studies confirm the contribution of the Local Self-Attention Block in improving performance, and the inclusion of depthwise separable convolutions in the decoder reduces computational complexity while maintaining high accuracy. EEG-TransNet's ability to generalize across subjects with minimal performance variation highlights its potential as a robust tool for EEG-based brain activity classification and emotion recognition tasks.

2606.10706 2026-06-10 cs.LG cs.AI 新提交

Unifying Data, Memory, and Compute Efficiency in LLM training: A Survey

统一LLM训练中的数据、内存和计算效率:一项综述

Vanessa Schmidt, Huy Hoang Nguyen, Cédric Jung, Shirin Salehi, Anke Schmeink

发表机构 * Chair of Information Theory and Data Analytics (INDA), RWTH Aachen University(亚琛工业大学信息理论与数据分析教席) AIT Austrian Institute of Technology GmbH(奥地利技术研究所) Automation and Control Institute, Technische Universität Wien (TUW)(维也纳工业大学自动化与控制研究所)

AI总结 本文从资源约束视角综述大语言模型训练中的数据效率、内存效率和计算预算感知三大瓶颈,强调三者需联合优化而非孤立处理。

Comments Accpeted for publication in IEEE Transactions on Artificial Intelligence (TAI)

详情
AI中文摘要

资源约束日益决定了大语言模型(LLM)中可以训练、微调和部署的内容,然而效率通常通过孤立的技术而非作为相互作用的限制系统来研究。本综述采用以约束为中心的视角,围绕三个耦合的瓶颈组织近期进展:数据效率(训练什么)、内存效率(如何适应训练)和计算预算感知(何时何地消耗FLOPs)。在数据轴上,我们回顾了最大化每个token学习量的选择和剪枝方法,从基于学习动态的可扩展代理信号到基于梯度和影响的评分,以及难度感知和课程式策略。我们强调新兴证据表明,不同的“好数据”概念在不同机制中占主导地位,这意味着最优子集取决于任务目标和资源预算,而非普遍适用。在系统方面,我们表明GPU内存(而非原始计算)通常是微调中的主要瓶颈,有效的扩展需要联合减少权重存储、优化器状态和激活内存,而不是孤立地优化任何单一组件。超越内存,我们将训练和推理视为计算主导的过程,其中优化、数据选择和解码必须明确考虑有限的FLOP预算。我们回顾了计算最优分配和停止规则的证据,其中一旦边际性能增益低于预算依赖的阈值,计算应停止或重新分配。总之,这些结果将计算感知的数据选择、缩放定律和自适应推理统一在资源条件决策的共同原则下。

英文摘要

Resource constraints increasingly determine what can be trained, fine-tuned, and deployed in large language models (LLMs), yet efficiency is often studied through isolated techniques rather than as an interacting system of limits. This survey adopts a constraint-centric perspective and organizes recent progress around three coupled bottlenecks: data efficiency (what to train on), memory efficiency (how to fit training), and compute budget awareness (when and where to spend FLOPs). On the data axis, we review selection and pruning methods that maximize learning per token, ranging from scalable proxy signals based on learning dynamics to gradient- and influence-based scoring, as well as difficulty-aware and curriculum-style strategies. We highlight emerging evidence that different notions of good data dominate in different regimes, implying that optimal subsets depend on the task objective and resource budget rather than being universal. On the systems side, we show that GPU memory, not raw compute, is often the dominant bottleneck in fine-tuning, and that effective scaling requires jointly reducing weight storage, optimizer states, and activation memory rather than optimizing any single component in isolation. Beyond memory, we frame training and inference as compute-governed processes in which optimization, data selection, and decoding must explicitly account for finite FLOP budgets. We review evidence for compute-optimal allocation and stopping rules, where computation should be halted or reallocated once marginal performance gains fall below a budget-dependent threshold. Together, these results unify compute-aware data selection, scaling laws, and adaptive inference under a common principle of resource-conditioned decision-making.

2606.10705 2026-06-10 cs.LG cs.AI cs.SY eess.SY 新提交

Event-Driven Reinforcement Learning Enables Long-Horizon Control in Semiconductor Fabrication

事件驱动强化学习实现半导体制造中的长时域控制

Yavar Yeganeh, Mahsa Shekari, Nicla Frigerio, Daniele Pagano, Andrea Matta

发表机构 * Politecnico di Milano(米兰理工大学) STMicroelectronics(意法半导体)

AI总结 提出事件驱动深度强化学习框架,将半导体制造控制建模为中心化智能体问题,通过事件驱动时序差分方法优化多目标策略,在高保真仿真中显著提升吞吐量和利用率。

详情
AI中文摘要

强化学习有望优化大规模系统中的序贯决策。半导体制造系统是随机且高度约束的环境,其中异构晶圆在广泛的设备网络中经历数百个加工步骤。这些特性产生了复杂、高维的决策问题,具有延迟反馈和长时域要求,使生产计划和控制复杂化。我们提出了一个用于此规模的多目标策略优化的深度强化学习框架。具体来说,我们将控制表述为一个中心化智能体问题,其中核心策略协调系统范围的决策,而系统演化被表示为由离散事件驱动的互联时间过程。相应地,我们开发了一个定制的事件驱动时序差分公式,该公式保持通用性,并可在相关训练设置下与各种策略优化方法集成。我们研究了纳入该框架的几种核心无模型算法,并使用不同工业现实操作场景的高保真仿真评估其有效性。在广泛的验证实验中,在离线和在线设置下训练的智能体在吞吐量和利用率方面显示出显著且一致的提升。我们进一步评估了训练阶段的表现和泛化能力,阐明了替代强化学习公式和算法的相对优势。总体而言,结果支持所提出框架在控制事件驱动复杂自适应系统方面的可扩展性、通用性和可迁移性。

英文摘要

Reinforcement learning promises to optimize sequential decisions in large-scale systems. Semiconductor manufacturing systems are stochastic and highly constrained environments where heterogeneous wafers traverse hundreds of processing steps across extensive equipment networks. These characteristics yield complex, high-dimensional decision problems with delayed feedback and long-horizon requirements, complicating production planning and control. We propose a deep reinforcement learning framework for multi-objective policy optimization at this scale. Specifically, we formulate control as a centralized-agent problem, where a core policy coordinates system-wide decisions, while system evolution is represented as an interconnected temporal process driven by discrete events. Accordingly, we develop a tailored event-driven temporal-difference formulation that remains general and can be integrated with various policy optimization methods under relevant training settings. We investigate several core model-free algorithms incorporated into this framework and evaluate their effectiveness using high-fidelity simulations of diverse, industry-real operating scenarios. Across extensive validation experiments, agents trained in both offline and online settings show significant and consistent gains in throughput and utilization. We further evaluate performance and generalization across training phases, clarifying the relative strengths of alternative reinforcement learning formulations and algorithms. Overall, the results support the scalability, generality, and transferability of the proposed framework for controlling event-driven complex adaptive systems.

2606.10701 2026-06-10 cs.CV 新提交

Vector Map as Language: Toward Unified Remote Sensing Vector Mapping

向量地图即语言:迈向统一的遥感向量制图

Yinglong Yan, Yunkai Yang, Haoyi Wang, Wei Fu, Linshan Wu, Honghu Pan, Shaobo Xia, Shanghang Zhang, Hao Chen, Leyuan Fang

发表机构 * School of Artificial Intelligence and Robotics, Hunan University(湖南大学人工智能与机器人学院) Department of Computer Science and Engineering, The Hong Kong University of Science and Technology(香港科技大学计算机科学与工程系) Department of Geomatics Engineering, Changsha University of Science and Technology(长沙理工大学测绘工程系) State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(北京大学计算机学院多媒体信息处理国家重点实验室)

AI总结 提出VecLang范式,将多类向量制图重构为结构化文本生成,通过类GeoJSON语言统一表达不同地理实体,并设计渐进式视觉-语言映射框架和层次化向量语言优化方法,实现跨类别、跨数据集和开放词汇的向量地图生成。

详情
AI中文摘要

遥感向量制图旨在从遥感图像中生成地理实体的结构化地图,如建筑物、道路和水体。实践中,向量地图通常包含多个类别层和异构实体结构,需要统一模型满足多样化的制图需求。然而,现有方法通常将向量对象表示为多边形或图,使其仅适用于特定类别:多边形难以捕捉拓扑关系,而图往往模糊实例边界。我们观察到,语言作为人类交流的自然媒介,提供了一种灵活且富有表现力的表示,能够容纳异构地图元素,包括几何、语义和拓扑。受此启发,我们提出向量地图即语言(VecLang),一种统一范式,将多类向量制图重构为结构化文本生成。VecLang将不同地理实体的共同元素编码为类GeoJSON的向量语言,从而在共享文本格式内实现跨类别建模。为了可靠地生成这种语言,我们设计了一个渐进式视觉-语言映射框架,首先定位向量化单元,然后生成结构化地图元素。我们进一步引入层次化向量语言优化,利用强化学习提高语法有效性、内容保真度和地图可执行性。我们还构建了包含54K图像和800K实例的VecMap-Bench,支持标准和泛化设置下的训练与评估。大量实验表明,VecLang能够处理单类和多样向量制图,同时实现强大的跨数据集和开放词汇泛化。模型和数据集已公开于该网址。

英文摘要

Remote sensing vector mapping aims to generate structured maps of geospatial entities, such as buildings, roads, and water bodies, from remote sensing imagery. In practice, vector maps usually contain multiple category layers and heterogeneous entity structures, requiring a unified model for diverse mapping needs. However, existing methods typically represent vector objects as polygons or graphs, making them suitable only for specific categories: polygons poorly capture topological relations, while graphs often blur instance boundaries. We observe that language, as a natural medium for human communication, offers a flexible and expressive representation that can accommodate heterogeneous map elements, including geometry, semantics, and topolog. Motivated by this insight, we propose Vector Map as Language (VecLang), a unified paradigm that reformulates multiclass vector mapping as structured text generation. VecLang encodes the common elements of different geospatial entities into a GeoJSON-like vector language, enabling cross-category modeling within a shared textual format. To generate this language reliably, we design a progressive vision-language mapping framework that first localizes vectorization units and then generates structured map elements. We further introduce Hierarchical Vector Language Optimization, which uses reinforcement learning to improve syntax validity, content fidelity, and map executability. We also build VecMap-Bench with 54K images and 800K instances, supporting training and evaluation across standard and generalization settings. Extensive experiments demonstrate that VecLang handles both single-class and multiclass vector mapping while achieving strong cross-dataset and open-vocabulary generalization. The model and dataset are publicly available at https://github.com/yyyyll0ss/VecLang.

2606.10699 2026-06-10 cs.CV cs.AI 新提交

Using the YOLOv12 Model for Verifying the Correct Color Sequence of Wires in Network Cables (Patch Cords) on the Production Line

使用YOLOv12模型验证生产线上网线(跳线)中导线的正确颜色顺序

Amin Doroodchi, Danial Soleimany

发表机构 * Computer Department, Islamic Azad University, Beyza Branch(伊斯兰 Azad 大学计算机系,贝兹分校) R&D at Nedaye Sabz Company, Isfahan Branch(Nedaye Sabz 公司研发部,伊斯法罕分校)

AI总结 针对网线生产中导线颜色顺序检测问题,提出基于YOLOv12的目标检测模型,实现高精度实时验证,减少人工错误。

详情
AI中文摘要

在网络电缆的生产过程中,确保标准连接器内部线对的正确颜色顺序对电缆的最终性能至关重要,因为任何错位或颜色顺序错误都可能导致缺陷产品并造成巨大成本。基于数字显微镜目视检查的传统检测方法通常耗时、繁琐且容易出错。在本研究中,开发了一种基于第十二版YOLO目标检测模型的智能系统,用于识别跳线中导线的位置并验证其正确的颜色顺序。使用的数据集包括从网络连接器显微视图中捕获的2500张图像,其中70%用于训练,15%用于验证,15%用于测试。所提出的模型利用单阶段架构和学习过程中的注意力机制,实现了约98%精度的导线检测。此外,总体平均准确率、分类精度和召回率分别约为95%、99%和98%。结果表明,该系统能够在生产线上可靠地实时验证导线颜色顺序的正确性,无需人工干预,从而减少人为错误并提高制造效率。

英文摘要

In the production process of network cables, ensuring the correct color sequence of wire pairs inside the standard connector plays a critical role in the final performance of the cable, as any misplacement or color-ordering error can lead to defective products and impose significant costs. Traditional inspection methods based on visual examination through digital microscopes are typically time-consuming, tedious, and prone to human error. In this study, an intelligent system based on the twelfth version of the YOLO1 object detection model was developed to identify the position and verify the correct color sequence of wires in patch cords. The dataset used consisted of 2,500 images captured from microscopic views of network connectors, which were divided into 70% for training, 15% for validation, and 15% for testing. The proposed model, leveraging a single-stage architecture and attention mechanisms during learning, achieved highly accurate wire detection with approximately 98% precision. Additionally, the overall mean accuracy, classification precision, and recall were around 95%, 99%, and 98%, respectively. The results demonstrate that this system can reliably and in real time verify the correctness of wire color sequencing on the production line without the need for human intervention, thereby reducing human error and enhancing efficiency in the manufacturing process.

2606.10696 2026-06-10 cs.CV 新提交

Don't waste SAM

不要浪费 SAM

Nermeen Abou Baker, Uwe Handmann

发表机构 * Ruhr West University of Applied Sciences - Dept of Computer Science(鲁尔西部应用科学大学计算机科学系)

AI总结 本文评估了SAM在垃圾分割任务中的泛化能力,通过微调SAM-ViT-H模型,在三个数据集上显著提升IoU,表明微调SAM作为基础模型对下游任务至关重要。

Comments Published at European Symposium on Artificial Neural Networks (ESANN2023), Computational Intelligence and Machine Learning. Bruges (Belgium)

详情
AI中文摘要

Meta AI 最近发布了 Segment Anything Model (SAM),该模型在各种任务中展示了卓越的零样本图像分割性能,具有显著的准确性。尽管 SAM 无法在多个研究领域提供精确的分割,但它仍然是支持分割流程的宝贵起点,特别是对于需要大量高级技能标注的任务。本研究旨在使用三个垃圾分割数据集评估 SAM 和微调 SAM 模型的泛化能力。尽管这些数据集是从真实场景中捕获的(与 SAM 预训练的数据相同),但它们带来了若干挑战,包括遮挡、可变形物体、透明物体以及易与背景混淆的物体。我们的发现表明,微调的 SAM-ViT-H 模型在 Zerowaste 和 TACO 数据集上优于最先进的方法,IoU 显著提高了 +30,并且非常接近 TrashCan 1.0 的性能水平,仅相差 -1.44。在评估这些流行的垃圾数据集后,很明显,微调 SAM 作为基础模型是为下游垃圾分割任务提供更好泛化能力的关键步骤。因此,SAM 不应被忽视或浪费。

英文摘要

Meta AI has recently released the Segment Anything Model (SAM), which demonstrates exceptional zero-shot image segmentation performance across various tasks with remarkable accuracy. Despite its inability to provide accurate segmentation across multiple research fields, SAM still serves as a valuable starting point for supporting the segmentation pipeline process, particularly for tasks that require extensive and senior skills annotations. This study aims to evaluate the generalization of SAM and fine-tuning SAM models using three waste segmentation datasets. Although they are captured from real scenes as SAM was pretrained on, these datasets present several challenges, including occlusions, deformable objects, transparency, and objects easily confused with backgrounds. In our findings, the fine-tuned SAM-ViT-H model outperforms the state-ofthe-art Zerowaste, and TACO datasets with a significant increase of +30 in IoU, and it closely approaches performance levels of TrashCan 1.0, with only a -1.44 difference. After evaluating these popular waste datasets, it became evident that fine-tuning SAM as a foundational model is a crucial step for providing better generalization for downstream waste segmentation tasks. Therefore, SAM should not be disregarded or wasted.

2606.10694 2026-06-10 cs.CL 新提交

REAL: A Reasoning-Enhanced Graph Framework for Long-Term Memory Management of LLMs

REAL: 一种增强推理的图框架用于LLM的长期记忆管理

Keer Lu, Liwei Chen, Guoqing Jiang, Zhiheng Qin, Yunhuai Liu, Wentao Zhang

发表机构 * School of Computer Science, Peking University(北京大学计算机科学学院) Kuaishou Technology(快手科技) Center for Data Science, Academy for Advanced Interdisciplinary Studies, Peking University(北京大学前沿交叉学科研究院数据科学中心)

AI总结 提出REAL框架,通过构建时序和置信度感知的有向属性图,采用非破坏性更新和混合束搜索检索,解决LLM长期记忆中的关系缺失、事实覆盖和查询被动问题,平均性能提升22.72%。

详情
AI中文摘要

大型语言模型(LLM)越来越期望与用户进行长时间跨度的交互。然而,由于其有限的上下文窗口,LLM无法保留所有过去的交互,因此长期记忆管理对于存储、更新和检索超出上下文限制的历史信息至关重要。尽管最近的记忆系统试图通过外部存储历史信息来解决这个问题,但现有方法存在三个关键限制:基于平面文本的记忆组织无法捕捉记忆之间的显式关系,结构化记忆系统通常会破坏性地覆盖演变的事实,而当前的检索机制在证据不完整时仍然与查询无关且被动。REAL将长期对话记忆构建为时序和置信度感知的有向属性图,其中每个原子事实都用实体、关系、有效时间区间、置信度分数和探索意图标签表示。在记忆构建过程中,REAL采用非破坏性时序更新策略,保留并行的事实版本及其有效性区间,从而能够忠实地追踪事实的演变。在检索过程中,REAL锚定与查询相关的根实体,解耦其探索意图,并执行语义评估器引导的混合束搜索以提取紧凑的记忆子图。它进一步结合反事实推理来修复不可靠的检索状态,并通过隐式逻辑关系恢复缺失的记忆证据。综合实验表明,REAL在长期记忆性能上显著优于平面文本、基于图和现有记忆基线,平均提升22.72%。

英文摘要

Large Language Models (LLMs) are increasingly expected to interact with users over long time horizons. However, due to their finite context window, LLMs cannot retain all past interactions, making long-term memory management essential for storing, updating, and retrieving historical information beyond the context limit. Although recent memory systems attempt to address this issue by storing historical information externally, existing approaches suffer from three key limitations: flat text-based memory organizations fail to capture explicit relations among memories, structured memory systems often destructively overwrite evolving facts, and current retrieval mechanisms remain query-agnostic and passive when evidence is incomplete. REAL constructs long-term conversational memory as a temporal and confidence-aware directed property graph, where each atomic fact is represented with entities, relations, valid-time intervals, confidence scores, and exploration intent labels. During memory construction, REAL adopts a non-destructive temporal update strategy that preserves parallel fact versions and their validity intervals, enabling faithful tracking of fact evolution. During retrieval, REAL anchors query-relevant root entities, decouples their exploration intents, and performs semantic evaluator-guided hybrid beam search to extract compact memory subgraphs. It further incorporates counterfactual inference to repair unreliable retrieval states and recover missing memory evidence through implicit logical relations. Comprehensive experiments demonstrate that REAL substantially improves long-term memory performance over flat-text, graph-based, and existing memory baselines, achieving an average improvement of 22.72\%.

2606.10688 2026-06-10 cs.RO 新提交

Self-Supervised Relevance Modelling in Autonomous Driving via Counterfactual Analysis

自动驾驶中基于反事实分析的自监督相关性建模

Luca Lusvarghi, Javier Gozalvez, Pablo Urbano Hidalgo

发表机构 * Networked Systems Lab, Universidad Miguel Hernandez de Elche(网络系统实验室,米格尔·希内斯·埃尔切大学)

AI总结 提出一种基于反事实分析的自监督方法,用于量化自动驾驶中物体的相关性,实现毫秒级实时估计,并生成相关性热图以辅助感知与规划。

详情
AI中文摘要

自动驾驶依赖于计算密集型的感知管线,以持续检测和跟踪周围环境中的物体。虽然某些物体对于规划安全有效的操作至关重要,但其他物体可能不相关,并且对自动驾驶车辆的驾驶决策没有影响。关注相关物体可以更有效地利用可用计算资源,减少处理延迟,并限制感知噪声的下游传播。在这项工作中,我们提出了一种基于反事实分析的新型自监督方法,以开发相关性模型——一种基于AI的工具,用于量化物体对自动驾驶车辆的相关性。为了展示所提出方法的潜力,我们在选定城市场景中生成的合成因果数据集上训练了相关性模型。结果表明,该相关性模型能够以毫秒级延迟准确估计物体的相关性,从而在高密度场景中实现实时相关性估计。我们还展示了该相关性模型可用于构建相关性热图,为自动驾驶车辆的驾驶策略提供有价值的见解,并可用于主动通知感知和规划任务。我们公开发布了相关性模型和因果数据集。

英文摘要

Autonomous driving relies on computationally intensive perception pipelines to continuously detect and track objects in the surrounding environment. While some objects are key to plan safe and effective maneuvers, others may not be relevant and have no impact on the autonomous vehicle's driving decisions. Focusing on relevant objects allows a more efficient usage of available computational resources, reduces processing latencies, and limits the downstream propagation of perception noise. In this work, we propose a novel self-supervised approach based on counterfactual analysis to develop a relevance model - an AI-based tool that quantifies the relevance of objects for an autonomous vehicle. To demonstrate the potential of the proposed approach, we train a relevance model on a synthetic causal dataset generated in a selected urban scenario. Results show that the relevance model is able to accurately estimate the objects' relevance with millisecond-level latency, enabling real-time relevance estimation also in high-density scenarios. We also show that the relevance model can be used to build relevance heatmaps that offer valuable insights into the autonomous vehicle's driving policy and can be used to proactively inform perception and planning tasks. We openly release both the relevance model and the causal dataset.

2606.10684 2026-06-10 cs.LG cs.AI 新提交

Divide and Cooperate: Role-Decomposed Multi-Agent LLM Training with Cross-Agent Learning Signals

分工与合作:基于跨智能体学习信号的角色分解多智能体LLM训练

Jaewan Park, Solbee Cho, Jay-Yoon Lee

发表机构 * Seoul National University(首尔大学)

AI总结 提出DAC框架,将多步推理分解为搜索和生成两个子任务,分别由专用智能体处理,并通过跨智能体学习信号解决信用分配问题,在QA基准上超越全参数微调的单体模型。

详情
AI中文摘要

现代语言智能体通过多步推理在知识密集型问答中表现出色。然而,现有方法通常将证据获取和答案生成耦合在单一策略中。这迫使单个模型扮演多个可能冲突的角色,导致策略空间组合爆炸并阻碍高效探索。同时,训练中引入信用分配问题:当生成失败时,检索到足够证据的搜索动作仍可能受到惩罚,反之亦然。我们提出DAC(分工与合作),一个角色分解的多智能体训练框架,将智能体搜索分解为两个合作性子任务,每个子任务由专用智能体处理,并使用角色特定的学习信号进行训练。生成器扮演双重角色:既是答案生成器,也是证据充分性验证器,当检索到的证据不足时放弃回答。该放弃信号被纳入搜索智能体的奖励中,提供结构化的跨智能体学习信号以改进信用分配。相反,搜索器通过硬阳性证据增强向生成器暴露多样且具有挑战性的证据环境,提高其鲁棒性。在通用和多跳问答基准上的实验表明,DAC通过共享骨干网络上的参数高效LoRA模块实现,在性能上优于先前依赖全参数微调单体模型的基线方法。

英文摘要

Modern language agents which perform multi-step reasoning have shown strong performance in knowledge-intensive question answering. However, existing approaches typically couple evidence acquisition and answer generation within a single policy. This forces a single model to play multiple potentially conflicting roles, inducing a combinatorial explosion in the policy space and hindering efficient exploration. It also introduces a credit assignment problem during training: a search action that retrieves sufficient evidence may still be penalized when generation fails, and vice versa. We propose DAC (Divide and Cooperate), a role-decomposed multi-agent training framework that divides agentic search into two cooperative subtasks, each handled by a dedicated agent trained with role-specific learning signals. The generator serves a dual role as both an answer producer and an evidence sufficiency verifier, abstaining when retrieved evidence is insufficient. This abstention signal is incorporated into the search agent's reward, providing structured cross-agent learning signals that improve credit assignment. Conversely, the searcher exposes the generator to diverse and challenging evidence environments by hard-positive evidence augmentation, improving its robustness. Experiments on general and multi-hop QA benchmarks show that DAC, implemented via parameter-efficient LoRA modules over a shared backbone, achieves strong performance against prior baselines that rely on full fine-tuning of monolithic models.

2606.10682 2026-06-10 cs.LG 新提交

PL-KKT-hPINN: Enforcing Nonlinear Equality Constraints on Neural Networks via Piecewise-Linear Projection

PL-KKT-hPINN:通过分段线性投影在神经网络上强制非线性等式约束

Fateme Mohammad Mohammadi, Hector Budman, Joshua L. Pulsipher

发表机构 * Department of Chemical Engineering, University of Waterloo(滑铁卢大学化学工程系)

AI总结 提出PL-KKT-hPINN框架,通过分段线性投影严格强制非线性等式约束,在CSTR案例中保持预测精度的同时大幅降低约束违反,并在小样本下提升鲁棒性。

详情
AI中文摘要

尽管物理信息神经网络(PINN)在过程建模中显示出强大潜力,但物理方程仅在训练期间作为软约束强制执行,因此无法保证推理时的约束满足。我们提出一个称为分段线性Karush--Kuhn--Tucker硬约束PINN(PL-KKT-hPINN)的框架,通过分段线性投影严格强制非线性等式约束。这扩展了KKT-hPINN框架,后者通过Karush--Kuhn--Tucker(KKT)条件精确强制线性等式,该条件与将神经网络输出正交投影到约束可行域相关。该方法在连续搅拌釜反应器(CSTR)案例研究中进行了单输入和双输入情况的演示。结果表明,PL-KKT-hPINN保持了与标准神经网络相当的预测精度,同时实现了显著更低的约束违反。此外,所提出的模型在低数据情况下显示出改进的鲁棒性,在有限的训练样本量下,其RMSE低于无约束神经网络。这些结果表明,PL-KKT-hPINN为非线性化学工程系统的代理建模提供了一种计算高效且物理一致的框架。

英文摘要

While physics-informed neural networks (PINNs) have shown strong potential for process modeling, physical equations are only enforced as soft constraints during training, and thus, they do not guarantee constraint satisfaction at inference. We propose a framework, called piecewise-linear Karush--Kuhn--Tucker hard-constrained PINNs (PL-KKT-hPINNs), that strictly enforces nonlinear equality constraints through piecewise-linear projection. This extends the KKT-hPINN framewor, which exactly enforces linear equalities through the Karush--Kuhn--Tucker (KKT) conditions associated with orthogonally projecting neural network outputs onto the constraint feasible region. The method is demonstrated on a continuous stirred-tank reactor (CSTR) case study for both one and two inputs. Results show that PL-KKT-hPINN preserves predictive accuracy comparable to that of a standard neural network while achieving substantially lower constraint violations. In addition, the proposed model shows improved robustness in low-data regimes, yielding lower RMSE than the unconstrained neural network for limited training sample sizes. These results demonstrate that PL-KKT-hPINN provides a computationally efficient and physically consistent framework for surrogate modeling of nonlinear chemical engineering systems.

2606.10677 2026-06-10 cs.AI cs.CL 新提交

Infini Memory: Maintainable Topic Documents for Long-Term LLM Agent Memory

Infini Memory:用于长期LLM智能体记忆的可维护主题文档

Suozhao Ji, Baodong Wu, Zehao Wang, Lei Xia, Qingping Li, Ruisong Wang, Wenbo Ding, Zhenhua Zhu, Boxun Li, Guohao Dai, Yu Wang

发表机构 * Infinigence AI(InfiniGen AI) Tsinghua University(清华大学) Shanghai Jiaotong University(上海交通大学)

AI总结 提出Infini Memory架构,将智能体记忆组织为主题文档,通过缓冲合并和迭代检索实现可维护的长期记忆,在MemoryAgentBench上达到64.7%的总体得分。

详情
AI中文摘要

长期LLM智能体需要持久记忆,以跟踪变化的事实并在会话间提供相关证据。现有的记忆系统通常将观察存储为孤立的记录、摘要或索引片段,这使得证据聚合、事实修正和记忆维护变得困难。我们提出Infini Memory,一种可维护的基于文本的持久记忆架构,将智能体记忆视为主题结构化文档。每个主题文档作为一个语义单元,用于收集相关证据、保留元数据并随时间修正事实。新观察首先被暂存在缓冲区中,然后定期合并为连贯的文本上下文。在推理时,一种智能体检索过程允许LLM通过迭代工具调用读取记忆,而不是单次检索步骤。在MemoryAgentBench上,Infini Memory取得了64.7%的总体得分。消融实验表明,主题结构化维护和迭代证据检查改善了长期记忆使用的互补方面。

英文摘要

Long-term LLM agents need persistent memory that can track changing facts and provide relevant evidence across sessions. Existing memory systems often store observations as isolated records, summaries, or indexed fragments, which makes evidence aggregation, fact revision, and memory maintenance difficult. We propose Infini Memory, a maintainable text-based persistent memory architecture that treats agent memory as topic-structured documents. Each topic document serves as a semantic unit for collecting related evidence, preserving metadata, and revising facts over time. New observations are first staged in a buffer and periodically consolidated into coherent textual contexts. At inference time, an agentic retrieval procedure lets the LLM read memory through iterative tool calls rather than a single retrieval step. On MemoryAgentBench, Infini Memory achieves 64.7% overall score. Ablations show that topic-structured maintenance and iterative evidence inspection improve complementary aspects of long-term memory use.

2606.10675 2026-06-10 cs.CL eess.AS 新提交

Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming

基于自监督表示和学习动态规划的多语言词级强制对齐

Roy Weber, Meidan Zehavi, Rotem Rousso, Joseph Keshet

发表机构 * Faculty of Electrical and Computer Engineering, Technion – Israel Institute of Technology(以色列理工学院电气与计算机工程学院)

AI总结 提出一种结合自监督表示和学习动态规划的多语言词级强制对齐方法,通过融合MMS和UnSupSeg特征并学习词边界概率,在多个语言上超越现有方法。

Comments Interspeech 2026

详情
AI中文摘要

我们提出了一种准确的多语言词级强制对齐方法,包括一个对齐编码器和一个学习对齐解码器。编码器整合两种表示:一种来自大规模多语言语音(MMS)模型,另一种来自自监督音素边界检测器(UnSupSeg)。它学习融合这些表示,并在长时间上下文中估计词边界概率。对齐解码器是一种学习动态规划,它将编码器输出与基于MMS和UnSupSeg表示的段特征相结合,以推断最终词边界。在TIMIT和Buckeye上迭代训练后,所提方法在两个数据集上均优于Montreal Forced Aligner(MFA)和基于MMS的对齐方法。在未见语言(荷兰语、德语和希伯来语)上,所提模型的性能始终优于或与现有对齐方法相当,表明其有潜力在不进行进一步训练的情况下扩展到MMS支持的1100多种语言。

英文摘要

We present a method for accurate multilingual word-level forced alignment, consisting of an alignment encoder and a learned alignment decoder. The encoder integrates two representations: one from the Massively Multilingual Speech (MMS) model and another from a self-supervised phoneme boundary detector (UnSupSeg). It learns to fuse them and to estimate word-boundary probabilities over long temporal contexts. The alignment decoder is a learned dynamic programming that combines encoder outputs with segmental features over the MMS and UnSupSeg representations to infer final word boundaries. Trained iteratively on TIMIT and Buckeye, the proposed approach outperforms Montreal Forced Aligner (MFA) and MMS-based alignment on both datasets. On unseen languages (Dutch, German, and Hebrew), the proposed model achieves performance consistently better than or on par with existing alignment approaches, indicating its potential to scale to 1100+ languages supported by MMS without further training.

2606.10671 2026-06-10 cs.CV 新提交

FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion

FadeMem: 面向自回归视频扩散的距离感知记忆巩固

Yu Lu, Junjie Yang, Piotr Koniusz, YuXin Song, Yi Yang

发表机构 * Zhejiang University(浙江大学) University of New South Wales (UNSW)(新南威尔士大学) Data61/CSIRO Baidu Inc(百度公司)

AI总结 提出FadeMem,一种距离感知的KV记忆巩固机制,在固定缓存预算下将历史KV块组织成时间层次,通过幂律分配实现近密远疏的记忆,提升长视频生成中的主体一致性和时间连贯性。

Comments 11 pages, 4 figures

详情
AI中文摘要

自回归视频生成器通过生成连续的时间片段来合成长时间视频,但其历史KV缓存随视频长度增长。现有的有界缓存方法通过局部窗口、汇合令牌或压缩记忆状态来降低这一成本,但它们通常为历史的不同部分分配固定角色。我们提出FadeMem,一种距离感知的KV记忆巩固机制,在固定缓存预算下将历史KV块组织成时间层次。这一设计受频率依赖的时间衰减启发:细节快速去相关,而粗略场景结构和身份在更长时间内保持有用。在生成过程中,新历史作为细粒度条目插入,而较旧的相邻条目在幂律时间分配调度下逐步合并,从而在单个缓存中产生近密远疏的记忆。无需架构更改,FadeMem保留近期上下文以处理短期动态,并保留紧凑的长程锚点以保持身份和场景连贯性。实验表明,与现有有界缓存策略相比,主体一致性、背景稳定性和时间连贯性均得到提升。

英文摘要

Autoregressive video generators synthesize long videos by generating successive temporal segments, but their historical KV cache grows with video length. Existing bounded-cache methods reduce this cost with local windows, sink tokens, or compressed memory states, yet they usually assign fixed roles to different parts of the history. We propose FadeMem, a distance-aware KV memory consolidation mechanism that organizes historical KV blocks into a temporal hierarchy under a fixed cache budget. This design is motivated by frequency-dependent temporal decay: fine details decorrelate quickly, while coarse scene structure and identity remain useful over longer horizons. During generation, new history is inserted as fine-grained entries, while older adjacent entries are progressively merged under a power-law temporal allocation schedule, yielding a dense-near, sparse-far memory within one cache. Without architectural changes, FadeMem preserves recent context for short-term dynamics and compact long-range anchors for identity and scene coherence. Experiments show improved subject consistency, background stability, and temporal coherence over existing bounded-cache strategies.

2606.10669 2026-06-10 cs.LG cs.AI cs.CR 新提交

In Defense of Information Leakage in Concept-based Models

为基于概念模型中的信息泄露辩护

Mateo Espinosa Zarlenga

发表机构 * GitHub arXiv

AI总结 本文重新审视基于概念模型中的信息泄露问题,提出良性泄露概念,通过优化训练目标,在概念不完整时利用泄露提升准确性和可干预性。

Comments Accepted as a position paper at the Forty-Third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

基于概念的模型(CMs)是深度神经网络,其预测基于与人类可理解概念(如“圆形”、“条纹”等)对齐的表示。已有研究表明,这些模型会学习到泄露概念无关信息的表示。传统观点认为,这种泄露是不可取的,应予以消除,因为它会导致模型不可解释。在本文中,我们认为这种关于CMs中泄露的传统观点不仅是不恰当的(因为泄露如何使模型更不可解释的证据往往不明确),而且在常见的现实约束下必然导致不实用的CMs。具体来说,我们认为在概念不完整是常态的现实环境中,为了构建准确且可干预的CMs,某种程度的泄露往往是必要的。为此,我们提出存在所谓的良性泄露,并表明通过重新优化典型的CM训练目标,CMs可以鼓励并利用这种形式的泄露,而不会牺牲准确性或可干预性。

英文摘要

Concept-based models (CMs), deep neural networks that ground their predictions on representations aligned with human-understandable concepts (e.g., "round", "stripes", etc.), have been shown to learn representations that leak concept-irrelevant information. As the traditional narrative goes, this leakage is undesirable and should be eradicated as it leads to uninterpretable models. In this paper, we posit that this conventional view of leakage in CMs is not only ill-posed, as the evidence of how leakage makes a model less interpretable is often inconclusive, but also bound to lead to impractical CMs under common real-world constraints. Specifically, we argue that in real-world settings where concept incompleteness is the norm, some leakage is often necessary for constructing accurate and intervenable CMs. To this end, we propose that there is such a thing as benign leakage and show that, by optimizing a reframing of the typical CM training objective, CMs can encourage and exploit this form of leakage without sacrificing accuracy or intervenability.

2606.10666 2026-06-10 cs.CV cs.DB 新提交

Analyzing Training-Free Corruption Detection for Object Detection Datasets

分析目标检测数据集的无训练腐败检测

Christian Sieberichs, Simon Geerkens, Thomas Waschulzik, Viswanathan Ramesh, Alexander Braun

发表机构 * University of Applied Sciences Düsseldorf(杜塞尔多夫应用科学大学) Siemens Mobility GmbH(西门子交通有限公司) Goethe University Frankfurt(法兰克福大学)

AI总结 本文研究无训练特征空间方法在目标检测数据集中检测标注错误的应用,发现该方法能可靠暴露语义错误,但位置错误难以检测。

Comments Accepted at DataCV Workshop, Conference on Computer Vision and Pattern Recognition (CVPR) 2026

详情
AI中文摘要

注释错误在计算机视觉数据集中普遍存在,并且会显著降低在其上训练的系统的性能,特别是在目标检测等复杂任务中。存在多种识别注释错误的方法,包括无训练的特征空间方法,这些方法提供了一种快速且可解释的方式来分析注释。然而,对于包含语义和空间信息的目标检测注释,其行为在很大程度上仍未探索。在这项工作中,我们分析了基于特征空间的方法在检测目标检测数据集中的注释错误时的适用性。通过调整现有的特征空间方法,我们表明此类方法可靠地暴露语义错误,而位置错误仍然难以检测。我们使用VOC2012和KITTI,在多个预训练嵌入模型、合成噪声类型(对称、非对称和位置)以及真实世界注释错误上评估了这种行为。所有代码和真实世界腐败数据均可在以下仓库公开获取:https://github.com/ChristianSieberichs/BoundingBox_corruption_detection

英文摘要

Annotation errors are widespread in computer vision datasets and can significantly degrade the performance of systems trained on them, particularly in complex tasks such as object detection. Several approaches exist to identify annotation errors, including training-free feature-space methods which provide a fast and interpretable way to analyze annotations. However, the behavior on object detection annotations, which include semantic and spatial information, remains largely unexplored. In this work we analyze the applicability of feature-space-based approaches for detecting annotation errors in object detection datasets. By adapting an existing feature-space method, we show that such approaches reliably expose semantic mislabel, while positional errors remain difficult to detect. We evaluate this behavior across multiple pretrained embedding models, synthetic noise types (symmetric, asymmetric, and positional), and real-world annotation errors using VOC2012 and KITTI. All code and real-world corruptions are publicly available at the following repository: https://github.com/ ChristianSieberichs/BoundingBox\_corruption\_detection

2606.10657 2026-06-10 cs.CL 新提交

Are We Evaluating Knowledge or Phrasing? Mitigating MCQA Sensitivity with ParaEval

我们是在评估知识还是措辞?使用ParaEval减轻MCQA敏感性

João Maria Janeiro, Mathurin Videau, Andrea Caciolai, Benjamin Piwowarski, Patrick Gallinari, Loic Barrault

发表机构 * FAIR at Meta(Meta FAIR) Sorbonne Université, CNRS, ISIR, F-75005 Paris, France(索邦大学,法国国家科学研究中心,智能系统与机器人研究所,法国巴黎) Criteo AI Lab, Paris, France(Criteo AI实验室,法国巴黎)

AI总结 针对多选题基准测试对答案措辞敏感的问题,提出ParaEval框架,通过对每个选项使用多种释义并选择最有利的评分,将虚假性能差距从2分以上降至1分以下,从而评估模型真实能力。

详情
AI中文摘要

多选题(MCQA)基准测试是评估预训练大语言模型的标准方法,但其依赖于对数似然评分使得结果不可靠。具体而言,标准评分对答案的确切措辞(表面形式)高度敏感,将模型对特定短语的熟悉程度与其实际能力混为一谈。我们使用一个受控测试床(1B-8B模型,基于相同知识训练)证明了这一缺陷。尽管拥有相同的知识,标准指标错误地报告了超过2分的性能差距。为了解决这个问题,我们提出了ParaEval,一个评估框架,它对每个答案选项使用多个释义来查询模型。通过根据每个模型最有利的措辞进行评分,ParaEval成功地将虚假性能差距降低到1分以下。我们确认这些评估伪影以及ParaEval的改进在前沿的70B和120B开源模型中仍然存在。最终,ParaEval提供了一种稳健且高效的方式来评估真正的底层能力,而不是表面形式的熟悉度。

英文摘要

Multiple-choice (MCQA) benchmarks are the standard for evaluating pretrained large language models, but their reliance on log-likelihood scoring makes them unreliable. Specifically, standard scores are highly sensitive to the exact phrasing (surface form) of the answers, conflating a model's familiarity with a specific phrase with its actual capability. We demonstrate this flaw using a controlled testbed of 1B-8B models trained on the same knowledge. Despite having identical knowledge, standard metrics falsely report a performance gap of over 2 points. To solve this, we propose ParaEval, an evaluation framework that queries models using multiple paraphrases per answer option. By scoring each model based on its most favorable phrasing, ParaEval successfully reduces the false performance gap to below 1 point. We confirm that these evaluation artifacts, and the improvements from ParaEval, persist in frontier 70B and 120B open-source models. Ultimately, ParaEval provides a robust and efficient way to evaluate true underlying capability rather than surface-form familiarity.

2606.10656 2026-06-10 cs.CV 新提交

Envision4D: Envisioning Visual Futures via Feed-forward 4D Gaussian Splatting for Autonomous Driving

Envision4D: 通过前馈4D高斯泼溅展望自动驾驶的视觉未来

Qi Song, Yifei He, Chi Zhang, Zheng Fu, Xuhe Zhao, Mengmeng Yang, Kun Jiang, Rui Huang, Diange Yang

发表机构 * Tsinghua University(清华大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出Envision4D,一种全自监督前馈框架,通过未来姿态预测、层内时间注意力和条件运动提升,实现无位姿的未来外推,在自动驾驶动态场景预测中达到最先进性能。

Comments Project Page: https://maggiesong7.github.io/research/Envision4D/

详情
AI中文摘要

预测动态场景的未来演变在自动驾驶中至关重要。然而,现有的前馈范式主要设计用于插值。当扩展到未来外推时,它们在大位移下会出现重影伪影,并受限于简化的运动假设或严格的未来先验。为了克服这些挑战,我们提出了Envision4D,一种完全自监督的前馈框架,用于无位姿的未来外推。具体来说,我们引入了一个未来姿态预测模块,通过迭代去噪过程推断未来相机参数。此外,为了捕捉非线性动态,我们提出了层内时间注意力,并采用条件运动提升,将高度不确定的外推过程转化为稳健的关系映射。最后,利用渐进式训练策略来稳定无监督运动学习,防止误差累积。大量实验表明,Envision4D实现了最先进的性能,在未来的视图合成中显著优于现有方法。

英文摘要

Forecasting the future evolution of dynamic scenes is crucial in autonomous driving. However, existing feed-forward paradigms are primarily designed for interpolation. When extended to future extrapolation, they suffer from ghosting artifacts under large displacements and are constrained by simplified motion assumptions or strict future priors. To overcome these challenges, we propose Envision4D, a fully self-supervised feed-forward framework for pose-free future extrapolation. Specifically, we introduce a Future Pose Prediction module that infers future camera parameters via an iterative denoising process. Furthermore, to capture non-linear dynamics, we propose In-layer Temporal Attention and employ Conditioned Motion Lifting, which transforms the highly uncertain extrapolation process into robust relational mappings. Finally, a Progressive Training Strategy is utilized to stabilize unsupervised motion learning against error accumulation. Extensive experiments demonstrate that Envision4D achieves state-of-the-art performance, significantly outperforming existing methods in future view synthesis.

2606.10654 2026-06-10 cs.CL 新提交

Speaker Group Encoding in Self-supervised Speech Recognition Models

自监督语音识别模型中的说话人群体编码

Felix Herron, Solange Rossato Alexandre Allauzen, Benoit Favre, François Portet

发表机构 * MILES Team, LAMSADE, Université Paris Dauphine-PSL, France(法国巴黎多芬纳-PSL大学LAMSADE实验室MILES团队) GETALP Team, LIG, Université Grenoble Alpes, France(法国格勒诺布尔阿尔卑斯大学LIG实验室GETALP团队) NLP team, LIS, Aix-Marseille University, France(法国艾克斯-马赛大学LIS实验室NLP团队)

AI总结 研究自监督语音识别模型如何编码说话人群体信息,发现微调任务和公平性算法对不同类型群体信息的影响不同。

详情
Journal ref
Text, Speech, and Dialogue. TSD 2025. Lecture Notes in Computer Science(), vol 16029
AI中文摘要

我们研究了自监督语音识别模型(S3Ms)学习了关于说话人群体(SGs)的哪些信息。我们检查了S3Ms的几种状态:预训练、在说话人识别(SID)上微调、在自动语音识别(ASR)上微调,以及使用公平性增强算法进行ASR微调。我们发现S3Ms编码了关于几个说话人群体类别(SGCs)的信息,包括他们的性别、年龄、方言、种族以及是否为母语者。我们发现,针对SID的微调放大了某些SGCs,即那些方差更偏向语音性质的SGCs,尽管它没有放大其他SGCs,即那些方差更偏向语义性质的SGCs。另一方面,针对ASR的微调丢弃了语音变异的说话人群体信息(SGI),但保留了语义变异的SGI。我们发现,为改善公平性而设计的ASR算法改变了S3Ms中编码SGI的程度;然而,这主要适用于语音变异的SGCs,而对于语义变异的SGCs则不太适用。我们讨论了SGI如何被每一层编码,并识别了负责编码不同SGCs的嵌入子维度。最后,我们讨论了我们的发现如何有助于设计更公平的ASR算法。

英文摘要

We investigate what self-supervised speech recognition models (S3Ms) learn about speaker groups (SGs). We examine several states of S3Ms: pretrained, finetuned on speaker identification (SID), finetuned on automatic speech recognition (ASR), and ASR-finetuned using a fairness enhancing algorithm. We find that S3Ms encode information about several speaker group categories (SGCs), including their gender, age, dialect, ethnicity, and whether they are a native speaker. We find that finetuning for SID amplifies certain SGCs, namely those whose variance is more phonetic in nature, though it does not amplify other SGCs, namely those whose variance is more semantic in nature. On the other hand, finetuning for ASR discards phonetically variant speaker group information (SGI) but retains semantically variant SGI. We find that ASR algorithms designed for fairness improvement change to what extent SGI is encoded in S3Ms; however, this is primarily true for for phonetically variant SGCs, and less true for semantically variant SGCs. We discuss how SGI is encoded by each layer, and identify subdimensions of embeddings responsible for encoding different SGCs. Finally, we discuss how our findings could be beneficial in designing fairer ASR algorithms.

2606.10653 2026-06-10 cs.CV 新提交

STEDiff: Strengthening Text Embedding for Text-to-Image Alignment in Diffusion Model

STEDiff: 增强文本嵌入以提升扩散模型中文本到图像的对齐

Hailan Zhang, Haipeng Liu, Bo Fu, Yang Wang

发表机构 * Hailan Zhang, Haipeng Liu, Yang Wang(未明确机构) Bo Fu(未明确机构)

AI总结 提出训练免费的STEDiff方法,通过利用[EOT]令牌增强子句语义并引入语义增强损失,在文本嵌入空间中改善扩散模型对复杂提示的语义对齐,无需微调或布局先验。

Comments 8 pages, 8 figures, to appear at IJCNN 2026

详情
AI中文摘要

尽管预训练的文本到图像(T2I)生成模型可以产生高质量图像,但由于随机噪声和固有的模型限制,它们常常无法忠实地反映复杂提示的语义意图。这个问题经常表现为模型忽略特定对象或无法正确地将属性绑定到其对应的实体上,这一挑战被称为语义对齐。与依赖计算昂贵的微调或劳动密集的布局先验的现有方法不同,我们提出了STEDiff,一种无需训练的方法,旨在直接在文本嵌入空间中增强语义表示。具体来说,我们引入了一种方法,主要利用[EOT]令牌来增强子句的相关语义,然后替换原始提示中的相应令牌。此外,还引入了一种新颖的语义增强损失来强制执行空间约束,确保每个实体的语义精确映射到其各自的图像区域。在T2I-CompBench上的大量定量和定性评估表明,我们的方法在复杂场景中显著提高了语义一致性和生成完整性。

英文摘要

Although pretrained text-to-image (T2I) generation models can produce high-quality images, they often fail to faithfully reflect the semantic intent of complex prompts due to stochastic noise and inherent model limitations. This issue frequently manifests as the model overlooking specific objects or failing to correctly bind attributes to their corresponding entities, a challenge referred to as semantic alignment. Unlike existing approaches that rely on computationally expensive fine-tuning or labor-intensive layout priors, we propose STEDiff, a training-free method designed to enhance semantic representations directly within the text-embedding space. Specifically, we introduce a method that primarily leverages the [EOT] token to strengthen the relevant semantics of sub-sentences and then replaces the corresponding tokens in the original prompt. Furthermore, a novel semantic enhancement loss is incorporated to enforce spatial constraints, ensuring that the semantics of each entity are precisely mapped to their respective image regions. Extensive quantitative and qualitative evaluations on the T2I-CompBench demonstrate that our method notably improves semantic consistency and generation integrity in complex scenarios.

2606.10651 2026-06-10 cs.CV 新提交

Kwai Keye-VL-2.0 Technical Report

Kwai Keye-VL-2.0 技术报告

Kwai Keye Team, Bin Wen, Changyi Liu, Chengru Song, Chongling Rao, Guowang Zhang, Han Li, Haonan Fan, Hengrui Ju, Jiankang Chen, Jiapeng Chen, Jiawei Yuan, Kaixuan Yang, Kaiyu Jiang, Kun Gai, Lingzhi Zhou, Na Nie, Sen Na, Tianke Zhang, Tingting Gao, Xuanyu Zheng, Yulong Chen, Fan Yang, Haixuan Gao, Lele Yang, Mingqiao Liu, Muxi Diao, Qi Zhang, Qile Su, Wei Chen, Wentao Hong, Xingyu Lu, Yancheng Long, Yankai Yang, Yingxin Li, Yiyang Fan, Yu Xia, Yuzhe Chen, Ziliang Lai, Chuan Yi, Haonan Jia, Tianming Liang, Weixin Xu, Xiaoxiao Ma, Yang Tian, Yufei Han, Feng Han, Hang Li, Jing Wang, Jinghui Jia, Junmin Chen, Junyu Shi, Ruilin Zhang

发表机构 * Kuaishou Group(快手集团)

AI总结 提出开源MoE多模态基础模型Keye-VL-2.0,首次将DeepSeek稀疏注意力适配到GQA架构,支持无损256K上下文处理,并通过跨模态多教师策略蒸馏和上下文/视频强化学习解决多任务对齐中的灾难性遗忘,在长视频理解和智能体任务上达到同类最优。

Comments 31 pages, 11 figures

详情
AI中文摘要

我们介绍了 Kwai Keye-VL-2.0-30B-A3B,一个开源的混合专家(MoE)多模态基础模型,旨在推进长视频理解和智能体智能。为应对小时级视频中存在的超长上下文、信息冗余和过高计算成本等挑战,Keye-VL-2.0 首次将 DeepSeek 稀疏注意力(DSA)适配到基于 GQA 的多模态架构中,实现了无损的 256K 上下文处理,同时捕捉关键帧和长程时间依赖。该架构由高度优化的训练和推理基础设施支撑,包括可扩展的视频 I/O、异构 ViT-LM 并行和自定义 DSA 内核,显著提高了吞吐量并最小化计算开销。此外,为克服多任务对齐过程中灾难性遗忘的算法困境,我们引入了跨模态多教师在线策略蒸馏(MOPD),并结合上下文强化学习和视频强化学习。通过将在线策略 rollout 中的密集 token 级教师反馈蒸馏回仅激活 3B 参数的 MoE 骨干网络,Keye-VL-2.0 原生支持跨代码、工具和搜索场景的高级智能体协作,并具备多模态自我纠正能力。在视频理解、时间定位、推理、STEM 和智能体基准上的广泛评估表明,Keye-VL-2.0-30B-A3B 在相似规模模型中达到了最先进的性能,尤其在 TimeLens 上的细粒度时间定位和 Video-MME-v2 及 LongVideoBench 上的长视频理解方面表现优异。我们发布了模型检查点,以加速社区向可扩展且鲁棒的多模态智能体应用迈进。

英文摘要

We introduce Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model designed to advance long-video understanding and agentic intelligence. To address the challenges of ultra-long contexts, information redundancy, and prohibitive computational costs inherent in hour-level videos, Keye-VL-2.0 is the first to adapt DeepSeek Sparse Attention (DSA) to GQA-based multimodal architectures, enabling lossless 256K context processing while capturing critical frames and long-range temporal dependencies. This architecture is underpinned by a highly optimized training and inference infrastructure, including scalable video I/O, heterogeneous ViT-LM parallelism, and custom DSA kernels that significantly maximize throughput and minimize computational overhead. Furthermore, to overcome the algorithmic dilemma of catastrophic forgetting during multi-task alignment, we introduce Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL. By distilling dense token-level teacher feedback from on-policy rollouts back into the MoE backbone, which activates only 3B parameters, Keye-VL-2.0 natively empowers advanced agent collaboration across Code, Tool, and Search scenarios with multimodal self-correction. Extensive evaluations across video understanding, temporal grounding, reasoning, STEM, and agent benchmarks demonstrate that Keye-VL-2.0-30B-A3B achieves state-of-the-art performance among models of similar scale, particularly excelling in fine-grained temporal localization on TimeLens and long-video comprehension on Video-MME-v2 and LongVideoBench. We release our model checkpoints to accelerate community progress toward scalable and robust multimodal agentic applications.

2606.10650 2026-06-10 cs.CL cs.AI 新提交

Dynamic Linear Attention

动态线性注意力

Xin Wang, Hui Shen, Boyuan Zheng, Xueshen Liu, Minkyoung Cho, Zhongwei Wan, Zesen Zhao, Zhuoqing Mao, Shen Yan, Mi Zhang

发表机构 * The Ohio State University(俄亥俄州立大学) University of Michigan(密歇根大学) ByteDance Seed(字节跳动Seed)

AI总结 提出DLA框架,通过信息感知动态状态合并和容量受限内存建模,解决多状态线性注意力中固定合并策略导致的错误累积问题,在16个数据集上超越现有方法。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)对长上下文的可扩展性从根本上受限于标准注意力的二次复杂度,这促使采用具有次二次成本(sub-quadratic cost)的线性注意力机制。为了在长上下文下提高表示能力,近期方法以多状态方式组织内存。然而,现有的多状态线性注意力方法依赖于固定的状态合并策略,无法适应动态变化的令牌重要性,不可逆地模糊了关键令牌,并在长序列上导致严重的错误累积。为了解决这一限制,我们提出了DLA,一种用于多状态线性注意力的动态内存建模框架。DLA引入了(i)信息感知动态状态合并,它基于令牌级别的信息变化自适应地确定状态边界,在语义转换周围保留高分辨率表示,同时积极总结稳定区域;以及(ii)容量受限内存建模,它通过选择性地合并相邻的低信息状态来维护一个固定大小、按时间顺序排列的状态缓存,以最小的信息损失控制内存增长。我们在两种不同的线性注意力模型上预训练DLA,并在三个类别的16个数据集上进行评估。实验结果表明DLA优于现有最先进方法。

英文摘要

The scalability of Large Language Models (LLMs) to long contexts is fundamentally constrained by the quadratic complexity of standard attention, motivating the adoption of linear attention mechanisms with sub-quadratic cost. To improve representation capacity under long contexts, recent approaches organize memory in a multi-state manner. However, existing multi-state linear attention methods rely on fixed state merging policies that cannot adapt to dynamically varying token importance, irreversibly obscuring critical tokens and causing severe error accumulation over long sequences. To address this limitation, we propose DLA, a dynamic memory modeling framework for multi-state linear attention. DLA introduces (i) Information-Aware Dynamic State Merging, which adaptively determines state boundaries based on token-level information variation, preserving high-resolution representations around semantic transitions while aggressively summarizing stable regions, and (ii) Capacity-Bounded Memory Modeling, which maintains a fixed-size, chronologically ordered state cache by selectively merging adjacent low-information states to control memory growth with minimal information loss. We pre-train DLA on two different linear attention models and evaluate on 16 datasets across three categories. Experimental results demonstrate the superiority of DLA over state-of-the-art.