arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.25943 2026-05-26 cs.LG

STaT: Resolving Shape Distortion in Non-Stationary Time Series via Tri-Modal Synergy

STaT: 通过三模态协同解决非平稳时间序列中的形状失真

Hui Cheng, Jinsheng Guo, Zhenhao Weng, Yan Qiao, Meng Li

AI总结提出STaT多模态架构，通过符号-时间-文本三模态对齐，在降低平均误差的同时减少形状失真，在8个基准上提升幅度指标达8.9%并降低形状失真达8.5%。

详情

AI中文摘要

近期时间序列预测研究常探索将文本和视觉模态与数值模型结合，以更好地应对非平稳环境。尽管取得了可靠的数值结果，现有多模态方法通常面临两难：优先最小化平均误差会导致预测过于平滑，忽略关键波动。为解决这一局限，我们提出STaT，一种创新的符号-时间-文本对齐多模态架构，无缝融合三种协同模态。具体而言，符号模态将连续时间序列转换为离散标记，便于准确识别结构模式和转折点；时间模态提取内在序列依赖；文本模态利用领域语义引导宏观预测趋势。在八个真实世界基准上的综合评估表明，STaT表现卓越，将传统幅度指标提升高达8.9%，同时将形状失真降低高达8.5%。

英文摘要

Recent research in time series forecasting frequently investigates the integration of textual and visual modalities with numerical models to better navigate non-stationary environments. Despite delivering solid numerical results, existing multi-modal approaches usually encounter a dilemma: prioritizing the minimization of average errors can result in excessively smooth forecasts that overlook essential fluctuations. To resolve this limitation, we introduce STaT, an innovative multimodal architecture for Symbolic-Temporal-Textual Alignment, which seamlessly unites three synergistic modalities. Specifically, the symbolic modality converts continuous time series into discrete tokens, facilitating the accurate identification of structural patterns and turning points; the temporal modality extracts inherent sequential dependencies; and the textual modality leverages domain semantics to steer the macroscopic forecasting trends. Comprehensive evaluations on eight real-world benchmarks indicate that STaT delivers exceptional performance, enhancing conventional magnitude indicators by up to 8.9% while simultaneously decreasing shape distortion by up to 8.5%.

URL PDF HTML ☆

赞 0 踩 0

2605.25942 2026-05-26 cs.CV cs.RO

LRDDv3: High-Resolution Long-Range Drone Detection Dataset with Range Information and Thermal Data

LRDDv3：具有距离信息和热数据的高分辨率远程无人机检测数据集

Knut Peterson, Zaid Mayers, Azmain Yousuf, Priontu Chowdhury, Asher Zaczepinski, Solmaz Arezoomandan, Reihaneh Maarefdoust, David Han

AI总结提出LRDDv3数据集，包含102,532张高分辨率远程RGB图像和29,630张配对IR图像，支持远程无人机检测，提供距离信息。

Comments 8 pages, 5 figures. Accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA)

详情

AI中文摘要

无人机已迅速成为各种空域中的常见设备，涵盖从娱乐飞行到商业摄影和包裹递送等多种应用。随着无人机日益普及，有人和无人飞行器能够远程检测无人机及其他飞行物体以有效跟踪运动并确保共享空域安全运行变得至关重要。尽管已有多个用于无人机检测的数据集，但对高质量数据的需求仍然存在，特别是在高分辨率远程无人机数据领域。为解决这一问题，我们引入了一个高分辨率数据集，包含102,532张远程无人机RGB图像，这些图像从128个不同的视频片段中以5 FPS采样，这些片段在17个不同的数据采集日（跨越8个月）的飞行中拍摄，以确保光照场景、飞行位置和背景元素的多样性。该数据集拥有全面的无人机距离信息，以及29,630张IR图像，所有这些图像都与基础数据集中的RGB图像配对。作为首批利用4K图像分辨率和配对640x512 IR图像的无人机检测数据集之一，我们的工作代表了在远程检测无人机方面的重要进展。如需获取完整数据集，请访问https://research.coe.drexel.edu/ece/imaple/lrddv3/

英文摘要

Unmanned Aerial Vehicles (UAVs) have quickly become common in various airspaces, representing a wide range of applications from recreation flying to commercial photography and package delivery. With the increasing prevalence of UAVs, it becomes critical that both manned and unmanned aircraft can detect UAVs and other flying objects from long range to effectively track movement and ensure safe operation in shared spaces. While several datasets have been introduced for drone detection, the need for expanded high-quality data persists, especially in the area of high-resolution long-range drone data. To address this, we introduce a high-resolution dataset of 102,532 long-range RGB images of drones, sampled at 5 FPS from 128 distinct video clips taken mid flight during 17 different data collection days spread over 8 months to ensure a wide variety of lighting scenarios, flight locations, and background elements. The dataset boasts comprehensive drone range information across the dataset, as well as 29,630 IR images, all paired with RGB counterparts from the base dataset. As one of the first drone detection datasets to leverage 4K image resolution and paired 640x512 IR images, our work represents a significant advancement to enable the detection of drones at long range. For access to the complete dataset, please visit https://research.coe.drexel.edu/ece/imaple/lrddv3/

URL PDF HTML ☆

赞 0 踩 0

2605.25941 2026-05-26 cs.CV

Where Concept Erasure Should Occur: Concept-Layer Alignment in Text-to-Video Diffusion Models

概念擦除应发生在何处：文本到视频扩散模型中的概念-层对齐

Yiwei Xie, Ping Liu, Zheng Zhang

AI总结本文通过识别概念-层拓扑对齐瓶颈，提出基于可分离性优化的CLEAR框架，在文本到视频扩散模型中实现精确的概念擦除并保持生成质量。

Comments Accepted by ICML 2026

详情

AI中文摘要

文本到视频扩散变换器在模型深度上不均匀地编码语义信息，这限制了有效概念擦除。我们识别出一个表示瓶颈，称为概念-层拓扑对齐，在该对齐下目标概念在特定表示深度表现出更高的可分离性。在这些深度之外，概念和非目标信号仍然强烈纠缠，限制了深度特定擦除的有效性。这一观察将概念擦除重新定义为识别概念-非目标分离自然出现的表示深度的问题。受此结构约束的启发，我们引入了CLEAR，一个用于概念擦除的可分离性驱动优化框架，明确强制概念-层对齐。CLEAR通过将层选择公式化为概念-非目标可分离性的优化问题（而非依赖层无关或启发式选择）来实现这一原则。为此，我们引入了一个可分离性感知目标，偏好表现出更强概念-非目标分离的层。在大规模文本到视频扩散模型上的实验表明，强制概念-层对齐导致更精确的概念抑制，同时保持整体生成质量。

英文摘要

Text-to-video diffusion transformers encode semantic information unevenly across model depth, which constrains effective concept erasure. We identify a representational bottleneck, termed concept-layer topological alignment, under which target concepts exhibit higher separability at certain representational depths. Outside these depths, concept and non-target signals remain strongly entangled, limiting the effectiveness of depth-specific erasure. This observation reframes concept erasure as the problem of identifying representational depths where concept-non-target separation naturally emerges. Motivated by this structural constraint, we introduce CLEAR, a separability-driven optimization framework for concept erasure that explicitly enforces concept-layer alignment. CLEAR operationalizes this principle by formulating layer selection as an optimization problem over concept-non-target separability, rather than relying on layer-agnostic or heuristic choices. To enable this, we introduce a separability-aware objective that favors layers exhibiting stronger concept-non-target separation. Experiments on large-scale text-to-video diffusion models demonstrate that enforcing concept--layer alignment leads to more precise concept suppression while preserving overall generative quality.

URL PDF HTML ☆

赞 0 踩 0

2605.25939 2026-05-26 cs.LG cs.AI

From Latent Space to Training Data: Explainable Specialization in Minimal MLPs

从潜在空间到训练数据：最小MLP中的可解释特化

Enrique Alba, Ezequiel Lopez-Rubio

AI总结研究最小单隐藏层MLP中隐藏神经元是否因训练偏差而特化，以及这种特化是否改善基于原型的训练数据重构，发现覆盖正则化能提高特化比并降低重构误差，而重叠惩罚会导致原型中心被推出凸包。

详情

AI中文摘要

我们在此研究训练偏差是否能使隐藏神经元在最小单隐藏层MLP中特化，以及这种特化是否改善从学习权重对训练数据集进行基于原型的重构。我们考虑宽度等于数据集大小的高斯激活MLP，并比较三种结构损失（分别鼓励训练样本覆盖、神经元诱导原型之间的分离以及隐藏响应的低重叠）与标准拟合基线。在均匀采样的一维数据集上的实验显示，从N=3到N=100的480次受控运行中呈现稳定模式。覆盖正则化在每个测试大小下给出最低的平均重构误差，并相对于标准基线提高了原型使用特化比，而分离效果参差不齐，重叠惩罚则系统性有害。我们表明这种损害并非优化失败：重叠激活的方法与无重叠方法一样拟合数据，但将优化器引导至退化均衡，其中原型中心被推出训练输入的凸包。覆盖无法奖励这种驱逐，并充当吸引子：分离仅在高温下允许它，而重叠在名义超参数选择下允许它。在分离掩码上的直接τ扫描和N=100时的原型位置可视化确认了这一机制。这些发现为原型可恢复性感知训练提供了一个简单的设计原则：每个排斥性结构损失必须由一个兼容的吸引子补偿，否则它将破坏本应精炼的潜在几何结构。

英文摘要

We here study whether training biases can make hidden neurons specialize in minimal one-hidden-layer MLPs, and whether such specialization improves prototype-based reconstruction of the training dataset from the learned weights. We consider Gaussianactivation MLPs of width equal to dataset size and compare three structural losses that respectively encourage coverage of the training samples, separation between neuron-induced prototypes, and low overlap of hidden responses, against the standard fitting baseline. Experiments on uniformly sampled one-dimensional datasets show a stable pattern from N = 3 to N = 100 across 480 controlled runs. Coverage regularization gives the lowest mean reconstruction error at every tested size and raises the prototype-usage specialization ratio relative to the standard baseline, while separation has mixed effects and overlap penalties are systematically harmful. We show that the harm is not an optimization failure: overlap-active approaches fit the data as well as overlap-free ones but route the optimizer to a degenerate equilibrium in which prototype centers are pushed outside the convex hull of the training inputs. Coverage cannot reward this expulsion and acts as an attractor: separation admits it only at large temperature and overlap admits it at the nominal hyperparameter choice. A direct τ-sweep on the separation-only mask and a prototype-position visualization at N = 100 confirm the mechanism. The findings yield a simple design principle for prototype-recoverability-aware training: every repulsive structural loss must be compensated by a compatible attractor, or it will collapse the latent geometry it was meant to refine.

URL PDF HTML ☆

赞 0 踩 0

2605.25933 2026-05-26 cs.LG cs.AI

Quantitative Evaluation of the Severity of Posttraumatic Stress Disorder through Transfer Learning from Specific Phobia Data

通过特定恐惧症数据迁移学习定量评估创伤后应激障碍的严重程度

Nicolas Ricka, Gauthier Pellegrin, Denis A. Fompeyrine, Thomas Rohaly, Leah Enders, Heather Roy

AI总结提出基于多元核密度估计的机器学习方法，利用心率与皮肤电导信号从特定恐惧症数据迁移学习，客观评估PTSD严重程度，分类准确率86%，平均绝对误差5.6。

Comments Submitted to a peer-reviewed journal, comments welcome

详情

AI中文摘要

创伤后应激障碍（PTSD）是一种普遍且使人衰弱的心理健康状况，对个人和社会产生重大影响。目前PTSD的临床评估通常依赖主观评价，耗时、昂贵且易受人为偏见影响。本研究提出一种基于多元核密度估计（MKDE）技术的机器学习方法，用于客观评估PTSD严重程度。我们收集了21名参与者在沉浸式模拟期间的心率（HR）和皮肤电导反应（GSR）信号以及PTSD检查表-军事版（PCL-M）标签。在公开的蜘蛛恐惧症数据集上训练恐惧反应模型，并从军事数据集估计的恐惧反应曲线中提取PTSD预测特征。该模型在分类PTSD状态时达到86%的准确率，有效区分有和无PTSD的参与者（PCL-M阈值为36）。模型的平均绝对误差（MAE）为5.6，并以17%的平均绝对百分比误差估计临床PTSD严重程度量表。我们的算法通过提供一种客观且低努力的生理评估方法，显示出增强PTSD严重程度估计和随访的潜力。这些发现表明在筛查和随访环境中具有临床实用性。

英文摘要

Posttraumatic stress disorder (PTSD) is a prevalent and debilitating mental health condition with significant personal and societal impacts. Current clinical assessments of PTSD often rely on subjective evaluations, which can be time-consuming, costly, and prone to human bias. This study proposes a machine learning (ML) approach based on multivariate kernel density estimation (MKDE) technique for the objective evaluation of PTSD severity. We collected heart rate (HR) and galvanic skin response (GSR) signals as well as PTSD Checklist - Military Version (PCL-M) labels from 21 participants during an immersive simulation. A fear-response model was trained on a public arachnophobia dataset, and predictive features of PTSD were extracted from the fear-response curves estimated on the military dataset. The model achieved an accuracy of 86\% in classifying PTSD status, effectively distinguishing participants with and without PTSD (PCL-M threshold of 36). The average mean absolute error (MAE) of the models is 5.6, and it estimated a clinical PTSD severity scale with a mean absolute percentage error of 17\%. Our algorithm demonstrates promising potential for enhancing estimation of PTSD severity and followup by offering an objective and low-effort evaluation approach using physiology. These findings suggest clinical utility in both screening and follow-up settings.

URL PDF HTML ☆

赞 0 踩 0

2605.25931 2026-05-26 cs.AI

Explore Before You Solve: The Speed--Depth Trade-off in Epistemic Agents for ARC-AGI-3

先探索再求解：面向ARC-AGI-3的认知主体中的速度-深度权衡

Liew Keong Han

AI总结通过系统分析所有25个公开ARC-AGI-3游戏，发现它们均可通过非智能策略达到，并提出了一个三阶段认知主体AERA，在速度-深度权衡框架下形式化其性能。

Comments 22 pages, 3 figures. Code: https://github.com/farmountain/aera-arc3-paper (CC0)

详情

AI中文摘要

我们系统研究了所有25个公开ARC-AGI-3游戏，发现每个游戏都可以通过非智能策略达到：10个通过单次盲步，5个通过一次探测动作，1个通过重复按ACTION1键，1个通过多样化探索，8个通过具有足够预算（50-200步）的单一重复动作。此外，一个库级别的空坐标漏洞使得18个游戏可以在1步内绕过。这一基准批评意味着公开评估集无法区分智能探索与琐碎启发式——私有的55游戏评估才是真正的智能测试。在此背景下，我们提出了AERA（自适应认知推理主体），一个三阶段（探索/验证/规划）主体，在Qwen2.5-0.5B上对这些25个游戏实现了RHAE=0.2116（4/25解决），而随机和无探索基线得分为0.0000。我们通过速度-深度权衡框架形式化AERA：在凸性假设下（附录中对一类环境证明），RHAE的二次形式作为对偏离动作效率与信息增益之间帕累托前沿的二阶惩罚。贡献：（i）基准有效性分析表明，当前交互式推理基准未能衡量它们声称所需的探索，以及（ii）探索前规划框架和模型能力×探索交互。链接的代码条目在完整的55游戏私有评估中实现了RHAE=0.30。代码：CC0。

英文摘要

We systematically investigate all 25 public ARC-AGI-3 games and find that every one is reachable through non-intelligent strategies: 10 in a single blind step, 5 after one probing action, 1 via repeated ACTION1 presses, 1 via diverse exploration, and 8 via single repeated actions with sufficient budget (50-200 steps). A library-level null-coordinate vulnerability additionally bypasses 18 games in 1 step. This benchmark critique implies the public evaluation set cannot discriminate intelligent exploration from trivial heuristics - the private 55-game evaluation is the only genuine intelligence test. Against this backdrop, we present AERA (Adaptive Epistemic Reasoning Agent), a three-phase (EXPLORE / VERIFY / PLAN) agent achieving RHAE=0.2116 (4/25 solved) on these 25 games with Qwen2.5-0.5B, while random and no-explore baselines score 0.0000. We formalise AERA through a Speed--Depth trade-off framework: under a convexity assumption (proved for a class of environments in the Appendix), RHAE's quadratic form emerges as a second-order penalty for deviating from the Pareto frontier between action efficiency and information gain. Contributions: (i) a benchmark validity analysis showing that current interactive reasoning benchmarks fail to measure the exploration they claim to require, and (ii) the EXPLORE-before-PLAN framework and model-capability x exploration interaction. The linked code track entry achieves RHAE=0.30 on the full 55-game private evaluation. Code: CC0.

URL PDF HTML ☆

赞 0 踩 0

2605.25928 2026-05-26 cs.CL cs.SD eess.AS

Thaka at KSAA-2026 Task 2: Regularized Fine-Tuning for Arabic Speech Diacritization

Thaka at KSAA-2026 Task 2: 用于阿拉伯语音节符号化的正则化微调

Meshal Alamr, Hassan Alqaeri, Abdullah Aldahlawi

AI总结针对低资源阿拉伯语音节符号化任务，通过正则化微调CATT-Whisper多模态模型，结合R-Drop一致性正则化、Optuna优化超参数和Focal Loss，在KSAA-2026共享任务中取得第一名。

Comments 4 pages, 1 figure. Published in Proceedings of OSACT7 (LREC 2026). Winning system for KSAA-2026 Task 2 on Arabic Speech Diacritization

详情

AI中文摘要

我们描述了KSAA-2026阿拉伯语音听写自动音节符号化共享任务Task 2的获胜系统。该任务要求从语音音频和无音节符号的转录文本中生成完全带音节符号的阿拉伯语文本，仅提供2,327个训练样本且不允许使用外部数据。我们的系统微调了CATT-Whisper，这是一个字符级多模态模型，结合了预训练的CATT文本编码器和冻结的Whisper语音编码器。我们方法的关键是训练正则化：R-Drop一致性正则化、使用高权重衰减的Optuna优化超参数以及Focal Loss。在推理时，我们在四个模型检查点上使用蒙特卡洛Dropout在softmax概率级别平均200次随机前向传播。该系统在主要排行榜指标（包括词尾变化，含无音节符号位置）上实现了23.26%的词错误率，在所有参与者中排名第一。

英文摘要

We describe the winning system for Task 2 of the KSAA-2026 Shared Task on Arabic Speech Dictation with Automatic Diacritization. The task requires producing fully diacritized Arabic text from speech audio and undiacritized transcripts, with only 2,327 training samples available and no external data permitted. Our system fine-tunes CATT-Whisper, a character-level multimodal model combining a pretrained CATT text encoder with a frozen Whisper speech encoder. The key to our approach is training regularization: R-Drop consistency regularization, Optuna-optimized hyperparameters with high weight decay, and Focal Loss. At inference, we average 200 stochastic forward passes across four model checkpoints using Monte Carlo Dropout at the softmax probability level. The system achieves 23.26% WER on the primary leaderboard metric (with case endings, including no-diacritic positions), placing 1st among all participants.

URL PDF HTML ☆

赞 0 踩 0

2605.25924 2026-05-26 cs.CL cs.LG

Does Continued Pretraining on a Learner Corpus Improve Automated Essay Scoring on English Proficiency Tests? Evidence from EFCAMDAT

在学习者语料库上继续预训练是否能提高英语水平测试的自动作文评分？来自EFCAMDAT的证据

Duy Anh Nguyen

AI总结研究通过在EFCAMDAT学习者语料库上进行领域自适应继续预训练（DAPT），探究其对基于Transformer的自动作文评分（AES）在英语水平测试中的影响，发现全语料库DAPT效果不一，而基于CEFR分级的针对性DAPT能更可靠地提升领域内评分性能。

Comments 16 pages, 3 figures, 10 tables, including references and appendices

详情

AI中文摘要

最近的自动作文评分（AES）研究越来越多地使用预训练的Transformer模型，但这些模型通常是在通用领域英语上预训练的，可能无法充分代表第二语言学习者的写作。本研究调查了在EFCAMDAT学习者语料库上进行领域自适应继续预训练（DAPT）是否能提高基于Transformer的AES在英语水平测试中的表现。我们对三个Transformer编码器应用DAPT，并在FCE和IELTS上评估了领域内评分和少样本跨数据集迁移。全语料库DAPT在模型、数据集和指标上产生了混合结果。进一步分析表明，这些混合效应部分由EFCAMDAT与下游数据集在熟练度、体裁和交际目的上的不匹配解释。基于熟练度的消融实验显示，使用CEFR对齐子集进行针对性DAPT比全语料库DAPT更可靠地提高了下游评分，尤其是对于使用B1-B2数据的FCE。然而，这些增益并未一致地改善跨数据集迁移。总体而言，研究结果表明，当预训练数据与下游评估设置充分对齐时，在学习者写作语料库上继续预训练可以有益于英语评估的领域内AES，但它不会自动提高跨不同英语水平测试数据集的迁移性。

英文摘要

Recent automated essay scoring (AES) studies increasingly use pretrained transformer models, but these models are usually pretrained on general-domain English and may under-represent second-language learner writing. This study investigates whether domain-adaptive continued pretraining (DAPT) on the EFCAMDAT learner corpus improves transformer-based AES for English proficiency tests. We apply DAPT to three transformer encoders and evaluate them on FCE and IELTS in both in-domain scoring and few-shot cross-dataset transfer. Full-corpus DAPT produces mixed results across models, datasets, and metrics. Further analyses suggest that these mixed effects are partly explained by mismatches in proficiency, genre, and communicative purpose between EFCAMDAT and the downstream datasets. A proficiency-based ablation shows that targeted DAPT using CEFR-aligned subsets improves downstream scoring more reliably than full-corpus DAPT, especially for FCE with B1--B2 data. However, these gains do not consistently improve cross-dataset transfer. Overall, the findings suggest that continued pretraining on a learner-writing corpus can benefit in-domain AES for English assessment when the pretraining data is sufficiently aligned with the downstream assessment settings. However, it does not automatically improve transferability across different English proficiency test datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.25922 2026-05-26 cs.CV

Closed-Loop Bidirectional Prompting for Adversarial Robustness of Vision Language Models

闭环双向提示用于视觉语言模型的对抗鲁棒性

Xiao Liu, Jiaxiang Liu, Boci Peng, Boren Hu, Yusong Wang, Xiwen Chen, Prayag Tiwari, Liming Zhang, Mingkun Xu

AI总结针对视觉语言模型在对抗扰动下跨模态语义对齐脆弱的问题，提出闭环双向提示方法，通过动态反馈循环恢复跨模态一致性，并引入语义锚点约束循环更新，实现实例自适应保护，在11个数据集上达到最先进的鲁棒性和泛化性能。

Comments 24 pages, 8 figures

详情

AI中文摘要

视觉语言模型能很好地适应下游任务，但对破坏跨模态语义对齐的对抗扰动高度脆弱。现有的防御方法大多是单向或结构性的，未能利用双向跨模态互补性和实例自适应的保护。为了克服对抗设置中单向和静态防御的局限性，我们提出了闭环双向提示，通过冻结编码器上的动态反馈循环将鲁棒适应视为跨模态一致性恢复。引入语义锚点作为稳定先验以约束循环更新并减轻扰动引起的特征损坏。通过基于锚点的自举，文本语义去噪视觉表示，而精炼的视觉使实例自适应提示更新成为可能，从而产生修正且鲁棒的共识。在11个数据集上的广泛评估验证了最先进的鲁棒性和强的基础到新类泛化能力，同时在计算成本和准确性之间保持了良好的平衡。

英文摘要

Vision Language Models adapt well to downstream tasks but are highly vulnerable to adversarial perturbations that disrupt cross-modal semantic alignment. Existing defenses are largely unidirectional or structural, failing to exploit bidirectional cross-modal complementarity and instance-wise adaptive protection. To overcome the limitations of unidirectional and static defenses in adversarial settings, we propose Closed-Loop Bidirectional Prompting, casting robust adaptation as cross-modal agreement recovery via a dynamic feedback loop on frozen encoders. A Semantic Anchor is introduced as a stable prior to constrain cyclic updates and mitigate perturbation-induced feature corruption. Through anchor-based bootstrapping, textual semantics denoise visual representations, while the refined visuals enable instance-adaptive prompt updating, yielding a rectified and robust consensus. Extensive evaluations across 11 datasets validate state-of-the-art robustness and strong base-to-new generalization, while maintaining a favorable trade-off between computational cost and accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.25920 2026-05-26 cs.CL cs.AI

Can LLMs Time Travel? Enhancing Temporal Consistency in Legal Agentic Search through Reinforcement Learning

LLM 能时间旅行吗？通过强化学习增强法律智能搜索中的时间一致性

Wei Fan, Yining Zhou, Mufan Zhang, Yanbing Weng, Yiran HU, Tianshi Zheng, Baixuan Xu, Chunyang Li, Jianhui Yang, Haoran Li, Yangqiu Song

AI总结提出 LegalSearch-R1 框架，结合本地 statute RAG 和在线搜索，通过强化学习在跨修订期数据上训练，以解决法律 LLM 的时间偏差和搜索代理缺乏时间约束的问题，在13项法律任务上超越现有方法。

Comments Under Review

详情

AI中文摘要

虽然增强智能搜索能力的大型语言模型在法律推理方面显示出前景，但它们忽略了一个基本约束：适用法律必须与每个案件的时间背景相匹配，因为法条的事后追溯适用违反了核心法律原则并导致错误结论。我们的观察表明，当前的法律 LLM 存在锚定于其训练截止日期的时间偏差，而搜索代理很少将时间约束纳入查询，并且仅靠网络搜索无法提供法律推理所需的精确法条和先例引用。为应对这些挑战，我们提出 LegalSearch-R1，一个端到端的强化学习框架，它将本地 statute RAG 用于精确条文匹配，与在线网络搜索用于更广泛的法律知识相结合，并在涵盖多个修订期的按时间索引的数据上训练以强制执行时间一致性。在我们涵盖13项法律任务的基准上的大量实验表明，我们的7B参数代理在时间一致性上以12.9%至29.8%的优势超越最先进的深度研究框架和专门的法律 LLM，以57.7%至80.3%的优势超越基线，并展现出强大的域外泛化能力。代码和数据可在 https://github.com/AlexFanw/LegalSearch-R1 获取。

英文摘要

While large language models (LLMs) augmented with agentic search capabilities show promise for legal reasoning, they overlook a fundamental constraint that applicable law must match the temporal context of each case, as retroactive application of statutes violates core legal principles and leads to erroneous conclusions. Our observations reveal that current legal LLMs suffer from temporal bias anchored to their training cutoff, while search agents rarely incorporate temporal constraints into queries, and that web search alone cannot provide the precise statute and precedent citations that legal reasoning demands. To address these challenges, we propose LegalSearch-R1, an end-to-end reinforcement learning framework that pairs local statute RAG for precise article matching with online web search for broader legal knowledge, trained on temporally-indexed data spanning multiple amendment periods to enforce temporal consistency. Extensive experiments on our benchmark covering 13 legal tasks demonstrate that our 7B-parameter agent outperforms state-of-the-art deep research frameworks and specialized legal LLMs by 12.9% to 29.8%, surpasses baselines by 57.7% to 80.3% on temporal consistency, and exhibits robust out-of-domain generalization. The code and data are available at https://github.com/AlexFanw/LegalSearch-R1.

URL PDF HTML ☆

赞 0 踩 0

2605.25916 2026-05-26 cs.LG cs.DC cs.NI

Joint Optimization of Training and Inference in Federated Edge Learning via Constrained Multi-Objective Deep Reinforcement Learning

通过约束多目标深度强化学习联合优化联邦边缘学习中的训练与推理

Zhen Li, Jun Cai, Chao Yang, Haoran Gao

AI总结提出一种在线优化框架，通过约束多目标深度强化学习算法C-MOPPO联合管理资源受限边缘设备上的联邦训练和推理，以在最小化延迟和能耗的同时最大化推理精度。

详情

AI中文摘要

联邦边缘学习（FEEL）最近成为一种有前景的范式，通过支持跨边缘设备的协作模型训练同时保护数据隐私来实现边缘智能（EI）。在本文中，我们提出了一种在线优化框架，用于联合管理资源受限边缘设备上的联邦训练和推理。我们引入了一种基于串联队列的转换机制，将推理请求与训练数据桥接起来，并进一步将数据和模型的新鲜度纳入准确性公式中，以捕捉真实环境中的时间动态。为了在最小化延迟和能耗的同时最大化推理精度，边缘设备的模式选择、通信和计算资源分配被联合优化。我们将此优化表述为一个多目标优化问题，该问题是NP难的，并且由于在线设置而进一步复杂化。为了应对这些挑战，我们将问题转化为多目标马尔可夫决策过程（MOMDP），并开发了一种约束多目标近端策略优化（C-MOPPO）算法。具体来说，C-MOPPO首先学习一组具有不同目标偏好策略，然后利用约束策略优化来丰富帕累托前沿并获得高质量、密集的解。大量实验表明，C-MOPPO在目标之间实现了良好的平衡权衡，并在各种系统配置下显著优于基线。

英文摘要

Federated edge learning (FEEL) has recently emerged as a promising paradigm for achieving edge intelligence (EI) via enabling collaborative model training across edge devices while protecting data privacy. In this paper, we put forth an online optimization framework that jointly manages federated training and inference on resource-constrained edge devices. We introduce a tandem-queue-inspired conversion mechanism that bridges inference requests and training data, and further incorporate both data and model freshness into the accuracy formulation to capture temporal dynamics in real-world environments. To maximize inference accuracy while minimizing latency and energy consumption, the mode selections, communication, and computation resource allocations of edge devices are jointly optimized. We formulate this optimization as a multi-objective optimization problem, which is NP-hard and further complicated by the online setting. To address these challenges, we transform the problem into a multi-objective Markov decision process (MOMDP) and develop a \underline{c}onstrained \underline{m}ulti-\underline{o}bjective \underline{p}roximal \underline{p}olicy \underline{o}ptimization (C-MOPPO) algorithm. Specifically, C-MOPPO first learns a set of policies with different preferences across three objectives, then leverages constrained policy optimization to enrich the Pareto front and obtain high-quality, dense solutions. Extensive experiments demonstrate that C-MOPPO achieves well-balanced trade-offs among objectives and significantly outperforms baselines under various system configurations.

URL PDF HTML ☆

赞 0 踩 0

2605.25909 2026-05-26 cs.CV

R5DGS: Semantic-Aware 4D Gaussian Splatting with Rigid Body Constraints for Efficient Dynamic Scene Reconstruction

R5DGS：基于刚体约束的语义感知4D高斯泼溅用于高效动态场景重建

Denis Gridusov, Maxim Popov, Sergey Kolyubin

AI总结提出R5DGS框架，通过紧凑身份编码和CLIP对象查找表实现语义感知的4D高斯表示，并利用刚体推理约束仅预测对象质心动力学，从而在保持轨迹合理性的同时实现11 FPS的加速。

Comments Code: https://github.com/be2rlab/r5dgs

详情

AI中文摘要

从多视角视频中重建和预测动态3D场景是机器人、AR/VR和数字孪生的基础任务。最近基于物理信息的高斯泼溅方法在未来的帧外推上取得了令人印象深刻的结果，但缺乏语义感知且计算开销大。我们引入了$ extbf{R5DGS}$，一个通过紧凑的身份编码向量增强物理驱动的4D高斯表示的框架，实现了精确的高斯到对象关联。通过构建离线的基于CLIP的对象查找表，我们支持开放词汇的文本提示，以检索和渲染任意时间戳和视角下的特定对象高斯。此外，我们提出了一个刚体推理约束，仅对对象质心预测和集成物理动力学，通过相对变换将运动传播到关联的高斯。这一优化在外推过程中实现了11 FPS的加速，而不损害轨迹的合理性。

英文摘要

Reconstructing and predicting dynamic 3D scenes from multi-view videos is a foundational task for robotics, AR/VR, and digital twins. Recent physics-informed Gaussian Splatting methods achieve impressive future frame extrapolation but lack semantic awareness and suffer from large computational overhead. We introduce $\textbf{R5DGS}$, a framework that augments a physics-driven 4D Gaussian representation with compact Identity Encoding vectors, enabling precise Gaussian-to-object association. By constructing an offline CLIP-based object lookup table, we support open-vocabulary text prompting to retrieve and render object-specific Gaussians across arbitrary timestamps and viewpoints. Furthermore, we propose a rigid-body inference constraint that predicts and integrates physical dynamics exclusively for object centroids, propagating motion to associated Gaussians via relative transformations. This optimization yields a 11 FPS speedup during extrapolation without compromising trajectories plausibility.

URL PDF HTML ☆

赞 0 踩 0

2605.25903 2026-05-26 cs.CL cs.LG

Universal Activation Verbalizer: A Unified Framework for Cross-Model Activation Explanation

通用激活词化器：跨模型激活解释的统一框架

Haiyan Zhao, Zirui He, Guanchu Wang, Ali Payani, Yingcong Li, Mengnan Du

AI总结提出通用激活词化器（UAV）框架，通过共享解码器和轻量适配器将异构模型的隐藏表示转化为自然语言解释，支持跨模型家族和规模的激活词化，在分类、事实检索和要点总结任务中与强基线竞争。

Comments 23 pages, 11 figures, 11 tables

详情

AI中文摘要

激活词化以自然语言解释隐藏表示，但现有方法大多局限于自解释，即每个模型仅解释自身的激活。我们引入通用激活词化器（UAV），一个使用共享解码器解释来自异构捐赠模型激活的框架。UAV学习一个轻量适配器，将捐赠激活转化为解码器嵌入空间中的软标记，并通过重用冻结的解码器侧LoRA同时为另一个捐赠者训练新适配器，进一步支持仅适配器迁移。在分类、事实检索和要点总结任务中，UAV在实现跨模型家族和规模的跨模型词化时，与强自解释基线保持竞争力。消融实验表明，解码器侧调优主要改善任务行为，而适配器提供激活基于的事实和语义信息，用于忠实解释。

英文摘要

Activation verbalization explains hidden representations in natural language, but existing methods are mostly limited to self-explanation, where each model explains only its own activations. We introduce Universal Activation Verbalizer (UAV), a framework that uses a shared decoder to explain activations from heterogeneous donor models. UAV learns a lightweight adapter that converts donor activations into soft tokens in decoder's embedding space, and further supports adapter-only transfer by reusing a frozen decoder-side LoRA while training only a new adapter for another donor. Across classification, fact retrieval, and gist summarization, UAV remains competitive with strong self-explanation baselines while enabling cross-model verbalization across model families and scales. Ablations show that decoder-side tuning mainly improves task behavior, whereas the adapter provides the activation-grounded factual and semantic information needed for faithful explanations.

URL PDF HTML ☆

赞 0 踩 0

2605.25901 2026-05-26 cs.CV cs.RO

AgentGrounder: Zero-Shot 3D Visual Pointcloud Grounding using Multimodal Language Models

AgentGrounder：使用多模态语言模型的零样本3D视觉点云定位

Cuong Huynh, Maxim Popov, Denis Gridusov, Sergey Kolyubin

AI总结提出AgentGrounder框架，通过两阶段设计（离线构建对象查找表和在线工具驱动代理）实现零样本3D视觉定位，在ScanRefer和Nr3D上分别提升2.5%和6.3%的准确率。

Comments Code: https://github.com/be2rlab/AgentGrounder

详情

AI中文摘要

3D视觉定位（3DVG）是具身AI的基本能力，要求智能体根据自然语言描述在3D场景中定位物体。最近的零样本方法利用2D视觉语言模型（LVLMs），但它们通常依赖于现有的多视图图像集，并且难以处理标准3D分割工具提供的有限语义和空间细节。我们提出了$ extbf{AgentGrounder}$，一个零样本3D视觉定位框架，直接对彩色点云进行操作，无需特定任务的3D训练。我们的方法采用两阶段设计：（1）离线阶段，应用3D模型构建对象查找表（OLT），包含实例ID、语义标签、3D边界框；（2）在线工具驱动代理，分解每个查询，仅从OLT中检索相关候选对象，进行几何评分，并在需要额外视觉证据（如颜色、材质或视角敏感线索）时按需触发图像渲染。与固定的锚点-目标匹配流水线相比，这种设计减少了级联匹配错误，并通过避免提示过载无关对象来提高上下文窗口效率。我们在零样本设置下对ScanRefer和Nr3D进行了评估，观察到在我们的设置中比SeeGround有持续改进，包括ScanRefer上+2.5%的Acc@0.5和Nr3D上+6.3%，在Nr3D视图无关查询上显著提升+6.3%。这些结果表明，结合选择性检索、几何推理和自适应视觉检查为开放词汇3D定位提供了实用且稳健的基础。我们的代码可在https://github.com/be2rlab/AgentGrounder获取。

英文摘要

3D Visual Grounding (3DVG) is an essential capability for embodied AI, requiring agents to localize objects in 3D scenes based on natural language descriptions. Recent zero-shot methods leverage 2D vision-language models (LVLMs). However, they often rely on existing sets of multi-view images and struggle with the limited semantic and spatial details provided by standard 3D segmentation tools. We present $\textbf{AgentGrounder}$, a zero-shot 3D visual grounding framework that operates directly on colored point clouds without task-specific 3D training. Our approach follows a two-stage design: (1) an offline stage that applies 3D model to build an Object Lookup Table (OLT) with instance IDs, semantic labels, 3D bounding boxes; and (2) an online tool-driven agent that decomposes each query, retrieves only relevant candidates from the OLT, performs geometric scoring, and triggers image rendering on demand when additional visual evidence (e.g., color, material, or viewpoint-sensitive cues) is required. Compared with fixed anchor-target matching pipelines, this design reduces cascading matching errors and improves context-window efficiency by avoiding prompts overloaded with irrelevant objects. We evaluate on ScanRefer and Nr3D under a zero-shot setting and observe consistent improvements over SeeGround in our setup, including +2.5% Acc@0.5 on ScanRefer and +6.3% on Nr3D, with a notable +6.3% gain on Nr3D view-independent queries. These results show that combining selective retrieval, geometric reasoning, and adaptive visual inspection yields a practical and robust foundation for open-vocabulary 3D grounding. Our code is available at https://github.com/be2rlab/AgentGrounder.

URL PDF HTML ☆

赞 0 踩 0

2605.25894 2026-05-26 cs.LG q-fin.ST

Predicting Stock Price Direction on Earnings Announcement Days using Multi-modal Deep Learning

使用多模态深度学习预测盈利公告日的股价方向

Manuel Noseda, Nathan Soldati, Marco Paina

AI总结本研究结合基本面指标、技术指标和新闻情感，利用LSTM和Transformer模型预测盈利公告日的股价方向，发现Transformer在识别波动方面更敏感，且新闻情感有助于提升性能。

详情

AI中文摘要

预测盈利公告（EAs）期间的股价走势是一个重大挑战，因为市场噪音和高冲击价格不连续性。在本研究中，我们评估了公告前的新闻情感、公司基本面和近期市场动态是否共同预测了EA日股票的价格方向运动。我们构建了一个多模态特征空间，结合了15个基本面指标、3个基于价格的技术指标以及使用FinBERT处理的金融新闻文章的情感分数。我们将长短期记忆（LSTM）网络和基于Transformer的架构与逻辑回归基线进行比较，并进一步评估所有模型在有和没有情感特征的情况下的增量价值。我们的结果表明，虽然LSTM通过保守的安全策略显示出更高的精确度，但Transformer模型在识别波动性运动方面表现出更高的敏感性，获得了更高的宏观F1分数，消融实验显示加入新闻情感有一致的益处。

英文摘要

Predicting stock price movements during Earnings Announcements (EAs) is a significant challenge due to market noise and high-impact price discontinuities. In this study, we evaluate whether pre-announcement news sentiment, firm fundamentals, and recent market dynamics jointly predict the directional price movement of equities on EA days. We construct a multi-modal feature space combining 15 fundamental metrics, 3 price-based technical indicators and sentiment scores derived from financial news articles processed using FinBERT. We compare a Long Short-Term Memory (LSTM) network and a Transformer-based architecture against a logistic regression baseline, and further assess all models with and without sentiment features to quantify their incremental value. Our results indicate that while the LSTM demonstrates higher precision through a conservative safe-bet strategy, the Transformer model exhibits superior sensitivity in identifying volatile movements, achieving a higher macro F1-score, with ablation experiments showing a consistent benefit from incorporating news sentiment.

URL PDF HTML ☆

赞 0 踩 0

2605.25893 2026-05-26 cs.AI

$D^2$-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing

$D^2$-Monitor: 通过犹豫感知路由实现扩散LLM的动态安全监控

Aoxi Liu, Yupeng Chen, James Oldfield, Guanzhe Hong, Junchi Yu, Baoyuan Wu, Philip Torr, Adel Bibi

AI总结针对扩散大语言模型的安全监控问题，提出基于犹豫感知路由的双层动态监控框架$D^2$-Monitor，通过轻量级探针实时估计犹豫度并触发高容量探针，在3个数据集上以0.85M参数达到最优性能与效率平衡。

详情

AI中文摘要

尽管扩散大语言模型（D-LLMs）作为自回归大语言模型（AR-LLMs）的替代方案已经出现，但D-LLMs的安全监控在很大程度上仍未得到探索。与AR-LLMs不同，D-LLMs通过多步去噪过程生成文本，暴露了中间隐藏表示，这些表示可能包含标准单步监控设置中无法获得的安全相关信息。受轻量级探针适用于始终在线监控的启发，我们分析了哪些轨迹级信号最能指示此类探针可能遇到困难。我们发现，信息量最大的信号是安全犹豫度：中间隐藏状态反复落在探针决策边界的小范围内。D-LLM轨迹中此类犹豫步的数量能有效预测探针失败，提供了样本难度的代理指标。基于此分析，我们提出了$D^2$-Monitor，一种针对D-LLMs的双层安全监控器。$D^2$-Monitor采用轻量级探针作为始终在线监控器，以联合估计犹豫度并执行基础分类。当犹豫度超过阈值时，激活更具表现力但计算量更大的探针。这种动态路由机制在测试时高效分配监控资源。在4个D-LLM上的3个数据集（WildguardMix、ToxicChat、OpenAI-Moderation）上评估，$D^2$-Monitor以紧凑的参数规模（≤0.85M参数）实现了最先进的性能，并且相对于8个基线方法，在有效性和效率之间取得了最佳权衡。

英文摘要

Despite the emergence of diffusion large language models (D-LLMs) as an alternative to autoregressive large language models (AR-LLMs), safety monitoring for D-LLMs remains largely unexplored. Unlike AR-LLMs, D-LLMs generate text through a multi-step denoising process, exposing intermediate hidden representations that may contain safety-relevant information unavailable in standard single-step monitoring setups. Motivated by the suitability of lightweight probes for always-on monitoring, we analyze which trajectory-level signals best indicate when such probes are likely to struggle. We find that the most informative signal is safety hesitation: intermediate hidden states repeatedly falling within a small margin of the probe's decision boundary. The number of such hesitation steps in D-LLM's trajectory predicts probe failure effectively, providing a proxy of sample difficulty. Building on this analysis, we propose $D^2$-Monitor, a bi-level safety monitor for D-LLMs. $D^2$-Monitor adopts a lightweight probe as an always-on monitor to jointly estimate hesitation and perform base classification. When the hesitation level exceeds a threshold, a more expressive but computationally heavier probe is activated. This dynamic routing mechanism allocates monitoring resources efficiently at test time. Evaluated on 3 datasets (WildguardMix, ToxicChat, OpenAI-Moderation) across 4 D-LLMs, $D^2$-Monitor achieves state-of-the-art performance with a compact parameter footprint ($\leq$ 0.85M parameters), and exhibits the best trade-off between effectiveness and efficiency relative to 8 baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.25892 2026-05-26 cs.CV

SP-MoMamba: Superpixel-driven Mixture of State Space Experts for Efficient Image Super-Resolution

SP-MoMamba：基于超像素驱动的状态空间专家混合模型用于高效图像超分辨率

Wenbin Zou, Yawen Cui, Yi Wang, Lap-Pui Chau, Liang Chen, Jinshan Pan, Huiping Zhuang, Guanbin Li

AI总结提出SP-MoMamba，通过超像素驱动将刚性扫描转化为语义级交互，结合多尺度超像素状态空间专家混合与局部空间调制专家，实现高效且保真的图像超分辨率。

Comments 16 pages, 15 figures

详情

AI中文摘要

状态空间模型（SSM）因其线性复杂度和长程建模能力，已成为高效单图像超分辨率（SR）的强大范式。然而，现有的基于Mamba的方法通常依赖于与数据无关的刚性扫描，将2D图像重塑为固定网格上的1D序列，这不可避免地破坏了空间语义拓扑并引入伪影。受格式塔知觉分组理论的启发，我们提出了SP-MoMamba，一种用于内容感知SR的超像素驱动状态空间专家混合模型。我们的核心思想是通过将超像素视为基本单元，将传统的刚性扫描转化为语义级交互。具体来说，我们引入了超像素驱动状态空间模型（SP-SSM），它将语义同质区域压缩为高阶令牌，以保持全局拓扑一致性。为了解决固定扫描尺度与多样语义粒度之间的冲突，我们开发了多尺度超像素状态空间专家混合（MSS-MoE）。该模块利用动态路由机制自适应地分配尺度特定专家，有效捕捉多尺度纹理，同时减少计算冗余。此外，为了防止全局抽象过程中高频细节的丢失，我们引入了局部空间调制专家（LSME）来补充全局建模，确保锐利边缘和精细结构的精确重建。在标准基准上的大量实验表明，与最先进的高效SR方法相比，SP-MoMamba实现了更优的重建保真度和更有利的效率-性能权衡。

英文摘要

State space models (SSMs) have emerged as a powerful paradigm for efficient single-image super-resolution (SR) due to their linear complexity and long-range modeling capabilities. However, existing Mamba-based methods typically rely on data-agnostic rigid scanning, which reshapes 2D images into 1D sequences over a fixed grid, inevitably disrupting spatial-semantic topology and introducing artifacts. Inspired by the \textbf{Gestalt perceptual grouping theory}, we propose \textbf{SP-MoMamba}, a superpixel-driven mixture of state space experts designed for content-aware SR. Our core idea is to transform the traditional rigid scanning into a \textbf{semantic-level interaction} by treating superpixels as fundamental units. Specifically, we introduce the \textbf{Superpixel-driven State Space Model (SP-SSM)}, which compresses semantically homogeneous regions into high-order tokens to preserve global topological consistency. To address the conflict between fixed scanning scales and diverse semantic granularities, we develop the \textbf{Multi-Scale Superpixel Mixture of State Space Experts (MSS-MoE)}. This module utilizes a dynamic routing mechanism to adaptively assign scale-specific experts, effectively capturing multi-scale textures while reducing computational redundancy. Furthermore, to prevent the loss of high-frequency details during global abstraction, we introduce a \textbf{Local Spatial Modulation Expert (LSME)} to complement the global modeling, ensuring a precise reconstruction of sharp edges and fine structures. Extensive experiments on standard benchmarks demonstrate that SP-MoMamba achieves superior reconstruction fidelity and a more favorable efficiency-performance trade-off compared to state-of-the-art efficient SR methods.

URL PDF HTML ☆

赞 0 踩 0

2605.25891 2026-05-26 cs.CL cs.AI

Causal Tongue-Tie: LLMs Can Encode Causal Direction, But Their Yes/No Outputs Fail to Express

因果舌结：LLMs 能编码因果方向，但其是/否输出无法表达

Ziyi Ding, Xiao-Ping Zhang

AI总结研究发现大语言模型在因果问题上存在内部编码与输出不匹配的现象，通过线性探针可从隐藏状态恢复证据支持的答案（准确率约0.97），但口头是/否回答却退化为常识答案（准确率约0.5），揭示了约+0.5的差距，称为“因果舌结”。

2605.25890 2026-05-26 cs.LG

Merge-Bench: Resolve Merge Conflicts with Large Language Models

Merge-Bench: 使用大型语言模型解决合并冲突

Benedikt Schesch, Michael D. Ernst

AI总结本文构建了包含7938个真实合并冲突的数据集Merge-Bench，并利用组相对策略优化（GRPO）训练LLMergeJ模型，在Java程序上以14B参数超越多数商业LLM，但最佳模型正确解决率仍低于60%。

Comments 14 pages, 7 figures

详情

AI中文摘要

本文应用机器学习处理版本控制合并这一困难且重要的任务。（1）我们构建了一个数据集Merge-Bench，包含来自1439个GitHub仓库的7938个真实合并冲突片段。真实标注是开发者提交到仓库的合并解决方案。我们的数据集构建方法可扩展到任意数据量，因为无需手动标注。（2）我们训练了一个模型LLMergeJ，用于解决Java程序中的合并冲突。我们的方法使用组相对策略优化（GRPO），一种在线强化学习方法，来训练大型语言模型（LLM）。（3）我们对LLM在解决合并冲突上的性能进行了两次评估。在Java程序上，具有14B参数的LLMergeJ优于3个商业LLM，仅次于Gemini 2.5 Pro。在11种编程语言中，商业LLM的性能在不同语言间基本稳定。最佳模型正确解决的合并冲突不到60%。

英文摘要

This paper applies machine learning to the difficult and important task of version control merging. (1) We constructed a dataset, Merge-Bench, of 7938 real-world merge conflict hunks from 1439 GitHub repositories. The ground truth is the merge resolution that developers committed to the repository. Our dataset construction methodology is scalable to arbitrary amounts of data since no manual labeling is required. (2) We trained a model, LLMergeJ, to resolve merge conflicts in Java programs. Our approach uses Group Relative Policy Optimization (GRPO), an online reinforcement learning method, to train a Large Language Model (LLM). (3) We performed two evaluations of the performance of LLMs on resolving merge conflicts. On Java programs, LLMergeJ with 14B parameters outperforms 3 commercial LLMs, trailing only Gemini 2.5 Pro. Across 11 programming languages, commercial LLM performance is largely stable from language to language. The best models correctly resolve less than 60% of merge conflicts.

URL PDF HTML ☆

赞 0 踩 0

2605.25888 2026-05-26 cs.LG math.OC

Optimal and Order-optimal Gated Priority-based Greedy Policies for Two-layer Multi-item Order Fulfillment

两层多物品订单履约的最优和阶最优门控优先级贪婪策略

Xi Chen, Yuze Chen, Ziyi Chen, Yuan Zhou

AI总结针对电商在两层分销网络中实时履约决策问题，提出门控优先级贪婪策略，证明其竞争比最优性，并通过数值实验验证性能。

详情

AI中文摘要

我们研究当多物品客户订单顺序到达且未来需求未知时，电商企业如何在两层分销网络中做出实时履约决策。核心管理矛盾在于：是否使用稀缺的前端配送中心（FDC）库存来节省当前履约成本，还是保留该库存用于未来可能更有价值的本地服务订单。我们构建了一个对抗性在线模型，包含多个FDC、一个区域配送中心（RDC）、多单位多物品订单以及物品特定且时变的可变成本。理论目标是刻画简单、可解释且可实施的履约规则何时能够达到与最优先知规划者几乎相同的性能。我们提出了一类门控优先级贪婪策略，在时变和时不变成本结构下推导了竞争比保证，并为任何在线算法建立了匹配或接近匹配的下界。数值实验表明，所提策略相对于广义短视和基于预测的基准方法表现强劲。分析提供了管理指导：何时应保护本地库存，何时拆分订单值得承担固定成本负担，以及固定成本和可变成本的相对大小如何决定更复杂优化的价值。

英文摘要

We study how an e-commerce firm should make real-time fulfillment decisions in a two-layer distribution network when multi-item customer orders arrive sequentially and future demand is unknown. The central managerial tension is whether to use scarce front distribution center (FDC) inventory to save current fulfillment cost or preserve that inventory for future orders that may be more valuable to serve locally. We formulate an adversarial online model with multiple FDCs, one regional distribution center (RDC), multi-unit multi-item orders, and item-specific and time-varying variable costs. Our theoretical objective is to characterize when simple, interpretable, and implementable fulfillment rules can perform nearly as well as an optimal clairvoyant planner. We develop a family of Gated Priority-based Greedy policies, derive competitive-ratio guarantees under both time-varying and time-invariant cost structures, and establish matching or near-matching lower bounds for any online algorithm. Numerical experiments show that the proposed policies perform strongly relative to generalized myopic and forecast-based benchmarks. The analysis yields managerial guidance on when local inventory should be protected, when splitting orders is worth the fixed-cost burden, and how the relative magnitudes of fixed and variable costs determine the value of more sophisticated optimization.

URL PDF HTML ☆

赞 0 踩 0

2605.25882 2026-05-26 cs.LG

Conformalised imprecise inference for robust extrapolation under limited data

基于共形化的不精确推断在有限数据下的鲁棒外推

Yu Chen, Scott Ferson

AI总结提出一种模型无关的共形化不精确推断框架，通过引入不精确性和距离感知，在分布偏移下保持覆盖并自适应扩展不确定性，实现有限数据下的鲁棒外推。

Comments 10 pages, 5 figures

2605.25880 2026-05-26 cs.LG

The Quantization Benefits of Residual-Free Transformers

无残差Transformer的量化优势

Yiping Ji, Mahalakshmi Sabanayagam, Peyman Moghadam, Hemanth Saratchandran, Simon Lucey

AI总结本文通过对比残差与无残差Transformer，发现残差连接导致激活值非高斯性增强，从而增加量化误差；而无残差Transformer通过正交初始化等技术保持近高斯激活值，显著提升低比特量化鲁棒性，揭示了精度与可压缩性之间的权衡。

Comments Under review

详情

AI中文摘要

大规模Transformer的训练和部署日益受到跨加速器传输激活值、梯度和优化器状态的限制。低比特量化提供了一种自然的补救措施，但Transformer的激活值通常具有重尾和异常值主导的特点，使得简单量化损失严重。我们表明，这种困难不仅是量化器的属性，也是架构的属性。具体来说，残差连接在训练过程中可能使Transformer激活值偏离高斯性。通过残差和无残差Transformer之间的受控比较，我们证明这种效应导致残差模型在低精度下量化误差和精度下降显著更高。我们通过超额峰度分析解释这一现象，表明残差混合可以放大非高斯性，而无残差中的密集混合则压缩非高斯性。然后我们展示，使用正交初始化、谱或二阶优化以及注意力温度的深度感知缩放，可以使无残差Transformer可训练。在语言任务中，虽然全精度性能略有下降，但这些模型保持近高斯激活值，并对低比特量化表现出显著改善的鲁棒性。我们的结果揭示了Transformer设计中的精度-可压缩性权衡，并激发了面向量化的基础模型的架构级方法。

英文摘要

Large-scale transformer training and deployment are increasingly constrained by the transfer of activations, gradients, and optimizer states across accelerators. Low-bit quantization offers a natural remedy, but transformer activations are often heavy-tailed and outlier-dominated, making simple quantization highly lossy. We show that this difficulty is not only a property of the quantizer, but also of the architecture. Specifically, residual connections can drive transformer activations away from Gaussianity during training. Using controlled comparisons between residual and residual-free transformers, we demonstrate that this effect leads to substantially higher quantization error and accuracy degradation at low precision in residual models. We explain the phenomenon through an excess kurtosis analysis, showing that residual mixing can amplify non-Gaussianity, whereas dense mixing in residual-free contracts non-Gaussianity. We then show that residual-free transformers can be made trainable using orthogonal initialization, spectral or second-order optimization, and depth-aware scaling of attention temperature. In language tasks, while there is a small drop in full precision performance, these models retain near-Gaussian activations and exhibit significantly improved robustness to low-bit quantization. Our results identify an accuracy--compressibility trade-off in transformer design and motivate architecture-level approaches to quantization-friendly foundation models.

URL PDF HTML ☆

赞 0 踩 0

2605.25876 2026-05-26 cs.CV

DyCoRM: Dynamic Criterion-Aware Reward Modeling for Text-to-Image Generation

DyCoRM: 面向文本到图像生成的动态准则感知奖励建模

Jiaying Qian, Ziheng Jia, Qian Zhang, Zicheng Zhang, Jiayi Guo, Junqi Zhang, Guangtao Zhai, Xiongkuo Min

AI总结针对用户对文本到图像生成中动态、细粒度评价准则的需求，提出DyCoRM动态准则感知奖励模型，并构建数据集DyCoDataset-20K和基准DyCoBench-1K，通过准则感知偏好比较和DyCoPick选择方法，实现首个动态细粒度奖励建模框架。

详情

AI中文摘要

随着文本到图像（T2I）生成的持续进步，生成高质量图像变得越来越容易；因此，用户需求转向更符合其特定要求的图像。由于奖励模型在评估生成图像是否符合用户偏好方面扮演着越来越重要的角色，这一趋势为奖励建模带来了一个重要挑战：奖励模型不应仅依赖静态和通用的评价维度，而应考虑用户评估生成图像是否满足其特定要求时所用的任务相关且细粒度的准则。为应对这一挑战，我们提出了DyCoRM，一种动态的、准则感知的奖励模型，它能够基于任务相关准则并进行准则感知的偏好比较。为支持这一设定，我们构建了DyCoDataset-20K，提供动态准则及准则级标注，并进一步推导出DyCoBench-1K，一个用于在动态准则下系统评估奖励模型的基准。我们还引入了DyCoPick，它将准则感知奖励建模应用于T2I图像选择。我们的贡献建立了首个用于动态和细粒度评估以及在T2I生成中实际应用的奖励建模框架。

英文摘要

With the continued advancement of text-to-image (T2I) generation, producing high-quality images is becoming increasingly attainable; consequently, user demands are shifting toward images that better satisfy their specific requirements. As reward models play an increasingly important role in assessing whether generated images align with user preference, this trend introduces an important challenge for reward modeling: rather than relying solely on static and general evaluation dimensions, reward models should account for the task-relevant and fine-grained criteria through which users assess whether generated images meet their specific requirements. To address this challenge, we propose DyCoRM, a dynamic, criterion-aware reward model that grounds task-relevant criteria and performs criterion-aware preference comparison. To support this setting, we construct DyCoDataset-20K, which provides dynamic criteria together with criterion-level annotations, and further derive DyCoBench-1K, a benchmark for systematically evaluating reward models under dynamic criteria. We further introduce DyCoPick, which applies criterion-aware reward modeling to selecting T2I images. Our contributions establish the first reward modeling framework for dynamic and fine-grained evaluation and practical application in T2I generation.

URL PDF HTML ☆

赞 0 踩 0

2605.25874 2026-05-26 cs.CV

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

WBench：面向交互式视频世界模型评估的综合多轮基准

Kaining Ying, Hengrui Hu, Siyu Ren, Jiamu Li, Fengjiao Chen, Ziwen Wang, Xuezhi Cao, Xunliang Cai, Henghui Ding

AI总结提出WBench，一个包含五个维度、289个测试用例和1058轮交互的综合多轮基准，用于系统评估交互式世界模型，并发现现有模型在不同维度上表现不一。

Comments Technical report of WBench. Homepage: https://meituan-longcat.github.io/WBench/

详情

AI中文摘要

交互式世界模型正在快速发展，但现有基准仅覆盖部分所需能力，缺乏统一标准进行系统评估。为填补这一空白，我们引入了WBench，一个全面的多轮基准，用于沿五个维度（视频质量、设置遵循、交互遵循、一致性和物理合规性）评估交互式世界模型。WBench包含289个测试用例和1,058个交互轮次，每个用例指定一个世界设置和多轮交互序列，涵盖多样场景、风格、主体以及第一人称和第三人称视角，同时包括四种交互类型：导航、主体动作、事件编辑和视角切换。对于导航，WBench统一了文本、6自由度姿态和离散动作控制，使得具有不同原生输入接口的模型都能被评估。评估使用22个自动子指标，结合了专业视觉模型和大规模多模态模型，所有指标均通过人类判断进行验证。在20个最先进的模型上，我们发现没有单个模型在所有维度上都表现良好。我们提供了关于每个模型特征性优势、劣势和开放挑战的详细诊断见解。代码和数据可在 https://github.com/meituan-longcat/WBench 获取。

英文摘要

Interactive world models are advancing rapidly, yet existing benchmarks cover only part of the required competencies, leaving no unified standard for systematic evaluation. To fill this gap, we introduce WBench, a comprehensive multi-turn benchmark for interactive world model evaluation along five dimensions, namely video quality, setting adherence, interaction adherence, consistency, and physics compliance. WBench contains 289 test cases and 1,058 interaction turns, where each case specifies a world setting and a multi-turn interaction sequence, covering diverse scenes, styles, subjects, and both first- and third-person perspectives, together with four interaction types, including navigation, subject action, event editing, and perspective switching. For navigation, WBench unifies text, 6-DoF pose, and discrete-action control, enabling evaluation of models with different native input interfaces. Evaluation uses 22 automatic sub-metrics that combine specialist vision models with large multimodal models, and all metrics are validated against human judgments. Across 20 state-of-the-art models, we find that no single model performs strongly across all dimensions. We provide detailed diagnostic insights into the characteristic strengths, weaknesses, and open challenges of each model. Code and data are available at https://github.com/meituan-longcat/WBench.

URL PDF HTML ☆

赞 0 踩 0

2605.25869 2026-05-26 cs.CL

Mitigating Provenance-Role Collapse in Long-Term Agents via Typed Memory Representation

通过类型化记忆表示缓解长期智能体中的来源-角色崩溃

Zhengda Jin, Bingbing Wang, Jing Li, Ruifeng Xu, Min Zhang

AI总结提出MemIR类型化记忆中间表示，通过结构约束实现来源监控，解决长期智能体中因无结构存储导致的来源-角色崩溃问题，在LoCoMo和BEAM-100K上优于现有基线。

详情

AI中文摘要

长期记忆对于持久化LLM智能体至关重要，但现有架构将历史交互存储为非结构化的平面文本。这种无约束存储会导致来源-角色崩溃，即智能体出现来源监控错误的关键失效模式。为了在架构层面解决这一认知脆弱性，我们提出MemIR，一种类型化记忆中间表示，将来源监控作为结构约束来操作化。MemIR将长期记忆写入基础原子，这些原子分离原始证据、检索线索和承载真相的声明，事实授权仅限于受支持的声明原子。然后，它应用多路径原子投影和来源范围利用，将异构检索结果转化为以声明为中心的候选包，以及用于答案生成的归一化事实接口。在LoCoMo和BEAM-100K上的实验表明，MemIR持续优于现有记忆基线，特别是在需要来源追踪、时间锚定和碎片证据聚合的任务上。

英文摘要

Long-term memory is essential for persistent LLM agents, yet prevailing architectures store historical interactions as unstructured, flat text. This unconstrained storage induces provenance-role collapse, a critical failure mode where agents suffer from source-monitoring errors. To resolve this cognitive vulnerability at the architectural level, we propose MemIR, a typed Memory Intermediate Representation that operationalizes source monitoring as a structural constraint. MemIR writes long-term memory into grounded atoms that separate raw evidence, retrieval cues, and truth-bearing claims, with factual authorization restricted to supported claim atoms. It then applies multi-route atomic projection and provenance-scoped utilization to transform heterogeneous retrieval hits into claim-centered candidate bundles and a normalized fact interface for answer generation. Experiments on LoCoMo and BEAM-100K demonstrate that MemIR consistently outperforms existing memory baselines, especially on tasks requiring source tracking, temporal grounding, and aggregation of fragmented evidence.

URL PDF HTML ☆

赞 0 踩 0

2605.25866 2026-05-26 cs.LG cond-mat.mtrl-sci physics.class-ph

UNATE: UNsupervised ATomic Embedding for crystal structures property prediction

UNATE：用于晶体结构性质预测的无监督原子嵌入

Laura Solà-Garcia, Àlex Solé, Javier Ruiz-Hidalgo

AI总结提出UNATE框架，通过无监督去噪自编码器和自监督对比学习从无标签晶体结构中学习鲁棒原子表示，用于下游性质预测，在有限标签数据下提升高达10%。

2605.25864 2026-05-26 cs.LG cs.CL

When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards

当自我信念误导：面向可验证奖励的强化学习的主动标签获取

Li Wang, Xiaodong Lu, Xiaohan Wang, Yikun Ban, Jiajun Chai, Wei Lin, Tianhao Peng, Guojun Yin

AI总结提出RLAVR框架，通过主动获取少量真实标签并与伪标签结合，利用CAG指标和CARE策略稳定训练并提升有限标注预算下的性能。

详情

AI中文摘要

大型语言模型（LLM）通过可验证奖励的强化学习（RLVR）在推理能力上取得了显著进展。然而，RLVR本质上依赖于真实标签进行奖励计算，而在实际场景中获取这些标签通常成本高昂。虽然无监督的RLVR范式试图通过训练伪标签来规避这一问题，但它们极易发生训练崩溃。此外，不同样本往往具有不同的标注价值。在本文中，我们提出了主动可验证奖励的强化学习（RLAVR），它主动获取少量选定样本的真实标签，并将其与伪标签相结合，从而稳定训练动态并在有限标注预算下提高性能。为了识别有价值的样本，我们提出了纠正优势差距（CAG）指标，并分析了样本级别的监督价值。在此基础上，我们引入了用于RLAVR的纠正感知可靠性估计（CARE），它将理想的CAG准则转化为实用的预查询获取策略，以显著提高训练稳定性。跨不同领域、模型家族和模型规模的大量实验证明了我们方法的有效性和通用性。我们的代码可在https://github.com/Lumina04/CARE获取。

英文摘要

Large Language Models (LLMs) have achieved remarkable advancements in reasoning capabilities empowered by Reinforcement Learning with Verifiable Rewards (RLVR). Nonetheless, RLVR intrinsically relies on ground-truth labels for reward computation, the acquisition of which is often prohibitively expensive in real-world scenarios. While unsupervised RLVR paradigms attempt to circumvent this by training on pseudo-labels, they are notoriously susceptible to training collapse. Moreover, different samples often exhibit varying annotation values. In this paper, we propose Reinforcement Learning with Active Verifiable Rewards (RLAVR), which actively acquires ground-truth labels for a small set of selected samples and integrates them with pseudo-labels, thereby stabilizing training dynamics and improving performance under limited annotation budgets. To identify valuable samples, we propose the Corrective Advantage Gap (CAG) metric and analyze the sample-level supervision value. Building on this, we introduce Correction-Aware Reliability Estimation for RLAVR (CARE), which translates the oracle CAG criterion into a practical pre-query acquisition policy to substantially improve training stability. Extensive experiments across diverse domains, model families, and model scales demonstrate the effectiveness and generality of our approach. Our code is available at https://github.com/Lumina04/CARE.

URL PDF HTML ☆

赞 0 踩 0

2605.25860 2026-05-26 cs.CV

SAM3-Assisted Training of Lightweight YOLO Models for Precision Pig Farming

SAM3辅助训练的轻量级YOLO模型用于精准养猪

Marcos Vinicius Mendes Faria, Thiago Borges Pereira, Isabella C. F. S. Condotta, Thiago Meireles Paixão, Francisco de Assis Boldt

AI总结提出利用SAM 3自动生成伪标签训练YOLOv8检测器，无需人工标注，在PigLife数据集上达到79.4% mAP，推理速度比教师模型快约200倍。

Comments Accepted for publication at the IEEE Sensors Applications Symposium (SAS 2026)

详情

AI中文摘要

基于深度学习的物体检测彻底改变了精准畜牧业（PLF），但仍存在一个关键障碍：高性能基础模型（如SAM 3）计算量过大，无法在边缘部署，而轻量级模型（如YOLO）需要大量人工标注。本文提出了一种全自动知识蒸馏流程，利用Segment Anything Model 3（SAM 3）生成零样本伪标签，用于训练高效的YOLOv8检测器。通过将SAM 3视为离线自动标注器，消除了手动标注瓶颈，生成的模型能够在资源受限的硬件上实现实时推理。我们在PigLife数据集上系统评估了该方法，将SAM 3监督模型与人工标注基线进行了比较。结果表明，无需人工干预，SAM 3训练的YOLOv8m平均精度（mAP）达到79.4%，同时推理延迟比教师模型降低约200倍。此外，分层分析显示，在低遮挡场景下，自动流程的检测率与人工基准相当（AP50 > 99%）。这些发现表明，基础模型可以作为有效的零标注成本监督器，为智慧农业提供可扩展的边缘计算解决方案。

英文摘要

Deep learning-based object detection has revolutionized Precision Livestock Farming (PLF), yet a critical barrier remains: high-performance Foundation Models (such as SAM 3) are too computationally intensive for edge deployment, while lightweight models (like YOLO) require prohibitive manual annotation efforts. This work proposes a fully automated knowledge distillation pipeline that leverages the Segment Anything Model 3 (SAM 3) to generate zero-shot pseudo-labels for training efficient YOLOv8 detectors. By treating SAM 3 as an offline auto-annotator, we eliminate the manual labeling bottleneck, producing models capable of real-time inference on resource-constrained hardware. We systematically evaluate this approach on the PigLife dataset, comparing SAM 3-supervised models against human-annotated baselines. Results demonstrate that a SAM 3-trained YOLOv8m achieves a mean Average Precision (mAP) of 79.4% without human intervention, while reducing inference latency by approximately 200$\times$ compared to the teacher model. Furthermore, stratified analysis reveals that in low-occlusion scenarios, the automated pipeline achieves detection rates comparable to human benchmarks ($AP_{50} > 99\%$). These findings indicate that foundation models can serve as effective, zero-annotation-cost supervisors, enabling scalable edge computing solutions for smart agriculture.

URL PDF HTML ☆

赞 0 踩 0

2605.25854 2026-05-26 cs.AI

From Accounting to Coordination: A Virtual Water-Aware Electricity-Computation-Water Nexus Framework for Data Center Dispatch

从核算到协调：面向数据中心调度的虚拟水感知电-算-水关联框架

Haiyang You, Chengwei Lou, Jin Zhao, Yue Zhou, Lu Zhang, Jin Yang

AI总结提出一个将虚拟水影响内化到电力系统调度的可微优化框架，通过深度学习实现端到端协调策略学习，在IEEE 30/118节点系统上实现约3-5%的淡水取水减少。

详情

AI中文摘要

数据中心的扩张驱动了电力需求的持续增长以及发电站点相关的取水量增加。这些取水发生在发电站点，并根据网络潮流虚拟分配给负荷。因此，特定负荷的实际水足迹随发电调度和网络条件动态变化。现有方法通常依赖静态统计核算来量化这些水足迹。然而，这种静态方法无法捕捉调度优化和负载迁移如何动态影响取水。结果，静态统计核算方法仍与优化过程脱节，无法指导负载迁移或电力调度以缓解水压力。为解决这一局限，本文开发了一个可运行的电-算-水（ECW）关联框架，将虚拟水影响直接内化到电力系统调度中。该框架将调度优化表示为嵌入深度学习架构中的可微优化层，能够在保持运行可行性的同时实现协调策略的高效端到端学习。结合不动点协调，该框架确保了虚拟水归因与物理发电侧取水之间的一致性。在IEEE 30节点和118节点测试系统上的案例研究展示了可靠的收敛性、精确的功率-水一致性，以及在受水约束条件下发电相关淡水取水减少约3-5%。

英文摘要

The expansion of data centers (DCs) drives a sustained increase in electricity demand and associated water withdrawals at generation sites. These withdrawals occur at generation sites and are virtually allocated to demand based on network power flows. Consequently, the actual water footprint of a specific load varies dynamically with generation dispatch and network conditions. Existing approaches typically rely on static statistical accounting to quantify these water footprints. However, such static methods fail to capture how dispatch optimization and workload relocation dynamically affect water withdrawals. As a result, static statistical accounting approaches remain decoupled from the optimization process, rendering them incapable of guiding workload relocation or power dispatch to mitigate water stress. To address this limitation, this paper develops an operational electricity-computation-water (ECW) nexus framework that internalizes virtual water impacts directly into power system dispatch. The framework represents dispatch optimization as a differentiable optimization layer embedded within a deep learning architecture, enabling efficient end-to-end learning of coordination policies while preserving operational feasibility. Combined with fixed-point coordination, the framework enforces consistency between virtual water attribution and physical generation-side withdrawals. Case studies on the IEEE 30-bus and 118-bus test systems demonstrate reliable convergence, exact power-water consistency, and reductions of approximately 3-5% in generation-related freshwater withdrawals under water-constrained conditions.

URL PDF HTML ☆

赞 0 踩 0

2605.25851 2026-05-26 cs.RO

RePlan-Bot: Multi-Level Replanning for Embodied Instruction Following

RePlan-Bot：面向具身指令跟随的多级重规划

Xicheng Gong, Guozheng Sun, Peiran Xu, Yadong Mu

AI总结提出RePlan-Bot，通过多级连续重规划（高层LLM审计器、常识引导搜索、轻量级ViT校正器）解决具身指令跟随中的长时规划和不可逆状态变化问题，在ALFRED基准上取得最佳性能。

Comments 10 pages

详情

AI中文摘要

具身指令跟随（EIF）要求智能体在交互式3D环境中理解和执行复杂的自然语言命令。尽管近期取得了进展，现有方法在长时规划和应对不可逆状态变化方面常常失败，导致任务成功率低。为解决这些挑战，我们引入了RePlan-Bot，一种新颖的EIF智能体，它在任务执行过程中执行多级、连续的重规划。RePlan-Bot集成了一个高层基于LLM的审计器，用于根据环境反馈动态调整子目标；一个基于常识引导的搜索机制，基于多层实例地图实现精确且结构化的对象定位；以及一个轻量级基于ViT的校正器，用于预先修复有风险的低层动作。在ALFRED基准上的评估显示，RePlan-Bot在已知和未知环境中均达到了最先进的性能，展示了卓越的适应性和可靠性。

英文摘要

Embodied instruction following (EIF) requires agents to understand and execute complex natural language commands within interactive 3D environments. Despite recent advances, existing methods often fail in long-horizon planning and handling irreversible state changes, resulting in low task success rates. To address these challenges, we introduce RePlan-Bot, a novel EIF agent that performs multi-level, continuous replanning throughout task execution. RePlan-Bot integrates a high-level LLM-based auditor for dynamic sub-goal adjustments guided by environmental feedback, a commonsense-guided search mechanism based on a multi-layered instance map for precise and structured object localization, and a lightweight ViT-based corrector to preemptively fix risky low-level actions. Evaluated on the ALFRED benchmark, RePlan-Bot achieves state-of-the-art performance in both seen and unseen environments, demonstrating superior adaptability and reliability.

URL PDF HTML ☆

赞 0 踩 0