arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.26567 2026-05-27 cs.AI

MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning

MedGuideX: 将可执行指南中的决策逻辑内化到大型语言模型中以进行临床推理

Yuhao Shen, Lang Cao, Simo Du, Yuqing Wang, Juexiao Zhou, Hao Peng, Yue Guo

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Albert Einstein College of Medicine（爱因斯坦医学院）

AI总结提出一种将临床实践指南转化为可执行决策逻辑并生成事实与反事实问答数据的训练流程，通过微调医学大语言模型得到MedGuideX，在四个临床推理基准上平均准确率相对提升10.28%，且医生评估显示其推理步骤更优。

详情

AI中文摘要

临床实践指南（CPGs）编码了基于证据的决策逻辑，临床医生通过评估患者变量、条件标准和推荐规则来应用这些逻辑。然而，现有方法通常将CPGs作为自由文本训练数据或检索源，未能充分利用其程序性决策结构。为了更好地利用这种结构，我们引入了一个基于指南的训练流程，将CPG推荐转化为可执行的临床决策逻辑，并利用它生成事实性和反事实性的问答数据。这些数据教会模型既支持指南推荐的决策，也了解在不同患者条件下决策如何变化。在生成的医学数据上对医学大语言模型进行后训练，得到MedGuideX。在四个临床推理基准上，MedGuideX的平均准确率相对提高了10.28%。医生评估进一步表明，MedGuideX能更好地恢复临床医生撰写的推理步骤，并在忠实性、有效性、完整性和清晰度方面产生医生偏好的推理依据。总体而言，我们的结果表明，来自CPGs的可执行决策逻辑可以转化为可扩展的监督信号，用于构建可靠的医学大语言模型。

英文摘要

Clinical practice guidelines (CPGs) encode evidence-based decision logic that clinicians apply by evaluating patient variables, conditional criteria, and recommendation rules. However, existing methods often use CPGs as free-text training data or retrieval sources, underutilizing their procedural decision structure. To better exploit this structure, we introduce a guideline-derived training pipeline that transforms CPG recommendations into executable clinical decision logic and uses it to generate factual and counterfactual question-answering data. Theses data teach models both guideline-supported decisions and how decisions change under different patient conditions. Post-training a medical LLM on the generated data yields MedGuideX. Across four clinical reasoning benchmarks, MedGuideX achieves a 10.28% relative improvement in average accuracy. Physician evaluation further shows that MedGuideX better recovers clinician authored reasoning steps and produces physician-preferred rationales in faithfulness, validity, completeness, and clarity. Overall, our results show that executable decision logic from CPGs can be transformed into scalable supervision for building reliable medical LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.26562 2026-05-27 cs.LG

Beyond Holistic Models: Systematic Component-level Benchmarking of Deep Multivariate Time-Series Forecasting

超越整体模型：深度多变量时间序列预测的系统性组件级基准测试

Shuang Liang, Chaochuan Hou, Xu Yao, Shiping Wang, Hailiang Huang, Songqiao Han, Minqi Jiang

发表机构 * Shanghai University of Finance and Economics（上海财经大学）； Key Laboratory of Interdisciplinary Research of Computation and Economics（交叉计算与经济学交叉学科实验室）

AI总结提出TSCOMP基准，通过正交实验分解深度预测方法的核心组件，揭示其有效性并构建性能语料库，实现零样本模型构建，优于手工复杂架构。

Comments accepted by KDD 2026 Datasets and Benchmarks Track

详情

DOI: 10.1145/3770855.3817551

AI中文摘要

虽然先前在多变量时间序列预测中的研究集中于开发复杂的整体模型，但本工作倡导转向对其影响的细粒度、组件级理解。我们提出TSCOMP，这是第一个大规模基准，系统地将深度预测方法分解为其核心、细粒度的组件——涵盖序列预处理、编码策略、包括特定和大规模时间序列模型的网络架构以及优化方法。通过使用约束正交实验设计和广泛评估，我们进行多视角分析，揭示组件在不同骨干网络、数据特征及其交互中的有效性。除了提供见解外，该基准建立了一个包含超过20,000个模型-数据集评估的细粒度性能语料库，支持自动组件选择的学习，从而在新数据集上实现零样本模型构建。我们的实验表明，尽管简单，但基于语料库的方法始终优于最先进的方法，验证了我们评估设计的合理性，并确认系统性组件选择超越了手工设计的复杂架构。所有代码和性能语料库均可在 https://github.com/SUFE-AILAB/TSCOMP 公开获取。

英文摘要

While previous research in multivariate time series forecasting has focused on developing complex holistic models, this work advocates for a shift toward a granular, component-level understanding of their impacts. We propose TSCOMP, the first large-scale benchmark that systematically deconstructs deep forecasting methods into their core, fine-grained components--spanning series preprocessing, encoding strategies, network architectures including specific and large time-series models, and optimization methods. Using constrained orthogonal experimental design and extensive evaluations, we conduct multi-view analyses that reveal component effectiveness across different backbones, data characteristics, and their interactions. Beyond providing insights, this benchmark establishes a fine-grained performance corpus comprising over 20,000 model-dataset evaluations, which supports the learning of automated component selection, enabling zero-shot model construction on new datasets. Our experiments demonstrate that the corpus-driven approach, despite its simplicity, consistently outperforms state-of-the-art methods, validating the soundness of our evaluation design and confirming that systematic component selection surpasses manually designed complex architectures. All code and the performance corpus are publicly available at https://github.com/SUFE-AILAB/TSCOMP.

URL PDF HTML ☆

赞 0 踩 0

2605.26560 2026-05-27 cs.CL cs.AI

Reliable Extraction of Clinical Follow-Up Instructions: A Hybrid Neural-Symbolic Pipeline

可靠提取临床随访指令：一种混合神经符号管道

Michal Laufer, Yehudit Aperstein, Alexander Apartsin

发表机构 * Bar-Ilan University（巴伊兰大学）； Afeka College of Engineering（阿菲卡工程学院）； Holon Institute of Technology（霍洛恩理工学院）

AI总结提出混合神经符号管道，结合BioBERT实体提取和确定性日期算术，在合成门诊笔记上实现接近完美的(动作, 日期)对提取F1分数，优于直接生成方法。

Comments 17 pages, 5 figures

详情

AI中文摘要

目标。门诊笔记携带随访指令，将动作与未来时间配对（“两周内进行脑部MRI”）。提取（动作，日期）对支持调度和审计，但生成式提取器会错过日期，因为链接和算术在解码中是隐式的。我们测试了一种混合神经符号管道与直接生成方法的对比。方法。我们定义了TestSpecification和TimeSpecification实体以及ScheduledFor关系。BioBERT提供BIO标注和双仿射链接器；实体通过28动作本体规范化，时间通过确定性方式归一化为天数偏移。我们在一个包含2000份笔记的合成门诊语料库上评估，采用动作不相交划分（18个训练，6个OOV测试），与零样本GPT-4o-mini和LoRA微调LLaMA-3 8B对比，使用笔记级bootstrap 95%置信区间。结果。在259份笔记的已知和OOV划分上，混合管道实现了测试时间对F1分别为0.997和0.986，MAE为0.00天。基线达到了高动作F1（LLaMA-3 0.992；GPT-4o-mini 0.963已知），但对F1保持在0.51-0.57（LLaMA-3）和0.53（GPT-4o-mini），置信区间与混合管道不重叠。结论。将学习的实体提取与确定性日期算术分离在此基准上优于生成方法，泛化到未见动作，并暴露了失败模式。迁移到真实电子健康记录笔记是下一步验证；初步的现实性检查见局限性。

英文摘要

Objective. Outpatient notes carry follow-up instructions pairing actions with future times ("MRI brain in two weeks"). Extracting (action, date) pairs supports scheduling and audit, but generative extractors miss the date because linking and arithmetic are implicit in decoding. We test a hybrid neural-symbolic pipeline against direct generation. Methods. We define TestSpecification and TimeSpecification entities and a ScheduledFor relation. BioBERT feeds BIO tagging and a biaffine linker; entities are canonicalized via a 28-action ontology and times normalized to day offsets deterministically. We evaluate on a 2,000-note synthetic outpatient corpus with action-disjoint splits (18 train, 6 OOV-test) against zero-shot GPT-4o-mini and LoRA-fine-tuned LLaMA-3 8B with note-level bootstrap 95% CIs. Results. On 259-note seen and OOV splits the hybrid pipeline achieves Test-Time Pair F1 of 0.997 and 0.986 with 0.00-day MAE. Baselines reach high action F1 (LLaMA-3 0.992; GPT-4o-mini 0.963 seen) but Pair F1 stays at 0.51-0.57 (LLaMA-3) and 0.53 (GPT-4o-mini), CIs non-overlapping with the hybrid. Conclusion. Separating learned entity extraction from deterministic date arithmetic outperforms generation on this benchmark, generalizes to held-out actions, and exposes failure modes. Transfer to real EHR notes is the next validation; a first-pass realism check is in Limitations.

URL PDF HTML ☆

赞 0 踩 0

2605.26559 2026-05-27 cs.LG cs.AI econ.EM

Auditing and Fixing Economic Validity in Tabular Foundation Models for Discrete Choice

审计与修复离散选择中表格基础模型的经济有效性

Yingshuo Wang, Xian Sun, Yanhang Li, Zhichao Fan, Zexin Zhuang

发表机构 * University of California, Berkeley, CA, USA（加州大学伯克利分校）； Duke University, Durham, NC, USA（杜克大学）； Northeastern University, Boston, MA, USA（东北大学）； University of Illinois Urbana-Champaign, IL, USA（伊利诺伊大学厄巴纳-香槟分校）； Southern Methodist University, Dallas, TX, USA（南方 Methodist 大学）

AI总结提出两阶段适配器，将表格基础模型预测嵌入效用最大化框架，在保证经济一致性的同时提升选择预测精度。

Comments 5 pages, 1 table. Accepted at the FMSD Workshop, ICML 2026

详情

AI中文摘要

表格基础模型在选择预测任务上取得了很高的准确率，但其预测常常违反这些任务所需的经济逻辑：提高价格有时会增加预测需求，隐含的支付意愿估计经常为负或不合理。我们提出了一种两阶段适配器，将基础模型预测嵌入效用最大化框架。在第一阶段，我们估计一个标准选择模型，其参数受经济理论约束。在第二阶段，我们冻结这些参数，并训练一个校正项，将基础模型的预测作为附加信息纳入。结果模型继承了基础模型的精度提升，同时保证了政策扰动下价格-需求的单调关系，并产生可解析计算的权衡指标。在两个交通数据集上，适配器在保持完美经济一致性的同时，相比标准logit模型恢复了高达13个百分点的准确率，这是原始基础模型或传统蒸馏都无法实现的。

英文摘要

Tabular foundation models achieve strong accuracy on choice prediction tasks, but their predictions often violate the economic logic those tasks require: raising a price sometimes increases predicted demand, and implied willingness-to-pay estimates are frequently negative or implausible. We propose a two-stage adapter that embeds foundation model predictions within a utility-maximization framework. In the first stage, we estimate a standard choice model whose parameters are constrained to obey economic theory. In the second stage, we freeze those parameters and train a correction term that incorporates the foundation model's predictions as additional information. The result is a model that inherits the foundation model's accuracy gains while guaranteeing monotonic price-demand relationships under policy perturbation and producing analytically computable trade-off measures. On two transportation datasets, the adapter recovers up to 13 percentage points of accuracy over a standard logit model while maintaining perfect economic consistency, something neither the raw foundation models nor conventional distillation achieve.

URL PDF HTML ☆

赞 0 踩 0

2605.26554 2026-05-27 cs.LG cs.AI

Linear and Neural Dueling Bandits with Delayed Feedback

线性与神经延迟反馈的对抗性赌博机

Xiangyi Wang, Pingchen Lu, Jie Mao, Mingze Kong, Zhi Hong, Zhiyong Wang, Zhongxiang Dai

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； The Chinese University of Hong Kong（香港中文大学）

AI总结针对随机延迟反馈下的上下文对抗性赌博机问题，提出线性（LDB-DF）和神经（NDB-DF）两种算法，通过将逆概率加权（IPW）机制直接融入损失函数实现无偏校正，并给出线性设置下O(d*sqrt(T))的遗憾界和神经设置下的次线性保证。

详情

AI中文摘要

上下文对抗性赌博机构成了基于偏好的决策制定的基石，在推荐系统和大语言模型对齐中有关键应用。然而，标准算法依赖于即时反馈的理想化假设，这一条件在现实场景（如提示优化）中经常被违反。这种设置带来了独特的理论挑战：与线性赌博机不同，对抗性赌博机估计量缺乏闭式解，使得标准加权技术的朴素适应产生偏差。为解决这一问题，我们形式化了具有随机延迟反馈的上下文对抗性赌博机问题，并提出了两种新颖算法：线性延迟反馈对抗性赌博机（LDB-DF）和神经延迟反馈对抗性赌博机（NDB-DF）。我们方法的核心是一种新颖的估计量，它将逆概率加权（IPW）机制直接集成到损失函数中，确保对延迟或缺失反馈的无偏校正。我们提供了全面的理论分析，为线性设置建立了O(d*sqrt(T))的遗憾界，并为神经设置建立了次线性保证。在模拟和真实数据集上的大量实验证明了我们提出方法的有效性。

英文摘要

Contextual dueling bandits form a cornerstone of preference-based decision-making, with critical applications in recommender systems and large language model alignment. However, standard algorithms rely on the idealized assumption of immediate feedback, a condition frequently violated in real-world scenarios such as prompt optimization. This setting introduces a unique theoretical challenge: unlike linear bandits, dueling bandit estimators lack closed-form solutions, rendering naive adaptations of standard weighting techniques biased. To address this, we formalize the problem of Contextual Dueling Bandits with Stochastic Delayed Feedback and propose two novel algorithms: Linear (LDB-DF) and Neural (NDB-DF) Dueling Bandits with Delayed Feedback. Central to our approach is a novel estimator that integrates an Inverse Probability Weighting (IPW) mechanism directly into the loss function, ensuring unbiased correction for delayed or missing feedback. We provide comprehensive theoretical analysis, establishing an O(d*sqrt(T)) regret bound for the linear setting and sub-linear guarantees for the neural setting. Extensive experiments on both simulated and real-world datasets demonstrate the effectiveness of our propose.

URL PDF HTML ☆

赞 0 踩 0

2605.26546 2026-05-27 cs.AI

MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration

MobileExplorer: 通过在线探索加速移动GUI代理的端侧推理

Runxi Huang, Liyu Zhang, Shengzhong Liu, Xiaomin Ouyang

发表机构 * Hong Kong University of Science and Technology（香港理工大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出MobileExplorer框架，通过在线探索并行探测UI元素并记录为结构化记忆，结合两级回滚机制，减少推理步骤和延迟，提升移动GUI代理的端侧部署效率。

详情

AI中文摘要

移动图形用户界面（GUI）代理使AI模型能够代表用户自主操作智能手机。然而，现有系统主要关注优化任务准确性，并依赖云端模型进行推理，这引入了隐私问题和网络依赖延迟。因此，移动GUI代理的完全端侧部署仍未被充分探索。我们提出MobileExplorer，一种通过在线探索加速基于视觉的移动GUI代理端侧推理的新框架。关键思想是利用视觉语言模型（VLM）较长的每步推理时间，对UI元素进行轻量级并行探索。在模型推理期间，代理主动探测语义相关的UI元素，并将这些探索轨迹记录为结构化记忆。为确保在真实移动环境中可靠执行，我们设计了两级回滚机制，当快速但简单的回溯策略失败时，能够稳健地恢复初始UI状态。收集的探索轨迹随后被总结为简洁的上下文提示，并注入到提示中，以增强后续推理步骤。我们在多个现成设备上使用AndroidWorld基准测试以及新设计的更复杂任务和动态端侧环境评估了MobileExplorer。MobileExplorer将平均推理步骤数和端到端延迟减少了23%，同时将任务成功率维持或提高了高达5%。真实世界中MobileExplorer性能的视频演示可在https://youtu.be/thK7MJmdlvM获取。

英文摘要

Mobile graphical user interface (GUI) agents enable AI models to autonomously operate smartphones on behalf of users. However, most existing systems focus primarily on optimizing task accuracy and rely on cloud-hosted models for inference, which introduces privacy concerns and network-dependent latency. As a result, fully on-device deployment of mobile GUI agents remains underexplored. We propose MobileExplorer, a new framework that accelerates on-device inference for vision-based mobile GUI agents via online exploration. The key idea is to exploit the long per-step reasoning time of vision-language models (VLMs) by performing lightweight, parallel exploration of UI elements. During model inference, the agent proactively probes semantically relevant UI elements and records these exploration traces as structured memory. To ensure reliable execution in live mobile environments, we design a two-level rollback mechanism that robustly restores the initial UI state when a fast but naive backtracking strategy fails. The collected exploration traces are then summarized into concise contextual hints and injected into the prompt to enhance the subsequent reasoning step. We evaluate MobileExplorer on multiple off-the-shelf devices using the AndroidWorld benchmark, as well as newly designed, more complex tasks and dynamic on-device environments. MobileExplorer reduces the average number of reasoning steps and end-to-end latency by 23\%, while maintaining or improving task success rates by up to 5\%. A video demonstration of MobileExplorer performance in the real world is available at https://youtu.be/thK7MJmdlvM .

URL PDF HTML ☆

赞 0 踩 0

2605.26543 2026-05-27 cs.AI cs.LG

PolyFusionAgent: A Multimodal Foundation Model and Autonomous AI Assistant for Polymer Property Prediction and Inverse Design

PolyFusionAgent: 用于聚合物性能预测和逆向设计的多模态基础模型与自主AI助手

Manpreet Kaur, Xingying Zhang, Qian Liu

发表机构 * Department of Applied Computer Science, The University of Winnipeg（应用计算机科学系，温尼伯大学）； Department of Mechanical Engineering, University of Manitoba（机械工程系，曼尼托巴大学）

AI总结提出PolyFusionAgent框架，结合多模态聚合物基础模型PolyFusion和工具增强的设计代理PolyAgent，通过对齐序列、拓扑、3D几何和指纹等多模态视图学习共享潜在空间，实现热物理性能预测和化学有效、结构新颖的聚合物逆向设计，并利用文献证据检索闭环设计流程。

Comments 23 pages, 5 figures, 2 tables; Supplementary material included

详情

AI中文摘要

聚合物的发现对于从能量存储到生物医学等领域至关重要，但受到天文数字般的化学设计空间以及结构、性能和先验知识的碎片化表示的阻碍。这种碎片化使得许多AI模型与物理和实验现实脱节，限制了它们支持直接可操作设计决策的能力。在这里，我们介绍PolyFusionAgent，一个交互式框架，将多模态聚合物基础模型（PolyFusion）与工具增强、基于文献的设计代理（PolyAgent）相结合。PolyFusion对齐互补的聚合物视图，包括序列、拓扑、3D几何和指纹，跨越数百万种聚合物，学习一个跨化学和数据体系可迁移的共享潜在空间，改进了热物理性能预测，并实现了超出参考设计空间的化学有效、结构新颖聚合物的性能条件生成。PolyAgent通过将预测和逆向设计与从聚合物文献中检索证据联系起来，在一个工作流中提出、评估和情境化假设，从而闭合设计循环。PolyFusionAgent共同实现了交互式、证据关联的聚合物发现，结合了大规模表示学习、多模态化学知识和可验证的科学推理。

英文摘要

Polymer discovery is central to fields ranging from energy storage to biomedicine, but it is hindered by an astronomically large chemical design space and fragmented representations of structure, properties, and prior knowledge. This fragmentation leaves many AI models disconnected from physical and experimental reality, restricting their ability to support directly actionable design decisions. Here we introduce PolyFusionAgent, an interactive framework coupling a multimodal polymer foundation model (PolyFusion) with a tool-augmented, literature-grounded design agent (PolyAgent). PolyFusion aligns complementary polymer views including sequence, topology, 3D geometry, and fingerprints across millions of polymers to learn a shared latent space transferable across chemistries and data regimes, improving thermophysical property prediction and enabling property-conditioned generation of chemically valid, structurally novel polymers beyond the reference design space. PolyAgent closes the design loop by linking prediction and inverse design with evidence retrieval from the polymer literature, proposing, evaluating, and contextualizing hypotheses with explicit precedent in one workflow. Together, PolyFusionAgent enables interactive, evidence-linked polymer discovery combining large-scale representation learning, multimodal chemical knowledge, and verifiable scientific reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.26538 2026-05-27 cs.CV

Scheduled Style Injection: Expanding the Style-Content Pareto Frontier in Training-Free Diffusion-based Style Transfer

调度式风格注入：在免训练扩散风格迁移中扩展风格-内容帕累托前沿

Amey Sunil Kulkarni

发表机构 * Independent Researcher（独立研究者）

AI总结通过系统探索层、时间步和ControlNet几何条件四个维度的调度，发现递减调度（浅层和早期时间步强结构注入）优于递增调度，且余弦和平方根时间步调度优于线性，结合近乎独立的gamma调度与ControlNet条件可扩展帕累托前沿，在ArtFID上相对提升6.1%。

Comments Accepted to CVPR NTIRE 2026

详情

AI中文摘要

基于预训练扩散模型的风格迁移已取得快速进展，但一个核心问题仍未充分探索：模型中风格注入应在何处最强？领先的免训练方法StyleID在所有层和时间步上统一使用单个全局参数（gamma），这强制了风格质量与内容保留之间的固定权衡。我们证明这种权衡是不必要的刚性。我们系统地探索了四个控制维度：跨解码器层改变风格注入强度、跨去噪时间步改变强度，以及沿两个轴调度ControlNet几何条件。模式在所有地方一致：递减调度（在较浅层和较早时间步注入更强的结构信号）可靠地优于反向调度。除方向外，调度形状也很重要：余弦和平方根时间步调度优于线性。最重要的是，我们发现gamma调度和ControlNet条件几乎独立。由此产生的组合配置扩展了帕累托前沿，与任何单一基线设置相比，提供了风格保真度和内容保留之间的优越权衡。我们最佳的平衡配置实现了ArtFID 27.036，而StyleID为28.801——相对改进6.1%，在整个风格-内容权衡前沿上具有一致的增益。结果在35种配置（总计超过28,000张风格化图像）上使用四种互补指标进行了验证。这些发现在SD骨干网络上具有相同的排名顺序。所有修改都是免训练、无参数的，仅需几行调度代码；代码可在https://github.com/ameyskulkarni/scheduled_style_injection获取。

英文摘要

Style transfer with pre-trained diffusion models has advanced rapidly, but a core question remains underexplored: where in the model should style injection be strongest? StyleID, the leading training-free method, uses a single global parameter (gamma) uniformly across all layers and timesteps, which forces a fixed tradeoff between style quality and content preservation. We show this tradeoff is unnecessarily rigid. We systematically explore four dimensions of control: varying style injection strength across decoder layers, across denoising timesteps, and scheduling ControlNet geometric conditioning along both axes. The pattern is consistent everywhere: decreasing schedules, with stronger structural signal injection in shallower layers and earlier timesteps, reliably outperform the reverse. Beyond direction, schedule shape matters: cosine and square-root timestep schedules outperform linear. Most importantly, we find that gamma scheduling and ControlNet conditioning are nearly independent. The resulting combined configurations expand the Pareto frontier, offering superior tradeoffs between style fidelity and content preservation compared to any single baseline setting. Our best balanced configuration achieves ArtFID of 27.036 versus StyleID's 28.801 - a 6.1% relative improvement, with consistent gains across the full style-content tradeoff frontier. Results are validated across 35 configurations totaling over 28,000 stylized images using four complementary metrics. These findings generalize across SD backbones with identical rank ordering. All modifications are training-free, parameter-free, and require only a few lines of scheduling code; code is available at https://github.com/ameyskulkarni/scheduled_style_injection.

URL PDF HTML ☆

赞 0 踩 0

2605.26537 2026-05-27 cs.CL

Conceptual Steganography

概念隐写术

Zhejian Zhou, Jonathan May

发表机构 * University of Southern California Information Sciences Institute（南加州大学信息科学研究所）

AI总结提出概念隐写术，通过思维链中的高级推理行为模式而非词汇选择嵌入信息，实验表明其对释义防御比标准关键词方法更鲁棒，且不影响推理效用。

详情

AI中文摘要

语言模型（LM）发出的思维链（CoT）驱动了其大部分能力。然而，承载有用推理的同一序列也可以隐蔽地传递信息：一个未对齐的模型可能在其CoT中嵌入隐蔽信息，从而逃避人类监督，这种隐写术形式被称为编码推理。先前的LM隐写术方案在词元或词汇空间操作，而内容保留释义器是近期工作中规范且有效的防御手段。我们引入了概念隐写术，其中CoT的每一步通过高级推理行为模式而非词汇选择来携带信息。在四个模型家族和两个推理领域中，这种后门通信渠道被证明比标准关键词方法对强释义防御具有更一致的鲁棒性，并且将信息编码到CoT中不会影响其在推理过程中的效用。在提高对这一新风险的认识后，我们进一步证明，一个策略感知的释义器可以关闭大部分渠道，强调了确保野外忠实LLM推理的新挑战和推荐防御措施。

英文摘要

Language Models (LMs) emit Chains-of-Thought (CoTs) that drive much of their capability. However, the same sequence that carries useful reasoning can also covertly convey messages: a misaligned model may embed covert information in its CoT that slips through human supervision, a form of steganography known as encoded reasoning. Prior LM steganography schemes operate in the token or lexical space, and a content-preserving paraphraser is the canonical and effective defense in recent work. We introduce conceptual steganography, in which each step of a CoT carries information through patterns of high-level reasoning behavior, rather than through lexical choice. Across four model families and two reasoning domains, this backdoor communication channel is shown to be consistently more robust to a strong paraphrase defense than standard keyword approaches, and the encoding of information into CoTs does not affect their utility in the reasoning process. Having raised awareness of this new risk, we then demonstrate that a strategy-aware paraphraser can close much of the channel, highlighting new challenges and recommended defenses for ensuring faithful LLM reasoning in the wild.

URL PDF HTML ☆

赞 0 踩 0

2605.26535 2026-05-27 cs.LG cs.AI cs.CV cs.NA math.NA

Recursive Flow Matching

递归流匹配

Jiahe Huang, Sihan Xu, Sharvaree Vadgama, Rose Yu

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）； University of Michigan（密歇根大学）

AI总结提出递归流匹配（RecFM）框架，通过自一致性约束对齐不同离散化尺度的轨迹，实现高保真单步或少步（2-4步）动态生成，在科学基准上相比领先扩散模拟器加速20倍且提升预测精度。

Comments Project page: https://jhhuangchloe.github.io/RecFM/

详情

AI中文摘要

生成模型已成为解决物理系统和建模复杂时空动态的强大范式。然而，在不产生高计算成本的情况下实现高物理精度仍然是一个基本挑战，因为现有方法面临关键的速度-保真度权衡。在这项工作中，我们引入了递归流匹配（RecFM），一个用于预测复杂时空动态的生成框架。RecFM强制执行自一致性以对齐跨离散化尺度的轨迹，减少离散化误差并改善基于物理任务的各种指标。据我们所知，这是第一种在科学系统中实现高保真单步和少步（2-4步）动态生成的方法，其性能可与最先进的多步求解器相媲美。在具有挑战性的科学基准测试中，RecFM相比领先的扩散模拟器实现了高达20倍的加速，同时提高了预测精度。此外，与普通流匹配相比，RecFM将均方误差降低了超过15%，为实时科学模拟提供了一种可扩展且高效的解决方案。

英文摘要

Generative models have emerged as a powerful paradigm for solving physics systems and modeling complex spatiotemporal dynamics. However, achieving high physical accuracy without incurring high computational cost remains a fundamental challenge, as existing approaches face a critical speed-fidelity trade-off. In this work, we introduce Recursive Flow Matching (RecFM), a generative framework for forecasting complex spatiotemporal dynamics. RecFM enforces self-consistency to align trajectories across discretization scales, reducing discretization errors and improving performance across metrics for physics-based tasks. To our knowledge, this is the first method to achieve high-fidelity one- and few-step (2-4 step) dynamic generation for scientific systems with performance comparable to state-of-the-art multi-step solvers. Across challenging scientific benchmarks, RecFM achieves up to a 20$\times$ speedup over leading diffusion-based emulators while improving predictive accuracy. Furthermore, RecFM reduces mean squared error by over 15% compared to vanilla flow matching, offering a scalable and efficient solution for real-time scientific emulation.

URL PDF HTML ☆

赞 0 踩 0

2605.26533 2026-05-27 cs.CV cs.AI cs.CL cs.LG

A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection

一种用于工业检测中自动缺陷推理与报告生成的混合视觉-语言架构

Malikussaid, Imad Gohar

发表机构 * School of Computing, Telkom University（Telkom大学计算机学院）； Faculty of Engineering and Technology, School of Computing and Artificial Intelligence（工程与技术学院，计算与人工智能学院）

AI总结本文提出一种解耦的边缘可部署管道，结合YOLO26-x-obb检测器、确定性编码模块和QLoRA微调的Qwen-2.5-1.5B模型，实现风电叶片缺陷定位与结构化报告生成，在BLEU-4、幻觉率和专家评分上显著优于零样本VLM基线。

Comments 23 pages, 6 figures, 9 equations, and 6 tables

详情

AI中文摘要

自动化工业检测需要精确的缺陷定位和结构化的维护报告生成；在当前的实践中，这些任务被分开处理，语言解释留给人类专家。本文描述了一种解耦的、边缘可部署的风电叶片检测管道，由三个组件组成，每个组件处理一个不同的子任务。“眼睛”是一个YOLO26-x-obb定向边界框检测器，在数据集原生分辨率下定位缺陷。“桥梁”是一个确定性的、无参数的编码模块，将每个检测到的边界框映射到嵌入结构化提示中的网格参考空间令牌。“大脑”是一个4比特量化的Qwen-2.5-1.5B模型，通过量化低秩适应（QLoRA）在947个合成生成的维护报告上进行适配，从该提示生成结构化的JSON报告。检索增强微调（RAFT）进一步将每个建议基于索引的维护程序。五项消融实验，通过BLEU-4、ROUGE-L、幻觉率（HR）和LLM-as-a-Judge评分标准，将该管道与单一视觉-语言模型（VLM）基线以及移除一个组件的部分配置进行比较。完整系统实现了BLEU-4 0.41、HR=4%和专家评分8.6/10，而零样本VLM基线分别为0.07、65%和3.3/10。在相同的检测证据下，QLoRA适配的1.5B模型在单个T4级GPU上以每秒47个令牌的速度生成比671B参数通用API模型更高质量的报告。结果表明，具有小型领域特定训练语料库的专用解耦架构在此结构化生成任务上优于通用端到端模型。

英文摘要

Automated industrial inspection requires both precise defect localization and structured maintenance report generation; in current practice these tasks are handled separately, with linguistic interpretation left to human experts. This paper describes a decoupled, edge-deployable pipeline for wind turbine blade inspection built from three components that each handle a distinct sub-task. The Eyes a YOLO26-x-obb oriented bounding-box detector localizes defects at dataset-native resolution. The Bridge a deterministic, parameter-free encoding module maps each detected bounding box to grid-referenced spatial tokens embedded in a structured prompt. The Brain a 4-bit quantized Qwen-2.5-1.5B model adapted with Quantized Low-Rank Adaptation (QLoRA) on 947 synthetically generated maintenance reports generates a structured JSON report from that prompt. Retrieval-Augmented Fine-Tuning (RAFT) further grounds each recommendation in indexed maintenance procedures. Five ablation experiments, scored by BLEU-4, ROUGE-L, Hallucination Rate (HR), and an LLM-as-a-Judge rubric, compare the pipeline against a monolithic vision-language model (VLM) baseline and against partial configurations in which one component is removed. The complete system achieves BLEU-4 0.41, HR=4%, and Expert Score = 8.6/10 compared with 0.07, 65%, and 3.3/10 for the zero-shot VLM baseline. The QLoRA-adapted 1.5B model generates higher-quality reports than a 671B-parameter generalist API model given identical detection evidence, at 47 tokens per second on a single T4-class GPU. The results show that purpose-built decoupled architecture with a small domain-specific training corpus outperforms a generalist end-to-end model on this structured generation task.

URL PDF HTML ☆

赞 0 踩 0

2605.26530 2026-05-27 cs.AI

Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning

哪些变化重要？通过相关性敏感评估和求解器基础推理实现可信赖的法律AI

Chen Linze, Cai Yufan, Hou Zhe, Dong Jin Song

发表机构 * National University of Singapore（新加坡国立大学）； Griffith University（格里菲斯大学）

AI总结提出法律相关性敏感评估问题，引入统一评估套件，并设计基于形式推理的对抗多智能体框架LexGuard，以提高法律AI对法律相关变化的校准敏感性。

详情

AI中文摘要

法律推理需要区分重要的变化和不重要的变化。法律AI应在法律无关的扰动下保持稳定，但当扰动改变法律实质要点时应发生变化。我们将这一要求形式化为法律相关性敏感评估问题：LLM应仅对法律相关的变化敏感。我们引入了一个统一的评估套件，涵盖司法公平性、鲁棒性和法规混淆场景中的应变化和不应变化评估。我们的评估表明，现有的法律LLM系统性地对法律无关的变化敏感，并且常常无法区分相关的法律要素和法规规则。为了缓解这些失败，我们提出了LexGuard，一个基于形式推理的对抗多智能体框架。LexGuard将法规形式化为可执行约束，使用对抗智能体提取竞争的事实-法规论点，并调用SMT求解器验证法律满足性和逻辑一致性。实验表明，LexGuard通过减少对操纵性框架的脆弱性、改善相似法规之间的区分、限制法律无关属性的影响以及增加良性重述下的一致性，提高了法律推理的可靠性。我们表明，法律可信赖性不仅需要准确性，还需要对法律实质性变化的校准敏感性。

英文摘要

Legal reasoning requires distinguishing changes that matter from those that do not. Legal AI should remain stable under legally irrelevant perturbations, but should change when perturbations alter legally material points. We formulate this requirement as a legal-relevance-sensitive evaluation problem: LLMs should only be sensitive to the legally relevant change. We introduce a unified evaluation suite covering should-change and should-not-change evaluation across judicial fairness, robustness, and statute-confusion scenarios. Our evaluation shows that existing legal LLMs are systematically sensitive to legally irrelevant variations and often fail to distinguish related legal elements and statutory rules. To mitigate these failures, we present LexGuard, an adversarial multi-agent framework grounded in formal reasoning. LexGuard formalizes statutes into executable constraints, uses adversarial agents to extract competing fact-statute arguments, and invokes SMT solvers to verify legal satisfaction and logical consistency. Experiments show that LexGuard improves legal reasoning reliability by reducing vulnerability to manipulative framing, improving disambiguation among similar statutes, limiting the influence of legally irrelevant attributes, and increasing consistency under benign reformulations. We show that legal trustworthiness requires not only accuracy, but calibrated sensitivity to legally material changes.

URL PDF HTML ☆

赞 0 踩 0

2605.26526 2026-05-27 cs.LG cs.CR

Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks

开源权重大语言模型微调防御易受简单攻击

Kevin Kuo, Chhavi Yadav, Virginia Smith

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Simons Institute, UC Berkeley（Simons研究所，伯克利大学）

AI总结本文发现针对开源权重大语言模型的防御措施易受abliteration和prefilling等低成本攻击，并提出abliteration-resistant tuning (ART) 方法将攻击成功率降低10%-20%。

Comments main body: 9 pages, 3 figures

详情

AI中文摘要

近期针对开源权重大语言模型（LLMs）的防御措施旨在防止对抗性使用。这些防御措施基于一个假设：新的有害行为是通过微调学习到的，而不是通过越狱模型诱发的。然而，预训练的LLMs已经在许多领域编码了大量有害知识，这引发了一个重要问题：对手能否越狱受保护的模型，在不进行任何微调的情况下实现有害使用？在本文中，我们表明开源权重防御措施容易受到更简单的策略攻击，这些策略虽然众所周知，但尚未针对这些防御措施进行系统评估。具体来说，我们评估了两种低成本攻击——abliteration和prefilling——它们不依赖于基于梯度的优化。在三个有害性评估基准（BeaverTails、HarmBench和AdvBench）上，这些攻击将针对受保护开源权重模型的攻击成功率从低于10%提高到16%-96%的范围。为了缓解这一漏洞，我们引入了abliteration-resistant tuning (ART)，它将基于abliteration的目标纳入训练。ART可以叠加到现有防御措施上，并将abliteration、prefilling及其组合的成功率降低10%-20%。这些发现表明，开源权重模型的攻击面比先前描述的要更广，并且对防御措施的评估应包含更多样化的攻击策略，而不仅仅是对抗性微调。

英文摘要

Recent defenses for safeguarding open-weight large language models (LLMs) are intended to prevent adversarial usage. Underlying these defenses is an assumption that new harmful behavior is learned through fine-tuning rather than elicited by jailbreaking the model. Yet, pretrained LLMs already encode substantial harmful knowledge across many domains, which raises an important question: can an adversary jailbreak safeguarded models, to achieve harmful usage without fine-tuning at all? In this paper, we show that open-weight safeguards are susceptible to simpler strategies that, despite being well known, have not been systematically evaluated against these safeguards. Specifically, we evaluate two low-cost attacks--abliteration and prefilling--that do not rely on gradient-based optimization. Across three harmfulness evaluation benchmarks (BeaverTails, HarmBench, and AdvBench), these attacks increase attack success rates against safeguarded open-weight models from below 10\% to a range of 16%-96%. To mitigate this vulnerability, we introduce abliteration-resistant tuning (ART), which incorporates an abliteration-based objective into training. ART can be layered onto existing defenses and reduces the success rates of abliteration, prefilling, and their combination by 10%-20%. These findings indicate that the attack surface for open-weight models is broader than previously characterized, and that evaluations of safeguarding defenses should incorporate a more diverse set of attack strategies beyond adversarial fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2605.26525 2026-05-27 cs.CV cs.AI

ReCA: Multi-Shot Long Video Extrapolation via Recursive Context Allocation

ReCA: 通过递归上下文分配实现多镜头长视频外推

Akide Liu, Jinbo Xing, Chaojie Mao, Ye Li, Zeyu Zhang, Yefei He, Weijie Wang, Zihan Wang, Yu Liu, Gholamreza Haffari, Bohan Zhuang

发表机构 * Monash University（墨尔本大学）； Tongyi Lab, Alibaba Group（通义实验室，阿里集团）； Zhejiang University（浙江大学）； University of Queensland（昆士兰大学）

AI总结针对多镜头视频外推任务中上下文分配瓶颈，提出递归上下文分配框架，通过层次化分解和结构化状态传播提升长视频生成的一致性和质量。

Comments Project Page: https://reca.vmv.re , Code: https://github.com/ali-vilab/ReCA

详情

AI中文摘要

分钟级电影式视频生成是生成式视频模型的核心挑战。现有范式仅解决该挑战的片段：单镜头外推保留锚点但缺乏电影结构，而多镜头叙事施加结构却可自由创造视觉状态而非延续观察到的状态。我们定义多镜头视频外推（MSVE）任务，该任务将观察到的帧或片段扩展为一系列具有电影结构的镜头，同时保留锚点状态并推进叙事意图。该设置受限于短视频模型的每次调用生成预算。我们识别出三个耦合瓶颈：（1）全局规划器从完整剧本中过度指定不支持的细节；（2）镜头级提示在携带完整故事时稀释任务相关状态；（3）时间链将生成帧转变为有损记忆，其中身份、场景、对象和动作状态衰减。MSVE揭示长视频失败不仅是上下文长度的限制，更是上下文分配失败。我们提出递归上下文分配（ReCA），一种推理时框架，在规划和生成之间分层分配上下文。ReCA递归地将MSVE分解为上下文有界子问题，在叶节点调用冻结生成器，并跨时间传播结构化状态更新。为评估该设置，我们进一步提出MSVE-Bench和NB-Q，一种源接地协议，带有专为3至5分钟长视频生成设计的提示，该场景未被现有短视频基准覆盖。与先前方法相比，ReCA在最强竞争控制器上将平均归一化分数提高8%至16%，并将多镜头一致性指标提高28%至43%。查看项目页面：https://reca.vmv.re。

英文摘要

Minute-scale cinematic video generation is a central challenge for generative video models. Existing paradigms address only fragments of this challenge: single-shot extrapolation preserves an anchor but lacks cinematic structure, while multi-shot storytelling imposes structure yet remains free to invent its visual states rather than continue an observed one. We define Multi-Shot Video Extrapolation (MSVE), a task that extends an observed frame or clip into a sequence of cinematically structured shots while preserving anchor state and advancing narrative intent. This setting operates under the finite per-call generation budget of short-video models. We identify three coupled bottlenecks: (1) global planners over-specify unsupported details from full screenplays; (2) shot-level prompts dilute task-relevant state when carrying the complete story; and (3) temporal chaining turns generated frames into a lossy memory in which identity, scene, object, and action state decay. MSVE reveals that long-video failure is not merely a limitation of context length, but a failure of context allocation. We propose Recursive Context Allocation (ReCA), an inference-time framework that allocates context hierarchically across planning and generation. ReCA recursively decomposes MSVE into context-bounded subproblems, invokes frozen generators at leaf nodes, and propagates structured state updates across time. To evaluate this setting, we further propose MSVE-Bench and NB-Q, a source-grounded protocol with prompts purpose-built for 3 to 5 minute long-video generation, a regime not addressed by existing short-clip benchmarks. Compared to previous methods, ReCA improves average normalized score by 8 to 16 percent over the strongest competing controller and improves multi-shot consistency metrics by 28 to 43 percent. View the project page at https://reca.vmv.re.

URL PDF HTML ☆

赞 0 踩 0

2605.26524 2026-05-27 cs.CV cs.AI

CmIVTP: Cross-modal Interaction-based Vessel Trajectory Prediction for Maritime Intelligence

CmIVTP：面向海事智能的基于跨模态交互的船舶轨迹预测

Yuxu Lu, Dong Yang, Xiaoyu Li, Mengwei Bao, Congcong Zhao

发表机构 * Department of Logistics and Maritime Studies, the Hong Kong Polytechnic University（物流及海运研究系，香港理工大学）； Research Centre for ESG Advancement (RCESGA), the Hong Kong Polytechnic University（ESG进步研究中心（RCESGA），香港理工大学）； School of Navigation, Wuhan University of Technology（航海学院，武汉理工大学）

AI总结针对单一数据源局限导致船舶轨迹预测不准的问题，提出跨模态交互框架CmIVTP，融合AIS和CCTV数据，利用目标感知场景编码器和跨模态交互Transformer实现高精度预测。

详情

AI中文摘要

海事智能交通系统（MITS）对于确保繁忙水域的航行安全和效率至关重要。然而，由于单源数据的局限性，准确的船舶轨迹预测仍然具有挑战性。自动识别系统（AIS）数据对于小型船舶通常稀疏或不可用，而仅靠闭路电视（CCTV）数据无法完全捕捉动态船舶行为。为缓解这些挑战，我们提出了一种基于跨模态交互的船舶轨迹预测（称为CmIVTP）框架，以建模船舶动力学与环境约束之间的复杂交互。具体地，我们引入了一个目标感知场景编码器来提取场景语义特征，有效捕捉船舶-环境交互并提高轨迹预测精度。此外，我们提出了一个跨模态交互变换器，它集成了AIS衍生的运动特征、基于CCTV的环境特征和场景表示。它利用跨模态注意力机制同时捕捉模态内语义和模态间交互，确保动态一致且环境可行的预测。此外，我们通过将历史AIS轨迹聚类为代表性运动模式构建了船舶群体轨迹库，为候选轨迹生成提供了一种高效且可扩展的方法。另外，我们引入了海事多模态数据集增强版（名为Maritime-MmD$^+$），这是一个同步AIS数据和CCTV视频数据的大规模数据集，为多模态轨迹预测研究提供了有力支持。大量实验表明，CmIVTP在多模态驱动的船舶轨迹预测基准上取得了更好的性能。本工作的代码资源可在https://github.com/LouisYxLu/CmIVTP获取。

英文摘要

Maritime intelligent transportation systems (MITS) are essential for ensuring navigation safety and efficiency in busy waterways. However, accurate vessel trajectory prediction remains challenging due to the limitations of single-source data. Automatic identification system (AIS) data is often sparse or unavailable for small vessels, while closed-circuit television (CCTV) data alone cannot fully capture dynamic vessel behavior. To mitigate these challenges, we propose a cross-modal interaction-based vessel trajectory prediction (named CmIVTP) framework to model the intricate interactions between vessel dynamics and environmental constraints. Specifically, we introduce a target-aware scene encoder to extract scene semantic features, effectively capturing vessel-environment interactions and enhancing trajectory prediction accuracy. In addition, we propose a cross-modal interaction transformer, which integrates AIS-derived motion features, CCTV-based environmental features, and scene representations. It leverages cross-modal attention mechanisms to simultaneously capture intra-modal semantics and inter-modal interactions, ensuring dynamically consistent and environmentally feasible predictions. Furthermore, we construct a vessel group trajectory bank by clustering historical AIS trajectories into representative motion patterns, providing an efficient and scalable approach for candidate trajectory generation. Additionally, we introduce the maritime multimodal dataset plus (named Maritime-MmD$^+$), a large-scale dataset that synchronizes AIS data and CCTV video data, providing robust support for multimodal trajectory prediction research. Extensive experiments demonstrate that CmIVTP achieves better performance on multimodal-driven vessel trajectory prediction benchmarks. The code resources for this work can be available at https://github.com/LouisYxLu/CmIVTP.

URL PDF HTML ☆

赞 0 踩 0

2605.26520 2026-05-27 cs.CV cs.AI

InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

InterSketch: 一种具有自校正视觉草图和逐步奖励的交错推理模型

Zhiwei Ning, Wenwen Tong, Xiangli Kong, Shengnan Ma, Ziyi Shang, Jingcheng Ni, Tao Hu, Yong Xien Chng, Jixuan Ying, Zehuan Wu, Hanming Deng, Jie Yang, Yuanjie Zheng, Wei Liu, Lewei Lu

发表机构 * Shanghai Jiao Tong University（上海交通大学）； SenseTime Research（商汤研究院）； Shandong Normal University（山东师范大学）

AI总结针对视觉-语言模型在长程视觉推理中文本中心范式局限性的问题，提出InterSketch模型，通过自校正和逐步奖励机制增强交错视觉-文本思维链能力，在视觉推理基准上超越Gemini-3-Pro等专有模型。

详情

AI中文摘要

尽管视觉-语言模型（VLM）已展现出多轮视觉推理能力，但其推理轨迹仍相对浅层且以文本为中心，限制了其在复杂视觉挑战中的适用性。相比之下，人类思维通常涉及长程推理，并伴有交错的视觉-文本思维链（VT-CoT）。为弥合这一差距，我们引入InterSketch，一种交错推理模型，通过自校正和逐步奖励机制增强VT-CoT能力。InterSketch使用外部工具动态生成中间视觉草图，并将其与文本推理交错进行，从而在长程视觉理解任务中实现有效的感知和逻辑推理。具体而言，在第一个冷启动阶段，我们提出了一个合成的高质量交错VT-CoT数据集，并引入反思机制，使模型具备多轮交错推理和自校正能力。在后续的强化学习（RL）阶段，我们设计了一种逐步奖励机制，以缓解长程推理中仅端到端监督固有的奖励信号稀疏性问题。在视觉推理基准上的大量实验证明了InterSketch的有效性，其性能甚至超越了Gemini-3-Pro等专有模型。

英文摘要

While vision-language models (VLMs) have exhibited multi-turn visual reasoning capabilities, their reasoning trajectories remain relatively shallow and are dominated by a text-centric paradigm, limiting their applicability to complex visual challenges. In contrast, human-like thought typically involves long-horizon reasoning with an interleaved visual-textual chain-of-thought (VT-CoT). To bridge this gap, we introduce InterSketch, an interleaved reasoning model to enhance the VT-CoT capability via self-correcting and stepwise reward mechanisms. InterSketch dynamically generates intermediate visual sketches using external tools and interleaves them with textual reasoning, enabling effective perception and logical reasoning over long-horizon visual understanding tasks. Specifically, in the first cold-start stage, we propose a synthesized high-quality interleaved VT-CoT dataset and include a reflection mechanism to enable the model's capability in multi-turn interleaved reasoning and self-correction. In the subsequent reinforcement learning (RL) stage, we design a stepwise reward mechanism to mitigate the sparsity of reward signals inherent in end-only supervision over long-horizon reasoning. Extensive experiments on visual reasoning benchmarks demonstrate the effectiveness of InterSketch, even outperforming proprietary models such as Gemini-3-Pro.

URL PDF HTML ☆

赞 0 踩 0

2605.26514 2026-05-27 cs.CV cs.AI cs.LG

CSV-ViT: A Vision Transformer with the Variable-sized Cortical Supervertices for Detection of Alzheimer's Disease Pathologies

CSV-ViT: 一种使用可变大小皮层超顶点的视觉Transformer用于阿尔茨海默病病理检测

Geonwoo Baek, Ikbeom Jang

发表机构 * Department of Computer Science and Engineering（计算机科学与工程系）； Hankuk University of Foreign Studies（韩国家 foreign 学院）

AI总结提出一种保留感兴趣区域的、基于顶点的可变大小皮层表面分块方法（皮层超顶点），并设计可变大小补丁兼容的视觉Transformer（CSV-ViT），在阿尔茨海默病诊断、淀粉样蛋白阳性和tau蛋白阳性三分类任务中优于现有表面模型。

详情

AI中文摘要

确认阿尔茨海默病（AD）通常依赖于正电子发射断层扫描（PET），该方法仍然昂贵且有创，这促使了基于结构MRI的预筛查的使用。在非欧几里得流形，特别是大脑皮层表面上的深度学习，由于数据的球形拓扑结构面临重大挑战。最近的表面模型已经能够从皮层表面数据中学习；然而，施加基于面的均匀补丁通常会导致补丁边界处的重复顶点。一般来说，许多基于表面的模型对感兴趣区域（ROI）的感知有限，这可能导致非皮层区域（如内侧壁）被包含在内。我们提出了一种皮层表面分块方法，该方法执行保留ROI的、基于顶点的、可变大小的补丁划分。我们将这些皮层表面补丁称为皮层超顶点（CSV）。基于这种表示，我们设计了CSV视觉Transformer（CSV-ViT），这是一种可变大小补丁容忍的视觉Transformer，使用填充和掩码感知的补丁嵌入。我们使用T1加权MRI，并通过将AD相关状态分类为三个类别来评估我们的框架：AD诊断、淀粉样蛋白阳性和tau蛋白阳性。在实验中，CSV-ViT取得了比最近基于表面的模型更高的分类性能。结果表明，所提出的CSV-ViT可能支持在PET或脑脊液确认之前基于MRI的AD相关状态预测。

英文摘要

Confirming Alzheimer's disease (AD) typically relies on positron emission tomography (PET), which remains costly and invasive, motivating the use of structural MRI-based prescreening. Deep learning on non-Euclidean manifolds, particularly brain cortical surfaces, faces significant challenges due to the data's spherical topology. Recent surface models have enabled learning from cortical surface data; however, imposing face-based uniform patches often causes duplicate vertices at patch boundaries. In general, many surface-based models are limited in their awareness of the region of interest (ROI), which can result in non-cortical regions, such as the medial wall, being included. We propose a cortical surface tokenization that performs ROI-preserving, vertex-based, variable-sized patch partitioning. We refer to these cortical surface patches as cortical supervertices (CSVs). Building on this representation, we design the CSV Vision Transformer (CSV-ViT), a variable-size patch-tolerant Vision Transformer that uses padding and a mask-aware patch embedding. We used T1-weighted MRI and evaluated our framework by classifying AD-related status into three categories: AD diagnosis, amyloid positivity, and tau positivity. Across the experiments, CSV-ViT achieved higher classification performance than recent surface-based models. The results suggest that the proposed CSV-ViT may support MRI-based prediction of AD-related status prior to PET or CSF confirmation.

URL PDF HTML ☆

赞 0 踩 0

2605.26513 2026-05-27 cs.CV

Re-M3Dr: Rebalanced MultiModal Mean Deviation Regression

Re-M3Dr：重新平衡的多模态均值偏差回归

Haojie Yin, Chengcheng Feng, Tianyi Liu, Tianqi Zhang, Kaizhu Huang

发表机构 * Duke Kunshan University, China（杜克大学昆山学院）； Xi'an Jiaotong-Liverpool University, China（西安交通大学利物浦大学）； Soochow University（苏州大学）

AI总结针对多模态医学图像融合性能反不如单模态的问题，提出Re-M3Dr框架，通过自适应边界的监督对比学习和锐度感知梯度调制，实现多模态均值偏差回归，在临床数据集上均方误差降低29%。

详情

AI中文摘要

均值偏差（MD）是评估眼科视野损失的关键指标。虽然以往的工作仅关注从光学相干断层扫描（OCT）预测MD，但直观上假设将OCT与另一种眼底摄影（FP）成像结合可以提高性能，因为两种眼科医学成像提供了互补信息。当应用复杂的多目标优化时，这一点尤其值得期待，正如常见的多模态分类中所记载的那样。令人惊讶的是，我们的研究表明，在这种医学成像场景中，多模态融合的性能不如单模态模型。通过详细分析，我们确定根本原因是数据分布和模态学习冲突之间的耦合不平衡。这种不平衡扭曲了优化景观，导致训练不稳定。为了解决这一挑战，我们提出了重新平衡的多模态均值偏差回归（Re-M3Dr）方法，这是一种新颖的多模态回归框架。我们通过自适应边界的监督对比学习增强单模态表示。然后，我们的框架通过锐度感知梯度调制稳定联合优化。在公共和私人临床数据集上的实验结果表明，与最先进的多模态学习方法相比，均方误差平均降低29%，证明了Re-M3Dr的优越性。代码可在补充材料中获得。

英文摘要

Mean Deviation (MD) is a critical metric for assessing visual field loss in ophthalmology. While previous work has focused solely on predicting MD from Optical Coherence Tomography (OCT), it is intuitive to assume that combining OCT with another imaging of fundus photography (FP) could improve performance, as two ophthalmic medical imaging provide complementary information. This is particularly expected when sophisticated multi-objective optimization is applied, as documented in common multimodal classification. Surprisingly, our investigations reveal that multimodal fusion in this medical imaging scenario performs worse than unimodal model. Through detailed analysis, we identify the root cause as a coupled imbalance between data distribution and modality learning conflict. This imbalance distorts the optimization landscape, leading to unstable training. To address this challenge, we propose the method of Rebalanced MultiModal Mean Deviation Regression (Re-M3Dr), a novel multimodal regression framework. We enhance unimodal representation through adaptive margin based supervised contrastive learning. Then, our framework stabilizes the joint optimization with the sharpness-aware gradient modulation. Experimental results on both public and private clinical datasets show average 29\% reduction in MSE compared to SOTA multimodal learning methods, demonstrating the superiority of Re-M3Dr. The code is available in the supplementary materials.

URL PDF HTML ☆

赞 0 踩 0

2605.26509 2026-05-27 cs.LG math.PR stat.CO

SIKA-GP: Accelerating Gaussian Process Inference with Sparse Inducing Kernel Approximations for Bayesian Deep Learning

SIKA-GP：利用稀疏诱导核近似加速贝叶斯深度学习中的高斯过程推断

Wenyuan Zhao, Rui Tuo, Chao Tian

发表机构 * Department of Electrical ； Computer Engineering, Texas A\&M University, College Station, US ； Department of Industrial ； Systems Engineering, Texas A\&M University, College Station, US

AI总结提出SIKA-GP方法，通过基于二元有序模板基的稀疏诱导核近似，将高斯过程推断的计算复杂度降低至O(log M)，并实现高效张量化GPU计算，可自然嵌入贝叶斯神经网络，在视觉和Transformer语言基准上显著加速训练和推断而不牺牲预测性能。

Comments 20 pages, 8 figures; accepted to International Conference on Machine Learning (ICML) 2026

详情

AI中文摘要

高斯过程（GPs）为不确定性估计提供了原则性的贝叶斯框架，但其计算复杂度严重限制了在大规模数据集上的可扩展性。我们提出SIKA-GP，该方法使用基于二元有序模板基的稀疏诱导核近似来加速GP推断，对诱导点数量的复杂度依赖仅为${O}(\log M)$。我们的方法从稀疏激活基构建紧凑且表达力强的核表示，从而实现高效的张量化GPU计算，并与现代大规模模型无缝集成。SIKA-GP可以自然地嵌入具有稀疏激活的贝叶斯神经网络（BNNs）中，在训练和推断中均实现显著加速，且不牺牲预测性能。该方法自然地扩展到深度特征学习，解决了深度架构和高维特征表示带来的可扩展性挑战。在视觉和基于Transformer的语言基准上的实验结果表明，我们的方法始终提供快速且准确的GP模型，为可扩展核学习提供了一条原则性路径。

英文摘要

Gaussian processes (GPs) provide a principled Bayesian framework for uncertainty estimation, but their computational complexity severely limits scalability to large datasets. We propose SIKA-GP, which accelerates GP inference using sparse inducing kernel approximations based on a dyadic ordered template basis, incurring only ${O}(\log M)$ complexity dependence on the number of inducing points. Our approach constructs compact and expressive kernel representations from sparsely activated bases, enabling efficient tensorized GPU computation and seamless integration with modern large-scale models. SIKA-GP can be naturally embedded into Bayesian neural networks (BNNs) with sparse activations, yielding significant speedups in both training and inference without sacrificing predictive performance. The method naturally extends to deep feature learning, addressing the scalability challenges introduced by deep architectures and high-dimensional feature representations. Empirical results on vision and transformer-based language benchmarks demonstrate that our approach consistently delivers fast and accurate GP models, providing a principled path toward scalable kernel learning.

URL PDF HTML ☆

赞 0 踩 0

2605.26503 2026-05-27 cs.CV

Uncertainty-Aware Gaussian Map for Vision-Language Navigation

面向视觉-语言导航的不确定性感知高斯地图

Jianzhe Gao, Rui Liu, Yuxuan Xu, Tongtong Cao, Yingxue Zhang, Zhanguang Zhang, Sida Peng, Yi Yang, Wenguan Wang

发表机构 * The State Key Lab of Brain-Machine Intelligence（脑机智能国家重点实验室）； Department of Foundation model, 2012 Labs, Huawei（基础模型部门，2012实验室，华为）； Noah’s Ark Lab, 2012 Labs, Huawei（诺亚方舟实验室，2012实验室，华为）； School of Software Technology, Zhejiang University（浙江大学软件学院）

AI总结提出不确定性感知高斯地图，通过显式建模几何、语义和外观三种感知不确定性并融入观测空间，提升视觉-语言导航中智能体的决策可靠性。

详情

AI中文摘要

视觉-语言导航（VLN）要求智能体按照自然语言指令在3D环境中导航。在导航过程中，现有智能体通常遇到感知不确定性，例如缺乏可靠定位的证据或空间线索解释的模糊性，但在预测动作时通常忽略此类信息。在这项工作中，我们显式建模三种形式的感知不确定性（即几何、语义和外观不确定性），并将其整合到智能体的观测空间中，以实现知情决策。具体来说，我们的智能体首先构建一个语义高斯地图（SGM），由从全景观测初始化的可微3D高斯原语组成，编码环境的几何结构和语义内容。在SGM之上，通过高斯位置和尺度的变分扰动估计几何不确定性，以评估结构可靠性；通过扰动高斯语义属性捕获语义不确定性，以揭示模糊解释；通过Fisher信息刻画外观不确定性，该信息衡量渲染观测对高斯级变化的敏感性。这些不确定性被纳入SGM，将其扩展为统一的3D价值地图，将其作为支持可靠导航的可供性和约束。在多个VLN基准上的综合评估显示了我们的智能体的有效性。

英文摘要

Vision-Language Navigation (VLN) requires an agent to navigate 3D environments following natural language instructions. During navigation, existing agents commonly encounter perceptual uncertainty, such as insufficient evidence for reliable grounding or ambiguity in interpreting spatial cues, yet they typically ignore such information when predicting actions. In this work, we explicitly model three forms of perceptual uncertainty (i.e., geometric, semantic, and appearance uncertainty) and integrate them into the agent's observation space to enable informed decision-making. Concretely, our agent first constructs a Semantic Gaussian Map (SGM), composed of differentiable 3D Gaussian primitives initialized from panoramic observations, that encodes both the geometric structure and semantic content of the environment. On top of SGM, geometric uncertainty is estimated through variational perturbations of Gaussian position and scale to assess structural reliability; semantic uncertainty is captured by perturbing Gaussian semantic attributes to reveal ambiguous interpretations; and appearance uncertainty is characterized by Fisher Information, which measures the sensitivity of rendered observations to Gaussian-level variations. These uncertainties are incorporated into SGM, extending it into a unified 3D Value Map, which grounds them as affordances and constraints that support reliable navigation. Comprehensive evaluations across multiple VLN benchmarks show the effectiveness of our agent.

URL PDF HTML ☆

赞 0 踩 0

2605.26501 2026-05-27 cs.CV cs.AI

Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization

揭示视觉-语言模型的脆弱性：通过纹理约束扰动和跨模态优化的多模态对抗协同

Xiang Fang, Wanlong Fang, Changshuo Wang

发表机构 * School of Software Engineering, Huazhong University of Science and Technology（华中科技大学软件学院）； Nanyang Technological University, Singapore（新加坡南洋理工大学）； University College London（伦敦大学学院）

AI总结提出多模态对抗协同框架，通过纹理约束的通用对抗扰动和可学习的文本提示扰动，在黑盒设置下联合优化，揭示视觉-语言模型在多模态攻击下的脆弱性。

Comments Publish in AAAI 2026

详情

AI中文摘要

大型视觉-语言模型（LVLMs）通过整合视觉和文本输入，在图像描述和视觉问答等任务中表现出色，改变了多模态理解。然而，它们对抗攻击的鲁棒性，特别是利用两种模态的攻击，仍未被充分探索，这给自动驾驶和内容审核等关键应用带来了风险。现有攻击集中于单一模态或需要不切实际的白盒访问，限制了其现实相关性。在本文中，我们引入了多模态对抗协同（MMAS），这是一个开创性的框架，用于针对LVLMs构建通用的黑盒多模态攻击。MMAS同时生成纹理尺度约束的通用对抗扰动用于图像，以及可学习的提示扰动用于文本，仅通过模型查询进行联合优化。图像扰动利用基于小波的纹理约束确保在各种视觉输入中的不可感知性和鲁棒性。文本扰动在嵌入空间中受L范数约束，在保持语义连贯性的同时将输出导向目标。一种新颖的跨模态正则化项对齐扰动的梯度方向，增强了它们在任务和模型间的协同影响和可迁移性。大量实验表明，我们提出的攻击在主流LVLMs上具有强大的通用对抗能力。

英文摘要

Large Vision-Language Models (LVLMs) have transformed multi-modal understanding, excelling in tasks like image captioning and visual question answering by integrating visual and textual inputs. However, their robustness against adversarial attacks, particularly those exploiting both modalities, remains underexplored, posing risks to critical applications like autonomous driving and content moderation. Existing attacks focus on single modalities or require impractical white-box access, limiting their real-world relevance. In this paper, we introduce Multi-Modal Adversarial Synergy, a groundbreaking framework that crafts universal, black-box multi-modal attacks against LVLMs. MMAS simultaneously generates a texture scale-constrained universal adversarial perturbation for images and a learnable prompt perturbation for text, optimized jointly using only model queries. The image perturbation leverages wavelet-based texture constraints to ensure imperceptibility and robustness across diverse visual inputs. The text perturbation, constrained by an L-norm in the embedding space, maintains semantic coherence while steering outputs toward a target. A novel cross-modal regularization term aligns the perturbations' gradient directions, enhancing their synergistic impact and transferability across tasks and models. Extensive experiments show the strong universal adversarial capabilities of our proposed attack with prevalent LVLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.26500 2026-05-27 cs.CV

3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation

面向视觉-语言导航的开放集语义分组3D高斯地图

Jianzhe Gao, Rui Liu, Wenguan Wang

发表机构 * The State Key Lab of Brain-Machine Intelligence（脑机智能国家重点实验室）

AI总结提出一种3D高斯地图表示环境，通过在线构建自中心场景地图和开放集语义分组操作增强几何与语义信息，并设计多层级动作预测策略，在三个公开基准上验证了有效性。

详情

AI中文摘要

视觉-语言导航（VLN）要求智能体基于自然语言指令遍历复杂的3D环境，这需要对场景有透彻的理解。现有工作为智能体配备了各种场景表示以增强空间感知，但往往忽略了VLN场景中复杂的3D几何和丰富的语义，限制了在多样化和未见环境中的泛化能力。为应对这些挑战，本文提出一种3D高斯地图，将环境表示为一组可微分的3D高斯，并据此开发了用于VLN的导航策略。具体地，通过从稀疏伪激光雷达点云初始化3D高斯来在线构建自中心场景地图，为场景理解提供信息丰富的几何先验。每个高斯基元进一步通过开放集语义分组操作得到增强，该操作基于3D高斯在开放世界中属于对象实例或材质类别的成员关系对其进行分组，形成统一的3D高斯地图。基于该地图，设计了多层级动作预测策略，结合多粒度的空间-语义线索，辅助智能体进行决策。在三个公开基准（即R2R、R4R和REVERIE）上进行的大量实验验证了我们方法的有效性。

英文摘要

Vision-language navigation (VLN) requires an agent to traverse complex 3D environments based on natural language instructions, necessitating a thorough scene understanding. While existing works equip agents with various scene representations to enhance spatial awareness, they often neglect the complex 3D geometry and rich semantics in VLN scenarios, limiting the ability to generalize across diverse and unseen environments. To address these challenges, this work proposes a 3D Gaussian Map that represents the environment as a set of differentiable 3D Gaussians and accordingly develops a navigation strategy for VLN. Specifically, Egocentric Scene Map is constructed online by initializing 3D Gaussians from sparse pseudo-lidar point clouds, providing informative geometric priors for scene understanding. Each Gaussian primitive is further enriched through Open-Set Semantic Grouping operation, which groups 3D Gaussians based on their membership in object instances or stuff categories within the open world, resulting in a unified 3D Gaussian Map. Building on this map, Multi-Level Action Prediction strategy, which combines spatial-semantic cues at multiple granularities, is designed to assist agents in decision-making. Extensive experiments conducted on three public benchmarks (i.e., R2R, R4R, and REVERIE) validate the effectiveness of our method.

URL PDF HTML ☆

赞 0 踩 0

2605.26498 2026-05-27 cs.CL

Verilog-Evolve: Feedback-Driven and Skill-Evolving Verilog Generation

Verilog-Evolve：反馈驱动与技能演进的Verilog生成

Zehua Pei, Hui-Ling Zhen, Yu Zhang, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Huawei Technologies Co., Ltd（华为技术有限公司）

AI总结提出Verilog-Evolve框架，通过反馈驱动（功能仿真、Yosys综合、ABC时序代理等）迭代优化Verilog代码，并利用跨会话技能演进提升生成质量，实验表明在VerilogEval和混合精度GEMM任务上提高了功能成功率和下游友好性。

详情

AI中文摘要

大型语言模型（LLMs）改进了从自然语言规范生成Verilog的过程，但大多数流水线仍将生成视为孤立的采样后功能检查。这对于实际的RTL设计是不够的，因为有用的Verilog必须正确、可综合、考虑时序，并对下游硬件目标友好。我们提出Verilog-Evolve，一个用于版本化Verilog细化和跨会话技能演进的反馈驱动框架。对于每个任务，Verilog-Evolve生成多样化的次要候选，通过功能仿真、Yosys综合、ABC时序代理以及可选的GEMM指标的可执行反馈进行评估，然后在可配置评分下将最佳候选提升为主要版本。为了跨任务改进，系统维护模块化技能指导，根据任务和反馈上下文检索技能，并通过创建/改进/跳过决策和验证器报告从记录的历史中演进候选技能。在VerilogEval和混合精度GEMM任务上的实验表明，Verilog-Evolve提高了最终功能成功和晋升稳定性，同时在开源综合、时序代理和网表级GEMM目标下生成更下游友好的RTL。验证门控的技能演进进一步提高了GEMM下游质量，并在评估的技能模式中实现了最佳下游分数和GEMM保留通过率。

英文摘要

Large language models (LLMs) have improved Verilog generation from natural-language specifications, but most pipelines still treat generation as isolated sampling followed by functional checking. This is insufficient for practical RTL design, where useful Verilog must be correct, synthesizable, timing-conscious, and friendly to downstream hardware objectives. We present Verilog-Evolve, a feedback-driven framework for versioned Verilog refinement and cross-session skill evolution. For each task, Verilog-Evolve generates diverse minor candidates, evaluates them with executable feedback from functional simulation, Yosys synthesis, ABC timing proxy, and optional GEMM metrics, then promotes the best candidate into a major version under configurable scoring. To improve across tasks, the system maintains modular skill guidance, retrieves skills according to task and feedback context, and evolves candidate skills from logged histories through create/improve/skip decisions and verifier reports. Experiments on VerilogEval and mixed-precision GEMM tasks show that Verilog-Evolve improves final functional success and promotion stability while producing more downstream-friendly RTL under open-source synthesis, timing-proxy, and netlist-level GEMM objectives. Validation-gated skill evolution further improves GEMM downstream quality and achieves the best downstream score and GEMM held-out pass rate among the evaluated skill modes.

URL PDF HTML ☆

赞 0 踩 0

2605.26496 2026-05-27 cs.LG cs.AI

Dense2MoE: Pushing the Pareto Frontier of On-Device LLMs via Unified Pruning and Upcycling

Dense2MoE：通过统一剪枝和升级推动设备端LLM的帕累托前沿

Fengfa Li, Hongjin Ji, Yifeng Ding, Lei Ren, Chen Wei

发表机构 * Li Auto ； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结提出Dense2MoE框架，通过层融合升级（LF-UC）统一剪枝和升级，将密集LLM高效转换为设备端MoE模型，在推理延迟与准确性之间取得更优帕累托前沿。

Comments 19 pages

详情

AI中文摘要

混合专家（MoE）架构对于资源受限的设备端部署极具前景，但从头训练这些模型成本高昂。当前方法试图通过将密集模型升级为MoE来缓解这一问题，然而它们常常引入参数冗余，降低推理效率。另一方面，标准层剪枝减少了冗余，但不可避免地损害模型准确性。为解决这一困境，我们提出Dense2MoE，一种通过层融合升级（LF-UC）统一剪枝和升级的新框架。在硬件Roofline理论指导下，Dense2MoE通过剪枝来自冗余层的带宽密集型注意力模块，同时将其多层感知机（MLP）重新用作MoE专家，系统地克服了推理内存墙。这种结构创新保留了模型的核心能力，并通过选择性令牌路由严格限制活跃参数。借助适度的持续预训练预算，Dense2MoE高效地将公开可用的密集LLM转换为设备端就绪的MoE模型。大量实验表明，Dense2MoE显著推进了设备端推理延迟与模型准确性的帕累托前沿，优于密集基线、最先进的压缩方法和标准升级方法。

英文摘要

The Mixture of Experts MoE architecture is highly promising for resource constrained on device deployments yet training these models from scratch incurs prohibitive costs Current methods attempt to alleviate this by upcycling dense models into MoEs however they often introduce parameter redundancy that degrades inference efficiency Alternatively standard layer pruning mitigates redundancy but inevitably compromises model accuracy To resolve this dilemma we propose Dense2MoE a novel framework that unifies pruning and upcycling through Layer Fusion UpCycling LF UC Guided by hardware Roofline theory Dense2MoE systematically overcomes the inference memory wall by pruning bandwidth heavy attention modules from redundant layers while repurposing their Multi Layer Perceptrons MLPs into MoE experts This structural innovation preserves the models core capabilities and strictly limits active parameters via selective token routing With a modest continual pre training budget Dense2MoE efficiently converts publicly available dense LLMs into on device ready MoE models Extensive experiments demonstrate that Dense2MoE significantly advances the Pareto frontier for on device inference latency versus model accuracy outperforming dense baselines state of the art compression and standard upcycling methods

URL PDF HTML ☆

赞 0 踩 0

2605.26494 2026-05-27 cs.AI cs.CL cs.LG

The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence

MiniMax-M2系列：小激活释放最大现实智能

MiniMax, :, Aili Chen, Aonian Li, Baichuan Zhou, Bangwei Gong, Binyang Jiang, Boji Dan, Changqing Yu, Chao Wang, Cheng Ma, Cheng Zhong, Cheng Zhu, Chengjun Xiao, Chengyi Yang, Chengyu Du, Chenyang Zhang, Chi Zhang, Chuangyi Huang, Chunhao Zhang, Chunhui Du, Chunyu Zhao, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dongyu Zhang, Enhui Yang, Fei Yu, Guang Zheng, Guodong Zheng, Guohong Li, Haichao Zhu, Haigang Zhou, Haimo Zhang, Han Ding, Hao Zhang, Haohai Sun, Haolin Lyu, Haonan Lu, Haoyu Wang, Huajie Shi, Huiyang Li, Jiacheng Chen, Jian Zhang, Jiaqi Zhuang, Jiaren Cai, Jiaxin Pan, Jiayao Li, Jiayuan Song, Jichuan Zhang, Jie Wang, Jihao Gu, Jin Zhu, Jingwei Dong, Jingyang Li, Jingyu Zhang, Jingze Zhuang, Jinhao Tian, Jinli Liu, Jinyi Hu, Jun Tao, Jun Zhang, Junbin Ruan, Junhao Xu, Junjie Yan, Junteng Liu, Junxian He, Kang Xu, Ke Ji, Ke Yang, Kecheng Xiao, Keyu Duan, Keyu Li, Le Han, Letian Ruan, Li Yuan, Lianfei Yu, Liheng Feng, Lijie Mo, Lin Li, Lingye Bao, Lingyu Yang, Lingyuan Zhou, Loki, Lu Chen, Lunbin Ceng, Ming Li, Ming Zhong, Mingliang Tao, Mingyuan Chi, Mujie Lin, Nan Hu, Ningxin Chen, Peiyin Zhu, Peng Gao, Pengcheng Gao, Pengfei Li, Penglin Li, Pengyu Zhao, Qibin Ren, Qidi Xu, Qihan Ren, Qile Li, Qin Wang, Quanliang Chen, Qunhong Ceng, Rong Tian, Rui Dong, Ruitao Leng, Ruize Zhang, Shanqi Liu, Shaoyu Chen, Sheng Jia, Shun Yao, Shuoran Zhao, Shuqi Yu, Sichen Li, Sicheng Pan, Songquan Zhu, Tengfei Li, Tian Xie, Tiancheng Qin, Tianrun Liang, Wei Liu, Weiqi Xu, Weitao Li, Weixiang Chen, Weiyu Cheng, Weiyu Zhang, Wenhu Chen, Wenqian Zhao, Xiancai Chen, Xiangjun Song, Xiangyuan Wang, Xiao Luo, Xiao Su, Xiaobo Li, Xiaodong Han, Xiaojie Wu, Xihao Song, Xingyi Han, Xinyu Guan, Xuan Lu, Xun Zou, Xunhao Lai, Xutong Li, Yan Gong, Yang Wang, Yang Xu, Yangsen Wang, Ye Tang, Yicheng Chen, Yinran Qiu, Yiqi Shi, Yiting Guo, Yiwen Huang, Yixuan Wang, Yongyi Hu, Yu Gao, Yu Zhang, Yuanxiang Ying, Yuanzhen Zhang, Yubo Wang, Yuchen Song, Yufeng Yang, Yuhang Meng, Yuhang Miao, Yuhao Li, Yujie Liu, Yulin Hu, Yunan Huang, Yunji Li, Yunyi Huang, Yusen Zhang, Yusu Hong, Yutao Xie, Yutong Zhang, Yuwen Liao, Yuxuan Shi, Yuze Wenren, Zebin Li, Zehan Li, Zejian Luo, Zeyu Jin, Zeyuan Sun, Zhanpeng Zhou, Zhaochen Su, Zhendong Li, Zhengmao Zhu, Zhengyuan Peng, Zhenhua Fan, Zhi Zhang, Zhichao Xu, Zhiheng Lv, Zhikang Xu, Zhitao He, Zhiwei He, Zhongyuan Li, Zibo Gao, Zijia Wu, Zijian Song, Zijian Zhou, Zijun Sun, Zishan Huang, Ziying Chen, Ziyue Ge

发表机构 * MiniMax

AI总结提出MiniMax-M2系列混合专家语言模型，通过小激活参数实现前沿性能，核心包括智能体驱动数据管道、可扩展强化学习系统Forge及自进化检查点M2.7。

Comments Technical Report. 35 pages, 10 figures, 4 tables

详情

AI中文摘要

我们介绍了MiniMax-M2系列，这是一个基于“小激活可以释放最大现实智能”原则构建的混合专家语言模型家族。旗舰版M2总参数量为229.9B，每个token仅激活9.8B参数。M2系列专为智能体部署而端到端设计，包含三个组成部分：(i) 智能体驱动数据管道，生成大规模、可验证的轨迹，涵盖智能体编码和智能体协作，每个轨迹都基于可执行工作空间和与工件对齐的奖励；(ii) Forge，一个可扩展的智能体原生强化学习系统，适应长程智能体轨迹，并配有窗口FIFO调度、前缀树合并、推理优化以及支持白盒和黑盒智能体的干净训练-推理-智能体解耦；(iii) 最新的M2.7检查点向自我进化迈出了早期一步——自主调试训练运行并修改其自身框架。从M2到M2.7，这种组合将小激活足迹转化为智能体编码、深度搜索、办公任务和推理基准上的前沿性能。

英文摘要

We introduce the MiniMax-M2 series, a family of Mixture-of-Experts language models built around the principle that mini activations can unleash maximum real-world intelligence. The flagship M2 contains 229.9B total parameters with only 9.8B activated per token. Designed end-to-end for agentic deployment, the M2 series rests on three components: (i) agent-driven data pipelines producing large-scale, verifiable trajectories across agentic coding and agentic cowork, each grounded in an executable workspace and an artifact-aligned reward; (ii) Forge, a scalable agent-native RL system that adapts to long-horizon agent trajectories, paired with windowed-FIFO scheduling, prefix-tree merging, inference optimization, and a clean training-inference-agent decoupling that supports both white-box and black-box agents; (iii) the latest M2.7 checkpoint takes an early step toward self-evolution -- autonomously debugging training runs and modifying its own scaffold. Across M2 through M2.7, this combination translates a mini-activation footprint into frontier-tier performance on agentic coding, deep search, office-task, and reasoning benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.26492 2026-05-27 cs.CL cs.AI cs.LG

Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories

灯塔中的伊莱亚斯，再次？诊断LLM故事中的低多样性

Sil Hamilton, David Mimno

发表机构 * Department of Information Science（信息科学系）

AI总结研究通过采样20000个故事发现，LLM生成的故事中存在高度重复的词汇（如Elias、灯塔等），这些词汇来自偏好数据而非预训练数据，表明小数据集与强对齐算法的结合可能对多样性产生不成比例的影响。

2605.26491 2026-05-27 cs.LG cs.CV

Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models

超越成对偏好：扩散模型的列表级奖励感知对齐

Austin Wang, Jiaqi Han, Stefano Ermon, Yisong Yue

发表机构 * Caltech（加州理工学院）； Stanford University（斯坦福大学）

AI总结提出Diffusion LAIR方法，通过列表级奖励感知优化，利用连续奖励分数和所有候选图像同时优化扩散模型，在文本到图像生成等任务上超越成对偏好基线。

详情

AI中文摘要

偏好优化已成为从人类反馈中进行在线强化学习（RLHF）的一种高效替代方案，用于对齐文本到图像扩散模型。然而，现有方法大多将监督简化为二元成对比较。当训练数据自然包含同一提示的多个候选图像，并且连续奖励分数能提供比单一赢家-输家标签更丰富的信息时，这种成对简化具有局限性。为解决这些局限性，我们提出了Diffusion LAIR，一种用于扩散模型的奖励感知列表级偏好优化方法。对于每个提示，LAIR将一组候选图像的奖励分数转换为居中优势权重，然后在隐式奖励上优化优势加权回归目标，隐式奖励定义为当前模型相对于固定参考模型的去噪损失改进，并带有二次惩罚以正则化隐式奖励的幅度。所得目标同时使用所有候选图像而非选择成对，并通过显式控制隐式奖励的幅度保持保守性。LAIR目标在隐式奖励空间中具有有界闭式最优解，阐明了正则化强度如何控制偏好更新的幅度。实验表明，Diffusion LAIR在SD1.5和SDXL上，在文本到图像生成、组合生成和图像编辑基准测试中均优于强偏好优化基线。

英文摘要

Preference optimization has emerged as an efficient alternative to online reinforcement learning from human feedback (RLHF) for aligning text-to-image diffusion models. However, existing methods largely reduce supervision to binary pairwise comparisons. This pairwise reduction is limiting when training data naturally contains multiple candidate images for the same prompt, and when continuous reward scores can provide richer information than a single winner-loser label. To address these limitations, we propose Diffusion LAIR, a reward-aware listwise preference optimization method for diffusion models. For each prompt, LAIR converts reward scores across a group of candidate images into centered advantage weights, then optimizes an advantage-weighted regression objective on the implicit reward, defined as the denoising-loss improvement of the current model over a fixed reference model, with a quadratic penalty that regularizes the magnitude of the implicit reward. The resulting objective uses all candidates simultaneously rather than selecting pairs, and remains conservative by explicitly controlling the magnitude of the implicit reward. The LAIR objective admits a bounded closed-form optimum in implicit-reward space, clarifying how the regularization strength controls the magnitude of the preference update. Experiments show that Diffusion LAIR outperforms strong preference optimization baselines on SD1.5 and SDXL across text-to-image generation, compositional generation, and image editing benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.26489 2026-05-27 cs.LG

The Stability of Singular Distribution: A Spectral Perspective on the Two-Phase Dynamics of Language Model Pre-training

奇异分布的稳定性：语言模型预训练两阶段动力学的谱视角

Hongtao Zhang, Wenjie Zhou, Chenxi Jia, Wei Chen, Xueqi Cheng

发表机构 * School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences, Beijing, China（中国科学院大学先进交叉学科学院）； State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China（中国科学院人工智能安全国家重点实验室）； University of Chinese Academy of Sciences, Beijing, China（中国科学院大学）； School of Mathematics, Southeast University, Nanjing, China（东南大学数学学院）

AI总结本文发现语言模型预训练中奇异值谱的早期稳定现象（SoSD），并证明该现象与慢下降阶段同步，通过理论分析揭示了权重范数增长导致SoSD阈值，从而限制后续损失下降速率。

详情

AI中文摘要

大型语言模型预训练通常表现出两阶段轨迹：快速的初始损失下降，随后是长时间的缓慢改善。我们识别出一个潜在的谱现象——奇异分布的稳定性（SoSD），其中迹归一化的奇异值谱早期就稳定下来，即使参数矩阵继续演化。我们证明，SoSD与慢下降阶段之间的同步在不同架构（GPT-2、LLaMA）和设置中广泛存在，包括各种调度（Step-wise、WSD、Cosine Decay）、权重衰减和优化器（AdamW、Muon）。通过分析一个简化的Transformer，我们证明权重范数的增长不可避免地会引发早期的SoSD阈值，之后损失下降速率在理论上受限于奇异分布的变化。我们进一步解释了WSD和Muon等策略通过调节SoSD尺度来影响预训练动态，从而为理解高效预训练动力学提供了谱视角。

英文摘要

Large language model pre-training typically exhibits a two-phase trajectory: a fast initial loss drop followed by a prolonged slow improvement. We identify an underlying spectral phenomenon, Stability of Singular Distribution (SoSD), where the trace-normalized singular value spectrum stabilizes early, even as parameter matrices continue to evolve. We demonstrate that synchronization between SoSD and the slow-descent regime is widely observed across diverse architectures (GPT-2, LLaMA) and settings, including various schedules (Step-wise, WSD, Cosine Decay), weight decays, and optimizers (AdamW, Muon). By analyzing a simplified Transformer, we prove that growing weight norms inevitably precipitate an early SoSD threshold, after which the rate of loss decrease becomes theoretically bounded by the variation in the singular distribution. We further interpret strategies like WSD and Muon through their ability to modulate the SoSD scale, offering a spectral lens for understanding efficient pre-training dynamics.

URL PDF HTML ☆

赞 0 踩 0

2605.26486 2026-05-27 cs.CV

LongCat-Video-Avatar 1.5 Technical Report

LongCat-Video-Avatar 1.5 技术报告

Meituan LongCat Team, Xunliang Cai, Meng Cheng, Feng Gao, Zhe Kong, Jiamu Li, Le Li, Weiheng Li, Hongyu Liu, Shuai Tan, Xiaoming Wei, Tianyu Yang, Yong Zhang

发表机构 * Meituan LongCat Team（美团LongCat团队）

AI总结本文提出 LongCat-Video-Avatar 1.5，一个通过升级音频编码器、优化训练策略、数据筛选和RLHF训练实现高精度唇同步、全身时间稳定性和长视频生成的开放框架，在多个基准测试中达到或超越商业系统性能。

Comments Homepage: https://meigen-ai.github.io/LongCat-Video-Avatar-1.5-Page/ Github: https://github.com/meituan-longcat/LongCat-Video

详情

AI中文摘要

尽管音频驱动视频生成取得了进展，但实现商业级稳定性仍具挑战。我们提出 LongCat-Video-Avatar 1.5，一个升级的开源框架，优先考虑系统工程和生产就绪性而非架构新颖性。通过将音频编码器升级为 Whisper Large 并精心扩展训练配方，v1.5 实现了精确的唇同步、全身时间稳定性和严格身份一致性的鲁棒长视频生成。通过严格的数据筛选和 RLHF 训练，该模型能轻松泛化到风格化领域（如动漫和动物），并原生处理复杂现实条件（如多人交互和物体操作）。此外，针对工业部署的实际需求，我们采用高级步进蒸馏将推理加速至最优的 8 NFE，在服务效率与视觉保真度之间实现了良好权衡。通过在超过 500 个多样化测试案例的综合基准上进行的广泛定量指标和严格人工评估，验证了我们方法的优越性。结果表明，v1.5 在人类相似度评分和专家级质量评估中，与领先的闭源系统（如 HeyGen、OmniHuman 1.5、Kling Avatar 2.0）相比，达到了具有竞争力或更优的性能。通过开源发布，LongCat-Video-Avatar 1.5 缩小了学术研究原型与商业级部署之间的差距。

英文摘要

Despite advances in audio-driven video generation, achieving commercial-grade stability remains challenging. We present LongCat-Video-Avatar 1.5, an upgraded open-source framework prioritizing systematic engineering and production-readiness over architectural novelty. By upgrading the audio encoder to Whisper Large and meticulously scaling our training recipes, v1.5 achieves accurate lip-synchronization, full-body temporal stability, and robust long-video generation with strict identity consistency. Through rigorous data curation and RLHF Training, the model readily generalizes to stylized domains such as anime and animals, and natively handles complex real-world conditions, such as multi-person interactions and object handling. Furthermore, addressing the practical demands of industrial deployment, we employ advanced step distillation to accelerate inference to an optimal 8 NFE, achieving a favorable trade-off between serving efficiency and visual fidelity. The superiority of our approach is validated through extensive quantitative metrics and a rigorous human evaluation conducted on a comprehensive benchmark of over 500 diverse test cases. Results show that v1.5 achieves competitive or superior performance compared to leading closed-source systems (e.g., HeyGen, OmniHuman 1.5, Kling Avatar 2.0) across human-likeness ratings and expert-level quality assessments on our benchmark. With its open-source release, LongCat-Video-Avatar 1.5 narrows the gap between academic research prototypes and commercial-grade deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.26485 2026-05-27 cs.CV cs.CL

OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

OmniInteract：面向实时全模态助手的流式交互基准测试

Xudong Lu, Xueying Li, Annan Wang, Yang Bo, Jinpeng Chen, Zengliang Li, Nianzu Yang, Rui Liu, Xue Yang, Jingwen Hou, Hongsheng Li

发表机构 * CUHK MMLab（香港中文大学多模态实验室）； SJTU（上海交通大学）； NTU（国立新加坡大学）； McMaster（麦马斯特大学）； CityUHK（香港城市大学）； JUFE（吉林大学）

AI总结提出OmniInteract基准，通过在线推理音视频流评估全模态大模型的实时交互能力，发现现有模型在流式交互中表现薄弱。

详情

AI中文摘要

我们引入了OmniInteract，一个用于实时全模态大语言模型的流式基准测试，通过音视频流上的原生在线推理进行评估。与离线视频理解或文本提示的流式问答不同，OmniInteract保留了原始音视频流，并要求模型在线处理，无法访问未来内容。用户查询和环境声音嵌入在音频轨道中，要求模型检测多模态触发信号，决定何时响应，并在流展开时回答问题。OmniInteract包含250个视频，具有1430个时间锚定的响应槽：其中1062个1Q1A槽涵盖实时、主动和嵌套场景，368个1QnA槽用于连续任务监控和步骤指导。每个槽包括触发信号、响应窗口和目标答案。我们使用交互感知质量-时效性F1、中断诊断套件和嵌套链完成分数来评估响应正确性、时序、无效输出、中断处理和上下文连续性。实验表明，当前模型在流式交互中仍然薄弱，最佳整体IA-QTF1仅为0.368，最佳1QnA IA-QTF1仅为0.052。在全双工设置下的数学推理进一步研究表明，离线能力不一定能迁移到在线交互中。代码和数据集将在https://github.com/Lucky-Lance/OmniInteract公开。

英文摘要

We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detect multimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430 temporally grounded response slots: 1,062 1Q1A slots across real-time, proactive, and nested scenarios, and 368 1QnA slots for continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity using Interaction-Aware Quality-Timeliness F1, Interruption Diagnostic Suite, and Nested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at https://github.com/Lucky-Lance/OmniInteract.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning

Beyond Holistic Models: Systematic Component-level Benchmarking of Deep Multivariate Time-Series Forecasting

Reliable Extraction of Clinical Follow-Up Instructions: A Hybrid Neural-Symbolic Pipeline

Auditing and Fixing Economic Validity in Tabular Foundation Models for Discrete Choice

Linear and Neural Dueling Bandits with Delayed Feedback

MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration

PolyFusionAgent: A Multimodal Foundation Model and Autonomous AI Assistant for Polymer Property Prediction and Inverse Design

Scheduled Style Injection: Expanding the Style-Content Pareto Frontier in Training-Free Diffusion-based Style Transfer

Conceptual Steganography

Recursive Flow Matching

A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection

Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning

Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks

ReCA: Multi-Shot Long Video Extrapolation via Recursive Context Allocation

CmIVTP: Cross-modal Interaction-based Vessel Trajectory Prediction for Maritime Intelligence

InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

CSV-ViT: A Vision Transformer with the Variable-sized Cortical Supervertices for Detection of Alzheimer's Disease Pathologies

Re-M3Dr: Rebalanced MultiModal Mean Deviation Regression

SIKA-GP: Accelerating Gaussian Process Inference with Sparse Inducing Kernel Approximations for Bayesian Deep Learning

Uncertainty-Aware Gaussian Map for Vision-Language Navigation

Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization

3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation

Verilog-Evolve: Feedback-Driven and Skill-Evolving Verilog Generation

Dense2MoE: Pushing the Pareto Frontier of On-Device LLMs via Unified Pruning and Upcycling

The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence

Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories

Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models

The Stability of Singular Distribution: A Spectral Perspective on the Two-Phase Dynamics of Language Model Pre-training

LongCat-Video-Avatar 1.5 Technical Report

OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants