arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.01038 2026-06-02 cs.RO

Robust Integrated Planning and Control for Quadrotors in Dynamic Environments via NMPC with CBF Penalties

动态环境中四旋翼飞行器的鲁棒集成规划与控制：基于带CBF惩罚的NMPC

Zeinab Shayan, Mohammadreza Izadi, Reza Faieghi

发表机构 * Autonomous Vehicles Laboratory, Department of Aerospace Engineering, Toronto Metropolitan University（自主车辆实验室，航空航天工程系，多伦多 Metropolitan 大学）

AI总结提出一种将控制障碍函数作为指数惩罚嵌入非线性模型预测控制的鲁棒集成规划与控制策略，通过高增益扰动观测器和卡尔曼滤波器增强系统鲁棒性，实现动态环境中的安全避障。

Comments Accepted to Conference on Robots and Vision (CRV 2026), Vancouver, Canada

详情

DOI: 10.21428/d82e957c.79116e3d

AI中文摘要

本文提出了一种新的多旋翼无人飞行器鲁棒集成规划与控制策略。我们提出了一种非线性模型预测控制公式，将控制障碍函数作为指数惩罚嵌入，在严格输入约束下提高可行性并确保平滑避障。惩罚权重提供了一个实用的调节旋钮，用于在跟踪精度和避障激进程度之间进行权衡。我们通过采用高增益扰动观测器来估计和补偿外部扰动，从而增强系统鲁棒性。我们还结合了卡尔曼滤波器，用于计算高效的实时障碍物运动预测，从而实现对移动障碍物的规避。与传统的NMPC以及带有硬CBF约束的NMPC的对比研究，在Gazebo和硬件实验中得到了验证，展示了优越的可行性、安全性和鲁棒性。据我们所知，这是首个经过硬件验证的NMPC-CBF IPC框架，为四旋翼飞行器在动态环境中的安全部署迈出了实际的一步。

英文摘要

This paper presents a new robust integrated planning and control (IPC) strategy for multirotor uncrewed aerial vehicles. We propose a nonlinear model predictive control (NMPC) formulation that embeds control barrier functions (CBFs) as exponential penalties, improving feasibility while ensuring smooth obstacle avoidance under tight input bounds. The penalty weights provide a practical tuning knob to trade off tracking accuracy against avoidance aggressiveness. We enhance the system robustness by employing a high-gain disturbance observer (HGDO) to estimate and compensate for external disturbances. We also incorporate a Kalman filter (KF) for computationally efficient, real-time prediction of obstacle motion, enabling avoidance of moving obstacles. Comparative studies against both conventional NMPC and NMPC with hard CBF constraints, validated in Gazebo and hardware experiments, demonstrate superior feasibility, safety, and robustness. To the best of our knowledge, this is the first hardware-validated NMPC-CBF IPC framework, offering a practical step toward safe quadrotor deployment in dynamic environments.

URL PDF HTML ☆

赞 0 踩 0

2606.01036 2026-06-02 cs.RO

Position: Good Embodied Reward Models Need Bad Behavior Data

立场：好的具身奖励模型需要不良行为数据

Ran Tian, Yilin Wu, Andrea Bajcsy

发表机构 * Ran Tian, Yilin Wu, Andrea Bajcsy

AI总结本文主张为获得可靠的具身奖励模型，社区必须投资于“不良”机器人数据（失败、次优、易错甚至危险行为），并通过实验证明即使少量真实不良数据也能改善与人类偏好的一致性。

Comments This position paper has been accepted by the ICML 2026 position track as a spotlight paper

详情

AI中文摘要

这篇立场论文认为，为了获得可靠的具身奖励模型，社区必须投资于“不良”机器人数据：失败、次优、易错甚至危险的行为。虽然奖励模型是任何基础模型生命周期的核心，但今天的具身奖励模型主要基于成功行为进行训练。我们分析了三个最先进的具身奖励模型，发现它们系统性地过度奖励那些真实人类评估者会惩罚的行为，包括不安全交互、糟糕执行以及仅表面满足任务的捷径策略。我们将这些失败归因于一个关键的数据缺口：负面具身数据的稀缺性，这些数据收集成本高昂，并且在现有的机器人数据集中经常被过滤掉或保留。此外，我们表明，即使是少量真实不良行为数据也能改善与人类偏好的一致性，并减少代价高昂的误报。因此，我们呼吁具身AI社区整理并发布他们的不良机器人数据，构建合成不良数据生成引擎，开发更去中心化的物理评估系统，并设计用于细粒度具身奖励模型评估的基准。

英文摘要

This position paper argues that to obtain reliable embodied reward models, the community must invest in ``bad'' robot data: failed, suboptimal, error-prone, and even hazardous behaviors. While reward models are central to any foundation model's lifecycle, today's embodied reward models are trained primarily on successful behaviors. We analyze three state-of-the-art embodied reward models and find that they systematically over-reward behaviors that real human evaluators would penalize, including unsafe interactions, poor execution, and shortcut strategies that only superficially satisfy tasks. We attribute these failures to a key data gap: the scarcity of negative embodied data which is costly to collect and often filtered out or withheld in existing robotics datasets. Furthermore, we show that even modest exposure to real bad behavior data can improve alignment with human preferences and reduce costly false positives. We therefore call on the embodied AI community to curate and release their bad robot data, build synthetic bad data generation engines, develop more decentralized physical evaluation systems, and design benchmarks for fine-grained embodied reward model evaluations.

URL PDF HTML ☆

赞 0 踩 0

2606.01034 2026-06-02 cs.CL stat.ME

A Finite-Calibration Regime Map for LLM Judge Panels

有限校准机制图：LLM评审团面板

Bin Zhu, Yanghui Rao

发表机构 * School of Computer Science and Engineering（计算机科学与工程学院）

AI总结研究在有限人工标注预算下，低维堆叠器与联合输出表对LLM评审团面板的校准权衡，提出有限校准面板选择方法，实验表明多数评审输出可加或冗余。

Comments Work in Progress

详情

AI中文摘要

我们研究了在有限人工标注预算下，LLM评审团面板应何时使用低维堆叠器与联合输出表进行校准。低维堆叠器估计成本小但忽略交互，而联合表校准器可表示交互但需为单元格计数和未见模式付出代价。我们将此权衡构建为有限校准机制图，并实例化为有限校准面板选择——一种可部署的验证选择器，涵盖评审路径、前缀大小和聚合器家族，并辅以表格和参数估计诊断。在RewardBench、LLMBar、SummEval和Arena100K上，使用包含DeepSeek V4 Flash的七评审池，标量/可靠性聚合在20个真实数据集-预算单元中赢得16个，表明当前评审输出通常是可加或冗余的。受控的校准增长数据显示互补机制：可加标签仍偏好标量，而六路交互选择更大的联合表，其测试MSE从未见质量消失前的0.224降至0.061。因此，实际问题不是“需要多少评审？”，而是下一个评审的信息在可用人工标注下是否可估计。

英文摘要

We study when LLM judge panels should be calibrated with low-dimensional stackers versus joint output tables under finite human-label budgets. Low-dimensional stackers have small estimation cost but miss interactions, whereas joint-table calibrators can represent interactions but pay for cell counts and unseen patterns. We cast this tradeoff as a finite-calibration regime map and instantiate it as Finite-Calibration Panel Selection, a deployable validation selector over judge path, prefix size, and aggregator family with table and parametric estimation diagnostics. On RewardBench, LLMBar, SummEval, and Arena100K with a seven-judge pool including DeepSeek V4 Flash, scalar/reliability aggregation wins 16 of 20 real dataset--budget cells, indicating that current judge outputs are often additive or redundant. Controlled calibration-growth data show the complementary regime: additive labels remain scalar-favored, whereas a six-way interaction selects a larger joint table and its test MSE drops from 0.224 to 0.061 once unseen mass vanishes. Thus the practical question is not ``how many judges?'' but whether the next judge's information is estimable under the available human labels.

URL PDF HTML ☆

赞 0 踩 0

2606.01033 2026-06-02 cs.AI

TriLens: Per-Layer Logit-Lens Entropy for White-Box Hallucination Detection

TriLens: 基于逐层Logit-Lens熵的白盒幻觉检测

Bohan Yang, Yijun Gong, Zhi Zhang, Ge Zhang, Wenpeng Xing, Meng Han

发表机构 * Binjiang Institute of Zhejiang University（浙江大学滨海学院）； Beijing Normal-Hong Kong Baptist University（北京师范大学-香港 Baptist大学）； Zhejiang University（浙江大学）； GenTel.io ； Great Bay University（Great Bay大学）

AI总结提出TriLens方法，通过在每个Transformer层读取多头自注意力、前馈网络和残差流的logit-lens输出熵，构建紧凑的3L维轨迹，有效检测大语言模型幻觉。

详情

AI中文摘要

当语言模型产生幻觉时，最终答案是错误的，但错误在模型内部并非不可见。不同的内部路径可能保持不确定，在锐化速度上不一致，或在输出产生前承诺相互竞争的延续。我们提出TriLens，一种白盒检测器，将这一直觉转化为紧凑表示：在每一层，它通过模型自身的logit透镜读取多头自注意力输出、前馈输出和残差流，然后仅记录每个读出的熵。得到的3L维轨迹描述了确定性如何跨深度和跨模块形成，无需存储高维隐藏状态或采样多个生成。这一简单信号在指令微调LLM和QA基准测试中产生了强大的检测器，我们的分析表明，三个模块的熵轨迹提供了互补证据。TriLens表明，幻觉检测可以从跟踪内部计算如何稳定中受益，而不仅仅是最终层的预测。

英文摘要

When a language model hallucinates, the final answer is wrong, but the mistake is not necessarily invisible inside the model. Different internal pathways may remain uncertain, disagree in how quickly they sharpen, or commit to competing continuations before the output is produced. We introduce TriLens, a white-box detector that turns this intuition into a compact representation: at every layer, it reads the multi-head self-attention output, the feed-forward output, and the residual stream through the model's own logit lens, then records only the entropy of each readout. The resulting 3L-dimensional trajectory describes how certainty forms across depth and across modules, without storing high-dimensional hidden states or sampling multiple generations. This simple signal yields a strong detector across instruction-tuned LLMs and QA benchmarks, and our analyses show that the three module-wise entropy trajectories provide complementary evidence. TriLens suggests that hallucination detection can benefit from tracking how internal computation settles, not only what the final layer predicts.

URL PDF HTML ☆

赞 0 踩 0

2606.01028 2026-06-02 cs.LG

MedGym:A Unified Continuous-Time Benchmark for Dynamic Medical Treatment Reinforcement Learning

MedGym：面向动态医疗治疗强化学习的统一连续时间基准

Yuepeng Wang, Ken Kawano, Yongqi Zhou, Yoshihiko Fujisawa, Richard Weiss, Akifumi Wachi, Katsuki Fujisawa, Ying Chen, Mehrshad Sadria, Xin Liu, Kyoung-Sook Kim, Xiao Hu, Sebastien Gros, Xun Shen

发表机构 * Tokyo University of Agriculture and Technology（东京农业大学）； Institute of Science Tokyo（东京科学研究院）； National University of Singapore（国立新加坡大学）； LY Corporation（LY公司）； Altos Labs, Inc.（Altos实验室）； National Institute of Advanced Industrial Science and Technology (AIST)（国家先进工业科学与技术研究院）； Emory University（埃默里大学）； Norwegian University of Science and Technology（挪威科学技术大学）

AI总结提出MedGym基准，通过连续时间框架和物理信息神经网络构建可配置的医疗RL环境，支持离散与连续时间方法在非规则治疗间隔下的比较，并评估个性化、轨迹安全等临床指标。

详情

AI中文摘要

医疗治疗推荐给强化学习（RL）带来了若干挑战：患者生理状态在连续时间内演变，测量和干预以不规则间隔进行，且治疗效果在不同个体间差异显著。然而，现有的RL公式和模拟环境基于离散时间的MDP或POMDP抽象，具有固定或预先指定的决策间隔。因此，评估RL方法能否处理时间间隔依赖的疾病进展、个性化治疗反应以及连续测量点之间的安全性仍然困难。为弥补这一空白，我们引入了MedGym，一个用于动态治疗推荐的基准环境。MedGym在连续时间框架中对纵向患者演变进行建模，并通过使用物理信息神经网络从临床数据构建可配置的医疗RL基准。所得基准支持离线RL和在线RL，并能够在非规则治疗时机和患者特定动态下直接比较离散时间与连续时间方法。此外，MedGym支持从临床重要角度进行评估，包括个性化、轨迹级安全性以及基于模型的离线学习与在线部署之间的性能差距。通过为连续时间动态治疗提供标准化且可配置的基准，MedGym旨在促进对医疗RL方法进行更真实、更具信息量的评估。

英文摘要

Medical treatment recommendation poses several challenges to reinforcement learning (RL): patient physiology evolves in continuous time, measurements and interventions are performed at irregular intervals, and treatment effects vary substantially across individuals. Existing RL formulations and simulated environments, however, are based on discrete-time MDP or POMDP abstractions with fixed or pre-specified decision intervals. Thus, it remains difficult to evaluate whether RL methods can handle time-interval-dependent disease progression, personalized treatment response, and safety between consecutive measurement points. To address this gap, we introduce MedGym, a benchmark environment for dynamic treatment recommendation. MedGym models longitudinal patient evolution in a continuous-time framework and constructs a configurable medical RL benchmark from clinical data by using Physics-Informed Neural Networks. The resulting benchmark supports both offline and online RL, and enables direct comparison between discrete-time and continuous-time methods under irregular treatment timing and patient-specific dynamics. Besides, MedGym supports evaluation from clinically important perspectives, including personalization, trajectory-level safety, and the performance gap between model-based offline learning and online deployment. By providing a standardized and configurable benchmark for continuous-time dynamic treatment, MedGym aims to facilitate more realistic and informative evaluation of medical RL methods.

URL PDF HTML ☆

赞 0 踩 0

2606.01027 2026-06-02 cs.RO

$τ_0$-WM: A Unified Video-Action World Model for Robotic Manipulation

$\tau_0$-WM：一种用于机器人操作的统一视频-动作世界模型

Pengfei Zhou, Shengcong Chen, Di Chen, Jiaxu Wang, Rongjun Jin, Bingwen Zhu, Yike Pan, Songen Gu, Kuanning Wang, Shufeng Nan, Xingyu Qiu, Chenhao Qiu, Pu Yang, Yunuo Cai, Jianxiong Gao, Yifan Li, Yanwei Fu, Xiangyu Yue, Zhi Chen, Jianlan Luo

发表机构 * Shanghai Innovation Institute（上海创新研究院）； AGIBOT Finch

AI总结提出$\tau_0$-WM，一个统一视频-动作世界模型，通过共享视频扩散骨干集成策略学习、视频预测和动作评估，在长时域和精细操作任务上优于基线。

Comments Our project homepge: https://finch.agibot.com/research/tau0-wm

详情

AI中文摘要

机器人操作需要能够生成可执行动作并在物理执行前预测和评估其未来后果的模型。我们提出$\tau_0$-世界模型（$\tau_0$-WM），一个统一的视频-动作世界模型，在单个未来预测框架内整合了策略学习、视频预测和动作评估。基于共享的视频扩散骨干，$\tau_0$-WM提供两个互补接口。首先，一个视频动作模型从多视角观察、语言指令和机器人状态中联合预测未来视觉潜变量和连续动作块。其次，一个动作条件视频模拟器将候选动作块展开为多视角未来并预测密集的任务进度分数。该模型在大约27,300小时的实机遥操作、UMI风格交互、自我中心人类视频以及使用模态特定监督掩码的展开或失败轨迹上进行训练。在推理时，$\tau_0$-WM利用测试时计算来采样动作候选，通过重新去噪一致性对其进行排序，并对低质量候选调用基于模拟器的修正。在具有挑战性的长时域和精细机器人操作任务上，$\tau_0$-WM表现出优于其他相关基线的性能。

英文摘要

Robotic manipulation requires models that generate executable actions while anticipating and evaluating their future consequences before physical execution. We present $τ_0$-World Model ($τ_0$-WM), a unified video-action world model that integrates policy learning, video prediction, and action evaluation within a single future-predictive framework. Built on a shared video diffusion backbone, $τ_0$-WM provides two complementary interfaces. First, a video action model jointly predicts future visual latents and continuous action chunks from multi-view observations, language instructions, and robot state. Second, an action-conditioned video simulator rolls out candidate action chunks into multi-view futures and predicts dense task-progress scores. The model is trained on approximately $27{,}300$ hours of real-robot teleoperation, UMI-style interaction, egocentric human videos, and rollout or failure trajectories using modality-specific supervision masks. At inference time, $τ_0$-WM uses test-time computation to sample action candidates, rank them with re-denoising consistency, and invoke simulator-based rectification for low-quality candidates. On challenging long-horizon and fine-grained robotic manipulation tasks, $τ_0$-WM shows superior performance over other relevant baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.01026 2026-06-02 cs.CL

Revise, Don't Freeze: Sampler-Matched Training for Self-Correcting Masked Diffusion Language Models

修正而非冻结：面向自修正掩码扩散语言模型的采样器匹配训练

Longxuan Yu, Shaorong Zhang, Yu Fu, Hui Liu, Yue Dong, Greg Ver Steeg

发表机构 * University of California, Riverside（加州大学河滨分校）； Microsoft（微软）

AI总结针对掩码扩散语言模型在去噪过程中未利用可见标记修正能力的问题，提出无需额外模块的采样器D3IM和轻量级后训练方法SCOPE，显著提升数学和代码生成性能。

Comments 8 pages, 2 figures, 10 tables

详情

AI中文摘要

掩码扩散语言模型（MDLMs）在每个去噪步骤重新预测每个位置，但标准采样器一旦揭示标记就将其固定，导致这种修正能力未被使用。现有方法要么添加启发式或学习机制来修正已提交的标记，要么在重新预测前将其重新掩码为[MASK]；一种无需辅助模块、直接修正可见标记的原则性采样器仍未被充分探索。我们提出了D3IM，一种无参数采样器，作为校正器风格的反向更新推导而来，允许无需额外模块或辅助传递的直接可见到可见修正。D3IM还揭示了一个我们称为保留偏差的模型侧障碍：模型倾向于重现自身错误的已提交标记而非修正它们。我们通过SCOPE（基于预测误差的自条件化）解决这一问题，这是一种轻量级的后训练过程，模拟D3IM的采样过程。在LLaDA-8B上使用64个去噪步骤时，SCOPE+D3IM相比原始LLaDA-8B标准去掩码在GSM8K上提升+13.0（68.3%），在MATH-500上提升+4.8（23.6%），在HumanEval上提升+15.3（29.3%），在MBPP上提升+10.4（30.8%），且在数学和HumanEval上随着去噪步骤增加，提升幅度更大。

英文摘要

Masked diffusion language models (MDLMs) re-predict every position at each denoising step, but standard samplers commit tokens once revealed, leaving this revision capability unused. Existing approaches either add heuristic or learned mechanisms to revise committed tokens, or remask them back to [MASK] before re-predicting; a principled sampler that directly revises visible tokens without auxiliary modules remains underexplored. We introduce D3IM, a parameter-free sampler derived as a corrector-style reverse update that permits direct visible-to-visible revision without additional modules or auxiliary passes. D3IM also reveals a model-side obstacle we term preservation bias: the model tends to reproduce its own wrong committed tokens rather than correct them. We address this with SCOPE (Self-Conditioned On Prediction Errors), a lightweight post-training procedure that simulates D3IM's sampling process. On LLaDA-8B at 64 denoising steps, SCOPE+D3IM improves over the original LLaDA-8B with standard unmasking by +13.0 on GSM8K (68.3%), +4.8 on MATH-500 (23.6%), +15.3 on HumanEval (29.3%), and +10.4 on MBPP (30.8%), with gains that increase as more denoising steps are used on math and HumanEval.

URL PDF HTML ☆

赞 0 踩 0

2606.01024 2026-06-02 cs.CL cs.AI

DSL-LLaDA: Scaling Continuous Denoising to 8B Masked Diffusion LMs

DSL-LLaDA: 将连续去噪扩展到8B掩码扩散语言模型

Longxuan Yu, Yunshu Wu, Yu Fu, Siheng Xiong, Rob Brekelmans, Hui Liu, Yue Dong, Greg Ver Steeg

发表机构 * University of California, Riverside（加州大学河滨分校）； Georgia Institute of Technology（佐治亚理工学院）； Microsoft（微软）

AI总结通过离散随机定位（DSL）将预训练掩码扩散语言模型（LLaDA-8B-Instruct）轻量适配为支持连续嵌入空间去噪，在低步数下实现高质量摘要生成并避免长度-质量权衡。

Comments 8 pages, 4 figures, 28 tables

详情

AI中文摘要

离散掩码扩散语言模型通过迭代并行解码生成文本，但少步解码面临长度与质量之间的权衡：在固定步数预算下，标准方法可以生成短而高质量的输出，或者产生长但重复的文本。连续去噪可以通过在嵌入空间中联合演化所有位置来规避这种权衡，但从头开始构建这样的模型仍是一个开放问题。我们证明，预训练的掩码DLM可以轻量适配以支持连续嵌入空间去噪。从LLaDA-8B-Instruct开始，我们仅用1,000步进行离散随机定位（DSL）的继续预训练，将二元掩码替换为连续的逐token高斯噪声作为软掩码。适配后的模型支持连续推理，在嵌入空间中联合演化所有位置，并将硬token承诺推迟到最后一步。在低步数预算（<=16次前向传播）下的零样本摘要任务中，DSL-LLaDA-SDE在所有四个基准上取得了最佳ROUGE-1，并很大程度上避免了迭代去掩码的过早终止/重复权衡。同样的适配还产生了选择性噪声状态鲁棒性：模型在保留干净token的同时纠正损坏的token。使用相同计算量的标准掩码扩散训练对照实验未表现出这两种行为。

英文摘要

Discrete Masked diffusion language models generate text by iterative parallel decoding, but few-step decoding suffers from a tradeoff between length and quality: with a fixed step budget, standard methods can generate a short, high-quality output, or they can produce long but repetitive text. Continuous denoising can sidestep this tradeoff by evolving all positions jointly in embedding space, but building such a model from scratch at scale remains an open problem. We show that a pretrained masked DLM can instead be lightly adapted to support continuous embedding-space denoising. Starting from LLaDA-8B-Instruct, we continue-pretrain for only 1,000 steps with Discrete Stochastic Localization (DSL), replacing binary masking with continuous per-token Gaussian noise as a soft mask. The adapted model supports continuous inference that evolves all positions jointly in embedding space and defers hard token commitment to the final step. On zero-shot summarization at low step budgets (<=16 forward passes), DSL-LLaDA-SDE achieves the best ROUGE-1 on all four benchmarks and largely avoids the premature-termination / repetition tradeoff of iterative unmasking. The same adaptation also yields selective noisy-state robustness: the model corrects corrupted tokens while preserving clean ones. Control experiments using standard masked diffusion training with the same compute demonstrate neither behavior.

URL PDF HTML ☆

赞 0 踩 0

2606.01022 2026-06-02 cs.CV cs.AI

ProductWebGen: Benchmarking Multimodal Product Webpage Generation

ProductWebGen: 多模态产品网页生成基准测试

Zhihong Liu, Siqi Kou, Zheng Li, Ye Ma, Quan Chen, Peng Jiang, Kai Yu, Zhijie Deng

发表机构 * School of Computer Science & Zhiyuan College（计算机科学学院及智远学院）； Shanghai Jiao Tong University（上海交通大学）； Kuaishou Technology（快手科技）

AI总结提出ProductWebGen基准，用于评估多模态生成模型从产品图像和指令生成一致产品展示网页的能力，并比较了基于编辑和基于统一模型两种工作流。

Comments Accepted by KDD 2026

详情

DOI: 10.1145/3770855.3817507

AI中文摘要

从源产品图像以及布局和视觉内容指令中制作产品展示网页，对于营销、广告和电子商务等领域具有重要的实用价值。直观上，该任务要求产品展示之间严格的视觉一致性以及高保真度的指令遵循，以联合生成可渲染的HTML代码。这些对可控性和指令遵循的要求与先进多模态生成模型（如图像编辑模型和统一模型）的核心特征紧密一致。为此，本文引入ProductWebGen来系统性地基准测试这些模型的产品网页生成能力。我们组织了包含500个测试样本的ProductWebGen，涵盖13个产品类别；每个样本由源图像、视觉内容指令和网页指令组成。任务是根据源图像和指令生成包含多个一致图像的产品展示网页。鉴于任务的混合模态输入输出性质，我们设计并系统比较了两种评估工作流——一种使用大语言模型和图像编辑模型分别生成HTML代码和图像（基于编辑），另一种依赖单个统一模型生成两者，其中图像生成依赖于先前的多模态上下文（基于统一模型）。实验结果表明，基于编辑的方法在网页指令遵循和内容吸引力方面取得领先结果，而基于统一模型的方法在满足视觉内容指令方面可能展现出更多优势。我们还构建了一个监督微调数据集ProductWebGen-1k，包含1000组真实产品图像和LLM生成的HTML代码。我们在开源统一模型BAGEL上验证了其有效性。数据和代码可在https://github.com/SJTU-DENG-Lab/ProductWebGen获取。

英文摘要

Crafting a product display webpage from a source product image, along with layout and visual content instructions, holds significant practical value for domains such as marketing, advertising, and E-commerce. Intuitively, this task demands strict visual consistency across product displays and high-fidelity instruction following to jointly generate renderable HTML code. These requirements on controllability and instruction-following are closely aligned with the core features of advanced multimodal generative models, such as image editing models and unified models. To this end, this paper introduces ProductWebGen to systematically benchmark the product webpage generation capacities of these models. We organize ProductWebGen with 500 test samples covering 13 product categories; each sample consists of a source image, a visual content instruction, and a webpage instruction. The task is to generate a product showcase webpage including multiple consistent images in accordance with the source image and instructions. Given the mixed-modality input-output nature of the task, we design and systematically compare two workflows for evaluation -- one uses large language models and image editing models to separately generate HTML code and images (editing-based), while the other relies on a single UM to generate both, with image generation conditioned on the preceding multimodal context (UM-based). Empirical results show that editing-based approaches achieve leading results in webpage instruction following and content appeal, while UM-based ones may display more advantages in fulfilling visual content instructions. We also construct a supervised fine-tuning dataset, ProductWebGen-1k, with 1,000 groups of real product images and LLM-generated HTML code. We verify its effectiveness on the open-source UM BAGEL. The data and code are available at https://github.com/SJTU-DENG-Lab/ProductWebGen.

URL PDF HTML ☆

赞 0 踩 0

2606.01021 2026-06-02 cs.CV

Learning Neural Deformation Representation for 4D Dynamic Shape Generation

学习神经变形表示用于4D动态形状生成

Gyojin Han, Jiwan Hur, Jaehyun Choi, Junmo Kim

发表机构 * Korea Advanced Institute of Science and Technology（韩国科学技术院）

AI总结提出一种新的神经变形表示，结合条件神经符号距离场，设计解耦运动与形状潜在空间的4D表示架构，通过扩散模型生成高质量、高时间一致性的4D动态形状。

Comments ECCV 2024

详情

AI中文摘要

近期3D形状表示的发展为生成精细3D形状开辟了新可能性。尽管取得了这些进展，但关于生成随时间变形的3D对象形式的4D动态形状的研究仍然很少。为弥补这一差距，本文聚焦于生成4D动态形状，同时强调生成质量和效率。先前关于4D生成的工作HyperDiffusion提出了一种直接生成4D占用场权重参数的方法，但由于运动表示未与4D占用场的形状表示分离，导致时间一致性差且渲染速度慢。因此，我们提出一种新的神经变形表示，并将其与条件神经符号距离场结合，设计了一种4D表示架构，其中运动潜在空间与形状潜在空间解耦。所提出的变形表示通过预测多个部分的蒙皮权重和刚体变换来工作，在理解形状结构方面也优于现有4D表示的变形模块。此外，我们设计了一种扩散模型的训练过程，利用由我们的4D表示提取的形状和运动特征作为数据点。无条件生成、条件生成和运动重定向实验结果表明，我们的方法不仅在4D动态形状生成方面表现出优于先前工作的性能，而且具有多种潜在应用。

英文摘要

Recent developments in 3D shape representation opened new possibilities for generating detailed 3D shapes. Despite these advances, there are few studies dealing with the generation of 4D dynamic shapes that have the form of 3D objects deforming over time. To bridge this gap, we focus on generating 4D dynamic shapes with an emphasis on both generation quality and efficiency in this paper. HyperDiffusion, a previous work on 4D generation, proposed a method of directly generating the weight parameters of 4D occupancy fields but suffered from low temporal consistency and slow rendering speed due to motion representation that is not separated from the shape representation of 4D occupancy fields. Therefore, we propose a new neural deformation representation and combine it with conditional neural signed distance fields to design a 4D representation architecture in which the motion latent space is disentangled from the shape latent space. The proposed deformation representation, which works by predicting skinning weights and rigid transformations for multiple parts, also has advantages over the deformation modules of existing 4D representations in understanding the structure of shapes. In addition, we design a training process of a diffusion model that utilizes the shape and motion features that are extracted by our 4D representation as data points. The results of unconditional generation, conditional generation, and motion retargeting experiments demonstrate that our method not only shows better performance than previous works in 4D dynamic shape generation but also has various potential applications.

URL PDF HTML ☆

赞 0 踩 0

2606.01020 2026-06-02 cs.AI cs.LG

Tackling the Root of Misinformation by Teaching Laypeople about Logical Fallacies via Socratic Questioning and Critical Argumentation

通过苏格拉底式提问和批判性论证教授外行人逻辑谬误，以应对错误信息的根源

Minjing Shi, Junling Wang, Jingwei Ni, Sankalan Pal Chowdhury, Mrinmaya Sachan

发表机构 * ETH Zurich（苏黎世联邦理工学院）； ETH AI Center（苏黎世联邦理工学院人工智能中心）

AI总结提出LFTutor智能辅导系统，利用大语言模型结合苏格拉底式提问和批判性论证原则，帮助外行人学习识别逻辑谬误，显著优于基线模型。

Comments This paper has been accepted to Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Long Paper), Main Conference

Journal ref Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics, 2026

2606.01019 2026-06-02 cs.CL cs.AI

Hybrid Verified Decoding: Learning to Allocate Verification in Speculative Decoding

混合验证解码：在推测解码中学习分配验证

Xin Su, Dawid Majchrowski, Fangyuan Yu, Vanshil Atul Shah, Sebastian Rogawski, Pawel Morkisz, Anahita Bhiwandiwalla, Phillip Howard

发表机构 * Thoughtworks ； Nvidia

AI总结提出混合验证解码方法，通过预测缓存草稿的接受长度并在缓存验证与模型草稿之间动态选择，在代理工作流中平均加速2.73倍。

详情

AI中文摘要

大型语言模型（LLM）生成仍然昂贵，因为自回归解码每生成一个新token就调用一次模型。推测解码通过草拟多个token并用目标模型一步验证来降低成本，但其加速取决于接受的草稿token数量。无参数草稿源可以在结构化和代理工作负载中以低成本提出长续写，但一个生成步骤中看起来有前景的缓存匹配可能在下一步收益很低。我们提出混合验证解码，在验证前预测缓存草稿的接受长度，并使用该收益估计在缓存验证和基于模型的草稿器之间进行选择。在三个LLM和十六个数据集上，混合验证解码在代理工作流中特别有效，在每个设置中均优于EAGLE3，平均加速2.73倍。我们的分析揭示了提示结构如何创造缓存机会，高收益缓存草稿如何集中在草稿空间的一小部分，以及收益引导的选择如何减少顺序解码工作，指向运行时草稿选择作为推测解码的一个有前景的方向。

英文摘要

Large Language Model (LLM) generation remains expensive because autoregressive decoding calls the model once for each new token. Speculative decoding reduces this cost by drafting multiple tokens and verifying them with the target model in one step, but its speedup depends on how many drafted tokens are accepted. Parameter-free draft sources can propose long continuations at low cost in structured and agentic workloads, yet a cache match that looks promising at one generation step may have low payoff at the next. We propose Hybrid Verified Decoding, which predicts the accepted length of a cache draft before verification and uses this payoff estimate to choose between cache verification and a model-based drafter. Across three LLMs and sixteen datasets, Hybrid Verified Decoding is especially effective on agentic workflows, where it outperforms EAGLE3 in every setting with a 2.73x average speedup. Our analysis shows how prompt structure creates cache opportunities, how high-payoff cache drafts concentrate in a small part of the draft space, and how payoff-guided selection reduces sequential decoding work, pointing to runtime draft selection as a promising direction for speculative decoding.

URL PDF HTML ☆

赞 0 踩 0

2606.01016 2026-06-02 cs.CL cs.AI eess.AS

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

PolySpeech-100：面向100多种语言和方言的大规模语音理解基准

Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu, Lu Fan, Zhi Li, You He

发表机构 * Shenzhen International Graduate School, Tsinghua University（深圳国际研究生院，清华大学）； Department of Electronic Engineering, Tsinghua University（清华大学电子工程系）； JD AI Research（京东人工智能研究院）

AI总结为解决现有语音评估基准在资源丰富语言偏向、缺乏语义推理和忽视方言的问题，提出PolySpeech-100基准，通过混合构建管道覆盖110种语言变体，并评估22个模型，发现开源端到端模型在重方言上优于级联系统，而思维链提示在零样本设置下会降低性能。

Comments 19 pages, 13 figures, KDD 2026

详情

AI中文摘要

虽然端到端（E2E）语音大语言模型（Speech-LLMs）正在快速发展，但它们的评估方法仍局限于简单转录的时代。现有基准存在三个关键限制：明显偏向高资源语言、关注低级识别（ASR）而非语义推理，以及忽视区域方言。为弥补这一差距，我们引入了PolySpeech-100，这是一个大规模基准，旨在评估110种语言变体上的“母语级”语音理解。我们采用了一种新颖的混合构建管道，将黄金标准的人类录音与指令驱动的合成语音相结合，从而覆盖了19种不同的中文方言和80多种低资源语言。对22个最先进模型（包括Gemini-3、GPT-Audio和Qwen2.5-Omni）的广泛评估得出了关键见解。首先，我们证明开源端到端模型在重方言上优于级联（ASR+LLM）系统，证明直接音频处理保留了标准转录中经常丢失的关键副语言线索和韵律特征（例如语调、重音）。其次，我们揭示了一个显著的性能差距：虽然商业模型保持稳健，但开源模型在低资源语言上遭受灾难性退化。最后，反直觉的是，我们观察到在标准零样本设置下，思维链提示经常降低大多数评估模型的语音理解性能，揭示了当前架构中潜在的多模态对齐差距。PolySpeech-100为下一代包容性、全能的语音LLM建立了严格标准。数据、演示和代码公开于https://github.com/YoungSeng/PolySpeech-100。

英文摘要

While End-to-End (E2E) Speech-Large Language Models (Speech-LLMs) are rapidly evolving, their evaluation methodologies remain limited to the era of simple transcription. Existing benchmarks suffer from three critical limitations: a pronounced bias towards high-resource languages, a focus on low-level recognition (ASR) rather than semantic reasoning, and a neglect of regional dialects. To bridge this gap, we introduce PolySpeech-100, a massive-scale benchmark designed to assess `native-level' speech comprehension across 110 linguistic variants. We employ a novel hybrid construction pipeline that augments gold-standard human recordings with instruction-driven synthetic speech, allowing us to cover 19 distinct Chinese dialects and over 80 low-resource languages. Extensive evaluation of 22 state-of-the-art models (including Gemini-3, GPT-Audio, and Qwen2.5-Omni) yields pivotal insights. First, we demonstrate that open-source E2E models outperform Cascade (ASR+LLM) systems on heavy dialects, proving that direct audio processing preserves critical paralinguistic cues and prosodic features (e.g., intonation, stress) that are often lost in standard transcription. Second, we reveal a significant performance gap: while commercial models maintain robustness, open-source models suffer catastrophic degradation on low-resource languages. Finally, counter-intuitively, we observe that under standard zero-shot settings, Chain-of-Thought prompting frequently degrades speech understanding performance for most evaluated models, revealing a potential modality alignment gap in current architectures. PolySpeech-100 establishes a rigorous standard for the next generation of inclusive, omni-capable Speech-LLMs. The data, demo, and code are publicly available at https://github.com/YoungSeng/PolySpeech-100.

URL PDF HTML ☆

赞 0 踩 0

2606.01015 2026-06-02 cs.RO cs.AI cs.NI cs.SY eess.SY

AI-IoT-Robotics Integration: Survey of Frameworks, Emerging Trends, and the Path Toward Connected Robotics

AI-IoT-机器人集成：框架、新兴趋势及迈向互联机器人的路径综述

Ranulfo Bezerra, Satoshi Tadokoro, Kazunori Ohno

发表机构 * Tohoku University（东大大学）

AI总结本文综述了人工智能、物联网和机器人三者融合的现状，提出了模块化系统架构，并强调了小语言模型（SLM）和大型语言模型（LLM）在分布式认知与自主决策中的作用，为下一代互联机器人和物理AI生态系统提供了概念和技术路线图。

Comments 15 pages, 3 figures, 3 tables. Published in IEEE Internet of Things Journal

Journal ref IEEE Internet of Things Journal, vol. 13, no. 10, pp. 20398-20412, 15 May15, 2026

详情

DOI: 10.1109/JIOT.2026.3670191

AI中文摘要

人工智能、物联网和机器人的融合不再是未来的愿景；它正迅速成为实时、智能和上下文感知系统的基础。AI实现感知和推理，IoT提供可扩展的感知和通信，而机器人则提供具身驱动。尽管在AIoT和物联网机器人（IoRT）等两两组合方面取得了显著进展，但仍缺乏完全整合这三者的统一设计框架。本综述综合了这些领域的最新进展，强调了边缘端的小语言模型（SLM）和云端的大型语言模型（LLM）在分布式认知和自主决策中的新兴作用。我们提出了一个符合这些趋势的模块化系统架构，分析了互操作性和反馈控制中存在的持续差距，并根据集成深度对现有工作进行了分类。我们的综述强调了混合SLM-LLM系统与IoT基础设施和机器人代理相结合时，如何应对实时适应、可扩展性和可靠性方面的挑战。这项工作为设计模块化、可解释且能够在动态环境中学习的下一代AI-IoT-机器人生态系统提供了概念和技术路线图，为新兴的互联机器人和物理AI范式铺平了道路。

英文摘要

The convergence of Artificial Intelligence, the Internet of Things, and Robotics is no longer a futuristic vision; it is rapidly becoming the foundation of real-time, intelligent, and context-aware systems. AI enables perception and reasoning, IoT provides scalable sensing and communication, and robotics delivers embodied actuation. Despite significant progress in pairwise combinations such as AIoT and the Internet of Robotic Things (IoRT), there remains a lack of unified design frameworks that fully integrate all three. This survey synthesizes the state-of-the-art across these domains, emphasizing the emerging role of Small Language Models (SLMs) at the edge and Large Language Models (LLMs) in the cloud for distributed cognition and autonomous decision-making. We propose a modular system architecture that aligns with these trends, analyze persistent gaps in interoperability and feedback control, and classify existing work by integration depth. Our review highlights how hybrid SLM-LLM systems, when coupled with IoT infrastructure and robotic agents, can address challenges in real-time adaptation, scalability, and reliability. This work offers a conceptual and technical roadmap for designing next-generation AI-IoT-Robotic ecosystems that are modular, interpretable, and capable of learning within dynamic environments, paving the way for the emerging paradigm of Connected Robotics and Physical AI.

URL PDF HTML ☆

赞 0 踩 0

2606.01014 2026-06-02 cs.CV cs.AI

Cross-Axis Feature Fusion with Joint-Wise Motion Difference Prediction for Text-Based 3D Human Motion Editing

基于文本的三维人体运动编辑中的跨轴特征融合与关节运动差异预测

Gyojin Han, Junmo Kim

发表机构 * School of Electrical Engineering, KAIST（韩国科学技术院电子工程学院）

AI总结提出一种跨轴特征融合架构和辅助任务，通过联合锚定变换器预测关节运动差异，实现文本驱动的三维人体运动编辑，在MotionFix数据集上达到最优性能。

Comments CVPR 2026

详情

AI中文摘要

我们研究基于文本的三维人体运动编辑，目标是保留源运动的风格和结构，同时应用自然语言描述的编辑。MotionFix数据集的发布推动了基于训练扩散模型的直接生成编辑运动的研究，这些模型从源运动和文本指令生成编辑运动。虽然先前的工作主要关注学习编辑在时间上何时发生，但我们的目标是创建一个不仅理解时间方面，还理解哪些特定关节负责变化的模型。为此，我们提出了一种新颖的架构和一个互补的辅助任务来辅助其训练。我们的架构由两个轴锚定变换器组成，分别沿关节和时间维度提取不同特征，以及一个跨轴融合块来整合这些表示。我们进一步引入一个辅助任务，训练关节锚定变换器回归源和目标关节旋转之间的Soft-DTW距离。该目标教会模块理解哪些关节需要修改，哪些需要保留。通过在MotionFix数据集上的全面实验，我们证明我们的方法显著提高了与文本指令和源运动的语义对齐，以及生成运动的整体保真度，达到了最先进的结果。

英文摘要

We address text-based 3D human motion editing, where the goal is to preserve the style and structure of a source motion while applying edits described in natural language. The release of the MotionFix dataset has spurred active research into training-based diffusion models that directly generate an edited motion from a source motion and a text instruction. While previous works have focused primarily on learning when an edit should occur temporally, our goal is to create a model that understands not only this temporal aspect but also which specific joints are responsible for the change. Targeting this, we propose a novel architecture and a complementary auxiliary task to aid its training. Our architecture consists of two axis-anchored transformers, which extract distinct features along the joint and time dimensions respectively, and a cross-axis fusion block that integrates these representations. We further introduce an auxiliary task that trains the joint-anchored transformer to regress the Soft-DTW distance between source and target joint rotations. This objective teaches the module to understand which joints to modify and which to preserve. Through comprehensive experiments on the MotionFix dataset, we demonstrate that our method significantly improves semantic alignment with both the text instruction and the source motion, as well as the overall fidelity of the generated motion, achieving state-of-the-art results.

URL PDF HTML ☆

赞 0 踩 0

2606.01012 2026-06-02 cs.AI cond-mat.mtrl-sci

Property Prediction of Stacked Bilayer Materials: A Multimodal Learning Approach

堆叠双层材料的性质预测：一种多模态学习方法

An Vuong, Minh-Hao Van, Chen Zhao, Xintao Wu

发表机构 * University of Arkansas（亚拉巴马大学）； Baylor University（贝勒大学）

AI总结提出一种多模态学习方法，通过联合建模不同材料层间的界面，预测给定配置下垂直堆叠产生的性质，实验证明其有效性和高效性。

Comments Accepted to the 35th International Joint Conference on Artificial Intelligence (IJCAI 2026)

详情

AI中文摘要

AI for materials science 是 AI for science 中的一个关键主题，旨在加速材料发现并产生准确的性质预测。双层二维材料堆叠对于探索具有新功能和内在现象的新材料至关重要，能够创建用于各种实际应用的新型二维双层材料。从实验和计算角度对双层 vdWs 材料的研究已取得显著进展。多种双层材料已通过实验成功合成，并且高通量计算技术的日益普及构建了几个计算二维材料数据库。然而，利用 AI 对双层堆叠进行建模并预测新性质的研究仍不充分，需要进一步研究。在这项工作中，我们提出了一种新颖的多模态学习方法，用于研究不同材料之间的界面，这些界面共同实现新的或多种功能，并预测在给定配置下不同功能材料层垂直集成（堆叠）产生的新性质。综合实验证明了我们方法相对于基线方法的有效性和高效性。我们的代码可在 https://github.com/AnVuong123/bimat_ml 获取。

英文摘要

AI for materials science is a critical topic within AI for science, aiming to accelerate materials discovery and produce accurate property predictions. Bilayer 2D material stacking is essential for exploring new materials with novel functions and inherent phenomena, enabling the creation of new 2D bilayers for diverse real-world applications. Research on bilayer vdWs materials has made significant progress from experimental and computational perspectives. Various bilayer materials have been successfully synthe sized experimentally and the increasing utilization of high-throughput computing technology has con structed several computational two-dimensional materials databases. However, the use of AI to model bilayer stacking and predict new properties remains underexplored, necessitating further research studies. In this work, we propose a novel multimodal learning approach to study the interfaces between dissimilar materials that jointly enable new or multiple functions, and to predict new properties arising from the vertical integration (stacking) of different functional material layers under given configurations. Comprehensive experiments demonstrate the effectiveness and efficiency of our approach compared to baseline methods. Our code is available at https://github.com/AnVuong123/bimat ml.

URL PDF HTML ☆

赞 0 踩 0

2606.01009 2026-06-02 cs.SD

MelT: GEMM-Native NDFT for Efficient Single-Stage Audio Frontends on Modern Accelerators

MelT: 面向现代加速器的高效单级音频前端的GEMM原生NDFT

Augusto Camargo, Marcelo Finger

发表机构 * Instituto de Ciências Matemáticas e de Computação, University of São Paulo, Brazil（圣保罗大学数学与计算机科学研究所，巴西）

AI总结提出MelT框架，通过将梅尔间隔非均匀离散傅里叶变换（NDFT）公式化为稠密通用矩阵乘法（GEMM）操作，实现单级音频前端，替代传统STFT+梅尔流水线，在多种加速器上获得高达3.75倍推理加速和3.52倍能耗降低。

详情

AI中文摘要

现代音频处理网络通常部署在加速器上，其峰值吞吐量通过稠密线性代数获得，而传统的声学前端——短时傅里叶变换（STFT）后接稀疏梅尔聚合——在结构上仍然是异构的。这种不匹配会在当代加速器后端引入内存带宽、调度和中间分配开销。本文介绍MelT，一个单级前端框架，其中梅尔间隔非均匀离散傅里叶变换（NDFT）基被预先计算，并通过稠密通用矩阵乘法（GEMM）操作应用于时域声学帧。贡献不在于NDFT算子本身，而在于将梅尔间隔NDFT投影公式化为GEMM原生的音频前端，并将其评估为传统STFT+梅尔流水线的硬件高效替代方案。在从Apple A18 Pro边缘硬件到NVIDIA H100数据中心加速器的多个平台上评估，MelT在保持下游分类准确性的同时，实现了高达3.75倍的推理延迟加速和3.52倍的能耗降低。

英文摘要

Modern audio processing networks are commonly deployed on accelerators whose peak throughput is obtained through dense linear algebra, whereas conventional acoustic frontends -- a Short-Time Fourier Transform (STFT) followed by sparse Mel aggregation -- remain structurally heterogeneous. This mismatch can introduce memory-bandwidth, dispatch, and intermediate-allocation overheads on contemporary accelerator backends. This work introduces MelT, a single-stage frontend framework in which Mel-spaced Non-Uniform Discrete Fourier Transform (NDFT) bases are precomputed and applied to time-domain acoustic frames through dense General Matrix Multiplication (GEMM) operations. The contribution is not the NDFT operator itself; rather, it is the formulation of Mel-spaced NDFT projection as a GEMM-native audio frontend and its evaluation as a hardware-efficient alternative to conventional STFT+Mel pipelines. Evaluated across platforms ranging from Apple A18 Pro edge hardware to NVIDIA H100 datacenter acceleration, MelT attains up to a $3.75\times$ speedup in inference latency and a $3.52\times$ reduction in energy consumption while maintaining downstream classification accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.01007 2026-06-02 cs.LG cs.AI

Beyond Task-Agnostic: Task-Aware Grouping for Communication-Efficient Multi-Task MoE Inference

超越任务无关：面向通信高效的多任务MoE推理的任务感知分组

Zhiyao Xu, Aoxue Liu, Zhanjie Ding, Dan Zhao, Yong Jiang, Qing Li

发表机构 * Tsinghua Shenzhen International Graduate School（清华大学深圳国际研究生院）； Pengcheng Laboratory（鹏城实验室）

AI总结提出任务感知共激活分组（TACG）框架，通过任务特定的共激活模式优化专家放置，并引入通用专家共享复制（GESR）应对在线负载倾斜，在三个MoE模型上平均降低通信成本31.39%，保持公平性指数0.9975。

详情

AI中文摘要

稀疏激活的混合专家（MoE）模型通过条件计算扩展容量，但分布式推理面临跨GPU专家通信和路由引起的负载不平衡问题。现有的放置方法通过共同定位频繁共激活的专家来降低这一成本；然而，它们从全局聚合的路由轨迹中推导出单一部署方案，从而平均掉了多任务服务中实际驱动通信的异构、任务特定的共激活模式。我们观察到专家共激活强烈依赖于任务：在一个任务族中紧密耦合的专家对在另一个任务族中往往不相关，因此有效的部署应根据任务感知的共激活而非任务无关的平均值来分组专家。基于这一见解，我们提出了任务感知共激活分组（TACG），这是一个部署时框架，利用族特定的调度和共激活轨迹推导每个专家的任务族偏好，重新加权共激活图使得族内局部性主导分组，并在精确容量约束下将每个专家分配到主GPU。为了使静态放置对在线工作负载倾斜保持鲁棒，我们进一步引入了通用专家共享复制（GESR），这是一个轻量级辅助方法，识别具有持续中心共激活特征的通用专家，将它们复制到少量辅助GPU上，并在服务时应用局部性和负载感知的选择。在三个代表性的开源MoE模型上的实验表明，我们的框架相比基线平均降低了31.39%的通信成本，同时保持了平均Jain公平指数0.9975。即使在推理数据出现严重分布偏移的情况下，这一优势依然存在，持续优于强基线。

英文摘要

Sparsely activated Mixture-of-Experts (MoE) models scale capacity via conditional computation, but distributed inference suffers from cross-GPU expert communication and routing-induced load imbalance. Existing placement methods reduce this cost by co-locating frequently co-activated experts; however, they derive a single deployment plan from globally aggregated routing traces, thereby averaging away the heterogeneous, task-specific co-activation patterns that actually drive communication in multi-task serving. We observe that expert co-activation is strongly task-conditioned: pairs tightly coupled in one task family are often uncorrelated in another, so effective deployment should group experts by task-aware co-activation rather than by a task-agnostic average. Based on this insight, we propose \emph{Task-Aware Coactivation Grouping} (TACG), a deployment-time framework that uses family-specific dispatch and co-activation traces to derive per-expert task-family preferences, reweights the co-activation graph so that intra-family locality dominates grouping, and assigns each expert to a primary GPU under exact capacity constraints. To keep the static placement robust under online workload skew, we further introduce \emph{Generic Expert Shared Replication} (GESR), a lightweight companion that identifies generic experts with consistently central co-activation profiles, replicates them across a small set of secondary GPUs, and applies locality- and load-aware selection at serving time. Experiments on three representative open-source MoE models demonstrate that our framework reduces the average communication cost by 31.39\% over the baseline, while preserving an average Jain fairness index of 0.9975. This advantage persists even under severe distribution shifts in the inference data, consistently outperforming strong baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.01006 2026-06-02 cs.CV

Automated Erythrocyte Detection and Tracking for Retinal Blood Flow Quantification in Erythrocyte-Mediated Angiography

自动红细胞检测与追踪用于红细胞介导血管造影中的视网膜血流定量

Chiao-Yi Wang, Havish S Gadde, Yi-Ting Shen, Saige M. Oechsli, Osamah Saeedi, Yang Tao

发表机构 * Department of Bioengineering, University of Maryland, College Park, MD 20742, USA（生物工程系，马里兰大学，学院公园，MD 20742，美国）； Department of Ophthalmology and Visual Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA（眼科学与视觉科学系，马里兰大学医学院，巴尔的摩，MD 21201，美国）； Department of Electrical and Computer Engineering, University of Maryland, College Park, MD 20742, USA（电气与计算机工程系，马里兰大学，学院公园，MD 20742，美国）

AI总结提出EMTrack框架，通过流上下文模块和拓扑感知追踪策略实现红细胞自动检测与追踪，用于视网膜血流定量，并在新数据集RBF-EMA上优于基线方法。

详情

AI中文摘要

毛细血管水平的视网膜血流（RBF）作为多种眼病的生物标志物具有巨大潜力。然而，测量毛细血管水平RBF的方法仍然有限。红细胞介导血管造影（EMA）是一种新兴成像技术，通过可视化单个红细胞实现毛细血管水平RBF测量，但自动红细胞检测与追踪（量化血流所必需）仍鲜有探索。为填补这一空白，我们提出EMTrack，一种新颖框架，包含用于区分运动与静止细胞的红细胞检测流上下文模块，以及能够在帧间大位移和显著运动变化下进行追踪的拓扑感知追踪策略。此外，我们建立了RBF-EMA，一个包含全面红细胞检测与追踪标注的新EMA数据集。实验结果表明，我们的方法在RBF-EMA数据集上的检测与追踪任务中，在定量和定性上均优于基线方法。此外，RBF量化结果凸显了我们的框架在自动化视网膜血流测量中的巨大潜力。

英文摘要

Capillary-level retinal blood flow (RBF) has strong potential as a biomarker for various ocular diseases. However, modalities for measuring capillary-level RBF remain limited. Erythrocyte-mediated angiography (EMA), an emerging imaging technique, enables capillary-level RBF measurement by visualizing individual erythrocytes, yet automated erythrocyte detection and tracking, which are essential for quantifying blood flow, remain largely unexplored. To address this gap, we propose EMTrack, a novel framework featuring a flow-context module for erythrocyte detection that distinguishes moving from paused cells and a topology-aware tracking strategy that enables tracking under large inter-frame displacements and substantial motion variations. In addition, we establish RBF-EMA, a new EMA dataset with comprehensive erythrocyte detection and tracking annotations. Experimental results demonstrate that our method outperforms baseline methods both quantitatively and qualitatively on detection and tracking tasks in the RBF-EMA dataset. Moreover, RBF quantification results highlight the strong potential of our framework for automated retinal blood flow measurement.

URL PDF HTML ☆

赞 0 踩 0

2606.01000 2026-06-02 cs.LG cs.CL

Trust Functions: Near-Lossless Weak-to-Strong Generalization by Learning When to Trust the Weak Teacher

信任函数：通过学习何时信任弱教师实现近乎无损的弱到强泛化

Arda Uzunoglu, Alvin Zhang, Daniel Khashabi

发表机构 * University of Washington（华盛顿大学）

AI总结提出信任函数为弱标签分配信任分数并据此过滤弱监督，在多个领域实现近乎无损的弱到强泛化，且能通过迭代链放大收益。

Comments ICML 2026

2606.00999 2026-06-02 cs.CV

SWARD: Stochastic Window-Attention-Based Relational Distillation for Cross-Architectural Semantic Segmentation

SWARD：基于随机窗口注意力的关系蒸馏用于跨架构语义分割

Aditya Makineni, Qing Tian

发表机构 * Department of Computer Science University of Alabama at Birmingham（计算机科学系阿拉巴马大学伯明翰分校）

AI总结提出SWARD框架，通过多尺度窗口注意力蒸馏和原型判别正则化，弥合Transformer教师与CNN学生之间的表征差距，实现跨架构语义分割的知识蒸馏。

详情

AI中文摘要

大规模视觉基础模型在语义分割等密集预测任务上取得了显著进展，但其规模使得在资源受限环境中部署不切实际，因此知识蒸馏成为将其能力迁移至轻量级学生网络的一种手段。然而，现代基础教师模型主要基于Transformer，编码全局上下文，而高效学生模型通常是具有局部偏置感受野的卷积网络。现有蒸馏方法大多假设架构同质性，并依赖直接特征模仿，这未能弥合这种表征差距，且忽略了准确语义分割所需的结构化空间依赖和判别性组织。在本文中，我们提出SWARD，一种通过两种互补机制解决这一差距的知识蒸馏框架。首先，我们引入多尺度窗口注意力蒸馏（MWAD）模块，该模块在随机移位窗口分区中对齐师生基于注意力的关系，窗口偏移在每次训练迭代中随机重新采样。这消除了窗口边界偏差，并结合多尺度设计，捕获了短程和长程空间依赖。其次，我们引入原型判别正则化（PDR），一种通过强制类间分离和类内紧凑性来塑造学生特征分布的损失，进一步锐化判别结构，超越仅靠特征模仿在学生容量减少下所能产生的效果。在不同视觉应用（即城市场景解析和医学图像分割）上的实验表明，SWARD达到了最先进的性能。

英文摘要

Large-scale vision foundation models have driven substantial gains on dense prediction tasks such as semantic segmentation, but their size makes deployment impractical in resource-constrained settings, motivating knowledge distillation as a means of transferring their capabilities to lightweight student networks. However, modern foundation teachers are predominantly transformer-based that encode global context, whereas efficient students are typically convolutional networks with locally biased receptive fields. Existing distillation methods largely assume architectural homogeneity and rely on direct feature mimicry, which fails to bridge this representational gap and neglects the structured spatial dependencies and discriminative organization required for accurate semantic segmentation. In this paper, we propose SWARD, a knowledge distillation framework that addresses this gap through two complementary mechanisms. First, we introduce a Multi-Scale Windowed Attention Distillation (MWAD) module that aligns teacher-student attention-based relations within stochastically shifted window partitions whose offsets are randomly resampled at every training iteration. This removes window boundary bias, and, combined with the multi-scale design, captures both short- and long-range spatial dependencies. Second, we introduce Prototype Discriminative Regularization (PDR), a loss that helps shape the student's feature distribution by enforcing inter-class separation and intra-class compactness, further sharpening the discriminative structure beyond what feature mimicry alone can produce under the student's reduced capacity. Experiments across different vision applications (i.e., urban scene parsing and medical image segmentation) show that SWARD achieves state-of-the-art performance.

URL PDF HTML ☆

赞 0 踩 0

2606.00998 2026-06-02 cs.RO

GraspGen-X: Cross-Embodiment 6-DOF Diffusion-based Grasping

GraspGen-X: 跨形态6自由度扩散抓取

Beining Han, Yu-Wei Chao, Erwin Coumans, Clemens Eppner, Balakumar Sundaralingam, Jia Deng, Stan Birchfield, Adithyavairavan Murali

发表机构 * NVIDIA ； Princeton University（普林斯顿大学）

AI总结提出一种基于扩散模型的跨形态6自由度抓取方法，通过扫描体积启发式编码夹爪表示，在20亿抓取数据上训练，实现对新物体、场景和夹爪形态的零样本泛化。

详情

AI中文摘要

我们研究跨形态6自由度机器人抓取。与先前工作不同，我们要求模型不仅泛化到新物体/场景，还要泛化到新夹爪形态和物理抓取过程。我们的方法将基于扩散模型的生成式6自由度抓取模型扩展到对额外夹爪表示的条件化。我们提出一种用于编码夹爪的扫描体积启发式方法。我们使用程序化生成的夹爪和一个包含20亿抓取的大规模数据集训练跨形态模型。在仿真实验中，我们的模型在零样本泛化到新型真实世界夹爪和物体方面优于基线方法。我们的模型也可作为微调以适应新夹爪的良好初始化。在消融实验中，我们展示了扫描体积夹爪表示和程序化夹爪训练数据集的效率。最后，我们展示了在6自由度抓取中对真实世界新型夹爪的零样本泛化，在跨形态泛化方面超越了基线。

英文摘要

We study cross-embodiment 6-DOF robot grasping. Unlike prior works, we require the model not only to generalize to novel objects / scenes but also to novel gripper morphologies and physical grasping processes. Our method extends diffusion model based generative 6-DOF grasping models to condition on the additional gripper's representation. We propose a swept-volume heuristic for encoding the gripper. We train our cross-embodiment model with procedural grippers and a large-scale dataset of 2 Billion grasps. In simulation experiments, our model has the best zero-shot generalization to novel real-world grippers and objects over baseline methods. Our model also serves as a good initialization for fine-tuning to adapt to novel grippers. In ablations, we demonstrate the efficiency of our sweep-volume gripper representation and our procedural gripper training dataset. Last, we show zero-shot generalization to real-world novel grippers for 6-DOF grasping, surpassing baselines in cross-embodiment generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.00997 2026-06-02 cs.CL

Decoding in Order-Agnostic Language Models: Chain-Rule Deviation and Uniform Spreading

顺序无关语言模型中的解码：链式法则偏差与均匀扩散

Lin Yao

发表机构 * School of Computer Science, Shanghai Jiao Tong University（上海交通大学计算机科学学院）； Zhongguancun Academy（中关村学院）

AI总结本文研究顺序无关语言模型（OALM）中揭示顺序对似然的影响，提出基于置信度方差的诊断方法，并证明均匀扩散定理以优化解码路径。

详情

AI中文摘要

顺序无关语言模型（OALM），包括离散扩散语言模型（dLLM），被训练用于在任意条件集下预测掩码标记，从而允许在推理时以任意揭示顺序生成或评分序列。在LLaDA-2.1中，我们报告了三个发现。首先，学习到的条件概率并不是一个连贯联合分布的精确分解：仅改变揭示顺序就会使目标对数似然偏移高达0.49 nats/标记，因此仅凭似然就混合了内容难度和路径依赖的伪影。其次，尽管置信度优先（CF）解码是顺序无关的，但其在内容标记上的揭示顺序接近从左到右（L2R）。第三，我们提出了一种基于置信度轨迹形状的补充诊断方法。一个均匀扩散定理表明，在固定总似然下，当每一步的置信度均匀扩散时，目标可恢复性最大化；由此产生的偏差促使我们使用$\mathrm{Var}(\log q_t)$作为比较解码路径的诊断指标。在C4和四个下游基准测试中，低方差将结构化路径与随机排序区分开来，并且方差与下游正确性一致相关。这些结果支持在比较OALM解码路径时联合报告平均置信度和置信度方差。

英文摘要

Order-agnostic language models (OALMs), including discrete diffusion language models (dLLMs), are trained to predict masked tokens under arbitrary conditioning sets, allowing sequences to be generated or scored under arbitrary reveal orders at inference time. In LLaDA-2.1, we report three findings. First, the learned conditionals are not exact factorizations of a coherent joint distribution: changing only the reveal order shifts target log-likelihood by up to 0.49 nats/token, so likelihood alone mixes content difficulty with path-dependent artifacts. Second, although confidence-first (CF) decoding is order-agnostic, its reveal orders are close to left-to-right (L2R) on content tokens. Third, we propose a complementary diagnostic based on the shape of the confidence trace. A uniform-spreading theorem shows that, at fixed total likelihood, target recoverability is maximized when per-step confidence is spread uniformly; the resulting deviation motivates $\mathrm{Var}(\log q_t)$ as a diagnostic for comparing decoding paths. Across C4 and four downstream benchmarks, low variance separates structured paths from random ordering, and variance is consistently associated with downstream correctness. These results support reporting mean confidence and confidence variance jointly when comparing OALM decoding paths.

URL PDF HTML ☆

赞 0 踩 0

2606.00994 2026-06-02 cs.CL

A Registry-Bound LLM Pipeline for Evidence-Grounded Trait Extraction across Tropical Plants, Aquatic Species, and Exotic Pets

面向热带植物、水生生物和外来宠物的证据驱动性状提取的注册约束LLM流水线

Jeff Wang

发表机构 * NEXLY LLC, United States（NEXLY LLC，美国）

AI总结提出一种注册约束的大语言模型流水线，通过四种机制（受控词汇注册表、逐行证据引用、置信度标签、多版本保存）从热带物种百科全书中大规模提取证据驱动的结构化性状记录，在409,820个物种上实现99.985%的覆盖率和高置信度。

Comments 33 pages, 6 figures; methodology paper

详情

AI中文摘要

我们描述了一种注册约束的大语言模型提取流水线，能够在栽培热带植物、水生和宠物物种上大规模生成证据驱动的结构化性状记录。四种机制使LLM导出的行可审计：一个版本化的39键闭词汇性状注册表，将每个接受的值约束到类型化模式；每行逐字证据引用，将每个值绑定到源文本；每行置信度标签（高或中；低置信度在持久化前丢弃）；以及多版本保存。应用于热带物种百科全书中409,880个可发表物种，该流水线执行了706,220次运行，并在409,820个物种（99.985%）中持久化了5,489,881条性状记录，其中81.57%为高置信度。我们报告了三个验证层级，按证据强度递减：在全种群中，5,427,588条含证据的行中有90.12%的引用是源文本的逐字子串（排除一个合规性元性状后为93.49%）；在n=100的分层非红区行上进行的引用支持值审计结果为100/100（下限96.30%）；在n=50的红区行上进行的表面有效性审计结果为50/50接受（下限92.86%）。我们不声称每条记录的正确性；100%有待人工审核。贡献在于四种机制框架。

英文摘要

We describe a registry-bound large-language-model extraction pipeline producing evidence-grounded structured trait records at scale, on cultivated tropical plant, aquatic, and pet species. Four mechanisms render LLM-derived rows auditable: a versioned 39-key closed-vocabulary trait registry constraining every admitted value to a typed schema; a per-row verbatim evidence quote tying each value to source text; a per-row confidence label (high or medium; low dropped pre-persist); and multi-version preservation. Applied to 409,880 publishable species from the Tropical Species Encyclopedia, the pipeline executed 706,220 runs and persisted 5,489,881 trait records across 409,820 species (99.985%), 81.57% at high confidence. We report three validation layers in descending evidentiary strength: at full population, 90.12% of 5,427,588 evidence-bearing rows have their quote as a verbatim source substring (93.49% excluding one compliance meta-trait); a quote-supports-value audit on n=100 stratified non-red-zone rows yielded 100/100 (lower bound 96.30%); face-validity on n=50 red-zone rows yielded 50/50 Accept (lower bound 92.86%). Per-record correctness is not claimed; 100% pending human curation. The contribution is the four-mechanism framework.

URL PDF HTML ☆

赞 0 踩 0

2606.00991 2026-06-02 cs.AI

Large Language Models in Transportation Systems Management and Operations: From Text Reasoning to Multi-modal Decision Support

交通系统管理与运营中的大语言模型：从文本推理到多模态决策支持

Siyan Li, Zehao Wang, Jiachen Li, Kanok Boriboonsomsin, Matthew J. Barth, Guoyuan Wu

发表机构 * Bourns College of Engineering, Center for Environmental Research and Technology, University of California at Riverside, CA, USA（伯恩斯工程学院，环境研究与技术中心，加州大学河滨分校，美国，加利福尼亚州河滨）

AI总结本文综述了大语言模型（LLM）和多模态大语言模型（MM-LLM）在交通系统管理与运营（TSMO）中的应用，涵盖运营与服务、移动性与车队服务、数据建模与决策支持三大领域，并指出了数据异构性、实时推理、可解释性等挑战及未来方向。

Comments Preprint version

详情

AI中文摘要

交通系统管理与运营（TSMO）越来越依赖于对各种传感器流、事件报告、旅行者反馈和视觉观测等异构数据的及时解读。大语言模型（LLM），包括新兴的多模态大语言模型（MM-LLM），为将这些结构化和非结构化输入整合到面向操作者的决策支持中提供了新机制。本文综述了基于LLM和MM-LLM在TSMO中的应用，涵盖三个领域：交通运营与服务（供给）、移动性与车队服务（需求）以及数据、建模与决策支持。通过PRISMA指导的筛选过程，我们综合了当前研究，同时区分了面向操作的应用与原型及新兴概念。我们进一步识别了数据异构性、实时推理、可解释性、多模态融合和治理方面的反复出现的挑战。最后，我们概述了在本地化适应、边缘部署、基准测试和跨机构协作方面的现有差距和未来方向。总体而言，基于LLM的系统作为决策支持层最有前景，而MM-LLM在需要整合异构文本、视觉和传感器输入时尤其有价值。

英文摘要

Transportation systems management and operations (TSMO) increasingly depends on timely interpretation of heterogeneous data, from various sensor streams, incident reports, traveler feedback, and visual observations. Large language models (LLMs), including emerging multi-modal large language models (MM-LLMs), provide a new mechanism for integrating these structured and unstructured inputs into operator-facing decision support. This survey paper reviews LLM- and MM-LLM-based applications in TSMO across three domains: transportation operations & services (supply), mobility & fleet services (demand), and data, modeling & decision support. Using a PRISMA-guided screening process, we synthesize current studies while distinguishing operationally oriented applications from prototype and emerging concepts. We further identify recurring challenges in data heterogeneity, real-time inference, explainability, multi-modal fusion, and governance. Finally, we outline existing gaps and future directions in localized adaptation, edge deployment, benchmarking, and cross-agency collaboration. Overall, LLM-based systems appear most promising as a decision-support layer, with MM-LLMs offering particular value when heterogeneous text, visual, and sensor inputs must be integrated.

URL PDF HTML ☆

赞 0 踩 0

2606.00990 2026-06-02 cs.RO

OSCAR: Obstacle Survival Curves for Adaptive Robot Navigation

OSCAR: 用于自适应机器人导航的障碍物生存曲线

Hshmat Sahak, Aoran Jiao, Nicholas Rhinehart, Tim Barfoot

发表机构 * University of Toronto（多伦多大学）

AI总结提出OSCAR框架，利用生存模型学习障碍物清除时间分布，并通过图规划器动态调整等待与重路由的阈值，以减少导航时间。

Comments 8 pages main text, appendices included

详情

AI中文摘要

一个沿已知路线图行驶的移动机器人在临时障碍物阻塞关键边时可能会犯代价高昂的导航错误：在停放的推车后面等待太久浪费时间，但立即绕过一个几秒钟后会移动的人也是低效的。标准的反应式避障处理障碍物周围的局部运动，而固定的等待或重路由规则忽略了不同障碍物类型通常持续的时间。我们提出了OSCAR：一种用于具有临时阻塞的基于图的导航的自适应生存建模框架。假设在遇到障碍物时可以获得障碍物类别标签，机器人从在线经验中学习类别条件的残余清除时间分布，包括在重路由之前未观察到清除时的右删失观测。这些生存模型被集成到一个时间相关的图规划器中，该规划器维护障碍物记忆并计算每个阻塞边的耐心阈值：在采取替代路线之前等待多长时间。该方法在多个回合中持续更新其清除估计，并使用它们来平衡等待与重路由。我们在仿真中和真实移动机器人上（在大学中庭，障碍物包括人、椅子、垃圾桶和管道）评估了该方法。在仿真中，学习策略的目标时间在每类障碍物少于20次观测后收敛到具有真实清除分布的神谕的1%以内，优于所有启发式基线。实际部署证实该策略在线改进，从50个导航回合的经验中调整其耐心阈值。

英文摘要

A mobile robot following a graph of known routes can make costly navigation errors when a temporary obstacle blocks a critical edge: waiting too long behind a parked cart wastes time, but immediately rerouting around a person who would move in a few seconds is also inefficient. Standard reactive obstacle avoidance addresses local motion around obstacles, while fixed wait-or-reroute rules ignore how long different obstacle types tend to persist. We propose OSCAR: an adaptive survival-modeling framework for graph-based navigation with temporary blockages. Assuming obstacle class labels are available at encounter time, the robot learns class-conditioned residual clearance-time distributions from online experience, including right-censored observations when it reroutes before observing clearance. These survival models are integrated into a time-dependent graph planner that maintains obstacle memory and computes a patience threshold at each blocked edge: how long to wait before taking an alternate route. The method continuously updates its clearance estimates across episodes and uses them to balance waiting against rerouting. We evaluate the approach in simulation and on a real mobile robot in a university atrium with obstacles including people, chairs, bins, and tubes. In simulation, the learned policy's time-to-goal converges to within 1% of an oracle with access to ground-truth clearance distributions after fewer than 20 observations per obstacle class, outperforming all heuristic baselines. Real-world deployment confirms that the policy improves online, adapting its patience thresholds from experience across 50 navigation episodes.

URL PDF HTML ☆

赞 0 踩 0

2606.00988 2026-06-02 cs.LG

Data Enrichment for Symbolic Regression Using Diffusion Models

使用扩散模型进行符号回归的数据增强

Simon De Reuver, Tamas Kristof Toth, Teddy Lazebnik

发表机构 * Department of Computing（计算系）； Jönköping University（约翰·科普丁大学）； Department of Information Science（信息科学系）； University of Haifa（海法大学）

AI总结提出一种物理引导的潜在扩散框架，通过生成受物理约束的合成数据来增强稀疏观测，从而提升符号回归在稀疏、噪声或不完整数据下的方程发现可靠性。

详情

AI中文摘要

符号回归（SR）通过将观测转化为可解释的控制方程，为科学发现提供了一条途径。然而，尽管其前景广阔，当时空测量稀疏、有噪声或物理上不完整时（这在实践中很常见），其可靠性会急剧下降。数据增强（DE）已被证明能够缓解这一限制，但除非额外样本保留目标系统的物理结构，否则它们可能误导方程发现。这种DE的隐含要求需要狭窄的领域专业知识以及技术流畅性，极大地限制了其实用性。在本研究中，我们引入了一个物理引导的潜在扩散框架，用于下游SR模型的DE。该框架结合了变分自编码器、条件潜在扩散模型和物理信息残差校正器，通过受控制关系约束的合成场来补全稀疏观测。我们在热传导、不可压缩Navier-Stokes流和移动单质量牛顿引力势上评估了该方法，使用GPLearn、DEAP和PySR作为下游SR后端。我们的结果表明，物理校正的增强在稀疏情况下始终改善了跨物理动力学和SR模型的恢复。这些结果表明，生成式增强可以在不需要额外领域专业知识的情况下加强方程发现。

英文摘要

Symbolic regression (SR) offers a route to scientific discovery by converting observations into interpretable governing equations. However, despite its promise, its reliability degrades sharply when spatiotemporal measurements are sparse, noisy, or physically incomplete, as commonly occurring in practice. Data enrichment (DE) has been shown to be able to mitigate this limitation, yet additional samples can mislead equation discovery unless they preserve the physical structure of the target system. Such implication of DE requires narrow domain expertise as well as technical fluidity, highly limiting its practical usefulness. In this study, we introduce a physics-guided latent diffusion framework for DE for down the line SR models. The proposed framework combines a variational autoencoder, a conditional latent diffusion model, and a physics-informed residual corrector to complete sparse observations with synthetic fields constrained by governing relations. We evaluate the approach on heat conduction, incompressible Navier-Stokes flow, and a moving single-mass Newtonian gravitational potential, using GPLearn, DEAP, and PySR as downstream SR backends. Our results reveal that physics-corrected enrichment consistently improves recovery in sparse regimes across physical dynamics and SR models. These results show that generative enrichment can strengthen equation discovery without additional domain expertise.

URL PDF HTML ☆

赞 0 踩 0

2606.00987 2026-06-02 cs.CV cs.AI

An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation

多时相指代分割的开源基准与基线

Bingyu Li, Da Zhang, Tao Huo, Zhiyuan Zhao, Junyu Gao, Xuelong Li

发表机构 * University of Science and Technology of China（中国科学技术大学）； Institute of Artificial Intelligence (TeleAI)（人工智能研究所）； China Telecom（中国电信）； School of Artificial Intelligence, Optics and Electronics (iOPEN)（人工智能、光学与电子学院）； Northwestern Polytechnical University（西北工业大学）

AI总结提出多时相指代分割任务，通过自动化数据构建管道CRAFT-Agent生成首个基准MTRefSeg-21K，并设计两阶段训练的变化感知LVLM框架MTRefSeg-R1，实现优于现有基线的性能。

详情

AI中文摘要

大型视觉语言模型（LVLMs）展现了强大的视觉理解和语言引导定位能力，但其多时相视觉推理能力仍未充分探索。为填补这一空白，我们引入了 extbf{多时相指代分割（MTRS）}，这是一个新任务，旨在从多时相图像中分割语言描述的时间变化。MTRS通过联合要求时相对应推理、语言定位和像素级掩码预测，扩展了传统的指代分割和变化检测。我们提出了 extbf{CRAFT-Agent}，一个带有人工审核的自动化数据构建管道，并构建了 extbf{MTRefSeg-21K}，这是第一个MTRS基准，包含21K个高质量的多时相图像-文本-掩码三元组，覆盖多样化的场景、视角和领域。对一系列基于VLM和LVLM的模型进行基准测试表明，直接推理表现较差，而任务特定的微调仍然有限。为解决这一问题，我们提出了 extbf{MTRefSeg-R1}，一个采用两阶段策略训练的变化感知LVLM框架。它首先从20K个仅视觉的双时相样本中学习通用时间变化感知，然后在MTRefSeg-21K上进行微调，以实现细粒度的语言引导时间定位。MTRefSeg-R1显式建模跨时相视觉差异，将语言指令与时间变化对齐，并预测所指变化掩码。大量实验表明，与现有的LVLM基线相比，MTRefSeg-R1实现了强大且通常更优的性能，展示了MTRS的挑战和潜力。

英文摘要

Large Vision-Language Models (LVLMs) have shown strong visual understanding and language-guided grounding abilities, yet their capacity for multi-temporal visual reasoning remains underexplored. To bridge this gap, we introduce \textbf{Multi-temporal Referring Segmentation (MTRS)}, a new task that aims to segment language-described temporal changes from multi-temporal images. MTRS extends conventional referring segmentation and change detection by jointly requiring temporal correspondence reasoning, language grounding, and pixel-level mask prediction. We propose \textbf{CRAFT-Agent}, an automated data construction pipeline with human auditing, and build \textbf{MTRefSeg-21K}, the first MTRS benchmark, containing 21K high-quality multi-temporal image-text-mask triplets across diverse scenes, viewpoints, and domains. Benchmarking a broad set of VLM- and LVLM-based models reveals that direct inference performs poorly, while task-specific fine-tuning remains limited. To address this, we propose \textbf{MTRefSeg-R1}, a change-aware LVLM framework trained with a two-stage strategy. It first learns general temporal-change perception from 20K vision-only bi-temporal samples, and is then fine-tuned on MTRefSeg-21K for fine-grained language-guided temporal localization. MTRefSeg-R1 explicitly models cross-temporal visual differences, aligns language instructions with temporal variations, and predicts referred change masks. Extensive experiments show that MTRefSeg-R1 achieves strong and often superior performance compared with existing LVLM baselines, demonstrating the challenge and potential of MTRS.

URL PDF HTML ☆

赞 0 踩 0

2606.00986 2026-06-02 cs.LG

Profiling Privacy Preservation Against Gradient Inversion Attacks in Tabular Federated Learning

表格联邦学习中针对梯度反转攻击的隐私保护分析

Ivo Osterberg Nilsson, Maximilian Birr Engvall, Viktor Valadi, Teddy Lazebnik

发表机构 * Department of Computing（计算系）； Jönköping University（琼堡大学）； Scaleout Systems ； University of Haifa（海法大学）

AI总结本研究通过评估不同联邦学习协议、客户端批量大小、训练阶段、攻击者假设、模型架构及任务类型下梯度反转攻击对表格数据的恢复能力，发现小批量更新最易受攻击，而FT-Transformer架构比MLP更难反转，并指出聚合重建精度可能高估完整记录恢复。

详情

AI中文摘要

联邦学习（FL）允许多个数据持有者在不集中原始数据的情况下协作训练机器学习模型，使其在医疗保健和机构数据共享等隐私敏感领域非常有用。FL将数据保留在客户端本地，仅通信模型更新（如梯度或模型增量）。然而，这些更新可能通过梯度反转攻击（GIA）暴露客户端私有数据。我们研究了在诚实但好奇的服务器威胁模型下，表格FL中的这种风险，涉及FL协议、客户端批量大小、训练阶段、攻击者假设、模型架构以及二分类、多分类和回归任务。我们使用MIMIC-IV和补充基准数据集。我们的评估区分了数值和分类恢复、基线可恢复性、特征级别恢复和精确匹配率（EMR）。我们使用暴露对齐协议评估FedSGD梯度和FedAvg模型增量，比较在匹配的客户端数据暴露（而非匹配的通信轮次）后的受攻击模型。我们比较了多层感知器（MLP）、ResNet和FT-Transformer模型，并通过MLP网格（宽度、深度、激活函数、归一化和丢弃率）隔离架构效应。结果表明，小客户端批量以及代表少量不同记录的更新最易受攻击。更大的本地批量和更强的聚合减少了重建，但并未消除泄露。FT-Transformer始终比独热基线更难反转，而MLP家族内的可重建性也差异很大。这些发现将架构确定为表格FL中一个实用的隐私变量。我们还表明，聚合重建精度可能高估稀疏数据中的完整记录恢复，因此EMR和基线比较至关重要。

英文摘要

Federated learning (FL) enables multiple data holders to train machine learning models collaboratively without centralizing raw data, making it useful in privacy sensitive domains such as healthcare and institutional data sharing. FL keeps data local to clients while communicating only model updates, such as gradients or model deltas. Nevertheless, these updates can expose private client data through gradient inversion attacks (GIAs). We study this risk for tabular FL under an honest-but-curious server threat model across FL protocols, client batch sizes, training stages, attacker assumptions, model architectures, and binary classification, multiclass classification, and regression tasks. We use MIMIC-IV and complementary benchmark datasets. Our evaluation distinguishes numerical and categorical recovery, baseline recoverability, feature level recovery, and exact match rate (EMR). We evaluate FedSGD gradients and FedAvg model deltas with an exposure aligned protocol, comparing attacked models after matched client data exposure rather than matched communication rounds. We compare multilayer perceptron (MLP), ResNet, and FT-Transformer models, and isolate architecture effects through an MLP grid over width, depth, activation, normalization, and dropout. The results show that small client batches and updates representing few distinct records are most vulnerable. Larger local batches and stronger aggregation reduce reconstruction but do not eliminate leakage. FT-Transformer is consistently harder to invert than one-hot baselines, while reconstructability also varies substantially within the MLP family. These findings identify architecture as a practical privacy variable in tabular FL. We also show that aggregate reconstruction accuracy can overstate complete record recovery in sparse data, making EMR and baseline comparisons essential.

URL PDF HTML ☆

赞 0 踩 0

2606.00985 2026-06-02 cs.RO

Make Your VLA More Robust Without More Data By Interleaving Motion Planning

通过交错运动规划使您的VLA更鲁棒而无需更多数据

Dan BW Choe, Sundhar Vinodh Sangeetha, Samuel Coogan, Shreyas Kousik

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结提出MPVI框架，将基于模型的运动规划与视觉-语言-动作模型交错结合，通过VLM完成检查和本体感受触发实现可靠切换，无需额外训练即可提升长时域移动操作任务的鲁棒性，在BEHAVIOR-1K基准上任务进度提升113%。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在移动操作方面取得了显著进展，但在长时域任务上的表现仍然较差。这些任务尤其具有挑战性，因为（1）必须在空间分布的子任务的长序列中保持对高层目标的进展，并且（2）早期执行错误会在任务时域内迅速累积。尽管在大规模人类遥操作移动操作数据上进行了微调，这些挑战仍然存在，表明仅靠更多数据可能无法解决问题。为了应对这些挑战，我们提出了MPVI：运动规划器/VLA交错框架，该框架将基于模型的运动规划与VLA集成，无需进一步训练即可提高鲁棒性。所提出的集成通过开放词汇目标检测、前沿探索和运动规划，实现了在杂乱场景中对远处或遮挡目标物体的定位和导航。然而，这种集成并非易事，需要模块之间的可靠切换；我们通过基于VLM的完成检查与本体感受触发器展示了一种可行的方法。我们在BEHAVIOR-1K基准上评估了我们的方法，并展示了在任务进度上比顶级端到端VLA基线提升113%。更多详情请访问项目页面：https://mpvi.netlify.app/。

英文摘要

Vision-Language-Action (VLA) models have shown remarkable progress for mobile manipulation, but their performance on long-horizon tasks remains poor. These tasks are especially challenging because (1) progress toward high-level goals must be maintained across extended sequences of spatially distributed subtasks, and (2) early execution errors compound rapidly over the task horizon. These challenges persist despite finetuning on large human teleoperated mobile manipulation data, indicating that more data alone may not resolve the problem. To address these challenges, we propose MPVI: Motion Planner / VLA Interleaving, a framework that integrates model-based motion planning with VLAs to improve robustness without further training. The proposed integration enables localization and navigation to distant or occluded target objects through cluttered scenes using open-vocabulary object detection, frontier exploration and motion planning. However, such integration is non-trivial, requiring reliable switching between modules; we show one way forward via VLM-based completion checking with proprioceptive triggers. We evaluate our approach on the BEHAVIOR-1K benchmark and demonstrate 113% improvement in task progress over a top end-to-end VLA baseline. Additional details are available at the project page: https://mpvi.netlify.app/.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Robust Integrated Planning and Control for Quadrotors in Dynamic Environments via NMPC with CBF Penalties

Position: Good Embodied Reward Models Need Bad Behavior Data

A Finite-Calibration Regime Map for LLM Judge Panels

TriLens: Per-Layer Logit-Lens Entropy for White-Box Hallucination Detection

MedGym:A Unified Continuous-Time Benchmark for Dynamic Medical Treatment Reinforcement Learning

$τ_0$-WM: A Unified Video-Action World Model for Robotic Manipulation

Revise, Don't Freeze: Sampler-Matched Training for Self-Correcting Masked Diffusion Language Models

DSL-LLaDA: Scaling Continuous Denoising to 8B Masked Diffusion LMs

ProductWebGen: Benchmarking Multimodal Product Webpage Generation

Learning Neural Deformation Representation for 4D Dynamic Shape Generation

Tackling the Root of Misinformation by Teaching Laypeople about Logical Fallacies via Socratic Questioning and Critical Argumentation

Hybrid Verified Decoding: Learning to Allocate Verification in Speculative Decoding

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

AI-IoT-Robotics Integration: Survey of Frameworks, Emerging Trends, and the Path Toward Connected Robotics

Cross-Axis Feature Fusion with Joint-Wise Motion Difference Prediction for Text-Based 3D Human Motion Editing

Property Prediction of Stacked Bilayer Materials: A Multimodal Learning Approach

MelT: GEMM-Native NDFT for Efficient Single-Stage Audio Frontends on Modern Accelerators

Beyond Task-Agnostic: Task-Aware Grouping for Communication-Efficient Multi-Task MoE Inference

Automated Erythrocyte Detection and Tracking for Retinal Blood Flow Quantification in Erythrocyte-Mediated Angiography

Trust Functions: Near-Lossless Weak-to-Strong Generalization by Learning When to Trust the Weak Teacher

SWARD: Stochastic Window-Attention-Based Relational Distillation for Cross-Architectural Semantic Segmentation

GraspGen-X: Cross-Embodiment 6-DOF Diffusion-based Grasping

Decoding in Order-Agnostic Language Models: Chain-Rule Deviation and Uniform Spreading

A Registry-Bound LLM Pipeline for Evidence-Grounded Trait Extraction across Tropical Plants, Aquatic Species, and Exotic Pets

Large Language Models in Transportation Systems Management and Operations: From Text Reasoning to Multi-modal Decision Support

OSCAR: Obstacle Survival Curves for Adaptive Robot Navigation

Data Enrichment for Symbolic Regression Using Diffusion Models

An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation

Profiling Privacy Preservation Against Gradient Inversion Attacks in Tabular Federated Learning

Make Your VLA More Robust Without More Data By Interleaving Motion Planning