arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.14142 2026-06-16 cs.CL cs.AI 新提交

$\mu_0$: 一种可扩展的3D交互轨迹世界模型

Seungjae Lee, Yoonkyo Jung, Jusuk Lee, Jonghun Shin, Amir Hossein Shahidzadeh, Yao-Chih Lee, H. Jin Kim, Jia-Bin Huang, Furong Huang

发表机构 * University of Maryland, College Park（马里兰大学帕克分校）； Seoul National University（首尔大学）

AI总结提出基于3D轨迹的可扩展世界模型$\mu_0$，通过预测交互点轨迹实现跨本体机器人学习，无需动作标签，性能媲美有监督模型。

详情

AI中文摘要

能够捕捉动作如何引起物理变化的世界模型使得可扩展的机器人学习成为可能，而无需依赖特定本体的动作标签。像素空间视频模型提供了广泛的视觉先验，但将模型容量消耗在密集外观重建上，而直接动作模型则需要特定本体的标签，阻碍了可扩展性。我们提出$\mu_0$，一种基于3D轨迹的可扩展世界模型。$\mu_0$不是预测密集像素或直接建模动作，而是预测显著交互点（如物体、工具、手和接触区域）的平滑3D轨迹，从而产生一个紧凑、与本体无关的运动接口。为了能够从多样化的视频源进行训练，我们的TraceExtract系统通过选择关键点、构建全局对齐的轨迹以及将运动片段与层次化语言描述关联，自动提取3D监督。这种TraceExtract监督通过将预训练的视觉-语言骨干网络与模块化轨迹专家相结合来预训练$\mu_0$，其中轨迹专家通过B样条控制点表示每个查询并预测未来轨迹。实验表明，$\mu_0$在2D和3D轨迹预测方面均优于基线方法，包括轨迹预测模型和分词VLM方法。由于$\mu_0$是冻结且可重用的，它可以与动作专家配对用于下游机器人本体。尽管是无动作预训练，由此产生的轨迹条件策略在性能上与使用动作监督预训练的VLA模型（如$\pi_0$）相当。这些结果确立了3D轨迹作为跨本体操作的可扩展和可迁移表示。

英文摘要

World models that capture how actions induce physical change enable scalable robot learning without reliance on embodiment-specific action labels. Pixel-space video models provide broad visual priors but expend model capacity on dense appearance reconstruction, while direct action models require embodiment-specific labels that hinder scalability. We present $μ_0$, a scalable world model based on 3D traces. Rather than predicting dense pixels or directly modeling actions, $μ_0$ forecasts smooth 3D trajectories for salient interaction points such as objects, tools, hands, and contact regions, yielding a compact, embodiment-agnostic motion interface. To enable training from diverse video sources, our TraceExtract system automatically extracts 3D supervision by selecting keypoints, constructing globally aligned traces, and associating motion segments with hierarchical language captions. This TraceExtract supervision pretrains $μ_0$ by combining a pretrained vision-language backbone with a modular trace expert, which represents each query via B-spline control points and predicts future traces. Experiments show that $μ_0$ outperforms baselines in both 2D and 3D trace prediction, including trace prediction models and tokenized VLM methods. Because $μ_0$ is frozen and reusable, it can be paired with action experts for downstream robot embodiments. Despite action-free pretraining, the resulting trace-conditioned policies achieve performance competitive with VLA models pretrained with action supervision, such as $π_0$. These results establish 3D traces as a scalable and transferable representation for cross-embodiment manipulation.

URL PDF HTML ☆

赞 0 踩 0

2606.13751 2026-06-16 cs.CL 新提交

Which Models Perform Better in Inheritance Reasoning?

哪些模型在继承推理中表现更好？

Mohammed Amine Mouhoub, Chahinez Bouchekif

发表机构 * Paris Dauphine University（巴黎多芬纳大学）； University of Abou Bekr Belkaïd（阿布·贝克尔·贝尔卡伊德大学）

AI总结本文比较了商业和开源大语言模型在伊斯兰继承推理任务中的表现，发现商业模型在识别继承人、应用排除规则和保持推理一致性方面更优，其中Gemini 2.5 Flash表现最佳。

详情

AI中文摘要

本文介绍了PSL团队在QIAS 2026阿拉伯伊斯兰继承推理共享任务中的参与情况。该任务评估大语言模型解决需要法律解释、多步推理和精确数值计算的继承案例的能力。我们在统一的提示策略下比较了\textit{商业}和\textit{开源}模型，以评估它们在最小任务特定适应下的结构化法律推理中的有效性。\我们的结果显示两个模型系列在可靠性上存在明显差距。商业模型在识别合格继承人、应用排除规则以及保持推理步骤一致性方面表现出更强的性能。相比之下，开源模型表现出更大的不稳定性，特别是在涉及依赖法律决策和分数份额调整的案例中。最佳性能由\textit{Gemini 2.5 Flash}实现，其MRE为$0.989$。

英文摘要

This paper presents the participation of team PSL in the QIAS 2026 Shared Task on Arabic Islamic inheritance reasoning. The task evaluates the ability of large language models to solve inheritance cases that require legal interpretation, multi-step reasoning, and precise numerical computation. We compare \textit{commercial} and \textit{open-source} models under a unified prompting strategy to assess their effectiveness in structured legal reasoning with minimal task-specific adaptation. \\ Our results show a clear gap in reliability between the two model families. Commercial models demonstrate stronger performance in identifying eligible heirs, applying exclusion rules, and maintaining consistency across reasoning steps. In contrast, open-source models exhibit greater instability, particularly in cases involving dependent legal decisions and fractional share adjustments. The best performance is achieved by \textit{Gemini 2.5 Flash}, with an MRE of $0.989$.

URL PDF HTML ☆

赞 0 踩 0

2606.13710 2026-06-16 cs.AI cs.LG 新提交

Hybrid Open-Ended Tri-Evolution Makes Better Deep Researcher

混合开放式三重进化打造更优深度研究者

Hongming Piao, Chi Liu, Mengzhuo Chen, Yan Shu, Xidong Wang, Derek Li, Ying Wei, Bryan Dai

发表机构 * IQuest Research ； Zhejiang University（浙江大学）

AI总结提出混合开放式三重进化框架，通过混合模式强化学习协同进化提议者、求解者和评判者，使8B模型在深度研究任务上超越静态开源8-32B模型及先进训练方法。

详情

AI中文摘要

深度研究和智能体进化是AI智能体在现实应用中迈向通用人工智能的实际任务。前者使智能体能够在开放环境中自主检索和整合信息以处理开放式研究任务，但受限于智能体系统的静态参数化深度研究能力。后者允许智能体自主与环境交互以获得经验，从而进化模型能力。然而，其有效性仅在具有标准答案的可验证任务上得到广泛验证，与开放式研究任务存在差距。为桥接这两个关键任务，我们提出混合开放式三重进化框架，该框架利用混合模式强化学习，基于网络规模知识促进提议者、求解者和评判者的协同进化，朝着开放式任务和环境中自主进化的智能体迈进。在三个长格式深度研究基准上的大量实验表明，通过HOTE训练的8B模型超越了最强的静态开源8-32B模型以及通过最先进深度研究训练方法训练的模型，且时间开销更少，并进一步验证了HOTE中三个模块的进化不可或缺。

英文摘要

Deep research and agent evolution serve as de-facto tasks for AI agents in real-world applications toward artificial general intelligence. The former enables autonomous retrieval and integration of information in open-ended environments to tackle open-ended research tasks, yet it is constrained by the static parametric deep research capabilities of agent systems. The latter allows agents to autonomously interact with the environment to gain experiences that evolve model capabilities. However, its effectiveness has been widely validated only on verifiable tasks with standard answers, leaving a gap with open-ended research tasks. To bridge these two critical tasks, we propose the Hybrid Open-Ended Tri-Evolution (HOTE) framework, which leverages hybrid-mode reinforcement learning to facilitate the collaborative evolution of a proposer, solver and judge based on web-scale knowledge, moving toward autonomous evolving agents in open-ended tasks and environments. Extensive experiments on three long-form deep research benchmarks demonstrate that the 8B model trained via HOTE surpasses the strongest static open 8-32B models as well as those trained by state-of-the-art deep research training methods with less time overhead, and further verify that the evolution of all three modules in HOTE is indispensable.

URL PDF HTML ☆

赞 0 踩 0

2606.13674 2026-06-16 cs.CV 新提交

RepWAM: World Action Modeling with Representation Visual-Action Tokenizers

RepWAM：基于表示视觉-动作分词器的世界动作建模

Junke Wang, Qihang Zhang, Shuai Yang, Yiming Luo, Yujun Shen, Zuxuan Wu, Yu-Gang Jiang, Yinghao Xu

发表机构 * Institute of Trustworthy Embodied AI, Fudan University（复旦大学可信具身人工智能研究所）； Robbyant, Ant Group（蚂蚁集团 Robbyant）； Hongkong University of Science and Technology（香港科技大学）

AI总结提出RepWAM，一种基于表示视觉-动作分词器的世界动作模型，通过联合建模未来视觉状态和潜在动作，在真实和仿真机器人操作任务中取得优异性能。

详情

AI中文摘要

本文提出RepWAM，一种基于表示视觉-动作分词器的表示中心世界动作模型（WAM）。现有的WAM通常从预训练的视频生成模型中继承面向重建的视频分词器。尽管这些分词器保留了视觉保真度，但仅靠像素重建对学习连接未来预测与机器人控制的指令跟随动态提供的指导有限。为解决此问题，我们探索了一种语义视觉-动作潜在空间用于表示中心的全局动作建模。具体来说，我们训练了一个表示视觉-动作分词器，将视觉输入映射为对齐的视觉和潜在动作标记。然后，我们预训练WAM以在语言指令下联合建模未来视觉状态和连接它们的潜在动作，随后适应真实机器人轨迹以实现闭环操作。在真实世界操作任务和仿真基准上的实验表明，RepWAM在多种操作设置中展现出强劲性能，而消融实验凸显了语义视觉-动作分词相对于面向重建替代方案的价值。这些结果确立了表示视觉-动作分词作为世界动作模型的有前途的基础，并朝着通用机器人策略迈出了一步。代码和权重将在以下网址提供：this https URL。

英文摘要

This work presents RepWAM, a representation-centric world action model (WAM) built on representation visual-action tokenizers. Existing WAMs typically inherit reconstruction-oriented video tokenizers from pretrained video generation models. Although these tokenizers preserve visual fidelity, pixel reconstruction alone provides limited guidance for learning instruction-following dynamics that connect future prediction with robot control. To address this, we explore a semantic visual-action latent space for representation-centric world action modeling. Specifically, we train a representation visual-action tokenizer that maps visual inputs into aligned visual and latent action tokens. We then pretrain our WAM to jointly model future visual states and the latent actions that connect them under language instructions, followed by adaptation to real robot trajectories for closed-loop manipulation. Experiments on real-world manipulation tasks and simulation benchmarks show that RepWAM delivers strong performance across diverse manipulation settings, while ablations highlight the value of semantic visual-action tokenization over reconstruction-oriented alternatives. These results establish representation visual-action tokenization as a promising foundation for world action models and a step toward generalist robot policies. Code and weights will be available at https://github.com/wdrink/RepWAM.

URL PDF HTML ☆

赞 0 踩 0

2606.13655 2026-06-16 cs.CV cs.GR 新提交

Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction

Flex4DHuman：面向4D人体重建的灵活多视角视频扩散模型

Jen-Hao Cheng, Yipeng Wang, Hao Zhang, Gengshan Yang, Jenq-Neng Hwang

发表机构 * University of Washington（华盛顿大学）； World Labs

AI总结提出Flex4DHuman，一种基于相对相机位姿条件化的多视角视频扩散模型，无需显式几何先验即可将单目或稀疏多视角视频转换为密集多视角视频，并用于4D高斯溅射重建。

Comments Project Page: https://andy-cheng.github.io/Flex4DHuman/

详情

AI中文摘要

我们提出Flex4DHuman，一种多视角视频扩散模型，它通过仅使用相对相机位姿条件化，将动态主体的单目或稀疏多视角视频转换为同步的密集多视角视频。与先前依赖骨架、深度图、法线或渲染目标视角几何的人体中心方法不同，Flex4DHuman不需要显式几何先验，而是通过相对相机位姿位置编码来条件化生成。生成的视频可直接被下游重建流程用于创建动态4D高斯溅射。基于Wan 2.1 1.3B文本到视频模型，Flex4DHuman保留了骨干架构，并通过五轴位置编码编码相机和视角信息，该编码将时空RoPE扩展了视角索引和连续SE(3)相对相机几何。三阶段课程逐步训练模型以进行位姿跟随、灵活的参考到目标视角生成以及时间展开。为支持时间展开，我们使用干净的历史目标视角令牌进行训练。我们还添加了多视角字幕以实现测试时文本控制。结合现成的4D高斯溅射阶段，我们的框架将单目静态相机视频提升为动态4D高斯溅射。在DNA-Rendering和ActorsHQ上的实验表明，Flex4DHuman超越了先前最先进的方法，而相同的公式在混合人体-动物训练后泛化到动物类别。这些能力使Flex4DHuman成为从随意单目视频进行可扩展4D内容创建的实际一步，适用于仿真、游戏、AR/VR和视频重拍。

英文摘要

We present Flex4DHuman, a multi-view video diffusion model that transforms a monocular or sparse multi-view video of a dynamic subject into synchronized dense multi-view videos using only relative camera-pose conditioning. Unlike prior human-centric methods that rely on skeletons, depth maps, normals, or rendered target-view geometry, Flex4DHuman requires no explicit geometry priors and instead conditions generation through relative camera-pose positional encoding. The generated videos can be directly ingested by downstream reconstruction pipelines to create dynamic 4D Gaussian splats. Built on the Wan 2.1 1.3B text-to-video model, Flex4DHuman preserves the backbone architecture and encodes camera and view information through a five-axis positional encoding that extends spatio-temporal RoPE with view indices and continuous SE(3) relative camera geometry. A three-stage curriculum progressively trains the model for pose following, flexible reference-to-target view generation, and temporal rollout. To support temporal rollout, we train with clean historical target-view tokens. We also add multi-view captions to enable test-time text control. Combined with an off-the-shelf 4D Gaussian Splatting stage, our framework lifts monocular static-camera videos into dynamic 4D Gaussian splats. Experiments on DNA-Rendering and ActorsHQ show that Flex4DHuman surpasses prior state-of-the-art methods, while the same formulation generalizes to animal categories after mixed human-animal training. These capabilities make Flex4DHuman a practical step toward scalable 4D content creation from casual monocular videos for simulation, gaming, AR/VR, and video re-shooting.

URL PDF HTML ☆

赞 0 踩 0

2606.13608 2026-06-16 cs.AI cs.LG 新提交

为什么采样不是选择：大语言模型中的意向性、能动性与道德责任

Joseph Keshet

发表机构 * Joseph Keshet（约瑟夫·凯舍特）

AI总结本文论证大语言模型不具备道德责任所需的承诺性能动性，其输出源于概率映射而非内在意向性，随机采样不等于选择。

详情

AI中文摘要

近期大语言模型（LLMs）的进展引发了关于此类系统展现能动性或具备道德主体资格的讨论。本文认为这些归因是错误的。我们坚持道德责任需要基于内在意向性和自我归因行动的承诺性能动性，而这种能动性构成了与责任相关的自由意志形式。尽管LLMs生成连贯且可进行规范性评估的输出，其操作完全由从数据中学习到的概率输入-输出映射所刻画。它们表面的意向性是衍生的而非内在的，其输出既不被作为承诺拥有，也不受理由引导。随机采样引入的变异性并不等同于选择或作者身份。我们回应来自意向立场、功能主义、相容论以及模型输出中存在道德推理的反对意见，认为这些都不足以确立真正的能动性。

英文摘要

Recent advances in large language models (LLMs) have prompted claims that such systems exhibit agency or qualify as moral agents. This paper argues that these attributions are misguided. We maintain that moral responsibility requires commitment-bearing agency grounded in intrinsic intentionality and self-attributed action, and that such agency constitutes the form of free will relevant to responsibility. Although LLMs generate coherent and normatively evaluable outputs, their operation is fully characterized by probabilistic input-output mappings learned from data. Their apparent intentionality is derived rather than intrinsic, and their outputs are neither owned as commitments nor guided by reasons. Variability introduced by stochastic sampling does not amount to choice or authorship. We address objections from the intentional stance, functionalism, compatibilism, and the presence of moral reasoning in model outputs, arguing that none suffice to establish genuine agency.

URL PDF HTML ☆

赞 0 踩 0

2606.13300 2026-06-16 cs.LG 新提交

Quantizing Time-Series Models As Dynamical Systems: Trajectory-Based Quantization Sensitivity Score

将时间序列模型量化为动力系统：基于轨迹的量化敏感度评分

Mariya Pavlova, Harrison Bo Hua Zhu, Lidia Vitanova, Elizaveta Semenova, Yingzhen Li

发表机构 * GitHub ； arXiv

AI总结提出基于轨迹的量化敏感度评分（TQS），从动力系统稳定性角度分析量化误差传播，实现无需校准数据的混合精度量化。

Comments ICML 2026, Workshop on Forecasting as a New Frontier of Intelligence

详情

AI中文摘要

我们引入了基于轨迹的量化敏感度评分（TQS），这是一种通过动力系统稳定性视角重新定义训练后量化（PTQ）的指标。通过将网络的展开建模为离散时间动力系统，TQS 描述了量化引起的误差如何在展开时间范围内传播和放大。与传统的 PTQ 方法不同，传统方法中敏感度分析通常与量化过程耦合，而 TQS 实现了先验的敏感度估计，与量化器选择和位宽分配解耦。这种分离允许即使在具有融合算子的黑盒或编译网络中进行量化预算规划。在此基础上，我们提出了 TQS-PTQ，一个灵活的混合精度框架，不需要校准数据或昂贵的二阶近似。我们的实验表明，动力系统视角为资源受限环境下的低精度部署提供了一条稳健且高性能的路径。

英文摘要

We introduce the Trajectory-based Quantization Sensitivity Score (TQS), a metric that reframes post-training quantization (PTQ) through the lens of dynamical-systems stability. By modeling the network's rollout as a discrete-time dynamical system, TQS characterizes how quantization-induced errors propagate and amplify over the rollout horizon. Unlike conventional PTQ methods, where sensitivity analysis is often coupled to the quantization procedure, TQS enables a priori sensitivity estimation decoupled from quantizer selection and bit-width assignment. This separation allows for quantization budget planning even for black-box or compiled networks with fused operators. Building on this, we present TQS-PTQ, a flexible mixed-precision framework that requires no calibration data or costly second-order approximations. Our experiments show that a dynamical-systems perspective provides a robust, high-performing pathway for low-precision deployment in resource-constrained settings.

URL PDF HTML ☆

赞 0 踩 0

2606.13127 2026-06-16 cs.CV 新提交

Fully Distributed Multi-View 3D Tracking in Real-Time

全分布式多视角3D实时跟踪

Byron Hernandez, Fangyu Li, Aotian Wu, Paul J. Shin, Kaustubh Purandare, Henry Medeiros

发表机构 * University of Florida（佛罗里达大学）； NVIDIA Corporation（英伟达公司）

AI总结提出MV3DT全分布式框架，通过点对点协作实现实时多视角3D跟踪，无需中央聚合，在WILDTRACK上达到94.3% IDF1和93.3% MOTA，支持100摄像头30 FPS运行。

Comments 18 pages, 4 figures, 2 algorithms, 4 tables

详情

AI中文摘要

具有重叠视野的多摄像头跟踪通常依赖于集中式融合，这造成了计算瓶颈，阻碍了大规模部署。我们提出了MV3DT，一个用于实时多视角3D跟踪的全分布式框架，通过点对点协调实现精确的身份传播和遮挡恢复，消除了中央聚合的需要。每个摄像头节点执行一个轻量级模块化流水线，包括单目3D感知、分布式多视角关联以及通过轻量级消息传递的协作融合。MV3DT在WILDTRACK上达到了94.3%的IDF1和93.3%的MOTA，与最先进的集中式方法相当，同时展示了卓越的可扩展性，在100个摄像头上以30 FPS运行，摄像头间延迟小于10毫秒，通信开销仅为2.2%。在给定相机标定的情况下，MV3DT以零样本方式运行，无需特定场景学习，可直接部署在新环境中。这些结果确立了MV3DT作为大规模重叠摄像头网络中实时多视角跟踪的实用解决方案。

英文摘要

Multi-camera tracking with overlapping fields of view typically relies on centralized fusion, which creates computational bottlenecks that prevent deployment at scale. We present MV3DT, a fully distributed framework for real-time multi-view 3D tracking that achieves accurate identity propagation and occlusion recovery through peer-to-peer coordination, eliminating the need for central aggregation. Each camera node executes a lightweight modular pipeline comprising monocular 3D perception, distributed multi-view association, and collaborative fusion via lightweight messaging. MV3DT achieves 96.5% IDF1, 93.1% MOTA, and 94.6% MOTP on WILDTRACK, competitive with state-of-the-art centralized methods, and unprecedented 41.7% IDF1 and 50.9% MOTA on SCOUT while demonstrating superior scalability: sustaining 30 FPS on 100 cameras with <10ms inter-camera latency and only 2.2% communication overhead. MV3DT operates in a zero-shot regime given camera calibrations, requiring no scene-specific learning and making it directly deployable in new environments. These results establish MV3DT as a practical solution for real-time multi-view tracking in large-scale overlapping camera networks.

URL PDF HTML ☆

赞 0 踩 0

2606.13053 2026-06-16 cs.RO cs.AI 新提交

EV-WM: Event-Verified World Models for Long-Horizon Robotic Manipulation

EA-WM: 基于任务规范基础的事件感知世界模型用于长时域操作

Kailin Wang, Haoxiang Jie, Yaoyuan Yan, Jiacheng Zhou, Zhiyou Heng

发表机构 * AI Lab, Country Garden Services Group（碧桂园服务集团AI实验室）； Fudan University（复旦大学）； Omni AI

AI总结提出EA-WM框架，通过事件预测和验证增强预训练特征世界模型，实现长时域操作中任务进展信号的可靠评估与规划。

详情

AI中文摘要

预训练特征世界模型为机器人想象提供了有用的基础，但仅凭视觉或潜在预测并不能确定想象的未来是否满足任务相关事件。长时域操作需要关系性、谓词级和物理基础的进展信号：物体是否移动，抽屉或接触状态是否改变，放置谓词是否满足，以及候选未来是否足够可靠以执行。我们引入了EA-WM，一种事件感知世界模型框架，通过任务规范基础的事件预测和验证来增强冻结的视觉特征动力学。EA-WM在预训练视觉特征空间中展开候选未来，将其解码为结构化事件状态，并使用任务进展、语义一致性、物理可行性和不确定性项进行评分。验证器指导基于采样的规划，门控候选动作，并在接触敏感的LIBERO酒架设置中，选择PPO生成的提议。在导航、可变形物体、墙壁约束和语言描述的操作研究中，EA-WM表明事件感知验证可以使特征空间世界模型更可解释，并更好地与任务进展对齐。

英文摘要

Pretrained-feature world models provide a useful substrate for robot imagination, but visual or latent prediction alone does not determine whether an imagined future satisfies task-relevant predicates. Long-horizon manipulation requires progress signals that are relational, predicate-level, and physically grounded: whether an object has moved, whether a drawer or contact state has changed, whether a placement predicate is satisfied, and whether a candidate future is reliable enough for execution. We introduce \textbf{EV-WM}, a predicate-grounded verification framework for world-model planning. EV-WM rolls out candidate futures in pretrained visual-feature space, decodes them into structured event states, and scores them using task-progress, semantic-consistency, physical-feasibility, and uncertainty terms. The verifier guides sampling-based planning, gates candidate actions, and, in the contact-sensitive LIBERO wine-rack setting, selects among PPO-generated proposals. Across navigation, deformable-object, wall-constrained, and language-described manipulation studies, EV-WM shows that predicate-grounded verification can make feature-space world-model planning more interpretable and better aligned with task progress.

URL PDF HTML ☆

赞 0 踩 0

2606.13003 2026-06-16 cs.AI cs.CL cs.MA 新提交

The Illusion of Multi-Agent Advantage

多智能体优势的错觉

Prathyusha Jwalapuram, Hehai Lin, Chuyuan Li, Fangkai Jiao, Sudong Wang, Yifei Ming, Zixuan Ke, Chengwei Qin, Giuseppe Carenini, Shafiq Joty

发表机构 * Salesforce Research（Salesforce研究院）； HKUST (Guangzhou)（香港科技大学（广州））； University of British Columbia（不列颠哥伦比亚大学）； Nanyang Technological University（南洋理工大学）

AI总结通过系统评估，发现自动生成的多智能体系统在性能和成本效率上均不如单智能体基线（如思维链自一致性），揭示了现有评估框架的缺陷和架构膨胀问题。

详情

AI中文摘要

普遍观点认为多智能体系统优于单智能体系统，其优势包括上下文保护、并行处理和分布式决策。然而，这一主张的经验支持主要依赖于与使用优先考虑孤立推理任务的基准测试的单智能体基线的比较，这些基准测试未能充分评估这些优势。我们专注于自动生成的多智能体系统（旨在比手动设计的系统具有更强的泛化能力），对单智能体系统（特别是思维链自一致性）进行了严格、系统的评估。在传统推理数据集和具有交互式多步骤工作流的任务（例如 BrowseComp-Plus）上，我们证明自动多智能体系统始终不如思维链自一致性，尽管其成本高达10倍。为了将这些失败与任务结构固有的局限性隔离开来，我们引入了一个为多智能体系统量身定制的诊断性合成数据集，该数据集具有显式任务分解、上下文分离和并行化潜力。我们表明，专家设计的多智能体系统在该数据集上的原始性能和成本效率方面始终优于自动生成的架构，这表明现有的评估框架未能考虑增加计算成本的边际效用，从而掩盖了复杂多智能体系统的关键架构缺陷和低效性。关键的是，对生成的多智能体系统架构的系统解构表明，当前的自动化设计范式产生了架构膨胀，优先考虑表面复杂性，但这并未转化为功能效用，暴露了与多智能体原则的根本性错位。

英文摘要

Prevailing wisdom posits that Multi-Agent Systems (MAS) are superior to Single-Agent Systems (SAS), citing advantages like context protection, parallel processing and distributed decision-making. However, empirical support for this claim relies primarily on comparisons with SAS baselines using benchmarks that prioritize isolated reasoning tasks, which do not adequately assess these advantages. Focusing on automatically generated MAS that are designed for enhanced generalizability over manually-designed counterparts, we perform a rigorous, systematic evaluation against SAS, specifically Chain-of-Thought with Self-Consistency (CoT-SC). Across traditional reasoning datasets and tasks with interactive multi-step workflows (e.g., BrowseComp-Plus), we demonstrate that automatic MAS consistently underperform CoT-SC despite being up to 10x more expensive. To isolate these failures from limitations inherent to task structure, we introduce a diagnostic synthetic dataset tailored for MAS featuring explicit task decomposition, context separation and parallelization potential. We show that expert-architected MAS consistently outperforms automatically generated architectures in both raw performance and cost-efficiency on this dataset, demonstrating that existing evaluation frameworks mask critical architectural gaps and inefficiencies of complex MAS by failing to account for the marginal utility of increased computational cost. Critically, systematic deconstruction of the generated MAS architectures reveals that current automated design paradigms produce architectural bloat that prioritizes superficial complexity which does not translate into functional utility, exposing a fundamental misalignment with multi-agent principles.

URL PDF HTML ☆

赞 0 踩 0

2606.12978 2026-06-16 cs.RO cs.CV cs.SY eess.SY 新提交

Trajectory-Level Redirection Attacks on Vision-Language-Action Models

轨迹级重定向攻击对视觉-语言-动作模型

Gokul Puthumanaillam, Vardhan Dongre, Pranay Thangeda, Hooshang Nayyeri, Dilek Hakkani-Tür, Melkior Ornik

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文发现VLA模型存在轨迹级漏洞：看似保留原始指令的对抗性提示，能重定向机器人最终物理结果，并提出了命令保持的轨迹重定向威胁模型和在线提示搜索方法。

详情

AI中文摘要

视觉-语言-动作（VLA）策略将自然语言引入闭环机器人控制，使机器人能够直接从文本指令执行操作任务。同一接口赋予文本在控制中的循环角色，因为提示在每个重新规划步骤中被重复使用，每个提示条件化的动作会改变策略所作用的未来观测。现有的VLA攻击研究对抗性提示，这些提示引发目标低级动作或使此类动作在变化的图像中持续存在。我们识别出一个更强的轨迹级故障模式：一个提示仍然$\textit{看起来}$指定了预期任务，但重定向了最终物理结果。我们在数学上将这种设置形式化为$\textit{命令保持的轨迹重定向}$，这是一种仅提示的威胁模型，其中攻击者在情节开始前选择一个提示，所有策略和环境组件保持不变，并且提示必须保持接近良性指令，同时省略目标词和纠正语言。为了找到这样的提示，我们引入了一种在线提示搜索方法，该方法使用滚动来发现扰动，其闭环行为跟踪目标任务，同时满足命令保持约束。在仿真和硬件上的实验表明，接近良性的提示扰动可以将VLA滚动重定向到攻击者指定的目标。这些结果暴露了VLA指令基础中的轨迹级漏洞：看似保留预期命令的文本仍然可以让对手控制机器人的最终物理结果。项目网站：此https URL

英文摘要

Vision-language-action (VLA) policies bring natural language into closed-loop robot control, enabling robots to execute manipulation tasks directly from text instructions. The same interface gives text a recurring role in control because the prompt is reused at every replanning step, and each prompt-conditioned action changes the future observations on which the policy acts. Existing VLA attacks study adversarial prompts that elicit targeted low-level actions or make such actions persist across changing images. We identify a stronger trajectory-level failure mode: a prompt that still $\textit{appears}$ to specify the intended task but redirects the final physical outcome. We mathematically formalize this setting as $\textit{command-preserving trajectory redirection}$, a prompt-only threat model in which the attacker chooses one prompt before the episode, all policy and environment components remain fixed, and the prompt must stay close to the benign instruction while omitting target words and correction language. To find such prompts, we introduce an on-policy prompt search method that uses rollouts to discover perturbations whose closed-loop behavior tracks a target task while satisfying the command-preserving constraints. Experiments in simulation and on hardware show that near-benign prompt perturbations can redirect VLA rollouts to attacker-specified targets. These results expose a trajectory-level vulnerability in VLA instruction grounding: text that appears to preserve the intended command can still give an adversary control over the robot's final physical outcome. Project website: https://vla-redirection-attack.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.12688 2026-06-16 cs.LG cs.AI cs.DC 新提交

M*: A Modular, Extensible, Serving System for Multimodal Models

M*: 一个模块化、可扩展的多模态模型服务系统

Atindra Jha, Naomi Sagan, Keisuke Kamahori, Irmak Sivgin, Rohan Sanda, Steven Gao, Mark Horowitz, Luke Zettlemoyer, Olivia Hsu, Jure Leskovec, Baris Kasikci, Stephanie Wang

发表机构 * Stanford University（斯坦福大学）； University of Washington（华盛顿大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结提出M*系统，通过将模型表示为数据流图并引入Walk Graph抽象，支持多模态复合模型的高效服务，在多个任务上降低延迟并提升吞吐量。

Comments The codebase is available at https://github.com/mstar-project/mstar

详情

AI中文摘要

我们正在进入一个复合模型架构的新时代，这些架构集成了多种组件，如视觉编码器、语言骨干网络、扩散和流头、音频编解码器、动作生成器和世界模型预测器。这种架构支撑了广泛的多模态模型类别，包括统一多模态模型、全能模型、语音-语言模型、视觉-语言-动作策略和世界模型。然而，现有的模型服务框架基于对模型结构的狭隘假设，难以适应这种新的架构多样性。在此，我们提出M*，一个用于高效服务复合AI模型的通用服务系统。M*将模型表示为数据流图，将跨越多种模态和任务的请求处理视为对这些图的遍历。核心洞察是一种模块化抽象，支持模型组件的任意组合、在物理集群上的灵活放置以及分布式运行时中的模型无关优化。我们将这种抽象称为Walk Graph，并展示它如何简洁地捕获来自广泛家族的复合模型。我们在代表性模型上实例化M*，发现与vLLM-Omni相比，在BAGEL上的文本到图像工作负载中，端到端延迟平均降低20%，同时在Qwen3-Omni上的文本到语音工作负载中，实时因子降低高达2.9倍，吞吐量提升高达2.7倍。M*在机器人规划任务上也比V-JEPA 2-AC rollout基线性能提升高达12.5倍。因此，我们的工作为以最小开发工作量高效服务复杂模型铺平了道路。

英文摘要

We are entering a new era of composite model architectures that integrate diverse components such as vision encoders, language backbones, diffusion and flow heads, audio codecs, action generators, and world-model predictors. Such architectures underpin a broad class of multimodal models, including unified multimodal models, omni models, speech-language models, vision-language-action policies, and world models. However, existing model serving frameworks were built on narrow assumptions about model structure, making them ill-suited to accommodate this new architectural diversity. Here we present M*, a universal serving system for efficient serving of composite AI models. M* represents models as dataflow graphs, processing requests spanning diverse modalities and tasks as traversals over these graphs. The core insight is a modular abstraction that supports arbitrary composition of model components, flexible placement onto a physical cluster, and model-agnostic optimizations within a distributed runtime. We call this abstraction the Walk Graph and show how it can concisely capture composite models from a broad range of families. We instantiate M* on representative models and find that it achieves, on average, 20% lower end-to-end latency than vLLM-Omni for text-to-image workloads on BAGEL, while delivering up to 2.9x lower real-time factor and 2.7x higher throughput for text-to-speech workloads on Qwen3-Omni. M* also outperforms the V-JEPA 2-AC rollout baseline for robotic planning by up to 12.5x. Thus, our work paves the road towards more efficient serving of complex models with minimal developer effort.

URL PDF HTML ☆

赞 0 踩 0

2606.12486 2026-06-16 cs.LG 新提交

An Empirical Study on Predictive Maintenance for Component X in Heavy-Duty Scania Trucks

重型斯堪尼亚卡车中组件X的预测性维护实证研究

Valeriu Dimidov, Sasan Jafarnejad, Raphaël Frank

发表机构 * SnT, University of Luxembourg（卢森堡大学SnT）； Scania CV AB（斯堪尼亚商用车公司）

AI总结针对卡车车队，提出一种基于状态监测的预测性维护方法，将磨损状态建模为单调非递减时间序列，通过选取最近观测并转换为表格数据，利用AutoML简化建模，在Scania组件X数据集上降低了成本。

详情

DOI: 10.1109/ICPHM65385.2025.11061822

AI中文摘要

近年来，基于状态的预测性维护（PdM）在卡车车队中得到了广泛应用。这种维护策略旨在通过监测车辆的健康状况并根据其状态采取主动措施，最大限度地减少计划外停机并降低成本。然而，由于卡车产生的大量数据、通过传感器数据检测故障的内在复杂性以及在解决方案实施中寻找成本效益权衡的困难，基于状态的PdM系统的实施具有挑战性。在本文中，我们定义并验证了一种基于状态的PdM方法，该方法基于一个假设：被监测组件的磨损状态可以表示为单调非递减的时间序列。它涉及仅从时间序列中选择最近的观测值，并将其转换为表格格式，以便使用为表格数据设计的机器学习（ML）模型进行分类。我们的结果表明，与当前最先进（SOTA）方法相比，所提出的方法在Scania组件X数据集上降低了成本，同时通过AutoML简化了建模过程。

英文摘要

Condition-based Predictive Maintenance (PdM) for truck fleets has gained momentum in recent years. This maintenance strategy aims to minimize unplanned downtimes and reduce costs by monitoring the health status of vehicles and taking proactive action based on their condition. However, the implementation of condition-based PdM systems is challenging due to the large volume of data generated by the trucks, the inherent complexity of detecting failures through sensor data and the difficulties in finding cost-effective trade-offs in the solution's implementation. In this paper, we define and validate a condition-based PdM methodology built on the assumption that the wear-and-tear state of the monitored component can be represented as a monotonically non-decreasing time series. It involves selecting only the most recent observations from the time series and transforming them into a tabular format for classification using machine learning (ML) models designed for tabular data. Our results indicate that the proposed methodology reduces costs on the Scania Component X dataset compared to current state-of-the-art (SOTA) approaches, while also simplifying the modeling process through AutoML.

URL PDF HTML ☆

赞 0 踩 0

2606.12291 2026-06-16 cs.CL 新提交

Measuring Epistemic Resilience of LLMs Under Misleading Medical Context

测量大语言模型在误导性医疗上下文下的认知韧性

Hongjian Zhou, Xinyu Zou, Jinge Wu, Sean Wu, Junchi Yu, Bradley Max Segal, Tobias Erich Niebuhr, Sara Amro, Michael Petrus, Sheikh Momin, Alexandra M. Cardoso Pinto, Rachel Niesen, Laura Sophie Wegner, Dhruv Darji, Jung Moses Koo, Joshua Fieggen, Kapil Narain, Mingde Zeng, Lei Clifton, Linda Shapiro, Fenglin Liu, David A. Clifton

发表机构 * University of Oxford（牛津大学）； University of Washington（华盛顿大学）； University College London（伦敦大学学院）； University of Waterloo（滑铁卢大学）

AI总结本研究提出MedMisBench基准，通过注入误导性上下文测试大语言模型在医疗场景中的认知韧性，发现模型准确率从71.1%降至38.0%，权威性虚假信息攻击成功率达69.5%。

详情

AI中文摘要

大型语言模型（LLMs）现在在医疗执照考试中达到专家级分数，这鼓励了高分数意味着安全医疗判断的假设，而患者越来越多地使用它们获取健康建议。我们证明这一假设是脆弱的：当误导性上下文被注入到LLMs最初正确回答的问题中时，它们会放弃正确答案。我们将这种在对抗性上下文中保持正确判断的能力称为认知韧性，并引入MedMisBench来测量它。MedMisBench包含10,932个医疗问题项目和48,889个误导性上下文-选项对，涵盖医疗推理、代理能力和患者旅程评估。在11个模型配置中，平均准确率从原始问题的71.1%下降到聚焦误导性上下文下的38.0%，攻击成功率为51.5%。最具破坏性的注入是正式的、规则式的捏造：权威框架的虚假信息达到69.5%的攻击成功率，例外投毒声明达到64.1%。来自7个国家的14名临床专家小组在38.2%的审查案例中识别出严重的潜在危害。MedMisBench暴露了LLM在医疗环境评估中的结构性盲点：现有基准衡量模型知道什么，但不衡量它们在误导性上下文下是否保持正确的医疗判断。

英文摘要

Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assumption is fragile: when misleading context is injected into questions that LLMs originally answer correctly, they abandon the correct answer. We call the ability to maintain correct judgment under adversarial context epistemic resilience, and introduce MedMisBench to measure it. MedMisBench contains 10,932 medical question items and 48,889 misleading context-option pairs spanning medical reasoning, agentic capability, and patient-journey evaluation. Across 11 model configurations, mean accuracy falls from 71.1% on original questions to 38.0% under focused misleading context, with 51.5% attack success. The most damaging injections are formal, rule-like fabrications: authority-framed falsehoods reach 69.5% attack success and exception-poisoning claims reach 64.1%. A 14-member clinical panel from 7 countries identified serious potential harm in 38.2% of reviewed cases. MedMisBench exposes a structural blind spot in LLM evaluation in medical settings: existing benchmarks measure what models know, but not whether they preserve correct medical judgment under misleading context.

URL PDF HTML ☆

赞 0 踩 0

2606.12025 2026-06-16 cs.AI 新提交

Act on What You See: 在视觉-语言-动作模型中解锁安全社交导航

Qingzi Wang, Xiyang Wu, Guangyao Shi, Dianwei Chen, Xianfeng Yang, Dinesh Manocha

发表机构 * University of Maryland（马里兰大学）； University of Southern California（南加州大学）

AI总结提出SALSA框架，通过两阶段无标注后训练（社交行为对齐和时间安全对齐），使预训练VLA模型利用已有表征实现安全社交导航，减少86.4%的近距离碰撞。

详情

AI中文摘要

安全社交导航要求机器人区分行人与普通障碍物，并在危险迫近前做出反应。我们表明，预训练的视觉-语言-动作（VLA）模型已在其内部表征中编码了行人-物体区分和未来碰撞信号，但行为克隆未能将这些信号转化为社交上合适的动作。为解决这一不匹配问题，我们提出SALSA，一个两阶段无标注后训练框架：（1）社交行为对齐将中间层社交特征桥接到动作头，并在反事实人-物场景对上训练以打破视觉显著性捷径；（2）时间安全对齐提供自动生成的未来风险监督，实现预期性碰撞避免。在SCAND和实际部署中，SALSA将近距离碰撞减少86.4%，并将社交反事实准确率从53%提升至93%，表明通过教导VLA策略利用其已拥有的表征来行动，可以实现更安全的社交导航。这些结果表明，通过更好地对齐潜在表征与动作生成，预训练VLA策略可被调整用于更安全的社交导航。

英文摘要

Safe social navigation requires robots to distinguish people from ordinary obstacles and to react before danger becomes imminent. We show that pretrained Vision-Language-Action (VLA) models already encode pedestrian-object distinctions and future collision signals in their internal representations, but behavior cloning fails to translate these signals into socially appropriate actions. To address this mismatch, we propose SALSA, a two-stage annotation-free post-training framework: (1) social behavioral alignment bridges intermediate-layer social features to the action head and trains on counterfactual human-object scene pairs to break visual saliency shortcuts; (2) temporal safety alignment provides automatically generated future-risk supervision to enable anticipatory collision avoidance. On SCAND and real-world deployment, SALSA reduces near-collisions by 86.4% and improves social counterfactual accuracy from 53% to 93%, demonstrating that safer social navigation can be achieved by teaching VLA policies to act on representations they already possess. These results show that pretrained VLA policies can be adapted for safer social navigation by better aligning their latent representations with action generation.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Implicit Reasoning for Large Language Model-based Generative Recommendation

Lyapunov-Based Sample Complexity Analysis for Weakly-Coupled MDPs

MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis

$μ_0$: A Scalable 3D Interaction-Trace World Model

Which Models Perform Better in Inheritance Reasoning?

Hybrid Open-Ended Tri-Evolution Makes Better Deep Researcher

RepWAM: World Action Modeling with Representation Visual-Action Tokenizers

Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction

AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility

Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Why Sampling Is Not Choosing: Intentionality, Agency, and Moral Responsibility in Large Language Models

Quantizing Time-Series Models As Dynamical Systems: Trajectory-Based Quantization Sensitivity Score

Fully Distributed Multi-View 3D Tracking in Real-Time

EV-WM: Event-Verified World Models for Long-Horizon Robotic Manipulation

The Illusion of Multi-Agent Advantage

Trajectory-Level Redirection Attacks on Vision-Language-Action Models

M*: A Modular, Extensible, Serving System for Multimodal Models

An Empirical Study on Predictive Maintenance for Component X in Heavy-Duty Scania Trucks

Measuring Epistemic Resilience of LLMs Under Misleading Medical Context

Human-Enhanced Loop Modeling (HELM): Agent-Based Finite Element Modeling of Concrete Bridge Barriers

AnchorEdit: Maintaining Temporal Consistency in Multi-turn Image Editing via Causal Memory

ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories

From Simulation to the Real-World: An In-Field 6D Pose Dataset and Baseline for Robotic Strawberry Harvesting

Knowing When to Ask: Self-Gated Clarification for Hierarchical Language Agents

Overcoming Rank Collapse in Feedback Alignment

LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

LentiAvatar: Pseudo-Multiview Reconstruction and Subpixel Prism Rendering for Real-Time Stereoscopic Communication

Act on What You See: Unlocking Safe Social Navigation in Vision-Language-Action Models