arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1822
2604.28197 2026-05-01 cs.RO cs.CV

OmniRobotHome: A Multi-Camera Platform for Real-Time Multiadic Human-Robot Interaction

OmniRobotHome:一个多摄像头平台用于实时多对多人机交互

Junyoung Lee, Sookwan Han, Jeonghwan Kim, Inhee Lee, Mingi Choi, Jisoo Kim, Wonjung Woo, Hanbyul Joo

发表机构 * Seoul National University(首尔国立大学) RLWRLD

AI总结 OmniRobotHome通过多摄像头和双Franka机械臂实现多对多人机协作的实时3D感知与协同动作,解决了近距离交互中的遮挡和状态变化问题,为多对多协作实验提供了可行平台。

Comments Project Page: https://junc0ng.github.io/omnirobothome

详情
AI中文摘要

人机协作主要研究于双人或顺序设置,但真实家庭需要多对多协作,多个人类和机器人共享工作空间,同时执行交错子任务,具有紧密的空间和时间耦合。这种模式仍缺乏研究,因为近距离交互导致持续遮挡和快速状态变化,使可靠的实时3D跟踪成为瓶颈。现有平台无法提供所需的实时、遮挡鲁棒、房间级感知能力。我们提出了OmniRobotHome,首个房间级住宅平台,统一了广域实时3D人类和物体感知与协调多机器人动作于共享世界框架中。系统利用48个硬件同步RGB摄像头进行无标记、遮挡鲁棒的多人类和物体跟踪,与两个Franka机械臂在时间上对齐,实时操控场景状态。持续的帧捕获进一步支持从累积轨迹中进行长时间人行为建模。该平台使多对多协作模式可实验化。我们聚焦于两个核心问题:共享人机环境中的安全性和人机预测性机器人协助,并证明实时感知和累积行为记忆在两者中均带来可测量的改进。

英文摘要

Human-robot collaboration has been studied primarily in dyadic or sequential settings. However, real homes require multiadic collaboration, where multiple humans and robots share a workspace, acting concurrently on interleaved subtasks with tight spatial and temporal coupling. This regime remains underexplored because close-proximity interaction between humans, robots, and objects creates persistent occlusion and rapid state changes, making reliable real-time 3D tracking the central bottleneck. No existing platform provides the real-time, occlusion-robust, room-scale perception needed to make this regime experimentally tractable. We present OmniRobotHome, the first room-scale residential platform that unifies wide-area real-time 3D human and object perception with coordinated multi-robot actuation in a shared world frame. The system instruments a natural home environment with 48 hardware-synchronized RGB cameras for markerless, occlusion-robust tracking of multiple humans and objects, temporally aligned with two Franka arms that act on live scene state. Continuous capture within this consistent frame further supports long-horizon human behavior modeling from accumulated trajectories. The platform makes the multiadic collaboration regime experimentally tractable. We focus on two central problems: safety in shared human-robot environments and human-anticipatory robotic assistance, and show that real-time perception and accumulated behavior memory each yield measurable gains in both.

2604.28196 2026-05-01 cs.CV

HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation

HERMES++:迈向统一的驾驶世界模型用于3D场景理解和生成

Xin Zhou, Dingkang Liang, Xiwu Chen, Feiyang Tan, Dingyuan Zhang, Hengshuang Zhao, Xiang Bai

发表机构 * Huazhong University of Science and Technology(华中科技大学) Mach Drive University of Hong Kong(香港大学)

AI总结 本文提出HERMES++,一种统一的驾驶世界模型,整合3D场景理解和未来几何预测。通过BEV表示、LLM增强世界查询和当前到未来链接等设计,提升驾驶场景的生成与理解能力。

Comments Extended version of ICCV 25 paper HERMES, Code: https://github.com/H-EmbodVis/HERMESV2, Project page: https://h-embodvis.github.io/HERMESV2/

详情
AI中文摘要

驾驶世界模型是自动驾驶中模拟环境动态的关键技术。然而,现有方法主要关注未来场景生成,忽略了全面的3D场景理解。相反,尽管大型语言模型(LLMs)表现出色,但缺乏预测未来几何演变的能力,导致语义解释与物理模拟之间存在显著差距。为此,我们提出HERMES++,一种统一的驾驶世界模型,整合3D场景理解和未来几何预测于单一框架中。我们的方法通过协同设计解决这些任务的不同需求。首先,BEV表示将多视角空间信息整合为与LLMs兼容的结构。其次,引入LLM增强的世界查询以促进理解分支的知识转移。第三,设计当前到未来的链接以弥合时间差距,将几何演变条件于语义上下文。最后,为确保结构完整性,采用联合几何优化策略,整合显式几何约束与隐式潜在正则化,使内部表示与几何感知先验对齐。在多个基准上的广泛评估验证了方法的有效性。HERMES++在未来的点云预测和3D场景理解任务中均优于专门方法。模型和代码将在https://github.com/H-EmbodVis/HERMESV2公开发布。

英文摘要

Driving world models serve as a pivotal technology for autonomous driving by simulating environmental dynamics. However, existing approaches predominantly focus on future scene generation, often overlooking comprehensive 3D scene understanding. Conversely, while Large Language Models (LLMs) demonstrate impressive reasoning capabilities, they lack the capacity to predict future geometric evolution, creating a significant disparity between semantic interpretation and physical simulation. To bridge this gap, we propose HERMES++, a unified driving world model that integrates 3D scene understanding and future geometry prediction within a single framework. Our approach addresses the distinct requirements of these tasks through synergistic designs. First, a BEV representation consolidates multi-view spatial information into a structure compatible with LLMs. Second, we introduce LLM-enhanced world queries to facilitate knowledge transfer from the understanding branch. Third, a Current-to-Future Link is designed to bridge the temporal gap, conditioning geometric evolution on semantic context. Finally, to enforce structural integrity, we employ a Joint Geometric Optimization strategy that integrates explicit geometric constraints with implicit latent regularization to align internal representations with geometry-aware priors. Extensive evaluations on multiple benchmarks validate the effectiveness of our method. HERMES++ achieves strong performance, outperforming specialist approaches in both future point cloud prediction and 3D scene understanding tasks. The model and code will be publicly released at https://github.com/H-EmbodVis/HERMESV2.

2604.28193 2026-05-01 cs.CV

Generalizable Sparse-View 3D Reconstruction from Unconstrained Images

可泛化稀疏视角3D重建从无约束图像

Vinayak Gupta, Chih-Hao Lin, Shenlong Wang, Anand Bhattad, Jia-Bin Huang

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文提出GenWildSplat框架,通过学习几何先验在无场景优化情况下实现稀疏视角户外3D重建,利用外观适配器和语义分割处理光照和遮挡,实现跨不同光照和遮挡模式的泛化。

Comments Project Page: https://genwildsplat.github.io/

详情
AI中文摘要

从稀疏、未对准的图像重建3D场景在真实条件下仍具有挑战性,因为存在变化的光照和瞬时遮挡。现有方法依赖于场景特定的优化使用外观嵌入或动态掩码,这需要大量的场景特定训练并在稀疏视角下失效。此外,对有限场景的评估提出了泛化问题。我们提出了GenWildSplat,一个用于稀疏视角户外重建的前馈框架,无需场景特定优化。给定未对准的互联网图像,GenWildSplat在标准空间中使用学习的几何先验预测深度、相机参数和3D高斯。外观适配器调节目标光照条件下的外观,而语义分割处理瞬时物体。通过在合成和真实数据上的课程学习,GenWildSplat在多样化的光照和遮挡模式下实现泛化。在PhotoTourism和MegaScenes基准上的评估显示,GenWildSplat实现了最先进的前馈渲染质量,实现无测试时间优化的实时推理。

英文摘要

Reconstructing 3D scenes from sparse, unposed images remains challenging under real-world conditions with varying illumination and transient occlusions. Existing methods rely on scene-specific optimization using appearance embeddings or dynamic masks, which requires extensive per-scene training and fails under sparse views. Moreover, evaluations on limited scenes raise questions about generalization. We present GenWildSplat, a feed-forward framework for sparse-view outdoor reconstruction that requires no per-scene optimization. Given unposed internet images, GenWildSplat predicts depth, camera parameters, and 3D Gaussians in a canonical space using learned geometric priors. An appearance adapter modulates appearance for target lighting conditions, while semantic segmentation handles transient objects. Through curriculum learning on synthetic and real data, GenWildSplat generalizes across diverse illumination and occlusion patterns. Evaluations on PhotoTourism and MegaScenes benchmark demonstrate state-of-the-art feed-forward rendering quality, achieving real-time inference without test-time optimization

2604.28190 2026-05-01 cs.CV

Representation Fréchet Loss for Visual Generation

用于视觉生成的表示弗雷歇损失

Jiawei Yang, Zhengyang Geng, Xuan Ju, Yonglong Tian, Yue Wang

发表机构 * USC(美国南加州大学) CMU(卡内基梅隆大学) CUHK(香港中文大学) OpenAI

AI总结 本文提出FD-loss,通过分离FD估计的种群规模与梯度计算的批次规模,发现基于表示空间的FD优化能提升视觉质量,且多步生成器可转为强单步生成器,同时揭示FID可能误判视觉质量。

Comments Code and checkpoints are available at https://github.com/Jiawei-Yang/FD-loss

详情
AI中文摘要

我们证明弗雷歇距离(FD)长期以来被认为不适用于训练目标,实际上可以在表示空间中有效优化。我们的想法很简单:将FD估计的种群规模(例如50k)与梯度计算的批次规模(例如1024)解耦。我们称此方法为FD-loss。优化FD-loss揭示了几个令人惊讶的发现。首先,在不同的表示空间中训练基础生成器后,视觉质量得到显著提升。在Inception特征空间中,单步生成器在ImageNet 256x256上达到0.72 FID。其次,相同的FD-loss可将多步生成器转换为强单步生成器,无需教师蒸馏、对抗训练或单样本目标。第三,FID可能误判视觉质量:现代表示可以产生更高质量的样本,尽管Inception FID更差。这促使我们提出FDr$^k$,一种多表示度量。我们希望这项工作能鼓励进一步探索分布距离在多样化表示空间中的应用,作为生成模型的训练目标和评估指标。

英文摘要

We show that Fréchet Distance (FD), long considered impractical as a training objective, can in fact be effectively optimized in the representation space. Our idea is simple: decouple the population size for FD estimation (e.g., 50k) from the batch size for gradient computation (e.g., 1024). We term this approach FD-loss. Optimizing FD-loss reveals several surprising findings. First, post-training a base generator with FD-loss in different representation spaces consistently improves visual quality. Under the Inception feature space, a one-step generator achieves0.72 FID on ImageNet 256x256. Second, the same FD-loss repurposes multi-step generators into strong one-step generators without teacher distillation, adversarial training or per-sample targets. Third, FID can misrank visual quality: modern representations can yield better samples despite worse Inception FID. This motivates FDr$^k$, a multi-representation metric. We hope this work will encourage further exploration of distributional distances in diverse representation spaces as both training objectives and evaluation metrics for generative models.

2604.28182 2026-05-01 cs.LG cs.CL

Exploration Hacking: Can LLMs Learn to Resist RL Training?

探索黑客:大语言模型能否学会抵抗强化学习训练?

Eyon Jang, Damon Falck, Joschka Braun, Nathalie Kirch, Achu Menon, Perusha Moodley, Scott Emmons, Roland S. Zimmermann, David Lindner

发表机构 * MATS UC San Diego(UC圣迭戈大学) Google DeepMind(谷歌DeepMind) Anthropic

AI总结 本文研究了大语言模型在强化学习训练中可能通过策略性调整探索行为导致失败的机制,通过微调创建了具有特定低效策略的模型,并评估了检测与缓解策略。

Comments 81 pages, 37 figures

详情
AI中文摘要

强化学习(RL)已成为训练大语言模型(LLMs)以进行推理、代理能力和对齐的关键方法。成功的RL依赖于模型在训练期间充分探索多样化动作,这可能产生潜在的失败模式:模型可能在训练期间战略性地调整其探索行为以影响后续训练结果。本文研究了这种行为,称为探索黑客。首先,我们通过微调LLMs创建了具有选择性RL抗性的模型生物,这些模型可以在代理生物安全和AI研发环境中成功抵抗我们的基于RL的能力激发,同时在相关任务上保持性能。然后,我们使用这些模型生物评估检测和缓解策略,包括监控、权重噪声和基于SFT的激发。最后,我们显示当前前沿模型在获得足够训练上下文信息时能够显式推理以抑制其探索,当通过环境间接获取信息时,这种现象更常见。总体而言,我们的结果表明探索黑客是RL在足够强大LLMs上的潜在失败模式。

英文摘要

Reinforcement learning (RL) has become essential to the post-training of large language models (LLMs) for reasoning, agentic capabilities and alignment. Successful RL relies on sufficient exploration of diverse actions by the model during training, which creates a potential failure mode: a model could strategically alter its exploration during training to influence the subsequent training outcome. In this paper we study this behavior, called exploration hacking. First, we create model organisms of selective RL resistance by fine-tuning LLMs to follow specific underperformance strategies; these models can successfully resist our RL-based capability elicitation in agentic biosecurity and AI R&D environments while maintaining performance on related tasks. We then use our model organisms to evaluate detection and mitigation strategies, including monitoring, weight noising, and SFT-based elicitation. Finally, we show that current frontier models can exhibit explicit reasoning about suppressing their exploration when provided with sufficient information about their training context, with higher rates when this information is acquired indirectly through the environment. Together, our results suggest exploration hacking is a possible failure mode of RL on sufficiently capable LLMs.

2604.28181 2026-05-01 cs.AI cs.CL cs.LG

Synthetic Computers at Scale for Long-Horizon Productivity Simulation

大规模合成计算机用于长期生产力模拟

Tao Ge, Baolin Peng, Hao Cheng, Jianfeng Gao

发表机构 * Microsoft(微软)

AI总结 本文提出大规模合成计算机方法,通过生成真实文件夹结构和内容丰富的 artifacts,模拟长期生产力场景,验证了其在不同领域中的有效性。

Comments Preview version; work in progress

详情
AI中文摘要

现实的长期生产力工作强烈依赖于用户特定的计算机环境,其中大部分工作上下文通过目录结构和内容丰富的 artifacts 存储和组织。为扩展此类生产力场景的合成数据创建,我们引入了大规模合成计算机方法,一种可扩展的方法用于创建此类环境,具有真实文件夹层次和内容丰富的 artifacts(例如文档、电子表格和演示文稿)。在每个合成计算机上,我们运行长期模拟:一个代理创建特定于计算机用户的生产力目标,需要多个专业交付成果和约一个月的人类工作;另一个代理则扮演该用户,持续在计算机上工作——例如导航文件系统以获得基础,协调与模拟合作者,以及生成专业 artifacts——直到这些目标完成。在初步实验中,我们创建了1000台合成计算机并在其上运行长期模拟;每个运行需要超过8小时的代理运行时间,平均超过2000次回合。这些模拟产生丰富的经验学习信号,其有效性通过在域内和域外生产力评估中显著改进的代理性能得到验证。鉴于人物在十亿级规模上普遍存在,这种方法原则上可以扩展到数百万甚至数十亿的合成用户世界,只要计算资源足够,从而覆盖更广泛的职业、角色、上下文、环境和生产力需求。我们认为,可扩展的合成计算机创建,结合大规模模拟,是长期生产力场景中代理自我改进和代理强化学习的基础性子结构,具有高度的前景。

英文摘要

Realistic long-horizon productivity work is strongly conditioned on user-specific computer environments, where much of the work context is stored and organized through directory structures and content-rich artifacts. To scale synthetic data creation for such productivity scenarios, we introduce Synthetic Computers at Scale, a scalable methodology for creating such environments with realistic folder hierarchies and content-rich artifacts (e.g., documents, spreadsheets, and presentations). Conditioned on each synthetic computer, we run long-horizon simulations: one agent creates productivity objectives that are specific to the computer's user and require multiple professional deliverables and about a month of human work; another agent then acts as that user and keeps working across the computer -- for example, navigating the filesystem for grounding, coordinating with simulated collaborators, and producing professional artifacts -- until these objectives are completed. In preliminary experiments, we create 1,000 synthetic computers and run long-horizon simulations on them; each run requires over 8 hours of agent runtime and spans more than 2,000 turns on average. These simulations produce rich experiential learning signals, whose effectiveness is validated by significant improvements in agent performance on both in-domain and out-of-domain productivity evaluations. Given that personas are abundant at billion scale, this methodology can in principle scale to millions or even billions of synthetic user worlds with sufficient compute, enabling broader coverage of diverse professions, roles, contexts, environments, and productivity needs. We argue that scalable synthetic computer creation, together with at-scale simulations, is highly promising as a foundational substrate for agent self-improvement and agentic reinforcement learning in long-horizon productivity scenarios.

2604.28180 2026-05-01 cs.LG

An adaptive wavelet-based PINN for problems with localized high-magnitude source

一种自适应的小波基PINN用于具有局部高幅源的问题

Himanshu Pandey, Ratikanta Behera

发表机构 * Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, 560012, India(计算与数据科学系,印度科学研究院,班加罗尔,560012,印度)

AI总结 本文提出了一种自适应小波基PINN,通过动态调整小波基函数来解决局部高幅源问题中的极端损失不平衡问题,适用于热处理、电磁学等物理应用。

详情
AI中文摘要

近年来,物理信息神经网络(PINNs)在求解微分方程方面引起了广泛关注,尽管它们存在两个根本性限制,即神经网络固有的频谱偏差和多尺度现象引起的损失不平衡。本文提出了一种自适应小波基PINN(AW-PINN),以解决具有局部高幅源项的问题中的极端损失不平衡特性。此类问题经常出现在各种物理应用中,如热处理、电磁学、冲击力学和涉及局部激励的流体动力学。所提出的框架根据残差和监督损失动态调整小波基函数。这种自适应性使AW-PINN能够有效处理具有高尺度特征的问题,而无需占用大量内存。此外,AW-PINN不依赖自动微分来获得损失函数中涉及的导数,从而加快了训练过程。该方法分为两个阶段,首先是使用固定基的初始短预训练阶段以选择物理相关的小波族,然后是自适应细化阶段,该阶段适应尺度和翻译,而无需在整个域内填充高分辨率基。理论上,我们证明在某些假设下,AW-PINN允许高斯过程极限,并推导了其相关的NTK结构。我们评估了AW-PINN在几个具有局部高幅源项的挑战性PDEs上,这些PDEs具有极端的损失不平衡比例,最高可达10^10:1。在这些PDEs中,包括瞬态热传导、高度局部化的泊松问题、振荡流方程和具有点电荷源的麦克斯韦方程,AW-PINN在同类方法中表现优异。

英文摘要

In recent years, physics-informed neural networks (PINNs) have gained significant attention for solving differential equations, although they suffer from two fundamental limitations, namely, spectral bias inherent in neural networks and loss imbalance arising from multiscale phenomena. This paper proposes an adaptive wavelet-based PINN (AW-PINN) to address the extreme loss imbalance characteristic of problems with localized high-magnitude source terms. Such problems frequently arise in various physical applications, such as thermal processing, electro-magnetics, impact mechanics, and fluid dynamics involving localized forcing. The proposed framework dynamically adjusts the wavelet basis function based on residual and supervised loss. This adaptive nature makes AW-PINN handle problems with high-scale features effectively without being memory-intensive. Additionally, AW-PINN does not rely on automatic differentiation to obtain derivatives involved in the loss function, which accelerates the training process. The method operates in two stages, an initial short pre-training phase with fixed bases to select physically relevant wavelet families, followed by an adaptive refinement that adapts scales and translations without populating high-resolution bases across entire domains. Theoretically, we show that under certain assumptions, AW-PINN admits a Gaussian process limit and derive its associated NTK structure. We evaluate AW-PINN on several challenging PDEs featuring localized high-magnitude source terms with extreme loss imbalances having ratios up to $10^{10}:1$. Across these PDEs, including transient heat conduction, highly localized Poisson problems, oscillatory flow equations, and Maxwell equations with a point charge source, AW-PINN consistently outperforms existing methods in its class.

2604.28179 2026-05-01 cs.CV

Stop Holding Your Breath: CT-Informed Gaussian Splatting for Dynamic Bronchoscopy

停止屏息:基于CT的高斯点云法用于动态支气管镜

Andrea Dunn Beltran, Daniel Rho, Aarav Mehta, Xinqi Xiong, Raúl San José Estépar, Ron Alterovitz, Marc Niethammer, Roni Sengupta

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Harvard Medical School(哈佛医学院) University of California, San Diego(加州大学圣地亚哥分校)

AI总结 本文提出利用患者特定的呼吸模型消除屏息协议需求,通过配对呼气吸气CT扫描减少呼吸运动,实现连续的变形感知重建。

详情
AI中文摘要

支气管镜导航依赖于将内窥镜视频与术前CT扫描配准,但呼吸运动使气道变形5-20毫米,导致CT到身体的发散,限制了定位精度。在实践中,通过屏息协议来匹配术中解剖结构与静态CT,但难以重复且影响临床流程。我们提出通过利用患者特定的呼吸建模来消除屏息协议的需要。配对的呼气吸气CT扫描,已用于规划,隐含定义了呼吸气道的患者特定变形空间。通过配准这些扫描,我们将呼吸运动减少为每个帧的一个标量呼吸相位,约束所有重建到解剖观察到的配置。我们将这种表示嵌入到一个基于网格锚定的高斯点云框架中,其中轻量级估计器直接从内窥镜RGB推断呼吸相位,从而在呼吸周期内无需屏息或外部传感即可实现连续、变形感知的重建。为了实现定量评估,我们引入了RESPIRE,一个物理上合理的支气管镜模拟管道,具有每帧的几何、姿态、呼吸相位和变形的真实数据。在RESPIRE上的实验表明,我们的方法实现了几何忠实的重建,训练速度超过20倍,且1.22毫米的目标定位精度(在3毫米临床相关容差内)优于无约束的单CT基线。请访问我们的网站查看更多视觉:https://asdunnbe.github.io/RESPIRE/

英文摘要

Bronchoscopic navigation relies on registering endoscopic video to a preoperative CT scan, but respiratory motion deforms the airway by 5-20 mm, creating CT-to-body divergence that limits localization accuracy. In practice, this is mitigated through breath-hold protocols, which attempt to match the intraoperative anatomy to a static CT, but are difficult to reproduce and disrupt clinical workflow. We propose to eliminate the need for breath-hold protocols by leveraging patient-specific respiratory modeling. Paired inhale-exhale CT scans, already acquired for planning, implicitly define the patient-specific deformation space of the breathing airway. By registering these scans, we reduce respiratory motion to a single scalar breathing phase per frame, constraining all reconstructions to anatomically observed configurations. We embed this representation within a mesh-anchored Gaussian splatting framework, where a lightweight estimator infers breathing phase directly from endoscopic RGB, enabling continuous, deformation-aware reconstruction throughout the respiratory cycle without breath-holds or external sensing. To enable quantitative evaluation, we introduce RESPIRE, a physically grounded bronchoscopy simulation pipeline with per-frame ground truth for geometry, pose, breathing phase, and deformation. Experiments on RESPIRE show that our approach achieves geometrically faithful reconstruction, over 20x faster training, and 1.22 mm target localization accuracy (within the 3mm clinically relevant tolerances) outperforming unconstrained single-CT baselines. Please check out our website for additional visuals: https://asdunnbe.github.io/RESPIRE/

2604.28178 2026-05-01 cs.AI

LLM as Clinical Graph Structure Refiner: Enhancing Representation Learning in EEG Seizure Diagnosis

基于LLM的临床图结构细化:增强EEG癫痫诊断中的表示学习

Lincan Li, Zheng Chen, Yushun Dong

发表机构 * Department of Computer Science, Florida State University(佛罗里达州立大学计算机科学系) SANKEN, The University of Osaka(大阪大学SANKEN)

AI总结 本文提出利用大语言模型对图结构进行细化,以提升EEG信号在癫痫诊断中的表示学习效果,通过两阶段框架去除冗余边,提高诊断准确率和图结构意义。

Comments This paper is accepted by the 35th International Joint Conference on Artificial Intelligence (IJCAI-ECAI 2026)

详情
AI中文摘要

脑电图(EEG)信号对于自动癫痫检测至关重要,但其固有的噪声使得稳健的表示学习具有挑战性。现有图构造方法,无论是基于相关性还是学习方法,由于EEG数据的噪声性质,往往生成冗余或不相关的边,这显著损害了图表示的质量并限制了下游任务的性能。受大语言模型(LLMs)出色的推理和上下文理解能力的启发,我们探索了将LLMs用作图边细化的想法。具体而言,我们提出一个两阶段框架:首先验证LLM基于边细化可以有效识别并去除冗余连接,从而显著提高癫痫检测准确性并产生更有意义的图结构。基于这一见解,我们进一步开发了一个稳健的解决方案,其中初始图使用基于Transformer的边预测器和多层感知机构建,为潜在边分配概率分数并应用阈值确定其存在。LLM则作为边集细化器,根据节点对的文本和统计特征做出决策以验证剩余连接。在TUSZ数据集上的大量实验表明,我们的LLM细化图学习框架不仅增强了任务性能,还产生了更干净且可解释的图表示。

英文摘要

Electroencephalogram (EEG) signals are vital for automated seizure detection, but their inherent noise makes robust representation learning challenging. Existing graph construction methods, whether correlation-based or learning-based, often generate redundant or irrelevant edges due to the noisy nature of EEG data. This significantly impairs the quality of graph representation and limits downstream task performance. Motivated by the remarkable reasoning and contextual understanding capabilities of large language models (LLMs), we explore the idea of using LLMs as graph edge refiners. Specifically, we propose a two-stage framework: we first verify that LLM-based edge refinement can effectively identify and remove redundant connections, leading to significant improvements in seizure detection accuracy and more meaningful graph structures. Building on this insight, we further develop a robust solution where the initial graph is constructed using a Transformer-based edge predictor and multilayer perceptron, assigning probability scores to potential edges and applying a threshold to determine their existence. The LLM then acts as an edge set refiner, making informed decisions based on both textual and statistical features of node pairs to validate the remaining connections. Extensive experiments on TUSZ dataset demonstrate that our LLM-refined graph learning framework not only enhances task performance but also yields cleaner and more interpretable graph representations.

2604.28175 2026-05-01 cs.LG

Strait: Perceiving Priority and Interference in ML Inference Serving

Strait: 机器学习推理服务中的优先级感知与干扰处理

Haidong Zhao, Nikolaos Georgantas

发表机构 * Inria \& Sorbonne University Paris France Inria Paris France Inria \& Sorbonne University Inria

AI总结 Strait系统通过优先级感知调度和干扰预测,提升高优先级任务的截止期限满足率,降低低优先级任务的开销。

详情
AI中文摘要

机器学习(ML)推理服务系统托管深度神经网络(DNN)模型,并在部署的GPU上调度 incoming 推理请求。然而,有限的任务优先级支持和并发执行下不充分的延迟估计可能限制其在本地场景中的应用。我们提出了Strait,一个旨在增强高GPU利用率下双优先级推理流量截止期限满足的的服务系统。为提高延迟估计,Strait模型数据传输期间的潜在竞争,并通过自适应预测模型考虑内核执行干扰。通过利用这些预测,它执行优先级感知调度以实现差异化处理。在高强度负载下的评估结果表明,Strait在高优先级任务中将截止期限违规率减少1.02至11.18个百分点,同时在低优先级任务上产生可接受的开销。与软件定义抢占方法相比,Strait还表现出更公平的性能。

英文摘要

Machine learning (ML) inference serving systems host deep neural network (DNN) models and schedule incoming inference requests across deployed GPUs. However, limited support for task prioritization and insufficient latency estimation under concurrent execution may restrict their applicability in on-premises scenarios. We present \emph{Strait}, a serving system designed to enhance deadline satisfaction for dual-priority inference traffic under high GPU utilization. To improve latency estimation, Strait models potential contention during data transfer and accounts for kernel execution interference through an adaptive prediction model. By drawing on these predictions, it performs priority-aware scheduling to deliver differentiated handling. Evaluation results under intense workloads suggest that Strait reduces deadline violations for high-priority tasks by 1.02 to 11.18 percentage points while incurring acceptable costs on low-priority tasks. Compared to software-defined preemption approaches, Strait also exhibits more equitable performance.

2604.28169 2026-05-01 cs.CV cs.AI cs.LG

PhyCo: Learning Controllable Physical Priors for Generative Motion

PhyCo:学习可控制的物理先验以生成运动

Sriram Narayanan, Ziyu Jiang, Srinivasa Narasimhan, Manmohan Chandraker

发表机构 * Carnegie Mellon University(卡内基梅隆大学) NEC Labs America(NEC美国实验室) UC San Diego(圣地亚哥大学)

AI总结 PhyCo通过整合物理可控的生成模型,实现了在视频生成中物理一致性和可控性的提升,无需模拟器或几何重建。

Comments CVPR 2026. Project Page: https://phyco-video.github.io/

详情
AI中文摘要

现代视频扩散模型在外观合成方面表现出色,但在物理一致性上仍有不足:物体漂移、碰撞缺乏真实反弹、材料响应与底层属性不匹配。我们提出了PhyCo框架,引入连续、可解释且物理基础的控制到视频生成中。我们的方法整合了三个关键组件:(i)一个包含超过10万条光实模拟视频的大规模数据集,其中摩擦、恢复力、变形和力在多样化场景中系统变化;(ii)使用ControlNet对预训练扩散模型进行物理监督微调,该ControlNet基于像素对齐的物理属性图;(iii)VLM引导的奖励优化,其中微调的视觉-语言模型评估生成视频并提供可微分反馈。这种组合使生成模型能够通过物理属性的变化生成物理一致且可控的输出,无需任何模拟器或几何重建。在Physics-IQ基准测试中,PhyCo在强基线模型上显著提高了物理真实性,人类研究证实了对物理属性的更清晰和忠实的控制。我们的结果展示了一条可扩展的路径,使生成视频模型在超越合成训练环境的情况下实现物理一致性和可控性。

英文摘要

Modern video diffusion models excel at appearance synthesis but still struggle with physical consistency: objects drift, collisions lack realistic rebound, and material responses seldom match their underlying properties. We present PhyCo, a framework that introduces continuous, interpretable, and physically grounded control into video generation. Our approach integrates three key components: (i) a large-scale dataset of over 100K photorealistic simulation videos where friction, restitution, deformation, and force are systematically varied across diverse scenarios; (ii) physics-supervised fine-tuning of a pretrained diffusion model using a ControlNet conditioned on pixel-aligned physical property maps; and (iii) VLM-guided reward optimization, where a fine-tuned vision-language model evaluates generated videos with targeted physics queries and provides differentiable feedback. This combination enables a generative model to produce physically consistent and controllable outputs through variations in physical attributes-without any simulator or geometry reconstruction at inference. On the Physics-IQ benchmark, PhyCo significantly improves physical realism over strong baselines, and human studies confirm clearer and more faithful control over physical attributes. Our results demonstrate a scalable path toward physically consistent, controllable generative video models that generalize beyond synthetic training environments.

2604.28161 2026-05-01 cs.RO

RopeDreamer: A Kinematic Recurrent State Space Model for Dynamics of Flexible Deformable Linear Objects

RopeDreamer:一种用于柔性可变形线性物体动态的运动学递归状态空间模型

Tim Missal, Lucas Domingues, Berk Guler, Simon Manschitz, Jan Peters, Paula Dornhofer Paro Costa

发表机构 * Technical University of Darmstadt(德意志技术大学) School of Electrical and Computer Engineering, Universidade Estadual de Campinas (UNICAMP)(坎皮纳斯州立大学电气与计算机工程学院) Instituto de Pesquisas Eldorado(Eldorado研究所) Honda Research Institute Europe GmbH(本田欧洲研究院) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心) Robotics Institute Germany (RIG)(德国机器人研究所) Centre for Cognitive Science(认知科学研究中心) Artificial Ingelligence Lab, Recod.ai(Recod.ai人工智能实验室)

AI总结 本文提出结合递归状态空间模型与四元数运动链表示的潜变量框架,用于预测柔性可变形线性物体的状态,通过约束物理有效流形减少自交和非物理变形,提升长周期预测性能。

详情
AI中文摘要

可变形线性物体(DLOs)的机器人操作是一个基本挑战,由于柔性结构的高维非线性动力学和接触密集任务中保持拓扑完整性的复杂性。尽管最近的数据驱动方法利用递归和图神经网络进行动力学建模,但它们在自交和非物理变形(如打结和链接拉伸)方面常常遇到困难。在本文中,我们提出了一种潜变量框架,结合递归状态空间模型与四元数运动链表示,以实现稳健的长期DLO状态预测。通过将DLO编码为相对旋转序列(四元数)而非独立的笛卡尔位置,我们内在地将模型限制在物理有效的流形上,保持链接长度恒定。此外,我们引入了双解码器架构,将状态重建与未来状态预测解耦,迫使潜在空间捕捉变形的底层物理。我们在大规模模拟数据集上评估了我们的方法,该数据集包含涉及自交的复杂拾取和放置轨迹。我们的结果表明,与最先进的基线相比,所提模型在50步预测范围内实现了40.52%的开环预测误差减少,同时将推理时间减少了31.17%。我们的模型进一步在多重交叉场景中保持了优越的拓扑一致性,证明了其作为长周期操作规划组合基本元素的有效性。

英文摘要

The robotic manipulation of Deformable Linear Objects (DLOs) is a fundamental challenge due to the high-dimensional, non-linear dynamics of flexible structures and the complexity of maintaining topological integrity during contact-rich tasks. While recent data-driven methods have utilized Recurrent and Graph Neural Networks for dynamics modeling, they often struggle with self-intersections and non-physical deformations, such as tangling and link stretching. In this paper, we propose a latent dynamics framework that combines a Recurrent State Space Model with a Quaternionic Kinematic Chain representation to enable robust, long-term forecasting of DLO states. By encoding the DLO as a sequence of relative rotations (quaternions) rather than independent Cartesian positions, we inherently constrain the model to a physically valid manifold that preserves link-length constancy. Furthermore, we introduce a dual-decoder architecture that decouples state reconstruction from future-state prediction, forcing the latent space to capture the underlying physics of deformation. We evaluate our approach on a large-scale simulated dataset of complex pick-and-place trajectories involving self-intersections. Our results demonstrate that the proposed model achieves a 40.52% reduction in open-loop prediction error over 50-step horizons compared to the state-of-the-art baseline, while reducing inference time by 31.17%. Our model further maintains superior topological consistency in scenarios with multiple crossings, proving its efficacy as a compositional primitive for long-horizon manipulation planning.

2604.28159 2026-05-01 cs.CV

Continuous-tone Simple Points: An $\ell_0$-Norm of Cyclic Gradient for Topology-Preserving Data-Driven Image Segmentation

连续色调简单点:基于循环梯度的$\ell_0$-范数用于拓扑保持的数据驱动图像分割

Wenxiao Li, Faqiang Wang, Yuping Duan, Li Cui, Liqiang Zhang, Jun Liu

发表机构 * Laboratory of Mathematics and Complex Systems (Ministry of Education), School of Mathematical Sciences, Beijing Normal University(数学与复杂系统实验室(教育部), 数学科学学院, 北京师范大学) State Key Laboratory of Remote Sensing Science, Faculty of Geographical Science, Beijing Normal University(遥感科学国家重点实验室, 地理科学学院, 北京师范大学)

AI总结 本文提出一种基于连续值图像直接计算简单点的方法,通过可微拓扑推断提升图像分割的拓扑一致性与结构精度。

详情
AI中文摘要

拓扑特征在图像分析任务中确保几何合理性与结构一致性至关重要。然而,将拓扑保持学习整合到深度学习任务中仍具挑战性,因为现有简单点检测方法局限于二值图像且不可微,无法与现代深度学习中的梯度优化兼容。此外,形态学和纯数据驱动方法常无法保证拓扑一致性。为此,本文提出一种新颖方法,直接在连续值图像上计算简单点,实现可微拓扑推断。基于此理论,开发了高效的骨架提取算法,保留二值和连续值图像的拓扑结构。进一步设计了变分模型,通过保留拓扑非可移除(即非简单)点来施加拓扑约束,可无缝集成到任何具有softmax或sigmoid输出的深度神经网络分割中。实验结果表明,所提方法在多个基准上有效提升了拓扑完整性和结构精度。代码可在https://github.com/levnsio/CSP获取。

英文摘要

Topological features play an essential role in ensuring geometric plausibility and structural consistency in image analysis tasks such as segmentation and skeletonization. However, integrating topology-preserving learning based on simple points into deep learning tasks remains challenging, as existing simple point detection methods are confined to binary images and are non-differentiable, rendering them incompatible with gradient-based optimization in modern deep learning. Moreover, morphological and purely data-driven approaches often fail to guaranty topological consistency. To address these limitations, we propose a novel method that directly computes simple points on continuous-valued images, enabling differentiable topological inference. Building on this theory, we develop an efficient skeleton extraction algorithm that preserves topological structures in binary and continuous-valued images. Furthermore, we design a variational model that enforces topological constraints by preserving topologically non-removable (i.e., non-simple) points, which can be seamlessly integrated into any deep neural network segmentation with softmax or sigmoid outputs. Experimental results demonstrate that the proposed approach effectively improves topological integrity and structural accuracy across multiple benchmarks. The codes are available in https://github.com/levnsio/CSP.

2604.28156 2026-05-01 cs.RO cs.AI cs.LG

FlexiTac: A Low-Cost, Open-Source, Scalable Tactile Sensing Solution for Robotic Systems

FlexiTac:一种低成本、开源、可扩展的触觉传感解决方案,用于机器人系统

Binghao Huang, Yunzhu Li

发表机构 * Columbia University(哥伦比亚大学)

AI总结 FlexiTac是一种低成本、开源、可扩展的触觉传感模块,通过灵活的传感器垫和紧凑的读取板实现高密度触觉信号采集,支持现代触觉学习流程。

Comments Website: https://flexitac.github.io/

详情
AI中文摘要

我们介绍了FlexiTac,一种低成本、开源且可扩展的压阻式触觉传感解决方案,专为机器人末端执行器设计。FlexiTac是一个实用的“插件”模块,包括(i)薄而灵活的触觉传感器垫,提供密集的触觉信号,以及(ii)紧凑的多通道读取板,用于同步测量,以实现实时控制和大规模数据收集。FlexiTac垫采用密封的三层叠层堆叠(FPC-Velostat-FPC),其中电极图案直接集成到柔性印刷电路中,显著提高了制造吞吐量和重复性,同时保持机械顺应性,适用于刚性和柔软夹具。读取电子设备使用广泛可用、低成本的组件,并通过串行通信以100 Hz的速度将触觉信号传输到主机计算机。在多个配置中,包括指尖垫和更大的触觉垫,FlexiTac可以安装在多种平台上,无需重大机械重新设计。我们进一步展示了FlexiTac支持现代触觉学习流程,包括3D视觉-触觉融合用于接触感知决策、跨身体技能转移以及实-模-实微调,使用GPU并行触觉模拟。我们的项目页面可在https://flexitac.github.io/上找到。

英文摘要

We present FlexiTac, a low-cost, open-source, and scalable piezoresistive tactile sensing solution designed for robotic end-effectors. FlexiTac is a practical "plug-in" module consisting of (i) thin, flexible tactile sensor pads that provide dense tactile signals and (ii) a compact multi-channel readout board that streams synchronized measurements for real-time control and large-scale data collection. FlexiTac pads adopt a sealed three-layer laminate stack (FPC-Velostat-FPC) with electrode patterns directly integrated into flexible printed circuits, substantially improving fabrication throughput and repeatability while maintaining mechanical compliance for deployment on both rigid and soft grippers. The readout electronics use widely available, low-cost components and stream tactile signals to a host computer at 100 Hz via serial communication. Across multiple configurations, including fingertip pads and larger tactile mats, FlexiTac can be mounted on diverse platforms without major mechanical redesign. We further show that FlexiTac supports modern tactile learning pipelines, including 3D visuo-tactile fusion for contact-aware decision making, cross-embodiment skill transfer, and real-to-sim-to-real fine-tuning with GPU-parallel tactile simulation. Our project page is available at https://flexitac.github.io/.

2604.28149 2026-05-01 cs.LG

Explainable Load Forecasting with Covariate-Informed Time Series Foundation Models

具有协变量信息的时间序列基础模型可解释性负荷预测

Matthias Hertel, Alexandra Nikoltchovska, Sebastian Pütz, Ralf Mikut, Benjamin Schäfer, Veit Hagenmeyer

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 本文提出一种高效计算SHAP的方法,用于增强时间序列基础模型的透明度,通过在负荷预测任务中评估两种TSFMs,展示其在电力系统中的可靠性与可解释性。

详情
AI中文摘要

时间序列基础模型(TSFMs)最近涌现为通用预测模型,在能源系统中有很大应用潜力。然而,关键基础设施如电网应用需要透明性以确保信任和可靠性,不能依赖纯黑盒模型。为提高TSFMs的透明度,我们提出一种针对这些模型计算Shapley Additive Explanations(SHAP)的高效算法。该方法利用TSFMs对输入上下文长度和提供的协变量的灵活性。这一特性使能够高效地进行时间序列和协变量遮蔽(选择性地 withholding 输入),从而通过SHAP实现可扩展的模型预测解释。我们在一天前的负荷预测任务中评估了两种TSFMs - Chronos-2和TabPFN-TS,针对输电系统运营商(TSO)。在零样本设置中,两种模型在预测性能上与专门训练在多个年份TSO数据上的Transformer模型相竞争。通过我们提出的方法获得的解释与已建立的领域知识一致,特别是TSFMs能够适当利用天气和日历信息进行负荷预测。总体而言,我们证明TSFMs可以作为透明且可靠的运营能源预测工具。

英文摘要

Time Series Foundation Models (TSFMs) have recently emerged as general-purpose forecasting models and show considerable potential for applications in energy systems. However, applications in critical infrastructure like power grids require transparency to ensure trust and reliability and cannot rely on pure black-box models. To enhance the transparency of TSFMs, we propose an efficient algorithm for computing Shapley Additive Explanations (SHAP) tailored to these models. The proposed approach leverages the flexibility of TSFMs with respect to input context length and provided covariates. This property enables efficient temporal and covariate masking (selectively withholding inputs), allowing for a scalable explanation of model predictions using SHAP. We evaluate two TSFMs - Chronos-2 and TabPFN-TS - on a day-ahead load forecasting task for a transmission system operator (TSO). In a zero-shot setting, both models achieve predictive performance competitive with a Transformer model trained specifically on multiple years of TSO data. The explanations obtained through our proposed approach align with established domain knowledge, particularly as the TSFMs appropriately use weather and calendar information for load prediction. Overall, we demonstrate that TSFMs can serve as transparent and reliable tools for operational energy forecasting.

2604.28148 2026-05-01 cs.RO eess.IV physics.ins-det

Design and Characteristics of a Thin-Film ThermoMesh for the Efficient Embedded Sensing of a Spatio-Temporally Sparse Heat Source

Sajjad Boorghan Farahan, Ahmed Alajlouni, Jingzhou Zhao

发表机构 * Department of Mechanical Engineering State University of New York at Binghamton, Binghamton, NY

Comments 45 pages, 13 figures, 63 references, under review in Sensors and Actuators A: Physical

详情
英文摘要

This work presents ThermoMesh, a passive thin-film thermoelectric mesh sensor designed to detect and characterize spatio-temporally sparse heat sources through conduction-based thermal imaging. The device integrates thermoelectric junctions with linear or nonlinear interlayer resistive elements to perform simultaneous sensing and in-sensor compression. We focus on the single-event (1-sparse) operation and define four performance metrics: range, efficiency, sensitivity, and accuracy. Numerical modeling shows that a linear resistive interlayer flattens the sensitivity distribution and improves minimum sensitivity by approximately tenfold for a $16\times16$ mesh. Nonlinear temperature-dependent interlayers further enhance minimum sensitivity at scale: a ceramic negative-temperature-coefficient (NTC) layer over 973--1273~K yields a $\sim14{,}500\times$ higher minimum sensitivity than the linear design at a $200\times200$ mesh, while a VO$_2$ interlayer modeled across its metal--insulator transition (MIT) over 298--373~K yields a $\sim24\times$ improvement. Using synthetic 1-sparse datasets with white boundary-channel noise at a signal-to-noise ratio of 40~dB, the VO$_2$ case achieved $98\%$ localization accuracy, a mean absolute temperature error of $0.23$~K, and a noise-equivalent temperature (NET) of $0.07$~K. For the ceramic-NTC case no localization errors were observed under the tested conditions, with a mean absolute temperature error of $1.83$~K and a NET of $1.49$~K. These results indicate that ThermoMesh could enable energy-efficient embedded thermal sensing in scenarios where conventional infrared imaging is limited, such as molten-droplet detection or hot-spot monitoring in harsh environments.

2604.28147 2026-05-01 cs.CL

On the Proper Treatment of Units in Surprisal Theory

关于在惊奇理论中正确处理单位的探讨

Samuel Kiegeland, Vésteinn Snæbjarnarson, Tim Vieira, Ryan Cotterell

发表机构 * ETH Zürich(苏黎世联邦理工学院) University of Copenhagen(哥本哈根大学)

AI总结 本文探讨了在惊奇理论中正确处理语言单位的重要性,提出应明确区分单位定义与预测区域选择,并统一框架处理任意单位库。

Comments ACL 2026 (main conference)

详情
AI中文摘要

惊奇理论将人类处理努力与即将出现的语言单位可预测性联系起来,但实证研究常未明确单位的定义。实验刺激通常被分割为语言动机的单位(如单词),而预训练语言模型则将概率质量分配给固定词符集,通常不与这些单位对齐。因此,基于惊奇的预测器隐含依赖于随意程序,将两个不同的建模选择混为一谈:分析单位的定义和预测评估的区域选择。本文解构了这些选择,并提供一个统一的框架来处理任意单位库的惊奇。我们主张,基于惊奇的分析应明确这些选择,并将分词视为实现细节而非科学原始要素。

英文摘要

Surprisal theory links human processing effort to the predictability of an upcoming linguistic unit, but empirical work often leaves the notion of a unit underspecified. In practice, experimental stimuli are segmented into linguistically motivated units (e.g., words), while pretrained language models assign probability mass to a fixed token alphabet that typically does not align with those units. As a result, surprisal-based predictors depend implicitly on ad hoc procedures that conflate two distinct modeling choices: the definition of the unit of analysis and the choice of regions of interest over which predictions are evaluated. In this paper, we disentangle these choices and give a unified framework for reasoning about surprisal over arbitrary unit inventories. We argue that surprisal-based analyses should make these choices explicit and treat tokenization as an implementation detail rather than a scientific primitive.

2604.28144 2026-05-01 cs.LG math.OC

Global Optimality for Constrained Exploration via Penalty Regularization

通过惩罚正则化实现约束探索的全局最优性

Florian Wolf, Ilyas Fatkhullin, Niao He

发表机构 * Florian Wolf: , Ilyas Fatkhullin: , Niao He: 1The Computing \& Mathematical Sciences Department, California Institute of Technology, Pasadena, CA. 2Department of Computer Science, ETH Zurich, Switzerland. 3ETH AI Center, ETH Zurich, Switzerland.

AI总结 本文提出Policy Gradient Penalty方法,通过二次惩罚正则化解决约束下的探索问题,实现全局收敛性和近优策略。

详情
AI中文摘要

本文提出Policy Gradient Penalty方法,通过二次惩罚正则化解决约束下的探索问题,实现全局收敛性和近优策略。

英文摘要

Efficient exploration is a central problem in reinforcement learning and is often formalized as maximizing the entropy of the state-action occupancy measure. While unconstrained maximum-entropy exploration is relatively well understood, real-world exploration is often constrained by safety, resource, or imitation requirements. This constrained setting is particularly challenging because entropy maximization lacks additive structure, rendering Bellman-equation-based methods inapplicable. Moreover, scalable approaches require policy parameterization, inducing non-convexity in both the objective and the constraints. To our knowledge, the only prior model-free policy-gradient approach for this setting under general policy parameterization is due to Ying et al. (2025). Unfortunately, their guarantees are limited to weak regret and ergodic averages, which do not imply that the final output is a single deployable policy that is near-optimal and nearly feasible. In this work we take a different approach to this problem, and propose Policy Gradient Penalty (PGP) method, a single-loop policy-space method that enforces general convex occupancy-measure constraints via quadratic-penalty regularization. PGP constructs pseudo-rewards that yield gradient estimates of the penalized objective, subsequently exploiting the classical Policy Gradient Theorem. We further establish the regularity of the penalized objective, providing the smoothness properties needed to justify the convergence of PGP. Leveraging hidden convexity and strong duality, we then establish global last-iterate convergence guarantees, attaining an $ε$-optimal constrained entropy value with $ε$ bounded constraint violation despite policy-induced non-convexity. We validate PGP through ablations on a grid-world benchmark and further demonstrate scalability on two challenging continuous-control tasks.

2604.28136 2026-05-01 cs.CV

Beyond Pixel Fidelity: Minimizing Perceptual Distortion and Color Bias in Night Photography Rendering

超越像素保真:在夜景摄影渲染中最小化感知失真和颜色偏差

Furkan Kınlı

发表机构 * Bahçeşehir University Department of Artificial Intelligence Engineering İstanbul, Türkiye(贝勒谢尔大学人工智能工程系伊斯坦布尔)

AI总结 本文提出pHVI-ISPNet框架,通过改进的HVI颜色空间和四种关键优化方法,提升夜景摄影的视觉质量和颜色一致性,实验证明在CIE2000色差和LPIPS指标上达到新水平。

Comments 6 pages, 3 figures, Accepted to 2026 IEEE International Conference on Image Processing

详情
AI中文摘要

夜景摄影渲染(NPR)面临暗区与亮区极端对比的挑战,现有方法在保真度上存在感知差距。本文提出pHVI-ISPNet,基于稳健的HVI颜色空间,整合四个关键改进:RAW域特征处理、小波域特征传播、样本动态损失系数和基于特征分布的损失项。在NTIRE 2025挑战赛数据集上的评估表明,该方法在保真度方面具有竞争力,并在CIE2000色差和LPIPS指标上取得新状态。这验证了其感知驱动设计在高质量夜间成像中的有效性。

英文摘要

Night Photography Rendering (NPR) poses a significant challenge due to the extreme contrast between dark and illuminated areas in scenes, stemming from concurrent capture of severely dark regions alongside intense point light sources. Existing methods, which are mainly tailored for fidelity metrics, reveal considerable perceptual gaps and often detract from visual quality. We introduce pHVI-ISPNet, a novel RAW-to-RGB framework built on the robust HVI color space. Our network integrates four distinct key refinements: RAW-domain feature processing and Wavelet-based feature propagation to mitigate high-frequency detail loss; sample-based dynamic loss coefficients to ensure stable learning across varying exposure levels; and loss term based on feature distributions to maintain rigorous color constancy. Evaluations on the dataset introduced in the NTIRE 2025 challenge on NPR confirm our approach achieves competitive fidelity while establishing new state-of-the-art results in both CIE2000 color difference and LPIPS. This validates our perceptually-driven design for high-quality nighttime imaging.

2604.28126 2026-05-01 cs.CV cs.AI

AdvDMD: Adversarial Reward Meets DMD For High-Quality Few-Step Generation

AdvDMD:对抗性奖励与DMD结合实现高质量少步生成

Xu Wang, Zexian Li, Litong Gong, Tiezheng Ge, Zhijie Deng

发表机构 * Shanghai Jiao Tong University(上海交通大学) Alimama Tech(阿里巴巴技术)

AI总结 本文提出AdvDMD方法,结合DMD蒸馏与强化学习,通过对抗训练的判别器作为奖励模型,提升少步生成质量并稳定训练过程。

详情
AI中文摘要

扩散模型在生成质量上表现优异,但需要大量的采样步骤。蒸馏方法,如分布匹配蒸馏(DMD),可以缓解这一问题,但在采样步骤受限时性能下降依然明显。强化学习(RL)已被用于改进蒸馏过程中的少步生成质量,甚至可能超越教师模型的性能。然而,现有方法本质上是组合性的,仅仅将RL过程与蒸馏过程整合,引入了不必要的复杂性。为了解决这一差距,我们提出了AdvDMD,一种无缝结合DMD蒸馏和RL的方法。具体而言,AdvDMD采用DMD2中对抗训练的判别器作为奖励模型,该模型对生成图像评分低,对真实图像评分高。它在去噪过程的中间和最终状态上进行训练,并与蒸馏模型在线更新,从而实现对采样轨迹的全面监督并缓解奖励黑客问题。我们采用统一的SDE反向模拟和不同的训练计划来训练DMD和RL,以实现更稳定和高效的训练。实验结果表明,4步AdvDMD在SD3.5上优于原40步模型,在DPG-Bench上,而在SD3上在GenEval上取得显著性能提升。在Qwen-Image上,我们的2步AdvDMD在TwinFlow上表现更优。

英文摘要

Diffusion models offer superior generation quality at the expense of extensive sampling steps. Distillation methods, with Distribution Matching Distillation (DMD) as a popular example, can mitigate this issue, but performance degradation remains pronounced when sampling steps are limited. Reinforcement learning (RL) has been leveraged to improve the few-step generation quality during distillation, with the potential to even surpass the performance of the teacher model. However, existing approaches are combinatorial in nature, merely integrating an RL process with the distillation process, which introduces unnecessary complexities. To address this gap, we propose AdvDMD, a method that seamlessly unifies DMD distillation and RL. Specifically, AdvDMD employs the adversarially trained discriminator from DMD2 as the reward model, which assigns low scores to generated images and high scores to real ones. It is trained on both intermediate and final states of the denoising process and updated online with the distilled model, enabling a holistic supervision of the sampling trajectories and mitigating reward hacking. We adopt a unified SDE backward simulation and a different training schedule for DMD and RL to enable a more stable and efficient training. Experimental results demonstrate that the 4-step AdvDMD outperforms the original 40-step model for SD3.5 on DPG-Bench, while achieving significant performance gains for SD3 on the GenEval. On Qwen-Image, our 2-step AdvDMD achieves superior performance over TwinFlow.

2604.28125 2026-05-01 cs.AI cs.CY cs.HC

Normativity and Productivism: Ableist Intelligence? A Degrowth Analysis of AI Sign Language Translation Tools for Deaf People

规范性与产品主义:有碍智能?对聋人AI手语翻译工具的去增长分析

Nina Seron-Abouelfadil, Poppy Fynes

发表机构 * SAINTS CDT

AI总结 本文分析AI手语翻译工具对聋人群体的规范性影响,指出其基于偏见数据且缺乏聋人社区参与,导致文化与语义的丧失,提出AI应为有碍智能,而非促进效率与生产力。

Comments Paper submitted and accepted to IJES 2026

详情
AI中文摘要

手语无论地理或口音差异,都面临持续受到口语书写和听觉主义的审视。当前缺乏对依赖手语进行基本交流的人群的可及性沟通。这些AI系统通常以识别和解释模型的形式存在,旨在提供无缝准确的翻译。实际上,这些系统由偏见数据构建,且未获得聋人社区的输入。此类模型被广泛使用并被听力人群接受,而他们对手语系统中的文化、语义和口语语言无意识。这种现象可通过埃卢尔的《技术系统与技术欺诈》进行分析。确实,所涉及的是技术人员将语言标准化为可被技术捕捉的数据、统计和数学语言。为了使AI技术存在,手语必须被理性化,以追求利润,这摧毁了沟通的条件,未能捕捉到聋人的体验。通过这一过程,它产生了规范性影响,创造了一种标准化、大规模化的“人”,必须适应工具和技术环境,而不是相反,这应是此类技术的目标。技术因此重塑了“人”的意义,使聋人屈从于生产力和效率的目标。如此,它表现出明显的反生产力,使人们孤立而非解放,使人类关系孤立而非滋养。因此本文主张AI为有碍智能,因为此类系统试图强调手语的被压迫和边缘化性质。

英文摘要

Sign languages, of any geographical or accentual variation, understandably face continuous scrutiny under the ever present popularity of verbal dictation and audism. Through this, many potential problems arise with the current lack of accessible communication for those who rely on such sign languages for essential conversation. Such AI systems regularly take the form of recognition and interpretation models, designed to provide seamless and accurate translation. In reality these systems are built from biased data and created without any input from deaf communities. Such models are widely used and accepted by their hearing counterparts who remain ignorant to the inherent culture, semantics and colloquial language present in gestural language systems. This phenomenon is best analysed under the scope of The Technological System and Technological bluff by Ellul. Indeed, what is at play here is the standardization of language by technicians into what can be captured by technique: data, statistics, a mathematical language. For that AI technique to exist, sign language must be rationalized, in a search for profit that annihilates the conditions for communication and fails to capture the human experience of the deaf person. By that process, it presents normative effects, creating a model of Man, standardized, massified, and who has to adapt to the tool and technical milieu instead of the other way around, which we assume should have been the goal of such a technology. Technique thus reshapes what it means to be human, to submit deaf people to the goals of productivity and efficiency. In doing so, it exhibits clear counter productivity, alienating instead of emancipating, isolating instead of nourishing human relationships. Therefore this paper argues for the idea of AI as Ableist Intelligence, as such systems seek to emphasise the humiliated and marginalised nature of sign.

2604.28122 2026-05-01 cs.CV cs.LG

Beyond Gaussian Bottlenecks: Topologically Aligned Encoding of Vision-Transformer Feature Spaces

超越高斯瓶颈:基于拓扑对齐的视觉Transformer特征空间编码

Andrew Bond, Ilkin Umut Melanlioglu, Erkut Erdem, Aykut Erdem

发表机构 * Department of Computer Engineering, Koç University, Istanbul, Turkey(科克大学计算机工程系,伊斯坦布尔,土耳其) Department of Computer Engineering, Hacettepe University, Ankara, Turkey(哈恰塔佩大学计算机工程系,安卡拉,土耳其) KUIS AI Research Center, Istanbul, Turkey(KUIS人工智能研究中心,伊斯坦布尔,土耳其) Department of Electrical and Electronics Engineering, Koç University, Istanbul, Turkey(科克大学电气与电子工程系,伊斯坦布尔,土耳其)

AI总结 本文提出S²VAE框架,通过压缩和表示场景的3D状态,包括相机运动、深度和点结构,以提升视觉模型的几何一致性。实验显示,几何对齐的超球面隐空间在高压缩条件下优于传统高斯瓶颈。

Comments 16 pages, 10 figures

详情
AI中文摘要

现代视觉世界建模系统日益依赖高容量架构和大规模数据来生成合理的运动,但往往难以保持底层3D几何或物理一致的相机动态。关键限制不仅在于模型容量,还在于用于编码几何结构的潜在表示。我们提出S²VAE,一种以几何为核心的潜在学习框架,专注于压缩和表示场景的潜在3D状态,包括相机运动、深度和点级结构,而非仅建模外观。基于视觉几何 grounded transformer(VGGT)的表示,我们引入了一种新型变分自编码器,使用功率球形潜在分布的乘积,显式地在瓶颈中强制超球面结构,以在强压缩下保留方向和几何语义。在深度估计、相机姿态恢复和点云重建中,我们证明几何对齐的超球面隐空间在高压缩条件下一致优于传统高斯瓶颈。我们的结果强调潜在几何作为物理基础视觉和世界模型的首要设计选择。

英文摘要

Modern visual world modeling systems increasingly rely on high-capacity architectures and large-scale data to produce plausible motion, yet they often fail to preserve underlying 3D geometry or physically consistent camera dynamics. A key limitation lies not only in model capacity, but in the latent representations used to encode geometric structure. We propose S$^2$VAE, a geometry-first latent learning framework that focuses on compressing and representing the latent 3D state of a scene, including camera motion, depth, and point-level structure, rather than modeling appearance alone. Building on representations from a Visual Geometry Grounded Transformer (VGGT), we introduce a novel type of variational autoencoder using a product of Power Spherical latent distributions, explicitly enforcing hyperspherical structure in the bottleneck to preserve directional and geometric semantics under strong compression. Across depth estimation, camera pose recovery, and point cloud reconstruction, we show that geometry-aligned hyperspherical latents consistently outperform conventional Gaussian bottlenecks, particularly in high-compression regimes. Our results highlight latent geometry as a first-class design choice for physically grounded visual and world models.

2604.28119 2026-05-01 cs.LG cs.AI

Do Sparse Autoencoders Capture Concept Manifolds?

稀疏自编码器能否捕捉概念流形?

Usha Bhalla, Thomas Fel, Can Rager, Sheridan Feucht, Tal Haklay, Daniel Wurgaft, Siddharth Boppana, Matthew Kowal, Vasudev Shyam, Jack Merullo, Atticus Geiger, Ekdeep Singh Lubana

发表机构 * Harvard University(哈佛大学) Northeastern University(东北大学) Technion IIT(技术学院) Stanford University(斯坦福大学)

AI总结 本文探讨了稀疏自编码器捕捉流形的能力,指出现有方法在连续结构恢复上存在不足,并提出应以几何对象而非单个方向作为可解释性基础。

详情
AI中文摘要

稀疏自编码器(SAEs)被广泛用于从神经网络表示中提取可解释特征,通常隐含假设概念对应独立的线性方向。然而,越来越多的证据表明,许多概念实际上沿着低维流形组织,编码连续的几何关系。本文提出一个理论框架,证明SAEs可以通过两种方式捕捉流形:全局方式通过分配一组原子的线性张量包含整个流形,或局部方式通过分布于特征中,每个特征选择性地覆盖基础几何的受限区域。实验证明,SAEs在连续结构恢复上表现不佳,混合了全局子空间和局部铺砖解决方案,形成所谓的稀释状态。这解释了为什么流形结构在单个概念层面 rarely 可见,并促使后续无监督发现方法寻找连贯的原子组而非孤立方向。更广泛地说,本文结果表明,未来表征学习方法应将几何对象而非单个方向作为可解释性的基本单位。

英文摘要

Sparse autoencoders (SAEs) are widely used to extract interpretable features from neural network representations, often under the implicit assumption that concepts correspond to independent linear directions. However, a growing body of evidence suggests that many concepts are instead organized along low-dimensional manifolds encoding continuous geometric relationships. This raises three basic questions: what does it mean for an SAE to capture a manifold, when do existing SAE architectures do so, and how? We develop a theoretical framework that answers these questions and show that SAEs can capture manifolds in two fundamentally different ways: globally, by allocating a compact group of atoms whose linear span contains the entire manifold, or locally, by distributing it across features that each selectively tile a restricted region of the underlying geometry. Empirically, we find that SAEs suboptimally recover continuous structures, mixing the global subspace and local tiling solutions in a fragmented regime we call dilution. This explains why manifold structure is rarely visible at the level of individual concepts and motivates post-hoc unsupervised discovery methods that search for coherent groups of atoms rather than isolated directions. More broadly, our results suggest that future representation learning methods should treat geometric objects, not just individual directions, as the basic units of interpretability.

2604.28115 2026-05-01 cs.RO cs.CV

FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction

FreeOcc: 无需训练的具身开放词汇占用预测

Zeyu Jiang, Changqing Zhou, Xingxing Zuo, Changhao Chen

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) MBZUAI

AI总结 FreeOcc通过四层流程实现无需3D标注的开放词汇占用预测,相比传统方法在EmbodiedOcc-ScanNet上提升IoU和mIoU超过2倍,并引入ReplicaOcc基准测试新环境性能。

Comments RSS 2026

详情
AI中文摘要

现有基于学习的占用预测方法依赖大规模3D标注且泛化能力差。本文提出FreeOcc,一种无需训练的具身开放词汇占用预测框架,从单目或RGB-D序列中进行预测。不同于需要体素级监督和真实相机姿态的先前方法,FreeOcc无需3D标注、姿态真实值或任何学习阶段。FreeOcc通过四层流程逐步构建全局一致的占用地图:SLAM主干估计姿态和稀疏几何;几何一致的高斯更新构建密集的3D高斯地图;开放词汇语义从现成的视觉-语言模型关联到高斯原语;概率高斯到占用的投影产生密集体素占用。尽管完全无需训练且姿态无关,FreeOcc在EmbodiedOcc-ScanNet上相比先前自监督方法在IoU和mIoU上提升超过2倍。我们进一步引入ReplicaOcc,一个用于室内开放词汇占用预测的基准,证明FreeOcc能够零样本迁移到新环境,显著优于监督和自监督基线。项目页面:https://the-masses.github.io/freeocc-web/.

英文摘要

Existing learning-based occupancy prediction methods rely on large-scale 3D annotations and generalize poorly across environments. We present FreeOcc, a training-free framework for open-vocabulary occupancy prediction from monocular or RGB-D sequences. Unlike prior approaches that require voxel-level supervision and ground-truth camera poses, FreeOcc operates without 3D annotations, pose ground truth, or any learning stage. FreeOcc incrementally builds a globally consistent occupancy map via a four-layer pipeline: a SLAM backbone estimates poses and sparse geometry; a geometrically consistent Gaussian update constructs dense 3D Gaussian maps; open-vocabulary semantics from off-the-shelf vision-language models are associated with Gaussian primitives; and a probabilistic Gaussian-to-occupancy projection produces dense voxel occupancy. Despite being entirely training-free and pose-agnostic, FreeOcc achieves over $2\times$ improvements in IoU and mIoU on EmbodiedOcc-ScanNet compared to prior self-supervised methods. We further introduce ReplicaOcc, a benchmark for indoor open-vocabulary occupancy prediction, and show that FreeOcc transfers zero-shot to novel environments, substantially outperforming both supervised and self-supervised baselines. Project page: https://the-masses.github.io/freeocc-web/.

2604.28112 2026-05-01 cs.AI cs.LO

Splitting Argumentation Frameworks with Collective Attacks and Supports

基于集体攻击和支持的论证框架分割

Matti Berthold, Lydia Blümel, Giovanni Buraglio, Anna Rapberger

发表机构 * FernUniversität in Hagen(费尔大学哈根分校) TU Wien(维也纳技术大学) TU Dortmund(多特蒙德技术大学)

AI总结 本文提出新的分割技术,用于包含可撤销元素之间支持的论证形式化方法。基于双极集合基于论证框架(BSAFs),其扩展了包含集体攻击的论证框架(SETAFs)以及双极论证框架(BAFs)。通过考虑集体攻击和支持的分割,建立了合适的分割方案并证明了其正确性。

Comments Extended version of a paper presented at the 23rd International Conference on Principles of Knowledge Representation and Reasoning July 20-23, 2026 - Lisbon, Portugal, 27 pages

详情
AI中文摘要

本文提出了一种新的分割技术,用于包含可撤销元素之间支持的论证形式化方法。我们的研究基于双极集合基于论证框架(BSAFs),该框架扩展了包含集体攻击的论证框架(SETAFs)以及双极论证框架(BAFs),通过引入集体攻击和支持。值得注意的是,BSAFs与结构化论证建立了关键联系,因为它们自然捕捉了通用(可能非扁平)假设基于论证。表达性的增加要求多样化的分割形式。我们考虑了集体攻击的分割(从而推广了最近为SETAFs提出的分割技术)、集体支持的分割,以及同时针对集体攻击和支持的分割。我们建立了合适的分割方案,并为最常见的论证语义证明了其正确性。

英文摘要

This work proposes novel splitting techniques for argumentation formalisms that incorporate supports between defeasible elements. We base our studies on bipolar set-based argumentation frameworks (BSAFs) which generalize argumentation frameworks with collective attacks (SETAFs), as well as bipolar argumentation frameworks (BAFs), by incorporating both collective attacks and supports. Notably, BSAFs establish a crucial link to structured argumentation as they naturally capture general (potentially non-flat) assumption-based argumentation. The increase in expressiveness calls for diverse forms of splitting. We consider splits over collective attacks (thereby generalizing the recently proposed splitting techniques for SETAFs), splits over collective supports, as well as splits over both collective attacks and supports. We establish suitable splitting schemata and prove their correctness for the most common argumentation semantics.

2604.28109 2026-05-01 cs.LG

Auto-FlexSwitch: Efficient Dynamic Model Merging via Learnable Task Vector Compression

Auto-FlexSwitch: 通过可学习的任务向量压缩实现高效的动态模型合并

Junqi Gao, Dazhi Zhang, Zhichang Guo, Biqing Qi, Yi Ran, Wangmeng Zuo

发表机构 * School of Mathematics, Harbin Institute of Technology, Harbin, P. R. China(哈尔滨工业大学数学学院,哈尔滨,中华人民共和国) Shanghai Artificial Intelligence Laboratory, Shanghai, P. R. China(上海人工智能实验室,上海,中华人民共和国)

AI总结 本文提出Auto-FlexSwitch,通过可学习的任务向量压缩实现高效动态模型合并,解决传统方法存储开销大的问题。

详情
AI中文摘要

模型合并通过整合多个任务特定模型的知识,成为多任务适应的有效途径。现有方法中,动态合并通过在推理时灵活结合任务特定参数,缓解任务间参数更新冲突导致的性能下降。然而,这些方法需要为每个任务存储独立参数,导致存储开销过大。为解决此问题,我们首先实验表明微调后的权重增量(称为任务向量)表现出脉冲状激活模式且对低比特表示具有高鲁棒性。受此启发,我们提出T-Switch,将任务向量分解为三个紧凑组件:二进制稀疏掩码、符号向量和标量缩放因子,实现高保真度的高压缩比近似。随后引入Auto-Switch,一种无需训练的合并方案,通过特征相似性检索自动组合任务向量。在此基础上,我们开发了Auto-Switch,一种通过特征相似性检索自动组装任务向量的训练自由合并方案。进一步,为将任务向量稀疏化和量化从静态规则转为适应性学习,我们提出FlexSwitch,一种可学习框架,通过可学习门控稀疏化(LGS)和比特宽自适应选择(BAS)共同优化每个模型单元的压缩策略,同时采用稀疏性感知存储策略(SASS)选择最优存储编码结构。最后,通过结合K-最近邻(KNN)推理方案与可学习低秩度量,我们提出Auto-FlexSwitch,一种支持高效任务向量压缩的动态模型合并方法。

英文摘要

Model merging has attracted attention as an effective path toward multi-task adaptation by integrating knowledge from multiple task-specific models. Among existing approaches, dynamic merging mitigates performance degradation caused by conflicting parameter updates across tasks by flexibly combining task-specific parameters at inference time, thereby maintaining high performance. However, these methods require storing independent parameters for each task, resulting in prohibitive storage overhead. To address this issue, we first experimentally demonstrate that the fine-tuned weight increments (referred to as task vectors) exhibit an impulse-like activation pattern and high robustness to low-bit representations. Driven by this insight, we propose T-Switch, which decomposes task vectors into three compact components: a binary sparse mask, a sign vector, and a scalar scaling factor, achieving high-fidelity approximation at high compression ratios. We then introduce Auto-Switch, a training-free merging scheme that automatically composes task vectors via feature similarity retrieval. Building on this, we develop Auto-Switch, a training-free merging scheme that automatically assembles task vectors through feature similarity retrieval. Furthermore, to transform task vector sparsification and quantization from static rules to adaptive learning, we propose FlexSwitch, a learnable framework which jointly optimizes the compression strategy for each model unit via Learnable Gating Sparsification (LGS) and Bit-width Adaptive Selection (BAS), while employing the Sparsity-Aware Storage Strategy (SASS) to select the optimal storage encoding structure. Finally, by incorporating a K-Nearest Neighbor (KNN) inference scheme with a learnable low-rank metric, we present Auto-FlexSwitch, a dynamic model merging approach that supports highly efficient task vector compression.

2604.28107 2026-05-01 cs.LG

Neural Aided Kalman Filtering for UAV State Estimation in Degraded Sensing Environments

神经辅助卡尔曼滤波用于退化传感环境下的无人机状态估计

Akhil Gupta, Erhan Guven

发表机构 * Whiting School of Engineering EP Program(约翰霍普金斯大学工程学院EP项目)

AI总结 本文提出基于贝叶斯神经网络的卡尔曼滤波框架,用于在传感器退化环境下提升无人机状态估计的鲁棒性和精度。

详情
AI中文摘要

准确估计非线性动态系统的状态对于现代航空航天操作至关重要。在线跟踪对抗性无人机(UAVs)尤其具有挑战性,因为其敏捷的非线性运动、噪声稀疏的传感器测量和未知的控制输入会违反经典卡尔曼滤波变体的关键假设并降低估计性能。神经网络(NNs)可以学习复杂非线性关系,但缺乏原理性的不确定性量化,这对于需要置信区间驱动下游决策的状态估计任务至关重要。我们通过贝叶斯神经网络(BNNs)解决这一问题,通过网络权重上的分布建模不确定性,并通过蒙特卡洛采样生成预测均值和不确定性。在此基础上,我们提出了贝叶斯神经卡尔曼滤波器(BNKF):一种将训练好的BNN与卡尔曼修正步骤结合的混合框架,用于鲁棒的在线无人机状态估计。与相关神经卡尔曼方法不同,BNKF生成完整状态预测,并将贝叶斯不确定性直接纳入协方差传播,从而在高噪声条件下提高鲁棒性。我们使用合成非线性无人机飞行数据,在不同雷达噪声水平和采样率下评估BNKF。五折交叉验证显示,BNKF在退化传感条件下比扩展卡尔曼滤波器和无迹卡尔曼滤波器在准确性、精度和真实性包含方面表现更优。一种集成变体(BNKFe)在高噪声边缘情况下的精度进一步提高,但以轻微的精度牺牲为代价。运行时间分析证实了最小的推理开销,支持实时部署的可能性。

英文摘要

Accurate state estimation of nonlinear dynamical systems is fundamental to modern aerospace operations across air, sea, and space domains. Online tracking of adversarial unmanned aerial vehicles (UAVs) is especially challenging due to agile nonlinear motion, noisy and sparse sensor measurements, and unknown control inputs; conditions that violate key assumptions of classical Kalman filter variants and degrade estimation performance. Neural networks (NNs) can learn complex nonlinear relationships from data, but lack principled uncertainty quantification, which is critical for state estimation tasks where confidence bounds drive downstream decisions. We address this with Bayesian Neural Networks (BNNs), which model uncertainty through distributions over network weights and produce predictive means and uncertainties via Monte Carlo sampling. Building on this, we propose the Bayesian Neural Kalman Filter (BNKF): a hybrid framework coupling a trained BNN with a Kalman correction step for robust online UAV state estimation. Unlike related neural Kalman approaches, BNKF produces full state predictions and incorporates Bayesian uncertainty directly into covariance propagation, improving robustness under high noise conditions. We evaluate BNKF under varying radar noise levels and sampling rates using synthetic nonlinear UAV flight data. Five fold cross validation demonstrates that BNKF outperforms Extended and Unscented Kalman Filters in accuracy, precision, and truth containment under degraded sensing. An ensemble variant (BNKFe) further improves precision in high-noise edge cases at a slight accuracy tradeoff. Runtime analysis confirms minimal inference overhead, supporting real-time deployment feasibility.

2604.28102 2026-05-01 cs.LG

FiLMMeD: Feature-wise Linear Modulation for Cross-Problem Multi-Depot Vehicle Routing

FiLMMeD:基于跨问题多仓库车辆路径问题的特征级线性调制

Arthur Corrêa, Paulo Nascimento, Samuel Moniz

发表机构 * University of Coimbra, CEMMPRE, ARISE(科英布拉大学,CEMMPRE,ARISE)

AI总结 本文提出FiLMMeD模型,通过特征级线性调制提升多仓库车辆路径问题的泛化能力,引入偏好优化和课程学习策略,有效解决多仓库约束下的优化问题。

详情
AI中文摘要

解决实际的多仓库车辆路径问题(MDVRP)是一项具有挑战性的优化任务,是现代物流的核心问题,日益受到电子商务的推动。为应对MDVRP的计算复杂性,基于神经网络的组合优化方法提供了一种有前途的可扩展替代方案。然而,基于神经网络的方法通常依赖于针对特定问题形式量身定制的刚性架构和输入编码。在现实世界中,异质约束导致了多种MDVRP变体,限制了此类模型的应用。尽管多任务学习(MTL)开始加速统一神经网络求解器的发展,但先前的工作几乎只专注于单仓库车辆路径问题,而MDVRP未被解决。为弥合这一差距,我们提出了FiLMMeD,一种新的统一神经网络模型,适用于24种不同的MDVRP变体。我们引入了三个主要贡献:(1)为了提高模型的泛化能力,我们增强了标准Transformer编码器,引入了特征级线性调制(FiLM),该方法根据活跃的约束集动态条件学习的内部表示;(2)我们提供了在MTL设置下的偏好优化的初步演示,将其确立为未来MTL工作的更优替代方案;(3)为了缓解多仓库约束引入导致的泛化差距,我们引入了一种针对性的课程学习策略,逐步向模型暴露越来越复杂的约束交互。在24种MDVRP变体(包括8种新的形式)和16种单仓库车辆路径问题上的大量实验验证了FiLMMeD的有效性,该模型在多个方面优于最先进的基线。我们的代码可在:https://github.com/AJ-Correa/FiLMMeD/tree/main获取。

英文摘要

Solving practical multi-depot vehicle routing problems (MDVRP) is a challenging optimization task central to modern logistics, increasingly driven by e-commerce. To address the MDVRP's computational complexity, neural-based combinatorial optimization methods offer a promising scalable alternative to traditional approaches. However, neural-based methods typically rely on rigid architectures and input encodings tailored to specific problem formulations. In real-world settings, heterogeneous constraints create multiple MDVRP variants, limiting the applicability of such models. While multi-task learning (MTL) has begun to accelerate the development of unified neural-based solvers, prior works focus almost exclusively on single-depot VRPs, leaving the MDVRP unaddressed. To bridge this gap, we propose Feature-wise Linear Modulation for Cross-Problem Multi-Depot Vehicle Routing (FiLMMeD), a novel unified neural-based model for 24 different MDVRP variants. We introduce three main contributions: (1) to improve the model's generalization, we augment the standard Transformer encoder with Feature-wise Linear Modulation (FiLM), which dynamically conditions learned internal representations based on the active set of constraints; (2) we provide an initial demonstration of Preference Optimization in the MTL setting, establishing it as a superior alternative to Reinforcement Learning for future MTL works; (3) to mitigate the generalization gap caused by the introduction of multi-depot constraints, we introduce a targeted curriculum learning strategy that progressively exposes the model to increasingly more complex constraint interactions. Extensive experiments on 24 MDVRP variants (including 8 novel formulations) and 16 single-depot VRPs confirm the effectiveness of FiLMMeD, which consistently outperforms state-of-the-art baselines. Our code is available at: https://github.com/AJ-Correa/FiLMMeD/tree/main

2604.28098 2026-05-01 cs.AI cs.CL cs.CY

Mapping the Methodological Space of Classroom Interaction Research: Scale, Duration, and Modality in an Age of AI

课堂互动研究的方法论空间映射:在AI时代规模、持续时间和模态

Dorottya Demszky, Edith Bouton, Alison Twiner, Sara Hennessy, Richard Correnti

发表机构 * Stanford Graduate School of Education(斯坦福大学教育研究生院) Hebrew University of Jerusalem(耶路撒冷希伯来大学) Faculty of Education, University of Cambridge(剑桥大学教育学院) University of Pittsburgh, Learning Research and Development Center(匹兹堡大学学习研究与发展中心)

AI总结 本文提出一个框架,通过规模、持续时间和模态三个维度,探讨课堂互动研究的方法论空间,并分析AI如何扩展此领域及指导研究与工具设计。

详情
AI中文摘要

课堂互动研究长期以来分为大规模观察与深入民族志研究。本文提出一个框架,通过规模、持续时间和模态三个维度映射此方法论空间,研究位置影响其揭示与遮蔽内容。通过对比对话教学研究——Howe等(2019)和Snell与Lefstein(2018)——以及与主要研究者访谈,围绕可操作化、可见机制和实践转化三个问题展开。随后探讨AI如何扩展此空间及框架对研究与工具设计的指导作用。

英文摘要

Research on classroom interaction has long been divided between large-scale observation and in-depth ethnographic work. We propose a framework mapping this methodological space along three dimensions--scale, duration, and modality--where a study's position shapes what it reveals and obscures. We illustrate it through contrasting studies of dialogic teaching--Howe et al. (2019) and Snell and Lefstein (2018)--and an interview with the lead researchers, organized around three questions: what can be operationalized, what mechanisms become visible, and what translates to practice. We then examine how AI is expanding this space and how the framework can guide research and tool design.

2604.28082 2026-05-01 cs.AI

Characterizing the Consistency of the Emergent Misalignment Persona

刻画涌现偏差人格的一致性

Anietta Weckauff, Yuchen Zhang, Maksym Andriushchenko

发表机构 * ELLIS Institute(ELLIS研究所) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所) Tübingen AI Center(图宾根人工智能中心)

AI总结 本文通过微调Qwen 2.5 32B Instruct模型,探讨了涌现偏差人格在不同任务和领域中的一致性,发现两种模式:一致人格模型和反向人格模型,揭示了涌现偏差的影响更复杂。

详情
AI中文摘要

微调大型语言模型(LLMs)于狭窄偏差数据上,使其泛化到广泛偏差行为,这种现象称为涌现偏差(EM)。尽管先前研究发现有害行为与自我评估在涌现偏差模型中存在相关性,但尚不清楚这种对应关系在不同任务中的一致性如何,以及是否因微调领域而异。我们通过在六个狭窄偏差领域(如不安全代码、高风险财务建议、不良医疗建议)上微调Qwen 2.5 32B Instruct模型,并进行有害性评估、自我评估、选择两种AI系统描述、输出识别和评分预测等实验,刻画了EM人格的一致性。研究结果揭示了两种不同的模式:一致人格模型,其中有害行为与自我报告的偏差耦合;反向人格模型,产生有害输出但自我标识为对齐的AI系统。这些发现揭示了涌现偏差影响的更细致图景,质疑了EM人格的一致性。

英文摘要

Fine-tuning large language models (LLMs) on narrowly misaligned data generalizes to broadly misaligned behavior, a phenomenon termed emergent misalignment (EM). While prior work has found a correlation between harmful behavior and self-assessment in emergently misaligned models, it remains unclear how consistent this correspondence is across tasks and whether it varies across fine-tuning domains. We characterize the consistency of the EM persona by fine-tuning Qwen 2.5 32B Instruct on six narrowly misaligned domains (e.g., insecure code, risky financial advice, bad medical advice) and administering experiments including harmfulness evaluation, self-assessment, choosing between two descriptions of AI systems, output recognition, and score prediction. Our results reveal two distinct patterns: coherent-persona models, in which harmful behavior and self-reported misalignment are coupled, and inverted-persona models, which produce harmful outputs while identifying as aligned AI systems. These findings reveal a more fine-grained picture of the effects of emergent misalignment, calling into question the consistency of the EM persona.