arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.09273 2026-06-09 cs.CV 新提交

EditSSC: Toward Editable Semantic Occupancy Scenes with Unconditional Diffusion Models

EditSSC: 基于无条件扩散模型的可编辑语义占用场景

Fatima Balde, Raoul de Charette, Alexandre Boulch

发表机构 * Inria（法国国家信息与自动化研究所）； Valeo.ai（法雷奥人工智能实验室）

AI总结提出EditSSC方法，利用2D BEV表示和现成潜扩散网络实现3D语义场景生成与免训练编辑，在SemanticKITTI上优于现有3D专用基线。

Comments Accepted at CVPR 2026 Workshop

详情

AI中文摘要

3D语义场景生成对于自动驾驶应用至关重要，但大多数方法依赖于复杂的3D专用架构，如三平面编码器和适配的扩散网络，限制了其简单性和编辑能力。我们提出EditSSC，一种使用2D鸟瞰图（BEV）表示和现成潜扩散网络的3D语义场景生成方法，支持编辑。我们的方法将3D语义占用网格重塑为多通道BEV图像，并利用Stable Diffusion的量化自编码器和UNet，仅做最小修改。我们在量化后的潜变量上进行扩散，从而实现了免训练的编辑能力。通过利用码本中的类到码对应关系，我们的方法支持草图引导生成、修补和外推，无需任何重新训练。在SemanticKITTI上，EditSSC在无条件生成方面优于现有的3D专用基线，表明成熟的2D架构可以有效地用于3D场景生成和编辑。

英文摘要

3D semantic scene generation is crucial for autonomous driving applications, yet most methods rely on complex 3D-specific architectures such as triplane encoders and adapted diffusion networks, limiting both their simplicity and their editing capabilities. We propose EditSSC, an editing-ready method for 3D semantic scene generation using 2D Bird's Eye View (BEV) representations and off-the-shelf latent diffusion network. Our approach reshapes 3D semantic occupancy grids into multi-channel BEV images and leverages the quantized autoencoder and UNet from Stable Diffusion with minimal modifications. We perform diffusion on the latents after quantization, which enables training-free editing capabilities. By exploiting class-to-code correspondences in the codebook, our method supports sketch-guided generation, inpainting, and outpainting without any retraining. On SemanticKITTI, EditSSC outperforms existing 3D-specific baselines on unconditional generation, demonstrating that well-established 2D architectures can be effectively repurposed for 3D scene generation and editing.

URL PDF HTML ☆

赞 0 踩 0

2606.09271 2026-06-09 cs.SD cs.LG 新提交

Multi-View Speech Representation Learning for Parkinson's Disease Detection Using Context-guided Cross-modal Attention

基于上下文引导跨模态注意力的多视角语音表示学习用于帕金森病检测

George Theodosiou, Loukas Ilias, Dimitris Askounis

发表机构 * National Technical University of Athens（雅典国家技术大学）

AI总结提出多分支深度学习框架，融合Log-Mel谱图、MFCC和HuBERT嵌入三种互补语音模态，通过上下文引导跨模态注意力机制动态加权，在PC-GITA语料库上实现91.51%准确率和95.97% AUC，验证了异质语音建模对帕金森病检测的有效性。

详情

AI中文摘要

帕金森病（PD）是一种进行性神经退行性疾病，常导致与运动功能减退性构音障碍相关的言语障碍。由于言语产生依赖于复杂神经肌肉机制的精确协调，语音分析已成为早期PD检测中一种有前景的非侵入性、成本效益高的生物标志物。最近的深度学习方法显示出令人鼓舞的结果；然而，大多数现有方法依赖单一语音表示，可能忽略跨不同特征空间编码的互补病理信息。在这项工作中，我们提出了一种多分支深度学习框架，用于从语音中自动检测PD。每个录音被分割成5秒的片段，并使用三种互补模态表示：Log-Mel谱图、MFCC和从原始波形中提取的HuBERT嵌入。谱图使用预训练的ResNet-18编码器处理，MFCC序列通过BiLSTM网络建模，原始语音使用预训练的HuBERT模型编码。为了有效整合这些异质表示，我们引入了一种上下文引导的跨模态注意力机制，该机制根据来自谱图和MFCC分支的全局声学上下文动态加权时间HuBERT嵌入。在公开的西班牙语PC-GITA语料库上，在严格的说话人独立5折交叉验证下进行的实验证明了所提出方法的有效性。所提出的架构实现了91.51%的准确率、91.24%的F1分数和95.97%的AUC。此外，消融研究证实了所提出的上下文引导跨模态注意力机制以及互补语音表示整合的贡献。这些发现突显了异质语音建模在稳健且临床可靠的PD检测中的潜力。

英文摘要

Parkinson's disease (PD) is a progressive neurodegenerative disorder that frequently causes speech impairments associated with hypokinetic dysarthria. As speech production relies on the precise coordination of complex neuromuscular mechanisms, speech analysis has emerged as a promising non-invasive and cost-effective biomarker for early PD detection. Recent deep learning approaches have shown encouraging results; however, most existing methods rely on a single speech representation, potentially overlooking complementary pathological information encoded across different feature spaces. In this work, we propose a multi-branch deep learning framework for automatic PD detection from speech. Each recording is segmented into 5-second chunks and represented using three complementary modalities: Log-Mel spectrograms, MFCCs, and HuBERT embeddings extracted from raw waveforms. The spectrograms are processed using a pre-trained ResNet-18 encoder, MFCC sequences are modeled through a BiLSTM network, and raw speech is encoded using a pre-trained HuBERT model. To effectively integrate these heterogeneous representations, we introduce a context-guided cross-modal attention mechanism that dynamically weights temporal HuBERT embeddings according to the global acoustic context derived from the spectrogram and MFCC branches. Experiments conducted on the publicly available Spanish PC-GITA corpus under strict speaker-independent 5-fold cross-validation demonstrate the effectiveness of the proposed approach. The proposed architecture achieves an accuracy of 91.51%, an F1-score of 91.24%, and an AUC of 95.97%. Furthermore, ablation studies confirm the contribution of both the proposed context-guided cross-modal attention mechanism and the integration of complementary speech representations. These findings highlight the potential of heterogeneous speech modeling for robust and clinically reliable PD detection.

URL PDF HTML ☆

赞 0 踩 0

2606.09268 2026-06-09 cs.RO 新提交

VGP-Nav: Metric-Aware Visual Geometric Perception for Robot Navigation

VGP-Nav：用于机器人导航的度量感知视觉几何感知

Hewei Pan, Weiye Zhu, Zekai Zhang, Zitong Huang, Rongtao Xu, Jinbao Wang, Feng Zheng

发表机构 * Southern University of Science and Technology（南方科技大学）； MBZUAI（穆罕默德·本·扎耶德人工智能大学）； Shenzhen University（深圳大学）； SpatialTemporal AI（时空人工智能）

AI总结提出VGP-Nav，一种仅依赖单目RGB输入的框架，通过地面平面几何约束解决尺度模糊，实现度量定位与障碍物感知的统一。

详情

AI中文摘要

可靠的机器人导航需要精确的全局定位和稠密、度量一致的障碍物感知的无缝集成。实现这些能力的常见策略涉及集成多种传感模态：相机提供丰富的视觉特征用于定位，而主动传感器如LiDAR提供直接的度量测量。然而，这种多传感器配置需要复杂的时空校准并增加部署开销。尽管纯视觉方法提供了低成本且可扩展的替代方案，现有的单目视觉系统通常难以同时实现高效、全局一致的定位和稠密、度量一致的几何感知。为弥合这一差距，我们提出\textbf{VGP-Nav}，一个统一的\textit{度量感知视觉几何感知}框架，仅依赖单目RGB输入，联合支持度量定位和障碍物感知。我们的关键洞察是将基于定位的视觉几何锚定到从地面平面几何导出的物理上有意义的尺度约束，从而为单目感知提供可靠的度量参考。VGP-Nav在线解决单目尺度模糊，并生成可直接用于下游规划的、基于定位的度量障碍物表示。大量实验证明了其在多种环境中的强泛化能力以及在真实移动机器人上的成功部署，突显了该方法在可扩展、低成本且安全的自主导航中的实用性。

英文摘要

Reliable robotic navigation necessitates the seamless integration of accurate global localization and dense, metric-consistent obstacle perception. A common strategy to achieve these capabilities involves integrating diverse sensing modalities: cameras offer rich visual features for localization, while active sensors like LiDAR provide direct metric measurements. However, such multi-sensor configurations necessitate complex spatial-temporal calibration and increase deployment overhead. Although vision-only approaches offer a low-cost and scalable alternative, existing monocular visual systems typically struggle to simultaneously achieve efficient, globally consistent localization and dense, metric-consistent geometric perception. To bridge this gap, we propose \textbf{VGP-Nav}, a unified framework for \textit{Metric-Aware Visual Geometric Perception} that relies solely on monocular RGB input to jointly support metric localization and obstacle perception. Our key insight is to anchor localization-grounded visual geometry to physically meaningful scale constraints derived from ground-plane geometry, thereby providing a reliable metric reference for monocular perception. VGP-Nav resolves monocular scale ambiguity online and produces localization-grounded, metric obstacle representations that are directly applicable to downstream planning. Extensive experiments demonstrate strong generalization across diverse environments and successful deployment on real mobile robots, highlighting the practicality of our approach for scalable, low-cost, and safe autonomous navigation.

URL PDF HTML ☆

赞 0 踩 0

2606.09266 2026-06-09 cs.SD cs.AI 新提交

Physics-Guided Sequence-Based Generative Framework for Acoustic Metamaterial Inverse Design

物理引导的序列生成框架用于声学超材料逆向设计

Yijie Li, Jiahao Xu, Ching-Chih Tsao, Lili Qiu, Jingxian Wang

发表机构 * National University of Singapore（新加坡国立大学）； UT Austin（德克萨斯大学奥斯汀分校）

AI总结提出MetaSeq框架，将声学超材料表示为结构化序列，通过序列到序列模型结合物理求解器和强化学习，实现宽带逆向设计，误差降低45%。

详情

AI中文摘要

声学超材料（AMM）逆向设计对于宽带目标响应尤其具有挑战性，原因是声学色散：在一个频率上匹配期望响应的结构可能在其它频率上偏离，而修改几何以改善一个子带通常会扰动相邻子带。然而，现有的宽带逆向设计方法要么受限于预定义模板，要么依赖于无法保持声学结构所需的几何精度和结构连通性的图像表示。我们提出了MetaSeq，一个物理引导的、基于序列的生成框架，用于声学超材料逆向设计。其核心是，MetaSeq引入了一种语言，将每个AMM表示为结构化序列，而不是像素网格或固定模板。这种表示保留了精确的几何形状，显式编码了连通性，并将逆向设计转化为从目标响应到结构序列的序列到序列任务。MetaSeq进一步构建了一个平衡、高保真的数据集，具有高效的校准和基于复杂度的采样。为了解决逆向设计的一对多性质，MetaSeq结合了监督预训练和基于物理求解器及有效性检查器引导的强化学习微调。针对COMSOL和五个基线的广泛评估表明，MetaSeq在最佳基线基础上将响应误差降低了45%。

英文摘要

Acoustic metamaterial (AMM) inverse design is particularly challenging for broadband target responses due to acoustic dispersion: a structure that matches the desired response at one frequency may deviate at others, and modifying geometry to improve one sub-band often perturbs neighboring sub-bands. Yet existing broadband inverse-design approaches are either constrained by predefined templates, or rely on image representations that fail to preserve the geometric precision and structural connectivity required by acoustic structures. We present MetaSeq, a physics-guided, sequence-based generative framework for acoustic metamaterial inverse design. At its core, MetaSeq introduces a language that represents each AMM as a structured sequence, rather than as a pixel grid or fixed template. This representation preserves precise geometry, explicitly encodes connectivity, and casts inverse design as a sequence-to-sequence task from target response to structure sequence. MetaSeq further constructs a balanced, high-fidelity dataset with efficient calibration and complexity-based sampling. To address the one-to-many nature of inverse design, MetaSeq combines supervised pretraining with reinforcement learning fine-tuning guided by a physics-based solver and validity checker. Extensive evaluations against COMSOL and five baselines show that MetaSeq reduces response error by 45% over the best baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.09262 2026-06-09 cs.CV 新提交

See More, Match Better: Multi-Source Feature Fusion for Two-View Correspondence Learning

看得更多，匹配更好：用于双视图对应学习的多源特征融合

Xiaojie Li, Xin Jiang, Luanyuan Dai, Jinnan Yang, Yongdong Zhang, Zechao Li

发表机构 * Nanjing University of Science and Technology（南京理工大学）； People’s Daily Online（人民网）； University of Science and Technology of China（中国科学技术大学）

AI总结提出TriMatch框架，融合几何、纹理语义和结构语义特征，通过语义引导调制和层次细化，提升重复结构等场景下的对应点鉴别能力。

Comments Correspondence Learning, Multi-Source Feature Fusion, Outlier Removal, Camera Pose Estimation

详情

AI中文摘要

双视图对应学习旨在通过利用图像对中真假对应点的内在差异来区分内点和外点。现有方法主要依赖于基于坐标的几何一致性。然而，在包含重复结构、无纹理区域或局部相似几何模式的场景中，它们常常难以处理伪一致的外点。为了解决这一限制，我们提出了TriMatch，一个用于双视图对应学习的多源特征融合框架，由特征提取和特征细化两部分组成。在特征提取中，TriMatch联合提取几何、纹理语义和结构语义特征，为对应点判别提供互补证据。为了弥合语义特征与几何特征之间的差距，纹理和结构语义特征分别通过专用的纹理-几何对齐和结构-几何对齐模块与几何特征对齐。我们进一步引入了语义引导的对应点调制模块，该模块利用语义信息调制几何特征，以抑制几何上合理但语义上不一致的对应点。在特征细化中，层次化语义增强的对应点细化策略逐步建模对应点依赖关系并重新校准多上下文特征响应，从而实现更可靠的内点-外点判别。大量实验证明了TriMatch的有效性、鲁棒性和泛化能力。

英文摘要

Two-view correspondence learning aims to distinguish true correspondences (inliers) from false ones (outliers) in image pairs by leveraging their underlying differences. Existing methods mainly rely on coordinate-based geometric consistency. However, they often struggle with pseudo-consistent outliers in scenes containing repetitive structures, textureless regions, or locally similar geometric patterns. To address this limitation, we propose TriMatch, a multi-source feature fusion framework for two-view correspondence learning, which consists of two parts: feature extraction and feature refinement. In feature extraction, TriMatch jointly extracts geometric, texture semantic, and structural semantic features to provide complementary evidence for correspondence discrimination. To bridge the gap between semantic and geometric features, texture and structural semantic features are aligned with geometric features through dedicated Texture-Geometric Alignment and Structural-Geometric Alignment modules, respectively. We further introduce a Semantic-Guided Correspondence Modulation module, which modulates geometric features using semantic information to suppress geometrically plausible but semantically inconsistent correspondences. In feature refinement, a Hierarchical Semantic-Enhanced Correspondence Refinement strategy progressively models correspondence dependencies and recalibrates multi-context feature responses, enabling more reliable inlier-outlier discrimination. Extensive experiments demonstrate the effectiveness, robustness, and generalization capability of TriMatch.

URL PDF HTML ☆

赞 0 踩 0

2606.09261 2026-06-09 cs.CV 新提交

Self-supervised Learning Matters: A Simple Ensemble Solution for Micro-Gesture Recognition

自监督学习至关重要：一种用于微手势识别的简单集成方案

Tingyi Liu, Kun Li, Fei Wang, Junjie Chen, Zhiliang Wu, Jihao Gu, Haixu Liu, Dan Guo

发表机构 * Hefei University of Technology（合肥工业大学）； United Arab Emirates University（阿拉伯联合酋长国大学）； Institute of Artificial Intelligence, Hefei Comprehensive National Science Center（合肥综合性国家科学中心人工智能研究院）； Anhui Evolution Technology Co., Ltd.（安徽进化科技有限公司）； Nanyang Technological University（南洋理工大学）； University College London（伦敦大学学院）； The University of Sydney（悉尼大学）； Beijing QBoson Quantum Technology Co., Ltd.（北京量子芯科技有限公司）

AI总结提出一种集成自监督RGB模型与监督多流模型的框架，在MiGA挑战赛微手势分类赛道取得第一名，通过自监督预训练提升性能，在iMiGUE测试集上达到74.419%的top-1准确率。

详情

AI中文摘要

在本文中，我们介绍了XInsight Lab在IJCAI 2026第四届MiGA挑战赛微手势分类赛道中的解决方案，该方案排名第一并取得了新的最先进结果。我们提出了一种多模态集成框架，将基于自监督的RGB模型与先前解决方案中的监督多流模型相结合。自监督RGB模型通过掩码视频建模在12万个未标注片段上进行预训练，然后在iMiGUE上微调。这一简单而有效的RGB基线在iMiGUE测试集上达到了69.224%的top-1准确率，展示了从域内未标注视频中学习可迁移表示的好处。通过将该模型作为互补分支加入，最终集成模型达到了74.419%的top-1准确率，比之前的最先进结果高出1.206个百分点。在iMiGUE上的实验结果，包括对集成策略的消融研究，验证了自监督RGB表示学习在微手势识别中的有效性。

英文摘要

In this paper, we present XInsight Lab's solution to the micro-gesture classification track of the 4th MiGA Challenge at IJCAI 2026, in which our solution ranked first and achieved a new state-of-the-art result. We propose a multimodal ensemble framework that integrates a self-supervised RGB-based model with supervised multi-stream models from previous solutions. The self-supervised RGB model is pretrained on 120K unlabeled clips via masked video modeling and then fine-tuned on iMiGUE. This simple yet effective RGB baseline achieves 69.224% top-1 accuracy on the iMiGUE test set, demonstrating the benefit of learning transferable representations from unlabeled in-domain videos. By incorporating this model as a complementary branch, the final ensemble reaches 74.419% top-1 accuracy, surpassing the previous state of the art by 1.206 percentage points. Experimental results on iMiGUE, including ablation studies on the ensemble strategy, validate the effectiveness of self-supervised RGB representation learning for micro-gesture recognition.

URL PDF HTML ☆

赞 0 踩 0

2606.09258 2026-06-09 cs.RO 新提交

Back to the Familiar Future: Failure Recovery for VLA Policies via Pre-Imagined Milestone Selection

回到熟悉的未来：通过预想里程碑选择实现VLA策略的故障恢复

Suyeon Shin, Juwon Kim, Hyeonbin Park, Hyunseo Kim, Hyundo Lee, Hyung-Sin Kim, Byoung-Tak Zhang

发表机构 * Seoul National University（首尔大学）； Yonsei University（延世大学）； Soongsil University（崇实大学）

AI总结提出B2FF框架，通过预生成熟悉未来状态里程碑并选择恢复目标，使VLA策略在偏离轨迹时无需微调即可稳健恢复，成功率从56.3%提升至74.0%。

详情

AI中文摘要

视觉-语言-动作（VLA）策略在操作过程中可能偏离标称轨迹，即使任务在物理上仍然可行。从这些偏离中恢复具有挑战性，因为它们将策略推入陌生的状态空间，直接重新规划常常会破坏动作序列的稳定性。我们提出“回到熟悉的未来”（B2FF），一种面向预见性VLA的恢复框架，利用未来视觉条件作为恢复接口。在执行前，VLA基于干净的初始观察生成一个由熟悉未来状态组成的里程碑库。在恢复时，一个可恢复性感知的选择器从该库中选择一个恢复里程碑，并将其强制作为固定的视觉目标。这使得VLA能够将偏离轨迹的观察稳健地映射回熟悉的未来。在注入故障的LIBERO数据集上，在受控的恢复时间与注入故障对齐的情况下，B2FF将基线VLA的平均成功率从56.3%提升至74.0%，证明预想里程碑可以在不微调底层动作生成器的情况下指导恢复。

英文摘要

Vision-language-action (VLA) policies can deviate from nominal trajectories during manipulation, even when tasks remain physically feasible. Recovering from these deviations is challenging, as they push the policy into unfamiliar state spaces where direct re-planning frequently destabilizes action sequences. We propose Back to the Familiar Future (B2FF), a recovery framework for foresight-driven VLAs that leverages future visual conditioning as a recovery interface. Before execution, the VLA generates a milestone bank of familiar future states conditioned on the clean initial observation. At recovery time, a recoverability-aware selector selects a recovery milestone from this bank and enforces it as a fixed visual goal. This enables the VLA to robustly map off-trajectory observations back to a familiar future. On failure-injected LIBERO, under controlled recovery timing aligned with the injected failure, B2FF increases the average success rate of a baseline VLA from 56.3% to 74.0%, demonstrating that pre-imagined milestones can guide recovery without fine-tuning the low-level action generator.

URL PDF HTML ☆

赞 0 踩 0

2606.09257 2026-06-09 cs.LG cs.AI stat.ML 新提交

BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation

BSTabDiff: 用于高维表格数据生成的块-子单元扩散先验

Al Zadid Sultan Bin Habib, Md Younus Ahamed, Prashnna Gyawali, Gianfranco Doretto, Donald A. Adjeroh

发表机构 * West Virginia University（西弗吉尼亚大学）； The University of Utah（犹他大学）

AI总结针对高维低样本量表格数据，提出BSTabDiff框架，通过将特征划分为潜在块并使用共享低维子单元变量生成每个块，结合扩散先验和copula依赖，实现稳定合成与可控基准生成。

Comments Published as a paper at the 2nd DeLTa Workshop, ICLR 2026

详情

AI中文摘要

高维低样本量（HDLSS）表格领域（例如组学）的特点是 $n \ll m$，其中 $n$ = 样本数，$m$ = 特征数。此类领域通常表现出强局部相关组、稀疏跨组依赖、重尾非高斯边缘分布、异方差噪声和结构化缺失，使得在 $\mathbb{R}^m$ 中直接进行密度学习因 $n \ll m$ 而病态。我们提出 BSTabDiff，一种块-子单元生成框架，将 $m$ 个观测特征划分为 $M$ 个潜在块（$M \ll m$），并通过共享的低维子单元变量生成每个块，将全局依赖学习集中在紧凑的块潜在空间 $\mathbb{R}^M$ 中，同时通过 copula 驱动的依赖、灵活的逐特征边缘分布和显式缺失机制解码到完整特征空间。BSTabDiff 支持块潜在上的现代深度先验，包括扩散和归一化流，从而在 HDLSS 场景中实现稳定合成和可控基准生成。实验表明，与 HDLSS 数据上的非结构化表格生成器相比，BSTabDiff 能产生更真实和稳定的高维合成数据。

英文摘要

High-Dimensional Low-Sample Size (HDLSS) tabular domains (e.g., omics) are characterized by $n \ll m$, where $n$ = number of samples, and $m$ = number of features. Such domains often exhibit strong local correlation groups, sparse cross-group dependencies, heavy-tailed non-Gaussian marginals, heteroscedastic noise, and structured missingness, making direct density learning in $\mathbb{R}^m$ ill-conditioned since $n \ll m$. We propose BSTabDiff, a block-subunit generative framework that partitions the $m$ observed features into $M$ latent blocks ($M \ll m$) and generates each block via a shared low-dimensional subunit variable, concentrating global dependence learning in the compact block-latent space $\mathbb{R}^M$ while decoding to the full feature space with copula-driven dependence, flexible per-feature marginals, and explicit missingness mechanisms. BSTabDiff supports modern deep priors on block latents, including diffusion and normalizing flows, enabling stable synthesis and controllable benchmark generation in the HDLSS regime. Empirically, BSTabDiff produces more realistic and stable high-dimensional synthetic data when compared with unstructured tabular generators on HDLSS data.

URL PDF HTML ☆

赞 0 踩 0

2606.09255 2026-06-09 cs.RO 新提交

RPO-PDT: Demonstrating Role-Play-Based Knowledge Adaptation for Student Support Dialogue (Demonstration System)

RPO-PDT：展示基于角色扮演的知识适应用于学生支持对话（演示系统）

Filip Janik, Ewa Olton, Robert Smales, Harris Spratt, Shea Tait, Md Zia Ullah, Yanchao Yu

发表机构 * Edinburgh Napier University（爱丁堡龙比亚大学）

AI总结提出RPO-PDT系统，通过检索增强和角色扮演循环，实现高等教育中基于结构化知识源的个性化学生支持对话，并确保安全与适应性。

Comments 5 pages, 2 figures

2606.09253 2026-06-09 cs.CV physics.med-ph 新提交

A practical probabilistic framework for deformable image registration uncertainty in radiotherapy dose propagation

一种实用的概率框架用于放射治疗剂量传播中的可变形图像配准不确定性

Stefan Heldmann, Sven Kuckertz, Nasim Givehchi, Thomas Coradi, Mikel Byrne, Ben Archibald-Heeren, Nils Papenberg

发表机构 * Fraunhofer Institute for Digital Medicine MEVIS（弗劳恩霍夫数字医学研究所MEVIS）； Varian, a Siemens Healthineers company（Varian公司）； Icon Group（Icon集团）

AI总结提出一种轻量级概率框架，通过局部确定性图建模可变形图像配准不确定性，实现剂量统计和剂量体积直方图的不确定性传播，并在前列腺放疗案例中验证了确定性图设计对结果的影响。

详情

AI中文摘要

可变形图像配准（DIR）广泛应用于放射治疗中的剂量传播和累积，但底层变形的不确定性会显著影响临床相关的剂量估计。我们提出了一种实用的概率框架，用于将DIR不确定性传播到体素级剂量统计和剂量体积直方图（DVH）。该方法将每个体素的映射对应关系建模为由透明的局部确定性图控制的随机变量，该确定性图可通过简单的安全边界、结构边界不匹配或结构保守的不确定性值来定义。这产生了可解释的量，如剂量概率、期望剂量、置信区间和诱导的DVH包络。该框架设计为轻量级且可解释：它避免了复杂的生物力学或基于集成的不确定性模型，而是强调简单的参数化、计算可行性和透明的剂量指标。我们进一步引入了一种结构导向的内/外策略作为可选优化，将映射概率限制在解剖学上合理的目标区域。该方法在前列腺放疗案例研究中得到验证，并用于比较不同的确定性图策略和概率核。实验表明，确定性图设计对结果剂量和DVH不确定性边界的影响比特定核选择更强，而内/外策略的额外收益在案例中依赖于具体情况且效果有限。总体而言，所提出的框架提供了一种透明的方式，将DIR不确定性纳入放射治疗剂量评估，并研究建模选择如何影响传播的剂量指标。

英文摘要

Deformable image registration (DIR) is widely used in radiotherapy for dose propagation and accumulation, but uncertainty in the underlying deformation can substantially affect clinically relevant dose estimates. We present a practical probabilistic framework for propagating DIR uncertainty to voxel-wise dose statistics and dose-volume histograms (DVHs). The method models the mapped correspondence at each voxel as a random variable governed by a transparent local certainty map that can be defined by simple safety margins, structure-boundary mismatch, or structure-wise conservative uncertainty values. This yields interpretable quantities such as dose probabilities, expected dose, confidence bounds, and induced DVH envelopes. The framework is designed to remain lightweight and interpretable: it avoids complex biomechanical or ensemble-based uncertainty models and instead emphasizes simple parameterization, computational feasibility, and transparent dose metrics. We further introduce a structure-guided in/out strategy as an optional refinement that restricts mapping probabilities to anatomically plausible target regions. The approach is demonstrated on a prostate radiotherapy case study and used to compare different certainty-map strategies and probability kernels. The experiments show that the certainty-map design has a stronger effect on resulting dose and DVH uncertainty bounds than the specific kernel choice, while the additional benefit of the in/out strategy is case-dependent and modest in the present example. Overall, the proposed framework provides a transparent way to incorporate DIR uncertainty into radiotherapy dose assessment and to study how modelling choices affect propagated dose metrics.

URL PDF HTML ☆

赞 0 踩 0

2606.09251 2026-06-09 cs.CL 新提交

TruthSplit: Operationalizing Conditional Validity in Arguments Through Multi-Perspective Reasoning

TruthSplit：通过多视角推理实现论证中的条件有效性操作化

Benjamin Stieger, Maximilian Terberger, Thomas Huber, Christina Niklaus

发表机构 * University of St. Gallen（圣加仑大学）

AI总结提出TruthSplit系统，通过三层自然语言推理和结构化世界观档案，实现基于不同视角的论证条件有效性分析，识别价值冲突与假设差异。

Comments Demo paper. To appear at ACL 2026

详情

AI中文摘要

我们提出TruthSplit，一个用于多视角论证分析的交互式系统。现有的论证工具通常分析论证本身的属性，如结构、质量、立场或说服力，而将特定视角的背景知识隐含起来。TruthSplit通过支持探索性分析来填补这一空白，即当通过世界观特定的价值观、假设和概念定义来解释时，同一主张如何导致不同的结论。我们将这种依赖于视角的分析称为条件有效性。给定输入的论证文本，TruthSplit提取主张和前提，应用三层自然语言推理（NLI）方法来评估逻辑和世界观特定的规范性一致性，并将大语言模型（LLM）推理条件化为编码核心价值观和决策原则的结构化世界观档案。然后，系统生成特定视角的解释，识别价值冲突和假设差距，并通过交互式分析界面可视化分歧。

英文摘要

We present TruthSplit, an interactive system for multi-perspective argument analysis. Existing argumentation tools typically analyze properties of the argument itself, such as structure, quality, stance, or persuasiveness, while leaving perspective-specific background knowledge implicit. TruthSplit addresses this gap by supporting an exploratory analysis of how the same claim can lead to different conclusions when interpreted through worldview-specific values, assumptions, and conceptual definitions. We refer to this perspective-dependent analysis as conditional validity. Given an input argumentative text, TruthSplit extracts claims and premises, applies a three-layer natural language inference (NLI) approach to assess both logical and worldview-specific normative consistency, and conditions large language model (LLM) reasoning on structured worldview profiles that encode core values and decision principles. The system then generates perspective-specific interpretations, identifies value conflicts and assumption gaps, and visualizes divergence through interactive analytical interfaces.

URL PDF HTML ☆

赞 0 踩 0

2606.09250 2026-06-09 cs.CV 新提交

LiteVSR: Lightweight Adaptation of Frozen Diffusion Transformers for Video Super-Resolution

LiteVSR: 冻结扩散变换器的轻量级自适应用于视频超分辨率

Yu Cao, Ziquan Liu, Zhensong Zhang, Jiankang Deng, Shaogang Gong, Jifei Song

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出LiteVSR，利用流匹配原理，通过轻量级状态感知适配器在冻结扩散变换器上实现视频超分辨率，仅需11.25%可训练参数和12 GPU小时训练。

详情

AI中文摘要

将大规模预训练视频生成器适应于新领域的视频超分辨率（VSR）在计算上仍然昂贵。将生成重新表述为直接从低质量到高质量映射的方法偏离了原始生成形式，需要大量微调。ControlNet风格的适配器在现代扩散变换器下失去效率，因为缺少编码器-解码器层次结构迫使复制整个骨干网络。我们观察到流匹配为跨域VSR适应提供了一种原则性替代方案。通过预测所有时间步上的恒定速度场，适应任务简化为学习固定的注入模式，而不是时变变换。基于这一见解，我们提出了LiteVSR，一个极简框架，使用完全冻结的扩散变换器和轻量级状态感知适配器执行VSR。该适配器采用双流架构，从低质量输入中提取静态结构线索，从中间去噪状态中提取动态线索，通过时间依赖的交叉注意力对齐它们，使得随着去噪进行，从结构对齐到纹理细化的自适应过渡成为可能。LiteVSR在仅使用11.25%可训练参数和单个A100上12 GPU小时的训练下实现了有竞争力的恢复质量，同时保持了快速采样（低至单步）的兼容性。

英文摘要

Adapting large-scale pre-trained video generators for Video Super-Resolution (VSR) in novel domains remains computationally prohibitive. Methods that reformulate generation as direct Low-Quality to High-Quality mappings deviate from the original generative formulation, demanding extensive fine-tuning. ControlNet-style adapters lose their efficiency under modern Diffusion Transformers since the absence of encoder-decoder hierarchy forces duplication of the entire backbone. We observe that flow matching offers a principled alternative for cross-domain VSR adaptation. By predicting a constant velocity field across all timesteps, the adaptation task reduces to learning a fixed injection pattern rather than time-varying transformations. Building on this insight, we propose LiteVSR, a minimalist framework that performs VSR using a completely frozen Diffusion Transformer with a lightweight State-Aware Adapter. The adapter employs a dual-stream architecture that extracts static structural cues from the LQ input and dynamic cues from intermediate denoising states, aligning them through time-dependent cross-attention to enable adaptive transition from structural alignment to texture refinement as denoising proceeds. LiteVSR achieves competitive restoration quality with only 11.25% trainable parameters and 12 GPU-hours of training on a single A100, while maintaining fast sampling (down to a single step) compatibility.

URL PDF HTML ☆

赞 0 踩 0

2606.09249 2026-06-09 cs.CV 新提交

MAGIS: Evidence-Based Multi-Agent Reasoning for Interpretable Strabismus Clinical Decision-Making

MAGIS：基于证据的多智能体推理用于可解释的斜视临床决策

Xikai Tang, Yifan Wang, Jiafan Zhuang, Li Luo, Jinming Guo, Xiaoling Xie, Jiacheng Liu, Peiwei Wei, Lihao Zhong, Xiaoli Kang, Jie Cen, Guangqiang Yin, Kunliang Qiu, Ce Zheng, Zhun Fan

发表机构 * School of Information and Software Engineering, University of Electronic Science and Technology of China（电子科技大学信息与软件工程学院）； Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China（电子科技大学深圳高等研究院）； Joint Shantou International Eye Center of Shantou University and The Chinese University of Hong Kong（汕头大学·香港中文大学联合汕头国际眼科中心）； School of Artificial Intelligence, Guangzhou City Polytechnic（广州城市职业学院人工智能学院）； Medical College, Shantou University（汕头大学医学院）； College of Engineering, Shantou University（汕头大学工学院）； Department of Ophthalmology, Xinhua Hospital Affiliated to Shanghai Jiaotong University School of Medicine（上海交通大学医学院附属新华医院眼科）； Shenzhen Loop Area Institute（深圳环路区域研究所）

AI总结提出MAGIS框架，通过多智能体协作、双重证据约束上下文和基于证据的纠正验证机制，将斜视诊断从黑箱生成转变为结构化推理，在细粒度斜视基准上将加权F1分数从72.0%提升至91.3%，并显著提高诊断报告的临床可靠性。

详情

AI中文摘要

斜视是一种常见的眼部疾病，需要细粒度亚型诊断以制定个性化治疗方案。然而，现有的深度学习方法主要提供诊断预测，缺乏透明推理；而近期的大视觉语言模型（LVLMs）虽然在联合图像理解和报告生成方面有前景，但在这种对证据敏感且规则驱动的医学任务中极易产生幻觉。为解决这些问题，我们提出了MAGIS，一个基于证据的多智能体可解释斜视诊断推理框架。MAGIS将黑箱端到端生成转变为结构化的诊断过程，包括候选假设生成、双重证据约束上下文、基于证据的纠正验证和报告生成。具体而言，我们引入了双重证据约束上下文（DECC）机制，将来自九个注视方位照片的视觉证据和基于证据的临床诊断规则联合组织成约束上下文，以实现可靠的诊断推理。我们进一步开发了基于证据的纠正验证（EBCV）机制，验证当前诊断假设是否得到视觉证据、基于热图的视觉线索和基于证据的临床诊断规则的支持。当检测到不一致时，触发假设修正。在细粒度斜视基准上的实验表明，MAGIS不仅显著优于其他最先进的诊断系统，将加权F1分数从72.0%提高到91.3%，而且大幅提升了生成诊断报告的临床可靠性（一致性、对齐性和完整性）。这些结果表明，MAGIS为构建准确、基于证据且临床可解释的斜视诊断系统提供了有效解决方案。

英文摘要

Strabismus is a common ocular disorder that requires fine-grained subtype diagnosis for individualized treatment planning. However, existing deep learning methods mainly provide diagnostic predictions without transparent reasoning, while recent large vision-language models (LVLMs), although promising for joint image understanding and report generation, remain highly prone to hallucination in this evidence-sensitive and rule-driven medical task. To address these challenges, we propose MAGIS, an evidence-based Multi-AGent reasoning for Interpretable Strabismus diagnosis framework. MAGIS transforms black-box end-to-end generation into a structured diagnostic process consisting of candidate hypothesis generation, dual-evidence constrained context, evidence-based corrective verification, and report generation. Specifically, we introduce a Dual-Evidence Constrained Context (DECC) mechanism that jointly organizes visual evidence from the photograph of the nine cardinal positions of gaze and evidence-based clinical diagnostic rules into a constrained context for reliable diagnostic reasoning. We further develop an Evidence-Based Corrective Verification (EBCV) mechanism that verifies whether the current diagnostic hypothesis is supported by visual evidence, heatmap-based visual cues, and evidence-based clinical diagnostic rules. Hypothesis refinement is triggered when inconsistency is detected. Experiments on a fine-grained strabismus benchmark demonstrate that MAGIS not only significantly outperforms other state-of-the-art diagnostic systems, improving the weighted F1 score from 72.0% to 91.3%, but also substantially improves the clinical reliability (consistency, alignment, and completeness) of generated diagnostic reports. These results demonstrate that MAGIS provides an effective solution for building accurate, evidence-based, and clinically interpretable strabismus diagnosis systems.

URL PDF HTML ☆

赞 0 踩 0

2606.09248 2026-06-09 cs.CV 新提交

Temporal-Aware Reasoning Optimization for Video Temporal Grounding

时间感知推理优化用于视频时间定位

Minghang Zheng, Zihao Yin, Yi Yang, Yuxin Peng, Yang Liu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出TaRO框架，通过构造性推理探索和时间敏感性奖励，增强多模态大模型在视频时间定位中的时间感知推理能力，实现最先进性能。

Comments Accepted by ICML 2026

详情

AI中文摘要

多模态大语言模型（MLLMs）在结合强化学习生成推理路径的视频时间定位任务中取得了显著进展。然而，现有模型常产生浅层推理，对精确时间定位的指导有限。这一限制源于（1）低效的随机探索和（2）仅关注答案正确性而忽略推理质量的奖励函数。为解决这些问题，我们提出TaRO（时间感知推理优化），一个明确增强模型时间思考能力的框架。首先，我们引入构造性推理探索，利用预生成的密集描述构建基于显式视觉线索和时间戳的推理路径，实现高质量时间感知推理的高效探索。其次，为评估推理质量，我们设计了时间敏感性奖励。高质量推理应锚定于特定事件和时间戳。如果思考中的事件边界被破坏，该推理应失效，导致推理路径的logit下降。我们利用这一下降作为推理质量的评判。最后，TaRO遵循渐进式课程，从利用该奖励选择更好的构造推理路径开始，演变为自由探索阶段，模型自主生成有效推理。实验表明，TaRO在VTG基准上达到最先进性能。代码见https://github.com/oceanflowlab/TaRO。

英文摘要

Multi-modal Large Language Models (MLLMs) have achieved remarkable progress in video temporal grounding with reinforcement learning for generating reasoning paths. However, existing models often produce superficial reasoning, which offers limited guidance for precise temporal localization. This limitation stems from (1) inefficient random exploration and (2) reward functions that focus solely on the answer correctness while ignoring reasoning quality. To address these issues, we propose TaRO (Temporal-Aware Reasoning Optimization), a framework that explicitly enhances the model's ability of thinking with time. First, we introduce a Constructive Reasoning Exploration that leverages pre-generated dense captions to construct reasoning paths grounded in explicit visual cues and timestamps, enabling efficient exploration of high-quality time-aware reasoning. Second, to evaluate reasoning quality, we design a Temporal-Sensitivity Reward. High-quality reasoning should be anchored to specific events and timestamps. If the event boundary under thinking is disrupted, such reasoning should become invalid, leading to a drop in the logit of the reasoning path. We utilize this drop as a critique of reasoning quality. Finally, TaRO follows a progressive curriculum, which starts by utilizing this reward to select better constructed reasoning paths, and evolves to a free exploration phase where the model autonomously generates effective reasoning. Experiments demonstrate that TaRO achieves state-of-the-art performance on VTG benchmarks. Code is available at https://github.com/oceanflowlab/TaRO.

URL PDF HTML ☆

赞 0 踩 0

2606.09246 2026-06-09 cs.CV 新提交

SOMA: From Surface Observations to Muscle Anatomy

SOMA：从表面观察到肌肉解剖

Eduardo Alvarado, Emily Kim, Gerrit Nolte, Friedemann Runte, Mario Botsch, Marc Habermann, Christian Theobalt

发表机构 * Max Planck Institute for Informatics, Saarland Informatics Campus（马克斯·普朗克信息学研究所，萨尔兰信息学园区）； TU Dortmund University（多特蒙德工业大学）

AI总结提出SOMA模型，从多视角RGB相机获取的表面信号推断时空肌肉行为，并构建SKIM数据集，首次实现从多视角RGB数据恢复肌肉变形，提供可扩展的低成本解剖动画方案。

详情

AI中文摘要

随着对逼真虚拟人类的需求日益增长，参数化人体模型已成为现代医学、体育和娱乐应用的基石。然而，大多数这些模型固有地存在局限性：它们仅捕捉皮肤的3D表面，无法洞察产生运动的复杂生物力学结构。随着更多应用向生物力学扩展，对超越皮肤的虚拟人类模型的需求日益明显。传统的软组织模拟（如FEM）准确但不可扩展，且对于大多数常见应用而言计算成本过高。或者，现有的生物力学工具可以模拟肌肉力和激活，但不模拟外部形状的变化，限制了激活与实际可观察解剖结构之间的相关性。这激发了一个新的逆向研究问题：直接从可见的表面观测（即从皮肤，从而从姿态）恢复肌肉变形。在这项工作中，我们提出了SOMA（从表面观察到肌肉解剖），一个从使用RGB相机获得的表面信号推断时空肌肉行为的个体特定模型，以及SKIM，一个个体特定的软组织变形数据集。据我们所知，这是首次尝试从多视角RGB数据恢复肌肉变形的方法。我们展示了我们的方法如何提供解剖学基础的动画，而无需传统模拟的复杂性，从而提供可扩展且成本效益高的解决方案。数据和代码已公开。

英文摘要

With the growing demand for realistic virtual humans, parametric body models have become a cornerstone of modern medicine, sports, and entertainment applications. However, most of these models are inherently limited: they only capture the 3D surface of the skin, offering no insight into the complex bio-mechanical structures that generate motion. As more applications expand towards biomechanics, the need for virtual human models that go beyond the skin has become increasingly evident. Traditional soft-tissue simulations, such as FEM, are accurate but non-scalable and too computationally expensive for most common applications. Alternatively, existing biomechanical tools can simulate muscular forces and activations, but do not model changes in external shape, restricting how activations correlate with actual observable anatomy. This motivates a novel inverse research problem: recovering muscle deformations directly from visible surface observations - i.e., from the skin, and thus the pose. In this work, we present SOMA (from Surface Observations to Muscle Anatomy), a person-specific model that infers spatio-temporal muscle behavior from surface signals obtained using RGB cameras, and SKIM, a subject-specific soft-tissue deformation dataset. To the best of our knowledge, this is the first method that attempts to recover muscle deformations from multi-view RGB data. We show how our method provides anatomically grounded animations without the complexity of traditional simulations, leading to a scalable and cost-effective solution. Data and code are available.

URL PDF HTML ☆

赞 0 踩 0

2606.09245 2026-06-09 cs.CV cs.AI 新提交

Proposal Refinement for Few-Shot Object Detection

用于少样本目标检测的提议细化

Yuan Zeng, Bin Song, Jie Guo, Yuwen Chen

发表机构 * State Key Laboratory of Integrated Services Networks, Xidian University（西安电子科技大学综合业务网理论及关键技术国家重点实验室）

AI总结针对少样本检测中区域提议在基类和新类间分布不均的问题，提出分阶段提议细化方法，通过基类训练阶段的细化损失和微调阶段的细化分支重新平衡提议分布，在基准上提升1%~6%且不增加推理时间。

详情

AI中文摘要

近年来，少样本目标检测引起了广泛关注。一些优秀的算法已被提出以处理这一任务。然而，这些算法大多依赖于少样本分类的性能。与以往尝试不同，我们的工作聚焦于新类和基类之间区域提议分布不均的问题。为了缓解这种不平衡分布，我们针对不同训练阶段提出了提议细化方法。具体而言，在基类训练阶段设计了细化损失以增强模型对新类的敏感性，在微调阶段引入了细化分支作为RPN（区域提议网络）的辅助分支以生成更多新类提议。通过重新平衡提议分布，所提方法在现有基准上比基线方法提高了约1%~6%，且不增加任何推理时间。通过大量实验，我们证明了为少样本目标检测任务建立了一种新的最先进方法。

英文摘要

Few-shot object detection has gained widely attention in recent years. Some excellent algorithms have been proposed to handle this task. However, most of these algorithms rely on the performance of few-shot classification. Unlike previous attempts, our work focuses on the problem of unbalanced distribution of region proposals between the novel classes and the base classes. In order to alleviate this unbalanced distribution, we propose the proposal refinement approach for different training phases. Specifically, refinement loss is designed for the base training phase to enhance sensitivity of the model to novel classes, and refinement branch is introduced as an auxiliary branch for RPN (Region Proposal Networks) to generate more novel proposals in the fine-tuning phase. By rebalancing the proposal distribution, the proposed approach outperforms the baselines methods by roughly 1\%$\sim$6\% on current benchmarks without increasing any inference time. Through extensive experiments, we prove that we establish a new state-of-the-art method for the few-shot object detection task.

URL PDF HTML ☆

赞 0 踩 0

2606.09243 2026-06-09 cs.CV cs.AI 新提交

EgoTactile: Learning Grasp Pressure for Everyday Objects from Egocentric Video

EgoTactile: 从自我中心视频学习日常物体的抓取压力

Yuan Zeng, Yujia Shi, Tiao Tan, Xingting Li, Yaqi Qin, Zongqing Lu, Wenming Yang, Jing-Hao Xue, Qingmin Liao

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出EgoTactile基准和条件扩散框架EgoPressureDiff，从自我中心视频估计全手抓取压力，解决视觉-物理歧义，实现鲁棒迁移。

Comments Accepted to ICML2026 spotlight

详情

AI中文摘要

从自我中心视频估计全手抓取压力对于沉浸式VR和机器人操作至关重要，然而密集触觉传感通常依赖侵入式硬件。现有的基于视觉的方法主要依赖平面或指尖接触，无法泛化到复杂的3D物体交互。因此，我们引入EgoTactile，一个将自我中心视频与全手压力监督配对用于多样日常物体的基准，并包含裸手迁移子集以实现对自然场景的泛化。利用该基准，我们首先建立EgoPressureFormer作为判别基线。此外，为显式处理部分观测中的不确定性，我们提出EgoPressureDiff，一个条件扩散框架，适配大规模预训练视频扩散骨干。通过将丰富的世界知识先验与物理信息特征修正层结合以注入语义约束，我们的方法有效推断合理的接触模式并解决视觉-物理歧义。大量实验表明，我们的方法在基准上取得优越性能，并具有对野外场景的鲁棒迁移能力。项目页面见https://egotactile.github.io/。

英文摘要

Estimating full-hand grasp pressure from egocentric video is critical for immersive VR and robotic manipulation, yet dense tactile sensing often relies on intrusive hardware. Existing vision-based methods predominantly rely on planar surfaces or fingertip contacts, failing to generalize to complex 3D object interactions. Therefore, we introduce EgoTactile, a benchmark pairing egocentric video with full-hand pressure supervision for diverse everyday objects, incorporating a bare-hand transfer subset to enable generalization to natural scenarios. Leveraging this benchmark, we first establish EgoPressureFormer as a discriminative baseline. Beyond this, to explicitly address the uncertainty in partial observations, we propose EgoPressureDiff, a conditional diffusion framework that adapts a large-scale pre-trained video diffusion backbone. By combining rich world knowledge priors with a Physically-Informed Feature Rectification layer to inject semantic constraints, our approach effectively infers plausible contact patterns and resolves visual-physical ambiguities. Extensive experiments demonstrate that our method achieves superior performance on the benchmark and robust transferability to in-the-wild scenarios. Our project page is available at https://egotactile.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.09239 2026-06-09 cs.LG cs.HC 新提交

Orange Lab: Lowering Barriers to Data Mining through Embedded Interactive Workflows

Orange Lab：通过嵌入式交互工作流降低数据挖掘门槛

Matej Bevec, Aleš Erjavec, Vesna Tanko, Lena Trnovec, Lan Žagar, Ana Farič, Janez Demšar, Blaž Zupan

发表机构 * University of Ljubljana（卢布尔雅那大学）； Revelo d.o.o.（Revelo公司）

AI总结提出Orange Lab，一种基于Web的可视化数据分析环境，通过组件展示范式将机器学习工作流嵌入任意网页，实现动态交互与数据驱动叙事，降低数据科学使用门槛。

详情

AI中文摘要

虽然数据分析工作流的可视化编程已成为数据科学民主化的重要工具，但此类系统仍主要局限于独立应用程序，并且对将其可视化分析解决方案过渡到交互式网络环境的支持有限。因此，数据分析管道难以共享、嵌入和适应用户面向的分析工具。我们提出了Orange Lab，一个基于Web的协作式可视化数据分析环境。其核心是，Orange Lab使用户能够从模块化组件中可视化地构建机器学习工作流，其中任何组件中的交互都会无缝地传播到整个工作流，将静态管道转变为支持探索和数据驱动叙事的动态响应系统。我们的关键贡献是组件展示，这是一种范式，允许作者将选定的工作流组件或其界面部分嵌入到任意网络上下文中，创建同步的交互式界面，同时隐藏底层工作流的复杂性。这支持开发定制化的分析视图和叙事驱动的体验，将数据分析直接集成到在线材料中。我们通过在数据素养教育中的部署来展示该方法，其中嵌入式组件引导学生动手探索机器学习概念，而无需了解底层系统，表明Orange Lab有效降低了入门门槛并支持数据科学的民主化。

英文摘要

While visual programming of data analysis workflows has become an important vehicle for the democratization of data science, such systems remain largely confined to standalone applications and offer limited support for transitioning their visual analytics solutions into interactive web environments. As a result, data analysis pipelines are difficult to share, embed, and adapt into user-facing analytical tools. We present Orange Lab, a web-based collaborative environment for visual data analytics. At its core, Orange Lab enables users to visually construct machine learning workflows from modular components, where interactions in any component propagate seamlessly through the workflow, turning static pipelines into dynamic, reactive systems that support exploration and data-driven storytelling. Our key contribution is component exposition, a paradigm that allows authors to embed selected workflow components, or parts of their interfaces, into arbitrary web contexts, creating synchronized, interactive interfaces while hiding underlying workflow complexity. This enables the development of tailored analytical views and narrative-driven experiences that integrate data analysis directly into online materials. We demonstrate the approach through deployments in data literacy education, where embedded components guide students in hands-on exploration of machine learning concepts without requiring knowledge of the underlying system, showing that Orange Lab effectively lowers barriers to entry and supports the democratization of data science.

URL PDF HTML ☆

赞 0 踩 0

2606.09237 2026-06-09 cs.RO cs.SY eess.SY 新提交

Can we stabilize an inverted pendulum with feedback from a time-of-flight camera?

我们能否利用飞行时间相机的反馈来稳定倒立摆？

Anthony Czubarow, Antonio Terpin, Raffaello D'Andrea

发表机构 * Institute for Dynamic Systems and Control, ETH Zürich（苏黎世联邦理工学院动态系统与控制研究所）

AI总结本文证明低成本、低分辨率的飞行时间相机能够提供足够反馈，可靠且精确地平衡推车上的倒立摆，挑战了其无法用于精确反馈控制的普遍观点。

2606.09236 2026-06-09 cs.RO cs.AI 新提交

Self-Paced Curriculum Reinforcement Learning for Autonomous Superbike Racing in Simulation

用于模拟自主超级摩托车赛车的自定进度课程强化学习

Luca Ghisi, Jacopo Essenziale, Carlo D'Eramo, Matteo Luperto

发表机构 * University of Bologna（博洛尼亚大学）

AI总结提出自定进度课程深度强化学习框架，结合软演员-评论家算法，动态生成渐进任务，在物理精确模拟器中训练自主摩托车赛车，优于标准SAC。

Comments Presented at the "1st Workshop on Generalization in Autonomous Driving: Paradigms, Practice, and Public Road Demonstrations" at ICRA 2026, Vienna. Oral+poster presentation

详情

AI中文摘要

自主赛车通过深度强化学习取得了显著进展，主要针对四轮车辆。然而，摩托车由于需要管理平衡和倾斜角度，以及更灵敏的转向和油门控制，且重量更小，带来了更大的复杂性。在这项工作中，我们提出了一个框架，用于在VRider SBK（一个基于Unity的物理精确摩托车模拟器）中训练自主智能体进行超级摩托车赛车。我们的方法将软演员-评论家（SAC）与自定进度课程深度强化学习（SPDL）相结合，后者根据智能体的性能动态生成逐渐更具挑战性的任务，无需手动课程设计。智能体的状态空间包括扩展了倾斜角度历史的本体感受特征，以及通过赛道点的全局赛道特征。奖励信号被设计为鼓励沿赛道前进，同时惩罚针对两轮动力学的不稳定诱导行为。初步实验结果表明，SPDL在多个赛道和摩托车模型上的训练效率、圈速和驾驶稳定性方面优于单独的SAC，为基于强化学习的自主摩托车赛车建立了第一个基线。

英文摘要

Autonomous Racing has seen remarkable progress through deep Reinforcement Learning (RL), primarily for four-wheeled vehicles. However, motorbikes introduce substantially greater complexity due to the need to manage balance and lean angle, in addition to more reactive steering and throttle control, and a smaller weight. In this work, we present a framework for training an autonomous agent to race a superbike in VRider SBK, a physics-accurate Unity-based motorbike simulator. Our approach integrates Soft Actor-Critic (SAC) with Self-Paced curriculum Deep reinforcement Learning (SPDL), which dynamically generates progressively more challenging tasks based on the agent's performance, without requiring manual curriculum design. The agent's state space comprises proprioceptive features extended with lean-angle history, along with global track features via course points. The reward signal is shaped to encourage progress along the track while penalizing instability-inducing behaviors specific to two-wheeled dynamics. Preliminary experimental results demonstrate that SPDL outperforms SAC alone in training efficiency, lap time, and driving stability across multiple tracks and motorbike models, establishing a first baseline for RL-based autonomous motorbike racing.

URL PDF HTML ☆

赞 0 踩 0

2606.09234 2026-06-09 cs.SD cs.AI 新提交

End-to-End Training for Discrete Token LLM based TTS System

基于离散令牌LLM的文本转语音系统的端到端训练

Changfeng Gao, Yong Ren, Jun Yuan, Ye Bai, Zhao You, ShiDong Shang

发表机构 * National University of Singapore（新加坡国立大学）

AI总结提出统一训练语音分词器、LLM、流匹配模型和奖励模型的端到端框架，通过多任务联合优化提升离散令牌TTS性能，在Seed-TTS-Eval上达到新SOTA。

详情

AI中文摘要

最近的先进文本转语音系统通常采用级联流水线，包括语音分词器、自回归大语言模型和基于扩散的流匹配模型，这些组件独立训练。本文提出一个完全端到端的优化框架，统一了语音分词器、LLM、FM模型和额外奖励模型的训练。具体来说，我们首先通过来自FM重建、LLM下一令牌预测和RM多识别任务的多任务目标联合优化分词器。这种联合训练鼓励离散语音令牌空间捕获更适合TTS的声学和语义显著信息。然后，我们通过FM和RM的下游重建和识别进一步优化LLM，这减少了推理时的不匹配，并引导LLM生成更优的结果。实验结果表明，我们的端到端框架始终优于级联基线。在Seed-TTS-Eval基准上，我们的系统实现了0.78%和1.56%的词错误率，使用0.6B参数的LLM和0.5B参数的FM模型取得了新的SOTA结果。这些结果验证了整体端到端优化对于改进基于离散令牌的TTS系统至关重要，且训练流水线更简单。

英文摘要

Recent state-of-the-art (SOTA) text-to-speech (TTS) systems typically adopt a cascaded pipeline consisting of a speech tokenizer, an autoregressive large language model (LLM), and a diffusion based flow-matching (FM) model, with these components trained independently. In this paper, we propose a fully end-to-end (E2E) optimization framework that unifies the training of the speech tokenizer, LLM, FM model, and an additional reward model (RM). Specifically, we first jointly optimize the tokenizer using multi-task objectives derived from reconstruction for FM, next-token prediction for LLM, and multi recognition task for RM. This joint training encourages the discrete speech token space to capture acoustically and semantically salient information that is better tailored to TTS. We then further optimize the LLM using downstream reconstruction and recognition by FM and RM, which reduces inference-time mismatch and steers the LLM toward more preferred generations. Experimental results show that our E2E framework consistently outperforms cascaded baselines. On the Seed-TTS-Eval benchmark, our system achieves a word error rate (WER) of 0.78% and 1.56%, a new SOTA result with a 0.6B-parameter LLM and 0.5B-parameter FM model. These results validate that holistic E2E optimization is critical for improving discrete-token-based TTS systems with a much simpler training pipeline.

URL PDF HTML ☆

赞 0 踩 0

2606.09219 2026-06-09 cs.CV astro-ph.IM 新提交

Semi-supervised Source Detection in Astronomical Images: New Benchmark and Strong Baseline

天文图像中的半监督源检测：新基准与强基线

Longhan Feng, Zihuang Cao, Ali Luo, Yuanhao Guo, Shuilian Yao, Yixin Guo, Qi Jia, Yu Liu

发表机构 * School of Software Dalian University of Technology（大连理工大学软件学院）； National Astronomical Observatories Chinese Academy of Sciences（中国科学院国家天文台）； Research Institute of Highway Ministry of Transport（交通运输部公路科学研究院）

AI总结针对天文图像中源检测的挑战，提出LAMOST-DET基准数据集和半监督学习框架Nova Teacher，通过光源增强、置信度引导伪监督和跨视图互补挖掘，在稀疏标注下有效检测密集源，mAP提升4.04%和5.22%。

详情

AI中文摘要

在现代观测天文学中，源检测是准确定位和识别恒星源的基石，对于恒星种群合成和宇宙学参数估计等研究至关重要。然而，天文图像的特征，包括高密度、点扩散函数效应和低信噪比，对最新的先进目标检测器提出了重大挑战。此外，由于在天文图像中标注密集、微小和暗弱的源存在显著困难，全监督检测方法几乎不实用。为了解决天文数据集的稀缺性，我们引入了一个新的综合基准（LAMOST-DET），包含18,400张天文图像和728,898个源实例。在该数据集上，我们进一步设计了一个新颖的半监督学习框架，称为Nova Teacher，能够在稀疏标注下有效检测密集源。它集成了光源增强模块、置信度引导的伪监督和跨视图互补挖掘，采用双教师范式。在LAMOST-DET上的大量实验表明，Nova Teacher在两种半监督设置下分别比之前的竞争者持续提高4.04%和5.22%的mAP。此外，我们的方法在自然图像数据集上与其他检测器竞争，验证了其在不同场景下的泛化能力。源代码可在https://github.com/AcWiz/NovaTeacher获取。

英文摘要

Source detection in modern observational astronomy is a cornerstone for localizing and identifying stellar sources accurately. It is crucial for studies such as stellar population synthesis and cosmological parameter estimation. However, the characteristics of astronomical images, including high density, the effect of point spread functions and low signal-to-noise ratios, significantly challenge the latest advanced object detectors. Besides, fully-supervised detection methods are hardly practical, due to the significant difficulty in annotating dense, small, and faint sources in astronomical images. To tackle the scarcity of astronomical datasets, we introduce a new comprehensive benchmark (LAMOST-DET), comprising 18,400 astronomical images and 728,898 source instances. Upon the dataset, we further devise a novel semi-supervised learning framework coined Nova Teacher, capable of detecting dense sources effectively given sparse annotations. It integrates source light enhancement module, confidence-guided pseudo-supervision, and cross-view complementary mining in a dual-teacher paradigm. Extensive experiments on LAMOST-DET show that, Nova Teacher consistently improves previous competitors by 4.04% and 5.22% mAP under two semi-supervised settings. Additionally, our method competes against other detectors on a natural image dataset, validating its generalization ability to various scenarios. The source code is available at https://github.com/AcWiz/NovaTeacher.

URL PDF HTML ☆

赞 0 踩 0

2606.09218 2026-06-09 cs.CV 新提交

Minimal Solvers for Full-DoF Motion Estimation from Asynchronous Differential SfM

全自由度运动估计的最小求解器：基于异步差分SfM

Shuo Pan, Banglei Guan, Bin Li, Zhenbao Yu, Zibin Liu, Zi Wang, Yang Shang, Qifeng Yu

发表机构 * College of Aerospace Science and Engineering, National University of Defense Technology（国防科技大学空天科学学院）； The Hunan Provincial Key Laboratory of Image Measurement and Vision Navigation（湖南省图像测量与视觉导航重点实验室）

AI总结提出从异步光流直接估计全自由度自运动的方法，解耦差分对极约束，基于至少五个点实现角速度和线速度的联合恢复，并设计了首个代数最小5点求解器及加速版本。

详情

AI中文摘要

作为一种仿生智能传感器，事件相机以其高时间分辨率、低延迟和低功耗为特点，为时空信息的智能感知和视觉运动估计引入了新范式。然而，其异步数据流对传统的同步帧算法提出了重大挑战。为了解决这些挑战，本文提出了一种新颖的框架，直接从异步光流进行全自由度（DoF）自运动估计，特别针对角速度和线速度的联合恢复。我们将差分对极约束解耦为不同的角速度和线速度分量，并推导出其异步数据的公式。基于该公式，开发了一种优化算法，利用至少五个点实现全自由度自运动估计。此外，通过对旋转动力学应用一阶近似，我们将约束方程转化为多项式形式，从而得到了该公式的第一个代数最小5点求解器。为了确保高速场景下的实时性能，我们还提出了一种通过截断高阶角速度项实现的加速求解器。在合成和真实数据集上的广泛评估表明，异步方法优于传统的同步方法，特别是在对时空噪声的准确性和鲁棒性方面。我们相信，这项工作为高速机器人应用中高效且准确的连续时间运动估计奠定了关键基础。

英文摘要

As a bio-inspired intelligent sensor, event cameras have introduced a new paradigm in the intelligent perception of spatiotemporal information and visual motion estimation, characterized by their high temporal resolution, low latency, and minimal power consumption. However, their asynchronous data streams present significant challenges to traditional synchronous, frame-based algorithms. To address these challenges, this paper presents a novel framework for full degree of freedom (DoF) egomotion estimation directly from asynchronous optical flow, specifically targeting the joint recovery of angular and linear velocities. We decouple the differential epipolar constraint into distinct angular and linear velocity components, and derive its formulation for asynchronous data. Based on this formulation, an optimization algorithm is developed that enables full-DoF egomotion estimation leveraging at least five points. Furthermore, by applying a first-order approximation to rotational dynamics, we transform the constraint equations into a polynomial form, resulting in the first algebraic minimal 5-point solver for this formulation. To ensure real-time performance in high-speed scenarios, we additionally propose an accelerated solver achieved by truncating high-order angular velocity terms. Extensive evaluations on both synthetic and real-world datasets demonstrate that the asynchronous approach outperforms traditional synchronous methods, particularly in its accuracy and robustness to spatiotemporal noise. We believe that this work establishes a critical foundation for efficient and accurate continuous-time motion estimation in high-speed robotics applications.

URL PDF HTML ☆

赞 0 踩 0

2606.09215 2026-06-09 cs.RO 新提交

MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation

MotionWAM：迈向实时人形机器人全身操作的基础世界动作模型

Jia Zheng, Teli Ma, Yudong Fan, Zifan Wang, Shuo Yang, Junwei Liang

发表机构 * Mondo Robotics ； HKUST (GZ)（香港科技大学（广州））； HKUST（香港科技大学）

AI总结提出MotionWAM，一种实时世界动作模型，通过统一运动潜变量和全身动作令牌，实现单目相机驱动的自主人形机器人全身操作，在真实任务上成功率比VLA基线高30%以上。

详情

AI中文摘要

世界动作模型（WAM）将视频动态先验与策略耦合，在桌面操作中表现出令人鼓舞的结果，但高维视频-动作潜变量的迭代去噪使其对于实时人形机器人全身操作来说过于缓慢。主导的分层范式加剧了这一问题，其中高层操作策略仅控制上半身，而低层控制器跟踪粗略的基础命令——将上半身和下半身置于不一致的动作空间中，并将腿部降级为保持平衡的 locomotion。我们提出MotionWAM，一种实时WAM，通过将策略条件设置为视频世界模型的中间去噪特征，从单个自我中心摄像头驱动自主人形机器人全身操作。MotionWAM用统一的运动潜变量取代了上下半身的分割，并预测全身动作令牌，在单个动作空间中联合覆盖 locomotion、躯干运动、高度调节、足部交互和手部操作。一个三阶段学习框架逐步将视频世界模型适应于自我中心视觉动态和目标人形机器人具身。在九个真实世界的Unitree G1任务上，MotionWAM实时运行，在总体成功率上比在同一演示上微调的视觉-语言-动作（VLA）基线高出30%以上，并执行解耦的上下半身策略无法达到的任务驱动足部交互。我们的结果表明，视频预训练的WAM可以从桌面操作提升到协调的、类人的人形机器人全身控制。

英文摘要

World Action Models (WAMs) couple a video dynamics prior to the policy and have shown encouraging results on tabletop manipulation, but iterative denoising over high-dimensional video-action latents leaves them too slow for real-time humanoid loco-manipulation. The problem is compounded by the dominant hierarchical paradigm, in which a high-level manipulation policy controls only the upper body while a low-level controller tracks coarse base commands -- placing upper and lower body in inconsistent action spaces and reducing the legs to balance-preserving locomotion. We present MotionWAM, a real-time WAM that drives autonomous humanoid loco-manipulation from a single egocentric camera by conditioning the policy on the intermediate denoising features of a video world model. MotionWAM replaces the upper-lower split with a unified motion latent and predicts whole-body motion tokens that jointly cover locomotion, torso motion, height regulation, foot interaction, and hand manipulation in a single action space. A three-stage learning framework progressively adapts the video world model to egocentric visual dynamics and to the target humanoid embodiment. On nine real-world Unitree G1 tasks, MotionWAM runs in real time, substantially outperforms Vision-Language-Action (VLA) baselines fine-tuned on the same demonstrations by over 30% in overall success rate, and executes task-driven foot interaction that decoupled upper-lower policies cannot reach. Our results suggest that video-pretrained WAMs can be lifted from tabletop manipulation to coordinated, human-like whole-body humanoid control.

URL PDF HTML ☆

赞 0 踩 0

2606.09208 2026-06-09 cs.CV 新提交

Event-driven dynamic trajectories reconstruction and measurement of mechanical parameters for fragments

碎片事件驱动的动态轨迹重建与机械参数测量

Haoyang Li, Banglei Guan, Muxi Zha, Yifei Bian, Minzu Liang, Yang Shang, Qifeng Yu

发表机构 * School of Aeronautics, Northwestern Polytechnical University（航空学院，西北工业大学）； College of Aerospace Science and Engineering, National University of Defense Technology（航空科学与工程学院，国防科技大学）； College of Science, National University of Defense Technology（科学学院，国防科技大学）

AI总结针对弹头爆炸中高速碎片因强闪光和烟雾难以准确测量机械参数的问题，提出事件驱动方法，利用事件相机的高时间分辨率和动态范围，结合多几何约束与概率模型重建3D轨迹并计算速度与动能。

Comments 33 pages,11 figures

详情

AI中文摘要

在弹头爆炸过程中，会产生高密度、高速且相互遮挡的碎片。它们的机械参数（位置、速度、动能）直接决定了弹头碎片场的杀伤力。然而，爆炸场景中的高强度闪光和烟雾严重阻碍了这些机械参数的准确获取。为应对这一挑战，本文融合实验力学方法，提出了一种事件驱动的碎片动态轨迹重建及其机械参数测量方法。作为一种新型类脑视觉传感器，事件相机提供微秒级时间分辨率和高动态范围的光照变化感知，克服了强闪光干扰下高速目标难以准确测量的难题。该方法构建了多事件相机视觉系统，采用三种几何约束：时间相关极线约束寻找潜在匹配事件点对，三焦张量线约束和局部单应约束消除误匹配。建立了综合概率模型，通过熵权法确定每个约束概率的权重，以定量过滤误匹配。通过空间线线相交和非线性优化实现3D轨迹重建。最后，基于重建轨迹计算碎片的速度和动能。该方法为弹头碎片场的机械损伤评估和战术防护设计提供了可靠的技术支持。

英文摘要

During warhead detonation, high-density, high-speed, and mutually occluded fragments are generated. Their mechanical parameters (position, velocity, kinetic energy) directly determine the lethality of the warhead fragment field. However, high-intensity flash and smoke in detonation scenarios severely hinder the accurate acquisition of these mechanical parameters. To address this challenge, this paper integrates experimental mechanics approaches and presents an event-driven method for reconstructing the dynamic trajectories of fragments and measuring their mechanical parameters. As a novel brain-inspired visual sensor, event cameras offer microsecond-level temporal resolution and high dynamic range lighting change perception, overcoming the difficulty of accurately measuring high-speed targets under strong flash interference. The method constructs a multi-event-camera vision system, adopting three geometric constraints: time-correlated epipolar constraint to find potential matching event point pairs, and trifocal tensor line constraint plus local homography constraint to eliminate mismatches. A comprehensive probability model is established, with entropy weight method determining the weight of each constraint's probability to quantitatively filter mismatches. 3D trajectory reconstruction is achieved via spatial line-line intersection and nonlinear optimization. Finally, the velocity and kinetic energy of the fragments are calculated based on the reconstructed trajectory. This method provides reliable technical support for the mechanical damage evaluation of warhead fragment fields and the tactical protection design.

URL PDF HTML ☆

赞 0 踩 0

2606.09204 2026-06-09 cs.LG cs.CL cs.CR 新提交

The Injection Paradox: Brand-Level Suppression in Safety-Trained LLM Recommendations via RAG Context Injection

注入悖论：通过RAG上下文注入在安全训练的LLM推荐中实现品牌级压制

Hyunseok Paeng

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结研究发现在基于RAG的LLM推荐中，安全训练会导致注入提示反而压制目标品牌推荐率，揭示了安全机制可能被逆向利用的风险。

Comments 16 pages, 1 figure, 15 tables. Accepted at the ICML 2026 Workshop on Failure Modes in Agentic AI (FAGEN), a non-archival venue

详情

AI中文摘要

我们提出了一种在基于RAG的LLM推荐中安全训练的可复现失败模式——注入悖论，其中嵌入在检索文档中的提示注入反而对攻击者不利，将目标品牌压制到低于无注入基线的水平。在安全训练的Claude模型中，包含提示注入的文档推荐率急剧下降，且这种压制会传播到同一品牌的其他未修改文档。在Claude Opus 4.6中，目标品牌从54%的基线降至所有50次试验中零次前二推荐，尽管语料库中4个品牌文档只有1个包含注入。该方向模式在反事实实验和三个品牌中均得到复现。在测试的GPT模型中观察到相反结果，相同的注入反而增加了推荐，表明注入类上下文影响推荐行为的模型族差异。这些发现提出了逆向攻击场景的技术可能性，即攻击者将注入嵌入竞争对手文档，通过安全敏感模型行为压制竞争对手品牌。

英文摘要

We present a reproducible failure mode of safety training in RAG-based LLM recommendation -- the Injection Paradox -- in which prompt injections embedded in retrieved documents backfire against the attacker, suppressing the target brand below the injection-free baseline. In safety-trained Claude models, documents containing prompt injections suffer a sharp drop in recommendation rate, and this suppression propagates beyond the injected document to unmodified documents of the same brand. In Claude Opus 4.6, the target brand drops from a 54% baseline to zero top-2 recommendations across all 50 trials, even though only 1 of 4 brand documents in the corpus contains an injection. The directional pattern is reproduced in counterfactual experiments and across three brands. A contrasting result across the GPT models tested, where the same injection instead increases recommendations, suggests model-family differences in how injection-like context affects recommendation behavior. These findings raise the technical possibility of a reverse-attack scenario in which an adversary embeds injections in a competitor's documents to suppress the competitor's brand via safety-sensitive model behavior.

URL PDF HTML ☆

赞 0 踩 0

2606.09198 2026-06-09 cs.AI 新提交

MASS: Deep Research for Social Sciences with Memory-Augmented Social Simulation

MASS：基于记忆增强社会模拟的深度社会科学研究

Yongrui Liu, Deyi Xiong

发表机构 * The International Joint Institute of Tianjin University, Fuzhou, Tianjin University, China（天津大学福州国际联合学院）； TJUNLP Lab, School of Computer Science and Technology, Tianjin University, China（天津大学计算机科学与技术学院TJUNLP实验室）

AI总结提出MASS范式，通过动态目标路径规划、多学科行为数据集和艾宾浩斯遗忘机制增强社会模拟真实性，提升LLM生成研究的洞察力与创新性，整体质量提升6.81%，洞察力提升17.19%。

详情

AI中文摘要

由大型语言模型（LLM）驱动的深度研究代理在自动论文写作任务中展现出非凡潜力。然而，现有系统严重依赖通过互联网和本地知识库进行文献检索与综合，导致社会科学研究缺乏洞察力和创造力。为解决这一问题，我们提出“记忆增强社会模拟（MASS）”，一种创新范式，利用高度逼真且面向研究的社会模拟来增强LLM生成研究的创造力和实证基础。具体而言，MASS集成了三个核心组件：具有多级社会规范约束的动态目标路径规划以引导模拟、用于代理记忆冷启动的多学科行为数据集，以及受艾宾浩斯曲线启发的结构化遗忘机制。这些共同确保了模拟的真实性，并为生成创新学术论文提供了坚实的实证基础。实验结果表明了我们方法的有效性，在生成整体质量上比基础LLM提高了6.81%，在洞察力上比强基线提高了17.19%。

英文摘要

Deep Research agents powered by Large Language Models (LLMs) have exhibited extraordinary potential in automated paper writing tasks. However, existing systems rely heavily on literature retrieval and synthesis through internet and local knowledge bases, often resulting research in lacking insight and creativity in social science. To address this issue, we propose "Memory-Augmented Social Simulation (MASS)", an innovative paradigm that leverages highly realistic and research-oriented social simulations to enhance the creativity and empirical founding of LLMs-generated research. Specifically, MASS integrates three core components: dynamic goal-path planning with multi-level social norm restraint to guide the simulation, a multi-disciplinary behavior dataset for agent memory cold-start, and a structured forgetting mechanism inspired by the Ebbinghaus curve. Together, these ensure simulation authenticity and provide a robust empirical foundation for generating innovative scholarly papers. Experimental results demonstrate the effectiveness of our method, showing a 6.81\% improvement in generation overall quality over foundation LLMs and 17.19\% gain in Insight over strong baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.09195 2026-06-09 cs.CL 新提交

Symbolic and Abstractive Reasoning with Complex Visual Queries

复杂视觉查询的符号与抽象推理

Yichi Zhang, Jingdian Lu, Zhuo Chen, Lingbing Guo, Jun Xu, Wen Zhang, Huajun Chen

发表机构 * Zhejiang University（浙江大学）； Nanjing University（南京大学）； Ant Group（蚂蚁集团）

AI总结提出复杂视觉查询（CVQ）概念，通过多模态知识图谱合成数据集，并设计两阶段训练框架，提升多模态大语言模型的符号与抽象推理能力。

Comments Work in progress

详情

AI中文摘要

理解和推理抽象视觉内容仍然是当前多模态大语言模型（MLLMs）面临的挑战。本文探索了一种新颖的抽象数据类型，称为复杂视觉查询（CVQ），旨在探测符号和抽象推理，这是MLLMs类人神经符号推理中关键但尚未充分探索的维度。我们从三个角度进行了全面研究：\textbf{数据 $\times$ 范式 $\times$ 探索}。具体而言，我们提出了一种可扩展的流水线，用于合成基于大规模多模态知识图谱的CVQ，通过一阶逻辑算子的系统组合生成了一个包含14种不同查询类型的多样化数据集。我们进一步引入了一个两阶段训练框架，逐步赋予MLLMs强大的视觉推理能力。我们进行了大量实验，从多个维度严格评估MLLMs，包括在CVQ上的推理性能，以及跨任务和跨场景的泛化能力。我们相信，我们的工作为推进MLLMs的推理前沿开辟了新的视角和途径。

英文摘要

Understanding and reasoning over abstract visual content remains a challenge for current multi-modal large language models (MLLMs). In this paper, we explore a novel abstract data type termed complex visual query (CVQ), designed to probe symbolic and abstractive reasoning, which is a critical yet underexplored dimension of human-like neuro-symbolic reasoning for MLLMs. We present a comprehensive investigation from three perspectives: \textbf{Data $\times$ Paradigm $\times$ Exploration}. Specifically, we propose a scalable pipeline for synthesizing CVQs grounded in large-scale multi-modal knowledge graphs, generating a diverse dataset encompassing 14 distinct query types via systematic combinations of first-order logic operators. We further introduce a two-stage training framework that progressively equips MLLMs with robust visual reasoning capabilities. We conduct extensive experiments to rigorously evaluate MLLMs across multiple dimensions, including reasoning performance on CVQs, as well as cross-task and cross-scenario generalization. We believe our work opens new perspectives and avenues for advancing the reasoning frontiers of MLLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.09191 2026-06-09 cs.LG stat.ML 新提交

Asymptotic Optimality of Thompson Sampling for Risk-Averse Bandits with Sub-Gaussian Rewards

风险厌恶型多臂赌博机中汤普森采样的渐近最优性（次高斯奖励）

Joel Q. L. Chang

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文证明了一种无锚非参数汤普森采样算法在风险厌恶型多臂赌博机中达到实例依赖的渐近最优后悔界，适用于任意连续风险泛函，且仅需连续性条件，优于先前参数方法。

Comments 10 pages, 4 figures

详情

AI中文摘要

我们证明 $\rho\text{-}\mathrm{NPTS}_{\mathrm{SG}}$，一种用于风险厌恶型多臂赌博机的无锚非参数汤普森采样算法，其遗憾值在 $\log n$ 的主阶上匹配实例依赖下界，从而确立了它在具有有界密度和次高斯尾部（包括高斯臂）的分布类上对任意连续风险泛函 $\rho$（CVaR、均值-方差、夏普比率、扭曲风险度量等）的渐近最优性。该结果及其有界支撑版本仅要求 $\rho$ 的连续性：严格弱于先前参数汤普森采样结果的支配条件，也严格弱于UCB类算法的Lipschitz条件，从而在无参数奖励假设下首次为夏普比率等非Lipschitz泛函提供了实例最优保证。有界支撑情形作为具有相同证明结构的垫脚石首先被发展。关键技术贡献是一个离散化引理（有界支撑）和一个截断离散化引理（次高斯尾部），每个引理通过Dirichlet聚合性质将增长字母表的Dirichlet后验投影到固定网格上，保持所有多项式前因子在固定次数且独立于样本量，打破了先前证明中阻碍的超指数障碍。

英文摘要

We prove that $ρ\text{-}\mathrm{NPTS}_{\mathrm{SG}}$, an anchor-free nonparametric Thompson Sampling algorithm for risk-averse bandits, achieves regret matching the instance-dependent lower bound to leading order in $\log n$, establishing it as asymptotically optimal for any continuous risk functional $ρ$ (CVaR, mean-variance, Sharpe ratio, distortion risk measures, and more) on the class of distributions with bounded density and sub-Gaussian tails, including Gaussian arms. Both this result and its bounded-support counterpart require only continuity of $ρ$: strictly weaker than the dominance condition of prior parametric Thompson Sampling results, and strictly weaker than the Lipschitz condition of UCB-type algorithms, yielding the first instance-optimal guarantees for non-Lipschitz functionals such as the Sharpe ratio without parametric reward assumptions. The bounded-support case is developed first as a stepping stone sharing the same proof structure. The key technical contributions are a discretisation lemma (bounded support) and a truncated discretisation lemma (sub-Gaussian tails), each projecting the growing-alphabet Dirichlet posterior onto a fixed grid via the Dirichlet aggregation property, holding all polynomial prefactors at fixed degree independent of sample size and breaking the super-exponential barrier that blocked prior proofs.

URL PDF HTML ☆

赞 0 踩 0

2606.09188 2026-06-09 cs.RO cs.CV 新提交

Trajectory Optimization in Single and Dual-UAV Bearing-Only Target Localization

单无人机和双无人机仅方位目标定位中的轨迹优化

Zhijian Xiao, Huayu Huang, Bin Li, Yang Shang, Banglei Guan

发表机构 * College of Aerospace Science and Engineering, National University of Defense Technology（国防科技大学航天科学与工程学院）； Hunan Key Laboratory of Image Measurement and Visual Navigation（湖南省图像测量与视觉导航重点实验室）

AI总结提出基于Fisher信息矩阵的轨迹优化方法，通过谱加权目标函数和交叉角正弦项改善观测几何，结合改进粒子群算法，显著降低定位误差。

Comments 16 pages, 13 figures and 6 tables. Submitted to Measurement

详情

AI中文摘要

仅方位目标定位是光学测量中的一个基本问题，在无人机技术中有着广泛的应用。有效的轨迹规划可以建立有利的观测几何，从而提高仅方位无人机系统的目标定位精度。本文提出了一种用于无人机在仅方位目标定位场景中的轨迹优化方法。通过利用Fisher信息矩阵，该方法将几何构型和飞行器机动性动态集成到优化框架中。具体而言，我们引入了一个谱加权FIM目标函数，该函数在退化构型附近提供更好的梯度动力学，使规划器能够快速逃离不良观测条件。对于双无人机场景，引入交叉角正弦项，通过改善视线交叉角来优化三角测量几何，从而防止轨迹聚集。此外，我们提出了一种改进的粒子群优化算法，该算法具有运动模型约束和粒子归一化，以确保轨迹的物理可行性并增强与目标函数的兼容性。仿真结果表明，与传统的基于FIM的方法相比，所提出的方法在单无人机场景中将中位定位误差降低了99.21%，在双无人机配置中实现了69.70%的提升，在远距离机动目标的长时间仅方位目标定位中表现出优越的性能。

英文摘要

Bearing-only target localization is a fundamental problem in optical measurement and finds extensive applications in unmanned aerial vehicle (UAV) technology. Effective trajectory planning establishes favorable observation geometries, thereby enhancing the target localization accuracy of bearing-only UAV systems. This paper proposes an trajectory optimization method for unmanned aerial vehicles (UAVs) in bearing-only target localization scenarios. By leveraging the Fisher Information Matrix (FIM), the proposed approach dynamically integrates the geometric configuration and vehicle maneuverability into the optimization framework. Specifically, we introduce a spectrally-weighted FIM objective function that provides better gradient dynamics near degenerate configurations, enabling the planner to rapidly escape from poor observation conditions. For dual-UAV scenarios, an intersection angle sine term is introduced to optimize triangulation geometry by improving the sight-line intersection angle, thereby preventing trajectory aggregation. Furthermore, we propose an improved Particle Swarm Optimization (PSO) algorithm with motion model constraints and particle normalization to ensure the physical feasibility of the trajectory and enhance the compatibility with the objective functions. Simulation results demonstrate that the proposed method reduces the median localization error by 99.21% compared to conventional FIM-based approaches in single-UAV scenarios, and achieves a 69.70% improvement for dual-UAV configurations, exhibits superior performance in long-duration bearing-only target localization of maneuverability targets at extended ranges.

URL PDF HTML ☆

赞 0 踩 0