arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1926
专题追踪
2408.13002 2026-05-22 cs.LG

Measuring Variable Importance in Heterogeneous Treatment Effects with Confidence

用置信度衡量异质处理效应中的变量重要性

Joseph Paillard, Angel Reyero Lobo, Vitaliy Kolodyazhniy, Bertrand Thirion, Denis A. Engemann

发表机构 * Roche Pharma Research \& Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd, Basel, Switzerland Université Paris-Saclay, Inria, CEA, Palaiseau, France

AI总结 本文提出PermuCATE算法,用于在估计条件平均处理效应时进行统计严谨的全局变量重要性评估,通过理论分析和实证研究证明其比LOCO方法具有更低的方差,从而提高统计功效,适用于生物医学应用中的有限数据环境。

Journal ref Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:47456-47477, 2025

详情
AI中文摘要

因果机器学习在从复杂数据中估计个体处理效应方面具有潜力。为了成功应用于现实世界,获得可靠见解以确定哪些变量驱动对治疗的异质反应至关重要。我们提出PermuCATE,一种基于条件排列重要性(CPI)方法的算法,用于统计严谨地评估条件平均处理效应(CATE)估计中的变量重要性。有限样本情况的理论分析和实证研究显示,PermuCATE比留一协变量法(LOCO)参考方法具有更低的方差,并提供可靠的变量重要性度量。这一特性提高了统计功效,这对于生物医学应用中常见的有限数据环境中的因果推断至关重要。我们通过模拟和真实世界健康数据集实证展示了PermuCATE的优势,包括具有多达数百个相关变量的设置。

英文摘要

Causal machine learning holds promise for estimating individual treatment effects from complex data. For successful real-world applications of machine learning methods, it is of paramount importance to obtain reliable insights into which variables drive heterogeneity in the response to treatment. We propose PermuCATE, an algorithm based on the Conditional Permutation Importance (CPI) method, for statistically rigorous global variable importance assessment in the estimation of the Conditional Average Treatment Effect (CATE). Theoretical analysis of the finite sample regime and empirical studies show that PermuCATE has lower variance than the Leave-One-Covariate-Out (LOCO) reference method and provides a reliable measure of variable importance. This property increases statistical power, which is crucial for causal inference in the limited-data regime common to biomedical applications. We empirically demonstrate the benefits of PermuCATE in simulated and real-world health datasets, including settings with up to hundreds of correlated variables.

2404.05307 2026-05-22 cs.CV cs.RO

4D Radar Semantic Segmentation of People in Field Conditions Using Temporal Multi-View Networks

利用时序多视角网络进行野外条件下4D雷达的人体语义分割

Mikael Skog, Oleksandr Kotlyar, Vladimír Kubelka, Martin Magnusson

发表机构 * Center for Advanced Autonomous Sensor Systems (AASS)(先进自主传感器系统中心)

AI总结 本文提出TMVA4D网络,利用4D雷达数据进行人体语义分割,通过多视角投影区分背景与人体,在低能见度条件下实现75.9%的Dice系数和61.2%的IoU指标。

详情
AI中文摘要

可靠的人员检测对于移动机器人和重型车辆在道路和工业环境(如采矿和建筑)中的安全自主至关重要。然而,常规传感器如摄像头或激光雷达在尘埃、雾或烟等恶劣条件下容易失效,限制了其在现实机器人系统中的应用。雷达在广泛的环境条件下提供稳健的测量。特别是现代高分辨率4D成像雷达提供跨距离、方位和仰角的4D点云,以及每个点的多普勒速度数据,非常适合机器人感知。我们提出TMVA4D,一种基于CNN和ConvLSTM编码器的神经网络架构家族,利用4D雷达模态进行语义分割。这些架构被训练以区分背景和人体类别,使用一系列2D投影的4D雷达数据,涵盖仰角、方位、距离和多普勒速度维度。在多个操作站点评估中,我们的模型在低能见度条件下实现了有希望的性能(Dice 75.9%,IoU 61.2% for class person)。数据和代码将在发表后公开发布。

英文摘要

Reliable people detection is crucial for the safe autonomy of mobile robots and heavy vehicles, both on roads and in industrial settings like mining and construction. However, common sensors like cameras or lidars are prone to failure in adverse conditions such as dust, fog, or smoke, which limits their use in real-world robotic systems. Radar, on the other hand, delivers robust measurements in a wide range of environmental conditions. In particular, modern high-resolution 4D imaging radars provide 4D point clouds across range, azimuth, and elevation, as well as per-point Doppler velocity data, well suited for robot perception. We propose TMVA4D, a family of artificial neural network architectures based on CNN and ConvLSTM encoders that leverage the 4D radar modality for semantic segmentation. The architectures are trained to distinguish between background and person classes using a series of 2D projections of the 4D radar data, encompassing elevation, azimuth, range, and Doppler velocity dimensions. Evaluated across several operational sites, our models achieve promising performance (Dice 75.9%, IoU 61.2% for class person) even in low-visibility conditions. The data and code will be made publicly available upon publication.

2308.04371 2026-05-22 cs.AI

Cumulative Reasoning with Large Language Models

基于大语言模型的累积推理

Yifan Zhang, Jingqin Yang, Yang Yuan, Andrew Chi-Chih Yao

发表机构 * IIIS, Tsinghua University(清华大学人工智能研究院) Shanghai Qi Zhi Institute(上海启智研究院)

AI总结 本文提出了一种名为累积推理(CR)的框架,通过模拟人类的迭代和累积思维过程,增强大语言模型(LLM)的问题解决能力。CR通过分解任务、生成并验证中间推理步骤,构建动态有向无环图(DAG)来组成解决方案,从而在逻辑推理、24点游戏和数学问题等任务中取得了显著的性能提升。

Comments Published in Transactions on Machine Learning Research (TMLR). Project Page: https://github.com/iiis-ai/cumulative-reasoning

详情
AI中文摘要

近年来,大语言模型(LLMs)在解决问题方面取得了显著进展,但其解决复杂问题的能力仍然有限。本文介绍了一种名为累积推理(CR)的结构化框架,通过模拟人类的迭代和累积思维过程,增强LLM的问题解决能力。CR通过三个不同的角色:提出者、验证者和报告者,系统地分解任务,生成并验证中间推理步骤,并通过构建动态有向无环图(DAG)来组成解决方案。这种方法显著增强了问题解决能力。我们通过几个复杂的推理任务展示了CR的优势:在逻辑推理任务中,CR在现有方法上提高了9.3%,在经过整理的FOLIO维基数据集上达到了98.04%的准确率。在24点游戏中,它达到了98%的准确率,比以前的方法提高了24%。在解决数学问题时,CR在之前的办法上提高了4.2%,在最困难的第五级问题中相对改进了43%。当结合代码环境使用CR时,我们进一步利用LLM的推理能力,并在程序思维(PoT)方法上提高了38.8%。

英文摘要

Recent advancements in large language models (LLMs) have shown remarkable progress, yet their ability to solve complex problems remains limited. In this work, we introduce Cumulative Reasoning (CR), a structured framework that enhances LLM problem-solving by emulating human-like iterative and cumulative thought processes. CR orchestrates LLMs in three distinct roles: Proposer, Verifier(s), and Reporter, to systematically decompose tasks, generate and validate intermediate reasoning steps, and compose them into a solution by building a dynamic Directed Acyclic Graph (DAG) of verified propositions. This approach substantially enhances problem-solving capabilities. We demonstrate CR's advantage through several complex reasoning tasks: it outperforms existing methods in logical inference tasks with up to a 9.3% improvement, achieving 98.04% accuracy on the curated FOLIO wiki dataset. In the Game of 24, it achieves 98% accuracy, marking a 24% improvement over previous methods. In solving MATH problems, CR achieves a 4.2% increase from previous methods and a 43% relative improvement in the most challenging level 5 problems. When incorporating a code environment with CR, we further harness LLMs' reasoning capabilities and outperform the Program of Thought (PoT) method by 38.8%.

1709.03806 2026-05-22 cs.CV

Do Vision Models Encode Object-Level Semantic Relatedness? A Cognitive Psychology-Inspired Benchmark

视觉模型是否编码物体层面的语义相关性?一种受认知心理学启发的基准

Hansang Lee, Haeil Lee, Junmo Kim

发表机构 * Department of Computer Science(计算机科学系) Seoul Women’s University(首尔女子大学) LG Energy Solution(LG能源解决方案) School of Electrical Engineering(电气工程学院) KAIST(韩国科学技术院)

AI总结 本文通过一种受认知心理学启发的基准,探讨了视觉模型是否能编码物体层面的语义相关性,研究了两种仅基于图像的测试集,并揭示了分类准确率之外的表征特性。

详情
AI中文摘要

现代视觉模型在物体识别任务上取得了显著的性能,但尚不清楚其表示是否编码物体层面的语义相关性,即支持人类视觉认知的对象概念之间的有意义联系。现有的基准主要针对类别预测或依赖图像-文本匹配,忽略了视觉表示本身的研究。受认知心理学启发,我们将语义相关性重新定义为三元组排序任务,并研究了两个仅基于图像的测试集:POPORO,一个已有的400个三元组心理刺激集,重新用于表示评估;以及PoporoIN,一个新构建并人工编写的1000个三元组ImageNet验证扩展集。每个三元组沿两个正交轴进行注释:一个相关目标轴区分类别相关性(CR,分类学)和上下文相关性(TR,主题性),一个干扰轴区分颜色匹配干扰项(CD)和形状匹配干扰项(SD)。二十种预训练模型,涵盖监督、自监督、视觉-语言和生成范式,在仅推理的协议下通过余弦相似度进行评估。基于变换器的表示在PoporoIN上比卷积表示高出高达18.30个百分点,且在可比的ImageNet准确率下,视觉-语言编码器在POPORO上比视觉-only编码器高出高达22.50个百分点。在所有范式中,模型在分类学目标上比主题性目标更可靠地识别,且更容易被形状匹配干扰项所误导,而不是颜色匹配干扰项。这些基准揭示了分类准确率之外的表征特性,连接了认知心理学和视觉表征评估。

英文摘要

Modern vision models have achieved strong object-recognition performance, yet it remains unclear whether their representations encode object-level semantic relatedness, the meaningful connection between object concepts that supports human visual cognition. Existing benchmarks predominantly target category prediction or rely on image--text matching, leaving the visual representation itself underexamined. Drawing on cognitive psychology, we recast semantic relatedness as a triplet-ranking task and study two image-only test beds: POPORO, an existing 400-triplet psychological stimulus set repurposed for representation evaluation, and PoporoIN, a newly constructed and manually curated 1,000-triplet ImageNet-validation extension. Each triplet is annotated along two orthogonal axes: a related-target axis distinguishing Categorical Relatedness (CR, taxonomic) from conTextual Relatedness (TR, thematic), and a distractor axis distinguishing Color-matched Distractors (CD) from Shape-matched Distractors (SD). Twenty pretrained models spanning supervised, self-supervised, vision--language, and generative paradigms were evaluated by cosine similarity in an inference-only protocol. Transformer-based representations exceeded convolutional counterparts by up to 18.30 percentage points on PoporoIN at comparable ImageNet accuracy, and vision--language encoders exceeded vision-only counterparts by up to 22.50 percentage points under matched ImageNet accuracy on POPORO. Across paradigms, models recognized taxonomic targets more reliably than thematic ones and were more easily misled by shape-matched than by color-matched distractors. The benchmarks expose representational properties that classification accuracy alone does not fully predict, bridging cognitive psychology and visual representation evaluation.

2605.22086 2026-05-22 cs.CV

GenHAR: Generalizing Cross-domain Human Activity Recognition for Last-mile Delivery

GenHAR:面向最后一公里配送的跨领域人类活动识别通用化

Zhiqing Hong, Zelong Li, Xiubin Fan, Guang Yang, Baoshen Guo, Haotian Wang, Tian He, Desheng Zhang

发表机构 * Beijing Normal–Hong Kong Baptist University(北京师范大学-香港 Baptist大学) Rutgers University(罗格斯大学) SMART Center, MIT(MIT SMART中心)

AI总结 本文提出GenHAR框架,通过学习领域不变的传感器表示来解决跨领域人类活动识别中的分布偏移问题,提升了目标领域的泛化能力,并在实际部署中实现了高效率和高精度的实时活动检测。

详情
AI中文摘要

人类活动识别(HAR)在各种应用中表现出显著的有效性,如智能医疗和智能制造。然而,HAR面临的主要挑战是不同传感器数据域之间的分布偏移,这通常会导致在现实应用中性能下降。为了解决这个问题,本文引入了GenHAR,一种新的框架,旨在通过学习领域不变的传感器表示来缩小领域差距。GenHAR的目标是通过仅使用源域的数据来增强HAR在目标域上的泛化能力。GenHAR的关键创新体现在两个方面:首先,GenHAR对传感器数据进行分词,并学习频率传感器通道维度之间的相关性,以提高HAR模型的鲁棒性;其次,GenHAR通过选择性掩码和高效的注意力机制来提高效率。我们通过在现实世界的人类活动数据集上与最先进的HAR方法进行比较,系统分析了GenHAR。结果表明,GenHAR在准确性上比最先进的方法高出9.97%,并减少了6.4倍的浮点运算。此外,我们还在四个城市的一家领先物流公司部署了GenHAR,并检测到21.5亿次实时活动。我们发布了代码:https://github.com/Sensor-FoundationModel/GenHAR。

英文摘要

Human Activity Recognition (HAR) has shown remarkable effectiveness in various applications, such as smart healthcare and intelligent manufacturing. However, a major challenge faced by HAR is the distribution shift across different sensor data domains, which often leads to decreased performance when deployed for real-world applications. To address this issue, this paper introduces GenHAR, a novel framework designed to mitigate the domain gap by learning domain-invariant sensor representations. GenHAR aims to enhance the generalization capabilities of HAR on target domains purely with data from the source domain. The key novelty of GenHAR lies in two aspects. Firstly, GenHAR tokenizes sensor data and learns correlations among frequency sensor channel dimensions to improve the robustness of HAR models. Secondly, GenHAR improves the efficiency via selective masking and an efficient attention mechanism. We conduct a systematic analysis of GenHAR by comparing it with state-of-the-art HAR methods on real-world human activity datasets. Results show that GenHAR outperforms state-of-the-art methods by 9.97% in accuracy, and reduces Floating Point Operations by 6.4 times. Moreover, we deploy GenHAR at a leading logistics company in 4 cities, and have detected 2.15 billion real-time activities. We release our code at: https://github.com/Sensor-FoundationModel/GenHAR.

2605.22083 2026-05-22 cs.SD cs.LG eess.AS

RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching

RobustSpeechFlow: 通过基于增强的对比流匹配学习鲁棒的文本到语音轨迹

Jinhyeok Yang, Hyeongju Kim, Yechan Yu, Joon Byun, Frederik Bous, Juheon Lee

发表机构 * Supertone Inc(Supertone公司) Independent Researcher(独立研究者)

AI总结 本文提出RobustSpeechFlow,一种通过引入长度保持重复和跳过潜在增强来改进对齐鲁棒性的训练策略,从而在无需外部对齐器或偏好数据的情况下,直接惩罚现实中的失败模式,并能无缝集成到现有流程中,实验表明其在文本到语音任务中显著提升了语音质量与鲁棒性。

Comments Submitted to INTERSPEECH 2026

详情
AI中文摘要

尽管流匹配文本到语音(TTS)在零样本说话人相似性和自然度方面表现强劲,但仍易受内容保真度问题影响,特别是由于不完美的对齐导致的跳过和重复错误。我们提出了RobustSpeechFlow,一种训练策略,通过扩展对比流匹配,引入长度保持重复和跳过潜在增强来提高对齐鲁棒性。该方法无需外部对齐器或偏好数据,直接惩罚现实中的失败模式,并能无缝集成到现有流程中。在Seed-TTS-eval上,仅使用0.06B参数,其将词错误率(WER)从1.44降至1.38。在我们的ZERO500基准测试中,它在多样化的说话人和语调条件下实现了稳定的可理解性提升;在NFE=24时,其将英文字符错误率(CER)从0.48%降至0.35%,将韩文CER从0.81%降至0.57%。音频样本:https://robustspeechflow.github.io/

英文摘要

While flow-matching text-to-speech (TTS) achieves strong zero-shot speaker similarity and naturalness, it remains susceptible to content fidelity issues, particularly skip and repeat errors from imperfect alignment. We propose RobustSpeechFlow, a training strategy that improves alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations. Requiring no external aligners or preference data, our method directly penalizes realistic failure modes and readily integrates into existing pipelines. On Seed-TTS-eval, it reduces the word error rate (WER) from 1.44 to 1.38 using only 0.06B parameters. On our ZERO500 benchmark, it delivers consistent intelligibility improvements across diverse speaker and prosody conditions; at NFE=24, it reduces English character error rate (CER) from 0.48\% to 0.35\% and Korean CER from 0.81\% to 0.57\%. Audio samples: https://robustspeechflow.github.io/

2605.22081 2026-05-22 cs.CL

ArabDiscrim: A Decade-Long Arabic Facebook Corpus on Racism and Discrimination

ArabDiscrim: 一个十年的阿拉伯语Facebook语料库,涉及种族主义和歧视

Wajdi Zaghouani, Shimaa Amer Ibrahim, Mabrouka Bessghaier, Houda Bouamor

发表机构 * Northwestern University in Qatar(卡塔尔西北大学) Carnegie Mellon University in Qatar(卡塔尔卡内基梅隆大学)

AI总结 本文提出了ArabDiscrim,一个包含293,000条阿拉伯语Facebook公开帖子的十年长的词料库(2014-2024年),用于研究种族主义和歧视。该语料库整合了平台原生的互动信号,如反应、分享、评论和页面元数据,支持语言和受众反应的联合分析。该资源包括200个精心挑选的术语(100个与种族主义相关,100个与歧视相关)以及20个歧视轴,捕捉基于身份的不平等对待。它还提供了显式的归属模式。ArabDiscrim在伦理合规的限制研究使用许可下发布,支持弱监督、轴感知采样和平台生态研究。通过连接词法深度和生态效度,它为公平导向、平台意识的阿拉伯语NLP建立了基础。

Comments Accepted at LREC 2026 Main Conference

详情
AI中文摘要

我们介绍了ArabDiscrim,一个十年长的词料库和包含293,000条公开阿拉伯语Facebook帖子(2014-2024)的词料库,讨论种族主义和歧视。不同于现有以推特为中心的数据集,ArabDiscrim整合了平台原生的互动信号,包括反应、分享、评论和页面元数据,使语言和受众反应的联合分析成为可能。该资源包括200个精心挑选的术语(100个与种族主义相关,100个与歧视相关)以及20个歧视轴,捕捉基于身份的不平等对待。它还提供了显式的归属模式。在遵守平台条款的限制研究使用许可下发布,ArabDiscrim支持弱监督、轴感知采样和平台生态研究。通过连接词法深度和生态效度,它为公平导向、平台意识的阿拉伯语NLP建立了基础。

英文摘要

We present ArabDiscrim, a decade-long lexical resource and corpus of 293K public Arabic Facebook posts (2014--2024) discussing racism and discrimination. Unlike existing Twitter-centric datasets, ArabDiscrim integrates platform-native engagement signals, including reactions, shares, comments, and page metadata, enabling joint analysis of language and audience response. The resource includes 200 curated terms (100 racism-related and 100 discrimination-related) with morphological regex families (13+ inflections per lemma), and 20 discrimination axes capturing identity-based grounds for unequal treatment. It also provides explicit attribution patterns. Released under a restricted research-use license for ethical compliance with platform terms, ArabDiscrim supports weak supervision, axis-aware sampling, and platform ecology research. By bridging lexical depth and ecological validity, it establishes a foundation for fairness-oriented, platform-aware Arabic NLP.

2605.22078 2026-05-22 cs.AI cs.CV

Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding

通过无训练空间-时间池化和栅格化增强视频大语言模型的视觉令牌表示

Bingjun Luo, Tony Wang, Hanqi Chen, Xinpeng Ding

发表机构 * Tsinghua University(清华大学) Shenzhen University(深圳大学) Xidian University(西安电子科技大学)

AI总结 本文提出了一种无需训练的空间-时间池化和栅格化方法ST-GridPool,用于提升视频大语言模型的视觉令牌表示,通过多级时空交互和基于规范的空间池化技术,在不需重新训练的情况下提高性能。

Comments Accepted by ICLR 2026

详情
AI中文摘要

近年来,多模态大语言模型(MLLMs)在视频理解任务中取得了显著进展,但如何高效压缩视觉令牌同时保持时空交互仍面临挑战。现有方法如LLaVA家族使用简单的池化或插值技术,忽视了视觉令牌的复杂动态。为弥合这一差距,我们提出了ST-GridPool,一种专为视频LLM设计的新型无训练视觉令牌增强方法。我们的方法整合了金字塔时间栅格(PTG),通过层次化时间栅格捕捉多粒度时空交互,以及基于规范的空间池化(NSP),通过利用令牌规范与语义丰富度之间的相关性来保留高信息视觉区域。在各种基准测试中,ST-GridPool在不需成本高昂重新训练的情况下,一致提升了视频LLM的性能。我们的方法提供了一种高效且即插即用的解决方案来改进视觉令牌表示。我们的代码可在https://github.com/bingjunluo/ST-GridPool上获得。

英文摘要

Recent advances in Multimodal Large Language Models (MLLMs) have significantly advanced video understanding tasks, yet challenges remain in efficiently compressing visual tokens while preserving spatiotemporal interactions. Existing methods, such as LLaVA family, utilize simplistic pooling or interpolation techniques that overlook the intricate dynamics of visual tokens. To bridge this gap, we propose ST-GridPool, a novel training-free visual token enhancement method designed specifically for Video LLMs. Our approach integrates Pyramid Temporal Gridding (PTG), which captures multi-grained spatiotemporal interactions through hierarchical temporal gridding, and Norm-based Spatial Pooling (NSP), which preserves high-information visual regions by leveraging the correlation between token norms and semantic richness. Extensive experiments on various benchmarks demonstrate that ST-GridPool consistently enhances performance of Video LLMs without requiring costly retraining. Our method offers an efficient and plug-and-play solution for improving visual token representations. Our code is available in https://github.com/bingjunluo/ST-GridPool.

2605.22075 2026-05-22 cs.LG q-bio.QM

Can Breath Biomarkers Causally Influence Blood Glucose? Investigating VOC-Mediated Modulation in Diabetes

呼吸生物标志物能否因果影响血糖?探讨VOC介导的糖尿病调节

Varsha Sharma, Prasanta K. Guha, Avik Ghose

发表机构 * TCS Research(TCS研究) Department of E&ECE, IIT Kharagpur(印度理工学院Kharagpur电子与电气工程系)

AI总结 本研究通过非侵入式数据驱动框架,利用挥发性有机化合物(VOCs)和生活方式变量识别糖尿病高风险个体,采用因果推断技术估计VOCs如乙酮、异丙醇、异戊二烯和乙醇对血糖水平的影响,并设计分类器区分糖尿病患者与非糖尿病患者,建立基于风险的排名系统和高斯混合模型识别自然聚类。

Journal ref Proceedings of the IJCAI workshop on Advanced Neural Systems for Next-Generation Biomedical Intelligence, 2025

详情
AI中文摘要

糖尿病是一种全球健康负担,早期检测对于及时干预至关重要。本研究探讨了一种非侵入式、数据驱动的框架,利用挥发性有机化合物(VOCs)和生活方式变量识别糖尿病高风险个体。我们使用因果推断技术估计乙酮、异丙醇、异戊二烯和乙醇等VOCs对血糖水平的影响。此外,我们设计了一个分类器,利用非侵入式标志物区分糖尿病患者和非糖尿病患者。我们为“灰色区域”中的个体建立了基于风险的排名系统,并使用高斯混合模型识别人群中的自然聚类。我们的结果表明,特定的VOCs对血糖水平表现出强因果影响,且机器学习模型能够可靠地分类和分层高风险个体。这种集成的因果-可解释分析可以支持非侵入式糖尿病早期筛查工具的开发。

英文摘要

Diabetes is a global health burden, and early detection is critical for timely intervention. This study explores a non-invasive, data-driven framework to identify individuals at risk of diabetes using Volatile Organic Compounds (VOCs) and lifestyle variables. We use causal inference techniques to estimate the impact of VOCs such as acetone, isopropanol, isoprene, and ethanol on blood glucose levels. Additionally, we designed a classifier to distinguish diabetics from non-diabetics using non-invasive markers. We created a risk-based ranking system for individuals in the "gray zone," and identified natural clusters in the population using Gaussian Mixture Model. Our results suggest that specific VOCs exhibit a strong causal influence on glucose levels and that machine learning models can reliably classify and stratify individuals at high risk. This integrated causal-explainable analysis can support the development of tool for non-invasive early screening of diabetes.

2605.22074 2026-05-22 cs.LG cs.AI cs.CL

From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning

从推理链到可验证子问题:课程强化学习使LLM推理能够进行信用分配

Xitai Jiang, Zihan Tang, Wenze Lin, Yang Yue, Shenzhi Wang, Gao Huang

发表机构 * LeapLab, Tsinghua University(清华大学 LeapLab) Qiuzhen College, Tsinghua University(清华大学 旗正学院)

AI总结 该研究提出SCRL框架,通过从参考推理链中生成可验证子问题,解决LLM推理中信用分配问题,提升了在数学推理任务中的性能。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)在LLM推理中展现出强大潜力,但基于结果的RLVR在处理难题时效率低下,因为正确的最终答案 rollout 很少且样本层面的信用分配无法利用失败尝试中的部分进展。我们引入SCRL(子问题课程强化学习),一种课程强化学习框架,通过从参考推理链中推导出可验证子问题,并将最终子问题固定为原始问题。这将难题中的部分进展转化为可验证的学习信号。算法上,SCRL使用子问题层面的归一化,每个子问题位置独立归一化奖励,并将结果优势分配给相应的答案片段,使在没有外部评分标准或奖励模型的情况下实现更细粒度的信用分配。我们的分析表明,子问题课程将难题从梯度死亡区中拉出,随着原始问题难度增加,相对收益也更大。在七个数学推理基准测试中,SCRL超越了强大的课程学习基线,使Qwen3-4B-Base的平均准确率比GRPO提高+4.1点,Qwen3-14B-Base提高+1.9点。在AIME24、AIME25和IMO-Bench上,SCRL进一步提高Qwen3-4B-Base的pass@1由+3.7点,pass@64由+4.6点,表明在难题推理任务中探索能力更强。

英文摘要

Reinforcement learning from verifiable rewards (RLVR) has shown strong promise for LLM reasoning, but outcome-based RLVR remains inefficient on hard problems because correct final-answer rollouts are rare and sample-level credit assignment cannot use partial progress in failed attempts. We introduce SCRL (Subproblem Curriculum Reinforcement Learning), a curriculum RL framework that derives verifiable subproblems from reference reasoning chains and fixes the final subproblem as the original problem. This turns partial progress on hard problems into verifiable learning signals. Algorithmically, SCRL uses subproblem-level normalization, which normalizes rewards independently at each subproblem position and assigns the resulting advantages to the corresponding answer spans, enabling finer-grained credit assignment without external rubrics or reward models. Our analysis shows that subproblem curricula lift hard problems out of gradient dead zones, with larger relative gains as the original problem becomes harder. Across seven mathematical reasoning benchmarks, SCRL outperforms strong curriculum-learning baselines, improving average accuracy over GRPO by +4.1 points on Qwen3-4B-Base and +1.9 points on Qwen3-14B-Base. On AIME24, AIME25, and IMO-Bench, SCRL further improves pass@1 by +3.7 points and pass@64 by +4.6 points on Qwen3-4B-Base, indicating better exploration on hard reasoning problems.

2605.22072 2026-05-22 cs.CL cs.CV

Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention

Faithful-MR1: 通过锚定和强化视觉注意力实现忠实的多模态推理

Changyuan Tian, Zhicong Lu, Huaxing Liu, Xiang Wang, Shuai Li, Yu Chen, Wenqian Lv, Zichuan Lin, Juncheng Diao, Deheng Ye

发表机构 * AMAP, Alibaba Group(阿里云实验室,阿里巴巴集团) University of Chinese Academy of Sciences(中国科学院大学) Tsinghua University(清华大学) Nanyang Technological University(南洋理工大学)

AI总结 本文提出Faithful-MR1框架,通过锚定和强化视觉注意力解决多模态推理中的忠实性问题,提升模型在多模态基准上的表现。

Comments 20 pages, 7 figures, 3 tables. Preprint

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成为推动大语言模型复杂推理发展的有希望范式,最近的研究将其扩展到多模态大语言模型(MLLMs)。然而,这种转移带来了忠实性挑战:任务相关视觉证据的忠实感知以及在推理中忠实使用该证据,导致多模态基准上的不满意收益。具体而言,现有的感知监督通常基于文本描述而非原生的图像区域,且忠实使用被忽视,暴露出感知-推理断层,正确感知的证据在推理中被丢弃或矛盾。为弥合这些差距,我们提出Faithful-MR1,一个训练框架,通过锚定和强化视觉注意力来解决忠实多模态推理的两方面。锚定阶段将感知转化为一个显式的预推理子任务,监督专门的<Focus>标记的注意力直接针对图像区域,而不是通过文本描述。强化阶段通过反事实图像干预暴露忠实使用,奖励那些在视觉上因果重要的区域集中注意力的轨迹。广泛实验表明,Faithful-MR1在Qwen2.5-VL-Instruct 3B和7B架构上优于最近的多模态推理基线,同时使用大量训练数据。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for advancing complex reasoning in large language models, and recent work extends RLVR to multimodal large language models (MLLMs). This transfer, however, surfaces a faithfulness challenge: faithful perception of task-relevant visual evidence and faithful use of that evidence during reasoning, leading to unsatisfactory gains on multimodal benchmarks. Specifically, existing perception supervision often operates on textual descriptions rather than natively on image regions, and faithful use is largely overlooked, exposing the perception-reasoning disconnect where correctly perceived evidence is dropped or contradicted during reasoning. To close these gaps, we propose Faithful-MR1, a training framework that anchors and reinforces visual attention to address both halves of faithful multimodal reasoning. The Anchoring stage turns perception into an explicit pre-reasoning subtask, supervising a dedicated <Focus> token's attention directly against image regions rather than through textual descriptions. The Reinforcing stage exposes faithful use through counterfactual image intervention, rewarding answer-correct trajectories that concentrate visual attention where vision causally matters. Extensive experiments demonstrate that Faithful-MR1 outperforms recent multimodal reasoning baselines on both Qwen2.5-VL-Instruct 3B and 7B backbones while using substantially less training data.

2605.22068 2026-05-22 cs.CV

COCOTree: A Dataset and Benchmark for Open Tree-Structured Visual Decomposition

COCOTree: 一个用于开放树状视觉分解的数据集和基准

Junhyub Lee, Seunghun Chae, Hyosu Kim

发表机构 * Chung-Ang University(Chung-Ang大学)

AI总结 本文提出COCOTree数据集和基准,通过自动化生成管道和开放词汇空间,实现了对复杂物理组装的长尾分布的捕捉,并提出了Open Tree Quality (OTQ)评估指标。

详情
AI中文摘要

我们正式化并启用了开放树分解任务,该任务将图像分割为具有无约束粒度和灵活性的层次树状视觉组件。具体而言,我们为这一新范式提供了基础基准,有三个关键贡献:首先,通过开发一个完全自动化的生成管道,结合大视觉-语言模型的语义推理与SAM 3的精确几何定位,克服了手动标注的高认知和物理瓶颈;其次,利用该管道构建了COCOTree大规模基准,包含超过21,000张图像和180万个结构节点,通过超过3,500个唯一标签的开放词汇空间,成功捕捉了复杂物理组装的长尾分布;最后,我们通过提出Open Tree Quality (OTQ)指标建立了标准化评估协议,该指标联合评估掩码精度、标签准确性和结构一致性。我们已发布数据集和基准代码:https://github.com/melonkick3090/COCOTree.

英文摘要

We formalize and enable the task of open tree decomposition, which segments an image into hierarchical trees of visual components with unconstrained granularity and flexibility. Specifically, we provide the foundation benchmark for this new paradigm with the following three key contributions. First, we overcome the prohibitively high cognitive and physical bottlenecks of manual annotation by developing a fully automated generation pipeline that synergizes the semantic reasoning of Large Vision-Language Models (LVLMs) with the precise geometric grounding of SAM 3. Second, leveraging this pipeline, we construct COCOTree, a massive-scale benchmark featuring over 21K images and 1.8M structural nodes. By embracing an open-vocabulary space of over 3.5K unique labels, it successfully captures the long-tail distribution of complex physical assemblies. Notably, rigorous human evaluation confirms our generated annotations demonstrate strong alignment with human structural judgment. Third, we establish a standardized evaluation protocol by proposing the Open Tree Quality (OTQ) metric, which jointly assesses mask precision, label accuracy, and structural consistency. We release our dataset and benchmark code at https://github.com/melonkick3090/COCOTree.

2605.22066 2026-05-22 cs.CV cs.AI

Echo4DIR: 4D Implicit Heart Reconstruction from 2D Echocardiography Videos

Echo4DIR: 从2D超声视频重建4D隐式心脏结构

Yanan Liu, Qinya Li, Hao Zhang, Kangjian He, Xuan Yang, Hao Li, Dan Xu, Lei Li

发表机构 * Department of Biomedical Engineering, National University of Singapore, Singapore(新加坡国立大学生物医学工程系) School of Information Science and Engineering, Yunnan University, Kunming, China(云南大学信息科学与工程学院)

AI总结 本文提出Echo4DIR框架,通过隐式重建方法从稀疏2D超声视频中重建4D心脏几何结构,解决了几何歧义和时间不连续性问题,实现了高精度的临床重叠度。

详情
AI中文摘要

从稀疏的2D超声图像中重建4D(3D+t)心脏几何结构具有高度的实用性,但受到几何歧义和时间不连续性的根本挑战。为了解决这些问题,我们提出了Echo4DIR,一种新颖的测试时4D隐式重建框架。具体来说,我们通过心脏条件SDF学习鲁棒的3D形状先验,构建了具有极线交叉注意力的Epipolar Mask Encoder模块,以有效融合多视角特征。为了弥合合成到现实的领域差距,我们引入了一种自监督的SDF定制可微渲染策略,利用未经校准的临床掩码进行患者特定的3D形状适应,而无需3D地面真实数据。关键的是,隐式表示的内在连续性克服了稀疏观测,使在任意分辨率下都能获得解剖学可靠的几何结构。此外,为了使我们的框架具备物理连续的4D扩展能力,我们引入了一种径向SDF对齐策略,严格锁定形状演变到预测的速度场,从根本上消除了网格漂移。在合成基准和真实临床数据集上的广泛实验表明,Echo4DIR实现了最先进的4D心脏网格重建,特别是在临床重叠度方面,达到了高达98.35%的Dice和96.75%的IoU。

英文摘要

Reconstructing 4D (3D+t) cardiac geometry from sparse 2D echocardiography is highly desirable yet fundamentally challenged by geometric ambiguity and temporal discontinuity. To tackle these issues, we propose Echo4DIR, a novel test-time 4D implicit reconstruction framework. Specifically, we learn robust 3D shape priors from statistical shape models (SSMs) via a cardiac conditional SDF, constructing an Epipolar Mask Encoder module with epipolar cross attention to effectively fuse multi-view features. To bridge the synthetic-to-real domain gap, we introduce a self-supervised SDF-tailored differentiable rendering strategy for patient-specific 3D shape adaptation using uncalibrated clinical masks without requiring 3D ground truth. Crucially, the inherent continuity of implicit representation overcomes sparse observations, enabling anatomically reliable geometry at arbitrary resolutions. Furthermore, to empower our framework with physically continuous 4D extension, we introduce a Radial SDF Alignment strategy that strictly locks shape evolution to the predicted velocity field, fundamentally eliminating mesh drift. Extensive experiments on synthetic benchmarks and real clinical datasets demonstrate that Echo4DIR achieves state-of-the-art 4D cardiac mesh reconstruction, notably yielding an impressive clinical overlap of up to 98.35% Dice and 96.75% IoU.

2605.22061 2026-05-22 cs.CV

Distributed Image Compression with Multimodal Side Information at Extremely Low Bitrates

分布式图像压缩与多模态侧信息在极低比特率下的应用

Guojun Xu, Mingyang Zhang, Jianwen Xiang, Cheng Tan, Yanchao Yang, Junwei Zhou

发表机构 * School of Computer Science and Artificial Intelligence, Wuhan University of Technology(武汉理工大学计算机科学与人工智能学院)

AI总结 本文提出了一种多模态分布式图像压缩框架(MDIC),通过利用多模态侧信息在极低比特率下实现高质量图像重建,核心方法是引入文本到图像扩散解码器和特征掩码生成器,以提升全局感知质量和局部细节保留能力。

Comments Accepted by CVPR2026

详情
AI中文摘要

分布式图像压缩(DIC)对于多视图传输至关重要,尤其是在极低比特率(< 0.1 bpp)下。其核心挑战是有效利用侧信息以在严格比特率预算下实现高质量重建。然而,现有DIC方法难以利用全局上下文和对象级细节,导致局部模糊和细节丢失。为此,我们提出了一种多模态DIC框架(MDIC),首次将多模态侧信息引入DIC范式,有效保留细粒度局部细节并提升重建图像的全局感知质量。具体而言,我们引入基于文本到图像扩散的解码器,该解码器根据从相关图像中提取的文本侧信息进行条件化,以捕捉共享的全局语义。此外,我们设计了一个由多模态细粒度对齐任务监督的特征掩码生成器,以加强视觉侧信息的利用。生成的掩码具有两个作用:首先,它指导从无损传输的侧信息中提取细粒度细节,以保持重建细节的语义一致性;其次,它调节从量化VQ-VAE嵌入中提取的聚类特征表示,补偿主图像在极端压缩下的类别信息丢失。在广泛使用的KITTI立体和Cityscapes数据集上的大量实验表明,MDIC在极低比特率下实现了最先进的感知质量。

英文摘要

Distributed Image Compression (DIC) is crucial for multi-view transmission, especially when operating at extremely low bitrates (< 0.1 bpp). Its core challenge is effectively utilizing side information to achieve high-quality reconstruction under strict bitrate budgets. However, existing DIC approaches struggle to exploit global context and object-level details from side information, leading to local blurring and the loss of fine details in the reconstruction. To address these limitations, we propose a Multimodal DIC framework (MDIC), which, for the first time, leverages side information in a multimodal manner into the DIC paradigm, effectively preserving fine-grained local details and enhancing global perceptual quality in reconstructed images. Specifically, we introduce a text-to-image diffusion-based decoder conditioned on textual side information extracted from correlated images to capture shared global semantics. Moreover, we design a feature-mask generator, supervised by a multimodal fine-grained alignment task, to strengthen the exploitation of visual side information. The generated mask serves two purposes: first, it guides the extraction of fine-grained details from losslessly transmitted side information to preserve the semantic consistency of reconstructed details; second, it regulates the extraction of clustered feature representations from the quantized VQ-VAE embeddings, compensating for category information lost under the extreme compression of the primary image. Extensive experiments on the widely used KITTI Stereo and Cityscapes datasets demonstrate that MDIC achieves state-of-the-art perceptual quality at extremely low bitrates.

2605.22057 2026-05-22 cs.CL

FlyRoute: Self-Evolving Agent Profiling via Data Flywheel for Adaptive Task Routing

FlyRoute: 通过数据飞轮实现自进化代理配置以实现适应性任务路由

Rongjun Li, Ziyu Zhou, Yihang Wu

发表机构 * IT Innovation and Research Center, Huawei Technologies(华为技术有限公司信息技术创新与研究中心)

AI总结 本文提出FlyRoute,一种自进化配置框架,通过真实流量增长能力证据,提高适应性任务路由的性能。

Comments 13 pages, 5 figures, 5 tables

详情
AI中文摘要

企业路由器将查询分配给专家代理,但部署的配置保持静态,而代理进化(提示、工具、模型)时,配置未更新。我们提出了FlyRoute,一种自进化配置框架,从真实流量中增长能力证据:将调度候选者和质量门成功的配对加入每个代理的成功存储,定期将证据转化为学习的能力描述,并将这些描述与BM25检索的成功注入到LLM路由器中。为了使此飞轮数据高效,FlyRoute引入了一种针对性探索策略,结合配置不确定性、BM25相关性和词汇新颖性,只优先为可能的查询下注释欠配置的代理,并避免冗余证据收集。在我们专有的企业开发支持数据集上的实验中,FlyRoute仅使用每个代理五个种子查询,将相同架构的零样本LLM路由器的性能从72.57%提升到78.04%,表明配置检索已经增强了冷启动路由。在流过7,211个标记的训练查询后,准确率提升到89.83%(零样本提升17.26个点;冷启动提升11.79个点),在四个专家领域下,标准路由准确率在单金测试查询上保持一致的提升。

英文摘要

Enterprise routers assign queries to expert agents, yet deployed profiles stay static while agents evolve (prompts, tools, models), and developers rarely keep descriptions or exemplars current. We present FlyRoute, a self-evolving profiling framework that grows capability evidence from real traffic: dispatch candidates, quality-gate successful pairs into each agent's success store, periodically distill evidence into learned capability descriptions, and inject those descriptions together with BM25-retrieved successes into an LLM router. To make this flywheel data-efficient, FlyRoute introduces a targeted exploration policy that combines profile uncertainty, BM25 relevance, and lexical novelty, prioritizing under-profiled agents only for plausible queries and avoiding redundant evidence collection. In experiments on our proprietary enterprise developer-support dataset of real routed queries, FlyRoute improves a same-backbone zero-shot LLM router from 72.57% to 78.04% with only five seed queries per agent, showing that profile retrieval already strengthens cold-start routing. After streaming 7,211 labeled training queries through the flywheel, accuracy rises to 89.83% (+17.26pp over zero-shot; +11.79pp over cold start), with consistent gains across four expert domains under standard routing accuracy on single-gold test queries.

2605.22055 2026-05-22 cs.LG cs.AI

Prototype-Guided Classification Sub-Task Decoupling Framework: Enhancing Generalization and Interpretability for Multivariate Time Series

基于原型的分类子任务解耦框架:提升多变量时间序列的泛化能力与可解释性

Xianhao Song, Yuang Zhang, Yuqi She, Liping Wang, Xuemin Lin

发表机构 * East China Normal University(华东师范大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出PDFTime框架,通过多阶段决策过程解耦时间序列分类任务,提升模型的泛化能力和可解释性,实现了在UEA和UCR基准测试中的最优性能。

详情
AI中文摘要

时间序列分类(TSC)是一个长期存在的研究问题,近年来随着大规模时间数据的快速增长而受到越来越多的关注。尽管深度学习带来了显著进展,但设计出既准确又可解释的TSC模型仍然是一个具有挑战性的任务。许多现有方法采用直接的特征到标签分类范式,通过单一线性投影(通常在全局池化后)将高维时间嵌入压缩为类别日志it,这种范式将特征提取和决策逻辑合并为不可分割的映射。为了解决这些限制,我们提出了PDFTime,一个基于原型的框架,将时间序列分类重新表述为多阶段决策过程。不同于直接的特征到标签映射,PDFTime利用学习到的原型来近似潜在空间中的类别条件特征分布,通过不同粒度的分类子任务实现逐步辨别。据我们所知,PDFTime是第一个将时间序列分类重新表述为解耦、多阶段相似性推理过程的框架,打破了长期以来直接、黑箱的特征到标签映射范式。广泛的评估表明,PDFTime在UEA和UCR基准测试中实现了最先进的性能。值得注意的是,它在UCR档案中的128个数据集中,取得了80个数据集的top-1准确率,显著优于最近的强基线方法在一致性和泛化性上的表现。

英文摘要

Time Series Classification (TSC) is a long-standing research problem that has gained increasing attention in recent years with the rapid growth of large-scale temporal data. Despite substantial progress enabled by deep learning, designing TSC models that are both accurate and interpretable remains a challenging task. Many existing approaches adopt a direct feature-to-label classification paradigm, by collapsing high-dimensional temporal embeddings into class logits via a single linear projection (often after global pooling), the paradigm conflates feature extraction and decision logic into an inseparable mapping. To address these limitations, we propose PDFTime, a prototype-guided framework that reformulates time series classification as a multi-stage decision process. Instead of direct feature-to-label mapping, PDFTime leverages learned prototypes to approximate class-conditional feature distributions in the latent space, enabling progressive discrimination through classification sub-tasks of varying granularity. To our knowledge, PDFTime is the first framework to reformulate time series classification as a decoupled, multi-stage similarity-based reasoning process, breaking the long-standing paradigm of direct, black-box feature-to-label mapping. Extensive evaluations demonstrate that PDFTime achieves state-of-the-art (SOTA) performance across UEA and UCR benchmarks. Notably, it secures the top-$1$ accuracy on 80 out of 128 datasets in the UCR archive, significantly outperforming recent strong baselines in both consistency and generalization.

2605.22054 2026-05-22 cs.LG cs.AI

LABO: LLM-Accelerated Bayesian Optimization through Broad Exploration and Selective Experimentation

LABO: 通过广泛探索和选择性实验实现的LLM加速贝叶斯优化

Zhuo Chen, Xinzhe Yuan, Jianshu Zhang, Jinzong Dong, Ruichen Zhou, Yingchun Niu, Tianhang Zhou, Yu Yang Fredrik Liu, Yuqiang Li, Nanyang Ye, Qinying Gu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) School of Mechanical Engineering, Shanghai Jiao Tong University(上海交通大学机械工程学院) Institute for Advanced Study in Mathematics, Harbin Institute of Technology(哈尔滨工业大学数学研究所) School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院) School of Automation, Central South University(中南大学自动化学院) College of New Energy and Materials, China University of Petroleum, Beijing(中国石油大学(北京)新能源与材料学院) College of Carbon Neutrality Future Technology, China University of Petroleum, Beijing(中国石油大学(北京)碳中和未来技术学院) DeepVerse PTE. LTD.

AI总结 本文提出LABO框架,通过结合LLM预测与实验观测,在贝叶斯优化中实现更高效的样本优化,理论分析和实验结果表明其在科学任务中优于现有方法。

Comments Accepted to ICML 2026

详情
AI中文摘要

科学探索中的高成本和数据稀缺性推动了将大型语言模型(LLMs)作为知识驱动组件应用于贝叶斯优化(BO)的研究。然而,现有方法通常将LLMs直接嵌入到采样或替代建模流程中,未能充分利用其显著低于现实实验的评估成本。为了解决这一限制,我们提出了LLM加速贝叶斯优化(LABO)框架,该框架在单个BO循环中结合LLM预测与实验观测。LABO采用门控标准来动态平衡对LLM预测和实际实验的依赖。通过利用低成本的LLM评估进行广泛探索搜索空间,并仅在高不确定性区域保留昂贵的现实实验,LABO实现了更高效的样本优化。我们提供了理论分析,通过累积遗憾界正式化这一效率增益。在多样化的科学任务中,实验结果表明LABO在相同实验预算下一致优于现有方法。我们的结果表明,LABO为将LLMs整合到科学发现流程中提供了一种实用且理论严谨的方法。

英文摘要

The high cost and data scarcity in scientific exploration have motivated the use of large language models (LLMs) as knowledge-driven components in Bayesian optimization (BO). However, existing approaches typically embed LLMs directly into the sampling or surrogate modeling pipeline, without fully leveraging their significantly lower evaluation cost compared to real-world experiments. To address this limitation, we propose LLM-Accelerated Bayesian Optimization (LABO), a framework that combines LLM predictions with experimental observations within a single BO loop. LABO employs a gating criterion to dynamically balance the reliance on LLM predictions versus actual experiments. By leveraging inexpensive LLM evaluations to broadly explore the search space and reserving costly real experiments only for regions with high uncertainty, LABO achieves more sample-efficient optimization. We provide a theoretical analysis with a cumulative regret bound that formalizes this efficiency gain. Empirical results across diverse scientific tasks demonstrate that LABO consistently outperforms existing methods under identical experimental budgets. Our results suggest that LABO offers a practical and theoretically grounded approach for integrating LLMs into scientific discovery workflows.

2605.22051 2026-05-22 cs.CV

EasyVFX: Frequency-Driven Decoupling for Resource-Efficient VFX Generation

EasyVFX: 用于资源高效视觉效果生成的频率驱动解耦

Yue Ma, Xu Ye, Qinghe Wang, Yucheng Wang, Hongyu Liu, Yinhan Zhang, Xinyu Wang, Yuanpeng Che, Shanhui Mo, Paul Liang, Fangneng Zhan, Qifeng Chen

发表机构 * HKUST(香港科技大学) DUT(东华大学) THU(清华大学) MIT(麻省理工学院)

AI总结 本文提出EasyVFX框架,通过频率域分解解耦高频和低频成分,降低视觉效果生成的计算和数据依赖性,实现高效且高质量的视觉效果合成。

Comments Accepted by SIGGRAPH 2026. Project page: https://easy-vfx.github.io/

详情
AI中文摘要

生成高保真视觉效果(VFX)通常需要大量数据集和高昂的计算资源,因为空间纹理和时间动态之间存在复杂的耦合。本文介绍了EasyVFX,一个资源高效的框架,能够在严格约束下实现逼真的VFX合成。我们的核心理念在于频域分解:我们观察到通过解耦高频成分(代表复杂的空间外观)和低频成分(代表全局运动动态),可以显著降低VFX的复杂性。这种频域解耦将高维学习问题转化为可管理的子任务,从而降低优化障碍并减少数据依赖性。基于这一见解,我们提出了一种双阶段训练范式。首先,我们设计了一种频率感知的专家混合(Freq-MoE)架构。通过利用软路由机制,我们的模型将专门的专家分配到不同的频谱带,使它们能够培养稳健的先验知识用于外观和运动动态。这种专业化使模型能够以更少的GPU资源获取基础的VFX知识。其次,我们引入了一种由新型频率约束损失驱动的测试时训练策略。这使预训练模型能够通过局部优化快速适应特定的、未见过的效果,仅需在单个GPU上进行约100步的训练。实验结果表明,EasyVFX生成的结构一致且视觉震撼的效果,证明了频率感知学习是使专业级VFX民主化的重要催化剂。

英文摘要

Generating high-fidelity visual effects (VFX) typically demands massive datasets and prohibitive computational power due to the intricate coupling of spatial textures and temporal dynamics. In this paper, we introduce EasyVFX, a resource-efficient framework that achieves realistic VFX synthesis under stringent constraints. Our core philosophy lies in frequency-domain decomposition: we observe that the complexity of VFX can be significantly mitigated by decoupling high-frequency components, which represent intricate spatial appearances, from low-frequency components that encapsulate global motion dynamics. This spectral disentanglement transforms a high-dimensional learning problem into manageable sub-tasks, thereby lowering the optimization barrier and reducing data dependency. Building upon this insight, we propose a two-stage training paradigm. First, we design a Frequency-aware Mixture-of-Experts (Freq-MoE) architecture. By utilizing a soft routing mechanism, our model assigns specialized experts to distinct spectral bands, enabling them to cultivate robust priors for appearance and motion dynamics. This specialization allows the model to acquire foundational VFX knowledge with fewer GPU resources. Second, we introduce a Test-Time Training strategy powered by a novel Frequency-constraint Loss. This allows the pre-trained model to swiftly adapt to specific, unseen effects through localized optimizations, requiring only about 100 steps on a single GPU. Experimental results demonstrate that EasyVFX produces structurally consistent and visually stunning effects, proving that frequency-aware learning is a key catalyst for democratizing professional-grade VFX.

2605.22047 2026-05-22 cs.AI

Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support

大语言模型在临床决策支持中的主动证据获取与诊断推理

Chen Zhan, Xihe Qiu, Xiaoyu Tan, Xibing Zhuang, Gengchen Ma, Yue Zhang, Shuo Li, Peifeng Liu, Xiaoxiao Ge, Liang Liu, Lu Gan

发表机构 * Tencent Youtu Lab(腾讯优图实验室) Case Western Reserve University(凯斯西储大学)

AI总结 研究探讨了大语言模型在临床决策支持中的主动证据获取与诊断推理问题,提出了一种基于OSCE的标准化患者模拟器和可控可复现的基准测试,发现多轮证据获取会降低诊断准确性并降低支持证据质量,表明静态全上下文基准可能高估交互证据获取场景中的性能,需引入互补的交互评估以提高临床决策安全性。

详情
AI中文摘要

大语言模型在静态医学检查中表现良好,但临床诊断往往需要在不确定性下进行迭代证据收集。基于先前的交互评估努力,我们引入了受OSCE启发的标准化患者模拟器和一个受控、可复现的基准测试,用于主动诊断查询。在我们的协议中,经过468个案例和15个模型的测试,我们发现多轮证据获取会将诊断准确性降低12.75%,并将支持证据质量降低24.36%,相对全上下文评估。错误分析将这些下降与过早的诊断封闭和低效的提问联系起来。这些结果表明,静态全上下文基准可能高估交互证据获取场景中的性能,从而推动对更安全临床决策支持的互补交互评估。

英文摘要

Large language models perform well on static medical examinations, yet clinical diagnosis often requires iterative evidence gathering under uncertainty. Building on prior interactive evaluation efforts, we introduce an OSCE-inspired standardized patient simulator and a controlled, reproducible benchmark for active diagnostic inquiry. Across 468 cases and 15 models in our protocol, we observe that multi-turn evidence seeking reduces diagnostic accuracy by 12.75% and lowers supporting-evidence quality by 24.36% relative to full-context evaluation; error analyses associate these drops with premature diagnostic closure and inefficient questioning. Together, these results suggest that static full-context benchmarks may overestimate performance in interactive evidence-seeking settings, motivating complementary interactive assessment for safer clinical decision support.

2605.22044 2026-05-22 cs.CV

Physiology and Anatomy Aware Inverse Inference of Myocardial Infarction for Cardiac Digital Twin

心肌梗死逆推推理的生理与解剖意识:用于心脏数字双胞胎

Mengxiao Wang, Yilin Lyu, Julia Camps, Ching Hui Sia, Mark Yan-Yee Chan, Yanrui Jin, Shuzhi Sam Ge, Chengliang Liu, Lei Li

发表机构 * Shanghai Jiao Tong University, Shanghai, China(上海交通大学) National University of Singapore, Singapore(新加坡国立大学) Universitat Pompeu Fabra, Barcelona, Spain(庞培法华大学)

AI总结 本文提出了一种基于心脏数字双胞胎的非侵入性心肌梗死定位方法,通过整合运动成像和心电图,利用解剖和生理意识网络(PAA-Net)来更准确地推断心肌梗死区域的形态和位置,从而提高逆推推理的精度和可解释性。

Comments Early-accepted by MICCAI 2026. This version corresponds to the submitted version. The final version will be available on Springer Link

详情
AI中文摘要

准确定位心肌梗死对于风险分层至关重要。虽然LGE-MRI仍是金标准,但其资源消耗大。将运动MRI与ECG结合可以更详细地表示梗死特性。现有的逆推心肌梗死推断方法忽略了真实疤痕形态和心脏复极化,降低了对ECG细微变化的敏感性和对梗死引起电生理变化的可解释性。在本文中,我们提出了一种用于非侵入性心肌梗死定位的新框架。为了弥合仿真与现实之间的领域差距,我们引入了一种解剖意识的随机梗死合成策略,以合成真实、不规则的疤痕和边缘区,模拟缺血性横纹进展。我们然后构建了一个虚拟队列来模拟QRS-T波形,捕捉去极化和复极化动态。此外,我们设计了一种生理和解剖意识网络(PAA-Net),联合编码3D心肌几何和多导联ECG,以推断具有不同定位、大小、空间范围和横纹性的梗死区域。实验结果表明,我们的框架在逆推推断中显著优于现有方法,实现了疤痕和边缘区分割的Dice分数分别为0.7391和0.5503,同时进一步提高了ECG-梗死关系的可解释性。我们的代码将在接受后发布。

英文摘要

Accurate localization of myocardial infarction is essential for risk stratification. While LGE-MRI remains the gold standard, it is resource-intensive. Integrating cine MRI with ECG enables a more detailed representation of infarct properties. Existing inverse MI inference methods overlook realistic scar morphology and cardiac repolarization, reducing sensitivity to subtle ECG variations and interpretability of infarct-induced electrophysiological changes. In this paper, we propose a novel framework for noninvasive MI localization using cardiac digital twins. To bridge the domain gap between simulation and reality, we introduce an anatomy-aware stochastic infarct synthesis strategy to synthesize realistic, irregular scars with border zones, mimicking ischemic transmural progression. We then construct a virtual cohort to simulate QRS-T waveforms, capturing both depolarization and repolarization dynamics. Furthermore, we design a Physiology and Anatomy Aware Network (PAA-Net) that jointly encodes 3D myocardial geometry and multi-lead ECGs to infer infarct areas with varying localizations, sizes, spatial extents, and transmuralities. Experimental results demonstrate that our framework significantly outperforms existing methods in inverse inference, achieving Dice scores of 0.7391 and 0.5503 for scar and border zone segmentation, respectively, while further enhancing the interpretability of the ECG-infarct relationship. Our code will be released upon acceptance.

2605.22043 2026-05-22 cs.LG

CASE-NET: Deep Spatio-Temporal Representation Learning via Causal Attention and Channel Recalibration for Multivariate Time Series Classification

CASE-NET:通过因果注意力和通道重校准进行多变量时间序列分类的深度时空表示学习

Fan Zhang, Yating Cui, Hua Wang

发表机构 * Shandong Technology and Business University(山东技术与商业大学) Ludong University(鲁东大学)

AI总结 本文提出CASE-NET,通过因果注意力和通道重校准模块,解决多变量时间序列分类中时空表示不准确的问题,实现在四个任务上达到新的最先进基准,最高准确率达98.6%。

Comments 9 pages, 6 figures, 2 tables

详情
AI中文摘要

多变量时间序列(MTS)分类是普适计算和金融分析的基础,但现有多尺度方法常受限于表示保真度不足。我们识别出两个关键瓶颈:标准编码器中的时间非因果性导致非平稳动态中的时间混淆,以及缺乏显式通道重要性机制导致噪声污染潜在空间。为解决这些挑战,我们提出因果注意力和时空编码器网络(CASE-NET),一种用于结构流形预条件的架构。CASE-NET结合了因果时间编码器,通过掩码自注意力和因果卷积强制物理时间箭头约束,以及适应性通道重校准模块,作为信息瓶颈以抑制有害噪声。在六个异质领域上的全面评估表明,CASE-NET在四个任务上建立了新的最先进基准,达到AWR数据集上的最高准确率98.6%,并在非平稳环境中表现出卓越的鲁棒性。

英文摘要

Multivariate time series (MTS) classification is foundational to pervasive computing and financial analysis, yet existing multi-scale paradigms are often constrained by suboptimal representation fidelity. We identify two critical bottlenecks: temporal non-causality in standard encoders that induces temporal confounding in non-stationary dynamics, and the absence of explicit channel saliency mechanisms that allows noise to contaminate the latent space. To address these challenges, we propose the Causal Attention and Spatio-temporal Encoder Network (CASE-NET), an architecture designed for structural manifold pre-conditioning. CASE-NET synergizes a Causal Temporal Encoder, which enforces physical arrow-of-time constraints via masked self-attention and causal convolutions, with an Adaptive Channel Recalibration module functioning as an information bottleneck to suppress detrimental noise. Comprehensive evaluations across six heterogeneous domains demonstrate that CASE-NET establishes new state-of-the-art benchmarks on four tasks, achieving a peak accuracy of 98.6% on the AWR dataset and superior robustness in non-stationary regimes.

2605.22036 2026-05-22 cs.CV cs.AI

GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation

GA-VLN: 用于高效视觉-语言导航的几何感知鸟瞰图表示

Jiahao Yang, Zihan Wang, Xiangyang Li, Xing Zhu, Yujun Shen, Yinghao Xu, Shuqiang Jiang

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences(人工智能安全国家重点实验室,计算技术研究所,中国科学院) University of Chinese Academy of Sciences(中国科学院大学) Robbyant School of Computing, National University of Singapore(新加坡国立大学计算机学院) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 本文提出GA-VLN框架,通过引入几何感知的鸟瞰图表示(GA-BEV),整合显式和隐式几何信息,提升视觉-语言导航的效率和性能,实验表明其在仅使用导航数据的情况下取得了最先进的结果。

详情
AI中文摘要

尽管在视觉-语言导航(VLN)领域取得了显著进展,现有方法仍依赖密集的RGB视频,产生过多的片段标记且缺乏显式的空间结构,导致计算开销大且空间推理能力有限。为了解决这些问题,我们引入了几何感知的鸟瞰图(GA-BEV)-一种紧凑且3D基础的特征表示,将显式和隐式的几何线索整合到多模态大语言模型(MLLM)导航系统中。我们通过将视觉特征投影到3D空间并聚合为以代理为中心的布局来构建BEV空间地图,该布局在保持几何一致性的同时减少标记冗余。为了进一步丰富几何理解,我们将预训练的3D基础模型的特征融入BEV空间,注入从大规模3D重建任务中学习到的结构先验。这些互补的线索-基于深度的显式投影和隐式学习的先验-产生紧凑但空间表达能力强的表示,显著提高了导航效率和性能。实验表明,我们的方法仅使用导航数据即可取得最先进的结果,无需DaGger增强或混合VQA训练,证明了所提GA-VLN框架的鲁棒性和数据效率。

英文摘要

Despite significant progress in Vision-Language Navigation (VLN), existing approaches still rely on dense RGB videos that produce excessive patch tokens and lack explicit spatial structure, resulting in substantial computational overhead and limited spatial reasoning. To address these issues, we introduce the Geometry-Aware BEV (GA-BEV) - a compact, 3D-grounded feature representation that integrates both explicit and implicit geometric cues into multimodal large language model (MLLM) - based navigation systems. We construct BEV spatial maps from RGB-D inputs by projecting visual features into 3D space and aggregating them into an agent-centric layout that preserves geometric consistency while reducing token redundancy. To further enrich geometric understanding, we incorporate features from a pretrained 3D foundation model into the BEV space, injecting structural priors learned from large-scale 3D reconstruction tasks. Together, these complementary cues - explicit depth-based projection and implicit learned priors - yield compact yet spatially expressive representations that substantially improve navigation efficiency and performance. Experiments show that our method achieves state-of-the-art results using only navigation data, without DAgger augmentation or mixed VQA training, demonstrating the robustness and data efficiency of the proposed GA-VLN framework.

2605.22035 2026-05-22 cs.CV cs.CL

HyLoVQA: Dynamic Hypernetwork-Generated Low-Rank Adaptation for Continual Visual Question Answering

HyLoVQA: 动态超网络生成低秩适应用于连续视觉问答

Yiran Wang, Chenyi Xiong, Ziyue Qin, Miao Zhang, Kui Xiao, Zhifei Li

发表机构 * School of Computer Science, Hubei University, Wuhan 430062, China(湖北大学计算机学院,武汉430062,中国) Hubei Key Laboratory of Big Data Intelligent Analysis and Application (Hubei University), Wuhan 430062, China(湖北省大数据智能分析与应用重点实验室(湖北大学),武汉430062,中国) Key Laboratory of Intelligent Sensing System and Security (Hubei University), Ministry of Education, Wuhan 430062, China(智能感知系统与安全重点实验室(湖北大学),教育部,武汉430062,中国)

AI总结 HyLoVQA通过动态超网络生成低秩适应,解决连续视觉问答中任务干扰问题,提升模型对当前任务和对象的适应能力。

Comments Accepted by IJCAI 2026

详情
AI中文摘要

连续视觉问答(VQA)需要在非稳态的视觉输入和问题流中学习,同时保持过去知识。大多数先前方法通过更新大量共享参数集来适应,这通常导致跨层任务干扰,阻碍对当前任务和对象的准确适应。为了解决这一限制,我们提出了HyLoVQA。它维护一个具有漂移鲁棒性的锚点记忆库。该库存储视觉对象的内容和文本任务的内容,并使用当前输入特征进行更新。基于检索到的锚点,超网络生成轻量级低秩适应(LoRA)适配器。这确保了参数效率,使模型能够动态适应每个任务和对象。此外,我们提出了一个对齐损失,将特征空间中的语义差异与参数空间中的功能变化对齐,从而约束LoRA适配器保持专注于当前任务和对象。在VQA v2和NExT-QA上广泛实验表明,HyLoVQA在标准和组合设置下优于先前最先进的方法。

英文摘要

Continual Visual Question Answering (VQA) requires learning from non-stationary streams of visual inputs and questions while preserving past knowledge. Most prior methods adapt by updating a largely shared parameter set. This often leads to cross-level task interference, hindering accurate adaptation to the current task and object. To address this limitation, we propose HyLoVQA. It maintains a drift-resilient memory bank of anchors. The bank stores the content of visual objects and textual tasks, and they are updated using current input features. Conditioned on retrieved anchors, a hypernetwork generates lightweight Low-Rank Adaptation (LoRA) adapters. This ensures parameter efficiency, allowing the model to adapt to each task and object dynamically. Additionally, we formulate an alignment loss that aligns semantic discrepancies in the feature space with functional changes in the parameter space, thereby constraining LoRA adapters to remain focused on the current task and object. Extensive experiments on VQA v2 and NExT-QA under both standard and compositional settings demonstrate the superiority of HyLoVQA over prior state-of-the-art methods.

2605.22034 2026-05-22 cs.CV cs.AI

AgroVG: A Large-Scale Multi-Source Benchmark for Agricultural Visual Grounding

AgroVG:一个大规模多源基准用于农业视觉 grounding

Haocheng Li, Juepeng Zheng, Zenghao Yang, Kaiqi Du, Guilong Xiao, Gengmeng Pu, Haohuan Fu, Jianxi Huang

发表机构 * China Agricultural University(中国农业大学) Sun Yat-sen University(中山大学) Tianjin University(天津大学) Tsinghua University(清华大学) Southwest Jiaotong University(西南交通大学) National Supercomputing Center in Shenzhen(深圳国家超算中心)

AI总结 本文提出AgroVG基准,用于评估农业视觉 grounding能力,通过多源数据集和任务特定协议,评估模型在多目标、多实例和无目标场景下的性能,揭示了现有模型在农业视觉 grounding任务中的不足。

Comments 45 pages,12 figures

详情
AI中文摘要

视觉 grounding,即根据自然语言描述定位物体的任务,是农业人工智能系统的基础能力,可应用于选择性除草、疾病监测和定向收获。农业视觉 grounding的可靠评估具有挑战性,因为农业目标往往小、重复、被遮挡或形状不规则,且指令可能指一个、多个或没有物体。因此,评估此能力需要联合测试定位精度、目标集完整性和存在感知的回避。为了解决这些挑战,我们引入了AgroVG,一个多源基准,将农业 grounding 视为广义集合预测:给定一张图像和一个指称表达,模型必须返回所有匹配的目标实例或在没有目标时回避。AgroVG包含来自十个数据集的10,071个注释-图像查询对,涵盖六个目标类别:作物/杂草、水果、小麦头、害虫、植物疾病和树冠。它支持所有六个类别上的边界框 grounding(T1)和具有可靠实例级像素注释的数据源上的实例掩码 grounding(T2),查询涵盖单目标、多目标和无目标场景。AgroVG进一步提供任务特定的协议用于框集匹配和查询级掩码覆盖。对26种模型配置的零样本评估揭示了持续的差距:最好的多目标Set-F1仅达到0.35,最好的正查询掩码成功率在IoU@0.75下仍低于0.17。数据和代码可在https://anonymous.4open.science/r/AgroVG-5172/上获得。

英文摘要

Visual grounding, the task of localizing objects described by natural-language expressions, is a foundational capability for agricultural AI systems, enabling applications such as selective weeding, disease monitoring, and targeted harvesting. Reliable evaluation of agricultural visual grounding remains challenging because agricultural targets are often small, repetitive, occluded, or irregularly shaped, and instructions may refer to one, many, or no objects in an image. Evaluating this capability therefore requires jointly testing localization accuracy, target-set completeness, and existence-aware abstention. To address these challenges, we introduce \textbf{AgroVG}, a multi-source benchmark that formulates agricultural grounding as generalized set prediction: given an image and a referring expression, a model must return all matching target instances or abstain when no target is present. AgroVG contains 10{,}071 annotation-grounded image-query pairs from ten source datasets across six target families: crop/weed, fruit, wheat head, pest, plant disease, and tree canopy. It supports bounding-box grounding (T1) across all six families and instance-mask grounding (T2) on sources with reliable instance-level pixel annotations, with queries covering single-target, multi-target, and target-absent regimes. AgroVG further provides task-specific protocols for box-set matching and query-level mask coverage. Zero-shot evaluation of 26 model configurations spanning closed-source MLLMs, open-source VLMs, and specialized grounding systems reveals persistent gaps: the best multi-target Set-$F_1$ reaches only 0.35, and the best positive-query mask success rate at IoU@0.75 remains below 0.17. Data and code are available at https://anonymous.4open.science/r/AgroVG-5172/ .

2605.22031 2026-05-22 cs.CV

SO-Mamba: State-Ownership Mamba for Unrolled MRI Reconstruction

SO-Mamba:用于展开MRI重建的态所有权Mamba

Pengcheng Fang, Hongli Chen, Fangfang Tang, Feng Liu, Xiaohao Cai, Shanshan Shan

发表机构 * University of Southampton(南安普顿大学) University of Queensland(昆士兰大学) Soochow University(苏州大学)

AI总结 本文提出SO-Mamba,一种用于展开MRI重建的态所有权Mamba正则化器,通过分配每个Mamba阶段的重建证据到递归驻留、态接口访问和非态输出校正,以提升重建质量与效率。

详情
AI中文摘要

加速MRI重建需要在大空间区域内恢复缺失细节的同时保持解剖学一致的结构。状态空间模型如Mamba提供高效的长距离建模,使其成为展开重建中的有吸引力的学得正则化器。然而,在数据一致性耦合的展开求解器中,不同阶段操作于不同的重建迭代,其中驻留载体应在不同阶段保持一致的重建内容,而阶段依赖的非驻留证据则与当前更新相关。将这些角色统一处理会将持久驻留载体证据和更新依赖的非驻留证据置于相同的递归内容路由中。为此,我们提出了SO-Mamba,一种态所有权Mamba正则化器,该正则化器将每个Mamba阶段的重建证据分配到递归驻留、态接口访问和非态输出校正。SO-Mamba通过State-Ownership Router (SOR)实现这一所有权规则,构建递归内容的驻留载体,并将非驻留证据路由到B/C态接口的仿射调制和输出校正出口。驻留载体提供Mamba内容路由,而非驻留证据流调整态接口并通过输出出口贡献,而无需进入递归内容路由。我们进一步引入了两级外带泄漏诊断,通过测量选择性扫描状态轨迹中的外带能量和扫描后Mamba读取中的外带能量,将隐藏状态存储与读取表达分开。在五个公开的MRI重建基准上进行的实验表明,SO-Mamba在具有竞争性计算效率的CNN、Transformer和Mamba基线中表现一致提升。

英文摘要

Accelerated MRI reconstruction requires recovering missing details while preserving anatomically coherent structures across large spatial regions. State-space models such as Mamba provide efficient long-range modeling, making them attractive learned regularizers for unrolled reconstruction. However, in a data-consistency-coupled unrolled solver, different stages operate on different reconstruction iterates, where the resident carrier should preserve coherent reconstruction content across stages while stage-dependent non-resident evidence is tied to the current update. Treating these roles uniformly can place persistent resident-carrier evidence and update-dependent non-resident evidence into the same recurrent content route. We therefore propose SO-Mamba, a state-ownership Mamba regularizer that assigns reconstruction evidence within each Mamba stage to recurrent residency, state-interface access, and non-state output correction. SO-Mamba implements this ownership rule with a State-Ownership Router (SOR), which constructs a resident carrier for recurrent content and routes non-resident evidence to affine modulation of the B/C state interfaces and an output correction outlet. The resident carrier supplies the Mamba content route, while the non-resident evidence stream adapts the state interfaces and contributes through the output outlet without entering the recurrent content route. We further introduce a two-level outer-band leakage diagnostic that separates hidden-state storage from readout expression by measuring outer-band energy in the selective-scan state trajectory and the post-scan Mamba readout. Experiments on five public MRI reconstruction benchmarks spanning diverse anatomies, sampling patterns, and coil configurations show that SO-Mamba consistently improves over CNN-, Transformer-, and Mamba-based baselines with competitive computational efficiency.

2605.22021 2026-05-22 cs.RO

Industrial Dual-Arm Box Handling via Online Inertial Estimation and Convex Wrench Optimization

工业双臂箱体搬运 via 在线惯性估计和凸 wrench 优化

Kenzhi Iskandar Wong, Lin Yang, Qian Ying Lee, Domenico Campolo

发表机构 * School of Mechanical and Aerospace Engineering, Nanyang Technological University(机械与航空航天工程学院,南洋理工大学)

AI总结 本文提出了一种摩擦感知的双臂箱体搬运框架,用于处理具有未知惯性特性的物体。通过在线估计物体质量和质心,并利用二次锥规划在椭球摩擦限制表面约束下计算摩擦可行的接触力和扭距,从而实现稳定的搬运。

Comments 14 pages, submitted to Robotics and Computer-Integrated Manufacturing (RCIM) Journal

详情
AI中文摘要

工业机器人物体搬运 often 涉及箱子和包裹,其质量和质心通常在事先未知。这些不确定性影响了稳定提升所需的力-力矩平衡,不当的接触 wrench 控制可能导致滑动、物体掉落、方向偏差或过度挤压。本文提出了一种摩擦感知的双臂箱体搬运框架,用于具有未知惯性特性的物体。所提出的方法从测量的接触 wrench 中在线估计物体质量和质心,并通过二次锥规划(SOCP)在椭球摩擦限制表面约束下计算摩擦可行的接触力和扭距。还包含一个离线轨迹细化阶段,以减少存在几何约束时的不希望的物体-环境接触。通过将摩擦可行性作为硬约束,并在可行区域内最小化接触努力,该框架实现了稳定的提升,而无需将滑动避免和过度挤压作为单独调节的目标。在不同质心配置下的真实双臂机器人系统实验表明,该方法在未知惯性特性物体上实现了稳定的摩擦接触。

英文摘要

Industrial robotic object handling often involves boxes and packages whose mass and center of mass are not known in advance. These uncertainties affect the force--moment balance required for stable lifting, and improper regulation of contact wrenches can lead to slip, object drop, orientation deviation, or excessive squeezing. This paper presents a friction-aware dual-arm box-handling framework for objects with unknown inertial properties. The proposed approach estimates the object mass and center of mass online from measured contact wrenches, and computes friction-feasible contact forces and torsional moments through a second-order cone program (SOCP) under ellipsoidal friction-limit-surface constraints. An offline trajectory refinement stage is also included to reduce undesired object--environment contact when geometric constraints are present. By enforcing friction feasibility as a hard constraint and minimizing contact effort within the feasible region, the framework achieves stable lifting without treating slip avoidance and excessive squeezing as separately tuned objectives. Experiments on a real dual-arm robotic system under different center-of-mass configurations demonstrate that the method lifts objects with unknown inertial properties while maintaining stable frictional contact.

2605.22017 2026-05-22 cs.CV

Diverse Yet Consistent: Context-Guided Diffusion with Energy-Based Joint Refinement for Multi-Agent Motion Prediction

多样且一致:基于能量的联合细化的上下文引导扩散用于多智能体运动预测

Lei Chu, Yuhuan Zhao

发表机构 * University of Southern California(南加州大学)

AI总结 本文提出了一种基于扩散的框架,通过利用历史轨迹中的丰富上下文信息来改进多智能体运动预测,通过引导机制增强预测动作的多样性和表达性,并引入基于能量的公式来细化联合轨迹分布,同时保持个体轨迹的合理性,实验表明该方法在多个基准数据集上均优于现有方法。

Comments MEIS-- CVPR

详情
AI中文摘要

深度生成模型由于其能够捕捉多模态分布和表示多样化的人类行为的能力,已成为人类运动预测的有希望的方法。然而,生成在相互作用代理之间既多样又联合一致的预测仍然具有挑战性。此外,大多数现有方法主要使用单代理(边缘)度量进行评估,这无法充分反映多代理互动的联合动态。我们提出了一种基于扩散的框架,通过利用历史轨迹中的丰富上下文信息来改进多代理运动预测。这种信息通过引导机制进行整合,以增强预测动作的多样性和表达性。为了进一步强制交互一致性,我们引入了基于能量的公式,通过细化联合轨迹分布的同时保持个体轨迹的合理性。在四个基准数据集上的大量实验表明,我们的方法在多个指标上均优于现有方法。值得注意的是,我们的方法在ETH/UCY上显著提高了边缘(ADE/FDE)和联合(JADE/JFDE)度量,与先前的联合预测方法相比,它在保持竞争性联合性能的同时,显著提高了边缘度量。

英文摘要

Deepgenerative models havebecomeapromisingapproach for human motion prediction due to their ability to capture multimodal distributions and represent diverse human be haviors. However, generating predictions that are both di verse and jointly consistent among interacting agents re mains challenging. In addition, most existing approaches are primarily evaluated using single-agent (marginal) met rics, which fail to fully reflect the joint dynamics of multi agent interactions. We propose a diffusion-based frame work that improves multi-agent motion prediction by lever aging rich contextual information from historical trajecto ries. This information is incorporated through a guidance mechanism to enhance the diversity and expressiveness of predicted motions. To further enforce interaction consis tency, we introduce an energy-based formulation that re fines the joint trajectory distribution while preserving the plausibility of individual trajectories. Extensive experi ments on four benchmark datasets demonstrate that our approach consistently outperforms existing methods. No tably, our approach substantially improves both marginal (ADE/FDE) and joint (JADE/JFDE) metrics on ETH/UCY over strong marginal baselines. Compared with prior joint prediction methods, it delivers significant gains in marginal metrics while maintaining competitive joint performance.

2605.22015 2026-05-22 cs.CV cs.AR

ORBIS: Output-Guided Token Reduction with Distribution-Aware Matching for Video Diffusion Acceleration

ORBIS: 通过分布感知匹配的输出引导标记减少以加速视频扩散

Hangyeol Lee, Joo-Young Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出ORBIS,一种针对视频扩散Transformer的SW-HW协同设计加速器,通过利用前一时间步的输出激活获得更准确的token相似性,从而提高匹配质量并实现更高的标记减少比例,同时引入分布感知标记匹配算法和专用硬件设计,实现比现有方法更高的标记减少率、更快的速度和更低的能耗。

详情
AI中文摘要

扩散Transformer(DiT)已发展为生成高质量图像和视频的强大模型架构。在视频DiT中,3D空间时间注意力使token长度与帧数成正比,显著增加计算成本。标记减少方法通过利用空间冗余来缓解这一成本,但现有方法依赖于不准确的相似性估计和轻量级匹配算法,导致匹配质量差且仅带来微小的加速效果。为克服这些限制,我们提出了ORBIS,一种为视频DiT设计的SW-HW协同加速器。ORBIS利用前一时间步的输出激活以获得更准确的token间相似性,显著提高匹配质量并实现更高的token减少比例。我们进一步引入了分布感知标记匹配(DATM)算法,该算法捕捉全局token分布并显式最小化token对损失以获得额外收益。为了完全隐藏DATM延迟,我们设计了专用、深度流水线化的硬件并通过量化来最小化其硬件成本,仅占用总面积的2.4%,且精度损失可忽略不计。大量实验表明,ORBIS的token减少比例比最先进的方法AsymRnR高约2倍,相比NVIDIA A100 GPU实现了高达4.5倍的速度提升和79.3%的能耗降低。

英文摘要

Diffusion Transformer (DiT) has emerged as a powerful model architecture for generating high-quality images and videos. In the case of video DiT, 3D Spatio-Temporal Attention increases token length in proportion to the number of frames, sharply increasing computational cost. Token reduction methods mitigate this cost by exploiting spatial redundancy, but existing approaches rely on inaccurate similarity estimates and lightweight matching algorithms, resulting in poor matching quality and only marginal acceleration. To overcome these limitations, we propose ORBIS, an SW-HW co-designed accelerator for video DiT. ORBIS leverages the output activation from the previous timestep to obtain more accurate inter-token similarity, substantially improving matching quality and enabling a higher token reduction ratio. We further introduce a Distribution-Aware Token Matching (DATM) algorithm that captures global token distribution and explicitly minimizes token-pair loss for additional gains. To fully hide DATM latency, we design specialized, deeply pipelined hardware and minimize its hardware cost through quantization, occupying only 2.4% of total area with negligible accuracy loss. Extensive experiments show that ORBIS achieves about 2x higher token reduction ratio than the state-of-the-art approach, AsymRnR, while delivering up to 4.5x speedup and 79.3% energy reduction compared to an NVIDIA A100 GPU.

2605.22013 2026-05-22 cs.CV cs.GR cs.LG

PointLLM-R: Enhancing 3D Point Cloud Reasoning via Chain-of-Thought

PointLLM-R: 通过链式推理增强3D点云推理

Chaoqi Chen, Qile Xu, Wenjun Zhou, Hui Huang

发表机构 * Visual Computing Research Center (VCC), College of Computer Science(视觉计算研究中心(VCC),计算机科学学院) Software Engineering (CSSE) Shenzhen University China(软件工程(CSSE)深圳大学中国) VCC, CSSE Shenzhen University China(VCC,CSSE 深圳大学中国) Shenzhen University(深圳大学)

AI总结 本文提出了一种数据驱动的框架,用于构建大规模链式推理监督,以改进3D点云理解。通过两阶段流程优化点文本指令数据,并合成高质量推理路径,构建了包含55K样本的PoCoTI数据集,训练PointLLM-R实现3D多模态语言模型的推理能力,实验表明其在生成3D分类和描述任务中达到最先进的性能。

详情
AI中文摘要

通过语言理解3D点云仍然是计算机图形学和视觉计算中的基本挑战,由于点云数据的不规则结构和现有3D多模态模型中缺乏显式推理。尽管链式推理(CoT)在LLM和基于图像的MLLM中表现出强大的有效性,但其在3D理解中的扩展仍鲜有探索。本文提出了一种数据驱动的框架,用于构建大规模CoT监督,专门针对3D点云理解。我们的框架由一个两阶段流程组成,首先通过基于视觉语言模型的质量评估和参考引导细化点文本指令数据,然后通过人机协同提示优化(HiLPO)合成高质量的推理路径。使用这种方法,我们构建了PoCoTI,一个包含55K样本的CoT增强点文本指令遵循数据集。在PoCoTI上微调PointLLM,得到PointLLM-R,一个具备推理能力的3D多模态语言模型。在生成3D分类和描述任务上的大量实验表明,PointLLM-R在生成3D分类和描述任务中达到了最先进的性能,并且能够稳健地推广到现实世界扫描点云和多轮对话场景中。

英文摘要

Understanding 3D point clouds through language remains a fundamental challenge in computer graphics and visual computing, due to the irregular structure of point cloud data and the lack of explicit reasoning in existing 3D multimodal models. While Chain-of-Thought (CoT) reasoning has shown strong effectiveness in LLMs and image-based MLLMs, its extension to 3D understanding remains largely underexplored. In this paper, we propose a data-centric framework for constructing large-scale CoT supervision tailored to 3D point cloud understanding. Our framework consists of a two-stage pipeline that first refines point-text instruction data via vision-language-model-based quality evaluation and reference-guided refinement, and then synthesizes high-quality reasoning paths through Human-in-the-Loop Prompt Optimization (HiLPO). Using this approach, we build PoCoTI, a CoT-enhanced point-text instruction-following dataset containing 55K samples with explicit reasoning paths. Fine-tuning PointLLM on PoCoTI yields PointLLM-R, a reasoning-capable 3D multimodal language model. Extensive experiments on generative 3D classification and captioning demonstrate that PointLLM-R achieves state-of-the-art performance and generalizes robustly to real-world scanned point clouds and multi-turn dialogue scenarios.

2605.22012 2026-05-22 cs.CL cs.CV

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

LatentOmni: 通过统一的音频-视觉潜在推理重新思考多模态理解

Yifan Dai, Zhenhua Wu, Bohan Zeng, Daili Hua, Jialing Liu, Bozhou Li, Yuran Wang, Chengzhuo Tong, Hao Liang, Xiaochen Ma, Junbo Niu, Tianyu Guo, Yang Shi, Yue Ding, Yiyan Ji, Bingyin Mei, Yushuo Guan, Yuanxing Zhang, Pengfei Wan, Fangcheng Fu, Wentao Zhang

发表机构 * School of AI, Shanghai Jiao Tong University(上海交通大学人工智能学院) Kling Team, Kuaishou Technology(快手科技 Kling 团队) Peking University(北京大学) HKUST(香港科技大学) CASIA(中国科学院自动化研究所) Nanjing University(南京大学) Renmin University of China(中国人民大学) Tsinghua University(清华大学)

AI总结 本文提出LatentOmni框架,通过统一的音频-视觉潜在空间进行多模态推理,利用特征级监督和Omni-Sync Position Embedding保持时间一致性,从而在多个音频-视觉推理基准测试中取得最佳性能。

Comments 21 pages, 15 figures

详情
AI中文摘要

联合音频-视觉推理对于多模态理解至关重要,但当前的多模态大语言模型(MLLMs)在需要从两种模态中提取细粒度证据进行推理时仍存在困难。一个核心限制是显式的基于文本的推理链(CoT)将连续的音频-视觉信号压缩成离散的标记,削弱了时间定位并使中间推理偏向语言先验。我们主张统一的潜在空间是此类推理更好的媒介,因为它保留了密集的感知信息,同时仍能与自回归生成兼容。基于这一见解,我们提出了LatentOmni,一个跨模态推理框架,将文本推理与音频-视觉潜在状态交织在一起。LatentOmni引入了特征级监督,以对齐潜在推理状态与任务相关的感知特征,并使用Omni-Sync Position Embedding(OSPE)来保持潜在音频和视觉状态之间的时间一致性。我们进一步构建了LatentOmni-Instruct-35K数据集,该数据集包含音频-视觉交织推理轨迹,用于监督潜在空间推理。在多个音频-视觉推理基准测试中的综合评估表明,LatentOmni在评估的开源模型中取得了最佳性能,并且在显式文本CoT基线中表现一致,支持潜在空间联合推理作为更强多模态理解的有前途的路径。

英文摘要

Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unified latent space is a better medium for such reasoning because it preserves dense sensory information while remaining compatible with autoregressive generation. Based on this insight, we propose \textbf{LatentOmni}, a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states. LatentOmni introduces feature-level supervision to align latent reasoning states with task-relevant sensory features and uses Omni-Sync Position Embedding (OSPE) to maintain temporal consistency between latent audio and visual states. We further construct \textbf{LatentOmni-Instruct-35K}, a dataset of audio-visual interleaved reasoning trajectories for supervising latent-space reasoning. Comprehensive evaluation across multiple audio-visual reasoning benchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent-space joint reasoning as a promising path toward stronger omnimodal understanding.