arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.07591 2026-06-18 cs.LG cs.AI cs.CL 版本更新

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

ResearchClawBench: 端到端自主科学研究基准

Wanghan Xu, Shuo Li, Tianlin Ye, Qinglong Cao, Yixin Chen, Hengjian Gao, Yiheng Wang, Qi Li, Kun Li, Sheng Xu, Shengdu Chai, Fangchen Yu, Xiangyu Zhao, Zhangrui Zhao, Weijie Ma, Zijie Guo, Koutian Wu, Haoyu Zhou, Haoxiang Yin, Lixue Cheng, Chaofan Hu, Haoxuan Li, Lu Mi, Xuxuan Xie, Yifan Zhou, Ruizhe Chen, Zhiwang Zhou, Xingjian Guo, Yuhao Zhou, Xuming He, Shengyuan Xu, Xinyu Gu, Jiamin Wu, Mianxin Liu, Chunfeng Song, Fenghua Ling, Dongzhan Zhou, Shixiang Tang, Yuqiang Li, Mao Su, Peng Ye, Siqi Sun, Bin Wang, Xue Yang, Zhenfei Yin, Tianfan Fu, Guangtao Zhai, Wanli Ouyang, Bo Zhang, Lei Bai, Wenlong Zhang

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结提出ResearchClawBench基准，包含10个领域40个任务，通过多模态评分标准评估自主科研能力，最强智能体仅得21.5分，揭示当前系统在实验协议、证据匹配和科学核心方面的不足。

详情

AI中文摘要

AI编码智能体越来越多地用于科学工作，但其端到端自主研究能力仍然难以验证。我们提出了ResearchClawBench，一个用于评估自主科学研究的基准，涵盖来自10个科学领域的40个任务。每个任务基于一篇真实发表论文，提供相关文献和原始数据，并在评估期间隐藏目标论文。专家策划的多模态评分标准将目标科学制品分解为加权标准，从而能够评估目标论文级别的重新发现，同时为新发现留出空间。我们在统一协议下评估了七个自主研究（auto-research）智能体，并通过轻量级ResearchHarness评估了十七个原生LLM。当前系统远未达到可靠的重新发现：最强的自主智能体Claude Code平均得分为21.5，最强的ResearchHarness LLM Claude-Opus-4.7平均得分为20.7，LLM前沿均值仅为26.5。错误分析表明，失败集中在实验协议不匹配、证据不匹配和缺失科学核心。ResearchClawBench为衡量自主科学研究进展提供了一个可复现的评估前沿。

英文摘要

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.

URL PDF HTML ☆

赞 0 踩 0

2606.06361 2026-06-18 cs.CV 版本更新

Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them

两步物理：在视觉细化之前锁定运动先验会擦除它们

Woojung Han, Seil Kang, Youngjun Jun, Min-Hung Chen, Fu-En Yang, Seong Jae Hwang

发表机构 * National Institute of Standards and Technology（国家标准与技术研究院）

AI总结本文发现图像到视频扩散模型在两步生成中比多步生成具有更好的物理一致性，通过频谱分析将原因归结为去噪过程中的相位侵蚀，并提出无需训练的PhaseLock框架，通过从两步推理中提取运动先验并利用潜在增量引导强制到高保真生成中，有效缓解相位退化，提升物理一致性平均6.2点，同时保持视觉保真度且开销极小。

Comments ICML 2026

详情

AI中文摘要

图像到视频扩散模型利用输入图像生成视觉上令人惊艳的内容，但常常产生违反物理规律的运动。我们揭示了一个令人惊讶的发现：两步生成通常比同一模型的50步输出表现出更好的物理一致性。通过频谱分析，我们将其追溯到去噪过程中的相位侵蚀：相位显著退化（从第2步到第50步下降约18%），而幅度保持相对稳定。基于这一洞察，我们提出PhaseLock，一个无需训练的框架，在整个去噪轨迹中保留来自少步推理的有效运动先验。PhaseLock不依赖全步推理来获得物理一致性，而是仅从2步中提取运动先验，并通过潜在增量引导将其强制到高保真生成中。我们的方法有效缓解了相位退化，在多种模型上平均提升物理一致性6.2点，同时基本保持视觉保真度，且开销可忽略不计（时间1.06倍，内存1.02倍），并减少了对昂贵外部引导方法（时间约5倍）的依赖。

英文摘要

Image-to-Video diffusion models leverage input images to generate visually stunning content, yet frequently produce motion that violates physical laws. We reveal a surprising finding: a 2-step generation often exhibits better physical consistency than a 50-step output from the same model. Through spectral analysis, we trace this to phase erosion during denoising; the phase degrades significantly (dropping by $\approx 18\%$ from step 2 to step 50), whereas the magnitude remains relatively stable. Building on this insight, we propose PhaseLock, a training-free framework that preserves the valid motion priors from few-step inference throughout the denoising trajectory. Rather than relying on full-step inference for physical consistency, PhaseLock extracts a motion prior from just 2 steps and enforces it onto high-fidelity generation via Latent Delta Guidance. Our approach effectively mitigates phase degradation, improving physical consistency by an average of 6.2 points across diverse models while largely maintaining visual fidelity, with negligible overhead ($1.06\times$ time, $1.02\times$ memory) and reduced reliance on expensive external guidance methods ($\sim5\times$ time). Project Page: https://dnwjddl.github.io/phaselock

URL PDF HTML ☆

赞 0 踩 0

2606.05883 2026-06-18 cs.CV 版本更新

Geometry-Aware Dataset Condensation for Diffusion Model Training

面向扩散模型训练的几何感知数据集压缩

Xiao Cui, Yulei Qin, Mo Zhu, Wengang Zhou, Hongsheng Li, Houqiang Li

发表机构 * GitHub

AI总结针对扩散模型训练，提出基于几何感知分布对齐的真实子集选择方法，利用单侧部分最优传输保持几何结构，并辅以轻量级特征统计与语义一致性正则化，通过两阶段离散优化实现高效压缩。

Comments ICML 2026

详情

AI中文摘要

数据集压缩旨在通过合成或选择从真实数据中构建紧凑数据集。然而，现有方法不适用于扩散模型训练：合成数据生成通常产生不适合真实建模的低保真样本，而真实子集选择通常无法保留扩散似然目标所需的分布几何结构。为解决此问题，我们提出将真实子集选择重新表述为几何感知分布对齐问题。通过引入单侧部分最优传输，我们的方法选择性地将紧凑子集与完整数据分布对齐，同时允许低密度区域中的未匹配质量，确保保留扩散模型训练所需的有效几何结构。为进一步保证分布保真度，我们用轻量级特征统计和语义一致性正则化补充几何对齐。提出了一种高效的两阶段离散优化策略来实现该对齐目标。在扩散变体、子集大小、图像分辨率和训练轮次上的大量实验表明，我们的方法在扩散模型训练中实现了优越的保真度和分布覆盖。代码可在 https://github.com/2018cx/GADC 获取。

英文摘要

Dataset condensation aims to construct compact datasets from real data via synthesis or selection. However, existing approaches are ill-suited for diffusion model training: synthetic data generation often yields low-fidelity samples unsuitable for authentic modeling, while real subset selection typically fails to preserve the distributional geometry required by diffusion likelihood objectives. To address this, we propose to reformulate real subset selection as a geometry-aware distribution alignment problem. By incorporating one-sided partial optimal transport, our method selectively aligns a compact subset with the full data distribution while allowing unmatched mass in low-density regions, ensuring the preserved geometric structure necessary for effective diffusion model training. To further ensure distributional fidelity, we complement geometric alignment with lightweight feature-statistics and semantic consistency regularization. An efficient two-stage discrete optimization strategy is proposed to achieve this alignment objective. Extensive experiments across diffusion variants, subset sizes, image resolutions, and training rounds show that our method achieves superior fidelity and distributional coverage in diffusion model training. Codes are available at https://github.com/2018cx/GADC.

URL PDF HTML ☆

赞 0 踩 0

2606.05739 2026-06-18 cs.SD eess.AS 版本更新

Do speech foundation models perceive speaker similarity as humans do?

语音基础模型是否像人类一样感知说话人相似性？

Minoru Kishi, Hayato Yagi, Shinnosuke Takamichi, Yuki Saito

发表机构 * Keio University, Japan（庆应大学，日本）； The University of Tokyo, Japan（东京大学，日本）

AI总结本研究通过比较40多个语音基础模型的说话人嵌入与人类主观相似性评分，探究模型距离是否与人类感知一致，并识别影响模型与人类感知一致性的关键配置因素。

Comments Accepted by INTERSPEECH 2026. Camera-ready version

2606.05409 2026-06-18 cs.CV cs.CL 版本更新

Would you still call this Dax? Novel Visual References in VLMs and Humans

你还会称它为Dax吗？VLM与人类中的新颖视觉参照

Ada Defne Tür, Gaurav Kamath, Joyce Chai, Siva Reddy, Benno Krojer

发表机构 * McGill University（麦吉尔大学）； Mila Quebec AI Institute（魁北克人工智能研究所）； University of Michigan - Ann Arbor（密歇根大学安娜堡分校）； Canada CIFAR AI Chair（加拿大CIFAR人工智能主席）

AI总结提出新颖视觉参照数据集（NVRD），通过对比VLM和人类对新颖视觉概念的泛化能力，发现模型在矛盾先验知识时难以习得新概念，且过度泛化。

详情

AI中文摘要

视觉语言模型（VLM）像人类学习者一样，经常接触新的视觉概念，但它们在接触后如何将新颖的视觉参照映射到语言上仍未被充分探索，特别是当这些参照与预训练的先验知识相矛盾时。为了研究这一点，我们提出了新颖视觉参照数据集（NVRD）：包含跨越90个视觉概念的19,176张图像，这些概念具有不同层次的新颖性，每个概念最多有20个原始对象的逐渐扰动版本以测试泛化能力。与之前关于熟悉概念视觉增强的工作不同，NVRD包含完全新颖、开放式的刺激，从头构建，模拟人类遇到真正新概念的方式。我们评估了3个开源和2个闭源模型以及2,400个人类判断，以进行直接的人机比较，发现（i）当新概念与先验知识矛盾时，模型难以在上下文中习得它们，以及（ii）虽然模型和人类对视觉扰动表现出相关的敏感性，但模型显著过度泛化，将学到的标签扩展到人类拒绝的刺激上。我们贡献了NVRD作为人类和机器视觉概念学习研究的语料库和基准。

英文摘要

Vision-language models (VLMs), like human learners, are frequently exposed to new visual concepts, but how they map novel visual references to language after exposure remains largely underexplored, particularly when those references contradict prior knowledge from pre-training. To study this, we present the Novel Visual References Dataset (NVRD): 19,176 images spanning 90 visual concepts across different levels of visual novelty, each with up to 20 increasingly perturbed versions of the original object to probe generalization. Unlike prior work on visual augmentations of familiar concepts, NVRD comprises entirely novel, open-ended stimuli constructed from scratch, mirroring how humans encounter genuinely new concepts. We evaluate 3 open- and 2 closed-source models alongside 2,400 human judgments for direct human-model comparison, and find that (i) models struggle to acquire novel concepts in-context when they contradict prior knowledge, and (ii) while models and humans show correlated sensitivity to visual perturbations, models significantly overgeneralize, extending learned labels to stimuli that humans reject. We contribute NVRD as a corpus and benchmark for research on visual concept learning in both humans and machines.

URL PDF HTML ☆

赞 0 踩 0

2606.05368 2026-06-18 cs.CV 版本更新

Biomazon: A Multimodal Dataset for 3D Forest Structure and Biomass Modeling in the Amazon Basin

Biomazon：亚马逊盆地三维森林结构与生物量建模的多模态数据集

Sayan Mandal, Rocco Sedona, Simon Besnard, Mikhail Urbazaev, Morris Riedel, Ehsan Zandi, Gabriele Cavallaro

发表机构 * Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich（julich超级计算中心（JSC），julich研究所）； School of Engineering and Natural Sciences (SENS), University of Iceland（工程与自然科学学院（SENS），冰岛大学）； Global Land Monitoring Group, GFZ Helmholtz Centre for Geosciences（全球土地监测组，geofz赫尔姆霍兹研究中心）

AI总结针对现有方法未将森林垂直结构作为有序轮廓学习的问题，提出Biomazon多模态基准数据集，结合GEDI RH和AGBD目标与多传感器预测因子，通过共享编码器-解码器框架进行消融研究，为热带森林结构一致RH轮廓预测和结构-生物量建模建立参考基准。

Comments 32 pages, 21 figures, 8 tables

详情

AI中文摘要

准确、空间明确的描述热带森林结构对于碳核算和生态系统监测至关重要，然而大多数机器学习流程预测冠层顶部高度代理（例如RH95/RH98）或AGBD作为单独的标量目标，而不是将森林垂直结构作为有序轮廓学习。社区缺乏一个ML就绪的多模态基准，用于联合预测整个GEDI RH轮廓与AGBD，或评估强制RH百分位数之间物理一致排序的方法。我们通过Biomazon解决了这一问题，这是一个覆盖亚马逊盆地的20米多模态基准数据集，在标准化的空间划分和评估协议下，将GEDI RH和AGBD目标与多传感器预测因子（Sentinel-1/2、ALOS-2 PALSAR-2、Copernicus DEM、Dynamic World LULC和AlphaEarth嵌入）配对。使用共享编码器-解码器与任务特定头作为基线框架，我们对（i）骨干/模型规模、（ii）模态贡献以及（iii）在独立和融合设置下使用辅助嵌入进行了全面的消融研究，并报告了单目标和联合目标结果，以量化统一训练协议下的权衡。最后，我们通过与现有网格化产品（包括GEDI L4D RH10-RH98和AGBD）在匹配时间尺度上的区域对齐比较，将基线性能置于背景中。Biomazon连同随附的协议和基线结果，为未来热带森林中结构一致的RH轮廓预测和结构-生物量建模工作建立了参考基准。

英文摘要

Accurate, spatially explicit characterization of tropical forest structure is essential for carbon accounting and ecosystem monitoring, yet most ML pipelines predict canopy-top height proxies (e.g., RH95/RH98) or AGBD as separate scalar targets, rather than learning the forest vertical structure as an ordered profile. The community lacks a ML-ready multimodal benchmark for predicting the entire GEDI RH profile jointly with AGBD, or for evaluating methods that enforce physically consistent ordering across RH percentiles. We address this with Biomazon, a 20 m multimodal benchmark dataset over the Amazon Basin that pairs GEDI RH and AGBD targets with multi-sensor predictors (Sentinel-1/2, ALOS-2 PALSAR-2, Copernicus DEM, Dynamic World LULC, and AlphaEarth embeddings) under standardized spatial splits and evaluation protocols. Using a shared encoder-decoder with task-specific heads as a baseline framework, we conduct a comprehensive ablation study of (i) backbone/model scale, (ii) modality contributions, and (iii) the use of auxiliary embeddings under standalone and fusion settings, and we report both single-target and joint-target results to quantify tradeoffs under a unified training protocol. Finally, we contextualize baseline performance through regionally aligned comparisons against existing gridded products, including GEDI L4D RH10-RH98 and AGBD, at matching temporal scale. Biomazon, together with the accompanying protocols and baseline results, establishes a reference benchmark for future work on structurally consistent RH-profile prediction and structure-biomass modeling in tropical forests.

URL PDF HTML ☆

赞 0 踩 0

2606.03827 2026-06-18 cs.CV cs.AI 版本更新

Conditional Latent Diffusion Model with Fourier-based Motion Modelling for Virtual Population Synthesis

基于傅里叶运动建模的条件潜扩散模型用于虚拟人群合成

Shaokun Lan, Haoran Dou, Jinghan Huang, Arezoo Zakeri, Fengming Lin, Zherui Zhou, Jinming Duan, Alejandro F. Frangi

发表机构 * Centre for Computational Imaging and Modelling in Medicine (CIMIM)（计算医学成像与建模中心）； University of Manchester（曼彻斯特大学）； Christabel Pankhurst Institute（克里斯塔贝尔·潘克赫斯特研究所）； Department of Computer Science（计算机科学系）； Division of Informatics, Imaging & Data Sciences（信息学、成像与数据科学分会）； Department of Electrical & Electronic Engineering（电子与电气工程系）； NIHR Manchester Biomedical Research Centre, Manchester Academic Health Sciences Centre, University of Manchester（尼日利亚卫生研究委员会曼彻斯特生物医学研究中心、曼彻斯特学术健康科学中心、曼彻斯特大学）

AI总结提出4D F-MeshLDM框架，结合卷积网格VAE、截断傅里叶级数运动参数化和条件扩散先验，实现可控的3D+t心脏网格序列生成，在UK Biobank数据上优于基线方法。

Comments This work has been early accepted by International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) 2026

详情

AI中文摘要

医疗设备的计算机模拟试验需要生成虚拟解剖人群。在心血管应用中，虚拟解剖通常表示为从生成模型采样的3D+t网格。然而，大多数现有网格生成器关注静态解剖，而序列模型往往缺乏显式周期性。为此，我们提出4D F-MeshLDM，一个条件生成框架，包括用于编码网格的卷积网格VAE、使用截断傅里叶级数参数化运动的结构化潜空间，以及学习傅里叶系数令牌上潜分布的先验扩散。通过仿射调制将扩散过程条件化于临床协变量，我们实现了可控合成。采样令牌并执行逆傅里叶合成产生周期一致的潜轨迹，可解码为3D+t心脏网格序列。在5,000名UK Biobank受试者上的实验表明，4D F-MeshLDM在解剖保真度上优于最先进的基线，并实现了接近零的周期闭合误差。此外，生成的队列准确保留了临床功能指标，突显了我们的框架在可靠的心脏计算机模拟试验中的潜力。

英文摘要

In-silico trials of medical devices require the generation of virtual populations of anatomies. In cardiovascular applications, virtual anatomy is typically represented as a 3D+t mesh sampled from a generative model. However, most existing mesh generators focus on static anatomy, while sequence models often lack explicit periodicity. To this end, we propose 4D F-MeshLDM, a conditional generative framework comprising a convolutional mesh VAE to encode meshes, a structural latent space that parameterises motion using a truncated Fourier series, and a diffusion prior that learns the latent distribution over Fourier coefficient tokens. By conditioning the diffusion process on clinical covariates via affine modulation, we enable controllable synthesis. Sampling tokens and performing inverse Fourier synthesis yield cycle-consistent latent trajectories, which can be decoded into 3D+t cardiac mesh sequences. Experiments on 5,000 UK Biobank subjects demonstrate that 4D F-MeshLDM outperforms state-of-the-art baselines in anatomical fidelity and achieves near-zero cycle closure error. Furthermore, the generated cohorts accurately preserve clinical functional indices, highlighting the potential of our framework for reliable in-silico cardiac trials.

URL PDF HTML ☆

赞 0 踩 0

2606.02800 2026-06-18 cs.CV cs.AI cs.LG cs.MM cs.RO 版本更新

Cosmos 3: Omnimodal World Models for Physical AI

Cosmos 3：面向物理AI的全模态世界模型

NVIDIA, :, Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Yan Chang, Yu-Wei Chao, Prithvijit Chattopadhyay, Roshan Chaudhari, Chieh-Yun Chen, Junyu Chen, Ke Chen, Qizhi Chen, Wenkai Chen, Xiaotong Chen, Yu Chen, An-Chieh Cheng, Click Cheng, Xiu Chia, Jeana Choi, Chaeyeon Chung, Wenyan Cong, Yin Cui, Magdalena Dadela, Nalin Dadhich, Wenliang Dai, Joyjit Daw, Alperen Degirmenci, Rodrigo Vieira Del Monte, Robert Denomme, Sameer Dharur, Marco Di Lucca, Ke Ding, Wenhao Ding, Yifan Ding, Yuzhu Dong, Nicole Drumheller, Yilun Du, Aigul Dzhumamuratova, Aleksandr Efitorov, Hamid Eghbalzadeh, Naomi Eigbe, Imad El Hanafi, Hassan Eslami, Benedikt Falk, Jiaojiao Fan, Jim Fan, Amol Fasale, Sergiy Fefilatyev, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Vikram Fugro, Prashant Gaikwad, TJ Galda, Katelyn Gao, Yihuai Gao, Wenhang Ge, Sreyan Ghosh, Arushi Goel, Vivek Goel, Akash Gokul, Rama Govindaraju, Jinwei Gu, Miguel Guerrero, Elfie Guo, Aryaman Gupta, Siddharth Gururani, Hugo Hadfield, Song Han, Ankur Handa, Zekun Hao, Mohammad Harrim, Ali Hassani, Nathan Hayes-Roth, Yufan He, Chris Helvig, Cyrus Hogg, Madison Huang, Michael Huang, Sophia Huang, Yufan Huang, Jacob Huffman, DeLesley Hutchins, Suneel Indupuru, Boris Ivanovic, Arihant Jain, Joel Jang, Ryan Ji, Yanan Jian, Dongfu Jiang, Jingyi Jin, Atharva Joshi, Nikhilesh Joshi, Pranjali Joshi, Andy Ju, Jaehun Jung, Weiwei Kang, Scott Kassekert, Jan Kautz, Ashna Khetan, Julia Kiczka, Slawek Kierat, Gwanghyun Kim, Kuno Kim, Sunny Kim, Kezhi Kong, Xin Kong, Zhifeng Kong, Tomasz Kornuta, Egor Krivov, Hui Kuang, Saurav Kumar, Chia-Wen Kuo, George Kurian, Wojciech Kutak, JF Lafleche, Himangshu Lahkar, Omar Laymoun, Jayjun Lee, Sanggil Lee, Gabriele Leone, Boyi Li, Freya Li, Jiajun Li, Jinfeng Li, Ling Li, Pengcheng Li, Shangru Li, Tingle Li, Xiaolong Li, Xuan Li, Zhaoshuo Li, Zhiqi Li, Hao Liang, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Ming-Yu Liu, Sifei Liu, Zihan Liu, Hai Loc Lu, Xiangyu Lu, Alice Luo, Ruipu Luo, Wenjie Luo, Jiangran Lyu, Martin Ding Ma, Nic Ma, Qianli Ma, Dawid Majchrowski, Louis Marcoux, Miguel Martin, Qing Miao, Ashkan Mirzaei, Shreyas Misra, Kaichun Mo, Durra Mohsin, Hyejin Moon, Pawel Morkisz, Saeid Motiian, Kirill Motkov, Seungjun Nah, Yashraj Narang, Deepak Narayanan, Thabang Ngazimbi, Julian Ouyang, Shubham Pachori, David Page, Yatian Pang, Sehwi Park, Mahesh Patekar, Mostofa Patwary, Marco Pavone, Trung Pham, Wei Ping, Soha Pouya, Shrimai Prabhumoye, Varun Praveen, Delin Qu, Hesam Rabeti, Morteza Ramezanali, Marilyn Reeb, Xuanchi Ren, Kristen Rumley, Wojciech Rymer, Jun Saito, Yeongho Seol, John Shao, Piyush Shekdar, Tianwei Shen, Humphrey Shi, Min Shi, Stella Shi, Kevin Shih, Mohammad Shoeybi, Mateusz Sieniawski, Shuran Song, Alexander Sotelo, Amir Sotoodeh, Sunil Srinivasa, Vignesh Srinivasakumar, Bartosz Stefaniak, Rahul Heinrich Steiger, Shangkun Sun, Jiaxiang Tang, Shitao Tang, Yangyang Tang, Yue Tang, Tolou Tavakkoli, Kayley Ting, Krzysztof Tomala, Wei-Cheng Tseng, Jibin Varghese, Sergei Vasilev, Thomas Volk, Raju Wagwani, Roger Waleffe, Andrew Z. Wang, Boxiang Wang, Haoxiang Wang, Qiao Wang, Shihao Wang, Shijie Wang, Ting-Chun Wang, Yan Wang, Yu Wang, Rohit Watve, David Wehr, Fangyin Wei, Xinshuo Weng, Jay Zhangjie Wu, Kedi Wu, Hongchi Xia, Summer Xiao, Tianjun Xiao, Kevin Xie, Daguang Xu, Jiashu Xu, Mengyao Xu, Ruqing Xu, Xingqian Xu, Yao Xu, Dinghao Yang, Dong Yang, Hans Yang, Xiaodong Yang, Xuning Yang, Yichu Yang, Yurong You, Zhiding Yu, Hao Yuan, Simon Yuen, Xiaohui Zeng, Pengcuo Zeren, Cindy Zha, Haotian Zhang, Jenny Zhang, Jing Zhang, Liangkai Zhang, Paris Zhang, Shun Zhang, Xuanmeng Zhang, Zhizheng Zhang, Ann Zhao, Yilin Zhao, Yuliya Zhautouskaya, Charles Zhou, Fengzhe Zhou, Shilin Zhu, Yuke Zhu, Dima Zhylko, Artur Zolkowski

发表机构 * NVIDIA

AI总结提出基于统一混合Transformer架构的全模态世界模型Cosmos 3，联合处理语言、图像、视频、音频和动作序列，在理解和生成任务上达到新最优，为具身智能体提供可扩展的通用骨干。

详情

AI中文摘要

我们介绍了Cosmos 3，一个全模态世界模型家族，设计用于在统一的混合Transformer架构中联合处理和生成语言、图像、视频、音频和动作序列。通过支持高度灵活的输入输出配置，Cosmos 3无缝统一了物理AI的关键模态——有效地将视觉语言模型、视频生成器、世界模拟器和世界动作模型整合到一个框架中。我们的评估表明，Cosmos 3在一系列多样化的理解和生成任务中确立了新的最优水平，展示了全模态世界模型作为具身智能体可扩展、通用骨干的能力。我们的后训练Cosmos 3模型在技术报告撰写时被Artificial Analysis评为最佳开源文本到图像和图像到视频模型，并被RoboArena评为最佳策略模型。为了加速物理AI领域的开放研究和部署，我们在Linux基金会的OpenMDW-1.1许可证下提供我们的代码、模型检查点、策划的合成数据集和评估基准，网址为https://this https URL License at this https URL }{ this http URL and this https URL。项目网站位于https://this https URL。

英文摘要

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.

URL PDF HTML ☆

赞 0 踩 0

2606.02045 2026-06-18 cs.CV cs.AI 版本更新

Attention mechanisms and transfer learning for robust peach leaf damage classification under domain shift

域偏移下基于注意力机制和迁移学习的鲁棒桃叶损伤分类

Adrián Cánovas-Rodriguez, Miguel A. González-Illán, Maria Fernanda García-Cruz, Pedro Nortes Tortosa, José Salvador Rubio-Asensio, Miguel A. Zamora Izquierdo, Juan Antonio Martínez Navarro, Antonio F. Skarmeta

发表机构 * Department of Information and Communication Engineering（信息与通信工程系）； University of Murcia（穆尔西亚大学）； Department of Irrigation, Centro de Edafología y Biología Aplicada del Segura CEBAS-CSIC（灌溉系，塞格拉应用土壤学与生物技术中心CEBAS-CSIC）

AI总结提出基于注意力机制和迁移学习的桃叶损伤分类方法，通过CBAM增强EfficientNet模型在公共数据集上达到93.3%准确率，并在本地数据集上通过迁移学习实现93%宏F1分数，有效应对域偏移。

详情

AI中文摘要

人工智能为从图像数据评估作物损伤提供了实用框架，支持农业管理中的早期决策。在桃园中，气候变化增加了非生物胁迫和生物压力，包括病虫害，这些通常产生视觉上相似的叶片症状。这种重叠使得手动诊断变得困难，尤其是在不同环境条件下的多个田地中，凸显了对具有强泛化能力的自动化模型的需求。我们提出了一种基于图像的桃叶损伤检测分类方法。通过手动标注公开图像创建了一个基准数据集，包含六个损伤类别的1,366片桃叶。评估了几种深度学习架构。EfficientNet模型取得了最佳结果，其中EfficientNetB0达到92.9%的准确率，EfficientNetB3达到91.5%，EfficientNetB5在少数类上表现最强。DenseNet121达到92.6%的准确率。卷积块注意力模块（CBAM）的集成在多个骨干网络中提升了性能，特别是在EfficientNetB5和InceptionV3中，而在其他网络中效果有限或为负。CBAM增强的EfficientNetB5取得了93.3%的最佳总体准确率。为了评估在现实条件下的鲁棒性，收集了一个包含四个类别180张图像的本地数据集，并应用迁移学习策略来解决域偏移。测试了三种微调策略。结合CBAM的EfficientNetB3在本地域中取得了最佳性能，迁移后宏F1分数达到93%。总体而言，基于注意力的模型在少数类上表现出更强的鲁棒性，并在不同田间条件下具有更好的泛化能力。

英文摘要

Artificial intelligence provides a practical framework for crop damage assessment from imagery data, supporting early decision-making in agricultural management. In peach orchards, climate change increases abiotic stress and biotic pressures, including pests and diseases, which often produce visually similar foliar symptoms. This overlap makes manual diagnosis difficult, especially across multiple fields with varying environmental conditions, highlighting the need for automated models with strong generalization ability. We propose an image-based classification approach for peach leaf damage detection. A benchmark dataset was created through manual annotation of publicly available images, consisting of 1,366 peach leaves across six damage categories. Several deep learning architectures were evaluated. EfficientNet models achieved the best results, with EfficientNetB0 reaching 92.9 percent accuracy, EfficientNetB3 achieving 91.5 percent, and EfficientNetB5 showing the strongest performance on minority classes. DenseNet121 reached 92.6 percent accuracy. The integration of the Convolutional Block Attention Module (CBAM) improved performance in several backbones, particularly EfficientNetB5 and InceptionV3, while showing limited or negative impact in others. The CBAM-enhanced EfficientNetB5 achieved the best overall accuracy of 93.3 percent. To evaluate robustness under realistic conditions, a local dataset of 180 images across four classes was collected, and transfer learning strategies were applied to address domain shift. Three fine-tuning strategies were tested. EfficientNetB3 combined with CBAM achieved the best performance in the local domain, reaching a 93 percent macro F1-score after transfer. Overall, attention-based models showed improved robustness for minority classes and better generalization across different field conditions.

URL PDF HTML ☆

赞 0 踩 0

2606.01711 2026-06-18 cs.CV 版本更新

Improving Visual Token Reduction via Rectifying Distortions for Efficient Multimodal LLM Inference

通过纠正失真改进视觉令牌减少以实现高效多模态大语言模型推理

Hyeonwoo Cho, Donghyeon Baek, Yewon Kim, Bumsub Ham

发表机构 * KAIST（韩国科学技术院）

AI总结提出RESTORE框架，通过校准位置和注意力失真来改进视觉令牌减少，在保持效率的同时提升多模态大语言模型性能。

Comments Accepted to ICML 2026

详情

AI中文摘要

多模态大语言模型（MLLMs）在视觉-语言任务中取得了显著成功，但大量视觉令牌带来的二次计算复杂度导致了严重的内存和延迟瓶颈。虽然已经探索了视觉令牌减少（VTR）策略来缓解这一负担，但现有方法忽略了完整序列与减少序列之间的位置和注意力一致性，导致表示失真。为此，我们提出RESTORE，一种新颖的VTR框架，在保持效率的同时纠正位置和注意力失真。具体来说，我们提出一种简单而有效的校准方法，通过基于相对距离增强注意力权重来恢复丢失的视觉注意力。我们还引入了一种独特的锚点选择用于令牌合并，以减轻特征平均过程中的信息损失。在多个基准上的实验结果表明，我们的方法持续提高了各种减少方法的准确性，在保持计算效率的同时实现了最先进的性能。

英文摘要

Recent advancements in Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision-language tasks, yet the quadratic computational complexity arising from the vast number of visual tokens incurs significant memory and latency bottlenecks. While visual token reduction (VTR) strategies have been explored to mitigate this burden, existing methods overlook the positional and attentional consistency between the full and reduced sequences, resulting in a distorted representation. To this end, we propose RESTORE, a novel VTR framework that rectifies the positional and attentional distortions while maintaining efficiency. Specifically, we present a simple yet effective calibration method that restores lost visual attention by augmenting attention weights based on relative distances. We also introduce a distinctive anchor selection for token merging to mitigate information loss during feature averaging. Experimental results on multiple benchmarks demonstrate that our method consistently improves the accuracy of various reduction methods, achieving state-of-the-art performance while maintaining computational efficiency. Project page is available at https://cvlab.yonsei.ac.kr/projects/RESTORE

URL PDF HTML ☆

赞 0 踩 0

2606.01697 2026-06-18 cs.CL 版本更新

RCEM: Robust Conversational Search EMbedder in Distributional Shift

RCEM：配备查询重写技能的嵌入器，用于分布偏移下的鲁棒对话搜索

Kilho Son, Paul Hsu, Cha Zhang, Dinei Florencio

发表机构 * Microsoft（微软）

AI总结提出RCEM模型，通过将LLM的查询重写能力蒸馏到嵌入模型中，实现无需显式重写的上下文感知检索，在分布偏移下提升鲁棒性。

详情

AI中文摘要

对话搜索在检索增强生成（RAG）系统中变得越来越重要，用户通过包含上下文相关查询的多轮对话与AI助手交互。我们提出RCEM，一种对话式稠密检索模型，它将LLM的查询改写能力蒸馏到嵌入模型中，从而在推理时无需显式查询改写即可实现上下文感知检索。与先前学习直接对话到文档匹配的对话式稠密检索方法不同，RCEM将对话查询嵌入与改写后的查询嵌入对齐，提高了在分布偏移下的鲁棒性。RCEM不需要用于训练的对话查询到文档的相关性映射，这些映射通常昂贵且难以获得高质量。在QReCC、TopiOCQA和TREC CAsT上的大量实验表明，RCEM始终优于强对话检索基线，在分布偏移下取得了特别大的增益，包括Recall@10提升高达20%。RCEM进一步扩展了基础嵌入模型，使其具备对话查询改写能力，同时保留了原有的检索功能，允许单个模型对独立查询和对话查询进行编码，并针对现有文档索引进行搜索，而无需重建检索数据库。

英文摘要

We propose RCEM, a Robust Conversational search EMbedder that is additionally equipped with LLM's query reformulation capability without losing base model's generalization. Unlike prior conversational dense retrieval approaches that learn direct conversation-to-passage matching, RCEM aligns conversations, prepended by special token, to LLM-rewritten queries, while preserving the original embedding space. The unchanged embedding space automatically maps the rewritten-query to the relevant passages. As a result, RCEM (1) reduces overfitting by simplifying the alignment task from long passages to shorter rewritten queries, (2) eliminates the need for conversation-to-passage relevance labels for training, and (3) maintains its original embedding space that allows conversational queries against indexes built by original embedder without rebuilding them. Extensive experiments show that RCEM consistently outperforms prior approaches, achieving up to 30% improvement under distributional shift.

URL PDF HTML ☆

赞 0 踩 0

2606.01605 2026-06-18 cs.RO 版本更新

Embedding Semantic Risk into Distance Fields and CBFs for Online Monocular Safe Control

将语义风险嵌入距离场和CBF用于在线单目安全控制

Dawei Zhang, Nuo Chen, Shuo Liu, Roberto Tron, Zhiwen Fan

发表机构 * Division of Systems Engineering, Boston University（系统工程系，波士顿大学）； Department of Mechanical Engineering, Boston University（机械工程系，波士顿大学）； Department of Electrical and Computer Engineering, Texas A&M University（电气与计算机工程系，德克萨斯农工大学）

AI总结提出一种在线单目感知到控制框架，通过将语义风险直接嵌入欧几里得符号距离场（ESDF），在控制优化前编码风险，实现基于控制障碍函数（CBF）的语义感知安全导航与遥操作。

详情

AI中文摘要

我们提出了一种在线单目感知到控制框架，将语义风险嵌入到用于基于控制障碍函数（CBF）的安全导航和遥操作的距离场中。许多基于感知的安全过滤器对所有映射的障碍物分配相同的基于距离的安全裕度，或者仅将语义用作下游控制器调整，而不是在空间表示中编码语义风险。我们的框架通过将语义信息直接嵌入欧几里得符号距离场（ESDF），在线推理障碍物几何和类别相关风险。这种设计在控制优化前编码语义风险，因此高风险对象在安全场中施加更大的空间影响，同时保留运行时高效的ESDF查询。具体来说，基于基础模型的SLAM前端从单目RGB视频重建密集3D几何，而每帧语义分割提供像素级类别标签，这些标签被融合到重建的几何中。得到的几何-语义表示随后被转换为ESDF，其中语义标签识别安全相关区域并在场计算前施加类别相关的膨胀。语义感知的ESDF提供CBF控制器所需的局部距离值和空间导数，而类别相关的增益进一步调节控制器响应。广泛的仿真和硬件实验证明了在线操作在10-20 Hz的频率以及遥操作和自主导航中的语义感知安全行为。

英文摘要

We propose an online monocular perception-to-control framework that embeds semantic risk into the distance field used by Control Barrier Function (CBF)-based safe navigation and teleoperation. Many perception-based safety filters assign the same distance-based safety margin to all mapped obstacles or use semantics only as a downstream controller adjustment, rather than encoding semantic risk in the spatial representation. Our framework instead reasons online about obstacle geometry and class-dependent risk by embedding semantic information directly into the Euclidean Signed Distance Field (ESDF). This design encodes semantic risk before control optimization, so high-risk objects exert a larger spatial influence in the safety field while retaining efficient ESDF queries at runtime. Specifically, a foundation-model-based SLAM front end reconstructs dense 3-D geometry from monocular RGB video, while per-frame semantic segmentation provides pixel-level class labels that are fused into the reconstructed geometry. The resulting geometric-semantic representation is then converted into an ESDF, where semantic labels identify safety-relevant regions and impose class-dependent inflation before field computation. The semantic-aware ESDF provides the local distance values and spatial derivatives required by the CBF controller, while class-dependent gains further regulate the controller response. Extensive simulation and hardware experiments demonstrate online operation at 10--20 Hz and semantic-aware safe behavior in both teleoperation and autonomous navigation.

URL PDF HTML ☆

赞 0 踩 0

2606.01249 2026-06-18 cs.LG cs.CL 版本更新

Trust Region On-Policy Distillation

信任区域在线策略蒸馏

Xingrun Xing, Haoqing Wang, Boyan Gao, Ziheng Li, Yehui Tang

发表机构 * Samsung Research（三星研究院）； University of Oxford（牛津大学）； Peking University（北京大学）

AI总结提出信任区域在线策略蒸馏（TrOPD），通过信用分配策略和信任区域学习解决师生分布差异导致的训练不稳定问题，在数学推理、代码生成和通用基准上超越现有方法。

详情

AI中文摘要

在线策略蒸馏（OPD）是大型语言模型（LLM）高效后训练的基本技术，在智能体学习、多任务增强和模型压缩中具有广泛应用。然而，当教师和学生分布差异较大时，OPD训练变得不稳定，因为教师对学生生成token的监督可能产生不可靠的策略梯度，甚至导致优化失败。本文通过信用分配策略解决可靠的在线策略token级监督问题，并提出信任区域在线策略蒸馏（TrOPD）。它具有以下特点：1）信任区域在线策略学习：TrOPD仅在教师提供可靠监督的区域进行OPD，缓解了分布不匹配下K1反向KL估计的优化困难。2）异常值估计：对于异常区域，我们探索梯度裁剪、掩码和前向KL估计，以减少不可靠监督的不利影响。3）离策略引导：学生从教师前缀继续生成，并使用前向KL模仿离策略引导，鼓励向可靠区域进行在线策略探索。实验表明，TrOPD在数学推理、代码生成和通用领域基准上始终优于最先进的OPD基线，包括OPD、EOPD和REOPOLD。

英文摘要

On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics: 1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch. 2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision. 3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions. Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.01139 2026-06-18 cs.AI 版本更新

SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision

SkillRevise: 通过轨迹条件技能修订改进LLM撰写的智能体技能

Yuxuan Liu, Zhaochen Su, Lingyun Xie, Yuhao Zhang, Qing Zong, Jiahe Guo, Zhongwei Xie, Yiyan Ji, Yauwai Yim, Hongyu Luo, Xiyu Ren, Ruan Chenyu, Haoran Li, Yangqiu Song

发表机构 * The Hong Kong University of Science and Technology（香港科学与技术大学）； Harbin Institute of Technology（哈尔滨工业大学）； Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； Nanjing University（南京大学）； The University of Hong Kong（香港大学）

AI总结提出SkillRevise框架，通过执行证据诊断、修复原则检索和执行锚定编辑，迭代优化初始技能，在SkillsBench上将基础智能体成功率从36.05%提升至61.63%，并展现跨模型迁移性。

Comments 15 pages, 4 figures

详情

AI中文摘要

智能体技能是使LLM智能体能够执行工作流、验证约束并从故障中恢复的程序性工件。现有的自进化方法利用累积轨迹来优化技能，但在冷启动场景下（仅有一个初始的不完美技能可用）表现不佳。因此，技能构建默认采用专家编写或一次性LLM生成。专家编写的技能成本高昂，且可能与LLM智能体实际执行任务的方式不一致，而一次性生成的技能可能在语法上良好但在行为上薄弱。为弥合这一差距，我们提出SkillRevise，一个基于执行的框架，旨在迭代优化这些初始技能。SkillRevise从执行证据中诊断技能缺陷，从通用记忆中检索相关修复原则，并应用执行锚定编辑。通过重新执行候选技能并测量经验效用，它系统地保留最优技能版本。在三个基准测试和五个LLM上的评估表明，SkillRevise显著优于一次性基线，将SkillsBench上基础智能体的成功率从36.05%提升至61.63%。此外，修订后的技能展现出强大的跨模型迁移性，捕获了超越模型特定工件的通用程序性知识。

英文摘要

Agent skills are procedural artifacts that enable LLM agents to execute workflows, verify constraints, and recover from failures. Existing self-evolving methods refine skills using accumulated trajectories. However, they struggle in cold-start settings, where only an initial, imperfect skill is available. Consequently, skill construction defaults to expert authoring or one-shot LLM generation. Expert-authored skills are costly and may not align with how LLM agents actually execute tasks, while one-shot generated skills can be syntactically well formed yet behaviorally weak. To bridge this gap, we propose SkillRevise, an execution-grounded framework designed to iteratively refine these initial skills. SkillRevise diagnoses skill defects from execution evidence, retrieves relevant repair principles from a general memory, and applies execution-anchored edits. By re-executing candidates, it retains the first verifier-passing skill within the revision budget and falls back to empirical utility only when no candidate succeeds. Evaluated across three benchmarks and five LLMs, SkillRevise substantially outperforms one-shot baselines, improving the base agent's success rate on SkillsBench from 36.05% to 61.63%. Furthermore, the revised skills transfer across both executors and task environments, suggesting that SkillRevise captures reusable procedural knowledge beyond any single executor.

URL PDF HTML ☆

赞 0 踩 0

2606.00729 2026-06-18 cs.AI 版本更新

AI Sovereignty as National Learning Capacity: A Human-Centered Learning Mechanics Viewpoint on France, the United States, and China

AI主权作为国家学习能力：基于人本学习机制视角看法国、美国与中国

Kim Phuc Tran

发表机构 * Univ. Lille, ENSAIT, ULR 2461 – GEMTEX（里尔大学、ENSAIT、ULR 2461 – GEMTEX）

AI总结本文提出将国家AI发展视为一个受控的信息注入与熵耗散平衡的动态学习系统，主张AI主权源于国家调节自身信息动力学的能力，而非单纯规模扩张。

详情

AI中文摘要

在法国，人工智能常被从投资、算力、监管、就业、主权和教育等维度讨论，这些维度通常被分开处理。本文提出一个统一解读：法国应被理解为一个\emph{国家AI学习系统}。基于最近被形式化为熵调控表示学习动力学框架的人本学习机制（HCLM），我们将国家AI发展解释为信息注入与熵耗散之间的受控平衡。信息注入对应算力、数据、人才、研究、资本、产业部署和制度实验；熵耗散对应组织复杂性、协调摩擦、能源约束、监管不确定性、人才流动压力以及加强产业吸收的机会。核心主张是：AI主权并非仅源于规模，而是源于国家调节自身信息动力学的能力。本文将HCLM与神经标度律、内生增长理论、创造性破坏和博弈论联系起来，认为法国AI辩论应超越技术乐观主义与监管优先的二元对立。一个具有竞争力且以人为本的AI战略需要一个受控机制，其中信息注入增长快于制度耗散，同时避免不稳定、不平等或高能耗的扩张。我们提供了一个数学模型、可衡量的政策指标、博弈论命题、国家AI制度的说明性模拟，以及对法国的具体政策启示。所提出的观点将AI政策重新定义为对一个开放、战略性、非均衡学习系统的治理。

英文摘要

Artificial intelligence in France is often discussed through separate dimensions such as investment, compute, regulation, employment, sovereignty, and education. This viewpoint paper proposes a unified interpretation: France can be analyzed as a national AI learning system. Building on Human-Centered Learning Mechanics (HCLM), we use HCLM not as a validated econometric model, but as a conceptual and diagnostic lens for interpreting national AI development as a balance between information injection, absorptive capacity, and institutional dissipation. Information injection includes compute, data, talent, research, capital, industrial deployment, and policy experimentation. Institutional dissipation refers to avoidable frictions such as administrative overload, coordination failures, energy constraints, regulatory uncertainty, talent mobility pressures, and weak industrial absorption. Regulation is not treated as mere friction: adaptive governance, trusted data spaces, and safety-oriented standards may increase long-term learning capacity by improving legitimacy, interoperability, and social trust. The central claim is not that a country follows neural-network equations, but that AI sovereignty depends on how effectively it converts distributed information into absorbed, coordinated, and socially legitimate capability. The paper connects HCLM with neural scaling laws, endogenous growth theory, creative destruction, absorptive capacity, and coordination mechanisms. It offers a formal heuristic, policy indicators, illustrative scenarios, and implications for France. The numerical results are diagnostic scenarios, not econometric estimates or official rankings. The proposed viewpoint reframes AI policy as the governance of an open, strategic, non-equilibrium learning system that should be tested with historical and cross-country data.

URL PDF HTML ☆

赞 0 踩 0

2606.00491 2026-06-18 cs.CV cs.AI 版本更新

Pre-Deployment Robustness Stress Testing for CT Segmentation Systems Using Clinically Motivated Multi-Corruption Augmentation

CT分割系统的部署前鲁棒性压力测试：使用临床驱动的多损坏增强

CholMin Kanga, Jonghyun Chung, Amanpreet Kaur, Nagesh Gulkotwar, Aarthi Sivasankaran

发表机构 * Seoul National University（首尔国立大学）； Google Inc.（谷歌公司）

AI总结提出RAMP框架，通过多损坏增强提升CT分割模型在临床异质成像条件下的鲁棒性，显著缩小干净与损坏图像性能差距。

详情

AI中文摘要

基于深度学习的CT分割系统在干净基准图像上通常能达到高精度，但在噪声、分辨率损失、对比度变化、强度偏移和伪影等异质临床成像条件下，其性能可能会下降。这种不稳定性可能限制其在真实医疗成像工作流程中的可靠部署。我们提出鲁棒性增强多损坏流水线（RAMP），这是一个面向鲁棒性的CT分割增强框架。RAMP结合了解剖约束的空间扰动、CT强度变换和随机多损坏组合，使模型在训练过程中暴露于临床可行的图像退化。在两个CT分割评估设置中，RAMP实现了最强的损坏图像性能和最小的干净到损坏鲁棒性差距。在五器官噪声评估基准中，与nnU-Net基线相比，RAMP将平均损坏Dice从0.610提高到0.753，并将鲁棒性差距从0.264降低到0.064。在Abdomen1K中，RAMP将平均损坏Dice从0.633提高到0.789，并将鲁棒性差距从0.290降低到0.070。尽管RAMP未达到最高的干净图像Dice，但它显著减轻了严重图像退化下的最坏情况分割崩溃。这些结果表明，多损坏增强可以作为提高CT分割系统在异质临床环境中可靠性的实用部署前策略。

英文摘要

Deep learning-based CT segmentation systems often achieve high accuracy on clean benchmark images, but their performance may degrade under heterogeneous clinical imaging conditions such as noise, resolution loss, contrast variation, intensity shift, and artifacts. This instability can limit reliable deployment in real-world medical imaging workflows. We propose Robustness via Augmented Multi-corruption Pipeline (RAMP), a robustness-oriented augmentation framework for CT segmentation. RAMP combines anatomically constrained spatial perturbations, CT intensity transformations, and stochastic multi-corruption composition to expose models to clinically plausible image degradation during training. Across two CT segmentation evaluation settings, RAMP achieved the strongest corrupted-image performance and the smallest clean-to-corrupted robustness gap. In the five-organ noisy evaluation benchmark, RAMP improved mean corrupted Dice from 0.610 to 0.753 and reduced the robustness gap from 0.264 to 0.064 compared with the nnU-Net baseline. In Abdomen1K, RAMP improved mean corrupted Dice from 0.633 to 0.789 and reduced the robustness gap from 0.290 to 0.070. Although RAMP did not achieve the highest clean-image Dice, it substantially mitigated worst-case segmentation collapse under severe image degradation. These results suggest that multi-corruption augmentation can serve as a practical pre-deployment strategy for improving the reliability of CT segmentation systems in heterogeneous clinical environments.

URL PDF HTML ☆

赞 0 踩 0

2605.30920 2026-06-18 cs.LG 版本更新

Unsupervised Diffusion Solver for Combinatorial Optimization via Combinatorial Adjoint Matching

通过组合伴随匹配实现组合优化的无监督扩散求解器

Shengyu Feng, Tarun Suresh, Yiming Yang

发表机构 * Language Technologies Institute, Carnegie Mellon University（卡内基梅隆大学语言技术研究所）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出组合伴随匹配（CAM）框架，利用离散伴随动力学和随机控制公式，实现无监督训练离散扩散求解器，在多种组合优化问题上达到与监督方法竞争的性能。

Comments ICML26

详情

AI中文摘要

基于扩散的神经求解器在组合优化（CO）中显示出强大潜力，但现有方法通常依赖于使用大量近最优解进行监督训练。在这项工作中，我们将基于伴随的轨迹优化方法扩展到离散组合域。我们将基于扩散的CO表述为连续时间马尔可夫链上的随机控制问题，并引入离散伴随动力学，用于通过离散生成轨迹传播优化信号。基于这一表述，我们提出了组合伴随匹配（CAM），一种用于离散扩散求解器的无监督训练框架，具有结构化和低方差的轨迹级优化信号。实验上，CAM在多种组合优化问题上始终优于现有的无监督扩散基线，并与强大的监督扩散求解器甚至传统求解器性能相当。我们的代码可在 https://github.com/Shengyu-Feng/CAM 获取。

英文摘要

Diffusion-based neural solvers have shown strong promise for combinatorial optimization (CO), but existing methods typically rely on supervised training with large collections of near-optimal solutions. In this work, we extend adjoint-based trajectory optimization methods to discrete combinatorial domains. We formulate diffusion-based CO as a stochastic control problem over Continuous-Time Markov Chains and introduce discrete adjoint dynamics for propagating optimization signals through discrete generative trajectories. Building on this formulation, we propose Combinatorial Adjoint Matching (CAM), an unsupervised training framework for discrete diffusion solvers with structured and low-variance trajectory-level optimization signals. Empirically, CAM consistently outperforms existing unsupervised diffusion baselines and achieves performance competitive with strong supervised diffusion solvers and even traditional solvers across diverse combinatorial optimization problems. Our code is available at https://github.com/Shengyu-Feng/CAM.

URL PDF HTML ☆

赞 0 踩 0

2605.30880 2026-06-18 cs.CL cs.AI 版本更新

PatchWorld: Gradient-Free Optimization of Executable World Models

PatchWorld：可执行世界模型的免梯度优化

Jiaxin Bai, Yue Guo, Yifei Dong, Jiaxuan Xiong, Tianshi Zheng, Yixia Li, Tianqing Fang, Yufei Li, Yisen Gao, Haoyu Huang, Zhongwei Xie, Hong Ting Tsang, Zihao Wang, Lihui Liu, Jeff Z. Pan, Yangqiu Song

发表机构 * Hong Kong Baptist University（香港 Baptist 大学）； Independent Researcher（独立研究员）； HKUST（香港科技大学）； Beijing Institute of Technology（北京理工大学）； Southern University of Science and Technology（南方科技大学）； Wayne State University（韦恩州立大学）； University of Edinburgh（爱丁堡大学）

AI总结提出 PatchWorld 框架，通过反例引导的代码修复将离线轨迹转化为可执行的 Python 世界模型，实现无需梯度优化的符号信念状态程序，在 AgentGym 环境中达到 76.4% 的宏观成功率。

Comments 40 pages

详情

AI中文摘要

文本智能体环境通常被建模为部分可观察马尔可夫决策过程（POMDP），假设模拟器的潜在状态和转移动态对智能体隐藏。然而，很少有工作研究是否可以通过归纳可执行代码来作为部分可观察性下的预测和规划的世界模型。我们引入了 PatchWorld，一个免梯度框架，通过反例引导的代码修复将离线轨迹转化为可执行的 Python 世界模型。PatchWorld 不是用黑盒模型预测下一个观察，而是归纳出符号信念状态程序，其动作更新可以被检查、重放和局部修补。在七个 AgentGym 环境中，PatchWorld-Simple 在评估方法中取得了最高的基于代码的规划分数，在实时一步前瞻中达到 76.4% 的宏观成功率，同时在世界模型预测模块本身内不调用任何 LLM。我们进一步发现，人类指定的残差记忆偏差提高了表面观察保真度，但削弱了决策效用。这暴露了可执行世界模型中的权衡，因为提高观察保真度可能以牺牲动作判别动态为代价，反之亦然。代码可在 https://github.com/HKBU-KnowComp/PatchWorld 获取。

英文摘要

Text-agent environments are typically modeled as partially observable Markov decision processes (POMDPs), assuming that the simulator's latent state and transition dynamics are hidden from the agent. Yet little work has examined whether executable code can be induced to serve as a world model for prediction and planning under partial observability. We introduce PatchWorld, a gradient-free framework that turns offline trajectories into executable Python world models through counterexample-guided code repair. Instead of predicting the next observation with a black-box model, PatchWorld induces symbolic belief-state programs whose action updates can be inspected, replayed, and locally patched. Across seven AgentGym environments, PatchWorld-Simple achieves the highest code-based planning score among evaluated methods, reaching 76.4\% macro success in live one-step lookahead while invoking no LLM calls inside the world-model prediction module itself. We further find that a human-specified residual-memory bias improves surface observation fidelity but weakens decision utility. This exposes a tradeoff in executable world models, since improving observation fidelity can come at the expense of action-discriminative dynamics, and vice versa. Code is available at https://github.com/HKBU-KnowComp/PatchWorld.

URL PDF HTML ☆

赞 0 踩 0

2605.29676 2026-06-18 cs.AI cs.CL 版本更新

Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems

符号至关重要：智能体AI系统中令牌优化格式的基准研究

Lorenz Kutschka, Bernhard Geiger

发表机构 * Know Center Research GmbH（知中心研究有限公司）； Graz University of Technology（格拉茨技术大学）； Graz Center for Machine Learning（格拉茨机器学习中心）

AI总结本研究在四个智能体基准上评估了两种令牌优化格式TOON和TRON，发现TRON在保持准确率的同时最多减少27%的令牌，而TOON虽减少18%但存在多轮解析失败和并行工具调用输出崩溃的问题。

Comments 16 pages, 6 figures, 4 tables

详情

AI中文摘要

智能体AI系统中的大型语言模型消耗工具模式和执行结果，并发出结构化数据的工具调用。这种交换的默认语言JSON是为应用间交换而非令牌效率设计的，因此其结构元素带来大量令牌开销。最近的工作提出了令牌优化替代方案，如TOON（令牌导向对象表示法）和TRON（令牌减少对象表示法）作为更紧凑的替代，但这些格式仅在孤立的理解或生成任务上进行了评估。它们在端到端智能体循环中是否保持令牌减少仍是一个开放问题。我们在四个智能体基准（BFCL、MCPToolBenchPP、MCP-Universe、StableToolBench）和五个开放权重LLM上评估了TOON和TRON，将输入压缩与输出压缩解耦，以独立测量理解和生成。TRON最多减少27%的令牌，准确率在JSON基线的14个百分点内。TOON实现了最多18%的减少，准确率成本类似为9个百分点，但在多轮解析失败上额外级联，并且对于大多数模型导致并行工具调用输出崩溃。

英文摘要

Large language models in Agentic AI systems consume tool schemas and execution results and emit tool invocations as structured data. The default language for that exchange, JSON, was designed for application-to-application interchange rather than token efficiency, so its structural elements impose substantial token overhead. Recent work proposes token-optimized alternatives such as TOON (Token-Oriented Object Notation) and TRON (Token Reduced Object Notation) as more compact replacements, but these formats have been evaluated only on isolated comprehension or generation tasks. Whether their token reductions hold inside end-to-end agentic loops therefore remains an open question. We evaluate TOON and TRON on four agentic benchmarks (BFCL, MCPToolBenchPP, MCP-Universe, StableToolBench) and five open-weight LLMs, decoupling input compression from output compression to measure comprehension and generation independently. TRON reduces tokens by up to 27% with accuracy within 14pp of the JSON baseline. TOON achieves up to 18% reduction at a similar 9pp accuracy cost, but additionally cascades on multi-turn parsing failures and collapses parallel tool-call output for most models. The code is available at: https://github.com/lkutschka/notation-matters

URL PDF HTML ☆

赞 0 踩 0

2605.16385 2026-06-18 cs.CV cs.AI cs.CL 版本更新

Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

Hilbert-Geo：通过神经符号推理解决立体几何问题

Ruoran Xu, Haoyu Cheng, Bin Dong, Qiufeng Wang

发表机构 * Xi’an Jiaotong-Liverpool University（西安交通大学利物浦大学）； Ricoh Software Research Center Beijing Co.,Ltd（Ricoh 软件研究中心北京有限公司）

AI总结提出Hilbert-Geo框架和Parse2Reason方法，利用条件描述语言和定理库实现立体几何问题的严格推理，在SolidFGeo2k和MathVerse-Solid上达到SOTA性能。

Comments Computer Vision and Pattern Recognition (CVPR), 2026

详情

AI中文摘要

几何问题求解作为一种典型的多模态推理问题，近年来受到广泛关注并取得了很大进展，然而大多数工作集中于平面几何，由于三维空间图和复杂推理，通常在立体几何中失败。为弥补这一差距，我们引入了Hilbert-Geo，这是第一个用于立体几何的统一形式语言框架，包括一个广泛的谓词库和一个专用的定理库。基于该框架，我们提出了一种Parse2Reason方法，包含先解析后推理两个步骤。在解析步骤中，我们利用条件描述语言（CDL），一种由专门用于构建几何条件的谓词组成的形式化语言，来表示问题描述（自然文本）和立体图（视觉图像）。在推理步骤中，我们利用这些形式化CDL和定理库进行关系推理和代数计算，生成严格正确、可验证且人类可读的推理过程。值得注意的是，我们提出的Hilbert-Geo也适用于平面几何。为推进几何推理，我们策划了两个专家标注的数据集SolidFGeo2k和PlaneFGeo3k，它们配备了几何形式语言标注、解答和答案。大量实验表明，我们提出的方法在SolidFGeo2k上达到77.3%的最先进性能，在MathVerse-Solid（MathVerse中专用于立体几何的一个小子集）上达到84.1%，显著优于领先的多模态大语言模型，如Gemini-2.5-pro（在SolidFGeo2k上为54.2%）和GPT-5（在MathVerse-Solid上为62.9%）。此外，我们的方法在PlaneFGeo3k上达到80.2%的SOTA准确率，展示了Hilbert-Geo在几何推理中的通用性。我们的代码和数据集将公开提供。

英文摘要

Geometric problem solving, as a typical multimodal reasoning problem, has attracted much attention and made great progress recently, however most of works focus on plane geometry while usually fail in solid geometry due to 3D spatial diagrams and complex reasoning. To bridge this gap, we introduce Hilbert-Geo, the first unified formal language framework for solid geometry, including an extensive predicate library and a dedicated theorem bank. Based on this framework, we propose a Parse2Reason method containing two steps of first parsing then reasoning. In the parsing step, we utilize conditional description language (CDL), a formalized language composed of predicates specifically designed to construct geometric conditions, to represent both problem description (natural text) and solid diagrams (visual image). In the reasoning step, we leverage those formal CDL and the theorem bank to perform relational inference and algebraic computation, generating strictly correct, verifiable, and human-readable reasoning processes. Notably, our proposed Hilbert-Geo is also applicable to plane geometry. To advance geometric reasoning, we curate two expert-annotated dataset SolidFGeo2k and PlaneFGeo3k, which are furnished with geometric formal language annotations, solutions and answers. Extensive experiments show that our proposed method achieves the state-of-the-art (SOTA) performance 77.3% in SolidFGeo2k and 84.1% in MathVerse-Solid (one small subset in MathVerse dedicated to solid geometry), substantially outperforming leading MLLMs, such as Gemini-2.5-pro (54.2% on SolidFGeo2k) and GPT-5 (62.9% on MathVerse-Solid). In addition, our method achieves the SOTA accuracy 80.2% in PlaneFGeo3k, demonstrating the generality of the Hilbert-Geo in geometric reasoning. Our code and datasets are released at https://github.com/PremiLab-Math/Hilbert-Geo.

URL PDF HTML ☆

赞 0 踩 0

2602.08355 2026-06-18 cs.CV 版本更新

E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs

E-VAds：面向多模态大语言模型的电商短视频理解基准

Xianjie Liu, Yiman Hu, Liang Wu, Ping Hu, Yixiong Zou, Jian Xu, Bo Zheng

发表机构 * Alimama Tech, Taobao \& Tmail Group of Alibaba ； Huazhong University of Science ； Vin University

AI总结提出电商短视频理解基准E-VAds，通过多模态信息密度评估框架量化领域复杂性，并构建多智能体生成的问答数据集，最后开发基于强化学习的推理模型E-VAds-R1，在商业意图推理上实现109.2%的性能提升。

Comments Accepted by ICML2026

详情

AI中文摘要

电商短视频代表了在线视频行业中高收入的细分领域，其特点是目标驱动的格式和密集的多模态信号。当前模型通常难以处理这些视频，因为现有基准主要关注通用任务，忽略了商业意图的推理。在这项工作中，我们首先提出了一个多模态信息密度评估框架，以量化该领域的复杂性。我们的评估显示，与主流数据集相比，电商内容在视觉、音频和文本模态上表现出显著更高的密度，为视频理解建立了更具挑战性的前沿。为了弥补这一差距，我们引入了电商视频广告基准（E-VAds），这是首个专门为电商短视频理解设计的基准。我们从淘宝精选了3,961个高质量视频，涵盖广泛的产品类别，并使用多智能体系统生成了19,785个开放式问答对。这些问题被组织成两个主要维度，即感知与认知和推理，包含五个不同的任务。最后，我们开发了E-VAds-R1，一个基于强化学习的推理模型，具有称为MG-GRPO的多粒度奖励设计。该策略为早期探索提供平滑指导，同时为专家级精度创造非线性激励。实验结果表明，E-VAds-R1在仅使用几百个训练样本的情况下，在商业意图推理上实现了109.2%的性能提升。

英文摘要

E-commerce short videos represent a high-revenue segment of the online video industry characterized by a goal-driven format and dense multi-modal signals. Current models often struggle with these videos because existing benchmarks focus primarily on general-purpose tasks and neglect the reasoning of commercial intent. In this work, we first propose a multi-modal information density assessment framework to quantify the complexity of this domain. Our evaluation reveals that e-commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets, establishing a more challenging frontier for video understanding. To address this gap, we introduce E-commerce Video Ads Benchmark, which is the first benchmark specifically designed for e-commerce short video understanding. We curated 3,961 high-quality videos from Taobao covering a wide range of product categories and used a multi-agent system to generate 19,785 open-ended Q&A pairs, which consist of five distinct tasks. Finally, we develop E-VAds-R1, an RL-based reasoning model featuring a multi-grained reward design called MG-GRPO. This strategy provides smooth guidance for early exploration while creating a non-linear incentive for expert-level precision. Experimental results demonstrate that E-VAds-R1 achieves a 109.2% performance gain in commercial intent reasoning with only a few hundred training samples. Data is available at https://github.com/TaobaoTmall-AlgorithmProducts/E-VAds_Benchmark.

URL PDF HTML ☆

赞 0 踩 0

2605.03460 2026-06-18 cs.AI cs.LG 版本更新

FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models

FinSTaR：面向时间序列推理模型的金融推理

Seunghan Lee, Jun Seo, Jaehoon Lee, Sungdong Yoo, Minjae Kim, Tae Yoon Lim, Dongwan Kang, Hwanil Choi, Soonyoung Lee, Wonbin Ahn

发表机构 * LG AI Research（LG人工智能研究）

AI总结针对时间序列推理模型在金融领域的失效问题，提出基于2x2能力分类法的FinSTaR模型，通过Compute-in-CoT和Scenario-Aware CoT策略在FinTSR-Bench基准上达到78.9%平均准确率。

Comments KDD Workshop on SciSoc Agents & LLMs 2026 (Oral Presentation)

详情

AI中文摘要

时间序列推理模型在通用领域表现出色，但在具有独特特征的金融领域却持续失败。我们提出一个通用的2x2能力分类法，通过交叉1)单实体与多实体分析，以及2)当前状态评估与未来行为预测来划分TSRM能力。我们在金融领域实例化该分类法——其中确定性评估与随机性预测的区分尤为关键——形成十个金融推理任务，并基于标普股票构建FinTSR-Bench基准。为此，我们提出FinSTaR（金融时间序列思考与推理），在FinTSR-Bench上训练，并针对每个类别采用不同的思维链策略。对于评估（确定性，即可从可观测数据计算得出），我们采用Compute-in-CoT，一种程序化思维链，使模型能够直接从原始价格推导答案。对于预测（本质上是随机的，即受不可观测因素影响），我们采用场景感知思维链，在做出判断前生成多种场景，模拟金融分析师在不确定性下的推理方式。所提方法在FinTSR-Bench上达到78.9%的平均准确率，显著优于LLM和TSRM基线。此外，我们展示了四个能力类别通过联合训练具有互补性和相互增强性，并且场景感知思维链相比标准思维链持续提升预测准确率。代码已公开：https://github.com/seunghan96/FinSTaR。

英文摘要

Time series (TS) reasoning models (TSRMs) have shown promising capabilities in general domains, yet they consistently fail in the financial domain, which exhibits unique characteristics. We propose a general 2 x 2 capability taxonomy for TSRMs by crossing 1) single-entity vs. multi-entity analysis with 2) assessment of the current state vs. prediction of future behavior. We instantiate this taxonomy in the financial domain-where the distinction between deterministic assessment and stochastic prediction is particularly critical-as ten financial reasoning tasks, forming the FinTSR-Bench benchmark based on S&P stocks. To this end, we propose FinSTaR (Financial Time Series Thinking and Reasoning), trained on FinTSR-Bench with distinct chain-of-thought (CoT) strategies tailored to each category. For assessment, which is deterministic (i.e., computable from observable data), we employ Compute-in-CoT, a programmatic CoT that enables models to derive answers directly from raw prices. For prediction, which is inherently stochastic (i.e., subject to unobservable factors), we adopt Scenario-Aware CoT, which generates diverse scenarios before making a judgment, mirroring how financial analysts reason under uncertainty. The proposed method achieves 78.9% average accuracy on FinTSR-Bench, substantially outperforming LLM and TSRM baselines. Furthermore, we show that the four capability categories are complementary and mutually reinforcing through joint training, and that Scenario-Aware CoT consistently improves prediction accuracy over standard CoT. Code is available at https://github.com/seunghan96/FinSTaR.

URL PDF HTML ☆

赞 0 踩 0

2605.21528 2026-06-18 cs.LG cs.AI 版本更新

A Reproducible Log-Driven AutoML Framework for Interpretable Pipeline Optimization in Healthcare Risk Prediction

可重复的基于日志的自动机器学习框架用于医疗风险预测中的可解释流水线优化

Rui Huang, Lican Huang

发表机构 * School of Basic Medicine, Hangzhou Normal University（杭州师范大学基础医学院）； Research Department, Hangzhou Domain Zones Technology Co.Ltd.（杭州域区技术有限公司）

AI总结本文提出了一种可重复的基于日志的自动机器学习框架，用于医疗风险预测中的可解释流水线优化，通过分析组件属性、交互和冗余性，提高了模型性能和稳定性。

详情

AI中文摘要

准确且可重复的疾病风险预测仍然具有挑战性，由于异质特征、有限样本和严重的类别不平衡。本研究引入了yvsoucom-iterkit，一种确定性和基于日志的自动化机器学习框架，将流水线优化完全可重复地建模为配置级系统。每个流水线被编码为可追溯的日志实体，使能够分析组件属性、交互、相似性和跨种子鲁棒性。在超过18,000个流水线配置上对Pima Indians糖尿病和中风数据集的实验揭示了一个结构化且部分冗余的搜索空间，其中性能由一小部分相互作用的组件决定。随机森林重要性分析显示，增强（0.454）、模型选择（0.198）和不平衡处理（0.101）是Pima数据集的关键驱动因素，而不平衡处理主导中风（0.406）。组件相似性分析显示强冗余性，特征选择变体（biMax-biMean）表现出低RMS距离（0.0252），混合匹配无增强（0.0279），TomekLinks与无不平衡处理对齐（0.0325），而高斯噪声与无增强的差异更大（0.10）。该框架使用集成模型（加权F1 0.89，宏F1 0.88在Pima；加权F1 0.94在中风）实现了强且稳定的性能，而宏F1在中风上较低（0.67）由于类别不平衡。跨种子分析揭示了性能-鲁棒性权衡，集成模型的变异性低于SVM。这些结果表明，有效的AutoML优化可以聚焦于一组高影响的组件。

英文摘要

Accurate disease risk prediction is challenged by heterogeneous features, limited data, and class imbalance. This study presents yvsoucom-iterkit, a deterministic AutoML framework that models pipeline optimization as a configuration-level system with full reproducibility and traceable execution logs, enabling systematic analysis of component attribution, interactions, similarity, and cross-seed robustness. Experiments on the Pima Indians Diabetes and Stroke datasets across more than 18,000 pipeline configurations reveal a structured yet partially redundant search space, where performance is dominated by a small subset of interacting components. Ensemble models achieve stable performance, reaching a Weighted-F1 of 0.89 on Pima and 0.94 on Stroke. Macro-F1 reaches approximately 0.88 on Pima but drops to 0.6560 on Stroke due to severe imbalance. Cross-seed experiments show that ensembles reduce variance compared to single models. Friedman testing ($p < 0.05$) confirms significant ranking differences across configurations. Based on analysis of component attribution, interaction, and similarity, optimal configuration design reveals dataset-dependent behavior. For the Pima dataset, computational efficiency benefits from simplified search spaces where redundant components can be removed, with split ratio playing a key role. In contrast, the Stroke dataset requires enhanced imbalance-aware strategies, where RandomOverSampler improves Macro-F1 from 0.6560 to 0.6766. These findings demonstrate that effective AutoML optimization is achieved through optimal configuration design, where carefully constraining the search space to high-impact components can improve performance, stability, and interpretability while reducing unnecessary search complexity.

URL PDF HTML ☆

赞 0 踩 0

2605.21431 2026-06-18 cs.CV 版本更新

iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

iTryOn: 通过空间-语义引导掌握交互式视频虚拟试穿

Jun Zheng, Zhengze Xu, Mengting Chen, Jing Wang, Jinsong Lan, Xiaoyong Zhu, Kaifu Zhang, Bo Zheng, Xiaodan Liang

发表机构 * Shenzhen Campus of Sun Yat-sen University ； Taobao \& Tmall Group of Alibaba

AI总结本文提出iTryOn框架，通过空间-语义引导解决交互式视频虚拟试穿中的语义模糊和复杂服装变形问题，实现了更动态可控的虚拟试穿体验。

Comments Project Page: https://zhengjun-ai.github.io/itryon-page. Accepted by ICML 2026

详情

AI中文摘要

视频虚拟试穿（VVT）旨在无缝替换视频中人物身上的衣物。尽管现有方法在保持时间一致性方面取得了显著进展，但它们主要局限于非交互场景，其中模型仅展示衣物。这种限制忽略了现实世界服装展示中的关键方面：主动的人-衣物互动。为弥合这一差距，我们引入并正式化了一个新的挑战性任务：交互式视频虚拟试穿（Interactive VVT），其中视频中的主体主动与衣物互动。该任务引入了超出简单纹理保留的独特挑战，包括：（1）从标准姿态信息中解决交互的语义模糊性，以及（2）从视频中学习复杂的衣物变形，其中交互时刻稀少且短暂。为了解决这些挑战，我们提出了iTryOn，一种基于大规模视频扩散Transformer的新型框架。iTryOn首创多级交互注入机制，以引导复杂动态的生成。在空间层面，我们引入了服装无关的3D手先验，以提供精细的指导，精确的手-服装接触，有效解决空间模糊性。在语义层面，iTryOn利用全局描述词提供整体上下文，并利用时间戳动作描述词提供局部交互，通过我们新颖的Action-aware Rotational Position Embedding（A-RoPE）进行同步。广泛的实验表明，iTryOn不仅在传统VVT基准上实现了最先进的性能，还在新的交互设置中建立了显著的领先优势，标志着更动态和可控的虚拟试穿体验的重要一步。

英文摘要

Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction. To bridge this gap, we introduce and formalize a new challenging task: Interactive Video Virtual Try-On (Interactive VVT), where subjects in the video actively engage with their clothing. This task introduces unique challenges beyond simple texture preservation, including: (1) resolving the semantic ambiguity of interactions from standard pose information, and (2) learning complex garment deformations from video where interactive moments are sparse and brief. To address these challenges, we propose iTryOn, a novel framework built upon a large-scale video diffusion Transformer. iTryOn pioneers a multi-level interaction injection mechanism to guide the generation of complex dynamics. At the spatial level, we introduce a garment-agnostic 3D hand prior to provide fine-grained guidance for precise hand-garment contact, effectively resolving spatial ambiguity. At the semantic level, iTryOn leverages global captions for overall context and time-stamped action captions for localized interactions, synchronized via our novel Action-aware Rotational Position Embedding (A-RoPE). Extensive experiments demonstrate that iTryOn not only achieves state-of-the-art performance on traditional VVT benchmarks but also establishes a commanding lead in the new interactive setting, marking a significant step towards more dynamic and controllable virtual try-on experiences.

URL PDF HTML ☆

赞 0 踩 0

2605.21028 2026-06-18 cs.CV cs.AI 版本更新

DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation

DySink：动态帧 sinks 用于自回归长视频生成

Bo Ye, Xinyu Cui, Jian Zhao, Tong Wei, Min-Ling Zhang

发表机构 * School of Computer Science and Engineering, Southeast University（东南大学计算机科学与工程学院）； Key Lab. of Computer Network and Information Integration, Southeast University（东南大学计算机网络与信息集成重点实验室）； Zhongguancun Academy（中关村学院）； Zhongguancun Institute of Artificial Intelligence（中关村人工智能研究院）； Institute of Automation, CAS（中国科学院自动化研究所）

AI总结本文提出 DySink，一种基于检索的框架，通过维护紧凑的记忆银行并选择视觉相关的历史帧作为动态帧 sinks，以提高自回归长视频生成的动态性和时间质量。

详情

AI中文摘要

自回归长视频生成通常采用有界内存流以提高效率，通常结合局部窗口实现短期连续性与静态早期帧 sinks 作为长程锚点。然而，这种固定分配在当前视觉状态与早期帧大幅偏离时仍会缓存早期帧，而丢弃可能更相关的中间历史。结果，保留的长程上下文可能变得不适应，并偏向过时的线索；在严重情况下，RoPE 引起的相位再对齐会homogenize 头间注意力并导致 sink 崩溃，其中内容会回归到 sink 帧。我们提出 DySink，一种基于检索的框架，维护紧凑的记忆银行并选择视觉相关的历史帧作为动态帧 sinks。DySink 将自适应检索与 sink 异常门相结合，后者检测检索上下文中的过度头间共识并抑制易崩溃的上下文。在分钟级视频上的实验表明，DySink 在动态度方面一致优于强基线，同时也实现了更高的时间质量。代码和模型权重将在 https://github.com/yebo0216best/DySink 上发布。

英文摘要

Autoregressive long video generation often adopts bounded-memory streaming for efficiency, typically combining local windows for short-term continuity with static early-frame sinks as long-range anchors. However, this fixed allocation keeps early frames cached even when the current visual state has substantially diverged from them, while discarding potentially more relevant intermediate history. As a result, the retained long-range context may become less adaptive and bias generation toward outdated cues; in severe cases, RoPE-induced phase re-alignment can homogenize inter-head attention and cause sink collapse, where content regresses toward sink frames. We propose DySink, a retrieval-based framework that maintains a compact memory bank and selects visually relevant historical frames as dynamic frame sinks. DySink couples adaptive retrieval with a sink anomaly gate, which detects excessive inter-head consensus over retrieved context and suppresses collapse-prone context. Experiments on minute-long videos show that DySink consistently improves temporal quality over strong baselines while also achieving higher dynamic degree, enabling coherent and more natural long-horizon visual evolution. The code and model weights are released at https://github.com/yebo0216best/DySink.

URL PDF HTML ☆

赞 0 踩 0

2603.10718 2026-06-18 cs.LG 版本更新

Riemannian MeanFlow for One-Step Generation on Manifolds

Riemannian MeanFlow用于流形上的单步生成

Zichen Zhong, Haoliang Sun, Yukun Zhao, Yongshun Gong, Yilong Yin

发表机构 * School of Software, Shandong University, Jinan, China（软件学院，山东大学，济南，中国）

AI总结本文提出Riemannian MeanFlow（RMF），通过平行运输定义平均速度场，并推导出将平均速度与瞬时速度联系起来的Riemannian MeanFlow恒等式，从而实现流形上基于位置的切空间中的单步生成，改进了生成质量与效率的权衡并降低了采样成本。

Comments ICML 2026

详情

AI中文摘要

Flow Matching enables simulation-free training of generative models on Riemannian manifolds, yet sampling typically still relies on numerically integrating a probability-flow ODE. We propose Riemannian MeanFlow (RMF), extending MeanFlow to manifold-valued generation where velocities lie in location-dependent tangent spaces. RMF defines an average-velocity field via parallel transport and derives a Riemannian MeanFlow identity that links average and instantaneous velocities for intrinsic supervision. We make this identity practical in a log-map tangent representation, avoiding trajectory simulation and heavy geometric computations. For stable optimization, we decompose the RMF objective into two terms and apply conflict-aware multi-task learning to mitigate gradient interference. RMF also supports conditional generation via classifier-free guidance. Experiments on spheres, tori, SO(3), and SE(3) demonstrate competitive one-step sampling with improved quality-efficiency trade-offs and substantially reduced sampling cost.

英文摘要

Flow Matching enables simulation-free training of generative models on Riemannian manifolds, yet sampling typically still relies on numerically integrating a probability-flow ODE. We propose Riemannian MeanFlow (RMF), extending MeanFlow to manifold-valued generation where velocities lie in location-dependent tangent spaces. RMF defines an average-velocity field via parallel transport and derives a Riemannian MeanFlow identity that links average and instantaneous velocities for intrinsic supervision. We make this identity practical in a log-map tangent representation, avoiding trajectory simulation and heavy geometric computations. For stable optimization, we decompose the RMF objective into two terms and apply conflict-aware multi-task learning to mitigate gradient interference. RMF also supports conditional generation via classifier-free guidance. Experiments on spheres, tori, SO(3), and SE(3) demonstrate competitive one-step sampling with improved quality-efficiency trade-offs and substantially reduced sampling cost.

URL PDF HTML ☆

赞 0 踩 0

2605.17232 2026-06-18 cs.LG math.ST stat.ML stat.TH 版本更新

Dimension-Free Convergence of Discrete Diffusion Models: Adjoint Equations Induce the Right Space

离散扩散模型的维度无关收敛性：伴随方程诱导了正确的空间

Kelvin Kan, Xingjian Li, Benjamin J. Zhang, Tuhin Sahai, Stanley Osher, Markos A. Katsoulakis

发表机构 * Department of Mathematics（数学系）； Oden Institute School of Data Science and Society（数据科学与社会学院）； UCLA（加州大学洛杉矶分校）； University of Texas at Austin（德克萨斯大学奥斯汀分校）； UNC Chapel Hill（北卡罗来纳大学教堂山分校）； Computational and Applied Sciences Group（计算与应用科学组）； Department of Mathematics and Statistics（数学与统计学系）； SRI International（SRI国际）； University of Massachusetts Amherst（马萨诸塞大学阿姆赫斯特分校）

AI总结本文提出了一种基于伴随方程的统一框架，实现了任何积分概率度量（IPM）下的维度无关收敛保证，克服了传统KL和TV方法在处理大规模状态空间时的局限性。

详情

AI中文摘要

离散扩散已成为生成建模中的领先框架，广泛应用于语言、视觉和生物学等领域。然而，现有的收敛理论存在根本性局限。基于KL的分析在奇异先验如掩码分布下会发散，而总变差（TV）的界依赖于状态空间大小S，并在现代语言任务中变得无效，因为词汇表包含数以万计的标记。我们开发了一种统一的基于伴随方程的框架，建立了任何积分概率度量（IPM）下的维度无关收敛保证。到目前为止，我们的界是首个完全不依赖S且适用于掩码和均匀先验的。重要的是，我们的理论仅依赖于一个标准的速率矩阵正则性假设，并且兼容时间非齐次调度。四个新颖的技术推动了我们的改进：通过伴随方程在可观测空间中工作而不是直接处理概率测度，一种产生任何IPM界正则性分析，一种耦合论证在均匀转移下去除S依赖性，以及一种分数-边际抵消技术在掩码转移下去除S依赖性。因此，我们的框架与先前分析显著不同，并避免了路径空间-KL和现有TV方法的不足。除了收敛界外，我们的框架还提供了一种灵活的工具包，用于进一步理论研究离散扩散模型。

英文摘要

Discrete diffusion has become a leading framework for generative modeling in various applications including language, vision, and biology. Existing convergence theory, however, exhibits fundamental limitations. KL-based analyses diverge under singular priors such as the masked distribution, while bounds in total variation (TV) depend on the state space size $S$ and become vacuous for modern language tasks, where vocabularies contain hundreds of thousands of tokens. We develop a unified adjoint-equation-based framework that establishes dimension-free convergence guarantees in any integral probability metric (IPM). To the best of our knowledge, our bounds are the first to be entirely free of $S$ and applicable to both masked and uniform priors. Importantly, our theory relies only on a single standard rate-matrix regularity assumption and applies to general priors. Five novel techniques drive our improvements: working in the space of observables via adjoint equations rather than directly with probability measures, a regularity analysis that yields bounds on any IPM, a coupling argument that removes $S$-dependence under uniform transitions, and score-marginal cancellation and exit-routing techniques that remove $S$-dependence under masked transitions. Our framework thus sharply departs from prior analyses and avoids the shortcomings of pathspace-KL and existing TV-based approaches. Beyond convergence bounds, our framework provides a versatile toolkit for further theoretical study of discrete diffusion models, including principled choices of loss functions and dimension-free step complexity.

URL PDF HTML ☆

赞 0 踩 0

2605.17131 2026-06-18 cs.CV cs.AI cs.LG 版本更新

A Survey on Deep Learning Architectures for Point Cloud Classification and Segmentation

针对点云分类和分割的深度学习架构系统性调研

Minhas Kamal, Hiranya Garbha Kumar, Balakrishnan Prabhakaran

发表机构 * State University of New York at Albany（纽约州立大学阿尔巴尼分校）

AI总结本文系统性地探讨了点云分类和分割中的深度学习架构，分析了点云数据的结构特性，分类了不同架构的工作，并评估了其在主流基准上的性能，同时指出了开放挑战和未来方向。

Comments We reviewed a decade of advancements in point cloud processing: trace the evolution of the field from its foundational roots to the modern SOTA, analyze how diverse architectures overcome the inherent geometric challenges of 3D data, and map out critical research gaps alongside promising future directions. GitHub: https://github.com/MinhasKamal/DeepLearningForPointCloud

Journal ref ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2026

详情

DOI: 10.1145/3815180

AI中文摘要

点云因其简洁性和几何保真度而成为表示3D形状和场景最广泛采用的格式。然而，其固有的无序和不规则性质，加剧了传感器噪声和遮挡的影响，给基于机器学习的方法带来了独特的挑战。为应对这些问题，已开发出多种策略，包括转换为有序格式、提取局部几何特征以及基于排列不变或自注意力的处理方法。在本文中，我们的重点是深度学习模型在3D视觉三个基本任务中的应用：点云分类、部分分割和语义分割。我们首先正式定义点云数据，然后深入讨论其结构特性。接着，我们根据其骨干结构对重要工作进行分类，并评估其在流行基准上的性能。除了经验比较外，我们还提供了架构创新和局限性的见解。我们还概述了3D点云理解中的开放挑战和有前途的未来方向。

英文摘要

Point cloud stands as the most widely adopted format for representing 3D shapes and scenes due to its simplicity and geometric fidelity. However, its inherent unordered and irregular nature, exacerbated by sensor noise and occlusions, introduces unique challenges for machine learning based methodologies. To combat these issues, diverse strategies have been developed, including converting to a format that has orderliness, extracting local geometry, and permutation-invariant or self-attention-based processing. In this paper, our focus is directed towards deep learning models for three fundamental tasks in 3D vision: point cloud classification, part segmentation, and semantic segmentation. We begin by formally defining point cloud data, followed by an in-depth discussion on its structural characteristics. Then, we categorize notable works based on their backbone structure and evaluate their performance on popular benchmarks. Beyond empirical comparison, we offer insights into architectural innovations and limitations. We also outline open challenges and promising future directions for 3D point cloud understanding.

URL PDF HTML ☆

赞 0 踩 0

2605.07022 2026-06-18 cs.LG 版本更新

Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

自主驾驶数据集：从2000万篇论文到大规模精细化生物医学知识

Haydn Jones, Yimeng Zeng, Alden Rose, Li S. Yifei, Yining Huang, Kaiwen Wu, Jiaming Liang, Maggie Ziyu Huan, Yoseph Barash, Cesar de la Fuente-Nunez, Osbert Bastani, Zachary Ives, Mark Yatskar, Jacob R. Gardner

发表机构 * Department of Computer and Information Science, University of Pennsylvania（宾夕法尼亚大学计算机与信息科学系）； Department of Genetics, University of Pennsylvania（宾夕法尼亚大学遗传学系）； Departments of Bioengineering and Chemical and Biomolecular Engineering, University of Pennsylvania（宾夕法尼亚大学生物工程与化学与生物分子工程系）

AI总结本文提出通过PubMed自动生成结构化数据集，实现更大规模、更精细和更准确的生物医学知识，展示Starling系统在多个任务中生成大规模数据集并提升准确性。

详情

AI中文摘要

人工编纂的生物医学仓库在生物活性、基因组学和化学领域昂贵且滞后于原始文献，丢弃实验背景，掩盖了评估数据正确性和覆盖范围所需的细微差别。我们证明PubMed本身可以被自动且经济地转化为结构化数据集，这些数据集比它们取代的编纂数据库更大、更细致和更准确。我们提出了三个耦合贡献：(1)基于九个生物医学本体的LLM实体标记流水线，能够在包含2250万篇论文和2500亿个token的PubMed语料库中标记45亿个实体，跨19个类别；(2)混合稀疏密集检索支持在标记语料库上执行实体过滤的语义查询；(3)Starling，一个多代理深度研究系统，仅给定自然语言任务描述，即可设计精度和召回率目标的检索过滤器，诱导提取模式，并输出具有丰富细节字段和支持段落的结构化记录。在六个任务中——血脑屏障渗透性、口服生物利用度、急性毒性（LD50）、基因疾病关联、蛋白质亚细胞定位和化学反应——Starling生成约630万条记录（每任务91K至3M条）；其中一些是目前最大的公开数据集。前沿模型对我们的提取的拒绝率在0.6-7.7%之间，远低于我们在广泛使用的编纂数据集上测量的错误率（例如，BBB_Martins为16.5%，Bioavailability_Ma为7.3%）。除了规模和准确性外，支持段落还携带了表格数据库所丢弃的细微差别——例如，口服生物利用度可能取决于进食与否的状态。共同，语料库、检索和代理为AI驱动的治疗设计建立了基础。代码和数据集：https://github.com/starling-labs/starling.

英文摘要

Manually curated biomedical repositories -- spanning bioactivity, genomics, and chemistry -- are expensive to maintain, lag behind primary literature, and discard experimental context, obscuring nuances needed to assess data correctness and coverage. We show that PubMed itself can be autonomously and cost-effectively turned into structured datasets that are larger, more nuanced, and more accurate than the curated databases they replace. We present three coupled contributions: (1) an LLM-based entity-tagging pipeline, grounded in nine biomedical ontologies, that tags 4.5B entities across 19 categories in a 22.5M-paper, 2.5T-token PubMed corpus; (2) hybrid sparse-dense retrieval supporting entity-filtered semantic queries over the tagged corpus; and (3) Starling, a multi-agent deep research system that, given only a natural-language task description, designs precision- and recall-targeted retrieval filters, induces an extraction schema, and emits structured records with nuance-rich fields and supporting passages. Across six tasks -- blood-brain barrier permeability, oral bioavailability, acute toxicity (LD50), gene-disease associations, protein subcellular localization, and chemical reactions -- Starling produces ~6.3M records (91K-3M per task); several are, to our knowledge, the largest public datasets for their property. Frontier-model rejection of our extractions is 0.6-7.7% across tasks, far below error rates we measure on widely used curated counterparts (e.g., 16.5% on BBB_Martins, 7.3% on Bioavailability_Ma). Beyond scale and accuracy, the supporting passages carry nuance tabular databases discard -- e.g., oral bioavailability may depend on fed vs. fasted state. Together, the corpus, retrieval, and agent establish a foundation for AI-driven therapeutic design. Code and datasets: https://github.com/starling-labs/starling.

URL PDF HTML ☆

赞 0 踩 0

2605.15824 2026-06-18 cs.CV 版本更新

FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization

FashionChameleon：迈向实时和交互式的人体服装视频定制

Quanjian Song, Yefeng Shen, Mengting Chen, Hao Sun, Jinsong Lan, Xiaoyong Zhu, Bo Zheng, Liujuan Cao

发表机构 * Xiamen University（厦门大学）； Alibaba Group（阿里巴巴集团）

AI总结本文提出FashionChameleon框架，通过单件服装视频数据实现交互式多服装视频定制，保留动作一致性，实现实时生成23.8FPS，比现有方法快30-180倍。

Comments Project Page: https://quanjiansong.github.io/projects/FashionChameleon/

详情

AI中文摘要

以人为中心的视频定制，特别是在服装层面，已显示出显著的商业价值。然而，现有方法无法支持低延迟和交互式服装控制，这对电子商务和内容创作应用至关重要。本文研究如何在仅使用单件服装视频数据的情况下，实现交互式多服装视频定制并保持动作一致性。我们提出了FashionChameleon，一个用于自回归视频生成中的人体服装定制的实时交互框架，用户可以在生成过程中交互式切换服装。FashionChameleon包含三个关键技术：(i) 代替在多服装视频数据上训练，我们使用上下文学习在单个参考服装对上训练教师模型。通过保留图像到视频的训练范式，同时强制参考和服装图像之间不匹配，模型被鼓励在单件服装切换时隐式保持一致性。(ii) 为了在生成过程中实现一致性和效率，我们引入了带有上下文学习的流式蒸馏，通过上下文教师强制微调模型，并通过梯度加权分布匹配蒸馏提高外推一致性。(iii) 为了将模型扩展到交互式多服装视频定制，我们提出了无训练KV缓存调度，包括服装KV刷新、历史KV撤回和参考KV解耦，以在保持动作一致性的同时实现服装切换。我们的FashionChameleon独特地支持交互式定制和一致的长视频外推，同时在单个GPU上实现实时生成23.8 FPS，比现有基线快30-180倍。

英文摘要

Human-centric video customization, particularly at the garment level, has shown significant commercial value. However, existing approaches cannot support low-latency and interactive garment control, which is crucial for applications such as e-commerce and content creation. This paper studies how to achieve interactive multi-garment video customization while preserving motion coherence using only single-garment video data. We present FashionChameleon, a real-time and interactive framework for human-garment customization in autoregressive video generation, where users can interactively switch garment during generation. FashionChameleon consists of three key techniques: (i) Instead of training on multi-garment video data, we train a Teacher Model with In-Context Learning on a single reference-garment pair. By retaining the image-to-video training paradigm while enforcing a mismatch between the reference and garment image, the model is encouraged to implicitly preserve coherence during single-garment switching. (ii) To achieve consistency and efficiency during generation, we introduce Streaming Distillation with In-Context Learning, which fine-tunes the model with in-context teacher forcing and improves extrapolation consistency via gradient-reweighted distribution matching distillation. (iii) To extend the model for interactive multi-garment video customization, we propose Training-Free KV Cache Rescheduling, which includes garment KV refresh, historical KV withdraw, and reference KV disentangle to achieve garment switching while preserving motion coherence. Our FashionChameleon uniquely supports interactive customization and consistent long-video extrapolation, while achieving real-time generation at 23.8 FPS on a single GPU, 30-180$\times$ faster than existing baselines.

URL PDF HTML ☆

赞 0 踩 0