arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

图像生成

图像生成、文生图、图像编辑、扩散模型和可控生成。

2026-06-19 至 2026-06-19 收录 11 信号源:cs.CV, cs.GR, cs.MM

1. 图像编辑 2 篇

2603.07236 2026-06-19 cs.CV 版本更新 85%

HY-WU (Part I): An Extensible Functional Neural Memory Framework and An Instantiation in Text-Guided Image Editing

HY-WU (第一部分): 一种可扩展的功能性神经记忆框架及其在文本引导图像编辑中的应用

Mengxuan Wu, Xuanlei Zhao, Ziqiao Wang, Ruicheng Feng, Zhangyang Wang, Kai Wang

发表机构 * Tencent HY Team(腾讯 HY 团队)

专题命中 图像编辑 :提出HY-WU框架用于文本引导图像编辑。

AI总结 提出HY-WU框架,通过功能性神经记忆模块即时生成实例特定权重更新,避免共享权重覆盖导致的干扰,解决持续学习与个性化中的灾难性遗忘问题。

详情
AI中文摘要

基础模型正从离线预测器过渡到期望长时间运行的部署系统。在实际部署中,目标并非固定:领域漂移、用户偏好演变,以及模型发布后出现新任务。这将持续学习和即时个性化从可选功能提升为核心架构要求。然而,大多数适应流程仍遵循静态权重范式:训练后(或任何适应步骤后),推理执行单一参数向量,而不考虑用户意图、领域或实例特定约束。这将训练或适应后的模型视为参数空间中的单个点。在异构且持续演变的机制中,不同目标可能在参数上诱导分离的可行区域,迫使任何单一共享更新陷入妥协、干扰或过度专业化。结果,持续学习和个性化通常实现为对共享权重的重复覆盖,冒着先前学习行为退化的风险。我们提出HY-WU(权重释放),一种记忆优先的适应框架,将适应压力从覆盖单一共享参数点转移。HY-WU将功能性(算子级)记忆实现为神经模块:一个根据实例条件即时合成权重更新的生成器,产生实例特定算子而无需测试时优化。

英文摘要

Foundation models are transitioning from offline predictors to deployed systems expected to operate over long time horizons. In real deployments, objectives are not fixed: domains drift, user preferences evolve, and new tasks appear after the model has shipped. This elevates continual learning and instant personalization from optional features to core architectural requirements. Yet most adaptation pipelines still follow a static weight paradigm: after training (or after any adaptation step), inference executes a single parameter vector regardless of user intent, domain, or instance-specific constraints. This treats the trained or adapted model as a single point in parameter space. In heterogeneous and continually evolving regimes, distinct objectives can induce separated feasible regions over parameters, forcing any single shared update into compromise, interference, or overspecialization. As a result, continual learning and personalization are often implemented as repeated overwriting of shared weights, risking degradation of previously learned behaviors. We propose HY-WU (Weight Unleashing), a memory-first adaptation framework that shifts adaptation pressure away from overwriting a single shared parameter point. HY-WU implements functional (operator-level) memory as a neural module: a generator that synthesizes weight updates on-the-fly from the instance condition, yielding instance-specific operators without test-time optimization.

2602.01391 2026-06-19 cs.CV 版本更新 70%

Relighting as a Probe of Visual Priors via Augmented Latent Intrinsics

通过增强潜在本征属性将重光照作为视觉先验的探针

Xiaoyan Xing, Xiao Zhang, Sezer Karaoglu, Theo Gevers, Anand Bhattad

发表机构 * UvA-Bosch Delta Lab, University of Amsterdam, Amsterdam, Netherlands(乌得勒支大学阿姆斯特丹分校博世Delta实验室) The University of Chicago, Chicago, USA(芝加哥大学) Johns Hopkins University, Baltimore, USA(约翰霍普金斯大学)

专题命中 图像编辑 :重光照属于图像编辑范畴

AI总结 提出增强潜在本征属性(ALI)方法,融合密集像素对齐视觉特征到潜在本征重光照模型,平衡语义与光度保真度,提升复杂材质重光照质量。

Comments Camera-ready version for ICML 2026. Project page: https://augmented-latent-intrinsics.github.io

详情
AI中文摘要

图像到图像的重光照需要能够将光照与场景属性分离,同时保留密集几何、材质和光度线索的表征。我们将此任务用作视觉先验的探针:与奖励不变性的识别任务不同,重光照测试视觉特征是否保留光传输所需的信息。通过一个受控的生成式重光照框架,我们发现强语义编码器会降低重光照质量,揭示了抽象与物理保真度之间的语义-光度权衡。我们引入了增强潜在本征属性(ALI),通过将密集的、像素对齐的视觉特征融合到潜在本征重光照模型中,并在未标注的真实图像对上通过自监督进行细化,来平衡这一权衡。ALI提高了重光照质量,尤其是在光泽、金属和透明材质上,并证明了生成式重光照是量化视觉编码器对物理世界编码内容的有效工具。

英文摘要

Image-to-image relighting requires representations that separate illumination from scene properties while preserving dense geometry, material, and photometric cues. We use this task as a probe of visual priors: unlike recognition tasks that reward invariance, relighting tests whether visual features retain the information needed for light transfer. Through a controlled generative relighting framework, we find that strong semantic encoders can degrade relighting quality, exposing a semantic--photometric trade-off between abstraction and physical fidelity. We introduce Augmented Latent Intrinsics (ALI), which balances this trade-off by fusing dense, pixel-aligned visual features into a latent-intrinsic relighting model and refining it with self-supervision on unlabeled real image pairs. ALI improves relighting quality, especially on glossy, metallic, and transparent materials, and demonstrates that generative relighting is an effective tool for quantifying what visual encoders encode about the physical world.

2. 扩散模型 3 篇

2601.21542 2026-06-19 cs.CV cs.AI 版本更新 85%

Bi-Anchor Interpolation Solver for Accelerating Generative Modeling

双锚点插值求解器加速生成建模

Hongxu Chen, Hongxiang Li, Zhen Wang, Long Chen

发表机构 * The Hong Kong University of Science(香港科学与技术大学)

专题命中 扩散模型 :加速生成建模,双锚点插值求解器

AI总结 提出BA-solver,通过轻量SideNet(1-2%主干大小)学习双向时间感知和双锚点速度积分,在不重新训练主干的情况下,以极低训练成本实现10步内达到100+步Euler求解器质量,支持即插即用。

详情
AI中文摘要

流匹配(FM)模型已成为高保真合成的前沿范式。然而,它们对迭代常微分方程(ODE)求解的依赖造成了显著的延迟瓶颈。现有解决方案面临两难:无训练求解器在低神经函数评估(NFE)下性能严重下降,而基于训练的一步或几步生成方法则面临高昂的训练成本且缺乏即插即用的通用性。为弥合这一差距,我们提出了双锚点插值求解器(BA-solver)。BA-solver保留了标准无训练求解器的通用性,同时通过引入轻量级SideNet(主干大小的1-2%)与冻结主干并行,实现了显著加速。具体而言,我们的方法基于两个协同组件:1)双向时间感知,其中SideNet学习近似未来和过去的速度,无需重新训练重型主干;2)双锚点速度积分,利用带有两个锚点速度的SideNet高效近似中间速度,用于批量高阶积分。通过利用主干建立高精度“锚点”并利用SideNet加密轨迹,BA-solver能够以最小误差实现大步长。在ImageNet-256^2上的实验结果表明,BA-solver仅需10次NFE即可达到与100+次NFE的Euler求解器相当的生成质量,并在仅5次NFE时保持高保真度,且训练成本可忽略不计。此外,BA-solver确保与现有生成流水线的无缝集成,便于图像编辑等下游任务。

英文摘要

Flow Matching (FM) models have emerged as a leading paradigm for high-fidelity synthesis. However, their reliance on iterative Ordinary Differential Equation (ODE) solving creates a significant latency bottleneck. Existing solutions face a dichotomy: training-free solvers suffer from significant performance degradation at low Neural Function Evaluations (NFEs), while training-based one- or few-steps generation methods incur prohibitive training costs and lack plug-and-play versatility. To bridge this gap, we propose the Bi-Anchor Interpolation Solver (BA-solver). BA-solver retains the versatility of standard training-free solvers while achieving significant acceleration by introducing a lightweight SideNet (1-2% backbone size) alongside the frozen backbone. Specifically, our method is founded on two synergistic components: \textbf{1) Bidirectional Temporal Perception}, where the SideNet learns to approximate both future and historical velocities without retraining the heavy backbone; and 2) Bi-Anchor Velocity Integration, which utilizes the SideNet with two anchor velocities to efficiently approximate intermediate velocities for batched high-order integration. By utilizing the backbone to establish high-precision ``anchors'' and the SideNet to densify the trajectory, BA-solver enables large interval sizes with minimized error. Empirical results on ImageNet-256^2 demonstrate that BA-solver achieves generation quality comparable to 100+ NFEs Euler solver in just 10 NFEs and maintains high fidelity in as few as 5 NFEs, incurring negligible training costs. Furthermore, BA-solver ensures seamless integration with existing generative pipelines, facilitating downstream tasks such as image editing.

2603.20455 2026-06-19 math.OC 版本更新 80%

Time-Reversed BSDEs for Accurate Gradient Estimation in Diffusion Models

时间反向BSDE用于扩散模型中的精确梯度估计

Yuhang Mei, Amirhossein Taghvaei

专题命中 扩散模型 :扩散模型梯度估计的BSDE方法

AI总结 针对扩散模型微调中梯度估计不稳定问题,提出基于时间反向BSDE的自适应伴随过程,降低方差并提高稳定性。

Comments 10 pages, 3 figures

详情
AI中文摘要

越来越多的文献采用随机最优控制(SOC)视角来微调扩散模型及相关生成策略。一类称为迭代扩散优化的著名方法通过模拟扩散过程、评估损失函数并应用随机优化算法来解决SOC问题,其中伴随匹配已成为最先进的方法。然而,这些方法中使用的伴随过程不适应前向扩散滤波,可能导致不稳定或高方差的梯度估计。在本文中,我们通过后向随机微分方程(BSDE)的视角重新审视扩散模型中的梯度估计。我们提出了一种基于我们先前工作中引入的时间反向BSDE公式的替代估计器,该估计器产生适应于底层滤波的伴随过程。这种自适应结构导致更稳定的梯度估计,且可能具有更低的方差。我们分析了所提估计器的准确性,并将其与伴随匹配进行了比较。在微调玩具扩散模型上的数值实验证明了改进的梯度稳定性和有竞争力的性能。

英文摘要

There is a growing literature adopting a stochastic optimal control (SOC) perspective to fine-tune diffusion models and related generative policies. A prominent class of methods, known as iterative diffusion optimization, solves the SOC problem by simulating the diffusion process, evaluating a loss function, and applying stochastic optimization algorithms, with adjoint matching emerging as a state-of-the-art approach. However, the adjoint process used in these methods is not adapted to the forward diffusion filtration, which can lead to unstable or high-variance gradient estimates. In this paper, we revisit gradient estimation in diffusion models through the lens of backward stochastic differential equations (BSDEs). We propose an alternative estimator based on a time-reversed BSDE formulation introduced in our prior work, which produces an adjoint process adapted to the underlying filtration. This adapted structure leads to more stable gradient estimates with potentially lower variance. We analyze the accuracy of the proposed estimator and compare it with adjoint matching. Numerical experiments on fine-tuning toy diffusion models demonstrate improved gradient stability and competitive performance.

2601.03112 2026-06-19 eess.IV cs.CV 版本更新 80%

DiT-JSCC: Rethinking Deep JSCC with Diffusion Transformers and Semantic Representations

DiT-JSCC:基于扩散变换器与语义表示的深度JSCC再思考

Kailin Tan, Jincheng Dai, Sixian Wang, Guo Lu, Shuo Shao, Kai Niu, Wenjun Zhang, Ping Zhang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Shanghai Jiao Tong University(上海交通大学) University of Shanghai for Science and Technology(上海科技大学)

专题命中 扩散模型 :利用扩散变换器作为生成解码器

AI总结 提出DiT-JSCC框架,联合学习语义优先表示编码器和扩散变换器生成解码器,通过粗细粒度条件解码和基于Kolmogorov复杂度的自适应带宽分配,在极端信道条件下提升语义一致性与传输效率。

Comments 14pages, 14figures, 2tables

详情
AI中文摘要

生成式联合源信道编码(GJSCC)已成为一种新的深度JSCC范式,用于在极端无线信道条件(如超低带宽和低信噪比)下实现高保真和鲁棒的图像传输。近期研究通常采用扩散模型作为生成解码器,但经常产生视觉上逼真但语义一致性有限的结果。这种局限性源于面向重建的JSCC编码器与生成解码器之间的根本性不匹配,因为前者缺乏显式的语义判别能力,无法提供可靠的条件线索。在本文中,我们提出DiT-JSCC,一种新颖的GJSCC骨干网络,能够联合学习语义优先的表示编码器和基于扩散变换器(DiT)的生成解码器,我们的开源项目旨在促进GJSCC的未来研究。具体来说,我们设计了一个语义-细节双分支编码器,与从粗到细的条件DiT解码器自然对齐,在极端信道条件下优先考虑语义一致性。此外,受Kolmogorov复杂度启发,引入了一种无需训练的自适应带宽分配策略,以进一步提高传输效率,从而真正重新定义生成解码时代的信息价值概念。大量实验表明,DiT-JSCC在语义一致性和视觉质量上始终优于现有JSCC方法,尤其是在极端条件下。

英文摘要

Generative joint source-channel coding (GJSCC) has emerged as a new Deep JSCC paradigm for achieving high-fidelity and robust image transmission under extreme wireless channel conditions, such as ultra-low bandwidth and low signal-to-noise ratio. Recent studies commonly adopt diffusion models as generative decoders, but they frequently produce visually realistic results with limited semantic consistency. This limitation stems from a fundamental mismatch between reconstruction-oriented JSCC encoders and generative decoders, as the former lack explicit semantic discriminability and fail to provide reliable conditional cues. In this paper, we propose DiT-JSCC, a novel GJSCC backbone that can jointly learn a semantics-prioritized representation encoder and a diffusion transformer (DiT) based generative decoder, our open-source project aims to promote the future research in GJSCC. Specifically, we design a semantics-detail dual-branch encoder that aligns naturally with a coarse-to-fine conditional DiT decoder, prioritizing semantic consistency under extreme channel conditions. Moreover, a training-free adaptive bandwidth allocation strategy inspired by Kolmogorov complexity is introduced to further improve the transmission efficiency, thereby indeed redefining the notion of information value in the era of generative decoding. Extensive experiments demonstrate that DiT-JSCC consistently outperforms existing JSCC methods in both semantic consistency and visual quality, particularly in extreme regimes.

3. 可控生成 3 篇

2601.12870 2026-06-19 cs.CE 版本更新 75%

Text2Structure3D: Graph-Based Generative Modeling of Equilibrium Structures with Diffusion Transformers

Text2Structure3D: 基于扩散变换器的图生成建模平衡结构

Lazlo Bleker, Zifeng Guo, Kaleb E. Smith, Kam-Ming Mark Tam, Karla Saldaña Ochoa, Pierluigi D'Acunto

专题命中 可控生成 :从文本生成平衡结构图,属于可控结构生成。

AI总结 提出Text2Structure3D,结合潜在扩散、变分图自编码器和图变换器,从自然语言提示生成接近平衡状态的结构图,并通过残余力优化确保完全满足静力平衡。

Journal ref Results in Engineering 31 (2026) 111375

详情
AI中文摘要

本文提出Text2Structure3D,一种基于图的机器学习模型,能够从自然语言提示生成平衡结构。Text2Structure3D旨在支持概念结构设计过程中新的直观设计探索和迭代方式。该方法将潜在扩散与变分图自编码器(VGAE)和图变换器相结合,生成接近平衡状态的结构图。Text2Structure3D集成了一个残余力优化后处理步骤,确保生成的结构完全满足静力平衡。该模型使用一个跨类型的悬链线找形和静定桥梁结构数据集进行训练和验证,该数据集配有针对每座桥梁的形式和结构特征的文本描述。结果表明,Text2Structure3D生成的平衡结构高度遵循基于文本的规范,并且与基于参数模型的方法相比,大大提高了泛化能力。Text2Structure3D代表了迈向结构设计通用基础模型的早期一步,使生成式AI能够集成到概念设计工作流程中。

英文摘要

This paper presents Text2Structure3D, a graph-based Machine Learning (ML) model that generates equilibrium structures from natural language prompts. Text2Structure3D is designed to support new intuitive ways of design exploration and iteration in the conceptual structural design process. The approach combines latent diffusion with a Variational Graph Auto-Encoder (VGAE) and graph transformers to generate structural graphs that are close to an equilibrium state. Text2Structure3D integrates a residual force optimization post-processing step that ensures generated structures fully satisfy static equilibrium. The model was trained and validated using a cross-typological dataset of funicular form-found and statically determinate bridge structures, paired with text descriptions that capture the formal and structural features of each bridge. Results demonstrate that Text2Structure3D generates equilibrium structures with strong adherence to text-based specifications and greatly improves generalization capabilities compared to parametric model-based approaches. Text2Structure3D represents an early step toward a general-purpose foundation model for structural design, enabling the integration of generative AI into conceptual design workflows.

2601.21081 2026-06-19 cs.CV 版本更新 70%

Shape of Thought: Progressive Object Assembly via Visual Chain-of-Thought

思维形状:通过视觉思维链进行渐进式物体组装

Yu Huo, Siyu Zhang, Kun Zeng, Haoyue Liu, Owen Lee, Junlin Chen, Yuquan Lu, Yifu Guo, Yaodong Liang, Xiaoying Tang

发表机构 * School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)科学与工程学院) School of Data Science, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)数据科学学院) Sun Yat-sen University(中山大学) The Hong Kong University of Science and Technology, Guangzhou(香港科学与技术大学(广州)) Shenzhen Future Network of Intelligence Institute (FNii-Shenzhen)(深圳未来网络智能研究所(FNii-Shenzhen)) Guangdong Provincial Key Laboratory of Future Networks of Intelligence, CUHK(SZ)(广东省未来网络智能重点实验室,CUHK(SZ))

专题命中 可控生成 :文本到图像生成中的组合结构约束

AI总结 提出Shape-of-Thought (SoT)框架,通过视觉思维链在渲染2D域中逐步组装形状,解决文本到图像生成中的组合结构约束问题,在组件计数和结构拓扑上显著优于直接生成。

Comments ICML2026

详情
AI中文摘要

用于文本到图像生成的多模态模型已实现强视觉保真度,但在组合结构约束(特别是生成计数、属性绑定和部分级关系)下仍然脆弱。为解决这些挑战,我们提出了Shape-of-Thought (SoT),一种视觉思维链框架,用于在渲染2D域中进行过程监督的渐进式形状组装,推理时无需外部引擎。SoT训练一个统一的多模态自回归模型,生成交错文本计划和渲染中间状态,帮助模型在不产生显式几何表示的情况下捕捉形状组装逻辑。与纯文本思维链不同,每个决策都基于渲染状态,使得计数、连接、拓扑和中间部件添加错误在整个轨迹中可检查。为支持这一范式,我们引入了SoT-26K,一个基于部件CAD层次结构的大规模接地组装轨迹数据集,以及T2S-CompBench,一个用于评估结构完整性和轨迹忠实度的基准。在SoT-26K上微调在组件计数上达到88.4%,在结构拓扑上达到84.8%,在组件计数上比直接生成高出24.2个百分点,在结构拓扑上高出19.3个百分点。SoT为渲染域结构感知生成建立了一个透明测试平台。代码见此https URL。

英文摘要

Multimodal models for text-to-image generation have achieved strong visual fidelity, yet they remain brittle under compositional structural constraints, notably generative numeracy, attribute binding, and part-level relations. To address these challenges, we propose Shape-of-Thought (SoT), a visual CoT framework for process-supervised progressive shape assembly in the rendered 2D domain, without external engines at inference time. SoT trains a unified multimodal autoregressive model to generate interleaved textual plans and rendered intermediate states, helping the model capture shape-assembly logic without producing explicit geometric representations. Unlike text-only CoT, each decision is grounded in a rendered state, making counts, attachments, topology, and intermediate part-addition errors inspectable across the trajectory. To support this paradigm, we introduce SoT-26K, a large-scale dataset of grounded assembly traces derived from part-based CAD hierarchies, and T2S-CompBench, a benchmark for evaluating structural integrity and trace faithfulness. Fine-tuning on SoT-26K achieves 88.4% on component numeracy and 84.8% on structural topology, outperforming direct generation by +24.2 points on component numeracy and +19.3 points on structural topology. SoT establishes a transparent testbed for rendered-domain structure-aware generation. The code is available at https://github.com/yuhuo03/Shape-of-Thought.

2503.01425 2026-06-19 cs.GR cs.CV 版本更新 70%

MeshPad: Interactive Sketch-Conditioned Artist-Reminiscent Mesh Generation and Editing

MeshPad: 交互式草图条件艺术家风格网格生成与编辑

Haoxuan Li, Ziya Erkoc, Lei Li, Daniele Sirigatti, Vladislav Rosov, Angela Dai, Matthias Nießner

发表机构 * Technical University of Munich(慕尼黑技术大学) AUDI AG(奥迪股份公司)

专题命中 可控生成 :草图条件生成3D网格,涉及可控生成。

AI总结 提出MeshPad,一种基于草图输入的交互式3D网格生成与编辑方法,通过分解为网格区域的删除和添加操作,结合Transformer和顶点对齐推测策略,实现快速迭代编辑,在Chamfer距离上提升22%以上质量,并获90%用户偏好。

Comments Project page: https://derkleineli.github.io/meshpad/ Video: https://www.youtube.com/watch?v=_T6UTGTMZ1E

详情
AI中文摘要

我们介绍了MeshPad,一种从草图输入生成3D网格的生成方法。基于最近在艺术家风格三角形网格生成方面的进展,我们的方法解决了交互式网格创建的需求。为此,我们专注于通过将编辑分解为网格区域的“删除”和随后新网格几何的“添加”来实现一致编辑。这两个操作都由用户对草图图像的简单编辑触发,促进了迭代内容创建过程,并能够构建复杂的3D网格。我们的方法基于三角形序列网格表示,利用大型Transformer模型进行网格三角形的添加和删除。为了交互式地执行编辑,我们在加法网格生成器之上引入了一种顶点对齐的推测预测策略。该推测器预测对应于一个顶点的多个输出标记,从而显著降低推理的计算成本并加速编辑过程,使得每个编辑步骤只需几秒钟即可完成。综合实验表明,MeshPad优于最先进的草图条件网格生成方法,在Chamfer距离上实现了超过22%的网格质量改进,并且在感知评估中被90%的参与者所偏好。

英文摘要

We introduce MeshPad, a generative approach that creates 3D meshes from sketch inputs. Building on recent advances in artist-reminiscent triangle mesh generation, our approach addresses the need for interactive mesh creation. To this end, we focus on enabling consistent edits by decomposing editing into 'deletion' of regions of a mesh, followed by 'addition' of new mesh geometry. Both operations are invoked by simple user edits of a sketch image, facilitating an iterative content creation process and enabling the construction of complex 3D meshes. Our approach is based on a triangle sequence-based mesh representation, exploiting a large Transformer model for mesh triangle addition and deletion. In order to perform edits interactively, we introduce a vertex-aligned speculative prediction strategy on top of our additive mesh generator. This speculator predicts multiple output tokens corresponding to a vertex, thus significantly reducing the computational cost of inference and accelerating the editing process, making it possible to execute each editing step in only a few seconds. Comprehensive experiments demonstrate that MeshPad outperforms state-of-the-art sketch-conditioned mesh generation methods, achieving more than 22% mesh quality improvement in Chamfer distance, and being preferred by 90% of participants in perceptual evaluations.

4. 文生图 2 篇

2605.10898 2026-06-19 cs.HC 版本更新 70%

How Creatives Approach GenAI Image Generation: Tensions Between Structured Guidance, Self-Experimentation, and Creative Autonomy

创意人士如何接近生成式AI图像生成:结构化指导、自我实验与创意自主之间的张力

Haidan Liu, Isabelle Kwan, Taiga Okuma, Jeffrey Loverock, Nicholas Vincent, Parmit K Chilana

专题命中 文生图 :研究创意人士使用GenAI图像生成工具的行为

AI总结 研究探讨创意人士在使用生成式AI图像工具时如何平衡结构化指导与自我实验,发现尽管指导有助于理解AI,但许多人仍倾向于自我探索以保持创意自由。

Comments Accepted at ACM Creativity & Cognition 2026

详情
AI中文摘要

随着生成式AI工具日益影响创意实践,它们引发了长期存在的HCI问题,即创意人士如何学习复杂软件以及如何更好地得到支持。我们通过与8名艺术家和爱好者进行访谈研究,并随后进行159人调查,以了解该群体如何接近和寻求生成式AI图像工具的指导。我们发现,创意人士通常使用自我实验或教程来探索生成式AI工具,但许多人对复杂的AI术语感到困惑。为了进一步了解创意人士的学习体验,我们开发了一个研究探针来获取他们对结构化指导的看法。我们的用户研究显示,即使创意人士描述指导有助于理解AI,许多人仍更喜欢自我实验,认为指导可能限制他们的创造力。我们的发现突显了在支持创意人士AI素养时的核心张力:在平衡指导和促进素养的同时,保持创意自由。

英文摘要

As generative AI tools increasingly influence creative practice, they raise longstanding HCI questions about how creatives learn complex software and how they can be better supported. We conducted an interview study with artists and hobbyists (n=8) and a follow-up survey (n=159) to understand how this population approaches and seeks guidance for GenAI image tools. We found that creatives commonly use either self-experimentation or tutorials to explore GenAI tools, yet many struggle with confusing AI terminology. To gain further insight into creatives' learning experiences, we developed a research probe to elicit creatives' perceptions of structured guidance. Our user study with 17 creatives revealed that, even when creatives described the guidance as helpful for understanding AI, many still preferred self-experimentation, feeling that guidance could limit their creativity. Our findings highlight a central tension in supporting AI literacy for creatives: balancing guidance and promoting literacy while preserving creative freedom.

2506.06952 2026-06-19 cs.CV 版本更新 70%

LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer

LaTtE-Flow: 基于层间时间步专家流的Transformer

Ying Shen, Zhiyang Xu, Jiuhai Chen, Shizhe Diao, Jiaxin Zhang, Yuguang Yao, Joy Rimchala, Ismini Lourentzou, Lifu Huang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Maryland(马里兰大学) Nvidia(英伟达) Salesforce AI Research(Salesforce AI研究) Intuit AI Research(Intuit AI研究)

专题命中 文生图 :提出高效架构实现图像生成,速度提升6倍。

AI总结 提出LaTtE-Flow,一种基于预训练视觉语言模型的高效统一架构,通过层间时间步专家流和条件残差注意力机制,实现图像理解与生成,生成速度提升约6倍。

Comments Unified multimodal model, Flow-matching

详情
AI中文摘要

多模态基础模型在统一图像理解与生成方面取得了最新进展,为在单一框架内处理广泛的视觉-语言任务开辟了令人兴奋的途径。尽管取得了进展,现有的统一模型通常需要大量的预训练,并且与专门针对每项任务的模型相比,难以达到相同的性能水平。此外,许多这些模型存在图像生成速度慢的问题,限制了它们在实时或资源受限环境中的实际部署。在这项工作中,我们提出了基于层间时间步专家流的Transformer(LaTtE-Flow),一种新颖且高效的架构,可在单个多模态模型中统一图像理解与生成。LaTtE-Flow建立在强大的预训练视觉语言模型(VLM)之上,以继承强大的多模态理解能力,并通过新颖的层间时间步专家流架构扩展它们,以实现高效的图像生成。LaTtE-Flow将流匹配过程分布到专门的Transformer层组中,每组负责不同的时间步子集。这种设计通过在每个采样时间步仅激活一小部分层,显著提高了采样效率。为了进一步提升性能,我们提出了一种时间步条件残差注意力机制,用于跨层高效的信息重用。实验表明,LaTtE-Flow在多模态理解任务上取得了强劲的性能,同时与最近的统一多模态模型相比,实现了具有竞争力的图像生成质量,推理速度提高了约6倍。

英文摘要

Recent advances in multimodal foundation models unifying image understanding and generation have opened exciting avenues for tackling a wide range of vision-language tasks within a single framework. Despite progress, existing unified models typically require extensive pretraining and struggle to achieve the same level of performance compared to models dedicated to each task. Additionally, many of these models suffer from slow image generation speeds, limiting their practical deployment in real-time or resource-constrained settings. In this work, we propose Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow), a novel and efficient architecture that unifies image understanding and generation within a single multimodal model. LaTtE-Flow builds upon powerful pretrained Vision-Language Models (VLMs) to inherit strong multimodal understanding capabilities, and extends them with a novel Layerwise Timestep Experts flow-based architecture for efficient image generation. LaTtE-Flow distributes the flow-matching process across specialized groups of Transformer layers, each responsible for a distinct subset of timesteps. This design significantly improves sampling efficiency by activating only a small subset of layers at each sampling timestep. To further enhance performance, we propose a Timestep-Conditioned Residual Attention mechanism for efficient information reuse across layers. Experiments demonstrate that LaTtE-Flow achieves strong performance on multimodal understanding tasks, while achieving competitive image generation quality with around 6x faster inference speed compared to recent unified multimodal models.

5. 其他图像生成 1 篇

2507.04081 2026-06-19 cs.NI 版本更新 60%

Graph Diffusion-Based AeBS Deployment and Resource Allocation in RSMA-Enabled URLLC Low-Altitude Wireless Networks

基于图扩散的RSMA使能URLLC低空无线网络中AeBS部署与资源分配

Xudong Wang, Lei Feng, Jiacheng Wang, Hongyang Du, Changyuan Zhao, Wenjing Li, Ping Zhang

专题命中 其他图像生成 :图扩散模型用于资源分配,弱相关。

AI总结 针对低空无线网络中频谱受限和同频干扰问题,提出基于速率分割多址接入(RSMA)的传输设计,并利用生成式图扩散模型联合优化AeBS部署、用户关联和资源分配,以最大化总速率和覆盖率。

Comments 13 pages, 9 figures

详情
AI中文摘要

作为低空无线网络的关键组成部分,空中基站(AeBS)提供灵活可靠的无线覆盖,以支持6G超可靠低延迟通信(URLLC)服务。然而,有限的频谱资源和严重的同频干扰给AeBS的部署和资源分配带来了重大挑战。为了解决这些限制,本文提出了一种新颖的基于速率分割多址接入(RSMA)的传输设计,以管理干扰并增强频谱受限的多AeBS网络中的URLLC服务。我们制定了一个联合优化问题,涉及AeBS部署、用户关联和资源分配,以最大化系统的总速率和覆盖率。鉴于该问题的NP-hard性质,我们提出了一种基于生成式图扩散模型的新型交替优化框架。具体来说,我们将AeBS和地面用户建模为图节点,然后采用通过去噪扩散解决的离散图生成过程来探索部署和关联策略的组合空间。此外,采用逐次凸近似(SCA)在有限块长约束下优化AeBS波束成形和RSMA速率分配。大量仿真表明,所提算法在收敛速度、总速率和覆盖率方面优于现有方法,并且在变化的网络密度和干扰水平下表现出鲁棒性能。

英文摘要

As a key component of low-altitude wireless networks, aerial base stations (AeBSs) provide flexible and reliable wireless coverage to support 6G ultra-reliable and low-latency communication (URLLC) services. However, limited spectrum resources and severe co-channel interference pose significant challenges to the deployment and resource allocation of AeBSs. To address these limitations, this paper proposes a novel rate-splitting multiple access (RSMA)-enabled transmission design to manage interference and enhance URLLC services in spectrum-constrained multi-AeBS networks. We formulate a joint optimization problem involving AeBS deployment, user association, and resource allocation to maximize the sum rate and coverage of system. Given the NP-hard nature of the problem, we propose a novel alternating optimization framework based on the generative graph diffusion models. Specifically, we model AeBSs and ground users as graph nodes, then we employ a discrete graph generation process solved via denoising diffusion to explore the combinatorial space of deployment and association strategies. Moreover, the successive convex approximation (SCA) is adopted to optimize AeBS beamforming and RSMA rate allocation under finite blocklength constraints. Extensive simulations demonstrate that the proposed algorithm outperforms existing methods in terms of convergence speed, sum rate, and coverage, while also exhibiting robust performance under varying network densities and interference levels.