arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1727
2606.07179 2026-06-08 cs.CV cs.MM eess.IV 新提交

EvoGS: Constructing Continuous-Layered Gaussian Splatting with Evolution Tree for Scalable 3D Streaming

EvoGS:基于进化树构建连续分层高斯泼溅以实现可扩展3D流式传输

Yuang Shi, Simone Gasparini, Géraldine Morin, Wei Tsang Ooi

发表机构 * National University of Singapore(国立新加坡大学) IRIT - University of Toulouse(图卢兹大学IRIT实验室) IPAL, IRL2955(IPAL研究所)

AI总结 提出EvoGS,首个连续分层高斯泼溅表示,通过进化树结构实现父-子细化,消除冗余并支持可扩展3D流式传输,传输负载和显存占用分别降低2.4倍和5.5倍。

Comments Project page: https://yuang-ian.github.io/evogs/

详情
AI中文摘要

流式传输3D高斯泼溅需要高度可扩展的渐进式表示。现有渐进式方法依赖\textit{离散分层},为每个细节层次累积独立的泼溅集。层间的结构独立性固有地导致误差累积、严重的泼溅冗余以及不受控的质量过渡。我们提出EvoGS,首个\textit{连续分层}表示。EvoGS组织为进化树,通过显式的、受小波启发的父-子细化生成更精细的细节。这使得子节点能够结构性地纠正祖先误差,产生固有稀疏且高度可压缩的层间信号。大量实验表明,EvoGS将泼溅冗余从超过65%降至低于25%。与最先进的基线相比,它分别将传输负载和GPU显存占用降低高达2.4倍和5.5倍,并实现了适用于实时自适应流式传输的平滑质量过渡。项目页面:此 https URL

英文摘要

Streaming 3D Gaussian Splatting requires highly scalable, progressive representations. Existing progressive methods rely on \textit{discrete layering}, accumulating separate splat sets for each level of detail. This structural independence between layers inherently leads to error accumulation, severe splat redundancy, and uncontrolled quality transitions. We propose EvoGS, the first \textit{continuous-layering} representation. Organized as an Evolution Tree, EvoGS generates finer details via an explicit, wavelet-inspired parent-child refinement. This empowers child nodes to structurally correct ancestral errors, yield inherently sparse and highly compressible inter-layer signals. Extensive experiments show EvoGS eliminates splat redundancy from over 65\% to under 25\%. Compared to state-of-the-art baselines, it reduces transmission payload and GPU VRAM footprint by up to 2.4$\times$ and 5.5$\times$, respectively, and achieves smooth quality transitions optimal for real-time adaptive streaming. Project page: https://yuang-ian.github.io/evogs/

2606.07175 2026-06-08 cs.CV 新提交

Seeing Without Exposing: Adaptive Privacy Control for Open-World, Context-Hungry MLLMs

看见而不暴露:面向开放世界、上下文饥渴型MLLM的自适应隐私控制

Siyuan Xu, Yibing Liu, Peilin Chen, Yung-Hui Li, Shiqi Wang, Sam Kwong

发表机构 * City University of Hong Kong(香港城市大学) Hon Hai Research Institute(鸿海研究学院) Lingnan University(岭南大学)

AI总结 针对多模态大语言模型在开放世界中面临不可预测敏感信息泄露的隐私挑战,提出无训练方法APD,将隐私元素漂移至语义等价替代物并锚定上下文线索,结合新基准AdaptShield实现隐私保护与上下文保留的平衡提升。

详情
AI中文摘要

多模态大语言模型(MLLM)引发了新的隐私挑战。在数据方面,用户提供的输入通常包含不可预测的敏感信息;而在下游任务方面,模型推理依赖于丰富的视觉上下文,这些上下文本身可能涉及隐私敏感信息。然而,现有的隐私保护方法依赖于预定义的敏感类别和固定的混淆策略,难以应对MLLM中的此类挑战。为解决这一困境,我们提出了锚定隐私漂移(APD),一种无需训练的方法,它将隐私敏感元素漂移到语义等价的替代物,同时将上下文线索锚定到源图像。为了系统评估这种隐私保护和上下文保留的双重目标,我们引入了AdaptShield,一个涵盖22个隐私类别的综合基准,它将传统隐私度量与基于MLLM的上下文效用评估相结合。大量实验表明,我们的方法在隐私净化和内容保留方面实现了平衡改进,在四个MLLM系列(即Qwen2.5、Qwen3、InternVL3和InternVL3.5)上,文本类别的平均增益为10.4%,基于MLLM的评估平均增益为8.5%。

英文摘要

Multimodal large language models (MLLMs) have raised new privacy challenges. On the data side, user-provided inputs often include unpredictable sensitive information; while on the downstream task side, model reasoning depends on rich visual context that may itself be privacy-sensitive. Existing privacy protection methods, however, rely on predefined sensitive categories and fixed obfuscation strategies, struggling to tackle such challenges in MLLMs. To address this dilemma, we propose Anchored Privacy Drifting (APD), a training-free method that drifts privacy-sensitive elements toward semantically equivalent alternatives while anchoring contextual cues to the source image. To systematically evaluate this dual objective of privacy protection and contextual preservation, we introduce AdaptShield, a comprehensive benchmark covering 22 privacy categories, which combines conventional privacy metrics with MLLM-based assessments of contextual utility. Extensive experiments show that our method achieves balanced improvements in both privacy sanitization and content retention, with average gains of 10.4% on textual categories and 8.5% under MLLM-based evaluation across four MLLM series, i.e., Qwen2.5, Qwen3, InternVL3, and InternVL3.5.

2606.07172 2026-06-08 cs.CV cs.AI cs.CL cs.LG 新提交

Textual Supervision Enhances Geospatial Representations in Vision-Language Models

文本监督增强视觉-语言模型中的地理空间表示

Marcelo Sartori Locatelli, Fernando Tonucci, Jea Kwon, Luiz Felipe Vecchietti, Bryan Nathanael Wijaya, Cheng Yaw Low, Virgilio Almeida, Meeyoung Cha

发表机构 * University of São Paulo(圣保罗大学) National University of Singapore(新加坡国立大学)

AI总结 研究视觉、视觉-语言及多模态模型的地理空间表示能力,发现文本监督能有效提升空间编码,推动地理空间AI发展。

Comments Accepted at ICML 2026

详情
AI中文摘要

地理空间理解是机器学习系统在图像地理定位和空间推理等任务中一个关键但尚未充分探索的维度。在这项工作中,我们分析了三种模型家族获得的地理空间表示:纯视觉架构(如ViT)、视觉-语言模型(如CLIP)和大规模多模态基础模型(如LLaVA、Qwen和Gemma)。通过评估包括人物、地标和日常物体在内的图像聚类(根据可定位程度分组),我们揭示了空间准确性的系统性差距,并表明文本监督增强了地理空间表示的学习。我们的发现表明语言作为编码空间上下文的有效补充模态,以及多模态学习作为推进地理空间AI的关键方向。

英文摘要

Geospatial understanding is a critical yet underexplored dimension in the development of machine learning systems for tasks such as image geolocation and spatial reasoning. In this work, we analyze the geospatial representations acquired by three model families: vision-only architectures (e.g., ViT), vision-language models (e.g., CLIP), and large-scale multimodal foundation models (e.g., LLaVA, Qwen, and Gemma). By evaluating across image clusters, including people, landmarks, and everyday objects, grouped based on the degree of localizability, we reveal systematic gaps in spatial accuracy and show that textual supervision enhances the learning of geospatial representations. Our findings suggest the role of language as an effective complementary modality for encoding spatial context and multimodal learning as a key direction for advancing geospatial AI.

2606.07171 2026-06-08 cs.CV 新提交

When Recovery Matters: The Blind Spot of Surrogate Privacy in MLLM Editing

当恢复至关重要时:MLLM编辑中替代隐私的盲点

Siyuan Xu, Yibing Liu, Peilin Chen, Yung-Hui LI, Shiqi Wang, Sam Kwong

发表机构 * City University of Hong Kong(香港城市大学) Hon Hai Research Institute(鸿海研究院) Lingnan University(岭南大学)

AI总结 针对多模态大模型编辑中的隐私风险,提出首个面向恢复的替代隐私保护编辑基准SPPE,涵盖36个细粒度隐私类别和65个编辑指令,并设计可编辑性评估与替代到源编辑恢复两个任务及对应方法。

详情
AI中文摘要

多模态大语言模型(MLLM)支持灵活的指令驱动图像编辑,但当用户图像暴露多样且用户特定的私有内容时,会产生隐私风险。典型的隐私保护策略通常在云端编辑前用替代内容替换敏感区域。然而,结果输出往往是编辑后的替代图像而非期望的编辑后源图像,在设计和评估范围中都忽略了局部恢复。为此,我们引入SPPE(基于替代的隐私保护编辑),这是首个面向恢复的基准,涵盖36个细粒度隐私类别和65个编辑指令。它定义了两个互补任务:1)可编辑性评估,在云端交互前估计替代图像是否能产生与原始图像一致的编辑;2)替代到源编辑恢复,评估编辑后的替代图像是否能转移回私有源图像并保留编辑效果。我们为每个任务提出了专用方法:ERMA通过指令感知的多模态关系建模预测替代可编辑性,而C2E-S2SER通过使用替代编辑对作为视觉编辑证据和源图像作为源保留锚点来执行循环一致性恢复。在SPPE和InstructPix2Pix上的实验表明,两个任务均有一致改进。对于可编辑性评估,ERMA在SRCC上比最佳基线提升13.9%,在PLCC上提升12.3%。对于替代到源编辑恢复,C2E-S2SER在SPPE的所有8个源完整性和编辑一致性指标上优于SOER。

英文摘要

Multimodal Large Language Models (MLLMs) enable flexible instruction-driven image editing, but privacy risks arise when user images expose diverse and user-specific private content. Canonical privacy protection strategies typically substitute sensitive regions with surrogate content before cloud editing. Yet, the resulting output is often an edited surrogate rather than the desired edited source image, neglecting the local recovery in both design and evaluation scope. To this end, we introduce SPPE (Surrogate-based Privacy-Preserving Editing), the first recovery-oriented benchmark covering 36 fine-grained privacy categories and 65 editing instructions. It defines two complementary tasks: 1) editability assessment, which estimates before cloud interaction whether a surrogate can induce an edit consistent with the original image; and 2) surrogate-to-source edit recovery, which evaluates whether the edited surrogate can be transferred back to the private source with the edit effect preserved. We address each task with a dedicated method: ERMA predicts surrogate editability through instruction-aware multimodal relation modeling, while \method performs cycle-consistent recovery by using the surrogate editing pair as visual edit evidence and the source image as a source-preserving anchor. Experiments on SPPE and InstructPix2Pix show consistent improvements on both tasks. For editability assessment, ERMA improves over the best-performing baselines by 13.9% in SRCC and 12.3% in PLCC. For surrogate-to-source edit recovery, C2E-S2SER outperforms SOER across all 8 source integrity and edit consistency metrics on SPPE.

2606.07170 2026-06-08 cs.RO 新提交

Test-Time Trajectory Optimization for Autonomous Driving

自动驾驶的测试时轨迹优化

Yihong Xu, Eloi Zablocki, Yuan Yin, Elias Ramzi, Ellington Kirby, Alexandre Boulch, Matthieu Cord

发表机构 * valeo.ai Sorbonne Université(索邦大学) CNRS(国家科学研究中心) ISIR(信息科学研究所)

AI总结 提出TOAD方法,在测试时使用交叉熵方法优化轨迹,无需重新训练即可提升多种规划器的性能。

详情
AI中文摘要

端到端的自动驾驶规划器通常生成一组候选轨迹,对每个轨迹评分,并返回得分最高的候选轨迹。然而,评分器仅在生成候选轨迹后应用,无法影响轨迹集合:无论评分器质量如何,候选轨迹集较弱会限制规划性能。我们转而将评分器视为学习到的轨迹级奖励函数,并搜索最大化该奖励的轨迹。我们的方法TOAD在测试时运行交叉熵方法,从规划器的候选轨迹进行热启动。它无需重新训练,可即插即用于现有规划器。在六个基础规划器上,TOAD在NAVSIM-v1(94.7 PDMS)、NAVSIM-v2(56.3 EPDMS)和闭环HUGSIM基准测试中提升了结果。代码将通过项目页面公开:this https URL。

英文摘要

End-to-end planners for autonomous driving typically generate a set of candidate trajectories, score each one, and return the highest-scoring candidate. However, the scorer is applied only after the proposals are generated and cannot influence the set of trajectories: a weak set of candidates limits planning performance regardless of the scorer's quality. We instead treat the scorer as a learned trajectory-level reward function and search for trajectories that maximize it. Our method, TOAD, runs the Cross-Entropy Method at test time, warm-started from the planner's proposals. It requires no retraining and is plug-and-play for existing planners. Across six base planners, TOAD improves results on NAVSIM-v1 (94.7 PDMS), NAVSIM-v2 (56.3 EPDMS), and the closed-loop HUGSIM benchmark. The code will be made publicly available via the project page: https://valeoai.github.io/TOAD/.

2606.07167 2026-06-08 cs.CL cs.AI 新提交

UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

UrduMMLU:乌尔都语理解的大规模多任务基准

Ahmer Tabassum, Sarfraz Ahmad, Hasan Iqbal, Owais Aijaz, Momina Ahsan, Preslav Nakov

发表机构 * MBZUAI

AI总结 针对乌尔都语缺乏本地教育来源的MMLU风格基准,提出包含26,431道多选题的UrduMMLU,覆盖26个学科,评估30个LLM发现Gemini-3.5-Flash最佳,多数模型在人文科目上表现差。

Comments 27 pages, 18 figures, 17 tables, Submitted to ARR May 2026

详情
AI中文摘要

有意义的 multilingual 评估必须在目标语言和教育背景下测试模型。乌尔都语有超过2.3亿人使用,但缺乏从本地教育来源构建的广泛MMLU风格基准。我们提出UrduMMLU,一个包含26,431道乌尔都语多选题的基准,涵盖26个学科和五个领域,数据来自本地乌尔都语题库和公开考试PDF。与基于翻译的资源不同,UrduMMLU既包括标准学术科目,也包括乌尔都语和地区特定内容。我们通过双重人工标注和严格共识过滤对考试部分进行标注。我们在英语和乌尔都语提示下评估了30个LLM,进行了60次零样本评估,并进一步在两种提示语言的多个少样本设置下评估了四个开源LLM。Gemini-3.5-Flash表现最佳,准确率达到90.20%和90.34%,而其他模型均未超过85%。最强的开源模型落后7.79和8.92个百分点,许多模型在乌尔都语人文科目上比STEM科目损失25到40个百分点。少样本提示仅带来微小提升。UrduMMLU表明,当前LLM中乌尔都语知识仍不均匀,尤其是地区性内容。

英文摘要

Meaningful multilingual evaluation must test models in the target language and educational context. Urdu, spoken by more than 230 million people, lacks a broad MMLU-style benchmark built from native educational sources. We introduce UrduMMLU, a benchmark of 26,431 Urdu MCQs across 26 subjects and five domains, collected from native Urdu MCQ banks and public examination PDFs. Unlike translation-based resources, UrduMMLU covers both standard academic subjects and Urdu- and region-specific content. We label the exam-derived portion through dual human annotation with strict consensus filtering. We evaluate 30 LLMs under English and Urdu prompts, yielding 60 zero-shot evaluations, and further evaluate four open-source LLMs under multiple few-shot settings across both prompt languages. Gemini-3.5-Flash performs best, reaching 90.20% and 90.34% accuracy, while no other model exceeds 85%. The strongest open-source model trails by 7.79 and 8.92 points, and many models lose 25 to 40 points on Urdu-centered Humanities subjects compared with STEM. Few-shot prompting yields only modest gains. UrduMMLU shows that Urdu knowledge remains uneven in current LLMs, especially for regionally grounded content.

2606.07161 2026-06-08 cs.CV 新提交

TraRA: Trajectory-level Recognition Aggregation for Video Text Spotting in Urban Surveillance

TraRA: 面向城市监控视频文本识别的轨迹级识别聚合方法

Duc Tri Tran, Trung Thanh Nguyen, Vijay John, Phi Le Nguyen, Yasutomo Kawanishi

发表机构 * RIKEN(日本理化学研究所) Hanoi University of Science and Technology(河内科学技术大学) Nagoya University(名古屋大学) Lawrence Technological University(劳伦斯技术大学) Ritsumeikan University(立命馆大学)

AI总结 提出TraRA方法,通过轨迹级文本识别聚合,利用时间与多模态一致性,解决监控视频中运动模糊、遮挡等导致的帧级识别不一致问题,在多个基准上提升跟踪与识别性能。

Comments 22nd IEEE International Conference on Advanced Visual and Signal-Based Systems

详情
AI中文摘要

视频文本识别(VTS)对于城市监控和智能交通系统至关重要,能够自动读取视频流中的街道标志、车辆标记和场景文本。然而,由于监控场景中常见的动态视频因素(包括运动模糊、遮挡和尺度变化)导致帧级识别退化,可靠识别仍然具有挑战性。现有的VTS方法通常对每一帧独立进行识别,导致跨序列的结果不一致且不准确。为了解决这些限制,我们提出了TraRA(面向VTS的轨迹级识别聚合),这是一种即插即用的方法,通过利用时间和多模态一致性执行轨迹级文本识别。TraRA集成了两个关键模块:(1)时间聚类和(2)视觉-语言聚合。前者通过分组时间和视觉上一致的文本实例来细化噪声轨迹,而后者采用低秩自适应增强的视觉-语言模型,融合跨帧的视觉线索与语言上下文。通过聚合整个文本轨迹的信息,TraRA即使在具有挑战性的监控条件下也能实现鲁棒的文本识别。在四个公共基准(包括道路和城市场景数据集RoadText、BOVText、ArTVideo和ICDAR15)上进行的大量实验表明,与最先进的VTS方法相比,TraRA持续提升了跟踪和识别性能。源代码可在该网址获取。

英文摘要

Video Text Spotting (VTS) is essential for urban surveillance and intelligent transportation systems, enabling automated reading of street signs, vehicle markings, and scene text in video streams. However, reliable recognition remains challenging due to dynamic video factors common in surveillance scenarios, including motion blur, occlusion, and scale variation, which degrade frame-level recognition. Existing VTS methods typically perform recognition independently on each frame, leading to inconsistent and inaccurate results across sequences. To address these limitations, we propose TraRA (Trajectory-level Recognition Aggregation for VTS), a plug-and-play method that performs trajectory-level text recognition by leveraging temporal and multimodal consistency. TraRA integrates two key modules: (1) the Temporal Clustering and (2) the Vision-Language Aggregation. The former refines noisy trajectories by grouping temporally and visually coherent text instances, while the latter employs a Low-Rank Adaptation-enhanced Vision-Language model to fuse visual cues with linguistic context across frames. By aggregating information over entire text trajectories, TraRA achieves robust text recognition even under challenging surveillance conditions. Extensive experiments on four public benchmarks, including road and urban scene datasets (RoadText, BOVText, ArTVideo, and ICDAR15), demonstrate that TraRA consistently improves tracking and recognition performance over state-of-the-art VTS methods. The source code is available at https://github.com/trid2912/TraRA.

2606.07151 2026-06-08 cs.LG 新提交

Geodesics of Dynamic Graphs for Regime Change Detection

动态图的测地线用于状态转换检测

William Cappelletti, Étienne Voutaz, Pascal Frossard

发表机构 * LTS4, EPFL(EPFL拉沃德实验室) Cyber-Defence Campus, Armasuisse(阿玛苏斯网络防御校区)

AI总结 提出将动态网络中的状态定义为时间图沿测地线的轨迹,通过图回归方法测量观测图与测地线的累积距离,结合变点检测算法识别状态转换,在合成和真实数据上优于现有方法。

详情
AI中文摘要

传统动态网络中的变点检测假设平稳状态之间的突然转换,忽略了大多数实际应用(如社交网络或物理系统)中出现的连续演化场景。我们通过将状态正式定义为时间图中连贯动态的时期来弥补这一空白,并将其表征为在适当定义的图空间中沿测地线的轨迹。这一原创视角使我们能够将状态转换定义为动态中的显著漂移,要么朝向新轨迹,要么速度变化。我们利用图回归方法测量观测图序列与相关图空间中其端点之间估计测地线的累积距离,并可将其与变点检测算法结合。我们在具有变化轨迹和不同速度的动态网络上进行实验,结果优于最先进的变点检测模型。然后,我们分析了新冠疫情期间的流动性数据,并表明我们对规则网络演化的假设导致变点与外部事件相比基线方法的结果更一致。我们的工作是首次在图空间中建模和检测演化状态之间的变化,为分析复杂时间图数据提供了现实且强大的工具。

英文摘要

Traditional change point detection in dynamic networks assumes abrupt transitions between stationary states, overlooking scenarios of continuous evolution which arise in most real-world applications, such as social networks or physical systems. We address this gap by formally defining regimes as periods of coherent dynamics in temporal graphs, which we characterize as trajectories along geodesics in a suitably defined graph space. This original perspective allows us to define regime changes as significant drifts in dynamics, either toward new trajectories or with pace changes. We leverage graph regression methods to measure the cumulative distance of sequences of observed graphs from the estimated geodesics between their endpoints, in the relevant graph space, which we can combine with change point detection algorithms. We present experiments on dynamic networks, with changing trajectories and varying speeds, in which we outperform state of the art change point detection models. Then, we analyse mobility data during the Covid-19 pandemic, and show that our assumptions on regular network evolution lead to change points that are more aligned to external events compared to the outcomes of baseline methods. Our work is the first to model and detect changes between evolving regimes in graph space, providing a realistic and powerful tool for analyzing complex temporal graph data.

2606.07146 2026-06-08 cs.LG cs.CE 新提交

Decision-Aware Evaluation of Physics-Informed Surrogates

决策感知的物理信息替代模型评估

Daniel Cieślak, Andrzej Czyżewski

发表机构 * Gdańsk University of Technology(格但斯克技术大学)

AI总结 针对物理信息机器学习在工程决策中的评估,提出pinn-gym基准,通过曲线误差、物理可行性、top-k检索和遗憾值等多维度指标,揭示低nRMSE不足以识别有用设计,且物理信息损失改变权衡而非单调改进所有指标。

Comments 12 pages, 5 figures, 9 tables. Code and data available at https://github.com/Dyniel/pinn-gym

详情
AI中文摘要

物理信息机器学习通常通过曲线误差来评估,尽管工程应用取决于下游决策:对候选方案进行排序、避免不可行设计以及限制遗憾值。我们引入了pinn-gym,一个用于材料条件晶格设计的开放基准,它结合了一个透明的降阶碰撞冲击预言机、五种可打印聚合物卡片、无量纲力响应目标以及一个涵盖曲线保真度、物理可行性、top-k检索和质量遗憾值的协议。在逐材料、混合和跨材料设置中,低nRMSE通常不足以识别有用的设计选择。物理信息损失改变了权衡,而不是单调地改进所有指标,并且无量纲条件化提高了可比性,但并未使迁移对称。该基准不是经过认证的材料模型;在发布的预言机、候选生成器和材料卡片中,pinn-gym提供了一个可重复的测试平台,用于评估PIML替代模型作为决策系统,而不仅仅是曲线预测器。

英文摘要

Physics-informed machine learning is often assessed by curve error, although engineering use depends on downstream decisions: ranking candidates, avoiding infeasible designs and limiting regret. We introduce pinn-gym, an open benchmark for material-conditioned lattice design that couples a transparent reduced-order crush-and-impact oracle with five printable polymer cards, dimensionless force-response targets and a protocol spanning curve fidelity, physical admissibility, top-k retrieval and mass regret. Across per-material, pooled and cross-material settings, low nRMSE is frequently insufficient to identify useful design selections. Physics-informed losses alter trade-offs rather than monotonically improving all metrics, and dimensionless conditioning improves comparability without making transfer symmetric. The benchmark is not a certified material model; within the released oracle, candidate generator and material cards, pinn-gym provides a reproducible testbed for evaluating PIML surrogates as decision systems rather than curve predictors alone.

2606.07145 2026-06-08 cs.CV 新提交

Consistent-Inversion: Reverse Consistency Guidance for Structure-Preserving Visual Editing

Consistent-Inversion: 用于结构保持视觉编辑的反向一致性引导

Xiaocheng Lu, Jingcai Guo, Song Guo

发表机构 * Hong Kong University of Science and Technology(香港理工大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出Consistent-Inversion,一种无训练的反向一致性引导框架,通过检查中间目标轨迹能否在源提示下反向到源反转轨迹,并利用反向一致性差异校正早期去噪步骤,在保持结构的同时提升编辑效果。

Comments Submitted to IEEE Transactions on Multimedia; 10 pages, 4 figures

详情
AI中文摘要

文本引导的扩散模型已成为真实图像视觉编辑的有效工具,其中编辑后的图像必须遵循目标指令,同时保持与编辑无关的结构。大多数无训练编辑器依赖于反转:源图像被映射到一个噪声潜变量轨迹,终端潜变量被重新用于目标提示去噪。这种重用有助于保持结构,但也耦合了源重建和目标编辑。由此产生的轨迹不匹配可能会损害背景/布局细节,或过度约束预期编辑。本文提出Consistent-Inversion,一种用于结构保持视觉编辑的无训练反向一致性引导框架。Consistent-Inversion不将反转后的源潜变量视为固定初始化,而是检查中间目标轨迹是否能在源提示下反向到源反转轨迹。为使这一检查明确,我们构建了一个辅助的目标侧噪声表示,执行源引导的反向去噪,并将得到的反向一致性差异作为校正信号,用于选定的早期目标去噪步骤。该方法不更新模型参数,与基于反转的编辑器兼容,且在稀疏应用时仅引入少量推理开销。在PIE-Bench上的实验表明,Consistent-Inversion在统一的SD3.5协议下提高了背景和结构保真度,同时保持目标提示对齐,兼容性实验进一步验证了相同校正原则在经典Stable-Diffusion反转流水线上的有效性。

英文摘要

Text-guided diffusion models have become effective tools for real-image visual editing, where the edited image must follow a target instruction while preserving editing-irrelevant structure. Most training-free editors rely on inversion: a source image is mapped to a noisy latent trajectory and the terminal latent is reused for target-prompt denoising. This reuse is useful for preservation, but it also couples source reconstruction and target editing. The resulting trajectory mismatch may either damage background/layout details or over-constrain the intended edit. This paper presents Consistent-Inversion, a training-free reverse consistency guidance framework for structure-preserving visual editing. Instead of treating the inverted source latent as a fixed initialization, Consistent-Inversion checks whether an intermediate target trajectory can be reversed toward the source inversion trajectory under the source prompt. To make this check well-defined, we construct an auxiliary target-side noise representation, perform source-guided reverse denoising, and use the resulting reverse consistency discrepancy as a correction signal for selected early target denoising steps. The method does not update model parameters, is compatible with inversion-based editors, and introduces only a small inference overhead when applied sparsely. Experiments on PIE-Bench show that Consistent-Inversion improves background and structural fidelity under a unified SD3.5 protocol while maintaining target-prompt alignment, and compatibility experiments further verify the same correction principle on classical Stable-Diffusion inversion pipelines.

2606.07141 2026-06-08 cs.LG cs.AI 新提交

REMEDI: A Benchmark for Retention and Unlearning Evaluation in Multi-label Clinical Disease Inference

REMEDI:多标签临床疾病推断中的保留与遗忘评估基准

Anurag Sharma, Sai Teja Chunchu, Prasenjit Mitra, Sandipan Sikdar, Koustav Rudra

发表机构 * IIT Kharagpur(印度理工学院Kharagpur分校) Carnegie Mellon University(卡内基梅隆大学) L3S Research Center, Leibniz University Hannover(Leibniz汉诺威大学L3S研究中心)

AI总结 提出REMEDI基准,针对多标签临床疾病推断中的机器遗忘问题,利用MIMIC-III数据库评估现有方法在效用与遗忘性能间的权衡,并发现其不适用于多标签任务。

Comments Under review

详情
AI中文摘要

用于临床疾病推断的语言模型在患者数据上进行训练,这些数据可能包含敏感和私人信息,数据所有者可能出于隐私或版权原因要求从训练模型中删除其数据。然而,精确遗忘患者特定数据是棘手的,而通过少量数据删除重新训练则资源密集。虽然存在几种可用的机器遗忘方法,但其效用通常局限于非医疗领域。此外,评估此类遗忘方法的现有基准主要使用合成数据集,这些数据集不能真正代表现实系统。因此,这些遗忘方法在医疗领域的有效性在很大程度上尚不清楚。为此,我们引入了REMEDI,一个针对多标签和多类别临床疾病推断的广泛机器遗忘基准,其中标签相关性、纵向结构和安全约束使得遗忘特别具有挑战性。与现有基准不同,REMEDI考虑:(1) 相关的应用领域(医疗),(2) 涉及多样遗忘实例集的全面遗忘设置,(3) 具有挑战性的遗忘场景,包括多标签和多类别分类任务,以及(4) 评估指标,涉及效用和遗忘程度的性能。REMEDI使用MIMIC-III临床数据库开发,该数据库包含患者的全面临床数据。现有遗忘方法的实验表明,效用和遗忘性能之间存在权衡。它们也大多不适合多标签分类任务。为促进可重复性,我们公开了我们的基准。

英文摘要

Language models trained for clinical disease inference are trained on patient data, which may include sensitive and private information, and data owners may request the removal of their data from a trained model due to privacy or copyright concerns. However, exactly unlearning patient-specific data is intractable, and retraining with minor data removal is resource-intensive. While there exists several machine unlearning methods that can be used, their utility is generally restricted to non-medical domains. Moreover, the existing benchmarks for evaluating such unlearning methods primarily utilize synthetically curated datasets, which are not truly representative of real-world systems. Hence, the effectiveness of these unlearning methods in the medical domain is largely unclear. To this end, we introduce REMEDI, an extensive benchmark for machine unlearning tailored to multi-label and multiclass clinical disease inference, where label correlations, longitudinal structure, and safety constraints make unlearning particularly challenging. Unlike the existing benchmarks, REMEDI considers: (1) a relevant application domain (medical), (2) comprehensive unlearning setups involving diverse sets of forget instances, (3) challenging unlearning scenarios including multi-label and multi-class classification tasks, and (4) evaluation metrics involving performance both in terms of utility and extent of unlearning achieved. REMEDI is developed using the MIMIC-III clinical database that contains comprehensive clinical data of patients. Experiments with existing unlearning methods indicate that there exists a trade-off between utility and unlearning performance. They are also largely unsuited to multi-label classification tasks. To facilitate reproducibility, we make our benchmark publicly available.

2606.07134 2026-06-08 cs.LG 新提交

$α$-PFN: Fast Entropy Search via In-Context Learning

$\alpha$-PFN:通过上下文学习实现快速熵搜索

Herilalaina Rakotoarison, Steven Adriaensen, Tom Viering, Carl Hvarfner, Samuel Müller, Frank Hutter, Eytan Bakshy

发表机构 * University of Freiburg(弗莱堡大学) University of Tübingen(图宾根大学) University of Amsterdam(阿姆斯特丹大学) Lund University(Lund大学) Meta

AI总结 提出一种两阶段摊销策略,利用先验数据拟合网络(PFN)在单次前向传播中近似熵搜索采集函数,实现超过50倍加速,在合成和真实基准上性能与最先进方法相当。

Comments Published at ICML 2026

详情
AI中文摘要

信息论采集函数如熵搜索(ES)为贝叶斯优化(BO)提供了原则性的探索-利用框架。然而,它们的实际实现依赖于复杂且缓慢的近似,即信息增益的蒙特卡洛估计。这种复杂性可能引入数值误差,并需要专门的、手工定制的实现。我们提出了一种两阶段摊销策略,该策略学习使用先验数据拟合网络(PFN)在单次前向传播中近似基于熵搜索的采集函数。第一个PFN被训练为以最优值的信息为条件;第二个$\alpha$-PFN通过训练来预测期望信息增益,该训练基于使用第一个PFN测量的信息增益。$\alpha$-PFN提供了一种灵活的学习近似,用每个候选点的单次前向传播取代了复杂的启发式近似,实现了快速且可扩展的采集评估。实验上,我们的方法在合成和真实世界基准上与最先进的熵搜索实现具有竞争力,同时在我们所有实验中加速了不同的熵搜索变体,加速比超过50倍。源代码:此https URL。

英文摘要

Information-theoretic acquisition functions such as Entropy Search (ES) offer a principled exploration-exploitation framework for Bayesian optimization (BO). However, their practical implementation relies on complicated and slow approximations, i.e., a Monte Carlo estimation of the information gain. This complexity can introduce numerical errors and requires specialized, hand-crafted implementations. We propose a two-stage amortization strategy that learns to approximate entropy search-based acquisition functions using Prior-data Fitted Networks (PFNs) in a single forward pass. A first PFN is trained to be conditioned on information about the optima; second, the $α$-PFN is trained to predict the expected information gain by training on information gains measured with the first PFN. The $α$-PFN offers a flexible learned approximation, which replaces the complex heuristic approximations with a single forward pass per candidate, enabling rapid and extensible acquisition evaluation. Empirically, our approach is competitive with state-of-the-art entropy search implementations on synthetic and real-world benchmarks, while accelerating the different entropy search variants across all our experiments, with speed ups over 50x. Source code: https://github.com/automl/AlphaPFN.

2606.07130 2026-06-08 cs.CL 新提交

Explicit Evidence Grounding via Structured Inline Citation Generation

通过结构化内联引文生成实现显式证据基础

Anar Yeginbergen, Amelie Wührl, Anna Rogers, Rodrigo Agerri

发表机构 * University of the Basque Country (UPV/EHU)(巴斯克大学) IT University of Copenhagen(哥本哈根IT大学)

AI总结 提出FullCite框架,通过提示生成、约束解码和后处理跨度对齐三种策略生成结构化内联引文,在三个QA基准上评估引文质量和忠实性,发现LLMs虽能识别相关文档但难以精确定位支持性证据跨度。

详情
AI中文摘要

随着AI系统被更广泛采用,对事实性和忠实性生成的需求日益增长。因此,通过引文适当归因信息变得至关重要。本文介绍了FullCite,一个与大多数先前工作不同,生成结构化内联引文的框架,将每个主张链接到其源文档和支持证据。FullCite提出了三种内联引文生成策略:基于提示的生成、在引文语法上的约束解码以及事后跨度对齐。使用三个问答基准,即ASQA、BioASQ和ExpertQA,我们从三个维度评估引文质量和忠实性:文档级正确性、证据跨度识别以及主张-引文忠实性。我们的评估表明,虽然LLMs通常能有效识别相关文档,但它们在识别文档内精确的支持性跨度方面存在困难。这一差距表明,实现忠实的归因问答需要研究更加重视精确的证据跨度识别。

英文摘要

As AI systems become more widely adopted, the demand for factual and faithful generation grows. Properly attributing information through citations becomes, therefore, crucial. This work introduces FullCite, a framework that, in contrast to most previous works, generates structured inline citations linking each claim to both its source document and supporting evidence. FullCite proposes three strategies to inline citation generation: prompt-based generation, constrained decoding over a citation grammar, and posthoc span alignment. Using three question answering benchmarks, namely, ASQA, BioASQ, and ExpertQA, we assess citation quality and faithfulness along three dimensions: document-level correctness, evidence span identification, and claim-citation faithfulness. Our evaluation shows that while LLMs are generally effective at identifying relevant documents, they struggle to identify the precise supporting spans within them. This gap suggests that achieving faithful attributed QA will require research to place greater emphasis on precise evidence span identification.

2606.07128 2026-06-08 cs.LG 新提交

A machine-learning-assisted progressive digit-randomness screening framework for detecting non-random patterns in raw numerical research data

一种机器学习辅助的渐进式数字随机性筛查框架,用于检测原始数值研究数据中的非随机模式

Zhuphua Cao

发表机构 * Key Laboratory of Natural Medicines of the Changbai Mountain, Ministry of Education, College of Pharmacy, Yanbian University(长白山天然药物重点实验室,教育部,药学院,延边大学)

AI总结 提出FDRS框架,结合统计与机器学习方法检测数值数据中的非随机数字模式,通过酶学吸光度数据集和模拟异常数据验证,能有效分级风险。

详情
AI中文摘要

原始数值数据集在完整性筛查中受到的关注少于图像、抄袭或汇总统计不一致性。我们开发了造假风险数字随机性筛查模型(FDRS),这是一个统计和机器学习框架,用于检测数值研究数据中的非随机数字模式不规则性。FDRS整合了单小数位和联合小数位检验、Cramer's V、熵度量、Kullback-Leibler散度、数字偏好指数、渐进子采样和半监督风险评分。使用仪器衍生的酶促吸光度数据集(RawData,n=253)和盲法手动模拟不规则数据集(ErrData,n=255)进行评估。RawData在单个第三小数位分析中未显示显著偏差,而ErrData显示显著偏差。在联合第三-第四小数位分析中,ErrData显示出更高的Cramer's V、更低的归一化熵、更高的KL散度以及更持久的渐进子采样偏差信号。在内部验证中,弹性网络逻辑回归取得了最高的AUC(0.98395)和最低的Brier分数(0.048439),而随机森林取得了最高的准确率(0.926667)和平衡准确率(0.935)。RawData获得了0.124627的低集成风险评分,被分类为0级;ErrData获得了0.740760的评分,被分类为3级。外部真实世界基准支持分级风险分层:三个未发现公开出版后问题的数据集被分类为0级或1级,而两个来自公开质疑或机构处理文章的数据集被分类为2级或3级。FDRS通过整合可解释的统计和机器学习特征,可以优先考虑对原始数值数据集进行进一步审查。它是一个辅助性的数字结构筛查工具,而非造假或不当行为的独立证据。

英文摘要

Raw numerical datasets remain less systematically examined in integrity screening than images, plagiarism, or summary-statistic inconsistencies. We developed the Fabrication-risk Digit Randomness Screening model (FDRS), a statistical and machine-learning framework for detecting non-random digit-pattern irregularities in numerical research data. FDRS integrates single- and joint-decimal-digit tests, Cramer's V, entropy metrics, Kullback-Leibler divergence, digit-preference indices, progressive subsampling, and semi-supervised risk scoring. It was evaluated using an instrument-derived enzymatic absorbance dataset (RawData, n=253) and a blinded manually simulated irregular dataset (ErrData, n=255). RawData showed no significant deviation in single third-decimal-digit analysis, whereas ErrData showed a significant deviation. In joint third-fourth decimal digit analysis, ErrData showed higher Cramer's V, lower normalized entropy, higher KL divergence, and a more persistent progressive-subsampling deviation signal. In internal validation, Elastic-net Logistic Regression achieved the highest AUC (0.98395) and lowest Brier score (0.048439), while Random Forest achieved the highest accuracy (0.926667) and balanced accuracy (0.935). RawData received a low ensemble risk score of 0.124627 and was classified as Grade 0; ErrData received a score of 0.740760 and was classified as Grade 3. External real-world benchmarks supported graded risk stratification: three datasets without identified public post-publication concerns were classified as Grade 0 or 1, whereas two datasets from publicly questioned or institutionally handled articles were classified as Grade 2 or 3. FDRS can prioritize raw numerical datasets for further review by integrating interpretable statistical and machine-learning features. It is an auxiliary digit-structure screening tool, not standalone evidence of fabrication or misconduct.

2606.07127 2026-06-08 cs.LG 新提交

Learning Explicit Behavioral Models with Adaptive Questions and World-Model Probes

通过自适应问题和世界模型探针学习显式行为模型

Hikaru Shindo, Yu Deng, Teng Cao, Quentin Delfosse, Christopher Tauchmann, Jannis Blüml, Gopika Sudhakaran, Kristian Kersting

发表机构 * Artificial Intelligence and Machine Learning Lab(人工智能与机器学习实验室) Technical University of Darmstadt(德累斯顿技术大学) Hessian Center for Artificial Intelligence (hessian.AI)(黑森人工智能中心) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心) Department of Computer Science(计算机科学系) Centre for Cognitive Science(认知科学中心)

AI总结 提出显式符号行为模型(ESBM),通过自适应问题和世界模型探针将任务性能与可解释机制结合,在Atari任务中学习高分策略并生成显式答案和机制预测。

详情
AI中文摘要

仅针对任务回报训练的交互式智能体可以获得高分,但无法表示其动作成功的机制。这导致行为脆弱且难以诊断,并在环境动态变化时限制适应性。现有的LLM反思和策略代码修复可以从失败轨迹中修正行为,但问题和世界理解测试通常仅在训练后使用。我们引入了显式符号行为模型(ESBM),一种可训练的行为模型,将任务性能与基于证据的问答和可执行机制预测相结合。ESBM通过类型化谓词、加权规则、有界选项和机制记忆表示行为;机制层在动作干预下预测符号事件、对象变化、奖励和终止后果。每次滚动后,自适应问题和主动世界模型探针将得分失败、问答错误和转换预测错误转化为局部ESBM编辑的约束。候选模型通过多准则规则选择,该规则联合评估任务得分、可回答性和主动世界模型一致性。在测试的Atari风格协议下,ESBM学习高分策略,同时产生显式答案和可执行机制预测,表明自适应问题可以作为训练压力和可重用基准,用于该设置下的机制策略学习。

英文摘要

Interactive agents trained only against task return can achieve high scores while failing to represent the mechanisms that make their actions succeed. This makes brittle behavior difficult to diagnose and limits adaptation when environment dynamics change. Existing LLM reflection and policy-code repair can revise behavior from failed trajectories, but questions and world-understanding tests are usually used only after training. We introduce an Explicit Symbolic Behavioral Model (ESBM), a trainable behavioral model that couples task performance with evidence-grounded question answering and executable mechanism prediction. An ESBM represents behavior through typed predicates, weighted rules, bounded options and mechanism memory; the mechanism layer predicts symbolic events, object changes, rewards and terminal consequences under action interventions. After each rollout, adaptive questions and active world-model probes convert score failures, QA errors and transition-prediction errors into constraints for local ESBM edits. Candidate models are selected by a multi-criterion rule that jointly evaluates task score, answerability and active world-model consistency. Under the tested Atari-style protocols, ESBM learns high-scoring policies while producing explicit answers and executable mechanism predictions, indicating that adaptive questions can serve as both training pressure and reusable benchmarks for mechanistic policy learning in this setting.

2606.07123 2026-06-08 cs.CL 新提交

Learning Perspectivist Social Meaning via Demographic-Conditioned Fusion Embeddings

通过人口条件融合嵌入学习视角主义社会意义

Amanda Cercas Curry, Lucio La Cava, Luca Maria Aiello, Gianmarco De Francisci Morales

发表机构 * Independent Researcher(独立研究者) University of Calabria(卡拉布里亚大学) IT University of Copenhagen(哥本哈根技术大学) CENTAI

AI总结 提出融合嵌入方法整合文本与人口统计信息,在28k人工标注数据集上建模社会意义解释的视角差异,相比纯文本基线提升5.9-6.5%相对宏PR-AUC。

详情
AI中文摘要

语言中的社会意义本质上是视角性的,随着标注者背景、人口统计特征和意识形态立场而变化。然而,大多数NLP系统将这种变化压缩为单一的真实标签,忽略了解释的多样性。在这项工作中,我们沿着视角主义光谱对社会维度进行建模,捕捉在包含28k人工标注的数据集上不同人口群体间解释的变化。我们基准测试了多种建模范式,包括零样本、少样本和微调方法,并提出了融合文本和人口统计表示的融合嵌入。我们的融合模型在所有融合策略上相比纯文本基线产生了持续且统计显著的改进(+5.9-6.5%相对宏PR-AUC),且洗牌消融实验证实人口统计档案携带了真实的预测信号而非虚假相关性。

英文摘要

Social meaning in language is inherently perspectival, varying across annotator backgrounds, demographics, and ideological positions. However, most NLP systems collapse this variation into a single ground-truth label, ignoring the diversity of interpretations. In this work, we model social dimensions along a perspectivist spectrum, capturing how interpretations vary across demographic groups on a dataset consisting of 28k human annotations. We benchmark multiple modeling paradigms, including zero-shot, few-shot, and fine-tuned approaches, and propose fusion embeddings that integrate textual and demographic representations. Our fusion models yield consistent and statistically significant improvements over text-only baselines across all fusion strategies (+5.9-6.5% relative macro PR-AUC), with shuffle ablations confirming that demographic profiles carry genuine predictive signal rather than spurious correlations.

2606.07120 2026-06-08 cs.LG 新提交

Beyond Linear and Overcomplete Regimes: A Mean-Field Analysis of Bottleneck Autoencoders

超越线性与过完备机制:瓶颈自编码器的平均场分析

Santanu Das, Ramyak Bilas, Pascal Esser, Satyaki Mukherjee

发表机构 * STCS department, Tata Institute Of Fundamental Research(STCS部门,印度塔塔基础研究 institute) Department of Mathematics, Indiana University(数学系,印第安纳大学) Department of Mathematics, Ludwig-Maximilians-Universität München(数学系,慕尼黑路德维希-马克西米利安大学) Department of Mathematics, National University of Singapore(数学系,新加坡国立大学)

AI总结 研究非线性瓶颈自编码器在平均场机制下的学习动态,推导编码器和解码器的显式动态方程,证明有限宽度网络的经验风险高概率跟踪平均场风险轨迹,且最优解收敛于平均场最优。

详情
AI中文摘要

自编码器通过将数据映射到潜在空间并最小化重构误差来学习低维表示。尽管经验成功,其理论理解仍然有限,且主要局限于线性模型或无瓶颈设置。本文研究了在平均场机制下具有固定有限维瓶颈的非线性自编码器。我们推导了编码器和解码器的显式平均场学习动态,提供了非线性设置中训练的可处理表征。我们证明,在有限时间范围内,使用随机梯度下降训练的有限宽度网络的经验风险以高概率紧密跟踪平均场风险轨迹。在最优性方面,我们进一步证明有限宽度风险收敛到平均场最优,表明有限网络具有足够的表达能力来逼近无限宽度解。

英文摘要

Autoencoders (AEs) learn low-dimensional representations by mapping data into a latent space while minimizing reconstruction error. Despite their empirical success, theoretical understanding remains limited and largely restricted to linear models or settings without a bottleneck. In this work, we study nonlinear AEs with a fixed finite-dimensional bottleneck in the mean-field (MF) regime. We derive explicit MF learning dynamics for both encoder and decoder, providing a tractable characterization of training in the nonlinear setting. We show that, over finite time horizons, the empirical risk of finite-width networks trained with stochastic gradient descent closely tracks the MF risk trajectory with high probability. At optimality, we further establish that the finite-width risk converges to the MF optimum, demonstrating that finite networks are sufficiently expressive to approximate the infinite-width solution.

2606.07117 2026-06-08 cs.CV cs.AI 新提交

Native3D: End-to-End 3D Scene Generation via Unified Mesh-Texture Modeling and Semantic Alignment

Native3D: 通过统一网格纹理建模与语义对齐的端到端3D场景生成

Yibo Liu, Ziwei Zhang, Haozhou Pang, Menghao Li, Lanshan He, Gan Qi

发表机构 * Kuaishou GameMind Lab(快手游戏大脑实验室)

AI总结 提出Native3D,首个完全绕过2D中间表示的端到端3D场景生成框架,通过统一网格纹理联合表示和3D表示对齐损失,解决几何结构失真和纹理细节退化问题。

详情
AI中文摘要

本文提出了Native3D,首个完全绕过2D中间表示的端到端3D场景生成框架。传统方法通常需要将3D表示适配到2D域以利用预训练的扩散模型,这不可避免地引入了域适应问题,包括几何结构失真和纹理细节退化。为了解决这些限制,我们设计了一种统一的网格纹理联合表示,通过基于Transformer的场景编码器同时对几何结构和纹理特征进行建模,有效维持场景中物体之间的空间关系和视觉一致性。我们进一步提出了3D表示对齐损失(3D REPA Loss),该损失采用改进的对比学习机制来对齐潜在空间中的多级语义表示,显著增强了几何和纹理保真度。实验结果表明,Native3D在生成质量和编辑灵活性方面均优于现有方法,为3D场景编辑提供了一种新的解决方案。

英文摘要

This paper presents Native3D, the first end-to-end 3D scene generation framework that completely bypasses 2D intermediate representations. Traditional approaches typically require adapting 3D representations to the 2D domain to leverage pre-trained diffusion models, which inevitably introduces domain adaptation issues including geometric structural distortion and texture detail degradation. To address these limitations, we design a unified mesh-texture joint representation that simultaneously models both geometric structures and texture features through a Transformer-based scene encoder, effectively maintaining spatial relationships and visual consistency among objects within scenes. We further propose the 3D Representation Alignment Loss (3D REPA Loss), which employs an improved contrastive learning mechanism to align multi-level semantic representations in the latent space, significantly enhancing geometric and textural fidelity. Experimental results demonstrate that Native3D outperforms existing methods in both generation quality and editing flexibility, providing a novel solution for 3D scene editing.

2606.07116 2026-06-08 cs.LG cs.AI cs.CL 新提交

OffQ: Taming Structured Outliers in LLM Quantization by Offsetting

OffQ:通过偏移驯服LLM量化中的结构化异常值

Haoqi Wang, Lorenz K. Mueller, Jiawei Zhuang, Mathieu Salzmann, Lukas Cavigelli

发表机构 * School of Computer and Communication Sciences, EPFL, Switzerland(瑞士联邦理工学院计算机与通信科学学院) Huawei, Switzerland(华为公司) Swiss Data Science Center, ETHZ & EPFL, Switzerland(瑞士数据科学中心,苏黎世联邦理工学院与联邦理工学院)

AI总结 提出OffQ方法,通过top-1 PCA识别异常值子空间、旋转集中异常值通道并转换为共享偏移,实现LLM的低比特均匀量化,在W4A4KV4下提升精度。

详情
AI中文摘要

低比特量化已被广泛采用,通过显著降低计算成本和内存使用来加速大型语言模型(LLM)的推理。然而,激活异常值对有效量化构成了重大挑战,常常导致显著的性能下降。在本文中,我们介绍了OffQ,一种通过新颖的偏移机制来缓解低比特量化中激活异常值的方法。具体来说,OffQ首先使用提出的top-1 PCA识别激活中的低维异常值子空间,然后通过旋转将高幅度激活集中到1个通道中。OffQ随后通过将其幅度转换为共享偏移来吸收这个集中的异常值通道,从而降低激活的标准差。这种偏移策略使得使用部署友好的均匀网格和均匀精度量化对LLM进行有效的W4A4KV4量化成为可能。在多种LLM架构和基准上的广泛实验表明,OffQ优于最先进的基线,在保持低比特效率的同时持续提高模型精度。

英文摘要

Low-bit quantization has been widely adopted to accelerate the inference of large language models (LLMs) by significantly reducing computational cost and memory usage. However, activation outliers pose a major challenge to effective quantization, often leading to notable performance degradation. In this paper, we introduce OffQ, a method designed to mitigate activation outliers in low-bit quantization through a novel offsetting mechanism. Specifically, OffQ first identifies a low-dimensional outlier subspace in the activations using a proposed top-1 PCA, and then concentrates high-magnitude activations into 1 channel via rotation. OffQ then absorbs this concentrated outlier channel by converting its magnitude into a shared offset, thereby reducing the standard deviation of the activations. This offsetting strategy enables effective W4A4KV4 quantization of LLMs using deployment-friendly uniform-grid and uniform-precision quantization. Extensive experiments across diverse LLM architectures and benchmarks demonstrate that OffQ outperforms state-of-the-art baselines, consistently improving model accuracy while preserving low-bit efficiency.

2606.07115 2026-06-08 cs.CV cs.GR 新提交

3DMorph: Single-Image-Guided Local 3D Shape Editing and Morphing

3DMorph: 单图引导的局部3D形状编辑与变形

Tobias Preintner, Yunfei Deng, Phillip Müller, Sebastian Illing, Adrian König, Thomas Bäck, Elena Raponi, Niki van Stein

发表机构 * ETH Zürich(苏黎世联邦理工学院)

AI总结 提出无训练框架3DMorph,通过单张编辑图像自动定位并转移2D修改到3D局部区域,同时支持中间形状生成,在Delta3D基准上优于现有方法。

Comments Accepted to IJCNN 2026

详情
AI中文摘要

尽管3D生成领域近期取得了进展,但对现有形状的直观编辑仍然有限。与受益于成熟修复工具的图像不同,网格等通用3D对象仍缺乏简单有效的局部形状编辑方法。现有方法通常是全局的、领域特定的、需要复杂的用户交互,或侧重于外观(颜色和纹理)而非几何。我们提出了3DMorph,一个无需训练的框架,用于单图引导的局部3D形状编辑和变形。给定一张显示所需形状修改的编辑图像,我们的方法自动定位相关的3D区域,并将2D修改转移到3D,同时保留未修改的区域。3DMorph还能在原始对象和编辑对象之间生成中间形状,促进设计探索。为了基准测试编辑质量,我们引入了Delta3D,一个带有配对真实编辑的图像引导局部3D编辑基准。实验结果表明,3DMorph将直观的2D编辑转化为3D,优于最先进的生成和编辑方法。

英文摘要

Despite recent progress in 3D generation, intuitive editing of existing shapes remains limited. Unlike images, which benefit from well-established inpainting tools, general 3D objects such as meshes still lack simple and effective methods for local shape editing. Existing approaches are often global, domain-specific, require complex user interaction, or focus on appearance (color and texture) rather than geometry. We introduce 3DMorph, a training-free framework for single-image-guided local 3D shape editing and morphing. Given an edited image showing a desired shape modification, our method automatically localizes the relevant 3D region and transfers 2D modifications to 3D while preserving unmodified areas. 3DMorph also enables intermediate shape generation between the original and edited objects, facilitating design exploration. To benchmark editing quality, we introduce Delta3D, an image-guided local 3D editing benchmark with paired ground-truth edits. Experimental results show that 3DMorph translates intuitive 2D edits into 3D, outperforming state-of-the-art generative and editing methods.

2606.07113 2026-06-08 cs.AI 新提交

Beyond Post-hoc Explanation: Toward Glassbox AI via Probabilistic Mediation

超越事后解释:通过概率中介迈向玻璃箱AI

Manuele Leonelli

发表机构 * Manuele Leonelli(曼努埃尔·莱奥内利)

AI总结 针对大语言模型在关键领域的不透明性,提出玻璃箱框架,利用贝叶斯网络作为事前中介层,实现可审计推理、不确定性量化和可争议输出。

详情
AI中文摘要

大型语言模型正迅速成为高风险机构设置中的基础设施组件,包括公共管理、法律推理和医疗保健,在这些领域中,不透明性不仅不方便,而且在制度和法律上不可接受。现有的可解释性方法主要是事后性的,提供不稳定、不可争议的解释,这些解释与产生输出的推理过程没有正式关系。我们认为问题不在于缺乏解释,而在于首先缺乏结构化推理。本文提出了一种根本不同的架构,我们称之为玻璃箱框架,其中贝叶斯网络作为生成模型的透明事前中介层。贝叶斯网络在推理之前编码领域知识、因果假设和概率依赖性,从而实现可审计的推理轨迹、不确定性量化和可争议的输出。我们描述了该框架的架构,并将其置于一个福利资格场景中,确定了必须解决的基础性挑战,包括语义对齐、动态模型构建、概率基础以及人类治理,以便大规模实现它。通过从事后解释转向事前概率中介,本文勾勒出一条原则性路径,通往不仅强大而且根本上可问责的AI系统。

英文摘要

Large language models are rapidly becoming infrastructural components in high-stakes institutional settings, including public administration, legal reasoning, and healthcare, where opacity is not merely inconvenient but institutionally and legally untenable. Existing approaches to explainability are predominantly post-hoc, offering unstable, non-contestable accounts that have no formal relationship to the reasoning process that produced the output. We argue that the problem is not the absence of explanation but the absence of structured reasoning in the first place. This paper makes the case for a fundamentally different architecture, which we call the Glassbox Framework, in which Bayesian networks serve as transparent, ante-hoc mediation layers for generative models. Bayesian networks encode domain knowledge, causal assumptions, and probabilistic dependencies before inference occurs, enabling auditable reasoning traces, uncertainty quantification, and contestable outputs. We characterise the architecture of this framework and ground it in a benefit eligibility scenario, identifying the foundational challenges spanning semantic alignment, dynamic model construction, probabilistic grounding, and human governance that must be solved to realise it at scale. By shifting from post-hoc explanation to ante-hoc probabilistic mediation, this work outlines a principled path toward AI systems that are not only powerful but fundamentally accountable.

2606.07107 2026-06-08 cs.RO 新提交

Coarse-to-Control: Action-Token Planning for Vision-Language-Action Models

粗到细控制:面向视觉-语言-动作模型的行动令牌规划

Jinhao Wu, Shiduo Zhang, Yicheng Liu, Xiaopeng Yu, Sixian Li, Siyin Wang, Hang Zhao, Jing Huo, Yang Gao, Jingjing Gong, Xipeng Qiu, Yu-Gang Jiang

发表机构 * Nanjing University(南京大学) Shanghai Innovation Institute(上海创新研究院) Fudan University(复旦大学) Tsinghua University(清华大学)

AI总结 提出Coarse-to-Control框架,在动作令牌空间中引入原生规划,通过先预测粗粒度动作令牌序列再生成可执行动作,提升长程任务性能。

详情
AI中文摘要

大多数视觉-语言-动作(VLA)模型直接将观测映射到动作,缺乏显式的中间规划,这限制了在早期错误累积的长程任务上的性能。我们提出Coarse-to-Control,一种规划-执行VLA模型,在动作令牌空间中原生引入规划。关键思想是让策略首先预测一个紧凑的粗粒度动作令牌序列,该序列总结了预期的未来轨迹,然后基于此规划生成可执行的动作令牌。由于规划和执行共享统一的离散动作词汇,规划保持接近控制流形,并提供直接可操作的指导,而不是必须被转换回运动命令的抽象提示。在LIBERO、SimplerEnv-WidowX和真实世界操作任务上的实验表明,动作令牌规划一致地优于直接动作生成,在长程多阶段任务上提升最大。

英文摘要

Most vision-language-action (VLA) models map observations directly to actions without explicit intermediate planning, which limits performance on long-horizon tasks where early mistakes compound. We propose Coarse-to-Control, a plan-execute VLA that introduces planning natively in the action-token space. The key idea is to let the policy first predict a compact sequence of coarse action tokens that summarize the intended future trajectory, and then generate executable action tokens conditioned on this plan. Because both planning and execution share a unified discrete action vocabulary, the plan stays close to the control manifold and provides directly actionable guidance rather than an abstract hint that must be translated back to motor commands. Experiments on LIBERO, SimplerEnv-WidowX, and real-world manipulation tasks show that action-token planning consistently improves over direct action generation, with the largest gains on long-horizon multi-stage tasks.

2606.07103 2026-06-08 cs.CL 新提交

Style or Content? Evaluating Style Classifiers with Controlled Content Overlap

风格还是内容?通过控制内容重叠评估风格分类器

Zhuo Liu, Haozheng Du, Xiangxiang Xu, Hangfeng He

发表机构 * University of Rochester(罗切斯特大学)

AI总结 提出控制内容重叠的评估方法,通过并行圣经翻译构建参数α,发现低重叠模型依赖内容线索,高重叠模型更鲁棒,为分离风格学习与内容捷径提供诊断。

Comments 9 pages

详情
AI中文摘要

风格分类器可以利用自然收集数据中与风格标签相关的内容线索,但我们缺乏系统的方法来衡量这种依赖。我们通过基于并行圣经翻译构建的控制内容重叠设置来研究这个问题。具体来说,我们将重叠参数α定义为内容身份与风格标签之间互信息的归一化残差,从而衡量风格类别之间共享内容的程度:从无共享内容(α=0)到完全共享内容(α=1)。基于RoBERTa分类器的交叉重叠评估表明,当内容线索被移除时,低重叠模型性能下降,而高重叠模型迁移更鲁棒。跨风格内容检索探针进一步表明,随着α增加,内容变得难以恢复,训练动态显示这种移除是逐渐发生的。这些结果表明,控制重叠为分离风格学习与内容捷径提供了一个简单的诊断方法。

英文摘要

Style classifiers can use content cues that correlate with style labels in naturally collected data, yet we lack a systematic way to measure this reliance. We study this problem with a controlled content overlap setup built on parallel Bible translations. Specifically, we define the overlap parameter $α$ as the normalized residual of mutual information between content identity and style label, so that it measures how much content is shared across style classes: from no shared content ($α=0$) to fully shared content ($α=1$). Cross-overlap evaluation of RoBERTa-based classifiers shows that low-overlap models degrade when content cues are removed, while high-overlap models transfer more robustly. A cross-style content retrieval probe further shows that content becomes less recoverable as $α$ increases, with training dynamics showing this removal occurs gradually. Together, these results suggest that controlled overlap provides a simple diagnostic for separating style learning from content shortcuts.

2606.07100 2026-06-08 cs.CV cs.RO 新提交

LARA: Latent Action Representation Alignment for Vision-Language-Action Models

LARA: 视觉-语言-动作模型的潜在动作表示对齐

Mengya Liu, Baoxiong Jia, Jiangyong Huang, Jingze Zhang, Siyuan Huang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出LARA框架,通过表示对齐联合优化潜在动作模型和视觉-语言-动作模型,利用人类视频数据提升机器人操作性能,在模拟和真实基准上平均提升约10%、5%和15%。

详情
AI中文摘要

视觉-语言动作(VLA)模型使机器人能够直接从观测和语言指令预测动作,但其性能依赖于大规模、高质量数据,并受到真实机器人动作数据集稀缺的限制。为了利用丰富的未标记人类视频促进VLA模型学习,潜在动作模型(LAM)从视觉动态中学习潜在动作表示,为VLA学习提供额外监督。然而,LAM和VLA通常分开训练,导致LAM在VLA训练期间未接地,且VLA模型受冻结的LAM表示约束。为解决这些问题,我们提出潜在动作表示对齐(LARA),一种即插即用框架,通过表示对齐联合优化LAM和VLA。这使得LAM能够利用动作轨迹学习以避免虚假视觉变化,同时VLA通过LAM中学习的前向动力学进行正则化,减少功能无效轨迹的幻觉。我们展示了LARA在预训练、预训练VLA模型的后训练增强以及LAM细化中的多功能性和有效性,在3个模拟和1个精心设计的真实机器人操作基准上平均提升约10%、约5%和约15%。

英文摘要

Visual-language action (VLA) models enable robots to predict actions directly from observations and language instructions, but their performance depends on large-scale, high-quality data and is limited by the scarcity of real-world robot action datasets. To facilitate VLA model learning with abundant unlabeled human videos, Latent Action Models (LAM) learn latent action representations from visual dynamics to provide additional supervision for VLA learning. However, LAM and VLA are typically trained separately, leaving LAM ungrounded during VLA training and VLA models constrained by frozen LAM representations. To address these issues, we propose Latent Action Representation Alignment (LARA), a plug-and-play framework that jointly optimizes LAM and VLA via representation alignment. This enables reciprocal benefits where LAMs learn with action trajectories to avoid spurious visual changes, while VLAs are regularized by forward dynamics learned within LAMs to reduce hallucinations of functionally ineffective trajectories. We demonstrate LARA versatility and effectiveness for pre-training, post-training enhancement of pre-trained VLA models, and LAM refinement, achieving an average of ~10%, ~5%, and ~15% improvement over 3 simulation and 1 meticulously designed real-world robotic manipulation benchmarks.

2606.07098 2026-06-08 cs.CL cs.LG 新提交

SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices

SigmaScale: 基于SVD低秩分解和学习缩放矩阵的LLM压缩

Ernests Lavrinovics, Marco Letizia, Roy Janco, Shai Segal, Johannes Bjerva, Maurizio Pierini

发表机构 * Department of Computer Science, Aalborg University Copenhagen(奥尔堡大学哥本哈根分校计算机科学系) MaLGa-DIBRIS, University of Genoa(热那亚大学MaLGa-DIBRIS) INFN, Sezione di Genova(国家核物理研究所热那亚分部) European Organization for Nuclear Research (CERN)(欧洲核子研究中心) Ceva, Inc.(Ceva公司)

AI总结 提出SigmaScale方法,通过学习辅助缩放矩阵优化截断SVD的LLM压缩,降低权重矩阵有效秩,在Llama 3.1 8B和Qwen3-8B上达到竞争性能。

详情
AI中文摘要

我们提出SigmaScale,一种学习辅助缩放矩阵$S$以辅助基于截断奇异值分解(SVD)的大语言模型(LLM)压缩的方法。SigmaScale不是解析地推导缩放矩阵,而是优化两组定义对角行和列缩放变换的向量,并在激活感知的压缩损失下进行。我们表明,学习到的缩放降低了权重矩阵的有效内在秩,这反映在有效秩熵的减少上,并且这种减少与压缩损失强相关。在Llama 3.1 8B Instruct和Qwen3-8B上的实验表明,SigmaScale在困惑度和零样本基准测试上与最相关的基于SVD的压缩方法具有竞争力。通过使用学习到的激活感知变换,SigmaScale通过适应单个模型权重的结构,探索了一条更灵活的低秩LLM压缩路径。在特定任务中观察到的优势使我们的方法成为需要降低LLM推理计算成本的应用的有效选择。

英文摘要

We present SigmaScale, a method for learning auxiliary scaling matrices $S$ to aid truncated Singular Value Decomposition (SVD) based Large Language Model (LLM) compression. Instead of deriving scaling matrices analytically, SigmaScale optimizes two sets of vectors that define diagonal row and column scaling transformations under an activation-aware compression loss. We show that learned scaling lowers the effective intrinsic rank of weight matrices, as reflected by reductions in effective-rank entropy, and that this reduction is strongly correlated with compression loss. Experiments on Llama 3.1 8B Instruct and Qwen3-8B show that SigmaScale is competitive with closely related state-of-the-art SVD-based compression methods across perplexity and zero-shot benchmarks. By using learned activation-aware transformations, SigmaScale explores a more flexible route to low-rank LLM compression by adapting to the structure of individual model weights. The advantage observed in specific tasks makes our approach a valid option for applications requiring a reduced LLM-inference computing cost.

2606.07093 2026-06-08 cs.LG 新提交

The discovery of the effects of women employment participation on the fertility of developing countries: A panel data approach

女性就业参与对发展中国家生育率影响的发现:面板数据方法

Thi Kim Ngan Nguyen

发表机构 * Tokyo International University(东京国际大学)

AI总结 本文使用面板数据方法,将115个发展中国家分为四大洲组,发现女性劳动参与率对生育率的影响因地区而异,仅美洲地区显著负相关。

详情
AI中文摘要

过去几十年,发展中国家的生育率显著下降,同时女性在职场中的作用有所提升。为了更深入地了解女性劳动力市场参与率对发展中国家总生育率的因果关系,本文将1991-2018年间115个发展中国家的数据集分为四个大洲组(非洲、南北美洲、亚太、欧洲),并采用数据驱动的面板数据计量经济学程序来减轻遗漏变量偏差。结果表明,南北美洲大陆女性的生育行为受到其职业选择的影响;而在其他地区的社会中,女性在考虑生育时,其他因素可能更为重要。总之,政策制定者可以借鉴本文制定政策,以在生育决策方面提供更多激励,该领域的进一步研究需要考虑发展中国家的家庭政策和从夫居作为重要数据。

英文摘要

The fertility trend in developing countries has experienced a significant decline in the last few decades; at the same time, the role of women in the workplace has improved. To have a better insight of the causality of the rate of women participation in the labor market on the total fertility rate in developing world, this paper divides the dataset of 115 developing countries in the period of 1991-2018 into four continents group (Africa, North/South America, Asia/Pacific, Europe) and then applies a data-driven panel data econometric procedure to mitigate omitted bias. The results suggest that the fertility behaviors of women in the North/South America continents are influenced by their career choice; meanwhile in society of other regions, other factors might be more important to women when thinking of having children. In conclusion, policymakers can reference to the paper and formulate policies to have more incentives in making reproductive decisions and further research in the field needs to consider family policies and patrilocality of developing countries as important data.

2606.07090 2026-06-08 cs.CV 新提交

Detecting Temporally Localized Manipulations in Authentic Video Streams

检测真实视频流中的时间局部操纵

Okan Umur, Ali Emre Güşlü, Ibrahim Delibasoglu

发表机构 * Okan Umur Ali Emre Güşlü Ibrahim Delibasoglu

AI总结 针对真实视频中插入短时逼真操纵片段难以检测的问题,提出新数据集并评估两种方法:基于DINOv3特征的线性探针和连续帧相似性方法,建立初步基准。

详情
AI中文摘要

视频编辑和生成式人工智能技术的快速发展使得逼真的视频操纵越来越容易实现。尽管现有数据集显著推动了深度伪造检测、对象移除和视频修复的研究,但它们未能充分模拟在真实视频中插入短时操纵片段且原始视频继续播放的场景。在本研究中,我们回顾了文献中的代表性数据集,分析了它们的特征,并讨论了它们在时间局部逼真操纵检测方面的局限性。基于此分析,我们提出了专门针对包含短时且高度逼真操纵间隔的真实视频的新数据集的需求。最后,我们在自定义策划的测试集上评估了两种互补方法,为这一具有挑战性的场景建立了初始基准。第一种方法采用基于DINOv3特征的线性探针,在三种阈值策略下进行评估。第二种方法利用DINOv3特征结合连续帧相似性方法来检测时间操纵边界。这些实验共同为部分操纵视频检测提供了初步基准,并强调了内容自适应阈值机制的必要性。数据集、代码和补充材料可在此https URL公开获取。

英文摘要

The rapid advancement of video editing and generative artificial intelligence technologies has made realistic video manipulation increasingly accessible. Although existing datasets have significantly advanced research in deepfake detection, object removal, and video inpainting, they do not adequately model scenarios in which a short manipulated segment is inserted into an otherwise authentic video and the original video continues afterward. In this study, we review representative datasets from the literature, analyze their characteristics, and discuss their limitations with respect to temporally localized realistic manipulation detection. Based on this analysis, we motivate the need for a new dataset specifically designed for authentic videos containing short and highly realistic manipulated intervals. Finally, we evaluate two complementary approaches on our custom-curated test set to establish an initial benchmark for this challenging scenario. The first employs a linear probe on DINOv3 features, assessed under three thresholding strategies. The second leverages DINOv3 features with a consecutive frame similarity-based method to detect temporal manipulation boundaries. Together, these experiments provide an initial benchmark for partially manipulated video detection and highlight the need for content-adaptive thresholding mechanisms. The dataset, code, and supplementary materials are publicly available at https://github.com/OkanUmur/temporally-localized-video-manipulation-detection.

2606.07089 2026-06-08 cs.RO 新提交

Dreaming when Necessary: Advancing World Action Models with Adaptive Multi-Modal Reasoning

必要时做梦:通过自适应多模态推理推进世界行动模型

Yinzhou Tang, Jingbo Xu, Yu Shang, Zihao Song, Chen Gao, Wei Wu, Yong Li

发表机构 * Tsinghua University(清华大学) Manifold AI

AI总结 提出AdaWAM,通过轻量动态路由器自适应触发文本或视觉推理,提升长时复杂任务中的推理效率和性能。

详情
AI中文摘要

世界行动模型(WAMs)为具身智能提供了一种有前景的方法,但现有方法严重依赖视频预测作为行动先验,缺乏自适应多模态推理,限制了其在长时、复杂任务中的有效性。我们观察到,WAMs在不同执行上下文中需要不同的多模态推理模式:在任务转换期间,文本推理对于指导高层行动预测至关重要,而在细粒度操作期间,视觉推理对于精确控制至关重要。基于这一观察,我们提出了\textbf{AdaWAM},一种具有自适应多模态推理能力的世界行动模型。AdaWAM集成了一个轻量动态路由器,可在任务执行过程中根据需要自主触发文本或视觉推理。在模拟和真实世界具身任务上的实验表明,AdaWAM在显著提升推理效率的同时,超越了最先进的具身策略。代码和演示可在以下网址获取:this https URL。

英文摘要

World Action Models (WAMs) offer a promising approach to embodied intelligence, yet existing methods rely heavily on video prediction as action priors and lack adaptive multimodal reasoning, limiting their effectiveness on long-horizon, complex tasks. We observe that WAMs require different multimodal reasoning modes under different execution contexts: textual reasoning is essential during task transitions to guide high-level action prediction, while visual reasoning is critical during fine-grained manipulation for precise control. Motivated by this observation, we propose \textbf{AdaWAM}, a world action model with adaptive multimodal reasoning abilities. AdaWAM integrates a lightweight dynamic router that autonomously triggers textual or visual reasoning as needed during task execution. Experiments on both simulated and real-world embodied tasks show that AdaWAM substantially improves inference efficiency while outperforming state-of-the-art embodied policies. Codes and demos are available at: https://adawam.github.io/.

2606.07083 2026-06-08 cs.RO 新提交

Predictive Style Matching: Natural and Robust Humanoid Locomotion

预测性风格匹配:自然且鲁棒的类人机器人行走

Simeon Nedelchev, Ekaterina Chaikovskaia, Egor Davydenko, Eduard Zaliaev, Roman Gorbachev

发表机构 * Moscow Institute of Physics and Technology (MIPT)(莫斯科物理技术学院) Innopolis University(因诺波利斯大学) Sber Robotics Center(Sber机器人中心)

AI总结 提出预测性风格匹配(PSM)方法,通过离线预测器将机器人下半身状态映射到上半身关节和步态目标,在保持任务奖励鲁棒性的同时显著降低风格误差。

详情
AI中文摘要

强化学习已成为类人机器人行走控制的主流方法:策略能够可靠地从仿真迁移到硬件,并从干扰中优雅恢复。然而,运动质量仍然落后:仅任务奖励往往收敛到僵硬、不对称的步态,而运动模仿方法改善了外观,但由于参考信号可能对抗恢复平衡所需的瞬态姿态,因此对外部干扰更加敏感。我们提出预测性风格匹配,其中离线预测器将机器人下半身状态历史和速度命令映射到可解释的上半身关节和步态目标,以在训练期间塑造奖励。由于目标是状态条件而非时间索引,且预测器仅在训练时使用,部署的控制器继承了仅任务奖励强化学习基线(RL baseline)的本体感觉接口和推理成本。在Unitree G1上,无论是在仿真还是硬件中,PSM将上半身风格误差比仅任务RL降低大约一个数量级,同时保持其跌倒恢复率,而运动模仿基线实现了最低的风格误差,但无法从干扰中恢复的频率大约高出五倍。

英文摘要

Reinforcement learning has become the prevailing approach to humanoid locomotion control: policies transfer reliably from simulation to hardware and recover gracefully from disturbances. Motion quality, however, still lags behind: task-only rewards often converge to stiff, asymmetric gaits, while motion imitation methods improve appearance but become more sensitive to external disturbances because reference signals can oppose the transient poses needed to regain balance. We propose Predictive Style Matching, in which an offline predictor maps the robot's lower-body state history and velocity commands to interpretable upper-body joint and gait targets that shape the rewards during training. Because the targets are state-conditioned rather than time-indexed and the predictor is used only at training time, the deployed controller inherits the proprioceptive interface and inference cost of a task-only RL baseline. On the Unitree G1, in both simulation and hardware, PSM reduces upper-body style error by roughly an order of magnitude over task-only RL while preserving its fall-recovery rate, whereas the motion-imitation baseline attains the lowest style error but fails to recover from disturbances about five times as often.

2606.07080 2026-06-08 cs.SD cs.AI eess.AS 新提交

dots.tts Technical Report

dots.tts 技术报告

Shi Lian, Changtao Li, Bohan Li, Hankun Wang, Da Zheng, Junfeng Tian, Yufeng Ma, Colin Zhang, Kai Yu

发表机构 * ByteDance(字节跳动)

AI总结 提出一个20亿参数的连续自回归TTS基础模型,通过多目标AudioVAE、全历史条件流匹配和无奖励自校正后训练,在Seed-TTS-Eval上取得最优性能,并支持低延迟推理。

详情
AI中文摘要

我们提出了 this http URL,一个20亿参数的连续自回归文本到语音(TTS)基础模型,在连续潜在空间中建模语音。与现有的连续自回归模型相比,我们的关键创新有三点。首先,我们训练了一个具有多目标的AudioVAE,以构建语义结构化和预测友好的连续语音空间。其次,我们在流匹配头中使用全历史条件,以保持长程一致性并减少生成过程中的漂移。第三,我们对流匹配头应用无奖励自校正后训练,以进一步提高鲁棒性和声学质量。在大规模多语言语料库上训练后,this http URL 在Seed-TTS-Eval上取得了最佳平均性能,在zh/en/zh-hard测试集上的WER分别为0.94%/1.30%/6.60%,SIM分数分别为81.0/77.1/79.5。在其他基准测试中,this http URL 也持续展示了开源最先进的性能,表现出强大的生成稳定性、声音克隆能力和情感表现力。为了实现高效推理,我们进一步应用了CFG感知的MeanFlow蒸馏,使得输出流和双流模式下的首包延迟分别为85毫秒和54毫秒,实现了低延迟语音生成。为了促进可重复研究和实际部署,我们在Apache 2.0许可下发布了训练和推理代码,以及预训练、后训练和MeanFlow蒸馏的检查点。

英文摘要

We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that models speech in a continuous latent space. Compared with existing continuous autoregressive models, our key innovations are threefold. First, we train an AudioVAE with multiple objectives to build a semantically structured and prediction-friendly continuous speech space. Second, we use full-history conditioning in the flow-matching head to preserve long-range consistency and reduce drift during generation. Third, we apply reward-free self-corrective post-training to the flow-matching head to further improve robustness and acoustic quality. After being trained on a large-scale multilingual corpus, dots.tts achieves the best average performance on Seed-TTS-Eval, with WERs of 0.94%/1.30%/6.60% and SIM scores of 81.0/77.1/79.5 on the zh/en/zh-hard test sets, respectively. Across other benchmarks, dots.tts also consistently demonstrates open-source state-of-the-art performance, exhibiting strong generation stability, voice cloning ability, and emotional expressiveness. For efficient inference, we further apply CFG-aware MeanFlow distillation, enabling low-latency speech generation with first-packet latencies of 85/54 ms in output streaming and dual-streaming modes, respectively. To facilitate reproducible research and practical deployment, we release the training and inference code, together with the pretrained, post-trained, and MeanFlow-distilled checkpoints, under the Apache 2.0 license.