arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1733
2606.14368 2026-06-15 cs.LG cs.CL 新提交

Be My Tutor: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback

做我的导师:通过同伴反馈实现互惠LLM改进的在策略共蒸馏

Woohyeon Byeon, Jiwon Jeon, Jeonghye Kim, Youngchul Sung

发表机构 * KAIST(韩国科学技术院)

AI总结 提出在策略共蒸馏(OPCoD)方法,通过认知门控和反馈锚定实现两个不同领域强模型间的相互教学,达到帕累托改进。

详情
AI中文摘要

我们研究多领域LLM训练,其中两个模型各自在不同领域更强,通过在线策略反馈相互教学共同进化。与单向蒸馏或单模型微调不同,我们的目标是互惠帕累托改进:每个模型在所有领域提升而不损失原有优势。为此,我们提出在策略共蒸馏(OPCoD),其中每个学生的自蒸馏以其自身正确展开和同伴反馈为条件。为使反馈交换有效,OPCoD使用基于认知的门控决定何时给予反馈,以及反馈锚定将反馈扎根于问题。在科学问答任务上,OPCoD持续优于基线,并在所有评估的领域对和学生中实现帕累托改进。

英文摘要

We study multi-domain LLM training in which two models, each stronger in a different domain, co-evolve by tutoring each other through on-policy feedback. Unlike one-way distillation or single-model fine-tuning, our goal is mutual Pareto improvement: each model improves across domains without losing its original strength. To this end, we propose On-Policy Co-Distillation (OPCoD), where each student's self-distillation is conditioned on its own correct rollout and feedback from its peer. To make feedback exchange effective, OPCoD uses cognizance-based gating to decide when to give feedback and feedback anchoring to ground feedback in the problem. On Science Q\&A tasks, OPCoD consistently outperforms baselines and achieves Pareto improvement across all evaluated domain pairs and students.

2606.14361 2026-06-15 cs.LG cs.DB 新提交

SemPiper: Interactive Code Synthesis for Semantic Operators in Machine Learning Pipelines

SemPiper:机器学习流水线中语义算子的交互式代码合成

Olga Ovcharenko, Luciano Duarte, Sebastian Schelter

发表机构 * BIFOLD & TU Berlin(BIFOLD 与柏林工业大学)

AI总结 提出SemPipes编程模型,通过声明式语义算子和LLM合成专用实现,结合Python代码,实现可控、可优化的ML流水线开发。

Comments Accepted at VLDB 2026 (Demonstrations track)

详情
AI中文摘要

机器学习(ML)流水线需要大量的数据准备、特征工程以及跨异构源的集成,这使得开发过程繁琐且容易出错。虽然大型语言模型(LLM)最近在辅助编程任务方面显示出潜力,但基于聊天的界面提供了对流水线行为的有限控制,并且通常生成的代码难以优化或集成到生产系统中。我们展示了SemPipes,一种新颖的编程模型,它通过声明式的、由LLM驱动的语义数据算子扩展了ML流水线。SemPipes允许开发者为数据密集型操作指定高级自然语言指令,同时将这些算子与来自标准数据科学库的任意Python代码无缝结合。对于语义算子,它在流水线训练时根据数据集特征和流水线上下文合成专门的实现,从而实现了灵活但可控的LLM能力集成。我们通过SemPiper演示了SemPipes,这是一个交互式界面,可以可视化流水线的计算图、合成的算子实现以及由进化搜索过程产生的优化轨迹。与会者可以探索三个端到端场景,修改流水线,检查生成的代码,并观察语义算子如何被合成和迭代优化。该演示突出了声明式语义算子如何实现LLM在ML流水线开发中的可控、可优化和实际集成。

英文摘要

Machine learning (ML) pipelines require extensive data preparation, feature engineering, and integration across heterogeneous sources, making them tedious and error-prone to develop. While large language models (LLMs) have recently shown promise for assisting programming tasks, chat-based interfaces provide limited control over pipeline behavior and often produce code that is difficult to optimize or integrate into production systems. We demonstrate SemPipes, a novel programming model that extends ML pipelines with declarative, LLM-powered semantic data operators. SemPipes allows developers to specify high-level natural language instructions for data-centric operations, while seamlessly combining these operators with arbitrary Python code from standard data science libraries. For the semantic operators, it synthesizes specialized implementations at pipeline training time, conditioned on dataset characteristics and pipeline context, enabling the flexible yet controlled integration of LLM capabilities. We demonstrate SemPipes through SemPiper, an interactive interface that visualizes computational graphs of the pipelines, synthesized operator implementations, and optimization trajectories produced by an evolutionary search procedure. Attendees can explore three end-to-end scenarios, modify pipelines, inspect generated code, and observe how semantic operators are synthesized and iteratively optimized. The demonstration highlights how declarative semantic operators enable controllable, optimizable, and practical integration of LLMs into ML pipeline development.

2606.14355 2026-06-15 cs.CV eess.SP 新提交

Point Cloud Upsampling through Patch-based Frequency Superposition

基于补丁频率叠加的点云上采样

Marina Ritthaler, Azhar Hussian, Vasileios Belagiannis, André Kaup

发表机构 * Friedrich-Alexander-Universität Erlangen-Nürnberg(埃尔朗根-纽伦堡大学)

AI总结 提出一种基于补丁频率叠加的优化方法PUtPFS,通过选择点子集并叠加空间频率估计表面,在稀疏区域放置新点实现均匀上采样,无需训练数据,在点对面距离上超越现有方法。

详情
Journal ref
European Conference on Signal Processing (EUSIPCO) 2026
AI中文摘要

近年来,神经网络已成为大多数点云上采样方法中的主导模型。尽管这些方法取得了良好的效果,但它们存在一些缺点,例如缺乏可解释性和数据依赖性。此外,它们必须在与测试数据相似的数据集上进行训练才能表现良好。为了避免这些缺点,我们提出了基于补丁频率叠加的点云上采样(PUtPFS),这是一种基于优化的方法,通过选择点子集并通过叠加空间频率来估计该子集的表面。然后,在该表面上放置新点。通过连续选择点云中最稀疏区域中的点,可以实现均匀上采样。使用这种方法,我们在通常考虑的点对面距离上超越了当前最佳的上采样结果。此外,我们在基于优化的方法中实现了最佳的Chamfer距离和Hausdorff距离。作为额外优势,我们的方法不需要任何训练数据,并且具有数学可解释性。

英文摘要

In recent years, neural networks have become the dominant models in most point cloud upsampling methods. Although these approaches are achieving good results, they do have drawbacks, such as a lack of interpretability and data dependency. Moreover, they have to be trained on a dataset that is similar to the test data in order to perform well. To avoid these disadvantages, we propose Point Cloud Upsampling through Patch-based Frequency Superposition (PUtPFS), an optimization-based approach that selects subsets of points and estimates the surface of this set through superpositioning spatial frequencies. Then, new points are placed on this surface. By successively selecting points in the least dense regions of the point cloud, a uniform upsampling can be reached. With this method, we surpass the current best upsampling results in the commonly considered point-to-surface distance. Furthermore, we achieve the best Chamfer and Hausdorff distance among the optimization-based approaches. As an additional advantage, our method does not need any training data and is mathematically interpretable.

2606.14354 2026-06-15 cs.LG 新提交

MUFFLe: Efficient Model Update Compression via Generalized Deduplication for Federated Learning

MUFFLe: 通过广义去重实现联邦学习的高效模型更新压缩

Xiaobo Zhao, Daniel E. Lucani

发表机构 * Innovation Foundation Denmark(丹麦创新基金会)

AI总结 提出MUFFLe方案,将广义去重(GD)集成到FedAvg中,通过去重更新向量中的重复模式实现固定速率、可变计数的压缩,在IID MNIST上以38 MB累积上行通信达到92.93%目标精度。

Comments Accepted at IEEE EDGE 2026 (Work-in-Progress track)

详情
AI中文摘要

联邦学习非常适合边缘环境,但通常受到传输模型更新的上行链路成本的限制。这篇进行中的工作论文提出了MUFFLe,一种通信高效的更新压缩方案,将广义去重(GD)集成到FedAvg流程中。MUFFLe去重更新向量中的重复模式,产生固定速率、可变计数的压缩方案。在20个客户端的IID MNIST上的初步实验表明,MUFFLe以38 MB累积上行通信达到92.93%的目标精度,而8位量化需要75 MB,Top-$k$稀疏化需要86 MB,未压缩的FedAvg需要310 MB。这些结果证明了将GD应用于通信高效的联邦学习的可行性。

英文摘要

Federated learning is well suited to edge environments but is often limited by the uplink cost of transmitting model updates. This Work-in-Progress paper presents MUFFLe, a communication-efficient update compression scheme that integrates generalized deduplication (GD) into the FedAvg pipeline. MUFFLe deduplicates repeated patterns across the update vector, yielding a fixed-rate, variable-count compression scheme. Preliminary experiments on IID MNIST with 20 clients show that MUFFLe reaches the target accuracy of $92.93\%$ with 38~MB cumulative uplink communication, compared with 75~MB for 8-bit quantization, 86~MB for Top-$k$ sparsification, and 310~MB for uncompressed FedAvg. These results demonstrate the feasibility of applying GD to communication-efficient federated learning.

2606.14353 2026-06-15 cs.LG 新提交

Can Deep Neural Networks Improve Compression of Very Large Scientific Data?

深度神经网络能否改善超大规模科学数据的压缩?

Muhannad Alhumaidi, Guozhong Li, Spiros Skiadopoulos, Panos Kalnis

发表机构 * King Abdullah University of Science and Technology(阿卜杜拉国王科技大学) University of the Peloponnese(伯罗奔尼撒大学)

AI总结 本文提出将深度学习预测器集成到传统误差有界压缩框架中,通过气候数据实验发现,尽管ML预测器能提高预测精度和重建质量,但由于残差空间结构影响熵编码效率,未能提升整体压缩比。

详情
AI中文摘要

误差有界有损压缩是管理现代模拟和观测仪器产生的快速增长的科学数据的基本技术。大多数最先进的压缩器遵循预测-残差范式,其中压缩效果取决于预测器的质量:更准确的预测产生更小的残差,更容易压缩。这一观察提出了一个问题:现代机器学习模型能否作为科学数据压缩的优越预测器?直接回答这个问题具有挑战性,因为开发特定于压缩的ML预测器需要大量资源。相反,我们利用气候领域,其中已经存在高度准确的预训练天气预报基础模型,使其成为理想的测试平台。我们提出了一个框架,将空间和时间深度学习模型集成到传统的误差有界压缩流水线中。该框架支持自回归预测模型,并避免误差累积。使用ERA5气候数据作为代表性的大规模科学数据集,我们评估了三种不同的ML预测器:基于VAEformer的编解码器(CRA5)、图神经网络预测器(GraphCast)和视觉变换器预测器(Aurora),与最先进的压缩器SZ3.1在相同的量化和熵编码后端下进行比较。我们对约1.7 TB数据的评估揭示了一个令人惊讶的结果:尽管ML预测器生成更准确的预测,并且可以将重建质量提高多达91%,同时对于高度可预测的变量实现高达9.6倍的压缩比,但它们并没有提高整体数据集级别的压缩比。我们表明,仅预测准确性是不够的:所得残差的空间结构在熵编码效率中起决定性作用。

英文摘要

Error-bounded lossy compression is a fundamental technique for managing the rapidly growing volumes of scientific data produced by modern simulations and observational instruments. Most state-of-the-art-compressors follow a prediction-residual paradigm, where compression effectiveness depends on the quality of the predictor: more accurate predictions generate smaller residuals that are easier to compress. This observation raises a question: can modern machine learning models serve as superior predictors for scientific data compression? Answering this question directly is challenging because developing compression-specific ML predictors requires substantial resources. Instead, we leverage the climate domain where highly accurate pretrained weather forecasting foundation models already exist, making them an ideal testbed. We present a framework that integrates spatial and temporal deep learning models into a conventional error-bounded compression pipeline. The framework supports auto-regressive forecasting models and avoids error accumulation. Using ERA5 climate data as a representative large-scale scientific dataset, we evaluate three distinct ML predictors: a VAEformer-based codec (CRA5), a graph neural network forecaster (GraphCast), and a vision-transformer forecaster (Aurora), against the state-of-the-art compressor SZ3.1 under identical quantization and entropy-coding backends. Our evaluation over approximately 1.7 TB of data reveals a surprising result: although ML predictors generate more accurate predictions and can improve reconstruction quality by up to 91% while achieving up to 9.6x higher compression ratios for highly predictable variables, they do not improve overall dataset-level compression ratio. We show that prediction accuracy alone is insufficient: the spatial structure of the resulting residuals plays a decisive role in entropy coding efficiency.

2606.14351 2026-06-15 cs.CV 新提交

ForceForget: Reinforcement Concept Removal for Enhancing Safety in Text-to-Image Models

ForceForget: 通过强化概念移除增强文本到图像模型的安全性

Dong Han, Yong Li

发表机构 * Dong Han(董汉) Yong Li(李勇)

AI总结 针对文本到图像模型生成不安全内容的问题,提出基于强化学习优化概念擦除奖励的方法,通过安全适配器调节文本嵌入,在消除不安全内容的同时保持模型对安全语义的生成能力。

Comments Accepted to ICML 2026

详情
AI中文摘要

随着生成式AI的发展,文本到图像(T2I)模型能够生成各种内容。然而,T2I模型仍可能生成不安全内容。为缓解此问题,研究者提出了多种概念擦除方法。但现有方法倾向于过度擦除不安全概念,并抑制有害提示中的良性概念,从而对模型效用产生负面影响。本文中,我们专注于在消除不安全内容的同时,通过强化学习优化概念擦除奖励(CER)来保持模型对安全语义解释的能力。为避免过度内容擦除,我们引入安全适配器(Safe Adapter)来投影部分文本嵌入,以在交叉注意力层中实现高效的概念调节。在不同数据集上进行的大量实验表明,与现有最先进(SOTA)的概念擦除方法相比,所提方法在减轻不安全内容生成的同时,能保持良性图像的高保真度。在鲁棒性方面,我们的方法在对抗红队工具时优于其他方法。此外,我们展示了所提方法在新兴的图像到图像(I2I)场景中比其他方法更有效。最后,我们将方法扩展到擦除一般概念,如艺术风格和物体。免责声明:本文包含可能对某些读者造成冒犯的性露骨内容讨论。本工作中使用的所有图像均为合成图像或来自公共数据集。

英文摘要

With the advance of generative AI, the text-to-image (T2I) model has the ability to generate various contents. However, T2I models still can generate unsafe contents. To alleviate this issue, various concept erasing methods are proposed. However, existing methods tend to excessively erase unsafe concepts and suppress benign concepts contained in harmful prompts, which can negatively affect model utility. In this paper, we focus on eliminating unsafe content while maintaining model capability in safe semantic meaning interpretation by optimizing the concept erasing reward (CER) with reinforcement learning. To avoid overly content erasure, we introduce the Safe Adapter to project partial text embedding for efficient concept regulation in cross-attention layers. Extensive experiments conducted on different datasets demonstrate the effectiveness of the proposed method in alleviating unsafe content generation while preserving the high fidelity of benign images compared with existing state-of-the-art (SOTA) concept erasing methods. In terms of robustness, our method outperforms counterparts against red-teaming tools. Moreover, we showcase the proposed approach is more effective in emerging image-to-image (I2I) scenarios compared with others. Lastly, we extend our method to erase general concepts, such as artistic styles and objects. Disclaimer: This paper includes discussions of sexually explicit content that may be offensive to certain readers. All images used in this work are synthesized or from public datasets.

2606.14347 2026-06-15 cs.LG 新提交

When Language Representations Interact: Separability and Cross-Lingual Effects in LLMs

当语言表示交互时:LLM中的可分离性与跨语言效应

Boris Marinov, Angira Sharma, Christian Schroeder de Witt, Philip Torr, Anisoara Calinescu, Jialin Yu

发表机构 * University of Oxford(牛津大学) Imperial College London(帝国理工学院) University of York(约克大学)

AI总结 通过因果几何分析,研究多语言LLM中语言表示的线性可分离性及跨语言结构依赖,发现语言概念在协方差调整内积下可分离,同语系语言呈现单纯形几何结构。

Comments Trustworthy AI for Good (AI4Good) Workshop @ ICML 2026

详情
AI中文摘要

大型语言模型展现出强大的多语言能力,然而其内部表示难以解释。理解这些交互对于确保多语言系统的可靠行为至关重要。近期研究表明,因果几何结构可以解释某些概念如何被编码为近似线性和可分离的方向,但该框架是否适用于语言身份相关且层次化的多语言模型,尚待探索。我们将因果几何分析应用于多语言LLM,研究了三个模型中的28个双语对比,从而分析语言何时表现为近似独立因子,以及何时存在结构化依赖。我们发现证据表明,语言概念具有稳定的线性表示,在协方差调整(因果)内积下大致可分离,而结构化偏差反映了语言相似性。此外,同一语系(如日耳曼语或罗曼语)内的语言表现出类似单纯形的几何结构,暗示了层次化组织。这些结果将因果几何可解释性扩展到多语言设置,并揭示了多语言LLM表示中可分离性与相似性如何共存,从而激励可解释性分析以诊断何时以及如何预期概念间的结构化依赖。这对可信部署具有重要意义,因为语言间的残余结构可能导致在监控或干预模型时产生意外的跨语言效应。

英文摘要

Large language models exhibit strong multilingual capabilities, however, their internal representations are difficult to interpret. Understanding these interactions is important for ensuring reliable behavior in multilingual systems. Recent work has shown that causal-geometric structure can explain how certain concepts are encoded as approximately linear and separable directions, but whether this framework extends to multilingual models, where language identity is correlated and hierarchical, is underexplored. We apply causal-geometric analysis to multilingual LLMs, studying 28 bilingual contrasts across three models, allowing us to analyze when languages behave as approximately independent factors and when structured dependencies persist. We find evidence that language concepts admit stable linear representations that are largely separable under a covariance-adjusted (causal) inner product, with structured deviations reflecting linguistic similarity. Moreover, languages within the same family (such as Germanic or Romance) exhibit a simplex-like geometric structure, suggesting hierarchical organization. These results extend causal-geometric interpretability to multilingual settings and provide insight into how separability and similarity may exist in multilingual LLM representations, motivating interpretability analyses that diagnose when and how structured dependencies between concepts can be anticipated. This has implications for trustworthy deployment, as residual structure between languages may lead to unintended cross-lingual effects when models are monitored or intervened upon.

2606.14346 2026-06-15 cs.LG cs.AI 新提交

Squeeze-Release: Iterative Pruning with Exact Structural Minimization

挤压-释放:具有精确结构最小化的迭代剪枝

Roman Denkin, Ida Akerholm, Prashant Singh, Ida-Maria Sintorn

发表机构 * Uppsala University(乌普萨拉大学)

AI总结 提出Squeeze-Release循环,通过精确结构重写将掩码网络转化为更小密集网络,并引入CompensatedLayerNorm扩展至残差流,实现高达39倍压缩。

详情
AI中文摘要

非结构化剪枝产生稀疏权重张量,但标准实现保持张量形状不变,因此部署模型并不比剪枝前更小。我们提出一种精确的结构重写,称为最小化,它将掩码网络转换为一个更小的密集网络,其前向函数在浮点舍入误差内相同。挤压-释放循环迭代剪枝和最小化,中间有一个释放步骤,将压缩张量内的精确零位置重新启用为小的校准噪声,将原本浪费的容量转化为可训练参数。连续的循环利用该容量找到单次剪枝无法达到的结构冗余。我们还引入了CompensatedLayerNorm,这是一种保持功能的LayerNorm替代方案,将最小化扩展到具有LayerNorm的残差流上的通道缩减。挤压-释放将可部署网络压缩到比未剪枝模型小39倍(全连接模型网络)和14.8倍(现代CNN,ConvNeXt-Tiny),且精度相当。此外,我们证明该重写可以扩展到Transformer架构。

英文摘要

Unstructured pruning produces sparse weight tensors, but the standard implementation keeps tensor shapes unchanged so the deployed model is no smaller than before pruning. We present an exact structural rewrite, which we call minimization, that converts a masked network into a smaller dense network with the same forward function up to floating-point rounding. The Squeeze-Release cycle iterates pruning and minimization with an intermediate release step that re-enables the exact-zero positions inside the compacted tensors as small calibrated noise, turning otherwise wasted capacity back into trainable parameters. Successive cycles use that capacity to find structural redundancy a single pass cannot reach. We additionally introduce CompensatedLayerNorm, a function-preserving replacement for LayerNorm that extends minimization to channel reduction across LayerNorm-equipped residual streams. Squeeze-Release compresses the deployable network to 39x smaller than the unpruned model on a fully-connected model network and 14.8x smaller on modern CNN (ConvNeXt-Tiny), at comparable accuracy. In addition we prove that the rewrite can be extended to transformer architectures.

2606.14344 2026-06-15 cs.LG 新提交

More with LESS -- Local Scene Representations for Tactile Imaging

用更少实现更多——局部场景表示用于触觉成像

Zohar Rimon, Elisei Shafer, Tal Tepper, Daniel Kozin, Alon Malka, Roy Holland, Aviv Tamar

发表机构 * Technion - Israel Institute of Technology(以色列理工学院) Rambam Health Care Campus(拉姆巴姆医疗中心)

AI总结 提出局部编码器空间感知(LESS)方法,通过局部感受野的循环编码器网格建模触觉场景,实现内部结构的2D/3D重建,在泛化、不确定性估计和手持成像上取得突破。

Comments RSS 2026

详情
AI中文摘要

触觉成像旨在通过触觉传感重建软物体的内部结构,应用于医学诊断和机器人操作。最近的自监督学习方法显示出有希望的结果,但依赖于全局、非结构化表示和机器人控制的传感,限制了泛化和实际使用。我们提出局部编码器空间感知(LESS),一种以物体为中心的触觉表示,利用触觉的局部性质。触觉场景被建模为具有局部感受野的循环编码器网格,其状态被融合以重建内部结构的2D或3D图像。这种组合设计实现了强泛化:在单夹杂物体模上训练的模型能够准确成像具有多个夹杂物和不同尺寸的物体。局部结构还支持空间不确定性估计。此外,我们通过外部姿态跟踪和类似人类触诊的数据实现了手持触觉成像,并将触觉成像扩展到完整的3D重建。

英文摘要

Tactile imaging seeks to reconstruct the internal structure of soft objects through touch sensing, with applications in medical diagnosis and robotic manipulation. Recent self-supervised learning approaches have shown promising results, but rely on global, unstructured representations and robot-controlled sensing, limiting generalization and practical use. We propose Local Encoder for Spatial Sensing (LESS), an object-centric tactile representation that exploits the local nature of touch. The tactile scene is modeled as a grid of recurrent encoders with local receptive fields, whose states are fused to reconstruct 2D or 3D images of internal structure. This compositional design enables strong generalization: models trained on single-inclusion phantoms accurately image objects with multiple inclusions and varying sizes. The local structure further supports spatial uncertainty estimation. In addition, we enable hand-held tactile imaging via external pose tracking and human-like palpation data, and extend tactile imaging to full 3D reconstruction.

2606.14334 2026-06-15 cs.LG math.DG 新提交

Riemannian Metric Matching for Scalable Geometric Modeling of Distributions

黎曼度量匹配:面向分布的可扩展几何建模

Jacob Bamberger, Adam Gosztolai, Pierre Vandergheynst, Michael Bronstein, Iolo Jones

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出黎曼度量匹配框架,通过神经网络学习数据的黎曼几何,利用carré du champ算子的条件期望形式实现样本级训练和恒定成本推理,在精度相当或更优下速度提升400倍。

Comments ICML 2026 (Oral)

详情
AI中文摘要

高维数据集通常集中在低维结构附近,但从样本中估计其几何通常依赖于图和核,这些方法随数据集大小和维度扩展性差。我们提出黎曼度量匹配:一种去噪概率框架,使用神经网络学习数据的黎曼几何。具体而言,我们学习carré du champ算子,通过扩散几何,该算子使我们能够访问黎曼几何工具包,用于下游机器学习和统计任务。我们的关键观察是,carré du champ算子可以表述为数据随机扰动的条件期望,这可用于样本级训练和恒定成本的摊销推理,而无需显式核构造。实验上,度量匹配在精度上媲美或改进基于$k$-NN的扩散几何估计器,同时实现高达$400$倍更快的摊销推理,并支持在最近邻失效的高维图像上进行无图几何分析。

英文摘要

High-dimensional datasets often concentrate near low-dimensional structures, but estimating their geometry from samples typically relies on graphs and kernels that scale poorly with dataset size and dimension. We propose Riemannian metric matching: a denoising probabilistic framework for learning the Riemannian geometry of data using neural networks. Specifically, we learn the carré du champ operator, which, using diffusion geometry, gives us access to the Riemannian geometry toolkit for downstream machine learning and statistical tasks. Our key observation is that the carré du champ operator can be formulated as a conditional expectation over random perturbations of the data, which can be exploited for sample-wise training and constant cost, amortized inference without explicit kernel construction. Empirically, metric matching rivals or improves the accuracy of $k$-NN-based diffusion geometry estimators, while enabling amortized inference that is up to $400\times$ faster, and supports graph-free geometric analysis on high-dimensional images where nearest neighbors break down.

2606.14325 2026-06-15 cs.CL cs.AI 新提交

Achieving Precise Text-To-Cypher Via Grounded Knowledge Graph Data Generation

通过基于知识图谱的数据生成实现精确的文本到Cypher转换

Francesco Cazzaro, Jessica Lennon, Ariadna Quattoni

发表机构 * Universitat Politècnica de Catalunya(波兰理工大学)

AI总结 提出一种自动合成数据生成方法,微调小型LLM以提升Text2Cypher性能,使其在本地部署中与大型专有模型竞争,保障数据主权。

详情
AI中文摘要

属性图正迅速被采用作为表示异构数据源的数据库框架。为了精确访问其中包含的信息,我们需要基于文本到Cypher(Text2Cypher)解析器的对话界面。本文提出了一种自动合成数据生成方法,可用于微调小型LLM以完成此任务。我们在所有主要的Text-To-Cypher基准测试上进行了实验,证明使用我们的合成数据生成方法可以显著提高小型LLM的性能,使其能够与更大的专有模型竞争。这意味着在必须本地部署模型的场景中,我们可以在不牺牲准确性且无需昂贵标注活动的情况下确保数据主权。

英文摘要

Property Graphs are rapidly being adopted as database frameworks for representing heterogeneous data sources. To enable precise access to the information contained in them we need conversational interfaces based on Text-To-Cypher (Text2Cypher) parsers. This paper presents an automatic synthetic data generation method that can be leveraged to fine-tune small LLMs for this task. We conduct experiments on all the major Text-To-Cypher benchmarks, demonstrating that with our synthetic data generation approach we can significantly increase the performance of small LLMs, allowing them to compete with much larger proprietary models. This means that in settings in which models must be locally deployed we can ensure data-sovereignty without sacrificing accuracy and without costly annotation campaigns.

2606.14324 2026-06-15 cs.SD 新提交

Instantaneous Pitch Estimation via Wave-U-Net-Based Fundamental Waveform Enhancement

基于Wave-U-Net基波增强的瞬时音高估计

Junya Koguchi, Tomoki Koriyama

发表机构 * CyberAgent, Japan(日本CyberAgent公司)

AI总结 提出将基波滤波视为语音增强问题,利用Wave-U-Net提取基波,再计算瞬时频率,实现准确鲁棒的瞬时音高估计。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

瞬时音高估计在分析语音韵律和演唱技巧等急剧音高变化中起着重要作用。传统方法在从包含谐波和噪声的信号中分离基波后估计瞬时频率,这使得精度对基波滤波的不完美敏感。在本研究中,我们将基波滤波表述为一个语音增强问题。具体来说,我们训练一个Wave-U-Net模型从输入语音信号中提取基波。然后通过从估计基波的解析信号计算瞬时频率来获得瞬时音高。实验结果表明,所提出的方法优于传统的确定性方法,并在包括语音、歌声、乐器和退化语音信号在内的多个领域提供了准确且鲁棒的瞬时音高估计。

英文摘要

Instantaneous pitch estimation plays an important role in analyzing steep pitch variations such as speech prosody and singing techniques. Conventional approaches estimate instantaneous frequency after isolating the fundamental waveform from signals that contain harmonics and noise, which makes the accuracy sensitive to imperfect fundamental filtering. In this study, we formulate fundamental waveform filtering as a speech enhancement problem. Specifically, we train a Wave-U-Net model to extract a fundamental waveform from an input speech signal. The instantaneous pitch is then obtained by computing the instantaneous frequency from the analytic signal of the estimated fundamental waveform. Experimental results show that the proposed method outperforms conventional deterministic approaches and provides accurate and robust instantaneous pitch estimation across diverse domains, including speech, singing voice, musical instruments, and degraded speech signals.

2606.14321 2026-06-15 cs.SD cs.MM 新提交

MaskedFOP: Polyglot Speaker Identification under Missing Visual Modality via Cascaded Graph Label Propagation

MaskedFOP:缺失视觉模态下的多语言说话人识别通过级联图标签传播

Ayoub Elkhouzari, Youssef Iraqi, Loubna Mekouar

发表机构 * College of Computing, University Mohammed VI Polytechnic(穆罕默德六世理工大学计算机学院)

AI总结 提出MaskedFOP系统,在测试时人脸完全缺失且语音为未见语言(乌尔都语)的挑战下,通过模态丢弃双头网络、互补嵌入平均和两级级联推理实现闭集多语言说话人识别,在POLY-SIM 2026挑战赛中取得第一。

详情
AI中文摘要

我们提出MaskedFOP,一个用于闭集多语言说话人识别的系统,它同时面临两个挑战:测试时人脸模态完全缺失,且语音来自乌尔都语——一种在面部监督训练中未见过的语言。该系统集成了三种互补机制。首先,基于融合与正交投影(FOP)骨干网络的模态丢弃双头网络通过逐样本面部掩码强制音频分支发展独立的判别能力,确保音频编码器在人脸缺失时仍能胜任。其次,两个在不同随机种子下基于ECAPA-TDNN特征训练的MaskedFOP实例产生互补的音频嵌入,其逐元素平均得到比任何单一模型更鲁棒的512维表示。第三,一个两级级联推理过程首先通过融合图标签传播(GLP)步骤(阶段1)细化多模态标签,然后通过余弦最近质心(阶段2)分配纯音频标签,用阶段1中约1500个域内测试集质心替换70个稀疏训练原型。提交至POLY-SIM 2026大挑战赛,该系统达到平均P准确率0.9989,在挑战服务器上所有提交中排名第一。消融实验表明级联播种是最大的单一增益来源(在P4/P6上超过8个百分点)。代码见https://this URL。

英文摘要

We present MaskedFOP, a system for closed-set polyglot speaker identification under two simultaneous challenges: the face modality is entirely absent at test time, and speech comes from Urdu, a language unseen during face-supervised training. The system integrates three complementary mechanisms. First, a modality-dropout dual-head network built on the Fusion and Orthogonal Projection (FOP) backbone forces the audio branch to develop independent discriminative power via per-sample face masking, ensuring that the audio encoder remains capable when face is absent. Second, two MaskedFOP instances trained on Emphasized Channel Attention, Propagation, and Aggregation in Time Delay Neural Network (ECAPA-TDNN) features with different random seeds produce complementary audio embeddings whose element-wise average yields a more robust 512-dimensional representation than any single model. Third, a two-stage cascaded inference procedure first refines multimodal labels through a fused Graph Label Propagation (GLP) pass (Stage 1), then assigns audio-only labels by cosine nearest-centroid (Stage 2), replacing the 70 sparse training prototypes with ~1,500 in-domain test-set centroids from Stage 1. Submitted to the POLY-SIM 2026 Grand Challenge, the system achieves a mean P-accuracy of 0.9989, placing first among all submissions evaluated on the challenge server. An ablation identifies cascaded seeding as the single largest gain (>8 pp on P4/P6). The code is available at https://github.com/Ayoub-Elkhouzari/POLY-SIM2026.

2606.14317 2026-06-15 cs.CV 新提交

CausalMotion: Structured Physical Reasoning as Keyframe and Trajectory Guidance for Training-Free Video Generation

CausalMotion: 基于关键帧与轨迹引导的结构化物理推理实现免训练视频生成

Sihan Zhuang, Xinyuan Chen, Tianfan Xue, Yaohui Wang

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) ShanghaiTech University(上海科技大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出CausalMotion框架,利用视觉语言模型将文本提示分解为因果一致的关键帧和物体运动轨迹,作为软约束引导预训练视频扩散模型,无需额外训练即可提升物理合理性和时间连贯性。

Comments Project Page: https://zhuangsh0713.github.io/CausalMotion/

详情
AI中文摘要

近年来基于扩散的视频生成在视觉质量和短期时间连贯性上取得了显著进展。然而,现有方法仍难以生成具有物理一致性和因果合理动态的视频,尤其是在涉及长程交互的场景中。这一局限性源于视频扩散模型主要隐式学习物理一致性,而视觉语言模型可以直接建模物理规律。基于这一思想,本文提出\textbf{CausalMotion},一种免训练框架,通过结构化中间表示将显式物理推理注入视频生成。我们的关键思想是将推理与生成解耦,利用视觉语言模型将文本提示分解为一系列因果一致的关键帧和以物体为中心的运动轨迹。然后将这些表示对齐并整合为软约束,在推理过程中引导预训练的视频扩散模型。这种设计无需额外训练或监督即可显式建模物体动态和因果转换。大量实验表明,我们的方法在动态密集场景中持续提升物理合理性和时间连贯性,同时保持高感知视频质量。

英文摘要

Recent advances in diffusion-based video generation have significantly improved visual quality and short-term temporal coherence. However, existing methods still struggle to produce videos with physically consistent and causally plausible dynamics, especially in scenarios involving long-horizon interactions. This limitation arises from the fact that video diffusion models primarily learn physical consistency implicitly, while vision-language models can directly model physical laws. Based on this idea, in this work, we propose \textbf{CausalMotion}, a training-free framework that injects explicit physical reasoning into video generation through structured intermediate representations. Our key idea is to decouple reasoning from generation by leveraging a vision-language model to decompose a text prompt into a sequence of causally consistent keyframes and object-centric motion trajectories. These representations are then aligned and integrated as soft constraints to guide a pretrained video diffusion model during inference. This design enables explicit modeling of object dynamics and causal transitions without requiring additional training or supervision. Extensive experiments show that our method consistently improves physical plausibility and temporal coherence, particularly in dynamics-intensive scenarios, while maintaining high perceptual video quality.

2606.14314 2026-06-15 cs.AI 新提交

Communication Policy Evolution for Proactive LLM Agents

主动式LLM智能体的通信策略演化

Xinbei Ma, Jiyang Qiu, Yao Yao, Zheng Wu, Yijie Lu, Xiangmou Qu, Jiaxin Yin, Xingyu Lou, Jun Wang, Weiwen Liu, Weinan Zhang, Zhuosheng Zhang, Hai Zhao

发表机构 * Shanghai Jiao Tong University(上海交通大学) OPPO Research Institute(OPPO研究院)

AI总结 针对用户与智能体间信息不对称问题,形式化通信策略,提出基于文本和UI的策略,并引入自演化框架CPE,通过提示优化提升任务成功率。

详情
AI中文摘要

LLM智能体已迅速演变为自主系统,但用户与智能体之间仍存在持续的信息鸿沟:通信成本高昂,而用户相同的偏好进一步限制了信息交换。为了研究智能体应如何跨模态通信,本文形式化了通信策略,建立了基于文本和UI的策略,然后在不同环境、角色和模型组合中评估通信策略。为构建主动式智能体的信息不对称性,我们设置了两个互补场景:用户-智能体和规划者-执行者。实验揭示了交互通道之间的互补优势:基于文本的交互通常有助于任务性能,而结构化UI则提高了智能体的响应质量和角色遵从性。受此启发,一种混合方法结合了这些优势。我们进一步提出通信策略演化(CPE),一种通过展开和提示级演化来优化通信策略的自我演化框架。在不修改模型的情况下,仅通过提示优化,CPE在多种设置中实现了最佳任务成功率。我们的发现将通信行为确定为LLM智能体一个关键但尚未充分探索的设计维度。

英文摘要

LLM agents have rapidly evolved into autonomous systems, yet a persistent information gap remains between users and agents: communication is costly, while users' identical preferences further limit information exchange. To investigate how agents should communicate across modalities, this paper formalizes Communication Policy, establishes textual and UI-based policies, and then evaluates communication policies across diverse environments, personas, and model combinations. Building information asymmetry for proactive agents, we set up two complementary settings, User-Agent and Planner-Executor. Experimental results reveal complementary strengths between interaction channels: text-based interaction often facilitates task performance, while structured UI improves agents' response quality and persona compliance. Motivated by that, a hybrid method combines these advantages. We further propose Communication Policy Evolution (CPE), a self-evolution framework for refining communication policies through rollout and prompt-level evolving. Without model modification, CPE achieves the best task success across multiple settings using prompt refinement alone. Our findings identify communication behavior as a critical yet underexplored design dimension for LLM agents.

2606.14307 2026-06-15 cs.CV 新提交

Pano3D: Unified 3D Reconstruction and Panoptic Segmentation

Pano3D:统一的3D重建与全景分割

Victor Barberteguy, Ahmet Iscen, Mathilde Caron, Alireza Fathi, Gül Varol, Cordelia Schmid

发表机构 * Google DeepMind(谷歌深度思维) LIGM, École des Ponts, IP Paris, Univ Gustave Eiffel, CNRS(LIGM, 巴黎高科桥梁学院, 巴黎理工学院, 古斯塔夫·埃菲尔大学, 法国国家科学研究中心)

AI总结 提出统一框架,在3D前馈重建网络中集成全景分割,通过联合训练几何与语义损失实现互惠提升,在多个数据集上达到最优性能。

Comments Project page: https://victorbbt.github.io/Pano3D/

详情
AI中文摘要

最近,3D前馈重建神经网络的进展在无需任何相机参数的图像密集重建中取得了显著成功。然而,为这些模型配备鲁棒的语义理解仍然是一个开放问题。本文提出了一种在统一框架中执行3D重建和3D全景分割的方法。我们基于现有的3D重建模型,并为其增加了一个基于集合的掩码解码器。该方法通过几何损失和语义损失进行联合训练,实验表明两者相互促进。更准确地说,特征从几何信息初始化,然后微调以同时捕获几何和语义。我们通过将框架成功应用于在线和全对全注意力重建骨干网络,证明了方法的通用性。我们的方法在ScanNet、ScanNet200和ScanNet++数据集上的3D全景分割中达到了最先进的性能。消融研究表明,这种统一模型的联合训练使3D前馈重建神经网络具备了全景分割能力,并带来了互惠的改进。

英文摘要

Recent advances in 3D feedforward reconstruction neural networks have achieved remarkable success in dense reconstruction from images without any camera parameters. Yet, equipping these models with robust semantic understanding remains an open problem. Here we introduce an approach that performs 3D reconstruction and 3D panoptic segmentation in a unified framework. We build on existing 3D reconstruction models and augment them with a set-based mask decoder. The approach is jointly trained with a geometric and semantic loss, which are shown to be mutually beneficial. More precisely, the features are initialized from the geometric information and then finetuned to capture jointly geometry and semantics. We demonstrate the generality of our approach by successfully applying our framework both to online and all-to-all attention reconstruction backbones. Our method achieves state-of-the-art performance in 3D panoptic segmentation across ScanNet, ScanNet200, and ScanNet++ datasets. Ablation studies show that such joint training of a unified model equips 3D feedforward reconstruction neural networks with panoptic segmentation and yields mutually beneficial improvements.

2606.14302 2026-06-15 cs.CL 新提交

Retrospective Progress-Aware Self-Refinement for LLM Agent Training

回顾性进度感知的LLM智能体训练自我精炼

Xinbei Ma, Congmin Zheng, Jiyang Qiu, Jiale Hong, Yao Yao, Xiangmou Qu, Jiaxin Yin, Xingyu Lou, Jun Wang, Weiwen Liu, Weinan Zhang, Zhuosheng Zhang, Hai Zhao

发表机构 * Shanghai Jiao Tong University(上海交通大学) OPPO Research Institute(OPPO研究院)

AI总结 提出RePro框架,通过前向-反思滚动范式训练智能体自我生成进度信号,无需持续外部监督,在WebShop等任务上提升Qwen系列性能高达12%。

详情
AI中文摘要

基于强化学习训练的LLM智能体优化逐步动作预测,但缺乏对任务进度的元认知意识,导致阻碍长期扩展的差距。一项初步研究表明,在线进度提示会损害性能,而回顾性演示有帮助,但这种能力无法仅从结果奖励训练中涌现。我们提出RePro,即回顾性进度感知训练,一种通过前向-反思滚动范式训练智能体自我生成进度信号的框架:智能体在线执行动作,然后根据完成的轨迹和已知结果回顾性地重新评估其逐步进度。RePro通过回顾性热身初始化,从最少的外部演示中教授反思格式,然后通过RePro-PO进一步训练,使用复合奖励产生自我生成的信号,无需持续的外部监督。在WebShop、ALFWorld和Sokoban上的实验表明,RePro提升了Qwen系列的性能,绝对成功率提升高达12%。

英文摘要

LLM-based agents trained with reinforcement learning optimize step-wise action prediction but lack metacognitive awareness of task progress, inducing a gap that hinders long-horizon scaling. A pilot study reveals that online progress prompting hurts performance while retrospective demonstrations help, yet this capability cannot emerge from outcome-reward training alone. We present RePro, Retrospective Progress-Aware Training, a framework that trains agents to self-generate progress signals via a forward-then-reflect rollout paradigm: the agent executes actions online, then retrospectively reassesses its step-wise progress given the completed trajectory and known outcome. RePro initializes with a Retrospection Warmup that teaches reflection format from minimal external demonstrations, then further trains through RePro-PO with a composite reward that produces self-generated signals without continuous external supervision. Experiments on WebShop, ALFWorld, and Sokoban show that RePro enhances the Qwen family's performance, with up to $12\%$ absolute success rate gains.

2606.14299 2026-06-15 cs.CV cs.LG 新提交

What Drives Test-Time Adaptation for CLIP? A Controlled Empirical Study from an Update Perspective

什么驱动了CLIP的测试时适应?从更新视角进行的受控实证研究

Jiazhen Huang, Xiao Chen, Zhiming Liu, Yaru Sun, Jingyan Jiang, Zhi Wang

发表机构 * Tsinghua University(清华大学) Shenzhen Technology University(深圳技术大学)

AI总结 本文通过受控实证研究,从更新视角分析了CLIP测试时适应方法的驱动因素,揭示了适应增益主要来自测试时证据和可靠代理,而非繁重优化,并指出无单一范式普遍最优。

详情
AI中文摘要

视觉语言模型(如CLIP)已成为开放词汇识别的标准骨干,但其零样本预测在部署时仍易受分布偏移影响。测试时适应(TTA)最近被扩展到CLIP作为轻量级解决方案,导致TTA4CLIP方法迅速增长。然而,该领域的实证进展在很大程度上超过了我们对真正驱动适应因素、其增益来源以及哪些偏移下保持可靠的理解。本文从追求最先进准确率中退一步,对TTA4CLIP进行了系统性的受控研究。我们首先根据测试时更新的内容,将现有方法组织为三个统一范式。然后,我们引入TTABC,一个开源的CLIP TTA基准,它标准化了评估协议并集成了20多种代表性方法。我们的受控实证分析集中在三个关键领域。首先,我们确定了基于参数方法的驱动因素,揭示适应增益主要由测试时证据和可靠代理驱动,而非繁重优化。其次,我们探索了超越繁重参数调整的证据利用,表明通过跨样本或当前样本证据以及轻量级原型更新可以实现竞争性和高效的性能。最后,我们证明TTA没有银弹:没有单一的适应范式普遍最优,首选范式取决于偏移的性质。我们希望我们的基准和研究能提供对当前TTA4CLIP格局的更清晰理解,并为进一步研究奠定基础。

英文摘要

Vision-Language Models (VLMs) such as CLIP have become a standard backbone for open-vocabulary recognition, yet their zero-shot predictions remain vulnerable to distribution shifts encountered at deployment. Test-Time Adaptation (TTA) has recently been extended to CLIP as a lightweight solution, leading to a rapidly growing body of TTA4CLIP methods. However, empirical progress in this area has largely outpaced our understanding of what truly drives adaptation, where their gains originate, and under which shifts they remain reliable. In this paper, we take a step back from the pursuit of state-of-the-art accuracy and conduct a systematic controlled study of TTA4CLIP. We first organize existing methods into three unified paradigms according to what is updated at test time. We then introduce TTABC, an open-source TTA Benchmark for CLIP, which standardizes evaluation protocols and integrates more than 20 representative methods. Our controlled empirical analysis focuses on three key areas. First, we determine the driving factors in parameter-based methods, revealing that adaptation gains are primarily driven by test-time evidence and reliable proxies rather than heavy optimization. Second, we explore evidence utilization beyond heavy parameter tuning, showing that competitive and efficient performance can be achieved through cross- or current-sample evidence and lightweight prototype updates. Finally, we demonstrate that there is no silver bullet for TTA: no single adaptation paradigm is universally optimal, and the preferred paradigm depends on the nature of shift. We hope our benchmark and study provide a clearer understanding of the current TTA4CLIP landscape and establish a foundation for further research.

2606.14297 2026-06-15 cs.CV cs.AI 新提交

Pix2Pix-Hybrid: Structure-Guided Conditional Synthesis of Hajj Crowd Images with Multi-Channel Conditioning and Weak Attribute Supervision

Pix2Pix-Hybrid: 结构引导的多通道条件与弱属性监督的朝觐人群图像条件合成

Amirah F. Alshammari, Bander A. Alzahrani, Nahed A. Alowidi

发表机构 * King Abdulaziz University(阿卜杜勒阿齐兹国王大学) Jouf University(焦夫大学)

AI总结 提出Pix2Pix-Hybrid条件GAN,通过多通道结构线索和上下文属性条件合成朝觐人群图像,用于数据增强,在减少人工标注的同时提升合成质量,并验证了合成数据对人群计数模型的改进效果。

详情
AI中文摘要

开发准确的朝觐场景人群计数模型仍然具有挑战性,因为领域特定的标注图像稀缺,且大型集会期间的数据收集引发隐私问题。为解决这些限制,本文提出Pix2Pix-Hybrid (P2P-H),一种用于结构引导的朝觐人群图像合成和数据增强的混合条件GAN。P2P-H基于Pix2Pix,采用U-Net生成器,以八个输入通道为条件,这些通道联合编码结构线索(边缘和灰度)和上下文属性(人群密度和一天中的时间)。为了捕捉密集场景中的详细纹理,该框架集成了两个在不同分辨率下运行的多尺度PatchGAN判别器。训练过程结合了对抗、感知和特征匹配目标,并采用自适应数据增强和稳定化策略。该模型在从60个公开视频源收集的993个真实朝觐帧上训练,条件属性自动推导以减少人工标注工作量。利用该框架,我们构建了CrowdH,一个包含10,000张高分辨率朝觐人群图像的合成数据集。实验结果表明,与Pix2Pix和StyleGAN2-ADA基线相比,P2P-H提高了结构保持的条件合成质量,并显示出对其他人群数据集的良好迁移性。为了评估下游实用性,我们进一步构建了CrowdH-Mix-469,一个包含384张真实朝觐图像和85张精选合成图像的标注混合真实-合成数据集,并在仅真实和真实加合成训练下评估了五个计数模型。精选的合成数据在所有五个模型上均降低了MAE,其中CSRNet的提升最为显著。

英文摘要

Developing accurate crowd-counting models for Hajj pilgrimage scenes remains challenging because domain-specific annotated images are scarce and data collection during large gatherings raises privacy concerns. To address these limitations, this paper proposes Pix2Pix-Hybrid (P2P-H), a hybrid conditional GAN for structure-guided Hajj crowd-image synthesis and data augmentation. P2P-H builds on Pix2Pix and employs a U-Net generator conditioned on eight input channels that jointly encode structural cues (edges and grayscale) and contextual attributes (crowd density and time of day). To capture detailed textures in dense scenes, the framework integrates two multi-scale PatchGAN discriminators operating at different resolutions. The training procedure combines adversarial, perceptual, and feature-matching objectives with adaptive data augmentation and stabilization strategies. The model was trained on 993 real Hajj frames collected from 60 publicly available video sources, with conditioning attributes derived automatically to reduce manual labeling effort. Using this framework, we constructed CrowdH, a synthetic dataset of 10,000 high-resolution Hajj crowd images. Experimental results show that P2P-H improves structure-preserving conditional synthesis quality compared with Pix2Pix and StyleGAN2-ADA baselines and shows favorable transfer to other crowd datasets. To assess downstream utility, we further constructed CrowdH-Mix-469, an annotated mixed real-synthetic dataset comprising 384 real Hajj images and 85 selected synthetic images,and evaluated five crowd-counting models under real-only and real-plus-synthetic training. The selected synthetic data reduced MAE across all five models, with the strongest gain observed for CSRNet.

2606.14292 2026-06-15 cs.CV 新提交

A Robust Point Cloud Analysis Framework Inspired By Primary Visual Cortex

一种受初级视觉皮层启发的鲁棒点云分析框架

Jisheng Dang, Dengyue Pan, Delin Deng, Yifan Zhang, Bimei Wang, Hong Peng, Bin Hu, Qi Tian, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University(兰州大学信息科学与工程学院) School of Medical Technology, Beijing Institute of Technology(北京理工大学医学技术学院) Cloud and AI BU, Huawei(华为云与AI业务部) School of Computing, National University of Singapore(新加坡国立大学计算机学院)

AI总结 受初级视觉皮层启发,提出DC-CCNN++框架,通过仿生神经网络替代MLP,结合NRMR模块和CPVT策略,在点云分类和分割中提升鲁棒性,性能媲美SOTA。

Comments 12 pages, 2 figures, 7 tables

详情
AI中文摘要

尽管点云分析取得了显著进展,但降低能耗和提高鲁棒性仍然研究不足,这主要归因于卷积神经网络(CNN)的固有局限性。为解决这一问题,我们从初级视觉皮层中汲取灵感,提出了一种树突连接连续耦合神经网络(DC-CCNN),这是一种用于点云分析的新型类脑神经网络(BINN)架构。通过结合离散和连续编码,我们的设计用更高效、更鲁棒的BINN取代了传统的多层感知机(MLP)。在此框架基础上,我们进一步提出了扩展模型DC-CCNN++,以提高在复杂损坏条件下的鲁棒性。具体来说,我们引入了一个神经启发的鲁棒调制与读出模块(NRMR),通过全局上下文增益调制和双码证据整合来增强特征稳定性和决策鲁棒性。我们还设计了一种皮层启发的渐进变异性训练(CPVT)策略,该策略在训练过程中逐步将模型暴露于结构化的环境变异性,同时保持稳定的干净样本锚点。实验结果表明,DC-CCNN++在点云分析上提升了类脑网络的性能,同时保持了与最先进方法相当的性能。与原始DC-CCNN相比,它在分类和部分分割上均取得了更强的结果,并且对稀疏性、遮挡、高斯噪声、椒盐噪声和空间变换表现出增强的鲁棒性。凭借其高效性、鲁棒性和生物学基础的设计,DC-CCNN++为点云分析提供了一种有前景的传统深度学习方法替代方案。代码可在该https URL获取。

英文摘要

Despite significant advancements in point cloud analysis, reducing energy consumption and improving robustness remain understudied, largely due to the inherent limitations of Convolutional Neural Networks (CNNs). To address this issue, we draw inspiration from the primary visual cortex and propose a Dendritic-Connected Continuous-Coupled Neural Network (DC-CCNN), a novel Brain-Inspired Neural Network (BINN) architecture for point cloud analysis. By combining discrete and continuous encoding, our design replaces traditional Multilayer Perceptrons (MLPs) with more efficient and robust BINNs. Building upon this framework, we further propose an extended model, DC-CCNN++, to improve robustness under complex corruption conditions. Specifically, we introduce a Neuro-Inspired Robust Modulation-and-Readout Module (NRMR) to enhance feature stability and decision robustness through global-context gain modulation and dual-code evidence integration. We also design a Cortically Inspired Progressive Variability Training (CPVT) strategy, which progressively exposes the model to structured environmental variability while preserving stable clean-sample anchors during training. Experimental results show that DC-CCNN++ improves the performance of brain-inspired networks on point cloud analysis while maintaining performance comparable to state-of-the-art methods. Compared with the original DC-CCNN, it achieves stronger results on both classification and part segmentation, and exhibits enhanced robustness against sparsity, occlusion, Gaussian noise, salt-and-pepper noise, and spatial transformations. With its efficiency, robustness, and biologically grounded design, DC-CCNN++ provides a promising alternative to traditional deep learning methods for point cloud analysis. Code is available at https://anonymous.4open.science/r/DC-CCNNpp-44E3.

2606.14284 2026-06-15 cs.LG cs.AI 新提交

Hierarchical ODE: Learning Continuous-Time Physical Prototypes for Early Link Failure Detection

层次化常微分方程:学习连续时间物理原型用于早期链路故障检测

Jiaen Lv, Leran Qi, Shaowei Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出层次化常微分方程聚类网络,利用神经ODE建模连续潜状态演化,解耦随机噪声与动态趋势,自适应确定原型数量,在不规则采样时间序列的早期链路故障检测中有效提取物理原型。

Comments International Conference on Machine Learning 2026

详情
AI中文摘要

时间序列原型学习从根本上受到观测模糊性的挑战。离散架构无法解决这一问题,因为它们缺乏将随机噪声与连续动态解耦的能力。此外,僵化的封闭集假设无法捕捉未见过的多样性。为了解决这些局限性,我们提出了一种层次化常微分方程聚类网络,该网络利用神经常微分方程将潜状态演化建模为连续积分曲线。这种形式强制时间连续性,从而有效将平滑特征趋势与随机噪声分离,同时我们的自适应层次机制无需严格的先验约束即可自主确定合适的原型数量。在具有不规则采样时间序列的早期链路故障检测任务上验证,所提方法有效提取了底层物理原型,从而实现了鲁棒的故障检测。我们的代码可在此https URL获取。

英文摘要

Time series prototype learning is fundamentally challenged by observational ambiguity. Discrete architectures fail to resolve this, as they lack the capacity to decouple stochastic noise from continuous dynamics. Furthermore, rigid closed-set assumptions fail to capture unseen diversity. To address these limitations, we propose a hierarchical ordinary differential equation clustering network, which utilizes neural ordinary differential equation to model latent state evolution as a continuous integral curve. This formulation enforces temporal continuity to effectively disentangle smooth feature trends from stochastic noise, while our adaptive hierarchical mechanism autonomously determines the appropriate number of prototypes without rigid prior constraints. Validated on the early link failure detection task with irregularly sampled time series, the proposed method effectively extracts underlying physical prototypes, thereby enabling robust failure detection. Our code is available at https://github.com/NJ-LNN/Hierarchical-ODE.

2606.14283 2026-06-15 cs.LG cs.AI 新提交

DIFF-ERO: A Conformance-Aware Loss for Deep Learning in Process Mining

DIFF-ERO:一种面向过程挖掘的深度学习一致性感知损失函数

Johannes De Smedt, Jari Peeperkorn, Artem Polyvyanyy, Jochen De Weerdt

发表机构 * KU Leuven(鲁汶大学) The University of Melbourne(墨尔本大学) Information Systems Engineering Research Group (LIRIS), KU Leuven(鲁汶大学信息系统工程研究组(LIRIS))

AI总结 提出DIFF-ERO,一种可微的随机一致性损失函数,通过构建软边界的批次级随机转移矩阵,在训练中融入控制流信息,提升深度学习模型在过程数据上的结构预测性能。

Comments Accepted at the 24th International Conference on Business Process Management

详情
AI中文摘要

深度学习推动了过程分析领域的许多最新进展,尤其是在预测性和规范性监控方面。然而,诸如交叉熵之类的标准目标函数优化的是局部下一步似然,仅隐式地捕获控制流结构。因此,模型在实现高令牌级准确率的同时,可能允许不精确的全局行为。我们提出了DIFF-ERO,一种用于过程数据深度学习模型的一致性感知损失函数。DIFF-ERO是基于熵的随机一致性的可微形式,在训练过程中融入控制流信息。我们的方法构建了具有软边成员资格的批次级随机转移矩阵,使得结构精度和召回率信号能够直接指导反向传播。该损失函数是模型无关的,只要最终表示参数化随机转移,就可以应用。我们将DIFF-ERO实例化到用于下一活动预测的Transformer编码器-解码器流水线中,并与交叉熵联合使用,分析其理论组件在收敛方面的表现。在比较其他损失函数和目标的基准测试中,DIFF-ERO在结构至关重要的地方显示出改进的预测性能,同时在其它地方保持同等水平。同时,学习到的随机自动机向结构真实值收敛,表明网络内化了过程模型结构。

英文摘要

Deep learning has driven many recent advances in process analytics, especially for predictive and prescriptive monitoring. However, standard objectives such as cross-entropy optimize local next-step likelihoods and only implicitly capture control-flow structure. As a result, models can achieve high token-level accuracy while permitting imprecise global behaviour. We introduce DIFF-ERO, a conformance-aware loss function for deep learning models on process data. DIFF-ERO is a differentiable formulation of entropy-based stochastic conformance that incorporates control-flow information during training. Our approach constructs batch-level stochastic transition matrices with soft edge memberships, allowing structural precision and recall signals to directly inform backpropagation. The loss is model-agnostic and can be applied whenever the final representation parametrizes stochastic transitions. We instantiate DIFF-ERO in transformer encoder-decoder pipelines for next-activity prediction and use it jointly with cross-entropy to analyse its theoretical components with respect to convergence. Across benchmarks comparing other loss functions and targets, DIFF-ERO shows improved predictive performance where structure matters most while maintaining parity elsewhere. At the same time, the learned stochastic automaton converges towards the structural ground truth, indicating that the network internalizes process model structure.

2606.14278 2026-06-15 cs.CL 新提交

Does the Judge Prefer English? Evaluating Language-Switching Invariance in LLM-as-a-Judge

法官偏爱英语吗?评估LLM作为法官时的语言切换不变性

Shaojie Yin

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出Judge-LS协议,通过语言切换变体评估LLM作为自动法官的可靠性,发现语言切换导致10.7-14.4%偏好翻转,但翻译等价测试未显示系统英语偏好。

详情
AI中文摘要

大型语言模型(LLM)现在被广泛用作开放式指令遵循评估的自动法官。这种做法方便、可扩展,且通常比基于参考的指标更具语义意识,但它也引入了一个新的可靠性问题:法官评估的是答案的质量,还是也对比较呈现的语言做出反应?我们提出了Judge-LS,一个轻量级元评估协议,将LLMBar响应对项目转换为英语、中文和中英文语言切换变体。一个可靠的法官应在保持标签的语言转换下保持其偏好,并且当两个答案翻译等价时不应偏好某种语言。我们在完整的419项LLMBar基准上评估了四个API可访问的法官,产生了13,408个成功的成对判断。跨模型,中文和语言切换呈现相对于英语导致10.7-14.4%的偏好翻转,所有法官在英语中达到最高准确率。然而,翻译等价的平局探测并未揭示系统的英语偏好:大多数探测被判断为平局,非平局决策更常偏向中文。我们添加了置信区间、配对显著性检验和自动转换审计,并进行了敏感性分析,排除了机械标记的高风险变体。该实验不需要模型训练,仅使用API调用,并且在适度的本地硬件上可行。

英文摘要

Large language models (LLMs) are now widely used as automatic judges for open-ended instruction-following evaluation. This practice is convenient, scalable, and often more semantically aware than reference-based metrics, but it also introduces a new reliability question: does a judge evaluate the quality of an answer, or does it also react to the language in which the comparison is presented? We propose Judge-LS, a lightweight meta-evaluation protocol that transforms LLMBar response-pair items into English, Chinese, and Chinese-English language-switched variants. A reliable judge should preserve its preference under label-preserving language transformations and should not prefer a language when two answers are translation-equivalent. We evaluate four API-accessible judges on the full 419-item LLMBar benchmark, producing 13,408 successful pairwise judgments. Across models, Chinese and language-switched presentations induce 10.7--14.4% preference flips relative to English, and all judges achieve their highest accuracy in English. However, translation-equivalent tie probes do not reveal a systematic English preference: most probes are judged as ties, and non-tie decisions more often favor Chinese. We add confidence intervals, paired significance tests, and an automatic transformation audit with a sensitivity analysis that excludes mechanically flagged high-risk variants. The experiment requires no model training, uses only API calls, and is feasible on modest local hardware.

2606.14277 2026-06-15 cs.CV 新提交

One Layer's Trash is Another Layer's Treasure: Adaptive Layer-wise Visual Token Selection in LVLMs

一层的垃圾是另一层的宝藏:LVLMs中自适应逐层视觉标记选择

Yongru Chen, Kai Zhang, Zeliang Zong, Yuchen Lu, Wenming Tan, Ye Ren, Jilin Hu

发表机构 * Hikvision Research Institute(海康威视研究院) Peking University(北京大学) East China Normal University(华东师范大学)

AI总结 提出自适应逐层视觉标记选择(ALVTS),通过轻量级选择器为不同层路由重要标记,实现高效压缩,在89%压缩率下保留96.7%精度。

Comments Accepted by CVPR 2026 (highlight)

详情
AI中文摘要

大型视觉语言模型(LVLMs)在多样化的多模态任务中取得了显著成功,但其实际部署仍受限于长视觉标记带来的计算负担。虽然视觉标记剪枝已成为一种有前景的解决方案,但现有方法存在一个根本性局限:一旦标记在特定层被剪枝,它们将无法被所有后续层访问,导致过早的信息丢失,从而损害模型性能。通过实证研究,我们观察到不同层表现出不同的视觉区域关注点,表明各层存在不同的最优标记子集。受此启发,我们提出自适应逐层视觉标记选择(ALVTS),这是一种突破传统静态标记剪枝范式的新框架。ALVTS包含一个轻量级标记选择器,用于识别重要标记并将其路由到后续处理,同时允许较不重要的标记跳过该层,从而最小化计算冗余。这两类标记在输入后续层之前无缝重新整合,促进整个模型的自适应压缩。基于我们提出的重要性一致性约束低秩近似,所提出的标记选择模块紧密模拟了完整注意力机制,有效捕捉其关键模式,而无需模型重新训练。在LLaVA-1.5、LLaVA-NeXT和Qwen2.5-VL上的大量实验验证了我们方法的有效性。在89%的标记压缩率下,ALVTS保留了原始模型96.7%的准确率,实现了LVLM推理中优越的效率-准确率权衡。

英文摘要

Large Vision-Language Models (LVLMs) have achieved remarkable success across diverse multimodal tasks, yet their practical deployment remains constrained by the computational burden arising from lengthy visual tokens. While visual token pruning has emerged as a promising solution, existing methods suffer from a fundamental limitation: once tokens are pruned at a specific layer, they become inaccessible to all subsequent layers, leading to premature information loss that can compromise model performance. Through empirical studies, we observe that different layers exhibit distinct visual region focus, indicating a varying optimal token subset across layers. Motivated by this insight, we propose Adaptive Layer-wise Visual Token Selection (ALVTS), a novel framework that breaks away from the conventional static token pruning paradigm. ALVTS incorporates a lightweight token selector to identify and route important tokens for further processing, while allowing less important tokens to skip the layer, thus minimizing computational redundancy. These two streams of tokens are seamlessly reintegrated before being fed into subsequent layers, facilitating adaptive compression across the entire model. Grounded in our importance consistency constrained low-rank approximation, the proposed token selection module closely emulates the full attention mechanism, effectively capturing its essential patterns without requiring model retraining. Extensive experiments on LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL validate the effectiveness of our method. With an 89% token compression ratio, ALVTS retains 96.7% of the original model's accuracy, achieving a superior efficiency-accuracy trade-off for LVLM inference.

2606.14270 2026-06-15 cs.RO cs.AI 新提交

Robust Fall Recovery for Armless Bipedal-Wheeled Robots Via Force-Guided Learning

无臂双轮足机器人的鲁棒摔倒恢复:基于力引导的学习方法

Haidong Hou, Zhangguo Yu, Tao Han, Hengbo Qi, Khaleel Ghazal, Yu Zhang, Yidong Du, Xuechao Chen, Fei Meng

发表机构 * Beijing Institute of Technology(北京理工大学)

AI总结 针对无臂双轮足机器人无法借助外部支撑恢复站立的问题,提出力引导教师-学生框架FTSR,通过约束强化学习逐步减少外力依赖,实现从摔倒到稳定行走的鲁棒恢复。

Comments 8 pages, 6 figures, accepted by IEEE Robotics and Automation Letters (RA-L)

详情
Journal ref
IEEE Robotics and Automation Letters, 2026
AI中文摘要

摔倒恢复对于自主腿式运动至关重要。现有方法已证明,某些腿式机器人(如人形机器人和四足机器人)能够通过利用手臂或协调多腿产生支撑力,从各种姿态恢复。没有手臂或其他腿提供支撑辅助,双轮足机器人必须完全依赖其腿部的驱动,这使得恢复特别困难。为解决这一问题,我们引入了FTSR(力引导的教师-学生框架与阶段奖励)。力引导方法在模拟训练期间构建一个与机器人实时高度直接相关的外部辅助力,明确地将该力公式化为可优化约束。通过约束强化学习,策略被引导逐步减少力依赖并增加身体高度,尽管没有手臂支撑,仍能发展内部恢复策略。高度渐进式阶段奖励在恢复过程中逐步构建姿态稳定,并过渡到持续运动,与教师-学生架构集成,蒸馏出力效应和恢复动态的特权知识。经过模拟训练,该策略被部署在物理无臂双轮足机器人上并进行了广泛评估。实验证实了在多种挑战性条件下鲁棒可靠的摔倒恢复,展示了强大的环境适应性和运动鲁棒性,同时保持恢复后的完整运动能力。该框架也有效泛化到高自由度人形机器人,证实了其实用泛化性。项目页面见该URL。

英文摘要

Fall recovery is critical for autonomous legged locomotion. Existing methods have demonstrated that some legged robots, such as humanoids and quadrupeds, are capable of fall recovery from diverse postures by utilizing arms or coordinating multi-legs to generate support forces. Without arms or other legs to provide supportive assistance, a bipedal-wheeled robot must rely solely on the actuation of its legs, making recovery particularly difficult. To address this, we introduce FTSR (Force-guided Teacher-student framework with Stage-wise Rewards). The force-guided method constructs an external auxiliary force during simulation training that correlates directly with the robot's real-time height, explicitly formulating this force as an optimizable constraint. Through constrained reinforcement learning, the policy is guided toward reducing force dependency gradually and increasing the body height, developing internal recovery strategies despite having no arms for support. Height-progressive stage-Wise rewards progressively structure posture stabilization during recovery and transition to sustained locomotion, integrated with teacher-student architecture distilling privileged knowledge of force effects and recovery dynamics. After simulation training, the policy is deployed on a physical armless bipedal-wheeled robot and extensively evaluated. Experiments confirm robust and reliable fall recovery under diverse challenging conditions, demonstrating strong environmental adaptability and motion robustness, while maintaining full post-recovery motion capability. The framework also generalizes effectively to a high-DOF humanoid, confirming its practical generalizability. The project page is available at https://2350575870.github.io/force-guided.github.io/

2606.14267 2026-06-15 cs.RO 新提交

FloVerse: Floor Plan-Guided Multi-Modal Navigation

FloVerse:基于楼层平面图的多模态导航

Weiqi Huang, Shuangyi Dong, Jiaxin Li, Yifei Guo, Zan Wang, Wei Liang

发表机构 * School of Computer Science & Technology, Beijing Institute of Technology(北京理工大学计算机科学与技术学院)

AI总结 提出FloVerse任务统一PointNav、ObjectNav和ImageNav,构建FloVerse-1.6K数据集,并设计ThreeDiff两阶段模仿学习策略,利用楼层平面图先验提升导航性能。

Comments Accepted at CVPR 2026

详情
AI中文摘要

楼层平面图包含了紧凑的空间先验信息,使智能体能够更高效地导航未见过的场景。虽然先前的工作已经探索了基于楼层平面图的导航,但主要集中在PointNav和有限的环境上。为了弥补这一差距,我们引入了FloVerse,一个基于楼层平面图的具身导航新任务,统一了PointNav、ObjectNav和ImageNav。为了支持FloVerse,我们构建了FloVerse-1.6K,一个大规模数据集,包含来自HM3D和Gibson 4+的1600个场景及其对应的楼层平面图,共计24万条专家轨迹和1200万帧RGBD图像。我们进一步提出了ThreeDiff,一种两阶段模仿学习策略,包括一个规划器、一个基于扩散的多模态目标推理模块(通过掩码模态建模训练)和一个精炼器(基于深度的轨迹精炼模块,用于安全执行)。大量实验表明:(1) 楼层平面图先验提高了所有目标模态的导航性能;(2) ThreeDiff隐式地从楼层平面图中捕获空间信息。这些结果强调了空间先验的有效性,并验证了我们提出的基于楼层平面图的具身导航统一方法的有效性。

英文摘要

Floor plans encapsulate compact spatial priors, enabling agents to navigate unseen scenes more efficiently. While prior work has explored floor plan-guided navigation, it has focused mainly on PointNav and a limited set of environments. To bridge this gap, we introduce FloVerse, a new task for floor plan-guided embodied navigation that unifies PointNav, ObjectNav, and ImageNav. To support FloVerse, we assemble FloVerse-1.6K, a large-scale dataset of 1.6K scenes from HM3D and Gibson 4+, paired with corresponding floor plans, comprising 240K expert trajectories and 12M RGBD frames. We further propose ThreeDiff, a two-stage imitation learning policy comprising a planner, a diffusion-based multimodal goal-reasoning module trained via masked-modality modeling, and a refiner, a depth-based trajectory-refinement module for safe execution. Extensive experiments demonstrate that (1) floor-plan priors improve navigation performance across all goal modalities, and (2) ThreeDiff implicitly captures spatial information from floor plans. These results underscore the effectiveness of spatial priors and validate our proposed unified approach for floor plan-guided embodied navigation.

2606.14259 2026-06-15 cs.LG 新提交

Beyond a Single Explanation of the Adam--SGD Gap

超越对Adam与SGD差距的单一解释

Chenxiang Zhang, Rustem Islamov, Enea Monzio Compagnoni, Jun Pang, Aurelien Lucchi, Antonio Orvieto

发表机构 * University of Luxembourg(卢森堡大学) University of Basel(巴塞尔大学) MPI for Intelligent Systems(马克斯·普朗克智能系统研究所) ELLIS Tübingen(ELLIS蒂宾根) Tübingen AI Center(蒂宾根人工智能中心)

AI总结 通过跨视觉、语言、基因组和图形任务的受控实验,发现Adam与SGD的性能差距源于数据和架构的复杂交互,而非单一因素,并观察到随批量大小变化的交叉点。

Comments preprint

详情
AI中文摘要

先前的工作已经确定了几个可能导致Adam和SGD之间性能差距的因素,涵盖数据方面、架构设计和优化属性。然而,这些解释通常是孤立研究的,它们的相对重要性尚不清楚。在这项工作中,我们通过跨视觉、语言、基因组和图形任务、涵盖现代和经典架构以及精心设计的训练设置的控制实证研究重新审视了这些假设。我们的结果表明,没有单一因素能够一致地解释Adam-SGD差距。例如,Adam的优势可以(1)在均匀词汇分布下持续存在,但在重尾分布下几乎消失;(2)在softmax注意力模型中逆转,有利于SGD;(3)在软架构修改下变得更大,例如当ReLU被GeLU非线性替换时。这表明差距源于非平凡的数据和架构交互,而不是单一的共同因素。然而,我们在我们的设置中观察到一个模式:一个\emph{交叉批量大小},在该大小下,随着批量大小的缩放,相对优势从SGD转移到Adam。这些实证结果被我们的理论差距模型所捕获,该模型预测了这种依赖于批量大小的交叉。我们的视角有助于调和几个现有的假设,同时提供跨领域的实际见解。

英文摘要

Prior work has identified several factors that can contribute to the performance gap between Adam and SGD, spanning data aspects, architecture design, and optimization properties. Yet these explanations are often studied in isolation, leaving their relative importance unclear. In this work, we revisit these hypotheses through a controlled empirical study across vision, language, genomics, and graph tasks, spanning modern and classical architectures, and carefully designed training setups. Our results suggest that no single factor consistently explains the Adam--SGD gap. For instance, the Adam advantage can (1) persist under a uniform vocabulary distribution yet nearly disappear under a heavy-tailed one; (2) reverse in favor of SGD in softmax-attention models; and (3) become larger under soft architectural modifications, e.g., when ReLU is replaced by a GeLU nonlinearity. This suggests that the gap arises from nontrivial data and architecture interactions, rather than from a single common factor. Yet, we observe a pattern across our settings: a \emph{crossover batch size} at which the relative advantage shifts from SGD to Adam as the batch size scales. These empirical results are captured by our theoretical gap model, which predicts this batch-size-dependent crossover. Our perspective helps reconcile several existing hypotheses while offering practical insights across domains.

2606.14255 2026-06-15 cs.RO 新提交

ReactVLA: Fast and Lightweight Reactive Robot Manipulation via Improved Mean Flow Action Generation

ReactVLA: 通过改进的平均流动作生成实现快速轻量级反应式机器人操作

Yanzhao Guo, Wenkai Chen, Jianwei Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Technical Aspects of Multimodal Systems (TAMS), Department of Informatics, Universität Hamburg(汉堡大学信息学系多模态系统技术方面(TAMS))

AI总结 提出ReactVLA框架,结合改进的平均流动作生成器和注意力残差机制,实现轻量低延迟的实时机器人操作,在模拟和真实任务中性能提升达1.65倍,推理速度提升4倍以上。

详情
AI中文摘要

基于扩散的视觉-语言-动作(VLA)策略在建模表达性和多模态动作分布方面表现出强大的能力。然而,它们对迭代采样的依赖引入了显著的推理延迟,限制了其在反应式闭环机器人操作中的应用。为了解决这一限制,我们提出了\texttt{ReactVLA},一个用于实时机器人操作的轻量级低延迟VLA框架。\texttt{ReactVLA}结合了两种互补设计:(1)改进的平均流(iMF)动作生成器,将昂贵的多步扩散采样减少到一步到几步的动作生成;(2)注意力残差(AttnRes),一种动态的深度特征路由机制,取代均匀残差累积,以更好地保留任务相关的多模态表示。我们在大规模模拟基准(包括LIBERO和RoboIMI)以及真实世界机器人操作任务上评估了\texttt{ReactVLA}。实验结果表明,\texttt{ReactVLA}始终优于同等规模的VLA基线,包括SmolVLA和$\pi_0$。在具有挑战性的精密操作任务中,与领先的VLA模型相比,\texttt{ReactVLA}在任务性能上实现了高达1.65倍的提升,同时推理速度提高了4倍以上。最后,它将真实世界策略延迟降低到38.6毫秒以下,从而在物理机器人平台上实现快速反应控制。请访问我们的项目网站:this https URL。

英文摘要

Diffusion-based Vision-Language-Action (VLA) policies have demonstrated strong capability in modeling expressive and multimodal action distributions. However, their reliance on iterative sampling introduces substantial inference latency, which limits their applicability to reactive closed-loop robot manipulation. To address this limitation, we propose \texttt{ReactVLA}, a lightweight and low-latency VLA framework for real-time robotic manipulation. \texttt{ReactVLA} combines two complementary designs: (1) an improved Mean Flow (iMF) action generator that reduces expensive multi-step diffusion sampling to one-to-few-step action generation, and (2) Attention Residuals (AttnRes), a dynamic depth-wise feature routing mechanism that replaces uniform residual accumulation to better preserve task-relevant multimodal representations. We evaluate \texttt{ReactVLA} on large-scale simulation benchmarks, including LIBERO and RoboIMI, as well as real-world robotic manipulation tasks. Experimental results show that \texttt{ReactVLA} consistently outperforms similarly sized VLA baselines, including SmolVLA and $π_0$. On challenging precision manipulation tasks, \texttt{ReactVLA} achieves up to a 1.65$\times$ improvement in task performance while providing more than a 4$\times$ increase in inference speed compared with leading VLA models. Finally, it reduces real-world policy latency to below 38.6 ms, enabling fast reactive control on physical robot platforms. Please check out our project website at: https://game-loader.github.io/ReactVLA/.

2606.14252 2026-06-15 cs.RO 新提交

Optimality-Preserving Decomposition for Scalable QAOA in Natural-Language-Guided Multi-Drone Assignment

面向自然语言引导的多无人机分配中可扩展QAOA的最优性保持分解

Junyeop Bang, Byongho Lee, Dohyun An, Hwangnam Kim

发表机构 * Korea University(高丽大学)

AI总结 提出端到端框架,集成微调大语言模型与量子-经典后端,通过约束保持图分割和动态规划合并,实现自然语言引导下多无人机任务分配的可扩展量子优化。

Comments 10 pages, 2 figures, 3 tables, preprint

详情
AI中文摘要

随着多无人机机群的扩展,区域分配迅速演变为一个难以处理的NP-hard组合问题,使经典穷举搜索不堪重负。虽然量子优化有望打破这些经典瓶颈,但将人类意图中的复杂空间任务映射到受限的量子硬件上仍然是一个严峻挑战。为弥合这一差距,我们提出了一个端到端框架,集成了微调的大语言模型前端和高度可扩展的领域特定量子-经典后端。前端利用监督微调和直接偏好优化,将自由形式的自然语言指令转换为结构稳健的二次无约束二元优化约束,且无假阴性。为克服近期量子设备的严格量子比特限制,我们的框架采用了一种新颖的约束保持图分割器和基于压缩分隔符的动态规划合并。通过W态初始化和XY混频器在条件风险价值量子近似优化中结构性地编码约束,流水线保持高度紧凑。实验结果表明,该架构规避了经典扩展墙,在100%的理想化预言机案例和96.3%的实际QAOA采样下恢复了全局最优解,使得自然语言引导的任务分配在以前难以处理的规模上成为可能。

英文摘要

As multi-drone fleets scale, zone assignment rapidly evolves into an intractable NP-hard combinatorial problem that overwhelms classical exhaustive search. While quantum optimization promises to shatter these classical bottlenecks, mapping complex spatial tasks from human intent to restricted quantum hardware remains a severe challenge. To bridge this gap, we present an end-to-end framework integrating a fine-tuned Large Language Model (LLM) front-end with a highly scalable, domain-specific quantum-classical backend. The front-end utilizes Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to translate free-form natural language instructions into structurally robust Quadratic Unconstrained Binary Optimization (QUBO) constraints without false negatives. To overcome the strict qubit limits of near-term quantum devices, our framework features a novel constraint-preserving graph partitioner and a compressed separator-based dynamic programming (DP) merge. By structurally encoding constraints via W-state initialization and XY-mixers in Conditional Value-at-Risk Quantum Approximate Optimization (CVaR-QAOA), the pipeline stays highly compact. Empirical results demonstrate that this architecture circumvents classical scaling walls, recovering the global optimum on 100% of idealized oracle cases and 96.3% under real QAOA sampling, enabling natural-language-guided task allocation at previously intractable scales.

2606.14251 2026-06-15 cs.CV 新提交

HiST: A Hierarchical Sparse Transformer for Cross-Modal Spatial Transcriptomics Modeling

HiST:用于跨模态空间转录组建模的分层稀疏Transformer

Weiyi Wu, Xinwen Xu, Xingjian Diao, Siting Li, Zhi Wei, Alma Andersson, Jiang Gui

发表机构 * New Jersey Institute of Technology(新泽西理工学院) Stevens Institute of Technology(史蒂文斯理工学院) Karolinska Institutet(卡罗林斯卡学院) Dartmouth College(达特茅斯学院)

AI总结 提出HiST,一种分层稀疏Transformer,通过稀疏窗口注意力和分辨率变换算子实现高效的多尺度空间转录组推断,显著降低计算开销并提升预测性能。

详情
Journal ref
ICML 2026
AI中文摘要

空间转录组学(ST)将基因表达与组织形态联系起来,但成本高且通量低,因此需要从常规组织学推断表达的替代方法。全切片H&E到ST推断将千兆像素图像与稀疏、不规则位置上的基因测量配对,使得多尺度建模在不产生密集网格开销或二次令牌混合的情况下具有挑战性。我们提出HiST,一种分层稀疏Transformer,将测量位置视为晶格索引的稀疏场,并直接在活跃组织足迹上构建二元编码器-解码器。HiST结合了稀疏窗口注意力用于局部几何对应,以及分辨率变换算子用于快速多尺度上下文整合。对于固定窗口大小,主要运行时间和内存随观测位置数量而非密集切片面积扩展。为缓解切片特定的采集变异,HiST通过一个瓶颈全局条件路径添加了\emph{切片校准令牌},该令牌总结切片级上下文并调节局部表示。在涵盖不同组织和采集源的多器官基准测试中,HiST在降低运行时间和峰值内存的同时,提升了相对于近期基线的预测性能。

英文摘要

Spatial transcriptomics (ST) links gene expression with tissue morphology but remains expensive and low-throughput, motivating surrogates that infer expression from routine histology. Whole-slide H&E-to-ST inference pairs a gigapixel image with gene measurements at a sparse, irregular set of locations, making multiscale modeling challenging without incurring dense-grid overhead or quadratic token mixing. We propose HiST, a hierarchical sparse transformer that treats measured locations as a lattice-indexed sparse field and builds a dyadic encoder--decoder directly on the active tissue footprint. HiST combines sparse window attention for local geometric correspondence with resolution-changing operators for rapid multiscale context integration. For a fixed window size, the dominant runtime and memory scale with the number of observed locations rather than the dense slide area. To mitigate slide-specific acquisition variation, HiST adds a bottlenecked global conditioning pathway via a \emph{slide calibration token} that summarizes slide-level context and conditions local representations. On a multi-organ benchmark spanning diverse tissues and acquisition sources, HiST improves predictive performance over recent baselines while reducing runtime and peak memory.