arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2056
专题追踪
2605.30832 2026-06-01 cs.AI

SLAT: Segment-Level Adaptive Trimming for Efficient CoT Reasoning

SLAT:面向高效CoT推理的段级自适应修剪

Jian Yao, Xiongcai Luo, Ran Cheng, Kay Chen Tan

发表机构 * Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University(数据科学与人工智能学院,香港理工大学) The Hong Kong Polytechnic University-Daya Bay Technology and Innovation Research Institute, Huizhou,Guangdong Province, China(香港理工大学大亚湾科技与创新研究院,广东惠州市) The Hong Kong Polytechnic University Shenzhen Research Institute, Shenzhen, China(香港理工大学深圳研究院,深圳)

AI总结 提出段级自适应修剪框架SLAT,通过强化学习选择性抑制低边际效用的高概率冗余段,在保持准确率的同时将推理长度减少50%。

详情
AI中文摘要

近期大型推理模型通过强化学习显著提升了思维链(CoT)能力。然而,生成的推理链常存在结构冗余(即“过度思考”),在未提高答案正确性的情况下产生高计算开销。现有缓解策略通常依赖令牌均匀长度惩罚,这种粗粒度、段无关的缩短压力可能在不经意间抑制有用推理。为解决此问题,我们证明低效集中在高概率且边际效用低的段。我们推导了在正确性-长度权衡目标下段次优性的理论表征,并提出SLAT(段级自适应修剪),一种基于该准则选择性抑制冗余段的强化学习框架。在标准基准上的实验结果表明,SLAT建立了优越的准确率-效率帕累托前沿,与未压缩基线相比,将推理长度减少50%,同时保持有竞争力的准确率。总体而言,我们的结果表明,基于理论的段感知修剪是大型语言模型中高效CoT推理的一个有前景的方向。

英文摘要

Recent advances in Large Reasoning Models have significantly improved chain-of-thought (CoT) capabilities via reinforcement learning (RL). However, generated reasoning chains frequently suffer from structural redundancy (i.e., \emph{overthinking}), incurring high computational overhead without improving answer correctness. Existing mitigation strategies typically rely on token-uniform length penalties, which provide coarse, segment-agnostic pressure toward shorter outputs and can inadvertently suppress useful reasoning alongside redundancy. To address this, we demonstrate that inefficiency concentrates in high-probability segments with low marginal utility. We derive a theoretical characterization of segment suboptimality under the correctness-length trade-off objective and propose \textsc{SLAT} (Segment-Level Adaptive Trimming), an RL framework that selectively suppresses redundant segments based on this criterion. Empirical results on standard benchmarks indicate that \textsc{SLAT} establishes a superior accuracy-efficiency Pareto frontier, reducing reasoning length by $50\%$ relative to uncompressed baselines while maintaining competitive accuracy. Overall, our results suggest that theoretically grounded, segment-aware trimming is a promising direction for efficient CoT reasoning in large language models.

2605.30829 2026-06-01 cs.CV

LegSegNet: A Public Deep Learning System for Lower Extremity CT Tissue Segmentation and Quantification

LegSegNet:用于下肢CT组织分割与量化的公共深度学习系统

Yuwen Chen, Yaqian Chen, Roy Colglazier, Haoyu Dong, Hanxue Gu, Maciej A. Mazurowski, Kevin W. Southerland

发表机构 * Department of Electrical and Computer Engineering, Duke University(杜克大学电气与计算机工程系) Department of Biostatistics & Bioinformatics, Duke University(杜克大学生物统计与生物信息学系) Department of Radiology, Duke University(杜克大学放射学系) Department of Computer Science, Duke University(杜克大学计算机科学系) Department of Surgery, Duke University(杜克大学外科系)

AI总结 提出LegSegNet深度学习系统,实现下肢CT中骨骼、肌肉、皮下脂肪和肌间/肌内脂肪的自动分割与量化,在测试集上平均Dice达89.31,是首个公开的端到端系统。

Comments 9 pages

详情
AI中文摘要

下肢计算机断层扫描(CT)包含用于身体成分分析、肌少症评估和肌肉骨骼疾病监测的临床相关信息,但大规模提取这些测量需要精确的组织分割和自动化量化工作流程。现有的公共分割工具并非为全面的下肢CT分析而设计,特别是对于临床重要的肌间/肌内脂肪组织,而且大多数公共方法仅提供掩膜预测而非端到端量化系统。为解决这一问题,我们提出了LegSegNet,一个用于下肢CT组织分割和身体成分量化的深度学习系统。给定输入CT扫描,LegSegNet分割骨骼、骨骼肌、皮下脂肪组织和肌间/肌内脂肪组织。然后计算定量的组织测量用于下游分析。我们使用1,302张手动标注的CT切片开发了分割模型,并在900张保留测试切片上进行了评估,所有标注均由放射科医生审核。我们将LegSegNet与广泛的2D分割方法进行基准测试,包括基于CNN的模型、基于Transformer的模型和微调的基础模型,并进一步在外部公共CT数据集上评估其泛化能力。LegSegNet实现了最佳的整体分割性能,在保留测试集上的平均Dice得分为89.31。据我们所知,LegSegNet是首个公开可用的用于下肢CT组织分割和量化的端到端系统,为未来医学图像分析中的计算机视觉研究提供了实用的评估工具。代码和模型权重可在https://github.com/mazurowski-lab/LegSegNet获取。

英文摘要

Lower extremity computed tomography (CT) contains clinically relevant information for body composition analysis, sarcopenia assessment, and musculoskeletal disease monitoring, but extracting these measurements at scale requires accurate tissue segmentation and an automated quantification workflow. Existing public segmentation tools are not designed for comprehensive lower extremity CT analysis, particularly for clinically important inter/intramuscular adipose tissue, and most public methods only provide mask prediction rather than an end-to-end quantification system. To address this problem, we present LegSegNet, a deep learning system for lower extremity CT tissue segmentation and body composition quantification. Given an input CT scan, LegSegNet segments bone, skeletal muscle, subcutaneous adipose tissue, and inter/intramuscular adipose tissue. It then computes quantitative tissue measurements for downstream analysis. We developed the segmentation model using 1,302 manually annotated CT slices and evaluated it on 900 held-out test slices, with all annotations reviewed by radiologists. We benchmark LegSegNet against a broad set of 2D segmentation methods, including CNN-based models, transformer-based models, and finetuned foundation models, and further evaluate its generalization on an external public CT dataset. LegSegNet achieves the best overall segmentation performance, with an average Dice score of 89.31 on the held-out test set. To our knowledge, LegSegNet is the first publicly available end-to-end system for lower extremity CT tissue segmentation and quantification, providing a practical evaluation tool for future computer vision research in medical image analysis. The code and model weights are available at: https://github.com/mazurowski-lab/LegSegNet

2605.30826 2026-06-01 cs.CL cs.AI

Beyond Agreement: Scoring Panel-Surfaced Biomedical Entity Candidates for Curator Triage

超越一致性:为策展人分类对面板筛选的生物医学实体候选进行评分

Shuheng Cao, Ruiqi Chen, Renjie Cao, Zhenhao Zhang, Siyu Zhang, Tingting Dan

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) University of Michigan, Ann Arbor(密歇根大学安娜堡分校) The Hong Kong University of Science and Technology, Guangzhou(香港科学与技术大学(广州)) ShanghaiTech University(上海科技大学) University of North Carolina, Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 提出BioConCal评分器,利用无金标准的一致性、提及、表面可用性和文档特征,对多LLM面板筛选的候选实体进行评分,显著提高候选筛选的精确率和召回率。

详情
AI中文摘要

生物医学命名实体识别对于现代LLM来说看似简单:合理的生物医学提及容易浮现,但语料库约定正确性取决于标注约定、跨度边界、实体粒度和类型模式。多LLM一致性是一个显著性信号,而非语料库约定正确性。我们引入了一个候选级面板输出基准,用于面板筛选的候选验证,其中单元是由明确定义的多模型面板对齐的候选,而非独立提取器输出。该基准将八个LLM在五个公共生物医学NER数据集上的预测对齐到一个候选主表中。BioConCal是一个领域内监督评分器,它利用推理时的无金标准一致性、提及、表面可用性和文档特征,为固定候选流实例化这一层。在领域内,BioConCal将AUROC从原始一致性的0.753提高到0.910。在验证选择的0.95精确率目标下,它选择了1,340个候选,经验测试精确率为0.939,而原始一致性为293个候选。这对应于候选级召回率0.592和语料库级召回率0.523,而面板内行标签上限为0.883。主要好处不是恢复每个面板成员遗漏的实体,而是将嘈杂的面板流重塑为更高产出的审查队列。在实体类型转移下,阈值需要目标领域验证,而精确字符定位仍然是单独的后处理步骤。

英文摘要

Biomedical NER is deceptively simple for modern LLMs: plausible biomedical mentions are easy to surface, but corpus-convention correctness depends on annotation conventions, span boundaries, entity granularity, and type schemas. Multi-LLM agreement is a salience signal, not corpus-convention correctness. We introduce a candidate-level panel-output benchmark for panel-surfaced candidate verification, where the unit is an aligned candidate surfaced by an explicitly defined multi-model panel rather than a standalone extractor output. The benchmark aligns eight LLMs' predictions over five public biomedical NER datasets into a candidate master table. BioConCal is an in-domain supervised scorer that instantiates this layer with inference-time gold-free agreement, mention, surface-availability, and document features for a fixed candidate stream. In domain, BioConCal improves AUROC from 0.753 for raw agreement to 0.910. At a validation-selected 0.95 precision target it selects 1,340 candidates at empirical test precision 0.939, compared with 293 for raw agreement. This corresponds to candidate-level recall 0.592 and corpus-level recall 0.523 against a within-panel row-label ceiling of 0.883. The main benefit is not recovering entities missed by every panel member, but reshaping a noisy panel stream into a higher-yield review queue. Under entity-type shift, thresholds require target-domain validation, and exact character localization remains a separate deterministic post-processing step.

2605.30825 2026-06-01 cs.LG cs.AI math.OC stat.ML

Unlearning in Diffusion Models: A Unified Framework with KL Divergence and Likelihood Constraints

扩散模型中的遗忘学习:基于KL散度和似然约束的统一框架

Shervin Khalafi, Alejandro Ribeiro, Dongsheng Ding

发表机构 * University of Pennsylvania(宾夕法尼亚大学) University of Tennessee, Knoxville(田纳西大学,基洛纳)

AI总结 提出一个约束优化框架,通过最小化与预训练模型的偏差并施加与遗忘分布的分离约束,实现扩散模型中的概念和数据遗忘,并基于KL散度和似然约束推导最优解及原始-对偶算法。

Comments 27 pages, 6 figures, 4 tables; Accepted by ICML 2026

详情
AI中文摘要

扩散模型中的遗忘学习旨在移除不需要的数据或概念,同时保留预训练模型的效用——这两个目标本质上相互冲突。我们提出了一个原则性的约束优化框架,将遗忘学习形式化为在满足与遗忘分布的显式分离约束下,最小化与预训练模型的偏差。具体地,我们基于反向和正向KL散度以及似然约束,构建了三个约束优化问题。前两个问题泛化了现有的概念和数据遗忘方法,而第三个问题为遗忘学习提供了一种新颖且自然的表述。尽管KL约束非凸,我们证明了所有三个问题的强对偶性,从而能够显式地表征其最优解作为遗忘目标,并为每个公式开发原始-对偶算法。实验结果表明,与基于权重的基线方法相比,我们的KL约束方法在概念和数据遗忘中实现了更优的保留-遗忘权衡,而基于似然的方法在匹配遗忘效果的同时,更好地保留了保留概念。

英文摘要

Unlearning in diffusion models aims to remove undesirable data or concepts while preserving the utility of pretrained models -- two fundamentally conflicting objectives. We propose a principled constrained optimization framework that formulates unlearning as minimizing the deviation from a pretrained model, subject to explicit separation constraints from the unlearning distributions. Specifically, we formulate three constrained optimization problems based on reverse and forward KL divergences, and likelihood constraints. The first two generalize existing approaches for concept and data unlearning, while the third offers a novel and natural formulation for unlearning. Despite the nonconvexity of the KL constraints, we establish strong duality for all three problems, enabling us to explicitly characterize their optimal solutions as unlearning targets and develop primal-dual algorithms for each formulation. Experimental results demonstrate that our KL-constrained approach achieves superior retention-unlearning tradeoffs compared to weight-based baselines for concept and data unlearning, and that our likelihood-based approach matches unlearning effectiveness while better preserving retained concepts compared to baselines.

2605.30824 2026-06-01 cs.AI

Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward

面向深度研究的规划器中心强化学习与结构感知奖励

Mustafa Anis Hussain, Xinle Wu, Yao Lu

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出DecomposeR框架,通过将研究计划表示为有向无环图(DAG)并采用两阶段强化学习(规划器RL和回答器RL),实现显式结构化规划与细粒度奖励分配,在长文本基准上提升5.1-8.0分。

详情
AI中文摘要

深度研究任务要求LLM规划调查内容、检索证据,并在多个研究分支中综合长格式答案。现有训练范式要么依赖短格式可验证问答作为代理,要么优化单一的长轨迹,这使得规划和执行难以分离,并导致规划过程的信用分配薄弱。我们提出DecomposeR,一种以规划器为中心的深度研究框架,将研究计划表示为类型化有向无环图(DAG),使规划变得显式、结构化且可奖励。我们分两个阶段训练Qwen3-8B模型:规划器强化学习(RL)首先学习图结构和查询分解以改进研究规划,然后回答器强化学习(RL)基于所学计划学习分支级执行和最终综合。通过将奖励分配给显式的规划器令牌和结构化组件,而不是平坦的轨迹,DecomposeR实现了对规划的更细粒度优化,同时减少了端到端训练的模糊性。实验表明,由于改进了规划和回答能力,DecomposeR-8B在流行的长文本基准上比强可比开源基线提高了5.1-8.0分。

英文摘要

Deep research tasks require LLMs to plan what to investigate, retrieve evidence, and synthesize long-form answers across multiple branches of inquiry. Existing training paradigms either rely on short-form verifiable QA as a proxy or optimize monolithic long trajectories, which makes planning and execution difficult to disentangle and yields weak credit assignment for the planning process. We propose DecomposeR, a planner-centric deep research framework that represents research plans as typed directed acyclic graphs (DAGs), allowing planning to be made explicit, structured, and rewardable. We train a Qwen3-8B model in two stages: planner reinforcement learning (RL) first learns graph structure and query decomposition to improve research planning, and answerer reinforcement learning (RL) then learns branch-level execution and final synthesis conditioned on the learned plan. By assigning rewards to explicit planner tokens and structured components rather than to a flat trajectory, DecomposeR enables finer-grained optimization of planning while reducing the ambiguity of end-to-end training. Experiments show that DecomposeR-8B improves over strong comparable open baselines by 5.1-8.0 points on popular long-form benchmarks due to improved planning and answering capabilities.

2605.30813 2026-06-01 cs.CL cs.DS

Incremental BPE Tokenization

增量式 BPE 分词

Shenghu Jiang, Ruihao Gong

发表机构 * Beihang University(北京航空航天大学) SenseTime Research(商汤研究)

AI总结 提出一种增量式 BPE 分词算法,以 O(log² t) 时间处理每个字节,总复杂度 O(n log² t),支持流式场景,相比 Hugging Face 和 OpenAI 的实现速度提升约 3 倍并降低延迟。

Comments Accepted to ICML 2026 (Spotlight)

详情
AI中文摘要

我们提出了一种用于增量式字节对编码(BPE)分词的新算法。该算法在最坏情况下以 $\mathcal{O}(\log^2 t)$ 时间处理每个输入字节,总复杂度为 $\mathcal{O}(n \log^2 t)$,其中 $n$ 是输入长度,$t$ 是最大分词长度。该算法增量地维护输入文本每个前缀的 BPE 分词结果,实现了由固定合并规则集定义的标准 BPE 合并过程。这使得在流式设置中能够进行高效的部分分词。作为标准 BPE 的即插即用替代方案,我们的方法相比 Hugging Face 的分词器实现了高达约 3 倍的加速,并在病态输入上相比 OpenAI 的 tiktoken 展示了显著的延迟降低。我们进一步引入了一种急切输出算法,支持流式输出,在增量分词过程中一旦确定分词边界即发出分词。总体而言,我们的结果表明 BPE 分词可以以具有强最坏情况保证的增量方式执行,同时在现代大语言模型流水线中提供实际的延迟优势。代码:https://github.com/ModelTC/mtc-inc-bpe

英文摘要

We propose a novel algorithm for incremental Byte Pair Encoding (BPE) tokenization. The algorithm processes each input byte in worst-case $\mathcal{O}(\log^2 t)$ time, leading to an overall complexity of $\mathcal{O}(n \log^2 t)$, where $n$ is the input length and $t$ is the maximum token length. The algorithm incrementally maintains BPE tokenization results for every prefix of the input text, implementing the standard BPE merge procedure defined by a fixed set of merge rules. This enables efficient partial tokenization in streaming settings. Functioning as a drop-in replacement for standard BPE, our approach achieves a speedup of up to ${\sim}3\times$ over Hugging Face's tokenizers, and demonstrates significant latency reductions over OpenAI's tiktoken on pathological inputs. We further introduce an eager output algorithm that enables streaming output, emitting tokens as soon as token boundaries are determined during incremental tokenization. Overall, our results demonstrate that BPE tokenization can be performed incrementally with strong worst-case guarantees, while providing practical latency benefits in modern large language model pipelines. Code: https://github.com/ModelTC/mtc-inc-bpe

2605.30812 2026-06-01 cs.LG physics.comp-ph

Learning Permutation-invariant Macroscopic Dynamics

学习置换不变的宏观动力学

Zhichao Han, Mengyi Chen, Qianxiao Li

发表机构 * Institute for Functional Intelligent Materials, National University of Singapore(功能智能材料研究所,新加坡国立大学) Department of Mathematics, National University of Singapore(数学系,新加坡国立大学)

AI总结 提出一种置换不变的自编码器框架,通过重建质量分布而非逐点重建来学习无序微观系统的宏观动力学,并在粒子系统、Lennard-Jones流体和聚合物拉伸动力学中验证了有效性。

Comments ICML 2026 submission

详情
AI中文摘要

准确建模高维微观系统的宏观动力学在科学领域具有广泛兴趣。许多数据驱动方法通过自编码器学习低维潜在状态,该自编码器针对逐点输入重建进行训练。这些方法通常假设输入中微观自由度的固定顺序。然而,在许多场景中,例如粒子系统,微观状态本质上是无序的。这激发了一种学习置换不变潜在表示的自编码器框架。为此,我们采用置换不变的编码器,并设计解码器来重建以观测点为中心的质量分布,而不是逐样本重建。然后,我们联合学习可观测量和潜在状态的宏观动力学。我们展示了所提方法在各种微观设置中的有效性和鲁棒性,包括学习相互作用粒子系统中的能量动力学、预测Lennard-Jones流体中的混合动力学,以及从拉伸力场中运动的聚合物视频数据建模拉伸动力学。

英文摘要

Accurately modeling the macroscopic dynamics of high-dimensional microscopic systems is of broad interest across the sciences. Many data-driven approaches learn a low-dimensional latent state through an autoencoder trained for pointwise input reconstruction. These methods typically assume a fixed ordering of microscopic degrees of freedom in the input. However, in many settings, such as particle systems, the microscopic state is inherently unordered. This motivates an autoencoder framework that learns permutation-invariant latent representations. To this end, we adopt a permutation-invariant encoder and design the decoder to reconstruct the mass distribution centered at the observed points rather than per-sample reconstruction. We then jointly learn the macroscopic dynamics of the observables together with the latent states. We demonstrate the effectiveness and robustness of the proposed method across a range of microscopic settings, including learning the energy dynamics in interacting particle systems, predicting mixing dynamics in Lennard-Jones fluids, and modeling the stretching dynamics from video data of polymers moving in an elongational force field.

2605.30811 2026-06-01 cs.LG

Non-destructive Identification of Oyster Species is possible from Hyperspectral Images with Machine Learning

基于高光谱图像与机器学习实现牡蛎物种的无损鉴别

Ethan Kane Waters, Max Wingfield, Aiden Mellor, Paul Stewart, Iman Tahmasbian

发表机构 * Department of Primary Industries(初级产业部) QUT Centre for Data Science, School of Mathematical Sciences, Queensland University of Technology(昆士兰理工大学数据科学中心、数学科学学院) School of Environment and Science, Griffith University, Nathan(格里菲斯大学环境与科学学院、纳恩)

AI总结 本研究利用高光谱成像结合偏最小二乘判别分析和卷积神经网络,实现了对黑唇岩牡蛎和悉尼岩牡蛎的无损、高准确率鉴别。

Comments 13 pages, 9 figures

详情
AI中文摘要

区分牡蛎物种对于开发适合生产系统的新型商业牡蛎物种至关重要,并且对海鲜供应链的可追溯性至关重要。常见方法(如DNA分析)具有破坏性且耗时。本研究探讨了使用高光谱成像(HSI)区分黑唇岩牡蛎(BL)和悉尼岩牡蛎(SR)的可能性。对活体BL和SR样本(N=156)用HSI相机(950-2515nm)进行扫描。使用蒙特卡洛交叉验证训练偏最小二乘判别分析(PLS-DA)和卷积神经网络(CNN),根据左右壳的光谱反射率区分BL和SR牡蛎。PLS-DA模型成功区分了左右壳的物种,中位测试集分类准确率为100%,优于CNN(分别为83%和96%)。通过电子显微镜测量了牡蛎壳表面和横截面的元素及矿物组成。右壳分析显示,BL的层数多于SR(4层 vs 2层)。右壳外层的碳和氧浓度存在差异,BL富含碳,SR富含氧。BL和SR右壳之间碳和氧浓度的变化可能反映了几丁质和糖蛋白的相对丰度或组成差异。模型导出的波长重要性对应于这些化合物特征官能团的振动模式,支持了这一观点。透射分析显示,光透过壳体和壳体边缘,表明光谱特征可能受到另一壳或肉的影响。最终,研究结果突显了一种快速、无损的牡蛎物种鉴别方法。

英文摘要

Differentiating between oyster species is important for developing new commercial oyster species suited to production systems and is critical for traceability in seafood supply chains. Common methods, such as DNA profiling, are destructive and time consuming. The possibility of using hyperspectral imaging (HSI) for discriminating between Black-Lip rock (BL) and Sydney rock (SR) oysters was investigated. Live BL and SR samples (N = 156) were scanned with a HSI camera (950-2515nm). Partial Least Square Discriminant Analysis and Convolutional Neural Networks were trained with Monte Carlo Cross Validation to distinguish BL and SR oysters from the spectral reflectance of their left and rights valves. The PLS-DA model successfully distinguished between the species from both the left and right valves with a median test set classification accuracy of 100%, out performing the CNN with 83% and 96% respectively. Elemental and mineralogical composition in the surface and cross-section of oyster valves were measured with electron microscopy. Analysis of the right valve revealed a greater number of layers in BL compared to SR (4 vs 2). The concentrations of carbon and oxygen varied in the outer layer of the right valves, with BL being rich in carbon and SR being rich in oxygen. The variation in carbon and oxygen concentrations observed between BL and SR right valves may reflect differences in the relative abundance or composition of chitin and glycoproteins. This is supported by model-derived wavelength importance corresponding to vibrational modes of functional groups characteristic of these compounds. Transmittance analysis revealed that light was transmitted through the valves, around the valve edges, indicating that the spectral signatures may have been influenced by the other valve or the meat. Ultimately, the findings highlight an effective rapid, non-destructive methodology for oyster species.

2605.30810 2026-06-01 cs.LG

IRIS: time-structured manifold projections

IRIS: 时间结构化流形投影

Brian Ondov, Chia-Hsuan Chang, Weipeng Zhou, Xingjian Zhang, Xueqing Peng, Yutong Xie, Huan He, Qiaozhu Mei, Hua Xu

发表机构 * Department of Biomedical Informatics and Data Science, Yale School of Medicine(耶鲁医学院生物医学信息学与数据科学部) School of Information, University of Michigan(密歇根大学信息学院)

AI总结 提出IRIS算法,通过结合时间顺序和流形拓扑结构,解决t-SNE和UMAP无法体现时间动态的问题,适用于scRNA-seq、比较宏基因组学等动态生物医学数据可视化。

详情
AI中文摘要

高维生物医学数据,如细胞-基因矩阵,越来越多地按时间顺序生成。然而,流形学习算法(如t-SNE和UMAP)无法在其布局中融入时间顺序,模糊了细胞类型或其他类别的动态变化。作为解决方案,我们提出了IRIS,一种新的流形学习算法,能够按时间顺序和流形拓扑结构构建布局。IRIS可以可视化广泛的动态生物医学数据,包括scRNA-seq、比较宏基因组学和文献数据。

英文摘要

High-dimensional biomedical data, such as cell-by-gene matrices, are increasingly generated temporally. However, Manifold Learning algorithms, like t-SNE and UMAP, cannot incorporate time-ordering in their layouts, obfuscating the dynamics of cell types or other classes. As a solution, we present IRIS, a new Manifold Learning algorithm that structures layouts both chronologically and by manifold topology. IRIS can visualize a wide range of dynamic biomedical data, including scRNA-seq, comparative metagenomics, and literature.

2605.30807 2026-06-01 cs.LG

Conformal Reliability: A New Evaluation Metric for Conditional Generation

共形可靠性:条件生成的新评估指标

Yachen Gao, Xinwei Sun, Yikai Wang, Ye Shi, Jingya Wang, Jianfeng Feng, Yanwei Fu

发表机构 * Institute of Science and Technology for Brain-Inspired Intelligence(脑启发式智能科学与技术研究院) Shanghai Innovation Institute(上海创新研究院) School of Data Science(数据科学学院) Nanyang Technological University(南洋理工大学) School of Information Science and Technology(信息科学与技术学院)

AI总结 提出基于共形预测的可靠性分数作为条件生成模型的新评估指标,并开发CReL框架高效计算该分数,实验证明其有效性和可解释性。

Comments Accepted at ICML 2026

详情
AI中文摘要

条件生成模型近年来在各种应用中取得了显著成功。然而,目前仍缺乏一个合适的指标来评估这些模型的可靠性,该指标需要考虑其固有的不确定性。现有指标通常评估单个输出,可能无法捕捉生成中的变异性或潜在风险。在本文中,我们提出了一种基于共形预测的新型评估指标,称为可靠性分数,该指标在预指定的置信水平下衡量预测集内的最差性能。然而,由于输出空间的高维性以及指标函数和预测集的非凸性,计算该分数具有挑战性。为了高效计算该分数,我们引入了共形可靠性(CReL)框架,该框架可以(i)构建具有期望覆盖率的预测集;(ii)在构建的预测集内准确优化可靠性分数。我们提供了关于覆盖率的理论结果,并实验证明我们的方法比现有方法能产生更具信息量的预测集。在合成数据以及图像到文本和文本到图像任务上的实验进一步展示了我们新指标的可解释性,以及我们计算框架的有效性和高效性。源代码可在https://ggc29.github.io/CReL/找到。

英文摘要

Conditional generative models have recently achieved remarkable success in various applications. However, a suitable metric for evaluating the reliability of these models, which takes into account their inherent uncertainty, is still lacking. Existing metrics, which typically assess a single output, may fail to capture the variability or potential risks in generation. In this paper, we propose a novel evaluation metric called reliability score based on conformal prediction, which measures the worst-case performance within the prediction set at a pre-specified confidence level. However, computing this score is challenging due to the high-dimensional nature of the output space and the nonconvexity of both the metric function and the prediction set. To efficiently compute this score, we introduce Conformal ReLiability (CReL), a framework that can (i) construct the prediction set with desired coverage; and (ii) accurately optimize the reliability score within the constructed prediction set. We provide theoretical results on coverage and demonstrate empirically that our method produces more informative prediction sets than existing approaches. Experiments on synthetic data and the image-to-text and text-to-image tasks further demonstrate the interpretability of our new metric, and the validity and effectiveness of our computational framework. Source code can be found at https://ggc29.github.io/CReL/.

2605.30804 2026-06-01 cs.CL

Anchoring LLM Gender Bias to Human Baselines: A Cross-Lingual Audit

将LLM性别偏见锚定人类基线:跨语言审计

Jiwoo Choi, Seonwoo Ahn, Tongxin Zhang, Seohyon Jung

发表机构 * School of Digital Humanities and Computational Social Sciences, KAIST, South Korea(数字人文与计算社会科学学院,韩国韩国科学技术院)

AI总结 通过HEXACO-100人格量表,跨英语、韩语、中文和日语审计六种大语言模型的性别刻板印象,发现其偏见幅度是人类跨国差异的2.5倍,并引入四模式框架(一致性、抑制、重组、放大)描述跨语言行为。

详情
AI中文摘要

我们审计了六种大语言模型(LLM)在英语、韩语、中文和日语中的性别刻板印象。其中三种主要面向英语使用(Claude、GPT、Gemini),三种面向东亚使用(DeepSeek、Syn-Pro、HyperCLOVA X)。我们采用HEXACO-100人格量表,并将每个模型锚定于覆盖48个国家的跨文化人类数据集,以询问的不是LLM是否有偏见,而是它们的性别归因偏离其部署人群的程度。我们的发现表明,它们的刻板印象范围大约是人类跨国范围的2.5倍,且该效应可能跨语言复合。一个以英语为中心的模型在用韩语提示时,达到了当地基线的5倍,即使提示表明候选人已被录用(这通常会减弱人类的刻板印象)。为了在不排序的情况下描述此类行为,我们引入了一个四模式框架——一致性、抑制、重组和放大——涵盖24个(模型×语言)单元。项目级分析表明,翻译不仅重新缩放刻板印象,还改变了与之相关的属性,在表面看似校准良好的情况下隐藏了显著的重新排列。我们的结果最终表明,没有单一的消除偏见流程能够均匀地解决跨语言边界的偏见。

英文摘要

We audit six large language models (LLMs) for gender stereotyping across English, Korean, Chinese, and Japanese. Three were developed primarily for English-language use (Claude, GPT, Gemini) and three for East Asian use (DeepSeek, Syn-Pro, HyperCLOVA X). We adopt the HEXACO-100 personality inventory and anchor each model against a cross-cultural human dataset spanning 48 countries to ask not whether LLMs are biased, but how far their gender attributions drift from the populations they are deployed among. Our findings show that their stereotyping spans a range roughly 2.5 times wider than the entire cross-country range found in humans, and the effect can compound across languages. One English-centric model, prompted in Korean, reached 5 times the local baseline, even when the prompt stated the candidate had already been hired, which often dampens human stereotyping. To characterize such behaviors without ranking them, we introduce a four-pattern framework -- concordance, suppression, reorganization, and amplification -- across 24 (model x language) cells. Item-level analysis reveals that translation does not just rescale stereotypes, but changes the attributes tied to it, hiding significant rearrangement under the surface while appearing well-calibrated. Our results ultimately suggest that no single debiasing pipeline is likely to address bias evenly across linguistic boundaries.

2605.30803 2026-06-01 cs.AI

PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges

PReMISE:作为LLM评判者测量规范的政策评分标准

Swastik Roy, Rajkumar Pujari, Tharindu Kumarage, Charith Peris, Rahul Gupta, Anna Rumshisky, Pradeep Natarajan, Venkatesh Saligrama

发表机构 * Amazon AGI(亚马逊人工智能研究院)

AI总结 提出PReMISE框架,从人类偏好数据中发现政策级评分标准集,并从结构充分性、可靠性、偏好拟合和对抗鲁棒性四个维度审计评分标准,通过偏好排名选择和可靠性约束修复操作提升评判准确性并降低可被利用性。

详情
AI中文摘要

LLM评判者越来越多地被用于评估开放式回答,但其分数强烈依赖于条件化它们的评分标准。一个模糊的评分标准要求回答“有帮助且事实准确”可能会奖励那些编造事实或违反用户意图的精心修饰的回答。我们将可重复使用的评分标准视为测量规范:改变评分标准会改变由固定评判者产生的回答质量测量。我们引入PReMISE,一个框架,给定成对的人类偏好数据,(i) 发现一个政策级别的评分标准集,以及(ii) 在LLM评判者使用下,沿着四个维度审计任何评分标准集:结构充分性、可靠性、偏好拟合和对抗鲁棒性。在评分标准来源中,没有原始来源同时具有可靠性、偏好预测性和对抗鲁棒性;高评分者间一致性并不意味着低可被利用性。PReMISE是唯一同时在适用性、特异性和有效维度上得分非平凡的评分标准来源。我们贡献了两个针对审计的修复操作:偏好排名选择将评判者在成对回答上的准确率从65.0%提高到68.6%,与最强的评分标准发现基线竞争,并在我们的跨评判者扫描中在三个评判者中的两个上领先;可靠性约束精炼将利用性回答获得高分的比率从46.4%降低到36.0%,而评分者间一致性变化很小(α=.531→.519)。

英文摘要

LLM judges are increasingly used to evaluate open-ended responses, but their scores depend strongly on the rubrics that condition them. A vague rubric asking for a response to be ``helpful and factual'' can reward polished answers that invent facts or violate user intent. We treat reusable rubrics as measurement specifications: changing the rubric changes the response quality measurement induced by a fixed judge. We introduce PReMISE, a framework that, given pairwise human-preference data, (i) discovers a policy-level rubric set, and (ii) audits any rubric set under LLM-judge use along four axes: structural adequacy, reliability, preference fit, and adversarial robustness. Across rubric sources no raw source is simultaneously reliable, preference-predictive, and adversarially robust; and high inter-rater agreement does not imply low exploitability. PReMISE is the only rubric source to score non-trivially on applicability, specificity, and effective dimensionality simultaneously. We contribute two audit-targeted repair operations: preference-rank selection raises judge accuracy on paired responses from $65.0\%$ to $68.6\%$, competitive with the strongest rubric-discovery baselines and leading on two of three judges in our cross-judge sweep; reliability-constrained refinement reduces the rate at which exploit responses receive high scores from $46.4\%$ to $36.0\%$ with little change in inter-judge agreement ($α{=}.531\to.519$).

2605.30795 2026-06-01 cs.RO

Feat2Go: Visual Feature-Grounded Value Estimation for Embodied Reinforcement Learning

Feat2Go: 面向具身强化学习的视觉特征基础价值估计

Junyang Shu, Zhiwei Lin, Bingqing Wei, Yongtao Wang

发表机构 * Wangxuan Institute of Computer Technology, Peking University, China(北京大学计算机科学技术研究院)

AI总结 提出Feat2Go框架,通过预训练视觉世界模型提取补丁级子目标相似度并聚类语义阶段,训练具身价值模型预测结构进度以重塑终端奖励,显著提升VLA模型在单臂和双臂操作任务中的强化学习性能。

详情
AI中文摘要

强化学习是提升视觉-语言-动作(VLA)模型能力的一种有前景的方法,同时避免了模仿学习对大量数据的需求。然而,其对VLA模型的有效性常受限于稀疏监督以及为长程操作设计信息丰富的奖励信号的困难。在这项工作中,我们提出了Feat2Go,一种用于具身强化学习的细粒度价值估计框架。具体来说,Feat2Go首先通过测量与子目标状态的补丁级相似性,并利用基于趋势的聚类将回合划分为语义阶段,从预训练的视觉世界模型中导出一个连续的进度目标。然后,我们训练一个具身价值模型,根据当前观测和任务指令预测这一结构进度,并在策略优化过程中使用预测值重塑终端奖励。所提出的框架与现有的VLA策略强化学习流程(包括PPO和GRPO)兼容,且不依赖手动奖励工程。在ManiSkill3和RoboTwin 2.0上的大量实验表明,Feat2Go在单臂和双臂操作设置下均能持续提升现有VLA模型的性能。更具体地说,在ManiSkill3上,Feat2Go将OpenVLAOFT的平均分布外成功率从17.5%提升至82.9%,同时保留了96.9%的分布内性能。在RoboTwin 2.0上,Feat2Go在域随机化任务设置中实现了88.8%的平均成功率,优于先前的强化学习方法。

英文摘要

Reinforcement learning is a promising approach for improving the capabilities of vision-language-action (VLA) models while avoiding the heavy data requirements of imitation learning. However, its effectiveness for VLA models is often constrained by sparse supervision and the difficulty of designing informative reward signals for long-horizon manipulation. In this work, we present Feat2Go, a fine-grained value estimation framework for embodied reinforcement learning. Specifically, Feat2Go first derives a continuous progress target from a pretrained visual world model by measuring patch-level similarity to subgoal states and partitioning episodes into semantic stages with trend-based clustering. We then train an embodied value model to predict this structural progress from the current observation and task instruction, and use the predicted value to reshape terminal rewards during policy optimization. The proposed framework is compatible with existing VLA policy reinforcement learning pipelines, including PPO and GRPO, and does not rely on manual reward engineering. Extensive experiments on ManiSkill3 and RoboTwin 2.0 demonstrate that Feat2Go consistently improves the performance of existing VLA models under both single-arm and bimanual manipulation settings. More specifically, on ManiSkill3, Feat2Go improves OpenVLAOFT from 17.5% to 82.9% average out-of-distribution success while retaining 96.9% in-distribution performance. On RoboTwin 2.0, Feat2Go achieves an average success rate of 88.8% in domain-randomized task settings, outperforming prior reinforcement learning methods.

2605.30794 2026-06-01 cs.CV cs.AI

MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding

MechVQA:在综合机械图纸理解上基准测试与增强多模态大语言模型

Qian Kou, Xiaofeng Shi, Yulin Li, Xiaosong Qiu, Xinyang Wang, Hua Zhou, Cao Dongxing

发表机构 * Beijing Academy of Artificial Intelligence (BAAI), China(北京人工智能研究院) Institute of Information Engineering, Chinese Academy of Sciences, China(信息工程研究所) Beijing University of Technology, China(北京理工大学)

AI总结 针对多模态大语言模型在机械工程图纸理解上的不足,提出首个综合机械图纸理解数据集MechVQA,并开发MechVL模型,通过多阶段训练显著提升性能。

Comments accept by iclm2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在通用视觉问答(VQA)任务中取得了显著成就。然而,它们在机械工程图纸上仍然脆弱,因为高标注密度和弱领域知识,加上严格投影规则和几何约束下不可靠的空间关系推理,使得决定性线索容易被忽略,并经常导致错误答案。为弥补这一差距,我们引入了第一个综合机械图纸理解数据集MechVQA,通过半自动构建和质量控制流程创建。MechVQA包含3.3k张高密度图片和21K个问答对,涵盖三个能力级别(识别、推理和判断)的10个不同细粒度任务,为评估和改进MLLM在真实机械图纸上的理解提供了测试平台。在MechVQA基础上,我们通过多阶段训练范式开发了MechVL模型,构建了一个强大的领域专用基线。大量实验结果表明,MechVL在MechVQA总分上比最强的闭源基线高出7.57个百分点,显著增强了机械图纸理解能力,并为在机械设计和检测场景中部署MLLM提供了可复用的基础。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated significant achievements in general visual question answering (VQA) tasks. However, they remain brittle on mechanical engineering drawings, where high annotation density and weak domain knowledge, compounded by unreliable spatial relation reasoning under strict projection rules and geometric constraints, make decisive cues easy to miss and frequently lead to wrong answers. To bridge this gap, we introduce the first comprehensive mechanical drawing understanding dataset, MechVQA, created through a semi-automated construction and quality-control pipeline. MechVQA contains 3.3k high-density pictures with 21K question-answer pairs, spanning 10 different fine-grained tasks across three capability levels: Recognition, Reasoning, and Judging, providing a testbed to evaluate and improve MLLM understanding on real-world mechanical drawings. On top of MechVQA, we then develop the MechVL model through a multi-stage training paradigm, building a strong domain-specialized baseline. Extensive experimental results demonstrate that MechVL outperforms the strongest closed-source baseline by 7.57 percentage points on the MechVQA total score, significantly enhancing mechanical drawing understanding ability and providing a reusable foundation for deploying MLLMs in mechanical design and inspection scenarios.

2605.30788 2026-06-01 cs.CL cs.AI cs.LG

XLGoBench: Detecting cross-lingual skill gaps with algorithmic tasks

XLGoBench: 用算法任务检测跨语言技能差距

Purvam Jain, Preethi Jyothi, Vihari Piratla, Suvrat Raju

发表机构 * Google DeepMind(谷歌深Mind) Indian Institute of Technology Bombay(印度理工学院孟买分校) International Centre for Theoretical Sciences, Tata Institute of Fundamental Research(理论科学国际中心, Tata 基础研究机构)

AI总结 提出一套合成算法任务基准,通过跨语言执行相同任务来检测大语言模型的跨语言能力差距,实验揭示多个先进模型存在持续差距。

Comments 8+37pages

详情
AI中文摘要

我们引入一套合成算法任务,用于检测大语言模型在跨语言能力上的差距。我们的基准在语言间具有可比性,因为它要求模型在不同语言中执行相同的底层任务;可扩展,因为每个任务可以在不同复杂度级别生成,从而适应不同能力的模型;可量化,因为每个任务都承认客观的正确性概念;且透明,因为任务是从简单模板生成的,可以轻松审计翻译错误。由于我们的基准专注于算法任务,性能差异是跨语言差距的充分但不必要条件。尽管如此,我们通过大量实验表明,我们的基准暴露了多个最先进模型中存在的持续跨语言差距。

英文摘要

We introduce a set of synthetic algorithmic tasks to detect cross-lingual gaps in the abilities of large language models. Our benchmark is commensurate across languages, since it requires models to perform the same underlying task in different languages; scalable, since each task can be generated at varying levels of complexity allowing it to be adapted to models with different capabilities; quantifiable, since every task admits an objective notion of correctness; and transparent, since tasks are generated from simple templates that can be readily audited for translation errors. Because our benchmark focuses on algorithmic tasks, differential performance is a sufficient -- but not necessary -- indicator of cross-lingual gaps. Nevertheless, we show through extensive experiments that our benchmark exposes persistent cross-lingual gaps in multiple state-of-the-art models.

2605.30786 2026-06-01 cs.LG

AbstainGNN: Teaching Graph Neural Networks to Abstain for Graph Classification

AbstainGNN:教会图神经网络在图分类中弃权

Xixun Lin, Zhiheng Zhou, Zhengyin Zhang, Yancheng Chen, Shuai Zhang, Ge Zhang, Shichao Zhu, Lixin Zou, Chuan Zhou, Peng Zhang, Shirui Pan, Yanan Cao

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(信息工程研究所,中国科学院) School of Mathematics and Statistics, Shandong University(数学与统计学学院,山东大学) Academy of Mathematics and Systems Science, Chinese Academy of Sciences(数学与系统科学学院,中国科学院) School of Information and Intelligent Science, Donghua University(信息与智能科学学院,东华大学) TikTok, ByteDance Inc.(TikTok,字节跳动公司) School of Cyber Science and Engineering, Wuhan University(网络安全与工程学院,武汉大学) Cyberspace Institute of Advanced Technology, Guangzhou University(空域技术高级研究所,广州大学) Griffith University(格里菲斯大学)

AI总结 提出AbstainGNN框架,通过理论驱动的弃权机制让GNN在不确定时拒绝预测,避免错误决策,并基于PAC-Bayes理论优化分类与弃权权衡。

Comments Accepted at KDD 2026

详情
AI中文摘要

图分类是图数据挖掘中的核心任务,具有广泛的现实应用。图神经网络的最新进展显著提升了图分类的性能。然而,现有的GNN即使在高度不确定性或未知条件下也通常被迫做出预测,导致不可靠的决策,特别是在安全关键场景中会严重影响下游任务。为了解决这一关键限制,我们提出了AbstainGNN,一种新颖且理论驱动的带弃权图分类框架,使GNN能够拒绝不确定的预测,而不是产生错误的决策。具体来说,AbstainGNN显式地建模了预测函数和弃权函数,从而有效利用图结构信息。此外,与现有的启发式弃权方法不同,我们从PAC-Bayesian泛化角度理论刻画了分类错误与拒绝成本之间的权衡,并推导出用于模型优化的统一学习目标。在此理论洞察的指导下,我们进一步开发了一种高效的两阶段训练策略,包括预测函数预热和弃权函数校准。在五个基准数据集上的大量实验表明,AbstainGNN优于现有的弃权方法,在相同拒绝率下实现了更优的分类性能。

英文摘要

Graph classification is a core task in graph data mining with widespread real-world applications. Recent advances in graph neural networks (GNNs) have led to substantial performance improvements for graph classification. However, existing GNNs are typically forced to make predictions even under high uncertainty or unknown conditions, resulting in unreliable decisions that can severely impact downstream tasks, particularly in safety-critical scenarios. To address this critical limitation, we propose AbstainGNN, a novel and theory-driven framework for graph classification with abstention, which enables GNNs to reject uncertain predictions instead of producing incorrect decisions. Specifically, AbstainGNN explicitly models both the predictive function and the abstention function, allowing for effective utilization of graph structural information. Moreover, unlike existing heuristic abstention methods, we theoretically characterize the trade-off between classification errors and rejection costs from a PAC-Bayesian generalization perspective, and derive a unified learning objective for model optimization. Guided by this theoretical insight, we further develop an efficient two-stage training strategy consisting of predictive function warm-start and abstention function calibration. Extensive experiments on five benchmark datasets show that AbstainGNN outperforms existing abstention methods, achieving superior classification performance under the same rejection rates.

2605.30785 2026-06-01 cs.AI

Learning Agent-Compatible Context Management for Long-Horizon Tasks

面向长时任务的学习智能体兼容上下文管理

Lu Yi, Runlin Lei, Liuyi Yao, Yuexiang Xie, Yuyang Li, Wenhao Zhang, Zhewei Wei, Yaliang Li, Jian-Yun Nie

发表机构 * Renmin University of China(中国人民大学) Tongyi Lab, Alibaba Group(阿里云实验室) Beijing University of Posts and Telecommunications(北京邮电大学) Université de Montréal(蒙特利尔大学)

AI总结 提出AdaCoM方法,通过外部LLM对冻结智能体进行端到端强化学习上下文管理,在长时任务中提升性能并揭示保真度-可靠性权衡。

详情
AI中文摘要

LLM智能体在现实应用中越来越多地面临长时任务,如网络搜索和深度研究,累积的上下文可能导致长上下文退化和推理失败。先前的工作通过智能体端上下文控制或固定策略(如摘要)来缓解这一问题,这需要训练智能体本身进行适应——这使得它对于闭源智能体不切实际,并且忽略了不同智能体可能需要不同策略。我们引入了自适应上下文管理(AdaCoM),它训练一个外部LLM通过灵活的修改动作和端到端强化学习来管理冻结智能体的上下文。在多种智能体上进行的网络搜索和深度研究基准测试中,AdaCoM通过保留任务约束和进展同时修剪过时内容,显著提升了性能。学习到的策略揭示了保真度-可靠性权衡:具有更高原始ReAct性能的智能体受益于更高保真度的上下文保留,而性能较低的智能体则需要更激进的压缩以保持在可靠的推理范围内。迁移实验表明,AdaCoM在能力相似(以原始ReAct性能衡量)的智能体之间最有效地泛化,这为智能体系统的可复用上下文管理器提供了一条实用路径。

英文摘要

LLM agents increasingly face long-horizon tasks such as web search and deep research in real-world applications, where accumulated context can cause long-context degradation and reasoning failures. Prior work mitigates this through context management with agent-side context control or fixed strategies such as summarization, which require training the agent itself for adaptation - making it impractical for closed-source agents and ignoring that different agents may require different strategies. We introduce Adaptive Context Management (AdaCoM), which trains an external LLM to manage the context of a frozen agent through flexible modification actions and end-to-end reinforcement learning. Across diverse agents on web search and deep research benchmarks, AdaCoM substantially improves performance by preserving task constraints and progress while pruning stale content. The learned strategies reveal a Fidelity-Reliability Trade-off: agents with higher vanilla ReAct performance benefit from higher-fidelity context preservation, whereas lower-performing agents require more aggressive compression to stay within a reliable reasoning regime. Transfer experiments show that AdaCoM generalizes most effectively across agents with similar capability (measured by vanilla ReAct performance), suggesting a practical path toward reusable context managers for agent systems.

2605.30784 2026-06-01 cs.CV

Text-guided Feature Disentanglement for Cross-modal Gait Recognition

文本引导的跨模态步态识别特征解耦

Zhiyang Lu, Ming Cheng

发表机构 * Fujian Key Laboratory of Urban Intelligent Sensing and Computing, Xiamen University(福建城市智能感知与计算重点实验室,厦门大学) Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University(多媒体可信感知与高效计算重点实验室,中华人民共和国教育部,厦门大学)

AI总结 针对LiDAR与RGB相机之间的模态差异,提出TCFDNet网络,利用文本先验引导解耦模态共享特征,通过CLIP对齐、特征解耦和稳定性增强实现跨模态步态识别,在SUSTech1K和FreeGait数据集上达到最优性能。

Comments Accept by CVPR2026

详情
AI中文摘要

步态识别是一种基于行走模式识别个体的生物特征技术,在远距离、非侵入场景中具有优势。然而,现实场景通常涉及异构传感模态,如LiDAR和RGB相机,由于2D视频和3D点云序列之间存在显著的模态差距,LiDAR-相机跨模态步态识别(LCCGR)成为一项关键但具有挑战性的任务。为应对这一挑战,我们提出了TCFDNet,一种文本引导的跨模态特征解耦网络,该网络利用模态感知的文本先验作为语义锚点,指导学习解耦的模态共享表示。具体而言,我们使用大型语言模型构建步态模态文本字典(GMTD),以生成跨模态和视角的丰富步态语义描述。然后,基于CLIP的多粒度特征编码器将视觉和文本特征对齐到统一的视觉-语言空间中。此外,文本引导的特征解耦(TFD)模块选择topk匹配的文本描述来重建模态特定表示,并通过残差分解和正交性约束推导出模态共享特征。为缓解解耦共享特征的脆弱性,我们提出特征稳定性增强(FSE)模块,该模块建模空间和通道相关性以提高特征鲁棒性。此外,引入跨模态补丁交换策略以进一步提升泛化能力。在SUSTech1K和FreeGait数据集上的大量实验表明,TCFDNet取得了新的最优结果,并验证了所提模块的有效性。

英文摘要

Gait recognition is a biometric technique that identifies individuals based on their walking patterns, offering advantages in long-range, non-intrusive scenarios. However, real-world scenarios often involve heterogeneous sensing modalities such as LiDAR and RGB cameras, making LiDAR-Camera Cross-modal Gait recognition (LCCGR) a critical yet challenging task due to the substantial modality gap between 2D videos and 3D point cloud sequences. To address this challenge, we propose TCFDNet, a Text-guided Cross-modal Feature Disentanglement Network, which leverages modality-aware textual priors as semantic anchors to guide the learning of disentangled modality-shared representations. Specifically, we construct a Gait Modality Text Dictionary (GMTD) using large language models to generate rich semantic descriptions of gait across modalities and viewpoints. A CLIP-based Multi-grained Feature Encoder then aligns visual and textual features within a unified vision-language space. Furthermore, the Text-guided Feature Disentanglement (TFD) module selects the topk matched textual descriptions to reconstruct modality-specific representations and derive modality-shared features via residual decomposition and orthogonality constraints. To mitigate the fragility of the disentangled shared features, we propose a Feature Stability Enhancement (FSE) module, which models spatial and channel-wise correlations to improve feature robustness. In addition, a cross-modal patch exchange strategy is introduced to further improve generalization. Extensive experiments on SUSTech1K and FreeGait datasets demonstrate that TCFDNet achieves new state-of-the-art results and validate the effectiveness of the proposed modules.

2605.30780 2026-06-01 cs.RO

Two Degree-of-Freedom Vibratory Transport in a Grasp

抓取中的两自由度振动运输

C. L. Yako, Shenli Yuan, Kenneth Salisbury

发表机构 * Department of Mechanical Engineering, Stanford University(斯坦福大学机械工程系) Department of Computer Science, Stanford University(斯坦福大学计算机科学系)

AI总结 利用非对称振动实现抓取零件的两自由度(DoF)手内操作,通过闭环位置控制产生周期性粘滑波形,分析波形参数对平均速度的影响,并用实验验证。

详情
AI中文摘要

在本文中,我们利用非对称振动演示了抓取零件的两自由度(DoF)手内操作。非对称振动通过移动表面的闭环位置控制实现,该表面向待操作零件施加周期性粘滑波形。我们从理论上分析了两个振动波形参数——粘附加速度和滑动加速度——如何影响零件在对抗重力运动时的平均速度。然后使用实验装置验证理论趋势,其中挤压力受控,零件运动由高分辨率编码器记录。我们还开发了一个2-DoF振动表面,能够在一个方向平移并绕表面法线旋转。在平行爪夹持器配置中使用两个这样的2-DoF表面,我们双向平移和旋转各种抓取零件,并证明相同的平移波形趋势也适用于面内旋转。

英文摘要

In this paper, we use asymmetric vibrations to demonstrate two degree-of-freedom (DoF) in-hand manipulation of grasped parts. The asymmetric vibrations are achieved through closed-loop position control of a moving surface, which applies a periodic stick-slip waveform to the part to be manipulated. We show analytically how two vibratory waveform parameters, the sticking acceleration and the slipping acceleration, affect average part velocity when moving against gravity. The theoretical trends are then validated using an experimental setup where the squeeze force is controlled and part motion is recorded by a high-resolution encoder. We also develop a 2-DoF vibratory surface capable of translation in one direction and rotation about the surface normal. Using two of these 2-DoF surfaces in a parallel jaw gripper configuration, we bidirectionally translate and rotate a variety of grasped parts, as well as demonstrate that the same waveform trends for translation also persist for in-plane rotation.

2605.30778 2026-06-01 cs.RO

Object-Informed Model Predictive Path Integral Control for Non-Prehensile Robot Manipulation

面向非抓取式机器人操作的对象感知模型预测路径积分控制

Nikola Raicevic, Bharath Raam Radhakrishnan, Chenbin Yu, Ki Myung Brian Lee, Nikolay Atanasov

发表机构 * Department of Electrical and Computer Engineering, University of California San Diego(加州大学圣地亚哥分校电气与计算机工程系)

AI总结 提出一种分层模型预测路径积分(MPPI)控制框架,通过对象级规划引导机器人级规划,实现非抓取式操作中的长时域高效规划。

详情
AI中文摘要

由于欠驱动和不连续交互,非抓取式机器人操作的长时域规划具有挑战性。我们提出一种分层模型预测路径积分(MPPI)控制框架,通过单独计算的对象级规划引导机器人级规划,实现高效的长时域预测。我们首先求解一个简化的仅对象问题,假设对象可以直接被驱动,并将规划的对象轨迹作为参考来求解联合机器人-对象规划问题。我们在仿真和硬件上使用6自由度xArm6机械臂执行对象推动任务来评估我们的方法,其中目标对象必须到达目标点同时避开静态障碍物,这需要非短视的推理。我们的对象感知MPPI在仿真中将任务成功率提高了40%,控制频率提高了26%,在实际实验中提高了20%,且计算量与常规MPPI相当。

英文摘要

Long-horizon planning for non-prehensile robot manipulation is challenging due to underactuated and discontinuous interactions. We propose a hierarchical formulation of model predictive path integral (MPPI) control that guides robot-level planning with a separately computed object-level plan to achieve efficient long-horizon prediction. We first solve a simplified object-only problem, assuming the object can be actuated directly, and use the planned object trajectory as a reference in solving the joint robot-object planning problem. We evaluate our method in both simulation and hardware using a 6-DoF xArm6 manipulator to perform object pushing tasks in which the target object must reach a goal while avoiding static obstacles, necessitating non-myopic reasoning. Our object-informed MPPI increases task success by 40\% with a 26\% faster control frequency in simulation, and by 20\% in real experiments with similar computation as regular MPPI.

2605.30776 2026-06-01 cs.LG

Efficient and Uncertainty-Aware Diffusion Framework for Offline-to-Online Reinforcement Learning

高效且不确定性感知的离线到在线强化学习扩散框架

Ha Manh Bui, Metod Jazbec, Eric Nalisnick, Anqi Liu

发表机构 * Department of Computer Science, Johns Hopkins University, Baltimore, MD, U.S.A.(约翰霍普金斯大学计算机科学系) AMLab, University of Amsterdam, Amsterdam, Netherlands(阿姆斯特丹大学AM实验室)

AI总结 提出DUAL框架,利用扩散模型先验知识蒸馏快速采样扩散策略和转移模型,并通过拉普拉斯近似和距离转移状态偏移检测进行不确定性量化,以改进在线阶段的探索与利用平衡。

Comments International Conference on Machine Learning, 2026

详情
AI中文摘要

离线到在线强化学习(O2O-RL)利用离线预训练策略来最小化昂贵的在线交互。尽管数据高效,但O2O-RL容易受到离线与在线分布之间偏移的影响。现有工作旨在通过对从扩散模型采样的轨迹数据微调策略来减轻这种偏移的危害。受此启发,我们提出了DUAL:一个用于离线到在线强化学习的高效不确定性感知扩散框架。DUAL利用扩散模型的先验知识,在离线阶段蒸馏出一个快速采样的扩散策略和转移模型。DUAL还采用拉普拉斯近似和距离转移状态偏移检测,从而通过不确定性量化来改进在线阶段的探索与利用平衡。我们正式证明,带有拉普拉斯近似的策略损失提供了认知不确定性原则性估计的代理。实验上,DUAL在多种设置和环境下的在线期望回报优于O2O-RL基线。

英文摘要

Offline-to-Online Reinforcement Learning (O2O-RL) leverages an offline, pre-trained policy to minimize costly online interactions. Although data-efficient, O2O-RL is susceptible to shifts between offline and online distributions. Existing work aims to mitigate the harm of this shift by finetuning the policy on trajectory data sampled from a diffusion model. Inspired by this line of work, we propose DUAL: an efficient \textbf{D}iffusion \textbf{U}ncertainty-\textbf{A}ware framework for offline-to-online reinforcement \textbf{L}earning. DUAL utilizes the prior knowledge of the diffusion model to distill a fast-sampling diffusion actor policy and transition model in the offline phase. DUAL also employs a Laplace approximation and distance transition-state-shift detection, thereby using uncertainty quantification to improve exploration versus exploitation in the online phase. We formally show that our actor loss with the Laplace approximation provides a proxy for a principled estimate of epistemic uncertainty. Empirically, DUAL improves the online expected return over O2O-RL baselines across multiple settings and environments.

2605.30774 2026-06-01 cs.CV

CameraNoise: Enabling Faithful Camera Control in Video Diffusion through Geometry-Flow-Guided Noise Warping

CameraNoise: 通过几何流引导的噪声扭曲实现视频扩散中的忠实相机控制

Haoyu Zhao, Jiaxi Gu, Haoran Chen, Qingping Zheng, Yeying Jin, Hongyi Yang, Junqi Cheng, Yuang Zhang, Zenghui Lu, Huan Yu, Jie Jiang, Peng Shu, Zuxuan Wu, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学) Tencent(腾讯) Xiamen University(厦门大学)

AI总结 提出CameraNoise方法,通过几何流引导的噪声扭曲将相机运动编码为时间一致的随机表示,实现视频扩散中忠实且几何一致的相机控制。

Comments 28 pages, 16 figures

Journal ref Proceedings of the Forty-third International Conference on Machine Learning (ICML), 2026

详情
AI中文摘要

精确的相机姿态控制对于视频扩散至关重要,但保持几何一致性仍然是一个挑战。现有方法直接将数值相机参数注入扩散骨干网络,往往无法弥合抽象坐标与视觉内容之间的差距,导致结构失真。为解决这一问题,我们提出CameraNoise,一种流到噪声的扭曲方法,将相机运动编码为时间一致的随机表示。与传统的条件控制不同,CameraNoise将相机姿态直接嵌入噪声空间。这将在忠实保留轨迹动态的同时,将运动与场景外观解耦。具体来说,我们引入了一种新颖的几何引导重投影流和噪声扭曲算法,共同保持扩散的高斯先验,并确保在相机变换下噪声传播的一致性。通过将CameraNoise集成到扩散过程中,我们的框架能够生成稳定、高保真的视频。大量实验表明,我们的方法在视觉质量和轨迹忠实度方面均显著优于先前方法。项目页面和代码可在 https://gulucaptain.github.io/CameraNoise/ 获取。

英文摘要

Precise camera pose control is critical for video diffusion, yet maintaining geometric consistency remains a challenge. Existing methods that directly inject numerical camera parameters into the diffusion backbone often fail to bridge the gap between abstract coordinates and visual content, leading to structural distortions. To address this issue, we propose CameraNoise, a flow-to-noise warping method that encodes camera motion into a temporally coherent stochastic representation. Unlike conventional conditioning, CameraNoise embeds camera poses directly into the noise space. This decouples motion from scene appearance while faithfully preserving trajectory dynamics. Specifically, we introduce a novel Geometry-guided Reprojection Flow and a noise warping algorithm, which jointly preserve the Gaussian prior of diffusion and ensure consistent noise propagation under camera transformations. By integrating CameraNoise into the diffusion process, our framework delivers stable, high-fidelity videos. Extensive experiments demonstrate that our approach significantly outperforms prior methods in both visual quality and trajectory faithfulness. The project page and code are available at: https://gulucaptain.github.io/CameraNoise/.

2605.30771 2026-06-01 cs.CL

Eywa: Provenance-Grounded Long-Term Memory for AI Agents

Eywa:基于溯源的人工智能智能体长期记忆

Resham Joshi

发表机构 * Eywa

AI总结 提出Eywa架构,通过先存储证据再推导事实、验证记忆并采用确定性多路径读取(零LLM调用)实现可审计的长期记忆,在多个基准测试中取得高准确率。

Comments 29 pages, 3 figures, 16 tables. Benchmark artifacts available at https://eywa.to/research

详情
AI中文摘要

跨会话持久化的人工智能智能体需要能够检索、审计、更新和擦除的记忆。现有的记忆系统通常将源证据、提取的事实、检索到的上下文和答案策略合并为一个不透明的提示路径,使得故障难以诊断:错误答案可能源于缺失证据、不支持的提取、过时状态、检索损失或答案模型行为。我们提出Eywa,一种基于溯源的记忆架构,围绕“证据先于信念”构建。Eywa在推导规范事实之前存储不可变的源证据,根据类型化信号和源支持验证提取的记忆,并通过确定性多路径读取路径(检索内部零LLM调用)检索有界的记忆上下文。检索到的上下文与答案指令分开返回,使得相同的记忆基质可以在前沿、预算和本地答案模型上进行评估。在冻结的、工件记录的检索配置下,Eywa在LoCoMo C1-C4分割上使用Claude Sonnet 4.6写入和QA角色达到90.19%的裁判准确率。在LongMemEval-S上,达到88.2%的检索充分性准确率。在BEAM(一个700问题的技术记忆压力基准)上,达到81.45%的平均nugget分数和85.29%的pass@score >= 0.5。完整的每问题工件,包括问题、黄金答案、模型答案、检索到的上下文和标签,发布在https://eywa.to/research。

英文摘要

AI agents that persist across sessions need memory they can retrieve, audit, update, and erase. Existing memory systems often collapse source evidence, extracted facts, retrieved context, and answer policy into one opaque prompt path, making failures difficult to diagnose: a wrong answer may come from missing evidence, unsupported extraction, stale state, retrieval loss, or answer-model behavior. We present Eywa, a provenance-grounded memory architecture built around evidence before belief. Eywa stores immutable source evidence before deriving canonical facts, validates extracted memories against typed signals and source support, and retrieves bounded memory context through a deterministic multi-route read path with zero LLM calls inside retrieval. Retrieved context is returned separately from answer instructions, allowing the same memory substrate to be evaluated across frontier, budget, and local answer models. Under a frozen, artifact-recorded retrieval configuration, Eywa reaches 90.19% judge accuracy on the LoCoMo C1-C4 split with Claude Sonnet 4.6 write and QA roles. On LongMemEval-S, it reaches 88.2% retrieval-sufficiency accuracy. On BEAM, a 700-question technical-memory stress benchmark, it reaches 81.45% mean nugget score and 85.29% pass@score >= 0.5. Full per-question artifacts, including questions, gold answers, model answers, retrieved context, and labels, are published at https://eywa.to/research.

2605.30770 2026-06-01 cs.RO

SSR: Scaling Surefooted and Symmetric Humanoid Traversal to the Open World

SSR:将稳健且对称的人形穿越扩展到开放世界

Ruiqi Yu, Yiwen Wang, Yuan Hao, Jun WU, Qiuguo Zhu

发表机构 * Zhejiang University(浙江大学)

AI总结 提出SSR框架,通过引入想象落脚点引导、等变潜在空间对称增强和地形特定多判别器运动先验,实现基于视觉的人形机器人在开放世界中的安全稳定穿越。

详情
AI中文摘要

将人形穿越扩展到开放世界是在人类环境中实际部署的关键,但仍然具有挑战性。机器人必须利用视觉在高度动态运动下确保在异质地面上安全可靠的落脚点,同时产生协调、自然的全身行为。我们提出SSR,一种高效的端到端框架,用于基于自我中心视觉的人形穿越,联合学习这些能力。SSR引入了想象落脚点引导,学习建模即将到来的摆动脚接触并评估其支撑,以指导触地前的摆动朝向稳定区域,减少边缘滑动。它进一步采用等变潜在空间对称增强,在高维视觉观察下有效诱导双边协调,并使用地形特定多判别器运动先验,鼓励跨场景的类人行为。大量实验表明,SSR在多种真实世界地形上实现了安全、稳定和高质量的运动,包括不同结构的楼梯以及宽间隙和高平台等极端挑战,同时在开放户外环境中实现了可靠的长距离穿越。

英文摘要

Extending humanoid traversal to the open world is key to practical deployment in human environments, but remains challenging. The robot must use vision to ensure safe and reliable foot placement on heterogeneous terrain under highly dynamic motion, while producing coordinated, natural whole-body behaviors. We propose SSR, an efficient end-to-end framework for egocentric vision-based humanoid traversal that jointly learns these capabilities. SSR introduces imagined foothold guidance, which learns to model forthcoming swing-foot contacts and evaluates their support to guide pre-touchdown swings toward stable regions, reducing edge slips. It further employs equivariant latent-space symmetry augmentation to efficiently induce bilateral coordination under high-dimensional visual observations, and uses terrain-specific multi-discriminator motion priors to encourage human-like behavior across scenes. Extensive experiments show that SSR achieves safe, stable, and high-quality locomotion on diverse real-world terrains, including stairs with varied structures and extreme challenges such as wide gaps and high platforms, while enabling reliable long-horizon traversal in open outdoor environments.

2605.30769 2026-06-01 cs.CV cs.RO

DisPlace: Discriminative Place Projections for Multi-Reference Visual Place Recognition

DisPlace: 面向多参考视觉地点识别的判别性地点投影

Dhyey Manish Rajani, Michael Milford, Tobias Fischer

发表机构 * QUT Centre for Robotics, School of Electrical Engineering and Robotics at the Queensland University of Technology(昆士兰理工大学机器人中心,电气工程与机器人学学院)

AI总结 提出DisPlace框架,通过广义特征值问题融合多参考描述符,最大化地点间可分性并抑制地点内变化,提升视觉地点识别在多变条件下的鲁棒性。

Comments Under review

详情
AI中文摘要

视觉地点识别(VPR)的一个关键挑战是在不同环境条件和视角下,将查询图像与参考地图进行匹配。虽然多次参考遍历提高了鲁棒性,但现有的融合策略要么统一聚合参考,要么依赖启发式选择,无法区分保持稳定地点身份的描述符变化与由变化条件或视角引起的变化。在本文中,我们提出DisPlace,一种多参考VPR框架,将多个参考描述符融合为单个紧凑且具有判别性的地点表示。DisPlace将描述符融合表述为一个广义特征值问题,该问题最大化地点间可分性,同时抑制跨参考的地点内变化,而不是保留整体描述符方差。与现有的多参考融合方法不同,DisPlace利用跨参考遍历的变化来识别哪些描述符维度的线性组合保留了地点身份,哪些捕捉了条件或视角特定的变化。我们在Oxford RobotCar、Nordland、Pittsburgh30k和Google Landmarks v2上,使用六种最先进的VPR描述符评估了DisPlace。在54种外观变化条件下,DisPlace在49种中优于七种多参考基线,在视角和非结构化设置下持续改进描述符级融合性能,并且在推理期间比所有比较的融合方法需要更少的存储空间。

英文摘要

A key challenge in Visual Place Recognition (VPR) is matching query images against reference maps captured under diverse environmental conditions and viewpoints. While multiple reference traversals improve robustness, existing fusion strategies either aggregate references uniformly or rely on heuristic selection, without distinguishing descriptor variations that preserve stable place identity from those caused by changing conditions or viewpoints. In this paper, we propose DisPlace, a multi-reference VPR framework that fuses multiple reference descriptors into a single compact and discriminative place representation. DisPlace formulates descriptor fusion as a generalized eigenvalue problem that maximizes between-place separability while suppressing within-place variation across references, rather than preserving overall descriptor variance. Unlike existing multi-reference fusion methods, DisPlace exploits variation across reference traversals to identify which linear combinations of descriptor dimensions preserve place identity and which capture condition- or viewpoint-specific variation. We evaluate DisPlace on Oxford RobotCar, Nordland, Pittsburgh30k, and Google Landmarks v2 across six state-of-the-art VPR descriptors. DisPlace outperforms seven multi-reference baselines in 49 out of 54 appearance-varying conditions, consistently improves descriptor-level fusion performance under viewpoint and unstructured settings, and requires less storage during inference than all compared fusion methods.

2605.30758 2026-06-01 cs.CL cs.LG

Pairwise Reference Alignment as a Model-Level Ordinal Observable

成对参考对齐作为模型级序数可观测量

Mujing Li

发表机构 * Independent Researcher(独立研究者)

AI总结 本文定义成对参考对齐为模型评分函数诱导的序数可观测量,提出中心化序参数统计量和基于边界的扩展,并给出有限样本估计和浓度界,通过Qwen2.5和RewardBench实验验证。

详情
AI中文摘要

成对偏好数据广泛用于语言模型评估和对齐,通常用于模型排名、奖励建模或偏好优化。本文提出了一个更基础的测量问题:给定成对偏好的参考分布,当我们测试模型是否将首选响应排在拒绝响应之上时,估计的是哪个模型级量?我们将成对参考对齐定义为由模型评分函数诱导的序数可观测量。给定三元组$(x,y^+,y^-)$上的参考对分布$P_{\mathrm{pair}}$和标量模型分数$S_M(x,y)$,我们将对齐可观测量定义为模型诱导的排序与参考偏好排序一致的概率。我们进一步定义了一个中心化的序参数类统计量,并讨论了基于边界的扩展。所得量在独立抽样假设下具有简单的有限样本估计量和浓度界。本文没有引入新的基准。它为成对参考对齐提供了概念和统计公式,阐明了参考对分布的作用,并将一般的序数可观测量与评分选择(如归一化对数概率或基于能量的分数)区分开来。我们还在Qwen2.5模型和RewardBench上进行了初步实证研究,其中所提出的统计量随模型大小和指令调优而增加,并根据公式在参考对子集之间变化。

英文摘要

Pairwise preference data is widely used in language-model evaluation and alignment, often for model ranking, reward modeling, or preference optimization. This note formulates a more basic measurement question: given a reference distribution of pairwise preferences, what model-level quantity is estimated when we test whether a model ranks preferred responses above rejected responses? We define pairwise reference alignment as an ordinal observable induced by a model scoring function. Given a reference pair distribution $P_{\mathrm{pair}}$ over triples $(x,y^+,y^-)$, and a scalar model score $S_M(x,y)$, we define the alignment observable as the probability that the model-induced ordering agrees with the reference preference ordering. We further define a centered order-parameter-like statistic and discuss a margin-based extension. The resulting quantities admit simple finite-sample estimators and concentration bounds under independent sampling assumptions. This note does not introduce a new benchmark. It provides a conceptual and statistical formulation for pairwise reference alignment, clarifies the role of the reference pair distribution, and distinguishes the general ordinal observable from scoring choices such as normalized log-probability or energy-based scores. We also provide an initial empirical study on Qwen2.5 models and RewardBench, where the proposed statistics increase with model size and instruction tuning and vary across reference-pair subsets as predicted by the formulation.

2605.30757 2026-06-01 cs.LG

Chain-of-Thought and Compressed Looped Transformers: A Memory-Budget Separation

思维链与压缩循环Transformer:记忆预算分离

Haozhou Zhang

发表机构 * Department of Mathematics and Statistics(数学与统计学系)

AI总结 本文通过比较三种记忆机制(压缩潜在循环、全序列状态循环和思维链暂存区),证明压缩循环Transformer的记忆预算限制其推理能力,而思维链通过扩展上下文实现更强的问题求解。

详情
AI中文摘要

思维链提示和循环Transformer都赋予固定模型更多的测试时计算,但它们在记忆内容上有所不同。思维链将中间状态存储在生成的标记中,这些标记保留在上下文中,而循环Transformer通过循环隐藏激活传递状态。我们认为这种持久可变记忆是测试时推理的核心资源。我们比较了三种记忆机制:压缩潜在循环、全序列状态循环和思维链暂存区。我们的主要结果表明,压缩循环受其循环状态大小的限制。运行更长时间的循环增加了计算量,但本身不会创建增长的暂存区,因此即使运行多个步骤,具有小循环状态的循环仍然是小空间推理器。在标准复杂性假设下,这样的循环无法解决在logspace归约下P-complete的问题,而多项式长度的思维链可以。这种分离是压缩循环特有的,因为全序列状态循环在每个输入位置携带状态,并处于更接近显式暂存区的记忆丰富状态。受控的指针追逐和关联回忆扫描说明了这种记忆预算观点,其性能对持久状态预算是否匹配任务的工作记忆需求敏感。

英文摘要

Chain-of-thought prompting and looped Transformers both give a fixed model more test-time computation, but they differ in what they remember. Chain-of-thought stores intermediate state in generated tokens that remain in the context, whereas a looped Transformer carries state through recurrent hidden activations. We argue that this persistent mutable memory is a central resource for test-time reasoning. We compare three memory regimes, the compressed latent loop, the full sequence-state loop, and the chain-of-thought scratchpad. Our main result shows that a compressed loop is limited by the size of its recurrent state. Running the loop longer adds computation but does not by itself create a growing scratchpad, so a loop with a small recurrent state remains a small-space reasoner even when run for many steps. Under a standard complexity assumption, such loops cannot decide problems that are P-complete under logspace reductions, whereas polynomial-length chain-of-thought can. The separation is specific to compressed loops, as full sequence-state loops carry state at every input position and live in a memory-rich regime closer to explicit scratchpads. Controlled pointer-chasing and associative-recall sweeps illustrate this memory-budget view, with performance sensitive to whether the persistent-state budget matches the task's working-memory demand.

2605.30753 2026-06-01 cs.CL

Efficient Diffusion LLMs via Temporal-Spatial Parallel Decoding and Confidence Extrapolation

通过时空并行解码和置信度外推的高效扩散大语言模型

Zekai Li, Ji Liu, Yiqing Huang, Ziqiong Liu, Dong Li, Emad Barsoum

发表机构 * Advanced Micro Devices, Inc. (AMD)(先进微器件公司(AMD))

AI总结 提出时空并行解码(TSPD)和置信度外推(CE)两种方法,通过动态控制去噪轨迹减少冗余迭代,加速扩散大语言模型推理。

详情
AI中文摘要

基于扩散的大语言模型(dLLMs)通过迭代去噪支持并行文本生成,但由于许多步骤花费在冗余精炼和重复掩码那些最终值已确定的token上,推理仍然延迟严重。先前的加速方法主要依赖于步骤局部置信度启发式或固定调度,这些方法对提示和任务变化敏感,且忽略了序列内的强位置效应。我们将扩散解码视为一个动态控制问题,并表明逐token的去噪轨迹为可靠控制提供了关键信号。我们提出了一个具有两个组件的轨迹感知解码框架。首先,时空并行解码(TSPD)使用一个轻量级的时空控制器,该控制器消耗每个token的轨迹特征,包括置信度、熵和动量,以及token位置,以决定何时token已收敛并可以安全固定。其次,我们引入了置信度外推(CE),一个无训练的状态空间模块,它预测未来的logit趋势并带有不确定性,以支持主动决策,包括安全的前瞻和在轨迹振荡或置信度不足时的目标稳定。TSPD和CE共同减少了不必要的去噪迭代,同时保持了输出质量,并且它们与系统优化(如KV缓存)干净地组合。

英文摘要

Diffusion-based large language models (dLLMs) support parallel text generation via iterative denoising, yet inference remains latency-heavy because many steps are spent on redundant refinement and repeated remasking of tokens whose final values are already determined. Prior acceleration methods mainly depend on step-local confidence heuristics or fixed schedules, which are sensitive to prompt and task variation and ignore strong positional effects within a sequence. We cast diffusion decoding as a dynamic control problem and show that token-wise denoising trajectories provide the key signal for reliable control. We propose a trace-aware decoding framework with two components. First, Temporal-Spatial Parallel Decoding (TSPD) uses a lightweight temporalspatial controller that consumes per-token trajectory features, including confidence, entropy, and momentum, together with token position, to decide when a token has converged and can be safely fixed. Second, we introduce Confidence Extrapolation (CE), a training-free state-space module that forecasts future logit trends with uncertainty to support proactive decisions, including safe look-ahead and targeted stabilization when trajectories are oscillatory or underconfident. Together, TSPD and CE reduce unnecessary denoising iterations while preserving output quality, and they compose cleanly with system optimizations such as KV caching.

2605.30750 2026-06-01 cs.CV

SLAP: The Semantic Least Action Principle for Variational Video-Language Modeling

SLAP: 用于变分视频-语言建模的语义最小作用原理

Xiang Fang, Wanlong Fang

发表机构 * School of Software Engineering, Huazhong University of Science and Technology(华中科技大学软件学院) Nanyang Technological University, Singapore(新加坡南洋理工大学)

AI总结 提出语义最小作用原理(SLAP),将视频插值建模为黎曼流形上的边界值问题,通过离散欧拉-拉格朗日方程保持对象持久性,解决大视频语言模型中的时间间隙问题。

Comments Accepted by ICML 2026

详情
AI中文摘要

在大视频语言模型(LVLMs)时代,稀疏帧采样的计算需求造成了根本性的“时间间隙”,使模型对关键的因果转换视而不见。现有的依赖于生成幻觉(如潜在扩散)或自回归外推的解决方案往往难以在长时间跨度内保持语义一致性,遭受对象消失和能量不稳定的问题。我们提出从概率生成到变分力学的范式转变,即语义最小作用原理(SLAP)。通过在经典力学和语义动力学之间建立严格的同构关系,我们将潜在视频轨迹建模为由语义拉格朗日量控制的黎曼流形上的路径。通过将插值任务表述为通过离散欧拉-拉格朗日方程求解的边界值问题(BVP),SLAP自然地强制对象持久性,而无需像素级渲染。大量实验证明了我们提出的SLAP的有效性。

英文摘要

In the era of Large Video-Language Models (LVLMs), the computational necessity of sparse frame sampling creates a fundamental ``temporal gap'', rendering models blind to critical causal transitions. Existing solutions relying on generative hallucination (e.g., latent diffusion) or autoregressive extrapolation often fail to maintain semantic consistency over long horizons, suffering from object vanishing and energetic instability. We propose a paradigm shift from probabilistic generation to variational mechanics with the \textbf{Semantic Least Action Principle (SLAP)}. Drawing a rigorous isomorphism between classical mechanics and semantic dynamics, we model the latent video trajectory as a path on a Riemannian manifold governed by a Semantic Lagrangian. By formulating the interpolation task as a Boundary Value Problem (BVP) solved via the discrete Euler-Lagrange equations, SLAP naturally enforces object persistence without pixel-level rendering. Extensive experiments show the effectiveness of our proposed SLAP.

2605.30749 2026-06-01 cs.LG cs.RO

FLAG: Flow Policy MaxEnt-RL by Latent Augmented Guidance

FLAG: 通过潜在增强引导的流策略最大熵强化学习

Sungha Kim, Gawon Lee, Jusuk Lee, Jonghae Park, H. Jin Kim, Daesol Cho

发表机构 * Seoul National University(首尔国立大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出FLAG方法,通过潜在变量增强状态空间并优化代理最大熵目标,解决重要性权重崩溃问题,实现高维控制任务中的表达性策略优化。

详情
AI中文摘要

最大熵强化学习(MaxEnt-RL)能够实现鲁棒的探索,然而实际实现通常将策略限制为简单的高斯分布。最近的方法通过重要性加权监督学习引入表达性生成策略,但容易受到重要性权重崩溃的影响,这限制了它们在高维动作空间中的可扩展性。我们的关键见解是通过局部化采样区域来缓解这一限制,避免在整个动作空间上进行重要性采样导致的权重退化。为了实例化这一见解,我们引入了FLAG(具有潜在增强引导的流策略)。FLAG通过流潜在变量增强状态空间,并优化一个可证明一致的代理MaxEnt-RL目标。我们经验证明,FLAG能够在有限的重要性样本下实现表达性策略优化,并扩展到高维控制任务。此外,FLAG在具有挑战性的基准测试中达到了最先进的性能。我们的项目网页:https://flag-rl.github.io/

英文摘要

Maximum entropy reinforcement learning (MaxEnt-RL) enables robust exploration, yet practical implementations often restrict policies to simple Gaussians. While recent approaches incorporate expressive generative policies via importance-weighted supervised learning, they are prone to importance weight collapse, which limits their scalability in high-dimensional action spaces. Our key insight is to mitigate this limitation by localizing the sampling region, avoiding the weight degeneracy induced by importance sampling over the entire action space. To instantiate this insight, we introduce \textbf{FLAG} (\textbf{F}low policy with \textbf{L}atent-\textbf{A}ugmented \textbf{G}uidance). FLAG augments the state space with a flow latent variable and optimizes a provably consistent proxy MaxEnt-RL objective. We empirically demonstrate that FLAG enables expressive policy optimization with limited importance samples and scales to high-dimensional control tasks. Furthermore, FLAG achieves state-of-the-art performance across challenging benchmarks. Our project webpage: https://flag-rl.github.io/