arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2328
专题追踪
2605.11520 2026-05-13 cs.CV cs.AI

PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting

Yixiao Song, Qingyong Li, Wen Wang, Zhicheng Yan

发表机构 * Key Laboratory of Big Data & Artificial Intelligence in Transportation (Beijing Jiaotong University), Ministry of Education(大数据与人工智能在交通运输中的关键实验室(北京交通大学),教育部) Frontiers Science Center for Smart High-speed Railway System, Beijing Jiaotong University(智能高速铁路系统前沿科学中心,北京交通大学)

AI总结 本文提出了一种名为PointGS的无监督3D点云分割方法,旨在解决传统监督方法依赖密集标注带来的高昂成本问题。该方法通过3D高斯溅射技术构建统一的中间表示,弥合了离散点云与连续图像之间的域差距,并利用多视角重建与语义蒸馏策略,实现了跨视角语义的一致性分配。实验表明,PointGS在多个基准数据集上优于现有无监督方法,显著提升了分割性能。

Comments Accepted by Computer Vision and Pattern Recognition (CVPR) 2026

详情
英文摘要

Unsupervised point cloud segmentation is critical for embodied artificial intelligence and autonomous driving, as it mitigates the prohibitive cost of dense point-level annotations required by fully supervised methods. While integrating 2D pre-trained models such as the Segment Anything Model (SAM) to supplement semantic information is a natural choice, this approach faces a fundamental mismatch between discrete 3D points and continuous 2D images. This mismatch leads to inevitable projection overlap and complex modality alignment, resulting in compromised semantic consistency across 2D-3D transfer. To address these limitations, this paper proposes PointGS, a simple yet effective pipeline for unsupervised 3D point cloud segmentation. PointGS leverages 3D Gaussian Splatting as a unified intermediate representation to bridge the discrete-continuous domain gap. Input sparse point clouds are first reconstructed into dense 3D Gaussian spaces via multi-view observations, filling spatial gaps and encoding occlusion relationships to eliminate projection-induced semantic conflation. Multi-view dense images are rendered from the Gaussian space, with 2D semantic masks extracted via SAM, and semantics are distilled to 3D Gaussian primitives through contrastive learning to ensure consistent semantic assignments across different views. The Gaussian space is aligned with the original point cloud via two-step registration, and point semantics are assigned through nearest-neighbor search on labeled Gaussians. Experiments demonstrate that PointGS outperforms state-of-the-art unsupervised methods, achieving +0.9% mIoU on ScanNet-V2 and +2.8% mIoU on S3DIS.

2605.11519 2026-05-13 cs.AI cs.CL cs.LG

Controllable User Simulation

Guy Tennenholtz, Ofer Meshi, Amir Globerson, Uri Shalit, Jihwan Jeong, Craig Boutilier

发表机构 * Google Research(谷歌研究) Tel Aviv University(特拉维夫大学)

AI总结 本文研究如何构建可控的用户模拟器,以更准确地评估对话代理的行为。作者将可控模拟问题形式化为因果推断问题,指出传统基于监督微调的方法会引入结构偏差,导致评估指标方差急剧上升,即“可控性崩溃”。为此,作者提出了基于因果一致性的理论条件和一系列实用训练方法,实验表明其方法能有效消除前瞻偏差,保持对话多样性,并具备对未知代理行为的鲁棒泛化能力。

详情
英文摘要

Using offline datasets to evaluate conversational agents often fails to cover rare scenarios or to support testing new policies. This has motivated the use of controllable user simulators for targeted, counterfactual evaluation, typically implemented by prompting or fine-tuning large language models. In this work, we formalize controllable simulation as a causal inference problem. By bridging natural language evaluation with off-policy evaluation methodology, we show that the standard practice of training simulators via supervised fine-tuning on post-hoc trajectory labels yields a structurally biased model. Specifically, these labels are inextricably coupled to the data-generating behavior policy, injecting a look-ahead bias that breaks causal consistency. Furthermore, we prove that under policy shift this failure causes the variance of evaluation metrics to explode geometrically, a phenomenon we term controllability collapse. To restore causal consistency, we establish theoretical conditions for accurate simulation and propose practical training mitigations: a priori controls, step-wise dynamic controls, and direct policy-conditioned learning. Empirical evaluation confirms that while standard global controls distort conversational distributions and collapse behavioral diversity, our causally grounded simulators eliminate look-ahead bias, preserve natural variance, and exhibit robust zero-shot generalization to unseen agent behaviors.

2605.11513 2026-05-13 cs.CL cs.AI

A Study on Hidden Layer Distillation for Large Language Model Pre-Training

Maxime Guigon, Lucas Dixon, Michaël E. Sander

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 本文研究了隐藏层蒸馏(HLD)在大规模语言模型预训练中的应用,指出当前知识蒸馏主要依赖输出logits,而忽视了教师模型中间层的语义信息。通过对比实验,作者发现HLD在下游任务上的表现并不一致优于传统基于logits的蒸馏方法,但在所有共享超参数配置下,HLD在困惑度上均有所提升,表明其可能蕴含潜在价值,但尚未成为预训练中的主流方法。

详情
英文摘要

Knowledge Distillation (KD) is a critical tool for training Large Language Models (LLMs), yet the majority of research focuses on approaches that rely solely on output logits, neglecting semantic information in the teacher's intermediate representations. While Hidden Layer Distillation (HLD) showed potential for encoder architectures, its application to decoder-only pre-training at scale remains largely unexplored. Through compute-controlled experiments, we benchmark HLD against logit-based KD and self-supervised baselines with Gemma3 3.4B as teacher and 123M and 735M students trained on up to 168B tokens from the C4 dataset. Our experiments show that HLD does not consistently outperform standard KD on downstream evaluation tasks. Nevertheless, we show that HLD can yield a systematic perplexity gain over KD across all shared-hyperparameter configurations, suggesting that a latent signal can be extracted, but a breakthrough may be needed for it to play a more significant role in LLM pre-training.

2605.11509 2026-05-13 cs.AI cs.LG cs.MA cs.SY eess.SY

Hierarchical LLM-Driven Control for HAPS-Assisted UAV Networks: Joint Optimization of Flight and Connectivity

Zijiang Yan, Hao Zhou, Wael Jaafar, Jianhua Pei, Ping Wang, Halim Yanikomeroglu, Hina Tabassum

发表机构 * Department of Electrical Engineering and Computer Science, York University(约克大学电气工程与计算机科学系) Samsung Research America(三星美国研究院) Department of Software and IT Engineering, École de technologie supérieure (ÉTS), University of Quebec(魁北克大学软件与信息技术工程系,École de technologie supérieure) Non-Terrestrial Networks (Carleton-NTN) Lab and the Department of Systems and Computer Engineering, Carleton University(非地面网络(Carleton-NTN)实验室和系统与计算机工程系,卡尔顿大学)

AI总结 本文研究了在融合地面与非地面网络(ITNTN)环境下,无人机(UAV)的飞行控制与通信连接的联合优化问题。为解决动态且部分可观测条件下的多无人机协同问题,作者提出了一种基于大语言模型(LLM)的分层多速率控制框架,将全局负载均衡与切换决策与局部无人机运动控制相结合。实验表明,该方法在运输效率、通信吞吐量和碰撞率等方面均优于现有方法,展现出良好的动态场景适应能力。

Comments Submission for possible publication

详情
英文摘要

Uncrewed aerial vehicles (UAVs) are increasingly deployed in complex networked environments, yet the joint optimization of multi-UAV motion control and connectivity remains a fundamental challenge. In this paper, we study a multi-UAV system operating in an integrated terrestrial and non-terrestrial network (ITNTN) comprising terrestrial base stations and high-altitude platform stations (HAPS). We consider a three-dimensional (3D) aerial highway scenario where UAVs must adapt their motion to ensure collision avoidance, efficient traffic flow, and reliable communication under dynamic and partially observable conditions. We first model the problem as a hierarchical multi-objective partially observable Markov decision process (H-MO-POMDP), capturing the coupling between control and communication objectives. Based on this formulation, we propose a large language model (LLM)-driven hierarchical multi-rate control framework. At the global level, an LLM-based controller on the HAPS performs long-term planning for load balancing and handover decisions. At the local level, each UAV employs a hybrid controller that integrates a slow-timescale LLM for high-level spatial reasoning with a reinforcement learning agent for faster UAV-to-infrastructure (U2I) communication and motion control. We further develop a high-fidelity 3D simulation platform by integrating the gym-pybullet-drones environment with 3GPP-compliant RF/THz channel models. Numerical results demonstrate that the proposed framework significantly outperforms state-of-the-art baselines, achieving a 14% increase in transportation efficiency and a 25% improvement in telecommunication throughput. Additionally, it achieves a 23% reduction in physical collision rates, demonstrating strong handover stability and zero-shot generalization in dynamic scenarios.

2605.11508 2026-05-13 cs.CV

LiBrA-Net: Lie-Algebraic Bilateral Affine Fields for Real-Time 4K Video Dehazing

Yongcong Wang, Chengchao Shen, Guangwei Gao, Wei Wang, Pengwen Dai, Dianjie Lu, Guijuan Zhang, Zhuoran Zheng

发表机构 * Central South University(中南大学) Nanjing University of Science and Technology(南京理工大学) Sun Yat-sen University(中山大学) Shandong Normal University(山东师范大学) Qilu University of Technology(齐鲁工业大学)

AI总结 当前超高分辨率视频去雾领域缺乏评估基准,且现有方法难以在消费级GPU上实时处理4K视频。本文提出LiBrA-Net,通过将去雾问题转化为由低频深度场驱动的逐像素仿射变换,并利用双侧网格进行高效编码,实现了在单个GPU上以25 FPS处理4K视频的实时去雾。此外,本文还发布了首个包含深度、透射率和光流注释的4K视频去雾基准UHV-4K,并在多个数据集上取得了最先进的性能。

Comments 10 pages, 5 figures

详情
英文摘要

Currently, there is a gap in the field of ultra-high-definition (UHD) video dehazing due to the lack of a benchmark for evaluation. Furthermore, existing video dehazing methods cannot run on consumer-grade GPUs when processing continuous UHD sequences of 3--5 frames at a time. In this paper, we address both issues with a new benchmark and an efficient method. Our key observation is that atmospheric dehazing reduces to a per-pixel affine transform governed by the low-frequency depth field, which can be compactly encoded in bilateral grids whose prediction cost is decoupled from the output resolution. Building on this, we propose LiBrA-Net, which factorizes the spatiotemporal affine field into a spatial--color and a temporal bilateral sub-grid predicted at a fixed low resolution, fuses their coefficients in the $\mathfrak{gl}(3)$ Lie algebra under group-theoretic regularization, maps the result to invertible GL(3) transforms via a Cayley parameterization, and restores high-frequency detail through a lightweight input-guided branch. We further release UHV-4K, the first paired 4K video dehazing benchmark with depth, transmission, and optical-flow annotations on every frame. Across UHV-4K, REVIDE, and HazeWorld, LiBrA-Net sets a new state of the art among compared video dehazing methods while running native 4K at 25 FPS on a single GPU with only 6.12 M parameters. Code and data are available at https://anonymous.4open.science/r/LiBrA-Net-42B8.

2605.11506 2026-05-13 cs.CV

Principled Design of Diffusion-based Optimizers for Inverse Problems

Julio Oscanoa, Irmak Sivgin, Cagan Alkan, Daniel Ennis, John Pauly, Mert Pilanci, Shreyas Vasanawala

发表机构 * Department of Bioengineering(生物工程系) Department of Radiology, Stanford University, USA(斯坦福大学放射学系,美国)

AI总结 本文研究了基于扩散模型的优化器在逆问题中的设计,旨在解决其推理时间长和超参数调优繁琐的问题。作者提出了一种原理性的重参数化方法,使超参数能够在不同任务间复用,无需重新调整。同时,基于RED-diff框架,他们进一步开发了OptDiff流程,将后验采样转化为优化问题,从而加速推理并提升图像质量。实验表明,该方法在图像重建、去模糊和超分辨率任务中均取得了显著的加速效果和图像质量提升。

Comments 22 pages, 8 figures, 6 tables

详情
英文摘要

Score-based diffusion models achieve state-of-the-art performance for inverse problems, but their practical deployment is hindered by long inference times and cumbersome hyperparameter tuning. While pretrained diffusion models can be reused across tasks without retraining, inference-time hyperparameters such as the noise schedule and posterior sampling weights typically require ad-hoc adjustment for each problem setup. We propose principled reparameterizations that induce invariances, allowing the same hyperparameters to be reused across multiple problems without re-tuning. In addition, building on the RED-diff framework, which reformulates posterior sampling as an optimization problem, we further develop the OptDiff pipeline. OptDiff provides a simplified tuning framework that facilitates the integration of convex optimization tools to accelerate inference. Experiments on image reconstruction, deblurring, and super-resolution show substantial speedups and improved image quality.

2605.11504 2026-05-13 cs.LG cs.CR

CTFusion: A CTF-based Benchmark for LLM Agent Evaluation

Dongjun Lee, Ga-eun Bae, Insu Yun

发表机构 * School of Electrical Engineering, KAIST(韩国科学技术院电子工程学院)

AI总结 随着大型语言模型(LLM)在复杂任务中的应用日益广泛,网络安全成为其重要应用场景之一。然而,现有基于夺旗(CTF)的评估基准存在数据污染和作弊风险,影响评估可靠性。为此,本文提出CTFusion,一个基于实时CTF赛事的流式评估框架,通过单账号多代理独立运行和仅提交首个正确flag等方式降低竞争干扰,并在CTFd平台上实现为模型上下文协议(MCP)服务器,有效提升对网络安全代理的评估准确性。实验表明,CTFusion相较于现有基准更具鲁棒性,已作为开源工具释放以促进相关研究。

Comments 14 pages, 8 figures

详情
英文摘要

Recent advances in Large Language Models (LLMs) have enabled agentic systems for complex, multi-step tasks; cybersecurity is emerging as a prominent application. To evaluate such agents, researchers widely adopt Capture The Flag (CTF) benchmarks. However, current CTF benchmarks reuse existing challenges, which exposes them to data contamination and potential cheating. Notably, we confirmed these issues in practice by integrating web search tools into an existing agent. To address these limitations, we present CTFusion, a streaming evaluation framework built on Live CTFs. To achieve this, CTFusion preserves per-agent independence under a single team account and reduces competition impact by forwarding only the first correct flag per challenge. Moreover, we implement CTFusion as a Model Context Protocol (MCP) server on the widely used CTFd platform, which offers broad applicability to diverse CTF events and agent types. Through experiments with three LLMs, two agents, and five Live CTFs, we demonstrate that existing CTF benchmarks can be unreliable in assessing LLM-based agents, while CTFusion can serve as a robust solution for evaluating cybersecurity agents. We release CTFusion as open source to foster future research in this area.

2605.11497 2026-05-13 cs.CV

PoseBridge: Bridging the Skeletonization Gap for Zero-Shot Skeleton-Based Action Recognition

Sanghyeon Lee, Jinwoo Kim, Jong Taek Lee

发表机构 * School of Computer Science and Engineering(计算机科学与工程学院)

AI总结 本文研究了零样本骨架动作识别(ZSSAR)中的语义对齐问题,指出当前方法在对齐阶段已丢失了人体与物体交互及姿态相关视觉线索等关键语义信息。为此,提出了一种名为PoseBridge的框架,通过利用姿态估计过程中的中间表示,提取姿态锚定的语义线索,并通过骨架条件桥接和语义原型适配将其传递至文本对齐模块,从而提升零样本识别性能。实验表明,PoseBridge在多个数据集上均取得显著提升,尤其在Kinetics-200/400 PURLS基准上表现突出。

详情
英文摘要

Zero-shot skeleton-based action recognition (ZSSAR) is typically treated as a skeleton-text alignment problem: encode joint-coordinate sequences, align them with language, and classify unseen actions. We argue that this alignment is often too late. Skeletons are not complete action observations, but compressed outputs of human pose estimation (HPE); by the time alignment begins, human-object interactions and pose-relative visual cues may no longer be explicit. We call this upstream semantic loss. To address it, we propose PoseBridge, an HPE-aware ZSSAR framework that bridges intermediate HPE representations to skeleton-text alignment. Rather than adding an RGB action branch or object detector, PoseBridge extracts pose-anchored semantic cues from the same HPE process that produces skeletons, then transfers them through skeleton-conditioned bridging and semantic prototype adaptation. Across NTU-RGB+D 60/120, PKU-MMD, and Kinetics-200/400, PoseBridge improves ZSSAR performance under the evaluated protocols. On the Kinetics-200/400 PURLS benchmark, which contains in-the-wild videos with diverse scenes and action contexts, PoseBridge shows the clearest separation, improving the strongest compared baseline by 13.3-17.4 points across all eight splits. Our code will be publicly released.

2605.11496 2026-05-13 cs.AI cs.CY cs.HC cs.LG

The Evaluation Differential: When Frontier AI Models Recognise They Are Being Tested

Varad Vishwarupe, Nigel Shadbolt, Marina Jirotka, Ivan Flechais

发表机构 * Anthropic OpenAI UK AI Security Institute(英国人工智能安全研究所)

AI总结 本文探讨了前沿人工智能模型在识别评估环境时表现出的行为差异问题,指出这些模型在测试环境下可能与实际部署时表现不同,从而影响安全评估的可靠性。研究提出了“评估差异”(Evaluation Differential)的概念,定义了标准化效应大小(nED)以进行跨属性比较,并开发了TRACE评估框架,用于更严谨地分析和限制从评估中得出的安全声明。该研究对AI系统评估和治理具有重要启示。

详情
英文摘要

Recent published evidence from frontier laboratories shows that contemporary AI models can recognise evaluation contexts, latently represent them, and behave differently under those contexts than under deployment-continuous conditions. Anthropic's BrowseComp incident, the Natural Language Autoencoder findings on SWE-bench Verified and destructive-coding evaluations, and the OpenAI / Apollo anti-scheming work all document instances of this phenomenon. We argue that these findings create a claim-validity problem for safety conclusions drawn from frontier evaluations. We introduce the Evaluation Differential (ED), a conditional divergence in a target behavioural property between recognised-evaluation and deployment-continuous contexts, define a normalised effect-size form (nED) for cross-property comparison, and prove that marginal evaluation scores cannot identify ED. We develop a typology of safety claims (ED-stable, ED-degraded, ED-inverted, ED-undetermined) by their warrant-status under documented divergence, and specify TRACE (Test-Recognition Audit for Claim Evaluation), an audit protocol that wraps existing evaluation infrastructure and produces restricted claims rather than capability scores. We apply the framework retrospectively to three publicly documented evaluation incidents and discuss governance implications for system cards, conformity assessment, and the international network of AI safety and security institutes. TRACE does not eliminate adversarial adaptation; it disciplines the claims drawn from evaluation evidence by making explicit the conditions under which that evidence was produced.

2605.11494 2026-05-13 cs.CV

STRIDE: Training-Free Diversity Guidance via PCA-Directed Feature Perturbation in Single-Step Diffusion Models

Ankit Yadav, Arpit Garg, Ta Duc Huy, Lingqiao Liu

发表机构 * Australian Institute for Machine Learning, Adelaide University, Australia(澳大利亚机器学习研究所,阿德莱德大学,澳大利亚)

AI总结 STRIDE 是一种无需训练和优化的单步扩散模型多样性增强方法,通过在中间特征上注入与模型激活主成分对齐的噪声,实现可控的多样性提升。该方法基于模型自身特征结构进行扰动,确保生成样本在保持高质量的同时提高多样性。实验表明,STRIDE 在多个数据集上有效提升了生成图像的多样性,同时保持了良好的文本对齐性能,优于现有无训练基线方法。

Comments 11 Pages 3 figures 4 tables

详情
英文摘要

Distilled one-step (T=1) or few-step (T$\leq$4) diffusion models enable real-time image generation but often exhibit reduced sample diversity compared to their multi-step counterparts. In multi-step diffusion, diversity can be introduced through schedules, trajectories, or iterative optimization; however, these mechanisms are unavailable in the few-step or single-step setting, limiting the effectiveness of existing diversity-enhancing methods. A natural alternative is to perturb intermediate features, but naive feature perturbation is often ineffective, either yielding limited diversity gains or degrading generation quality. We argue that effective diversity injection in few-step models requires perturbations that respect the model's learned feature geometry. Based on this insight, we propose STRIDE, a training-free and optimization-free method that operates in a single forward pass. STRIDE injects spatially coherent (pink) noise into intermediate transformer features, projected onto the principal components of the model's own activations, ensuring that perturbations lie on the learned feature manifold. This design enables controlled variation along meaningful directions in the representation space. Extensive experiments on FLUX.1-schnell and SD3.5 Turbo across COCO, DrawBench, PartiPrompts, and GenEval show that STRIDE consistently improves diversity while maintaining strong text alignment. In particular, STRIDE reduces intra-batch similarity with minimal impact on CLIP score, and Pareto-dominates existing training-free baselines on the diversity-fidelity frontier. These results highlight that, in the absence of iterative refinement, improving diversity in few-step and one-step diffusion depends not on increasing perturbation strength, but on aligning perturbations with the model's internal representation structure.

2605.11491 2026-05-13 cs.LG cs.AI

Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization

Huimin Xu, Shuai Zhao, Xiaobao Wu, Anh Tuan Luu

发表机构 * Nanyang Technological University(南洋理工大学) Shanghai Jiao Tong University(上海交通大学) VinUniversity(文大学)

AI总结 本文研究了可验证奖励强化学习(RLVR)中普遍存在的熵崩溃问题,分析发现该问题源于令牌层面的熵流不平衡,即熵减少的令牌远多于熵增加的令牌。为此,作者提出了一种基于策略的熵流优化方法(OPEFO),通过动态调整熵增和熵减更新的比重,实现熵流的自适应平衡。实验表明,该方法有效提升了模型在数学推理任务中的训练稳定性和最终性能。

详情
英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving the reasoning ability of large language models. However, widely used RLVR algorithms, such as GRPO, often suffer from entropy collapse, leading to premature determinism and unstable optimization. Existing remedies, including entropy regularization and ratio-based clipping heuristics, either control entropy in a coarse-grained manner or rely on approximate on-policy training. In this paper, we revisit entropy collapse from a token-level entropy flow perspective. Our analysis reveals that entropy-decreasing tokens consistently outweigh entropy-increasing ones, resulting in a severely imbalanced entropy flow. This perspective provides a unified explanation of entropy collapse in existing RLVR algorithms and highlights the importance of balancing entropy dynamics. Motivated by this analysis, we propose On-Policy Entropy Flow Optimization (OPEFO), an adaptive entropy flow balancing mechanism that rescales entropy-increasing and entropy-decreasing updates according to their contributions to entropy change, while remaining strict on-policy. Experiments on six mathematical reasoning benchmarks demonstrate that OPEFO improves training stability and final performance. We will release the code and models upon publication.

2605.11483 2026-05-13 cs.CL

StoicLLM: Preference Optimization for Philosophical Alignment in Small Language Models

Ishmam Khan, Sindhuja Thogarrati, Shuo Zhang

发表机构 * Tufts University(塔夫茨大学) Bose Corporation(博世公司)

AI总结 本文研究了小型语言模型在哲学对齐方面的优化问题,特别是斯多葛哲学的内向型美德与外向型宇宙公民责任的对齐。作者采用偏好优化方法(如ORPO、AlphaPO),在微小数据集上对小型语言模型进行专项训练,结果表明仅需300个高质量样本即可实现较强的内向型美德对齐,效果接近少样本提示方法。然而,所有模型在处理斯多葛哲学的外向型责任时均表现不佳,揭示了小型模型在该方面的表征局限,单靠微数据集优化无法解决这一问题。

详情
英文摘要

While large language models excel at factual adaptation, their ability to internalize nuanced philosophical frameworks under severe data constraints remains underexplored. We investigate this by specializing small LLMs on micro-datasets of foundational Stoic texts using preference optimization (ORPO, AlphaPO). Evaluated via a multi-model critic bank, our results show that just 300 high-fidelity examples can induce strong alignment with inward-facing Stoic virtues, closely approaching few-shot prompting while freeing the context window. Critically, however, all models, including few-shot baselines, exhibit a persistent failure on Stoicism's outward-facing cosmopolitan duties, pointing to a representational limitation of small models that micro-dataset adaptation alone cannot overcome.

2605.11479 2026-05-13 cs.RO cs.AI

Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation

Hao Wang, Joshua Bowden, Colton Crosby, Somil Bansal

发表机构 * University of Southern California(南加州大学) Stanford University(斯坦福大学)

AI总结 本文研究了在稀疏奖励环境下对机械臂操作策略进行离线策略评估的问题,针对策略评估中任务进展非单调、评估轨迹长度有限导致的截断偏差等问题,提出了一种基于生存性(liveness)的贝尔曼算子框架。该方法将策略评估视为任务完成问题,得到的值函数对有限时间截断具有鲁棒性,并在理论分析中证明了其收缩性等性质。实验表明,该方法在多个模拟和实际任务中能更准确反映任务进展并有效减少截断偏差,优于传统方法如TD(0)和蒙特卡洛策略评估。

Comments Published at RSS 2026

详情
英文摘要

Policy evaluation is a fundamental component of the development and deployment pipeline for robotic policies. In modern manipulation systems, this problem is particularly challenging: rewards are often sparse, task progression of evaluation rollouts are often non-monotonic as the policies exhibit recovery behaviors, and evaluation rollouts are necessarily of finite length. This finite length introduces truncation bias, breaking the infinite-horizon assumptions underlying standard methods relying on Bellman equations/principle of optimality. In this work, we propose a framework for offline policy evaluation from sparse rewards based on a liveness-based Bellman operator. Our formulation interprets policy evaluation as a task-completion problem and yields a conservative fixed-point value function that is robust to finite-horizon truncation. We analyze the theoretical properties of the proposed operator, including contraction guarantees, and show how it encodes task progression while mitigating truncation bias. We evaluate our method on two simulated manipulation tasks using both a Vision-Language-Action model and a diffusion policy, and a cloth folding task using human demonstrations. Empirical results demonstrate that our approach more accurately reflects task progress and substantially reduces truncation bias, outperforming classical baselines such as TD(0) and Monte Carlo policy evaluation.

2605.11478 2026-05-13 cs.AI cs.IT math.IT stat.ML

FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression

Namyoon Lee, Yongjune Kim

发表机构 * POSTECH(POSTECH大学)

AI总结 本文提出了一种名为FibQuant的通用固定率向量量化方法,用于随机访问的键值缓存压缩,以解决长上下文推理中的内存和流量瓶颈问题。该方法在保持归一化-旋转-存储接口的同时,将传统的标量编码表替换为与标准化源匹配的共享径向-角向码本,从而保留归一化步骤所创建的几何信息并提升压缩效率。实验表明,FibQuant在保持高注意力相似度的同时实现了更高的压缩比,并在多个模型上表现出优于现有标量量化方法的性能。

Comments 15 pages

详情
英文摘要

Long-context inference is increasingly a memory-traffic problem. The culprit is the key--value (KV) cache: it grows with context length, batch size, layers, and heads, and it is read at every decoding step. Rotation-based scalar codecs meet this systems constraint by storing a norm, applying a shared random rotation, and quantizing one coordinate at a time. They are universal and random-access, but they discard the geometry created by the normalization step. After a Haar rotation, a block of $k$ consecutive coordinates is not a product source; it is a spherical-Beta source on the unit ball. We introduce \textsc{FibQuant}, a universal fixed-rate vector quantizer that keeps the same normalize--rotate--store interface while replacing scalar tables by a shared radial--angular codebook matched to this canonical source. The codebook combines Beta-quantile radii, Fibonacci\,/\,Roberts--Kronecker quasi-uniform directions, and multi-restart Lloyd--Max refinement. We prove that the resulting vector code strictly improves on its scalar product specialization at matched rate, with a high-rate gain that separates into a cell-shaping factor and a density-matching factor. The same construction gives a dense rate axis, including fractional-bit and sub-one-bit operating points, without calibration or variable-length addresses. On GPT-2 small KV caches, \textsc{FibQuant} traces a memory--fidelity frontier from $5\times$ compression at $0.99$ attention cosine similarity to $34\times$ at $0.95$. End-to-end on TinyLlama-1.1B, it is within $0.10$ perplexity of fp16 at $4\times$ compression and has $3.6\times$ lower perplexity than scalar \textsc{TurboQuant} at $b = 2$ ($8\times$ compression), where scalar random-access quantization begins to fail.

2605.11477 2026-05-13 cs.CV

LDDR: Linear-DPP-Based Dynamic-Resolution Frame Sampling for Video MLLMs

Jingfeng Chen, Jiawen Qian, Wendi Deng, Yinuo Guo, Jiaqi Yu, Sicong Leng, Raghuveer Thirukovalluru, Bhuwan Dhingra

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Individual Researcher(个人研究员) National University Singapore(新加坡国立大学) Nanyang Technological University(南洋理工大学) Duke University(杜克大学)

AI总结 在多模态大语言模型中,视频理解需要在有限的视觉token预算下从冗长的视频中选取信息量大的帧。为此,本文提出LDDR,一种基于线性行列式点过程(DPP)的动态分辨率帧采样框架,能够在任务条件特征空间中进行查询感知的帧选择,实现比标准DPP方法快3倍的运行效率。LDDR通过引入组DPP重要性度量,指导帧的保留与动态分辨率分配,显著提升了视频理解性能,在多个视频基准测试中均优于现有方法。

Comments 21 pages, 4 figures

详情
英文摘要

Video understanding in multimodal large language models requires selecting informative frames from long, redundant videos under limited visual-token budgets. Existing methods often rely on uniform sampling, point-wise relevance scoring, chunk-wise selection, or agentic exploration, which either miss global dependencies or introduce substantial overhead. We propose LDDR (Linear DPP-Based Dynamic Resolution), a training-free, plug-and-play, and budget-aware video frame sampling framework. LDDR performs query-aware Determinantal Point Process (DPP) frame selection in a task-conditioned feature space, achieving a 3x runtime speedup over standard DPP baselines. It further introduces a Group DPP importance metric to guide frame retention and dynamic resolution allocation, assigning more tokens to informative, non-redundant frames while downscaling or pruning less useful ones. Across four video benchmarks spanning short-, medium-, and long-range videos, LDDR consistently outperforms the next-best baselines, achieving gains of 2.5 points under budget-constrained settings and 1.6 points in high-budget scenarios. These improvements are consistently observed across multiple MLLM backbones, including both open- and closed-source models. Qualitative analysis confirms that relevant frames are selected and allocated a higher budget, facilitating improved video understanding.

2605.11475 2026-05-13 cs.CV

Deep Probabilistic Unfolding for Quantized Compressive Sensing

Gang Qu, Ping Wang, Siming Zheng, Xin Yuan

发表机构 * Westlake University, School of Engineering, Hangzhou, Zhejiang, China(西湖大学工程学院,杭州,浙江,中国) Vivo Mobile Communication Co., Ltd., Hangzhou, Zhejiang, China(Vivo移动通信有限公司,杭州,浙江,中国)

AI总结 本文提出了一种深度概率展开模型,用于解决量化压缩感知问题,通过展开框架提升重建的精度和效率。不同于以往方法采用L2投影,本文推导出一种闭式且数值稳定的似然梯度投影,使模型能够遵循真实的量化物理特性,将硬量化约束转化为软概率引导。此外,设计了一个高效的双域Mamba模块,用于动态捕捉和融合多尺度的局部与全局特征,增强远距离相关区域的交互能力。实验表明,该方法在多个任务上达到当前最优性能,有助于推动量化压缩感知在实际中的应用。

详情
英文摘要

We propose a deep probabilistic unfolding model to address the classical quantized compressive sensing problem that leverages an unfolding framework to enhance the reconstruction accuracy and efficiency. Unlike previous unfolding methods that apply L2 projection to measurements, we derive a closed-form, numerically stable likelihood gradient projection, which allows the model to respect the true quantization physics, turning the hard quantization constraint into a soft probabilistic guidance. Furthermore, an efficient, dual-domain Mamba module is specifically designed to dynamically capture and fuse the multi-scale local and global features, ensuring the interactions between the distant but correlated regions. Extensive experiments demonstrate the state-of-the-art performance of the proposed method over previous works, which is capable of promoting the application of quantized compressive sensing in real life.

2605.11473 2026-05-13 cs.AI cs.LG cs.RO stat.ML

TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing

Yuanpeng Li, Gefei Lin, Annie Qu, Rui Miao

发表机构 * UC Irvine(加州大学尔湾分校) George Washington University(乔治·华盛顿大学) UC Santa Barbara(加州大学圣芭芭拉分校) UT Dallas(德克萨斯大学达拉斯分校)

AI总结 本文研究了多任务强化学习中基于策略梯度的PPO方法的优化问题,指出其在多任务环境下存在价值函数梯度条件不佳的问题,导致部分任务学习停滞。为此,作者提出TOPPO方法,通过引入批评者平衡模块改善梯度条件,提升任务间的学习均衡性。实验表明,TOPPO在参数和环境步数更少的情况下,优于现有的SAC和ARS方法,在多任务基准测试中表现出更强的平均和尾部任务性能,证明了基于策略的方法在适当优化下可以媲美甚至超越基于价值的方法。

详情
英文摘要

Soft Actor-Critic (SAC) and its variants dominate Multi-Task Reinforcement Learning (MTRL) due to their off-policy sample efficiency, while on-policy methods such as Proximal Policy Optimization (PPO) remain underexplored. We diagnose that PPO in MTRL suffers from a previously overlooked issue: critic-side gradient ill-conditioning, which may cause tail tasks to stall while easy tasks dominate the value function's updates. To address this, we propose TOPPO (Tail-Optimized PPO), a reformulation of PPO via Critic Balancing -- a set of modules that improve gradient conditioning and balance learning dynamics across tasks. Unlike prior approaches that rely on modular architectures or large models, TOPPO targets the optimization bottleneck within PPO itself. Empirically, TOPPO achieves stronger mean and tail-task performance than published SAC-family and ARS-family baselines while using substantially fewer parameters and environment steps on Meta-World+ benchmark. Notably, TOPPO matches or surpasses strong SAC baselines early in training and maintains superior performance at full budget. Ablations confirm the effectiveness of each module in TOPPO and provide insights into their interactions. Our results demonstrate that, with proper optimization, on-policy methods can rival or exceed off-policy approaches in MTRL, challenging the prevailing reliance on SAC and highlighting critic-side gradient conditioning as the central bottleneck.

2605.11471 2026-05-13 cs.LG

On the Approximation Complexity of Matrix Product Operator Born Machines

Chao Li, Zerui Tao, Yuchen Cong, Jian Xu, Qibin Zhao

发表机构 * RIKEN-AIP(日本理化学研究所高级研究所)

AI总结 本文研究了矩阵乘积算子玻姆机(MPO-BM)在概率建模中的近似复杂性。作者从正反两方面分析了MPO-BM的近似能力边界,证明在连续情况下,使用KL散度进行近似是NP难的,表明最坏情况下无法实现通用高效的近似。同时,在满足局部性和谱隙条件的前提下,作者展示了对于结构化目标(如路径图马尔可夫随机场),MPO-BM可以以多项式规模的键维度实现具有可证明KL保证的近似,并且只需多项式数量的得分查询即可估计诱导哈密顿量,从而获得相应的保证。这些结果为MPO-BM在何时难以近似、何时可高效学习提供了理论依据。

详情
英文摘要

Matrix product operator Born machines (MPO-BMs) are tractable tensor-network models for probabilistic modeling, but their efficient approximation capability remains unclear. We characterize this boundary from both negative and positive perspectives. First, we prove that KL approximation is NP-hard for MPO-BMs in the continuous setting, ruling out universal efficient approximation in the worst case. Second, for score-based variational inference, we show that, under a locality and spectral-gap conditions on the loss-induced Hamiltonian, structured targets (e.g., path-graph Markov random fields) admit MPO-BM approximations with polynomial bond dimension and provable KL guarantees. Third, under the same locality structure, we prove that polynomially many score queries suffice to estimate the induced Hamiltonian and obtain such guarantees. Our results provide a theoretical characterization of when MPO-BMs are fundamentally hard to approximate and when they become efficiently learnable.

2605.11467 2026-05-13 cs.LG cs.AI

Drop the Act: Probe-Filtered RL for Faithful Chain-of-Thought Reasoning

Swapnil Parekh

发表机构 * Intuit

AI总结 该研究提出了一种名为ProFIL的新方法,旨在减少大型语言模型在链式推理过程中产生的“推理剧场”现象,即模型在已得出结论后仍生成看似思考但实际上对正确性无贡献的推理步骤。ProFIL通过在冻结的基模型上训练一个多头注意力探针,检测并抑制这些冗余步骤,并结合强化学习框架GRPO进行优化,从而提升推理链的可信度、缩短推理长度,同时保持或提升任务准确性。实验表明,该方法在多个推理任务和模型架构上均取得显著效果。

详情
英文摘要

Reasoning models post-hoc rationalize answers they have already committed to internally, producing chains of *reasoning theater*: deliberative-looking steps that contribute nothing to correctness. This wastes inference tokens, pollutes interpretability, and obscures what the model actually computed. We introduce **ProFIL** (**Pro**be-**Fil**tered Reinforcement Learning) to *reduce theater, increase chain-of-thought faithfulness, and shrink chain length* in a single, drop-in extension to Group Relative Policy Optimization (GRPO). A multi-head attention probe is trained *once* on the *frozen* base model to detect post-commitment steps from internal activations alone; during GRPO, rollouts whose probe score exceeds a threshold have their advantage zeroed. *Our central finding is that a probe trained on a frozen base, with verifier-derived labels and no human annotation, provides a stable signal that suppresses theater while resisting the RL-obfuscation failure mode predicted by prior work.* Across four reasoning domains (GSM8K, LiveCodeBench, ToolUse, MMLU-Redux) and two model architectures (Llama-8B, Qwen-7B), ProFIL reduces post-commitment theater by **11--100%**, raises faithful-fraction (e.g., +24pp on LiveCodeBench under an independent Claude 3.7 Sonnet judge), and shortens chains by 4--19%, all while preserving or improving task accuracy. ProFIL also beats a matched length-penalty GRPO baseline, isolating the gain as semantic commitment-detection rather than chain compression. Probe weights, training configurations, and rollouts are released across all four domains.

2605.11463 2026-05-13 cs.CV

Encore: Conditioning Trajectory Forecasting via Biased Ego Rehearsals

Conghao Wong, Ziqian Zou, Xinge You

发表机构 * Huazhong University of Science and Technology(华中科技大学)

AI总结 本文研究了如何在轨迹预测任务中学习和表示智能体的主观性,这一问题具有挑战性但至关重要。作者提出了一种名为Encore的方法,通过引入偏向性的自我排练机制,使模型能够从短期观测中生成针对场景中所有参与者的偏置排练轨迹,并利用这些轨迹作为条件来引导最终预测,从而更准确地模拟不同智能体的主观行为。实验表明,该方法在多个数据集上均取得了性能提升,并为理解轨迹中的主观性提供了清晰的解释。

详情
英文摘要

Learning and representing the subjectivities of agents has become a challenging but crucial problem in the trajectory prediction task. Such subjectivities not only present specific spatial or temporal structures, but also are anisotropic for all interaction participants. Despite great efforts, it remains difficult to explicitly learn and forecast these subjectivities, let alone further modulate models' predictions through a specific ego's subjectivity. Inspired by prefactual thoughts in psychology and relevant theatrical concepts, we interpret such subjectivities in future trajectories as the continuous process from rehearsal to encore. In the rehearsal phase, the proposed ego predictor focuses on how each ego agent learns to derive and direct a set of explicitly biased rehearsal trajectories for all participants in the scene from the short-term observations. Then, these rehearsal trajectories serve as immediate controls to condition final predictions, providing direct yet distinct ego biases for the prediction network to simulate agents' various subjectivities. Experiments across datasets not only demonstrate a consistent improvement in the performance of the proposed \emph{Encore} trajectory prediction model but also provide clear interpretability regarding subjectivities as biased ego rehearsals.

2605.11462 2026-05-13 cs.CV cs.AI

SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images

Zishan Liu, Ruoxi Zang, Yanglin Zhang, Wei Liu, Yin Zhang, Jian Yao, Jiayin Zheng, Zhengzhe Liu

发表机构 * Lingnan University(岭南大学) XPENG Robotics(小鹏机器人)

AI总结 该研究提出了一种名为 SpatialForge 的可扩展数据合成方法,旨在从开放世界的二维图像中生成用于三维空间推理的监督信号,以解决当前大型视觉-语言模型在空间推理方面的不足。通过将空间推理分解为感知与关系两个部分,并构建包含深度、布局和视角依赖推理的结构化监督信号,该方法能够自动生成高质量的空间问答数据。基于此,研究构建了一个包含1000万对空间问答的大型数据集 SpatialForge-10M,并在多个空间推理基准上验证了其有效性,显著提升了视觉-语言模型的空间推理能力。

详情
英文摘要

Recent advancements in Large Vision-Language Models (VLMs) have demonstrated exceptional semantic understanding, yet these models consistently struggle with spatial reasoning, often failing at fundamental geometric tasks such as depth ordering and precise coordinate grounding. Recent efforts introduce spatial supervision from scene-centric datasets (e.g., multi-view scans or indoor video), but are constrained by the limited number of underlying scenes. As a result, the scale and diversity of such data remain significantly smaller than those of web-scale 2D image collections. To address this limitation, we propose SpatialForge, a scalable data synthesis pipeline that transforms in-the-wild 2D images into spatial reasoning supervision. Our approach decomposes spatial reasoning into perception and relation, and constructs structured supervision signals covering depth, layout, and viewpoint-dependent reasoning, with automatic verification to ensure data quality. Based on this pipeline, we build SpatialForge-10M, a large-scale dataset containing 10 million spatial QA pairs. Extensive experiments across multiple spatial reasoning benchmarks demonstrate that training on SpatialForge-10M significantly improves the spatial reasoning ability of standard VLMs, highlighting the effectiveness of scaling 2D data for 3D-aware spatial reasoning.

2605.11460 2026-05-13 cs.LG cs.SY eess.SY

Beyond Prediction: Interval Neural Networks for Uncertainty-Aware System Identification

Mehmet Ali Ferah, Tufan Kumbasar

发表机构 * Artificial Intelligence and Intelligent Systems Laboratory, Istanbul Technical University, Istanbul, Türkiye(人工智能与智能系统实验室,伊斯坦布尔技术大学,伊斯坦布尔,土耳其)

AI总结 本文提出了一种用于不确定性感知系统辨识的区间神经网络(INN)框架,旨在解决传统方法在建模非线性动态系统时无法有效捕捉不确定性的局限性。通过将传统神经网络扩展为区间形式,研究开发了能够传播不确定性的区间LSTM和NODE模型,并提出了两种训练策略——级联INN(C-INN)和联合INN(J-INN),分别在不同阶段优化预测精度与区间精度。实验表明,该框架在多个系统辨识数据集上表现优异,且引入了通道弹性概念以分析不确定性在模型参数中的分布特征。

Comments Under review

详情
英文摘要

System identification (SysID) is critical for modeling dynamical systems from experimental data, yet traditional approaches often fail to capture nonlinear behaviors. While deep learning offers powerful tools for modeling such dynamics, incorporating uncertainty quantification is essential to ensure reliable predictions. This paper presents a systematic framework for constructing and training interval Neural Networks (INNs) for uncertainty-aware SysID. By extending crisp neural networks into interval counterparts, we develop Interval LSTM and NODE models that propagate uncertainty through interval arithmetic without probabilistic assumptions. This design allows them to represent uncertainty and produce prediction intervals. For training, we propose two strategies: Cascade INN (C-INN), a two-stage approach converting a trained crisp NN into an INN, and Joint INN (J-INN), a one-stage framework jointly optimizing prediction accuracy and interval precision. Both strategies employ uncertainty-aware loss functions and parameterization tricks to ensure reliable learning. Comprehensive experiments on multiple SysID datasets demonstrate the effectiveness of both approaches and benchmark their performance against well-established uncertainty-aware baselines: C-INN achieves superior point prediction accuracy, whereas J-INN yields more accurate and better-calibrated prediction intervals. Furthermore, to reveal how uncertainty is represented across model parameters, the concept of channel-wise elasticity is introduced, which is used to identify distinct patterns across the two training strategies. The results of this study demonstrate that the proposed framework effectively integrates deep learning with uncertainty-aware modeling.

2605.11448 2026-05-13 cs.LG cs.AI

Deep Minds and Shallow Probes

Su Hyeong Lee, Risi Kondor

发表机构 * Department of Statistics, University of Chicago(芝加哥大学统计系) Department of Statistics and Department of Computer Science, University of Chicago(芝加哥大学统计系和计算机科学系)

AI总结 本文研究神经表示中隐藏坐标在不同实现下的对称性问题,提出应使用对称性稳定的浅层探针来揭示表示中的结构,而非依赖特定基底。通过分析最终输出层的精确模型,作者确定了一种唯一的浅层探针分层结构,其中线性探针为其一级成员。研究还表明,跨模型探针迁移应基于表示中探针可见的商空间,而非完整的隐藏状态,实验验证了该方法在合成与实际任务中的有效性。

详情
英文摘要

Neural representations are not unique objects. Even when two systems realize the same downstream computation, their hidden coordinates may differ by reparameterization. A probe family intended to reveal structure already present in a representation should therefore be stable under the relevant representation symmetries rather than be tied to a particular basis. We study this group action in the tractable exact setting of the final readout layer, where equivalent realizations induce affine changes of hidden coordinates. The resulting symmetry principle singles out a unique hierarchy of shallow coordinate-stable probes, with linear probes as its degree-1 member. We also show that a natural object for cross-model probe transfer is a shared probe-visible quotient--the representation modulo directions invisible to the probe family--rather than the full hidden state. Experiments on synthetic and real-world tasks support both predictions, showing where degree-2 probes help beyond linear ones and how quotient-based transfer enables coverage-aware monitor portability across model families. These results point toward a broader geometric representation theory of neural probing, with coverage-aware monitor transfer as a concrete operational consequence.

2605.11438 2026-05-13 cs.CV

Beyond Masks: The Case for Medical Image Parsing

Siddharth Gupta, Alan L. Yuille, Zongwei Zhou

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Northwestern University(西北大学) Johns Hopkins Medicine(约翰霍普金斯医学)

AI总结 本文提出医疗图像解析(Medical Image Parsing)作为医学影像研究的核心输出,强调应超越传统的像素级分割掩码,生成包含实体、属性及关系的结构化表示,以更全面地描述医学影像内容。研究指出,当前系统在实体识别方面表现较好,但在属性描述、实体间关系及语义闭包等方面仍严重不足。作者主张通过改进输出形式和训练信号,推动模型从测量转向解释,以更贴近临床实际需求。

详情
英文摘要

Medical imaging research has spent a decade getting very good at one thing: producing per-voxel masks. Masks tell us size, volume, and location, and a decade of clinical infrastructure rests on those outputs. Yet the report a radiologist writes contains almost nothing a mask can express. We argue that medical imaging research should adopt medical image parsing as its central output: a structured representation in which entities, attributes, and relationships are emitted together and mutually consistent. Entities are the named structures and findings, present or absent. Attributes describe those entities, capturing things like margin regularity, enhancement pattern, or severity grade. Relationships connect them, naming where one structure sits relative to another, what abuts what, and what has changed since the prior scan. A good parse satisfies three properties, in order: (1) decision (the parse names the right things in the current image), (2) reconstruction (its content is rich enough to regenerate that image), and (3) prediction (its content is rich enough to forecast how the patient state will evolve). Quantitative measurements are derived from this content; they are not predicted alongside it. To test how close the field is to producing such an output, we audit eleven representative systems against the three parsing primitives plus closure. None emits a well-formed parse. Entities are largely solved. Attributes, relationships, and closure remain near-empty. The path forward is not a new architecture. It is a commitment to a richer output, and to training signals that reward it. Segmentation taught models to measure. Parsing asks them to explain.

2605.11436 2026-05-13 cs.CL cs.AI

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

Joykirat Singh, Zaid Khan, Archiki Prasad, Justin Chih-Yao Chen, Akshay Nambi, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Microsoft Research(微软研究院)

AI总结 本文提出了一种名为Agent-BRACE的方法,旨在解决大型语言模型在长时序、部分可观测环境中执行任务时面临的不确定性管理和上下文膨胀问题。该方法通过将信念状态与策略解耦,利用自然语言标注的置信度标签构建结构化的信念表示,从而帮助模型在决策时更有效地处理不确定性。实验表明,Agent-BRACE在多个长时序任务中显著提升了性能,同时保持了对上下文长度的鲁棒性。

Comments Code: https://github.com/joykirat18/Agent-BRACE

详情
英文摘要

Large language models (LLMs) are increasingly deployed on long-horizon tasks in partially observable environments, where they must act while inferring and tracking a complex environment state over many steps. This leads to two challenges: partial observability requires maintaining uncertainty over unobserved world attributes, and long interaction history causes context to grow without bound, diluting task-relevant information. A principled solution to both challenges is a belief state: a posterior distribution over environment states given past observations and actions, which compactly encodes history for decision making regardless of episode length. In LLM agents, however, the open-ended nature of text makes it unclear how to represent such a distribution. Therefore, we introduce Agent-BRACE: Agent Belief state Representation via Abstraction and Confidence Estimation, a method that decouples an LLM agent into a belief state model and a policy model, jointly optimized via reinforcement learning. The belief state model produces a structured approximation of the belief distribution: a set of atomic natural language claims about the environment, each annotated with an ordinal verbalized certainty label ranging from certain to unknown. The policy model conditions on this compact, structured approximate belief rather than the full history, learning to select actions under explicit uncertainty. Across long-horizon, partially observable embodied language environments, Agent-BRACE achieves an average absolute improvement of +14.5% (Qwen2.5-3B-Instruct) and +5.3% (Qwen3-4B-Instruct), outperforming strong RL baselines while maintaining a near-constant context window independent of episode length. Further analysis shows that the learned belief becomes increasingly calibrated over the course of an episode as evidence accumulates.

2605.11435 2026-05-13 cs.CV

ZeroIDIR: Zero-Reference Illumination Degradation Image Restoration with Perturbed Consistency Diffusion Models

Hai Jiang, Zhen Liu, Yinjie Lei, Songchen Han, Bing Zeng, Shuaicheng Liu

发表机构 * School of Aeronautics and Astronautics, Sichuan University(四川大学航空航天学院) University of Electronic Science and Technology of China(电子科技大学) College of Electronics and Information Engineering, Sichuan University(四川大学电子信息工程学院)

AI总结 本文提出了一种基于扩散模型的零参考图像修复框架ZeroIDIR,用于解决光照退化图像的恢复问题。该方法仅依赖低质量退化图像进行训练,通过解耦光照校正与扩散重建过程,引入自适应伽马校正模块和直方图引导的光照校正损失,提升光照一致性并作为后续扩散过程的可靠输入。此外,提出了一种扰动一致性扩散损失,以增强恢复图像的细节还原能力和稳定性,实验表明该方法在多个公开数据集上优于现有无监督方法,并具有良好的场景泛化能力。

Comments Accepted by CVPR 2026

详情
英文摘要

In this paper, we propose a zero-reference diffusion-based framework, named ZeroIDIR, for illumination degradation image restoration, which decouples the restoration process into adaptive illumination correction and diffusion-based reconstruction while being trained solely on low-quality degraded images. Specifically, we design an adaptive gamma correction module that performs spatially varying exposure correction to generate illumination-corrected only representations to mitigate exposure bias and serve as reliable inputs for subsequent diffusion processes, where a histogram-guided illumination correction loss is introduced to regularize the corrected illumination distribution toward that of natural scenes. Subsequently, the illumination-corrected image is treated as an intermediate noisy state for the proposed perturbed consistency diffusion model to reconstruct details and suppress noise. Moreover, a perturbed diffusion consistency loss is proposed to constrain the forward diffusion trajectory of the final restored image to remain consistent with the perturbed state, thus improving restoration fidelity and stability in the absence of supervision. Extensive experiments on publicly available benchmarks show that the proposed method outperforms state-of-the-art unsupervised competitors and is comparable to supervised methods while being more generalizable to various scenes. Code is available at https://github.com/JianghaiSCU/ZeroIDIR.

2605.11430 2026-05-13 cs.CV cs.AI cs.LG

Diabetic Retinopathy Classification using Downscaling Algorithms and Deep Learning

Nishi Doshi, Urvi Oza, Pankaj Kumar

发表机构 * Dhirubhai Ambani Institute of Information and Communication Technology(迪鲁巴希·阿姆巴尼信息与通信技术研究所)

AI总结 该研究针对糖尿病视网膜病变(DR)分类中的图像尺寸不一问题,提出在输入深度学习网络前使用多种下采样算法对视网膜图像进行预处理。研究结合了Kaggle和印度糖尿病视网膜病变图像数据集,基于改进的多通道Inception V3网络架构进行分类实验,结果在准确率、特异性和灵敏度方面优于现有方法,为DR的自动分级提供了更有效的解决方案。

Journal ref 2020 7th International Conference on Signal Processing and Integrated Networks (SPIN)

详情
英文摘要

Diabetic Retinopathy (DR) is an art and science of recording and classifying the retinal images of a diabetic patient. DR classification deals with classifying retinal fundus image into five stages on the basis of severity of diabetes. One of the major issue faced while dealing with DR classification problem is the large and varying size of images. In this paper we propose and explore the use of several downscaling algorithms before feeding the image data to a Deep Learning Network for classification. For improving training and testing; we amalgamate two datasets: Kaggle and Indian Diabetic Retinopathy Image Dataset. Our experiments have been performed on a novel Multi Channel Inception V3 architecture with a unique self crafted preprocessing phase. We report results of proposed approach using accuracy, specificity and sensitivity, which outperform the previous state of the art methods. Index Terms: Diabetic Retinopathy, Downscaling Algorithms, Multichannel CNN Architecture, Deep Learning

2605.11428 2026-05-13 cs.LG

FastUMAP: Scalable Dimensionality Reduction via Bipartite Landmark Sampling

Hongmin Li

发表机构 * School of Life Science and Technology, Institute of Science Tokyo(生命科学与技术学院,科学东京研究所) Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences(计算生物学与医学科学系,前沿科学研究生院)

AI总结 本文提出了一种名为 FastUMAP 的可扩展降维方法,旨在解决在重复使用场景下非线性降维方法计算效率低的问题。该方法基于双分图的地标采样,通过构建稀疏的点-地标模糊图,并结合 Nystrom 方法进行谱初始化,再在双分图上进行 UMAP 风格的目标优化,从而在保证一定精度的同时显著提升计算速度。实验表明,FastUMAP 在多个数据集上相比传统方法具有更快的运行时间,适合用于需要频繁进行降维探索的场景。

Comments 17 pages, 5 figures

详情
英文摘要

Exploratory analysis of high-dimensional data rarely stops at a single embedding. In practice, analysts rerun dimensionality reduction after changing preprocessing, subsets, or hyperparameters, and standard nonlinear methods can quickly become the bottleneck. We introduce FastUMAP (Bipartite Manifold Approximation and Projection), a landmark-based method designed for this repeated-use setting. FastUMAP builds a sparse point-landmark fuzzy graph, computes a Nystrom spectral warm start from the induced landmark affinity, and then refines all sample coordinates with a UMAP-style objective on the bipartite graph. The landmark ratio r = m/n provides a direct way to trade runtime against fidelity. On 9 benchmark datasets spanning 178 to 70,000 samples, FastUMAP has the lowest runtime on 7 datasets in our reported default-implementation comparison on one workstation. On MNIST and Fashion-MNIST (n=70000), it runs in about 4.6 seconds, compared with about 73--75 seconds for Barnes--Hut t-SNE, while reaching 91.4% mean kNN accuracy versus 94.6% for the strongest accuracy baseline. FastUMAP is therefore best viewed as a fast option for repeated exploratory embedding, rather than as a replacement for accuracy-first methods.

2605.11427 2026-05-13 cs.CV

PD-4DGS:Progressive Decomposition of 4D Gaussian Splatting for Bandwidth-Adaptive Dynamic Scene Streaming

Jiachen Li, Guangzhi Han, Jin Wan, Delong Han, Yuan Gao, Min Li, Mingle Zhou, Gang Li

发表机构 * Qilu University of Technology(青岛理工大学)

AI总结 PD-4DGS 是一种面向动态场景流媒体的渐进式 4D 高斯溅射压缩框架,旨在解决现有 4DGS 模型在带宽受限设备上渲染延迟高、无法适配自适应码率传输的问题。该方法通过层次化形变分解(HDD)将 4DGS 的运动结构分解为三个可独立传输的层次,使流媒体前缀即可渲染,实现可扩展的流式传输。实验表明,PD-4DGS 在保持渲染质量的同时显著降低了传输带宽和首帧延迟,为 4DGS 在移动设备上的实时流媒体应用提供了可行方案。

详情
英文摘要

4D Gaussian Splatting (4DGS) enables high-quality dynamic novel view synthesis, yet current models remain monolithic bitstreams that clients must download in full before any frame can be rendered, causing black-screen waits of tens to hundreds of seconds on mobile bandwidth and leaving 4DGS incompatible with modern adaptive-bitrate delivery. Progressive 3DGS compression alleviates this for static scenes, but it acts only on spatial anchors and cannot partition the temporal deformation networks that dominate dynamic-scene size. We present PD-4DGS, the first framework for progressive compression and on-demand transmission of 4DGS. Hierarchical Deformation Decomposition (HDD) externalises the coarse-to-fine motion hierarchy already latent in 4DGS into three independently transmittable layers -- a static scaffold, a global deformation, and a local refinement -- so that any prefix of the bitstream is already renderable, turning a single training run into a scalable, DASH/HLS-compatible bitstream. A Gaussian-entropy attribute rate-distortion loss together with a temporal mask consistency regulariser shrink the base layer while suppressing low-bitrate flicker; a capacity-weighted rollout schedule, gated online by a learnt activation rate rho, then prevents deformation-network under-training without any per-scene hyperparameter. On the Dycheck iPhone benchmark, PD-4DGS cuts the streamed bitstream by >60% at matched rendering fidelity and reduces first-frame latency from 73--930 s to ~1.7 s on a 2 Mbps link, uniquely enabling true on-demand progressive streaming for 4DGS.

2605.11426 2026-05-13 cs.AI

A Mechanistic Investigation of Supervised Fine Tuning

Ruhaan Chopra

发表机构 * Independent Researcher(独立研究者)

AI总结 本研究探讨了监督微调(SFT)对大语言模型激活状态的影响,发现尽管微调前后隐藏层激活的余弦相似度很高,但通过预训练稀疏自编码器(SAE)投影后,稀疏潜在表示存在显著差异。研究提出了一种基于SAE的分析方法,揭示了微调过程中任务和层特异性语义特征的变化,并发现了与安全对齐相关的分层更新模式。该方法为理解SFT的机制提供了高分辨率的诊断工具。

详情
英文摘要

The cosine similarity between a large language model's hidden activations before and after Supervised Fine-Tuning (SFT) remains very high. This, at first glance, suggests that SFT leaves the model's activation geometry largely undisturbed. However, projecting both sets of activations through a Sparse Autoencoder (SAE) pretrained on the base model reveals that the underlying sparse latents diverge significantly. We introduce a novel investigative pipeline which utilizes these pretrained SAEs as a high-resolution diagnostic tool to mechanistically investigate the drivers of this representational divergence. Through our analytical pipeline, we discover task-specific and layer-specific distributions of the precise semantic features that are systematically altered during supervised fine-tuning. We additionally identify a layer-wise update profile specific to safety alignment. All code, experimental scripts, and analysis files associated with this work are publicly available at: https://github.com/ruhzi/sae-investigation.