arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1670
专题追踪
2602.05414 2026-05-18 cs.CV

TSBOW -- Traffic Surveillance Benchmark for Occluded Vehicles Under Various Weather Conditions

Ngoc Doan-Minh Huynh, Duong Nguyen-Ngoc Tran, Long Hoang Pham, Tai Huu-Phuong Tran, Hyung-Joon Jeon, Huy-Hung Nguyen, Duong Khac Vu, Hyung-Min Jeon, Son Hong Phan, Quoc Pham-Nam Ho, Chi Dai Tran, Trinh Le Ba Khanh, Jae Wook Jeon

发表机构 * Automation Lab, Department of Electrical and Computer Engineering(自动化实验室,电气与计算机工程系)

AI总结 随着全球变暖加剧极端天气事件的频率和强度,现有交通监控数据集难以应对复杂天气条件下的遮挡车辆检测问题。为此,本研究提出了TSBOW数据集,包含超过32小时的真实城市交通视频,涵盖多种天气条件和遮挡场景,标注了超过4.8万个目标框,旨在提升恶劣天气下交通参与者检测的性能。TSBOW为智能交通系统的研究提供了重要资源,推动了基于CCTV的交通监控技术发展。

Comments This paper has been accepted by the 40th AAAI Conference on Artificial Intelligence (AAAI-26)

Journal ref Proceedings of the AAAI Conference on Artificial Intelligence. 40(2026). 5239-5247

详情
英文摘要

Global warming has intensified the frequency and severity of extreme weather events, which degrade CCTV signal and video quality while disrupting traffic flow, thereby increasing traffic accident rates. Existing datasets, often limited to light haze, rain, and snow, fail to capture extreme weather conditions. To address this gap, this study introduces the Traffic Surveillance Benchmark for Occluded vehicles under various Weather conditions (TSBOW), a comprehensive dataset designed to enhance occluded vehicle detection across diverse annual weather scenarios. Comprising over 32 hours of real-world traffic data from densely populated urban areas, TSBOW includes more than 48,000 manually annotated and 3.2 million semi-labeled frames; bounding boxes spanning eight traffic participant classes from large vehicles to micromobility devices and pedestrians. We establish an object detection benchmark for TSBOW, highlighting challenges posed by occlusions and adverse weather. With its varied road types, scales, and viewpoints, TSBOW serves as a critical resource for advancing Intelligent Transportation Systems. Our findings underscore the potential of CCTV-based traffic monitoring, pave the way for new research and applications. The TSBOW dataset is publicly available at: https://github.com/SKKUAutoLab/TSBOW.

2602.04909 2026-05-18 cs.LG

Learning Where It Matters: Geometric Anchoring for Robust Preference Alignment

Youngjae Cho, Jongsuk Kim, Ji-Hoon Kim

发表机构 * PYLER KAIST(韩国科学技术院)

AI总结 本文研究了如何在存在噪声监督的情况下,提升大型语言模型偏好对齐的鲁棒性。为了解决传统方法中固定参考策略随策略漂移而失效的问题,作者提出了一种基于几何锚点的偏好优化方法(GAPO),通过动态生成策略的局部对抗扰动作为悲观基准,实现对偏好对的自适应重加权。该方法有效抑制了脆弱样本的影响,提升了模型在不同噪声环境下的鲁棒性,同时在标准对齐和推理任务上保持或超越了现有方法的性能。

Comments Under Review

详情
英文摘要

Direct Preference Optimization (DPO) and related methods align large language models from pairwise preferences by regularizing updates against a fixed reference policy. As the policy drifts, a static reference, however, can become increasingly miscalibrated, leading to distributional mismatch and amplifying spurious preference signals under noisy supervision. Conversely, reference-free variants avoid mismatch but often suffer from unconstrained reward drift. We propose Geometric Anchor Preference Optimization (GAPO), which replaces the fixed reference with a dynamic, geometry-aware anchor: an adversarial local perturbation of the current policy within a small radius that serves as a pessimistic baseline. This anchor enables an adaptive reweighting mechanism, modulating the importance of each preference pair based on its local sensitivity. We further introduce the Anchor Gap, the reward discrepancy between the policy and its anchor, and show under smoothness conditions that it approximates worst-case local margin degradation. Optimizing a logistic objective weighted by this gap downweights geometrically brittle instances while emphasizing robust preference signals. Across diverse noise settings, GAPO consistently improves robustness while matching or improving performance on standard LLM alignment and reasoning benchmarks.

2602.04163 2026-05-18 cs.LG

BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models

Junyu Chen, Jungang Li, Jing Xiong, Wenjie Wang, Qingyao Yang, He Xiao, Zhen Li, Taiqiang Wu, Mengzhao Chen, Zhen Peng, Chaofan Tao, Long Shi, Hongxia Yang, Ngai Wong

发表机构 * Southwestern University of Finance and Economics(西南财经大学) The University of Hong Kong(香港大学) Artificial Intelligence and Digital Finance Key Laboratory of Sichuan Province(四川省人工智能与数字金融重点实验室) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) The Hong Kong Polytechnic University(香港理工大学) Sun Yat-sen University(中山大学)

AI总结 在资源受限的场景下,大语言模型的推理常受内存和带宽限制,量化是提升效率的关键。现有后训练量化方法在4比特时保持较高精度,但在2-3比特时精度显著下降,主要受限于固定量化网格对误差最小化的限制。为此,本文提出一种基于位平面分解的可变网格量化方法BPDQ,通过位平面和标量系数构建动态量化网格,并利用二阶信息迭代优化,逐步补偿量化误差以最小化输出差异。实验表明,BPDQ在2比特下仍能以较高精度运行超大规模模型,且理论分析表明其可变网格扩展了误差最小化的可行解空间。

详情
英文摘要

Large language model inference is often bounded by memory footprint and bandwidth in resource-constrained deployments, making quantization fundamental to efficient serving. While post-training quantization (PTQ) maintains high fidelity at 4-bit, it deteriorates at 2-3 bits. In essence, existing methods enforce a shape-invariant quantization grid (e.g., the fixed uniform intervals of UINT2) for each group, severely restricting the feasible set for error minimization. To address this, we propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients, and iteratively refines them using second-order information while progressively compensating for quantization errors to minimize output discrepancy. In the 2-bit regime, BPDQ enables serving Qwen2.5-72B on a single RTX 3090 with 83.85\% GSM8K accuracy (vs. 90.83\% at 16-bit). Moreover, we theoretically show that the variable grid expands the feasible set, and that the quantization process consistently aligns with the optimization objective in Hessian-induced geometry. The code is available at https://github.com/KingdalfGoodman/BPDQ.

2602.03922 2026-05-18 cs.LG

Online Vector Quantized Attention

Nick Alonso, Tomas Figliolia, Beren Millidge

发表机构 * Zyphra

AI总结 本文提出了一种名为在线向量量化注意力(OVQ-attention)的序列混合层,旨在在计算与内存效率和长上下文处理能力之间取得更好的平衡。该方法在保持线性计算和常数内存消耗的同时,通过稀疏的内存更新机制显著提升了内存容量,从而在长序列任务中表现出色。实验表明,OVQ-attention在多个合成和语言建模任务中优于传统线性注意力和原始向量量化注意力,并在较长序列长度上达到了与自注意力相当的性能。

详情
英文摘要

Standard sequence mixing layers used in language models struggle to balance efficiency and performance. Self-attention performs well on long context tasks but has expensive quadratic compute and linear memory costs, while linear attention and SSMs use only linear compute and constant memory but struggle with long context processing. In this paper, we develop a sequence mixing layer that aims to find a better compromise between memory-compute costs and long-context processing, which we call online vector-quantized (OVQ) attention. OVQ-attention requires linear compute costs and constant memory, but, unlike linear attention and SSMs, it uses a sparse memory update that allows it to greatly increase the size of its memory state and, consequently, memory capacity. We develop a theoretical basis for OVQ-attention based on Gaussian mixture regression, and we test it on a variety of synthetic long context tasks and on long context language modeling. OVQ-attention shows significant improvements over linear attention baselines and the original VQ-attention, on which OVQ-attention was inspired. It demonstrates competitive, and sometimes identical, performance to strong self-attention baselines up 64k sequence length, despite using a small fraction of the memory of full self-attention.

2602.03812 2026-05-18 cs.LG cs.AI cs.CL

Antidistillation Fingerprinting

Yixuan Even Xu, John Kirchenbauer, Yash Savani, Asher Trockman, Alexander Robey, Tom Goldstein, Fei Fang, J. Zico Kolter

发表机构 * Carnegie Mellon University, Pittsburgh, PA, USA(卡内基梅隆大学) University of Maryland, College Park, MD, USA(马里兰大学学院市分校)

AI总结 该研究提出了一种名为“反蒸馏指纹”(ADFP)的新方法,用于检测第三方模型是否通过蒸馏技术学习了教师模型的输出。与现有依赖启发式扰动的方法不同,ADFP 将指纹检测目标与学生模型的学习动态对齐,利用代理模型选择能最大化指纹可检测性的标记,从而在保证生成质量的前提下提升检测效果。实验表明,ADFP 在数学推理、对话和代码生成任务中均实现了比现有方法更优的检测性能与实用性平衡。

Comments 28 pages, 13 figures, ICML 2026

详情
英文摘要

Model distillation enables efficient emulation of frontier large language models (LLMs), creating a need for robust mechanisms to detect when a third-party student model has trained on a teacher model's outputs. However, existing fingerprinting techniques that could be used to detect such distillation rely on heuristic perturbations that impose a steep trade-off between generation quality and fingerprinting strength, often requiring significant degradation of utility to ensure the fingerprint is effectively internalized by the student. We introduce antidistillation fingerprinting (ADFP), a principled approach that aligns the fingerprinting objective with the student's learning dynamics. Building upon the gradient-based framework of antidistillation sampling, ADFP utilizes a proxy model to identify and sample tokens that directly maximize the expected detectability of the fingerprint in the student after fine-tuning, rather than relying on the incidental absorption of the un-targeted biases of a more naive watermark. Experiments on GSM8K, OASST1, and MBPP demonstrate that ADFP achieves a significant Pareto improvement over state-of-the-art baselines, yielding stronger detection confidence with minimal impact on utility across mathematical reasoning, dialogue, and code generation, even when the student model's architecture is unknown.

2602.00841 2026-05-18 cs.CV

Beyond First-Order: Learning Riemannian Geometries for Invariant Visual Place Recognition

Jintao Cheng, Weibin Li, Zhijian He, Jin Wu, Chi Man Vong, Wei Zhang

发表机构 * The Hong Kong University of Science and Technology, Hong Kong, China(香港科技大学) University of Macau, Macau, China(澳门大学) University of Science and Technology Beijing, Beijing, China(北京科技大学)

AI总结 本文研究了视觉地点识别(VPR)中如何构建对环境和视角剧烈变化具有鲁棒性的特征表示。为解决现有方法在极端变化下结构关联丢失或适应成本高的问题,提出了一种基于黎曼几何的不变聚合框架RIA,通过在对称正定流形上建模二阶场景结构,有效保留不变结构信息并抑制噪声。实验表明,RIA在无需大量监督训练的情况下即可达到与监督方法相当的性能,并在无结构环境中取得最先进的识别准确率。

Comments 14pages, 5 figures

详情
英文摘要

Visual Place Recognition (VPR) demands representations robust to drastic environmental and viewpoint shifts. Existing aggregation paradigms either depend on extensive supervised training or rely on first-order pooling, often struggling to preserve structural correlations under extreme shifts or incurring high adaptation costs. In this work, we propose Riemannian Invariant Aggregation (RIA), a unified geometric framework that explicitly models second-order scene structure on the Symmetric Positive Definite (SPD) manifold. By treating perturbations as tractable congruence transformations, RIA leverages geometry-aware Riemannian mappings to project covariance descriptors into a linearized Euclidean space, effectively preserving invariant structural components while suppressing noise. Extensive evaluations demonstrate that RIA achieves zero-shot performance comparable to supervised methods, and establishes state-of-the-art accuracy with simple fine-tuning, particularly in unstructured environments. The source code will be released.

2601.21702 2026-05-18 cs.LG cs.CL

Beyond Forgetting: Machine Unlearning Elicits Controllable Side Behaviors and Capabilities

Tien Dang, The-Hai Nguyen, Dinh Mai Phuong, Nguyen Minh Phuong, Anh Bui, Hoang Thanh-Tung, Le-Minh Nguyen, Naoya Inoue

发表机构 * Japan Advanced Institute of Science and Technology(日本先进科学研究所) Quantum AI and Cyber Security Institute(量子人工智能与网络安全研究所) FPT Corporation(FPT公司) Monash University(墨尔本大学) RIKEN(理化学研究所)

AI总结 本文研究了一种名为“表示误导”(RM)的大语言模型遗忘方法,该方法通过将遗忘样本的潜在表示引导至目标向量实现遗忘。作者从线性表示假设出发,提出RM不仅能够实现遗忘,还能引发与高层概念相关的可控侧行为和增强能力。实验表明,RM可用于控制模型的诚实度、情感倾向等行为,或提升其上下文学习能力,揭示了该方法在可控模型开发中的潜在价值与风险。

Comments 36 pages, 19 tables, 9 figures

详情
英文摘要

We consider Representation Misdirection (RM), a class of large language model (LLM) unlearning methods that achieve forgetting by redirecting the forget-representations, that is, latent representations of forget-samples, toward a target vector. Despite being important, the roles of the target vector used in RM, however, remain underexplored. Here, we approach and revisit RM through the lens of the Linear Representation Hypothesis. Specifically, if one can identify a one-dimensional representation corresponding to a high-level concept, the Linear Representation Hypothesis enables linear operations on this concept vector within the forget-representation space. Under this view, we hypothesize that, beyond forgetting, machine unlearning via RM elicits controllable emergent side behaviors and stronger side capabilities corresponding to the high-level concept. Our hypothesis is empirically validated across a wide range of tasks, including behavioral control (e.g., controlling unlearned models' truthfulness, sentiment, refusal, and language) and capability enhancement (e.g., improving unlearned models' in-context learning (ICL) capability). Our findings reveal that this phenomenon could be either a hidden risk if misused or a mechanism that can be harnessed for developing unlearned models that require stronger capabilities and controllable behaviors.

2601.21294 2026-05-18 cs.LG stat.ML

Missing-Data-Induced Phase Transitions in Spectral PLS for Multimodal Learning

Anders Gjølbye, Ida Kargaard, Emma Kargaard, Lina Skerath, Lars Kai Hansen

发表机构 * Technical University of Denmark(丹麦技术大学) Department of Applied Mathematics and Computer Science(应用数学与计算机科学系)

AI总结 本文研究了在多模态学习中,缺失数据对谱偏最小二乘(PLS)方法性能的影响。通过在高维尖峰模型下分析独立缺失的完全随机掩码对交叉协方差矩阵的影响,发现缺失数据会削弱信号强度,并导致类似BBP类型的相变现象:当信号与噪声比低于临界阈值时,主奇异向量无法有效捕捉潜在共享结构;高于该阈值时则能实现非平凡对齐。研究还提出了有限秩扩展的猜想,并通过仿真和半合成实验验证了理论预测的相图和恢复曲线。

Comments Preprint

详情
英文摘要

Partial Least Squares (PLS) learns shared structure from paired data via the top singular vectors of the empirical cross-covariance (PLS-SVD), but multimodal datasets often have missing entries in both views. We study PLS-SVD under independent entry-wise missing-completely-at-random masking in a proportional high-dimensional spiked model. After appropriate normalization, the masked cross-covariance behaves like a spiked rectangular random matrix whose effective signal strength is attenuated by $\sqrtρ$, where $ρ$ is the joint entry retention probability. The replica-symmetric analysis predicts a sharp BBP-type phase transition: below a critical signal-to-noise threshold the leading singular vectors are asymptotically uninformative, while above it they achieve nontrivial alignment with the latent shared directions, with closed-form asymptotic overlap formulas. We also state a finite-rank extension as a conjecture, predicting that the same missingness-adjusted threshold applies componentwise when the latent spikes are separated. Simulations and semi-synthetic multimodal experiments agree with the predicted phase diagram and recovery curves across aspect ratios, signal strengths, and missingness levels.

2601.19923 2026-05-18 cs.CL cs.AI

Structure-BiEval: A Self-Supervised, Dual-Track Framework for Decoupling Structure and Content in LLM Evaluation for Web Information Systems

Boxiang Zhao, Qince Li, Zhonghao Wang, Zelin Cao, Yi Wang, Peng Cheng, Bo Lin

发表机构 * Tele-Communication Technology Bureau, Xinhua News Agency(新华通讯社电信技术局)

AI总结 随着大语言模型(LLMs)在基于网络的自主代理和复杂网络信息系统中扮演核心角色,其将自然语言准确转换为结构化格式的能力变得至关重要。为此,本文提出Structure-BiEval,一种无需人工标注的自监督框架,通过解耦结构与内容,利用内容语义准确度和归一化树编辑距离等指标,对网络数据的结构保真度进行量化评估。实验结果表明,不同规模的LLM在结构化任务中表现差异显著,且深层嵌套结构对各类模型均构成挑战。

详情
英文摘要

As Large Language Models (LLMs) evolve into the core of Web-based autonomous agents and complex Web Information Systems, their ability to faithfully translate natural language into rigorous structured formats has become paramount, as this capability is critical for Web API invocation and data exchange. However, evaluating this structural fidelity in Web-native payloads remains a challenge: traditional text metrics fail to capture topological consistency in semi-structured Web data, while manual evaluation is prohibitively costly. To address this, we propose Structure-BiEval, a novel self-supervised framework for quantitative, annotation-free assessment tailored for Web data engineering. By leveraging deterministic Intermediate Representations, our framework effectively decouples structure from content, utilizing Content Semantic Accuracy and Normalized Tree Edit Distance as precise metrics. We empirically benchmark 15 state-of-the-art LLMs across dual Web structural topologies, namely Hierarchical Data (Web backend payloads) and Tabular Data (Web frontend presentation). The results reveal substantial variability in structural performance, including cases where mid-sized models unexpectedly outperform larger counterparts in Web data formatting. Furthermore, our findings show that deep recursive nesting poses a consistent challenge for Web agents across varying parameter scales.

2601.13529 2026-05-18 cs.RO

The OncoReach Stylet for Brachytherapy: Design Evaluation and Pilot Study

Pejman Kheradmand, Kent K. Yamamoto, Emma Webster, Keith Sowards, Gianna Hatheway, Katharine L. Jackson, Sabino Zani, Julie A. Raffi, Diandra N. Ayala-Peacock, Scott R. Silva, Joanna Deaton Bertram, Yash Chitalia

发表机构 * Telesurgery (HeaRT) Laboratory, J. B. Speed School of Engineering, University of Louisville, Louisville, Kentucky, USA.(远程手术(HeaRT)实验室,J. B. Speed工程学院,路易斯维尔大学,路易斯维尔,肯塔基州,美国。) Department of Mechanical Engineering and Materials Science, Duke University, Durham, NC, USA.(机械工程与材料科学系,杜克大学,杜克斯,北卡罗来纳州,美国。) Surgical Education and Activities Lab (SEAL), Duke University Department of Surgery, Durham, NC, USA.(手术教育与活动实验室(SEAL),杜克大学外科系,杜克斯,北卡罗来纳州,美国。) Department of Radiation Oncology, University of Louisville School of Medicine, Louisville, KY, USA.(放射肿瘤学系,路易斯维尔大学医学院,路易斯维尔,肯塔基州,美国。) Department of Radiation Oncology, Duke University School of Medicine, Durham, NC, USA.(放射肿瘤学系,杜克大学医学院,杜克斯,北卡罗来纳州,美国。)

AI总结 本文介绍了一种名为OncoReach的可操控针管装置,用于宫颈癌的近距离放疗。该装置采用肌腱驱动设计,兼容标准的15和13号针管,能够在保持轴向刚度的同时实现更高的弯曲灵活性。通过实验和仿真验证了其性能,并在患者衍生的复合仿体上进行了初步测试,展示了从微创入口点到达侧方靶点的引导能力,验证了可操控针管在临床应用中的潜力。

详情
英文摘要

Cervical cancer accounts for a significant portion of the global cancer burden among women. Interstitial brachytherapy (ISBT) is a standard procedure for treating cervical cancer; it involves placing a radioactive source through a straight hollow needle within or in close proximity to the tumor and surrounding tissue. However, the use of straight needles limits surgical planning to a linear needle path. We present the OncoReach stylet, a handheld, tendon-driven steerable stylet designed for compatibility with standard ISBT 15- and 13-gauge needles. Building upon our prior work, we evaluated design parameters like needle gauge, spherical joint count and spherical joint placement, including an asymmetric disk design to identify a configuration that maximizes bending compliance while retaining axial stiffness. Free space experiments quantified tip deflection across configurations, and a two-tube Cosserat rod model accurately predicted the centerline shape of the needle for most trials. The best performing configuration was integrated into a reusable handheld prototype that enables manual actuation. A patient-derived, multi-composite phantom model of the uterus and pelvis was developed to conduct a pilot study of the OncoReach steerable stylet with one expert user. Results showed the ability to steer from less-invasive, medial entry points to reach the lateral-most targets, underscoring the significance of steerable stylets.

2601.00678 2026-05-18 cs.CV

Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians

Melonie de Almeida, Daniela Ivanova, Tong Shi, John H. Williamson, Paul Henderson

发表机构 * University of Glasgow(格拉斯哥大学)

AI总结 该论文提出了一种基于单张图像生成动态视频的新方法,能够根据给定的相机轨迹生成高质量且时间一致的视频。核心方法是通过构建动态的3D高斯场景表示,并在单次前向传播中生成合理的物体运动,从而实现快速的相机控制视频生成。该方法在多个数据集上表现出色,取得了领先的视频质量和推理效率。

详情
英文摘要

Humans excel at forecasting the future dynamics of a scene given just a single image. Video generation models that can mimic this ability are an essential component for intelligent systems. Recent approaches have improved temporal coherence and 3D consistency in single-image-conditioned video generation. However, these methods often lack robust user controllability, such as modifying the camera path, limiting their applicability in real-world applications. Most existing camera-controlled image-to-video models struggle with accurately modeling camera motion, maintaining temporal consistency, and preserving geometric integrity. Leveraging explicit intermediate 3D representations offers a promising solution by enabling coherent video generation aligned with a given camera trajectory. Although these methods often use 3D point clouds to render scenes and introduce object motion in a later stage, this two-step process still falls short in achieving full temporal consistency, despite allowing precise control over camera movement. We propose a novel framework that constructs a 3D Gaussian scene representation and samples plausible object motion, given a single image in a single forward pass. This enables fast, camera-guided video generation without the need for iterative denoising to inject object motion into render frames. Extensive experiments on the KITTI, Waymo, RealEstate10K and DL3DV-10K datasets demonstrate that our method achieves state-of-the-art video quality and inference efficiency. The project page is available at https://melonienimasha.github.io/Pixel-to-4D-Website.

2512.21651 2026-05-18 cs.LG

Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models

Dung Anh Hoang, Cuong Pham, Cuong Nguyen, Trung le, Jianfei Cai, Thanh-Toan Do

发表机构 * Department of Data Science and AI(数据科学与人工智能系) Monash University(墨尔本大学) Centre for Vision, Speech and Signal Processing(视觉、语音和信号处理中心) University of Surrey(萨里大学)

AI总结 大型语言模型(LLMs)在自然语言处理任务中表现出色,但其庞大的模型规模限制了在资源受限设备上的部署。本文针对1比特后训练量化(PTQ)中输出行为保持困难的问题,分析了误差累积和表示空间各向异性失真两大根本原因,提出了一种新的PTQ方法,有效提升了1比特量化模型的性能,实验表明该方法在多个任务上均优于现有方法。

详情
英文摘要

Large Language Models (LLMs) deliver strong performance across a wide range of NLP tasks, but their massive sizes hinder deployment on resource-constrained devices. To reduce their computational and memory burden, various compression techniques have been proposed, including quantization, pruning, and knowledge distillation. Among these, post-training quantization (PTQ) is widely adopted for its efficiency, as it requires no retraining and only a small dataset for calibration, enabling low-cost deployment. Recent advances for post-training quantization have demonstrated that even near 4-bit methods can maintain most of the original model performance. However, 1-bit quantization remains particularly challenging. A common strategy in 1-bit quantization is to determine binary weights by matching full-precision parameters, following a weight-driven criterion. However, this objective is not directly aligned with the quantized model's objective, which is to preserve the model's output behavior under the impact of quantization. A natural alternative is to adopt output-driven criteria that minimize discrepancies in model outputs using calibration data. Surprisingly, naive output-driven approaches often perform even worse in the 1-bit regime. In this paper, we show that this failure arises from two fundamental issues: error accumulation across layers and, more critically, \emph{anisotropic distortion} of the representation space. Based on these insights, we propose a novel PTQ method for 1-bit LLMs that explicitly addresses these issues while maintaining computational efficiency. Extensive experiments demonstrate that our approach consistently outperforms existing 1-bit PTQ methods.

2512.19701 2026-05-18 cs.LG cs.AI

LASER: Language Model Regression for Semi-Structured Workflow Resource and Runtime Estimation

Yuxuan Yin, Shengke Zhou, Yunjie Zhang, Ajay Mohindra, Boxun Xu, Peng Li

发表机构 * University of California, Santa Barbara(加州大学圣芭芭拉分校)

AI总结 准确预测云工作流任务的资源消耗和运行时间对调度效率至关重要,但由于任务配置的半结构化特性,这一任务具有挑战性。本文提出 LASER 框架,通过微调大语言模型对序列化的工作流配置进行多目标资源和运行时间回归,引入科学记数法输出编码和约束解码机制以提升数值预测的准确性和效率。实验表明,LASER 在大规模芯片设计任务和新构建的 GHARuntime 数据集上均优于人类专家和最先进的表格机器学习方法,确立了基于大语言模型处理半结构化工作流数据回归任务的新范式。

Comments 20 pages, 7 figures

详情
英文摘要

Accurate prediction of resource consumption and runtime for cloud workflow jobs is critical for scheduling efficiency, yet remains challenging due to the semi-structured nature of job configurations -- comprising shell commands, tool-specific parameters, dependency graphs, and hierarchical metadata. Traditional ML approaches require brittle feature engineering to flatten this rich information into fixed-size vectors, losing critical semantic context. We present LASER, a framework that fine-tunes LLMs on serialized workflow job configurations for multi-target resource and runtime regression. To address the challenges of numerical regression via generation, we introduce scientific notation output encoding for targets spanning multiple orders of magnitude, and constrained decoding with prefix filling to enforce output validity while reducing inference latency by over 30%. We further show that full-attention fine-tuning improves accuracy over sliding-window LLMs on long job contexts. Validated on large-scale chip design workloads, and GHARuntime, a new public benchmark derived from 580,000+ GitHub Actions runs across 27,000+ repositories, LASER outperforms human experts and SOTA tabular ML baselines, with clear model- and data-scaling behavior, establishing a new paradigm for LLM-based regression on semi-structured workflow data.

2512.15067 2026-05-18 cs.LG cs.AI cs.SY eess.SY

EMFusion: An Uncertainty-Aware Conditional Diffusion Framework for Frequency-Selective EMF Forecasting in Wireless Networks

Zijiang Yan, Yixiang Huang, Jianhua Pei, Hina Tabassum, Luca Chiaraviglio

发表机构 * department of Electrical Engineering and Computer Science, York University(电气工程与计算机科学系,约克大学) School of Electrical and Electronic Engineering, Huazhong University of Science and Technology(电子与电气工程学院,华中科技大学) Central China Branch of State Grid Corporation of China(国家电网公司中部分部) Department of Electronic Engineering, University of Rome Tor Vergata(罗马大学Tor Vergata电子工程系) Consorzio Nazionale Interuniversitario per le Telecomunicazioni (CNIT)(国家大学间电信研究会(CNIT))

AI总结 随着无线基础设施的快速发展,准确估计和预测电磁场(EMF)水平对于确保合规性、评估健康影响和优化网络规划变得尤为重要。本文提出EMFusion,一种结合不确定性感知的条件扩散框架,用于无线网络中频率选择性的多变量EMF预测。该方法通过引入残差U-Net结构和跨注意力机制,整合时间、季节和节假日等上下文信息,同时提供显式的不确定性估计,并采用基于插补的采样策略提升预测的时序一致性。实验表明,EMFusion在多个评价指标上均优于现有方法,显著提升了预测精度和可靠性。

Comments Submission for possible publication

详情
英文摘要

The rapid growth in wireless infrastructure has increased the need to accurately estimate and forecast electromagnetic field (EMF) levels to ensure ongoing compliance, assess potential health impacts, and support efficient network planning. While existing studies rely on univariate forecasting of wideband aggregate EMF data, frequency-selective multivariate forecasting is needed to capture the inter-operator and inter-frequency variations essential for proactive network planning. To this end, this paper introduces EMFusion, a conditional multivariate diffusion-based probabilistic forecasting framework that integrates diverse contextual factors, such as time of day, season, and holidays, while providing explicit uncertainty estimates. The proposed architecture features a residual U-Net backbone enhanced by a cross-attention mechanism that dynamically integrates external conditions to guide the generation process. Furthermore, EMFusion integrates an imputation-based sampling strategy that treats forecasting as a structural inpainting task, ensuring temporal coherence even with irregular measurements. Unlike standard point forecasters, EMFusion generates empirical probabilistic prediction intervals from the learned conditional distribution, providing uncertainty-aware probabilistic forecasting rather than simple point estimation. Numerical experiments conducted on frequency-selective EMF datasets demonstrate that EMFusion with the contextual information of working hours outperforms the baseline models with or without conditions. EMFusion outperforms the best baseline by 23.85% in continuous ranked probability score (CRPS), 13.93% in normalized root mean square error, and reduces prediction CRPS error by 22.47%.

2512.14671 2026-05-18 cs.CV

ART: Articulated Reconstruction Transformer

Zizhang Li, Cheng Zhang, Zhengqin Li, Henry Howard-Jenkins, Zhaoyang Lv, Chen Geng, Jiajun Wu, Richard Newcombe, Jakob Engel, Zhao Dong

发表机构 * Reality Labs Research, Meta(Meta现实实验室) Stanford University(斯坦福大学)

AI总结 本文提出了一种名为ART的全新模型,用于从稀疏的多状态RGB图像中重建完整的3D可动物体,该模型无需依赖特定物体类别或复杂的优化过程。ART将可动物体视为由多个刚性部件组成,通过设计的Transformer架构将图像映射到可学习的部件槽位,并联合解码各部件的三维几何、纹理及运动参数,实现了物理可解释且可直接用于仿真的重建结果。实验表明,ART在多个基准测试中表现优异,显著超越了现有方法,确立了新的状态-of-the-art。

Comments Project Page: https://kyleleey.github.io/ART/

详情
英文摘要

We introduce ART, Articulated Reconstruction Transformer -- a category-agnostic, feed-forward model that reconstructs complete 3D articulated objects from only sparse, multi-state RGB images. Previous methods for articulated object reconstruction either rely on slow optimization with fragile cross-state correspondences or use feed-forward models limited to specific object categories. In contrast, ART treats articulated objects as assemblies of rigid parts, formulating reconstruction as part-based prediction. Our newly designed transformer architecture maps sparse image inputs to a set of learnable part slots, from which ART jointly decodes unified representations for individual parts, including their 3D geometry, texture, and explicit articulation parameters. The resulting reconstructions are physically interpretable and readily exportable for simulation. Trained on a large-scale, diverse dataset with per-part supervision, and evaluated across diverse benchmarks, ART achieves significant improvements over existing baselines and establishes a new state of the art for articulated object reconstruction from image inputs.

2512.09673 2026-05-18 cs.LG cs.AI cs.NE stat.ML

Drawback of Enforcing Equivariance and its Compensation via the Lens of Expressive Power

Yuzhu Chen, Tian Qin, Xinmei Tian, Fengxiang He, Dacheng Tao

发表机构 * University of Science and Technology of China(中国科学技术大学) University of Edinburgh(爱丁堡大学) Nanyang Technological University(南洋理工大学)

AI总结 本文研究了强制等变性对神经网络表达能力的影响,发现这种约束可能削弱模型的表达能力。通过分析边界超平面和通道向量,作者构造性地证明了这一问题,并指出可通过扩大模型规模来补偿这一缺陷,同时证明了所需扩大的上界。令人意外的是,扩大的网络结构反而降低了假设空间的维度,可能带来更好的泛化能力。

详情
英文摘要

Equivariant neural networks encode the intrinsic symmetry of data as an inductive bias, which has achieved impressive performance in wide domains. However, the understanding to their expressive power remains premature. Focusing on 2-layer ReLU networks, this paper investigates the impact of enforcing equivariance constraints on the expressive power. By examining the boundary hyperplanes and the channel vectors, we constructively demonstrate that enforcing equivariance constraints could undermine the expressive power. Naturally, this drawback can be compensated for by enlarging the model size -- we further prove upper bounds on the required enlargement for compensation. Surprisingly, we show that the enlarged neural architectures have reduced hypothesis space dimensionality, implying even better generalizability.

2512.04457 2026-05-18 cs.CL

RapidUn: Influence-Driven Parameter Reweighting for Efficient Large Language Model Unlearning

Guoshenghui Zhao, Huawei Lin, Weijie Zhao

发表机构 * Golisano College of Computing and Information Sciences, Rochester Institute of Technology(罗切斯特理工学院戈利萨诺计算机与信息科学学院)

AI总结 本文提出了一种名为RapidUn的高效大语言模型遗忘框架,通过影响驱动的参数重加权方法,解决从模型中移除特定数据影响的难题。该方法首先快速估计每个样本的影响,再将其映射为自适应更新权重,从而选择性地更新参数以遗忘有害行为而不丢失通用知识。实验表明,RapidUn在多个模型和数据集上相比现有方法效率提升高达100倍,具有更好的稳定性和泛化能力。

Comments Code available at: https://github.com/eyerf/RapidUn

详情
英文摘要

Removing specific data influence from large language models (LLMs) remains challenging, as retraining is costly and existing approximate unlearning methods are often unstable. The challenge is exacerbated when the forget set is small or imbalanced. We introduce RapidUn, an influence-driven and parameter-efficient unlearning framework. It first estimates per-sample influence through a fast estimation module, then maps these scores into adaptive update weights that guide selective parameter updates -- forgetting harmful behavior while retaining general knowledge. On Mistral-7B and Llama-3-8B across Dolly-15k and Alpaca-57k, RapidUn achieves up to 100 times higher efficiency than full retraining and consistently outperforms Fisher, GA, and LoReUn on both in-distribution and out-of-distribution forgetting. These results establish influence-guided parameter reweighting as a scalable and interpretable paradigm for LLM unlearning.

2512.01089 2026-05-18 cs.AI

CodeDistiller: Automatically Generating Code Libraries for Scientific Coding Agents

Peter Jansen, Samiah Hassan, Pragnya Narasimha

发表机构 * University of Arizona(亚利桑那大学) Allen Institute for Artificial Intelligence(人工智能研究所)

AI总结 CodeDistiller 是一个自动从科学 GitHub 仓库中提炼高质量代码库的系统,旨在增强科学编程代理的代码生成能力。该系统通过结合自动评估和领域专家评审,生成适用于材料科学等领域的可运行代码示例,显著提升了自动科学发现系统的实验准确性和科学性。实验表明,使用 CodeDistiller 生成的代码库可使代理生成更完整、更可靠的实验代码,并为大规模评估科学发现系统提供了可行的替代指标。

Comments 8 pages, 3 figures, 3 tables. Accepted to ACL 2026 (Demo Track)

详情
英文摘要

Automated Scientific Discovery (ASD) systems can help automatically generate and run code-based experiments, but their capabilities are limited by the code they can reliably generate from parametric knowledge alone. As a result, current systems either mutate a small number of manually-crafted experiment examples, or operate solely from parametric knowledge, limiting quality and reach. We introduce CodeDistiller, a system that automatically distills large collections of scientific Github repositories into a vetted library of working domain-specific code examples, allowing ASD agents to expand their capabilities without manual effort. Using a combination of automatic and domain-expert evaluation on 250 materials science repositories, we find the best model is capable of producing functional examples for 74% of repositories, while our downstream evaluation shows an ASD agent augmented with a CodeDistiller generated library produces more accurate, complete, and scientifically sound experiments than an agent with only general materials-science code examples. We also evaluate LLM-as-a-judge ratings against domain-expert ratings in an A/B testing paradigm, finding moderate agreement and suggesting that inexpensive proxy metrics may be feasible for evaluating scientific discovery systems at scale.

2512.00920 2026-05-18 cs.CL

Reward Auditor: Inference on Reward Modeling Suitability in Real-World Perturbed Scenarios

Jianxiang Zang, Yongda Wei, Ruxue Bai, Shiyu Jiang, Nijia Mo, Binhong Li, Qiang Sun, Hui Liu

发表机构 * School of Computer Science and Artificial Intelligence, Fudan University(复旦大学计算机科学与人工智能学院) School of Statistics and Data Science, Shanghai University of International Business and Economics(上海国际商务经济学院统计与数据科学学院) Data Science and Analytics Thrust, The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)数据科学与分析方向)

AI总结 该研究提出了一种名为Reward Auditor的框架,用于评估奖励模型(RM)在现实世界扰动场景中的适用性,即其在特定条件下保持可靠性的能力。与传统仅关注偏好感知准确率的评估方法不同,Reward Auditor通过统计检验分析RM在不同扰动下的置信度分布退化,从而判断其是否存在系统性漏洞,并评估漏洞的严重程度。该方法为构建更安全、可靠和可信的大语言模型对齐系统提供了理论基础。

详情
英文摘要

Reliable reward models (RMs) are critical for ensuring the safe alignment of large language models (LLMs). However, current RM evaluation methods focus solely on preference perception accuracies in given specific scenarios, obscuring the critical vulnerabilities of RMs in real-world scenarios. We identify the true challenge lies in assessing a novel dimension: Suitability, defined as conditional reliability under specific real-world perturbations. To this end, we introduce Reward Auditor, a hypothesis-testing framework specifically designed for RM suitability inference. Rather than answering "How accurate is the RM's preference perception for given samples?", it employs scientific auditing to answer: "Can we infer RMs exhibit systematic vulnerabilities in specific real-world scenarios?". Under real-world perturbed scenarios, Reward Auditor quantifies statistical significance and effect size by auditing distribution degradation of RM preference perception confidence. This enables inference of both the certainty and severity of RM vulnerabilities across diverse real-world scenarios. This lays a solid foundation for building next-generation LLM alignment systems that are verifiably safe, more robust, and trustworthy.

2512.00778 2026-05-18 cs.LG

What Is Preference Optimization Doing, and Why?

Yue Wang, Qizhou Wang, Zizhuo Zhang, Gang Niu, Bo Han, Masashi Sugiyama

发表机构 * TMLR Group, Department of Computer Science, Hong Kong Baptist University(香港 Baptist 大学计算机科学系 TMLR 组) RIKEN AIP(理化学研究所 AIP 分室) The University of Tokyo(东京大学)

AI总结 偏好优化(PO)在大语言模型(LLM)中至关重要,其中直接偏好优化(DPO)和近端策略优化(PPO)等方法取得了显著成功。本文通过分析这两种方法的优化动态,揭示了它们在算法行为和目标方向上的差异,指出DPO以稳定目标为主,而PPO则在探索与利用之间取得平衡。同时,研究深入探讨了正向学习、负向学习和损失重加权等关键但常被忽视的组件在不同方法中的不同作用,并通过消融实验验证了这些动态对优化效率和实际性能的影响,为PO方法的理解和改进提供了重要见解。

详情
英文摘要

Preference optimization (PO) is indispensable for large language models (LLMs), with methods such as direct preference optimization (DPO) and proximal policy optimization (PPO) achieving great success. A common belief is that DPO is supervised learning while PPO is reinforcement learning, yet deeper analyses for the reasons underlying these differences remain lacking. To fill this gap, we analyze their optimization dynamics, revealing distinct algorithmic behaviors and comprehending their underlying causes. First, we examine the target directions of gradient-based updates and find that DPO follows stable targets, whereas PPO balances exploration and exploitation, validating the common belief yet from this new perspective. Second, we examine the roles of positive learning, negative learning, and loss reweighting, which are three key yet seldom discussed components within PO methods. Our analyses reveal that these components play fairly different roles. In DPO, positive and negative learning jointly shape the targets. However, loss reweighting in DPO acts less as a reward signal but more as a regularizer to mitigate overfitting. In PPO, negative learning primarily supports exploration rather than determining the targets. Meanwhile, loss reweighting, related to the absolute advantages, indicates the distinct roles of token groups in updating targets. Given these findings, we conduct carefully designed ablation studies to further examine how controlling these dynamics impacts optimization efficiency and practical performance. The insights gained from our analyses not only deepen the understanding of PO methods but also inspire the development of more preference-aligned LLMs.

2512.00242 2026-05-18 cs.LG cs.AI cs.ET stat.ML

Polynomial Neural Sheaf Diffusion: A Spectral Filtering Approach on Cellular Sheaves

Alessio Borgi, Fabrizio Silvestri, Pietro Liò

发表机构 * Department of Computer Science and Technology, University of Cambridge(计算机科学与技术系,剑桥大学) Department of Computer, Control and Management Engineering, Sapienza University(计算机、控制与管理工程系,萨皮恩扎大学)

AI总结 本文提出了一种名为多项式神经束扩散(PolyNSD)的新方法,用于改进神经束网络在图结构上的扩散过程。该方法通过在归一化束拉普拉斯矩阵上应用K次多项式传播算子,实现了与束维数无关的K跳感受野,并通过凸混合的正交多项式基响应进行可训练的谱响应建模。相比传统方法,PolyNSD在保持模型稳定性的同时,降低了计算和内存需求,并在同质和异质图基准测试中取得了新的最先进结果。

详情
英文摘要

Sheaf Neural Networks equip graph structures with a cellular sheaf: a geometric structure which assigns local vector spaces (stalks) and a linear learnable restriction/transport maps to nodes and edges, yielding an edge-aware inductive bias that handles heterophily and limits oversmoothing. However, common Neural Sheaf Diffusion implementations rely on SVD-based sheaf normalization and dense per-edge restriction maps, which scale with stalk dimension, require frequent Laplacian rebuilds, and yield brittle gradients. To address these limitations, we introduce Polynomial Neural Sheaf Diffusion (PolyNSD), a new sheaf diffusion approach whose propagation operator is a degree-K polynomial in a normalised sheaf Laplacian, evaluated via a stable three-term recurrence on a spectrally rescaled operator. This provides an explicit K-hop receptive field in a single layer (independently of the stalk dimension), with a trainable spectral response obtained as a convex mixture of K+1 orthogonal polynomial basis responses. PolyNSD enforces stability via convex mixtures, spectral rescaling, and residual/gated paths, reaching new state-of-the-art results on both homophilic and heterophilic benchmarks, inverting the Neural Sheaf Diffusion trend by obtaining these results with just diagonal restriction maps, decoupling performance from large stalk dimension, while reducing runtime and memory requirements.

2511.19399 2026-05-18 cs.CL cs.AI cs.LG

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen-tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, Pang Wei Koh

发表机构 * University of Washington(华盛顿大学) Allen Institute for AI(人工智能研究院) Carnegie Mellon University(卡内基梅隆大学) Massachusetts Institute of Technology(麻省理工学院) Seattle Children's Hospital(西雅图儿童医院) University of California, Berkeley(加州大学伯克利分校)

AI总结 该论文提出了一种名为DR Tulu-8B的深度研究模型,旨在解决现有开放源深度研究代理在长篇、多步骤研究任务中表现不足的问题。研究引入了基于动态评分标准的强化学习方法(RLER),使评分标准与策略模型在训练过程中协同进化,从而提升事实核查能力和反馈质量。DR Tulu-8B是首个直接针对开放性长篇深度研究任务训练的完全开源模型,在多个科学、医疗和通用领域的基准测试中,其性能显著优于现有开源模型,并接近甚至超越了专有模型,同时在每查询成本上大幅降低。

Comments ICML 2026

详情
英文摘要

Deep research agents perform multi-step research to produce long-form, well-attributed answers. However, most open deep research agents are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards, which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), where rubrics are constructed and maintained to co-evolve with the policy model during training. This allows the rubrics to incorporate newly explored information from search and contrasting model responses, enabling better fact checking and more discriminative on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first fully open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare, and general domains, DR Tulu substantially outperforms existing open deep research agents (by 15.6% over Tongyi DR on average) and matches or exceeds proprietary deep research agents (by 0.7% over OpenAI DR on average), while being significantly smaller and cheaper per query (1000x cheaper than OpenAI DR per query).

2511.19115 2026-05-18 cs.AI cs.CY

AI Consciousness and Existential Risk

Rufin VanRullen

发表机构 * Frontier AI companies(前沿AI公司) independent foundations(独立基金会)

AI总结 本文探讨了人工智能意识与存在风险之间的关系,指出二者常被混淆,但实际上意识与智能在理论和实践中是截然不同的属性。研究认为,智能是预测AI系统存在风险的直接因素,而意识本身并不直接构成威胁,但在某些情况下可能间接影响风险。明确这一区别有助于AI安全研究者和政策制定者更准确地识别和应对核心问题。

Comments Updated for clarity and completeness following peer-review

详情
英文摘要

In AI, the existential risk denotes the hypothetical threat posed by an artificial system that would possess both the capability and the objective, either directly or indirectly, to eradicate humanity. This issue is gaining prominence in scientific debate due to recent technical advancements and increased media coverage. In parallel, AI progress has sparked speculation and studies about the potential emergence of artificial consciousness. The two questions, AI consciousness and existential risk, are sometimes conflated, as if the former entailed the latter. Here, I explain that this view stems from a common confusion between consciousness and intelligence. Yet these two properties are empirically and theoretically distinct. Arguably, while intelligence is a direct predictor of an AI system's existential threat, consciousness is not. There are, however, certain incidental scenarios in which consciousness could influence existential risk, in either direction. Consciousness could be viewed as a means towards AI alignment, thereby lowering existential risk; or, it could be a precondition for reaching certain capabilities or levels of intelligence, and thus positively related to existential risk. Recognizing these distinctions can help AI safety researchers and public policymakers focus on the most pressing issues.

2511.18719 2026-05-18 cs.CV

Seeing What Matters: Visual Preference Policy Optimization for Visual Generation

Ziqi Ni, Yuanzhi Liang, Rui Li, Yi Zhou, Haibin Huang, Chi Zhang, Xuelong Li

发表机构 * Southeast University(东南大学) Institute of Artificial Intelligence (TeleAI), China Telecom(人工智能研究院(TeleAI),中国电信) University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出了一种名为ViPO的视觉偏好策略优化方法,用于提升视觉生成模型与人类偏好的一致性。与现有方法依赖单一标量奖励不同,ViPO通过引入感知结构模块,将反馈转化为结构化的像素级优势图,从而更精细地引导模型优化视觉内容中的关键区域。该方法在图像和视频生成任务中均表现出色,提升了对域内人类偏好奖励的对齐能力,并增强了对域外任务的泛化性能,且具有轻量、通用、易于集成现有训练流程的优点。

详情
英文摘要

Reinforcement learning (RL) has become a powerful tool for post-training visual generative models, with Group Relative Policy Optimization (GRPO) increasingly used to align generators with human preferences. However, existing GRPO pipelines rely on a single scalar reward per sample, treating each image or video as a holistic entity and ignoring the rich spatial and temporal structure of visual content. This coarse supervision hinders the correction of localized artifacts and the modeling of fine-grained perceptual cues. We introduce Visual Preference Policy Optimization (ViPO), a GRPO variant that lifts scalar feedback into structured, pixel-level advantages. ViPO employs a Perceptual Structuring Module that uses pretrained vision backbones to construct spatially and temporally aware advantage maps, redistributing optimization pressure toward perceptually important regions while preserving the stability of standard GRPO. Across both image and video benchmarks, ViPO consistently outperforms vanilla GRPO, improving in-domain alignment with human-preference rewards and enhancing generalization on out-of-domain evaluations. The method is architecture-agnostic, lightweight, and fully compatible with existing GRPO training pipelines, providing a more expressive and informative learning signal for visual generation.

2511.18127 2026-05-18 cs.CV

SFHand: Learning Embodied Manipulation by Streaming Egocentric 3D Hand Forecasting

Ruicong Liu, Yifei Huang, Liangyang Ouyang, Caixin Kang, Yoichi Sato

发表机构 * The University of Tokyo(东京大学)

AI总结 SFHand 是一种用于语言引导的实时 3D 手部状态预测框架,旨在提升增强现实和辅助机器人等场景下的人机交互体验。该方法通过连续视频流和语言指令,自回归地预测未来手部的多种状态,包括手部类型、2D 边界框、3D 姿态和轨迹,并结合了区域兴趣增强的记忆层以捕捉时间上下文和关键手部区域。研究还引入了 EgoHaFL 数据集,实验证明 SFHand 在 3D 手部预测任务中取得了显著优于现有方法的性能,并在下游操作任务中提升了任务成功率。

详情
英文摘要

Real-time 3D hand forecasting is a critical component for fluid human-computer interaction in applications like AR and assistive robotics. However, existing methods are ill-suited for these scenarios, as they typically require offline access to accumulated video sequences and cannot incorporate language guidance that conveys task intent. To overcome these limitations, we introduce SFHand, the first streaming framework for language-guided 3D hand forecasting. SFHand autoregressively predicts a comprehensive set of future 3D hand states, including hand type, 2D bounding box, 3D pose, and trajectory, from a continuous stream of video and language instructions. Our framework combines a streaming autoregressive architecture with an ROI-enhanced memory layer, capturing temporal context while focusing on salient hand-centric regions. To enable this research, we also introduce EgoHaFL, the first large-scale dataset featuring synchronized 3D hand poses and language instructions. We demonstrate that SFHand achieves new state-of-the-art results in 3D hand forecasting, outperforming prior work by a significant margin of up to 35.8%. Furthermore, we show the practical utility of our learned representations by transferring them to downstream embodied manipulation tasks, improving task success rates by up to 13.4% on multiple benchmarks. Dataset page: https://huggingface.co/datasets/ut-vision/EgoHaFL, project page: https://github.com/ut-vision/SFHand.

2511.17426 2026-05-18 cs.LG cs.CV stat.ML

Self-Supervised Learning by Curvature Alignment

Benyamin Ghojogh, M. Hadi Sepanj, Paul Fieguth

发表机构 * Vision and Image Processing Group, Systems Design Engineering, University of Waterloo, Ontario, Canada(温哥华大学图像与图像处理小组,系统设计工程,安大略省,加拿大)

AI总结 本文提出了一种基于曲率对齐的自监督学习方法CurvSSL及其核空间扩展kernel CurvSSL,旨在通过显式建模数据流形的局部几何结构来提升表征学习效果。该方法在传统非对比学习框架中引入曲率正则化项,通过计算嵌入特征的局部曲率并对其在不同数据增强视图间进行对齐和去相关,从而增强表示的不变性和几何一致性。实验表明,该方法在MNIST和CIFAR-10数据集上取得了优于现有方法的线性评估性能。

Comments A shorter version of this paper has been published in: Journal of Computational Vision and Imaging Systems, Vol. 11, No. 1, Special Issue: Proceedings of CVIS 2025

Journal ref Shorter version of this paper is published in Journal of Computational Vision and Imaging Systems, Vol. 11, No. 1, Special Issue: Proceedings of CVIS 2025

详情
英文摘要

Self-supervised learning (SSL) has recently advanced through non-contrastive methods that couple an invariance term with variance, covariance, or redundancy-reduction penalties. While such objectives shape first- and second-order statistics of the representation, they largely ignore the local geometry of the underlying data manifold. In this paper, we introduce CurvSSL, a curvature-regularized self-supervised learning framework, and its RKHS extension, kernel CurvSSL. Our approach retains a standard two-view encoder-projector architecture with a Barlow Twins-style redundancy-reduction loss on projected features, but augments it with a curvature-based regularizer. Each embedding is treated as a vertex whose $k$ nearest neighbors define a discrete curvature score via cosine interactions on the unit hypersphere; in the kernel variant, curvature is computed from a normalized local Gram matrix in an RKHS. These scores are aligned and decorrelated across augmentations by a Barlow-style loss on a curvature-derived matrix, encouraging both view invariance and consistency of local manifold bending. Experiments on MNIST and CIFAR-10 datasets with a ResNet-18 backbone show that curvature-regularized SSL yields competitive or improved linear evaluation performance compared to Barlow Twins and VICReg. Our results indicate that explicitly shaping local geometry is a simple and effective complement to purely statistical SSL regularizers.

2511.14282 2026-05-18 cs.LG cs.AI

Weight Concentration Regularization for Improving Pruning Robustness Under High Sparsity

Vincent-Daniel Yun, Junhyuk Jo, Sunwoo Lee

发表机构 * University of Southern California(美国南加州大学) Inha University(inha大学)

AI总结 深度神经网络在视觉和语言任务中表现出色,但其庞大的参数量限制了在资源受限环境中的部署。为解决这一问题,研究提出了一种新的权重集中正则化方法(WCR),通过在训练过程中放大一小部分参数的幅度,同时将其他参数驱动至零,从而在剪枝时主要移除对模型功能贡献较小的参数,提升模型在高稀疏度下的鲁棒性。实验表明,该方法在多种任务和架构中均能有效提升剪枝鲁棒性,并与现有剪枝鲁棒优化器兼容。

详情
英文摘要

Deep neural networks achieve outstanding performance across vision and language tasks, yet their large parameter counts limit deployment in resource-constrained settings. One-shot pruning reduces model size without retraining, but models trained with standard objectives often suffer substantial accuracy drops under aggressive sparsity. Prior work mitigates this drop along two directions: regularizers such as $\ell_1$ and DeepHoyer that shape the weight distribution during training, and pruning-robust optimizers such as SAM, CrAM, and S$^2$SAM that flatten the loss landscape. However, existing regularizers either shrink all weights uniformly ($\ell_1$) or induce scale-invariant sparsity (DeepHoyer), without concentrating weight energy onto a small set of informative parameters. We propose a Weight Concentration Regularizer (WCR), a training-time regularizer that amplifies the magnitude of a small subset of parameters while driving the remainder toward zero, so that magnitude pruning predominantly removes parameters with negligible functional contribution. We provide a convergence analysis and evaluate WCR on LLM fine-tuning, image classification, and medical segmentation, demonstrating consistent improvements in pruning robustness across architectures and compatibility with existing pruning-robust optimizers.

2511.09884 2026-05-18 cs.AI

Quantum Artificial Intelligence for Mission-Critical Systems: Foundations, Architectural Elements, and Future Directions

Siva Sai, Rajkumar Buyya

发表机构 * Quantum Cloud Computing and Distributed Systems (qCLOUDS) Laboratory, School of Computing and Information Systems, The University of Melbourne(量子云计算与分布式系统实验室,计算与信息系统学院,墨尔本大学)

AI总结 本文探讨了量子人工智能(QAI)在关键任务系统(如国防、能源管理、网络安全和航空航天控制)中的应用潜力,旨在解决传统人工智能在可靠性、实时性、可解释性和安全性方面存在的不足。研究系统分析了QAI方法在满足关键任务系统需求方面的可行性,并提出了量子云资源管理与调度的概念框架,同时指出现有QAI技术与实际需求之间的差距。文章还讨论了QAI在训练限制、数据访问、组件验证等方面面临的挑战,并展望了未来在可解释性、可扩展性和硬件实现方面的发展方向。

Comments 15 pages, 5 figures, revised and accepted version of the paper

详情
英文摘要

Mission critical (MC) applications such as defense operations, energy management, cybersecurity, and aerospace control require reliable, deterministic, and low-latency decision making under uncertainty. Although the classical Artificial Intelligence (AI) approaches are effective, they often struggle to meet the stringent constraints of robustness, timing, explainability, and safety in the MC domains. Quantum Artificial Intelligence (QAI), the fusion of artificial intelligence and quantum computing (QC), can potentially provide transformative solutions to the challenges faced by classical ML models. QAI is a broader umbrella than Quantum Machine Learning (QML) and additionally includes quantum optimization, search, and reasoning; we use QAI throughout the paper for the field at large, and QML only for learning-specific subroutines. The principal contributions of this work are: (i) a systematic survey of QAI methods analyzed through the lens of MC requirements like certification, robustness, and timing; (ii) a conceptual quantum cloud resource management and scheduling framework with deployment assumptions, complexity analysis, and failure-mode discussion; and (iii) an identification of the gaps between current QAI capabilities and MC systems requirements. We also propose a conceptual model for management of quantum resources and scheduling of applications driven by timeliness constraints. We discuss multiple challenges, including trainability limits, data access, and loading bottlenecks, verification of quantum components, and adversarial QAI. Finally, we outline future research directions toward achieving interpretable, scalable, and hardware-feasible QAI models for MC application deployment.

2511.07720 2026-05-18 cs.RO

Empowering Robot Teleoperation: Exploring the Synergies Between Devices and Manipulator Controllers in a Comparative Study

Yuxuan Zhao, Yuanchen Tang, Jindi Zhang, Hongyu Yu

发表机构 * School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, Guangdong, China(香港中文大学(深圳)科学与工程学院) Department of Mechanical and Aerospace Engineering, Hong Kong University of Science and Technology(香港科学与技术大学机械与航空航天工程系) Shenzhen Institute of Artificial Intelligence and Robotics for Society (AIRS), Shenzhen, Guangdong, China(深圳人工智能与机器人社会研究院(AIRS))

AI总结 本文研究了通过远程操作设备收集机器人操作任务数据时,不同设备与控制器策略之间的协同关系。作者比较了基于位置的逆运动学控制、基于力矩的逆动力学控制以及基于优化的柔顺控制等方法,分析了设备与控制器匹配对实际任务性能的影响。研究揭示了设备与控制器协同优化对提升机器人自主操作能力的重要性。

详情
英文摘要

Robot learning empowers the robot system with human brain-like intelligence to autonomously acquire and adapt skills through experience, enhancing flexibility and adaptability in various environments. Aimed at achieving a similar level of capability in large language models (LLMs) for embodied intelligence, data quality plays a crucial role in training a foundational model with diverse robot skills. In this study, we investigate the collection of data for manipulation tasks using teleoperation devices. Different devices yield varying effects when paired with corresponding controller strategies, including position-based inverse kinematic (IK) control, torque-based inverse dynamic (ID) control, and optimization-based compliant control. Analysis of experimental results suggests the importance of the relationship between teleoperation devices and controllers for real tasks.

2511.03260 2026-05-18 cs.CV

Enhancing Medical Image Segmentation via Heat Conduction Equation

Rong Wu, Yim-Sang Yu

发表机构 * Department of Epidemiology and Biostatistics, University of California, San Francisco, CA, USA(流行病学与生物统计学系,加州大学旧金山分校,加州,美国)

AI总结 本文针对医学图像分割中在有限计算资源下难以实现高效全局上下文建模和长距离依赖推理的问题,提出了一种结合U-Mamba结构与热传导方程的混合架构。该方法在瓶颈层引入热传导算子,通过模拟频率域热扩散过程提升语义抽象能力,实验表明其在腹部CT数据集上的Dice系数达到0.8719,验证了该方法在医学图像分割任务中的有效性与优越性。

详情
英文摘要

Medical image segmentation models struggle to achieve efficient global context modeling and long-range dependency reasoning under practical computational budgets. In this work, we propose a hybrid architecture utilizing U-Mamba with Heat Conduction Equation, which combines state-space modules for efficient long-range reasoning with Heat Conduction Operators (HCOs) in the bottleneck layers, simulating frequency-domain thermal diffusion for enhanced semantic abstraction. Experimental results show that our model attains the highest DSC (0.8719) on the Abdomen CT dataset. It suggests that blending state-space dynamics with heat-based global diffusion offers a scalable solution for medical segmentation tasks.