arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3013
2604.24493 2026-04-28 cs.CV

CA-IDD: Cross-Attention Guided Identity-Conditional Diffusion for Identity-Consistent Face Swapping

Md Shohel Rana, Tanoy Debnath

详情
英文摘要

Face swapping aims to optimize realistic facial image generation by leveraging the identity of a source face onto a target face while preserving pose, expression, and context. However, existing methods, especially GAN-based methods, often struggle to balance identity preservation and visual realism due to limited controllability and mode collapse. In this paper, we introduce CA-IDD (Cross-Attention Guided Identity-Conditional Diffusion), the first diffusion-based face swapping approach that integrates multi-modal guidance comprising gaze, identity, and facial parsing through multi-scale cross-attention. Precomputed identity embeddings are incorporated into the denoising process via hierarchical attention layers, resulting in accurate and consistent identity transfer. To improve semantic coherence and visual quality, we use expert-guided supervision, with facial parsing and gaze-consistency modules. Unlike GAN-based or implicit-fusion methods, our diffusion framework provides stable training, robust generalization, and spatially adaptive identity alignment, allowing for fine-grained regional control across pose and expression variations. CA-IDD achieves an FID of 11.73, exceeding established baselines such as FaceShifter and MegaFS. Qualitative results also reveal improved identity retention across diverse poses, establishing CA-IDD as a strong foundation for future diffusion-based face editing.

2604.24492 2026-04-28 cs.CV cs.AI cs.ET cs.LG cs.NE

Deployment-Aligned Low-Precision Neural Architecture Search for Spaceborne Edge AI

Parampuneet Kaur Thind, Vaibhav Katturu, Giacomo Zema, Roberto Del Prete

详情
英文摘要

Designing deep networks that meet strict latency and accuracy constraints on edge accelerators increasingly relies on hardware-aware optimization, including neural architecture search (NAS) guided by device-level metrics. Yet most hardware-aware NAS pipelines still optimize architectures under full-precision assumptions and apply low-precision adaptation only after the search, leading to a mismatch between optimization-time behavior and deployment-time execution on low-precision hardware that can substantially degrade accuracy. We address this limitation by integrating deployment-aligned low-precision training directly into hardware-aware NAS. Candidate architectures are exposed to FP16 numerical constraints during fine-tuning and evaluation, enabling joint optimization of architectural efficiency and numerical robustness without modifying the search space or evolutionary strategy. We evaluate the proposed framework on vessel segmentation for spaceborne maritime monitoring, targeting the Intel Movidius Myriad X Visual Processing Unit (VPU). While post-training precision conversion reduces on-device performance from 0.85 to 0.78 mIoU, deployment-aligned low-precision training achieves 0.826 mIoU on-device for the same architecture (95,791 parameters), recovering approximately two-thirds of deployment-induced accuracy gap without increasing model complexity. These results demonstrate that incorporating deployment-consistent numerical constraints into hardware-aware NAS substantially improves robustness and alignment between optimization and deployment for resource-constrained edge Artificial Intelligence (AI).

2604.24487 2026-04-28 cs.RO

Guiding Vector Field Generation via Score-based Diffusion Model

Zirui Chen, Shiliang Guo, Shiyu Zhao

Comments 8 pages, 6 figrues, ICRA2026

详情
英文摘要

Guiding Vector Fields (GVFs) are a powerful tool for robotic path following. However, classical methods assume smooth, ordered curves and fail when paths are unordered, multi-branch, or generated by probabilistic models. We propose a unified framework, termed the Score-Induced Guiding Vector Field (SGVF), which leverages score-based generative modeling to construct vector fields directly from data distributions. SGVF learns tangent fields from point clouds with unit-norm, orthogonality, and directional-consistency losses, ensuring geometric fidelity and control feasibility. This approach removes the reliance on ad-hoc path segmentation and enables guidance along complex topologies such as branching and pseudo-manifolds. The study establishes a correspondence between score vanishing in diffusion models and GVF singularities and highlights representational capacity near sharp path curvatures. Experiments on robotic navigation in planar environments demonstrate that SGVF achieves reliable path following in scenarios where classical GVFs fail, underscoring its potential as a bridge between generative modeling and geometric control. Code and experiment video are available at https://github.com/czr-gif/Guiding-Vector-Field-Generation-via-Score-based-Diffusion-Model.

2604.24479 2026-04-28 cs.CV

Zero-to-CAD: Agentic Synthesis of Interpretable CAD Programs at Million-Scale Without Real Data

Mohammadmehdi Ataei, Farzaneh Askari, Kamal Rahimi Malekshan, Pradeep Kumar Jayaraman

详情
英文摘要

Computer-Aided Design (CAD) models are defined by their construction history: a parametric recipe that encodes design intent. However, existing large-scale 3D datasets predominantly consist of boundary representations (B-Reps) or meshes, stripping away this critical procedural information. To address this scarcity, we introduce Zero-to-CAD, a scalable framework for synthesizing executable CAD construction sequences. We frame synthesis as an agentic search problem: by embedding a large language model (LLM) within a feedback-driven CAD environment, our system iteratively generates, executes, and validates code using tools and documentation lookup to promote geometric validity and operation diversity. This agentic approach enables the synthesis of approximately one million executable, readable, editable CAD sequences, covering a rich vocabulary of operations beyond sketch-and-extrude workflows. We also release a curated subset of 100,000 high-quality models selected for geometric diversity. To demonstrate the dataset's utility, we fine-tune a vision-language model on our synthetic data to reconstruct editable CAD programs from multi-view images, outperforming strong baselines, including GPT-5.2, and effectively bootstrapping sequence generation capabilities without real construction-history training data. Zero-to-CAD bridges the gap between geometric scale and parametric interpretability, offering a vital resource for the next generation of CAD AI.

2604.24473 2026-04-28 cs.AI cs.CL

Agentic clinical reasoning over longitudinal myeloma records: a retrospective evaluation against expert consensus

Johannes Moll, Jannik Lübberstedt, Christoph Nuernbergk, Jacob Stroh, Luisa Mertens, Anna Purcarea, Christopher Zirn, Zeineb Benchaaben, Fabian Drexel, Hartmut Häntze, Anirudh Narayanan, Friedrich Puttkammer, Andrei Zhukov, Jacqueline Lammert, Sebastian Ziegelmayer, Markus Graf, Marion Högner, Marcus Makowski, Florian Bassermann, Lisa C. Adams, Jiazhen Pan, Daniel Rueckert, Krischan Braitsch, Keno K. Bressem

详情
英文摘要

Multiple myeloma is managed through sequential lines of therapy over years to decades, with each decision depending on cumulative disease history distributed across dozens to hundreds of heterogeneous clinical documents. Whether LLM-based systems can synthesise this evidence at a level approaching expert agreement has not been established. A retrospective evaluation was conducted on longitudinal clinical records of 811 myeloma patients treated at a tertiary centre (2001-2026), covering 44,962 documents and 1,334,677 laboratory values, with external validation on MIMIC-IV. An agentic reasoning system was compared against single-pass retrieval-augmented generation (RAG), iterative RAG, and full-context input on 469 patient-question pairs from 48 templates at three complexity levels. Reference labels came from double annotation by four oncologists with senior haematologist adjudication. Iterative RAG and full-context input converged on a shared ceiling (75.4% vs 75.8%, p = 1.00). The agentic system reached 79.6% concordance (95% CI 76.4-82.8), exceeding both baselines (+3.8 and +4.2 pp; p = 0.006 and 0.007). Gains rose with question complexity, reaching +9.4 pp on criteria-based synthesis (p = 0.032), and with record length, reaching +13.5 pp in the top decile (n = 10). The system error rate (12.2%) was comparable to expert disagreement (13.6%), but severity was inverted: 57.8% of system errors were clinically significant versus 18.8% of expert disagreements. Agentic reasoning was the only approach to exceed the shared ceiling, with gains concentrated on the most complex questions and longest records. The greater clinical consequence of residual system errors indicates that prospective evaluation in routine care is required before these findings translate into patient benefit.

2604.24470 2026-04-28 cs.CL

Zero-shot Large Language Models for Automatic Readability Assessment

Riley Grossman, Yi Chen

Comments Accepted to ACL 2026 (Main Conference)

详情
英文摘要

Unsupervised automatic readability assessment (ARA) methods have important practical and research applications (e.g., ensuring medical or educational materials are suitable for their target audiences). In this paper, we propose a new zero-shot prompting methodology for ARA and present the first comprehensive evaluation of using large language models (LLMs) as an unsupervised ARA method by testing 10 diverse open-source LLMs (e.g., different sizes and developers) on 14 diverse datasets (e.g., different text lengths and languages). Our findings show that our proposed prompting methodology outperforms prior methods on 13 of the 14 datasets. Furthermore, we propose LAURAE, which combines LLM and readability formula scores to improve robustness by capturing both contextual and shallow (e.g., sentence length) features of readability. Our evaluation demonstrates that LAURAE robustly outperforms prior methods across languages, text lengths, and amounts of technical language.

2604.24459 2026-04-28 cs.CV

TextGround4M: A Prompt-Aligned Dataset for Layout-Aware Text Rendering

Dongxing Mao, Yilin Wang, Linjie Li, Zhengyuan Yang, Alex Jinpeng Wang

Comments aaai poster; Project page: https://dongxingmao.github.io/TextGround4M.github.io/

详情
Journal ref
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 10, pp. 7918-7926, Mar. 2026
英文摘要

Despite recent advances in text-to-image generation, models still struggle to accurately render prompt-specified text with correct spatial layout -- especially in multi-span, structured settings. This challenge is driven not only by the lack of datasets that align prompts with the exact text and layout expected in the image, but also by the absence of effective metrics for evaluating layout quality. To address these issues, we introduce TextGround4M, a large-scale dataset of over 4 million prompt-image pairs, each annotated with span-level text grounded in the prompt and corresponding bounding boxes. This enables fine-grained supervision for layout-aware, prompt-grounded text rendering. Building on this, we propose a lightweight training strategy for autoregressive T2I models that appends layout-aware span tokens during training, without altering model architecture or inference behavior. We further construct a benchmark with stratified layout complexity to evaluate both open-source and proprietary models in a zero-shot setting. In addition, we introduce two layout-aware metrics to address the long-standing lack of spatial evaluation in text rendering. Our results show that models trained on TextGround4M outperform strong baselines in text fidelity, spatial accuracy, and prompt consistency, highlighting the importance of fine-grained layout supervision for grounded T2I generation.

2604.24447 2026-04-28 cs.RO cs.AI

Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment

Kaijun Zhou, Qiwei Chen, Da Peng, Zhiyang Li, Xijun Li, Jinyu Gu

Comments 13 pages

详情
英文摘要

Vision-Language-Action (VLA) models are promising for generalist robot control, but on-robot deployment is bottlenecked by real-time inference under tight cost and energy budgets. Most prior evaluations rely on desktop-grade GPUs, obscuring the trade-offs and opportunities offered by heterogeneous edge accelerators (GPUs/XPUs/NPUs). We present a systematic analysis for low-cost VLA deployment via model-hardware co-characterization. First, we build a cross-accelerator leaderboard and evaluate model-hardware pairs under CET (Cost, Energy, Time), showing that right-sized edge devices can be more cost-/energy-efficient than flagship GPUs while meeting control-rate constraints. Second, using in-depth profiling, we uncover a consistent two-phase inference pattern: a compute-bound VLM backbone followed by a memory-bound Action Expert, which induces phase-dependent underutilization and hardware inefficiency. Finally, guided by these insights, we propose DP-Cache and V-AEFusion to reduce diffusion redundancy and enable asynchronous pipeline parallelism, achieving up to 2.9x speedup on GPUs and 6x on edge NPUs with only marginal success degradation. The example leaderboard website is available at: https://vla-leaderboard-01.vercel.app/.

2604.24444 2026-04-28 cs.CL

Can You Make It Sound Like You? Post-Editing LLM-Generated Text for Personal Style

Connor Baumler, Calvin Bao, Huy Nghiem, Xinchen Yang, Marine Carpuat, Hal Daumé

Comments ACL 2026

详情
英文摘要

Despite the growing use of large language models (LLMs) for writing tasks, users may hesitate to rely on LLMs when personal style is important. Post-editing LLM-generated drafts or translations is a common collaborative writing strategy, but it remains unclear whether users can effectively reshape LLM-generated text to reflect their personal style. We conduct a pre-registered online study ($n=81$) in which participants post-edit LLM-generated drafts for writing tasks where personal style matters to them. Using embedding-based style similarity metrics, we find that post-editing increases stylistic similarity to participants' unassisted writing and reduces similarity to fully LLM-generated output. However, post-edited text still remains stylistically closer in style to LLM text than to participants' unassisted control text, and it exhibits reduced stylistic diversity compared to unassisted human text. We find a gap between perceived stylistic authenticity and model-measured stylistic similarity, with post-edited text often perceived as representative of participants' personal style despite remaining detectable LLM stylistic traces.

2604.24443 2026-04-28 cs.AI

PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model

Sinin Zhang, Yunfei Xie, Yuxuan Cheng, Haoyu Zhang, Tong Zhang

Comments 11 pages. Accepted by ICLR 2026 Workshop ES-Reasoning

详情
英文摘要

Vision-Language Models (VLMs) have demonstrated strong performance on textbook-style physics problems, yet they frequently fail when confronted with dynamic real-world scenarios that require temporal consistency and causal reasoning across frames. We identify two fundamental challenges underlying these failures: (1) spatio-temporal identity drift, where objects lose their physical identity across successive frames and break causal chains, and (2) volatility of inference-time insights, where a model may occasionally produce correct physical reasoning but never consolidates it for future reuse. To address these challenges, we propose PhysNote, an agentic framework that enables VLMs to externalize and refine physical knowledge through self-generated "Knowledge Notes." PhysNote stabilizes dynamic perception through spatio-temporal canonicalization, organizes self-generated insights into a hierarchical knowledge repository, and drives an iterative reasoning loop that grounds hypotheses in visual evidence before consolidating verified knowledge. Experiments on PhysBench demonstrate that PhysNote achieves 56.68% overall accuracy, a 4.96% improvement over the best multi-agent baseline, with consistent gains across all four physical reasoning domains.

2604.24441 2026-04-28 cs.CV

AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

Hongxin Li, Xiping Wang, Jingran Su, Zheng Ju, Yuntao Chen, Qing Li, Zhaoxiang Zhang

Comments Technical Report

详情
英文摘要

Autonomous agents capable of navigating Graphical User Interfaces (GUIs) hold the potential to revolutionize digital productivity. However, achieving true digital autonomy extends beyond reactive element matching; it necessitates a predictive mental model of interface dynamics and the ability to foresee the "digital world state" resulting from interactions. Despite the perceptual capabilities of modern Vision-Language Models (VLMs), existing benchmarks remain bifurcated (focusing either on black-box task completion or static, shallow grounding), thereby failing to assess whether agents truly comprehend the implicit functionality and transition logic of GUIs. To bridge this gap, we introduce AutoGUI-v2, a comprehensive benchmark designed to evaluate deep GUI functionality understanding and interaction outcome prediction. We construct the benchmark using a novel VLM-human collaborative pipeline that recursively parses multi-platform screenshots into hierarchical functional regions to generate diverse evaluation tasks. Providing 2,753 tasks across six operating systems, AutoGUI-v2 rigorously tests agents on region and element-level semantics, grounding, and dynamic state prediction. Our evaluation reveals a striking dichotomy in VLMs: while open-source models fine-tuned on agent data (e.g., Qwen3-VL) excel at functional grounding, commercial models (e.g., Gemini-2.5-Pro-Thinking) dominate in functionality captioning. Crucially, all models struggle with complex interaction logic of uncommon actions, highlighting that deep functional understanding remains a significant hurdle. By systematically measuring these foundational capabilities, AutoGUI-v2 offers a new lens for advancing the next generation of GUI agents.

2604.24432 2026-04-28 cs.CL cs.AI cs.IR cs.LG

Kwai Summary Attention Technical Report

Chenglong Chu, Guorui Zhou, Guowang Zhang, Han Li, Hao Peng, Hongtao Cheng, Jian Liang, Jiangxia Cao, Kun Gai, Lingzhi Zhou, Lu Ren, Qi Zhang, Ruiming Tang, Ruitao Wang, Xinchen Luo, Yi Su, Zhiyuan Liang, Ziqi Wang, Boyang Ding, Chengru Song, Dunju Zang, Hui Wang, Jiao Ou, Jiaxin Deng, Jijun Shi, Jinghao Zhang, Junmin Chen, Lejian Ren, Minxuan Lv, Qianqian Wang, Qigen Hu, Shiyao Wang, Siyang Mao, Tao Wang, Xingmei Wang, Zhixin Ling, Ziming Li, Zixing Zhang

Comments Work in progress

详情
英文摘要

Long-context ability, has become one of the most important iteration direction of next-generation Large Language Models, particularly in semantic understanding/reasoning, code agentic intelligence and recommendation system. However, the standard softmax attention exhibits quadratic time complexity with respect to sequence length. As the sequence length increases, this incurs substantial overhead in long-context settings, leading the training and inference costs of extremely long sequences deteriorate rapidly. Existing solutions mitigate this issue through two technique routings: i) Reducing the KV cache per layer, such as from the head-level compression GQA, and the embedding dimension-level compression MLA, but the KV cache remains linearly dependent on the sequence length at a 1:1 ratio. ii) Interleaving with KV Cache friendly architecture, such as local attention SWA, linear kernel GDN, but often involve trade-offs among KV Cache and long-context modeling effectiveness. Besides the two technique routings, we argue that there exists an intermediate path not well explored: {Maintaining a linear relationship between the KV cache and sequence length, but performing semantic-level compression through a specific ratio $k$}. This $O(n/k)$ path does not pursue a ``minimum KV cache'', but rather trades acceptable memory costs for complete, referential, and interpretable retention of long distant dependency. Motivated by this, we propose Kwai Summary Attention (KSA), a novel attention mechanism that reduces sequence modeling cost by compressing historical contexts into learnable summary tokens.

2604.24429 2026-04-28 cs.CL

A Multi-Dimensional Audit of Politically Aligned Large Language Models

Lisa Korver, Mohamed Mostagir, Sherief Reda

详情
英文摘要

As the application of Large Language Models (LLMs) spreads across various industries, there are increasing concerns about the potential for their misuse, especially in sensitive areas such as political discourse. Deliberately aligning LLMs with specific political ideologies, through prompt engineering or fine-tuning techniques, can be advantageous in use cases such as political campaigns, but requires careful consideration due to heightened risks of performance degradation, misinformation, or increased biased behavior. In this work, we propose a multi-dimensional framework inspired by Habermas' Theory of Communicative Action to audit politically aligned language models across four dimensions: effectiveness, fairness, truthfulness, and persuasiveness using automated, quantitative metrics. Applying this to nine popular LLMs aligned via fine-tuning or role-playing revealed consistent trade-offs: while larger models tend to be more effective at role-playing political ideologies and truthful in their responses, they were also less fair, exhibiting higher levels of bias in the form of angry and toxic language towards people of different ideologies. Fine-tuned models exhibited lower bias and more effective alignment than the corresponding role-playing models, but also saw a decline in performance reasoning tasks and an increase in hallucinations. Overall, all of the models tested exhibited some deficiency in at least one of the four metrics, highlighting the need for more balanced and robust alignment strategies. Ultimately, this work aims to ensure politically-aligned LLMs generate legitimate, harmless arguments, offering a framework to evaluate the responsible political alignment of these models.

2604.24426 2026-04-28 cs.CV

DYMAPIA: A Multi-Domain Framework for Detecting AI-based Video Manipulation

Md Shohel Rana, Andrew H. Sung

详情
英文摘要

AI-generated media are advancing rapidly, raising pressing concerns for content authenticity and digital trust. We introduce DYMAPIA, a multi-domain Deepfake detection framework that fuses spatial, spectral, and temporal cues to capture subtle traces of manipulation in visual data. The system builds dynamic anomaly masks by combining evidence from Fourier spectra, local texture descriptors, edge irregularities, and optical flow consistency, which highlight tampered regions with fine spatial accuracy. These masks guide DistXCNet, a lightweight classifier distilled from Xception and optimized with depthwise separable convolutions for fast, region-focused classification. This joint design achieves state-of-the-art results, with accuracy and F1-scores exceeding 99\% on FF++, Celeb-DF, and VDFD benchmarks, while keeping the model compact enough for real-time use. Beyond outperforming existing full-frame and multidomain detectors, DYMAPIA demonstrates deployment readiness for time-critical forensic tasks, including media verification, misinformation defense, and secure content filtering.

2604.24419 2026-04-28 cs.CV

BMD-45: A Large-Scale CCTV Vehicle Detection Dataset for Urban Traffic in Developing Cities

Akash Sharma, Chinmay Mhatre, Sankalp Gawali, Ruthvik Bokkasam, Brij Sharma, Vishwajeet Pattanaik, Punit Rathore, Raghu Krishnapuram, Vijay Gopal Kovvali, Anirban Chakraborty, Yogesh Simmhan

Comments Accepted at CVPR 2026 Findings Track. To appear in the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026

详情
英文摘要

Robust vehicle detection from fixed CCTV cameras is critical for Intelligent Transportation Systems. Yet existing benchmarks predominantly feature relatively homogeneous, highly organized traffic patterns captured from ego-centric driving perspectives or controlled aerial views. This regional and sensor view bias creates a significant gap. Models trained on datasets such as UA-DETRAC and COCO struggle to generalize to the dense, heterogeneous, disorganized traffic conditions observed in rapidly developing urban centers in emerging economies. To address this limitation, we introduce BMD-45, a large-scale dataset comprising 480K bounding boxes annotated over 45K images captured from over 3.6K operational Safe City CCTV cameras. BMD-45 contains 14 fine-grained vehicle categories, including region-specific modes such as auto-rickshaws and tempo travellers, which are not present in existing benchmarks. The dataset captures real-world deployment challenges, including extreme viewpoint variation, occlusion, and vehicle density . We establish comprehensive baselines using state-of-the-art detectors and reveal a striking domain gap: models fine-tuned on UA-DETRAC achieve only 33.6% mAP@0.50:0.95, compared to 83.8% when trained in-domain on BMD-45, representing a 2.5x improvement that persists even when accounting for novel vehicle classes. This performance gap underscores the critical need for geographically diverse traffic benchmarks and establishes BMD-45 as a baseline for developing robust perception systems in underrepresented urban environments worldwide. The dataset is available at: https://huggingface.co/datasets/iisc-aim/BMD-45.

2604.24416 2026-04-28 cs.CL cs.AI cs.LG

Scaling Properties of Continuous Diffusion Spoken Language Models

Jason Ramapuram, Eeshan Gunesh Dhekane, Amitis Shidani, Dan Busbridge, Bogdan Mazoure, Zijin Gu, Russ Webb, Tatiana Likhomanenko, Navdeep Jaitly

详情
英文摘要

Speech-only spoken language models (SLMs) lag behind text and text-speech models in performance, with recent discrete autoregressive (AR) SLMs indicating significant computational and data demands to match text models. Since discretizing continuous speech for AR creates bottlenecks, we explore whether continuous diffusion (CD) SLM is more viable. To quantify the SLMs linguistic quality, we introduce the phoneme Jensen-Shannon divergence (pJSD) metric. Our analysis reveals CD SLMs, mirroring AR behavior, exhibit scaling laws for validation loss and pJSD, and show optimal token-to-parameter ratios decreasing as compute scales. However, for the latter, loss becomes insensitive to choice of data and model sizes, showing potential for fast inference. Scaling CD SLMs to 16B parameters with tens of millions of hours of conversational data enables generation of emotive, prosodic, multi-speaker, multilingual speech, though achieving long-form coherence remains a significant challenge.

2604.24407 2026-04-28 cs.CV

AD-Relight: Training-Free Banner Relighting via Illumination Translation with Diffusion Priors

Rameshwar Mishra, A V Subramanyam

详情
英文摘要

The recent surge in content consumption through streaming services has driven a growing demand for personalized content. Personalized advertisements (ads) play a crucial role in enhancing both user engagement and ad effectiveness. A key aspect of ad personalization involves replacing existing regions in a frame with custom, Photoshop-generated banners. However, existing ad-placement pipelines typically rely on simple geometric warping, ignoring the scene's underlying lighting conditions. Similarly, state-of-the-art diffusion-based object insertion and relighting models struggle to accurately relight these newly inserted banners, as they are not trained on ad-banner data, and training such a model for ad banners would require millions of images. This highlights the need for an effective relighting framework that enables seamless integration of custom banners into the original scene. Motivated by this, we present AD-Relight, a novel multi-stage training-free framework that adapts a diffusion-based relighting model at test time to relight newly added Photoshop-generated ad banners. Through extensive evaluation, we demonstrate that AD-Relight outperforms both relighting baselines and existing ad-placement methods based on simple warping. User studies further show that participants consistently prefer the outputs of AD-Relight over those of prior approaches.

2604.24403 2026-04-28 cs.LG cs.RO

An Automatic Ground Collision Avoidance System with Reinforcement Learning

Seyyid Osman Sevgili, Atahan Cilan, Mahir Demir, Özgün Can Yürütken, Ümit Can Bekar

详情
英文摘要

This article evaluates an artificial intelligence (AI)-based Automatic Ground Collision Avoidance System (AGCAS) designed for advanced jet trainers to enhance operational effectiveness. In the continuously evolving field of aerospace engineering, the integration of AI is crucial for advancing operations with improved timing constraints and efficiency. Our study explores the design process of an AI-driven AGCAS, specifically tailored for advanced jet trainers, focusing on addressing the AGCAS problem within a limited observation space. The system utilizes line-of-sight queries on a terrain server to ensure precise and efficient collision avoidance. This approach aims to significantly improve the safety and operational capabilities of advanced jet trainers.

2604.24401 2026-04-28 cs.SD cs.AI cs.CL eess.AS

All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

Leonardo Haw-Yang Foo, Chih-Kai Yang, Chen-An Li, Ke-Han Lu, Hung-yi Lee

Comments 6 pages, 3 figures, 5 tables

详情
英文摘要

Large Audio-Language Models show consistent performance gains across speech and audio benchmarks, yet high scores may not reflect true auditory perception. If a model can answer questions without processing the acoustic signal, the benchmark fails as a measure of auditory understanding. We present a diagnostic framework using two axes: text prior, which measures answerability from text and general knowledge alone, and audio reliance, which assesses actual dependency on the acoustic signal. Evaluating eight LALMs across three benchmarks, we find that models retain 60-72% of their full audio scores even without any audio input. Moreover, among items that require audio, only 3.0-4.2% need the complete audio clip; the majority can be resolved using localized fragments. These findings challenge the assumption that benchmark performance equals robust audio understanding, and we conclude with practical guidelines for improving evaluation reliability and benchmark design.

2604.24396 2026-04-28 cs.CV cs.AI

Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation

Yubo Jiang, Xin Yang, Abudukelimu Wuerkaixi, Zheming Yuan, Xuxin Cheng, Fengying Xie, Zhiguo Jiang, Cao Liu, Ke Zeng, Haopeng Zhang

Comments 9 pages, 8 figures, Findings of ACL 2025

详情
英文摘要

Vision-Language Models (VLMs) are frequently undermined by object hallucination--generating content that contradicts visual reality--due to an over-reliance on linguistic priors. We introduce Positive-and-Negative Decoding (PND), a training-free inference framework that intervenes directly in the decoding process to enforce visual fidelity. PND is motivated by our key finding of a critical attention deficit in VLMs, where visual features are empirically under-weighted. Our framework corrects this via a dual-path contrast: The positive path amplifies salient visual evidence using multi-layer attention to encourage faithful descriptions, directly counteracting the attention deficit. Simultaneously, the negative path identifies and degrades the core object's features to create a strong counterfactual, which penalizes ungrounded, prior-dominant generation. By contrasting the model's outputs from these two perspectives at each step, PND steers generation towards text that is not just linguistically probable, but visually factual. Extensive experiments on benchmarks like POPE, MME, and CHAIR show that PND achieves state-of-the-art performance with up to 6.5% accuracy improvement, substantially reducing object hallucination while also enhancing descriptive detail--all without requiring any model retraining. The method generalizes effectively across diverse VLM architectures including LLaVA, InstructBLIP, InternVL, and Qwen-VL.

2604.24395 2026-04-28 cs.AI

Aligning with Your Own Voice: Self-Corrected Preference Learning for Hallucination Mitigation in LVLMs

Byeonggeuk Lim, JungMin Yun, Junehyoung Kwon, Kyeonghyun Kim, YoungBin Kim

Comments Accepted to ACL 2026

详情
英文摘要

Large Vision-Language Models (LVLMs) frequently suffer from hallucinations. Existing preference learning-based approaches largely rely on proprietary models to construct preference datasets. We identify that this reliance introduces a distributional mismatch between the proprietary and target models that hinders efficient alignment. To address this, we propose Alignment via VErified Self-correction DPO (AVES-DPO), a framework that aligns LVLMs using in-distribution data derived from the model's intrinsic knowledge. Our approach employs a consensus-based verification mechanism to diagnose diverse hallucinations and guides the model to self-correct, thereby generating preference pairs strictly compatible with its internal distribution. Extensive experiments demonstrate that AVES-DPO surpasses existing baselines in hallucination mitigation while requiring only 5.2k samples.

2604.24393 2026-04-28 cs.LG cs.CV

Complexity of Linear Regions in Self-supervised Deep ReLU Networks

Mufhumudzi Muthivhi, Terence L. van Zyl

Comments Accepted for publication in 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition - Findings Track (CVPRF)

详情
英文摘要

There has been growing interest in studying the complexity of Rectified Linear Unit (ReLU) based activation networks. Recent work investigates the evolution of the number of piecewise-linear partitions (linear regions) that are formed during training. However, current research is limited to examining the complexity of models trained in a supervised way. Self-Supervised Learning (SSL) differs in that it directly optimises the representation space using a loss function to enhance the model's performance across multiple downstream tasks. This study investigates the local distribution of linear regions produced by SSL models. We demonstrate that the evolution of linear regions correlates with the representation quality by utilising SplineCam to extract two-dimensional polytopes near the data distribution. We track the number, area, eccentricity, and boundaries of regions throughout training. The study compares supervised, contrastive, and self-distillation methods over two standard benchmark datasets, MNIST and FashionMNIST. The analysis of the experimental results shows that self-supervised methods create substantially fewer regions to achieve comparable accuracy to supervised models. Contrastive methods rapidly expand regions over time, whereas self-distillation methods tend to consolidate by merging neighbouring regions. Lastly, we can detect representation collapse early within the geometric space of linear regions. Our analysis suggests that polytopal metrics can serve as reliable indicators of representation quality and model performance.

2604.24391 2026-04-28 cs.RO

FreqCache: Accelerating Embodied VLN Models with Adaptive Frequency-Guided Token Caching

Zihao Zheng, Xingyue Zhou, Zhihao Mao, Songyu Sun, Lingyue Zhang, Yulong Ao, Yupu Feng, Qiongqiong Zhang, Yonghua Lin, Xiang Chen

详情
英文摘要

Vision-Language-Navigation (VLN) models exhibit excellent navigation accuracy but incur high computational overhead. Token caching has emerged as a promising training-free strategy to reduce this cost by reusing token computation results; however, existing token caching approaches rely on visual domain methods for cacheable token selection, leading to challenges when adapted to VLN models. 1) Visual domain methods become invalid when there is viewpoint migration. 2) Visual domain methods neglect critical edge information without the aid of additional algorithms. 3) Visual domain methods overlook the temporal variation of scenarios and lack adjustability in cache budgets. In this paper, we develop detailed analyses and find that the impacts of these challenges exhibit invariance and analyzability in the frequency domain. Based on these, we propose a frequency-guided token caching framework, called FreqCache. Utilizing the inherent properties of the frequency domain, FreqCache achieves optimal token cache establishment, refreshment, and adaptive adjustment. Experiments show that FreqCache achieves 1.59x speedup with ignorable overhead, showing the effect of integrating frequency domain methods in VLN token caching.

2604.24386 2026-04-28 cs.SD eess.AS

An event-based sequence modeling approach to recognizing non-triad chords with oversegmentation minimization

Leekyung Kim, Jonghun Park

Comments accepted to ICASSP 2026

详情
英文摘要

Automatic chord recognition (ACR) extracts time-aligned chord labels from music audio recordings. Despite recent advances, ACR still struggles with oversegmentation, data scarcity, and imbalance, especially in recognizing complex chords such as non-triads, which are unpopular in existing datasets. To address these challenges, we reformulate ACR as a segment-level sequence-to-sequence prediction task, where chord sequences are predicted auto-regressively rather than frame by frame. This design mitigates excessive segmentation by detecting chord changes only at segment boundaries. We further introduce two types of token representations and an encoder pre-training method, both specifically designed for time-aligned chord modeling. Experimental results show that our model improves performance in both chord recognition and segmentation, with notable gains for complex and infrequent chord types. These findings demonstrate the effectiveness of segment-level sequence modeling, structured tokenization, and representation learning for advancing chord recognition systems.

2604.24384 2026-04-28 cs.RO

Pedestrians play chicken with an autonomous vehicle

Rakshit Soni, Charles Fox

详情
英文摘要

Automated vehicles (AVs) are commonly programmed to yield unconditionally to pedestrians in the interest of safety. However, this design choice can give rise to the Freezing Robot Problem in which pedestrians learn to assert priority at every interaction, causing vehicles to stall and make no progress. The game theoretic Sequential Chicken model has shown that, like human drivers, AVs can resolve this problem by trading credible threats of very small risks of collision or larger risks of less severe invasion of personal space against the value of time due to yielding delays. This paper presents the first demonstration and evaluation of this approach using a real AV with human subjects and shows that pedestrian behavior under experimentally constrained safety conditions can be well fitted by Sequential Chicken, with a low time value of collision, suggestive of their planning to avoid proxemic personal space penalties as well as actual collisions.

2604.24380 2026-04-28 cs.CL

Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency

Yiran Huang, Lukas Thede, Massimiliano Mancini, Wenjia Xu, Zeynep Akata

Comments Accepted at International Journal of Computer Vision (IJCV) 2026

详情
英文摘要

While Large Vision Language Models (LVLMs) demonstrate impressive capabilities, their substantial computational and memory requirements pose deployment challenges on resource-constrained edge devices. Current parameter reduction techniques primarily involve training LVLMs from small language models, but these methods offer limited flexibility and remain computationally intensive. We study a complementary route: compressing existing LVLMs by applying structured pruning to the language model backbone, followed by lightweight recovery training. Specifically, we investigate two structural pruning paradigms: layerwise and widthwise pruning, and pair them with supervised finetuning and knowledge distillation on logits and hidden states. Additionally, we assess the feasibility of conducting recovery training with only a small fraction of the available data. Our results show that widthwise pruning generally maintains better performance in low-resource scenarios, where computational resources are limited or there is insufficient finetuning data. As for the recovery training, finetuning only the multimodal projector is sufficient at small compression levels. Furthermore, a combination of supervised finetuning and hidden-state distillation yields optimal recovery across various pruning levels. Notably, effective recovery can be achieved using just 5% of the original data, while retaining over 95% of the original performance. Through empirical study on three representative LVLM families ranging from 3B to 7B parameters, this study offers actionable insights for practitioners to compress LVLMs without extensive computation resources or sufficient data. The code base is available at https://github.com/YiranHuangIrene/VLMCompression.git.

2604.24379 2026-04-28 cs.AI cs.LG cs.SC

Certified geometric robustness -- Super-DeepG

Noémie Cohen, Mélanie Ducoffe, Christophe Gabreau, Claire Pagetti, Xavier Pucel

Comments ICCPS / HSCC 2026, CPS IoT Week, May 2026, Saint Malo (Palais du Grand Large), France

详情
英文摘要

Safety-critical applications are required to perform as expected in normal operations. Image processing functions are often required to be insensitive to small geometric perturbations such as rotation, scaling, shearing or translation. This paper addresses the formal verification of neural networks against geometric perturbations on their image dataset. Our method Super-DeepG improves the reasoning used in linear relaxation techniques and Lipschitz optimization, and provides an implementation that leverages GPU hardware. By doing so, Super-DeepG achieves both precision and computational efficiency of robustness certification, to an extent that outperforms prior work. Super-DeepG is shared as an open-source tool on GitHub.

2604.24376 2026-04-28 cs.CL

Learning Evidence of Depression Symptoms via Prompt Induction

Eliseo Bao, Anxo Perez, David Otero, Javier Parapar

Comments Accepted at SIGIR 2026

详情
英文摘要

Depression places substantial pressure on mental health services, and many people describe their experiences outside clinical settings in high-volume user-generated text (e.g., online forums and social media). Automatically identifying clinical symptom evidence in such text can therefore complement limited clinical capacity and scale to large populations. We address this need through sentence-level classification of 21 depression symptoms from the BDI-II questionnaire, using BDI-Sen, a dataset annotated for symptom relevance. This task is fine-grained and highly imbalanced, and we find that common LLM approaches (zero-shot, in-context learning, and fine-tuning) struggle to apply consistent relevance criteria for most symptoms. We propose Symptom Induction (SI), a novel approach which compresses labeled examples into short, interpretable guidelines that specify what counts as evidence for each symptom and uses these guidelines to condition classification. Across four LLM families and eight models, SI achieves the best overall weighted F1 on BDI-Sen, with especially large gains for infrequent symptoms. Cross-domain evaluation on an external dataset further shows that induced guidelines generalize across other diseases shared symptomatology (bipolar and eating disorders).

2604.24371 2026-04-28 cs.LG cs.AI

PathMoG: A Pathway-Centric Modular Graph Neural Network for Multi-Omics Survival Prediction

Di Wang, Chupei Tang, Junxiao Kong, Jixiu Zhai, Moyu Tang, Tianchi Lu

Comments 9 pages, 5 figures, 3 tables. Source code available at https://github.com/wangzoyou/pathmog

详情
英文摘要

Cancer survival prediction from multi-omics data remains challenging because prognostic signals are high-dimensional, heterogeneous, and distributed across interacting genes and pathways. We propose PathMoG, a pathway-centric modular graph neural network for multi-omics survival prediction. PathMoG reorganizes genome-scale inputs into 354 KEGG-informed pathway modules, introduces a Hierarchical Omics Modulation module to condition gene-expression representations on mutation, copy number variation, pathway, and clinical context, and uses dual-level attention to capture both intra-pathway driver signals and inter-pathway clinical relevance. We evaluated PathMoG on 5,650 patients across 10 TCGA cancer types and observed consistent improvements over representative survival baselines. The framework further provides gene-level, pathway-level, and patient-level interpretability, supporting biologically grounded and clinically relevant risk stratification.

2604.24370 2026-04-28 cs.CV

Multispectral airborne laser scanning dataset for tree species classification: MS-ALS-SPECIES

Matti Hyyppä, Klaara Salolahti, Eric Hyyppä, Xiaowei Yu, Josef Taher, Leena Matikainen, Matti Lehtomäki, Paula Litkey, Teemu Hakala, Harri Kaartinen, Juha Hyyppä, Antero Kukko

详情
英文摘要

The shift from stand-level to individual-tree-level forest assessments supports improved biodiversity mapping, particularly in boreal ecosystems where tree species like aspen (Populus tremula L.) play a keystone role. While airborne laser scanning (ALS) is the standard for such inventories, a major limitation is the small number of publicly available ALS datasets containing high-quality, field-validated reference data. Furthermore, open multispectral ALS datasets with high-quality field reference data are completely lacking despite the potential of multispectral ALS data for tree species classification. This paper presents and details an open multispectral ALS dataset used in a recent international benchmarking study of machine learning and deep learning methods for tree species classification by Taher et al. (2026). The dataset comprises 6326 segment-level point clouds of individual trees representing nine species in Southern Finland. The point cloud data has been acquired using two multispectral laser scanning systems each operating at three laser wavelengths: a helicopter-borne system (HeliALS) with a point density exceeding 1000 points/m$^2$ and an Optech Titan system with approximately 35 points/m$^2$. We provide a detailed description of field data collection techniques developed in the study to facilitate the collection of high-quality ground truth data in an efficient and scalable manner. Additionally, our article presents new analyses on species classification using multispectral data building upon the initial findings of Taher et al. (2026). Furthermore, we study the relation between classification accuracy and tree height to highlight the versatility of the open dataset and to demonstrate the advantage of the point transformer model for small trees and minority species.