arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1108
2605.00905 2026-05-05 cs.CL cs.AI cs.CV

DIAGRAMS: A Review Framework for Reasoning-Level Attribution in Diagram QA

Anirudh Iyengar Kaniyar Narayana Iyengar, Tampu Ravi Kumar, Manan Suri, Raviteja Bommireddy, Dinesh Manocha, Puneet Mathur, Vivek Gupta

Comments 10 Pages, 4 figures

详情
英文摘要

Diagram question answering (Diagram QA) requires reasoning-level attribution that links each question-answer pair to all visual regions needed to derive the answer, rather than only the region containing the final response. Creating such structured evidence across diagrams, charts, maps, circuits, and infographics is time-consuming, and existing annotation tools tightly couple their interfaces to dataset-specific formats. We present DIAGRAMS, a lightweight, schema-driven review framework that decouples interface logic from dataset-specific JSON structures through an internal meta-schema and dataset adapters. Given an image and QA pair with optional candidate regions, the system performs QA-conditioned evidence selection and proposes the regions required for reasoning. When QA pairs or candidate regions are missing, it generates them and supports human verification and refinement. Across six Diagram QA datasets, model-suggested evidence achieves 85.39% precision and 75.30% recall against reviewer-final selections (micro-averaged). These results indicate that the review-first framework reduces manual region creation while maintaining high agreement with final reasoning-level attributions. We release a public demo and installable package to support dataset auditing, grounded supervision creation, and grounded evaluation.

2605.00904 2026-05-05 cs.CV

Robustness of Transformer-Based Fluence Map Prediction Under Clinically Realistic Perturbations

Ujunwa Mgboh, Rafi Ibn Sultan, Joshua Kim, Kundan Thind, Dongxiao Zhu

Comments Accepted by The Artificial Intelligence in Medicine (AIME) 2026 Conference

详情
英文摘要

Learning-based fluence map prediction offers a fast alternative to iterative inverse planning in intensity-modulated radiation therapy (IMRT), but its robustness under realistic distribution shifts remains unclear. We study a two-stage transformer pipeline that maps anatomy (CT and contours) to dose and then to beamlet fluence maps. We compare fluence-stage transformer backbones with hierarchical, global, and hybrid attention, trained with a physics-informed loss enforcing energy consistency. Robustness is evaluated under geometric perturbations, radiometric noise, reduced training data, and domain shifts using a prostate IMRT dataset, with additional evaluation of the dose stage on public datasets. Results show smooth degradation under moderate perturbations but sharp failures under severe rotations and noise. Hierarchical transformers (e.g., SwinUNETR) exhibit slower growth in upper-quartile energy error, indicating improved robustness. We further show that SSIM alone fails to capture clinically relevant errors, highlighting the need for physics-informed evaluation.

2605.00903 2026-05-05 cs.CV

A Light Weight Multi-Features-View Convolution Neural Network For Plant Disease Identification

Muhammad Kaleem Ullah Khan

详情
英文摘要

Agriculture is a key sector of the economies of developing countries. It serves as a primary source of income and employment for rural populations. However, each year, a large portion of crops is wasted because of pests and diseases. Well-timed prediction of plant diseases is crucial to sustainable, high-quality agricultural production. Detection of plant diseases through conventional methods is both labour-intensive and time-consuming. Researchers have developed image classification based automated techniques for this purpose. Most accurate methods are based on deep convolutional neural networks, which are computationally intensive, with many layers and millions of trainable parameters. In resource-constrained settings, especially in rural areas, it is difficult to deploy deep convolutional neural network models for efficient plant disease identification. To address these issues, an efficient and light-weight Multi-View Convolutional Neural Network is proposed. These additional features aid the proposed model to identify the plant diseases accurately and efficiently with less number of parameters. The proposed model is tested on a benchmark Plantvillage dataset and achieves an improvement of $ 2.9\%$ in classification accuracy compared to the baseline convolutional neural network model, which was trained only on Red, Green, and Blue (RGB) plant images. Compared with state-of-the-art deep convolutional neural network models, the proposed model is less computationally expensive and achieves comparable accuracy for plant disease identification on the PlantVillage dataset.

2605.00902 2026-05-05 cs.CV cs.IR

Validation of Whole-Slide Foundation Models for Image Retrieval in TCGA Data

Tianhao Lei, Parsa Esmaeilkhani, Saghir Alfasly, Wataru Uegami, Judy C. Boughey, Matthew P. Goetz, Krishna R. Kalari, H. R. Tizhoosh

详情
英文摘要

Foundation models are reshaping computational histopathology, yet their value for whole-slide image retrieval relative to strong patch-based and supervised aggregation baselines remains unclear. We benchmarked ten pipelines on 9,387 diagnostic slides spanning 17 organs and 60 diagnoses from The Cancer Genome Atlas (TCGA) using patient-level leave-one-patient-out evaluation. Methods included four pre-trained slide foundation models, a supervised attention-based multiple instance learning (ABMIL) aggregator on patch embeddings, and patch-level retrieval across five sampling densities. Performance varied more across organs and diagnoses than across architectures. Although the slide foundation model TITAN achieved the strongest overall results, its advantage was modest; ABMIL and patch-based methods reached comparable Top-1 and Top-3 accuracy, with no model consistently dominant. Morphologically distinctive entities approached ceiling performance, while rare, heterogeneous, and closely related subtypes remained challenging. Misclassifications aligned with organs exhibiting known inter-observer variability, suggesting an intrinsic ceiling for morphology-only retrieval. Performance was driven primarily by patch-level feature representations, with limited benefit from slide-level aggregation, indicating aggregation may be unnecessary in many settings. These findings argue against a universally optimal architecture and instead support organ-resolved benchmarking, diagnosis-aware or ensemble strategies, stronger feature representations, and multimodal retrieval frameworks. Notably, even the best model achieved only $\approx 68\% \pm 21\%$ retrieval accuracy on TCGA, and some subtypes showed $0\%$ accuracy across all methods, highlighting fundamental limitations of morphology-based representations and the need for substantial progress before reliable clinical deployment.

2605.00901 2026-05-05 cs.CV cs.AI

RA-CMF: Region-Adaptive Conditional MeanFlow for CT Image Reconstruction

Md Shifatul Ahsan Apurba, Md Selim, Jin Chen

详情
英文摘要

The use of CT imaging is important for screening, diagnosis, therapy planning, and prognosis of lung cancers. Unfortunately, due to differences in imaging protocols and scanner models, CT images acquired by different means may show large differences in noise statistics, contrast, and texture. In this study, we develop a novel conditional MeanFlow pipeline for CT image reconstruction. We introduce a conditional MeanFlow network that models the enhancement trajectory by predicting image-conditioned flow fields given intermediate image states. The image enhancement network is trained with a MeanFlow consistency loss along with the image reconstruction loss. In order to provide an adaptive refinement process in terms of spatial location of enhancements, we integrate a regional reinforcement learning-driven policy network into our approach. The policy network receives information about the MeanFlow rollouts and provides predictions in terms of tile-wise refinement budgets, stopping criteria, and total budget allocation of enhancement processes. Our policy network is trained through reinforcement learning in a policy gradient framework, where the goal of the training reward is to maximize improvement of enhancements while minimizing unnecessary computations and avoiding instabilities. In this way, our approach combines conditional flow-based enhancement with reinforcement learning-based spatial enhancement control. This allows our approach to focus more attention on enhancing difficult areas while stabilizing areas already showing sufficient quality. Our results show high accuracy in the tumor ROI, with the average radiomic feature CCC being 0.96, an average PSNR of 31.30 $\pm$ 4.16, and average SSIM of 0.94 $\pm$ 0.07. Moreover, there is an improvement in the overall quality of images, with an average PSNR of 34.23 $\pm$ 1.71 and average SSIM of 0.95 $\pm$ 0.01.

2605.00899 2026-05-05 cs.CV cs.LG

LatentDiff: Scaling Semantic Dataset Comparison to Millions of Images

James Flora, Kowshik Thopalli, Akshay R. Kulkarni, Weng-Keen Wong, Shusen Liu

Comments 17 pages, 6 figures

详情
英文摘要

We present LatentDiff, a scalable framework for semantic dataset comparison that operates directly in the latent space of pretrained vision encoders. By combining sparse autoencoder-based divergence testing with density ratio estimation, LatentDiff identifies interpretable semantic differences between datasets at a fraction of the computational cost of caption-based alternatives. We also introduce Noisy-Diff, a benchmark capturing realistic sparse distribution shifts that cause existing methods to struggle. Experiments demonstrate that LatentDiff achieves superior accuracy while remaining robust to settings where an extremely small fraction of images (from 5% to <1% ) differ semantically.

2605.00896 2026-05-05 cs.CV cs.AI cs.LG

When Less Is More: Simplicity Beats Complexity for Physics-Constrained InSAR Phase Unwrapping

Prabhjot Singh, Manmeet Singh

Comments 9 pages, 5 figures, 2 tables. Oral presentation, ML4RS Workshop @ ICLR 2026

详情
英文摘要

Operational phase unwrapping is the primary computational bottleneck in InSAR-based volcanic and seismic monitoring. We challenge the industry trend of adopting high-complexity computer vision architectures, such as attention mechanisms, without validating their suitability for physics-constrained geophysical regression. We present the first large-scale architectural ablation study on a global LiCSAR benchmark (20 frames, 39,724 patches, 651M pixels). Our results reveal a significant "complexity penalty": a vanilla U-Net (7.76M parameters) achieves $R^2=0.834$ and RMSE $= 1.01$ cm, outperforming 11.37M-parameter attention-based models by 34% in $R^2$ and 51% in RMSE. Power Spectral Density (PSD) analysis provides the physical justification: while attention excels at capturing sharp semantic edges in natural images, it injects unphysical high-frequency artifacts ($>0.3$ cycles/pixel) into geophysical fields, violating the fundamental smoothness constraints of elastic surface deformation. With a 2.92ms inference latency (a $2.5\times$ speedup), the vanilla U-Net is the only candidate to comfortably meet the sub-100ms requirement for operational early-warning systems. This work bridges the "publication-to-practice" gap by proving that convolutional locality outperforms modern complexity for smooth-field regression, advocating for physics-informed simplicity in ML4RS. Code available at https://github.com/prabhjotschugh/When-Less-is-More-InSAR-Phase-Unwrapping

2605.00894 2026-05-05 cs.CV

Dino-NestedUNet: Unlocking Foundation Vision Encoders for Pathology Tumor Bulk Segmentation via Dense Decoding

Tianyang Wang, Ziyu Su, Abdul Rehman Akbar, Usama Sajjad, Usman Afzaal, Lina Gokhale, Charles Rabolli, Wei Chen, Anil Parwani, Muhammad Khalid Khan Niazi

详情
英文摘要

Vision foundation models (VFMs), such as DINOv3, provide rich semantic representations that are promising for computational pathology. However, many current adaptations pair frozen VFMs with lightweight decoders, creating a capacity mismatch that often limits boundary fidelity for infiltrative tumor bulk segmentation. This paper presents Dino-NestedUNet, a framework that couples a pre-trained DINOv3 encoder with a Nested Dense Decoder. Instead of sparse skip connections and linear upsampling, the proposed decoder forms a dense grid of intermediate pathways to enable continuous feature reuse and multi-scale recalibration, aligning high-level semantics with low-level morphological textures during reconstruction. We evaluate Dino-NestedUNet on three histopathology cohorts (multi-center CHTN, institutional OSU, and CAMELYON16) and observe consistent improvements over UNet++ and standard Dino-UNet variants, particularly under cross-domain shift. To further assess external generalization, we perform zero-shot evaluation by training on CHTN and directly testing on unseen TIGER WSIBULK and OSU CRC cohorts without fine-tuning. These results suggest that dense decoding is a key ingredient for unlocking foundation encoders in boundary-sensitive pathology segmentation.

2605.00893 2026-05-05 cs.CV cs.AI cs.IR

Retrieval-Guided Generation for Safer Histopathology Image Captioning

Md. Enamul Hoq, Wataru Uegami, Saghir Alfasly, Ghazal Alabtah, Sahar Rahimi Malakshan, Armita Kazemi, Alex T. Schmitgen, Fred Prior, H. R. Tizhoosh

详情
英文摘要

Generative vision-language models can produce fluent medical image captions but remain prone to hallucination, over-specific diagnostic claims, and factual inconsistency-serious issues in pathology. We investigate retrieval-guided generation (RGG) as a safer alternative, where captions are formed by summarizing expert text from visually similar cases rather than generated de novo. On the ARCH histopathology dataset, RGG improves semantic alignment with ground truth, achieving cosine similarity of $\approx$0.60 versus $\approx$0.47 from MedGemma, with non-overlapping confidence intervals indicating a robust gain. A pathologist-led qualitative review shows better preservation of morphology-relevant terminology and fewer unsupported diagnoses, while revealing failure modes such as concept mixing and inherited over-specific labeling. Overall, retrieval-guided captioning offers a more transparent and reliable approach with clearer opportunities for auditing than fully generative methods.

2605.00892 2026-05-05 cs.CV

When To Adapt? Adapting the Model or Data in Federated Medical Imaging

Chamani Shiranthika, Parvaneh Saeedi

Comments 10 pages, Accepted for oral presentation and proceedings of 24th International Conference on Artificial Intelligence in Medicine, Ottawa, Canada, July 7-10, 2026

详情
英文摘要

Federated learning enables collaborative model training across medical institutions without sharing raw data, but its performance is often limited by domain heterogeneity across clients. Existing approaches to address this challenge fall into two main paradigms: model-side personalization, which adapts model parameters to each client, and data-side harmonization, which reduces inter-client variation at the input level. Despite their widespread use, these strategies have not been systematically compared. In this work, we conduct a comprehensive study across six medical imaging settings-colon polyp, skin lesion, and breast tumor segmentation, and tuberculosis CXR, brain tumor, and breast tumor classification-covering diverse types of domain shift. We evaluate a broad set of state-of-the-art harmonization and personalization methods under a unified framework. Our results reveal a conditional trade-off driven by the nature of heterogeneity: harmonization is more effective when variation is primarily appearance-based (e.g., CXR classification), while personalization performs better when differences are structural (e.g., colon polyp segmentation). When inter-client variation is limited, both strategies perform similarly. These findings demonstrate that the effectiveness of adaptation in federated medical imaging depends on the type and magnitude of domain shift rather than the strategy alone. We provide practical guidelines for selecting between harmonization and personalization and highlight directions for future hybrid approaches that combine both paradigms. Code is available at https://github.com/ChamaniS/WhenToAdapt.

2605.00891 2026-05-05 cs.CV cs.AI

X2SAM: Any Segmentation in Images and Videos

Hao Wang, Limeng Qiao, Chi Zhang, Lin Ma, Guanglu Wan, Xiangyuan Lan, Xiaodan Liang

Comments Technical Report

详情
英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated strong image-level visual understanding and reasoning, yet their pixel-level perception across both images and videos remains limited. Foundation segmentation models such as the SAM series produce high-quality masks, but they rely on low-level visual prompts and cannot natively interpret complex conversational instructions. Existing segmentation MLLMs narrow this gap, but are usually specialized for either images or videos and rarely support both textual and visual prompts in one interface. We introduce X2SAM, a unified segmentation MLLM that extends any-segmentation capabilities from images to videos. Given conversational instructions and visual prompts, X2SAM couples an LLM with a Mask Memory module that stores guided vision features for temporally consistent video mask generation. The same formulation supports generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation across image and video inputs. We further introduce the Video Visual Grounded (V-VGD) segmentation benchmark, which evaluates whether a model can segment object tracks in videos from interactive visual prompts. With a unified joint training strategy over heterogeneous image and video datasets, X2SAM delivers strong video segmentation performance, remains competitive on image segmentation benchmarks, and preserves general image and video chat ability.

2605.00890 2026-05-05 cs.CV cs.AI cs.LG

Skeleton-Based Posture Classification to Promote Safer Walker-Assisted Gait in Older Adults

Sergio D. Sierra M., Monica Sinha, Marcela Múnera, Carlos A. Cifuentes

详情
英文摘要

Falls among older adults are a significant public health concern, leading to severe injuries, loss of independence, and increased healthcare costs. This study evaluates the effectiveness of various models, including a Geometric approach, XGBoost, SVM, and several deep learning architectures, in classifying walker usage, standing vs. sitting, and posture for smart walkers used. Geometric and XGBoost were the top performers. XGBoost achieved near-perfect training accuracy in binary classification tasks, with 99.84% for walker choice and 99.69% for standing vs. sitting. For posture classification, Geometric approach attained 89.9% accuracy for 8 postures, and XGBoost obtained 99.24% during training for 17 postures. Deep learning models such as the 4-layer CNN and Encoder-Decoder CNN also demonstrated strong performance in binary classification, with accuracies above 98%. This study underscores the potential of machine learning to enhance human-robot interaction in smart walkers, particularly for fall prevention.

2605.00889 2026-05-05 cs.CV cs.LG

On the explainability of max-plus neural networks

Ikhlas Enaieh, Olivier Fercoq, García Ángel

Comments IEEE International Symposium on Computer-Based Medical Systems (CBMS 2026), Jun 2026, Limassol, Cyprus, Cyprus

详情
英文摘要

We investigate the explanability properties of the recently proposed linear-min-max neural networks. At initialization, they can be interpreted as k-medoids with the infinity norm as a distance. Then, they are trained using subgradient descent to better fit the data. The model has been shown to be a universal approximator. Yet, we can trace the decision process because a single most activated neuron is responsible for the value of the output. Using this property, we designed a pixel fragility measure that determines whether changes to a single pixel may be responsible to a change in the classification output. Experiments on the PneumoniaMnist dataset show that this explanation for the output of the neural network compares favorably to SHAP and Integrated Gradient.

2605.00888 2026-05-05 cs.CV cs.AI cs.LG eess.IV eess.SP

Selective Correlation Based Knowledge Distillation for Ground Reaction Force Estimation

Eun Som Jeon, Jisoo Lee, Huisu Lim, Omik M. Save, Hyunglae Lee, Pavan Turaga

详情
Journal ref
Measurement, 2026
英文摘要

Wearable sensor-based human gait analysis holds great promise in healthcare, rehabilitation, clinical diagnosis and monitoring, and sports activities. Specifically, ground reaction force (GRF) provides essential insights into the body's interaction with the ground during movement and is typically measured using instrumented treadmills equipped with force plates. However, such equipment is expensive and restricted to laboratory environments. To enable a more portable solution, wearable insole sensors have been used to measure GRF. These sensors, however, are prone to noise and external interference, which reduces measurement accuracy. Deep learning methodologies could be adopted to address these issues, but they often require significant computing resources to achieve high accuracy, limiting their applicability for real-time analysis on portable devices. To overcome these limitations, we propose Selective Correlation Based Knowledge Distillation (SCKD) for estimating GRF from data collected by insole sensors. Our proposed method utilizes selected features considering temporal characteristics in the process of extracting correlation maps for knowledge transfer, enhancing interpretability and mitigating issues in high dimensional data processing. We demonstrate the effectiveness of the compact models generated by our distillation framework through comparison with existing methods. Various configurations of teacher-student architectures and training approaches are examined based on multiple evaluation criteria, utilizing data collected at different walking speeds and with different window sizes. Experimental results confirm that our approach outperforms existing methods in estimating GRF from wearable insole sensor data. Therefore, our approach offers a reliable and resource-efficient solution for human gait analysis.

2605.00887 2026-05-05 cs.CV

SparseContrast: Dynamic Sparse Attention for Efficient and Accurate Contrastive Learning in Medical Imaging

Paarth Prasad, Ruchika Malhotra

详情
英文摘要

We propose SparseContrast, a new framework that merges dynamic sparse attention with contrastive learning for medical imaging, with a focus on chest X-ray disease detection in low-data settings. Traditional contrastive learning methods rely on dense attention mechanisms, which are computationally expensive and often process redundant regions in medical images. To resolve this, SparseContrast introduces a sparse attention mechanism that selectively concentrates on diagnostically pertinent areas, markedly decreasing computational burden without compromising accuracy. The framework adaptively trims attention maps in the training phase, directed by a compact saliency predictor which concurrently optimizes sparsity and feature quality. This method not only speeds up training and inference by as much as 40% relative to dense attention benchmarks but also boosts diagnostic accuracy by focusing on areas of clinical importance. Moreover, the approach remains indifferent to the selection of backbone architecture, which permits its application to both convolutional and transformer-based models. Experiments show SparseContrast attains comparable or better performance in disease identification tasks with greater efficiency relative to current approaches. The proposed framework delivers a practical approach for implementing contrastive learning in medical imaging settings with limited resources, where computational efficiency and diagnostic accuracy are paramount.

2605.00886 2026-05-05 cs.CV

Selective Attention-Based Network for Robust Infrared Small Target Detection

Yingming Zhang, Wuqi Su, Qing Xiao, Yonggang Yang

详情
英文摘要

Infrared small target detection (IRSTD) plays a pivotal role in a broad spectrum of mission-critical applications, including maritime surveillance, military search and rescue, early warning systems, and precision-guided strikes, all of which demand the precise identification of dim, sub-pixel targets amid highly cluttered infrared backgrounds. Despite significant progress driven by deep learning methods, fundamental challenges persist: infrared small targets occupy extremely limited spatial extents (often only a few pixels), exhibit low signal-to-clutter ratios, and are easily confused with structurally complex backgrounds that frequently induce false alarms. Existing encoder-decoder architectures suffer from two key limitations - an information bottleneck in early convolutional stages that undermines fine-grained target perception, and static skip connections that lack the dynamic adaptability required to discriminate between genuine targets and pseudo-target regions. To address these challenges, we propose SANet, a Selective Attention-based Network built upon the classical U-Net framework and augmented with two novel components: (1) a \emph{Dual-path Semantic-aware Module} (DSM) that integrates standard convolutions for local spatial detail preservation with pinwheel-shaped convolutions for expanded, direction-sensitive receptive fields, followed by a Convolutional Block Attention Module (CBAM) for fine-grained spatial-channel feature recalibration; and (2) a \emph{Selective Attention Fusion Module} (SAFM) that replaces conventional static skip connections with a spatially adaptive, learnable weighting mechanism to perform context-aware, cross-scale feature fusion.

2605.00885 2026-05-05 cs.CV

Multi-Branch Non-Homogeneous Image Dehazing via Concentration Partitioning and Image Fusion

Yingming Zhang, Wuqi Su, Qing Xiao, Yonggang Yang

详情
英文摘要

Existing single image dehazing methods have demonstrated satisfactory performance on homogeneous thin-haze images; however, they often struggle with non-homogeneous hazy images that exhibit spatially varying haze concentrations and abrupt density transitions across different regions. To address this fundamental limitation, we propose a novel multi-branch deep neural network framework, termed Concentration Partitioning and Image Fusion Network (CPIFNet), which decomposes the challenging non-homogeneous dehazing problem into a set of tractable homogeneous sub-problems. Our key insight is that a single non-homogeneous hazy image can be viewed as a composite of multiple local regions, each exhibiting approximately homogeneous haze characteristics. CPIFNet employs a two-stage architecture consisting of an Image Enhancement Network (IENet) stage and an Image Fusion Network (IFNet) stage. In the first stage, multiple IENet branches are independently trained on homogeneous haze datasets of different concentration levels, producing enhancement models that excel at restoring regions matching their respective haze densities. In the second stage, the IFNet intelligently aggregates the advantageous regions from all enhancement outputs through deep feature stacking and merging, yielding a unified high-quality dehazed result. Furthermore, we introduce a comprehensive loss function incorporating reconstruction, perceptual, structural, and color losses to jointly supervise both stages.

2605.00883 2026-05-05 cs.CV cs.AI

Towards High Fidelity Face Swapping: A Comprehensive Survey and New Benchmark

Qi Li, Weining Wang, Shuangjun Du, Bo Peng, Jing Dong, Kun Wang, Zhenan Sun, Ming-Hsuan Yang

详情
英文摘要

Face swapping has witnessed significant progress in recent years, largely driven by advances in deep generative models such as GANs and diffusion models.Despite these advances, existing methods remain fragmented across different paradigms, and their evaluation is highly inconsistent due to the lack of standardized datasets and protocols. Moreover, prior surveys primarily focus on broader deepfake generation or detection, leaving face swapping insufficiently studied as a standalone problem. In this paper, we present a comprehensive survey and benchmark for face swapping. We provide a structured review of existing methods, organizing them into five major paradigms and systematically analyzing their design principles, strengths, and limitations. To enable fair and controlled evaluation, we introduce CASIA FaceSwapping, a high-quality benchmark with balanced demographic distributions and explicit attribute variations, and establish standardized protocols to assess the robustness of different face swapping methods. Extensive experiments on representative approaches yield new insights into the performance characteristics and limitations of current techniques. Overall, our work provides a unified perspective and a principled evaluation framework to facilitate the development of more robust and controllable face swapping methods. More results can be found at https://github.com/CASIA-NLPRAI/face-swapping-survey.

2605.00882 2026-05-05 cs.CV

Intervention-Based Self-Supervised Learning: A Causal Probe Paradigm for Remote Photoplethysmography

Zhiyi Niu, Xiaoguang Tu, Bo Zhao, Junzhe Cao, Dan Guo, Zitong Yu

详情
英文摘要

Remote Photoplethysmography (rPPG) enables convenient non-contact physiological measurement. Existing Self-Supervised Learning (SSL) methods commonly fall into a correlation trap: they tend to learn the most dominant periodic signals in the data, such as high-energy motion or illumination noise, rather than the faint, true rPPG signal, leading to poor model generalization. To address this, we propose a new SSL paradigm, Physiological Causal Probing (PCP), which treats the latent rPPG signal as the underlying physical source and the resulting pixel chrominance variations as its visual manifestation. Its core idea is to shift from passive correlation learning to active, precise intervention: it intervenes on the video based on a proposed rPPG hypothesis, and verifies whether the post-intervention changes match physical expectations. We propose the Interv-rPPG framework to implement PCP: an rPPG extractor named PhysMambaFormer hypothesizes the rPPG signal, while a Controllable Physiological Signal Editor conducts precise chrominance-domain interventions on videos based on this hypothesis. Interv-rPPG validates the physical realism of the hypothesis through `Falsifiability via Nulling' and `Axiomatic Equivariance'. Our editor achieves precise editing of the rPPG signal by intervening in the low-frequency chrominance components of the video. Our method improves both in-domain and cross-domain performance on challenging datasets such as VIPL-HR and MMPD. Furthermore, it surpasses the supervised baseline in complex cross-dataset settings, while remaining competitive on clean datasets where the intervention mechanism may introduce slight residual chrominance noise. Extensive experiments, including diagnostic analysis of nuisance sensitivity, demonstrate that the PCP paradigm effectively resists motion and illumination artifacts.

2605.00880 2026-05-05 cs.CV cs.AI

Adversarial Flow Matching for Imperceptible Attacks on End-to-End Autonomous Driving

Xinyu Zeng, Xiangkun He, Lei Tao, Chen Lv, Hong Cheng

Comments 16 pages, 11 figures

详情
英文摘要

Autonomous driving (AD) is evolving towards end-to-end (E2E) frameworks through two primary paradigms: monolithic models exemplified by Vision-Language-Action (VLA), and specialized modular architectures. Despite their divergent designs, both paradigms increasingly rely on Transformer backbones for complex reasoning, potentially causing a shared vulnerability: visually imperceptible perturbations can manipulate E2E AD models into hazardous maneuvers by targeting the Transformer module. Most existing adversarial attack approaches against AD systems operate under white-box or black-box settings; yet, they typically necessitate full model transparency, or suffer from either prohibitive query latency or limited attack transferability. In this paper, we propose Adversarial Flow Matching (AFM), a novel gray-box attack framework that exploits Transformer structural vulnerabilities in E2E AD models. AFM enables efficient one-step generation of adversarial examples via a neural average velocity field. Additionally, the proposed technique yields effective and visually imperceptible attacks by synergistically perturbing the generative latent space and the neural average velocity field. Extensive experiments demonstrate that AFM achieves a superior trade-off between attack effectiveness and imperceptibility: it substantially degrades the performance of both VLA and modular AD agents across various scenarios compared to baselines, while maintaining state-of-the-art visual imperceptibility. Furthermore, adversarial examples generated by AFM exhibit robust cross-model transferability, indicating that AFM closely approximates a black-box attack setting while requiring only the prior knowledge that the target AD model incorporates a Transformer-based module.

2605.00879 2026-05-05 cs.RO cs.SY eess.SY

LiDAR for Rehabilitation: A Comprehensive Survey of Applications, AI Techniques, and Future Directions

Soumia Siyoucef, Najmeddine Dhieb, Hakim Ghazzai, Eleonora Guanziroli, Franco Molteni, Gianluca Setti

Comments This paper is accepted for publication in IEEE Sensors Reviews, April, 2026

详情
英文摘要

Rehabilitation aims to help patients with limited mobility regain their physical abilities through targeted movements, exercises, stimulation, and other therapeutic methods. Recent advances in technology have introduced sensor-based systems into rehabilitation and clinical practices, enabling real-time monitoring and providing accurate feedback on movement accuracy. Among these sensors, LiDAR has demonstrated strong potential, offering key advantages over conventional techniques such as camera-based systems, which raise privacy concerns, and wearable sensors, which can be uncomfortable and prone to errors. In this work, we review the applications of LiDAR in rehabilitation, post-injury care, and hospital environments, focusing on studies published between 2019 and 2025. Studies across several areas have been explored: 3D body scanning and gait analysis with standalone LiDAR, LiDAR mounted on robotic systems for rehabilitation, real-time monitoring and environment scanning for safe navigation, and activity and position recognition. We also analyze processing techniques, particularly learning-based approaches, and support the discussion with statistical analysis, highlighting trends, gaps, and future research opportunities. To the best of our knowledge, this is the first comprehensive survey dedicated to LiDAR for rehabilitation applications, providing an overview of current methods, AI-based processing techniques, and open challenges.

2605.00878 2026-05-05 cs.CV

Single Image Defogging Using a Fourth-Order Telegraph PDE Guided by Physical Haze Modeling

Manish Kumar, Rajendra K. Ray

详情
英文摘要

In real-world scenarios, image defogging is an inverse problem due to unknown scene depth, atmospheric scattering, and the common absence of ground truth . To resolve the issue, we propose a hybrid defogging model that integrates a fourth-order nonlinear PDE with a physical haze formation model. We used Dark Channel Prior to estimate atmospheric parameters and to generate a guidance image, while the final restoration is performed via a fourth-order PDE-based evolution. A fourth-order PDE of the type telegraph is then evolved, incorporating an edge-adaptive diffusion coefficient and a fidelity term weighted by the transmission map. Fourth-order diffusion effectively suppresses haze while preserving structural details, and the hyperbolic formulation improves numerical stability and convergence behavior. We use relative error norm criteria for the convergence of our PDE. The proposed method is compared with Dark Channel prior, modified Dark Channel prior, and variational-based single-image defogging techniques. When we have ground truth available, we use MSE and SSIM for quantitative evaluation, whereas no-reference metrics, including FADE, Contrast Restoration Index, Average Gradient, and Entropy, are applied to real-world foggy images. Experimental results demonstrate that the proposed hybrid PDE-based method provides comparable visual quality and maintains structural details.

2605.00876 2026-05-05 cs.LG cs.CV

GAZE: Grounded Agentic Zero-shot Evaluation with Viewer-Level Tools and Literature Retrieval on Rare Brain MRI

Duaa Alim, Mogtaba Alim, Liam Chalcroft

详情
英文摘要

Vision-language models (VLMs) read an image and produce text in a single forward pass, whereas radiologists typically inspect an image several times and consult the literature before writing a report. We introduce GAZE (Grounded Agentic Zero-shot Evaluation), a framework that lets a medical VLM work in this iterative way by calling viewer-level tools (zoom, windowing, contrast, edge detection) and two retrieval tools backed by the U.S. National Library of Medicine (PubMed for medical literature, Open-i for radiological images), with structured outputs validated against a schema and full tool-call traces recorded for auditability. On NOVA, a benchmark of 906 brain MRI cases covering 281 rare neurological conditions, GAZE reaches 58.2 mean average precision (mAP) at intersection-over-union (IoU) 0.3 for lesion localisation and 34.9% Top-1 diagnostic accuracy under a joint protocol that scores captioning, diagnosis, and localisation from the image alone, without task-specific fine-tuning. Before any tool is used, structured prompting and schema-validated outputs already improve over the published Gemini 2.0 Flash baseline (20.2 to 29.4 mAP@0.3), so framework design is itself an experimental variable. Tool use helps rare pathologies disproportionately: the fraction of cases with IoU > 0.3 rises from 17% to 58% for diagnoses with three or fewer examples versus 25% to 68% for common conditions ($\geq$10 cases), with gains tracking engagement (Gemini 3 Flash: Cohen's d = 0.79, 11.8 tool calls per case; Gemini 2.0 Flash: tools used in 8.2% of cases, no significant benefit). Retrieval ablations additionally reveal a model-dependent trade-off in which gains in diagnosis can coincide with losses in localisation, reinforcing the case for joint evaluation of diagnosis, localisation, and captioning in medical VLMs.

2605.00875 2026-05-05 cs.CV cs.AI

Visual Chart Representations for Cryptocurrency Regime Prediction: A Systematic Deep Learning Study

Dustin M. Haggett

Comments 9 pages, 8 figures, 9 tables. Stevens Institute of Technology course project, Fall 2025

详情
英文摘要

Technical traders have long relied on visual analysis of candlestick charts to identify market patterns and predict price movements. While deep learning has achieved remarkable success in image classification, its application to financial chart images remains underexplored. This paper presents a systematic study comparing different visual representations for cryptocurrency regime prediction. We evaluate three image encoding methods (raw candlestick charts, Gramian Angular Fields, and multi-channel GAF), five chart component configurations, four neural network architectures (CNN, ResNet18, EfficientNet-B0, and Vision Transformer), and the impact of ImageNet transfer learning. Through eight controlled experiments on Bitcoin, Ethereum, and S&P 500 data spanning 2018-2024, we identify optimal configurations for visual regime classification. Our results show that a simple 4-layer CNN on raw candlestick charts achieves 0.892 AUC-ROC, outperforming larger pretrained models. Surprisingly, simpler representations (price-only charts, 128x128 resolution) consistently outperform more complex alternatives. We provide interpretability analysis using GradCAM and demonstrate that transfer learning improves performance by 4-16% despite the domain gap between natural images and financial charts.

2605.00874 2026-05-05 cs.CV cs.AI cs.LG cs.MM

Latent Space Probing for Adult Content Detection in Video Generative Models

Alizishaan Khatri, Chiquita Prabhu

Comments To be published in 2026 56th Annual IEEE International Conference on Dependable Systems and Networks Workshops (DSN-W)

详情
英文摘要

The rapid proliferation of AI-powered video generation systems has introduced significant challenges in content moderation, particularly with respect to adult and sexually explicit material. Existing detection methods operate on either prompts or decoded pixel-space outputs. Therefore, both approaches are blind to the rich internal representations formed during generation. In this paper, we propose a novel latent space probing framework that intercepts the denoised latent representations produced by the CogVideoX video diffusion model during inference and attaches lightweight classifiers to perform real-time adult content detection. To support this work, we construct a large-scale binary dataset of 11039 ten-second video clips (5086 violating, 5953 non-violating) sourced from adult websites and YouTube respectively. We introduce two lightweight probing classifier architectures. We train and evaluate it on the dataset. Our work demonstrates that latent-space signals encode strong discriminative features for harmful content detection, achieving 97.29% F1 on our held-out test set with an overhead in the 4-6ms range. Our results suggest that probing the latent space results in improvements in both detection performance as well as cost.

2605.00842 2026-05-05 cs.AI cs.LG

Understanding Emergent Misalignment via Feature Superposition Geometry

Gouki Minegishi, Hiroki Furuta, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

Comments Accepted to ACL2026

详情
英文摘要

Emergent misalignment, where fine-tuning on narrow, non-harmful tasks induces harmful behaviors, poses a key challenge for AI safety in LLMs. Despite growing empirical evidence, its underlying mechanism remains unclear. To uncover the reason behind this phenomenon, we propose a geometric account based on the geometry of feature superposition. Because features are encoded in overlapping representations, fine-tuning that amplifies a target feature also unintentionally strengthens nearby harmful features in accordance with their similarity. We give a simple gradient-level derivation of this effect and empirically test it in multiple LLMs (Gemma-2 2B/9B/27B, LLaMA-3.1 8B, GPT-OSS 20B). Using sparse autoencoders (SAEs), we identify features tied to misalignment-inducing data and to harmful behaviors, and show that they are geometrically closer to each other than features derived from non-inducing data. This trend generalizes across domains (e.g., health, career, legal advice). Finally, we show that a geometry-aware approach, filtering training samples closest to toxic features, reduces misalignment by 34.5%, substantially outperforming random removal and achieving comparable or slightly lower misalignment than LLM-as-a-judge-based filtering. Our study links emergent misalignment to feature superposition, providing a basis for understanding and mitigating this phenomenon.

2605.00841 2026-05-05 cs.AI econ.GN q-fin.EC

AI Agents for Sustainable SMEs: A Green ESG Assessment Framework

Viet Trinh, Tan Nguyen, Minh-Huyen Phan, Quan Luu

详情
英文摘要

This study presents a novel, AI-driven framework for assessing Environmental, Social, and Governance (ESG) performance in European small and medium-sized enterprises (SMEs). An initial phase established expert-validated ESG baseline scores from a subset of the Flash Eurobarometer FL549 survey data. In the second phase, a scalable AI agent system, built on the n8n automation platform, applied these baselines to perform automated ESG classification and generate contextual recommendations using large language models (LLMs). The results demonstrate the AI system's high consistency with human-derived outputs, thereby supporting more effective monitoring and intervention strategies aligned with the European Green Deal.

2605.00839 2026-05-05 cs.AI cs.LG

2026 Roadmap on Artificial Intelligence and Machine Learning for Smart Manufacturing

Jay Lee, Hanqi Su, Marco Macchi, Adalberto Polenghi, Wei Wu, Zhiheng Zhao, George Q. Huang, Kiva Allgood, Devendra Jain, Benedikt Gieger, Vibhor Pandhare, Soumyabrata Bhattacharjee, Ram Mohril, Lingbao Kong, Qiyuan Wang, Xinlan Tang, Sungjong Kim, Chan Hee Park, Byeng D. Youn, Guo Dong Goh, Xi Huang, Wai Yee Yeong, Yung C Shin, He Zhang, Zitong Wang, Fei Tao, Jagjit Singh Srai, Satyandra K. Gupta, Byung Gun Joung, Albin John, John W. Sutherland, Sang Won Lee, Olga Fink, Vinay Sharma, Faez Ahmed, Wei Chen, Mark Fuge, Arild Waaler, Martin G. Skjæveland, Dimitris Kyritsis, Wei Chen, VispiNevile Karkaria, Yi-Ping Chen, Ying-Kuan Tsai, Joseph Cohen, Xun Huan, Jing Lin, Liangwei Zhang, Gregory W. Vogl, Aaron W. Cornelius, Xiaodong Jia, Dai-Yan Ji, Takanobu Minami, Ruoxin Wang

Comments This paper has been accepted for publication in the Journal Machine Learning: Engineering

详情
英文摘要

The evolution of artificial intelligence (AI) and machine learning (ML) is reshaping smart manufacturing by providing new capabilities for efficiency, adaptability, and autonomy across industrial value chains. However, the deployment of AI and ML in industrial settings still faces critical challenges, including the complexity of industrial big data, effective data management, integration with heterogeneous sensing and control systems, and the demand for trustworthy, explainable, and reliable operation in high-stakes industrial environments. In this roadmap, we present a comprehensive perspective on the foundations, applications, and emerging directions of AI and ML in smart manufacturing. It is structured in three parts. The first highlights the foundations and trends that frame the evolution of AI in smart manufacturing. The second focuses on key topics where AI is already enabling advances, including industrial big data analytics, advanced sensing and perception, autonomous systems, additive and laser-based manufacturing, digital twins, robotics, supply chain and logistics optimization, and sustainable manufacturing. The third section explores non-traditional ML approaches that are opening new frontiers, such as physics-informed AI, generative AI, semantic AI, advanced digital twins, explainable AI, RAMS, data-centric metrology, LLMs, and foundation models for highly connected and complex manufacturing systems. By identifying both opportunities and remaining barriers across these areas, this roadmap outlines the advances needed in methods, integration strategies, and industrial adoption. We hope this roadmap will serve as a guide for researchers, engineers, and practitioners to accelerate innovation, align academic and industrial priorities, and ensure that AI-driven smart manufacturing delivers reliable, sustainable, and scalable impact for the future of manufacturing ecosystems.

2605.00837 2026-05-05 cs.LG

Fast Log-Domain Sinkhorn Optimal Transport with Warp-Level GPU Reductions

Hao Xiao

Comments 14 pages, 7 figures, code at https://github.com/xiao98/Fast-Sinkhorn-CUDA

详情
英文摘要

Entropic regularized optimal transport (OT) via the Sinkhorn algorithm has become a fundamental tool in machine learning, yet existing implementations either suffer from numerical instability for small regularization parameters or incur significant overhead from deep learning frameworks. We present FastSinkhorn, a lightweight, native CUDA implementation of the log-domain Sinkhorn algorithm that combines warp-level shuffle reductions with shared-memory tiling to achieve high GPU utilization without sacrificing numerical stability. Our solver operates entirely in the log-domain, enabling robust computation for regularization parameters as small as epsilon = 10^{-4} where standard-domain methods fail. On dense OT problems with n = m = 8192, our implementation achieves 12x speedup over the widely-used POT library and 5.9x speedup over GPU-accelerated PyTorch baselines, while consuming only 256 MB of GPU memory. We validate our solver on image color transfer, 3D point cloud matching, and convergence analysis, demonstrating that native CUDA kernels with careful numerical treatment provide a practical and efficient foundation for large-scale optimal transport computation.

2605.00836 2026-05-05 cs.LG

From Euler to Dormand-Prince: ODE Solvers for Flow Matching Generative Models

Hao Xiao

Comments 14 pages, 10 figures, code at github.com/xiao98/ODE-Flow-Experiments

详情
英文摘要

Sampling from Flow Matching generative models requires solving an ordinary differential equation (ODE) whose computational cost is dominated by neural network forward passes. We derive four classical ODE solvers -- Euler, Explicit Midpoint, Classical Runge-Kutta (RK4), and Dormand-Prince 5(4) -- from first principles via Taylor expansion, implement them from scratch in PyTorch, and systematically benchmark their efficiency on Conditional Flow Matching tasks ranging from 2D toy distributions to MNIST digits. On the quantitative side, we use sliced Wasserstein distance to construct NFE-quality Pareto frontiers,finding that RK4 at 80 function evaluations achieves sample quality comparable to Euler at 200. Beyond reproducing known convergence rates, we report two empirical observations: (1) the Jacobian eigenvalue spectrum of the learned velocity field stiffens sharply near t=1, explaining why the adaptive Dormand-Prince solver automatically concentrates its step budget at the end of the trajectory; (2) the quality gap between low-order and high-order solvers widens for undertrained and smaller models, indicating that solver choice matters most when the model is imperfect. Code and all experiment scripts are publicly available.