arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2604.20357 2026-04-23 cs.CV cs.CL

SignDATA: Data Pipeline for Sign Language Translation

Kuanwei Chen, Tingyi Lin

Comments 7 pages, 1 figure

详情

英文摘要

Sign-language datasets are difficult to preprocess consistently because they vary in annotation schema, clip timing, signer framing, and privacy constraints. Existing work usually reports downstream models, while the preprocessing pipeline that converts raw video into training-ready pose or video artifacts remains fragmented, backend-specific, and weakly documented. We present SignDATA, a config-driven preprocessing toolkit that standardizes heterogeneous sign-language corpora into comparable outputs for learning. The system supports two end-to-end recipes: a pose recipe that performs acquisition, manifesting, person localization, clipping, cropping, landmark extraction, normalization, and WebDataset export, and a video recipe that replaces pose extraction with signer-cropped video packaging. SignDATA exposes interchangeable MediaPipe and MMPose backends behind a common interface, typed job schemas, experiment-level overrides, and per-stage checkpointing with config- and manifest-aware hashes. We validate the toolkit through a research-oriented evaluation design centered on backend comparison, preprocessing ablations, and privacy-aware video generation on datasets. Our contribution is a reproducible preprocessing layer for sign-language research that makes extractor choice, normalization policy, and privacy tradeoffs explicit, configurable, and empirically comparable.Code is available at https://github.com/balaboom123/signdata-slt.

URL PDF HTML ☆

赞 0 踩 0

2604.20354 2026-04-23 cs.CV

Hallucination Early Detection in Diffusion Models

Federico Betti, Lorenzo Baraldi, Lorenzo Baraldi, Rita Cucchiara, Nicu Sebe

Comments 21 pages, 6 figures, 4 tables. Published in International Journal of Computer Vision (IJCV)

详情

DOI: 10.1007/s11263-025-02622-0
Journal ref: Int. J. Comput. Vis. 134, 35 (2026)

英文摘要

Text-to-Image generation has seen significant advancements in output realism with the advent of diffusion models. However, diffusion models encounter difficulties when tasked with generating multiple objects, frequently resulting in hallucinations where certain entities are omitted. While existing solutions typically focus on optimizing latent representations within diffusion models, the relevance of the initial generation seed is typically underestimated. While using various seeds in multiple iterations can improve results, this method also significantly increases time and energy costs. To address this challenge, we introduce HEaD+ (Hallucination Early Detection +), a novel approach designed to identify incorrect generations early in the diffusion process. The HEaD+ framework integrates cross-attention maps and textual information with a novel input, the Predicted Final Image. The objective is to assess whether to proceed with the current generation or restart it with a different seed, thereby exploring multiple-generation seeds while conserving time. HEaD+ is trained on the newly created InsideGen dataset of 45,000 generated images, each containing prompts with up to seven objects. Our findings demonstrate a 6-8% increase in the likelihood of achieving a complete generation (i.e., an image accurately representing all specified subjects) with four objects when applying HEaD+ alongside existing models. Additionally, HEaD+ reduces generation times by up to 32% when aiming for a complete image, enhancing the efficiency of generating complete and accurate object representations relative to leading models. Moreover, we propose an integrated localization module that predicts object centroid positions and verifies pairwise spatial relations (if requested by the users) at an intermediate timestep, gating generation together with object presence to further improve relation-consistent outcomes.

URL PDF HTML ☆

赞 0 踩 0

2604.20350 2026-04-23 cs.CV

X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis

Gui Wang, Zehao Zhong, YongSong Zhou, Yudong Li, Ende Wu, Wooi Ping Cheah, Rong Qu, Jianfeng Ren, Linlin Shen

Comments Accept by CVPR2026

2604.20347 2026-04-23 cs.RO cs.AI

A Vision-Language-Action Model for Adaptive Ultrasound-Guided Needle Insertion and Needle Tracking

Yuelin Zhang, Qingpeng Ding, Longxiang Tang, Chengyu Fang, Shing Shin Cheng

Comments Accepted by ICRA 2026

2604.20336 2026-04-23 cs.CV cs.GR

Stability-Driven Motion Generation for Object-Guided Human-Human Co-Manipulation

Jiahao Xu, Xiaohan Yuan, Xingchen Wu, Chongyang Xu, Kun Li, Buzhen Huang

Comments CVPR 2026

2604.20328 2026-04-23 cs.CV

Hybrid Latent Reasoning with Decoupled Policy Optimization

Tao Cheng, Shi-Zhe Chen, Hao Zhang, Yixin Qin, Jinwen Luo, Zheng Wei

Comments Tech report

2604.20319 2026-04-23 cs.CV

SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark

Gui Wang, YongSong Zhou, Kaijun Deng, Wooi Ping Cheah, Rong Qu, Jianfeng Ren, Linlin Shen

Comments Accept by CVPR2026

2604.20318 2026-04-23 cs.CV cs.MM

UniCVR: From Alignment to Reranking for Unified Zero-Shot Composed Visual Retrieval

Haokun Wen, Xuemeng Song, Haoyu Zhang, Xiangyu Zhao, Weili Guan, Liqiang Nie

2604.20317 2026-04-23 cs.CV

MD-Face: MoE-Enhanced Label-Free Disentangled Representation for Interactive Facial Attribute Editing

Xuan Cui, Yunfei Zhao, Bo Liu, Wei Duan, Xingrong Fan

2604.20313 2026-04-23 cs.LG cs.AI

Formalising the Logit Shift Induced by LoRA: A Technical Note

Xiang Shi, Shuaizhi Cheng, Mingwei Li

Comments 7 pages, technical note

2604.20307 2026-04-23 cs.CV

Improving Facial Emotion Recognition through Dataset Merging and Balanced Training Strategies

Serap Kırbız

2604.20306 2026-04-23 cs.CV cs.AI

Dual Causal Inference: Integrating Backdoor Adjustment and Instrumental Variable Learning for Medical VQA

Zibo Xu, Qiang Li, Ke Lu, Jin Wang, Weizhi Nie, Yuting Su

2604.20305 2026-04-23 cs.RO

AdaTracker: Learning Adaptive In-Context Policy for Cross-Embodiment Active Visual Tracking

Kui Wu, Hao Chen, Jinzhu Han, Haijun Liu, Churan Wang, Yizhou Wang, Zhoujun Li, Si Liu, Fangwei Zhong

2604.20295 2026-04-23 cs.RO

ETac: A Lightweight and Efficient Tactile Simulation Framework for Learning Dexterous Manipulation

Zhe Xu, Feiyu Zhao, Xiyan Huang, Chenxi Xiao

2604.20291 2026-04-23 cs.CV

Efficient INT8 Single-Image Super-Resolution via Deployment-Aware Quantization and Teacher-Guided Training

Pham Phuong Nam Nguyen, Nam Tien Le, Thi Kim Trang Vo, Nhu Tinh Anh Nguyen

Comments 10 pages, 4 figures. Accepted at the Mobile AI (MAI) 2026 Workshop at CVPR 2026

2604.20290 2026-04-23 cs.RO

Onboard Wind Estimation for Small UAVs Equipped with Low-Cost Sensors: An Aerodynamic Model-Integrated Filtering Approach

Bingchen Cheng, Tielin Ma, Jingcheng Fu, Lulu Tao, Tianhui Guo

2604.20289 2026-04-23 cs.CV

X-Cache: Cross-Chunk Block Caching for Few-Step Autoregressive World Models Inference

Yixiao Zeng, Jianlei Zheng, Chaoda Zheng, Shijia Chen, Mingdian Liu, Tongping Liu, Tengwei Luo, Yu Zhang, Boyang Wang, Linkun Xu, Siyuan Lu, Bo Tian, Xianming Liu

Comments Technical Report

2604.20288 2026-04-23 cs.LG

Generative Augmentation of Imbalanced Flight Records for Flight Diversion Prediction: A Multi-objective Optimisation Framework

Karim Aly, Alexei Sharpanskykh, Jacco Hoekstra

Comments 12 pages, 18 figures, 21 files, paper under review

2604.20286 2026-04-23 cs.CV cs.AI

MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation

Md Maklachur Rahman, Soon Ki Jung, Tracy Hammond

Comments Accepted at CVPR 2026 Main

2604.20283 2026-04-23 cs.CL

Multi-Perspective Evidence Synthesis and Reasoning for Unsupervised Multimodal Entity Linking

Mo Zhou, Jianwei Wang, Kai Wang, Helen Paik, Ying Zhang, Wenjie Zhang

详情

英文摘要

Multimodal Entity Linking (MEL) is a fundamental task in data management that maps ambiguous mentions with diverse modalities to the multimodal entities in a knowledge base. However, most existing MEL approaches primarily focus on optimizing instance-centric features and evidence, leaving broader forms of evidence and their intricate interdependencies insufficiently explored. Motivated by the observation that human expert decision-making process relies on multi-perspective judgment, in this work, we propose MSR-MEL, a Multi-perspective Evidence Synthesis and Reasoning framework with Large Language Models (LLMs) for unsupervised MEL. Specifically, we adopt a two-stage framework: (1) Offline Multi-Perspective Evidence Synthesis constructs a comprehensive set of evidence. This includes instance-centric evidence capturing the instance-centric multimodal information of mentions and entities, group-level evidence that aggregates neighborhood information, lexical evidence based on string overlap ratio, and statistical evidence based on simple summary statistics. A core contribution of our framework is the synthesis of group-level evidence, which effectively aggregates vital neighborhood information by graph. We first construct LLM-enhanced contextualized graphs. Subsequently, different modalities are jointly aligned through an asymmetric teacher-student graph neural network. (2) Online Multi-Perspective Evidence Reasoning leverages the power of LLM as a reasoning module to analyze the correlation and semantics of the multi-perspective evidence to induce an effective ranking strategy for accurate entity linking without supervision. Extensive experiments on widely used MEL benchmarks demonstrate that MSR-MEL consistently outperforms state-of-the-art unsupervised methods. The source code of this paper was available at: https://anonymous.4open.science/r/MSR-MEL-C21E/.

URL PDF HTML ☆

赞 0 踩 0

2604.20276 2026-04-23 cs.LG stat.ML

Rethinking Intrinsic Dimension Estimation in Neural Representations

Rickmer Schulte, David Rügamer

Comments Accepted at the 29th International Conference on Artificial Intelligence and Statistics (AISTATS) 2026

2604.20273 2026-04-23 cs.AI cs.CL

ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks

Jan-Philipp Schmidt

Comments 19 pages, 4 figures, 4 tables

2604.20268 2026-04-23 cs.CV

Opportunistic Bone-Loss Screening from Routine Knee Radiographs Using a Multi-Task Deep Learning Framework with Sensitivity-Constrained Threshold Optimization

Zhaochen Li, Xinghao Yan, Runni Zhou, Xiaoyang Li, Chenjie Zhu, Gege Wang, Yu Shi, Lixin Zhang, Rongrong Fu, Liehao Yan, Yuan Chai

2604.20267 2026-04-23 cs.SD cs.AI

ATIR: Towards Audio-Text Interleaved Contextual Retrieval

Tong Zhao, Chenghao Zhang, Yutao Zhu, Zhicheng Dou

2604.20261 2026-04-23 cs.AI

Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data

Fengxian Dong, Zhi Zheng, Xiao Han, Wei Chen, Jingqing Ruan, Tong Xu, Yong Chen, Enhong Chen

Comments 16 pages (including appendix), 4 main figures, 15 tables. Accepted to ACL 2026

2604.20259 2026-04-23 cs.LG

Causal-Transformer with Adaptive Mutation-Locking for Early Prediction of Acute Kidney Injury

Weizhi Nie, Haolin Chen

2604.20258 2026-04-23 cs.CV

Rethinking Where to Edit: Task-Aware Localization for Instruction-Based Image Editing

Jingxuan He, Xiyu Wang, Mengyu Zheng, Xiangyu Zeng, Yunke Wang, Chang Xu

2604.20256 2026-04-23 cs.CL cs.LG

RADS: Reinforcement Learning-Based Sample Selection Improves Transfer Learning in Low-resource and Imbalanced Clinical Settings

Wei Han, David Martinez, Anna Khanina, Lawrence Cavedon, Karin Verspoor

Comments Accepted at ACL 2026 Findings

2604.20255 2026-04-23 cs.LG cs.AI

uLEAD-TabPFN: Uncertainty-aware Dependency-based Anomaly Detection with TabPFN

Sha Lu, Jixue Liu, Stefan Peters, Thuc Duy Le, Craig Xie, Lin Liu, Jiuyong Li

2604.20254 2026-04-23 cs.AI cs.LG

Mol-Debate: Multi-Agent Debate Improves Structural Reasoning in Molecular Design

Wengyu Zhang, Xiao-Yong Wei, Qing Li