arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1864
专题追踪
2512.21948 2026-04-01 cs.CV

Interpretable Machine Learning-Derived Spectral Indices for Vegetation Monitoring

Ali Lotfi, Adam Carter, Thuan Ha, Mohammad Meysami, Kwabena Nketia, Steve Shirtliffe

Comments 20 pages, 5 figures, 6 tables

详情
英文摘要

Spectral indices such as NDVI have driven vegetation monitoring for decades, yet their design remains largely manual and ad hoc. Their usefulness stems not only from their empirical performance, but also from algebraic forms that remain compact and biologically interpretable. However, the space of possible algebraic expressions relating spectral bands is effectively infinite, making systematic search impractical without structural constraints. We introduce the Spectral Feature Polynomial (SFP) framework, a general pipeline that automatically discovers compact, interpretable spectral indices from labeled multispectral imagery. SFP constructs a library of ratio-based spectral features that inherit illumination invariance by construction. It then applies cross-validated feature selection and continuous coefficient optimization to produce a single closed-form equation per task, transparent to domain experts and deployable on any remote sensing platform without requiring standardization statistics. We validate the framework on two agricultural applications. For Kochia (Bassia scoparia) detection in Sentinel-2 imagery near Lucky Lake of Saskatchewan over three growing seasons, the same two-term equation emerged in 44 of 46 independent cross-validation folds, achieving 98.6% mean accuracy, more than 4 percentage points above the best established index under year-held-out evaluation. For wheat plant classification from UAV multispectral imagery, stage-specific indices achieved 99.5%, 97.2%, and 93.5% across three growth stages, compared to 78% or below for the best established index at late season when NIR-based contrasts lose discriminatory power as wheat senesces. In both applications, SFP yielded a single transparent equation that generalized across held-out regions and outperformed established indices.

2512.17320 2026-04-01 cs.CV

EMMA: Concept Erasure Benchmark with Comprehensive Semantic Metrics and Diverse Categories

Lu Wei, Yuta Nakashima, Noa Garcia

Comments Accepted by CVPR 2026

详情
英文摘要

The widespread adoption of text-to-image (T2I) generation has raised concerns about privacy, bias, and copyright violations. Concept erasure techniques offer a promising solution by selectively removing undesired concepts from pre-trained models without requiring full retraining. However, these methods are often evaluated on a limited set of concepts, relying on overly simplistic and direct prompts. To test the boundaries of concept erasure techniques, and assess whether they truly remove targeted concepts from model representations, we introduce EMMA, a benchmark that evaluates five key dimensions of concept erasure over 13 metrics. EMMA goes beyond standard metrics like image quality and time efficiency, testing robustness under challenging conditions, including indirect descriptions, visually similar non-target concepts, and potential gender and ethnicity bias, providing a socially aware analysis of method behavior. Using EMMA, we analyze five concept erasure methods across five domains (objects, celebrities, art styles, NSFW, and copyright). Our results show that existing methods struggle with implicit prompts (i.e., generating the erased concept when it is indirectly referenced) and visually similar non-target concepts (i.e., failing to generate non-target concepts resembling the erased one), while some amplify gender and ethnicity bias compared to the original model. Code and prompts are available at https://github.com/lobsterlulu/EMMA.

2512.17257 2026-04-01 cs.LG

Electric Vehicle Charging Load Forecasting: An Experimental Comparison of Machine Learning Methods

Iason Kyriakopoulos, Yannis Theodoridis

Comments 19 pages, 2 figures, 5 tables

详情
英文摘要

With the growing popularity of electric vehicles as a means of addressing climate change, concerns have emerged regarding their impact on electric grid management. As a result, predicting EV charging demand has become a timely and important research problem. While substantial research has addressed energy load forecasting in transportation, relatively few studies systematically compare multiple forecasting methods across different temporal horizons and spatial aggregation levels in diverse urban settings. This work investigates the effectiveness of five time series forecasting models, ranging from traditional statistical approaches to machine learning and deep learning methods. Forecasting performance is evaluated for short-, mid-, and long-term horizons (on the order of minutes, hours, and days, respectively), and across spatial scales ranging from individual charging stations to regional and city-level aggregations. The analysis is conducted on four publicly available real-world datasets, with results reported independently for each dataset. To the best of our knowledge, this is the first work to systematically evaluate EV charging demand forecasting across such a wide range of temporal horizons and spatial aggregation levels using multiple real-world datasets.

2512.15987 2026-04-01 cs.LG cs.AI cs.DS stat.ML

Provably Extracting the Features from a General Superposition

Allen Liu

详情
英文摘要

It is widely believed that complex machine learning models generally encode features through linear representations. This is the foundational hypothesis behind a vast body of work on interpretability. A key challenge toward extracting interpretable features, however, is that they exist in superposition. In this work, we study the question of extracting features in superposition from a learning theoretic perspective. We start with the following fundamental setting: we are given query access to a function \[ f(x)=\sum_{i=1}^n σ_i(v_i^\top x), \] where each unit vector $v_i$ encodes a feature direction and $σ_i:\R\to\R$ is an arbitrary response function and our goal is to recover the $v_i$ and the function $f$. In learning-theoretic terms, superposition refers to the \emph{overcomplete regime}, when the number of features is larger than the underlying dimension (i.e. $n > d$), which has proven especially challenging for typical algorithmic approaches. Our main result is an efficient query algorithm that, from noisy oracle access to $f$, identifies all feature directions whose responses are non-degenerate and reconstructs the function $f$. Crucially, our algorithm works in a significantly more general setting than all related prior results. We allow for essentially arbitrary superpositions, only requiring that $v_i, v_j$ are not nearly identical for $i \neq j$, and allowing for general response functions $σ_i$. At a high level, our algorithm introduces an approach for searching in Fourier space by iteratively refining the search space to locate the hidden directions $v_i$.

2512.11423 2026-04-01 cs.CV

JoyStreamer-Flash: Real-time and Infinite Audio-Driven Avatar Generation with Autoregressive Diffusion

Chaochao Li, Ruikui Wang, Liangbo Zhou, Jinheng Feng, Huaishao Luo, Huan Zhang, Youzheng Wu, Xiaodong He

详情
英文摘要

Existing DiT-based audio-driven avatar generation methods have achieved considerable progress, yet their broader application is constrained by limitations such as high computational overhead and the inability to synthesize long-duration videos. Autoregressive methods address this problem by applying block-wise autoregressive diffusion methods. However, these methods suffer from the problem of error accumulation and quality degradation. To address this, we propose JoyStreamer-Flash, an audio-driven autoregressive model capable of real-time inference and infinite-length video generation with the following contributions: (1) Progressive Step Bootstrapping (PSB), which allocates more denoising steps to initial frames to stabilize generation and reduce error accumulation; (2) Motion Condition Injection (MCI), enhancing temporal coherence by injecting noise-corrupted previous frames as motion condition; and (3) Unbounded RoPE via Cache-Resetting (URCR), enabling infinite-length generation through dynamic positional encoding. Our 1.3B-parameter causal model achieves 16 FPS on a single GPU and achieves competitive results in visual quality, temporal consistency, and lip synchronization.

2512.10938 2026-04-01 cs.LG cs.AI cs.CL cs.CV

Stronger Normalization-Free Transformers

Mingzhi Chen, Taiming Lu, Jiachen Zhu, Mingjie Sun, Zhuang Liu

Comments Published in CVPR 2026

详情
英文摘要

Although normalization layers have long been viewed as indispensable components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. The point-wise function DyT constrains extreme values for stable convergence and reaches normalization-level performance; this work seeks further for function designs that can surpass it. We first study how the intrinsic properties of point-wise functions influence training and performance. Building on these findings, we conduct a large-scale search for a more effective function design. Through this exploration, we introduce $\mathrm{Derf}(x) = \mathrm{erf}(αx + s)$, where $\mathrm{erf}(x)$ is the rescaled Gaussian cumulative distribution function, and identify it as the most performant design. Derf outperforms LayerNorm, RMSNorm, and DyT across a wide range of domains, including visual recognition and generation, speech representation, and DNA sequence modeling. Our analysis also suggests that the performance gains of Derf largely stem from its improved generalization rather than stronger fitting capacity. Its simplicity and stronger performance make Derf a practical choice for normalization-free Transformer architectures.

2512.10805 2026-04-01 cs.LG cs.CV

Interpretable and Steerable Concept Bottleneck Sparse Autoencoders

Akshay Kulkarni, Tsui-Wei Weng, Vivek Narayanaswamy, Shusen Liu, Wesam A. Sakla, Kowshik Thopalli

Comments CVPR 2026

详情
英文摘要

Sparse autoencoders (SAEs) promise a unified approach for mechanistic interpretability, concept discovery, and model steering in LLMs and LVLMs. However, realizing this potential requires learned features to be both interpretable and steerable. To that end, we introduce two new computationally inexpensive interpretability and steerability metrics for a systematic analysis of LVLM SAEs. This uncovers two observations; (i) a majority of SAE neurons exhibit either low interpretability or low steerability or both, rendering them ineffective for downstream use; and (ii) user-desired concepts are often absent in the SAE, thus limiting their practical utility. To address these limitations, we propose Concept Bottleneck Sparse Autoencoders (CB-SAE) - a novel post-hoc framework that prunes low-utility neurons and augments the latent space with a lightweight concept bottleneck aligned to a user-defined concept set. The resulting CB-SAE improves interpretability by +32.1% and steerability by +14.5% across LVLMs and image generation tasks.

2512.10780 2026-04-01 cs.CL cs.LG

Script Gap: Evaluating LLM Triage on Indian Languages in Native vs Romanized Scripts in a Real World Setting

Manurag Khullar, Utkarsh Desai, Poorva Malviya, Aman Dalmia, Zheyuan Ryan Shi

详情
英文摘要

Large Language Models (LLMs) are increasingly deployed in high-stakes clinical applications in India. Speakers of Indian languages frequently communicate using romanized text rather than native scripts, yet existing research rarely quantifies or evaluates this orthographic variation in real world applications. We investigate how romanization impacts the reliability of LLMs in a critical domain: maternal and newborn healthcare triage. We benchmark leading LLMs on a real world dataset of user-generated health queries spanning five Indian languages and Nepali. Our results reveal consistent degradation in performance for romanized messages, with gap reaching up to 24 points across languages and models. We propose and evaluate an Uncertainty-based Selective Routing method to close this script gap. At our partner maternal health organization alone, this gap could cause nearly 2 million excess errors in triage. Our findings highlight a critical safety blind spot in LLM-based health systems: models that appear to understand romanized input may still fail to act on it reliably.

2512.10334 2026-04-01 cs.CV

A Conditional Generative Framework for Synthetic Data Augmentation in Segmenting Thin and Elongated Structures in Biological Images

Yi Liu, Yichi Zhang

Comments Accepted at the International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA 2026)

详情
英文摘要

Thin and elongated filamentous structures, such as microtubules and actin filaments, often play important roles in biological systems. Segmenting these filaments in biological images is a fundamental step for quantitative analysis. Recent advances in deep learning have significantly improved the performance of filament segmentation. However, there is a big challenge in acquiring high quality pixel-level annotated dataset for filamentous structures, as the dense distribution and geometric properties of filaments making manual annotation extremely laborious and time-consuming. To address the data shortage problem, we propose a conditional generative framework based on the Pix2Pix architecture to generate realistic filaments in microscopy images from binary masks. We also propose a filament-aware structural loss to improve the structure similarity when generating synthetic images. Our experiments have demonstrated the effectiveness of our approach and outperformed existing model trained without synthetic data.

2512.05597 2026-04-01 cs.CV

Fast SceneScript: Fast and Accurate Language-Based 3D Scene Understanding via Multi-Token Prediction

Ruihong Yin, Xuepeng Shi, Oleksandr Bailo, Marco Manfredi, Theo Gevers

Comments Accepted to CVPR 2026

详情
英文摘要

Recent perception-generalist approaches based on language models have achieved state-of-the-art results across diverse tasks, including 3D scene layout estimation and 3D object detection, via unified architecture and interface. However, these approaches rely on autoregressive next-token prediction, which is inherently slow. In this work, we introduce Fast SceneScript, a novel structured language model for accurate and efficient 3D scene understanding. Our method employs multi-token prediction (MTP) to reduce the number of autoregressive iterations and significantly accelerate inference. While MTP improves speed, unreliable token predictions can significantly reduce accuracy. To filter out unreliable tokens, we adapt self-speculative decoding (SSD) for structured language models and introduce confidence-guided decoding (CGD) with an improved scoring mechanism for token reliability. Furthermore, we design a parameter-efficient mechanism that reduces the parameter overhead of MTP. Extensive experiments on synthetic and real-world benchmarks demonstrate that Fast SceneScript can generate up to 9 tokens per decoder inference step without compromising accuracy, while adding only $\sim7.5\%$ additional parameters.

2512.05513 2026-04-01 cs.CV

Know-Show: Benchmarking Video-Language Models on Spatio-Temporal Grounded Reasoning

Chinthani Sugandhika, Chen Li, Deepu Rajan, Basura Fernando

详情
英文摘要

Large Video-Language Models (Video-LMs) have achieved impressive progress in multimodal understanding, yet their reasoning remains weakly grounded in space and time. We present Know-Show, a new benchmark designed to evaluate spatio-temporal grounded reasoning, the ability of a model to reason about actions and their semantics while simultaneously grounding its inferences in visual and temporal evidence. Know-Show unifies reasoning and localization within a single evaluation framework consisting of five complementary scenarios across spatial (person, object, person-object, and hand-object) and temporal dimensions. Built from Charades, Action Genome, and Ego4D with 2.5K high-quality human-authored questions, the benchmark exposes significant gaps between current Video-LMs and human reasoning. To bridge this gap, we propose GRAM, a training-free plug-in that augments Video-LMs with fine-grained grounded reasoning through attention-based video token selection and explicit timestamp encoding. Extensive experiments across open and closed Video-LMs (e.g., Qwen, VideoR1, Gemini, and GPT-4o) reveal that existing models struggle to "show what they know" and vice versa. Know-Show establishes a unified standard for assessing grounded reasoning in video-language understanding and provides insights toward developing interpretable and reliable multimodal reasoning systems. We have released the dataset at https://github.com/LUNAProject22/Know-Show, and the code will be released in the same repository.

2512.05114 2026-04-01 cs.LG cs.CV eess.IV

Deep infant brain segmentation from multi-contrast MRI

Malte Hoffmann, Lilla Zöllei, Adrian V. Dalca

Comments 8 pages, 8 figures, 1 table, website at https://w3id.org/babyseg, presented at the 2025 IEEE Asilomar Conference on Signals, Systems, and Computers

Journal ref Asilomar Conf Signals Syst Comput, 2025, 974-981

详情
英文摘要

Segmentation of magnetic resonance images (MRI) facilitates analysis of human brain development by delineating anatomical structures. However, in infants and young children, accurate segmentation is challenging due to development and imaging constraints. Pediatric brain MRI is notoriously difficult to acquire, with inconsistent availability of imaging modalities, substantial non-head anatomy in the field of view, and frequent motion artifacts. This has led to specialized segmentation models that are often limited to specific image types or narrow age groups, or that are fragile for more variable images such as those acquired clinically. We address this method fragmentation with BabySeg, a deep learning brain segmentation framework for infants and young children that supports diverse MRI protocols, including repeat scans and image types unavailable during training. Our approach builds on recent domain randomization techniques, which synthesize training images far beyond realistic bounds to promote dataset shift invariance. We also describe a mechanism that enables models to flexibly pool and interact features from any number of input scans. We demonstrate state-of-the-art performance that matches or exceeds the accuracy of several existing methods for various age cohorts and input configurations using a single model, in a fraction of the runtime required by many existing tools.

2512.03736 2026-04-01 cs.RO cs.LG cs.SY eess.SY

Crossing the Sim2Real Gap Between Simulation and Ground Testing to Space Deployment of Autonomous Free-flyer Control

Kenneth Stewart, Samantha Chapin, Roxana Leontie, Carl Glen Henshaw

Comments published at iSpaRo 2025

Journal ref 2025 International Conference on Space Robotics (iSpaRo)

详情
英文摘要

Reinforcement learning (RL) offers transformative potential for robotic control in space. We present the first on-orbit demonstration of RL-based autonomous control of a free-flying robot, the NASA Astrobee, aboard the International Space Station (ISS). Using NVIDIA's Omniverse physics simulator and curriculum learning, we trained a deep neural network to replace Astrobee's standard attitude and translation control, enabling it to navigate in microgravity. Our results validate a novel training pipeline that bridges the simulation-to-reality (Sim2Real) gap, utilizing a GPU-accelerated, scientific-grade simulation environment for efficient Monte Carlo RL training. This successful deployment demonstrates the feasibility of training RL policies terrestrially and transferring them to space-based applications. This paves the way for future work in In-Space Servicing, Assembly, and Manufacturing (ISAM), enabling rapid on-orbit adaptation to dynamic mission requirements.

2512.03729 2026-04-01 cs.RO cs.LG cs.SY eess.SY

Autonomous Planning In-space Assembly Reinforcement-learning free-flYer (APIARY) International Space Station Astrobee Testing

Samantha Chapin, Kenneth Stewart, Roxana Leontie, Carl Glen Henshaw

Comments iSpaRo 2025, Best Paper Award in Orbital Robotics

Journal ref 2025 International Conference on Space Robotics (iSpaRo)

详情
英文摘要

The US Naval Research Laboratory's (NRL's) Autonomous Planning In-space Assembly Reinforcement-learning free-flYer (APIARY) experiment pioneers the use of reinforcement learning (RL) for control of free-flying robots in the zero-gravity (zero-G) environment of space. On Tuesday, May 27th 2025 the APIARY team conducted the first ever, to our knowledge, RL control of a free-flyer in space using the NASA Astrobee robot on-board the International Space Station (ISS). A robust 6-degrees of freedom (DOF) control policy was trained using an actor-critic Proximal Policy Optimization (PPO) network within the NVIDIA Isaac Lab simulation environment, randomizing over goal poses and mass distributions to enhance robustness. This paper details the simulation testing, ground testing, and flight validation of this experiment. This on-orbit demonstration validates the transformative potential of RL for improving robotic autonomy, enabling rapid development and deployment (in minutes to hours) of tailored behaviors for space exploration, logistics, and real-time mission needs.

2512.03590 2026-04-01 cs.CV

Beyond Boundary Frames: Context-Centric Video Interpolation with Audio-Visual Semantics

Yuchen Deng, Xiuyang Wu, Hai-Tao Zheng, Jie Wang, Feidiao Yang, Yuxing Han

详情
英文摘要

Video frame interpolation has long been challenged by limited controllability and interactivity, especially in scenarios involving fast, highly non-linear, and fine-grained motion. Although recent interactive interpolation methods have made progress, they remain largely boundary-centric and ignore auxiliary contextual signals beyond the start and end frames, leading to outputs that deviate from user-intended objectives. To address this issue, we reformulate VFI from a boundary-centric task into a context-centric generation problem. Based on this, we propose BBF (Beyond Boundary Frames), a context-centric video frame interpolation framework with decoupled multimodal conditioning, which jointly exploits endpoint-adjacent visual context, text semantics, and audio-correlated temporal dynamics. To balance endpoint consistency with context-dependent temporal evolution, BBF further introduces a multi-stream context integration mechanism, consisting of endpoint-constraint integration, evolution-prior integration, and temporal-context integration. In addition, BBF adopts a progressive training strategy to stabilize multimodal learning and improve controllable interpolation. Extensive experiments show that BBF outperforms specialized state-of-the-art methods on both generic interpolation and audio-visual synchronized generation tasks, establishing a unified framework for video frame interpolation under coordinated multimodal conditioning. The code, the model, and the interface will be released to facilitate further research.

2512.02902 2026-04-01 cs.RO cs.AI cs.LG

VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling

Weiqi Li, Quande Zhang, Ruifeng Zhai, Liang Lin, Guangrun Wang

详情
英文摘要

Vision-language-action (VLA) models achieve strong in-distribution performance but degrade sharply under novel camera viewpoints and visual perturbations. We show that this brittleness primarily arises from misalignment in Spatial Modeling, rather than Physical Modeling. To address this, we propose a one-shot adaptation framework that recalibrates visual representations through lightweight, learnable updates. Our first method, Feature Token Modulation (FTM), applies a global affine transformation to visual tokens and improves Libero viewpoint accuracy from 48.5% to 87.1% with only 4K parameters. Building on this, Feature Linear Adaptation (FLA) introduces low-rank updates to the ViT encoder, achieving 90.8% success with 4.7M parameters -- matching LoRA-scale finetuning at far lower cost. Together, these results reveal substantial untapped robustness in pretrained VLA models and demonstrate that targeted, minimal visual adaptation is sufficient to restore viewpoint generalization.

2512.01946 2026-04-01 cs.RO cs.CV

Scaling Cross-Environment Failure Reasoning Data for Vision-Language Robotic Manipulation

Paul Pacaud, Ricardo Garcia, Shizhe Chen, Cordelia Schmid

Comments Code, Data, and Models available at https://www.di.ens.fr/willow/research/guardian/. The paper contains 8 pages, 7 figures, 7 tables

详情
英文摘要

Robust robotic manipulation requires reliable failure detection and recovery. Although recent Vision-Language Models (VLMs) show promise in robot failure detection, their generalization is severely limited by the scarcity and narrow coverage of failure data. To address this bottleneck, we propose an automatic framework for generating diverse robotic planning and execution failures across both simulated and real-world environments. Our approach perturbs successful manipulation trajectories to synthesize failures that reflect realistic failure distributions, and leverages VLMs to produce structured step-by-step reasoning traces. This yields FailCoT, a large-scale failure reasoning dataset built upon the RLBench simulator and the BridgeDataV2 real-robot dataset. Using FailCoT, we train Guardian, a multi-view reasoning VLM for unified planning and execution verification. Guardian achieves state-of-the-art performance on three unseen real-world benchmarks: RoboFail, RoboVQA, and our newly introduced UR5-Fail. When integrated with a state-of-the-art LLM-based manipulation policy, it consistently boosts task success rates in both simulation and real-world deployment. These results demonstrate that scaling high-quality failure reasoning data is critical for improving generalization in robotic failure detection. Code, Data, and Models available at https://www.di.ens.fr/willow/research/guardian/.

2512.00818 2026-04-01 cs.AI cs.CV

Med-CMR: A Fine-Grained Benchmark Integrating Visual Evidence and Clinical Logic for Medical Complex Multimodal Reasoning

Haozhen Gong, Xiaozhong Ji, Yuansen Liu, Wenbin Wu, Xiaoxiao Yan, Jingjing Liu, Kai Wu, Jiazhen Pan, Bailiang Jian, Jiangning Zhang, Xiaobin Hu, Hongwei Bran Li

详情
英文摘要

MLLMs MLLMs are beginning to appear in clinical workflows, but their ability to perform complex medical reasoning remains unclear. We present Med-CMR, a fine-grained Medical Complex Multimodal Reasoning benchmark. Med-CMR distinguishes from existing counterparts by three core features: 1) Systematic capability decomposition, splitting medical multimodal reasoning into fine-grained visual understanding and multi-step reasoning to enable targeted evaluation; 2) Challenging task design, with visual understanding across three key dimensions (small-object detection, fine-detail discrimination, spatial understanding) and reasoning covering four clinically relevant scenarios (temporal prediction, causal reasoning, long-tail generalization, multi-source integration); 3) Broad, high-quality data coverage, comprising 20,653 Visual Question Answering (VQA) pairs spanning 11 organ systems and 12 imaging modalities, validated via a rigorous two-stage (human expert + model-assisted) review to ensure clinical authenticity. We evaluate 18 state-of-the-art MLLMs with Med-CMR, revealing GPT-5 as the top-performing commercial model: 57.81 accuracy on multiple-choice questions (MCQs) and a 48.70 open-ended score, outperforming Gemini 2.5 Pro (49.87 MCQ accuracy, 45.98 open-ended score) and leading open-source model Qwen3-VL-235B-A22B (49.34 MCQ accuracy, 42.62 open-ended score). However, specialized medical MLLMs do not reliably outperform strong general models, and long-tail generalization emerges as the dominant failure mode. Med-CMR thus provides a stress test for visual-reasoning integration and rare-case robustness in medical MLLMs, and a rigorous yardstick for future clinical systems.

2511.22715 2026-04-01 cs.CV cs.AI cs.CL cs.MM

ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering

Alberto Compagnoni, Marco Morini, Sara Sarto, Federico Cocchi, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Comments CVPR 2026 - Project page: https://aimagelab.github.io/ReAG/

详情
英文摘要

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence.

2511.22302 2026-04-01 cs.AI cs.DC cs.PF

When AI Bends Metal: AI-Assisted Optimization of Design Parameters in Sheet Metal Forming

Ahmad Tarraf, Koutaiba Kassem-Manthey, Seyed Ali Mohammadi, Philipp Martin, Lukas Moj, Semih Burak, Enju Park, Christian Terboven, Felix Wolf

Comments 20 pages

详情
英文摘要

Numerical simulations have revolutionized the industrial design process by reducing prototyping costs, design iterations, and enabling product engineers to explore the design space more efficiently. However, the growing scale of simulations demands substantial expert knowledge, computational resources, and time. A key challenge is identifying input parameters that yield optimal results, as iterative simulations are costly and can have a large environmental impact. This paper presents an AI-assisted workflow that reduces expert involvement in parameter optimization through the use of Bayesian optimization. Furthermore, we present an active learning variant of the approach, assisting the expert if desired. A deep learning model provides an initial parameter estimate, from which the optimization cycle iteratively refines the design until a termination condition (e.g.,energy budget or iteration limit) is met. We demonstrate our approach, based on a sheet metal forming process, and show how it enables us to accelerate the exploration of the design space while reducing the need for expert involvement.

2511.16625 2026-04-01 cs.AI

MedBayes-Lite: Bayesian Uncertainty Quantification for Safe Clinical Decision Support

Elias Hossain, Md Mehedi Hasan Nipu, Maleeha Sheikh, Rajib Rana, Subash Neupane, Niloofar Yousefi

详情
英文摘要

We propose MedBayes-Lite, a lightweight Bayesian enhancement for transformer-based clinical language models that improves reliability through uncertainty-aware prediction. The framework operates without retraining, architectural modification, or additional trainable parameters, and integrates three components: Bayesian Embedding Calibration via Monte Carlo dropout, Uncertainty-Weighted Attention for reliability-aware token aggregation, and Confidence-Guided Decision Shaping for abstention under uncertainty. Across MedQA, PubMedQA, and MIMIC-III, MedBayes-Lite improves calibration and trustworthiness, reducing overconfidence by 32--48\%. In simulated clinical settings, it further supports safer decision-making by flagging uncertain predictions for human review, particularly under distribution shift. For closed API models, the framework remains applicable through sampling-based predictive uncertainty and confidence-guided abstention, while full embedding- and attention-level uncertainty propagation is evaluated on open-weight transformer models.

2511.14565 2026-04-01 cs.RO cs.AI

Masked IRL: LLM-Guided Reward Disambiguation from Demonstrations and Language

Minyoung Hwang, Alexandra Forsey-Smerek, Nathaniel Dennler, Andreea Bobu

Comments Accepted to ICRA 2026

详情
英文摘要

Robots can adapt to user preferences by learning reward functions from demonstrations, but with limited data, reward models often overfit to spurious correlations and fail to generalize. This happens because demonstrations show robots how to do a task but not what matters for that task, causing the model to focus on irrelevant state details. Natural language can more directly specify what the robot should focus on, and, in principle, disambiguate between many reward functions consistent with the demonstrations. However, existing language-conditioned reward learning methods typically treat instructions as simple conditioning signals, without fully exploiting their potential to resolve ambiguity. Moreover, real instructions are often ambiguous themselves, so naive conditioning is unreliable. Our key insight is that these two input types carry complementary information: demonstrations show how to act, while language specifies what is important. We propose Masked Inverse Reinforcement Learning (Masked IRL), a framework that uses large language models (LLMs) to combine the strengths of both input types. Masked IRL infers state-relevance masks from language instructions and enforces invariance to irrelevant state components. When instructions are ambiguous, it uses LLM reasoning to clarify them in the context of the demonstrations. In simulation and on a real robot, Masked IRL outperforms prior language-conditioned IRL methods by up to 15% while using up to 4.7 times less data, demonstrating improved sample-efficiency, generalization, and robustness to ambiguous language. Project page: https://MIT-CLEAR-Lab.github.io/Masked-IRL and Code: https://github.com/MIT-CLEAR-Lab/Masked-IRL

2511.12882 2026-04-01 cs.RO cs.AI

Towards High-Consistency Embodied World Model with Multi-View Trajectory Videos

Taiyi Su, Jian Zhu, Yaxuan Li, Chong Ma, Jianjun Zhang, Zitai Huang, Hanli Wang, Yi Xu

Comments 12 pages, 5 figures

详情
英文摘要

Embodied world models aim to predict and interact with the physical world through visual observations and actions. However, existing models struggle to accurately translate low-level actions (e.g., joint positions) into precise robotic movements in predicted frames, leading to inconsistencies with real-world physical interactions. To address these limitations, we propose MTV-World, an embodied world model that introduces Multi-view Trajectory-Video control for precise visuomotor prediction. Specifically, instead of directly using low-level actions for control, we employ trajectory videos obtained through camera intrinsic and extrinsic parameters and Cartesian-space transformation as control signals. However, projecting 3D raw actions onto 2D images inevitably causes a loss of spatial information, making a single view insufficient for accurate interaction modeling. To overcome this, we introduce a multi-view framework that compensates for spatial information loss and ensures high-consistency with physical world. MTV-World forecasts future frames based on multi-view trajectory videos as input and conditioning on an initial frame per view. Furthermore, to systematically evaluate both robotic motion precision and object interaction accuracy, we develop an auto-evaluation pipeline leveraging multimodal large models and referring video object segmentation models. To measure spatial consistency, we formulate it as an object location matching problem and adopt the Jaccard Index as the evaluation metric. Extensive experiments demonstrate that MTV-World achieves precise control execution and accurate physical interaction modeling in complex dual-arm scenarios.

2511.10788 2026-04-01 cs.AI cs.CL

From Efficiency to Adaptivity: A Deeper Look at Adaptive Reasoning in Large Language Models

Chao Wu, Baoheng Li, Mingchen Gao, Yu Tian, Zhenyi Wang

详情
英文摘要

Recent advances in large language models (LLMs) have made reasoning a central benchmark for evaluating intelligence. While prior surveys focus on efficiency by examining how to shorten reasoning chains or reduce computation, this view overlooks a fundamental challenge: current LLMs apply uniform reasoning strategies regardless of task complexity, generating long traces for trivial problems while failing to extend reasoning for difficult tasks. This survey reframes reasoning through the lens of {adaptivity}: the capability to allocate reasoning effort based on input characteristics such as difficulty and uncertainty. We make three contributions. First, we formalize deductive, inductive, and abductive reasoning within the LLM context, connecting these classical cognitive paradigms with their algorithmic realizations. Second, we formalize adaptive reasoning as a control-augmented policy optimization problem balancing task performance with computational cost, distinguishing learned policies from inference-time control mechanisms. Third, we propose a systematic taxonomy organizing existing methods into training-based approaches that internalize adaptivity through reinforcement learning, supervised fine-tuning, and learned controllers, and training-free approaches that achieve adaptivity through prompt conditioning, feedback-driven halting, and modular composition. This framework clarifies how different mechanisms realize adaptive reasoning in practice and enables systematic comparison across diverse strategies. We conclude by identifying open challenges in self-evaluation, meta-reasoning, and human-aligned reasoning control.

2511.09232 2026-04-01 cs.CL cs.SD

POTSA: A Cross-Lingual Speech Alignment Framework for Speech-to-Text Translation

Xuanchen Li, Chenrui Cui, Tianrui Wang, Meng Ge, Zikang Huang, Yizhou Peng, Jin Li, Yuheng Lu, Yu Jiang, Nyima Tashi, Longbiao Wang, Jianwu Dang

详情
英文摘要

Speech Large Language Models have achieved breakthroughs in multilingual speech-to-text translation. However, existing approaches often overlook semantic commonalities across source languages, leading to biased translation performance. In this work, we propose POTSA (Parallel Optimal Transport for Speech Alignment), a new framework based on cross-lingual parallel speech pairs and Optimal Transport, designed to bridge high- and low-resource translation gaps. First, we introduce a Bias Compensation module to coarsely align initial speech representations. Second, we impose token-level OT constraints on a Q-Former using parallel pairs to establish fine-grained representation consistency. Then, we apply a layer scheduling strategy to focus OT constraints on semantically beneficial layers. Experiments on FLEURS show our method achieves SOTA performance, with +1.29 BLEU over five common languages and +2.93 BLEU on zero-shot languages, using only 10 hours of parallel speech per language.

2511.07204 2026-04-01 cs.AI cs.CY cs.MA

Evaluating Online Moderation Via LLM-Powered Counterfactual Simulations

Giacomo Fidone, Lucia Passaro, Riccardo Guidotti

Comments Accepted for publication at AAAI Conference on Artificial Intelligence 2026

Journal ref Proceedings of the AAAI Conference on Artificial Intelligence 2026

详情
英文摘要

Online Social Networks (OSNs) widely adopt content moderation to mitigate the spread of abusive and toxic discourse. Nonetheless, the real effectiveness of moderation interventions remains unclear due to the high cost of data collection and limited experimental control. The latest developments in Natural Language Processing pave the way for a new evaluation approach. Large Language Models (LLMs) can be successfully leveraged to enhance Agent-Based Modeling and simulate human-like social behavior with unprecedented degree of believability. Yet, existing tools do not support simulation-based evaluation of moderation strategies. We fill this gap by designing a LLM-powered simulator of OSN conversations enabling a parallel, counterfactual simulation where toxic behavior is influenced by moderation interventions, keeping all else equal. We conduct extensive experiments, unveiling the psychological realism of OSN agents, the emergence of social contagion phenomena and the superior effectiveness of personalized moderation strategies.

2511.06686 2026-04-01 cs.LG

Balancing Multi-modal Sensor Learning via Multi-objective Optimization

Heshan Fernando, Quan Xiao, Parikshit Ram, Yi Zhou, Horst Samulowitz, Nathalie Baracaldo, Tianyi Chen

详情
英文摘要

Learning-enabled control systems increasingly rely on multiple sensing modalities (e.g., vision, audio, language, etc.) for perception and decision support. A key challenge is that multi-modal sensor training dynamics are often imbalanced: fast-to-learn sensing channels dominate optimization, while slower channels remain underutilized, degrading reliability under sensing perturbations. Existing balancing strategies are largely heuristic and can require computationally intensive subroutines. In this paper, we reformulate multi-modal sensor learning as a multi-objective optimization (MOO) problem that explicitly prioritizes the worst-performing modality while retaining the nominal multi-modal sensor fusion objective. We then propose a simple gradient-based method, MIMO (multi-modal sensor learning via MOO), for the resulting formulation. We provide convergence guarantees and evaluate the method on standard multi-modal benchmarks. Results show improved balanced performance over state-of-the-art balanced multi-modal learning and MOO baselines, together with up to ~20x reduction in subroutine computation time, highlighting the suitability of MIMO for resource-constrained control pipelines.

2511.06458 2026-04-01 cs.SD cs.AI cs.LG eess.AS

EchoMark: Perceptual Acoustic Environment Transfer with Watermark-Embedded Room Impulse Response

Chenpei Huang, Lingfeng Yao, Kyu In Lee, Lan Emily Zhang, Xun Chen, Miao Pan

详情
英文摘要

Acoustic Environment Matching (AEM) is the task of transferring clean audio into a target acoustic environment, enabling engaging applications such as audio dubbing and auditory immersive virtual reality (VR). Recovering similar room impulse response (RIR) directly from reverberant speech offers more accessible and flexible AEM solution. However, this capability also introduces vulnerabilities of arbitrary ``relocation" if misused by malicious user, such as facilitating advanced voice spoofing attacks or undermining the authenticity of recorded evidence. To address this issue, we propose EchoMark, the first deep learning-based AEM framework that generates perceptually similar RIRs with embedded watermark. Our design tackle the challenges posed by variable RIR characteristics, such as different durations and energy decays, by operating in the latent domain. By jointly optimizing the model with a perceptual loss for RIR reconstruction and a loss for watermark detection, EchoMark achieves both high-quality environment transfer and reliable watermark recovery. Experiments on diverse datasets validate that EchoMark achieves room acoustic parameter matching performance comparable to FiNS, the state-of-the-art RIR estimator. Furthermore, a high Mean Opinion Score (MOS) of 4.22 out of 5, watermark detection accuracy exceeding 99\%, and bit error rates (BER) below 0.3\% collectively demonstrate the effectiveness of EchoMark in preserving perceptual quality while ensuring reliable watermark embedding.

2511.02015 2026-04-01 cs.RO

Stein-based Optimization of Sampling Distributions in Model Predictive Path Integral Control

Jace Aldrich, Odest Chadwicke Jenkins

Comments 8 pages, 6 figures

详情
英文摘要

This paper introduces a method for Model Predictive Path Integral (MPPI) control that optimizes sample generation towards an optimal trajectory through Stein Variational Gradient Descent (SVGD). MPPI relies upon predictive rollout of trajectories sampled from a distribution of possible actions. Traditionally, these action distributions are assumed to be unimodal and represented as Gaussian. The result can lead suboptimal rollout predictions due to sample deprivation and, in the case of differentiable simulation, sensitivity to noise in the cost gradients. Through introducing SVGD updates in between MPPI environment steps, we present Stein-Optimized Path-Integral Inference (SOPPI), an MPPI/SVGD algorithm that can dynamically update noise distributions at runtime to better capture action sampling distributions without an excessive increase in computational requirements. We demonstrate the efficacy of SOPPI through experiments on a planar cart-pole, 7-DOF robot arm, and a planar bipedal walker. These results indicate improved system performance compared to state-of-the-art MPPI algorithms across a range of hyper-parameters and demonstrate feasibility at lower particle counts.

2511.01250 2026-04-01 cs.CV

Source-Only Cross-Weather LiDAR via Geometry-Aware Point Drop

YoungJae Cheong, Jhonghyun An

Comments Accepted by ICRA 2026

详情
英文摘要

Adverse weather conditions, such as rain, snow, and fog, severely degrade LiDAR semantic segmentation by introducing refraction, scattering, and point dropouts that compromise geometric integrity. While prior approaches ranging from weather simulation and mixing-based augmentation to domain randomization and regularization enhance robustness, they frequently overlook structural vulnerabilities inherent to object boundaries, corners, and highly sparse regions. To address this limitation, we propose a Light Geometry-Aware Adapter. This module aligns azimuths and applies horizontal circular padding to preserve neighbor continuity across the 0 deg-360 deg wrap-around boundary. Using a local-window K-Nearest Neighbors (KNN) search, it aggregates nearby points and computes lightweight local statistics, compressing them into compact geometry-aware cues. During training, these cues facilitate region-aware regularization, which effectively stabilizes predictions in structurally fragile areas. The proposed adapter is designed to be plug-and-play, complements existing augmentation techniques, and operates exclusively during training, incurring negligible inference overhead. Operating under a rigorous source-only cross-weather paradigm wherein models are trained on SemanticKITTI and evaluated on SemanticSTF without target-domain labels or fine-tuning, our adapter achieves a +3.4 mIoU improvement over strong data-centric augmentation baselines. Furthermore, it demonstrates performance comparable to advanced class-centric regularization methods. These findings highlight that geometry-driven regularization constitutes a critical pathway toward achieving highly robust, all-weather LiDAR segmentation.