arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3185
2604.09879 2026-04-14 cs.CV cs.CG

Topo-ADV: Generating Topology-Driven Imperceptible Adversarial Point Clouds

Gayathry Chandramana Krishnan Nampoothiry, Raghuram Venkatapuram, Anirban Ghosh, Ayan Dutta

Comments Under review

详情
英文摘要

Deep neural networks for 3D point cloud understanding have achieved remarkable success in object classification and recognition, yet recent work shows that these models remain highly vulnerable to adversarial perturbations. Existing 3D attacks predominantly manipulate geometric properties such as point locations, curvature, or surface structure, implicitly assuming that preserving global shape fidelity preserves semantic content. In this work, we challenge this assumption and introduce the first topology-driven adversarial attack for point cloud deep learning. Our key insight is that the homological structure of a 3D object constitutes a previously unexplored vulnerability surface. We propose Topo-ADV, an end-to-end differentiable framework that incorporates persistent homology as an explicit optimization objective, enabling gradient-based manipulation of topological features during adversarial example generation. By embedding persistence diagrams through differentiable topological representations, our method jointly optimizes (i) a topology divergence loss that alters persistence, (ii) a misclassification objective, and (iii) geometric imperceptibility constraints that preserve visual plausibility. Experiments demonstrate that subtle topology-driven perturbations consistently achieve up to 100% attack success rates on benchmark datasets such as ModelNet40, ShapeNet Part, and ScanObjectNN using PointNet and DGCNN classifiers, while remaining geometrically indistinguishable from the original point clouds, beating state-of-the-art methods on various perceptibility metrics.

2604.09876 2026-04-14 cs.LG cs.AI cs.CV cs.HC

Efficient Personalization of Generative User Interfaces

Yi-Hao Peng, Samarth Das, Jeffrey P. Bigham, Jason Wu

详情
英文摘要

Generative user interfaces (UIs) create new opportunities to adapt interfaces to individual users on demand, but personalization remains difficult because desirable UI properties are subjective, hard to articulate, and costly to infer from sparse feedback. We study this problem through a new dataset in which 20 trained designers each provide pairwise judgments over the same 600 generated UIs, enabling direct analysis of preference divergence. We find substantial disagreement across designers (average kappa = 0.25), and written rationales reveal that even when designers appeal to similar concepts such as hierarchy or cleanliness, designers differ in how they define, prioritize, and apply those concepts. Motivated by these findings, we develop a sample-efficient personalization method that represents a new user in terms of prior designers rather than a fixed rubric of design concepts. In a technical evaluation, our preference model outperforms both a pretrained UI evaluator and a larger multimodal model, and scales better with additional feedback. When used to personalize generation, it also produces interfaces preferred by 12 new designers over baseline approaches, including direct user prompting. Our findings suggest that lightweight preference elicitation can serve as a practical foundation for personalized generative UI systems.

2604.09874 2026-04-14 cs.CL cs.SI

Simulating Organized Group Behavior: New Framework, Benchmark, and Analysis

Xinkai Zou, Yiming Huang, Zhuohang Wu, Jian Sha, Nan Huang, Longfei Yun, Jingbo Shang, Letian Peng

详情
英文摘要

Simulating how organized groups (e.g., corporations) make decisions (e.g., responding to a competitor's move) is essential for understanding real-world dynamics and could benefit relevant applications (e.g., market prediction). In this paper, we formalize this problem as a concrete research platform for group behavior understanding, providing: (1) a task definition with benchmark and evaluation criteria, (2) a structured analytical framework with a corresponding algorithm, and (3) detailed temporal and cross-group analysis. Specifically, we propose Organized Group Behavior Simulation, a task that models organized groups as collective entities from a practical perspective: given a group facing a particular situation (e.g., AI Boom), predict the decision it would take. To support this task, we present GROVE (GRoup Organizational BehaVior Evaluation), a benchmark covering 44 entities with 8,052 real-world context-decision pairs collected from Wikipedia and TechCrunch across 9 domains, with an end-to-end evaluation protocol assessing consistency, initiative, scope, magnitude, and horizon. Beyond straightforward prompting pipelines, we propose a structured analytical framework that converts collective decision-making events into an interpretable, adaptive, and traceable behavioral model, achieving stronger performance than summarization- and retrieval-based baselines. It further introduces an adapter mechanism for time-aware evolution and group-aware transfer, and traceable evidence nodes grounding each decision rule in originating historical events. Our analysis reveals temporal behavioral drift within individual groups, which the time-aware adapter effectively captures for stronger prediction, and structured cross-group similarity that enables knowledge transfer for data-scarce organizations.

2604.09870 2026-04-14 cs.LG cs.AI

Relational Preference Encoding in Looped Transformer Internal States

Jan Kirin

详情
英文摘要

We investigate how looped transformers encode human preference in their internal iteration states. Using Ouro-2.6B-Thinking, a 2.6B-parameter looped transformer with iterative refinement, we extract hidden states from each loop iteration and train lightweight evaluator heads (~5M parameters) to predict human preference on the Anthropic HH-RLHF dataset. Our pairwise evaluator achieves 95.2% test accuracy on 8,552 unseen examples, surpassing a full-batch L-BFGS probe (84.5%) while the base model remains completely frozen. Our central finding is that loop states encode preference predominantly relationally: a linear probe on pairwise differences achieves 84.5%, the best nonlinear independent evaluator reaches only 65% test accuracy, and linear independent classification scores 21.75%, below chance and with inverted polarity. Interpreted precisely, the evaluator functions as a model-internal consistency probe, measuring how stably Ouro's own learned value system organizes its representations rather than how well it predicts noisy human annotations. We also document a systematic architecture search that established a genuine 70% ceiling for independent scoring, and show how the 50% argument-swap protocol required to prevent degenerate pairwise solutions deflated pairwise training metrics by about 31 points at peak, creating the false appearance that pairwise and pointwise evaluators shared the same ceiling. Finally, we show that a cosine learning-rate dead zone at epoch 2 accidentally acted as early stopping, preserving the generalization peak before overfitting degraded test accuracy from 95.2% to 62.4% by epoch 5. Cross-epoch flip-test analysis shows that antisymmetry correlation remains stable while strict sign-flip rate mainly tracks scorer bias. We propose the flip test as a mandatory diagnostic for pairwise preference evaluators.

2604.09863 2026-04-14 cs.CV cs.AI

PAS: Estimating the target accuracy before domain adaptation

Raphaella Diniz, Jackson de Faria, Martin Ester

Comments Published as a conference paper at ICLR 2026

详情
英文摘要

The goal of domain adaptation is to make predictions for unlabeled samples from a target domain with the help of labeled samples from a different but related source domain. The performance of domain adaptation methods is highly influenced by the choice of source domain and pre-trained feature extractor. However, the selection of source data and pre-trained model is not trivial due to the absence of a labeled validation set for the target domain and the large number of available pre-trained models. In this work, we propose PAS, a novel score designed to estimate the transferability of a source domain set and a pre-trained feature extractor to a target classification task before actually performing domain adaptation. PAS leverages the generalization power of pre-trained models and assesses source-target compatibility based on the pre-trained feature embeddings. We integrate PAS into a framework that indicates the most relevant pre-trained model and source domain among multiple candidates, thus improving target accuracy while reducing the computational overhead. Extensive experiments on image classification benchmarks demonstrate that PAS correlates strongly with actual target accuracy and consistently guides the selection of the best-performing pre-trained model and source domain for adaptation.

2604.09862 2026-04-14 cs.CV

FF3R: Feedforward Feature 3D Reconstruction from Unconstrained views

Chaoyi Zhou, Run Wang, Feng Luo, Mert D. Pesé, Zhiwen Fan, Yiqi Zhong, Siyu Huang

Comments CVPR 2026 Findings. Project Page: https://chaoyizh.github.io/ff3r_project/

详情
英文摘要

Recent advances in vision foundation models have revolutionized geometry reconstruction and semantic understanding. Yet, most of the existing approaches treat these capabilities in isolation, leading to redundant pipelines and compounded errors. This paper introduces FF3R, a fully annotation-free feed-forward framework that unifies geometric and semantic reasoning from unconstrained multi-view image sequences. Unlike previous methods, FF3R does not require camera poses, depth maps, or semantic labels, relying solely on rendering supervision for RGB and feature maps, establishing a scalable paradigm for unified 3D reasoning. In addition, we address two critical challenges in feedforward feature reconstruction pipelines, namely global semantic inconsistency and local structural inconsistency, through two key innovations: (i) a Token-wise Fusion Module that enriches geometry tokens with semantic context via cross-attention, and (ii) a Semantic-Geometry Mutual Boosting mechanism combining geometry-guided feature warping for global consistency with semantic-aware voxelization for local coherence. Extensive experiments on ScanNet and DL3DV-10K demonstrate FF3R's superior performance in novel-view synthesis, open-vocabulary semantic segmentation, and depth estimation, with strong generalization to in-the-wild scenarios, paving the way for embodied intelligence systems that demand both spatial and semantic understanding.

2604.09861 2026-04-14 cs.AI cs.NE

Evolutionary Token-Level Prompt Optimization for Diffusion Models

Domício Pereira Neto, João Correia, Penousal Machado

Comments 17 pages, 3 figures, 2 tables, 6 appendix figures

详情
英文摘要

Text-to-image diffusion models exhibit strong generative performance but remain highly sensitive to prompt formulation, often requiring extensive manual trial and error to obtain satisfactory results. This motivates the development of automated, model-agnostic prompt optimization methods that can systematically explore the conditioning space beyond conventional text rewriting. This work investigates the use of a Genetic Algorithm (GA) for prompt optimization by directly evolving the token vectors employed by CLIP-based diffusion models. The GA optimizes a fitness function that combines aesthetic quality, measured by the LAION Aesthetic Predictor V2, with prompt-image alignment, assessed via CLIPScore. Experiments on 36 prompts from the Parti Prompts (P2) dataset show that the proposed approach outperforms the baseline methods, including Promptist and random search, achieving up to a 23.93% improvement in fitness. Overall, the method is adaptable to image generation models with tokenized text encoders and provides a modular framework for future extensions, the limitations and prospects of which are discussed.

2604.09855 2026-04-14 cs.AI cs.CL cs.GT econ.GN q-fin.EC

Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable Rewards

Shuze Daniel Liu, Claire Chen, Jiabao Sean Xiao, Lei Lei, Yuheng Zhang, Yisong Yue, David Simchi-Levi

详情
英文摘要

The recent advancement of Large Language Models (LLMs) has established their potential as autonomous interactive agents. However, they often struggle in strategic games of incomplete information, such as bilateral price negotiation. In this paper, we investigate if Reinforcement Learning from Verifiable Rewards (RLVR) can effectively teach LLMs to negotiate. Specifically, we explore the strategic behaviors that emerge during the learning process. We introduce a framework that trains a mid-sized buyer agent against a regulated LLM seller across a wide distribution of real-world products. By grounding reward signals directly in the maximization of economic surplus and strict adherence to private budget constraints, we reveal a novel four-phase strategic evolution. The agent progresses from naive bargaining to using aggressive starting prices, moves through a phase of deadlock, and ultimately develops sophisticated persuasive skills. Our results demonstrate that this verifiable training allows a 30B agent to significantly outperform frontier models over ten times its size in extracting surplus. Furthermore, the trained agent generalizes robustly to stronger counterparties unseen during training and remains effective even when facing hostile, adversarial seller personas.

2604.09854 2026-04-14 cs.CL

Spoiler Alert: Narrative Forecasting as a Metric for Tension in LLM Storytelling

Peiqi Sui, Yutong Zhu, Tianyi Cheng, Peter West, Richard Jean So, Hoyt Long, Ari Holtzman

Comments 29 pages, 10 figures, 9 tables

详情
英文摘要

LLMs have so far failed both to generate consistently compelling stories and to recognize this failure--on the leading creative-writing benchmark (EQ-Bench), LLM judges rank zero-shot AI stories above New Yorker short stories, a gold standard for literary fiction. We argue that existing rubrics overlook a key dimension of compelling human stories: narrative tension. We introduce the 100-Endings metric, which walks through a story sentence by sentence: at each position, a model predicts how the story will end 100 times given only the text so far, and we measure tension as how often predictions fail to match the ground truth. Beyond the mismatch rate, the sentence-level curve yields complementary statistics, such as inflection rate, a geometric measure of how frequently the curve reverses direction, tracking twists and revelations. Unlike rubric-based judges, 100-Endings correctly ranks New Yorker stories far above LLM outputs. Grounded in narratological principles, we design a story-generation pipeline using structural constraints, including analysis of story templates, idea formulation, and narrative scaffolding. Our pipeline significantly increases narrative tension as measured by the 100-Endings metric, while maintaining performance on the EQ-Bench leaderboard.

2604.09852 2026-04-14 cs.AI cs.LG

MEMENTO: Teaching LLMs to Manage Their Own Context

Vasilis Kontonis, Yuchen Zeng, Shivam Garg, Lingjiao Chen, Hao Tang, Ziyan Wang, Ahmed Awadallah, Eric Horvitz, John Langford, Dimitris Papailiopoulos

详情
英文摘要

Reasoning models think in long, unstructured streams with no mechanism for compressing or organizing their own intermediate state. We introduce MEMENTO: a method that teaches models to segment reasoning into blocks, compress each block into a memento, i.e., a dense state summary, and reason forward by attending only to mementos, reducing context, KV cache, and compute. To train MEMENTO models, we release OpenMementos, a public dataset of 228K reasoning traces derived from OpenThoughts-v3, segmented and annotated with intermediate summaries. We show that a two-stage SFT recipe on OpenMementos is effective across different model families (Qwen3, Phi-4, Olmo 3) and scales (8B--32B parameters). Trained models maintain strong accuracy on math, science, and coding benchmarks while achieving ${\sim}2.5\times$ peak KV cache reduction. We extend vLLM to support our inference method, achieving ${\sim}1.75\times$ throughput improvement while also enabling us to perform RL and further improve accuracy. Finally, we identify a dual information stream: information from each reasoning block is carried both by the memento text and by the corresponding KV states, which retain implicit information from the original block. Removing this channel drops accuracy by 15\,pp on AIME24.

2604.09850 2026-04-14 cs.CV

Training-Free Object-Background Compositional T2I via Dynamic Spatial Guidance and Multi-Path Pruning

Yang Deng, David Mould, Paul L. Rosin, Yu-Kun Lai

详情
英文摘要

Existing text-to-image diffusion models, while excelling at subject synthesis, exhibit a persistent foreground bias that treats the background as a passive and under-optimized byproduct. This imbalance compromises global scene coherence and constrains compositional control. To address the limitation, we propose a training-free framework that restructures diffusion sampling to explicitly account for foreground-background interactions. Our approach consists of two key components. First, Dynamic Spatial Guidance introduces a soft, time step dependent gating mechanism that modulates foreground and background attention during the diffusion process, enabling spatially balanced generation. Second, Multi-Path Pruning performs multi-path latent exploration and dynamically filters candidate trajectories using both internal attention statistics and external semantic alignment signals, retaining trajectories that better satisfy object-background constraints. We further develop a benchmark specifically designed to evaluate object-background compositionality. Extensive evaluations across multiple diffusion backbones demonstrate consistent improvements in background coherence and object-background compositional alignment.

2604.09841 2026-04-14 cs.CV cs.AI

Is There Knowledge Left to Extract? Evidence of Fragility in Medically Fine-Tuned Vision-Language Models

Oliver McLaughlin, Daniel Shubin, Carsten Eickhoff, Ritambhara Singh, William Rudman, Michal Golovanevsky

详情
英文摘要

Vision-language models (VLMs) are increasingly adapted through domain-specific fine-tuning, yet it remains unclear whether this improves reasoning beyond superficial visual cues, particularly in high-stakes domains like medicine. We evaluate four paired open-source VLMs (LLaVA vs. LLaVA-Med; Gemma vs. MedGemma) across four medical imaging tasks of increasing difficulty: brain tumor, pneumonia, skin cancer, and histopathology classification. We find that performance degrades toward near-random levels as task difficulty increases, indicating limited clinical reasoning. Medical fine-tuning provides no consistent advantage, and models are highly sensitive to prompt formulation, with minor changes causing large swings in accuracy and refusal rates. To test whether closed-form VQA suppresses latent knowledge, we introduce a description-based pipeline where models generate image descriptions that a text-only model (GPT-5.1) uses for diagnosis. This recovers a limited additional signal but remains bounded by task difficulty. Analysis of vision encoder embeddings further shows that failures stem from both weak visual representations and downstream reasoning. Overall, medical VLM performance is fragile, prompt-dependent, and not reliably improved by domain-specific fine-tuning.

2604.09838 2026-04-14 cs.CV

Vector Field Synthesis with Sparse Streamlines Using Diffusion Model

Nguyen K. Phan, Ricardo Morales, Sebastian D. Espriella, Guoning Chen

Comments 5 pages, 4 figures; published at IEEE VIS 2025

Journal ref 2025 IEEE Visualization and Visual Analytics (VIS), pp. 296-300

详情
英文摘要

We present a novel diffusion-based framework for synthesizing 2D vector fields from sparse, coherent inputs (i.e., streamlines) while maintaining physical plausibility. Our method employs a conditional denoising diffusion probabilistic model with classifier-free guidance, enabling progressive reconstruction that preserves both geometric and physical constraints. Experimental results demonstrate our method's ability to synthesize plausible vector fields that adhere to physical laws while maintaining fidelity to sparse input observations, outperforming traditional optimization-based approaches in terms of flexibility and physical consistency.

2604.09835 2026-04-14 cs.CV cs.AI

F3G-Avatar : Face Focused Full-body Gaussian Avatar

Willem Menu, Erkut Akdag, Pedro Quesado, Yasaman Kashefbahrami, Egor Bondarev

Comments CVPRW 3DMV, 10 pages

详情
英文摘要

Existing full-body Gaussian avatar methods primarily optimize global reconstruction quality and often fail to preserve fine-grained facial geometry and expression details. This challenge arises from limited facial representational capacity that causes difficulties in modeling high-frequency pose-dependent deformations. To address this, we propose F3G-Avatar, a full-body, face-aware avatar synthesis method that reconstructs animatable human representations from multi-view RGB video and regressed pose/shape parameters. Starting from a clothed Momentum Human Rig (MHR) template, front/back positional maps are rendered and decoded into 3D Gaussians through a two-branch architecture: a body branch that captures pose-dependent non-rigid deformations and a face-focused deformation branch that refines head geometry and appearance. The predicted Gaussians are fused, posed with linear blend skinning (LBS), and rendered with differentiable Gaussian splatting. Training combines reconstruction and perceptual objectives with a face-specific adversarial loss to enhance realism in close-up views. Experiments demonstrate strong rendering quality, with face-view performance reaching PSNR/SSIM/LPIPS of 26.243/0.964/0.084 on the AvatarReX dataset. Ablations further highlight contributions of the MHR template and the face-focused deformation. F3G-Avatar provides a practical, high-quality pipeline for realistic, animatable full-body avatar synthesis.

2604.09829 2026-04-14 cs.RO

Perception Is All You Need: A Neuroscience Framework for Low Cost Sensorless Gaze in HRI

Mason Kadem

详情
英文摘要

Gaze-following in child-robot interaction improves attention, recall, and learning, but requires expensive platforms (\$30,000+), sensors, algorithms, and raises privacy concerns. We propose a framework that avoids sensors and computation entirely, instead relying on the human visual system's assumption of convexity to produce perceptual gaze-following between a robot and its viewer. Specifically, we motivate sub-dollar cardboard robot design that directly implements the brain's own gaze computation pipeline in reverse, making the viewer's perceptual system the robot's "actuator", with no sensors, no power, and no privacy concerns. We ground this framework in three converging lines of theoretical and empirical neuroscience evidence. Namely, the distributed face processing network that computes gaze direction via the superior temporal sulcus, the high-precision convexity prior that causes the brain to perceive concave faces as convex, and the predictive processing hierarchy in which top-down face knowledge overrides bottom-up depth signals. These mechanisms explain why a concave eye socket with a painted pupil produces the perception of mutual gaze from any viewing angle. We derive design constraints from perceptual science, present a sub-dollar open-template robot with parameterized interchangeable eye inserts, and identify boundary conditions (developmental, clinical, and geometric) that predict where the framework will succeed and where it will fail. If leveraged, two decades of HRI gaze findings become deliverable at population scale.

2604.09824 2026-04-14 cs.RO cs.CL cs.CV

ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models

Nastaran Darabi, Amit Ranjan Trivedi

详情
英文摘要

Vision language action (VLA) models enable generalist robotic agents but often exhibit language ignorance, relying on visual shortcuts and remaining insensitive to instruction changes. We present Prospective Grounding and Alignment VLA (ProGAL-VLA), which constructs a 3D entity-centric graph (GSM), uses a slow planner to produce symbolic sub-goals, and aligns them with grounded entities via a Grounding Alignment Contrastive (GAC) loss. All actions are conditioned on a verified goal embedding $g_t$, whose attention entropy provides an intrinsic ambiguity signal. On LIBERO-Plus, ProGAL-VLA increases robustness under robot perturbations from 30.3 to 71.5 percent, reduces language ignorance by 3x-4x, and improves entity retrieval from 0.41 to 0.71 Recall@1. On the Custom Ambiguity Benchmark, it reaches AUROC 0.81 (vs. 0.52), AUPR 0.79, and raises clarification on ambiguous inputs from 0.09 to 0.81 without harming unambiguous success. The verification bottleneck increases mutual information of language-actions, the GAC loss imposes an entity-level InfoNCE bound, and attention entropy yields calibrated selective prediction, indicating that explicit verified grounding is an effective path toward instruction-sensitive, ambiguity-aware agents.

2604.09819 2026-04-14 cs.CV cs.AI

ACCIDENT: A Benchmark Dataset for Vehicle Accident Detection from Traffic Surveillance Videos

Lukas Picek, Michal Čermák, Marek Hanzl, Vojtěch Čermák

详情
英文摘要

We introduce ACCIDENT, a benchmark dataset for traffic accident detection in CCTV footage, designed to evaluate models in supervised (IID and OOD) and zero-shot settings, reflecting both data-rich and data-scarce scenarios. The benchmark consists of a curated set of 2,027 real and 2,211 synthetic clips annotated with the accident time, spatial location, and high-level collision type. We define three core tasks: (i) temporal localization of the accident, (ii) its spatial localization, and (iii) collision type classification. Each task is evaluated using custom metrics that account for the uncertainty and ambiguity inherent in CCTV footage. In addition to the benchmark, we provide a diverse set of baselines, including heuristic, motion-aware, and vision-language approaches, and show that ACCIDENT is challenging. You can access the ACCIDENT at: https://accidentbench.github.io

2604.09818 2026-04-14 cs.LG cs.CE

Below-ground Fungal Biodiversity Can be Monitored Using Self-Supervised Learning Satellite Features

Robin Young, Michael E. Van Nuland, E. Toby Kiers, Tomáš Větrovský, Petr Kohout, Petr Baldrian, Srinivasan Keshav

详情
英文摘要

Mycorrhizal fungi are vital to terrestrial ecosystem functioning. Yet monitoring their biodiversity at landscape scales is often unfeasible due to time and cost constraints. Current predictions suggest that 90\% of mycorrhizal diversity hotspots remain unprotected, opening questions of how to broadly and effectively map underground fungal communities. Here, we show that self-supervised learning (SSL) applied to satellite imagery can predict below-ground ectomycorrhizal fungal richness across diverse environments. Our models explain over half the variance in species richness across ~12,000 field samples spanning Europe and Asia. SSL-derived features prove to be the single most informative predictor, subsuming the majority of information contained in climate, soil, and land cover datasets. Using this approach, we achieve a 10,000-fold increase in spatial resolution over existing techniques, moving from 1km landscape averages to 10m habitat-scale observations with nearly no systematic bias. As satellite observations are dynamic rather than static, this enables temporal monitoring of below-ground biodiversity at landscape scales for the first time. We analyze multi-year trends in predicted fungal richness across UK National Park woodlands, finding that ancient forests may be losing ectomycorrhizal diversity at disproportionate rates. These results establish SSL satellite features as a scalable tool for extending sparse field observations to continuous, high-resolution biodiversity maps for monitoring the invisible half of terrestrial ecosystems.

2604.09817 2026-04-14 cs.LG

NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity

Weijian Mai, Mu Nan, Yu Zhu, Jiahang Cao, Rui Zhang, Yuqin Dai, Chunfeng Song, Andrew F. Luo, Jiamin Wu

Comments Accepted to CVPR 2026. Project page: https://michaelmaiii.github.io/NeuroFlow-S

详情
英文摘要

Visual encoding and decoding models act as gateways to understanding the neural mechanisms underlying human visual perception. Typically, visual encoding models that predict brain activity from stimuli and decoding models that reproduce stimuli from brain activity are treated as distinct tasks, requiring separate models and training procedures. This separation is inefficient and fails to model the consistency between encoding and decoding processes. To address this limitation, we propose NeuroFlow, the first unified framework that jointly models visual encoding and decoding from neural activity within a single flow model. NeuroFlow introduces two key components: (1) NeuroVAE is designed as a variational backbone to model neural variability and establish a compact, semantically structured latent space for bidirectional modeling across visual and neural modalities. (2) Cross-modal Flow Matching (XFM) bypasses the typical paradigm of noise-to-data diffusion guided by a specific modality condition, instead learning a reversibly consistent flow model between visual and neural latent distributions. For the first time, visual encoding and decoding are reformulated as a time-dependent, reversible process within a shared latent space for unified modeling. Empirical results demonstrate that NeuroFlow achieves superior overall performance in visual encoding and decoding tasks with higher computational efficiency compared to any isolated methods. We further analyze principal factors that steer the model toward encoding-decoding consistency and, through brain functional analyses, demonstrate that NeuroFlow captures consistent activation patterns underlying neural variability. NeuroFlow marks a major step toward unified visual encoding and decoding from neural activity, providing mechanistic insights that inform future bidirectional visual brain-computer interfaces.

2604.09815 2026-04-14 cs.AI

EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning

Tiantian He, Yihang Chen, Keyue Jiang, Ka Yiu Lee, Kaiwen Zhou, Kun Shao, Shuai Wang

详情
英文摘要

Computer-use agents that combine GUI interaction with structured API calls via the Model Context Protocol (MCP) show promise for automating software tasks. However, existing approaches lack a principled understanding of how agents should balance these two modalities and how to enable iterative self-improvement across diverse applications. We formulate MCP-GUI interplay as a unified hybrid policy learning problem where the agent learns when each modality provides complementary advantages, and show that distillation and experience augmentation target fundamentally different failure modes - requiring application-aware mechanism selection. Built on this formulation, we propose a self-evolving framework with a fully automatic pipeline that orchestrates automatic environment generation and validation, trajectory collection, gap-driven task synthesis, and quality-filtered training - all without manual intervention. A key innovation is our experience bank, which accumulates LLM-learned rules from trajectory comparison, enabling inference-time improvement without fine-tuning. Systematic \textbf{cross-application analysis} across three desktop applications reveals that the optimal strategy depends on MCP-GUI composition: distillation achieves 77.8\% pass rate on MCP-dominant tasks (+17.8pp), while the experience bank excels on GUI-intensive tasks (+10.0pp).

2604.09814 2026-04-14 cs.CV

RobustMedSAM: Degradation-Resilient Medical Image Segmentation via Robust Foundation Model Adaptation

Jieru Li, Matthew Chen, Micky C. Nnamdi, J. Ben Tamo, Benoit L. Marteau, May D. Wang

Comments 14 pages, 9 figures

详情
英文摘要

Medical image segmentation models built on Segment Anything Model (SAM) achieve strong performance on clean benchmarks, yet their reliability often degrades under realistic image corruptions such as noise, blur, motion artifacts, and modality-specific distortions. Existing approaches address either medical-domain adaptation or corruption robustness, but not both jointly. In SAM, we find that these capabilities are concentrated in complementary modules: the image encoder preserves medical priors, while the mask decoder governs corruption robustness. Motivated by this observation, we propose RobustMedSAM, which adopts module-wise checkpoint fusion by initializing the image encoder from MedSAM and the mask decoder from RobustSAM under a shared ViT-B architecture. We then fine-tune only the mask decoder on 35 medical datasets from MedSegBench, spanning six imaging modalities and 12 corruption types, while freezing the remaining components to preserve pretrained medical representations. We additionally investigate an SVD-based parameter-efficient variant for limited encoder adaptation. Experiments on both in-distribution and out-of-distribution benchmarks show that RobustMedSAM improves degraded-image Dice from 0.613 to 0.719 (+0.106) over SAM, demonstrating that structured fusion of complementary pretrained models is an effective and practical approach for robust medical image segmentation.

2604.09813 2026-04-14 cs.AI

Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning

Siyuan Xu, Shiyang Li, Xin Liu, Tianyi Liu, Yixiao Li, Zhan Shi, Zixuan Zhang, Zilong Wang, Qingyu Yin, Jianshu Chen, Tuo Zhao, Bing Yin

详情
英文摘要

Existing synthetic tool-use corpora are primarily designed for offline supervised fine-tuning, yet reinforcement learning (RL) requires executable environments that support reward-checkable online rollouts. We propose COVERT, a two-stage pipeline that first generates reliable base tool-use trajectories through self-evolving synthesis with multi-level validation, and then applies oracle-preserving augmentations that systematically increase environmental complexity. These augmentations introduce distractor tools, indirect or ambiguous user queries, and noisy, multi-format, or erroneous tool outputs, while strictly preserving oracle tool calls and final answers as ground truth. This design enables automatic reward computation via reference matching for standard cases and lightweight judge-assisted verification for special behaviors such as error detection, supporting RL optimization of tool-calling policies. On Qwen2.5-Instruct-14B, COVERT-RL improves overall accuracy on BFCL v3 from 56.5 to 59.9 and on ACEBench from 53.0 to 59.3, with minimal regressions on general-ability benchmarks; when stacked on SFT, it further reaches 62.1 and 61.8, confirming additive gains. These results suggest that oracle-preserving synthetic environments offer a practical RL refinement stage, complementary to SFT, for improving tool-use robustness under ambiguity and unreliable tool feedback.

2604.09803 2026-04-14 cs.SD

MAGE: Modality-Agnostic Music Generation and Editing

Muhammad Usama Saleem, Tejasvi Ravi, Tianyu Xu, Rajeev Nongpiur, Ishan Chatterjee, Mayur Jagdishbhai Patel, Pu Wang

详情
英文摘要

Multimodal music creation requires models that can both generate audio from high-level cues and edit existing mixtures in a targeted manner. Yet most multimodal music systems are built for a single task and a fixed prompting interface, making their conditioning brittle when guidance is ambiguous, temporally misaligned, or partially missing. Common additive fusion or feature concatenation further weakens cross-modal grounding, often causing prompt drift and spurious musical content during generation and editing. We propose MAGE, a modality-agnostic framework that unifies multimodal music generation and mixture-grounded editing within a single continuous latent formulation. At its core, MAGE uses a Controlled Multimodal FluxFormer, a flow-based Transformer that learns controllable latent trajectories for synthesis and editing under any available subset of conditions. To improve grounding, we introduce Audio-Visual Nexus Alignment to select temporally consistent visual evidence for the audio timeline, and a cross-gated modulation mechanism that applies multiplicative control from aligned visual and textual cues to the audio latents, suppressing unsupported components rather than injecting them. Finally, we train with a dynamic modality-masking curriculum that exposes the model to text-only, visual-only, joint multimodal, and mixture-guided settings, enabling robust inference under missing modalities without training separate models. Experiments on the MUSIC benchmark show that MAGE supports effective multimodal-guided music generation and targeted editing, achieving competitive quality while offering a lightweight and flexible interface tailored to practical music workflows.

2604.09800 2026-04-14 cs.RO cs.SY eess.SY

Kinematics of continuum planar grasping

Udit Halder, Nicolas Echeverria Zambrano, Xincheng Li

详情
英文摘要

This paper presents an analytical framework to study the geometry arising when a soft continuum arm grasps a planar object. Both the arm centerline and the object boundary are modeled as smooth curves. The grasping problem is formulated as a kinematic boundary following problem, in which the object boundary acts as the arm's 'shadow curve'. This formulation leads to a set of reduced kinematic equations expressed in terms of relative geometric shape variables, with the arm curvature serving as the control input. An optimal control problem is formulated to determine feasible arm shapes that achieve optimal grasping configurations, and its solution is obtained using Pontryagin's Maximum Principle. Based on the resulting optimal grasp kinematics, a class of continuum grasp quality metrics is proposed using the algebraic properties of the associated continuum grasp map. Feedback control aspects in the dynamic setting are also discussed. The proposed methodology is illustrated through systematic numerical simulations.

2604.09799 2026-04-14 cs.LG cs.AI

Explainable Human Activity Recognition: A Unified Review of Concepts and Mechanisms

Mainak Kundu, Catherine Chen, Rifatul Islam, Ismail Uysal, Ria Kanjilal

详情
英文摘要

Human activity recognition (HAR) has become a key component of intelligent systems for healthcare monitoring, assistive living, smart environments, and human-computer interaction. Although deep learning has substantially improved HAR performance on multivariate sensor data, the resulting models often remain opaque, limiting trust, reliability, and real-world deployment. Explainable artificial intelligence (XAI) has therefore emerged as a critical direction for making HAR systems more transparent and human-centered. This paper presents a comprehensive review of explainable HAR methods across wearable, ambient, physiological, and multimodal sensing settings. We introduce a unified perspective that separates conceptual dimensions of explainability from algorithmic explanation mechanisms, reducing ambiguities in prior surveys. Building on this distinction, we present a mechanism-centric taxonomy of XAI-HAR methods covering major explanation paradigms. The review examines how these methods address the temporal, multimodal, and semantic complexities of HAR, and summarize their interpretability objectives, explanation targets, and limitations. In addition, we discuss current evaluation practices, highlight key challenges in achieving reliable and deployable XAI-HAR, and outline directions toward trustworthy activity recognition systems that better support human understanding and decision-making.

2604.09793 2026-04-14 cs.CL cs.AI

GIANTS: Generative Insight Anticipation from Scientific Literature

Joy He-Yueya, Anikait Singh, Ge Gao, Michael Y. Li, Sherry Yang, Chelsea Finn, Emma Brunskill, Noah D. Goodman

详情
英文摘要

Scientific breakthroughs often emerge from synthesizing prior ideas into novel contributions. While language models (LMs) show promise in scientific discovery, their ability to perform this targeted, literature-grounded synthesis remains underexplored. We introduce insight anticipation, a generation task in which a model predicts a downstream paper's core insight from its foundational parent papers. To evaluate this capability, we develop GiantsBench, a benchmark of 17k examples across eight scientific domains, where each example consists of a set of parent papers paired with the core insight of a downstream paper. We evaluate models using an LM judge that scores similarity between generated and ground-truth insights, and show that these similarity scores correlate with expert human ratings. Finally, we present GIANTS-4B, an LM trained via reinforcement learning (RL) to optimize insight anticipation using these similarity scores as a proxy reward. Despite its smaller open-source architecture, GIANTS-4B outperforms proprietary baselines and generalizes to unseen domains, achieving a 34% relative improvement in similarity score over gemini-3-pro. Human evaluations further show that GIANTS-4B produces insights that are more conceptually clear than those of the base model. In addition, SciJudge-30B, a third-party model trained to compare research abstracts by likely citation impact, predicts that insights generated by GIANTS-4B are more likely to lead to higher citations, preferring them over the base model in 68% of pairwise comparisons. We release our code, benchmark, and model to support future research in automated scientific discovery.

2604.09791 2026-04-14 cs.AI cs.CL cs.LG cs.MA

Pioneer Agent: Continual Improvement of Small Language Models in Production

Dhruv Atreja, Julia White, Nikhil Nayak, Kelton Zhang, Henrijs Princis, George Hurn-Maloney, Ash Lewis, Urchade Zaratiana

Comments 43 pages, 10 figures, 14 tables

详情
英文摘要

Small language models are attractive for production deployment due to their low cost, fast inference, and ease of specialization. However, adapting them to a specific task remains a challenging engineering loop, driven not by training itself but by surrounding decisions: data curation, failure diagnosis, regression avoidance, and iteration control. We present Pioneer Agent, a closed-loop system that automates this lifecycle. In cold-start mode, given only a natural-language task description, the agent acquires data, constructs evaluation sets, and iteratively trains models by jointly optimizing data, hyperparameters, and learning strategy. In production mode, given a deployed model with labeled failures, it diagnoses error patterns, constructs targeted training data, and retrains under explicit regression constraints. To evaluate this setting, we introduce AdaptFT-Bench, a benchmark of synthetic inference logs with progressively increasing noise, designed to test the full adaptation loop: diagnosis, curriculum synthesis, retraining, and verification. Across eight cold-start benchmarks spanning reasoning, math, code generation, summarization, and classification, Pioneer Agent improves over base models by 1.6-83.8 points. On AdaptFT-Bench, it improves or preserves performance in all seven scenarios, while naive retraining degrades by up to 43 points. On two production-style deployments built from public benchmark tasks, it raises intent classification from 84.9% to 99.3% and Entity F1 from 0.345 to 0.810. Beyond performance gains, the agent often discovers effective training strategies, including chain-of-thought supervision, task-specific optimization, and quality-focused data curation, purely from downstream feedback.

2604.09782 2026-04-14 cs.CV

Biomarker-Based Pretraining for Chagas Disease Screening in Electrocardiograms

Elias Stenhede, Arian Ranjbar

Journal ref Computing in Cardiology 2025; Vol 52

详情
英文摘要

Chagas disease screening via ECGs is limited by scarce and noisy labels in existing datasets. We propose a biomarker-based pretraining approach, where an ECG feature extractor is first trained to predict percentile-binned blood biomarkers from the MIMIC-IV-ECG dataset. The pretrained model is then fine-tuned on Brazilian datasets for Chagas detection. Our 5-model ensemble, developed by the Ahus AIM team, achieved a challenge score of 0.269 on the hidden test set, ranking 5th in Detection of Chagas Disease from the ECG: The George B. Moody PhysioNet Challenge 2025. Source code and the model are shared on GitHub: github.com/Ahus-AIM/physionet-challenge-2025

2604.09781 2026-04-14 cs.CV

Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents

Sangwon Baik, Gunhee Kim, Mingi Choi, Hanbyul Joo

详情
英文摘要

Vision-Language Models (VLMs) exhibit strong visual reasoning capabilities, yet they still struggle with 3D understanding. In particular, VLMs often fail to infer a text-consistent goal 6D pose of a target object in a 3D scene. However, we find that with some inference-time techniques and iterative reasoning, VLMs can achieve dramatic performance gains. Concretely, given a 3D scene represented by an RGB-D image (or a compositional scene of 3D meshes) and a text instruction specifying a desired state change, we repeat the following loop: observe the current scene; evaluate whether it is faithful to the instruction; propose a pose update for the target object; apply the update; and render the updated scene. Through this closed-loop interaction, the VLM effectively acts as an agent. We further introduce three inference-time techniques that are essential to this closed-loop process: (i) multi-view reasoning with supporting view selection, (ii) object-centered coordinate system visualization, and (iii) single-axis rotation prediction. Without any additional fine-tuning or new modules, our approach surpasses prior methods at predicting the text-guided goal 6D pose of the target object. It works consistently across both closed-source and open-source VLMs. Moreover, when combining our 6D pose prediction with simple robot motion planning, it enables more successful robot manipulation than existing methods. Finally, we conduct an ablation study to demonstrate the necessity of each proposed technique.

2604.09780 2026-04-14 cs.AI

The Myth of Expert Specialization in MoEs: Why Routing Reflects Geometry, Not Necessarily Domain Expertise

Xi Wang, Soufiane Hayou, Eric Nalisnick

详情
英文摘要

Mixture of Experts (MoEs) are now ubiquitous in large language models, yet the mechanisms behind their "expert specialization" remain poorly understood. We show that, since MoE routers are linear maps, hidden state similarity is both necessary and sufficient to explain expert usage similarity, and specialization is therefore an emergent property of the representation space, not of the routing architecture itself. We confirm this at both token and sequence level across five pre-trained models. We additionally prove that load-balancing loss suppresses shared hidden state directions to maintain routing diversity, which might provide a theoretical explanation for specialization collapse under less diverse data, e.g. small batch. Despite this clean mechanistic account, we find that specialization patterns in pre-trained MoEs resist human interpretation: expert overlap between different models answering the same question is no higher than between entirely different questions ($\sim$60\%); prompt-level routing does not predict rollout-level routing; and deeper layers exhibit near-identical expert activation across semantically unrelated inputs, especially in reasoning models. We conclude that, while the efficiency perspective of MoEs is well understood, understanding expert specialization is at least as hard as understanding LLM hidden state geometry, a long-standing open problem in the literature.