arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3191
2509.25667 2026-04-14 cs.LG cs.AI cs.HC

EEG-based AI-BCI Wheelchair Advancement: Hybrid Deep Learning with Motor Imagery for Brain Computer Interface

Bipul Thapa, Biplov Paneru, Bishwash Paneru, Khem Narayan Poudyal

详情
英文摘要

This paper presents an Artificial Intelligence (AI) integrated approach to Brain-Computer Interface (BCI)-based wheelchair development, utilizing a motor imagery right-left-hand movement mechanism for control. The system is designed to simulate wheelchair navigation based on motor imagery right and left-hand movements using electroencephalogram (EEG) data. A pre-filtered dataset, obtained from an open-source EEG repository, was segmented into arrays of 19x200 to capture the onset of hand movements. The data was acquired at a sampling frequency of 200Hz. The system integrates a Tkinter-based interface for simulating wheelchair movements, offering users a functional and intuitive control system. We propose a framework that uses Convolutional Neural Network-Transformer Hybrid Model, named CTHM, for motor imagery EEG classification. The model achieves a test accuracy of 91.73% compared with various machine learning baseline models, including XGBoost, EEGNet, and a transformer-based model. The CTHM achieved a mean accuracy of 90% through stratified cross-validation, showcasing the effectiveness of the CNN-Transformer hybrid architecture in BCI applications.

2509.22830 2026-04-14 cs.CL

ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents

Hwan Chang, Yonghyun Jun, Hwanhee Lee

Comments ICLR 2026

详情
英文摘要

The growing deployment of large language model (LLM) based agents that interact with external environments has created new attack surfaces for adversarial manipulation. One major threat is indirect prompt injection, where attackers embed malicious instructions in external environment output, causing agents to interpret and execute them as if they were legitimate prompts. While previous research has focused primarily on plain-text injection attacks, we find a significant yet underexplored vulnerability: LLMs' dependence on structured chat templates and their susceptibility to contextual manipulation through persuasive multi-turn dialogues. To this end, we introduce ChatInject, an attack that formats malicious payloads to mimic native chat templates, thereby exploiting the model's inherent instruction-following tendencies. Building on this foundation, we develop a persuasion-driven Multi-turn variant that primes the agent across conversational turns to accept and execute otherwise suspicious actions. Through comprehensive experiments across frontier LLMs, we demonstrate three critical findings: (1) ChatInject achieves significantly higher average attack success rates than traditional prompt injection methods, improving from 5.18% to 32.05% on AgentDojo and from 15.13% to 45.90% on InjecAgent, with multi-turn dialogues showing particularly strong performance at average 52.33% success rate on InjecAgent, (2) chat-template-based payloads demonstrate strong transferability across models and remain effective even against closed-source LLMs, despite their unknown template structures, and (3) existing prompt-based defenses are largely ineffective against this attack approach, especially against Multi-turn variants. These findings highlight vulnerabilities in current agent systems.

2509.21879 2026-04-14 cs.LG math.OC

Learning Aligned Stability in Neural ODEs Reconciling Accuracy with Robustness

Chaoyang Luo, Yan Zou, Nanjing Huang

详情
英文摘要

Despite Neural Ordinary Differential Equations (Neural ODEs) exhibiting intrinsic robustness, existing methods often impose Lyapunov stability for formal guarantees. However, these methods still face a fundamental accuracy-robustness trade-off, which stems from a core limitation: their applied stability conditions are rigid and inappropriate, creating a mismatch between the model's regions of attraction (RoAs) and its decision boundaries. To resolve this, we propose Zubov-Net, a novel framework that unifies dynamics and decision-making. We first employ learnable Lyapunov functions directly as the multi-class classifier, ensuring the prescribed RoAs (PRoAs, defined by the Lyapunov functions) inherently align with a classification objective. Then, for aligning prescribed and true regions of attraction (PRoAs-RoAs), we establish a Zubov-driven stability region matching mechanism by reformulating Zubov's equation into a differentiable consistency loss. Building on this alignment, we introduce a new paradigm for actively controlling the geometry of RoAs by directly optimizing PRoAs to reconcile accuracy and robustness. Theoretically, we prove that minimizing the tripartite loss guarantees consistency alignment of PRoAs-RoAs, non-overlapping PRoAs, trajectory stability, and a certified robustness margin. Moreover, we establish stochastic convex separability with tighter probability bounds and lower dimensionality requirements to justify the convex design in Lyapunov functions.

2509.12694 2026-04-14 cs.LG cs.IT eess.SP math.IT

Soft Graph Transformer for MIMO Detection

Jiadong Hong, Lei Liu, Xinyu Bian, Wenjie Wang, Zhaoyang Zhang

Comments 5 pages with 3 figures and 2 tables, Accepted by IEEE ICASSP 2026

详情
英文摘要

We propose the Soft Graph Transformer (SGT), a soft-input-soft-output neural architecture designed for MIMO detection. While Maximum Likelihood (ML) detection achieves optimal accuracy, its exponential complexity makes it infeasible in large systems, and conventional message-passing algorithms rely on asymptotic assumptions that often fail in finite dimensions. Recent Transformer-based detectors show strong performance but typically overlook the MIMO factor graph structure and cannot exploit prior soft information. SGT addresses these limitations by combining self-attention, which encodes contextual dependencies within symbol and constraint subgraphs, with graph-aware cross-attention, which performs structured message passing across subgraphs. Its soft-input interface allows the integration of auxiliary priors, producing effective soft outputs while maintaining computational efficiency. Experiments demonstrate that SGT achieves near-ML performance and offers a flexible and interpretable framework for receiver systems that leverage soft priors.

2508.09533 2026-04-14 cs.CV cs.AI

COXNet: Cross-Layer Fusion with Adaptive Alignment and Scale Integration for RGBT Tiny Object Detection

Peiran Peng, Tingfa Xu, Liqiang Song, Mengqi Zhu, Yuqiang Fang, Jianan Li

详情
英文摘要

Detecting tiny objects in multimodal Red-Green-Blue-Thermal (RGBT) imagery is a critical challenge in computer vision, particularly in surveillance, search and rescue, and autonomous navigation. Drone-based scenarios exacerbate these challenges due to spatial misalignment, low-light conditions, occlusion, and cluttered backgrounds. Current methods struggle to leverage the complementary information between visible and thermal modalities effectively. We propose COXNet, a novel framework for RGBT tiny object detection, addressing these issues through three core innovations: i) the Cross-Layer Fusion Module, fusing high-level visible and low-level thermal features for enhanced semantic and spatial accuracy; ii) the Dynamic Alignment and Scale Refinement module, correcting cross-modal spatial misalignments and preserving multi-scale features; and iii) an optimized label assignment strategy using the GeoShape Similarity Measure for better localization. COXNet achieves a 3.32\% mAP$_{50}$ improvement on the RGBTDronePerson dataset over state-of-the-art methods, demonstrating its effectiveness for robust detection in complex environments.

2508.08574 2026-04-14 cs.RO cs.MA

DeepFleet: Multi-Agent Foundation Models for Mobile Robots

Ameya Agaskar, Sriram Siva, William Pickering, Kyle O'Brien, Charles Kekeh, Alexandre Ormiga Galvao Barbosa, Ang Li, Brianna Gallo Sarker, Alicia Chua, Mayur Nemade, Charun Thattai, Jiaming Di, Isaac Iyengar, Ramya Dharoor, Dino Kirouani, Jimmy Erskine, Tamir Hegazy, Scott Niekum, Usman A. Khan, Federico Pecora, Joseph W. Durham

Comments 27 pages, 10 figures, 2 tables

详情
英文摘要

We introduce DeepFleet, a suite of foundation models designed to support coordination and planning for large-scale mobile robot fleets. These models are trained on fleet movement data, including robot positions, goals, and interactions, from hundreds of thousands of robots in Amazon warehouses worldwide. DeepFleet consists of four architectures that each embody a distinct inductive bias and collectively explore key points in the design space for multi-agent foundation models: the robot-centric (RC) model is an autoregressive decision transformer operating on neighborhoods of individual robots; the robot-floor (RF) model uses a transformer with cross-attention between robots and the warehouse floor; the image-floor (IF) model applies convolutional encoding to a multi-channel image representation of the full fleet; and the graph-floor (GF) model combines temporal attention with graph neural networks for spatial relationships. In this paper, we describe these models and present our evaluation of the impact of these design choices on prediction task performance. We find that the robot-centric and graph-floor models, which both use asynchronous robot state updates and incorporate the localized structure of robot interactions, show the most promise. We also present experiments that show that these two models can make effective use of larger warehouses operation datasets as the models are scaled up.

2508.04436 2026-04-14 cs.RO cs.SY eess.SY

Reliable and Real-Time Highway Trajectory Planning via Hybrid Learning-Optimization Frameworks

Yujia Lu, Chong Wei, Lu Ma, Lounis Adouane

详情
英文摘要

Autonomous highway driving involves high-speed safety risks due to limited reaction time, where rare but dangerous events may lead to severe consequences. This places stringent requirements on trajectory planning in terms of both reliability and computational efficiency. This paper proposes a hybrid highway trajectory planning (H-HTP) framework that integrates learning-based adaptability with optimization-based formal safety guarantees. The key design principle is a deliberate division of labor: a learning module generates a traffic-adaptive velocity profile, while all safety-critical decisions including collision avoidance and kinematic feasibility are delegated to a Mixed-Integer Quadratic Program (MIQP). This design ensures that formal safety constraints are always enforced, regardless of the complexity of multi-vehicle interactions. A linearization strategy for the vehicle geometry substantially reduces the number of integer variables, enabling real-time optimization without sacrificing formal safety guarantees. Experiments on the HighD dataset demonstrate that H-HTP achieves a scenario success rate above 97% with an average planning-cycle time of approximately 54 ms, reliably producing smooth, kinematically feasible, and collision-free trajectories in safety-critical highway scenarios.

2507.13292 2026-04-14 cs.CV

DiffClean: Diffusion-based Makeup Removal for Accurate Age Estimation

Ekta Gavas, Sudipta Banerjee, Chinmay Hegde, Nasir Memon

Comments Revised version with minor changes and code release

详情
英文摘要

Accurate age verification can protect underage users from unauthorized access to online platforms and e-commerce sites that provide age-restricted services. However, accurate age estimation can be confounded by several factors, including facial makeup that can induce changes to alter perceived identity and age to fool both humans and machines. In this work, we propose DiffClean which erases makeup traces using a text-guided diffusion model to defend against makeup attacks. DiffClean improves age estimation (minor vs. adult accuracy by 5.8%) and face verification (TMR by 5.1% at FMR=0.01%) compared to images with makeup. Our method is robust across digitally simulated and real-world makeup styles, and outperforms multiple baselines in terms of biometric and perceptual quality. Our codes are available at https://github.com/Ektagavas/DiffClean.

2506.14493 2026-04-14 cs.CL cs.CR

LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops

Jiyuan Fu, Kaixun Jiang, Lingyi Hong, Jinglun Li, Haijing Guo, Dingkang Yang, Zhaoyu Chen, Wenqiang Zhang

Comments Accepted to ICLR 2026. Code is available at: https://github.com/fuhaha824/LingoLoop-Attack

详情
英文摘要

Multimodal Large Language Models (MLLMs) have shown great promise but require substantial computational resources during inference. Attackers can exploit this by inducing excessive output, leading to resource exhaustion and service degradation. Prior energy-latency attacks aim to increase generation time by broadly shifting the output token distribution away from the EOS token, but they neglect the influence of token-level Part-of-Speech (POS) characteristics on EOS and sentence-level structural patterns on output counts, limiting their efficacy. To address this, we propose LingoLoop, an attack designed to induce MLLMs to generate excessively verbose and repetitive sequences. First, we find that the POS tag of a token strongly affects the likelihood of generating an EOS token. Based on this insight, we propose a POS-Aware Delay Mechanism to postpone EOS token generation by adjusting attention weights guided by POS information. Second, we identify that constraining output diversity to induce repetitive loops is effective for sustained generation. We introduce a Generative Path Pruning Mechanism that limits the magnitude of hidden states, encouraging the model to produce persistent loops. Extensive experiments on models like Qwen2.5-VL-3B demonstrate LingoLoop's powerful ability to trap them in generative loops; it consistently drives them to their generation limits and, when those limits are relaxed, can induce outputs with up to 367x more tokens than clean inputs, triggering a commensurate surge in energy consumption. These findings expose significant MLLMs' vulnerabilities, posing challenges for their reliable deployment.

2506.14170 2026-04-14 cs.CV cs.AI cs.ET

Progressive Multimodal Interaction Network for Reliable Quantification of Fish Feeding Intensity in Aquaculture

Shulong Zhang, Mingyuan Yao, Jiayin Zhao, Daoliang Li, Yingyi Chen, Haihua Wang

详情
英文摘要

Accurate quantification of fish feeding intensity is crucial for precision feeding in aquaculture, as it directly affects feed utilization and farming efficiency. Although multimodal fusion has proven to be an effective solution, existing methods often overlook the inconsistencies in responses and decision conflicts between different modalities, thus limiting the reliability of the quantification results. To address this issue, this paper proposes a Progressive Multimodal Interaction Network (PMIN) that integrates image, audio, and water-wave data for fish feeding intensity quantification. Specifically, a unified feature extraction framework is first constructed to map inputs from different modalities into a structurally consistent feature space, thereby reducing representational discrepancies across modalities. Then, an auxiliary-modality reinforcement primary-modality mechanism is designed to facilitate the fusion of cross-modal information, which is achieved through channel aware recalibration and dual-stage attention interaction. Furthermore, a decision fusion strategy based on adaptive evidence reasoning is introduced to jointly model the confidence, reliability, and conflicts of modality-specific outputs, so as to improve the stability and robustness of the final judgment. Experiments are conducted on a multimodal fish feeding intensity dataset containing 7089 samples. The results show that PMIN has an accuracy of 96.76%, while maintaining relatively low parameter count and computational cost, and its overall performance outperforms both homogeneous and heterogeneous comparison models. Ablation studies, comparative experiments, and real-world application results further validate the effectiveness and superiority of the proposed method. It can provide reliable support for automated feeding monitoring and precise feeding decisions in smart aquaculture.

2506.06248 2026-04-14 cs.LG

Lagrangian-based Equilibrium Propagation: generalisation to arbitrary boundary conditions & equivalence with Hamiltonian Echo Learning

Guillaume Pourcel, Debabrota Basu, Maxence Ernoult, Aditya Gilra

详情
英文摘要

Equilibrium Propagation (EP) is a learning algorithm for training Energy-based Models (EBMs) on static inputs which leverages the variational description of their fixed points. Extending EP to time-varying inputs is a challenging problem, as the variational description must apply to the entire system trajectory rather than just fixed points, and careful consideration of boundary conditions becomes essential. In this work, we present Generalized Lagrangian Equilibrium Propagation (GLEP), which extends the variational formulation of EP to time-varying inputs. We demonstrate that GLEP yields different learning algorithms depending on the boundary conditions of the system, many of which are impractical for implementation. We then show that Hamiltonian Echo Learning (HEL) -- which includes the recently proposed Recurrent HEL (RHEL) and the earlier known Hamiltonian Echo Backpropagation (HEB) algorithms -- can be derived as a special case of GLEP. Notably, HEL is the only instance of GLEP we found that inherits the properties that make EP a desirable alternative to backpropagation for hardware implementations: it operates in a "forward-only" manner (i.e. using the same system for both inference and learning), it scales efficiently (requiring only two or more passes through the system regardless of model size), and enables local learning.

2506.02387 2026-04-14 cs.AI

VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments

Zelai Xu, Zhexuan Xu, Xiangmin Yi, Huining Yuan, Mo Guang, Kaiwen Long, Xinlei Chen, Yi Wu, Chao Yu, Yu Wang

Comments Published at CVPR 2026 (Oral)

详情
英文摘要

Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to single-agent or text-only environments. In contrast, real-world scenarios often involve multiple agents interacting within rich visual and textual contexts, posing challenges with both multimodal observations and strategic interactions. To bridge this gap, we introduce Visual Strategic Bench (VS-Bench), a multimodal benchmark that evaluates VLMs for strategic abilities in multi-agent environments. VS-Bench comprises ten vision-grounded environments that cover cooperative, competitive, and mixed-motive interactions. The performance of VLM agents is evaluated across three dimensions: perception measured by element recognition accuracy; strategic reasoning measured by next-action prediction accuracy; and decision-making measured by normalized episode return. Extensive experiments on fifteen leading VLMs show that, although current models exhibit strong perception abilities, there remains a significant gap to optimal performance in reasoning and decision-making, with the best-performing model attaining 46.6% prediction accuracy and 31.4% normalized return. We further analyze the key factors influencing performance, conduct human experiments, and examine failure modes to provide a deeper understanding of VLMs' strategic abilities. By standardizing the evaluation and highlighting the limitations of existing models, we envision VS-Bench as a foundation for future research on strategic multimodal agents. Code and data are available at https://vs-bench.github.io.

2505.24665 2026-04-14 cs.LG

Learning Geometry and Topology via Multi-Chart Flows

Hanlin Yu, Søren Hauberg, Marcelo Hartmann, Arto Klami, Georgios Arvanitidis

详情
英文摘要

Real world data often lie on low-dimensional Riemannian manifolds embedded in high-dimensional spaces. This motivates learning degenerate normalizing flows that map between the ambient space and a low-dimensional latent space. However, if the manifold has a non-trivial topology, it can never be correctly learned using a single flow. Instead multiple flows must be `glued together'. In this paper, we first propose the general training scheme for learning such a collection of flows, and secondly we develop the first numerical algorithms for computing geodesics on such manifolds. Empirically, we demonstrate that this leads to highly significant improvements in topology estimation.

2505.17012 2026-04-14 cs.CV cs.AI

SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, Weidi Xie

Comments Accepted by CVPR 2026 (Highlight); Project Page: https://haoningwu3639.github.io/SpatialScore

详情
英文摘要

Existing evaluations of multimodal large language models (MLLMs) on spatial intelligence are typically fragmented and limited in scope. In this work, we aim to conduct a holistic assessment of the spatial understanding capabilities of modern MLLMs and propose complementary data-driven and agent-based solutions. Specifically, we make the following contributions: (i) we introduce SpatialScore, to our knowledge, the most comprehensive and diverse benchmark for multimodal spatial intelligence to date. It covers multiple visual data types, input modalities, and question-answering formats, and contains approximately 5K manually verified samples spanning 30 distinct tasks; (ii) using SpatialScore, we extensively evaluate 49 representative MLLMs, revealing persistent challenges and a substantial gap between current models and human-level spatial intelligence; (iii) to advance model capabilities, we construct SpatialCorpus, a large-scale training resource with 331K multimodal QA samples that supports fine-tuning on spatial reasoning tasks and significantly improves the performance of existing models (e.g., Qwen3-VL); (iv) to complement this data-driven route with a training-free paradigm, we develop SpatialAgent, a multi-agent system equipped with 12 specialized spatial perception tools that supports both Plan-Execute and ReAct reasoning, enabling substantial gains in spatial reasoning without additional model training. Extensive experiments and in-depth analyses demonstrate the effectiveness of our benchmark, corpus, and agent framework. We expect these resources to serve as a solid foundation for advancing MLLMs toward human-level spatial intelligence. All data, code, and models will be released to the research community.

2505.15489 2026-04-14 cs.CV cs.CL cs.MM

Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models

Jiaying Wu, Fanxiao Li, Zihang Fu, Min-Yen Kan, Bryan Hooi

Comments ICLR 2026

详情
英文摘要

The impact of multimodal misinformation arises not only from factual inaccuracies but also from the misleading narratives that creators deliberately embed. Interpreting such creator intent is therefore essential for multimodal misinformation detection (MMD) and effective information governance. To this end, we introduce DeceptionDecoded, a large-scale benchmark of 12,000 image-caption pairs grounded in trustworthy reference articles, created using an intent-guided simulation framework that models both the desired influence and the execution plan of news creators. The dataset captures both misleading and non-misleading cases, spanning manipulations across visual and textual modalities, and supports three intent-centric tasks: (1) misleading intent detection, (2) misleading source attribution, and (3) creator desire inference. We evaluate 14 state-of-the-art vision-language models (VLMs) and find that they struggle with intent reasoning, often relying on shallow cues such as surface-level alignment, stylistic polish, or heuristic authenticity signals. To bridge this, our framework systematically synthesizes data that enables models to learn implication-level intent reasoning. Models trained on DeceptionDecoded demonstrate strong transferability to real-world MMD, validating our framework as both a benchmark to diagnose VLM fragility and a data synthesis engine that provides high-quality, intent-focused resources for enhancing robustness in real-world multimodal misinformation governance.

2505.09368 2026-04-14 cs.CV cs.LG

RobustSpring: Benchmarking Robustness to Image Corruptions for Optical Flow, Scene Flow and Stereo

Victor Oei, Jenny Schmalfuss, Lukas Mehl, Madlen Bartsch, Shashank Agnihotri, Margret Keuper, Andreas Bulling, Andrés Bruhn

详情
英文摘要

Standard benchmarks for optical flow, scene flow, and stereo vision algorithms generally focus on model accuracy rather than robustness to image corruptions like noise or rain. Hence, the resilience of models to such real-world perturbations is largely unquantified. To address this, we present RobustSpring, a comprehensive dataset and benchmark for evaluating robustness to image corruptions for optical flow, scene flow, and stereo models. RobustSpring applies 20 different image corruptions, including noise, blur, color changes, quality degradations, and weather distortions, in a time-, stereo-, and depth-consistent manner to the high-resolution Spring dataset, creating a suite of 20,000 corrupted images that reflect challenging conditions. RobustSpring enables comparisons of model robustness via a new corruption robustness metric. Integration with the Spring benchmark enables two-axis evaluations of both accuracy and robustness. We benchmark a curated selection of initial models, observing that robustness varies widely by corruption type, and experimentally show that evaluations on RobustSpring indicate real-world robustness. RobustSpring is a new computer vision benchmark to treat robustness as a first-class citizen, fostering models that are accurate and resilient. It is available at https://spring-benchmark.org.

2504.04099 2026-04-14 cs.CV cs.AI

TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection

Lei Jiang, Chunzhao Xie, Tongxuan Liu, Yuting Zeng, jinrong Guo, Yunheng Shen, Weizhe Huang, Jing Li, Xiaohua Xu

Comments 8 pages, 9 figures

详情
英文摘要

Large Vision-Language Models have demonstrated remarkable capabilities, yet they suffer from hallucinations that limit practical deployment. While various mitigation strategies exist, they often incur high computational overhead or require extensive retraining. In this paper, we address the issue of visual attention decay during generation, a key factor contributing to hallucinations. We propose Temporal Attention Real-time Accumulative Connection (TARAC), a novel training-free framework that dynamically accumulates and re-injects historical attention to sustain visual grounding. Inspired by cognitive reinforcement mechanisms, TARAC operates as a lightweight, plug-and-play module. Extensive experiments across diverse models (e.g., LLaVA, Qwen2-VL) and benchmarks demonstrate that TARAC significantly outperforms state-of-the-art methods. Remarkably, it achieves these gains with negligible inference overhead ($\sim$4\% TPOT increase), compared to the substantial costs of existing training-free baselines. Specifically, TARAC reduces hallucinated sentences by 25.2\% on CHAIR and improves Perception score by +10.65 on MME, validating its effectiveness and efficiency.

2503.23514 2026-04-14 cs.CL cs.AI

If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs

Siqi Fan, Xiusheng Huang, Yiqun Yao, Xuezhi Fang, Kang Liu, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang

详情
英文摘要

Large language models (LLMs) can carry out human-like dialogue, but unlike humans, they are stateless due to the superposition property. However, during multi-turn, multi-agent interactions, LLMs begin to exhibit consistent, character-like behaviors, hinting at a form of emergent lifelong learning. Despite this, existing benchmarks often fail to capture these dynamics, primarily focusing on static, open-ended evaluations. To address this gap, we introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in LLMs. It features two episodic datasets: Hamlet and a synthetic script collection, rich in narrative structure and character interactions. Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches. Experiments on models like Llama3.1-8B, GPT-4-turbo, and DeepSeek R1, we demonstrate that nonparametric methods significantly outperform parametric ones in managing stateful learning. However, all models exhibit challenges with catastrophic forgetting as interactions extend, highlighting the need for further advancements in lifelong learning.

2503.23001 2026-04-14 cs.LG cs.GT

Quotation-Based Data Retention Mechanism for Data Privacy in LLM-Empowered Network Services

Bin Han, Di Feng, Zexin Fang, Jie Wang, Hans D. Schotten

Comments Accepted by IEEE ICC 2026 WKSPS

详情
英文摘要

The deployment of large language models (LLMs) for next-generation network optimization introduces novel data governance challenges. mobile network operators (MNOs) increasingly leverage generative artificial intelligence (AI) for traffic prediction, anomaly detection, and service personalization, requiring access to users' sensitive network usage data-including mobility patterns, traffic types, and location histories. Under the General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), and similar regulations, users retain the right to withdraw consent and demand data deletion. However, extensive machine unlearning degrades model accuracy and incurs substantial computational costs, ultimately harming network performance for all users. We propose an iterative price discovery mechanism enabling MNOs to compensate users for data retention through sequential price quotations. The server progressively raises the unit price for retaining data while users independently determine their supply at each quoted price. This approach requires no prior knowledge of users' privacy preferences and efficiently maximizes social welfare across the network ecosystem.

2503.15481 2026-04-14 cs.RO cs.AI cs.LG

Learning to Play Piano in the Real World

Yves-Simon Zeulner, Simon Crämer, Sandeep Selvaraj, Roberto Calandra

详情
英文摘要

Towards the grand challenge of achieving human-level manipulation in robots, playing piano is a compelling testbed that requires strategic, precise, and flowing movements. Over the years, several works demonstrated hand-designed controllers on real world piano playing, while other works evaluated robot learning approaches on simulated piano playing. In this work, we develop the first piano playing robotic system that makes use of learning approaches while also being deployed on a real world dexterous robot. Specifically, we use a Sim2Real2Sim approach where we iteratively alternate between training policies in simulation, deploying the policies in the real world, and use the collected real world data to update the parameters of the simulator. Using this approach we demonstrate that the robot can learn to play several piano pieces (including Are You Sleeping, Happy Birthday, Ode To Joy, and Twinkle Twinkle Little Star) in the real world accurately, reaching an average F1-score of 0.881. By providing this proof-of-concept, we want to encourage the community to adopt piano playing as a compelling benchmark towards human-level manipulation in the real world. We open-source our code and show additional videos at www.lasr.org/research/learning-to-play-piano .

2502.19731 2026-04-14 cs.CL

Preference Learning Unlocks LLMs' Psycho-Counseling Skills

Mian Zhang, Shaun M. Eack, Zhiyu Zoey Chen

Comments ACL 2026 Camera-Ready

详情
英文摘要

Applying large language models (LLMs) to assist in psycho-counseling is an emerging and meaningful approach, driven by the significant gap between patient needs and the availability of mental health support. However, current LLMs struggle to consistently provide effective responses to client speeches, largely due to the lack of supervision from high-quality real psycho-counseling data, whose content is typically inaccessible due to client privacy concerns. Furthermore, the quality of therapists' responses in available sessions can vary significantly based on their professional training and experience. Assessing the quality of therapists' responses remains an open challenge. In this work, we address these challenges by first proposing a set of professional and comprehensive principles to evaluate therapists' responses to client speeches. Using these principles, we create a preference dataset, PsychoCounsel-Preference, which contains 36k high-quality preference comparison pairs. This dataset aligns with the preferences of professional psychotherapists, providing a robust foundation for evaluating and improving LLMs in psycho-counseling. Experiments on reward modeling and preference learning demonstrate that PsychoCounsel-Preference is an excellent resource for LLMs to acquire essential skills for responding to clients in a counseling session. Our best-aligned model, PsychoCounsel-Llama3-8B, achieves an impressive win rate of 87% against GPT-4o. We release PsychoCounsel-Preference, PsychoCounsel-Llama3-8B and the reward model PsychoCounsel Llama3-8B-Reward to facilitate the research of psycho-counseling with LLMs at: https://hf.co/Psychotherapy-LLM.

2502.18026 2026-04-14 cs.LG cs.AI

ExPath: Targeted Pathway Inference for Biological Knowledge Bases via Graph Learning and Explanation

Rikuto Kotoge, Ziwei Yang, Zheng Chen, Yushun Dong, Yasuko Matsubara, Jimeng Sun, Yasushi Sakurai

Comments Accepted at AAAI 2026 (Main Technical Track)

详情
英文摘要

Retrieving targeted pathways in biological knowledge bases, particularly when incorporating wet-lab experimental data, remains a challenging task and often requires downstream analyses and specialized expertise. In this paper, we frame this challenge as a solvable graph learning and explaining task and propose a novel subgraph inference framework, ExPAth, that explicitly integrates experimental data to classify various graphs (bio-networks) in biological databases. The links (representing pathways) that contribute more to classification can be considered as targeted pathways. Our framework can seamlessly integrate biological foundation models to encode the experimental molecular data. We propose ML-oriented biological evaluations and a new metric. The experiments involving 301 bio-networks evaluations demonstrate that pathways inferred by ExPath are biologically meaningful, achieving up to 4.5x higher Fidelity+ (necessity) and 14x lower Fidelity- (sufficiency) than explainer baselines, while preserving signaling chains up to 4x longer.

2502.07432 2026-04-14 cs.LG

CapyMOA: Efficient Machine Learning for Data Streams and Online Continual Learning in Python

Heitor Murilo Gomes, Anton Lee, Nuwan Gunasekara, Yibin Sun, Guilherme Weigert Cassales, Justin Liu, Marco Heyden, Vitor Cerqueira, Maroua Bahri, Yun Sing Koh, Bernhard Pfahringer, Albert Bifet

详情
英文摘要

CapyMOA is an open-source Python library for efficient machine learning on data streams and online continual learning. It provides a structured framework for real-time learning, supporting adaptive models that evolve over time. CapyMOA's architecture allows integration with frameworks such as MOA, scikit-learn and PyTorch, enabling the combination of high-performance online algorithms with modern deep learning techniques. By emphasizing efficiency, scalability, and usability, CapyMOA allows researchers and practitioners to tackle dynamic learning challenges across various domains. Website: https://capymoa.org. GitHub: https://github.com/adaptive-machine-learning/CapyMOA.

2502.02189 2026-04-14 cs.LG

deCIFer: Crystal Structure Prediction from Powder Diffraction Data using Autoregressive Language Models

Frederik Lizak Johansen, Ulrik Friis-Jensen, Erik Bjørnager Dam, Kirsten Marie Ørnsbjerg Jensen, Rocío Mercado, Raghavendra Selvan

Comments 24 pages, 18 figures, 8 tables. v2: Figure 8 revision. v3: added benchmarks, text revisions. v4: accepted to TMLR (https://openreview.net/forum?id=LftFQ35l47)

详情
英文摘要

Novel materials drive advancements in fields ranging from energy storage to electronics, with crystal structure characterization forming a crucial yet challenging step in materials discovery. In this work, we introduce \emph{deCIFer}, an autoregressive language model designed for powder X-ray diffraction (PXRD)-conditioned crystal structure prediction (PXRD-CSP). Unlike traditional CSP methods that rely primarily on composition or symmetry constraints, deCIFer explicitly incorporates PXRD data, directly generating crystal structures in the widely adopted Crystallographic Information File (CIF) format. The model is trained on nearly 2.3 million crystal structures, with PXRD conditioning augmented by basic forms of synthetic experimental artifacts, specifically Gaussian noise and instrumental peak broadening, to reflect fundamental real-world conditions. Validated across diverse synthetic datasets representative of challenging inorganic materials, deCIFer achieves a 94\% structural match rate. The evaluation is based on metrics such as the residual weighted profile ($R_{wp}$) and structural match rate (MR), chosen explicitly for their practical relevance in this inherently underdetermined problem. deCIFer establishes a robust baseline for future expansion toward more complex experimental scenarios, bridging the gap between computational predictions and experimental crystal structure determination.

2501.19227 2026-04-14 cs.CV cs.AI

Integrating Semi-Supervised and Active Learning for Semantic Segmentation

Wanli Ma, Oktay Karakus, Paul L. Rosin

详情
英文摘要

In this paper, we propose a novel active learning approach integrated with an improved semi-supervised learning framework to reduce the cost of manual annotation and enhance model performance. Our proposed approach effectively leverages both the labelled data selected through active learning and the unlabelled data excluded from the selection process. The proposed active learning approach pinpoints areas where the pseudo-labels are likely to be inaccurate. Then, an automatic and efficient pseudo-label auto-refinement (PLAR) module is proposed to correct pixels with potentially erroneous pseudo-labels by comparing their feature representations with those of labelled regions. This approach operates without increasing the labelling budget and is based on the cluster assumption, which states that pixels belonging to the same class should exhibit similar representations in feature space. Furthermore, manual labelling is only applied to the most difficult and uncertain areas in unlabelled data, where insufficient information prevents the PLAR module from making a decision. We evaluated the proposed hybrid semi-supervised active learning framework on two benchmark datasets, one from natural and the other from remote sensing imagery domains. In both cases, it outperformed state-of-the-art methods in the semantic segmentation task.

2501.18490 2026-04-14 cs.RO cs.AI

Curriculum-based Sample Efficient Reinforcement Learning for Robust Stabilization of a Quadrotor

Fausto Mauricio Lagos Suarez, Akshit Saradagi, Vidya Sumathy, Shruti Kotpaliwar, George Nikolakopoulos

Comments 8 pages, 7 figures

详情
英文摘要

This article introduces a novel sample-efficient curriculum learning (CL) approach for training an end-to-end reinforcement learning (RL) policy for robust stabilization of a Quadrotor. The learning objective is to simultaneously stabilize position and yaw-orientation from random initial conditions through direct control over motor RPMs (end-to-end), while adhering to pre-specified transient and steady-state specifications. This objective, relevant in aerial inspection applications, is challenging for conventional one-stage end-to-end RL, which requires substantial computational resources and lengthy training times. To address this challenge, this article draws inspiration from human-inspired curriculum learning and decomposes the learning objective into a three-stage curriculum that incrementally increases task complexity, while transferring knowledge from one stage to the next. In the proposed curriculum, the policy sequentially learns hovering, the coupling between translational and rotational degrees of freedom, and robustness to random non-zero initial velocities, utilizing a custom reward function and episode truncation conditions. The results demonstrate that the proposed CL approach achieves superior performance compared to a policy trained conventionally in one stage, with the same reward function and hyperparameters, while significantly reducing computational resource needs (samples) and convergence time. The CL-trained policy's performance and robustness are thoroughly validated in a simulation engine (Gym-PyBullet-Drones), under random initial conditions, and in an inspection pose-tracking scenario. A video presenting our results is available at https://youtu.be/9wv6T4eezAU.

2412.17574 2026-04-14 cs.CV cs.AI

HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks

Ting Zhou, Daoyuan Chen, Qirui Jiao, Bolin Ding, Yaliang Li, Ying Shen

Comments Accepted as a conference paper at CVPR 2026

详情
英文摘要

Evaluating the nuanced human-centric video understanding capabilities of Multimodal Large Language Models (MLLMs) remains a great challenge, as existing benchmarks often overlook the intricacies of emotion, behavior, and cross-modal alignment. We introduce HumanVBench, a comprehensive video benchmark designed to rigorously probe these capabilities across 16 fine-grained tasks. A cornerstone of our work is a novel and scalable benchmark construction methodology, featuring two automated pipelines that synthesize high-quality video annotations and challenging multiple-choice questions with minimal human labor. By leveraging state-of-the-art models for annotation and systematically converting model-induced errors into plausible distractors, our framework provides a generalizable ``machine'' for creating nuanced evaluation suites. Our extensive evaluation of 30 leading MLLMs on HumanVBench reveals critical deficiencies, particularly in perceiving subtle emotions and aligning speech with visual cues, with even top proprietary models falling short of human performance. We open-source HumanVBench and our synthesis pipelines to catalyze the development of more socially intelligent and capable video MLLMs.

2412.10273 2026-04-14 cs.CV cs.LG

How to Spin an Object: First, Get the Shape Right

Rishabh Kabra, Drew A. Hudson, Sjoerd van Steenkiste, Joao Carreira, Niloy J. Mitra

详情
英文摘要

Image-to-3D models increasingly rely on hierarchical generation to disentangle geometry and texture. However, the design choices underlying these two-stage models--particularly the optimal choice of intermediate geometric representations--remain largely understudied. To investigate this, we introduce unPIC (undo-a-Picture), a modular framework for empirical analysis of image-to-3D pipelines. By factorizing the generation process into a multiview-geometry prior followed by an appearance decoder, unPIC enables a rigorous comparison of intermediate geometry representations. Through this framework, we identify that a specific representation, Camera-Relative Object Coordinates (CROCS), significantly outperforms alternatives such as depth maps, pretrained visual features, and other pointmap-based representations. We demonstrate that CROCS is not only easier for the first-stage geometry prior to predict, but also serves as an effective conditioning signal for ensuring 360-degree consistency during appearance decoding. Another advantage is that CROCS enables fully feedforward, direct 3D point cloud generation without requiring a separate post-hoc reconstruction step. Our unPIC formulation utilizing CROCS achieves superior novel-view quality, geometric accuracy, and multiview consistency; it outperforms leading baselines, including InstantMesh, Direct3D, CAT3D, Free3D, and EscherNet, on datasets of real-world 3D captures like Google Scanned Objects and the Digital Twin Catalog.

2410.14383 2026-04-14 cs.RO

MARLIN: Multi-Agent Reinforcement Learning Guided by Language-Based Inter-Robot Negotiation

Toby Godfrey, William Hunt, Mohammad D. Soorati

Comments 15 pages, 8 figures, 1 table

详情
英文摘要

Multi-agent reinforcement learning is a key method for training multi-robot systems. Through rewarding or punishing robots over a series of episodes according to their performance, they can be trained and then deployed in the real world. However, poorly trained policies can lead to unsafe behaviour during early training stages. We introduce Multi-Agent Reinforcement Learning guided by language-based Inter-robot Negotiation (MARLIN), a hybrid framework in which large language models provide high-level planning before the reinforcement learning policy has learned effective behaviours. Robots use language models to negotiate actions and generate plans that guide policy learning. The system dynamically switches between reinforcement learning and language-model-based negotiation during training, enabling safer and more effective exploration. MARLIN is evaluated using both simulated and physical robots with local and remote language models. Results show that, compared to standard multi-agent reinforcement learning, the hybrid approach achieves higher performance in early training without reducing final performance. The code is available at https://github.com/SooratiLab/MARLIN.

2408.07587 2026-04-14 cs.LG cs.DC

FedQUIT: On-Device Federated Unlearning via a Quasi-Competent Virtual Teacher

Alessio Mora, Lorenzo Valerio, Paolo Bellavista, Andrea Passarella

详情
英文摘要

Federated Learning (FL) enables the collaborative training of machine learning models without requiring centralized collection of user data. To comply with the right to be forgotten, FL clients should be able to request the removal of their data contributions from the global model. In this paper, we propose FedQUIT, a novel unlearning algorithm that operates directly on client devices that request to remove its contribution. Our method leverages knowledge distillation to remove the influence of the target client's data from the global model while preserving its generalization ability. FedQUIT adopts a teacher-student framework, where a modified version of the current global model serves as a virtual teacher and the client's model acts as the student. The virtual teacher is obtained by manipulating the global model's outputs on forget data, penalizing the confidence assigned to the true class while preserving relationships among outputs of non-true classes, to simultaneously induce forgetting and retain useful knowledge. As a result, FedQUIT achieves unlearning without making any additional assumption over the standard FedAvg protocol. Evaluation across diverse datasets, data heterogeneity levels, and model architectures shows that FedQUIT achieves superior or comparable unlearning efficacy compared to six state-of-the-art methods, while significantly reducing cumulative communication and computational overhead relative to retraining from scratch.