arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3420
专题追踪
2508.11281 2026-04-21 cs.CL cs.AI cs.CY

ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection

Axel Delaval, Shujian Yang, Haicheng Wang, Han Qiu, Jialiang Lu

Comments 22 pages, 5 figures, 11 tables. This paper introduces TOXIFRENCH, a benchmark of 53,622 comments for French toxicity detection. It proposes a Chain-of-Thought fine-tuning method with a dynamic weighted loss. The fine-tuned 4B model (Qwen3-4B) achieves state-of-the-art performance, outperforming larger models like GPT-4o and DeepSeek-R1

详情
英文摘要

Detecting toxic content using language models is crucial yet challenging. While substantial progress has been made in English, toxicity detection in French remains underdeveloped, primarily due to the lack of culturally relevant, human-annotated, large-scale datasets. In this work, we release ToxiFrench, a dataset of 53,622 French online comments together with a balanced benchmark split for systematic evaluation. The dataset is constructed via a semi-automated annotation pipeline that reduces manual labeling to only 10% through high-confidence LLM-based pre-annotation and human verification, while ensuring statistical alignment with human-only annotation. We then benchmark a broad range of models and uncover a counterintuitive finding: Small Language Models (SLMs) often surpass larger models in robustness and generalization on this task. Motivated by this finding, we propose a novel Chain-of-Thought (CoT) fine-tuning strategy using a Dynamic Weighted Loss (DWL) that progressively emphasizes the model's final decision and significantly improves faithfulness. Our fine-tuned 4B model (Qwen3-4B) achieves state-of-the-art performance on the benchmark. It improves its balanced accuracy by 10% over its baseline and achieves better performance than GPT-4o and DeepSeek-R1 on our benchmark, while successfully retaining cross-lingual capabilities.

2508.11184 2026-04-21 cs.CL

Tailoring Diagnostic Modeling to Individual Learners: Personalized Distractor Generation via MCTS-Guided Reasoning Reconstruction

Tao Wu, Jingyuan Chen, Wang Lin, Jian Zhan, Mengze Li, Fangzhou Jin, Min Zhang, Kun Kuang, Fei Wu

详情
英文摘要

Distractors-incorrect yet plausible answer choices in multiple-choice questions (MCQs)-are vital in educational assessments, as they help identify student misconceptions by presenting potential reasoning errors. Current distractor generation methods typically produce shared distractors for all students, ignoring the individual variations in reasoning, which limits their diagnostic effectiveness. To tackle this challenge, we introduce the task of Personalized Distractor Generation, which tailors distractors to each student's specific cognitive flaws, inferred from their past question-answering (QA) history. While promising, this task is particularly demanding due to the limited number of QA records available for each student, which are insufficient for training, as well as the absence of their underlying reasoning process. To overcome this, we propose a novel, training-free two-stage framework. In the first stage, Monte Carlo Tree Search (MCTS) is used to reconstruct the student's reasoning process from past errors, creating a student-specific misconception prototype. In the second stage, this prototype guides the simulation of the student's reasoning on new questions, generating personalized distractors that resonate with their individual misconceptions. Our experiments, conducted on 1,361 students across 6 subjects, demonstrate that this approach outperforms existing methods in generating plausible, personalized distractors, and also effectively adapts to group-level settings, highlighting its robustness and versatility.

2508.08775 2026-04-21 cs.SD cs.GR cs.NA math.NA

SonicRadiation: A Hybrid Numerical Solution for Sound Radiation without Ghost Cells

Xutong Jin, Fei Zhu, Guoping Wang, Sheng Li

Comments 11 pages

详情
英文摘要

Interactive synthesis of physical sound effects is crucial in digital media production. Sound radiation simulation, a key component of physically based sound synthesis, has posed challenges in the context of complex object boundaries. Previous methods, such as ghost cell-based finite-difference time-domain (FDTD) wave solver, have struggled to address these challenges, leading to large errors and failures in complex boundaries because of the limitation of ghost cells. We present SonicRadiation, a hybrid numerical solution capable of handling complex and dynamic object boundaries in sound radiation simulation without relying on ghost cells. We derive a consistent formulation to connect the physical quantities on grid cells in FDTD with the boundary elements in the time-domain boundary element method (TDBEM). Hereby, we propose a boundary grid synchronization strategy to seamlessly integrate TDBEM with FDTD while maintaining high numerical accuracy. Our method holds both advantages from the accuracy of TDBEM for the near-field and the efficiency of FDTD for the far-field. Experimental results demonstrate the superiority of our method in sound radiation simulation over previous approaches in terms of accuracy and efficiency, particularly in complex scenes, further validating its effectiveness.

2508.02204 2026-04-21 cs.RO

TacMan-Turbo: Proactive Tactile Control for Robust and Efficient Articulated Object Manipulation

Zihang Zhao, Zhenghao Qi, Yuyang Li, Leiyao Cui, Zhi Han, Lecheng Ruan, Yixin Zhu

Comments Accepted for publication in the IEEE Transactions on Automation Science and Engineering (T-ASE)

详情
英文摘要

Adept manipulation of articulated objects is essential for robots to operate successfully in human environments. Such manipulation requires both effectiveness--reliable operation despite uncertain object structures--and efficiency--swift execution with minimal redundant steps and smooth actions. Existing approaches struggle to achieve both objectives simultaneously: methods relying on predefined kinematic models lack effectiveness when encountering structural variations, while tactile-informed approaches achieve robust manipulation without kinematic priors but compromise efficiency through reactive, step-by-step exploration-compensation cycles. This paper introduces TacMan-Turbo, a novel proactive tactile control framework for articulated object manipulation that mitigates this fundamental trade-off. Unlike previous approaches that treat tactile contact deviations merely as error signals requiring compensation, our method interprets these deviations as rich sources of local kinematic information. This new perspective enables our controller to predict optimal future interactions and make proactive adjustments, significantly enhancing manipulation efficiency. In comprehensive evaluations across 200 diverse simulated articulated objects and real-world experiments, our approach maintains a 100% success rate while significantly outperforming the previous tactile-informed method in time efficiency, action efficiency, and trajectory smoothness (all p-values < 0.0001). These results demonstrate that the long-standing trade-off between effectiveness and efficiency in articulated object manipulation can be successfully resolved without relying on prior kinematic knowledge.

2508.01486 2026-04-21 cs.CL

Human-Centered Supervision for Sentiment Analysis in Telugu: A Systematic Inquiry Beyond Accuracy

Vallabhaneni Raj Kumar, Ashwin S, Supriya Manna, Niladri Sett, Cheedella V S N M S Hema Harshitha, Kurakula Harshitha, Anand Kumar Sharma, Basina Deepakraj, Tanuj Sarkar, Bondada Navaneeth Krishna, Samanthapudi Shakeer

Comments Camera-ready version; ACL Findings, 2026

详情
英文摘要

Sentiment analysis for low-resource languages remains challenging in an era where interpretability, human alignment, and fairness are increasingly non-negotiable aspects of modern machine learning systems. These challenges stem both from the scarcity of annotated data and from the resulting difficulty of conducting reliable, human-interpretable analyses that go beyond predictive accuracy. Telugu, one of the primary Dravidian languages with over 96 million speakers, is not an exception. In this work, we first introduce TeSent, a large-scale Telugu sentiment classification dataset annotated with sentiment labels and human-selected rationales from multiple native speakers. This resource enables the study of rationale-based supervision for aligning models with human reasoning in this low-resource setting. We fine-tune five transformer-based models with and without rationale supervision and evaluate them on classification performance, explanation quality, and social bias. To facilitate controlled fairness evaluation, we additionally construct TeEEC, an evaluation corpus for Telugu sentiment analysis. Our results show that incorporating human rationales consistently improves alignment and often leads to holistic gains in predictive performance. We further provide extensive analysis of multi-facade explanation quality and fairness, offering insights into the broader effects of alignment-oriented supervision in resource-scarce language contexts.

2508.01330 2026-04-21 cs.AI

NaturalGAIA: A Verifiable Benchmark and Hierarchical Framework for Long-Horizon GUI Tasks

Zihan Zheng, Tianle Cui, Taoran Wang, Fengtao Wang, Jiahui Pan, Lewei He, Qianglong Chen

详情
英文摘要

Despite significant advances in LLM-driven GUI agents, the field remains constrained by the challenge of reconciling high-fidelity realism with verifiable evaluation accuracy. To address this, we introduce NaturalGAIA, a verifiable evaluation dataset grounded in real-world human GUI interaction intents. By decoupling logical causal pathways from linguistic narratives, it rigorously simulates natural human intent, characterized by cognitive non-linearity and contextual dependencies. Furthermore, we propose LightManus-Jarvis, a hierarchical collaborative framework where LightManus manages dynamic topological planning and context evolution, while Jarvis~ensures execution precision via hybrid visual-structural perception. Experiments demonstrate that our approach achieves a Weighted Pathway Success Rate of 45.6%, significantly outperforming the state-of-the-art baseline (21.1%), while reducing token consumption by 75% and execution time by 76%. These results validate the efficacy of the macro-planning and micro-execution paradigm in handling complex naturalized tasks. Our code is publicly available at: https://github.com/KeLes-Coding/NatureGAIA.

2508.00553 2026-04-21 cs.CV

HiPrune: Hierarchical Attention for Efficient Token Pruning in Vision-Language Models

Jizhihui Liu, Feiyi Du, Guangdao Zhu, Niu Lian, Jun Li, Bin Chen, Weili Guan, Yaowei Wang

Comments Accepted at ACL-2026 Findings

详情
英文摘要

Vision-Language Models (VLMs) encode images and videos into abundant tokens, which contain substantial redundancy and computation cost. While visual token pruning mitigates the issue, most existing methods lack insight into the intrinsic property of the vision encoder itself. In this work, we dive into the vision encoder and prove that the middle layers pay more attention to the main objects of the image qualitatively and quantitatively, while the deep layers to tokens with rich global information. Utilizing this Hierarchical attention pattern, we propose HiPrune, a training-free and model-agnostic token Pruning method. HiPrune identifies three types of visual tokens according to their attention in different phases of the vision encoder, which preserves different levels of information. By coupling with the similarity of text tokens, we propose a prompt-aware variance, HiPrune++, which further improves instruction following performance under a very low token budget. Extensive experiments across four representative VLMs show that HiPrune achieves up to 99.3% of task accuracy with only 1/3 of the tokens, while reducing inference FLOPs by 58.7%. HiPrune++ maintains up to 99.7% accuracy with 2/9 tokens, highlighting robustness under high-resolution. Our code is available at https://github.com/Danielement321/HiPrune.

2508.00040 2026-04-21 cs.LG math.PR stat.AP stat.ML

Regime-Aware Conditional Neural Processes with Multi-Criteria Decision Support for Operational Electricity Price Forecasting

Abhinav Das, Stephan Schlüter

Journal ref Energy Economics 157 (2026) 109233

详情
英文摘要

This work integrates Bayesian regime detection with conditional neural processes for 24-hour electricity price prediction in the German market. Our methodology integrates regime detection using a disentangled sticky hierarchical Dirichlet process hidden Markov model (DS-HDP-HMM) applied to daily electricity prices. Each identified regime is subsequently modeled by an independent conditional neural process (CNP), trained to learn localized mappings from input contexts to 24-dimensional hourly price trajectories, with final predictions computed as regime-weighted mixtures of these CNP outputs. We rigorously evaluate R-NP against deep neural networks (DNN) and Lasso estimated auto-regressive (LEAR) models by integrating their forecasts into diverse battery storage optimization frameworks, including price arbitrage, risk management, grid services, and cost minimization. This operational utility assessment revealed complex performance trade-offs: LEAR often yielded superior absolute profits or lower costs, while DNN showed exceptional optimality in specific cost-minimization contexts. Recognizing that raw prediction accuracy doesn't always translate to optimal operational outcomes, we employed TOPSIS as a comprehensive multi-criteria evaluation layer. Our TOPSIS analysis identified LEAR as the top-ranked model for 2021, but crucially, our proposed R-NP model emerged as the most balanced and preferred solution for 2021, 2022 and 2023.

2507.21934 2026-04-21 cs.CL cs.AI cs.CY cs.IR cs.LG

Culinary Crossroads: A RAG Framework for Enhancing Diversity in Cross-Cultural Recipe Adaptation

Tianyi Hu, Andrea Morales-Garzón, Jingyi Zheng, Maria Maistro, Daniel Hershcovich

Comments ACL 2026 (main)

详情
英文摘要

In cross-cultural recipe adaptation, the goal is not only to ensure cultural appropriateness and retain the original dish's essence, but also to provide diverse options for various dietary needs and preferences. Retrieval Augmented Generation (RAG) is a promising approach, combining the retrieval of real recipes from the target cuisine for cultural adaptability with large language models (LLMs) for relevance. However, it remains unclear whether RAG can generate diverse adaptation results. Our analysis shows that RAG tends to overly rely on a limited portion of the context across generations, failing to produce diverse outputs even when provided with varied contextual inputs. This reveals a key limitation of RAG in creative tasks with multiple valid answers: it fails to leverage contextual diversity for generating varied responses. To address this issue, we propose CARRIAGE, a plug-and-play RAG framework for cross-cultural recipe adaptation that enhances diversity in both retrieval and context organization. To our knowledge, this is the first RAG framework that explicitly aims to generate highly diverse outputs to accommodate multiple user preferences. Our experiments show that CARRIAGE achieves Pareto efficiency in terms of diversity and quality of recipe adaptation compared to closed-book LLMs.

2507.20879 2026-04-21 cs.CV

DriveAgent-R1: Advancing VLM-based Autonomous Driving with Active Perception and Hybrid Thinking

Weicheng Zheng, Xiaofei Mao, Nanfei Ye, Pengxiang Li, Kun Zhan, Xianpeng Lang, Hang Zhao

Comments Accepted to ICLR 2026

详情
英文摘要

The advent of Vision-Language Models (VLMs) has significantly advanced end-to-end autonomous driving, demonstrating powerful reasoning abilities for high-level behavior planning tasks. However, existing methods are often constrained by a passive perception paradigm, relying solely on text-based reasoning. This passivity restricts the model's capacity to actively seek crucial visual evidence when faced with uncertainty. To address this, we introduce DriveAgent-R1, the first autonomous driving agent capable of active perception for planning. In complex scenarios, DriveAgent-R1 proactively invokes tools to perform visual reasoning, firmly grounding its decisions in visual evidence, thereby enhancing both interpretability and reliability. Furthermore, we propose a hybrid thinking framework, inspired by human driver cognitive patterns, allowing the agent to adaptively switch between efficient text-only reasoning and robust tool-augmented visual reasoning based on scene complexity. This capability is cultivated through a three-stage progressive training strategy, featuring a core Cascaded Reinforcement Learning (Cascaded RL) phase. Extensive experiments on the Drive-Internal dataset, which is rich in long-tail scenarios, and the public nuScenes dataset show that, with only 3B parameters, DriveAgent-R1 achieves competitive performance comparable to top closed model systems such as GPT-5 and to human driving proficiency while remaining deployment-friendly, offering a proven path toward building more intelligent autonomous driving systems.

2507.06056 2026-04-21 cs.CL cs.AI

Data Compressibility Quantifies LLM Memorization

Yizhan Huang, Zhe Yang, Meifang Chen, Huang Nianchen, Jianping Zhang, Michael R. Lyu

Comments accepted by TMLR

详情
英文摘要

Large Language Models (LLMs) are known to memorize portions of their training data, sometimes even reproduce content verbatim when prompted appropriately. Despite substantial interest, existing LLM memorization research has offered limited insight into how training data influences memorization and largely lacks quantitative characterization. In this work, we build upon the line of research that seeks to quantify memorization through data compressibility. We analyze why prior attempts fail to yield a reliable quantitative measure and show that a surprisingly simple shift from instance-level to set-level metrics uncovers a robust phenomenon, which we term the \textit{Entropy--Memorization (EM) Linearity}. This law states that a set-level data entropy estimator exhibits a linear correlation with memorization scores.

2507.05920 2026-04-21 cs.CV

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

Xinyu Huang, Yuhao Dong, Weiwei Tian, Bo Li, Rui Feng, Ziwei Liu

Comments Accepted by ACL(Findings) 2026

详情
英文摘要

State-of-the-art large multi-modal models (LMMs) face challenges when processing high-resolution images, as these inputs are converted into enormous visual tokens, many of which are irrelevant to the downstream task. In this paper, we propose Multi-turn Grounding-based Policy Optimization (MGPO), an end-to-end reinforcement learning (RL) framework that enables LMMs to iteratively focus on key visual regions by automatically cropping sub-images, based on model-predicted grounding coordinates within a multi-turn conversation framework. Compared to supervised fine-tuning (SFT), which requires costly additional grounding annotations, our approach highlights that LMMs can emerge robust grounding abilities during the RL training process, leveraging only a binary reward function derived from the correctness of the final answer. Additionally, we observe that LMMs struggle to autonomously trigger visual grounding during the rollout process. To address this cold start problem, we design a multi-turn conversational template and restrict policy loss computation to model outputs generated across multiple dialogue rounds, thereby promoting stable optimization. Extensive experiments demonstrate that, when trained on standard visual-question-short answering data without grounding annotations, MGPO effectively elicits stronger grounding capabilities compared to GRPO, leading to 5.4\% improvement on in-distribution MME-Realworld and 5.2\% improvement on the challenging out-of-distribution (OOD) V* Bench. Notably, MGPO post-training on Qwen2.5-VL-7B with 21K samples surpasses OpenAI's o1 and GPT-4o models on the OOD V* Bench. Codes are available at https://github.com/EvolvingLMMs-Lab/MGPO.

2506.18141 2026-04-21 cs.CL cs.AI

Sparse Feature Coactivation Reveals Causal Semantic Modules in Large Language Models

Ruixuan Deng, Xiaoyang Hu, Miles Gilberti, Shane Storks, Aman Taxali, Mike Angstadt, Chandra Sripada, Joyce Chai

Comments ACL 2026

详情
英文摘要

We identify semantically coherent, context-consistent network components in large language models (LLMs) using coactivation of sparse autoencoder (SAE) features collected from just a handful of prompts. Focusing on concept-relation prediction tasks, we show that ablating these components for concepts (e.g., countries and words) and relations (e.g., capital city and translation language) changes model outputs in predictable ways, while amplifying these components induces counterfactual responses. Notably, composing relation and concept components yields compound counterfactual outputs. Further analysis reveals that while most concept components emerge from the very first layer, more abstract relation components are concentrated in later layers. Lastly, we show that extracted components more comprehensively capture concepts and relations than individual features while maintaining specificity. Overall, our findings suggest a modular organization of knowledge and advance methods for efficient, targeted LLM manipulation.

2506.13674 2026-04-21 cs.CL cs.AI

PrefixMemory-Tuning: Modernizing Prefix-Tuning by Decoupling the Prefix from Attention

Haonan Wang, Brian Chen, Siquan Li, Xinhe Liang, Hwee Kuan Lee, Kenji Kawaguchi, Tianyang Hu

Comments ICLR 2026

详情
英文摘要

Parameter-Efficient Fine-Tuning (PEFT) methods have become crucial for rapidly adapting large language models (LLMs) to downstream tasks. Prefix-Tuning, an early and effective PEFT technique, demonstrated the ability to achieve performance comparable to full fine-tuning with significantly reduced computational and memory overhead. However, despite its earlier success, its effectiveness in training modern state-of-the-art LLMs has been very limited. In this work, we demonstrate empirically that prefix-tuning underperforms on LLMs because of an inherent tradeoff between the contribution of the input prompt and the parameterized prefix within the attention head. This motivates us to introduce PrefixMemory-Tuning, an architecture that generalizes the principles of prefix-tuning while addressing its shortcomings by shifting the prefix module out of the attention head itself and improving its expressiveness. Our experiments show that, across diverse benchmarks, PrefixMemory-Tuning consistently outperforms existing prefix-tuning methods. Notably, it achieves competitive performance with modern PEFTs on several general benchmarks, highlighting a potential extension of prefix-tuning approaches to become state-of-the-art. Our findings suggest that by overcoming its inherent limitations, prefix-tuning can remain a competitive and relevant research direction in the landscape of parameter-efficient LLM adaptation.

2506.12622 2026-04-21 cs.LG cs.AI math.OC

DR-SAC: Distributionally Robust Soft Actor-Critic for Reinforcement Learning under Uncertainty

Mingxuan Cui, Duo Zhou, Yuxuan Han, Grani A. Hanasusanto, Qiong Wang, Huan Zhang, Zhengyuan Zhou

Comments 31 Pages. Accepted to ICLR 2026

详情
英文摘要

Deep reinforcement learning (RL) has achieved remarkable success, yet its deployment in real-world scenarios is often limited by vulnerability to environmental uncertainties. Distributionally robust RL (DR-RL) algorithms have been proposed to resolve this challenge, but existing approaches are largely restricted to value-based methods in tabular settings. In this work, we introduce Distributionally Robust Soft Actor-Critic (DR-SAC), the first actor-critic based DR-RL algorithm for offline learning in continuous action spaces. DR-SAC maximizes the entropy-regularized rewards against the worst possible transition models within an KL-divergence constrained uncertainty set. We derive the distributionally robust version of the soft policy iteration with a convergence guarantee and incorporate a generative modeling approach to estimate the unknown nominal transition models. Experiment results on five continuous RL tasks demonstrate our algorithm achieves up to 9.8 times higher average reward than the SAC baseline under common perturbations. Additionally, DR-SAC significantly improves computing efficiency and applicability to large-scale problems compared with existing DR-RL algorithms. Code is publicly available at github.com/Lemutisme/DR-SAC.

2506.12176 2026-04-21 cs.LG stat.ML

"Faithful to What?" On the Limits of Fidelity-Based Explanations

Jackson Eshbaugh

Comments 6 pages, 3 figures, 3 tables. Accepted at the Workshop on Scientific Methods for Understanding Deep Learning (Sci4DL) at ICLR 2026. Code available at https://github.com/jacksoneshbaugh/lambda-linearity-score/tree/main

详情
英文摘要

In explainable AI, surrogate models are commonly evaluated by their fidelity to a neural network's predictions. Fidelity, however, measures alignment to a learned model rather than alignment to the data-generating signal underlying the task. This work introduces the linearity score $λ(f)$, a diagnostic that quantifies the extent to which a regression network's input--output behavior is linearly decodable. $λ(f)$ is defined as an $R^2$ measure of surrogate fit to the network. Across synthetic and real-world regression datasets, we find that surrogates can achieve high fidelity to a neural network while failing to recover the predictive gains that distinguish the network from simpler models. In several cases, high-fidelity surrogates underperform even linear baselines trained directly on the data. These results demonstrate that explaining a model's behavior is not equivalent to explaining the task-relevant structure of the data, highlighting a limitation of fidelity-based explanations when used to reason about predictive performance.

2506.10779 2026-04-21 cs.CL cs.AI

Improving Speech Recognition of Named Entities in Classroom Speech with LLM Revision and Phonetic-Semantic Context

Viet Anh Trinh, Xinlu He, Jacob Whitehill

详情
英文摘要

Classroom speech and lectures often contain named entities (NEs) such as names of people and special terminology. While automatic speech recognition (ASR) systems have achieved remarkable performance on general speech, the word error rate (WER) of state-of-the-art ASR remains high for named entities. Since NE are often the most critical keywords, misrecognizing them can affect all downstream applications, especially when the ASR functions as the front end of a complex system. In this paper, we introduce a large language model (LLM) revision pipeline to revise incorrect NEs in ASR predictions by leveraging not only the LLM's world knowledge and reasoning ability but also the available phonetic and semantic context. We also introduce the NER-MIT-OpenCourseWare dataset, containing 45 hours of data from MIT courses for development and testing. On this dataset, our proposed technique achieves up to 30\% relative WER reduction for NEs.

2506.10137 2026-04-21 cs.LG cs.AI

Self-Predictive Representations for Combinatorial Generalization in Behavioral Cloning

Daniel Lawson, Adriana Hugessen, Charlotte Cloutier, Glen Berseth, Khimya Khetarpal

详情
英文摘要

While goal-conditioned behavior cloning (GCBC) methods can perform well on in-distribution training tasks, they do not necessarily generalize zero-shot to tasks that require conditioning on novel state-goal pairs, i.e. combinatorial generalization. In part, this limitation can be attributed to a lack of temporal consistency in the state representation learned by BC; if temporally correlated states are properly encoded to similar latent representations, then the out-of-distribution gap for novel state-goal pairs would be reduced. We formalize this notion by demonstrating how encouraging long-range temporal consistency via successor representations (SR) can facilitate generalization. We then propose a simple yet effective representation learning objective, $\text{BYOL-}γ$ for GCBC, which theoretically approximates the successor representation in the finite MDP case through self-predictive representations, and achieves competitive empirical performance across a suite of challenging tasks requiring combinatorial generalization.

2506.07969 2026-04-21 cs.LG physics.flu-dyn

A Two-Phase Deep Learning Framework for Adaptive Time-Stepping in High-Speed Flow Modeling

Jacob Helwig, Sai Sreeharsha Adavi, Xuan Zhang, Yuchao Lin, Felix S. Chim, Luke Takeshi Vizzini, Haiyang Yu, Muhammad Hasnain, Saykat Kumar Biswas, John J. Holloway, Narendra Singh, N. K. Anand, Swagnik Guhathakurta, Shuiwang Ji

详情
英文摘要

We consider the problem of modeling high-speed flows using machine learning methods. While most prior studies focus on low-speed fluid flows in which uniform time-stepping is practical, flows approaching and exceeding the speed of sound exhibit sudden changes such as shock waves. In such cases, it is essential to use adaptive time-stepping methods to allow a temporal resolution sufficient to resolve these phenomena while simultaneously balancing computational costs. Here, we propose a two-phase machine learning method, known as ShockCast, to model high-speed flows with adaptive time-stepping. In the first phase, we propose to employ a machine learning model to predict the timestep size. In the second phase, the predicted timestep is used as an input along with the current fluid fields to advance the system state by the predicted timestep. We explore several physically-motivated components for timestep prediction and introduce timestep conditioning strategies inspired by neural ODE and Mixture of Experts. We evaluate our methods by generating three supersonic flow datasets, available at https://huggingface.co/divelab. Our code is publicly available as part of the AIRS library (https://github.com/divelab/AIRS).

2506.07826 2026-04-21 cs.CV cs.LG cs.RO

R3D2: Realistic 3D Asset Insertion via Diffusion for Autonomous Driving Simulation

William Ljungbergh, Bernardo Taveira, Wenzhao Zheng, Adam Tonderski, Chensheng Peng, Fredrik Kahl, Christoffer Petersson, Michael Felsberg, Kurt Keutzer, Masayoshi Tomizuka, Wei Zhan

详情
英文摘要

Validating autonomous driving (AD) systems requires diverse and safety-critical testing, making photorealistic virtual environments essential. Traditional simulation platforms, while controllable, are resource-intensive to scale and often suffer from a domain gap with real-world data. In contrast, neural reconstruction methods like 3D Gaussian Splatting (3DGS) offer a scalable solution for creating photorealistic digital twins of real-world driving scenes. However, they struggle with dynamic object manipulation and reusability as their per-scene optimization-based methodology tends to result in incomplete object models with integrated illumination effects. This paper introduces R3D2, a lightweight, one-step diffusion model designed to overcome these limitations and enable realistic insertion of complete 3D assets into existing scenes by generating plausible rendering effects-such as shadows and consistent lighting-in real time. This is achieved by training R3D2 on a novel dataset: 3DGS object assets are generated from in-the-wild AD data using an image-conditioned 3D generative model, and then synthetically placed into neural rendering-based virtual environments, allowing R3D2 to learn realistic integration. Quantitative and qualitative evaluations demonstrate that R3D2 significantly enhances the realism of inserted assets, enabling use-cases like text-to-3D asset insertion and cross-scene/dataset object transfer, allowing for true scalability in AD validation. To promote further research in scalable and realistic AD simulation, we release our code, see https://research.zenseact.com/publications/R3D2/.

2506.06485 2026-04-21 cs.CL cs.AI

Task Matters: Knowledge Requirements Shape LLM Responses to Context-Memory Conflict

Kaiser Sun, Fan Bai, Mark Dredze

Comments ACL2026 Camera Ready

详情
英文摘要

Large language models (LLMs) draw on both contextual information and parametric memory, yet these sources can conflict. Prior studies have largely examined this issue in contextual question answering, implicitly assuming that tasks should rely on the provided context, leaving unclear how LLMs behave when tasks require different types and degrees of knowledge utilization. We address this gap with a model-agnostic diagnostic framework that holds underlying knowledge constant while introducing controlled conflicts across tasks with varying knowledge demands. Experiments on representative open-weight and proprietary LLMs show that performance degradation under conflict is driven by both task-specific knowledge reliance and conflict plausibility; that strategies such as rationales or context reiteration increase context reliance, helping context-only tasks but harming those requiring parametric knowledge; and that these effects bias model-based evaluation, calling into question the reliability of LLMs as judges. Overall, our findings reveal that context-memory conflict is inherently task-dependent and motivate task-aware approaches to balancing context and memory in LLM deployment and evaluation.

2506.05760 2026-04-21 cs.CL

Writing-RL: Advancing Long-form Writing via Adaptive Curriculum Reinforcement Learning

Xuanyu Lei, Chenliang Li, Yuning Wu, Kaiming Liu, Weizhou Shen, Peng Li, Ming Yan, Fei Huang, Ya-Qin Zhang, Yang Liu

Comments Code is released at https://github.com/Tongyi-Zhiwen/Writing-RL

详情
英文摘要

Recent advances in Large Language Models(LLMs) have enabled strong performance in long-form writing, but current training paradigms remain limited: Supervised Fine-Tuning (SFT) remains constrained by data saturation and performance ceilings, while Reinforcement Learning with Verifiable Reward (RLVR), though successful in verifiable domains like math and code, cannot be directly migrated to open-ended long-form writing due to a lack of ground-truths. To further advance long-form writing, we present Writing-RL: an Adaptive Curriculum Reinforcement Learning framework to advance long-form writing capabilities beyond SFT. The framework consists of three key components: Margin-aware Data Selection strategy that prioritizes samples with high learning potential, Pairwise Comparison Reward mechanism that provides discriminative learning signals in the absence of verifiable rewards, and Dynamic Reference Scheduling approach, which plays a critical role by adaptively adjusting task difficulty based on evolving model performance. Experiments on 7B-scale writer models show that Writing-RL effectively improves long-form writing performance over strong SFT baselines. Furthermore, we observe that models trained with long-output RL generalize surprisingly well to long-input reasoning tasks, potentially offering a promising perspective for rethinking long-context training.

2506.02718 2026-04-21 cs.LG cs.AI

End-to-End Optimization of LLM-Driven Multi-Agent Search Systems via Heterogeneous-Group-Based Reinforcement Learning

Guanzhong Chen, Shaoxiong Yang, Chao Li, Wei Liu, Jian Luan, Zenglin Xu

Comments Accepted to ACL 2026 Main Conference. 20 pages, 9 figures

详情
英文摘要

Large language models (LLMs) are versatile, yet their deployment in complex real-world settings is limited by static knowledge cutoffs and the difficulty of producing controllable behavior within a single inference. Multi-agent search systems (MASS), which coordinate specialized LLM agents equipped with search tools, mitigate these issues via task decomposition and retrieval-augmented problem solving. However, optimizing LLMs for agent-specific roles remains labor-intensive with prompt engineering or supervised fine-tuning, motivating automated end-to-end training. Existing multi-agent reinforcement learning (MARL) methods such as Multi-Agent Proximal Policy Optimization (MAPPO) typically depend on large critic networks to evaluate joint actions, leading to instability and high memory costs. We introduce Multi-Agent Heterogeneous Group Policy Optimization (MHGPO), which updates policies by estimating relative advantages across heterogeneous groups of multi-agent rollouts, shifting the optimization focus from local agent performance to global system success. We further study three group rollout sampling strategies to trade off sample efficiency and optimization quality. Experiments show that MHGPO captures implicit inter-agent dependencies and consistently outperforms strong baselines in both task performance and computational efficiency.

2506.02541 2026-04-21 cs.LG cs.AI cs.CV

Rethinking Post-Unlearning Behavior of Large Vision-Language Models

Minsung Kim, Nakyeong Yang, Kyomin Jung

Comments 11 pages

详情
英文摘要

Large Vision-Language Models (LVLMs) can recognize individuals in images and disclose sensitive personal information about them, raising critical privacy concerns. Machine unlearning aims to remove such knowledge from the model. However, existing methods rarely prescribe what the model should output in place of the forgotten content, leading to Unlearning Aftermaths: degenerate, hallucinated, or excessively refused responses. We argue that, especially for generative LVLMs, it is crucial to consider the quality and informativeness of post-unlearning responses rather than relying solely on naive suppression. To address this, we introduce a new unlearning task for LVLMs that requires models to provide privacy-preserving yet informative and visually grounded responses. We also propose PUBG, a novel unlearning method that explicitly guides post-unlearning behavior toward a desirable output distribution. Experiments show that, while existing methods suffer from Unlearning Aftermaths despite successfully preventing privacy violations, PUBG effectively mitigates these issues, generating visually grounded and informative responses without privacy leakage for forgotten targets.

2506.01942 2026-04-21 cs.CV

OD3: Optimization-free Dataset Distillation for Object Detection

Salwa K. Al Khatib, Ahmed ElHagry, Shitong Shao, Zhiqiang Shen

详情
英文摘要

Training large neural networks on large-scale datasets requires substantial computational resources, particularly for dense prediction tasks such as object detection. Although dataset distillation (DD) has been proposed to alleviate these demands by synthesizing compact datasets from larger ones, most existing work focuses solely on image classification, leaving the more complex detection setting largely unexplored. In this paper, we introduce OD3, a novel optimization-free data distillation framework specifically designed for object detection. Our approach involves two stages: first, a candidate selection process in which object instances are iteratively placed in synthesized images based on their suitable locations, and second, a candidate screening process using a pre-trained observer model to remove low-confidence objects. We perform our data synthesis framework on MS COCO and PASCAL VOC, two popular detection datasets, with compression ratios ranging from 0.25% to 5%. Compared to the prior solely existing dataset distillation method on detection and conventional core set selection methods, OD3 delivers superior accuracy, establishes new state-of-the-art results, surpassing prior best method by more than 14% on COCO mAP50 at a compression ratio of 1.0%. Code is available at: https://github.com/VILA-Lab/OD3.

2505.23941 2026-04-21 cs.LG cs.CV

Vision Language Models are Biased

An Vo, Khai-Nguyen Nguyen, Mohammad Reza Taesiri, Vy Tuong Dang, Anh Totti Nguyen, Daeyoung Kim

Comments Code and qualitative examples are available at: vlmsarebiased.github.io

详情
英文摘要

Large language models (LLMs) memorize a vast amount of prior knowledge from the Internet that helps them on downstream tasks but also may notoriously sway their outputs towards wrong or biased answers. In this work, we test how the knowledge about popular subjects hurt the accuracy of vision language models (VLMs) on standard, objective visual tasks of counting and identification. We find that state-of-the-art VLMs are strongly biased (e.g., unable to recognize the 4th stripe has been added to a 3-stripe Adidas logo) scoring an average of 17.05% accuracy in counting (e.g., counting stripes in an Adidas-like logo) across 7 diverse domains from animals, logos, chess, board games, optical illusions, to patterned grids. Removing image backgrounds nearly doubles accuracy (21.09 percentage points), revealing that contextual visual cues trigger these biased responses. Further analysis of VLMs' reasoning patterns shows that counting accuracy initially rises with thinking tokens, reaching ~40%, before declining with excessive reasoning. Our work presents an interesting failure mode in VLMs and a human-supervised automated framework for testing VLM biases. Code and data are available at: vlmsarebiased.github.io.

2505.22226 2026-04-21 cs.CV

Expressive yet Efficient Feature Expansion with Adaptive Cross-Hadamard Products

Xuyang Zhang, Xi Zhang, Liang Chen, Hao Shi, Qingshan Guo

Comments Accepted by ICLR 2026. Camera-ready Version. 24 pages, 9 figures

详情
英文摘要

Recent theoretical advances reveal that the Hadamard product induces nonlinear representations and implicit high-dimensional mappings for the field of deep learning, yet their practical deployment in resource-constrained vision models remains largely unexplored. To address this gap, we introduce the Adaptive Cross-Hadamard (ACH) module, a novel operator that embeds learnability through differentiable discrete sampling and dynamic softsign normalization. This facilitates highly efficient feature reuse without incurring additional convolutional parameters, while ensuring stable gradient flow. Integrated into Hadaptive-Net (Hadamard Adaptive Network) via neural architecture search, our approach achieves unprecedented efficiency. Comprehensive experiments demonstrate state-of-the-art accuracy/speed trade-offs on image classification tasks, establishing Hadamard operations as specific building blocks for efficient vision models.

2505.21282 2026-04-21 cs.RO

EgoWalk: A Multimodal Dataset for Robot Navigation in the Wild

Timur Akhtyamov, Mohamad Al Mdfaa, Javier Antonio Ramirez Benavides, Arthur Nigmatzyanov, Sergey Bakulin, German Devchich, Denis Fatykhov, Diego Ruiz Salinas, Alexander Mazurov, Kristina Zipa, Malik Mohrat, Pavel Kolesnik, Ivan Sosin, Gonzalo Ferrer

Comments This work has been submitted to the IEEE for possible publication

详情
英文摘要

Data-driven navigation algorithms are critically dependent on large-scale, high-quality real-world data collection for successful training and robust performance in realistic and uncontrolled conditions. To enhance the growing family of navigation-related real-world datasets, we introduce EgoWalk - a dataset of 50 hours of human navigation in a diverse set of indoor/outdoor, varied seasons, and location environments. Along with the raw and Imitation Learning-ready data, we introduce several pipelines to automatically create subsidiary datasets for other navigation-related tasks, namely natural language goal annotations and traversability segmentation masks. Diversity studies, use cases, and benchmarks for the proposed dataset are provided to demonstrate its practical applicability. We openly release all data processing pipelines and the description of the hardware platform used for data collection to support future research and development in robot navigation systems.

2505.20211 2026-04-21 cs.LG cs.AI

PiCa: Parameter-Efficient Fine-Tuning with Column Space Projection

Junseo Hwang, Wonguk Cho, Taesup Kim

详情
英文摘要

Fine-tuning large foundation models is essential for building expert models tailored to specialized tasks and domains, but fully updating billions of parameters is computationally prohibitive. Reducing the number of trainable parameters using Parameter-Efficient Fine-Tuning (PEFT), such as Low-Rank Adaptation (LoRA), is therefore crucial not only to reduce training costs but also to mitigate storage, caching, and serving overheads during deployment. Prior works, such as Singular Vectors-guided Fine-Tuning (SVFT), have shown that exploiting the geometry of pre-trained weights based on Singular Value Decomposition (SVD) can significantly improve parameter-efficiency, but they lack a solid theoretical foundation. In this paper, we introduce Parameter-Efficient Fine-Tuning with Column Space Projection (PiCa), a novel theoretically grounded PEFT method. We prove that projecting gradients onto the principal column space of pre-trained weights provides an effective inductive bias for adaptation and further enhance parameter efficiency through a novel weight-sharing strategy. Across diverse NLP and vision tasks, PiCa consistently outperforms state-of-the-art baselines under comparable or smaller parameter budgets, demonstrating both theoretical rigor and practical effectiveness.

2505.19237 2026-04-21 cs.AI cs.RO

Sensorimotor Self-Recognition in Multimodal Large Language Model-Driven Robots

Iñaki Dellibarda Varela, Pablo Romero-Sorozabal, Diego Torricelli, Gabriel Delgado-Oleas, Jose Ignacio Serrano, Maria Dolores del Castillo Sobrino, Eduardo Rocon, Manuel Cebrian

Comments 16 pages, 3 figures, 1 table

详情
英文摘要

Self-recognition -- the ability to maintain an internal representation of one's own body within the environment -- underpins intelligent, autonomous behavior. As a foundational component of the minimal self, self-recognition provides the initial substrate from which higher forms of self-awareness may eventually emerge. Recent advances in large language models achieve human-like performance in tasks integrating multimodal information, raising growing interest in the embodiment capabilities of AI agents deployed on nonhuman platforms such as robots. We investigate whether multimodal LLMs can develop self-recognition through sensorimotor experience by integrating an LLM into an autonomous mobile robot. The system exhibits robust environmental awareness, self-identification, and predictive awareness, enabling it to infer its robotic nature and motion characteristics. Structural equation modeling reveals how sensory integration influences distinct dimensions of the minimal self and their coordination with past-present memory, as well as the hierarchical internal associations that drive self-identification. Ablation tests of sensory inputs demonstrate compensatory interactions among sensors and confirm the essential role of structured and episodic memory. Given appropriate sensory information about the world and itself, multimodal LLMs open the door to artificial selfhood in embodied cognitive systems.